arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.07561 2026-05-20 cs.CV

PureCC: Pure Learning for Text-to-Image Concept Customization

PureCC: 文本到图像概念定制的纯学习

Zhichao Liao, Xiaole Xian, Qingyu Li, Wenyu Qin, Meng Wang, Weicheng Xie, Siyang Song, Pingfa Feng, Long Zeng, Liang Pan

AI总结本文提出PureCC，一种用于文本到图像概念定制的纯学习方法，通过分离学习目标来平衡概念定制的保真度与模型保留。

Comments Accepted to CVPR 2026

详情

AI中文摘要

现有概念定制方法在高保真和多概念定制方面取得了显著成果。然而，它们往往忽视了在学习新个性化概念时对原始模型行为和能力的影响。为了解决这个问题，我们提出了PureCC。PureCC引入了一个新的分离学习目标用于概念定制，结合了目标概念的隐式指导与原始条件预测。这种分离形式使PureCC在训练过程中能够显著专注于原始模型。此外，基于此目标，PureCC设计了一个双分支训练流水线，包括一个冻结的提取器提供纯净的目标概念表示作为隐式指导，以及一个可训练的流模型产生原始条件预测，共同实现对个性化概念的纯学习。此外，PureCC引入了一个新的自适应指导尺度$λ^\star$，以动态调整目标概念的指导强度，平衡定制保真度和模型保留。广泛的实验表明，PureCC在保留原始行为和能力的同时，实现了高保真的概念定制。代码可在https://github.com/lzc-sg/PureCC上获得。

英文摘要

Existing concept customization methods have achieved remarkable outcomes in high-fidelity and multi-concept customization. However, they often neglect the influence on the original model's behavior and capabilities when learning new personalized concepts. To address this issue, we propose PureCC. PureCC introduces a novel decoupled learning objective for concept customization, which combines the implicit guidance of the target concept with the original conditional prediction. This separated form enables PureCC to substantially focus on the original model during training. Moreover, based on this objective, PureCC designs a dual-branch training pipeline that includes a frozen extractor providing purified target concept representations as implicit guidance and a trainable flow model producing the original conditional prediction, jointly achieving pure learning for personalized concepts. Furthermore, PureCC introduces a novel adaptive guidance scale $λ^\star$ to dynamically adjust the guidance strength of the target concept, balancing customization fidelity and model preservation. Extensive experiments show that PureCC achieves state-of-the-art performance in preserving the original behavior and capabilities while enabling high-fidelity concept customization. The code is available at https://github.com/lzc-sg/PureCC.

URL PDF HTML ☆

赞 0 踩 0

2603.03066 2026-05-20 cs.CV

EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos

EduVQA: 向概念感知的教育AI生成视频评估迈进

Baoliang Chen, Xinlong Bu, Hanwei Zhu, Lingyu Zhu, Jieyu Zhan

AI总结本研究提出EduVQA框架，通过引入结构化2D混合专家架构，实现了对教育AI生成视频中概念正确性的感知评估，解决了传统方法在教育场景中忽略概念正确性的不足。

详情

AI中文摘要

现有的AI生成视频质量评估（AIGVQA）方法主要关注全局感知真实性和粗略的文本-视频对齐，而忽视了教育场景中的关键要求：概念正确性。在早期数学教育中，即使视觉上合理，数值量、几何关系或空间配置中的细微错误也可能从根本上改变传达的知识。为了解决这个问题，我们引入了EduAVQABench，这是首个概念感知的教育AIGV评估基准，包含1,130个由十种最先进的T2V模型生成的视频，以及超过310,650个精细的人工标注，涵盖感知质量和语义对齐。基于此基准，我们进一步提出了EduVQA，一个概念感知的AIGVQA框架，配备了结构化2D混合专家（S2D-MoE）架构。通过通过共享专家和自适应二维路由联合建模细粒度概念评估和整体质量预测，EduVQA有效地捕捉了传统全局评分方法所忽略的细微概念层面不一致。广泛的实验表明，EduVQA在感知和语义评估任务中均优于现有AIGVQA方法，并在未见过的基准上表现出强大的泛化能力。代码和数据集将在：https://github.com/EduVQA/EduVQA 公开。

英文摘要

Existing AI-generated video quality assessment (AIGVQA) methods mainly focus on global perceptual realism and coarse text-video alignment, while overlooking a critical requirement in educational scenarios: concept correctness. In early mathematics education, subtle errors in numerical quantities, geometric relations, or spatial configurations may fundamentally alter the conveyed knowledge despite visually plausible generation. To address this problem, we introduce EduAVQABench, the first benchmark for concept-aware educational AIGV assessment, containing 1,130 videos generated by ten state-of-the-art T2V models together with over 310,650 fine-grained human annotations spanning perceptual quality and semantic alignment. Built upon this benchmark, we further propose EduVQA, a concept-aware AIGVQA framework equipped with a Structured 2D Mixture-of-Experts (S2D-MoE) architecture. By jointly modeling fine-grained concept assessment and overall quality prediction through shared experts and adaptive two-dimensional routing, EduVQA effectively captures subtle concept-level inconsistencies overlooked by conventional global scoring methods. Extensive experiments demonstrate that EduVQA consistently outperforms existing AIGVQA approaches across both perceptual and semantic evaluation tasks while exhibiting strong generalization capability on unseen benchmarks. Code and dataset will be publicly available at: https://github.com/EduVQA/EduVQA.

URL PDF HTML ☆

赞 0 踩 0

2603.01009 2026-05-20 cs.CL

Qayyem: A Real-time Platform for Scoring Proficiency of Arabic Essays

Qayyem: 一个实时平台用于评分阿拉伯语作文的熟练程度

Hoor Elbahnasawi, Marwan Sayed, Sohaila Eltanbouly, Fatima Brahamia, Tamer Elsayed

AI总结本文提出Qayyem平台，提供阿拉伯语作文评分的集成工作流程，通过友好的界面简化评分服务器API的交互，部署了多种先进的阿拉伯语作文评分模型。

Comments Accepted at ACL 2026

2602.23622 2026-05-20 cs.CV cs.AI

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

DLEBench: 评估基于指令的图像编辑模型在小规模物体编辑能力

Shibo Hong, Boxian Ai, Jun Kuang, Wei Wang, FengJiao Chen, Zhongyuan Peng, Chenhao Huang, Yixin Cao

AI总结本文提出DLEBench，首个专门评估基于指令的图像编辑模型在小规模物体编辑能力的基准，通过1889个样本覆盖复杂场景，揭示了现有模型在小物体编辑上的性能差距，强调了专用基准的重要性。

详情

AI中文摘要

在基于指令的图像编辑模型（IIEMs）领域已取得显著进展。然而，尽管这些模型在当前基准上表现出对指令的合理遵循和强大的推理能力，但它们在编辑小物体方面的能力仍缺乏深入探索，尽管这对精确局部编辑和生成图像中细节的细化至关重要。本文介绍了DeepLookEditBench（DLEBench），首个专门评估IIEMs在编辑小规模物体能力的基准。具体而言，我们构建了一个包含七个指令类型的挑战性测试平台，共1889个样本。在这些样本中，目标物体仅占图像面积的1%-10%，涵盖了部分遮挡和多物体编辑等复杂场景。为确保对本基准的稳健评估，我们提出了一种评估协议，包含细化的评分标准，以最小化在“指令遵循”和“视觉一致性”两个标准中的主观性和歧义性。该协议还引入了双模式评估框架（工具驱动模式和Oracle引导模式），以解决DLEBench中LMM-as-a-Judge与人类判断之间的不一致问题。在10个IIEMs上的实证结果揭示了小规模物体编辑上的显著性能差距，突显了专用基准在推动该能力发展方面的重要性。

英文摘要

Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

URL PDF HTML ☆

赞 0 踩 0

2602.17038 2026-05-20 cs.AI

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

具有相意识的专家混合用于代理强化学习

Shengtian Yang, Yu Li, Shuo He, Yewen Li, Qingpeng Cai, Peng Jiang, Lei Feng

AI总结本文提出了一种具有相意识的专家混合（PA-MoE），以解决传统专家混合（MoE）中由于token级路由导致的相一致模式碎片化问题，通过学习隐含的相边界来提升专家的专业性。

详情

AI中文摘要

强化学习（RL）已使LLM代理具备解决复杂任务的强大能力。然而，现有RL方法通常使用单一策略网络，导致简单任务占据大部分参数并主导梯度更新，从而为复杂任务留出不足的容量。一个可行的解决方案是在策略网络中采用专家混合（MoE）架构，因为MoE允许不同参数（专家）专门处理不同任务，防止简单任务主导所有参数。然而，传统MoE的一个关键限制是其token级路由，其中路由器将每个token分配给专门化的专家，这会将相一致的模式碎片化为分散的专家分配，从而削弱专家专业化。在本文中，我们提出了具有相意识的专家混合（PA-MoE）。它首先具有一个轻量级的相路由器，该路由器直接从RL目标中学习隐含的相边界，而无需预定义相类别。然后，相路由器将时间一致的分配分配给同一专家，使专家能够保留相特定的专业知识。实验结果展示了我们提出的PA-MoE的有效性。

英文摘要

Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emph{single} policy network, causing \emph{simplicity bias} where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose \textbf{Phase-Aware Mixture of Experts (PA-MoE)}. It first features a lightweight \emph{phase router} that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE.

URL PDF HTML ☆

赞 0 踩 0

2602.15752 2026-05-20 cs.LG

Beyond Match Maximization and Fairness: Retention-Optimized Two-Sided Matching

超越匹配最大化和公平性：以用户留存优化的双侧匹配

Ren Kishimoto, Rikiya Takehi, Koichi Tanaka, Masahiro Nomura, Riku Togashi, Yoji Tomita, Yuta Saito

AI总结本文提出了一种新的双侧匹配优化方法，旨在最大化用户留存而非单纯匹配数量或公平性，通过引入动态学习排序算法MRet，利用用户个性化留存曲线优化推荐策略，提升整体用户留存率。

Comments Published as a conference paper at ICLR 2026

详情

AI中文摘要

在在线约会和招聘等双侧匹配平台上，推荐算法通常旨在最大化总匹配数。然而，这一目标导致了不平衡，一些用户获得过多匹配而另一些用户则获得极少并最终离开平台。对于许多平台，尤其是依赖订阅的平台，用户留存至关重要。一些平台可能使用公平性目标来解决匹配最大化的问题。然而，公平性本身并非所有平台的最终目标，因为用户不会仅仅因为曝光均等而奖励平台。在实践中，用户留存通常是最终目标，随意依赖公平性会使留存优化取决于运气。在本工作中，我们没有最大化匹配或公理化定义公平性，而是正式定义了双侧匹配平台中最大化用户留存的新问题设置。为此，我们引入了一种动态学习到排序（LTR）算法，称为Matching for Retention（MRet）。与传统的双侧匹配算法不同，我们的方法通过从每个用户档案和交互历史中学习个性化留存曲线来建模用户留存。基于这些曲线，MRet通过同时考虑接收推荐的用户和被推荐用户的留存收益，动态调整推荐策略，使得有限的匹配机会分配到最能提高整体留存的地方。自然但重要的是，对主要在线约会平台的合成和真实世界数据集的实证评估显示，MRet实现了更高的用户留存率，因为传统方法优化匹配或公平性而非留存。

英文摘要

On two-sided matching platforms such as online dating and recruiting, recommendation algorithms often aim to maximize the total number of matches. However, this objective creates an imbalance, where some users receive far too many matches while many others receive very few and eventually abandon the platform. Retaining users is crucial for many platforms, such as those that depend heavily on subscriptions. Some may use fairness objectives to solve the problem of match maximization. However, fairness in itself is not the ultimate objective for many platforms, as users do not suddenly reward the platform simply because exposure is equalized. In practice, where user retention is often the ultimate goal, casually relying on fairness will leave the optimization of retention up to luck. In this work, instead of maximizing matches or axiomatically defining fairness, we formally define the new problem setting of maximizing user retention in two-sided matching platforms. To this end, we introduce a dynamic learning-to-rank (LTR) algorithm called Matching for Retention (MRet). Unlike conventional algorithms for two-sided matching, our approach models user retention by learning personalized retention curves from each user's profile and interaction history. Based on these curves, MRet dynamically adapts recommendations by jointly considering the retention gains of both the user receiving recommendations and those who are being recommended, so that limited matching opportunities can be allocated where they most improve overall retention. Naturally but importantly, empirical evaluations on synthetic and real-world datasets from a major online dating platform show that MRet achieves higher user retention, since conventional methods optimize matches or fairness rather than retention.

URL PDF HTML ☆

赞 0 踩 0

2602.13466 2026-05-20 cs.CL cs.AI cs.LG

Language Model Memory and Memory Models for Language

语言模型记忆与记忆模型用于语言

Benjamin L. Badger

AI总结研究探讨了语言模型和记忆模型在信息存储中的能力差异，发现语言模型的嵌入向量信息较少，而自编码器在输入再生训练中能形成接近完美的记忆，提出了一种可并行的编码器-解码器记忆模型架构，并通过结合因果和信息保留目标函数来提升记忆形成和解码能力。

详情

AI中文摘要

机器学习模型存储输入信息的能力，类似于“记忆”的概念，在隐藏层向量嵌入中被广泛使用但未充分表征。我们发现，无论数据和计算规模如何，语言模型嵌入通常包含相对较少的输入信息。相比之下，用于输入再生训练的自编码器嵌入能够形成几乎完美的记忆。用记忆嵌入替代令牌序列可带来显著的计算效率，从而引入一种可并行的编码器-解码器记忆模型架构。在因果训练后，这些模型包含信息贫乏的嵌入，无法进行任意信息访问，但通过结合因果和信息保留目标函数，它们学会形成和解码信息丰富的记忆。通过冻结高保真编码器并采用课程训练方法，解码器首先学习处理记忆，然后学习预测下一个令牌。我们引入了观点，即仅使用下一个令牌预测训练不足以准确形成记忆，因为目标本身不可逆，从而推动在输入不完全暴露的情况下使用结合目标函数的模型。

英文摘要

The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory', is widely employed but not well characterized. We find that language model embeddings typically contain relatively little input information regardless of data and compute scale during training. In contrast, embeddings from autoencoders trained for input regeneration are capable of nearly perfect memory formation. The substitution of memory embeddings for token sequences leads to substantial computational efficiencies, motivating the introduction of a parallelizable encoder-decoder memory model architecture. Upon causal training these models contain information-poor embeddings incapable of arbitrary information access, but by combining causal and information retention objective functions they learn to form and decode information-rich memories. Training can be further streamlined by freezing a high fidelity encoder followed by a curriculum training approach where decoders first learn to process memories and then learn to additionally predict next tokens. We introduce the perspective that next token prediction training alone is poorly suited for accurate memory formation as the objective itself is non-invertible, motivating the use of combined objective functions for models where the entire input is not exposed.

URL PDF HTML ☆

赞 0 踩 0

2602.11910 2026-05-20 cs.SD cs.LG

TADA! Tuning Audio Diffusion Models through Activation Steering

TADA! 通过激活引导调整音频扩散模型

Łukasz Staniszewski, Katarzyna Zaleska, Mateusz Modrzejewski, Kamil Deja

AI总结本文通过激活引导技术揭示音频扩散模型中的语义瓶颈，并展示了局部激活引导在音频概念调节中的新状态-of-the-art性能。

Comments Preprint

详情

AI中文摘要

音频扩散模型能够从文本生成高质量的音乐，但实现对特定音乐属性的精细控制仍然具有挑战性，因为其内部机制对高级概念的表示尚不明确。在本文中，我们利用激活修补技术证明，最近的音频扩散架构存在语义瓶颈，其中一小部分连续的注意力层控制不同的音乐概念，例如特定乐器、人声或音乐类型的存在。在此基础上，我们系统地评估了广泛的应用引导方法，比较了激活引导与提示级、乐谱空间和权重空间干预，分析了引导机制与干预位置之间的相互作用。我们的新基准，通过广泛的用户研究支持，证明了局部激活引导在音频概念调节中建立了新的状态-of-the-art性能。

英文摘要

Audio diffusion models can synthesize high-fidelity music from text, yet achieving fine-grained control over specific musical attributes remains challenging, as their internal mechanisms for representing high-level concepts are poorly understood. In this work, we use activation patching to demonstrate that recent audio diffusion architectures exhibit a semantic bottleneck, where a small, shared subset of consecutive attention layers controls distinct musical concepts, such as the presence of specific instruments, vocals, or genres. Building on this, we systematically evaluate a broad spectrum of steering paradigms, comparing activation steering against prompt-level, score-space, and weight-space interventions, analyzing the interaction between the steering mechanism and the intervention site. Our new benchmark, supported by an extensive user study, demonstrates that localized activation steering establishes a new state-of-the-art in audio concept modulation.

URL PDF HTML ☆

赞 0 踩 0

2602.11767 2026-05-20 cs.AI cs.CL cs.LG

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

TSR：用于LLM代理多轮RL的轨迹搜索

Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Heiko Ludwig, Holger Boche

AI总结本文提出TSR，一种在训练时改进每轮轨迹生成的方法，通过轻量级树状搜索构造高质量轨迹，提升rollout质量和学习稳定性，适用于多轮RL任务。

详情

AI中文摘要

大规模语言模型（LLMs）的进步正在推动使用强化学习（RL）来训练代理，从跨任务的迭代、多轮交互中学习。然而，多轮RL仍然具有挑战性，因为奖励通常稀疏或延迟，而环境可能是随机的。在这种情况下，朴素的轨迹采样会阻碍利用并导致模式崩溃。我们提出了TSR（轨迹搜索rollouts），一种训练时的方法，重新利用测试时扩展的想法以改进每轮rollout生成。TSR通过基于状态的反馈在每个回合中选择高分动作，进行轻量级树状搜索来构造高质量轨迹。这提高了rollout质量并稳定了学习，同时与标准策略梯度优化器兼容，使TSR对优化器无偏见。我们用best-of-N、beam和浅层前瞻搜索实例化TSR，并与PPO和GRPO配对，在Sokoban、FrozenLake和WebShop任务中实现高达15%的性能提升和更稳定的训练，仅需适度增加一次训练计算。通过将搜索从推理时间转移到训练的rollout阶段，TSR提供了一种模块化且通用的机制，用于更强的多轮代理学习，与现有框架和拒绝采样式选择方法互补。

英文摘要

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using state-based feedback. This improves rollout quality and stabilizes learning while remaining compatible with standard policy gradient optimizers, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a modest, one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a modular and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.

URL PDF HTML ☆

赞 0 踩 0

2602.09872 2026-05-20 cs.CV cs.HC

BabyMamba-HAR: Lightweight Selective State Space Models for Efficient Human Activity Recognition on Resource Constrained Devices

BabyMamba-HAR：轻量级选择性状态空间模型用于资源受限设备上高效的人体活动识别

Mridankan Mandal

AI总结本文提出BabyMamba-HAR，一种轻量级选择性状态空间模型，用于在资源受限设备上高效进行人体活动识别，通过两种轻量级架构实现高精度和低资源消耗。

详情

AI中文摘要

在资源受限的设备上进行人体活动识别（HAR）需要在多样化的传感器设置下保持高精度。选择性状态空间模型（SSMs）提供了高效的线性时间序列处理，成为注意力机制的一种有吸引力的替代方案。然而，其TinyML设计空间仍待探索。本文介绍了BabyMamba-HAR，包含两种轻量级架构：（1）CI-BabyMamba-HAR，利用通道独立的茎部以提高噪声鲁棒性；（2）Crossover-BiDir-BabyMamba-HAR，利用早期融合的茎部以实现通道计数独立的复杂度。两者都集成了权重绑定的双向扫描和门控时间注意力池化。在八个基准测试中，Crossover-BiDir-BabyMamba-HAR平均达到86.52%的F1分数，使用27K参数和2.21M MACs，与TinyHAR（86.16%）相当，但要求在高通道数据集上减少11倍的MACs。在设备上部署到Raspberry Pi Pico 2和ESP32上使用混合精度C++运行时（INT8投影，float32状态）。融合计算策略与生命周期感知内存管理将峰值内存足迹从O(B*dmodel*L*dstate)减少到O(B*dmodel*dstate)，适应于支持权重绑定的双向和通道流执行。两种架构均实现了完整的8/8数据集覆盖，与PyTorch的>99.2%的兼容性，而INT8量化TFLite基线显示了退化的覆盖和兼容性（TinyHAR：7/8和4/8覆盖，60.4%和88.6%兼容性，TinierHAR：8/8和6/8在54.2%和90.8%兼容性，DeepConvLSTM：1/8和0/8在Pico 2和ESP32上）。Crossover-BiDir-BabyMamba-HAR在ESP32上平均延迟为154.4 ms，在Pico 2上为481.9 ms。消融实验确认双向扫描和门控注意力分别将F1分数提高高达8.42%和8.94%，建立了TinyML SSM部署的实用原则。

英文摘要

Human activity recognition (HAR) on resource constrained devices requires high accuracy across diverse sensor setups. Selective state space models (SSMs) offer efficient linear time sequence processing, presenting a compelling alternative to attention mechanisms. However, their TinyML design space remains unexplored. This paper introduces BabyMamba-HAR, comprising two lightweight architectures: (1) CI-BabyMamba-HAR, utilizing a channel independent stem for noise robustness, and (2) Crossover-BiDir-BabyMamba-HAR, utilizing an early fusion stem for channel count independent complexity. Both integrate weight tied bidirectional scanning and gated temporal attention pooling. Across eight benchmarks, Crossover-BiDir-BabyMamba-HAR averages an 86.52% F1-score with 27K parameters and 2.21M MACs, matching TinyHAR (86.16%) while requiring 11x fewer MACs on high channel datasets. On-device deployment on the Raspberry Pi Pico 2 and ESP32 utilized a mixed precision C++ runtime (INT8 projections, float32 states). A fused computation strategy with lifetime aware memory management reduces peak memory footprint from O(B*dmodel*L*dstate) to O(B*dmodel*dstate), adapting to support weight-tied bidirectional and channel-streaming execution. Both architectures achieved full 8/8 dataset coverage with >99.2% PyTorch parity, whereas INT8 quantized TFLite baselines showed degraded coverage and parity (TinyHAR: 7/8 and 4/8 coverage at 60.4% and 88.6% parity, TinierHAR: 8/8 and 6/8 at 54.2% and 90.8%, DeepConvLSTM: 1/8 and 0/8 on Pico 2 and ESP32, respectively). Crossover-BiDir-BabyMamba-HAR averages 154.4 ms latency on ESP32 and 481.9 ms on Pico 2. Ablations confirm bidirectional scanning and gated attention improve F1-scores by up to 8.42% and 8.94%, respectively, establishing practical principles for TinyML SSM deployment.

URL PDF HTML ☆

赞 0 踩 0

2602.09259 2026-05-20 cs.RO cs.HC

Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation

以数据为中心的基于学习的多任务手术注视感知模型设计

Yizhou Li, Shuyuan Yang, Jiaji Su, Zonghe Chua

AI总结本研究探讨了在多任务模拟中，基于学习的手术注视感知模型的设计，通过主动-被动注视数据集分析，评估了不同注视来源对注意力模型学习的影响，并提出了可扩展的群众源注视监督方法。

Comments 8 pages, conference pre-print

详情

AI中文摘要

在机器人辅助微创手术（RMIS）中，减少的触觉反馈和深度线索增加了对专家视觉感知的依赖，推动了基于注视引导的训练和基于学习的手术感知模型。然而，操作专家的注视数据收集成本高，且不清楚注视监督来源（专家水平（中级 vs. 初学者）和感知模态（主动执行 vs. 被动观看））如何影响注意力模型的学习。我们引入了一个配对的主动-被动、多任务手术注视数据集，该数据集在达芬奇SimNow模拟器上进行了四次钻探任务。使用VR头盔和眼动追踪记录了任务执行期间的主动注视，相应的视频被重新利用作为刺激，以收集观察者的被动注视，从而实现受控的同视频比较。我们量化了技能和模态依赖的注视组织差异，并通过注视密度重叠分析和单帧显著性建模评估了被动注视在操作监督中的可替代性。在各种设置中，MSI-Net产生了稳定且可解释的预测，而SalGAN不稳定且经常与人类注视不一致。训练于被动注视的模型恢复了相当大的中级主动注意力，但存在可预测的退化，且主动和被动目标之间的迁移是不对称的。值得注意的是，初学者的被动标签在较高质量演示中对中级-被动目标的近似具有有限的损失，这表明了一条可行的路径，用于在手术指导和感知建模中实现可扩展的群众源注视监督。

英文摘要

In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.

URL PDF HTML ☆

赞 0 踩 0

2602.09023 2026-05-20 cs.RO

TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

TwinRL: 基于数字孪生的强化学习用于真实世界机器人操作

Qinwen Xu, Jiaming Liu, Rui Zhou, Shaojun Shi, Nuowei Han, Zhuoyang Liu, Chenyang Gu, Shuo Gu, Yang Yue, Gao Huang, Wenzhao Zheng, Sirui Han, Peng Jia, Shanghang Zhang

AI总结本文提出TwinRL框架，通过数字孪生与真实世界协同训练，提升视觉-语言-动作模型在真实世界中的探索效率和收敛速度，实现高成功率和快速收敛。

详情

AI中文摘要

尽管具有强大的泛化能力，视觉-语言-动作（VLA）模型仍然受到专家演示成本高和现实世界交互有限的限制。虽然在线强化学习（RL）显示出前景，但将其应用于真实世界VLA操作受到探索效率低和探索覆盖受限的阻碍。通过系统性的现实世界实验，我们发现在线RL的有效探索空间主要受监督微调（SFT）期间诱导的轨迹分布所限制。受此观察启发，我们提出TwinRL，一种数字孪生-真实世界协同的后训练框架，通过三个阶段扩展和引导RL探索：SFT预热、孪生RL预热和真实世界RL。TwinRL首先从手机捕捉的场景中重建高保真的数字孪生。在SFT阶段，我们引入一种探索空间扩展策略，将轨迹分布的支持扩展到现实演示之外，重塑探索空间以更有效地进行RL。与将孪生视为数据增强工具不同，我们提出一种孪生RL预热策略，使其能够作为真实世界RL的探索引导。具体而言，TwinRL在数字孪生中执行高效的并行RL，生成填充回放缓冲区的交互轨迹，稳定后续真实世界RL学习。这一过程还识别出易失败但信息丰富的配置，使针对人类在回路中的rollouts进一步提高机器人上的效率。在四个任务中，TwinRL在分布内和分布外区域均实现近100%的成功率，比先前的真实世界RL方法快30%以上，仅需20分钟的机器人交互时间。

英文摘要

Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and limited real-world interaction. While online reinforcement learning (RL) has shown promise, its application to real-world VLA manipulation is hindered by low exploration efficiency and restricted exploration coverage. Through systematic real-world experiments, we observe that the effective exploration space of online RL is largely constrained by the trajectory distribution induced during supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative post-training framework that expands and guides RL exploration for VLA models through three stages: SFT warm-up, twin RL warm-up, and real-world RL. TwinRL first reconstructs a high-fidelity digital twin from smartphone-captured scenes. During the SFT stage, we introduce an exploration space expansion strategy that expands the support of the trajectory distribution beyond real demonstrations, reshaping the exploration space for more effective RL. Rather than treating the twin as a data augmentation tool, we propose a twin RL warm-up strategy that enables it to act as an exploration guide for real-world RL. Specifically, TwinRL performs efficient parallel RL in the digital twin to generate interactive trajectories that populate the replay buffer and stabilize subsequent real-world RL learning. This process also identifies failure-prone yet informative configurations, enabling targeted human-in-the-loop rollouts to further improve on-robot efficiency. Across four tasks, TwinRL achieves near-100% success in both in-distribution and out-of-distribution regions, delivering over 30% faster convergence than prior real-world RL methods with only 20 minutes of on-robot interaction.

URL PDF HTML ☆

赞 0 踩 0

2602.07008 2026-05-20 cs.CV cs.LG

Where Not to Learn: Prior-Aligned Training with Subset-based Attribution Constraints for Reliable Decision-Making

不应学习的地方：基于子集归因约束的先验对齐训练以实现可靠的决策制定

Ruoyu Chen, Shangquan Sun, Xiaoqing Guo, Sanyi Zhang, Kangwei Liu, Shiming Liu, Zhangcheng Wang, Qunli Zhang, Hua Zhang, Xiaochun Cao

AI总结本文提出了一种基于归因的先验对齐方法，通过子集选择归因技术约束模型依赖于人类先验区域，从而提升决策的可靠性。

详情

AI中文摘要

可靠的模型不仅要预测正确，还要能用可接受的证据来解释决策。然而，传统监督学习通常只提供类别级标签，使模型通过捷径相关性实现高精度，而非预期的证据。人类先验可以约束此类行为，但对齐模型到这些先验仍然具有挑战性，因为学习的表示往往偏离人类感知。为了解决这一挑战，我们提出了一种基于归因的人类先验对齐方法。我们将人类先验编码为模型应依赖的输入区域（例如边界框），并利用高度忠实的子集选择归因方法，在训练过程中暴露模型的决策证据。当归因区域显著偏离先验区域时，我们惩罚对非先验证据的依赖，促使模型将归因转向预期区域。这是通过一个训练目标实现的，该目标通过人类先验诱导归因约束。我们在基于MLLM的GUI代理模型上验证了我们的方法，涵盖图像分类和点击决策任务。在传统分类和自回归生成设置中，人类先验对齐一致提高了任务准确性，同时增强了模型的决策合理性。

英文摘要

Reliable models should not only predict correctly, but also justify decisions with acceptable evidence. Yet conventional supervised learning typically provides only class-level labels, allowing models to achieve high accuracy through shortcut correlations rather than the intended evidence. Human priors can help constrain such behavior, but aligning models to these priors remains challenging because learned representations often diverge from human perception. To address this challenge, we propose an attribution-based human prior alignment method. We encode human priors as input regions that the model is expected to rely on (e.g., bounding boxes), and leverage a highly faithful subset-selection-based attribution approach to expose the model's decision evidence during training. When the attribution region deviates substantially from the prior regions, we penalize reliance on off-prior evidence, encouraging the model to shift its attribution toward the intended regions. This is achieved through a training objective that imposes attribution constraints induced by the human prior. We validate our method on both image classification and click decision tasks in MLLM-based GUI agent models. Across conventional classification and autoregressive generation settings, human prior alignment consistently improves task accuracy while also enhancing the model's decision reasonability.

URL PDF HTML ☆

赞 0 踩 0

2602.06462 2026-05-20 cs.CL cs.LG

Diffusion-State Policy Optimization for Masked Diffusion Language Models

扩散状态策略优化用于掩码扩散语言模型

Daisuke Oba, Hiroki Furuta, Naoaki Okazaki

AI总结本文提出Diffusion-State Policy Optimization（DiSPO），一种用于掩码扩散语言模型的插件信用分配层，通过直接优化中间填充决策来改进生成过程，实验表明其在数学和规划基准测试中优于现有基线方法。

详情

AI中文摘要

掩码扩散语言模型通过迭代填充掩码标记来生成文本，但仅对最终完成结果的终端奖励对中间填充决策的信用分配过于粗糙。我们提出Diffusion-State Policy Optimization（DiSPO），一种插件信用分配层，直接优化中间填充决策。在选定的中间掩码状态下，DiSPO通过从滚出缓存的logits中重新采样当前掩码位置，评估由此产生的完成结果，并仅更新新填充的标记，无需额外的多步扩散滚出或优化器步骤。我们为分支完成形式化了一个固定状态目标，并推导出一个策略梯度估计器，该估计器重用与终端反馈策略优化相同的滚出。在LLaDA-8B-Instruct上的实验表明，DiSPO在匹配的滚出计算和优化器步骤下，一致提高了终端反馈基线，包括diffu-GRPO和SPG，在数学和规划基准测试中。我们的项目页面可在https://daioba.github.io/dispo上找到。

英文摘要

Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization. Experiments on LLaDA-8B-Instruct show that DiSPO consistently improves terminal-feedback baselines, including diffu-GRPO and SPG, on math and planning benchmarks under matched rollout compute and optimizer steps, supporting its use as a general plug-in for masked diffusion policy optimization. Our project page is available at https://daioba.github.io/dispo .

URL PDF HTML ☆

赞 0 踩 0

2602.05709 2026-05-20 cs.AI

Nonlinearity as Rank: Generative Low-Rank Adapter with Radial Basis Functions

非线性作为秩：基于径向基函数的生成低秩适配器

Yihao Ouyang, Shiwei Li, Haozhao Wang, Xiandi Luo, Zhuoqi Hu, Yuetong Song, Qiyu Qin, Yichen Li, Ruixuan Li

AI总结本文提出GenLoRA，通过使用轻量级非线性函数生成径向基函数来替代传统低秩适配器中显式的基向量存储，从而提高参数效率和细调性能。

详情

AI中文摘要

低秩适配（LoRA）通过两个低秩矩阵的乘积来近似预训练权重矩阵的更新。然而，标准LoRA遵循显式秩范式，增加模型容量需要在低秩矩阵中添加更多行或列（即基向量），导致参数增长显著。在本文中，我们发现这些基向量表现出显著的参数冗余，可以被轻量级非线性函数紧凑地表示。因此，我们提出生成低秩适配器（GenLoRA），用非线性基向量生成替代显式基向量存储。具体而言，GenLoRA为每个低秩矩阵维护一个潜在向量，并使用一组轻量级径向基函数（RBFs）来合成基向量。每个RBF所需的参数远少于显式基向量，使GenLoRA实现了更高的参数效率。在多个数据集和架构上的广泛实验表明，GenLoRA在较小的参数预算下实现了更高的有效LoRA秩，从而获得更优越的微调性能。代码可在https://anonymous.4open.science/r/GenLoRA获取。

英文摘要

Low-rank adaptation (LoRA) approximates the update of a pretrained weight matrix using the product of two low-rank matrices. However, standard LoRA follows an explicit-rank paradigm, where increasing model capacity requires adding more rows or columns (i.e., basis vectors) to the low-rank matrices, leading to substantial parameter growth. In this paper, we find that these basis vectors exhibit significant parameter redundancy and can be compactly represented by lightweight nonlinear functions. Therefore, we propose Generative Low-Rank Adapter (GenLoRA), which replaces explicit basis vector storage with nonlinear basis vector generation. Specifically, GenLoRA maintains a latent vector for each low-rank matrix and employs a set of lightweight radial basis functions (RBFs) to synthesize the basis vectors. Each RBF requires far fewer parameters than an explicit basis vector, enabling higher parameter efficiency in GenLoRA. Extensive experiments across multiple datasets and architectures show that GenLoRA attains higher effective LoRA ranks under smaller parameter budgets, resulting in superior fine-tuning performance. The code is available at https://anonymous.4open.science/r/GenLoRA.

URL PDF HTML ☆

赞 0 踩 0

2602.04998 2026-05-20 cs.LG cs.AI cs.CL

Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning

学习率至关重要：Vanilla LoRA可能足以用于LLM微调

Yu-Ang Lee, Ching-Yun Ko, Pin-Yu Chen, Mi-Yen Yeh

AI总结本文通过广泛的超参数搜索重新评估了九种代表性的LoRA变体和Vanilla LoRA，在数学推理、常识推理、代码生成和指令遵循等任务上，发现不同的LoRA方法偏好不同的学习率范围。当学习率正确调整时，所有方法都能达到相似的峰值性能，这表明Vanilla LoRA仍然是一个有竞争力的基线，而单一训练配置下的改进可能并不反映一致的方法优势。

Comments Project page: https://github.com/yuang-lee/lr-matters-lora

详情

AI中文摘要

低秩适应（LoRA）是高效大型语言模型（LLM）微调的主流方法。在此范式基础上，近期研究提出了替代的初始化策略、架构修改和优化调整，报告了显著优于Vanilla LoRA的改进。然而，这些改进通常是在固定或狭窄调整的超参数设置下展示的，尽管神经网络对训练配置敏感已知。在本工作中，我们通过广泛的超参数搜索，系统地重新评估了九种代表性的LoRA变体以及Vanilla LoRA，搜索范围包括学习率、批量大小、秩和训练持续时间。在覆盖数学推理、常识推理、代码生成和指令遵循等任务的不同模型规模上，我们发现不同的LoRA方法偏好不同的学习率范围。关键的是，一旦学习率正确调整，所有方法都能达到相似的峰值性能（在1-2%以内），仅存在细微的秩依赖行为。这些结果表明，Vanilla LoRA仍然是一个有竞争力的基线，而单一训练配置下的改进可能并不反映一致的方法优势。最后，二次分析将不同的最优学习率范围归因于最大的Hessian特征值的变化，这与经典的机器学习理论一致。

英文摘要

Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model (LLM) fine-tuning. Building on this paradigm, recent studies have proposed alternative initialization strategies, architectural modifications, and optimization adjustments, reporting substantial improvements over vanilla LoRA. However, these gains are often demonstrated under fixed or narrowly tuned hyperparameter settings, despite the known sensitivity of neural networks to training configurations. In this work, we systematically re-evaluate nine representative LoRA variants alongside vanilla LoRA through extensive hyperparameter searches over learning rate, batch size, rank, and training duration. Across tasks spanning mathematical reasoning, commonsense reasoning, code generation, and instruction following at diverse model scales, we find that different LoRA methods favor distinct learning rate ranges. Crucially, once learning rates are properly tuned, all methods achieve similar peak performance (within 1-2%), with only subtle rank-dependent behaviors. These results suggest that vanilla LoRA remains a competitive baseline and that improvements reported under a single training configuration may not reflect consistent methodological advantages. Finally, a second-order analysis attributes the differing optimal learning rate ranges to variations in the largest Hessian eigenvalue, aligning with classical learning theories.

URL PDF HTML ☆

赞 0 踩 0

2602.04663 2026-05-20 cs.LG cs.AI

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

重新思考扩散模型强化学习的设计空间：超越损失设计的似然估计的重要性

Jaemoo Choi, Yuchen Zhu, Wei Guo, Petr Molodyk, Bo Yuan, Jinbin Bai, Yi Xin, Molei Tao, Yongxin Chen

AI总结本文研究了扩散模型强化学习设计空间中的关键问题，通过分解策略梯度目标、似然估计器和回放采样方案三个因素，发现基于证据下界（ELBO）的模型似然估计器是实现有效、高效和稳定强化学习优化的主要因素，优于特定策略梯度损失函数的影响。

Comments 23 pages, 11 figures

详情

AI中文摘要

强化学习已被广泛应用于扩散和流模型，用于文本到图像生成等视觉任务。然而，这些任务仍然具有挑战性，因为扩散模型具有不可 tractable 的似然，这阻碍了直接应用流行策略梯度类型方法。现有方法主要集中在构建新的目标，这些目标基于已经高度工程化的LLM目标，并使用随意的似然估计器，而没有深入研究此类估计对整体算法性能的影响。在本文中，我们通过分解三个因素：i）策略梯度目标，ii）似然估计器，和iii）回放采样方案，对RL设计空间进行了系统分析。我们证明，采用基于证据下界（ELBO）的模型似然估计器，仅从最终生成的样本计算，是实现有效、高效和稳定RL优化的主要因素，其影响超过特定策略梯度损失函数的影响。我们通过SD 3.5 Medium在多个奖励基准上验证了我们的发现，并在所有任务中观察到一致的趋势。我们的方法在90个GPU小时内将GenEval得分从0.24提高到0.95，比FlowGRPO高效4.6倍，比无奖励黑客的SOTA方法DiffusionNFT高效2倍。

英文摘要

Reinforcement learning has been widely applied to diffusion and flow models for visual tasks such as text-to-image generation. However, these tasks remain challenging because diffusion models have intractable likelihoods, which creates a barrier for directly applying popular policy-gradient type methods. Existing approaches primarily focus on crafting new objectives built on already heavily engineered LLM objectives, using ad hoc estimators for likelihood, without a thorough investigation into how such estimation affects overall algorithmic performance. In this work, we provide a systematic analysis of the RL design space by disentangling three factors: i) policy-gradient objectives, ii) likelihood estimators, and iii) rollout sampling schemes. We show that adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional. We validate our findings across multiple reward benchmarks using SD 3.5 Medium, and observe consistent trends across all tasks. Our method improves the GenEval score from 0.24 to 0.95 in 90 GPU hours, which is $4.6\times$ more efficient than FlowGRPO and $2\times$ more efficient than the SOTA method DiffusionNFT without reward hacking.

URL PDF HTML ☆

赞 0 踩 0

2602.04555 2026-05-20 cs.LG

Finding Structure in Continual Learning

在持续学习中寻找结构

Pourya Shamsolmoali, Masoumeh Zareapoor

AI总结本文提出了一种基于Douglas-Rachford Splitting方法的持续学习框架，通过解耦的两个目标在稳定性和可塑性之间进行协商，实现了更高效且稳定的持续学习。

Comments There is a bug in the algorithm and implementation

详情

AI中文摘要

从一系列任务中学习通常面临可塑性与稳定性的矛盾：获取新知识往往导致对过去信息的灾难性遗忘。大多数方法通过求和竞争损失项来解决这一问题，产生梯度冲突，通常需要复杂的且效率低的策略如外部记忆回放或参数正则化来管理。我们提出了一种使用Douglas-Rachford Splitting（DRS）重新表述持续学习目标的方法。这种方法将学习过程重新表述为两个解耦目标之间的协商：一个促进新任务的可塑性，另一个确保旧知识的稳定性。通过迭代地通过其近端算子寻找共识，DRS提供了一种更加系统和稳定的持续学习动态。我们的方法在不需辅助模块或复杂附加组件的情况下实现了稳定性与可塑性之间的高效平衡，为持续学习系统提供了一种更简单却更强大的范式。

英文摘要

Learning from a stream of tasks usually pits plasticity against stability: acquiring new knowledge often causes catastrophic forgetting of past information. Most methods address this by summing competing loss terms, creating gradient conflicts that are managed with complex and often inefficient strategies such as external memory replay or parameter regularization. We propose a reformulation of the continual learning objective using Douglas-Rachford Splitting (DRS). This reframes the learning process not as a direct trade-off, but as a negotiation between two decoupled objectives: one promoting plasticity for new tasks and the other enforcing stability of old knowledge. By iteratively finding a consensus through their proximal operators, DRS provides a more principled and stable learning dynamic. Our approach achieves an efficient balance between stability and plasticity without the need for auxiliary modules or complex add-ons, providing a simpler yet more powerful paradigm for continual learning systems.

URL PDF HTML ☆

赞 0 踩 0

2602.04381 2026-05-20 cs.CV cs.AI

Enabling Real-Time Colonoscopic Polyp Segmentation on Commodity CPUs via Ultra-Lightweight Architecture

通过超轻量架构在商用CPU上实现实时结肠镜息肉分割

Weihao Gao, Zhuo Deng, Zheng Gong, Lan Ma

AI总结本文提出UltraSeg家族，一种在CPU上运行的轻量级分割模型，能够在不依赖GPU的情况下实现实时结肠镜息肉分割，其核心方法是采用组多率扩张卷积和注意力门控跨层融合，主要贡献是建立了首个在商用CPU上实现高精度实时息肉分割的基准线。

Comments 18pages, 4 figures

详情

AI中文摘要

实时息肉分割对于早期结直肠癌检测至关重要，但临床部署仍受GPU依赖的阻碍。我们引入UltraSeg家族，一组在CPU上运行的分割模型，参数量低于0.3M。UltraSeg-108K（0.108M）建立了极端压缩的前沿，而UltraSeg-130K（0.130M）通过跨层轻量融合提升了多中心泛化能力。该架构用组多率扩张卷积和注意力门控跨层融合取代参数密集的组件，实现了在单个CPU核心上实时吞吐（在256*256分辨率上超过50 FPS，在352*352分辨率上超过30 FPS）而不牺牲临床级精度。在七个公开数据集上评估，UltraSeg-130K在两个分辨率上均达到Dice分数超过0.8，显著优于所有现有的子0.3M竞争者。值得注意的是，在零样本外部验证中，它接近或超过了UNet-Medium（7.76M参数）的性能，但仅使用其1.7%的参数，建立了首个在CPU上实现实时息肉分割的强基准线。当扩展到4.38M参数时，UltraSeg的准确性可与重型最先进的模型相媲美，同时保持数量级的参数优势，证明了所提出的设计原则在效率光谱的整个范围内实现了内在的表示增益。通过提供首个在商用CPU上可部署的实时解决方案，本工作为资源有限的环境提供了一个立即可用的工具，并为超越内窥镜的实时医疗AI提供了可复现的蓝图。源代码已公开。

英文摘要

Real-time polyp segmentation is essential for early colorectal cancer detection, yet clinical deployment remains blocked by GPU dependency. We introduce the UltraSeg family, a set of CPU-native segmentation models operating below 0.3M parameters. UltraSeg-108K (0.108M) establishes the extreme-compression frontier, while UltraSeg-130K (0.130M) integrates cross-layer lightweight fusion for enhanced multi-center generalization. The architecture replaces parameter-heavy components with grouped multi-rate dilated convolutions and attention-gated cross-layer fusion, achieving real-time throughput on a single CPU core (exceeding 50 FPS at 256*256 and 30 FPS at 352*352) without sacrificing clinical-grade accuracy. Evaluated on seven public datasets, UltraSeg-130K attains Dice scores exceeding 0.8 at both resolutions, substantially outperforming all existing sub-0.3M competitors. Notably, it approaches or exceeds UNet-Medium (7.76M parameters) on zero-shot external validations while using only 1.7% of its parameters, establishing the first strong baseline for CPU-native real-time polyp segmentation. When scaled to 4.38M parameters, UltraSeg achieves accuracy competitive with heavyweight state-of-the-art models while maintaining an order-of-magnitude parameter advantage, demonstrating that the proposed design principles yield intrinsic representational gains across the entire efficiency spectrum. By delivering the first clinically deployable, CPU-native real-time solution, this work provides an immediately usable tool for resource-limited settings and a reproducible blueprint for real-time medical AI beyond endoscopy. Source code is publicly available.

URL PDF HTML ☆

赞 0 踩 0

2602.03454 2026-05-20 cs.CV

Contextualized Visual Personalization in Vision-Language Models

基于上下文的视觉个性化在视觉-语言模型中

Yeongtak Oh, Sangwon Yu, Junsung Park, Han Cheol Moon, Jisoo Mok, Sungroh Yoon

AI总结本文提出了一种基于上下文的视觉个性化方法，通过强化学习和生成增强技术改进视觉-语言模型的个性化图像描述能力，并通过诊断评估验证了模型对视觉上下文的真实利用，展示了CoViP在下游个性化任务中的全面提升。

Comments Accepted at ICML 2026

详情

AI中文摘要

尽管视觉-语言模型（VLMs）在最近取得了进展，但现有方法往往无法根据用户的特定经历生成个性化响应，因为它们缺乏将视觉输入与用户积累的视觉-文本上下文相关联的能力。我们首次将这一挑战正式化为“基于上下文的视觉个性化”，要求VLMs在解释新图像时通过视觉识别和文本检索个性化视觉经验。为了解决这一问题，我们提出了CoViP，一个统一的框架，将个性化图像描述作为基于上下文的视觉个性化的核心任务，并通过基于强化学习的后训练和描述增强生成来提高这一能力。我们进一步引入了诊断评估，明确排除了文本捷径解决方案，并验证VLMs是否真正利用了视觉上下文。广泛的实验表明，现有开源和专有VLMs存在显著限制，而CoViP不仅提高了个性化图像描述能力，还在下游个性化任务中实现了全面提升。这些结果突显了CoViP作为实现稳健且可推广的基于上下文的视觉个性化关键阶段的重要性。

英文摘要

Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.

URL PDF HTML ☆

赞 0 踩 0

2602.03139 2026-05-20 cs.CV

Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

保留多样性的分布匹配蒸馏用于快速视觉合成

Tianhe Wu, Ruibin Li, Lei Zhang, Kede Ma

AI总结本文提出了一种保留多样性的分布匹配蒸馏（DP-DMD）方法，通过分离角色的蒸馏策略，在少量步骤中保持样本多样性并维持竞争性的视觉质量，为其他DMD变体提供了一种简单且稳定的替代方案。

详情

AI中文摘要

分布匹配蒸馏（DMD）通过将蒸馏的学生模型与参考多步骤教师模型对齐，实现了少步图像生成。然而，在实践中，优化DMD可能会减少少步合成中的样本多样性，而现有解决方案通常依赖于感知或对抗正则化，导致训练过程中的稳定性和可扩展性挑战。本文描述了保留多样性的DMD（DP-DMD），一种受早期和晚期去噪步骤互补作用启发的角色分离蒸馏方法。具体而言，第一个蒸馏步骤通过教师衍生的目标预测目标（例如v-prediction）进行训练，以保留样本多样性，而其余步骤则通过标准DMD损失进行优化，以提高感知质量。DP-DMD无需感知或对抗正则化、额外模块和教师生成的参考样本，在少量步骤采样下保持样本多样性，同时维持竞争性的视觉质量，为其他DMD变体提供了一种简单且稳定的替代方案。

英文摘要

Distribution matching distillation (DMD) facilitates few-step image generation by aligning a distilled student with a reference multi-step teacher. In practice, however, optimizing DMD can reduce sample diversity in few-step synthesis, and existing remedies typically rely on perceptual or adversarial regularization, leading to stability and scalability challenges during training. Here, we describe diversity-preserved DMD (DP-DMD), a role-separated distillation method inspired by the complementary roles of early and late denoising steps. Specifically, the first distillation step is trained with a teacher-derived target-prediction objective (e.g., v-prediction) to preserve sample diversity, while the remaining steps are optimized with the standard DMD loss to refine perceptual quality. DP-DMD, with no perceptual or adversarial regularization, no additional modules, and no teacher-generated reference samples, preserves sample diversity while maintaining competitive visual quality under few-step sampling, providing a simple and stable alternative to other DMD variants.

URL PDF HTML ☆

赞 0 踩 0

2602.02513 2026-05-20 cs.LG cond-mat.mtrl-sci

Learning ORDER-Aware Multimodal Representations for Composite Materials Design

学习有序的多模态表示以进行复合材料设计

Xinyao Li, Hangwei Qian, Jingjing Li, Lei Zhu, Ivor Tsang

AI总结本研究提出了一种基于有序性的多模态预训练框架ORDER，用于复合材料设计，通过整合异构数据源来捕捉纤维分布，从而在连续设计空间中实现有效的属性预测和微结构生成。

详情

AI中文摘要

人工智能在材料发现和性质预测中展现出显著的成功，尤其是在晶体和聚合物系统中，其中材料性质和结构主要由离散图表示主导。这种图中心范式在复合材料中失效，因为复合材料具有连续和非线性的设计空间。通用复合描述符，例如纤维体积和偏移角度，无法完全捕捉决定微结构特性的纤维分布，需要通过多模态学习整合异构数据源。现有的对齐导向框架在离散、唯一的图-性质映射假设下对大量晶体或聚合物数据有效，但在极端数据稀缺的情况下无法解决高度连续的复合设计空间。在本工作中，我们引入了ORDinal-aware imagE-tabulaR alignment（ORDER），一种多模态预训练框架，将有序性作为材料表示的核心原则。ORDER确保具有相似目标属性的材料在潜在空间中占据附近区域，这有效地保持了复合材料属性的连续性，并在稀疏观察设计之间实现了有意义的插值。我们评估了ORDER在纳米纤维增强复合材料数据集和碳纤维T700数据集上的表现。ORDER及其变体在属性预测、跨模态检索和微结构生成任务中均优于对齐导向和定制属性意识对比基线。我们进一步引入基于物理的有序替代信号，避免了预训练过程中需要完整的属性注释。我们的工作证明了学习连续多模态特征对于复合材料是基础性的，并提供了一条通往数据高效通用多模态智能系统可靠路径。

英文摘要

Artificial intelligence has shown remarkable success in materials discovery and property prediction, particularly for crystalline and polymer systems where material properties and structures are dominated by discrete graph representations. Such graph-central paradigm breaks down on composite materials, which possess continuous and nonlinear design spaces. General composite descriptors, e.g., fiber volume and misalignment angle, cannot fully capture the fiber distributions that determine microstructural characteristics, necessitating the integration of heterogeneous data sources through multimodal learning. Existing alignment-oriented frameworks have proven effective on abundant crystal or polymer data under discrete, unique graph-property mapping assumptions, but fail to address the highly continuous composite design space under extreme data scarcity. In this work we introduce ORDinal-aware imagE-tabulaR alignment (ORDER), a multimodal pretraining framework that establishes ordinality as a core principle for material representations. ORDER ensures that materials with similar target properties occupy nearby regions in the latent space, which effectively preserves the continuous nature of composite properties and enables meaningful interpolation between sparsely observed designs. We evaluate ORDER on a Nanofiber-reinforced composite dataset and a carbon fiber T700 dataset. ORDER and its variants outperform both alignment-oriented and customized property-aware contrastive baselines across property prediction, cross-modal retrieval, and microstructure generation tasks. We further introduce physics-based ordinal surrogate signals avoiding the need for full property annotation during pretrain. Our work demonstrates learning continuous multimodal features are fundamental for composite materials, and provides a reliable pathway toward data-efficient universal multimodal intelligent systems.

URL PDF HTML ☆

赞 0 踩 0

2601.22478 2026-05-20 cs.LG

Transformation-Augmented GRPO for Enhancing Exploration in Reasoning of Large Language Models

增强推理探索的变换增强GRPO

Khiem Le, Phuc Nguyen, Youssef Mroueh, Chi-Heng Lin, Shangqian Gao, Ting Hua, Nitesh V. Chawla

AI总结本文提出变换增强GRPO（TA-GRPO）方法，通过问题重述解决大语言模型强化学习中梯度消失和多样性崩溃问题，提升模型在推理任务中的探索能力。

详情

AI中文摘要

组相对策略优化（GRPO）已成为在大语言模型中使用可验证奖励的强化学习主导方法，但其面临两个关键限制：梯度消失和多样性崩溃。当训练问题过于简单或过于困难时，所有采样响应获得相同奖励，导致梯度消失。同时，模型倾向于将响应集中于单一推理模式，而非探索多样化策略。我们提出变换增强GRPO（TA-GRPO），一种简单但有效的方法，通过问题重述解决这两个问题。对于每个训练问题，我们自动生成多个等价问题重述，改变用词、格式和信息顺序，同时保持底层含义。由于这些重述改变了模型感知的难度，池化原始问题及其重述的响应可获得混合奖励和更多多样化的推理路径。TA-GRPO联合计算此扩展响应集的优势，并将所有重要性比率对齐到原始问题，使模型能够从更丰富的解决方案尝试中学习。在四个LLM（Qwen3-1.7B，Qwen3-4B，Llama-3.2-1B，Llama-3.2-3B）上的实验表明，TA-GRPO在竞争级基准（AMC，OlympiadBench，AIME24，AIME25）和分布外基准（Minerva，GPQA-Diamond）上一致提升了pass@$k$。值得注意的是，TA-GRPO使Qwen3-1.7B和Qwen3-4B的平均pass@32分别提高了4.97和4.34个点，并与训练数据多达2.5倍的基线模型在探索质量上相当。

英文摘要

Group Relative Policy Optimization (GRPO) has become the dominant method for reinforcement learning with verifiable rewards in large language models, but it suffers from two critical limitations: gradient vanishing and diversity collapse. When training questions are too easy or too hard, all sampled responses receive identical rewards, yielding zero gradients. Meanwhile, the model tends to collapse its responses toward a single reasoning pattern rather than exploring diverse strategies. We propose Transformation-Augmented GRPO (TA-GRPO), a simple but effective method that addresses both issues via question rephrasing. For each training question, we automatically generate multiple problem-equivalent rephrasings that alter wording, format, and information order while preserving the underlying meaning. Because these rephrasings shift the model's perceived difficulty, pooling responses across the original and its rephrasings yields mixed rewards and more diverse reasoning paths. TA-GRPO jointly computes advantages over this expanded response set and aligns all importance ratios to the original question, enabling the model to learn from a richer set of solution attempts. Experiments on four LLMs (Qwen3-1.7B, Qwen3-4B, Llama-3.2-1B, Llama-3.2-3B) show that TA-GRPO consistently improves pass@$k$ on competition-level benchmarks (AMC, OlympiadBench, AIME24, AIME25) and out-of-distribution benchmarks (Minerva, GPQA-Diamond). Notably, it improves the average pass@32 of Qwen3-1.7B and Qwen3-4B by \textbf{4.97} and \textbf{4.34} points, respectively, and matches the exploration quality of baselines trained on up to 2.5$\times$ more data.

URL PDF HTML ☆

赞 0 踩 0

2601.21484 2026-05-20 cs.LG

ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

ETS: 为无训练强化学习对齐的能耗引导测试时缩放

Xiuyu Li, Jinkai Zhang, Mingyang Yi, Yu Li, Longqiang Wang, Yue Wang, Ju Fan

AI总结本文提出了一种无需训练的推理方法，直接从最优强化学习策略中采样，通过结合参考策略模型和能耗项来改进掩码语言模型的过渡概率，并通过在线蒙特卡洛方法估计关键能耗项，从而提高生成质量。

Comments Accepted by ICML 2026

详情

AI中文摘要

强化学习（RL）在语言模型中的训练后对齐是有效的，但实际中也成本高且不稳定，这归因于其复杂的训练过程。为了解决这个问题，我们提出了一种无需训练的推理方法，直接从最优RL策略中采样。应用于掩码语言模型（MLM）的过渡概率由参考策略模型和一个能耗项组成。基于此，我们的算法，能耗引导测试时缩放（ETS），通过在线蒙特卡洛方法估计关键能耗项，具有可证明的收敛率。此外，为了确保实际效率，ETS利用现代加速框架以及定制的重要性采样估计器，显著减少推理延迟，同时可证明地保持采样质量。在MLM（包括自回归模型和扩散语言模型）上，通过推理、编码和科学基准测试，我们的ETS一致地提高了生成质量，验证了其有效性和设计。代码可在https://github.com/sheriyuo/ETS上获得。

英文摘要

Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design. The code is available at https://github.com/sheriyuo/ETS.

URL PDF HTML ☆

赞 0 踩 0

2601.20308 2026-05-20 cs.CV cs.GR

Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion

通过一步扩散模型平滑现实世界的时空视频超分辨率

Shuoyan Wei, Feng Li, Chen Zhou, Runmin Cong, Yao Zhao, Huihui Bai

AI总结本文提出OSDEnhancer框架，通过一步扩散模型实现鲁棒的时空视频超分辨率，解决了现实世界中复杂未知退化的问题，通过线性初始化和分治策略提升时空动态和纹理恢复性能。

Comments 12 pages, 9 figures

详情

AI中文摘要

扩散模型在视频超分辨率（VSR）中表现出色，能够生成精细细节。然而，其在时空视频超分辨率（STVSR）中的潜力仍被忽视，STVSR需要恢复真实的高分辨率视觉内容并提高帧率，同时保持时间动态的一致性。此外，现有STVSR方法主要在简单退化假设下处理时空上采样，无法应对现实世界中复杂的未知退化。为了解决这些挑战，我们提出了OSDEnhancer，这是首个在一步扩散中实现稳健STVSR的框架。OSDEnhancer首先通过线性初始化建立必要的时空结构并适应模型进行一步重建。然后应用分治策略，引入时间一致性（TC）和纹理丰富（TE）LoRAs，分别专注于帧间动态建模和精细纹理恢复，同时在推理过程中协作以提升整体性能。双向VAE解码器使用可变形递归块来利用常规VAE的多尺度结构，通过联合多尺度可变形聚合和帧间特征传播增强潜在到像素的重建。实验结果表明，所提出的方法在现实世界场景中实现了最先进的性能，并具有更强的泛化能力。代码可在https://github.com/W-Shuoyan/OSDEnhancer获取。

英文摘要

Diffusion models have demonstrated exceptional success in video super-resolution (VSR), exhibiting powerful capabilities for generating fine-grained details. However, their potential for space-time video super-resolution (STVSR), which necessitates not only recovering realistic high-resolution visual content but also improving the frame rate with coherent temporal dynamics, remains largely underexplored. Moreover, existing STVSR methods predominantly address spatiotemporal upsampling under simple degradation assumptions, thus failing in real-world scenarios with complex unknown degradations. To address these challenges, we propose OSDEnhancer, the first framework that achieves robust STVSR in one-step diffusion. OSDEnhancer begins with a linear initialization to establish essential spatiotemporal structures and adapt the model for one-step reconstruction. It then applies a divide-and-conquer strategy, introducing the temporal coherence (TC) and texture enrichment (TE) LoRAs that progressively specialize in inter-frame dynamics modeling and fine-grained texture recovery, respectively, while collaborating during inference for enhanced overall performance. A bidirectional VAE decoder employs deformable recurrent blocks to leverage the multi-scale structure of the vanilla VAE, enhancing latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation. Experimental results demonstrate that the proposed method attains state-of-the-art performance with superior generalization in real-world scenarios. The code is available at https://github.com/W-Shuoyan/OSDEnhancer.

URL PDF HTML ☆

赞 0 踩 0

2601.18993 2026-05-20 cs.CV cs.AI cs.GR

FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction

FreeOrbit4D: 通过前景完整4D重建实现免训练的任意相机重定向

Wei Cao, Hao Zhang, Fengrui Tian, Yulun Wu, Yingying Li, Shenlong Wang, Ning Yu, Yaoyao Liu

AI总结本文提出FreeOrbit4D，一种无需训练的框架，通过恢复完整的前景4D代理来解决大角度重定向中的几何模糊问题，从而生成更真实且时间一致的视频。

Comments 12 pages, 10 figures. Accepted to SIGGRAPH Conference Papers 2026

详情

DOI: 10.1145/3799902.3811122

AI中文摘要

Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective 免训练 framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/

英文摘要

Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective training-free framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/

URL PDF HTML ☆

赞 0 踩 0

2601.16823 2026-05-20 cs.CL cs.AI

Disentangling generalization and memorization in large language models using chess

通过国际象棋解构大型语言模型中的泛化与记忆

Leonard S. Pleiss, Maximilian Schiffer, Robert K. von Weizsaecker

AI总结本文通过国际象棋测试环境，研究大型语言模型中泛化与记忆能力的区别，发现模型在相关先验知识稀疏时性能显著下降，表明系统泛化能力有限，需超越规模的机制来实现鲁棒性。

详情

AI中文摘要

大型语言模型（LLMs）展现出显著的能力，但其能力在多大程度上反映的是复杂的记忆还是真正的推理能力仍不明确。我们引入国际象棋作为受控测试环境，旨在区分这些能力。利用游戏的结构和可扩展的引擎评估，我们构建了一个位置分类学，这些位置在相关先验知识的密度上变化较大，从可以通过记忆解决的常见状态到完全新颖需要泛化的状态。关键的是，我们的方法在不需要显式了解模型训练数据的情况下实现了这一区分。应用此分类学，我们结合了GPT系列的纵向分析和对现代模型的严格评估，包括Claude Opus和Gemini。我们的分析揭示了一个陡峭的梯度：随着相关先验知识密度的降低，性能持续下降。值得注意的是，在相关先验知识较少的任务中，基础模型性能回归到随机下棋的基线。虽然新模型有所改进，但在先验知识稀疏的任务中，进步显著放缓。此外，虽然推理增强的推理提高性能，但在没有相关先验知识的情况下，每token的相对边际收益减少。这些结果表明系统泛化能力有限，强调了在缺乏相关先验知识时，需要超越规模的机制来实现鲁棒性能。

英文摘要

Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall or genuine reasoning ability. We introduce chess as a controlled testbed aimed at disentangling these faculties. Leveraging the game's structure and scalable engine evaluations, we construct a taxonomy of positions varying in density of relevant priors - ranging from common states solvable by memorization to completely novel ones requiring generalization. Crucially, our approach achieves this distinction without requiring explicit knowledge of the models' training data. Applying this taxonomy, we combine a longitudinal analysis of the GPT lineage with a rigorous evaluation of contemporary models, including Claude Opus and Gemini. Our analysis reveals a steep gradient: performance consistently degrades as the density of relevant priors decreases. Notably, for tasks with few relevant priors, base model performance regresses to the random-play baseline. While newer models improve, progress slows significantly for tasks with sparse priors. Furthermore, while reasoning-augmented inference improves performance, its relative marginal benefit per token decreases in the absence of relevant priors. These results suggest limitations in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust performance when deprived of relevant priors.

URL PDF HTML ☆

赞 0 踩 0

2601.14822 2026-05-20 cs.CV cs.AI

Multimodal system for skin cancer detection

多模态皮肤癌检测系统

Volodymyr Sydorskyi, Igor Krashenyi, Oleksii Yakubenko

AI总结本文提出一种多模态皮肤癌检测系统，结合传统照片图像与表格型元数据（如患者人口统计数据和病变特征），通过多模态神经网络和两阶段模型提升检测准确率，并通过三阶段流程进一步优化预测，最终在不平衡数据集上实现显著性能提升。

Comments Accepted to System research and information technologies

详情

DOI: 10.20535/SRIT.2308-8893.2026.1.03
Journal ref: System Research and Information Technologies, no. 1, pp. 33-57, 2026

AI中文摘要

皮肤癌检测对于早期诊断和有效治疗至关重要。尽管基于dermoscopic图像的深度学习模型已显示出潜力，但它们需要专门的设备，限制了其在更广泛临床环境中的应用。本研究介绍了一种使用传统照片图像的多模态皮肤癌检测系统，使其更具可访问性和适应性。我们的系统整合图像数据与表格型元数据，如患者人口统计数据和病变特征，以提高检测准确性。它采用结合图像和元数据处理的多模态神经网络，并支持有或无元数据的两阶段模型。一个三阶段流程进一步通过提升算法和增强性能来优化预测。为解决高度不平衡数据集的挑战，实施了特定技术以确保稳健的训练。通过消融研究评估了最近的视觉架构、提升算法和损失函数，实现了峰值部分ROC AUC为0.18068（0.2最大）和前15检索灵敏度为0.78371。结果表明，通过结构化、多阶段的图像与元数据整合流程，实现了显著的性能提升。该系统通过提供一个可扩展、设备无关的解决方案，推进了皮肤癌检测，适用于多样化的医疗环境，弥合了专业与一般临床实践之间的差距。

英文摘要

Melanoma detection is vital for early diagnosis and effective treatment. While deep learning models on dermoscopic images have shown promise, they require specialized equipment, limiting their use in broader clinical settings. This study introduces a multi-modal melanoma detection system using conventional photo images, making it more accessible and versatile. Our system integrates image data with tabular metadata, such as patient demographics and lesion characteristics, to improve detection accuracy. It employs a multi-modal neural network combining image and metadata processing and supports a two-step model for cases with or without metadata. A three-stage pipeline further refines predictions by boosting algorithms and enhancing performance. To address the challenges of a highly imbalanced dataset, specific techniques were implemented to ensure robust training. An ablation study evaluated recent vision architectures, boosting algorithms, and loss functions, achieving a peak Partial ROC AUC of 0.18068 (0.2 maximum) and top-15 retrieval sensitivity of 0.78371. Results demonstrate that integrating photo images with metadata in a structured, multi-stage pipeline yields significant performance improvements. This system advances melanoma detection by providing a scalable, equipment-independent solution suitable for diverse healthcare environments, bridging the gap between specialized and general clinical practices.

URL PDF HTML ☆

赞 0 踩 0

2601.14234 2026-05-20 cs.LG cs.AI cs.RO stat.ML

Q-learning with Adjoint Matching

具有伴随匹配的Q学习

Qiyang Li, Sergey Levine

AI总结本文提出了一种基于时序差分的强化学习算法QAM，解决了连续动作强化学习中的长期挑战：高效优化表达性强的扩散或流匹配策略相对于参数化的Q函数。通过利用批评者的首阶信息进行有效优化，但直接通过反向传播其多步去噪过程进行梯度优化在数值上不稳定。现有方法通过仅使用价值和丢弃梯度信息或依赖近似方法牺牲策略的表达性或偏置学习策略。QAM通过利用生成建模中最近提出的技术伴随匹配，将批评者的动作梯度转换为逐步目标函数，避免了不稳定反向传播，同时在最优时提供无偏且表达性强的策略。结合时序差分备份进行批评者学习，QAM在离线和离线到在线强化学习的硬稀疏奖励任务中一致优于先前方法。

Comments 32 pages, 8 figures, 7 tables

详情

AI中文摘要

我们提出QAM，一种新颖的基于时序差分的强化学习（RL）算法，解决了连续动作RL中长期存在的挑战：高效优化表达性强的扩散或流匹配策略相对于参数化的Q函数。有效的优化需要利用批评者的首阶信息，但通过反向传播其多步去噪过程进行直接梯度优化在数值上不稳定。现有方法通过仅使用价值和丢弃梯度信息或依赖近似方法牺牲策略的表达性或偏置学习策略。QAM通过利用生成建模中最近提出的技术伴随匹配，将批评者的动作梯度转换为逐步目标函数，避免了不稳定反向传播，同时在最优时提供无偏且表达性强的策略。结合时序差分备份进行批评者学习，QAM在离线和离线到在线RL的硬稀疏奖励任务中一致优于先前方法。

英文摘要

We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.

URL PDF HTML ☆

赞 0 踩 0

2601.12707 2026-05-20 cs.LG stat.ML

Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization

在竞争性游戏中解码奖励：带有熵正则化的逆向博弈论

Junyi Liao, Zihan Zhu, Ethan Fang, Zhuoran Yang, Vahid Tarokh

AI总结本文研究了在竞争性游戏中通过逆向博弈论和熵正则化来恢复未知奖励函数的问题，提出了一种统一的框架，能够在静态和动态设置中学习奖励函数，并通过理论保证和数值实验验证了其有效性。

Comments Extended journal version of ICML 2025 paper. Submitted to Operations Research

详情

AI中文摘要

估计驱动智能体行为的未知奖励函数在逆向强化学习和博弈论中具有核心重要性。为解决这个问题，我们开发了一个统一的框架，用于在两名玩家零和矩阵博弈和马尔可夫博弈中恢复奖励函数，并通过熵正则化来重建给定观察到的玩家策略和动作的潜在奖励函数。这项任务具有挑战性，因为逆向问题固有的模糊性、可行奖励的非唯一性和观察数据覆盖的限制。为了解决这些挑战，我们利用线性假设在量级响应均衡（QRE）下建立了奖励函数的可识别性。在此理论基础上，我们提出了一种新的算法，从观察到的动作中学习奖励函数。我们的算法适用于静态和动态设置，并且可以适应不同方法，如最大似然估计（MLE）。我们为算法的可靠性和样本效率提供了强有力的理论保证。进一步，我们进行了广泛的数值研究，以证明所提出框架的实际有效性，为竞争环境中的决策提供了新的见解。

英文摘要

Estimating the unknown reward functions driving agents' behaviors is of central interest in inverse reinforcement learning and game theory. To tackle this problem, we develop a unified framework for reward function recovery in two-player zero-sum matrix games and Markov games with entropy regularization, where we aim to reconstruct the underlying reward functions given observed players' strategies and actions. This task is challenging due to the inherent ambiguity of inverse problems, the non-uniqueness of feasible rewards, and limited observational data coverage. To address these challenges, we establish the reward function's identifiability using the quantal response equilibrium (QRE) under linear assumptions. Building upon this theoretical foundation, we propose a novel algorithm to learn reward functions from observed actions. Our algorithm works in both static and dynamic settings and is adaptable to incorporate different methods, such as Maximum Likelihood Estimation (MLE). We provide strong theoretical guarantees for the reliability and sample efficiency of our algorithm. Further, we conduct extensive numerical studies to demonstrate the practical effectiveness of the proposed framework, offering new insights into decision-making in competitive environments.

URL PDF HTML ☆

赞 0 踩 0