arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1970
2605.15157 2026-05-21 cs.RO cs.LG

Hand-in-the-Loop: Improving VLA Policies for Dexterous Manipulation via Seamless Hand-Arm Intervention

手在环中:通过无缝手臂干预改进VLA策略以实现灵巧操作

Zhuohang Li, Liqun Huang, Wei Xu, Zhengming Zhu, Nie Lin, Xiao Ma, Xinjun Sheng, Ruoshi Wen

AI总结 本文提出Hand-in-the-Loop方法,通过无缝整合人类干预与自主策略执行,减少手部操作中的突兀变化,提升双臂灵巧操作的鲁棒性和效率。

详情
AI中文摘要

Vision-Language-Action (VLA)模型在灵巧操作中容易累积误差,高维动作空间和接触丰富的动态会放大政策偏差。虽然交互模仿学习(IIL)可通过人类修正数据细化策略,但将其应用于高自由度机械手仍具有挑战性,因为人类遥控与策略执行在干预时刻的命令不匹配,导致机器人手部配置的突兀变化,即'手势跳跃'。我们提出了Hand-in-the-Loop (HandITL),一种无缝的人在回路干预方法,将人类的修正意图与自主策略执行相结合,以避免在双臂灵巧操作中的手势跳跃。与使用直接遥控接管相比,HandITL将干预抖动减少了99.8%,并保持了干预后的稳健操作,将抓取失败减少了87.5%,平均完成时间减少了19.1%。我们在需要双臂协调、工具使用和精细长时域操作的任务上验证了HandITL。当用于收集策略细化的修正数据时,HandITL在三个长时域灵巧任务中平均优于使用标准遥控数据训练的策略19%。

英文摘要

Vision-Language-Action (VLA) models are prone to compounding errors in dexterous manipulation, where high-dimensional action spaces and contact-rich dynamics amplify small policy deviations over long horizons. While Interactive Imitation Learning (IIL) can refine policies through human correction data, applying it to high-degree-of-freedom (DoF) robotic hands remains challenging due to a command mismatch between human teleoperation and policy execution at the intervention moment, which causes abrupt robot-hand configuration changes, or "gesture jumps". We present Hand-in-the-Loop (HandITL), a seamless human-in-the-loop intervention method that blends human corrective intent with autonomous policy execution to avoid gesture jumps during bimanual dexterous manipulation. Compared with taking over control using direct teleoperation, HandITL reduces intervention jitter by 99.8% and preserves robust post-intervention manipulation, reducing grasp failures by 87.5% and mean completion time by 19.1%. We validate HandITL on tasks requiring bimanual coordination, tool use, and fine-grained long-horizon manipulation. When used to collect correction data for policy refinement, HandITL yields policies that outperform those trained with standard teleoperation data by 19% on average across three long-horizon dexterous tasks.

2605.15104 2026-05-21 cs.CL

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

从文本到声音:一个可重复和可验证的评估工具调用LLM代理的框架

Md Tahmid Rahman Laskar, Xue-Yong Fu, Seyyed Saeed Sarfjoo, Quinten McNamara, Jonas Robertson, Shashi Bhushan TN

AI总结 本文提出了一种可重复和可验证的框架,用于评估基于语音的工具调用LLM代理,通过文本到语音转换和环境噪声模拟,无需重新标注工具模式和黄金标签,从而在不重新标注的情况下评估工具调用性能。

详情
AI中文摘要

语音代理越来越多地需要从语音中可靠的工具使用,而突出的工具调用基准测试仍基于文本。我们研究是否可以将经过验证的文本基准转换为受控的音频基础工具调用评估,而无需重新标注工具模式和黄金标签。我们的数据集无关框架使用文本到语音、说话人变化和环境噪声来创建配对的文本-音频实例,同时保留原始数据集注释。基于对7个多模态模型在Confetti和When2Call音频转换版本上的广泛评估,我们的框架表明性能强烈依赖于模型和任务:Gemini-3.1-Flash-Live获得最高的Confetti分数(70.4),而GPT-Realtime-1.5在When2Call上表现最佳(71.9)。在Confetti上,文本到语音的差距从Qwen3-Omni的1.8分到GPT-Realtime-1.5的4.8分。对失败案例的针对性分析表明,降解最常反映语音中对论点值的误解。考虑到现实部署场景,我们进一步报告了纯文本结果,一个基于歧义的改述压力测试,以及一个参考免费的LLM-as-judge协议,其经过人类偏好的验证。值得注意的是,我们发现开源的Qwen3判官在至少8B参数的情况下,超过80%的协议与专有判官一致,支持隐私保护的评估。总体而言,我们的框架提供了一个可验证和可重复的第一阶段诊断,补充了专门构建的音频语料库。

英文摘要

Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

2605.14417 2026-05-21 cs.RO cs.CV

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

在身体移动之前:为语言条件的人形控制学习预见性关节意图

Haozhe Jia, Honglei Jin, Yuan Zhang, Youcheng Fan, Shaofeng Liang, Lei Wang, Shuxu Jin, Kuimou Yu, Zinuo Zhang, Jianfei Song, Wenshuo Chen, Yutao Yue

AI总结 该研究提出DAJI框架,通过学习语言生成与闭环控制之间的预见性关节意图接口,解决语言条件人形机器人中预见未来物理转换的需求,实现了在HumanML3D风格生成和BABEL任务中的高性能表现。

详情
AI中文摘要

自然语言是人形机器人的直观接口,但流式全身控制需要能够现在执行并预见未来物理转换的控制表示。现有语言条件人形系统通常生成低级跟踪器必须反应性修复的运动学参考,或使用隐式/动作策略,其输出不显式编码即将发生的接触变化、支撑转移和平衡准备。我们提出DAJI(Dynamics-Aligned Joint Intent),一个分层框架,学习语言生成与闭环控制之间的预见性关节意图接口。DAJI-Act通过学生驱动的回放将未来的教师 distill 成可部署的扩散动作策略,而 DAJI-Flow 自回归地从语言和意图历史生成未来意图块。实验表明,DAJI 在预见性隐式学习、单指令生成和流式指令跟随中表现优异,在 HumanML3D 风格生成中达到 94.42% 的回放成功率,在 BABEL 任务中达到 0.152 的子序列 FID。

英文摘要

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

2605.14382 2026-05-21 cs.CV cs.GR cs.MM

Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

Delta Forcing:交互式自回归视频生成中的信任区域引导

Yuheng Wu, Xiangbo Gao, Tianhao Chen, Xinghao Chen, Qing Yin, Zhengzhong Tu, Dongman Lee

AI总结 本文提出Delta Forcing方法,通过约束不可靠的教师监督在适应性信任区域中,以提高自回归视频生成的一致性并保持对新事件的响应性。

详情
AI中文摘要

交互式实时自回归视频生成对于内容创作和世界建模等应用至关重要,其中视觉内容必须适应动态变化的事件条件。一个基本挑战在于在反应性和稳定性之间取得平衡:模型必须迅速响应新事件,同时在长时间范围内保持时间一致性。现有方法将双向模型蒸馏为自回归生成器,并进一步通过流式长调优进行适应,但往往在条件变化后仍会出现持续漂移。我们发现原因在于条件偏差,其中教师可能提供与条件对齐但轨迹无关的指导,使生成偏向于局部有效但全局不一致的模式。受信任区域策略优化的启发,我们提出Delta Forcing,一种简单而有效的框架,它将不可靠的教师监督限制在适应性信任区域内。具体而言,Delta Forcing从教师和生成器轨迹之间的潜在delta估计转移一致性,并利用它来平衡教师监督与单调连续性目标。这抑制了不可靠的教师诱导的偏移,同时保持对新事件的响应性。广泛的实验表明,Delta Forcing在提高一致性的同时保持了事件的响应性。

英文摘要

Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.

2605.14364 2026-05-21 cs.LG

MoRe: Modular Representations for Principled Continual Representation Learning on Sequential Data

MoRe:模块化表示用于序列数据的原理化持续表示学习

Jiaqi Sun, Boyang Sun, Rasmy M. H., Xiangchen Song, Kun Zhang

AI总结 本文提出MoRe框架,通过模块化表示方法实现序列数据的原理化持续学习,其核心贡献是通过分解知识为可识别的模块层级,实现模块的重用、对齐和扩展,从而在保持旧模块的同时提升模型的可塑性和稳定性。

详情
AI中文摘要

持续学习要求模型在适应新数据的同时保持已获得的知识。其核心挑战可以视为原理化的一步适应:在最小干扰的情况下将新信息整合到现有表示中。大多数现有方法通过监督、任务特定的方式修改模型参数或架构来解决这一挑战。然而,根本问题在于表示层面:任务需要具有不同但结构化的表示,这些表示可以被选择性更新而不破坏表示,同时结构应反映数据中的内在组织而非任务边界。在序列数据中,时间延迟依赖性提供了一种自然的信号,用于揭示这种组织,展示如何基本表示产生更具体的表示。受人类大脑模块化组织的启发,我们提出MoRe,一个框架,它在表示本身中识别模块性,而不是在架构层面分配。MoRe将知识分解为具有可识别保证的基本和特定模块层级,使在适应过程中能够实现原理化的模块重用、对齐和扩展,同时通过构造保留旧模块。在合成基准和真实世界LLM激活数据上的实验表明了可解释的层次结构,改进了可塑性-稳定性权衡,表明MoRe是持续适应的原理化基础。

英文摘要

Continual learning requires models to adapt to new data while preserving previously acquired knowledge. At its core, this challenge can be viewed as principled one-step adaptation: incorporating new information with minimal interference to existing representations. Most existing approaches address this challenge by modifying model parameters or architectures in a supervised, task-specific manner. However, the underlying issue is representational: tasks require distinct yet structured representations that can be selectively updated without disrupting representations, while structure should reflect intrinsic organization in the data rather than task boundaries. In sequential data, time-delayed dependencies provide a natural signal for uncovering this organization, revealing how fundamental representations give rise to more specific ones. Inspired by the modular organization of the human brain, we propose MoRe, a framework that identifies modularity in the representation itself rather than allocating it at the architectural level. MoRe decomposes knowledge into a hierarchy of fundamental and specific modules with identifiability guarantees, enabling principled module reuse, alignment, and expansion during adaptation while preserving old modules by construction. Experiments on synthetic benchmarks and real-world LLM activations demonstrate interpretable hierarchical structure, improved plasticity-stability trade-offs, suggesting MoRe as a principled foundation for continual adaptation

2605.14259 2026-05-21 cs.AI cs.CL

Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

面向异构业务系统的超图企业代理推理器

Ling Wang, Xin Liu, Songnan Liu, Jianan Wang, Cheng Cheng, Yihan Zhu, Enyu Li, Yu Xiao, Jiangyong Xie, Duogong Yan, Jiangyi Chen

AI总结 本文提出HEAR,一种基于分层超图本体的企业代理推理器,通过分层图层和超边层实现结构化多跳分析,无需重新训练LLM,在供应链任务中达到94.7%的准确率,并展示出适应性和效率。

详情
AI中文摘要

将大语言模型(LLMs)应用于异构企业系统受到多跳、n元推理中幻觉和失败的阻碍。现有范式(如GraphRAG、NL2SQL)缺乏复杂环境所需语义基础和可审计执行。我们引入HEAR,一种基于分层超图本体的企业代理推理器。其基图层虚拟化了具有溯源意识的数据接口,而超边层编码n元业务规则和程序协议。通过证据驱动的推理循环,HEAR动态协调本体工具进行结构化多跳分析,无需重新训练LLM。在供应链任务中,包括订单履行阻塞根本原因分析(RCA)的评估显示,HEAR达到94.7%的准确率。关键地,HEAR展示了适应性效率:利用程序超边以最小化令牌成本,同时利用拓扑探索确保复杂查询的严格正确性。通过将专有模型性能与开源权重骨干结合,并自动化手动诊断,HEAR建立了可扩展、可审计的企业智能基础。

英文摘要

Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HEAR, an enterprise agentic reasoner built on a Stratified Hypergraph Ontology. Its base Graph Layer virtualizes provenance-aware data interfaces, while the Hyperedge Layer encodes n-ary business rules and procedural protocols. Operating an evidence-driven reasoning loop, HEAR dynamically orchestrates ontology tools for structured multi-hop analysis without requiring LLM retraining. Evaluations on supply-chain tasks, including order fulfillment blockage root cause analysis (RCA), show HEAR achieves up to 94.7% accuracy. Crucially, HEAR demonstrates adaptive efficiency: utilizing procedural hyperedges to minimize token costs, while leveraging topological exploration for rigorous correctness on complex queries. By matching proprietary model performance with open-weight backbones and automating manual diagnostics, HEAR establishes a scalable, auditable foundation for enterprise intelligence.

2605.14201 2026-05-21 cs.RO cs.CV

MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

MAPLE:基于潜在空间的多智能体交互用于端到端自动驾驶

Rajeev Yasarla, Deepti Hegde, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Meysam Sadeghigooghari, Hanno Ackermann, Litian Liu, Pranav Desai, Fatih Porikli, Mohammad Ghavamzadeh, Hong Cai

AI总结 本文提出MAPLE框架,通过在视觉-语言-动作模型的潜在空间中实现反应式多智能体滚动,以解决传统模仿学习框架下闭环设置中模型易碎的问题,通过监督微调和强化学习结合多样性奖励,实现了可扩展且无需外部模拟器的闭环训练,提升了端到端自动驾驶系统的鲁棒性。

Comments 19 pages, 9 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型在端到端运动规划中表现出色,但在闭环设置中由于训练基于传统模仿学习框架而显得脆弱。现有的闭环监督方法缺乏可扩展性且无法完全建模反应式环境。我们提出MAPLE,一种新的框架,用于在VLA模型的潜在空间中进行动态驾驶场景的反应式多智能体滚动。主体车辆和附近交通代理在多步时间范围内独立控制,同时对场景中的其他代理具有反应性,从而实现闭环训练。MAPLE包含两个训练阶段:(1)基于真实轨迹的潜在滚动监督微调,随后是(2)具有全局和代理特定奖励的强化学习,这些奖励鼓励安全、进展和交互真实感。我们进一步提出多样性奖励,鼓励模型生成可能不在记录驾驶数据中存在的规划行为。值得注意的是,我们的闭环训练框架具有可扩展性,且无需外部模拟器,这些模拟器计算成本高且视觉保真度有限。MAPLE在Bench2Drive上实现了最先进的驾驶性能,并展示了可扩展的闭环多智能体交互,为鲁棒的端到端自动驾驶系统提供了支持。

英文摘要

Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.

2605.13475 2026-05-21 cs.CV

FedHPro: Federated Hyper-Prototype Learning via Gradient Matching

FedHPro: 通过梯度匹配实现联邦超原型学习

Huan Wang, Jun Shen, Haoran Li, Zhenyu Yang, Jun Yan, Ousman Manjang, Yanlong Zhai, Di Wu, Guansong Pang

AI总结 本文提出FedHPro框架,通过引入超原型和梯度匹配来提升联邦学习中的类别分离度和类别内一致性,实验表明其在多个基准数据集上达到最优性能。

Comments 23 pages, ICML 2026 Camera-ready Version

详情
AI中文摘要

联邦学习(FL)能够在保护隐私的同时实现分布式客户端的协同训练。为了增强FL的泛化能力,基于原型的FL受到关注,因为共享的全局原型为对齐客户端特定的局部原型提供了语义锚点。然而,现有方法通过平均局部原型或细化全局锚点来更新全局原型,这通常导致客户端间的语义漂移,从而产生不一致的全局信号。为了解决这个问题,我们引入了超原型,由一组可学习的全局类别级原型定义,以在客户端间保留底层语义知识。超原型通过梯度匹配进行优化,以对齐从客户端真实样本中直接提取的类别相关特征,而不是原型级描述符。我们进一步提出了FedHPro,一个联邦超原型学习框架,以利用超原型通过互对比学习和客户端特定的边距来促进类别间分离度,同时通过一致性惩罚促进类别内均匀性。在多样化的异构场景中的全面实验表明,1)超原型产生更一致的全局信号,2)FedHPro在多个基准数据集上达到最优性能。代码可在https://github.com/mala-lab/FedHPro获取。

英文摘要

Federated Learning (FL) enables collaborative training of distributed clients while protecting privacy. To enhance generalization capability in FL, prototype-based FL is in the spotlight, since shared global prototypes offer semantic anchors for aligning client-specific local prototypes. However, existing methods update global prototypes at the prototype-level via averaging local prototypes or refining global anchors, which often leads to semantic drift across clients and subsequently yields a misaligned global signal. To alleviate this issue, we introduce hyper-prototypes, defined by a set of learnable global class-wise prototypes to preserve underlying semantic knowledge across clients. The hyper-prototypes are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients' real samples, rather than prototype-level descriptors. We further propose FedHPro, a Federated Hyper-Prototype Learning framework, to leverage hyper-prototypes to promote inter-class separability via mutual-contrastive learning with client-specific margin, while encouraging intra-class uniformity through a consistency penalty. Comprehensive experiments under diverse heterogeneous scenarios confirm that 1) hyper-prototypes produce a more semantically consistent global signal, and 2) FedHPro achieves state-of-the-art performance on several benchmark datasets. Code is available at \href{https://github.com/mala-lab/FedHPro}{https://github.com/mala-lab/FedHPro}.

2605.13302 2026-05-21 cs.LG cs.SY eess.SY

Safe Bayesian Optimization for Uncertain Correlation Matrices in Linear Models of Co-Regionalization

安全的贝叶斯优化用于线性共区域化模型中的不确定相关矩阵

Jannis Lübsen, Annika Eichler

AI总结 本文将多任务贝叶斯优化的安全保证从内在共区域化模型扩展到线性共区域化模型,通过组合多个特征更灵活地建模任务间相关性,并推导了从线性共区域化核高斯过程中采样的向量值函数的统一误差界,同时在安全多任务贝叶斯优化基准上的数值比较中展示了线性共区域化模型的潜在性能优势。

Comments Accepted at IFAC WC26

详情
AI中文摘要

本文将多任务贝叶斯优化的安全保证从内在共区域化模型扩展到线性共区域化模型。后者通过组合多个特征提供了更灵活的任务间相关性建模方式。我们推导了从线性共区域化核高斯过程中采样的向量值函数的统一误差界。此外,我们通过在安全多任务贝叶斯优化基准上的数值比较,展示了线性共区域化模型的潜在性能优势。

英文摘要

This paper extends safety guarantees for multi-task Bayesian optimization with uncertain co-regionalization matrices from intrinsic co-regionalization models to linear models of co-regionalization. The latter allows for more flexible modeling of the inter-task correlations by composing multiple features. We derive uniform error bounds for vector-valued functions sampled from a Gaussian process with a linear model of co-regionalization kernel. Furthermore, we show the potential performance gains of linear models of co-regionalization in a numerical comparison on a safe multi-task Bayesian optimization benchmark.

2605.13081 2026-05-21 cs.CV

PRA-PoE: Robust Multimodal Alzheimer's Diagnosis with Arbitrary Missing Modalities

PRA-PoE: 基于任意缺失模态的鲁棒多模态阿尔茨海默病诊断

Guangqian Yang, Ye Du, Wenlong Hou, Qian Niu, Shujun Wang

AI总结 该研究提出PRA-PoE框架,通过原型锚定表示对齐和不确定性感知专家融合机制,解决多模态学习中模态缺失导致的表示偏移问题,提升了在不同缺失模式下的诊断鲁棒性与准确性。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

缺失模态在真实世界阿尔茨海默病评估中普遍存在,对多模态学习构成重大挑战,尤其是在训练和部署时观测模态子集的分布不同时。这种缺失模式不匹配会引发跨模态子集的条件表示偏移。现有方法依赖于隐式插补或模态合成,往往无法显式建模模态可用性和不确定性,导致过度依赖合成特征、鲁棒性降低和不确定性估计不准确。为了解决这些限制,我们提出PRA-PoE,一种不完整多模态学习框架,配备了原型锚定表示对齐(PRA)和不确定性感知专家(UA-PoE)融合机制。首先,PRA使用可学习的全局原型和可用性条件化标记来编码模态可用性,区分观测与缺失模态,重新合成缺失模态的特征,并自适应地细化观测表示以对齐跨模态子集的潜在空间,目标是在不同缺失模式下减少表示偏移。其次,UA-PoE将每个模态建模为高斯专家并执行闭式产品专家融合,其中不确定性较高的专家会通过较低的精度自动降权,从而提高不确定性可靠性。我们通过在临床现实协议下训练使用自然缺失数据并在所有非空模态组合上测试来评估PRA-PoE。PRA-PoE在所有非空模态子集上均优于现有最佳方法,实现了在ADNI数据集上的平均准确率5.4%的相对提升,在OASIS-3数据集上的平均F1值达到10.9%的相对提升。

英文摘要

Missing modalities are prevalent in real-world Alzheimer's disease (AD) assessment and pose a significant challenge to multimodal learning, particularly when the distribution of observed modality subsets differs between training and deployment. Such missingness pattern mismatch induces a conditional representation shift across modality subsets. Existing approaches that rely on implicit imputation or modality synthesis often fail to explicitly model modality availability and uncertainty, leading to overconfident dependence on synthesized features, reduced robustness, and miscalibrated uncertainty estimates. To address these limitations, we propose PRA-PoE, an incomplete multimodal learning framework that is equipped with Prototype-anchored Representation Alignment (PRA) and an Uncertainty-aware Product of Experts (UA-PoE) fusion mechanism. First, PRA uses learnable global prototypes and availability-conditioned tokens to encode modality availability, distinguish observed from missing modalities, re-synthesize features for missing modalities, and adaptively refine observed representations to align latent spaces across modality subsets, with the goal of reducing representation shift under varying missingness patterns. Second, UA-PoE models each modality as a Gaussian expert and performs closed-form Product of Experts fusion, where experts with higher uncertainty are automatically down-weighted via lower precision, improving uncertainty reliability. We evaluate PRA-PoE under a clinically realistic protocol by training with naturally missing data and testing on all non-empty modality combinations. PRA-PoE consistently outperforms the state-of-the-art across datasets, achieving a 5.4% relative improvement in average accuracy on ADNI and a 10.9% relative gain in average F1 on OASIS-3 over the strongest baseline across all non-empty modality subsets.

2605.12960 2026-05-21 cs.CL

DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

DiM\textsuperscript{3}: 通过方向和幅度感知融合连接多语言和多模态模型

Zijing Wang, Mingyang Wang, Ercong Nie, Yongkang Liu, Shi Feng, Mengjie Zhao, Daling Wang, Xiaocui Yang, Hinrich Schütze

AI总结 本研究提出DiM3方法,通过在共享语言模型骨干中融合残差更新,实现多语言和多模态能力的无缝整合,从而提升多语言性能并保持多模态能力。

详情
AI中文摘要

为了实现更通用和类人智能,大语言模型应能够无缝整合多语言和多模态能力;然而,扩展现有多模态模型以支持多种语言通常需要昂贵的多语言多模态数据构建和重复端到端重新训练。我们研究了一种无需训练的替代方法:通过在共享语言模型骨干中组合残差更新,将多语言能力注入现有多模态模型中。关键挑战是多语言和多模态更新是异构的,反映了共享模型中的不同功能角色。为了解决这一问题,我们提出了方向和幅度感知的多语言多模态融合(DiM3),在每个参数维度上选择性地组合这两种更新,同时保留原始视觉编码器和多模态投影器。在文本-only和视觉-语言设置下的多语言基准测试中,覆盖57种语言(LLaVA和Qwen为基础的骨干),实验表明DiM3在多语言性能上显著优于现有融合基线,并在原始多模态模型上大幅提升了多语言性能,同时在专用多语言多模态微调中保持竞争力。此外,我们进一步表明DiM3可以直接应用于已训练的多语言多模态模型,并仍能产生额外收益。进一步的可解释性分析显示,DiM3主要重塑中间层语义表示,在文本-only和多模态输入下加强跨语言对齐,同时保留高层任务敏感结构。我们的仓库在https://github.com/wzj1718/DiM3。

英文摘要

Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.

2605.12483 2026-05-21 cs.LG cs.AI

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

超越GRPO和在线策略蒸馏:一种经验性稀疏到密集奖励原则用于语言模型后训练

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard

AI总结 本文提出了一种经验性的稀疏到密集奖励原则,用于语言模型后训练,通过在教师模型上使用稀疏奖励进行探索和发现,然后通过密集监督将行为压缩到部署模型中,从而在数学问题上实现了优于GRPO的性能。

详情
AI中文摘要

在标记可验证的训练数据是约束的情况下,每个检查的示例应分配给模型和奖励密度,其中它最有信息量。我们识别出一个支配这种分配的奖励密度原则:稀疏序列级奖励在能够探索和发现更好行为的模型上最有用,而密集的token级教师监督更适合将该行为压缩到更小的部署模型中。该原则产生了一个简单的分配规则:在最强的可用教师上使用稀缺的标记数据,然后将奖励形状的行为作为密集监督转移到下游。我们通过一个四阶段的工作流程——教师RL、forward-KL预热、在线策略蒸馏、可选的后桥学生RL——在可验证的数学上评估了此规则,使用Qwen3和Llama模型。在固定的Qwen3-1.7B部署学生大小下,一个通过密集桥进行蒸馏的RL改进的8B教师在相同的学生上表现优于直接GRPO(79.3% vs. 75.9%在MATH;25.2% vs. 19.8%在AIME 2024,avg@16),而从相同教师提前进行RL的转移效果更差。一个组件消融确认了每个阶段的重要性:用RL改进的教师替换为原始教师会损失7.8个MATH点,移除forward-KL预热会损失1.7个点,移除在线策略蒸馏会损失3.3个点。教师质量顺序——原始教师转移 < 直接GRPO < RL教师转移——在使用Llama-3.1-8B-Instruct作为教师和Llama-3.3-70B-Instruct作为教师的情况下重复。操作教训是避免将稀缺的标记数据用于准备最少的策略:使用稀疏奖励进行教师端的发现,使用密集转移进行学生端的压缩,并在桥接后才使用学生端的稀疏奖励。

英文摘要

In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated to the model and reward density where it is most informative. We identify a reward-density principle that governs this allocation: sparse sequence-level reward is most useful on models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller deployment model. The principle yields a simple allocation rule: use scarce labeled data upstream on the strongest available teacher, then transfer the reward-shaped behavior downstream as dense supervision. We evaluate this rule through a four-stage workflow -- teacher RL, forward-KL warmup, on-policy distillation, optional post-bridge student RL -- on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student ($79.3\%$ vs.\ $75.9\%$ on MATH; $25.2\%$ vs.\ $19.8\%$ on AIME~2024, avg@16), while transfer from the same teacher \emph{before} RL underperforms. A component ablation confirms that each stage is load-bearing: replacing the RL-improved teacher with a raw teacher costs $7.8$ MATH points, removing the forward-KL warmup costs $1.7$, and removing on-policy distillation costs $3.3$. The teacher-quality ordering -- raw-teacher transfer $<$ direct GRPO $<$ RL-teacher transfer -- replicates on Llama-3.1-8B-Instruct with a Llama-3.3-70B-Instruct teacher. The operational lesson is to avoid spending scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.

2605.12334 2026-05-21 cs.AI

Reinforcing VLAs in Task-Agnostic World Models

在任务无关的世界模型中强化视觉-语言-动作

Yucen Wang, Rui Yu, Fengming Zhang, Junjie Lu, Xinyao Qin, Tianxiang Zhang, Kaixin Wang, Li Zhao

AI总结 本文提出RAW-Dream方法,通过分离世界模型学习与下游任务依赖,利用预训练的世界模型和现成的视觉-语言模型,实现零样本推理,从而在无需任务特定数据的情况下提高VLA适应性。

详情
AI中文摘要

在学习的世界模型中通过强化学习(RL)后训练视觉-语言-动作(VLA)模型,已成为一种有效的策略,可以在不进行昂贵的真实世界交互的情况下适应新任务。然而,尽管使用想象轨迹减少了策略训练的样本复杂性,现有方法仍然严重依赖任务特定数据来微调世界和奖励模型,从根本上限制了其扩展到未见任务的能力。为了解决这个问题,我们主张世界和奖励模型应捕捉可转移的物理先验,以实现零样本推理。我们提出了RAW-Dream(在任务无关世界梦中强化VLA),一种新的范式,完全将世界模型学习与下游任务依赖分离。RAW-Dream利用在多样化任务无关行为上预训练的世界模型来预测未来滚动,以及现成的视觉-语言模型(VLM)进行奖励生成。由于这两个组件都是任务无关的,VLA可以在此零样本想象中轻松微调以适应任何新任务。此外,为了减轻世界模型的幻觉,我们引入了双噪声验证机制来过滤掉不可靠的滚动。在模拟和现实世界设置中的广泛实验展示了一致的性能提升,证明了通用的物理先验可以有效替代昂贵的任务依赖数据,为VLA适应提供了一条高度可扩展的道路。

英文摘要

Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task-specific data to fine-tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference. We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation. Because both components are task-agnostic, VLAs can be readily finetuned for any new task entirely within this zero-shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual-noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real-world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task-dependent data, offering a highly scalable roadmap for VLA adaptation.

2605.12321 2026-05-21 cs.AI cs.CY cs.ET

LIDSA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management

LIDSA:信号自由的自主交叉口管理中的认知仲裁

Abderrahmane Lakas, Mohamed Amine Ferrag, Merouane Debbah

AI总结 本文提出LIDSA框架,利用大语言模型进行意图驱动的速度建议,以实现信号自由的自主交叉口管理,通过对比固定周期控制、SCATS、AIM和GLOSA等方法,证明LLM在实时交叉口管理中的有效性。

Comments Renamed LISA to LIDSA to avoid naming ambiguity with existing traffic-control software. No technical changes

详情
AI中文摘要

大型语言模型(LLMs)在智能交通系统(ITS)中展现出强大的潜力,特别是在需要情境推理和多智能体协调的任务中。这些能力使它们非常适合协同驾驶,其中基于规则的方法在复杂和动态的交通环境中表现不佳。交叉口管理尤其具有挑战性,因为存在冲突的优先权需求、异质车辆优先级以及必须实时解决的车辆特定运动学约束。然而,现有方法通常将LLMs作为基于信号系统的辅助组件,而不是主要决策者。信号控制器仍然缺乏车辆感知,预留方法缺乏意图意识,而最近的基于LLM的系统仍然依赖于信号基础设施。此外,LLM推理延迟限制了其在亚秒级控制设置中的应用。我们提出了LIDSA(基于LLM的意图驱动速度建议),一种用于自主交叉口管理的信号自由认知仲裁框架。LIDSA利用LLM对声明的车辆意图进行推理,结合优先级类别、队列压力和能源偏好。我们评估了LIDSA在不同交通负载下的性能,结果表明LIDSA将平均控制延迟减少了高达89.1%,同时保持了服务水平C,而所有非LLM基线方法降级到服务水平F。在接近饱和需求下,LIDSA将平均等待时间减少了93%,峰值队列长度减少了60.6%相对于固定周期控制。它还降低了燃料消耗高达48.8%,并实现了86.2%的意图满足率,相比最好的非LLM方法的61.2%。这些结果证明了基于LLM的推理能够实现实时、无信号的交叉口管理。

英文摘要

Large language models (LLMs) show strong potential for Intelligent Transportation Systems (ITS), particularly in tasks requiring situational reasoning and multi-agent coordination. These capabilities make them well suited for cooperative driving, where rule-based approaches struggle in complex and dynamic traffic environments. Intersection management remains especially challenging due to conflicting right-of-way demands, heterogeneous vehicle priorities, and vehicle-specific kinematic constraints that must be resolved in real time. However, existing approaches typically use LLMs as auxiliary components on top of signal-based systems rather than as primary decision-makers. Signal controllers remain vehicle-agnostic, reservation-based methods lack intent awareness, and recent LLM-based systems still depend on signal infrastructure. In addition, LLM inference latency limits their use in sub-second control settings. We propose LIDSA (LLM-Based Intent-Driven Speed Advisory), a signal-free cognitive arbitration framework for autonomous intersection management. LIDSA uses an LLM to reason over declared vehicle intents, incorporating priority classes, queue pressure, and energy preferences. We evaluate LIDSA against fixed-cycle control, SCATS, AIM, and GLOSA across varying traffic loads. Results show that LIDSA reduces mean control delay by up to 89.1% and maintains Level of Service C while all non-LLM baselines degrade to Level of Service F. Under near-saturated demand, LIDSA reduces mean waiting time by 93% and peak queue length by 60.6% relative to fixed-cycle control. It also lowers fuel consumption by up to 48.8% and achieves 86.2% intent satisfaction, compared to 61.2% for the best non-LLM method. These results demonstrate that LLM-based reasoning can enable real-time, signal-free intersection management.

2605.12196 2026-05-21 cs.LG

ECTO: Exogenous-Conditioned Temporal Operator for Ultra-Short-Term Wind Power Forecasting

ECTO:用于超短期风功率预测的外源性条件化时间运算符

Cao Yuan, Junjun Wang

AI总结 本文提出了一种统一框架ECTO,通过物理基础变量选择和外源性条件化制度细化模块,实现了对超短期风功率预测中非平稳、条件依赖的风力发电的高效建模,从而在不同气候、容量和外源变量维度的风场中取得最佳的均方误差性能。

Comments 42 pages, 10 figures, 9 tables

详情
AI中文摘要

准确的超短期风功率预测对于电网调度和备用管理至关重要,但因其风力发电的非平稳性和条件依赖性而具有挑战性。气象外源变量包含大量预测信息,但最有信息量的变量组合会因站点、运行条件和预测时间跨度而异。现有的深度学习方法要么将外源输入视为通用的辅助通道通过统一混合或软门控,要么依赖于固定的预处理步骤如PCA,而没有利用气象变量的物理结构。我们提出ECTO(外源性条件化时间运算符),一个统一的框架,将外源变量建模分解为两个互补的模块。物理基础变量选择(PGVS)使用领域指导的物理先验和稀疏max激活进行层次化、组意识的稀疏选择,产生一个紧凑、条件适应的外源上下文。外源性条件化制度细化(ECRR)将预测路由通过学习到的制度专家,通过专家混合范式应用增益-偏置校准和特定时间跨度的校正。在三个跨越不同气候、容量(66-200 MW)和外源变量维度(11-13个变量)的风场实验中,ECTO在所有站点中实现了最低的均方误差,相对于最强基线的相对改进范围从2.2%到5.2%,在较长的预测时间跨度(H=32)时扩大到8.6%。消融分析确认了每个与外源变量相关的组件都贡献了积极的效果(PGVS +1.84%,ECRR +2.86%),可解释性分析揭示PGVS学习了具有物理意义的、特定站点的变量选择模式,而ECRR收敛到一致的校准策略。

英文摘要

Accurate ultra-short-term wind power forecasting is critical for grid dispatch and reserve management, yet remains challenging due to the non-stationary, condition-dependent nature of wind generation. Meteorological exogenous variables carry substantial predictive information, but the most informative variable combination varies across sites, operating conditions, and prediction horizons. Existing deep learning approaches either treat exogenous inputs as generic auxiliary channels through uniform mixing or soft gating, or rely on fixed preprocessing steps such as PCA, without exploiting the physical structure of meteorological variables. We propose ECTO (Exogenous-Conditioned Temporal Operator), a unified framework that decomposes exogenous variable modeling into two complementary modules. Physically-Grounded Variable Selection (PGVS) performs hierarchical, group-aware sparse selection over exogenous variables using a domain-informed physical prior and sparsemax activations, producing a compact, condition-adaptive exogenous context. Exogenous-Conditioned Regime Refinement (ECRR) routes the forecast through learned regime experts that apply gain--bias calibration and horizon-specific corrections via a mixture-of-experts paradigm. Experiments on three wind farms spanning different climates, capacities (66--200 MW), and exogenous dimensions (11--13 variables) demonstrate that ECTO achieves the lowest MSE across all sites, with relative improvements over the strongest baseline ranging from 2.2% to 5.2%, widening to 8.6% at the longer prediction horizon ($H=32$). Ablation analysis confirms that each exogenous-related component contributes positively (PGVS +1.84%, ECRR +2.86%), and interpretability analysis reveals that PGVS learns physically meaningful, site-specific variable selection patterns, while ECRR converges to well-separated calibration strategies consistent across sites.

2605.11866 2026-05-21 cs.SD

AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

AuDirector:一种用于沉浸式音频叙事的自反思闭环框架

Yiming Ren, Xuenan Xu, Ziyang Zhang, Wen Wu, Baoxiang Li, Chao Zhang

AI总结 本文提出AuDirector框架,通过自反思闭环多智能体方法解决长期音频叙事中一致性、情感表达和音频保真度的问题,提升语音生成的质量和用户交互性。

详情
AI中文摘要

尽管在文本和视觉生成方面取得了进展,但创建连贯的长格式音频叙事仍然具有挑战性。现有框架往往存在角色设定与语音表现不匹配、自我纠正机制不足和人机交互有限等问题。为了解决这些挑战,我们提出AuDirector,一种自反思闭环多智能体框架。具体而言,它包括一个身份感知预制作机制,将叙事文本转换为角色档案和语句层面的情感指令,以检索合适的语音候选人并指导表达性语音合成,从而促进上下文对齐的语音适应。为了提高质量,协作合成与纠正模块引入闭环自我纠正机制,系统地审核和重新生成缺陷的音频组件。此外,由人类引导的交互细化模块通过解释自然语言反馈来促进用户控制,从而交互式地细化底层脚本。实验表明,AuDirector在结构连贯性、情感表达性和音频保真度方面均优于最先进的基线模型。音频样本可在https://anonymous-itsh.github.io/上找到。

英文摘要

Despite advances in text and visual generation, creating coherent long-form audio narratives remains challenging. Existing frameworks often exhibit limitations such as mismatched character settings with voice performance, insufficient self-correction mechanisms, and limited human interactivity. To address these challenges, we propose AuDirector, a self-reflective closed-loop multi-agent framework. Specifically, it involves an Identity-Aware Pre-production mechanism that transforms narrative texts into character profiles and utterance-level emotional instructions to retrieve suitable voice candidates and guide expressive speech synthesis, thereby promoting context-aligned voice adaptation. To enhance quality, a Collaborative Synthesis and Correction module introduces a closed-loop self-correction mechanism to systematically audit and regenerate defective audio components. Furthermore, a Human-Guided Interactive Refinement module facilitates user control by interpreting natural language feedback to interactively refine the underlying scripts. Experiments demonstrate that AuDirector achieves superior performance compared to state-of-the-art baselines in structural coherence, emotional expressiveness, and acoustic fidelity. Audio samples can be found at https://anonymous-itsh.github.io/.

2605.11302 2026-05-21 cs.LG cs.AI cs.CL

A Theory of Time-Sensitive Language Generation: Sparse Hallucination Beats Mode Collapse

时间敏感语言生成理论:稀疏幻觉战胜模式崩溃

Atul Ganju, Travis McVoy, Shaddin Dughmi, Shang-Hua Teng

AI总结 本文研究了在全局偏好顺序下语言生成的极限情况,提出了一种时间敏感的语言生成方法,通过稀疏幻觉技术克服了模式崩溃问题,证明了在特定条件下可以实现最优密度。

详情
AI中文摘要

我们研究了在全局偏好顺序下语言生成的极限情况,如Kleinberg和Wei所引入的。与以往工作类似,我们追求广度,但增加了时效性要求:高排名字符串应更早生成。一个字符串只有在截止时间前生成才被认可,其截止时间由一个函数确定,该函数将字符串在目标语言中的排名映射到必须生成的时间。这与机器学习中的归纳偏置一致,即在其他条件相同的情况下,倾向于选择更简单或更可能的输出。我们证明,在强意义上,最终一致的生成器无法实现时效性生成——这是大多数先前相关工作的主角。在可能最温和的一致性放松下,即幻觉率随时间消失,我们证明可以绕过我们的不可能结果。特别是,我们可以实现相对于任何超线性截止函数的最优密度。我们还证明这是紧的,通过排除线性截止时间和消失幻觉率下的时效性生成。

英文摘要

We study language generation in the limit under a global preference ordering on strings, as introduced by Kleinberg and Wei. As is done in previous work, we aim for breadth, but impose an additional requirement of timeliness: higher-ranked strings should be generated earlier. A string is then only credited if it is generated before a deadline, where its deadline is defined by a function that maps a string's rank in the target language to the time by which it must be produced. This is in keeping with a central consideration in machine learning, where inductive bias favors ``simpler'' or ``more plausible'' outputs, all else being equal. We show that timely generation is impossible in a strong sense for eventually consistent generators -- the protagonists of most prior related work. Under what is perhaps the mildest natural relaxation of consistency, a hallucination rate that vanishes over time, we show that we can circumvent our impossibility result. In particular, we can achieve optimal density with respect to any superlinear deadline function. We also show this is tight by ruling out timely generation with linear deadlines and vanishing hallucination rate.

2605.11151 2026-05-21 cs.AI cs.RO

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

RankQ: 通过自监督动作排名实现离线到在线强化学习

Andrew Choi, Wei Xu

AI总结 该研究提出RankQ方法,通过自监督多项排名损失增强时序差分学习,以在大状态-动作空间中更准确地学习批评器,从而在稀疏奖励D4RL基准和基于视觉的机器人学习中实现更高效的离线到在线微调。

详情
AI中文摘要

离线到在线强化学习(RL)通过利用预先收集的数据集来提高样本效率。然而,一个关键挑战是在有限的数据集覆盖下,在大规模状态-动作空间中学习准确的批评器。为了减轻价值过估计带来的有害更新,先前方法通过降低分布外(OOD)动作相对于数据集动作的权重来引入悲观主义。虽然有效,但这种方法本质上充当了一个行为克隆锚点,当数据集动作不优时会阻碍后续在线策略改进。我们提出RankQ,一种离线到在线的Q学习目标,通过在时序差分学习中加入自监督的多项排名损失来强制结构化动作排序。通过学习相对动作偏好而不是均匀惩罚未见过的动作,RankQ塑造Q函数,使动作梯度指向高质量的行为。在稀疏奖励D4RL基准中,RankQ的性能与或优于七种先前方法。在基于视觉的机器人学习中,RankQ能够在低数据环境下有效微调预训练的视觉-语言-动作(VLA)模型,平均在模拟成功率上比次优方法高42.7%。在高数据环境下,RankQ在模拟性能上比次优方法提高13.7%,并实现强大的仿真到现实转移,将现实世界立方体堆叠成功率从43.1%提升到88.9%,相对于VLA的初始性能。

英文摘要

Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors. Across sparse reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. In vision-based robot learning, RankQ enables effective offline-to-online fine-tuning of a pretrained vision-language-action (VLA) model in a low-data regime, achieving on average a 42.7% higher simulation success rate than the next best method. In a high-data setting, RankQ improves simulation performance by 13.7% over the next best method and achieves strong sim-to-real transfer, increasing real-world cube stacking success from 43.1% to 88.9% relative to the VLA's initial performance.

2605.10830 2026-05-21 cs.CV cs.LG

Predicting 3D structure by latent posterior sampling

通过潜在后验采样预测3D结构

Azmi Haider, Dan Rosenbaum

AI总结 本文提出了一种结合NeRF表示和扩散模型的概率建模方法,用于从不同类型的观测数据(如单视角、多视角、噪声图像、稀疏像素和稀疏深度数据)中准确预测3D结构。

详情
AI中文摘要

生成模型在2D图像和神经场表示在3D场景中的显著成就提供了一个有吸引力的机会,将两种方法的优势结合起来。在本工作中,我们提出了一种方法,将基于NeRF的3D场景表示与扩散模型的概率建模和推理相结合。我们将3D重建视为一个具有内在不确定性的感知问题,从而可以受益于概率推断方法。核心思想是将3D场景表示为一个随机的潜在变量,我们可以学习其先验分布,并在给定一组观测数据的情况下进行后验推断。我们通过扩散模型的分数推理方法进行后验采样,并结合从重建模型计算出的似然项(包括体渲染)。我们通过两阶段过程训练模型:首先训练重建模型并自动解码潜在表示以处理3D场景的数据集,然后在潜在空间上训练扩散模型的先验。通过使用模型从后验中生成样本,我们证明了各种3D重建任务可以执行,根据所使用的输入观测类型不同。我们展示了从单视角、多视角、噪声图像、稀疏像素和稀疏深度数据的重建。这些观测在提供的场景信息量上有所不同,我们展示了我们的方法能够建模与每个任务相关的不同水平的内在不确定性。我们的实验表明,这种方法产生了一种全面的方法,能够准确地从各种观测类型中预测3D结构。

英文摘要

The remarkable achievements of both generative models of 2D images and neural field representations for 3D scenes present a compelling opportunity to integrate the strengths of both approaches. In this work, we propose a methodology that combines a NeRF-based representation of 3D scenes with probabilistic modeling and reasoning using diffusion models. We view 3D reconstruction as a perception problem with inherent uncertainty that can thereby benefit from probabilistic inference methods. The core idea is to represent the 3D scene as a stochastic latent variable for which we can learn a prior and use it to perform posterior inference given a set of observations. We formulate posterior sampling using the score-based inference method of diffusion models in conjunction with a likelihood term computed from a reconstruction model that includes volumetric rendering. We train the model using a two-stage process: first we train the reconstruction model while auto-decoding the latent representations for a dataset of 3D scenes, and then we train the prior over the latents using a diffusion model. By using the model to generate samples from the posterior we demonstrate that various 3D reconstruction tasks can be performed, differing by the type of observation used as inputs. We showcase reconstruction from single-view, multi-view, noisy images, sparse pixels, and sparse depth data. These observations vary in the amount of information they provide for the scene and we show that our method can model the varying levels of inherent uncertainty associated with each task. Our experiments illustrate that this approach yields a comprehensive method capable of accurately predicting 3D structure from diverse types of observations.

2605.10787 2026-05-21 cs.AI cs.SE

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

ComplexMCP: 评估LLM代理在动态、相互依赖和大规模工具沙箱中的表现

Yuanyang Li, Xue Yang, Longyue Wang, Weihua Luo, Hongyang Chen

AI总结 本文提出ComplexMCP基准,用于评估LLM代理在动态、相互依赖和大规模工具环境中的性能,揭示了现有模型在复杂任务中的不足,指出三个关键瓶颈:工具检索饱和、过度自信和战略投降倾向。

详情
AI中文摘要

当前LLM代理擅长调用孤立API,但在商业软件自动化最后一公里方面表现不佳。在现实场景中,工具并非独立,而是原子性、相互依赖且易受环境噪声影响。我们引入ComplexMCP,一个基于Model Context Protocol(MCP)设计的基准,提供超过300个经过严格测试的工具,来源于7个状态沙箱,涵盖办公套件到金融系统。与现有数据集不同,我们的基准采用种子驱动架构模拟动态环境状态和不可预测的API故障,确保评估的确定性与多样性。我们评估了各种LLM在全上下文和RAG范式下的表现,揭示了显著的性能差距:即使顶级模型也难以超过60%的成功率,远低于人类90%的表现。细粒度轨迹分析识别出三个根本瓶颈:(1)工具检索饱和;(2)过度自信,即代理跳过必要的环境验证;(3)战略投降倾向,即倾向于合理化失败而非追求恢复。这些发现凸显了当前代理在相互依赖工作流中的不足,将ComplexMCP定位为下一代鲁棒自主系统的关键测试平台。

英文摘要

Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.

2605.10603 2026-05-21 cs.CV

Segment Anything with Robust Uncertainty-Accuracy Correlation

具有鲁棒不确定性和准确性相关性的分割任何东西

Hongyou Zhou, Marc Toussaint, Ling Shao, Zihan Ye

AI总结 本文提出了一种名为RUAC的分割方法,通过引入轻量级不确定性头和对抗性训练,提高在外观和形变转移下的像素级不确定性估计,从而提升分割质量和不确定性准确性相关性。

Comments ICML 2026

详情
AI中文摘要

尽管在零样本性能方面表现强劲,SAM在域转移下不可靠,因为Mask级置信度混淆(MCC),其中基于IoU的单个掩码分数无法反映边界附近的像素级可靠性。受神经网络中纹理偏置捷径与人类视觉中以形状为中心的处理之间的对比启发,我们将域外变化建模为外观转移和非刚性变形,这些共同压力校准。我们提出Segment Anything with Robust Uncertainty-Accuracy Correlation(RUAC)以在外观和变形转移下实现鲁棒的像素级不确定性估计。RUAC添加了一个轻量级的不确定性头,通过联合扰动纹理和几何的协作风格-变形攻击进行训练,并应用不确定性-准确性对齐以确保在对抗性扰动下不确定性仍能一致地突出错误像素。在23个零样本领域中,RUAC提高了分割质量和更忠实的不确定性,具有更强的不确定性-准确性相关性。项目页面:https://hongyouzhou.github.io/ruac/.

英文摘要

Despite strong zero-shot performance, SAM is unreliable under domain shift due to Mask-level Confidence Confusion (MCC), where a single IoU-based mask score fails to reflect pixel-wise reliability near boundaries. Motivated by the contrast between texture-biased shortcuts in neural networks and shape-centric processing in human vision, we model out-of-domain variation as appearance shifts and non-rigid deformations that jointly stress calibration. We propose Segment Anything with Robust Uncertainty-Accuracy Correlation (RUAC) for robust pixel-wise uncertainty estimation under appearance and deformation shifts. RUAC adds a lightweight uncertainty head, trains it with a collaborative style-deformation attack that jointly perturbs texture and geometry, and applies Uncertainty-Accuracy Alignment to ensure uncertainty consistently highlights erroneous pixels even under adversarial perturbations. Across 23 zero-shot domains, RUAC improves segmentation quality and yields more faithful uncertainty with stronger uncertainty-accuracy correlation. Project page: https://hongyouzhou.github.io/ruac/.

2605.10181 2026-05-21 cs.CV cs.AI

A Comparative Study of Machine Learning and Deep Learning for Out-of-Distribution Detection

机器学习与深度学习在分布外检测中的比较研究

Jihyeon Baek, Seunghoon Lee, Gitaek Kwon, Doohyun Park

AI总结 本文比较了传统机器学习和深度学习在分布外检测任务中的性能,发现轻量级机器学习方法在保持同等准确性的同时,具有显著更低的计算成本,适用于视觉复杂度有限的任务。

Comments Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore

详情
AI中文摘要

分布外检测(OOD)对于构建可靠的人工智能系统至关重要,因为无法信任产生无效输入输出的模型。尽管深度学习(DL)通常被认为优于传统机器学习(ML),但医学影像数据通常是在标准化协议下获取的,导致在OOD检测任务中图像变化相对受限。这促使在该设置下直接比较ML和DL方法。两种方法在包含超过60,000张视网膜和非视网膜图像的开放数据集上进行了评估,涵盖多种分辨率。两种方法在内部和外部验证集上均实现了AUROC为1.000和准确性在0.999至1.000之间的结果,显示出相当的检测性能。然而,ML方法在保持等同准确性的同时,表现出显著更低的端到端延迟,表明具有更大的计算效率。这些结果表明,对于视觉复杂度有限的OOD检测任务,轻量级ML方法可以实现DL级别的性能,但计算成本显著降低,支持实际应用场景的部署。

英文摘要

Out-of-distribution (OOD) detection is essential for building reliable AI systems, as models that produce outputs for invalid inputs cannot be trusted. Although deep learning (DL) is often assumed to outperform traditional machine learning (ML), medical imaging data are typically acquired under standardized protocols, leading to relatively constrained image variability in OOD detection tasks. This motivates a direct comparison between ML and DL approaches in this setting. The two approaches are evaluated on open datasets comprising over 60,000 fundus and non-fundus images across multiple resolutions. Both approaches achieved an AUROC of 1.000 and accuracies between 0.999 and 1.000 on internal and external validation sets, showing comparable detection performance. The ML approach, however, exhibited substantially lower end-to-end latency while maintaining equivalent accuracy, indicating greater computational efficiency. These results suggest that for OOD detection tasks of limited visual complexity, lightweight ML approaches can achieve DL-level performance with significantly reduced computational cost, supporting practical real-world deployment.

2605.10165 2026-05-21 cs.CV cs.AI

Task-Agnostic Noisy Label Detection via Standardized Loss Aggregation

通过标准化损失聚合进行任务无关的噪声标签检测

Inhyuk Park, Doohyun Park

AI总结 本文提出了一种任务无关的噪声标签检测方法SLA,通过聚合标准化的交叉验证损失来量化标签可靠性,实验表明SLA在各种噪声水平下均优于硬计数基线,并在低噪声比情况下收敛更快,有助于高效重新标注和提升数据集可靠性。

Comments Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore

详情
AI中文摘要

由于观察者差异和模糊案例,大规模医学影像数据集中的噪声标签很常见。我们提出了一种统计上站得住且任务无关的框架,即标准化损失聚合(SLA),用于在样本层面检测噪声标签。SLA通过在重复交叉验证运行中聚合标准化的折叠级验证损失来量化标签可靠性。这种公式将离散的硬计数方案泛化为一个连续估计器,能够捕捉性能偏差的频率和幅度,从而产生可解释且统计上稳定的噪声分数。在公共视网膜数据集上的实验表明,SLA在所有噪声水平下均优于硬计数基线,并在低噪声比情况下收敛速度显著加快,尤其是在细微损失变化具有信息量的情况下。具有高SLA分数的样本指示可能模糊或错误标注的案例,从而指导高效的重新标注,提高任何分类任务的数据集可靠性。

英文摘要

Noisy labels are common in large-scale medical imaging datasets due to inter-observer variability and ambiguous cases. We propose a statistically grounded and task-agnostic framework, Standardized Loss Aggregation (SLA), for detecting noisy labels at the sample level. SLA quantifies label reliability by aggregating standardized fold-level validation losses across repeated cross-validation runs. This formulation generalizes discrete hard-counting schemes into a continuous estimator that captures both the frequency and magnitude of performance deviations, yielding interpretable and statistically stable noisiness scores. Experiments on a public fundus dataset demonstrate that SLA consistently outperforms the hard-counting baseline across all noise levels and converges substantially faster, especially under low noise ratios where subtle loss variations are informative. Samples with high SLA scores indicate potentially ambiguous or mislabeled cases, guiding efficient re-annotation and improving dataset reliability for any classification task.

2605.09860 2026-05-21 cs.AI

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

何时重新承诺:为长时间视觉-语言推理发现时间抽象

Chen Li, Zhantao Yang, Fangyi Chen, Han Zhang, Anudeepsekhar Bolimera, Marios Savvides

AI总结 本文提出了一种可学习的状态条件化承诺深度方法,用于长时间视觉-语言推理任务,通过动态调整承诺深度,提高了求解率并减少了基本动作数量,优于固定深度基线和现有模型。

详情
AI中文摘要

长时间推理需要决定不仅采取什么行动,还要在下一次观察之前多深地承诺。我们将其形式化为"承诺深度":在重新规划之间执行的原始动作数量。承诺深度在重新规划成本和执行误差累积之间产生权衡,但大多数现有长时间系统将其固定为手动设计的标量。在本文中,我们将其视为策略本身的一个可学习、状态条件化的变量。我们将其实例化在一个模型原生的视觉-语言策略中,该策略联合预测执行什么和持续多久。在Sliding Puzzle和Sokoban任务中,所得到的自适应策略在非退化的固定深度基线中占据帕累托最优,达到高达12.5个百分点的更高求解率,同时每回合使用约25%更少的基本动作。尽管使用7B主干,我们的方法在两个任务上优于GPT-5.5和Claude Sonnet,而每个测试的开放权重视觉-语言模型都达到0%的零样本成功率。我们进一步展示了理论分析,表明在标准的承诺深度替代方案下,状态条件化的承诺在本地最优深度在不同状态变化时严格优于任何固定深度。

英文摘要

Long-horizon reasoning requires deciding not only what actions to take, but how deeply to commit before the next observation. We formalize this as \emph{commitment depth}: the number of primitive actions executed open-loop between replans. Commitment depth induces a trade-off between replanning cost and compounding execution error, yet most existing long-horizon systems fix it as a hand-designed scalar. In this work, we instead treat commitment depth as a learnable, state-conditioned variable of the policy itself. We instantiate this within a model-native vision--language policy that jointly predicts both what to execute and for how long. Across Sliding Puzzle and Sokoban, the resulting adaptive policy Pareto-dominates every non-degenerate fixed-depth baseline, achieving up to 12.5 percentage points higher solve rate while using approximately 25\% fewer primitive actions per episode. Despite using a 7B backbone, our method outperforms GPT-5.5 and Claude Sonnet on both tasks, while every tested open-weight vision--language model achieves 0\% zero-shot success. We further present a theoretical analysis showing that, under the standard commitment-depth surrogate, state-conditioned commitment strictly dominates any fixed depth whenever the locally optimal depth varies across states.

2605.09586 2026-05-21 cs.CV cs.RO

DeformMaster: An Interactive Physics-Neural World Model for Deformable Objects from Videos

DeformMaster: 一个用于从视频中生成变形物体交互物理-神经世界模型

Can Li, Zhoujian Li, Ren Li, Jie Gu, Lei Lei, Jingmin Chen, Lei Sun

AI总结 本研究提出DeformMaster,一种基于视频的交互物理-神经世界模型,能够从真实交互视频中生成变形物体的统一动态-外观框架,通过保留结构化的物理推演并利用神经残差补偿未建模效应,实现高保真4D外观生成,实验表明其在动态预测和外观渲染方面优于现有方法。

Comments Project page: https://can-lee.github.io/deformmaster-web/

详情
AI中文摘要

世界模型用于变形物体应恢复不仅几何和外观,还应包含底层物理动态、交互基础和材料行为。从真实视频中学习此类模型具有挑战性,因为变形的线性、平面和体积物体在高维变形、噪声交互和复杂材料响应下演变。因此,模型必须从视觉观测中推断物理状态,通过新交互推进,并以高视觉保真度渲染结果。我们提出了DeformMaster,一种视频衍生的交互物理-神经世界模型,将真实交互视频转化为统一动态-外观框架中的变形物体在线交互模型。DeformMaster保留了结构化的物理推演,同时利用神经残差补偿未建模效应,将稀疏手部运动作为分布式合规执行器用于手-连续体交互,用空间变化的本构专家表示材料响应,并从预测的物理演变中驱动高保真4D外观。在真实世界变形物体序列上的实验表明,DeformMaster能够推演未来动态并渲染动态外观,优于现有最先进基线,同时支持新动作推演、材料参数变化和动态新视角合成。项目页面:https://can-lee.github.io/deformmaster-web/

英文摘要

World models for deformable objects should recover not only geometry and appearance, but also underlying physical dynamics, interaction grounding, and material behavior. Learning such a model from real videos is challenging because deformable linear, planar, and volumetric objects evolve under high-dimensional deformation, noisy interactions, and complex material response. The model must therefore infer a physical state from visual observations, roll it forward under new interactions, and render the resulting dynamics with high visual fidelity. We present DeformMaster, a video-derived interactive physics-neural world model that turns real interaction videos into an online interactive model of deformable objects within a unified dynamics-and-appearance framework. DeformMaster preserves structured physical rollout while using a neural residual to compensate for unmodeled effects, grounds sparse hand motion as distributed compliant actuator for hand-continuum interaction, represents material response with spatially varying constitutive experts, and drives high-fidelity 4D appearance from the predicted physical evolution. Experiments on real-world deformable-object sequences demonstrate DeformMaster's ability to roll out future dynamics and render dynamic appearance, outperforming state-of-the-art baselines while supporting novel action rollout, material-parameter variation, and dynamic novel-view synthesis. Project page: https://can-lee.github.io/deformmaster-web/

2605.08858 2026-05-21 cs.CV

ProDG: Prototypes for Data-Free Generative Post-Hoc Explainability

ProDG:用于无数据后置可解释性的原型

Piotr Borycki, Magdalena Trędowicz, Jacek Tabor, Łukasz Struski, Przemysław Spurek

AI总结 本文提出ProDG,一种无需数据的后置可解释性框架,通过生成模型直接从冻结模型的权重中合成纯高保真原型,从而摆脱了对任何外部数据的依赖,为隐私敏感领域提供了稳健的视觉可解释性。

详情
AI中文摘要

基于原型的前置可解释性方法通过利用直观的'这看起来像那'推理范式提供高度准确的解释。另一方面,后置模型可以在不依赖底层数据集或需要昂贵神经网络重新训练的情况下解释单个图像的预测。最近的方法成功解决了原型网络的重新训练问题。然而,它们仍然面临一个根本限制:它们需要访问数据子集(例如测试或验证集)来搜索并提取视觉原型。在本文中,我们解决了这一问题,并引入了ProDG:用于无数据后置可解释性的生成原型,一种新的框架,利用生成模型直接从冻结模型的权重中合成纯、高保真的原型,完全消除了对任何外部数据的依赖。通过在无数据XAI领域建立新的前沿,ProDG为隐私敏感领域解锁了稳健的视觉可解释性,其中原始数据受到严格限制或根本无法访问。项目页面:https://github.com/piotr310100/ProDG

英文摘要

Ante-hoc interpretability methods based on prototypes provide highly accurate explanations by utilizing the intuitive "this looks like that" reasoning paradigm. On the other hand, post-hoc models can explain predictions for a single image without relying on an underlying dataset or requiring costly neural network retraining. Recent approaches successfully solve the retraining problem for prototype-based networks. However, they still face a fundamental limitation: they require access to a subset of data (e.g., a test or validation set) to search for and extract the visual prototypes. In this paper, we address this issue and introduce ProDG: Generative Prototypes for Data-Free Post-Hoc Explainability, a novel framework that leverages generative models to synthesize pure, high-fidelity prototypes directly from the frozen model's weights, completely eliminating the dependency on any external data. By establishing this new frontier in Data-Free XAI, ProDG unlocks robust visual interpretability for privacy-sensitive domains, where original data is strictly restricted or fundamentally inaccessible. Project page: https://github.com/piotr310100/ProDG

2605.08123 2026-05-21 cs.LG cs.CL

Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge

块级可微的Sinkhorn注意力:带有间隙意识的尘桶桥尾部细化

Dylan Forde

AI总结 本文研究了通过停止基固定深度尾部细化代理在TPU硬件上实现长上下文平衡熵最优传输(OT)注意力。通过停止T步Sinkhorn求解后,展开一个短的细化尾部并精确地对这个代理进行微分。对于报告的R=2 TPU路径,反向传播包含四个阶梯计划因子。我们证明了一个精确的一参考瓷砖计划:R=2分数余切是单个参考计划瓷砖乘以一个由向量余切和双差分构建的显式修改字段。这导致了块级成本O((T+R)LW),O(Ld)输入存储,以及O(L)额外的HBM使用,对于固定头部维度d和带宽W在平衡固定支撑路径上。我们还正式化了当前dustbin_block路径作为在增强支撑上的相同单位目标代理,因此共轭计划提升到单个活跃尘桶路径,这在我们的TPU运行中使用;这个桥是代数的,不声称一般KL不平衡或任意容量间隙模型。我们提供了局部代理偏置界,后验偏置证书和严格正活跃块的投影收缩证书。在合成掩码问题上,优化的内核在10^-5至10^-10范围内与相同中心代理的精确自动微分匹配。在TPU v6e-8上,一个四配置Pfam屏幕完成端到端,一个提升的平衡R=2运行通过三小时预算,每秒维持大约8.5个示例,达到第1437步。保留的Pfam测试碎片将重建从5.57提高到2.05,稀疏CE从5.53提高到5.30,相对于第0步,CE被诊断性记录而不是直接优化;目标-均值对齐度量没有显著改善,而确定性对角参考在这些度量上仍更强。

详情
AI中文摘要

我们研究了通过停止基固定深度尾部细化代理在TPU硬件上实现长上下文平衡熵最优传输(OT)注意力。在停止T步Sinkhorn求解后,我们展开一个短的细化尾部并精确地对这个代理进行微分。对于报告的R=2 TPU路径,反向传播包含四个阶梯计划因子。我们证明了一个精确的一参考瓷砖计划:R=2分数余切是单个参考计划瓷砖乘以一个由向量余切和双差分构建的显式修改字段。这导致了块级成本O((T+R)LW),O(Ld)输入存储,以及O(L)额外的HBM使用,对于固定头部维度d和带宽W在平衡固定支撑路径上。我们还正式化了当前dustbin_block路径作为在增强支撑上的相同单位目标代理,因此共轭计划提升到单个活跃尘桶路径,这在我们的TPU运行中使用;这个桥是代数的,不声称一般KL不平衡或任意容量间隙模型。我们提供了局部代理偏置界,后验偏置证书和严格正活跃块的投影收缩证书。在合成掩码问题上,优化的内核在10^-5至10^-10范围内与相同中心代理的精确自动微分匹配。在TPU v6e-8上,一个四配置Pfam屏幕完成端到端,一个提升的平衡R=2运行通过三小时预算,每秒维持大约8.5个示例,达到第1437步。保留的Pfam测试碎片将重建从5.57提高到2.05,稀疏CE从5.53提高到5.30,相对于第0步,CE被诊断性记录而不是直接优化;目标-均值对齐度量没有显著改善,而确定性对角参考在这些度量上仍更强。

英文摘要

We study long-context balanced entropic optimal transport (OT) attention on TPU hardware through a stopped-base, fixed-depth tail-refinement surrogate. After a stopped $T$-step Sinkhorn solve, we unroll a short refinement tail and differentiate that surrogate exactly. For the reported $R=2$ TPU path, the backward pass contains four staircase plan factors. We prove an exact one-reference-tile schedule: the $R=2$ score cotangent is a single reference plan tile times an explicit modifier field built from vector cotangents and dual differences. This yields block-wise cost $O((T+R)LW)$, $O(Ld)$ input storage, and $O(L)$ additional HBM usage for fixed head dimension $d$ and band width $W$ on the balanced fixed-support path. We also formalize the current \texttt{dustbin\_block} path as the same unit-target surrogate on an augmented support, so the adjoint schedule lifts to the single-active-dustbin path used in our TPU runs; this bridge is algebraic and does not claim a general KL-unbalanced or arbitrary-capacity gap model. We provide a local surrogate-bias bound, an a posteriori bias certificate, and a projective contraction certificate for strictly positive active blocks. On synthetic masked problems, the optimized kernel matches exact autodiff of the same centered surrogate to within $10^{-5}$--$10^{-10}$. On TPU v6e-8, a four-configuration Pfam screen completes end-to-end, and a promoted balanced $R=2$ run sustains roughly $8.5$ examples per second through a three-hour budget, reaching step $1437$. Held-out Pfam test shards improve reconstruction from $5.57$ to $2.05$ and sparse CE from $5.53$ to $5.30$ relative to step $0$, with CE logged diagnostically rather than optimized directly; target-barycenter alignment metrics do not materially improve, and a deterministic diagonal reference remains stronger on those metrics.

2605.07926 2026-05-21 cs.AI

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

AgentEscapeBench: 评估LLM代理在跨领域工具引导推理中的能力

Zhengkang Guo, Yiyang Li, Lin Qiu, Xiaohua Wang, Jingwen Xv, Dongyu Ru, Xiaoyu Li, Xiaoqing Zheng, Xuezhi Cao, Xunliang Cai

AI总结 本文提出AgentEscapeBench基准测试,用于评估LLM代理在非熟悉工作流和短程交互之外维持工具引导推理的能力,通过逃亡室风格的任务测试代理在显式长距离依赖约束下推断、执行和修订新工具使用程序的能力,结果显示代理在依赖深度增加时表现显著下降。

详情
AI中文摘要

随着基于LLM的代理越来越多地依赖外部工具,评估其在非熟悉工作流和短程交互之外维持工具引导推理的能力变得至关重要。我们引入了AgentEscapeBench,一个逃亡室风格的基准测试,用于测试代理是否能够在显式长距离依赖约束下推断、执行和修订新的工具使用程序。每个任务定义了一个工具和物品上的有向无环依赖图,要求代理调用真实外部函数、跟踪逐步揭示的隐藏状态、传播中间结果,并提交一个确定性可验证的最终答案。AgentEscapeBench包含五个难度层级中的270个实例,并支持全自动评估。对十六个LLM代理和人类参与者的实验表明,随着依赖深度的增加,表现急剧下降:人类从难度5级的98.3%成功降至难度25级的80.0%,而最佳模型从90.0%降至60.0%。轨迹分析表明,模型失败主要归因于长距离状态跟踪、线索遵循和中间结果传播的崩溃。这些发现表明,当前代理通常能够处理局部工具使用,但在深度上下文依赖方面仍存在困难。我们希望AgentEscapeBench可以作为诊断测试床,用于衡量当前代理能力,并指导未来训练努力,以实现更健壮的通用推理、行动和适应能力。

英文摘要

As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a deterministically verifiable final answer. AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation. Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%. Trajectory analysis attributes model failures mainly to breakdowns in long-range state tracking, clue adherence, and intermediate-result propagation. These findings suggest that current agents can often handle local tool use but still struggle with deep contextual dependencies. We hope AgentEscapeBench can serve as a diagnostic testbed for measuring current agent capabilities and informing future training efforts toward more robust general-purpose reasoning, action, and adaptation.

2605.07731 2026-05-21 cs.CL cs.AI

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

对可比的意大利和国际开源大语言模型进行EngGPT2-16B-A3B的基准测试

Andrea Sassella, Andrea Chizzola, Tommaso Bianchi, Luca Alessandrelli, Mark James Carman

AI总结 本文研究了EngGPT2-16B-A3B在多个基准测试中的性能,与同等规模的开源MoE和密集模型进行比较,展示了其在国际和意大利基准测试中的表现。

详情
AI中文摘要

本报告对ENGINEERING Ingegneria Informatica S.p.A.的EngGPT2MoE-16B-A3B大语言模型进行了基准测试,该模型是一个具有3B活跃参数的16B参数混合专家(MoE)模型。性能在各种代表性基准测试中进行了评估,并与同等规模的开源MoE和密集模型进行了比较。与流行的意大利模型如FastwebMIIA-7B、Minerva-7B、Velvet-14B和LLaMAntino-3-ANITA-8B相比,EngGPT2MoE-16B-A3B在国际基准测试(ARC-Challenge、GSM8K、AIME24、AIME25、MMLU和HumanEval(HE))中表现相同或更好。它在RULER基准测试的最长上下文设置(32k)中取得最佳性能。在意大利基准数据集ITALIC上,该模型在除Velvet-14B外的其他模型中表现相同或更好。与同等规模的MoE模型相比,新模型在所有考虑的基准测试中都比DeepSeek-MoE-16B-Chat的值更高。它在HE、MMLU、AIME24、AIME25、GSM8K和32k RULER设置上比Moonlight-16B-A3B更高,但在BFCL和一些ARC和ITALIC设置上较低。最后,它在大多数基准测试中比GPT-OSS-20B低,包括HE、MMLU、AIME24、AIME25、GSM8K、ARC、BFCL和RULER 32k。与流行的密集模型相比,EngGPT2MoE-16B-A3B在AIME24和AIME25上比Llama-3.1-8B-Instruct、Gemma-3-12b-it和Minstral-3-8BInstruct-2512-BF16的值更高,但在ITALIC、BFCL和32k RULER设置上较低。当性能汇总所有基准测试指标时,EngGPT2MoE-16B-A3B在评估的意大利模型中表现更高,但在一些最高效的国际模型(特别是GPT-5 nano和Qwen3-8B)中表现较低。总体而言,我们的发现表明新模型是原生意大利大语言模型的一大步。

英文摘要

This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.

2605.07021 2026-05-21 cs.AI

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

行为线索推理:通过监督提高推理的效率和安全性

Christopher Z. Cui, Taylor W. Killian, Prithviraj Ammanabrolu

AI总结 该研究提出行为线索推理方法,通过引入行为线索来增强大语言模型的可控性和可监控性,从而在复杂数学问题解决中减少50%的无效推理token,并在安全行动恢复方面将成功率从46%提升至96%。

详情
AI中文摘要

大语言模型(LLMs)的推理过程在监督方面面临挑战,因为许多不一致的行为往往在推理结束后才显现。为了解决这一问题,我们引入了行为线索推理,使LLM的推理过程更加可控和可监控。行为线索是特殊标记序列,模型在训练过程中被训练为在特定隐含和显式行为之前立即发出,起到双重用途的信号和控制杠杆。在微调较弱的外部监控器时,通过强化学习进行推理监督,仅使用行为线索产生的信息压缩视图就足以让监控器剪枝复杂数学问题解决中多达50%的无效推理token。当在过度约束违反导致失败的环境中利用几乎最优的规则基监控器时,行为线索使从80%的推理轨迹中恢复安全行动,这些轨迹原本会以提出不安全行动而结束,将成功率从46%提升至96%。通过在两个模型家族和三个领域中的评估,我们证明行为线索推理在不降低性能的情况下提高了推理的可监控性和可控性。更广泛地说,我们的工作通过展示被监控模型本身可以被训练得更易于监督来推进可扩展的监督。

英文摘要

Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, Behavior Cues allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that Behavior Cue Reasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code: https://github.com/christopherzc/behavior-cues