arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 21503
专题追踪
2605.29584 2026-06-04 cs.CL

GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering

GAPD:面向知识库问答中智能体强化学习的金动作策略蒸馏

Xin Sun, Jianan Xie, Zhongqi Chen, Qiang Liu, Shu Wu, Bowen Song, Weiqiang Wang, Zilei Wang, Liang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) ShanghaiTech University(上海科技大学) Ant Group(蚂蚁集团)

AI总结 提出GAPD框架,通过中间锚点匹配将金动作序列与在线策略对齐,为基于强化学习的知识库问答提供密集的令牌级指导,在多个基准上取得最优结果。

详情
AI中文摘要

强化学习(RL)天然适用于智能体知识库问答(KBQA),其中模型必须发出可执行动作、观察知识库反馈并最终返回答案。然而,当前基于RL的KBQA系统主要优化来自最终答案的稀疏奖励,导致中间动作错误监督不足。这对于逻辑形式标注的KBQA基准尤其受限:金逻辑形式可转换为可执行动作序列,但现有流水线主要将其用于热启动数据构建,而非用于在线策略RL更新。我们提出GAPD,一种训练时的金动作策略蒸馏框架,为基于结果的RL添加密集的令牌级指导。为了将金动作与在线学生策略对齐,GAPD使用中间锚点匹配:它将学生探索和金执行期间达到的中间实体视为状态锚点,并通过这些探索的实体集将学生状态与金状态匹配。基于对齐后的金动作的当前策略作为停止梯度的教师,其令牌分布被蒸馏回普通学生策略的生成动作令牌跨度上。GAPD在WebQSP、GrailQA和GraphQ上持续超越当前最先进水平。

英文摘要

Reinforcement learning (RL) is a natural fit for agentic knowledge base question answering (KBQA), where a model must issue executable actions, observe knowledge-base feedback, and eventually return an answer. However, current RL-based KBQA systems mainly optimize sparse rewards from the final answer, leaving intermediate action errors weakly supervised. This is especially limiting for logical-form annotated KBQA benchmarks: gold logical forms can be converted into executable action sequences, but existing pipelines use them mainly for warm-start data construction rather than for on-policy RL updates. We propose GAPD, a training-time Gold-Action Policy Distillation framework that adds dense token-level guidance to outcome-based RL. To align gold actions with on-policy student rollouts, GAPD uses MID-ANCHOR MATCHING: it treats the intermediate entities reached during student exploration and gold execution as state anchors, and matches student states to gold states through these explored entity sets. The current policy conditioned on this aligned gold action serves as a stop-gradient teacher, whose token distribution is distilled back to the ordinary student policy over generated action-token spans. GAPD consistently surpasses the current state of the art on WebQSP, GrailQA, and GraphQ.

2511.05924 2026-06-04 cs.LG

DiScoFormer: Plug-In Density and Score Estimation with Transformers

DiScoFormer: 基于Transformer的即插即用密度与得分估计

Vasily Ilin, Peter Sushko, Ranjay Krishna

发表机构 * Department of Mathematics, University of Washington, Seattle, USA Math AI Lab, University of Washington, Seattle, USA Allen Institute for Artificial Intelligence, Seattle, USA Paul G.\ Allen School of Computer Science \& Engineering, University of Washington, Seattle, USA

AI总结 提出DiScoFormer,一种可一次训练、任意推理的等变Transformer,通过自注意力机制实现跨分布和样本规模的密度与得分估计,证明其泛化核密度估计并优于KDE。

Comments Accepted in ICML 2026 (oral)

详情
AI中文摘要

从样本中估计概率密度及其得分仍然是生成建模、贝叶斯推断和动力学理论中的核心问题。现有方法分为两类:经典核密度估计(KDE)可泛化到不同分布,但受维度灾难影响;现代神经得分模型精度高,但需为每个目标分布重新训练。我们提出DiScoFormer(密度与得分Transformer),一种“一次训练,任意推理”的等变Transformer,将独立同分布样本映射到密度值和得分向量,可泛化到不同分布和样本规模。理论上,我们证明自注意力可以恢复归一化KDE,从而建立其作为核方法函数泛化的地位;实验上,单个注意力头学习多尺度、类核的行为。该模型在密度估计上收敛更快、精度高于KDE,并为得分去偏KDE、Fisher信息计算和Fokker-Planck型偏微分方程提供高保真即插即用得分预言机。

英文摘要

Estimating probability density and its score from samples remains a core problem in generative modeling, Bayesian inference, and kinetic theory. Existing methods are bifurcated: classical kernel density estimators (KDE) generalize across distributions but suffer from the curse of dimensionality, while modern neural score models achieve high precision but require retraining for every target distribution. We introduce DiScoFormer (Density and Score Transformer), a ``train-once, infer-anywhere" equivariant Transformer that maps i.i.d. samples to both density values and score vectors, generalizing across distributions and sample sizes. Analytically, we prove that self-attention can recover normalized KDE, establishing it as a functional generalization of kernel methods; empirically, individual attention heads learn multi-scale, kernel-like behaviors. The model converges faster and achieves higher precision than KDE for density estimation, and provides a high-fidelity plug-in score oracle for score-debiased KDE, Fisher information computation, and Fokker-Planck-type PDEs.

2509.23694 2026-06-04 cs.AI cs.CL cs.CR

SafeSearch: Automated Red-Teaming of LLM-Based Search Agents

SafeSearch: 基于LLM的搜索代理的自动化红队测试

Jianshuo Dong, Sheng Guo, Hao Wang, Xun Chen, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SafeSearch自动化红队框架,系统评估基于LLM的搜索代理在五个风险类别中的安全性,发现GPT-4.1-mini在搜索工作流中攻击成功率高达90.5%,且常见防御措施效果有限。

Comments Accepted by ICML 2026

详情
AI中文摘要

搜索代理将LLM连接到互联网,使其能够访问更广泛和更新的信息。然而,这也引入了一个新的威胁面:不可靠的搜索结果可能误导代理产生不安全的输出。现实世界的事件和我们的两个野外观察表明,此类失败在实践中可能发生。为了系统地研究这一威胁,我们提出了SafeSearch,一个可扩展、成本效益高且轻量级的自动化红队框架,支持搜索代理的沙盒安全评估。利用该框架,我们生成了涵盖五个风险类别(例如,错误信息和提示注入)的300个测试用例,并评估了三个搜索代理框架在17个代表性LLM上的表现。我们的结果揭示了基于LLM的搜索代理存在重大漏洞,在搜索工作流设置中,GPT-4.1-mini的最高攻击成功率(ASR)达到90.5%。此外,我们发现常见的防御措施(如提醒提示)提供的保护有限。总体而言,SafeSearch提供了一种实用的方法来衡量和提高基于LLM的搜索代理的安全性。

英文摘要

Search agents connect LLMs to the Internet, enabling them to access broader and more up-to-date information. However, this also introduces a new threat surface: unreliable search results can mislead agents into producing unsafe outputs. Real-world incidents and our two in-the-wild observations show that such failures can occur in practice. To study this threat systematically, we propose SafeSearch, an automated red-teaming framework that is scalable, cost-efficient, and lightweight, enabling sandboxed safety evaluation of search agents. Using this, we generate 300 test cases spanning five risk categories (e.g., misinformation and prompt injection) and evaluate three search agent scaffolds across 17 representative LLMs. Our results reveal substantial vulnerabilities in LLM-based search agents, with the highest ASR reaching 90.5% for GPT-4.1-mini in a search-workflow setting. Moreover, we find that common defenses, such as reminder prompting, offer limited protection. Overall, SafeSearch provides a practical way to measure and improve the safety of LLM-based search agents.

2504.12988 2026-06-04 cs.LG stat.ML

Why Ask One When You Can Ask $k$? Learning-to-Defer to the Top-$k$ Experts

为何只问一个专家?学习将任务推迟到Top-$k$专家

Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院) Fédération ENAC, ISAE-SUPAERO, ONERA, Université de Toulouse(ENAC联合会、ISAE-SUPAERO、ONERA、图卢兹大学) Agency for Science, Technology and Research, Institute for Infocomm Research(科技研究局、信息通信研究所)

AI总结 提出Top-$k$学习推迟框架,通过将查询分配给最优的$k$个专家,实现多专家协作,并开发了与$k$无关的替代损失函数,在准确性和成本之间取得更优权衡。

详情
AI中文摘要

现有的学习推迟(L2D)框架仅限于单专家推迟,迫使每个查询仅依赖一个专家,无法利用集体专业知识。我们首次提出了Top-$k$学习推迟框架,将查询分配给成本效益最高的$k$个实体。我们的公式统一并严格推广了先前的方法,包括单阶段和两阶段机制、选择性预测以及经典级联。特别地,它将通常的Top-1推迟规则作为特例,同时当$k>1$时能够与多个专家进行原则性协作。我们进一步提出了Top-$k(x)$学习推迟,这是一种自适应变体,根据输入难度、专家质量和咨询成本学习每个查询的最佳专家数量。为了实现实际学习,我们开发了一种新颖的替代损失函数,该函数在单阶段设置中是贝叶斯一致且$\mathcal{H}_h$一致的,在两阶段设置中是$(\mathcal{H}_r,\mathcal{H}_g)$一致的。关键是,该替代损失与$k$无关,允许一次性学习单个策略并灵活地部署到不同的$k$值。在两个机制上的实验表明,Top-$k$和Top-$k(x)$在准确性和成本之间实现了更优的权衡,为L2D中的多专家推迟开辟了新方向。

英文摘要

Existing Learning-to-Defer (L2D) frameworks are limited to single-expert deferral, forcing each query to rely on only one expert and preventing the use of collective expertise. We introduce the first framework for Top-$k$ Learning-to-Defer, which allocates queries to the $k$ most cost-effective entities. Our formulation unifies and strictly generalizes prior approaches, including the one-stage and two-stage regimes, selective prediction, and classical cascades. In particular, it recovers the usual Top-1 deferral rule as a special case while enabling principled collaboration with multiple experts when $k>1$. We further propose Top-$k(x)$ Learning-to-Defer, an adaptive variant that learns the optimal number of experts per query based on input difficulty, expert quality, and consultation cost. To enable practical learning, we develop a novel surrogate loss that is Bayes-consistent, $\mathcal{H}_h$-consistent in the one-stage setting, and $(\mathcal{H}_r,\mathcal{H}_g)$-consistent in the two-stage setting. Crucially, this surrogate is independent of $k$, allowing a single policy to be learned once and deployed flexibly across $k$. Experiments across both regimes show that Top-$k$ and Top-$k(x)$ deliver superior accuracy-cost trade-offs, opening a new direction for multi-expert deferral in L2D.

2410.15761 2026-06-04 cs.CL cs.LG stat.ML

Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees

基于LLM的抽取式问答中的最优查询分配:一个具有理论保证的学习-推迟框架

Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院) Fédération ENAC ISAE-SUPAERO ONERA, Université de Toulouse, France(法国图卢兹大学ENAC ISAE-SUPAERO ONERA联合体) Institute for Infocomm Research (A*STAR), Singapore(新加坡信息与通信研究院(A*STAR)) IPAL, IRL 2955, Singapore(新加坡IPAL实验室)

AI总结 提出一个学习-推迟框架,通过将查询分配给专门专家,在保证高置信度预测的同时优化计算效率,并在SQuADv1、SQuADv2和TriviaQA上验证了其提高答案可靠性和降低计算开销的效果。

Comments 25 pages, 17 main paper

详情
AI中文摘要

大型语言模型在生成任务中表现出色,但在结构化文本选择(特别是抽取式问答)中效率低下。这一挑战在资源受限环境中被放大,因为部署多个专门模型处理不同任务是不切实际的。我们提出一个学习-推迟框架,将查询分配给专门专家,确保高置信度预测的同时优化计算效率。我们的方法整合了一个原则性的分配策略,并提供了关于最优推迟的理论保证,以平衡性能和成本。在SQuADv1、SQuADv2和TriviaQA上的实证评估表明,我们的方法增强了答案可靠性,同时显著降低了计算开销,使其非常适合可扩展且高效的EQA部署。

英文摘要

Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.

2605.29280 2026-06-04 cs.LG cs.AI cs.IR

LoopFM: Learning frOm HistOrical RePresentations of Foundation Model for Recommendation

LoopFM:从基础模型的历史表示中学习用于推荐

Shali Jiang, Hua Zheng, Boyang Liu, Laming Chen, Kenny Lov, Chuanqi Xu, Lisang Ding, Qinghai Zhou, Can Cui, Xiaolong Liu, Xiaoyi Liu, Yasmine Badr, Xin Xu, Jiyan Yang, Ellie Dingqiao Wen, Gerard Jonathan Mugisha Akkerhuis, Chenxiao Guan, Rong Jin, Ruichao Qiu, Xian Chen, Shifu Xu, Zhehui Zhou, Ping Chen, Rui Yang, Haicheng Chen, Xiangge Meng, Song Zhou, Dharak Kharod, Shuyu Xu, Qiang Jin, Qiao Yang, Wankun Zhu, Qin Huang, Yuzhen Huang, Darren Liu, Parish Aggarwal, Hui Zhou, Erzhuo Wang, Shuo Chang, Xiaorui Gan, Wenlin Chen, Santanu Kolay, Huayu Li

发表机构 * Meta

AI总结 针对知识蒸馏中传递标量导致转移率下降的问题,提出LoopFM框架,通过将基础模型的中间嵌入作为输入特征传递给下游垂直模型,实现高带宽知识转移,并在理论和实验中证明其有效性。

Comments Shali Jiang, Hua Zheng, Boyang Liu contributed equally to this work

详情
AI中文摘要

知识蒸馏(KD)将大型基础模型(FM)的单个标量预测传递给紧凑的垂直模型(VM),但由于单个标量无法传达较大FM学习的丰富中间知识,导致转移率(VM捕获的FM改进比例)下降。为了解决这一瓶颈,我们提出了LoopFM(从FM的历史表示中学习),该框架通过将FM中间嵌入结构化为下游VM的输入特征(例如,用户历史序列)来打开高带宽传输通道,无需在服务时进行实时FM推理,也无需FM和VM之间的架构耦合。我们为LoopFM提供了理论框架,包括增益分解和转移率分析。在三个公开基准上,LoopFM展示了强大的AUC改进(例如,在淘宝广告上提高6%以上)以及与KD互补的知识转移能力。在工业规模系统(数十亿样本、万亿参数FM)上,LoopFM在KD基础上将知识转移率大约翻倍,在Y1H1中实现了+0.5%的转化改进,在Y1H2中分别从两次单独发布实现了+1.03%和+1.22%的转化改进。

英文摘要

Knowledge distillation (KD) transfers a single scalar prediction from a large foundation model (FM) to compact vertical models (VMs), suffering from diminishing transfer ratio -- the fraction of FM improvement captured by the VM -- as a single scalar cannot convey the rich intermediate knowledge that larger FMs learn. To address this bottleneck, we propose LoopFM (Learning frOm HistOrical RePresentations of FM), a framework that opens a high-bandwidth transfer channel by structuring FM intermediate embeddings as input features (e.g., user history sequence) for downstream VMs, without requiring real-time FM inference at serving and architectural coupling between FM and VM. We provide a theoretical framework for LoopFM with a gain decomposition and transfer-ratio analysis. On three public benchmarks, LoopFM demonstrates strong AUC improvements (e.g., 6%+ on TaobaoAd) and complementary knowledge transfer capability with KD. On industrial-scale systems (billions of examples, trillion-parameter FMs), LoopFM approximately doubles the knowledge transfer ratio on top of KD, delivering a +0.5% conversion improvement in the first half after its initial launch, and +1.03% and +1.22% conversion improvement from two individual launches in the subsequent half.

2605.29076 2026-06-04 cs.CL cs.AI cs.LG

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

结构化提示优化结合强化学习实现复杂文本的全局与局部可解释性

Tianyang Zhou, Wenbo Chen, Pierre Jinghong Liang, Leman Akoglu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Amazon(亚马逊)

AI总结 提出eXTC框架,通过结构化提示优化、基于SOP的推理蒸馏和强化学习扩展,在分类性能和解释质量上显著优于现有范式。

详情
AI中文摘要

LLMs在文本分类上取得了进展,但现有范式面临权衡:监督(仅标签)微调可扩展,但对复杂文本推理有限且缺乏模型透明度;离散提示优化提供可读指令,但性能和可扩展性不佳。我们引入eXTC(可解释文本分类器),包含三个渐进阶段:(1)通过新的结构化提示优化算法学习自然语言的标准操作程序(SOP或规则手册);(2)从大型教师LLM到紧凑LM的基于SOP的推理蒸馏;(3)通过强化学习扩展超出初始SOP的推理能力。该设计使eXTC能够(i)通过紧凑LM实现快速推理,(ii)提供推理时的局部推理轨迹,以及其学习领域规则的全局模块化解释,同时(iii)在分类性能和解释质量上显著优于现有范式,并逐步提升。

英文摘要

LLMs have advanced text classification, yet existing paradigms face a trade-off: supervised (label only) fine-tuning is scalable but offers limited reasoning on complex text and lacks broader model transparency, while discrete prompt optimization offers human-readable instructions but struggles with performance and scalability. We introduce eXTC (eXplainable Text Classifier) with three progressive stages: (1) learning a Standard Operating Procedure (SOP, or rulebook) in natural language via a new Structured Prompt Optimization algorithm; (2) SOP-grounded reasoning distillation from a large teacher LLM into a compact LM; and (3) expanding reasoning capabilities beyond the initial SOP via reinforcement learning. This design enables eXTC to provide (i) fast inference via a compact LM, with (ii) inference-time local reasoning traces, alongside a global, modular explanation of its learned domain rules, while (iii) significantly outperforming existing paradigms across diverse benchmarks in both classification performance and explanation quality, with stage-by-stage gains.

2605.28829 2026-06-04 cs.CL cs.AI cs.CY

Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

Aryabhata 2:扩展强化学习以提升高级STEM推理能力

Ritvik Rastogi, Vishal Singh, Tejas Chaudhari, Sandeep Varma

发表机构 * PhysicsWallah

AI总结 本文提出Aryabhata 2,一个通过强化学习后训练在竞争性STEM考试中提升推理能力的语言模型,在JEE、NEET等基准上超越基础模型且输出token减少高达64%。

详情
AI中文摘要

竞争性STEM考试(如JEE和NEET)需要多步符号推理、精确数值计算以及物理、化学和数学的深层概念理解。近期的大语言模型在常见推理基准上表现强劲,但仍难以大规模部署,因为数百万学生的疑问需要领域特定且结构一致的问题求解。 我们提出了Aryabhata 2,一个专注于竞争性STEM考试推理的语言模型,通过强化学习后训练进行优化。利用PhysicsWallah的内部题库,我们构建了高质量的训练课程,并通过可验证奖励的强化学习对GPT-OSS-20B进行后训练。训练结合了延长强化学习与通过逐步增大的rollout组大小拓宽探索。 我们在竞争性考试基准(包括JEE Main、JEE Advanced和NEET)以及分布外推理数据集(如AIME、HMMT、MMLU-Pro、MMLU-Redux 2.0和GPQA)上评估了Aryabhata 2。结果表明,Aryabhata 2在竞争性STEM推理上优于其基础模型GPT-OSS-20B,同时所需输出token大幅减少(最多减少64%)。

英文摘要

Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics. Recent large language models perform strongly on common reasoning benchmarks, yet they remain difficult to deploy at scale, where millions of student doubts demand domain-specific, consistently structured problem solving. We introduce Aryabhata 2, a reasoning-focused language model for competitive STEM examinations, trained via reinforcement-learning post-training. Using PhysicsWallah's internal question banks, we construct a high-quality training curriculum and post-train GPT-OSS-20B through reinforcement learning with verifiable rewards. Training combines prolonged reinforcement learning with broadened exploration via progressively larger rollout group sizes. We evaluate Aryabhata 2 on competitive examination benchmarks, including JEE Main, JEE Advanced, and NEET, as well as out-of-distribution reasoning datasets such as AIME, HMMT, MMLU-Pro, MMLU-Redux 2.0, and GPQA. Results show that Aryabhata 2 outperforms its base model GPT-OSS-20B on competitive STEM reasoning while requiring substantially fewer output tokens (up to 64\% fewer).

2605.25402 2026-06-04 cs.CV cs.AI

Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation

解剖锚定的自监督:蒸馏视觉基础模型用于不变超声表示

Chunzheng Zhu, Yijun Wang, Jianxin Lin, Feng Wang, Hongwei Wang, Lei Zhao, Shengli Li, Kenli Li

发表机构 * Hunan University(湖南大学) Shenzhen Maternity and Child Healthcare Hospital(深圳妇幼保健医院)

AI总结 提出解剖锚定的超声自监督框架ANAUS,通过可学习潜在提示引擎和领域自适应实现无标注解剖分割,并设计双策略自监督学习(语义感知解剖分离对齐和上下文核心区域预测)来增强表示学习,在六个公开数据集上超越现有方法。

Comments MICCAI 2026 Accepted Paper; Anatomy-Anchored Ultrasound Self-Supervision

详情
AI中文摘要

自监督预训练范式在医学图像中学习可迁移表示方面日益重要,但现有超声图像方法在图像或帧级别操作,忽略了临床对齐表示学习的解剖上下文。在这项工作中,我们提出了一种解剖锚定的超声自监督框架ANAUS,将表示学习从通用视觉区域转移到临床有意义的解剖结构。利用可学习的潜在提示引擎以及对现有公开图像-掩码对的一次性领域自适应,我们使LP-SAM模块能够大规模实现无标注解剖描绘。基于此解剖基础,我们提出了一种双策略自监督学习范式,包括视图间语义感知的解剖分离对齐和上下文核心区域预测,以增强表示学习。具体而言,前者在相同解剖区域内强制特征不变性,同时促进不同结构间的可区分性;后者迫使模型重建被破坏的区域,从而捕获细粒度的结构细节。在六个公开数据集上的广泛评估表明,我们的方法持续优于当前最先进的方法,同时保持了临床部署所需的计算效率。代码可在https://github.com/zhcz328/ANAUS获取。

英文摘要

Self-supervised pre-training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet existing methods for ultrasound (US) images operate at the image or frame level, overlooking the anatomical context for clinical-aligned representation learning. In this work, we propose an anatomy-anchored ultrasound self-supervision framework ANAUS that shifts representation learning from generic visual regions to clinically meaningful anatomical structures. Utilizing a learnable latent prompt engine alongside a one-time domain adaptation on existing public image-mask pairs, we empower the LP-SAM module to achieve annotation-free anatomy delineation at scale. Building upon this anatomical grounding, we propose a dual-policy self-supervised learning paradigm consisting of inter-view semantics-aware anatomy-separating alignment and contextual core-region prediction to enhance representation learning. Specifically, the former enforces feature invariance within identical anatomical regions while promoting discriminability across distinct structures; the latter compels the model to reconstruct corrupted regions, thereby capturing fine-grained structural details. Extensive evaluations on six public datasets demonstrate that ANAUS consistently outstrips current state-of-the-art methods while maintaining the computational efficiency essential for clinical deployment. Code is available at https://github.com/zhcz328/ANAUS.

2605.11130 2026-06-04 cs.LG cs.AI

HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series

HEPA: 一种用于时间序列的自监督水平条件事件预测架构

Jonas Petersen, Gian-Alessandro Lombardi, Riccardo Maggioni, Camilla Mazzoleni, Federico Martelli, Philipp Petersen

发表机构 * ETH Zurich(苏黎世联邦理工学院) Forgis University of Vienna(维也纳大学)

AI总结 提出HEPA架构,通过因果Transformer编码器联合嵌入预测(JEPA)预训练和仅微调预测器生成单调生存累积分布函数,在14个基准测试中超过PatchTST等模型,参数和标注数据量减少一个数量级。

Comments Spotlight at FMSD, ICML 2026. Code: https://github.com/Forgis-Labs/HEPA

详情
AI中文摘要

多变量时间序列中的关键事件,从涡轮机故障到心律失常,需要准确的预测,但由于此类事件罕见且标注成本高,标注数据稀缺。我们引入了HEPA(水平条件事件预测架构),基于两个关键原则。首先,通过联合嵌入预测架构(JEPA)预训练因果Transformer编码器:一个水平条件预测器学习预测未来表示而非未来值,迫使编码器仅从无标注数据中捕获可预测的时间动态。其次,我们冻结编码器,仅微调预测器以预测目标事件,生成随水平单调的生存累积分布函数(CDF)。在所有基准测试中,使用固定的架构和优化器超参数,HEPA处理了水污染、网络攻击检测、波动率制度以及跨11个领域的另外8种事件类型,在14个基准测试中的至少10个上超过了包括PatchTST、iTransformer、MAE和Chronos-2在内的领先时间序列架构,调优参数少一个数量级,并且在生命周期数据集上,标注数据少一个数量级。

英文摘要

Critical events in multivariate time series, from turbine failures to cardiac arrhythmias, demand accurate prediction, yet labeled data is scarce because such events are rare and costly to annotate. We introduce HEPA (Horizon-conditioned Event Predictive Architecture), built on two key principles. First, a causal Transformer encoder is pretrained via a Joint-Embedding Predictive Architecture (JEPA): a horizon-conditioned predictor learns to forecast future representations rather than future values, forcing the encoder to capture predictable temporal dynamics from unlabeled data alone. Second, we freeze the encoder and finetune only the predictor toward the target event, producing a monotonic survival cumulative distribution function (CDF) over horizons. With fixed architecture and optimiser hyperparameters across all benchmarks, HEPA handles water contamination, cyberattack detection, volatility regimes, and eight further event types across 11 domains, exceeding leading time-series architectures including PatchTST, iTransformer, MAE, and Chronos-2 on at least 10 of 14 benchmarks, with an order of magnitude fewer tuned parameters and, on lifecycle datasets, an order of magnitude less labeled data.

2605.09081 2026-06-04 cs.LG cs.AI

FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models

FactoryNet:面向工业时间序列基础模型的大规模数据集

Karim Othman, Jonas Petersen, Matei Ignuta-Ciuncanu, Camilla Mazzoleni, Federico Martelli, Alessandro Lombardi, Riccardo Maggioni, Philipp Petersen

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出首个工业时间序列通用预训练语料库FactoryNet,通过统一模式实现跨实体零样本迁移和高效异常检测。

Comments Accepted at AI4Physics and FMSD, ICML 2026. Code: https://github.com/Forgis-Labs/FactoryNet

详情
AI中文摘要

我们引入了首个工业时间序列数据的通用预训练语料库:FactoryNet。该数据集包含51M个数据点,涵盖六种实体上的23k个端到端任务执行(13.3k真实,9.8k合成),通过共享模式实现了鲁棒的零样本跨实体迁移和高参数效率的异常检测。我们提出了一种新颖的模式:设定点、努力、反馈、上下文(S-E-F-C),该模式贯穿整个流水线,将任何驱动系统映射到共同的表示框架。该语料库涵盖27种标注的异常类型,以及健康基线和机器人操作与加工领域的反事实对。跨实体迁移实验取得了积极结果:在考虑偏见的指标下,我们的模型在评估的源-目标对上展示了公平的跨实体迁移能力,而24个模式对齐的信号与高维基线相比,实现了有竞争力的异常检测性能。我们发布FactoryNet作为一个不断增长的多实体数据集,以推动工业基础模型的发展。

英文摘要

We introduce the first universal pretraining corpus for industrial time-series data: FactoryNet. 51M datapoints across 23k end-to-end task executions (13.3k real, 9.8k synthetic) on six embodiments, unified by a shared schema that enables robust zero-shot cross-embodiment transfer and highly parameter-efficient anomaly detection. We introduce a novel schema: Setpoint, Effort, Feedback, Context (S-E-F-C) underlying the whole pipeline that maps any actuated system into a common representational frame. The corpus spans 27 annotated anomaly types alongside healthy baselines and counterfactual pairs across robotic manipulation and machining domains. Cross-embodiment transfer experiments yield positive results: under bias-aware metrics our model demonstrates fair cross-embodiment transfer capabilities on the evaluated source-target pair, while 24 schema-aligned signals achieves competitive anomaly detection performance compared to high-dimensional baselines. We release FactoryNet as a growing, multi-embodiment dataset to drive progress toward industrial foundation models.

2603.07523 2026-06-04 cs.LG

Breaking the Scale Barrier: One-Shot Knowledge Transfer via Frequency Transform

基于频域知识的通用模型初始化

Jianlu Shen, Fu Feng, Yucheng Xie, Jiaqi Lv, Xin Geng

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) Southeast University(东南大学) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications(新一代人工智能技术及其交叉应用重点实验室) Ministry of Education, China(中华人民共和国教育部)

AI总结 提出FRONT框架,利用离散余弦变换提取权重的低频分量作为“学习基因”,通过截断或填充实现任意大小模型的免训练初始化,并可选频谱正则化提升迁移性,在视觉任务中加速收敛15倍,语言任务中平均减少40.5%训练计算量。

详情
AI中文摘要

通过微调大规模预训练网络来迁移知识已成为下游任务的标准范式,然而预训练模型的知识与单一架构紧密耦合,限制了在不同规模模型间的灵活复用。针对这一挑战,近期方法通常采用参数选择(无法捕捉知识的相互依赖结构)或使用生成模型进行参数预测(依赖于对大规模网络集合的不切实际访问)。在本文中,我们实验证明,模型的基础、任务无关知识(即其“学习基因”)编码在权重的低频分量中,并且可以被下游模型高效继承。基于这一发现,我们提出FRONT(频域知识迁移),一种新颖框架,使用离散余弦变换(DCT)分离低频“学习基因”。该学习基因可以通过简单的截断或填充无缝适配以初始化任意大小的模型,整个过程无需训练。为了提升性能,我们提出一个可选的低成本精炼过程,引入频谱正则化器以进一步提高学习基因的可迁移性。大量实验表明,FRONT达到了最先进的性能,在视觉任务中加速收敛高达15倍,在语言任务中平均减少40.5%的训练FLOPs。

英文摘要

Transferring knowledge by fine-tuning large-scale pre-trained networks has become a standard paradigm for downstream tasks, yet the knowledge of a pre-trained model is tightly coupled with monolithic architecture, which restricts flexible reuse across models of varying scales. In response to this challenge, recent approaches typically resort to either parameter selection, which fails to capture the interdependent structure of this knowledge, or parameter prediction using generative models that depend on impractical access to large network collections. In this paper, we identify the low-frequency components of model weights as the concrete carrier of foundational, task-agnostic knowledge, its ``learngene", and validate this by demonstrating its efficient inheritance by downstream models and tasks. Based on this insight, we propose FRONT (FRequency dOmain kNowledge Transfer), a novel framework that uses the Discrete Cosine Transform (DCT) to isolate the low-frequency ``learngene". This learngene can be seamlessly adapted to initialize models of arbitrary size via simple truncation or padding, a process that is entirely training-free. For enhanced performance, we propose an optional low-cost refinement process that introduces a spectral regularizer to further improve the learngene's transferability. Extensive experiments demonstrate that FRONT achieves the state-of-the-art performance, accelerates convergence by up to $15\times$ in vision tasks, and reduces training FLOPs by an average of 40.5% in language tasks. Code is available at https://github.com/LUcy0505/FRONT.

2602.05725 2026-06-04 cs.LG math.OC stat.ML

Muon in Associative Memory Learning: Training Dynamics and Scaling Laws

联想记忆学习中的Muon:训练动力学与缩放定律

Binghui Li, Kaifei Wang, Han Zhong, Pinyan Lu, Liwei Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文在联想记忆模型中研究Muon优化器的训练动力学和缩放定律,证明其相比梯度下降在无噪声情况下实现指数加速,在有噪声情况下具有更优的缩放效率。

Comments Published as a conference paper at ICML 2026; 53 pages

详情
AI中文摘要

Muon通过梯度的矩阵符号更新矩阵参数,并显示出强大的经验增益,但其动力学和缩放行为在理论上仍不清楚。我们在具有softmax检索和查询-答案对上的层次频谱(含和不含标签噪声)的线性联想记忆模型中研究Muon。在该设置下,我们证明梯度下降以高度不平衡的速率学习频率分量,导致收敛缓慢,瓶颈在于低频分量。相比之下,Muon优化器缓解了这种不平衡,实现了更快且更均匀的进展。具体地,在无噪声情况下,Muon实现了相对于梯度下降的指数加速;在具有幂律频谱的有噪声情况下,我们推导了Muon的缩放定律,并展示了其相对于梯度下降的优越缩放效率。此外,我们表明Muon可以解释为由自适应任务对齐和块对称梯度结构产生的隐式矩阵预处理器。相比之下,具有坐标符号算子的预处理器在已知未知任务表示的oracle访问下才能匹配Muon,而这在实践中的SignGD中是不可行的。在合成长尾分类和LLaMA风格预训练上的实验证实了该理论。

英文摘要

Muon updates matrix parameters via the matrix sign of the gradient and has shown strong empirical gains, yet its dynamics and scaling behavior remain unclear in theory. We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs, with and without label noise. In this setting, we show that Gradient Descent (GD) learns frequency components at highly imbalanced rates, leading to slow convergence bottlenecked by low-frequency components. In contrast, the Muon optimizer mitigates this imbalance, leading to faster and more uniform progress. Specifically, in the noiseless case, Muon achieves an exponential speedup over GD; in the noisy case with a power-law frequency spectrum, we derive Muon's scaling law and demonstrate its superior scaling efficiency over GD. Furthermore, we show that Muon can be interpreted as an implicit matrix preconditioner arising from adaptive task alignment and block-symmetric gradient structure. In contrast, the preconditioner with coordinate-wise sign operator could match Muon under oracle access to unknown task representations, which is infeasible for SignGD in practice. Experiments on synthetic long-tail classification and LLaMA-style pre-training corroborate the theory.

2605.25200 2026-06-04 cs.CL

GroupTravelBench: Benchmarking LLM Agents on Multi-Person Travel Planning

GroupTravelBench: 多人群组旅行规划中LLM智能体的基准测试

Xiang Cheng, Yulan Hu, Lulu Zheng, Zheng Pan, Xin Li, Yong Liu

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学首都人工智能学院) AMAP, Alibaba Group(阿里集团AMAP)

AI总结 提出GroupTravelBench基准,通过多用户多轮对话任务评估LLM智能体在偏好获取、冲突协调和公平规划三方面的能力。

Comments work in process

详情
AI中文摘要

旅行规划是评估LLM智能体规划与工具使用能力的现实任务。然而,现有基准通常只假设单一用户,从而回避了现实场景中最具挑战性的方面之一:智能体识别和解决多用户冲突的能力。为填补这一空白,我们引入了 extbf{GroupTravelBench},这是首个针对 extbf{多用户、多轮}旅行规划的基准。基于真实用户画像、POI数据和票价数据,我们综合生成了650个任务,并将其分为三个难度等级。除了单用户行程规划所需的标准能力(如多步推理和工具使用)外,我们的基准进一步评估了旅行智能体所需的三项关键能力:\emph{(i) 获取}——主动进行多轮对话以收集每位用户的偏好;\emph{(ii) 协调}——通过妥协或分组策略解决用户间的冲突;以及\emph{(iii) 规划}——搜索能最大化整体群体效用同时保持公平性和可行性的旅行方案。为模拟现实中的对话式行程规划,同时确保可靠的工具使用和离线评估,我们构建了一个带有缓存真实工具数据的交互式沙箱环境。我们评估了多种LLM,发现即使是前沿模型在偏好覆盖率和群体公平性方面仍存在显著弱点。 extit{GroupTravelBench}为推进LLM智能体在现实旅行规划中的研究提供了一个实用且可复现的基准。

英文摘要

Travel planning in the real world is overwhelmingly a \textit{group} activity, yet existing LLM travel-planning benchmarks reduce it to a single user, where the field is approaching saturation. This single-user assumption sidesteps what makes group planning hard for an agent: discovering private preferences across multiple users, surfacing conflicts, and balancing utility against fairness. To bring the task back to its multi-user reality, we introduce \textbf{\textit{GroupTravelBench}}, the first benchmark for \textbf{multi-user, multi-turn} travel planning. Built from real user profiles, POI data, and ticket prices, it comprises 650 tasks across three difficulty levels, each running in a synchronous group-chat sandbox with cached tool data for reproducible offline evaluation. Beyond the multi-step reasoning and tool use that single-user benchmarks already test, GroupTravelBench probes three group-specific capabilities: \textit{(i) elicitation} of private preferences through multi-turn dialogue; \textit{(ii) coordination} of inter-user conflicts via compromise or subgrouping; and \textit{(iii) planning} that balances group utility against fairness. We pair this with a complementary evaluation framework combining rule-based outcome metrics and LLM-judge process metrics. Across a wide range of frontier models, even the strongest agents fall short on all four rule-based outcome metrics, with plan validity below 12\%, suggesting that group-level outcome quality is a key open challenge for LLM travel-planning agents.

2605.24782 2026-06-04 cs.LG

The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench

感知-物理悖论:用TC-Bench探究科学对齐

Dingling Yao, Andrea Polesello, Adeel Pervez, Caroline Muller, Francesco Locatello

发表机构 * ETH Zurich(苏黎世联邦理工学院) DeepMind University of Cambridge(剑桥大学) University of Amsterdam(阿姆斯特丹大学) University of Toronto(多伦多大学)

AI总结 本文提出科学对齐概念,通过结构同构性构建层次化必要条件,并发布TC-Bench基准数据集,揭示视觉基础模型在极端条件下依赖视觉捷径而非科学推理。

Comments Accepted at ICML 2026

详情
AI中文摘要

虽然视觉基础模型(VFM)在卫星图像的预测任务中表现出色,但其性能可能源于视觉相关性而非底层结构不变性,这使得基于感知的分布外准确性甚至不能作为科学实用性的良好代理。因此,模型可能看起来正确但推理错误,我们将这种差异称为感知-物理悖论。为了解决这一差距,我们引入科学对齐作为科学领域表示学习的隐式目标。我们通过结构同构性研究科学对齐的一个原则性、可测试的方面,该要求潜在表示能够唯一地识别物理系统,直至线性重新参数化。这一视角引出了一个层次化的必要条件,并为物理和因果可解释性提供了系统的探测协议。为了实施这一框架,我们发布了TC-Bench,这是一个全球性的、可复现的基准数据集,带有自动构建流程,用于热带气旋研究,并表明当前的VFM依赖于在极端条件下崩溃的视觉捷径,表明科学对齐并非仅仅是规模扩展的自然副产品。

英文摘要

While Vision Foundation Models (VFMs) excel at predictive tasks on satellite imagery, their performance can arise from visual correlations rather than underlying structural invariants, making even perception-based out-of-distribution accuracy a poor proxy for scientific utility. As a result, models may look correct without reasoning correctly, a discrepancy we term the Perception-Physics Paradox. To address this gap, we introduce scientific alignment as an implicit objective for representation learning in scientific domains. We study a principled, testable aspect of scientific alignment through structural isomorphism, which requires latent representations to uniquely identify physical systems up to a linear reparameterization. This perspective induces a hierarchy of necessary conditions and yields a systematic probing protocol for physical and causal interpretability. To operationalize this framework, we release TC-Bench, a global, reproducible benchmark dataset with an automated construction pipeline for tropical cyclone research, and show that current VFMs rely on visual shortcuts that collapse in intense regimes, indicating that scientific alignment does not arise as a natural byproduct of scaling alone.

2605.24602 2026-06-04 cs.CV cs.AI

Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

纠正注意力分散引起的视觉模糊以减少幻觉:算法与理论

Quanjiang Li, Zhiming Liu, Wei Luo, Tingjin Luo, Chenping Hou

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本文揭示多模态大语言模型中的物体幻觉与类人注意力分散现象相关,并提出一种无需额外训练的注意力聚焦方法(AFIP)通过跨头注意力增强和动态历史注意力强化来纠正视觉模糊,从而减少幻觉。

Journal ref ICML2026

详情
AI中文摘要

多模态大语言模型(MLLMs)经常遭受物体幻觉的困扰,但导致这一失败的视觉感知机制仍知之甚少。在这项工作中,我们揭示幻觉与一种类人注意力分散现象密切相关,其中人类在注意力分散下会经历视觉清晰度下降并产生不准确的描述,而在模型中,同样的机制表现为解码过程中多头注意力的空间不一致性以及对图像令牌注意力的时间衰减。我们进一步提供了理论见解,表明注意力分散会增加模型复杂度并降低分类泛化能力。受这些发现的启发,我们提出了一种用于改进图像感知的注意力聚焦方法(AFIP),该方法通过跨头注意力丰富来纠正注意力分散,并通过动态历史注意力增强来强化视觉基础。在多个基准和模型上的大量实验验证了AFIP的有效性,且无需额外训练。

英文摘要

Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this failure remains poorly understood. In this work, we reveal that hallucinations are strongly associated with a human-like attention distraction phenomenon, where humans under divided focus experience degraded visual clarity and produce inaccurate descriptions, while in models the same mechanism manifests as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. We further provide theoretical insights that attention dispersion increases model complexity and degrades classification generalization. Motivated by these findings, we propose an Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction via cross-head attention enrichment and reinforces visual grounding through dynamic historical attention enhancement. Extensive experiments on multiple benchmarks and models validate the effectiveness of AFIP without additional training.

2605.17273 2026-06-04 cs.LG cs.AI

Position: State-of-the-Art Claims Require State-of-the-Art Evidence

立场:声称最先进需要最先进的证据

YongKyung Oh

发表机构 * YongKyung Oh(永庆欧)

AI总结 本文指出人工智能和机器学习研究中普遍存在的声称最先进(SOTA)与证据不足之间的差距,通过分析十个跨领域基准测试发现,超过一半的顶级模型比较中至少一项常见的优越性假设不成立,并呼吁声明语言应反映证据强度。

详情
AI中文摘要

最先进(SOTA)声称在人工智能(AI)和机器学习(ML)研究中普遍存在。这些声称基于基准评估,其中模型根据跨任务的总分进行排名。公共基准或排行榜是最明显的实例,但相同的结构也出现在文献中的论文表格中。然而,这种微弱的证据往往无法支持这些强有力的声称。我们识别出AI基准测试中普遍存在的声称-证据差距。声称SOTA隐含着超越平均分数优越性的假设,表明模型在大多数任务上显著优于替代方案。然而,平均分数的边际改进仅表明平均排名靠前,而非真正的优越性。通过分析来自公共排行榜的十个跨领域基准测试,我们发现超过一半的顶级模型比较中,至少一项常见的优越性假设不成立。这些属性包括有意义的效应大小、跨任务的一致性,或对数据集移除的鲁棒性。相反,总分提升往往由异常数据集驱动。即使在任务众多的基准测试中,这种脆弱性仍然存在。我们认为,声称语言应反映潜在证据的强度。这不需要额外的实验,只需诚实地报告结果实际显示的内容,从而实现跨模型更精确和可解释的比较。

英文摘要

State-of-the-Art (SOTA) claims pervade Artificial Intelligence (AI) and Machine Learning (ML) research. These claims rest on benchmark evaluations, where models are ranked by aggregate scores across tasks. Public benchmarks or leaderboards are the most visible instance, but the same structure appears in paper tables throughout the literature. However, such minimal evidence often cannot support these strong claims. We identify a widespread claim-evidence gap in AI benchmarking. Claiming SOTA carries implicit assumptions beyond mean score superiority, suggesting that a model meaningfully outperforms alternatives across most tasks. However, a marginal improvement in the mean score merely indicates a top average rank rather than true superiority. Analyzing ten cross-domain benchmarks from public leaderboards, we found that in more than half of top-model comparisons, at least one commonly assumed property of superiority does not hold. These properties include meaningful effect size, consistency across tasks, or robustness to dataset removal. Instead, aggregate gains are frequently driven by outlier datasets. This fragility persists even in benchmarks with many tasks. We argue that claim language should reflect the strength of the underlying evidence. This requires no additional experiments, only honest reporting of what results actually show, enabling more precise and interpretable comparisons across models.

2605.22740 2026-06-04 cs.LG

Ternary Decision Trees with Locally-Adaptive Uncertainty Zones

三元决策树与局部自适应不确定性区域

William Smits

发表机构 * Avathon

AI总结 本文提出三元决策树,通过在每个分裂节点引入局部自适应的不确定性区域,改进传统二元决策树的决策准确性,并在多个数据集上验证了其优越性。

Comments V2: Major revision. Added decision-theoretic framework deriving optimal delta* as a node-local cost minimisation problem; four formal theoretical properties (Propositions 1-4); motivating example figure (Figure 5); strengthened related work and limitations analysis. 15 pages, 5 figures, 5 appendix sections. Submitted to Data Mining and Knowledge Discovery (DAMI)

详情
AI中文摘要

决策树通过硬二元阈值划分特征空间,对远离决策边界和直接位于边界上的实例赋予相同的置信度。我们引入三元决策树,每个分裂节点附加一个半宽为delta的不确定性区域,位于最优阈值中心。该区域内实例的预测由两个子树的加权混合生成,并被标记为边界不确定,提示下游应用可能以不同方式处理这些预测。关键的是,delta在每个节点本地计算,基于标准CART分裂寻找过程中已有的统计信息,无需外部噪声指定。我们提出并评估了五种delta估计方法:质量平台(分裂标准曲线的平台宽度)、类别重叠(经验类别分布重叠)、增益比(分裂质量相对于分裂熵)、节点自助法(节点层面重采样下的阈值方差)以及边缘(受SVM启发的最近跨类训练实例距离)。在72个OpenML-CC18数据集上进行5折交叉验证后,所有五种方法结合概率路由显著优于标准CART在决定准确性上(Wilcoxon符号秩检验,p < 0.001)。边缘方法在效率上最佳(每个边界不确定标志率单位获得0.104准确性提升),在42个数据集上获胜,且不需要额外超参数。对三个Breiman合成基准的分析显示,边缘方法在干净数据上自我校准,而节点自助法和质量平台方法最佳跟踪理论不可约误差。在四个医疗和金融数据集上的实验展示了实际价值:在乳腺X线摄影中,节点自助法通过将10.8%的筛查病例标记为边界不确定,实现了+0.71%的决定准确性提升。

英文摘要

Decision trees assign identical confidence to instances near and far from each split threshold. We introduce ternary decision trees, which augment each split node with an uncertainty zone of half-width delta. A decision-theoretic framework characterises the optimal zone width delta* as the solution to a node-local cost-minimisation problem; four formal properties are established: accuracy decomposition, a sufficiency condition for decided accuracy improvement, an exact efficiency characterisation (eta = Dec-Acc minus Acc_u, the accuracy gap between decided and boundary-uncertain predictions), and asymptotic consistency of the margin method. Instances within the zone receive predictions by weighted blending of both child subtrees and are flagged as boundary-uncertain. We propose and evaluate five delta-estimation methods: quality-plateau (plateau width of the split criterion curve), class-overlap (empirical class-distribution overlap), gain-ratio (split quality relative to split entropy), node-bootstrap (threshold variance under node-level resampling), and margin (SVM-inspired distance to the nearest cross-class training example). All methods reuse statistics already computed during standard CART split finding, requiring no external noise specification. Evaluated across 71 of the 72 OpenML-CC18 datasets with 5-fold cross-validation, all five methods with probabilistic routing significantly outperform standard CART on decided accuracy (Wilcoxon signed-rank, p < 0.001). The margin method achieves the best efficiency (0.104 accuracy gain per unit flagging rate), wins on 42 of 72 datasets, and requires zero hyperparameters. Analysis on Breiman synthetic benchmarks confirms margin is self-calibrating on clean data. On mammography, node-bootstrap achieves +0.71% decided accuracy by flagging 10.8% of cases as boundary-uncertain.

2605.22240 2026-06-04 cs.AI

Unlocking Proactivity in Task-Oriented Dialogue

解锁任务导向型对话中的主动性

Azure Zhang, Ning Gao, Yuqin Dai, Ruiyuan Wu, Jinpeng Wang, Rena Wei Gao, Bingdong Tan, Shuzheng Gao, Zongjie Li, Chaozheng Wang

发表机构 * Keeta AI, Meituan(Keeta AI,美团) Independent Researcher(独立研究者) CUHK(香港中文大学) HKUST(香港科技大学)

AI总结 针对任务导向型对话中主动性问题,提出认知用户模拟器和模拟器诱导的非对称视角策略优化,通过建模用户潜在关注实现主动对话。

详情
AI中文摘要

主动任务导向型对话(如外呼销售)需要一个有说服力的代理,能够主动探询用户的关注点,并在有限轮次内引导对话走向接受。然而,后训练的LLM本质上是保守的,而奖励塑造强化学习(如GRPO)效果不佳,因为它仅重新加权被动策略已采样的内容。我们表明,以用户的潜在关注为条件可以解锁任何采样量都无法破坏的主动能力,从而将这些关注确立为关键的训练时信号。为将这一发现付诸实践,我们构建了**认知用户模拟器**,它将每个用户建模为一个分层角色,包括可观察的外部特征和隐藏的内部关注。该模拟器产生忠实且多样化的交互,同时输出每轮状态动态以跟踪说服进展。然后,我们引入**模拟器诱导的非对称视角策略优化**,将建模的关注和模拟状态转换转化为互补的训练目标:(1)*非对称在线自蒸馏*,将关注感知行为从同一策略的特权视角转移到其可部署的、仅对话视角;(2)*状态转换策略优化*...

英文摘要

Proactive task-oriented dialogue (TOD), such as outbound sales, demands a persuasive agent that actively probes the user's concerns and steers the conversation toward acceptance within a bounded number of turns. Yet post-trained LLMs are inherently conservative, and reward-shaping RL (e.g., GRPO) struggles since it only re-weights what an already passive policy samples. We show that conditioning on the user's latent concerns unlocks proactive capability that no amount of sampling can undermine, establishing these concerns as a pivotal training-time signal. To operationalize this finding, we build the \textbf{Cognitive User Simulator}, which models each user as a stratified persona comprising observable external traits and hidden internal concerns. The simulator produces faithful and diverse interactions, while emitting per-turn state dynamics that track persuasion progress. We then introduce \textbf{Simulator-Induced Asymmetric-View Policy Optimization}, which converts the modeled concerns and the simulation state transition into complementary training objectives: (1) \emph{Asymmetric On-Policy Self-Distillation} that transfers concern-aware behavior from a privileged view of the same policy into its deployable, conversation-only view; and (2) \emph{State-Transition Policy Refinement} ...

2605.18102 2026-06-04 cs.CV

DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos

DanceHMR: 从单目视频中恢复手部感知的全身人体网格

Wenhao Shen, Ming Zhou, Hengyuan Zhang, Siyuan Bian, Youjiang Xu, Yuan Zhang

发表机构 * ByteDance Intelligent Creation(字节跳动智能创作)

AI总结 提出一种基于残差体手融合的时序一致全身HMR框架,通过身体上下文与手部观测的融合以及特写增强,实现稳定身体运动与精细手部恢复。

Comments Project page: https://shenwenhao01.github.io/dancehmr/

详情
AI中文摘要

单目视频人体网格恢复对于数字人、虚拟角色动画和具身模拟至关重要,需要时间稳定性和表现力丰富的全身运动。现有视频HMR方法能生成连贯的身体运动,但常忽略精细的手部关节;而基于图像的全身体方法逐帧独立恢复SMPL-X网格,常导致手部运动抖动且不准确。我们提出一种针对具有挑战性的野外单目视频的时序一致全身体HMR框架。我们的模型通过残差体手融合统一身体上下文和特定部分的手部观测,在单个时序架构中实现稳定的身体运动和精细的手部恢复。我们进一步引入特写感知增强,以提高上半身构图下的鲁棒性。在全身体和仅身体基准上的实验表明,手部重建得到改善,身体精度具有竞争力。我们的方法在具有挑战性的真实世界视频中也产生了时间稳定且2D一致的SMPL-X运动。

英文摘要

Monocular video human mesh recovery is essential for digital humans, avatar animation, and embodied simulation, where both temporal stability and expressive whole-body motion are required. Existing video HMR methods produce coherent body motion but often overlook detailed hand articulation, while image-based whole-body methods recover SMPL-X meshes independently per frame, often leading to jittery and inaccurate hand motion. We present a temporally coherent whole-body HMR framework for challenging in-the-wild monocular videos. Our model unifies body context and part-specific hand observations through residual body-hand fusion, enabling stable body motion and detailed hand recovery within a single temporal architecture. We further introduce close-up-aware augmentation to improve robustness under upper-body framing. Experiments on whole-body and body-only benchmarks demonstrate improved hand reconstruction and competitive body accuracy. Our method also produces temporally stable and 2D-consistent SMPL-X motion in challenging real-world videos.

2605.21446 2026-06-04 cs.RO cs.AI

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

迷失在雾中:传感器扰动暴露驾驶VLA的推理脆弱性

Abhinaw Priyadershi, Jelena Frtunikj

发表机构 * NVIDIA Corporation, USA(NVIDIA公司,美国) NVIDIA GmbH, Germany(NVIDIA德国公司)

AI总结 通过受控传感器扰动实验,发现因果链解释的一致性可作为轨迹可靠性的高保真指标,并证明启用因果链生成可提升轨迹精度。

详情
AI中文摘要

可解释的自主驾驶规划器不仅依赖于生成解释,还依赖于这些解释在真实传感器退化下的可靠性。本文对自主驾驶中视觉-语言-动作(VLA)模型的鲁棒性进行了受控扰动研究,评估了Alpamayo R1(10B参数)在八种传感器扰动(四种强度的高斯噪声、两种光照极端条件和两种雾浓度;约18,000次推理试验)下的1,996个场景。我们发现推理一致性是轨迹可靠性的高保真指标:当扰动后因果链(CoC)解释发生变化时,轨迹偏差激增5.3倍(21.8米 vs 4.1米),跨攻击类型的相关系数r=0.99,每样本点双列相关系数r_pb=0.53(Cohen's d=1.12)。受控消融实验表明,在匹配的推理设置下,启用CoC生成与轨迹精度提升相关(平均提升11.8%;p<0.0001)。在测试的噪声范围(σ∈{10,30,50,70})内,退化近似线性(R²=0.957),而标准输入预处理防御仅提供边际缓解。综上,这些结果将CoC一致性确立为规划安全的定量代理,并激励基于推理的运行时监控以实现更安全的VLA部署。

英文摘要

Interpretable autonomous driving planners depend not only on generating explanations, but also on those explanations remaining reliable under real-world sensor degradation. In this paper we present a controlled perturbation study of Vision-Language-Action (VLA) robustness in autonomous driving, evaluating Alpamayo R1 (10B parameters) across 1,996 scenarios under eight sensor perturbations (Gaussian noise at four intensities, two lighting extremes, and two fog levels; ${\sim}18{,}000$ inference trials). We find that reasoning consistency is a high-fidelity indicator of trajectory reliability: when Chain-of-Causation (CoC) explanations change after perturbation, trajectory deviation spikes $5.3{\times}$ (21.8m vs 4.1m), with $r\!=\!0.99$ across attack types and $r_{pb}\!=\!0.53$ per-sample (Cohen's $d\!=\!1.12$). A controlled ablation provides evidence that enabling CoC generation is associated with improved trajectory accuracy (11.8% on average across conditions; $p < 0.0001$) under matched inference settings. Over the tested noise range ($σ\in \{10, 30, 50, 70\}$), degradation is approximately linear ($R^2\!=\!0.957$), while standard input preprocessing defenses provide only marginal relief. Together, these results establish CoC consistency as a quantitative proxy for planning safety and motivate reasoning-based runtime monitoring for safer VLA deployment.

2605.21268 2026-06-04 cs.CV

Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification

视觉Transformer与卷积神经网络在土地利用场景分类中的应用

Arun D. Kulkarni

发表机构 * Computer Science Department, University of Texas at Tyler(德克萨斯理工大学计算机科学系)

AI总结 本文比较了视觉Transformer和CNN在遥感土地利用场景分类中的性能,发现CNN在有限训练样本和局部纹理特征强的场景中表现稳健,而ViT在数据充足时能更好地捕捉全局空间关系,但计算成本更高。

Comments 11 pages

详情
AI中文摘要

来自遥感影像的土地利用场景分类在环境监测、城市规划和可持续资源管理中起着关键作用。近年来,深度学习方法显著推动了该领域的发展,其中卷积神经网络因其强大的局部空间特征捕获能力而占据主导地位。然而,视觉Transformer的出现引入了一种新范式,通过自注意力机制建模长距离依赖关系,可能实现更好的全局上下文理解。本文对视觉Transformer和基于CNN的架构在遥感土地利用场景分类中进行了比较评估。使用基准遥感数据集(包括UC Merced土地利用和EuroSAT土地利用数据集)评估了代表性CNN模型(如AlexNet)和视觉Transformer。研究考察了分类准确率、精确率、召回率、F1分数和计算复杂度,以提供全面的性能比较。实验结果表明,在训练样本有限且局部纹理特征强的数据集上,CNN表现稳健;而在训练数据充足时,视觉Transformer在捕获复杂场景中的全局空间关系方面表现出更优性能。然而,ViT通常需要更多的计算资源和更大的训练数据集才能达到最优性能。本研究的结果为两种架构的优势和局限性提供了见解,并为遥感土地利用场景分类应用中选择合适模型提供了指导。

英文摘要

Land Use Scene Classification (LUSC) from remote sensing imagery plays a critical role in environmental monitoring, urban planning, and sustainable resource management. In recent years, deep learning methods have significantly advanced the state of the art, with Convolutional Neural Networks (CNNs) dominating the field because of their strong ability to capture local spatial features. However, the emergence of Vision Transformers (ViTs) has introduced a new paradigm that models long-range dependencies through self-attention mechanisms, potentially enabling improved global context understanding. This paper presents a comparative assessment of Vision Transformers and CNN-based architecture for remote sensing land use scene classification. Representative CNN models, such as AlexNet, is evaluated alongside the Vision Transformer (ViT) using benchmark remote sensing datasets, including the UC Merced Land Use and EuroSAT Land Use datasets. The study examines classification accuracy, precision, recall, F1-score, and computational complexity to provide a comprehensive performance comparison. Experimental results demonstrate that CNNs perform robustly on datasets with limited training samples and strong local texture characteristics, whereas Vision Transformers exhibit superior performance in capturing global spatial relationships in complex scenes when sufficient training data are available. However, ViTs typically require greater computational resources and larger training datasets to achieve optimal performance. The findings of this study provide insights into the strengths and limitations of both architectures and offer guidance for selecting appropriate models for remote sensing land use scene classification applications.

2605.20654 2026-06-04 cs.LG cs.AI

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

REFLECTOR: 内化逐步反思以对抗间接越狱

Jiachen Ma, Jiawen Zhang, Xiangtian Li, Bo Zou, Chaochao Lu, Chao Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出REFLECTOR两阶段框架,通过教师引导生成反思数据并进行监督微调,再结合强化学习内化自主反思能力,在复杂间接攻击下实现超过90%的防御成功率,同时提升通用性能。

Comments ICML 2026

详情
AI中文摘要

尽管大型语言模型(LLMs)展现出卓越的能力,但它们仍然容易受到复杂的多步越狱攻击,这些攻击通过利用内部生成过程来规避传统的表面安全对齐。为了解决这些漏洞,我们提出了REFLECTOR,一个原则性的两阶段框架,将自我反思内化在生成轨迹中。REFLECTOR首先利用教师引导生成高质量反思数据用于监督微调(SFT),建立结构化的反思模式。随后,它使用强化学习(RL)结合结果驱动和奖励有效性监督,以培养稳健、自主的自我反思能力。实验结果表明,REFLECTOR在复杂的间接攻击下实现了超过90%的防御成功率(DSR),同时在不同威胁场景中具有稳健的泛化能力。值得注意的是,该框架增强了任务特定和通用效用,在GSM8K上获得了5.85%的提升,并在知识密集型基准测试中表现更佳。通过内化轨迹级安全性,REFLECTOR克服了表面对齐的基本限制,且没有显著的计算开销,为开发安全且能力强大的LLMs提供了一种高效且可扩展的解决方案。

英文摘要

While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT), establishing structured reflection patterns. It subsequently uses Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to instill robust, autonomous self-reflection capabilities. Empirical results show that Reflector achieves Defense Success Rates (DSR) exceeding 90% against complex indirect attacks while generalizing robustly across diverse threat scenarios. Notably, the framework enhances both task-specific and general utility, yielding a 5.85% gain on GSM8K alongside improved performance on knowledge-intensive benchmarks. By internalizing trajectory-level safety, Reflector overcomes the fundamental limitations of surface alignment without significant computational overhead, offering an efficient and scalable solution for the development of safe and capable LLMs.

2605.19398 2026-06-04 cs.CV cs.AI

Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

重新平衡参考帧主导性以改善图像到视频模型中的运动

Wooseok Jeon, Seungho Park, Seunghyun Shin, Sangeyl Lee, Hyeonho Jeong, Hae-Gon Jeon

发表机构 * Yonsei University(延世大学) GIST(韩国科学技术院) Adobe Research(Adobe研究)

AI总结 针对图像到视频模型生成视频过于静态的问题,提出无需训练且模型无关的DyMoS方法,通过重新平衡去噪初期生成帧对参考帧的注意力来增强运动,同时保持视觉质量和保真度。

Comments Preprint. Project page: https://sh0xed98b8.github.io/DyMoS/

详情
AI中文摘要

与文本到视频模型相比,图像到视频模型通常生成的视频过于静态。先前的方法通过削弱或修改图像条件信号来缓解这一问题,但往往需要额外训练或牺牲对参考图像的保真度。在这项工作中,我们识别出参考帧主导性是运动抑制的关键机制。我们观察到,I2V模型中的非参考帧将过多的自注意力分配给参考帧的关键词元,导致参考信息随时间过度传播,从而抑制了帧间动态。基于这一发现,我们提出了DyMoS(动态运动滑块),一种无需训练且模型无关的方法,在初始去噪步骤中重新平衡从生成帧到参考帧的注意力路径。DyMoS保持输入图像和模型权重不变,并引入单个标量参数以连续控制运动强度。在多个最先进的I2V骨干网络上的实验表明,DyMoS在保持视觉质量和参考图像保真度的同时,一致地改善了运动动态。

英文摘要

Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate this issue by weakening or modifying the image-conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify reference-frame dominance as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS (Dynamic Motion Slider), a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state-of-the-art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.

2605.18879 2026-06-04 cs.LG cs.AI cs.CL

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

ZeroUnlearn:大语言模型中的少样本知识遗忘

Yujie Lin, Chengyi Yang, Zhishang Xiang, Yiping Song, Jinsong Su

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ZeroUnlearn框架,通过模型编辑将机器遗忘重新定义为精确的知识重映射问题,利用封闭解乘法参数更新实现高效、定向的少样本遗忘。

详情
AI中文摘要

大型语言模型由于在海量网络语料上训练,不可避免地会保留敏感信息(定义为可能引发有害生成的输入),从而引发隐私和安全担忧。现有的机器遗忘方法主要依赖于重训练或激进微调,这些方法要么计算成本高,要么容易降低相关知识并损害整体模型效用。在这项工作中,我们通过模型编辑将机器遗忘重新表述为一个精确的知识重映射问题。我们提出了ZeroUnlearn,一个少样本遗忘框架。它通过将敏感输入映射到中性目标状态并移除其原始表示来覆盖敏感输入。ZeroUnlearn通过封闭解形式的乘法参数更新强制执行表示正交性,从而实现高效且有针对性的遗忘。我们进一步将ZeroUnlearn扩展到基于梯度的变体,用于多样本遗忘。实验表明,我们的方法在保持模型整体效用的同时优于现有基线。我们的代码可在github上获取:https://github.com/XMUDeepLIT/ZeroUnlearn。

英文摘要

Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive fine-tuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re-mapping problem via model editing. We propose ZeroUnlearn, a few-shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed-form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient-based variant for multi-sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the github: https://github.com/XMUDeepLIT/ZeroUnlearn.

2605.20468 2026-06-04 cs.LG stat.ME stat.ML

CASCADE Conformal Prediction: Uncertainty-Adaptive Prediction Intervals for Two-Stage Clinical Decision Support

CASCADE 共形预测:两阶段临床决策支持的不确定性自适应预测区间

Ricardo Diaz-Rincon, Muxuan Liang, Adolfo Ramirez-Zamora, Benjamin Shickel

发表机构 * University of Florida(佛罗里达大学) MD Anderson Cancer Center(MD安德森癌症中心) University of Louisville(路易斯维尔大学)

AI总结 提出 CASCADE 共形预测框架,通过传播分类器认知不确定性动态调整回归预测区间,在帕金森病用药管理中实现高效且鲁棒的区间估计。

Comments Accepted to ICML 2026 AgenticUQ Workshop. 14 Pages, 3 Figures

详情
AI中文摘要

由于疾病进展的异质性、患者反应的差异性以及药物副作用,帕金森病(PD)的有效用药管理具有挑战性。虽然AI模型可以预测左旋多巴等效日剂量(LEDD)作为用药需求的度量,但标准的不确定性量化通常无法传达这些预测的可靠性,将高置信度和低置信度的临床决策等同对待。我们引入了CASCADE(通过共形和分布估计的校准自适应缩放),一种新颖的共形预测框架,它将来自筛选分类器的认知不确定性传播以自适应下游预测。与依赖辅助残差回归的标准共形方法不同,我们利用来自主要分类任务(识别是否需要改变用药)的认知不确定性,动态缩放次要回归任务(预测改变多少)的预测区间。通过将Venn-Abers多概率不确定性直接映射到非一致性分数,我们的框架实现了连续的风险自适应。我们证明,这种“级联效应”为高置信度患者产生高效的区间(比标准共形基线窄38.9%),同时自动扩展区间以确保对不确定病例的鲁棒覆盖,弥合了PD中离散临床决策与连续剂量预测之间的差距。

英文摘要

Effective medication management in Parkinson's Disease (PD) is challenging due to heterogeneous disease progression, variable patient response, and medication side effects. While AI models can forecast levodopa equivalent daily dose (LEDD) as a measure of medication needs, standard uncertainty quantification often fails to communicate the reliability of these predictions, treating high and low confidence clinical decisions identically. We introduce CASCADE (Calibrated Adaptive Scaling via Conformal And Distributional Estimation), a novel conformal prediction framework that propagates epistemic uncertainty from a screening classifier to adapt downstream predictions. Unlike standard conformal methods that rely on auxiliary residual regression, we leverage epistemic uncertainty from a primary classification task (identifying whether a medication change is needed) to dynamically scale the prediction intervals of a secondary regression task (predicting how much change). By mapping Venn-Abers multi-probabilistic uncertainty directly to non-conformity scores, our framework achieves continuous risk adaptation. We demonstrate that this ``cascade effect'' produces highly efficient intervals for confident patients (38.9% narrower than standard conformal baselines) while automatically expanding intervals to ensure robust coverage for uncertain cases, bridging the gap between discrete clinical decision-making and continuous dose forecasting in PD.

2605.19852 2026-06-04 cs.CL

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

工具总是有益的吗?学习自适应调用工具以实现双模式多模态大语言模型推理

Qinghe Ma, Zhen Zhao, Yiming Wu, Jian Zhang, Lei Bai, Yinghuan Shi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出AutoTool模型,通过强化学习框架自适应决定是否调用工具,结合双模式推理策略和模式特定奖励函数,在提升准确率的同时降低推理开销。

Comments Accepted to ICML 2026

详情
AI中文摘要

工具增强推理已成为增强多模态大语言模型(MLLMs)推理能力的一个有前景的方向。然而,现有研究主要关注使模型能够执行工具调用,而忽略了调用工具的必要性。我们认为工具使用并非总是有益的,因为冗余或不恰当的调用会大大增加推理开销,甚至误导模型预测。为解决这一问题,我们引入了AutoTool,一个根据每个查询的特征自适应决定是否调用工具的模型。在强化学习框架内,我们设计了一种显式的双模式推理策略,并配以模式特定的奖励函数,以引导模型产生准确的响应。此外,为防止过早偏向单一推理模式,AutoTool在整个训练过程中共同探索并平衡工具辅助推理和文本中心推理,并在后期促进自由探索。大量实验表明,AutoTool表现出卓越的性能和高效率,在V*基准测试上相比基础模型准确率提升21.8%,在POPE基准测试上相比现有工具增强方法效率提升44.9%。代码可在https://github.com/MQinghe/AutoTool获取。

英文摘要

Tool-augmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models (MLLMs). However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query. Within a reinforcement learning framework, we design an explicit dual-mode reasoning strategy with mode-specific reward functions to guide the model toward producing accurate responses. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool-assisted and text-centric reasoning throughout training, and promotes free exploration in later stages. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21.8\% accuracy gain on V* benchmark compared to the base model, and a 44.9\% improvement in efficiency over existing tool-augmented methods on POPE benchmark. Code is available at https://github.com/MQinghe/AutoTool.

2605.19294 2026-06-04 cs.RO cs.AI

DEFLECT: Temporal Counterfactual Preference Learning for Delay-Robust Asynchronous VLAs

DEFLECT: 面向延迟鲁棒异步VLA的时间反事实偏好学习

Yixiang Zhu, Yonghao Chen, Zijie Yang, Yusong Hu, Xinyu Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) One Robotics

AI总结 针对异步视觉-语言-动作(VLA)策略中陈旧观测导致的预测-执行不匹配问题,提出离线后训练框架DEFLECT,通过反事实偏好监督学习偏好与执行时间对齐的动作,无需人工标注、在线部署或架构修改,显著提升高延迟下的任务成功率。

详情
AI中文摘要

视觉-语言-动作(VLA)策略越来越依赖异步推理,将大模型延迟隐藏在持续的机器人运动背后。虽然这避免了同步动作块执行的“走走停停”行为,但产生了预测-执行不匹配:下一个动作块是根据推理开始时的陈旧观测计算得出的,但仅在机器人和场景发生变化后才执行。因此,适合预测时状态的动作可能与执行时状态不对齐。现有的运行时修复、行为克隆和偏好对齐方法并未直接教导策略解决这种陈旧输入不匹配问题。我们提出DEFLECT,一个面向延迟鲁棒异步VLA的离线后训练框架。DEFLECT将延迟引起的不匹配转化为反事实偏好监督:冻结的参考VLA从未来的执行时间观测生成偏好块,并从陈旧的预测时间观测生成拒绝块。可训练策略在相同的部署时间输入下对两个块进行评分,学习偏好与执行时间对齐的动作,同时监督微调锚点保留专家动作流形。DEFLECT不需要人工偏好标签、奖励模型、在线机器人部署、架构更改或额外的推理时间计算。在Kinetix、LIBERO和三个真实机器人任务上,DEFLECT相比强异步VLA基线提高了延迟鲁棒性,在高延迟下成功率提升高达6.4个百分点,并在真实规模VLA的最长延迟下实现4.6个百分点的增益。

英文摘要

Vision-Language-Action (VLA) policies increasingly rely on asynchronous inference to hide large-model latency behind ongoing robot motion. While this avoids the stop-and-go behavior of synchronous action-chunk execution, it creates a prediction-execution mismatch: the next chunk is computed from a stale observation at inference start but executed only after the robot and scene have evolved. As a result, actions that fit the prediction-time state can become misaligned with the execution-time state. Existing runtime repair, behavior-cloning, and preference-alignment approaches do not directly teach the policy to resolve this stale-input mismatch. We propose DEFLECT, an offline post-training framework for delay-robust asynchronous VLAs. DEFLECT converts latency-induced mismatch into counterfactual preference supervision: a frozen reference VLA generates a preferred chunk from the future execution-time observation and a rejected chunk from the stale prediction-time observation. The trainable policy scores both chunks under the same deployment-time input, learning to favor execution-time-aligned actions while a supervised fine-tuning anchor preserves the expert action manifold. DEFLECT requires no human preference labels, reward models, online robot rollouts, architectural changes, or additional inference-time computation. Across Kinetix, LIBERO, and three real-robot tasks, DEFLECT improves delay robustness over strong asynchronous VLA baselines, raising high-latency success by up to 6.4 percentage points and achieving a 4.6 percentage-point gain at the longest delay on a real-scale VLA.

2605.18936 2026-06-04 cs.LG cs.CL

FedMental: Evaluating Federated Learning for Mental Health Detection from Social Media Data

FedMental: 评估用于社交媒体数据心理健康检测的联邦学习

Nuredin Ali Abdelkadir, Anjali Ratnam, Zeerak Talat, Stevie Chancellor

发表机构 * University of Minnesota(明尼苏达大学) University of Edinburgh(爱丁堡大学)

AI总结 本文通过联邦学习和差分隐私联邦学习在抑郁和自杀危机检测任务上的实验,评估了隐私保护技术对心理健康检测性能的影响,发现联邦学习性能接近集中式训练,但差分隐私联邦学习存在显著的性能-隐私权衡。

Comments Association for Computational Linguistics (ACL) 2026 Main Conference

详情
AI中文摘要

社交媒体文本数据常用于训练机器学习模型以识别表现出高风险心理健康行为的用户。然而,共享这些敏感数据会带来隐私风险,并限制了基准数据集的发展。我们全面评估了隐私保护的机器学习技术是否能在保持性能的同时实现更安全的数据共享。具体来说,我们将联邦学习和差分隐私联邦学习应用于两个广泛研究的心理健康预测任务:X(Twitter)上的抑郁检测和Reddit上的自杀危机检测。通过将每个用户视为非独立同分布设置中的一个客户端,我们模拟了现实的数据共享场景,评估了不同的客户端比例、聚合策略和隐私预算。虽然联邦学习在抑郁识别上达到了与集中式训练相当的性能(集中式F1=85.63;最佳联邦学习模型F1=83.16),但我们发现差分隐私联邦学习即使在低噪声水平(epsilon=50)下也存在较大的性能-隐私权衡(F1下降高达27.01)。这是由于与心理健康相关的高信息量但稀疏的语言标记(如健康主题和情感词)被扭曲所致。本研究实证展示了当前隐私保护技术在心理健康推理任务中的潜力和局限性。

英文摘要

Social media text data are often used to train Machine Learning (ML) models to identify users exhibiting high-risk mental health behaviors. However, sharing this sensitive data poses privacy risks and limits the growth of benchmark datasets. We comprehensively evaluate whether privacy-preserving ML techniques can enable safer data sharing while preserving performance. Specifically, we apply federated learning (FL) and Differentially Private FL for two widely-studied mental health prediction tasks: depression detection on X (Twitter) and suicide crisis detection on Reddit. We simulate realistic data-sharing scenarios by treating each user as a client in a non-IID setting, evaluating across different client fractions, aggregation strategies, and privacy budgets. While FL achieves comparable performance to centralized training (centralized F1 = 85.63; best FL model F1 = 83.16) on depression identification, we find that Differentially Private FL has a large performance-privacy trade-off (up to F1 = 27.01 drop) even with low levels of noise (epsilon = 50). This is due to the distortion of highly informative yet sparse mental health linguistic markers related to mental health, like health topics and emotion words. This research empirically demonstrates the potential and limitations of current privacy preservation techniques for mental health inference tasks.

2605.15949 2026-06-04 cs.RO

A Reproducible and Physically Feasible Dynamic Parameter Identification Framework for a Low-Cost Robot Arm

低成本机器人臂的可重复且物理可行的动力学参数辨识框架

Junji Oaki, Koki Yamane, Koki Inami, Sho Sakaino

发表机构 * Institute of Systems and Information Engineering, University of Tsukuba(系统与信息工程研究所,茨川大学)

AI总结 针对低成本机器人臂CRANE-X7,提出一种结合最小二乘、半定规划投影和闭环输入误差精化的可重复且物理可行的动力学参数辨识方法,并通过主成分分析和惯性矩阵正定性审核确保模型统计一致性与物理可行性。

Comments 11 pages, 8 figures, 7 tables, 1 algorithm and 2 appendices

详情
AI中文摘要

本文针对由模块化智能驱动器驱动的低成本机器人臂CRANE-X7,提出了一种可重复且物理可行的动力学参数辨识框架。为提高实际可辨识性,根据近似连杆对称性移除惯性积,将刚体模型从65个基础参数减少至39个。辨识运动是在实际关节限位下,由结构化的单关节和相邻关节基元手工设计而成。所提出的流程结合了预处理、基于逆动力学回归的普通最小二乘(OLS)、用于可行性恢复的条件半定规划(SDP)投影以及闭环输入误差(CLIE)精化。在共同的主成分分析(PCA)空间中分析来自40个结构化轨迹的候选解,以选择一个统计上中心的代表性模型。由于统计中心性本身不能保证物理可接受性,最终选定的模型需通过所有位姿下的惯性矩阵正定性审核,并在必要时通过局部化的后CLIE SDP救援步骤进行修正。实验表明,参数云从OLS到SDP再到CLIE逐渐变得更加集中,而最终接受的模型在保留的验证运动上保持了高预测精度。这些结果为低成本机器人平台获得统计一致且物理可行的动力学模型提供了一条实用途径。

英文摘要

This paper presents a reproducible and physically feasible dynamic parameter identification framework for CRANE-X7, a low-cost robot arm driven by modular smart actuators. To improve practical identifiability, products of inertia are removed according to approximate link symmetry, reducing the rigid-body model from 65 to 39 base parameters. Identification motions are hand-designed from structured single-joint and adjacent-joint primitives under practical joint-range limits. The proposed pipeline combines preprocessing, inverse-dynamics-regressor-based ordinary least squares (OLS), conditional semidefinite-programming (SDP) projection for feasibility recovery, and closed-loop input error (CLIE) refinement. Candidate solutions from 40 structured trajectories are analyzed in a common principal component analysis (PCA) space to select a statistically central representative model. Because statistical centrality alone does not ensure physical acceptability, the selected model is finally screened by an all-pose positive-definiteness audit of the inertia matrix and, when necessary, corrected by a localized post-CLIE SDP rescue step. Experiments show that the parameter cloud becomes progressively more concentrated from OLS to SDP and CLIE, while the final accepted model preserves high predictive accuracy on held-out validation motions. These results demonstrate a practical route to statistically coherent and physically feasible dynamic models for low-cost robot platforms.