arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3962
2606.04627 2026-06-09 cs.AI 版本更新

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

MIRAGE: 具有隐式推理和生成世界模型的移动智能体

Zhichao Yang, Yuanze Hu, Haojie Hao, Longkun Hao, Dongshuo Huang, Hongyu Lin, Gen Li, Lanqing Hong, Yihang Lou, Yan Bai

发表机构 * Beihang University(北京航空航天大学) Northwestern Polytechnical University(西北工业大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) National University of Singapore(新加坡国立大学) Peking University(北京大学)

AI总结 提出MIRAGE框架,通过从显式推理轨迹学习连续潜在表示,使移动智能体能够内部推理并预测未来屏幕状态,在减少生成token的同时提升执行效率。

详情
AI中文摘要

移动智能体越来越需要从截图和语言目标操作日常应用,可靠的控要求对屏幕可供性、多步导航和未来状态变化进行推理。然而,许多智能体将这种计算外部化为长的文本推理链,这减慢了交互速度,增加了监督成本,并使部署复杂化。我们引入了MIRAGE,一个从可见的文本推理轨迹中学习连续潜在推理表示的框架。MIRAGE将显式推理转化为紧凑的隐藏状态,使智能体能够在内部推理而无需解码长的理由。它还包含一个生成世界模型目标:潜在推理向量与未来截图对齐,鼓励智能体在行动前预测即将到来的界面状态。这将隐藏计算转变为压缩的思维表示和环境动态的前瞻模型。在推理时,MIRAGE在连续潜在空间中进行推理,减少了token生成,同时提高了执行效率。在AndroidWorld上,MIRAGE在4B消融实验中匹配了显式思维链监督微调,解码token预算降低了3-5倍,并比可比的指令微调基线提高了10.2个点;在AndroidControl上,它改进了动作定位,同时生成了超过75%更少的token。

英文摘要

Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi-step navigation, and future state changes. However, many agents externalize this computation as long textual chains of thought, which slows interaction, increases supervision cost, and complicates deployment. We introduce MIRAGE, a framework that learns continuous latent reasoning representations from visible textual reasoning traces. MIRAGE transfers explicit reasoning into compact hidden states, enabling the agent to reason internally without decoding long rationales. It also incorporates a generative world-model objective: latent reasoning vectors are aligned with future screenshots, encouraging the agent to anticipate upcoming interface states before acting. This turns hidden computation into both a compressed thought representation and a forward-looking model of environment dynamics. At inference time, MIRAGE reasons in continuous latent space, reducing token generation while improving execution efficiency. On AndroidWorld, MIRAGE matches explicit chain-of-thought supervised fine-tuning in the 4B ablation with a 3-5x lower decoded-token budget and improves a comparable instruction-tuned baseline by 10.2 points; on AndroidControl, it improves action grounding while generating over 75% fewer tokens.

2606.04421 2026-06-09 cs.AI cs.LG 版本更新

Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

Trivium: 时间遗憾作为因果记忆控制器的一等目标

Edward Y. Chang

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出将长期时间遗憾作为一等目标,与结果遗憾和认知遗憾共同构成因果记忆控制器的可证伪失败分析框架,证明时间校准偏差在对结果遗憾为零时仍线性增长,而基于持久因果日志的探测复杂度为对数级。

详情
Comments
62 pages, 12 tables, 12 figures
AI中文摘要

许多当前的智能体系统和LLM管道通过优化结果奖励来纠正错误。这仅解决了失败的“什么”:当结果偏离预测时,不匹配的“为什么”和“何时”没有被系统地记录、审查或纠正,因此相同的错误可能反复出现。我们认为这是一个结构性问题,而不仅仅是模型容量问题。我们提出将长期时间遗憾作为一等目标,与结果遗憾和工作因果模型上的认知遗憾并列。时间遗憾捕捉失败持续的时间:在纠正之前,一个校准错误的因果模型被容忍了多久。认知遗憾捕捉失败持续的原因:工作因果模型中的残余不确定性或错误。这三个遗憾共同给出了一个可证伪的说明,关于一个长期存在的智能体可能失败的原因、内容和时间。将智能体建模为E个片段的流,我们在显式因果探测、持久性和可检测性假设下证明了三个条件结果。首先,在观测等价混淆下,仅基于结果的学习无法在没有干预通道的情况下区分因果结构和虚假结构,因此时间校准偏差可以在结果遗憾被降至零后仍线性持续。其次,使用持久因果日志和预算探测,总探测复杂度是片段范围的对数,导致O(log E)的时间遗憾。第三,在K个可检测变化点下,速率扩展为O(K log E)。我们实例化了Trivium并预注册了五个可证伪预测。在CausalBench-Seq上,Trivium遵循预测的对数包络线,而仅基于结果的基线线性增长。一个真实LLM流的初步外部有效性证据跨越了一个完整的E=500运行和三个E=100前沿模型试点。这里的自学习意味着修正外部因果模型,而不是重新训练LLM权重。

英文摘要

Many current agentic systems and LLM pipelines correct mistakes by optimizing outcome reward. This addresses only the what of failure: when an outcome diverges from prediction, the why and when of the mismatch are not systematically logged, reviewed, or corrected, so the same error can recur episode after episode. We argue that this is a structural problem, not merely a model-capacity one. We propose long-horizon temporal regret as a first-class objective alongside outcome regret and epistemic regret over the working causal model. Temporal regret captures when failure persists: how long a miscalibrated causal model is tolerated before correction. Epistemic regret captures why failure persists: residual uncertainty or error in the working causal model. Together, the three regrets give a falsifiable account of what, why, and when a long-lived agent can fail. Modeling the agent as a stream of E episodes, we prove three conditional results under explicit causal-probing, persistence, and detectability assumptions. First, under observationally equivalent confounding, outcome-only learning cannot distinguish causal from spurious structure without an intervention channel, so temporal miscalibration can persist linearly even after outcome regret is driven to zero. Second, with a persistent causal log and budgeted probes, total probe complexity is logarithmic in the episode horizon, inducing O(log E) temporal regret. Third, under K detectable change-points, the rate extends to O(K log E). We instantiate Trivium and pre-register five falsifiable predictions. On CausalBench-Seq, Trivium follows the predicted logarithmic envelope while outcome-only baselines grow linearly. A pilot real-LLM stream provides preliminary external-validity evidence across one full E = 500 run and three E = 100 frontier-model pilots. Self-learning here means revising an external causal model, not retraining LLM weights.

2606.04409 2026-06-09 cs.CV cs.AI cs.LG 版本更新

An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization

数据规模、模型复杂度和输入模态对视觉泛化影响的实证研究

Yidi Zhouluo

发表机构 * School of Medical Information and Artificial Intelligence, Shandong First Medical University(医学信息与人工智能学院,山东第一医科大学)

AI总结 通过一维非线性函数和CIFAR数据集实验,实证分析数据规模、模型复杂度和输入模态对视觉泛化性能的影响。

详情
Comments
12 pages, 9 figures, 4 tables
AI中文摘要

现代深度神经网络通常具有较大的参数规模和非线性层次结构,在计算机视觉中取得了强劲性能。然而,其泛化性能的来源仍然难以用传统统计学习理论解释。在可能影响视觉泛化的因素中,数据规模、模型复杂度和输入模态是基础且可控的变量。本研究实证分析了这三个因素如何影响模型泛化性能。具体而言,在初步实验中,我们构建了一维非线性函数,并改变训练样本数量和多项式次数,以观察数据规模和模型复杂度对模型性能的影响。在主要实验中,我们比较了CIFAR-10和CIFAR-100上不同训练数据规模、模型架构和输入模态下的模型性能。实验结果表明,增加训练数据规模持续改善泛化性能,而模型复杂度的变化并未带来稳定提升。此外,去除颜色信息会降低模型性能,而梯度、边缘和小波等显式先验特征在不同模型架构上的效果不一致。总体而言,本研究提供了数据规模、模型复杂度、输入模态与视觉泛化性能之间关系的实证分析。代码和实验日志见:https://github.com/zlyd-CV/DeepLearning-Empirical-Studies。

英文摘要

Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one-dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR-10 and CIFAR-100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: https://github.com/YidiZhouluo/DeepLearning-Empirical-Studies/tree/main/Exp_01.

2606.04109 2026-06-09 cs.CL 版本更新

Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

话语角色标签作为语言模型上下文使用的呈现时间变量

Jianguo Zhu, Xiangmei Li, Wenjie Liu

发表机构 * arXiv.org GitHub

AI总结 通过固定内容探针实验,研究不同话语角色标签(如Instruction、Reference、Example)如何影响语言模型对误导信息的采纳率,发现标签可导致采纳率变化56-84个百分点,并建议上下文利用和RAG基准应报告和控制包装标签。

详情
Comments
Revised version with updated author information, added clean baselines, clarified evaluation metrics, and tightened discussion of context-augmented settings
AI中文摘要

上下文增强的语言模型系统通常用Reference:、Evidence:、Instruction:、Note:或Example:等标签包装提供的内容,但这些标签对读者模型行为的影响尚未充分探索。我们引入了一个配对固定内容探针,涵盖500个MMLU-Pro项目:每个项目在不同话语角色标签下接收相同的误导性答案断言,并通过模型是否输出注入的错误选项来衡量采纳率。在GPT-5.5、DeepSeek V4 Pro、Llama-3-8B-Instruct和Qwen2.5-7B-Instruct上,误导采纳率变化了56-84个百分点。绑定或来源类标签(如Instruction:和Reference:)导致高采纳率,而Example:则持续抑制采纳率。配对检验、bootstrap区间、最终指令消融和Qwen最终步对数概率探针支持标签条件化的候选偏好。边界探针显示了效果减弱或持续的位置:算术任务降低采纳率,段落形状的外部上下文保持较小的标签差距,短答案评估排除了选项字母复制,嵌套标签冲突表明说明性框架可以限制采纳范围。一项200例单作者人工审核确认,在保守裁决下短答案对比是稳定的。由此得出的结论有限但实用:上下文利用和读者端RAG基准应报告并控制包装标签,因为呈现选择可以改变对提供上下文的测量依赖。

英文摘要

Context-augmented language model systems often wrap supplied content with labels such as Reference:, Evidence:, Instruction:, Note:, or Example:, but the effect of these labels on reader-model behavior remains underexplored. We introduce a paired fixed-content probe over 500 MMLU-Pro items: each item receives the same misleading answer-bearing assertion under different discourse-role labels, and adoption is measured by whether the model outputs the injected wrong option. Across GPT-5.5, DeepSeek V4 Pro, Llama-3-8B-Instruct, and Qwen2.5-7B-Instruct, Misleading Adoption Rate shifts by 56-84 percentage points. Binding or source-like labels such as Instruction: and Reference: produce high adoption, whereas Example: consistently suppresses it. Paired tests, bootstrap intervals, final-instruction ablations, and Qwen final-step log-probability probes support a label-conditioned candidate preference. Boundary probes show where the effect weakens or persists: arithmetic tasks reduce adoption, passage-shaped external context preserves smaller label gaps, short-answer evaluation rules out option-letter copying, and nested-label conflicts suggest that illustrative framing can delimit adoption scope. A 200-case single-author manual audit confirms that the short-answer contrasts are stable under conservative adjudication. The resulting claim is bounded but practical: context-utilization and reader-side RAG benchmarks should report and control wrapper labels, because presentation choices can change measured reliance on supplied context.

2606.04029 2026-06-09 cs.LG cs.AI 版本更新

Position: Deployed Reinforcement Learning should be Continual

立场:部署的强化学习应该是持续的

Parnian Behdin, Kevin Roice, Golnaz Mesbahi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文主张部署的强化学习系统应持续学习,分析了部署后非平稳性的四个来源,并展示了持续RL的优势和实现方法。

详情
Comments
Accepted to the ICML 2026 Position Paper Track. See https://icml.cc/virtual/2026/poster/67195
AI中文摘要

强化学习(RL)在现实世界用例中受到越来越多的关注和采用。大多数系统遵循“训练-修复”范式,其中训练好的代理在与世界交互时不会学习,直到性能下降且需要重新训练。在这篇立场论文中,我们认为部署一个无法达到最优但接收评估奖励信号的代理本质上是一个持续的RL问题。我们确定了部署后导致需要永无止境学习的四个非平稳性来源,并强调了为什么最好的部署代理永远不会停止适应。我们分析了现实世界中持续RL的成功案例,并向社区展示了摆脱当前“训练-修复”范式的优势和措施。

英文摘要

Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-then-fix paradigm, where trained agents do not learn while interacting with the world until performance degrades and retraining becomes necessary. In this position paper, we argue that deploying an agent that is incapable of optimality, but receives an evaluative reward signal, is inherently a continual RL problem. We identify four sources of non-stationarity after deployment that necessitate never-ending learning, and highlight why the best deployed agents never stop adapting. We analyze successful examples of continual RL in the real world, and present the community with the advantages and measures to move away from the current train-then-fix paradigm.

2606.03787 2026-06-09 cs.RO 版本更新

Worth Remembering: Surprise-Gated Robot Episodic Memory

值得记住:基于惊讶门控的机器人情景记忆

Nicolas Gorlo, Derek K. Wise, Alberto Speranzon, Luca Carlone

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Lockheed Martin(洛克希德·马丁公司)

AI总结 提出基于贝叶斯惊讶的门控机制来选择性地存储高效用情景记忆,利用V-JEPA-2潜在空间计算惊讶,在机器人问答任务中提升12%以上性能。

详情
Comments
14 pages, 2 figures, 4 tables
AI中文摘要

解决通用任务的机器人需要能够将指令与过去经验联系起来,因为人类在给出任务时可能会提及显著的历史事件(例如,“带我去昨天化学品泄漏的地方”)。由于记忆限制使得存储所有过去事件不可行,长期机器人记忆必须具有选择性,理想情况下只保留那些对未来任务具有高实用性的情节。然而,对于通用机器人,未来任务通常不是先验给定的。为了选择通用有用的记忆,我们提出贝叶斯惊讶作为记忆形成的门控机制。我们提出了一种方法,在由V-JEPA-2提供的语义丰富且部署无关的潜在空间中计算惊讶。通过使用我们的门控情景记忆来增强基于4D场景图的时空记忆,我们在机器人问答中显示出相对于最先进基准的一致改进,在时间、空间和二元问题上优于先前的机器人记忆方法≥12%,并在事件分割任务中以无监督因果方法超越了有监督和非因果方法的性能。

英文摘要

Robots solving generalist tasks need to be able to ground instructions in their past experience, since humans may refer to notable past events when giving a task (e.g., ``Take me to where the chemical spill happened yesterday''). Since memory limits make storing all past events infeasible, long-term robot memory must be selective, ideally retaining only those episodes with high utility for future tasks. However, future tasks are not typically given a priori for generalist robots. To select generically useful memories, we propose Bayesian surprise as a gating mechanism for memory formation. We present an approach to compute surprise in a semantically rich deployment-agnostic latent space provided by V-JEPA-2. Using our gated episodic memory to augment 4D scene graph-based spatial memory, we show a consistent improvement over state-of-the-art benchmarks in robot question answering, outperforming prior robot memory methods by $\geq12\%$ for temporal, spatial, and binary questions, and surpassing the performance of supervised and non-causal methods with an unsupervised causal method in event segmentation tasks.

2606.03576 2026-06-09 cs.CL 版本更新

AutoTail-BSFGM: Class-Balance-Aware Fine-Tuning for Chinese Scholarly Text Classification

AutoTail-BSFGM:面向中文学术文本分类的类别平衡感知微调

Anling Xiang, Yuwen Yang, Yang Shen

发表机构 * Department of Intelligent Communication, School of Journalism and Communication, Minzu University of China(中国民族大学新闻与传播学院智能通信系) ZeeLin (Beijing) Technology Co., Ltd.(北京智联科技有限公司) School of Journalism and Communication, Tsinghua University(清华大学新闻与传播学院) College of AI, Tsinghua University(清华大学人工智能学院)

AI总结 提出AutoTail-BSFGM方法,通过自动门控尾部调整、弱平衡Softmax辅助损失和快速梯度法对抗正则化,解决中文学术文本分类中的类别不平衡和语义邻近问题,在CSL数据集上提升了验证集和锁箱集准确率。

详情
Comments
17 pages, 4 figures, 4 tables. Code and data: https://github.com/thu-nmrc/autotail-bsfgm-scholarly-classification
AI中文摘要

学术文本分类支持文献组织、主题标引和研究情报,但中文学术语料库通常包含不平衡且语义邻近的学科标签。我们提出AutoTail-BSFGM,一种类别平衡感知的微调方法,它结合了自动门控尾部先验调整、弱平衡Softmax辅助损失和快速梯度法对抗正则化。该方法仅改变训练目标和过程;推理使用与相应标签平滑基线相同的单一基础规模编码器和线性分类器。我们在两个基于CSL的任务上评估该方法:一个包含67个标签的摘要到学科任务和一个包含13个类别的标题到类别任务。在主要的摘要任务上,AutoTail-BSFGM在中文RoBERTa-WWM和MacBERT-base下均提高了验证集和锁箱集准确率。使用MacBERT-base时,验证集准确率提高0.83个百分点,锁箱集准确率提高0.49个百分点,验证集上的合并配对McNemar检验显著(p = 0.023)。在标题任务上,该方法将验证集准确率提高0.70个百分点,验证集平衡准确率提高2.64个百分点;锁箱集准确率大致中性,而锁箱集平衡准确率提高1.22个百分点。结果支持有界贡献:AutoTail-BSFGM改善了类别平衡敏感行为,并在基于摘要的学术分类中取得一致增益,但并非在每个划分上均匀改善每个指标。

英文摘要

Scholarly text classification supports literature organization, subject indexing, and research intelligence, but Chinese scholarly corpora often contain imbalanced and semantically adjacent disciplinary labels. We propose AutoTail-BSFGM, a class-balance-aware fine-tuning method that combines an automatically gated tail-prior adjustment, a weak Balanced Softmax auxiliary loss, and Fast Gradient Method adversarial regularization. The method changes only the training objective and procedure; inference uses the same single base-size encoder and linear classifier as the corresponding label-smoothed baseline. We evaluate the method on two CSL-based tasks: an abstract-to-discipline task with 67 labels and a title-to-category task with 13 categories. On the primary abstract task, AutoTail-BSFGM improves validation and lockbox accuracy under both Chinese RoBERTa-WWM and MacBERT-base. With MacBERT-base, validation accuracy increases by 0.83 percentage points and lockbox accuracy by 0.49 points, with a pooled paired McNemar signal on validation (p = 0.023). On the title task, the method improves validation accuracy by 0.70 points and validation balanced accuracy by 2.64 points; lockbox accuracy is approximately neutral while lockbox balanced accuracy improves by 1.22 points. The results support a bounded contribution: AutoTail-BSFGM improves class-balance-sensitive behavior and yields consistent gains for abstract-based scholarly classification, without uniformly improving every metric on every split.

2606.03371 2026-06-09 cs.CL 版本更新

See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence

观察、推断、干预:面向目标导向社交智能的主动世界建模

Honghui Zhang, Chenmeinian Guo, Yichen Yu, Guanyu Liu, Yujia Zhang, Yongming Qin, Chongguo Song, Mengyue Yang, Lei Yu, Tianyu Shi

发表机构 * Mita Technology(Mita技术公司) University of Bristol(布里斯托大学) University of Toronto(多伦多大学) McGill University(麦吉尔大学)

AI总结 提出 See-Infer-Intervene (SII) 框架和主动意图世界模型 (PIWM),通过观察顾客行为、推断潜在意图并选择干预动作,实现零售场景中的主动辅助,在 GuidanceSalesBench 基准上达到 0.641 macro F1。

详情
Comments
16 pages, 3 figures, 9 tables. Preprint
AI中文摘要

多模态零售智能体不仅应识别顾客正在做什么,还应决定是否以及如何在明确请求之前提供帮助。我们通过 See-Infer-Intervene (SII) 框架研究这一场景,其中设备必须观察交互前行为、推断潜在顾客意图,并通过选择适当的服务干预或选择等待来采取行动。我们使用主动意图世界模型 (PIWM) 实例化 SII,该模型通过 AIDA(注意力、兴趣、欲望、行动)购买阶段和 BDI(信念、欲望、意图)心理场表示顾客状态,预测动作条件下的意图转换,并从五类响应中选择:问候、引导、告知、推荐和等待。我们进一步构建了 GuidanceSalesBench,这是一个智能零售基准,包含状态清单、交互前视频、候选响应、动作条件结果和最佳动作标签。当以真实顾客状态为条件以隔离动作选择时,PIWM 在 30 个保留目标视频上达到 0.641 macro F1,优于零样本 Qwen2.5-VL-7B 基线和没有平衡动作监督的训练变体;端到端仅视频选择降至 0.295,低于 5 类平衡随机基线 0.414,将视频到状态的基础定位确定为部署阶段的主要瓶颈。一项初步的分阶段真实商店试点(由付费参与者执行脚本化顾客行为录制)在 20 个完全标注视频上达到 0.579 动作 macro F1,并额外发布了 10 个带有索引级标签的可访问视频。

英文摘要

Multimodal retail agents should not only recognize what a customer is doing, but also decide whether and how to assist before an explicit request is made. We study this setting through the See--Infer--Intervene (SII) framework, where a device must see pre-interaction behavior, infer latent customer intent, and act by selecting an appropriate service intervention or choosing to wait. We instantiate SII with the Proactive Intent World Model (PIWM), which represents customer state with AIDA (Attention, Interest, Desire, Action) purchasing phases and BDI (belief, desire, intention) psychological fields, predicts action-conditioned intent transitions, and selects from five response classes: Greet, Elicit, Inform, Recommend, and Hold. We further construct GuidanceSalesBench, a smart-retail benchmark containing state manifests, pre-interaction videos, candidate responses, action-conditioned outcomes, and best-action labels. When conditioned on ground-truth customer state to isolate action selection, PIWM achieves 0.641 macro F1 on 30 held-out target videos, outperforming a zero-shot Qwen2.5-VL-7B baseline and training variants without balanced action supervision; end-to-end video-only selection drops to 0.295, below the 5-class balanced random baseline of 0.414, identifying video-to-state grounding as the dominant deployment-time bottleneck. A preliminary staged real-store pilot (recorded with paid participants performing scripted customer behaviors) reaches 0.579 action macro F1 on 20 fully annotated videos, with 10 additional accessible videos released with index-level labels.

2606.03328 2026-06-09 cs.LG cs.AI 版本更新

Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning

校准数据在能力维度上的权衡:为什么多源混合对高稀疏LLM剪枝至关重要

Hu Xu, Zhaolong Xing, Congcong Liu, Jiaxing Wang, Zhida Jiang, Junshi Huang, Zhen Chen, Jianfeng Xu

发表机构 * Shanghai Jiao Tong University(上海交通大学) JD.com(京东公司)

AI总结 通过分解后剪枝能力维度并分析15个校准源,发现校准困惑度与通用能力保留正相关但与数学和代码能力保留负相关,提出多源混合校准方法IGSP以平衡各维度性能。

详情
AI中文摘要

训练后剪枝使用小型无标签校准集将大型语言模型压缩至高稀疏度,近期研究认为校准源的选择对平均后剪枝精度影响不大。我们提出疑问:当校准效果分别在不同能力维度上评估而非聚合时,该结论是否仍然成立。将后剪枝能力分解为通用、常识、代码和数学,并通过Spearman相关性分析$n{=}15$个校准源的OIT信息度量与各维度保留率,我们发现一个符号相反的权衡:校准困惑度与通用保留率正相关($ ho{=}{+}0.71$),但与数学和代码保留率负相关($ ho{=}{-}0.53,\,{-}0.59$;$p{<}0.05$),因此单一源无法保留所有能力。我们以多源校准混合作为回应,并提出IGSP,一种信息引导的自校准协议,通过最小化4-gram聚合和平衡各维度困惑度,自动构建多源混合而无需能力对齐的语料库。在LLaMA-3.1-8B上使用SparseGPT 60%稀疏度时,均匀多源混合达到58.8%的总保留率,优于最佳单一源(MetaMath,50.0%)$+8.8$和C4默认(40.0%)$+18.8$;IGSP比Self-Cal提高$+2.4$,比SGS提高$+4.8$。

英文摘要

Post-training pruning compresses large language models to high sparsity using a small unlabelled calibration set, and recent work has concluded that the choice of calibration source has only modest impact on averaged post-pruning accuracy. We ask whether this conclusion survives once calibration impact is evaluated separately across distinct capability dimensions rather than aggregated. Decomposing post-pruning capability into General, Commonsense, Code, and Math, and analysing $n{=}15$ calibration sources via Spearman correlations between OIT information metrics and per-dimension retention, we uncover an opposite-sign trade-off: calibration perplexity correlates positively with General retention ($ρ{=}{+}0.71$) but negatively with Math and Code retention ($ρ{=}{-}0.53,\,{-}0.59$; $p{<}0.05$), so no single source can preserve all capabilities. We respond with multi-source calibration mixing, and propose IGSP, an information-guided self-calibration protocol that automates multi-source construction without capability-aligned corpora by minimising 4-gram aggregation and balancing perplexity across dimensions. On LLaMA-3.1-8B at SparseGPT 60% sparsity, a uniform multi-source mix reaches 58.8% total retention, outperforming the best single source (MetaMath, 50.0%) by $+8.8$ and the C4 default (40.0%) by $+18.8$; IGSP improves over Self-Cal by $+2.4$ and SGS by $+4.8$.

2606.03092 2026-06-09 cs.AI 版本更新

The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

推理的影子价格:LLM最优预算分配的经济学视角

Xu Wan, Speed Zhu, Jianwei Cai, Guang Chen, XiMing Huang, Wiggin Zhou, Mingyang Sun

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文从经济学视角将推理预算分配建模为全局约束优化问题,提出基于影子价格的CLEAR方法,通过理性放弃和资源再分配,在资源稀缺下显著提升总token成本与平均准确率的帕累托前沿。

详情
AI中文摘要

推理时扩展已成为提升大型语言模型性能的关键途径,但实际部署受严格计算预算限制。本文将推理预算分配建模为受经济学原理支配的全局约束优化问题。通过使用移位激增函数对每查询推理效用建模,我们推导出基于全局影子价格的最优分配策略,该价格在资源稀缺下均衡边际效用。基于此理论,我们提出约束潜在效用均衡分配推理(CLEAR)。它执行理性放弃,并将资源从无力偿付的查询重新分配到接近其涌现阈值的可解查询。在不同流量流的多个推理任务上的大量实验表明,CLEAR显著改善了总token成本与平均准确率的帕累托前沿。在资源稀缺模式下,与均匀分配相比,CLEAR的全局准确率提升高达3倍。

英文摘要

Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet real-world deployment is constrained by strict computational budgets. In this work, we formulate inference budget allocation as a global constrained optimization problem governed by economic principles. By modeling per-query reasoning utility with a shifted-surge function, we derive an optimal allocation policy based on a global shadow price that equilibrates marginal utility under resource scarcity. Based on this theory, we propose Constrained Latent-utility Equilibrium Allocation for Reasoning (CLEAR). It performs rational abandonment and reallocates resources from insolvent queries to solvable queries near their emergence thresholds. Extensive experiments on several reasoning tasks with different traffic streams demonstrate that CLEAR significantly improves the Pareto frontier of total token cost versus mean accuracy. In resource-scarce regimes, CLEAR achieves up to a 3x improvement in global accuracy compared to uniform allocation.

2606.02802 2026-06-09 cs.AI 版本更新

ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

ChatHealthAI: 将电子健康记录表示与大语言模型对齐以实现基于临床的推理

Bo-Hong Wang, Baicheng Peng, Ruilin Wang, Jun Bai, Ziyang Song, Yue Li

发表机构 * School of Computer Science, McGill University(麦吉尔大学计算机科学系) Mila - Quebec AI Institute(魁北克人工智能研究所)

AI总结 提出ChatHealthAI框架,通过任务感知重采样器将预训练的EHR基础模型的结构化表示与冻结的大语言模型语义空间对齐,实现可解释的临床推理并保持预测性能。

详情
Comments
Main paper with appendix, 13 pages
AI中文摘要

大语言模型在临床决策支持中展现出强大的自然语言推理能力,但难以有效建模结构化的纵向电子健康记录。相比之下,EHR基础模型可以学习预测性患者表示,但缺乏可解释的基于语言的推理。为弥合这一差距,我们提出ChatHealthAI,一个多模态推理框架,通过任务感知重采样器将预训练的EHR基础模型的结构化EHR表示与冻结的大语言模型的语义空间对齐。通过整合纵向患者表示与精细化的临床事件描述,ChatHealthAI在保持准确患者预测的同时,实现了基于临床的自然语言推理。我们在EHRSHOT基准上的三个临床预测任务上评估了ChatHealthAI。结果表明,ChatHealthAI在保持竞争性预测性能的同时,提高了推理质量和可解释性。这些发现凸显了将EHR基础模型与预训练大语言模型整合用于可解释临床预测的潜力。

英文摘要

Large language models (LLMs) exhibit strong natural-language reasoning abilities for clinical decision support, but struggle to effectively model structured longitudinal electronic health records (EHRs). In contrast, EHR foundation models can learn predictive patient representations, yet lack interpretable language-based reasoning. To bridge this gap, we propose ChatHealthAI, a multimodal reasoning framework that aligns structured EHR representations from a pretrained EHR foundation model with the semantic space of a frozen LLM through a task-aware resampler. By integrating longitudinal patient representations with refined clinical event descriptions, ChatHealthAI enables clinically grounded natural-language reasoning while maintaining accurate patient prediction. We evaluated ChatHealthAI on three clinical predictive tasks from the EHRSHOT benchmark. Results show that ChatHealthAI improves reasoning quality and interpretability while preserving competitive predictive performance. These findings highlight the potential of integrating EHR foundation models with pretrained LLMs for interpretable clinical prediction.

2606.02780 2026-06-09 cs.CL 版本更新

Do Value Vectors in Deep Layers Need Context from the Residual Stream?

深层中的值向量是否需要来自残差流的上下文?

Muyu He, Yuchen Liu, Qingya Huang, Li Zhang

发表机构 * Independent(独立) Drexel University(德雷塞尔大学)

AI总结 研究通过提出Bank of Values(BoV)方法,在深层注意力层中使用无上下文的值向量来保留原始token信息,从而提升模型性能并减少计算和内存开销。

详情
Comments
13 pages, 5 figures. Code: https://github.com/RiddleHe/nanochat
AI中文摘要

Transformer架构作为现代LLM骨干的成功在很大程度上归功于其使用注意力层。注意力层遵循标准神经网络范式:以残差流为输入,从而产生上下文相关的查询、键和值向量。然而,我们发现当深层学习仅保留原始token信息的无上下文值向量,而不利用残差流中的任何上下文时,模型性能有显著提升。当模型可以访问这种无上下文的值向量时,添加回上下文相关的组件对整体基准性能几乎没有额外益处。这种无上下文的值向量可以作为稀疏模型参数存储,无需重新计算或持久缓存这些值。通过对这种无上下文值向量的关键设计选择进行系统消融,我们提出了Bank of Values(BoV),这是一种通过为最后三分之一的每一层学习一个token特定值向量的查找表来计算注意力中值向量的新方法。在135M和780M模型上,BoV相比标准注意力提升了验证损失,并且在780M模型上,在21个基准测试的平均得分上匹配了之前最佳方法(该方法以更少的计算和内存向值向量添加token信息)。

英文摘要

The success of the transformer architecture as the backbone of modern LLMs is in large part due to its use of attention layers. An attention layer follows the standard neural network paradigm: it takes the residual stream as input and thereby produces context-dependent query, key, and value vectors. However, we find that model performance meaningfully improves when deeper layers learn only a context-free value vector to preserve the original token information, without drawing on any context from the residual stream. When the model has access to this context-free value vector, adding back the context-dependent component provides little additional benefit for aggregate benchmark performance. Such context-free value vectors can be stored as sparse model parameters, eliminating the need to recompute or persistently cache these values. Through systematic ablations on the key design choices for such context-free value vectors, we propose Bank of Values (BoV), a new way of computing value vectors in attention by learning a lookup table of token-specific value vectors for each of the last third of layers. Across 135M and 780M models, BoV improves validation loss over standard attention and, at 780M, the average score across 21 benchmarks, matching the previous best method that adds token information to the value vector with less compute and memory.

2606.02735 2026-06-09 cs.RO cs.AI cs.LG 版本更新

See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

看得更少,指定更多:面向可泛化视觉-语言-动作模型的视觉证据预算

Yueh-Hua Wu, Tatsuya Matsushima, Kei Ota

发表机构 * Airoa

AI总结 提出S2框架,通过显式视觉证据预算和细化轨迹语言,改善VLA模型在干扰、外观变化和语义相似任务下的泛化能力。

详情
Comments
Project page: https://s2.airoa.io
AI中文摘要

泛化仍然是视觉-语言-动作(VLA)模型的核心瓶颈:在干扰物、外观变化和语义相似任务下,策略通常需要从粗略指令中推断局部执行细节,同时决定图像的哪些部分对控制重要。我们提出S2(看得更少,指定更多),一个通过更干净的接口训练执行器来提升VLA泛化的框架。“指定更多”保留原始指令作为稳定的高层目标,同时将每条轨迹重新标注为细化的轨迹级和子任务级语言,以消除当前执行模式的歧义。与原生注意力不同,“看得更少”施加显式的视觉证据预算,训练执行器从任务充分的证据中行动,而非不受约束的视觉上下文,无需任何区域或掩码标注。该接口让执行器能够遵循详细指导,而不依赖干扰性的视觉补丁或自行解决可避免的歧义,并且通过上下文学习与现成的VLM规划器兼容。在我们的主要评估设置中,S2通过改变执行器的学习问题提升了整体泛化指标:粗略指令导致可避免的监督混叠,目标保持的局部指导在我们的主要消融中优于指令替换,显式证据预算减少了对广泛视觉上下文的依赖,超越了效率考虑。在TX-G2(一个AgiBot G2兼容变体)和HSR上的八个真实机器人任务中,S2将平均子任务成功率从pi0.5的54.2%提升到79.0%。这些结果共同表明,当执行器被训练从信息丰富的局部指导和任务充分的视觉证据中行动,而非从弱监督中同时恢复两者时,VLA泛化得到改善。

英文摘要

Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.

2606.02519 2026-06-09 cs.RO 版本更新

IMAC-AgriVLN: Can Agricultural Vision-and-Language Navigation Agents be Aware of Instruction Mistakes?

IMAC-AgriVLN:农业视觉与语言导航智能体能否意识到指令错误?

Xiaobei Zhao, Xingqi Lyu, Xin Chen, Xiang Li

发表机构 * China Agricultural University(中国农业大学) China Agricultural University-Sichuan Advanced Agricultural & Industrial Institute(中国农业大学-四川先进农业与工业研究院)

AI总结 针对农业VLN中指令可能错误的问题,提出A2A-MI基准和IMAC模块,通过分析指令与前方图像判断并纠正错误,显著提升导航性能。

详情
AI中文摘要

农业机器人在广泛的农业任务中充当着强大的助手,然而,其移动仍然严重依赖手动操作或轨道系统。AgriVLN方法和A2A基准开创性地将视觉与语言导航(VLN)扩展到农业领域,使机器人能够按照自然语言指令导航到目标位置。然而,几乎所有先前的方法都采用了一个理想假设,即给定的指令本身是正确的,这与现实场景不符,因为任何人都可能说出带有错误的指令。为弥补这一差距,我们提出了A2A-MI基准,其中构建了一个半自动数据标注器,以更多样化和高效的方式将三种错误分类插入到每个原始指令中。我们在该基准上测试了几种最先进的农业VLN智能体,观察到SR下降57%、NE下降9%的显著下降,由此我们认为农业VLN智能体倾向于假设给定指令是正确的,因此当它看到的场景与接收到的指令不一致时,没有怀疑的意识。为了建立对指令错误的意识,我们提出了IMAC模块,该模块分析指令和当前前方图像,判断指令是否有错误,并在需要时尝试纠正。我们将IMAC集成到基线模型中,观察到显著的改进,充分缩小了与无错误指令性能的差距。项目:https://github.com/AlexTraveling/IMAC-AgriVLN。

英文摘要

Agricultural robots are serving as powerful assistants across a wide range of agricultural tasks, nevertheless, still heavily relying on manual operations or railway systems for movement. The AgriVLN method and the A2A benchmark pioneeringly extended Vision-and-Language Navigation (VLN) to the agricultural domain, enabling a robot to navigate to a target position following a natural language instruction. However, almost all the prior methods adopt an ideal assumption that the given instructions themselves are correct, which does not align with the realistic scenarios, because anybody may say an instruction with mistakes. To bridge this gap, we propose the A2A-MI benchmark, in which we build a semi-automatic data annotator to insert three mistake classifications into each original instruction in a more diversified and efficient way. We test several state-of-the-art agricultural VLN agents on it and observe a sufficient drop with -57% on SR and -9% on NE, from which we suggest that an agricultural VLN agent tends to assume that the given instruction is correct, so does not have the awareness to doubt it when the scenes it sees do not align with the instruction it receives. To build the awareness on instruction mistake, we propose the IMAC module analyzing the instruction and the current front-facing image, to judge whether the instruction has mistakes and attempt to correct it when needed. We integrate IMAC into the baseline model, and observe a noteworthy improvement, sufficiently narrowing the gap to the performance on instructions without mistakes. Project: https://github.com/AlexTraveling/IMAC-AgriVLN.

2606.02351 2026-06-09 cs.LG stat.ML 版本更新

Local Preferential Bayesian Optimization

局部偏好贝叶斯优化

Johanna Menn, Miriam Kober, Paul Brunzema, David Stenger, Sebastian Trimpe

发表机构 * Institute for Data Science in Mechanical Engineering, RWTH Aachen University(机械工程数据科学研究所,亚琛工业大学) Department of Clinical Research, University of Bern(伯尔尼大学临床研究系) Center for Reproducible Science and Research Synthesis, University of Zurich(苏黎世大学可重复科学与研究综合中心) aiXopt GmbH(aiXopt公司)

AI总结 针对偏好贝叶斯优化在高维问题中效率低的问题,提出利用信任域和导数信息的局部偏好贝叶斯优化方法,显著降低累积遗憾。

详情
AI中文摘要

贝叶斯优化(BO)是一种流行且有效的调优昂贵、有噪声实验的方法,但需要制定明确的目标函数。偏好贝叶斯优化(PBO)通过从成对的人类反馈中学习来消除这一要求,然而现有方法由于其全局搜索策略,难以有效优化中低维以外的问题。我们通过开发一系列局部PBO方法来解决这一限制,这些方法将高维BO的关键思想迁移到偏好设置中。具体而言,我们引入了局部PBO方法,将信任域和导数信息局部搜索适应于成对偏好反馈,其中后者利用了拉普拉斯近似高斯过程后验的一阶和二阶导数。我们在GP样本路径、标准优化基准函数和策略搜索任务上的基准测试表明,局部PBO方法在具有陡峭最优值的高维和复杂景观中特别有效。与基于全局偏好的基线相比,它们可以显著减少累积遗憾,使其对于现实世界中基于偏好的优化任务(如策略搜索)特别有用。

英文摘要

Bayesian optimization (BO) is a popular and effective approach for tuning expensive, noisy experiments, but requires the formulation of an explicit objective function. Preferential BO (PBO) removes this requirement by learning from pairwise human feedback, yet existing methods struggle to efficiently optimize beyond low- and medium-dimensional problems due to their global search approaches. We address this limitation by developing a family of local PBO methods that transfer key ideas from high-dimensional BO to the preferential setting. In particular, we introduce local PBO methods which adapt trust-region and derivative-informed local search to pairwise preference feedback, where the latter exploits first- and second-order derivatives of the Laplace-approximated GP posterior. Our benchmark on GP sample paths, standard optimization benchmark functions, and policy-search tasks shows that local PBO methods are especially effective in high-dimensional and complex landscapes with steep optima. Compared with global preference-based baselines, they can substantially reduce cumulative regret, making them particularly useful for real-world preference-based optimization tasks such as policy search.

2606.02341 2026-06-09 cs.SD cs.LG 版本更新

Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification

参数高效的双编码器架构与可微Choquet积分融合用于水下声学分类

Amirmohammad Mohammadi, Joshua Peeples, Alexandra Van Dine

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出一种双编码器神经网络架构,同时处理波形和频谱图,利用预训练骨干和参数高效微调模块,并通过基于Choquet积分的可微模糊聚合机制融合时域和频域表示,提高分类准确性和可解释性。

详情
Comments
9 pages, 7 figures
AI中文摘要

水下声学分类具有广泛的海事应用,但由于日益复杂的声学环境而面临挑战。波形和频谱图表示已被主要用作该领域分类任务的声学数据特征。频谱图建模谐波依赖性,但这些降维表示可能过滤掉与判别相关的声学特征。虽然波形的相位信息允许对信号进行完整表征,但原始波形可能嘈杂且复杂,使得模型难以直接处理该表示。本文提出一种双编码器神经网络架构,同时处理声学波形和频谱图,利用预训练骨干和参数高效微调模块,实现领域自适应。为了结合这些自适应分支,引入了一种基于Choquet积分的可微模糊聚合机制,以平衡时域和频谱表示。这种融合策略不仅提高了分类准确性,还提供了可解释性。具体来说,通过分析学习到的模糊测度,揭示了网络表示依赖性的类别特定变化。通过动态将注意力转移到受潜在非对称信道失真影响最小的表示上,所提出的门控机制缓解了水下环境的非平稳挑战。在DeepShip和ShipsEar数据集上的评估表明,所提出的架构相对于独立的单编码器基线实现了分类改进,同时限制了可训练参数空间。这减轻了在有限声学数据集上过拟合的风险,同时降低了与完全微调基础模型相关的计算成本。

英文摘要

Underwater acoustic classification has a wide array of oceanic applications, but faces challenges due to an increasingly complex acoustic environment. Waveform and spectrogram representations have been primarily used as acoustic data features for classification tasks in this domain. Spectrograms model harmonic dependencies, but these reduced representations can filter out acoustic features relevant for discrimination. While phase information from the waveform allows full characterization of the signal, the original waveform can be noisy and complex, rendering this representation difficult for models to process directly. This paper proposes a dual-encoder neural architecture to simultaneously process acoustic waveforms and spectrograms, leveraging pre-trained backbones and parameter-efficient fine-tuning modules, enabling a domain adaptation. To combine these adapted branches, a novel differentiable fuzzy aggregation mechanism based on the Choquet integral is introduced to balance the temporal and spectral representations. This fusion strategy not only yields higher classification accuracy but also provides interpretability. Specifically, by analyzing the learned fuzzy measures, insights are revealed about class-specific shifts in the network's representation reliance. By dynamically shifting attention to the representation least corrupted by potential asymmetric channel distortions, the proposed gating mechanism mitigates the non-stationary challenges of the underwater environment. Evaluations on the DeepShip and ShipsEar datasets demonstrate that the proposed architecture achieves classification improvements over independent single-encoder baselines, while simultaneously restricting the trainable parameter space. This mitigates the risk of overfitting on limited acoustic datasets while alleviating the computational costs associated with fully fine-tuning foundation models.

2606.02274 2026-06-09 cs.RO 版本更新

Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

Dexterity-BEV: 对齐3D世界与动作以实现通用机器人策略学习

Huayi Zhou, Wei Gao, Dekun Lu, Ruiji Liu, Zhanqi Zhang, Ziyang Zhang, Jian Chen, Wenlve Zhou, Sheng Xu, Shumin Li, Kangyi Guo, Shichen Xu, Zixin Huang, Yongyi Su, Kui Jia

发表机构 * DexForce Technology(德克斯技术公司) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出Dexterity-BEV框架,通过对齐顶点图和顶点谱的3D表示以及鸟瞰图对齐,解决2D基础模型在3D操作中的局限性,提升机器人策略的泛化能力。

详情
Comments
under review
AI中文摘要

端到端操作策略结合大规模预训练的视觉-语言模型(VLM)展示了通用且灵巧的机器人操作的潜力。然而,它们继承了2D基础模型的两个关键局限性:1)依赖忽略操作内在3D性质的2D RGB输入;2)输入输出空间以及不同机器人形态、相机设置和轨迹数据集之间缺乏空间3D对齐。在本文中,我们提出了一系列贡献来解决这些问题。首先,我们引入对齐顶点图和顶点谱——一种逐像素的3D表示,利用相机标定和可选的深度将2D视觉输入提升到3D。这种新颖的输入表示将3D感知与2D大型VLM的泛化能力相结合。然后,我们提出通过将每个相机视图的逐像素3D信息和机器人动作表达到一个共享坐标系来对齐操作策略的输入和输出。基于此,我们指定一个规范的鸟瞰图(BEV)对齐框架,并创新性地提出构建BEV图像,产生对相机姿态变化鲁棒的视角不变表示。为了实现大规模训练和评估,我们开发了一个全面的数据处理流程来执行此类对齐;我们还引入了一种新颖的时间对齐方案,用于跨不同机器人、人类操作员和数据集的轨迹。这些贡献共同缓解了输入输出的时空错位,提高了真实世界操作的一致性和泛化能力。预训练检查点、源代码和数据处理流程可在 https://hnuzhy.github.io/projects/Dex-BEV 获取。

英文摘要

End-to-end manipulation policies, combined with web-scale pretrained Vision-Language Models (VLMs), show the promise for generalizable and dexterous robotic manipulation. However, they inherit two key limitations from 2D foundation models: 1) the reliance on 2D RGB inputs that ignores the intrinsically 3D nature of manipulation; and 2) the lack of spatial 3D alignment between input-output spaces as well as across diverse robot embodiments, camera setups, and trajectory datasets. In this paper, we present a series of contributions to address these issues. First, we introduce aligned vertex map and vertex spectrum -- a pixel-wise 3D representation that elevates 2D visual inputs to 3D, using camera calibration and optional depth. This novel input representation marries 3D awareness with the generalization of 2D large VLMs. Then, we propose to align the inputs and outputs of manipulation policies by expressing per-pixel 3D information of each camera view and robot actions to a shared coordinate. Based on this, we designate a canonical Bird's-Eye-View (BEV) alignment frame and innovatively propose to construct BEV images, producing a view-invariant representation robust to camera pose variations. To enable training and evaluation at scale, we develop a comprehensive data processing pipeline to perform such alignments; we also introduce a novel temporal alignment scheme for trajectories across diverse robots, human operators, and datasets. These contributions collectively mitigate input and output spatial-temporal misalignments, improving the consistency and generalization for real-world manipulation. Pretrained checkpoint, source code and data processing pipeline are available in https://hnuzhy.github.io/projects/Dex-BEV.

2606.01869 2026-06-09 cs.AI 版本更新

WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis

WorldCoder-Bench:物理接地3D世界合成基准

Shuo Lu, Yinuo Xu, Kecheng Yu, Siru Jiang, Yongcan Yu, Yubin Wang, Haitao Yang, Yuxiang Zhang, Bin Wang, Ran He, Jian Liang

发表机构 * NLPR & MAIS, CASIA(中国科学院自动化研究所与模式识别国家重点实验室) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 提出WorldCoder-Bench基准,通过StateProbe协议评估LLM生成Three.js 3D世界的物理正确性和交互可靠性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被要求不仅编写静态界面,还要从自然语言构建可执行的交互式世界。浏览器原生3D(通常使用Three.js构建)是下一个自然前沿:生成的程序必须集成资源、遵守空间和物理约束,并保持面向用户的控件与隐藏的运行时状态同步。然而,现有的网络生成基准和评估器主要只观察像素或DOM节点,而Three.js世界的机制在不透明的<canvas>内部展开。我们引入了WorldCoder-Bench,一个用于自主、物理接地3D世界合成的基准。WorldCoder-Bench包含2026个专家策划的任务,涵盖模拟、渲染和应用场景,带有可选的.glb资源和隐藏的行为契约。我们进一步提出了StateProbe,一种基于执行的协议,在沙盒浏览器中探测生成的程序,并验证运行时状态和转换上的隐藏、变异硬化契约。除了验证覆盖率,我们报告了自动化回报和时间效率乘数,以衡量正确性调整的成本和时间节省。在九个前沿模型中,最佳系统在WorldCoder-Core上仅达到27.8%的验证覆盖率,在WorldCoder-Robust上达到19.9%,失败主要由状态模式漂移和交互链断裂主导,而非缺失场景元素。效用指标进一步表明,廉价或快速的模型在较简单的领域仍能提供显著价值。WorldCoder-Bench可在https://anonymous.4open.science/r/WorldCoder-Bench/获取。

英文摘要

Large language models (LLMs) are increasingly asked not only to write static interfaces, but to construct executable interactive worlds from natural language. Browser-native 3D, commonly built with Three.js, is a natural next frontier: generated programs must integrate assets, obey spatial and physical constraints, and keep user-facing controls synchronized with hidden runtime state. Existing web-generation benchmarks and evaluators, however, largely observe only pixels or DOM nodes, while the mechanics of a Three.js world unfold inside an opaque <canvas>. We introduce WorldCoder-Bench, a benchmark for autonomous, physically grounded 3D world synthesis. WorldCoder-Bench contains 2,026 expert-curated tasks across Simulation, Rendering, and Application scenarios, with optional .glb assets and hidden behavioral contracts. We further propose StateProbe, an execution-based protocol that probes generated programs in a sandboxed browser and verifies hidden, mutation-hardened contracts over runtime states and transitions. Beyond verification coverage, we report Return on Automation and Time Efficiency Multiplier to measure correctness-adjusted cost and time savings. Across nine frontier models, the best system reaches only 27.8% verification coverage on WorldCoder-Core and 19.9% on WorldCoder-Robust, with failures dominated by state-schema drift and broken interaction chains rather than missing scene elements. Utility metrics further show that cheap or fast models can still provide substantial value on easier domains. WorldCoder-Bench is available at https://anonymous.4open.science/r/WorldCoder-Bench/.

2606.01736 2026-06-09 cs.CL cs.AI 版本更新

Argument Collapse: LLMs Flatten Long-Form Public Debate

论点坍缩:LLMs 扁平化长篇公共辩论

Yekyung Kim, Yapei Chang, Chau Minh Pham, Mohit Iyyer

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 研究大型语言模型在生成公共辩论文本时导致论点坍缩的现象,即不同模型生成的论文在主要论点、子论点和段落结构上趋于收敛,通过对比人类与LLM生成文本发现LLM的论点多样性显著降低。

详情
AI中文摘要

随着LLMs越来越多地被用于起草面向公众的论点,它们可能通过反复引入相同的、经过修饰的、看似合理的论点来扁平化公共辩论。我们研究了论点坍缩,即不同LLMs生成的论文倾向于收敛到更小的主要论点、子论点和段落级结构集合。我们比较了来自195场《纽约时报》辩论的1,039个人类回复、来自61场更长形式的《波士顿评论》论坛的448个人类回复以及23,384篇LLM生成的论文。在《纽约时报》语料库中,65.3%的人类主要论点在辩论中是唯一的,而LLM主要论点中这一比例为3.4%。要求LLMs生成多样化的答案会增加变异性,但一个典型模型只能恢复大约一半的不同人类主要论点,且增加的变异性大多落在观察到的人类论点空间之外。坍缩也出现在子论点中,在具有相同主要论点的论文中,41.0%的人类子论点是唯一的,而LLM回复中这一比例为9.1%。定性上,LLMs经常重复使用泛化和模糊的子论点,而人类更喜欢更具体和针对主题的子论点。在结构上,LLM生成的论文倾向于遵循更固定的弧线,通常以直接主张开头并迅速转向提议。同样的模式在更长的《波士顿评论》论文中也成立,表明论点坍缩不仅限于短篇回复。

英文摘要

As LLMs are increasingly used to draft public-facing arguments, they may flatten public debate by repeatedly introducing the same polished, plausible arguments. We study argument collapse, the tendency of essays generated by different LLMs to converge to a smaller set of main arguments, sub-arguments, and paragraph-level structures. We compare 1,039 human responses from 195 New York Times (NYT) debates, 448 human responses from 61 longer-form Boston Review (BR) forums, and 23,384 LLM-generated essays. In the NYT corpus, 65.3% of human main arguments are unique within a debate, compared to 3.4% of LLM main arguments. Asking LLMs to generate diverse answers adds variation, but a typical model recovers only about half of the distinct human main arguments, with much of the added variation falling outside the observed human argument space. Collapse also appears in sub-arguments, where among essays with the same main argument, 41.0% of human sub-arguments are unique versus 9.1% from LLM responses. Qualitatively, LLMs often reuse generalized and hedged sub-arguments, while humans prefer more concrete and topic-specific ones. Structure-wise, LLM-generated essays tend to follow a more fixed arc, often opening with a direct claim and moving quickly toward proposals. The same patterns hold in longer BR essays, suggesting that argument collapse extends beyond short-form responses.

2606.01637 2026-06-09 cs.CL cs.AI 版本更新

Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

误导比纠正更容易:LLM 从众中的有害与有益修正

Jiaming Qu, Lucheng Fu, Yibo Hu

发表机构 * Amazon(亚马逊) Georgia Institute of Technology(佐治亚理工学院) Illinois Institute of Technology(伊利诺伊理工学院)

AI总结 通过控制实验,研究大语言模型在多智能体系统中面对同伴答案时的从众行为,发现同伴一致意见更容易误导原本正确的模型,而权威标签使模型更倾向于选择被认可的答案,且通用推理干预无法可靠地减少有害修正。

详情
AI中文摘要

大语言模型越来越多地用于多智能体系统,在这些系统中,它们会看到并回应其他智能体的答案。一个关键风险是从众:模型可能仅仅因为其他人同意不同的答案而放弃自己的答案。先前的研究表明,LLM 经常向多数答案修正,但仍不清楚这些修正是像引入新错误一样频繁地帮助纠正错误。在本文中,我们进行了一项受控研究,其中 LLM 首先回答一个问题,然后在做出最终决定之前看到模拟的同伴回应。我们操纵两个社会线索:共识结构和分配给同伴的权威标签,并测量它们如何影响有益和有害的修正。在四个开放权重的 LLM 和七个问答数据集上,我们发现同伴一致意见使得误导原本正确的模型比纠正原本错误的模型容易得多。权威标签使模型更可能选择被认可的答案,无论其是否正确。更令人担忧的是,通用的推理干预(如思维链和反思)并不能可靠地减少有害修正同时保留有益修正。这些发现表明,多智能体 LLM 系统应该验证同伴答案,而不是简单地聚合它们。

英文摘要

Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. Prior studies show that LLMs often revise toward a majority answer, but it remains unclear whether these revisions help correct mistakes as often as they introduce new errors. In this paper, we conduct a controlled study in which an LLM first answers a question, then sees simulated peer responses before making a final decision. We manipulate two social cues: consensus structure and authority labels assigned to peers, and measure how they influence beneficial and harmful revisions. Across four open-weight LLMs and seven QA datasets, we find that peer agreement makes it much easier to mislead initially correct models than to correct initially wrong ones. Authority labels make models more likely to choose the endorsed answer, regardless of whether it is correct. More concerningly, generic reasoning interventions such as chain-of-thought and reflection do not reliably reduce harmful revision while preserving beneficial revision. These findings suggest that multi-agent LLM systems should verify peer answers rather than simply aggregate them.

2606.01619 2026-06-09 cs.AI cs.LG stat.ML 版本更新

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

ReSkill:在智能体强化学习中协调技能创建与策略优化

Zelin He, Haotian Lin, Boran Han, Wei Zhu, Haoyang Fang, Bernie Wang, Xuan Zhu, Runze Li, Matthew Reimherr

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出ReSkill框架,通过GRPO的组结构嵌入断言驱动技能创建、组内轨迹采样和自适应汤普森采样,实现技能与策略的协同进化,在多个领域超越现有方法。

详情
AI中文摘要

智能体强化学习使LLM智能体能够从环境奖励中持续改进,但由此产生的策略并未系统地积累可跨任务泛化的可重用策略。模块化技能可以提供此类可重用策略,然而现有的技能增强强化学习方法将技能创建与策略优化分离,存在采用与进化策略冲突的技能的风险。受Anthropic的Skill Creator启发,我们引入ReSkill,一种强化学习在环的技能创建框架,协调技能进化与策略学习。ReSkill利用GRPO的组结构自然嵌入三种机制,仅需少量额外开销:(1)断言驱动的技能创建器,从过去经验中诊断失败并提出基于条件的触发式技能修订;(2)组内轨迹采样,实现技能版本的可控比较,捕获哪个版本最能支持策略的持续学习;(3)自适应折扣的汤普森采样,在策略进化过程中平衡技能版本选择的探索与利用。在多个领域,ReSkill始终优于现有的基于记忆和技能的强化学习方法,在未见任务上提升最大。对技能生命周期的分析显示,随着策略改进,技能被自动创建、测试、精炼和修剪,展示了协调的技能-策略协同进化。

英文摘要

Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not systematically accumulate reusable strategies that generalize across tasks. Modular skills can provide such reusable strategies, yet existing skill-augmented RL methods decouple skill creation from policy optimization, risking adopting skills that conflict with the evolving policy. Inspired by Anthropic's Skill Creator, we introduce ReSkill, an RL-in-the-loop skill creation framework that reconciles skill evolution with policy learning. ReSkill exploits the group-wise structure of GRPO to naturally embed three mechanisms with only marginal additional overhead: (1) an assertion-driven skill creator that diagnoses failures from past experience and proposes conditional, trigger-based skill revisions; (2) within-group rollout sampling that enables controlled comparison of skill versions, capturing which version best supports the policy's ongoing learning; and (3) Thompson Sampling with adaptive discounting to balance exploration and exploitation in skill version selection as the policy evolves. Across several domains, ReSkill consistently outperforms existing memory and skill-based RL methods, with the largest gains on unseen tasks. Analysis of the skill lifecycle shows skills being automatically created, tested, refined, and pruned as the policy improves, demonstrating reconciled skill-policy co-evolution.

2606.01546 2026-06-09 cs.LG 版本更新

Flexible Online Representation Learning Based on Similarity Matching

基于相似性匹配的灵活在线表示学习

Shagesh Sridharan, Yanis Bahroun, Anirvan M. Sengupta

发表机构 * arXiv.org University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种基于相似性匹配的在线生物合理学习算法,能够学习稀疏移位不变表示,适用于聚类、流形平铺或稀疏编码。

详情
Comments
6 pages, 3 figures. Originally accepted to IJCNN 2023 but not presented owing to visa issues
AI中文摘要

稀疏高维表示有助于在无监督数据探索中发现非平凡结构。这种表示可以处理与社区检测问题相关的图中的密集连接。然而,稀疏高维表示还能做更多事情,包括流形平铺和特征学习。传统算法在计算上难以处理的完全正定矩阵空间中进行优化,或者将问题松弛到双非负矩阵空间,这些矩阵的规模随样本大小增长,使得它们对大数据集不实用。其中一些方法还施加了行和约束,例如双随机性。在流形平铺的背景下,行和约束具有平移不变性的额外优势。对输出相似性矩阵的行和约束需要非平凡的在线学习规则。针对这些需求,我们提出了一种通用的在线生物合理学习算法,能够学习稀疏移位不变表示,根据数据结构,可用于聚类、流形平铺或稀疏编码。

英文摘要

Sparse high-dimensional representations are conducive to uncovering nontrivial structures in unsupervised exploration of data. Such a representation can deal with the dense connectivity in graphs relevant to community detection problems. However, sparse high-dimensional representations are capable of doing more, including manifold tiling and feature learning. Conventional algorithms optimize in the space of computationally intractable completely positive matrices or relax the problem to the space of doubly nonnegative matrices that scale with sample size in a way rendering them impractical for large data sets. Some of these methods also impose a row sum constraint, such as double stochasticity. Row sum constraints have the added advantage of being shift-invariant, in the context of manifold tiling. Constraints on the row sum of output similarity matrices require nontrivial online learning rules. Addressing these needs, we propose a versatile online biologically plausible learning algorithm capable of learning sparse shift-invariant representations, useful for clustering, manifold tiling, or sparse coding, depending on the data structure.

2606.01478 2026-06-09 cs.RO cs.AI cs.MA cs.SY eess.SY 版本更新

Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX

Crazyflow: 基于JAX的精确、GPU加速、可微分的无人机模拟器

Martin Schuck, Marcel P. Rath, Yufei Hua, Abhishek Goudar, SiQi Zhou, Angela P. Schoellig

发表机构 * Technical University of Munich(慕尼黑技术大学) University of Toronto(多伦多大学) Simon Fraser University(西蒙弗雷泽大学)

AI总结 提出Crazyflow模拟器,通过GPU加速和可微分设计,实现单机超高速仿真、数千架无人机集群模拟,并支持基于解析梯度的策略学习与采样避障,甚至能在0.38秒内从零训练飞行恢复策略。

详情
Comments
Fix minor metadata mistakes
AI中文摘要

来自仿真的高质量、大规模合成数据正成为推动机器人算法能力提升的基石。虽然空中机器人模拟器已独立发展出支持保真度、可微分性和集群等专门需求,但缺少一个能够跨所有领域合成数据的统一平台。在这项工作中,我们提出了Crazyflow,一个旨在突破空中机器人算法开发极限的模拟器,涵盖从基于模型到数据驱动的方法、从基于梯度到基于采样的方法、以及从单智能体到多智能体系统。与现有最先进的无人机模拟器相比,它实现了单个无人机超过一个数量级的速度提升,并能模拟数千个包含4000架无人机的集群。真实世界实验表明,Crazyflow既支持基于解析梯度的策略学习(无需域随机化即可实现亚厘米级轨迹跟踪精度),也支持每秒超过5亿步的采样避障。打破传统的先训练后部署范式,我们展示了其前所未有的速度甚至能够实现飞行中的强化学习:通过将物理无人机抛向空中,在0.38秒内从零开始训练恢复策略,成功稳定了无人机。Crazyflow支持多级仿真抽象,直接兼容所有开源Crazyflie模型,并通过提供轻量级系统辨识流程,支持跨自定义无人机平台和应用的快速重新配置。通过同时推动精度、速度和可微分性,Crazyflow作为合成数据生成的开源资源,具备在线执行学习和优化的大规模并行化新兴能力,为新型算法开发打开了大门。

英文摘要

High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial-robotics algorithm development, from model-based to data-driven methods, gradient-based to sampling-based approaches, and single-agent to multi-agent systems. Compared to existing state-of-the-art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real-world experiments show Crazyflow supports both analytical-gradient-based policy learning, achieving sub-centimeter trajectory tracking accuracy without domain randomization, and sampling-based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train-then-deploy paradigm, we show that its unprecedented speed even enables in-flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open-source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light-weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open-source resource for synthetic data generation, with emerging capabilities for large-scale parallelization for online, in-execution learning and optimization, opening the door to novel algorithm development.

2606.01379 2026-06-09 cs.LG 版本更新

Turning Back Without Forgetting: Selective Backward Refinement for Parameter-Efficient Continual Learning

在不遗忘的情况下回溯:面向参数高效持续学习的选择性反向精炼

Anushka Tiwari, Kaiyi Ji

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出SABER框架,通过基于提示梯度几何和损失分布相似性的任务相关性准则,在提示型参数高效持续学习中实现受控的正向反向知识迁移,无需重放。

详情
Comments
Accepted at ICML 2026
AI中文摘要

虽然基于提示的参数高效持续学习通过隔离任务特定提示来缓解灾难性遗忘,但这种隔离也限制了后续任务改进先前任务,导致反向知识迁移未被充分探索。我们通过提出选择性反向精炼以实现正向反向知识迁移(SABER)来解决这一限制,这是一个无需重放的框架,能够在基于提示的持续学习中实现受控的反向迁移。SABER利用基于提示梯度几何和损失分布相似性的互补任务相关性准则,判断何时进行反向精炼有益,并通过将更新限制在提示参数空间中的非干扰方向来安全执行精炼。在多个持续学习基准和不同预训练骨干网络(包括T5-Large、LLaMA和Qwen)上的大量实验表明,SABER在保持强大整体平均性能的同时,持续实现正向反向迁移。代码可在https://github.com/OptMN-Lab/SABER-ICML-2026/获取。

英文摘要

While prompt-based parameter-efficient continual learning mitigates catastrophic forgetting by isolating task-specific prompts, this isolation also limits later tasks from improving earlier ones, leaving backward knowledge transfer underexplored. We address this limitation by proposing Selective bAckward refinement for positive Backward knowledge transfER (SABER), a replay-free framework that enables controlled backward transfer in prompt-based continual learning. SABER determines when backward refinement is beneficial using complementary task-correlation criteria based on prompt-gradient geometry and loss-distribution similarity, and how to perform refinement safely by restricting updates to non-interfering directions in the prompt parameter space. Extensive experiments across multiple continual learning benchmarks and diverse pretrained backbones, including T5-Large, LLaMA, and Qwen, demonstrate that SABER consistently achieves positive backward transfer while maintaining strong overall average performance. Code is available at https://github.com/OptMN-Lab/SABER-ICML-2026/.

2606.01304 2026-06-09 cs.LG 版本更新

When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval

当硬负例有害时:弥合检索中硬负例生成的生成-判别鸿沟

Zhicheng Zhang, Jiwei Tang, Kuicai Dong, Xiaopeng Li, Jieming Zhu, Jingyu Li, Qianhui Zhu, Fengyuan Lu, Wang Jiaheng, Gang Wang, Hai-Tao Zheng, Zhaocheng Du

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Huawei Technologies Co., Ltd.(华为技术有限公司) City University of Hong Kong(香港城市大学) School of Cyber Science and Technology, Sun Yat-sen University(中山大学信息科学与技术学院) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院) The Hong Kong University of Science and Technology(香港科学与技术大学) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 针对检索中硬负例生成存在的生成-判别鸿沟问题,提出CausalNeg方法,通过CoT引导的反事实扰动和查询视角熵最大化来提升检索性能。

详情
Comments
Accepted at KDD 2026
AI中文摘要

硬负例挖掘已成为训练检索器的主流策略,但它面临内在局限性:负例受限于语料库可用性,由检索器分数而非诊断价值选择,并且随着检索器改进,假阳性污染日益严重。基于LLM的合成提供了一种原则性替代方案,其中负例不受约束、具有针对性且无假阳性风险。但我们表明,将生成的负例天真地融入对比学习通常会降低检索性能。我们识别并形式化根本原因为生成-判别鸿沟:LLM生成优化流畅、合理的文本,而对比学习要求在决策边界处进行战略性的相关性违反。我们的分析揭示了两种复合失败模式:判别无关生成,即LLM缺乏对查询信息需求的显式模型,默认生成通用或主题漂移的文本,不提供对比信号;以及源依赖捷径,即分布性伪影使模型能够根据来源而非相关性区分负例,导致梯度漂移,积极破坏优化。为弥合这一鸿沟,我们提出CausalNeg,包含两个主要模块:(1) CoT引导的反事实扰动用于数据构建:将文档满足查询的原因分解为显式信息需求,然后精确违反个别需求以构建具有可控、可解释硬度的负例。(2) 训练期间的查询视角熵最大化:将生成的负例分散到相似度谱中,最小化源身份与相似度分数之间的互信息,以抑制捷径利用。我们在https://github.com/mzhangzhicheng/CausalNeg公开代码。

英文摘要

Hard negative mining has become the dominant strategy for training retrievers, yet it faces intrinsic limitations: negatives are bounded by corpus availability, selected by retriever score rather than diagnostic value, and increasingly contaminated by false positives as the retriever improves. LLM-based synthesis offers a principled alternative, where negatives that are unconstrained, targeted, and free from false positive risk. But we show that naively incorporating generated negatives into contrastive learning often degrades retrieval performance. We identify and formalize the root cause as a generative-discriminative gap: LLM generation optimizes for fluent, plausible text, while contrastive learning demands strategic violations of relevance at the decision boundary. Our analysis reveals two compounding failure modes: discriminative-agnostic generation, where the LLM lacks an explicit model of query information needs and defaults to generic or topic-drifted text that provides no contrastive signal; and source-dependent shortcuts, where distributional artifacts enable the model to distinguish negatives by origin rather than relevance, causing gradient drift that actively corrupts optimization. To close this gap, we propose CausalNeg consisting of two main modules: (1) CoT-guided counterfactual perturbation for data construction: decomposes why a document satisfies a query into explicit information requirements, then surgically violates individual requirements to construct negatives with controlled, interpretable hardness. (2) Query-view entropy maximization during training: disperses generated negatives across the similarity spectrum, minimizing the mutual information between source identity and similarity scores to suppress shortcut exploitation. We make our code publicly available at https://github.com/mzhangzhicheng/CausalNeg.

2606.01205 2026-06-09 cs.RO 版本更新

ImagineUAV: Aerial Vision-Language Navigation via World-Action Modeling and Kinodynamic Planning

ImagineUAV:通过世界-动作建模和动力学规划实现空中视觉语言导航

Xuchen Liu, Jiawei Huang, Shihao Xia, Bingxi Liu, Jinqiang Cui, Jiankun Yang

发表机构 * Pengcheng Laboratory(鹏城实验室) School of Computer Science and Cyber Engineering(计算机科学与网络工程学院) Guangzhou University(广州大学) Southern University of Science and Technology(南方科技大学)

AI总结 针对无人机视觉语言导航中几何不一致和动力学失配问题,提出基于潜视频扩散模型的世界-动作建模框架,通过生成未来观测推断6自由度运动并规划无碰撞轨迹,以1.3B参数在基准和实际飞行中超越先前方法。

详情
Comments
Video demo: https://www.youtube.com/watch?v=Ng1alP0yhc0
AI中文摘要

无人机的视觉语言导航(VLN)要求在部分可观测条件下将自由形式的指令接地到6自由度飞行中。虽然视觉-语言-动作(VLA)模型在语义推理方面表现出色,但由于几何不一致和动力学失配,它们存在脆弱性。为了解决这个问题,我们提出了ImagineUAV,一个利用级联世界-动作建模的想象驱动框架。ImagineUAV不是直接回归,而是采用潜视频扩散模型生成指令条件下的未来观测,明确想象环境演化,然后通过动作提取器推断6自由度运动。动力学规划器将这些估计优化为无碰撞轨迹。此外,步骤蒸馏推理流水线确保实时执行。仅凭1.3B参数,ImagineUAV在基准测试和实际飞行中优于先前的VLN和VLA基线,验证了想象驱动空中导航的实用性。

英文摘要

Vision-language navigation (VLN) for UAVs demands grounding free-form instructions into 6-DoF flight under partial observability. While Vision-Language-Action (VLA) models excel at semantic reasoning, they suffer from brittleness due to geometric inconsistency and dynamics mismatch. To address this, we propose ImagineUAV, an imagination-driven framework leveraging cascaded world-action modeling. Instead of direct regression, ImagineUAV employs a latent video diffusion model to generate instruction-conditioned future observations, explicitly imagining environmental evolution, from which 6-DoF motions are inferred via an action extractor. A kinodynamic planner then refines these estimates into collision-free trajectories. Additionally, a step-distilled inference pipeline ensures real-time execution. With only 1.3B parameters, ImagineUAV outperforms prior VLN and VLA baselines on benchmarks and real-world flights, validating the practicality of imagination-driven aerial navigation.

2606.01060 2026-06-09 cs.CL cs.AI cs.LG 版本更新

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

MENTIS: 对齐改变了什么信念?语言模型中多尺度潜在扭转的测量

Partha Pratim Saha, Samarth Raina, Mayur Parvatikar, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Pragya Lab, BITS Pilani Goa, India(BITS Pilani 去掉 Goa 的机构名,因为该机构名中包含 'Goa',但根据规则,如果机构已有常见中文名,使用常见中文名。'Pragya Lab, BITS Pilani' 是 BITS Pilani 的一个实验室,因此翻译为 'BITS Pilani 实验室') IIIT Delhi, India(德里印度理工学院) Amazon, USA(美国亚马逊) Meta, USA(美国Meta) Apple, USA(美国苹果)

AI总结 提出MENTIS框架,通过层间协方差扭转范数、谱扭转诊断和能量-辐射-激活度量,测量偏好对齐在语言模型内部计算中引起的选择性、深度局部的几何结构变化。

详情
Comments
Submitted to EMNLP 2026
AI中文摘要

偏好对齐显著改善了大语言模型的可观察行为,但尚不清楚对齐在内部改变了什么。对齐系统在越狱、提示注入和检索时损坏下仍然失败,表明仅行为级评估是不完整的。后训练应在内部计算中留下可测量的痕迹。我们问:当指令微调(IT)模型变为偏好对齐(PA)模型时,哪些几何结构发生了变化,这些变化集中在何处,以及它们在不同概念、提示和模型家族中的选择性如何? 我们引入MENTIS,一个几何优先的框架,用于测量配对检查点中对齐引起的内部重组。MENTIS使用基于层间协方差的主扭转范数(T1)、辅助谱扭转诊断(T2)和用于深度定位的能量-辐射-激活度量(ERA)来比较IT和PA模型。在LITMUS上的四个7-8B模型对中,我们的研究表明对齐引起的变化是选择性的而非均匀的:规范性概念平均表现出比事实性概念更大的扭转偏移;扭转与上下文熵负相关;峰值效应定位于架构特定的中后层。相同的模式出现在词级、提示级和模型级分析中。这些结果表明偏好对齐在内部计算中留下了结构化的、深度局部的几何特征,超越了仅行为级评估所能揭示的内容。

英文摘要

Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and retrieval-time corruption, suggesting behavior-level evaluation alone is incomplete. Post-training should leave measurable traces in internal computation. We ask: when an instruction-tuned (IT) model becomes a preference-aligned (PA) model, what geometric structure changes, where do those changes concentrate, and how selectively do they vary across concepts, prompts, and model families? We introduce MENTIS, a geometry-first framework for measuring alignment-induced internal reorganization in paired checkpoints. MENTIS compares IT and PA models using a primary layerwise covariance-based torsion norm (T1), a secondary spectral torsion diagnostic (T2), and an Energy-Radiance-Activation measure (ERA) for depth localization. Across four 7-8B model pairs on LITMUS, our study reveals that alignment-induced change is selective rather than uniform: normative concepts exhibit larger torsion shifts than factual concepts on average; torsion is negatively correlated with contextual entropy; and peak effects localize to architecture-specific mid-to-late layers. The same pattern appears across word-level, prompt-level, and model-level analyses. These results suggest preference alignment leaves structured, depth-localized geometric signatures in internal computation beyond what behavior-level evaluation alone can reveal.

2606.00967 2026-06-09 cs.CV 版本更新

MedSyn2: Flexible Control of 3D CT Generation via Text and Semantically-Defined Segmentation Prompts

通过文本和语义定义的分割提示灵活控制3D CT生成

Weicheng Dai, Chenyu Wang, Binxu Li, Shantanu Ghosh, Afrooz Zandifar, Christina LeBedis, Kayhan Batmanghelich

发表机构 * Boston University School of Engineering(波士顿大学工程学院) Stanford University(斯坦福大学) University of Pittsburgh Medical Center(匹兹堡大学医学中心) Boston University School of Medicine(波士顿大学医学院)

AI总结 提出一种灵活的多模态框架,通过文本和可选分割提示控制3D CT生成,实现高分辨率、解剖一致且可控的体数据生成。

详情
AI中文摘要

体积医学图像的生成模型在医学成像中有许多应用,从数据增强到作为逆问题的先验。对于这些应用,生成具有强可控性的高分辨率3D图像至关重要,但仍极具挑战性。现有方法通常通过放射学报告作为文本提示或通过完整图像分割来控制生成。基于文本的提示虽然灵活,但对异常的位置、形状和边界的空间控制有限。相比之下,基于分割的方法接收精确的空间指导,但需要全器官标注,具有限制性。在这项工作中,我们提出了一种灵活的多模态框架,用于可控体积图像生成,支持来自放射学报告和分割提示(两者均为可选)的输入。我们的方法允许用户提供特定解剖结构或异常的分割,而无需全器官标注。分割掩膜的语义含义通过附带的文本描述指定,从而形成高度灵活且可扩展的条件机制。我们开发了一种基于改进扩散变换器的内存高效架构,该架构联合处理图像和分割标记。该模型进一步结合了门控注意力,以有效关注长放射学报告。实验表明,我们的方法实现了最先进的感知和语义分数(例如,平均FID相对改进24%),生成高分辨率解剖一致的CT体积,并在用于数据增强时提高了数据效率。放射科医生的评估进一步证实了生成图像与真实医学图像之间的强一致性。

英文摘要

Generative models for volumetric medical images have found many applications in medical imaging, ranging from data augmentation to serving as priors for inverse problems. For these applications, generating high-resolution 3D images with strong controllability is essential but remains highly challenging. Existing approaches typically control generation either through radiology reports used as text prompts or through full image segmentation. While text-based prompting is flexible, it provides limited spatial control over the location, shape, and boundary of abnormalities. In contrast, segmentation-based methods receive precise spatial guidance but are restrictive in requiring full-organ annotations. In this work, we propose a flexible multimodal framework for controllable volumetric image generation that supports input from radiology reports and segmentation prompts (both optional). Our approach allows users to provide segmentation of a specific anatomy or abnormality without requiring full-organ annotations. The semantic meaning of the segmentation mask is specified through an accompanying text description, resulting in a highly flexible and scalable conditioning mechanism. We develop a memory-efficient architecture based on a modified diffusion transformer that jointly processes image and segmentation tokens. The model further incorporates gated attention to effectively attend to long radiology reports. Experiments demonstrate that our method achieves state-of-the-art perceptual and semantic scores (e.g., 24% relative improvement in mean FID), generates high-resolution anatomically consistent CT volumes, and improves data efficiency when used for data augmentation. Radiologists' evaluation further confirms strong alignment between generated and real medical images.

2606.00827 2026-06-09 cs.LG cs.AI 版本更新

Beyond Independent Manipulation: Individual Fairness-aware Strategic Classification with Peer Imitation

超越独立操纵:具有同伴模仿的个体公平感知策略分类

Xinpeng Lv, Chunyuan Zheng, Yunxin Mao, Renzhe Xu, Jinxuan Yang, Yuanlong Chen, Wangrong Huang, Shaowu Yang, Wenjing Yang, Xinwang Liu, Peng Cui, Haotian Wang

发表机构 * College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院) School of Mathematical Sciences, Peking University(北京大学数学学院) Institute for Theoretical Computer Science, Shanghai University of Finance and Economics(上海财经大学理论计算机科学研究所) Information Technology Development, Aetos Capital Group, Sydney(悉尼Aetos资本集团信息技术部) Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出个体公平感知策略分类(IFSC)框架,通过建模基于个体公平的同伴驱动操纵(模仿邻近被接受同伴),并采用鲁棒学习过程处理同伴可观测性不确定性,以改善个体公平一致性并减轻模仿引起的扭曲。

详情
Comments
Accepted by SIGKDD2026
AI中文摘要

策略分类(SC)研究智能体操纵其特征以从预测模型获得有利决策的场景。现有的公平感知SC方法主要关注群体公平,并通常假设智能体独立响应。然而,当需要个体公平时,确保相似个体获得相似结果,智能体的操纵变得相互依赖:一个智能体偏好的操纵取决于邻域的结果。这导致了经典SC公式与公平感知决策设置之间的不匹配,其中独立模型不再准确刻画策略操纵。为解决此问题,我们引入了个体公平感知策略分类(IFSC),这是一个框架,对由个体公平引起的同伴驱动操纵进行建模,其中智能体模仿附近被积极决策的同伴以获得有利结果。IFSC将策略操纵刻画为对可见被接受同伴的基于相似性的模仿,并在由此产生的操纵后分布下学习分类器。为了考虑同伴可观测性的不确定性,IFSC采用鲁棒学习过程,在操纵模拟期间引入随机扰动。在合成和真实数据集上的实验表明,IFSC改善了个体公平一致性并减轻了模仿引起的扭曲。

英文摘要

Strategic classification (SC) investigates scenarios where agents manipulate their features to obtain favorable decisions from predictive models. Existing fairness-aware SC approaches primarily focus on group fairness and typically assume that agents respond independently. However, when individual fairness is required, ensuring similar individuals receive similar outcomes, agents' manipulation becomes interdependent: an agent's preferred manipulation depends on the neighborhoods' outcomes. This induces a mismatch between classical SC formulations and fairness-aware decision settings, where independent models no longer accurately characterize strategic manipulations. To address this issue, we introduce individual fairness-aware strategic classification (IFSC), a framework that models peer-driven manipulation arising from individual fairness, where agents imitate nearby positively decided peers to obtain favorable outcomes. IFSC characterizes strategic manipulation as similarity-based imitation toward visible accepted peers and learns classifiers under the resulting post-manipulation distributions. To account for uncertainty in peer observability, IFSC employs a robust learning process that introduces stochastic perturbations during manipulation simulation. Experiments on synthetic and real-world datasets demonstrate that IFSC improves individual-fairness consistency and mitigates imitation-induced distortions.

2606.00793 2026-06-09 cs.CV 版本更新

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

MBench: 视频世界模型记忆能力的综合基准

Shengjun Zhang, Zhang Zhang, Simin Huang, Zhenyu Tang, Hanyang Wang, Chensheng Dai, Min Chen, Yifan Li, Yuxin Li, Yingjie Chen, Hao Liu, Chen Li, Jing Lyu, Yueqi Duan

发表机构 * Tsinghua University(清华大学) WeChat Vision, Tecent Inc.(微信视觉,腾讯公司) Peking University(北京大学)

AI总结 提出MBench基准,通过实体一致性、环境一致性和因果一致性三个核心维度及其12个子维度,系统评估视频世界模型的长期记忆能力,并揭示现有方法在长期状态保持上的关键局限。

详情
Comments
Project Page: https://peanutup.github.io/MBench-project/
AI中文摘要

近期基于视频的世界模型在合成高保真视觉序列方面展现了前所未有的能力。然而,在视觉上合理的视频生成与世界模型的功能要求之间仍存在根本差距,特别是在长时间跨度内维持稳定且合理的内部状态方面。现有基准主要强调视觉质量、运动一致性和文本-视频对齐,但很大程度上忽略了记忆——世界模型在长期跨度和复杂交互中保持一致性的核心能力。为解决这一差距,我们提出了 extbf{MBench},一个专门用于量化和评估视频世界模型记忆能力的综合基准。我们系统地将视频世界模型的记忆能力分解为三个层次化且互补的核心维度:实体一致性、环境一致性和因果一致性,这些维度进一步细化为12个可量化的子维度,以全面表征长期记忆。我们的基准基于严格策划的真实拍摄长视频,并通过基于规则的量化矩阵和VLM进行评估,以实现客观且全面的一致性评估。对主流最先进视频世界模型的广泛评估揭示了现有方法在长期状态保持方面的关键系统性局限,为推进该领域提供了标准化基准和明确的研究方向。

英文摘要

Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present \textbf{MBench}, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.