arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4160
2605.24727 2026-06-02 cs.AI cs.CL cs.CY cs.IT math.IT

Fundamental Limitation in Explaining AI

解释AI的根本限制

Atsushi Suzuki, Jing Wang

发表机构 * Department of Mathematics Faculty of Science(科学学院数学系) The University of Hong Kong Hong Kong SAR(香港大学香港特别行政区) School of Computing and Mathematical Sciences Faculty of Engineering and Science(工程与科学学院计算与数学科学系) University of Greenwich United Kingdom(格林威治大学英国)

AI总结 本文通过数学证明了一个解释AI的基本四难困境,指出AI及其解释无法同时满足环境复杂性、AI性能优良、解释可解释性和解释完全忠实性四个条件,从而表明AI治理应基于解释忠实性总是不完整的假设。

详情
Comments
minor modifications
AI中文摘要

尽管大规模模型如LLMs和扩散模型已取得实际成功,公共机构强调了AI可解释性的重要性。然而,现有的解释AI方法并非旨在提供大规模AI系统行为的完全忠实解释。虽然对AI系统行为的完全忠实且可解释的解释可能对AI治理有用,但尚不清楚提供这样的解释在理论上是否可能。在本文中,我们从数学上证明了解释AI的一个基本四难困境,指出AI及其解释无法同时满足以下四个条件:1)操作环境的复杂性,2)AI性能的优良性,3)AI解释的可解释性,以及4)AI解释的完全忠实性。这个四难困境表明,在大多数我们无法改变环境或牺牲良好AI性能和可解释解释的应用中,我们应该放弃解释的完全忠实性,而应仅针对应用重要的部分进行解释。因此,该四难困境意味着AI治理应基于AI解释的忠实性总是不完整的假设来设计。

英文摘要

While large-scale models such as LLMs and diffusion models have achieved practical success, public institutions have emphasized the importance of explainability in AI. Existing methods for explaining AI, however, are not designed to provide completely faithful explanations of the behavior of large-scale AI systems. Although a completely faithful and interpretable explanation of the behavior of an AI system might be useful for AI governance, it has not been known whether providing such an explanation is theoretically possible. In this paper, we mathematically prove a fundamental quadrilemma in explaining AI, stating that AI and its explanation cannot satisfy the following four conditions simultaneously: 1) the complexity of the operation environment, 2) the goodness of the AI's performance, 3) the interpretability of the AI's explanation, and 4) the complete faithfulness of the AI's explanation. This quadrilemma suggests that, in most applications where we cannot change the environment or sacrifice good AI performance and an interpretable explanation, we should give up complete faithfulness of explanations and should instead aim to explain only the parts that are important for applications. As a consequence, the quadrilemma implies that AI governance should be designed on the premise that the faithfulness of AI explanations is always incomplete.

2605.24716 2026-06-02 cs.CV eess.SP

Physics-Guided Self-Supervised Statistical Residual Learning for Sonar Despeckling with Improved Generalization

物理引导的自监督统计残差学习用于声纳图像去斑及泛化改进

Swapna Pillai, Siddharth Singh Savner, Sujit Kumar Sahoo

发表机构 * School of Electrical Sciences, Indian Institute of Technology Goa(印度理工学院Goa电子科学学院) Inria, Sophia Antipolis, France(法国Sophia Antipolis Inria)

AI总结 提出一种物理引导的自监督框架,通过同态对数域残差一致性约束,结合方差统计损失、边缘感知正则化和中值引导课程学习,实现无需干净监督的声纳图像去斑,并在多个真实数据集上达到最优性能且具有跨数据集鲁棒性。

详情
Journal ref
IEEE Signal Processing Letters, Early Access, pp. 1-5, 2026
AI中文摘要

本文介绍了一种物理引导的自监督框架用于声纳图像去斑,该框架将去斑重新表述为同态对数域中的残差一致性。通过约束对数比残差服从乘性散斑统计,所提方法无需干净监督即可防止恒等解退化。结合方差目标统计损失、边缘感知结构正则化以及中值引导的课程学习,该方法在保持结构保真度的同时实现了有效的散斑抑制。该公式与轻量级神经网络相结合,在多个真实声纳数据集上实现了最先进的性能,并展现出优异的跨数据集鲁棒性,同时适用于实时部署。

英文摘要

This letter introduces a physics-informed self-supervised framework for sonar image despeckling that reformulates despeckling as residual consistency in the homomorphic log domain. By constraining the log-ratio residual to obey multiplicative speckle statistics, the proposed method eliminates the need for clean supervision while preventing degenerate identity solutions. A variance-targeted statistical loss combined with edge-aware structural regularization and median-guided curriculum stabilization enables effective speckle suppression with preserved structural fidelity. This formulation along with a lightweight neural network achieves state-of-the-art performance across multiple real sonar datasets and demonstrates excellent cross-dataset robustness, while remaining suitable for real-time deployment.

2605.24681 2026-06-02 cs.CL cs.AI

Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

Mix-MoE:通过混合专家混合提升大语言模型的多语言机器翻译

Bo Li, Tianyu Dong, Shaolin Zhu, Deyi Xiong

发表机构 * School of Software, Tsinghua University(清华大学软件学院) College of Intelligence and Computing, Tianjin University(天津大学智能与计算学院)

AI总结 提出Mix-MoE框架,通过将MoE层分为语言模型专家和机器翻译专家,并利用傅里叶变换增强路由机制,解决大语言模型在多语言机器翻译微调中的参数干扰问题。

详情
Comments
Accepted by TASLP
AI中文摘要

大语言模型(LLMs)在多语言机器翻译(MT)中展现出巨大潜力,即使双语监督有限。然而,使用平行语料库微调LLMs带来了主要挑战,即参数干扰。为了解决这些问题,我们提出了Mix-MoE,一个混合专家混合框架,旨在训练LLMs进行多语言MT。我们的框架在两个不同的阶段运行:(1)在单语语料库上使用MoE进行后预训练,以及(2)在平行语料库上使用MoE进行后预训练。关键的是,我们将MoE层分为两个专门的组:语言模型专家(LM专家)和机器翻译专家(MT专家)。LM专家旨在捕获和保留预训练LLM学到的单语知识。另一方面,MT专家专门训练以获取和存储双语翻译知识。此外,为了促进这些专门专家之间的有效交互并利用文本中潜在的结构模式,我们引入了一种由模型表示中的傅里叶变换特征增强的路由机制。实验结果表明,Mix-MoE在多语言MT中表现出色,显著优于现有基线,并在缓解参数干扰方面取得了显著进展。

英文摘要

Large Language Models (LLMs) have shown great promise in multilingual machine translation (MT), even with limited bilingual supervision. However, fine-tuning LLMs with parallel corpora presents major challenges, namely parameter interference. To address these issues, we propose Mix-MoE, a mixed Mixture-of-Experts framework designed to train LLMs for multilingual MT. Our framework operates in two distinct stages: (1) post-pretraining with MoE on monolingual corpora, and (2) post-pretraining with MoE on parallel corpora. Crucially, we divide the MoE layers into two specialized groups: Language Model Experts (LM Experts) and Machine Translation Experts (MT Experts). LM Experts are designed to capture and retain the monolingual knowledge learned by the pre-trained LLM. MT Experts, on the other hand, are specifically trained to acquire and store bilingual translation knowledge. Furthermore, to facilitate effective interaction between these specialized experts and leverage potential underlying structural patterns in text, we introduce a routing mechanism enhanced by Fourier Transform features derived from model representations. The experimental results demonstrate that Mix-MoE excels in multilingual MT, significantly outperforming existing baselines and showing notable progress in mitigating parameter interference.

2605.24528 2026-06-02 cs.AI cs.CL cs.LG

Hypothesis Generation and Inductive Inference in Children and Language Models

儿童与语言模型中的假设生成与归纳推理

Jeffrey Qin, Wasu Top Piriyakulkij, Zhuangfei Gao, Mia Radovanovic, Jessica Sommerville, Kevin Ellis, Marta Kryven

发表机构 * Computer Science University of Waterloo(滑铁卢大学计算机科学系) Department of Computer Science Cornell University(康奈尔大学计算机科学系) Department of Computer Science Dalhousie University(达尔豪斯大学计算机科学系) Department of Psychology University of Toronto(多伦多大学心理学系)

AI总结 通过归纳推理盒子任务,结合贝叶斯粒子推断的程序归纳形式化,比较儿童与基于LLM的智能体在不确定性下的假设生成与证据寻求行为,发现两者在适应环境结构上相似但信息寻求成本与归纳偏差不同。

详情
AI中文摘要

现实世界中的决策需要在证据、潜在因果规则以及世界状态本身的不确定性下构建心智模型。在这种条件下,哪些计算原理支撑人类的推理?在给定匹配约束下,基于LLM的智能体是否表现出类似行为?我们使用归纳推理盒子任务来探讨这些问题,在该任务中,参与者(人类儿童和基于LLM的智能体)通过与不确定环境的顺序交互来推断潜在原因。我们将该任务形式化为基于贝叶斯粒子推断的程序归纳,并承认两种互补的解释:(1) 作为对假设的约束满足过程,以及(2) 作为程序综合问题,其中假设是针对证据评估的可执行程序。使用基于约束的公式,我们表明儿童的行为最好由主观证据可靠性和在线假设生成的组合来解释,这解释了他们的证据寻求模式以及任务完成与规则泛化之间的分离。使用程序综合公式,我们将基于LLM的智能体视为模型有机体:可控系统,允许系统性地操纵任务条件。在各种后端中,基于LLM的智能体复制了儿童对证据可靠性和可观察性变化的反应,包括折扣不可靠证据、寻求解决部分信息以及任务完成与因果泛化之间的分离。同时,与儿童相比,基于LLM的智能体倾向于过度观察和过度遵守指令。这些结果表明,虽然儿童和基于LLM的智能体在适应环境结构方面相似,但他们的信息寻求行为表现出不同的潜在成本和归纳偏差。

英文摘要

Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over the state of the world itself. Which computational principles underpin human inference under such conditions, and do LLM-based agents exhibit similar behavior given matching constraints? We address these questions using an inductive inference Box Task in which participants, human children and LLM-based agents, infer a latent cause through sequential interaction with an uncertain environment. We formalize this task as program induction with Bayesian particle-based inference, admitting two complementary interpretations: (1) as a constraint satisfaction process over hypotheses, and (2) as a program synthesis problem in which hypotheses are executable programs evaluated against evidence. Using the constraint-based formulation, we show that children's behavior is best explained by a combination of subjective evidence reliability and online hypothesis generation, accounting for both their evidence-seeking patterns and their dissociation between task completion and rule generalization. Using the program synthesis formulation, we treat LLM-based agents as model organisms: controllable systems that allow systematic manipulation of task conditions. Across backends, LLM-based agents replicate children's responses to changes in evidence reliability and observability, including discounting unreliable evidence, seeking to resolve partial information, and dissociating between task completion and causal generalization. At the same time, LLM-based agents tend to over-observe and over-comply with instructions relative to children. These results suggest that while children and LLM-based agents adapt similarly to environmental structure, their information-seeking behavior exhibits distinct underlying costs and inductive biases.

2605.18838 2026-06-02 cs.LG cs.AI cs.CL

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

说谎只是一个阶段:语言模型扩展中的隐藏对齐转变

Adil Amin

发表机构 * ZEHEN Labs(ZEHEN实验室)

AI总结 通过分析63个基础模型,发现语言模型在特定规模阈值下,推理能力与真实性从反相关转变为正相关,并揭示了输出投影瓶颈和零竞争注意力头等内部机制。

详情
Comments
15 pages, 8 figures, 2 tables. Companion paper: "The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next." ( https://doi.org/10.48550/arXiv.2605.18840). Code: https://github.com/adilamin89/cape-scaling. Dashboard: https://zehenlabs.com/cape/
AI中文摘要

扩展定律预测了计算量带来的损失,但未预测能力如何相互作用。我们测量了来自16个家族的63个基础模型的推理能力与真实性之间的耦合,并发现了一个在损失曲线中不可见的相变:低于家族依赖的临界规模N_c时,能力反相关(r = -0.989,p = 4 x 10^{-5},非参数置换检验);高于该规模时,它们合作。N_c ~ 3.5B参数 [2.9B, 13.4B](bootstrap 95% CI),但模型大小并非决定相位的唯一变量。架构、数据整理和训练配方各自独立地改变N_c:精心整理的数据消除了Qwen代际之间的耦合下降(在匹配规模下从0.025到0.830),Gemma-4在4B时通过蒸馏和架构创新实现了0.871的耦合,这通常是13B+标准训练模型的特征,而Phi在1B时仅通过数据整理就达到了10B网络训练模型的耦合水平。宽度归一化消除了所有测试家族的反相关,支持输出投影瓶颈的存在。在内部,40个模型中有38个显示零竞争注意力头。一个稀疏回归ODE以5.6%的误差交叉预测了保留的Llama-2。该诊断不需要模型内部信息——仅需跨模型家族的公开基准分数。合作区域扩展到前沿(r = +0.72,34个模型,10个实验室)。一个概念验证干预证实了瓶颈是可利用的:在识别层添加单个真实方向向量,无需重新训练即可纠正税收阶段60%的错位输出——这是一种无需修改权重的、每推理一次的外科手术式修正。代码、数据、用于任何开放权重模型的开源转向CLI以及用于相位诊断的交互式仪表板已发布:https://zehenlabs.com/cape/。

英文摘要

Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale N_c, capabilities anticorrelate (r = -0.989, p = 4 x 10^{-5} nonparametric permutation test); above it, they cooperate. N_c ~ 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift N_c independently: curated training eliminated the coupling dip between Qwen generations (0.025 to 0.830 at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals -- only public benchmark scores across a model family. The cooperative regime extends to the frontier (r = +0.72, 34 models, 10 labs). A proof-of-concept intervention confirms the bottleneck is exploitable: adding a single truth-direction vector at the identified layer corrects 60% of misaligned outputs in the tax phase with zero retraining -- a surgical, per-inference correction that requires no weight modification. Code, data, an open-source steering CLI for any open-weight model, and an interactive dashboard for phase diagnosis are released: https://zehenlabs.com/cape/.

2603.09095 2026-06-02 cs.CL cs.CV

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

阅读,而非思考:理解并弥合多模态大语言模型中文本变为像素时的模态差距

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan Bai

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Amazon(亚马逊) New York University(纽约大学) Texas A&M University(德克萨斯大学)

AI总结 本文系统诊断多模态大语言模型在处理图像文本时的模态差距,发现其源于模型推理意愿不足而非感知失败,并提出一种轻量级自蒸馏方法有效弥合该差距。

详情
AI中文摘要

多模态大语言模型(MLLMs)能够处理以图像形式呈现的文本,但它们的表现往往不如相同内容以文本令牌形式提供时。我们通过在五种输入模式下跨七个基准评估七个MLLM,系统性地诊断了这种“模态差距”,涵盖了从合成渲染文本到来自arXiv PDF和Wikipedia页面的真实文档图像。我们发现,该差距对字体和分辨率等渲染选择高度敏感,并且自然文档图像通常表现出更小的差距,这表明性能差异部分反映了评估伪影而非根本性限制。通过对超过4000个示例进行基于扎根理论的错误分析,我们确定了主要原因:仅图像输入抑制了推理努力,模型产生的输出短5-19倍,跳过了逐步计算或推理。不愿推理,而非感知或知识检索失败,驱动了性能差距,尤其是在需要多步推理的任务上。我们展示了一种简单的、轻量级的在线自蒸馏方法,通过让模型在其自身的文本模式推理轨迹与图像输入配对上进行微调,弥合了这一差距,将图像模式准确率提升至匹配或超过文本模式性能,提升超过50%,并且增益可迁移到未见过的基准而不会灾难性遗忘。总体而言,我们的结果和分析提供了对模态差距的系统理解,并指出了在多模态语言模型中改进视觉文本理解的实际路径。

英文摘要

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the gap is highly sensitive to rendering choices such as font and resolution, and that natural document images often exhibit much smaller gaps, suggesting the performance difference partly reflects evaluation artifacts rather than fundamental limitations. Through a grounded-theory error analysis of over 4,000 examples, we identify the primary cause: image input alone suppresses reasoning effort, with models producing 5--19x shorter outputs that skip step-by-step computation or reasoning. The reluctance to reason, not a failure of perception or knowledge retrieval, drives the performance gap, particularly on tasks requiring multi-step reasoning. We show that a simple, lightweight on-policy self-distillation method by fine-tuning models on their own text-mode reasoning traces paired with image inputs closes this gap, raising image-mode accuracy to match or exceed text-mode performance with over 50\% improvement, and the gains transfer to unseen benchmarks without catastrophic forgetting. Overall, our results and analyses provide a systematic understanding of the modality gap and suggest a practical path toward improving visual text understanding in multimodal language models.

2602.23179 2026-06-02 cs.LG q-bio.BM

Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models

归纳遇见生物学:蛋白质语言模型中重复检测的机制

Gal Pomerants, Yaniv Nikankin, Anja Reusch, Tomer Tsaban, Ora Schueler-Furman, Yonatan Belinkov

发表机构 * Weizmann Institute of Science(魏茨曼科学研究院)

AI总结 通过分析蛋白质语言模型在掩码预测中的行为,揭示了其检测精确和近似重复序列的两阶段机制:先构建特征表示,再利用归纳头关注重复片段中的对齐标记。

详情
AI中文摘要

蛋白质序列中存在大量重复片段,既有精确拷贝,也有带有突变的近似片段。这些重复对蛋白质结构和功能至关重要,推动了数十年来关于重复识别的算法研究。最近的研究表明,蛋白质语言模型(PLMs)通过掩码标记预测中的行为能够识别重复。为了阐明其内部机制,我们研究了PLMs如何检测精确和近似重复。我们发现,近似重复的机制在功能上包含了精确重复的机制。然后,我们描述了这一机制,揭示了两个主要阶段:首先,PLMs使用通用位置注意力头和生物学特化组件(如编码氨基酸相似性的神经元)构建特征表示;然后,归纳头关注重复片段中的对齐标记,促进正确答案的产生。我们的结果揭示了PLMs如何通过将基于语言的模式匹配与特化的生物学知识相结合来解决这一生物学任务,从而为研究PLMs中更复杂的进化过程奠定了基础。

英文摘要

Protein sequences are abundant in repeating segments, both as exact copies and as approximate segments with mutations. These repeats are important for protein structure and function, motivating decades of algorithmic work on repeat identification. Recent work has shown that protein language models (PLMs) identify repeats, by examining their behavior in masked-token prediction. To elucidate their internal mechanisms, we investigate how PLMs detect both exact and approximate repeats. We find that the mechanism for approximate repeats functionally subsumes that of exact repeats. We then characterize this mechanism, revealing two main stages: PLMs first build feature representations using both general positional attention heads and biologically specialized components, such as neurons that encode amino-acid similarity. Then, induction heads attend to aligned tokens across repeated segments, promoting the correct answer. Our results reveal how PLMs solve this biological task by combining language-based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.

2601.19597 2026-06-02 cs.LG

The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-modal Divergence

对比表示学习的几何力学:对齐势、熵分散和跨模态散度

Yichao Cai, Zhen Zhang, Yuhang Liu, Javen Qinfeng Shi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过测度论框架,在大批量极限下证明InfoNCE目标与确定性能量景观的等价性,揭示单模态与对称多模态之间的几何分岔,并指出跨模态散度项导致模态间隙。

详情
Comments
54 Pages, ICML 2026 (Refined document aesthetics for clearer reading)
AI中文摘要

尽管InfoNCE是现代对比学习的基础,但其几何机制在经典的对齐-均匀分解之外仍未被充分刻画。我们发展了一个测度论框架,其中表示测度在固定的嵌入流形上演化。在大批量极限下,我们证明了值和梯度的一致性,将随机目标与显式的确定性能量景观联系起来,并揭示了单模态和对称多模态之间的几何分岔。在单模态情况下,内在能量是严格凸的,并具有唯一的吉布斯平衡,表明熵在对齐盆地中起到打破平衡的作用。在多模态情况下,内在几何变得交叉耦合,并包含一个持续的负对称散度项:每个模态的边缘分布重塑了另一个模态的有效景观,使得强成对对齐与持续的模态间隙共存。受控的合成实验和预训练CLIP表示的分析支持这些预测。总体而言,我们的结果将分析视角从逐点区分转移到总体几何,表明仅靠成对对齐不足以控制跨模态边缘结构。

英文摘要

While InfoNCE underlies modern contrastive learning, its geometric mechanisms remain under-characterized beyond the canonical alignment--uniformity decomposition. We develop a measure-theoretic framework in which representation measures evolve on a fixed embedding manifold. In the large-batch limit, we prove value and gradient consistency, linking the stochastic objective to explicit deterministic energy landscapes and revealing a geometric bifurcation between unimodal and symmetric multimodal regimes. In the unimodal case, the intrinsic energy is strictly convex and admits a unique Gibbs equilibrium, showing that entropy acts as a tie-breaker within the aligned basin. In the multimodal case, the intrinsic geometry becomes cross-coupled and contains a persistent negative symmetric divergence term: each modality's marginal reshapes the effective landscape of the other, allowing strong pairwise alignment to coexist with a persistent modality gap. Controlled synthetic experiments and analyses of pretrained CLIP representations support these predictions. Overall, our results shift the analytical lens from pointwise discrimination to population geometry, showing that pairwise alignment alone is insufficient to control cross-modal marginal structure.

2605.24202 2026-06-02 cs.AI cs.LG

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

多智能体强化学习何时能改进LLM工作流?工作流、规模与策略共享的权衡

Yifan Zeng, Yiran Wu, Yaolun Zhang, Wentian Zhao, Kun Wan, Qingyun Wu, Huazheng Wang

发表机构 * Oregon State University(俄勒冈州立大学) Pennsylvania State University(宾夕法尼亚州立大学) Adobe Inc.(Adobe公司) AG2AI, Inc.(AG2AI公司)

AI总结 研究多智能体LLM工作流中端到端强化学习训练的效果,发现改进依赖于工作流、任务和规模,策略共享不提供统一稳定性而是重新分配失败模式。

详情
AI中文摘要

多智能体LLM工作流通过将推理路由到专门角色来提升最终任务准确性,但联合训练这些角色的强化学习不稳定,其机制尚不明确。我们研究了多智能体LLM工作流的端到端RL训练何时能改进其基础模型,比较了共享策略训练(所有角色更新一个策略)和隔离策略训练(每个角色有自己的参数)。我们的实验矩阵涵盖Eval-Opt、Voting和Orch-Workers工作流、数学和代码任务以及三种模型规模(0.6B、1.7B、4B)。我们发现多智能体RL通常能改进基础模型,但增益共同依赖于工作流、任务和规模,而非仅依赖于策略共享。隔离策略倾向于达到更高的峰值准确率,但更频繁地掉入终端准确率悬崖,而共享策略训练并未消除失败;它只是将失败重新分布为性质不同的模式。然后,我们通过工作流拓扑和策略路由引起的角色级梯度动力学解释了其中最显著的模式:在隔离策略下,共享提示上的并行同角色代理会放大每个角色的梯度,并在Voting和Orch-Workers工作流中导致终端退化;在共享策略下,非对称的每步梯度质量导致共享策略被主导角色捕获,从而产生因任务和工作流而异的失败特征。总之,经验图谱及其潜在机制表明,策略共享通过不同渠道引导训练压力,而非提供统一稳定性,使其成为具有工作流和任务条件权衡的设计选择。

英文摘要

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.

2605.24005 2026-06-02 cs.AI cs.CL

LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

LC-ERD:通过一致性调节奖励分解挖掘潜在逻辑以实现自我进化推理

Yanyu Chen, Jiyue Jiang, Dianzhi Yu, Zheng Wu, Jiahong Liu, Jiaming Han, Xiao Guo, Jinhu Qi, Yu Li, Yifei Zhang, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiaotong University(上海交通大学) Fudan University(复旦大学)

AI总结 针对大语言模型推理中高质量过程数据稀缺的问题,提出LC-ERD框架,通过潜在逻辑挖掘和一致性调节的奖励分解,实现自我对齐与推理进化。

详情
Comments
Accepted in SIGKDD 2026 Research Track
AI中文摘要

大语言模型推理的进化受到高质量过程数据稀缺的瓶颈限制。虽然通过内生奖励进行自我对齐提供了一种解决方案,但挖掘有效监督面临三个挑战:(1)通过模仿偏差产生的标签噪声,奖励优先考虑统计可能性而非逻辑真实性,造成掩盖复合错误的“正确性幻觉”;(2)粗粒度监督,稀疏的全局结果(例如在GRPO中)无法提供细粒度指导,将推理链视为整体;(3)分布崩溃,信号无法在不放大预训练偏差的情况下泛化。为了解决这些问题,我们引入了LC-ERD(逻辑一致的内生奖励分解),一个将自我对齐视为潜在结构挖掘的框架。我们通过聚合模型潜在逻辑专家(LLE)的共识推导出变分逻辑势,以去噪推理流形,并引入基于IGM原则的多智能体价值分解协议来量化单个步骤的效用。实验表明,LC-ERD提供了一条稳健的自我进化路径,揭示了逻辑一致性与准确性之间的权衡,同时识别了标准奖励遗漏的高价值推理模式。我们的代码可在https://github.com/LC-ERD-repo/LC-ERD获取。

英文摘要

The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high-quality process data. While self-alignment via endogenous rewards offers a solution, mining valid supervision faces three challenges: (1) Label Noise via Mimetic Bias, where rewards prioritize statistical likelihood over logical truth, creating a "correctness illusion" that masks compounding errors; (2) Coarse-Grained Supervision, where sparse global outcomes (e.g., in GRPO) fail to provide granular guidance, treating reasoning chains as monolithic; and (3) Distributional Collapse, where signals fail to generalize without amplifying pre-training biases. To address these, we introduce LC-ERD (Logic-Consistent Endogenous Reward Decomposition), a framework framing self-alignment as latent structure mining. We derive a Variational Logic Potential by aggregating consensus from the model's Latent Logic Expertise (LLE) to denoise the reasoning manifold, and introduce a Multi-Agent Value Decomposition protocol based on the IGM principle to quantify individual step utility. Experiments show LC-ERD delivers a robust self-evolution path, uncovering trade-offs between logic consistency and accuracy while identifying high-value reasoning patterns missed by standard rewards. Our code is available at https://github.com/LC-ERD-repo/LC-ERD.

2605.11359 2026-06-02 cs.AI physics.data-an

CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing

CVEvolve:面向非结构化科学数据处理的自主算法发现

Ming Du, Xiangyu Yin, Yanqi Luo, Dishant Beniwal, Songyuan Tang, Hemant Sharma, Mathew J. Cherukara

发表机构 * Argonne National Laboratory(阿贡国家实验室) Advanced Photon Source(先进光子源)

AI总结 提出CVEvolve,一种零代码自主智能体框架,通过多轮搜索与工具集成,自动发现用于非结构化科学数据处理的算法,并在多个任务上超越基线方法。

详情
AI中文摘要

科学数据处理通常需要特定任务的算法或AI模型,这给需要分析数据但可能缺乏广泛计算或图像处理专业知识的领域科学家造成了障碍。当数据噪声大、动态范围高、标签稀疏或仅松散指定时,这一障碍尤为明显。我们引入了CVEvolve,一个具有零代码界面的自主智能体框架,用于科学数据处理算法的发现。CVEvolve结合了多轮搜索策略与代码执行、评估实现、历史管理、保留测试以及可选的科学数据和视觉输出检查工具。搜索在发现和改进动作之间交替,并使用谱系感知的随机候选采样来平衡探索与利用。我们在X射线荧光显微镜图像配准、布拉格峰检测、高能衍射显微镜图像分割以及混合分析学习基仿射配准上展示了CVEvolve。在这些任务中,CVEvolve发现了优于基线方法的算法,而保留测试跟踪有助于识别比后期过度优化替代方案泛化能力更好的候选算法。这些结果表明,零代码、自主的LLM驱动算法开发可以帮助领域科学家将非结构化科学图像数据转化为实用算法和下游科学发现。

英文摘要

Scientific data processing often requires task-specific algorithms or AI models, creating a barrier for domain scientists who need to analyze their data but may not have extensive computing or image-processing expertise. This barrier is especially pronounced when data are noisy, have a high dynamic range, are sparsely labeled, or are only loosely specified. We introduce CVEvolve, an autonomous agentic harness with a zero-code interface for scientific data-processing algorithm discovery. CVEvolve combines a multi-round search strategy with tools for code execution, evaluation implementation, history management, holdout testing, and optional inspection of scientific data and visual outputs. The search alternates between discovery and improvement actions, and uses lineage-aware stochastic candidate sampling to balance exploration and exploitation. We demonstrate CVEvolve on X-ray fluorescence microscopy image registration, Bragg peak detection, high-energy diffraction microscopy image segmentation, and hybrid analytical-learning-based affine registration. Across these tasks, CVEvolve discovers algorithms that improve over baseline methods, while holdout test tracking helps identify candidates that generalize better than later over-optimized alternatives. These results show that zero-code, autonomous LLM-powered algorithm development can help domain scientists turn unstructured scientific image data into practical algorithms and downstream scientific discoveries.

2212.07944 2026-06-02 cs.LG math.OC q-fin.CP q-fin.PM q-fin.ST

Variable Clustering via Distributionally Robust Nodewise Regression

基于分布鲁棒节点回归的变量聚类

Kaizheng Wang, Xiao Xu, Xun Yu Zhou

发表机构 * Department of Industrial Engineering and Operations Research & The Data Science Institute, Columbia University(工业工程与运筹学系及数据科学研究院,哥伦比亚大学)

AI总结 本文提出一种分布鲁棒节点回归方法,通过凸松弛、数据驱动鲁棒区域选择和ADMM算法,实现多因子块模型下的变量聚类,并在数值实验中展示其优越性能。

详情
Comments
ICML 2026
AI中文摘要

我们研究了一个用于变量聚类的多因子块模型,并通过分布鲁棒版本的节点回归将其与正则化子空间聚类联系起来。为了解决后一个问题,我们推导了一个凸松弛,提供了一种数据驱动的方法来选择鲁棒区域的大小,并开发了一种ADMM算法以实现高效实现。我们在广泛的数值研究中验证了我们的方法,并展示了其优越的性能。

英文摘要

We study a multi-factor block model for variable clustering and connect it to regularized subspace clustering through a distributionally robust version of nodewise regression. To solve the latter problem, we derive a convex relaxation, provide a data-driven approach for selecting the size of the robust region, and develop an ADMM algorithm for efficient implementation. We validate our method in extensive numerical studies and demonstrate its superior performance.

2605.23500 2026-06-02 cs.CV cs.LG

B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation

B-GRTO: 引导式分组相对工具优化用于指代分割

Mario Markov, Stefan Maria Ailuro, Mohammad Mahdi, Luc Van Gool, Danda Pani Paudel

发表机构 * INSAIT Sofia University "St. Kliment Ohridski"(索菲亚大学"圣克莱门特·欧赫里迪斯基")

AI总结 提出B-GRTO框架,通过引导式预训练和分组相对工具优化,联合优化策略与可微分割解码器,显著提升复杂指代分割性能。

详情
AI中文摘要

分割是计算机视觉中的基本任务,支撑像素级场景理解,并作为从自主感知到医学图像分析等应用的基石。对于复杂的指代分割,近期方法将大型视觉-语言模型与分割解码器配对:前者分析图像和提示,后者预测目标掩码。尽管强化学习改进了推理密集型视觉-语言系统,但可训练工具(如分割解码器)通常使用可微目标单独优化,而将这些目标原则性地整合到强化学习中仍未被充分探索。因此,我们引入了分组相对工具优化(GRTO),这是一个数学上严谨的框架,用于联合优化具有可微工具使用的策略。GRTO重用分组相对策略优化(GRPO)的采样结果来优化辅助工具目标,使解码器梯度补充策略奖励。此外,我们推导出引导式GRTO(B-GRTO),一种廉价引导工具的预训练方法,从而实现更快的收敛和更优的性能。在三个具有挑战性的指代分割设置中,B-GRTO相比普通GRPO取得了显著改进,匹配或超越了领域特定的最新方法。这证明了将强化学习与可微辅助目标统一用于推理密集型分割的价值。

英文摘要

Segmentation is a fundamental task in computer vision, underpinning pixel-level scene understanding and serving as a cornerstone for applications ranging from autonomous perception to medical image analysis. For complex referring segmentation, recent methods pair large vision-language models with segmentation decoders: the former analyzes the image and prompt, while the latter predicts the target mask. Although reinforcement learning improves reasoning-intensive vision-language systems, trainable tools such as segmentation decoders are typically optimized separately with differentiable objectives, and the principled integration of such objectives into reinforcement learning remains underexplored. Thus, we introduce group relative tool optimization (GRTO), a mathematically grounded framework for jointly optimizing a policy with differentiable tool use. GRTO reuses group relative policy optimization (GRPO) rollouts to optimize the auxiliary tool objective, letting decoder gradients complement policy rewards. Further, we derive Bootstrapped-GRTO (B-GRTO), a pre-training method that cheaply bootstraps the tool, leading to faster convergence and superior performance. Across three challenging referring segmentation settings, B-GRTO results in substantial improvements over plain GRPO, matching or surpassing domain-specific state-of-the-art methods. This demonstrates the value of unifying reinforcement learning with differentiable auxiliary objectives for reasoning-intensive segmentation.

2605.23231 2026-06-02 cs.CV

Beyond Normal References: Discriminative Few-Shot Anomaly Detection

超越正常参考:判别式少样本异常检测

Huan Wang, Jun Shen, Jun Yan, Guansong Pang

发表机构 * Singapore Management University, Singapore(新加坡国立管理学院) University of Wollongong, Australia(沃林戈大学)

AI总结 提出IDEAL框架,通过内在偏差学习同时利用正常和异常参考,抑制正常变化并提取判别性偏差向量,实现少样本异常检测的泛化。

详情
Comments
31 pages, 7 figures
AI中文摘要

本文考虑一种实用的少样本异常检测(FSAD)设置,称为判别式FSAD,其中在推理时仅有有限数量的正常和异常样本作为参考可用。现有的FSAD方法依赖于仅正常参考进行正常性匹配,忽略了异常参考中的判别性线索,而直接拟合两种参考可能导致对已知异常的过拟合。我们引入了IDEAL,一种内在偏差学习框架,它利用两种参考类型来学习表征可泛化异常(即偏离正常性)的内在偏差模式。IDEAL将学习过程分解为两个新颖的组件:1)正常变化擦除器,用于抑制可能导致偏离正常性的噪声正常变化,从而突出异常相关的偏差表示;2)内在偏差编码器,用于将这些去噪后的偏差表示分解为内在偏差向量,捕捉最具判别性的正交偏差方向。在推理时,IDEAL对投影到学习到的内在偏差向量上的查询-正常偏差进行评分,从而实现对已知和未知异常的泛化。在八个真实世界数据集上的大量实验表明,IDEAL有效泛化到未知异常,并持续优于现有最先进的FSAD方法。代码和数据将在\href{https://github.com/mala-lab/IDEAL}{https://github.com/mala-lab/IDEAL}提供。

英文摘要

This paper considers a practical few-shot anomaly detection (FSAD) setting, termed discriminative FSAD, where a limited number of both normal and anomalous examples are available as references during inference. Existing FSAD methods rely on normal-only references through normality matching, ignoring the discriminative clues in anomalous references, while directly fitting both references can overfit to the seen anomalies. We introduce IDEAL, an intrinsic deviation learning framework that leverages both reference types to learn intrinsic deviation patterns characterizing generalizable abnormality as deviations from normality. IDEAL decomposes the learning process into two novel components: 1) a Normal Variation Eraser to suppress nuisance normal variations that may lead to noisy deviations from normality, thereby highlighting anomaly-relevant deviation representations; 2) an Intrinsic Deviation Encoder to decompose these denoised deviation representations into intrinsic deviation vectors capturing the most discriminative orthogonal deviation directions. At inference, IDEAL scores query-to-normal deviations preserved after projection onto the learned intrinsic deviation vectors, enabling generalization for both seen and unseen anomalies. Extensive experiments on eight real-world datasets show that IDEAL generalizes effectively to unseen anomalies and consistently outperforms existing state-of-the-art FSAD methods. Code and data will be available at \href{https://github.com/mala-lab/IDEAL}{https://github.com/mala-lab/IDEAL}.

2605.23080 2026-06-02 cs.LG

The Attribution Contract: Feature Attribution for Generative Language Models

归因契约:生成式语言模型的特征归因

Giang Nguyen

发表机构 * Guide Labs(Guide实验室)

AI总结 针对生成式语言模型中特征归因的歧义性,提出归因契约规范,明确归因对象、特征范围、生成过程等要素,并通过自回归和扩散模型案例展示不同契约下的归因效果。

详情
AI中文摘要

特征归因方法承诺识别哪些输入特征对模型输出重要。然而,在生成式语言模型中,首先往往不清楚什么应算作特征。在自回归语言模型中,先前生成的标记既是模型的输出,也是后续预测的输入。在扩散语言模型中,生成通过迭代去噪或去掩码进行,而非固定的从左到右预测,因此局部解释可能针对扩散状态而非下一个标记。我们认为这种模糊性不仅是实现细节,而是将分类器时代的特征归因直接带入生成式语言建模的概念局限。我们引入归因契约,这是一种特征归因声明的规范,它命名了被解释的输出、有资格获得归因的特征、假定的生成过程、保持不变的内容以及被归因的模型分数。该契约澄清了为什么相同的归因方法可以根据其实例化方式回答不同的问题。我们认为,生成式语言模型中关于特征归因的许多分歧并非关于归因算法的分歧,而是关于未明确说明的解释契约的分歧。以自回归和扩散语言模型为案例研究,我们展示了何时对先前生成的标记、中间状态或去噪阶段的归因是有信息的,何时是误导的,以及为什么生成式语言模型中的特征归因方法应作为方法-契约对进行评估。

英文摘要

Feature attribution methods promise to identify which input features matter for a model output. In generative language models, however, it is often unclear what should count as a feature in the first place. In autoregressive language models, earlier generated tokens are both outputs of the model and inputs to later predictions. In diffusion language models, generation proceeds through iterative denoising or unmasking rather than fixed left-to-right prediction, so local explanation may target a state of diffusion rather than a next token. We argue that this ambiguity is not merely an implementation detail, but a conceptual limitation of carrying classifier-era feature attribution directly into generative language modeling. We introduce the Attribution Contract, a specification for feature-attribution claims that names what output is being explained, which features are eligible to receive attribution, what generative process is assumed, what is held fixed, and what model score is being attributed. The contract clarifies why the same attribution method can answer different questions depending on how it is instantiated. We argue that many disagreements about feature attribution in generative language models are not disagreements about attribution algorithms, but about unstated explanatory contracts. Using autoregressive and diffusion language models as case studies, we show when attribution to earlier generated tokens, intermediate states, or denoising stages is informative, when it is misleading, and why feature-attribution methods in generative language models should be evaluated as method-contract pairs.

2605.22978 2026-06-02 cs.CL

A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

一种可复现的通用依赖风格管道用于Katharevousa希腊语议会文本

George Mikros, Fotios Fitsilis

发表机构 * Hamad Bin Khalifa University(哈马德·本·卡立夫大学) Universidad Austral(阿维拉大学)

AI总结 针对Katharevousa希腊语,提出一种可复现的通用依赖风格解析管道,结合OCR重建、LLM辅助标注、自动验证和模型比较,显著提升解析性能并开源全部资源。

详情
Comments
12 pages, 1 figure, 2 tables; companion to the kathnlp open-source release at https://github.com/gmikros/katharevousa-nlp-tooling
AI中文摘要

Katharevousa希腊语尽管在法律、行政和议会档案中具有重要性,但当代NLP管道对其支持不足。我们提出了一种可复现的工作流程,用于构建和评估针对希腊后军政府时期早期议会问题的通用依赖风格解析资源。该管道结合了OCR感知重建、模式约束的LLM辅助标注、自动验证、确定性CoNLL-U快照、固定分割评估和模型族比较。冻结的自动验证参考集包含1,697个句子,分为1,357个训练句子和340个保留测试句子。我们在相同评分协议下比较了现成的希腊语和古希腊语解析器、基于特征的解析器、mBERT、XLM-R和自定义Stanza训练。现成系统显示出显著的语域不匹配:最强的外部基线spaCy希腊语达到0.4183 LAS。最佳结构解析器XLM-R模型达到0.8893 UPOS准确率、0.7250依赖关系F1、0.6098 UAS和0.5162 LAS,比最佳外部基线绝对LAS提升0.0980。基于特征的模型在UPOS和关系标注上仍具竞争力,表明在此数据规模下透明的词汇-上下文特征仍然重要。除了分数,本文还贡献了一种可审计的方法论,用于将困难的历史议会OCR转化为可重用的句法NLP基础设施。整个管道——代码、模式、冻结参考标注、固定训练/测试分割和每个模型的基准报告——作为本文的开放获取附件发布。

英文摘要

Katharevousa Greek remains poorly served by contemporary NLP pipelines despite its importance for legal, administrative, and parliamentary archives. We present a reproducible workflow for building and evaluating a Universal Dependencies-style parsing resource for Katharevousa parliamentary questions from Greece's early post-junta period. The pipeline links OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, fixed-split evaluation, and model-family comparison. The frozen automatically validated reference set contains 1{,}697 sentences, split into 1{,}357 training sentences and 340 held-out test sentences. We compare off-the-shelf Greek and Ancient Greek parsers, a feature-based parser, mBERT, XLM-R, and custom Stanza training under the same scoring protocol. Off-the-shelf systems show substantial register mismatch: the strongest external baseline, spaCy Greek, reaches 0.4183 LAS. The best structural parser, an XLM-R model, reaches 0.8893 UPOS accuracy, 0.7250 dependency-relation F1, 0.6098 UAS, and 0.5162 LAS, an absolute LAS gain of 0.0980 over the best external baseline. The feature-based model remains competitive for UPOS and relation labeling, indicating that transparent lexical-context features still matter at this data scale. Beyond scores, the paper contributes an auditable methodology for turning difficult historical parliamentary OCR into reusable syntactic NLP infrastructure. The entire pipeline -- code, schema, frozen reference annotations, fixed train/test split, and per-model benchmark reports -- is released as an open-access companion to this paper.

2605.22671 2026-06-02 cs.CV

From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

从抽象到实例化:学习视觉-语言-动作模型的行为表示

Bing Hu, Zaijing Li, Rui Shao, Junda Chen, April Hua Liu, Wei-Shi Zheng, Liqiang Nie

发表机构 * arXiv.org

AI总结 提出BehaviorVLA框架,通过因果Mamba架构的视觉运动行为编码器和相位条件行为解码器学习时间一致的行为表示,在分布偏移下实现鲁棒操作,在多个基准上达到最优成功率并展现数据效率。

详情
Comments
ICML 2026 Oral
AI中文摘要

视觉-语言-动作(VLA)模型在分布偏移下常出现性能下降,因为它们在跨不同环境学习泛化行为表示方面存在困难。现有方法尝试通过以动作为中心的潜变量构建行为表示,但常受限于短时间跨度的时间碎片化和静态执行对齐,导致复杂场景中的行为不一致。为解决这些限制,我们提出 extbf{BehaviorVLA},一个通过学习时间一致的行为表示来促进鲁棒操作的框架。我们的方法包含两个对称组件:(1) extbf{视觉运动行为编码器(VBE)},利用基于因果Mamba的架构将长时间跨度的轨迹信息聚合为统一的行为表示;(2) extbf{相位条件行为解码器(PBD)},通过动态对齐任务级先验与实时执行进度,将该表示解码为精确动作。在RoboTwin 2.0、LIBERO和CALVIN上的实验分别达到了58%、98%和4.36(平均长度)的最优成功率。值得注意的是,在真实世界的仿真到现实迁移中,BehaviorVLA仅使用50%的演示数据就匹配了OpenVLA-OFT的性能,展示了其优越的数据效率和泛化能力。

英文摘要

Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through action-centric latent variables, they are often limited by short-horizon temporal fragmentation and static execution-alignment, leading to inconsistent behaviors in complex scenarios. To address these limitations, we propose \textbf{BehaviorVLA}, a framework that facilitates robust manipulation through the learning of a temporally coherent behavioral representations. Our approach features two symmetric components: (1) the \textbf{Visuomotor Behavior Encoder (VBE)}, which utilizes a causal Mamba-based architecture to aggregate long-horizon trajectory information into a unified behavior representation; and (2) the \textbf{Phase-conditioned Behavior Decoder (PBD)}, which decodes this representation into precise actions by dynamically aligning task-level priors with real-time execution progress. Experiments on RoboTwin 2.0, LIBERO, and CALVIN demonstrate state-of-the-art success rates of 58\%, 98\%, and 4.36 (Avg.Len), respectively. Notably, in real-world sim-to-real transfer, BehaviorVLA matches the performance of OpenVLA-OFT using only 50\% of the demonstration data, showcasing its superior data efficiency and generalization.

2604.13517 2026-06-02 cs.LG cs.AI

Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO

表征优于路由:诊断多时间尺度PPO中的时间路由病理

Jing Sun

发表机构 * Information Engineering School, Chengyi College, Jimei University(信息工程学院, Chengyi 学院, 厦门大学)

AI总结 本文通过形式化代理目标攻击和时间不确定性悖论,揭示了多时间尺度PPO中可微路由和基于误差路由的数值捷径问题,并提出目标解耦方法消除演员侧路由路径以改善性能。

详情
Comments
8 pages, 3 figures
AI中文摘要

强化学习中的时间信用分配通常通过引入多个折扣因子的价值估计来处理。一个自然的下一步是让演员在这些时间头之间动态路由,使用可微注意力或启发式不确定性权重。本文认为,这种路由可能产生数值捷径而非可靠的时间抽象。我们在LunarLander-v2上的受控PPO设置中研究此问题,将环境用作诊断故障模式的视觉沙箱。首先,我们形式化了代理目标攻击:暴露于PPO代理的可微softmax路由器会直接获得梯度,指向对当前更新数值有利的优势头,即使这种路由变化并不对应物理控制的改进。由于不同折扣因子的未归一化优势具有不同的有效尺度,这产生了尺度差异脆弱性。其次,我们在基于梯度的无误差路由中识别了时间不确定性悖论:短视头可能获得最大的路由份额,因为其预测目标更容易,即使它们与延迟任务成功的对齐程度较低。作为结构性回应,我们研究了目标解耦:评论家可以保留多时间尺度辅助头,但演员仅使用长视优势进行更新。目标解耦并非作为广泛的性能提升器;在此运行集中,它消除了可被利用的演员侧路由路径,并改善了观察到的最差种子回报。代码可在 https://github.com/ben-dlwlrma/Representation-Over-Routing 获取。

英文摘要

Temporal credit assignment in reinforcement learning is often approached by introducing value estimates at multiple discount factors. A natural next step is to let the actor dynamically route among these temporal heads, using either differentiable attention or heuristic uncertainty weights. This paper argues that such routing can create a numerical shortcut rather than a reliable temporal abstraction. We study this issue in a controlled PPO setting on LunarLander-v2, using the environment as a visual sandbox for diagnosing failure modes. First, we formalize Surrogate Objective Hacking: a differentiable softmax router exposed to the PPO surrogate receives a direct gradient toward advantage heads that are numerically favorable for the current update, even when this routing change does not correspond to improved physical control. Because unnormalized advantages at different discount factors have different effective scales, this creates a scale-discrepancy vulnerability. Second, we identify the Paradox of Temporal Uncertainty in gradient-free error-based routing: short-horizon heads can receive the largest routing share because their prediction targets are easier, even when they are less aligned with delayed task success. As a structural response, we study Target Decoupling: the critic may retain multi-timescale auxiliary heads, but the actor is updated only with the long-horizon advantage. Target Decoupling is not presented as a broad performance booster; in this run set it removes the exploitable actor-side routing pathway and improves the observed worst-seed return. Code is available at https://github.com/ben-dlwlrma/Representation-Over-Routing.

2605.22305 2026-06-02 cs.LG

Chebyshev Policies and the Mountain Car Problem: Reinforcement Learning for Low-Dimensional Control Tasks

切比雪夫策略与山地车问题:低维控制任务的强化学习

Stefan Huber, Hannes Unger, Georg Schäfer, Jakob Rehrl

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 本文解析求解山地车问题,推导最优控制,并引入切比雪夫策略作为通用策略类,在低维控制任务中显著优于神经网络。

详情
Comments
ICML 2026 Oral
AI中文摘要

我们解析求解了强化学习中的经典基准问题——山地车问题,并推导出最优控制解,填补了36年来的空白。这使我们得以揭示两个令人惊讶的见解:最优控制非常简单,然而现代强化学习智能体与最优性之间存在巨大差距。受最优控制分析的启发,我们从基本原理出发,引入了切比雪夫策略作为强化学习策略的通用(即稠密)类。它们可以作为神经网络的即插即用替代品进行训练,将遗憾值降低4.18倍,同时所需参数减少277倍,从而促进样本效率、可解释性和实时能力。切比雪夫策略在进一步的强化学习任务上进行了评估,包括一个真实世界的非线性运动控制测试平台。在使用PPO、ARS和REINFORCE算法时,它们始终优于神经网络。我们的结果证明了切比雪夫策略在低维控制任务中作为神经网络的一种引人注目且轻量级的替代或补充方案的有效性。

英文摘要

We analytically solve the Mountain Car problem, a canonical benchmark in RL, and derive an optimal control solution, closing a gap after 36 years. This enables us to reveal two surprising insights: The optimal control is quite simple, yet modern RL agents display a large gap to optimality. Motivated by the analysis of the optimal control, we introduce Chebyshev policies as a universal (i.e. dense) class of RL policies from first principles. They can be trained as drop-in replacements of neural nets, reducing the regret by a factor of 4.18, while requiring 277 times fewer parameters, fostering sample efficiency, explainability and realtime capability. Chebyshev policies are evaluated on further RL tasks, including a real-world nonlinear motion control testbed. They consistently improve performance over neural nets with PPO, ARS and REINFORCE. Our results demonstrate how Chebyshev policies offer a compelling and lightweight alternative or addition to neural nets for low-dimensional control tasks.

2605.00941 2026-06-02 cs.LG cs.CV

Divergence is Uncertainty: A Closed-Form Posterior Covariance for Flow Matching

散度即不确定性:流匹配的闭式后验协方差

Jiarui Xing, Song Wang, Jian Wang

发表机构 * Yale University(耶鲁大学) Shanxi University(山西大学) Harvard Medical School(哈佛医学院)

AI总结 本文通过扩展Tweedie公式到流匹配插值,推导出生成轨迹上每一点后验协方差的精确闭式表达式,该表达式仅依赖于学习速度场的散度,可在预训练模型上事后计算,无需重新训练或修改架构。

详情
Comments
9 Pages, 5 figures
AI中文摘要

流匹配已成为生成建模的领先框架,但量化其样本的不确定性仍是一个开放问题。现有方法使用辅助方差头重新训练模型、维护昂贵的集成或通过多个积分步骤传播近似协方差,在训练成本、推理成本或准确性之间进行权衡。我们表明这些权衡都不是必需的。通过将Tweedie公式从去噪设置扩展到流匹配插值,我们推导出生成轨迹上每一点后验协方差的精确闭式表达式。结果仅依赖于一个量,即学习速度场的散度,该散度可以在任何预训练的流匹配模型上事后计算,无需重新训练和架构修改。对于像MeanFlow这样的单步生成器,相同的公式在单次前向传递中产生端到端的生成不确定性,消除了所有先前方法所需的多步方差传播。在MNIST上的实验证实,得到的逐像素不确定性图在语义上有意义,集中在样本间变化最大的数字边界上,并且标量不确定性分数跟踪实际预测误差,所有计算量大约比集成或蒙特卡洛丢弃法少$10^4$倍。

英文摘要

Flow matching has become a leading framework for generative modeling, but quantifying the uncertainty of its samples remains an open problem. Existing approaches retrain the model with auxiliary variance heads, maintain costly ensembles, or propagate approximate covariance through many integration steps, trading off training cost, inference cost, or accuracy. We show that none of these trade-offs is necessary. By extending Tweedie's formula from the denoising setting to the flow matching interpolant, we derive an exact, closed-form expression for the posterior covariance at every point along the generative trajectory. The result depends on a single quantity, namely the divergence of the learned velocity field, which can be computed post-hoc on any pre-trained flow matching model, requiring no retraining and no architectural modification. For one-step generators such as MeanFlow, the same formula yields the end-to-end generation uncertainty in a single forward pass, eliminating the multi-step variance propagation required by all prior methods. Experiments on MNIST confirm that the resulting per-pixel uncertainty maps are semantically meaningful, concentrating on digit boundaries where inter-sample variation is highest, and that the scalar uncertainty score tracks actual prediction error, all at roughly $10^4 \times$ less total compute than ensembling or Monte Carlo dropout.

2604.17473 2026-06-02 cs.CV cs.AI

Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

双锚定:解决视觉语言导航中的状态漂移问题

Kangyi Wu, Pengna Li, Kailin Lyu, Xi Lin, Lin Zhao, Qingrong He, Jinjun Wang, Jianyi Liu

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室) National Engineering Research Center for Visual Information and Applications(视觉信息与应用国家工程研究中心) Institute of Artificial Intelligence and Robotics(人工智能与机器人研究院) Xi’an Jiaotong University(西安交通大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Johns Hopkins University(约翰霍普金斯大学) Joy Future Academy, JD(京东未来学院)

AI总结 提出双锚定框架,通过指令进度锚定和记忆地标锚定分别解决进度漂移和记忆漂移,显著提升长场景导航成功率。

详情
AI中文摘要

视觉语言导航(VLN)要求智能体通过遵循自然语言指令在3D环境中导航。尽管最近的视频大语言模型(Video-LLMs)极大地推进了VLN,但在长场景中它们仍然非常容易受到状态漂移的影响。在这些情况下,智能体的内部状态偏离真实的任务执行状态,导致无目的漫游和无法执行指令中的关键操作。我们将这种失败归因于两种不同的认知缺陷:进度漂移,即智能体无法区分已完成的子目标和剩余的子目标;以及记忆漂移,即智能体的历史表示退化,使其无法跟踪已访问的地标。在本文中,我们提出了一个双锚定框架,明确锚定指令进度和历史表示。首先,为了解决进度漂移,我们引入了指令进度锚定,监督智能体生成结构化的文本标记,以描述已完成与剩余的子目标。其次,为了缓解记忆漂移,我们提出了记忆地标锚定,利用以地标为中心的世界模型回顾性地预测由Segment Anything模型提取的以对象为中心的嵌入,迫使智能体显式验证过去的观察并保留已访问地标的独特表示。为促进该框架,我们整理了两个大规模数据集:360万个带有显式进度描述的样本,以及93.7万个用于回顾性验证的接地地标数据。在模拟和真实环境中的大量实验证明了我们方法的优越性,在成功率上提高了15.2%,在长时程轨迹上获得了24.7%的显著提升。为促进进一步研究,我们将发布我们的代码、数据生成流程以及收集的数据集。

英文摘要

Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.

2602.04509 2026-06-02 cs.CL

Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models

Model-Dowser:无数据重要性探测以缓解多模态大语言模型中的灾难性遗忘

Hyeontaek Hwang, Nguyen Dinh Son, Daeyoung Kim

发表机构 * arXiv.org GitHub

AI总结 提出Model-Dowser,一种基于重要性评分的稀疏微调方法,通过联合考虑权重幅度、输入激活和输出敏感性来缓解多模态大语言模型微调中的灾难性遗忘,在LLaVA和NVILA上优于现有方法。

详情
Comments
Accepted at ICML 2026. Code link: https://model-dowser.github.io
AI中文摘要

在特定任务数据上微调多模态大语言模型(MLLMs)是提高下游应用性能的有效方法。然而,这种适应通常会导致预训练任务泛化能力的下降,这种现象称为灾难性遗忘。现有的旨在缓解该问题的方法要么在微调语言解码器更深层时失效,要么随着模型规模增大而扩展性差。为了解决这些局限性,我们提出了Model-Dowser,一种用于MLLMs的新型稀疏微调方法。Model-Dowser通过联合考虑权重幅度、输入激活和输出敏感性,为每个模型参数测量相对于预训练泛化能力(在下游适应之前)的原则性重要性分数。在微调过程中,Model-Dowser选择性地保留高重要性参数并更新其余参数。在两个代表性MLLMs(LLaVA和NVILA)上的综合实验表明,Model-Dowser有效缓解了灾难性遗忘,并始终优于先前方法,同时保持资源高效且可扩展到数十亿参数模型。

英文摘要

Fine-tuning Multimodal Large Language Models (MLLMs) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language decoder or scale poorly with increasing model size. To address these limitations, we propose Model-Dowser, a novel sparse fine-tuning approach for MLLMs. Model-Dowser measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, Model-Dowser selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative MLLMs, LLaVA and NVILA, demonstrate that Model-Dowser effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.

2602.02214 2026-06-02 cs.CV

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

因果强迫:自回归扩散蒸馏的正确方法,用于高质量实时交互式视频生成

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, Jun Zhu

发表机构 * Hongzhou Zhu(朱洪洲) Min Zhao(赵敏) Guande He(何冠德) Hang Su(苏hang) Chongxuan Li(李崇轩) Jun Zhu(朱军)

AI总结 针对双向扩散模型蒸馏为自回归模型时的架构差距问题,提出因果强迫方法,通过自回归教师进行ODE初始化并应用DMD过程,显著提升视频生成质量。

详情
Comments
Project page and the code: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}; https://github.com/thu-ml/Causal-Forcing. ICML 2026
AI中文摘要

为了实现实时交互式视频生成,当前方法将预训练的双向视频扩散模型蒸馏为少步自回归(AR)模型,当全注意力被因果注意力替代时面临架构差距。然而,现有方法并未从理论上弥合这一差距。它们通过ODE蒸馏初始化AR学生模型,这需要帧级单射性,即在AR教师的PF-ODE下,每个噪声帧必须映射到唯一的干净帧。从双向教师蒸馏AR学生违反了这一条件,阻止了教师流映射的恢复,反而诱导出条件期望解,导致性能下降。为解决此问题,我们提出因果强迫(Causal Forcing),它使用自回归教师进行ODE初始化以弥合架构差距,然后应用与Self Forcing相同的DMD过程。实验结果表明,我们的方法在所有指标上优于所有基线,在动态程度、VisionReward和指令跟随上分别超过SOTA Self Forcing 19.3%、8.7%和16.7%。项目页面:https://thu-ml.github.io/CausalForcing.github.io/;代码:https://github.com/thu-ml/Causal-Forcing。

英文摘要

To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing, which uses an autoregressive teacher for ODE initialization to bridge the architectural gap, and then applies the same DMD procedure as in Self Forcing. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}; the code: \href{https://github.com/thu-ml/Causal-Forcing}{https://github.com/thu-ml/Causal-Forcing}.

2508.03556 2026-06-02 cs.LG

VRPRM: Process Reward Modeling via Visual Reasoning

VRPRM: 通过视觉推理的过程奖励建模

Xinquan Chen, Chongying Yue, Bangwei Liu, Xuhong Wang, Yingchun Wang, Chaochao Lu

发表机构 * PJ Lab(PJ实验室)

AI总结 提出VRPRM,一种结合视觉推理的过程奖励模型,通过两阶段训练策略(少量CoT-PRM SFT数据+大量非CoT-PRM RL数据)以较低成本实现高质量推理,性能提升达118%。

详情
Comments
20 pages, 11 figures
AI中文摘要

过程奖励模型(PRM)因其能够对生成内容的推理步骤进行细粒度评估,被广泛用于大型语言模型(LLM)的后训练中。然而,大多数PRM缺乏长期推理和深度思考能力。另一方面,尽管少数工作尝试将思维链(CoT)能力引入PRM,但CoT-PRM数据的标注成本过高,难以在各种任务中发挥稳定作用。为应对上述挑战,我们提出VRPRM,一种通过视觉推理的过程奖励模型,并设计了一种高效的两阶段训练策略。实验结果表明,仅使用3.6K CoT-PRM监督微调(SFT)数据和50K非CoT-PRM强化学习(RL)训练数据,VRPRM即可超越总数据量达400K的非思考PRM,并在BoN实验中相对于基模型实现了高达118%的相对性能提升。这一结果证实,所提出的组合训练策略能够以较低的数据标注成本实现更高质量的推理能力,从而为更高效数据利用的PRM训练提供了新范式。

英文摘要

Process Reward Model (PRM) is widely used in the post-training of Large Language Model (LLM) because it can perform fine-grained evaluation of the reasoning steps of generated content. However, most PRMs lack long-term reasoning and deep thinking capabilities. On the other hand, although a few works have tried to introduce Chain-of-Thought (CoT) capability into PRMs, the annotation cost of CoT-PRM data is too expensive to play a stable role in various tasks. To address the above challenges, we propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy. Experimental results show that using only 3.6K CoT-PRM Supervised Fine-Tuning(SFT) data and 50K non-CoT PRM Reinforcement Learning (RL) training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K and achieved a relative performance improvement of up to 118\% over the base model in the BoN experiment. This result confirms that the proposed combined training strategy can achieve higher quality reasoning capabilities at a lower data annotation cost, thus providing a new paradigm for PRM training with more efficient data utilization.

2605.21964 2026-06-02 cs.CV physics.optics

Dual-Integrated Low-Latency Single-Lens Infrared Computational Imaging for Object Detection

用于目标检测的双集成低延迟单透镜红外计算成像

Xuquan Wang, Guishuo Yang, Dapeng Yan, Yujie Xing, Xuanyu Qian, Kai Zhang, Xiong Dun, Jiande Sun

发表机构 * MOE Key Laboratory of Advanced Micro-Structured Materials(教育部先进微结构材料重点实验室) Institute of Precision Optical Engineering(精密光学工程研究院) School of Physics Science and Engineering(物理科学与工程学院) Shanghai Frontiers Science Center of Digital Optics(上海前沿科学中心数字光学中心) School of Computer Science and Artificial Intelligence(计算机科学与人工智能学院) Shandong Normal University(山东师范大学) Shandong Engineering Research Center for Multimodal Computing and Intelligent Decision Making(山东省多模态计算与智能决策中心)

AI总结 提出物理感知双集成网络(PDI-Net),通过嵌入光学先验并共享编码器特征,在单透镜红外相机上实现低延迟高精度目标检测。

详情
Comments
15 pages, 11 figures; supplementary material: 3 pages, 2 figures
AI中文摘要

计算成像能够实现紧凑的红外系统,但结合图像重建和目标检测的深度学习流程通常会引入显著的推理延迟。大多数现有的加速策略压缩重建网络,而忽略了来自光路的物理先验,从而在准确性和速度之间留下权衡。我们提出了物理感知双集成网络(PDI-Net),这是一个低延迟框架,它将红外重建与目标检测集成在一起,并进一步将光学先验嵌入到学习过程中。PDI-Net在训练期间使用监督U-Net,而在推理期间,半U-Net编码器直接与基于YOLO的检测器共享特征,避免了完整的图像重建。为了弥合面向保真度的重建特征与面向检测的语义之间的差距,我们引入了物理感知大小桥接(PALS-Bridge),它使用与视场相关的点扩散函数先验自适应地调制多尺度卷积分支。还开发了物理信息的光学退化模拟流程用于训练和验证。该方法部署在单透镜红外相机上,与传统多透镜设计相比,系统重量减轻约50%。在低信噪比条件下的M3FD基准上,与采用剪枝策略的Rec+Det相比,PDI-Net将推理时间减少了84.06%,同时将mAP@0.5:0.95提高了5.07%。这些结果展示了在资源受限平台上用于实时目标检测的紧凑、低延迟计算红外成像。

英文摘要

Computational imaging enables compact infrared systems, but deep-learning pipelines that combine image reconstruction and object detection often introduce substantial inference latency. Most existing acceleration strategies compress the reconstruction network while overlooking physical priors from the optical path, leaving a trade-off between accuracy and speed. We present Physics-aware Dual-Integrated Network (PDI-Net), a low-latency framework that integrates infrared reconstruction with object detection and further embeds optical priors into the learning process. PDI-Net uses a supervised U-Net during training, while a semi-U-Net encoder shares features directly with a YOLO-based detector during inference, avoiding full image reconstruction. To bridge the gap between fidelity-oriented reconstruction features and detection-oriented semantics, we introduce a physics-aware large-small bridge (PALS-Bridge), which uses field-dependent point spread function priors to adaptively modulate multiscale convolutional branches. A physics-informed optical degradation simulation pipeline is also developed for training and validation. The method is deployed on a single-lens infrared camera, reducing system weight by about 50% compared with traditional multi-lens designs. On the M3FD benchmark under low-SNR conditions, PDI-Net reduces inference time by 84.06% compared with the Rec+Det with pruning strategy while improving mAP@0.5:0.95 by 5.07%. These results demonstrate compact, low-latency computational infrared imaging for real-time object detection on resource-constrained platforms.

2605.21648 2026-06-02 cs.LG cond-mat.dis-nn cs.NE stat.ML

Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

Dropout 普适性:混沌边缘的缩放定律与最优调度

Lucas Fernandez Sarmiento

发表机构 * Lucas Fernandez Sarmiento

AI总结 提出 dropout 作为临界信号传播扰动的平均场理论,发现前端加载的 dropout 调度在固定预算下可将 MLP 和 Vision Transformer 的测试损失降低 18-35%,并推导出相关缩放定律与普适类。

详情
Comments
Accepted at the 43rd International Conference on Machine Learning (ICML 2026). 36 pages, 11 figures. Camera-ready version
AI中文摘要

我们发展了 dropout 作为混沌边缘临界信号传播扰动的平均场理论,并表明它预测了一个简单、零成本的实践改变:前端加载的 dropout 调度在固定预算下,在 MLP 和 Vision Transformer 中比恒定 dropout 降低测试损失 18-35%。理论机制是 dropout 移动了完美对齐固定点,使得即使在临界初始化下信息传播的深度尺度也变得有限。我们推导了相关衰减的临界和交叉缩放定律,并建立了平滑激活和带拐点的 ReLU 类激活构成不同的普适类,具有不同的临界指数以及在失谐和 dropout 强度下的通用两参数缩放塌缩。这种区别追溯到相关映射的解析结构:平滑激活在完美对齐附近允许泰勒展开,而带拐点的激活则出现具有普适非解析性的分支点。作为推论,该框架在固定预算下产生饱和的 dropout 轮廓;然后通过正则化可达性论证选择前端加载的调度,精度提升作为一致的次要效果。我们还讨论了相同的高斯核结构如何将理论从 MLP 扩展到 CNN 和残差架构。

英文摘要

We develop a mean-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos, and show that it predicts a simple, no-cost change to standard practice: \emph{front-loaded} dropout schedules cut test loss by \(18\)--\(35\%\) over constant dropout in MLPs and Vision Transformers at fixed budget. The theoretical mechanism is that dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at critical initialization. We derive critical and crossover scaling laws for correlation decay and establish that smooth activations and kinked, \relu{}-like activations constitute distinct universality classes, with different critical exponents and a universal two-parameter scaling collapse in detuning and dropout strength. The distinction traces to the analytic structure of the correlation map: smooth activations admit a Taylor expansion near perfect alignment, while kinked activations develop a branch point with universal non-analyticity. As a corollary, the framework yields saturated dropout profiles under fixed budget; a regularization-reach argument then selects front-loaded schedules, with accuracy gains as a consistent secondary effect. We also discuss how the same Gaussian-kernel structure extends the theory beyond MLPs toward CNNs and residual architectures.

2605.20823 2026-06-02 cs.CV

RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses

RelWitness: 基于视觉-几何关系见证者的开放词汇3D场景图生成

Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le, Tuan Kiet Pham, Sui Yang Guang

发表机构 * Phenikaa University(费恩基亚大学)

AI总结 提出RelWitness框架,通过视觉-几何关系见证者从不完整关系监督中生成开放词汇3D场景图,解决关系标注稀疏和词汇扩展问题。

详情
AI中文摘要

开放词汇3D场景图生成旨在用灵活的自然语言谓词描述对象实例及其关系。核心难点不仅在于词汇扩展,还在于监督可靠性:3D场景图数据集中的关系标注具有选择性,许多有效的对象对关系未被标注。我们提出RelWitness,一个从带有位姿的RGB-D序列中生成开放词汇3D场景图的框架,可在不完整关系监督下工作。关键概念是关系见证者:一种具体的视觉-几何线索,使关系在捕获场景中可观察。支持关系需要接触和垂直排序;包含关系需要包围;邻近关系需要度量接近;朝向关系需要面对方向;稳定关系应在两个对象可见的视角间持续存在。RelWitness从RGB视图、深度图、重建的3D几何、角色敏感文本、对象先验空视图和多视角一致性构建关系见证记录。视觉-几何见证验证器将未标注的关系候选分配给验证的缺失正例、可靠负例或不确定未标注案例。然后,见证引导的正-无标记目标从不完整标注中学习,而不将每个缺失标签视为负例。我们进一步引入见证一致解码和RGB-D缺失关系审计协议。在3DSSG/3RScan和ScanNet派生的开放词汇分割上的模拟手稿规划实验显示了预期行为:改进的未见关系识别、更高的见证精度、更低的幻觉和减少的关系短语冗余。所有数值结果均为规划值,在提交前必须替换为复现的测量值。

英文摘要

Open-vocabulary 3D scene graph generation seeks to describe object instances and their relations with flexible natural-language predicates. The central difficulty is not only vocabulary expansion, but supervision reliability: relation annotations in 3D scene graph datasets are selective, and many valid object-pair relations are unannotated. We propose RelWitness, a framework for open-vocabulary 3D scene graph generation from posed RGB-D sequences under incomplete relation supervision. The key concept is a relation witness: a concrete visual-geometric cue that makes a relation observable in the captured scene. Support relations require contact and vertical ordering; containment requires enclosure; proximity requires metric closeness; orientation requires facing direction; and stable relations should persist across views where both objects are visible. RelWitness constructs relation witness records from RGB views, depth maps, reconstructed 3D geometry, role-sensitive text, object-prior null views, and multi-view consistency. A visual-geometric witness verifier assigns unannotated relation candidates to verified missing positives, reliable negatives, or uncertain unlabeled cases. A witness-guided positive-unlabeled objective then learns from incomplete annotations without turning every missing label into a negative. We further introduce witness-consistent decoding and an RGB-D missing-relation audit protocol. Simulated manuscript-planning experiments on 3DSSG/3RScan and ScanNet-derived open-vocabulary splits show the intended behavior: improved unseen-relation recognition, higher witness precision, lower hallucination, and reduced redundant relation phrases. All numerical results are planning values and must be replaced by reproduced measurements before submission

2605.21422 2026-06-02 cs.LG

PRISM: Preference-Aware Influence Function Based Data Selection Method for Efficient Fine-Tuning

PRISM:基于偏好感知影响函数的数据选择方法用于高效微调

Qihao Lin, Guanxu Chen, Dongrui Liu, Jing Shao

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出PRISM方法,通过偏好感知影响函数对目标示例加权,构建偏好感知目标方向,优先选择有效驱动模型匹配目标行为的数据,提升高效微调和安全对齐微调性能。

详情
Comments
23 pages, 5 figures
AI中文摘要

随着LLM规模不断扩大,提高训练效率在很大程度上依赖于有效的数据利用。数据选择通过将有限的训练预算分配给能够最优促进模型目标行为的高价值样本来缓解这一问题。大多数现有方法通过一组目标示例定义目标行为,并根据候选训练数据对这些样本的估计影响进行评分。然而,这些方法将所有目标示例视为同等重要,忽略了单个示例对模型优化的不同相关性。具体来说,与模型固有行为紧密对齐的目标示例提供更强的监督信号,而不一致的示例仅提供微弱且无效的局部指导。我们提出PRISM,一种基于偏好感知影响函数的数据选择方法。它利用模型偏好为目标示例分配权重,并构建偏好感知目标方向。PRISM根据候选训练样本对该方向的影响进行评估,并优先将数据预算分配给能有效驱动模型匹配预期目标行为的样本。理论分析验证,与均匀聚合策略相比,加权偏好构造能产生更优的一阶梯度方向以提升目标偏好。涵盖不同模型架构和参数规模的广泛实验表明,PRISM在高效微调和安全对齐监督微调修正中取得了更好的性能。结果验证了目标行为的准确表征是成本效益数据选择的核心。

英文摘要

As LLMs continue to scale up, improving training efficiency heavily relies on effective data utilization. Data selection mitigates this issue by allocating the limited training budget to high-value examples that optimally facilitate the model's target behavior. Most existing approaches define target behavior via a set of target examples and score candidate training data based on their estimated influence on these samples. However, such methods uniformly treat all target examples as equally important, ignoring the varying relevance of individual examples to model optimization. Specifically, target examples that align closely with the model's inherent behavior deliver stronger supervisory signals, whereas discrepant examples yield only weak and ineffective local guidance. We propose PRISM, a Preference-aware Influence function based Data Selection Method. It leverages model preference to assign weights to target examples and builds a preference-aware target direction. PRISM evaluates candidate training samples according to their influence on this direction, and prioritizes data budget allocation to samples that effectively drive the model to match expected target behavior. Theoretical analysis verifies that weighted preference construction generates a superior first-order gradient direction for boosting target preference, compared with uniform aggregation strategies. Extensive experiments covering diverse model architectures and parameter scales demonstrate that PRISM achieves better performance in efficient fine-tuning and safety-aligned supervised fine-tuning rectification. The results validate that accurate characterization of target behavior serves as the core of cost-effective data selection.

2605.21421 2026-06-02 cs.CV

AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing

AIGaitor: 面向所有人的隐私保护与无云端运动分析——基于边缘计算

Lauhitya Reddy, Trisha M. Kesar, Hyeokhyen Kwon

发表机构 * Department of Biomedical Informatics, Emory University(埃默里大学生物医学信息学系) Department of Rehabilitation Medicine, Emory University(埃默里大学康复医学系) The Wallace H. Coulter Department of Biomedical Engineering, Emory University and Georgia Institute of Technology(埃默里大学和佐治亚理工学院的Wallace H. Coulter生物医学工程系)

AI总结 提出AIGaitor系统,在智能手机上利用边缘计算实现无标记单目运动捕捉与深度学习分析,解决成本、隐私和易用性问题。

详情
Comments
18 pages 3 figures, 2 tables
AI中文摘要

运动捕捉是测量人体运动的金标准,但临床使用仍受成本、技术复杂性和隐私问题限制。AIGaitor是一个隐私保护、无云端的运动分析系统,完全在消费级智能手机上使用设备上的神经加速器运行无标记单目运动捕捉流程和下游深度学习分析。为激励其设计,我们调查了74位康复临床医生:92%表示会采用准确、经济、易用的AI步态分析工具,而79.7%认为运营成本、68.9%认为培训不足、64.9%认为隐私问题是主要障碍。然后,我们优化并基准测试了当前单目流程组件的移动iOS实现,包括2D和3D姿态估计、姿态优化、基于骨架的深度学习和视觉语言模型。一个时间优先的端到端设备上流程在iPhone 14上处理10秒4K 60fps视频片段耗时77秒,与高端NVIDIA H200云服务器(含网络传输)相比,在全局移动平均上行链路下为94秒,在发达地区Wi-Fi下为66秒,匹配或优于后者。轻量级模型如ViTPose-s实现实时关键点提取,基于骨架的动作识别模型在同一片段上提供亚毫秒级步态分类。据我们所知,AIGaitor是首个展示端到端设备上运动捕捉和下游深度学习分析的单目系统,支持低成本、私密且对智能手机用户可及的临床适用运动分析。

英文摘要

Motion capture is the gold standard for measuring human movement, but clinical use remains limited by cost, technical complexity, and privacy concerns. AIGaitor is a privacy-preserving, cloud-free motion analysis system that runs markerless monocular motion-capture pipelines and downstream deep-learning analysis entirely on a consumer smartphone using on-device neural accelerators. To motivate its design, we surveyed 74 rehabilitation clinicians: 92 percent said they would adopt an accurate, cost-effective, easy-to-use AI gait analysis tool, while 79.7 percent cited operating cost, 68.9 percent insufficient training, and 64.9 percent privacy concerns as leading barriers. We then optimized and benchmarked mobile iOS implementations of current monocular pipeline components, including 2D and 3D pose estimation, pose optimization, skeleton-based deep-learning analysis, and a vision-language model. A Time-Priority end-to-end on-device pipeline processes a 10 s 4K 60 fps video clip in 77 s on an iPhone 14, matching or beating the same pipeline on a high-end NVIDIA H200 cloud server when network transfer is included: 94 s at global mobile-average uplink and 66 s at developed-world Wi-Fi. Lightweight models such as ViTPose-s achieve real-time keypoint extraction, and skeleton-based action-recognition models provide sub-millisecond gait classification on the same clip. To our knowledge, AIGaitor is the first monocular system to demonstrate end-to-end on-device motion capture and downstream deep-learning analysis, supporting clinically applicable movement analysis that is low-cost, private, and accessible to smartphone users.

2603.02845 2026-06-02 cs.RO cs.AI

SPARC: Spatial-Aware Path Planning via Attentive Agent Communication

SPARC: 通过注意力智能体通信实现空间感知路径规划

Sayang Mu, Xiangyu Wu, Bo An

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出关系增强多头注意力(RMHA)机制,通过嵌入曼哈顿距离到注意力权重计算,优先处理空间邻近机器人的消息,在40x40网格上从8机器人零样本泛化到128机器人时,在30%障碍密度下实现约75%成功率,超越基线25个百分点以上。

详情
Comments
The manuscript is being withdrawn at the request of the first author for the purpose of revising content and re-uploading a revised version with updated data/figures/text . The revised manuscript will be resubmitted to arXiv promptly with the same author list and research theme
AI中文摘要

高效通信对于分散式多机器人路径规划(MRPP)至关重要,然而现有的学习型通信方法平等对待所有邻近机器人,而不考虑它们的空间接近性,导致在协调最重要的拥挤区域注意力被稀释。我们提出关系增强多头注意力(RMHA),这是一种通信机制,它将成对曼哈顿距离显式嵌入到注意力权重计算中,使每个机器人能够动态优先处理来自空间相关邻居的消息。结合距离约束注意力掩码和GRU门控消息融合,RMHA与MAPPO无缝集成,实现稳定的端到端训练。在从8个训练机器人到128个测试机器人在40x40网格上的零样本泛化中,RMHA在30%障碍密度下实现了约75%的成功率,比最佳基线高出超过25个百分点。消融研究证实,距离关系编码是高密度环境中成功率提高的关键因素。索引词-多机器人路径规划,图注意力机制,多头注意力,通信优化,协作决策。

英文摘要

Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat all neighboring robots equally regardless of their spatial proximity, leading to diluted attention in congested regions where coordination matters most. We propose Relation enhanced Multi Head Attention (RMHA), a communication mechanism that explicitly embeds pairwise Manhattan distances into the attention weight computation, enabling each robot to dynamically prioritize messages from spatially relevant neighbors. Combined with a distance-constrained attention mask and GRU gated message fusion, RMHA integrates seamlessly with MAPPO for stable end-to-end training. In zero-shot generalization from 8 training robots to 128 test robots on 40x40 grids, RMHA achieves approximately 75 percent success rate at 30 percent obstacle density outperforming the best baseline by over 25 percentage points. Ablation studies confirm that distance-relation encoding is the key contributor to success rate improvement in high-density environments. Index Terms-Multi-robot path planning, graph attention mechanism, multi-head attention, communication optimization, cooperative decision-making