arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1764
2602.07026 2026-06-08 cs.CV cs.AI cs.MM 版本更新

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

模态间隙驱动的子空间对齐训练范式用于多模态大语言模型

Xiaomin Yu, Yi Xin, Yuhui Zhang, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Chen Liu, Xiaoxing Hu, Ziyue Qiao, Hao Tang, Xiaobin Hu, Chengwei Qin, Hui Xiong, Yu Qiao, Shuicheng Yan

发表机构 * HKUST(GZ)(香港科技大学(广州)) NUS(新加坡国立大学) sh AILab SII Stanford(斯坦福大学) UCLA(加州大学洛杉矶分校) Yale(耶鲁大学) SJTU(上海交通大学) GBU(国防大学) PKU(北京大学)

AI总结 针对多模态对比学习中的模态间隙问题,提出固定帧模态间隙理论,并基于该理论设计无训练的对齐策略ReAlign和可扩展训练范式ReVision,利用无配对数据实现视觉与语言表示的高效对齐。

详情
AI中文摘要

尽管多模态对比学习在视觉和语言表示对齐方面取得了成功,但一个持久的几何异常——模态间隙——仍然存在:表达相同语义的不同模态的嵌入位于系统性偏移的区域。先前弥合这一间隙的方法大多受限于过于简化的各向同性假设,阻碍了它们在大规模场景中的应用。在本文中,我们通过精确刻画模态间隙的几何形状并利用它进行高效模型扩展来解决这些局限性。首先,我们提出了固定帧模态间隙理论,该理论将冻结参考帧内的模态间隙分解为稳定偏差和各向异性残差。在这种精确建模的指导下,我们引入了ReAlign,一种无需训练的模态对齐策略。利用大量无配对数据的统计信息,ReAlign通过锚点、轨迹和质心对齐三步过程将文本表示对齐到图像表示分布,从而显式纠正几何错位。基于ReAlign,我们提出了ReVision,一种用于多模态大语言模型(MLLMs)的可扩展训练范式。ReVision将ReAlign集成到预训练阶段,使模型在视觉指令微调之前从无配对文本中学习视觉表示的分布,无需大规模、高质量的图像-文本对。我们的框架表明,统计对齐的无配对数据可以有效替代昂贵的图像-文本对,为MLLMs的高效扩展提供了一条稳健的路径。

英文摘要

Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models~(MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.

2605.07496 2026-06-08 cs.RO 版本更新

PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation

PathPainter:将图像生成模型的泛化能力迁移至具身导航

Yijin Wang, Yuru Tian, Xijie Huang, Weiqi Gai, Mo Zhu, Xin Zhou, Yuze Wu, Fei Gao

发表机构 * Tsinghua University(清华大学)

AI总结 提出利用鸟瞰图作为全局先验的导航系统,通过图像生成模型理解自然语言意图并生成可通行掩码,结合跨视图定位消除里程计漂移,在无人机平台上完成160米室外长距离导航。

详情
Comments
Work in the progress. 16 pages, 13 figures
AI中文摘要

鸟瞰图已被广泛证明能为导航提供有价值的先验信息。鉴于这种视图提供的全局信息,仍存在两个关键挑战:如何充分利用这些信息以及如何在执行过程中可靠地使用它们。在本文中,我们提出了一种导航系统,该系统使用鸟瞰图作为全局先验,并专为地面和近地面机器人平台设计。该系统采用图像生成模型从自然语言中解读人类意图,识别目标目的地,并生成可通行掩码。在执行过程中,我们引入跨视图定位以将机器人的里程计与鸟瞰图对齐,并减轻传统里程计中的长期漂移。我们进行了广泛的基准实验来评估所提出的方法,并在无人机平台上进一步验证。仅使用传统的局部运动规划器,无人机成功完成了160米的室外长距离导航任务。这项工作展示了基础模型的世界理解能力如何迁移到具身导航,使机器人能够受益于现有图像生成模型的强大泛化能力。

英文摘要

Bird's-eye-view (BEV) images have been widely demonstrated to provide valuable prior information for navigation. Given the global information provided by such views, two key challenges remain: how to fully exploit this information and how to reliably use it during execution. In this paper, we propose a navigation system that uses BEV images as global priors and is designed for ground and near-ground robotic platforms. The system employs an image generation model to interpret human intent from natural language, identify the target destination, and generate traversability masks. During execution, we introduce cross-view localization to align the robot's odometry with the BEV map and mitigate long-term drift in conventional odometry. We conduct extensive benchmark experiments to evaluate the proposed method and further validate it on a UAV platform. Using only a conventional local motion planner, the UAV successfully completes a 160-meter outdoor long-range navigation task. This work demonstrates how the world-understanding capabilities of foundation models can be transferred to embodied navigation, enabling robots to benefit from the strong generalization ability of existing image generation models.

2605.05225 2026-06-08 cs.LG cs.AI 版本更新

MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

MACS: 模态感知容量缩放用于高效多模态MoE推理

Bo Li, Chuan Wu, Shaolin Zhu

发表机构 * School of Software, Tsinghua University, Beijing, China(清华大学软件学院,北京,中国) TJUNLP Lab, School of Computer Science and Technology, Tianjin University, China(天津大学计算机科学与技术学院,中国) School of New Media and Communication, Tianjin University, China(天津大学新媒体与传播学院,中国)

AI总结 针对多模态MoE大模型在专家并行推理中因信息异质性和模态动态性导致的效率瓶颈,提出无需训练的MACS框架,通过熵加权负载和动态模态自适应容量机制优化资源分配,显著提升多模态基准性能。

详情
Comments
Accepted by ACL 2026
AI中文摘要

混合专家多模态大语言模型(MoE MLLMs)在专家并行(EP)推理过程中因落后者效应而遭受显著的效率瓶颈。在多模态背景下,这一问题更加严重,因为现有的基于token计数的负载均衡方法无法解决两个独特挑战:(1)信息异质性,其中大量冗余的视觉token与语义关键的token被同等对待;(2)模态动态性,不同任务中视觉与文本比例的变化导致资源错配。为应对这些挑战,我们提出MACS(模态感知容量缩放),一种无需训练的推理框架。具体而言,MACS引入熵加权负载机制来量化视觉token的语义价值,解决信息异质性。此外,动态模态自适应容量机制根据输入的实时模态组成分配专家资源。大量实验表明,MACS在各种多模态基准上显著优于现有方法,为MoE MLLMs在EP推理中的高效部署提供了新颖且稳健的解决方案。

英文摘要

Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) suffer from a significant efficiency bottleneck during Expert Parallelism (EP) inference due to the straggler effect. This issue is worsened in the multimodal context, as existing token-count-based load balancing methods fail to address two unique challenges: (1) Information Heterogeneity, where numerous redundant visual tokens are treated equally to semantically critical ones, and (2) Modality Dynamics, where varying visual to text ratios across tasks lead to resource misallocation. To address these challenges, we propose MACS (Modality-Aware Capacity Scaling), a training-free inference framework. Specifically, MACS introduces an Entropy-Weighted Load mechanism to quantify the semantic value of visual tokens, addressing information heterogeneity. Additionally, the Dynamic Modality-Adaptive Capacity mechanism allocates expert resources based on the real-time modal composition of the input. Extensive experiments demonstrate that MACS significantly outperforms existing methods on various multimodal benchmarks, providing a novel and robust solution for the efficient deployment of MoE MLLMs in EP inference.

2511.22581 2026-06-08 cs.LG cs.MA 版本更新

High entropy leads to symmetry-equivariant policies in Dec-POMDPs

高熵导致 Dec-POMDP 中的对称等变策略

Johannes Forkel, Constantin Ruhdorfer, Michael Beukman, Andreas Bulling, Jakob Foerster

发表机构 * FLAIR, Department of Engineering Science, University of Oxford(奥德赛实验室,工程科学系,牛津大学) Collaborative Artificial Intelligence, University of Stuttgart(协同人工智能,斯图加特大学)

AI总结 证明在 Dec-POMDP 中,足够高的熵正则化可确保策略梯度收敛到对称等变联合策略,并通过实验发现高熵系数能提升跨种子交叉对战的回报。

详情
AI中文摘要

我们证明,在任何 Dec-POMDP 中,足够高的熵正则化可确保使用表格 softmax 参数化的策略梯度流对于任何初始化都收敛到相同的联合策略,并且该联合策略关于 Dec-POMDP 的所有对称性是等变的。特别地,来自不同初始化的策略将完全兼容,即它们的交叉对战回报等于自对战回报。通过在 Hanabi、Overcooked 和 Yokai 环境中对独立 PPO(可以说是标准基线深度多智能体策略梯度算法)进行广泛评估,我们发现熵系数对独立训练策略之间的交叉对战回报有巨大影响,并且增加熵正则化导致的自对战回报下降通常可以通过在训练后对学习策略进行贪婪化来抵消。特别是在 Hanabi 中,我们通过这种方式实现了跨种子交叉对战的新 SOTA。虽然我们给出了 Dec-POMDP 的示例,其中无法以这种方式学习最优对称等变策略,但我们的理论和实证结果都表明,在 Dec-POMDP 的超参数扫描中,应该考虑比通常高得多的熵系数。我们实验的代码可以在 https://github.com/jforkel/JAX-OBL 找到。

英文摘要

We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that the policy gradient flow with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different initializations will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through extensive evaluation of independent PPO, arguably the standard baseline deep multi-agent policy gradient algorithm, in the Hanabi, Overcooked and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the decrease in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi in particular we achieve a new SOTA in inter-seed cross-play this way. While we give examples of Dec-POMDPs in which one cannot learn the optimal symmetry-equivariant policy this way, both our theoretical and empirical results suggest that one should consider far higher entropy coefficients during hyperparameter sweeps in Dec-POMDPs than is typically done. Code for our experiments can be found at https://github.com/jforkel/JAX-OBL

2605.05220 2026-06-08 cs.LG cs.AI 版本更新

MidSteer: Optimal Affine Framework for Steering Generative Models

MidSteer:用于引导生成模型的最优仿射框架

Tatiana Gaintseva, Andrew Stepanov, Ziquan Liu, Martin Benning, Gregory Slabaugh, Jiankang Deng, Ismail Elezi

发表机构 * University of Basel(巴塞尔大学) University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院) University of Cambridge(剑桥大学) University of Washington(华盛顿大学)

AI总结 本文提出MidSteer,一种基于仿射变换的最优概念引导框架,通过最小干扰实现生成模型中的概念切换,并在视觉扩散模型和大型语言模型上验证其有效性。

详情
AI中文摘要

引导中间表示已成为控制生成模型的一种强大策略,特别是在部署后对齐和安全设置中。然而,尽管其经验成功,目前仍缺乏全面的理论框架。在本文中,我们通过形式化概念引导的理论来弥补这一差距。首先,我们在引导和仿射概念擦除之间建立联系,证明去除不期望行为的标准方法是LEACE(一种用于仿射擦除的闭式方法)的特例。接下来,我们为概念切换LEACE-Switch制定了一个原则性的理论框架,并描述了其提供最优仿射解的假设。基于这一分析,我们引入了MidSteer(最小干扰概念引导),一个更通用的用于概念操作的仿射框架,它放宽了这些假设并实现了定向的最小干扰变换。我们证明MidSteer在一系列任务、模态和架构(包括视觉扩散模型和大型语言模型)中表现良好。

英文摘要

Steering intermediate representations has emerged as a powerful strategy for controlling generative models, particularly in post-deployment alignment and safety settings. However, despite its empirical success, it currently lacks a comprehensive theoretical framework. In this paper, we bridge this gap by formalizing the theory of concept steering. First, we establish a link between steering and affine concept erasure, proving that the standard approach for removing unwanted behaviors is a special case of LEACE (a closed-form method for affine erasure). Next, we formulate a principled theoretical framework for concept switching, LEACE-Switch, and characterize the assumptions under which it provides an optimal affine solution. Building on this analysis, we then introduce MidSteer (Minimal Disturbance concept Steering), a more general affine framework for concept manipulation that relaxes these assumptions and enables directed, minimal-disturbance transformations. We demonstrate that MidSteer performs favorably across a range of tasks, modalities, and architectures, including vision diffusion models and large language models.

2605.01642 2026-06-08 cs.LG 版本更新

Adaptive Pluralistic Alignment: A pipeline for dynamic artificial democracy

自适应多元对齐:动态人工民主的流水线

Rachel Freedman

发表机构 * GitHub

AI总结 提出自适应多元对齐(APA)流水线,通过低秩奖励基分解和陪审团投票机制,动态追踪社会价值观演变,避免价值锁定,无需重复预训练或大规模数据收集。

详情
AI中文摘要

当前的对齐方法针对一组固定的偏好,因此随着社会规范随时间演变,存在强制价值锁定的风险。我们引入了自适应多元对齐(APA),这是一个模块化流水线,用于更新多元对齐的AI系统,以追踪不断变化的价值观并避免价值锁定,而无需重复昂贵的预训练或大规模数据收集。APA包含三个阶段:(1)通过低秩奖励基分解学习紧凑的个性化奖励模型;(2)使用这些模型作为陪审团,通过社会选择理论投票集体选择候选输出;(3)随着价值观变化,通过在固定奖励基上拟合新的注释者权重,高效地随时间调整陪审团。由此产生的系统高效、可解释、可引导且模块化。我们使用PRISM多用户对齐数据集和模拟的历史注释者实现了概念验证实例,并提供了初步分析,表明陪审团组成和投票规则的选择可以显著影响结果,尤其是在陪审团偏好异质的情况下。我们在https://github.com/RachelFreedman/apa提供完整代码和生成的偏好数据集。

英文摘要

Prevailing alignment methods target a fixed set of preferences and therefore risk forcing value lock-in as societal norms evolve over time. We introduce Adaptive Pluralistic Alignment (APA), a modular pipeline for updating pluralistically aligned AI systems to track evolving values and avoid value lock-in without repeating costly pretraining or large-scale data collection. APA has three stages: (1) learning compact personalized reward models via low-rank reward basis decomposition, (2) using these models as a jury that collectively selects among candidate outputs through social-choice-theoretic voting, and (3) efficiently adapting the jury over time by fitting new annotator weights over the fixed reward bases as values shift. The resulting system is efficient, explainable, steerable, and modular. We implement a proof-of-concept instantiation using the PRISM multi-user alignment dataset and simulated historical annotators, and provide preliminary analysis showing that jury composition and the choice of voting rule can substantially affect outcomes, particularly when jury preferences are heterogeneous. We provide full code and resulting preference datasets at https://github.com/RachelFreedman/apa.

2604.27011 2026-06-08 cs.LG cs.AI 版本更新

Automatic Causal Fairness Analysis with LLM-Generated Reporting

基于LLM生成报告的自适应因果公平性分析

Alessia Berarducci, Eric Rossetto, Alessandro Antonucci, Marco Zaffalon

发表机构 * Istituto Dalle Molle di Studi sull’Intelligenza Artificiale (IDSIA), USI-SUPSI(日内瓦人工智能研究所(IDSIA)、USI-SUPSI)

AI总结 提出FairMind原型,利用标准公平模型进行因果公平分析,通过反事实查询计算因果效应,并借助LLM零样本生成公平性报告,优于直接LLM分析。

详情
Comments
23 pages, 6 figures, 3 tables, LaTeX; added missing proof for Proposition 3, typos corrected, updated example 1 to have positive values for the Sankey
AI中文摘要

AutoML,旨在自动化机器学习在现实问题中的应用,是AI普及的关键步骤。大多数AutoML框架未考虑训练数据及相应预测中潜在的公平性缺失。我们介绍\textsc{FairMind},一个旨在自动化数据集层面公平性分析的软件原型。我们通过借助Plečko和Bareinboim最近提出的\textit{标准公平模型}的假设来实现这一点。这允许基于涉及目标、可能的混杂因素和中介变量以及我们视为\textit{受保护}的输入特征的不同值的\textit{反事实}查询,进行合理的因果效应公平性评估。在必要的数据预处理之后,该工具实现了效应的闭式计算。随后利用LLM生成关于训练数据集中检测到的公平性水平的准确报告。我们在零样本设置中实现了这一点,并通过示例展示了相对于LLM直接分析的预期优势。为了促进应用,还讨论了有序受保护变量和连续目标的扩展以及新的分解结果。

英文摘要

AutoML, intended as the process of automating the application of machine learning to real-world problems, is a key step for AI popularisation. Most AutoML frameworks are not accounting for the potential lack of fairness in the training data and in the corresponding predictions. We introduce \textsc{FairMind}, a software prototype aiming to automatise fairness analysis at the dataset level. We achieve that by resorting to the assumptions of the \emph{standard fairness model}, recently proposed by Plečko and Bareinboim. This allows for a sound fairness evaluation in terms of causal effects, based on \emph{counterfactual} queries involving the target, possibly confounders and mediators, and the different values of an input feature we regard as \emph{protected}. After the necessary data preprocessing, the tool implements a closed-form computation of the effects. LLMs are consequently exploited to generate accurate reports on the fairness levels detected in the training dataset. We achieve that in a zero-shot setup and show by examples the expected advantages with respect to a direct analysis performed by the LLM. To favour applications, extensions to ordinal protected variable and continuous targets and novel decomposition results are also discussed.

2604.00270 2026-06-08 cs.CV 版本更新

OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning

OmniSch:面向结构化图表视觉推理的多模态PCB原理图基准

Taiting Lu, Kaiyuan Lin, Yuxin Tian, Mingjia Wang, Yubo Wang, Muchuan Wang, Sharique Khatri, Akshit Kartik, Yixi Wang, Amey Santosh Rane, Yida Wang, Sung-Liang Chen, Yifan Yang, Yi-Chao Chen, Yincheng Jin, Mahanth Gowda

发表机构 * Pennsylvania State University, USA(宾夕法尼亚州立大学) Independent Researcher(独立研究者) Binghamton University, USA(布ingham顿大学) Shanghai Jiao Tong University, China(上海交通大学) Microsoft Research(微软研究院)

AI总结 提出首个多模态PCB原理图理解基准OmniSch,包含四项任务评估大模型在视觉定位、图推理和几何推理上的能力,揭示现有模型在工程图表理解上的显著差距。

详情
AI中文摘要

近期大型多模态模型(LMMs)在视觉定位、文档理解和图表推理任务中取得了快速进展。然而,它们将印刷电路板(PCB)原理图转换为机器可读的空间加权网表图(同时捕获组件属性、连接性和几何信息)的能力仍未被充分探索,尽管这种图表示是实际电子设计自动化(EDA)工作流的基石。为弥补这一差距,我们引入了OmniSch,这是首个旨在评估LMMs在原理图理解和空间网表图构建方面的综合基准。OmniSch包含1,854张真实世界原理图,并包括四项任务:(1)原理图实体的视觉定位,包含109.9K个定位实例,将423.4K个图表语义标签与其视觉区域对齐;(2)图到图推理,理解图表元素间的拓扑关系;(3)几何推理,为每个连接构建依赖于布局的权重;(4)用于视觉搜索的工具增强型智能体推理,调用外部工具完成(1)-(3)。我们的结果揭示了当前LMMs在解释原理图工程制品方面的显著差距,包括不可靠的细粒度定位、脆弱的布局到图解析、不一致的全局连通性推理以及低效的视觉探索。

英文摘要

Recent large multimodal models (LMMs) have made rapid progress in visual grounding, document understanding, and diagram reasoning tasks. However, their ability to convert Printed Circuit Board (PCB) schematic diagrams into machine-readable spatially weighted netlist graphs, jointly capturing component attributes, connectivity, and geometry, remains largely underexplored, despite such graph representations are the backbone of practical electronic design automation (EDA) workflows. To bridge this gap, we introduce OmniSch, the first comprehensive benchmark designed to assess LMMs on schematic understanding and spatial netlist graph construction. OmniSch contains 1,854 real-world schematic diagrams and includes four tasks: (1) visual grounding for schematic entities, with 109.9K grounded instances aligning 423.4K diagram semantic labels to their visual regions; (2) diagram-to-graph reasoning, understanding topological relationship among diagram elements; (3) geometric reasoning, constructing layout-dependent weights for each connection; and (4) tool-augmented agentic reasoning for visual search, invoking external tools to accomplish (1)-(3). Our results reveal substantial gaps of current LMMs in interpreting schematic engineering artifacts, including unreliable fine-grained grounding, brittle layout-to-graph parsing, inconsistent global connectivity reasoning and inefficient visual exploration.

2604.23057 2026-06-08 cs.AI 版本更新

Don't Make the LLM Read the Graph: Make the Graph Think

不要让大语言模型读图:让图思考

Yuqi Sun, Tianqin Meng, George Liu, Yashraj Panwar, Lakshya Chaudhry, Munasib Ilham, Aman Chadha

发表机构 * Mindoverflow University of Waterloo(多伦多大学) Carnegie Mellon University(卡内基梅隆大学) Foothill College(foothill学院) Purdue University(普渡大学) University of Wisconsin(威斯康星大学) Apple(苹果公司)

AI总结 通过3000多次对照实验,研究显式信念图在合作多智能体推理中是否提升LLM性能,发现集成架构决定图的价值,识别出“规划者违抗”现象,并证明图深度收益递减。

详情
Comments
main body has 9 pages, 4 figures, under review for COLM 2026 conference
AI中文摘要

我们研究了显式信念图是否提升大语言模型在合作多智能体推理中的性能。通过在合作纸牌游戏Hanabi中跨越四个LLM家族的3000多次对照实验,我们建立了四个发现。首先,集成架构决定了信念图是否提供价值:作为提示上下文,图对强模型是装饰性的,仅对弱模型在二阶心智理论上有益(80% vs 10%,p<0.0001,OR=36.0);当图通过排序短列表门控动作选择时,即使对强模型也变得结构上必要(二阶ToM上100% vs 20%,p<0.001)。其次,我们识别出“规划者违抗”,一种模型家族特定的失败,即LLM在部分能力下覆盖正确的规划者建议(90%覆盖,重复N=20);Gemini模型表现出接近零的违抗,而Llama 70B表现出90%,模型区分事实上下文(被延迟)和咨询建议(被覆盖)。第三,完整游戏证据证实智能体间约定(比基线高+128%,p=0.003)优于所有单智能体干预,且单个信念图组件必须组合才能产生收益。第四,初步规模分析(N=10/单元,探索性)表明图深度收益递减:浅层图提供最佳成本效益比,而更深层的ToM图在更大玩家数量下似乎有害(5玩家时-1.5分,p=0.029)。

英文摘要

We investigate whether explicit belief graphs improve LLM performance in cooperative multi-agent reasoning. Through 3,000+ controlled trials across four LLM families in the cooperative card game Hanabi, we establish four findings. First, integration architecture determines whether belief graphs provide value: as prompt context, graphs are decorative for strong models and beneficial only for weak models on 2nd-order Theory of Mind (80% vs 10%, p<0.0001, OR=36.0); when graphs gate action selection through ranked shortlists, they become structurally essential even for strong models (100% vs 20% on 2nd-order ToM, p<0.001). Second, we identify "Planner Defiance," a model-family-specific failure where LLMs override correct planner recommendations at partial competence (90% override, replicated N=20); Gemini models show near-zero defiance while Llama 70B shows 90%, and models distinguish factual context (deferred to) from advisory recommendations (overridden). Third, full-game evidence confirms inter-agent conventions (+128% over baseline, p=0.003) outperform all single-agent interventions, and individual belief-graph components must be combined to produce gains. Fourth, preliminary scaling analysis (N=10/cell, exploratory) suggests graph depth has diminishing returns: shallow graphs provide the best cost-benefit ratio, while deeper ToM graphs appear harmful at larger player counts (-1.5 pts at 5-player, p=0.029).

2604.18401 2026-06-08 cs.CL 版本更新

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

StepPO: 面向智能体强化学习的步骤对齐策略优化

Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(认知智能国家重点实验室,中国科学技术大学)

AI总结 提出StepPO,一种步骤级策略优化方法,通过将智能体强化学习从token级MDP重构为步骤级MDP并引入步骤级信用分配,以解决现有算法在智能体决策粒度上的不匹配问题,在多跳问答等任务上优于多种RL算法。

详情
AI中文摘要

智能体强化学习(Agentic RL)正成为提升LLM智能体能力的关键后训练范式。现有的LLM RL算法大多遵循RLHF和RLVR中的token中心范式,其中token作为建模和优化的基本单元。然而,这种范式在智能体RL中引入了粒度不匹配问题,因为它优化的是token级预测,而LLM智能体通过环境观察和行动的循环做出步骤级决策。为弥合这一差距,我们提出 extbf{StepPO},一种通过步骤对齐策略优化实现的步骤中心智能体RL范式。具体来说,我们将智能体RL从token级马尔可夫决策过程(MDP)重构为步骤级MDP,其中交互步骤作为基本轨迹表示。我们进一步提出步骤级信用分配,使策略优化与智能体决策的自然粒度对齐。StepPO在步骤级优化智能体策略,用于多轮智能体-环境交互。在多跳问答、学术论文搜索和文本世界行动任务上的实验表明,StepPO始终优于各种RL算法。进一步的分析揭示了步骤中心范式如何改善智能体训练。我们希望这种步骤中心范式能为理解智能体行为提供有用的视角,并为训练更强大的LLM智能体提供一条实用路径。

英文摘要

Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions through cycles of environmental observations and actions. To bridge this gap, we propose \textbf{StepPO}, a step-centric paradigm for agentic RL via step-aligned policy optimization. Specifically, we reformulate agentic RL from a token-level Markov Decision Process (MDP) into a step-level MDP, where interaction steps serve as the basic trajectory representations. We further propose step-level credit assignment to align policy optimization with the natural granularity of agent decisions. Together, StepPO optimizes agent policies at the step level for multi-turn agent-environment interaction. Experiments across multi-hop QA, academic paper search, and text-world action tasks show that StepPO consistently outperforms various RL algorithms. Further analyses provide insights into how step-centric paradigm improves agent training. We hope this step-centric paradigm offers a useful lens for understanding agent behavior and a practical path for training more capable LLM agents.

2604.17433 2026-06-08 cs.CL cs.AI cs.LG 版本更新

Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

仅需两个样本的自一致性:CoT-PoT集成实现高效LLM推理

Raman Saparkhan, Majd Hawasly, Md Rizwan Parvez, Mohammad Raza

发表机构 * Carnegie Mellon University Qatar(卡内基梅隆大学(卡塔尔)) Qatar Computing Research Institute(卡塔尔计算研究院)

AI总结 提出一种混合集成方法,结合思维链与程序化推理两种模式,通过仅需两个样本即可实现自一致性,将采样量减少9.3倍,并在78.6%的任务上达到最优。

详情
Comments
9 pages, 3 figures; accepted to Findings of ACL 2026
AI中文摘要

自一致性(SC)是一种通过聚合多个采样输出来提高大型语言模型推理准确性的流行技术,但由于大量采样,其计算成本高昂。我们引入了一种混合集成方法,利用两种不同推理模式(思维链(CoT)和程序化推理(PoT))的互补优势。我们描述了一个通用框架,用于在自一致性中结合这两种推理形式,并提出了全采样和早停的特定策略。我们表明,CoT-PoT集成不仅提高了整体准确性,而且将SC所需的样本数量大幅减少了9.3倍。特别是,大多数任务(78.6%)仅需两个样本即可解决,这在之前的任何SC方法中都是不可能的。

英文摘要

Self-consistency (SC) is a popular technique for improving the reasoning accuracy of large language models by aggregating multiple sampled outputs, but it comes at a high computational cost due to extensive sampling. We introduce a hybrid ensembling approach that leverages the complementary strengths of two distinct modes of reasoning: Chain-of-Thought (CoT) and Program-of-Thought (PoT). We describe a general framework for combining these two forms of reasoning in self-consistency, as well as particular strategies for both full sampling and early-stopping. We show that CoT-PoT ensembling not only improves overall accuracy, but also drastically reduces the number of samples required for SC by a factor of 9.3x. In particular, the majority of tasks (78.6%) can be addressed with only two samples, which has not been possible with any prior SC methods.

2604.10578 2026-06-08 cs.CV 版本更新

Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

Rein3D: 基于全景视频扩散模型的强化3D室内场景生成

Dehui Wang, Rong Wei, Yue Shi, Congsheng Xu, Shoufa Chen, Dingxiang Luo, Tianshuo Yang, Xiaokang Yang, Wei Sui, Yusen Qin, Rui Tang, Yao Mu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Manycore Tech Inc.(Manycore科技公司) D-Robotics The University of Hong Kong(香港大学)

AI总结 提出Rein3D框架,结合3D高斯泼溅与视频扩散模型,通过“恢复-细化”范式从稀疏输入生成全局一致的360度室内场景,并构建PanoV2V-15K数据集,显著提升长距离相机探索效果。

详情
AI中文摘要

随着具身AI和VR应用需求的增长,从稀疏输入合成高质量3D室内场景变得尤为重要。然而,现有方法在推断大量未观测区域中的缺失几何结构时难以保持全局一致性,往往产生局部合理但全局不一致的重建结果。我们提出Rein3D,一个通过将显式3D高斯泼溅(3DGS)与视频扩散模型的时间一致先验相结合来重建完整360度室内环境的框架。我们的方法遵循“恢复-细化”范式:采用径向探索策略,沿从原点开始的轨迹渲染不完美的全景视频,从而从粗略的3DGS初始化中有效揭示被遮挡区域。这些序列由全景视频到视频扩散模型恢复,并通过视频超分辨率进一步增强,以合成高保真几何和纹理。最后,这些细化后的视频作为伪真值更新全局3D高斯场。为支持此任务,我们构建了PanoV2V-15K数据集,包含超过15K对干净和退化的全景视频,用于基于扩散的场景恢复。实验表明,Rein3D生成逼真且全局一致的3D场景,与现有基线相比,显著改善了长距离相机探索。

英文摘要

The growing demand for Embodied AI and VR applications has highlighted the need for synthesizing high-quality 3D indoor scenes from sparse inputs. However, existing approaches struggle to infer massive amounts of missing geometry in large unseen areas while maintaining global consistency, often producing locally plausible but globally inconsistent reconstructions. We present Rein3D, a framework that reconstructs full 360-degree indoor environments by coupling explicit 3D Gaussian Splatting (3DGS) with temporally coherent priors from video diffusion models. Our approach follows a "restore-and-refine" paradigm: we employ a radial exploration strategy to render imperfect panoramic videos along trajectories starting from the origin, effectively uncovering occluded regions from a coarse 3DGS initialization. These sequences are restored by a panoramic video-to-video diffusion model and further enhanced via video super-resolution to synthesize high-fidelity geometry and textures. Finally, these refined videos serve as pseudo-ground truths to update the global 3D Gaussian field. To support this task, we construct PanoV2V-15K, a dataset of over 15K paired clean and degraded panoramic videos for diffusion-based scene restoration. Experiments demonstrate that Rein3D produces photorealistic and globally consistent 3D scenes and significantly improves long-range camera exploration compared with existing baselines.

2604.10098 2026-06-08 cs.LG 版本更新

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

Transformer中的注意力汇聚:利用、解释与缓解综述

Zunhai Su, Hengyuan Zhang, Wei Wu, Yifan Zhang, Yaxiu Liu, He Xiao, Qingyao Yang, Yuxuan Sun, Rui Yang, Chao Zhang, Jing Xiong, Hui Shen, Keyu Fan, Weihao Ye, Chaofan Tao, Taiqiang Wu, Zhongwei Wan, Tiantian Zhang, Bowen Yan, Zhen Li, Yiming Zhang, Congkai Xie, Yulei Qian, Yuchen Xie, Yik-Chung Wu, Hongxia Yang, Ngai Wong

发表机构 * Tsinghua University(清华大学) Meituan LongCat Team(美团LongCat团队) The University of Hong Kong(香港大学) University of Michigan(密歇根大学) Xiamen University(厦门大学) The Ohio State University(俄亥俄州立大学) Columbia University(哥伦比亚大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文首次系统综述Transformer中的注意力汇聚现象,从基础利用、机制解释和策略缓解三个维度梳理研究现状,为未来研究提供指导。

详情
AI中文摘要

作为现代机器学习的基础架构,Transformer已在多个AI领域取得了显著进展。尽管其具有变革性影响,但各种Transformer中普遍存在一个挑战——注意力汇聚(Attention Sink, AS),即大量注意力集中在少数特定但无信息量的标记上。AS使可解释性复杂化,显著影响训练和推理动态,并加剧幻觉等问题。近年来,大量研究致力于理解和利用AS。然而,缺乏系统整合AS相关研究并为未来进展提供指导的全面综述。为填补这一空白,我们提出了首个关于AS的综述,围绕定义当前研究格局的三个关键维度展开:基础利用、机制解释和策略缓解。我们的工作通过突出该领域的关键概念和主要趋势,引导研究人员了解AS相关研究的演变,做出了关键贡献。我们希望本综述能成为有价值的资源,使研究人员能够在当前Transformer范式下有效管理AS,同时为下一代Transformer的创新进展提供灵感。本文的论文列表可在https://github.com/ZunhaiSu/Awesome-Attention-Sink获取。

英文摘要

As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, significantly affecting the training and inference dynamics, and exacerbates issues such as hallucinations. In recent years, substantial research has been dedicated to understanding and harnessing AS. However, a comprehensive survey that systematically consolidates AS-related research and offers guidance for future advancements remains lacking. To address this gap, we present the first survey on AS, structured around three key dimensions that define the current research landscape: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Our work makes a pivotal contribution by highlighting the key concepts and main trends in the field, guiding researchers through the evolution of AS-related studies. We envision this survey as a valuable resource, empowering researchers to effectively manage AS within the current Transformer paradigm, while simultaneously inspiring innovative advancements for the next generation of Transformers. The paper list of this work is available at https://github.com/ZunhaiSu/Awesome-Attention-Sink.

2604.08168 2026-06-08 cs.RO cs.AI 版本更新

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

ViVa:用于机器人强化学习的视频生成价值模型

Jindi Lv, Hao Li, Jie Li, Fankun Kong, Yang Wang, Pengfei Yi, Yifei Nie, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang

发表机构 * GigaAI Sichuan University(四川大学) Tsinghua University(清华大学)

AI总结 提出ViVa,利用预训练视频生成器联合预测未来本体感受和标量价值,通过时空先验实现可靠价值估计,在三个任务中取得最优结果,与RECAP结合平均成功率达80%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过大规模预训练推进了机器人操作,但由于部分可观测性和延迟反馈,实际部署仍然具有挑战性。强化学习通过价值函数解决这一问题,该函数评估任务进展并指导策略改进。然而,基于视觉-语言模型(VLM)的现有价值模型难以捕捉时间动态和物理交互,削弱了长期任务中价值估计的可靠性。本文提出ViVa,一种视频生成价值模型,该模型重新利用预训练的视频生成器,联合预测未来本体感受和标量价值。通过将价值估计基于预期的具身动态,ViVa利用时空先验,将价值与超越静态快照的前瞻性内在耦合。ViVa在三个任务的基于度量的评估中取得了最先进的结果,产生可靠的价值信号,准确跟踪任务进展并检测执行错误。集成到RECAP中,它实现了80%的平均成功率,突显了视频生成模型在价值估计中的前景。

英文摘要

Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics and physical interactions, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator to jointly predict future proprioception and a scalar value. By grounding value estimation in anticipated embodiment dynamics, ViVa leverages spatiotemporal priors to intrinsically couple value with foresight beyond static snapshots. ViVa achieves state-of-the-art results in metric-based evaluation across three tasks, producing reliable value signals that accurately track task progress and detect execution errors. Integrated into RECAP, it achieves an average success rate of 80%, highlighting the promise of video-generative models for value estimation.

2604.07472 2026-06-08 cs.LG cs.NI 版本更新

Scalable Joint Resource Allocation for SLO-Constrained LLM Inference in Heterogeneous GPU Clouds

异构GPU云中SLO约束的LLM推理的可扩展联合资源分配

Jiaming Cheng, Duong Tung Nguyen

发表机构 * Ira A. Fulton Schools of Engineering, Arizona State University(亚利桑那州立大学工程学院)

AI总结 针对异构GPU云中LLM推理的SLO约束,提出可扩展框架,通过约束感知启发式算法(GH和AGH)实现联合资源分配,在秒级内生成可行解并接近最优,显著降低成本和SLO违规。

详情
AI中文摘要

在云环境中服务大型语言模型(LLM)推理需要在延迟、准确性、内存和预算约束下联合优化模型选择、GPU配置、并行配置和工作负载路由。虽然混合整数线性规划(MILP)可以建模此问题,但其计算成本限制了在需求变化下的频繁重新优化。现有的启发式方法通常分别优化各个组件,并且在强制执行系统范围约束时可能变得不可行。本文提出了一个用于SLO约束的LLM推理的可扩展框架。我们将问题表述为一个MILP,并采用两阶段延迟模型,该模型在张量并行和流水线并行下捕获预填充和自回归解码。为了高效求解,我们开发了两种约束感知启发式算法:贪婪启发式(GH)和自适应贪婪启发式(AGH)。AGH通过多起点构建、局部搜索和GPU整合扩展了GH。两种方法通过并行感知过滤、基于成本的排序和自适应并行缩放来保持可行性。基于Azure LLM推理轨迹的实验表明,GH在一秒内生成可行解,而AGH在三秒内实现接近最优的性能,并可扩展到精确求解器无法收敛的大规模实例。在高达1.5倍延迟和准确性膨胀的样本外压力下,AGH通过预留的余量优雅地退化,产生的成本和SLO违规远低于成本最小的MILP解决方案。在合成和真实Azure工作负载中,AGH以显著低于精确MILP解决方案的成本维持SLO合规性。这些结果表明,高质量分配在实现快速适应工作负载变化的同时,对需求变化提供了显著的鲁棒性。

英文摘要

Serving large language model (LLM) inference in cloud environments requires jointly optimizing model selection, GPU provisioning, parallelism configuration, and workload routing under latency, accuracy, memory, and budget constraints. While mixed-integer linear programming (MILP) can model this problem, its computational cost limits frequent re-optimization under demand variability. Existing heuristics often optimize individual components separately and may become infeasible when system-wide constraints are enforced. This paper presents a scalable framework for SLO-constrained LLM inference. We formulate the problem as an MILP with a two-phase delay model capturing both prefill and autoregressive decoding under tensor and pipeline parallelism. To solve it efficiently, we develop two constraint-aware heuristics: a Greedy Heuristic (GH) and an Adaptive Greedy Heuristic (AGH). AGH extends GH through multi-start construction, local search, and GPU consolidation. Both methods maintain feasibility through parallelism-aware filtering, cost-based ranking, and adaptive parallelism scaling. Experiments based on the Azure LLM Inference Trace show that GH generates feasible solutions within one second, while AGH achieves near-optimal performance within three seconds and scales to large instances where exact solvers fail to converge. Under out-of-sample stress with up to 1.5x delay and accuracy inflation, AGH degrades gracefully through provisioned headroom, yielding substantially lower cost and SLO violations than cost-minimal MILP solutions. Across synthetic and real Azure workloads, AGH maintains SLO compliance at significantly lower cost than exact MILP solutions. These results demonstrate that high-quality allocations provide substantial robustness to demand variability while enabling rapid adaptation to workload changes.

2604.06684 2026-06-08 cs.LG 版本更新

GraphWalker: Patient Analogy Meets Information Gain for Clinical Reasoning with Large Language Models

GraphWalker: 患者类比与信息增益结合用于大型语言模型的临床推理

Yue Fang, Weibin Liao, Yuxin Guo, Jiaran Gao, Hongxin Ding, Jinyang Zhang, Xinke Jiang, Zhibang Yang, Junfeng Zhao, Yasha Wang, Liantao Ma

发表机构 * School of Computer Science, Peking University, Beijing, China(北京大学计算机学院,北京,中国) National Engineering Research Center for Software Engineering, Peking University, Beijing, China(软件工程国家工程研究中心,北京大学,北京,中国)

AI总结 提出GraphWalker框架,通过联合数据驱动和模型驱动视角、发现患者队列以及采用懒惰贪心搜索,从电子健康记录中检索患者案例进行类比推理,无需参数更新即可提升临床推理性能。

详情
AI中文摘要

在电子健康记录(EHR)上进行临床推理是现代医疗中一项基本但具有挑战性的任务。虽然大型语言模型(LLM)通过上下文演示提供了一种有前景的范式,无需特定任务的参数更新,但现有的基于患者类比推理的方法在EHR设置中存在三个核心局限性:(1)视角局限性,数据驱动的相似性与LLM推理需求不一致,而模型驱动的信号受限于有限的临床能力;(2)队列意识,演示独立选择,未建模群体级结构;(3)信息聚合,忽略演示之间的冗余和交互效应。我们提出GraphWalker,一个无需训练的框架,让冻结的LLM通过检索到的患者案例进行类比推理。GraphWalker(i)联合利用数据驱动和模型驱动视角,(ii)发现患者队列以将检索基于群体级结构,(iii)采用带前沿扩展的懒惰贪心搜索来组合具有高边际信息增益的演示。在多个真实EHR基准上的大量实验表明,GraphWalker始终优于最先进的演示选择基线,并且在跨数据集分布偏移下保持更强的鲁棒性,无需特定任务的参数更新。GraphWalker进一步泛化到黑盒LLM,并自然地与智能体推理框架组合,使其成为基于LLM的临床工作流中可插拔的患者类比技能。我们的代码可在https://github.com/PuppyKnightUniversity/GraphWalker获取。

英文摘要

Clinical reasoning over electronic health records (EHRs) is a fundamental yet challenging task in modern healthcare. While large language models (LLMs) offer a promising paradigm via in-context demonstrations that requires no task-specific parameter updates, existing methods for reasoning by patient analogy in EHR settings suffer from three core limitations: (1) Perspective Limitation, where data-driven similarity misaligns with LLM reasoning needs while model-driven signals are constrained by limited clinical competence; (2) Cohort Awareness, as demonstrations are selected independently without modeling population-level structure; and (3) Information Aggregation, where redundancy and interaction effects among demonstrations are ignored. We propose GraphWalker, a training-free framework that lets frozen LLMs reason by analogy over retrieved patient cases. GraphWalker (i) jointly leverages data-driven and model-driven perspectives, (ii) discovers patient cohorts to ground retrieval in population-level structure, and (iii) employs a lazy greedy search with frontier expansion to compose demonstrations with high marginal information gain. Extensive experiments on multiple real-world EHR benchmarks show that GraphWalker consistently outperforms state-of-the-art demonstration selection baselines, and remains substantially more robust under cross-dataset distribution shift, without task-specific parameter updates. GraphWalker further generalizes to black-box LLMs and composes naturally with agentic reasoning frameworks, positioning it as a pluggable patient-analogy skill in LLM-based clinical workflows. Our code is available at https://github.com/PuppyKnightUniversity/GraphWalker.

2510.24561 2026-06-08 cs.LG cs.AI 版本更新

LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

LoRA-DA:基于渐近分析的低秩自适应数据感知初始化

Qingyue Zhang, Chang Chu, Tianren Peng, Qi Li, Xiangyang Luo, Zhihao Jiang, Shao-Lun Huang

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 提出LoRA-DA,通过渐近分析优化低秩自适应初始化,结合Fisher梯度与Fisher信息最小化参数偏差,提升微调准确率与收敛稳定性。

详情
Comments
Published at ICML 2026
AI中文摘要

LoRA已成为广泛采用的PEFT方法,其初始化方法日益受到关注。然而,现有方法存在显著局限性:许多方法未纳入目标域数据,而基于梯度的方法仅通过依赖一步梯度分解在浅层利用数据。本文建立了数据感知LoRA初始化的理论框架。从最小化微调模型与目标模型之间参数偏差的期望出发,我们推导出一个包含两项的优化问题:偏差项,与微调模型和目标模型之间的参数距离相关,并使用Fisher梯度公式近似以保持各向异性;方差项,通过Fisher信息考虑采样随机性引入的不确定性。求解该问题得到LoRA的最优初始化策略,基于此我们开发了高效算法LoRA-DA。跨多个基准的实验结果表明,LoRA-DA在最终准确率上持续优于现有初始化方法。附加研究显示其收敛更快、更稳定,跨秩鲁棒性强,且初始化开销小。源代码见https://github.com/zqy0126/LoRA-DA。

英文摘要

LoRA has become a widely adopted method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition. In this paper, we establish a theoretical framework for data-aware LoRA initialization. Starting from minimizing the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher-gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. Solving this problem yields an optimal initialization strategy for LoRA, based on which we develop an efficient algorithm, LoRA-DA. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code is available at https://github.com/zqy0126/LoRA-DA.

2604.03779 2026-06-08 cs.LG cs.AI 版本更新

CountsDiff: A Diffusion Model on the Natural Numbers for Generation and Imputation of Count-Based Data

CountsDiff: 一种用于计数数据生成和插补的自然数扩散模型

Renzo G. Soatto, Anders Hoel, Greycen Ren, Shorna Alam, Stephen Bates, Nikolaos P. Daskalakis, Caroline Uhler, Maria Skoularidou

发表机构 * Princeton University(普林斯顿大学) Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CountsDiff扩散框架,通过生存概率调度和显式损失加权简化Blackout扩散,引入连续时间训练、无分类器引导和逆动态,在自然图像和单细胞RNA-seq插补任务中匹配或超越现有方法。

详情
Comments
39 Pages, 11 figures. To appear in the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

扩散模型在连续和基于token的领域中的生成任务表现出色,但其在离散序数数据上的应用仍不成熟。我们提出CountsDiff,一个旨在对自然数上的分布进行建模的扩散框架。CountsDiff扩展了Blackout扩散框架,通过直接参数化(基于生存概率调度和显式损失加权)简化其公式。这通过引入与现有扩散建模框架中直接类似的设计参数,带来了灵活性。除了这种重新参数化,CountsDiff还引入了现代扩散模型中的特性,这些特性此前在基于计数的领域中缺失,包括连续时间训练、无分类器引导以及允许非单调逆轨迹的搅动/重掩码逆动态。我们提出了CountsDiff的初始实例化,并在自然图像数据集(CIFAR-10、CelebA)上进行了验证,探索了所引入的设计参数在一个复杂、经过充分研究且可解释的数据领域中的影响。然后,我们强调生物计数分析作为一个自然用例,在胎儿和心脏细胞图谱中评估了CountsDiff在单细胞RNA-seq插补上的表现。值得注意的是,我们发现即使这种简单的实例化也能匹配或超越最先进的离散生成模型和领先的scRNA-seq插补方法的性能,同时通过未来工作中优化的设计选择,仍有很大的提升空间。

英文摘要

Diffusion models have excelled at generative tasks for both continuous and token-based domains, but their application to discrete ordinal data remains underdeveloped. We present CountsDiff, a diffusion framework designed to model distributions on the natural numbers. CountsDiff extends the Blackout diffusion framework by simplifying its formulation through a direct parameterization in terms of a survival probability schedule and an explicit loss weighting. This introduces flexibility through design parameters with direct analogues in existing diffusion modeling frameworks. Beyond this reparameterization, CountsDiff introduces features from modern diffusion models, previously absent in counts-based domains, including continuous-time training, classifier-free guidance, and churn/remasking reverse dynamics that allow non-monotone reverse trajectories. We propose an initial instantiation of CountsDiff and validate it on natural image datasets (CIFAR-10, CelebA), exploring the effects of the introduced design parameters in a complex, well-studied, and interpretable data domain. We then highlight biological count assays as a natural use case, evaluating CountsDiff on single-cell RNA-seq imputation in fetal and heart cell atlases. Remarkably, we find that even this simple instantiation matches or surpasses the performance of a state-of-the-art discrete generative model and leading scRNA-seq imputation methods, while leaving substantial headroom for further gains through optimized design choices in future work.

2604.02029 2026-06-08 cs.AI 版本更新

The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook

潜空间:基础、演化、机制、能力与展望

Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Guanting Dong, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Ronghao Chen, Huacan Wang, Chenglin Wu, Zikun Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao, Yue Liao, Ruqi Huang, Tao Jin, Zhucun Xue, Cheng Tan, Jiangning Zhang, Wenqi Ren, Yanwei Fu, Yong Liu, Yu Wang, Xiangyu Yue, Yu-Gang Jiang, Shuicheng Yan

发表机构 * National University of Singapore(国立新加坡大学) Fudan University(复旦大学) Tsinghua University(清华大学) Zhejiang University(浙江大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Renmin University of China(中国人民大学) The Chinese University of Hong Kong(香港中文大学) The Hong Kong University of Science and Technology(香港科技大学) DeepWisdom(深智科技) Nanjing University(南京大学) Shanghai Jiatong University(上海交通大学) Nanyang Technological University(南洋理工大学) Tencent Hunyuan(腾讯文深) QuantaAlpha(量子阿尔法) Beijing University of Posts and Telecommunications(北京邮电大学) Zhejiang Lab(浙江实验室) University of Chinese Academy of Sciences(中国科学院大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Sun Yat-sen University(中山大学)

AI总结 本文综述潜空间在语言模型中的基础、演化、机制与能力,指出其克服显式空间计算的结构性限制,并展望未来研究方向。

详情
AI中文摘要

潜空间正迅速成为基于语言的模型的原生子空间。虽然现代系统仍常通过显式词元级生成来理解,但越来越多的工作表明,许多关键内部过程在连续潜空间中比在人类可读的言语痕迹中更自然地执行。这一转变由显式空间计算的结构性限制驱动,包括语言冗余、离散化瓶颈、顺序低效和语义损失。本综述旨在提供语言模型中潜空间的统一且最新的全景。我们按五个连续视角组织综述:基础、演化、机制、能力和展望。首先,我们界定潜空间的范围,将其与显式或言语空间以及生成式视觉模型中常见的潜空间区分开来。然后,我们追溯该领域从早期探索性工作到当前大规模扩展的演化过程。为组织技术全景,我们通过机制和能力两个互补视角审视现有工作。从机制视角,我们识别出四个主要发展路线:架构、表示、计算和优化。从能力视角,我们展示潜空间如何支持广泛的能力谱系,涵盖推理、规划、建模、感知、记忆、协作和具身。除整合外,我们讨论关键开放挑战,并概述未来研究的有前景方向。我们希望本综述不仅作为现有工作的参考,而且作为理解潜空间作为下一代智能的通用计算和系统范式的基础。

英文摘要

Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential inefficiency, and semantic loss. This survey aims to provide a unified and up-to-date landscape of latent space in language-based models. We organize the survey into five sequential perspectives: Foundation, Evolution, Mechanism, Ability, and Outlook. We begin by delineating the scope of latent space, distinguishing it from explicit or verbal space and from the latent spaces commonly studied in generative visual models. We then trace the field's evolution from early exploratory efforts to the current large-scale expansion. To organize the technical landscape, we examine existing work through the complementary lenses of mechanism and ability. From the perspective of Mechanism, we identify four major lines of development: Architecture, Representation, Computation, and Optimization. From the perspective of Ability, we show how latent space supports a broad capability spectrum spanning Reasoning, Planning, Modeling, Perception, Memory, Collaboration, and Embodiment. Beyond consolidation, we discuss the key open challenges, and outline promising directions for future research. We hope this survey serves not only as a reference for existing work, but also as a foundation for understanding latent space as a general computational and systems paradigm for next-generation intelligence.

2604.01313 2026-06-08 cs.LG nucl-ex physics.data-an physics.ins-det 版本更新

ScatterPrism: convergence for generative simulation and inverse problems in particle and nuclear physics

ScatterPrism:粒子与核物理中生成模拟与逆问题的收敛性

Zeyu Xia, Tyler Kim, Trevor Reed, Judy Fox, Geoffrey Fox, Adam Szczepaniak

发表机构 * University of Maryland(马里兰大学)

AI总结 针对条件流匹配(CFM)在粒子物理模拟中损失函数过早收敛的问题,提出ScatterPrism生成代理模型,结合物理信息指标确保真实运动学保真度,并推广至高能物理等领域。

详情
Comments
21 pages, 16 figures. Accepted for publication in JINST (AI4EIC 2025 proceedings)
AI中文摘要

高保真模拟和复杂逆问题(如探测器建模和解折叠)是亚原子物理中计算密集的瓶颈,但对于准确的物理解释至关重要。虽然条件流匹配(CFM)提供了一种稳健的加速方法,但我们证明其标准训练损失从根本上具有误导性。具体而言,利用杰斐逊实验室核物理(NP)运动学数据集($γp \ o ρ^0 p \ o π^+π^- p$),我们发现CFM损失过早进入平台期,掩盖了持续的物理改进。为了验证这种脱节是与数据集无关的病理现象,我们引入了ScatterPrism,这是一种高效的生成代理模型,在NP数据和模拟挑战性一维分布拓扑的合成压力测试上进行了评估。结合这些基准测试,我们确定物理信息指标在标准损失收敛后仍持续改进。因此,我们提出了一种多指标诊断协议,以确保在没有数据记忆的情况下实现真正的运动学保真度。受即将到来的电子-离子对撞机(EIC)相关NP挑战的驱动,这一统一机制有潜力扩展到高能物理(HEP)应用,如喷注建模。此外,该框架有望应用于需要严格生成可靠性的更广泛领域,包括医学成像、天体物理学和定量金融。

英文摘要

High-fidelity simulations and complex inverse problems, such as detector modeling and unfolding, are computationally intensive bottlenecks across subatomic physics, yet essential for accurate physical interpretation. While Conditional Flow Matching (CFM) offers a robust acceleration approach, we demonstrate its standard training loss is fundamentally misleading. Specifically, utilizing a Jefferson Lab Nuclear Physics (NP) kinematic dataset ($γp \to ρ^0 p \to π^+π^- p$), we expose that CFM loss plateaus prematurely, obscuring ongoing physical refinement. To verify this disconnect is a dataset-agnostic pathology, we introduce ScatterPrism, an efficient generative surrogate evaluated against both the NP data and synthetic stress tests modeling challenging 1D distribution topologies. Coupling these benchmarks, we establish that physics-informed metrics continue improving long after standard loss converges. Consequently, we propose a multi-metric diagnostic protocol to ensure true kinematic fidelity without data memorization. Driven by NP challenges relevant to the forthcoming Electron-Ion Collider (EIC), this unified machinery has strong potential to extend to High-Energy Physics (HEP) applications, such as jet modeling. Furthermore, the framework holds promise for broader domains requiring rigorous generative reliability, including medical imaging, astrophysics, and quantitative finance.

2603.28304 2026-06-08 cs.CL 版本更新

The Necessity of Setting Temperature in LLM-as-a-Judge

LLM作为评判者中设置温度的必要性

Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia, Jerome Francois, Radu State

发表机构 * University of Luxembourg(卢森堡大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 系统研究温度对LLM评判行为的影响,发现高温降低一致性但暴露不确定性,低温适合稳定任务,高温适合复杂场景,建议温度作为任务相关的设计选择。

详情
Comments
17 pages
AI中文摘要

使用大型语言模型(LLM)作为评判者来评估模型输出已成为自动化评估的重要范式。然而,在LLM作为评判者的设置中,解码温度的选择在很大程度上仍然是经验性的,缺乏关于其影响的系统证据。为了解决这一差距,我们系统研究了温度如何影响不同LLM评判模型、提示策略和评估范式下的评判行为。我们的结果表明,较高的温度通常会降低评判一致性并增加格式错误,同时也会暴露潜在的不确定性,这种不确定性在低温解码下往往被抑制,尤其是在模糊案例中。进一步的分析表明,较高的温度可以作为探索机制,并可能提高复杂或不确定评估场景中的评判性能。总体而言,低温设置更适合优先考虑稳定性和可重复性的任务,而高温设置更适合涉及大量模糊性或复杂性的场景,在这些场景中,探索评判者的决策空间是有益的。这些发现表明,在LLM作为评判者的系统中,温度不应被视为固定的超参数,而应被视为可控的、任务相关的设计选择,它调节了可靠性与探索之间的权衡。

英文摘要

Using large language models (LLMs) as judges for evaluating model outputs has emerged as an important paradigm for automated evaluation. However, the choice of decoding temperature in LLM-as-a-judge settings is still largely chosen empirically, with limited systematic evidence on its impact. To address this gap, we conduct a systematic study of how temperature affects judgment behavior across different LLM judge models, prompting strategies, and evaluation paradigms. Our results show that higher temperatures generally decrease judgment consistency and increase formatting errors, while also exposing latent uncertainty that tends to remain suppressed under low-temperature decoding, particularly in ambiguous cases. Further analysis suggests that higher temperatures can serve as an exploratory mechanism and may improve judging performance in complex or uncertain evaluation scenarios. Overall, low-temperature settings are better suited to tasks that prioritize stability and reproducibility, whereas higher-temperature settings are more appropriate for scenarios involving substantial ambiguity or complexity, where exploration of the judge's decision space is beneficial. These findings suggest that, in LLM-as-a-judge systems, temperature should be treated not as a fixed hyperparameter, but as a controllable, task-dependent design choice that mediates the trade-off between reliability and exploration.

2511.14019 2026-06-08 cs.CV 版本更新

RISE: Single Static Radar-based Indoor Scene Understanding

RISE:基于单静态雷达的室内场景理解

Kaichen Zhou, Laura Dodds, Sayed Saad Afzal, Fadel Adib

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Cartesian Systems

AI总结 提出RISE系统,利用毫米波雷达的多径反射(传统视为噪声)编码几何线索,通过双角度多径增强和模拟到现实的分层扩散框架,实现布局重建和物体检测,在50,000帧数据集上布局重建倒角距离降低60%,首次实现基于毫米波雷达的物体检测。

详情
AI中文摘要

鲁棒且保护隐私的室内场景理解仍然是一个基本开放问题。虽然光学传感器(如RGB和LiDAR)提供高空间保真度,但它们在室内环境中遭受严重遮挡并引入隐私风险。相比之下,毫米波雷达保护隐私并穿透障碍物,但其固有的低空间分辨率使得可靠的几何推理变得困难。我们介绍了RISE,这是首个用于单静态雷达室内场景理解的基准和系统,同时针对布局重建和物体检测。RISE基于一个关键洞察:多径反射——传统上被视为噪声——编码了丰富的几何线索。为了利用这一点,我们提出了一种双角度多径增强方法,显式建模到达角和离开角,以恢复二次(鬼影)反射并揭示不可见结构。在这些增强观测的基础上,一个模拟到现实的分层扩散框架将碎片化的雷达响应转化为完整的布局重建和物体检测。我们的基准包含100条真实室内轨迹中收集的50,000帧数据,形成了首个专门用于单静态雷达室内场景理解的大规模数据集。大量实验表明,与最先进的毫米波布局重建方法相比,RISE将倒角距离降低了60%(降至16厘米),并实现了首个基于毫米波雷达的物体检测,IoU达到58%。这些结果确立了RISE作为使用单静态雷达进行几何感知和隐私保护室内场景理解的新基础。我们的网站和代码可在https://rise-cvpr.github.io获取。

英文摘要

Robust and privacy-preserving indoor scene understanding remains a fundamental open problem. While optical sensors such as RGB and LiDAR offer high spatial fidelity, they suffer from severe occlusions and introduce privacy risks in indoor environments. In contrast, millimeter-wave (mmWave) radar preserves privacy and penetrates obstacles, but its inherently low spatial resolution makes reliable geometric reasoning difficult. We introduce RISE, the first benchmark and system for single-static-radar indoor scene understanding, jointly targeting layout reconstruction and object detection. RISE is built upon the key insight that multipath reflections-traditionally treated as noise-encode rich geometric cues. To exploit this, we propose a Bi-Angular Multipath Enhancement that explicitly models Angle-of-Arrival and Angle-of-Departure to recover secondary (ghost) reflections and reveal invisible structures. On top of these enhanced observations, a simulation-to-reality Hierarchical Diffusion framework transforms fragmented radar responses into complete layout reconstruction and object detection. Our benchmark contains 50,000 frames collected across 100 real indoor trajectories, forming the first large-scale dataset dedicated to single, static, radar-based indoor scene understanding. Extensive experiments show that RISE reduces the Chamfer Distance by 60% (down to 16 cm) compared to the state of the art in mmWave layout reconstruction, and delivers the first mmWave-based object detection, achieving 58% IoU. These results establish RISE as a new foundation for geometry-aware and privacy-preserving indoor scene understanding using a single static radar. Our website and code are available at https://rise-cvpr.github.io.

2603.26846 2026-06-08 cs.LG cs.AI 版本更新

Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry

稳定推理,不稳定响应:通过稳定性不对称缓解大语言模型欺骗

Guoxi Zhang, Jiawei Chen, Tianzhuo Yang, Lang Qin, Juntao Dai, Yaodong Yang, Jingwei Yi

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Beijing Academy of Artificial Intelligence(北京人工智能研究院) School of Chinese as a Second Language, Peking University(北京大学第二语言学院)

AI总结 针对大语言模型内在欺骗问题,提出稳定性不对称正则化(SAR),通过惩罚内部思维链稳定性与外部响应稳定性之间的不对称性来抑制欺骗,实验证明其有效性且不损害模型能力。

详情
AI中文摘要

随着大语言模型(LLMs)在能力和应用范围上的扩展,其可信度变得至关重要。一个关键风险是内在欺骗,即模型策略性地误导用户以实现自身目标。现有的基于思维链(CoT)监控的对齐方法监督显式的推理轨迹。然而,在优化压力下,模型被激励隐藏欺骗性推理,使得语义监督从根本上不可靠。基于认知心理学,我们假设一个欺骗性LLM在其CoT中保持稳定的内部信念,而其外部响应在扰动下仍然脆弱。我们将此现象称为稳定性不对称,并通过测量扰动下内部CoT稳定性与外部响应稳定性之间的对比来量化它。基于这一结构特征,我们提出了稳定性不对称正则化(SAR),一种在强化学习期间惩罚这种分布不对称性的新型对齐目标。与CoT监控不同,SAR针对模型输出的统计结构,使其对语义隐藏具有鲁棒性。大量实验证实,稳定性不对称可靠地识别欺骗行为,并且SAR有效抑制内在欺骗而不降低一般模型能力。

英文摘要

As Large Language Models (LLMs) expand in capability and application scope, their trustworthiness becomes critical. A vital risk is intrinsic deception, wherein models strategically mislead users to achieve their own objectives. Existing alignment approaches based on chain-of-thought (CoT) monitoring supervise explicit reasoning traces. However, under optimization pressure, models are incentivized to conceal deceptive reasoning, rendering semantic supervision fundamentally unreliable. Grounded in cognitive psychology, we hypothesize that a deceptive LLM maintains a stable internal belief in its CoT while its external response remains fragile under perturbation. We term this phenomenon stability asymmetry and quantify it by measuring the contrast between internal CoT stability and external response stability under perturbation. Building on this structural signature, we propose the Stability Asymmetry Regularization (SAR), a novel alignment objective that penalizes this distributional asymmetry during reinforcement learning. Unlike CoT monitoring, SAR targets the statistical structure of model outputs, rendering it robust to semantic concealment. Extensive experiments confirm that stability asymmetry reliably identifies deceptive behavior, and that SAR effectively suppresses intrinsic deception without degrading general model capability.

2603.24963 2026-06-08 cs.AI cs.LG 版本更新

Design Once, Deploy at Scale: Template-Driven ML Development for Large Model Ecosystems

一次设计,大规模部署:面向大型模型生态的模板驱动ML开发

Jiang Liu, John Martabano Landy, Yao Xuan, Swamy Muddu, Nhat Le, Munaf Sahaf, Luc Kien Hang, Rupinder Khandpour, Kevin De Angeli, Chang Yang, Shouyuan Chen, Shiblee Sadik, Anirudh Agrawal, Djordje Gligorijevic, Jingzheng Qin, Peggy Yao, Alireza Vahdatpour

发表机构 * Meta AI

AI总结 针对大型模型生态中ML开发效率低的问题,提出标准化模型模板(SMT)框架,将技术传播复杂度从O(n·2^k)降至O(n+k),在Meta广告排名系统中实现交叉熵提升0.63%、迭代时间减少92%、技术-模型对采用吞吐量提升6.3倍。

详情
AI中文摘要

现代计算广告平台通常依赖推荐系统来预测用户响应,如点击率、转化率和其他优化事件。为了支持多样化的产品表面和广告主目标,这些平台经常维护一个广泛的机器学习(ML)模型生态系统。然而,在这种规模下运营带来了显著的发展和效率挑战。需要大量的工程努力来定期刷新ML模型并传播新技术,这导致在生态系统中部署ML创新时出现长延迟。我们提出了一项大规模实证研究,比较了标准化模型构建方法与推荐系统中独立每模型优化之间的模型性能、效率和ML技术传播。为了促进这种标准化,我们提出了标准模型模板(SMT)——一个生成适应不同数据分布和优化事件的高性能模型的框架。通过利用标准化、可组合的ML模型组件,SMT将技术传播复杂度从O(n·2^k)降低到O(n+k),其中n是模型数量,k是技术数量。在Meta的生产广告排名生态系统中,对四个全球开发周期内的广泛模型套件进行评估,我们的结果表明:(1)在中等服务容量下,交叉熵平均提高0.63%;(2)每模型迭代工程时间减少92%;(3)技术-模型对采用吞吐量增加6.3倍。这些发现挑战了多样化优化目标本质上需要多样化ML模型设计的传统观点。

英文摘要

Modern computational advertising platforms typically rely on recommendation systems to predict user responses, such as click-through rates, conversion rates, and other optimization events. To support a wide variety of product surfaces and advertiser goals, these platforms frequently maintain an extensive ecosystem of machine learning (ML) models. However, operating at this scale creates significant development and efficiency challenges. Substantial engineering effort is required to regularly refresh ML models and propagate new techniques, which results in long latencies when deploying ML innovations across the ecosystem. We present a large-scale empirical study comparing model performance, efficiency, and ML technique propagation between a standardized model-building approach and independent per-model optimization in recommendation systems. To facilitate this standardization, we propose the Standard Model Template (SMT) -- a framework that generates high-performance models adaptable to diverse data distributions and optimization events. By utilizing standardized, composable ML model components, SMT reduces technique propagation complexity from $O(n \cdot 2^k)$ to $O(n + k)$ where $n$ is the number of models and $k$ the number of techniques. Evaluating an extensive suite of models over four global development cycles within Meta's production ads ranking ecosystem, our results demonstrate: (1) a 0.63% average improvement in cross-entropy at neutral serving capacity, (2) a 92% reduction in per-model iteration engineering time, and (3) a $6.3\times$ increase in technique-model pair adoption throughput. These findings challenge the conventional wisdom that diverse optimization goals inherently require diversified ML model design.

2603.26394 2026-06-08 cs.SD 版本更新

CA-TCN: A Causal-Anticausal Temporal Convolutional Network for Direct Auditory Attention Decoding

CA-TCN: 一种用于直接听觉注意解码的因果-反因果时序卷积网络

Iñigo García-Ugarte, Rubén Eguinoa, Ricardo San Martín, Daniel Paternain, Carmen Vidaurre

发表机构 * Department of Science, Universidad Pública de Navarra (UPNA)(纳瓦拉公共大学科学系) BCBL, Basque Center on Cognition Brain and Language(巴斯克认知脑语言中心) Department of Statistics, Computer Sciences and Mathematics, Universidad Pública de Navarra (UPNA)(纳瓦拉公共大学统计、计算机科学与数学系) Ikerbasque, Basque Foundation for Science(巴斯克科学基金会)

AI总结 提出CA-TCN,一种因果-反因果时序卷积网络,直接对注意说话者进行分类,通过分别采用因果和反因果卷积对齐听觉刺激与神经响应,在多个数据集上比AADNet提升0.5%-3.2%的解码准确率。

详情
Comments
10+2(refs) pages, 5 figures, 4 Tables, IEEE transactions preprint
AI中文摘要

在复杂听觉环境中引导听觉注意的一种有前景的方法依赖于听觉注意解码(AAD),其旨在从神经记录中识别多说话者场景中注意的语音流。基于夹带的AAD方法通常假设可以访问干净的语音源和脑电图(EEG)信号,以利用神经响应与注意刺激之间的低频相关性。在本研究中,我们提出了CA-TCN,一种因果-反因果时序卷积网络,直接对注意说话者进行分类。所提出的架构整合了卷积神经网络在序列处理任务中的若干最佳实践。重要的是,它通过分别采用具有不同感受野、在相反时间方向上操作的因果和反因果卷积,显式地对齐听觉刺激和神经响应。通过与三个基线AAD模型比较获得的实验结果表明,CA-TCN在数据集和决策窗口上一致地提高了解码精度,与次优模型AADNet相比,在主题无关模型中增益范围为0.5%至3.2%,在主题特定模型中增益范围为0.8%至2.9%。此外,在比较最小期望切换持续时间分布时,这些改进在六个评估设置中的四个中具有统计显著性。除了准确性,该模型在不同条件下表现出空间鲁棒性,因为EEG空间滤波器在数据集上表现出稳定的模式。总体而言,这项工作引入了一个准确且统一的AAD模型,其性能优于现有方法,同时考虑了在线处理场景的实际优势。这些发现有助于推进AAD的发展及其在现实世界系统中的适用性。

英文摘要

A promising approach for steering auditory attention in complex listening environments relies on Auditory Attention Decoding (AAD), which aim to identify the attended speech stream in a multiple speaker scenario from neural recordings. Entrainment-based AAD approaches, typically assume access to clean speech sources and electroencephalography (EEG) signals to exploit low-frequency correlations between the neural response and the attended stimulus. In this study, we propose CA-TCN, a Causal-Anticausal Temporal Convolutional Network that directly classifies the attended speaker. The proposed architecture integrates several best practices from convolutional neural networks in sequence processing tasks. Importantly, it explicitly aligns auditory stimuli and neural responses by employing separate causal and anticausal convolutions respectively, with distinct receptive fields operating in opposite temporal directions. Experimental results, obtained through comparisons with three baseline AAD models, demonstrated that CA-TCN consistently improved decoding accuracy across datasets and decision windows, with gains ranging from 0.5% to 3.2% for subject-independent models and from 0.8% to 2.9% for subject-specific models compared with the next best-performing model, AADNet. Moreover, these improvements were statistically significant in four of the six evaluated settings when comparing Minimum Expected Switch Duration distributions. Beyond accuracy, the model demonstrated spatial robustness across different conditions, as the EEG spatial filters exhibited stable patterns across datasets. Overall, this work introduces an accurate and unified AAD model that outperforms existing methods while considering practical benefits for online processing scenarios. These findings contribute to advancing the state of AAD and its applicability in real-world systems.

2603.24576 2026-06-08 cs.RO cs.AI cs.CV 版本更新

Chameleon: Control-Indexed Prospective Memory for Visuomotor Manipulation

Chameleon: 用于视觉运动操控的索引控制前瞻记忆

Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Yuhang Han, Ying Sun, Yang Xiao, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室) Institute for Infocomm Research, A*STAR, Singapore(新加坡*STAR信息与通信研究所) National University of Singapore(新加坡国立大学)

AI总结 提出Chameleon策略,通过索引控制前瞻记忆解决观察-动作延迟问题,在Camo-Dataset上决策成功率从22.5%提升至80.8%,并在多个基准上达到最优。

详情
Comments
Code is available at https://github.com/gxyes/MARS_Chameleon
AI中文摘要

机器人常常在观察到某个信息后很久才执行相应的动作。例如,在藏球游戏中,机器人首先看到哪个杯子藏有球,观察杯子移动,然后才需要选择正确的杯子。仅凭最后的观察不足以做出决策:正确的动作依赖于更早的事件。我们将这种时间间隔称为观察-动作延迟。它使得记忆成为一个策略面对的问题:策略必须保持相似历史记录的可区分性,检索与当前决策相关的过去事件,并将该回忆转换为动作就绪状态。我们将这些需求称为可分离性、可寻址性和前瞻性。我们引入了Chameleon,一个约60M参数的视觉运动策略,用于索引控制的前瞻记忆。Chameleon写入具身事件记忆,保留可分离的历史记录,检索控制相关的痕迹,并训练生成的工作状态具有前瞻性。我们还引入了Camo-Dataset,这是一个真实机器人基准,通过使决策场景视觉模糊来隔离观察-动作延迟,从而必须从早期观察中推断正确动作。Chameleon在Camo-Dataset上将决策/端到端成功率从22.5%/21.3%提高到80.8%/71.3%。在公开的长时记忆基准上,它在LIBERO-10上达到87.1% ± 0.8%,在MemoryBench上达到97.3% ± 4.5%,在MIKASA-Robo上达到75.1% ± 1.4%,在相同规模模型中达到最先进水平,并在报告协议下超过多个更大的VLA基线。探针和消融实验表明,Chameleon学习了可分离、可寻址和前瞻的记忆,并且这些特性驱动了其性能提升。

英文摘要

Robots often observe information that determines a future action long before that action is executed. In a shell game, for example, a robot first sees which cup hides the ball, watches the cups move, and only later needs to choose the correct cup. The final observation alone is not enough for a decision: the correct action depends on an earlier event. We refer to this temporal gap as observation-action delay. It makes memory a policy-facing problem: a policy must keep similar histories distinct, retrieve the past event relevant to the current decision, and convert that recall into an action-ready state. We call these requirements separability, addressability, and prospectiveness. We introduce Chameleon, a ~60M visuomotor policy for control-indexed prospective memory. Chameleon writes embodied event memory, preserves separable histories, retrieves control-relevant traces, and trains the resulting working state to be prospective. We also introduce Camo-Dataset, a real-robot benchmark that isolates observation-action delay by making the decision scene visually ambiguous, so the correct action must be inferred from earlier observations. Chameleon improves decision/end-to-end success on Camo-Dataset from 22.5%/21.3% to 80.8%/71.3%. On public long-horizon memory benchmarks, it achieves 87.1% +/- 0.8% on LIBERO-10, 97.3% +/- 4.5% on MemoryBench, and 75.1% +/- 1.4% on MIKASA-Robo, setting the state of the art for same-size models and exceeding multiple larger VLA baselines under the reported protocols. Probes and ablations show that Chameleon learns separable, addressable, and prospective memory, and that these properties drive its performance gains.

2603.24481 2026-06-08 cs.AI cs.CL cs.LG 版本更新

Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

基于一致性验证的多智能体推理改进医学多项选择题问答中的不确定性校准

John Ray B. Martinez

发表机构 * Department of Data Science and Analytics(数据科学与分析系)

AI总结 提出多智能体框架,结合领域专家智能体与两阶段验证及S分数加权融合,在医学MCQA中显著降低校准误差并提升判别能力。

详情
Comments
20 pages, 6 figures. Preprint under review
AI中文摘要

校准不良的置信度分数是AI在临床环境中部署的实际障碍。总是过度自信的模型无法为延迟决策提供有用信号。我们提出了一个多智能体框架,结合领域特定专家智能体与两阶段验证(Wu等人,2024)和S分数加权融合,以改进医学多项选择题问答中的校准和判别能力。四个专家智能体(呼吸科、心脏病科、神经科、胃肠科)使用Qwen2.5-7B-Instruct生成独立诊断。每个诊断经历两阶段自我验证过程,测量内部一致性并产生专家置信度分数(S分数)。S分数驱动加权融合策略,选择最终答案并校准报告的置信度。我们在MedQA-USMLE和MedMCQA的高分歧子集(100和250个问题)上进行评估。所有结果均针对此过滤后的设置。在MedQA-250上,完整系统实现了ECE=0.091(比单专家基线降低74.4%)和AUROC=0.630(+0.056),准确率为59.2%。在所有四种设置中,校准增益保持在49-74%。消融分析表明,两阶段验证驱动ECE降低,而多智能体推理驱动AUROC提升,表明一致性检查和集成聚合解决了LLM不确定性的不同失败模式。由此产生的置信度信号是否足以在实践中支持临床延迟决策,仍是未来研究的方向。

英文摘要

Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification (Wu et al., 2024) and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis undergoes a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate on high-disagreement subsets of MedQA-USMLE and MedMCQA (100 and 250 questions). All results are specific to this filtered regime. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Calibration gains of 49-74% hold across all four settings. Ablation analysis reveals that Two-Phase Verification drives ECE reduction while multi-agent reasoning drives AUROC improvement, suggesting that consistency checking and ensemble aggregation address different failure modes of LLM uncertainty. Whether the resulting confidence signal is sufficient to support clinical deferral decisions in practice remains a direction for future investigation.

2603.22278 2026-06-08 cs.CV cs.LG 版本更新

The Dual Mechanisms of Spatial Variable Binding in Vision-Language Models

视觉-语言模型中空间变量绑定的双重机制

Kelly Cui, Nikhil Prakash, Shoval Messica, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) Northeastern University(东北大学) Sony Playstation(索尼PlayStation)

AI总结 本文揭示视觉-语言模型通过语言骨干中的内容无关空间关系编码和视觉编码器中的全局布局表示两种机制实现空间变量绑定,其中视觉编码器起主导作用。

详情
Comments
37 pages, 53 figures
AI中文摘要

许多多模态任务,如图像描述和视觉问答,要求视觉-语言模型(VLM)将对象与其属性和空间关系绑定。然而,这种关联在VLM中如何以及在哪里计算仍不清楚。在这项工作中,我们展示了VLM依赖两种并发机制来表示空间变量绑定。在语言模型骨干中,中间层在对应对象的视觉标记之上表示内容无关的空间关系。然而,这种机制在塑造模型预测中仅起次要作用。相反,空间信息的主要来源是视觉编码器,其表示编码了对象的布局,并被语言模型骨干直接利用。值得注意的是,这种空间信号全局分布在视觉标记中,从对象区域扩展到周围的背景区域。我们表明,增强这些源自视觉的空间表示(跨所有图像标记)可以改善不同规模模型在COCO数据集复杂自然图像上的空间变量绑定性能。总之,我们的结果阐明了VLM中空间变量绑定的计算方式,并强调了视觉编码器在实现这一功能中的核心作用。

英文摘要

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to bind objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent spatial variable binding. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial variable binding performance across models of various sizes on complex natural images from the COCO datasets. Together, our results clarify how spatial variable binding is computed within VLMs and highlight the central role of vision encoders in enabling it.

2603.19100 2026-06-08 cs.AI 版本更新

LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling

LuMamba: 用于电极拓扑不变且高效的EEG建模的潜在统一Mamba

Danaé Broustail, Anna Tegon, Thorir Mar Ingolfsson, Yawei Li, Luca Benini

发表机构 * ETH Zürich, Institute of Information Systems(苏黎世联邦理工学院信息系统研究所)

AI总结 提出LuMamba框架,结合拓扑不变编码和线性复杂度状态空间模型,通过LUNA的查询交叉注意力机制统一通道,FEMBA的双向Mamba块高效时序建模,在5个下游任务上以4.6M参数达到SOTA性能。

详情
Comments
EUSIPCO 2026, 5 pages, 2 figures, 4 tables
AI中文摘要

脑电图(EEG)能够在临床和神经技术应用中无创监测大脑活动,但由于电极拓扑结构不同以及Transformer架构的二次序列复杂度,构建EEG基础模型仍然具有挑战性。作为联合解决方案,我们提出了LuMamba(潜在统一Mamba),一个自监督框架,结合了拓扑不变编码和线性复杂度状态空间建模,使用LUNA的学习查询交叉注意力机制进行通道统一,以及FEMBA的双向Mamba块进行高效时序建模。在此架构内,我们首次系统研究了用于生物信号学习的潜在-欧几里得联合嵌入预测架构(LeJEPA)。在来自TUEG语料库的超过21,000小时未标记EEG上预训练后,LuMamba在五个下游任务上进行了评估,涵盖异常检测、伪影识别和精神状态分类,电极配置从16到26通道不等。在预训练目标中,仅掩码重建产生结构化但泛化性较差的表示,而仅LeJEPA产生扩散嵌入;结合两个目标实现了最稳健的性能。仅用4.6M参数,LuMamba在TUAB上达到80.99%的平衡准确率,在阿尔茨海默病检测上达到SOTA性能(0.97 AUPR),同时在等效序列长度下所需FLOPS比SOTA模型少377倍,并在达到典型GPU内存限制前可扩展到12倍更长的序列。代码可在https://github.com/pulp-bio/biofoundation获取。

英文摘要

Electroencephalography (EEG) enables non-invasive monitoring of brain activity across clinical and neurotechnology applications, yet building foundation models for EEG remains challenging due to differing electrode topologies and computational scalability, as Transformer architectures incur quadratic sequence complexity. As a joint solution, we propose LuMamba (Latent Unified Mamba), a self-supervised framework combining topology-invariant encodings with linear-complexity state-space modeling, using LUNA's learned-query cross-attention mechanism for channel unification, and FEMBA's bidirectional Mamba blocks for efficient temporal modeling. Within this architecture, we provide the first systematic investigation of the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA) for biosignal learning. Pre-trained on over 21,000 hours of unlabeled EEG from the TUEG corpus, LuMamba is evaluated on five downstream tasks spanning abnormality detection, artifact recognition, and mental condition classification across electrode configurations ranging from 16 to 26 channels. In the pre-training objective, masked reconstruction alone yields structured but less generalizable representations, while LeJEPA alone produces diffuse embeddings; combining both objectives achieves the most robust performance. With only 4.6M parameters, LuMamba attains 80.99% balanced accuracy on TUAB and achieves state-of-art performance on Alzheimer's detection (0.97 AUPR), while requiring 377x fewer FLOPS than state-of-art models at equivalent sequence lengths and scaling to 12x longer sequences before reaching typical GPU memory limits. Code is available at https://github.com/pulp-bio/biofoundation.

2603.16689 2026-06-08 cs.LG 版本更新

Predictive Statistics Shape Emergent World Representations of Grid Walkers

预测统计塑造网格行走者的涌现世界表征

Sasha Brenner, Thomas R. Knösche, Nico Scherf

发表机构 * Max Planck Institute for Human Cognitive and Brain Sciences(马克斯·普朗克人类认知与脑科学研究所) Leipzig University(莱比锡大学) ScaDS.AI

AI总结 通过约束随机游走实验,发现解码器仅Transformer的第一注意力块提取预测充分统计量,后续层将其转化为预测几何,形成可读的世界模型,而循环网络未分离此阶段。

详情
Comments
24 pages, 15 figures
AI中文摘要

下一个词预测器通常似乎会发展出对潜在世界及其规则的内部表征。这些模型的概率性质表明世界结构与概率分布几何之间存在深层联系。为了更精确地理解这种联系,我们使用一个最小随机过程作为受控设置:在二维格点上的约束随机游走,必须在预定步数后到达固定终点。对该过程的最优预测仅取决于由游走者相对于目标的位置和剩余时间范围决定的充分向量;换句话说,概率分布由世界的网格几何参数化。我们在从这些游走的精确分布中采样的前缀上训练解码器仅Transformer和循环网络,并通过跨层测量对齐和线性可读性,将其隐藏激活与预测的充分统计量进行比较。我们发现Transformer的计算分为两个阶段:第一个注意力块从输入中提取充分统计量,后续层将其转化为下一步预测几何。在不同约束变体中,注意力后的表征是通用的:一个共享的晶格世界状态,可以直接作为世界模型读取,追溯到数据的预测几何。后续层随后将其专门化到每个变体的下一步分布。循环网络达到相同的贝叶斯最优损失,但未将这个世界状态隔离为一个单独阶段,表明世界模型几何也依赖于架构。尽管在玩具系统中演示,结果表明预测分布的几何是理解神经网络如何内化其数据结构的有用视角。

英文摘要

Next-token predictors often appear to develop internal representations of the latent world and its rules. The probabilistic nature of these models suggests a deep connection between the structure of the world and the geometry of probability distributions. In order to understand this link more precisely, we use a minimal stochastic process as a controlled setting: constrained random walks on a two-dimensional lattice that must reach a fixed endpoint after a predetermined number of steps. Optimal prediction of this process solely depends on a sufficient vector determined by the walker's position relative to the target and the remaining time horizon; in other words, the probability distributions are parametrized by the world's grid geometry. We train decoder-only transformers and recurrent networks on prefixes sampled from the exact distribution of these walks and compare their hidden activations to sufficient statistics of prediction, by measuring alignment and linear readability across layers. We find that the transformer's computation factors into two stages: the first attention block extracts the sufficient statistic from the input, and later layers transform it into the next-step predictive geometry. Across constraint variants the post-attention representation is universal: a shared world-state of the lattice that can be read directly as a world model, traced to the predictive geometry of the data. Later layers then specialize it to each variant's next-step distribution. Recurrent networks reach the same Bayes-optimal loss but do not isolate this world-state as a separate stage, showing that the world-model geometry also depends on architecture. Although demonstrated in a toy system, the results suggest that the geometry of the predictive distribution is a useful lens on how neural networks internalize the structure of their data.