arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2604.15762 2026-05-19 cs.LG

Zero-Shot Scalable Resilience in UAV Swarms: A Decentralized Imitation Learning Framework with Physics-Informed Graph Interactions

无人机群中的零样本可扩展韧性:一种带有物理信息图交互的去中心化模仿学习框架

Huan Lin, Lianghui Ding

AI总结 本文提出了一种去中心化模仿学习框架,通过物理信息图神经网络编码局部交互,实现无人机群在大规模故障和碎片化拓扑下的鲁棒恢复。

详情
AI中文摘要

大规模无人机(UAV)故障可能导致无人机群网络分裂为断开的子网络,使得去中心化恢复既紧迫又困难。集中式恢复方法依赖于全局拓扑信息,在严重碎片化后变得通信密集。去中心化启发法和多智能体强化学习方法更容易部署,但其性能在群规模和损坏严重程度变化时通常会退化。我们提出了物理信息图对抗模仿学习算法(PhyGAIL),该算法采用集中训练与去中心化执行。PhyGAIL从异构观测中构建有界的局部交互图,并利用物理信息图神经网络将方向局部交互编码为具有显式吸引力和排斥力的门控消息传递。这使策略具有物理基础的协调偏置,同时保持局部观测的尺度不变性。它还使用场景自适应模仿学习来改进在碎片化拓扑和可变长度恢复周期下的训练。我们的分析建立了有界局部图放大、有界交互动态和终端成功信号的受控方差。在20个UAV群上训练的策略可直接转移到最多500个UAV的群中,无需微调,且在重新连接可靠性、恢复速度、运动安全性和运行效率方面优于代表性基线。

英文摘要

Large-scale Unmanned Aerial Vehicle (UAV) failures can split an unmanned aerial vehicle swarm network into disconnected sub-networks, making decentralized recovery both urgent and difficult. Centralized recovery methods depend on global topology information and become communication-heavy after severe fragmentation. Decentralized heuristics and multi-agent reinforcement learning methods are easier to deploy, but their performance often degrades when the swarm scale and damage severity vary. We present Physics-informed Graph Adversarial Imitation Learning algorithm (PhyGAIL) that adopts centralized training with decentralized execution. PhyGAIL builds bounded local interaction graphs from heterogeneous observations, and uses physics-informed graph neural network to encode directional local interactions as gated message passing with explicit attraction and repulsion. This gives the policy a physically grounded coordination bias while keeping local observations scale-invariant. It also uses scenario-adaptive imitation learning to improve training under fragmented topologies and variable-length recovery episodes. Our analysis establishes bounded local graph amplification, bounded interaction dynamics, and controlled variance of the terminal success signal. A policy trained on 20-UAV swarms transfers directly to swarms of up to 500 UAVs without fine-tuning, and achieves better performance across reconnection reliability, recovery speed, motion safety, and runtime efficiency than representative baselines.

2604.09609 2026-05-19 cs.AI cs.RO

General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging

通用大语言模型作为人类驾驶员行为模型:简化合并案例

Samir H. A. Mohammad, Wouter Mooi, Arkady Zgonnikov

AI总结 本文研究了通用大语言模型在模拟人类驾驶员行为中的应用,通过在简化的一维合并场景中嵌入两个通用大语言模型,并与人类数据进行定量和定性分析,发现模型在间歇性操作控制和空间线索战术依赖方面能再现人类行为,但在动态速度线索响应和安全性能方面存在差异,提示未来需进一步研究其失效模式以确保其作为人类驾驶行为模型的有效性。

Comments To be published in proceedings of IEEE ITSC 2026

详情
AI中文摘要

人类行为模型在自动驾驶车辆(AVs)的虚拟安全评估中作为行为参考和模拟人类代理至关重要,但当前模型面临可解释性与灵活性之间的权衡。通用大语言模型(LLMs)提供了一种有前景的替代方案:一个模型可能在各种场景中无需参数拟合即可部署。然而,LLMs在捕捉人类驾驶行为方面能做什么、不能做什么仍不明确。我们通过将两个通用LLMs(OpenAI o3和Google Gemini 2.5 Pro)作为独立的闭环驾驶员代理嵌入简化的一维合并场景,并通过定量和定性分析将其行为与人类数据进行比较,来填补这一空白。两个模型能够再现人类样式的间歇性操作控制和对空间线索的战术依赖。然而,它们均无法一致地捕捉人类对动态速度线索的反应,且模型间的安全性能差异显著。系统性的提示消融研究揭示了提示组件作为模型特定的归纳偏置,这些偏置在不同LLMs之间不转移。这些发现表明,通用LLMs可能潜在地作为独立、即用型的人类行为模型在AV评估流程中发挥作用,但未来研究需要进一步理解其失效模式,以确保其作为人类驾驶行为模型的有效性。

英文摘要

Human behavior models are essential as behavior references and for simulating human agents in virtual safety assessment of automated vehicles (AVs), yet current models face a trade-off between interpretability and flexibility. General-purpose large language models (LLMs) offer a promising alternative: a single model potentially deployable without parameter fitting across diverse scenarios. However, what LLMs can and cannot capture about human driving behavior remains poorly understood. We address this gap by embedding two general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) as standalone, closed-loop driver agents in a simplified one-dimensional merging scenario and comparing their behavior against human data using quantitative and qualitative analyses. Both models reproduce human-like intermittent operational control and tactical dependencies on spatial cues. However, neither consistently captures the human response to dynamic velocity cues, and safety performance diverges sharply between models. A systematic prompt ablation study reveals that prompt components act as model-specific inductive biases that do not transfer across LLMs. These findings suggest that general-purpose LLMs could potentially serve as standalone, ready-to-use human behavior models in AV evaluation pipelines, but future research is needed to better understand their failure modes and ensure their validity as models of human driving behavior.

2604.09450 2026-05-19 cs.LG cs.AI eess.IV

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

ECHO: 通过一步块扩散实现高效的胸部X光报告生成

Lifeng Chen, Tianqi You, Hao Liu, Zhimin Bao, Jile Jiao, Xiao Han, Zhicai Ou, Tao Sun, Xiaofeng Mou, Xiaojie Jin, Yi Xu

AI总结 本文提出ECHO,一种基于扩散模型的高效视觉-语言模型,用于生成胸部X光报告,通过一步块扩散和响应不对称扩散策略,显著提高了生成效率和文本连贯性,同时在临床准确性上保持良好表现。

详情
AI中文摘要

胸部X光报告生成(CXR-RG)有潜力显著减轻放射科医生的工作负担。然而,传统自回归视觉-语言模型(VLMs)由于序列令牌解码而存在高推理延迟。基于扩散的模型通过并行生成提供了一种有前景的替代方案,但它们仍然需要多个去噪迭代。将多步去噪压缩到单步可以进一步减少延迟,但通常会因令牌因子化去噪器引入的均场偏差而降级文本连贯性。为了解决这一挑战,我们提出了ECHO,一种高效的基于扩散的VLM(dVLM),用于胸部X光报告生成。ECHO通过一种新颖的直接条件蒸馏(DCD)框架实现了稳定的每块一步推理,该框架通过从策略扩散轨迹中构建非因子化监督来缓解均场限制,以编码联合令牌依赖性。此外,我们引入了一种响应不对称扩散(RAD)训练策略,该策略进一步提高了训练效率,同时保持模型有效性。广泛的实验表明,ECHO超越了最先进的自回归方法,在RaTE和SemScore上分别提高了64.33%和60.58%,同时在临床准确性上几乎没有下降的情况下,实现了高达8倍的推理加速。

英文摘要

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33\%} and \textbf{60.58\%} respectively, while achieving up to \textbf{$8\times$} inference speedup with negligible degradation in clinical accuracy.

2604.04932 2026-05-19 cs.CL

Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection

超越最终作者:为细粒度LLM生成文本检测建模创作者与编辑的双重角色

Yang Li, Qiang Sheng, Zhengjia Wang, Yehan Yang, Danding Wang, Juan Cao

AI总结 本文提出RACE方法,通过建模创作者和编辑的双重角色,实现细粒度LLM生成文本检测,以更精确地区分不同类型的文本,从而为LLM监管提供政策对齐的解决方案。

Comments ACL 2026 (Oral)

详情
AI中文摘要

大型语言模型(LLM)的滥用需要精确检测合成文本。现有工作主要遵循二元或三元分类设置,只能区分纯人类/LLM文本或协作文本。这在 nuanced 的监管中仍显不足,因为LLM润色的人类文本和人类化的LLM文本往往触发不同的政策后果。在本文中,我们探索了在严格四类设置下细粒度LLM生成文本检测。为处理这些复杂性,我们提出了RACE(Rhetorical Analysis for Creator-Editor Modeling),一种细粒度检测方法,该方法刻画了创作者和编辑的各自特征。具体而言,RACE利用修辞结构理论(RST)构建创作者的逻辑图,同时提取基本话语单元(EDU)级别的特征以捕捉编辑的风格。实验表明,RACE在识别细粒度类型时优于12个基线方法,具有较低的误报率,为LLM监管提供了一种政策对齐的解决方案。

英文摘要

The misuse of large language models (LLMs) requires precise detection of synthetic text. Existing works mainly follow binary or ternary classification settings, which can only distinguish pure human/LLM text or collaborative text at best. This remains insufficient for the nuanced regulation, as the LLM-polished human text and humanized LLM text often trigger different policy consequences. In this paper, we explore fine-grained LLM-generated text detection under a rigorous four-class setting. To handle such complexities, we propose RACE (Rhetorical Analysis for Creator-Editor Modeling), a fine-grained detection method that characterizes the distinct signatures of creator and editor. Specifically, RACE utilizes Rhetorical Structure Theory (RST) to construct a logic graph for the creator's foundation while extracting Elementary Discourse Unit (EDU)-level features for the editor's style. Experiments show that RACE outperforms 12 baselines in identifying fine-grained types with low false alarms, offering a policy-aligned solution for LLM regulation.

2604.01658 2026-05-19 cs.AI

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

CORAL:迈向自主多智能体进化以实现开放性发现

Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang

AI总结 本文提出CORAL框架,通过自主多智能体进化方法,实现了在开放性问题上的发现,展示了智能体自主性和多智能体进化对提升开放性发现的显著效果。

详情
AI中文摘要

基于大型语言模型(LLM)的进化是一种有前景的开放性发现方法,其中进展需要持续的搜索和知识积累。现有方法仍然严重依赖固定启发式和硬编码探索规则,这限制了LLM智能体的自主性。我们提出了CORAL,这是首个用于开放性问题的自主多智能体进化的框架。CORAL用长运行的智能体取代了刚性的控制,这些智能体通过共享持久记忆、异步多智能体执行和基于心跳的干预进行探索、反思和协作。它还提供了实用的保障措施,包括隔离的工作空间、评估者分离、资源管理以及智能体会话和健康管理。在多样化的数学、算法和系统优化任务上评估,CORAL在10个任务上实现了新的最先进结果,其改进率比固定进化搜索基线高出3-10倍,且使用更少的评估。在Anthropic的内核工程任务中,四个共进化智能体将最佳已知分数从1363提高到1103周期。机理分析进一步显示这些增益源于知识重用和多智能体探索和交流。这些结果表明,更大的智能体自主性和多智能体进化可以显著提高开放性发现。代码可在https://github.com/Human-Agent-Society/CORAL上获得。

英文摘要

Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.

2603.27341 2026-05-19 cs.AI cs.CV cs.LG

A Comparative Study in Surgical AI: Potential and Limitations of Data, Compute, and Scaling

外科AI的比较研究:数据、计算和扩展的潜力与局限

Kirill Skobelev, Eric Fithian, Yegor Baranovski, Jack Cook, Sandeep Angara, Shauna Otto, Zhuang-Fang Yi, John Zhu, Daniel A. Donoho, X. Y. Han, Neeraj Mainkar, Margaux Masson-Forsythe

AI总结 本文通过2026年最先进的AI方法,研究了外科手术工具检测中的性能和限制,发现即使使用多十亿参数模型和大量训练数据,当前的视觉语言模型在神经外科手术工具检测任务中仍表现不足,且模型规模和训练时间的增加对性能提升效果有限,表明当前AI在手术应用中仍面临显著挑战。

详情
AI中文摘要

最近的人工智能(AI)模型在多个生物医学任务基准上已匹配或超越了人类专家,但特别是在外科手术基准方面,这些基准往往缺失于主要的医学基准套件中。由于手术需要整合多种任务,一般能力的AI模型可能成为协作工具,如果性能可以得到提升。一方面,通过扩展架构大小和训练数据的常规方法具有吸引力,尤其是由于每年有数百万小时的手术视频数据生成。另一方面,为AI训练准备手术数据需要显著更高的专业水平,并且在该数据上训练需要昂贵的计算资源。这些权衡描绘了现代AI是否以及在多大程度上能够帮助外科实践的不确定图景。在本文中,我们通过使用2026年最先进的AI方法进行外科手术工具检测的案例研究来探讨这个问题。我们证明,即使使用多十亿参数模型和大量训练,当前的视觉语言模型在看似简单的神经外科手术工具检测任务中仍表现不足。此外,我们展示了扩展实验,表明增加模型规模和训练时间仅导致相关性能指标的边际改善。因此,我们的实验表明,当前模型在手术使用案例中仍可能面临重大障碍。此外,一些障碍无法通过额外的计算能力简单地“解决”并持续存在于不同的模型架构中,提出了数据和标签可用性是否是唯一限制因素的问题。我们讨论了这些约束的主要贡献者,并提出了潜在的解决方案。

英文摘要

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but surgical benchmarks in particular are often missing from prominent medical benchmark suites. Since surgery requires integrating disparate tasks, generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

2603.25723 2026-05-19 cs.CL cs.AI

Natural-Language Agent Harnesses

自然语言代理Harness

Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, Hai-Tao Zheng

AI总结 本文提出自然语言代理Harness(NLAH)作为一种可执行的自然语言对象,用于描述任务运行的Harness策略,并引入Intelligent Harness Runtime(IHR)作为共享运行时,能够将这些文档解释为代理调用、交接、状态更新、验证门和成果合同。实验表明,NLAH在编码、终端使用和计算机使用基准测试中表现与代码和提示实现相当,同时暴露了更短的静态Harness策略。

Comments revise paper

详情
AI中文摘要

代理性能受到周围Harness的强烈影响:围绕模型组织任务运行的外部执行系统。然而,这种逻辑通常隐藏在紧密耦合的控制器代码中,使得Harness难以检查、比较、转移和消解。本文探讨是否可以将代理Harness的可重用设计模式表示为可执行的自然语言对象。我们引入自然语言代理Harness(NLAH),即可编辑的文档,用于描述运行级别的Harness策略,并引入Intelligent Harness Runtime(IHR),一个共享运行时,能够将这些文档解释为代理调用、交接、状态更新、验证门和成果合同。在编码、终端使用和计算机使用基准测试中,IHR执行的NLAH实现了与代码和提示实现相当的任务结果,同时暴露了更短的静态Harness策略。模块消解进一步表明,显式的Harness模块是可分析的。这些结果表明,代理Harness可以从模型周围的偶然粘合物转变为科学表示对象。

英文摘要

Agent performance is strongly shaped by the surrounding harness: the external execution system around a model that organizes a task run. Yet this logic is usually buried in tightly coupled controller code, which makes harnesses hard to inspect, compare, transfer, and ablate. This paper asks whether the reusable design pattern of an agent harness can be represented as an executable natural-language object. We introduce Natural-Language Agent Harnesses (NLAHs), editable documents that describe run-level harness policy, and Intelligent Harness Runtime (IHR), a shared runtime that interprets these documents into agent calls, handoffs, state updates, validation gates, and artifact contracts. Across coding, terminal-use, and computer-use benchmarks, IHR-executed NLAHs achieve comparable task outcomes to code and prompted realizations, while exposing much shorter static harness policies. Module ablations further show that explicit harness modules are analyzable. These results suggest that agent harnesses can be turned from incidental glue around models into scientific representation objects.

2603.23672 2026-05-19 cs.RO cs.CV

Bio-Inspired Event-Based Visual Servoing for Ground Robots

生物启发的基于事件的视觉伺服控制用于地面机器人

Maral Mordad, Kian Behzad, Debojyoti Biswas, Noah J. Cowan, Milad Siami

AI总结 本文提出了一种基于生物启发的1D事件视觉伺服框架,用于在结构化环境中运行的地面机器人,通过动态视觉传感器和多模式刺激直接合成非线性状态反馈项,实现了高效低延迟的控制。

详情
AI中文摘要

生物感觉系统本质上是自适应的,能够过滤掉恒定刺激并优先处理相对变化,可能提高计算和代谢效率。受广泛动物主动感知行为的启发,本文介绍了一种原理性的1D基于事件的视觉伺服框架,用于在结构化环境中运行的地面机器人。利用动态视觉传感器(DVS),我们证明通过将固定的空间核应用于由结构化对数强度变化模式生成的异步事件流,所得到的网络事件流能够分析性地隔离特定的运动状态组合。我们建立了该事件率估计器的一般理论界,并证明线性和二次空间剖面分别隔离了机器人的速度和位置-速度乘积。利用这些特性,我们采用多模式刺激直接合成非线性状态反馈项,而无需传统状态估计。为克服事件感知中在平衡点固有的线性可观测性损失,我们提出了一种生物启发的主动感知极限环控制器。在1/10比例自主地面车辆上的实验验证证实了所提出直接感知方法的有效性、极低延迟和计算效率。

英文摘要

Biological sensory systems are inherently adaptive, filtering out constant stimuli and prioritizing relative changes, likely enhancing computational and metabolic efficiency. Inspired by active sensing behaviors across a wide range of animals, this paper introduces a principled 1D event-based visual servoing framework for ground robots operating in structured environments. Utilizing a Dynamic Vision Sensor (DVS), we demonstrate that by applying a fixed spatial kernel to the asynchronous event stream generated from structured logarithmic intensity-change patterns, the resulting net event flux analytically isolates specific combinations of kinematic states. We establish a generalized theoretical bound for this event rate estimator and show that linear and quadratic spatial profiles isolate the robot's velocity and position-velocity product, respectively. Leveraging these properties, we employ a multi-pattern stimulus to directly synthesize a nonlinear state feedback term entirely without traditional state estimation. To overcome the inescapable loss of linear observability at equilibrium inherent in event sensing, we propose a bio-inspired active sensing limit-cycle controller. Experimental validation on a 1/10-scale autonomous ground vehicle confirms the efficacy, extreme low-latency, and computational efficiency of the proposed direct-sensing approach.

2603.23231 2026-05-19 cs.AI

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

PERMA:通过事件驱动的偏好和现实任务环境评估个性化记忆代理

Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, Zhiyu Li, Feiyu Xiong, Enhong Chen, Tong Xu

AI总结 本文提出PERMA基准,通过事件驱动的偏好和现实任务环境评估个性化记忆代理的长期一致性,引入文本变异和语言对齐以模拟真实数据中的不规则用户输入和个体语言风格,实验表明先进记忆系统能精准提取偏好并减少token消耗,但仍需更稳健的个性化记忆管理。

详情
AI中文摘要

为构建能适应用户不断变化需求的代理,增强大语言模型的长期记忆能力至关重要。现有评估通常将偏好相关对话与无关对话交织,使任务退化为needle-in-a-haystack检索,忽略了驱动用户偏好演变的事件之间的关系。此类设置忽视了现实世界个性化的一个基本特征:偏好是逐渐形成并在嘈杂环境中跨交互累积的。为弥合这一差距,我们引入PERMA,一个评估时间跨度内人格一致性的基准,超越静态偏好回忆。此外,我们引入(1)文本变异和(2)语言对齐,以模拟现实数据中的不规则用户输入和个体语言风格。PERMA包含跨多个会话和领域的时序排列交互事件,其中偏好相关查询随时间插入。我们设计了多选和交互任务以探测模型对人格的理解沿交互时间线。实验表明,通过关联相关交互,先进记忆系统能够精确提取偏好并减少token消耗,优于传统语义检索原始对话。然而,它们在时间和跨领域干扰中仍难以保持一致的人格,突显了代理中需要更稳健的个性化记忆管理的必要性。我们的代码和数据在https://github.com/PolarisLiu1/PERMA上开源。

英文摘要

Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. Existing evaluations of this capability typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events driving user preference evolution. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model's understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems extract precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.

2603.22056 2026-05-19 cs.CL

Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

双空间知识蒸馏与关键-查询匹配用于具有词汇不匹配的大型语言模型

Stella Eva Tsiapali, Cong-Thanh Do, Kate Knill

AI总结 本文研究了针对具有词汇不匹配的大型语言模型的双空间知识蒸馏与关键-查询匹配方法,通过分析注意力机制揭示其优缺点,并提出基于生成对抗学习的新方法以解决关键-查询分布不匹配问题。

Comments Copyright 2026 IEEE. Published in ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), scheduled for 4-8 May 2026 in Barcelona, Spain

详情
AI中文摘要

大型语言模型(LLMs)在语言任务上实现了最先进的(SOTA)性能,但因其规模和资源需求而昂贵。知识蒸馏(KD)通过训练较小的学生模型模仿较大的教师模型来解决这一问题,从而在不显著损失性能的情况下提高效率。双空间知识蒸馏与跨模型注意力(DSKD-CMA)已成为在具有不同分词器的LLM之间进行KD的SOTA方法,但其内部机制仍然大多不透明。在本文中,我们通过手动标记对齐探测和热图可视化系统地分析DSKD-CMA的注意力机制,揭示其优缺点。在此基础上,我们引入了一种基于生成对抗(GA)学习的新方法DSKD-CMA-GA,以解决由不同模型计算出的关键-查询分布不匹配问题。实验显示在文本生成质量上获得了适度但一致的ROUGE-L提升,特别是在分布外数据上(平均+0.37),缩小了跨分词器KD与同分词器KD之间的差距。

英文摘要

Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.

2603.21787 2026-05-19 cs.CV

Benchmarking Recurrent Event-Based Object Detection for Industrial Multi-Class Recognition on MTevent

在MTevent上评估用于工业多类识别的循环事件基目标检测基准

Lokeshwaran Manohar, Moritz Roidl

AI总结 本文研究了在MTevent数据集上使用循环ReYOLOv8s进行工业多类识别的性能,并通过非循环YOLOv8s作为基线分析时间记忆的影响,发现事件域预训练对性能提升更有效。

Comments Accepted at the Neuromorphic Field Robotics and Automation Workshop, ICRA 2026

详情
AI中文摘要

事件相机因提供高时间分辨率、高动态范围和减少运动模糊而在工业机器人中具有吸引力。然而,大多数基于事件的目标检测研究集中在户外驾驶场景或有限类别设置上。在本工作中,我们在MTevent上评估了循环ReYOLOv8s用于工业多类识别,并使用非循环YOLOv8s变体作为基线来分析时间记忆的影响。在MTevent验证分割上,最佳的从头开始的循环模型(C21)达到了0.285 mAP50,比非循环YOLOv8s基线(0.260)提高了9.6%。事件域预训练效果更显著:GEN1初始化的微调在剪辑长度21时达到了最佳整体结果0.329 mAP50,并且与从头开始训练不同,GEN1预训练模型在剪辑长度上持续改进。PEDRo初始化下降到0.251,表明源域预训练不匹配可能不如从头开始训练有效。持续失败模式主要由类别不平衡和人-物体交互主导。总体而言,我们将这项工作定位为对工业环境中循环事件基检测的聚焦基准测试和分析研究。

英文摘要

Event cameras are attractive for industrial robotics because they provide high temporal resolution, high dynamic range, and reduced motion blur. However, most event-based object detection studies focus on outdoor driving scenarios or limited class settings. In this work, we benchmark recurrent ReYOLOv8s on MTevent for industrial multi-class recognition and use a non-recurrent YOLOv8s variant as a baseline to analyze the effect of temporal memory. On the MTevent validation split, the best scratch recurrent model (C21) reaches 0.285 mAP50, corresponding to a 9.6\% relative improvement over the non-recurrent YOLOv8s baseline (0.260). Event-domain pretraining has a stronger effect: GEN1-initialized fine-tuning yields the best overall result of 0.329 mAP50 at clip length 21, and unlike scratch training, GEN1-pretrained models improve consistently with clip length. PEDRo initialization drops to 0.251, indicating that mismatched source-domain pretraining can be less effective than training from scratch. Persistent failure modes are dominated by class imbalance and human-object interaction. Overall, we position this work as a focused benchmarking and analysis study of recurrent event-based detection in industrial environments.

2603.18972 2026-05-19 cs.LG

Best-of-Both-Worlds Multi-Dueling Bandits: Unified Algorithms for Stochastic and Adversarial Preferences under Condorcet and Borda Objectives

兼顾两种世界的多对决老虎机:统一算法用于在康多塞和波尔多目标下的随机和对抗性偏好

S Akash, Pratik Gajane, Jawar Singh

AI总结 本文提出了一种兼顾随机和对抗性环境的多对决老虎机统一算法,针对康多塞和波尔多目标,同时在无先验知识的情况下实现了最优性能。

详情
AI中文摘要

多对决老虎机,其中学习者每轮选择m≥2个臂并仅观察胜者,自然出现在许多应用中,包括排名和推荐系统,但一个基本问题仍然存在:能否一个单一的算法在随机和对抗性环境中都表现最优,而无需知道所处的环境?我们对此给出了肯定答案,提供了第一个兼顾两种世界的多对决老虎机算法,适用于康多塞和波尔多目标。对于康多塞设置,我们提出MetaDueling,一种黑盒减少方法,将任何对决老虎机算法转换为多对决老虎机算法,通过将多方式胜者反馈转换为无偏的 pairwise 信号。将我们的减少方法应用于Versatile-DB,得到第一个兼顾两种世界的多对决老虎机算法:它在对抗性偏好下达到O(√(KT))的伪遗憾,在随机偏好下达到实例最优的O(∑_{i≠a*} logT/Δ_i)的伪遗憾,同时且无需先验知识。对于波尔多设置,我们提出SA-MiDEX,一种随机和对抗性算法,它在随机环境中达到O(K²logKT + Klog²T + ∑_{i:Δ_i^B>0} KlogKT/(Δ_i^B)²)的遗憾,在对抗者面前达到O(K√(TlogKT) + K^{1/3}T^{2/3}(logK)^{1/3})的遗憾,再次无需先验知识。我们用康多塞设置的上界补充了匹配的下界。对于波尔多设置,我们的上界在下界附近(因子K内)并且与文献中最好的结果相匹配。

英文摘要

Multi-dueling bandits, where a learner selects $m \geq 2$ arms per round and observes only the winner, arise naturally in many applications including ranking and recommendation systems, yet a fundamental question has remained open: can a single algorithm perform optimally in both stochastic and adversarial environments, without knowing which regime it faces? We answer this affirmatively, providing the first best-of-both-worlds algorithms for multi-dueling bandits under both Condorcet and Borda objectives. For the Condorcet setting, we propose $\texttt{MetaDueling}$, a black-box reduction that converts any dueling bandit algorithm into a multi-dueling bandit algorithm by transforming multi-way winner feedback into an unbiased pairwise signal. Instantiating our reduction with $\texttt{Versatile-DB}$ yields the first best-of-both-worlds algorithm for multi-dueling bandits: it achieves $O(\sqrt{KT})$ pseudo-regret against adversarial preferences and the instance-optimal $O\left(\sum_{i \neq a^\star} \frac{\log T}{Δ_i}\right)$ pseudo-regret under stochastic preferences, both simultaneously and without prior knowledge of the regime. For the Borda setting, we propose $\texttt{SA-MiDEX}$, a stochastic-and-adversarial algorithm that achieves $O\left(K^2 \log KT + K \log^2 T + \sum_{i: Δ_i^{\mathrm{B}} > 0} \frac{K\log KT}{(Δ_i^{\mathrm{B}})^2}\right)$ regret in stochastic environments and $O\left(K \sqrt{T \log KT} + K^{1/3} T^{2/3} (\log K)^{1/3}\right)$ regret against adversaries, again without prior knowledge of the regime. We complement our upper bounds with matching lower bounds for the Condorcet setting. For the Borda setting, our upper bounds are near-optimal with respect to the lower bounds (within a factor of $K$) and match the best-known results in the literature.

2603.18702 2026-05-19 cs.LG

Off-Policy Learning with Limited Supply

有限供应下的离策略学习

Koichi Tanaka, Ren Kishimoto, Bushun Kawagishi, Yusuke Narita, Yasuo Yamamoto, Nobuyuki Shimizu, Yuta Saito

AI总结 本文研究了在情境老虎机中受限供应下的离策略学习问题,提出了一种新的OPLS方法,通过考虑用户间的相对预期奖励来更高效地分配有限供应的物品,实验证明其在有限供应情境下的优越性。

Comments Published as a conference paper at WWW 2026

详情
AI中文摘要

我们研究了情境老虎机中的离策略学习(OPL),这在推荐系统和在线广告等广泛的实际应用中起着关键作用。典型的OPL在情境老虎机中假设一个无约束环境,其中策略可以无限次选择同一物品。然而,在许多实际应用中,包括优惠券分配和电子商务,有限供应通过分布式优惠券的预算限制或产品库存限制来限制物品。在这些设置中,贪心地选择当前用户预期奖励最高的物品可能导致该物品的早期耗尽,使其无法为未来可能生成更高预期奖励的用户使用。因此,最优的无约束设置中的OPL方法在有限供应设置中可能变得次优。为了解决这个问题,我们提供了一个理论分析,显示传统贪心OPL方法可能无法最大化策略性能,并证明在有限供应设置中必须存在性能更优的策略。基于这一见解,我们引入了一种新的方法,称为有限供应下的离策略学习(OPLS)。与简单选择预期奖励最高的物品不同,OPLS关注相对预期奖励较高的物品,从而更有效地分配有限供应的物品。我们在合成和现实数据集上的实验证明,OPLS在具有有限供应的情境老虎机问题中优于现有的OPL方法。

英文摘要

We study off-policy learning (OPL) in contextual bandits, which plays a key role in a wide range of real-world applications such as recommendation systems and online advertising. Typical OPL in contextual bandits assumes an unconstrained environment where a policy can select the same item infinitely. However, in many practical applications, including coupon allocation and e-commerce, limited supply constrains items through budget limits on distributed coupons or inventory restrictions on products. In these settings, greedily selecting the item with the highest expected reward for the current user may lead to early depletion of that item, making it unavailable for future users who could potentially generate higher expected rewards. As a result, OPL methods that are optimal in unconstrained settings may become suboptimal in limited supply settings. To address the issue, we provide a theoretical analysis showing that conventional greedy OPL approaches may fail to maximize the policy performance, and demonstrate that policies with superior performance must exist in limited supply settings. Based on this insight, we introduce a novel method called Off-Policy learning with Limited Supply (OPLS). Rather than simply selecting the item with the highest expected reward, OPLS focuses on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply. Our empirical results on both synthetic and real-world datasets show that OPLS outperforms existing OPL methods in contextual bandit problems with limited supply.

2603.14462 2026-05-19 cs.LG cs.AI

STAG-CN: Spatio-Temporal Apiary Graph Convolutional Network for Disease Onset Prediction in Beehive Sensor Networks

STAG-CN:时空蜂巢图卷积网络用于蜂巢传感器网络中疾病发病预测

Sungwoo Kang

AI总结 该研究提出STAG-CN模型,通过建模蜂箱间关系来预测疾病发病,利用时空图卷积网络结合物理位置和气候传感器相关性,验证了共享环境响应模式比空间接近性更有效。

Comments Null result after running with 10 seeds

详情
AI中文摘要

蜂蜜蜂群损失威胁着全球授粉服务,但当前监测系统将每个蜂箱视为孤立单元,忽略了疾病在养蜂场中传播的空间路径。本文介绍了时空蜂巢图卷积网络(STAG-CN),一种图神经网络,用于疾病发病预测。STAG-CN基于双邻接图,结合蜂箱会话间的物理共置和气候传感器相关性,通过基于因果扩张卷积和Chebyshev谱图卷积的时空-时空三明治架构处理多变量物联网传感器流。在韩国AI Hub养蜂数据集(数据集#71488)上进行扩展窗口时间交叉验证后,STAG-CN在三天预测范围内达到F1分数0.607。消融研究显示,仅气候邻接矩阵可达到全模型性能(F1=0.607),而仅物理邻接矩阵则为F1=0.274,表明共享的环境响应模式比空间接近性在疾病发病预测中更具预测信号。这些结果为基于图的生物安全监控在精准养蜂中的概念验证奠定了基础,证明了蜂箱传感器相关性编码了单个蜂箱方法无法察觉的疾病相关信息。

英文摘要

Honey bee colony losses threaten global pollination services, yet current monitoring systems treat each hive as an isolated unit, ignoring the spatial pathways through which diseases spread across apiaries. This paper introduces the Spatio-Temporal Apiary Graph Convolutional Network (STAG-CN), a graph neural network that models inter-hive relationships for disease onset prediction. STAG-CN operates on a dual adjacency graph combining physical co-location and climatic sensor correlation among hive sessions, and processes multivariate IoT sensor streams through a temporal--spatial--temporal sandwich architecture built on causal dilated convolutions and Chebyshev spectral graph convolutions. Evaluated on the Korean AI Hub apiculture dataset (dataset \#71488) with expanding-window temporal cross-validation, STAG-CN achieves an F1 score of 0.607 at a three-day forecast horizon. An ablation study reveals that the climatic adjacency matrix alone matches full-model performance (F1\,=\,0.607), while the physical adjacency alone yields F1\,=\,0.274, indicating that shared environmental response patterns carry stronger predictive signal than spatial proximity for disease onset. These results establish a proof-of-concept for graph-based biosecurity monitoring in precision apiculture, demonstrating that inter-hive sensor correlations encode disease-relevant information invisible to single-hive approaches.

2603.13652 2026-05-19 cs.CV

Causal Attribution via Activation Patching

通过激活修补进行因果归因

Amirmohammad Izadi, Mohammadali Banayeeanzade, Alireza Mirrokni, Hosein Hasani, Mobin Bagherian, Faridoun Mehri, Mahdieh Soleymani Baghshah

AI总结 本文提出了一种新的因果归因方法CAAP,通过直接干预内部激活来估计图像补丁对Vision Transformer预测的贡献,从而产生更准确和局部化的归因结果。

详情
AI中文摘要

针对Vision Transformers(ViTs)的归因方法旨在识别影响模型预测的图像区域,但产生忠实且良好的局部化归因仍具有挑战性。现有归因方法面临多个限制,基于梯度、相关性传播和注意力的方法依赖于局部近似,而扰动或优化方法则干预输入、令牌或替代物,而非内部补丁表示。关键挑战在于类别相关证据是通过跨层的补丁令牌相互作用形成的;仅操作输入变化、注意力权重或反向相关性信号的方法可能只能提供补丁重要性的间接代理,而非直接测试上下文化补丁表示的预测效果。我们提出通过激活修补进行因果归因(CAAP),通过直接干预内部激活来估计单个图像补丁对ViT预测的贡献,而非使用学习的掩码或合成扰动模式。对于每个补丁,CAAP将对应的源图像激活插入中性目标上下文中的中间层范围,并使用由此产生的目标类别分数作为归因信号。所得到的归因图反映了补丁相关内部表示对模型预测的因果贡献。因果干预作为一种原则性的测量方法,通过在初始表示形成后捕捉语义证据,同时避免晚期层的全局混合,这可能减少空间特异性。在多个ViT骨干网络和标准度量指标上,CAAP在各种设置中均优于现有方法,并产生更忠实且局部化的归因结果。

英文摘要

Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing attribution methods face several limitations, with gradient-based, relevance-propagation, and attention-based methods relying on local approximations, while perturbation or optimization-based methods intervene on inputs, tokens, or surrogates rather than internal patch representations. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers; methods that operate only on input changes, attention weights, or backward relevance signals may therefore provide indirect proxies for patch importance rather than directly testing the predictive effect of contextualized patch representations. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal contribution of patch-associated internal representations on the model's prediction. The causal intervention serves as a principled measure of patch influence by capturing semantic evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP consistently outperforms existing methods in various settings and produces more faithful and localized attributions.

2603.12145 2026-05-19 cs.LG cs.AI cs.SE

Automatic Generation of High-Performance RL Environments

自动生成高性能强化学习环境

Seth Karten, Rahul Dev Appapogu, Chi Jin

AI总结 本文提出了一种闭环方法,通过最小的计算成本生成等效的高性能强化学习环境,展示了三种不同的工作流程,并在五个环境中验证了无仿真到仿真的差距,同时展示了新的环境创建方法。

Comments 20 pages, 5 figures

详情
AI中文摘要

将复杂的强化学习(RL)环境转换为高性能实现传统上需要数月的专业工程工作。我们提出了一种闭环方法,以最小的计算成本生成等效的高性能环境。我们的方法使用通用提示模板、分层验证(属性、交互和运行测试)、迭代修复和跨后端策略转移来验证无仿真到仿真的差距。我们展示了三个不同的工作流程跨越五个环境:(1)从Game Boy模拟器PyBoy直接翻译到我们的EmuRust(通过Rust IPC)和从Pokemon Showdown翻译到我们的PokeJAX(通过JAX);(2)通过与现有高性能实现的吞吐量一致性进行验证,如Puffer Pong、MJX和Brax在匹配的GPU批次大小下;(3)新环境的创建:TCGJax,第一个Pokemon TCG Pocket环境,从网页提取的规范中创建。在2亿个参数下,环境开销低于训练时间的4%。我们的闭环方法验证了所有五个环境的等效性。TCGJax,由一个不在公共存储库中的私有参考合成,用于控制代理预训练数据的污染问题。

英文摘要

Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a closed-loop methodology that produces equivalent high-performance environments for minimal compute cost. Our method uses a generic prompt template, hierarchical verification (property, interaction, and rollout tests), iterative repair, and cross-backend policy transfer to verify no sim-to-sim gap. We demonstrate three distinct workflows across five environments: (1) Direct translation (no prior performance implementation exists) from Game Boy emulator PyBoy to our EmuRust (via Rust IPC) and from Pokemon Showdown to our PokeJAX (via JAX); (2) Translation verified against existing performance implementations via throughput parity with Puffer Pong, MJX and Brax at matched GPU batch sizes; and (3) New environment creation: TCGJax, the first Pokemon TCG Pocket environment, created from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Our closed-loop methodology confirms equivalence for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns.

2603.11689 2026-05-19 cs.AI

Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks

显式逻辑通道用于验证和增强用于零样本任务的前沿多模态大语言模型

Mei Chee Leong, Ying Gu, Hui Li Tan, Liyuan Li, Nancy Chen

AI总结 本文提出显式逻辑通道用于验证和增强多模态大语言模型在零样本任务中的性能,通过显式逻辑推理提高模型的可解释性和可信度。

详情
AI中文摘要

前沿多模态大语言模型(MLLMs)在视觉-语言理解(VLC)任务中表现出显著能力。然而,它们通常以黑盒方式部署到新任务中。验证和理解这些模型的行为对于应用到新任务变得重要。我们提出显式逻辑通道,与黑盒模型通道并行,以进行显式逻辑推理用于模型验证、选择和增强。前沿MLLM,封装潜在的视觉语言知识,可以被视为隐式逻辑通道。所提出的显式逻辑通道,模仿人类逻辑推理,结合了一个LLM、一个VFM和逻辑推理与概率推理,用于事实、反事实和关系推理,基于显式视觉证据。提出了一种一致性率(CR)用于跨通道验证和模型选择,即使没有地面真相注释。此外,跨通道整合进一步提高了MLLM在零样本任务中的性能,基于显式视觉证据以增强可信度。在两个代表性的VLC任务,即MC-VQA和HC-REC上,对三个具有挑战性的基准进行综合实验,使用11个最近的开源MLLMs,来自四个前沿家族。我们的系统评估证明了所提出的ELC和CR在增强可解释性和可信度的MLLM模型验证、选择和改进中的有效性。

英文摘要

Frontier Multimodal Large Language Models (MLLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks. However, they are often deployed as zero-shot solution to new tasks in a black-box manner. Validating and understanding the behavior of these models become important for application to new task. We propose an Explicit Logic Channel, in parallel with the black-box model channel, to perform explicit logical reasoning for model validation, selection and enhancement. The frontier MLLM, encapsulating latent vision-language knowledge, can be considered as an Implicit Logic Channel. The proposed Explicit Logic Channel, mimicking human logical reasoning, incorporates a LLM, a VFM, and logical reasoning with probabilistic inference for factual, counterfactual, and relational reasoning over the explicit visual evidence. A Consistency Rate (CR) is proposed for cross-channel validation and model selection, even without ground-truth annotations. Additionally, cross-channel integration further improves performance in zero-shot tasks over MLLMs, grounded with explicit visual evidence to enhance trustworthiness. Comprehensive experiments conducted for two representative VLC tasks, i.e., MC-VQA and HC-REC, on three challenging benchmarks, with 11 recent open-source MLLMs from 4 frontier families. Our systematic evaluations demonstrate the effectiveness of proposed ELC and CR for model validation, selection and improvement on MLLMs with enhanced explainability and trustworthiness.

2603.10935 2026-05-19 cs.LG cs.AI cs.CV

Spherical VAE with Cluster-Aware Feasible Regions: Guaranteed Prevention of Posterior Collapse

具有聚类感知可行区域的球形VAE:保证防止后验崩溃

Zegu Zhang, Jian Zhang

AI总结 本文提出了一种理论保证非崩溃解的新型框架,通过利用球壳几何和聚类感知约束,防止VAE中的后验崩溃问题,并在合成和现实数据集上实现了100%的崩溃预防。

Comments 8 pages, 6 figures

详情
AI中文摘要

变分自编码器(VAEs)经常受到后验崩溃的影响,其中潜在变量在近似后验退化为先验时变得无信息。尽管最近的研究将崩溃描述为由数据协方差属性决定的相变,但现有方法主要旨在避免而非消除崩溃。我们引入了一种新的框架,通过利用球壳几何和聚类感知约束,从理论上保证非崩溃解。我们的方法将数据转换为球壳,通过K-means计算最优聚类分配,并定义一个在聚类内方差W和崩溃损失δ-collapse之间的可行区域。我们证明当重构损失被限制在这个区域内时,崩溃解在数学上被排除在可行参数空间之外。关键的是,我们引入了规范约束机制,确保解码器输出保持与球壳几何兼容,而不限制表示能力。与以往方法不同,我们的方法提供了严格的理论保证,计算开销小,且不施加对解码器输出的限制。在合成和现实数据集上的实验表明,在传统VAE完全失败的条件下,实现了100%的崩溃预防,重构质量匹配或超过最先进的方法。我们的方法不需要显式的稳定性条件(例如σ² < λ_max),并且适用于任意神经网络架构。代码可在https://github.com/tsegoochang/spherical-vae-with-Cluster获取。

英文摘要

Variational autoencoders (VAEs) frequently suffer from posterior collapse, where the latent variables become uninformative as the approximate posterior degenerates to the prior. While recent work has characterized collapse as a phase transition determined by data covariance properties, existing approaches primarily aim to avoid rather than eliminate collapse. We introduce a novel framework that theoretically guarantees non-collapsed solutions by leveraging spherical shell geometry and cluster-aware constraints. Our method transforms data to a spherical shell, computes optimal cluster assignments via K-means, and defines a feasible region between the within-cluster variance $W$ and collapse loss $δ_{\text{collapse}}$. We prove that when the reconstruction loss is constrained to this region, the collapsed solution is mathematically excluded from the feasible parameter space. \textbf{Critically, we introduce norm constraint mechanisms that ensure decoder outputs remain compatible with the spherical shell geometry without restricting representational capacity.} Unlike prior approaches, our method provides a strict theoretical guarantee with minimal computational overhead without imposing constraints on decoder outputs. Experiments on synthetic and real-world datasets demonstrate 100\% collapse prevention under conditions where conventional VAEs completely fail, with reconstruction quality matching or exceeding state-of-the-art methods. Our approach requires no explicit stability conditions (e.g., $σ^2 < λ_{\max}$) and works with arbitrary neural architectures. The code is available at https://github.com/tsegoochang/spherical-vae-with-Cluster.

2603.03328 2026-05-19 cs.CL cs.AI

StructLens: A Structural Lens for Language Models via Maximum Spanning Trees

StructLens:通过最大生成树实现语言模型的结构镜像

Haruki Sakajo, Frederikus Hudi, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

AI总结 本文提出StructLens框架,通过最大生成树分析语言模型的表示结构,揭示模型在不同层和训练阶段中如何组织token表示。

详情
AI中文摘要

语言具有内在结构,这一特性解释了语言习得和语言变化。鉴于此特性,我们预期语言模型也会表现出自身的内部结构。尽管可解释性研究已经探讨了模型如何通过注意力模式和稀疏自编码器计算表示,但所得到的表示的组织方式却被忽视。为解决这一差距,我们引入StructLens,一个通过整体结构视角分析表示的框架。StructLens基于残差流中的语义表示构建最大生成树,受依赖解析中树表示的启发,并在表示空间中提供token关系的摘要。我们分析了连续token在表示空间中也彼此接近,并发现中间层显示出最强的局部跨度组织。此外,对预训练检查点的分析表明,较小的局部单元在预训练早期变得可检测,而较大的单元则在后期才变得可检测。我们的发现表明,StructLens提供了关于模型在不同层和训练过程中如何组织token表示的见解。我们的代码可在https://github.com/naist-nlp/structlens获取。

英文摘要

Language exhibits inherent structures, a property that explains both language acquisition and language change. Given this characteristic, we expect language models to manifest their own internal structures as well. While interpretability research has investigated how models compute representations mechanistically through attention patterns and Sparse AutoEncoders, the organization of the resulting representations is overlooked. To address this gap, we introduce StructLens, a framework to analyze representations through a holistic structural view. StructLens constructs maximum spanning trees based on the semantic representations in residual streams, inspired by tree representation in dependency parsing, and provides summaries of token relationships in representation space. We analyze how contiguous tokens are also nearby in representation space and find that middle layers show the strongest local-span organization. Moreover, analysis of pre-training checkpoints reveals that smaller local units become detectable earlier in pre-training, and larger units later. Our findings demonstrate that StructLens provides insights into how models organize token representations across layers and training. Our code is available at https://github.com/naist-nlp/structlens.

2603.03308 2026-05-19 cs.CL cs.AI

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

旧习惯难改:对话历史如何几何学地困住大语言模型

Adi Simhi, Fazl Barez, Martin Tutek, Yonatan Belinkov, Shay B. Cohen

AI总结 研究探讨对话历史如何通过几何陷阱影响大语言模型的后续表现,提出History-Echoes框架从概率和几何两个角度分析对话历史偏差,并揭示行为持续性在潜在空间中的几何陷阱。

Comments Accepted to ICML 2026

详情
AI中文摘要

大语言模型(LLMs)的对话历史如何影响其未来表现?近期研究表明,LLMs受对话历史影响的方式出人意料。例如,先前交互中的幻觉可能影响后续模型响应。在本工作中,我们引入History-Echoes框架,研究对话历史如何偏移后续生成。该框架从两个角度探索这种偏差:概率上,我们将对话建模为马尔可夫链以量化状态一致性;几何上,我们测量连续隐藏表示的一致性。在三个模型家族和六个涵盖多样化现象的数据集上,我们的分析揭示了两种视角之间的强相关性。通过连接这些视角,我们证明行为持续性表现为几何陷阱,即潜在空间中的间隙会限制模型轨迹。代码可在https://github.com/technion-cs-nlp/OldHabitsDieHard获取。

英文摘要

How does the conversational past of large language models (LLMs) influence their future performance? Recent work suggests that LLMs are affected by their conversational history in unexpected ways. For instance, hallucinations in prior interactions may influence subsequent model responses. In this work, we introduce History-Echoes, a framework that investigates how conversational history biases subsequent generations. The framework explores this bias from two perspectives: probabilistically, we model conversations as Markov chains to quantify state consistency; geometrically, we measure the consistency of consecutive hidden representations. Across three model families and six datasets spanning diverse phenomena, our analysis reveals a strong correlation between the two perspectives. By bridging these perspectives, we demonstrate that behavioral persistence manifests as a geometric trap, where gaps in the latent space confine the model's trajectory. Code available at https://github.com/technion-cs-nlp/OldHabitsDieHard.

2603.03190 2026-05-19 cs.AI q-bio.NC

Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

期望与听觉神经网络表示增强从脑活动识别音乐

Shogo Noguchi, Taketo Akama, Tai Nakamura, Shun Minamikawa, Natalia Polouliakh

AI总结 本研究通过区分听觉和期望相关的神经网络表示作为教师目标,提高了基于EEG的音乐识别性能,展示了表示学习可以由神经编码引导,并为预测音乐认知和神经解码的发展提供了新方向。

Comments 47 pages, 12 figures

详情
AI中文摘要

在音乐聆听过程中,皮层活动编码了听觉和期望相关信息。先前工作已表明,ANN表示类似于皮层表示,并可作为EEG识别的监督信号。本文显示,将听觉和期望相关的ANN表示作为教师目标进行区分,能提高基于EEG的音乐识别性能。预训练以预测任一表示的模型优于非预训练基线,且结合它们可获得互补增益,超过通过不同随机初始化形成的强种子集合。这些发现表明,教师表示类型影响下游性能,且表示学习可以由神经编码引导。本工作为预测音乐认知和神经解码的发展指明了方向。我们的期望表示直接从原始信号计算得出,无需人工标签,反映了超越起始或音高的预测结构,使能够研究跨多样刺激的多层预测编码。其可扩展性表明,未来可能开发出基于皮层编码原理的通用EEG模型。

英文摘要

During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.

2603.03099 2026-05-19 cs.LG cs.AI

Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails

为何Adam能胜过SGD:二阶矩归一化产生更尖锐的尾部

Ruinan Jin, Yingbin Liang, Shaofeng Zou

AI总结 本文揭示了Adam中的关键二阶矩归一化机制,并通过停止时间/鞅分析,在经典有界方差模型下,证明了Adam在高概率收敛行为上优于SGD,前者对置信参数δ的依赖为δ^{-1/2},而SGD则至少为δ^{-1}。

Comments 68 pages

详情
AI中文摘要

尽管Adam在许多应用中表现出比SGD更快的实证收敛速度,但现有的大多数理论保证与SGD几乎相同,无法充分解释实证性能差距。在本文中,我们揭示了Adam中的关键二阶矩归一化,并开发了一种停止时间/鞅分析,该分析在经典有界方差模型(一个二阶矩假设)下,能够证明Adam在高概率收敛行为上优于SGD。具体而言,我们建立了两种方法高概率收敛行为之间的第一个理论区分:Adam对置信参数δ的依赖为δ^{-1/2},而SGD对应的高概率保证至少需要δ^{-1}的依赖。

英文摘要

Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $δ^{-1/2}$ dependence on the confidence parameter $δ$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $δ^{-1}$ dependence.

2603.00631 2026-05-19 cs.AI

LiTS: A Modular Framework for LLM Tree Search

LiTS:一个用于LLM树搜索的模块化框架

Xinzhe Li, Yaguang Tao

AI总结 本文提出LiTS,一个模块化框架,用于通过树搜索进行LLM推理,展示了其在语言推理、环境规划和工具使用任务中的可组合性,并发现无限动作空间中LLM策略多样性是有效树搜索的瓶颈。

Comments ACL 2026 Demo

详情
AI中文摘要

LiTS是一个模块化的Python框架,用于通过树搜索进行LLM推理。它将树搜索分解为三个可重用的组件(策略、转移和奖励模型),这些组件可以插入到MCTS和BFS等算法中。基于装饰器的注册机制使领域专家能够通过注册组件扩展到新领域,使算法研究人员能够实现自定义的搜索算法。我们在MATH500(语言推理)、Crosswords(环境规划)和MapEval(工具使用)上展示了可组合性,证明了组件和算法的正交性:组件可以在每个任务类型内跨算法重用,而算法可以在所有组件和领域中工作。我们还报告了一个模式崩溃发现:在无限动作空间中,LLM策略多样性(而不是奖励质量)是有效树搜索的瓶颈。演示视频可在https://youtu.be/nRGX43YrR3I获取。该包在Apache 2.0许可证下发布于https://github.com/xinzhel/lits-llm,包含安装说明和可运行示例,使用户能够重现演示的工作流。

英文摘要

LiTS is a modular Python framework for LLM reasoning via tree search. It decomposes tree search into three reusable components (Policy, Transition, and RewardModel) that plug into algorithms like MCTS and BFS. A decorator-based registry enables domain experts to extend to new domains by registering components, and algorithmic researchers to implement custom search algorithms. We demonstrate composability on MATH500 (language reasoning), Crosswords (environment planning), and MapEval (tool use), showing that components and algorithms are orthogonal: components are reusable across algorithms within each task type, and algorithms work across all components and domains. We also report a mode-collapse finding: in infinite action spaces, LLM policy diversity (not reward quality) is the bottleneck for effective tree search. A demonstration video is available at https://youtu.be/nRGX43YrR3I. The package is released under the Apache 2.0 license at https://github.com/xinzhel/lits-llm, including installation instructions and runnable examples that enable users to reproduce the demonstrated workflows.

2603.00607 2026-05-19 cs.CV cs.AI

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

IdGlow: 多主体生成中的动态身份调节

Honghao Cai, Xiangyuan Wang, Jing Li, Yunhao Bai, Tianze Zhou, Haohua Chen, Chao Hui, Changhao Qiao, Runqi Wang, Sijie Xu, Yuyang Hao, Zezhou Cui, Yuyuan Yang, Wei Zhu, Yibo Chen, Xu Tang, Yao Hu, Zhen Li

AI总结 本文提出IdGlow框架,通过任务自适应的时间步调度和视觉语言模型解决多主体生成中的稳定性与可塑性矛盾,提升面部真实感与商业级美学质量。

详情
AI中文摘要

多主体图像生成需要在一致的场景中无缝协调多个参考身份。然而,现有方法依赖刚性空间掩码或局部注意力,往往在需要复杂结构变形的任务中(如保持身份的年龄变换)面临'稳定性-可塑性困境'。为此,我们提出IdGlow,一种基于流匹配扩散模型的无掩码、分阶段框架。在监督微调(SFT)阶段,我们引入任务自适应的时间步调度,与扩散生成动力学对齐:一种线性衰减调度,逐步放松约束以生成自然群体组成,以及一个时间门控机制,将身份注入集中于关键语义窗口,成功保留成人面部语义而不覆盖儿童样结构。为解决属性泄漏和语义模糊问题而无需显式布局输入,我们进一步整合了基于badcase驱动的视觉语言模型(VLM)进行精确的上下文感知提示合成。在第二阶段,我们设计了细粒度群体级直接偏好优化(DPO)方法,采用加权边距公式,同时消除多主体伪影、提升纹理和谐度,并重新校准身份保真度以适应现实分布。在两个具有挑战性的基准测试——直接多人物融合和年龄变换群体生成——上的大量实验表明,IdGlow从根本上缓解了稳定性-可塑性冲突,实现了在最先进的面部保真度和商业级美学质量之间的优越帕累托平衡。

英文摘要

Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.

2602.24238 2026-05-19 cs.LG

Time Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis

时间序列基础模型在交通预测中的强大基准作用:一项大规模基准分析

Javier Yanes-Pulido, Filipe Rodrigues

AI总结 本文通过在十个真实世界数据集上评估最新时间序列模型Chronos-2的零样本性能,证明了通用时间序列基础模型在交通预测中的有效性,展示了其在多数数据集上达到或超越传统统计基线和专用深度学习架构的准确性,尤其在长预测范围内表现突出。

Comments 6 pages

详情
AI中文摘要

准确预测交通动态对于城市交通和基础设施规划至关重要。尽管近期工作在深度学习模型中取得了优异表现,但这些方法通常需要特定数据集的训练、架构设计和超参数调整。本文评估了通用时间序列基础模型是否能作为交通任务的预测器,通过在十个涵盖高速公路交通量和流、城市交通速度、自行车共享需求和电动汽车充电站数据的真实世界数据集上,对最新模型Chronos-2的零样本性能进行基准测试。在一致的评估协议下,我们发现,即使没有任何任务特定的微调,Chronos-2在大多数数据集上均达到或超越了传统统计基线和专用深度学习架构的准确性,特别是在长预测范围。除了点预测外,我们还通过预测区间覆盖和锐度评估其原生概率输出,证明Chronos-2在无需特定数据集训练的情况下也提供了有用的不确定性量化。总体而言,本研究支持将时间序列基础模型作为交通预测研究的关键基准。

英文摘要

Accurate forecasting of transportation dynamics is essential for urban mobility and infrastructure planning. Although recent work has achieved strong performance with deep learning models, these methods typically require dataset-specific training, architecture design and hyper-parameter tuning. This paper evaluates whether general-purpose time-series foundation models can serve as forecasters for transportation tasks by benchmarking the zero-shot performance of the state-of-the-art model, Chronos-2, across ten real-world datasets covering highway traffic volume and flow, urban traffic speed, bike-sharing demand, and electric vehicle charging station data. Under a consistent evaluation protocol, we find that, even without any task-specific fine-tuning, Chronos-2 delivers state-of-the-art or competitive accuracy across most datasets, frequently outperforming classical statistical baselines and specialized deep learning architectures, particularly at longer horizons. Beyond point forecasting, we evaluate its native probabilistic outputs using prediction-interval coverage and sharpness, demonstrating that Chronos-2 also provides useful uncertainty quantification without dataset-specific training. In general, this study supports the adoption of time-series foundation models as a key baseline for transportation forecasting research.

2602.23566 2026-05-19 cs.LG cs.AI

Flowette: Flow Matching with Graphette Priors for Graph Generation

Flowette: 用于图生成的图结构先验的流匹配

Asiri Wijesinghe, Sevvandi Kandanaarachchi, Daniel M. Steinberg, Cheng Soon Ong

AI总结 本文提出Flowette框架,通过图神经网络基于transformer学习图表示上的速度场,结合最优传输耦合和正则化,利用图ettes先验结构模型提升图生成性能,实验证明结合结构先验和流训练的有效性。

Comments 48 Pages

详情
AI中文摘要

我们研究具有重复子图motif的图生成建模。我们提出了Flowette,一个连续流匹配框架,利用基于图神经网络的transformer学习具有节点和边属性的图表示上的速度场。我们的模型通过基于最优传输的耦合实现拓扑感知对齐,并通过正则化促进全局结构一致性。为整合领域驱动的结构先验,我们引入图ettes,一种新的概率图结构模型家族,通过受控的结构编辑推广图ons以适用于环、星形和树等motif。我们理论分析了框架的耦合、不变性和结构性质,评估了其在合成和分子基准上的性能,并通过受控消融实验隔离了结构先验、最优传输耦合和正则化项的贡献。Flowette在多个基准上取得了竞争性性能,达到多个指标的最先进结果,突显了结合结构先验与流训练在建模复杂图分布中的有效性。

英文摘要

We study generative modeling of graphs with recurring subgraph motifs. We propose Flowette, a continuous flow matching framework that employs a graph neural network-based transformer to learn a velocity field over graph representations with node and edge attributes. Our model promotes topology-aware alignment through optimal transport-based coupling and encourages global structural coherence through regularisation. To incorporate domain-driven structural priors, we introduce graphettes, a new probabilistic family of graph structure models that generalize graphons via controlled structural edits for motifs such as rings, stars, and trees. We theoretically analyze the coupling, invariance, and structural properties of the framework, evaluate it on synthetic and molecular benchmarks, and isolate the contributions of the structural prior, the optimal-transport coupling, and the regularisation terms through controlled ablations. Flowette achieves competitive performance overall, attaining state-of-the-art results on several metrics across multiple benchmarks, highlighting the effectiveness of combining structural priors with flow-based training for modeling complex graph distributions.

2602.22667 2026-05-19 cs.CV

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

单目开放词汇占用预测用于室内场景

Changqing Zhou, Yueru Luo, Han Zhang, Zeyu Jiang, Changhao Chen

AI总结 该研究提出了一种基于几何的监督方法,用于单目开放词汇室内场景的占用预测,通过引入一种基于Poisson的透明度感知方法和逐步温度衰减调度,提高了几何和语义对齐的稳定性与精度,实验结果显示在Occ-ScanNet数据集上取得了较高的IoU和mIoU指标。

Comments Accepted at CVPR2026 Oral

详情
AI中文摘要

开放词汇3D占用对于具有体素的智能体至关重要,这些智能体需要理解具有丰富语义类别的复杂室内环境,并超越固定分类体系。尽管最近的研究在户外驾驶场景中探索了开放词汇占用,但这些方法在室内场景中表现不佳,因为几何更密集,布局更复杂,语义更细粒度。为了解决这些挑战,我们采用仅使用二元占用标签(占用vs自由)的几何-only监督范式。我们的框架基于3D语言嵌入高斯,这些高斯作为统一的中间表示,将细粒度3D几何与语言对齐的语义嵌入耦合在一起。在几何方面,我们发现现有高斯到占用运算符在如此弱的监督下无法收敛,我们引入了一种基于Poisson的透明度感知方法,稳定了体积分组。在语义方面,直接对渲染特征和开放词汇分割特征之间的对齐导致特征混合;因此,我们提出了一个逐步温度衰减调度,逐步在溅射过程中锐化透明度,加强高斯-语言对齐。在Occ-ScanNet上,我们的框架在开放词汇设置中实现了59.50 IoU和21.05 mIoU,超过了所有现有的占用方法在IoU,并在mIoU上大幅优于先前的开放词汇方法。代码将在https://github.com/JuIvyy/LegoOcc上发布。

英文摘要

Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.

2602.21426 2026-05-19 cs.LG stat.CO

Proximal-IMH: Proximal Posterior Proposals for Independent Metropolis-Hastings with Approximate Operators

Proximal-IMH: 用于独立Metropolis-Hastings的近端后验提议

Youguang Chen, George Biros

AI总结 本文提出了一种改进的独立Metropolis-Hastings算法,通过引入辅助优化问题来消除近似后验分布中的偏差,从而在保持精确模型的同时提高稳定性和采样效率。

详情
AI中文摘要

我们考虑了在科学、工程和成像中的贝叶斯反问题中从后验分布采样的问题。我们的方法属于独立Metropolis-Hastings(IMH)采样算法家族,常用于贝叶斯推断。依赖于存在一个更便宜但可能有显著偏差的近似后验分布,我们引入了Proximal-IMH,通过辅助优化问题纠正近似后验的样本,从而在精确模型和近似参考点周围获得局部调整。对于理想化设置,我们证明了近端校正能够收紧近似和精确后验之间的匹配,从而提高接受率和混合性。该方法适用于线性和非线性输入-输出算子,并特别适用于精确后验采样成本过高的反问题。我们展示了包含多模态和数据驱动先验的数值实验,结果表明Proximal-IMH在现有IMH变体中表现更优。

英文摘要

We consider the problem of sampling from a posterior distribution arising in Bayesian inverse problems in science, engineering, and imaging. Our method belongs to the family of independence Metropolis-Hastings (IMH) sampling algorithms, which are common in Bayesian inference. Relying on the existence of an approximate posterior distribution that is cheaper to sample from but may have significant bias, we introduce Proximal-IMH, a scheme that removes this bias by correcting samples from the approximate posterior through an auxiliary optimization problem. This yields a local adjustment that trades off adherence to the exact model against stability around the approximate reference point. For idealized settings, we prove that the proximal correction tightens the match between approximate and exact posteriors, thereby improving acceptance rates and mixing. The method applies to both linear and nonlinear input-output operators and is particularly suitable for inverse problems where exact posterior sampling is too expensive. We present numerical experiments including multimodal and data-driven priors with nonlinear input-output operators. The results show that Proximal-IMH reliably outperforms existing IMH variants.

2602.21265 2026-05-19 cs.CL cs.LG cs.SE

ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints

ToolMATH: 一种用于在系统性工具目录约束下评估长周期工具使用的诊断基准

Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee

AI总结 本文提出ToolMATH,一种基于数学的诊断基准,用于评估在可控工具目录条件下长周期工具使用的性能,通过将分步MATH解决方案转换为可重用的Python工具,并配对需要顺序工具使用、中间输出重用和逻辑连接工具调用链的问题,从而评估模型在不同工具目录条件下的适应性、鲁棒性和工具连接性。

Comments Submitted to NeurIPS Evaluation & Dataset Track

详情
AI中文摘要

我们介绍了ToolMATH,一种用于评估在可控工具目录条件下长周期工具使用的数学基础诊断基准。ToolMATH将分步MATH解决方案转换为具有自然语言描述和类型化架构的可重用Python工具,并配对每个问题与一个需要顺序工具使用、中间输出重用和逻辑连接工具调用链的工具环境。ToolMATH通过构建黄金工具和难度分级的干扰项来控制工具可用性和目录难度。ToolMATH还结合了行为条件度量指标,使诊断评估超越最终准确性。基于这些测量,ToolMATH强调三个评估轴:(1)适应性衡量在黄金工具被完全替换为干扰项时保留的黄金成功程度;(2)鲁棒性衡量在添加干扰项作为噪声时的稳定性;(3)工具连接性衡量模型是否在长执行的工具调用链中保持准确性。此外,跟踪级失败分析描述了模型在每种工具目录条件下如何失败。这些诊断揭示了不同的模型特征:可靠的工具使用、工具回避、适应性替代以及不可靠工具目录的影响。总体而言,ToolMATH提供了一个受控的测试平台,用于评估语言模型如何适应变化的工具可用性,保持对干扰项的鲁棒性,并在长周期工具使用轨迹中保持正确性。

英文摘要

We introduce \ToolMATH, a math-grounded diagnostic benchmark for evaluating long-horizon tool use under controllable tool-catalog conditions. \ToolMATH converts stepwise MATH solutions into reusable Python tools with natural-language descriptions and typed schemas, and pairs each problem with a tool environment requiring sequential tool use, intermediate-output reuse, and logically connected tool-call chains. \ToolMATH controls tool availability and catalog difficulty by constructing gold tools and graded distractors with varying similarity to gold tools. \ToolMATH also incorporates behavior-conditioned metrics, enabling diagnostic evaluation beyond final accuracy. Building on these measurements, \ToolMATH emphasizes three evaluation axes: (1) \emph{Adaptability} measures how much Gold-only success is retained when gold tools are replaced entirely by distractors; (2) \emph{Robustness} measures stability under adding distractors as a noise; and (3) \emph{Tool Connectivity} measures whether models preserve accuracy over long executed tool-call chains. Furthermore, trace-level failure analyses characterize how models fail under each tool-catalog condition. Together, these diagnostics reveal distinct model profiles: reliable tool use, tool avoidance, adaptive substitution, and impacts of unreliable tool catalogs. Overall, \ToolMATH provides a controlled testbed for evaluating how language models adapt to changing tool availability, remain robust to distractors, and maintain correctness across long-horizon tool-use trajectories.

2602.20200 2026-05-19 cs.RO cs.AI cs.CV

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

全局先验与局部一致性:双内存增强的视觉-语言-动作模型用于高效机器人操作

Zaijing Li, Bing Hu, Rui Shao, Gongwei Chen, Dongmei Jiang, Pengwei Xie, Jianye Hao, Liqiang Nie

AI总结 本文提出OptimusVLA模型,通过引入全局先验内存和局部一致性内存,解决机器人操作中动作生成效率低和鲁棒性差的问题,从而在多个基准测试中实现了更高的成功率和更快的推理速度。

Comments Accepted by CVPR 2026

详情
AI中文摘要

分层视觉-语言-动作(VLA)模型已成为机器人操作中的主导范式。它通常包括一个视觉-语言骨干网络用于感知和理解,以及一个生成性策略用于动作生成。然而,其性能越来越受到动作生成过程的限制。(i) 低推理效率。各向同性噪声先验与目标动作分布之间存在显著的分布差距,这会增加去噪步骤和不可行样本的发生率。(ii) 脆弱性差。现有策略仅基于当前观察,忽视了历史序列的约束,因此缺乏对任务进展和时间一致性意识。为了解决这些问题,我们引入OptimusVLA,一种具有全局先验内存(GPM)和局部一致性内存(LCM)的双内存VLA框架。GPM用从语义相似轨迹中检索到的任务级先验替代高斯噪声,从而缩短生成路径并减少函数评估次数(NFE)。LCM动态建模执行的动作序列以推断任务进展,并注入一个学习的一致性约束,强制轨迹的时间一致性和平滑性。在三个模拟基准测试中,OptimusVLA始终优于强大的基线:它在LIBERO上实现了98.6%的平均成功率,在CALVIN上比pi_0提高了13.5%,在RoboTwin 2.0 Hard上达到了38%的平均成功率。在现实世界评估中,OptimusVLA在泛化和长周期套件中排名第一,比pi_0分别高出42.9%和52.4%,同时实现了2.9倍的推理加速。

英文摘要

Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.