arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1946
2603.14987 2026-05-22 cs.CL cs.DB

Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

超越基准岛屿:面向代理AI的代表性可信度评估

Jinhu Qi, Yifan Li, Minghao Zhao, Wentao Zhang, Zijian Zhang, Yaoman Li, Irwin King

AI总结 本文提出了一种基于五属性的代理可信度定义,并引入了Holographic Agent Assessment Framework(HAAF)框架,通过场景 manifold 的静态策略分析、沙盒模拟、社会伦理对齐评估和分布感知采样,实现对代理系统在社会技术场景中的可信度评估,展示了其在13个模型家族上的跨家族迁移实验结果。

Comments 9 pages, 3 figures, 8 tables. Submitted to the Agent4IR Workshop at KDD 2026

详情
AI中文摘要

Agentic AI systems increasingly act through tool-augmented, multi-step workflows whose failures (unsafe tool use, unauthorised actions, social harm) carry deployment-level consequences. Evaluation practice remains fragmented across isolated benchmark slices, and

英文摘要

Agentic AI systems increasingly act through tool-augmented, multi-step workflows whose failures (unsafe tool use, unauthorised actions, social harm) carry deployment-level consequences. Evaluation practice remains fragmented across isolated benchmark slices, and "trustworthiness" is frequently invoked but rarely defined operationally. We argue the central limitation is twofold: (i) the absence of a measurable specification of what agent trustworthiness means, and (ii) the lack of a principled notion of representativeness allowing assessment over a socio-technical scenario distribution rather than disconnected benchmark instances. We address (i) by defining agentic trustworthiness as a five-property profile (Reliability, Robustness, Safety, Social-Ethical Alignment, Operational Integrity) grounded in current AI risk frameworks, and (ii) with the Holographic Agent Assessment Framework (HAAF), which measures this profile over a scenario manifold through static policy analysis, sandbox simulation, social-ethical alignment assessment, and distribution-aware sampling, connected through an iterative Trustworthy Optimization Factory that converts red-team diagnoses into blue-team interventions. Our contributions are: (1) an operational five-property definition of agentic trustworthiness; (2) a distribution-aware scenario-sampling framework that surfaces property-level trade-offs invisible to scalar leaderboards; and (3) a cross-family transfer experiment in which interventions designed from a single focal model generalise -- without per-model or per-scenario tuning -- to 13 systems from seven model families (Llama, Mistral, Kimi, GLM, Qwen, GPT, DeepSeek) on a 100-scenario suite, where all 13 systems improve and two reach a perfect risk-weighted profile, establishing HAAF's Factory as a model-agnostic deployment-readiness pipeline. Code: https://github.com/TonyQJH/haaf-pilot

2603.11679 2026-05-22 cs.AI

LLMs can construct powerful representations and streamline sample-efficient supervised learning

LLMs can construct powerful representations and streamline sample-efficient supervised learning

Ilker Demirel, Lawrence Shi, Zeshan Hussain, David Sontag

AI总结 本文提出了一种基于LLM的代理流程,通过生成全局 rubric 来提升多模态数据的表示能力,并在15个临床任务中显著优于传统方法。

详情
AI中文摘要

随着现实数据集变得更加复杂和异质化,监督学习常受到输入表示设计的瓶颈。对多模态数据(如时间序列、自由文本和结构化记录)建模通常需要非平凡的领域专业知识。我们提出了一种代理流程来简化这一过程。首先,一个LLM分析一小但多样化的文本序列输入示例,在上下文中合成一个全局rubric,该rubric作为程序化规范用于提取和组织证据。此rubric随后用于将原始文本序列转换为更标准化的格式,以供下游模型使用。我们还描述了局部rubrics,即由LLM生成的任务条件解释性摘要。在EHRSHOT基准的15个临床任务中,我们的rubric方法显著优于计数特征模型、朴素LLM基线和预训练数据量更大的临床基础模型。除了性能外,rubrics还提供了操作优势,如易于审计、规模化成本效益以及促进表格表示。

英文摘要

As real-world datasets become more complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data, such as time-series, free text, and structured records, often requires non-trivial domain expertise. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned interpretive summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric approaches significantly outperform count-feature models, naive LLM baselines, and a clinical foundation model pretrained on orders of magnitude more data. Beyond performance, rubrics offer operational advantages such as being easy to audit, cost-effectiveness at scale, and facilitating tabular representations.

2603.11642 2026-05-22 cs.RO

Noise-Space Attribution and Control of Chunk-Boundary Artifact

噪声空间中的属性分析与块边界伪影控制

Rui Wang

AI总结 本文研究了生成视觉-运动策略中块边界伪影的机制,通过分析噪声空间中的变量,展示了如何通过控制隐含噪声来调节伪影,并证明伪影变化可以影响最终任务结果。

详情
AI中文摘要

动作分块在生成视觉-运动策略中被广泛应用,但块边界处的反复执行不连续性仍然缺乏机制性解释。本文将块边界伪影视为可分析的机制变量。我们首先证明成功和失败的episode在伪影度量上稳定分离。然后我们显示,在随机动作分块策略中,固定观察上下文并仅改变隐含噪声足以系统地调节伪影。在同一扩散策略检查点上,比较DDPM、零方差DDPM和DDIM进一步表明,这种局部可控性取决于从初始噪声到动作输出的信息路径是否保持完整。最后,从固定局部执行状态的受控干预中,我们发现伪影变化可以影响最终结果,并且在同一任务中,首选方向甚至可以反转:某些上下文在较低伪影下表现更高成功,而另一些上下文在较高伪影下表现更高成功。在代表性高伪影偏好的关键上下文中,成功率从0.033增加到0.717。这些结果表明,块边界伪影不是单纯的执行副产品,而是在噪声空间中的一个变量,可以被归因、控制,并与任务结果机制性关联。

英文摘要

Action chunking is widely used in generative visuomotor policies, yet the recurring execution discontinuities at chunk boundaries still lack a mechanistic explanation. This paper treats chunk-boundary artifact as an analyzable mechanism variable. We first show that successful and failed episodes separate stably on artifact metrics. We then show that, in stochastic action-chunked policies, fixing the observation context and changing only latent noise is sufficient to modulate artifact systematically. On the same Diffusion Policy checkpoint, comparisons among DDPM, zero-variance DDPM, and DDIM further show that this local controllability depends on whether the information path from initial noise to action output remains intact. Finally, from controlled interventions at fixed local execution states, we find that artifact changes can carry through to final outcome, and that the preferred direction can reverse even within the same task: some contexts achieve higher success under lower artifact, whereas others achieve higher success under higher artifact. In a representative high-artifact-favoring key context selected by held-out matched-continuation validation, success rate increases from 0.033 to 0.717. These results show that chunk-boundary artifact is not a mere execution-side by-product, but a variable in noise space that can be attributed, controlled, and mechanistically linked to task outcome.

2603.03784 2026-05-22 cs.AI

Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism

通过DEVS形式化方法驱动的离散事件世界模型生成与评估

Zheyu Chen, Huiteng Zhuang, Zhuohuan Li, Chuanhao Li

AI总结 本文提出了一种基于自然语言规范在线生成离散事件世界模型的方法,结合了显式模拟器的可靠性与神经模型的适应性,通过DEVS形式化方法和分阶段的LLM生成流程,实现了对事件和时间逻辑的结构推断,并通过基准测试集验证了模型的一致性和可验证性。

Comments 36 pages, 6 figures

详情
AI中文摘要

世界模型是LLM代理在长时间范围内评估行动的核心组成部分。然而,现有研究大多集中在由物理动态或空间结构主导的环境,而许多高影响领域,如供应链、采购网络和业务流程,通过离散事件、时间约束和因果依赖演变。这些设置需要离散事件世界模型。现有构建世界模型的方法往往处于两个极端:手动工程模拟器提供一致性和可重复性,但构建和适应成本高;神经模型灵活,但长期时间推演中可能累积不一致。本文寻求一种原则性的中间方法,通过从自然语言规范中在线合成离散事件世界模型,保留显式模拟器的可靠性,同时获得神经模型的适应性。我们采用DEVS形式化方法,并引入一种分阶段的基于LLM的生成流程,将组件交互的结构推断与组件级事件和时间逻辑分开。在评估方面,我们开发了基准测试集,其中模拟器发出结构化事件轨迹,随后通过规范推导的时序、因果和语义约束进行验证。这使得可以实现可重复的验证和局部诊断。这些贡献共同产生了一种在长期时间推演中保持一致、可以从可观察行为中验证,并且可以在在线执行时高效合成的世界模型。

英文摘要

World models are central to LLM agents that must evaluate actions over long horizons. Yet much existing work focuses on environments governed by physical dynamics or spatial structure, whereas many high-impact domains, including supply chains, procurement networks, and business processes, evolve through discrete events, timing constraints, and causal dependencies. These settings call for discrete-event world models. Existing approaches to constructing world models often fall near two extremes: hand-engineered simulators provide consistency and reproducibility, but are costly to build and adapt; neural models are flexible, but can suffer from compounding inconsistency over long-horizon rollouts. We seek a principled middle ground by synthesizing discrete-event world models online from natural-language specifications, retaining the reliability of explicit simulators while gaining the adaptability of neural models. We adopt the DEVS formalism and introduce a staged LLM-based generation pipeline that separates structural inference over component interactions from component-level event and timing logic. For evaluation, we develop benchmark suites in which simulators emit structured event traces, which are then validated against specification-derived temporal, causal, and semantic constraints. This enables reproducible verification and localized diagnostics. Together, these contributions produce world models that remain consistent over long-horizon rollouts, can be verified from observable behavior, and can be synthesized efficiently on demand during online execution.

2603.02604 2026-05-22 cs.LG

Heterogeneous Agent Collaborative Reinforcement Learning

异质智能体协作强化学习

Zhixia Zhang, Zixuan Huang, Gongxun Li, Huaiyang Wang, Chengyi Yuan, Xin Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, Yikun Ban

AI总结 本文提出了一种新的强化学习从可验证奖励(RLVR)问题HACRL,通过异质智能体共享验证的轨迹实现协同优化,解决了孤立多智能体在线优化的效率问题,并提出HACPO算法以最大化样本利用率和跨智能体知识转移。

详情
AI中文摘要

我们引入了异质智能体协作强化学习(HACRL),一种新的强化学习从可验证奖励(RLVR)问题,旨在解决孤立多智能体在线优化的低效问题。HACRL允许独立执行的协同优化:异质智能体在训练期间共享验证的轨迹以互相改进,而在推理期间独立操作。不同于基于大语言模型的多智能体强化学习(MARL),HACRL不需要协调部署,也不同于在线/离线策略蒸馏,它使异质智能体之间实现双向相互学习,而非单向的教师到学生转移。基于此问题,我们提出HACPO,一种协作RL算法,能够通过原则性的轨迹共享最大化样本利用率和跨智能体知识转移。为缓解能力差异和策略分布偏移,HACPO引入了四个定制机制,具有对无偏优势估计的理论保证。在多样化的异质模型组合和推理基准上的广泛实验表明,HACPO一致地提升了所有参与智能体,相比使用双轨迹的GSPO,平均提高了3.6%,同时仅使用一半的轨迹成本。

英文摘要

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new Reinforcement Learning from Verifiable Reward (RLVR) problem that addresses the inefficiencies of isolated multi-agent on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional homogeneous teacher-to-student transfer. Building on this problem, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO with double rollouts by an average of 3.6% while using only half the rollout cost.

2602.23200 2026-05-22 cs.LG cs.CL

InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models

InnerQ: 一种面向硬件的无需调优的KV缓存量化方法用于大语言模型

Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross

AI总结 本文提出InnerQ,一种面向硬件的KV缓存量化方法,旨在减少解码延迟而不影响评估性能,通过分组量化策略提高数据重用率,从而在Llama和Mistral模型上提升了少样本评估得分。

Comments 18 pages, 5 figures, 7 tables

详情
AI中文摘要

当基于Transformer的语言模型用于文本生成时,大部分推理时间消耗在解码阶段,其中依次生成输出token。因此,减少每个解码步骤的硬件成本对于高效的长上下文生成至关重要。主要瓶颈是键值(KV)缓存,其大小随序列长度增长,通常主导模型的内存足迹。先前工作提出了压缩KV缓存的同时最小化精度损失的量化方法。我们提出了InnerQ,一种面向硬件的KV缓存量化方案,能够在不牺牲评估性能的情况下减少解码延迟。InnerQ通过沿内维对缓存矩阵进行分组实现分组量化。这种分组策略使去量化与向量-矩阵乘法对齐,并在GPU计算单元之间增加数据重用。结果,InnerQ减少了内存访问并加速了去量化,实现了比先前KV缓存量化方法平均快1.3倍,比非量化基线快2.7倍。为了在剧烈压缩下保持精度,InnerQ结合了三种技术:(i) 混合量化,根据局部统计选择对每个组使用对称或非对称量化;(ii) 高精度窗口用于最近的token和注意力sink token以缓解异常值泄漏;(iii) 对key缓存的通道归一化,在prefill期间计算一次并折叠到模型参数中以消除运行时开销。除了减少延迟外,在Llama和Mistral模型上的实验表明,InnerQ还相对于先前的KV缓存量化方法提升了少样本评估得分。

英文摘要

When transformer-based language models are deployed for text generation, most of the inference time is spent in the decoding stage, where output tokens are generated sequentially. Reducing the hardware cost of each decoding step is therefore critical for efficient long-context generation. A major bottleneck is the key-value (KV) cache, whose size grows with sequence length and often dominates the model's memory footprint. Prior work has proposed quantization methods to compress the KV cache while minimizing its loss of precision. We present InnerQ, a hardware-aware KV cache quantization scheme that reduces decode latency without compromising evaluation performance. InnerQ performs group-wise quantization by grouping cache matrices along their inner dimension. This grouping strategy aligns dequantization with vector-matrix multiplication and increases data reuse across GPU compute units. As a result, InnerQ reduces memory access and accelerates dequantization, achieving an average $1.3\times$ speedup over prior KV cache quantization methods and $2.7\times$ over the non-quantized baseline. To maintain fidelity under aggressive compression, InnerQ incorporates three techniques: (i) hybrid quantization, which chooses symmetric or asymmetric quantization for each group based on local statistics; (ii) high-precision windows for both recent tokens and attention sink tokens to mitigate outlier leakage; and (iii) per-channel normalization of the key cache, computed once during prefill and folded into the model parameters to eliminate runtime overhead. Beyond reducing latency, experiments on Llama and Mistral models show that InnerQ also improves few-shot evaluation scores relative to prior KV cache quantization methods.

2602.18600 2026-05-22 cs.LG

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

MapTab: MLLMs 是否已准备好在异构图中进行多标准路线规划?

Ziqiao Shang, Lingyue Ge, Zi-Jian Cheng, Shi-Yu Tian, Zhenyu Huang, Wenbo Fu, Weiming Wu, Yang Chen, Xiangwen Zhang, Yulan Hu, Bin Liu, Yu-Feng Li, Lan-Zhe Guo

AI总结 本文提出MapTab基准测试,用于评估多模态大语言模型在多标准路线规划任务中的综合推理能力,发现当前模型在多模态推理方面存在显著挑战。

详情
AI中文摘要

系统评估多模态大语言模型(MLLMs)对于推进人工通用智能(AGI)至关重要。然而,现有基准测试仍不足以严格评估其在多标准约束下的推理能力。为弥合这一差距,我们引入MapTab,一个专门设计用于通过路线规划任务评估MLLMs的综合多标准推理能力的多模态基准测试。MapTab要求MLLMs感知并结合地图图像中的视觉线索与结构化表格数据中的路线属性(如时间、价格)。该基准测试涵盖两个场景:Metromap,涵盖52个国家160座城市的地铁网络;Travelmap,描绘19个国家的168个代表性旅游景点。总共包含328张图像、196,800个路线规划查询和3,936个问答查询,所有数据均包含4个关键标准:时间、价格、舒适度和可靠性。对15个代表性MLLMs的广泛评估表明,当前模型在多标准多模态推理方面面临重大挑战。值得注意的是,在视觉感知有限的条件下,多模态协作往往不如单模态方法表现优异。我们认为MapTab提供了一个具有挑战性和现实性的测试平台,以推进MLLMs的系统评估。我们的代码可在https://github.com/Ziqiao-Shang/MapTab上获得。

英文摘要

Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artificial General Intelligence (AGI). However, existing benchmarks remain insufficient for rigorously assessing their reasoning capabilities under multi-criteria constraints. To bridge this gap, we introduce MapTab, a multimodal benchmark specifically designed to evaluate holistic multi-criteria reasoning in MLLMs via route planning tasks. MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data. The benchmark encompasses two scenarios: Metromap, covering metro networks in 160 cities across 52 countries, and Travelmap, depicting 168 representative tourist attractions from 19 countries. In total, MapTab comprises 328 images, 196,800 route planning queries, and 3,936 QA queries, all incorporating 4 key criteria: Time, Price, Comfort, and Reliability. Extensive evaluations across 15 representative MLLMs reveal that current models face substantial challenges in multi-criteria multimodal reasoning. Notably, under conditions of limited visual perception, multimodal collaboration often underperforms compared to unimodal approaches. We believe MapTab provides a challenging and realistic testbed to advance the systematic evaluation of MLLMs. Our code is available at https://github.com/Ziqiao-Shang/MapTab.

2602.17186 2026-05-22 cs.CV

Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain

聚焦视觉关键点:通过视觉信息增益进行大视觉语言模型的定向训练

Seulbi Lee, Sangheum Hwang

AI总结 本文提出通过视觉信息增益(VIG)指标,对大视觉语言模型进行定向训练,以提升视觉基础性并减少语言偏见,通过优先选择高VIG样本和token来提高性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大视觉语言模型(LVLMs)已取得显著进展,但它们常常受到语言偏见的影响,产生答案时往往不依赖视觉证据。尽管先前工作试图通过解码策略、架构修改或精心挑选的指令数据来缓解这一问题,但它们通常缺乏对单个训练样本或token实际从图像中获益程度的定量衡量。在本工作中,我们引入了视觉信息增益(VIG),一种基于困惑度的度量指标,用于衡量视觉输入对预测不确定性的减少。VIG能够在样本和token层面进行细粒度分析,有效突出视觉基础元素,如颜色、空间关系和属性。借助这一指标,我们提出了一种VIG引导的定向训练方案,优先选择高VIG样本和token。这种方法提高了视觉基础性并减轻了语言偏见,通过专注于仅视觉信息丰富的样本和token,实现了显著减少监督下的优越性能。

英文摘要

Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

2602.16169 2026-05-22 cs.LG cs.CL

Discrete Stochastic Localization for Non-autoregressive Generation

非自回归生成的离散随机定位

Yunshu Wu, Jiayi Cheng, Longxuan Yu, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis, Greg Ver Steeg

AI总结 本文提出了一种名为离散随机定位(DSL)的连续状态框架,通过单位球体令牌嵌入实现最优去噪,从而在离散序列生成中提升分布忠实度,并展示了其在OpenWebText上的有效性。

详情
AI中文摘要

连续扩散是一种非自回归生成的自然框架,但在离散序列生成中通常落后于掩码离散扩散模型(MDMs)。我们认为瓶颈不在于连续性本身,而在于一种表示方式,其中去噪依赖于时间步索引的噪声模式。我们引入了离散随机定位(DSL),一种具有单位球体令牌嵌入的连续状态框架,其贝叶斯最优去噪器在定位信道下对名义信号噪声比(SNR)具有不变性。一个训练好的网络可以支持整个SNR路径家族,端点掩码扩散路径是特殊情况。对预训练MDLM检查点进行微调可显著提升OpenWebText在所有步预算(从T=128到T=1024)下的分布忠实度(MAUVE),并且同一检查点支持随机顺序自回归采样,以及使用最少T=48总步数的混合连续-然后-离散采样器,无需蒸馏或重新训练。

英文摘要

Continuous diffusion is a natural framework for non-autoregressive generation but has generally lagged behind masked discrete diffusion models (MDMs) on discrete sequence generation. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep-indexed noise regimes. We introduce \emph{Discrete Stochastic Localization} (DSL), a continuous-state framework with unit-sphere token embeddings whose Bayes-optimal denoiser is invariant to the nominal signal-to-noise ratio (SNR) under the localization channel. One trained network then supports an entire family of per-token SNR paths, with endpoint masked-diffusion paths as a special case. Fine-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness (MAUVE) on OpenWebText across all step budgets from $T{=}128$ to $T{=}1024$, and the same checkpoint supports random-order autoregressive sampling, as well as a hybrid continuous-then-discrete sampler using as few as T=48 total steps -- without distillation or retraining.

2602.15338 2026-05-22 cs.LG cs.CL

Discovering Implicit Large Language Model Alignment Objectives

发现隐式大语言模型对齐目标

Edward Chen, Sanmi Koyejo, Carlos Guestrin

AI总结 本文提出Obj-Disco框架,通过自动分解对齐奖励信号为可解释的目标,解决现有方法的不足,验证了框架在多种任务和模型上的鲁棒性,并发现潜在的对齐偏差。

Comments ICML 2026

详情
AI中文摘要

大语言模型(LLM)对齐依赖于复杂的奖励信号,这些信号往往模糊了被激励的具体行为,导致对齐风险和奖励黑客问题。现有解释方法通常依赖预定义的准则,可能遗漏“未知的未知”,或无法识别全面覆盖和因果影响模型行为的目标。为了解决这些限制,我们引入Obj-Disco框架,该框架能够自动将对齐奖励信号分解为稀疏、加权的可解释自然语言目标的组合。我们的方法利用迭代贪心算法分析训练检查点的行为变化,识别并验证最佳解释残差奖励信号的候选目标。在多种任务、模型大小和对齐算法上的广泛评估证明了框架的鲁棒性。对流行开源奖励模型的实验表明,框架一致捕获超过90%的奖励行为,这一发现进一步得到人类评估的证实。此外,对开源奖励模型对齐的案例研究显示,Obj-Disco能够成功识别伴随预期行为出现的潜在偏移激励。我们的工作提供了一种关键工具,用于揭示LLM对齐中的隐式目标,为更透明和安全的AI发展铺平道路。

英文摘要

Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of "unknown unknowns", or fail to identify objectives that comprehensively cover and are causal to the model behavior. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness. Experiments with popular open-source reward models show that the framework consistently captures > 90% of reward behavior, a finding further corroborated by human evaluation. Additionally, a case study on alignment with an open-source reward model reveals that Obj-Disco can successfully identify latent misaligned incentives that emerge alongside intended behaviors. Our work provides a crucial tool for uncovering the implicit objectives in LLM alignment, paving the way for more transparent and safer AI development.

2602.13294 2026-05-22 cs.CV cs.AI

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

VisPhyWorld: 通过代码驱动的视频重建探测物理推理

Jiarong Liang, Max Ku, Ka-Hei Hui, Ping Nie, Wenhu Chen

AI总结 本文提出VisPhyWorld框架,通过要求模型从视觉观察生成可执行的模拟器代码来评估物理推理能力,引入VisPhyBench基准测试集,验证模型在重建外观和模拟物理运动方面的能力,发现最先进的MLLM在准确推断物理参数和模拟一致的物理动态方面存在困难。

详情
AI中文摘要

评估多模态大语言模型(MLLMs)是否真正理解物理动态仍然具有挑战性。现有的基准测试大多依赖于识别式协议,如视觉问答(VQA)和期望违反(VoE),这些协议通常可以在不承诺明确、可测试的物理假设的情况下回答。我们提出了VisPhyWorld,一个基于执行的框架,通过要求模型从视觉观察生成可执行的模拟器代码来评估物理推理能力。通过生成可运行的代码,推断的世界表示可以直接检查、编辑和验证。这将物理推理与渲染分开。基于此框架,我们引入了VisPhyBench,包含209个评估场景,这些场景源自108个物理模板和一个系统化的协议,用于评估模型在重建外观和模拟物理合理的运动方面的能力。我们的流水线在97.7%的基准运行中生成有效的重建视频之前会回退。实验表明,尽管最先进的MLLM在语义场景理解方面表现强劲,但在准确推断物理参数和模拟一致的物理动态方面存在困难。我们的代码可在https://github.com/TIGER-AI-Lab/VisPhyWorld上获得。

英文摘要

Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% of benchmark runs before fallback. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics. Our code is available https://github.com/TIGER-AI-Lab/VisPhyWorld

2602.11574 2026-05-22 cs.AI

Learning to Configure Agentic AI Systems

学习配置代理AI系统

Aditya Taparia, Som Sagar, Ransalu Senanayake

AI总结 本文提出了一种基于半马尔可夫决策过程(SMDP)的代理配置方法,通过ARC模型动态选择查询特定的代理配置,从而在多个基准测试中提升了推理准确性、工具使用准确性和τ-Bench(Airline)Pass的成功率。

Comments 22 pages, 12 figures

详情
AI中文摘要

配置基于LLM的代理系统涉及从庞大的组合设计空间中选择工作流、工具、令牌预算和提示,而目前通常通过固定的模板或手工调整的启发式方法处理,这些方法无论查询难度如何都应用相同的配置,导致行为脆弱和计算浪费。为了解决这个问题,我们将代理配置建模为半马尔可夫决策过程(SMDP),其中每个配置都是一种时间扩展的选项,决定了代理系统如何处理查询,并引入了ARC(Agentic Resource & Configuration learner),一种轻量级的分层策略,能够动态选择查询特定的代理配置。在推理、工具使用和代理基准测试中,ARC在与预算匹配的工具增强LLM相比,平均推理准确性提高了31.3%,工具使用准确性提高了13.95%,并将τ-Bench(Airline)Pass的成功率从9.0%提升到18.0%。这些结果表明,学习查询特定的代理配置是“一刀切”设计的一种强大替代方案。

英文摘要

Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed templates or hand-tuned heuristics that apply the same configuration regardless of query difficulty, leading to brittle behavior and wasted compute. To address this, we formulate agent configuration as a semi-Markov decision process (SMDP) where each configuration acts as a temporally extended option that determines how an agent system processes a query, and introduce introduce ARC (Agentic Resource & Configuration learner), a lightweight hierarchical policy that dynamically selects query-specific agent configurations. Across reasoning, tool-use, and agentic benchmarks, ARC consistently improves over budget-matched tool-augmented LLMs, increasing average reasoning accuracy by 31.3%, tool-use accuracy by 13.95%, and doubling τ-Bench (Airline) Pass^1 success from 9.0% to 18.0%. These results demonstrate that learning per-query agent configurations is a powerful alternative to "one size fits all" designs.

2602.10062 2026-05-22 cs.LG cs.CV

Vendi Novelty Scores for Out-of-Distribution Detection

Vendi Novelty Scores for Out-of-Distribution Detection

Amey P. Pasarkar, Adji Bousso Dieng

AI总结 本文提出了一种基于Vendi Scores的Vendi Novelty Score(VNS)方法,从多样性角度解决分布外检测问题,该方法无需密度建模,具有线性时间复杂度和非参数特性,并在多个图像分类基准上实现了最先进的OOD检测性能。

详情
AI中文摘要

Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.

英文摘要

Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.

2602.06264 2026-05-22 cs.LG

Swap Regret Minimization Through Response-Based Approachability

通过响应方法实现交换遗憾最小化

Ioannis Anagnostides, Gabriele Farina, Maxwell Fishelson, Haipeng Luo, Jon Schneider

AI总结 本文提出了一种更简单高效的算法,通过预处理后的约翰椭球,保证了线性交换遗憾为O(d√T),并建立了信息论下限,证明了经典算法在减少线性交换遗憾方面的最优性,同时扩展了该方法以处理多项式维度的交换偏差集。

Comments V3 makes certain clarifications and improves the upper bound for general sets via symmetrization

详情
AI中文摘要

我们考虑在线优化中最小化不同交换遗憾形式的问题。这些形式的遗憾与博弈中的相关均衡概念紧密相关,并且最近已被证明能够保证对战略对手的非操纵性。最近,Daskalakis, Farina, Fishelson, Pipis和Schneider(STOC '25)开发了在一般凸集上最小化线性交换遗憾的计算效率算法,但其遗憾界为Ω(d⁴√T),并且每次迭代都需要计算强度大的椭球算法调用。在本文中,我们开发了一种显著更简单、计算效率更高的算法,该算法保证在经过约翰椭球预处理的一般凸集上线性交换遗憾为O(d√T)。我们的算法利用了Bernstein和Shimkin(JMLR~'15)提出的强大的响应方法可接近框架——此前在交换遗憾最小化研究中被忽视——同时最小化了profile交换遗憾,最近已被证明能够保证非操纵性。此外,我们建立了匹配的信息论下限:即使当集合是中心对称的时,任何学习者在期望上必须承受Ω(d√T)的线性交换遗憾,对于足够大的T。这还表明,Gordon, Greenwald和Marks(ICML '08)的经典算法在减少线性交换遗憾方面是存在最优的,尽管它计算上效率低下。最后,我们将这种方法扩展以最小化相对于具有多项式维度的交换偏差集的遗憾,统一并加强了最近在均衡计算和在线学习中的研究成果。

英文摘要

We consider the problem of minimizing different notions of swap regret in online optimization. These forms of regret are tightly connected to correlated equilibrium concepts in games, and have been more recently shown to guarantee non-manipulability against strategic adversaries. The only computationally efficient algorithm for minimizing linear swap regret over a general convex set in $\mathbb{R}^d$ was developed recently by Daskalakis, Farina, Fishelson, Pipis, and Schneider (STOC '25). However, it incurs a highly suboptimal regret bound of $Ω(d^4 \sqrt{T})$ and also relies on computationally intensive calls to the ellipsoid algorithm at each iteration. In this paper, we develop a significantly simpler, computationally efficient algorithm that guarantees $O(d \sqrt{T})$ linear swap regret for a general convex set that has been preconditioned via the John ellipsoid. Our algorithm leverages the powerful response-based approachability framework of Bernstein and Shimkin (JMLR~'15) -- previously overlooked in the line of work on swap regret minimization -- and simultaneously minimizes profile swap regret, which was recently shown to guarantee non-manipulability. Moreover, we establish a matching information-theoretic lower bound: any learner must incur in expectation $Ω(d \sqrt{T})$ linear swap regret for large enough $T$, even when the set is centrally symmetric. This also shows that the classic algorithm of Gordon, Greenwald, and Marks (ICML '08) is existentially optimal for minimizing linear swap regret, although it is computationally inefficient. Finally, we extend our approach to minimize regret with respect to the set of swap deviations with polynomial dimension, unifying and strengthening recent results in equilibrium computation and online learning.

2602.05286 2026-05-22 cs.LG cs.AI

HealthMamba: An Uncertainty-aware Spatiotemporal Graph State Space Model for Effective and Reliable Healthcare Facility Visit Prediction

HealthMamba: 一种考虑不确定性的时空图状态空间模型用于有效可靠的医疗设施访问预测

Dahai Yu, Lin Jiang, Rongchao Xu, Guang Wang

AI总结 本文提出HealthMamba,一种考虑不确定性的时空图状态空间模型,用于有效可靠的医疗设施访问预测。该模型包含三个关键组件:统一的时空上下文编码器、新的图状态空间模型GraphMamba以及综合的不确定性量化模块。实验结果显示,HealthMamba在预测准确性和不确定性量化方面分别比现有最佳基线提高了6.0%和3.5%。

Comments IJCAI 2026

详情
AI中文摘要

医疗设施访问预测对于优化医疗资源配置和 informing 公共卫生政策至关重要。尽管已经采用了先进的机器学习方法以提高预测性能,但现有工作通常将此任务视为时间序列预测问题,而没有考虑不同类型的医疗设施的内在空间依赖性,且在公共紧急情况等异常情况下也无法提供可靠的预测。为了推进现有研究,我们提出了HealthMamba,一种考虑不确定性的时空框架,用于准确且可靠的医疗设施访问预测。HealthMamba包含三个关键组件:(i) 一个统一的时空上下文编码器,融合异构的静态和动态信息,(ii) 一种新的图状态空间模型称为GraphMamba用于分层时空建模,(iii) 一个综合的不确定性量化模块,整合三种不确定性量化机制以实现可靠的预测。我们在四个大规模真实世界数据集上评估了HealthMamba,这些数据集来自加州、纽约、得克萨斯州和佛罗里达州。结果表明,HealthMamba在预测准确性和不确定性量化方面分别比现有最佳基线提高了6.0%和3.5%。

英文摘要

Healthcare facility visit prediction is essential for optimizing healthcare resource allocation and informing public health policy. Despite advanced machine learning methods being employed for better prediction performance, existing works usually formulate this task as a time-series forecasting problem without considering the intrinsic spatial dependencies of different types of healthcare facilities, and they also fail to provide reliable predictions under abnormal situations such as public emergencies. To advance existing research, we propose HealthMamba, an uncertainty-aware spatiotemporal framework for accurate and reliable healthcare facility visit prediction. HealthMamba comprises three key components: (i) a Unified Spatiotemporal Context Encoder that fuses heterogeneous static and dynamic information, (ii) a novel Graph State Space Model called GraphMamba for hierarchical spatiotemporal modeling, and (iii) a comprehensive uncertainty quantification module integrating three uncertainty quantification mechanisms for reliable prediction. We evaluate HealthMamba on four large-scale real-world datasets from California, New York, Texas, and Florida. Results show HealthMamba achieves around 6.0% improvement in prediction accuracy and 3.5% improvement in uncertainty quantification over state-of-the-art baselines.

2602.03205 2026-05-22 cs.RO

HUSKY: Humanoid Skateboarding System via Physics-Aware Whole-Body Control

HUSKY:通过物理感知的全身控制实现人形滑雪板系统

Jinrui Han, Dewei Wang, Chenyun Zhang, Xinzhe Liu, Ping Luo, Chenjia Bai, Xuelong Li

AI总结 本文提出HUSKY框架,通过整合人形滑雪板系统建模和物理感知的全身控制,解决高动态和复杂交互任务中的稳定动态操控问题,实现在实际场景中稳定灵活的滑雪板操作。

Comments Accepted to RSS2026

详情
AI中文摘要

尽管当前的人形全身控制框架大多依赖静态环境假设,但解决具有高动态性和复杂交互的任务却是一个巨大的挑战。本文针对人形滑雪板任务,一个需要在欠驱动轮式平台上稳定动态操控的极具挑战性的任务,提出一个整合系统,该系统受非完整约束和紧密耦合的人-物体交互支配。成功执行此任务需要同时掌握混合接触动力学和在机械耦合、动态不稳定滑雪板上的稳健平衡控制。为克服上述挑战,我们提出了HUSKY,一个基于学习的框架,整合了人形-滑雪板系统建模和物理感知的全身控制。我们首先建模板倾斜与车轮转向角度之间的耦合关系,从而能够对系统动力学进行原理性分析。在此基础上,HUSKY利用对抗运动先验(AMP)学习人样的推动作,并采用物理引导的、以方向为导向的策略来实现倾斜到转向行为。此外,轨迹引导机制确保了在推与转向之间平滑而稳定的过渡。在Unitree G1人形平台上的实验结果表明,我们的框架能够在现实场景中实现稳定的滑雪板操控。项目页面可在https://husky-humanoid.github.io/上找到。

英文摘要

While current humanoid whole-body control frameworks predominantly rely on the static environment assumptions, addressing tasks characterized by high dynamism and complex interactions presents a formidable challenge. In this paper, we address humanoid skateboarding, a highly challenging task requiring stable dynamic maneuvering on an underactuated wheeled platform. This integrated system is governed by non-holonomic constraints and tightly coupled human-object interactions. Successfully executing this task requires simultaneous mastery of hybrid contact dynamics and robust balance control on a mechanically coupled, dynamically unstable skateboard. To overcome the aforementioned challenges, we propose HUSKY, a learning-based framework that integrates humanoid-skateboard system modeling and physics-aware whole-body control. We first model the coupling relationship between board tilt and truck steering angles, enabling a principled analysis of system dynamics. Building upon this, HUSKY leverages Adversarial Motion Priors (AMP) to learn human-like pushing motions and employs a physics-guided, heading-oriented strategy for lean-to-steer behaviors. Moreover, a trajectory-guided mechanism ensures smooth and stable transitions between pushing and steering. Experimental results on the Unitree G1 humanoid platform demonstrate that our framework enables stable and agile maneuvering on skateboards in real-world scenarios. The project page is available on https://husky-humanoid.github.io/.

2602.03067 2026-05-22 cs.LG cs.AI cs.NA math.NA

FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU

FlashSinkhorn: GPU上的IO感知熵最优传输

Felix X. -F. Ye, Xingjie Li, An Yu, Ming-Ching Chang, Linsong Chu, Davis Wertheimer

AI总结 本文提出FlashSinkhorn,一种基于GPU的熵最优传输求解器,通过将稳定化的对数域Sinkhorn更新转换为行-wise的LogSumExp归一化,实现了与Transformer注意力相同的归一化方式,从而实现了FlashAttention风格的融合和分块处理,显著降低了HBMIO并保持线性内存操作。

详情
AI中文摘要

熵最优传输(EOT)通过Sinkhorn迭代在现代机器学习中广泛应用,但GPU求解器在大规模情况下仍效率低下。张量化实现因密集的n×m交互导致二次HBM流量,而现有在线后端避免存储密集矩阵但仍然依赖于通用的 tiled map-reduce 减少内核,融合有限。我们提出FlashSinkhorn,一种针对平方欧几里得成本的IO感知EOT求解器,将稳定化的对数域Sinkhorn更新重写为行-wise的LogSumExp归一化,与Transformer注意力相同的归一化方式。这使得FlashAttention风格的融合和分块处理成为可能:融合的Triton内核通过芯片上的SRAM流式传输分块,并在单次通过中更新双潜力,显著减少每个迭代的HBM IO同时保持线性内存操作。我们进一步提供了用于传输应用的流式内核,实现了可扩展的一阶和二阶优化。在A100 GPU上,FlashSinkhorn在点云OT上的前向传递速度比最先进的在线基线快32倍,在端到端速度上快161倍,提高了OT基于下游任务的可扩展性。为了可重复性,我们发布了开源实现,网址为https://github.com/ot-triton-lab/flash-sinkhorn。

英文摘要

Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense $n\times m$ interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled map-reduce reduction kernels with limited fusion. We present \textbf{FlashSinkhorn}, an IO-aware EOT solver for squared Euclidean cost that rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores, the same normalization as transformer attention. This enables FlashAttention-style fusion and tiling: fused Triton kernels stream tiles through on-chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear-memory operations. We further provide streaming kernels for transport application, enabling scalable first- and second-order optimization. On A100 GPUs, FlashSinkhorn achieves up to $32\times$ forward-pass and $161\times$ end-to-end speedups over state-of-the-art online baselines on point-cloud OT, improves scalability on OT-based downstream tasks. For reproducibility, we release an open-source implementation at https://github.com/ot-triton-lab/flash-sinkhorn .

2602.01935 2026-05-22 cs.LG cs.AI cs.PL

LiteCoOp: Lightweight Multi-LLM Shared-Tree Reasoning for Model-Serving Compiler Optimizations

LiteCoOp: 轻量级多语言模型共享树推理用于模型服务编译器优化

Annabelle Sujun Tang, Christopher Priebe, Lianhui Qin, Hadi Esmaeilzadeh

AI总结 本文提出LiteCoOp,一种轻量级框架,通过将优化搜索树本身作为多语言模型协作机制,实现编译器优化过程中异构语言模型的协作,从而在降低编译成本的同时提升性能。

详情
AI中文摘要

LLM引导的编译器优化最近展现出潜力,但现有方法依赖于整个搜索过程中单一大型语言模型,使其昂贵且排除了较小模型。我们提出了研究问题:异构语言模型是否可以在编译器优化过程中协作,同时在编译成本低于由单一大型语言模型引导的优化时减少成本。关键的是,这必须在不引入代理框架的开销的情况下实现,这会与降低编译成本的目标相悖。为实现这些竞争目标,我们引入了LiteCoOp,一种轻量级框架,将优化搜索树本身作为多语言模型协作的机制,使异构模型能够共享进展而无需外部代理协调。在每个优化步骤中,LiteCoOp查询一个语言模型以提出编译器转换并选择下一步查询的语言模型。这些语言模型的提案被记录在共享的MCTS树中,因此所有模型依次被调用,但彼此的决策相互影响。共享的MCTS回传奖励,使一个模型的进步影响其他模型后续的决策。这使得MCTS树本身成为协作推理的机制,避免了模型间通信、重载推理轨迹或代理基础设施。我们通过LLM-aware UCT将这一想法实例化,该方法倾向于较小的语言模型以减少成本,同时保持编译器性能目标。在多样化的GPU和(CPU)基准测试中,LiteCoOp在单模型基线上持续表现优异,当将协作扩展到八个异构语言模型时,其最佳结果取得。八模型配置将总编译时间减少1.95x(1.74x),减少API成本4.47x(4.32x),并且只在总调用中调用最大模型的23.1%(23.9%),并展示了协作的可扩展性。

英文摘要

LLM-guided compiler optimization has recently shown promise, but existing approaches rely on a single large LLM throughout search, making them expensive and excluding smaller models. We pose the research question: whether heterogeneous LLMs can collaborate during compiler optimization while reducing compilation cost below optimization guided by a single large LLM. Crucially, this must be achieved without introducing overhead from agentic frameworks, which would run counter to the goal of lower compilation cost. To achieve these competing objectives, we introduce LiteCoOp, a lightweight framework that turns the optimization search tree itself into the mechanism for multi-LLM collaboration, enabling heterogeneous models to share progress without external agentic coordination. At each optimization step, LiteCoOp queries one LLM to propose both a compiler transformation and select the LLM to query at the next step. These LLM proposals are recorded in a shared MCTS tree, so all models are invoked serially and yet are informed by each other's decisions. The shared MCTS backpropagates the rewards, allowing progress made by one model to influence later decisions by others. This makes the MCTS tree the collaborative reasoning mechanism itself, avoiding inter-model communication, heavy reasoning traces, or agentic infrastructure. We instantiate this idea with an LLM-aware UCT that biases model selection toward smaller LLMs to reduce cost while still preserving the compiler performance objective. Across diverse GPU and (CPU) benchmarks, LiteCoOp consistently outperforms single-model baselines, with the best results obtained when scaling collaboration to eight heterogeneous LLMs. This eight-model config reduces total compilation time by 1.95x (1.74x), reduces API cost by 4.47x (4.32x), and invokes the largest model for only 23.1% (23.9%) of total calls while demonstrating collaboration scalability.

2602.01851 2026-05-22 cs.CV

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

模型能多好地遵循视觉指令?VIBE:一个系统性的视觉指令驱动图像编辑基准

Huanyu Zhang, Xuehai Bai, Chengzu Li, Chen Liang, Haochen Tian, Haodong Li, Ruichuan An, Yifan Zhang, Anna Korhonen, Zhang Zhang, Liang Wang, Tieniu Tan

AI总结 本文提出VIBE基准,用于评估视觉指令驱动的图像编辑模型,通过三级交互层次评估指涉 grounding、形态操作和因果推理,并发现专有模型在早期阶段表现优异但随着任务难度增加性能下降。

Comments https://vibe-benchmark.github.io/

详情
AI中文摘要

最近的生成模型在图像编辑方面取得了显著进展。然而,现有系统和基准仍然主要是文本引导的。相比之下,人类交流本质上是多模态的,视觉指令如草图能高效传达空间和结构意图。为填补这一差距,我们引入VIBE,即图像编辑的视觉指令基准,其三级交互层次捕捉了指涉 grounding、形态操作和因果推理。在这些层次中,我们精心挑选了高质量且多样的测试用例,反映了视觉指令遵循的逐步增加的复杂性。我们进一步提出一个稳健的LMM-as-a-judge评估框架,配有任务特定的指标,以实现可扩展且细致的评估。通过全面评估17个代表性的开源和专有图像编辑模型,我们发现专有模型在早期阶段展现出视觉指令遵循能力,并且一贯优于开源模型。然而,随着任务难度的增加,即使是最强的系统性能也会显著下降,这揭示了未来研究的有希望方向。

英文摘要

Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.

2602.01760 2026-05-22 cs.CV

MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement

MagicFuse: 单图像融合用于视觉与语义增强

Hao Zhang, Yanping Zha, Zizhuo Li, Meiqi Gong, Jiayi Ma

AI总结 本文提出MagicFuse单图像融合框架,通过扩散模型生成跨光谱场景表示,实现视觉与语义的双重约束,实验表明其性能优于多模态融合方法。

Comments Accepted by CVPR 2026

详情
AI中文摘要

本文聚焦于一个高度实用的场景:在仅使用可见成像传感器的情况下,如何继续利用多模态图像融合的优势。为此,我们提出了一种新的单图像融合概念,将其扩展到知识层面。具体而言,我们开发了MagicFuse,一种新的单图像融合框架,能够从单个低质量可见图像中推导出全面的跨光谱场景表示。MagicFuse首先引入了基于扩散模型的内在光谱知识增强分支和跨光谱知识生成分支。它们分别挖掘在可见光谱中被掩盖的场景信息,并学习转移到红外光谱的热辐射分布模式。在此基础上,我们设计了一个多领域知识融合分支,整合这两个分支的扩散流的概率噪声,从而通过连续采样获得跨光谱场景表示。然后,我们施加了视觉和语义约束,确保该场景表示能够满足人类观察同时支持下游语义决策。大量实验表明,尽管仅依赖单个退化的可见图像,我们的MagicFuse在视觉和语义表示性能上与或优于多模态输入的最先进融合方法。代码已公开在https://github.com/zhayanping/MagicFuse。

英文摘要

This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image. The code is publicly available at https://github.com/zhayanping/MagicFuse.

2602.01279 2026-05-22 cs.LG

Richer Bayesian Last Layers with Subsampled NTK Features

更丰富的贝叶斯最后层与子采样NTK特征

Sergio Calvo-Ordoñez, Jonathan Plenk, Richard Bergna, Álvaro Cartea, Yarin Gal, Jose Miguel Hernández-Lobato, Kamil Ciosek

AI总结 本文提出了一种改进贝叶斯最后层的方法,通过将神经切线核特征投影到由最后层特征张成的空间中,以更准确地估计不确定性,同时保持计算效率。

Comments Appearing in the Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情
AI中文摘要

贝叶斯最后层(BLLs)提供了一种方便且计算高效的神经网络不确定性估计方法。然而,由于只对最终层应用贝叶斯处理,忽略了早期层引入的不确定性,导致低估了epistemic不确定性。我们提出了一种方法,通过将神经切线核(NTK)特征投影到由最后层特征张成的空间中,从而在保持标准BLL推理低计算成本的同时,实现对整个网络变异性更全面的后验推断。我们证明了该方法产生的后验方差至少等于标准BLL的方差,纠正了其低估epistemic不确定性的倾向。为进一步降低计算成本,我们引入了统一的子采样方案来估计投影矩阵和后验推断。我们为两种子采样类型推导了近似界限。在UCI回归、上下文带币、图像分类和分布外检测任务中,对图像和表格数据集的实证评估显示,与标准BLL和竞争基线相比,该方法在校准和不确定性估计方面有所改进,同时降低了计算成本。

英文摘要

Bayesian Last Layers (BLLs) provide a convenient and computationally efficient way to estimate uncertainty in neural networks. However, they underestimate epistemic uncertainty because they apply a Bayesian treatment only to the final layer, ignoring uncertainty induced by earlier layers. We propose a method that improves BLLs by leveraging a projection of Neural Tangent Kernel (NTK) features onto the space spanned by the last-layer features. This enables posterior inference that accounts for variability of the full network while retaining the low computational cost of inference of a standard BLL. We show that our method yields posterior variances that are provably greater or equal to those of a standard BLL, correcting its tendency to underestimate epistemic uncertainty. To further reduce computational cost, we introduce a uniform subsampling scheme for estimating the projection matrix and for posterior inference. We derive approximation bounds for both types of subsampling. Empirical evaluations on UCI regression, contextual bandits, image classification, and out-of-distribution detection tasks in image and tabular datasets, demonstrate improved calibration and uncertainty estimates compared to standard BLLs and competitive baselines, while reducing computational cost.

2602.00851 2026-05-22 cs.AI cs.MA

Understanding Persuasion in Long-Running Agents

理解长期运行代理中的说服

Hyejun Jeong, Amir Houmansadr, Shlomo Zilberstein, Eugene Bagdasarian

AI总结 本文研究了长期任务中代理受到用户说服影响的行为变化,提出了一种基于行为的评估框架,发现提前指定信念状态的代理在搜索和源访问上表现更高效,表明说服影响代理行为。

Comments Code available at https://github.com/HyejunJeong/persuasion-propagation

详情
AI中文摘要

现代AI代理越来越多地结合对话交互与自主任务执行,例如编码和网络研究,这引发了一个自然问题:当一个从事长期任务的代理受到用户说服时会发生什么?然而研究这一可能性具有挑战性,因为长期运行的代理行为具有噪声且难以重复,而且不清楚只有在扩展任务执行中才会出现哪些独特挑战。我们研究了信念层面干预如何影响下游任务行为,这种现象我们称之为说服传播。我们介绍了一种以行为为中心的评估框架,区分在任务执行期间或之前应用的说服。在网页研究和编码任务中,我们发现即时说服导致的行为影响弱且不一致。相反,当在任务时间显式指定信念状态时,信念预填充的代理平均进行26.9%更少的搜索,并访问16.9%更少的唯一来源,比中性预填充的代理。这些结果表明,即使在之前的交互中,说服也会影响代理的行为,从而推动对代理系统的行为层面评估。

英文摘要

Modern AI agents increasingly combine conversational interaction with autonomous task execution, such as coding and web research, raising a natural question: What happens when an agent engaged in long-horizon tasks is exposed to user persuasion? Yet studying this possibility is challenging because long-running agent behavior is noisy and costly to reproduce, and it remains unclear which unique challenges emerge only in extended task execution. We study how belief-level intervention can influence downstream task behavior, a phenomenon we name persuasion propagation. We introduce a behavior-centered evaluation framework that distinguishes between persuasion applied during or prior to task execution. Across web research and coding tasks, we find that on-the-fly persuasion induces weak and inconsistent behavioral effects. In contrast, when the belief state is explicitly specified at task time, belief-prefilled agents conduct on average 26.9% fewer searches and visit 16.9% fewer unique sources than neutral-prefilled agents. These results suggest that persuasion, even in prior interaction, can affect the agent's behavior, motivating behavior-level evaluation in agentic systems.

2601.11079 2026-05-22 cs.LG

Soft Bayesian Context Tree Models for Real-Valued Time Series

针对实值时间序列的软贝叶斯上下文树模型

Shota Saito, Yuta Nakahara, Toshiyasu Matsushima

AI总结 本文提出了一种新的软贝叶斯上下文树模型(Soft-BCT),用于实值时间序列。该模型采用概率性分裂上下文空间,而非传统上下文树模型中确定性的上下文空间分裂。基于变分推断提出学习算法,实验结果表明Soft-BCT在某些数据集上优于传统上下文树模型。

详情
AI中文摘要

本文提出软贝叶斯上下文树模型(Soft-BCT),这是一种新的实值时间序列的上下文树模型。Soft-BCT考虑了上下文空间的软(概率)分裂,而不是传统上下文树模型中上下文空间的硬(确定性)分裂。基于变分推断提出Soft-BCT的学习算法。实验结果表明,Soft-BCT在某些数据集上优于传统上下文树模型。

英文摘要

This paper proposes the soft Bayesian context tree model (Soft-BCT), which is a novel BCT model for real-valued time series. The Soft-BCT considers soft (probabilistic) splits of the context space, instead of hard (deterministic) splits of the context space as in the previous BCT for real-valued time series. A learning algorithm of the Soft-BCT is proposed based on the variational inference. The results of experiments demonstrate the superiority of the Soft-BCT compared to the previous BCT for some datasets.

2601.10348 2026-05-22 cs.CL cs.AI cs.LG

Training-Trajectory-Aware Token Selection

基于训练轨迹的token选择

Zhanming Shen, Jiaqi Hu, Zeyu Qin, Hao Chen, Wentao Ye, Zenan Huang, Yihong Zhuang, Guoshan Lu, Junlin Zhou, Junbo Zhao

AI总结 本文提出T3S方法,通过在token层面重构训练目标,清除未学习token的优化路径,从而在连续蒸馏中提升性能,实验表明在AR和dLLM设置中均取得显著效果。

Comments Accepted by ICML 2026

详情
AI中文摘要

高效的蒸馏是将昂贵的推理能力转化为可部署效率的关键途径,然而在前沿领域中,当学生模型已具备较强的推理能力时,朴素的连续蒸馏往往产生有限的收益甚至退化。我们观察到一种训练特征现象:即使损失单调下降,所有性能指标在几乎相同的瓶颈处会突然大幅下降,然后逐渐恢复。我们进一步揭示了token层面的机制:置信度会分裂成稳步增加的模仿锚点token,快速锚定优化,以及尚未学习的token,其置信度被抑制直到瓶颈之后。这两种类型token无法共存的特性是连续蒸馏失败的根本原因。为此,我们提出了基于训练轨迹的token选择(T3S)方法,以在token层面重建训练目标,清除未学习token的优化路径。T3S在AR和dLLM设置中均取得一致的收益:仅用数百个示例,Qwen3-8B在竞争性推理基准上超越DeepSeek-R1,Qwen3-32B接近Qwen3-235B,且T3训练的LLaDA-2.0-Mini超越其AR基线,达到所有16B级模型中的最先进性能。

英文摘要

Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3S yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.

2512.16739 2026-05-22 cs.AI

AI-Driven Prediction of Cancer Pain Episodes: A Hybrid Decision Support Approach

基于AI的癌症疼痛发作预测:一种混合决策支持方法

Yipeng Zhuang, Yifeng Guo, Yuewen Li, Yuheng Wu, Philip Leung-Ho Yu, Tingting Song, Zhiyong Wang, Kunzhong Zhou, Weifang Wang, Li Zhuang

AI总结 本研究提出了一种混合机器学习和大语言模型的方法,利用结构化和非结构化电子健康记录数据预测癌症患者在住院48和72小时内疼痛发作,通过整合时间序列药物趋势和模糊剂量记录,提高了敏感性和可解释性,实现了87.6%和91.7%的准确率。

详情
AI中文摘要

肺癌患者经常经历突破性疼痛发作,高达91%的患者需要及时干预。为了实现主动疼痛管理,我们提出了一种混合机器学习和大语言模型的管道,利用结构化和非结构化的电子健康记录数据预测住院48和72小时内的疼痛发作。分析了266名住院患者的历史队列,特征包括人口统计学数据、肿瘤分期、生命体征和WHO分级镇痛药使用情况。机器学习模块捕捉时间序列药物趋势,而大语言模型解释模糊的剂量记录和自由文本临床笔记。整合这些模态提高了灵敏度和可解释性。我们的框架在48小时和72小时的准确率分别为0.876和0.917,灵敏度分别提高了10.6%和10.7%,归因于大语言模型的增强。这种混合方法提供了一种临床可解释且可扩展的工具,用于早期疼痛发作预测,有望提高治疗精准度并优化肿瘤学护理中的资源分配。

英文摘要

Lung cancer patients frequently experience breakthrough pain episodes, with up to 91% requiring timely intervention. To enable proactive pain management, we propose a hybrid machine learning and large language model pipeline that predicts pain episodes within 48 and 72 hours of hospitalization using both structured and unstructured electronic health record data. A retrospective cohort of 266 inpatients was analyzed, with features including demographics, tumor stage, vital signs, and WHO-tiered analgesic use. The machine learning module captured temporal medication trends, while the large language model interpreted ambiguous dosing records and free-text clinical notes. Integrating these modalities improved sensitivity and interpretability. Our framework achieved an accuracy of 0.876 (48h) and 0.917 (72h), with improvements in sensitivity of 10.6% and 10.7%, respectively, attributable to large language model augmentation. This hybrid approach offers a clinically interpretable and scalable tool for early pain episode forecasting, with potential to enhance treatment precision and optimize resource allocation in oncology care.

2512.12744 2026-05-22 cs.LG

Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity

静息神经元,主动洞察:通过自发性增强LLM中的激活稀疏性

Haotian Xu, Jiannan Yang, Tian Gao, Tsui-Wei Weng, Tengfei Ma

AI总结 本文提出了一种通过引入自发神经元(SPON)来增强LLM中激活稀疏性的方法,解决了高稀疏率下模型精度下降的问题,通过分布匹配训练SPON,使模型在稀疏计算中保持稳定和泛化能力。

Comments ICML 2026

详情
AI中文摘要

激活稀疏性提供了一种有吸引力的途径来加速大型语言模型(LLM)的推理过程,通过选择性地抑制隐藏激活。然而,现有方法在高稀疏率下表现出严重的准确性下降。我们发现,这种失败源于表征不稳定:*激活稀疏性破坏了预训练期间学习的输入依赖激活,导致隐藏状态的分布偏移。*我们通过将激活稀疏性重新定义为表征对齐问题,并引入**自发神经元(SPON)**,一种受生物系统中自发神经活动启发的轻量机制。SPON注入一组小的可学习、输入无关的激活向量,作为稀疏计算中的持久表征锚点。这些向量通过分布匹配训练与密集模型匹配,并在训练后可吸收进偏置项中,带来极小的推理开销。在多个LLM架构上,SPON一致地恢复了性能,稳定了潜在表征,并保持了泛化能力。我们的结果确立了SPON作为可靠激活稀疏推理的有效且原则性解决方案,并为LLM的知识保留提供了新的见解。

英文摘要

Activation sparsity offers a compelling route to accelerate large language model (LLM) inference by selectively suppressing hidden activations, yet existing approaches exhibit severe accuracy degradation at high sparsity. We show that this failure stems from representational instability: *activation sparsity disrupts input-dependent activation learned during pretraining, inducing distribution shifts in hidden states.* We address this issue by reframing activation sparsity as a representational alignment problem and introducing **Spontaneous Neurons (SPON)**, a lightweight mechanism inspired by spontaneous neural activity in biological systems. SPON injects a small set of learnable, input-independent activation vectors that act as persistent representational anchors for sparse computation. These vectors are trained via distribution matching to the dense model and can be absorbed into bias terms after training, incurring negligible inference overhead. Across multiple LLM backbones, SPON consistently restores performance, stabilizes latent representations, and preserves generalization. Our results establish SPON as an effective and principled solution for reliable activation-sparse inference, and offer new insights into knowledge retention in LLMs.

2512.11587 2026-05-22 cs.LG cs.NA math.NA math.OC

Gradient Descent as a Perceptron Algorithm: Understanding Dynamics and Implicit Acceleration

梯度下降作为感知机算法:理解动态与隐式加速

Alexander Tyurin

AI总结 本文研究了梯度下降在神经网络训练中的优化动态和隐式加速现象,通过非线性模型分析显示梯度下降步骤等价于广义感知机算法,揭示了非线性模型在迭代复杂度上的优势。

详情
AI中文摘要

即使对于应用于神经网络训练的梯度下降(GD)方法,理解其优化动态,包括收敛速度、迭代轨迹、函数值振荡,尤其是其隐式加速现象,仍然是一个具有挑战性的问题。我们分析了具有逻辑损失的非线性模型,并展示梯度下降的步骤等同于广义感知机算法(Rosenblatt, 1958),从而提供了新的动态视角。这种简化步骤通过经典线性代数工具进行分析。在最小化示例中,我们证明了双层模型的非线性可以证明在迭代复杂度上比线性模型更快,即$ ilde{O}(\sqrt{d})$,相比线性模型的$Ω(d)$,其中$d$是特征数量。这有助于解释神经网络中观察到的优化动态和隐式加速现象。理论结果通过广泛的数值实验得到支持。我们相信这种替代观点将进一步推动神经网络优化的研究。

英文摘要

Even for the gradient descent (GD) method applied to neural network training, understanding its optimization dynamics, including convergence rate, iterate trajectories, function value oscillations, and especially its implicit acceleration, remains a challenging problem. We analyze nonlinear models with the logistic loss and show that the steps of GD reduce to those of generalized perceptron algorithms (Rosenblatt, 1958), providing a new perspective on the dynamics. This reduction yields significantly simpler algorithmic steps, which we analyze using classical linear algebra tools. Using these tools, we demonstrate on a minimalistic example that the nonlinearity in a two-layer model can provably yield a faster iteration complexity $\tilde{O}(\sqrt{d})$ compared to $Ω(d)$ achieved by linear models, where $d$ is the number of features. This helps explain the optimization dynamics and the implicit acceleration phenomenon observed in neural networks. The theoretical results are supported by extensive numerical experiments. We believe that this alternative view will further advance research on the optimization of neural networks.

2512.10719 2026-05-22 cs.CV

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

SpaceDrive: 在基于视觉语言模型的自动驾驶中引入空间感知

Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, Andreas Zell

AI总结 本文提出SpaceDrive框架,通过将空间信息作为显式位置编码来增强基于VLM的自动驾驶系统对精细3D空间关系的理解,从而提升规划精度和开放环性能。

详情
AI中文摘要

基于视觉语言模型(VLM)的端到端自动驾驶方法因具备通用的视觉理解和强大的推理能力而迅速发展。然而,我们发现当前VLM在理解细粒度的3D空间关系方面存在困难,这在与物理世界交互的系统中是基本要求。为了解决这一问题,我们提出了SpaceDrive,一个基于空间感知的VLM自动驾驶框架,将空间信息作为显式位置编码(PEs)而非文本数字标记,从而实现语义和空间表示的联合推理。SpaceDrive采用通用的位置编码器处理从多视角深度估计、历史自我状态和文本提示中得到的所有3D坐标。这些3D PE首先叠加到相应的2D视觉标记上,同时作为任务无关的坐标表示,取代数字形式的数值标记作为VLM的输入和输出。这种机制使模型能够更好地在空间推理中索引特定的视觉语义,并直接回归轨迹坐标而非逐位生成,从而提升规划精度。广泛的实验验证了SpaceDrive在nuScenes数据集上实现了最先进的开放环性能,并在Bench2Drive闭环基准中取得了78.02的第二好Driving Score。代码可在:https://github.com/zhenghao2519/SpaceDrive获取。

英文摘要

End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods. Code is available at: https://github.com/zhenghao2519/SpaceDrive.

2512.02193 2026-05-22 cs.AI

From monoliths to modules: Decomposing transducers for efficient world modelling

从整体到模块:分解转换器以实现高效的world建模

Alexander Boyd, Franz Nowak, David Hyland, Manuel Baltieri, Fernando E. Rosas

AI总结 本文提出了一种分解复杂world建模的方法,通过转换器框架将世界模型分解为多个模块,从而提高计算效率并支持分布式推理,为AI安全和现实应用提供基础。

详情
AI中文摘要

world模型最近被提出作为AI代理在部署前训练和评估的沙盒环境。尽管现实中的world模型通常计算需求高,但通过利用现实世界场景中子组件以模块化方式交互的事实,可以缓解这一问题。在本文中,我们通过开发一个框架来分解由转换器表示的复杂world模型,探索这一想法。转换器是一类扩展POMDPs的模型。尽管转换器的组合已被深入理解,我们的结果澄清了如何通过推导在不同输入-输出子空间上操作的子转换器来反转这一过程,从而实现并行化和可解释的替代方案,以支持分布式推理。总体而言,这些结果为连接现实推理所需的计算效率与AI安全所要求的结构透明性奠定了基础。

英文摘要

World models have been recently proposed as sandbox environments in which AI agents can be trained and evaluated before deployment. While realistic world models often have high computational demands, this can often be alleviated by exploiting the fact that real-world scenarios tend to involve subcomponents that interact in a modular manner. In this paper, we explore this idea by developing a framework for decomposing complex world models represented by transducers, a class of models generalising POMDPs. Whereas the composition of transducers is well understood, our results clarify how to invert this process by deriving sub-transducers operating on distinct input-output subspaces, enabling parallelizable and interpretable alternatives to monolithic world modelling that can support distributed inference. Overall, these results lay groundwork for bridging the computational efficiency required for real-world inference and the structural transparency demanded by AI safety.

2511.18159 2026-05-22 cs.LG

Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models

为扩散模型带来稳定性:分解和减少训练掩码扩散模型的方差

Mengni Jia, Mengyu Zhou, Yihao Liu, Xiaoxi Jiang, Guanjun Jiang

AI总结 本文研究了掩码扩散模型(MDMs)训练方差高导致不稳定的问题,通过分解方差来源并提出六种方差减少方法,显著提升了模型在复杂推理任务中的准确率,并将运行间变异性降低至自回归模型(ARMs)水平。

详情
AI中文摘要

Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from inherently much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. There has been no theoretical explanation or systematic solution. We derive the first decomposition of MDM training variance into three sources: (A) masking pattern noise, (B) masking rate noise, and (C) data noise, while ARMs are only affected by (C). This explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a Pareto-optimal t sampler that minimizes training variance by sampling harder t values more often with appropriately smaller update steps, and (2) MIRROR, which uses negatively correlated samples to reduce (A). Experiments show that compared to standard MDM training, our methods improve accuracy by 7-8% on complex reasoning tasks, while simultaneously reducing run-to-run variability to near ARM levels, substantially narrowing the gap with strong ARM baselines; in most settings, even the best baseline runs remain below the worst run of our method.

英文摘要

Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from inherently much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. There has been no theoretical explanation or systematic solution. We derive the first decomposition of MDM training variance into three sources: (A) masking pattern noise, (B) masking rate noise, and (C) data noise, while ARMs are only affected by (C). This explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a Pareto-optimal t sampler that minimizes training variance by sampling harder t values more often with appropriately smaller update steps, and (2) MIRROR, which uses negatively correlated samples to reduce (A). Experiments show that compared to standard MDM training, our methods improve accuracy by 7-8% on complex reasoning tasks, while simultaneously reducing run-to-run variability to near ARM levels, substantially narrowing the gap with strong ARM baselines; in most settings, even the best baseline runs remain below the worst run of our method.