arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
专题追踪
2604.12325 2026-05-22 cs.LG cs.AI

Black-Box Optimization From Small Offline Datasets via Meta Learning with Synthetic Tasks

通过合成任务进行元学习的黑盒优化

Azza Fadhel, The Hung Tran, Trong Nghia Hoang, Jana Doppa

发表机构 * School of EECS, Washington State University, Pullman, WA, USA(华盛顿州立大学电子工程与计算机科学学院,普拉默,华盛顿州,美国)

AI总结 本文提出了一种通过生成合成任务进行元学习的框架OptBias,用于解决小规模离线数据下的黑盒优化问题,通过学习可重用的优化偏差来提升小数据场景下的性能。

Comments Accepted for Publication at International Conference on Artificial Intelligence and Statistics (AISTATS)

详情
AI中文摘要

我们考虑了离线黑盒优化的问题,目标是从过去的实验数据中发现最优设计(例如分子或材料)。在这一设置中,一个关键挑战是数据稀缺性:在许多科学应用中,只有小规模或低质量的数据集可用,这严重限制了现有算法的有效性。先前的工作在理论和实证上都表明,离线优化算法的性能取决于代理模型对优化偏差(即正确排序输入设计的能力)的捕捉程度,这在有限的实验数据下很难实现。本文提出了一种通过生成合成任务进行元学习的框架OptBias,该框架通过在高斯过程生成的合成任务上训练来直接解决数据稀缺性问题。OptBias通过在小数据上微调代理模型来解决目标任务。在多样化的连续和离散离线优化基准上,OptBias在小数据场景中始终优于最先进的基线。这些结果突显了OptBias作为现实中小数据设置中离线优化的稳健且实用的解决方案。

英文摘要

We consider the problem of offline black-box optimization, where the goal is to discover optimal designs (e.g., molecules or materials) from past experimental data. A key challenge in this setting is data scarcity: in many scientific applications, only small or poor-quality datasets are available, which severely limits the effectiveness of existing algorithms. Prior work has theoretically and empirically shown that performance of offline optimization algorithms depends on how well the surrogate model captures the optimization bias (i.e., ability to rank input designs correctly), which is challenging to accomplish with limited experimental data. This paper proposes Surrogate Learning with Optimization Bias via Synthetic Task Generation (OptBias), a meta-learning framework that directly tackles data scarcity. OptBias learns a reusable optimization bias by training on synthetic tasks generated from a Gaussian process, and then fine-tunes the surrogate model on the small data for the target task. Across diverse continuous and discrete offline optimization benchmarks, OptBias consistently outperforms state-of-the-art baselines in small data regimes. These results highlight OptBias as a robust and practical solution for offline optimization in realistic small data settings.

2604.08872 2026-05-22 cs.LG cond-mat.dis-nn cond-mat.stat-mech

How does Chain of Thought decompose complex tasks?

链式思维如何分解复杂任务?

Amrut Nadgir, Vijay Balasubramanian, Pratik Chaudhari

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文研究了链式思维在复杂任务分解中的作用,发现通过将任务分解为多个小分类问题可以显著降低预测误差,并确定了分解深度的最优阈值。

详情
AI中文摘要

许多语言任务可以建模为分类问题,其中大型语言模型(LLM)被给出提示并选择多个可能答案中的一个。我们证明此类问题中的分类误差随着类别的数量呈幂律变化。这具有重大影响:通过将整体任务分解为一系列较小的分类问题,每个问题具有相同数量的类别(

英文摘要

Many language tasks can be modeled as classification problems where a large language model (LLM) is given a prompt and selects one among many possible answers. We show that the classification error in such problems scales as a power law in the number of classes. This has a dramatic consequence: the prediction error can be reduced substantially by splitting the overall task into a sequence of smaller classification problems, each with the same number of classes ("degree"). This tree-structured decomposition models chain-of-thought (CoT). It has been observed that CoT-based predictors perform better when they "think", i.e., when they develop a deeper tree, thus decomposing the problem into a larger number of steps. We identify a critical threshold for the degree, below which thinking is detrimental, and above which there exists an optimal depth that minimizes the error. It is impossible to surpass this minimal error by increasing the depth of thinking.

2604.08571 2026-05-22 cs.LG cs.AI cs.CL

Robust Reasoning Benchmark

鲁棒推理基准

Pavel Golikov, Evgenii Opryshko, Gennady Pekhimenko, Mark C. Jeffrey

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 本研究提出鲁棒推理基准(RRB),通过13种确定性文本扰动评估8种前沿模型,发现Claude在面对变换提示时表现出异常拒绝行为,而开放权重模型在结构噪声下出现多种失败模式,如认知冲刷、分词崩溃和推理崩溃,导致平均准确率下降高达54%。研究进一步发现由模型自身推理链引起的注意力稀释问题,并提出Intra-Query Attention Dilution概念,表明中间推理步骤会污染标准密集注意力机制,未来架构需整合显式上下文重置以实现可靠推理。

详情
AI中文摘要

尽管大型语言模型(LLMs)在标准数学基准上表现优异,但其问题解决能力依赖于上下文和文本格式。我们引入鲁棒推理基准(RRB),该基准由13种确定性文本扰动组成,应用于2024年和2025年的AIME。评估8种最先进的模型后,发现前沿模型总体上具有较强的鲁棒性,但Claude在面对变换提示时表现出异常拒绝行为。开放权重推理模型在结构噪声下表现出多种失败模式(认知冲刷、分词崩溃和推理崩溃),在扰动下平均准确率下降高达54%,某些扰动甚至导致100%的准确率下降。我们进一步研究其中一种失败模式:由模型自身推理链引起的注意力稀释。通过要求模型在单一上下文窗口内依次解决多个独立数学问题,我们识别出Intra-Query Attention Dilution。从7B到120B参数的开放权重模型在后续问题上的准确率逐渐下降,表明中间推理步骤会污染标准密集注意力机制。我们主张,为了实现可靠的推理,未来架构需要在模型自身推理链中整合显式上下文重置,从而引发关于推理任务最佳粒度的开放研究问题。

英文摘要

While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their problem-solving abilities depend on the context and textual formatting. We introduce the Robust Reasoning Benchmark (RRB), a pipeline of 13 deterministic textual perturbations applied to AIME 2024 and AIME 2025. Evaluating 8 state-of-the-art models, we find that frontier models are largely resilient, with the notable exception of Claude, which categorically refuses many transformed prompts. Open-weights reasoning models exhibit a range of failure modes under structural noise (cognitive thrashing, tokenization breakdown, and reasoning collapse), with up to 54% average accuracy drops across perturbations and up to 100% on some. We further study one of these failure modes in isolation: attention dilution caused by the model's own chain-of-thought. By tasking models with solving multiple independent mathematical problems sequentially within a single context window, we identify Intra-Query Attention Dilution. Open-weights models ranging from 7B to 120B parameters exhibit accuracy decay on subsequent problems, suggesting that intermediate reasoning steps progressively pollute standard dense attention mechanisms. We argue that in order to achieve reliable reasoning, future architectures need to integrate explicit contextual resets within models' own chain-of-thought, leading to open research questions regarding the optimal granularity of reasoning tasks.

2603.29735 2026-05-22 cs.AI

Unveiling the Reasoning Process of Large Language Models

揭示大型语言模型的推理过程

Junjie Zhang, Zhen Shen, Xisong Dong, Gang Xiong

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 本文通过分析Transformer层中注意力头和层的信息转换,揭示了大型语言模型在数学和符号推理任务中,中间层将token级信息转化为可重用的关联结构的核心机制。

详情
AI中文摘要

大型语言模型往往能够超越表层token进行推理,但token级信息转变为抽象关系结构的内部阶段仍不明确。我们通过分析自回归推理过程中注意力头和层如何转换信息来探讨这一问题。在数学和符号推理任务中,我们观察到一种一致的分层分工:外层主要保留和路由输入相关特征,而中层将它们重新组织为更具转移性的规则级表示。这种解释得到了表示几何的支持:中层状态占据较低维的流形,并在不同词汇库中表现出更强的对齐性,这些词汇库实现了相同的符号规则。此外,因果干预进一步支持了这一结论:移除通过我们基于交互的标准识别出的中层组件,会比移除其他区域或随机移除的组件产生更大的下游变化和准确率下降。共同,这些结果表明,抽象推理并非均匀分布在Transformer层中,而是优先在中层计算阶段形成,该阶段将token级信息转化为可重用的关联结构。

英文摘要

Large language models often reason beyond surface tokens, but the internal stage at which token-level information becomes abstract relational structure remains unclear. We investigate this question by analyzing how attention heads and layers transform information during autoregressive reasoning. Across mathematical and symbolic reasoning tasks, we observe a consistent layer-wise division of labor: outer layers mainly preserve and route input-related features, whereas middle layers reorganize them into more transferable rule-level representations. This interpretation is supported by representation geometry: middle-layer states occupy lower-dimensional manifolds and show stronger alignment across disjoint vocabularies that instantiate the same symbolic rules. It is further supported by causal interventions: removing middle-layer components identified by our interaction-based criterion produces substantially larger downstream changes and accuracy drops than removing components from other regions or at random. Together, these results suggest that abstract reasoning is not uniformly distributed across transformer layers, but is preferentially formed in a middle-layer computation stage that converts token-level information into reusable relational structure.

2603.22508 2026-05-22 cs.RO cs.SY eess.SY

Parallel OctoMapping: A Scalable Framework for Enhanced Path Planning in Autonomous Navigation

并行八叉树映射:一种用于自主导航中路径规划增强的可扩展框架

Yihui Mao, Tian Tan, Xuehui Shen, Warren E. Dixon, Rushikesh Kamalapurkar

发表机构 * Department of Mechanical and Aerospace Engineering, University of Florida(佛罗里达大学机械与航空航天工程系) Department of Electrical and Systems Engineering, University of Pennsylvania(宾夕法尼亚大学电气与系统工程系)

AI总结 本文提出并行八叉树映射(POMP),一种高效的基于八叉树的映射技术,通过在固定占用网格分辨率下优化自由空间表示,提升路径规划效率和成功率,特别是在复杂环境中。

详情
AI中文摘要

映射在机器人和自主系统中至关重要,因为它为路径规划提供了空间基础。高效的映射使规划算法能够生成可靠的路径,同时确保安全并实时适应复杂环境。固定分辨率的映射方法通常会产生过于保守的障碍物表示,导致在拥挤场景中生成次优路径或规划失败。为了解决这个问题,我们引入了并行八叉树映射(POMP),一种高效的基于八叉树的映射技术,旨在最大化可用自由空间并支持多线程计算。据我们所知,POMP是首个在固定占用网格分辨率下优化自由空间表示同时保持地图保真度和与现有基于搜索的规划器兼容的方法。因此,它可以集成到现有的规划流程中,从而提高路径发现的成功率和路径长度,特别是在拥挤环境中,同时显著提高计算效率。

英文摘要

Mapping is essential in robotics and autonomous systems because it provides the spatial foundation for path planning. Efficient mapping enables planning algorithms to generate reliable paths while ensuring safety and adapting in real time to complex environments. Fixed-resolution mapping methods often produce overly conservative obstacle representations that lead to suboptimal paths or planning failures in cluttered scenes. To address this issue, we introduce Parallel OctoMapping (POMP), an efficient OctoMap-based mapping technique that maximizes available free space and supports multi-threaded computation. To the best of our knowledge, POMP is the first method that, at a fixed occupancy-grid resolution, refines the representation of free space while preserving map fidelity and compatibility with existing search-based planners. It can therefore be integrated into existing planning pipelines, yielding higher pathfinding success rates and shorter path lengths, especially in cluttered environments, while substantially improving computational efficiency.

2603.21743 2026-05-22 cs.LG q-bio.QM

CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

CellFluxRL: 通过强化学习实现生物约束的虚拟细胞建模

Dongxia Wu, Shiye Su, Yuhui Zhang, Elaine Sui, Emma Lundberg, Emily B. Fox, Serena Yeung-Levy

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出CellFluxRL,通过强化学习约束虚拟细胞模型,使其在生物功能、结构有效性及形态正确性方面更符合生物学规律,从而提升虚拟细胞建模的生物意义。

详情
AI中文摘要

构建虚拟细胞以生成模型模拟细胞行为在硅中的仿真,正成为加速药物发现的有前途的范式。然而,先前基于图像的生成方法可能会产生不合理的细胞图像,违反基本的物理和生物学约束。为了解决这个问题,我们提出通过强化学习(RL)后训练虚拟细胞模型,利用具有生物意义的评估器作为奖励函数。我们设计了七个奖励,涵盖三个类别——生物功能、结构有效性及形态正确性,并优化最先进的CellFlux模型以获得CellFluxRL。CellFluxRL在所有奖励上均优于CellFlux,且在测试时扩展进一步提升性能。总体而言,我们的结果展示了一个通过强化学习施加物理约束的虚拟细胞建模框架,从而超越了“视觉逼真”的生成,朝着“生物意义”的生成迈进。

英文摘要

Building virtual cells with generative models to simulate cellular behavior in silico is emerging as a promising paradigm for accelerating drug discovery. However, prior image-based generative approaches can produce implausible cell images that violate basic physical and biological constraints. To address this, we propose to post-train virtual cell models with reinforcement learning (RL), leveraging biologically meaningful evaluators as reward functions. We design seven rewards spanning three categories-biological function, structural validity, and morphological correctness-and optimize the state-of-the-art CellFlux model to yield CellFluxRL. CellFluxRL consistently improves over CellFlux across all rewards, with further performance boosts from test-time scaling. Overall, our results present a virtual cell modeling framework that enforces physically-based constraints through RL, advancing beyond "visually realistic" generations towards "biologically meaningful" ones.

2603.21717 2026-05-22 cs.LG

Uncertainty-Aware Distribution-to-Distribution Flow Matching for Scientific Imaging

面向科学成像的不确定性感知分布到分布流匹配

Dongxia Wu, Yuhui Zhang, Serena Yeung-Levy, Emma Lundberg, Emily B. Fox

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出了一种面向科学成像的不确定性感知分布到分布流匹配方法,通过引入贝叶斯随机流匹配和抗变异不确定性量化技术,提升模型在分布偏移下的泛化能力,并有效估计epistemic和aleatoric不确定性,从而检测不可靠的生成结果。

详情
AI中文摘要

分布到分布生成模型支持从建模细胞扰动响应到跨条件翻译医学图像的科学成像任务。可信生成需要可靠性,即在不同实验室、设备和实验条件下的泛化能力,以及问责,即检测出分布外情况,其中预测可能不可靠。我们利用随机流匹配(SFM),一种保持边缘的随机扩展流匹配,以改进在分布偏移下的泛化能力。SFM在确定性流中加入扩散项和学习的分数基漂移校正,保留所学的传输边缘的同时建模条件变化性。基于此SFM框架,我们引入贝叶斯随机流匹配(BSFM)作为不确定性量化机制,并开发AVUQ(反向方差减少不确定性量化)以通过样本高效反向采样和近似后验推断来近似估计epistemic和aleatoric不确定性。我们进一步使用AVUQ生成异常分数以检测不可靠的生成结果。在细胞成像(BBBC021,JUMP)和脑部fMRI(Theory of Mind)等不同未见过的场景中的实验表明,SFM在提升泛化能力的同时,AVUQ在实际采样预算下提供了有效的基于不确定性的异常分数。

英文摘要

Distribution-to-distribution generative models support scientific imaging tasks ranging from modeling cellular perturbation responses to translating medical images across conditions. Trustworthy generation requires reliability, or generalization across labs, devices, and experimental conditions, and accountability, or detecting out-of-distribution cases where predictions may be unreliable. We leverage Stochastic Flow Matching (SFM), a marginal-preserving stochastic extension of flow matching for improved generalization under distribution shift. SFM augments deterministic flows with a diffusion term together with a learned score-based drift correction, retaining the learned transport marginals while modeling conditional variability. Building on this SFM framework, we introduce Bayesian Stochastic Flow Matching (BSFM) as a companion uncertainty quantification mechanism and develop AVUQ (Antithetic Variance-reduction Uncertainty Quantification) to approximately estimate epistemic and aleatoric uncertainty via sample-efficient antithetic sampling with approximate posterior inference. We further use AVUQ to yield anomaly scores for unreliable generation detection. Experiments on cellular imaging (BBBC021, JUMP) and brain fMRI (Theory of Mind) across diverse unseen scenarios show that SFM improves generalization while AVUQ provides effective uncertainty-based anomaly scores under practical sampling budgets.

2603.21610 2026-05-22 cs.LG cs.AI stat.ML

Rule-State Inference (RSI): A Bayesian Framework for Compliance Monitoring in Rule-Governed Domains

规则状态推断(RSI):一种用于规则治理领域合规监控的贝叶斯框架

Abdou-Raouf Atarmla

发表机构 * Institut National des Postes et Télécommunications(摩洛哥邮政和电信国家研究院) Togo DataLab(多哥数据实验室) Ministry of Digital Economy(数字经济部)

AI总结 本文提出了一种名为规则状态推断(RSI)的贝叶斯框架,用于解决规则治理领域中合规监控的三大结构性挑战:部署时缺乏标记结果、非合规实体战略性缺失观察以及监管环境变化速度超过任何监督模型的重新训练速度。RSI通过将权威、形式化的规则集作为结构化的贝叶斯先验,利用变分推断和精确坐标上升更新来推断人口的潜在合规状态。

Comments 18 pages. Experimental validation forthcoming

详情
AI中文摘要

在规则治理领域(如税收管理、临床协议遵守、环境监管)的合规监控面临三个结构性障碍,标准机器学习无法同时解决:部署时缺乏标记结果、非合规实体战略性缺失观察以及监管环境变化速度超过任何监督模型的重新训练速度。我们引入规则状态推断(RSI),一种贝叶斯框架,颠覆了传统的学习规则从数据的范式。RSI将权威、形式化的规则集作为结构化的贝叶斯先验,并通过均场变分推断和精确坐标上升更新推断人口的潜在合规状态。核心建模对象是一个联合潜变量,每个监管时期一个:全局合规文化因子η以及每个规则的激活、人口合规水平和参数漂移成分。RSI提供了三个正式保证:每个规则更新的监管适应性为O(n_k + K);对于可识别的连续成分的伯恩斯坦-冯·米塞斯一致性;以及每次迭代的单调ELBO收敛。我们将在托戈财政系统上实例化RSI,基于官方监管法律的基准2000家合成企业;完整的数值验证将随后进行。该框架设计用于直接扩展到顺序RSI,一种状态空间公式化中,一个监管时期的后验成为下一个的先验,从而产生精确的卡尔曼滤波器用于合规轨迹跟踪和实体级贝叶斯评分。

英文摘要

Compliance monitoring in rule-governed domains (tax administration, clinical protocol adherence, environmental regulation) faces three structural obstacles that standard machine learning does not simultaneously address: the absence of labeled outcomes at deployment, strategically missing observations where non-compliant entities selectively withhold evidence, and a regulatory environment that changes faster than any supervised model can be retrained. We introduce Rule-State Inference (RSI), a Bayesian framework that reverses the usual paradigm. Rather than learning rules from data, RSI treats an authoritative, formalized rule set as structured Bayesian priors and infers the latent compliance state of a population through mean-field variational inference with exact coordinate-ascent updates. The central modeling object is a joint latent state per regulatory period: a global compliance-culture factor eta and per-rule components for activation, population compliance level, and parametric drift. RSI delivers three formal guarantees: O(n_k + K) regulatory adaptability per rule update; Bernstein-von Mises consistency for the identifiable continuous components; and monotone ELBO convergence at every iteration. We instantiate RSI on the Togolese fiscal system on a benchmark of 2,000 synthetic enterprises grounded in official regulatory law; full numerical validation is forthcoming. The framework is designed for direct extension to Sequential RSI, a state-space formulation where the posterior from one regulatory period becomes the prior for the next, yielding an exact Kalman filter for compliance-trajectory tracking and entity-level Bayesian scoring.

2603.16077 2026-05-22 cs.LG

MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Scaling of Diffusion Language Models

MDM-Prime-v2:二进制编码和索引洗牌使扩散语言模型能够扩展

Chen-Hao Chao, Wei-Fang Sun, Junwei Quan, Chun-Yi Lee, Rahul G. Krishnan

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) NVIDIA AI Technology Center(NVIDIA AI技术中心) National Taiwan University(国立台湾大学)

AI总结 本文提出MDM-Prime-v2,通过二进制编码和索引洗牌技术改进扩散语言模型,解决了子分词器功能形式与BPE分词器结合导致的交叉熵损失增加以及子分词器粒度超参数选择缺乏工具的问题,从而提升了模型在常识推理基准上的零样本准确率。

详情
AI中文摘要

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we find that the functional form of the subtokenizer significantly increases the cross-entropy loss in the objective when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. Second, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. To address these limitations, we analyze the optimal design of the subtokenizer that minimizes MDM-Prime training objective and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our analysis characterizes how token granularity and sub-token entropy influence the training objective and downstream performance, providing principled criteria for subtokenizer design. When extending the model size to 1.1B parameters, MDM-Prime-v2 demonstrates superior average zero-shot accuracy across eight commonsense reasoning benchmarks, outperforming similar-sized baselines including GPT-Neo, OPT, Pythia, Bloom, SMDM, and TinyLLaMA.

英文摘要

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we find that the functional form of the subtokenizer significantly increases the cross-entropy loss in the objective when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. Second, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. To address these limitations, we analyze the optimal design of the subtokenizer that minimizes MDM-Prime training objective and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our analysis characterizes how token granularity and sub-token entropy influence the training objective and downstream performance, providing principled criteria for subtokenizer design. When extending the model size to 1.1B parameters, MDM-Prime-v2 demonstrates superior average zero-shot accuracy across eight commonsense reasoning benchmarks, outperforming similar-sized baselines including GPT-Neo, OPT, Pythia, Bloom, SMDM, and TinyLLaMA.

2603.14987 2026-05-22 cs.CL cs.DB

Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

超越基准岛屿:面向代理AI的代表性可信度评估

Jinhu Qi, Yifan Li, Minghao Zhao, Wentao Zhang, Zijian Zhang, Yaoman Li, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Macao Polytechnic University(澳门理工学院) Jilin University(吉林大学)

AI总结 本文提出了一种基于五属性的代理可信度定义,并引入了Holographic Agent Assessment Framework(HAAF)框架,通过场景 manifold 的静态策略分析、沙盒模拟、社会伦理对齐评估和分布感知采样,实现对代理系统在社会技术场景中的可信度评估,展示了其在13个模型家族上的跨家族迁移实验结果。

Comments 9 pages, 3 figures, 8 tables. Submitted to the Agent4IR Workshop at KDD 2026

详情
AI中文摘要

Agentic AI systems increasingly act through tool-augmented, multi-step workflows whose failures (unsafe tool use, unauthorised actions, social harm) carry deployment-level consequences. Evaluation practice remains fragmented across isolated benchmark slices, and

英文摘要

Agentic AI systems increasingly act through tool-augmented, multi-step workflows whose failures (unsafe tool use, unauthorised actions, social harm) carry deployment-level consequences. Evaluation practice remains fragmented across isolated benchmark slices, and "trustworthiness" is frequently invoked but rarely defined operationally. We argue the central limitation is twofold: (i) the absence of a measurable specification of what agent trustworthiness means, and (ii) the lack of a principled notion of representativeness allowing assessment over a socio-technical scenario distribution rather than disconnected benchmark instances. We address (i) by defining agentic trustworthiness as a five-property profile (Reliability, Robustness, Safety, Social-Ethical Alignment, Operational Integrity) grounded in current AI risk frameworks, and (ii) with the Holographic Agent Assessment Framework (HAAF), which measures this profile over a scenario manifold through static policy analysis, sandbox simulation, social-ethical alignment assessment, and distribution-aware sampling, connected through an iterative Trustworthy Optimization Factory that converts red-team diagnoses into blue-team interventions. Our contributions are: (1) an operational five-property definition of agentic trustworthiness; (2) a distribution-aware scenario-sampling framework that surfaces property-level trade-offs invisible to scalar leaderboards; and (3) a cross-family transfer experiment in which interventions designed from a single focal model generalise -- without per-model or per-scenario tuning -- to 13 systems from seven model families (Llama, Mistral, Kimi, GLM, Qwen, GPT, DeepSeek) on a 100-scenario suite, where all 13 systems improve and two reach a perfect risk-weighted profile, establishing HAAF's Factory as a model-agnostic deployment-readiness pipeline. Code: https://github.com/TonyQJH/haaf-pilot

2603.11679 2026-05-22 cs.AI

LLMs can construct powerful representations and streamline sample-efficient supervised learning

LLMs can construct powerful representations and streamline sample-efficient supervised learning

Ilker Demirel, Lawrence Shi, Zeshan Hussain, David Sontag

发表机构 * MIT(麻省理工学院) Harvard Medical School(哈佛医学院)

AI总结 本文提出了一种基于LLM的代理流程,通过生成全局 rubric 来提升多模态数据的表示能力,并在15个临床任务中显著优于传统方法。

详情
AI中文摘要

随着现实数据集变得更加复杂和异质化,监督学习常受到输入表示设计的瓶颈。对多模态数据(如时间序列、自由文本和结构化记录)建模通常需要非平凡的领域专业知识。我们提出了一种代理流程来简化这一过程。首先,一个LLM分析一小但多样化的文本序列输入示例,在上下文中合成一个全局rubric,该rubric作为程序化规范用于提取和组织证据。此rubric随后用于将原始文本序列转换为更标准化的格式,以供下游模型使用。我们还描述了局部rubrics,即由LLM生成的任务条件解释性摘要。在EHRSHOT基准的15个临床任务中,我们的rubric方法显著优于计数特征模型、朴素LLM基线和预训练数据量更大的临床基础模型。除了性能外,rubrics还提供了操作优势,如易于审计、规模化成本效益以及促进表格表示。

英文摘要

As real-world datasets become more complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data, such as time-series, free text, and structured records, often requires non-trivial domain expertise. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned interpretive summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric approaches significantly outperform count-feature models, naive LLM baselines, and a clinical foundation model pretrained on orders of magnitude more data. Beyond performance, rubrics offer operational advantages such as being easy to audit, cost-effectiveness at scale, and facilitating tabular representations.

2603.11642 2026-05-22 cs.RO

Noise-Space Attribution and Control of Chunk-Boundary Artifact

噪声空间中的属性分析与块边界伪影控制

Rui Wang

发表机构 * Rui Wang(1 王瑞)

AI总结 本文研究了生成视觉-运动策略中块边界伪影的机制,通过分析噪声空间中的变量,展示了如何通过控制隐含噪声来调节伪影,并证明伪影变化可以影响最终任务结果。

详情
AI中文摘要

动作分块在生成视觉-运动策略中被广泛应用,但块边界处的反复执行不连续性仍然缺乏机制性解释。本文将块边界伪影视为可分析的机制变量。我们首先证明成功和失败的episode在伪影度量上稳定分离。然后我们显示,在随机动作分块策略中,固定观察上下文并仅改变隐含噪声足以系统地调节伪影。在同一扩散策略检查点上,比较DDPM、零方差DDPM和DDIM进一步表明,这种局部可控性取决于从初始噪声到动作输出的信息路径是否保持完整。最后,从固定局部执行状态的受控干预中,我们发现伪影变化可以影响最终结果,并且在同一任务中,首选方向甚至可以反转:某些上下文在较低伪影下表现更高成功,而另一些上下文在较高伪影下表现更高成功。在代表性高伪影偏好的关键上下文中,成功率从0.033增加到0.717。这些结果表明,块边界伪影不是单纯的执行副产品,而是在噪声空间中的一个变量,可以被归因、控制,并与任务结果机制性关联。

英文摘要

Action chunking is widely used in generative visuomotor policies, yet the recurring execution discontinuities at chunk boundaries still lack a mechanistic explanation. This paper treats chunk-boundary artifact as an analyzable mechanism variable. We first show that successful and failed episodes separate stably on artifact metrics. We then show that, in stochastic action-chunked policies, fixing the observation context and changing only latent noise is sufficient to modulate artifact systematically. On the same Diffusion Policy checkpoint, comparisons among DDPM, zero-variance DDPM, and DDIM further show that this local controllability depends on whether the information path from initial noise to action output remains intact. Finally, from controlled interventions at fixed local execution states, we find that artifact changes can carry through to final outcome, and that the preferred direction can reverse even within the same task: some contexts achieve higher success under lower artifact, whereas others achieve higher success under higher artifact. In a representative high-artifact-favoring key context selected by held-out matched-continuation validation, success rate increases from 0.033 to 0.717. These results show that chunk-boundary artifact is not a mere execution-side by-product, but a variable in noise space that can be attributed, controlled, and mechanistically linked to task outcome.

2603.03784 2026-05-22 cs.AI

Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism

通过DEVS形式化方法驱动的离散事件世界模型生成与评估

Zheyu Chen, Huiteng Zhuang, Zhuohuan Li, Chuanhao Li

发表机构 * Zhili College, Tsinghua University(清华大学紫光学院) School of Transportation Science and Engineering, Beihang University(北航交通科学与工程学院) Department of Industrial Engineering, Tsinghua University(清华大学工业工程系)

AI总结 本文提出了一种基于自然语言规范在线生成离散事件世界模型的方法,结合了显式模拟器的可靠性与神经模型的适应性,通过DEVS形式化方法和分阶段的LLM生成流程,实现了对事件和时间逻辑的结构推断,并通过基准测试集验证了模型的一致性和可验证性。

Comments 36 pages, 6 figures

详情
AI中文摘要

世界模型是LLM代理在长时间范围内评估行动的核心组成部分。然而,现有研究大多集中在由物理动态或空间结构主导的环境,而许多高影响领域,如供应链、采购网络和业务流程,通过离散事件、时间约束和因果依赖演变。这些设置需要离散事件世界模型。现有构建世界模型的方法往往处于两个极端:手动工程模拟器提供一致性和可重复性,但构建和适应成本高;神经模型灵活,但长期时间推演中可能累积不一致。本文寻求一种原则性的中间方法,通过从自然语言规范中在线合成离散事件世界模型,保留显式模拟器的可靠性,同时获得神经模型的适应性。我们采用DEVS形式化方法,并引入一种分阶段的基于LLM的生成流程,将组件交互的结构推断与组件级事件和时间逻辑分开。在评估方面,我们开发了基准测试集,其中模拟器发出结构化事件轨迹,随后通过规范推导的时序、因果和语义约束进行验证。这使得可以实现可重复的验证和局部诊断。这些贡献共同产生了一种在长期时间推演中保持一致、可以从可观察行为中验证,并且可以在在线执行时高效合成的世界模型。

英文摘要

World models are central to LLM agents that must evaluate actions over long horizons. Yet much existing work focuses on environments governed by physical dynamics or spatial structure, whereas many high-impact domains, including supply chains, procurement networks, and business processes, evolve through discrete events, timing constraints, and causal dependencies. These settings call for discrete-event world models. Existing approaches to constructing world models often fall near two extremes: hand-engineered simulators provide consistency and reproducibility, but are costly to build and adapt; neural models are flexible, but can suffer from compounding inconsistency over long-horizon rollouts. We seek a principled middle ground by synthesizing discrete-event world models online from natural-language specifications, retaining the reliability of explicit simulators while gaining the adaptability of neural models. We adopt the DEVS formalism and introduce a staged LLM-based generation pipeline that separates structural inference over component interactions from component-level event and timing logic. For evaluation, we develop benchmark suites in which simulators emit structured event traces, which are then validated against specification-derived temporal, causal, and semantic constraints. This enables reproducible verification and localized diagnostics. Together, these contributions produce world models that remain consistent over long-horizon rollouts, can be verified from observable behavior, and can be synthesized efficiently on demand during online execution.

2603.02604 2026-05-22 cs.LG

Heterogeneous Agent Collaborative Reinforcement Learning

异质智能体协作强化学习

Zhixia Zhang, Zixuan Huang, Gongxun Li, Huaiyang Wang, Chengyi Yuan, Xin Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, Yikun Ban

发表机构 * Beihang University(北航) Bytedance China(字节跳动中国) Tsinghua University(清华大学) Peking University(北京大学) Apple(苹果公司)

AI总结 本文提出了一种新的强化学习从可验证奖励(RLVR)问题HACRL,通过异质智能体共享验证的轨迹实现协同优化,解决了孤立多智能体在线优化的效率问题,并提出HACPO算法以最大化样本利用率和跨智能体知识转移。

详情
AI中文摘要

我们引入了异质智能体协作强化学习(HACRL),一种新的强化学习从可验证奖励(RLVR)问题,旨在解决孤立多智能体在线优化的低效问题。HACRL允许独立执行的协同优化:异质智能体在训练期间共享验证的轨迹以互相改进,而在推理期间独立操作。不同于基于大语言模型的多智能体强化学习(MARL),HACRL不需要协调部署,也不同于在线/离线策略蒸馏,它使异质智能体之间实现双向相互学习,而非单向的教师到学生转移。基于此问题,我们提出HACPO,一种协作RL算法,能够通过原则性的轨迹共享最大化样本利用率和跨智能体知识转移。为缓解能力差异和策略分布偏移,HACPO引入了四个定制机制,具有对无偏优势估计的理论保证。在多样化的异质模型组合和推理基准上的广泛实验表明,HACPO一致地提升了所有参与智能体,相比使用双轨迹的GSPO,平均提高了3.6%,同时仅使用一半的轨迹成本。

英文摘要

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new Reinforcement Learning from Verifiable Reward (RLVR) problem that addresses the inefficiencies of isolated multi-agent on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional homogeneous teacher-to-student transfer. Building on this problem, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO with double rollouts by an average of 3.6% while using only half the rollout cost.

2602.23200 2026-05-22 cs.LG cs.CL

InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models

InnerQ: 一种面向硬件的无需调优的KV缓存量化方法用于大语言模型

Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 本文提出InnerQ,一种面向硬件的KV缓存量化方法,旨在减少解码延迟而不影响评估性能,通过分组量化策略提高数据重用率,从而在Llama和Mistral模型上提升了少样本评估得分。

Comments 18 pages, 5 figures, 7 tables

详情
AI中文摘要

当基于Transformer的语言模型用于文本生成时,大部分推理时间消耗在解码阶段,其中依次生成输出token。因此,减少每个解码步骤的硬件成本对于高效的长上下文生成至关重要。主要瓶颈是键值(KV)缓存,其大小随序列长度增长,通常主导模型的内存足迹。先前工作提出了压缩KV缓存的同时最小化精度损失的量化方法。我们提出了InnerQ,一种面向硬件的KV缓存量化方案,能够在不牺牲评估性能的情况下减少解码延迟。InnerQ通过沿内维对缓存矩阵进行分组实现分组量化。这种分组策略使去量化与向量-矩阵乘法对齐,并在GPU计算单元之间增加数据重用。结果,InnerQ减少了内存访问并加速了去量化,实现了比先前KV缓存量化方法平均快1.3倍,比非量化基线快2.7倍。为了在剧烈压缩下保持精度,InnerQ结合了三种技术:(i) 混合量化,根据局部统计选择对每个组使用对称或非对称量化;(ii) 高精度窗口用于最近的token和注意力sink token以缓解异常值泄漏;(iii) 对key缓存的通道归一化,在prefill期间计算一次并折叠到模型参数中以消除运行时开销。除了减少延迟外,在Llama和Mistral模型上的实验表明,InnerQ还相对于先前的KV缓存量化方法提升了少样本评估得分。

英文摘要

When transformer-based language models are deployed for text generation, most of the inference time is spent in the decoding stage, where output tokens are generated sequentially. Reducing the hardware cost of each decoding step is therefore critical for efficient long-context generation. A major bottleneck is the key-value (KV) cache, whose size grows with sequence length and often dominates the model's memory footprint. Prior work has proposed quantization methods to compress the KV cache while minimizing its loss of precision. We present InnerQ, a hardware-aware KV cache quantization scheme that reduces decode latency without compromising evaluation performance. InnerQ performs group-wise quantization by grouping cache matrices along their inner dimension. This grouping strategy aligns dequantization with vector-matrix multiplication and increases data reuse across GPU compute units. As a result, InnerQ reduces memory access and accelerates dequantization, achieving an average $1.3\times$ speedup over prior KV cache quantization methods and $2.7\times$ over the non-quantized baseline. To maintain fidelity under aggressive compression, InnerQ incorporates three techniques: (i) hybrid quantization, which chooses symmetric or asymmetric quantization for each group based on local statistics; (ii) high-precision windows for both recent tokens and attention sink tokens to mitigate outlier leakage; and (iii) per-channel normalization of the key cache, computed once during prefill and folded into the model parameters to eliminate runtime overhead. Beyond reducing latency, experiments on Llama and Mistral models show that InnerQ also improves few-shot evaluation scores relative to prior KV cache quantization methods.

2602.18600 2026-05-22 cs.LG

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

MapTab: MLLMs 是否已准备好在异构图中进行多标准路线规划?

Ziqiao Shang, Lingyue Ge, Zi-Jian Cheng, Shi-Yu Tian, Zhenyu Huang, Wenbo Fu, Weiming Wu, Yang Chen, Xiangwen Zhang, Yulan Hu, Bin Liu, Yu-Feng Li, Lan-Zhe Guo

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院) AMAP, Alibaba Group(阿里集团AMAP) School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院)

AI总结 本文提出MapTab基准测试,用于评估多模态大语言模型在多标准路线规划任务中的综合推理能力,发现当前模型在多模态推理方面存在显著挑战。

详情
AI中文摘要

系统评估多模态大语言模型(MLLMs)对于推进人工通用智能(AGI)至关重要。然而,现有基准测试仍不足以严格评估其在多标准约束下的推理能力。为弥合这一差距,我们引入MapTab,一个专门设计用于通过路线规划任务评估MLLMs的综合多标准推理能力的多模态基准测试。MapTab要求MLLMs感知并结合地图图像中的视觉线索与结构化表格数据中的路线属性(如时间、价格)。该基准测试涵盖两个场景:Metromap,涵盖52个国家160座城市的地铁网络;Travelmap,描绘19个国家的168个代表性旅游景点。总共包含328张图像、196,800个路线规划查询和3,936个问答查询,所有数据均包含4个关键标准:时间、价格、舒适度和可靠性。对15个代表性MLLMs的广泛评估表明,当前模型在多标准多模态推理方面面临重大挑战。值得注意的是,在视觉感知有限的条件下,多模态协作往往不如单模态方法表现优异。我们认为MapTab提供了一个具有挑战性和现实性的测试平台,以推进MLLMs的系统评估。我们的代码可在https://github.com/Ziqiao-Shang/MapTab上获得。

英文摘要

Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artificial General Intelligence (AGI). However, existing benchmarks remain insufficient for rigorously assessing their reasoning capabilities under multi-criteria constraints. To bridge this gap, we introduce MapTab, a multimodal benchmark specifically designed to evaluate holistic multi-criteria reasoning in MLLMs via route planning tasks. MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data. The benchmark encompasses two scenarios: Metromap, covering metro networks in 160 cities across 52 countries, and Travelmap, depicting 168 representative tourist attractions from 19 countries. In total, MapTab comprises 328 images, 196,800 route planning queries, and 3,936 QA queries, all incorporating 4 key criteria: Time, Price, Comfort, and Reliability. Extensive evaluations across 15 representative MLLMs reveal that current models face substantial challenges in multi-criteria multimodal reasoning. Notably, under conditions of limited visual perception, multimodal collaboration often underperforms compared to unimodal approaches. We believe MapTab provides a challenging and realistic testbed to advance the systematic evaluation of MLLMs. Our code is available at https://github.com/Ziqiao-Shang/MapTab.

2602.17186 2026-05-22 cs.CV

Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain

聚焦视觉关键点:通过视觉信息增益进行大视觉语言模型的定向训练

Seulbi Lee, Sangheum Hwang

发表机构 * Department of Data Science, Seoul National University of Science and Technology(数据科学系,首尔科学技术大学)

AI总结 本文提出通过视觉信息增益(VIG)指标,对大视觉语言模型进行定向训练,以提升视觉基础性并减少语言偏见,通过优先选择高VIG样本和token来提高性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大视觉语言模型(LVLMs)已取得显著进展,但它们常常受到语言偏见的影响,产生答案时往往不依赖视觉证据。尽管先前工作试图通过解码策略、架构修改或精心挑选的指令数据来缓解这一问题,但它们通常缺乏对单个训练样本或token实际从图像中获益程度的定量衡量。在本工作中,我们引入了视觉信息增益(VIG),一种基于困惑度的度量指标,用于衡量视觉输入对预测不确定性的减少。VIG能够在样本和token层面进行细粒度分析,有效突出视觉基础元素,如颜色、空间关系和属性。借助这一指标,我们提出了一种VIG引导的定向训练方案,优先选择高VIG样本和token。这种方法提高了视觉基础性并减轻了语言偏见,通过专注于仅视觉信息丰富的样本和token,实现了显著减少监督下的优越性能。

英文摘要

Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

2602.16169 2026-05-22 cs.LG cs.CL

Discrete Stochastic Localization for Non-autoregressive Generation

非自回归生成的离散随机定位

Yunshu Wu, Jiayi Cheng, Longxuan Yu, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis, Greg Ver Steeg

发表机构 * University of California Riverside(加州大学河滨分校) New York University(纽约大学)

AI总结 本文提出了一种名为离散随机定位(DSL)的连续状态框架,通过单位球体令牌嵌入实现最优去噪,从而在离散序列生成中提升分布忠实度,并展示了其在OpenWebText上的有效性。

详情
AI中文摘要

连续扩散是一种非自回归生成的自然框架,但在离散序列生成中通常落后于掩码离散扩散模型(MDMs)。我们认为瓶颈不在于连续性本身,而在于一种表示方式,其中去噪依赖于时间步索引的噪声模式。我们引入了离散随机定位(DSL),一种具有单位球体令牌嵌入的连续状态框架,其贝叶斯最优去噪器在定位信道下对名义信号噪声比(SNR)具有不变性。一个训练好的网络可以支持整个SNR路径家族,端点掩码扩散路径是特殊情况。对预训练MDLM检查点进行微调可显著提升OpenWebText在所有步预算(从T=128到T=1024)下的分布忠实度(MAUVE),并且同一检查点支持随机顺序自回归采样,以及使用最少T=48总步数的混合连续-然后-离散采样器,无需蒸馏或重新训练。

英文摘要

Continuous diffusion is a natural framework for non-autoregressive generation but has generally lagged behind masked discrete diffusion models (MDMs) on discrete sequence generation. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep-indexed noise regimes. We introduce \emph{Discrete Stochastic Localization} (DSL), a continuous-state framework with unit-sphere token embeddings whose Bayes-optimal denoiser is invariant to the nominal signal-to-noise ratio (SNR) under the localization channel. One trained network then supports an entire family of per-token SNR paths, with endpoint masked-diffusion paths as a special case. Fine-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness (MAUVE) on OpenWebText across all step budgets from $T{=}128$ to $T{=}1024$, and the same checkpoint supports random-order autoregressive sampling, as well as a hybrid continuous-then-discrete sampler using as few as T=48 total steps -- without distillation or retraining.

2602.15338 2026-05-22 cs.LG cs.CL

Discovering Implicit Large Language Model Alignment Objectives

发现隐式大语言模型对齐目标

Edward Chen, Sanmi Koyejo, Carlos Guestrin

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出Obj-Disco框架,通过自动分解对齐奖励信号为可解释的目标,解决现有方法的不足,验证了框架在多种任务和模型上的鲁棒性,并发现潜在的对齐偏差。

Comments ICML 2026

详情
AI中文摘要

大语言模型(LLM)对齐依赖于复杂的奖励信号,这些信号往往模糊了被激励的具体行为,导致对齐风险和奖励黑客问题。现有解释方法通常依赖预定义的准则,可能遗漏“未知的未知”,或无法识别全面覆盖和因果影响模型行为的目标。为了解决这些限制,我们引入Obj-Disco框架,该框架能够自动将对齐奖励信号分解为稀疏、加权的可解释自然语言目标的组合。我们的方法利用迭代贪心算法分析训练检查点的行为变化,识别并验证最佳解释残差奖励信号的候选目标。在多种任务、模型大小和对齐算法上的广泛评估证明了框架的鲁棒性。对流行开源奖励模型的实验表明,框架一致捕获超过90%的奖励行为,这一发现进一步得到人类评估的证实。此外,对开源奖励模型对齐的案例研究显示,Obj-Disco能够成功识别伴随预期行为出现的潜在偏移激励。我们的工作提供了一种关键工具,用于揭示LLM对齐中的隐式目标,为更透明和安全的AI发展铺平道路。

英文摘要

Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of "unknown unknowns", or fail to identify objectives that comprehensively cover and are causal to the model behavior. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness. Experiments with popular open-source reward models show that the framework consistently captures > 90% of reward behavior, a finding further corroborated by human evaluation. Additionally, a case study on alignment with an open-source reward model reveals that Obj-Disco can successfully identify latent misaligned incentives that emerge alongside intended behaviors. Our work provides a crucial tool for uncovering the implicit objectives in LLM alignment, paving the way for more transparent and safer AI development.

2602.13294 2026-05-22 cs.CV cs.AI

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

VisPhyWorld: 通过代码驱动的视频重建探测物理推理

Jiarong Liang, Max Ku, Ka-Hei Hui, Ping Nie, Wenhu Chen

发表机构 * University of Waterloo(滑铁卢大学) Autodesk AI Lab(Autodesk人工智能实验室) Independent Researcher(独立研究者)

AI总结 本文提出VisPhyWorld框架,通过要求模型从视觉观察生成可执行的模拟器代码来评估物理推理能力,引入VisPhyBench基准测试集,验证模型在重建外观和模拟物理运动方面的能力,发现最先进的MLLM在准确推断物理参数和模拟一致的物理动态方面存在困难。

详情
AI中文摘要

评估多模态大语言模型(MLLMs)是否真正理解物理动态仍然具有挑战性。现有的基准测试大多依赖于识别式协议,如视觉问答(VQA)和期望违反(VoE),这些协议通常可以在不承诺明确、可测试的物理假设的情况下回答。我们提出了VisPhyWorld,一个基于执行的框架,通过要求模型从视觉观察生成可执行的模拟器代码来评估物理推理能力。通过生成可运行的代码,推断的世界表示可以直接检查、编辑和验证。这将物理推理与渲染分开。基于此框架,我们引入了VisPhyBench,包含209个评估场景,这些场景源自108个物理模板和一个系统化的协议,用于评估模型在重建外观和模拟物理合理的运动方面的能力。我们的流水线在97.7%的基准运行中生成有效的重建视频之前会回退。实验表明,尽管最先进的MLLM在语义场景理解方面表现强劲,但在准确推断物理参数和模拟一致的物理动态方面存在困难。我们的代码可在https://github.com/TIGER-AI-Lab/VisPhyWorld上获得。

英文摘要

Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% of benchmark runs before fallback. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics. Our code is available https://github.com/TIGER-AI-Lab/VisPhyWorld

2602.11574 2026-05-22 cs.AI

Learning to Configure Agentic AI Systems

学习配置代理AI系统

Aditya Taparia, Som Sagar, Ransalu Senanayake

发表机构 * School of Computing and Augmented Intelligence(计算与增强智能学院) Arizona State University(亚利桑那州立大学)

AI总结 本文提出了一种基于半马尔可夫决策过程(SMDP)的代理配置方法,通过ARC模型动态选择查询特定的代理配置,从而在多个基准测试中提升了推理准确性、工具使用准确性和τ-Bench(Airline)Pass的成功率。

Comments 22 pages, 12 figures

详情
AI中文摘要

配置基于LLM的代理系统涉及从庞大的组合设计空间中选择工作流、工具、令牌预算和提示,而目前通常通过固定的模板或手工调整的启发式方法处理,这些方法无论查询难度如何都应用相同的配置,导致行为脆弱和计算浪费。为了解决这个问题,我们将代理配置建模为半马尔可夫决策过程(SMDP),其中每个配置都是一种时间扩展的选项,决定了代理系统如何处理查询,并引入了ARC(Agentic Resource & Configuration learner),一种轻量级的分层策略,能够动态选择查询特定的代理配置。在推理、工具使用和代理基准测试中,ARC在与预算匹配的工具增强LLM相比,平均推理准确性提高了31.3%,工具使用准确性提高了13.95%,并将τ-Bench(Airline)Pass的成功率从9.0%提升到18.0%。这些结果表明,学习查询特定的代理配置是“一刀切”设计的一种强大替代方案。

英文摘要

Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed templates or hand-tuned heuristics that apply the same configuration regardless of query difficulty, leading to brittle behavior and wasted compute. To address this, we formulate agent configuration as a semi-Markov decision process (SMDP) where each configuration acts as a temporally extended option that determines how an agent system processes a query, and introduce introduce ARC (Agentic Resource & Configuration learner), a lightweight hierarchical policy that dynamically selects query-specific agent configurations. Across reasoning, tool-use, and agentic benchmarks, ARC consistently improves over budget-matched tool-augmented LLMs, increasing average reasoning accuracy by 31.3%, tool-use accuracy by 13.95%, and doubling τ-Bench (Airline) Pass^1 success from 9.0% to 18.0%. These results demonstrate that learning per-query agent configurations is a powerful alternative to "one size fits all" designs.

2602.10062 2026-05-22 cs.LG cs.CV

Vendi Novelty Scores for Out-of-Distribution Detection

Vendi Novelty Scores for Out-of-Distribution Detection

Amey P. Pasarkar, Adji Bousso Dieng

发表机构 * Lewis-Sigler Institute For Integrative Genomics, Princeton University(普林斯顿大学整合基因组学研究所) Department of Computer Science, Princeton University(普林斯顿大学计算机科学系)

AI总结 本文提出了一种基于Vendi Scores的Vendi Novelty Score(VNS)方法,从多样性角度解决分布外检测问题,该方法无需密度建模,具有线性时间复杂度和非参数特性,并在多个图像分类基准上实现了最先进的OOD检测性能。

详情
AI中文摘要

Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.

英文摘要

Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.

2602.06264 2026-05-22 cs.LG

Swap Regret Minimization Through Response-Based Approachability

通过响应方法实现交换遗憾最小化

Ioannis Anagnostides, Gabriele Farina, Maxwell Fishelson, Haipeng Luo, Jon Schneider

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Massachusetts Institute of Technology(麻省理工学院) University of Southern California(南加州大学) Google Research(谷歌研究)

AI总结 本文提出了一种更简单高效的算法,通过预处理后的约翰椭球,保证了线性交换遗憾为O(d√T),并建立了信息论下限,证明了经典算法在减少线性交换遗憾方面的最优性,同时扩展了该方法以处理多项式维度的交换偏差集。

Comments V3 makes certain clarifications and improves the upper bound for general sets via symmetrization

详情
AI中文摘要

我们考虑在线优化中最小化不同交换遗憾形式的问题。这些形式的遗憾与博弈中的相关均衡概念紧密相关,并且最近已被证明能够保证对战略对手的非操纵性。最近,Daskalakis, Farina, Fishelson, Pipis和Schneider(STOC '25)开发了在一般凸集上最小化线性交换遗憾的计算效率算法,但其遗憾界为Ω(d⁴√T),并且每次迭代都需要计算强度大的椭球算法调用。在本文中,我们开发了一种显著更简单、计算效率更高的算法,该算法保证在经过约翰椭球预处理的一般凸集上线性交换遗憾为O(d√T)。我们的算法利用了Bernstein和Shimkin(JMLR~'15)提出的强大的响应方法可接近框架——此前在交换遗憾最小化研究中被忽视——同时最小化了profile交换遗憾,最近已被证明能够保证非操纵性。此外,我们建立了匹配的信息论下限:即使当集合是中心对称的时,任何学习者在期望上必须承受Ω(d√T)的线性交换遗憾,对于足够大的T。这还表明,Gordon, Greenwald和Marks(ICML '08)的经典算法在减少线性交换遗憾方面是存在最优的,尽管它计算上效率低下。最后,我们将这种方法扩展以最小化相对于具有多项式维度的交换偏差集的遗憾,统一并加强了最近在均衡计算和在线学习中的研究成果。

英文摘要

We consider the problem of minimizing different notions of swap regret in online optimization. These forms of regret are tightly connected to correlated equilibrium concepts in games, and have been more recently shown to guarantee non-manipulability against strategic adversaries. The only computationally efficient algorithm for minimizing linear swap regret over a general convex set in $\mathbb{R}^d$ was developed recently by Daskalakis, Farina, Fishelson, Pipis, and Schneider (STOC '25). However, it incurs a highly suboptimal regret bound of $Ω(d^4 \sqrt{T})$ and also relies on computationally intensive calls to the ellipsoid algorithm at each iteration. In this paper, we develop a significantly simpler, computationally efficient algorithm that guarantees $O(d \sqrt{T})$ linear swap regret for a general convex set that has been preconditioned via the John ellipsoid. Our algorithm leverages the powerful response-based approachability framework of Bernstein and Shimkin (JMLR~'15) -- previously overlooked in the line of work on swap regret minimization -- and simultaneously minimizes profile swap regret, which was recently shown to guarantee non-manipulability. Moreover, we establish a matching information-theoretic lower bound: any learner must incur in expectation $Ω(d \sqrt{T})$ linear swap regret for large enough $T$, even when the set is centrally symmetric. This also shows that the classic algorithm of Gordon, Greenwald, and Marks (ICML '08) is existentially optimal for minimizing linear swap regret, although it is computationally inefficient. Finally, we extend our approach to minimize regret with respect to the set of swap deviations with polynomial dimension, unifying and strengthening recent results in equilibrium computation and online learning.

2602.05286 2026-05-22 cs.LG cs.AI

HealthMamba: An Uncertainty-aware Spatiotemporal Graph State Space Model for Effective and Reliable Healthcare Facility Visit Prediction

HealthMamba: 一种考虑不确定性的时空图状态空间模型用于有效可靠的医疗设施访问预测

Dahai Yu, Lin Jiang, Rongchao Xu, Guang Wang

发表机构 * Department of Computer Science, Florida State University(佛罗里达州立大学计算机科学系)

AI总结 本文提出HealthMamba,一种考虑不确定性的时空图状态空间模型,用于有效可靠的医疗设施访问预测。该模型包含三个关键组件:统一的时空上下文编码器、新的图状态空间模型GraphMamba以及综合的不确定性量化模块。实验结果显示,HealthMamba在预测准确性和不确定性量化方面分别比现有最佳基线提高了6.0%和3.5%。

Comments IJCAI 2026

详情
AI中文摘要

医疗设施访问预测对于优化医疗资源配置和 informing 公共卫生政策至关重要。尽管已经采用了先进的机器学习方法以提高预测性能,但现有工作通常将此任务视为时间序列预测问题,而没有考虑不同类型的医疗设施的内在空间依赖性,且在公共紧急情况等异常情况下也无法提供可靠的预测。为了推进现有研究,我们提出了HealthMamba,一种考虑不确定性的时空框架,用于准确且可靠的医疗设施访问预测。HealthMamba包含三个关键组件:(i) 一个统一的时空上下文编码器,融合异构的静态和动态信息,(ii) 一种新的图状态空间模型称为GraphMamba用于分层时空建模,(iii) 一个综合的不确定性量化模块,整合三种不确定性量化机制以实现可靠的预测。我们在四个大规模真实世界数据集上评估了HealthMamba,这些数据集来自加州、纽约、得克萨斯州和佛罗里达州。结果表明,HealthMamba在预测准确性和不确定性量化方面分别比现有最佳基线提高了6.0%和3.5%。

英文摘要

Healthcare facility visit prediction is essential for optimizing healthcare resource allocation and informing public health policy. Despite advanced machine learning methods being employed for better prediction performance, existing works usually formulate this task as a time-series forecasting problem without considering the intrinsic spatial dependencies of different types of healthcare facilities, and they also fail to provide reliable predictions under abnormal situations such as public emergencies. To advance existing research, we propose HealthMamba, an uncertainty-aware spatiotemporal framework for accurate and reliable healthcare facility visit prediction. HealthMamba comprises three key components: (i) a Unified Spatiotemporal Context Encoder that fuses heterogeneous static and dynamic information, (ii) a novel Graph State Space Model called GraphMamba for hierarchical spatiotemporal modeling, and (iii) a comprehensive uncertainty quantification module integrating three uncertainty quantification mechanisms for reliable prediction. We evaluate HealthMamba on four large-scale real-world datasets from California, New York, Texas, and Florida. Results show HealthMamba achieves around 6.0% improvement in prediction accuracy and 3.5% improvement in uncertainty quantification over state-of-the-art baselines.

2602.03205 2026-05-22 cs.RO

HUSKY: Humanoid Skateboarding System via Physics-Aware Whole-Body Control

HUSKY:通过物理感知的全身控制实现人形滑雪板系统

Jinrui Han, Dewei Wang, Chenyun Zhang, Xinzhe Liu, Ping Luo, Chenjia Bai, Xuelong Li

发表机构 * Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信) The University of Hong Kong(香港大学) University of Science and Technology of China(中国科学技术大学) ShanghaiTech University(上海科技大学)

AI总结 本文提出HUSKY框架,通过整合人形滑雪板系统建模和物理感知的全身控制,解决高动态和复杂交互任务中的稳定动态操控问题,实现在实际场景中稳定灵活的滑雪板操作。

Comments Accepted to RSS2026

详情
AI中文摘要

尽管当前的人形全身控制框架大多依赖静态环境假设,但解决具有高动态性和复杂交互的任务却是一个巨大的挑战。本文针对人形滑雪板任务,一个需要在欠驱动轮式平台上稳定动态操控的极具挑战性的任务,提出一个整合系统,该系统受非完整约束和紧密耦合的人-物体交互支配。成功执行此任务需要同时掌握混合接触动力学和在机械耦合、动态不稳定滑雪板上的稳健平衡控制。为克服上述挑战,我们提出了HUSKY,一个基于学习的框架,整合了人形-滑雪板系统建模和物理感知的全身控制。我们首先建模板倾斜与车轮转向角度之间的耦合关系,从而能够对系统动力学进行原理性分析。在此基础上,HUSKY利用对抗运动先验(AMP)学习人样的推动作,并采用物理引导的、以方向为导向的策略来实现倾斜到转向行为。此外,轨迹引导机制确保了在推与转向之间平滑而稳定的过渡。在Unitree G1人形平台上的实验结果表明,我们的框架能够在现实场景中实现稳定的滑雪板操控。项目页面可在https://husky-humanoid.github.io/上找到。

英文摘要

While current humanoid whole-body control frameworks predominantly rely on the static environment assumptions, addressing tasks characterized by high dynamism and complex interactions presents a formidable challenge. In this paper, we address humanoid skateboarding, a highly challenging task requiring stable dynamic maneuvering on an underactuated wheeled platform. This integrated system is governed by non-holonomic constraints and tightly coupled human-object interactions. Successfully executing this task requires simultaneous mastery of hybrid contact dynamics and robust balance control on a mechanically coupled, dynamically unstable skateboard. To overcome the aforementioned challenges, we propose HUSKY, a learning-based framework that integrates humanoid-skateboard system modeling and physics-aware whole-body control. We first model the coupling relationship between board tilt and truck steering angles, enabling a principled analysis of system dynamics. Building upon this, HUSKY leverages Adversarial Motion Priors (AMP) to learn human-like pushing motions and employs a physics-guided, heading-oriented strategy for lean-to-steer behaviors. Moreover, a trajectory-guided mechanism ensures smooth and stable transitions between pushing and steering. Experimental results on the Unitree G1 humanoid platform demonstrate that our framework enables stable and agile maneuvering on skateboards in real-world scenarios. The project page is available on https://husky-humanoid.github.io/.

2602.03067 2026-05-22 cs.LG cs.AI cs.NA math.NA

FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU

FlashSinkhorn: GPU上的IO感知熵最优传输

Felix X. -F. Ye, Xingjie Li, An Yu, Ming-Ching Chang, Linsong Chu, Davis Wertheimer

发表机构 * Department of Mathematics \& Statistics, University at Albany, Albany, NY, USA Department of Mathematics Statistics, University of North Carolina at Charlotte, Charlotte, NC, USA Department of Computer Science, University at Albany, Albany, NY, USA IBM T.\ J.\ Watson Research Center, Yorktown Heights, NY, USA

AI总结 本文提出FlashSinkhorn,一种基于GPU的熵最优传输求解器,通过将稳定化的对数域Sinkhorn更新转换为行-wise的LogSumExp归一化,实现了与Transformer注意力相同的归一化方式,从而实现了FlashAttention风格的融合和分块处理,显著降低了HBMIO并保持线性内存操作。

详情
AI中文摘要

熵最优传输(EOT)通过Sinkhorn迭代在现代机器学习中广泛应用,但GPU求解器在大规模情况下仍效率低下。张量化实现因密集的n×m交互导致二次HBM流量,而现有在线后端避免存储密集矩阵但仍然依赖于通用的 tiled map-reduce 减少内核,融合有限。我们提出FlashSinkhorn,一种针对平方欧几里得成本的IO感知EOT求解器,将稳定化的对数域Sinkhorn更新重写为行-wise的LogSumExp归一化,与Transformer注意力相同的归一化方式。这使得FlashAttention风格的融合和分块处理成为可能:融合的Triton内核通过芯片上的SRAM流式传输分块,并在单次通过中更新双潜力,显著减少每个迭代的HBM IO同时保持线性内存操作。我们进一步提供了用于传输应用的流式内核,实现了可扩展的一阶和二阶优化。在A100 GPU上,FlashSinkhorn在点云OT上的前向传递速度比最先进的在线基线快32倍,在端到端速度上快161倍,提高了OT基于下游任务的可扩展性。为了可重复性,我们发布了开源实现,网址为https://github.com/ot-triton-lab/flash-sinkhorn。

英文摘要

Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense $n\times m$ interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled map-reduce reduction kernels with limited fusion. We present \textbf{FlashSinkhorn}, an IO-aware EOT solver for squared Euclidean cost that rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores, the same normalization as transformer attention. This enables FlashAttention-style fusion and tiling: fused Triton kernels stream tiles through on-chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear-memory operations. We further provide streaming kernels for transport application, enabling scalable first- and second-order optimization. On A100 GPUs, FlashSinkhorn achieves up to $32\times$ forward-pass and $161\times$ end-to-end speedups over state-of-the-art online baselines on point-cloud OT, improves scalability on OT-based downstream tasks. For reproducibility, we release an open-source implementation at https://github.com/ot-triton-lab/flash-sinkhorn .

2602.01935 2026-05-22 cs.LG cs.AI cs.PL

LiteCoOp: Lightweight Multi-LLM Shared-Tree Reasoning for Model-Serving Compiler Optimizations

LiteCoOp: 轻量级多语言模型共享树推理用于模型服务编译器优化

Annabelle Sujun Tang, Christopher Priebe, Lianhui Qin, Hadi Esmaeilzadeh

发表机构 * A lternative C omputing T echnologies ( ACT ) Lab(替代计算技术实验室) University of California San Diego(加州大学圣地亚哥分校)

AI总结 本文提出LiteCoOp,一种轻量级框架,通过将优化搜索树本身作为多语言模型协作机制,实现编译器优化过程中异构语言模型的协作,从而在降低编译成本的同时提升性能。

详情
AI中文摘要

LLM引导的编译器优化最近展现出潜力,但现有方法依赖于整个搜索过程中单一大型语言模型,使其昂贵且排除了较小模型。我们提出了研究问题:异构语言模型是否可以在编译器优化过程中协作,同时在编译成本低于由单一大型语言模型引导的优化时减少成本。关键的是,这必须在不引入代理框架的开销的情况下实现,这会与降低编译成本的目标相悖。为实现这些竞争目标,我们引入了LiteCoOp,一种轻量级框架,将优化搜索树本身作为多语言模型协作的机制,使异构模型能够共享进展而无需外部代理协调。在每个优化步骤中,LiteCoOp查询一个语言模型以提出编译器转换并选择下一步查询的语言模型。这些语言模型的提案被记录在共享的MCTS树中,因此所有模型依次被调用,但彼此的决策相互影响。共享的MCTS回传奖励,使一个模型的进步影响其他模型后续的决策。这使得MCTS树本身成为协作推理的机制,避免了模型间通信、重载推理轨迹或代理基础设施。我们通过LLM-aware UCT将这一想法实例化,该方法倾向于较小的语言模型以减少成本,同时保持编译器性能目标。在多样化的GPU和(CPU)基准测试中,LiteCoOp在单模型基线上持续表现优异,当将协作扩展到八个异构语言模型时,其最佳结果取得。八模型配置将总编译时间减少1.95x(1.74x),减少API成本4.47x(4.32x),并且只在总调用中调用最大模型的23.1%(23.9%),并展示了协作的可扩展性。

英文摘要

LLM-guided compiler optimization has recently shown promise, but existing approaches rely on a single large LLM throughout search, making them expensive and excluding smaller models. We pose the research question: whether heterogeneous LLMs can collaborate during compiler optimization while reducing compilation cost below optimization guided by a single large LLM. Crucially, this must be achieved without introducing overhead from agentic frameworks, which would run counter to the goal of lower compilation cost. To achieve these competing objectives, we introduce LiteCoOp, a lightweight framework that turns the optimization search tree itself into the mechanism for multi-LLM collaboration, enabling heterogeneous models to share progress without external agentic coordination. At each optimization step, LiteCoOp queries one LLM to propose both a compiler transformation and select the LLM to query at the next step. These LLM proposals are recorded in a shared MCTS tree, so all models are invoked serially and yet are informed by each other's decisions. The shared MCTS backpropagates the rewards, allowing progress made by one model to influence later decisions by others. This makes the MCTS tree the collaborative reasoning mechanism itself, avoiding inter-model communication, heavy reasoning traces, or agentic infrastructure. We instantiate this idea with an LLM-aware UCT that biases model selection toward smaller LLMs to reduce cost while still preserving the compiler performance objective. Across diverse GPU and (CPU) benchmarks, LiteCoOp consistently outperforms single-model baselines, with the best results obtained when scaling collaboration to eight heterogeneous LLMs. This eight-model config reduces total compilation time by 1.95x (1.74x), reduces API cost by 4.47x (4.32x), and invokes the largest model for only 23.1% (23.9%) of total calls while demonstrating collaboration scalability.

2602.01851 2026-05-22 cs.CV

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

模型能多好地遵循视觉指令?VIBE:一个系统性的视觉指令驱动图像编辑基准

Huanyu Zhang, Xuehai Bai, Chengzu Li, Chen Liang, Haochen Tian, Haodong Li, Ruichuan An, Yifan Zhang, Anna Korhonen, Zhang Zhang, Liang Wang, Tieniu Tan

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Science(中国科学院大学人工智能学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Language Technology Lab, University of Cambridge(剑桥大学语言技术实验室) Peking University(北京大学) South China University of Technology(华南理工大学) Nanjing University(南京大学)

AI总结 本文提出VIBE基准,用于评估视觉指令驱动的图像编辑模型,通过三级交互层次评估指涉 grounding、形态操作和因果推理,并发现专有模型在早期阶段表现优异但随着任务难度增加性能下降。

Comments https://vibe-benchmark.github.io/

详情
AI中文摘要

最近的生成模型在图像编辑方面取得了显著进展。然而,现有系统和基准仍然主要是文本引导的。相比之下,人类交流本质上是多模态的,视觉指令如草图能高效传达空间和结构意图。为填补这一差距,我们引入VIBE,即图像编辑的视觉指令基准,其三级交互层次捕捉了指涉 grounding、形态操作和因果推理。在这些层次中,我们精心挑选了高质量且多样的测试用例,反映了视觉指令遵循的逐步增加的复杂性。我们进一步提出一个稳健的LMM-as-a-judge评估框架,配有任务特定的指标,以实现可扩展且细致的评估。通过全面评估17个代表性的开源和专有图像编辑模型,我们发现专有模型在早期阶段展现出视觉指令遵循能力,并且一贯优于开源模型。然而,随着任务难度的增加,即使是最强的系统性能也会显著下降,这揭示了未来研究的有希望方向。

英文摘要

Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.

2602.01760 2026-05-22 cs.CV

MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement

MagicFuse: 单图像融合用于视觉与语义增强

Hao Zhang, Yanping Zha, Zizhuo Li, Meiqi Gong, Jiayi Ma

发表机构 * Electronic Information School, Wuhan University, China(武汉大学电子信息学院) Suzhou Institute of Wuhan University, China(武汉大学苏州研究院) School of Automation, Wuhan University, China(武汉大学自动化学院)

AI总结 本文提出MagicFuse单图像融合框架,通过扩散模型生成跨光谱场景表示,实现视觉与语义的双重约束,实验表明其性能优于多模态融合方法。

Comments Accepted by CVPR 2026

详情
AI中文摘要

本文聚焦于一个高度实用的场景:在仅使用可见成像传感器的情况下,如何继续利用多模态图像融合的优势。为此,我们提出了一种新的单图像融合概念,将其扩展到知识层面。具体而言,我们开发了MagicFuse,一种新的单图像融合框架,能够从单个低质量可见图像中推导出全面的跨光谱场景表示。MagicFuse首先引入了基于扩散模型的内在光谱知识增强分支和跨光谱知识生成分支。它们分别挖掘在可见光谱中被掩盖的场景信息,并学习转移到红外光谱的热辐射分布模式。在此基础上,我们设计了一个多领域知识融合分支,整合这两个分支的扩散流的概率噪声,从而通过连续采样获得跨光谱场景表示。然后,我们施加了视觉和语义约束,确保该场景表示能够满足人类观察同时支持下游语义决策。大量实验表明,尽管仅依赖单个退化的可见图像,我们的MagicFuse在视觉和语义表示性能上与或优于多模态输入的最先进融合方法。代码已公开在https://github.com/zhayanping/MagicFuse。

英文摘要

This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image. The code is publicly available at https://github.com/zhayanping/MagicFuse.

2602.01279 2026-05-22 cs.LG

Richer Bayesian Last Layers with Subsampled NTK Features

更丰富的贝叶斯最后层与子采样NTK特征

Sergio Calvo-Ordoñez, Jonathan Plenk, Richard Bergna, Álvaro Cartea, Yarin Gal, Jose Miguel Hernández-Lobato, Kamil Ciosek

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) Oxford-Man Institute, University of Oxford(牛津大学奥克斯曼研究所) OATML, University of Oxford(牛津大学OATML研究所) Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 本文提出了一种改进贝叶斯最后层的方法,通过将神经切线核特征投影到由最后层特征张成的空间中,以更准确地估计不确定性,同时保持计算效率。

Comments Appearing in the Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情
AI中文摘要

贝叶斯最后层(BLLs)提供了一种方便且计算高效的神经网络不确定性估计方法。然而,由于只对最终层应用贝叶斯处理,忽略了早期层引入的不确定性,导致低估了epistemic不确定性。我们提出了一种方法,通过将神经切线核(NTK)特征投影到由最后层特征张成的空间中,从而在保持标准BLL推理低计算成本的同时,实现对整个网络变异性更全面的后验推断。我们证明了该方法产生的后验方差至少等于标准BLL的方差,纠正了其低估epistemic不确定性的倾向。为进一步降低计算成本,我们引入了统一的子采样方案来估计投影矩阵和后验推断。我们为两种子采样类型推导了近似界限。在UCI回归、上下文带币、图像分类和分布外检测任务中,对图像和表格数据集的实证评估显示,与标准BLL和竞争基线相比,该方法在校准和不确定性估计方面有所改进,同时降低了计算成本。

英文摘要

Bayesian Last Layers (BLLs) provide a convenient and computationally efficient way to estimate uncertainty in neural networks. However, they underestimate epistemic uncertainty because they apply a Bayesian treatment only to the final layer, ignoring uncertainty induced by earlier layers. We propose a method that improves BLLs by leveraging a projection of Neural Tangent Kernel (NTK) features onto the space spanned by the last-layer features. This enables posterior inference that accounts for variability of the full network while retaining the low computational cost of inference of a standard BLL. We show that our method yields posterior variances that are provably greater or equal to those of a standard BLL, correcting its tendency to underestimate epistemic uncertainty. To further reduce computational cost, we introduce a uniform subsampling scheme for estimating the projection matrix and for posterior inference. We derive approximation bounds for both types of subsampling. Empirical evaluations on UCI regression, contextual bandits, image classification, and out-of-distribution detection tasks in image and tabular datasets, demonstrate improved calibration and uncertainty estimates compared to standard BLLs and competitive baselines, while reducing computational cost.