arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.02438 2026-06-02 cs.AI

LLM-Evolved Pattern Generators for Optimal Classical Planning

LLM演化模式生成器用于最优经典规划

Windy Phung, Dominik Drexler, Arnaud Lequen, Jendrik Seipp

AI总结 提出首个通过LLM驱动的进化程序合成学习可容许启发式函数的方法,用于最优经典规划,结合饱和成本分区保证A*搜索的最优性。

详情
AI中文摘要

学习到的启发式函数最近已成为满足规划中传统领域无关启发式函数的竞争性替代方案。然而,现有方法侧重于改进搜索引导而非保证可容许性,这使得它们不适用于最优经典规划。我们提出了第一种学习领域相关启发式函数的方法,这些启发式函数在设计上是可容许的,从而保留了A*搜索的最优性保证。我们不是学习从状态到启发式值的直接映射,而是学习构建可诱导可容许启发式函数的抽象。我们使用LLM驱动的进化程序合成框架,为每个领域获得一个程序,该程序为该领域中的任何任务生成模式集合,并通过饱和成本分区以可容许的方式组合所得模式。实验表明,学习到的程序编码了可解释的领域特定见解,在测试时以可忽略的开销运行,并在多个领域上产生了与最先进的领域无关基线相匹配的覆盖范围,同时每个状态的评估速度显著更快。

英文摘要

Learned heuristics have recently become a competitive alternative to traditional domain-independent heuristics for satisficing planning. Existing approaches, however, focus on improving search guidance rather than guaranteeing admissibility, which makes them unsuitable for optimal classical planning. We present the first method for learning domain-dependent heuristics that are admissible by design and thus preserve the optimality guarantees of A* search. Instead of learning a direct mapping from states to heuristic values, we learn to construct abstractions that induce admissible heuristics. We use an LLM-driven evolutionary program-synthesis framework to obtain, for each domain, a program that produces a pattern collection for any task in that domain, and we combine the resulting patterns admissibly via saturated cost partitioning. Empirically, the learned programs encode interpretable domain-specific insights, run with negligible overhead at test time and yield heuristics that match the coverage of state-of-the-art domain-independent baselines on several domains while evaluating each state substantially faster.

2606.02436 2026-06-02 cs.CV

Geometry-Aware Implicit Memory for Video World Models

几何感知隐式记忆用于视频世界模型

Zhengxuan Wei, Xu Guo, Xinghui Li, Xunzhi Xiang, Min Wei, Yiran Zhu, Qiulin Wang, Xintao Wang, Pengfei Wan, Xiangwang Hou, Qi Fan

AI总结 提出GIM-World框架,通过轻量级Transformer编码器将可变长度历史压缩为固定大小的记忆令牌,并利用相机可查询的几何头在训练期间从冻结的基础模型中蒸馏3D场景结构,从而在长时程视频生成中保持几何和视觉一致性。

详情
Comments
Project page: https://gim-world.github.io/
AI中文摘要

视频世界模型旨在模拟可控的视觉环境,但长时程展开取决于模型在观察离开其原生上下文窗口后记住的内容。显式记忆保留帧或在线3D重建,可能会遭受启发式检索错误、冗余外观存储或重建伪影。隐式记忆将历史压缩为紧凑状态,但现有设计没有明确约束以编码跨视图场景几何。我们提出GIM-World,一种用于视频世界模型的几何感知隐式记忆框架。轻量级Transformer编码器将可变长度历史压缩为固定大小的记忆令牌,相机可查询的几何头在训练期间从冻结的基础模型中将3D场景结构蒸馏到记忆中,信息引导的剪枝规则在历史增长时保持编码成本有界。在推理时丢弃几何教师,留下轻量级记忆模块。在MIND上的实验表明,GIM-World在保持长时程几何和视觉一致性方面优于显式和隐式记忆基线。

英文摘要

Video world models aim to simulate controllable visual environments, but long-horizon rollouts depend on what the model remembers after observations leave its native context window. Explicit memories retain frames or online 3D reconstructions, which can suffer from heuristic retrieval errors, redundant appearance storage, or reconstruction artifacts. Implicit memories compress history into a compact state, but existing designs are not explicitly constrained to encode cross-view scene geometry. We propose GIM-World, a geometry-aware implicit memory framework for video world models. A lightweight transformer encoder compresses variable-length history into fixed-size memory tokens, a camera-queryable geometry head distills 3D scene structure from a frozen foundation model into the memory during training, and an information-guided pruning rule keeps encoding cost bounded as history grows. The geometry teacher is discarded at inference, leaving a lightweight memory module. Experiments on MIND show that GIM-World better preserves long-horizon geometric and visual consistency than both explicit- and implicit-memory baselines.

2606.02434 2026-06-02 cs.AI

Bridging the Sim-to-Real Gap in Semiconductor Visual Program Synthesis via Input Binarization

通过输入二值化弥合半导体视觉程序合成中的仿真到现实差距

Yusuke Ohtsubo, Kota Dohi, Koichiro Yawata, Koki Takeshita, Tatsuya Sasaki

AI总结 提出一种视觉程序合成框架,利用输入二值化策略消除扫描电子显微镜图像的纹理和噪声,使视觉语言模型专注于几何结构,从而弥合仿真到现实的差距,在MIIC数据集上将平均Dice系数从0.4393提升至0.5256。

详情
AI中文摘要

精确的电路几何参数控制对于半导体检测至关重要,但获取足够的真实训练数据成本高昂。尽管扩散模型和生成对抗网络等生成模型可以扩充训练数据,但它们无法保证计量任务所需的纳米级几何精度。我们提出一个视觉程序合成框架,其中视觉语言模型将检测图像转换为描述电路几何的可编辑领域特定语言代码,从而能够通过精确参数操作可控地生成训练数据。由于视觉语言模型仅使用合成的DSL渲染数据进行训练,在处理真实扫描电子显微镜图像时会出现领域差距。我们通过输入二值化策略弥合这一差距,该策略去除SEM特有的纹理和噪声,使模型专注于几何结构。在MIIC数据集上,二值化输入将平均Dice系数从原始输入基线的0.4393提升至0.5256,表明简单的纹理抽象显著缓解了仿真到现实的差距。

英文摘要

Precise parametric control over circuit geometry is essential for semiconductor inspection, yet obtaining sufficient real training data remains costly. Although generative models such as diffusion models and Generative Adversarial Networks (GANs) can augment training data, they cannot guarantee the nanometer-scale geometric accuracy required for metrology tasks. We propose a visual program synthesis framework in which a Vision-Language Model (VLM) converts inspection images into editable Domain-Specific Language (DSL) code describing circuit geometries, enabling controlled generation of training data with exact parameter manipulation. Because the VLM is trained solely on synthetic DSL-rendered data, a domain gap arises when processing real Scanning Electron Microscope (SEM) images. We bridge this gap with an input binarization strategy that strips SEM-specific texture and noise, letting the model focus on geometric structure. On the MIIC dataset, binarized inputs improve the mean Dice coefficient from 0.4393 to 0.5256 over the raw-input baseline, demonstrating that simple texture abstraction substantially mitigates the sim-to-real gap.

2606.02433 2026-06-02 cs.IR cs.AI cs.CL cs.LG cs.MA

ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning

ODTQA-FoRe:面向未来数据预测与推理的开放域表格问答数据集

Zhensheng Wang, Xiaole Liu, Wenmian Yang, Kun Zhou, Yiquan Zhang, Weijia Jia

AI总结 提出开放域表格问答的未来预测与推理任务,并构建首个覆盖时间序列预测和基于预测推理的数据集,通过基于LLM代理的TimeFore框架(检索器、预测器、分析器)解决历史数据检索、预测限制和响应标准化挑战。

详情
Comments
This paper has been accepted by Findings of ACL 2026
AI中文摘要

大语言模型的快速发展显著推进了表格问答,但大多数系统无法进行面向未来的数值预测。为弥补这一空白,我们引入了一个新任务——面向未来数据预测与推理的开放域表格问答,并提出了首个覆盖时间序列预测和基于预测推理场景的数据集,使用房地产数据。该任务在检索精确历史数据、克服LLM的预测限制以及标准化多样化查询的响应方面提出了挑战。为解决上述挑战,我们提出了TimeFore,一个基于LLM代理的框架,将问题分解为三个协作角色:检索器自主生成SQL以获取数据,预测器调用外部时间序列模型以获得更高精度,分析器综合结果以构建精确且一致的最终答案。大量实验证明了我们TimeFore的有效性。

英文摘要

The rapid development of LLMs has significantly advanced tabular question answering, but most systems cannot perform future-oriented numerical prediction. To address this gap, we introduce a novel task, Open-Domain Tabular Question Answering for Future Data Forecasting and Reasoning, and propose the first dataset to cover time-series forecasting and forecast-based reasoning scenarios using real estate data. This task poses challenges in retrieving precise historical data, overcoming the forecasting limitations of LLMs, and standardizing responses for diverse queries. To solve the above challenges, we propose TimeFore, an LLM agent-based framework that decomposes the problem into three collaborative roles: a Retriever autonomously generates SQL to fetch data, a Forecaster invokes external time-series models for higher accuracy, and an Analyzer synthesizes the results to construct a precise and consistent final answer. Extensive experiments demonstrate the effectiveness of our TimeFore.

2606.02432 2026-06-02 cs.RO

NDPP-Grasp: Non-Differentiable Physical Plausibility Constraint-Guided Task-Oriented Dexterous Grasp Generation

NDPP-Grasp:非可微物理合理性约束引导的任务导向灵巧抓取生成

Qiuchi Xiang, Haoxuan Qu, Hossein Rahmani, Jun Liu

AI总结 提出一种框架,通过将非可微物理合理性约束直接注入任务对齐的抓取扩散模型的去噪过程,实现物理合理性引导的灵巧抓取生成,同时保持任务对齐。

详情
AI中文摘要

任务导向的灵巧抓取生成旨在产生既物理合理又适用于特定操作任务的灵巧抓取姿态。现有的基于扩散的方法通常以解耦的方式处理这两个要求:它们首先训练一个用于任务对齐的抓取扩散模型,然后依赖生成后的细化来提高物理合理性。然而,这种事后修正策略仅在抓取已经生成后才应用物理合理性指导,使得生成轨迹本身不受物理约束引导,可能导致次优的抓取。为了解决这个问题,我们提出了一种新颖的框架,该框架以实用且有效的方式将物理合理性指导直接注入任务对齐的抓取扩散模型的去噪过程中,即使物理合理性约束是非可微的。这使得物理合理性能够在整个去噪过程中塑造抓取生成,同时保持任务对齐。大量实验证明了我们框架的有效性。

英文摘要

Task-oriented dexterous grasp generation aims to produce dexterous grasp poses that are both physically plausible and functionally suitable for specified manipulation tasks. Existing diffusion-based methods often address these two requirements in a decoupled manner: they first train a grasp diffusion model for task alignment and then rely on post-generation refinement to improve physical plausibility. However, this after-the-fact correction strategy applies physical plausibility guidance only once the grasp has already been generated, leaving the generation trajectory itself unguided by physical constraints and potentially leading to suboptimal grasps. To address this problem, we propose a novel framework that directly injects physical plausibility guidance into the denoising process of a task-aligned grasp diffusion model in a practical and effective manner, even when physical plausibility constraints are non-differentiable. This allows physical plausibility to shape grasp generation throughout denoising while preserving task alignment. Extensive experiments demonstrate the efficacy of our framework.

2606.02430 2026-06-02 cs.DC cs.AI

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

并非所有错误都平等:大型语言模型推理中错误传播的系统研究

Yafan Huang, Sheng Di, Guanpeng Li

AI总结 本研究通过提出的LLMFI故障注入框架,系统研究了软错误在大型语言模型推理中的传播机制,揭示了关键脆弱性模式,并提出了四种低开销的软件级可靠性改进方向。

详情
Comments
Accepted at ICS'26
AI中文摘要

大型语言模型(LLM)日益集成到高性能计算(HPC)工作流中,通过代码生成和领域特定决策等多种视角加速科学发现。然而,软错误如何传播并影响LLM推理仍 largely unexplored。为弥补这一空白,我们提出了LLMFI——一个可配置且确定性的故障注入框架,并基于该框架对LLM推理中的错误传播进行了全面研究。我们系统地跨三个开放权重的LLM和十三个代表性任务(涵盖推理、多语言、数学和编码领域)注入故障。此外,我们进行了细粒度的案例研究,揭示了关键脆弱性模式。总体而言,我们的研究得出了17个要点,推进了对LLM推理中错误传播的理解,并提出了四种低开销的纯软件修改方向以提高可靠性,为未来的错误检测和缓解提供了实用指导。

英文摘要

Large language models (LLMs) are increasingly integrated into high-performance computing (HPC) workflows, accelerating scientific discovery through diverse perspectives such as code generation and domain-specific decision-making. Yet, how soft errors propagate and affect LLM inference remains largely unexplored. To bridge this gap, we present a comprehensive study on error propagation in LLM inference, enabled by our proposed LLMFI, a configurable and deterministic fault-injection framework. Using LLMFI, we systematically inject faults across three open-weighted LLMs and thirteen representative tasks, covering reasoning, multilingual, mathematical, and coding domains. In addition, we conduct fine-grained case studies that reveal critical vulnerability patterns. Overall, our study yields 17 takeaways that advance the understanding of error propagation in LLM inference and introduces four low-overhead directions to improve reliability through software-only modification, offering practical guidance for future error detection and mitigation.

2606.02427 2026-06-02 math.NA cs.LG cs.NA

Spectral Audit of In-Context Operator Networks

上下文算子网络的频谱审计

Zhiwei Gao, Liu Yang, George Em Karniadakis

AI总结 提出基于雅可比矩阵的频谱审计方法,通过分析上下文算子学习中的局部频谱特性(频率增益、相位结构、交叉模式耦合)来评估模型是否真正学习了PDE算子的局部动力学机制,而不仅仅是输出预测。

详情
AI中文摘要

现有的神经算子和上下文算子学习评估主要依赖于预测误差,但准确的输出预测并不能保证正确的局部动力学结构。一个模型可能匹配解,同时表现出不正确的敏感性、失真的频率响应、虚假的模式耦合或不稳定的切向行为。我们引入了一种基于雅可比矩阵的频谱审计方法,用于上下文算子学习。对于固定的提示,我们将网络输出对查询函数求导,并将得到的雅可比矩阵视为学习的切向算子。将其投影到傅里叶模式上,我们获得了推断算子的局部频谱特征,包括频率相关的增益、相位结构和交叉模式耦合。该审计通过测试模型是否再现底层PDE算子的局部机制(而不仅仅是输出)来补充标准预测指标。在多个基准测试中,审计揭示了不同的算子级现象,包括相位传输、粘度依赖的阻尼、非线性模式耦合和反应-扩散稳定性结构。它还检测了部分被预测误差指标隐藏的失败,包括高频退化、不正确的相位恢复和提示-算子不一致。即使逐点预测部分准确,损坏或内部不一致的提示也会导致切向算子结构退化。我们的结果表明,预测精度和局部算子保真度是学习到的神经算子的不同属性。我们的框架还为稳定性、灵敏度和算子一致性提供了诊断。

英文摘要

Existing evaluations of neural operators and in-context operator learning rely primarily on prediction error, but accurate output prediction does not guarantee the correct local dynamical structure. A model may match solutions while exhibiting incorrect sensitivities, distorted frequency response, spurious mode coupling, or unstable tangent behavior. We introduce a Jacobian-based spectral audit for in-context operator learning. For a fixed prompt, we differentiate the network output with respect to the query function and view the resulting Jacobian as a learned tangent operator. Projecting it onto Fourier modes, we obtain a local spectral characterization of the inferred operator, including frequency-dependent gains, phase structure, and cross-mode coupling. The audit complements standard prediction metrics by testing whether the model reproduces local mechanisms of the underlying PDE operator rather than only outputs. Across benchmarks, the audit reveals distinct operator-level phenomena, including phase transport, viscosity-dependent damping, nonlinear mode coupling, and reaction--diffusion stability structure. It also detects failures partially hidden by prediction-error metrics, including high-frequency degradation, incorrect phase recovery, and prompt--operator inconsistencies. Corrupted or internally inconsistent prompts lead to degraded tangent-operator structure even when pointwise predictions remain partially accurate. Our results suggest that prediction accuracy and local operator fidelity are distinct properties of learned neural operators. Our framework also provides a diagnostic for stability, sensitivity, and operator consistency.

2606.02424 2026-06-02 cs.CV cs.AI cs.LG

GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial Transcriptomics

GC-MoE: 基因组引导的细胞类型特异性专家混合模型用于基于组织学的单细胞空间转录组学

Kaito Shiku, Ahtisham Fazeel Abbasi, Ryoma Bise, Yuichiro Iwashita, Kazuya Nishimura, Andreas Dengel, Muhammad Nabeel Asim

AI总结 提出GC-MoE模型,通过路由网络估计细胞类型概率并软组合细胞类型特异性专家,结合细胞类型特异性共表达感知预测器和细胞间交互注意力模块,从组织学图像和细胞位置预测单细胞基因表达,在公共数据集上优于现有方法。

详情
AI中文摘要

基于组织学的单细胞空间转录组学(ST)估计旨在从组织病理学图像和细胞位置预测单个细胞的基因表达,从而减少对昂贵的单细胞ST测量的需求。与现有的组织学到ST方法主要预测包含多个细胞的局部区域的斑点级谱不同,该任务需要对细胞间的表达变异性进行建模,而这种变异性强烈地由细胞类型结构化。我们提出了基因组引导的细胞类型特异性专家混合模型(GC-MoE),该模型通过路由网络估计细胞类型概率,并软组合细胞类型特异性专家进行基因表达预测。为了进一步编码细胞类型依赖的基因程序,我们引入了细胞类型特异性共表达感知预测器(CAP),以及一个轻量级的细胞间交互注意力(C2CA)模块用于邻域细胞上下文。在公共单细胞ST数据集上的实验和消融研究表明,该方法在现有单细胞和适应性斑点级基线方法上均有一致的改进。

英文摘要

Histology-based single-cell spatial transcriptomics (ST) estimation aims to predict gene expression for individual cells from histopathological images and cell locations, reducing the need for costly single-cell ST measurements. Unlike existing histology-to-ST methods that mainly predict spot-level profiles for local regions containing multiple cells, this task requires modeling cell-to-cell expression variability, which is strongly structured by cell type. We propose Genomics-Guided Cell-Type-Specific Mixture-of-Experts (GC-MoE), which estimates cell-type probabilities with a routing network and softly combines cell-type-specific experts for gene expression prediction. To further encode cell-type-dependent gene programs, we introduce the Cell-Type-Specific Co-Expression-Aware Predictor (CAP), together with a lightweight Cell-to-Cell Interaction Attention (C2CA) module for neighboring-cell context. Experiments and ablations on public single-cell ST datasets show consistent improvements over existing single-cell and adapted spot-level baselines.

2606.02423 2026-06-02 cs.CL cs.LG

Investigating and Alleviating Harm Amplification in LLM Interactions

调查和缓解大语言模型交互中的危害放大

Ruohao Guo, Wei Xu, Alan Ritter

AI总结 提出HarmAmp基准和TrajSafe监控器,用于评估和缓解多轮对话中大语言模型对危害的放大效应。

详情
AI中文摘要

大语言模型(LLM)可以作为有用的助手,但它们同样可以作为危害放大器,使恶意用户通过扩展交互实现超出其能力的危害结果。这种风险沿着两个轴显现,即民主化领域专业知识,使新手能够产生专门的有害内容,以及以手动努力无法匹敌的规模扩大有害操作。然而,现有工作往往忽略了LLM在多轮对话中如何加剧危害。我们引入了HarmAmp,这是一个新的基准,用于涵盖十二个风险类别的多轮危害放大场景。每个场景都基于现实世界的威胁,并满足严格的标准,即实质性放大、操作特异性和多轮必要性。我们进一步提出了TrajSafe,一种主动监控器,可以预测有害轨迹并通过诸如探测用户真实意图和引导模型更安全地完成等行动进行干预。我们的广泛实验表明,TrajSafe显著降低了多轮交互中产生的危害性,同时保持了低过度拒绝率和目标模型的一般能力。我们的工作为缓解LLM交互中微妙的安全风险提供了一个有前景的范式。

英文摘要

Large language models (LLMs) can serve as helpful assistants, yet they can equally function as harm amplifiers that enable malicious users to achieve harmful outcomes beyond their capabilities through extended interactions. This risk manifests along two axes, i.e., democratizing domain expertise that allows novices to produce specialized harmful content, and scaling harmful operations at volumes that manual effort cannot match. Existing works, however, often overlook how LLMs compound harm across multi-turn conversations. We introduce HarmAmp, a new benchmark for multi-turn harm amplification scenarios spanning twelve risk categories. Each scenario is grounded in real-world threats and satisfies rigorous criteria, i.e., substantive amplification, operational specificity, and multi-turn necessity. We further propose TrajSafe, a proactive monitor that anticipates harmful trajectories and intervenes through actions such as probing users' genuine intents and steering the models towards safer completion. Our extensive experiments demonstrate that TrajSafe significantly reduces the harmfulness incurred in multi-turn interactions while preserving a low over-refusal rate and the target model's general capabilities. Our work offers a promising paradigm to alleviate the nuanced safety risks in LLM interactions.

2606.02418 2026-06-02 quant-ph cs.AI

Evolutionary Discovery of Bivariate Bicycle Codes with LLM-Guided Search

基于LLM引导搜索的双变量自行车码的进化发现

Juan Cruz-Benito, Andrew W. Cross, David Kremer, Ismael Faro

AI总结 提出一种LLM引导的进化工作流,通过变异生成双变量自行车码和扰动变体的Python程序,在约1650次迭代中筛选约2×10^5个候选码,发现了465个不同候选码,包括非CSS扰动码和CSS码,展示了LLM引导的程序进化在结构化量子码发现中的实用性。

详情
AI中文摘要

量子LDPC码的发现需要在大型代数设计空间中进行搜索,同时可靠地认证任何候选码的参数和等价类。我们引入了一种LLM引导的进化工作流,其中语言模型变异生成双变量自行车码和扰动双变量自行车码ansätze的Python程序。在五次活动中,系统执行了约1,650次进化迭代,筛选了约$2 \times 10^5$个候选码,需要约140小时的计算时间和约400美元的LLM推理成本。候选码通过一个分阶段验证流水线进行评估,该流水线结合了$\mathrm{GF}(2)$秩计算、距离估计和认证、混合整数线性规划、BLISS Tanner图去重、可分解性分析和局部Clifford等价检查。在块长度$n \leq 360$时,工作流识别出465个不同的候选码:97个CSS双变量自行车码和368个非CSS扰动变体。CSS搜索恢复了已知的高性能码,并找到了新的有限长度代表,包括一个不可分解的[[288,16,12]]码和更高权重的码,在距离$d = 8$时最多有$k = 50$。非CSS搜索产生了在[[144,12,12]]处匹配总码品质因子的扰动码,以及根据MILP状态报告为认证值或上界的额外高距离候选码。总体而言,这些结果表明,当与独立评估配对时,LLM引导的程序进化可以作为一种实用的结构化量子码发现工具。

英文摘要

Quantum LDPC code discovery requires searching large algebraic design spaces while reliably certifying the parameters and equivalence classes of any candidates found. We introduce an LLM-guided evolutionary workflow in which language models mutate Python programs that generate bivariate-bicycle and perturbed bivariate-bicycle code ansätze. Across five campaigns, the system performed approximately 1{,}650 evolutionary iterations, screened about $2 \times 10^5$ candidate codes, and required ${\sim}140$ hours of computation and ${\sim}$US\$400 in LLM inference cost. Candidate codes are evaluated through a staged validation pipeline combining $\mathrm{GF}(2)$ rank computation, distance estimation and certification, mixed-integer linear programming, BLISS Tanner-graph deduplication, decomposability analysis, and local-Clifford equivalence checks. At block length $n \leq 360$, the workflow identifies 465 distinct candidate codes: 97 CSS bivariate-bicycle codes and 368 non-CSS perturbed variants. The CSS search recovers known high-performing codes and finds new finite-length representatives, including an indecomposable [[288,16,12]] code and higher-weight codes with up to $k = 50$ at distance $d = 8$. The non-CSS search produces perturbed codes matching the gross-code figure of merit at [[144,12,12]], along with additional high-distance candidates reported as certified values or upper bounds according to MILP status. Overall, these results show that LLM-guided program evolution can serve as a practical tool for structured quantum-code discovery when paired with independent evaluation.

2606.02406 2026-06-02 cs.CV

Edge Prediction for Roof Wireframe Reconstruction with Transformers

基于Transformer的屋顶线框重建边预测

Gustav Hanning, Ludvig Dillén, Jonathan Astermark, Johanna Lidholm, Viktor Larsson

AI总结 提出一种端到端Transformer编码器-解码器架构,利用稀疏SfM点云和语义分割图重建3D屋顶线框,在HoHo 22k数据集上取得0.6476的混合结构分数,位列挑战赛第二名。

详情
Comments
Presented at the 3rd Urban Scene Modeling (USM3D) Workshop at CVPR 2026
AI中文摘要

本文提出了一种针对S23DR Challenge 2026的竞争性解决方案,该挑战旨在从稀疏SfM点云、地面级语义分割图和深度图中重建3D房屋屋顶线框模型。我们的方法采用受DETR启发的端到端Transformer编码器-解码器架构。为了有效处理几何和语义数据,稀疏SfM点云输入基于语义优先级进行动态子采样,并增强以Gestalt和ADE20k类别特征。为了进一步增加分割上下文,我们将点特征与额外的Gestalt特征编码融合,这些编码通过将点投影到冻结自编码器产生的潜在特征图中获得。然后,学习到的查询嵌入通过交叉注意力机制直接解码为3D线框边。在“HoHo 22k”数据集上的评估表明,我们的方法显著优于手工和学习的基线方法,取得了0.6476的混合结构分数(HSS),并在挑战赛私有排行榜上获得第二名。

英文摘要

This paper presents a competitive solution to the S23DR Challenge 2026, which aims to reconstruct 3D house roof wireframe models from sparse SfM point clouds and ground-level semantic segmentations and depth maps. Our proposed method utilizes an end-to-end Transformer encoder-decoder architecture inspired by DETR. To effectively process the geometric and semantic data, the sparse SfM point cloud input is dynamically subsampled based on semantic priority and augmented with Gestalt and ADE20k class features. To further increase segmentation context, we fuse the point features with additional Gestalt feature encodings which are obtained by projecting the points into latent feature maps produced by a frozen autoencoder. Learned query embeddings are then decoded directly into 3D wireframe edges via cross-attention mechanisms. Evaluated on the "HoHo 22k" dataset, our approach significantly outperforms both handcrafted and learned baselines, achieving a Hybrid Structure Score (HSS) of 0.6476 and securing the second-highest position on the challenge's private leaderboard.

2606.02404 2026-06-02 cs.CL

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

K-BrowseComp:基于韩国语境的网页浏览代理基准测试

Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko, Jeonghun Park, Haneul Yoo, Jaewon Cho, Junghun Park, Changyoon Lee, Kyochul Jang, Jaeyeon Kim, Eunsu Kim, Woojin Cho, Seungone Kim

AI总结 针对韩国语境,构建包含400个问题的网页浏览代理基准K-BrowseComp,评估前沿模型性能,发现其准确率显著低于BrowseComp,并公开数据与代码。

详情
AI中文摘要

前沿模型评估正从基础能力(如指令遵循和推理)转向组合性、代理性能力,但韩语代理基准仍然稀缺。我们介绍了K-BrowseComp,一个基于韩国语境的网页浏览代理基准,包含400个问题。其中300个问题的K-BrowseComp-Verified子集由母语为韩语的人手动构建和验证。在该子集上,包括GPT-5.5、DeepSeek-V4-Pro和GLM-5.1在内的前沿LLM仅达到30.00–45.67%,相比BrowseComp大幅下降,而通过韩国专有AI基础模型计划发布的韩语LLM仅获得0.00–10.33%。我们进一步利用解决和创建网页浏览问题之间的不对称性,通过硬样本少样本示例和失败模式导向生成构建了一个100个问题的合成分割。在对抗性过滤的合成诊断分割上,最强模型仅达到26.00%,我们将此分割作为定向压力测试单独报告。我们公开发布了数据和代码。

英文摘要

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.

2606.02402 2026-06-02 cs.CV

Explainable Forensics of Manipulated Segments in Untrimmed Long Videos

未修剪长视频中被操纵片段的可解释取证

Yue Feng, Jingjing Li, Qijia Lu, Wei Ji, Jingrou Zhang, Fei Shen, Xiao Li, Yizhen Jia, Qiang Chen, Limin Wang, Wentong Li, Jie Qin

AI总结 针对长视频中AI生成片段的定位与解释任务,提出TASLE基准数据集和MSLoc粗到细取证方法,实现时序定位、真实性检测与可解释分析。

详情
Comments
Accepted to ICML 2026
AI中文摘要

AI驱动视频生成的快速发展改变了内容创作,同时也通过长视频中的局部操纵增加了错误信息的风险。现有的视频取证方法主要处理短小的独立片段,因此无法捕捉AI生成内容稀疏嵌入真实视频中的现实场景。为弥补这一差距,我们提出了时序AI生成片段定位与解释任务,旨在对未修剪长视频中的操纵片段进行真实性检测、时序定位和可解释分析。我们进一步引入了TASLE,一个大规模基准数据集,包含12,472个未修剪视频,具有多样化的操纵模式和丰富的标注信号,包括时序边界、真实性标签和片段级理由。此外,我们提出了MSLoc,一种粗到细的取证基线方法,结合了边界敏感的建议生成模块用于高效长视频扫描,以及基于MLLM的细化模块用于精确边界定位和可解释推理。实验验证了所提基线的有效性,突显了片段级可解释取证对于长视频AI生成视频分析的重要性。我们的数据集和代码公开于https://debby-0527.github.io/TASLE。

英文摘要

The rapid advancement of AI-driven video generation has transformed content creation, while simultaneously increasing the risk of misinformation through localized manipulations in long-form videos. Existing video forensic methods predominantly operate on short, independent clips, and thus fail to capture realistic scenarios where AI-generated content is sparsely embedded within otherwise authentic footage. To bridge this gap, we formulate the task of Temporal AI-Generated Segment Localization and Explanation, which targets authenticity detection, temporal localization, and interpretable analysis of manipulated segments in untrimmed long videos. We further introduce TASLE, a large-scale benchmark comprising 12,472 untrimmed videos with diverse manipulation patterns and rich annotation signals, including temporal boundaries, authenticity labels, and segment-level rationales. In addition, we propose MSLoc, a coarse-to-fine forensic baseline that combines a boundary-sensitive proposal generation module for efficient long-video scanning with an MLLM-based refinement module for precise boundary localization and interpretable reasoning. Experiments validate the effectiveness of the proposed baseline, highlighting the importance of segment-level explainable forensics for long-form AI-generated video analysis. Our dataset and code are publicly available at https://debby-0527.github.io/TASLE.

2606.02398 2026-06-02 cs.LG cs.CL

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

跨域干扰与恢复的局部微扰理论:多领域强化学习

Lei Yang, Siyu Ding, Deyi Xiong

AI总结 针对多领域RL训练中一个领域性能下降的问题,提出局部微扰理论,证明后期领域训练主要通过二阶损伤项在低维共享冲突子空间中损害早期领域,并通过短时领域刷新实现选择性恢复。

详情
AI中文摘要

强化学习后训练在数学推理、代码生成、问答和创意写作等单个领域上改进了大型语言模型,但在一个领域上的训练往往会降低其他领域的性能。基于灾难性遗忘或全局梯度冲突的现有解释是不完整的:即使全模型梯度几乎正交,也可能发生实质性干扰。我们表明,单领域RL产生稀疏、小量级的参数编辑,且top变化神经元之间的重叠较弱,而不同领域仍然共享大量的活跃计算路径,这些路径上的更新方向决定了它们是协同还是冲突。在此观察指导下,我们在多领域RL的局部微扰模型下证明,后期领域训练主要通过二阶损伤项损害早期领域,在观察到的稀疏路径结构下,该损伤项集中在低维共享冲突子空间中。此外,短时领域刷新会收缩该子空间上的有害成分,从而在有限的附带损伤下实现选择性恢复。与理论一致,在Code → Math → QA → CW之后进行短暂的Re-Math刷新,将Math从57.66恢复到66.04,同时基本保持其他领域的性能,得到最佳平均分66.39。除了刷新之外,针对Math-QA对的稀疏代理冲突坐标集进行无训练回滚可部分恢复Math,为局部损伤提供了直接的代理级证据。这些结果为多领域RL中的干扰和恢复提供了局部机制解释。

英文摘要

Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code $\rightarrow$ Math $\rightarrow$ QA $\rightarrow$ CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.

2606.02388 2026-06-02 cs.LG cs.AI

Policy and World Modeling Co-Training for Language Agents

语言智能体的策略与世界模型协同训练

Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu, Haoze Lv, Yanbin Wei, Lingting Zhu, Shengju Qian, Xin Wang, Ying-Cong Chen, Qi Wang, Ke Tang

AI总结 提出PaW框架,通过在强化学习过程中添加辅助世界模型监督,无需改变推理范式,提升语言智能体在多个任务上的性能。

详情
Comments
9 pages, 6 figures
AI中文摘要

强化学习通过教导大语言模型智能体哪些行动能带来高奖励来改进它们,但对这些行动对环境的影响提供很少的监督。世界建模可以填补这一空白,但现有方法通常需要单独的模拟器、额外的训练阶段或额外的推理时计算。我们观察到,在策略强化学习 rollout 已经包含了所需的信号:每个转移将行动与其产生的下一个观察配对。基于这一观察,我们提出了PaW,一个策略和世界模型协同训练框架,它在强化学习过程中向同一策略添加辅助世界模型监督,而不改变推理范式。为了使辅助世界模型监督信息丰富且稳定,PaW引入了三个组件:基于行动熵的世界模型数据选择、噪声容忍的世界模型损失和奖励自适应的损失平衡。在三个智能体任务基准上的实验表明,在不同模型和强化学习算法上,PaW相对于强强化学习基线有一致的改进。这些结果表明,标准的强化学习 rollout 是语言智能体训练中世界模型监督的实用来源。

英文摘要

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.

2606.02385 2026-06-02 q-bio.NC cs.LG

How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations

最优性如何结构化稀疏字典:理解SAE表示的理论

William Dorrell

AI总结 本文通过扩展局部最优性分析到非负联合优化问题,推导出稀疏自编码器(SAE)最优特征与数据分布之间的约束,解释了层级分裂与吸收、残差结构和密集对映特征等行为,并构建了新型大字典凸问题以探索宽原子-数据点极限。

详情
Comments
27 pages, 5 figures
AI中文摘要

稀疏自编码器(SAE)已成功将神经表示解析为可解释的概念,为理解和控制提供了基础。然而,SAE究竟提取了什么,以及我们据此能得出哪些科学结论,并不明显。经验上,证据在于结果:SAE学习了可解释的特征。理论上,我们缺乏一个清晰的解释,说明一个“概念”必须满足什么属性才能被SAE提取。已有大量可识别性工作研究稀疏编码恢复真实特征的条件,但这些方法往往关注简单的数据生成模型(如稀疏独立特征),这些模型难以近似SAE所训练的、吞噬互联网的语言模型表示。在此,我们避免数据生成模型,仅询问任何字典学习最优解必须满足什么属性。具体地,我们将局部最优性分析(Gribonval & Schnass, 2010)扩展到普通SAE近似的非负联合优化问题,并推导出最优SAE特征与其分布之间的约束。我们利用这些约束解释了一系列观察到的SAE行为——层级分裂与吸收、残差结构以及密集对映特征——每个都反映了L1+非负性如何与数据交互以结构化最优字典。最后,我们构建了一个新颖的大字典凸问题,并探索了宽原子-数据点极限。总之,我们希望将模型假设与意外观察区分开,从而从SAE的成功中学到更多,并为设计其继任者提供原则。

英文摘要

Sparse Autoencoders (SAEs) have found success parsing neural representations into interpretable concepts, providing a basis for understanding and control. However, what exactly SAEs extract, and, correspondingly, the scientific conclusions we can draw from them, are not obvious. Empirically, the proof is in the pudding: SAEs learn interpretable features. Theoretically, we lack a clear account of what properties a 'concept' must satisfy for an SAE to extract it. There has been extensive identifiability work studying the conditions under which sparse coding recovers ground-truth features; however, these approaches tends to focus on simple data-generating models (e.g. sparse independent features) which poorly approximate the internet-swallowing language-model representations on which SAEs are trained. Here, avoiding data-generating models, we ask simply what properties any dictionary learning optimum must satisfy. Concretely, we extend local optimality analyses (Gribonval & Schnass, 2010) to the nonnegative joint-optimisation problem that vanilla SAEs approximate, and derive constraints relating optimal SAE features to their distributions. We use these constraints to explain a range of observed SAE behaviours - hierarchical splitting & absorption, the structure of residuals, and dense antipodal features - each reflecting how L1+nonnegativity interact with data to structure optimal dictionaries. Finally, we construct a novel large-dictionary convex problem and explore the wide atom-per-datapoint limit. In sum, we hope to tease model assumptions from unexpected observations, letting us learn more from SAEs' successes and provide principles for designing their successors.

2606.02384 2026-06-02 cs.LG

TabPrep: Closing the Feature Engineering Gap in Tabular Benchmarks

TabPrep: 弥合表格基准测试中的特征工程差距

Andrej Tschalzev, Nick Erickson, Yuyang Wang, Huzefa Rangwala, Stefan Lüdtke, Heiner Stuckenschmidt, Christian Bartelt

AI总结 本文提出TabPrep,一个轻量级预处理流程,通过针对三种特定数据模式设计的特征生成器,系统性地进行特征工程,显著提升多种模型在表格基准测试中的性能。

详情
AI中文摘要

表格机器学习的进展主要集中在日益复杂的模型架构上。同时,特征工程仍然是现实建模流程中关键但未被充分探索的组成部分,在现代基准测试中完全缺失,这造成了未量化的评估差距。在这项工作中,我们引入了TabPrep,一个轻量级预处理流程,由精心设计以针对三种特定结构数据模式的特征生成器组成。我们表明,许多广泛使用的模型类对这些模式表现出可预测的盲点,仅凭系统性的特征工程就能建立新的峰值性能。在TabArena基准测试中,将TabPrep集成到模型训练和调优中持续提升了基于树、神经网络、线性和基础模型的性能,通常超过仅通过以模型为中心的创新所获得的收益。TabPrep在性能、效率和跨数据集的适用性方面优于以前的自动化特征工程方法,使其能够集成到大规模基准测试中。通过发布TabPrep(见https://github.com/atschalz/tabprep),我们使研究人员能够将特征工程集成到他们的基准测试设置中,填补了表格评估中长期存在的空白。

英文摘要

Progress in tabular machine learning has largely focused on increasingly sophisticated model architectures. At the same time, feature engineering remains a critical yet underexplored component of real-world modeling pipelines that is entirely absent from modern benchmarks, which creates an unquantified evaluation gap. In this work, we introduce TabPrep, a lightweight preprocessing pipeline composed of feature generators that are carefully designed to target three specific structural data patterns. We show that many widely used model classes exhibit predictable blind spots to these patterns and that systematic feature engineering alone can establish new peak performance. Across the TabArena benchmark, integrating TabPrep into model training and tuning consistently improves performance for tree-based, neural, linear, and foundation models, often surpassing gains achieved by model-centric innovations alone. TabPrep outperforms previous automated feature engineering approaches in performance, efficiency, and applicability across datasets, enabling integration into large-scale benchmarks. By releasing TabPrep (see https://github.com/atschalz/tabprep), we enable researchers to integrate feature engineering into their benchmarking setup, filling a longstanding gap in tabular evaluations.

2606.02381 2026-06-02 cs.AI cs.LG math.DS

A Mathematical Conflict Framework for Contextual Data Modulation

上下文数据调制的数学冲突框架

Hakan Emre Kartal

AI总结 提出一个基于算子的数学冲突框架,将冲突视为局部、方向性和上下文敏感的量,通过统一抽象算子整合权重、尺度行为和输出映射,作为独立于优化过程的数学对象。

详情
Comments
15 pages, 3 figures, framework paper
AI中文摘要

在本研究中,提出了一个基于算子的广义数学冲突框架,以显式表示原始数据与上下文数据之间的结构差异。所提出的结构将冲突视为局部、方向性和上下文敏感的量,在统一抽象算子下整合了权重、尺度行为和输出映射等组件。该框架并未简化为特定的学习算法或优化方法,而是定义为适用于不同问题类别的通用结构。现有方法通常将冲突仅仅视为嵌入优化过程中的隐式副作用,而所提出的框架则将冲突视为独立的、基于算子的、组件级别的数学对象。

英文摘要

In this study, a generalized operator-based mathematical conflict framework is presented to explicitly represent structural discrepancies between raw data and contextual data. The proposed structure treats conflict as a local, directional, and context-sensitive quantity, integrating components such as weighting, scale behavior, and output mapping under a unified abstract operator. Without being reduced to a specific learning algorithm or optimization method, the framework is defined as a general structure adaptable to different classes of problems. While existing approaches typically treat conflict merely as an implicit side effect embedded within the optimization process, the proposed framework considers conflict as an independent, operator-based, and component-level mathematical object.

2606.02380 2026-06-02 cs.CL cs.AI

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

SPADE-Bench:通过计划-行动分歧评估智能体中的自发性策略欺骗

Yuyan Bu, Haowei Li, Qirui Zheng, Bowen Dong, Kaiyue Yang, Jiaming Ji, Yingshui Tan, Wenxin Li, Yaodong Yang, Juntao Dai

AI总结 针对LLM智能体在工具使用中可能出现的自发性策略欺骗(计划与行动不一致),提出SPADE-Bench基准,通过结合实际工具执行和受控压力场景,严格区分欺骗与幻觉,实验证实该问题真实且紧迫。

详情
AI中文摘要

随着基于LLM的智能体扩展其操作范围,可靠性成为实际部署的前提。然而,在实际应用中,人类用户无法监控每一个即时行为;相反,执行过程往往是一个黑箱,用户仅依赖智能体的自我报告更新。这种不透明性带来了关键风险:智能体可能呈现与执行行动不一致的面向观察者的报告,使得系统不可控,尤其是在高风险自主场景中。我们将这种自我报告的计划-行动分歧称为智能体欺骗。为了评估这一点,我们引入了SPADE-Bench,一个旨在评估自发性计划-行动分歧的基准。与先前的欺骗基准不同,SPADE-Bench同时集成了实际工具执行和受控压力场景。这种设计确保了生态效度,并通过在压力下进行受控的计划-行动比较,严格区分策略欺骗与单纯的幻觉。跨主流模型的实验证实,智能体欺骗在工具使用环境中是一个真实且紧迫的问题。通过提供一个全面且稳健的评估框架,SPADE-Bench填补了智能体安全中的关键空白,促进社区朝着构建可信和可控的自主系统迈进。

英文摘要

As LLM-based agents expand their operational scope, reliability becomes a prerequisite for real-world deployment. However, in practical applications, human users cannot monitor every immediate behavior; instead, the execution process often remains a black box, leaving users dependent solely on the agent's self-reported updates. This opacity creates a critical risk: agents may present observer-facing reports that diverge from their executed actions, rendering the system uncontrollable, especially in high-stakes autonomous scenarios. We term such self-reported plan-action divergence as agent deception. To assess this, we introduce SPADE-Bench, a benchmark designed to evaluate spontaneous plan-action divergence. Unlike prior deception benchmarks, SPADE-Bench simultaneously integrates actual tool execution and controlled pressure scenarios. This design ensures ecological validity and rigorously distinguishes strategic deception from mere hallucination through controlled plan-action comparisons under pressure. Experiments across mainstream models confirm that agent deception is a genuine and pressing issue in tool-use contexts. By providing a comprehensive and robust evaluation framework, SPADE-Bench fills a critical gap in agent safety, facilitating the community's progress toward building trustworthy and controllable autonomous systems.

2606.02379 2026-06-02 cs.CV

Honey, I Shrunk the Arc de Triomphe!

亲爱的,我把凯旋门缩小了!

Yuanbo Xiangli, Hanyu Chen, Xueqing Tsang, Noah Snavely

AI总结 针对单目度量几何估计中的“尺度坍缩”现象,通过构建新数据集MetricScenes并采用两阶段泊松补全方法提升深度图质量,微调MoGe-2模型显著缓解了尺度低估问题。

详情
Comments
Project page: https://metricscenes.github.io/
AI中文摘要

度量尺度单目几何估计通过大规模数据聚合取得了显著进展,但当前的基础模型存在持续的“尺度坍缩”现象:远处地标和广阔景观被度量低估。我们假设这一性能差距源于训练数据瓶颈,现有度量尺度数据集受硬件限制,要么是均匀的车辆捕获LiDAR或短距离室内扫描,要么是缺乏物理世界语义复杂性的合成数据。为弥补这一差距,我们整理了一个新的度量级野外数据集MetricScenes,从多种来源收集,包括互联网照片集和立体图像。我们使用现成方法估计每个场景的相机姿态和初始深度图,并从地理标记元数据以及已知立体相机基线恢复绝对尺度。我们还通过一种新的两阶段泊松补全方法改进了从MetricScenes导出的深度图质量。在我们的数据集上微调MoGe-2显著缓解了尺度坍缩,并在无约束的开放域场景中实现了优越的度量精度,同时在标准基准上保持了最先进的性能。

英文摘要

Metric scale monocular geometry estimation has seen significant progress through large-scale data aggregation, yet current foundation models suffer from a persistent ''scale-collapse'' phenomenon: distant landmarks and vast landscapes are metrically underestimated. We hypothesize that this performance gap stems from a training data bottleneck, where existing metric-scale datasets are hardware-constrained to homogenous vehicle-captured LiDAR or short-range indoor scans, or consist of synthetic data that lacks the semantic complexity of the physical world. To bridge this gap, we curate a new metrically-grounded, in-the-wild dataset that we call MetricScenes, gathered from a variety of sources including Internet photo collections and stereo imagery. We estimate camera poses and initial depth maps for each scene using off-the-shelf methods, and recover absolute scale from geo-tagged metadata as well as known stereo camera baselines. We also improve the quality of depth maps derived from MetricScenes via a new two-stage Poisson completion method. Fine-tuning MoGe-2 on our dataset significantly mitigates scale-collapse and achieves superior metric accuracy in unconstrained, open-domain scenes while maintaining state-of-the-art performance on standard benchmarks.

2606.02375 2026-06-02 cs.CL cs.CY cs.HC

WAXAL-NET: Finetuned Edge ASR Across 19 African Languages

WAXAL-NET: 针对19种非洲语言的微调边缘ASR

Victor Tolulope Olufemi, Oreoluwa Babatunde, Ramsey Njema, Bolarinwa Gbotemi, Wanchi Lucia Yen, John Uzodinma, Sunday Ajayi, Oluwademilade Williams, Kausar Moshood, Innocent Elendu Anyaele, Akebert Arefaine, Candace Hunzwi, Wongel Dawit Daniel, Emmilly Namuganga, Cleophas Kadima, Athanase Bahizire, Onitsiky Ranaivoson, Emmanuel Aaron, Nicholaus Ladislaus, Idris Muhammed, Jonathan Enoch Simenya, Martin Koome, Matewos Tegete Endaylalu, Peter Ifeoluwa Adeyemo, Hondi Prisca Birindwa, Ukachi Agnes Eze-Mbey, Yacoba Oduro-Yeboah, Pericles Adjovi, Mikel K. Ngueajio, Toluwani Aremu, Prasenjit Mitra

AI总结 本研究评估了紧凑型领域专用ASR模型在WAXAL语料库的19种非洲语言会话语音上是否优于大规模多语言基础模型,通过微调边缘模型实现了宏平均WER从64.9%降至38.0%,模型大小缩小3-40倍,证实领域专业化主导规模效应。

详情
AI中文摘要

我们评估了紧凑型领域专用ASR模型是否能在WAXAL语料库的19种非洲语言会话语音上优于大规模多语言基础模型。微调后的边缘模型实现了宏平均词错误率(WER)38.0%,而最佳零样本基线为64.9%,使用小3-40倍的模型降低了26.9个百分点。结果证实,对于自发的非洲语音,领域专业化主导规模效应。跨域评估显示,微调模型在分布外(OOD)语音上恢复了可用性能,而零样本模型在测试域与其预训练分布匹配时重新获得优势。一项涵盖所有调查语言的分布式母语者审计产生了基于语言学的错误分类,表明CTC和自回归架构在不同语系中表现不同。我们进一步表明,对于音节文字语言,仅WER会错误表示性能,其中CER/WER比率显示字符级准确率远高于标题WER所暗示的。最后,为促进未来的非洲ASR研究,我们发布了所有模型权重、微调和评估脚本,以及涵盖全部19种语言的清洗后的WAXAL子集。

英文摘要

We evaluate whether compact domain-specialized ASR models can outperform massively multilingual foundation models for conversational African speech across 19 languages in the WAXAL corpus. Fine-tuned edge models achieve a macro-averaged WER of $38.0\%$ compared to $64.9\%$ for the best zero-shot baseline, a $26.9$ percentage-point reduction using models $3-40\times$ smaller. Results confirm that domain specialization dominates scale for spontaneous African speech. Cross-domain evaluation shows that fine-tuned models recover usable performance on out-of-distribution (OOD) speech, while zero-shot models regain an advantage when the test domain matches their pretraining distribution. A distributed native-speaker audit across all surveyed languages produces a linguistically-grounded error taxonomy, showing that CTC and autoregressive architectures behave differently across language families. We further show that WER alone misrepresents performance for syllabary-script languages where CER/WER ratios reveal substantially higher character-level accuracy than headline WER suggests. Finally, to contribute to future African ASR research, we release all model weights, fine-tuning and evaluation scripts, and a cleaned WAXAL subset covering all $19$ languages.

2606.02374 2026-06-02 cs.AI

Spatial Representation Learning Beyond Pixels: Unifying Raster Data and Vector Semantics for Human-Centric Geospatial Foundation Models

超越像素的空间表示学习:统一栅格数据和向量语义以构建以人为中心的地理空间基础模型

Steffen Knoblauch, Hao Li, Gengchen Mai, Konstantin Klemmer, Song Gao, WenWen Li

AI总结 本文提出统一栅格感知与向量推理的联合空间表示学习范式,旨在解决当前地球观测基础模型仅依赖栅格模态、忽略向量数据中丰富结构化信息的局限性。

详情
AI中文摘要

地球观测(EO)从根本上改变了环境过程和人类活动的监测,达到了行星尺度。自监督学习的最新进展催生了地球观测基础模型(EOFMs),这些模型利用PB级未标记EO数据学习跨广泛下游地理空间任务的可迁移表示。尽管取得了这些进展,当前的EOFMs仍然局限于栅格模态,忽视了诸如OpenStreetMap和Overture等可公开访问的向量数据源中编码的丰富结构化信息。向量数据提供了地理实体的显式和紧凑表示,包括几何、拓扑和语义关系,提供了在图像中通常模糊或难以获取的关键上下文信号。因此,栅格和向量数据代表了地理空间的互补视图:栅格数据捕捉连续的物理和光谱模式,而向量数据编码离散对象及其关系结构,并且通常更多地代表人类系统而非物理系统(例如社会或人口数据)。然而,现有的地理空间表示学习范式孤立地处理这些模态,依赖于不完美且常有损的转换来桥接它们。这篇观点文章呼吁向联合空间表示学习(SRL)的范式转变,即在统一的嵌入空间中整合栅格感知与基于向量的推理。基于多模态地理空间学习的新兴努力,我们强调了对齐异构空间数据源的概念基础、技术挑战和有前景的方向。我们认为,这种整合对于开发能够更准确、可解释且语义扎实地理解地球的下一代地理空间AI系统至关重要。

英文摘要

Earth Observation (EO) has fundamentally transformed the monitoring of environmental processes and human activities up to planetary scale. Recent advances in self-supervised learning have given rise to Earth Observation Foundation Models (EOFMs), which leverage petabyte-scale unlabeled EO data to learn transferable representations across a wide range of downstream geospatial tasks. Despite these advances, current EOFMs remain largely confined to raster modalities, overlooking the rich, structured information encoded in openly-accessible vector data sources such as OpenStreetMap and Overture. Vector data provides explicit and compact representations of geographic entities, including geometry, topology, and semantic relationships, offering critical contextual signals that are often ambiguous or inaccessible in imagery alone. Raster and vector data thus represent complementary views of geographic space: raster data captures continuous physical and spectral patterns, while vector data encodes discrete objects and their relational structure and often represents more of the human rather than the physical systems (e.g. social or demographic data). However, existing geospatial representation learning paradigms treat these modalities in isolation, relying on imperfect and often lossy transformations to bridge them. This perspective paper calls for a paradigm shift toward joint Spatial Representation Learning (SRL) in an unified embedding space that integrate raster perception with vector-based reasoning. Building on emerging efforts in multimodal geospatial learning, we highlight conceptual foundations, technical challenges, and promising directions for aligning heterogeneous spatial data sources. We contend that such integration is essential for developing next-generation geospatial AI systems capable of more accurate, interpretable, and semantically grounded understanding of the Earth.

2606.02373 2026-06-02 cs.AI cs.CL cs.IR

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Harness-1:基于状态外化马具的搜索智能体强化学习

Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu, Jiashuo Sun, Jimeng Sun, Hammad Bashir, Jiawei Han

AI总结 提出Harness-1,一个20B参数的搜索智能体,通过强化学习在有状态搜索马具中训练,将常规状态管理外化到环境,在八个检索基准上平均召回率0.730,超越现有开源搜索子智能体11.4个百分点。

详情
AI中文摘要

搜索智能体通常被训练为基于不断增长的转录的策略:模型必须决定如何搜索,同时记住它看到了什么、哪些证据有用、哪些约束仍然开放、哪些声明已被检查。我们认为这种表述将过多的常规状态管理放在策略内部:强化学习被迫同时优化语义搜索决策和可恢复的簿记,而环境可以更可靠地维护这些簿记。我们引入Harness-1,一个20B参数的搜索智能体(检索子智能体),在有状态搜索马具内通过强化学习训练。该马具维护环境端的工作记忆,包括候选池、重要性标记的精选集、紧凑的证据链接、验证记录、压缩和去重的观察结果,以及预算感知的上下文渲染。策略保留语义决策:搜索什么、保留或丢弃哪些文档、验证什么以及何时停止。在涵盖网络、金融、专利和多跳问答的八个检索基准上,Harness-1实现了0.730的平均精选召回率,比次强的开源搜索子智能体高出11.4个百分点,并与更大的前沿模型搜索器保持竞争力。其优势在保留的迁移基准上尤为显著,表明基于显式搜索状态的强化学习可以产生超越训练领域的检索行为。我们的代码可在https://github.com/pat-jj/harness-1获取。

英文摘要

Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.

2606.02372 2026-06-02 cs.AI cs.CL

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

COMAP:面向LLM智能体的世界模型与智能体策略协同进化

Youwei Liu, Jian Wang, Hanlin Wang, Wenjie Li

AI总结 提出COMAP框架,通过闭环交互协同进化文本世界模型和智能体策略,在具身任务规划、网页导航和工具使用基准上显著提升性能。

详情
AI中文摘要

为语言智能体配备世界模型使其能够在执行前预测环境动态并评估候选动作。然而,现有的文本世界模型通常在训练后固定不变,无法适应由进化中的智能体引发的策略内状态-动作分布。同时,智能体改进方法往往依赖外部奖励或验证器,限制了其在现实交互环境中的适用性。本文提出COMAP,一种通过闭环交互协同进化文本世界模型和智能体策略的新框架。在每个决策步骤,世界模型预测候选动作的未来状态反馈,智能体通过估计该反馈的可靠性并相应调整动作来进行未来感知反思。由此产生的策略内轨迹随后通过自蒸馏用于更新世界模型,使其更好地匹配智能体不断演化的交互分布。在具身任务规划、网页导航和工具使用基准上,COMAP始终优于竞争基线,例如使用Qwen3-4B相对提升16.75%。进一步分析表明,协同进化循环随时间提高了世界模型的预测准确性,并导致更有效的长程决策。我们的代码可在https://github.com/loyiv/CoMAP获取。

英文摘要

Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution. However, existing textual world models are typically fixed after training, preventing them from adapting to the on-policy state-action distributions induced by an evolving agent. Meanwhile, agent-improvement methods often rely on external rewards or verifiers, limiting their applicability in realistic interactive environments. In this paper, we propose COMAP, a novel framework that co-evolves textual world models and agent policies through closed-loop interaction. At each decision step, the world model predicts future state feedback for candidate actions, and the agent performs future-aware reflection by estimating the reliability of this feedback and refining its action accordingly. The resulting on-policy trajectories are then used to update the world model via self-distillation, allowing it to better match the agent's evolving interaction distribution. Across embodied task planning, Web navigation, and tool-use benchmarks, COMAP consistently outperforms competitive baselines, e.g., +16.75% relative improvement with Qwen3-4B. Further analyses show that the co-evolutionary loop improves the world model's prediction accuracy over time and leads to more effective long-horizon decision-making. Our code is available at: https://github.com/loyiv/CoMAP.

2606.02370 2026-06-02 cs.RO

A Simulation Platform for Flapping-Wing Vehicles

扑翼飞行器仿真平台

Haichuan Li, Tomi Westerlund

AI总结 针对扑翼飞行器仿真与现实差距大的问题,提出FWAV-Sim高保真仿真平台,集成复合气动模型、湍流生成和真实传感器模拟,提升自主系统开发效果。

详情
AI中文摘要

扑翼飞行器(FWAVs)表现出卓越的敏捷性,但由于其对气动扰动的高敏感性和有限的传感器有效载荷能力,面临着巨大的自主性挑战。当前的仿真平台通常依赖于过度简化的层流假设和理想化的传感器模型,无法捕捉实际运行中遇到的复杂湍流模式和感知限制。这种仿真与现实的差距严重阻碍了FWAVs鲁棒自主系统的发展。我们引入了FWAV-Sim,一个基于Unity的高保真仿真框架,它集成了:(1)结合准稳态叶片单元理论和钝体阻力效应的复合气动模型,(2)通过分形噪声合成生成时空相关的湍流,以及(3)包括噪声IMU测量、LiDAR点云和RGB相机馈送的真实传感器模拟。我们的平台能够可扩展地生成包含真实车辆状态、气动力、湍流风场和多模态传感器流的同步数据集。实验验证表明,在FWAV-Sim中开发的自主流水线(包括控制器和感知系统)表现出显著提高的仿真能力,从而推进了扑翼飞行系统基于仿真的开发的卓越性能。

英文摘要

Flapping-wing aerial vehicles (FWAVs) demonstrate remarkable agility but face substantial autonomy challenges due to their high sensitivity to aerodynamic disturbances and limited sensor payload capacity. Current simulation platforms typically rely on oversimplified laminar flow assumptions and idealized sensor models, failing to capture the complex turbulence patterns and perceptual limitations encountered in real-world operation. This simulation-to-reality discrepancy significantly impedes the development of robust autonomy systems for FWAVs. We introduce FWAV-Sim, a high-fidelity Unity-based simulation framework that integrates: (1) a composite aerodynamic model combining quasi-steady blade-element theory with bluff-body drag effects, (2) spatiotemporally correlated turbulence generation through fractal noise synthesis, and (3) realistic sensor simulation including noisy IMU measurements, LiDAR point clouds, and RGB camera feeds. Our platform enables scalable generation of synchronized datasets containing ground-truth vehicle states, aerodynamic forces, turbulent wind fields, and multi-modal sensor streams. Experimental validation demonstrates that autonomy pipelines (including both controllers and perception systems) developed in FWAV-Sim exhibit significantly improved simulation capability, thereby advancing the outstanding performance in simulation-based development for flapping-wing aerial systems.

2606.02366 2026-06-02 cs.CV

PRIMA: Boosting Animal Mesh Recovery with Biological Priors and Test-Time Adaptation

PRIMA: 利用生物先验和测试时自适应提升动物网格恢复

Xiaohang Yu, Ti Wang, Mackenzie Weygandt Mathis

AI总结 提出PRIMA框架,通过生物先验(BioCLIP嵌入)和测试时自适应策略,解决严重物种和姿态不平衡下的3D四足动物网格恢复问题,实现高泛化性能并构建大规模伪3D数据集Quadruped3D。

详情
AI中文摘要

我们提出PRIMA(*PRI*ors for *M*esh *A*daptation),一个在严重物种和姿态不平衡下进行鲁棒3D四足动物网格恢复的框架。现有的动物重建方法由于有限的3D监督和长尾物种分布,往往回归到平均形状和姿态,导致对欠代表性动物和罕见关节的泛化能力差。PRIMA通过三个关键贡献解决了这一挑战。首先,我们将BioCLIP嵌入作为生物先验,将语义和形态学知识注入重建过程,从而在多样化的四足动物中实现更准确和可泛化的形状预测。其次,我们引入了一种测试时自适应(TTA)策略,该策略利用2D重投影约束和辅助关键点指导来优化SMAL预测,改进了姿态和形状估计,同时能够从现有2D数据集中生成高质量的伪3D标注。第三,利用这个TTA框架,我们构建了Quadruped3D,一个大规模伪3D数据集,涵盖多样化的物种和姿态变化,以系统性地提升模型性能。在Animal3D、CtrlAni3D、Quadruped2D和Animal Kingdom上的大量实验表明,PRIMA达到了最先进的结果,在欠代表性物种和挑战性姿态上尤其有显著改进。我们的结果强调了生物先验和自适应驱动的数据扩展对于可扩展和可泛化的动物网格恢复的重要性。代码可在https://github.com/AdaptiveMotorControlLab/PRIMA获取。

英文摘要

We present PRIMA (*PRI*ors for *M*esh *A*daptation), a framework for robust 3D quadruped mesh recovery under severe species and pose imbalance. Existing animal reconstruction methods often regress toward mean shapes and poses due to limited 3D supervision and long-tailed species distributions, resulting in poor generalization to underrepresented animals and rare articulations. PRIMA addresses this challenge through three key contributions. First, we incorporate BioCLIP embeddings as biological priors to inject semantic and morphological knowledge into the reconstruction process, enabling more accurate and generalizable shape prediction across diverse quadrupeds. Second, we introduce a test-time adaptation (TTA) strategy that refines SMAL predictions using 2D reprojection constraints together with auxiliary keypoint guidance, improving pose and shape estimation while enabling the generation of high-quality pseudo-3D annotations from existing 2D datasets. Third, leveraging this TTA framework, we construct Quadruped3D, a large-scale pseudo-3D dataset that covers diverse species and pose variations to systematically improve model performance. Extensive experiments on Animal3D, CtrlAni3D, Quadruped2D, and Animal Kingdom demonstrate that PRIMA achieves state-of-the-art results, with particularly strong improvements on underrepresented species and challenging poses. Our results highlight the importance of biological priors and adaptation-driven data expansion for scalable and generalizable animal mesh recovery. Code is available at https://github.com/AdaptiveMotorControlLab/PRIMA.

2606.02365 2026-06-02 cs.LG cs.AI

FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo

FOAM:基于频率和算子误差的自适应阻尼方法,用于减少Shampoo的陈旧性误差

Kyunghun Nam, Sumyeong Ahn

AI总结 提出FOAM算法,通过自适应控制阻尼因子和特征分解频率来抑制陈旧性误差,在保持收敛的同时减少Shampoo的计算时间。

详情
Comments
9 pages, ICML 2026 camera-ready version
AI中文摘要

Shampoo因其在大规模优化基准上的卓越性能而备受关注,但它面临一个重要的实际瓶颈:矩阵求逆的过高计算开销。为了缓解这一问题,从业者通常依赖陈旧的预条件子更新,这在计算效率和优化保真度之间产生了根本性的权衡。在这项工作中,我们通过收敛性和稳定性的互补视角对陈旧性进行了理论研究。虽然陈旧性提高了计算效率,但它固有地降低了性能并引入了数值不稳定性。关键的是,我们发现作为数值稳定器的阻尼可以有效抑制这些负面影响。在此分析指导下,我们提出了FOAM,一种自适应算法,通过基于陈旧性误差的近似动态控制阻尼因子和特征分解频率来稳定训练。实验结果表明,与标准Shampoo相比,FOAM在保持稳健收敛的同时减少了挂钟时间。

英文摘要

Shampoo is attracting considerable attention for its superior performance on large-scale optimization benchmarks; yet it faces a significant practical bottleneck: the prohibitive computational overhead of matrix inversion. To mitigate this, practitioners typically rely on stale preconditioner updates, creating a fundamental trade-off between computational efficiency and optimization fidelity. In this work, we provide a theoretical study of staleness through the complementary lenses of convergence and stability. While staleness improves computational efficiency, it inherently degrades performance and introduces numerical instability. Crucially, we identify that damping, acting as a numerical stabilizer, can effectively suppress these negative effects. Guided by this analysis, we propose FOAM, an adaptive algorithm that stabilizes training by dynamically controlling both the damping factor and the eigendecomposition frequency based on an approximation of the staleness-oriented error. Experimental results demonstrate that FOAM reduces wall-clock time compared to standard Shampoo while maintaining robust convergence.

2606.02363 2026-06-02 cs.LG stat.ML

Minimax-Optimal Policy Regret in Partially Observable Markov Games

部分可观测马尔可夫博弈中的极小化最优策略遗憾

Raman Arora

AI总结 针对部分可观测马尔可夫博弈,提出基于epoch的乐观最大似然算法,实现了与聚合Eluder维数相关的$ ilde{O}(\sqrt{T})$策略遗憾,并证明了匹配的下界。

详情
AI中文摘要

我们研究了部分可观测环境中面对战略、自适应对手的序贯决策问题,建模为部分可观测马尔可夫博弈(POMG)。核心挑战在于从部分观测中学习潜在动态,同时面对行为依赖于学习者策略的对手,这使得标准遗憾概念不适用。我们证明,对于固定问题参数,基于epoch的乐观最大似然算法实现了$ ilde{O}(\sqrt{T})$的策略遗憾,显式依赖于视界、对手记忆、置信半径以及可观测算子类的聚合Eluder维数。该算法在每个几何增长的epoch中选择一个策略,使用从过去数据累积构建的置信集,这将比较跨策略的对手响应的成本控制在$T$的对数级别。我们还证明了与$\sqrt{T}$和聚合Eluder维数依赖相匹配的下界(至多问题相关和对数因子)。最后,我们将框架扩展到视界自适应保证和具有几何衰减记忆的对手。

英文摘要

We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavior depends on the learner's strategy, making standard regret notions inadequate. We prove that an epoch-based optimistic maximum-likelihood algorithm achieves $\tilde{O}(\sqrt{T})$ policy regret for fixed problem parameters, with explicit dependence on the horizon, adversary memory, confidence radius, and the aggregate Eluder dimension of the observable-operator class. The algorithm selects one policy per geometrically growing epoch using confidence sets built cumulatively from past data, which keeps the cost of comparing adversary responses across policies logarithmic in $T$. We also prove a lower bound matching the $\sqrt{T}$ and aggregate-Eluder-dimension dependence, up to problem-dependent and logarithmic factors. Finally, we extend the framework to horizon-adaptive guarantees and adversaries with geometric fading memory.

2606.02359 2026-06-02 cs.AI

MOC: Multi-Order Communication in LLM-based Multi-Agent Systems

MOC:基于LLM的多智能体系统中的多阶通信

Yao Guan, Lin Wang, Zhihu Lu, Ziyi Wang, Wenzhu Yan, Qiang Duan

AI总结 提出多阶通信(MOC)方案,通过重构智能体间通信以捕获多跳依赖,并设计结构消息合并策略,在多个数据集上提升任务性能并降低通信成本。

详情
AI中文摘要

尽管基于大语言模型(LLM)的多智能体系统取得了显著进展,但大多数研究侧重于优化协调拓扑,而同样关键的问题——如何有效地在智能体之间传输和优化消息——却很大程度上未被充分探索。当前的通信方案通常依赖于一阶邻居响应的直接拼接,这导致了受限的证据感受野,并使得关键信息在多跳路径上被稀释。为了解决这些局限性,我们提出了多阶通信(MOC)方案,该方案重构了智能体间通信以捕获多跳依赖,并引入了一种结构消息合并策略以确保效率。具体来说,我们形式化了通信机制以构建结构化的多阶证据流,随后设计了一种语义-拓扑合并算法,以在令牌约束内优化语义保真度。在六个不同数据集和不同参数规模的LLM骨干上的大量实验表明,MOC一致地提升了任务性能并降低了通信成本。代码可在 https://github.com/yao-guan/MOC 获取。

英文摘要

Despite the remarkable progress of Large Language Model (LLM) based Multi-Agent Systems, most research focuses on optimizing coordination topology while largely underexploring the equally critical problem: how to transmit and optimize messages among agents effectively? Current communication schemes typically rely on the direct concatenation of first-order neighbor responses, which induces a restricted evidence receptive field and leads to the dilution of crucial insights over multi-hop paths. To address these limitations, we propose the Multi-Order Communication (MOC) scheme, which reconstructs the inter-agent communication to capture multi-hop dependencies and incorporates a structural message consolidation strategy to ensure efficiency. Specifically, we formalize the communication mechanism to construct a structured multi-order evidence stream, and subsequently design a Semantic-Topological Merging algorithm to optimize semantic fidelity within token constraints. Extensive experiments across six diverse datasets and LLM backbones of varying parameter scales demonstrate that MOC consistently improves task performance and reduces communication costs. The code is available at https://github.com/yao-guan/MOC.

2606.02357 2026-06-02 cs.CV cs.AI

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

多模态智能体真的从工具使用中受益吗?能力增益的系统性研究

Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao, Huaxing Liu, Qinghao Wang, Minpeng Liao

AI总结 通过对比工具增强与无工具的多模态智能体在多项任务上的表现,发现工具使用并未带来一致的性能提升,智能体更多是学会了工具调用模式而非真正利用工具扩展能力。

详情
AI中文摘要

工具增强的多模态智能体在基准测试中表现出显著提升,这常被视为智能体已学会使用工具的证据。我们认为这种解读可能为时过早:仅凭工具调用轨迹并不能证明工具提供了答案关键信息。我们研究了两种代表性的“用图像思考”智能体,Thyme 和 DeepEyesV2,在真实世界理解、OCR、图表理解和数学推理任务上的表现。每个智能体与其无工具版本以及从同一源池训练但不含工具调用轨迹的纯文本推理器进行比较。工具访问并未带来一致的总体改进,未能可靠地降低生成令牌成本,并且仅留下一个很小的仅工具解决集:DeepEyesV2 的 93% 工具解决问题和 Thyme 的 96% 也被至少一种无工具设置解决。机制消融进一步表明,完整的工具使用循环并不始终优于单独的工具调用格式或返回的执行结果。在我们研究的设置中,所分析的智能体似乎更可靠地学习了工具调用模式而非工具贡献的能力,这表明评估应区分工具的可用性与工具是否真正扩展了智能体可解决的问题。

英文摘要

Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that this interpretation can be premature: a tool-call trace alone does not show whether the tool supplied answer-critical information. We study two representative ``thinking with images'' agents, Thyme and DeepEyesV2, across real-world understanding, OCR, chart understanding, and mathematical reasoning. Each agent is compared with its Tool-Free counterpart and with a Pure-Text Reasoner trained from the same source pool without tool-calling trajectories. Tool access yields little consistent aggregate improvement, does not reliably reduce generated-token cost, and leaves only a small tool-only solved set: 93% of DeepEyesV2's tool-solved problems and 96% of Thyme's are also solved by at least one non-tool setting. Mechanism ablations further show that the full tool-use loop does not consistently outperform either the tool-call format or the returned execution result alone. In the settings we study, the analyzed agents appear to learn tool-calling patterns more reliably than tool-contributed capabilities, suggesting that evaluation should distinguish tool availability from whether tools actually expand what agents can solve.