arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.17423 2026-05-19 cs.CV

Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration

Soap2Soap：通过多智能体协作实现长 cinematic 视频重制

Yiren Song, Huilin Zhong, Kevin Qinghong Lin, Haofan Wang, Mike Zheng Shou

AI总结本研究提出 Soap2Soap 框架，通过多智能体协作实现长 cinematic 视频重制，解决视频到视频生成中长期一致性与叙事保真度的问题。

详情

AI中文摘要

我们研究系列级 cinematic 重制，这是一个长视界视频到视频生成问题，通过风格化或演员替换局部化完整 episodes 或 films，同时严格保持叙事结构、动作编排和角色身份在数百个镜头中。现有视频生成和编辑管道在此领域常常失效，因为大相机运动和视角变化下会出现身份漂移、背景突变和语义侵蚀的叠加问题。我们提出 Soap2Soap，一个通过双桥一致性机制强制长期语言-视觉一致性的多智能体框架：一个场景感知的 JSON 剧本作为持久的语义骨架，以及在场景和镜头级别动态分配的视觉参考锚点。为在视频合成前抑制漂移，我们引入批次关键帧一致性，通过基于网格的公式共同生成多个关键帧在共享的潜在上下文中。一个闭环验证智能体进一步审计身份、稳定性和对齐度以触发选择性再生。在 SoapBench 上的实验显示，与商业视频生成 API 相比，在长期一致性和叙事保真度方面有显著提升。

英文摘要

We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi-agent framework that enforces long-term language-visual consistency through a Dual-Bridge Consistency mechanism: a scene-aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid-based formulation. A closed-loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long-term consistency and narrative fidelity.

URL PDF HTML ☆

赞 0 踩 0

2605.17421 2026-05-19 cs.RO

MUSE: Multimodal Uncertainty Quantification of State Estimation

MUSE：多模态状态估计不确定性量化

Minkyung Kim, Henry Che, Bhargav Chandaka, Bhumsitt Pramuanpornsatid, Chengyu Yang, Sheng Cheng, Xiaofeng Wang, Naira Hovakimyan, Shenlong Wang

AI总结本文提出MUSE，一种基于学习的实时框架，利用Mamba的强效序列建模能力，从多个异步传感器流中估计定位不确定性，提高了状态估计的可靠性和鲁棒性。

Comments Code and dataset: https://github.com/hungdche/MUSE

详情

AI中文摘要

准确的视觉状态估计一直是机器人领域的重要课题，广泛应用于机器人导航、自动驾驶和自主飞行。最近的机器人感知进展显著提高了状态估计的精度和鲁棒性，但如何量化和校准其精度，即我们对估计的置信度以及能否检测失败仍然是一个根本性挑战。在视觉惯性里程计（VIO）中，异方差和多模态的性质使不确定性量化尤为困难。本文介绍了MUSE（多模态状态估计不确定性量化），一种新颖的实时学习框架，利用Mamba的强大且高效的序列建模能力，从多个异步传感器流中估计定位不确定性。在公开和内部数据集上的实验表明，MUSE相比现有不确定性量化方法在可靠性和鲁棒性方面表现更优，消融研究验证了其关键设计选择的优势。

英文摘要

Accurate visual state estimation has been a central topic in robotics with a wide range of applications in robot navigation, autonomous driving, and autonomous flight. Recent advances in robot perception have led to significant improvements in the accuracy and robustness of state estimation, yet a fundamental challenge remains in how to quantify and calibrate its precision, i.e., how confident we are in an estimate and whether failures can be detected. This issue is particularly pronounced in visual-inertial odometry (VIO), where the heteroscedastic and multimodal nature of the problem makes uncertainty quantification especially difficult. This paper introduces MUSE (Multimodal Uncertainty Quantification of State Estimation), a novel real-time learning-based framework that leverages the strong and efficient sequential modeling capacity of Mamba to estimate localization uncertainty from multiple asynchronous sensor streams. Experiments on both public and in-house datasets demonstrate that MUSE achieves superior reliability and robustness compared to existing uncertainty quantification methods, and ablation studies justify the benefits of its key design choices.

URL PDF HTML ☆

赞 0 踩 0

2605.17419 2026-05-19 cs.LG cs.AI

Learning Displacement-Robust Representations for Landslide Early Warning under Rainfall Forecast Uncertainty

学习位移鲁棒的表示以在降雨预报不确定性下进行滑坡预警

Ren Ozeki, Hamada Rizk, Hirozumi Yamaguchi

AI总结本文提出了一种鲁棒于降雨场位移的滑坡预警系统，通过学习降雨和地形数据的潜在表示，以提高在降雨预报不确定性下的滑坡预测精度。

详情

AI中文摘要

由降雨引发的滑坡已成为全球范围内日益增长的风险，因为气候变化加剧了极端降雨事件。为了提供足够的撤离时间，实时灾害监测的滑坡预警系统（LEWS）必须通过整合观测降雨与短期降雨预报来估计近未来滑坡风险，这些预报来自时空环境数据流。尽管最近的滑坡预测方法通过统计和深度学习方法提高了预测性能，但大多数方法假设降雨输入是准确的。然而，在实际应用中，滑坡预测依赖于降雨预报，这些预报通常包含由于预测不确定性导致的降雨场空间位移。这种位移会改变局部累积降雨并降低预测准确性。为了解决这一挑战，我们提出了一种新的LEWS，其对降雨场位移具有鲁棒性。关键思想是学习降雨和地形数据的潜在表示，这些表示在降雨场运动中的位移下保持稳定，从而实现可靠的地理空间数据整合以估计滑坡风险。滑坡预测模型通过使用降雨-运动-感知对比学习（RMCL）进行训练，该方法引入了时间相关的降雨场扰动以模拟预报引起的降雨驱动时空环境数据流中的位移。实验使用了日本两年的降雨和地形数据，覆盖了19个地区中的滑坡事件。所提出的系统在精度上比最先进的基线高出高达37%。这些结果表明，将降雨建模为移动的空间场并在学习过程中处理降雨场位移显著提高了操作预警系统中短期滑坡预测的可靠性。

英文摘要

Rainfall-induced landslides pose a growing risk worldwide as climate change intensifies extreme rainfall events. To provide sufficient evacuation time, landslide early warning systems (LEWS) for real-time disaster monitoring must estimate near-future landslide risk by integrating observed rainfall with short-term rainfall forecasts from spatio-temporal environmental data streams. Although recent landslide prediction methods have improved predictive performance using statistical and deep learning approaches, most assume accurate rainfall inputs. In operational settings, however, landslide prediction relies on rainfall forecasts, which often contain spatial displacement of rainfall fields due to forecasting uncertainties. Such displacement can alter local accumulated rainfall and degrade prediction accuracy. To address this challenge, we propose a novel LEWS robust to rainfall field displacement. The key idea is to learn latent representations from rainfall and terrain data that remain stable under displacement in rainfall field motion, enabling reliable geospatial data integration for landslide risk estimation. The landslide prediction model is trained using Rainfall-Motion-Aware Contrastive Learning (RMCL), which introduces temporally correlated rainfall field perturbations to emulate forecast-induced displacement in rainfall-driven spatio-temporal environmental data streams. Experiments were conducted using two years of rainfall and terrain data across Japan, covering 19 regions with landslide events. The proposed system achieved up to 37% higher precision than state-of-the-art baselines. These results demonstrate that modeling rainfall as a moving spatial field and addressing rainfall field displacement during learning significantly improve the reliability of short-term landslide prediction in operational early warning systems.

URL PDF HTML ☆

赞 0 踩 0

2605.17410 2026-05-19 cs.AI

Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design

令牌经济学中的计算挑战：连接经济理论与AI系统设计

Ou Wu, Yingjun Deng

AI总结本文探讨了在大规模语言模型系统中，将令牌作为经济原语时所面临的计算挑战，提出了计算令牌经济学的概念和令牌经济学三元论，旨在建立连接令牌经济学与AI系统设计的研究议程。

Comments 43 pages

详情

AI中文摘要

异质信息瓶颈协调图用于多智能体强化学习

Wei Duan, Junyu Xuan, En Yu, Xiaoyu Yang, Jie Lu

AI总结本文提出异质信息瓶颈协调图（HIBCG），通过理论指导机制解决多智能体强化学习中协调图的边存在性和信息传递容量分配问题，通过信息瓶颈方法构建组对齐的块对角先验，实现边存在性和信息容量的理论验证。

详情

AI中文摘要

协调图是合作多智能体强化学习（MARL）中的核心抽象，然而现有的稀疏图学习者缺乏理论基础的机制来决定哪些边应存在以及每条边应携带多少信息。当前方法依赖于启发式标准，无法保证学习到的拓扑结构的正式保证，并且没有系统的方法来分配不同的通信容量以处理结构不同的智能体关系。为了解决这个问题，我们提出了异质信息瓶颈协调图（HIBCG），它学习了一个组感知的稀疏图，在其中边的存在性和信息容量都得到了理论支持。通过图信息瓶颈（GIB）作为底层工具，HIBCG首先构建了一个组对齐的块对角先验，提供了一个闭式标准用于边保留——确定哪些边应该存在以及每个组块的密度——然后在所得到的拓扑上控制每个智能体的特征带宽，压缩信息以保留仅与任务相关的内容。我们证明了组对齐的先验严格收紧拓扑学习的变分界，目标分解为每个组块，实现了微分边控制，且容量分配遵循水填充原则。

英文摘要

Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism to decide which edges should exist and how much information each edge should carry. Current methods rely on heuristic criteria that offer no formal guarantee on the learned topology, and no principled way to allocate different communication capacities to structurally different agent relationships. To address this, we propose Heterogeneous Information-Bottleneck Coordination Graphs (HIBCG), which learns a group-aware sparse graph in which both edge existence and message capacity are theoretically justified. With the graph information bottleneck (GIB) serving as the underlying tool, HIBCG first constructs a group-aligned block-diagonal prior that provides a closed-form criterion for edge retention -- determining which edges should exist and at what density per group block -- and then controls per-agent feature bandwidth on the resulting topology, compressing messages to retain only task-relevant content. We prove that the group-aligned prior strictly tightens the variational bound on topology learning, that the objective decomposes per group block, enabling differential edge control, and that capacity allocation follows a water-filling principle.

URL PDF HTML ☆

赞 0 踩 0

2605.17382 2026-05-19 cs.AI cs.CL cs.GR

UAM：VL A训练中遗忘的双流视角

Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, Jianyu Chen

AI总结本文提出UAM模型，通过双流架构解决VL A训练中因单一编码器导致的多模态能力下降问题，展示了通过架构分离而非冻结权重或辅助数据可实现语义保留，并在多种任务中取得高成功率。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通常通过在动作数据上微调预训练的视觉-语言模型（VLM）来构建。然而，我们证明这种标准方法系统性地削弱了VLM的多模态能力，这种副作用我们称之为‘具身税’。但VL A是否必须遗忘？受生物视觉双流组织的启发，我们将这种退化归因于结构性瓶颈：当前VL A要求单一编码器同时支持语言基础语义和控制相关的视觉特征，而生物视觉将识别与视觉运动控制分为不同的路径。基于此观点，我们提出了统一动作模型（UAM），添加了一个平行的背侧专家，作为大脑背侧通路的类比。为了使背侧专家成为有效的第二路径并减少对VLM的控制学习负担，我们从预训练的生成模型中初始化它，并用中层推理目标进行训练，该目标预测视觉动态。这种设计使我们能够仅用动作数据端到端地训练整个VLA：无需参数冻结、无需梯度停止、无需辅助VL共训练，UAM保留了超过95%的底层VLM的多模态能力，同时在多种任务中取得了最高平均成功率，包括未见物体、新物体-目标组合和指令变化等探测分布外泛化的任务。这些结果表明，VL A中的语义保留可以从架构分离本身产生，而非通过冻结权重或辅助数据重放，并且这种保留的语义能力可以自然地从VLM转移到动作中的语义泛化。

英文摘要

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

URL PDF HTML ☆

赞 0 踩 0

2605.15694 2026-05-19 cs.LG

学习归一化能量模型以解决线性逆问题

Nicolas Zilberstein, Santiago Segarra, Eero Simoncelli, Florentin Guth

AI总结本文提出了一种新的能量模型，用于解决线性逆问题，通过引入基于协方差的正则化项来提高不同测量条件下的一致性，从而计算出归一化的后验密度，无需额外训练或微调，同时实现了能量引导的自适应采样、无偏的Metropolis-Hastings修正步骤以及通过贝叶斯规则估计退化算子。

Comments ICML 2026

详情

Journal ref: Int'l Conf Machine Learning (ICML), Jul 2026. https://openreview.net/forum?id=PlFJwgaaDK

AI中文摘要

生成扩散模型可以为成像中的逆问题提供强大的先验概率模型，但现有实现存在两个关键限制：(i) 先验密度以隐式方式表示，(ii) 它们依赖于似然近似，这会引入采样偏见。我们通过引入一种新的能量模型来解决这些挑战，该模型针对去噪进行了训练，并引入了基于协方差的正则化项，以确保在不同测量条件下的一致性。训练后的模型能够为各种线性逆问题计算归一化的后验密度，而无需额外的重新训练或微调。除了保留扩散模型的采样能力外，这还使以前不可用的能力得以实现：能量引导的自适应采样，可以实时调整采样计划，无偏的Metropolis-Hastings修正步骤，以及通过贝叶斯规则估计退化算子。我们验证了该方法在多个数据集（ImageNet、CelebA、AFHQ）和任务（修复、去模糊）上的性能，证明了其与现有基线相比具有竞争力或更优的表现。

英文摘要

Generative diffusion models can provide powerful prior probability models for inverse problems in imaging, but existing implementations suffer from two key limitations: $(i)$ the prior density is represented implicitly, and $(ii)$ they rely on likelihood approximations that introduce sampling biases. We address these challenges by introducing a new energy-based model trained for denoising with a covariance-based regularization term that enforces consistency across different measurement conditions. The trained model can compute normalized posterior densities for diverse linear inverse problems, without additional retraining or fine tuning. In addition to preserving the sampling capabilities of diffusion models, this enables previously unavailable capabilities: energy-guided adaptive sampling that adjusts schedules on-the-fly, unbiased Metropolis-Hastings correction steps, and blind estimation of the degradation operator via Bayes rule. We validate the method on multiple datasets (ImageNet, CelebA, AFHQ) and tasks (inpainting, deblurring), demonstrating competitive or superior performance to established baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.15377 2026-05-19 cs.AI

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

为AI控制的集束监控：多样信号胜过更多计算

Eugene Koran, Yejun Yun, Samantha Tetef, Benjamin Arnav, Pablo Bernabeu-Pérez

AI总结本文研究了通过结合多种监控信号来提高AI行为检测的性能，发现多样性的监控集合比单一或同质的监控集合更有效，且细调的监控方法在检测能力上更具优势。

详情

AI中文摘要

随着AI系统在大规模自主代理环境中越来越广泛地部署，确保它们采取的安全和符合用户意图的行为变得至关重要。监控代理行为是关键的安全机制，但可靠的监控仍然难以构建，而系统规模使人类监督变得不切实际。我们证明，将来自不同监控器的信号组合成一个集合可以提高检测偏离行为的能力。我们使用提示和微调策略构建了12个GPT-4.1-Mini监控器。我们在编码任务中评估了它们，其中候选解决方案通过标准测试但失败于对抗性输入。在这种情况下，多样化的集合优于单个监控器和同质的集合。我们的最佳3监控集合在检测性能上比由三个相同监控器组成的集合提高了2.4倍，且在独立数据集上表现强劲。我们认为这些结果表明，收益来自于多样性而不是规模。最佳集合结合了强个体表现和监控器之间低相关性。此外，微调的监控器出现在每一个表现最好的集合中，并且在非分布攻击类型上保持了这一优势，表明微调能够激发检测能力，而提示单独无法做到。这些结果支持集合监控作为一种实用的AI控制策略，以在合理的推理成本下获得安全收益。

英文摘要

As AI systems are increasingly deployed in autonomous agentic settings at scale, it is important to ensure the actions they take are safe and aligned with user intent. Monitoring agent actions is a key safety mechanism, yet reliable monitors remain difficult to build and the scale of these systems makes human oversight impractical. We show that combining signals from diverse monitors into an ensemble improves detection of misaligned actions. We build 12 GPT-4.1-Mini monitors using both prompting and fine-tuning strategies. We evaluate them on coding tasks where candidate solutions pass standard tests but fail on adversarial inputs. In this setting, diverse ensembles outperform both individual monitors and homogeneous ensembles. Our best 3-monitor ensemble achieves 2.4x greater detection performance gain compared to an ensemble composed of three identical monitors, with the same ensemble performing strongly on an independent dataset. We contend that these results show that diversity - not scale - drives gains. The best ensembles combine strong individual performance with low correlation between monitors. Furthermore, fine-tuned monitors appear in every top-performing ensemble and maintain this advantage on out-of-distribution attack types, suggesting that fine-tuning enables detection capabilities that prompting alone does not elicit. These results support ensemble monitoring as a practical AI control strategy for safety gains at reasonable inference costs.

URL PDF HTML ☆

赞 0 踩 0

2605.15177 2026-05-19 cs.AI

OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation

OpenDeepThink: 通过布拉德利-蒂尔利聚合实现并行推理

Shang Zhou, Wenhao Chai, Kaiyuan Liu, Huanzhi Mao, Qiuyang Mang, Jingbo Shang

AI总结该研究提出OpenDeepThink框架，通过布拉德利-蒂尔利聚合方法在测试时扩展计算资源，以提高大语言模型的推理能力，通过并行选择候选方案并消除选择瓶颈，从而提升模型在Codeforces等领域的表现。

Comments 19 pages, 4 figures

详情

AI中文摘要

测试时计算扩展是提高大语言模型推理能力的主要方向。现有方法主要通过扩展单个推理轨迹来扩展深度，而通过并行采样多个候选方案来扩展广度则较为简单，但会引入选择瓶颈：在没有地面真相验证器的情况下选择最佳候选方案，因为点wise LLM判断是嘈杂且有偏见的。为了解决这个问题，我们引入了OpenDeepThink，一种基于种群的测试时计算框架，通过成对的布拉德利-蒂尔利比较来选择。每次生成中，LLM随机判断候选方案对并利用布拉德利-蒂尔利聚合生成全局排名；排名最高的候选方案被保留，前四分之三的方案通过自然语言批评进行变异；后四分之一的方案被丢弃。OpenDeepThink在八个连续的LLM调用轮次中（约27分钟实时时钟时间）将Gemini 3.1 Pro的Codeforces Elo有效提升405分。该流程在较弱和较强模型之间转移时无需重新训练，并在多领域HLE基准测试中，收益集中在客观可验证的领域，而在主观领域则相反。我们发布了CF-73，一个包含73个专家评分的Codeforces问题的精选集，具有国际大师注释，并且与官方判决的本地评估一致性达到99%。

英文摘要

Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.

URL PDF HTML ☆

赞 0 踩 0

2605.14133 2026-05-19 cs.AI

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

ClawForge: 为命令行代理生成可执行的交互式基准测试

Yuxiang Lai, Peng Xia, Haonian Ji, Kaiwen Xiong, Kaide Zeng, Jiaqi Liu, Fang Wu, Jike Zhong, Zeyu Zheng, Cihang Xie, Huaxiu Yao

AI总结 ClawForge通过生成可执行的交互式基准测试，解决了可扩展性与真实工作流评估之间的矛盾，通过系统测试代理在存在状态冲突时的处理能力。

详情

AI中文摘要

交互式代理基准测试面临可扩展性构建与真实工作流评估之间的张力。手工编写的任务扩展和修改成本高，而静态提示评估忽略了只有在代理在持久状态上操作时才会出现的失败。现有的交互式基准测试已显著提升了代理评估，但大多数初始化任务从干净的状态开始，没有系统测试代理如何处理已存在的部分、过时或冲突的物品。我们提出了ClawForge，一个基于生成器的可执行命令行工作流基准测试框架，在状态冲突下。该框架将场景模板、扎根槽位、初始化状态、参考轨迹和验证器编译成可重复的任务规范，并通过归一化的终端状态和可观测的副作用逐步评估代理，而不是精确轨迹匹配。我们实例化该框架为ClawForge-Bench（17个场景，6个能力类别）。在七个前沿模型上的结果表明，最佳模型仅达到45.3%的严格准确率，错误状态替换在所有模型中低于17%，最宽的模型分离（17%到90%）由代理在行动前是否检查现有状态决定。部分信用和步骤效率分析进一步揭示了许多失败是近似关闭而非早期崩溃，且在状态冲突下模型表现出不同的失败风格。

英文摘要

Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre-existing partial, stale, or conflicting artifacts. We present \textbf{ClawForge}, a generator-backed benchmark framework for executable command-line workflows under state conflict. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching. We instantiate this framework as the ClawForge-Bench (17 scenarios, 6 ability categories). Results across seven frontier models show that the best model reaches only 45.3% strict accuracy, wrong-state replacement remains below 17\% for all models, and the widest model separation (17% to 90%) is driven by whether agents inspect existing state before acting. Partial-credit and step-efficiency analyses further reveal that many failures are near-miss closures rather than early breakdowns, and that models exhibit qualitatively different failure styles under state conflict.

URL PDF HTML ☆

赞 0 踩 0

2605.14038 2026-05-19 cs.AI

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

模型适应性工具必要性揭示了大语言模型工具使用中的知行差距

Yize Cheng, Chenrui Fan, Mahdi JafariRaviz, Keivan Rezaei, Soheil Feizi

AI总结本文研究了大语言模型在使用外部工具时的必要性问题，提出了一种基于模型自身性能的适应性工具必要性定义，并通过四个模型在算术和事实性问答数据集上的比较，发现工具必要性与实际调用行为之间存在显著的不匹配，揭示了LLM工具使用中的知行差距。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地作为自主代理，必须决定何时直接回答问题，何时调用外部工具。先前研究大多将工具必要性视为模型无关的属性，由人类或LLM判断者标注，主要涵盖答案明显的情况（例如获取天气与改写文本）。然而，现实中的工具必要性更为复杂，因为不同模型的能力边界存在分歧：一个强模型可以单独解决的问题，可能仍需要工具帮助弱模型。在本文中，我们引入了基于每个模型实证性能的模型适应性工具必要性定义。随后，我们比较了四个模型在算术和事实性问答数据集上的必要性与观察到的工具调用行为，发现存在26.5-54.0%和30.8-41.8%的显著不匹配。为了诊断失败，我们将工具使用分解为两个阶段：内部认知阶段，反映模型是否认为需要工具；执行阶段，决定模型是否实际做出调用动作。通过探测LLM隐藏状态，我们发现这两种信号往往可以线性解码，但它们的探测方向在晚期层、最后token的范围内几乎正交。通过追踪样本在两个阶段过程中的轨迹，我们进一步发现，大多数不匹配集中在认知到行动的转换过程中，而非认知本身。这些结果揭示了LLM工具使用中的知行差距：提高工具使用可靠性不仅需要更好的识别何时需要工具，还需要更好的将这种识别转化为行动。

英文摘要

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.

URL PDF HTML ☆

赞 0 踩 0

2605.14005 2026-05-19 cs.CL cs.LG

Yixu Feng, Zinan Zhao, Yanxiang Ma, Chenghao Xia, Chengbin Du, Yunke Wang, Chang Xu

AI总结本文提出了一种基于可微网格采样的视觉-语言-动作模型压缩方法，通过连续的token重采样保留关键空间信息，实现高达90%的计算量减少而不影响性能。

详情

Journal ref: Proceedings of the Forty-third International Conference on Machine Learning, 2026

AI中文摘要

视觉-语言-动作（VLA）模型在机器人操作中表现出色，但其高计算成本限制了实时部署。现有token剪枝方法面临根本性的权衡：使用剪枝进行剧烈压缩会不可避免地丢弃关键几何细节，如接触点，导致性能严重下降。我们主张通过重新思考压缩作为几何感知的连续token重采样来打破这种权衡。为此，我们提出了可微网格采样器（GridS），一个即插即用的模块，用于在VLA中进行任务感知的连续重采样。通过自适应预测最小的显著坐标集并利用可微插值提取特征，GridS在保留关键空间信息的同时实现了大幅压缩（少于10%的原始视觉token）。在LIBERO基准和真实机器人平台上的实验表明，GridS实现了76%的FLOPs减少，而无需降级成功率。代码可在https://github.com/Fediory/Grid-Sampler上获得。

英文摘要

Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at https://github.com/Fediory/Grid-Sampler.

URL PDF HTML ☆

赞 0 踩 0

2605.11567 2026-05-19 cs.CV

Dynamic Execution Commitment of Vision-Language-Action Models

视觉-语言-动作模型的动态执行承诺

Feng Chen, Xianghui Wang, Yuxuan Chen, Boying Li, Yefei He, Zeyu Zhang, Yicheng Wu

AI总结本文提出A3机制，通过将动态执行承诺重新定义为自推测前缀验证问题，解决了视觉-语言-动作模型在动态或分布外情况下执行鲁棒性和推理吞吐量之间的平衡问题。

Comments code is available at https://inceptionwang.github.io/A3/

详情

AI中文摘要

视觉-语言-动作（VLA）模型主要采用动作分块方法，即在单次前向传递中预测并承诺一系列连续的低层动作，以摊销大规模主干网络的推理成本并减少每步延迟。然而，将这些多步骤预测提交到现实世界执行需要在成功率和推理效率之间进行平衡，这一决策通常由针对特定任务调整的固定执行时间范围控制。此类启发式方法忽略了预测可靠性与状态依赖性的关系，导致在动态或分布外情况下表现脆弱。在本文中，我们引入了A3，一种自适应动作接受机制，将动态执行承诺重新定义为自推测前缀验证问题。A3首先通过群体采样计算轨迹级的动作共识分数，然后选择一个代表性的草稿并优先验证下游部分。具体而言，它强制执行：（1）共识有序的条件不变性，通过判断在高共识动作条件下重新解码后低共识动作是否保持一致来验证低共识动作；以及（2）前缀封闭的序列一致性，通过只接受从开始处最长连续验证动作序列来保证物理运行完整性。因此，执行时间范围自然成为满足内部模型逻辑和序列执行约束的最长可验证前缀。在多种VLA模型和基准测试中，实验表明A3消除了手动调整时间范围的需要，同时在执行鲁棒性和推理吞吐量之间实现了更优的平衡。

英文摘要

Vision-Language-Action (VLA) models predominantly adopt action chunking, i.e., predicting and committing to a short horizon of consecutive low-level actions in a single forward pass, to amortize the inference cost of large-scale backbones and reduce per-step latency. However, committing these multi-step predictions to real-world execution requires balancing success rate against inference efficiency, a decision typically governed by fixed execution horizons tuned per task. Such heuristics ignore the state-dependent nature of predictive reliability, leading to brittle performance in dynamic or out-of-distribution settings. In this paper, we introduce A3, an Adaptive Action Acceptance mechanism that reframes dynamic execution commitment as a self-speculative prefix verification problem. A3 first computes a trajectory-wise consensus score of actions via group sampling, then selects a representative draft and prioritizes downstream verification. Specifically, it enforces: (1) consensus-ordered conditional invariance, which validates low-consensus actions by judging whether they remain consistent when re-decoded conditioned on high-consensus actions; and (2) prefix-closed sequential consistency, which guarantees physical rollout integrity by accepting only the longest continuous sequence of verified actions starting from the beginning. Consequently, the execution horizon emerges as the longest verifiable prefix satisfying both internal model logic and sequential execution constraints. Experiments across diverse VLA models and benchmarks demonstrate that A3 eliminates the need for manual horizon tuning while achieving a superior trade-off between execution robustness and inference throughput.

URL PDF HTML ☆

赞 0 踩 0

2605.11461 2026-05-19 cs.AI cs.LG

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

打破赢家通吃：合作策略优化提升大语言模型的多样化推理

Haoxuan Chen, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu

AI总结本文提出Group Cooperative Policy Optimization (GCPO)方法，通过改变训练范式从 rollout 竞争转向团队合作，提升大语言模型在推理任务中的准确性和解题多样性。

详情

AI中文摘要

基于验证器的强化学习（RLVR）已成为提升大语言模型（LLM）推理能力的核心范式，然而流行的基于群体的优化算法如GRPO常常面临探索崩溃问题，即模型过早收敛于一组高分模式，缺乏探索新解的能力。最近的研究尝试通过添加熵正则化或多样性奖励来缓解这一问题，但这些方法并未改变赢家通吃的本质，即rollouts仍为个体优势竞争而非合作最大化全局多样性。在本文中，我们提出Group Cooperative Policy Optimization（GCPO），将训练范式从rollout竞争转向团队合作。具体而言，GCPO将独立rollout评分替换为团队层面的信用分配：rollout被奖励其对团队有效解覆盖的贡献，而非其个体准确性。该覆盖被描述为奖励加权语义嵌入上的确定体体积，其中只有正确且非冗余的rollout才对这一体积做出贡献。在优势估计过程中，GCPO将集体团队奖励重新分配给每个单个rollout，根据其对团队的平均边际贡献。这种合作训练范式将优化方向导向非冗余的正确推理路径。在多个推理基准测试中，GCPO在现有方法的基础上显著提高了推理准确性和解题多样性。代码将在https://github.com/bradybuddiemarch/gcpo上发布。

英文摘要

Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non-redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at https://github.com/bradybuddiemarch/gcpo.

URL PDF HTML ☆

赞 0 踩 0