arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.13681 2026-06-12 cs.CL 新提交

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

EvoArena: 追踪记忆演化以构建动态环境中的鲁棒LLM智能体

Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan, Shuyue Stella Li, Huichi Zhou, Bowen Jiang, Lei Wang, Jun Wang, Anh Tuan Luu, Caiming Xiong, Hae Won Park, Bryan Hooi, Zhiyuan Hu

发表机构 * National University of Singapore(新加坡国立大学) Singapore Management University(新加坡管理大学) University of Washington(华盛顿大学) University College London(伦敦大学学院) University of Pennsylvania(宾夕法尼亚大学) Nanyang Technological University(南洋理工大学) Recursive Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出EvoArena基准套件模拟终端、软件和社交领域的渐进环境变化,并设计基于补丁的记忆范式EvoMem记录结构化更新历史,使智能体能通过记忆变化推理环境演化,实验表明当前智能体在动态环境中表现不佳,EvoMem可稳定提升性能。

详情
AI中文摘要

大型语言模型(LLM)智能体在广泛基准测试中取得了强劲性能,但大多数评估假设静态环境。相比之下,实际部署本质上是动态的,要求智能体持续将其知识、技能和行为与不断变化的环境及更新的任务条件对齐。为弥补这一差距,我们引入了EvoArena,一个基准套件,将环境变化建模为终端、软件和社交领域的渐进更新序列。我们进一步提出EvoMem,一种基于补丁的记忆范式,将记忆演化记录为结构化的更新历史,使智能体能够通过记忆中的变化推理环境演化。实验表明,当前智能体在EvoArena上表现不佳,在演化的终端、软件和社交偏好领域平均准确率仅为39.6%。EvoMem持续提升性能,在EvoArena上平均提升1.5%,并在GAIA和LoCoMo等标准基准上分别提升6.1%和4.8%。除单个任务外,EvoMem在EvoArena上还将链级准确率提升3.7%,其中成功需要完成一系列连续的相关演化子任务。机制分析表明,EvoMem改善了记忆中的证据捕获,表明更完整地保留了演化的环境状态。我们的结果强调了在评估和记忆中对演化进行建模对于可靠智能体部署的重要性。

英文摘要

Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.

2606.13680 2026-06-12 cs.CL cs.AI 新提交

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

通过检索增强强化微调进行类比推理学习

Zilin Xiao, Qi Ma, Chun-cheng Jason Chen, Xintao Chen, Avinash Atreya, Hanjie Chen, Vicente Ordonez

发表机构 * Meta Superintelligence Labs(Meta超级智能实验室) Rice University(莱斯大学)

AI总结 提出RA-RFT框架,通过黄金相关性蒸馏训练检索器,并结合强化微调利用类比推理轨迹,提升数学推理性能。

详情
AI中文摘要

检索增强生成(RAG)已成为将语言模型锚定于外部知识的标准机制,然而基于词汇或语义相似性的传统检索难以适用于复杂推理任务:语义相似的问题可能要求完全不同的解决策略,而表面不同的问题可能共享相同的底层推理模式。我们提出检索增强强化微调(RA-RFT),一种事后训练框架,教导语言模型通过类比进行推理。RA-RFT使用黄金相关性蒸馏训练检索器,该检索器根据预期推理收益而非语义重叠对上下文进行排序,然后通过强化微调方法利用检索到的类比演示对策略模型进行微调,使模型学会在可验证的结果奖励下利用推理轨迹。我们进一步分析了检索上下文的多样性,发现推理感知检索揭示了互补的解决策略,为个别问题提供了不同的推理支架。在具有挑战性的数学推理基准上,RA-RFT始终优于标准强化微调方法。例如,在AIME 2025上,对于Qwen3-1.7B和Qwen3-4B,RA-RFT的平均@32准确率分别比GRPO提高了7.1和2.8个百分点——这表明推理感知检索是一个互补的改进轴,与奖励设计或训练课程的进步正交。

英文摘要

Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional retrieval based on lexical or semantic similarity is poorly suited for complex reasoning tasks: a semantically similar problem may demand an entirely different solution strategy, while a superficially different problem may share the same underlying reasoning pattern. We propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. RA-RFT uses gold-relevance distillation to train a retriever that ranks contexts by expected reasoning benefit rather than semantic overlap, and then fine-tunes the policy model via reinforcement fine-tuning methods with retrieved analogous demonstrations, so the model learns to leverage reasoning traces under verifiable outcome rewards. We further analyze the diversity of retrieved contexts and find that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct reasoning scaffolds for individual problems. Across challenging mathematical reasoning benchmarks, RA-RFT consistently outperforms standard reinforcement fine-tuning methods. For example, it improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively -- suggesting that reasoning-aware retrieval is a complementary axis of improvement and orthogonal to advances in reward design or training curricula.

2606.13679 2026-06-12 cs.CV 新提交

InterleaveThinker: Reinforcing Agentic Interleaved Generation

InterleaveThinker: 强化智能体交错生成

Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng, Zoey Guo, Ray Zhang, Hongsheng Li

发表机构 * CUHK MMLab(香港中文大学多媒体实验室) Meituan(美团) CUHK IMIXR(香港中文大学IMIXR实验室)

AI总结 提出首个多智能体管线InterleaveThinker,通过规划器和评论家智能体使现有图像生成器具备交错生成能力,并利用GRPO强化单步指令修正,显著提升生成性能。

详情
Comments
Project Page: this https URL Code: this https URL
AI中文摘要

最近的图像生成器在单图像生成和编辑中展示了令人印象深刻的逼真度和指令遵循能力。然而,受限于其架构,它们无法实现交错生成(文本-图像序列),这在视觉叙事、指导和具身操作中具有关键应用。即使是最近的开源统一多模态模型(UMMs)在这方面也表现出有限的性能。在本文中,我们介绍了InterleaveThinker,这是第一个旨在赋予任何现有图像生成器交错生成能力的多智能体管线。具体来说,我们使用规划器智能体来组织图像-文本输入序列,指示图像生成器在每个步骤所需的执行。随后,我们引入评论家智能体来评估生成器的输出,识别偏离计划指令的样本,并优化指令以进行重新生成。为了实现这一管线,我们构建了Interleave-Planner-SFT-80k和Interleave-Critic-SFT-112k以进行格式冷启动。然后,我们开发了Interleave-Critic-RL-13k,使用GRPO在生成轨迹内强化逐步指令修正能力。由于单个交错生成轨迹可能涉及超过25次生成器调用,优化整个轨迹在计算上不可行。因此,我们提出了准确率奖励和逐步奖励,使得单步强化学习能够有效引导整个生成轨迹。结果表明,InterleaveThinker在各种图像生成器上提升了性能。在交错生成基准上,它实现了与Nano Banana和GPT-5相当的性能。令人惊讶的是,它还在基于推理的基准上显著增强了基础模型;例如,在4步FLUX.2-klein上,我们在WISE和RISE上观察到了显著的增益。

英文摘要

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

2606.13677 2026-06-12 cs.RO cs.AI cs.CV cs.LG 新提交

Mana: Dexterous Manipulation of Articulated Tools

Mana: 铰接工具的灵巧操作

Zhao-Heng Yin, Guanya Shi, Pieter Abbeel, C. Karen Liu

发表机构 * UC Berkeley(加州大学伯克利分校) CMU(卡内基梅隆大学) Stanford University(斯坦福大学) Amazon FAR(亚马逊FAR)

AI总结 提出Mana框架,将灵巧操作重解释为动画问题,通过粗到细的流水线自动生成操作轨迹,实现铰接工具的零样本仿真到现实迁移。

详情
Comments
Project Page: this https URL
AI中文摘要

铰接工具的操作由于需要协调内部自由度与接触丰富的交互,仍然是灵巧机器人学中的一个主要挑战。虽然先前的工作主要集中在刚性物体上,但铰接工具的使用由于其物理复杂性以及学习功能性抓取和操作策略的困难,仍未得到充分探索。我们提出了Mana(操作动画器),一个通用的仿真到现实框架,将灵巧操作重新解释为动画问题。受计算机动画启发,Mana采用粗到细的流水线,通过运动规划和强化学习将程序生成的抓取关键帧转化为操作轨迹。数据生成过程基本自动化,仅需几次鼠标点击即可指定功能可供性(每个工具不到1分钟)。在跨越不同尺度和关节类型的四个铰接工具上,Mana实现了抓取和手内操作的零样本仿真到现实迁移,展示了灵巧铰接工具操作的可扩展方法。

英文摘要

Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (<1 minute per tool). Across four articulated tools spanning different scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating a scalable approach to dexterous articulated tool use.

2606.13676 2026-06-12 cs.CV 新提交

Modality Forcing for Scalable Spatial Generation

模态强制实现可扩展的空间生成

Bardienus Pieter Duisterhof, Deva Ramanan, Jeffrey Ichnowski, Justin Johnson, Keunhong Park

发表机构 * Carnegie Mellon University(卡内基梅隆大学) World Labs

AI总结 提出Modality Forcing方法,通过为每个模态分配独立噪声水平,实现单DiT的联合图像-深度生成,利用稀疏深度数据训练,继承T2I预训练的可扩展性,在深度估计上取得竞争性能。

详情
AI中文摘要

文本到图像(T2I)模型包含丰富的空间先验。合成逼真、杂乱的场景需要理解几何,包括透视和相对尺度。先前的工作通过调整T2I模型利用这一先验进行深度预测,但需要密集深度数据并涉及复杂的方案。我们提出Modality Forcing,一种简单、可扩展的后训练方案,使用在稀疏深度数据上训练的单个DiT进行联合图像-深度生成。Modality Forcing通过为每个模态分配独立的噪声水平,允许以任意排列进行图像和深度的条件生成和联合生成。每个模态的解码器使我们能够在稀疏的真实世界深度上训练,并实现强大的、可泛化的深度预测。我们进一步表明,Modality Forcing继承了T2I预训练的可扩展性:通过从头训练一组T2I模型(370M到3.3B参数),我们发现更大的模型在更多图像数据上训练产生更准确的深度。我们的最强模型与最先进的单目深度估计器竞争,并将现有联合图像-深度生成模型的AbsRel降低了57%。这些结果提供了强有力的证据,表明图像生成是空间感知的可扩展预训练目标。

英文摘要

Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. this https URL

2606.13675 2026-06-12 cs.RO 新提交

Improving Robotic Generalist Policies via Flow Reversal Steering

通过流反转引导改进机器人通用策略

Andy Tang, William Chen, Andrew Wagenmaker, Chelsea Finn, Sergey Levine

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出流反转引导(FRS)方法,通过逆向流策略找到次优动作的潜在噪声并映射到通用策略的动作模式,提升零样本控制、行为克隆和强化学习效果。

详情
AI中文摘要

通用策略可以从多样化的机器人数据集中学习广泛的技能。为了解决或改进具有挑战性的新任务,我们需要一种方法从策略丰富的行为先验中推断并调用适当的动作,特别是当直接命令策略失败时。我们专注于流匹配通用策略,并提出流反转引导(FRS):一种方法,它采用次优但“合理”的动作,通过逆向流策略传递它们以找到其潜在噪声,并将它们映射到附近的通用策略动作模式。我们在多个模拟和真实世界的操作设置中评估了FRS。首先,FRS可以将来自人类或视觉语言模型的粗略语义引导转化为相应的良好机器人动作,从而改进零样本控制。这些收益可以通过行为克隆进行蒸馏,通过训练一个辅助策略输出噪声,通用策略将其映射到良好动作——在不到一分钟的训练中显示出高达95%的绝对任务成功率提升。最后,FRS通过用语义知识引导强化学习实现策略改进,在标准强化学习无法改进的多个任务上取得了改进。

英文摘要

Generalist policies can learn a wide range of skills from diverse robot datasets. In order to solve or improve on challenging news tasks, we need a way to infer and invoke the appropriate actions from the policy's rich behavioral prior, especially when directly commanding the policy fails. We focus on flow matching generalists and propose Flow Reversal Steering (FRS): a method that takes suboptimal but ``reasonable'' actions, finds their latent noises by passing them through the flow policy in reverse, and maps them to nearby generalist action modes. We evaluate FRS across many simulated and real-world manipulation settings. First, FRS can turn coarse semantic guidance from humans or vision-language models (VLMs) into corresponding good robot actions, improving zero-shot control. These gains can be distilled with behavioral cloning by training an auxiliary policy to output noises that the generalist maps to good actions -- showing up to 95% absolute task success rate boosts in under a minute of training. Finally, FRS enables policy improvement by bootstrapping reinforcement learning with semantic knowledge, improving on several tasks that standard RL fails to improve on.

2606.13674 2026-06-12 cs.CV 新提交

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

RepWAM:基于表示视觉-动作分词器的世界动作建模

Junke Wang, Qihang Zhang, Shuai Yang, Yiming Luo, Yujun Shen, Zuxuan Wu, Yu-Gang Jiang, Yinghao Xu

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究所) Robbyant, Ant Group(蚂蚁集团 Robbyant) Hongkong University of Science and Technology(香港科技大学)

AI总结 提出RepWAM,一种基于表示视觉-动作分词器的世界动作模型,通过联合建模未来视觉状态和潜在动作,在真实和仿真机器人操作任务中取得优异性能。

详情
Comments
Project page: this https URL
AI中文摘要

本文提出RepWAM,一种基于表示视觉-动作分词器的表示中心世界动作模型(WAM)。现有的WAM通常从预训练的视频生成模型中继承面向重建的视频分词器。尽管这些分词器保留了视觉保真度,但仅靠像素重建对学习连接未来预测与机器人控制的指令跟随动态提供的指导有限。为解决此问题,我们探索了一种语义视觉-动作潜在空间用于表示中心的全局动作建模。具体来说,我们训练了一个表示视觉-动作分词器,将视觉输入映射为对齐的视觉和潜在动作标记。然后,我们预训练WAM以在语言指令下联合建模未来视觉状态和连接它们的潜在动作,随后适应真实机器人轨迹以实现闭环操作。在真实世界操作任务和仿真基准上的实验表明,RepWAM在多种操作设置中展现出强劲性能,而消融实验凸显了语义视觉-动作分词相对于面向重建替代方案的价值。这些结果确立了表示视觉-动作分词作为世界动作模型的有前途的基础,并朝着通用机器人策略迈出了一步。代码和权重将在以下网址提供:this https URL。

英文摘要

This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at this https URL.

2606.13673 2026-06-12 cs.CV cs.AI 新提交

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

SpatialClaw:重新思考智能体空间推理的动作接口

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, Min-Hung Chen

发表机构 * KAIST(韩国科学技术院) NVIDIA(英伟达)

AI总结 提出SpatialClaw框架,以代码作为动作接口,通过状态化Python内核和感知几何原语,使VLM智能体逐步执行并灵活组合中间结果,在20个3D/4D空间推理基准上平均准确率59.9%,比现有方法高11.2个百分点。

详情
Comments
Project page: this https URL
AI中文摘要

空间推理——确定物体在3D空间中的位置、关系及运动方式的能力——仍然是视觉语言模型(VLM)面临的基本挑战。工具增强型智能体试图通过为VLM添加专业感知模块来解决这一问题,但其有效性受限于调用这些工具的动作接口。本文研究该接口的设计如何影响智能体进行开放式空间推理的能力。现有的空间智能体要么采用单次代码执行,即在观察到任何中间结果之前就确定完整的分析策略;要么依赖结构化的工具调用接口,这通常缺乏自由组合操作或针对每个任务定制分析的灵活性。这两种设计对开放式、复杂的3D/4D空间推理的灵活性有限。因此,我们提出SpatialClaw,一个无需训练的空间推理框架,采用代码作为动作接口。SpatialClaw维护一个状态化的Python内核,预加载输入帧和一套感知与几何原语,让基于VLM的智能体在每一步根据所有先前输出编写一个可执行单元,从而灵活地组合和操作感知结果,并根据中间文本和视觉观察以及每个问题的需求调整其分析。在涵盖广泛静态和动态3D/4D空间推理任务的20个空间推理基准上评估,SpatialClaw实现了59.9%的平均准确率,比最新的空间智能体高出11.2个百分点,并且在来自两个模型家族的六个VLM骨干网络上均取得一致提升,无需任何基准或模型特定的适配。

英文摘要

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

2606.13672 2026-06-12 cs.RO 新提交

$\texttt{WEAVER}$, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

$\texttt{WEAVER}$:更好、更快、更长——一种有效的机器人操作世界模型

Arnav Kumar Jain, Yilin Wu, Jesse Farebrother, Gokul Swamy, Andrea Bajcsy

发表机构 * Mila - Québec AI Institute(Mila - 魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) Carnegie Mellon University(卡内基梅隆大学) McGill University(麦吉尔大学)

AI总结 提出WEAVER世界模型架构,通过流匹配损失训练多视图潜在预测,同时实现高保真度、长程一致性和高效推理,在机器人操作任务中显著提升策略评估、改进和测试时规划性能。

详情
AI中文摘要

世界模型(即学习型模拟器)对机器人技术的潜在影响深远——包括策略评估、策略改进和测试时规划——所有这些都只需有限的真实世界交互。为了解锁这些下游能力,世界模型需要同时满足三个期望:(i)保真度(即产生与现实相关的模拟轨迹),(ii)一致性(即产生在长时域上连贯的模拟轨迹),以及(iii)效率(即快速产生模拟轨迹)。我们提出$\texttt{WEAVER}$(面向具身推理的多视图世界估计):一种同时实现所有三个期望的世界模型架构,在机器人操作任务上提供了最先进的结果。$\texttt{WEAVER}$是一个多视图世界模型,通过流匹配损失训练以预测未来潜在状态和奖励值。我们提炼了模型架构、记忆和预测目标方面的关键设计决策,以解锁那些困扰先前世界建模方法的长时间动态操作任务。我们将$\texttt{WEAVER}$应用于机器人硬件,展示了其在策略评估(与真实世界成功率的相关系数$\rho=0.870$)、策略改进(在$\pi_{0.5}$机器人基础模型上真实世界成功率提升$38\%$)和测试时规划(真实世界成功率提升$14\%$,且比先前世界模型快$5-10$倍)方面的有效性。$\texttt{WEAVER}$在分布外场景评估中也表现出优于先前世界模型的性能。代码、模型和视频见:this https URL。

英文摘要

The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and test-time planning -- all with limited real-world interaction. To unlock these downstream capabilities, a WM needs to jointly satisfy three desiderata: $\textit{(i)}$ fidelity (i.e., producing simulated trajectories that correlate with reality), $\textit{(ii)}$ consistency (i.e., producing simulated trajectories that are coherent over long horizons), and $\textit{(iii)}$ efficiency (i.e., producing simulated trajectories quickly). We propose $\texttt{WEAVER}$ (World Estimation Across Views for Embodied Reasoning): a WM architecture that simultaneously achieves all three desiderata, providing state-of-the-art results on robotic manipulation tasks. $\texttt{WEAVER}$ is a multi-view WM trained to predict future latents and reward values via a flow-matching loss. We distill the key design decisions across model architecture, memory, and prediction objectives required to unlock the kinds of long-horizon dynamic manipulation tasks that have confounded prior world modeling approaches. We apply $\texttt{WEAVER}$ in robotic hardware, demonstrating its effectiveness at policy evaluation ($\rho$=0.870 correlation with real-world success rate), policy improvement (real-world success rate improvement of $38\%$ on top of the $\pi_{0.5}$ robot foundation model), and test-time planning (real-world success rate improvement of $14\%$ with a $5-10\times$ speedup over prior WMs). $\texttt{WEAVER}$ also demonstrates better performance than prior WMs when evaluated on out-of-distribution scenarios. Code, models, and videos at: this https URL.

2606.13670 2026-06-12 cs.AI 新提交

Automated reproducibility assessments in the social and behavioral sciences using large language models

使用大型语言模型自动评估社会与行为科学的可重复性

Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger, Stefan Rose, Sarah Ball, Bolei Ma, Frauke Kreuter, Markus Weinmann, Stefan Feuerriegel

发表机构 * LMU Munich(慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of Cologne(科隆大学)

AI总结 本研究利用大型语言模型(LLMs)自动评估社会与行为科学研究的可重复性,在76项研究中,LLM在41%的研究中恢复了原始效应量,在96%的案例中得出了与原始研究相同的定性结论,优于人类再分析。

详情
AI中文摘要

社会与行为科学的可重复性通常由独立研究人员重新分析原始数据来评估,以判断已发表的研究结果是否可复现。然而,这种方法资源密集且难以规模化。在此,我们展示了大型语言模型(LLMs)可以自动化可重复性评估。利用N=76项来自行为与社会科学、具有预定义声明的研究,我们比较了LLM生成的分析与原始结果和人类再分析。对于7项研究,LLM无法产生可行的效应量估计。对于其余研究,我们的LLM流程在41%的研究中恢复了原始效应量(Cohen's d的容忍度为+/-0.05)。此外,我们的LLM流程在96%的案例中得出了与原始研究相同的定性结论,其中结论指示再分析是否支持原始声明。相比之下,人类再分析者在34%的研究中恢复了原始效应量,并在74%的案例中得出了相同的定性结论。这些结果共同表明,LLMs可以作为自动化可重复性评估的可扩展工具,并为社会与行为科学中实证结果的系统审计提供基础。

英文摘要

Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

2606.13669 2026-06-12 cs.AI 新提交

Agents-K1: Towards Agent-native Knowledge Orchestration

Agents-K1:迈向智能体原生的知识编排

Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang, Fangchen Yu, Zhijie Zhong, Zijie Guo, Tianshuo Peng, Zhuo Liu, Yi Xie, Xiang Zhuang, Yue Fan, Runmin Ma, Shiyang Feng, Xiangchao Yan, Anran Liu, Peng Ye, Wenlong Zhang, Shufei Zhang, Chunfeng Song, Fenghua Ling, Jie Zhou, Liang He, Bo Zhang, Lei Bai

发表机构 * PJLab(上海人工智能实验室)

AI总结 提出Agents-K1管道,将原始文档转化为智能体原生科学知识图谱,通过多模态解析器、GRPO训练的4B信息抽取骨干和三源智能体接口,实现科学信息抽取、知识图谱构建和多跳推理。

详情
AI中文摘要

当前基于LLM的研究智能体通过智能体编排取得了进展,但在很大程度上忽视了科学知识编排。现有工作通常将论文简化为摘要、表面提及和扁平化的\ exttt{cites}边,忽略了科学推理所必需的关键实体、主张、证据、机制和方法谱系。为此,我们引入了\ extbf{Agents-K1},一个端到端的知识编排管道,将原始文档转换为智能体原生的科学知识图谱。Agents-K1在统一的理论基础下整合了三个组件:一个多模态解析器,其五模块模式捕获整个论文中的实体、多模态证据、引用和类型化实体间关系,而非仅摘要;一个基于GRPO在规则奖励下训练的4B信息抽取骨干;以及一个graphanything CLI,一个统一了网络搜索、多模态图检索和跨文档遍历的三源智能体接口。在此基础上,我们处理了六个学科的246万篇科学论文,生成了\ extbf{Scholar-KG},并发布了其中100万篇论文的子集,完整Scholar-KG可通过下方SCP链接访问。同一管道可扩展到通用领域语料库和符合模式的数据合成。大量实验表明,Agents-K1在科学信息抽取、知识图谱构建和多跳科学推理方面取得了优越性能。

英文摘要

Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Existing works often reduce papers to abstracts, surface mentions, and flat \texttt{cites} edges, omitting key entities, claims, evidence, mechanisms, and method lineages essential for scientific reasoning. To this end, we introduce \textbf{Agents-K1}, an end-to-end knowledge orchestration pipeline that converts raw documents into agent-native scientific knowledge graphs. Agents-K1 integrates three components under a unifying theoretical foundation: a multimodal parser whose five-module schema captures entities, multimodal evidence, citations, and typed inter-entity relations across the full paper rather than abstracts alone; a 4B information-extraction backbone trained with GRPO under a rule-based reward; and a graphanything CLI, a tri-source agent interface that unifies web search, multimodal graph retrieval, and cross-document traversal. On top of this, we process 2.46 million scientific papers across six subjects to produce \textbf{Scholar-KG}, of which we release a one-million-paper subset, and the full Scholar-KG is accessible via the SCP link below. The same pipeline can be extended to general-domain corpora and to schema-conformant data synthesis. Extensive experiments demonstrate that Agents-K1 achieves superior performance in scientific information extraction, knowledge graph construction, and multi-hop scientific reasoning.

2606.13668 2026-06-12 cs.CL 新提交

Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

Influcoder:将解码器的梯度影响排名蒸馏到编码器用于数据归因

Dimitri Kachler, Damien Sileo, Pascal Denis

发表机构 * Centre Inria de l’Université de Lille, CRIStAL, Université de Lille(里尔大学Inria中心,CRIStAL,里尔大学)

AI总结 针对大型语言模型训练数据归因中影响函数方法计算和存储成本高的问题,提出Influcoder方法,通过将解码器梯度影响排名蒸馏到编码器,实现快速且成本高效的大规模数据归因。

详情
Comments
8 pages, 2 figures
AI中文摘要

随着大型语言模型(LLMs)能力的增长,通过过滤训练数据中的样本来策划高质量数据集的努力日益增多。通常,数据归因(DA)方法旨在估计训练数据集中单个样本如何预先调节模型以生成特定输出。例如,人们可能对数据中哪些样本可能是训练LLM后产生毒性行为的来源感兴趣。许多方法通过影响函数的范式来量化这种调节。虽然这类方法在其功能上是有效的,但它们缺乏必要的处理速度和存储紧凑性,无法在实际中应用于大型数据集。我们提出了一种方法,Influcoder,作为一种快速且成本高效的方法,用于大规模基于影响的数据归因。

英文摘要

With the growth of LLMs' (Large Language Models) capabilities, there has been an increasing push to curate high quality datasets by filtering samples in the training data. In general, Data Attribution (DA) methods aim to estimate how individual samples in a training dataset can precondition a model to generate certain outputs. As an example, one might be interested in which samples in the data could be the source of toxic behavior after training the LLM. Many methods quantify this conditioning through the paradigm of influence functions. While methods of this family are effective in its function, they lack the necessary processing speed and storage compactness to be practically implemented on large datasets. We propose a method, Influcoder, as a quick and cost-effective approach to influence-based Data Attribution at scale.

2606.13663 2026-06-12 cs.CL 新提交

HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

HyperTool:超越逐步工具调用的工具增强型智能体

Yaxin Du, Yifan Zhou, Yujie Ge, Jiajun Wang, Xianghe Pang, Shuo Tang, Tuney Zheng, Bryan Dai, Jian Yang, Siheng Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) IQuest Research Beijing University of Aeronautics and Astronautics(北京航空航天大学)

AI总结 针对工具增强型LLM中逐步调用导致执行粒度不匹配的问题,提出HyperTool统一可执行接口,将确定性工具子流程折叠为单次调用,在多步工具任务上显著提升准确率。

详情
AI中文摘要

工具增强型LLM智能体通常依赖逐步的原子工具调用,其中每次调用、观察和值传递都暴露在主推理轨迹中。这造成了执行粒度不匹配:局部确定性的工具工作流被展开为重复的模型可见决策,消耗上下文并迫使模型管理轨迹中的低级数据流。我们引入HyperTool,一个统一的可执行MCP风格工具接口,改变了模型可见的工具执行单元。模型调用HyperTool时使用一个代码块,该代码块可以通过原始模式调用现有工具、操作返回值并在本地传递中间结果,将确定性工具子程序折叠为单个外部调用。为了训练模型使用此接口,我们从跨工具组合任务中合成HyperTool格式的轨迹,并在真实MCP环境中验证。在MCP-Universe上,HyperTool将Qwen3-32B的平均准确率从15.69%提升至35.29%,Qwen3-8B从9.93%提升至33.33%,并在平均准确率上超越GPT-OSS和Kimi-k2.5,表明我们的HyperTool能显著改进多步工具使用。

英文摘要

Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emph{execution-granularity mismatch}: locally deterministic tool workflows are unfolded into repeated model-visible decisions, consuming context and forcing the model to manage low-level dataflow in the trace. We introduce \textbf{HyperTool}, a unified executable MCP-style tool interface that changes the model-visible unit of tool execution. A model invokes HyperTool with a code block that can call existing tools through their original schemas, manipulate returned values, and pass intermediate results locally, folding deterministic tool subroutines into a single outer call. To train models to use this interface, we synthesize HyperTool-format trajectories from cross-tool compositional tasks and verify them in real MCP environments. On MCP-Universe, HyperTool improves average accuracy from 15.69\% to 35.29\% on Qwen3-32B and from 9.93\% to 33.33\% on Qwen3-8B, and surpass GPT-OSS and Kimi-k2.5 on average accuracy, showing that our HyperTool can substantially improve multi-step tool use.

2606.13662 2026-06-12 cs.AI cs.CL 新提交

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

EurekAgent:自主科学发现中,智能体环境工程即一切

Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao, Fanjin Zhang, Jian Song, Lei Hou, Juanzi Li

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Zhipu AI(智谱AI)

AI总结 提出环境工程框架EurekAgent,通过权限、工件、预算和人机交互四维工程设计,在数学、内核工程和机器学习任务上取得新最优结果,总API成本低于11美元。

详情
AI中文摘要

基于LLM的智能体在自动化科学发现方面展现出日益增长的潜力。给定一个可优化的度量和执行环境,它们可以提出、验证和迭代科学解决方案,并已产生超越人类设计方法的结果。随着模型能力的持续提升,我们认为自主科学发现的瓶颈正从规定智能体工作流程转向设计智能体环境:即塑造智能体行为的资源、约束和接口。我们将此框架化为环境工程:构建能够放大生产性行为(如开放式探索、系统化工件管理和智能体间协作)同时抑制有害行为(如奖励黑客和高摩擦人工监督)的环境。我们提出了EurekAgent,一个用于度量驱动自主科学发现的环境工程智能体系统。EurekAgent从四个维度进行环境工程:权限工程用于受限智能体执行和隔离评估;工件工程用于基于文件系统和Git的协作;预算工程用于预算感知探索;人机交互工程用于便捷的人工监督和干预。EurekAgent在多个数学、内核工程和机器学习任务上取得了新的最优结果,包括以不到11美元的总API成本发现新的26圆填充最优结果。我们开源了代码和结果,并呼吁将环境工程作为开发可靠自主研究智能体的核心研究方向。

英文摘要

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

2606.13657 2026-06-12 cs.LG 新提交

Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

密集监督,稀疏更新:论策略蒸馏的稀疏性与几何结构

Guo Yu, Wenlin Liu, Yulan Hu, Hao-Xuan Ma, Jun-Peng Jiang, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) National Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) Amap, Alibaba Group(阿里巴巴集团高德地图)

AI总结 本文分析策略蒸馏(OPD)中参数更新的稀疏性和几何特性,发现更新稀疏且集中于小权重坐标,并验证了稀疏子网络的有效性。

详情
Comments
Code is available at this https URL
AI中文摘要

策略蒸馏(\ extsc{OPD})最近成为一种重要的后训练方法,因为它结合了两个理想的要素:策略学生轨迹和密集教师监督,但这种混合如何改变模型参数仍不清楚。在多个语言和视觉-语言模型对及用例中,我们的分析得出两个主要发现。关于稀疏性,\ extsc{OPD}风格的更新小且坐标稀疏。它们分布在各层,通常以前馈网络(FFN)为主。这种稀疏结构在操作上有用:仅训练发现的子网络几乎能恢复完整\ extsc{OPD}的性能。然而,在我们的优化器消融实验中,诱导稀疏性的SGD优化器表现不如AdamW,可能是因为密集教师监督保留了异质的坐标梯度尺度,而AdamW的自适应缩放仍然有用。关于几何结构,更新在数值上是满秩的,但谱集中;它们主要位于源权重的奇异子空间之外,并且不成比例地落在源权重接近零的坐标上。这些发现表明,密集教师监督并不会使\ extsc{OPD}变成普通的密集参数重写;相反,\ extsc{OPD}保留了策略后训练的重要几何特征。

英文摘要

On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, \textsc{OPD}-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full \textsc{OPD}. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.

2606.13652 2026-06-12 cs.CV cs.GR 新提交

World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible

世界追踪:超越可见表面的生成式像素对齐几何

Hao Zhang, Mohamed El Banani, Jen-Hao Cheng, Paul Zhang, Yi Hua, Ben Mildenhall, Christoph Lassner, Narendra Ahuja, Gengshan Yang

发表机构 * World Labs University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出世界追踪(World Tracing),一种生成式像素对齐几何表示,通过扩散变压器预测有序点栈,同时重建可见表面和生成遮挡几何,在多个基准上超越深度预测和图像到3D方法。

详情
Comments
World Labs Technical Report; Page: this https URL
AI中文摘要

图像到3D方法常常在忠实度和完整性之间权衡:深度估计器锚定于输入像素但止于可见表面,而图像到3D模型生成完整形状却往往与输入不对齐。我们引入世界追踪(World Tracing),一种生成式像素对齐几何表示,它预测与观测像素对齐的3D点,同时完成可见表面之外的几何。对于每个输入像素,世界追踪预测一个有序的相机空间3D点栈,其中第一层表示可见表面,后续层表示与遮挡表面的从前到后交点。我们通过一个世界追踪扩散变压器WT-DiT实例化该表示,该变压器将多个几何层视为独立的去噪令牌,并通过分解和全局注意力耦合。WT-DiT使用像素空间流匹配和混合噪声调度进行训练,平衡可见表面重建与遮挡几何生成。世界追踪在物体、场景和动态基准上,在可见表面重建和完整几何生成方面均取得了强劲性能,超越了深度预测器和图像到3D生成器。它还保留了2D到3D对应关系,实现了文本驱动的3D场景编辑、几何条件的新视角视频合成,以及与纹理网格生成器的无训练集成。

英文摘要

Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.

2606.13649 2026-06-12 cs.CL cs.LG 新提交

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

Operadic一致性:LLM中组合推理失败的无标签信号

Nathaniel Bottman, Yinhong Liu, Kyle Richardson

发表机构 * Incubilate University of Cambridge(剑桥大学) Allen Institute for Artificial Intelligence(艾伦人工智能研究所)

AI总结 提出Operadic一致性(OC)作为检测大语言模型组合推理失败的无标签信号,在四个多跳QA数据集上与准确率强相关(Pearson r≥0.86),优于自一致性等方法。

详情
AI中文摘要

在推理时检测LLM推理失败而无需真实标签,催生了广泛的置信度基线,包括自一致性、语义熵和P(True),这些方法基于问题内采样和自我评估。Operad理论,即通过迭代替换构建系统的形式化方法,提出了一种补充性诊断:模型对组合查询的直接回答应与通过组合同一查询的分解陈述所产生的回答一致。我们将这一思想实例化为Operadic一致性(OC),一个每问题信号。在四个多跳QA数据集上的十二个指令微调LLM(4B到671B参数,开源和闭源)上,OC与每个数据集上的准确率强相关(Pearson r ∈ [0.86, 0.94],所有p ≤ 0.0004),并且是我们评估的所有信号中唯一在所有四个数据集上均匀达到r ≥ 0.85的信号。思维链自一致性(CoT-SC;Wang等人,2023)在HotpotQA和DROP上与OC匹配(r = 0.93, 0.87),但在MuSiQue和StrategyQA上降至r ≈ 0.45。在每问题层面,OC在每个数据集上提供了超出CoT-SC和语义熵的信息(OC系数的聚类稳健p ≤ 10^{-16}),并且该结论在额外控制构造的分解感知基线时依然稳健(p ≤ 10^{-13})。相同的信号在等成本K = 3预算下,相对于调优的CoT-SC基线产生了选择性预测改进(固定覆盖率下的准确率提升)(AUARC提升+0.086至+0.096,AUROC提升+0.092至+0.164;95%置信区间在每个单元上排除零)。在五个前沿思维模型上,其中分解从模型自身的思维链中提取,相同的等成本比较在所有测试的16个(数据集、预算、指标)单元上给出了正的选择性预测点估计提升,其中12个单元的95%置信区间排除零。

英文摘要

Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model's direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. We instantiate this idea as operadic consistency (OC), a per-question signal. Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets, OC is strongly correlated with accuracy on every dataset (Pearson $r \in [0.86, 0.94]$, all $p \leq 0.0004$), and is the only signal we evaluate with $r \geq 0.85$ uniformly across all four datasets. Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP ($r = 0.93, 0.87$) but drops to $r \approx 0.45$ on MuSiQue and StrategyQA. At the per-question level, OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust $p \leq 10^{-16}$ for the OC coefficient), and the conclusion is robust to additionally controlling for constructed decomposition-aware baselines ($p \leq 10^{-13}$). The same signal yields selective-prediction improvements (accuracy at fixed coverage) over a tuned CoT-SC baseline at the equal-cost $K = 3$ budget (AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell). On five frontier thinking models, where the decomposition is extracted from the model's own chain of thought, the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16.

2606.13647 2026-06-12 cs.CL cs.AI cs.LG 新提交

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

SkMTEB:斯洛伐克大规模文本嵌入基准与模型适配

Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková, Viktória Ondrejová

发表机构 * Comenius University in Bratislava(布拉迪斯拉发夸美纽斯大学) Cisco Systems(思科系统) Technical University of Košice(科希策技术大学) Kempelen Institute of Intelligent Technologies(肯佩伦智能技术研究所)

AI总结 针对低资源西斯拉夫语斯洛伐克语,构建首个MTEB风格文本嵌入基准SkMTEB(含31个数据集、7类任务),并开发高效本地部署模型e5-sk-small/large,通过词汇裁剪与微调在参数减少62%下达到与商业API相当的竞争力。

详情
Comments
ACL 2026
AI中文摘要

我们介绍了SkMTEB,这是首个针对斯洛伐克语(一种低资源西斯拉夫语)的全面MTEB风格文本嵌入基准,包含31个数据集,覆盖7种任务类型——几乎是现有斯洛伐克语多语言基准覆盖深度的4倍。我们对31个嵌入模型的评估表明,大型指令调优多语言模型表现最强,而现有的针对NLU任务训练的斯洛伐克语特定模型在嵌入任务上迁移效果不佳。为了满足高效、可本地部署的斯洛伐克语嵌入需求,我们通过对多语言E5模型进行词汇裁剪和微调,开发了\ exttt{e5-sk-small}(45M参数)和\ exttt{e5-sk-large}(365M)模型。尽管模型尺寸缩小了高达62%,我们的开源模型在性能上与专有API相当,同时仍可本地部署用于语义搜索和检索增强生成(RAG)。我们公开了基准、模型、数据集和代码,希望我们的方法能为其他资源匮乏的语言提供可复现的路径。

英文摘要

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

2606.13644 2026-06-12 cs.CV 新提交

Surflo: Consistent 3D Surface Flow Model with Global State

Surflo:具有全局状态的一致3D表面流模型

Antoine Guédon, Shu Nakamura, Nicolas Dufour, Jiahui Lei, Ko Nishino, Angjoo Kanazawa

发表机构 * LIX, École polytechnique(LIX,巴黎综合理工学院) Kyoto University(京都大学) Kyutai UC Berkeley(加州大学伯克利分校)

AI总结 提出Surflo模型,通过将可变数量的无位姿RGB视图压缩为全局潜变量,并利用流匹配从噪声中独立传输3D表面点,实现任意分辨率的一致表面重建,推理时通过光度梯度引导消除局部不一致性。

详情
Comments
Project webpage: this https URL
AI中文摘要

几何形状对视角具有不变性,这使得任何图像集合都是单个3D状态的冗余编码。现有的前馈重建模型未能充分利用这一点:逐视角方法会生成重叠且未对齐的点云,其数量随输入数量线性增长;而全局潜在方法则局限于固定的低分辨率输出。我们提出Surflo,它将可变数量的无位姿RGB视图压缩为K个潜在令牌(一个全局状态),并通过流匹配将带方向的3D表面点从噪声独立传输到表面上进行解码。这使得输出不受任何固定网格或令牌预算的限制:相同的潜在变量在单次前向传播中即可生成从几千到一百万个点。为了抑制独立逐点解码固有的局部不一致性,我们在ODE积分过程中注入光度梯度,通过推理时的引导项关联邻近点。Surflo在表面指标上匹配或超越前馈基线,运行速度比需要数百个视图的基于优化的方法快一个数量级,并且是唯一结合全局潜在变量与任意分辨率解码的前馈方法。

英文摘要

Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent methods commit to a fixed, low-resolution output. We introduce Surflo, which compresses a variable number of unposed RGB views into K latent tokens-one global state-and decodes oriented 3D surface points by independently transporting them from noise onto the surface via flow matching. This frees the output from any fixed grid or token budget: the same latent yields from a few thousand to a million points in a single forward pass. To suppress the local inconsistencies inherent to independent per-point decoding, an inference-time guidance term correlates nearby points by injecting a photometric gradient during ODE integration. Surflo matches or surpasses feed-forward baselines on surface metrics, runs an order of magnitude faster than optimization-based methods that require hundreds of views, and is the only feed-forward approach to combine a global latent with arbitrary-resolution decoding.

2606.13643 2026-06-12 cs.CL 新提交

Recursive Agent Harnesses

递归智能体框架

Elias Lumer, Sahil Sen, Kevin Paul, Vamse Kumar Subbiah

发表机构 * PricewaterhouseCoopers, U.S.(普华永道(美国))

AI总结 提出递归智能体框架(RAH),通过代码优先的框架递归扩展模型递归,在长上下文推理中显著提升编码智能体性能。

详情
AI中文摘要

递归语言模型(RLM)表明,模型调用的递归是长上下文推理的有效策略,而生产级编码智能体已开始编写大规模生成子智能体的代码,最近如Anthropic的动态工作流。我们命名并研究了这两条工作线之间的模式,其中递归单元是一个完整的智能体框架,包含文件系统工具、代码执行和规划,而不是没有工具的模型调用。我们将其称为递归智能体框架(RAH),并将其视为框架递归,即RLM模型递归的代码优先扩展。父智能体生成并执行一个可执行脚本,该脚本并行生成子智能体框架以处理细粒度工作负载,并使用结构化函数调用处理小子任务。我们在长上下文推理上提供了受控评估。在固定主干为GPT-5以匹配已发布的Codex和RLM基线的情况下,RAH在Oolong-Synthetic(199个样本,13个上下文长度桶,最高4M令牌)上将Codex编码智能体基线从71.75%提高到81.36%,这一增益归因于框架而非模型。使用更强的骨干Claude Sonnet 4.5,同一设计达到89.77%。

英文摘要

Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in Anthropic's dynamic workflows. We name and study the pattern between these two lines of work, where the recursive unit is a full agent harness with filesystem tools, code execution, and planning rather than a model call with no tools. We call this the Recursive Agent Harness (RAH) and frame it as harness recursion, the code-first extension to the model recursion of RLMs. A parent agent generates and runs an executable script that spawns subagent harnesses in parallel for fine-grained workloads and uses structured function calls for small subtasks. We provide a controlled evaluation on long-context reasoning. With the backbone held fixed at GPT-5 to match the published Codex and RLM baselines, RAH improves the Codex coding-agent baseline from 71.75% to 81.36% on Oolong-Synthetic (199 samples, 13 context-length buckets up to 4M tokens), a gain attributable to the harness rather than the model. With a stronger backbone, Claude Sonnet 4.5, the same design reaches 89.77%.

2606.13640 2026-06-12 cs.SD 新提交

The Moving Drone: Negotiating Agency Between the Voice and the Virtual

移动的无人机:在声音与虚拟之间协商能动性

Nithya Shikarpur, Victor Arul, Anna Huang

发表机构 * Massachusettes Institute of Technology(麻省理工学院) Harvard University(哈佛大学)

AI总结 基于印度斯坦音乐,通过Max/MSP循环器和生成式AI模型GaMaDHaNi,将传统静态无人机变为动态、主动的虚拟音乐代理,探讨人机协作中的能动性。

详情
Comments
Published in NIME music track 2026
AI中文摘要

印度斯坦音乐中的旋律材料通常与一个主音相关联,该主音通常由坦布拉(一种四弦无人机乐器)持续维持。植根于印度斯坦音乐,《移动的无人机》将传统静态的无人机置于运动中,在表演过程中逐渐获得能动性,从反应性角色过渡到更主动的角色。该作品在Max/MSP中使用四个独立的循环器作为“虚拟”无人机。当歌手即兴演唱时,这些循环器实时循环填充,在声音与虚拟无人机之间创建一个有机且不断演变的反馈回路。这种关系通过音高移位循环进一步在旋律上演变,引入了突然、显式运动的维度。然后,通过集成GaMaDHaNi(一种经过歌手条件训练的语音到声音生成式AI模型)来重新合成循环音频,从而在音色上发生变化。虽然当前的音乐AI方法优先考虑生成内容的高保真度和逼真度,这引发了音乐界对工作替代的焦虑,但本作品有意使用低保真生成输出,进一步需要人类解释和情境背景才能完成。《移动的无人机》将技术和生成式AI置于既定的社会文化音乐实践中,提出虚拟无人机作为一种主动、响应性和共同创造的音乐代理。

英文摘要

Melodic material in Hindustani music is presented in relation to a tonic, usually sustained by the tanpura, a four-stringed drone instrument. Rooted in Hindustani music, 'The Moving Drone' sets the traditionally static drone into motion that, throughout the performance, gains increasing agency transitioning from reactive to more proactive roles. The work employs four independent loopers in Max/MSP to function as 'virtual' drones. They are populated cyclically in real-time as the vocalist improvises, creating an organic and evolving feedback loop between the voice and the virtual drone. This relationship further evolves melodically by pitch shifting the loops, which introduces a dimension of sudden, explicit movement. Then it changes timbrally, via the integration of GaMaDHaNi, a singer conditioned pitch-to-voice generative AI model to resynthesize looped audio. While current music AI approaches prioritize high-fidelity and realism of generated content which has sparked anxiety over job replacement for the music community, this work intentionally utilizes low-fidelity generative outputs, further necessitating human interpretation and situational context in order to be complete. 'The Moving Drone' positions technology and generative AI within established socio-cultural musical practices, proposing a virtual drone as an active, responsive, and co-creative musical agent.

2606.13630 2026-06-12 cs.CL 新提交

From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

从词元到面部:探究用于3D面部动画的离散语音表示

Pedro Correa, Olivier Perrotin, Samir Sadok, Paula Costa, Thomas Hueber

发表机构 * Univ. Estadual de Campinas (UNICAMP), Brazil(巴西坎皮纳斯州立大学(UNICAMP)) Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, France(法国格勒诺布尔阿尔卑斯大学,CNRS,格勒诺布尔国立理工学院,GIPSA实验室) Inria at Univ. Grenoble Alpes, CNRS, LJK, France(法国格勒诺布尔阿尔卑斯大学Inria,CNRS,LJK)

AI总结 研究评估四种语音表示在3D面部合成中的效果,发现编码音素类别有利于准确预测面部动画,并基于此提出音频视觉文本到语音管线。

详情
Comments
This work has been accepted in Interspeech 2026
AI中文摘要

语音表示的选择在语音驱动的3D面部动画中至关重要。不同表示在编码内容上有所差异:SSL特征强调音段和语义线索,神经编解码器产生优化用于声学重建的潜在表示,而ASR风格的目标产生基于标签的空间。我们评估了四种用于3D面部合成的语音表示族,通过客观指标和感知评估比较了它们在两个面部解码器上的面部重建质量。此外,我们进行了探测分析,将分词表示与音素单元和发音变形联系起来。我们发现,编码音素类别有利于在语义和基于标签的表示上准确预测面部动画,且面部动画质量相当。基于后者,我们引入了一个音频视觉文本到语音(AVTTS)管线,该管线利用离散表示作为共享空间来解码语音和3D面部运动。

英文摘要

The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.

2606.13626 2026-06-12 cs.SD cs.LG 新提交

Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches

巴赫风格符号音乐的生成建模:自回归、潜变量和对抗方法的比较研究

Kyuil Lee, Dezhi Yu, Yongkang Huang

发表机构 * Stanford University(斯坦福大学)

AI总结 比较自回归LSTM、潜变量模型和生成对抗网络在巴赫风格钢琴音乐生成中的表现,发现带注意力的自回归LSTM生成音乐最连贯,向量量化缓解后验塌陷,对抗方法捕捉局部音高但训练困难。

详情
Comments
11 pages, 13 figures
AI中文摘要

我们使用共享的MIDI语料库和三个模型家族研究巴赫风格符号钢琴音乐的生成建模:带注意力的自回归LSTM、包括循环VAE和向量量化VAE的潜变量模型,以及生成对抗网络。我们比较它们对复调音符序列建模、学习有用潜在表示以及生成风格连贯作品的能力。实验表明,带注意力的自回归LSTM生成最音乐连贯的样本,而向量量化有助于缓解后验塌陷,并产生比传统循环VAE更结构化的输出。对抗方法捕捉局部音高模式,但训练困难且对巴赫风格的泛化可靠性较低。这些结果突出了自回归、潜变量和对抗方法在符号音乐生成中的相对优势和失败模式。

英文摘要

We study generative modeling of Bach-style symbolic piano music using a shared MIDI corpus and three model families: autoregressive LSTMs with attention, latent-variable models including recurrent VAEs and vector-quantized VAEs, and generative adversarial networks. We compare their ability to model polyphonic note sequences, learn useful latent representations, and generate stylistically coherent compositions. Our experiments show that the autoregressive LSTM with attention produces the most musically coherent samples, while vector quantization helps mitigate posterior collapse and yields more structured outputs than conventional recurrent VAEs. The adversarial approach captures local pitch patterns but remains difficult to train and generalizes less reliably to Bach's style. These results highlight the relative strengths and failure modes of autoregressive, latent-variable, and adversarial approaches for symbolic music generation.

2606.13625 2026-06-12 cs.CV 新提交

Revisiting Vehicle Color Recognition in Long-Tailed Surveillance Scenarios

重新审视长尾监控场景中的车辆颜色识别

Vinícius Orrú, Bruno H. Foggiatto, Gabriel E. Lima, David Menotti, Rayson Laroca

发表机构 * Pontifical Catholic University of Paraná(巴拉那天主教大学) National High Court of Brazil(巴西国家高等法院) Federal University of Paraná(巴拉那联邦大学)

AI总结 针对监控场景中车辆颜色分布高度不平衡的问题,本文提出结合生成式数据增强、视觉表征、损失重加权等方法的综合方案,在UFPR-VeSV数据集上实现94.6%微平均和79.7%宏平均准确率,宏平均比近期文献提升8.2个百分点。

详情
Comments
Accepted for presentation at the 2026 International Conference on Pattern Recognition (ICPR) - V3SC Workshop
AI中文摘要

车辆颜色识别是监控系统中车辆识别的重要线索,尤其是在车牌因低分辨率、遮挡、运动模糊或光照不足而难以辨认时。然而,真实世界的车辆颜色分布高度不平衡,使得整体准确率不足以评估在罕见但操作相关的颜色上的性能。本文利用UFPR-VeSV(一个具有挑战性的真实世界监控数据集),对严重类别不平衡下的车辆颜色识别进行了全面研究。我们通过两种现成的生成策略探索合成少数类增强:使用RunDiffusion/JuggernautXL的文本条件图像生成和使用Gemini 2.0 Flash的图像条件颜色编辑。精心策划的合成数据与现代视觉表征、损失重加权、学习率调度、颜色安全增强、前景感知预处理和集成融合相结合。表现最佳的方法达到了94.6%的微平均准确率和79.7%的宏平均准确率,宏平均准确率比近期文献提高了8.2个百分点。手动错误分析进一步表明,许多剩余的失败即使在人工标注者看来也是视觉上模糊的,这凸显了在无约束监控图像中基于颜色的车辆识别的实际局限性。生成的图像和源代码可在以下网址公开获取:this https URL

英文摘要

Vehicle color recognition is an important cue for vehicle identification in surveillance systems, especially when license plates are illegible due to low resolution, occlusion, motion blur, or poor illumination. However, real-world vehicle color distributions are highly imbalanced, making overall accuracy insufficient to assess performance on rare but operationally relevant colors. This paper presents a comprehensive study of vehicle color recognition under severe class imbalance using UFPR-VeSV, a challenging real-world surveillance dataset. We investigate synthetic minority-class augmentation through two off-the-shelf generative strategies: text-conditioned image generation with RunDiffusion/JuggernautXL and image-conditioned color editing with Gemini 2.0 Flash. The curated synthetic data are combined with modern visual representations, loss reweighting, learning-rate scheduling, color-safe augmentation, foreground-aware preprocessing, and ensemble fusion. The bestperforming approach achieves 94.6% micro accuracy and 79.7% macro accuracy, improving macro accuracy by 8.2 percentage points over recent literature. A manual error analysis further shows that many remaining failures are visually ambiguous even for human annotators, highlighting the practical limits of color-based vehicle identification in unconstrained surveillance imagery. The generated images and source code are publicly available at this https URL

2606.13624 2026-06-12 cs.CL 新提交

Beyond Uniform Tokens: Adaptive Compression for Time Series Language Models

超越统一令牌:时间序列语言模型的自适应压缩

Jialin Gan, Xin Qiu, Guangzhe Chen, Xue Wang

发表机构 * Zhejiang University(浙江大学) Harbin Institute of Technology(哈尔滨工业大学) Shandong University(山东大学)

AI总结 针对时间序列语言模型中令牌效率低的问题,提出自适应令牌预算框架,通过频域结构压缩时间序列令牌并逐层减少提示令牌,实现高达7.68倍推理加速并在78%设置中提升性能。

详情
AI中文摘要

大型语言模型(LLM)通过共享令牌接口联合建模数值观测和文本上下文,实现了时间序列(TS)分析。然而,TS令牌和提示令牌表现出根本不同的信息结构,使得统一令牌处理效率低下。在本文中,我们从非对称令牌的角度研究TS语言建模中的令牌效率。我们表明,TS令牌具有高度不均匀的频谱贡献,其中许多令牌共享冗余频率模式,而一小部分保留了关键的时间证据。我们还观察到,提示令牌的影响随模型深度衰减,表明在所有层中完全保留提示是不必要的。基于这些发现,我们开发了一个自适应令牌预算框架,通过频域结构压缩TS令牌,并逐层减少提示令牌。在预测、分类、插补和异常检测上的实验表明,在\textit{\textbf{78\%}}的评估设置中实现了高达\textit{\textbf{7.68$\times$}}的推理加速和性能提升,显示了非对称令牌压缩对于可扩展TS基础模型的有效性。

英文摘要

Large language models (LLMs) have enabled time series (TS) analysis by jointly modeling numerical observations and textual context through a shared token interface. However, TS tokens and prompt tokens exhibit fundamentally different information structures, making uniform token processing inefficient. In this paper, we study token efficiency in TS language modeling from an asymmetric-token perspective. We show that TS tokens have highly uneven spectral contributions, where many tokens share redundant frequency patterns while a small subset preserves critical temporal evidence. We also observe that prompt-token influence attenuates with model depth, suggesting that full prompt retention across all layers is unnecessary. Based on these findings, we develop an adaptive token budgeting framework that compresses TS tokens via frequency-domain structure and progressively reduces prompt tokens across layers. Experiments across forecasting, classification, imputation, and anomaly detection demonstrate up to \textit{\textbf{7.68$\times$}} inference acceleration and performance gains in \textit{\textbf{78\%}} of evaluated settings, showing the effectiveness of asymmetric token compression for scalable TS foundation models.

2606.13621 2026-06-12 cs.AI cs.CR cs.GT cs.LG cs.MA 新提交

Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks

超越运行时强制:作为对抗网络可防御性分析的盾牌合成

Achraf Hsain, Sultan Almuhammadi

发表机构 * Information and Computer Science Department, King Fahd University of Petroleum and Minerals(信息与计算机科学系,法赫德国王石油矿产大学)

AI总结 提出将盾牌合成重新解释为设计时分析工具,通过约束双人安全博弈生成可防御性判定,并融合拓扑度量和强化学习行为形成可防御性指纹,揭示系统安全的结构性见解。

详情
Comments
26 pages, 7 figures, 7 tables. Under review at JAIR. Code: this https URL
AI中文摘要

盾牌强化学习通常被呈现为一种运行时安全机制,它将时序逻辑规范编译成限制智能体行为的自动机。我们认为这是错误的产品。同样的自动机理论机制——规范编译、乘积博弈构建、吸引子计算和获胜区域提取——更适合被解读为一种设计时分析工具,其输出是关于系统的结构性见解,而非对已部署智能体的运行时约束。我们通过一个用于网络防御的约束双人安全博弈来实例化这一点。两个规范被不对称地执行:防御者规范定义了博弈的不安全区域,而攻击者规范在吸引子计算期间限制了对手的合法行为。求解该博弈产生一个可防御性判定——一个形式化证书,表明拓扑-规范对是否可防御——以及相关的获胜区域和盾牌。除了二元判定,我们还从吸引子结构中推导出拓扑级度量,并将其与盾牌约束的对抗性多智能体强化学习的后收敛行为相结合。这些共同构成了一个可防御性指纹,捕捉了网络的形式安全属性及其在自适应博弈下的操作行为。假设分析表明,形式可防御性和操作有效性捕捉了安全的不同方面:小的架构变化可能导致操作结果的巨大变化,而形式安全裕度几乎不变。因此,盾牌合成最有价值的不是作为安全智能体的部署机制,而是作为回答关于系统是否、在哪里以及如何可以被防御的架构问题的框架。可防御性判定是输出,而非安全策略。

英文摘要

Shielded reinforcement learning is typically presented as a runtime safety mechanism that compiles temporal-logic specifications into automata restricting an agent's actions. We argue this is the wrong product. The same automata-theoretic machinery -- specification compilation, product game construction, attractor computation, and winning-region extraction -- is better read as a design-time analytical instrument whose outputs are structural insights about a system rather than runtime constraints on a deployed agent. We instantiate this through a constrained two-player safety game for network defense. The two specifications are enforced asymmetrically: the defender specification defines the unsafe region of the game, whereas the attacker specification restricts the adversary's legal actions during attractor computation. Solving the game yields a defensibility verdict -- a formal certificate that a topology-specification pair is or is not defensible -- with the associated winning region and shield. Beyond the binary verdict, we derive topology-level metrics from the attractor structure and combine them with post-convergence behavior from shield-constrained adversarial multi-agent reinforcement learning. Together these form a defensibility fingerprint capturing both a network's formal safety properties and its operational behavior under adaptive play. A what-if analysis shows that formal defensibility and operational effectiveness capture distinct aspects of security: small architectural changes can produce large shifts in operational outcomes while leaving formal safety margins nearly unchanged. Shield synthesis is thus most valuable not as a deployment mechanism for safe agents, but as a framework for answering architectural questions about whether, where, and how a system can be defended. The defensibility verdict is the output, not the safe policy.

2606.13610 2026-06-12 cs.CL cs.AI 新提交

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

一个被污染的页面就够了:评估生成式推荐系统中的网页内容污染

Minghao Luo, Liang Chen

发表机构 * The Chinese University of Hong Kong(香港中文大学)

AI总结 本研究提出FORGE基准,评估搜索增强LLM在检索结果被污染时推荐虚假产品的脆弱性,发现单个污染页面即可导致高达27%的推荐错误率,且推理能力无法缓解此问题。

详情
AI中文摘要

搜索增强的大语言模型通过检索实时网页内容越来越多地介入日常消费者推荐。这带来了新的风险:生成式推荐系统可能消费被污染的网页内容,例如旨在误导推荐的虚假评论和推广页面。我们提出:在消费被污染的检索结果时,搜索增强的LLM在多大程度上会成为虚假产品的无意推广者?为此,我们引入FORGE(生成环境中的虚假在线推荐),这是一个在受控网页内容污染下衡量虚假产品推荐的基准。给定上游搜索结果,FORGE将检索到的网页中的真实产品本地重写为虚假产品,以模拟网页内容污染,并测量LLM推荐虚假产品的频率。FORGE涵盖15个类别和5个消费者场景下的225个真实世界产品。在12个商业和开源LLM中,所有模型都易受影响:单个被污染的页面即可导致高达27%的被欺骗率,而完全替换前三个结果则将此比例提升至73.8%。不同类别间的脆弱性差异显著,当模型缺乏相关产品的稳定先验知识时,脆弱性增加。推理并不能缓解这种脆弱性;相反,它常常生成虚假的社会证明来为错误推荐辩护。我们评估了三种防御措施:怀疑提示和共识过滤(基于模型先验或跨文档证据)。怀疑可能加剧脆弱性,类似于推理,而过滤则可能抑制合法产品。我们在以下网址发布FORGE:this https URL。

英文摘要

Search-augmented LLMs increasingly mediate everyday consumer recommendations by retrieving live web content. This creates a new risk: generative recommenders may consume polluted web content, such as fake reviews and promotional pages crafted to mislead recommendations. We ask: to what extent do search-augmented LLMs become unwitting promoters of fake products when consuming polluted retrieval results? To answer this, we introduce FORGE (Fake Online Recommendations in Generative Environments), a benchmark for measuring fake-product promotion under controlled web-content pollution. Given an upstream search result, FORGE locally rewrites real products in retrieved web pages into fake ones to simulate web-content pollution, and measures how often the LLM recommends the fake product. FORGE covers 225 real-world products across 15 categories and 5 consumer scenarios. Across 12 commercial and open-weights LLMs, all models are vulnerable: a single polluted page yields fooled rates of up to 27%, while the full top-3 replacement raises this to 73.8%. Vulnerability varies substantially across categories, increasing when models lack stable prior knowledge of the relevant products. Reasoning does not mitigate this vulnerability; instead, it often generates spurious social proof to justify false recommendations. We evaluate three defenses: skepticism prompting and consensus filtering (over model priors or cross-document evidence). Skepticism can exacerbate vulnerability, much like reasoning, while filtering risks suppressing legitimate products. We release FORGE at this https URL.

2606.13608 2026-06-12 cs.AI cs.LG 新提交

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

AgentBeats:面向开放性、标准化和可复现性的智能体评估代理化

Xiaoyuan Liu, Jianhong Tu, Yuqi Chen, Siyuan Xie, Sihan Ren, Tianneng Shi, Gal Gantar, Evan Sandoval, Donghyun Lee, Daniel Miao, Peter J. Gilbert, Nick Hynes, Mauro Staver, Warren He, David Marn, Andrew Low, Xi Zhang, Elron Bandel, Michal Shmueli-Scheuer, Siva Reddy, Alexandre Drouin, Alexandre Lacoste, Ramayya Krishnan, Elham Tabassi, Yu Su, Victor Barres, Chenguang Wang, Wenbo Guo, Dawn Song

发表机构 * University of California, Berkeley(加州大学伯克利分校) Purdue University(普渡大学) University of Ljubljana(卢布尔雅那大学) University of Washington(华盛顿大学) Oasis Labs University of Maryland(马里兰大学) IBM Research(IBM研究院) Mila McGill University(麦吉尔大学) ServiceNow Research(ServiceNow研究院) Carnegie Mellon University(卡内基梅隆大学) National Institute of Standards and Technology(美国国家标准与技术研究院) The Ohio State University(俄亥俄州立大学) University of Cambridge(剑桥大学) University of California, Santa Barbara(加州大学圣塔芭芭拉分校)

AI总结 提出代理化智能体评估(AAA)框架,通过标准化协议(A2A和MCP)统一评估接口,实现开放、可复现的多智能体评估,并基于AgentBeats系统通过大规模竞赛和案例研究验证其覆盖性、实用性和保真度。

详情
AI中文摘要

智能体系统在各领域快速进步,但其评估仍然碎片化。大多数基准测试依赖于固定的、以LLM为中心的测试框架,需要大量集成,造成测试与生产环境不匹配,并限制了不同智能体设计之间的公平比较。根本问题在于缺乏开放的、与智能体无关的评估接口。我们倡导代理化智能体评估(AAA),其中评估由裁判智能体执行,所有参与者通过标准化协议交互:A2A用于任务管理,MCP用于工具访问。传统基准测试定义了两个独立的接口(一个用于基准测试,一个用于智能体),而AAA只需要一个;这产生了一个通用的统一框架,将评估逻辑与智能体实现分离,并支持可复现、可互操作和多智能体评估。我们进一步引入AgentBeats作为AAA的具体实现:我们确定了五种实际操作模式,使标准化评估与开放性、隐私性和可复现性的现实约束兼容。为了大规模评估我们的设计,我们进行了两项研究:一项为期五个月的开放竞赛,吸引了来自独立参与者的12个类别的298个裁判智能体和467个主题智能体,表明AAA适用于异构基准测试范围;以及一项关于编码智能体的案例研究,证实代理化评估在保留与公开记录一致性的同时,揭示了先前缺失的直接比较结果,产生了关于智能体设计的研究见解。结合社区规模实地研究和受控编码案例研究,我们验证了AAA在异构场景下大规模提供覆盖性、实用性和保真度。AAA和AgentBeats共同为开放、标准化和可复现的智能体评估提供了清晰路径。

英文摘要

Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.

2606.13607 2026-06-12 cs.AI 新提交

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

推理即模式匹配:人类与LLM日常推理中的共享机制

Zach Studdiford, Gary Lupyan

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 研究通过比较人类和25个LLM在日常因果推理中的错误模式,发现两者均表现出模式匹配而非抽象世界模型驱动的推理,并识别出LLM中驱动响应的注意力头可预测人类推理错误。

详情
Comments
13 pages main text, 51 pages supplementary text
AI中文摘要

当大型语言模型(LLM)在推理中无法泛化或出现随意错误时,这通常被视为LLM并非真正推理,而是执行某种模式匹配的证据。其隐含意思是,人类行为不会表现出相同类型的失败,因为人类推理使用原则性的抽象世界模型。我们评估了人类参与者和25个LLM在各种日常情境中进行常识推理的能力,并在人和模型中观察到类似的错误模式。然后,我们识别出驱动LLM响应的注意力头集合,并发现这些头实现了模式匹配的形式。这些注意力头使我们能够预测由表面上无关的提示细节引起的人类看似无法解释的推理错误。综合来看,我们的结果表明,人和LLM在日常因果推理中更符合模式匹配的形式,而非抽象世界模型。

英文摘要

When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's behavior does not exhibit the same types of failures because human reasoning uses principled and abstract world models. We evaluate human participants and 25 LLMs on their ability to engage in common-sense reasoning about a variety of everyday situations and observe similar patterns of errors in both people and models. We then identify the set of attention heads driving LLM responses and find that these heads implement a form of pattern-matching. These attention heads allow us to predict seemingly inexplicable reasoning errors in people caused by ostensibly irrelevant prompt details. Taken together, our results suggest that everyday causal reasoning in people and LLMs is more consistent with a form of pattern-matching than with abstract world models.

2606.13604 2026-06-12 cs.AI cs.LG cs.MA 新提交

Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

基于延迟市场反馈的多智能体强化学习在三方调度中的目标权重自适应

Haochen Wu, Yi Hou, Shiguang Xie

发表机构 * DoorDash

AI总结 提出在DoorDash部署的强化学习系统,利用延迟信号自适应调整调度目标权重,通过离线策略学习在噪声和耦合反馈下优化配送质量与批处理效率的权衡。

详情
Comments
Accepted at ICML 2026 Workshop on Reinforcement Learning from World Feedback (RLxF)
AI中文摘要

三方市场中的调度为从世界反馈中进行强化学习提供了自然场景:决策通过延迟的操作结果(如配送速度、骑手利用率和商家拥堵)进行评估。我们介绍了DoorDash部署的一个强化学习系统,该系统利用延迟信号在大规模食品配送市场中自适应调整调度目标权重。该系统并非取代组合分配优化器,而是通过从记录的市场数据中学习的店铺级策略选择一个离散乘数,该乘数改变调度优化器在配送质量与批处理效率之间的权衡。这种接口使得在噪声、延迟和耦合反馈下进行离线策略学习成为可能,同时保留生产可行性约束和操作保障。我们使用集中式离线数据和分散式店铺级执行训练共享价值函数,采用双Q学习目标和保守正则化器以减少分布外价值高估。在生产切换实验中,离线训练的策略增加了批处理并减少了骑手侧时间成本,而不会降低面向客户的配送质量。结果展示了如何利用来自实时经济和物流系统的世界反馈安全地在线调整决策策略。

英文摘要

Dispatch in three-sided marketplaces provides a natural setting for reinforcement learning from world feedback: decisions are evaluated by delayed operational outcomes such as delivery speed, courier utilization, and merchant congestion. We present a deployed reinforcement learning system at DoorDash that adapts dispatch objective weights in a large-scale food-delivery marketplace using delayed signals. Rather than replacing the combinatorial assignment optimizer, a store-level policy learned from logged marketplace data selects a discrete multiplier that shifts the dispatch optimizer's tradeoff between delivery quality and batching efficiency. This interface enables offline policy learning under noisy, delayed, and coupled feedback while preserving production feasibility constraints and operational safeguards. We train a shared value function using centralized offline data and decentralized store-level execution, with Double Q-learning targets and a conservative regularizer to reduce out-of-distribution value overestimation. In a production switchback experiment, the offline-trained policy increases batching and reduces courier-side time costs without degrading customer-facing delivery quality. Results illustrate how world feedback from a live economic and logistics system can be used to safely adapt decision policies online.