arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别4020
2606.09828 2026-06-09 cs.CV 新提交

Latent Spatial Memory for Video World Models

视频世界模型的潜在空间记忆

Weijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen, Zeyu Zhang, Yefei He, Zicheng Duan, Donny Y. Chen, Yuqing Yang, Bohan Zhuang

发表机构 * Zhejiang University(浙江大学) Microsoft Research(微软研究院) Adelaide University(阿德莱德大学) Monash University(莫纳什大学)

AI总结 提出潜在空间记忆框架Mirage,通过在扩散潜在空间中直接构建和查询3D缓存,避免像素空间重建,实现高效视频生成,速度提升10.57倍,内存减少55倍。

详情
Comments
Project Page: https://aka.ms/latent-spatial-memory, Code: https://github.com/microsoft/LatentSpatialMemory
AI中文摘要

在生成帧之间保持3D空间一致性的视频世界模型通常依赖于在RGB空间中构建的显式点云记忆。这种设计既计算昂贵(需要重复渲染和VAE编码),又固有地有损(因为通过像素空间的往返会丢弃学习到的潜在表示的丰富特征)。在本文中,我们为视频世界模型引入了\emph{潜在空间记忆},这是一种持久化的3D缓存,直接在扩散潜在空间中存储场景信息,避免了像素空间重建。在此基础上,我们提出了Mirage,一种潜在空间空间记忆框架,通过深度引导的反投影将潜在令牌提升到3D来构建记忆,并通过直接潜在空间扭曲合成新视图来查询记忆。这种统一的公式消除了像素空间重建的信息损失以及重复编码和渲染的计算负担。实验表明,相对于显式3D基线,潜在空间记忆实现了高达\textbf{10.57}倍的端到端视频生成加速和\textbf{55}倍的内存占用减少。利用扩散模型的几何先验,Mirage在WorldScore上达到了最先进的性能,并在RealEstate10K上实现了强大的重建质量。

英文摘要

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce \emph{latent spatial memory} for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to \textbf{10.57}$\times$ faster end-to-end video generation and \textbf{55}$\times$ reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.

2606.09827 2026-06-09 cs.RO cs.CV 新提交

MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

MemoryVLA++:通过记忆与想象在视觉-语言-动作模型中进行时间建模

Hao Shi, Weiye Li, Bin Xie, Yulin Wang, Renping Zhou, Tiancai Wang, Xiangyu Zhang, Ping Luo, Gao Huang

发表机构 * Tsinghua University(清华大学) The University of Hong Kong(香港大学) Dexmal StepFun

AI总结 提出MemoryVLA++框架,通过工作记忆、感知-认知记忆库和想象未来状态的世界模型,实现完整时间建模,在模拟和真实机器人任务上显著提升长时域和依赖记忆与想象的任务性能。

详情
Comments
The project is available at https://shihao1895.github.io/MemoryVLA-PP-Web
AI中文摘要

时间建模对于机器人操作至关重要,因为有效控制既需要过去交互的记忆,也需要对未来状态的想象。然而,大多数VLA模型主要依赖当前观测,因此在长时域和时间依赖任务上表现不佳。认知科学表明,人类依赖工作记忆缓冲短期上下文,海马系统保存过去经历的情景记忆,以及内部模型想象可能的未来状态演化。受这些机制启发,我们提出MemoryVLA++,一个完整的时序建模框架,为VLA模型配备记忆和想象能力以进行机器人操作。预训练的VLM将当前观测编码为感知和认知标记,形成工作记忆。这些标记查询感知-认知记忆库以检索相关历史上下文。该记忆库存储来自过去交互的低级细节和高级语义,并通过冗余感知合并进行更新。一个世界模型在去噪潜在空间中想象未来状态,并在记忆引导下整合想象的潜在表示,形成完整的时间感知标记。生成的标记条件化一个扩散动作专家,以预测时间一致的动作序列。我们在5个模拟基准和3类真实机器人任务(涵盖3种机器人)上进行了广泛实验,包括通用操作、长时域时间任务、鲁棒性和泛化性。我们的方法在Libero、SimplerEnv、Mikasa-Robo、Calvin、Libero-Plus以及多样化的真实机器人任务上取得了强劲性能,验证了具有记忆和想象的完整时间建模的有效性。例如,在真实机器人上,在通用、依赖记忆和依赖想象的任务上分别获得了+9%、+26%和+28%的提升。项目页面:https://shihao1895.github.io/MemoryVLA-PP-Web

英文摘要

Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal models to imagine possible future state evolution. Inspired by these mechanisms, we propose MemoryVLA++, a full temporal modeling framework that equips VLA models with memory and imagination for robotic manipulation. A pretrained VLM encodes the current observation into perceptual and cognitive tokens, forming working memory. These tokens query a Perceptual-Cognitive Memory Bank to retrieve relevant historical context. This bank stores low-level details and high-level semantics from past interactions, and is updated through redundancy-aware consolidation. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. The resulting tokens condition a diffusion action expert to predict temporally consistent action sequences. We conduct extensive experiments on 5 simulation benchmarks and 3 categories of real-robot tasks across 3 robots, covering general manipulation, long-horizon temporal tasks, robustness, and generalization. Our method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, Libero-Plus, and diverse real-robot tasks, validating the effectiveness of full temporal modeling with memory and imagination. For example, on real robots, it achieves +9%, +26%, +28% gains on general, memory-dependent, and imagination-dependent tasks. Project Page: https://shihao1895.github.io/MemoryVLA-PP-Web

2606.09826 2026-06-09 cs.CV cs.AI 新提交

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

OmniGameArena: 一个统一的UE5基准测试,用于具有改进动态的VLM游戏智能体

Mingxian Lin, Shengju Qian, Yuqi Liu, Yi-Hua Huang, Yiyu Wang, Wei Huang, Yitang Li, Fan Zhang, Zeyu Hu, Lingting Zhu, Xin Wang, Xiaojuan Qi

发表机构 * The University of Hong Kong(香港大学) LIGHTSPEED The Chinese University of Hong Kong(香港中文大学) Tsinghua University(清华大学)

AI总结 提出OmniGameArena,一个包含12个UE5游戏的统一基准,以及改进动态曲线(IDC),通过反思机制评估VLM智能体的冷启动分数、改进动态和泛化能力。

详情
AI中文摘要

视觉语言模型(VLM)智能体越来越多地部署在交互式游戏环境中。然而,针对VLM智能体的游戏基准通常报告每个(智能体,游戏)对的单次首次尝试分数,专注于单智能体单人游戏,并且缺乏统一的协议来评估异构智能体类别(商业VLM、开源VLM和专用游戏策略)在同一水平上。我们通过OmniGameArena填补了这些空白,这是一个包含12个新构建的Unreal Engine 5游戏的实时基准,涵盖单人(7个)、玩家对战(3个)和合作(2个)模式,具有统一的动作接口,以及改进动态曲线(IDC),这是一个智能体反思框架,其中使用工具的反思LLM在多个回合中自主优化有界技能提示。除了冷启动排行榜分数外,IDC还为每个(智能体,游戏)对揭示了两个额外的可观测指标:分数在反思回合中的演变方式,以及学习到的技能在保留任务变体上的表现。我们报告了12个VLM智能体在冷启动排行榜上的这些可观测指标,以及四个顶级智能体在IDC下的表现。

英文摘要

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

2606.09825 2026-06-09 cs.LG cs.AI cs.SY eess.SY math.OC 新提交

An Agency-Transferring Model-Free Policy Enhancement Technique

一种无模型策略增强的代理转移技术

Anton Bolychev, Georgiy Malaniya, Sinan Ibrahim, Pavel Osinenko

发表机构 * Center for Engineering Systems and Sciences(工程系统与科学中心) Central University(中央大学) Sirius University of Science and Technology(天狼星科技大学)

AI总结 提出一种将次优基线策略嵌入强化学习训练的方法,通过逐步从基线策略向可学习策略转移代理权,提升训练效率并最终获得超越基线的独立策略。

详情
AI中文摘要

从头开始训练强化学习(RL)策略成本高昂:需要仔细设计奖励和环境、大量调参以及大量计算。然而,许多控制问题已经有一个功能正常但次优的基线策略可用。本文提出一种方法,将这样的基线策略嵌入RL训练过程,同时提高相对于从头开始方法的训练效率,并产生一个优于基线的学习策略。在每个步骤中,该方法在基线策略和可训练的学习策略之间进行仲裁,最初强烈依赖基线策略,然后逐步将代理权转移给学习策略。训练结束时,学习策略是一个无需基线策略支持的独立神经网络。本文形式化了基线策略“功能正常”的含义:在该策略下,智能体以高概率到达目标集并停留在那里。所提出的仲裁机制旨在训练过程中利用这一特性,从训练开始就产生高目标到达率。理论分析在给定假设下提供了这种行为的形式化解释,并将其扩展到最终无基线场景,其中推导了独立学习策略目标到达概率的显式下界。在连续控制基准上的实验结果表明,所提出的方法实现了与竞争方法相当或更高的回报,同时在训练过程中(包括最终阶段,学习策略无需任何基线支持)保持了最高的目标到达率。

英文摘要

Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems already have a functional but suboptimal policy available as a baseline. This paper proposes a method for embedding such a baseline into the RL training process, simultaneously improving training efficiency relative to from-scratch methods and producing a learning policy that outperforms the baseline. At each step, the method arbitrates between the baseline policy and a trainable learning policy, initially relying strongly on the baseline policy and then progressively transferring agency to the learning policy. By the end of training, the learning policy is a standalone neural network that operates without baseline policy support. The paper formalizes what it means for the baseline policy to be functional: under this policy, the agent reaches a goal set and remains there with high probability. The proposed arbitration mechanism is designed to exploit this property during training, yielding high goal-reaching rates right from the beginning of training. A theoretical analysis provides a formal interpretation of this behavior under stated assumptions and extends it to the final baseline-free regime, where explicit lower bounds are derived for the goal-reaching probability of the standalone learning policy. Empirical results on continuous-control benchmarks show that the proposed method achieves returns that match or exceed those of competitive approaches, while maintaining the highest goal-reaching rates throughout training among the compared methods -- including in the final stage, where the learning policy operates without any baseline support.

2606.09821 2026-06-09 cs.LG 新提交

Rethinking the Divergence Regularization in LLM RL

重新思考LLM强化学习中的散度正则化

Jiarui Yao, Xiangxin Zhou, Penghui Qi, Wee Sun Lee, Liefeng Bo, Tianyu Pang

发表机构 * Tencent Hunyuan(腾讯混元) UIUC(伊利诺伊大学厄巴纳-香槟分校) NUS(新加坡国立大学)

AI总结 针对PPO等方法的硬裁剪或硬掩码在长尾词汇中分布偏移代理不佳的问题,提出DRPO,用平滑的优势加权二次正则化替代硬掩码,保持信任区域几何的同时提供连续梯度权重,提升训练稳定性和效率。

详情
AI中文摘要

强化学习已成为后训练大型语言模型的关键组成部分。在实践中,由于训练-推理不匹配和策略陈旧,LLM RL通常是离策略的,因此信任区域控制对于稳定优化至关重要。PPO和GRPO等主流方法通过比率裁剪机制近似这种控制,但在长尾词汇中,重要性比率可能成为分布偏移的糟糕代理。最近的工作如DPPO通过用基于散度的掩码替换基于比率的裁剪来解决这种不匹配,从而产生由采样令牌的绝对概率偏移定义的信任区域。然而,DPPO仍然依赖于硬掩码:一旦令牌以有害方向越过信任区域边界,其梯度就会被丢弃而不是纠正。为了解决这个问题,我们提出了散度正则化策略优化(DRPO),它用策略偏移上的平滑优势加权二次正则化器替换硬掩码。DRPO保留了与DPPO相同的信任区域几何,同时引入了有界、连续的梯度权重,这些权重衰减发散更新并在边界之外提供纠正信号。跨模型规模、架构和精度设置的实验表明,DRPO提高了LLM RL训练的稳定性和效率。

英文摘要

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.

2606.09816 2026-06-09 cs.CV cs.AI math.PR 新提交

PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal Laws

PTL-Diffusion: 具有周期终端定律的流形感知扩散

Danqi Zhuang, Jisui Huang, Xiaoyue Xi, Andrew Kiggins, Xiaojie Wang, Ke Chen, Yue Wu

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of Cambridge(剑桥大学) University of Oxford(牛津大学) Harvard University(哈佛大学) MIT(麻省理工学院) University of Washington(华盛顿大学)

AI总结 提出PTL-Diffusion,通过将前向噪声过程收敛到周期高斯终端族而非单一分布,显式嵌入相位结构,改善低维流形上的分布匹配,在点云和人脸数据集上降低误差。

详情
AI中文摘要

标准扩散模型通常使用单一时间齐次高斯终端分布作为生成的参考律。虽然这一选择在分析上方便且经验上有效,但对于集中在低维流形附近的数据,它提供的显式结构很少,其中数据分布的不同区域可能对应于不同的局部几何或语义因素。因此,反向模型必须几乎完全从非结构化的终端参考分布中恢复流形级别的结构。\n我们提出PTL-Diffusion,一种概念验证的扩散框架,其前向噪声过程收敛到一个非常数的周期高斯终端族,而不是单一不变律。与相位条件DDPM不同(其中相位信息仅进入去噪网络,而前向过程保持不变),PTL-Diffusion将相位结构直接嵌入前向噪声动力学中。\n所提出的构造仍然接近标准去噪扩散模型:对于周期强迫的Ornstein-Uhlenbeck型前向过程,我们推导出闭合形式的前向边际分布、极限周期高斯终端族以及显式高斯反向后验,从而支持标准噪声预测训练。我们还引入了一个不变平均正则化项,通过平均周期参考律耦合相位条件反向动力学。在环面和圆柱点云基准以及Olivetti人脸数据集上的实验表明,PTL-Diffusion在匹配的DDPM基线上改善了流形级别的分布匹配,减少了相位条件误差、特征空间协方差误差和最近邻流形距离。这些结果表明结构化终端参考律是一个有前景的方向,同时激励更具表现力的相位构造和更大规模的评估。

英文摘要

Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structure for data concentrated near low-dimensional manifolds, where different regions of the data distribution may correspond to distinct local geometric or semantic factors. As a result, the reverse model must recover manifold-level structure almost entirely from an unstructured terminal reference distribution. We propose PTL-Diffusion, a proof-of-concept diffusion framework whose forward noising process converges to a nonconstant periodic family of Gaussian terminal laws rather than to a single invariant law. Unlike a phase-conditioned DDPM, where phase information only enters the denoising network while the forward process remains unchanged, PTL-Diffusion embeds phase structure directly into the forward noising dynamics. The proposed construction remains close to standard denoising diffusion models: for a periodically forced Ornstein--Uhlenbeck-type forward process, we derive closed-form forward marginals, the limiting periodic Gaussian terminal family, and explicit Gaussian reverse posteriors, enabling standard noise-prediction training. We also introduce an invariant-average regularization term coupling the phase-conditioned reverse dynamics through the averaged periodic reference law. Experiments on torus and cylinder point-cloud benchmarks and the Olivetti face dataset show that PTL-Diffusion improves manifold-level distributional matching over matched DDPM baselines, reducing phase-conditioned errors, feature-space covariance errors, and nearest-neighbour manifold distances. These results suggest structured terminal reference laws as a promising direction, while motivating more expressive phase constructions and larger-scale evaluations.

2606.09813 2026-06-09 cs.RO cs.CV 新提交

iMaC: Translating Actions into Motion and Contact Images for Embodied World Models

iMaC: 将动作转化为运动与接触图像用于具身世界模型

Zhenyu Wu, Xiuwei Xu, Yukun Zhou, Yifan Li, Qiuping Deng, Xiaofeng Wang, Zheng Zhu, Bingyao Yu, Ziwei Wang, Jiwen Lu, Haibin Yan

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) GigaAI Nanyang Technological University(南洋理工大学)

AI总结 提出iMac框架,将原始视觉图像作为动作表示,通过图像-动作编码器和动态预测器实现高保真未来状态预测和闭环控制,在预测精度、任务成功率和跨场景泛化上优于传统向量动作控制。

详情
Comments
Project page: https://imac-wm.github.io/
AI中文摘要

具身世界模型已成为视觉机器人决策和交互环境模拟的关键范式。然而,传统的具身框架依赖于低维结构化动作向量(例如关节角度和末端执行器位姿),这些向量存在表达能力有限、跨不同具身形态泛化能力差以及对复杂物理交互的动态建模不自然等问题。为了解决这些限制,本文提出了iMac(图像作为动作控制),一种新颖的统一控制范式,将原始视觉图像视为具身世界模型的原生动作表示。与传统的显式运动学动作编码不同,iMac将连续的视觉操作表述为基于图像的动作标记,这些标记内在地包含了空间运动意图、交互几何约束和细微的物理动力学。我们构建了一个双分支具身架构,包括图像-动作编码器和动态世界预测器:编码器将目标驱动的视觉图像压缩为紧凑的动作嵌入,而预测器学习以图像动作为条件的环境转移规则,以实现高保真的未来状态预测和闭环具身控制。在公开的具身操作基准和真实机器人场景上进行了大量实验。结果表明,iMac在预测精度、任务成功率和跨场景泛化能力方面优于基于向量的动作控制基线。此外,我们的图像动作设计消除了对人工定义动作空间的依赖,实现了异构具身智能体的灵活通用控制。这项工作为具身世界模型提供了一种创新的视觉-动作视角,为可扩展的机器人感知和操作提供了一种简单而有效的范式。

英文摘要

Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.

2606.09811 2026-06-09 cs.RO cs.AI cs.CV 新提交

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

AHA-WAM:异步自适应时域世界-动作建模与观测引导的上下文路由

Jisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu, Jiayue Kang, Zhixuan Liang, Wenjie Xu, Yinan Mao, Weinan Zhang, Xiaokang Yang, Ru Ying, Ran Zheng, Yao Mu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Baidu AI Cloud(百度智能云) The University of Hong Kong(香港大学)

AI总结 提出AHA-WAM,一种基于双扩散Transformer的异步时域自适应世界-动作模型,通过低频世界规划器和高频动作执行器解耦时序,实现高效闭环控制,在RoboTwin和真实任务上达到SOTA性能。

详情
Comments
Project page: https://serene-sivy.github.io/aha-wam/
AI中文摘要

世界-动作模型已成为机器人操作的一种有前景的范式,它联合建模视觉场景动态和动作,将物理先验注入策略学习。然而,现有的世界-动作模型以相同的时间分辨率耦合世界预测和动作执行,迫使世界分支建模近期的帧变化,这些变化是冗余且信息量弱的。我们假设,将世界预测和动作执行严格绑定到相同的时间节奏可能未充分利用视频分支在具身控制中的潜力。因此,我们提出AHA-WAM,一种基于双扩散Transformer(DiT)架构的异步自适应时域世界-动作模型,该模型围绕这种时间不对称性重新组织世界-动作建模。AHA-WAM将视频DiT实例化为一个低频世界规划器,它维护过去观测的滚动键值记忆,并暴露可重用的逐层潜在上下文,编码长时域场景演化;同时,一个高频动作DiT通过逐层联合注意力查询该上下文,以闭环方式执行短动作块。为了支持异步执行,我们引入了自适应时域偏移训练和观测引导的视频-上下文路由(OVCR),它们共同让动作专家利用长时域世界上下文,同时保持对实时执行状态的响应,而无需重新运行视频DiT。在RoboTwin和真实世界操作任务上的实验表明,AHA-WAM无需任何机器人数据预训练即达到最先进性能,在RoboTwin上平均成功率为92.80%,在4个真实世界任务上成功率为78.3%,同时达到24.17 Hz的闭环控制,相比Fast-WAM加速4.59倍。

英文摘要

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.

2606.09806 2026-06-09 cs.LG cs.AI 新提交

Topological Neural Operators

拓扑神经算子

Lennart Bastian, Samuel Leventhal, Mustafa Hajij, Tolga Birdal

发表机构 * Imperial College London(伦敦帝国学院) University of San Francisco(旧金山大学)

AI总结 提出拓扑神经算子(TNOs),利用离散外微积分在细胞复形上实现跨维度耦合,并通过分层结构提升长程信息传播,在PDE基准上优于现有算子。

详情
AI中文摘要

我们引入了拓扑神经算子(TNOs),这是一个在细胞复形上进行算子学习的原理性框架,将神经算子(NOs)从点和/或边上的函数提升到拓扑域。TNOs将数据表示为定义在不同维度细胞上的特征,并通过离散外微积分建模它们的相互作用,通过梯度、旋度和散度型算子实现显式的跨维度耦合。关键设计原则是将信息流向(由固定拓扑算子控制)与信息变换(学习得到)解耦,从而产生尊重物理量几何支撑并暴露守恒和相容性结构的模型。我们进一步提出了分层TNOs(HTNOs),它结合了学习到的粗粒度复形以传播长程和拓扑依赖的信息。我们的框架将现有NOs作为特例,提供了跨离散化的算子学习统一视角。在一系列PDE基准测试中,包括不规则几何流动问题,TNOs和HTNOs提高了精度;控制研究进一步隔离了原生高阶和拓扑结构带来的优势。项目页面:https://circle-group.github.io/research/TNO

英文摘要

We introduce Topological Neural Operators (TNOs), a principled framework for operator learning on cell complexes that lifts neural operators (NOs) from functions on points and/or edges to topological domains. TNOs represent data as features defined on cells of varying dimension and model their interactions through Discrete Exterior Calculus, enabling explicit cross-dimensional coupling via gradient-, curl-, and divergence-type operators. The key design principle is to decouple where information flows, as governed by fixed topological operators, from how it is transformed (which is learned), yielding models that respect the geometric support of physical quantities and expose conservation and compatibility structure. We further propose Hierarchical TNOs (HTNOs), which incorporate learned coarse complexes to propagate long-range and topology-dependent information. Our framework subsumes existing NOs as a special case, providing a unified perspective on operator learning across discretizations. Across a range of PDE benchmarks, including irregular-geometry flow problems, TNOs and HTNOs improve accuracy; controlled studies further isolate the benefits of native higher-rank and topological structure. Project page: https://circle-group.github.io/research/TNO

2606.09803 2026-06-09 cs.CV cs.GR cs.LG 新提交

Echo-Memory: A Controlled Study of Memory in Action World Models

Echo-Memory:动作世界模型中记忆的受控研究

Wayne King, Zeyue Xue, Yuxuan Bian, Jie Huang, Haoran Li, Yaowei Li, Yaofeng Su, Yuming Li, Haoyu Wang, Shiyi Zhang, Songchun Zhang, Yuwei Niu, Sihan Xu, Junhao Zhuang, Haoyang Huang, Nan Duan

发表机构 * Joy Future Academy

AI总结 提出Echo-Memory框架,通过控制变量法研究动作条件世界模型中的记忆机制,发现原始上下文容量和块状状态空间递归对开放域返回任务至关重要。

详情
Comments
9 figures and 28 pages, Code at \href{https://github.com/Echo-Team-Joy-Future-Academy-JD/Echo-Memory}{this URL}
AI中文摘要

我们提出\textbf{Echo-Memory},对动作条件世界模型中的记忆机制进行受控研究。这些模型从第一帧、文本提示和相机动作序列生成多段视频,但其核心失败往往是记忆而非局部图像合成:当相机离开并返回时,场景或显著物体可能悄然改变。现有记忆设计难以比较,因为增益与骨干网络、训练、检索和评估差异纠缠在一起。Echo-Memory固定了动作到视频的接口,仅改变生成器存储和读取历史的方式。在共享的视频扩散骨干网络、优化器、相机动作表示、采样器和评估流程下,我们比较了原始上下文、基于压缩的记忆、具有不同读取路径的空间摘要以及状态空间递归。这种匹配矩阵分离了四个通常混淆的轴:\emph{容量}、\emph{压缩}、\emph{读取}和\emph{递归}。我们还通过三个分支协议评估记忆:重放质量、域内循环重访和开放域返回探测。这些分支通常不一致,表明重放保真度不足以作为记忆世界的代理。得出三个发现。原始上下文是一个强大的容量基线,并且比重放指标更能改善开放域返回。紧凑性不能免费替代容量:激进的混合压缩记忆会丢失返回所需的显著证据。最后,块状状态空间递归是我们矩阵中最强的开放域返回机制,表明隐式记忆的结构与是否使用记忆同样重要。这些结果为在孤立的重放指标之外研究动作世界模型中的记忆提供了一个紧凑的协议。

英文摘要

We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: \emph{capacity}, \emph{compression}, \emph{read-out}, and \emph{recurrence}. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.

2606.09798 2026-06-09 cs.RO 新提交

SynManDex: Synthesizing Human-like Dexterous Grasps from Synthetic Human Pre-Grasps

SynManDex: 从合成人类预抓取中合成类人灵巧抓取

Yanming Shao, Zanxin Chen, Wenwei Lin, Mingjie Zhou, Tianxing Chen, Xiaokang Yang, Yichen Chi, Yao Mu

发表机构 * Shanghai AI Lab(上海人工智能实验室) Shanghai Jiaotong University(上海交通大学) Shenzhen University(深圳大学) Fudan University(复旦大学) University of Hong Kong(香港大学) ZTE Corporation(中兴通讯股份有限公司)

AI总结 提出SynManDex流水线,利用生成的人类预抓取作为启发,通过机器人原生优化实现力闭合接触,生成类人灵巧抓取,在仿真和真实机器人上取得高成功率和类人性。

详情
AI中文摘要

人类手-物交互编码了功能意图,但直接迁移到机器人手上常因形态、接触和可达性约束而失败。我们提出SynManDex,一个合成流水线,使用生成的人类预抓取作为可负担性感知的提议,并通过机器人原生优化解决最终接触。SynManDex采样物体条件化的数字人类预抓取,将其重定向到灵巧机器人手姿态,优化目标实体上的力闭合接触,并接受通过每一步检查的轨迹。所得关键帧支持抓取-举起演示以及各种抓取操作任务,如倒茶、拍照和吹笛子,这些任务通过VLM代理设计。因此,SynManDex结合了高抓取质量(86.4%抓取稳定性)和4.67/5的类人性(93.4%)。在仿真中达到80.7%的成功率,在应用于36自由度双臂灵巧机器人平台时,真实机器人成功率为25/30(83.3%)。

英文摘要

Human hand-object interactions encode functional intent, but direct transfer to robotic hands often fails under morphology, contact, and reachability constraints. We present SynManDex, a synthetic pipeline that uses generated human pre-grasps as affordance-aware proposals and resolves the final contacts with robot-native optimization. SynManDex samples object-conditioned digital human pre-grasps, retargets them to dexterous robotic hand poses, optimizes force-closure contacts on the target embodiment, and admits trajectories that pass checks from each step. The resulting keyframes support both grasp-and-lift demonstrations and various prehensile manipulation tasks such as tea pouring, photo taking, and flute playing, designed via VLM agents. As a result, SynManDex combines high grasp quality (86.4\% grasp stability) with 4.67/5 human-likeness (93.4\%). It achieves 80.7\% successes in simulation and 25/30 (83.3\%) real-robot successes when applied to a 36-DOF bimanual dexterous robotic platform.

2606.09794 2026-06-09 cs.CV cs.GR 新提交

Beyond Spherical Harmonics: Rethinking Appearance Models for Radiance Reconstruction

超越球谐函数:重新思考辐射重建的外观模型

Ewa Miazga, Jorge Condor, Piotr Didyk

发表机构 * École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) Università della Svizzera Italiana(意大利语区瑞士大学)

AI总结 本文系统评估多种球面函数,提出归一化各向异性球面Gabor函数,以紧凑表示高效建模高频外观效果,在辐射场重建中实现五倍内存节省和更优质量。

详情
Comments
19 pages, 11 figures
AI中文摘要

视角相关的外观建模在新视角合成与重建中仍是一个具有挑战性的问题。准确表示复杂的角度效应通常需要大量的内存和计算资源。对于新的基于学习的方法,常见做法是依赖球谐函数(SH)。然而,捕捉镜面反射等高频率现象需要高阶展开,这会增加内存使用和计算成本。因此,大多数方法采用低阶SH,这限制了建模复杂视角相关效应的能力,导致表示过于平滑或漫反射。为解决这些限制,我们系统评估了场景重建中多种球面函数。其中一些函数在本文中首次被引入图形学和计算机视觉领域。基于实验洞察,我们提出了一种新的球面公式——归一化各向异性球面Gabor函数,它能够在保持紧凑表示的同时高效建模和学习高频外观效果。与现有方法相比,我们的函数在重建如闪光等视角相关现象时实现了更高质量,同时内存效率提高五倍,且评估更高效。我们在辐射场重建任务中验证了其性能。

英文摘要

View-dependent appearance modeling remains a challenging problem in novel-view synthesis and reconstruction. Accurately representing complex angular effects often requires substantial memory and computational resources. For new learning-based methods, a common approach is to rely on SH. However, capturing high-frequency phenomena such as specular reflections demands high-order expansions, which increase memory usage and computational cost. Consequently, most methods employ low-order SH, which limits the ability to model complex view-dependent effects, resulting in overly smooth or diffuse representations. To address these limitations, we systematically evaluate a wide range of spherical functions in the context of scene reconstruction. Some of them are introduced to graphics and computer vision for the first time in this paper. Based on the insights from the experiment, we develop a novel spherical formulation, the Normalized Anisotropic Spherical Gabor function that enables efficient modeling and learning of high-frequency appearance effects while maintaining compact representation. Compared to existing approaches, our function achieves higher-quality reconstruction of view-dependent phenomena such as glints, while being up to five times more memory-efficient and more efficient to evaluate. We validate its performance in radiance-field reconstruction tasks.

2606.09792 2026-06-09 cs.CV 新提交

End-to-End Optimization of Incoherent Imaging for Classification Under Detector-Limited Readout

探测器有限读出下非相干成像分类的端到端优化

Archer Wang, Joshua Chen, Sachin Vaidya, Marin Soljačić

发表机构 * Research Laboratory of Electronics, Massachusetts Institute of Technology(麻省理工学院电子研究实验室) Department of Physics, Massachusetts Institute of Technology(麻省理工学院物理系)

AI总结 针对探测器有限读出场景,通过端到端优化相位掩模提升非相干成像分类性能,理论证明全读出下无增益,有限读出下通过增强类可分性实现显著改进。

详情
AI中文摘要

光学前端(如超表面)和神经网络后端的端到端联合优化已广泛应用于成像任务,但缺乏一个形式化框架来描述此类系统何时以及为何优于传统透镜成像。本文聚焦于分类这一核心成像任务,探究端到端优化非相干成像相位掩模何时能提升性能。我们发现,这些增益主要出现在探测器读出受限的情况下,而在全读出下则有限。在后一种情况下,我们证明没有非相干相位掩模能超过探测器测量与类别标签之间的理想信道互信息;传统聚焦透镜接近这一上限,联合优化无实证增益。当探测器读出受限时(通过粗空间采样或有限测量次数),优化光学系统可通过增加探测器测量中的类可分性来显著提升分类性能。这些增益在低探测器噪声下最大,并随噪声增大而减小,因为光学系统在信号到达探测器前塑造信号,但无法去除之后添加的噪声。该优势还取决于任务的光谱结构:当类别判别内容集中在比类内变化更低的空间频率时,协同设计帮助最大。我们开发了一个理论框架来形式化这些区别,并在合成数据和标准基准(MNIST、FashionMNIST、SVHN)上测试其预测。

英文摘要

End-to-end co-optimization of optical front-ends (e.g. metasurfaces) and neural network back-ends has been widely applied to imaging tasks, yet a formalism characterizing when and why such systems outperform conventional lens-based imaging is largely lacking. This paper focuses on object classification, a central imaging task, and asks when end-to-end optimization of a phase mask for incoherent imaging improves performance over a conventional focusing lens. We find that these gains arise primarily under constrained detector readout and are limited under full detector readout. In the latter setting, we prove that no incoherent phase mask exceeds the ideal-channel mutual information between detector measurements and class labels; a conventional focusing lens approaches this ceiling, and joint optimization yields no empirical gain. When detector readout is constrained -- by coarse spatial sampling or a limited number of measurements -- optimized optics can substantially improve classification by increasing class separability in the detector measurements. These gains are largest under low detector noise and shrink as noise grows, because the optics shape the signal before it reaches the detector but cannot remove noise added afterward. The advantage also depends on the spectral structure of the task: co-design helps most when class-discriminative content is concentrated at lower spatial frequencies than within-class variation. We develop a theoretical framework formalizing these distinctions and test its predictions on synthetic data and standard benchmarks (MNIST, FashionMNIST, SVHN).

2606.09788 2026-06-09 cs.CV 新提交

POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction

POTATR: 一种用于页面级表格提取的轻量级图像到图模型

Brandon Smock, Libin Liang, Max Sokolov, Amrit Ramesh, Valerie Faucon-Morin, Tayyibah Khanam, Maury Courtland

发表机构 * Kensho Technologies

AI总结 提出轻量级图像到图模型POTATR(29M参数),在页面级表格提取任务上以130倍速度和300倍低成本超越前沿模型,GriTS_Con达0.964,输出空间可解释。

详情
Comments
16 pages, split from PubTables-v2 paper
AI中文摘要

大规模文档处理需要上下文感知的表格提取(TE),既准确又高效。然而,当前方法需要数十亿参数、数百个自回归步骤或昂贵的API推理。受此启发,我们引入了页面对象表格Transformer(POTATR),这是一个轻量级的29M参数图像到图模型,扩展了表格Transformer(TATR)用于上下文感知的页面级TE。在PubTables-v2单页面基准测试中,POTATR超越了所有测试模型(包括前沿MLLM),实现了0.964的$\textrm{GriTS}_\textrm{Con}$,同时运行速度提高130倍以上,成本降低约300倍。此外,POTATR的输出是空间可解释的:每个识别元素都有一个边界框,支持视觉验证和几何文本分配。因此,POTATR在执行统一的页面级TE的同时,可以与其他模型组合,通过外部OCR扩展到扫描文档,并通过跨页面合并等技术扩展到全文档TE。代码和模型将发布。

英文摘要

Large-scale document processing requires contextually aware table extraction (TE) that is both accurate and efficient. Yet current approaches require billions of parameters, hundreds of autoregressive steps, or costly API inference. Motivated by this, we introduce the Page-Object Table Transformer (POTATR), a lightweight 29M parameter image-to-graph model that extends the Table Transformer (TATR) for contextualized page-level TE. POTATR outperforms all models tested on the PubTables-v2 Single Pages benchmark -- including frontier MLLMs -- achieving $\textrm{GriTS}_\textrm{Con}$ of 0.964 while running over 130$\times$ faster at roughly 300$\times$ lower cost. Further, POTATR's output is spatially grounded: every recognized element has a bounding box, enabling visual verification and geometric text assignment. As a result, POTATR performs unified page-level TE while composing with other models, enabling extension to scanned documents via external OCR and to full-document TE via techniques like cross-page merging. Code and models will be released.

2606.09787 2026-06-09 cs.LG cs.NI 新提交

Zero Touch Predictive Orchestration: Automating Time-Series Models for the Cloud-Edge Continuum

零接触预测性编排:为云边连续体自动化时间序列模型

Abd Elghani Meliani, Arora Sagar, Adlen Ksentini, Raymond Knopp

发表机构 * Eurecom OpenAirInterface

AI总结 针对云边连续体中节点冷启动问题,提出一种结合数据混合与神经架构搜索的自动化时间序列预测架构,有效提升预测精度并加速收敛。

详情
Comments
19 pages, 14 figures
AI中文摘要

云边连续体(CEC)通过将资源分布到远边缘来支持延迟关键型应用,但其极端波动性使得通过时间序列预测进行主动零接触管理至关重要。然而,编排器面临严重的“冷启动”问题:新发现的节点缺乏训练局部预测模型所需的历史数据,而通用模型无法捕捉独特的硬件和微服务行为。为解决此问题,我们提出了一种由新颖的数据混合方法驱动的全自动时间序列预测架构。在基础设施层面,我们引入了一个轻量级、技术无关的资源暴露器(RE),它动态发现节点并持续收集可定制的遥测数据(例如,计算、网络、能源)。为了克服这些初始局部样本的稀疏性,我们的框架自动将它们与TimeTrack(我们公开的高分辨率数据集,以45秒间隔收集)合并。这协同了TimeTrack的基础高频时间模式与局部节点数据的精确校准。通过神经架构搜索(NAS)引擎处理,系统自动生成高精度的基线模型。实验结果表明,将目标数据与TimeTrack合并有效缓解了冷启动挑战。与仅使用稀疏局部样本训练、仅使用通用数据集训练或将目标数据与标准替代数据集混合相比,这种集成显著提高了以均方误差(MSE)、平均绝对误差(MAE)和平均绝对百分比误差(MAPE)衡量的预测准确性,并加速了收敛,为持续MLOps部署奠定了坚实基础。

英文摘要

The Cloud-Edge Continuum (CEC) enables latency-critical applications by distributing resources to the far edge, but its extreme volatility makes proactive Zero Touch Management via time-series forecasting essential. However, orchestrators face a severe "cold start" problem: newly discovered nodes lack the historical data required to train localized predictive models, while generalized models fail to capture unique hardware and microservice behaviors. To solve this, we propose a fully automated time-series prediction architecture driven by a novel data-mixing methodology. At the infrastructure level, we introduce a lightweight, technology-agnostic Resource Exposer (RE) that dynamically discovers nodes and continuously collects customizable telemetry (e.g., compute, network, energy). To overcome the sparsity of these initial local samples, our framework automatically merges them with TimeTrack, our publicly available, high-resolution dataset collected at 45-second intervals. This synergizes TimeTrack's foundational, high-frequency temporal patterns with the precise calibration of the local node data. Processed through a Neural Architecture Search (NAS) engine, the system automatically generates highly accurate baseline models. Experimental results demonstrate that merging the target data with TimeTrack effectively mitigates the cold start challenge. This integration significantly improves forecasting accuracy measured in Mean Squared Error (MSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) and accelerates convergence compared to training on the sparse local samples alone, training solely on generic datasets, or mixing the target data with standard alternative datasets, establishing a robust foundation for continuous MLOps deployment.

2606.09780 2026-06-09 cs.SD cs.NE 新提交

Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration

声音生成中的质量-多样性搜索:用于音频探索的创新引擎研究

Björn Þór Jónsson, Çağrı Erdem, Stefano Fasciani, Kyrre Glette

发表机构 * University of Oslo(奥斯陆大学)

AI总结 本研究将质量多样性算法与监督判别模型结合,通过多频段CPPN和DSP图生成多样化合成声音,并分析进化路径与时间生态位,展示了创新引擎在声音发现中的潜力。

详情
Comments
This is an extended version of the previously published conference paper "Towards Sound Innovation Engines Using Pattern-Producing Networks and Audio Graphs": https://doi.org/10.1007/978-3-031-56992-0_14
AI中文摘要

本研究解决了作曲家和声音设计师在创建和优化工具以实现其音乐目标时所面临的挑战。通过利用进化过程促进多样性并培养偶然发现,我们自动化了在未知声音空间中的搜索以发现声音,认为促进多样性的算法可以弥合声音的理论实现与实际可访问性之间的差距。我们描述了一个生成式声音合成系统,该系统将质量多样性(QD)算法与监督判别模型相结合,灵感来自创新引擎算法,并探索了不同配置以及所选合成方法与判别模型之间的相互作用。我们研究了组合模式生成网络(CPPN)和数字信号处理(DSP)图之间的交互,引入了一种新颖的方法,该方法使用多个专门针对不同频率范围的CPPN;这产生了更简单的网络,同时保持了与单CPPN设置相当的性能。我们还通过分析音乐和非音乐背景之间的目标切换来研究进化垫脚石,揭示了谱系如何穿越看似不可能的路径到达当前精英。将先前研究的行为空间扩展到包括各种声音持续时间,我们发现了时间生态位内的特化。结果表明,CPPN和DSP图与多维表型精英档案(MAP-Elites)和深度学习分类器相结合,可以生成大量多样的合成声音,在时间和上下文维度上具有多样性和创新性。我们通过在线探索器和渲染的声音文件呈现生成的声音对象,并在音乐创作的背景下,展示了一个实验性应用,该应用展示了它们在不同持续时间和上下文中的创造潜力。

英文摘要

This study addresses the challenges composers and sound designers face in creating and refining tools to achieve their musical goals. Using evolutionary processes to promote diversity and foster serendipitous discoveries, we automate the search through uncharted sonic spaces for sound discovery, arguing that diversity-promoting algorithms can bridge the gap between the theoretical realisation and practical accessibility of sounds. We describe a system for generative sound synthesis combining Quality Diversity (QD) algorithms with a supervised discriminative model, inspired by the Innovation Engine algorithm, and explore different configurations and the interplay between the chosen synthesis approach and the discriminative model. We examine the interaction between Compositional Pattern Producing Networks (CPPNs) and Digital Signal Processing (DSP) graphs, introducing a novel approach that uses multiple specialised CPPNs for different frequency ranges; this yields simpler networks while maintaining performance comparable to single-CPPN setups. We also investigate evolutionary stepping stones by analysing goal switches between musical and non-musical contexts, revealing how lineages traverse unlikely paths to current elites. Expanding the behaviour space of a previous study to include various sound durations, we uncover specialisation within temporal niches. Results indicate that CPPN and DSP graphs coupled with a Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) and a deep learning classifier can generate a substantial variety of synthetic sounds, diverse and innovative across temporal and contextual dimensions. We present the generated sound objects through an online explorer and as rendered sound files, and, in the context of music composition, an experimental application that showcases their creative potential across various durations and contexts.

2606.09777 2026-06-09 cs.RO 新提交

AetheRock: An Arm-Worn Robot Teaching System for Force-Guided Vision-Tactile Learning

AetheRock: 一种用于力引导视觉触觉学习的臂戴式机器人教学系统

Hong Li, Yue Xu, Yihan Tang, Yankang Dong, Chenyuan Liu, Chenyang Yu, Xuyang Li, Siyuan Huang, Yujun Shen, Nan Xue, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团) Shanghai Innovation Institute(上海创新研究院) Beijing Institute for General Artificial Intelligence (BIGAI)(北京通用人工智能研究院)

AI总结 提出臂戴式设备AetheRock采集夹爪力、视觉和触觉数据,并设计ForceVT框架利用力和视觉引导触觉学习,解决力感知机器人学习中传感器装配不兼容问题。

详情
AI中文摘要

力和触觉感知在接触密集操作中不可或缺。然而,由于手持或可穿戴设备中触觉和力传感器的不兼容装配,力感知机器人学习面临关键挑战。为解决这些限制,我们首先引入AetheRock用于夹爪力、视觉和触觉数据收集,这是一种臂戴式设备,指尖配备模块化且易于制造的视觉触觉传感器GelSlim-MiniFab,人体手指接触区域配备电阻式压力传感器,定制PCB模块,以及用于舒适和稳健收集的可穿戴套件。在此基础上,我们提出ForceVT,一种表示学习框架,利用力和视觉引导保真度无关的触觉学习,实现在任何触觉情况下的鲁棒推理。实际实验表明,AetheRock实现了合格的数据效率,且ForceVT有效缓解了视觉触觉传感器在制造和使用不一致时的低效问题。总体而言,我们的工作通过创新的硬件设计和算法减轻了夹爪力-视觉-触觉机器人学习的局限性。

英文摘要

Force and tactile sensing are indispensable in contact-rich manipulation. However, force-aware robot learning faces critical challenges due to the incompatible assembly of tactile and force sensors in handheld or wearable devices. To address these limitations, we first introduce AetheRock for gripper-force, vision, and tactile data collection, which is an arm-worn device featuring a modular and easily manufactured visuo-tactile sensor, GelSlim-MiniFab, at the fingertip, a resistive pressure sensor at the human finger contact region, a customized PCB module, and a wearable kit for comfortable and robust collection. Building on this, we propose ForceVT, a representation learning framework that uses force and vision to guide fidelity-agnostic tactile learning, enabling robust inference in any tactile situation. Real-world experiments show that AetheRock achieves qualified data efficiency and that ForceVT effectively alleviates inefficiencies when visuo-tactile sensors exhibit manufacturing and utilization inconsistencies. Overall, our work mitigates the limitations of gripper-force vision-tactile robot learning through innovative hardware design and algorithms.

2606.09774 2026-06-09 cs.AI cs.CL 新提交

SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

SIGA: 用于科学模拟的自演化编码智能体适配器

Matthew Ho, Brian Liu, Jixuan Chen, Audrey Wang, Lianhui Qin

发表机构 * University of California, San Diego(加利福尼亚大学圣迭戈分校)

AI总结 提出SIGA适配器,通过检索、程序记忆、轨迹内验证和验证强制终止,将通用编码智能体转化为科学模拟软件操作员,在GEOS上实现36倍加速,并支持自演化提升性能。

详情
AI中文摘要

高级科学模拟器暴露了专门的输入语言,将模拟目标转化为可执行配置,但学习这些语言可能需要领域科学家花费数小时到数天。我们将模拟器设置研究为智能体-工具接口接地问题:需要哪些最小的模拟器特定适配才能使现成的编码智能体操作真实的科学软件?我们的直觉是,编码智能体已经知道如何导航文件、编辑代码、运行命令和修复输出,但它们缺乏模拟器的可执行契约:其词汇、结构约束、验证规则和终止条件。我们介绍了SIGA,一个模拟器接口接地适配器,通过检索、程序记忆、轨迹内验证和验证强制终止来提供此契约。我们主要在GEOS上评估SIGA,GEOS是一个用于地下科学的开源多物理场模拟器。SIGA在大约五分钟内生成完整的GEOS输入文件,TreeSim高于0.90,与花费大约三小时的扩展预算人类专家相当,实现了大约36倍的挂钟加速。在更难的保留集上,接地将TreeSim从0.720提高到0.789,相对于裸智能体提高了大约10%,并且可以将跨种子的标准差降低16倍。自演化通过从先前轨迹重写适配器内容进一步改进SIGA,产生了最高的保留GEOS平均值,并匹配或超过了最强的手工设计配置。迁移到OpenFOAM和LAMMPS表明,主导机制因接口而异:当结构完整性是瓶颈时,验证最重要;而当领域正确性是瓶颈时,记忆和检索最重要。这些结果表明,轻量级、可自我改进的接地层可以将通用编码智能体转变为科学软件的实用操作员。

英文摘要

Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

2606.09772 2026-06-09 cs.CV 新提交

SemDINO: A DINOv3-Driven Network for Cross-Temporal Semantic Alignment in Change Detection

SemDINO: 一种基于DINOv3的跨时间语义对齐变化检测网络

Xinyu Tong, Meihua Zhou, Jinxiao Sun, Yingjie Tang, Lei Wang

发表机构 * Xinjiang Institute of Ecology and Geography, Chinese Academy of Sciences(中国科学院新疆生态与地理研究所) University of Chinese Academy of Sciences(中国科学院大学) School of Computer Science, Xiangtan University(湘潭大学计算机科学学院) College of Information and Communication Engineering, Harbin Engineering University(哈尔滨工程大学信息与通信工程学院)

AI总结 提出SemDINO网络,通过双分支编码器、多尺度时序交互、语义净化与变化增强模块,解决语义变化检测中跨时间对齐不足、多尺度表示弱及伪变化鲁棒性差的问题。

详情
AI中文摘要

语义变化检测(SCD)旨在同时定位土地覆盖变化并识别转变前后的语义类别。然而,现有方法存在跨时间对齐不足、多尺度表示弱以及对光照、季节和配准噪声引起的伪变化鲁棒性差的问题。为了解决这些问题,我们提出了一种名为SemDINO的新型端到端语义变化检测网络,它将双分支编码器、多尺度时序交互、语义净化、变化增强和解耦多任务预测集成到一个统一框架中。具体来说,我们构建了一个双分支编码器,通过门控金字塔融合将CNN骨干网络和冻结的DINOv3特征相结合,实现丰富的多尺度语义表示。然后,提出了一种多尺度时序双向变换器交互(M-TBTT)模块,以实现全局跨时间特征对齐和信息交互。为了进一步增强真实变化并抑制伪变化,我们协同引入了语义净化(SCP)、双向变化增强(BiChangeEnhance)和多尺度变化增强(MCE)模块。最后,设计了一个多分支CD预测头,用于联合输出二值变化掩码、双时相语义图和边缘约束。在公开遥感CD数据集上的大量实验表明,SemDINO在复杂干扰因素场景下,相比最先进方法取得了优越的性能和泛化能力。

英文摘要

Semantic change detection (SCD) aims to simultaneously locate land-cover changes and identify semantic categories before and after transition. However, existing methods suffer from insufficient cross-temporal alignment, weak multi-scale representation, and poor robustness to pseudo-changes caused by illumination, season, and registration noise. To address these issues, we propose a novel end-to-end semantic change detection network named SemDINO, which integrates a dual-branch encoder, multi-scale temporal interaction, semantic purification, change enhancement, and decoupled multi-task prediction into a unified framework. Specifically, we construct a dual-branch encoder that combines a CNN backbone and frozen DINOv3 features via gated pyramid fusion, enabling rich multi-scale semantic representation. Then, a multi-scale temporal bidirectional transformer interaction (M-TBTT) module is proposed to achieve global cross-temporal feature alignment and information interaction. To further enhance genuine changes and suppress pseudo-variations, we introduce semantic purification (SCP), bidirectional change enhancement (BiChangeEnhance), and multi-scale change enhancement (MCE) modules collaboratively. Finally, a multi-branch CD prediction head is designed to jointly output binary change mask, bi-temporal semantic maps, and edge constraint. Extensive experiments on public remote sensing CD datasets demonstrate that SemDINO achieves superior performance and generalization ability against state-of-the-art methods, especially in complex scenarios with interference factors.

2606.09767 2026-06-09 cs.CL cs.AI cs.LG 新提交

Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan

低资源神经机器翻译的数据合成与参数高效微调:以Q'eqchi'玛雅语为例

Alexander Chulzhanov, Soeren Eberhardt, Arjun Mukherjee

发表机构 * University of Houston(休斯顿大学) MasterWord Services, Inc.(MasterWord Services公司) University of Washington(华盛顿大学)

AI总结 针对低资源土著语言,提出数据合成方法(利用社区词典生成合成语料)结合LoRA参数高效微调,在Q'eqchi'玛雅语上实现高结构习得(BLEU 42.02),但存在结构-语义差距,需结合真实数据进行课程学习。

详情
Comments
Accepted to the 29th International Conference on Text, Speech and Dialogue (TSD 2026). This version of the contribution has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections
AI中文摘要

对于数字低资源土著语言的神经机器翻译,通常因极端数据稀缺而受阻,促使依赖抽取式网络爬取。为确保数据主权,本研究引入了一种数据合成方法,无需爬取目标语言平行文本即可引导NMT模型。以Q'eqchi'玛雅语为重点,我们将社区来源的词典转换为大规模合成语料,利用通过LoRA适配器在mT5-base模型上的参数高效微调(PEFT)。领域内评估显示出高度的结构习得(BLEU 42.02),证明合成约束有效地教授了复杂的黏着形态和VOS语序。然而,针对有机词汇表的评估揭示了结构-语义差距(BLEU 0.59),模型保持了语法完整性但缺乏自然语言的词汇基础。模型表现出对合成模板受限结构方差的过拟合;尽管流程中具有高语义熵,模型仍难以应对自然语言的句法流动性,将有机输入强制转换为僵化的学习模式。此外,利用多任务学习架构的消融研究导致了负迁移,表明辅助任务在LoRA适配器内竞争有限的参数容量,导致对合成标记的过度优化而牺牲了有机灵活性。最终,我们确定合成引导是一种高度有效的结构入门,但需要通过课程学习使用真实数据进行语义细化。

英文摘要

Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this study introduces a data synthesis methodology to bootstrap NMT models without scraping target-language parallel text. Focusing on Q'eqchi' Mayan, we transformed community-sourced dictionaries into a massive synthetic corpus, utilizing Parameter-Efficient Fine-Tuning (PEFT) via LoRA adapters on an mT5-base model. In-domain evaluation demonstrates high structural acquisition (BLEU 42.02), proving that synthetic constraints effectively teach complex agglutinative morphology and VOS word order. However, evaluation against an organic glossary reveals a structural-semantic gap (BLEU 0.59), where the model maintains grammatical integrity but lacks the lexical grounding of natural language. The model exhibits overfitting to the constrained structural variance of the synthetic templates; despite high semantic entropy in the pipeline, it struggles with the syntactic fluidity of natural language, forcing organic inputs into rigid learned patterns. Furthermore, an ablation study utilizing a Multi-Task Learning architecture resulted in negative transfer, suggesting that auxiliary tasks competed for limited parameter capacity within the LoRA adapters, causing over-optimization for synthetic markers at the expense of organic flexibility. Ultimately, we establish that synthetic bootstrapping is a highly effective structural primer, but requires authentic data for semantic refinement via Curriculum Learning.

2606.09764 2026-06-09 cs.LG cs.CL 新提交

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

iOSWorld:个人智能手机代理的基准测试

Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh, Ruslan Salakhutdinov

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出首个基于持久用户身份的交互式原生iOS模拟器基准iOSWorld,包含26个新应用和133个任务,评估代理在单应用、多应用及记忆个性化任务上的表现,最佳配置整体准确率52%,多应用任务仅37%。

详情
AI中文摘要

一个有用的手机代理需要具备个人智能。它应该能够推理设备上存在的用户身份、历史记录和偏好,而不仅仅是在非个性化的沙箱中遵循孤立的指令。现有的移动代理基准缺乏这种个性化。我们引入了iOSWorld,这是第一个基于持久用户身份构建的交互式原生iOS模拟器基准,该身份跨越26个新构建的iOS应用。这些应用包含连接的数据,如交易、消息、旅行记录、社交关系和财务活动。iOSWorld包括133个任务,分为三个难度递增的类别。单应用任务(27个)测试一个应用,多应用任务(60个)跨越2到8个应用,记忆和个性化任务(46个)要求代理从个人数据中推断模式。我们在仅视觉和特权视觉+XML设置下评估了前沿和开源计算机使用模型。最佳配置整体达到52%,但在多应用任务上仅为37%。特权视觉+XML访问将前沿模型提升了最多26个百分点,而较小的模型并未从增加的辅助功能树输入中受益。我们将iOSWorld作为开源基准发布,包含所有应用、种子数据、任务、评分标准和评估代码。

英文摘要

A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.

2606.09758 2026-06-09 cs.RO cs.AI cs.LG 新提交

Difference-Aware Retrieval Policies for Imitation Learning

差异感知的模仿学习检索策略

Quinn Pfeifer, Ethan Pronovost, Paarth Shah, Khimya Khetarpal, Siddhartha Srinivasa, Abhishek Gupta

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(华盛顿大学保罗·G·艾伦计算机科学与工程学院) Toyota Research Institute(丰田研究所) Google DeepMind(谷歌DeepMind) Mila

AI总结 提出DARP,一种半参数检索式模仿学习方法,通过基于k近邻的局部邻域结构重参数化,解决行为克隆的分布外泛化问题,在连续控制和机器人操作任务中性能提升15-46%。

详情
Comments
12 pages, 7 figures, 3 tables. Accepted to ICLR 2026. Code and demos available at https://weirdlabuw.github.io/darp-site/
AI中文摘要

通过行为克隆的参数化模仿学习可能因部署期间的复合误差而在分布外状态上泛化能力差。我们表明,在推理期间通过半参数检索式模仿学习方法重用训练数据可以缓解这一挑战。我们提出差异感知的模仿学习检索策略(DARP),这是一种半参数检索式模仿学习方法,通过根据局部邻域结构而非直接的状态到动作映射来重新参数化模仿学习问题,从而解决这一局限性。DARP不学习全局策略,而是训练一个模型,基于专家演示中的k近邻、它们对应的动作以及邻居状态与查询状态之间的相对距离向量来预测动作。DARP不需要超出标准行为克隆所做的额外假设——它不需要额外的数据收集、在线专家反馈或任务特定知识。我们在不同领域(包括连续控制和机器人操作)以及不同表示(包括高维视觉特征)上展示了比标准行为克隆持续15-46%的性能提升。代码和演示可在https://weirdlabuw.github.io/darp-site/获取。

英文摘要

Parametric imitation learning via behavior cloning can suffer from poor generalization to out-of-distribution states due to compounding errors during deployment. We show that reusing the training data during inference via a semi-parametric retrieval-based imitation learning approach can alleviate this challenge. We present Difference-Aware Retrieval Policies for Imitation Learning (DARP), a semi-parametric retrieval-based imitation learning approach that addresses this limitation by reparameterizing the imitation learning problem in terms of local neighborhood structure rather than direct state-to-action mappings. Instead of learning a global policy, DARP trains a model to predict actions based on $k$-nearest neighbors from expert demonstrations, their corresponding actions, and the relative distance vectors between neighbor states and query states. DARP requires no additional assumptions beyond those made for standard behavior cloning -- it does not require additional data collection, online expert feedback, or task-specific knowledge. We demonstrate consistent performance improvements of 15-46% over standard behavior cloning across diverse domains, including continuous control and robotic manipulation, and across different representations, including high-dimensional visual features. Code and demos are available at https://weirdlabuw.github.io/darp-site/.

2606.09751 2026-06-09 cs.AI cs.CL cs.HC 新提交

Collaborative Human-Agent Protocol (CHAP)

协作式人机协议 (CHAP)

Arsalan Shahid, Gordon Suttie, Philip Black

发表机构 * Brightbeam AI

AI总结 提出CHAP协议,通过结构化事件记录(差异、理由、哈希)和可组合配置文件,解决多人类多智能体协作中人类判断信号丢失的问题。

详情
AI中文摘要

基础模型正从响应生成转向操作角色。它们跨步骤规划、调用工具、请求人类输入、与其他智能体协调,并越来越多地承担影响客户、索赔、代码、合同和临床决策的工作。生产部署不再是单个人类监督单个模型,而是跨团队、时区和信任边界的多人类、多智能体协作。这种协作的技术界面仍然定义不清。当智能体起草响应,人类在发布前编辑它时,人类判断的时刻是系统中最有价值的信号。在当前实践中,该信号(如果有记录)仅存在于应用程序代码、聊天线程、工单评论和集体记忆中。两个协议标准解决了相邻问题:MCP标准化了智能体对工具和数据的访问,A2A标准化了智能体间的互操作性。两者都没有定义人类和智能体共同执行可问责工作的共享工作空间。本文提出了CHAP,即协作式人机协议。在CHAP下,原本会消失在聊天线程中的覆盖操作变成了一个结构化事件,包含差异、理由和内容哈希。班次交接变成了可移植的信封,而不是置顶消息。人类对智能体草稿的批准变成了一个不可否认的签名决策,可在多年后重放。该协议通过一个小的核心(工作空间、参与者、任务、工件和仅追加的证据日志)以及可组合的配置文件(根据部署需要添加审查、模式、路由、审议、交接、身份、签名和透明度支持的审计)来实现。规范、参考实现、一致性测试套件和示例可在以下网址获取:https://github.com/BrightbeamAI/chap

英文摘要

Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibility for work that affects customers, claims, code, contracts, and clinical decisions. Production deployments are no longer one human supervising one model. They are multi-human, multi-agent collaborations that cross teams, time zones, and trust boundaries. The technical surface for this collaboration remains weakly specified. When an agent drafts a response and a human edits it before it ships, the moment of human judgement is the most valuable signal in the system. In current practice it is recorded, if at all, in application code, chat threads, ticket comments, and tribal memory. Two protocol standards address adjacent concerns: MCP standardises agent access to tools and data, and A2A standardises agent-to-agent interoperability. Neither defines the shared workspace in which humans and agents perform accountable work together. This paper presents CHAP, the Collaborative Human-Agent Protocol. Under CHAP, the override that used to vanish into a chat thread becomes a structured event carrying a diff, a rationale, and a content hash. The handoff between shifts becomes a portable envelope rather than a pinned message. The human approval of an agent's draft becomes a non-repudiable signed decision that can be replayed years later. The protocol achieves this through a small Core (workspaces, participants, tasks, artefacts, and an append-only evidence log) together with composable profiles that add review, modes, routing, deliberation, handoff, identity, signatures, and transparency-backed audit as deployments require them. Specification, reference implementation, conformance suite, and worked examples are available at: https://github.com/BrightbeamAI/chap

2606.09749 2026-06-09 cs.RO cs.LG 新提交

Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models

你的模型已经知道:面向视觉-语言-动作模型的注意力引导安全过滤器

Seongbin Park, Fan Zhang, Baharan Mirzasoleiman, Shahriar Talebi, Nader Sehatbakhsh

发表机构 * University of California Los Angeles(加州大学洛杉矶分校)

AI总结 本文发现VLA模型中的少数注意力头能可靠定位目标物体,利用这一特性提出无需训练的安全框架,结合控制障碍函数和实时目标跟踪器,实现动态障碍物下的碰撞避免,在动态场景中性能提升43%。

详情
Comments
Under review
AI中文摘要

视觉-语言-动作(VLA)模型在多种机器人操作任务中展现了令人印象深刻端到端性能。然而,这些策略无法保证避免与场景中任务无关的物体发生碰撞。现有的安全过滤器通过查询视觉-语言模型(VLM)来识别障碍物及其位置,从而回避了这个问题。但这在控制循环中运行速度太慢,只能在情节初始化时调用,使得过滤器无法跟踪移动障碍物。我们发现,VLA模型中的少数注意力头能够可靠地定位策略意图接近的目标物体。这些注意力头可以在一个无需训练的安全框架中利用,该框架每一步从注意力头获取活动目标,将场景其余部分视为障碍物,并将其输入控制障碍函数(CBF)过滤器。结合轻量级实时目标跟踪器,这允许对非静态障碍物进行碰撞避免。我们在SafeLIBERO上评估了我们的框架,并扩展了移动障碍物。在原始静态基准测试中,我们的方法性能与使用特权模拟器状态识别目标(模拟在情节初始化时运行一次的基于VLM的识别步骤)的oracle相当。在动态变体中,oracle的初始目标分配变得过时,我们的方法平均优于它43%。我们的发现表明,实时安全过滤所需的感知信号已经存在于VLA策略中,并且可以在无需额外训练或重型辅助模型的情况下加以利用。

英文摘要

Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of robotic manipulation tasks. However, these policies offer no guarantees against collisions with task-irrelevant objects in the scene. Existing safety filters sidestep this problem by querying a vision-language model (VLM) to identify obstacles and their locations. This, however, is too slow to run in the control loop and can only be invoked at episode initialization, leaving the filter unable to track moving obstacles. We discover that a small number of attention heads within a VLA model reliably localize the object the policy intends to approach. These heads can be exploited within a training-free safety framework that obtains the active target from the attention heads at every step, treats the remainder of the scene as obstacles, and feeds these into a Control Barrier Function (CBF) filter. Together with a lightweight real-time object tracker, this allows for collision avoidance for non-static obstacles. We evaluate our framework on SafeLIBERO, which we extend with moving obstacles. On the original static benchmark, our method performs comparably to an oracle that uses privileged simulator state to identify the target, emulating a VLM-based identification step run once at episode initialization. On the dynamic variant, where the oracle's init-time target assignment becomes stale, our method substantially outperforms it by 43%, on average. Our findings suggest that the perceptual signals needed for real-time safety filtering are already present within VLA policies and can be exploited without additional training or heavy auxiliary models.

2606.09748 2026-06-09 cs.AI cs.CL cs.LG 新提交

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

深度研究智能体在过程级反馈下的多轮评估

Rishabh Sabharwal, Hongru Wang, Amos Storkey, Jeff Z. Pan

发表机构 * Google DeepMind OpenAI Perplexity AI LangChain AI

AI总结 针对深度研究智能体(DRA)在单轮输出评估的不足,提出研究缺口推断(RGI)方法提供过程级反馈,发现单轮过程反馈可提升8-15分,但多轮改进因回归问题难以持续。

详情
Comments
Published as a workshop paper at SCALE - ICML 2026 (Oral)
AI中文摘要

现有的深度研究智能体(DRA)基准仅评估单次输出,忽略了一个关键问题:DRA能否在反馈指导下改进其报告?为此,我们在两种反馈设置下对DRA进行多轮评估:自我反思(智能体在无外部诊断信号的情况下修改报告)和过程级反馈(智能体接收针对其研究策略缺口的指导)。为提供过程级反馈,我们设计了研究缺口推断(RGI),该方法通过分析满足和未满足的评分标准模式来推断研究过程缺口。我们的分析揭示了三个关键发现:(i)在自我反思下,智能体以几乎相等的速率纳入和退步评分标准,导致净改进可忽略;(ii)单轮过程级反馈带来显著收益,将归一化分数提高约8-15分,并产生约35-40%的纳入率;(iii)这些收益在后续轮次中不会累积,因为智能体在重写完整报告以解决剩余缺口时,会退步多达24%的先前满足的标准。即使有针对性指导,我们所评估的DRA架构仍无法实现可靠的多轮改进。我们的代码和结果公开在 https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs。

英文摘要

Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately $8$-$15$ points and yielding a roughly $35$-$40\%$ incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to $24\%$ of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.

2606.09746 2026-06-09 cs.CV cs.AI cs.LG 新提交

Hybrid Robustness Verification for Spatio-Temporal Neural Networks

时空神经网络的混合鲁棒性验证

Sherwin Varghese, Matthew Wicker, Alessio Lomuscio

发表机构 * Imperial College London(伦敦帝国学院)

AI总结 针对3D CNN在视频和体素输入中的鲁棒性验证,提出时空约束建模和STBP框架,实现精确闭式传播与可扩展近似,在UCF-101等基准上提升1.7倍认证鲁棒准确率。

详情
Comments
Accepted at the 9th International Symposium on AI Verification (SAIV 2026)
AI中文摘要

随着人工智能越来越多地部署在安全关键系统中,为底层模型提供形式化的鲁棒性保证至关重要。现有的验证方法要么依赖过于保守的近似,要么产生难以承受的计算成本。例如,在视频设置中使用lp-范数扰动编码了对手可以在每个视频帧中注入噪声的信念。实际上,对抗性扰动表现出结构化的时空相关性,被约束在低维、语义上有意义的子空间中。在这项工作中,我们研究了处理视频和体素输入的3D CNN的鲁棒性验证,针对动作识别(UCF-101)、自动驾驶(Udacity)和医学成像(MedMNIST)中的应用,通过将对抗强度建模为时空约束——攻击者可以修改一组连续帧中的子集或补丁——来利用关于对抗强度的现实假设。我们证明,建模现实约束能够实现更紧的近似。我们引入了时空边界传播(STBP),这是一个验证框架,它计算第一卷积层的精确闭式表征,并通过可扩展的近似传播认证边界。计算精确闭式为第一卷积层提供了最紧的边界。因此,我们在网络的其余部分使用近似方法。为了推动该领域的进一步发展,我们提出了ST-Bench,一个用于自动驾驶和活动识别的验证基准,以系统评估可验证的鲁棒性。与现有的基于验证的方法相比,STBP在相同的扰动预算下提供了更强的鲁棒性保证,并显著提高了可扩展性,实现了1.7倍更高的认证鲁棒准确率。

英文摘要

With AI increasingly deployed in safety-critical systems, providing formal robustness guarantees for the underlying models is essential. Existing verification methods either rely on overly conservative approximations or incur prohibitive computational costs. For example, the use of lp-norm perturbations in video settings encodes the belief that the adversary can inject noise in every video frame. In practice, adversarial perturbations exhibit structured spatial and temporal correlations, constrained to lower-dimensional, semantically meaningful subspaces. In this work, we study robustness verification of 3D CNNs processing video and volumetric inputs, targeting applications in action recognition (UCF-101), autonomous driving (Udacity), and medical imaging (MedMNIST) exploiting realistic assumptions on adversarial strength by modelling them as spatio-temporal constraints - where the attacker can modify either a subset of frames or patches within a set of consecutive frames. We demonstrate that modelling realistic constraints enables tighter approximations. We introduce Spatio-Temporal Bound Propagation (STBP), a verification framework that computes an exact closed-form characterization of the first convolutional layer and propagates certified bounds through subsequent layers using scalable approximations. Computing the exact closed form provides the tightest bounds for the first convolutional layer. Thus, we utilise approximation methods in the remainder of the network. To spur further progress in this field, we propose ST-Bench, a verification benchmark for autonomous driving and activity recognition, to systematically evaluate verifiable robustness. Compared to existing verification-based approaches, STBP provides stronger robustness guarantees with significantly improved scalability, achieving 1.7x higher certified robust accuracy under identical perturbation budgets.

2606.09740 2026-06-09 cs.RO 新提交

ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models

ProbeAct: 视觉-语言-动作模型中的探针引导无训练故障恢复

Fan Zhang, Seongbin Park, Baharan Mirzasoleiman, Shariar Talebi, Nader Sehatbakhsh

发表机构 * University of California Los Angeles(加利福尼亚大学洛杉矶分校)

AI总结 提出ProbeAct框架,通过轻量级隐藏状态探针、运动学状态机和分层控制屏障函数,无需训练即可检测并恢复VLA模型的抓取与放置失败,在LIBERO-plus上将成功率从69.6%提升至74.1%。

详情
Comments
under review
AI中文摘要

视觉-语言-动作(VLA)模型在训练分布内的语言条件机器人操作中表现出强大性能,但其泛化能力仍存在根本性限制。它们缺乏处理扰动的鲁棒性,在面临光照变化、视角改变或初始状态微小变化时经常失败。我们提出PROBEACT,一个无需训练的运行时间干预框架,能够在不修改权重或额外演示的情况下,检测并恢复预训练VLA策略中的抓取和放置失败。PROBEACT结合了三个组件:(i)轻量级多目标隐藏状态探针,从中间VLA特征预测任务相关物体的3D位置,并采用匈牙利匹配的身份跟踪以处理多物体场景;(ii)与物体无关的运动学状态机,仅使用夹爪内部信号和末端执行器运动学检测抓取、搬运和放置失败;(iii)分层控制屏障函数(CBF)滤波器,将重复失败位置编码为软安全集约束,在保持基线行为的同时最小程度地修正VLA动作。作为即插即用、无需训练的干预循环,PROBEACT与现有训练流程正交。在LIBERO-plus基准上的评估表明,我们的框架作为通用安全网,将OpenVLA-OFT模型的成功率从69.6%提升至74.1%,并展示了在基础和微调VLA策略上的广泛适用性。

英文摘要

Vision-Language-Action (VLA) models demonstrate strong perfor-1 mance on language-conditioned robotic manipulation within their training dis-2 tribution, yet their generalization capabilities remain fundamentally limited. They3 lack the robustness required to handle perturbations, frequently failing when con-4 fronted with lighting changes, altered camera viewpoints, or small initial-state5 variations. We propose PROBEACT, a training-free runtime intervention frame-6 work that detects and recovers from grasping and placement failures in pre-7 trained VLA policies without modifying their weights or requiring additional8 demonstrations. PROBEACT combines three components: (i) a lightweight multi-9 target hidden-state probe that predicts the 3D positions of task-relevant objects10 from intermediate VLA features, with Hungarian-matched identity tracking for11 multi-object scenes; (ii) an object-agnostic kinematic state machine that detects12 grasp, transport, and placement failures using only gripper-internal signals and13 end-effector kinematics; and (iii) a hierarchical Control Barrier Function (CBF)14 filter that encodes repeated-failure locations as soft safe-set constraints, mini-15 mally correcting VLA actions while preserving baseline behavior. As a plug-and-16 play, training-free intervention loop, PROBEACT is orthogonal to existing train-17 ing pipelines. Evaluated on the LIBERO-plus benchmark, our framework acts as18 a universal safety net, improving the success rate of the OpenVLA-OFT model19 from 69.6% to 74.1%, while demonstrating broad applicability to both base and20 fine-tuned VLA policies.

2606.09738 2026-06-09 cs.CV 新提交

HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents

HDSL:一种用于结构化3D室内场景生成和基于LLM智能体局部编辑的层次化领域特定语言

Letian Li, Chao Shen, Shuzhao Xie, Chenghao Gu, ZhengXiao He, Yu Meng, Xin Yang, Wenyuan Jiang, Zhi Wang

发表机构 * SIGS, Tsinghua University(清华大学深圳国际研究生院) Nankai University(南开大学) University of Arizona(亚利桑那大学) Zhejiang University(浙江大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出HDSL语言,以树结构表示室内场景,结合LLM智能体生成、多模态检索和力导向布局优化,实现结构化场景生成与局部编辑,显著提升对象覆盖率和编辑效率。

详情
AI中文摘要

文本驱动的室内场景生成与编辑需要一种语言模型既能生成又能修改的中间表示。现有的基于LLM的系统通常依赖场景图或全局约束列表,这些表示虽然紧凑但未能充分指定局部几何结构,使得基于指令的编辑难以定位。我们将此问题视为结构化程序生成和局部程序修复,并提出层次化描述性场景语言(HDSL),一种用于结构化3D室内场景的XML/CSS风格领域特定语言。HDSL将房间、区域、对象和支持表面表示为带有局部坐标的树,使得复杂场景更易于递归规划和检索编辑。我们的流程使用LLM智能体生成带有边界验证的HDSL子树,通过多模态资产检索将非虚拟节点具体化,并应用力导向布局优化来修复边界和碰撞错误。对于编辑,层次化检索增强生成(HRAG)检索相关子树,要求LLM仅重写该局部上下文,并通过确定性三路合并将结果合并回去。在我们复现的基准测试中,HDSL在对象覆盖率、文本-场景对齐和生成时间上优于完整的文本到场景基线,同时在几何指标上与最近的仅布局复现方法保持竞争力;对于编辑,HRAG将令牌使用量减少5.22倍,运行时间减少6.19倍,为所有八对编辑生成有效的DSL,并更好地保留无关的场景对象。

英文摘要

Text-driven indoor scene generation and editing require an intermediate representation that language models can both produce and revise. Existing LLM-based systems often rely on scene graphs or global constraint lists, which are compact but underspecify local geometry and make instruction-based edits difficult to localize. We frame this problem as structured program generation and local program repair, and propose Hierarchical Descriptive Scene Language (HDSL), an XML/CSS-style domain-specific language for structured 3D indoor scenes. HDSL represents rooms, regions, objects, and support surfaces as a tree with local coordinates, making complex scenes easier to plan recursively and easier to retrieve for editing. Our pipeline uses LLM agents to generate HDSL subtrees with bounded verification, grounds non-virtual nodes through multimodal asset retrieval, and applies force-directed layout optimization to repair boundary and collision errors. For editing, Hierarchical Retrieval-Augmented Generation retrieves the relevant subtree, asks the LLM to rewrite only that local context, and merges the result back through a deterministic three-way merge. In our reproduced benchmark, HDSL improves average object coverage, text-scene alignment, and generation time over full text-to-scene baselines while remaining competitive with recent layout-only reproductions on geometry metrics; for editing, HRAG reduces token use by $5.22\times$ and runtime by $6.19\times$, produces valid DSL for all eight paired edits, and better preserves unrelated scene objects.

2606.09735 2026-06-09 cs.CL 新提交

The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

中性面具:RLHF如何提供浅层对齐而保留大语言模型中的党派结构

Wendy K. Tam

发表机构 * Vanderbilt University(范德堡大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) National Center for Supercomputing Applications(国家超级计算应用中心)

AI总结 研究RLHF对Llama 3.1 8B党派倾向的影响,发现RLHF仅压缩党派信号方差以实现中性输出,而非移除党派结构,且特征级操控可绕过对齐。

详情
AI中文摘要

对齐训练的目标是使大语言模型安全且有用。主要机制——基于人类反馈的强化学习(RLHF)——通过使模型与“人类价值观”对齐来塑造部署语言模型的行为。然而,这一过程并不透明:编码了哪些价值观?这些价值观是谁的?RLHF如何编码它们?越来越多的证据表明,RLHF仅产生功能性遵从而非深度对齐。我们以党派政治取向为例,对Llama 3.1 8B在RLHF前后的内部表征进行比较,进行了机制性案例研究。我们表明,RLHF并未移除基础模型中的结构化党派方向。相反,它压缩了党派信号的方差,以生成一致平衡且无党派的输出。稀疏自编码器分解揭示,在基础模型中零星激活的策略编码特征在Instruct模型中完全失活。特征级操控实验证实了因果断开。因此,RLHF编码了政治中立的规范,不是通过擦除模型对党派性的知识,而是通过切断从党派几何到输出生成的因果路径。重要的是,这种中立性是功能性的而非结构性的,因此支持党派操控的底层几何结构保持完整。绕过RLHF护栏的机制(例如推断并放大用户的党派身份)会重新激活党派生成。如果RLHF通过断开而非移除价值负载结构来运作,那么同样的模式可能适用于其他价值领域,并且对齐模型的行为可能比其输出所暗示的更脆弱。

英文摘要

The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.'' Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic case study of this phenomenon for partisan political orientation with a comparison of the internal representations of Llama 3.1 8B before and after RLHF. We show that RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry that enables partisan steering remains intact. The mechanisms that bypass RLHF's guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation. If RLHF operates by disconnecting rather than removing value-laden structure, then the same pattern may hold for other value domains, and the aligned model's behavior may be more fragile than its outputs suggest.

2606.09730 2026-06-09 cs.AI 新提交

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

SearchSwarm:面向长周期深度研究的代理LLM委托智能

Pu Ning, Quan Chen, Kun Tao, Xinyu Tang, Tianshu Wang, Qianggang Cao, Xinyu Kong, Zujie Wen, Zhiqiang Zhang, Jun Zhou

发表机构 * Tsinghua University(清华大学) Peking University(北京大学) Ant Group(蚂蚁集团) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院)

AI总结 提出SearchSwarm框架,通过监督微调将任务分解与委托决策内化到模型权重中,在BrowseComp和BrowseComp-ZH上取得同规模最佳性能。

详情
AI中文摘要

大型语言模型越来越需要处理复杂的、长周期的真实世界任务,这些任务的上下文需求可能无限增长,但模型上下文窗口本质上是有限的。最近的研究探索了一种范式,其中主代理分解任务并将子任务分派给子代理,子代理执行并仅返回汇总结果,从而节省主代理的上下文预算。然而,要很好地执行这一任务需要委托智能:分解复杂任务、确定何时委托以及委托什么、并将返回结果整合到持续工作流中的能力。这种能力的训练数据在自然文本中很少见,据我们所知,如何合成此类数据并训练模型获得这种能力在开源社区中仍基本未被探索。为填补这一空白,我们针对深度研究这一代表性的长周期代理任务进行了初步探索。具体来说,我们设计了一个引导工具,引导模型进行高质量的任务分解和委托,同时约束子代理正确返回结果以支持主代理的工作流。引导工具生成的轨迹自然地编码了正确的委托决策,我们将其作为监督微调数据,将委托智能内化到模型权重中。我们的模型SearchSwarm-30B-A3B在BrowseComp上达到68.1,在BrowseComp-ZH上达到73.3,在所有同规模模型中取得最佳结果。我们将发布我们的引导工具、模型权重和训练数据,以促进未来研究。

英文摘要

Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches subtasks to subagents, which execute and return only summarized results, conserving the main agent's context budget. However, performing this well requires delegation intelligence: the ability to decompose complex tasks, determine when and what to delegate, and integrate returned results into the ongoing workflow. Training data for this capability is scarce in naturally occurring text, and to our knowledge, how to synthesize such data and train models to acquire this capability remains largely unexplored in the open-source community. To bridge this gap, we present a preliminary exploration targeting deep research, a representative long-horizon agent task. Specifically, we design a harness that guides the model toward high-quality task decomposition and delegation, while constraining subagents to return results properly to support the main agent's workflow. The harness-guided trajectories naturally encode correct delegation decisions, which we use as supervised fine-tuning data to internalize delegation intelligence into model weights. Our resulting model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among all models of comparable scale. We will release our harness, model weights, and training data to facilitate future research.