arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

机器人 / 具身智能

机器人、具身智能、机器人学习、操作、导航和具身世界模型。

今日/当前日期收录 79 信号源:cs.RO, cs.AI, cs.CV, cs.LG

1. 机器人学习 23 篇

2606.20104 2026-06-19 cs.LG cs.AI 新提交 85%

Sensorimotor World Models: Perception for Action via Inverse Dynamics

传感器运动世界模型:通过逆动力学实现面向行动感知

Petr Ivashkov, Randall Balestriero, Bernhard Schölkopf

发表机构 * Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Department of Computer Science, Brown University(布朗大学计算机科学系) ELLIS Institute(ELLIS研究所) ETH Zürich(苏黎世联邦理工学院)

专题命中 机器人学习 :世界模型用于机器人控制

AI总结 提出传感器运动世界模型(SMWM),通过逆动力学正则化端到端训练潜空间世界模型,防止表示崩溃并学习与行动对齐的紧凑表示,在2D和3D控制任务中实现竞争性规划性能。

详情
AI中文摘要

面向行动的感知表明,世界的表示不应仅由视觉保真度决定,而应由其与行动的相关性决定。同时,潜在的JEPA风格世界模型主张从高维观测中学习紧凑的预测状态以促进未来状态的预测,但这些模型的端到端训练并非易事,因为如果我们的唯一目标是构建易于预测的潜在状态,表示可能会崩溃。我们引入了一种传感器运动世界模型(SMWM):一种通过逆动力学正则化进行端到端训练的潜在世界模型。这一单一正则化解决了两个问题:它防止表示崩溃并诱导与行动对齐的表示。通过迫使潜在状态保留关于转换背后行动的信息,它使模型偏向于环境中可控的自由度,同时丢弃不可控的干扰因素。这产生了从离线、无奖励轨迹中训练的稳定潜在世界模型,无需冻结编码器、指数移动平均或复杂的潜在正则化。实验表明,SMWM学习了紧凑、可解释的潜在空间,并在简单的2D和3D控制任务中实现了竞争性的规划性能。

英文摘要

Perception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions. At the same time, latent JEPA-style world models advocate learning compact predictive states from high-dimensional observations to facilitate the prediction of future states, but end-to-end training of these models is nontrivial because representations may collapse if our only goal is to construct a latent state that is easy to predict. We introduce a sensorimotor world model (SMWM): a latent world model trained end-to-end with inverse dynamics regularization. This single regularizer addresses both issues: it prevents representation collapse and induces action-aligned representations. By forcing latent states to preserve information about the action underlying a transition, it biases the model toward the controllable degrees of freedom of the environment while discarding uncontrollable distractors. This yields stable latent world models trained from offline, reward-free trajectories, without frozen encoders, exponential moving averages, or complex latent regularizers. Empirically, SMWM learns compact, interpretable latent spaces and enables competitive planning performance across simple 2D and 3D control tasks.

2606.20056 2026-06-19 cs.RO 新提交 85%

VFILC: Accurate Frequency Extrapolations in Imitation Learning via Sampling Frequency ILC

VFILC: 通过采样频率迭代学习控制实现模仿学习中的精确频率外推

Nozomu Masuya, Toshiaki Tsuji, Sho Sakaino

发表机构 * Grad. School of Science Technology University of Tsukuba Tsukuba, Japan Engineering Saitama University Saitama, Japan Information Engineering University of Tsukuba Tsukuba, Japan

专题命中 机器人学习 :提出模仿学习方法用于机器人速度外推。

AI总结 提出VFILC方法,结合可变频率模仿学习与前馈-反馈迭代学习控制,在三种任务中实现精确的速度外推,频率误差降低最高81%。

Comments 8 pages, 17 figures. Accepted at IROS 2026

详情
AI中文摘要

传统的基于神经网络(NN)的变速度运动模仿学习方法要么局限于内插速度,要么在外推超出训练速度范围时产生不可预测的运动。可变频率模仿学习(VFIL)通过将NN模型的采样频率与运动频率相关联,实现了速度的外推,但其开环配置导致频率误差,特别是在外推的高频设置中。本研究提出了基于VFIL和迭代学习控制(ILC)的可变频率模仿学习与迭代学习控制(VFILC),包含前馈和反馈两部分,前者利用VFIL的优势,后者调整频率误差。实验结果表明,所提方法成功且精确地外推了运动速度,并在所有三个任务中减少了频率误差;特别是在以训练数据中平均速度的两倍进行外推时,与简单前馈VFIL相比,反馈在擦拭任务中将频率误差显著降低了81%,在摇晃任务中降低了50%。即使在受复杂摩擦特性影响的接触密集混合任务的内插频率下,所提方法相比VFIL也将精度提高了27%。

英文摘要

Conventional neural network (NN)-based imitation learning methods for variable-speed motion either restricted their scope to interpolated speeds, or generated unpredictable motions when extrapolating beyond trained velocity ranges. Variable-frequency imitation learning (VFIL) enabled extrapolations of speeds by linking the NN model's sampling frequency to the motion frequency, whereas its open-loop configuration caused frequency errors, especially in the extrapolated high-frequency settings. This study proposes variable-frequency imitation learning with iterative learning control (VFILC) based on a combination of VFIL and iterative learning control (ILC) with both feedforward and feedback parts, the former taking advantage of VFIL and the latter adjusting the frequency errors. The experimental results showed that the proposed method successfully and accurately extrapolated motion speeds and reduced frequency errors in all three tasks, and that the feedback especially reduced the frequency errors by a remarkable 81% in the wiping task and 50% in the shaking task, both compared to simple feedforward VFIL, when extrapolating at double the average speed in the training data. The proposed method also improved accuracy by 27% compared with VFIL even at an interpolated frequency for a contact-rich mixing task affected by complex friction traits.

2606.20048 2026-06-19 cs.RO 新提交 85%

MirrorDuo: Reflection-Consistent Visuomotor Learning from Mirrored Demonstration Pairs

MirrorDuo:基于镜像演示对的反射一致视觉运动学习

Zheyu Zhuang, Ruiyu Wang, Giovanni Luca Marchetti, Florian T. Pokorny, Danica Kragic

发表机构 * Division of Robotics, Perception and Learning(机器人、感知与学习 division)

专题命中 机器人学习 :提出镜像演示增强行为克隆,用于机器人学习。

AI总结 提出MirrorDuo方法,通过反射一致性为每个原始演示生成镜像副本,实现数据增强,在相同数据预算下显著提升行为克隆性能,并支持零/少样本技能迁移。

Comments Published in CoRL 2025

Journal ref CoRL 2025

详情
AI中文摘要

基于图像的行为克隆利用从无处不在的RGB相机捕获的演示。然而,它仍然受到收集多样化演示成本的限制,特别是在工作空间变化中泛化。我们提出MirrorDuo,一种基于反射的公式,操作于图像、本体感受和完整的6自由度末端执行器动作元组,为每个原始演示生成镜像对应物,有效实现“收集一个,免费获得一个”。它可以作为现有学习管道(如标准行为克隆或扩散策略)的数据增强策略,或作为反射等变策略网络的结构先验。通过利用原始域和镜像域之间的重叠,当演示均匀分布在工作空间两侧时,MirrorDuo在相同数据预算下实现了显著改进的性能。当演示仅限于一侧时,MirrorDuo能够在目标布局中仅使用零或五个演示实现向镜像工作空间的高效技能迁移。

英文摘要

Image-based behaviour cloning leverages demonstrations captured from ubiquitous RGB cameras. However, it remains constrained by the cost of collecting diverse demos, especially for generalizing across workspace variations. We propose MirrorDuo, a reflection-based formulation that operates on image, proprioception, and full 6-DoF end-effector action tuples, generating a mirrored counterpart for each original demonstration, effectively achieving "collect one, get one for free". It can be applied as a data augmentation strategy for existing learning pipelines, such as standard behaviour cloning or diffusion policy, or as a structural prior for reflection-equivariant policy networks. By leveraging the overlap between the original and mirrored domains, MirrorDuo achieves significantly improved performance under the same data budget when demonstrations are evenly distributed across both sides of the workspace. When demonstrations are confined to one side, MirrorDuo enables efficient skill transfer to the mirrored workspace with as few as zero or five demos in the target arrangement.

2606.19990 2026-06-19 cs.AI 新提交 85%

Reward as An Agent for Embodied World Models

奖励作为具身世界模型的智能体

Pu Li, Zhigang Lin, Qiang Wu, Yongxuan Lv, Fei Wang, Shan You

发表机构 * ACE Robotics(ACE机器人)

专题命中 机器人学习 :提出奖励智能体框架用于具身世界模型

AI总结 提出奖励智能体框架和动态感知 rollout 多样化方法,通过鲁棒验证支持更广泛探索,缓解奖励黑客问题,提升世界模型性能。

详情
AI中文摘要

虽然强化学习已成为改进世界模型的有前景工具,现有方法大多依赖于训练分布附近的保守 rollout,限制了探索、行为多样性和更丰富的动态发现。在这项工作中,我们挑战这种保守范式。我们认为核心限制不是探索本身,而是缺乏支持更广泛探索的可靠验证策略。没有可靠的验证,扩展的探索极易受到奖励黑客攻击,即策略利用不完美的奖励而未能实现真正的改进。为了评估这一动机,我们在具身世界模型中实例化我们的方法,其中物理合理性和任务完成性为复杂动态下的可扩展强化学习提供了严格的测试平台。在验证方面,我们引入奖励作为智能体,一种主动评估生成行为以提供鲁棒奖励信号并减轻分布偏移下奖励黑客攻击的智能体奖励框架。在探索方面,我们通过 DynDiff-GRPO 引入动态感知 rollout 多样化,显式扩展动作空间探索以多样化轨迹、拓宽状态-动作覆盖范围,并鼓励超越保守 rollout 机制的更丰富具身行为。通过将奖励作为智能体与 DynDiff-GRPO 统一,我们在更可靠的奖励基础上实现强化学习,并大幅多样化采样,有效缓解奖励黑客攻击,同时在多个开源世界模型上取得显著的精度提升,从而证明当基于鲁棒验证时,更广泛的探索可以成功扩展。

英文摘要

While RL has become a promising tool for refining world models, existing methods largely rely on conservative rollouts near the training distribution, limiting exploration, behavioral diversity, and richer dynamic discovery. In this work, we challenge this conservative paradigm. We argue that the core limitation is not exploration itself, but the lack of reliable verification strategies to support broader exploration. Without reliable verification, expanded exploration becomes highly susceptible to reward hacking, where policies exploit imperfect rewards without achieving genuine improvement. To evaluate this motivation, we instantiate our method in embodied world models, where physical plausibility, and task completion provide a rigorous testbed for scalable RL under complex dynamics. On the verification side, we introduce Reward as an Agent, an agentic reward framework that actively evaluates generated behaviors to provide robust reward signals and mitigate reward hacking under distribution shifts. On the exploration side, we introduce Dynamic-Aware Rollout Diversification through DynDiff-GRPO, which explicitly expands action-space exploration to diversify trajectories, broaden state-action coverage, and encourage richer embodied behaviors beyond conservative rollout regimes. By unifying Reward as an Agent with DynDiff-GRPO, we enable RL on a more reliable reward foundation with substantially diversified sampling, effectively mitigating reward hacking while yielding significant accuracy gains across multiple open-source world models, thereby demonstrating that broader exploration can scale successfully when grounded in robust verification.

2606.19928 2026-06-19 cs.RO 新提交 85%

SWAP: Symmetric Equivariant World-Model for Agile Robot Parkour

SWAP: 用于敏捷机器人跑酷的对称等变世界模型

Kaixin Lan, Ze Wang, Hongyi Li, Lei Jiang, Chaojie Fu, Chengkai Su, Choi Lam Wong, Yongbin Jin, Hongtao Wang

发表机构 * Center for X-Mechanics, Zhejiang University(浙江大学交叉力学中心) ZJU-Hangzhou Global Scientific and Technology Innovation Center(浙江大学杭州国际科创中心) Mirrorme Technology Co., Ltd.(魔镜科技有限公司)

专题命中 机器人学习 :提出对称等变世界模型用于四足机器人跑酷

AI总结 提出SWAP框架,将对称等变性嵌入世界模型和演员-评论家网络,实现四足机器人跑酷记录突破(跨越2.13米间隙、攀爬1.63米平台),并展现出对未见镜像地形的几何泛化与零样本迁移能力。

详情
AI中文摘要

虽然潜在世界模型能够实现极限跑酷所需的主动预测,但其纯数据驱动的特性迫使它们将左右对称交互冗余编码为独立模式。这增加了学习负担并阻碍了几何规律性的捕获,限制了潜在空间对下游策略的效率。为了解决这个问题,我们提出了SWAP,一个端到端的等变对称世界模型。该框架将对称性直接嵌入到世界模型和演员-评论家网络中。在真实世界测试中,机器人跨越了2.13米的间隙并攀爬了1.63米的高台,打破了四足机器人跑酷的记录。此外,该框架对未见过的镜像地形展现出鲁棒的几何泛化能力,并在多种户外环境中具有卓越的零样本迁移能力。这些结果表明,对称等变性是推动学习型腿式运动物理极限的有效结构先验。

英文摘要

While latent world models enable the proactive predictions required for extreme parkour, their purely data-driven nature forces them to redundantly encode left-right symmetric interactions as independent patterns. This inflates the learning burden and hinders the capture of geometric regularities, restricting the latent space's efficiency for downstream policies. To address this, we propose SWAP, an end-to-end equivariant symmetric world model. This framework embeds symmetry directly into both the world model and the actor-critic networks. In real-world tests, the robot leaps across a 2.13 m gap and climbs a 1.63 m platform, breaking records for quadruped parkour. Furthermore, the framework exhibits robust geometric generalization to unseen mirrored terrains and exceptional zero-shot transferability across diverse outdoor environments. These results demonstrate that symmetry equivariance is an effective structural prior for pushing the physical boundaries of learned legged locomotion.

2606.19774 2026-06-19 cs.RO 新提交 85%

Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection

开始正确,到达正确:通过初始噪声选择实现异步执行

Trong-Bao Ho, Quang-Tan Nguyen, Thien-Loc Ha, Gia-Binh Nguyen, Viet-Thanh Nguyen, Long Dinh, Minh N. Vu, Duy M. H. Nguyen, An Thai Le, Ngo Anh Vien

发表机构 * VinRobotics VinUniversity DFKI(德国人工智能研究中心) University of Stuttgart(斯图加特大学) IMPRS-IS(国际马克斯·普朗克智能系统研究学院)

专题命中 机器人学习 :通过初始噪声选择解决机器人异步执行中的动作块不一致。

AI总结 针对流式策略异步执行中的动作块边界不一致问题,提出无需训练的PAINT方法,通过初始噪声选择而非轨迹引导实现前缀一致性,在12个模拟和6个真实操作任务中提升执行一致性与任务性能。

Comments First version 19 pages, project site: https://paint-action-chunking.github.io

详情
AI中文摘要

动作分块使机器人策略能够产生时间上连贯的行为,但基于流的策略生成多步动作序列会产生延迟,与实时控制不兼容。在异步执行下,机器人继续执行当前块的同时生成下一个块,即使微小延迟也会在块边界造成不一致。现有方法通过将生成导向已执行的动作前缀来解决此问题。我们则表明,通过在生成开始前选择合适的初始噪声即可实现前缀一致性,使得未经修改的流ODE能够生成连贯的下一块。这将异步推理重新定义为噪声选择问题而非轨迹引导问题。我们提出\textbf{PAINT},一种无需训练的方法,通过后向欧拉反演找到此噪声,并通过重绘规则构建最终块。总之,\texttt{PAINT}不需要梯度、重新训练或策略修改;然而它在\textit{12个模拟基准}和\textit{6个真实世界操作任务}(涵盖单臂、双臂和人形机器人)上提高了执行一致性和任务性能。网站:~\href{ this https URL }{\texttt{ this https URL }}。

英文摘要

Action chunking enables robot policies to produce temporally coherent behavior, but generating multi-step action sequences with flow-based policies incurs latency that is incompatible with real-time control. Under asynchronous execution, the robot continues executing the current chunk while the next one is generated, causing even minor delays to create inconsistencies at chunk boundaries. Existing methods address this problem by steering generation toward the already executed action prefix. We instead show that prefix consistency can be achieved by selecting an appropriate initial noise before generation begins, allowing the unmodified flow ODE to produce a coherent next chunk. This reframes asynchronous inference as a noise selection problem rather than a trajectory steering problem. We introduce \textbf{PAINT}, a training-free method that finds this noise via backward Euler inversion and constructs the final chunk through a repainting rule. In summary, \texttt{PAINT} requires no gradients, retraining, or policy modification; yet it improves execution consistency and task performance across \textit{12 simulated benchmarks} and \textit{6 real-world manipulation tasks} spanning single-arm, bimanual, and humanoid embodiments. Website: ~\href{https://paint-action-chunking.github.io}{\texttt{https://paint-action-chunking.github.io}}.

2606.19752 2026-06-19 cs.RO cs.AI 新提交 85%

Temporal Self-Imitation Learning

时间自我模仿学习

Yinsen Jia, Boyuan Chen

发表机构 * Duke University(杜克大学)

专题命中 机器人学习 :时间自我模仿学习提升长时域机器人操作效率。

AI总结 提出时间自我模仿学习框架,通过挖掘高效成功轨迹并转化为可重用监督信号,提升长时域机器人操作任务的学习效率与鲁棒性。

详情
AI中文摘要

基于奖励塑形训练的长时域机器人操作策略仍可能通过低效交互利用密集奖励,而训练过程中稀有高效行为可能被遗忘。我们认为时间效率本身为强化学习提供了强大且未充分利用的自我监督源。我们引入时间自我模仿学习(TSIL),一种强化学习框架,挖掘学习过程中产生的时间高效成功轨迹,并将其转化为可重用的监督信号以改进未来策略。TSIL通过从快速成功轨迹中提取配置条件自适应时间目标逐步优化学习,并通过效率加权自我模仿学习保留和重放高效行为。在15个不同的长时域操作任务中,TSIL持续提升了学习效率、任务完成效率、快速成功行为的重访率以及对不稳定训练条件的鲁棒性。更广泛地,我们的结果表明,成功行为的时间结构本身为强化学习提供了超越人工奖励塑形的可扩展自我监督信号。

英文摘要

Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.

2606.19633 2026-06-19 cs.RO cs.AI 新提交 85%

CTS-MoE: Implicit Terrain Adaptation via Mixture-of-Experts for Perceptive Locomotion

CTS-MoE: 基于混合专家模型的隐式地形适应感知运动

Francisco Affonso, Matheus P. Angarola, Ana Luiza Mineiro, Aditya Potnis, Marcelo Becker, Girish Chowdhary

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of São Paulo(圣保罗大学)

专题命中 机器人学习 :提出CTS-MoE用于感知运动,隐式地形适应。

AI总结 针对非连续地形上的感知运动问题,提出CTS-MoE方法,通过密集混合专家策略与感知门控组合共享行为,并用多批评家防止价值干扰,实现端到端训练和隐式地形适应,在仿真和硬件上优于基线。

详情
AI中文摘要

在不连续地形(如楼梯、间隙和障碍物)上的感知腿式运动需要自适应行为,因为单一的保守步态无法产生应对突然拓扑变化所需的预期动作。将该问题视为多任务强化学习,会在共享与分离之间引入张力。任务使用共同的运动基础但具有冲突的奖励,因此策略必须共享行为同时避免价值干扰。先前的工作只解决了其中一方面:整体策略牺牲了专业化,而分层子策略牺牲了跨过渡和未知地形的泛化能力。我们提出CTS-MoE,它结合了密集混合专家执行器与基于感知的门控来组合共享行为,以及具有任务特定价值头的多批评家来防止干扰。该模型在单阶段并发教师-学生设置中进行端到端训练,处理部分可观测性并避免顺序蒸馏,任务标签仅在训练期间使用。部署时,路由仅依赖于感知,从而无需高层选择器或地形分类器即可实现地形适应。在仿真和硬件上对Unitree Go1进行的实验(涵盖已知和未知地形)显示了任务感知的专业化,与整体基线相比,跟踪误差更低,成功率更高。项目网站:此https URL。

英文摘要

Perceptive legged locomotion over discontinuous terrain (e.g., stairs, gaps, and obstacles) requires adaptive behavior, as a single conservative gait cannot produce the anticipatory maneuvers needed for abrupt topology changes. Cast as multi-task reinforcement learning, this problem introduces a tension between sharing and separation. Tasks use a common locomotion base but have conflicting rewards, so a policy must share behavior while avoiding value interference. Prior work addresses only one side, with monolithic policies sacrificing specialization and hierarchical sub-policies sacrificing generalization across transitions and unseen terrain. We propose CTS-MoE, which combines a dense mixture-of-experts actor with perception-based gating to compose shared behaviors and a multi-critic with task-specific value heads to prevent interference. The model is trained end-to-end in a single-stage concurrent teacher-student setup that handles partial observability and avoids sequential distillation, with task labels used only during training. At deployment, routing depends solely on perception, allowing terrain adaptation without a high-level selector or terrain classifier. Experiments on a Unitree Go1 in simulation and on hardware across seen and unseen terrains show task-aware specialization, with lower tracking error and higher success rates than monolithic baselines. Project Website: https://cts-moe.github.io/ .

2606.19598 2026-06-19 cs.RO 新提交 85%

Fail-RAG : A Retrieval Augmented Generation Informed Framework for Robot Failure Identification

Fail-RAG:一种基于检索增强生成的机器人故障识别框架

Ameya Salvi, Jie Hu

发表机构 * Hitachi America, Ltd.(日立美国有限公司)

专题命中 机器人学习 :针对仓库机器人操作故障检测,属于机器人学习

AI总结 提出Fail-RAG框架,利用检索增强生成和视觉语言模型,通过嵌入故障图像和上下文信息并查询数据库,实现机器人操作故障的高效检测,在仓库自动化任务中平均检测准确率提升25个百分点。

详情
AI中文摘要

工业自动化正经历由技术突破和社会变革驱动的机器人演进:向通用机器人、具身和物理人工智能发展,以及劳动力短缺的加剧。智能自主机器人不仅需要按计划运动,还需对意外事件做出反应。本研究聚焦于仓库中物料搬运机器人的意外事件,将其定义为故障,并开发检测机器人操作故障的方法。由于环境和任务的动态性,故障形式可能变化,基于规则的检测方法可能失效。我们提出'Fail-RAG',一种基于检索增强生成(RAG)的故障检测框架,其中故障图像和上下文信息被嵌入,并通过计算相似度查询故障数据库。进一步使用视觉语言模型(VLM)按照指令模板分析故障并提供细节。通过使用固定机械臂和移动操作器在仓库自动化常见任务中进行仿真和物理实验,评估了Fail-RAG的性能。与使用现成VLM相比,Fail-RAG在五种机器人操作类型上的平均故障检测准确率提高了25个百分点,表明其在真实世界故障检测中的有效性。

英文摘要

Industry automation is witnessing an evolution in robotics driven by both technological breakthroughs and societal changes: progress towards generalist robots, embodied and physical artificial intelligence (AI), and increasing labor shortage in manufacturing.An intelligent autonomous robot needs to not only act according to planned motions but also react to any unexpected events. In this study, we focus on such unexpected events in warehouses where robots are used for material handling. Specifically, we refer to any unexpected events as failures and develop methods to detect robot operations related failures. Rule-based detection methods may break since the form of failures could change due to the dynamic nature of both environments and tasks. We propose 'Fail-RAG', a Retrieval Augmented Generation (RAG)-based failure detection framework where failure images and context information are embedded and queried against a failure database by calculating their similarities. Vision-Language Models (VLMs) are further used to analyze failures and provide details by following our instruction template. We evaluated the performance of Fail-RAG by conducting both simulation and physical experiments using fixed robot arms and a mobile manipulator for multiple tasks that are common in warehouse automation. Fail-RAG achieved 25 percentage point higher failure detection accuracy on average across five types of robot operations compared to using off-the-shelf VLMs, indicating its effectiveness for real-world failure detection.

2606.19531 2026-06-19 cs.CV cs.RO 新提交 85%

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

ImageWAM:世界动作模型真的需要视频生成,还是只需要图像编辑?

Yuyang Zhang, Wenyao Zhang, Zekun Qi, He Zhang, Haitao Lin, Jingbo Zhang, Yao Mu, Xiaokang Yang, Wenjun Zeng, Xin Jin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Eastern Institute of Technology(东方理工学院) Tencent Robotics X(腾讯机器人X) Tsinghua University(清华大学) Zhongguancun Academy(中关村学院)

专题命中 机器人学习 :用图像编辑模型进行机器人动作预测

AI总结 提出ImageWAM框架,利用预训练图像编辑模型替代视频生成进行机器人动作预测,通过编辑去噪的KV缓存作为世界动作上下文,在多个模拟和真实实验中优于基线,计算量降至1/6,延迟降至1/4。

Comments Project Page: https://zhangwenyao1.github.io/ImageWAM/

详情
AI中文摘要

世界动作模型(WAMs)通常依赖视频生成来桥接视觉世界建模和机器人控制。然而,基于视频的WAMs面临三个耦合的限制:密集的多帧未来令牌使得推理成本高昂,完整的视频预测将容量花费在与动作无关的时间和外观细节上,以及长期未来想象可能引入误导动作预测的错误。这些问题提出了一个简单的问题:世界动作模型真的需要视频生成吗?我们提出ImageWAM,一个简单的WAM框架,将预训练的图像编辑模型重新用于机器人动作预测。与视频生成相比,图像编辑提供了更匹配的先验:它只需要建模目标帧变换,关注与动作相关的当前到目标视觉差异,并通过编辑预训练将任务指令接地到局部视觉变化。在实践中,ImageWAM在推理时不解码目标帧;相反,它根据图像编辑去噪产生的KV缓存条件化一个流匹配动作专家,将其用作紧凑的世界动作上下文。ImageWAM在多个模拟和真实世界实验中优于标准VLA基线和匹配的竞争性WAM,且无需额外的策略预训练。它还将FLOPs降低到基于视频的WAMs的1/6,延迟降低到1/4。注意力分析进一步表明,编辑缓存聚焦于任务相关的变化区域,支持图像编辑作为基于视频的世界动作建模的有效替代方案。

英文摘要

World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.

2510.05013 2026-06-19 stat.ML cs.LG 85%

Curiosity-Driven Development of Action and Language in Robots Through Self-Exploration

通过自我探索的机器人好奇心驱动行为与语言发展

Theodore Jerome Tinker, Kenji Doya, Jun Tani

发表机构 * Okinawa Institute of Science and Technology(冲绳科学技术大学院大学)

专题命中 机器人学习 :好奇心驱动的机器人动作与语言学习

AI总结 本研究通过好奇心驱动的机器人自我探索,结合Q学习实现主动推理,揭示了组合泛化、快速学习、先配对后组合以及异常处理导致的U型发展模式,为人类高效语言习得提供解释。

Comments 27 pages, 22 pages of supplementary material

详情
AI中文摘要

婴儿通过极少的经验就能泛化习得语言,而大型语言模型需要数十亿的训练标记。人类高效发展的基础是什么?我们通过实验研究了这一问题,其中机器人代理通过好奇心驱动的自我探索学习执行与祈使句(例如,推红色立方体)相关的动作。我们的方法使用Q学习摊销主动推理,实现内在动机的发展性学习。模拟揭示了与发展心理学观察相对应的关键发现。i) 随着组合元素规模的增加,泛化能力显著提高。ii) 好奇心驱动的探索能够加速学习。iii) 句子和动作的机械配对先于组合泛化。iv) 异常处理导致U型发展表现,这种模式类似于儿童语言学习中的表征重述。这些结果表明,好奇心驱动的主动推理解释了内在动机的感觉运动-语言学习如何支持人类和人工代理中的可扩展组合泛化和异常处理。

英文摘要

Infants acquire language with generalization from minimal experience, whereas large language models require billions of training tokens. What underlies efficient development in humans? We investigated this problem through experiments wherein robotic agents learn to perform actions associated with imperative sentences (e.g., push red cube) via curiosity-driven self-exploration. Our approach amortizes active inference using Q-learning, enabling intrinsically motivated developmental learning. The simulations reveal key findings corresponding to observations in developmental psychology. i) Generalization improves drastically as the scale of compositional elements increases. ii) Curiosity-driven exploration enables faster learning. iii) Rote pairing of sentences and actions precedes compositional generalization. iv) Exception-handling induces U-shaped developmental performance, a pattern like representational redescription in child language learning. These results suggest that curiosity-driven active inference accounts for how intrinsically motivated sensorimotor-linguistic learning supports scalable compositional generalization and exception handling in humans and artificial agents.

2606.20549 2026-06-19 cs.RO 新提交 80%

Generating Robot Hands from Human Demonstrations

从人类演示生成机器人手

Sha Yi, Nicklas Hansen, Xueqian Bai, Carmelo Sferrazza, Michael T. Tolley, Xiaolong Wang

发表机构 * University of California San Diego(加州大学圣迭戈分校) Amazon Frontier AI & Robotics(亚马逊前沿人工智能与机器人)

专题命中 机器人学习 :从人类演示生成机器人手的设计框架。

AI总结 提出数据驱动框架,利用人类日常操作中超过400万帧指尖运动数据,通过逆运动学匹配指尖位置,优化树状结构机器人手的设计,生成通用6自由度手和低自由度任务专用手,并训练强化学习智能体加速设计搜索。

详情
AI中文摘要

机器人学习在控制学习方面取得了快速进展,但学习机器人的物理身体仍然困难得多,因为同时搜索设计和控制会产生一个非常大的组合问题。在这里,我们提出了一个数据驱动的框架,用于从人类演示生成机器人手。我们不是为每个候选设计学习一个复杂的控制器,而是使用制造后使用的相同简单控制策略来生成机器人手设计:通过逆运动学匹配指尖位置。利用来自日常操作的超过400万帧人类指尖运动数据,我们的算法优化树状结构机器人手以再现所需的目标运动。该框架产生了一个6自由度(DoF)通用手和具有空间四杆仿生关节的低自由度任务专用手。为了加速设计搜索,我们训练了一个强化学习(RL)智能体来提出好的手设计和关节角度,将搜索时间从数小时减少到数分钟。我们直接将机制制作为具有打印就绪关节的一体式铰接结构。在真实世界实验中,6自由度手实现了高度精确的遥操作指尖跟踪,优于现有的商用机器人手,而专门的3自由度手以降低的机械复杂性再现了结构化的人类和合成轨迹。这些结果表明,大规模人类运动数据不仅可以用于训练机器人控制器,还可以作为优化和生成机器人物理实体的参考。

英文摘要

Robot learning has advanced rapidly in learning control, but learning the physical body of a robot remains much more difficult because jointly searching over design and control creates a very large combinatorial problem. Here, we present a data-driven framework for generating robot hands from human demonstrations. Instead of learning a complex controller together with each candidate design, we generate robot hand designs using the same simple control policy used after fabrication: matching fingertip positions through inverse kinematics. Using more than 4 million frames of human fingertip motion from everyday manipulation, our algorithm optimizes tree-structured robot hands to reproduce desired target motions. The framework produced both a 6-degree-of-freedom (DoF) general-purpose hand and lower-DoF task-specific hands with spatial four-bar mimic joints. To accelerate the search over designs, we trained a reinforcement-learning (RL) actor to propose good hand designs and joint angles, reducing search time from hours to minutes. We fabricated the mechanisms directly as one-piece articulated structures with print-in-place joints. In real-world experiments, the 6-DoF hand achieved highly accurate teleoperated fingertip tracking better than available commercial robot hands, whereas the specialized 3-DoF hands reproduced structured human and synthetic trajectories with reduced mechanical complexity. These results showed that large-scale human motion data can be used not only to train robot controllers but also as a reference for optimizing and generating the physical embodiment of robots.

2606.20428 2026-06-19 cs.RO 新提交 80%

ARC: Adaptive Robust Joint State and Covariance Estimation

ARC:自适应鲁棒联合状态与协方差估计

Alexandre Hadji-Thomas, Andrew Stirling, James R. Forbes

专题命中 机器人学习 :鲁棒状态估计,用于机器人传感器数据处理

AI总结 提出统一块坐标下降框架,结合自适应鲁棒损失、迭代重加权最小二乘状态更新和最小加权协方差行列式估计器,实现离群值下状态与协方差的自适应联合估计。

Comments Submitted to information IEEE Robotics and Automation Letters (RA-L), June 2026. 8 pages, 7 figures, 1 table

详情
AI中文摘要

传感器测量经常受到离群值和非高斯噪声的污染。这些传感器数据中的缺陷会导致经典状态估计器产生有偏且不可靠的状态和不确定性估计。鲁棒估计器拒绝或降低离群值的权重,但不进行测量协方差估计,而联合状态和协方差估计器假设高斯残差和固定的损失形状参数。将这两种能力整合到一个框架中,可以在存在离群值的情况下同时估计状态和协方差。本文提出了一种统一的块坐标下降框架,该框架结合了范数感知自适应鲁棒损失、迭代重加权最小二乘状态更新和最小加权协方差行列式协方差估计器,产生了一个自调谐的联合状态和协方差估计器。该框架在蒙特卡洛模拟和真实世界超宽带定位实验(在杂乱的视距外环境中)中进行了评估。结果表明,所提出的估计器能够一致地恢复真实的内点测量协方差,并在状态估计精度上达到或超过所有基线方法,且无需任何手动参数调整。

英文摘要

Sensor measurements are frequently corrupted by outliers and non-Gaussian noise. These imperfections in the sensor data can cause classical state estimators to generate biased and unreliable state and uncertainty estimates. Robust estimators reject or downweight outliers but do not perform measurement covariance estimation, whereas joint state and covariance estimators assume Gaussian residuals and fixed loss shape parameters. Integrating these two capabilities into a single framework is an opportunity to simultaneously estimate both state and covariance in the presence of outliers. This paper proposes a unified Block-Coordinate Descent framework that combines a norm-aware adaptive robust loss, an Iteratively Reweighted Least-Squares state update, and a Minimum Weighted Covariance Determinant covariance estimator, yielding a self-tuning joint state and covariance estimator. The framework is evaluated in a Monte-Carlo simulation and on real-world ultra-wideband localization experiments in cluttered non-line-of-sight environments. Results show that the proposed estimator consistently recovers the true inlier measurement covariance and matches or exceeds the state estimation accuracy of all baselines, without requiring any manual parameter tuning.

2606.20197 2026-06-19 cs.RO 新提交 80%

Stable Transformer-Actor-Critic Model Predictive Control: A Contraction Analysis Approach

稳定的Transformer-Actor-Critic模型预测控制:一种收缩分析方法

Antonio Marino, Valerio Modugno, Marco Cognetti

发表机构 * University of Cambridge(剑桥大学) University College London(伦敦大学学院) LAAS-CNRS(Laas--cnrs)

专题命中 机器人学习 :Transformer-Actor-Critic MPC用于无人机控制。

AI总结 提出一种Transformer-Actor-Critic MPC架构,通过证明Transformer满足增量输入-状态稳定性并利用黎曼收缩理论分析互联动力学,将理论界作为训练正则化项,实现可证明鲁棒的控制策略。

详情
AI中文摘要

Actor-Critic模型预测控制(MPC)有效解决了复杂的非凸控制问题,但保证这些流程中基于序列的学习模型的闭环稳定性仍然具有挑战性。本文介绍了一种新颖的Transformer-Actor-Critic MPC架构,具有形式化的鲁棒性保证。首先,我们证明了Transformer网络可以满足全局增量输入-状态稳定性($\delta$ISS)。然后,我们利用黎曼收缩理论分析物理对象与预测神经网络之间的互联动力学。最后,我们将这些理论界作为训练正则化项,以产生可证明鲁棒的策略。该框架在非线性3D无人机模型上进行了验证,执行目标到达和避障机动。

英文摘要

Actor-Critic Model Predictive Control (MPC) effectively addresses complex, non-convex control problems, but guaranteeing the closed-loop stability of sequence-based learning models within these pipelines remains challenging. This paper introduces a novel Transformer-Actor-Critic MPC architecture with formal robustness guarantees. First, we prove that Transformer networks can satisfy global incremental Input-to-State Stability ($δ$ISS). We then leverage Riemannian contraction theory to analyze the interconnected dynamics between the physical plant and the predictive neural network. Finally, we integrate these theoretical bounds as a training regularizer to yield a certifiably robust policy. The framework is validated on a nonlinear 3D drone model executing target-reaching and obstacle-avoidance maneuvers.

2606.20031 2026-06-19 cs.RO cs.AI 新提交 80%

A Neuromorphic Reinforcement Learning Framework for Efficient Pathfinding in Robotic Mobile Fulfillment Systems

一种用于机器人移动履行系统高效路径规划的神经形态强化学习框架

Junzhe Xu, Zecui Zeng, Lusong Li, Yuetong Fang, Renjing Xu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) JD Explore Academy(京东探索研究院)

专题命中 机器人学习 :神经形态强化学习用于机器人路径规划。

AI总结 提出SDQN-RMFS框架,通过ANN到SNN的转换和硬标签知识蒸馏,在神经形态芯片上实现超低功耗路径规划,相比GPU能耗降低11281倍,延迟减少近一半。

详情
AI中文摘要

动态环境变化、受限工作空间和严格的实时约束使得机器人移动履行系统(RMFS)中的路径规划对传统的搜索和基于规则的方法来说是一个具有挑战性的问题,这些方法通常遭受高计算复杂性和长决策延迟。虽然强化学习(RL)已成为一种强大的替代方案,但在资源受限的硬件上以极端的能源效率部署学习到的策略仍然是一个开放的挑战。我们提出了SDQN-RMFS,一个端到端的框架,实现了从全精度人工神经网络(ANN)训练的RL策略到神经形态芯片的高保真部署。通过仅在稀疏事件触发时进行计算,该框架实现了超低功耗的RMFS路径规划。我们的全栈流水线操作如下:首先通过碰撞允许策略高效训练ANN策略以密集化信息轨迹,然后通过硬标签知识蒸馏方法将其转换为脉冲神经网络(SNN)。这有效地解决了输出分布不匹配问题,在保持策略能力的同时显著降低了推理延迟。硬件实验表明,与高性能GPU基线相比,能耗节省高达11281倍,延迟几乎减少两倍,同时决策质量与原始训练策略相当。这些结果确立了物理神经形态推理作为大规模RMFS运营的实用且能源可持续的途径。

英文摘要

Dynamic environmental changes, confined workspaces, and stringent real-time constraints make pathfinding in Robotic Mobile Fulfillment Systems (RMFS) a challenging problem for conventional search- and rule-based methods, which typically suffer from high computational complexity and long decision latency. While reinforcement learning (RL) has emerged as a powerful alternative, deploying learned policies with extreme energy efficiency on resource-constrained hardware remains an open challenge. We present SDQN-RMFS, an end-to-end framework that achieves high-fidelity deployment of an RL-trained policy from a full-precision artificial neural network (ANN) through to a neuromorphic chip. By computing only when triggered by sparse events, this framework unlocks ultra-low-power RMFS pathfinding. Our full-stack pipeline operates as follows: an ANN policy is first efficiently trained via a collision-allowing strategy to densify informative trajectories, and then converted into a spiking neural network (SNN) via a hard-label knowledge distillation approach. This effectively addresses the output distribution mismatch, preserving policy capability across the ANN-to-SNN pipeline while substantially reducing inference latency. Hardware experiments demonstrate up to 11,281$\times$ energy savings and a nearly two-fold reduction in latency compared to a high-performance GPU baseline, while maintaining decision quality on par with the original trained policy. These results establish physical neuromorphic inference as a practical and energy-sustainable pathway for large-scale RMFS operations.

2606.19935 2026-06-19 cs.AI 新提交 80%

PhysDrift: Bridging the Embodiment Gap in Humanoid Co-Speech Motion Generation

PhysDrift: 弥合人形机器人共语动作生成中的具身差距

Zhangzhao Liang, Xiaofen Xing, Mingyue Yang, Wenlve Zhou, Xiangmin Xu

发表机构 * South China University of Technology(华南理工大学) DexForce Technology(DexForce科技公司) Foshan University(佛山大学)

专题命中 机器人学习 :提出人形机器人共语动作生成框架PhysDrift

AI总结 针对人形机器人共语动作生成中人体运动流形与机器人具身约束不匹配的问题,提出IK-EER框架和PhysDrift模型,直接预测可执行关节轨迹,提升运动对齐、物理合理性和实时交互能力。

详情
AI中文摘要

人形机器人需要共语动作,这些动作不仅要富有表现力且与语音对齐,还要在具身约束下物理可执行。现有的共语动作生成流程主要是以人为中心的:首先以人体表示(如SMPL-X)生成动作,随后重定向到人形机器人。在这项工作中,我们识别出这种范式中的基本具身差距,即人体运动流形与人形机器人具身约束之间的不匹配在运动转移和物理执行过程中破坏了具身一致性。通过广泛分析,我们表明尽管重定向可以保留粗粒度的运动语义,但它显著压缩了运动多样性并削弱了韵律-动作同步,限制了富有表现力的人形机器人行为。为解决此问题,我们首先提出IK-EER,一种保留韵律的人形机器人运动策展框架,在重定向过程中联合优化运动学可行性和语音-运动时间对齐。基于策展的机器人原生运动数据集,我们进一步引入PhysDrift,一种具身感知的共语动作生成框架,直接预测可执行的人形机器人关节轨迹,无需依赖中间人体表示。与传统的以人为中心的流程不同,PhysDrift在训练和推理过程中都保持具身一致性,同时加入物理正则化以稳定机器人运动动态。大量实验和真实世界人形机器人部署表明,具身感知的机器人原生生成显著改善了语音-运动对齐、物理合理性、运动平滑性、推理效率和实时交互能力。

英文摘要

Humanoid robots require co-speech motions that are not only expressive and speech-aligned, but also physically executable under embodiment constraints. Existing co-speech generation pipelines are predominantly human-centric: motions are first generated in human-body representations such as SMPL-X and subsequently retargeted to humanoid robots. In this work, we identify a fundamental embodiment gap in this paradigm, where the mismatch between human motion manifolds and humanoid embodiment constraints disrupts embodiment consistency during motion transfer and physical execution. Through extensive analysis, we show that although retargeting can preserve coarse motion semantics, it significantly compresses motion diversity and weakens prosody-motion synchronization, limiting expressive humanoid behaviors. To address this problem, we first propose IK-EER, a prosody-preserving humanoid motion curation framework that jointly optimizes kinematic feasibility and speech-motion temporal alignment during retargeting. Building upon the curated robot-native motion dataset, we further introduce PhysDrift, an embodiment-aware co-speech motion generation framework that directly predicts executable humanoid joint trajectories from speech without relying on intermediate human-body representations. Unlike conventional human-centric pipelines, PhysDrift maintains embodiment consistency throughout both training and inference while incorporating physical regularization to stabilize robot motion dynamics. Extensive experiments and real-world humanoid deployment demonstrate that embodiment-aware robot-native generation substantially improves speech-motion alignment, physical plausibility, motion smoothness, inference efficiency, and real-time interaction capability.

2606.19914 2026-06-19 cs.RO cs.AI 新提交 80%

Co-policy: Responsive Human-Robot Co-Creation for Musical Performances

Co-policy: 响应式人机音乐共创框架

Xuetao Li, Wenke Huang, Mang Ye, Zijian Liu, Jinhua Xie, Jifeng Xuan, Miao Li

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Automation, Wuhan University of Technology(武汉理工大学自动化学院) School of Geodesy and Geomatics, Wuhan University(武汉大学测绘学院) School of Robotics, Wuhan University(武汉大学机器人学院)

专题命中 机器人学习 :提出人机音乐共创框架Co-policy

AI总结 提出Co-policy框架,通过语义锚定、约束变分和视觉运动策略实现人机音乐实时共创,在真实钟琴实验中优于扩散策略基线。

详情
AI中文摘要

艺术长期以来一直是人类创造力的关键表达。具身人工智能为生成模型通过物理动作而非无形数字内容参与创造力提供了一条途径。在机器人音乐共创中,将语义音乐理解与实时且可物理执行的表演连接起来具有挑战性。我们提出了Co-policy,一个人机音乐共创框架,它分离了语义意图接地、约束音乐变分和视觉运动执行。为了接地音乐语义,Co-policy使用预推理语义锚点和微调的Qwen-vl规划器(F-Qwen)将语音、实时音乐种子和视觉观察转化为结构化的共创计划。为了支持低延迟执行,Co-policy引入了高斯混合视觉运动策略(GMP),实现为条件混合密度策略,在单次前向传递中将目标音符和视觉上下文映射到多模态机器人动作。与仅复现用户指定音符的机器人回放系统不同,Co-policy在音乐和物理约束下生成互补的音乐响应。真实机器人钟琴实验、消融研究和专家评估显示,与扩散策略和消融基线相比,意图对齐、执行准确性和响应频率均有提升,支持物理接地动作生成作为具身人机共创的关键要求。

英文摘要

Art has long stood as a pivotal expression of human creativity. Embodied artificial intelligence offers a route for generative models to participate in that creativity through physical action rather than disembodied digital content. In robotic music co-creation, it is challenging to connect semantic musical understanding with real-time and physically executable performance. We present Co-policy, a framework for human-robot musical co-creation that separates semantic intent grounding, constrained musical variation, and visuomotor execution. To ground musical semantics, Co-policy uses pre-inference semantic anchors and a fine-tuned Qwen-vl planner (F-Qwen) to transform speech, live musical seeds, and visual observations into structured co-creation plans. To support low-latency execution, Co-policy introduces a Gaussian-Mixture Visuomotor Policy (GMP), implemented as a conditional mixture-density policy that maps target notes and visual context to multimodal robot actions in a single forward pass. Unlike robotic playback systems that merely reproduce user-specified notes, Co-policy generates complementary musical responses under both musical and physical constraints. Real-robot chime experiments, ablations, and expert evaluation show improved intent alignment, execution accuracy, and response frequency over diffusion-policy and ablated baselines, supporting physically grounded action generation as a key requirement for embodied human-AI co-creation.

2606.19711 2026-06-19 cs.RO cs.LG cs.SY eess.SY 新提交 80%

A Differentiable Composite Approximation Framework for Autonomous Underwater Vehicle Maneuvering Modeling from Sea-Trial Data

一种可微复合近似框架:基于海试数据的自主水下航行器机动建模

Aobo Wang, Aifei Xia, Zihao Wang, Lizhu Hao

发表机构 * College of Shipbuilding Engineering, Harbin Engineering University(哈尔滨工程大学船舶工程学院) China Academy of Aerospace Aerodynamics(中国航天空气动力技术研究院) Institute of Artificial Intelligence, Shanghai University(上海大学人工智能研究院) China Ship Scientific Research Center(中国船舶科学研究中心)

专题命中 机器人学习 :提出AUV机动建模的可微复合近似框架。

AI总结 提出可微复合近似框架,结合多项式基与数据自适应基联合校准,并引入转向运动电流估计补偿,提升AUV机动预测精度。

详情
AI中文摘要

基于机载测量的场建模可以生成反映真实运行特性的自主水下航行器(AUV)机动模型。从近似角度看,传统机动模型使用预定义的约束多项式基,而数据驱动模型使用数据自适应基。受此基函数视角启发,本文提出一种可微复合近似公式,其中多项式基分量和数据自适应基分量被视为单个预测器的可微部分并联合校准。开发了一种基于梯度的协同校准方法用于全尺寸AUV机动预测,其中灵敏度感知机制调节有界多项式更新,而神经残差在共享预测目标下捕获剩余非线性差异。为了考虑现场数据中的海流效应,引入了一种基于转向运动的电流估计和补偿程序,以构建电流补偿的学习目标用于训练和滚动预测。该框架使用从7米长AUV在多种机动条件下收集的海试数据进行评估。结果表明,与纯多项式、纯神经网络和冻结先验混合基线相比,所提方法改进了递归轨迹和速度预测,证明了其在基于现场数据的AUV机动建模中的适用性。

英文摘要

Field-based modeling from onboard measurements can produce autonomous underwater vehicle (AUV) maneuvering models that reflect real operating characteristics. From an approximation perspective, conventional maneuvering models use predefined constraint polynomial bases, whereas data-driven models use data-adaptive bases. Motivated by this basis-function view, this paper presents a differentiable composite-approximation formulation, in which the polynomial-basis component and the data-adaptive basis component are treated as differentiable parts of a single predictor and calibrated jointly. A gradient-based co-calibration method is developed for full-scale AUV maneuvering prediction, where a sensitivity-aware mechanism regulates bounded polynomial updates while the neural residual captures remaining nonlinear discrepancies under a shared prediction objective. To account for ocean-current effects in field data, a turning-motion-based current estimation and compensation procedure is incorporated to construct current-compensated learning targets for training and rollout. The framework is evaluated using sea-trial data collected from a 7-meter AUV under multiple maneuvering conditions. Results show that the proposed method improves recursive trajectory and velocity prediction compared with polynomial-only, neural-only, and frozen-prior hybrid baselines, demonstrating its applicability to field-data-based AUV maneuvering modeling.

2606.19675 2026-06-19 cs.RO 新提交 80%

ForEnt: A Multi-Modal Dataset for Characterizing Quadruped Robot Entrapments in Forest Environments

ForEnt: 用于表征四足机器人在森林环境中被困的多模态数据集

Natapat Kirdwichai, Danesh Tarapore

发表机构 * University of Southampton(南安普顿大学)

专题命中 机器人学习 :四足机器人森林被困多模态数据集,支持基准测试

AI总结 针对四足机器人在森林中因植被缠绕而倾覆的问题,提出多模态数据集ForEnt,包含RGB-D、LiDAR、本体感知和第三人称视频,记录69次被困事件,支持可重复的基准测试。

Comments 8 pages, 7 figures

详情
AI中文摘要

腿式机器人越来越多地被部署在森林中进行生态调查和监测,但由于穿越森林环境带来的挑战,它们的自主性经常中断。森林被困,例如当机器人的腿被藤蔓或其他植被缠住时,会导致失去稳定性并翻倒。此类事件不仅中断任务并需要人工干预,还可能损坏机器人硬件。为了解决缺乏专门数据集来研究森林环境中这些故障模式的问题,我们提出了ForEnt,这是一个多模态数据集,使用低成本的Unitree Go2四足机器人在英国南安普顿公共林地的八个森林地点收集。在我们的数据集中,进行了约1.7公里的穿越,共11个序列,记录了69次被困事件。ForEnt包括时间同步的RGB-D图像、LiDAR扫描、本体感知数据和第三人称视频,能够分析导致被困的地形因素,并提供标记的传感器流用于可重复的基准测试。通过支持被困检测策略的评估,ForEnt降低了在具有挑战性的森林环境中开发稳健四足机器人部署的门槛。

英文摘要

Legged robots are increasingly deployed in forests for ecological surveying and monitoring, yet their autonomy is often interrupted consequent to the challenges posed in traversing forest environments. Forest entrapments, for example, when a robot's legs are ensnared in vines or other vegetation, result in loss of stability and toppling. Such events not only disrupt the mission and require manual intervention, but also risk damage to the robot hardware. To address the absence of a dedicated dataset to investigate these failure modes in forest environments, we present ForEnt, a multi-modal dataset collected with the low-cost Unitree Go2 quadruped across eight forest sites in the Southampton Common Woodlands, UK. For our dataset, over approximately 1.7 km of traversals in 11 sequences were conducted, yielding 69 recorded entrapment events. ForEnt includes time-synchronized RGB-D images, LiDAR scans, proprioceptive data, and third-person video, enabling analysis of terrain factors contributing to entrapment and providing labeled sensor streams for reproducible benchmarking. By supporting the evaluation of entrapment detection strategies, ForEnt lowers the barrier to developing robust quadruped robot deployments in challenging forest environments.

2606.19656 2026-06-19 cs.RO cs.LG 新提交 80%

DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning

DF-ExpEnse: 扩散滤波探索用于高效样本微调

Calvin Luo, Chen Sun, Shuran Song

发表机构 * Stanford University(斯坦福大学) Brown University(布朗大学)

专题命中 机器人学习 :利用扩散滤波探索提升机器人微调样本效率

AI总结 提出DF-ExpEnse探索技术,利用生成控制策略的多模态建模能力和评论家集成,在微调中高效收集在线经验,提升样本效率。

Comments ICML 2026

详情
AI中文摘要

智能机器人决策的自然方案是从预训练的生成控制策略初始化,该策略总结了离线经验,并将其适应于自收集的在线经验。我们提出了DF-ExpEnse,一种探索技术,可提高在线经验收集的质量,从而提升微调样本效率。DF-ExpEnse利用生成控制策略的多模态建模能力,创建一个表达性强且易于评估的候选集。然后,它利用评论家集成来识别在质量与高探索兴趣之间最佳平衡的动作。在群体设置中,DF-ExpEnse进一步支持跨智能体通信,以促进群体协作探索。DF-ExpEnse可以无缝集成到通过强化学习微调预训练生成控制策略的现有策略中。我们通过实验验证,在各种操作和 locomotion 任务中,与默认微调和替代动作选择方案相比,DF-ExpEnse 持续带来样本效率优势。项目可在此 https URL 找到。

英文摘要

A natural recipe for intelligent robotic decision-making is initializing from pretrained generative control policies, which have summarized offline experience, and adapting them to self-collected online experience. We present DF-ExpEnse, an exploration technique that improves the quality of online experience collection, thus increasing finetuning sample-efficiency. DF-ExpEnse leverages the multimodal modeling capabilities of the generative control policy to create an expressive and tractably evaluatable candidate set. It then utilizes an ensemble of critics to identify the action that best balances quality with high exploration interest. In fleet settings, DF-ExpEnse further enables cross-agent communication to facilitate collaborative exploration as a group. DF-ExpEnse can be seamlessly integrated with existing strategies that finetune pretrained generative control policies via reinforcement learning. We experimentally validate consistent sample-efficiency benefits through DF-ExpEnse across a variety of manipulation and locomotion tasks, compared to default finetuning and alternative action selection schemes. Project can be found at https://df-expense.github.io.

2606.19632 2026-06-19 cs.RO cs.AI cs.LG cs.LO cs.MA 新提交 80%

Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation

通过决策树蒸馏对学习到的多智能体通信策略进行形式化验证

Ahmad Farooq, Kamran Iqbal

发表机构 * University of Arkansas at Little Rock(阿肯色大学小石城分校)

专题命中 机器人学习 :通过决策树蒸馏验证多智能体通信策略的安全性。

AI总结 提出通过决策树蒸馏将多智能体强化学习策略转化为可解释模型,并利用PRISM进行形式化验证,确保安全属性转移至原始网络,在无人机编队任务中实现88.9%属性满足率。

Comments 9 pages, 3 figures, 7 tables. Accepted at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026), Pittsburgh, Pennsylvania, USA, September 27-October 1, 2026

详情
AI中文摘要

多智能体强化学习使智能体能够通过涌现通信发展协调策略,但神经策略缺乏无人机群和自动驾驶车队等安全关键机器人部署所需的形式化安全保证。我们提出了首个通过学习策略抽象进行安全验证的端到端框架:神经策略被蒸馏为可解释的决策树,然后进行形式化验证,并通过经验验证确认验证的安全属性可转移至原始网络。我们的四阶段流程包括:从智能体观测中提取领域特定特征;决策树蒸馏达到97.9% +/- 1.2%的神经策略保真度;自动翻译为PRISM概率模型检查器规范,具有完整的特征到状态变量对应关系;以及通过成对分解、联合界聚合和经验邻居建模对概率计算树逻辑属性进行组合验证。评估用于5-7个智能体多无人机协调的矢量量化变分信息瓶颈策略,我们验证了18个涵盖安全性、活性和合作的时间逻辑属性,实现了88.9%的属性满足率,所有五个安全阈值均满足(碰撞概率0.3% vs 阈值1%)。原始神经策略的蒙特卡洛验证确认验证的安全属性转移偏差<=0.6个百分点(95%置信区间)。离散VQ-VIB消息相比连续方法提供+11.6至+13.6个百分点的保真度优势,实现3-4倍更快的验证。我们的框架为蒸馏策略抽象提供了经验验证的安全验证,作为深度多智能体强化学习与多机器人部署形式化安全工作流之间的实用桥梁。

英文摘要

Multi-agent reinforcement learning (MARL) enables agents to develop coordination strategies through emergent communication, but neural policies lack the formal safety guarantees required for safety-critical robotic deployment in drone swarms and autonomous vehicle fleets. We present the first end-to-end framework for safety verification of learned multi-agent communication policies through policy abstraction: neural policies are distilled into interpretable decision trees, then formally verified, with empirical validation confirming that verified safety properties transfer to original networks. Our four-stage pipeline consists of domain-specific feature extraction from agent observations, decision tree distillation achieving 97.9% +/- 1.2% fidelity to neural policies, automated translation to PRISM probabilistic model checker specifications with complete feature-to-state-variable correspondence, and compositional verification of Probabilistic Computation Tree Logic (PCTL) properties via pairwise decomposition with union-bound aggregation and empirical neighbor modeling. Evaluating Vector-Quantized Variational Information Bottleneck (VQ-VIB) policies for multi-drone coordination with 5-7 agents, we verify 18 temporal logic properties across safety, liveness, and cooperation, achieving 88.9% property satisfaction with all five safety thresholds satisfied (0.3% collision probability vs. 1% threshold). Monte Carlo validation of original neural policies confirms that verified safety properties transfer with <=0.6 percentage-point deviation (95% CI). Discrete VQ-VIB messages provide +11.6 to +13.6 percentage-point fidelity advantages over continuous methods, enabling 3-4x faster verification. Our framework provides empirically validated safety verification for distilled policy abstractions, serving as a practical bridge between deep MARL and formal safety workflows for multi-robot deployment.

2606.19512 2026-06-19 cs.RO cs.SY eess.SY 新提交 80%

Proprioceptive Invariant State Estimation for Humanoid Robots on Non-Inertial Ground

非惯性地面上仿人机器人的本体感觉不变状态估计

Falak Mandali, Zijian He, Yan Gu

发表机构 * Purdue University(普渡大学)

专题命中 机器人学习 :仿人机器人在非惯性地面的状态估计

AI总结 提出一种仅使用本体感觉的InEKF方法,利用足部IMU和运动学约束,实现非惯性地面上仿人机器人的实时状态估计,收敛速度提升96%,位置误差降低80%。

详情
AI中文摘要

本文提出了一种不变扩展卡尔曼滤波(InEKF)方法,用于在非惯性地面上运行的仿人机器人仅使用机载本体感觉进行实时状态估计。所提出的方法估计机器人相对于移动地面框架的基座位置和速度,无需直接测量地面运动或外部安装的传感器。通过足部安装的IMU利用支撑脚的运动学约束,该滤波器在保持完全本体感觉的同时,考虑了过程模型和测量模型中的地面引起的非线性。估计器被设计为具有右不变测量模型,从而在较大的初始不确定性下实现有利的误差动态。可观测性分析建立了机器人相对于非惯性地面框架的相对基座位置和速度可观测的条件。在摇摆和俯仰地面上站立和蹲下的Digit仿人机器人实验表明,与现有的InEKF相比,收敛速度提高了96%,位置估计误差减少了80%。在单轴旋转地面上的行走实验实现了平均估计误差小于9厘米,初始误差高达1米。

英文摘要

This paper presents an invariant extended Kalman filtering (InEKF) approach for real-time state estimation of humanoid robots operating on non-inertial ground using only onboard proprioceptive sensing. The proposed approach estimates the robot's base position and velocity relative to the moving ground frame without requiring direct measurements of ground motion or externally mounted sensors. By exploiting kinematic constraints at the stance foot through foot-mounted IMUs, the filter accounts for ground-induced nonlinearities in the process and measurement models while remaining fully proprioceptive. The estimator is formulated to admit a right-invariant measurement model, enabling favorable error dynamics under large initial uncertainties. Observability analysis establishes conditions under which the robot's relative base position and velocity are observable with respect to the non-inertial ground frame. Experiments with the Digit humanoid robot standing and squatting atop a swaying and pitching ground showcase a 96% speedup in convergence rate and an 80% reduction in position estimate errors over existing InEKFs. Walking experiments on a uni-axially rotating ground achieve an average estimation error of less than 9 cm for an initial error of up to 1 m.

2606.19031 2026-06-19 cs.RO 新提交 80%

Congestion-Aware Robot Tour Planning in Crowded Environments

拥挤环境中的拥塞感知机器人巡视规划

Stefano Bernagozzi, Charlie Street, Masoumeh Mansouri, Lorenzo Natale

发表机构 * Istituto Italiano di Tecnologia(意大利理工学院) Università di Genova(热那亚大学) University of Birmingham(伯明翰大学)

专题命中 机器人学习 :拥挤环境机器人巡视规划,属于机器人学习

AI总结 提出一种基于概率的巡视规划器,通过学习人流预测模型并在线构建马尔可夫决策过程,在拥挤环境中高效规划机器人路径,减少拥塞影响。

Comments Accepted to IEEE IROS 2026

Journal ref IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2026

详情
AI中文摘要

自主移动服务机器人通常需要完成在环境中遍历一组位置的巡视任务。例如,引导人们穿过购物中心、在配送中心递送包裹或在博物馆提供导览。然而,在拥挤环境中,人群的存在可能对机器人性能产生负面影响。例如,人类会触发机器人的碰撞避免操作,从而降低机器人速度。人群随机移动且随时间变化。本文提出一种针对拥挤环境的概率巡视规划器,该规划器明确考虑人类拥塞。我们学习圆形线性流场(CLiFF)地图,该地图根据初始观测预测人类轨迹。然后,我们利用这些预测在线构建并求解马尔可夫决策过程,从而高效地将机器人引导通过环境。我们的方法具有足够的可扩展性,能够在观察到新人群时重新规划。我们在购物中心的真实人群数据集上评估了该方法。

英文摘要

Autonomous mobile service robots are often required to complete tours that require navigating through a set of locations in an environment. Example domains include guiding people through a shopping mall, delivering packages in a fulfilment centre, or giving guided tours in a museum. However, in crowded environments, the presence of people may negatively impact robot performance. For example, humans will activate robot collision avoidance manoeuvres that slow the robot down. Crowds move stochastically and vary throughout the day. In this paper we present a probabilistic tour planner for crowded environments which explicitly reasons over human congestion. We learn circular linear flow field (CLiFF) maps which predict human trajectories given an initial observation. We then use these predictions to build and solve a Markov decision process online which efficiently routes the robot through the environment. Our approach is scalable enough to re-plan as new people are observed. We evaluate our approach on a real-world crowd dataset in a shopping mall.

2. 机器人操作 3 篇

2606.20092 2026-06-19 cs.CV 新提交 85%

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

EventVLA: 面向长程视觉-语言-动作策略的事件驱动视觉证据记忆

Ganlin Yang, Zhangzheng Tu, Yuqiang Yang, Sitong Mao, Junyi Dong, Tianxing Chen, Jiaqi Peng, Jing Xiong, Jiafei Cao, Jifeng Dai, Wengang Zhou, Yao Mu, Tai Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Dalian University of Technology(大连理工大学) Huawei Technologies Co., Ltd.(华为技术有限公司) The University of Hong Kong(香港大学) Tsinghua University(清华大学) Peking University(北京大学)

专题命中 机器人操作 :长程机器人操作记忆方法

AI总结 针对长程机器人操作中记忆瓶颈问题,提出EventVLA框架,通过动态关键帧证据记忆模块自主捕获任务关键视觉事件,在17个模拟和4个真实任务中平均成功率提升40%。

详情
AI中文摘要

记忆仍然是长程机器人操作的关键瓶颈,因为标准的视觉-语言-动作(VLA)策略在任务相关线索随时间变得遮挡或不可观测时常常失败。虽然现有的记忆增强方法利用历史上下文,但它们要么遭受严重的信息瓶颈,通过解耦的双系统引入高延迟,要么依赖积累大量视觉冗余的无选择性缓冲区。为了解决这些限制,我们引入了EventVLA,一个基于稀疏视觉证据记忆概念的端到端框架,包含两个核心组件:用于保留初始和短期上下文的基础视觉锚点,以及动态关键帧证据记忆(KEM)模块。具体来说,KEM直接从VLA的潜在嵌入中预测未来关键帧概率,以自主捕获和存储稀疏的、任务关键的视觉事件。这种前瞻驱动的机制使策略能够动态评估当前观测的未来因果效用,在瞬态视觉证据变得不可观测之前将其保留。此外,我们提出了RoboTwin-MeM,一个专门设计用于评估具有交互式视觉证据的非马尔可夫操作任务的诊断基准。大量评估表明,在17个需要记忆的模拟任务和4个真实世界双臂任务中,EventVLA相比最先进的记忆增强VLA实现了平均成功率提升+40%。

英文摘要

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.

2606.19586 2026-06-19 cs.RO 新提交 85%

One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies

一个演示胜过千条轨迹:用于视觉运动策略的动作-视角增强

Chuer Pan, Litian Liang, Dominik Bauer, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Shuran Song

发表机构 * Stanford University(斯坦福大学) Columbia University(哥伦比亚大学) Toyota Research Institute(丰田研究所)

专题命中 机器人操作 :提出动作-视角增强框架提升操作策略成功率

AI总结 提出一种数据增强框架,通过高斯泼溅和轨迹优化生成逼真的鱼眼图像序列和物理可行的动作轨迹,提升操作策略在场景变化和障碍物下的成功率。

Comments Project website: https://chuerpan.com/1001-demos.github.io/. Published at CoRL 2025

Journal ref Proceedings of The 9th Conference on Robot Learning, PMLR 305:3902-3914, 2025

详情
AI中文摘要

用于操作的视觉运动策略在建模复杂机器人行为方面展现出显著潜力,但机器人初始配置的微小变化和未见障碍物容易导致分布外观测。在没有大量数据收集工作的情况下,这些会导致灾难性的执行失败。在这项工作中,我们引入了一个有效的数据增强框架,该框架从真实世界的眼在手演示中生成视觉上逼真的鱼眼图像序列和相应的物理上可行的动作轨迹,这些演示使用带有单个鱼眼摄像头的便携式平行夹爪捕获。我们引入了一种新颖的高斯泼溅公式,适用于广角鱼眼摄像头,以重建和编辑带有未见物体的3D场景。我们利用轨迹优化生成平滑、无碰撞、视图渲染友好的动作轨迹,并从相应新视角渲染视觉观测。在仿真和现实世界中的综合实验表明,我们的增强框架提高了各种操作任务在相同场景和需要避障的增强场景中的成功率。

英文摘要

Visuomotor policies for manipulation have demonstrated remarkable potential in modeling complex robotic behaviors, yet minor alterations in the robot's initial configuration and unseen obstacles easily lead to out-of-distribution observations. Without extensive data collection effort, these result in catastrophic execution failures. In this work, we introduce an effective data augmentation framework that generates visually realistic fisheye image sequences and corresponding physically feasible action trajectories from real-world eye-in-hand demonstrations, captured with a portable parallel gripper with a single fisheye camera. We introduce a novel Gaussian Splatting formulation, adapted to wide FoV fisheye cameras, to reconstruct and edit the 3D scene with unseen objects. We utilize trajectory optimization to generate smooth, collision-free, view-rendering-friendly action trajectories and render visual observations from corresponding novel views. Comprehensive experiments in simulation and the real world show that our augmentation framework improves the success rate for various manipulation tasks in both the same scene and the augmented scene with obstacles requiring collision avoidance.

2606.18960 2026-06-19 cs.CV cs.RO 新提交 85%

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Mem-World:用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology(大连理工大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院)

专题命中 机器人操作 :内存增强世界模型用于机器人操作

AI总结 提出Mem-World,通过4D腕部视角曲面元索引内存W-VMem,解决操作中因遮挡和运动导致的场景遗忘问题,实现持久世界建模,提升策略评估与改进效果。

详情
AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式,通过生成动作一致的视频推演,为昂贵的真实世界实验提供了可扩展的替代方案。然而,在操作中持久世界建模仍然具有挑战性:频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图,导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制,我们提出了Mem-World,一种内存增强的多视图动作条件世界模型。其核心是W-VMem,一种4D腕部视图为中心的曲面元索引内存,将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置,W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中,通过基于曲面元的渲染和评分选择相关历史帧,为预测提供信息丰富且非冗余的上下文。大量实验表明,Mem-World在复杂操作场景中生成持久推演,比Ctrl-World实现更可靠的策略评估,将皮尔逊相关系数提高14.5%,并通过合成数据生成支持有效的策略改进,在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

3. 其他机器人 2 篇

2606.19769 2026-06-19 cs.RO cs.AI 新提交 85%

Data Standards for Humanoid Robotics: The Missing Infrastructure for Physical AI

人形机器人数据标准:物理AI缺失的基础设施

Shaoshan Liu, Xiugong Qin, Xuan Wu, Xuan Xia, Ning Ding, Jialu Liu, Jie Tang

专题命中 其他机器人 :讨论人形机器人数据标准,属于机器人基础设施。

AI总结 本文论证数据标准是人形机器人可扩展性的关键基础设施,通过提出ISO/WD 26264-1标准,解决数据非累积性问题,使具身经验可解释、可共享、可追溯和可复用。

详情
AI中文摘要

人形机器人的可扩展性不仅取决于模型和硬件,还取决于物理经验能否在机器人、任务、组织及时间维度上积累。基于作者在ISO/TC 299/WG 16内制定ISO/WD 26264-1《人形机器人数据集——第1部分:通用要求》的工作,本文论证数据标准正成为物理AI的基础设施。我们提出三个见解:第一,人形机器人数据是具身交互数据,而非孤立数字样本的集合;有用的数据集必须保留机器人本体、动作、任务、场景、执行轨迹和结果之间的关系。第二,其价值取决于物理一致性:多模态流仅在时序、坐标系、标定、运动学、单位和同步假设可检查时才可复用。第三,主要瓶颈不仅是数据稀缺,更是由高采集成本、数据孤岛和不一致评估导致的非累积性数据。我们认为人形机器人数据标准通过使具身经验可解释、可共享、可追溯和可复用来解决这些瓶颈。通用标准应为生命周期管理、元数据、来源、质量、版本控制和可追溯性提供横向基础设施,而能力特定部分应定义操作、移动、人机交互、认知及未来人形能力的领域语法。随着AI从屏幕进入实体,数据标准必须从组织数字信息演变为结构化物理交互。

英文摘要

The scalability of humanoid robots will depend not only on models and hardware, but also on whether physical experience can accumulate across robots, tasks, organizations, and time. Drawing on the authors' work in developing ISO/WD 26264-1, Humanoid robot datasets -- Part 1: General requirements, within ISO/TC 299/WG 16, this article argues that data standards are becoming foundational infrastructure for Physical AI. We develop three insights. First, humanoid robot data is embodied interaction data, not a collection of isolated digital samples; a useful dataset must preserve the relationship among robot body, action, task, scene, execution trace, and outcome. Second, its value depends on physical coherence: multimodal streams are reusable only when timing, coordinate frames, calibration, kinematics, units, and synchronization assumptions remain inspectable. Third, the main bottleneck is not only data scarcity, but non-cumulative data caused by high collection costs, data silos, and inconsistent evaluation. We argue that humanoid robot data standards address these bottlenecks by making embodied experience interpretable, shareable, traceable, and reusable. A general standard should provide horizontal infrastructure for lifecycle management, metadata, provenance, quality, versioning, and traceability, while capability-specific parts should define domain grammar for manipulation, locomotion, human-robot interaction, cognition, and future humanoid capabilities. As AI moves from screens into bodies, data standards must evolve from organizing digital information to structuring physical interaction.

2606.19920 2026-06-19 cs.RO cs.LG cs.MA 新提交 80%

Deep-Unfolded Coordination

深度展开协调

Hunter Kuperman, Minchan Jung, Rahul V. Ghosh, Alex Oshin, Evangelos A. Theodorou

发表机构 * Autonomous Control and Decision Systems Laboratory Georgia Institute of Technology United States(佐治亚理工学院自主控制与决策系统实验室)

专题命中 其他机器人 :提出深度展开框架学习分布式优化超参数

AI总结 提出Deep Coordinator框架,通过深度展开ADMM-DDP迭代学习动态调整超参数,实现非凸优化器求解时自适应惩罚参数,在车队和四旋翼仿真中速度提升6.18-9.44倍且可扩展至8倍规模。

Comments The second and third authors contributed equally (equal second authorship). 35 pages (10 pages main text), 17 figures, 3 tables

详情
AI中文摘要

分布式优化是一种高度可扩展且结构透明的技术,用于解决多机器人问题;然而,这类方法通常需要高度专门化、针对特定问题的超参数调整。在这项工作中,我们提出了Deep Coordinator,一个深度展开框架,学习在求解时根据优化器性能动态调整ADMM-DDP(一种流行的机器人任务分布式求解器)的超参数。我们的架构包括将固定数量的ADMM-DDP迭代展开成一个神经网络,层之间具有可学习的函数,将优化器状态映射到下一个超参数。据我们所知,Deep Coordinator是第一个在求解时调整非凸优化器惩罚参数的深度展开框架;我们展示了主流的监督方法在训练此类模型时可能产生退化解,并提出了一种无监督学习方案。在车队和四旋翼飞行器的仿真中,Deep Coordinator生成的轨迹质量与常规求解器相当,但速度快6.18-9.44倍。此外,当部署到比训练规模大8倍的系统时,Deep Coordinator仍能保持其性能优势。

英文摘要

Distributed optimization is a highly scalable and structurally transparent technique to solve multi-agent robotics problems; however, such methods often suffer from the need for highly-specialized, problem-specific hyperparameter tunings. In this work, we propose Deep Coordinator, a deep-unfolding framework that learns to dynamically adjust the hyperparameters of ADMM-DDP, a popular distributed solver for robotics tasks, at solve-time in response to optimizer performance. Our architecture consists of unrolling a fixed number of ADMM-DDP iterations into a neural network with learnable functions between layers mapping the optimizer state to the next hyperparameters. To the best of our knowledge, Deep Coordinator is the first deep-unfolding framework to adapt the penalty parameters of a non-convex optimizer at solve-time; we show that the mainstream supervised approach can yield degenerate solutions when training such models, and propose an unsupervised learning scheme. On simulations with fleets of cars and quadrotors, Deep Coordinator produces trajectories of comparable quality 6.18-9.44x faster than conventional solvers. Furthermore, Deep Coordinator retains its performance benefits when deployed to systems up to 8x larger than trained on.

4. 具身导航 1 篇

2606.20045 2026-06-19 cs.CV cs.AI 新提交 80%

See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View

See-and-Reach: 视场内的精确视觉语言导航用于无人机

Fanfu Xue, En Yu, Yantian Shen, Zhikun Hu, Hongjun Wang, Yang Yang, Xindi Wang, Jiande Sun

发表机构 * School of Information Science and Engineering, Shandong University(山东大学信息科学与工程学院) Faculty of Engineering and Information Technology, University of Technology Sydney(悉尼科技大学工程与信息技术学院) School of Computer Science and Technology, Shandong University(山东大学计算机科学与技术学院) School of Artificial Intelligence, Shandong University(山东大学人工智能学院) School of Computer Science and Artificial Intelligence, Shandong Normal University(山东师范大学计算机科学与人工智能学院) Interdisciplinary Research Center of General Artificial Intelligence, Shandong Normal University(山东师范大学通用人工智能跨学科研究中心)

专题命中 具身导航 :无人机视觉语言导航属于具身导航。

AI总结 针对无人机视觉语言导航中目标可见后精确到达能力评估不足的问题,提出UAV-VLN-FOV任务和3DG-VLN框架,通过动态3D方向线索增强细粒度视觉定位与空间对齐,在基准和真实实验中显著提升成功率。

Comments 12 pages, 7 figures

详情
AI中文摘要

无人机视觉语言导航(UAV-VLN)通常被形式化为一个整体的搜索与到达问题,其中远程目标发现和最终目标接近被联合优化和评估。这种表述使得评估空中具身代理的关键能力变得困难,即一旦目标进入其视场,无人机能否准确地将可见目标定位并将视觉语言证据转化为精确的3D运动。为了解决这一局限性,我们引入了UAV-VLN-FOV,一个目标可见的导航任务,它隔离了“看到并到达”阶段,并能够对终端到达能力进行更具诊断性的评估。我们进一步提出了3DG-VLN,一种由动态3D方向线索引导的视觉语言航点预测框架,以增强细粒度视觉定位和空间方向对齐,从而实现精确的目标到达。具体来说,3DG-VLN自适应地处理高分辨率的前视和下视观测,以保留用于目标定位的细粒度视觉和几何细节。它还在闭环导航过程中在线更新目标相对方向,使代理能够保持与目标的空间对齐并减少累积的方向漂移。为了支持该任务,我们构建了一个专用的高分辨率基准,包含2,717条轨迹,带有面向目标的高级指令、高分辨率的前视和下视自我中心观测以及连续的3D航点注释。实验表明,3DG-VLN优于具有竞争力的UAV-VLN基线,成功率提高了13.82%。真实世界试验进一步展示了3DG-VLN在实际“看到并到达”导航中的潜力。源代码和基准可在以下网址获取:此 https URL。

英文摘要

UAV Vision-Language Navigation (UAV-VLN) is typically formulated as a holistic search-and-reach problem, where long-range target discovery and final target approach are optimized and evaluated jointly. This formulation makes it difficult to assess a critical capability of aerial embodied agents, namely whether a UAV can accurately ground a visible target and translate vision-language evidence into precise 3D motion once the target enters its field of view. To address this limitation, we introduce UAV-VLN-FOV, a target-visible navigation task that isolates the see-and-reach stage and enables a more diagnostic evaluation of terminal reaching ability. We further propose 3DG-VLN, a vision-language waypoint prediction framework guided by dynamic 3D direction cues to enhance fine-grained visual grounding and spatial direction alignment for precise target reaching. Specifically, 3DG-VLN adaptively processes high-resolution front-view and downward-view observations to preserve fine-grained visual and geometric details for target grounding. It also updates the target-relative direction online during closed-loop navigation, allowing the agent to maintain spatial alignment with the target and reduce accumulated direction drift. To support this task, we construct a dedicated high-resolution benchmark which contains 2,717 trajectories with target-oriented high-level instructions, high-resolution front-view and downward-view egocentric observations, and continuous 3D waypoint annotations. Experiments show that 3DG-VLN outperforms competitive UAV-VLN baselines, achieving a 13.82\% improvement in success rate. Real-world trials further demonstrate the potential of 3DG-VLN for practical see-and-reach navigation. The source code and benchmark are available at https://github.com/xuefanfu/3DG-VLN.

5. 具身推理 1 篇

2606.19383 2026-06-19 cs.RO cs.CV 新提交 80%

3D Scene Graphs: Open Challenges and Future Directions

3D场景图:开放挑战与未来方向

Dennis Rotondi, Francesco Argenziano, Sebastian Koch, Nathan Hughes, Martin Buechner, Johanna Wald, Lukas Rosenberger Schmid, Daniele Nardi, Abhinav Valada, Liam Paull, Federico Tombari, Luca Carlone, Kai O. Arras

发表机构 * University of Stuttgart(斯图加特大学) IMPRS-IS(马克斯·普朗克研究所-智能系统) Sapienza University of Rome(罗马萨皮恩扎大学) Google(谷歌) MIT(麻省理工学院) University of Freiburg(弗赖堡大学) UTN University of Montreal(蒙特利尔大学UTN分校) Mila TU Munich(慕尼黑技术大学Mila)

专题命中 具身推理 :3DSG用于机器人操作和导航。

AI总结 本文统一综述3D场景图(3DSG)的构建、应用与评估,分析现有建模选择与开放挑战,旨在推动鲁棒部署。

Comments Invited article for the Annual Review of Control, Robotics, and Autonomous Systems Volume 10

详情
AI中文摘要

3D场景图(3DSG)通过将几何基础与环境的语义和关系抽象相结合,已成为空间AI的强大表示。其表现力使其与机器人和计算机视觉中的广泛问题相关,包括操作、导航、任务规划、场景理解等。然而,该领域仍然分散:不同的社区采用不同的公式、构建流程和评估协议,使得比较方法、识别共同假设以及评估鲁棒实际部署的剩余挑战变得困难。本综述提供了对3DSG的统一和批判性回顾,特别强调开放挑战和未来方向。我们首先在共同定义下形式化3DSG,并分析表征现有公式的主要建模选择,包括节点和边属性、层次结构、动态场景表示和可供性感知扩展。然后,我们回顾如何从原始感官观察构建3DSG,讨论最常见的术语、约定和技术。最后,我们检查下游应用和评估策略,从内在图质量到任务级性能。为支持社区,我们还提供了一个专用网站,组织和扩展所调查的内容,可访问此 https URL。

英文摘要

3D Scene Graphs (3DSGs) have emerged as a powerful representation for spatial AI by combining geometric grounding with semantic and relational abstractions of the environment. Their expressiveness has made them relevant to a broad range of problems in robotics and computer vision, including manipulation, navigation, task planning, scene understanding, and many others. However, the field remains fragmented: different communities adopt distinct formulations, construction pipelines, and evaluation protocols, making it difficult to compare methods, identify common assumptions, and assess remaining challenges for robust real-world deployment. This survey provides a unified and critical review of 3DSGs, with particular emphasis on open challenges and future directions. We first formalize 3DSGs under a common definition and analyze the principal modeling choices that characterize existing formulations, including node and edge attributes, hierarchical structure, dynamic scene representations, and affordance-aware extensions. We then review how 3DSGs are built from raw sensory observations, discussing the most common terminologies, conventions, and techniques. Finally, we examine downstream applications and evaluation strategies, from intrinsic graph quality to task-level performance. To support the community, we also provide a dedicated website that organizes and extends the surveyed content, accessible at https://3dscenegraphs.com/.