arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.02577 2026-06-02 cs.RO cs.CV 版本更新

RoboDream: Compositional World Models for Scalable Robot Data Synthesis

RoboDream: 用于可扩展机器人数据合成的组合世界模型

Junjie Ye, Rong Xue, Basile Van Hoorick, Runhao Li, Harshitha Rajaprakash, Pavel Tokmakov, Muhammad Zubair Irshad, Vitor Guizilini, Yue Wang

发表机构 * USC Physical Superintelligence (PSI) Lab(USC物理超智能实验室) Toyota Research Institute(丰田研究院)

AI总结 提出一种以具身为中心的组合世界模型,通过将轨迹执行与环境合成解耦,实现从新视角、新场景和新物体中合成逼真演示数据,并展示其在数据扩展和减少真实数据需求方面的有效性。

Comments Project page: https://junjieye.com/RoboDream/

详情
AI中文摘要

扩展机器人学习需要大规模、多样化的演示,然而通过远程操作收集真实世界数据仍然过于昂贵和耗时。虽然视频扩散模型为数据扩展提供了一条有希望的途径,但现有的生成方法通常局限于表面的视觉增强,或者遭受产生物理不可行运动的具身幻觉。我们提出了一种可泛化的以具身为中心的世界模型,通过合成具有新物体、新场景和新视角的逼真演示来实现可扩展的数据生成。我们的方法将生成锚定到渲染的机器人运动,同时以显式的场景和物体先验为条件,有效地将轨迹执行与环境合成解耦。这种公式有可能解锁两种强大的数据扩展能力:(1)检索与重生,将现有轨迹重新用于全新的上下文而无需新的运动数据;(2)无道具远程操作,操作员操纵空空气,模型随后幻觉出目标物体和场景,消除了重置时间。我们通过真实世界实验证明,我们生成的数据持续改进下游策略性能,并在各种操作任务中显著减少真实世界数据需求。

英文摘要

Scaling robot learning requires large-scale, diverse demonstrations, yet real-world data collection via teleoperation remains prohibitively expensive and time-consuming. While video diffusion models offer a promising avenue for data scaling, existing generative approaches are often limited to superficial visual augmentation, or suffer from embodiment hallucinations that yield physically infeasible motions. We present a generalizable embodiment-centric world model that achieves scalable data generation by synthesizing photorealistic demonstrations with novel objects, in novel scenes, and from novel viewpoints. Our approach anchors generation to rendered robot motion while conditioning on explicit scene and object priors, effectively decoupling trajectory execution from environment synthesis. This formulation has the potential to unlock two powerful data scaling capabilities: (1) retrieval and rebirth, which repurposes existing trajectories into entirely new contexts without new motion data; and (2) prop-free teleoperation, where operators manipulate empty air and the model hallucinates the target objects and scene afterwards, eliminating reset time. We demonstrate with real-world experiments that our generated data consistently improves downstream policy performance and significantly reduces real-world data requirements across diverse manipulation tasks.

2606.02562 2026-06-02 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

Permissive Safety Through Trusted Inference: Verifiable Belief-Space Neural Safety Filters for Assured Interactive Robotics

通过可信推理实现许可安全:可验证的信念空间神经安全滤波器用于保证交互式机器人

Haimin Hu

发表机构 * Department of Computer Science, Johns Hopkins University, USA(约翰霍普金斯大学计算机科学系)

AI总结 针对交互式机器人中人类不确定性带来的安全问题,提出一种基于共形预测的信念空间安全滤波器验证方法,在考虑推理可靠性的前提下保证高概率安全,并减少保守性。

Comments Accepted to the 17th World Symposium on the Algorithmic Foundations of Robotics (WAFR 2026)

详情
AI中文摘要

与人类交互的自主机器人必须在人类引起的不确定性(如偏好、目标、能力和合作意愿)下做出安全高效的决策。安全滤波器是确保交互式机器人安全性的流行方法,其模块化设计将安全性与性能分离,使机器人能够在最小影响任务效率的情况下安全地与人交互。传统安全滤波器通常仅在物理空间中运行,忽略了机器人在线学习和适应的能力,而最近提出的信念空间安全滤波器(BeliefSF)在闭环中考虑机器人安全性,并通过运行时推理主动减少机器人的不确定性,从而降低滤波的保守性。然而,由于运行时推理的误差以及处理信念空间高维性所需的安全滤波器神经近似,为部署BeliefSF的机器人提供形式化安全保证仍然是一个重大挑战。本文提出一种算法方法,使用共形预测来认证BeliefSF的高概率安全性,同时明确考虑机器人运行时推理模块的可靠性。我们的方法利用信念空间安全滤波的结构,将验证集中在预期推理可靠的区域。它保留了标准共形预测的简单性和样本复杂度,但能够认证一个显著更不保守的安全滤波器。通过一个模拟的人-车交互基准测试,我们展示了我们的方法验证了一个比标准共形预测基线更许可的信念空间安全滤波器。

英文摘要

Autonomous robots that interact with people must make safe and efficient decisions under human-induced uncertainty, such as their preferences, goals, competency, and willingness to cooperate. Safety filters are a popular approach for ensuring safety in interactive robotics, since their modular design separates safety from performance, allowing robots to operate safely around people with minimal impact on task efficiency. While traditional safety filters typically operate only in the physical space, neglecting the robot's ability to learn and adapt online, the recently proposed belief-space safety filter (BeliefSF) reasons about robot safety in closed-loop with runtime inference that actively reduces the robot's uncertainty online, thereby reducing conservativeness in filtering. However, providing formal safety guarantees for robots deploying BeliefSF remains a significant challenge due to errors in runtime inference and neural approximation of safety filters required to handle the high dimensionality of belief spaces. In this paper, we propose an algorithmic approach to certify high-probability safety of BeliefSF using conformal prediction, while explicitly accounting for the reliability of the robot's runtime inference module. Our method leverages the structure of belief-space safety filtering by focusing verification on a region where inference is expected to be reliable. It preserves the simplicity and sample complexity of standard conformal prediction, yet can certify a substantially less conservative safety filter. Through a simulated human-vehicle interaction benchmark, we show that our approach verifies a significantly more permissive belief-space safety filter than a standard conformal prediction baseline.

2606.02551 2026-06-02 cs.RO cs.CV 版本更新

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

AFUN:迈向用于功能理解的可供性基础模型

Zhaoning Wang, Yi Zhong, Jiawei Fu, Henrik I. Christensen, Jun Gao

发表机构 * University of Michigan(密歇根大学) University of California, San Diego(加州大学圣地亚哥分校) NVIDIA(英伟达)

AI总结 提出AFUN模型,从单张RGB-D图像和语言任务描述中预测任务条件功能掩码和3D接触后运动曲线,通过大规模标准化数据流水线实现开放世界泛化,在多项基准测试中显著优于现有方法。

详情
AI中文摘要

可供性理解连接视觉感知和物理动作,作为开放非结构化真实环境中机器人操作的可解释接口。然而,构建一个不仅理解交互发生的位置和方式,还能跨不同环境、物体和任务泛化的可供性基础模型,仍然是一个长期的研究挑战。现有方法通常只解决部分挑战,要么定位任务相关区域而不指定可执行运动,要么预测运动但可扩展性有限。在本文中,我们提出了我们的模型,朝着用于功能理解的可供性基础模型迈出了一步。从单个RGB-D观测和语言任务描述中,我们的模型预测任务条件功能掩码(在哪里交互)和3D接触后运动曲线(如何交互)。为了支持开放世界泛化,我们构建了一个大规模标准化数据流水线,将异构的机器人、人类、仿真和真实世界扫描数据转换为共享的可供性模式,包含语言、掩码和以物体为中心的3D运动标签。我们从三个方面评估我们的模型:对于可供性分割,我们的模型在来自4个基准的8个测试集上以较大优势优于所有基线,平均gIoU/cIoU提高+23.9/+26.3;对于接触点预测,它预测出更精确的点,命中率比最佳基线提高12.7-61.3%;对于3D运动,它在所有三个测试集上均达到最佳性能。我们的模型可以部署于真实世界机器人操作,无需针对机器人本体进行微调或使用任务特定启发式方法,展示了适应开放世界可供性任务的能力。项目页面:https://www.zhaoningwang.com/AFUN

英文摘要

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN

2606.02510 2026-06-02 cs.CV cs.RO 版本更新

Not All Points Are Equal: Uncertainty-Aware 4D LiDAR Scene Synthesis

并非所有点都同等重要:不确定性感知的4D LiDAR场景合成

Xiang Xu, Alan Liang, Youquan Liu, Xian Sun, Linfeng Li, Lingdong Kong, Ziwei Liu, Qingshan Liu

发表机构 * NUAA(南京航空航天大学) NUS(新加坡国立大学) FDU(福建工程学院) Duke(杜克大学) NTU(国立新加坡大学) NJUPT(南京理工大学泰州学院) SKL-TI(特种信息处理实验室)

AI总结 提出U4D框架,利用空间不确定性引导LiDAR场景生成,通过熵图识别高不确定性区域并优先合成,再补全其余区域,实现高保真4D场景。

Comments CVPR 2026 E2E3D Workshop; GitHub at https://github.com/worldbench/U4D

详情
AI中文摘要

从LiDAR获取的序列构建忠实的4D世界对于具身AI至关重要,但当前的生成框架对所有空间区域采用统一的建模能力。这忽略了单个扫描中感知难度的巨大差异:远距离表面、遮挡边界和小尺度物体比良好观测的结构具有更高的不确定性。我们提出了U4D,一种新的框架,明确利用空间不确定性以“从难到易”的顺序引导LiDAR场景生成。U4D通过预训练分割器的香农熵推导逐点不确定性图,然后应用无条件扩散阶段合成具有精确几何的高熵区域,接着是条件补全阶段,利用这些结构作为先验填充剩余区域。MoST(时空混合)块通过动态平衡空间细节和时间连续性进一步维护跨帧一致性。在nuScenes和SemanticKITTI上的大量实验证明了最先进的场景保真度、时间一致性和下游性能。

英文摘要

Constructing faithful 4D worlds from LiDAR-acquired sequences is crucial for embodied AI, yet current generative frameworks apply uniform modeling capacity across all spatial regions. This ignores that perceptual difficulty varies dramatically within a single scan: distant surfaces, occluded boundaries, and small-scale objects carry far higher uncertainty than well-observed structures. We present U4D, a new framework that explicitly leverages spatial uncertainty to guide LiDAR scene generation in a "hard-to-easy" schedule. U4D derives per-point uncertainty maps via Shannon Entropy from a pretrained segmentor, then applies an unconditional diffusion stage to synthesize high-entropy areas with precise geometry, followed by a conditional completion stage that fills in the remaining regions using these structures as priors. A MoST (Mixture of Spatio-Temporal) block further maintains cross-frame coherence by dynamically balancing spatial detail and temporal continuity. Extensive experiments on nuScenes and SemanticKITTI demonstrate state-of-the-art scene fidelity, temporal consistency, and downstream performance.

2606.02486 2026-06-02 cs.RO 版本更新

Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation

拦截未来:用于动态VLA操作的潜在空间预测世界模型

Shahram Najam Syed, Arthur Jakobsson, Haoran Hao, Jeffrey Ichnowski

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 提出AHEAD框架,通过潜在空间世界模型预测未来视觉特征,使冻结的VLA模型在动态场景中实现高成功率操作。

Comments 28 pages, 7 figures, 16 tables, Su

详情
AI中文摘要

视觉-语言-动作(VLA)模型在静态操作中具有泛化能力,但当物体在任务执行过程中移动时则失效。它们将当前观测映射为动作,并假设观测与执行之间场景静止,因此在任何非平凡的物体速度下,产生的延迟都会超过可用的抓取时间。我们通过AHEAD(自适应动态预期视界外推)弥补了这一差距,这是一种先预测后执行的包装器,用运动感知的潜在世界模型增强冻结的VLA。一个在操作视频上训练的小型世界模型,基于光流计算的每个令牌的速度和加速度,预测VLA特征空间中的未来块令牌。语言和运动显著性掩码将预测集中在任务相关的块上,模型向前滚动自适应视界,当预测不确定性超过阈值时停止。然后冻结的动作解码器接收预测的未来令牌代替当前令牌。AHEAD为冻结的7B OpenVLA增加了4.9M参数,在20个动态模拟场景中达到79%至97%的成功率,而最强基线仅为31%至58%。在物理UFactory xArm 7上,AHEAD在三个传送带和滚球任务中成功率为29/30至30/30,在桨叶拦截任务中为23/30,在抛射物捕捉任务中为19/30,而所有基线均为0/30。

英文摘要

Vision-Language-Action (VLA) models generalize across static manipulation but fail when objects move during task execution. They map the current observation to an action and assume the scene is stationary between observation and execution, so at any non-trivial object speed the resulting latency exceeds the time available to grasp. We close this gap with AHEAD (Anticipatory Horizon Extrapolation with Adaptive Dynamics), a predict-then-act wrapper that augments a frozen VLA with a motion-aware latent world model. A small world model trained on manipulation video forecasts future patch tokens in the VLA's feature space, conditioned on per-token velocity and acceleration from optical flow. A language-and-motion saliency mask concentrates prediction on task-relevant patches, and the model rolls forward for an adaptive horizon, halting when prediction uncertainty crosses a threshold. The frozen action decoder then receives the predicted future tokens in place of the current ones. AHEAD adds 4.9M parameters to a frozen 7B OpenVLA and reaches 79 to 97% success across 20 dynamic simulation scenarios where the strongest baseline reaches 31 to 58%. On a physical UFactory xArm 7, AHEAD succeeds on 29/30 to 30/30 on three conveyor and rolling-ball tasks, 23/30 on paddle interception, and 19/30 on projectile catching where every baseline scores 0/30.

2606.02432 2026-06-02 cs.RO 版本更新

NDPP-Grasp: Non-Differentiable Physical Plausibility Constraint-Guided Task-Oriented Dexterous Grasp Generation

NDPP-Grasp:非可微物理合理性约束引导的任务导向灵巧抓取生成

Qiuchi Xiang, Haoxuan Qu, Hossein Rahmani, Jun Liu

发表机构 * Lancaster University(兰卡斯特大学)

AI总结 提出一种框架,通过将非可微物理合理性约束直接注入任务对齐的抓取扩散模型的去噪过程,实现物理合理性引导的灵巧抓取生成,同时保持任务对齐。

详情
AI中文摘要

任务导向的灵巧抓取生成旨在产生既物理合理又适用于特定操作任务的灵巧抓取姿态。现有的基于扩散的方法通常以解耦的方式处理这两个要求:它们首先训练一个用于任务对齐的抓取扩散模型,然后依赖生成后的细化来提高物理合理性。然而,这种事后修正策略仅在抓取已经生成后才应用物理合理性指导,使得生成轨迹本身不受物理约束引导,可能导致次优的抓取。为了解决这个问题,我们提出了一种新颖的框架,该框架以实用且有效的方式将物理合理性指导直接注入任务对齐的抓取扩散模型的去噪过程中,即使物理合理性约束是非可微的。这使得物理合理性能够在整个去噪过程中塑造抓取生成,同时保持任务对齐。大量实验证明了我们框架的有效性。

英文摘要

Task-oriented dexterous grasp generation aims to produce dexterous grasp poses that are both physically plausible and functionally suitable for specified manipulation tasks. Existing diffusion-based methods often address these two requirements in a decoupled manner: they first train a grasp diffusion model for task alignment and then rely on post-generation refinement to improve physical plausibility. However, this after-the-fact correction strategy applies physical plausibility guidance only once the grasp has already been generated, leaving the generation trajectory itself unguided by physical constraints and potentially leading to suboptimal grasps. To address this problem, we propose a novel framework that directly injects physical plausibility guidance into the denoising process of a task-aligned grasp diffusion model in a practical and effective manner, even when physical plausibility constraints are non-differentiable. This allows physical plausibility to shape grasp generation throughout denoising while preserving task alignment. Extensive experiments demonstrate the efficacy of our framework.

2606.02370 2026-06-02 cs.RO 版本更新

A Simulation Platform for Flapping-Wing Vehicles

扑翼飞行器仿真平台

Haichuan Li, Tomi Westerlund

AI总结 针对扑翼飞行器仿真与现实差距大的问题,提出FWAV-Sim高保真仿真平台,集成复合气动模型、湍流生成和真实传感器模拟,提升自主系统开发效果。

详情
AI中文摘要

扑翼飞行器(FWAVs)表现出卓越的敏捷性,但由于其对气动扰动的高敏感性和有限的传感器有效载荷能力,面临着巨大的自主性挑战。当前的仿真平台通常依赖于过度简化的层流假设和理想化的传感器模型,无法捕捉实际运行中遇到的复杂湍流模式和感知限制。这种仿真与现实的差距严重阻碍了FWAVs鲁棒自主系统的发展。我们引入了FWAV-Sim,一个基于Unity的高保真仿真框架,它集成了:(1)结合准稳态叶片单元理论和钝体阻力效应的复合气动模型,(2)通过分形噪声合成生成时空相关的湍流,以及(3)包括噪声IMU测量、LiDAR点云和RGB相机馈送的真实传感器模拟。我们的平台能够可扩展地生成包含真实车辆状态、气动力、湍流风场和多模态传感器流的同步数据集。实验验证表明,在FWAV-Sim中开发的自主流水线(包括控制器和感知系统)表现出显著提高的仿真能力,从而推进了扑翼飞行系统基于仿真的开发的卓越性能。

英文摘要

Flapping-wing aerial vehicles (FWAVs) demonstrate remarkable agility but face substantial autonomy challenges due to their high sensitivity to aerodynamic disturbances and limited sensor payload capacity. Current simulation platforms typically rely on oversimplified laminar flow assumptions and idealized sensor models, failing to capture the complex turbulence patterns and perceptual limitations encountered in real-world operation. This simulation-to-reality discrepancy significantly impedes the development of robust autonomy systems for FWAVs. We introduce FWAV-Sim, a high-fidelity Unity-based simulation framework that integrates: (1) a composite aerodynamic model combining quasi-steady blade-element theory with bluff-body drag effects, (2) spatiotemporally correlated turbulence generation through fractal noise synthesis, and (3) realistic sensor simulation including noisy IMU measurements, LiDAR point clouds, and RGB camera feeds. Our platform enables scalable generation of synchronized datasets containing ground-truth vehicle states, aerodynamic forces, turbulent wind fields, and multi-modal sensor streams. Experimental validation demonstrates that autonomy pipelines (including both controllers and perception systems) developed in FWAV-Sim exhibit significantly improved simulation capability, thereby advancing the outstanding performance in simulation-based development for flapping-wing aerial systems.

2606.02313 2026-06-02 cs.RO 版本更新

Towards Precise Intent-Aligned VLA Aerial Navigation via Expert-Guided GRPO

迈向精确意图对齐的VLA空中导航:基于专家引导的GRPO

Tianyang Chen, Wenjun Li, Xin zhou, Yuze Wu, Fei Gao

发表机构 * Zhejiang University Differential Robotics(浙江大学差分机器人实验室)

AI总结 提出EG-GRPO框架,通过专家数据增强在线rollout和异构并行流水线,解决VLA模型在无人机导航中因数据稀缺和探索低效导致的意图对齐问题,成功率提升至SFT基线的2.13倍,意图对齐性能提升60.9%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型为无人机(UAV)执行细粒度指令指定的复杂任务提供了一种有前景的端到端范式。然而,标准的监督微调(SFT)面临数据稀缺、泛化能力有限以及对细微复杂人类意图的弱监督问题。强化微调通过可设计的反馈提供了一种自然的方式来缓解这些挑战,并使策略行为与人类意图对齐,但由于在广阔连续空间中的低效探索,将其应用于空中导航仍然具有挑战性。为了解决这些问题,我们引入了一个基于VLA的空中导航的高效强化学习(RL)框架。其核心是,我们提出了EG-GRPO(专家引导的组相对策略优化),以用少量专家数据增强在线rollout。此外,我们设计了一个异构流水线,支持并行仿真和推理,将rollout时间减少了43.5%。在由复杂人类意图指定的多个任务中,EG-GRPO将成功率提升至SFT基线的2.13倍,同时将意图对齐性能提高了60.9%。这些结果表明,我们的框架可以使空中导航迈向精确的意图对齐飞行。

英文摘要

Vision-Language-Action (VLA) models offer a promising end-to-end paradigm for unmanned aerial vehicles (UAVs) to accomplish complex tasks specified by fine-grained instructions. However, standard supervised fine-tuning (SFT) suffers from data scarcity, limited generalization, and weak supervision for nuanced and complicated human intents. Reinforcement fine-tuning offers a natural way to mitigate these challenges and align policy behaviors with human intents through designable feedback, but applying it to aerial navigation remains challenging due to inefficient exploration in expansive continuous spaces. To address these challenges, we introduce an efficient reinforcement learning (RL) framework for VLA-based aerial navigation. At its core, we propose EG-GRPO (Expert-Guided Group Relative Policy Optimization) to augment online rollouts with few-shot expert data. Additionally, we design a heterogeneous pipeline enabling parallel simulation and inference, which reduces rollout time by 43.5%. Across multiple tasks specified by complex human intents, EG-GRPO improves the success rate to 2.13x that of the SFT baseline, while improving intent alignment performance by 60.9%. These results demonstrate that our framework can move aerial navigation toward precise intent-aligned flight.

2606.02307 2026-06-02 cs.RO 版本更新

FATE-VLA:Failue-aware test generation for vision-language-action models

FATE-VLA:面向视觉-语言-动作模型的故障感知测试生成

Arusa Kanwal, Pablo Valle, Shaukat Ali, Aitor Arrieta

发表机构 * Mondragon University(蒙多龙大学) Simula Research Laboratory(Simula研究实验室)

AI总结 提出一种结合多样性驱动探索与代理模型的故障感知测试生成方法,用于主动发现VLA模型在高维具身空间中的稀疏聚类故障,在四个先进模型上相比基线多发现高达29.7%的故障。

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越多地被用作通用机器人策略,然而它们的评估仍然主要依赖于随机采样任务场景的静态基准。在高维具身空间中,故障是稀疏且聚类的,因此静态基准测试可能低估鲁棒性风险。我们将VLA评估重新定义为主动故障发现问题,并提出一种故障感知测试生成方法,该方法将多样性驱动的探索与从观察到的执行中学习的代理模型相结合。该方法将测试引导向高风险但多样化的场景区域。在四个最先进的VLA模型上,它发现了显著更多的故障(相比选定基线最多增加29.7%),同时揭示了更多样化的故障模式。这意味着,例如,在GR00T-N1.6的情况下,成功率从64.4%下降到34.7%。更广泛地说,我们的发现呼吁VLA评估的转变:从固定任务套件上的被动测量转向自适应、寻求故障的测试生成,在部署之前揭示模型弱点的结构。

英文摘要

Vision-Language-Action (VLA) models are increasingly used as generalist robot policies, yet their evaluation still relies largely on static benchmarks that randomly sample task scenes. In high-dimensional embodied spaces, failures are sparse and clustered, so static benchmarking can underestimate robustness risks. We reframe VLA evaluation as an active failure-discovery problem and propose a failure-aware test-generation approach that combines diversity-driven exploration with surrogate models learned from observed executions. The method steers testing toward high-risk yet diverse scene regions. Across four state-of-the-art VLA models, it uncovers substantially more failures (up to +29.7 % over selected baselines) while revealing more diverse failure modes. This mean that, for instance, in the case of GR00T-N1.6, success rate dropped from 64.4% to 34.7%. More broadly, our findings call for a shift in VLA evaluation: from passive measurement on fixed task suites to adaptive, failure-seeking test generation that exposes the structure of model weaknesses before deployment.

2606.02296 2026-06-02 cs.RO 版本更新

A Kinetic Theory of Encounter-Based Information Propagation in Multi-Robot Systems

多机器人系统中基于相遇的信息传播的动力学理论

Alkesh K. Srivastava, Philip Dames

发表机构 * Temple University(特拉华大学)

AI总结 本文提出一种动力学理论,通过相遇驱动的信息传播、时效性和几何约束,分析多机器人目标跟踪中的性能极限。

详情
AI中文摘要

多机器人系统不能假设持续的网络连接。我们通过目标跟踪研究这一问题,其中性能取决于目标信息被感知、通过团队传输并在变得过时之前使用的速度。当机器人仅通过物理相遇交换信息时,跟踪成为一个动力学信息传输问题:机器人运动引发相遇,相遇携带目标状态估计,信息年龄决定过时程度,而过时信息产生跟踪误差。本文发展了一种基于相遇的信息传播的动力学理论,并识别出三个极限。第一个是访问极限——信息无法支持团队级协调,除非它传播到感知到它的机器人之外。第二个是过时极限——即使传播的信息也会随着目标移动而失去价值。第三个是几何极限——当目标运动超过信息传输时,跟踪误差进入饱和状态,此时仅通信改进的收益递减。我们通过改变团队规模、操作区域、通信范围和目标速度的大规模模拟评估该理论。结果支持所提出的访问-过时-几何分解:通信覆盖控制访问转变;一旦信息可访问,跟踪误差由目标位移决定;这种响应在受限区域内是局部线性的,但由于感知刷新和有界几何,在更广范围内是非线性的。在受控扫描和联合变化中,推导出的访问和过时坐标可靠地描述了跟踪性能。这些结果共同建立了一个动力学理论框架,用于预测和设计基于相遇的多机器人系统。

英文摘要

Multi-robot systems cannot assume persistent network connectivity. We study this problem through target tracking, where performance depends on how quickly target information is sensed, transported through the team, and used before it becomes stale. When robots exchange information only through physical encounters, tracking becomes a kinetic information-transport problem: robot motion induces encounters, encounters carry target-state estimates, information age determines staleness, and stale information produces tracking error. This paper develops a kinetic theory of encounter-based information propagation and identifies three limits. The first is an access limit -- information cannot support team-level coordination unless it spreads beyond the robots that sensed it. The second is a staleness limit -- even propagated information loses value as the target moves. The third is a geometry limit -- when target motion outpaces information transport, tracking error approaches a saturation regime where communication improvements alone have diminishing returns. We evaluate the theory through large-scale simulations varying team size, operating area, communication range, and target speed. Results support the proposed access-staleness-geometry decomposition: communication coverage governs the access transition; once information is accessible, tracking error is shaped by target displacement; and this response is locally linear in restricted regimes but nonlinear over broader ranges because of sensing refreshes and bounded geometry. Across controlled sweeps and joint variation, the derived access and staleness coordinates reliably describe tracking performance. Together, these results establish a kinetic-theoretic framework for predicting and designing encounter-based multi-robot systems.

2606.02280 2026-06-02 cs.RO 版本更新

Dynamics Are Learned, Not Told: Semi-Supervised Discovery of Latent Dynamics Geometries For Zero-Shot Policy Adaptation

动力学是学出来的,不是告诉的:零样本策略适应的潜在动力学几何半监督发现

Zhiming Xu, Weitao Zhou, Xianghui Pan, Nanshan Deng, Chengju Liu, Qijun Chen, Chenpeng Yao

发表机构 * Zhiming Xu(徐志明) Weitao Zhou(周伟涛) Xianghui Pan(潘向辉) Nanshan Deng(邓南山) Chengju Liu(刘成军) Qijun Chen(陈齐军) Chenpeng Yao(姚晨鹏)

AI总结 针对机器人强化学习中动力学变化导致策略失效的问题,提出基于对比学习的半监督方法,通过构建平滑、任务相关的潜在拓扑结构,实现零样本策略适应,在MuJoCo基准上优于参数中心方法。

Comments Proceedings of the 43rd International Conference on Machine Learning

详情
AI中文摘要

现实世界中的动力学变化对机器人强化学习构成了严峻挑战,因为与标称环境紧密耦合的策略在物理条件变化时往往会灾难性地失败。大多数现有方法依赖于将明确识别的物理参数编码到潜在上下文中,这是一种以参数为中心的范式,依赖于预先指定的变化轴,在未建模或复合动力学变化下变得脆弱。我们从以结果为中心的角度重新审视动力学适应:不是告诉策略动力学是什么,而是让它们学习动力学如何影响交互结果。理论上,这基于目标域遗憾与轨迹动力学编码器的Lipschitz常数之间的单调关系。实际上,该常数可以通过对比学习来上界,从而在没有特权动力学信息的情况下产生平滑、任务相关的潜在拓扑。在MuJoCo基准上,我们的方法在严重的动力学变化下(包括未建模和时变参数)始终优于以参数为中心的基线,同时提高了分布内稳定性和潜在可解释性。总体而言,这些结果验证了控制潜在几何是实现鲁棒适应的原则性机制。

英文摘要

Real-world dynamics shifts pose a critical challenge for reinforcement learning in robotics, as policies tightly coupled to nominal environments often fail catastrophically when physical conditions change. Most existing methods rely on encoding explicitly identified physical parameters into a latent context, a parameter-centric paradigm that depends on pre-specified axes of variation and becomes brittle under unmodeled or compound dynamics changes. We revisit dynamics adaptation from an outcome-centric perspective: rather than telling policies what the dynamics are, we enable them to learn how dynamics affect interaction outcomes. Theoretically, this is grounded in a monotonic relationship between target-domain regret and the Lipschitz constant of a trajectory dynamics encoder. Practically, this constant can be upper-bounded through contrastive learning, yielding a smooth, task-relevant latent topology without privileged dynamics information. On MuJoCo benchmarks, our method consistently outperforms parameter-centric baselines under severe dynamics shifts, including unmodeled and time-varying parameters, while also improving in-distribution stability and latent interpretability. Overall, these results validate that controlling latent geometry is a principled mechanism for robust adaptation.

2606.02277 2026-06-02 cs.RO 版本更新

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

RoboSemanticBench: 诊断 VLA 模型在动作预测中的语义基础

Bin Yu, Yao Zhang, Haishan Liu, Shijie Lian, Yuliang Wei, Xiaopeng Lin, Zhaolong Shen, Changti Wu, Ruina Hu, Bailing Wang, Cong Huang, Kai Chen

发表机构 * HIT(哈尔滨工业大学) ZGCA(中钢集团人工智能研究院) ZGCI(中钢集团智能计算研究所) WHU(武汉大学) HUST(华中科技大学) HKUST(GZ)(香港科技大学(广州)) BUAA(北京航空航天大学) ECNU(华东师范大学) DeepCybo

AI总结 提出 RoboSemanticBench 基准,通过多选问答任务评估 VLA 模型是否利用指令语义选择正确物体,发现模型在语义正确选择上接近随机,揭示语义理解与动作预测之间的差距。

Comments GitHub: https://github.com/ZGC-EmbodyAI/RoboSemanticBench

详情
AI中文摘要

视觉-语言-动作(VLA)模型建立在预训练语言或视觉-语言骨干网络的语义理解应指导机器人动作预测的前提上。然而,机器人微调被优化为对任务特定动作分布的模仿,许多评估可以通过视觉或指令-动作捷径解决。我们引入 RoboSemanticBench(RSB),一个用于诊断动作预测中语义基础的具身基准:即后训练的 VLA 模型是否能够使用复杂的指令语义来选择并操作正确的物理目标。在每个回合中,机器人接收一个多项选择的数学或常识问题,观察候选答案块,并必须抓取对应正确答案的块。RSB 涵盖受控算术、小学数学理解以及常识或事实理解,分为四选和十选套件。在代表性的 VLA 模型上,我们发现许多策略学会了抓取候选块,但在控制抓取成功率后,选择语义正确块的比例接近随机或低于随机,揭示了骨干网络级别的语义能力与动作预测之间持续存在的差距。

英文摘要

Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.

2606.02251 2026-06-02 cs.RO cs.AI eess.SP 版本更新

FW-NKF: Frequency-Weighted Neural Kalman Filters

FW-NKF: 频率加权神经卡尔曼滤波器

Adnan Harun Dogan, Berken Utku Demirel, Christian Holz

发表机构 * Department of Computer Science, ETH Zürich(苏黎世联邦理工学院计算机科学系)

AI总结 提出频率加权神经卡尔曼滤波器(FW-NKF),通过将因果谱整形算子嵌入卡尔曼测量残差并联合学习观测和状态转移网络,抑制频带受限噪声,在混沌系统和惯性姿态估计等任务中定位误差降低达10%。

Comments Published at ICRA 2026

详情
AI中文摘要

鲁棒状态估计是机器人自主性的核心,然而经典卡尔曼滤波器难以应对频率相关干扰和模型失配,如传感器振动、电磁干扰和周期性噪声。尽管深度卡尔曼滤波器(DKF)变体通过学习潜在状态转移扩展了扩展卡尔曼滤波(EKF)框架,但它们缺乏明确的机制来抑制在实际场景中通常污染传感器测量的带限噪声分量。我们引入了频率加权神经卡尔曼滤波器(FW-NKF),这是一种统一的混合方法,将因果谱整形算子嵌入卡尔曼测量残差,并联合学习观测网络和状态转移网络。通过同时调整滤波器频谱和潜在状态表示,FW-NKF在抑制噪声主导频带的同时捕获复杂的残差结构。我们在四个异构基准上进行了广泛实验,包括混沌系统(如多维洛伦兹系统)和全身惯性姿态估计,发现定位误差降低高达10%,且方向精度显著提升。我们的消融研究证实,频率加权和深度潜在状态建模对整体性能有贡献。

英文摘要

Robust state estimation is central to robotic autonomy, yet classical Kalman filters struggle with frequency-dependent disturbances and model mismatch such as sensor vibrations, electromagnetic interference, and periodic noise. Although Deep Kalman Filter (DKF) variants extend the Extended Kalman Filtering (EKF) framework by learning latent transitions, they lack explicit mechanisms to suppress band-limited noise components that typically corrupt sensor measurements in real-world scenarios. We introduce the Frequency-Weighted Neural Kalman Filter (FW-NKF), a unified hybrid approach that embeds a causal spectral-shaping operator into the Kalman measurement residual and jointly learns observation, and transition networks. By adapting both the filter spectrum and the latent state representation, FW-NKF attenuates the noise-dominated frequency bands while capturing complex residual structures. We conduct extensive experiments on four heterogeneous benchmarks, including chaotic systems such as multi-dimensional Lorenz systems and full-body inertial pose estimation, and find a reduction in localization error of up to 10% as well as marked improvements in orientation accuracy. Our ablation studies confirm that frequency weighting and deep latent-state modeling contribute to overall performance.

2606.02107 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Network Distributed Multi-Agent Reinforcement Learning for Consensus Control of Quadcopters

网络分布式多智能体强化学习用于四旋翼无人机一致性控制

Youssef Mahran, Zeyad Gamal, Aamir Ahmad, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department, German University in Cairo (GUC), Egypt(埃及德国大学(GUC)机械工程系) Institute of Flight Mechanics and Control (IFR), Head of Flight Robotics, University of Stuttgart, Germany(德国斯图加特大学飞行力学与控制研究所) Faculty of EMS, Head of Mechatronics Engineering Department, German University in Cairo (GUC), Egypt(埃及德国大学(GUC)EMS学院)

AI总结 提出网络分布式多智能体强化学习框架,利用通信图实现分布式策略,通过MASAC训练高层规划器,实现零样本扩展到250个智能体。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2026 IEEE 23rd Mediterranean Electrotechnical Conference (MELECON)
AI中文摘要

本文提出了一种用于四旋翼无人机一致性控制的网络分布式多智能体强化学习(ND-MARL)框架。与依赖集中式规划或完全分散式执行的传统多智能体MARL公式相比,ND-MARL将群体通信图纳入决策过程。在2-邻居通信拓扑下,每个智能体仅观察两个邻居的信息,并通过分布式策略输出动作。使用多智能体软演员-评论家(MASAC)训练高层分布式一致性规划器,并将其嵌入层次化堆栈中,以生成由低层四旋翼控制器跟踪的参考目标位置。结果表明,与集中式MARL控制器相比,实现了平滑的一致性轨迹和规划器-跟踪器集成。最值得注意的是,学习到的控制器表现出零样本可扩展性,即在三智能体系统上训练的策略,在相同的2-邻居通信拓扑下,无需重新训练或微调即可部署到多达250个智能体的群体中,实现了随着团队规模增大而稳态散布增加的一致收敛,这是由于稀疏信息传播所致。这些发现突显了ND-MARL作为分布式、通信感知的四旋翼一致性控制的稳定框架。

英文摘要

This paper proposes a Network Distributed Multi-Agent Reinforcement Learning (ND-MARL) framework for quadcopter consensus control. Compared to conventional multi-agent MARL formulations that rely on centralized planning or fully decentralized execution, ND-MARL incorporates the swarm communication graph into the decision process. Under a 2-Neighbor communication topology, each agent observes information of only two neighbors and outputs an action through a distributed policy. A high-level distributed consensus planner is trained using Multi-Agent Soft Actor-Critic (MASAC) and embedded in a hierarchical stack to generate reference target positions tracked by a low-level quadcopter controller. Results demonstrate smooth consensus trajectories and planner-tracker integration when compared to a centralized MARL controller. Most notably, the learned controller exhibits zero-shot scalability, as policies trained on a three-agent system are deployed to swarms of up to 250 agents under the same 2-Neighbor communication topology without retraining or fine-tuning, achieving consistent convergence with increasing steady-state spread at large team sizes due to sparse information propagation. These findings highlight ND-MARL as a stable framework for distributed, communication-aware quadcopter consensus control.

2606.02058 2026-06-02 cs.CV cs.RO 版本更新

TIDES: Time-Derivative Event Simulation via Deformable Reconstruction

TIDES:基于可变形重建的时间导数事件模拟

Christopher Thirgood, Dipon Kumar Ghosh, Simon Hadfield

发表机构 * University of Surrey(萨里大学)

AI总结 提出TIDES,一种基于动态高斯泼溅的连续时间事件模拟器,通过显式3D场景表示推导逐像素强度动态,实现精确的阈值交叉预测,并利用遮挡引导自适应时间步长,达到最先进的事件流保真度。

详情
AI中文摘要

事件相机响应环境外观变化而发出异步事件。真实世界事件数据集的稀缺使得模拟至关重要。然而,大多数模拟器从帧序列推断事件时间戳,迫使许多阈值交叉共享一小组离散时间;我们将这种失效模式称为时间戳批处理,它在快速运动和遮挡下会恶化。我们提出TIDES,一种基于动态高斯泼溅的连续时间事件模拟器。由于TIDES在具有学习几何和运动的显式3D场景表示上运行,它可以直接从场景推导每像素强度动态,而不是通过渲染帧的差分。这使得能够精确预测阈值交叉,包括每个渲染步骤的多次交叉,而无需时间上采样或帧插值。相同的3D场景模型揭示了物体之间部分遮挡的位置;TIDES利用这一点来指导自适应时间步长,仅将计算集中在遮挡动力学使简单亮度变化模型不可靠的区域。最后,我们使用瓦片级仲裁器对有限传感器带宽进行建模,其吞吐量、抖动和事件丢失再现了真实的传感器伪影。在配对的RGB-事件基准测试中,TIDES达到了最先进的事件流保真度。我们还表明,TIDES模拟的事件比竞争对手更有效地转移到真实下游任务。

英文摘要

Event cameras emit asynchronous events in response to environmental appearance changes. The scarcity of real-world event datasets makes simulation essential. However, most simulators infer event timestamps from frame sequences, forcing many threshold crossings to share a small set of discrete times; a failure mode we term timestamp batching that worsens under fast motion and occlusion. We present TIDES, a continuous-time event simulator built on dynamic Gaussian splatting. Because TIDES operates on an explicit 3D scene representation with learnt geometry and motion, it can derive per-pixel intensity dynamics directly from the scene, rather than by differencing rendered frames. This enables accurate threshold-crossing prediction, including multiple crossings per rendering step, without temporal upsampling or frame interpolation. The same 3D scene model reveals where objects partially occlude one another; TIDES uses this to guide adaptive time stepping, concentrating computation only in regions where occlusion dynamics make simple models of brightness change unreliable. Finally, we model finite sensor bandwidth using a tile-level arbiter whose throughput, jitter, and event drops reproduce realistic sensor artifacts. Across paired RGB-event benchmarks, TIDES attains state-of-the-art event-stream fidelity. We also show that events simulated by TIDES transfer more effectively to real downstream tasks than competitors'.

2606.02027 2026-06-02 cs.RO cs.LG cs.MA 版本更新

World-Task Factorization for Robot Learning

世界-任务分解用于机器人学习

Eduardo Sebastián, Adrian Pfisterer, Vito Mengers, Oliver Brock, Amanda Prorok

发表机构 * Department of Computer Science and Technology, University of Cambridge, United Kingdom(计算机科学与技术系,剑桥大学,英国) Robotics and Biology Laboratory, Technische Universität Berlin(机器人与生物学实验室,柏林技术大学) Science of Intelligence (SCIoI), Cluster of Excellence, Berlin, Germany(智能科学(SCIoI),卓越中心,柏林,德国) Robotics Institute Germany(德国机器人研究所)

AI总结 提出将策略分解为世界因子和任务因子,通过可微图模型AICON与紧凑学习策略结合,实现零样本泛化到新配置并迁移到真实硬件。

详情
AI中文摘要

机器人学习必须产生能够泛化到新的约束、队友和环境组合的策略。为此,我们必须对策略进行结构性分解,这种选择决定了哪些部分泛化、哪些需要重新训练、哪些保持纠缠。现有方法涵盖从期望结构从数据扩展中涌现,到通过层次结构、技能库或学习专门化手工设计。在本文中,我们研究我们认为机器人学中最基本的分解:将世界与任务分离。我们研究了这种分解有原则的条件。世界因子是具身系统和环境的属性;它们独立于意图存在。任务因子由任务在世界所允许的事物上的逻辑定义。我们通过贝叶斯模型证据形式化这种不对称性:它与数据生成过程一致,通过分析世界模型保持高似然,并减少奥卡姆剃刀对任务参数的惩罚。我们通过将AICON(一个可微分的递归估计器和互连图,具有组合性,无需任务特定数据即可运行,并将成本梯度传播到执行器)与一个紧凑的学习策略配对来实例化这种分解,该策略调节梯度路径。梯度作为两个因子之间的接口:它们通过图携带世界结构,通过成本携带任务结构,从而在保持结构泛化的同时实现低维学习。我们在三个问题上测试了世界/任务分解,这些问题包含异构机器人、环境、任务逻辑和感觉运动模态。我们的框架在所有设置中优于端到端基线和分析启发式方法,零样本泛化到分布外配置,并无需重新训练即可迁移到真实硬件。

英文摘要

Robot learning must produce policies that generalize to new combinations of constraints, teammates, and environments. To achieve this, we must structurally factor the policy, which is a choice that dictates what generalizes, what requires retraining, and what remains entangled. Existing methods span a wide spectrum, from expecting structure to emerge from data scaling, to hand-designing it via hierarchies, skill libraries or learned specializations. In this paper, we study what we argue is the most fundamental factorization in robotics: separating the world from the task. We investigate the conditions under which this factorization is principled. World factors are properties of the embodied system and the environment; they exist independently of intent. Task factors are defined by the task's logic over what the world admits. We formalize this asymmetry through Bayesian model evidence: it aligns with the data-generating process, maintains high likelihood through an analytical world model, and reduces the Occam razor's penalty on task parameters. We instantiate this factorization by pairing AICON, a differentiable graph of recursive estimators and interconnections that is compositional, operates without task-specific data, and propagates cost gradients to actuators, with a compact, learned policy that modulates gradient paths. Gradients serve as the interface between the two factors: they carry world structure through the graph and task structure through costs, enabling low-dimensional learning while preserving structural generalization. We test the world/task factorization across three problems that encompass heterogeneous robots, environments, task logic and sensorimotor modalities. Our framework outperforms end-to-end baselines and analytical heuristics in all settings, generalizes zero-shot to out-of-distribution configurations, and transfers to real hardware without retraining.

2606.01970 2026-06-02 cs.RO cs.MA cs.SY eess.SY 版本更新

Market-Based Replanning for Safety-Critical UAV Swarms in Search and Rescue Missions

基于市场重规划的搜救任务中安全关键无人机群

Luiz Giacomossi, Andrea Haglund, Claire Namatovu, Emily Zainali, Esaias Målqvist, Yonatan M. Beyene, Ivan Tomasic, Baran Çürüklü, Håkan Forsberg

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Swedish Defence Research Agency(瑞典国防研究机构) KTH Royal Institute technological Institute(皇家理工学院)

AI总结 提出一种分布式协调架构IRDS,通过反向拍卖市场机制和几何共识协议,在无人机故障下自主重分配任务,在25%退化下保持93%任务成功率。

Comments 6 pages, 4 figures, accepted at MIPRO 2026

详情
AI中文摘要

搜救任务中可靠自主无人机群需要能够容忍代理退化并维持操作的容错协调。本文介绍了智能重规划无人机群(IRDS),一种为资源受限环境设计的分布式协调架构。所提出的框架采用反向拍卖市场机制,其中代理基于距离加权成本函数竞标服务搜索区域,并结合几何共识协议进行目标验证。我们通过物理仿真(N=8个代理,8x8网格)评估该方法,并施加随机故障注入。结果表明,无人机群能够以相对于总任务持续时间较低的延迟自主重新分配来自故障代理的任务,在25%劳动力退化下保持93%的任务成功率。所提出的框架展示了一种稳健的、经过实证测试的空中机器人自愈协调方法。

英文摘要

Reliable autonomous UAV swarms in Search and Rescue (SAR) missions require fault-tolerant coordination capable of sustaining operations despite agent degradation. This paper introduces the Intelligent Replanning Drone Swarm (IRDS), a distributed coordination architecture designed for resource-constrained environments. The proposed framework employs a Reverse-Auction market mechanism where agents bid to service search sectors based on a distance-weighted cost function, coupled with a geometric consensus protocol for target verification. We evaluate the approach through physics-based simulations (N=8 agents, 8x8 grid) subjected to stochastic fault injection. Results indicate that the swarm autonomously reallocates tasks from failed agents with low latency relative to the total mission duration, maintaining a mission success rate of 93% under 25% workforce degradation. The proposed framework demonstrates a robust, empirically tested method for self-healing aerial robotic coordination.

2606.01955 2026-06-02 cs.RO cs.CV 版本更新

WALL-WM: Carving World Action Modeling at the Event Joints

WALL-WM:在事件关节处雕刻世界动作建模

Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, Qian Wang

发表机构 * X Square Robot Team(X Square机器人团队)

AI总结 提出WALL-WM世界动作模型,通过事件级视觉-语言-动作预训练解决固定长度动作块与语言、视觉、动作之间的粒度不匹配问题,实现跨语言、场景和任务的泛化,在大规模真实世界评估中达到最先进性能。

详情
AI中文摘要

WALL-WM是一种世界动作模型,它将视频-动作学习从以块为中心的优化转变为以事件为基础的视觉-语言-动作预训练,使用语义连贯的动作事件作为学习的基本单元。现有的WAM通常从多模态或视频基础模型初始化,然后直接基于当前观测和指令优化固定长度的动作块。尽管方便,但这种以块为中心的公式造成了基本的粒度不匹配。语言描述语义目标和事件,视觉通过连续场景动态演变,动作在控制级时间尺度上运行;将三者强制纳入相同的固定长度预测窗口,使得VLA训练变成短视的相关性拟合。WALL-WM通过围绕语义事件组织监督和数据来解决这种不匹配。具体来说,它将基于事件的VLA预训练与由事件级标题和聚类平衡采样构建的数据生态系统配对,从而实现对多样化行为、场景和任务结构的可扩展学习。从相同的事件预训练骨干出发,WALL-WM支持两种互补的推理模式。事件模式消耗下一事件描述并实现可变长度的执行块,而统一模式使用带有阶梯式解码的VLM来调节传统的固定长度块推理,同时保留梯度连续的VLA路径。结合基于Muon优化器的大规模预训练基础设施,WALL-WM为通用WAM提供了实用的规模化方案。实验表明,WALL-WM在语言、场景和任务上广泛泛化,在大规模真实世界泛化评估中达到了最先进的性能。

英文摘要

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

2606.01951 2026-06-02 cs.RO 版本更新

Co-training with Ego-centric Video and Demonstration for Robot Navigation Task

基于自我中心视频与示范的机器人导航任务协同训练

Shoya Kuno, Yumo Ouchi, Kanata Suzuki

发表机构 * Department of Informatics, Graduate School of Informatics, Kyoto University(信息学系,京都大学研究生院) Spatial Robotics Research Center, Fujitsu Limited(空间机器人研究中心,富士通有限公司)

AI总结 提出将自我中心行走视频转化为移动机器人模仿学习数据集的框架,通过联合训练VLA模型提升语言理解和动作生成能力。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在多种机器人任务中展现出潜力,但其性能严重依赖于大规模高质量训练数据,而在真实机器人上收集这些数据成本高昂且耗时。虽然先前的工作已经探索了利用自我中心人类视频来增强操作数据集,但由于运动过程中的视角变化,将此类方法应用于移动机器人导航仍然具有挑战性。在本文中,我们提出了一个框架,将自我中心行走视频转化为移动机器人模仿学习的数据集。该方法从人类视频中估计相机运动,并将其转换为与地面移动机器人兼容的动作表示。通过联合训练基于人类数据和机器人收集数据的VLA模型,该模型在语言理解和鲁棒动作生成方面比单独使用任一数据源训练取得了更好的性能。在水果搜索导航任务上的实验表明,人类自我中心视频为移动机器人学习提供了有效且可扩展的数据源。

英文摘要

Vision-language-action (VLA) models are promising for diverse robotic tasks, but their performance heavily depends on large-scale high-quality training data, whose collection on real robots is costly and time-consuming. While prior work has explored augmenting manipulation datasets with egocentric human videos, applying such approaches to mobile robot navigation remains challenging due to viewpoint changes during locomotion. In this paper, we propose a framework that converts egocentric walking videos into datasets for mobile robot imitation learning. The proposed method estimates camera motion from human videos and transforms it into action representations compatible with ground mobile robots. By jointly training a VLA model on human-derived and robot-collected datasets, the model achieves improved language understanding and more robust action generation than training with either data source alone. Experiments on a fruit-search navigation task demonstrate that human egocentric videos provide an effective and scalable data source for mobile robot learning.

2606.01950 2026-06-02 cs.RO cs.CV cs.LG 版本更新

Learning Action-Conditional and Object-Centric Gaussian Splatting World Models for Rigid Objects

面向刚性物体的学习动作条件与对象中心高斯溅射世界模型

Jens U. Kreber, Lukas Mack, Joerg Stueckler

发表机构 * Intelligent Perception in Technical Systems Group(技术系统智能感知组)

AI总结 提出MRO-GWM模型,通过对象中心高斯表示和时空变换器架构,学习刚性物体在3D中的动作条件动力学,支持多物体场景和部分观测下的未来运动预测。

详情
AI中文摘要

世界模型使智能体能够预测其动作对环境的影响。在本文中,我们提出了多刚性物体高斯世界模型(MRO-GWM),一种学习刚性物体在3D中动作条件动力学的新模型。通过用对象中心高斯表示场景,我们可以表示任意物体形状和多物体场景。我们开发了一种新颖的时空变换器架构,该架构根据物体高斯的历史和未来动作预测未来的刚体运动。物体通过其在规范坐标系中的高斯表示,从而可以将物体运动描述为刚体变换。我们的模型在多视角重建上进行训练,这要求模型处理因遮挡导致的物体部分观测。我们分析了该方法在由典型家庭物体组成的合成数据集上的预测性能,这些数据集包含多物体动力学和机器人末端执行器的交互。我们还在模拟中评估了模型在非抓取操作中的模型预测控制性能。

英文摘要

World models enable intelligent agents to predict the consequences of their actions on the environment. In this paper, we propose Multi Rigid Object Gaussian World Model (MRO-GWM), a novel model that learns action-conditional dynamics of rigid objects in 3D. By representing the scene by object-centric Gaussians, we can represent arbitrary object shapes and multi-object scenes. We develop a novel spatio-temporal transformer architecture that predicts future rigid body motion from a history of object Gaussians and future actions. Objects are represented by their Gaussians in a canonical frame, which allows for describing object motion as rigid body transformation. Our model is trained on reconstructions from multiple viewpoints, which requires the model to handle partial observations of objects due to occlusions. We analyze prediction performance of our approach on synthetic datasets composed of typical household objects with multi-object dynamics and interactions by a robot end effector. We also evaluate our model in model-predictive control for non-prehensile manipulation in simulation.

2606.01946 2026-06-02 cs.RO 版本更新

Closed-Form Pose Estimation of Endoluminal Medical Devices via Gradiometer-Based Electromagnetic Localization System

基于梯度计的电磁定位系统实现腔内医疗器械的闭式位姿估计

Zhiwei Wu, Jiahao Luo, Yubo Pu, Siyi Wei, Yuankai Chen, Jinhui Zhang

AI总结 提出一种基于梯度计的电磁定位系统(GELS),利用紧凑型磁力计阵列作为准梯度计估计局部磁场和梯度张量,通过欧拉齐次关系映射为位移,再经多源Procrustes配准实现闭式位姿估计,无需预校准场图或迭代优化。

详情
AI中文摘要

嵌入式磁跟踪对于腔内医疗器械的远程导航具有极具吸引力的前景。然而,现有的六自由度位姿恢复方法通常需要预校准的工作空间场图或迭代非线性优化。本文提出了一种基于梯度计的电磁定位系统(GELS),这是一种闭式跟踪框架,使用紧凑型磁力计阵列作为嵌入式准梯度计来估计局部磁场和梯度张量。这些量通过欧拉齐次关系映射为源与阵列之间的位移,随后利用至少三个非共线源的多源Procrustes配准恢复阵列的方向和位置。该算法需要已知的源位置和阵列几何结构,但无需预校准的工作空间场图、初始位姿猜测或校准的激励源矩。恢复的位姿还可作为移动磁参考框架,实现概念验证的子级偶极子定位任务。跨传感器阵列配置和激励模式的台架实验显示,序列平均位置误差为\SI{10.80}{\milli\meter}--\SI{15.57}{\milli\meter},最快更新率为\SI{14.49}{\hertz},中位求解器运行时间为\SI{172.00}{\micro\second}。基于扰动的误差传播分析进一步确定了传感器间不一致性和偶极子模型失配是主要的精度限制因素,从而为未来进一步减少位姿估计误差的传感器阵列和磁源设计提供指导。

英文摘要

Embedded magnetic tracking holds highly attractive prospects for remote navigation of endoluminal medical devices. However, existing six-degree-of-freedom pose recovery approaches often require pre-calibrated workspace field maps or iterative nonlinear optimization. This letter presents a Gradiometer-Based Electromagnetic Localization System (GELS), a closed-form tracking framework that uses a compact magnetometer array as an embedded quasi-gradiometer to estimate local magnetic fields and gradient tensors. These quantities are mapped by the Euler homogeneous relation to displacements between source and array, from which multi-source Procrustes registration recovers the array orientation and position using at least three non-collinear sources. The algorithm requires known source positions and array geometry, but no pre-calibrated workspace field maps, initial pose guesses, or calibrated excitation-source moments. The recovered pose also enables a proof-of-concept sub-level dipole localization task by serving as a mobile magnetic reference frame. Benchtop experiments across sensor-array configurations and excitation modes demonstrate sequence-averaged position errors of \SI{10.80}{\milli\meter}--\SI{15.57}{\milli\meter}, a fastest update rate of \SI{14.49}{\hertz}, and a median solver runtime of \SI{172.00}{\micro\second}. A perturbation-based error propagation analysis further identifies inter-sensor inconsistency and dipole-model mismatch as the dominant accuracy limits, thereby informing future sensor array and magnetic source design for further reducing pose-estimation error.

2606.01865 2026-06-02 cs.RO 版本更新

Set-Supervised Diffusion Policy: Learning Action-Chunking Diffusion through Corrections

集合监督扩散策略:通过修正学习动作分块扩散

Zhaoting Li, Gang Chen, Javier Alonso-Mora, Cosimo Della Santina, Jens Kober

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出集合监督扩散策略(SDP),利用人类修正中的对比动作分块数据,通过构建期望动作分块集合来训练扩散策略,有效缓解分布偏移并提升鲁棒性。

详情
AI中文摘要

扩散策略最近已成为机器人操作的一个强大框架。然而,与其他行为克隆方法一样,它仍然容易受到分布偏移的影响,通常需要人在回路中进行干预以纠正部署过程中的失败。这些交互自然提供了成对监督,形式为机器人的不期望动作和人类教师的纠正动作。然而,现有的数据聚合流程和标准行为克隆损失在很大程度上忽略了来自不期望动作的负面信号,导致对教师动作的过拟合以及对昂贵专家数据的日益依赖。为了解决这一限制,我们提出了集合监督扩散策略(SDP),这是一种新颖的学习框架,利用对比动作分块数据从人类修正中训练扩散策略。从配对的正负动作分块中,SDP构建了一组期望的动作分块,并设计了一个训练流程,鼓励扩散策略与该集合对齐。通过在多个机器人操作任务上的大量实验,我们证明了SDP持续提高了策略性能,在对噪声数据的鲁棒性方面尤其显著。此外,SDP生成了高质量的聚合数据集,使得从人在回路修正中进行更高效、更可靠的策略学习成为可能。我们的代码可在 https://set-supervised-diffusion-policy.github.io/ 获取。

英文摘要

Diffusion policies have recently emerged as a powerful framework for robotic manipulation. However, like other behavior cloning methods, they remain vulnerable to distributional shift, often requiring human-in-the-loop interventions to correct failures during deployment. These interactions naturally provide paired supervision in the form of the robot's undesired actions and the human teacher's corrective actions. Yet existing data aggregation pipelines and standard behavior cloning losses largely ignore this negative signal from undesired actions, leading to overfitting to teacher's actions and an increasing reliance on costly expert data. To address this limitation, we propose Set-Supervised Diffusion Policy (SDP), a novel learning framework that utilizes contrastive action-chunk data to train diffusion policies from human corrections. From paired positive and negative action-chunks, SDP constructs a set of desired action-chunks and designs a training pipeline that encourages the diffusion policy to align with the set. Through extensive experiments across multiple robotic manipulation tasks, we demonstrate that SDP consistently improves policy performance, with particularly strong gains in robustness to noisy data. Moreover, SDP induces high-quality aggregated datasets, enabling more efficient and reliable policy learning from human-in-the-loop corrections. Our code is available at https://set-supervised-diffusion-policy.github.io/.

2606.01847 2026-06-02 cs.RO cs.LG 版本更新

The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space

我们说的谎言:通过切空间上的分数匹配纠正视觉-语言-动作策略中的欧几里得谬误

Bing-Cheng Chuang, I-Hsuan Chu, Bor-Jiun Lin, YuanFu Yang, Min Sun, Chun-Yi Lee

发表机构 * National Taiwan University(台湾大学)

AI总结 针对扩散视觉-语言-动作策略将SE(3)位姿表示为平坦R^12向量导致的欧几里得谬误,提出Lie Diffuser Actor (LDA)框架,通过左不变SDE注入噪声、在切空间预测分数并利用指数映射回缩样本,从根本上消除流形漂移、保证坐标框架等变性和测地线最优性,在CALVIN ABC→D上平均任务长度从3.27提升至3.51。

Comments ICML 2026 Accepted

详情
AI中文摘要

基于扩散的视觉-语言-动作策略在机器人操作中取得了显著成功,但犯了一个我们称之为$ extbf{欧几里得谬误}$的基本几何错误:将SE(3)位姿表示为平坦的$\mathbb{R}^{12}$向量。这种近似导致(1)违反SO(3)约束的流形漂移,(2)坐标变换下等变性的破坏,以及(3)具有过高运动学代价的非测地线轨迹。我们提出$ extbf{Lie Diffuser Actor (LDA)}$,一个本质上在SE(3)上运行的扩散框架。我们的方法通过左不变SDE注入噪声,在切空间中预测分数,并通过指数映射回缩样本。这种表述通过构造消除了流形漂移,同时保证了坐标框架等变性和测地线最优性。在CALVIN ABC$ ightarrow$D上,LDA将平均任务长度从$3.27$提升到$3.51$($+7.3\%$)。我们进一步在真实机器人上验证了该方法,结果表明我们的方法在大多数任务上优于基线。

英文摘要

Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic trajectories with excessive kinematic cost. We introduce $\textbf{Lie Diffuser Actor (LDA)}$, a diffusion framework operating intrinsically on SE(3). Our method injects noise through left-invariant SDEs, predicts scores in the tangent space, and retracts samples via the exponential map. This formulation eliminates manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality. On CALVIN ABC$\rightarrow$D, LDA improves average task length from $3.27$ to $3.51$ ($+7.3\%$). We further validate our method on real robot and the results show that our methodology outperforms the baseline on majority tasks.

2606.01824 2026-06-02 cs.RO 版本更新

DisFlow: Scene Flow from Distance Field for Object Pose, Velocity Tracking, and Dynamic Object Reconstruction

DisFlow: 基于距离场的场景流用于物体姿态、速度跟踪和动态物体重建

Lan Wu, Sheila Sutjipto, Jennifer Wakulicz, Teresa Vidal-Calleja

发表机构 * Robotics Institute, University of Technology Sydney(技术悉尼大学机器人研究所) School of Engineering, University of Western Australia(西澳大学工程学院)

AI总结 提出DisFlow框架,利用高斯过程隐式曲面表示从距离场估计场景流,实现6自由度动态物体姿态估计、运动跟踪和表面重建。

详情
AI中文摘要

我们提出了DisFlow,一种新颖的从距离场进行在线场景流估计的框架,能够实现6自由度动态物体姿态估计、运动跟踪和表面重建。场景由高斯过程隐式曲面(GPIS)表示,表面法线作为导数约束,使得在表面附近能够进行准确的符号距离计算和带不确定性的梯度查询。以此表示为基石,我们从距离场计算场景流,描述表面点如何在连续帧中随时间传输。通过我们的流,我们可以通过优雅的闭式优化逐步注册新观测的点云来估计物体的姿态和运动。与先前在相机或世界坐标系中操作的方法不同,我们的方法直接在物体坐标系中进行概率融合,其中物体随时间保持几何一致性。DisFlow方法在空间和时间上的紧密耦合产生了密集几何、表面法线、物体姿态轨迹、速度和不确定性,且均达到实时速率。我们在动态物体序列上评估了DisFlow,并证明它在同时重建高质量物体表面的同时,实现了准确的姿态和运动跟踪。代码公开于https://github.com/LanWu076/disflow_ros2。

英文摘要

We present \emph{DisFlow}, a novel framework for online scene flow estimation from distance field that enables \emph{6DoF dynamic object pose estimation}, \emph{motion tracking}, and \emph{surface reconstruction}. The scene is represented by Gaussian Process Implicit Surfaces (GPIS), with surface normals serving as derivative constraints, enabling accurate signed distance computations near the surface and gradient queries with uncertainty. With this representation as a foundation, we compute a scene flow from the distance field that describes how surface points are transported over time in consecutive frames. Through our flow, we can estimate an object's pose and motion by incrementally registering a new observed point cloud via an elegant closed-form optimisation. Unlike prior methods that operate in the camera or world frame, our approach performs probabilistic fusion directly in the \emph{object frame}, where the object remains geometrically consistent over time. The tight coupling of the DisFlow method in space and time yields dense geometry, surface normals, object pose trajectories, velocities, and uncertainty, all at real-time rates. We evaluate DisFlow on dynamic object sequences and demonstrate that it achieves accurate pose and motion tracking while simultaneously reconstructing high-quality object surfaces. Code publicly available at \href{https://github.com/LanWu076/disflow_ros2}{https://github.com/LanWu076/disflow\_ros2}

2606.01777 2026-06-02 cs.RO 版本更新

Trans2Occ: Voxel Occupancy Estimation and Grasp for Transparent Objects from Simulation to Reality

Trans2Occ: 从仿真到现实的透明物体体素占用估计与抓取

Yixuan Yang, Sha Zhang, Rui Li, Zhenfei Yin, Xinzhu Ma, Yiran Qin, Lei Bai, Xudong Xu, Shilin Shan, Wangmeng Zuo, Yanyong Zhang, Wanli Ouyang, Feng Zheng, Shixiang Tang, Dongzhan Zhou

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) SUSTech(南方科技大学) CUHK(香港中文大学) Harbin Institute of Technology(哈尔滨工业大学) University of Oxford(牛津大学) Beihang University(北京航空航天大学) Nanyang Technological University(南洋理工大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于单视图RGB输入的体素占用预测框架,结合仿真数据生成与规则抓取策略,实现透明物体的鲁棒3D感知与操作。

详情
AI中文摘要

透明物体由于折射和反射导致的深度感知不可靠,对机器人感知构成挑战。先前的方法依赖多视图重建或深度补全,但往往难以在真实机器人系统中扩展或部署。本文提出一个基于单视图RGB输入的透明物体感知与操作实用框架。我们的方法直接从单张图像预测体素空间占用,提供支持下游机器人抓取的几何感知表示。为实现大规模训练,我们构建了一个仿真流水线,在不同材质和光照条件下生成配对的RGB图像和体素占用标注。我们证明预测的占用表示对领域偏移具有鲁棒性,并能从仿真有效迁移到真实机器人设置,无需微调。基于占用构建的简单规则抓取策略进一步实现了透明物体的可靠抓取性能。在仿真和真实环境中的大量实验表明,我们的框架提供了准确的3D理解,并实现了透明物体的实用操作。这些结果表明,单视图占用预测为机器人中的透明物体感知提供了一种可扩展且有效的解决方案。

英文摘要

Transparent objects remain challenging for robotic perception due to unreliable depth sensing caused by refraction and reflection. While prior approaches rely on multi-view reconstruction or depth completion, they are often difficult to scale or deploy in real-world robotic systems. In this paper, we present a practical framework for transparent object perception and manipulation based on single-view RGB input. Our approach predicts voxel-space occupancy directly from a single image, providing a geometry-aware representation that supports downstream robotic grasping. To enable large-scale training, we construct a simulation pipeline that generates paired RGB images and voxel occupancy annotations under diverse materials and lighting conditions. We demonstrate that the predicted occupancy representation is robust to domain shifts and transfers effectively from simulation to real-world robotic setups without fine-tuning. A simple rule-based grasping strategy built on top of the occupancy further achieves reliable grasp performance on transparent objects. Extensive experiments in both simulation and real-world environments show that our framework provides accurate 3D understanding and enables practical manipulation of transparent objects. These results suggest that single-view occupancy prediction offers a scalable and effective solution for transparent object perception in robotics.

2606.01734 2026-06-02 cs.CV cs.LG cs.RO 版本更新

FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds

FlatVPR: 用于基础模型特征流形几何校正的即插即用地线性残差适配器

Rai Hisada, Kanji Tanaka

发表机构 * Fundamental Engineering for Knowledge-Based Society, Graduate School of Engineering, University of Fukui(知识社会基础工程,工程研究生院,福井大学)

AI总结 提出FlatVPR范式,通过可学习残差适配器和Pullback Flatness Loss抑制特征流形曲率,实现稀疏锚点下的线性插值重建,在NCLT数据集上显著提升视觉位置识别精度。

Comments 5 pages, 1 figure, technical report

详情
AI中文摘要

本文提出“FlatVPR”,一种新颖的几何校正范式,通过强制特征流形结构,使得两个相邻锚点 $\mathbf{z}_A$ 和 $\mathbf{z}_B$ 之间的任何描述符都可以通过线性插值 $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$(其中 $t \in [0,1]$ 表示相对位置)精确重建,从而有效平衡视觉位置识别(VPR)中地图轻量化和定位精度之间的权衡。尽管最先进的基础模型(如DINOv2-ViT-S/14)提供了鲁棒的语义特征,但其潜在流形表现出显著的曲率,将物理空间中的均匀线性运动投影到特征空间中高度非线性的轨迹上,这阻碍了稀疏锚点条件下的可靠重建。为了实现上述基于插值的重建,我们对原始基础特征 $\mathbf{z}$ 引入残差变换 $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$,其中 $\text{Res}(\cdot)$ 表示可学习的适配器。我们的方法通过数学上严谨的Pullback Flatness Loss显式抑制流形曲率,该损失最小化中间特征与连接相邻锚点的线性段之间的偏差,从而最小化流形的内在曲率。通过这种空间展平,地图构建被公式化为期望最大化(EM)框架,解耦为用于流形适应的连续M步和用于最优锚点选择准则的概念性E步。在NCLT数据集上的实验表明,即使在100米间隔的极端稀疏锚点和极端季节变化条件下,应用我们的适配器也能带来显著的性能提升。

英文摘要

This paper proposes ``FlatVPR,'' a novel geometric rectification paradigm that effectively bridges the trade-off between map lightweightness and localization accuracy in visual place recognition (VPR) by enforcing a feature manifold structure where any descriptor between two adjacent anchors $\mathbf{z}_A$ and $\mathbf{z}_B$ can be accurately reconstructed via linear interpolation $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$, where $t \in [0,1]$ denotes the relative position. While state-of-the-art foundation models such as DINOv2-ViT-S/14 provide robust semantic features, their latent manifolds exhibit prominent curvature, projecting uniform linear motion in physical space onto highly non-linear trajectories in the feature space, which hinders reliable reconstruction under sparse anchor conditions. To enable the aforementioned interpolation-based reconstruction, we introduce a residual transformation $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$ to the raw foundation features $\mathbf{z}$, where $\text{Res}(\cdot)$ represents a learnable adapter. Our method explicitly suppresses manifold curvature using a mathematically grounded Pullback Flatness Loss that minimizes the deviation of intermediate features from the linear segment connecting adjacent anchors, thereby minimizing the intrinsic curvature of the manifold. Through this spatial flattening, map construction is formulated within an Expectation-Maximization (EM) framework, decoupled into a continuous M-step for manifold adaptation and a conceptual E-step for optimal anchor selection guidelines. Experiments on the NCLT dataset demonstrate that the application of our adapter leads to significant performance improvements even under extremely sparse anchor conditions with 100m intervals and extreme seasonal changes.

2606.01713 2026-06-02 cs.RO cs.SY eess.SY math.OC 版本更新

FlipItRight: Stable Pose-Targeted Throw-Flip Across Diverse Objects

FlipItRight: 面向多样物体的稳定姿态目标投掷翻转

Axel Dawne, Shinkyu Park

发表机构 * King Abdullah University of Science and Technology(卡布斯大学)

AI总结 提出FlipItRight框架,通过将任务分解为物体级规划器和机器人级规划器,利用释放状态作为显式中间表示,实现无需先验数据或学习模型的高自由度机械臂稳定平面姿态目标投掷翻转,在120次实验中达到90%成功率。

详情
AI中文摘要

我们提出了FlipItRight,一个用于高自由度机械臂进行稳定平面姿态目标投掷翻转的框架。该任务被分解为一个物体级规划器,它生成满足期望着陆姿态的候选释放状态,以及一个机器人级规划器,它评估可执行性并构建可行的摆动运动。将释放状态视为显式中间表示,能够实现原则性的候选过滤、释放和预摆动配置的自适应选择,以及结构化的近释放运动设计——特别是在最终摆动阶段保持近似恒定的末端执行器速度,以提高对释放时间不确定性的鲁棒性。我们在一个真实平台上对不同形状、大小和质量的物体进行了验证,在120次试验中达到了90%的成功率。消融研究证实,每个设计选择都对投掷性能有所贡献,并且该框架不需要先验数据或学习模型,能够直接部署到新物体和目标上,无需特定环境的校准或数据收集。

英文摘要

We propose FlipItRight, a framework for stable planar pose-targeted throw-flip with a high-DoF manipulator. The task is decomposed into an object-level planner, which generates candidate release states satisfying the desired landing pose, and a robot-level planner, which evaluates executability and constructs a feasible swing motion. Treating the release state as an explicit intermediate representation enables principled candidate filtering, adaptive selection of release and pre-swing configurations, and structured near-release motion design -- in particular, approximately constant end-effector velocities during the final swing phase to improve robustness to release-timing uncertainty. We validate on a real platform across objects of varying shape, size, and mass, achieving a 90% success rate across 120 trials. Ablation studies confirm that each design choice contributes to throwing performance, and the framework requires no prior data or learned model, enabling direct deployment on new objects and targets without environment-specific calibration or data collection.

2606.01600 2026-06-02 cs.CV cs.CL cs.RO 版本更新

RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation

RoboTrustBench:机器人操作视频世界模型的可信度基准测试

Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar, Jingjing Chen, Bin Zhu

发表机构 * Singapore Management University(新加坡国立管理学院) Fudan University(复旦大学) Princeton University(普林斯顿大学)

AI总结 针对视频世界模型在机器人操作中的可信度问题,提出RoboTrustBench基准,包含正常、约束敏感、反事实和对抗四种场景,通过专家验证的指令-图像对和六维评估协议,发现当前模型在约束推理、反事实基础、物理交互和不安全指令抑制方面存在不足。

Comments Project: https://huiqiongli.github.io/RoboTrustBench/

详情
AI中文摘要

视频世界模型越来越多地用于机器人操作,然而现有基准大多在有效、可行和安全的指令下评估它们。我们引入了RoboTrustBench,一个用于评估视频世界模型在四种场景下可信度的基准:正常、约束敏感、反事实和对抗。基于真实世界的DROID片段构建,RoboTrustBench包含1,207个专家验证的指令-图像对和一个六维评估协议,包含13个细粒度标准。通过人类和MLLM评估七个代表性的视频世界模型,我们发现当前模型通常生成视觉上连贯的视频,但在约束推理、反事实基础、物理交互和不安全指令抑制方面存在困难。这些结果表明,视觉质量和表面级别的指令遵循不足以实现可信赖的机器人视频世界建模。

英文摘要

Video world models are increasingly used in robotic manipulation, yet existing benchmarks mostly evaluate them under valid, feasible, and safe instructions. We introduce RoboTrustBench, a benchmark for evaluating the trustworthiness of video world models under four scenarios: Normal, Constraint-Sensitive, Counterfactual, and Adversarial. Built from real-world DROID episodes, RoboTrustBench contains 1,207 expert-validated instruction-image pairs and a six-dimensional evaluation protocol with 13 fine-grained criteria. Evaluating seven representative video world models with human and MLLM assessment, we find that current models often generate visually coherent videos, but struggle with constraint reasoning, counterfactual grounding, physical interaction, and unsafe-instruction suppression. These results show that visual quality and surface-level instruction following are insufficient for trustworthy robotic video world modeling.

2606.01597 2026-06-02 cs.RO cs.MA 版本更新

Physics-Informed Modeling and Control of Emergent Behaviors in Robot Swarms

机器人群体涌现行为的物理信息建模与控制

Zixuan Jin, Wenzhuo Zhang, Shuxian Quan, Zirui Dong, Fangwen Ye, Yuchen Shi, Cheng Xu

发表机构 * School of Computer and Communication Engineering, University of Science and Technology Beijing(北京科技大学计算机与通信工程学院) Shunde Innovation School, University of Science and Technology Beijing(北京科技大学顺德创新学院)

AI总结 提出PhySwarm框架,利用多阶段对流-扩散-反应宏观模型和等效确定性运动微观模型,结合神经物理控制器,实现机器人群体多阶段涌现行为的建模与控制。

详情
AI中文摘要

机器人群体可以通过局部感知、有限通信和分散决策展现连贯的集体行为,但当行为在多阶段展开时,建模和控制这种涌现仍然具有挑战性。本文介绍了PhySwarm,一个物理信息微观-宏观框架,将多阶段群体涌现表示为受物理约束的密度场演化,并与可执行的机器人运动耦合。在宏观层面,多阶段对流-扩散-反应模型(Macro-ADR)通过定向传输、基于扩散的空间调节和行为阶段转换来描述阶段依赖的群体密度演化。在微观层面,等效确定性运动模型(Micro-EDM)通过势场对流、密度梯度补偿以及速率或事件门控的阶段切换来实现这些机制。神经物理控制器(NPC)将局部观测和时间记忆映射到有界物理参数,并通过强化学习-PINN目标进行训练,该目标结合了任务奖励、宏观尺度密度残差和微观尺度运动一致性约束。在几个概念验证的群体任务中——包括路径引导的觅食、队形可重构的导航和角色自适应的搜索与救援——我们证明了PhySwarm可以在统一的物理信息建模框架内生成不同的多阶段涌现行为。学习到的密度场和物理参数提供了可解释的证据,表明对流、扩散和反应如何共同调节多阶段群体组织。这些结果为学习、解释和控制机器人群体中的涌现行为建立了一条物理信息路径。

英文摘要

Robot swarms can exhibit coherent collective behaviors through local perception, limited communication and decentralized decision-making, yet modeling and controlling such emergence remains challenging when behaviors unfold over multiple phases. Here we introduce PhySwarm, a physics-informed micro--macro framework that represents multi-stage swarm emergence as physically constrained density-field evolution coupled to executable robot motion. At the macroscopic level, a multi-phase advection--diffusion--reaction model (Macro-ADR) describes phase-dependent swarm-density evolution through directed transport, diffusion-based spatial regulation and behavioral phase transitions. At the microscopic level, an equivalent deterministic motion model (Micro-EDM) realizes these mechanisms through potential-field advection, density-gradient compensation and rate- or event-gated phase switching. A neural-physics controller (NPC) maps local observations and temporal memory to bounded physical parameters, and is trained with a reinforcement learning--PINN objective that combines task rewards with macro-scale density residuals and micro-scale motion-consistency constraints. In several proof-of-concept swarm missions -- including trail-guided foraging, formation-reconfigurable navigation and role-adaptive search and rescue -- we demonstrate that PhySwarm can generate distinct multi-stage emergent behaviors within a unified physics-informed modeling framework. The learned density fields and physical parameters provide interpretable evidence of how advection, diffusion and reaction jointly regulate multi-stage swarm organization. These results establish a physics-informed route for learning, interpreting and controlling emergent behaviors in robot swarms.

2606.01565 2026-06-02 cs.RO cs.CV 版本更新

Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

层级语义增强导航:面向视觉语言导航的最优传输与图驱动推理

Xiang Fang, Wanlong Fang, Changshuo Wang

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Interdisciplinary Graduate Programme, Nanyang Technological University, Singapore(新加坡南洋理工大学交叉学科研究生项目) University College London(伦敦大学学院)

AI总结 提出层级语义增强导航框架,通过动态层级语义场景图、基于最优传输的拓扑规划器与图感知强化学习策略,解决连续环境中的视觉语言导航难题,实现最优性能。

Comments Published in NeurIPS 2025, address some typos

详情
AI中文摘要

连续环境中的视觉语言导航(VLN-CE)对自主智能体构成严峻挑战,要求无缝整合自然语言指令与视觉观察以在复杂3D室内空间导航。现有方法在长程任务中常因场景理解有限、规划效率低下及缺乏稳健决策框架而表现不佳。我们引入层级语义增强导航(HSAN)框架,这是一种开创性方法,通过三项协同创新重新定义VLN-CE。首先,HSAN构建动态层级语义场景图,利用视觉语言模型捕捉从物体到区域到区域的多级环境表示,实现细粒度空间推理。其次,它采用基于最优传输的拓扑规划器,以Kantorovich对偶为基础,通过平衡语义相关性与空间可达性来选择长期目标,并具有理论最优性保证。第三,图感知强化学习策略确保精确的低层控制,在稳健避障的同时导航子目标。通过整合谱图理论、最优传输和先进的多模态学习,HSAN解决了先前工作中静态地图和启发式规划器的缺陷。在多个具有挑战性的VLN-CE数据集上的大量实验表明,HSAN实现了最先进的性能,在导航成功率和泛化到未见环境方面均有显著提升。

英文摘要

Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented Navigation (HSAN)} framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations, from objects to regions to zones, enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich's duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.

2606.01545 2026-06-02 cs.RO 版本更新

Hierarchical Object Representation for Spatial Robot Perception: Points, Meshes, and Superquadrics

用于空间机器人感知的层次化物体表示:点云、网格和超二次曲面

Ceng Zhang, Wan Su, Mohamed Samshad, Gregory S. Chirikjian, Rajat Talak

发表机构 * National University of Singapore (NUS)(新加坡国立大学) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出一种层次化物体表示方法,从原始传感器数据逐步抽象为稠密网格和超二次曲面,用于高保真重建、鲁棒重定位和高效碰撞检测,并在室内外场景中验证其有效性。

Comments 18 pages, 5 figures, 4 tables

详情
AI中文摘要

层次化3D场景图(3DSG)已成为一种可操作且可扩展的表示方法,用于融合度量、语义和拓扑信息的长期自主导航。然而,3DSG中物体的几何表示问题一直被忽视,大多数方法使用简化几何模型,如部分点云或3D边界框。本文提出一种层次化物体表示,可用于高保真物体级重建、基于物体的鲁棒重定位或地图对齐,以及密集杂乱环境中安全机器人导航规划的高效解析碰撞检测。该表示结构上分为四个不同层次,从原始传感器数据逐步抽象为稠密3D网格,再到解析基元(如超二次曲面),从而提供物体几何的稀疏解析表示。我们开发了一个流程,从机器人捕获的RGB-D图像流构建层次化物体表示,并在室内外真实开放集物体场景中验证其效果。在包括HOPE、ReplicaCAD、Kimera-Multi以及使用Unitree B2机器人收集的NUS校园数据集等多个数据集上的大量实验,验证了该流程在室内外环境中的有效性。我们展示了基于超二次曲面的地图对齐方法优于当前最先进的基于物体的地图对齐方法ROMAN。代码见https://github.com/perceptica-robotics/Hickory。

英文摘要

Hierarchical 3D Scene Graphs (3DSG) have emerged as an actionable and scalable representation for long-term autonomy incorporating metric, semantic, and topological information in the scene. However, the question of geometric representation of objects in 3DSG has been overlooked as most methods use simplified geometric models such as partial point clouds or 3D bounding boxes. In this work, we introduce a hierarchical object representation that can be leveraged for high-fidelity object-level reconstruction, object-based robust re-localization or map alignment, and efficient and analytical collision checking for safe robot navigation planning in dense and cluttered environments. The representation is structurally organized into four distinct layers, progressively abstracting the scene from raw sensor data to dense 3D meshes to analytical primitives such as superquadrics, which provide a sparse and analytical representation for object geometry. We develop a pipeline that builds the hierarchical object representation from RGB-D image stream captured by a robot, and demonstrate its working in real-world open-set object scenes in both indoor and outdoor environments. Extensive experiments across diverse datasets including HOPE, ReplicaCAD, Kimera-Multi, and NUS Campus Dataset collected using Unitree B2 Robot validate our pipeline in both indoor and outdoor environments. We show that our superquadric-based map alignment method outperforms the current state-of-the-art object based map alignment method ROMAN. Our code can be found at https://github.com/perceptica-robotics/Hickory.

2606.01526 2026-06-02 cs.RO 版本更新

Spatio-Temporal Reconnection for Multi-Robot Networks using Adaptive Prescribed-Time CBFs

基于自适应预设时间CBF的多机器人网络时空重连

Hao Liu, Yupeng Yang, Yanze Zhang, Wenhao Luo

发表机构 * Department of Computer Science, University of Illinois Chicago(伊利诺伊大学芝加哥分校计算机科学系) Department of Computer Science, University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校计算机科学系)

AI总结 提出自适应预设时间控制屏障函数框架,使多机器人系统能在可调预设时间内断开并重连通信,结合触发机制提升任务效率。

Comments 6 pages, 6 figures, accepted by IFAC 2026

详情
AI中文摘要

在多机器人系统中,维持持续的通信图连接往往过于严格,特别是当机器人通信范围有限但在大环境中运行时。相反,允许机器人暂时断开连接并在之后重新连接,通常更有利于高效执行任务,同时确保团队内及时的信息共享。在本文中,我们提出了一种自适应预设时间控制屏障函数(自适应PT-CBF)框架,使机器人能够在可调且可行的预设时间内暂时断开连接并重新进入通信范围。此外,我们引入了一种重连触发机制,该机制联合考虑任务执行和重连紧迫性,从而提供了一种原则性的方式来决定何时应发生重连。理论分析证明了在预设有限时间内收敛到满足重连的合理性。实验结果验证了我们提出的自适应PT-CBF的性能,具有改进的任务效率和令人满意的重连。

英文摘要

In multi-robot systems, maintaining persistent communication graph connectivity is often overly restrictive, especially when robots have limited communication ranges but operate in large environments. Instead, allowing robots to temporarily disconnect and later reconnect is often more desirable for efficient task execution while still ensuring timely information sharing across the team. In this paper, we propose an adaptive prescribed-time control barrier function (adaptive PT-CBF) framework that enables robots to temporarily disconnect and re-enter the communication range within an adjustable and feasible prescribed time. Moreover, we introduce a reconnection triggering mechanism that jointly considers task execution and reconnection urgency, thereby providing a principled way to decide when reconnection should occur. Theoretical analysis justifies convergence to the satisfying reconnection within a prescribed finite time. Experimental results validate the performance of our proposed adaptive PT-CBF with improved task efficiency and satisfying reconnections.

2606.01487 2026-06-02 math.OC cs.RO cs.SY eess.SY 版本更新

Global Convergence of a Line-Search Filter Differential Dynamic Programming Method

线搜索滤波微分动态规划方法的全局收敛性

Ming Xu, Iman Shames

发表机构 * School of Computer and Communication Sciences EPFL(计算机与通信科学系,瑞士联邦理工学院) Department of Electrical and Electronic Engineering University of Melbourne(电子与电气工程系,墨尔本大学)

AI总结 本文证明了FilterDDP算法的全局收敛性,该算法扩展了Mayne和Jacobson的离散时间微分动态规划以处理状态和控制上的非线性约束,采用线搜索滤波过程进行步长接受。

详情
AI中文摘要

在本文中,我们建立了FilterDDP算法的全局收敛性质,该算法扩展了Mayne和Jacobson [\emph{International Journal of Control}, 3, (1966), pp. 85-95] 的离散时间微分动态规划(DDP)算法,以处理状态和控制上的非线性约束以及动力学。FilterDDP采用线搜索滤波过程进行步长接受。然而,与一般非线性规划设置中应用的阻尼牛顿步不同,试验点的计算涉及应用后向递归和前向模拟。我们通过证明对于一类受约束的最优控制问题,这种后向-前向过程满足与牛顿步相同的性质,从而建立了FilterDDP的全局收敛性,目的是建立线搜索滤波方法的全局收敛性,遵循Wächter和Biegler [\emph{SIAM Journal on Optimization}, 16 (2005), pp. 1-31] 的分析。

英文摘要

In this article, we establish the global convergence properties of the FilterDDP algorithm, which extends the discrete-time differential dynamic programming (DDP) algorithm of Mayne and Jacobson [\emph{International Journal of Control}, 3, (1966), pp. 85-95] to handle nonlinear constraints over states and controls, in addition to the dynamics. FilterDDP adopts a line-search filter procedure for step acceptance. However, instead of a damped Newton step applied in the general nonlinear programming setting, the computation of a trial point involves applying a backward recursion and a forward simulation. We establish the global convergence of FilterDDP by showing that for a subset of constrained optimal control problems, the this backward-forward procedure satisfies the same properties as a Newton step for the purpose of establishing global convergence of a line-search filter method, following the analysis of Wächter and Biegler [\emph{SIAM Journal on Optimization}, 16 (2005), pp. 1-31].

2606.01458 2026-06-02 cs.RO 版本更新

LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World

LEGS: 在具身高斯泼溅世界中免遥操作微调VLA用于人形机器人全身操控

Hojune Kim, Timothy Chen, Jiankai Sun, Lars W. Osterberg, Qianzhong Chen, Ke Wang, Mac Schwager

发表机构 * Stanford University(斯坦福大学)

AI总结 提出LEGS混合模拟器,通过程序化运动基元生成器和两阶段颜色校准,无需遥操作即可合成训练数据,使VLA策略在真实人形机器人操控任务中达到或超越遥操作训练效果。

Comments https://legsvla.github.io/

详情
AI中文摘要

训练用于人形机器人全身操控的视觉-语言-动作(VLA)策略受到收集人类遥操作演示的高成本和复杂性的限制。迄今为止,在模拟器中微调的VLA策略未能有效迁移到人形机器人全身操控任务中。我们提出LEGS(通过具身高斯泼溅实现全身操控),一种混合模拟器,将网格前景(机器人、物体、道具)合成到从手持场景捕获重建的光照真实3D高斯泼溅(3DGS)背景上。LEGS使用程序化运动基元生成器在无需人类遥操作的情况下大规模合成带标签的演示,并通过确定性两阶段颜色校准将渲染的3DGS图像对齐到机器人的部署相机。在Unitree G1人形机器人上,跨三个全身难度递增的抓取放置任务和三个VLA骨干网络(psi_0, pi_0.5, GR00T N1.6),仅使用LEGS数据训练的策略在每个实验中都匹配或超越了使用人类遥操作演示训练的策略。它还优于消融了3DGS背景效果的纯网格模拟基线,表明光照真实渲染是合成数据迁移的关键因素。LEGS中的人形运动独立于场景外观记录,使得相同的自动生成演示可以在新背景和物体网格下重新渲染——覆盖新场景的成本比遥操作低15倍以上——从而增强训练数据对场景变化的鲁棒性。在物体和场景外观联合偏移下,使用重新渲染的LEGS-AUG数据训练的策略保持任务成功,而使用遥操作数据训练的基线完全失败。我们的项目页面位于https://legsvla.github.io/。

英文摘要

Training vision-language-action (VLA) policies for humanoid loco-manipulation is constrained by the high cost and complexity of collecting human teleoperation demonstrations. VLA policies fine-tuned in simulators have, until now, failed to transfer effectively in humanoid loco-manipulation tasks. We present LEGS (Loco-manipulation via Embodied Gaussian Splatting), a hybrid simulator that composites a mesh foreground (robot, objects, props) over a photorealistic 3D Gaussian Splatting (3DGS) background reconstructed from a handheld scene capture. LEGS uses a procedural motion-primitive generator to synthesize labeled demonstrations at scale without human teleoperation, and a deterministic two-stage color calibration to align the rendered 3DGS image to the robot's deployment camera. On a Unitree G1 humanoid robot, across three pick-and-place tasks of increasing whole-body difficulty and three VLA backbones (psi_0, pi_0.5, GR00T N1.6), a policy trained purely on LEGS data matches or exceeds one trained on human teleoperation demos on every experiment. It also outperforms a mesh-only simulation baseline that ablates the effect of the 3DGS background, showing that photorealistic rendering is a key enabler for synthetic data transfer. Humanoid motion is recorded independently of scene appearance in LEGS, allowing the same auto-generated demonstrations to be re-rendered under new backgrounds and object meshes--covering a new scene at more than 15x lower cost than teleoperation--to augment training data for robustness to scene variations. Under combined object-and-scene appearance shift, the policy trained on re-rendered LEGS-AUG data maintains task success while the baseline trained on teleoperation data fails entirely. Our project page is located at https://legsvla.github.io/.

2606.01398 2026-06-02 cs.RO 版本更新

A Sonar-Visual Dataset for Cross-Modal Underwater Robot Perception

用于跨模态水下机器人感知的声纳-视觉数据集

Weitung Chen, Phil Tinn, Per Gunnar Auran, Martin Ludvigsen, Peter Halland Haro

发表机构 * Massachusetts Institute of Technology(麻省理工学院) SINTEF(斯蒂纳夫) Norwegian University of Science and Technology(挪威科技大学)

AI总结 提出SOVIS数据集,包含76,000多对声纳-视觉帧,通过端到端管道同步和清洗数据,并利用交互式标注工具加速标注,在跨模态鱼类检测任务中实现mAP@0.10提升7倍。

Comments 6 pages, 7 figures, 3 tables. Accepted to IEEE ICRA 2026 S2S Workshop (From Sea to Space: Advancing Perception in Harsh Domains)

详情
AI中文摘要

水下机器人通常同时使用相机和声纳进行感知,以利用视觉丰富的语义细节和声学稳健的距离测量。然而,由于缺乏声纳-视觉配对数据集,通过跨模态预测学习这些模态之间的映射仍然探索不足。我们提出了SOVIS,一个用于跨模态水下感知的声纳-视觉数据集。SOVIS包含在特隆赫姆峡湾六个地点17次潜水中收集的超过76,000对配对帧,并得到端到端管道的支持,该管道清洁和同步跨模态传感器数据。我们还引入了一个交互式标注工具,旨在加速配对数据的标注过程。最后,我们使用一小部分标注数据展示了一个概念验证的跨模态鱼类检测任务,与单目相机基线相比,mAP@0.10提高了7倍。SOVIS是推进跨模态水下感知研究的第一步,支持从单目图像进行密集声纳预测等研究方向。

英文摘要

Underwater robots typically use both cameras and sonar for perception to leverage the rich semantic details of vision and the robust range measurements of acoustics. However, learning to map between these modalities via cross-modal prediction remains underexplored due to limited sonar-visual paired datasets. We present SOVIS, a sonar-visual dataset for cross-modal underwater perception. SOVIS comprises over 76,000 paired frames collected across 17 dives at six sites in the Trondheimfjord, supported by an end-to-end pipeline that cleans and synchronizes the cross-modal sensor data. We also introduce an interactive annotation tool designed to accelerate the labeling process for this paired data. Finally, we demonstrate a proof-of-concept cross-modal fish detection task using a small subset of labeled data, achieving a 7x improvement in mAP@0.10 over a monocular camera baseline. SOVIS serves as the first step toward advancing cross-modal underwater perception research, enabling research directions such as dense sonar prediction from monocular images.

2606.01397 2026-06-02 cs.RO cs.LG cs.SY eess.SY 版本更新

Autopilot-Preserving Residual Q-Learning with HJB-Inspired Finite-Action Risk Filtering for Fixed-Wing UAV Command Supervision

基于HJB启发有限动作风险滤波的保持自动驾驶仪的残差Q学习用于固定翼无人机指令监督

Mehmet Iscan, Batuhan Temiz

发表机构 * PythaLab, Yildiz Technical University, Istanbul, Turkey(伊兹密尔技术大学吡塔实验室,伊斯坦布尔,土耳其) Turkish Aerospace (TUSAŞ), Ankara, Turkey(土耳其航空航天(TUSAŞ),安卡拉,土耳其)

AI总结 提出一种保持自动驾驶仪的残差指令监督框架,通过HJB方程启发的半离散值迭代评价器和控制Lyapunov/屏障函数启发的有限动作屏蔽,选择有限有界动作集中的残差,显著降低路径跟踪误差。

Comments 47 pages, 12 figures, 20 tables. Simulation-based study with a code-traceable benchmark, source code and a demonstration video are linked in the paper

详情
AI中文摘要

固定翼无人机必须在风、阵风和湍流下保持空速、高度和航向参考,这些通道耦合使得纠正一个通道可能恶化另一个。经典自动驾驶仪能很好地稳定机身,但在强侧风遇到激进转弯时适应能力差,而直接作用于舵面的强化学习策略将探索风险集中在执行器接口。我们在未改变的自动驾驶仪之上放置一个学习型监督器,而不是在其内部:它从指令空速、高度和航向的有限有界动作集中选择一个残差;修改后的参考在到达自动驾驶仪之前被投影到允许的指令包络内,自动驾驶仪仍然是唯一面向执行器的控制器。新颖之处在于残差的选择方式。HJB残差使用半离散值迭代评价器(基于Hamilton-Jacobi-Bellman方程精神)对候选动作评分,通过无操作相对哈密顿优势排序,并通过控制Lyapunov函数和控制屏障函数启发的有限动作屏蔽进行过滤,该屏蔽始终保留无操作回退。在共享的12状态运行时(固定植物、自动驾驶仪和执行器模型)上,HJB残差将均方根路径跟踪误差降低到44.809米,而基线自动驾驶仪为338.617米,表格Q残差为88.809米,相比基线降低86.77%,相比Q学习降低49.54%。增益集中在基线表现最差的区域,但伴随空速误差的测量上升,因此没有方法在所有指标上占优。我们呈现这种保持自动驾驶仪的残差指令监督设计,并完整报告其权衡基准。

英文摘要

A fixed-wing UAV must hold airspeed, altitude, and heading references under wind, gusts, and turbulence, channels coupled so that correcting one can degrade another. Classical autopilots stabilize the airframe well but adapt poorly when a hard crosswind meets an aggressive turn, while reinforcement-learning (RL) policies acting directly on the surfaces concentrate exploration risk at the actuator interface. We place a learned supervisor above an unchanged autopilot rather than inside it: it selects a residual from a finite, bounded action set on the commanded airspeed, altitude, and heading; the modified reference is projected into an admissible command envelope before reaching the autopilot, which stays the only actuator-facing controller. What is new is how the residual is chosen. HJB residual scores candidates with a semi-discrete value-iteration critic in the spirit of the Hamilton-Jacobi-Bellman (HJB) equation, ranks them by a no-op-relative Hamiltonian advantage, and filters them through a control-Lyapunov- and control-barrier-inspired finite-action shield that always keeps a no-op fallback. On a shared 12-state runtime holding the plant, autopilot, and actuator model fixed, so the comparison is at the package level, HJB residual lowers mean RMS path-tracking error to 44.809 m, against 338.617 m for the baseline autopilot and 88.809 m for a tabular-Q residual, an 86.77% reduction over the baseline and 49.54% over Q-learning. The gain concentrates where the baseline fails worst and comes with a measured rise in airspeed error, so no method dominates every metric. We present this autopilot-preserving residual command-supervision design and benchmark with its trade-offs reported intact.

2606.01367 2026-06-02 cs.RO cs.CV 版本更新

ActMVS: Active Scene Reconstruction with Monocular Multi-View Stereo

ActMVS:基于单目多视图立体的主动场景重建

Guo Pu, Yixuan Han, Zhouhui Lian

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王轩计算机技术研究所)

AI总结 提出ActMVS框架,通过视图因子图构建和全局深度优化,实现单目相机在线生成高质量、全局一致的密集深度图,支持机器人/UAV的主动场景重建与安全轨迹规划。

Comments ICRA 2026

详情
AI中文摘要

主动场景重建使机器人/UAV能够自主规划轨迹并重建环境,无需昂贵的手动数据采集。与被动方法不同,主动重建需要实时构建高置信度占据地图以实现无碰撞导航。现有方法依赖深度传感器更新占据地图,增加了平台成本和重量。为推进空间智能,我们旨在实现纯视觉单目解决方案。然而,当前单目场景重建方法离线运行,无法在机器人/UAV导航所需的帧率下提供全局一致的密集深度。为弥补这一差距,我们引入ActMVS,这是首个单目主动重建框架。我们的框架集成了用于信息多视图立体深度预测的视图因子图构建,以及全局深度优化,从而实现在线生成高质量、全局一致的密集深度图。这使得单目机器人/UAV能够在重建过程中维护可靠的占据地图,以实现安全的轨迹规划。在Replica数据集上的实验表明,其性能与RGB-D方法相当。我们的代码和数据可在https://github.com/TrickyGo/ActMVS获取。

英文摘要

Active scene reconstruction enables robots/UAVs to autonomously plan trajectories and reconstruct environments without costly manual data acquisition. Unlike passive methods, active reconstruction requires real-time construction of high-confidence occupancy maps for collision-free navigation. Existing approaches rely on depth sensors for occupancy map updates, increasing platform cost and weight. To advance spatial intelligence, we aim for a vision-only monocular solution. However, current monocular scene reconstruction methods operate offline and fail to deliver globally consistent dense depth at the frame rates required for robots/UAVs navigation. To bridge this gap, we introduce ActMVS, the first framework for monocular active reconstruction. Our framework integrates a view factor graph construction for informed Multi-View Stereo depth prediction, along with a global depth optimization, to enable the online generation of high-quality, globally consistent dense depth maps. This enables monocular robots/UAVs to maintain reliable occupancy maps for safe trajectory planning during reconstruction. Experiments on Replica datasets demonstrate performance competitive with RGB-D methods. Our code and data are available at https://github.com/TrickyGo/ActMVS.

2606.01332 2026-06-02 cs.RO 版本更新

S2M-Trek: From Single to Multi-Sphere Transport via Per-Frame Deep Sets on a Wheel-Legged Robot

S2M-Trek: 从单球到多球运输:基于轮腿机器人的逐帧深度集方法

Zong Chen, Xuebin Li, Jinpeng Xiao, Shaoyang Li, Ben Liu, Min Li, Zhouping Yin, Yiqun Li

发表机构 * School of Mechanical Science and Engineering, Huazhong University of Science and Technology(华中科技大学机械科学与工程学院) School of Mathematics, Harbin Institute of Technology(哈尔滨工业大学数学学院)

AI总结 针对轮腿四足机器人背部同时运输多个自由滚动球体的动态操作问题,提出逐帧深度集(PFDS)编码器,通过逐帧置换不变池化解决历史拼接编码器的置换对称性不匹配,实现五球100%无掉落运输。

详情
AI中文摘要

我们研究了从单个自由滚动球体到多个球体同时运输的动态操作缩放问题,这些球体在轮腿四足机器人背部运输,无需围栏、夹具或机械止动器。多个相同的自由滚动球体构成一个无序集合,没有持久身份:它们的顺序可能在每个历史帧中独立变化,产生一种\\emph{逐帧置换对称性},而标准的历史拼接集合编码器并未显式强制这种对称性——这些编码器仅在整个历史上施加共享的对角置换对称性。我们表明,这种对称性不匹配导致基于课程强化学习的具体失败模式。在相同的PPO训练预算内,平坦MLP和分支编码器在双球阶段或以下停滞,而历史拼接深度集基线(\\\HCDS)在我们的运行中无法超越双球阶段,除非在训练期间随机化球到槽的分配,这表明它利用槽索引作为课程捷径,而不是学习无身份的多球动力学。我们提出\textbf{逐帧深度集(\\\PFDS)},它在时间读出之前在每个历史帧内执行置换不变池化;我们证明\\\PFDS是$\\\Gframe$-不变的,并且能普遍逼近连续的$\\\Gframe$-不变策略。一个$2{\\times}2$消融实验(编码器架构和槽随机化)分离了架构和数据增强路径,\\\PFDS在所有五个随机种子下达到五球阶段,模拟中实现100%无掉落运输。我们进一步通过DAgger将\\\PFDS教师蒸馏为\\\TactSet,用$16{\\times}16$布尔联合接触图替代特权球体状态观测,产生紧凑且自然$\\\Gframe$-不变的触觉表示。

英文摘要

We study the problem of scaling dynamic loco-manipulation from a single free-rolling sphere to multiple spheres transported simultaneously on the back of a wheel-legged quadruped, without fences, grippers, or mechanical stops. Multiple identical free-rolling spheres form an unordered set with no persistent identity: their ordering may change independently at each history frame, creating a \emph{per-frame permutation symmetry} that standard history-concatenation set encoders do not explicitly enforce -- these encoders impose only a shared, diagonal permutation symmetry over the full history. We show that this symmetry mismatch leads to a concrete failure mode in curriculum-based reinforcement learning. Within the same PPO training budget, flat MLPs and branch-wise encoders plateau at or below the two-sphere stage, while a history-concatenation Deep Sets baseline (\HCDS) fails to progress past the two-sphere stage in our runs unless ball-to-slot assignments are randomised during training, suggesting that it exploits slot indices as a curriculum shortcut rather than learning identity-free multi-sphere dynamics. We propose \textbf{Per-Frame Deep Sets (\PFDS)}, which performs permutation-invariant pooling within each history frame before temporal readout; we prove that \PFDS is $\Gframe$-invariant and universally approximates continuous $\Gframe$-invariant policies. A $2{\times}2$ ablation over encoder architecture and slot randomisation separates the architectural and data-augmentation pathways, and \PFDS reaches the five-sphere stage with 100\% no-drop transport in simulation across all five random seeds. We further distill the \PFDS teacher into \TactSet via DAgger, replacing privileged sphere-state observations with a $16{\times}16$ Boolean union contact map, yielding a compact and naturally $\Gframe$-invariant tactile representation.

2606.01313 2026-06-02 cs.RO cs.AI 版本更新

PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making

PSG-Nav: 通过多元宇宙决策的概率场景图导航

Rufeng Chen, Yue Chang, Xiaqiang Tang, Hechang Chen, Sihong Xie

发表机构 * Tsinghua University(清华大学)

AI总结 提出PSG-Nav方法,通过构建3D概率场景图并利用多元宇宙决策从联合分布中采样最可能的世界设置,以处理开放词汇导航中的感知不确定性,并引入证据经验校准器实现在线终身适应,在多个基准上取得最新最优结果。

Comments 21 pages, 7 figures. ICML 2026

详情
AI中文摘要

开放词汇导航要求具身智能体管理由语义歧义和模型错误引起的显著感知不确定性。然而,大多数现有工作满足于局部最优的确定性方法,剥夺了在多个复合可能性上的复杂导航决策,而这些对于全局更优解至关重要。在本文中,我们提出概率场景图导航(PSG-Nav),它构建了一个3D概率场景图,使用完整的语义类别分布来考虑感知不确定性。为了有效利用局部分布来组合和推理最优导航地标,我们提出多元宇宙决策,从联合分布中采样多个最可能的世界设置,并基于地标与多元宇宙之间的兼容性评估导航地标。为了减轻开放词汇导航中因认知不确定性导致的误报,我们引入证据经验校准器,通过将检测与过去成功和失败的记忆进行交叉验证,实现在线终身适应。在广泛使用的基准MP3D、HM3D和HSSD上的大量实验表明,PSG-Nav建立了新的最先进结果,分别实现了66.1%、44.8%和67.9%的成功率。代码可在https://psg-nav.github.io/获取。

英文摘要

Open-vocabulary navigation requires embodied agents to manage significant perception uncertainty stemming from semantic ambiguity and model errors. However, most existing works settle for local optimal deterministic approaches, depriving complex navigation decision-making over multiple composite possibilities that are critical for globally better solutions. In this paper, we propose Probabilistic Scene Graph Navigation (PSG-Nav), which constructs a 3D Probabilistic Scene Graph that uses full semantic categorical distributions to account for perception uncertainty. To efficiently use the local distributions to compose and reason about the optimal navigation landmarks, we propose Multiverse Decision to sample multiple most likely world settings from the joint distribution, and evaluate navigation landmarks based on the compatibility between landmarks and multiverses. To mitigate false positives due to epistemic uncertainty in open-vocabulary navigation, we introduce the Evidential Experience Calibrator, which enables online lifelong adaptation by cross-validating detections against memories of past successes and failures. Extensive experiments on widely-used benchmarks MP3D, HM3D, and HSSD demonstrate that PSG-Nav establishes new state-of-the-art results, achieving Success Rates of 66.1%, 44.8%, and 67.9%, respectively. Code is available at: https://psg-nav.github.io/

2606.01277 2026-06-02 cs.RO cs.AI cs.CV cs.SY eess.IV eess.SY 版本更新

DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance

DeepIPCv3: 面向突发行人穿越避让的事件感知多模态传感器融合

Oskar Natan, Andi Dharmawan, Aufaclav Zatu Kusuma Frisky, Jazi Eko Istiyanto, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada(计算机科学与电子系,加雅马达大学) Department of Computer Science and Engineering, Toyohashi University of Technology(计算机科学与工程系,东福士大学)

AI总结 提出DeepIPCv3框架,通过Transformer交叉模态注意力融合LiDAR点云与DVS事件流,实现突发行人穿越场景下的高反应性避让,在自定义多模态数据集上达到最优轨迹与控制精度。

详情
AI中文摘要

当前的端到端自动驾驶系统主要依赖基于帧的传感器,这类传感器在高度动态的突发行人穿越场景中存在固有的感知延迟和运动模糊问题。为解决这一关键安全漏洞,我们提出DeepIPCv3,一种新颖的多模态自主导航框架,它将LiDAR点云的密集3D空间几何与动态视觉传感器(DVS)的微秒级异步事件流协同融合。我们引入了一种受Transformer启发的交叉模态注意力机制,以动态关联这些不同模态,使网络能够即时优先处理高速动态更新,同时不牺牲场景结构感知。融合后的潜在表示通过一个混合策略网络映射到安全的局部路径点和可执行控制命令,该网络结合了启发式轨迹跟踪与直接神经预测。由于在真实场景中测试这些突发穿越场景存在严重物理风险,该框架使用在光照良好的正午和具有挑战性的傍晚条件下收集的自定义多模态数据集进行严格离线评估。广泛的对比和消融研究表明,DeepIPCv3达到了最先进的预测性能。通过有效消除曝光失败和运动模糊,所提出的LiDAR与DVS融合实现了最低的轨迹和控制命令误差,使得无论环境光照如何,都能实现高反应性、数学上有界的规避机动。为支持未来研究,我们将代码发布到GitHub仓库:https://github.com/oskarnatan/DeepIPCv3。

英文摘要

Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and motion blur during highly dynamic encounters, specifically sudden pedestrian crossings. To address this critical safety vulnerability, we propose DeepIPCv3, a novel multi-modal autonomous navigation framework that synergizes the dense 3D spatial geometry of LiDAR point clouds with the microsecond-level asynchronous event streams of a Dynamic Vision Sensor (DVS). We introduce a Transformer-inspired cross-modal attention mechanism to dynamically correlate these distinct modalities, allowing the network to instantaneously prioritize high-speed dynamic updates without sacrificing structural scene awareness. The fused latent representations are then mapped to safe local waypoints and executable control commands via a hybrid policy network that blends heuristic trajectory tracking with direct neural predictions. Due to the severe physical risks associated with live testing of these sudden crossing scenarios, the framework is rigorously evaluated offline using a custom multi-modal dataset collected across both well-illuminated noon and challenging evening conditions. Extensive comparative and ablation studies demonstrate that DeepIPCv3 achieves state-of-the-art predictive performance. By effectively eliminating exposure failures and motion blur, the proposed LiDAR and DVS fusion yields the lowest trajectory and control command errors, enabling highly reactive, mathematically bounded evasive maneuvers regardless of ambient illumination. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/DeepIPCv3.

2606.01238 2026-06-02 cs.RO cs.LG 版本更新

Training-Free Imitation Learning with Closed-Form Diffusion Policies

无训练闭环扩散策略的模仿学习

Raghav Mishra, Ian R. Manchester

发表机构 * Australian Center for Robotics, ARIAM Hub, and School of Aerospace, Mechanical and Mechatronic Engineering University of Sydney(澳大利亚机器人中心、ARIAM中心和悉尼大学航空航天、机械与机电工程学院)

AI总结 提出一种基于演示数据集闭式得分的无训练扩散策略(CFDP),实现毫秒级实时模仿学习,性能媲美需数小时训练的神经基线,并支持推理时策略编辑与演示增强。

详情
AI中文摘要

尽管基于扩散的策略具有令人印象深刻的性能和表达能力,但其长时间离线训练拖慢了数据收集和策略部署循环。我们引入了闭环扩散策略(CFDP),这是一类使用从演示数据集导出的闭式得分的无训练扩散策略,用于模仿学习。我们在硬件实验中用移动CPU进行实时推理部署CFDP,表明它能够直接从数据集中毫秒级成功执行模仿,并且推理速度比神经扩散策略更快。在模仿学习基准实验中,我们展示了CFDP与需要数小时训练的神经基线相比具有竞争力,在训练时间和性能之间提供了有利的权衡。最后,我们展示了闭环扩散策略如何作为一种可组合原语,实现对预训练神经扩散策略的数据驱动推理时编辑,包括策略引导和新颖的演示增强。

英文摘要

While diffusion-based policies have impressive performance and expressivity, their long offline training slows down the data collection and policy deployment loop. We introduce Closed-Form Diffusion Policies, a class of training-free diffusion-based policies for imitation learning using the closed-form score derived from the demonstration dataset. We deploy CFDP with real-time inference with a mobile CPU in hardware experiments, showing it can successfully perform imitation directly from the dataset in milliseconds and with faster inference than neural diffusion policies. In experiments on imitation learning benchmarks, we show that CFDP is competitive against neural baselines that require hours of training, providing a favorable tradeoff between training time and performance. Finally, we show how closed-form diffusion policies act as a composable primitive that enables data-driven inference-time editing of pre-trained neural diffusion policies, including policy guidance and novel demonstration augmentation.

2606.01170 2026-06-02 cs.MA cs.RO 版本更新

Coordinating Task Switching in a Robotics Multi-Agent System Using Behavior Trees

使用行为树协调机器人多智能体系统中的任务切换

Lucas Haug, Anarosa Alves Franco Brandão, Arthur Casals

发表机构 * LTI - Laboratório de Técnicas Inteligentes, Universidade de S a ~ \tilde{a} o Paulo, SP(智能技术实验室,圣保罗大学)

AI总结 本文提出一种基于行为树的方法,用于在IEEE VSSS机器人足球多智能体系统中协调机器人行为,并通过与有限状态机的对比实验及竞赛验证其有效性。

Comments 7 pages, 7 figures. Preprint of a manuscript submitted to the XXVI Congresso Brasileiro de Automática (CBA 2026)

详情
AI中文摘要

多智能体系统在机器人领域的应用是一个极具挑战性的领域。为了促进以游戏为底层领域的策略和机制的研究与开发,提出了多项涉及此类系统的竞赛。其中, extit{IEEE Very Small Soccer (VSSS)} 类别是本文描述的案例研究。在VSSS中,每队三个机器人,在非常动态的足球比赛环境中竞争。因此,比赛中机器人行为的协调至关重要。本文提出了一种基于行为树的方法,用于支持圣保罗大学ThundeRatz机器人团队VSSS队伍中的多机器人协调。此外,使用FIRASim模拟器将所提出的方法与之前基于有限状态机(FSM)的方法进行了比较。此外,该新策略的性能还在一次学术机器人竞赛中得到了进一步评估。

英文摘要

The application of multi-agent systems in robotics is a very challenging field. Several competitions involving such systems are proposed to foster research and development of strategies and mechanisms using games as the underlying domain. Among them are the ones from the \textit{IEEE Very Small Soccer (VSSS)} category, which is the case study described in this paper. In VSSS, two teams of three robots each compete in a very dynamic environment of a soccer game. Thus, coordination of robots' behavior during the game is crucial to win it. In this paper, we present a Behavior-Tree-based approach to support multi-robot coordination within the VSSS team of the ThundeRatz robotics team from the Universidade de S$\tilde{a}$o Paulo. Moreover, a comparison between the proposed approach and the previous one, which was based on a Finite State Machine (FSM), was conducted using the FIRASim simulator. Besides that, the performance of this new strategy was further evaluated in an academic robotics competition.

2606.01169 2026-06-02 math.OC cs.RO 版本更新

Time-Optimal Collision Avoidance Via a Greedy Polynomial Backward Sweep

通过贪婪多项式后向扫描实现时间最优碰撞规避

Zeno Pavanello, Frank De Veld, Roberto Armellin

发表机构 * Department of Aerospace Science and Technology (DAER), Politec- nico di Milano(波兰米兰理工大学航空航天科学与技术系) Netherlands Aerospace Centre (NLR)(荷兰航空航天中心)

AI总结 提出一种贪婪时间最优后向扫描方法,通过迭代后向传播机动并选择局部最优推力方向,以确定低推力卫星最晚机动开始时间,实现高效碰撞规避。

详情
AI中文摘要

低推力卫星的航天器碰撞规避通常需要确定不仅如何机动,而且机动可以最晚何时开始同时确保安全。本文提出一种贪婪时间最优(GTO)后向扫描方法,以找到最晚机动开始时间。该方法从标称最接近时间开始,向后迭代传播机动,每一步选择局部最小化所选危险指标的推力方向。使用微分代数有效传播状态灵敏度并在线更新最接近时间。该方法在大量交会数据集上进行了测试,使用安全距离和碰撞概率作为安全指标。该方法实现了准确的结果,相对于最优控制基准只有很小的最优性损失,同时保持了适合机载实现的运行时间。

英文摘要

Spacecraft collision avoidance for low-thrust satellites often requires determining not only how to maneuver, but also how late a maneuver can begin while still ensuring safety. This paper presents a greedy time-optimal (GTO) backward-sweep method to find the latest maneuver initiation time. The method starts from the nominal time of closest approach and iteratively propagates the maneuver backward in time, selecting at each step the thrust direction that locally minimizes the chosen danger metric. Differential algebra is used to efficiently propagate state sensitivities and update the time of closest approach online. The method is tested on a large dataset of conjunctions, using both miss distance and probability of collision as safety metrics. The approach achieves accurate results and only a small loss of optimality relative to an optimal-control benchmark, while retaining runtimes suitable for on-board implementation.

2606.01112 2026-06-02 cs.RO 版本更新

Tether-Aware Dynamic Collision Avoidance for USV-HROV Systems

USV-HROV系统的系缆感知动态避碰

Yang Gu, Ziyang Hong, Xuanlin Chen, Hao Wei, Cheng Wang, Shujie Yang, Yulin Si

发表机构 * Zhejiang University(浙江大学)

AI总结 针对USV跟踪HROV时水下系缆与过往船只刮擦及系缆绷紧风险,提出一种系缆感知的动态避碰方法,通过引入系缆安全感知平面域和系缆绷紧感知速度障碍法,实现安全避碰并降低系缆绷紧可能性。

详情
AI中文摘要

由无人水面艇(USV)和混合遥控潜水器(HROV)组成的异构海洋机器人系统在海底电缆检测中展现出巨大潜力。在此类任务中,USV在水面跟踪HROV,同时通过脐带缆提供电力和通信。然而,USV在跟踪HROV时的动态避碰具有挑战性,因为水下系缆可能刮擦过往船只,而规避机动会增大USV-HROV间距,从而增加系缆绷紧的可能性并影响HROV操作。为解决这些挑战,本文提出了一种用于跟踪HROV的USV的系缆感知动态避碰方法。首先,引入系缆安全感知平面域,以表示系缆与障碍船之间的三维碰撞风险,无需显式系缆形状模型。其次,开发了系缆绷紧感知速度障碍法,以实现安全避碰并降低系缆绷紧的可能性。最后,该方法与视线制导集成,以协调HROV跟踪和避碰。基于Gazebo的仿真表明,所提方法能够避开动态障碍船,同时保持系缆安全并降低USV规避机动期间系缆绷紧的可能性。

英文摘要

Heterogeneous marine robotic systems composed of an unmanned surface vehicle (USV) and a hybrid remotely operated vehicle (HROV) have shown great potential for subsea cable inspection. In such missions, the USV tracks the HROV at the surface while supplying power and communication through an umbilical tether. However, dynamic collision avoidance for the USV during HROV tracking is challenging because the submerged tether may scrape against passing vessels, while evasive maneuvers can enlarge the USV--HROV separation, thereby increasing the likelihood of tether tautness and compromising HROV operations. To address these challenges, this work proposes a tether-aware dynamic collision avoidance method for a USV tracking an HROV. First, a tether safety-aware planar domain is introduced to represent the three-dimensional collision risk between the tether and obstacle vessels without an explicit tether shape model. Second, a tether tautness-aware velocity obstacle method is developed to achieve safe avoidance while reducing the likelihood of tether tautness. Finally, the method is integrated with line-of-sight guidance to coordinate HROV tracking and collision avoidance. Gazebo-based simulations show that the proposed method avoids dynamic obstacle vessels while maintaining tether safety and reducing the likelihood of tether tautness during USV evasive maneuvers.

2606.01098 2026-06-02 cs.RO cs.AI 版本更新

Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry

隐式漂移策略:通过条件专家几何实现单步动作生成

Zemin Yang, Yaoyu He, Yiming Zhong, Yuhao Zhang, Xinge Zhu, Yao Mu, Qingqiu Huang, Yuexin Ma

发表机构 * ShanghaiTech University(上海科技大学) Shanghai Jiao Tong University(上海交通大学) The Chinese University of Hong Kong(香港中文大学) Morphi Robot(Morphi机器人)

AI总结 提出隐式漂移策略(IDP),一种单步模仿学习框架,通过条件专家几何隐式引入训练时的漂移校正,无需显式向量场估计,在2D、3D及真实世界操作任务中有效保持有效动作流形,性能优于显式漂移方法并达到强单步基线水平。

详情
AI中文摘要

基于扩散或流匹配的生成动作策略在行为克隆中表现出色,但其迭代采样对于高频机器人控制来说过于耗时。尽管最近的单步公式缓解了这种延迟,但它们不可避免地丢弃了提供关键动作校正的中间轨迹演化。由于条件演示极端稀疏,通过显式估计训练时漂移场直接恢复这一机制在数学上是不适定的。我们提出了隐式漂移策略(IDP),一种单步模仿学习框架,无需显式向量场估计即可将训练时的漂移校正引入策略学习。IDP从观测相似专家动作的局部变化中提取条件专家几何,并将其与全局参考几何进行比较,以分离条件特定的约束。这种局部几何结构自适应地加权一个标量势目标。结合专家近端终端评估,IDP在训练期间直接对单步生成器施加流形约束。在2D、3D和真实世界操作任务上的广泛评估表明,IDP有效保持了对有效动作流形的遵循,优于显式漂移方法,并达到了与强单步基线相当的性能。

英文摘要

Generative action policies based on diffusion or flow matching excel in behavior cloning, yet their iterative sampling is prohibitive for high-frequency robot control. While recent one-step formulations alleviate this latency, they inevitably discard the intermediate trajectory evolution that provides crucial action correction. Directly recovering this mechanism by explicitly estimating a training-time drifting field is mathematically ill-posed due to extreme conditional demonstration sparsity. We introduce Implicit Drifting Policy (IDP), a one-step imitation learning framework that brings the training-time correction of Drifting into policy learning without explicit vector field estimation. IDP extracts a conditional expert geometry from the local variation of observation-similar expert actions, and compares it against a global reference geometry to isolate condition-specific constraints. This local geometric structure adaptively weights a scalar potential objective. Combined with an expert-proximal terminal evaluation, IDP directly enforces manifold constraints on the one-step generator during training. Extensive evaluations across 2D, 3D, and real-world manipulation tasks show IDP effectively maintains adherence to valid action manifolds, improving upon explicit drifting methods and achieving competitive performance with strong one-step baselines.

2606.01095 2026-06-02 cs.RO cs.AI 版本更新

Beyond Task Success: Behavioral and Representational Diagnostics for WAM and VLA

超越任务成功:WAM 和 VLA 的行为与表征诊断

Hung Mai, Bin Zhu, Tuan Do

发表机构 * National Economics University, Vietnam(越南国家经济大学) Singapore Management University(新加坡管理大学) Phenikaa University, Vietnam(越南Phenikaa大学)

AI总结 本文提出一个模型无关的诊断框架,通过行为分析和基于稀疏自编码器的特征分析,比较世界动作模型(WAM)与视觉-语言-动作(VLA)策略在机器人操作中的行为与表征差异,发现WAM在目标选择和行为改进上优于VLA但计算成本更高,且不同WAM架构对未来信息的编码方式不同。

详情
AI中文摘要

视觉-语言-动作(VLA)策略和世界动作模型(WAM)代表了机器人操作中两种日益重要的范式。然而,尚不清楚WAM中的未来预测是否在最终任务成功之外带来行为上有意义的改进。在本文中,我们探究WAM是否仅仅增加了未来预测,还是以对控制可操作的方式改变了机器人行为和内部表征。我们引入一个模型无关的诊断框架,通过两个互补的视角比较WAM和VLA:行为 rollout 分析和基于稀疏自编码器的特征分析。行为协议测量动作动态一致性、目标物体进展、干扰物干扰和运行时成本。特征空间协议将内部表征表征为记忆型、反应型或预测型,揭示模型是否编码了面向未来的结构。在LIBERO和RoboTwin2.0上,我们评估了7种策略,涵盖直接VLA以及联合、顺序和辅助WAM。我们的结果表明,仅凭成功隐藏了关键差异:WAM通常改善物体级行为和目标选择性,但其收益依赖于架构并导致更高的推理成本。顺序WAM显示出最清晰的预测结构,而辅助和联合WAM分别压缩或纠缠未来信息。这些发现为WAM设计提供了未来方向,以保留行为可操作的未来表征,实现高效操作。

英文摘要

Vision-language-action (VLA) policies and World-Action Models (WAM) represent two increasingly important paradigms for robotic manipulation. However, it remains unclear whether future prediction in WAMs leads to behaviorally meaningful improvements beyond final task success. In this paper, we ask whether WAMs merely add future prediction, or whether they change robot behavior and internal representations in ways that are actionable for control. We introduce a model-agnostic diagnostic framework that compares WAMs and VLAs through two complementary lenses: behavioral rollout analysis and sparse-autoencoder-based feature analysis. The behavioral protocol measures action dynamics consistency, target-object progress, distractor disturbance, and runtime cost. The feature-space protocol characterizes internal representations as memorized, reactive, or predictive, revealing whether models encode future-oriented structure. Across LIBERO and RoboTwin2.0, we evaluate 7 policies spanning direct VLAs and joint, sequential, and auxiliary WAMs. Our results show that success alone hides key differences: WAMs often improve object-level behavior and target selectivity, but their gains depend on architecture and incur higher inference cost. Sequential WAMs show the clearest predictive structure, while auxiliary and joint WAMs respectively compress or entangle future information. These findings suggest future directions for WAMs design to preserve behaviorally actionable future representations for efficient manipulation.

2606.01047 2026-06-02 cs.RO 版本更新

Learning Multi-Modal Trajectory Policies for Data-Efficient Robotic Manipulation

学习多模态轨迹策略以实现数据高效的机器人操作

Zijia Chen, Yuenan Hou, Xinhua Jiang, Yu Li, Weijie Li, Li Liu

发表机构 * College of Electronic Science and Technology, National University of Defense Technology(电子科学学院,国防科技大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 针对机器人操作中数据稀缺导致的多模态干扰问题,提出基于混合专家模型的多模态轨迹预测框架MATE,通过细粒度子令牌特征解耦和跨模态余弦路由器实现稳定专家分配,在LIBERO基准和真实乒乓球实验中取得显著性能提升。

详情
AI中文摘要

机器人操作需要有效整合异构输入,包括视觉观察、语言指令和轨迹表示,以生成精确的动作。现有的基于Transformer的策略通常在一个共享参数空间内处理这些异构模态,这往往导致模态干扰和低效的表示学习,尤其是在数据稀缺的场景下。虽然混合专家模型(MoE)通过专家专业化提供了可扩展的解决方案,但传统的路由机制通常对这类跨模态表示差异敏感,导致专家分配不稳定和专家崩溃。在这项工作中,我们提出了MATE(多模态轨迹策略),一种基于MoE的新型轨迹预测框架。具体来说,我们引入了一种多模态MoE架构以实现细粒度的子令牌特征解耦,并设计了一个跨模态余弦路由器,用于跨异构模态的稳定且尺度不变的专家分配。我们进一步采用温度控制路由和随机噪声注入,以改善专家平衡并防止在稀缺演示下过早的路由崩溃。在LIBERO基准上的实验表明,我们的MATE在数据稀缺情况下始终优于先前的工作。与轨迹引导的对应方法相比,平均成功率提高了4.75%。在真实世界的乒乓球机器人实验也表明,预测的轨迹可以为下游机器人执行提供有用的指导,进一步证明了我们算法的实际可行性。

英文摘要

Robotic manipulation requires the effective integration of heterogeneous inputs, including visual observations, language instructions, and trajectory representations, to generate accurate actions. Existing transformer-based policies typically process these heterogeneous modalities within a shared parameter space, which often leads to modality interference and inefficient representation learning, especially in data-scarce scenarios. While Mixture-of-Experts (MoE) offers a scalable solution through expert specialization, conventional routing mechanisms are often sensitive to such cross-modal representation discrepancies, resulting in unstable expert assignment and expert collapse. In this work, we propose MATE (Multi-ModAl TrajEctory Policies), a novel trajectory prediction framework built upon MoE. Specifically, we introduce a Multi-Modal MoE architecture to achieve fine-grained sub-token feature decoupling, and design a cross-modal cosine router for stable and scale-invariant expert assignment across heterogeneous modalities. We further employ temperature-controlled routing and stochastic noise injection to improve expert balance and prevent premature routing collapse under scarce demonstrations. Experiments on the LIBERO benchmark show that our MATE consistently outperforms prior work under data scarcity. It achieves a 4.75% improvement in average success rate over the trajectory-guided counterpart. Real-world experiments on robotic ping-pong also suggest that the predicted trajectories can provide useful guidance for downstream robotic execution, further indicating the practical feasibility of our algorithm.

2606.01038 2026-06-02 cs.RO 版本更新

Robust Integrated Planning and Control for Quadrotors in Dynamic Environments via NMPC with CBF Penalties

动态环境中四旋翼飞行器的鲁棒集成规划与控制:基于带CBF惩罚的NMPC

Zeinab Shayan, Mohammadreza Izadi, Reza Faieghi

发表机构 * Autonomous Vehicles Laboratory, Department of Aerospace Engineering, Toronto Metropolitan University(自主车辆实验室,航空航天工程系,多伦多 Metropolitan 大学)

AI总结 提出一种将控制障碍函数作为指数惩罚嵌入非线性模型预测控制的鲁棒集成规划与控制策略,通过高增益扰动观测器和卡尔曼滤波器增强系统鲁棒性,实现动态环境中的安全避障。

Comments Accepted to Conference on Robots and Vision (CRV 2026), Vancouver, Canada

详情
AI中文摘要

本文提出了一种新的多旋翼无人飞行器鲁棒集成规划与控制策略。我们提出了一种非线性模型预测控制公式,将控制障碍函数作为指数惩罚嵌入,在严格输入约束下提高可行性并确保平滑避障。惩罚权重提供了一个实用的调节旋钮,用于在跟踪精度和避障激进程度之间进行权衡。我们通过采用高增益扰动观测器来估计和补偿外部扰动,从而增强系统鲁棒性。我们还结合了卡尔曼滤波器,用于计算高效的实时障碍物运动预测,从而实现对移动障碍物的规避。与传统的NMPC以及带有硬CBF约束的NMPC的对比研究,在Gazebo和硬件实验中得到了验证,展示了优越的可行性、安全性和鲁棒性。据我们所知,这是首个经过硬件验证的NMPC-CBF IPC框架,为四旋翼飞行器在动态环境中的安全部署迈出了实际的一步。

英文摘要

This paper presents a new robust integrated planning and control (IPC) strategy for multirotor uncrewed aerial vehicles. We propose a nonlinear model predictive control (NMPC) formulation that embeds control barrier functions (CBFs) as exponential penalties, improving feasibility while ensuring smooth obstacle avoidance under tight input bounds. The penalty weights provide a practical tuning knob to trade off tracking accuracy against avoidance aggressiveness. We enhance the system robustness by employing a high-gain disturbance observer (HGDO) to estimate and compensate for external disturbances. We also incorporate a Kalman filter (KF) for computationally efficient, real-time prediction of obstacle motion, enabling avoidance of moving obstacles. Comparative studies against both conventional NMPC and NMPC with hard CBF constraints, validated in Gazebo and hardware experiments, demonstrate superior feasibility, safety, and robustness. To the best of our knowledge, this is the first hardware-validated NMPC-CBF IPC framework, offering a practical step toward safe quadrotor deployment in dynamic environments.

2606.01036 2026-06-02 cs.RO 版本更新

Position: Good Embodied Reward Models Need Bad Behavior Data

立场:好的具身奖励模型需要不良行为数据

Ran Tian, Yilin Wu, Andrea Bajcsy

发表机构 * Ran Tian, Yilin Wu, Andrea Bajcsy

AI总结 本文主张为获得可靠的具身奖励模型,社区必须投资于“不良”机器人数据(失败、次优、易错甚至危险行为),并通过实验证明即使少量真实不良数据也能改善与人类偏好的一致性。

Comments This position paper has been accepted by the ICML 2026 position track as a spotlight paper

详情
AI中文摘要

这篇立场论文认为,为了获得可靠的具身奖励模型,社区必须投资于“不良”机器人数据:失败、次优、易错甚至危险的行为。虽然奖励模型是任何基础模型生命周期的核心,但今天的具身奖励模型主要基于成功行为进行训练。我们分析了三个最先进的具身奖励模型,发现它们系统性地过度奖励那些真实人类评估者会惩罚的行为,包括不安全交互、糟糕执行以及仅表面满足任务的捷径策略。我们将这些失败归因于一个关键的数据缺口:负面具身数据的稀缺性,这些数据收集成本高昂,并且在现有的机器人数据集中经常被过滤掉或保留。此外,我们表明,即使是少量真实不良行为数据也能改善与人类偏好的一致性,并减少代价高昂的误报。因此,我们呼吁具身AI社区整理并发布他们的不良机器人数据,构建合成不良数据生成引擎,开发更去中心化的物理评估系统,并设计用于细粒度具身奖励模型评估的基准。

英文摘要

This position paper argues that to obtain reliable embodied reward models, the community must invest in ``bad'' robot data: failed, suboptimal, error-prone, and even hazardous behaviors. While reward models are central to any foundation model's lifecycle, today's embodied reward models are trained primarily on successful behaviors. We analyze three state-of-the-art embodied reward models and find that they systematically over-reward behaviors that real human evaluators would penalize, including unsafe interactions, poor execution, and shortcut strategies that only superficially satisfy tasks. We attribute these failures to a key data gap: the scarcity of negative embodied data which is costly to collect and often filtered out or withheld in existing robotics datasets. Furthermore, we show that even modest exposure to real bad behavior data can improve alignment with human preferences and reduce costly false positives. We therefore call on the embodied AI community to curate and release their bad robot data, build synthetic bad data generation engines, develop more decentralized physical evaluation systems, and design benchmarks for fine-grained embodied reward model evaluations.

2606.01027 2026-06-02 cs.RO 版本更新

$τ_0$-WM: A Unified Video-Action World Model for Robotic Manipulation

$\tau_0$-WM:一种用于机器人操作的统一视频-动作世界模型

Pengfei Zhou, Shengcong Chen, Di Chen, Jiaxu Wang, Rongjun Jin, Bingwen Zhu, Yike Pan, Songen Gu, Kuanning Wang, Shufeng Nan, Xingyu Qiu, Chenhao Qiu, Pu Yang, Yunuo Cai, Jianxiong Gao, Yifan Li, Yanwei Fu, Xiangyu Yue, Zhi Chen, Jianlan Luo

发表机构 * Shanghai Innovation Institute(上海创新研究院) AGIBOT Finch

AI总结 提出$\tau_0$-WM,一个统一视频-动作世界模型,通过共享视频扩散骨干集成策略学习、视频预测和动作评估,在长时域和精细操作任务上优于基线。

Comments Our project homepge: https://finch.agibot.com/research/tau0-wm

详情
AI中文摘要

机器人操作需要能够生成可执行动作并在物理执行前预测和评估其未来后果的模型。我们提出$\tau_0$-世界模型($\tau_0$-WM),一个统一的视频-动作世界模型,在单个未来预测框架内整合了策略学习、视频预测和动作评估。基于共享的视频扩散骨干,$\tau_0$-WM提供两个互补接口。首先,一个视频动作模型从多视角观察、语言指令和机器人状态中联合预测未来视觉潜变量和连续动作块。其次,一个动作条件视频模拟器将候选动作块展开为多视角未来并预测密集的任务进度分数。该模型在大约27,300小时的实机遥操作、UMI风格交互、自我中心人类视频以及使用模态特定监督掩码的展开或失败轨迹上进行训练。在推理时,$\tau_0$-WM利用测试时计算来采样动作候选,通过重新去噪一致性对其进行排序,并对低质量候选调用基于模拟器的修正。在具有挑战性的长时域和精细机器人操作任务上,$\tau_0$-WM表现出优于其他相关基线的性能。

英文摘要

Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present $τ_0$-World Model ($τ_0$-WM), a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, $τ_0$-WM provides two complementary interfaces. First, a video action model jointly predicts future visual latents and continuous action chunks from multi-view observations, language instructions, and robot state. Second, an action-conditioned video simulator rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. The model is trained on approximately $27{,}300$ hours of real-robot teleoperation, UMI-style interaction, egocentric human videos, and rollout or failure trajectories using modality-specific supervision masks. At inference time, $τ_0$-WM uses test-time computation to sample action candidates, rank them with re-denoising consistency, and invoke simulator-based rectification for low-quality candidates. On challenging long-horizon and fine-grained robotic manipulation tasks, $τ_0$-WM shows superior performance over other relevant baselines.

2606.01015 2026-06-02 cs.RO cs.AI cs.NI cs.SY eess.SY 版本更新

AI-IoT-Robotics Integration: Survey of Frameworks, Emerging Trends, and the Path Toward Connected Robotics

AI-IoT-机器人集成:框架、新兴趋势及迈向互联机器人的路径综述

Ranulfo Bezerra, Satoshi Tadokoro, Kazunori Ohno

发表机构 * Tohoku University(东大大学)

AI总结 本文综述了人工智能、物联网和机器人三者融合的现状,提出了模块化系统架构,并强调了小语言模型(SLM)和大型语言模型(LLM)在分布式认知与自主决策中的作用,为下一代互联机器人和物理AI生态系统提供了概念和技术路线图。

Comments 15 pages, 3 figures, 3 tables. Published in IEEE Internet of Things Journal

详情
Journal ref
IEEE Internet of Things Journal, vol. 13, no. 10, pp. 20398-20412, 15 May15, 2026
AI中文摘要

人工智能、物联网和机器人的融合不再是未来的愿景;它正迅速成为实时、智能和上下文感知系统的基础。AI实现感知和推理,IoT提供可扩展的感知和通信,而机器人则提供具身驱动。尽管在AIoT和物联网机器人(IoRT)等两两组合方面取得了显著进展,但仍缺乏完全整合这三者的统一设计框架。本综述综合了这些领域的最新进展,强调了边缘端的小语言模型(SLM)和云端的大型语言模型(LLM)在分布式认知和自主决策中的新兴作用。我们提出了一个符合这些趋势的模块化系统架构,分析了互操作性和反馈控制中存在的持续差距,并根据集成深度对现有工作进行了分类。我们的综述强调了混合SLM-LLM系统与IoT基础设施和机器人代理相结合时,如何应对实时适应、可扩展性和可靠性方面的挑战。这项工作为设计模块化、可解释且能够在动态环境中学习的下一代AI-IoT-机器人生态系统提供了概念和技术路线图,为新兴的互联机器人和物理AI范式铺平了道路。

英文摘要

The convergence of Artificial Intelligence, the Internet of Things, and Robotics is no longer a futuristic vision; it is rapidly becoming the foundation of real-time, intelligent, and context-aware systems. AI enables perception and reasoning, IoT provides scalable sensing and communication, and robotics delivers embodied actuation. Despite significant progress in pairwise combinations such as AIoT and the Internet of Robotic Things (IoRT), there remains a lack of unified design frameworks that fully integrate all three. This survey synthesizes the state-of-the-art across these domains, emphasizing the emerging role of Small Language Models (SLMs) at the edge and Large Language Models (LLMs) in the cloud for distributed cognition and autonomous decision-making. We propose a modular system architecture that aligns with these trends, analyze persistent gaps in interoperability and feedback control, and classify existing work by integration depth. Our review highlights how hybrid SLM-LLM systems, when coupled with IoT infrastructure and robotic agents, can address challenges in real-time adaptation, scalability, and reliability. This work offers a conceptual and technical roadmap for designing next-generation AI-IoT-Robotic ecosystems that are modular, interpretable, and capable of learning within dynamic environments, paving the way for the emerging paradigm of Connected Robotics and Physical AI.

2606.00998 2026-06-02 cs.RO 版本更新

GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping

GraspGen-X: 跨形态6自由度扩散抓取

Beining Han, Yu-Wei Chao, Erwin Coumans, Clemens Eppner, Balakumar Sundaralingam, Jia Deng, Stan Birchfield, Adithyavairavan Murali

发表机构 * NVIDIA Princeton University(普林斯顿大学)

AI总结 提出一种基于扩散模型的跨形态6自由度抓取方法,通过扫描体积启发式编码夹爪表示,在20亿抓取数据上训练,实现对新物体、场景和夹爪形态的零样本泛化。

详情
AI中文摘要

我们研究跨形态6自由度机器人抓取。与先前工作不同,我们要求模型不仅泛化到新物体/场景,还要泛化到新夹爪形态和物理抓取过程。我们的方法将基于扩散模型的生成式6自由度抓取模型扩展到对额外夹爪表示的条件化。我们提出一种用于编码夹爪的扫描体积启发式方法。我们使用程序化生成的夹爪和一个包含20亿抓取的大规模数据集训练跨形态模型。在仿真实验中,我们的模型在零样本泛化到新型真实世界夹爪和物体方面优于基线方法。我们的模型也可作为微调以适应新夹爪的良好初始化。在消融实验中,我们展示了扫描体积夹爪表示和程序化夹爪训练数据集的效率。最后,我们展示了在6自由度抓取中对真实世界新型夹爪的零样本泛化,在跨形态泛化方面超越了基线。

英文摘要

We study cross-embodiment 6-DOF robot grasping. Unlike prior works, we require the model not only to generalize to novel objects / scenes but also to novel gripper morphologies and physical grasping processes. Our method extends diffusion model based generative 6-DOF grasping models to condition on the additional gripper's representation. We propose a swept-volume heuristic for encoding the gripper. We train our cross-embodiment model with procedural grippers and a large-scale dataset of 2 Billion grasps. In simulation experiments, our model has the best zero-shot generalization to novel real-world grippers and objects over baseline methods. Our model also serves as a good initialization for fine-tuning to adapt to novel grippers. In ablations, we demonstrate the efficiency of our sweep-volume gripper representation and our procedural gripper training dataset. Last, we show zero-shot generalization to real-world novel grippers for 6-DOF grasping, surpassing baselines in cross-embodiment generalization.

2606.00990 2026-06-02 cs.RO 版本更新

OSCAR: Obstacle Survival Curves for Adaptive Robot Navigation

OSCAR: 用于自适应机器人导航的障碍物生存曲线

Hshmat Sahak, Aoran Jiao, Nicholas Rhinehart, Tim Barfoot

发表机构 * University of Toronto(多伦多大学)

AI总结 提出OSCAR框架,利用生存模型学习障碍物清除时间分布,并通过图规划器动态调整等待与重路由的阈值,以减少导航时间。

Comments 8 pages main text, appendices included

详情
AI中文摘要

一个沿已知路线图行驶的移动机器人在临时障碍物阻塞关键边时可能会犯代价高昂的导航错误:在停放的推车后面等待太久浪费时间,但立即绕过一个几秒钟后会移动的人也是低效的。标准的反应式避障处理障碍物周围的局部运动,而固定的等待或重路由规则忽略了不同障碍物类型通常持续的时间。我们提出了OSCAR:一种用于具有临时阻塞的基于图的导航的自适应生存建模框架。假设在遇到障碍物时可以获得障碍物类别标签,机器人从在线经验中学习类别条件的残余清除时间分布,包括在重路由之前未观察到清除时的右删失观测。这些生存模型被集成到一个时间相关的图规划器中,该规划器维护障碍物记忆并计算每个阻塞边的耐心阈值:在采取替代路线之前等待多长时间。该方法在多个回合中持续更新其清除估计,并使用它们来平衡等待与重路由。我们在仿真中和真实移动机器人上(在大学中庭,障碍物包括人、椅子、垃圾桶和管道)评估了该方法。在仿真中,学习策略的目标时间在每类障碍物少于20次观测后收敛到具有真实清除分布的神谕的1%以内,优于所有启发式基线。实际部署证实该策略在线改进,从50个导航回合的经验中调整其耐心阈值。

英文摘要

A mobile robot following a graph of known routes can make costly navigation errors when a temporary obstacle blocks a critical edge: waiting too long behind a parked cart wastes time, but immediately rerouting around a person who would move in a few seconds is also inefficient. Standard reactive obstacle avoidance addresses local motion around obstacles, while fixed wait-or-reroute rules ignore how long different obstacle types tend to persist. We propose OSCAR: an adaptive survival-modeling framework for graph-based navigation with temporary blockages. Assuming obstacle class labels are available at encounter time, the robot learns class-conditioned residual clearance-time distributions from online experience, including right-censored observations when it reroutes before observing clearance. These survival models are integrated into a time-dependent graph planner that maintains obstacle memory and computes a patience threshold at each blocked edge: how long to wait before taking an alternate route. The method continuously updates its clearance estimates across episodes and uses them to balance waiting against rerouting. We evaluate the approach in simulation and on a real mobile robot in a university atrium with obstacles including people, chairs, bins, and tubes. In simulation, the learned policy's time-to-goal converges to within 1% of an oracle with access to ground-truth clearance distributions after fewer than 20 observations per obstacle class, outperforming all heuristic baselines. Real-world deployment confirms that the policy improves online, adapting its patience thresholds from experience across 50 navigation episodes.

2606.00985 2026-06-02 cs.RO 版本更新

Make Your VLA More Robust Without More Data By Interleaving Motion Planning

通过交错运动规划使您的VLA更鲁棒而无需更多数据

Dan BW Choe, Sundhar Vinodh Sangeetha, Samuel Coogan, Shreyas Kousik

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出MPVI框架,将基于模型的运动规划与视觉-语言-动作模型交错结合,通过VLM完成检查和本体感受触发实现可靠切换,无需额外训练即可提升长时域移动操作任务的鲁棒性,在BEHAVIOR-1K基准上任务进度提升113%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在移动操作方面取得了显著进展,但在长时域任务上的表现仍然较差。这些任务尤其具有挑战性,因为(1)必须在空间分布的子任务的长序列中保持对高层目标的进展,并且(2)早期执行错误会在任务时域内迅速累积。尽管在大规模人类遥操作移动操作数据上进行了微调,这些挑战仍然存在,表明仅靠更多数据可能无法解决问题。为了应对这些挑战,我们提出了MPVI:运动规划器/VLA交错框架,该框架将基于模型的运动规划与VLA集成,无需进一步训练即可提高鲁棒性。所提出的集成通过开放词汇目标检测、前沿探索和运动规划,实现了在杂乱场景中对远处或遮挡目标物体的定位和导航。然而,这种集成并非易事,需要模块之间的可靠切换;我们通过基于VLM的完成检查与本体感受触发器展示了一种可行的方法。我们在BEHAVIOR-1K基准上评估了我们的方法,并展示了在任务进度上比顶级端到端VLA基线提升113%。更多详情请访问项目页面:https://mpvi.netlify.app/。

英文摘要

Vision-Language-Action (VLA) models have shown remarkable progress for mobile manipulation, but their performance on long-horizon tasks remains poor. These tasks are especially challenging because (1) progress toward high-level goals must be maintained across extended sequences of spatially distributed subtasks, and (2) early execution errors compound rapidly over the task horizon. These challenges persist despite finetuning on large human teleoperated mobile manipulation data, indicating that more data alone may not resolve the problem. To address these challenges, we propose MPVI: Motion Planner / VLA Interleaving, a framework that integrates model-based motion planning with VLAs to improve robustness without further training. The proposed integration enables localization and navigation to distant or occluded target objects through cluttered scenes using open-vocabulary object detection, frontier exploration and motion planning. However, such integration is non-trivial, requiring reliable switching between modules; we show one way forward via VLM-based completion checking with proprioceptive triggers. We evaluate our approach on the BEHAVIOR-1K benchmark and demonstrate 113% improvement in task progress over a top end-to-end VLA baseline. Additional details are available at the project page: https://mpvi.netlify.app/.

2606.00966 2026-06-02 cs.RO 版本更新

Threading Optimization for Vision-Language-Action Model Inference in Low-Cost Smart Agricultural Manipulation

低成本智能农业机械臂中视觉-语言-动作模型推理的线程优化

Keith Truongcao, Christopher Nhu, Zijian An, Phong Nguyen, Siwei Cai, Lifeng Zhou

发表机构 * Department of Electrical Engineering, Drexel University(德雷塞尔大学电气工程系)

AI总结 针对低成本机械臂上VLA模型推理慢、精细动作调整难的问题,通过优化RTAC算法的线程实现,降低了端到端延迟并提高了响应性,在农产品操作任务中验证了控制稳定性和速度的提升。

详情
AI中文摘要

视觉-语言-动作(VLA)模型仍然面临推理速度慢和难以进行精细运动调整等挑战,限制了它们在工业中的广泛应用。虽然实时动作分块(RTAC)算法已被提出以解决这些瓶颈,但从伪代码算法到低成本机械臂上稳定、实际部署的桥梁仍然是一个挑战。在这项工作中,我们提出了一个完整的系统级RTAC实现,专门针对低成本机器人操纵系统。我们通过优化策略推理和控制管道的线程实现,超越了原始的高级伪代码,在不修改底层策略的情况下减少了端到端延迟并提高了响应性。我们在涉及农产品(特别是大蒜球和核桃)操作的任务上评估了该系统。实验结果表明,与RTAC的基本实现相比,我们的自定义线程实现显著提高了控制稳定性和速度。

英文摘要

Vision-Language Action (VLA) models continue to face challenges such as slow inference speed and difficulty performing fine-grained motion adjustments, limiting their widespread adoption in industry. While the Real-Time Action Chunking (RTAC) algorithm has been proposed to address these bottlenecks, bridging the gap between the algorithm provided in pseudocode to a stable, real-world deployment on a low-cost robotic arm remains a challenge. In this work, we present a complete system-level implementation of RTAC tailored for a low-cost robotic manipulation system. We advance beyond the original high-level pseudocode by optimizing the threading implementation for the policy inference and control pipeline, reducing end-to-end latency and improving responsiveness without modifying the underlying policy. We evaluate this system on tasks involving the manipulation of agricultural produce, specifically garlic bulbs and walnuts. Experimental results demonstrate that our custom threading implementation significantly improves control stability and speed compared to the base implementation of RTAC.

2606.00933 2026-06-02 cs.RO 版本更新

Generative Multi-Robot Motion Planning via Diffusion Modeling with Multi-Agent Reinforcement Learning Guidance

基于扩散建模与多智能体强化学习引导的生成式多机器人运动规划

Suk Ki Lee, Venkata Sai Deepak Mutta, Hyunwoong Ko

发表机构 * School of Manufacturing Systems and Networks, Arizona State University, Mesa, AZ(1制造系统与网络学院,亚利桑那州立大学,梅萨,AZ) Michael W. Hall School of Mechanical Engineering, Mississippi State University, Starkville, MS(2迈克尔·W·霍尔机械工程学院,密西西比州立大学,斯塔克维尔,MS)

AI总结 提出一种结合扩散模型与多智能体强化学习的框架,通过值函数引导反向扩散过程实现交互感知的轨迹生成,降低多机器人冲突率。

Comments 11 pages, 6 figures, 1 table. This paper has been accepted for publication in the proceedings of ASME IDETC-CIE 2026

详情
AI中文摘要

在共享环境中协调多个机器人需要为每个智能体生成可行轨迹,同时考虑智能体间的交互。集中式规划方法随着机器人数量增加而难以扩展,而允许每个智能体独立规划的分散式方法则无法固有地处理智能体间的交互。本文提出一种协调多机器人运动规划的框架,将分散式生成轨迹规划与基于多智能体强化学习(MARL)的协调相结合。每个机器人使用在单智能体运动数据上训练的扩散模型独立生成候选轨迹,利用生成模型生成可行且多样化轨迹的能力。为了减少智能体间的冲突,通过基于梯度的引导,使用MARL训练的集中式值函数指导反向扩散过程,从而在不进行集中式联合规划或重新训练生成模型的情况下实现交互感知的轨迹生成。这种引导遵循指数倾斜公式,其中值函数将去噪分布偏向于具有更高期望多智能体回报的轨迹。该框架在包含四个移动机器人的模拟迷宫环境中进行评估。实验结果表明,所提出的值引导扩散规划将智能体间干扰率从55.4%降低到41.8%,证明在保持分散式轨迹生成可扩展性的同时,可以有效实现协调。这些结果表明,基于MARL的值引导可以在不需要完全联合的多机器人模型的情况下,有效地将协调引入分散式生成规划器。

英文摘要

Coordinating multiple robots in shared environments requires generating feasible trajectories for each agent while accounting for interactions among agents. Centralized planning approaches become difficult to scale as the number of robots increases, while decentralized approaches that allow each agent to plan independently do not inherently account for inter-agent interactions. This paper presents a framework for coordinated multi-robot motion planning that combines decentralized generative trajectory planning with multi-agent reinforcement learning (MARL)-based coordination. Each robot independently generates candidate trajectories using a diffusion model trained on single-agent motion data, leveraging the generative model's ability to produce feasible and diverse trajectories. To reduce conflicts between agents, a centralized value function trained via MARL guides the reverse diffusion process through gradient-based steering, enabling interaction-aware trajectory generation without centralized joint planning or retraining of the generative model. This guidance follows an exponential tilting formulation, in which the value function biases the denoising distribution toward trajectories with higher expected multi-agent return. The framework is evaluated in a simulated maze environment with four mobile robots. Experimental results show that the proposed value-guided diffusion planning reduces the inter-agent interference rate from 55.4% to 41.8%, demonstrating that coordination can be effectively achieved while preserving the scalability of decentralized trajectory generation. These results suggest that MARL-based value guidance can effectively introduce coordination into decentralized generative planners without requiring a fully joint multi-robot model.

2606.00922 2026-06-02 physics.med-ph cs.RO 版本更新

A Machine-to-Machine Knowledge-Guided LLM Agent for Generalizable Radiotherapy Treatment Planning

一种机器到机器知识引导的LLM智能体用于泛化放射治疗计划

Md Mainul Abrar, Xun Jia, Yujie Chi

发表机构 * National Institutes of Health (NIH)(国家卫生研究院) Department of Physics, The University of Texas at Arlington(德克萨斯理工大学阿灵顿分校物理系) Department of Radiation Oncology and Molecular Radiation Sciences, Johns Hopkins University(约翰霍普金斯大学放射肿瘤学与分子放射科学系)

AI总结 提出一种机器到机器知识引导的大语言模型框架,通过深度强化学习发现的治疗计划参数分布知识迁移至LLM智能体,实现无需人工干预的自主迭代规划,在多种病例中显著提升规划质量和泛化能力。

Comments 10 pages, 6 figures

详情
AI中文摘要

在这项工作中,我们提出了一种机器到机器(M2M)知识引导的大语言模型(LLM)框架,用于自动化放射治疗计划。在所提出的范式中,由深度强化学习(DRL)智能体发现的治疗计划参数(TPP)分布知识通过上下文学习迁移至LLM智能体,使其能够在无需人工干预的情况下自主进行迭代规划。虽然基于LLM的标准规划通常缺乏物理直觉且难以收敛,但整合DRL导出的引导将智能体约束在物理有效的参数空间内。我们在三种不同的规划场景中进行了实验评估:基础前列腺病例、具有增加器官危及(OAR)约束的复杂前列腺配置以及肝脏病例。评估结果表明,与无引导规划相比,引导的LLM智能体在显著减少迭代次数的同时,始终达到最优规划评分。对最终TPP配置的分析显示,该智能体成功学习了目标的层次优先级,有效恢复了参数调整与剂量学结果之间的逻辑“因果”关系。至关重要的是,该原型框架展现出强大的泛化能力,无论患者具体解剖结构、治疗部位或初始计划质量如何,都能保持高规划质量。通过将DRL的专业优化与LLM的自适应推理相结合,该M2M框架为迈向泛化的自主治疗计划建立了可扩展的基础,最终在现实环境中惠及临床实践。

英文摘要

In this work, we propose a prototype machine-to-machine (M2M) knowledge-guided Large Language Model (LLM) framework for automated radiotherapy treatment planning. In the proposed paradigm, Treatment Planning Parameter (TPP) distribution knowledge discovered by a Deep Reinforcement Learning (DRL) agent is transferred to an LLM agent through in-context learning, enabling autonomous iterative planning without human intervention. While standard LLM-based planning often lacks physical intuition and struggles with convergence, the integration of DRL-derived guidance constrains the agent to a physically valid parameter space. Experimental evaluations are performed across three diverse planning scenarios: basic prostate cases, complex prostate configurations with increased organ-at-risk (OAR) constraints, and liver cases. The evaluation results demonstrate that the guided LLM agent consistently achieves optimal planning scores while significantly reducing the number of iterations compared to unguided planning. Analysis of the final TPP configurations reveals that the agent successfully learns a hierarchical priority of objectives, effectively restoring a logical "cause-and-effect" relationship between parameter tuning and dosimetric outcomes. Crucially, this prototype framework exhibits robust generalizability, maintaining high planning quality regardless of specific patient anatomy, treatment site, or initial plan quality. By bridging the specialized optimization of DRL with the adaptive reasoning of LLMs, this M2M framework establishes a scalable foundation towards generalizable autonomous treatment planning, ultimately benefiting clinical practice in realistic environments.

2606.00886 2026-06-02 cs.CV cs.RO 版本更新

GABI: Geometry-Aware Boundary Integration for Spacecraft Segmentation

GABI: 用于航天器分割的几何感知边界集成

Iason Georgios Velentzas, Dhruv Ahuja, Panagiotis Tsiotras

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出一种轻量级边界感知多任务分割架构GABI,通过辅助距离场预测头增强卷积骨干网络,在保持低模型复杂度的同时提升航天器分割精度,在SPARK基准上平均精度提升5%,跨域泛化提升50%。

Comments Accepted to AI4Space at CVPR 2026

详情
AI中文摘要

精确分割对于自主航天器至关重要,因为它直接影响与3D态势感知相关的下游任务。然而,太空恶劣的照明条件会产生外观高度变化的图像,阻碍分割方法在不同航天器和环境中的泛化。在这项工作中,我们提出了GABI,一种轻量级的边界感知多任务分割架构,它通过一个辅助的距离场预测头增强卷积骨干网络。距离场在物体边界周围提供密集的几何监督,鼓励网络学习航天器结构的空间一致表示,同时保持适合机载感知系统的低模型复杂度。我们在一个既定的卷积基线和更重的基于Transformer的架构上评估了GABI。在SPARK基准上,距离场监督使基线在平均精度上提高了5%,同时实现了与Transformer模型相当的性能。在泛化实验中,GABI的平均精度比基线提高了50%以上。在跨域评估中,轻量级GABI变体在IoU和F1分数上与更重的Transformer模型相差5%以内,而体积大约小十倍。同时,更重的GABI变体在保持近三倍轻量的同时超越了Transformer架构。

英文摘要

Accurate segmentation is crucial for autonomous spacecraft, as it directly affects downstream tasks related to 3D situational awareness. The harsh illumination conditions of space, however, produce images with high variability in appearance, hindering the generalization of segmentation approaches across different spacecraft and environments. In this work, we propose GABI, a lightweight boundary-aware multi-task segmentation architecture that augments a convolutional backbone with an auxiliary distance-field prediction head. The distance field provides dense geometric supervision around object boundaries, encouraging the network to learn spatially consistent representations of spacecraft structures while maintaining low model complexity suitable for onboard perception systems. We evaluated GABI against both an established convolutional baseline and a heavier transformer-based architecture. On the SPARK benchmark, distance-field supervision improves the baseline by up to $5\%$ in Average Precision while achieving performance comparable to the transformer models. In generalization experiments, GABI improves Average Precision by more than $50\%$ over the baseline. In cross-domain evaluation, the lightweight GABI variant performs within $5\%$ in IoU and F1-score of the heavier transformer model while being approximately ten times smaller. At the same time, the heavier GABI variant surpasses the transformer architectures while remaining nearly three times lighter.

2606.00857 2026-06-02 cs.RO cs.AI 版本更新

From Cues to Horizons: Dynamic Risk Horizon Profiling for Trajectory Prediction

从线索到视野:轨迹预测的动态风险视界剖面

Xinyi Ning, Zilin Bian, Dachuan Zuo, Semiha Ergan, Kaan Ozbay

发表机构 * Department of Civil and Urban Engineering, New York University(纽约大学土木与城市工程系) Department of Civil Engineering Technology and Environmental Management Safety, Rochester Institute of Technology(罗切斯特理工学院土木工程技术与环境安全管理系)

AI总结 提出风险视界剖面(RHP)模块,通过连续可学习的势场模型对未来风险分布进行建模,以提升轨迹预测的准确性,在highD和SHRP2数据集上分别降低5秒RMSE 25.0%和5秒minFDE 29.1%。

Comments 11 pages, 7 figures, submitted to IEEE Transactions on Intelligent Transportation Systems (T-ITS)

详情
AI中文摘要

准确可靠的车辆轨迹预测对于安全自动驾驶至关重要。最近的研究将安全风险纳入轨迹预测,以量化周围代理带来的危险。然而,大多数风险感知方法将过去的风险信息作为辅助信号来帮助决策,忽视了其未来的演变和不确定性。在本文中,我们提出了一种风险视界剖面(RHP)模块,该模块结合了连续、可学习的势场模型,用于风险感知轨迹预测。RHP模块计算周围物体的时空接近度,以描绘未来视界上的风险分布,通过自适应识别人类驾驶员认为的关键时刻,支持更好的轨迹预测。我们在两个不同驾驶设置的数据集上评估了我们的方法:highD(高速公路走廊)和SHRP2(城市街道),涵盖了包括安全、近碰撞和碰撞事件在内的多种风险场景。与基线方法相比,我们的框架在highD数据集上实现了5秒RMSE降低25.0%,在SHRP2上实现了5秒minFDE降低29.1%。这些结果表明,该方法在短视界和长视界预测中均表现出色,并且在高速公路和城市场景中具有强大的泛化能力。所提出的方法能够实现更真实的自动驾驶车辆路径规划和策略选择,从而支持更安全的自动驾驶和更先进的驾驶员辅助系统。本工作的源代码可在以下网址获取:https://github.com/bilab-nyu/RHP

英文摘要

Accurate and reliable vehicle trajectory prediction is essential for safe autonomous driving. Recent studies have incorporated safety risk into trajectory prediction to quantify dangers posed by surrounding agents. However, most risk-aware approaches use past risk information as a secondary signal to help guide decisions, overlooking its future evolution and uncertainty. In this paper, we propose a risk horizon profiling (RHP) module that incorporates a continuous, learnable potential field model for risk-aware trajectory prediction. The RHP module calculates the spatial-temporal proximity of surrounding objects to profile risk distributions across future horizons, which supports better trajectory prediction by adaptively identifying what human drivers perceive as critical moments. We evaluate our method on two datasets from different driving settings, highD for highway corridors and SHRP2 for urban streets, which cover diverse risk scenarios including safe, near-crash, and crash events. Compared to the baseline methods, our framework achieves a 25.0\% reduction in 5s RMSE on the highD dataset and a 29.1\% reduction in 5s minFDE on SHRP2. These results indicate strong performance for both short and long horizon prediction and robust generalization across highway and urban scenarios. The proposed method enables more realistic AV path planning and strategic selection, thereby supporting safer autonomous driving and more advanced driver-assistance systems. The source code for this work is available at: https://github.com/bilab-nyu/RHP

2606.00837 2026-06-02 cs.RO cs.LG 版本更新

Coarse-to-Fine Compositional Diffusion for Long-Horizon Planning

粗到细的组合扩散用于长时域规划

Byoungwoo Park, Utkarsh A. Mishra, Jaemoo Choi, Juho Lee, Yongxin Chen

发表机构 * KAIST(韩国科学技术院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出Coarse-to-Fine Compositional Diffusion (CoFi)方法,通过先形成全局骨架再细化局部细节,在长时域机器人规划、全景图像生成和长视频生成中提升全局一致性和局部质量,同时减少2-8倍去噪评估次数。

Comments Project page: https://cofi-diffusion.github.io

详情
AI中文摘要

扩散模型为生成结构化数据提供了强先验,但许多任务需要输出超出这些模型通常训练规模的范围。组合生成通过将来自预训练短时域先验的重叠局部计划组合成长时域输出来解决这一问题。然而,标准组合主要强制相邻局部计划之间的一致性,产生局部一致性而不直接指定完整组合的全局结构。因此,局部兼容的计划仍可能形成不合理的路线、任务序列或时间演化。现有方法通过重复传播局部一致性信号或添加推理时优化来提高全局连贯性,但随着局部计划数量或维度的增加,这些过程变得昂贵。我们提出粗到细组合扩散(CoFi),一种推理时采样器,将全局结构形成与局部细节细化分离。CoFi首先将局部去噪估计围绕共享的粗结构对齐,产生捕获长程任务级排列的全局骨架。然后将该骨架扩散到中间噪声水平,并使用相同的预训练局部先验去噪,在保留骨架诱导的全局连贯性的同时恢复局部精细结构。在长时域机器人规划、全景图像生成和长视频生成中,CoFi不仅比先前的组合基线提高了全局连贯性和局部样本质量,而且需要2-8倍更少的去噪评估次数。

英文摘要

Diffusion models provide strong priors for generating structured data, but many tasks require outputs beyond the scale on which these models are typically trained. Compositional generation addresses this by composing overlapping local plans from a pretrained short-horizon prior into a long-horizon output. However, standard composition primarily enforces agreement between neighboring local plans, yielding local consistency without directly specifying the global structure of the full composition. As a result, locally compatible plans may still form an implausible route, task sequence, or temporal evolution. Existing methods improve global coherence by repeatedly propagating local consistency signals or by adding inference-time optimization, but these procedures become expensive as the number or dimensionality of local plans increases. We propose Coarse-to-Fine Compositional Diffusion (CoFi), an inference-time sampler that separates global structure formation from local detail refinement. CoFi first aligns local denoised estimates around a shared coarse structure, producing a global scaffold that captures the long-range task-level arrangement. It then diffuses this scaffold to an intermediate noise level and denoises it with the same pretrained local prior, restoring local fine structure while preserving the scaffold-induced global coherence. Across long-horizon robotic planning, panoramic image generation, and long video generation, CoFi not only improves both global coherence and local sample quality over prior compositional baselines, but also requires 2-8x fewer denoiser evaluations.

2606.00773 2026-06-02 cs.RO 版本更新

SafeVLA-Bench: A Benchmark for the Success-Safety Gap in Vision-Language-Action Models

SafeVLA-Bench: 视觉-语言-动作模型中成功-安全差距的基准

Jialiang Fan, Weizhe Xu, Oleg Sokolsky, Insup Lee, Fanxin Kong

发表机构 * University of Notre Dame(诺丁汉大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出SafeVLA-Bench,一种基于信号时序逻辑的后验安全评估框架,用于量化VLA策略在完成任务时的安全违规行为,揭示成功与安全之间的差距。

Comments 27 pages, 5 figures

详情
AI中文摘要

视觉-语言-动作(VLA)基准衡量策略是否完成指定的操作任务,但二元成功可能隐藏与安全相关的轨迹行为:在施加过度接触、干扰旁观物体、使被持物体不稳定或进入机器人自接触的同时达到目标。我们提出了SafeVLA-Bench,一个用于现有基于模拟器的VLA基准的后验安全评估框架。它将任务感知的安全要求形式化为信号时序逻辑(STL)规范,并用两个不安全成功指标报告原生成功:Succ-But-Unsafe(SBU),即既成功又违反安全策略的滚动比例,以及Violation Severity Index(VSI),一个有界的最坏违规深度分数。我们在LIBERO和RoboCasa-365上实例化SafeVLA-Bench,评估了九个策略基准条目,涵盖桌面和厨房操作任务。高任务成功并不意味安全执行:高SR的桌面基线仍然有13%到15%的不安全情节率,而36%到56%的成功RoboCasa-365滚动违反了至少一个活跃安全条款。项目页面:https://safevla.org。

英文摘要

Vision-language-action (VLA) benchmarks measure whether a policy completes a requested manipulation task, but binary success can hide safety-relevant trajectory behavior: reaching the goal while applying excessive contact, disturbing bystander objects, destabilizing the held object, or entering robot self-contact. We present SafeVLA-Bench, a post-hoc safety-evaluation framework for existing simulator-based VLA benchmarks. It formalizes task-aware safety requirements as Signal Temporal Logic (STL) specifications and reports native success with two unsafe-success metrics: Succ-But-Unsafe (SBU), the fraction of rollouts that both succeed and violate safety, and Violation Severity Index (VSI), a bounded worst-violation depth score. We instantiate SafeVLA-Bench on LIBERO and RoboCasa-365, evaluating nine policy-benchmark entries across tabletop and kitchen manipulation tasks. High task success does not imply safe execution: high-SR tabletop baselines still leave 13 to 15 percent unsafe-episode rates,and 36 to 56 percent of successful RoboCasa-365 rollouts violate at least one active safety clause. Project page: https://safevla.org.

2606.00762 2026-06-02 cs.RO 版本更新

STEM: Semantic Target Search and Exploration using MAVs in Cluttered Environments

STEM: 杂乱环境中使用MAV的语义目标搜索与探索

Nikhil Sethi, Max Lodel, Laura Ferranti, Robert Babuška, Javier Alonso-Mora

发表机构 * Department of Cognitive Robotics(认知机器人学系) Delft University of Technology(代尔夫特理工大学) CIIRC(捷克技术大学布拉格分校智能信息研究中心)

AI总结 提出一种基于语义引导视点规划器的框架,利用MAV在非结构化3D环境中最小化目标搜索与探索时间,通过组合规划器和主动感知管道实现高效语义探索。

Comments Accepted to Autonomous Robots Journal. Nikhil Sethi and Max Lodel contributed equally

详情
AI中文摘要

自主目标搜索对于在应急响应和救援任务中部署微型飞行器(MAV)至关重要。现有方法要么专注于结构化环境中的2D语义导航(在复杂3D环境中效果较差),要么专注于杂乱空间中的机器人探索(通常缺乏高效目标搜索所需的语义推理)。本文通过提出一种新颖框架克服了这些限制,该框架利用语义引导的视点规划器,使用MAV在非结构化3D环境中最小化目标搜索和探索时间。具体来说,我们开发了一个组合规划器,通过优先考虑可能导向目标的视点来生成高效的语义探索计划。为了引导规划器朝向目标,开发了一个主动感知管道,将观察到的物体的语义优先级传播到相邻的前沿体素中,以计算前沿视点的语义信息增益。此外,我们展示了如何利用基于LLM的相似度分数作为我们管道的语义优先级输入。在两个不同模拟环境中的评估表明,所提方法通过快速找到目标同时保持合理的探索时间,始终优于基线方法。使用MAV的真实世界实验进一步证明了该方法处理实际约束(如有限电池寿命、小传感器范围和语义不确定性)的能力。

英文摘要

Autonomous target search is crucial for deploying Micro Aerial Vehicles (MAVs) in emergency response and rescue missions. Existing approaches either focus on 2D semantic navigation in structured environments -- which is less effective in complex 3D settings, or on robotic exploration in cluttered spaces -- which often lacks the semantic reasoning needed for efficient target search. This paper overcomes these limitations by proposing a novel framework that utilizes a semantically-guided viewpoint planner to minimize target search and exploration time in unstructured 3D environments using an MAV. Specifically, we develop a combinatorial planner that generates efficient semantic exploration plans by prioritizing viewpoints that likely lead to the target. To guide the planner towards the target, an active perception pipeline is developed that propagates semantic priorities of observed objects into neighboring frontier voxels for computing semantic information gains of frontier viewpoints. In addition, we demonstrate how LLM-based similarity scores can be leveraged as semantic priority input to our pipeline. Evaluations in two distinct simulation environments show that the proposed method consistently outperforms baselines by quickly finding the target while maintaining reasonable exploration times. Real-world experiments with an MAV further demonstrate the method's ability to handle practical constraints like limited battery life, small sensor range, and semantic uncertainty.

2606.00737 2026-06-02 cs.RO math.OC 版本更新

Beyond Pure Sampling: Hybrid Optimization Mechanisms for Non-Convex Model Predictive Control

超越纯采样:非凸模型预测控制的混合优化机制

Yuichiro Aoyama, Minchan Jung, Akash Ratheesh, Evangelos A. Theodorou

发表机构 * School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, USA(航空航天工程学院,佐治亚理工学院,美国亚特兰大,GA州) Development Division, Komatsu Ltd., Tokyo, Japan(Komatsu Ltd.开发部门,日本东京) Department of Electrical and Computer Engineering, Inha University, Incheon, Republic of Korea(电气与计算机工程系,inha大学,韩国仁川,大韩民国)

AI总结 本文提出一种结合梯度下降与基于逆Hessian采样的双步优化机制,用于非凸模型预测控制,在多种机器人导航任务中相比纯采样方法(如MPPI)具有更高成功率和稳定性。

Comments 28 pages, 13 figures

详情
AI中文摘要

本文研究了使用最大熵微分动态规划(ME-DDP)框架的非凸模型预测控制(MPC)的优化机制。由非线性动力学、多个障碍物等引起的非凸代价景观仍然是机器人学中的一个基本挑战,其中基于梯度的方法经常收敛到次优局部最小值。我们展示了一种旨在克服这些陷阱的双步优化机制:(1)使用DDP利用代价景观梯度的初始阶段,随后(2)通过从由动作-价值函数的逆Hessian表征的策略中采样来破坏优化。我们对三种ME-DDP变体:单峰高斯ME-DDP、多峰高斯ME-DDP和Stein变分DDP的采样机制进行了严格分析。此外,通过在杂乱环境下的四个机器人系统的导航任务,我们对三种ME-DDP变体与确定性DDP以及最成功的基于采样的方案之一——模型预测路径积分(MPPI)控制(具有与ME-DDP对应的三种策略参数化和更新律)进行了广泛的基准测试。结果表明,在代价景观相对简单且局部信息足够代表性的低维系统中,我们的框架始终优于MPPI。在高维系统中,MPPI有时能够发现激进的机动,使其比基于DDP的方法更快地引导系统,而我们的方法保持更高、更稳定的成功率。最后,我们通过四旋翼飞行器在密集非凸障碍场中导航的硬件实验验证了该框架的实际功效,确认了所提框架在实际部署中的鲁棒性。

英文摘要

This paper investigates the optimization mechanisms of non-convex Model Predictive Control (MPC) using the Maximum Entropy Differential Dynamic Programming (ME-DDP) framework. Navigating non-convex cost landscapes induced by nonlinear dynamics, multiple obstacles, etc. remains a fundamental challenge in robotics, where gradient-based methods frequently converge to suboptimal local minima. We demonstrate a dual-step optimization mechanism designed to overcome these traps. (1) an initial phase of using DDP to exploit the gradient of the cost landscape, followed by (2) disruption of the optimization via sampling from policies characterized by the inverse Hessian of the action-value function. We provide a rigorous analysis of this sampling mechanism of three ME-DDP variants: Unimodal Gaussian ME-DDP, Multimodal Gaussian ME-DDP, and Stein Variational DDP. Furthermore, with navigation tasks of four robotic systems under cluttered environments, we conduct extensive benchmarking of three variants of the ME-DDP, against deterministic DDP, and one of the most successful sampling-based schemes, Model Predictive Path Integral (MPPI) control with three policy parameterizations and update laws that correspond to those of ME-DDPs. The results show that in low-dimensional systems where the cost landscapes are relatively simple and local information is sufficiently representative, our framework consistently outperforms MPPIs. In high-dimensional systems, MPPI can occasionally discover aggressive maneuvers that enable it to steer the systems faster than DDP-based methods, whereas our method maintains a higher, more stable success rate. Finally, we validate the practical efficacy of the framework through hardware experiments with a quadrotor navigating a dense, non-convex obstacle field, confirming the robustness of the proposed framework for real-world deployment.

2606.00730 2026-06-02 cs.RO 版本更新

Infeasible optimization problems and the hierarchical augmented Lagrangian method in imitation learning

模仿学习中的不可行优化问题与分层增广拉格朗日方法

Roland Andrews, Justin Carpentier, Ajay Sathya

发表机构 * University of Cambridge(剑桥大学)

AI总结 针对模仿学习中约束不可行导致训练不稳定的问题,提出基于增广拉格朗日方法的解决方案,将策略引导至最近可行约束问题的解,并在驾驶示例中验证其有效性。

详情
AI中文摘要

模仿学习(IL)是训练复杂机器人策略的有效方法。最近的研究将硬约束引入模仿学习优化问题,以确保所学策略的安全性、稳定性和鲁棒性。然而,我们认为这些约束有时是不可行的,这可能导致不稳定或困难的训练动态。我们基于不可行设置下增广拉格朗日方法的最新理论结果,研究了一种针对此类情况的简单补救措施。我们表明,我们的方法将所学策略引导至具有理想属性的最近可行约束IL问题的解。该方法在一个具有总加速度约束和行人安全约束的玩具驾驶示例中进行了说明,该设置中不可行性自然出现,同时仍允许安全的所学策略。

英文摘要

Imitation learning (IL) is an effective approach to train complex robotics policies. Recent works have introduced hard constraints into imitation-learning optimization problems to ensure safety, stability, and robustness of the learned policy. However, we argue that these constraints are sometimes infeasible, which can lead to unstable or difficult training dynamics. We study a simple remedy for such situations based on recent theoretical results on the augmented Lagrangian method in infeasible settings. We show that our approach drives the learned policy toward the solution of a closest-feasible constrained IL problem with desirable properties. The method is illustrated on a toy driving example with a total-acceleration constraint and pedestrian-safety constraints, a setting in which infeasibility can naturally arise while still allowing a safe learned policy.

2606.00709 2026-06-02 cs.RO 版本更新

BEVIO: Efficient Bird's-Eye-View based Sparse-Update Visual-Inertial Odometry for Lunar Day-Night Navigation

BEVIO: 基于鸟瞰图的稀疏更新视觉-惯性里程计用于月球昼夜导航

Mohit Singh, Shehryar Khattak, Ashish Goel, Michael Paton, Kostas Alexis, Issa A. Nesnas

发表机构 * Jet Propulsion Laboratory, California Institute of Technology(喷气推进实验室,加州理工学院) Autonomous Robots Lab at the Norwegian University of Science and Technology(挪威科学技术大学自主机器人实验室)

AI总结 提出一种基于鸟瞰图的图像匹配方案,在极低视觉更新率下实现可靠的视觉-惯性里程计,适用于资源受限的月球车昼夜导航。

Comments Accepted at the 2026 IEEE International Conference on Robotics and Automation, Vienna

详情
AI中文摘要

视觉-惯性里程计(VIO)提供平滑、高频率的状态估计,已广泛应用于地面和行星应用的机器人导航。然而,其性能通常依赖于视觉更新的频率,这对于在极端资源约束和低帧率下运行的行星车来说是一个挑战。本文研究如何为月球车应用实现具有极稀疏视觉更新的可靠VIO,解决昼夜操作中自照明条件下特征关联特别困难的问题。我们提出了一种基于鸟瞰图(BEV)的图像匹配方案,该方案在较大的帧间运动和显著的视觉外观变化下仍能保持鲁棒性,实现更可靠的特征匹配。我们通过高保真照片级月球仿真和半比例月球车在加利福尼亚州普拉斯特城进行的长期昼夜部署实时机器人实验,广泛评估了我们提出的BEVIO方法。结果表明,我们的方法能够在低至0.25 Hz的视觉更新率下实现可靠的昼夜自照明穿越,突显了其在功耗和计算受限的月球车导航中的适用性。

英文摘要

Visual-Inertial Odometry (VIO) provides smooth, high-rate state estimates and has been widely used for robotic navigation in both terrestrial and planetary applications. However, its performance is typically dependent on the frequency of visual updates, which is a challenge for planetary rovers operating under extreme resource constraints and low frame rates. This work investigates enabling reliable VIO with very sparse visual updates for lunar rover applications, addressing both day and night-time operations where feature associations become especially difficult under self-illumination conditions. We propose a Bird's Eye View (BEV)-based image matching scheme that remains robust to larger inter-frame motions and more reliable feature matching despite significant visual appearance changes. We extensively evaluate our proposed approach, BEVIO, through high-fidelity photorealistic lunar and real-time robotic experiments conducted using a half-scale lunar rover, in a long-term day-night deployment at Plaster City, CA, USA. The results demonstrate that our method enables reliable day and nighttime self-illuminated traverses at visual update rates as low as 0.25 Hz, underscoring its suitability for navigation on power- and compute-limited lunar rovers.

2606.00702 2026-06-02 cs.RO cs.AI 版本更新

Shape Your Body: Value Gradients for Multi-Embodiment Robot Design

塑造你的身体:用于多形态机器人设计的价值梯度

Nico Bohlinger, Jan Peters

发表机构 * Technical University of Darmstadt(德累斯顿技术大学) Robotics Institute Germany (RIG)(德国机器人研究所) German Research Center for AI (DFKI)(德国人工智能研究中心) hessian.AI(黑森AI)

AI总结 提出将通用多形态价值函数转化为可复用模型,通过价值梯度优化机器人设计,无需为每个机器人重新进行强化学习协同设计。

详情
AI中文摘要

我们提出将通用多形态价值函数转化为可复用的机器人设计模型。不是为每个机器人运行新的强化学习协同设计循环,而是首先在多种机器人设计上训练一个感知形态的策略和价值函数。训练后,冻结的价值函数被用作可微分的代理,通过价值梯度优化候选形态。我们在不同的机器人设计设置中评估了我们的方法,从受扰动的单个机器人到跨形态类别的保留机器人,使用在多达50个机器人和超过1100个连续形态参数的设计空间上训练的单个模型。除了优化完整形态,我们还展示了价值梯度可以识别限制性能的设计和控制参数,从而能够优化和分析新的机器人设计。

英文摘要

We propose to turn generalist multi-embodiment value functions into reusable models for robot design. Instead of running a new reinforcement learning co-design loop for each robot, we first train an embodiment-aware policy and value function across many robot designs. After training, the frozen value function is used as a differentiable surrogate to optimize candidate embodiments through value gradients. We evaluate our approach across different robot design settings, from perturbed single robots to held-out robots across morphology classes, with single models trained on up to 50 robots and design spaces of over 1100 continuous embodiment parameters. Beyond optimizing complete embodiments, we show that value gradients can identify performance-limiting design and control parameters, enabling both the optimization and the analysis of new robot designs.

2606.00664 2026-06-02 cs.RO cs.CV 版本更新

SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models

SKIP: 用于高效具身世界模型的稀疏关键帧插值范式

Ziheng He, Yixiang Chen, Ning Yang, Zhanqian Wu, Qisen Ma, Yuan Xu, Jiabing Yang, Peiyan Li, Xiangnan Wu, Xiaofeng Wang, Zheng Zhu, Jing Liu, Nianfeng Liu, Yan Huang

发表机构 * UCAS(中国科学院自动化研究所) CASIA(中国科学院自动化研究所) NJU(南京大学) GigaAI THU(清华大学) FiveAges

AI总结 提出稀疏关键帧插值范式(SKIP),通过识别任务相关关键帧并仅生成这些帧,再基于机器人动作插值缺失帧,实现高效视频生成,在LIBERO上速度提升4.16倍,FVD降低89%,且生成视频作为训练数据时策略性能下降极小。

Comments 25 pages, 10 figures

详情
AI中文摘要

具身世界模型通过预测机器人动作如何影响周围场景,已成为机器人学中一种有前景的范式。然而,在像素空间中进行 rollout 推理在计算上仍然昂贵,因为长时程操作视频通常必须逐帧生成。这种成本不能通过不加区分地丢弃帧来轻易降低,因为下游策略依赖于对稀疏任务相关事件(如接近、接触、抓取和释放)的完整保留。为了解决这一挑战,我们提出了稀疏关键帧插值范式(SKIP),这是一种事件保留的稀疏到密集框架,避免了密集的逐帧生成。SKIP 首先通过利用机器人感知的多模态特征来识别任务相关的关键帧。然后,它仅用稀疏视频扩散模型合成这些关键帧。一个学习到的间隙预测器和一个动作条件插值器随后根据机器人动作重建缺失的间隔。在 LIBERO 上,SKIP 生成密集 rollouts 的速度比密集基线快 4.16 倍,同时提高了视觉保真度并将聚合 FVD 降低了 89.0%。重要的是,SKIP 生成的视频是有效的策略训练数据。即使它们完全替代真实演示,π_{0.5} 的成功率在 LIBERO 模拟中仅下降 1.3 个百分点,在真实机器人上下降 6.7 个百分点,而完全密集的逐帧生成则下降 48 到 58 个百分点。

英文摘要

Embodied world models have emerged as a promising paradigm in robotics by predicting how robot actions affect the surrounding scene. However, the rollout inference remains computationally expensive in pixel space, as long-horizon manipulation videos typically have to be generated frame by frame. This cost cannot be easily reduced by indiscriminately dropping frames, since downstream policies rely on complete preservation of sparse task-relevant events such as approach, contact, grasp, and release. To address this challenge, we propose Sparse Keyframe Interpolation Paradigm (SKIP), an event-preserving sparse-to-dense framework that avoids dense frame-by-frame generation. SKIP first identifies task-relevant keyframes by leveraging robot-aware multimodal features. It then synthesizes only these keyframes with a sparse video diffusion model. A learned gap predictor and an action-conditioned interpolator subsequently reconstruct the missing intervals according to the robot actions. On LIBERO, SKIP generates dense rollouts $4.16\times$ faster than a dense baseline while improving visual fidelity and reducing aggregate FVD by $89.0\%$. Importantly, SKIP-generated videos are effective policy-training data. Even when they fully replace real demonstrations, $π_{0.5}$ success drops only $1.3$ pp in LIBERO simulation and $6.7$ pp on the real robot, whereas fully dense frame-by-frame generation collapses by $48$ to $58$ pp.

2606.00637 2026-06-02 cs.RO 版本更新

Global-Local Attention Decomposition for Terrain Encoding in Humanoid Perceptive Locomotion

全局-局部注意力分解用于人形感知运动中的地形编码

Shengcheng Fu, Yang Zhang, Zhanxiang Cao, Liyun Yan, Yizhi Chen, Yunpeng Yin, Yue Gao

发表机构 * Tongji University(同济大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学) Humanoid Robot (Shanghai) Co., Ltd.(人形机器人(上海)有限公司)

AI总结 提出全局-局部注意力分解(GLAD)方法,通过粗到细编码器分离全局地形感知和局部立足点选择,实现人形机器人在稀疏立足点和受限环境中的鲁棒运动。

详情
AI中文摘要

尽管强化学习显著推进了人形运动,感知策略在稀疏立足点地形和受限环境中仍然存在困难。在这些场景中成功需要广泛的地形感知和精确的立足点选择,而传统编码器常常纠缠这两种感知角色。为了解决这一挑战,我们提出了用于人形运动地形编码的全局-局部注意力分解(GLAD)。通过基于机器人中心高程图的粗到细编码器实现,GLAD明确分离了这些目标:全局注意力分支利用注意力池化总结周围地形上下文,而状态条件局部注意力分支稀疏化并编码精确的立足点相关几何。这种显式注意力分解防止了细粒度空间线索的稀释,同时减少了训练开销。实验表明,GLAD能够在具有挑战性的间隙、踏脚石和楼梯上实现可靠运动。此外,学习到的策略表现出涌现的地形响应行为,在简单速度指令下自主跟随狭窄路径并避开障碍物,无需显式导航规划器。在搭载机载LiDAR的Unitree G1人形机器人上的实际部署中,所提方法在多种稀疏立足点和障碍物丰富领域实现了鲁棒的零样本仿真到现实迁移。

英文摘要

Although reinforcement learning has significantly advanced humanoid locomotion, perceptive policies still struggle on sparse-foothold terrain and constrained environments. Success in these scenarios requires both broad terrain awareness and precise foothold selection, two perceptual roles that conventional encoders often entangle. To address this challenge, we propose Global-Local Attention Decomposition (GLAD) for terrain encoding in humanoid locomotion. Realized by a coarse-to-fine encoder over a robot-centric elevation map, GLAD explicitly separates these objectives: a global attention branch utilizes attention pooling to summarize the surrounding terrain context, while a state-conditioned local attention branch sparsifies and encodes precise foothold-relevant geometry. This explicit attention decomposition prevents the dilution of fine-grained spatial cues while reducing training overhead. Experiments demonstrate that GLAD enables reliable locomotion over challenging gaps, stepping stones, and stairs. Furthermore, the learned policy exhibits emergent terrain-responsive behaviors, autonomously following narrow paths and avoiding obstacles under simple velocity commands without explicit navigation planners. In real-world deployment on a Unitree G1 humanoid robot using onboard LiDAR, the proposed method achieves robust zero-shot sim-to-real transfer across diverse sparse-foothold and obstacle-rich domains.

2606.00576 2026-06-02 cs.RO 版本更新

Dynamic Resilient Spatio-Semantic Memory with Hybrid Localization for Mobile Manipulation

面向移动操作的动态弹性时空语义记忆与混合定位

Zhijie Yan, Shufei Li, Ze Zhang, Xin Liu, Yuhang Zheng, Zuoxu Wang

发表机构 * School of Mechanical Engineering and Automation, Beihang University(北京航空航天大学机械工程及自动化学院) Department of Systems Engineering, City University of Hong Kong(香港城市大学系统工程系) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出DREAM框架,通过在线构建时空语义体素记忆、冗余感知记忆剪枝和混合定位,实现无预建地图的动态室内移动操作,将长时任务成功率提升至55%-70%。

Comments Code, CAD model, and real-robot demonstrations are available at https://bjhyzj.github.io/dream-web

详情
AI中文摘要

动态室内环境中的可靠移动操作需要一种场景表示,该表示在环境变化时保持几何一致性、可语义查询且计算量可控。现有系统通常依赖预建地图、静态场景假设或高精度相机位姿,当目标物体被重新放置或位姿估计被修正时,可能导致场景信息过时或错位。本文提出DREAM,一个真实机器人移动操作框架,它集成感知、记忆、定位、导航和操作,在无预建地图的未知室内环境中运行。DREAM通过由LiDAR-惯性-视觉SLAM后端注册的RGB-D观测构建在线时空语义体素记忆。它进一步引入位姿图感知的冗余感知记忆剪枝(RMP),在位姿修正后更新历史观测,同时保持长时观测历史有界。对于目标定位和重新获取,DREAM结合语言条件3D检索、开放词汇图像检测和基于多模态大语言模型的语义验证。在四个动态室内实验室场景中的真实机器人实验表明,DREAM将长时任务成功率从DynaMem的40%-60%提升至55%-70%,同时在各场景中保持0.37-0.63 GB的内存占用和0.43-0.53秒的在线记忆更新时间。

英文摘要

Reliable mobile manipulation in dynamic indoor environments requires a scene representation that remains geometrically consistent, semantically queryable, and computationally bounded as the environment changes. Existing systems often rely on pre-built maps, static-scene assumptions, or highly accurate camera poses, which can lead to stale or misaligned scene information when target objects are relocated or pose estimates are corrected. This paper presents DREAM, a real-robot mobile manipulation framework that integrates perception, memory, localization, navigation, and manipulation in previously unseen indoor environments without a pre-built map. DREAM constructs an online spatio-semantic voxel memory from RGB-D observations registered by a LiDAR-inertial-visual SLAM backend. It further introduces pose-graph-aware Redundancy-Aware Memory Pruning (RMP) to update historical observations after pose corrections while keeping long-horizon observation history bounded. For target localization and reacquisition, DREAM combines language-conditioned 3D retrieval, open-vocabulary image detection, and multimodal large language model based semantic verification. Real-robot experiments in four dynamic indoor laboratory scenes show that DREAM improves long-horizon task success rates from 40%-60% with DynaMem to 55%-70%, while maintaining a memory footprint of 0.37-0.63 GB and an online memory-update time of 0.43-0.53 s across scenes.

2606.00552 2026-06-02 cs.OS cs.DC cs.NI cs.RO cs.SY eess.SY 版本更新

Edge-Based QoS-Aware Adaptive Task Placement: A Closed-Loop Control in Multi-Robot Systems

基于边缘的QoS感知自适应任务放置:多机器人系统中的闭环控制

Thien Tran, Jonathan Kua, Thuong Hoang, Minh Tran, Honghao Lyu, Jiong Jin

发表机构 * Deakin University(德肯大学) RMIT University(皇家墨尔本理工大学) Zhejiang University(浙江大学) Swinburne University of Technology(西姆伯恩理工大学)

AI总结 提出一种QoS感知的自适应任务放置(ATP)控制器,通过多指标成本评分和闭环控制,在共享边缘节点上动态切换任务放置,以降低尾延迟和截止时间违规。

Comments 6 pages, 2 figure, 1 algorithm, accepted as a regular paper on the 24th IEEE International Conference on Industrial Informatics (INDIN), 26-29 July, 2026, Melbourne, Australia

详情
AI中文摘要

多机器人系统(MRS)越来越多地将计算密集型感知任务卸载到边缘节点,以满足严格的时间敏感服务质量(QoS)约束。然而,共享边缘节点上的静态任务编排可能因网络延迟、抖动和边缘资源争用而严重降低QoS。我们使用Raspberry Pi节点构建了一个以边缘为中心的MRS试验平台,评估了三种模式下的相机到机械臂流水线:本地执行、静态卸载和QoS感知的自适应任务放置(ATP)控制器。ATP通过两秒控制窗口内的多指标成本(归一化延迟、CPU利用率和切换开销)对候选放置进行评分。该闭环视觉伺服试验平台配备了亚毫秒级时钟同步、网络仿真以及跨节点的多指标详细监控,以捕获真实抖动。在计算压力和网络故障场景下的实验结果表明,静态边缘卸载降低了板载CPU负载,但放大了尾延迟和截止时间违规。相比之下,QoS感知的ATP控制器通过基于测量延迟和利用率阈值切换任务放置,持续降低了截止时间违规和尾延迟。总体而言,结果将ATP定位为MRS的实用边缘侧控制原语,并为云-边缘机器人部署在更广泛的云-雾自动化中提供了具体设计指南,同时激励了面向工业信息物理系统的QoS感知多目标工作负载编排。

英文摘要

Multi-robot systems (MRS) increasingly offload compute-intensive perception tasks to edge nodes to meet strict time-sensitive Quality-of-Service (QoS) constraints. However, static task orchestration on a shared edge node can severely degrade QoS due to network latency, jitter, and edge-resource contention. We present a pilot edge-centric MRS testbed using Raspberry Pi nodes to evaluate a camera-to-manipulator pipeline under three modes: local execution, static offloading, and a QoS-aware Adaptive Task Placement (ATP) controller. ATP scores candidate placements using a multi-metric cost (normalized latency, CPU utilization, and switching overhead) over two-second control windows. The closed-loop visual servoing testbed is instrumented with sub-millisecond clock synchronization, network emulation, and detailed monitoring of multiple metrics across nodes to capture realistic jitter. Experimental results under compute-stress and network-fault scenarios show that static edge offloading reduces on-board CPU load but amplifies tail latency and deadline misses. In contrast, the QoS-aware ATP controller, by switching task placement based on measured latency and utilization thresholds, consistently lowers deadline violations and tail latency. Overall, the results position ATP as a practical edge-side control primitive for MRS and concrete design guidelines for Cloud-Edge Robotics deployments within the broader cloud-fog automation, while motivating QoS-aware multi-objective workload orchestration for industrial cyber-physical systems.

2606.00550 2026-06-02 cs.HC cs.ET cs.RO 版本更新

A Four-Tier Communication Architecture and Sim-to-Real Validation of a Graphical Open-Source Platform for Robotic Engineering Education

用于机器人工程教育的四层通信架构与图形化开源平台的仿真到现实验证

Thien Tran, Khang Duong, Minh Tran, Jonathan Kua, Thuong Hoang, Jiong Jin

发表机构 * Deakin University(德金大学) RMIT University(皇家墨尔本理工大学) Swinburne University of Technology(斯威本科技大学)

AI总结 针对大学实验室中机械臂教育规模化面临的商业数字孪生成本高和ROS门槛高的问题,提出一种四层通信架构,基于图形化开源平台(GOSP)实现虚拟环境与物理机器人的数据桥接,并通过仿真到现实验证其硬件无关的可行性。

Comments 4 pages, 4 figures, accepted as a Work-in-Progress (WiP) paper, on the 24th IEEE International Conference on Industrial Informatics (INDIN), 26-29 July, 2026, Melbourne, Australia

详情
AI中文摘要

在大学实验室中规模化开展真实的机械臂教育面临一个结构性难题:商业数字孪生通常成本高昂且脚本僵化,而开源机器人中间件(ROS)对新手来说存在陡峭的技术和语法门槛。为解决这一后勤和教育上的摩擦,本工作进展(WiP)论文提出了一种可扩展的四层通信架构,专为可持续的机器人课程设计。我们的研究不关注软件应用设计,而是考察桥接视觉概念环境与物理机器人端点所需的基础数据交换机制,并以图形化开源平台(GOSP)作为基础实例化。本WiP详细介绍了该框架的技术集成,包括3D视觉骨架建模与强大的ROS中间件后端,重点阐述了复杂通信例程的序列化、路由和封装。使用多轴空间轨迹进行的初步仿真到现实验证表明,封装这些通信管道提供了一条足够保真度的硬件无关路径。通过桥接虚拟设计与物理执行,该架构蓝图为工程教育提供了可行的基础设施。

英文摘要

The persistent challenge in scaling authentic manipulator education within university laboratories is a structural dichotomy: commercial digital twins are often cost-prohibitive and rigidly scripted, whereas open-source robotics middleware (ROS) imposes steep technical and syntax barriers for novices. To resolve this logistical and educational friction, this Work-in-Progress (WiP) paper proposes a scalable four-tier communication architecture tailored for sustainable robotic curricula. Rather than focusing on software application design, our study examines the underlying data exchange mechanisms required to bridge visual conceptual environments with physical robotic endpoints, utilizing the Graphical Open-Source Platform (GOSP) as a foundational instantiation. This WiP details the framework's technical integration of 3D visual armature modeling with a robust ROS middleware backend, emphasizing the serialization, routing, and encapsulation of intricate communication routines. Preliminary sim-to-real validation using multi-axis spatial trajectories confirms that encapsulating these communication pipelines provides a sufficient fidelity hardware-agnostic pathway. By bridging virtual design and physical execution, this architectural blueprint offers a viable infrastructure for engineering education.

2606.00537 2026-06-02 cs.RO 版本更新

PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking

PACE: 面向动作分块策略的相位感知分块执行方法

Junnan Nie, Jiayi Li, Jiachen Zhang, Junyi Lao, Chenghao Liu, Tianle Zhang, Songfang Huang

发表机构 * Peking University(北京大学) JD Explore Academy(京东探索研究院)

AI总结 提出PACE方法,通过在线预测动作分块中的低速过渡点作为重规划边界,自适应选择执行步长,无需重新训练即可提升机器人策略成功率。

Comments 21 pages, 7 figures, 6 tables. Preprint

详情
AI中文摘要

最近的视觉-语言-动作和基于扩散的机器人策略通常使用动作分块,其中每次策略查询预测一系列未来动作,机器人执行一个开环前缀后再重新查询。虽然这种接口改善了局部运动连续性,但部署时仍需选择执行步长:在获取新观测之前应执行每个预测分块的多少。然而,我们的实验表明,成功率强烈依赖于任务且相对于执行步长非单调,这使得单一恒定步长成为不可靠的部署规则。我们提出PACE(相位感知分块执行),一种无需训练的测试时执行方法,从预测分块本身在线选择执行步长。PACE通过识别预测速度剖面中的低速过渡点,利用操作轨迹的相位相关运动学结构,将其作为候选重规划边界。由于PACE仅使用预测的动作分块,因此即插即用,无需重新训练或访问策略内部。我们通过在仿真和真实机器人环境中的大规模评估验证了PACE。在50个RoboTwin2.0任务上,PACE将平均成功率从57.8%提升至64.2%。在双臂ALOHA和单臂Franka平台上的真实机器人实验中,PACE将平均任务得分从60.7提升至77.7,平均成功率从50.7%提升至70.4%。消融实验和轨迹级分析表明,PACE跨操作阶段自适应调整执行步长,在过渡附近缩短执行,同时在连贯运动中保持较长执行。

英文摘要

Recent vision-language-action and diffusion-based robot policies often use action chunking, where each policy query predicts a sequence of future actions and the robot executes an open-loop prefix before re-querying. While this interface improves local motion continuity, deployment still requires choosing the execution horizon: how much of each predicted chunk should be executed before acquiring a new observation. However, our experiments show that success is strongly task-dependent and non-monotonic with respect to the execution horizon, making a single constant horizon an unreliable deployment rule. We propose PACE (Phase-Aware Chunk Execution), a training-free test-time execution method that selects the execution horizon online from the predicted chunk itself. PACE exploits the phase-dependent kinematic structure of manipulation trajectories by identifying low-speed transition points in the predicted speed profile and using them as candidate replanning boundaries. Because PACE uses only the predicted action chunk, it is plug-and-play and requires no retraining or access to policy internals. We validate PACE through large-scale evaluations in both simulation and real-robot settings. On 50 RoboTwin2.0 tasks, PACE raises the average success rate from 57.8% to 64.2%. In real-robot experiments on bimanual ALOHA and single-arm Franka platforms, PACE improves the average task score from 60.7 to 77.7 and the average success rate from 50.7% to 70.4%. Ablations and rollout-level analyses show that PACE adapts execution horizons across manipulation phases, shortening near transitions while preserving longer execution during coherent motion.

2606.00519 2026-06-02 cs.RO 版本更新

DriveAnchor: Progressive Anchor-based Flow Learning for Autonomous Driving Planning

DriveAnchor: 用于自动驾驶规划的渐进式基于锚点的流学习

Limin Yan, Haoyun Tang, Yutao Qiu, Hongqing Liu, Haoyu Xu

发表机构 * Meituan Autonomous Driving(美团自动驾驶) Xi’an Jiaotong University(西安交通大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出三阶段框架DriveAnchor,通过示范流预训练、引导流后训练和奖励精炼流微调,实现行为多样性、可控性和安全性,在200万场景中近距碰撞率降低89%,平均奖励提升32%。

详情
AI中文摘要

我们提出DriveAnchor,一个用于自动驾驶规划的三阶段框架,在可组合流水线中实现行为多样性、可控性和安全性。示范流预训练通过最远点采样构建的2398个轨迹形状词汇表替代无结构高斯先验,在词汇覆盖中结构化地奠定行为多样性基础。引导流后训练联合后训练一个能量场模块与流匹配(FM),仅以静态道路几何为条件,在流生成前将锚点重新定位到用户指定的走廊多边形,无需可微引导即可增加可控性;在第二阶段后,新的走廊预设只需更新能量场,无需重新训练FM。奖励精炼流微调应用零阶强化学习,使每个锚点的输出与避碰目标对齐:由于流匹配模型在单步模式下是确定性前馈网络,每个锚点唯一确定输出轨迹,将奖励优化简化为锚点空间中的方向搜索,无需对数似然计算或ODE到SDE转换。在约200万个保留驾驶场景上的评估表明,DriveAnchor将近距碰撞率降低89%,平均奖励提升32%,且模仿精度不下降,在NVIDIA Drive Orin上推理时间为2.06毫秒。DriveAnchor已通过真实车辆测试验证,确认其适用于生产部署。

英文摘要

We present DriveAnchor, a three-stage framework for autonomous driving planning that achieves behavioral diversity, controllability, and safety in a composable pipeline. Demonstration Flow Pretraining replaces the unstructured Gaussian prior with a vocabulary of 2,398 trajectory shapes constructed by farthest-point sampling, structurally grounding behavioral diversity in vocabulary coverage. Guided Flow Post-training jointly post-trains an Energy Field module with flow matching (FM), conditioning the Energy Field on static road geometry alone, to relocate anchors toward user-specified corridor polygons before flow generation, adding controllability without differentiable guidance; after Stage 2, new corridor presets require only Energy Field updates, not FM retraining. Reward-Refined Flow Fine-tuning applies zeroth-order reinforcement learning to align each anchor's output with collision-avoidance objectives: because the flow-matching model is a deterministic feedforward network in single-step mode, each anchor uniquely determines the output trajectory, reducing reward optimization to a direction search in anchor space without log-likelihood computation or ODE-to-SDE conversion. Evaluated on approximately 2 million held-out driving scenarios, DriveAnchor reduces near-range collision rates by 89% and improves mean reward by 32% without degradation in imitation accuracy, with 2.06 ms inference on NVIDIA Drive Orin. DriveAnchor has been validated through real-world vehicle testing, confirming its practicality for production deployment.

2606.00515 2026-06-02 cs.RO cs.AI cs.SY eess.SY 版本更新

PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation

PaCo-VLA: 用于富接触视觉-语言-动作操控的被动屏蔽柔顺先验

Haofan Cao, Zhaoyang Li, Zhichao You, Liang Guo, Tianrui Li

发表机构 * Southwest Jiaotong University(西南交通大学) University of Leeds(莱斯特大学)

AI总结 提出PaCo-VLA框架,通过被动屏蔽将VLA模型输出转化为任务级柔顺建议,并利用能量罐和边界检查防止无效预测绕过底层接触物理,实现安全精确的富接触操控。

Comments Under review, code will be available soon

详情
AI中文摘要

富接触操控既需要高层语义推理,也需要对高频接触动态的安全调节。虽然视觉-语言-动作(VLA)模型提供了前所未有的语义泛化能力,但其低速率输出缺乏在力敏感任务中直接控制执行器所需的可靠性。为弥合这一语义到控制的鸿沟,我们引入PaCo-VLA,一种被动屏蔽的柔顺先验,重新定义了VLA接口。PaCo-VLA不将直接电机指令托付给VLA,而是将网络输出视为任务级柔顺建议:语义绑定、任务阶段和导纳调度。一个高频、建议无关的被动屏蔽通过能量罐核算和边界检查来管理这些建议,防止无效、过时或未经验证的模型预测绕过底层接触物理。这种解耦架构还支持因果评估,将语义贡献与几何捷径分离。大量仿真和真实世界的连接器插入实验表明,PaCo-VLA在无屏蔽VLA基线上实现了卓越的精度,即使在对抗性柔顺偏移下也能保持零被动违规。该框架在导纳端口建立了一个可证明的采样被动运行时契约,并为在富接触领域部署基础模型提供了运行时接口。

英文摘要

Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Vision-Language-Action (VLA) models provide unprecedented semantic generalization, their low-rate outputs lack the reliability required for direct plant authority in force-sensitive tasks. To bridge this semantic-to-control gap, we introduce PaCo-VLA, a passivity-shielded compliance prior that recasts the VLA interface. Rather than trusting VLAs with direct motor commands, PaCo-VLA treats network outputs as task-level compliance proposals: semantic bindings, task stages, and admittance schedules. A high-frequency, proposal-independent passivity shield governs these proposals through energy-tank accounting and boundary checks, preventing invalid, stale, or unverified model predictions from bypassing low-level contact physics. This decoupled architecture also enables causal evaluation, isolating semantic contributions from geometric shortcuts. Extensive simulated and real-world connector-insertion experiments demonstrate that PaCo-VLA achieves superior precision over unshielded VLA baselines, sustaining zero passivity violations even under adversarial compliance shifts. This framework establishes a provably sampled-passive runtime contract at the admittance port and provides a runtime interface for deploying foundation models in contact-rich domains.

2606.00470 2026-06-02 cs.RO cond-mat.soft 版本更新

A passive universal grasping mechanism based on an everting shell

基于外翻壳体的被动通用抓取机构

Mythra V. S. Balakuntala, Safvan Palathingal, G. K. Ananthasuresh

发表机构 * Indian Institute of Science(印度科学研究院)

AI总结 提出一种基于弹性可变形双稳态壳体外翻的被动单片柔性抓取机构,通过梁段构成的抓取臂与外翻壳体协同工作,实现对任意形状刚性物体的包络抓取。

详情
AI中文摘要

概念化了一种基于弹性可变形双稳态壳体外翻的被动单片柔性抓取机构。它由梁段构成的抓取臂与外翻壳体协同工作。该抓取器能够抓取任意形状的刚性物体,最大尺寸和重量受限于机构设计。双稳态壳体在接触物体时外翻,使抓取臂包裹物体形成封闭空间。机构保持该构型直到再次被驱动,使壳体恢复原始构型,从而打开封闭空间释放物体。臂的刚度决定机构的有效载荷,臂的尺寸决定可抓取的最大物体。臂具有分布式柔性,可适应物体形状而不施加过大压力。

英文摘要

A passive monolithic compliant grasping mechanism that works based on the eversion of an elastically deformable bistable shell is conceptualized. It comprises grasping arms made of beam segments that work in conjunction with the everting shell. The grasper is capable of picking up a stiff object of any shape up to a maximum size and weight. The bistable shell everts upon contact with the object to enable the grasping arms envelop the object forming an enclosure. The mechanism then stays in that configuration until it is actuated again to turn the shell back to its original configuration and thereby opening the enclosure to release the object. The stiffness of the arms decides the payload of the mechanism. The size of the arms decides the largest object that can be grasped and held. The arms have distributed compliance so that they can conform to the shape of the object without applying undue force on it.

2606.00459 2026-06-02 cs.RO cs.SY eess.SY 版本更新

Adaptive PD Gains for Energy-Conscious Control in Physical Human-Robot Interaction

物理人机交互中节能控制的自适应PD增益

Danyal Saqib, Francisco Andrade Chavez, Marie Charbonneau

发表机构 * University of Calgary(卡尔加里大学) University of Waterloo RoboHub(多伦多大学罗布hub)

AI总结 提出一种自适应PD控制器,通过限制机器人动能和势能实现安全物理人机交互,并给出稳定性证明与实验验证。

详情
Journal ref
Proceedings of the 23rd Conference on Robots and Vision, 2026
AI中文摘要

柔顺力或力矩控制是常被研究以实现安全物理人机交互(pHRI)的方法。然而,这些方法存在局限性。力控制要求机器人配备外部力传感器以跟踪施加力的幅度和方向。力矩控制需要在每个关节进行力矩感知或估计。由于并非所有机器人都具备这些条件,基于能量的方法提供了一种有前景的替代方案。此类方法旨在通过限制机器人的机械能来实现安全的pHRI。当前利用基于能量方法的方案往往实现复杂,且部分可能需要进一步稳定性验证。因此,我们提出一种自适应比例-微分(PD)控制器,能够在任意给定限制下限制机器人的能量,以实现安全的pHRI。所提出的控制器可以同时限制机器人的动能和势能,并且控制器增益的行为可通过多种参数进行塑造,精确界定截止限制和锐度。我们为控制器构建了稳定性证明,并定义了确保控制器稳定性的条件。所提出控制器的行为和柔顺性在PAL Robotics的TALOS机器人上进行了仿真和硬件测试,验证了控制器预期的柔顺和能量限制行为。

英文摘要

Compliant force or torque control are approaches often investigated to achieve safe physical human-robot interaction (pHRI). However, these approaches have limitations. Force control requires a robot to be equipped with external force sensors to track the amplitude and direction of applied forces. Torque control requires torque sensing or estimation in each joint. As this is not available on every robot, energy-based approaches offer a promising alternative. Such approaches aim to achieve safe pHRI by limiting the mechanical energy of the robot. Current schemes leveraging an energy-based approach tend to have a complex implementation, and some may require further stability verification. We hence propose an adaptive proportional-derivative (PD) controller that can limit a robot's energy under any given limit to achieve safe pHRI. The proposed controller can limit both the kinetic and potential energy of a robot, and the behaviour of the controller gains can be shaped using various parameters, defining precisely the cutoff limit and sharpness. We construct a stability proof for the controller and define a condition to ensure the controller's stability. The proposed controller's behaviour and compliance are tested on the TALOS robot from PAL Robotics both in simulation and on hardware, verifying the expected compliant and energy-limiting behaviour of the controller.

2606.00449 2026-06-02 cs.RO 版本更新

ROG-Grasp: Root-Oriented Geometry for Robotic Grasping and Placement

ROG-Grasp:面向根部的几何方法用于机器人抓取与放置

Zijian An, Augustus Sroka, Ran Yang, Bill Cai, Satoru Eto, Brian Poon, Kelvin Cai, Shijie Geng, Feng Liu, Yiming Feng, Lifeng Zhou

发表机构 * Department of Electrical and Computer Engineering, Drexel University(德雷塞尔大学电气与计算机工程系) Virginia Seafood Agricultural Research and Extension Center, and Department of Biological Systems Engineering, Virginia Tech(弗吉尼亚理工学院生物系统工程系和弗吉尼亚海鲜农业研究与推广中心) Amazon Store Foundation AI (SFAI)(亚马逊商店基金会人工智能(SFAI))

AI总结 提出基于根部表面几何的ROG-Grasp框架,通过RGB-D感知估计农产品朝向,结合YOLO检测器和点云平面拟合生成稳定抓取姿态,在番茄和洋葱实验中实现高成功率与快速执行。

Comments Comments: 7 pages, 6 figures. Video: https://youtu.be/Ir2UtGODdMo

详情
AI中文摘要

朝向感知操作在采后农业加工中至关重要,其中农产品必须以一致的配置被抓取和放置。本文提出ROG-Grasp,一种基于几何的机器人抓取和放置框架,通过RGB-D感知从根部表面几何估计农产品朝向。使用基于YOLO的根部检测器和点云平面拟合来推断根部法线,从而生成稳定的抓取姿态和朝向约束的笛卡尔运动规划。在番茄和洋葱上的实验表明,在孤立和杂乱场景中均具有高成功率和稳定的执行时间。与视觉-语言-动作(VLA)策略相比,所提出的方法实现了更可靠、更准确的抓取完成,且执行速度更快。这些结果突显了几何驱动感知对于实际朝向控制操作任务的有效性。我们的论文视频可在网上获取:https://youtu.be/Ir2UtGODdMo。

英文摘要

Orientation-aware manipulation is essential in post-harvest agricultural processing, where produce must be grasped and placed in consistent configurations. This paper presents ROG-Grasp, a geometry-based robotic grasping and placement framework that estimates the produce orientation from root surface geometry using RGB-D perception. A YOLO-based root detector and point cloud plane fitting are used to infer the root normal, enabling stable grasp pose generation and orientation-constrained Cartesian motion planning. Experiments on tomatoes and onions demonstrate high success rates and stable execution time in both isolated and cluttered scenarios. Compared with vision-language-action (VLA) policies, the proposed method achieves more reliable and accurate grasp completion with faster execution. These results highlight the effectiveness of geometry-driven perception for practical orientation-controlled manipulation tasks. A video of our paper is available online https://youtu.be/Ir2UtGODdMo.

2606.00418 2026-06-02 cs.RO cs.HC 版本更新

Literary Emotions in Motion: A Soft Robotics Installation for Tactile Storytelling

文学情感在运动中:用于触觉叙事的软体机器人装置

Carolina Silva-Plata, Abraham Villavicencio-Carmona, Miguel Silva Plata, Stefan Escaida, Ruben Fernandez

发表机构 * Department of Mechanical Engineering, University of Chile(智利大学机械工程系) Independent Researcher(独立研究员) Bolivian Catholic University(玻利维亚天主大学) Institute of Engineering Sciences, University of O’Higgins(奥希金斯大学工程科学研究所)

AI总结 提出一种将叙事文本语义情感分析映射到软体气动模块可变刚度的交互装置,通过用户研究评估刚度与LED强度多感官耦合对情感感知的影响。

Comments 8 pages, 8 figures

详情
Journal ref
IEEE Robotics and Automation Magazine, 2026
AI中文摘要

软体机器人越来越多地在艺术语境中被探索,其中触觉交互为观众提供了超越视觉或听觉信号的具身参与。本作品展示了一个交互装置,将叙事文本的语义情感分析映射到软体气动模块的可变刚度。一个自然语言模型从预定义的六种情感中识别出两种主导情感,驱动七个六边形排列的软体执行器充气。中心执行器代表主要情感,而周围的执行器表达次要情感。我们开发并机械表征了称为软模块的硅胶执行器,其具有薄膜层,展示了这种形态控制如何扩展可实现的刚度范围,同时保持简单性和低成本制造。一项包含十名参与者的用户研究进一步评估了刚度和LED强度的多感官耦合如何影响情感感知。结果表明,伴随颜色变化的刚度调制可以支持软体机器人装置中具有情感意义和吸引力的触觉交互。

英文摘要

Soft robotics is increasingly explored in artistic contexts, where tactile interaction provides audiences with embodied engagement beyond visual or auditory signals. This work presents an interactive installation that maps semantic emotion analysis of narrative text into variable stiffness of soft pneumatic modules. A natural language model identifies two dominant emotions from a predefined set of six, driving the inflation of seven hexagonally arranged soft actuators. The central actuator represents the primary emotion, while the surrounding ones express the secondary. We develop and mechanically characterize silicone actuators, called soft modules, featuring a thin membrane layer, demonstrating how this morphological control expands the achievable stiffness range while preserving simplicity and low-cost fabrication. A user study with ten participants further evaluates how multisensory coupling of stiffness and LEDs intensity influences emotional perception. The results suggest that stiffness modulation accompanied by color change can support emotionally meaningful and engaging tactile interaction in soft robotic installations.

2606.00397 2026-06-02 cs.RO cs.SY eess.SY 版本更新

SoFiE: Soft Finger Exoskeleton for Intelligent Grasping

SoFiE: 用于智能抓取的软手指外骨骼

Magnus Malthe Sigsgaard Nielsen, Nicklas Nikolaj Grønvall, Xiaofeng Xiong, Saravana Prashanth Murali Babu

发表机构 * SDU Soft Robotics, SDU Biorobotics, The Maersk Mc-Kinney Moller Institute, University of Southern Denmark (SDU)(SDU柔性机器人实验室、SDU生物机器人实验室、马士基麦金尼莫勒研究所、南丹麦大学)

AI总结 本文提出一种模块化软手指外骨骼SoFiE,采用3D打印柔性材料、肌腱驱动和集成触觉传感,实现轻量化、低轮廓的抓取辅助与智能感知。

详情
AI中文摘要

软体可穿戴机器人系统已成为辅助手部功能减退个体的有前景解决方案。本文提出SoFiE,一种模块化软手指外骨骼,旨在辅助抓取任务中的食指屈曲。该系统主要采用3D打印柔性材料制造,实现了轻量、低轮廓和模块化设计。驱动通过紧凑型直流电机驱动的肌腱机构实现,而被动伸展由柔性导电弹簧提供。该元件称为StretchSense,通过变形下的电阻变化也作为本体感受传感器。此外,引入了一种新颖的触觉传感方法MagSense,使用嵌入软指尖结构中的磁铁和磁力计对来估计接触力和物体柔顺性。该系统完全无线,并由嵌入式微控制器控制。此外,通过电机编码器反馈的驱动器级传感能够估计系统状态,为安全和自适应控制策略提供基础。实验验证表明,该系统能够提供可靠的姿态估计,区分不同刚度的材料,并在不同抓取任务中生成独特的传感器特征。本文详细介绍了所提出的外骨骼的设计、制造和传感概念,作为模块化、软体和辅助可穿戴机器人的概念验证。

英文摘要

Soft wearable robotic systems have emerged as a promising solution for assisting individuals with reduced hand function. This paper presents SoFiE, a modular soft finger exoskeleton designed to assist index-finger flexion during grasping tasks. The proposed system is primarily fabricated using 3D-printed flexible materials, enabling a lightweight, low-profile, and modular design. Actuation is achieved through a tendon-driven mechanism powered by a compact DC motor, while passive extension is provided by a compliant conductive spring. This element, termed StretchSense, also functions as a proprioceptive sensor by exhibiting resistance changes under deformation. Furthermore, a novel tactile sensing approach, MagSense, is introduced, using a magnet and magnetometer pair embedded in a soft fingertip structure to estimate contact force and object compliance. The system is fully untethered and controlled by an embedded microcontroller. In addition, actuator-level sensing through motor encoder feedback enables estimation of the system state, providing a foundation for safe and adaptive control strategies. Experimental validation demonstrates the capability of the system to provide reliable pose estimation, distinguish between materials with different stiffness, and generate distinct sensor signatures across different grasping tasks. This paper details the design, fabrication, and sensing concepts of the proposed exoskeleton as a proof of concept toward modular, soft, and assistive wearable robotics.

2606.00383 2026-06-02 cs.RO cs.LG cs.SY eess.SY 版本更新

Behavior Cloning of MPC for 3-DOF Robotic Manipulators

三自由度机械臂MPC的行为克隆

Theo Guegan, Dexter Wen Jie Teo

发表机构 * University of Waterloo(多伦多大学) Universite de Technologie de Compiègne(技术与科学大学) Nanyang Technological University(南洋理工大学) Polytechnique Montréal(蒙特利尔理工学院)

AI总结 针对MPC实时计算负担重的问题,采用行为克隆方法近似MPC策略,通过多种神经网络架构实现三自由度机械臂的实时控制,在宽松容差下推理延迟降低3倍,成功率84.98%。

Comments Accepted at the IEEE ICRA 2026 Workshop on Reinforcement Learning in the Era of Imitation Learning (RL4IL), 6 pages excluding references

详情
AI中文摘要

虽然模型预测控制(MPC)提供了强大的稳定性和鲁棒性,但它给实时系统带来了显著的计算负担。本文研究了行为克隆在近似MPC策略以实时控制三自由度机械臂中的应用。我们提出了一个结合逆运动学与MPC的基线控制器,并评估了从经典回归算法到深度学习模型(包括深度MLP和RNN)的神经网络架构,以推导计算高效的替代策略。我们分析了泛化能力、稳定性考虑以及不同架构选择固有的权衡。我们的实证研究采用了在线和离线评估,以评估在准确性、计算效率和对原始MPC策略的忠实度方面的性能。结果表明,行为克隆可以有效减少三自由度机械臂MPC策略的计算负担,在宽松容差下推理延迟降低3倍,成功率达到84.98%。值得注意的是,我们发现静态架构优于时间变体,证实了瞬时状态观测对此任务的充分性。然而,在严格容差下我们观察到精度差距,这表明虽然行为克隆捕获了全局最优轨迹,但需要进一步研究以最小化终端稳态误差。

英文摘要

While Model Predictive Control (MPC) provides strong stability and robustness, it imposes a significant computational burden on real-time systems. This paper investigates the application of Behavior Cloning to approximate MPC policies for the real-time control of a 3-degree-of-freedom robotic manipulator. We present a baseline controller combining Inverse Kinematics with MPC and evaluate neural network architectures, ranging from classical regression algorithms to deep learning models including Deep MLPs and RNNs, to derive computationally efficient surrogate policies. We analyze generalization capabilities, stability considerations, and the trade-offs inherent in different architectural choices. Our empirical study employs both online and offline evaluations to assess performance regarding accuracy, computational efficiency, and fidelity to the original MPC policy. Our results demonstrate that Behavior Cloning can effectively reduce the computational burden of MPC policies for 3-DOF robotic manipulators, achieving a 3x reduction in inference latency with a 84.98% success rate under relaxed tolerances. Notably, we find that static architectures outperform temporal variants, confirming the sufficiency of instantaneous state observations for this task. However, we observe a precision gap under strict tolerances, which suggest that while Behavior Cloning captures the global optimal trajectory, further research is needed to minimize terminal steady-state error.

2606.00374 2026-06-02 cs.RO 版本更新

Constrained Whole-Body Tracking for Humanoid Robots

人形机器人的约束全身跟踪

Daniel Morton, Pranit Mohnot, Marco Pavone

发表机构 * Stanford University(斯坦福大学) NVIDIA Research(NVIDIA研究)

AI总结 提出 ConstrainedMimic 框架,结合操作空间控制与控制障碍函数,在强化学习跟踪策略中实现实时约束满足,用于人形机器人全身运动跟踪与遥操作。

详情
AI中文摘要

强化学习的最新进展已展示出人形机器人令人印象深刻的全身灵活性,但确保安全性和满足约束(特别是训练后指定的约束)仍然是一个挑战。为此,我们提出了 ConstrainedMimic,一个利用全身运动学和动力学在 RL 跟踪策略中实时执行约束的控制框架。通过整合操作空间控制和障碍函数(CBF)的原理,我们能够满足对运动学参考运动和底层动力学的任意运行时约束。在(模拟的)Unitree G1 上使用学习策略进行的全身运动跟踪和遥操作实验中,我们展示了碰撞避免(包括机器人身体和外部障碍物)、关节限制和质心稳定性约束。通过保持与当前接触模式和跟踪目标一致,我们在约束激活时最小化地限制了策略的能力。我们的方法完全可微,可在 CPU、GPU 和 TPU 上运行,并能以高达 300-500 Hz 的频率部署。所有软件将在发表后免费提供。

英文摘要

Recent advances in reinforcement learning (RL) have demonstrated impressive whole-body agility for humanoid robots, yet ensuring safety and satisfying constraints -- particularly those specified after training -- remains a challenge. Towards this goal, we present ConstrainedMimic, a control framework that leverages whole-body kinematics and dynamics for real-time constraint enforcement within RL tracking policies. By integrating principles from operational space control and control barrier functions (CBFs), we enable the satisfaction of arbitrary runtime constraints on both the kinematic reference motion and the underlying dynamics. In whole-body motion-tracking and teleoperation experiments on a (simulated) Unitree G1 with a learned policy, we demonstrate collision avoidance (both with the robot body and external obstacles), joint limits, and center of mass stability constraints. By remaining consistent with the current contact mode and tracking objectives, we minimally restrict the capabilities of the policy when constraints are active. Our method is fully differentiable, runs on CPU, GPU, and TPU, and can be deployed at up to 300-500 Hz. All software will be freely available upon publication.

2606.00355 2026-06-02 cs.RO 版本更新

FAIR^2 Drones: An AI-Ready Standard for Cross-Domain Wildlife Drone Datasets

FAIR^2 Drones:跨领域野生动物无人机数据集的AI就绪标准

Jenna Kline, Kilian Meier, Vandita Shukla, Edouard G. A. Rolland, Elena Iannino, Lucie Laporte-Devylder, Constanza Andrea Molina Catricheo, Blair Costelloe, Elizabeth Campolongo, Henrik S. Midtiby, Devis Tuia, Benjamin Risse, Ulrik P. S. Lundquist, Anders Lyhne Christensen, Fabio Remondino, Thomas Richardson, Tanya Berger-Wolf

发表机构 * The Ohio State University, Department of Computer Science and Engineering(俄亥俄州立大学计算机科学与工程系) School of Civil, Aerospace and Design Engineering, University of Bristol(布里斯托尔大学土木、航空航天与设计工程学院) D Optical Metrology (3DOM), Fondazione Bruno Kessler (FBK)(3DOM光学计量(3DOM),布鲁诺·克塞勒基金会(FBK)) Computer Vision and Machine Learning Systems Group, Institute for Geoinformatics, University of Muenster(计算机视觉与机器学习系统组,地理信息研究所,穆恩斯特大学) Unmanned Aerial Systems Center, University of Southern Denmark(无人飞行系统中心,南部丹麦大学) Department of Collective Behavior, Max Planck Institute of Animal Behavior(集体行为部门,动物行为马克斯·普朗克研究所) Department of Biology, University of Konstanz(生物学系,康斯坦茨大学) Department of Biology, University of Southern Denmark(生物学系,南部丹麦大学)

AI总结 提出FAIR^2 Drones标准,通过整合FAIR和AI就绪数据框架并添加平台元数据和标注规范,使无人机数据集同时支持生态分析、机器人算法开发和计算机视觉基准测试。

详情
AI中文摘要

使用无人机收集动物生态数据需要大量的时间、专业知识和财务资源。然而,大多数现有数据集仅服务于单一研究社区,限制了跨学科重用。我们提出了一个统一的无人机数据集标准FAIR^2 Drones,该标准基于现有的FAIR和AI就绪数据框架,通过添加必要的平台元数据和标注规范,桥接了生态学、机器人和计算机视觉。我们的标准使数据集能够同时支持生态分析、机器人算法开发和计算机视觉基准测试。我们提供了开源验证工具、参考实现以及多模态扩展,将无人机图像与互补传感器(如相机陷阱、GPS和声学)连接起来。通过跨学科标准化元数据,该框架最大化了昂贵现场部署的科学投资回报,并加速了环境监测中的跨领域合作。

英文摘要

Animal ecology data collection using drones represents a substantial investment of time, expertise, and financial resources. Yet most existing datasets serve only a single research community, limiting interdisciplinary reuse. We propose a unified drone dataset standard, FAIR^2 Drones, that bridges ecology, robotics, and computer vision by building on existing FAIR and AI-ready data frameworks while adding essential platform metadata and annotation specifications. Our standard enables datasets to simultaneously support ecological analysis, robotics algorithm development, and computer vision benchmarking. We provide open-source validation tools, reference implementations, and multimodal extensions linking drone imagery with complementary sensors such as camera traps, GPS, and acoustics. By standardizing metadata across disciplines, this framework maximizes the scientific return on investment for costly field deployments and accelerates cross-domain collaboration in environmental monitoring.

2606.00318 2026-06-02 cs.RO cs.CV 版本更新

Belief Consistency Between Foundation-Model Evidence and Geometric Perception in Persistent Robotic Maps

持久机器人地图中基础模型证据与几何感知之间的信念一致性

Christoffer Heckman, Harel Biggie, Brendan Crowe, Nicholas Roy

发表机构 * Department of Computer Science, University of Colorado, Boulder(科罗拉多大学博尔德分校计算机科学系) Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology(麻省理工学院计算机科学与人工智能实验室)

AI总结 提出一种更新算子,通过每类校准提交门和每事件冲突丢弃窗口,解决基础模型语义通道与几何感知通道在持久地图中的矛盾,显著提升地图精度。

详情
AI中文摘要

自主机器人使用的持久地图越来越多地将一个断言特征良好的几何感知栈与一个产生语义声明但未校准可靠性的基础模型通道融合到同一场景中。当代建图系统通过将基础模型通道视为每个元素后验的额外投票者来集成这两个通道,但未针对其自身的每类可靠性进行校准,也没有机制在给定时刻标记两个通道相互矛盾的情况。我们提出了一种具有两个协作机制的更新算子:一个每类校准的提交门,以及一个每事件冲突丢弃窗口,该窗口拒绝提交在声明时刻与几何通道矛盾的基础模型声明。我们在KITTI-360和ScanNet上进行了评估,使用oracle几何通道(全景真值)和现成的在线语义分割器(Mask2Former)来展示真实世界性能。该算子生成的提交地图精度显著更高(KITTI中汽车提交精度99.7%对比仅校准算子的43.9%;平均每类IoU 0.522对比0.180),并且在更高精度下保留了比整体式组合VLM提示更多的组合真阳性。该框架在oracle和现成分割器几何通道上均达到部署质量,并且对基础模型替换具有不变性。

英文摘要

Persistent maps used by autonomous robots increasingly fuse a geometric perception stack whose assertions are well-characterized with a foundation-model channel that produces semantic claims without calibrated reliability about the same scene. Contemporary mapping systems integrate the two channels by treating the foundation-model channel as an additional voter into a per-element posterior, uncalibrated for its own per-class reliability and without machinery to flag when the two channels contradict each other at a given moment. We propose an update operator with two cooperating mechanisms: a per-class calibrated commit gate, and a per-event conflict-drop window that refuses to commit foundation-model claims contradicted by the geometric channel at the moment of the claim. We evaluate on KITTI-360 and ScanNet, with an oracle geometric channel (panoptic ground truth) and an off-the-shelf online semantic segmenter (Mask2Former) to demonstrate real-world performance. The operator produces substantially more accurate committed maps (KITTI is car commit precision 99.7% vs. 43.9% for the calibration-only operator; mean per-class IoU 0.522 vs. 0.180), retains more compositional true positives at higher precision than a monolithic compositional VLM prompt. The framework operates at deployment quality across both oracle and off-the-shelf-segmenter geometric channels, and is invariant under foundation-model substitution.

2606.00313 2026-06-02 cs.RO cs.AI 版本更新

DRL-Based Pose Control for Double-Ackermann Robots Under Actuation Uncertainties

基于深度强化学习的双阿克曼机器人驱动不确定性下的位姿控制

Oussama Zaim, Mélodie Daniel, Aly Magassouba, Miguel Aranda, Olivier Ly

发表机构 * Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800(波尔多大学、法国国家科学研究中心、波尔多国立理工学院、LaBRI研究所、UMR 5800) School of Computer Science, University of Nottingham, UK(诺丁汉大学计算机科学学院) Instituto de Investigación en Ingeniería de Aragón (I3A), Universidad de Zaragoza(阿ragón工程研究所(I3A)、萨拉戈萨大学)

AI总结 针对双阿克曼转向移动机器人在驱动不确定性下的控制问题,提出基于ManeuverNet框架的位姿控制扩展,采用sim-to-sim-to-real方法结合多环境DRL(SAC和CrossQ)学习鲁棒策略,显著缩小仿真到现实的性能差距。

Comments 6 pages, 4 figures, 2 tables, Accepted for Uncertainty in Open-World Robotics an IEEE International Conference on Robotics & Automation (ICRA 2026) workshop

详情
AI中文摘要

由于仿真与现实动力学之间的差异,深度强化学习策略在实际机器人上的鲁棒部署仍然具有挑战性。我们针对双阿克曼转向移动机器人的机动问题处理这一问题,这类机器人因其非完整特性引入了额外约束。基于DRL框架ManeuverNet,我们将其目标从位置控制扩展到完整的位姿控制,从而产生更具挑战性的任务。我们进一步研究了驱动相关不确定性对策略迁移的影响。在扩展策略训练期间使用简化驱动模型可能导致泛化能力差,表现为在更严格的评估条件下,成功率从PyBullet中的100%下降到Gazebo中的25%。为解决这一限制,我们采用sim-to-sim-to-real方法,将在Gazebo中观察到的驱动效应纳入PyBullet训练环境。通过使用SAC和CrossQ的多环境DRL,我们学习到即使在建模不准确的情况下也能保持鲁棒的策略。该方法可以显著缩小不同仿真器之间的性能差距,在Gazebo中实现高达92%的成功率,并在更严格阈值下保持69%的成功率,且无需额外调整即可成功迁移到真实机器人。

英文摘要

Robust deployment of deep reinforcement learning (DRL) policies on real robots remains challenging due to discrepancies between simulation and real-world dynamics. We address this issue in the context of maneuvering with double-Ackermann-steering mobile robots, which introduce additional constraints due to their non-holonomic nature. Building upon the DRL framework ManeuverNet, we extend its objective from position control to full pose control, resulting in a more challenging task. We further investigate the impact of actuation-related uncertainties on policy transfer. The use of simplified actuation models during training of the extended policy can lead to poor generalization, shown by a success rate drop from 100% in PyBullet to 25% in Gazebo under stricter evaluation conditions. To address this limitation, we adopt a sim-to-sim-to-real approach, where actuation effects observed in Gazebo are incorporated into the PyBullet training environment. Using multi-environment DRL with SAC and CrossQ, we learn policies that remain robust despite modeling inaccuracies. This approach can significantly reduce the performance gap across simulators, achieving up to 92% success rate in Gazebo and maintaining 69% under stricter thresholds, with successful transfer to a real robot without additional tuning.

2606.00307 2026-06-02 cs.RO 版本更新

ScaRF-SLAM: Scale-Consistent Reconstruction with Feed-Forward Models and Classical Visual SLAM

ScaRF-SLAM: 基于前馈模型与经典视觉SLAM的尺度一致重建

Yuhao Zhang, Yifu Tao, Frank Dellaert, Maurice Fallon

发表机构 * Oxford Robotics Institute, University of Oxford(牛津大学机器人研究所) College of Computing, Georgia Institute of Technology(佐治亚理工学院计算机学院)

AI总结 提出一种解耦框架,将经典特征SLAM用于鲁棒跟踪,几何基础模型仅用于建图,通过尺度优化和子图融合实现高质量一致密集重建,在建筑级数据集上重建精度提升10%-20%。

Comments 8 pages

详情
AI中文摘要

最近的工作探索了将SLAM与几何基础模型(GFM)统一起来。然而,直接使用GFM预测进行跟踪对模型能力和不确定性高度敏感,因为预测中的几何不准确性会不利地影响位姿估计。为了解决这一局限性,我们提出了一种解耦框架,将经典基于特征的SLAM与GFM相结合,实现了更高质量和更一致的密集重建。简而言之,我们使用经典视觉SLAM进行鲁棒的低延迟跟踪,并仅将GFM用于建图。通过将建图锚定到SLAM模块产生的位姿并在深度尺度上进行优化,所提出的设计避免了将GFM预测中的不准确性传播到位姿估计中,同时对重建施加几何约束。该系统从多个带位姿的关键帧构建子图,并通过轻量级的帧和子图尺度优化来强制执行尺度一致性。它还在每个子图内执行基于投影的点云融合,并在线更新子图以反映基于特征的SLAM的轨迹更新。为了评估我们方法的跟踪和重建性能,我们引入了一个包含丰富回环、建筑规模的室内数据集,具有精确的传感器轨迹和激光雷达地面真值。实验表明,我们的方法在实现优越轨迹精度的同时,重建精度比现有方法提高10%-20%,在建筑级数据集上每10米块的重建误差约为2厘米。在大型室外数据集上,每30米块(相对于激光雷达地面真值模型)达到10厘米的误差。

英文摘要

Recent works have explored unifying SLAM with geometric foundation models (GFMs). However, directly using GFM predictions for tracking is highly sensitive to model capability and uncertainty, as geometric inaccuracies in the predictions can adversely affect pose estimation. To address this limitation, we present a decoupled framework that integrates classical feature-based SLAM with GFMs, which achieves higher quality and more consistent dense reconstruction. In brief, we use classical visual SLAM for robust low-latency tracking and use GFMs exclusively for mapping. By anchoring mapping to poses produced by the SLAM module and optimizing across depth scales, the proposed design avoids propagating inaccuracies from GFM predictions into pose estimation while imposing geometric constraints on the reconstruction. The system builds submaps from multiple posed keyframes and enforces scale consistency via lightweight frame and submap scale optimization. It also performs projection-based point cloud fusion within each submap, and updates submaps online to reflect trajectory updates from the feature-based SLAM. To evaluate tracking and reconstruction of our method, we introduce a loop-rich, building-scale indoor dataset with accurate sensor trajectories and LiDAR ground-truth. Experiments show that our approach achieves superior trajectory accuracy while improving reconstruction precision by 10%-20% over existing methods, with about 2 cm reconstruction error per 10 m chunk on building-scale dataset. On large-scale outdoor datasets, it attains 10 cm error per 30 m chunk (w.r.t LiDAR ground-truth models).

2606.00297 2026-06-02 eess.SY cs.RO cs.SY 版本更新

Predicted-Flow Control Barrier Functions for Real-Time Safe Optimal Control

预测流控制障碍函数用于实时安全最优控制

Amirsaeid Safari, Jesse B. Hoagg

发表机构 * Department of Mechanical and Aerospace Engineering, University of Kentucky(机械与航空航天工程系,肯塔基大学)

AI总结 本文提出预测流控制障碍函数(P-CBF),通过将CBF推广为预测流的泛函,结合终端候选和规划时间偏移,实现有限预测时域内的安全证书,并统一了有限时域积分成本优化与安全认证。

详情
AI中文摘要

控制障碍函数(CBF)通过状态上的逐点条件提供实时安全保证。然而,合成有效的CBF是困难的,且得到的控制器是短视的。为解决短视问题,本文引入了预测流控制障碍函数(P-CBF),它将CBF从当前状态的函数推广为在有限预测时域内参数化控制计划下的预测流的泛函。为了安全,P-CBF可以证明预测流在整个预测时域内处于安全集中。然而,候选P-CBF面临与候选CBF相同的挑战,即控制约束使得保证P-CBF的有效性变得困难。本文通过引入终端候选P-CBF(要求预测流在终端时刻终止于备份安全集)和规划时间偏移(调节预测时域,提供额外的自由度以确保可行性)来解决这一挑战。实时控制以及控制计划参数和规划时间偏移的演化由单个凸优化联合确定,该优化保证可行且使相关安全集前向不变。所得到的安全最优流控制在整个预测时域内提供安全证书,并统一了有限时域积分成本优化与安全认证。如果控制约束是凸多面体,则该优化简化为二次规划(QP)。该QP实现称为FlowBarrier,在非完整地面机器人穿越密集环境的场景中进行了验证。FlowBarrier与非线性模型预测控制和两种基于CBF的安全滤波方法在100次试验中进行了比较,FlowBarrier实现了最高的目标到达率、零安全违规和最低的计算时间。

英文摘要

Control barrier functions (CBFs) provide real-time safety guarantees through pointwise conditions on the state. However, synthesizing a valid CBF is difficult and the resulting controllers are myopic. To address myopia, this article introduces predicted-flow control barrier functions (P-CBFs), which generalize the CBF from a function of the current state to a functional of a predicted flow under a parametrized control plan over a finite prediction horizon. For safety, a P-CBF can certify that the predicted flow is in a safe set over the entire prediction horizon. However, candidate P-CBFs suffer from the same challenge as candidate CBFs, namely, control constraints make it difficult to guarantee that the P-CBF is valid. This article resolves this challenge by introducing a terminal candidate P-CBF requiring that the predicted flow end in a backup safe set at the terminal time, and a planning-time shift that modulates the prediction horizon, providing an additional degree of freedom to ensure feasibility. The real-time control and the evolution of the control-plan parameter and planning-time shift are determined jointly by a single convex optimization that is guaranteed to be feasible and renders the associated safe set forward invariant. The resulting safe optimal flow control provides a safety certificate over the entire prediction horizon and unifies finite-horizon integral-cost optimization with safety certification. This optimization reduces to a quadratic program (QP) if the control constraints are a convex polytope. The QP implementation, termed FlowBarrier, is validated on a nonholonomic ground robot navigating a dense environment. FlowBarrier is compared to nonlinear model predictive control and two CBF-based safety filter methods across 100 trials, where FlowBarrier achieves the highest goal-reaching rate, zero safety violations, and the lowest computation time.

2606.00267 2026-06-02 cs.CV cs.AI cs.LG cs.RO 版本更新

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

StressDream: 引导视频世界模型实现鲁棒的策略评估与改进

Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding, Apoorva Sharma, Karen Leung, Edward Schmerling, Marco Pavone, Andrea Bajcsy

发表机构 * Carnegie Mellon University(卡内基梅隆大学) NVIDIA Research(NVIDIA研究) University of Washington(华盛顿大学) Stanford University(斯坦福大学)

AI总结 提出StressDream方法,通过优化扩散视频世界模型的初始噪声,在推理时引导生成高影响且合理的未来场景,以支持鲁棒的策略评估与改进。

Comments Project page: https://junwon.me/StressDream/

详情
AI中文摘要

视频世界模型通过想象以自我机器人动作为条件的真实未来观察,在策略评估与改进方面展现出潜力。虽然世界模型可以对未来的分布进行建模,但策略评估与改进通常依赖于名义上的想象,这可能会遗漏机器人动作的高影响结果,除非抽取大量样本。为了实现对世界模型想象的鲁棒策略评估与改进,我们提出StressDream,该方法通过在推理时优化扩散世界模型的初始噪声,将想象引导至高影响且合理的结果。然而,优化高维噪声具有挑战性:优化必须推理生成视频中细微的、场景相关的目标事件,同时避免产生不合理想象的分布外噪声。我们通过两个互补目标来解决这一问题:一个语义目标,利用视觉语言模型通过推理生成视频提供信息丰富的梯度;一个合理性目标,防止优化后的噪声漂移到分布外。利用用于自动驾驶和机器人操作的最先进的视频世界模型,我们展示了StressDream能够有效地将想象引导至推理时由文本指定的高影响且合理的结果,例如任务失败,从而通过识别那些合理未来包含不良结果的动作,实现鲁棒的策略评估与改进。视频结果见https://junwon.me/StressDream/。

英文摘要

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.

2606.00253 2026-06-02 cs.RO cs.LG 版本更新

Per-Group Error, Not Total MSE: Fine-Tuning Vision-Language-Action Models for 11-DoF Mobile Manipulation

分组误差而非总MSE:微调视觉-语言-动作模型用于11自由度移动操作

Pau Montagut Bofi, Mario García Blasco, Tessa Pulli, Markus Vincze

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 针对异构关节空间的移动操作器微调视觉-语言-动作模型时,发现总MSE最低的检查点并非实际表现最佳,提出以分组误差作为更可靠的检查点选择指标。

Comments 4 pages, 3 figures, 3 tables. Accepted as poster at ICRA 2026 Workshop "From Data to Decisions: VLA Pipelines for Real Robots". Code: [https://github.com/paumontagut/per-group-mse-vla](https://github.com/paumontagut/per-group-mse-vla)

详情
AI中文摘要

对具有异构关节空间的移动操作器微调视觉-语言-动作(VLA)模型可能产生反直觉的结果:总MSE最低的检查点并非在真实机器人上表现最佳。我们认为这是将异构关节组(手臂、夹爪、头部、轮式底座)合并为单一指标的可预测后果,其中易于预测的关节可能掩盖仍然失败的关节。我们在11自由度Toyota HSR上微调SmolVLA(450M,仅动作专家),并将其与更强的预训练基线$π_{0.5}$(3.3B)进行比较。分组分析揭示了两种模式:在SmolVLA中,移动底座收敛最慢并限制了整体性能。在$π_{0.5}$的仅专家微调(仅训练动作头,骨干冻结)中,总MSE低于基线但手臂精度下降。在60次真实机器人试验(每个模型20次)中,$π_{0.5}$ 80k(4.0/4)显著优于两种微调变体(仅专家3k:3.75/4;HSR-SmolVLA:3.5/4;Mann-Whitney $p \leq 0.010$),尽管仅专家3k的总MSE最低。这种差异与离线手臂组误差最为一致,而非总MSE或底座组误差。我们得出结论:对于具有异构动作空间的机器人,分组误差比总MSE是更可靠的检查点选择信号。代码:https://github.com/paumontagut/per-group-mse-vla

英文摘要

Fine-tuning Vision-Language-Action (VLA) models for mobile manipulators with heterogeneous joint spaces can produce a counterintuitive result: the checkpoint with the lowest aggregate MSE is not the one that performs best on the real robot. We argue this is a predictable consequence of collapsing heterogeneous joint groups (arm, gripper, head, wheeled base) into a single metric, where easy-to-predict joints can mask joints that still fail. We fine-tune SmolVLA (450M, action-expert only) on the 11-DoF Toyota HSR and compare it against $π_{0.5}$ (3.3B), a stronger pretrained baseline. Per-group analysis exposes two patterns: in SmolVLA, the mobile base converges slowest and limits overall performance. In expert-only fine-tuning of $π_{0.5}$ (training only the action head, backbone frozen), total MSE drops below the baseline but arm accuracy degrades. On 60 real-robot trials (20 per model), $π_{0.5}$ 80k (4.0/4) significantly outperforms both fine-tuned variants (expert-only 3k: 3.75/4; HSR-SmolVLA: 3.5/4; Mann-Whitney $p \leq 0.010$), despite expert-only 3k having the lowest total MSE. This separation is most consistent with the offline arm-group error, not total MSE or base-group error. We conclude that per-group error is a more reliable signal than total MSE for checkpoint selection on robots with heterogeneous action spaces. Code: https://github.com/paumontagut/per-group-mse-vla

2606.00252 2026-06-02 cs.RO cs.LG 版本更新

HOIST: Humanoid Optimization with Imitation and Sample-efficient Tuning for Manipulating Suspended Loads

HOIST: 基于模仿和样本高效微调的人形机器人悬挂负载操作优化

Songyang Liu, Shunyu Yao, Dingyuan Huang, Shuai Li

发表机构 * Department of Civil and Coastal Engineering, University of Florida(土木与海岸工程系,佛罗里达大学)

AI总结 提出HOIST方法,结合模仿学习和样本高效的批量强化学习,优化人形机器人操控悬挂负载的放置精度和停止行为。

详情
AI中文摘要

使用人形机器人操控悬挂负载具有挑战性,因为机器人只能通过全身运动和间歇接触来影响一个欠驱动的振荡负载。模仿学习提供了安全初始行为,但无法直接优化最终放置,而从头开始的强化学习在真实人形机器人上不安全且样本效率低。我们提出了HOIST——基于模仿和样本高效微调的人形机器人悬挂负载操作优化。HOIST首先从虚拟现实遥操作演示中微调一个高级视觉-语言-动作策略,并通过全身控制器执行其命令。然后,它使用VLA rollout和迭代批量RL来提高放置精度和停止行为。在仿真和真实人形机器人上的实验表明,HOIST优于仅模仿和额外演示基线;与纯VLA rollout相比,HOIST将平移放置误差减少了19.9厘米,原始角度误差减少了3.56度,展示了人形机器人在欠驱动物料处理任务中的潜力。

英文摘要

Manipulating suspended payloads with humanoid robots is challenging because the robot can only influence an underactuated, oscillatory load through whole-body motion and intermittent contact. Imitation learning provides safe initial behavior but does not directly optimize final placement, while reinforcement learning from scratch is unsafe and sample-inefficient on real humanoids. We present HOIST-Humanoid Optimized with Imitation and Sample-efficient Tuning for manipulating suspended loads. HOIST first finetunes a high-level vision-language-action (VLA) policy from virtual-reality (VR) teleoperation demonstrations and executes its commands through a whole-body controller. It then uses VLA rollouts and iterative batched RL to improve placement accuracy and stopping behavior. Experiments in simulation and on a real humanoid show that HOIST improves over imitation-only and additional-demonstration baselines; compared with pure VLA rollouts, HOIST reduces translational placement error by 19.9 cm and raw angular error by 3.56 degrees, demonstrating the potential of humanoids for underactuated material-handling tasks.

2606.00201 2026-06-02 cs.RO 版本更新

Series-Parallel Integrated Nonlinear Elastic Actuator applied to the lean motion of a bicycle simulator

应用于自行车模拟器倾斜运动的串并联集成非线性弹性致动器

Christina Kohler, Michiel Plooij, Nuria Peña-Perez, Arend L. Schwab, Heike Vallery

发表机构 * Institute of Automatic Control, RWTH Aachen University(自动控制研究所,亚琛RWTH大学) Demcon Life Sciences & Health(Demcon生命科学与健康) Hapticlink Technologies(Hapticlink技术公司) Department of BioMechanical Engineering, Delft University of Technology(生物机械工程系,代尔夫特理工大学) Department of Rehabilitation Medicine, Erasmus MC(康复医学系,埃因霍温麦斯特大学)

AI总结 提出一种串并联集成非线性弹性致动器(SPINEA),通过非线性传动使单个弹性元件同时承担串联和并联角色,实现高扭矩和精确扭矩跟踪,并应用于自行车模拟器倾斜运动。

详情
AI中文摘要

设计用于高扭矩、高保真力触觉交互的机器人具有挑战性。并联弹性致动器(PEA)使用与较小电机并联的弹性元件来补充扭矩,而串联弹性致动器(SEA)使用串联的弹性元件来解耦电机阻抗并改善力控制。最近的工作结合了SEA和PEA以获得两者的优点,但需要单独的弹性元件或离合器。本文提出了串并联集成非线性弹性致动器(SPINEA),它融合了SEA和PEA,使得单个弹性元件同时承担并联和串联的双重角色。这是通过非线性传动实现的,其中电机和负载具有不对齐的旋转轴并且弹性连接。这种几何结构实现了高峰值扭矩和精确的扭矩跟踪。我们将SPINEA应用于力触觉自行车模拟器的倾斜驱动,这需要高力矩和精确的渲染以实现安全且逼真的骑行者交互。我们实现了一个原型并进行了实验,包括外部激励装置和骑行者骑行。我们的结果证实了SPINEA的低阻抗和精确扭矩跟踪,在自行车框架固定时高达4.25 Hz,在骑行者骑行时高达4 Hz。这些优点可能转移到其他需要紧凑、高性能驱动的应用中。

英文摘要

Designing robots for high-torque, high-fidelity haptic interaction is challenging. Parallel Elastic Actuators (PEAs) use elastic elements in parallel to smaller motors to complement torques, and Series Elastic Actuators (SEAs) use elastic elements in series to decouple motor impedance and improve force control. Recent work combines SEAs and PEAs to obtain both benefits but requires separate elastic elements or clutching. This paper presents the Series Parallel Integrated Nonlinear Elastic Actuator (SPINEA), which merges SEA and PEA such that a single elastic element takes on dual roles simultaneously, parallel and series. This is achieved by a nonlinear transmission in which the motor and load have misaligned rotation axes and are elastically connected. This geometry enables both high peak torque and precise torque tracking. We apply SPINEA to actuate lean of a haptic bicycle simulator, which requires high moments and precise rendering for safe and realistic rider interactions. We realized a prototype and performed experiments, both with an external excitation setup and with riders cycling. Our results confirm SPINEA's low impedance and precise torque tracking, up to 4.25 Hz with the bicycle frame fixed and up to 4 Hz with riders. The benefits may transfer to other applications requiring compact, high-performance actuation.

2606.00197 2026-06-02 cs.RO 版本更新

Cuttlebot: a platform demonstration for complex, autonomous, bio-inspired swimmers

Cuttlebot:一种复杂自主仿生游泳机器人的平台演示

Alexander Nicholas White, Ang Leo Li, Alexander Yin, Derrick Roseman, Valeria Saro-Cortes, Hannah Wiswell, Aimy Wissa, Mihai Duduta

发表机构 * School of Mechanical, Aerospace, and Manufacturing Engineering, University of Connecticut(康涅狄格大学机械、航空航天与制造工程学院) Department of Mechanical and Industrial Engineering, University of Toronto(多伦多大学机械与工业工程系) University of Connecticut, Institute of Materials Science(康涅狄格大学材料科学研究所) Department of Mechanical and Aerospace Engineering, Princeton University(普林斯顿大学机械与航空航天工程系)

AI总结 本文提出CORE自主机器人平台,驱动六个人工肌肉并感知视觉与空间信息,开发了仿乌贼机器人Cuttlebot,通过波动鳍实现三维游泳,验证了平台的有效性。

详情
AI中文摘要

对深海作业和资源日益增长的兴趣推动了生态敏感但环境耐用的机器人的发展。介电弹性体驱动器人工肌肉因其耐压、耐温及柔软特性,成为驱动此类系统的理想选择,但难以与机器人系统集成。本文提出了一种自主机器人平台:CORE,能够驱动六个人工肌肉,同时感知视觉和空间信息。为验证该平台,我们开发了Cuttlebot——一种受乌贼启发的机器人,利用波动鳍进行三维游泳。Cuttlebot的鳍部有四个主要人工肌肉,外加一个触手启发的软体夹爪。该机器人在一系列有缆和无缆游泳测试中进行了评估,展示了每秒2.5厘米的平移速度和每秒10度的旋转速度。此外,CORE系统能够将专门的控制信号驱动到人工肌肉中,以可控方式输出六个自由度的力和扭矩。本工作为开发用于海洋探索和监测的复杂仿生游泳机器人提供了平台,并以我们的领先示例Cuttlebot奠定了基础。

英文摘要

Increasing interest in deep-sea operations and resources motivates the development of ecologically sensitive but environmentally durable robots. Dielectric elastomer actuator artificial muscles are good candidates for powering such systems due to their pressure and temperature tolerance and soft makeup, but they are difficult to integrate with robotic systems. This work presents an autonomous robotic platform: the CORE, capable of driving six artificial muscles while sensing visual and spatial information. To validate the platform, we developed the Cuttlebot - a cuttlefish-inspired robot that swims in three dimensions using undulatory fin locomotion. The Cuttlebot has four primary artificial muscles in its fins in addition to a tentacle-inspired soft gripper. The robot was evaluated in a series of tethered and untethered swimming tests, demonstrating a top speed of 2.5 centimeters per second translation and 10 degrees per second rotation. Furthermore, the CORE system was capable of driving specialized control signals into the artificial muscles to controllably output force and torque in six axes. This work provides a platform for developing complex, bio-inspired swimming robots for ocean exploration and monitoring, laying the foundation with our leading example: the Cuttlebot.

2606.00191 2026-06-02 cs.RO cs.CV 版本更新

Safe2Drive: Evaluating Safe Driving Behaviors of E2E Autonomous Driving Models

Safe2Drive: 评估端到端自动驾驶模型的安全驾驶行为

Nishad Sahu, Kalpana Panda, Congyuan Yu, Changzhong Qian, Shounak Sural, Ragunathan Rajkumar

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Birla Institute of Technology and Science Pilani(比拉理工学院和科学帕利尼)

AI总结 针对端到端自动驾驶模型在常见安全关键场景中表现脆弱的问题,提出Safe2Drive测试集和安全驾驶评分(SDS),评估发现领先模型在安全场景中驾驶得分大幅下降且SDS较低。

详情
Journal ref
CVPR Workshops 2026
AI中文摘要

最近的端到端(E2E)自动驾驶策略在闭环模拟中取得了高驾驶得分。然而,这些策略是否能够处理常见的安全关键场景仍不清楚。我们提出了Safe2Drive(S2D),一组与Bench2Drive对齐的场景扩展,重点关注三类常见的道路危险:施工区、行人乱穿马路和被遮挡的弱势道路使用者(VRU)。Safe2Drive增加了100个常见但具有挑战性的场景,并引入了安全驾驶评分(SDS),这是一种以安全为中心的度量,在先前评估器的基础上增加了碰撞前制动、施工区物体接触、车道居中和平滑性检查。在S2D上评估两种最先进的策略(LEAD和SimLingo),我们发现它们的驾驶得分相对于报告的Bench2Drive基线急剧下降(LEAD:从Bench2Drive上的94.70 DS下降到S2D上的39.95 DS;SimLingo:从Bench2Drive上的85.07 DS下降到S2D上的41.00 DS),并且S2D上的SDS较低(LEAD为11.85,SimLingo为15.27)。这些结果与脆弱的安全驾驶行为一致,例如对施工区理解差、闯红灯以及行人制动延迟或缺失。这项研究突显了E2E模型即使在训练集包含的CARLA城镇上进行测试时也缺乏安全行为推理。我们计划发布所有100个S2D场景的代码和视频。

英文摘要

Recent end-to-end (E2E) autonomous driving policies achieve high driving scores in closed-loop simulations. Yet it remains unclear whether these policies handle common safety-critical scenarios. We present Safe2Drive (S2D), a set of Bench2Drive-aligned scenario extensions focused on three frequent families of road hazards: work zones, pedestrian jaywalking, and occluded vulnerable road users (VRUs). Safe2Drive adds 100 common but challenging scenarios and introduces SafeDriving Score (SDS), a safety-centric metric that augments prior evaluators with pre-crash braking, work zone-object contact, lane centering, and smoothness checks. Evaluating two state-of-the-art policies (LEAD and SimLingo) on S2D, we find that their driving scores drop sharply relative to their reported Bench2Drive baselines (LEAD: from 94.70 DS on Bench2Drive to 39.95 DS on S2D; SimLingo: from 85.07 DS on Bench2Drive to 41.00 DS on S2D) and that SDS on S2D is low (11.85 for LEAD and 15.27 for Sim-Lingo). These results are consistent with brittle safe-driving behaviors such as poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. This study highlights a lack of safe behavioral reasoning in E2E models even when tested on CARLA towns that are part of the training set. We plan to release the code and videos for all 100 S2D scenarios.

2606.00162 2026-06-02 cs.RO cs.CV cs.LG 版本更新

Modeling Robotics Dataset Construction as an Artifact-Based Build Process

将机器人数据集构建建模为基于工件的构建过程

Leon Pohl, Lukas Beer, George Sebastian, Mirko Maehlisch

发表机构 * Institute for Autonomous Driving, University of the Bundeswehr Munich(自主驾驶研究所,联邦国防军 Munich 大学)

AI总结 本文提出将机器人数据集构建建模为基于工件的构建过程,并实现开源工具Bagzel,通过依赖图管理和增量构建显著降低数据集更新延迟,实验表明在迭代工作流中速度提升高达386倍。

Comments Accepted 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026), 6 pages, 6 figures, 2 tables

详情
AI中文摘要

机器人系统生成大量多模态传感器数据,但将ROS bag记录转换为机器学习数据集通常由临时的顺序脚本处理,导致工程开销和迭代周期缓慢。我们将数据集构建建模为基于依赖图的工件构建过程,并在Bagzel中实现该方法,这是一个开源的Bazel扩展,用于可重现、增量式的数据集生成(包括nuScenes格式导出)。我们将Bagzel和Bagzel-xattr(服务端摘要管理)与顺序的rosbag2nuscenes基线进行比较。Bagzel在所有评估执行模式下减少了运行时间,在迭代工作流中提升最大(在20.4 GB数据集上,热构建加速高达386.26倍,增量构建加速高达7.21倍)。在5.1至20.4 GB的数据集大小范围内,Bagzel变体显示出比基线明显更好的扩展行为,尤其是在热构建和增量构建模式下。Bagzel-xattr提供了额外增益,在输入粒度研究中相比Bagzel平均运行时间减少5.9%。总体而言,将机器人数据集构建建模为基于工件的构建过程大幅降低了数据集更新延迟,同时保持了支持可重现性的确定性构建设计。Bagzel公开获取地址:https://github.com/UniBwTAS/bagzel。

英文摘要

Robotic systems generate large volumes of multimodal sensor data, but converting ROS bag recordings into machine learning datasets is often handled by ad hoc sequential scripts, creating engineering overhead and slow iteration cycles. We model dataset construction as an artifact-based build process over a dependency graph and implement this approach in Bagzel, an open-source Bazel extension for reproducible, incremental dataset generation (including nuScenes-format export). We compare Bagzel and Bagzel-xattr (server-side digest management) against a sequential rosbag2nuscenes baseline. Bagzel reduces runtime in all evaluated execution modes, with the largest gains in iterative workflows (up to 386.26x in warm builds and 7.21x in incremental builds on a 20.4 GB dataset). Across dataset sizes from 5.1 to 20.4 GB, Bagzel variants show markedly better scaling behavior than the baseline, especially in warm and incremental modes. Bagzel-xattr provides additional gains, with a mean runtime reduction of 5.9% compared to Bagzel in the input granularity study. Overall, modeling robotics dataset construction as an artifact-based build process substantially reduces dataset update latency while maintaining a deterministic build design that supports reproducibility. Bagzel is publicly available at https://github.com/UniBwTAS/bagzel.

2606.00145 2026-06-02 cs.RO cs.AI 版本更新

Completion at the Boundary (CaB): Deployable Switching with Completion-Aware Control under Limited Calibration

边界完成(CaB):有限校准下具有完成感知的可部署切换

Yusuke Sano, Takeshi Itoga

发表机构 * Intelligent Systems Laboratory, SECOM Co., Ltd.(SECOM公司智能系统实验室)

AI总结 提出Completion at the Boundary (CaB)方法,通过边界阶段令牌(Before/Hit/After)保留双边证据,在有限校准条件下实现VLA代理的完成感知切换,提升复合指令执行和交接质量。

详情
AI中文摘要

视觉-语言-动作(VLA)代理可以执行自然语言指令,但部署系统仍缺乏操作接口:决定指令何时完成。这一缺口在短复合指令(“做A,然后做B”)中尤为严重,时机不当的交接会级联导致下游故障。完成本质上是闭环的,因为切换是一种改变指令上下文从而影响未来动作和观察的干预。我们研究在由开放式指令空间启发的可部署低校准机制下的完成问题,强制要求无测试时重新学习,并选择一个全局校准的切换规则(在开发集上选择一次,在测试集上原样复用)。在此约束下,将非对称边界证据压缩为单个标量可能在任务极性变化时变得脆弱。我们提出边界完成(CaB),它预测事件局部完成对象,形式为边界阶段令牌(Before/Hit/After),在此规则下保留双边证据。CaB-When将此完成对象转换为最小、可审计的切换决策(何时),而CaB-How重用同一完成对象来调节动作生成,以实现交接过程中的边界稳定控制(如何)。使用干预感知的E1/E2协议,我们表明在匹配容量和可部署性约束下,CaB在第一个视角Minecraft VLA基准上提高了复合执行和交接质量。

英文摘要

Vision-language-action (VLA) agents can execute natural-language instructions, yet deployed systems still lack an operational interface: deciding when the instruction is complete. This gap is acute in short composites ("do A, then B"), where mistimed handoffs cascade into downstream failures. Completion is inherently closed-loop because switching is an intervention that changes the instruction context and thus future actions and observations. We study completion under a deployable low-calibration regime motivated by open-ended instruction spaces, enforcing no test-time relearning and a single globally calibrated switching rule selected once on development set and reused unchanged on test set. Under this constraint, collapsing asymmetric boundary evidence into a single scalar can be brittle under polarity shifts across tasks. We propose Completion at the Boundary (CaB), which predicts an event-local completion object in the form of Boundary-Phase Tokens (Before/Hit/After), retaining two-sided boundary evidence under this discipline. CaB-When converts this completion object into a minimal, auditable switching decision (when), while CaB-How reuses the same completion object to condition action generation for boundary-stable control through handoffs (how). Using an intervention-aware E1/E2 protocol, we show that CaB improves composite execution and handoff quality on a first-person Minecraft VLA benchmark under matched capacity and deployability constraints.

2606.00119 2026-06-02 cs.RO cs.AI 版本更新

V2I Work Zone Geometry Reconstruction with Pose-Conditioned UWB Range Denoising

基于位姿条件的UWB测距去噪的V2I工作区几何重建

Jiaxi Liu, Hangyu Li, Yang Cheng, Rui Gana, Junwei You, Weizhe Tang, Peng Zhang, Steven T. Parker, Xiaopeng Li, Bin Ran

发表机构 * Department of Civil & Environmental Engineering, University of Wisconsin-Madison(威斯康星大学麦迪逊分校土木与环境工程系)

AI总结 针对V2I工作区几何重建中UWB测距受突发异常、非视距误差和位姿不确定性的影响,提出一种位姿条件、排列等变的预测去噪器,通过共享锚点时间预测、对称集聚合和位姿条件残差解码,显著提升测距精度和几何重建质量。

详情
AI中文摘要

可靠的工作区映射对于网联自动驾驶车辆(CAV)安全平稳地通过工作区至关重要。安装在锥形路标上的超宽带(UWB)路侧单元(RSU)提供了一种经济高效的工作区布局推断方式,因为路侧锚点和车载标签为工作区几何重建提供了直接的车对基础设施(V2I)距离约束。然而,在实际现场部署中,UWB测距估计受到突发异常、非视距(NLOS)误差、任意锚点排序问题以及车辆位姿不确定性的影响。为解决这些挑战,本研究提出了一种位姿条件、排列等变的预测去噪器,用于多锚点UWB测距。该模型采用共享锚点时间预测来捕捉距离动态,对称集聚合来处理无序和缺失的锚点,以及位姿条件残差解码来将车辆运动作为几何先验。两阶段训练策略首先从观测距离学习预测,然后通过NLOS加权监督微调去噪器。该方法在CAV收集的罕见真实世界V2I UWB现场数据以及受控大规模仿真基准上进行了评估,以获得消融见解。结果表明,所提出的方法在具有挑战性的NLOS主导场景中显著提高了测距精度、锥形标定位和工作区几何重建,对锚点重新索引和适度锚点丢失保持鲁棒,并将测量加权的现场均方误差相对于原始输入降低了66.9%。

英文摘要

Reliable work zone mapping is important for connected and autonomous vehicles (CAVs) to navigate safely and smoothly through work zone areas. Cone-mounted ultra-wideband (UWB) roadside units (RSU) offer a cost-effective way for work zone layout inference, as roadside anchors and vehicle tags provide direct vehicle-to-infrastructure (V2I) range constraints for work zone geometry reconstruction. However, UWB range estimation is degraded by bursty outliers, non-line-of-sight (NLOS) errors, arbitrary anchor-ordering issues, and vehicle pose uncertainties in practical field deployments. To address these challenges, this study proposes a pose-conditioned, permutation-equivariant predictive denoiser for multi-anchor UWB ranging. The model employs shared anchor-wise temporal prediction to capture range dynamics, symmetric set aggregation to handle unordered and missing anchors, and pose-conditioned residual decoding to incorporate vehicle motion as a geometric prior. A two-stage training strategy first learns prediction from observed ranges, and then fine-tunes the denoiser with NLOS-weighted supervision. The method is evaluated on rare real-world V2I UWB field data collected with a CAV, as well as on controlled large-scale simulation benchmarks for ablative insights. Results show that the proposed method substantially improves range accuracy, cone localization, and work zone geometry reconstruction in challenging NLOS-dominated regimes, remains robust to anchor re-indexing and moderate anchor dropout, and reduces measurement-weighted field MSE by 66.9% relative to the raw input.

2606.00117 2026-06-02 cs.RO 版本更新

Ontology-Guided Reasoning for Affordance-Based Explanations of Robot Navigation

基于本体引导的机器人导航可供性解释推理

Amar Halilovic, Vahidin Hasic, Senka Krivic

发表机构 * Institute of Artificial Intelligence, Ulm University(乌尔姆大学人工智能研究所) Faculty of Electrical Engineering, University of Sarajevo(萨拉热窝大学电气工程学院)

AI总结 提出本体引导推理方法,通过局部可供性本体表示实体、可供性状态和空间关系,评估假设的对象-可供性状态变化作为解释因素,生成语义可理解且可操作的解释,并在机器人图书管理员场景中验证其准确性和鲁棒性。

详情
AI中文摘要

本文提出基于本体引导的推理方法,用于机器人导航的可供性解释。在人类环境中,机器人仅检测到其路径被阻塞是不够的。它还必须推理附近物体的可供性、可能的状态变化以及哪些变化能使其安全继续。我们通过将附近实体、其可供性、可供性状态和定性空间关系表示在局部可供性本体中,并评估假设的对象-可供性状态变化作为候选解释因素来解决这一问题。这产生了不仅语义上可理解而且可操作的解释。我们在以机器人图书管理员场景为中心的轻量级基准中实例化该方法,并在程序生成的导航案例上进行评估。结果表明,与仅语义基线相比,本体引导推理更准确地识别相关解释因素,并且随着语义杂波增加仍保持鲁棒性。总体而言,本文论证了可供性本体不仅可以作为环境的语义描述,还可以作为可解释性和可靠机器人自主性的推理基础。

英文摘要

This paper proposes ontology-guided reasoning for affordance-based explanations of robot navigation. In human environments, it is not sufficient for a robot to detect that its route is blocked. It must also reason about what nearby objects afford, which state changes are possible, and which of these changes would allow it to continue safely. We address this problem by representing nearby entities, their affordances, affordance states, and qualitative spatial relations in a local affordance ontology and by evaluating hypothetical object--affordance state changes as candidate explanation factors. This yields explanations that are not only semantically grounded but also actionable. We instantiate the approach in a lightweight benchmark centered on a robot librarian scenario and evaluate it on procedurally generated navigation cases. The results show that ontology-guided reasoning identifies relevant explanation factors more accurately than a semantic-only baseline and remains robust as semantic clutter increases. Overall, the paper argues that affordance ontologies can serve not merely as semantic descriptions of the environment, but as reasoning foundations for explainability and reliable robot autonomy.

2606.00113 2026-06-02 cs.RO 版本更新

World Models for Robotic Manipulation: A Survey

机器人操作的世界模型:综述

Fangyuan Wang, Ziyuan Wang, Guorui Pei, Mengshi Zhang, Canxi Liang, Jun Hu, Zhongxuan Li, Jinsong Wu, Ning Han, Zeqing Zhang, Jiaming Qi, Hongmin Wu, Shiyao Zhang, Pai Zheng, Jia Pan, David Navarro-Alarcon, Sichao Liu, Peng Zhou

发表机构 * Department of Mechanical Engineering, The Hong Kong Polytechnic University(香港理工大学机械工程系) Department of Mechanical Engineering and Automation, Harbin Institute of Technology(哈尔滨工业大学机械工程与自动化系) School of Advanced Engineering, Great Bay University(大湾大学先进工程学院) College of Robotics Science and Engineering, Taiyuan University of Technology(太原科技大学机器人科学与工程学院) School of Data Science, City University of Hong Kong (Dongguan)(香港城市大学(东莞)数据科学学院) Department of Mechatronic Engineering, Guangdong Polytechnic Normal University(广东 polytechnic 正常大学机电工程系) School of Computing and Data Science, The University of Hong Kong(香港大学计算与数据科学学院) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院) College of Mechanical and Electrical Engineering, Northeast Forestry University(东北林业大学机械与电气工程学院) Greater Bay Area National Center of Technology Innovation(粤港澳大湾区国家技术创新中心) Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University(香港理工大学工业与系统工程系)

AI总结 本文通过三个问题(预测什么未来表示、预测如何与动作连接、何时在机器人学习流程中使用预测)系统综述了机器人操作中的世界模型,将其定义为动作条件预测系统,并分类为五种表示族,提出了功能分类法,总结了基础设施角色、数据集和评估协议,揭示了从任务特定动力学预测器向预测基础设施的演变及开放挑战。

详情
AI中文摘要

机器人操作依赖于在执行前预测动作如何重塑物体、接触和场景几何的能力。学习的世界模型通过预测在机器人干预下任务相关的未来演化提供这种能力,然而该术语现在涵盖潜在动力学模型、动作条件视频生成器、三维和四维场景预测器、物理信息模拟器以及视觉-语言-动作系统中的预测模块。这种广度使文献碎片化,并模糊了对操作重要的设计选择。我们通过三个问题调查机器人操作的世界模型:预测什么未来表示、预测如何与动作连接、以及何时在机器人学习流程中使用预测。我们将世界模型操作性地定义为动作条件预测系统,并将其与感知模块、逆模型、策略、奖励和值函数区分开来。然后,我们将现有工作组织成五种表示族,开发了一个功能分类法,将集成预测-动作模型与显式预测规划器分开,并描述了基础设施角色,包括合成经验生成、候选过滤、基于搜索的评估、学习环境和结果验证。我们进一步将这些角色映射到预训练、后训练和推理适应中,回顾了34个操作数据集,并综合了预测保真度、任务性能和模拟器可靠性的评估协议。本综述表明,世界模型正在从任务特定的动力学预测器演变为机器人学习的预测基础设施,同时揭示了接触建模、幻觉控制、动作对齐和闭环使用下基准测试方面的开放挑战。

英文摘要

Robotic manipulation depends on the ability to anticipate how actions reshape objects, contacts, and scene geometry before execution. Learned world models provide this capability by predicting task-relevant future evolution under robot intervention, yet the term now spans latent dynamics models, action-conditioned video generators, three- and four-dimensional scene predictors, physics-informed simulators, and predictive modules inside vision-language-action systems. This breadth has fragmented the literature and obscured the design choices that matter for manipulation. We survey world models for robotic manipulation through three questions: what future representation is predicted, how prediction is connected to action, and when prediction is used in the robot-learning pipeline. We operationally define a world model as an action-conditioned predictive system and distinguish it from perception modules, inverse models, policies, rewards, and value functions. We then organize existing work into five representation families, develop a functional taxonomy that separates integrated prediction-action models from explicit predictive planners, and characterize infrastructure roles including synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification. We further map these roles across pretraining, post-training, and inference adaptation, review 34 manipulation datasets, and synthesize evaluation protocols for predictive fidelity, task performance, and simulator reliability. This survey shows that world models are evolving from task-specific dynamics predictors into predictive infrastructure for robot learning, while exposing open challenges in contact modeling, hallucination control, action alignment, and benchmarking under closed-loop use.

2606.00110 2026-06-02 cs.CV cs.RO 版本更新

General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling

广义协变动作建模:通过时空解耦构建广义流形

Huaihai Lyu, Chaofan Chen, Mingyu Cao, Yuheng Ji, Changsheng Xu

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出广义动作流形框架,通过时间不变性和几何不变性解耦实现广义协变,提升从稀疏演示中泛化的鲁棒性。

详情
AI中文摘要

从有限数据中实现鲁棒泛化是具身智能的核心挑战。现有方法通过回归绝对坐标失败,这违反了广义协变原理。根本上,这混淆了内在任务几何与刚性执行模式,将策略绑定到特定运动风格和固定速度。为解决此问题,我们提出广义动作流形(GAM)框架,通过结构解耦强制执行广义协变。具体地,GAM通过强制两个正交维度的不变性来实现流形:(1)时间不变性,利用弧长参数化将空间路径几何与时间动力学正交化,确保对速度变化的鲁棒性;(2)几何不变性,其中模式-仿射-分解机制将轨迹映射到姿态归一化坐标框架中的规范“世界线”。这区分了不变几何模式与仿射调制,确保空间泛化性。通过将GAM集成到结构化视觉-语言-动作(VLA)架构中,我们使稀疏演示能够密集填充连续有效的动作流形。实验结果表明,GAM实现了优越的迁移和鲁棒性,优于几何无关基线。

英文摘要

Achieving robust generalization from limited data is a central challenge in embodied intelligence. Prevailing methods fail by regressing absolute coordinates, which violates the principle of general covariance. Fundamentally, this conflates the intrinsic task geometry with rigid execution patterns, binding policies to specific motion styles and fixed speeds. To resolve this, we propose the Generalized Action Manifold (GAM) framework that enforces general covariance through structural disentanglement. Specifically, GAM realizes the manifold by enforcing invariance across two orthogonal dimensions: (1) Temporal Invariance, utilizing an Arc-Length Parameterizer to orthogonalize the spatial path geometry from temporal dynamics, ensuring robustness to velocity variations; (2) Geometric Invariance, where a Schema-Affine-Factorization mechanism maps trajectories to canonical ``world lines'' in a pose-normalized coordinate frame. This distinguishes invariant geometric schemas from affine modulations, ensuring spatial generalizability. By integrating GAM within a structured Vision-Language-Action (VLA) architecture, we enable sparse demonstrations to densely populate a continuous, valid action manifold. Empirical results demonstrate that GAM enables superior transfer and robustness capabilities, outperforming geometry-agnostic baselines.

2606.00104 2026-06-02 cs.RO cs.AI 版本更新

PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs

PEACE: 一种用于无人机的带约束执行的规划-执行智能体

Erdem Uysal, Timo Kehrer, Sebastiano Panichella

发表机构 * Institute of Computer Science, University of Bern(伯尔尼大学计算机科学研究所) AI4I - The Italian Institute of Artificial Intelligence(意大利人工智能研究所)

AI总结 提出一种基于大语言模型的规划-执行智能体架构,通过解耦高层任务规划与低层控制,并引入约束执行层和有限重规划,实现无人机可解释、可约束的自主飞行。

Comments Accepted to ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy: From Environment Understanding and Reasoning to Safe Interaction

详情
AI中文摘要

基础模型越来越多地被用于驱动自主系统,然而现有方法要么将模型保持在紧密的控制循环中,增加延迟和幻觉风险,要么将自然语言编译成不透明的端到端策略,难以解释、约束,且需要特定领域的数据集和微调。我们提出一种用于基于PX4的无人机的规划-执行智能体,将高层任务规划与低层控制解耦。大语言模型执行单次任务规划,而执行通过结构化的ROS 2工具调用接口(桥接到MAVLink)处理。该系统通过将模块化2D检测器(如YOLO或视觉语言模型)与用于3D物体定位的针孔深度投影模块相结合,构建世界模型。约束执行层强制执行高度限制和水平地理围栏,有限重规划能够从执行时的动作失败中恢复。我们将我们的方法定位在基于基础模型的机器人系统的三种常见设计模式中,并在Gazebo中的PX4软件在环仿真中展示其可行性。结果突出了与紧密耦合的LLM控制相比,改进的可解释性、约束执行和减少的LLM调用。代码、数据集、视频和其他材料可在以下链接找到:https://github.com/erdemuysalx/PEACE

英文摘要

Foundation models are increasingly used to drive autonomous systems, yet existing approaches either keep the model in a tight control loop, raising latency and hallucination risk, or compile natural language into opaque end-to-end policies that are hard to explain, constraint and require domain-specific datasets and fine-tuning. We propose a planner-executor agent for PX4-based drones that decouples high-level mission planning from low-level control. A large language model performs single-pass task planning, while execution is handled through a structured ROS 2 tool-calling interface bridged to MAVLink. The system constructs a world model by combining modular 2D detectors (e.g., YOLO or vision-language models) with a pinhole depth projection module for 3D object localization. A constraint enforcement layer enforces altitude limits and horizontal geofencing, and bounded replanning enables recovery from execution-time action failures. We position our approach within three common design patterns for foundation-model-based robotics systems and demonstrate its feasibility in PX4 software-in-the-loop simulations in Gazebo. Results highlight improved explainability, constraint enforcement, and reduced LLM calls compared to tightly coupled LLM control. The code, dataset, videos, and other material can be found at the following link: https://github.com/erdemuysalx/PEACE

2606.00095 2026-06-02 cs.CV cs.AI cs.CL cs.RO 版本更新

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

弥合2D-3D鸿沟:面向视觉语言导航的分层语义几何地图

Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong, Xiaoling Wang, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Bosch Corporate Research(博世企业研究) King Abdullah University of Science and Technology(卡布斯大学)

AI总结 提出分层语义几何地图(HSGM),将3D几何信息转化为VLM可理解的结构化表示,结合VLM高层语义规划与经典路径规划,实现零样本视觉语言导航,在R2R-CE和RxR-CE基准上达到最先进性能。

详情
AI中文摘要

视觉语言导航(VLN)使具身智能体能够通过遵循语言指令在未知环境中到达目标位置。尽管近期视觉语言模型(VLM)取得了进展,但仍存在关键的语义-几何鸿沟:VLM擅长语言和2D视觉理解,但在3D空间推理方面表现不佳,且无法捕捉动作与空间转换之间的因果动态,导致导航不可靠,尤其在零样本设置中。为弥合这一鸿沟,我们提出分层语义几何地图(HSGM),将3D几何信息转化为与VLM兼容的结构化表示,有效将其与物理世界连接。具体而言,HSGM表示为多通道俯视图,组织为三个层次:(1)几何层,记录可导航区域和障碍物;(2)语义层,表示物体及其关系;(3)决策层,支持高层任务推理和目标选择。导航过程中,VLM作为高层语义规划器,解释HSGM编码的空间布局以选择几何有效航点,而航点间的低层无碰撞运动由经典路径规划算法执行,从而将语义推理与动作执行完全解耦。此外,复杂指令被分解为子任务,以缓解长程导航中的进度遗忘或幻觉问题。在R2R-CE和RxR-CE基准上的大量实验表明,我们的零样本框架达到了最先进性能,甚至优于若干监督方法。代码见 https://github.com/Teacher-Tom/HSGM_public。

英文摘要

Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic-Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods. Code is available at https://github.com/Teacher-Tom/HSGM_public.

2606.00090 2026-06-02 cs.RO cs.AI 版本更新

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems

物理AI中的静默故障:自主系统运行时动作授权的文献综述

Barak Or

发表机构 * STATE16

AI总结 本文综述了物理AI系统中黑箱模型发出看似合理但实际错误的物理动作导致的静默故障问题,提出了运行时防护栏的分类和评估要求。

Comments 23 pages

详情
AI中文摘要

物理AI系统越来越多地将多模态观测、语言指令和学习的世界表示映射为具有物理后果的动作。机器人基础模型、视觉-语言-动作模型和基于世界模型的自主系统可以决定移动车辆、机器人、无人机和工业机器的决策。这种转变暴露了一个传统AI内容审核或经典机器人安全无法完全捕获的安全问题:黑箱模型可能发出一个物理后果的动作,同时表现出自信、合理和语义对齐。由此产生的故障可能是静默的,源于传感器漂移、遮挡、状态估计误差、分布偏移、幻觉的可供性,或在下游硬件控制器检测到违规之前的无效物理假设。在具身基础模型、世界模型、机器人仿真、具身安全基准、安全控制、运行时保证、不确定性估计、验证和防护栏评估中,模型能力和安全机制沿着大致分离的技术轨道发展。这里综合的一个反复出现的差距是,本综述调查的单一流都没有提供黑箱物理AI模型与物理执行之间的完整运行时授权边界。由此产生的分析发展了一个有界的问题表述、静默物理动作故障的定义、运行时防护栏功能的分类,以及比较防护栏作为物理AI保证机制的评估要求。

英文摘要

Physical AI systems increasingly map multimodal observations, language instructions, and learned world representations into physically consequential actions. Robotics foundation models, vision-language-action models, and world-model-based autonomous systems can condition decisions that move vehicles, robots, drones, and industrial machines. This transition exposes a safety problem that is not fully captured by conventional AI content moderation or by classical robot safety alone: a black-box model may issue a physically consequential action while appearing confident, plausible, and semantically aligned. The resulting failure can be silent, arising from sensor drift, occlusion, state-estimation error, distribution shift, hallucinated affordances, or invalid physical assumptions before downstream hardware controllers detect a violation. Across embodied foundation models, world models, robotics simulation, embodied safety benchmarks, safe control, runtime assurance, uncertainty estimation, verification, and guardrail evaluation, model capability and safety mechanisms have advanced along largely separate technical tracks. A recurring gap synthesized here is that no single stream surveyed in this review supplies a complete runtime authorization boundary between black-box Physical AI models and physical execution. The resulting analysis develops a bounded problem formulation, a definition of silent physical-action failure, a taxonomy of runtime guardrail functions, and evaluation requirements for comparing guardrails as Physical AI assurance mechanisms.

2606.00089 2026-06-02 cs.RO cs.AI 版本更新

Can Predicted Dynamics Exist in the Physical World?

物理世界中是否存在可预测的动态?

Barak Or

发表机构 * STATE16 Technion - Israel Institute of Technology(技术Ion - 以色列理工学院) Reichman University(Reichman大学) Google-Reichman AI Tech School(Google-Reichman人工智能技术学院)

AI总结 本文提出物理可接受性作为预测-控制接口,通过运动学、动力学和直接到组合的视界条件评估解码提案的物理可执行性,实验表明该方法能有效识别无效提案并保持任务进度。

Comments 17 pages

详情
AI中文摘要

预测性物理AI系统输出状态展开、动作块和潜在计划,但低均方根误差(RMSE)并不意味特定提案在物理上可执行。我们将物理可接受性定义为预测-控制接口:在执行前,将解码提案视为候选动态,并使用运动学、动力学和直接到组合的视界条件进行评估。通过不是任务成功的证明;拒绝标识指定物理包络的违反,并给出组件级原因。在Hugging Face LeRobot PushT上,受控伪造表明一步预测RMSE和标准化动态残差达到接收者操作特征曲线下面积(AUC)0.982和0.972,仅运动学条件达到AUC 0.592,完整门控达到AUC 0.957并带有条件级归因。在基于重放的干预实验中,基于残差的过滤器和完整物理可接受性门控阻止了87-89%的无效提案,同时保持平均进度接近0.998。

英文摘要

Predictive Physical AI systems output state rollouts, action chunks, and latent plans, yet a low root-mean-square error (RMSE) does not imply that a particular proposal is physically executable. We formulate physical admissibility as a prediction-control interface: before execution, a decoded proposal is treated as candidate dynamics and evaluated using kinematic, dynamic, and direct-to-composed horizon conditions. Passing is not a certificate of task success; rejection identifies violation of the specified physical envelope and gives a component-level reason. On Hugging Face LeRobot PushT, controlled falsification shows that one-step prediction-RMSE and standardized dynamics residuals reach area under the receiver operating characteristic curve (AUC) 0.982 and 0.972, kinematic-only conditions reach AUC 0.592, and the full gate reaches AUC 0.957 with condition-level attribution. In replay-based intervention experiments, residual-based filters and the full physical-admissibility gate prevent 87-$89% of invalid proposals while preserving mean progress near 0.998.

2606.00086 2026-06-02 cs.RO 版本更新

Whole-Body Inverse Kinematics with Graph Diffusion

基于图扩散的全身逆运动学

Helong Huang, Kai Tan, Feng Wen, Guowei Huang, Xingyue Quan

发表机构 * Large Model Algorithm Lab, Huawei(华为大模型算法实验室)

AI总结 提出GraphDiff-IK,一种结构感知的图扩散逆运动学框架,通过将机器人表示为运动学图并引入分层消息传递和躯干感知条件,实现了多分支机器人的准确稳定IK求解。

详情
AI中文摘要

逆运动学(IK)是机器人学中的一个基本问题,需要生成满足目标末端执行器位姿的关节配置。现有方法通常难以在多种机器人形态间泛化,并且无法有效建模IK的多模态特性,特别是在具有多个运动学分支的关节系统中。在这项工作中,我们提出了GraphDiff-IK,一种结构感知的图扩散逆运动学框架。具体来说,我们将机器人表示为从机器人URDF构建的运动学图,其中节点对应驱动关节,边编码运动学依赖关系。基于这种表示,我们将IK表述为条件图扩散过程,直接在机器人图上生成关节配置。为了更好地捕捉关节系统中的结构依赖关系,我们进一步引入了一种结构感知的图推理框架,具有分层阶段式消息传递和针对多分支机器人的躯干感知条件。此外,我们结合了带噪声的正向运动学反馈和任务空间监督,以提高去噪过程中的几何一致性。所提出的框架提供了一种统一的公式,自然支持单臂机器人、双臂系统以及具有躯干或腰部结构的关节机器人。在多种机器人平台上的大量实验表明,所提出的方法实现了准确且稳定的IK性能,同时保留了为冗余机器人系统生成多个可行解的能力。

英文摘要

Inverse kinematics (IK) is a fundamental problem in robotics, requiring the generation of joint configurations that satisfy target end-effector poses. Existing approaches often struggle to generalize across diverse robot morphologies and to effectively model the multi-modal nature of IK, particularly in articulated systems with multiple kinematic branches. In this work, we propose GraphDiff-IK, a structure-aware graph diffusion framework for inverse kinematics. Specifically, we represent the robot as a kinematic graph constructed from the robot URDF, where nodes correspond to actuated joints and edges encode kinematic dependencies. Building upon this representation, we formulate IK as a conditional graph diffusion process that directly generates joint configurations on the robot graph. To better capture structural dependencies in articulated systems, we further introduce a structure-aware graph reasoning framework with hierarchical stage-wise message passing and torso-aware conditioning for multi-branch robots. In addition, we incorporate noisy forward kinematics feedback and task-space supervision to improve geometric consistency during denoising. The proposed framework provides a unified formulation that naturally supports single-arm robots, dual-arm systems, and articulated robots with torso or waist structures. Extensive experiments on diverse robotic platforms demonstrate that the proposed method achieves accurate and stable IK performance while preserving the ability to generate multiple feasible solutions for redundant robotic systems.

2606.00085 2026-06-02 cs.RO 版本更新

Balancing Accuracy and Efficiency: Adaptive Dynamics Orchestration for Model Predictive Control

平衡精度与效率:模型预测控制的自适应动力学编排

Francesco Cancelliere, Aniket Datar, Giovanni Muscato, Xuesu Xiao

发表机构 * Department of Electrical and Computer Engineering, University of Michigan, Ann Arbor, MI, USA(1. 电气与计算机工程系,密歇根大学,安娜堡,密歇根州,美国)

AI总结 提出自适应动力学编排(ADO)框架,通过在线反事实滚动评估模型残差,动态选择最适合当前导航上下文的动力学模型,在计算效率与预测精度之间取得平衡。

Comments 8 pages, 7 figures

详情
AI中文摘要

自主导航的模型预测控制(MPC)面临模型精度与实时效率之间的基本权衡。高保真动力学模型能够准确预测轨迹展开过程中复杂的车辆-地形交互,但计算成本高,增加推理延迟并降低控制频率。相反,轻量级模型支持快速更新和密集采样,但在安全关键条件下可能产生错误预测,导致灾难性故障如车辆侧翻。为解决这一权衡,我们提出自适应动力学编排(ADO),一种根据当前导航上下文动态选择最合适动力学模型的框架。ADO维护一个涵盖不同精度-效率特征的模型库,并通过在线反事实滚动(即执行的控制动作在模型库中重放以评估预测差异)的残差误差,持续细化地形条件性能估计。这些估计实时指导模型选择,平衡计算效率与预测精度。在越野地面机器人上的真实实验表明,与固定低延迟基线相比,ADO显著降低了建模误差,同时接近最高保真模型的精度而不产生其计算成本,从而在复杂地形中实现更可靠和有效的导航。

英文摘要

Model Predictive Control (MPC) for autonomous navigation faces a fundamental trade-off between model accuracy and real-time efficiency. High-fidelity dynamics models can accurately predict complex vehicle-terrain interactions during trajectory rollouts, but incur significant computational cost, increasing inference latency and reducing control frequency. Conversely, lightweight models enable fast updates and dense sampling, yet may produce erroneous predictions under safety-critical conditions, potentially leading to catastrophic failures such as vehicle rollover. To address this trade-off, we propose Adaptive Dynamics Orchestration (ADO), a framework that dynamically selects the most appropriate dynamics model for the current navigation context. ADO maintains a library of models spanning diverse accuracy-efficiency profiles and continuously refines terrain-conditioned performance estimates using residual errors from online counterfactual rollouts, where executed control actions are replayed across the model library to assess predictive discrepancy. These estimates guide model selection in real time, balancing computational efficiency and predictive accuracy. Real-world experiments on an off-road ground robot demonstrate that ADO significantly reduces modeling error compared to a fixed low-latency baseline, while approaching the accuracy of the highest-fidelity model without incurring its computational cost, resulting in more reliable and effective navigation in challenging terrain.

2606.00083 2026-06-02 cs.LG cs.AI cs.RO 版本更新

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

从演示到奖励:VLM奖励模型的测试时提示优化

Christian Gumbsch, Leonardo Barcellona, Lennard Schünemann, Platon Karageorgis, Andrii Zadaianchuk, Zehao Wang, Sergey Zakharov, Fabien Despinoy, Rahaf Aljundi, Efstratios Gavves

发表机构 * University of Amsterdam(阿姆斯特丹大学) Catholic University of Leuven(鲁汶天主大学) Toyota Research Institute(丰田研究院) Toyota Motor Europe(丰田欧洲公司)

AI总结 提出Demo2Reward方法,利用少量专家演示在测试时优化VLM奖励模型的提示指令,减少假阳性并保持真阳性,无需额外训练即可提升下游策略学习。

详情
AI中文摘要

强化学习依赖于准确的奖励函数,但在现实应用(如机器人技术)中,这些函数通常是手工设计的,甚至不可用。最近的研究探索了预训练视觉-语言模型(VLM)作为奖励模型的零样本推理能力。然而,如果没有仔细的提示工程,这些方法往往会产生次优的奖励,其中假阳性预测会严重降低下游策略学习。在机器人技术中,通常收集包含专家演示的有限数据集来引导策略学习。这种场景提供了在策略训练之前优化奖励模型的机会。我们提出Demo2Reward,一种测试时自适应技术,基于少量演示(3-10条轨迹)优化奖励模型的语言指令,以减少假阳性同时保持真阳性。关键是,这在策略学习期间不需要额外的模型训练或计算资源。我们表明,Demo2Reward在一系列模拟机器人任务和策略骨干上始终优于现有的零样本和少样本VLM奖励模型。最后,我们证明Demo2Reward有效迁移到真实世界的机器人学习场景,无需手动设计奖励函数即可实现策略学习。

英文摘要

Reinforcement learning relies on accurate reward functions, which are often hand-crafted or even unavailable in real-world applications, such as robotics. Recent work has explored the zero-shot reasoning capabilities of pre-trained Vision-Language Models (VLMs) as reward models. However, without careful prompt engineering, these approaches tend to produce suboptimal rewards, where false positive predictions can severely degrade downstream policy learning. In robotics, limited datasets comprising expert demonstrations are often collected to bootstrap policy learning. This scenario provides an opportunity to optimize a reward model prior policy training. We propose Demo2Reward a test-time adaptation technique to optimize the language instruction of a reward model based on a few demonstrations (3-10 trajectories) to reduce false positives while preserving true positives. Crucially, this requires no additional model training or computation resources during policy learning. We show that Demo2Reward consistently outperforms existing zero- and few-shot VLM reward models across a range of simulated robotic tasks and policy backbones. Finally, we demonstrate that Demo2Reward effectively transfers to a real-world robotic learning scenario, enabling policy learning without manually engineering a reward function.

2606.00069 2026-06-02 cs.RO eess.IV 版本更新

Invascal: Inverse-Vacuity Self-Calibration for Uncertainty-Aware LiDAR Range-View Semantic Segmentation

Invascal: 面向不确定性感知激光雷达距离视图语义分割的逆空性自校准

Kerim Turacan, Hannes Reichert, Andrei Bolandut, Konrad Doll

发表机构 * Faculty of Engineering and Computer Science, University of Applied Sciences Aschaffenburg(工程与计算机科学学院,阿施费尔德应用科学大学)

AI总结 提出一种与架构无关的不确定性感知适配器头,通过偏好头和强度头分解预测,并设计逆空性自校准目标(Invascal)来监督强度信号,实现可靠且校准良好的不确定性估计,同时保持分割精度。

Comments Accepted for publication at the 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC)

详情
AI中文摘要

激光雷达语义分割是自动驾驶车辆和移动机器人的核心感知能力。然而,安全运行还取决于知道预测何时不可靠。现有方法通常依赖softmax置信度,这往往校准不良且过度自信,而来自蒙特卡洛dropout或集成方法的更强不确定性估计对于实时使用通常计算成本高昂。为此,我们引入了一种新颖的、与架构无关的不确定性感知适配器头。它将预测分解为用于类别排名的偏好头和用于细化不确定性评估的强度头,从而能够原则性地构建证据狄利克雷表示。基于此设计,我们提出了逆空性自校准目标(Invascal),它直接监督强度信号以产生可靠且校准良好的不确定性估计,同时防止证据无节制增长。我们在多个激光雷达数据集和骨干架构上评估了我们的框架。我们与确定性训练、蒙特卡洛dropout和集成方法以及先前的证据方法进行了比较。我们的方法在最小计算开销下,持续改进了不确定性校准,优于传统的确定性方法。同时,它保持了有竞争力的分割精度,而先前的证据方法往往会出现性能下降。

英文摘要

LiDAR semantic segmentation is a core perception capability for autonomous vehicles and mobile robots. However, safe operation also depends on knowing when predictions are unreliable. Existing approaches typically rely on softmax confidence, which is often miscalibrated and overconfident, while stronger uncertainty estimates from Monte Carlo dropout or ensembles are often computationally expensive for real-time use. To this end, we introduce a novel, architecture-agnostic uncertainty-aware Adapter Head. It decomposes the prediction into a Preference Head for class ranking and a Strength Head that refines uncertainty assessment, thereby enabling a principled construction of evidential Dirichlet representations. Building on this design, we propose our inverse-vacuity self-calibration objective (Invascal), which directly supervises the strength signal to produce reliable and well-calibrated uncertainty estimates while preventing runaway evidence growth. We evaluate our framework across multiple LiDAR datasets and backbone architectures. We compare against deterministic training, Monte Carlo dropout and ensembles, and prior evidential methods. Our approach consistently improves uncertainty calibration over traditional deterministic methods with minimal computational overhead. At the same time, it preserves competitive segmentation accuracy, where prior evidential methods often suffer performance degradation.

2606.00063 2026-06-02 cs.RO math-ph math.MP physics.flu-dyn 版本更新

Linear Motility Maps in Nonlinear Viscous Fluids

非线性粘性流体中的线性运动映射

Yishun Zhou, Shai Revzen

发表机构 * Department of Robotics, University of Michigan(机器人学系,密歇根大学) Departments of Electrical Engineering and Computer Science, and Ecology and Evolutionary Biology(电气工程与计算机科学系、生态与进化生物学系)

AI总结 研究在低雷诺数流体中,线性运动映射扩展到幂律流体,并发现Carreau-Yasuda流体可违反该线性性质实现净运动,方向可随速度改变。

详情
AI中文摘要

已知在低雷诺数流体中运动的系统受“运动映射”支配,该映射线性地将形状变化率与通过流体的本体框架速度联系起来。其结果是“珀塞尔扇贝定理”——经历时间上前后相同路径的形状变化(往复身体变形)的运动系统无法实现净位移,无论这些变化的速度如何。我们证明线性速度运动映射扩展到任何幂律粘度(即Ostwald-de Waele流体),因此也适用于中间剪切范围内的许多生物流体。我们还表明,在Carreau-Yasuda流体中,线性速度性质可以被违反,使用由两个不等质量且具有不等阻力系数的质量组成的“尺蠖”模型进行往复运动,从而产生净运动。有趣的是,运动方向可以通过改变速度来切换。我们的结果表明,几何力学的线性运动映射可用于分析和设计幂律流体中的运动,并且某些非线性阻力关系(如Carreau-Yasuda)可用于产生净运动,看似违反了“扇贝定理”。

英文摘要

Systems moving in low Reynolds number fluid regimes are known to be governed by a ``motility map'' which linearly relates their shape change rates to they body frame velocity moving through the fluid. A consequence of this is ``Purcell's Scallop Theorem'' -- a locomotion system that undergoes shape changes that follow the same path forward and backward in time (reciprocal body deformations) cannot achieve net displacement, regardless of pacing of those changes.We show that linear-in-velocity motility maps extend to any power law viscosity (a.k.a. Ostwald--de Waele fluid), and therefore to many biological fluids in intermediate shear ranges. We also show that the linear-in-velocity property can be violated in Carreau-Yasuda fluids to produce net motion using an ``inchworm'' model consisting of two unequal masses with unequal drag coefficients performing reciprocal motions. Interestingly, the direction of motion can be switched by changing speeds. Our results show that the linear motility map of geometric mechaincs can be used to analyze and design locomotion in power-law fluids, and that some nonlinear drag relationships such as Carreau-Yasuda can be exploited to generate net locomotion in seeming violation of the ``scallop theorem''.

2606.00059 2026-06-02 cs.RO cs.LG 版本更新

Reinforcement Learning for Optimal Experiment Design in Parameter Identification of Mechatronic Systems

机电系统参数辨识中最优实验设计的强化学习方法

Julian Langschwert, Georg Schaefer, Jakob Rehrl, Stefan Huber, Simon Hirlaender

发表机构 * Josef Ressel Centre for Intelligent and Secure Industrial Automation, Salzburg University of Applied Sciences, Salzburg, Austria(约瑟夫·雷斯尔智能与安全工业自动化中心,萨尔茨堡应用技术大学,萨尔茨堡,奥地利) Paris Lodron University of Salzburg, Salzburg, Austria(萨尔茨堡巴黎洛登伦大学,萨尔茨堡,奥地利)

AI总结 提出一种强化学习智能体,通过奖励塑形自主满足安全约束,为Quanser Aero 2测试平台学习最优激励信号,在三个辨识参数上均达到竞争性估计精度,且安全违规率仅0.75%。

Comments Accepted at DEXA AI4IP 2026

详情
AI中文摘要

信息丰富的激励信号对于机电系统的精确系统辨识至关重要,然而经典系统辨识方法需要专家知识和手工设计的信号以满足硬件安全约束,限制了其通用性。我们提出一种强化学习智能体,为Quanser Aero 2测试平台学习最优激励信号,同时通过奖励塑形自主强制执行安全约束。在10个独立训练种子的评估中,我们的综合智能体在所有三个辨识参数上均实现了具有竞争力的估计精度,优于经典基线方法,且仅产生0.75%的安全违规。

英文摘要

Informative excitation signals are critical for accurate system identification of mechatronic systems, yet classical system identification (SI) approaches require expert knowledge and hand-crafted signal design to respect hardware safety constraints, limiting their generalizability. We propose a reinforcement learning (RL) agent that learns optimal excitation signals for a Quanser Aero 2 testbed while autonomously enforcing safety constraints through reward shaping. Evaluated across 10 independent training seeds, our comprehensive agent achieves competitive estimation accuracy across all three identified parameters, outperforming classical baselines while incurring only 0.75% safety violations.

2606.00054 2026-06-02 cs.RO cs.AI cs.CV 版本更新

From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data

从人类视频到机器人操作:基于人类中心数据的可扩展视觉-语言-动作学习综述

Zhiyuan Feng, Qixiu Li, Huizhi Liang, Rushuai Yang, Yichao Shen, Zhiying Du, Zhaowei Zhang, Yu Deng, Li Zhao, Hao Zhao, Zongqing Lu, Oier Mees, Marc Pollefeys, Jiaolong Yang, Baining Guo

发表机构 * Tsinghua University(清华大学) HKUST(香港科技大学) Xi’an Jiaotong University(西安交通大学) Fudan University(复旦大学) Microsoft Research Asia(微软亚洲研究院) Peking University(北京大学) Microsoft Zurich Project(微软苏黎世实验室)

AI总结 本文综述了如何将丰富的人类视频转化为视觉-语言-动作(VLA)模型的有效知识,分类了四种方法(潜在动作表示、预测世界模型、显式2D监督、显式3D重建),并指出了结构化非结构化视频、跨具身和视角的动作映射、以及评估协议设计三大挑战。

Comments Accepted to IJCAI 2026 Survey Track. Project page: https://aaronfengzy.github.io/HumanCentricToVLA-Survey/

详情
AI中文摘要

近期在可泛化具身控制方面的进展由大规模预训练的视觉-语言-动作(VLA)模型驱动。然而,大多数现有方法依赖于大量机器人演示数据,这些数据获取成本高昂且与特定具身紧密耦合。相比之下,人类视频丰富且捕捉了丰富的交互,为真实世界操作提供了多样的语义和物理线索。然而,具身差异以及任务对齐标注的频繁缺失使得它们直接用于VLA模型具有挑战性。本综述提供了一个统一的视角,探讨如何将人类视频转化为VLA模型的有效知识。我们根据所提取的动作相关信息将现有方法分为四类:(i) 编码帧间变化的潜在动作表示;(ii) 预测未来帧的预测世界模型;(iii) 提取图像平面线索的显式2D监督;(iv) 恢复几何或运动的显式3D重建。除分类外,我们强调了该领域的三个关键开放挑战:将非结构化视频结构化为可训练的片段、在具身和视角异质性下将视频导出的监督接地到机器人可执行动作中,以及设计能更好预测真实世界部署性能和迁移效率的评估协议,从而为未来研究方向提供参考。论文和资源的精选列表见 https://github.com/AaronFengZY/HumanCentricToVLA-Survey。

英文摘要

Recent progress in generalizable embodied control has been driven by large-scale pretraining of Vision-Language-Action (VLA) models. However, most existing approaches rely on large collections of robot demonstrations, which are costly to obtain and tightly coupled to specific embodiments. Human videos, by contrast, are abundant and capture rich interactions, providing diverse semantic and physical cues for real-world manipulation. Yet, embodiment differences and the frequent absence of task-aligned annotations make their direct use in VLA models challenging. This survey provides a unified view of how human videos are transformed into effective knowledge for VLA models. We categorize existing approaches into four classes based on the action-related information they derive: (i) latent action representations that encode inter-frame changes; (ii) predictive world models that forecast future frames; (iii) explicit 2D supervision that extracts image-plane cues; and (iv) explicit 3D reconstruction that recovers geometry or motion. Beyond this taxonomy, we highlight three key open challenges in this area: structuring unstructured videos into training-ready episodes, grounding video-derived supervision into robot-executable actions under embodiment and viewpoint heterogeneity, and designing evaluation protocols that better predict real-world deployment performance and transfer efficiency, thereby informing future research directions. A curated list of papers and resources is available at https://github.com/AaronFengZY/HumanCentricToVLA-Survey.

2606.00053 2026-06-02 cs.RO 版本更新

VLAMotor: Test-Guided Enhancement of Vision-Language-Action Models via Agent-BasedData Synthesis

VLAMotor: 通过基于智能体的数据合成实现视觉-语言-动作模型的测试引导增强

Zeqin Liao, Peifan Ren, Zixu Gao, Hongyu Gong, Lianyu Hu, Wenbing Tang, Yuhong Nan, Zibin Zheng, Yang Liu

发表机构 * School of computing and data science, Nanyang Technological University(计算与数据科学学院,南洋理工大学) School of Software Engineering, Sun Yat-sen University(软件工程学院,中山大学) GuangDong Engineering Technology Research Center of Blockchain, China(区块链工程技术研发中心,中国) Northwest A&F University(西北农林科技大学)

AI总结 提出VLAMotor框架,通过距离感知测试暴露失败案例,并利用基于智能体的数据合成生成成功轨迹微调VLA模型,显著提升模型在仿真和真实环境中的成功率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型遵循数据驱动范式,受训练数据覆盖范围的限制,在部署后容易在边缘情况配置上失败。为了减轻此类风险,必须暴露高质量失败模式,并将由此产生的失败转化为监督数据用于模型增强。现有研究大多止步于失败检测,缺乏利用发现的失败进行模型修复的机制。我们提出VLAMotor,这是首个用于VLA增强的分析框架,它集成了距离感知模型测试以暴露失败,以及基于智能体的数据合成以进行模型微调。首先,VLAMotor基于与训练样本的距离估计输入不确定性,并将不确定性排序与冗余消除相结合,构建暴露多样化失败的紧凑测试集。然后,VLAMotor将失败轨迹抽象为结构化语义表示,并规划参数化的修复技能序列,通过逆运动学和运动执行将其实现为可执行轨迹。由此产生的成功轨迹被自动标注并用于微调原始VLA模型,从而得到增强的VLA模型。在四个代表性机器人操作任务上的评估表明,VLAMotor生成的仿真测试用例中有92.33%触发了VLA失败,并且VLAMotor将测试覆盖率相比最先进工具提高了18.93%。通过使用从失败测试用例中导出的合成数据微调VLA模型,VLAMotor进一步将VLA模型的总体成功率提高了49.25%。当部署在真实硬件上时,仿真增强模型相比原始VLA模型成功率提高了57.50%,展示了VLA增强的一种有效且低成本的方向。

英文摘要

Vision-Language-Action (VLA) models follow a data-driven paradigm and are constrained by the coverage of training data, making them prone to failure on edge-case configurations after deployment. To mitigate such risks, it is essential to expose high-quality failure modes and convert the resulting failures into supervisory data for model enhancement. Existing studies largely stop at failure detection and lack a mechanism for leveraging discovered failures for model repair. We propose VLAMotor, the first analysis framework for VLA enhancement, which integrates distance-aware model testing for failure exposure and agent-based data synthesis for model finetunning. First, VLAMotor estimates input uncertainty based on the distance to training samples, and combines uncertainty ranking with redundancy elimination to build compact test sets that expose diverse failures. Then, VLAMotor abstracts failure trajectories into structured semantic representations, and plans parameterized repair-skill sequences, which are then realized as executable trajectories through inverse kinematics and motion execution. The resulting successful trajectories are automatically labeled and used to fine-tune the original VLA model, yielding an enhanced VLA model. Evaluation on four representative robotic manipulation tasks shows that 92.33% of the in-simulation test cases generated by VLAMotor trigger VLA failures, and VLAMotor improves test coverage over the state-of-the-art tool by 18.93%. By fine-tuning VLA models with synthetic data derived from failed test cases, VLAMotor further enhances the overall success rate of VLA models by 49.25%. When deployed on real hardware, the simulation-enhanced models improve the success rate over the original VLA models by 57.50%, demonstrating an effective and low-cost direction for VLA enhancement.

2605.30877 2026-06-02 cs.RO 版本更新

Wall-OSS-0.5 Technical Report

Wall-OSS-0.5 技术报告

Ryan Yu, Pushi Zhang, Starrick Liu, Brae Liu, Miracle Kang, Shalfun Li, Lights Shi, Ellie Ma, Ping Yang, Chris Pan, Jerry Chen, Dongxiu Liu, Rain Sun, Miles Guo, Byron Zhang, Hugo Zhou, Zach Xu, Vincent Chen, Harrison Huang, James Wang, Dance Kuzi, Andy Zhai, Hang Su, Roy Gan, Lucy Liang, Hao Wang, Qian Wang

发表机构 * arXiv

AI总结 本文提出Wall-OSS-0.5,一个基于3B VLM骨干网络并增强动作生成组件的4B开源VLA模型,通过梯度桥接联合训练策略,在超过20个实体上预训练,实现零样本真实机器人行为,并在微调后超越π_0.5,证明VLA预训练本身即可产生可执行的机器人能力。

详情
AI中文摘要

大规模视觉-语言-动作(VLA)预训练正日益成为机器人策略的基础,然而预训练VLA的证据几乎总是在任务特定微调后报告。这留下了一个基本问题未解答:VLA预训练本身是否产生可执行的机器人行为,还是仅仅为下游策略学习提供更好的初始化?我们提出Wall-OSS-0.5,一个基于3B VLM骨干网络并增强动作生成组件的开源4B VLA,设计使得预训练的机器人能力可直接在物理硬件上测量。该模型在超过20个实体上进行预训练,每轮处理超过一百万个机器人轨迹以及一个多模态语料库。我们采用梯度桥接联合训练方案,其中三个目标扮演不同且互补的角色:离散动作预测将强大的VLM原生梯度注入骨干网络,多模态预测保持基于视觉-语言的理解,连续流匹配作为部署时的动作接口。在任务特定微调之前,预训练检查点实现了非平凡的零样本真实机器人行为,在17个任务套件中完成了包括一个保留的变形操作任务在内的多个任务,并取得了高任务进度。微调后,同一检查点作为更强的适应先验,在15个真实机器人任务上达到60.5%的平均任务进度,比π_0.5高出17.5%。多模态评估进一步证实动作训练不会侵蚀基于视觉-语言的能力:模型在保持广泛视觉-语言能力的同时增强了具身基础。总之,这些结果将VLA预训练从初始化策略重新定位为可直接测试且已经有用的机器人能力来源。

英文摘要

Large-scale Vision-Language-Action (VLA) pretraining is increasingly adopted as the foundation for robot policies, yet the evidence for pretrained VLAs is almost invariably reported after task-specific fine-tuning. This leaves a foundational question unanswered: does VLA pretraining itself yield executable robot behavior, or does it merely furnish a better initialization for downstream policy learning? We present Wall-OSS-0.5, an open-source 4B VLA built upon a 3B VLM backbone augmented with action-generation components, designed so that pretrained robotic capability is directly measurable on physical hardware. The model is pretrained across more than 20 embodiments, processing over one million robot trajectories per epoch alongside a grounded multimodal corpus. We adopt a gradient-bridged co-training recipe in which three objectives play distinct and complementary roles: discrete action prediction routes strong VLM-native gradients into the backbone, multimodal prediction preserves grounded vision-language understanding, and continuous flow matching serves as the deployment-time action interface. Before task-specific fine-tuning, the pretrained checkpoint achieves non-trivial zero-shot real-robot behavior, completing several tasks, including a held-out deformable manipulation task, at high task progress on a 17-task suite. After fine-tuning, the same checkpoint serves as a stronger adaptation prior, reaching 60.5% average task progress on 15 real-robot tasks and outperforming π_0.5 by 17.5%. Multimodal evaluations further confirm that action training does not erode grounded vision-language competence: the model preserves broad vision-language ability while strengthening embodied grounding. Together, these results reposition VLA pretraining from an initialization strategy to a directly testable, already useful source of robot capability.

2605.30581 2026-06-02 cs.CV cs.AI cs.RO 版本更新

Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes

工业视觉模拟到现实中的先验可用性:CAD引导与CAD不可用机制的综述

Chenxi Tao, Seung-Kyum Choi

发表机构 * George W. Woodruff School of Mechanical Engineering(乔治·W·伍德鲁夫机械工程学院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文通过先验可用性视角重新组织工业视觉模拟到现实问题,区分CAD可用、CAD不可用和边界先验三种机制,并基于T-LESS/BOP、MVTec AD和VisA数据集进行实证分析,揭示了源分布设计、检测器容量和真实校准的重要性,以及CAD在测试时提供的独特验证通道。

Comments Review article; 103 references; 9 main figures; empirical anchors on T-LESS/BOP, MVTec AD, and VisA

详情
AI中文摘要

工业视觉模拟到现实通常被描述为从合成图像到真实图像的迁移,但工业部署通常涉及可用证据与所需决策之间更广泛的错配。系统可能基于CAD渲染、模拟RGB-D观测、正常参考图像、合成缺陷、预训练特征空间或语言提示构建,却在不同的传感器、光照、材料、夹具、校准、生产变化和罕见缺陷模式下部署。本综述将工业视觉模拟到现实重新定义为由先验可用性组织的域差距问题。我们区分了CAD可用设置(其中显式物体几何可支持渲染、校准、姿态估计、分割和测试时几何验证)、CAD不可用设置(其中几何被正常参考外观、特征分布、师生残差、合成异常假设、基础特征或视觉语言先验取代)以及边界先验设置(其中近似模型、模板、参考视图或语义对应仅保留CAD的部分作用)。这一框架将基于CAD的检测和6D姿态估计文献与通常单独综述的工业异常和表面检测文献联系起来。为使分类具体化,我们使用T-LESS/BOP、MVTec AD和VisA上的实证锚点。这些锚点表明,仅靠CAD渲染数量并不能弥合迁移;源分布设计、检测器容量和小规模真实校准可能更为重要。它们还表明,测试时的CAD通过掩码、姿态和深度一致性创建了独特的验证通道,而CAD不可用的检测则依赖于校准的正常性和特征偏差。因此,本综述反对单一跨任务排行榜,而是询问什么先验支撑了部署决策。

英文摘要

Industrial visual sim-to-real is often described as transferring from synthetic images to real images, but industrial deployment usually involves a broader mismatch between available evidence and required decisions. A system may be built from CAD renderings, simulated RGB-D observations, normal reference images, synthetic defects, pretrained feature spaces, or language prompts, yet deployed under different sensors, lighting, materials, fixtures, calibration, production variation, and rare defect modes. This review reframes industrial visual sim-to-real as a domain-gap problem organized by prior availability. We distinguish CAD-available settings, where explicit object geometry can support rendering, calibration, pose estimation, segmentation, and test-time geometric verification; CAD-unavailable settings, where geometry is replaced by normal-reference appearance, feature distributions, teacher-student residuals, synthetic anomaly assumptions, foundation features, or vision-language priors; and boundary-prior settings, where approximate models, templates, reference views, or semantic correspondences preserve only part of the CAD role. This framing connects CAD-based detection and 6D pose-estimation literature with industrial anomaly and surface-inspection literature that is usually reviewed separately. To make the taxonomy concrete, we use empirical anchors on T-LESS/BOP, MVTec AD, and VisA. The anchors show that CAD render count alone does not close transfer; source-distribution design, detector capacity, and small real calibration can matter more. They also show that CAD at test time creates a distinct verification channel through mask, pose, and depth consistency, whereas CAD-unavailable inspection relies on calibrated normality and feature deviation. The review therefore argues against a single cross-task leaderboard and instead asks what prior grounds the deployment decision.

2605.27180 2026-06-02 cs.RO 版本更新

Towards Drone-based Mapping of Volcanic Gases using Gas Tomography

面向基于无人机的火山气体测绘:使用气体断层成像

Marius Schaab, Niklas Karbach, Antonia Rabe, Thomas Wiedemann, Patrick Hinsen, Dmitriy Shutin, Thorsten Hoffmann, Achim J. Lilienthal

发表机构 * German Research Foundation (DFG)(德国研究基金会) Istituto Nazionale di Geofisica e Vulcanologia (INGV)(意大利国家地震与火山观测研究所)

AI总结 针对无人机旋翼下洗流干扰问题,提出基于拉格朗日模型的模型驱动气体断层成像方法,实现火山气体排放的准确测绘。

详情
AI中文摘要

火山排放大量二氧化碳,直接影响人类生活。测绘火山气体排放有助于预测喷发并了解火山对气候和环境的影响。基于无人机的气体传感显著降低了火山监测的风险,但在测量气体时面临技术限制,因为旋翼下洗流会在检测前驱散气体羽流。使用远程气体传感的气体断层成像解决了这一挑战。在Salinelle dei Cappuccini泥火山,我们证明,尽管无人机搭载的原位传感器因空气动力学干扰未能检测到CO2排放,但开路路径传感成功实现了远程气体分布测绘。我们提出了一种新颖的基于模型的气体断层重建方法,该方法结合拉格朗日模型来补偿风引起的平流。所得气体分布图与手动收集的原位测量结果一致,证实了基于模型的气体断层成像有效克服了下洗流限制,并实现了火山排放的准确测绘。

英文摘要

Volcanoes emit large amounts of CO2, directly influencing human lives. Mapping volcanic gas emissions helps to forecast eruptions and understand the impact of volcanoes on climate and the environment. Drone-based gas sensing significantly reduces risks in volcanic monitoring but faces technical limitations when measuring gas, as rotor downwash disperses the gas plume before detection. Gas Tomography using remote gas sensing addresses this challenge. At the Salinelle dei Cappuccini mud volcanoes, we demonstrate that while drone-mounted in-situ sensors failed to detect CO2 emissions due to aerodynamic disturbance, open-path sensing successfully enabled remote gas distribution mapping. We present a novel model-based gas tomographic reconstruction approach that incorporates a Lagrangian model to compensate for wind-induced advection. The resulting gas distribution maps align with manually collected in-situ measurements, confirming that model-based gas tomography effectively overcomes downwash limitations and enables accurate mapping of volcanic emissions.

2605.26625 2026-06-02 cs.RO cs.SY eess.SY 版本更新

Provably Safe Motion Planning Under Unknown Disturbances

未知扰动下的可证明安全运动规划

Ibon Gracia, Qi Heng Ho, Luca Laurenti, Morteza Lahijanian

发表机构 * Department of Aerospace Engineering Sciences at the University of Colorado Boulder(科罗拉多大学博尔德分校航空航天工程科学系) Department of Aerospace and Ocean Engineering(航空航天与海洋工程系) Delft University of Technology(代尔夫特理工大学) Italian Institute of Artificial Intelligence(意大利人工智能研究所)

AI总结 针对未知分布随机扰动下的机器人系统,提出一种基于Wasserstein模糊管的学习型采样运动规划算法,实现概率完备且保守性低的安全规划。

详情
AI中文摘要

我们提出了一种可证明安全的基于采样的运动规划算法,适用于受未知分布随机扰动影响的机器人系统。我们考虑具有线性或可线性化动力学的系统,在具有任意形状障碍物的工作空间中运行,并受状态和控制约束。安全要求被表述为机会约束。我们的方法利用系统轨迹的数据来学习Wasserstein模糊管,即一系列模糊集,该模糊管以高置信度包含系统状态分布的轨迹。然后,该模糊管被用于一个概率完备的算法中,以构建一个尊重问题约束的基于采样的运动规划树。我们表明,学习几个低维模糊管而不是单个高维模糊管可以有效降低保守性并提高可扩展性。此外,我们设计了一种高效的基于bandit的有效性检查器,在不牺牲概率完备性的情况下显著提高了我们算法的经验性能。案例研究表明,我们的算法在严格安全阈值下的杂乱环境中找到了有效规划,优于最先进的方法。

英文摘要

We present a provably safe sampling-based motion planning algorithm for robotic systems affected by random disturbances of unknown distribution. We consider systems with linear or linearizable dynamics evolving in workspace with arbitrary-shaped obstacles subject to state and control constraints. Safety requirements are formulated as chance-constraints. Our approach leverages data from trajectories of the system to learn a Wasserstein ambiguity tube, i.e., a sequence of ambiguity sets, which contains the trajectory of the system's state distribution with high confidence. This ambiguity tube is then used in a probabilistically complete algorithm to grow a sampling-based motion planning tree that respects the constraints of the problem. We show that learning several lower-dimensional ambiguity tubes instead of a single high-dimensional one effectively reduces the conservatism and boosts scalability. Additionally, we design an efficient bandit-based validity checker that remarkably increases the empirical performance of our approach without sacrificing probabilistic completeness. Case studies show our algorithm finds valid plans in cluttered environments under strict safety thresholds, outperforming state-of-the-art methods.

2605.30280 2026-06-02 cs.RO cs.AI cs.CL 版本更新

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qwen-VLA:统一跨任务、环境和机器人形态的视觉-语言-动作建模

Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, Xuhong Huang, Pei Lin, Junyang Lin, Dayiheng Liu, Shuai Bai, Jingren Zhou, Jiazhao Zhang, Haoqi Yuan, Gengze Zhou, Hang Yin, Ye Wang, Yiyang Huang, Zixing Lei, Wujian Peng, Delin Chen, Yingming Zheng, Jingyang Fan, Xianwei Zhuang, Xin Zhou, Haoyang Li, Anzhe Chen, Tong Zhang, Xuejing Liu, Yuchong Sun, Ruizhe Chen, Zhaohai Li, Chenxu Lü, Zhibo Yang, Tao Yu, Xionghui Chen

发表机构 * Qwen Team(通义实验室)

AI总结 提出Qwen-VLA,一种基于DiT动作解码器的统一具身基础模型,通过大规模联合预训练和具身感知提示,将操作、导航和轨迹预测统一为动作-轨迹预测框架,实现跨任务、环境和机器人形态的泛化。

Comments 34 pages

详情
AI中文摘要

具身智能通常通过针对单个任务(如操作或导航)的专用模型进行研究,导致能力碎片化,且跨任务、环境和机器人形态的泛化能力有限。在这项工作中,我们研究了异构的具身决策问题是否可以在单个视觉-语言-动作模型中统一。我们提出了Qwen-VLA,一个统一的具身基础模型,它通过基于DiT的动作解码器将Qwen的视觉-语言建模栈从感知、理解和推理扩展到连续动作和轨迹生成。Qwen-VLA通过大规模联合预训练方案在多样化的数据源上进行训练,包括机器人操作轨迹、人类自我中心演示、合成模拟数据、视觉-语言导航数据、轨迹中心监督和辅助视觉-语言数据。为了支持多种机器人平台,我们引入了具身感知提示调节,其中特定于机器人的文本描述指定了当前的具身形态和控制约定。我们进一步将操作、导航和轨迹预测统一为一个动作-轨迹预测框架,实现了跨机器人形态、任务族和环境的可迁移视觉基础、空间推理和连续动作生成。在操作、导航和轨迹中心基准上的实验显示,在场景布局、背景、光照、物体配置和机器人形态变化下,具有一致的多任务性能和分布外泛化能力。Qwen-VLA-Instruct在LIBERO上达到97.9%,在Simpler-WidowX上达到73.7%,在RoboTwin-Easy/Hard上达到86.1%/87.2%,在R2R上达到69.0% OSR,在RxR上达到59.6% SR,在真实世界ALOHA实验中平均OOD成功率为76.9%,在DOMINO动态操作上零样本成功率为26.6%。

英文摘要

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.

2605.29973 2026-06-02 cs.RO 版本更新

Replicable Simulation-Based Robot Validation through Provenance

通过数据溯源实现可复现的基于仿真的机器人验证

Argentina Ortega, Samuel Wiest, Frederik Pasch, Nico Hochgeschwender

发表机构 * Argentine Institute of Technology(阿根廷技术研究所)

AI总结 针对基于仿真的机器人验证可复现性不足的问题,提出将数据溯源与FAIR原则集成到测试流程中,通过追踪工件链接和附加机器可读元数据来增强可复现性,并在移动机器人导航数据集上验证了该方法。

Comments Accepted for publication at 2026 IEEE RAS International Conference on Engineering Reliable Autonomous Systems (ERAS)

详情
AI中文摘要

机器人行为通常通过基于仿真的测试来验证,然而此类测试活动的可复现性关键取决于测试配置、执行和后处理过程的透明文档化。我们认为,数据溯源结合FAIR原则(可发现、可访问、可互操作、可重用)通过显式追踪工件之间的链接以及附加关于文件来源和关键设计决策的机器可读元数据,弥补了这一不足。此外,溯源和元数据不能被视为仅局限于最终数据集的后续补充;它们必须集成到生成这些数据集的测试过程中,以便能够端到端地重建证据。我们通过为现有的基于仿真的测试框架增加溯源追踪和元数据收集机制,并利用这些扩展为移动机器人导航数据集添加结构化溯源和符合FAIR原则的元数据来证明这一点。最后,我们讨论了在此集成过程中遇到的障碍——例如词汇对齐、属性选择和领域标准的采纳——并提供了在机器人验证工作流中实现以溯源为中心的FAIR元数据的可操作建议。

英文摘要

Robot behavior is often validated through simulation-based testing, yet the replicability of such campaigns depends critically on transparent documentation of how tests are configured, executed, and post-processed. We argue that data provenance, coupled with the FAIR principles (findability, accessibility, interoperability, and reusability), addresses this gap by explicitly tracking links between artifacts and by attaching machine-readable metadata about file origins and key design decisions. Moreover, provenance and metadata cannot be treated as an afterthought confined to final datasets; they must be integrated into the testing processes that generate those datasets so that evidence can be reconstructed end-to-end. We demonstrate this by augmenting an existing simulation-based testing framework with provenance tracking and metadata collection mechanisms, and by using these extensions to enrich a mobile robot navigation dataset with structured provenance and FAIR-aligned metadata. Finally, we discuss obstacles encountered in this integration -- such as vocabulary alignment, attribute selection, and adoption of domain standards -- and provide actionable recommendations for implementing provenance-centric, FAIR metadata in robotics validation workflows.

2605.13548 2026-06-02 cs.RO cs.AI 版本更新

AttenA+: Rectifying Action Inequality in Robotic Foundation Models

AttenA+: 纠正机器人基础模型中的动作不平等性

Daojie Peng, Fulong Ma, Jiahang Cao, Qiang Zhang, Xupeng Xie, Jian Guo, Ping Luo, Andrew F. Luo, Boyu Zhou, Jun Ma

发表机构 * HKUST(GZ)(香港科技大学(广州)) HKU(香港大学) USTC(中国科学技术大学) IDEA Research(IDEA研究院) SUSTech(南方科技大学) X-Humaniod

AI总结 针对机器人基础模型忽视动作物理重要性的问题,提出AttenA+框架,通过速度驱动的动作注意力重加权训练目标,提升复杂长程任务性能。

详情
AI中文摘要

现有的机器人基础模型虽然强大,但基于一个隐含的时间同质性假设:在优化过程中将所有动作视为同等信息量。这种从语言模型继承的“平坦”训练范式,对操作的内在物理层次结构无动于衷。实际上,机器人轨迹本质上是异质的,其中低速段通常通过需要精确交互来决定任务成功,而高速运动则作为容错过渡。这种均匀损失权重与物理关键性之间的错位从根本上限制了当前视觉-语言-动作(VLA)模型和世界-动作模型(WAM)在复杂长程任务中的性能。为了纠正这一点,我们引入了AttenA+,一个与架构无关的框架,通过速度驱动的动作注意力优先考虑运动学关键段。通过基于逆速度场重新加权训练目标,AttenA+自然地使模型的学习能力与操作的物理需求对齐。作为一种即插即用的增强,AttenA+可以集成到现有骨干网络中,无需结构修改或额外参数。大量实验表明,AttenA+显著提升了当前最先进模型的上限。具体来说,它在Libero基准上将OpenVLA-OFT提升至98.6%(+1.5%),并将FastWAM在RoboTwin 2.0上推进至92.4%(+0.6%)。在Franka机械臂上的真实世界验证进一步展示了其鲁棒性和跨任务泛化能力。我们的工作表明,挖掘动作序列的内在结构先验为标准缩放定律提供了一种高效、物理感知的补充,为通用机器人控制开辟了新路径。

英文摘要

Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via velocity-driven action attention. By reweighting the training objective based on the inverse velocity field, AttenA+ naturally aligns the model's learning capacity with the physical demands of manipulation. As a plug-and-play enhancement, AttenA+ can be integrated into existing backbones without structural modifications or additional parameters. Extensive experiments demonstrate that AttenA+ significantly elevates the ceilings of current state-of-the-art models. Specifically, it improves OpenVLA-OFT to 98.6% (+1.5%) on the Libero benchmark and pushes FastWAM to 92.4% (+0.6%) on RoboTwin 2.0. Real-world validation on a Franka manipulator further showcases its robustness and cross-task generalization. Our work suggests that mining the intrinsic structural priors of action sequences offers a highly efficient, physics-aware complement to standard scaling laws, paving a new path for general-purpose robotic control.

2510.01711 2026-06-02 cs.RO cs.LG 版本更新

Contrastive Representation Regularization for Vision-Language-Action Models

视觉-语言-动作模型的对比表示正则化

Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin

发表机构 * KAIST(韩国科学技术院)

AI总结 提出机器人状态感知对比损失(RS-CL),通过对比学习对齐VLM表示与机器人本体感受状态,提升VLA模型在机器人操作任务中的性能。

Comments ICML 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过利用预训练视觉-语言模型(VLM)的丰富表示,在机器人操作中展现了强大的能力。然而,它们的表示可以说仍然次优,缺乏对控制动作和本体感受信息等机器人信号的敏感性。为了解决这个问题,我们引入了机器人状态感知对比损失(RS-CL),一种简单有效的VLA模型表示正则化方法,旨在弥合VLM表示与机器人信号之间的差距。特别地,RS-CL通过使用状态之间的相对距离作为软监督,使表示更紧密地对齐机器人的本体感受状态。作为原始动作预测目标的补充,RS-CL增强了控制相关表示学习,同时轻量级且与标准VLA训练流程完全兼容。我们的实验结果表明,RS-CL显著提升了最先进VLA模型的性能;它将先前技术在RoboCasa-Kitchen基准上的性能提升至69.7%,达到最先进水平,并在具有挑战性的真实机器人操作任务中将成功率从45.0%提升至58.3%。

英文摘要

Vision-Language-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive information. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipelines. Our empirical results demonstrate that RS-CL substantially improves the performance of state-of-the-art VLA models; it pushes the prior art to 69.7% achieving the state-of-the-art performance on the RoboCasa-Kitchen benchmark, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

2605.24881 2026-06-02 cs.RO 版本更新

Learning Transferable Motor Skills for Geometry-Aware Robotic Surface Tasks

面向几何感知的机器人表面任务的可迁移运动技能学习

Miroslav David, Karla Stepanova, Robert Babuska

发表机构 * Czech Institute of Informatics, Robotics, and Cybernetics(捷克信息学、机器人学与自动化研究所) Czech Technical University in Prague(布拉格捷克技术大学) Delft University of Technology(代尔夫特理工大学)

AI总结 提出一种模块化框架,将几何运动规划与执行级专家行为解耦,通过可解释的原子运动规则和神经网络推断,实现跨几何形状的迁移学习。

Comments In: Workshop on Geometry in the Age of Data-Driven Robotics, ICRA 2026, Vienna, 2026

详情
AI中文摘要

机器人表面交互任务,如喷涂或焊接,需要精确的几何规划和精确的运动执行。虽然现代运动规划器能够生成有效的几何路径,但它们通常缺乏人类操作员所具备的专家运动模式。相反,从示范中学习往往将任务执行紧密耦合到特定的训练几何形状,限制了可迁移性。我们提出了一种模块化框架,将几何运动规划与执行级专业知识解耦。专家行为被表示为一个可解释的、原子的运动规则词汇表,例如速度缩放和方向偏移,这些规则系统地修改几何规划的参考路径。我们训练了一个多模态神经网络,从运动轨迹数据和CAD模型几何中联合推断规则参数。我们通过在L形和窗形物体上的动态仿真评估了我们的方法,证明了模型在两种拓扑结构上成功提取了速度和方向规则。

英文摘要

Robotic surface-interaction tasks, such as spray painting or welding, require both accurate geometric planning and precise motion execution. While modern motion planners generate valid geometric paths, they often lack the expert motor patterns observed in human operators. Conversely, learning from demonstration often tightly couples task execution to the specific training geometry, limiting transferability. We propose a modular framework that decouples geometric motion planning from execution-level expertise. Expert behavior is represented as a vocabulary of interpretable, atomic motor rules, such as velocity scaling and orientation offsets, that systematically modify a geometrically planned reference path. We train a multimodal neural network to infer rule parameters jointly from kinematic trajectory data and CAD model geometry. We evaluate our approach through dynamic simulation on L-shaped and window-shaped objects, demonstrating on simulated data that the model successfully extracts velocity and orientation rules across both topologies.

2603.02845 2026-06-02 cs.RO cs.AI 版本更新

SPARC: Spatial-Aware Path Planning via Attentive Agent Communication

SPARC: 通过注意力智能体通信实现空间感知路径规划

Sayang Mu, Xiangyu Wu, Bo An

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出关系增强多头注意力(RMHA)机制,通过嵌入曼哈顿距离到注意力权重计算,优先处理空间邻近机器人的消息,在40x40网格上从8机器人零样本泛化到128机器人时,在30%障碍密度下实现约75%成功率,超越基线25个百分点以上。

Comments The manuscript is being withdrawn at the request of the first author for the purpose of revising content and re-uploading a revised version with updated data/figures/text . The revised manuscript will be resubmitted to arXiv promptly with the same author list and research theme

详情
AI中文摘要

高效通信对于分散式多机器人路径规划(MRPP)至关重要,然而现有的学习型通信方法平等对待所有邻近机器人,而不考虑它们的空间接近性,导致在协调最重要的拥挤区域注意力被稀释。我们提出关系增强多头注意力(RMHA),这是一种通信机制,它将成对曼哈顿距离显式嵌入到注意力权重计算中,使每个机器人能够动态优先处理来自空间相关邻居的消息。结合距离约束注意力掩码和GRU门控消息融合,RMHA与MAPPO无缝集成,实现稳定的端到端训练。在从8个训练机器人到128个测试机器人在40x40网格上的零样本泛化中,RMHA在30%障碍密度下实现了约75%的成功率,比最佳基线高出超过25个百分点。消融研究证实,距离关系编码是高密度环境中成功率提高的关键因素。索引词-多机器人路径规划,图注意力机制,多头注意力,通信优化,协作决策。

英文摘要

Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat all neighboring robots equally regardless of their spatial proximity, leading to diluted attention in congested regions where coordination matters most. We propose Relation enhanced Multi Head Attention (RMHA), a communication mechanism that explicitly embeds pairwise Manhattan distances into the attention weight computation, enabling each robot to dynamically prioritize messages from spatially relevant neighbors. Combined with a distance-constrained attention mask and GRU gated message fusion, RMHA integrates seamlessly with MAPPO for stable end-to-end training. In zero-shot generalization from 8 training robots to 128 test robots on 40x40 grids, RMHA achieves approximately 75 percent success rate at 30 percent obstacle density outperforming the best baseline by over 25 percentage points. Ablation studies confirm that distance-relation encoding is the key contributor to success rate improvement in high-density environments. Index Terms-Multi-robot path planning, graph attention mechanism, multi-head attention, communication optimization, cooperative decision-making

2512.05335 2026-06-02 cs.RO 版本更新

State-Conditional Adversarial Learning: An Off-Policy Visual Domain Transfer Method for End-to-End Imitation Learning

状态条件对抗学习:一种用于端到端模仿学习的离策略视觉域迁移方法

Yuxiang Liu, Shengfan Cao

发表机构 * University of California, Berkeley, CA, USA(加州大学伯克利分校)

AI总结 针对目标域数据严格离策略、无专家且稀缺的挑战,提出状态条件对抗学习(SCAL),通过状态条件潜在KL散度的判别器估计对齐分布,实现鲁棒的视觉域迁移。

详情
AI中文摘要

我们在一个现实且具有挑战性的设置中研究端到端模仿学习的视觉域迁移,其中目标域数据严格离策略、无专家且稀缺。我们首先提供理论分析,表明目标域模仿损失可以由源域损失加上源和目标观测模型之间的状态条件潜在KL散度上界。受此结果启发,我们提出状态条件对抗学习(SCAL),一种离策略对抗框架,使用基于判别器的条件KL项估计来对齐基于系统状态的潜在分布。在基于BARC-CARLA模拟器的视觉多样化自动驾驶环境中的实验表明,SCAL实现了鲁棒的迁移和强大的样本效率。

英文摘要

We study visual domain transfer for end-to-end imitation learning in a realistic and challenging setting where target-domain data are strictly off-policy, expert-free, and scarce. We first provide a theoretical analysis showing that the target-domain imitation loss can be upper bounded by the source-domain loss plus a state-conditional latent KL divergence between source and target observation models. Guided by this result, we propose State- Conditional Adversarial Learning, an off-policy adversarial framework that aligns latent distributions conditioned on system state using a discriminator-based estimator of the conditional KL term. Experiments on visually diverse autonomous driving environments built on the BARC-CARLA simulator demonstrate that SCAL achieves robust transfer and strong sample efficiency.

2605.12689 2026-06-02 cs.RO 版本更新

3D RL-DWA: A Hybrid Reinforcement Learning and Dynamic Window Approach for Goal-Directed Local Navigation in Multi-DoF Robots

3D RL-DWA:一种用于多自由度机器人目标导向局部导航的混合强化学习与动态窗口方法

Chiara Castellani, Enrico Turco, Domenico Prattichizzo

发表机构 * European Union’s Horizon Europe Research and Innovation Programme(欧洲联盟的地平线欧洲研究与创新计划)

AI总结 提出结合强化学习与动态窗口方法的混合框架,利用稀疏点云数据动态调整可变形微型机器人的运动和形状,在复杂受限环境中实现目标导航并最大化占据体积,实验表明该方法在变形和导航能力上优于纯强化学习和基于模型的方法。

Comments Accepted for publication in the Proceedings of the IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM 2026)

详情
AI中文摘要

在本文中,我们提出了一种新颖的混合方法,将强化学习与动态窗口方法相结合,用于高自由度机器人系统的自适应3D局部导航。我们的方法利用稀疏点云数据动态调整可变形微型机器人的运动和形状,使系统能够在复杂受限环境中导航到目标,同时最大化占据体积。我们在模拟血管网络中评估了我们的框架。基于1080次试验的实验结果表明,与纯强化学习和基于模型的方法相比,将强化学习与基于DWA的局部规划器集成显著增强了变形和导航能力。特别是,所提出的自主控制器在训练过程中始终实现高变形和近乎完美的路径完成,并在未见过的场景中保持稳健性能。这些发现突显了混合规划策略在稀疏感知条件下实现高效自适应3D导航的潜力。

英文摘要

In this paper, we present a novel hybrid approach that combines Reinforcement Learning (RL) with Dynamic Window Approach (DWA) for adaptive 3D local navigation of high-degree-of-freedom robotic systems. Our method leverages sparse point cloud data to dynamically adjust both the motion and the shape of a deformable microrobot, enabling the system to navigate toward a goal in complex, constrained environments while maximizing the occupied volume. We evaluate our framework in a simulated vascular network. Experimental results, based on 1080 trials, indicate that integrating RL with a DWA-based local planner significantly enhances both deformation and navigation capabilities compared to pure RL and model-based methods. In particular, the proposed autonomous controller consistently achieves high deformation and near-perfect path completion during training and maintains robust performance in unseen scenarios. These findings highlight the potential of hybrid planning strategies for efficient and adaptive 3D navigation under sparse sensory conditions.

2605.12369 2026-06-02 cs.RO 版本更新

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

GuidedVLA: 通过即插即用的动作注意力特化指定任务相关因素

Xiaosong Jia, Bowen Yang, Zuhao Ge, Xian Nie, Yuchen Zhou, Cunxin Fan, Yufeng Li, Yilin Chai, Chao Jing, Zijian Liang, Qingwen Bu, Haidong Cao, Chao Wu, Qifeng Li, Zhenjie Yang, Chenhe Zhang, Hongyang Li, Zuxuan Wu, Junchi Yan, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI (TEAI)(可信具身人工智能研究院) Shanghai Key Laboratory of Multimodal Embodied AI(上海多模态具身人工智能重点实验室) Shanghai Jiao Tong University(上海交通大学) OpenDriveLab, The University of Hong Kong(OpenDrive实验室,香港大学)

AI总结 提出GuidedVLA框架,通过为动作解码器中的注意力头分配人工定义的辅助信号(如物体定位、空间几何、时序技能逻辑),显式引导模型关注任务相关因素,提升VLA模型在域内和域外场景的成功率。

Comments Accepted to RSS 2026. Project page: https://guidedvla.github.io/project_page/

详情
AI中文摘要

视觉-语言-动作(VLA)模型旨在通过将动作作为模态与强大的视觉-语言模型(VLM)对齐来实现通用机器人学习。现有的VLA依赖端到端监督隐式地使动作解码过程学习任务相关特征。然而,在没有显式指导的情况下,这些模型常常过拟合虚假相关性,例如视觉捷径或环境噪声,限制了其泛化能力。在本文中,我们介绍了GuidedVLA,一个旨在手动引导动作生成聚焦于任务相关因素的框架。我们的核心见解是将动作解码器视为功能组件的集合,而非单一的学习器。个体注意力头通过人工定义的辅助信号进行监督,以捕获不同的因素。作为初步研究,我们用三个特化头实例化该范式:物体定位、空间几何和时序技能逻辑。在仿真和真实机器人实验中,与强VLA基线相比,GuidedVLA在域内和域外设置中均提高了成功率。最后,我们展示了这些特化因素的质量与任务性能正相关,并且我们的机制产生了解耦的、高质量的特征。我们的结果表明,显式引导动作解码器学习是构建更鲁棒和通用VLA模型的有前景方向。

英文摘要

Vision-Language-Action (VLA) models aim for general robot learning by aligning action as a modality within powerful Vision-Language Models (VLMs). Existing VLAs rely on end-to-end supervision to implicitly enable the action decoding process to learn task-relevant features. However, without explicit guidance, these models often overfit to spurious correlations, such as visual shortcuts or environmental noise, limiting their generalization. In this paper, we introduce GuidedVLA, a framework designed to manually guide the action generation to focus on task-relevant factors. Our core insight is to treat the action decoder not as a monolithic learner, but as an assembly of functional components. Individual attention heads are supervised by manually defined auxiliary signals to capture distinct factors. As an initial study, we instantiate this paradigm with three specialized heads: object grounding, spatial geometry, and temporal skill logic. Across simulation and real-robot experiments, GuidedVLA improves success rates in both in-domain and out-of-domain settings compared to strong VLA baselines. Finally, we show that the quality of these specialized factors correlates positively with task performance and that our mechanism yields decoupled, high-quality features. Our results suggest that explicitly guiding action-decoder learning is a promising direction for building more robust and general VLA models.

2602.08058 2026-06-02 cs.CV cs.AI cs.RO cs.SY eess.SY 版本更新

Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling

Picasso: 基于物理约束采样的整体场景重建

Xihang Yu, Rajat Talak, Lorenzo Shaikewitz, Luca Carlone

发表机构 * Massachusetts Institute of Technology(麻省理工学院) National University of Singapore(新加坡国立大学)

AI总结 提出Picasso,一种通过快速拒绝采样推理多物体交互并考虑几何、非穿透和物理约束的整体场景重建方法,在物理合理性和重建精度上显著优于现有技术。

Comments 15 pages, accepted to Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

在存在遮挡和测量噪声的情况下,几何精确的场景重建(即拟合传感器数据)仍然可能在物理上不正确。例如,当估计场景中物体的姿态和形状并将结果导入模拟器时,微小误差可能导致不合理的配置,包括物体相互穿透或不稳定平衡。这使得使用数字孪生预测场景的动态行为变得困难,而这是基于模拟的接触丰富行为规划和控制的重要步骤。在本文中,我们认为物体姿态和形状估计需要对场景进行整体推理(而不是孤立地推理每个物体),考虑物体交互和物理合理性。为此,我们的第一个贡献是Picasso,一个受物理约束的重建流水线,通过考虑几何、非穿透和物理来构建多物体场景重建。Picasso依赖于一种快速拒绝采样方法,该方法推理多物体交互,利用推断的物体接触图来指导采样。其次,我们提出了Picasso数据集,这是一个包含10个接触丰富真实场景的集合,带有真实标注,以及一个量化物理合理性的指标,我们将其作为基准测试的一部分开源。最后,我们在新引入的数据集和YCB-V数据集上对Picasso进行了广泛评估,结果表明它在提供物理合理且更符合人类直觉的重建的同时,大幅优于现有技术。

英文摘要

In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.

2602.04672 2026-06-02 cs.CV cs.GR cs.RO 版本更新

AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation

AGILE: 通过代理生成从视频重建手-物体交互

Jin-Chuan Shi, Binhong Ye, Tao Liu, Junzhe He, Yangjinhui Xu, Xiaoyang Liu, Zeju Li, Hao Chen, Chunhua Shen

发表机构 * State Key Lab of CAD & CG, Zhejiang University(浙江大学计算机辅助设计与图形学国家重点实验室) Zhejiang University of Technology(浙江工业大学)

AI总结 提出AGILE框架,利用视觉语言模型引导生成完整物体网格,结合锚定-跟踪策略和接触感知优化,从单目视频鲁棒重建手-物体交互,生成可直接用于仿真的资产。

Comments 16 pages, SIGGRAPH 2026

详情
AI中文摘要

从单目视频重建动态手-物体交互对于灵巧操作数据收集以及为机器人和VR创建逼真的数字孪生至关重要。然而,当前方法面临两个难以逾越的障碍:(1) 依赖神经渲染通常在严重遮挡下产生碎片化、不可用于仿真的几何体;(2) 依赖脆弱的运动恢复结构(SfM)初始化导致在野外视频中频繁失败。为克服这些限制,我们提出AGILE,一个鲁棒的框架,将范式从重建转变为交互学习的代理生成。首先,我们采用代理流水线,其中视觉语言模型(VLM)引导生成模型合成一个完整、水密的物体网格,具有高保真纹理,不受视频遮挡影响。其次,完全绕过脆弱的SfM,我们提出一种鲁棒的锚定-跟踪策略。我们使用基础模型在单个交互起始帧初始化物体姿态,并通过利用生成资产与视频观测之间的强视觉相似性在时间上传播姿态。最后,接触感知优化整合语义、几何和交互稳定性约束以强制执行物理合理性。在HO3D、DexYCB、ARCTIC和野外视频上的大量实验表明,AGILE在全局几何精度上优于基线,同时在先前技术经常崩溃的具有挑战性的序列上表现出卓越的鲁棒性。通过优先考虑物理有效性,我们的方法生成可直接用于仿真的资产,并通过真实到仿真重定向在机器人应用中验证。项目页面:https://agile-hoi.github.io。

英文摘要

Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, ARCTIC, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior arts frequently collapse. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications. Project page: https://agile-hoi.github.io.

2411.13109 2026-06-02 cs.RO 版本更新

Special Unitary Parameterized Estimators of Rotation

旋转的特殊酉参数化估计器

Akshay Chandrasekhar

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过特殊酉矩阵重新审视旋转估计问题,提出两种新的连续表示用于神经网络中的旋转学习,并通过实验验证其有效性。

Comments Published at ICLR 2026; clarified paper contribution and theoretical narrative; 33 pages

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR 2026)
AI中文摘要

本文通过特殊酉矩阵的视角重新审视旋转估计问题。我们首先使用$SU(2)$重新表述Wahba问题,推导出多个解,从而得到对应四元数参数的线性约束。然后,我们通过为相关问题制定高效方法来探索这些约束的应用。最后,基于这一理论基础,我们提出了两种新的连续表示,用于神经网络中的旋转学习。大量实验验证了所提方法的有效性。

英文摘要

This paper revisits the topic of rotation estimation through the lens of special unitary matrices. We begin by reformulating Wahba's problem using $SU(2)$ to derive multiple solutions that yield linear constraints on corresponding quaternion parameters. We then explore applications of these constraints by formulating efficient methods for related problems. Finally, from this theoretical foundation, we propose two novel continuous representations for learning rotations in neural networks. Extensive experiments validate the effectiveness of the proposed methods.

2604.22896 2026-06-02 cs.RO cs.LG 版本更新

Magnetic Indoor Localization through CNN Regression and Rotation Invariance

基于CNN回归和旋转不变性的磁室内定位

Helge Rosé, Konstantin Klipp, Tom Koubek, Bernd Schäufele, Ilja Radusch

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 提出使用旋转不变特征(磁场强度和重力轴投影)训练轻量级CNN模型,实现无需方向校准的室内定位,在MagPie数据集上达到或超越现有最优精度。

Comments Published and presented at the 2026 4th International Conference on Mechatronics, Control and Robotics (ICMCR)

详情
AI中文摘要

室内定位是GNSS拒止环境中广泛应用的关键技术,包括室内导航和物联网系统。结合卷积神经网络(CNN)和基于磁场特征的方法,提供了一种低成本、无需基础设施的精确定位解决方案。尽管磁指纹是室内定位的一种有前景的方法,但基于原始3D磁力计数据训练的模型对设备方向高度敏感。我们通过使用从3D磁场导出的两个旋转不变特征来解决这个问题:磁场强度(Mn)和重力轴投影(Mg)。我们在磁序列上训练轻量级7层扩张CNN(MagNetS/XL),直接回归(x, y)位置。使用MagPie数据集(三栋建筑,手持轨迹),我们系统评估了测试和/或训练数据的固定和随机旋转。原始3D输入(Mx, My, Mz)在固定90°旋转下表现出各向同性误差增加,并随着随机旋转增大而进一步恶化。相比之下,2D输入(Mn, Mg)保持旋转不变精度,并且一旦旋转超过三个参考建筑的特定阈值(Loomis大建筑0°,Talbot中建筑5°,CSL小建筑6°),其性能就超过3D输入。MagNetXL在MagPie数据集上达到或超越了现有最优精度,而MagNetS以约三分之一的参数实现了相似性能,有利于移动部署。这些结果表明,在实际使用中,从旋转不变输入获得的鲁棒性超过了输入维度降低的损失,从而无需方向校准或额外基础设施即可进行地图构建和定位。

英文摘要

Indoor positioning is an essential technology for a wide range of applications in GNSS-denied environments, including indoor navigation and IoT systems. Combining convolutional neural networks (CNNs) and magnetic field-based features offers a low-cost, infrastructure-free solution for precise positioning. While magnetic fingerprints are a promising approach for indoor positioning, models trained on raw 3D magnetometer data are highly sensitive to device orientation. We address this by using two rotation invariant features derived from the 3D magnetic field: the norm (Mn) and the projection onto the gravity axis (Mg). We train a lightweight 7-layer dilated CNN (MagNetS/XL) on magnetic sequences to directly regress (x, y) positions. Using the MagPie dataset (three buildings, handheld trajectories), we systematically evaluate fixed and random rotations of test and/or train data. Raw 3D inputs (Mx, My , Mz) exhibit isotropic error increases under fixed 90° rotations and further degrade with growing random rotations. In contrast, 2D (Mn, Mg) inputs maintain rotation invariant accuracy and surpass the 3D inputs once rotation exceeds building-specific thresholds for three reference buildings: 0° for Loomis (large), 5° for Talbot (medium), and 6° for CSL (small). MagNetXL achieves or exceeds state-of-the-art accuracy on the MagPie dataset, and MagNetS delivers similar performance with roughly one third of the parameters, favoring mobile deployment. These results show that the robustness gained from rotation invariant inputs outweighs the loss of input dimensionality in realistic usage, allowing mapping and localization without orientation alignment or added infrastructure.

2603.15956 2026-06-02 cs.RO cs.AI 版本更新

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

ExpertGen: 从非完美行为先验的可扩展仿真到现实专家策略学习

Zifan Xu, Ran Gong, Maria Vittoria Minniti, Kausik Sivakumar, Ahmet Salih Gundogdu, Eric Rosen, Riedana Yan, Tushar Kusnur, Zixing Wang, Di Deng, Peter Stone, Xiaohan Zhang, Karl Schmeckpeper

发表机构 * Robotics and AI Institute(机器人与人工智能研究院) University of Texas at Austin(德克萨斯大学奥斯汀分校) Sony AI(索尼人工智能)

AI总结 提出ExpertGen框架,通过扩散策略初始化行为先验并结合强化学习优化噪声,在仅稀疏奖励下生成高质量专家策略,实现从仿真到现实的可扩展迁移。

详情
AI中文摘要

学习通用且鲁棒的行为克隆策略需要大量高质量的机器人数据。虽然人类演示(例如通过遥操作)是专家行为的标准来源,但在现实世界中大规模获取此类数据成本过高。本文介绍了ExpertGen,一个在仿真中自动化专家策略学习的框架,以实现可扩展的仿真到现实迁移。ExpertGen首先使用在非完美演示(可能由大语言模型合成或由人类提供)上训练的扩散策略初始化行为先验。然后,通过优化扩散模型的初始噪声同时保持原始策略冻结,使用强化学习将该先验引导至高的任务成功率。通过保持预训练的扩散策略冻结,ExpertGen将探索正则化到安全、类人的行为流形内,同时仅使用稀疏奖励即可实现有效学习。在具有挑战性的操作基准上的实证评估表明,ExpertGen无需奖励工程即可可靠地生成高质量的专家策略。在工业装配任务中,ExpertGen实现了90.5%的整体成功率,而在长时域操作任务中达到了85%的整体成功率,优于所有基线方法。所得策略表现出灵巧的控制,并在不同的初始配置和失败状态下保持鲁棒。为了验证仿真到现实的迁移,学习到的基于状态的专家策略通过DAgger进一步提炼为视觉运动策略,并成功部署在真实的机器人硬件上。

英文摘要

Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer. ExpertGen first initializes a behavior prior using a diffusion policy trained on imperfect demonstrations, which may be synthesized by large language models or provided by humans. Reinforcement learning is then used to steer this prior toward high task success by optimizing the diffusion model's initial noise while keep original policy frozen. By keeping the pretrained diffusion policy frozen, ExpertGen regularizes exploration to remain within safe, human-like behavior manifolds, while also enabling effective learning with only sparse rewards. Empirical evaluations on challenging manipulation benchmarks demonstrate that ExpertGen reliably produces high-quality expert policies with no reward engineering. On industrial assembly tasks, ExpertGen achieves a 90.5% overall success rate, while on long-horizon manipulation tasks it attains 85% overall success, outperforming all baseline methods. The resulting policies exhibit dexterous control and remain robust across diverse initial configurations and failure states. To validate sim-to-real transfer, the learned state-based expert policies are further distilled into visuomotor policies via DAgger and successfully deployed on real robotic hardware.

2604.14344 2026-06-02 cs.RO 版本更新

CART: Context-Aware Terrain Adaptation using Temporal Sequence Selection for Legged Robots

CART: 基于时间序列选择的上下文感知地形自适应方法用于腿式机器人

Kartikeya Singh, Youngjin Kim, Yash Turkar, Karthik Dantu

发表机构 * DRONES LAB, University at Buffalo, NY, USA(无人机实验室,布法罗大学,纽约州,美国)

AI总结 提出CART高层控制器,通过融合本体感觉和外部感知的上下文信息,提升腿式机器人在复杂地形上的稳定行走能力,在仿真和真实实验中分别将成功率平均提高5%,并将基座振荡降低最多41%和22%。

详情
AI中文摘要

自然界中的动物结合多种模态(如视觉和触觉)来感知地形,并发展出在不平坦地形上高效行走的理解。同样,腿式机器人需要通过发展对视觉和本体感觉之间关系的理解,来增强其在复杂地形上稳定行走的能力。目前大多数地形自适应方法在复杂的越野地形上仍然容易失败,因为它们没有明确建模外部感知地形外观与本体感觉物理交互之间的上下文关系。这种基于经验的学习往往会在所见与真实感受之间产生视觉-纹理悖论。在这项工作中,我们引入了CART,一种基于上下文感知地形自适应方法的高层控制器,它集成了来自机载传感器的本体感觉和外部感知,以实现对地形的鲁棒理解。我们在多种地形上使用Unitree Go2和ANYmal-C机器人在IsaacSim模拟器中进行评估,并在真实世界实验中使用Boston Dynamics SPOT机器人。为了评估学习到的上下文是否能在各种悖论情况下改善运动行为,我们在仿真和真实实验中测量了机器人的稳定性、穿越成功率和任务完成时间。我们将CART与多种地形条件下的最先进运动控制和地形自适应基线进行比较。CART在仿真中将平均成功率比基线提高了5%,同时改善了上下文条件化的运动行为,包括在仿真中将基座振荡降低最多41%,在真实世界中降低22%,且不增加完成运动任务所需的时间。

英文摘要

Animals in nature combine multiple modalities, such as sight and feel, to perceive terrain and develop an understanding of how to walk on uneven terrain in an efficient manner. Similarly, legged robots need to develop their ability to stably walk on complex terrains by developing an understanding of the relationship between vision and proprioception. Most current terrain-adaptation methods remain susceptible to failure on complex off-road terrain because they do not explicitly model the context between exteroceptive terrain appearance and proprioceptive physical interaction. This experience-based learning often creates a Visual-Texture Paradox between what has been seen and how it actually feels. In this work, we introduce CART, a high-level controller built on a context-aware terrain adaptation approach that integrates proprioception and exteroception from onboard sensing to achieve a robust understanding of terrain. We evaluate our method on multiple terrains using the Unitree Go2 and ANYmal-C robot on the IsaacSim simulator and a Boston Dynamics SPOT robot for our real-world experiments. To evaluate whether the learned context improves locomotion behavior under the various paradox circumstances, we measure the robot s stability, traversal success, and task completion time in both simulation and real-world experiments. We compare CART against state-of-the-art locomotion and terrain- adaptation baselines across diverse terrain conditions. CART improves the average success rate by 5% over the baselines in simulation, while improving context-conditioned locomotion behavior, including up to 41% lower base oscillation in simulation and 22% in the real world, without increasing the time required to complete the locomotion tasks.

2504.08278 2026-06-02 math.OC cs.RO cs.SY eess.SY 版本更新

Line-Search Filter Differential Dynamic Programming for Optimal Control with Nonlinear Equality Constraints

带非线性等式约束最优控制的线搜索滤波微分动态规划

Ming Xu, Stephen Gould, Iman Shames

发表机构 * School of Computer and Communication Sciences, EPFL(瑞士联邦理工学院计算机与通信科学学院) School of Computing, Australian National University(澳大利亚国立大学计算学院) Department of Electrical and Electronic Engineering, University of Melbourne(墨尔本大学电子与电气工程系)

AI总结 提出FilterDDP算法,通过线搜索和步长滤波器处理非线性等式约束,并证明局部二次收敛性,在机器人接触隐式轨迹优化中验证有效性。

Comments Accepted for publication in the IEEE International Conference on Robotics and Automation (ICRA) 2026. Revised version with more exposition in methodology and updated results with improved implementation

详情
AI中文摘要

我们提出FilterDDP,一种用于求解带非线性等式约束的离散时间最优控制问题的微分动态规划算法。与基于价值函数或增广拉格朗日类算法的先前方法不同,FilterDDP使用步长滤波器结合线搜索来处理等式约束。我们确定了步长滤波器准则的两个重要设计选择,这些选择带来了鲁棒的数值性能:1)在步长接受准则中使用拉格朗日函数而非代价函数;2)在反向传播中扰动值函数Hessian矩阵。这两个选择都有严格的理论依据,特别是对于2),我们给出了局部二次收敛的形式化证明。除了提供处理同时含等式和不等式约束的最优控制问题的原始-对偶内点扩展外,我们还在机器人学中出现的三个接触隐式轨迹优化问题上验证了FilterDDP。

英文摘要

We present FilterDDP, a differential dynamic programming algorithm for solving discrete-time, optimal control problems (OCPs) with nonlinear equality constraints. Unlike prior methods based on merit functions or the augmented Lagrangian class of algorithms, FilterDDP uses a step filter in conjunction with a line search to handle equality constraints. We identify two important design choices for the step filter criteria which lead to robust numerical performance: 1) we use the Lagrangian instead of the cost in the step acceptance criterion and, 2) in the backward pass, we perturb the value function Hessian. Both choices are rigorously justified, for 2) in particular by a formal proof of local quadratic convergence. In addition to providing a primal-dual interior point extension for handling OCPs with both equality and inequality constraints, we validate FilterDDP on three contact implicit trajectory optimisation problems which arise in robotics.

2604.12792 2026-06-02 cs.RO 版本更新

Actuation space reduction to facilitate insightful shape matching in a novel reconfigurable tendon driven continuum manipulator

驱动空间缩减以促进新型可重构腱驱动连续体机器人的形状匹配洞察

Sabyasachi Dash, John Golden, Girish Krishnan

发表机构 * Department of Industrial and Enterprise Systems Engineering, University of Illinois Urbana-Champaign(工业与企业系统工程系,伊利诺伊大学厄巴纳-香槟分校) Department of Mechanical Science and Engineering, University of Illinois, Urbana-Champaign(机械科学与工程系,伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出一种通过旋转间隔盘重构腱路径的连续体机器人设计,利用曲率-扭转中间空间简化从期望骨架曲线到驱动器输入的映射,实现无模型的分步形状匹配策略。

详情
AI中文摘要

在腱驱动连续体机器人(TDCM)中,重构腱路径可以实现骨架的定制空间变形。本文提出一种设计,其中腱可以在驱动之前或之后通过主动旋转各个间隔盘来重新布线。每个盘的旋转因此为驱动空间增加了一个自由度,使得从期望骨架曲线到相应驱动器输入的映射复杂化。然而,当骨架形状投影到由曲率和扭转(C-T)定义的中间空间时,会出现一些模式,突出显示哪些盘对实现全局形状最有影响。这种洞察力使得一种简化的顺序形状匹配策略成为可能:首先,旋转近端和中间盘以近似全局形状;然后,调整远端盘以微调末端执行器位置,同时对整体形状影响最小。所提出的驱动框架为传统控制方法提供了一种无模型替代方案,绕过了建模可重构TDCM的复杂性。

英文摘要

In tendon driven continuum manipulators (TDCMs), reconfiguring the tendon routing enables tailored spatial deformation of the backbone. This work presents a design in which tendons can be rerouted either prior to or after actuation by actively rotating the individual spacer disks. Each disk rotation thus adds a degree of freedom to the actuation space, complicating the mapping from a desired backbone curve to the corresponding actuator inputs. However, when the backbone shape is projected into an intermediate space defined by curvature and torsion (C-T), patterns emerge that highlight which disks are most influential in achieving a global shape. This insight enables a simplified, sequential shape-matching strategy: first, the proximal and intermediate disks are rotated to approximate the global shape; then, the distal disks are adjusted to fine-tune the end-effector position with minimal impact on the overall shape. The proposed actuation framework offers a model-free alternative to conventional control approaches, bypassing the complexities of modeling reconfigurable TDCMs.

2604.10579 2026-06-02 cs.RO cs.AI 版本更新

AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence

AffordGen: 通过可供性对应生成多样化演示以实现通用物体操作

Jiawei Zhang, Kaizhe Hu, Yingqian Huang, Yuanchen Ju, Zhengrong Xue, Huazhe Xu

发表机构 * Shanghai Qi Zhi Institute(上海启智研究院) Tsinghua University(清华大学) Fudan University(复旦大学) UC Berkeley(伯克利大学)

AI总结 提出AffordGen框架,利用3D生成模型和视觉基础模型在大规模3D网格上的语义对应生成多样化操作轨迹,训练鲁棒的闭环视觉运动策略,实现零样本泛化到未见物体。

详情
AI中文摘要

尽管现代模仿学习方法在机器人操作中取得了近期成功,但其性能常常受到数据多样性不足导致的几何变化的限制。利用强大的3D生成模型和视觉基础模型(VFMs),所提出的AffordGen框架通过利用大规模3D网格上有意义关键点的语义对应来生成新的机器人操作轨迹,从而克服了这一限制。然后,这个大规模、可供性感知的数据集被用于训练一个鲁棒的、闭环的视觉运动策略,结合了可供性的语义泛化能力和端到端学习的反应性鲁棒性。在仿真和现实世界中的实验表明,使用AffordGen训练的策略实现了高成功率,并能够零样本泛化到真正未见过的物体,显著提高了机器人学习中的数据效率。项目页面:https://jiaweiz9.github.io/AffordGen-release/

英文摘要

Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often constrained by geometric variations due to limited data diversity. Leveraging powerful 3D generative models and vision foundation models (VFMs), the proposed AffordGen framework overcomes this limitation by utilizing the semantic correspondence of meaningful keypoints across large-scale 3D meshes to generate new robot manipulation trajectories. This large-scale, affordance-aware dataset is then used to train a robust, closed-loop visuomotor policy, combining the semantic generalizability of affordances with the reactive robustness of end-to-end learning. Experiments in simulation and the real world show that policies trained with AffordGen achieve high success rates and enable zero-shot generalization to truly unseen objects, significantly improving data efficiency in robot learning. Project Page: https://jiaweiz9.github.io/AffordGen-release/

2604.09877 2026-06-02 cs.CV cs.AI cs.RO 版本更新

Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction

Genie 4D:语义先验引导的4D动态场景重建

Yiru Yang, Zhuojie Wu, Nishant Kumar Singh, Max Schulthess

发表机构 * University of Zurich(苏黎世大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出Genie 4D框架,结合实时视觉惯性高斯泼溅前端和前馈4D骨干网络,利用冻结的DINOv3特征作为结构先验抑制身份漂移,并通过条件扩散精炼器恢复高频细节,最终通过轻量级潜在动作头实现用户可控的4D世界模型重建。

详情
AI中文摘要

在计算机视觉与机器人感知的交汇处,动态场景的4D重建将低层几何感知与高层语义理解联系起来。我们提出Genie 4D,一个将手持手机拍摄转化为语义化、动作可控的4D世界模型的框架。Genie 4D将用于度量几何的实时视觉惯性高斯泼溅前端与由冻结的DINOv3特征(作为结构先验)正则化的前馈4D骨干网络相结合。语义先验抑制了动态跟踪中的身份漂移,而短条件扩散精炼器恢复了回归骨干网络平滑掉的高频表面细节。最后,一个轻量级潜在动作头将重建的4D状态暴露给以JEPA风格下一嵌入目标训练的Genie式世界模型,使得场景可以在用户动作下向前推进。在Point Odyssey和TUM-Dynamics基准测试上,Genie 4D保留了前馈基线的线性时间复杂度O(T),同时提高了3D跟踪精度(APD)和重建完整性,并且可以在单个消费级GPU(RTX 5090)上通过iPhone、Mac、Windows和Linux采集客户端交互式运行。Genie 4D为走向物理基础的世界模型提供了一条实用的、语义先验引导的路径。

英文摘要

At the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes connects low-level geometric sensing with high-level semantic understanding. We present Genie 4D, a framework that turns hand-held phone capture into a semantically grounded, action-controllable 4D world model. Genie 4D couples a real-time visual-inertial Gaussian splatting front-end for metric geometry with a feed-forward 4D backbone regularized by frozen DINOv3 features acting as structural priors. The semantic priors suppress identity drift during dynamic tracking, while a short conditional diffusion refiner recovers high-frequency surface detail that regression backbones smooth away. Finally, a lightweight latent-action head exposes the reconstructed 4D state to a Genie-style world model trained with a JEPA-style next-embedding objective, so that the scene can be rolled forward under user actions. On the Point Odyssey and TUM-Dynamics benchmarks, Genie 4D retains the linear time complexity O(T) of feed-forward baselines while improving 3D tracking accuracy (APD) and reconstruction completeness, and it runs interactively on a single consumer GPU (RTX 5090) from iPhone, Mac, Windows, and Linux capture clients. Genie 4D offers a practical, semantic-prior-guided path toward physically grounded world models.

2604.09487 2026-06-02 cs.RO cs.LG 版本更新

Sim-to-Real Transfer for Muscle-Actuated Robots via Generalized Actuator Networks

基于广义执行器网络的肌肉驱动机器人仿真到现实迁移

Jan Schneider, Mridul Mahajan, Le Chen, Simon Guist, Bernhard Schölkopf, Ingmar Posner, Dieter Büchler

AI总结 提出广义执行器网络(GenAN),通过从关节位置轨迹学习执行器模型,实现肌肉驱动机器人从仿真到现实的策略迁移,首次成功在四自由度肌肉驱动机器人臂上完成动态任务。

详情
AI中文摘要

肌腱驱动配合软肌肉执行器使机器人更快、更安全,同时可能加速技能获取。然而,由于固有的非线性、摩擦和迟滞,这些系统在实际中很少使用,这给建模和控制带来了复杂性。到目前为止,这些挑战阻碍了策略从仿真到真实系统的迁移。为弥合这一差距,我们提出了一种仿真到现实的流程,该流程学习这种复杂执行器的神经网络模型,并利用成熟的刚体仿真来处理手臂动力学和与环境的交互。我们的方法称为广义执行器网络(GenAN),通过直接从关节位置轨迹学习,而不是需要扭矩传感器,从而能够在广泛的机器人上进行执行器模型识别。在PAMY2(一种由气动人工肌肉驱动的肌腱驱动机器人)上使用GenAN,我们成功部署了完全在仿真中训练的、动态但精确的到达目标、杯中球和乒乓球策略。据我们所知,这一结果构成了四自由度肌肉驱动机器人臂首次成功的仿真到现实迁移。

英文摘要

Tendon drives paired with soft muscle actuation enable faster and safer robots while potentially accelerating skill acquisition. Still, these systems are rarely used in practice due to inherent nonlinearities, friction, and hysteresis, which complicate modeling and control. So far, these challenges have hindered policy transfer from simulation to real systems. To bridge this gap, we propose a sim-to-real pipeline that learns a neural network model of this complex actuation and leverages established rigid body simulation for the arm dynamics and interactions with the environment. Our method, called Generalized Actuator Network (GenAN), enables actuation model identification across a wide range of robots by learning directly from joint position trajectories rather than requiring torque sensors. Using GenAN on PAMY2, a tendon-driven robot powered by pneumatic artificial muscles, we successfully deploy dynamic but precise goal-reaching, ball-in-a-cup, and table tennis policies, trained entirely in simulation. To the best of our knowledge, this result constitutes the first successful sim-to-real transfer for a four-degrees-of-freedom muscle-actuated robot arm.

2604.02878 2026-06-02 cs.RO cs.SY eess.SY 版本更新

An Asynchronous Two-Speed Kalman Filter for Real-Time UUV Cooperative Navigation Under Acoustic Delays

一种用于声学延迟下实时UUV协同导航的异步双速卡尔曼滤波器

Shuyue Li, Miguel López-Benítez, Eng Gee Lim, Fei Ma, Qian Dong, Mengze Cao, Limin Yu, Xiaohui Qin

发表机构 * Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) Suzhou Municipal Key Laboratory Broadband Wireless Access Technology(苏州市级宽带无线接入技术重点实验室) XJTLU-JITRI Academy(XJTLU-JITRI学院)

AI总结 针对水声通信延迟导致实时状态估计困难的问题,提出一种异步双速卡尔曼滤波器(TSKF),通过变分历史蒸馏(VHD)机制解耦估计过程,实现高频实时控制与延迟协同信息处理,在严重延迟下保持与批量优化方法相当的轨迹误差。

Comments 6 pages, 6 figures. Accepted for publication in the 2026 IEEE International Conference on Industrial Informatics (INDIN). \c{opyright} 2026 IEEE. Personal use of this material is permitted. See PDF for the full IEEE copyright notice

详情
AI中文摘要

在全球导航卫星系统(GNSS)受限的水下环境中,单个无人水下航行器(UUV)会遭受无界航位推算漂移,因此协同导航(CN)对于精确状态估计至关重要。然而,水声信道固有的严重通信延迟对实时状态估计构成了严峻挑战。传统滤波器,如扩展卡尔曼滤波器(EKF)或无迹卡尔曼滤波器(UKF),通常在等待延迟数据时阻塞主控制回路,或者有效丢弃乱序测量(OOSM),导致严重漂移。为了解决这一问题,我们提出了一种由新颖投影机制——变分历史蒸馏(VHD)增强的异步双速卡尔曼滤波器(TSKF)。所提出的架构将估计过程解耦为两个并行线程:一个快速线程利用高斯过程(GP)补偿的航位推算来保证高频实时控制,另一个慢速线程专门处理异步延迟的协同信息。通过引入有限长度循环状态缓冲区(FLCSB),该算法将延迟测量应用于对应的历史状态,并利用基于VHD的投影将修正快速前向传播到当前时刻,而无需计算密集的重新计算。仿真结果表明,所提出的TSKF在严重延迟(高达30秒)下保持了与计算密集的批量优化方法相当的轨迹误差。在亚毫秒时间内执行,它显著优于标准EKF/UKF。结果展示了一种有效的控制、通信和计算(3C)协同设计,显著增强了自主海洋自动化系统的鲁棒性。

英文摘要

In Global Navigation Satellite System (GNSS)-denied underwater environments, individual unmanned underwater vehicles (UUVs) suffer from unbounded dead-reckoning drift, making collaborative navigation (CN) crucial for accurate state estimation. However, the severe communication delay inherent in underwater acoustic channels poses serious challenges to real-time state estimation. Traditional filters, such as Extended Kalman Filters (EKFs) or Unscented Kalman Filters (UKFs), usually block the main control loop while waiting for delayed data, or effectively discard Out-of-Sequence Measurements (OOSMs), resulting in serious drift. To address this, we propose an Asynchronous Two-Speed Kalman Filter (TSKF) enhanced by a novel projection mechanism, which we term Variational History Distillation (VHD). The proposed architecture decouples the estimation process into two parallel threads: a fast-rate thread that utilizes Gaussian Process (GP) compensated dead reckoning to guarantee high-frequency real-time control, and a slow-rate thread dedicated to processing asynchronously delayed collaborative information. By introducing a Finite-Length Circular State Buffer (FLCSB), the algorithm applies delayed measurements to their corresponding historical states, and utilizes a VHD-based projection to fast-forward the correction to the current time without computationally heavy recalculations. Simulation results demonstrate that the proposed TSKF maintains a trajectory error comparable to computationally intensive batch-optimization methods under severe delays (up to 30\,s). Executing in sub-millisecond time, it significantly outperforms standard EKF/UKF. The results demonstrate an effective control, communication, and computing (3C) co-design that significantly enhances the resilience of autonomous marine automation systems.

2602.12724 2026-06-02 cs.RO 版本更新

TRANS: Terrain-aware Reinforcement Learning for Agile Navigation of Quadruped Robots under Social Interactions

TRANS:面向社交互动下四足机器人敏捷导航的地形感知强化学习

Wei Zhu, Irfan Tito Kurniawan, Ye Zhao, Mitsuhiro Hayashibe

发表机构 * Department of Robotics, Graduate School of Engineering, Tohoku University(东邦大学机器人学系) Laboratory for Intelligent Decision and Autonomous Robots, Woodruff School of Mechanical Engineering, Georgia Institute of Technology(佐治亚理工学院智能决策与自主机器人实验室)

AI总结 提出一个名为TRANS的两阶段深度强化学习框架,通过三个DRL流水线(TRANS-Loco、TRANS-Nav和统一TRANS)实现四足机器人在非结构化地形上的社交导航,克服了传统方法中运动规划与运动控制分离、缺乏地形感知以及假设静态环境等局限。

详情
AI中文摘要

本研究介绍了TRANS:面向社交互动下敏捷导航的地形感知强化学习,这是一个用于四足机器人在非结构化地形上进行社交导航的深度强化学习(DRL)框架。传统的四足导航通常将运动规划与运动控制分离,忽略了全身约束和地形感知。另一方面,端到端方法更加集成,但需要高频传感,这通常噪声大且计算成本高。此外,大多数现有方法假设静态环境,限制了它们在有人环境中的使用。为了解决这些限制,我们提出了一个包含三个DRL流水线的两阶段训练框架。(1)TRANS-Loco采用非对称演员-评论家(AC)模型进行四足运动,无需显式的地形或接触观测即可穿越不平坦地形。(2)TRANS-Nav采用对称AC框架进行社交导航,在差速驱动运动学下直接将变换后的LiDAR数据映射到自我智能体动作。(3)统一流水线TRANS集成了TRANS-Loco和TRANS-Nav,支持在不平坦和社交互动环境中的地形感知四足导航。针对运动导航和社交导航基线的全面基准测试证明了TRANS的有效性。硬件实验进一步证实了其从仿真到实际迁移的潜力。

英文摘要

This study introduces TRANS: Terrain-aware Reinforcement learning for Agile Navigation under Social interactions, a deep reinforcement learning (DRL) framework for quadrupedal social navigation over unstructured terrains. Conventional quadrupedal navigation typically separates motion planning from locomotion control, neglecting whole-body constraints and terrain awareness. On the other hand, end-to-end methods are more integrated but require high-frequency sensing, which is often noisy and computationally costly. In addition, most existing approaches assume static environments, limiting their use in human-populated settings. To address these limitations, we propose a two-stage training framework with three DRL pipelines. (1) TRANS-Loco employs an asymmetric actor-critic (AC) model for quadrupedal locomotion, enabling traversal of uneven terrains without explicit terrain or contact observations. (2) TRANS-Nav applies a symmetric AC framework for social navigation, directly mapping transformed LiDAR data to ego-agent actions under differential-drive kinematics. (3) A unified pipeline, TRANS, integrates TRANS-Loco and TRANS-Nav, supporting terrain-aware quadrupedal navigation in uneven and socially interactive environments. Comprehensive benchmarks against locomotion and social navigation baselines demonstrate the effectiveness of TRANS. Hardware experiments further confirm its potential for sim-to-real transfer.

2603.28439 2026-06-02 cs.RO 版本更新

A Predictive Control Strategy to Offset-Point Tracking for Agricultural Mobile Robots

农业移动机器人偏移点跟踪的预测控制策略

Stephane Ngnepiepaye Wembe, Vincent Rousseau, Johann Laconte, Roland Lenain

发表机构 * Université Clermont Auvergne, INRAE, UR TSCF(克莱蒙特-奥弗涅大学,法国国家农业食品与环境研究 council(INRAE),TSCF研究单位) SABI AGRI(SABI农业)

AI总结 针对农业机器人忽略机具位置导致跟踪误差大的问题,提出一种闭环预测控制策略,显式建模刚性偏移点机具并考虑侧滑和杠杆臂效应,田间实验表明中位跟踪误差降低24%-56%,峰值误差降低70%。

Comments Accepted in the journal IEEE Transaction on Field Robotics

详情
AI中文摘要

机器人越来越多地被部署在农业中,以支持可持续实践并提高生产力。它们为实现精确、高效和环保的操作提供了巨大潜力。然而,现有的大多数路径跟踪控制器仅关注机器人的运动中心,忽略了所连接机具的空间占用和动力学。在实践中,诸如机械除草机或弹簧齿中耕机之类的机具通常是大型的、刚性安装的,并直接与作物和土壤相互作用;忽略它们的位置会降低跟踪性能并增加作物受损的风险。为解决这一局限,我们提出了一种闭环预测控制策略,扩展了文献[1]中介绍的方法。该方法专门针对阿克曼型农业车辆开发,将机具显式建模为刚性偏移点,同时考虑横向滑移和杠杆臂效应。该方法与最先进的基线控制器进行了基准测试,包括反应式几何方法、反应式反步法和基于模型的预测方案。使用两种不同机具的实际农业实验表明,所提方法将中位跟踪误差降低了24%至56%,并在曲率过渡期间将峰值误差降低了高达70%。这些改进转化为增强的操作安全性,特别是在机具靠近作物行作业的场景中。

英文摘要

Robots are increasingly being deployed in agriculture to support sustainable practices and improve productivity. They offer strong potential to enable precise, efficient, and environmentally friendly operations. However, most existing path-following controllers focus solely on the robot's center of motion and neglect the spatial footprint and dynamics of attached implements. In practice, implements such as mechanical weeders or spring-tine cultivators are often large, rigidly mounted, and directly interacting with crops and soil; ignoring their position can degrade tracking performance and increase the risk of crop damage. To address this limitation, we propose a closed-form predictive control strategy extending the approach introduced in [1]. The method is developed specifically for Ackermann-type agricultural vehicles and explicitly models the implement as a rigid offset point, while accounting for lateral slip and lever-arm effects. The approach is benchmarked against state-of-the-art baseline controllers, including a reactive geometric method, a reactive backstepping method, and a model-based predictive scheme. Real-world agricultural experiments with two different implements show that the proposed method reduces the median tracking error by 24% to 56%, and decreases peak errors during curvature transitions by up to 70%. These improvements translate into enhanced operational safety, particularly in scenarios where the implement operates in close proximity to crop rows.

2504.05033 2026-06-02 cs.RO cs.CV 版本更新

CloSE: A Geometric Shape-Agnostic Cloth State Representation

CloSE: 一种几何形状无关的布料状态表示

Jay Kamat, Júlia Borràs, Carme Torras

发表机构 * Institut de Robòtica i Informàtica Industrial, CSIC-UPC(西班牙工业机器人与信息技术研究所,CSIC-UPC)

AI总结 提出一种基于拓扑索引的dGLI圆盘表示,并从中抽象出紧凑、连续的CloSE表示,用于预测布料折叠位置并支持语义标注与规划。

Comments Accepted at ICRA 2026 (8 pages, 11 figures, 1 table). Project page: https://close-representation.github.io/

详情
AI中文摘要

布料操作是一个难题,主要是因为布料的非刚性特性,这使得对变形的良好表示至关重要。我们提出了一种新的布料变形状态表示。首先,我们提出了基于拓扑索引的dGLI圆盘表示,这些索引是针对排列在圆形网格上的布料边界边缘段计算的。dGLI圆盘的热力图揭示了与布料状态特征相对应的模式,这些模式对于不同形状、尺寸或方向的布料是一致的。然后,我们将这些重要特征从dGLI圆盘中抽象成一个圆,称为布料状态表示(CloSE)。这种表示紧凑、连续,且适用于不同形状。我们表明,这种表示能够准确预测多个仿真布料数据集中的折叠位置。最后,我们还展示了这种表示在两个相关应用中的优势:语义标注以及高层和低层规划。代码和数据集可从以下网址获取:https://close-representation.github.io/

英文摘要

Cloth manipulation is a difficult problem mainly because of the non-rigid nature of cloth, which makes a good representation of deformation essential. We present a new representation for the deformation-state of clothes. First, we propose the dGLI disk representation based on topological indices computed for edge segments of the cloth border that are arranged on a circular grid. The heat-map of the dGLI disk uncovers patterns that correspond to features of the cloth state that are consistent for different shapes, sizes or orientation of the cloth. We then abstract these important features from the dGLI disk into a circle, calling it the Cloth StatE representation (CloSE). This representation is compact, continuous, and general for different shapes. We show that this representation is able to accurately predict the fold locations for several simulation clothing datasets. Finally, we also show the strengths of this representation in two relevant applications: semantic labeling and high- and low-level planning. The code and the dataset can be accessed from: https://close-representation.github.io/

2603.14010 2026-06-02 cs.RO 版本更新

URDF-Anything+: End-to-End Generation for Simulation-Ready Articulated Assets

URDF-Anything+:面向仿真就绪铰接资产的端到端生成

Zhuangzhe Wu, Yue Xin, Chengkai Hou, Minghao Chen, Yaoxu Lyu, Jieyu Zhang, Shanghang Zhang

发表机构 * Peking University(北京大学) Visual Geometry Group, University of Oxford(牛津大学视觉几何组) University of Washington(华盛顿大学)

AI总结 提出URDF-Anything+,一种端到端自回归扩散框架,从单张RGB图像直接生成仿真就绪的URDF模型,统一建模部件几何与铰接结构。

详情
AI中文摘要

铰接物体是机器人学、物理仿真和交互式虚拟环境的基础。然而,从视觉观测中恢复它们本质上具有挑战性,因为图像仅提供关于部件几何及其底层运动学结构的部分和模糊线索。现有方法通常依赖多阶段流水线、从资产库检索或显式部件分割。我们提出URDF-Anything+,一种端到端自回归扩散框架,直接从单张RGB图像生成仿真就绪的URDF模型。以视觉观测和物体几何为条件,URDF-Anything+在结构化潜在空间中运行,并在统一生成过程中联合建模部件几何和铰接。具体而言,模型顺序预测每个铰接部件及其关联的关节参数,同时一个终止标记动态确定部件数量。这种设计使得无需外部检索或后处理阶段即可直接生成完全可执行的URDF。在大规模铰接物体基准上的实验表明,URDF-Anything+在几何重建质量、关节参数估计和物理可执行性方面优于先前方法,同时比现有多阶段方法显著更高效。此外,生成的URDF作为忠实数字孪生,使得纯仿真训练的操作策略能够零样本迁移。

英文摘要

Articulated objects are fundamental for robotics, simulation of physics, and interactive virtual environments. However, recovering them from visual observations is inherently challenging, as images provide only partial and ambiguous cues about both part geometry and their underlying kinematic structure. Existing approaches typically rely on multi-stage pipelines, retrieval from asset libraries, or explicit part segmentation. We present URDF-Anything+, an end-to-end autoregressive diffusion framework that generates simulation-ready URDF models directly from a single RGB image. Conditioned on visual observations and object geometry, URDF-Anything+ operates in a structured latent space and jointly models part geometry and articulation in a unified generation process. Specifically, the model sequentially predicts each articulated part together with its associated joint parameters, while a termination token dynamically determines the number of parts. This design enables direct generation of fully executable URDFs without external retrieval or post-processing stages. Experiments on large-scale articulated object benchmarks demonstrate that URDF-Anything+ outperforms prior methods in geometric reconstruction quality, joint parameter estimation, and physical executability, while being substantially more efficient than existing multi-stage approaches. Furthermore, the generated URDFs serve as faithful digital twins, enabling the zero-shot transfer of manipulation policies trained purely in simulation.

2603.11653 2026-06-02 cs.LG cs.RO 版本更新

Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

简单配方有效:视觉-语言-动作模型通过强化学习成为自然持续学习者

Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, Roberto Martin-Martin

发表机构 * University of Southern California(南加州大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过系统研究发现,对于大型预训练视觉-语言-动作模型,简单的顺序微调结合低秩适配在持续强化学习中表现出高可塑性、几乎无遗忘和强零样本泛化,优于复杂方法。

Comments Accepted at RLC 2026

详情
AI中文摘要

持续强化学习(CRL)用于视觉-语言-动作(VLA)模型是一个有前景的方向,旨在实现能够在开放、不断变化的环境中适应的自我改进具身智能体。然而,持续学习的传统观点认为,简单的顺序微调(Seq. FT)会导致灾难性遗忘,需要复杂的CRL策略。在这项工作中,我们退一步,对大型预训练VLA在多种终身RL基准上的CRL进行了系统研究。我们发现,与既定信念相反,使用低秩适配(LoRA)的简单Seq. FT非常强大:它实现了高可塑性,几乎没有遗忘,并保持了强大的零样本泛化,通常优于更复杂的CRL方法。通过详细分析,我们表明这种鲁棒性源于大型预训练模型、参数高效适配和在线RL之间的协同作用。这些组件共同重塑了稳定性-可塑性权衡,使持续适应既稳定又可扩展。我们的结果将顺序微调定位为VLA持续RL的强大方法,并为大模型时代的终身学习提供了新见解。代码可在github.com/UT-Austin-RobIn/continual-vla-rl获取。

英文摘要

Continual Reinforcement Learning (CRL) for Vision-Language-Action (VLA) models is a promising direction toward self-improving embodied agents that can adapt in openended, evolving environments. However, conventional wisdom from continual learning suggests that naive Sequential Fine-Tuning (Seq. FT) leads to catastrophic forgetting, necessitating complex CRL strategies. In this work, we take a step back and conduct a systematic study of CRL for large pretrained VLAs across diverse lifelong RL benchmarks. We find that, contrary to established belief, simple Seq. FT with low-rank adaptation (LoRA) is remarkably strong: it achieves high plasticity, exhibits little to no forgetting, and retains strong zero-shot generalization, frequently outperforming more sophisticated CRL methods. Through detailed analysis, we show that this robustness arises from a synergy between the large pretrained model, parameter-efficient adaptation, and on-policy RL. Together, these components reshape the stability-plasticity trade-off, making continual adaptation both stable and scalable. Our results position Sequential Fine-Tuning as a powerful method for continual RL with VLAs and provide new insights into lifelong learning in the large model era. Code is available at github.com/UT-Austin-RobIn/continual-vla-rl.

2603.11537 2026-06-02 cs.RO 版本更新

MiNI-Q: A Miniature, Wire-Free Quadruped with Unbounded, Independently Actuated Leg Joints

MiNI-Q:一种微型、无线四足机器人,具有无界、独立驱动的腿关节

Daniel Koh, Suraj Shah, Yufeng Wu, Dennis Hong

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出MiNI-Q^2微型无线四足机器人,通过无界独立驱动腿关节设计实现多种运动模式,速度达0.46 m/s,可折叠至2.5 cm高度,并开源所有设计文件。

Comments 7 pages, 11 figures. Submitted to the IEEE RAS Conference on Ubiquitous Robots (UR 2026)

详情
AI中文摘要

物理关节限制在腿式机器人中很常见,会限制工作空间、约束步态设计并增加硬件损坏风险。本文介绍MiNI-Q^2,一种微型、无线四足机器人,具有独立驱动、机械无界的2自由度腿关节。我们提出了所设计机器人的机械设计、运动学分析和实验验证。该腿机构可实现振荡步态和旋转运动,同时允许机器人折叠至2.5 cm的最小高度。实验上,MiNI-Q达到0.46 m/s的速度,并展示了低间隙爬行、爬楼梯、倒立运动、跳跃和后空翻。无线架构扩展了我们之前的Q8bot设计,提高了微型尺度的装配可靠性。所有机械和电气设计文件均已开源,以支持可重复性和进一步研究。

英文摘要

Physical joint limits are common in legged robots and can restrict workspace, constrain gait design, and increase the risk of hardware damage. This paper introduces MiNI-Q^2, a miniature, wire-free quadruped robot with independently actuated, mechanically unbounded 2-DOF leg joints. We present the mechanical design, kinematic analysis, and experimental validation of the proposed robot. The leg mechanism enables both oscillatory gaits and rotary locomotion while allowing the robot to fold to a minimum height of 2.5 cm. Experimentally, MiNI-Q achieves speeds up to 0.46 m/s and demonstrates low-clearance crawling, stair climbing, inverted locomotion, jumping, and backflipping. The wire-free architecture extends our previous Q8bot design, improving assembly reliability at miniature scale. All mechanical and electrical design files are released open source to support reproducibility and further research.

2603.10282 2026-06-02 cs.RO 版本更新

Update-Free On-Policy Steering via Verifiers

基于验证器的免更新在线策略引导

Maria Attarian, Ian Vyse, Claas Voelcker, Jasper Gerigk, Evgenii Opryshko, Anas Almasri, Sumeet Singh, Yilun Du, Igor Gilitschenski

发表机构 * University of Toronto(多伦多大学) Google DeepMind(谷歌DeepMind) University of Alberta(阿尔伯塔大学) UTAustin(得克萨斯大学奥斯汀分校) Harvard University(哈佛大学)

AI总结 提出UF-OPS方法,利用策略评估中的验证器函数引导基础策略选择高成功概率动作,无需更新参数即可提升黑箱扩散策略性能,在5个真实任务中平均成功率提升49%。

Comments 9 pages, 6 figures

详情
AI中文摘要

近年来,行为克隆(BC)已成为从人类演示中学习操作的最流行方法之一。尽管取得了成功,但BC策略通常脆弱且难以进行精确操作。为了克服这些问题,我们提出了UF-OPS,一种免更新的在线策略引导方法,使机器人能够预测其动作的成功可能性,并在执行时调整策略。我们通过使用在策略初始评估期间获得的策略回滚数据来训练验证器函数来实现这一点。这些验证器随后用于将基础策略引导向具有更高成功可能性的动作。我们的方法在不改变基础参数的情况下提高了黑箱扩散策略的性能,使其轻量且灵活。我们展示了来自仿真和真实世界数据的结果,并在5个真实任务中实现了比基础策略平均49%的成功率提升。

英文摘要

In recent years, Behavior Cloning (BC) has become one of the most prevalent methods for learning manipulation from human demonstrations. Despite their successes, BC policies are often brittle and struggle with precise manipulation. To overcome these issues, we propose UF-OPS, an Update-Free On-Policy Steering method that enables the robot to predict the success likelihood of its actions and adapt its strategy at execution time. We accomplish this by training verifier functions using policy rollout data obtained during an initial evaluation of the policy. These verifiers are subsequently used to steer the base policy toward actions with a higher likelihood of success. Our method improves the performance of black-box diffusion policies, without changing the base parameters, making it lightweight and flexible. We present results from both simulation and real-world data and achieve an average 49% improvement in success rate over the base policy across 5 real tasks.

2603.09292 2026-06-02 cs.RO cs.CV 版本更新

See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

看、规划、回退:面向鲁棒机器人操作的进度感知视觉-语言-动作模型

Tingjun Dai, Mingfei Han, Tingwen Du, Zhiheng Liu, Zihao Zhang, Zhihui Li, Salman Khan, Jun Yu, Xiaojun Chang

发表机构 * School of Information Science and Technology, University of Science and Technology of China(信息科学与技术学院,中国科学技术大学) University of Technology Sydney(新南威尔士大学) Department of Computer Vision, Mohamed Bin Zayed University of Artificial Intelligence(人工智能与计算机视觉系,Mohamed Bin Zayed人工智能大学) The University of Hong Kong(香港大学) Institute of AI for Industry, Chinese Academy of Sciences(产业人工智能研究所,中国科学院) School of Intelligent Science and Engineering, Harbin Institute of Technology (Shenzhen)(智能科学与工程学院,哈尔滨工业大学(深圳))

AI总结 提出进度感知的视觉-语言-动作框架SPR,通过动态将语言指令映射为空间子目标序列,并利用闭环进度监控实现错误恢复,在LIBERO基准上提升5%性能,在LIBERO-Plus上展现最先进的鲁棒性。

Comments Suggested to CVPR Findings. https://tingjundai.github.io/SPRVLA/

详情
AI中文摘要

通过明确的、可操作的里程碑来测量任务进度对于鲁棒机器人操作至关重要。这种进度感知使模型能够把握当前任务状态,预期可验证的中间状态,并在进度停滞时检测和恢复失败。为体现这一能力,我们引入了 extbf{看}、 extbf{规划}、 extbf{回退}(SPR),一个进度感知的视觉-语言-动作框架,它动态地将语言指令接地到一系列空间子目标中。SPR通过一个连续的核心循环运行:观察当前状态和即将到来的里程碑,规划朝向下一个2D航点的轨迹,并在失败时通过监控与预期序列的进度来回退到可恢复状态。这种闭环方法无需额外训练数据或辅助模型即可实现鲁棒的错误纠正。大量实验证明了该框架的有效性、泛化能力和鲁棒性:SPR在LIBERO基准上比MolmoAct基线高出5%。在具有未见指令和初始状态的挑战性LIBERO-Plus基准上,SPR实现了最先进的鲁棒性,性能下降最小,超越了OpenVLA-OFT和UniVLA,展示了优越的分布外鲁棒性。

英文摘要

Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediate states, and detect and recover from failures when progress stalls. To embody this capability, we introduce \textbf{S}ee, \textbf{P}lan, \textbf{R}ewind (SPR), a progress-aware vision-language-action framework that dynamically grounds language instructions into a sequence of spatial subgoals. SPR operates through a continuous core cycle, Seeing the current state and upcoming milestone, Planning a trajectory towards the next 2D waypoint, and Rewinding to a recoverable state upon failure by monitoring progress against the expected sequence. This closed-loop approach enables robust error correction without requiring additional training data or auxiliary models. Extensive experiments demonstrate the framework's effectiveness, generalization and robustness: SPR outperforms the MolmoAct baseline by 5\% on the LIBERO benchmark. On the challenging LIBERO-Plus benchmark with unseen instructions and initial states, SPR achieves state-of-the-art robustness with the smallest performance drop, surpassing OpenVLA-OFT and UniVLA, demonstrating superior out-of-distribution robustness.

2603.07578 2026-06-02 cs.RO 版本更新

Approximate Imitation Learning for Event-based Quadrotor Flight in Cluttered Environments

基于事件的四旋翼飞行器在杂乱环境中的近似模仿学习

Nico Messikommer, Jiaxu Xing, Leonard Bauersfeld, Marco Cannici, Elie Aljalbout, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich, Switzerland(苏黎世联邦理工学院机器人与感知组)

AI总结 提出近似模仿学习框架,通过分离表征学习与策略搜索,将事件相机四旋翼飞行策略训练时间从52.44小时降至1.86小时,实现28倍加速,并在仿真和真实环境中验证了高速飞行性能。

详情
AI中文摘要

事件相机具有高时间分辨率和低延迟,使其成为高速机器人应用的理想传感器,而传统相机会因运动模糊而失效。然而,它们在机器人学习中的广泛应用受到在线训练期间模拟高频事件数据计算成本的严重制约。在这项工作中,我们提出了近似模仿学习,这是一个从根本上解决这一瓶颈的新框架,将复杂、敏捷无人机飞行的策略训练时间从52.44小时减少到仅1.86小时——实现了28倍的计算加速。我们的关键见解是将表征学习与策略搜索分离。我们首先利用大规模离线数据集学习特定于任务的表征空间。随后,通过仅依赖轻量级状态信息的在线交互对策略进行微调,完全消除了在主动策略搜索阶段渲染事件的需求。这种训练范式极大地降低了开发开销,并使基于事件的控制策略能够扩展到复杂环境。此外,我们的方法在部署期间消除了对标准相机或中间表示的依赖,直接将事件映射到控制命令。在仿真中,我们的方法匹配或超过了需要完整在线事件渲染的标准模仿学习基线的性能。最后,我们在真实世界中成功验证了该框架,展示了通过这种超高效范式训练的策略使四旋翼飞行器能够在高度杂乱的环境中以前所未有的速度(高达9.8米/秒)飞行。

英文摘要

Event cameras offer high temporal resolution and low latency, making them ideal sensors for high-speed robotic applications where conventional cameras suffer from motion blur. However, their widespread adoption in robot learning is severely bottlenecked by the computational cost of simulating high-frequency event data during online training. In this work, we present Approximate Imitation Learning, a novel framework that fundamentally resolves this bottleneck, reducing policy training time for complex, agile drone flight from 52.44 hours to just 1.86 hours - a 28x computational speedup. Our key insight is to separate representation learning from policy search. We first leverage a large-scale offline dataset to learn a task-specific representation space. Subsequently, the policy is fine-tuned through online interactions that rely solely on lightweight state information, completely eliminating the need to render events during the active policy search phase. This training paradigm drastically reduces development overhead and enables event-based control policies to scale to complex environments. Furthermore, our approach eliminates the reliance on standard cameras or intermediate representations during deployment, mapping events directly to control commands. In simulation, our method matches or exceeds the performance of standard imitation learning baselines that require full online event rendering. Finally, we successfully validate the framework in the real world, demonstrating that a policy trained via this ultra-efficient paradigm enables a quadrotor to fly through highly cluttered environments at remarkable speeds of up to 9.8 m/s.

2602.01662 2026-06-02 cs.RO 版本更新

PLanAR: Planning-Language-Grounded Agentic Reasoning for Robot Manipulation

PLanAR:面向机器人操作的规划语言基础智能推理

Pengyuan Guo, Zhonghao Mai, Zhengtong Xu, Kaidi Zhang, Quan Khanh Luu, Heng Zhang, Zichen Miao, Arash Ajoudani, Zachary Kingston, Qiang Qiu, Yu She

发表机构 * Purdue University(普渡大学) Istituto Italiano di Tecnologia(意大利理工研究院)

AI总结 提出PLanAR框架,通过规划语言接口定义VLM推理空间,实现开放词汇的长时域机器人操作,并支持逐步验证与重规划。

Comments New version with updated framing, contributions, experiments, and figures

详情
AI中文摘要

近期视觉-语言模型(VLM)的进步推动了真实世界机器人操作的进展。然而,非结构化环境中的长时域操作要求VLM推理变化的场景状态、动作约束和执行结果,这仅靠自然语言推理仍然困难。我们提出PLanAR,一个规划语言基础的机器人智能体框架,用于开放词汇的长时域操作。PLanAR使用规划语言接口定义VLM推理空间:对象谓词表示场景状态,动作模式指定具有前提条件和效果的机器人技能,符号规划提供可执行的中间表示。该接口支持逐步验证:在每个动作之后,PLanAR利用机载观测检查预期符号效果是否实现,使基于VLM的智能体能够更新任务状态、检测失败,并在执行偏离预期时重新规划。在多种机器人形态、VLM后端以及包括堆叠、填字游戏和长时域厨房工作流程的任务中,PLanAR展示了强大的真实世界能力,同时揭示了当前VLM在具身推理中的关键局限性。

英文摘要

Recent advances in vision-language models (VLMs) have enabled increasing progress in real-world robot manipulation. However, long-horizon manipulation in unstructured environments requires VLMs to reason about changing scene states, action constraints, and execution outcomes, which remains difficult with natural language reasoning alone. We present PLanAR, a planning-language-grounded robot agent framework for open-vocabulary, long-horizon manipulation. PLanAR uses a planning-language interface to define the VLM reasoning space: object predicates represent scene states, action schemas specify robot skills with preconditions and effects, and symbolic plans provide executable intermediate representations. This interface enables stepwise verification: after each action, PLanAR uses onboard observations to check whether the expected symbolic effects have been achieved, allowing the VLM-based agent to update task states, detect failures, and replan when execution deviates from expectation. Across robot embodiments, VLM backends, and tasks including stacking, crossword solving, and long-horizon kitchen workflows, PLanAR demonstrates strong real-world capability while revealing key limitations of current VLMs in embodied reasoning.

2602.23694 2026-06-02 cs.RO cs.AI 版本更新

Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion

基于对数似然比融合的可解释多模态手势识别用于无人机和移动机器人遥操作

Seungyeol Baek, Jaspreet Singh, Lala Shakti Swarup Ray, Hymalai Bello, Paul Lukowicz, Sungho Suh

发表机构 * Department of Artificial Intelligence, Korea University(人工智能系,韩国大学) Department of Computer Science, RPTU Kaiserslautern-Landau(计算机科学系,RPTU凯撒斯劳滕-兰道) Embedded Intelligence, German Research Center for Artificial Intelligence (DFKI)(嵌入式智能,德国人工智能研究中心(DFKI))

AI总结 提出一种融合腕戴式Apple Watch惯性数据和定制手套电容传感信号的多模态手势识别框架,利用对数似然比后期融合策略提升性能并提供可解释性,在降低计算成本的同时达到与视觉基线相当的识别效果。

详情
AI中文摘要

人类操作员仍经常暴露在危险环境中,如灾区及工业设施,在这些场景中,移动机器人和无人飞行器(UAV)的直观可靠遥操作至关重要。在此背景下,免手持遥操作增强了操作员的移动性和态势感知能力,从而提高了危险环境中的安全性。尽管基于视觉的手势识别已被探索作为免手持遥操作的一种方法,但其性能在遮挡、光照变化和杂乱背景下常会下降,限制了其在真实操作中的适用性。为克服这些限制,我们提出一种多模态手势识别框架,该框架融合来自双手腕上Apple Watch的惯性数据(加速度计、陀螺仪和方向)与来自定制手套的电容传感信号。我们设计了一种基于对数似然比(LLR)的后期融合策略,该策略不仅提升了识别性能,还通过量化模态特定贡献提供了可解释性。为支持本研究,我们引入了一个包含20种受飞机引导信号启发的手势的新数据集,包含同步的RGB视频、IMU和电容传感器数据。实验结果表明,我们的框架在显著降低计算成本、模型大小和训练时间的同时,达到了与最先进的视觉基线相当的性能,使其非常适合实时机器人控制。因此,我们强调了基于传感器的多模态融合作为手势驱动的移动机器人和无人机遥操作的鲁棒且可解释解决方案的潜力。

英文摘要

Human operators are still frequently exposed to hazardous environments such as disaster zones and industrial facilities, where intuitive and reliable teleoperation of mobile robots and Unmanned Aerial Vehicles (UAVs) is essential. In this context, hands-free teleoperation enhances operator mobility and situational awareness, thereby improving safety in hazardous environments. While vision-based gesture recognition has been explored as one method for hands-free teleoperation, its performance often deteriorates under occlusions, lighting variations, and cluttered backgrounds, limiting its applicability in real-world operations. To overcome these limitations, we propose a multimodal gesture recognition framework that integrates inertial data (accelerometer, gyroscope, and orientation) from Apple Watches on both wrists with capacitive sensing signals from custom gloves. We design a late fusion strategy based on the log-likelihood ratio (LLR), which not only enhances recognition performance but also provides interpretability by quantifying modality-specific contributions. To support this research, we introduce a new dataset of 20 distinct gestures inspired by aircraft marshalling signals, comprising synchronized RGB video, IMU, and capacitive sensor data. Experimental results demonstrate that our framework achieves performance comparable to a state-of-the-art vision-based baseline while significantly reducing computational cost, model size, and training time, making it well suited for real-time robot control. We therefore underscore the potential of sensor-based multimodal fusion as a robust and interpretable solution for gesture-driven mobile robot and drone teleoperation.

2512.00470 2026-06-02 cs.RO 版本更新

LAP: Fast LAtent Diffusion Planner for Autonomous Driving

LAP:面向自动驾驶的快速潜在扩散规划器

Jinhao Zhang, Wenlong Xia, Zhexuan Zhou, Haoming Song, Youmin Gong, Jie Mei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LAP框架,在VAE学习的潜在空间中进行规划,通过单步去噪实现高质量规划,在nuPlan基准上达到学习型规划方法的最优闭环性能,推理速度提升高达10倍。

详情
AI中文摘要

扩散模型在自动驾驶中模拟类人驾驶行为方面展现出强大能力,但其迭代采样过程导致大量延迟,且直接对原始轨迹点操作迫使模型将容量用于低级运动学而非高级多模态语义。为解决这些限制,我们提出潜在规划器(LAP),该框架在VAE学习的潜在空间中进行规划,将高级意图与低级运动学解耦,使规划器能够捕获丰富的多模态驾驶策略。为弥合高级语义规划空间与向量化场景上下文之间的表征差距,我们引入中间特征对齐机制,促进鲁棒的信息融合。值得注意的是,LAP可在单步去噪中生成高质量规划,大幅降低计算开销。通过在大型nuPlan基准上的广泛评估,LAP在学习型规划方法中实现了最先进的闭环性能,同时推理速度相比先前SOTA方法提升高达10倍。

英文摘要

Diffusion models have demonstrated strong capabilities for modeling human-like driving behaviors in autonomous driving, but their iterative sampling process induces substantial latency, and operating directly on raw trajectory points forces the model to spend capacity on low-level kinematics, rather than high-level multi-modal semantics. To address these limitations, we propose LAtent Planner (LAP), a framework that plans in a VAE-learned latent space that disentangles high-level intents from low-level kinematics, enabling our planner to capture rich, multi-modal driving strategies. To bridge the representational gap between the high-level semantic planning space and the vectorized scene context, we introduce an intermediate feature alignment mechanism that facilitates robust information fusion. Notably, LAP can produce high-quality plans in one single denoising step, substantially reducing computational overhead. Through extensive evaluations on the large-scale nuPlan benchmark, LAP achieves state-of-the-art closed-loop performance among learning-based planning methods, while demonstrating an inference speed-up of at most 10x over previous SOTA approaches.

2603.03741 2026-06-02 cs.RO cs.AI 版本更新

HALO: Learning Human-Robot Collaboration via Heterogeneous-Agent Lyapunov Policy Optimization

HALO:通过异质智能体李雅普诺夫策略优化学习人机协作

Hao Zhang, Yaru Niu, Yikai Wang, Ding Zhao, H. Eric Tseng

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对人机协作中人类行为多样性和环境变化导致的泛化与鲁棒性问题,提出异质智能体李雅普诺夫策略优化(HALO)框架,通过李雅普诺夫收缩稳定去中心化多智能体强化学习,并利用最优二次投影修正梯度,实现理性差距的单调收缩,提升协作性能。

Comments https://HaoZhang-THU.github.io/HALO/

详情
AI中文摘要

为了提高人机协作(HRC)的泛化性和韧性,机器人必须应对人类行为和情境的多种组合,这推动了多智能体强化学习(MARL)的应用。然而,机器人与人类之间的固有异质性造成了理性差距(RG),使得去中心化的策略更新偏离了合作联合优化。由此产生的学习问题是一个一般和可微博弈,因此独立的策略梯度更新在没有额外结构的情况下可能会振荡或发散。我们提出了异质智能体李雅普诺夫策略优化(HALO),这是一个通过强制策略参数空间中的李雅普诺夫收缩来稳定去中心化MARL的框架。与针对约束马尔可夫决策过程中状态/轨迹约束的基于李雅普诺夫的安全RL不同,HALO使用李雅普诺夫认证来稳定去中心化策略学习。HALO通过最优二次投影修正去中心化梯度,确保RG的单调收缩,并实现对开放式交互空间的有效探索。大量的仿真和真实人形机器人实验表明,这种认证的稳定性提高了协作边缘情况下的泛化性和鲁棒性。我们的项目网站位于https://HaoZhang-THU.github.io/HALO/。

英文摘要

To improve generalization and resilience in human-robot collaboration (HRC), robots must contend with diverse combinations of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG), where decentralized policy updates deviate from cooperative joint optimization. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALO), a framework that stabilizes decentralized MARL by enforcing Lyapunov-based contraction in policy-parameter space. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALO uses Lyapunov certification to stabilize decentralized policy learning. HALO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases. Our project website is available at https://HaoZhang-THU.github.io/HALO/.

2603.02650 2026-06-02 cs.LG cs.AI cs.RO 版本更新

Improving Diffusion Planners by Self-Supervised Action Gating with Energies

通过自监督动作能量门控改进扩散规划器

Yuan Lu, Dongqi Han, Yansen Wang, Dongsheng Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SAGE方法,利用潜在一致性信号在推理时重新排序轨迹,惩罚动态不一致的计划,从而提升扩散规划器的性能和鲁棒性。

详情
AI中文摘要

扩散规划器是离线强化学习的一种强大方法,但当价值引导选择偏好得分高但局部与环境动态不一致的轨迹时,它们可能会失败,导致执行脆弱。我们提出了自监督动作能量门控(SAGE),一种推理时重排序方法,使用潜在一致性信号惩罚动态不一致的计划。SAGE在离线状态序列上训练联合嵌入预测架构(JEPA)编码器,并训练一个动作条件的潜在预测器用于短时域过渡。在测试时,SAGE为每个采样候选分配一个由其潜在预测误差给出的能量,并将此可行性得分与价值估计相结合以选择动作。SAGE可以集成到现有的扩散规划流程中,这些流程可以通过价值评分采样轨迹和选择动作;它不需要环境回滚,也不需要重新训练策略。在运动、导航和操作基准测试中,SAGE提高了扩散规划器的性能和鲁棒性。

英文摘要

Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrate into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.

2603.02478 2026-06-02 eess.SY cs.RO cs.SY 版本更新

Scalar-Measurement Attitude Estimation on $\mathbf{SO}(3)$ with Bias Compensation

$\mathbf{SO}(3)$ 上带偏差补偿的标量测量姿态估计

Alessandro Melis, Tarek Bouazza, Hassan Alnahhal, Sifeddine Benahmed, Soulaimane Berkane, Tarek Hamel

发表机构 * I3S, CNRS, Université Côte d’Azur, Sophia Antipolis, France(法国国家科学研究中心I3S研究所、普罗旺斯大学、索菲亚-安蒂波利斯分校) Institut Universitaire de France(法国国家科学院) Department of Technology & Innovation, Capgemini Engineering(Capgemini工程公司技术与创新部) Department of Computer Science and Engineering, Université du Québec en Outaouais (UQO)(魁北克大学Outaouais分校计算机科学与工程系)

AI总结 本文提出基于标量测量的 $\mathbf{SO}(3)$ 非线性确定性观测器,结合陀螺仪偏差补偿,在适当可观测性条件下实现局部指数稳定,并证明两个标量测量在合适激励下足以进行姿态估计,三个在静态情况下足够。

Comments 9 pages, 4 figures. Accepted to ICRA 2026

详情
AI中文摘要

姿态估计方法通常依赖于来自惯性传感器(如加速度计和磁力计)的完整矢量测量。本文表明,仅使用标量测量也能实现可靠估计,这些标量测量自然出现为矢量读数的分量或来自其他传感模态的独立约束。我们提出了 $\mathbf{SO}(3)$ 上的非线性确定性观测器,该观测器结合了陀螺仪偏差补偿,并在适当的可观测性条件下保证均匀局部指数稳定性。该框架的一个关键特性是对部分感知的鲁棒性:即使只有矢量分量的子集可用,也能保持准确估计。在 BROAD 数据集上的实验验证确认了在逐步减少的测量配置下性能一致,即使在严重信息丢失的情况下估计误差仍然很小。据我们所知,这是第一项建立基本可观测性结果的工作,表明在适当激励下两个标量测量足以进行姿态估计,而在静态情况下三个足够。这些结果将基于标量测量的观测器定位为传统基于矢量方法的实用且可靠的替代方案。

英文摘要

Attitude estimation methods typically rely on full vector measurements from inertial sensors such as accelerometers and magnetometers. This paper shows that reliable estimation can also be achieved using only scalar measurements, which naturally arise either as components of vector readings or as independent constraints from other sensing modalities. We propose nonlinear deterministic observers on $\mathbf{SO}(3)$ that incorporate gyroscope bias compensation and guarantee uniform local exponential stability under suitable observability conditions. A key feature of the framework is its robustness to partial sensing: accurate estimation is maintained even when only a subset of vector components is available. Experimental validation on the BROAD dataset confirms consistent performance across progressively reduced measurement configurations, with estimation errors remaining small even under severe information loss. To the best of our knowledge, this is the first work to establish fundamental observability results showing that two scalar measurements under suitable excitation suffice for attitude estimation, and that three are enough in the static case. These results position scalar-measurement-based observers as a practical and reliable alternative to conventional vector-based approaches.

2603.01302 2026-06-02 cs.RO 版本更新

Hybrid TD3: Overestimation Bias Analysis and Stable Policy Optimization for Hybrid Action Space

混合TD3:混合动作空间中的过估计偏差分析与稳定策略优化

Thanh-Tuan Tran, Thanh Nguyen Canh, Nak Young Chong, Xiem HoangVan

发表机构 * Department of Computer Science, University of California, Berkeley(1. 加州大学伯克利分校计算机科学系)

AI总结 针对离散-连续混合动作空间中的强化学习挑战,提出Hybrid TD3算法,通过理论分析过估计偏差并引入加权裁剪Q学习目标,实现稳定策略优化。

详情
AI中文摘要

离散-连续混合动作空间中的强化学习对机器人操作提出了基本挑战,其中高层任务决策和低层关节空间执行必须联合优化。现有方法要么离散化连续组件,要么将离散选择松弛为连续近似,这些方法在高维动作空间和域随机化下存在可扩展性限制和训练不稳定性。在本文中,我们提出Hybrid TD3,这是对双延迟深度确定性策略梯度(TD3)的扩展,以原则性方式原生处理参数化混合动作空间。我们对混合动作设置中的过估计偏差进行了严格的理论分析,推导了双评论家架构下的形式化界限,并在同步高斯误差假设下建立了五种算法变体之间的完整偏差排序。基于此分析,我们引入了一个加权裁剪Q学习目标,该目标对离散动作分布进行边缘化,在实现与标准裁剪最小化等效的偏差减少的同时,提高了策略平滑性。实验结果表明,Hybrid TD3在训练稳定性和竞争性能方面优于最先进的混合动作基线。

英文摘要

Reinforcement learning in discrete-continuous hybrid action spaces presents fundamental challenges for robotic manipulation, where high-level task decisions and low-level joint-space execution must be jointly optimized. Existing approaches either discretize continuous components or relax discrete choices into continuous approximations, which suffer from scalability limitations and training instability in high-dimensional action spaces and under domain randomization. In this paper, we propose Hybrid TD3, an extension of Twin Delayed Deep Deterministic Policy Gradient (TD3) that natively handles parameterized hybrid action spaces in a principled manner. We conduct a rigorous theoretical analysis of overestimation bias in hybrid action settings, deriving formal bounds under twin-critic architectures and establishing a complete bias ordering across five algorithmic variants under synchronized Gaussian error assumptions. Building on this analysis, we introduce a weighted clipped Q-learning target that marginalizes over the discrete action distribution, achieving equivalent bias reduction to standard clipped minimization while improving policy smoothness. Experimental results demonstrate that Hybrid TD3 achieves superior training stability and competitive performance against state-of-the-art hybrid action baselines.

2602.23204 2026-06-02 cs.CV cs.RO 版本更新

Motion-aware Event Suppression for Event Cameras

面向事件相机的运动感知事件抑制

Roberto Pellerito, Nico Messikommer, Giovanni Cioffi, Marco Cannici, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich, Switzerland(苏黎世大学机器人与感知组,瑞士)

AI总结 提出首个运动感知事件抑制框架,通过联合分割当前事件流中的独立运动物体并预测其未来运动,实现动态事件的预期抑制,在EVIMO基准上分割精度提升67%,推理速度提高53%。

Comments Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

在这项工作中,我们引入了首个运动感知事件抑制框架,该框架学习实时过滤由独立运动物体和自身运动触发的事件。我们的模型联合分割当前事件流中的独立运动物体,同时预测其未来运动,从而在动态事件发生之前实现预期抑制。我们的轻量级架构在消费级GPU上实现了173 Hz的推理速度,内存使用不到1 GB,在具有挑战性的EVIMO基准上,分割精度比之前最先进的方法提高了67%,同时推理速度提高了53%。此外,我们展示了该方法对下游应用的显著益处:通过令牌剪枝,我们的方法将Vision Transformer推理速度提高了83%,并改进了基于事件的视觉里程计精度,将绝对轨迹误差降低了13%。

英文摘要

In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.

2602.20807 2026-06-02 cs.CV cs.RO 版本更新

RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction

RU4D-SLAM:面向4D场景重建的高斯溅射SLAM不确定性重加权

Yangfan Zhao, Hanwei Zhang, Ke Huang, Qiufeng Wang, Zhenzhou Shao, Dengyu Wu

发表机构 * Capital Normal University(首都师范大学) Saarland University(萨尔兰大学) Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) King’s College London(伦敦国王学院)

AI总结 提出RU4D-SLAM框架,通过引入时间因子、不确定性感知和语义引导重加权机制,解决动态环境中3D高斯溅射SLAM的跟踪与4D场景重建问题。

详情
AI中文摘要

将3D高斯溅射与同时定位与地图构建(SLAM)相结合的方法因其能够在运动过程中实现连续3D环境重建而受到广泛关注。然而,现有方法在动态环境中表现不佳,尤其是移动物体使3D重建复杂化,进而阻碍了可靠的跟踪。4D重建的出现,特别是4D高斯溅射,为解决这些挑战提供了有前景的方向,但其在4D感知SLAM中的潜力尚未得到充分探索。沿着这一方向,我们提出了一种鲁棒且高效的框架,即面向4D场景重建的高斯溅射SLAM不确定性重加权(RU4D-SLAM),该框架将时间因子引入空间3D表示,同时结合了场景变化的不确定性感知、模糊图像合成和动态场景重建。我们通过集成运动模糊渲染增强了动态场景表示,并通过扩展原本为静态场景设计的逐像素不确定性建模来处理模糊图像,从而改进了不确定性感知跟踪。此外,我们提出了一种用于动态场景中逐像素不确定性估计的语义引导重加权机制,并引入可学习的不透明度权重以支持自适应4D映射。在标准基准上的大量实验表明,我们的方法在轨迹精度和4D场景重建方面显著优于最先进的方法,尤其是在存在移动物体和低质量输入的动态环境中。代码地址:https://ru4d-slam.github.io

英文摘要

Combining 3D Gaussian splatting with Simultaneous Localization and Mapping (SLAM) has gained popularity as it enables continuous 3D environment reconstruction during motion. However, existing methods struggle in dynamic environments, particularly moving objects complicate 3D reconstruction and, in turn, hinder reliable tracking. The emergence of 4D reconstruction, especially 4D Gaussian splatting, offers a promising direction for addressing these challenges, yet its potential for 4D-aware SLAM remains largely underexplored. Along this direction, we propose a robust and efficient framework, namely Reweighting Uncertainty in Gaussian Splatting SLAM (RU4D-SLAM) for 4D scene reconstruction, that introduces temporal factors into spatial 3D representation while incorporating uncertainty-aware perception of scene changes, blurred image synthesis, and dynamic scene reconstruction. We enhance dynamic scene representation by integrating motion blur rendering, and improve uncertainty-aware tracking by extending per-pixel uncertainty modeling, which is originally designed for static scenarios, to handle blurred images. Furthermore, we propose a semantic-guided reweighting mechanism for per-pixel uncertainty estimation in dynamic scenes, and introduce a learnable opacity weight to support adaptive 4D mapping. Extensive experiments on standard benchmarks demonstrate that our method substantially outperforms state-of-the-art approaches in both trajectory accuracy and 4D scene reconstruction, particularly in dynamic environments with moving objects and low-quality inputs. Code available: https://ru4d-slam.github.io

2602.17737 2026-06-02 cs.RO cs.LG cs.MA 版本更新

NestRL: A Nested Training Regime for Mutual Adaptation in Human-AI Teaming

NestRL: 一种用于人机团队中相互适应的嵌套训练机制

Upasana Biswas, Durgesh Kalwar, Subbarao Kambhampati, Sarath Sreedharan

发表机构 * School of Computing and AI, Arizona State University(计算与人工智能学院,亚利桑那州立大学) Department of Computer Science, Colorado State University(计算机科学系,科罗拉多州立大学)

AI总结 针对人机团队中相互适应的挑战,提出嵌套训练机制NestRL,通过分层训练代理对抗自适应对手,避免产生不透明的协调策略,在Overcooked领域实现更高的任务性能和适应性。

详情
AI中文摘要

相互适应是人机团队中的一个核心挑战,因为人类会自然地根据AI代理的行为调整自己的策略。现有方法试图通过多样化训练伙伴来近似人类行为;然而,这些伙伴通常是静态的,无法捕捉人类队友的适应性。当代理在标准多智能体设置中联合训练时,它们常常收敛到不透明的协调策略,这些策略仅适用于共同训练的伙伴,导致泛化能力差。为了建模自适应的人类行为,我们将人机团队问题形式化为交互式部分可观测马尔可夫决策过程(I-POMDP)。我们提出NestRL,一种嵌套训练机制,通过在每个层级上训练代理对抗来自下一层级的自适应代理,来学习有限层级I-POMDP的解。这使代理暴露于自适应行为,同时防止出现不透明的协调策略。我们提供了理论分析,表明NestRL代理避免了收敛到特定伙伴的策略,并在Overcooked领域通过与最先进的基线进行实证验证。NestRL在与未见过的自适应代理和真实人类队友合作时均实现了更高的任务性能,同时在交互过程中表现出显著更强的适应性。

英文摘要

Mutual adaptation is a central challenge in human-AI teaming, as humans naturally adjust their strategies in response to an AI agent's behavior. Existing approaches attempt to approximate human behavior by diversifying training partners; however, these partners are typically static and fail to capture the adaptive nature of human teammates. When agents are trained jointly in standard multi-agent settings, they often converge to opaque coordination strategies that work only with their co-trained partners, leading to poor generalization. To model adaptive human behavior, we formulate human-AI teaming as an Interactive Partially Observable Markov Decision Process (I-POMDP). We propose NestRL, a nested training regime that learns the solution to a finite-level I-POMDP by training agents at each level against adaptive agents from the level below. This exposes agents to adaptive behavior while preventing emergence of opaque coordination strategies. We provide theoretical analysis showing that NestRL agents avoid convergence to partner-specific strategies, and validate this empirically in the Overcooked domain against state-of-the-art baselines. NestRL achieves higher task performance with both unseen adaptive agents and real human teammates, while exhibiting significantly greater adaptability over the course of interaction.

2602.11554 2026-06-02 cs.RO cs.CV cs.LG 版本更新

HyperDet: 3D Object Detection with Hyper 4D Radar Point Clouds

HyperDet: 基于超4D雷达点云的3D目标检测

Yichun Xiao, Runwei Guan, Jin Jin, Fangqiang Ding

发表机构 * University of Edinburgh(爱丁堡大学) HKUST (GZ)(香港科技大学(广州)) University of Oxford(牛津大学) MIT(麻省理工学院)

AI总结 提出一种与检测器无关的框架HyperDet,通过构建任务感知的超4D雷达点云,利用时空累积、跨传感器验证和多普勒引导的运动补偿以及前景生成增强,显著提升仅用雷达的3D目标检测性能。

Comments 11 pages, 3 figures, 3 tables

详情
AI中文摘要

仅使用4D雷达进行3D目标检测能达到什么程度?尽管现代4D雷达为自主感知提供了鲁棒天气和速度感知能力,但其点云仍然稀疏、嘈杂且不稳定,限制了仅用雷达的3D检测。我们提出HyperDet,一种与检测器无关的框架,在检测前构建任务感知的超4D雷达点云。HyperDet首先通过时空累积、跨传感器验证和多普勒引导的运动补偿来细化短窗口环视雷达观测,提高返回可靠性和时间一致性。然后,它利用仅在训练时可用的激光雷达引导的伪雷达监督进行前景生成增强,在保留测量雷达背景和雷达原生属性的同时丰富目标几何。在检测器训练期间,雷达感知的目标级增强进一步在几何重定位下保持多普勒一致性。在推理时,HyperDet仅需雷达输入,可直接与标准3D检测器配合使用。在两个公开的环视4D雷达数据集上的实验表明,与原始雷达输入相比,在标准3D检测器上均取得一致改进,验证了输入级雷达增强作为仅用雷达3D检测的有效方法。

英文摘要

How far can 3D object detection go using 4D radar alone? Despite offering weather-robust and velocity-aware sensing for autonomous perception, modern 4D radar still yields sparse, noisy, and unstable point clouds, limiting radar-only 3D detection. We present HyperDet, a detector-agnostic framework that constructs task-aware hyper 4D radar point clouds before detection. HyperDet first refines short-window surround-view radar observations through spatio-temporal accumulation, cross-sensor validation, and Doppler-guided motion compensation, improving return reliability and temporal coherence. It then performs foreground generative enhancement using LiDAR-guided pseudo-radar supervision available only during training, enriching object geometry while preserving measured radar background and radar-native attributes. During detector training, radar-aware object-level augmentation further preserves Doppler consistency under geometric relocation. At inference time, HyperDet requires radar input alone and can be directly paired with standard 3D detectors. Experiments on two public surround-view 4D radar datasets demonstrate consistent improvements over raw radar inputs across standard 3D detectors, validating input-level radar enhancement as an effective approach to radar-only 3D detection.

2509.18046 2026-06-02 cs.RO cs.AI cs.ET cs.SY eess.SP eess.SY 版本更新

HuMam: Humanoid Motion Control via End-to-End Deep Reinforcement Learning with Mamba

HuMam: 基于Mamba的端到端深度强化学习人形机器人运动控制

Yinuo Wang, Yuanyang Qi, Jinzhao Zhou, Pengxiang Meng, Xiaowen Tao

发表机构 * College of Graduate and Professional Studies, Trine University(特灵大学研究生与专业研究学院) Department of Civil Engineering, University of Hong Kong(香港大学土木工程系) Faculty of Engineering and Information Technology, University of Technology Sydney(悉尼大学工程与信息技术学院) National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University(吉林大学汽车底盘集成与生物仿生国家重点实验室) School of Computer Science and Statistics, Trinity College Dublin(都柏林信任学院计算机科学与统计学系)

AI总结 提出HuMam框架,使用单层Mamba编码器融合状态与步态目标,通过PPO优化实现人形机器人稳定高效的端到端运动控制。

Comments 12 pages

详情
Journal ref
2026 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE International Conference on Robotics, Automation and Mechatronics (RAM) (CIS-RAM 2026)
AI中文摘要

端到端强化学习(RL)用于人形机器人运动因其紧凑的感知-动作映射而具有吸引力,但实际策略常受训练不稳定、特征融合低效和高执行成本困扰。我们提出HuMam,一种以状态为中心的端到端RL框架,采用单层Mamba编码器融合机器人中心状态与定向脚步目标及连续相位时钟。策略输出由低级PD环跟踪的关节位置目标,并通过PPO优化。一个简洁的六项奖励平衡接触质量、摆动平滑度、脚部放置、姿态和身体稳定性,同时隐含促进节能。在mc-mujoco中的JVRC-1人形机器人上,HuMam在强前馈基线上持续提高了学习效率、训练稳定性和整体任务性能,同时降低了功耗和扭矩峰值。据我们所知,这是首个采用Mamba作为融合骨干的端到端人形机器人RL控制器,展示了在效率、稳定性和控制经济性方面的切实提升。

英文摘要

End-to-end reinforcement learning (RL) for humanoid locomotion is appealing for its compact perception-action mapping, yet practical policies often suffer from training instability, inefficient feature fusion, and high actuation cost. We present HuMam, a state-centric end-to-end RL framework that employs a single-layer Mamba encoder to fuse robot-centric states with oriented footstep targets and a continuous phase clock. The policy outputs joint position targets tracked by a low-level PD loop and is optimized with PPO. A concise six-term reward balances contact quality, swing smoothness, foot placement, posture, and body stability while implicitly promoting energy saving. On the JVRC-1 humanoid in mc-mujoco, HuMam consistently improves learning efficiency, training stability, and overall task performance over a strong feedforward baseline, while reducing power consumption and torque peaks. To our knowledge, this is the first end-to-end humanoid RL controller that adopts Mamba as the fusion backbone, demonstrating tangible gains in efficiency, stability, and control economy.

2602.09153 2026-06-02 cs.RO cs.AI cs.CV cs.GR 版本更新

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

SceneSmith: 面向仿真就绪室内场景的智能体生成

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Harvard University(哈佛大学)

AI总结 提出层次化智能体框架SceneSmith,通过VLM智能体协作从自然语言生成仿真就绪的室内场景,相比先前方法生成3-6倍物体且碰撞率低于2%。

Comments ICML 2026 Spotlight; Project page: https://scenesmith.github.io/

详情
AI中文摘要

仿真已成为大规模训练和评估家用机器人的关键工具,但现有环境未能捕捉真实室内空间的多样性和物理复杂性。当前的场景合成方法生成的房间稀疏布置,缺乏机器人操作所必需的密集杂乱、铰接式家具和物理属性。我们提出了SceneSmith,一个层次化智能体框架,能够从自然语言提示生成仿真就绪的室内环境。SceneSmith通过连续阶段构建场景——从建筑布局到家具放置再到小物体填充——每个阶段都实现为VLM智能体(设计师、评论家和编排者)之间的交互。该框架通过文本到3D合成生成静态物体、数据集检索获取铰接式物体以及物理属性估计,紧密集成了资产生成。SceneSmith生成的物体数量是先前方法的3-6倍,物体间碰撞率低于2%,且96%的物体在物理仿真下保持稳定。在205名参与者参与的用户研究中,与基线相比,平均真实感胜率达到92%,平均提示忠实度胜率达到91%。我们进一步证明了这些环境可用于端到端的自动机器人策略评估流程。

英文摘要

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.

2602.06925 2026-06-02 cs.RO cs.GT 版本更新

Strategizing at Speed: A Learned Model Predictive Game for Multi-Agent Drone Racing

高速策略化:多无人机竞速的学习型模型预测博弈

Andrei-Carlo Papuc, Lasse Peters, Sihao Sun, Laura Ferranti, Javier Alonso-Mora

发表机构 * Department of Cognitive Robotics, Delft University of Technology(认知机器人学系,代尔夫特理工大学)

AI总结 针对多无人机竞速中策略深度与计算延迟的权衡问题,提出一种学习型模型预测博弈(LMPG)方法,通过摊销模型预测博弈计算来降低延迟,在仿真和硬件实验中优于模型预测博弈和模型预测控制。

详情
AI中文摘要

自主无人机竞速推动了高速运动规划与多智能体策略决策的边界。在此领域,无人机不仅需要极限导航,还需预测并反制竞争对手的动作。本文研究一个基本问题:智能体在采取行动前应进行多深层次的策略思考?为此,我们比较两种规划范式:模型预测博弈(MPG)以较长的计算时间为代价获得交互感知策略,以及轮廓模型预测控制(MPC)快速计算策略但不考虑交互。我们进行了大量实验来研究这一权衡,发现MPG在中速时优于MPC,但在高速时因延迟而失去优势。为解决此问题,我们提出一种学习型模型预测博弈(LMPG)方法,通过摊销模型预测博弈计算来减少延迟。在仿真和硬件实验中,我们在头对头竞速中将我们的方法与MPG和MPC进行基准测试,发现LMPG优于两种基线方法。

英文摘要

Autonomous drone racing pushes the boundaries of high-speed motion planning and multi-agent strategic decision-making. Success in this domain requires drones not only to navigate at their limits but also to anticipate and counteract competitors' actions. In this paper, we study a fundamental question that arises in this domain: how deeply should an agent strategize before taking an action? To this end, we compare two planning paradigms: the Model Predictive Game (MPG), which finds interaction-aware strategies at the expense of longer computation times, and contouring Model Predictive Control (MPC), which computes strategies rapidly but does not reason about interactions. We perform extensive experiments to study this trade-off, revealing that MPG outperforms MPC at moderate velocities but loses its advantage at higher speeds due to latency. To address this shortcoming, we propose a Learned Model Predictive Game (LMPG) approach that amortizes model predictive gameplay to reduce latency. In both simulation and hardware experiments, we benchmark our approach against MPG and MPC in head-to-head races, finding that LMPG outperforms both baselines.

2601.22965 2026-06-02 cs.RO 版本更新

Self-Imitated Diffusion Policy for Efficient and Robust Visual Navigation

自模仿扩散策略用于高效鲁棒的视觉导航

Runhua Zhang, Junyi Hou, Changxu Cheng, Qiyi Chen, Tao Wang, Wuyue Zhao

发表机构 * Uni-Ubi Technology Co., Ltd.(Uni-Ubi技术有限公司) College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院)

AI总结 提出自模仿扩散策略(SIDP),通过奖励引导的自模仿机制和课程学习范式,减少对大量采样和后过滤的依赖,实现高效鲁棒的视觉导航。

Comments Preprint

详情
AI中文摘要

扩散策略(DP)通过捕获多样的多模态轨迹分布,在视觉导航中展现出显著潜力。然而,大多数DP方法依赖的标准模仿学习(IL)往往继承专家演示中的次优性和冗余性,从而需要在推理期间依赖辅助选择器进行计算密集型的“生成-然后过滤”流程。为解决这些挑战,我们提出自模仿扩散策略(SIDP),一种新颖的框架,通过选择性模仿从自身采样的一组轨迹来学习改进的规划。具体来说,SIDP引入了一种奖励引导的自模仿机制,鼓励策略持续高效地生成高质量轨迹,而非质量不一致的输出,从而减少对大量采样和后过滤的依赖。在训练过程中,我们采用奖励驱动的课程学习范式来缓解低效的数据利用率,以及目标无关的探索进行轨迹增强以提高规划鲁棒性。在综合仿真基准上的广泛评估表明,SIDP显著优于先前方法,真实世界实验证实了其在多个机器人平台上的有效性。在Jetson Orin Nano上,SIDP的推理速度比基线NavDP快2.5倍,即110ms对比273ms,实现了高效的实时部署。

英文摘要

Diffusion policies (DP) have demonstrated significant potential in visual navigation by capturing diverse multi-modal trajectory distributions. However, standard imitation learning (IL), which most DP methods rely on for training, often inherits sub-optimality and redundancy from expert demonstrations, thereby necessitating a computationally intensive "generate-then-filter" pipeline that relies on auxiliary selectors during inference. To address these challenges, we propose Self-Imitated Diffusion Policy (SIDP), a novel framework that learns improved planning by selectively imitating a set of trajectories sampled from itself. Specifically, SIDP introduces a reward-guided self-imitation mechanism that encourages the policy to consistently produce high-quality trajectories efficiently, rather than outputs of inconsistent quality, thereby reducing reliance on extensive sampling and post-filtering. During training, we employ a reward-driven curriculum learning paradigm to mitigate inefficient data utility, and goal-agnostic exploration for trajectory augmentation to improve planning robustness. Extensive evaluations on a comprehensive simulation benchmark show that SIDP significantly outperforms previous methods, with real-world experiments confirming its effectiveness across multiple robotic platforms. On Jetson Orin Nano, SIDP delivers a 2.5$\times$ faster inference than the baseline NavDP, i.e., 110ms VS 273ms, enabling efficient real-time deployment.

2601.14323 2026-06-02 cs.CR cs.AI cs.RO 版本更新

SilentDrift: Exploiting Action Chunking for Stealthy Backdoor Attacks on Vision-Language-Action Models

SilentDrift: 利用动作分块对视觉-语言-动作模型进行隐蔽后门攻击

Bingxin Xu, Yuzhang Shang, Binghui Wang, Emilio Ferrara

发表机构 * University of Southern California(南加州大学) University of Central Florida(中央佛罗里达大学) Illinois Institute of Technology(伊利诺伊理工学院)

AI总结 针对视觉-语言-动作模型中的动作分块与增量位姿表示导致的视觉开环漏洞,提出一种利用平滑步函数构建满足C2连续扰动的隐蔽黑盒后门攻击方法SilentDrift,并通过关键帧攻击策略实现高攻击成功率与低中毒率。

Comments Accepted to ACL Findings 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越多地部署在安全关键的机器人应用中,但其安全漏洞仍未得到充分探索。我们识别出现代VLA系统中的一个基本安全缺陷:动作分块与增量位姿表示的结合产生了块内视觉开环。该机制迫使机器人执行K步动作序列,允许每步扰动通过积分累积。我们提出SILENTDRIFT,一种利用此漏洞的隐蔽黑盒后门攻击。我们的方法采用平滑步函数构建具有保证C2连续性的扰动,确保轨迹边界处的速度和加速度为零,以满足严格的运动学一致性约束。此外,我们的关键帧攻击策略仅选择性地毒化关键的接近阶段,在最小化触发暴露的同时最大化影响。生成的毒化轨迹在视觉上与成功演示难以区分。在LIBERO上评估,SILENTDRIFT在低于2%的中毒率下实现了93.2%的攻击成功率,同时保持了95.3%的干净任务成功率。

英文摘要

Vision-Language-Action (VLA) models are increasingly deployed in safety-critical robotic applications, yet their security vulnerabilities remain underexplored. We identify a fundamental security flaw in modern VLA systems: the combination of action chunking and delta pose representations creates an intra-chunk visual open-loop. This mechanism forces the robot to execute K-step action sequences, allowing per-step perturbations to accumulate through integration. We propose SILENTDRIFT, a stealthy black-box backdoor attack exploiting this vulnerability. Our method employs the Smootherstep function to construct perturbations with guaranteed C2 continuity, ensuring zero velocity and acceleration at trajectory boundaries to satisfy strict kinematic consistency constraints. Furthermore, our keyframe attack strategy selectively poisons only the critical approach phase, maximizing impact while minimizing trigger exposure. The resulting poisoned trajectories are visually indistinguishable from successful demonstrations. Evaluated on the LIBERO, SILENTDRIFT achieves a 93.2% Attack Success Rate with a poisoning rate under 2%, while maintaining a 95.3% Clean Task Success Rate.

2601.13574 2026-06-02 cs.RO 版本更新

Highly Deformable Proprioceptive Membrane for Real-Time 3D Shape Reconstruction

高度可变形本体感觉膜用于实时三维形状重建

Guanyu Xu, Jiaqi Wang, Dezhong Tong, Xiaonan Huang

AI总结 提出一种基于光波导传感的柔性可拉伸本体感觉硅胶膜,通过数据驱动模型解码变形相关光强信号,实现实时三维形状重建,在140mm方形膜上达到90Hz更新率和1.307mm平均重建误差。

Comments 13 pages, 9 figures

详情
AI中文摘要

重建物体表面的三维几何形状对于机器人感知至关重要,但基于视觉的方法在低光照或遮挡条件下效果不佳。这一局限性促使我们设计一种本体感觉膜,该膜贴合感兴趣表面并通过重建自身变形来推断三维几何形状。传统的变形感知膜通常依赖于电阻、电容或磁敏机制,但可能存在结构复杂、在大规模变形下顺应性有限以及易受电磁干扰等问题。本文提出一种基于光波导传感的柔软、灵活且可拉伸的本体感觉硅胶膜。该膜将边缘安装的LED和中心分布的光电二极管集成在多层弹性体复合材料中。丰富的变形相关光强信号通过数据驱动模型解码,以恢复膜的几何形状。在定制的140mm方形膜上,以90Hz的端到端更新率实现了实时重建,对于高达25mm的面外变形,平均重建误差为1.307mm。所提出的传感器还在大面内变形下展示了精确重建,在高达75%应变下实现了可靠的形状恢复,平均Chamfer距离为1.214mm。所提出的框架为可变形机器人系统中的全局形状感知提供了一种可扩展、稳健且低剖面的解决方案。

英文摘要

Reconstructing the three-dimensional (3D) geometry of object surfaces is essential for robot perception, yet vision-based approaches degrade under low illumination or occlusion. This limitation motivates the design of a proprioceptive membrane that conforms to the surface of interest and infers 3D geometry by reconstructing its own deformation. Conventional deformation-aware membranes typically rely on resistive, capacitive, or magneto-sensitive mechanisms, but can suffer from structural complexity, limited compliance during large-scale deformation, and susceptibility to electromagnetic interference. This work presents a soft, flexible, and stretchable proprioceptive silicone membrane based on optical waveguide sensing. The membrane integrates edge-mounted LEDs and centrally-distributed photodiodes (PDs) within a multilayer elastomeric composite. Rich deformation-dependent light-intensity signals are decoded by a data-driven model to recover the membrane geometry. Real-time reconstruction is demonstrated on a customized 140 mm square membrane at an end-to-end update rate of 90 Hz, achieving an average reconstruction error of 1.307 mm for out-of-plane deformation of up to 25 mm. The proposed sensor also demonstrates accurate reconstruction under large in-plane deformation, achieving reliable shape recovery up to 75% strain with an average Chamfer distance of 1.214 mm. The proposed framework provides a scalable, robust, and low-profile solution for global shape perception in deformable robotic systems.

2601.11460 2026-06-02 cs.RO cs.LG 版本更新

Semantic-Geometric Task Representations for Bimanual Manipulation from Human Demonstrations to Robot Action Planning

面向双臂操作的语义-几何任务表示:从人类示范到机器人动作规划

Franziska Herbert, Vignesh Prasad, Han Liu, Dorothea Koert, Georgia Chalvatzaki

发表机构 * Interactive Robot Perception & Learning (PEARL) Lab, Computer Science Dept., TU Darmstadt, Germany(图腾大学达姆施塔特分校计算机科学系交互机器人感知与学习实验室) Hessian.AI, Darmstadt, Germany(黑森人工智能公司) Robotics Institute Germany (RIG)(德国机器人研究所) Interactive AI Algorithms & Cognitive Models for Human-AI Interaction (IKIDA), Computer Science Dept., TU Darmstadt, Germany(人机交互的交互人工智能算法与认知模型(IKIDA),图腾大学达姆施塔特分校计算机科学系) Center for Cognitive Science, TU Darmstadt, Germany(图腾大学达姆施塔特分校认知科学中心)

AI总结 提出一种语义-几何图任务表示,通过消息传递神经网络编码器和Transformer解码器联合编码对象身份、语义关系和运动历史,实现从人类示范中学习结构化任务表示,并支持跨实体迁移和双臂操作规划。

Comments 9 pages, 7 figures, preprint

详情
AI中文摘要

从人类示范中学习结构化任务表示对于双臂操作至关重要,因为动作顺序、对象参与和交互几何在不同执行中变化显著。一个关键挑战在于以支持任务进展推理的形式,联合捕获离散的语义任务结构和对象中心几何关系的时间演化。我们引入一种基于语义-几何图的任务表示,通过消息传递神经网络(MPNN)编码器和基于Transformer的解码器,联合编码对象身份、对象间语义关系和每个对象的运动历史。编码器仅对时间场景图进行操作,产生与动作标签解耦的结构化表示。解码器则根据动作上下文预测未来动作、关联对象和对象运动。这种解耦学习了任务无关的表示,使得编码器可以通过仅在小型机器人数据集上微调解码器而跨实体复用。在两个数据集的十一个双臂任务中,我们发现结构化语义-几何表示相对于更简单的基于序列模型的优势随着动作顺序和对象参与的任务变异性增加而增长。在部署时,规划器将动作和运动预测与学习的概率运动基元相结合,在两个真实机器人双臂任务上实现了完全任务成功,并优于图消融、Transformer、仅解码器和微调的视觉-语言模型基线。

英文摘要

Learning structured task representations from human demonstrations is essential for bimanual manipulation, where action ordering, object involvement, and interaction geometry vary significantly across executions. A key challenge lies in jointly capturing the discrete semantic task structure and the temporal evolution of object-centric geometric relations in a form that supports reasoning over task progression. We introduce a semantic--geometric graph-based task representation that jointly encodes object identities, inter-object semantic relations, and per-object motion histories, via a Message Passing Neural Network (MPNN) encoder and a Transformer-based decoder. The encoder operates solely on the temporal scene graph, producing structured representations decoupled from action labels. The decoder then conditions on action-context to forecast future actions, associated objects, and object motions. This decoupling learns task-agnostic representations, enabling encoder reuse across embodiments through decoder-only finetuning on a small robot dataset. Across eleven bimanual tasks from two datasets, we find that the benefit of structured semantic--geometric representations over simpler sequence-based models grows with task variability in action ordering and object involvement. At deployment, a planner couples the action and motion predictions with learned Probabilistic Movement Primitives, achieving full task success on two real-robot bimanual tasks and outperforming graph ablations, Transformer, decoder-only, and finetuned vision-language model baselines.

2508.20072 2026-06-02 cs.CV cs.LG cs.RO 版本更新

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

离散扩散VLA:将离散扩散引入视觉-语言-动作策略中的动作解码

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出离散扩散VLA,通过将动作块离散化并在统一Transformer骨干内使用离散扩散模式进行渐进细化,实现自适应解码顺序和错误纠正,在多个基准上取得高性能并保留预训练的视觉-语言先验。

Comments Accepted by ICML 2026. 17 pages

详情
AI中文摘要

视觉-语言-动作(VLA)模型将大型视觉-语言骨干网络适配为将图像和指令映射为机器人动作。然而,当前的VLA要么以固定的从左到右顺序自回归生成动作,性能较差;要么在骨干网络外附加独立的扩散头,这会割裂信息通路并阻碍统一、可扩展的架构。相反,我们提出了离散扩散VLA,它将动作块离散化,并使用离散扩散模式在统一的Transformer骨干内保留渐进细化。我们的方法实现了自适应解码顺序,在解决较难的动作元素之前先解决高置信度的动作元素,并采用二次重掩码来重新审视不确定的预测,从而实现鲁棒的纠错。这种设计保留了预训练的视觉-语言先验,支持并行解码,并提高了效率。离散扩散VLA在LIBERO上达到96.4%的平均成功率,在SimplerEnv-Fractal上达到71.2%的视觉匹配,在SimplerEnv-Bridge上达到54.2%的整体性能。在LIBERO-Goal的分布外测试中,我们的方法仅表现出0.8%的语言退化(相比之下并行解码为8.0%),以及20.4%的视觉退化(相比之下连续扩散为29.0%),表明其很好地保留了预训练的视觉-语言能力。我们还在AgileX Cobot Magic平台上进行了两次真实机器人评估,以展示该方法的有效性。

英文摘要

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order with poor performance or attach separate diffusion heads outside the backbone that fragments information pathways and hinders unified, scalable architectures. Instead, we present Discrete Diffusion VLA that discretizes action chunks and models them with discrete diffusion pattern retaining progressive refinement inside the unified transformer backbone. Our method achieves an adaptive decoding order that resolves high-confidence action elements before harder ones and employs secondary re-masking to revisit uncertain predictions, enabling robust error correction. This design preserves pretrained vision-language priors, supports parallel decoding, and improves the efficiency. Discrete Diffusion VLA achieves 96.4% avg. success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge. On out-of-distribution tests of LIBERO-Goal, our method exhibits only 0.8% language degradation versus 8.0% of parallel decoding, and 20.4% vision degradation versus 29.0% for continuous diffusion, demonstrating well retention of pretrained vision-language capabilities. We also conduct two real-robot evaluations on AgileX Cobot Magic platform to show the method's effectiveness.

2512.18336 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Dynamic Entropy Tuning in Reinforcement Learning Low-Level Quadcopter Control: Stochasticity vs Determinism

强化学习低层四旋翼控制中的动态熵调节:随机性与确定性

Youssef Mahran, Zeyad Gamal, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department(机械工程系) The German University in Cairo(开罗德国大学)

AI总结 研究在四旋翼控制中,通过动态熵调节训练随机策略的强化学习算法,并与确定性策略算法对比,发现动态熵调节可防止灾难性遗忘并提高探索效率。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2024 IEEE 34th International Conference on Computer Theory and Applications (ICCTA)
AI中文摘要

本文探讨了在训练随机策略的强化学习算法中动态熵调节的影响,并将其性能与训练确定性策略的算法进行了比较。随机策略通过优化动作的概率分布来最大化奖励,而确定性策略则为每个状态选择一个确定的动作。本文研究了使用静态熵和动态熵训练随机策略,然后执行确定性动作来控制四旋翼的效果,并与训练确定性策略并执行确定性动作进行了对比。为此,随机算法选择了软演员-评论家(SAC)算法,确定性算法选择了双延迟深度确定性策略梯度(TD3)算法。训练和仿真结果表明,动态熵调节通过防止灾难性遗忘和提高探索效率,对控制四旋翼产生了积极影响。

英文摘要

This paper explores the impact of dynamic entropy tuning in Reinforcement Learning (RL) algorithms that train a stochastic policy. Its performance is compared against algorithms that train a deterministic one. Stochastic policies optimize a probability distribution over actions to maximize rewards, while deterministic policies select a single deterministic action per state. The effect of training a stochastic policy with both static entropy and dynamic entropy and then executing deterministic actions to control the quadcopter is explored. It is then compared against training a deterministic policy and executing deterministic actions. For the purpose of this research, the Soft Actor-Critic (SAC) algorithm was chosen for the stochastic algorithm while the Twin Delayed Deep Deterministic Policy Gradient (TD3) was chosen for the deterministic algorithm. The training and simulation results show the positive effect the dynamic entropy tuning has on controlling the quadcopter by preventing catastrophic forgetting and improving exploration efficiency.

2512.18333 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Reinforcement Learning Position Control of a Quadrotor Using Soft Actor-Critic (SAC)

基于软演员-评论家(SAC)的四旋翼强化学习位置控制

Youssef Mahran, Zeyad Gamal, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department(机械电子工程系) The German University in Cairo(埃及德国大学)

AI总结 提出一种基于强化学习的四旋翼推力矢量控制架构,使用软演员-评论家算法训练,相比传统RPM控制器训练更快、路径跟踪更平滑准确。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2024 IEEE 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES)
AI中文摘要

本文提出了一种新的基于强化学习(RL)的四旋翼控制架构。现有文献主要关注直接控制四个旋翼的转速,而本文旨在控制四旋翼的推力矢量。RL智能体计算沿四旋翼z轴的总推力百分比以及期望的滚转角(ϕ)和俯仰角(θ)。然后,智能体将计算出的控制信号连同当前四旋翼的偏航角(ψ)发送给姿态PID控制器。PID控制器再将控制信号映射为电机转速。采用软演员-评论家算法(一种无模型离策略随机RL算法)来训练RL智能体。训练结果表明,与传统的RPM控制器相比,所提出的推力矢量控制器训练时间更短。仿真结果表明,所提出的推力矢量控制器具有更平滑、更精确的路径跟踪性能。

英文摘要

This paper proposes a new Reinforcement Learning (RL) based control architecture for quadrotors. With the literature focusing on controlling the four rotors' RPMs directly, this paper aims to control the quadrotor's thrust vector. The RL agent computes the percentage of overall thrust along the quadrotor's z-axis along with the desired Roll ($ϕ$) and Pitch ($θ$) angles. The agent then sends the calculated control signals along with the current quadrotor's Yaw angle ($ψ$) to an attitude PID controller. The PID controller then maps the control signals to motor RPMs. The Soft Actor-Critic algorithm, a model-free off-policy stochastic RL algorithm, was used to train the RL agents. Training results show the faster training time of the proposed thrust vector controller in comparison to the conventional RPM controllers. Simulation results show smoother and more accurate path-following for the proposed thrust vector controller.

2512.13356 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Control of a Twin Rotor using Twin Delayed Deep Deterministic Policy Gradient (TD3)

使用双延迟深度确定性策略梯度(TD3)控制双旋翼系统

Zeyad Gamal, Youssef Mahran, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department(机械电子工程系) The German University in Cairo(埃及德国大学)

AI总结 提出基于TD3算法的强化学习框架,用于控制双旋翼气动系统在俯仰和方位角上的稳定与轨迹跟踪,仿真和实验验证了其优于传统PID控制器的抗干扰能力。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2024 28th IEEE International Conference on System Theory, Control and Computing (ICSTCC)
AI中文摘要

本文提出了一种强化学习(RL)框架,用于在特定俯仰角和方位角下控制和稳定双旋翼气动系统(TRAS),并跟踪给定轨迹。TRAS的复杂动力学和非线性特性使得使用传统控制算法进行控制具有挑战性。然而,近年来RL的发展因其在多旋翼控制中的潜在应用而引起了兴趣。本文使用双延迟深度确定性策略梯度(TD3)算法来训练RL智能体。该算法适用于具有连续状态和动作空间的环境(类似于TRAS),因为它不需要系统的模型。仿真结果展示了RL控制方法的有效性。接下来,使用风扰形式的的外部扰动来测试控制器与传统PID控制器相比的有效性。最后,在实验室装置上进行了实验,以确认控制器在实际应用中的有效性。

英文摘要

This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at specific pitch and azimuth angles and tracking a given trajectory. The complex dynamics and non-linear characteristics of the TRAS make it challenging to control using traditional control algorithms. However, recent developments in RL have attracted interest due to their potential applications in the control of multirotors. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was used in this paper to train the RL agent. This algorithm is used for environments with continuous state and action spaces, similar to the TRAS, as it does not require a model of the system. The simulation results illustrated the effectiveness of the RL control method. Next, external disturbances in the form of wind disturbances were used to test the controller's effectiveness compared to conventional PID controllers. Lastly, experiments on a laboratory setup were carried out to confirm the controller's effectiveness in real-world applications.

2512.09065 2026-06-02 cs.RO cs.AI 版本更新

ShelfAware: Real-Time Semantic Localization in Quasi-Static Environments with Low-Cost Sensors

ShelfAware:准静态环境下基于低成本传感器的实时语义定位

Shivendra Agrawal, Jake Brawer, Ashutosh Naik, Alessandro Roncone, Bradley Hayes

发表机构 * Department of Computer Science, University of Colorado Boulder(科罗拉多大学波尔德分校计算机科学系)

AI总结 提出ShelfAware语义粒子滤波器,通过将场景语义建模为类别统计证据而非固定地标,结合深度似然与类别语义相似度,并利用预计算语义视角进行逆语义提议,实现低成本视觉硬件上的鲁棒全局定位。

Comments 8 pages

详情
Journal ref
IEEE Robotics and Automation Letters (RA-L), 2026
AI中文摘要

许多室内工作空间是准静态的:其全局几何布局稳定,但局部语义不断变化,产生重复几何结构、动态杂乱和感知噪声,使得标准基于视觉的定位失效。我们提出ShelfAware,一种用于鲁棒全局定位的语义粒子滤波器,它将场景语义视为对象类别的统计证据而非固定数量地标。ShelfAware融合深度似然与以类别为中心的语义相似度,并利用预计算的语义视角库在蒙特卡洛定位(MCL)中执行逆语义提议,从而在低成本、纯视觉硬件上实现快速、有针对性的假设生成。为了展示感知无关的可扩展性,我们在两个领域评估ShelfAware。在严格控制的模拟零售环境中,ShelfAware实现了97%的全局定位成功率,并在购物车、可穿戴和动态遮挡条件下保持了最高的跟踪成功率(66%)。此外,在利用开放词汇视觉管道的3,500平方英尺运营杂货店中,ShelfAware显著优于几何和固定数量语义基线。通过分布性建模语义并利用逆提议,ShelfAware解决了几何混叠问题,为动态真实环境中的移动和辅助机器人提供了无需基础设施的构建模块。

英文摘要

Many indoor workspaces are quasi-static: their global geometric layout is stable, but local semantics change continually, producing repetitive geometry, dynamic clutter, and perceptual noise that defeat standard vision-based localization. We present ShelfAware, a semantic particle filter for robust global localization that treats scene semantics as statistical evidence over object categories rather than fixed quantity landmarks. ShelfAware fuses a depth likelihood with a category-centric semantic similarity and uses a precomputed bank of semantic viewpoints to perform inverse semantic proposals inside Monte Carlo Localization (MCL), yielding fast, targeted hypothesis generation on low-cost, vision-only hardware. To demonstrate perception-agnostic scalability, we evaluate ShelfAware across two domains. In a rigorously controlled mock retail environment, ShelfAware achieves a 97% global localization success rate, maintaining the highest tracking success (66%) across cart, wearable, and dynamic occlusion conditions. Furthermore, in a 3,500 sq. ft. operational grocery store leveraging an open-vocabulary vision pipeline, ShelfAware significantly outperforms both geometric and fixed-quantity semantic baselines. By modeling semantics distributionally and leveraging inverse proposals, ShelfAware resolves geometric aliasing, providing an infrastructure-free building block for mobile and assistive robots in dynamic real-world environments.

2512.06182 2026-06-02 cs.RO cs.SY eess.SY 版本更新

Situation-Aware Interactive MPC Switching for Autonomous Driving

自动驾驶中的情境感知交互式MPC切换

Shuhao Qi, Qiling Aori, Luyao Zhang, Mircea Lazar, Sofie Haesaert

发表机构 * Eindhoven University of Technology(埃因霍温理工大学) Delft University of Technology(代尔夫特理工大学)

AI总结 针对自动驾驶中交互场景的挑战,提出一种基于神经网络的分类器,根据情境需求在不同MPC控制器间切换,以平衡性能和计算开销。

详情
AI中文摘要

在交互式交通场景中,由于车辆间的相互影响和周围智能体的固有不确定性,自动驾驶仍然具有挑战性。已经提出了几种模型预测控制(MPC)公式来解决这一挑战,每种公式采用不同的智能体间交互模型。虽然高保真交互模型能够实现更智能的行为,但它们会带来显著更高的计算成本。由于在实际交通中强交互仅偶尔出现,一种平衡性能和计算开销的实用策略是根据情境需求调用合适的控制器。为此,我们首先进行了一项比较研究,评估并层次化不同MPC公式的交互能力。基于这一层次结构,我们开发了一个基于神经网络的分类器,用于在这些控制器之间进行情境感知切换。我们证明,通过仅在罕见但关键的情况下调用最先进的交互式MPC,并在大多数情况下依赖基本MPC,情境感知切换显著提高了整体性能,同时大幅降低了计算负载。

英文摘要

Autonomous driving in interactive traffic scenarios remains challenging because of the mutual influence among vehicles and the inherent uncertainty of surrounding agents. Several model predictive control (MPC) formulations have been proposed to address this challenge, each adopting a different model of inter-agent interaction. While higher-fidelity interaction models enable more intelligent behavior, they incur substantially greater computational cost. Since strong interactions arise only occasionally in real traffic, a practical strategy for balancing performance and computational overhead is to invoke an appropriate controller based on situational demands. To this end, we first conduct a comparative study to assess and hierarchize the interactive capabilities of different MPC formulations. Building on this hierarchy, we then develop a neural network-based classifier for situation-aware switching among these controllers. We demonstrate that, by invoking the most advanced interactive MPC only in rare but critical situations and relying on a basic MPC in the majority of situations, situation-aware switching substantially improves overall performance while significantly reducing computational load.

2512.04069 2026-06-02 cs.CV cs.RO 版本更新

SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

SpaceTools: 通过双交互强化学习实现工具增强的空间推理

Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay

发表机构 * NVIDIA University of Michigan(密歇根大学)

AI总结 提出双交互强化学习(DIRL)框架,通过两阶段训练让视觉语言模型学会协调多种工具(如深度估计、分割、姿态估计)进行精确空间推理,在多个基准上达到最优性能。

Comments CVPR 2026

详情
AI中文摘要

视觉语言模型(VLM)表现出强大的定性视觉理解能力,但在具身应用所需的度量精确空间推理方面存在困难。智能体范式承诺VLM可以使用多种工具来增强这些能力,例如深度估计器、分割模型和姿态估计器。然而,如何在不完全依赖手工设计的提示策略或强制固定预定义工具管道(限制VLM发现最优工具使用模式的能力)的情况下实现这一愿景仍然是一个开放挑战。强化学习可以克服这一差距,但迄今为止由于多工具推理中的大搜索空间,仅限于使用单个视觉工具进行推理。我们引入了双交互强化学习(DIRL),这是一个两阶段训练框架,其中VLM通过交互式探索和反馈学习协调多个工具。在教学阶段,我们将通过交互式RL训练的单个工具专家的演示与使用所有工具的前沿模型的轨迹相结合。在探索阶段,模型通过持续的RL进一步优化多工具协调。我们的模型SpaceTools具有工具增强的空间推理能力,在空间理解基准(RoboSpatial-Home、BLINK、BOP-ASK)上达到了最先进的性能,并展示了使用7自由度机器人作为工具的可靠现实世界操作。DIRL在普通SFT(RoboSpatial上+12%)和RL(RoboSpatial上+16%)基线上提供了显著改进。项目页面:https://spacetools.github.io/。

英文摘要

Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.

2506.10239 2026-06-02 cs.RO 版本更新

A Unified Framework for Probabilistic Dynamic-, Trajectory- and Vision-based Virtual Fixtures

基于概率的动态、轨迹和视觉虚拟夹具的统一框架

Maximilian Mühlbauer, Bernhard Weber, Sylvain Calinon, Freek Stulp, Alin Albu-Schäffer, João Silvério

发表机构 * School of Computation, Information and Technology, Sensor Based Robotic Systems and Intelligent Assistance Systems(计算、信息与技术学院,基于传感器的机器人系统与智能辅助系统) German Aerospace Center (DLR), Robotics and Mechatronics Center (RMC)(德国航空航天中心(DLR),机器人与机电中心) Idiap Research Institute(Idiap研究机构) École Polytechnique Fédérale de Lausanne (EPFL)(瑞士联邦理工学院(EPFL))

AI总结 提出一个统一的概率虚拟夹具框架,通过动态系统、轨迹和视觉伺服三种模式自适应切换,实现从手动到全自动的辅助,降低交互力并提升可用性。

Comments for the supplementary video, see https://youtu.be/eMl41ha7VJ4

详情
AI中文摘要

概率虚拟夹具(VF)能够根据学习或感知的不确定性,自适应地选择每个任务阶段最合适的触觉反馈。虽然保持人在回路中对于确保高精度等至关重要,但某些任务阶段的部分自动化对生产力至关重要。我们提出了一个统一的概率VF框架,可在手动夹具、半自动夹具(人类处理精确任务)和全自动之间无缝切换。我们引入了一种新颖的基于概率动态系统的VF用于粗略引导,使机器人能够在保持人类操作员在回路中的同时自主完成某些任务阶段。对于需要精确引导的任务,我们将基于概率位置的轨迹夹具与自动化相结合,实现无缝的人机交互、几何感知和最优阻抗增益。对于需要非常精确引导的手动任务,我们还扩展了视觉伺服夹具,使其具有相同的几何感知和阻抗行为。我们在不同的机器人上验证了我们的方法,包括专家用户评估,展示了操作模式、编程夹具的简便性、更低的交互力以及与基线相比更优的可用性。

英文摘要

Probabilistic Virtual Fixtures (VFs) enable the adaptive selection of the most suitable haptic feedback for each phase of a task, based on learned or perceived uncertainty. While keeping the human in the loop remains essential, for instance, to ensure high precision, partial automation of certain task phases is critical for productivity. We present a unified framework for probabilistic VFs that seamlessly switches between manual fixtures, semi-automated fixtures (with the human handling precise tasks), and full autonomy. We introduce a novel probabilistic Dynamical System-based VF for coarse guidance, enabling the robot to autonomously complete certain task phases while keeping the human operator in the loop. For tasks requiring precise guidance, we extend probabilistic position-based trajectory fixtures with automation, allowing for seamless human interaction, geometry-awareness and optimal impedance gains. For manual tasks requiring very precise guidance, we also extend visual servoing fixtures with the same geometry-awareness and impedance behavior. We validate our approach on different robots, including an evaluation with expert users, showcasing operation modes, the ease of programming fixtures and lower interaction forces and favorable usability compared to a baseline.

2512.00062 2026-06-02 cs.RO cs.AI cs.LG 版本更新

SpeedAug: Policy Acceleration via Tempo-Enriched Policy and RL Fine-Tuning

SpeedAug: 通过节奏增强策略和强化学习微调实现策略加速

Taewook Nam, Junmo Cho, Youngsoo Jang, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) UNIST(全南大学) DeepAuto.ai

AI总结 提出SpeedAug框架,通过节奏增强先验策略和强化学习微调,使机器人策略学习任务最优执行节奏,在保持高成功率的同时显著提升执行速度和样本效率。

详情
AI中文摘要

针对复杂真实世界操作任务的机器人策略学习近期取得了快速进展,这在很大程度上得益于通过人类操作收集演示数据的能力。然而,从这些演示中训练出的策略通常执行任务的速度远低于机器人的物理能力,因为演示数据是在实际约束下收集的,这些约束倾向于保守的、以成功为导向的轨迹,而非执行速度。现有的策略加速方法通过数据预处理或启发式规则确定执行节奏,而不是学习针对任务优化的执行速度。在本文中,我们提出了SpeedAug,一个策略加速框架,使策略能够通过强化学习(RL)学习任务最优的执行节奏。SpeedAug首先从速度增强的演示中学习一个节奏增强的先验策略,该策略捕捉了多样的执行节奏。在此基础上,通过强化学习微调指导探索,以优化动作轨迹并高效优化执行节奏。在机器人操作基准上的实验表明,SpeedAug在保持高成功率的同时,显著提高了策略加速的样本效率,实现了快速且稳定的任务执行。应用于真实世界的操作任务时,SpeedAug仅用16分钟的在线交互就将任务吞吐量提高了1.8倍,且未降低成功率。

英文摘要

Robotic policy learning for complex real-world manipulation tasks has seen rapid recent progress, enabled in large part by the ability to collect demonstrations through human operation. However, policies trained from such demonstrations often execute tasks far more slowly than the robot's physical capabilities, as demonstration data is collected under practical constraints that favor conservative, success-oriented trajectories over execution speed. Existing policy acceleration methods determine execution tempo through data preprocessing or heuristic rules, rather than learning execution speed optimized for the task. In this paper, we propose SpeedAug, a policy acceleration framework that enables policies to learn task-optimal execution tempo via reinforcement learning (RL). SpeedAug first learns a tempo-enriched prior policy from speed-augmented demonstrations that captures diverse execution tempos. Building on this tempo-enriched prior, RL fine-tuning guides exploration to refine action trajectories and optimize execution tempo efficiently. Experiments on robotic manipulation benchmarks demonstrate that SpeedAug substantially improves the sample efficiency of policy acceleration while maintaining high success rates, achieving fast and stable task execution. Applied to a real-world manipulation task, SpeedAug improves task throughput by 1.8x using only 16 minutes of online interactions without compromising the success rate.

2511.22445 2026-06-02 cs.RO 版本更新

DIPOLE: Fusing Vision and Geometry for Robust Visuomotor Generalization

DIPOLE:融合视觉与几何实现鲁棒的视觉运动泛化

Yikai Tang, Haoran Geng, Jindou Jia, Yuxuan Hu, Sheng Zang, Jianfei Yang, Pieter Abbeel, Jitendra Malik

发表机构 * University of California, Berkeley(加州大学伯克利分校) Nanyang Technological University(南洋理工大学)

AI总结 提出DIPOLE,通过训练时模态丢弃和轻量交叉注意力融合视觉与几何信息,实现跨光照、纹理、视角等变化的鲁棒策略泛化,在18个模拟和4个真实任务中平均性能提升39.1%。

详情
AI中文摘要

模仿学习已成为从演示中获取视觉运动技能的关键方法,其中设计有效的观测编码器对于策略泛化至关重要。然而,现有方法在测试条件与演示不同时往往难以应对,例如光照、纹理、视角、物体位置或物体身份的变化。为了解决这一挑战,我们提出了具有互补编码器的扩散策略(DIPOLE),这是一种通过训练时机制而非专门融合架构学习融合互补模态的视觉运动策略。模态级丢弃在每个训练步骤中屏蔽一个分支,鼓励每个模态保持独立的信息性。然后,一个轻量级的交叉注意力层在两者之间交换互补线索。这种设计赋予DIPOLE五个核心优势:跨不同任务的稳定高性能、对视觉变化的鲁棒性、亚厘米精度的空间泛化、超越任一模态的涌现能力以及零样本迁移到未见物体。在18个模拟和4个真实世界任务中,DIPOLE平均优于六个基线39.1%,在未见视觉干扰物下提升41.5%,在随机物体放置下提升15.2%。

英文摘要

Imitation learning has emerged as a crucial approach for acquiring visuomotor skills from demonstrations, where designing effective observation encoders is essential for policy generalization. However, existing methods tend to struggle once test-time conditions differ from the demonstrations, such as changes in lighting, texture, viewpoint, object placement, or object identity. To address this challenge, we propose DIffusion POlicy with compLementarity Encoders (DIPOLE), a visuomotor policy that learns to fuse complementary modalities through a training-time mechanism rather than a specialized fusion architecture. A modality-wise dropout masks one branch at each training step, encouraging each modality to remain individually informative. A lightweight cross-attention layer then exchanges complementary cues between the two. This design endows DIPOLE with five core strengths: stable high performance across diverse tasks, robustness to visual changes, spatial generalization at sub-centimeter precision, emergent capability beyond either modality, and zero-shot transfer to unseen objects. Across 18 simulated and 4 real-world tasks, DIPOLE outperforms six baselines by 39.1% on average, with gains of 41.5% under unseen visual distractors and 15.2% under randomized object placement.

2511.17502 2026-06-02 cs.RO 版本更新

RynnVLA-002: A Unified Vision-Language-Action and World Model

RynnVLA-002: 统一的视觉-语言-动作与世界模型

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Bohan Hou, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Deli Zhao, Hao Chen

发表机构 * DAMO Academy, Alibaba Group(达摩院,阿里巴巴集团) Hupan Lab(虎扑实验室) Zhejiang University(浙江大学)

AI总结 提出统一视觉-语言-动作(VLA)与世界模型的框架RynnVLA-002,通过联合学习环境动态和动作规划,在仿真和真实机器人任务中显著提升成功率。

详情
AI中文摘要

我们介绍了RynnVLA-002,一个统一的视觉-语言-动作(VLA)和世界模型。世界模型利用动作和视觉输入来预测未来的图像状态,学习环境的底层物理规律以优化动作生成。相反,VLA模型从图像观测中产生后续动作,增强视觉理解并支持世界模型的图像生成。RynnVLA-002的统一框架使得环境动态和动作规划的联合学习成为可能。我们的实验表明,RynnVLA-002超越了单独的VLA和世界模型,展示了它们的相互增强。我们在仿真和真实世界机器人任务中评估了RynnVLA-002。在LIBERO仿真基准测试中,RynnVLA-002无需预训练即达到97.4%的成功率,而在真实世界的LeRobot实验中,其集成的世界模型将整体成功率提升了50%。

英文摘要

We introduce RynnVLA-002, a unified Vision-Language-Action (VLA) and world model. The world model leverages action and visual inputs to predict future image states, learning the underlying physics of the environment to refine action generation. Conversely, the VLA model produces subsequent actions from image observations, enhancing visual understanding and supporting the world model's image generation. The unified framework of RynnVLA-002 enables joint learning of environmental dynamics and action planning. Our experiments show that RynnVLA-002 surpasses individual VLA and world models, demonstrating their mutual enhancement. We evaluate RynnVLA-002 in both simulation and real-world robot tasks. RynnVLA-002 achieves 97.4% success rate on the LIBERO simulation benchmark without pretraining, while in real-world LeRobot experiments, its integrated world model boosts the overall success rate by 50%.

2511.10276 2026-06-02 cs.RO cs.AI 版本更新

RoboBenchMart: Benchmarking Robots in Retail Environment

RoboBenchMart:零售环境中的机器人基准测试

Konstantin Soshin, Alexander Krapukhin, Andrei Spiridonov, Gregorii Bukhtuev, Andrey Kuznetsov, Vlad Shakhuro, Denis Shepelev

发表机构 * FusionBrain Lab, Robotics Group(融合大脑实验室,机器人组) NUST MISIS Lomonosov Moscow State University(罗蒙诺索夫莫斯科国立大学)

AI总结 针对零售环境中的移动操作任务,提出RoboBenchMart开源模拟基准,通过密集杂乱物品和复杂空间配置评估通用视觉-语言-动作模型(VLA),发现现有模型在常见零售任务中仍表现不佳。

详情
AI中文摘要

大多数现有的机器人操作基准专注于桌面或家庭场景。虽然这些设置推动了令人印象深刻的进展,但目前尚不清楚在这些场景中表现出色的通用VLA是否能够真正泛化到具有不同几何、语义和工作流程的领域。我们引入了RoboBenchMart,一个针对零售暗店环境的开源模拟基准,其中移动操作器必须对多样化的杂货物品执行复杂的操作任务。该设置提出了重大挑战,包括密集的物品杂乱和多样的空间配置,物品位于不同的高度、深度且紧密相邻。通过针对零售领域,我们的基准解决了一个具有近期自动化影响潜力的场景。利用生成的轨迹,我们为当前的通用VLA建模了一个标准、现实的微调设置,并评估了几种最先进的模型。我们发现,即使在常见的零售任务上,它们仍然表现挣扎,这表明这些模型尚未真正跨领域泛化。为了支持进一步研究,我们发布了RoboBenchMart套件,其中包括程序化商店布局生成器、轨迹生成管道、评估工具和微调基线模型。

英文摘要

Most existing robotic manipulation benchmarks focus on tabletop or household scenarios. While these setups have driven impressive progress, it remains unclear whether generalist VLAs that excel there can truly generalize to domains with different geometry, semantics, and workflows. We introduce RoboBenchMart, an open-source simulated benchmark targeting retail dark-store environments, where a mobile manipulator must perform complex manipulation tasks with diverse grocery items. This setting presents significant challenges, including dense object clutter and varied spatial configurations, with items positioned at different heights, depths, and in close proximity. By targeting on the retail domain, our benchmark addresses a setting with strong potential for near-term automation impact. Using generated trajectories, we model a standard, realistic fine-tuning setup for current generalist VLAs and evaluate several state-of-the-art models. We find that they still struggle even on common retail tasks, indicating that these models are not yet truly general across domains. To support further research, we release the RoboBenchMart suite, which includes a procedural store layout generator, a trajectory generation pipeline, evaluation tools, and fine-tuned baseline models.

2511.02937 2026-06-02 cs.RO cs.SE cs.SY eess.SY 版本更新

Toward an Agricultural Operational Design Domain: A Framework

面向农业运行设计域:一个框架

Mirco Felske, Jannik Redenius, Georg Happich, Julius Schöning

AI总结 针对农业自主系统在复杂多变环境中运行的特殊挑战,提出包含Ag-ODD描述概念、7层模型和迭代验证过程的农业运行设计域框架,以实现环境描述的结构化、透明化和可验证性。

Comments 18 pages, 7 figures, 2 tables

详情
Journal ref
Smart Agricultural Technology, Volume 14, August 2026
AI中文摘要

农业部门越来越依赖在复杂多变环境中运行的自主系统。与道路应用不同,农业自动化集成了驾驶和工作过程,每个过程都施加了不同的运行约束。处理这种复杂性并确保整个开发和验证过程的一致性,需要对环境进行结构化、透明且经过验证的描述。然而,现有的运行设计域(ODD)概念尚未解决农业应用的独特挑战。因此,本文引入了农业ODD(Ag-ODD)框架,可用于描述和验证自主农业系统的运行边界。Ag-ODD框架由三个核心要素组成。首先,Ag-ODD描述概念,它提供了一种结构化方法,利用ASAM Open ODD和CityGML的概念明确定义环境和运行参数。其次,源自PEGASUS 6层模型的7层模型,已扩展包括一个过程层以捕获动态农业操作。第三,迭代验证过程,根据从7层模型导出的相应逻辑场景验证Ag-ODD,以确保Ag-ODD的完整性和一致性。这些要素共同提供了一种一致的方法来创建明确且可验证的Ag-ODD。演示用例展示了Ag-ODD框架如何支持自主农业系统环境描述的标准化和可扩展性。

英文摘要

The agricultural sector increasingly relies on autonomous systems that operate in complex and variable environments. Unlike on-road applications, agricultural automation integrates driving and working processes, each of which imposes distinct operational constraints. Handling this complexity and ensuring consistency throughout the development and validation processes requires a structured, transparent, and verified description of the environment. However, existing Operational Design Domain (ODD) concepts do not yet address the unique challenges of agricultural applications. Therefore, this work introduces the Agricultural ODD (Ag-ODD) Framework, which can be used to describe and verify the operational boundaries of autonomous agricultural systems. The Ag-ODD Framework consists of three core elements. First, the Ag-ODD description concept, which provides a structured method for unambiguously defining environmental and operational parameters using concepts from ASAM Open ODD and CityGML. Second, the 7-Layer Model derived from the PEGASUS 6-Layer Model, has been extended to include a process layer to capture dynamic agricultural operations. Third, the iterative verification process verifies the Ag-ODD against its corresponding logical scenarios, derived from the 7-Layer Model, to ensure the Ag-ODD's completeness and consistency. Together, these elements provide a consistent approach for creating unambiguous and verifiable Ag-ODD. Demonstrative use cases show how the Ag-ODD Framework can support the standardization and scalability of environmental descriptions for autonomous agricultural systems.

2403.06524 2026-06-02 cs.LG cs.AI cs.RO 版本更新

Tactical Decision Making for Autonomous Trucks by Deep Reinforcement Learning with Total Cost of Operation Based Reward

基于总运营成本奖励的深度强化学习自动驾驶卡车战术决策

Deepthi Pathare, Leo Laine, Morteza Haghir Chehreghani

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and and University of Gothenburg(计算机科学与工程系,查尔姆斯理工大学和哥德堡大学) Department of Mechanics and Maritime Sciences, Chalmers University of Technology(机械与海洋科学系,查尔姆斯理工大学) Safe and Efficient Driving, Volvo Group of Trucks Technology(安全高效驾驶,沃尔沃卡车技术集团)

AI总结 提出一种深度强化学习框架,用于自动驾驶卡车在高速公路场景下的自适应巡航控制和变道战术决策,通过基于总运营成本的多目标奖励函数优化性能。

Comments Paper is accepted for publication in Artificial Intelligence Review

详情
Journal ref
Artificial Intelligence Review, Volume 59, Article number 27 (2026)
AI中文摘要

我们开发了一个深度强化学习框架,用于自动驾驶卡车的战术决策,特别是高速公路场景下的自适应巡航控制(ACC)和变道操作。我们的结果表明,将高层决策过程与低层控制动作分离,分别由强化学习智能体和基于物理模型的低层控制器执行是有益的。接下来,我们研究了使用不同方法基于卡车总运营成本(TCOP)的逼真多目标奖励函数来优化性能:通过添加奖励组件权重、通过归一化奖励组件以及通过使用课程学习技术。

英文摘要

We develop a deep reinforcement learning framework for tactical decision making in an autonomous truck, specifically for Adaptive Cruise Control (ACC) and lane change maneuvers in a highway scenario. Our results demonstrate that it is beneficial to separate high-level decision-making processes and low-level control actions between the reinforcement learning agent and the low-level controllers based on physical models. In the following, we study optimizing the performance with a realistic and multi-objective reward function based on Total Cost of Operation (TCOP) of the truck using different approaches; by adding weights to reward components, by normalizing the reward components and by using curriculum learning techniques.

2504.16129 2026-06-02 cs.MA cs.AI cs.LG cs.RO 版本更新

MARFT: Multi-Agent Reinforcement Fine-Tuning

MARFT: 多智能体强化微调

Junwei Liao, Muning Wen, Jun Wang, Weinan Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) OPPO Research Institute(OPPO研究院)

AI总结 针对基于大语言模型的多智能体系统,提出多智能体强化微调(MARFT)框架,通过引入Flex-MG马尔可夫博弈公式和通用算法,解决异步交互、异构架构等挑战,提升系统鲁棒性和适应性。

Comments 37 pages

详情
AI中文摘要

基于大语言模型的多智能体系统(LaMAS)在需要多方面推理和协作的复杂智能体任务中展现出强大能力,从高质量演示生成到科学研究。同时,强化学习(RL)被广泛认可用于增强智能体智能,但用基础RL技术微调LaMAS的研究有限。由于LaMAS的独特机制,直接将传统多智能体强化学习(MARL)应用于LaMAS也带来了重大挑战。为解决这些挑战,本文对基于LLM的MARL进行了全面研究,并提出了多智能体强化微调(MARFT)。我们引入了Flex-MG,一种与真实世界LaMAS优化一致的新马尔可夫博弈公式,以及一个针对LaMAS定制的通用算法框架。我们回顾了从传统RL到强化微调(RFT)的演变,然后分析了多智能体对应部分。对于LaMAS,我们识别了经典MARL与MARFT之间的关键差异,包括异步智能体交互、轮廓感知智能体设计和异构架构。这些差异促使了面向LaMAS的RFT公式。我们提出了一个稳健且可扩展的MARFT框架,详细介绍了其模块化算法,并提供了开源实现以支持采用和进一步研究。本文进一步讨论了应用前景和开放挑战,包括动态环境建模、样本效率低下以及缺乏连贯框架。通过将理论基础与实践方法相结合,本文旨在作为推进MARFT向弹性、自适应和与人类一致的智能体系统发展的路线图。实现:https://github.com/jwliao-ai/MARFT。

英文摘要

Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multifaceted reasoning and collaboration, from high-quality presentation generation to scientific research. Meanwhile, Reinforcement Learning (RL) is widely recognized for enhancing agent intelligence, but limited work has studied fine-tuning LaMAS with foundational RL techniques. Directly applying conventional Multi-Agent Reinforcement Learning (MARL) to LaMAS also introduces major challenges due to the unique mechanisms of LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce Flex-MG, a new Markov Game formulation aligned with real-world LaMAS optimization, together with a universal algorithmic framework tailored to LaMAS. We review the evolution from traditional RL to Reinforcement Fine-Tuning (RFT), then analyze the multi-agent counterpart. For LaMAS, we identify key differences between classical MARL and MARFT, including asynchronous agent interactions, profile-aware agent design, and heterogeneous architectures. These differences motivate a LaMAS-oriented formulation of RFT. We present a robust and scalable MARFT framework, detail its modular algorithm, and provide an open-source implementation to support adoption and further research. The paper further discusses application perspectives and open challenges, including dynamic environment modeling, sample inefficiency, and the lack of cohesive frameworks. By connecting theoretical foundations with practical methodology, this work aims to serve as a roadmap for advancing MARFT toward resilient, adaptive, and human-aligned agentic systems. Implementation: https://github.com/jwliao-ai/MARFT.

2511.00306 2026-06-02 cs.RO 版本更新

Degeneration of Sliding-Window Factor Graph Optimization into Iterated Extended Kalman Filtering

滑动窗口因子图优化退化为迭代扩展卡尔曼滤波

Baoshan Song, Ruijie Xu, Zhi Zhan, Li-Ta Hsu

发表机构 * Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University(航空与航空工程系,香港理工大学)

AI总结 本文通过引入递归因子图优化(Re-FGO)和两阶段边缘化流程,证明了在马尔可夫假设和单状态窗口条件下,滑动窗口因子图优化(SW-FGO)在理论上等价于迭代扩展卡尔曼滤波(IEKF),并通过仿真和实测数据验证了该退化过程。

Comments Accepted by Nature Partner Journal Wireless Technology

详情
AI中文摘要

滑动窗口因子图优化(SW-FGO)因其鲁棒性而广受认可,但其与扩展卡尔曼滤波(EKF)的理论关系仍存在争议。本文建立了将SW-FGO与迭代扩展卡尔曼滤波(IEKF)联系起来的充分条件。我们引入了递归因子图优化(Re-FGO),这是一种概念性视角,采用两阶段边缘化流程,从数学上将因子图优化退化为IEKF递归更新。通过强制执行马尔可夫假设和单状态窗口,我们证明了IEKF与Re-FGO之间的理论等价性。这种退化通过仿真和真实城市GNSS与INS紧耦合融合实验得到了验证。结果证实,Re-FGO精确复现了IEKF的估计行为,表明两阶段边缘化流程是强制执行结构一致性的基础,从而成功地将基于图的平滑和滤波范式统一在优化原则下。

英文摘要

Sliding window factor graph optimization (SW-FGO) is widely recognized for its robustness, yet its theoretical relationship with the extended Kalman filter (EKF) remains a subject of debate. This paper establishes the sufficient conditions to bridge SW-FGO with the iterated extended Kalman filter (IEKF). We introduce recursive FGO (Re-FGO), a conceptual perspective that employs a two-stage marginalization pipeline to mathematically degenerate the factor graph optimization to the IEKF recursive update. By enforcing the Markov assumption and a single-state window, we prove the theoretical equivalence between the IEKF and Re-FGO. This degeneration is validated through simulations and real-world urban GNSS and INS tightly coupled fusion experiments. The results confirm that Re-FGO exactly reproduces IEKF estimation behavior, demonstrating that the two-stage marginalization pipeline is foundational to enforce structural consistency, thereby successfully uniting graph-based smoothing and filtering paradigms under unified optimization principles.

2510.23057 2026-06-02 cs.RO cs.CV cs.SY eess.IV eess.SY 版本更新

Seq-DeepIPC: Sequential Sensing for End-to-End Control in Legged Robot Navigation

Seq-DeepIPC:足式机器人导航中用于端到端控制的顺序感知

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada(计算机科学与电子系,加查马达大学) Department of Computer Science and Engineering, Toyohashi University of Technology(计算机科学与工程系,东福冈技术大学)

AI总结 提出Seq-DeepIPC模型,通过融合多模态感知(RGB-D+GNSS)与时间序列,实现足式机器人在真实环境中的端到端导航控制,并在机器人狗上验证了其有效性。

Comments This work has been accepted for publication in the IEEE Sensors Journal. https://ieeexplore.ieee.org/document/11373257/

详情
AI中文摘要

我们提出了Seq-DeepIPC,一种用于足式机器人在真实环境中导航的顺序端到端感知到控制模型。Seq-DeepIPC通过将多模态感知(RGB-D+GNSS)与时间融合和控制紧密结合,推进了自主足式导航的智能感知。该模型联合预测语义分割和深度估计,为规划和控制提供更丰富的空间特征。为了在边缘设备上高效部署,我们使用轻量级模型作为编码器,在保持精度的同时减少计算量。通过移除噪声较大的IMU,转而通过顺序GNSS坐标的差分分析推导全局航向,简化了航向估计。我们收集了一个更大且更多样化的数据集,包括道路和草地地形,并在机器人狗上验证了Seq-DeepIPC。对比和消融研究表明,顺序输入改善了我们的模型中的感知和控制,而其他基线则没有受益。Seq-DeepIPC以合理的模型大小取得了具有竞争力或更好的结果;尽管仅使用GNSS的航向在高大建筑物附近可靠性较低,但在开阔区域是鲁棒的。总体而言,Seq-DeepIPC将端到端导航从轮式机器人扩展到更通用和具有时间感知能力的系统。为了支持未来的研究,我们将在GitHub仓库https://github.com/oskarnatan/Seq-DeepIPC发布代码。

英文摘要

We present Seq-DeepIPC, a sequential end-to-end perception-to-control model for legged robot navigation in real-world environments. Seq-DeepIPC advances intelligent sensing for autonomous legged navigation by tightly integrating multi-modal perception (RGB-D + GNSS) with temporal fusion and control. The model jointly predicts semantic segmentation and depth estimation, giving richer spatial features for planning and control. For efficient deployment on edge devices, we use a lightweight model as the encoder, reducing computation while maintaining accuracy. Heading estimation is simplified by removing the noisy IMU and instead deriving global heading via differential analysis of sequential GNSS coordinates. We collected a larger and more diverse dataset that includes both road and grass terrains, and validated Seq-DeepIPC on a robot dog. Comparative and ablation studies show that sequential inputs improve perception and control in our models, while other baselines do not benefit. Seq-DeepIPC achieves competitive or better results with reasonable model size; although GNSS-only heading is less reliable near tall buildings, it is robust in open areas. Overall, Seq-DeepIPC extends end-to-end navigation beyond wheeled robots to more versatile and temporally-aware systems. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/Seq-DeepIPC.

2411.09241 2026-06-02 cs.RO eess.SP 版本更新

BlueME: Robust Underwater Robot-to-Robot Communication Using Compact Magnetoelectric Antennas

BlueME: 使用紧凑型磁电天线的鲁棒水下机器人间通信

Mehron Talebi, Sultan Mahmud, Adam Khalifa, Md Jahidul Islam

发表机构 * Department of ECE, University of Florida, USA(佛罗里达大学电子工程系)

AI总结 提出并验证了基于磁电天线的紧凑型水下通信系统BlueME,在700米距离内以10瓦功耗实现可靠通信,克服了浑浊、障碍和多径干扰等挑战。

详情
AI中文摘要

我们介绍了BlueME的设计、开发和实验验证,这是一种用于水下机器人间通信的紧凑型磁电(ME)天线阵列系统。BlueME采用在其自然机械共振频率下工作的ME天线,以高效地在水下传输和接收甚低频(VLF)电磁信号。我们概述了所提出系统在低功耗嵌入式平台上的设计、仿真、制造和集成,重点关注便携式和可扩展的应用。为了评估性能,我们在开放水域现场试验中将BlueME部署在自主水面航行器(ASV)和遥控潜水器(ROV)上。海洋试验表明,BlueME在仅消耗10瓦功率的情况下,可在超过700米的距离内保持可靠的信号传输。现场试验显示,该系统在具有挑战性的水下条件下(如浑浊度、障碍物和多径干扰)有效运行——这些条件通常会影响声学和光学。我们的分析还考察了完全浸没对系统性能的影响,并确定了关键的部署考虑因素。这项工作代表了ME天线在实验室外的首次实际水下部署,并实现了迄今为止最大的VLF ME阵列系统。BlueME展示了在多机器人协作系统和远程传感器网络中用于海洋机器人和自动化的巨大潜力。

英文摘要

We present the design, development, and experimental validation of BlueME, a compact magnetoelectric (ME) antenna array system for underwater robot-to-robot communication. BlueME employs ME antennas operating at their natural mechanical resonance frequency to efficiently transmit and receive very-low-frequency (VLF) electromagnetic signals underwater. We outline the design, simulation, fabrication, and integration of the proposed system on low-power embedded platforms, focusing on portable and scalable applications. For performance evaluation, we deployed BlueME on an autonomous surface vehicle (ASV) and a remotely operated vehicle (ROV) in open-water field trials. Ocean trials demonstrate that BlueME maintains reliable signal transmission at distances beyond 700 meters while consuming only 10 watts of power. Field trials show that the system operates effectively in challenging underwater conditions such as turbidity, obstacles, and multipath interference -- conditions that generally affect acoustics and optics. Our analysis also examines the impact of complete submersion on system performance and identifies key deployment considerations. This work represents the first practical underwater deployment of ME antennas outside the laboratory and implements the largest VLF ME array system to date. BlueME demonstrates significant potential for marine robotics and automation in multi-robot cooperative systems and remote sensor networks.

2510.10676 2026-06-02 cs.AR cs.CL cs.RO eess.AS 版本更新

Bhasha-Rupantarika: Algorithm-Hardware Co-design approach for Multilingual Neural Machine Translation

Bhasha-Rupantarika: 面向多语言神经机器翻译的算法-硬件协同设计方法

Mukul Lokhande, Tanushree Dewangan, Mohd Sharik Mansoori, Tejas Chaudhari, Akarsh J., Damayanti Lokhande, Adam Teman, Santosh Kumar Vishvakarma

发表机构 * Special Manpower Development Program for Chip to Start-Up (SMDP-C2S)(芯片到初创企业专项人才发展计划(SMDP-C2S)) Ministry of Electronics and Information Technology (MeitY)(电子与信息技术部(MeitY))

AI总结 提出一种通过算法-硬件协同设计实现的轻量高效多语言翻译系统Bhasha-Rupantarika,采用亚字节精度量化(FP8/INT8/INT4/FP4)在FPGA上实现模型大小减少4.1倍、推理速度提升4.2倍,为资源受限环境下的多语言AI部署提供可行方案。

详情
Journal ref
International Symposium on Quality Electronic Design (ISQED), San Francisco, CA, USA, 2026
AI中文摘要

本文介绍了Bhasha-Rupantarika,一个通过算法-硬件协同设计为资源受限环境量身定制的轻量高效多语言翻译系统。该方法研究了亚字节精度级别(FP8、INT8、INT4和FP4)的模型部署,实验结果表明模型大小减少4.1倍(FP4),推理速度提升4.2倍,对应吞吐量提高至66 tokens/s(提升4.8倍)。这凸显了超低精度量化对于使用FPGA加速器的物联网设备实时部署的重要性,实现了与预期相当的性能。我们的评估涵盖了印度语言和国际语言之间的双向翻译,展示了其在低资源语言环境中的适应性。FPGA部署显示LUT减少1.96倍,FF减少1.65倍,与OPU相比吞吐量提升2.2倍,与HPTA相比提升4.6倍。总体而言,该评估提供了一种基于量化感知翻译且兼顾硬件效率的可行解决方案,适用于可部署的多语言AI系统。完整的代码[https://github.com/mukullokhande99/Bhasha-Rupantarika/]和可复现数据集已公开,便于研究人员快速集成和进一步开发。

英文摘要

This paper introduces Bhasha-Rupantarika, a light and efficient multilingual translation system tailored through algorithm-hardware codesign for resource-limited settings. The method investigates model deployment at sub-octet precision levels (FP8, INT8, INT4, and FP4), with experimental results indicating a 4.1x reduction in model size (FP4) and a 4.2x speedup in inference speed, which correlates with an increased throughput of 66 tokens/s (improvement by 4.8x). This underscores the importance of ultra-low precision quantization for real-time deployment in IoT devices using FPGA accelerators, achieving performance on par with expectations. Our evaluation covers bidirectional translation between Indian and international languages, showcasing its adaptability in low-resource linguistic contexts. The FPGA deployment demonstrated a 1.96x reduction in LUTs and a 1.65x decrease in FFs, resulting in a 2.2x enhancement in throughput compared to OPU and a 4.6x enhancement compared to HPTA. Overall, the evaluation provides a viable solution based on quantisation-aware translation along with hardware efficiency suitable for deployable multilingual AI systems. The entire codes [https://github.com/mukullokhande99/Bhasha-Rupantarika/] and dataset for reproducibility are publicly available, facilitating rapid integration and further development by researchers.

2510.04074 2026-06-02 cs.RO 版本更新

Feedback Matters: Augmenting Autonomous Dissection with Visual and Topological Feedback

反馈至关重要:利用视觉和拓扑反馈增强自主解剖

Chung-Pang Wang, Changwei Chen, Xiao Liang, Soofiyan Atar, Florian Richter, Michael Yip

发表机构 * Department of Electrical and Computer Engineering, University of California San Diego(加州大学圣迭戈分校电气与计算机工程系)

AI总结 提出一种反馈驱动的自主组织解剖框架,通过内窥镜图像推理拓扑变化、量化可见性并主动操控组织,结合规划与学习方法,显著提升手术自主性和鲁棒性。

详情
AI中文摘要

自主手术系统必须适应高度动态的环境,其中组织特性和视觉线索快速演变。这种适应性的核心是反馈:在执行过程中感知、解释和响应变化的能力。尽管反馈机制已在手术机器人中得到探索,包括工具和组织跟踪以及错误检测,但现有方法在处理组织解剖的拓扑和感知挑战方面仍然有限。在这项工作中,我们提出了一种用于自主组织解剖的反馈驱动框架,该框架在每次解剖动作后明确地从内窥镜图像中推理拓扑变化。这种结构化反馈指导后续动作,使系统能够定位解剖进展并在线调整策略。为了提高这种反馈的可靠性,我们引入了量化组织暴露的可见性指标,并制定了主动操控组织以最大化可见性的最优控制器设计。最后,我们将这些反馈机制与基于规划和基于学习的解剖方法相结合,并通过实验证明,它们在复杂手术场景中显著增强了自主性,减少了错误,并提高了鲁棒性。

英文摘要

Autonomous surgical systems must adapt to highly dynamic environments where tissue properties and visual cues evolve rapidly. Central to such adaptability is feedback: the ability to sense, interpret, and respond to changes during execution. While feedback mechanisms have been explored in surgical robotics, ranging from tool and tissue tracking to error detection, existing methods remain limited in handling the topological and perceptual challenges of tissue dissection. In this work, we propose a feedback-enabled framework for autonomous tissue dissection that explicitly reasons about topological changes from endoscopic images after each dissection action. This structured feedback guides subsequent actions, enabling the system to localize dissection progress and adapt policies online. To improve the reliability of such feedback, we introduce visibility metrics that quantify tissue exposure and formulate optimal controller designs that actively manipulate tissue to maximize visibility. Finally, we integrate these feedback mechanisms with both planning-based and learning-based dissection methods, and demonstrate experimentally that they significantly enhance autonomy, reduce errors, and improve robustness in complex surgical scenarios.

2509.20070 2026-06-02 cs.RO 版本更新

LLM Trainer: Automated Robotic Data Generation via Demonstration Augmentation using LLMs

LLM Trainer:利用大语言模型通过演示增强自动生成机器人数据

Abraham George, Amir Barati Farimani

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出LLM Trainer,一种利用大语言模型的世界知识将少量人类演示自动扩展为大规模机器人数据集的管道,通过离线标注和在线关键姿势重定向生成新轨迹,并采用汤普森采样优化标注。

Comments 9 pages, 5 figures, 4 tables. Accepted in ICRA 2026

详情
AI中文摘要

我们提出LLM Trainer,一个全自动管道,利用大语言模型(LLM)的世界知识,将少量人类演示(少至一个)转化为用于模仿学习的大型机器人数据集。我们的方法将演示生成分解为两个步骤:(1)离线演示标注,提取关键帧、显著物体和姿态-物体关系;(2)在线关键姿势重定向,在给定初始观察的情况下将这些关键帧适应到新场景。使用这些修改后的关键点,我们的系统对原始演示进行扭曲以生成新轨迹,然后执行该轨迹,如果成功,则保存生成的演示。由于标注可跨场景重用,我们使用汤普森采样优化标注,显著提高了生成成功率。我们在多种任务上评估了我们的方法,发现我们的数据标注方法始终优于专家设计的基线。我们进一步展示了一种集成策略,将优化的LLM前馈计划与学习到的反馈模仿学习控制器相结合。最后,我们在Franka Emika Panda机器人上演示了硬件可行性。更多材料和演示视频,请参见项目网站:https://sites.google.com/andrew.cmu.edu/llm-trainer

英文摘要

We present LLM Trainer, a fully automated pipeline that leverages the world knowledge of Large Language Models (LLMs) to transform a small number of human demonstrations (as few as one) into a large robot dataset for imitation learning. Our approach decomposes demonstration generation into two steps: (1) offline demonstration annotation that extracts keyframes, salient objects, and pose-object relations; and (2) online keypose retargeting that adapts those keyframes to a new scene, given an initial observation. Using these modified keypoints, our system warps the original demonstration to generate a new trajectory, which is then executed, and the resulting demo, if successful, is saved. Because the annotation is reusable across scenes, we use Thompson sampling to optimize the annotation, significantly improving generation success rate. We evaluate our method on a range of tasks, and find that our data annotation method consistently outperforms expert-engineered baselines. We further show an ensemble policy that combines the optimized LLM feed-forward plan with a learned feedback imitation learning controller. Finally, we demonstrate hardware feasibility on a Franka Emika Panda robot. For additional materials and demonstration videos, please see the project website: https://sites.google.com/andrew.cmu.edu/llm-trainer

2509.19246 2026-06-02 cs.RO cs.MA cs.SY eess.SY 版本更新

Proactive-reactive detection and mitigation of intermittent faults in robot swarms

机器人群体中间歇性故障的主动-被动检测与缓解

Sinan Oğuz, Emanuele Garone, Marco Dorigo, Mary Katherine Heinrich

发表机构 * IRIDIA Université Libre de Bruxelles(布鲁塞尔自由大学) SAAS

AI总结 针对机器人群体中因网络拓扑瞬变而难以检测的间歇性故障,提出基于自组织备份层和多路网络分布式共识的主动-被动策略,实现高精度检测与低误报率缓解。

详情
AI中文摘要

间歇性故障是偶尔出现和消失的瞬时错误。尽管间歇性故障对可靠性和协调性构成重大挑战,但现有机器人群体容错研究主要关注永久性故障。原因之一是,在机器人群体典型的完全自组织自组网中,间歇性故障极难检测,因为其网络拓扑是瞬变的且往往不可预测。然而,在最近引入的自组织神经系统(SoNS)方法中,机器人群体首次能够自组织出持久网络结构,从而缓解了间歇性故障的检测问题。为了解决具有持久网络的机器人群体中的间歇性故障,我们提出了一种新颖的主动-被动检测与缓解策略,该策略基于自组织备份层和多路网络中的分布式共识。主动方面,机器人在故障发生前自组织动态备份路径,适应主网络拓扑和机器人相对位置的变化。被动方面,机器人使用单次似然比检验比较沿多路网络中不同路径接收的信息,从而实现早期故障检测。检测到故障后,通信以自组织方式临时重新路由,直到检测到的故障解决。我们在编队控制中发生位置数据故障的代表性场景中验证了该方法,证明间歇性故障不会破坏收敛到期望编队,且具有高故障检测精度和低误报率。

英文摘要

Intermittent faults are transient errors that sporadically appear and disappear. Although intermittent faults pose substantial challenges to reliability and coordination, existing studies of fault tolerance in robot swarms focus instead on permanent faults. One reason for this is that intermittent faults are prohibitively difficult to detect in the fully self-organized ad-hoc networks typical of robot swarms, as their network topologies are transient and often unpredictable. However, in the recently introduced self-organizing nervous systems (SoNS) approach, robot swarms are able to self-organize persistent network structures for the first time, easing the problem of detecting intermittent faults. To address intermittent faults in robot swarms that have persistent networks, we propose a novel proactive-reactive strategy to detection and mitigation, based on self-organized backup layers and distributed consensus in a multiplex network. Proactively, the robots self-organize dynamic backup paths before faults occur, adapting to changes in the primary network topology and the robots' relative positions. Reactively, robots use one-shot likelihood ratio tests to compare information received along different paths in the multiplex network, enabling early fault detection. Upon detection, communication is temporarily rerouted in a self-organized way, until the detected fault resolves. We validate the approach in representative scenarios of faulty positional data occurring during formation control, demonstrating that intermittent faults are prevented from disrupting convergence to desired formations, with high fault detection accuracy and low rates of false positives.

2509.14143 2026-06-02 cs.RO 版本更新

CLAW: A Vision-Language-Action Framework for Weight-Aware Robotic Grasping

CLAW: 一种面向重量感知机器人抓取的视觉-语言-动作框架

Zijian An, Ran Yang, Yiming Feng, Lifeng Zhou

发表机构 * Department of Electrical and Computer Engineering, Drexel University(德雷塞尔大学电气与计算机工程系)

AI总结 提出CLAW框架,通过解耦条件评估与动作生成,利用微调CLIP模型监控重量阈值并生成提示,结合流式VLA策略实现重量感知的机器人抓取,在单目标与双目标实验中优于基线模型。

Comments 8 pages, 5 figures, Video: https://youtu.be/MuMYj2QgReI

详情
AI中文摘要

视觉-语言-动作(VLA)模型最近成为机器人控制的一种有前景的范式,能够实现将自然语言指令映射到视觉运动动作的端到端策略。然而,当前的VLA模型往往难以满足精确的任务约束,例如基于数值阈值停止,因为它们的观测到动作的映射隐式地由训练数据塑造,缺乏条件监控的显式机制。在这项工作中,我们提出了CLAW(CLIP-语言-动作用于重量),一种将条件评估与动作生成解耦的框架。CLAW利用微调后的CLIP模型作为轻量级提示生成器,持续监控秤的数字读数,并根据任务特定的重量阈值生成离散指令。然后,这些提示被$π_0$(一种基于流的VLA策略)消费,该策略将提示与多视角相机观测结合,生成连续的机器人动作。这种设计使CLAW能够将符号重量推理与高频视觉运动控制相结合。我们在三个实验设置上验证了CLAW:单物体抓取和需要双臂操作的混合物体任务。在所有条件下,CLAW可靠地执行重量感知行为,并优于原始$π_0$和微调$π_0$模型。我们的论文视频可在 https://youtu.be/MuMYj2QgReI 在线观看。

英文摘要

Vision-language-action (VLA) models have recently emerged as a promising paradigm for robotic control, enabling end-to-end policies that ground natural language instructions into visuomotor actions. However, current VLAs often struggle to satisfy precise task constraints, such as stopping based on numeric thresholds, since their observation-to-action mappings are implicitly shaped by training data and lack explicit mechanisms for condition monitoring. In this work, we propose CLAW (CLIP-Language-Action for Weight), a framework that decouples condition evaluation from action generation. CLAW leverages a fine-tuned CLIP model as a lightweight prompt generator, which continuously monitors the digital readout of a scale and produces discrete directives based on task-specific weight thresholds. These prompts are then consumed by $π_0$, a flow-based VLA policy, which integrates the prompts with multi-view camera observations to produce continuous robot actions. This design enables CLAW to combine symbolic weight reasoning with high-frequency visuomotor control. We validate CLAW on three experimental setups: single-object grasping and mixed-object tasks requiring dual-arm manipulation. Across all conditions, CLAW reliably executes weight-aware behaviors and outperforms both raw-$π_0$ and fine-tuned $π_0$ models. A video of our paper is available online https://youtu.be/MuMYj2QgReI.

2509.14126 2026-06-02 cs.RO cs.MA 版本更新

CrazyMARL: Decentralized Direct Motor Control Policies for Cooperative Aerial Transport of Cable-Suspended Payloads

CrazyMARL:用于缆绳悬挂载荷协同空中运输的分散式直接电机控制策略

Viktor Lorentz, Khaled Wahba, Sayantan Auddy, Marc Toussaint, Wolfgang Hönig

发表机构 * Technische Universität Berlin(技术大学柏林) Robotics Institute Germany (RIG)(德国机器人研究所(RIG))

AI总结 提出CrazyMARL,一种基于分散式强化学习的多无人机缆绳悬挂载荷运输框架,在抗干扰和跟踪精度上优于经典控制器,并实现了零样本仿真到实物迁移。

Comments International Conference on Robotics and Automation (ICRA), 2026

详情
AI中文摘要

多无人机团队协作运输缆绳悬挂载荷具有提升载荷能力、适应不同载荷形状以及提供固有柔顺性的潜力,使其在从灾害救援到精密物流等应用中具有吸引力。然而,在扰动、非线性载荷动力学以及松弛-绷紧缆绳模式下的多无人机协调仍然是一个具有挑战性的控制问题。据我们所知,先前的工作没有在多无人机背景下处理这些缆绳模式转换,而是依赖于简化的刚性连杆假设。我们提出了CrazyMARL,一个用于多无人机缆绳悬挂载荷运输的分散式强化学习框架。仿真结果表明,所学策略在抗干扰和跟踪精度方面优于经典分散式控制器,在恶劣条件下实现了80%的恢复率,而基线方法仅为44%。我们还成功实现了零样本仿真到实物迁移,并证明了我们的策略在恶劣条件下(包括风、随机外部扰动以及松弛和绷紧缆绳动力学之间的转换)具有高度鲁棒性。这项工作为能够在非结构化环境中执行复杂载荷任务的自主、弹性无人机团队铺平了道路。代码和视频可在网站https://imrclab.github.io/CrazyMARL上找到。

英文摘要

Collaborative transportation of cable-suspended payloads by teams of UAVs has the potential to enhance payload capacity, adapt to different payload shapes, and provide built-in compliance, making it attractive for applications ranging from disaster relief to precision logistics. However, multi-UAV coordination under disturbances, nonlinear payload dynamics, and slack-taut cable modes remains a challenging control problem. To our knowledge, no prior work has addressed these cable mode transitions in the multi-UAV context, instead relying on simplifying rigid-link assumptions. We propose CrazyMARL, a decentralized RL framework for multi-UAV cable-suspended payload transport. Simulation results demonstrate that the learned policies can outperform classical decentralized controllers in terms of disturbance rejection and tracking precision, achieving an 80% recovery rate from harsh conditions compared to 44% for the baseline method. We also achieve successful zero-shot sim-to-real transfer and demonstrate that our policies are highly robust under harsh conditions, including wind, random external disturbances, and transitions between slack and taut cable dynamics. This work paves the way for autonomous, resilient UAV teams capable of executing complex payload missions in unstructured environments. Code and videos can be found on the website: https://imrclab.github.io/CrazyMARL.

2508.19191 2026-06-02 cs.RO 版本更新

RCM-ACT: Imitation Learning with Dynamic RCM Calibration for Autonomous Intraocular Foreign Body Removal

RCM-ACT: 基于动态RCM校准的模仿学习用于自主眼内异物取出

Yue Wang, Wenjie Deng, Haotian Xue, Di Cui, Yiqi Chen, Mingchuan Zhou, Haochao Ying, Jian Wu

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) State Key Laboratory of Transvascular Implantation Devices of the Second Affiliated Hospital, Zhejiang University School of Medicine(浙江大学医学院第二附属医院血管植入设备国家重点实验室) Dessight Biomedical(Dessight生物医学公司) Center for Rehabilitation Medicine, Department of Ophthalmology, Zhejiang Provincial People’s Hospital(浙江省人民医院康复医学中心、眼科部门) School of Biosystems Engineering and Food Science, Zhejiang University(浙江大学生物系统工程与食品科学学院) School of Public Health and Second Affiliated Hospital, Zhejiang University School of Medicine(浙江大学医学院公共卫生学院及第二附属医院) State Key Laboratory of Transvascular Implantation Devices of the Second Affiliated Hospital and School of Public Health, Zhejiang University School of Medicine(浙江大学医学院第二附属医院及公共卫生学院血管植入设备国家重点实验室) Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence(浙江省医学影像人工智能重点实验室)

AI总结 提出RCM-ACT模仿学习框架,通过动态RCM校准和动作分块变换器解决眼内手术中的运动学不确定性,实现自主环抓取与定位,平均3D抓取偏差0.686 mm。

详情
AI中文摘要

眼内异物取出需要在受限的眼内空间达到毫米级精度,然而现有机器人系统主要依赖手动遥操作,学习曲线陡峭。为了解决自主操作的挑战,特别是可变运动缩放和远程运动中心(RCM)点变化带来的运动学不确定性,我们提出了RCM-ACT,一种用于自主眼内异物环操作的模仿学习框架。我们的方法集成了RCM动态校准以解决眼内器械变化引起的坐标系不一致,并引入了RCM-ACT架构,该架构将动作分块变换器与片段级运动学重对齐相结合。仅使用来自人工眼模型中专家演示的立体视觉数据和器械运动学进行训练,RCM-ACT成功完成了环抓取和定位任务,无需显式深度感知。实验验证表明,在未校准的显微镜条件下成功实现了端到端自主操作,平均3D欧几里得抓取偏差为0.686 mm,完整任务成功率为11/20。这些结果为开发能够执行复杂眼内手术的智能眼科手术系统提供了可行的框架。

英文摘要

Intraocular foreign body removal demands millimeter-level precision in confined intraocular spaces, yet existing robotic systems predominantly rely on manual teleoperation with steep learning curves. To address the challenges of autonomous manipulation, particularly kinematic uncertainties from variable motion scaling and Remote Center of Motion (RCM) point variation, we propose RCM-ACT, an imitation learning framework for autonomous intraocular foreign body ring manipulation. Our approach integrates RCM dynamic calibration to resolve coordinate system inconsistencies caused by intraocular instrument variation and introduces the RCM-ACT architecture, which combines action chunking transformers with episode-level kinematic realignment. Trained solely on stereo visual data and instrument kinematics from expert demonstrations in an artificial eye model, RCM-ACT successfully completes ring grasping and positioning tasks without explicit depth sensing. Experimental validation demonstrates the successful implementation of end-to-end autonomy under uncalibrated microscopy conditions, achieving a mean 3-D Euclidean grasp deviation of 0.686 mm and 11/20 full-task successes. The results provide a viable framework for developing intelligent eye surgical systems capable of complex intraocular procedures.

2406.09953 2026-06-02 cs.RO cs.AI 版本更新

DAG-Plan: Generating Directed Acyclic Dependency Graphs for Dual-Arm Cooperative Planning

DAG-Plan:生成有向无环依赖图用于双臂协作规划

Zeyu Gao, Yao Mu, Jinye Qu, Mengkang Hu, Shijia Peng, Chengkai Hou, Lingyue Guo, Ping Luo, Shanghang Zhang, Yanfeng Lu

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (CASIA)(多模态人工智能系统国家重点实验室,自动化研究所,中国科学院(CASIA)) School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS)(中国科学院大学人工智能学院) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(多媒体信息处理国家重点实验室,北京大学计算机科学学院) Department of Computer Science, The University of Hong Kong(香港大学计算机科学系) OpenGVLab, Shanghai AI Laboratory(上海人工智能实验室,OpenGVLab)

AI总结 提出DAG-Plan框架,首次使用有向无环图作为双臂协调的核心表示,通过一次LLM解析生成结构化DAG,实现自适应并行执行,在双臂厨房基准测试中成功率提升48%,执行效率提升84.1%。

Comments ICRA 2026

详情
AI中文摘要

双臂机器人有望提高效率,但需要规划具有非线性子任务依赖关系的复杂任务。当前使用大型语言模型(LLM)的方法存在根本性权衡:生成线性序列效率高但无法建模并行性和适应变化,而迭代查询具有适应性但过于缓慢且成本高昂。为弥合这一差距,我们引入DAG-Plan,一种新颖的任务规划框架,首次采用有向无环图(DAG)作为双臂协调的核心表示。关键洞察在于DAG天然捕获复杂的子任务依赖关系并明确揭示并行执行的机会。在该框架内,LLM仅被使用一次作为强大的语义解析器,将自然语言指令转换为结构化的DAG。在执行过程中,我们的系统基于实时环境观察动态地将候选节点分配给合适的机械臂,实现真正的自适应并行操作。在双臂厨房基准测试上的广泛评估表明,DAG-Plan的结构化方法从根本上优于现有范式。与单查询线性序列方法相比,通过稳健管理依赖关系,成功率提高了48%;与迭代查询方法相比,通过消除重复LLM调用的延迟,执行效率提高了84.1%。我们的工作表明,基于图的原则性表示是解锁高效可靠的基于LLM的复杂机器人系统规划的关键。更多演示和代码请访问 https://sites.google.com/view/dag-plan。

英文摘要

Dual-arm robots promise greater efficiency but require planning for complex tasks with nonlinear sub-task dependencies. Current methods using Large Language Models (LLMs) suffer from a fundamental trade-off: generating linear sequences is efficient but fails to model parallelism and adapt to changes, while iterative querying is adaptive but too slow and costly. To bridge this gap, we introduce DAG-Plan, a novel task planning framework that for the first time employs a Directed Acyclic Graph (DAG) as the central representation for dual-arm coordination. The key insight is that a DAG natively captures complex sub-task dependencies and explicitly reveals opportunities for parallel execution. Within this framework, an LLM is used only once as a powerful semantic parser to translate a natural language instruction into a structured DAG. During execution, our system dynamically assigns candidate nodes to the suitable arm based on real-time environmental observations, enabling truly adaptive and parallel operation. Extensive evaluation on a dual-arm kitchen benchmark shows that DAG-Plan's structured approach fundamentally outperforms existing paradigms. It achieves a 48% higher success rate than single-query linear sequence methods with dual arm by robustly managing dependencies, and an 84.1% higher execution efficiency than iterative querying methods by eliminating the latency of repeated LLM calls. Our work demonstrates that a principled, graph-based representation is the key to unlocking efficient and reliable LLM-based planning for complex robotic systems. More demos and code are available on https://sites.google.com/view/dag-plan.

2503.15371 2026-06-02 cs.RO cs.LG 版本更新

GIFT: Geometry-Induced Functional Transfer for Category-level Object Manipulation

GIFT: 几何诱导的功能迁移用于类别级物体操作

Cristiana de Farias, Luis Figueredo, Riddhiman Laha, Maxime Adjigble, Brahim Tamadazte, Rustam Stolkin, Sami Haddadin, Naresh Marturi

发表机构 * Extreme Robotics Laboratory, School of Metallurgy and Materials, University of Birmingham(伯明翰大学冶金与材料学院极端机器人实验室) Munich Institute of Robotics & Machine Intelligence, Technische Universität München (TUM)(慕尼黑工业大学机器人与人工智能研究所) School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院) Sorbonne Université, ISIR, Paris, France(巴黎法国索邦大学ISIR研究所)

AI总结 提出GIFT框架,利用功能映射和螺旋插值,从单次人类演示中迁移复杂物体操作技能到新物体,无需额外训练。

Comments 8 pages, 6 figures. ICRA 2026

详情
AI中文摘要

在新环境中操作不熟悉物体对机器人来说具有挑战性,因为泛化能力有限。我们提出了一种新的技能迁移框架GIFT(几何诱导的功能迁移),使机器人能够从单次人类演示中迁移复杂的物体操作技能和约束。我们的方法通过关注以物体为中心的交互,从演示中推导几何表示,解决了技能获取和任务执行的挑战。利用功能映射(FMC)框架,我们高效地映射物体及其环境之间的交互函数,使机器人能够在具有相似拓扑或类别的物体之间复制任务操作,即使它们形状差异很大。此外,我们的方法结合了螺旋插值(ScLERP)来生成平滑、几何感知的机器人路径,确保迁移的技能遵循演示的任务约束。我们通过大量实验验证了该方法的有效性和适应性,展示了在多样化的真实环境中成功进行技能迁移和任务执行,无需额外训练。

英文摘要

Robotic manipulation of unfamiliar objects in new environments is challenging due to limited generalisation capabilities. We propose a new skill transfer framework, GIFT (Geometry-Induced Functional Transfer), which enables a robot to transfer complex object manipulation skills and constraints from a single human demonstration. Our approach addresses the challenge of skill acquisition and task execution by deriving geometric representations from demonstrations focusing on object-centric interactions. By leveraging the Functional Maps (FMC) framework, we efficiently map interaction functions between objects and their environments, allowing the robot to replicate task operations across objects of similar topologies or categories, even when they have significantly different shapes. Additionally, our method incorporates screw interpolation (ScLERP) for generating smooth, geometrically-aware robot paths to ensure the transferred skills adhere to the demonstrated task constraints. We validate the effectiveness and adaptability of our approach through extensive experiments, demonstrating successful skill transfer and task execution in diverse real-world environments without requiring additional training.

2412.12036 2026-06-02 cs.LG cs.RO 版本更新

LeARN: Learnable and Adaptive Representations for Nonlinear Dynamics in System Identification

LeARN: 系统辨识中非线性动力学的可学习与自适应表示

Arunabh Singh, Joyjit Mukherjee

发表机构 * Visual Computing Lab, Indian Institute of Science(印度科学院视觉计算实验室) Department of Electrical and Electronics Engineering, BITS Pilani Hyderabad Campus(BITS Pilani Hyderabad校区电子与电气工程系)

AI总结 提出LeARN框架,通过元学习从数据中直接学习基函数库,无需领域知识,实现非线性动力学的自适应辨识,在Neural Fly数据集上达到与SINDy相当的动态误差性能。

Comments This work has been accepted at the 34th Mediterranean Conference on Control and Automation (MED 2026)

详情
AI中文摘要

系统辨识是从观测的输入-输出数据中推导动态系统数学模型的过程,随着基于学习的方法的出现,经历了范式转变。这些方法解决了非线性动态系统中数据驱动发现的复杂挑战,受到了广泛关注。其中,稀疏非线性动力学辨识(SINDy)已成为一种变革性方法,将复杂的动态行为提炼为基函数的可解释线性组合。然而,SINDy依赖领域专业知识来构建其基函数的基础“库”,限制了其适应性和通用性。在这项工作中,我们引入了一个非线性系统辨识框架LeARN,通过直接从数据中学习基函数库,超越了对先验领域知识的需求。为了增强对不同噪声条件下动态系统演变的适应性,我们采用了一种新颖的基于元学习的系统辨识方法,利用轻量级深度神经网络(DNN)动态优化这些基函数。这不仅捕捉了复杂的系统行为,还能有效适应新的动态模式。我们在Neural Fly数据集上验证了我们的框架,展示了其强大的适应和泛化能力。尽管简单,我们的LeARN在动态误差性能上与SINDy相当。这项工作朝着自主发现动态系统迈出了一步,为机器学习无需大量领域特定干预即可揭示复杂系统控制原理的未来铺平了道路。

英文摘要

System identification, the process of deriving mathematical models of dynamical systems from observed input-output data, has undergone a paradigm shift with the advent of learning-based methods. Addressing the intricate challenges of data-driven discovery in nonlinear dynamical systems, these methods have garnered significant attention. Among them, Sparse Identification of Nonlinear Dynamics (SINDy) has emerged as a transformative approach, distilling complex dynamical behaviors into interpretable linear combinations of basis functions. However, SINDy's reliance on domain-specific expertise to construct its foundational 'library' of basis functions limits its adaptability and universality. In this work, we introduce a nonlinear system identification framework LeARN that transcends the need for prior domain knowledge by learning the library of basis functions directly from data. To enhance adaptability to evolving system dynamics under varying noise conditions, we employ a novel meta-learning-based system identification approach that utilizes a light-weight Deep Neural Network (DNN) to dynamically refine these basis functions. This not only captures intricate system behaviors but also adapts effectively to new dynamical regimes. We validate our framework on the Neural Fly dataset, showcasing its robust adaptation and generalization capabilities. Despite its simplicity, our LeARN achieves competitive dynamical error performance to SINDy. This work presents a step towards autonomous discovery of dynamical systems, paving the way for a future where machine learning uncovers the governing principles of complex systems without requiring extensive domain-specific interventions.

2407.12014 2026-06-02 cs.HC cs.CY cs.RO 版本更新

Surprising Performances of Students with Autism in Classroom with NAO Robot

自闭症学生在NAO机器人课堂中的惊人表现

Qin Yang, Huan Lu, Dandan Liang, Shengrong Gong, Huanghao Feng

发表机构 * School of Computer and Information Technology(计算机与信息学院) Northeast Petroleum University(东北石油大学) School of Computer Science and Engineering(计算机科学与工程学院) Changshu Institute of Technology(常州职业技术学院) Changshu Special Education School(常州特殊教育学校) School of Chinese Language and Literature(中文语言文学学院) Nanjing Normal University(南京师范大学)

AI总结 本研究通过NAO机器人辅助的集体课堂实验,发现自闭症谱系障碍学生在机器人课堂中表现出更高的参与度和更少的刻板行为,表明社交机器人能显著提升其课堂专注力和教育表现。

详情
Journal ref
Frontiers of Digital Education 3(2), 2024
AI中文摘要

自闭症是一种在幼儿期出现并持续终生的发育障碍,深刻影响社交行为,并阻碍患者学习和社交技能的获取。随着技术进步,越来越多的技术被用于支持自闭症谱系障碍(ASD)学生的教育,旨在改善其教育成果和社交能力。许多关于自闭症干预的研究强调了社交机器人在行为治疗中的有效性。然而,关于将社交机器人融入自闭症儿童课堂环境的研究仍然很少。本文描述了在NAO机器人介导的集体课堂环境中进行的一项小组实验的设计与实施。实验由特殊教育教师和NAO机器人协作开展课堂活动,旨在通过教师、机器人和学生之间的互动营造动态学习环境。该实验在特殊教育学校进行,作为预期中扩展机器人辅助课堂的基础研究。实验数据表明,配备NAO机器人的课堂中的ASD学生表现明显优于普通课堂中的学生。NAO机器人的类人特征和肢体语言吸引了学生的注意力,特别是在才艺展示和指令任务中,学生表现出更高的参与度,并且减少了在常规环境中常见的刻板重复行为和不相关的小动作。我们的初步发现表明,NAO机器人显著提高了ASD学生的专注力和课堂参与度,可能改善教育表现并促进更好的社交行为。

英文摘要

Autism is a developmental disorder that manifests in early childhood and persists throughout life, profoundly affecting social behavior and hindering the acquisition of learning and social skills in those diagnosed. As technological advancements progress, an increasing array of technologies is being utilized to support the education of students with Autism Spectrum Disorder (ASD), aiming to improve their educational outcomes and social capabilities. Numerous studies on autism intervention have highlighted the effectiveness of social robots in behavioral treatments. However, research on the integration of social robots into classroom settings for children with autism remains sparse. This paper describes the design and implementation of a group experiment in a collective classroom setting mediated by the NAO robot. The experiment involved special education teachers and the NAO robot collaboratively conducting classroom activities, aiming to foster a dynamic learning environment through interactions among teachers, the robot, and students. Conducted in a special education school, this experiment served as a foundational study in anticipation of extended robot-assisted classroom sessions. Data from the experiment suggest that ASD students in classrooms equipped with the NAO robot exhibited notably better performance compared to those in regular classrooms. The humanoid features and body language of the NAO robot captivated the students' attention, particularly during talent shows and command tasks, where students demonstrated heightened engagement and a decrease in stereotypical repetitive behaviors and irrelevant minor movements commonly observed in regular settings. Our preliminary findings indicate that the NAO robot significantly enhances focus and classroom engagement among students with ASD, potentially improving educational performance and fostering better social behaviors.

2307.06647 2026-06-02 cs.RO cs.AI cs.CV 版本更新

DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle

DeepIPCv2: 基于LiDAR的鲁棒环境感知与自动驾驶导航控制

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada(计算机科学与电子系,加查马达大学) Department of Computer Science and Engineering, Toyohashi University of Technology(计算机科学与工程系,toyohashi技术大学)

AI总结 提出DeepIPCv2端到端自动驾驶框架,通过融合LiDAR点云分割与多视图投影构建鲁棒场景表示,结合门控循环单元、命令特定多层感知器和PID控制器实现路径点与导航控制命令的联合估计,在光照变化下取得最低总指标误差和最少驾驶干预。

Comments This work has been accepted for publication in IEEE Access. https://ieeexplore.ieee.org/document/11313052

详情
AI中文摘要

我们提出DeepIPCv2,一个端到端的自动驾驶框架,它集成了基于LiDAR的环境感知与命令特定的控制学习。与先前依赖摄像头的模型不同,DeepIPCv2采用点云分割和多视图投影来构建鲁棒的场景表示。这些特征通过门控循环单元、命令特定的多层感知器和PID控制器的组合进行融合和解码,以估计路径点和导航控制命令。这种设计增强了机动性并解决了驾驶数据集中的动作不平衡问题。为了验证模型,我们构建了一个覆盖不同光照条件的数据集,并进行了消融研究和与包括TransFuser在内的最新方法的对比测试。结果表明,DeepIPCv2实现了最低的总指标误差和最少的驾驶干预,突显了其对光照变化的鲁棒性和改进的控制精度。通过稍后在https://github.com/oskarnatan/DeepIPCv2发布代码,我们旨在支持端到端自动驾驶研究的可重复性和未来进展。

英文摘要

We propose DeepIPCv2, an end-to-end autonomous driving framework that integrates LiDAR-based environmental perception with command-specific control learning. Unlike prior camera-reliant models, DeepIPCv2 employs point cloud segmentation and multi-view projection to construct robust scene representations. These features are fused and decoded through a combination of gated recurrent units, command-specific multi-layer perceptrons, and PID controllers to estimate both waypoints and navigational control commands. This design enhances maneuverability and addresses action imbalance in driving datasets. To validate the model, we constructed a dataset covering diverse illumination conditions and conducted ablation studies and comparative tests against recent methods, including TransFuser. Results demonstrate that DeepIPCv2 achieves the lowest total metric error and the fewest driving interventions, highlighting both its robustness to illumination changes and its improved control accuracy. By releasing the codes at https://github.com/oskarnatan/DeepIPCv2 later, we aim to support reproducibility and future advancements in end-to-end autonomous driving research.