arXivDaily arXiv每日学术速递 周一至周五更新
2605.18729 2026-05-19 cs.RO cs.CV 版本更新

Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction

Robo-Cortex: 通过双粒认知记忆和自主知识诱导实现自我进化具身智能体

Nga Teng Chan, Yi Zhang, Yechi Liu, Renwen Cui, Fanhu Zeng, Zeyuan Ding, Xiancong Ren, Zhang Zhang, Qifeng Chen, Jian Liu, Yong Dai, Xiaozhu Ju

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) X-Humanoid Institute of Automation, CAS(中国科学院自动化研究所) Beijing University of Aeronautics and Astronautics(北京航空航天大学)

AI总结 本文提出Robo-Cortex框架,通过双粒认知记忆和自主知识诱导机制,使机器人能够自主诱导导航启发式方法并优化认知策略,从而在复杂环境中实现自主导航和探索。

详情
AI中文摘要

导航和与复杂环境交互的能力是真实世界具身智能体的核心,但导航未知环境仍然具有挑战性,因为“经验性失忆”导致现有基于轨迹的或反应性策略无法从过去交互中合成可推广的策略。我们提出了Robo-Cortex,一个自我进化的框架,使机器人能够通过持续的反思-适应循环自主诱导导航启发式方法并优化认知策略。通过将成功模式和失败陷阱抽象为自然语言启发式方法,Robo-Cortex实现了从被动执行到主动策略进化的转变。我们的核心创新是一个自主知识诱导(AKI)机制,将多模态轨迹转化为结构化的导航启发式库以实现知识泛化。该架构进一步集成了双粒认知记忆系统,包括用于实时局部进展分析的短时反思记忆(SRM)和将过去轨迹抽象为可重用指导和警示原则的长时原则记忆(LPM)。为确保稳健决策,我们引入了多模态的想象-然后验证循环,其中世界模型模拟潜在结果,基于视觉语言模型(VLM)的评估器验证行动计划。在IGNav、AR和AEQA上的广泛评估显示,Robo-Cortex在任务成功率和探索效率方面均优于强大的基线方法,其在最强前方法上的SPL提升高达+4.16%,在启发式转移至未知环境下的SPL提升高达+15.30%。初步的现实世界机器人实验进一步支持了Robo-Cortex在物理环境中的有效性。

英文摘要

The ability to navigate and interact with complex environments is central to real-world embodied agents, yet navigation in unseen environments remains challenging due to "experiential amnesia," where existing trajectory-driven or reactive policies fail to synthesize generalizable strategies from past interactions. We propose Robo-Cortex, a self-evolving framework that enables robots to autonomously induce navigation heuristics and refine cognitive strategies through a continuous reflection-adaptation loop. By abstracting success patterns and failure pitfalls into natural-language heuristics, Robo-Cortex enables a transition from passive execution to active strategy evolution. Our core innovation is an Autonomous Knowledge Induction (AKI) mechanism that distills multimodal trajectories into a structured Navigation Heuristic Library for knowledge generalization. The architecture further incorporates a Dual-Grain Cognitive Memory system, comprising a Short-term Reflective Memory (SRM) for real-time local progress analysis, and a Long-term Principle Memory (LPM) that abstracts past trajectories into reusable guiding and cautionary principles. To ensure robust decision-making, we introduce a multimodal Imagine-then-Verify loop, where a world model simulates potential outcomes and a VLM-based evaluator validates action plans. Extensive evaluations on IGNav, AR, and AEQA show that Robo-Cortex consistently outperforms strong baselines in both task success and exploration efficiency, with gains of up to +4.16% SPL over the strongest prior method and up to +15.30% SPL under heuristic transfer to unseen environments. Preliminary real-world robotic experiments further support the effectiveness of Robo-Cortex in physical settings.

2605.18727 2026-05-19 cs.RO cs.AI 版本更新

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

DexHoldem: 使用 Dexterous Embodied 系统进行德州扑克游戏

Feng Chen, Tianzhe Chu, Li Sun, Pei Zhou, Zhuxiu Xu, Shenghua Gao, Yuexiang Zhai, Yanchao Yang, Yi Ma

发表机构 * ShadowHand

AI总结 本文提出DexHoldem,一个基于ShadowHand的现实世界系统级基准,用于评估德克萨斯扑克的灵巧操作。研究通过14个德克萨斯扑克操作原始技能的1470个远程操作示例,测试了代理在感知、执行和决策路由中的能力。

Comments 30 Pages

详情
AI中文摘要

评估基于真实灵巧硬件的具身系统需要超越孤立的原始技能:一个代理必须感知一个变化的桌面场景,选择合适的上下文动作,用灵巧的手执行该动作,并确保场景在后续决策中仍然可用。我们介绍了DexHoldem,一个围绕使用ShadowHand进行德克萨斯扑克灵巧操作构建的现实世界系统级基准。DexHoldem提供了1470个远程操作示例,涵盖14个德克萨斯扑克操作原始技能,一个标准化的物理政策基准,以及一个测试代理是否能够恢复所需结构化游戏状态的代理感知基准。在原始执行方面,π_{0.5}获得了最高的任务完成率(61.2%),而π_{0.5}和π_0在保持场景成功率为47.5%时并列。在代理感知方面,Opus 4.7在严格的问题级准确性(34.3%)方面表现最佳,而GPT 5.5在平均领域准确性(66.8%)方面表现最佳,揭示了孤立视觉子能力与完整路由相关状态恢复之间的差距。最后,我们通过三个案例研究实现了完整的具身代理循环,其中等待、恢复调度、人类帮助请求和重复原始执行揭示了在闭环部署过程中感知和策略错误如何累积。DexHoldem因此在共享物理环境中评估了灵巧桌面执行、代理感知和具身决策路由。项目页面:https://dexholdem.github.io/Dexholdem/.

英文摘要

Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real-world system-level benchmark built around Texas Hold'em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making. On primitive execution, $π_{0.5}$ obtains the highest task completion rate ($61.2\%$), while $π_{0.5}$ and $π_0$ tie on scene-preserving success rate ($47.5\%$). On agentic perception, Opus 4.7 obtains the best strict problem-level accuracy ($34.3\%$), while GPT 5.5 obtains the best average field-wise accuracy ($66.8\%$), exposing a gap between isolated visual sub-capabilities and complete routing-relevant state recovery. Finally, we instantiate the full embodied-agent loop in three case studies, where waiting, recovery dispatches, human-help requests, and repeated primitive execution reveal how perception and policy errors accumulate during closed-loop deployment. DexHoldem therefore evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. Project page: https://dexholdem.github.io/Dexholdem/.

2605.18722 2026-05-19 cs.RO 版本更新

Dexora: Open-source VLA for High-DoF Bimanual Dexterity

Dexora: 开源的高自由度双臂灵巧性VLA系统

Zongzheng Zhang, Jingrui Pang, Zhuo Yang, Kun Li, Minwen Liao, Saining Zhang, Guoxuan Chi, Jinbang Guo, Huan-ang Gao, Modi Shi, Dongyun Ge, Yao Mu, Jiayuan Gu, Rui Chen, Hao Dong, Huazhe Xu, Li Yi, Yixin Zhu, Hang Zhao, Pengwei Wang, Shanghang Zhang, Guocai Yao, Jianyu Chen, Hongyang Li, Hao Zhao

发表机构 * Tsinghua University(清华大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院) The University of Hong Kong(香港大学) Shanghai Jiao Tong University(上海交通大学) ShanghaiTech University(上海科技大学) Peking University(北京大学)

AI总结 本文提出Dexora,首个开源的VLA系统,旨在双臂双手高自由度操作,通过混合遥控管道分离粗臂运动和精细手指运动,结合物理平台和数字孪生,构建大规模训练数据集,并提出数据质量感知训练方法,实验证明其在基础和灵巧任务上的优越性能。

Comments Accpeted by ICRA 2026

详情
AI中文摘要

Vision-Language-Action (VLA)模型最近已成为具身AI的核心方向,但当前系统受限于双抓手控制或单臂灵巧手操作。尽管低维抓手控制可以使用更简单的方法处理,高维灵巧手控制从全端到端VLA学习获益匪浅。在本文中,我们介绍了Dexora,首个开源的VLA系统,原生针对双臂双手高自由度操作。我们设计了一个混合遥控管道,将粗臂运动(通过定制外骨骼背包捕捉)与精细手指运动(通过Apple Vision Pro无标记手追踪)分离,并驱动物理双臂双手平台和相同的MuJoCo数字孪生。使用该接口,我们构建了一个大规模训练数据集:一个匹配的合成数据集(100K模拟轨迹,6.5M帧)和一个现实世界的数据集(10K遥控演示,2.92M帧)。为缓解嘈杂的遥控演示,我们提出了一种数据质量感知的训练配方:一个离线判别器为扩散-Transformer策略训练提供片段级权重,降低低质量演示的权重。实证上,Dexora在基础和灵巧基准测试中优于竞争VLA基线(例如,平均灵巧成功率为66.7% vs. 51.7%),在基础任务上达到90%的成功率,并展示了鲁棒的分布外和跨具身泛化能力。消融实验确认了真实数据和判别器对灵巧性的重要性。

英文摘要

Vision-Language-Action (VLA) models have recently become a central direction in embodied AI, but current systems are restricted to either dual-gripper control or single-arm dexterous hand manipulation. While low-dimensional gripper control can often be handled with simpler methods, high-dimensional dexterous hand control benefits greatly from full end-to-end VLA learning. In this work, we introduce Dexora, the first open-source VLA system that natively targets dual-arm, dual-hand high-DoF manipulation. We design a hybrid teleoperation pipeline that decouples gross arm kinematics (captured with a custom exoskeleton backpack) from fine finger motion (markerless hand tracking via Apple Vision Pro), and that drives both a physical dual-arm dual-hand platform and an identical MuJoCo digital twin. Using that interface, we assemble a large training corpus: an embodiment-matched synthetic corpus (100K simulated trajectories, 6.5M frames) and a real-world dataset of 10K teleoperated episodes (2.92M frames). To mitigate noisy teleoperation demonstrations, we propose a data-quality-aware training recipe: an offline discriminator provides clip-level weights for diffusion-transformer policy training, down-weighting low-quality demonstrations. Empirically, Dexora outperforms competitive VLA baselines on both basic and dexterous benchmarks (e.g., average dexterous success 66.7% vs. 51.7%), attains 90% success on basic tasks, and shows robust out-of-distribution and cross-embodiment generalization. Ablations confirm the importance of real data and the discriminator for dexterity.

2605.18720 2026-05-19 cs.RO 版本更新

Data-Driven Dynamic Modeling of a Tendon-Actuated Continuum Robot

基于数据的 tendon-驱动连续机器人动态建模

Harald Minde Hansen, Bjørn Kåre Sæbø, Kristin Y. Pettersen, Jan Tommy Gravdahl, Mario Di Castro

发表机构 * Department of Engineering Cybernetics, Norwegian University of Science and Technology, NTNU(工程 cybernetics 部,挪威科学与技术大学,NTNU)

AI总结 本文研究了基于数据的系统辨识方法,用于建模具有滚动关节的tendon-驱动连续机器人,发现仅需两个自由度的动力学模型即可准确捕捉系统动力学,展示了其在实时控制中的可行性。

详情
AI中文摘要

开发tendon驱动连续机器人的动态模型具有挑战性,因为其非线性、高维和摩擦主导的动力学特性。本文对数据驱动的系统辨识方法进行了比较研究,包括N4SID、ARX和SINDYc,用于建模在CERN开发的具有滚动关节的tendon-驱动连续机器人。尽管机器人具有大量的关节,实验分析表明,一个两自由度的动力学模型可以准确捕捉系统动力学,这归因于关节之间的强运动学依赖性。模型经过实验数据验证,并用于设计模型预测控制器,展示了其在实时控制中的可行性。

英文摘要

Developing dynamic models for tendon-driven continuum robots is challenging due to their nonlinear, high-dimensional, and friction-dominated dynamics. This paper presents a comparative study of data-driven system identification methods, including N4SID, ARX, and SINDYc, for modeling a tendon-actuated continuum robot with rolling joints developed at CERN. Despite the high number of joints of the robot, experimental analysis reveals that a two-degree-of-freedom dynamic model can accurately capture the system dynamics, owing to strong kinematic dependencies between the joints. The models are validated against experimental data, and used in the design of a model predictive controller, demonstrating their feasibility for real-time control.

2605.18617 2026-05-19 cs.RO cs.AI cs.CV 版本更新

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

ManiSoft: 向视觉-语言操控的柔软连续机器人迈进

Ziyu Wei, Luting Wang, Chen Gao, Li Wen, Si Liu

发表机构 * Beihang University(北京航空航天大学) National University of Singapore(新加坡国立大学) Hangzhou Innovation Institute, Beihang University(北京航空航天大学杭州创新研究院)

AI总结 本文提出ManiSoft基准,用于研究柔软连续机器人的视觉-语言操控,通过定制模拟器结合真实柔软体动力学和丰富的接触交互,定义了四个任务以展示变形控制的不同方面,并通过自动化流程生成6300个多样场景和专家轨迹,评估了三种代表性策略模型的性能。

Comments Accepted in ICML 2026

详情
AI中文摘要

大多数现有的视觉-语言操控研究针对刚性机械臂,其固定形态限制了在杂乱或狭窄空间中的适应性。柔软机械臂由于其可变形性提供了一个有吸引力的替代方案,但面临不可靠的本体感觉和分布式的低层驱动挑战。为了研究这些挑战,我们介绍了ManiSoft,一个用于柔软机械臂的视觉-语言操控基准。ManiSoft特征一个定制的模拟器,通过弹性力约束将真实柔软体动力学与丰富的接触交互相结合。在此基础上,ManiSoft定义了四个任务,每个任务突出显示变形控制的不同方面,从基本末端执行器协调到障碍物回避。为了支持策略训练和评估,ManiSoft包括一个自动化流程,生成6,300个多样场景及其对应的专家轨迹。为了大规模生成高质量轨迹,我们首先使用高层规划器将每个任务分解为一系列路径点,然后使用低层强化学习策略生成扭矩命令以跟踪路径点。基准测试三种代表性策略模型显示在清洁场景中相对有希望的结果,但在随机化情况下性能显著下降。可视化分析表明,失败主要源于本体感觉状态的视觉估计不准确和变形性在适应性障碍回避中的利用有限。我们预计ManiSoft将作为有价值的测试平台,在视觉-语言操控的背景下弥合刚性和柔软机械臂之间的差距。代码和数据集已发布在https://buaa-colalab.github.io/ManiSoft。

英文摘要

Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce \ManiSoft, a benchmark for vision-language manipulation with soft arms. ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, \ManiSoft{} includes an automated pipeline that generates $6{,}300$ diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation. Out codes and datasets are released at https://buaa-colalab.github.io/ManiSoft.

2605.18611 2026-05-19 cs.RO 版本更新

Unified Walking, Running, and Recovery for Humanoids via State-Dependent Adversarial Motion Priors

通过状态依赖对抗运动先验实现人形机器人的统一行走、跑步和恢复

Yidan Lu, Yichao Zhong, Liu Zhao, Wanyue Li, Peng Lu

发表机构 * The University of Hong Kong(香港大学)

AI总结 本文提出了一种统一的强化学习框架,使单个策略能够在Unitree G1人形机器人上实现行走、跑步和跌倒恢复,且无需在部署时显式切换模式。该框架通过将传统全局参考分布替换为状态依赖的门控机制,将每个训练过渡路由到两个判别器中:一个专用的恢复判别器和一个基于速度的运动判别器,共同覆盖行走和跑步。

详情
AI中文摘要

我们提出了一种统一的强化学习框架,使单个策略能够在Unitree G1人形机器人上实现行走、跑步和跌倒恢复,且在部署时无需任何显式模式切换命令进行验证。该框架通过将传统全局参考分布替换为状态依赖的门控机制,将每个训练过渡路由到两个判别器中:一个专用的恢复判别器和一个基于速度的运动判别器,共同覆盖行走和跑步。门控机制由一个固定的投影重力阈值定义:当身体倾斜超过约37度(|g_z+1|>0.6)时,激活恢复判别器;否则使用运动判别器,以归一化的命令速度作为条件,选择在行走和跑步片段之间的适当参考轨迹。仅需三个LAFAN1参考片段即可正则化完整的行为集。在部署时,一个冻结的ONNX策略以50Hz的速度执行,无需运行时模式逻辑;硬件实验展示了在相同控制器下成功从仰卧和俯卧跌倒恢复以及平滑的行走转跑步过渡。

英文摘要

We propose a unified reinforcement learning framework that enables a single policy to perform walking, running, and fall recovery on the Unitree G1 humanoid robot, validated on physical hardware without any explicit mode-switching command at deployment. The framework extends Adversarial Motion Priors (AMP) by replacing the conventional global reference distribution with a state-dependent gate that routes each training transition to one of two discriminators: a dedicated recovery discriminator and a velocity-conditioned locomotion discriminator that jointly covers walking and running. The gate is defined by a single fixed threshold on projected gravity: the recovery discriminator is activated when body tilt exceeds approximately $37^\circ$ from vertical ($|g_z+1|>0.6$); otherwise the locomotion discriminator is used, with the normalized commanded velocity serving as a condition that selects the appropriate reference trajectory between walk and run clips. Only three LAFAN1 reference clips are required to regularize the complete behavior set. At deployment, a single frozen ONNX policy executes at 50\,Hz with no runtime mode logic; hardware experiments demonstrate successful recovery from both prone and supine falls and smooth walk-to-run transitions under the same controller.

2605.18593 2026-05-19 cs.CR cs.AI cs.RO 版本更新

Not What You Asked For: Typographic Attacks in Household Robot Manipulation

并非你所要求的:家庭机器人操作中的字体攻击

Ali Iranmanesh, Peng Liu

发表机构 * Cyber Security Lab(网络安全实验室) The Pennsylvania State University(宾夕法尼亚州立大学) State College, USA(州立学院,美国)

AI总结 本研究探讨了字体攻击对家庭机器人操作全流程的影响,提出了一种解耦感知架构,并发现感知错误会通过持久的3D语义地图导致物理性故障,揭示了字体误分类对机器人安全性的实际威胁。

Comments 10 pages, 1 figure, IEEE conference format

详情
AI中文摘要

开放词汇的具身AI代理越来越多地依赖如CLIP之类的视觉-语言模型进行物体感知和任务定位。然而,这种共享嵌入空间所带来的结构漏洞使字体攻击成为可能,其中物理场景中的印刷文本会语义上覆盖视觉判断。尽管先前研究在静态2D基准和3D导航任务中量化了这一威胁,但其对家庭机器人操作完整Sense-Plan-Act流程的影响仍未被探索。本文在基于Habitat的模拟中评估了字体攻击,使用HomeRobot基准。我们引入了一种解耦感知架构,使冻结的CLIP编码器暴露于对抗性贴纸,同时通过DEtic保持几何定位。在59个可控评估回合中,攻击的总体攻击成功率(ASR)为67.8%,在完全成功回合中上升至70.0%,在无控制视角和遮挡且无感知优化的情况下。关键发现是,感知错误通过持久的3D语义地图传播,导致动能故障,即由对抗性污染的语义状态驱动的物理性抓取和运输错误物体。在这些情况下,机器人会物理上抓取并传递错误的物体到目标容器。这些结果确立了字体误分类作为对模块化操作流程安全性的实际、可测量且物理上有影响的威胁,而此前的字体攻击研究未对其进行考察。

英文摘要

Open-vocabulary embodied AI agents increasingly rely on vision-language models such as CLIP for object perception and task grounding. However, the shared embedding space that enables this flexibility introduces a structural vulnerability to typographic attacks, where printed text in a physical scene semantically overrides visual judgment. While prior work has quantified this threat in static 2D benchmarks and 3D navigation tasks, its impact on the full Sense-Plan-Act pipeline of household robot manipulation remains unexplored. This work evaluates typographic attacks in a Habitat-based simulation using the HomeRobot benchmark. We introduce a decoupled perception architecture that exposes a frozen CLIP encoder to adversarial stickers while maintaining geometric grounding via DETIC. In a controlled evaluation pool of 59 attributable episodes, the attack achieves an overall Attack Success Rate (ASR) of 67.8%, rising to 70.0% among fully successful episodes, under uncontrolled viewing angles and occlusion with no perceptual optimization. Critically, we find that perceptual errors propagate through the persistent 3D semantic map to produce kinetic failures, defined here as physically executed grasping and transport of the wrong object driven by an adversarially poisoned semantic state. In these cases, the robot physically grasps and delivers the wrong object to a target receptacle. These results establish typographic misclassification as a real, measurable, and physically consequential threat to the safety of modular manipulation pipelines that prior typographic attack research has left unexamined.

2605.18556 2026-05-19 cs.RO cs.AI 版本更新

Key-Gram: Extensible World Knowledge for Embodied Manipulation

Key-Gram: 用于具身操作的可扩展世界知识

Jingjing Fan, Siyuan Li, Botao Ren, Zhidong Deng

发表机构 * Department of Computer Science and Technology(计算机科学与技术系) Department of Automation(自动化系)

AI总结 本文提出Key-Gram框架,通过分离语言知识与视觉状态推理,提升具身控制中对组合语言指令的理解和执行能力,主要贡献是引入可扩展的外部记忆模块以提高迁移和现实世界操作性能。

Comments 16 pages, 5 figures

详情
AI中文摘要

具身控制越来越多地要求模型在动态视觉状态上进行推理的同时遵循组合语言指令。然而,当前的视觉-语言-动作策略和世界-动作模型通常将语言知识与视觉计算结合在共享的骨干或条件路径中,导致模态竞争,并使知识扩展依赖于骨干更新。在本文中,我们引入了Key-Gram,一种条件记忆框架,它将语言衍生的世界知识与视觉状态推理分离用于具身控制。其核心是一个记忆模块,该模块将指令分解为任务特定的关键词组,通过确定性哈希查找检索静态语言先验,并通过上下文感知门控和轻量级卷积融合将检索到的条目注入到选定的隐藏层中。这种设计使骨干能够将其主要能力用于视觉推理和动作推断,同时可重用的指令知识存储在可扩展的外部记忆中。逻辑记忆表可以在训练期间方便地划分,并且由于其O(1)的查找模式,在推理时可以高效地放置在主机内存中。在RoboTwin2.0、LIBERO/LIBERO-Plus和现实世界双臂操作中,Key-Gram一致地提高了π₀和π₀.₅骨干,平均相对增益为RoboTwin2.0上的29.5%/9.9%、LIBERO-Plus转移无目标领域微调时的35.8%/4.5%以及现实世界长周期任务上的15.4%/8.1%。这些结果表明,外部化的语言记忆提供了一种有效的、可扩展的机制,以提高组合基础、迁移和现实世界操作性能。

英文摘要

Embodied control increasingly requires models to follow compositional language instructions while reasoning over dynamic visual states. However, current vision-language-action policies and world-action models often couple linguistic knowledge with visual computation in a shared backbone or conditioning pathway, leading to modality competition and making knowledge extension dependent on backbone updates. In this paper, we introduce Key-Gram, a conditional-memory framework that separates language-derived world knowledge from visual-state reasoning for embodied control. At its core is a memory module that decomposes an instruction into task-specific key-grams, retrieves static linguistic priors through deterministic hashed lookup, and injects the retrieved entries into selected hidden layers through context-aware gating and lightweight convolutional fusion. This design allows the backbone to devote its main capacity to visual reasoning and action inference, while reusable instruction knowledge is stored in an extensible external memory. The logical memory table can be conveniently partitioned during training and, due to its $O(1)$ lookup pattern, efficiently placed on host memory during inference. Across RoboTwin2.0, LIBERO/LIBERO-Plus, and real-world dual-arm manipulation, Key-Gram consistently improves both $π_{0}$ and $π_{0.5}$ backbones, with average relative gains of $29.5\%/9.9\%$ on RoboTwin2.0, $35.8\%/4.5\%$ on LIBERO-Plus transfer without target-domain fine-tuning, and $15.4\%/8.1\%$ on real-world long-horizon tasks. These results demonstrate that externalized linguistic memory provides an effective and extensible mechanism for improving compositional grounding, transfer, and real-world manipulation.

2605.18543 2026-05-19 cs.RO 版本更新

Geometry-Aware Surrogate for Real-Time Hydrodynamics Estimation of Autonomous Ground Vehicles in Amphibious Environments

面向水下环境的自主地面车辆实时流体动力学估计几何感知代理

Ammar Waheed, Luke Gallantree, Zohaib Hasnain

发表机构 * J. Mike Walker ’66 Department of Mechanical Engineering, Texas A&M University(J. Mike Walker ’66 机械工程系,德克萨斯A&M大学) Defence Science and Technology Laboratory(国防科学与技术实验室)

AI总结 本文提出了一种基于神经网络的几何感知代理,用于在水下环境中实时估计自主地面车辆的流体动力学,通过高保真CFD数据训练,实现了对车辆几何、深度和水流方向的准确预测,展示了在真实环境中的应用效果。

详情
AI中文摘要

在浅水或易发洪水的地形中运行的自主地面车辆需要能够考虑流体动力学力的动态模型。然而,目前可用的仿真和规划工具要么缺乏物理真实性,要么计算成本过高,无法实时运行。本文提出了一种针对不同表面的神经网络代理,通过在高保真CFD数据上训练,预测实时速率下的几何解析流体动力学力。车辆特定的符号距离场(SDF)提供每表面的淹没输入,使模型能够解析负载如何随车辆几何、深度和水流方向变化。在留出的CFD数据上,代理实现了纵向力对称MAPE(sMAPE)为13%,垂直力sMAPE为3-12%,推理时间每样本小于0.9毫秒。为了在真实世界条件下评估模型,使用全尺寸车辆在不同淹没深度下的涉水试验。运动捕捉推导的运动学作为代理输入,所得预测用于重现已知的力、速度和深度之间的物理关系。预测的阻力遵循二次速度缩放(R²≥0.97),浮力截距与深度线性相关(R²=0.973)。这两种关系未在模型训练损失中编码,但源自每表面架构中单独预测的表面力总和。所得到的框架为将物理基础的流体动力学嵌入自主地面车辆依赖的 amphibious 环境仿真和规划循环提供了路径。

英文摘要

Autonomous ground vehicles operating in shallow water or flood-prone terrains require dynamic models that account for hydrodynamic forces. However, the simulation and planning tools currently available either lack the physical fidelity or are too computationally expensive to run in real time. This work presents a per-surface neural network surrogate that bridges this gap by predicting geometry-resolved hydrodynamic forces at real-time rates, trained entirely on high-fidelity CFD data from two geometrically distinct vehicles. A vehicle specific Signed Distance Field (SDF) provides per-surface submergence inputs, allowing the model to resolve how loading varies with vehicle geometry, depth, and flow direction. On held-out CFD data, the surrogate achieves a longitudinal-force symmetric MAPE (sMAPE) of 13\% and a vertical-force sMAPE of 3-12\%, with inference running under 0.9\,ms per sample. To evaluate the model under real-world conditions, water wading trials of a full-scale vehicle at different submersion depths are used. Motion capture derived kinematics serve as the surrogate inputs, and the resulting predictions are tested to reproduce known physical relationships between force, speed, and depth. The predicted drag follows quadratic speed scaling ($R^2 \geq 0.97$) and the buoyancy intercepts scale linearly with depth ($R^2 = 0.973$). Neither relationship is encoded in the model training loss, both emerge from the per-surface architecture summing individually predicted surface forces. The resulting framework provides a pathway for embedding physically grounded hydrodynamics into the simulation and planning loops that autonomous ground vehicles depend on in amphibious environments.

2605.18482 2026-05-19 cs.RO 版本更新

Bidirectional Optical sensors for Actuation Tracking (BOAT) in soft lattice systems

用于软格栅系统的双向光学传感器(BOAT)用于驱动跟踪

Petr Trunin, Carolina Gay, Anderson Brazil Nardin, Trevor Exley, Diana Cafiso, Lucia Beccai

发表机构 * Soft BioRobotics and Perception Lab(软生物机器人与感知实验室) Istituto Italiano di Tecnologia (IIT)(意大利技术研究院) Genoa, Italy(意大利热那亚)

AI总结 本文提出了一种基于椭球几何排列的双波导光学传感器(BOAT),用于监测软格栅结构的全局变形,特别是压缩和伸展,并通过实验验证了其在压力循环中的高重复性和可靠性。

详情
AI中文摘要

随着格栅结构在软机器人中的广泛应用,需要更先进的传感解决方案来监测其整体变形,特别是压缩和伸展。本文通过引入基于两个图案化波导的新型光学传感器来解决这一挑战。该双向光学传感器用于驱动跟踪(BOAT)与一个由嵌入式气动人工肌肉(PAM)驱动的格栅结构无缝共印制,并对其性能进行了评估。在PAM伸长或收缩时,嵌入的BOAT波导的弯曲会引起输出信号的变化,从而能够清楚地区分压缩和伸展状态。两种波导结构(通过表面图案化)和传感器化的格栅单元嵌入两个BOAT的设计均通过数值模拟得到支持。经过100次连续的压力循环(从+50 kPa到-40 kPa)的实验校准,显示出高度可重复的响应,使得能够可靠地区分伸展和压缩。最后,利用传感器反馈实现数字影子,使整个传感器化单元与其虚拟对应物持续同步。这些结果证明了BOAT在软格栅机器人系统变形监测中的强大和可靠作用。

英文摘要

The growing adoption of lattice-based structures in soft robotics creates a need for advanced sensing solutions capable of monitoring their global deformation, particularly compression and extension. In this work, we address this challenge by introducing a novel optical sensor based on two patterned waveguides arranged in an ellipsoidal geometry. This Bidirectional Optical sensor for Actuation Tracking (BOAT) is seamlessly co-printed with a lattice structure actuated by an embedded pneumatic artificial muscle (PAM), and its performance is assessed. During PAM elongation or contraction, the bending of the embedded BOAT waveguides induces output signal variations that enable a clear discrimination between compression and extension states. The designs of both each specific waveguide structure (by surface patterning) and of the sensorized lattice-based unit embedding two BOATs are supported by numerical simulations. Experimental calibration over 100 consecutive pressure cycles ranging from +50 kPa to $-$40 kPa demonstrates a highly repeatable response, allowing a reliable distinction between extension and compression. Finally, sensor feedback is used to implement a digital shadow, enabling continuous synchronization between the whole sensorized unit and its virtual counterpart. These results establish BOAT as a powerful and reliable approach for deformation monitoring in soft lattice-based robotic systems.

2605.18441 2026-05-19 cs.RO cs.SY eess.SY 版本更新

REACT: Environment-Adaptive Architecture for Continuous Formation Navigation of Wheeled Mobile Robots

REACT:面向轮式移动机器人连续编队导航的环境自适应架构

Jianghong Dong, Yifeng Zhang, Jiawei Wang, Mengchi Cai, Keqiang Li, Guillaume Sartoretti

发表机构 * School of Vehicle and Mobility, Tsinghua University(清华大学车辆与移动性学院) Department of Mechanical Engineering, National University of Singapore(新加坡国立大学机械工程系) Department of Civil and Environmental Engineering, University of Michigan(密歇根大学土木与环境工程系)

AI总结 本文提出REACT架构,通过集中式编队生成和分布式编队维护相结合的方法,解决轮式移动机器人在复杂环境中编队导航的适应性问题,实现了无轨迹冲突的连续编队导航。

详情
AI中文摘要

轮式移动机器人(WMRs)的编队控制已广泛应用于物流运输、环境监测和搜索救援等领域。然而,大多数现有研究主要关注跟踪预定义编队,限制了其在复杂现实环境中的适应性。为此,我们提出了REACT(实时环境自适应架构用于连续编队导航),一种集成了集中式编队生成和分布式编队维护的分层架构。具体而言,上层在必要时生成新的环境自适应编队,并使用我们提出的TCF-R2T(轨迹冲突自由机器人到目标分配)算法,在多项式时间内计算无冲突的WMR到目标分配,实现及时的编队转换而无轨迹冲突。下层中,每个WMR执行我们开发的JSTP(联合时空轨迹规划)方法,通过同时优化空间位置和时间持续时间来维护生成的编队,从而增强机器人之间的协调性,并在障碍物丰富的环境和动态障碍场景中实现连续导航。仿真和实际实验验证了REACT的有效性和实用性。实验视频可在我们的项目网站上获取:https://dongjh20.github.io/REACT-website。

英文摘要

Formation control of wheeled mobile robots (WMRs) has been extensively studied due to its broad applications in fields such as logistics transportation, environmental monitoring, and search and rescue. However, most existing works mainly focus on tracking predefined formations, which limits their adaptability to complex real-world environments. To address this, we propose REACT (Real-time Environment-Adaptive architecture for Continuous formation navigaTion), a hierarchical architecture integrating centralized formation generation and distributed formation maintenance. Specifically, our upper layer generates new environment-adaptive formations when necessary and uses our proposed TCF-R2T (Trajectory-Conflict-Free Robot-to-Target assignment) algorithm to compute conflict-free WMR-to-target assignments in polynomial time, enabling timely formation transitions without trajectory conflicts. At the lower layer, each WMR executes our developed JSTP (Joint Spatio-Temporal trajectory Planning) method to maintain the generated formation by simultaneously optimizing spatial positions and temporal durations, thereby enhancing coordination among WMRs and enabling continuous navigation in obstacle-rich environments and dynamic-obstacle scenarios. Both simulation and real-world experiments validate the effectiveness and practical applicability of REACT. Experimental videos are available on our project website: https://dongjh20.github.io/REACT-website.

2605.18423 2026-05-19 cs.RO cs.CY 版本更新

REBAR: Reference Ethical Benchmark for Autonomy Readiness

REBAR:自主性准备的参考伦理基准

Jonathan Diller, David Barnes, Rebekah Bogdanoff, Rhett Collier, Roddy Collins, Keith Fieldhouse, Yonatan Gefen, Cameron Johnson, Anuriha Kodali, Brad Kriel, Varun Murali, James Niehaus, Mish Sukharev, Joseph VanPelt, Anthony Hoogs, Vijay Kumar, Arslan Basharat

发表机构 * University of Pennsylvania(宾夕法尼亚大学) David Barnes, LLC(大卫·巴恩斯公司) Kitware, Inc.(Kitware公司) Duality Robotics, Inc.(Duality机器人公司) Texas A&M University(德克萨斯大学) Charles River Analytics(查尔斯河分析公司)

AI总结 本文提出REBAR框架,通过严谨测试提供可计算的自主性准备等级,以量化伦理性能并解决现有伦理AI框架的不足。

Comments To be presented at the 2026 Workshop on Robot Ethics - Ethical, Legal and User Perspectives in Robotics and Automation (WOROBET)

详情
AI中文摘要

随着自主系统日益先进,客观评估其伦理和法律合规性的指标对于告知终端用户其局限性并确保滥用者的责任至关重要。当前的伦理具身AI框架大多定性,侧重于系统设计(通过安全护栏或定向红队测试),而实现的护栏往往直接禁止不安全行为,而没有为用户提供重置或可解释的原因。相反,需要通过严格测试获得可计算的指标,使用户能够确定系统适用于任务。为解决这一差距,我们引入了自主性准备的参考伦理基准(REBAR),一个用于自主系统的定量测试和评估框架。REBAR将运行指标映射到可计算的自主性准备等级(ARL)标准,以量化伦理表现。该框架的关键创新包括一种神经符号大型语言模型(LLM)方法来计算和解释场景的伦理难度,LLM驱动的大规模测试实例生成,以及一个多功能、逼真模拟环境。通过通过此严格测试流程评估白盒自主性解决方案,REBAR提供了一个客观且可重复的基准分数,弥合了抽象原则与可验证、可问责的自主性之间的差距。

英文摘要

As autonomous systems grow more advanced, objective metrics to evaluate their ethical and legal compliance are critical for informing end users of their limitations and ensuring accountability of those who misuse them. Current ethical embodied AI frameworks remain mostly qualitative, focusing on system design (through safety guardrails or targeted red teaming), and the realized guardrails often directly disallow unsafe behavior without providing the user with an override or interpretable reason. Instead, there is a need for computable metrics through rigorous testing that allow a user to determine the applicability of the system to the task. To address this gap, we introduce the Reference Ethical Benchmark for Autonomy Readiness (REBAR), a quantitative test and evaluation framework for autonomous systems. REBAR maps operating metrics into a computable Autonomy Readiness Level (ARL) rubric that can quantify ethical performance. Key innovations of the framework include a neuro-symbolic Large Language Model (LLM) approach to calculate and explain the ethical difficulty of scenarios, LLM-driven at-scale generation of test instances, and a versatile, photorealistic simulation environment. By evaluating white-box autonomy solutions through this rigorous testing pipeline, REBAR delivers an objective and repeatable benchmark score, bridging the gap between abstract principles and verifiable, accountable autonomy.

2605.18407 2026-05-19 cond-mat.mes-hall cond-mat.mtrl-sci cs.AI cs.RO 版本更新

Qumus: Realization of An Embodied AI Quantum Material Experimentalist

Qumus: 一种具身人工智能量子材料实验家的实现

Lihan Shi, Zhaoyi Joy Zheng, Xinzhe Juan, Yimin Wang, Ming Yin, Mayank Sengupta, Kristina Wolinski, Yanyu Jia, Jingzhi Shi, Derek Saucedo, Neill Saggi, Haosen Guan, Kenji Watanabe, Takashi Taniguchi, Ali Yazdani, Mengdi Wang, Sanfeng Wu

AI总结 本文提出Qumus,首个能够进行真实世界科学发现的具身人工智能量子材料实验家,通过机器人微型实验室实现了原子薄二维材料和范德瓦耳斯结构的制备与纳米加工,首次实现了AI生成石墨烯和原子薄场效应晶体管的AI制造。

Comments 29 Pages in total. Supplementary Demo Videos are available at https://qumus.ai

详情
AI中文摘要

尽管现代大语言模型(LLMs)和代理型人工智能(AI)在数字领域展现出了变革性能力,但实现能够进行真实世界科学发现的具身人工智能仍是一个具有挑战性的前沿。这些进展受到将高级推理、多模态信息处理和实时物理执行整合在一起的固有复杂性所阻碍。在这里,我们介绍了Qumus,首个AI量子材料实验家。Qumus物理上体现在一个机器人微型实验室中,是一个智能、多模态和多代理系统,旨在创建和纳米加工原子薄二维(2D)材料和堆叠范德瓦耳斯(vdW)结构。Qumus能够自主导航完整的科学循环,从假设生成和协议规划到多步骤实验执行、结果分析和报告,充当实验家的角色。值得注意的是,该系统首次实现了AI生成石墨烯,以及首次实现了复杂纳米设备(包括原子薄场效应晶体管)的AI制造,通过范德瓦耳斯堆叠。Qumus在这些任务中表现出色,通过展示自主纠错和闭环实验。我们的结果建立了一个可推广的框架,用于学习直接来自量子世界的自我改进具身人工智能系统,为量子材料、电子学等领域加速发现开辟了新路径。

英文摘要

While modern Large Language Models (LLMs) and agentic artificial intelligence (AI) have demonstrated transformative capabilities in digital domains, the realization of embodied AI capable of real-world scientific discovery remains a difficult frontier. The advancements are hindered by the inherent complexity of integrating high-level reasoning, multimodal information processing and real-time physical execution. Here we introduce Qumus, the first AI quantum materials experimentalist. Physically embodied within a robotic mini-laboratory, Qumus is an intelligent, multimodal, and multi-agent system designed for the creation and nano-processing of atomically thin two-dimensional (2D) materials and stacked van der Waals (vdW) structures. Qumus autonomously navigates the full scientific cycle, from hypothesis generation and protocol planning to multi-step experimental execution, result analysis and reporting, acting as an experimentalist. Markedly, the system has achieved, for the first time, the AI-creation of graphene, as well as the first AI-fabrication of complex nanodevices including atomically thin field-effect transistors via vdW stacking. Qumus excels at these tasks by demonstrating autonomous error correction and closed-loop experimentation. Our results establish a generalizable framework for self-improving embodied AI systems that learn directly from the quantum world, opening a pathway toward accelerated discovery in quantum materials, electronics and beyond.

2605.18385 2026-05-19 cs.RO cs.AI 版本更新

Towards Ubiquitous Mapping and Localization for Dynamic Indoor Environments

面向动态室内环境的无处不在的映射与定位

Halim Djerroud, Nico Steyn, Olivier Rabreau, Patrick Bonnin, Abderraouf Benali

发表机构 * Tshwane University of Technology(茨瓦内理工大学)

AI总结 本文提出UbiSLAM,一种用于动态室内环境实时映射和定位的创新解决方案,通过部署固定RGB-D相机网络解决传统SLAM系统在环境变化敏感性和依赖移动单元传感器的问题,提升机器人在环境中的定位精度和响应性。

详情
Journal ref
Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025), Volume 1, pages 537-548, SciTePress, 2025. ISBN: 978-989-758-737-5, ISSN: 2184-433X
AI中文摘要

我们提出了UbiSLAM,一种用于动态室内环境实时映射和定位的创新解决方案。通过在工作空间内战略性地部署固定RGB-D相机网络,UbiSLAM解决了传统SLAM系统常见的局限性,如对环境变化的敏感性和对移动单元传感器的依赖。这种固定传感器方法实现了实时、全面的映射,提高了机器人在环境中的定位精度和响应性。由UbiSLAM生成的集中化地图持续更新,为机器人提供准确的全局视图,从而提高导航、减少碰撞并促进共享空间中更流畅的人机交互。除了其优势外,UbiSLAM还面临挑战,特别是在确保完整空间覆盖和管理盲区方面,这需要从机器人本身集成数据。在本文中,我们讨论了潜在的解决方案,如自动校准以获得最佳的相机位置和方向,以及增强的通信协议以实现实时数据共享。所提出的模型减少了对单个机器人单元的计算负载,使更复杂的机器人平台能够有效运行,同时增强了整个系统的鲁棒性。

英文摘要

We present UbiSLAM, an innovative solution for real-time mapping and localization in dynamic indoor environments. By deploying a network of fixed RGB-D cameras strategically throughout the workspace, UbiSLAM addresses limitations commonly encountered in traditional SLAM systems, such as sensitivity to environmental changes and reliance on mobile unit sensors. This fixed-sensor approach enables real-time, comprehensive mapping, enhancing the localization accuracy and responsiveness of robots operating within the environment. The centralized map generated by UbiSLAM is continuously updated, providing robots with an accurate global view, which improves navigation, minimizes collisions, and facilitates smoother human-robot interactions in shared spaces. Beyond its advantages, UbiSLAM faces challenges, particularly in ensuring complete spatial coverage and managing blind spots, which necessitate data integration from the robots themselves. In this paper we discuss potential solutions, such as automatic calibration for optimal camera placement and orientation, along with enhanced communication protocols for real-time data sharing. The proposed model reduces the computational load on individual robotic units, allowing less complex robotic platforms to operate effectively while enhancing the robustness of the overall system.

2605.18373 2026-05-19 cs.RO cs.LG math.DS math.OC 版本更新

Dynamic robotic cloth folding with efficient Koopman operator-based model predictive control

动态机器人布料折叠与高效的Koopman算子基于模型预测控制

Edoardo Caldarelli, Franco Coltraro, Adrià Colomé, Lorenzo Rosasco, Carme Torras

发表机构 * Istituto Italiano di Tecnologia(意大利技术研究院) Institut de Robòtica i Informàtica Industrial(机器人与信息技术研究所) MaLGa Center, DIBRIS, Università degli Studi di Genova(MaLGa中心,DIBRIS,热那亚大学)

AI总结 本文提出了一种基于Koopman算子的模型预测控制方法,用于快速生成布料折叠轨迹,结合物理仿真和高效的核基Koopman算子回归,以提高折叠任务的效率和精度。

Comments Accepted for presentation at the 2026 IEEE International Conference on Robotics and Automation (ICRA)

详情
AI中文摘要

机器人布料折叠是一项具有挑战性的任务,尤其是在动态折叠任务中,需要通过快速运动利用布料的动力学特性进行折叠。当受到这种快速运动的影响时,布料动力学的复杂性会阻碍系统识别和折叠轨迹的规划,导致在使用物理布料模型时仿真到现实的转移困难。与人类在折叠任务中表现出的灵活性相比,机器人通常使用小而刚性的衣物,要么太慢,要么太快但不精确,需要多次尝试才能获得相对良好的折叠效果。在本文中,我们通过生成快速折叠轨迹来解决这些问题,采用了一种新的模型预测控制器,结合基于物理的布料动力学仿真和高效的核基Koopman算子回归。Koopman算子回归是一种日益流行的机器学习技术,用于非线性系统识别,用于获得被折叠布料的线性模型。此类代理模型,通过高保真的物理布料仿真器的数据进行训练,可以用于合适的模型预测控制算法中,替代昂贵的非线性模型,以高效地生成由机器人执行的折叠轨迹。在模拟和真实机器人实验中,我们展示了Koopman算子基于模型提供的线性化如何能够有效地生成未见过的姿势的快速折叠轨迹,而不牺牲折叠的准确性。

英文摘要

Robotic cloth folding is a challenging task, particularly when considering dynamic folding tasks, which aim at folding cloth by fast motions that leverage its dynamics. When subject to such fast motions, the complexity of cloth dynamics hinders both system identification and planning of folding trajectories, resulting in a difficult simulation-to-reality transfer when using physical models of cloth. Compared to the dexterity that humans exhibit when performing folding tasks, robotic approaches usually employ small garments with quite rigid dynamics, and are either too slow, or fast but imprecise, requiring several attempts to achieve a reasonably good fold. In this paper, we tackle these challenges by generating fast folding trajectories with a novel model predictive controller, integrating physics-based simulation of cloth dynamics and efficient, kernel-based Koopman operator regression. Koopman operator regression, an increasingly popular machine learning technique for nonlinear system identification, is used to obtain a linear model for the cloth being folded. Such a surrogate model, trained with data from a high-fidelity, physics-based cloth simulator, can then be employed within a suitable model predictive control algorithm, in place of the costly, nonlinear one, to efficiently generate folding trajectories to be executed by a robotic manipulator. Both in simulated and real-robot experiments, we show how the linearization supplied by the Koopman operator-based model can be employed to efficiently generate fast folding trajectories to unseen poses, without sacrificing folding accuracy.

2605.18303 2026-05-19 cs.LG cs.AI cs.CV cs.RO 版本更新

PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics

PH-Dreamer: 通过端口-哈密顿生成动力学构建一个物理驱动的世界模型

Xueyu Luan, Chenwei Shi

AI总结 本文提出了一种基于端口-哈密顿框架的物理驱动世界模型PH-Dreamer,通过三个协同机制改进了基于递归状态空间架构的世界模型,实现了更紧凑且物理结构化的表示,同时提高了内部模拟器的保真度,并减少了潜在相空间体积、能量消耗和平均加速度平方。

Comments 12 pages, 3 figures

详情
AI中文摘要

基于递归状态空间架构构建的世界模型能够实现高效的潜在想象,但仍然缺乏物理结构,导致动力学违反守恒和耗散原理。我们引入了一个统一的端口-哈密顿框架,通过三种协同机制来解决这一问题。首先,我们将隐含的物理先验嵌入到递归转换中,通过将投影的潜在演变建模为受流动和耗散控制的能量路由,使投影的PH相空间偏向于更紧凑且物理结构化的表示。其次,我们开发了一个具有运动学意识的能量世界模型,该模型从本体感觉观察估计哈密顿量和功率平衡,提供了一个明确的物理信号用于热力学推理。第三,利用这些能量梯度,我们建立了基于能量的Actor-Critic,利用拉格朗日乘数来正则化策略优化,使其朝着更低的能量和更平滑的控制方向发展。在视觉控制基准测试中,该范式不仅实现了更优的渐近回报,还通过在想象奖励和真实奖励之间建立更紧密且方差更低的对齐关系,提高了内部模拟器的保真度,同时将潜在相空间体积减少了4.18-8.41%,能量消耗降低了高达7.80%,平均加速度平方降低了高达9.38%。

英文摘要

World models built on recurrent state space architectures enable efficient latent imagination, yet remain physically unstructured, producing dynamics that violate conservation and dissipative principles. We introduce a unified Port-Hamiltonian framework that remedies this through three synergistic mechanisms. First, we embed implicit physical priors into recurrent transitions by modeling projected latent evolution as action controlled energy routing governed by flow and dissipation, biasing the projected PH phase space toward a more compact and physically structured representation. Second, we develop a kinematics aware energy world model that estimates the Hamiltonian and power balance from proprioceptive observations, providing an explicit physical signal for thermodynamic reasoning. Third, leveraging these energy gradients, we establish an energy guided Actor-Critic that uses Lagrangian multipliers to regularize policy optimization toward lower energy and smoother control. Across visual control benchmarks, this paradigm not only attains superior asymptotic returns but also elevates internal simulator fidelity by establishing a tighter, lower variance alignment between imagined and real rewards, all while reducing latent phase space volume by 4.18-8.41%, energy consumption by up to 7.80%, and mean squared jerk by up to 9.38%.

2605.18295 2026-05-19 cs.RO 版本更新

Assessing Localization Technologies for Pedestrian Collision Avoidance

评估用于行人碰撞避让的定位技术

Joshua Varughese, Joseba Gorospe, Novel Certad, Cristina Olaverri-Monreal

发表机构 * Dept. Intelligent Transport Systems, Johannes Kepler University Linz(智能交通系统系,约翰内斯·开普勒大学林茨)

AI总结 本文评估了超宽频技术和蓝牙6.0在行人碰撞预警中的定位精度,并将其与全球导航卫星系统进行性能对比,发现这些技术在特定场景下可作为替代或补充方案,提升环境感知能力。

详情
AI中文摘要

鲁棒的行人安全对于下一代智能交通系统至关重要。此类系统依赖于主动的行人定位和预测碰撞警报。行人定位可以借助超宽频技术和蓝牙6.0,这两种技术提供了高精度测距和低延迟通信,使其成为车辆碰撞预警系统有前途的候选者。本文评估了这些技术在行人警报中的定位精度,并将其性能与全球导航卫星系统进行对比。本文进行的实验评估聚焦于关键性能指标,包括定位精度和对环境条件的鲁棒性。初步结果表明,超宽频和蓝牙6.0可以在某些场景下作为全球导航卫星系统的替代或补充方案,提高环境感知能力,并实现及时的行人警报。

英文摘要

Robust pedestrian safety is crucial to the next-generation of intelligent transportation systems. Such systems rely on active pedestrian localization and predictive collision alerts. Pedestrian localization can be supported by Ultra-Wideband technology and Bluetooth 6.0, which offer high-precision ranging and low-latency communication, making them promising candidates for vehicular collision warning systems. This paper assesses the localization accuracy of these technologies for pedestrian alerting and benchmarks their performance against Global Navigation Satellite Systems. Experimental evaluations performed in this paper focused on key performance metrics, including localization accuracy and robustness to environmental conditions. Preliminary results suggest that Ultra-Wideband and Bluetooth 6.0 can serve as viable alternatives or complements to Global Navigation Satellite Systems in certain scenarios, improving situational awareness and enabling timely pedestrian alerts.

2605.18287 2026-05-19 cs.CV cs.RO 版本更新

StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

StableVLA: 向无额外数据的鲁棒视觉-语言-动作模型迈进

Yiyang Fu, Chubin Zhang, Shukai Gong, Yufan Deng, Kaiwei Sun, Qiyang Min, Qibin Hou, Yansong Tang, Jianan Wang, Daquan Zhou

发表机构 * Peking University(北京大学) Tsinghua University(清华大学) Nanjing University(南京大学) Nankai University(南开大学)

AI总结 本文研究了在未见真实世界视觉扰动下视觉-语言-动作(VLA)模型的鲁棒性问题,提出了一种基于信息理论的轻量级适配模块IB-Adapter,有效提升模型性能,同时保持高效和效果。

Comments Accepted by ICML 2026. Code: https://github.com/DAGroup-PKU/HumanNet. Project website: https://dagroup-pku.github.io/StableVLA/

详情
AI中文摘要

在训练数据中无法涵盖所有可能的扰动,这引发了关于在遇到未见真实世界视觉扰动时,视觉-语言-动作(VLA)模型鲁棒性的问题。在本文中,我们基于最近最先进的VLA模型进行了系统研究,并揭示了当引入训练数据中没有的视觉扰动时,性能显著下降。为缓解这一问题,我们提出了一种基于信息理论的轻量级适配模块,称为信息瓶颈适配器(IB-Adapter),该模块能够选择性地从视觉输入中过滤潜在噪声。无需任何额外数据或增强策略,IB-Adapter在基线模型上平均提升了30%,同时添加少于10M参数,显示出显著的效率和效果。此外,即使使用14倍更小的主干(0.5B参数)且未在Open X-Embodiment数据集上预训练,我们的模型StableVLA也实现了与7B规模最先进的VLA相媲美的鲁棒性。在参数开销极小(<10M)的情况下,我们的方法在长周期任务上保持了准确性,并在合成和物理视觉扰动下超越了OpenPi。

英文摘要

It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.

2605.18262 2026-05-19 cs.RO 版本更新

On Improving Multimodal Pedestrian Trajectory Prediction with CVAE: A Study on Benchmark and Robot Data

基于CVAE的多模态行人轨迹预测改进:对基准数据和机器人数据的研究

Yuzhou Liu, Cristina Olaverri-Monreal

发表机构 * Dept. Intelligent Transport Systems, Johannes Kepler University Linz(智能交通系统系,约翰·凯撒大学林茨)

AI总结 本文提出基于Social-STGCNN的CVAE概率模型,以改进多模态行人轨迹预测,通过在基准数据集和真实机器人数据集上的评估,展示了方法在不同人群配置下的端点准确性和轨迹多样性改进。

详情
AI中文摘要

准确的行人轨迹预测对于在复杂环境中运行的自主系统至关重要,例如郊区或半结构化区域中的模块化巴士和送货机器人。Social Spatio-Temporal Graph Convolutional Neural Networks (Social-STGCNN) 通过建模社会互动展示了强大的性能;然而,生成多样且校准良好的未来轨迹仍然具有挑战性。在本文中,我们基于Social-STGCNN骨架,引入基于条件变分自动编码器(CVAE)的概率公式,以显式建模多模态未来轨迹。我们评估了该方法在ETH和UCY行人轨迹数据集以及由移动机器人收集的真实世界行人数据集上的性能。结果表明,在公共基准上取得了适度的提升,但在不同人群配置下表现出更一致的端点准确性和改进的轨迹多样性。在机器人收集的数据上的评估进一步证明了该方法在非定制基准之外的有效性,并支持其在实际部署中的适用性。

英文摘要

Accurate pedestrian trajectory prediction is crucial for autonomous systems operating in complex environments, such as modular buses and delivery robots in suburban or semi-structured areas. Social Spatio-Temporal Graph Convolutional Neural Networks (Social-STGCNN) have shown strong performance by modeling social interactions; however, producing diverse and well-calibrated future trajectories remains challenging. In this work, we build on a Social-STGCNN backbone and introduce a Conditional Variational Autoencoder (CVAE)-based probabilistic formulation to explicitly model multimodal future trajectories. We evaluate the method on the ETH and UCY pedestrian trajectory datasets as well as on a real-world pedestrian dataset collected by a mobile robot. Results show moderate gains on public benchmarks, but more consistent endpoint accuracy and improved trajectory diversity across different crowd configurations. Evaluation on robot-collected data further demonstrates the approach's effectiveness beyond curated benchmarks and supports its applicability in practical deployments.

2605.18197 2026-05-19 cs.RO cs.AI cs.CV 版本更新

RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots

仅RGB的主动3D场景图生成用于室内移动机器人

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

发表机构 * Mobile Robotics Group (MRG)(移动机器人小组) Visual and Multimodal Applied Learning Lab (VANDAL)(视觉与多模态应用学习实验室)

AI总结 本文提出了一种仅使用RGB输入的主动3D场景图生成方法,通过统一感知与规划的结构化表示,解决了传统方法对专用传感器的依赖问题,并在Replica数据集上验证了其有效性。

详情
AI中文摘要

当前3D场景图生成方法依赖于专用深度传感器,如LiDAR或RGB-D相机,限制了部署到专用机器人平台,并排除了仅使用RGB相机的场景,如固定外部基础设施。现有流程通常基于被动收集的观测轨迹,而不是基于部分构建的场景表示选择视角,因此无法有效利用图中编码的语义和空间信息。本文提出了一种完全视觉框架,用于从仅RGB输入中主动、逐步构建3D场景图,解决了这两个限制。所提出的方法围绕共享的结构化表示统一感知和规划,该表示捕捉了物体语义、3D几何、关系上下文以及多视角信息。由于该框架是硬件无关的,并且仅依赖RGB观测,因此可以将机载机器人相机和固定外部相机的输入整合到同一表示中。在Replica数据集上的实验表明,仅RGB的流程在F1分数上与使用真实深度的基线相当。在ReplicaCAD上的主动探索实验进一步表明,语义驱动的视角选择在相同探索预算下能够检测到比基于几何前沿的基线多超过两倍的物体。最后,外部相机设置表明,互补的RGB视角可以有效启动场景图并提高上下文理解,而无需额外的探索成本。

英文摘要

Current approaches to 3D scene graph generation rely on dedicated depth sensors, such as LiDAR or RGB-D cameras, for metric 3D reconstruction. This limits deployment to specialized robotic platforms and excludes settings where only RGB cameras are available, such as fixed external infrastructure. Existing pipelines also typically operate on passively collected observation trajectories, rather than selecting viewpoints based on the partially built scene representation, and therefore fail to effectively exploit the semantic and spatial information encoded within the graph during exploration. This paper presents a fully visual framework for the active, incremental construction of 3D scene graphs from RGB input only, addressing both limitations. The proposed approach unifies perception and planning around a shared structured representation that captures object semantics, 3D geometry, relational context, and information from multiple viewpoints. Because the framework is hardware-agnostic and relies only on RGB observations, it can incorporate inputs from both onboard robot cameras and fixed external cameras within the same representation. Experiments on the Replica dataset show that the RGB-only pipeline achieves F1-score parity with baselines using ground-truth depth. Active exploration experiments on ReplicaCAD further show that semantic-driven viewpoint selection detects more than twice as many objects as a geometric frontier-based baseline under the same exploration budget. Finally, the external-camera setting demonstrates that complementary RGB views can effectively bootstrap the scene graph and improve contextual understanding at no additional exploration cost.

2605.18184 2026-05-19 cs.RO cs.AI cs.CV 版本更新

Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation

固定外部摄像头作为主动3D场景图生成的共同先验地图

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

发表机构 * Mobile Robotics Group (MRG)(移动机器人组) Visual and Multimodal Applied Learning Lab (VANDAL)(视觉与多模态应用学习实验室)

AI总结 本文提出利用固定外部RGB摄像头作为共同先验地图,以实现主动、渐进式的3D场景图生成,通过融合机器人 onboard 摄像头和固定外部摄像头的数据,提高场景理解的效率和准确性。

详情
AI中文摘要

常用的先验信息,如BIM模型、平面图和遥感图像,可以为自主机器人系统提供有价值的几何和语义上下文。在本文中,我们将固定外部RGB摄像头的观测视为共同先验地图(CPMs):环境的广角视图,在任何机器人运动开始之前初始化一个语义和几何场景先验。我们提出一个仅使用RGB的框架,用于主动、渐进式的3D场景图(3DSG)生成,该框架在单一硬件无关的管道中无缝融合来自机器人 onboard 摄像头和固定外部摄像头的观测。通过仅依赖RGB观测并通过前馈3D重建模型进行处理,系统将所有摄像头——机器人 onboard 或外部——视为相同,无需硬件修改。基于图的主动语义探索框架然后直接利用部分场景图,引导机器人向高语义不确定性区域前进,逐步完成和细化先验。实验表明,使用单个外部摄像头初始化场景图可使初始物体召回率提高高达+79%,并且先验的更丰富上下文显著提高了后续主动探索的效率。

英文摘要

Commonly available prior information, such as BIM models, floor plans, and remote sensing images, can provide valuable geometric and semantic context for autonomous robotic systems. In this paper, we treat observations from fixed external RGB cameras as Common Prior Maps (CPMs): wide-field views of the environment that initialize a semantic and geometric scene prior before any robot motion begins. We present an RGB-only framework for active, incremental 3D scene graph (3DSG) generation that seamlessly fuses observations from both onboard robot cameras and fixed external cameras within a single hardware-agnostic pipeline. By relying solely on RGB observations processed by a feed-forward 3D reconstruction model, the system treats all cameras - onboard or external - identically, requiring no hardware modifications. A graph-based active semantic exploration framework then directly leverages the partial scene graph to guide the robot toward regions of high semantic uncertainty, progressively completing and refining the prior. Experiments demonstrate that bootstrapping the scene graph with even a single external camera increases initial object recall by up to +79%, and that the richer context of the prior significantly improves the efficiency of subsequent active exploration.

2605.16015 2026-05-19 cs.RO cs.LG 版本更新

Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning

通过强化学习实现四旋翼机的自适应外环控制

Vishnu Saj, Sushil Vemuri, Dileep Kalathil, Moble Benedict

发表机构 * Texas A&M University(德克萨斯大学)

AI总结 本文提出了一种新颖的自适应控制架构,通过强化学习和残差动力学预测器来提高四旋翼飞行器在动态扰动下的控制性能,实验证明其在现实环境中具有更高的轨迹跟踪精度。

详情
AI中文摘要

深度强化学习(DRL)在四旋翼飞行器控制中通常依赖于领域随机化(DR)进行仿真到现实的转移,导致过于保守的策略难以应对动态扰动。为了解决这个问题,我们提出了一种新的自适应控制架构,能够主动感知并响应即时扰动。首先,我们训练了一个最优的外环策略,然后用残差动力学预测器(RDP)替代其对地面真实扰动数据的依赖。RDP通过仅使用状态和控制动作的历史数据在线估计飞行器所受的外部力和力矩。为了实现无缝的硬件转移,我们引入了数据高效的线性校准桥和在线推力校正机制,利用仅几秒的飞行数据将模拟的潜在空间与现实对齐。在真实世界中对Crazyflie微型四旋翼的验证表明,我们的自适应控制器在严重不确定性下,包括质量变化、不对称载荷和动态悬挂载荷,均显著优于基线方法,保持了精确的轨迹跟踪性能。

英文摘要

Deep Reinforcement Learning (DRL) for quadrotor flight control typically relies on Domain Randomization (DR) for sim-to-real transfer, resulting in overly conservative policies that struggle with dynamic disturbances. To overcome this, we propose a novel adaptive control architecture that actively perceives and reacts to instantaneous perturbations. First, we train an optimal outer-loop policy, then replace its reliance on ground-truth disturbance data with a Residual Dynamics Predictor (RDP). The RDP estimates the external forces and moments acting on the aircraft in flight online using only the history of states and control actions. For seamless hardware transfer, we introduce a data-efficient linear calibration bridge and an online thrust correction mechanism that align the simulated latent space with reality using mere seconds of flight data. Real-world validations on a Crazyflie micro-quadrotor demonstrate that our adaptive controller significantly outperforms baselines, maintaining precise trajectory tracking under severe uncertainties including mass variations, asymmetric payloads, and dynamic slung loads

2605.11654 2026-05-19 cs.CV cs.AI cs.RO 版本更新

Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery

通过基于原型的语义部分发现实现抗天气的跨视角地理定位

Chi-Nguyen Tran, Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Long Tran-Thanh

发表机构 * Faculty of Information Technology, University of Science, Vietnam National University(信息技术学院,科学大学,越南国家大学) Department of Computer Science, University of Warwick(计算机科学系,沃里克大学)

AI总结 本文提出SkyPart,一种轻量级可替换头,用于基于补丁的视觉变换器,通过在补丁网格上显式分组实现部分分组。SkyPart有四个理论基础的组件:(i)通过单次传递余弦分配学习可学习的原型以竞争补丁标记;(ii)在训练期间应用的海拔条件线性调制,使检索嵌入在推理时无海拔依赖;(iii)对活跃原型的图注意力读出;(iv)一种Kendall不确定性加权多目标损失,其平稳点是帕累托平稳点。在26.95M参数和22.14 GFLOPs下,SkyPart是表现最佳方法中最小的,并在SUES-200、University-1652和DenseUAV上设定了新的状态。其在十条件WeatherPrompt腐蚀基准下的优势优于最强基线。

Comments 37 pages, 7 figures, 6 tables

详情
AI中文摘要

跨视角地理定位(CVGL),即匹配一个倾斜无人机视角到地理参考的卫星瓷砖,已成为在GPS信号被干扰、欺骗或不可用时自主无人机导航的关键替代方案。尽管近年来取得了显著进展,但仍然存在三个限制:(1)全局描述符设计将补丁网格压缩成一个向量,而没有在视角间隙中分离布局和纹理;(2)与海拔相关的尺度变化保留在学习嵌入中,而不是被边缘化;(3)多目标训练依赖于手动调整的标量损失,这些损失在不兼容的梯度尺度上。我们提出SkyPart,一种轻量级可替换头,用于基于补丁的视觉变换器(ViTs),在补丁网格上实施显式部分分组。SkyPart有四个理论基础的组件:(i)通过单次传递余弦分配学习可学习的原型以竞争补丁标记;(ii)在训练期间应用的海拔条件线性调制,使检索嵌入在推理时无海拔依赖;(iii)对活跃原型的图注意力读出;(iv)一种Kendall不确定性加权多目标损失,其平稳点是帕累托平稳点。在26.95M参数和22.14 GFLOPs下,SkyPart是表现最佳方法中最小的,并在SUES-200、University-1652和DenseUAV上设定了新的状态。其在十条件WeatherPrompt腐蚀基准下的优势优于最强基线。

英文摘要

Cross-view geo-localization (CVGL), which matches an oblique drone view to a geo-referenced satellite tile, has emerged as a key alternative for autonomous drone navigation when GNSS signals are jammed, spoofed, or unavailable. Despite strong recent progress, three limitations persist: (1) global-descriptor designs compress the patch grid into a single vector without separating layout from texture across the view gap; (2) altitude-related scale variation is retained in the learned embedding rather than marginalized; and (3) multi-objective training relies on hand-tuned scalars over losses on incompatible gradient scales. We propose SkyPart, a lightweight swappable head for patch-based vision transformers (ViTs) that institutes explicit part grouping over the patch grid. SkyPart has four theory-grounded components: (i) learnable prototypes competing for patch tokens via single-pass cosine assignment; (ii) altitude-conditioned linear modulation applied only during training, making the retrieval embedding altitude-free at inference; (iii) a graph-attention readout over active prototypes; and (iv) a Kendall uncertainty-weighted multi-objective loss whose stationary points are Pareto-stationary. At 26.95M parameters and 22.14 GFLOPs, SkyPart is the smallest among top-performing methods and sets a new state of the art on SUES-200, University-1652, and DenseUAV under a single-pass, no-re-ranking, no-TTA protocol. Its advantage over the strongest baseline widens under the ten-condition WeatherPrompt corruption benchmark.

2604.26450 2026-05-19 cs.RO 版本更新

Reactive Motion Generation via Phase-varying Neural Potential Functions

通过相变神经势函数实现反应性运动生成

Ahmet Tekden, Dimitrios Kanoulas, Aude Billard, Yasemin Bekiroglu

发表机构 * Chalmers AI Research Center (CHAIR)(查尔姆斯人工智能研究中心(CHAIR)) Chalmers Gender Initiative for Excellence (Genie)(查尔姆斯卓越性别倡议(Genie)) Wallenberg AI, Autonomous Systems and Software Program (WASP)(瓦兰贝格人工智能、自主系统和软件计划(WASP)) University College London(伦敦大学学院) Ecole Polytechnique Federale de Lausanne (EPFL)(瑞士联邦理工学院(EPFL))

AI总结 本文提出了一种基于相变神经势函数(PNPF)的运动生成框架,通过直接从状态进展估计相变量来条件势函数,从而在点到点、周期性和全6D运动任务中实现更有效的泛化,并在有交点轨迹和外部干扰下表现出更强的鲁棒性。

Comments Accepted by IEEE Robotics and Automation Letters (RAL)

详情
AI中文摘要

动态系统(DS)方法在学习示范(LfD)中提供了从少量示范中获得稳定连续策略的能力。一阶动态系统(DS)在许多点对点和周期性任务中效果良好,只要为每个状态定义唯一的速度。对于具有交点的任务(例如绘制“8”),通常会使用扩展方法如二阶动态或相变量。然而,通过引入速度,二阶模型在交点附近对扰动敏感,因为速度用于区分运动方向。此外,这种区分可能在几乎相同的位移速度对对应不同后续运动时失效。相比之下,基于相位的方法依赖于开环时间或相变量,这限制了它们在扰动后恢复的能力。我们引入了相变神经势函数(PNPF),一种LfD框架,将势函数条件于直接从状态进展估计的相变量,而不是开环时间输入。该相变量使系统能够处理状态重访,而学习的势函数生成局部向量场用于反应性和稳定的控制。PNPF在点对点、周期性和全6D运动任务中表现出良好的泛化能力,在具有交点的轨迹上优于现有基线,并在实时机器人操作中表现出对外部扰动的鲁棒性。

英文摘要

Dynamical systems (DS) methods for Learning-from-Demonstration (LfD) provide stable, continuous policies from few demonstrations. First-order dynamical systems (DS) are effective for many point-to-point and periodic tasks, as long as a unique velocity is defined for each state. For tasks with intersections (e.g., drawing an "8"), extensions such as second-order dynamics or phase variables are often used. However, by incorporating velocity, second-order models become sensitive to disturbances near intersections, as velocity is used to disambiguate motion direction. Moreover, this disambiguation may fail when nearly identical position-velocity pairs correspond to different onward motions. In contrast, phase-based methods rely on open-loop time or phase variables, which limit their ability to recover after perturbations. We introduce Phase-varying Neural Potential Functions (PNPF), an LfD framework that conditions a potential function on a phase variable which is estimated directly from state progression, rather than on open-loop temporal inputs. This phase variable allows the system to handle state revisits, while the learned potential function generates local vector fields for reactive and stable control. PNPF generalizes effectively across point-to-point, periodic, and full 6D motion tasks, outperforms existing baselines on trajectories with intersections, and demonstrates robust performance in real-time robotic manipulation under external disturbances.

2604.10895 2026-05-19 cs.HC cs.RO 版本更新

Teaching Robots to Interpret Social Interactions through Lexically-guided Dynamic Graph Learning

通过词法引导的动态图学习教授机器人解读社交互动

Tongfei Bian, Mathieu Chollet, Tanaya Guha

发表机构 * University of Glasgow(格拉斯哥大学)

AI总结 本文提出了一种名为SocialLDG的多任务学习框架,通过动态图学习建模状态之间的动态关系,实现了在人类-机器人社交交互数据集上的最佳性能,并支持任务扩展和时间影响分析。

Comments submitted to ACM MM 26

详情
AI中文摘要

为了使机器人具备社交智能,它必须能够从用户当前行为推断其内部状态,预测用户未来行为,并在需要时做出适当回应。在本工作中,我们探讨了机器人如何通过建模用户内部状态(潜在)和动作(可观察状态)之间的动态关系来获得这种社交智能。我们的前提是这些状态源于相同的底层社会认知过程,并动态地相互影响。受认知科学理论的启发,我们提出了一种新的多任务学习框架,称为SocialLDG,它明确建模作为六个不同任务的状态之间的动态关系。我们的框架使用语言模型为每个任务引入词法先验,并利用动态图学习来建模随时间演变的任务亲和力。SocialLDG有三个优势:首先,它在两个具有挑战性的人类-机器人社交交互公开数据集上实现了最先进的性能。其次,它通过无缝学习新任务而支持强大的任务扩展能力,而不会产生灾难性遗忘。最后,得益于显式建模任务亲和力,它提供了关于不同互动随时间展开以及内部状态和可观察动作如何相互影响的见解。

英文摘要

For a robot to be called socially intelligent, it must be able to infer users internal states from their current behaviour, predict the users future behaviour, and if required, respond appropriately. In this work, we investigate how robots can be endowed with such social intelligence by modelling the dynamic relationship between user's internal states (latent) and actions (observable state). Our premise is that these states arise from the same underlying socio-cognitive process and influence each other dynamically. Drawing inspiration from theories in Cognitive Science, we propose a novel multi-task learning framework, termed as \textbf{SocialLDG} that explicitly models the dynamic relationship among the states represent as six distinct tasks. Our framework uses a language model to introduce lexical priors for each task and employs dynamic graph learning to model task affinity evolving with time. SocialLDG has three advantages: First, it achieves state-of-the-art performance on two challenging human-robot social interaction datasets available publicly. Second, it supports strong task scalability by learning new tasks seamlessly without catastrophic forgetting. Finally, benefiting from explicit modelling task affinity, it offers insights on how different interactions unfolds in time and how the internal states and observable actions influence each other in human decision making.

2604.02060 2026-05-19 cs.CV cs.RO 版本更新

CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

CompassAD: 基于意图的多功能竞争物体3D affordance 地标

Jingliang Li, Jindou Jia, Tuo An, Chuhao Zhou, Xiangyu Chen, Shilin Shan, Boyu Ma, Bofan Lyu, Gen Li, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University, Singapore(MARS实验室,南洋理工大学,新加坡)

AI总结 该研究提出了一种新的3D affordance设定,即意图驱动的可混淆地标,旨在预测多物体点云中正确物体的每点affordance掩码,基于隐含的自然语言意图。通过构建CompassAD基准,该研究展示了在具有隐含意图的多物体组合中的先进结果,并在机器人机械臂上验证了其在真实世界抓取中的有效性。

详情
AI中文摘要

当被告知要“切蛋糕”时,机器人必须在附近的剪刀之上选择刀,尽管两个物体都提供相同的切割功能。在真实世界场景中,多个物体可能具有相同的affordance,但只有一个是给定任务上下文下的合适对象。我们称这种情况为混淆对。然而,现有的3D affordance方法大多回避了这一挑战,通过评估孤立的单个物体,通常伴有查询中提供的显式类别名称。我们正式提出了意图驱动的可混淆affordance地标,这是一种新的3D affordance设定,要求在多物体点云中预测正确物体的每点affordance掩码,基于隐含的自然语言意图。为了研究这个问题,我们构建了CompassAD,第一个专注于隐含意图的多物体组合基准。它包含30个混淆物体对,覆盖16种affordance类型,6,422个组合,以及88K+个查询-回答对。此外,我们提出了CompassNet,一个包含两个专门模块的框架,专为该任务定制。实例受限的交叉注入(ICI)在物体边界内约束语言-几何对齐,以防止跨物体语义泄漏。双级对比细化(BCR)在几何组和点级别上强制执行区分,使目标和可混淆表面之间的区别更加清晰。广泛的实验表明,在已见和未见查询上均取得了最先进的结果,并在机器人机械臂上的部署证实了其在真实世界抓取中的有效性。

英文摘要

When told to "cut the cake," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Intent-Driven Confusable Affordance Grounding, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusing multi-object compositions. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 compositions, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object compositions.

2604.00634 2026-05-19 cs.RO cs.CV 版本更新

LiPS: Lightweight Panoptic Segmentation for Resource-Constrained Robotics

LiPS: 为资源受限机器人设计的轻量级全景分割

Calvin Galagain, Martyna Poreba, François Goulette, Cyrill Stachniss

发表机构 * Université Paris-Saclay, CEA LIST(巴黎-萨克雷大学,CEA LIST) U2IS, ENSTA, Institut Polytechnique de Paris(U2IS、ENSTA、巴黎理工学院) University of Bonn, Center for Robotics(波恩大学,机器人中心)

AI总结 本文提出LiPS,一种轻量级全景分割方法,通过简化特征提取和融合路径,在保持查询基于解码的同时,显著降低计算需求,实现与更重模型相当的精度和更高的吞吐量。

Comments Accepted to IEEE International Conference on Image Processing (ICIP) 2026, Paper #2070

详情
AI中文摘要

全景分割是机器人感知的关键使能器,因为它将语义理解与对象级推理统一起来。然而,随着最新模型复杂性的增加,它们不再适合在资源受限的平台上部署,如移动机器人。我们提出了一种名为LiPS的新方法,通过轻量级设计保留查询基于解码,同时引入流线型的特征提取和融合路径,旨在在大幅降低计算需求的同时提供强大的全景分割性能。在标准基准上的评估表明,LiPS在精度上与更重的基线相当,同时提供高达4.5倍的吞吐量(每秒帧数),并需要几乎6.8倍更少的计算。这种效率使LiPS成为现代全景模型与现实世界机器人应用之间的重要桥梁。

英文摘要

Panoptic segmentation is a key enabler for robotic perception, as it unifies semantic understanding with object-level reasoning. However, the increasing complexity of state-of-the-art models makes them unsuitable for deployment on resource-constrained platforms such as mobile robots. We propose a novel approach called LiPS that addresses the challenge of efficient-to-compute panoptic segmentation with a lightweight design that retains query-based decoding while introducing a streamlined feature extraction and fusion pathway. It aims at providing a strong panoptic segmentation performance while substantially lowering the computational demands. Evaluations on standard benchmarks demonstrate that LiPS attains accuracy comparable to much heavier baselines, while providing up to 4.5 higher throughput, measured in frames per second, and requiring nearly 6.8 times fewer computations. This efficiency makes LiPS a highly relevant bridge between modern panoptic models and real-world robotic applications.

2603.26720 2026-05-19 cs.RO cs.AI 版本更新

SutureFormer: Learning Surgical Trajectories via Goal-conditioned Offline RL in Pixel Space

SutureFormer: 通过像素空间中的目标引导离线强化学习学习手术轨迹

Huanrong Liu, Chunlin Tian, Tongyu Jia, Tailai Zhou, Qin Liu, Yu Gao, Yutong Ban, Yun Gu, Guy Rosman, Xin Ma, Qingbiao Li

发表机构 * University of Macau(澳门大学) The Chinese PLA General Hospital(中国人民解放军总医院) Duke University(杜克大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出SutureFormer,一种基于目标引导的离线强化学习框架,通过稀疏标注到密集奖励信号的插值,有效学习手术针轨迹预测,减少平均位移误差58.6%。

详情
AI中文摘要

从内窥镜视频预测手术针轨迹对于机器人辅助缝合至关重要,能够实现预见性规划、实时引导和更安全的运动执行。现有直接从视觉观测学习运动分布的方法往往忽视相邻运动步骤之间的序列依赖性。此外,稀疏路径点标注通常无法提供足够的监督,进一步增加了监督或模仿学习方法的难度。为了解决这些挑战,我们将基于图像的针轨迹预测 formulations 为一个序列决策问题,在其中将针尖视为一个在像素空间中逐步移动的智能体。这种 formulation 自然捕捉了针运动的连续性,并能够显式建模在时间上物理上合理的像素级状态转换。从这个角度来看,我们提出SutureFormer,一种目标引导的离线强化学习框架,通过三次样条插值将稀疏标注转换为密集奖励信号,鼓励策略在利用有限专家指导的同时探索合理的未来运动路径。SutureFormer 使用观察编码器编码可变长度片段,以捕捉局部空间线索和长距离时间动态,并通过由离散方向和连续幅度组成的操作自回归地预测未来路径点。为了实现从专家演示中稳定离线策略优化,我们采用保守Q学习与行为克隆正则化。在包含1,158条轨迹的新的肾伤口缝合数据集中进行的实验表明,与最强基线相比,SutureFormer将平均位移误差减少了58.6%,证明了将针轨迹预测建模为像素级序列动作学习的有效性。

英文摘要

Predicting surgical needle trajectories from endoscopic video is critical for robot-assisted suturing, enabling anticipatory planning, real-time guidance, and safer motion execution. Existing methods that directly learn motion distributions from visual observations tend to overlook the sequential dependency among adjacent motion steps. Moreover, sparse waypoint annotations often fail to provide sufficient supervision, further increasing the difficulty of supervised or imitation learning methods. To address these challenges, we formulate image-based needle trajectory prediction as a sequential decision-making problem, in which the needle tip is treated as an agent that moves step by step in pixel space. This formulation naturally captures the continuity of needle motion and enables the explicit modeling of physically plausible pixel-wise state transitions over time. From this perspective, we propose SutureFormer, a goal-conditioned offline reinforcement learning framework that leverages sparse annotations to dense reward signals via cubic spline interpolation, encouraging the policy to exploit limited expert guidance while exploring plausible future motion paths. SutureFormer encodes variable-length clips using an observation encoder to capture both local spatial cues and long-range temporal dynamics, and autoregressively predicts future waypoints through actions composed of discrete directions and continuous magnitudes. To enable stable offline policy optimization from expert demonstrations, we adopt Conservative Q-Learning with Behavioral Cloning regularization. Experiments on a new kidney wound suturing dataset containing 1,158 trajectories from 50 patients show that SutureFormer reduces Average Displacement Error by 58.6% compared with the strongest baseline, demonstrating the effectiveness of modeling needle trajectory prediction as pixel-level sequential action learning.

2603.17751 2026-05-19 cs.RO cs.SY eess.SY 版本更新

Multi-Source Human-in-the-Loop Digital Twin Testbed for Connected and Autonomous Vehicles in Mixed Traffic Flow

多源人机协同数字孪生测试平台用于混合交通流中的连接与自动驾驶车辆

Jianghong Dong, Chunying Yang, Mengchi Cai, Chaoyi Chen, Qing Xu, Jianqiang Wang, Jiawei Wang, Keqiang Li

发表机构 * School of Vehicle and Mobility, Tsinghua University(清华大学车辆与移动性学院) Department of Civil and Environmental Engineering, University of Michigan(密歇根大学土木与环境工程系)

AI总结 本文提出了一种多源人机协同混合云控制测试平台(MSH-MCCT),用于在混合交通环境中测试连接与自动驾驶车辆(CAVs)与人类驾驶车辆(HDVs)之间的复杂交互,通过混合数字孪生概念结合混合现实与数字孪生,提升实验灵活性和可扩展性。

详情
Journal ref
2026 in Journal of Intelligent and Connected Vehicles
AI中文摘要

在新兴的混合交通环境中,连接与自动驾驶车辆(CAVs)必须与周围的人类驾驶车辆(HDVs)进行交互。本文介绍MSH-MCCT(多源人机协同混合云控制测试平台),一种新的CAV测试平台,能够捕捉各种CAVs和HDVs之间的复杂交互。利用混合数字孪生概念,该概念结合了混合现实与数字孪生,MSH-MCCT整合了物理、虚拟和混合平台,以及多源控制输入。通过混合平台的连接,MSH-MCCT允许人类驾驶员和CAV算法在多个视野范围内同时操作物理和虚拟车辆。特别地,该测试平台促进了物理和虚拟CAVs与HDVs的共存和实时交互,显著提高了实验的灵活性和可扩展性。在混合交通中的车辆编队实验展示了MSH-MCCT通过不同保真度的驾驶模拟器进行多源真实人类驾驶员闭环CAV测试的潜力。实验视频可在我们的项目网站上获得:https://dongjh20.github.io/MSH-MCCT。

英文摘要

In the emerging mixed traffic environments, Connected and Autonomous Vehicles (CAVs) have to interact with surrounding human-driven vehicles (HDVs). This paper introduces MSH-MCCT (Multi-Source Human-in-the-Loop Mixed Cloud Control Testbed), a novel CAV testbed that captures complex interactions between various CAVs and HDVs. Utilizing the Mixed Digital Twin concept, which combines Mixed Reality with Digital Twin, MSH-MCCT integrates physical, virtual, and mixed platforms, along with multi-source control inputs. Bridged by the mixed platform, MSH-MCCT allows human drivers and CAV algorithms to operate both physical and virtual vehicles within multiple fields of view. Particularly, this testbed facilitates the coexistence and real-time interaction of physical and virtual CAVs \& HDVs, significantly enhancing the experimental flexibility and scalability. Experiments on vehicle platooning in mixed traffic showcase the potential of MSH-MCCT to conduct CAV testing with multi-source real human drivers in the loop through driving simulators of diverse fidelity. The videos for the experiments are available at our project website: https://dongjh20.github.io/MSH-MCCT.

2603.14371 2026-05-19 cs.RO cs.AI 版本更新

OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism

OxyGen: 为多任务并行下的VLA推理提供统一的KV缓存管理

Xiangyu Li, Huaizhi Tang, Xin Ding, Weijun Wang, Ting Cao, Yunxin Liu

发表机构 * Institute for AI Industry Research (AIR)(人工智能产业研究院) Department of Electronic Engineering(电子工程系) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出OxyGen,一种统一的KV缓存管理方法,用于在多任务并行下提高VLA推理效率,通过跨任务KV共享和跨帧连续批处理实现冗余计算和资源竞争的减少,从而在设备端实现更高的吞吐量和频率。

Comments Preprint

详情
AI中文摘要

具身AI代理越来越多地需要在不同的时间约束下从共享观察中并行执行多个任务,如操作、对话和记忆构建。最近的混合变换器(MoT)视觉-语言-动作模型(VLAs)在架构上支持这种异构输出,但现有的推理系统由于冗余计算和资源竞争未能在设备部署中实现高效的多任务并行。我们发现孤立的KV缓存管理是根本原因。为此,我们提出了统一的KV缓存管理,一种将KV缓存作为跨任务和时间的第一类共享资源的推理设计。这种抽象使两种关键优化成为可能:跨任务的KV共享消除了共享观察的冗余预填充,而跨帧连续批处理将可变长度的语言解码与固定速率的动作生成解耦。我们为流行的MoT VLA π_{0.5} 实现了这种设计,并在NVIDIA GeForce RTX 4090和Jetson AGX Thor两个代表性的设备端VLA推理平台上进行了评估。OxyGen在孤立执行的情况下实现了高达3.7倍的加速,同时在不降低动作质量的情况下,实现了超过200 tokens/s的语言吞吐量和70 Hz的动作频率,并进一步在搭载Jetson AGX Thor的现实人形机器人上验证了这些收益。

英文摘要

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment because of redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference design that treats the KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this design for $π_{0.5}$, a popular MoT VLA, and evaluate it on both NVIDIA GeForce RTX 4090 and Jetson AGX Thor, two representative platforms for on-device VLA inference. OxyGen achieves up to 3.7$\times$ speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without degrading action quality, and we further validate the gains on a real humanoid robot with on-board Jetson AGX Thor.

2602.05156 2026-05-19 cs.RO cs.SY eess.SY 版本更新

PLATO Hand: Shaping Contact Behavior with Fingernails for Precise Manipulation

PLATO Hand:利用指甲形状接触行为实现精确操控

Dong Ho Kang, Aaron Kim, Mingyo Seo, Kazuto Yokoyama, Tetsuya Narita, Luis Sentis

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Sony Group Corporation(索尼集团)

AI总结 本文提出PLATO手,一种具有混合指尖的灵活机器人手,通过结合刚性指甲、嵌入式远节指骨和顺应性肉垫,实现接触行为的塑造。研究开发了基于应变能的弯曲-压入模型,指导指尖设计并解释材料刚度和接触几何如何控制指尖变形分配。实验显示提升了捏合稳定性、指甲介导的背侧接触力传输和本体感觉可观察性,并成功执行了敏感边缘操控任务,如纸张分隔、卡片拾取和橙子剥皮。这些结果表明,结合机械结构的接触界面与力-运动透明手指机制提供了精确操控的原理性方法。

详情
AI中文摘要

我们提出了PLATO手,一种具有混合指尖的灵活机器人手,该指尖结合了刚性指甲、嵌入式远节指骨和顺应性肉垫,以在操控过程中塑造接触行为。通过机械组织指尖接触的启动、支撑和传递方式,这种结构在多样化的物体几何形状和抓取方向上创造了稳定且任务相关的接触条件。我们开发了基于应变能的弯曲-压入模型,以指导指尖设计并解释材料刚度和接触几何如何控制指尖内的变形分配。实验显示提升了捏合稳定性、指甲介导的背侧接触力传输和本体感觉可观察性,并成功执行了敏感边缘操控任务,包括纸张分隔、卡片拾取和橙子剥皮。这些结果表明,结合机械结构的接触界面与力-运动透明手指机制提供了精确操控的原理性方法。我们的项目页面是:https://platohand.github.io

英文摘要

We present the PLATO Hand, a dexterous robotic hand with a hybrid fingertip that combines a rigid fingernail, embedded distal phalanx, and compliant pulp to shape contact behavior during manipulation. \rrev{By mechanically organizing how contact is initiated, supported, and transmitted at the fingertip, this structure creates stable and task-relevant contact conditions across diverse object geometries and grasp orientations.} We develop a strain-energy-based bending--indentation model to guide the fingertip design and to explain how material stiffness and contact geometry govern deformation partitioning within the fingertip. \rrev{Experiments show improved pinch stability, improved fingernail-mediated dorsal-contact force transmission and proprioceptive observability}, and successful execution of edge-sensitive manipulation tasks, including paper singulation, card picking, and orange peeling. These results show that coupling a mechanically structured contact interface with a force-motion-transparent finger mechanism provides a principled approach to precise manipulation. Our project page is at: https://platohand.github.io

2601.05653 2026-05-19 cs.RO cs.MA 版本更新

EvoQRE: Modeling Bounded Rationality in Safety-Critical Traffic Simulation via Evolutionary Quantal Response Equilibrium

EvoQRE: 通过进化量化反应均衡建模安全关键交通仿真中的有限理性

Phu-Hoa Pham, Chi-Nguyen Tran, Duy-Minh Dao-Sy, Phu-Quy Nguyen-Lam, Trung-Kiet Huynh

AI总结 本文提出EvoQRE框架,通过量化反应均衡和进化博弈动态建模安全关键交通交互,理论证明其在弱单调性假设下收敛到Logit-QRE,并在Waymo和nuPlan数据集上验证了其在真实性和安全指标上的优越性。

Comments This article is being withdrawn due to identified issues in the experimental evaluation and theoretical assumptions that may affect the validity of some reported conclusions. The authors plan to revise the methodology and provide a corrected version in future work.

详情
AI中文摘要

现有的自动驾驶交通仿真框架通常依赖于模仿学习或博弈论方法来求解纳什或粗相关均衡,隐含假设了完美理性的代理。然而,人类驾驶员表现出有限理性,在认知和感知限制下做出近似最优决策。我们提出EvoQRE,一种原理性的框架,将安全关键交通交互建模为一般和博弈,通过量化反应均衡(QRE)和进化博弈动态求解。EvoQRE整合了预训练的生成世界模型与熵正则化的复制动态,捕捉随机的人类行为同时保持均衡结构。我们提供了严格的理论结果,证明所提出的动态在双重时间尺度随机近似下收敛到Logit-QRE,具有显式的收敛速率O(log k / k^{1/3})在弱单调性假设下。我们进一步通过混合基和能量基策略表示扩展QRE到连续动作空间。在Waymo Open Motion Dataset和nuPlan基准测试中,EvoQRE实现了最先进的现实感,改进的安全指标,以及通过可解释的理性参数可控生成多样化的安全关键场景。

英文摘要

Existing traffic simulation frameworks for autonomous vehicles typically rely on imitation learning or game-theoretic approaches that solve for Nash or coarse correlated equilibria, implicitly assuming perfectly rational agents. However, human drivers exhibit bounded rationality, making approximately optimal decisions under cognitive and perceptual constraints. We propose EvoQRE, a principled framework for modeling safety-critical traffic interactions as general-sum Markov games solved via Quantal Response Equilibrium (QRE) and evolutionary game dynamics. EvoQRE integrates a pre-trained generative world model with entropy-regularized replicator dynamics, capturing stochastic human behavior while maintaining equilibrium structure. We provide rigorous theoretical results, proving that the proposed dynamics converge to Logit-QRE under a two-timescale stochastic approximation with an explicit convergence rate of O(log k / k^{1/3}) under weak monotonicity assumptions. We further extend QRE to continuous action spaces using mixture-based and energy-based policy representations. Experiments on the Waymo Open Motion Dataset and nuPlan benchmark demonstrate that EvoQRE achieves state-of-the-art realism, improved safety metrics, and controllable generation of diverse safety-critical scenarios through interpretable rationality parameters.

2512.07765 2026-05-19 cs.RO 版本更新

Toward Seamless Physical Human-Humanoid Interaction: Insights from Control, Intent, and Modeling with a Vision for What Comes Next

迈向无缝的物理人机交互:从控制、意图和建模的角度见解以及未来发展的展望

Gustavo A. Cardona, Shubham S. Kumbhar, Panagiotis Artemiadis

发表机构 * University of Delaware(德克萨斯大学)

AI总结 本文探讨了物理人机交互领域中控制、意图估计和计算人类模型三个核心支柱,总结了当前的研究现状、开放挑战和限制,并提出了跨领域整合的路径,旨在推动更鲁棒、安全和直观的物理交互研究。

Comments 60 pages, 5 figures, 3 tables

详情
AI中文摘要

物理人机交互(pHHI)是一个快速发展的领域,对在无结构、以人为中心的环境中部署机器人具有重要意义。在本文综述中,我们通过三个核心支柱审视当前pHHI的现状:(i)人形机器人的建模与控制,(ii)人类意图估计,以及(iii)计算人类模型。对于每个支柱,我们调查了代表性方法,识别了开放挑战,并分析了当前限制,这些限制阻碍了鲁棒、可扩展和适应性交互的实现。这些包括需要能够处理不确定人类动态的全身控制策略、在有限感知下实时意图推断的需求,以及能够考虑人类身体状态变化的建模技术。尽管每个领域都取得了显著进展,但跨支柱的整合仍然有限。我们提出了统一这些领域的路径,以实现连贯的交互框架。这种结构不仅使我们能够映射当前的现状,还提出了未来研究的具体方向,旨在弥合这些领域之间的差距。此外,我们引入了一种基于模态的统一交互类型分类法,区分直接交互(如物理接触)和间接交互(如物体中介),并基于机器人参与的程度,从协助到合作和协作。对于此分类中的每个类别,我们提供了三个核心支柱,突出跨支柱整合的机会。我们的目标是建议推动鲁棒、安全和直观物理交互的途径,为未来研究提供路线图,使人形系统能够有效地理解、预测并与人类伙伴在多样化的现实环境中协作。

英文摘要

Physical Human-Humanoid Interaction (pHHI) is a rapidly advancing field with significant implications for deploying robots in unstructured, human-centric environments. In this review, we examine the current state of the art in pHHI through three core pillars: (i) humanoid modeling and control, (ii) human intent estimation, and (iii) computational human models. For each pillar, we survey representative approaches, identify open challenges, and analyze current limitations that hinder robust, scalable, and adaptive interaction. These include the need for whole-body control strategies capable of handling uncertain human dynamics, real-time intent inference under limited sensing, and modeling techniques that account for variability in human physical states. Although significant progress has been made within each domain, integration across pillars remains limited. We propose pathways for unifying methods across these areas to enable cohesive interaction frameworks. This structure enables us not only to map the current landscape but also to propose concrete directions for future research that aim to bridge these domains. Additionally, we introduce a unified taxonomy of interaction types based on modality, distinguishing between direct interactions (e.g., physical contact) and indirect interactions (e.g., object-mediated), and on the level of robot engagement, ranging from assistance to cooperation and collaboration. For each category in this taxonomy, we provide the three core pillars that highlight opportunities for cross-pillar unification. Our goal is to suggest avenues to advance robust, safe, and intuitive physical interaction, providing a roadmap for future research that will allow humanoid systems to effectively understand, anticipate, and collaborate with human partners in diverse real-world settings.

2511.00392 2026-05-19 cs.RO cs.AI cs.CV 版本更新

SonarSweep: Fusing Sonar and Vision for Robust 3D Reconstruction via Plane Sweeping

SonarSweep: 通过平面扫描融合声纳与视觉以实现鲁棒的3D重建

Lingpeng Chen, Jiakun Tang, Apple Pui-Yi Chui, Ziyang Hong, Junfeng Wu

发表机构 * Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Chinese University of Hong Kong, Hong Kong(香港中文大学) Department of Automation, Harbin Institute of Technology(哈尔滨工业大学自动化系)

AI总结 本文提出SonarSweep,一种端到端的深度学习框架,通过将平面扫描算法应用于声纳与视觉数据的跨模态融合,克服了单一模态方法在 underwater 环境中3D重建的局限性,实现了更精确和稳定的深度图生成。

Comments 8 pages, 9 figures, conference

详情
AI中文摘要

在视觉退化的水下环境中实现准确的3D重建仍是一个严峻的挑战。单一模态方法不足:基于视觉的方法因可见性差和几何约束而失败,而声纳则因固有的高度歧义和低分辨率而受限。因此,先前的融合技术依赖于启发式方法和错误的几何假设,导致显著的伪影和无法建模复杂场景。在本文中,我们引入了SonarSweep,一种新颖的端到端深度学习框架,通过将原理性的平面扫描算法应用于声纳与视觉数据的跨模态融合,克服了这些限制。在高保真模拟和真实环境中的大量实验表明,SonarSweep能够一致地生成密集且准确的深度图,在挑战性条件下,特别是在高浊度情况下,显著优于最先进的方法。为了促进进一步研究,我们将公开我们的代码和一个新型的数据集,该数据集包含同步的立体相机和声纳数据,这是首次公开的此类数据集。

英文摘要

Accurate 3D reconstruction in visually-degraded underwater environments remains a formidable challenge. Single-modality approaches are insufficient: vision-based methods fail due to poor visibility and geometric constraints, while sonar is crippled by inherent elevation ambiguity and low resolution. Consequently, prior fusion technique relies on heuristics and flawed geometric assumptions, leading to significant artifacts and an inability to model complex scenes. In this paper, we introduce SonarSweep, a novel, end-to-end deep learning framework that overcomes these limitations by adapting the principled plane sweep algorithm for cross-modal fusion between sonar and visual data. Extensive experiments in both high-fidelity simulation and real-world environments demonstrate that SonarSweep consistently generates dense and accurate depth maps, significantly outperforming state-of-the-art methods across challenging conditions, particularly in high turbidity. To foster further research, we will publicly release our code and a novel dataset featuring synchronized stereo-camera and sonar data, the first of its kind.

2510.26018 2026-05-19 cs.RO cs.AI 版本更新

RADRON: Cooperative Localization of Ionizing Radiation Sources by MAVs with Compton Cameras

RADRON:通过配备康普顿相机的微型飞行器进行离子化辐射源的协同定位

Petr Stibinger, Tomas Baca, Daniela Doubravova, Jan Rusnak, Jaroslav Solc, Jan Jakubek, Petr Stepan, Martin Saska

AI总结 该研究提出了一种利用微型飞行器协同定位放射性物质的新方法,通过康普顿相机实时估计辐射源位置,即使在稀疏测量条件下也能实现高灵敏度检测。

Comments 8 pages, 9 figures, submitted for review to IEEE RA-L

详情
AI中文摘要

我们提出了一种新型方法,通过合作微型飞行器(MAVs)定位放射性物质。我们的方法利用了最先进的单探测器康普顿相机,作为高灵敏度且微型的离子化辐射探测器。该探测器极低的重量(40克)为由协作敏捷MAVs进行的辐射检测开辟了新可能。我们提出了一种新的基本概念,将康普顿相机测量融合以实时估计辐射源位置,即使从极稀疏的测量中也能做到。数据读取和处理直接在机载上进行,结果用于动态反馈以驱动车辆运动。MAVs在紧密协作的群体中稳定,以最大化康普顿相机获取的信息,快速定位辐射源,甚至跟踪移动的辐射源。

英文摘要

We present a novel approach to localizing radioactive material by cooperating Micro Aerial Vehicles (MAVs). Our approach utilizes a state-of-the-art single-detector Compton camera as a highly sensitive, yet miniature detector of ionizing radiation. The detector's exceptionally low weight (40 g) opens up new possibilities of radiation detection by a team of cooperating agile MAVs. We propose a new fundamental concept of fusing the Compton camera measurements to estimate the position of the radiation source in real time even from extremely sparse measurements. The data readout and processing are performed directly onboard and the results are used in a dynamic feedback to drive the motion of the vehicles. The MAVs are stabilized in a tightly cooperating swarm to maximize the information gained by the Compton cameras, rapidly locate the radiation source, and even track a moving radiation source.

2510.24680 2026-05-19 cs.RO 版本更新

InFeR: Informed Failure Resilience in Learned Visual Navigation Control

InFeR:在学习视觉导航控制中的有信息故障韧性

Zishuo Wang, Joel Loo, David Hsu

发表机构 * School of Computing & Smart Systems Institute, National University of Singapore(计算与智能系统学院研究所,新加坡国立大学)

AI总结 该研究提出InFeR框架,通过变分信息瓶颈损失重构潜在空间以检测OOD故障,并利用Grad-CAM技术局部化故障源,从而在无需额外训练数据的情况下实现故障自恢复,提升了复杂环境中的长距离导航鲁棒性。

详情
AI中文摘要

尽管模仿学习(IL)已在许多常见环境中实现了成功的视觉导航,但在分布外(OOD)场景下,IL策略容易出现不可预测的故障。这需要具有故障韧性的策略,不仅能够检测故障,还能识别其来源并自主恢复。我们提出了InFeR,一种通用框架,用于构建具有有信息故障韧性的IL策略,而无需故障或恢复演示。InFeR通过变分信息瓶颈(VIB)损失重新训练IL策略,以结构化其潜在空间以检测OOD故障。它应用视觉可解释性技术Grad-CAM,以局部化图像区域作为故障源,并告知恢复的启发式策略。所有这些都在不需要额外训练数据的情况下实现。现实世界实验表明,InFeR在两种不同的策略架构上实现了有信息的故障恢复,从而在复杂环境中实现了稳健的长距离导航。

英文摘要

While imitation learning (IL) has enabled successful visual navigation in many common environments, IL policies are prone to unpredictable failures under out-of-distribution (OOD) scenarios. This necessitates failure-resilient policies, which not only detect failures, but also recognise their sources and recover from them autonomously. We propose InFeR, a general framework for building IL policies with informed failure resilience without failure or recovery demonstrations. InFeR retrains an IL policy with a Variational Information Bottleneck (VIB) loss to structure its latent space for OOD failure detection. It applies a visual explainability technique, Grad-CAM, to localise an image region as the source of failure and inform a heuristic policy for recovery. All these are achieved without requiring additional training data. Real-world experiments show that InFeR enables informed failure recovery across two different policy architectures, yielding robust long-range navigation in complex environments.

2505.16278 2026-05-19 cs.CV cs.AI cs.RO 版本更新

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

DriveMoE:面向端到端自动驾驶的视觉-语言-动作混合专家模型

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan

发表机构 * Sch. of Computer Science & Sch. of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学计算机科学学院与人工智能学院) Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究院) Shanghai Key Laboratory of Multimodal Embodied AI(上海多模态具身人工智能重点实验室) AnyScale AI Project(AnyScale AI项目)

AI总结 本文提出DriveMoE,一种基于混合专家架构的端到端自动驾驶框架,通过场景专用的视觉混合专家和技能专用的动作混合专家,实现了对复杂驾驶场景的有效处理,展示了在自动驾驶任务中结合视觉和动作混合专家的有效性。

Comments Accepted by CVPR 2026, Project Page: https://thinklab-sjtu.github.io/DriveMoE/

详情
AI中文摘要

端到端自动驾驶(E2E-AD)需要有效处理多视角传感器数据和稳健处理多样且复杂的驾驶场景,特别是罕见的激进转弯等场景。最近混合专家(MoE)架构在大语言模型(LLMs)中的成功表明,参数的专业化能够实现强大的可扩展性。在本工作中,我们提出了DriveMoE,一种新的基于MoE的E2E-AD框架,包含场景专用的视觉MoE和技能专用的动作MoE。DriveMoE基于我们$π_0$视觉-语言-动作(VLA)基线(最初来自具身AI领域),称为Drive-$π_0$。具体而言,我们通过训练一个路由器,根据驾驶上下文动态选择相关摄像头,将视觉MoE添加到Drive-$π_0$中。这种设计模仿了人类驾驶认知,即司机选择性地关注关键视觉线索,而不是穷尽处理所有视觉信息。此外,我们通过训练另一个路由器来激活针对不同驾驶行为的专用专家模块,通过显式的行为专业化,DriveMoE能够处理多样化的场景而不受现有模型中模式平均的困扰。在Bench2Drive闭环评估实验中,DriveMoE实现了最先进的性能,证明了在自动驾驶任务中结合视觉和动作MoE的有效性。我们将发布DriveMoE和Drive-$π_0$的代码和模型。

英文摘要

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $π_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$π_0$. Specifically, we add Vision MoE to Drive-$π_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$π_0$.

2503.16492 2026-05-19 cs.HC cs.RO 版本更新

FAM-HRI: Foundation-Model Assisted Multi-Modal Human-Robot Interaction Combining Gaze and Speech

FAM-HRI: 基于基础模型的多模态人机交互框架,结合目光和语音

Yuzhi Lai, Shenghai Yuan, Peizheng Li, Boya Zhang, Benjamin Kiefer, Tianchen Deng, Andreas Zell

发表机构 * University of Tuebingen(图宾根大学) Nanyang Technological University(南洋理工大学) Mercedes-Benz(梅赛德斯-奔驰)

AI总结 本文提出FAM-HRI,一种结合目光和语音的多模态人机交互框架,利用基础模型融合语言和目光输入,以提高任务执行的成功率和交互效率,为肢体障碍者提供实用解决方案。

Comments This work has been accepted for publication in IEEE Transactions on Automation Science and Engineering @ 2026 IEEE

详情
AI中文摘要

有效的机器人交互(HRI)对于增强现实世界中机器人应用的可访问性和可用性至关重要。然而,现有解决方案通常依赖于仅手势或仅语言的命令,使交互效率低下且模糊,尤其是对有肢体障碍的用户而言。在本文中,我们介绍了FAM-HRI,一种高效的多模态HRI框架,通过基础模型整合语言和目光输入。通过利用轻量级的Meta ARIA眼镜,我们的系统实时捕捉多模态信号,并利用大型语言模型(LLMs)将用户意图与场景上下文融合,实现直观且精确的机器人操作。我们的方法准确确定了目光固定时间间隔,减少了由于目光动态特性引起的噪声。实验评估表明,FAM-HRI在任务执行方面实现了高成功率,同时保持了低交互时间,为肢体活动受限或运动障碍者提供了实用的解决方案。为了支持社区,我们已发布系统设计、算法和解决方案,网址为https://github.com/laiyuzhi/FAM-HRI。

英文摘要

ffective Human-Robot Interaction (HRI) is crucial for enhancing accessibility and usability in real-world robotics applications. However, existing solutions often rely on gesture- only or language-only commands, making interaction inefficient and ambiguous, particularly for users with physical impairments. In this paper, we introduce FAM-HRI, an efficient multimodal framework for HRI that integrates language and gaze inputs via foundation models. By leveraging lightweight Meta ARIA glasses, our system captures real-time multimodal signals and utilizes large language models (LLMs) to fuse user intention with scene context, enabling intuitive and precise robot manipulation. Our method accurately determines the gaze fixation time interval, reducing noise caused by the gaze dynamic nature. Experimental evaluations demonstrate that FAM-HRI achieves a high success rate in task execution while maintaining a low interaction time, providing a practical solution for individuals with limited physical mobility or motor impairments. To support the community, we have released our system design, algorithms, and solutions at https://github.com/laiyuzhi/FAM-HRI.

2503.13934 2026-05-19 cs.RO cs.AI 版本更新

COLSON: Controllable Learning-Based Social Navigation via Diffusion-Based Reinforcement Learning

COLSON: 通过基于扩散的强化学习实现可控的社会导航

Kohei Matsumoto, Yuki Tomita, Yuki Hyodo, Ryo Kurazume

AI总结 本文提出了一种基于扩散的强化学习方法,用于社会导航,通过灵活的动作分布提高了导航的适应性和可控性,同时能够适应未见过的场景。

Comments ICRA 2026

详情
AI中文摘要

在动态环境中移动机器人导航面临行人交通的关键挑战,在自主移动服务机器人发展中尤为重要。最近,基于深度强化学习的方法被积极研究,并因其优化能力优于传统规则方法。其中,假设连续动作空间的方法通常依赖高斯分布,这限制了生成动作的灵活性。相比之下,将扩散模型应用于强化学习已取得进展,使动作分布比高斯策略方法更加灵活。在本研究中,我们应用基于扩散的强化学习方法进行社会导航,并验证其有效性。此外,通过利用扩散模型的特点,我们提出了能够适应以前未见过的场景而无需额外训练的扩展方法。作为具体场景示例,我们展示了适应环境中有静态障碍物的场景(这些障碍物在训练期间不存在),以及目标与训练不同的场景,例如在避免他人时陪同目标行人到达目的地。

英文摘要

Mobile robot navigation in dynamic environments with pedestrian traffic is a key challenge in the development of autonomous mobile service robots. Recently, deep reinforcement learning-based methods have been actively studied and have outperformed traditional rule-based approaches owing to their optimization capabilities. Among these methods, those that assume continuous action spaces typically rely on Gaussian distributions, which limit the flexibility of the generated actions. In contrast, the application of diffusion models to reinforcement learning has advanced, enabling more flexible action distributions than Gaussian policy-based approaches. In this study, we apply a diffusion-based reinforcement learning approach to social navigation and validate its effectiveness. Furthermore, by exploiting the characteristics of diffusion models, we propose extensions that enable adaptation to previously unseen scenarios without additional training. As concrete scenario examples, we demonstrate adaptability to scenarios in which static obstacles exist in the environment that were not present during training, as well as scenarios in which the objective differs from training, such as accompanying target pedestrians while avoiding others to reach the destination.

2605.18109 2026-05-19 cs.AI cs.CV cs.RO 版本更新

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

TaskGround:全场景家庭推理的结构化可执行任务推断

ZhiYuan Feng, Yu Deng, Ruichuan An, Zhenhua Liu, Qixiu Li, Keming Wu, Zhiying Du, Weijie Wang, Haoxiao Wang, Shuang Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

发表机构 * Tsinghua University(清华大学) Microsoft Research Asia(微软亚洲研究院) Peking University(北京大学) Fudan University(复旦大学) Zhejiang University(浙江大学)

AI总结 本文提出TaskGround框架,通过结构化任务推断提升全场景家庭推理能力,其核心贡献是引入FullHome评估套件,验证了在家庭场景中执行任务结构推断的重要性,并展示了紧凑本地模型在实际家庭部署中的有效性。

Comments Project page: https://aaronfengzy.github.io/TaskGround/

详情
AI中文摘要

在真实家庭部署中,家庭代理通常必须从完整的家庭场景和处于特定情境的家庭请求出发,而不是从干净的任务规范出发。此类请求要求代理识别与任务相关的实体,恢复意图的任务条件,并从周围场景上下文中解决顺序约束。我们正式将这种能力定义为全场景家庭推理:给定一个完整的家庭场景和一个处于特定情境的家庭请求,代理必须在生成接地技能级动作序列之前推断出可执行的任务结构。这种设置具有挑战性,因为完整的家庭场景包含大量与任务无关的信息,使直接完整场景提示效率低下且容易出错。在实际部署中,这一挑战进一步被隐私和本地计算限制放大,这些限制更倾向于紧凑的开放权重模型,其具有有限的长上下文推理能力。我们提出TaskGround,一种无需训练且模型无关的Ground-Infer-Execute框架,该框架将完整的场景接地为紧凑的任务相关场景切片,推断出可执行的任务结构,并将其编译为接地的技能级动作序列。为了评估这一设置,我们引入了FullHome,一个经过人类验证的400个家庭任务评估套件,涵盖多样化的家庭规模环境以及目标导向和过程约束要求。在FullHome上,TaskGround在专有和开放权重模型上均大幅提升了任务成功率。值得注意的是,它使Qwen3.5-9B在直接完整场景提示下与GPT-5竞争,同时将总输入token成本减少了多达18倍。我们的结果识别了执行任务结构推断为全场景家庭推理中的关键瓶颈,并表明结构化接地可以显著提高紧凑本地模型在实际家庭部署中的有效性。

英文摘要

In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such requests require agents to identify task-relevant entities, recover intended task conditions, and resolve ordering constraints from the surrounding scene context. We formalize this capability as full-scene household reasoning: given a complete household scene and a situated household request, an agent must infer executable task structure before producing a grounded skill-level action sequence. This setting is challenging because complete household scenes contain substantial task-irrelevant information, making direct complete-scene prompting inefficient and error-prone. In practical deployment, this challenge is further amplified by privacy and local compute constraints, which favor compact open-weight models with limited long-context reasoning ability. We propose TaskGround, a training-free and model-agnostic Ground-Infer-Execute framework that grounds complete scenes into compact task-relevant scene slices, infers executable task structure, and compiles it into grounded skill-level action sequences. To evaluate this setting, we introduce FullHome, a human-validated evaluation suite of 400 household tasks spanning diverse home-scale environments and both goal-oriented and process-constrained requirements. On FullHome, TaskGround improves task success rates by large margins across both proprietary and open-weight models. Notably, it makes Qwen3.5-9B competitive with GPT-5 under direct complete-scene prompting while reducing total input-token cost by up to 18x. Our results identify executable task-structure inference as a central bottleneck in full-scene household reasoning and show that structured grounding can make compact local models substantially more effective for practical household deployment.

2605.18074 2026-05-19 cs.RO 版本更新

4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving

4DLidarOpen: 一个用于运动感知自动驾驶的开放4D FMCW激光雷达数据集

Kane Qian, Xin Zhao, Yining Shi, Rujun Yan, Zhengqing Pan, Kaojin Zhu, Mengmeng Yang, Kai Sun, Diange Yang, Kun Jiang

发表机构 * Tsinghua University(清华大学) Hesai Technology Co., Ltd.(海思科技有限公司)

AI总结 本文提出4DLidarOpen数据集,用于自动驾驶,该数据集基于4D频率调制连续波(FMCW)激光雷达传感,包含点径向速度测量、多种激光雷达、环绕摄像头和6自由度车辆姿态数据,通过混合标注策略实现大规模训练和人工精修,用于3D目标检测、鸟瞰图分割和流预测及运动预测基准测试。

Comments 15pages, 9 figures

详情
AI中文摘要

我们提出了4DLidarOpen,一个大规模的开放多模态自动驾驶数据集,核心是基于4D频率调制连续波(FMCW)激光雷达传感。与传统飞行时间激光雷达数据集主要提供几何测量不同,4DLidarOpen包含来自前方4D FMCW激光雷达的点径向速度测量,以及多种类型的激光雷达,包括旋转、固态和盲 spot变种,环绕视图摄像头,以及6-DOF ego-vehicle姿态。该数据集在北京复杂城市环境中采集,涵盖了密集行人交互、拥堵交通、高速驾驶和无保护变道。4DLidarOpen提供同步多传感器数据和具有持久跟踪ID的3D边界框标注,跨五个物体类别。采用混合标注策略,其中大规模自动标注数据支持可扩展训练,而人类专家对人工标注的训练和验证集进行精修。基于此数据集,我们建立了3D目标检测、鸟瞰图(BEV)分割和流预测以及运动预测的基准测试。大量实验表明,直接来自4D FMCW激光雷达的速度测量为动态场景理解提供了互补的运动线索。与仅几何感知相比,速度感知表示提高了运动相关感知和下游预测和规划,特别是在涉及易受伤害道路使用者和快速移动物体的场景中。这些结果表明,4D FMCW激光雷达是运动感知自动驾驶的有前途的感知模式。数据集和评估工具包已公开发布,以支持4D场景理解、多激光雷达融合和速度感知感知和规划的研究。

英文摘要

We present 4DLidarOpen, a large-scale open multi-modal dataset for autonomous driving, centered on 4D frequency-modulated continuous-wave (FMCW) Lidar sensing. Unlike conventional time-of-flight Lidar datasets that mainly provide geometric measurements, 4DLidarOpen includes point-wise radial velocity measurements from a forward-facing 4D FMCW Lidar, together with multiple Lidars of different types, including rotating, solid-state, and blind-spot variants, surround-view cameras, and 6-DOF ego-vehicle poses. The dataset was collected in complex urban environments in Beijing and covers dense pedestrian interactions, congested traffic, high-speed driving, and unprotected maneuvers. 4DLidarOpen provides synchronized multi-sensor data and 3D bounding-box annotations with persistent track IDs across five object categories. A hybrid annotation strategy is adopted, where large-scale auto-labeled data support scalable training and human experts refine annotations for the human-annotated training and validation sets. Based on this dataset, we establish benchmarks for 3D object detection, birds-eye view (BEV) segmentation and flow prediction, and motion forecasting with planning. Extensive experiments show that direct velocity measurements from 4D FMCW Lidar provide complementary motion cues for dynamic-scene understanding. Compared with geometric-only sensing, the velocity-aware representation improves motion-related perception and downstream forecasting and planning, especially in scenarios involving vulnerable road users and fast-moving objects. These results indicate that 4D FMCW Lidar is a promising sensing modality for motion-aware autonomous driving. The dataset and evaluation toolkit are publicly released to support research on 4D scene understanding, multi-Lidar fusion, and velocity-aware perception and planning.

2605.18059 2026-05-19 cs.RO 版本更新

Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations

Bench2Drive-Robust: 在部署扰动下闭环自动驾驶的基准测试

Zhiyuan Zhang, Zhenghao Jin, Yanlun Peng, Xianda Guo, Haoran Liu, Shaofeng Zhang, Xingjun Ma, Zuxuan Wu, Junchi Yan, Xiaosong Jia, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI (TEAI)(可信具身人工智能研究院) Great Wall Motor(长城汽车) Sch. of Computer Science & Sch. of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学计算机学院及人工智能学院) School of Computer Science, Wuhan University(武汉大学计算机学院) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出Bench2Drive-Robust,首个针对闭环端到端自动驾驶在现实部署扰动下的设备中心鲁棒性基准测试,评估了三种主要来源的部署相关扰动对自动驾驶系统的影响,揭示了传统图像级腐蚀评估未能完全捕捉的鲁棒性挑战。

详情
AI中文摘要

鲁棒性是部署自动驾驶系统到现实世界中的关键要求。现有的自动驾驶鲁棒性基准测试在研究图像级腐蚀(如恶劣天气或摄像头退化)对感知模块和开环规划输出的影响方面取得了重要进展。然而,部署还可能涉及系统级缺陷,如推理延迟和自我状态估计误差,这些在闭环端到端自动驾驶评估中仍较少研究。这些缺陷可以通过反馈回路积累并导致控制不稳定。在本文中,我们提出了Bench2Drive-Robust,据我们所知,这是首个针对闭环端到端自动驾驶在现实部署扰动下的设备中心鲁棒性基准测试。我们系统地评估了三种主要来源的部署导向扰动:摄像头流故障(帧丢失、部分观察)、自我状态估计误差(GPS噪声,以及速度或里程误差)和计算导致的控制延迟(模型推理延迟)。我们评估了代表性端到端驾驶方法,并分析它们在不同扰动严重程度下的鲁棒性。我们的结果表明,这些部署相关扰动可以显著降低闭环驾驶性能,揭示了传统图像级腐蚀评估未能完全捕捉的鲁棒性挑战。通过建立闭环评估协议并展示这些部署导向扰动的实质性影响,Bench2Drive-Robust定义了端到端自动驾驶的实用鲁棒性问题,并鼓励进一步研究面向部署的鲁棒驾驶系统。

英文摘要

Robustness is a critical requirement for deploying autonomous driving systems in the real world. Existing robustness benchmarks for autonomous driving have made important progress in studying the effects of image-level corruptions, such as adverse weather or camera degradation, on perception modules and open-loop planning outputs. However, deployment can also involve system-level imperfections, such as inference latency and ego-state estimation errors, which remain less studied in closed-loop E2E-AD evaluation. These imperfections can accumulate through the feedback loop and destabilize control. In this work, we present Bench2Drive-Robust, to our knowledge the first device-centric robustness benchmark for closed-loop end-to-end autonomous driving under realistic deployment perturbations. We systematically evaluate deployment-oriented perturbations arising from three major sources: camera-stream failures (frame drop, partial observation), ego-state estimation errors (GPS noise, and speed or odometry errors), and compute-induced control delay (model inference delay). We evaluate representative end-to-end driving methods and analyze their robustness under different perturbation severities. Our results show that these deployment-related perturbations can substantially degrade closed-loop driving performance, revealing robustness challenges that are not fully captured by conventional image-level corruption evaluations. By establishing a closed-loop evaluation protocol and demonstrating the substantial impact of these deployment-oriented perturbations, Bench2Drive-Robust defines practical robustness problems for end-to-end autonomous driving and encourages further research on deployment-aware robust driving systems.

2605.18045 2026-05-19 cs.RO cs.AI 版本更新

Confidence-Gated Robot Autonomy: When Does Uncertainty Actually Help?

置信度门控机器人自主性:不确定性何时真的有帮助?

Johannes A. Gaus, Jhon P. F. Charaja, Daniel Haeufle

发表机构 * Hertie Institute for Clinical Brain Research & Center for Integrative Neuroscience, University of Tübingen(赫尔特研究所临床脑研究与整合神经科学中心,图宾根大学)

AI总结 本文研究了不确定性在机器人自主性决策中的作用,发现当基础模型具备一定能力时,简单的不确定性代理足以实现选择性门控,但无法用于语义新颖性检测。

Comments ICRA 2026 workshop paper

详情
AI中文摘要

机器人系统常常使用预测不确定性来决定是否自主行动还是退回到备用策略。在阈值门控自主性中,不确定性主要通过其对可能错误的排序能力起作用。标准指标如预期校准误差和AUROC并不能直接测试不确定性是否改变行动/退避决策。因此,我们通过斯皮尔曼等级相关性、配对bootstrap等价检验和行动/退避一致率来评估不确定性。在三个时间活动识别基准上,我们发现存在一个数据集依赖的胜任区域,在此之下不确定性只能提供弱且不稳定的错误排序。在此之上,softmax启发式方法、MC Dropout和集成模型产生相似的门控行为,而阈值选择对执行结果影响更大。一个多种子具身模拟显示,一旦实现自主性,碰撞率和成本也呈现出相同模式。在时间协变量转移下,排序质量保持稳定,但细粒度语义OOD检测仍接近随机。这些结果表明,一旦基础模型具备一定能力,简单的不确定性代理足以实现选择性门控,但无法用于语义新颖性检测。

英文摘要

Robotic systems often use predictive uncertainty to decide whether to act autonomously or defer to a fallback policy. In threshold-gated autonomy, uncertainty matters mainly through its ability to rank likely errors. Standard metrics such as expected calibration error and AUROC do not directly test whether uncertainty changes act/defer decisions. We therefore evaluate uncertainty using Spearman rank correlation, paired bootstrap equivalence testing, and act/defer agreement. Across three temporal activity-recognition benchmarks, we find a dataset-dependent competence regime below which uncertainty provides a weak and unstable error ranking. Above this regime, softmax heuristics, MC Dropout, and ensembles produce similar gating behavior, while threshold choice has a much larger effect on execution outcomes. A multi-seed embodied simulation shows the same pattern for collision rate and cost once realized autonomy is matched. Under temporal covariate shift, ranking quality remains stable, but fine grained semantic OOD detection remains near chance. These results suggest that simple uncertainty proxies can suffice for selective gating once the base model is competent, but not for semantic novelty detection.

2605.18026 2026-05-19 cs.RO 版本更新

Scenario Generation in Roundabouts with Adjustable Interaction Intensity

在可调节交互强度的环形交叉口中的场景生成

Li Li, Till Temmen, Tobias Brinkmann, Björn Krautwig, Markus Eisenbarth, Jakob Andert

发表机构 * Chair of Mechatronics in Mobile Propulsion(移动 propulsion 机械系统教授团)

AI总结 本文提出了一种具有可调节交互强度的环形交叉口场景生成器,通过解耦几何路线和时间进度轮廓,并利用预训练的自编码器映射到潜在代码,再通过Wasserstein生成对抗网络生成场景,从而提高时间-潜在空间的保真度和交互响应的合理性,增强了安全测试的可控性和可扩展性。

详情
AI中文摘要

环形交叉口以其频繁的合并和让行交互而著称,仍然是智能驾驶功能开发和测试中的安全关键案例。然而,从自然数据中提取足够的临界场景是低效的。大多数现有场景生成方法对交互强度和临界性控制有限,使得系统化安全测试和详细分析困难。本文提出了一种交互感知的环形交叉口场景生成器,具有连续可调的交互强度。首先,几何路线和时间进度轮廓被解耦并映射到潜在代码,使用预训练的自编码器。然后,通过Wasserstein生成对抗网络(WGAN)进行条件潜在生成,以生成场景。让行被建模为一种可控的定时干预,通过紧凑的让行代码在接近入口段进行,其中交互强度通过将代码与因子λ缩放来调节。结果表明,与基线模型相比,提高了时间-潜在空间的保真度和合理的交互响应。在临界性校准的缩放下,增加λ扩大了安全边际,提供了一种可扩展和受控的测试机制。

英文摘要

Roundabouts, characterized by frequent merging and yielding interactions, remain a safety-critical corner case for the development and testing of intelligent driving functions. However, extracting sufficient near-critical scenarios from naturalistic data is inefficient. Most existing scenario generation methods provide limited controllability over interaction intensity and criticality, making systematic safety testing and detailed analysis difficult. This paper presents an interaction-aware roundabout scenario generator with continuously adjustable interaction intensity. Geometric routes and temporal progress profiles are first decoupled and mapped to latent codes using pretrained autoencoders. Conditional latent generation is then performed with Wasserstein Generative Adversarial Networks (WGAN) to generate scenarios. Yielding is modeled as a controllable timing intervention via a compact yield code during the approach-to-entry segment, where interaction intensity is modulated by scaling the code with a factor $λ$. Results demonstrate enhanced timing-latent fidelity and plausible interaction responses compared to a baseline model. Under criticality-calibrated scaling, increasing $λ$ expands the safety margin, providing a scalable and controlled testing mechanism.

2605.17984 2026-05-19 eess.IV cs.CV cs.RO 版本更新

See Silhouettes in Motion with Neuromorphic Vision

用神经形态视觉感知运动中的轮廓

Pei Zhang, Shijie Lin, Zhou Ge, Jinpeng Chen, Wei Pu

发表机构 * School of Electrical Engineering, Guangxi University(广西大学电气工程学院) Department of Computer Science, The University of Hong Kong(香港大学计算机科学系) School of Mechatronic Engineering and Automation, Shanghai University(上海大学机电工程与自动化学院) SHU General Intelligent Robotics Research Institute(SHU通用智能机器人研究院) School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications(北京邮电大学计算机科学学院(国家试点软件学院)) School of Information and Communication Engineering, University of Electronic Science and Technology of China(电子科技大学信息与通信工程学院)

AI总结 本文提出了一种双模方法,利用帧和事件的协同作用,在仅CPU的设备上实现实时高帧率二值化,有效减少运动模糊并提升在恶劣光照下的性能,为资源受限边缘平台的轻量感知和交互铺平道路。

Comments 12 pages, 12 figures, and 3 tables. This work is under review. Project page: https://github.com/pz-even/event_binarization

详情
AI中文摘要

准双模物体,如文本、道路标志和条形码,在日常视觉交流中发挥基本而关键的作用。通过将其简化为清晰的轮廓,二值化使用最简语言传达必要的视觉线索,以实现最大下游效率。然而,基于帧的成像在移动平台如无人机、自动驾驶汽车和水下车辆上往往面临困难。在这些动态场景中,快速运动和恶劣光照会使成像失效,导致严重的运动模糊和关键细节的消失。为克服这些限制,神经形态视觉通过事件相机,具有微秒级时间分辨率和高动态范围,成为自然的解决方案。在此事件驱动的感知范式基础上,我们提出了一种简单而有效的双模方法,利用帧和事件之间的协同作用,在仅CPU的设备上实现实时、高帧率的二值化。广泛的评估表明,该方法在减少运动模糊方面与领先技术具有竞争力,并在挑战性光照条件下提供显著改进。此外,我们的异步工作流程绕过了事件稀缺问题,避免了传统时间分组重建的限制,即使在极高的千赫兹帧率下也能保持清晰的目标形状。其二值化结果进一步作为可靠的表示,促进了各种下游任务。本文为在资源受限边缘平台上的具身智能轻量感知和交互铺平了道路。

英文摘要

Quasi-bimodal objects, such as text, road signs, and barcodes, play a basic yet vital role in daily visual communication. By boiling these down to clear silhouettes, binarization uses a minimal language to convey essential vision cues for maximum downstream efficiency. The catch is that frame-based imaging often struggles on mobile platforms like drones, self-driving cars, and underwater vehicles. In these dynamic scenes, rapid motion and harsh lighting can make it blind, causing severe motion blur and erasing crucial details. To overcome the limits, neuromorphic vision via event cameras, featuring microsecond-level temporal resolution and high dynamic range, steps in as a natural solution. Building upon this event-driven sensing paradigm, we introduce a simple yet effective dual-modal approach that harnesses the synergy between frames and events to achieve real-time, high-frame-rate binarization on CPU-only devices. Extensive evaluations present that it earns competitive performance against leading techniques in reducing motion blur, while delivering impressive improvements under challenging illumination. Besides, our asynchronous workflow bypasses event scarcity that breaks traditional time-binning reconstruction, maintaining clear target shapes even at extreme kilohertz frame rates. Its binary results further serve as reliable representations that facilitate a range of downstream tasks. This work paves the way towards lightweight perception and interaction in embodied intelligence on resource-constrained edge platforms.

2605.17928 2026-05-19 cs.RO cs.LG 版本更新

Transfer Learning for Customized Car Racing Environments

迁移学习用于定制化的赛车环境

Benedict Florance Arockiaraj, Richard Chang, Wesley Yee

发表机构 * seas(系统工程与科学学院)

AI总结 本文研究了迁移学习在深度强化学习中的应用,旨在通过在单一赛道上训练智能体,实现零样本迁移或进一步微调以在其他定制化赛车环境中获得更快的圈速,并比较了基于模型和非基于模型方法的性能。

详情
AI中文摘要

迁移学习是一种技术,其中模型/智能体可以利用其在一项任务中获得的知识/专长来解决另一个密切相关任务。通过本项目,我们探讨了迁移学习在深度强化学习中的应用。具体而言,我们希望利用迁移学习在OpenAI的赛车环境中实现快速圈速,通过在单一赛道上训练智能体,并通过零样本迁移或额外微调在其他定制化目标环境中进行比赛。此外,我们比较了基于模型和非基于模型方法的性能,并观察到基于模型的方法在性能上占优,并且在该环境中比非基于模型的方法收敛得更快。我们观察到迁移学习在大多数设置中不仅提升了目标领域的性能,而且在学习过程中也表现出高水平的性能能力。

英文摘要

Transfer Learning, a technique where a model/agent can use the knowledge/expertise that it gained from one task and exploit that to solve another closely-related task, is often used in tackling problems in deep learning. Through this project, we explore transfer learning in the purview of deep reinforcement learning. Specifically, we want to use transfer learning to achieve the fast lap times in OpenAI's Car racing environment by training the agent on one circuit, and racing it on other customized target environments by zero-shot transfer or by additional fine-tuning. In addition, we compare the performance of model-based and model-free approaches, and observe that model-based approaches dominate in performance and converge faster than model-free approaches in this environment. We observe that transfer learning in most setups not only boosts the performance on the target domain, but also shows high performance ability during learning.

2605.17927 2026-05-19 cs.RO 版本更新

Learning-Based Adaptive Control for Surgical Robotic Exposure Task on Deformable Tissues

基于学习的自适应控制用于变形组织手术机器人暴露任务

Jiayi Liu, Kaiqi Wei, Yiwei Wang, Huan Zhao, Han Ding

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 本文提出了一种基于学习的自适应控制框架,用于解决手术中因覆盖组织的不规则几何形状、非线性生物力学特性及有限视野导致的自动组织牵开挑战,通过在线优化控制输入和深度变形估计模型实现零样本适应。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. 12 pages, 9 figures

详情
AI中文摘要

在各种外科手术中,感兴趣的区域(ROIs)如器官或病变常被覆盖组织遮挡,需要外科医生实现充分暴露以进行精确干预。然而,覆盖组织的不规则几何形状、非线性生物力学特性和术中ROIs的有限可见性对自动执行组织牵开提出了重大挑战。为此,我们提出了一个现实的组织牵开任务模型,并提出了一种基于学习的自适应控制框架,以实现ROIs的暴露。该方法通过监控组织视觉边界的变化在线优化控制输入,同时利用在模拟数据上训练的深度变形估计模型来识别最优抓取点,以确保自适应控制器的收敛性和安全性。通过在不同变形材料上的模拟和实际实验,证明了该框架能够实现零样本适应,并能完成从初始抓取选择到完全ROIs暴露的自动牵开过程。因此,它有潜力应用于实际的手术辅助场景。

英文摘要

In various surgical procedures, regions of interest (ROIs) such as organs or lesions are often occluded by overlying tissues, requiring surgeons to achieve adequate exposure for precise intervention. However, the irregular geometry, nonlinear biomechanical properties of overlying tissues, and limited intraoperative visibility of the ROI pose significant challenges to the autonomous execution of tissue retraction. To address this, we formulate a realistic model of the tissue retraction task and propose a learning-based adaptive control framework for achieving ROI exposure. The method optimizes control inputs online by monitoring changes in the visual boundary of the tissue, while leveraging a deep deformation estimation model trained on simulation data to identify the optimal grasping point and ensure the convergence and safety of the adaptive controller. Through simulations and real-world experiments on different deformable materials, it has been demonstrated that this framework exhibits zero-shot adaptation to similar tasks and can complete the autonomous retraction process, from initial grasp selection to full ROI exposure. Therefore, it has the potential to be applied in actual surgical assistance scenarios.

2605.17912 2026-05-19 cs.RO cs.CV 版本更新

WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

WorldArena 2.0: 扩展模态、功能和平台的具身世界模型基准测试

Yu Shang, Yinzhou Tang, Yiding Ma, Zhuohang Li, Lei Jin, Weikang Su, Xin Jin, Zhaolu Wang, Ziyou Wang, Xin Zhang, Haisheng Su, Weizhen He, Wei Wu, Haoyi Duan, Gordon Wetzstein, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong Tian, Tat-Seng Chua, Wenwu Zhu, Chen Gao, Yong Li

发表机构 * Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学) Stanford University(斯坦福大学) The University of Hong Kong(香港大学) Princeton University(普林斯顿大学) Chinese Academy of Sciences(中国科学院) University of Science and Technology of China(中国科学技术大学) Peking University(北京大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出WorldArena 2.0,扩展了具身世界模型的评估,涵盖模态、功能和平台三个维度,提供全面的测试平台以评估具身世界模型的进展。

详情
AI中文摘要

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

英文摘要

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

2605.17851 2026-05-19 cs.RO 版本更新

A Dexterous and Compliant Gripper With Soft Hydraulic Actuation for Microgravity Manipulation

一种具有软液压驱动的灵活机械手用于微重力操作

William Su, Jordan Kam, Yixiao Wang, Jianshu Zhou

发表机构 * Aerospace Engineering Program, University of California, Berkeley(加州大学伯克利分校航空航天工程系) Department of Mechanical Engineering, University of California, Berkeley(加州大学伯克利分校机械工程系) Department of Mechanical Engineering, National University of Singapore(新加坡国立大学机械工程系)

AI总结 本文提出将DexCoHand灵活的双指六自由度机械手与Astrobee自由飞行机器人集成,以实现微重力环境下的灵活操作,该机械手在保持稳定接触的同时减少了对自由飞行基底的干扰,提高了操作的连续性和适应性。

Comments Accepted to the IEEE ICRA 2026 Space Robotics Workshop (SRW). 4 pages, 3 figures

详情
AI中文摘要

Astrobee现有的单自由度(DOF)欠驱动柔性爪形抓取器能够停靠在国际空间站(ISS)上,但对连续的灵活操作能力有限。更复杂的微重力任务需要一个能够保持稳定接触并限制对自由飞行基底的干扰的末端执行器,因为接触力会直接耦合到基底运动中。本文提出了将DexCoHand(一种灵活的双指六自由度抓取器)与Astrobee自由飞行机器人集成,以实现微重力操作。该系统在MuJoCo中使用Astrobee的标准手rail停靠序列进行评估,包括接近、停靠以及随后的俯仰和偏转运动。与Astrobee现有的抓取器相比,DexCoHand在保持命令的俯仰和偏转运动的同时,减少了意外的交叉轴基底运动。在地球上的硬件实验进一步展示了DexCoHand的灵活操作能力和其在更适应的智能操作任务中的潜力。

英文摘要

Astrobee's existing one-degree-of-freedom (DOF) underactuated compliant claw gripper enables perching on the International Space Station (ISS), but provides limited capability for continuous dexterous manipulation. More complex microgravity tasks require an end-effector that can maintain stable contact while limiting disturbance to the free-flying base, since contact forces directly couple into base motion. This article presents the integration of DexCoHand, a dexterous and compliant two-finger, 6-DOF gripper, with the Astrobee free-flying robot for microgravity manipulation. The system is evaluated in MuJoCo using Astrobee's standard handrail perching sequence, including approach, perching, and subsequent pan and tilt motions. Compared with Astrobee's existing gripper, DexCoHand preserves the commanded pan and tilt motions while reducing unintended cross-axis base motion. Hardware experiments on Earth further demonstrate DexCoHand's dexterous manipulation capabilities and its potential for more adaptable intelligent manipulation tasks.

2605.17815 2026-05-19 cs.RO cs.AI 版本更新

Virtues of Ordered Chaos: Planning with Topple Actions in Tabletop Stack Rearrangement

秩序之中的混沌:在桌面堆叠重构中使用Topple动作的规划

Hao Lu, Rahul Shome

发表机构 * School of Computing at the Australian National University(澳大利亚国立大学计算学院)

AI总结 本文研究了桌面环境中堆叠重构任务,通过引入更丰富的非抓取聚合动作(特别是从堆叠中倒落物体到桌面的Topple动作)来增强任务规划领域。核心方法是提出一种新的Topple聚合工具,将候选任务计划计算转化为 Pebble Motion 问题变体,从而在IsaacSim物理模拟中验证了其效果,展示了在执行速度上的显著优势。

Comments 8 pages, 7 figures

详情
AI中文摘要

高效的物体操作策略对自动化应用有重大影响。本文研究了桌面环境中的堆叠重构任务,重点是通过引入更丰富的非抓取聚合动作(特别是从堆叠中倒落物体到桌面的Topple动作)来增强任务规划领域。Topple可以压缩长序列的中间搬运动作。计算的计划需要根据问题在其中交错执行抓取和放置动作与Topple动作。为了生成任务计划并建模一个抽象来计算包含抓取和Topple动作的解决方案,引入了一种新的Topple聚合工具。使用这种有向图抽象,候选任务计划计算成为Pebble Motion问题的变种,将物体视为石子。然后在基于IsaacSim的物理模拟中报告了基准测试。结果突显了仅使用抓取和放置动作相比,在执行速度上的明显优势。尽管本文主要研究Topple动作,但证明了类似的抽象可以建模其他感兴趣的聚合动作,如Scoop。本文的工作为丰富物体交互的操纵应用提供了初步但有力的证据,表明抽象在其中的潜在好处。

英文摘要

Efficient object manipulation strategies have significant impact in automation applications. In this work, the stack rearrangement in tabletop settings is studied, with a focus on augmenting the task planning domain with richer nonprehensile aggregating actions, in particular the toppling of objects from a stack to the table. Toppling can compress long sequences of intermediate relocations. Computed plans need to interleave pick-and-place actions with topple throughout its plan based on the problem. In order to generate the task plan and model an abstraction to compute solutions that include both pick-and-place and topple actions, a novel aggregating gadget for topple is introduced. Using this directed graphical abstraction, candidate task plan computation becomes a variant of the pebble motion problem, treating objects as pebbles. Benchmarks are then reported in a IsaacSim-based physics simulation. Results highlight clear benefits of achieving faster execution than solely using pick-and-place actions. Though this work primarily investigates the topple action, we demonstrate that similar abstractions can model other aggregating actions of interest, like scoop. The current work provides a preliminary, strong indication of the promising benefits of abstractions for rich object interactions in manipulation applications.

2605.17800 2026-05-19 cs.RO cs.AI 版本更新

Optimal Knock-Pick Planning for Tightly Packed Tabletop Blocks With Parallel Grippers

紧密排列桌面积木的最优敲击抓取规划

Hao Lu, Rahul Shome

发表机构 * School of Computing(计算学院) Australian National University(澳大利亚国立大学)

AI总结 研究在平行夹具无法在物体周围获得足够空隙时,如何通过引入方向性敲击原语来优化敲击抓取策略,以减少动作数量。

Comments Accepted by WAFR 2026, 18 pages, 6 figures

详情
AI中文摘要

在平行夹具无法在物体周围获得足够空隙时,重新排列紧密堆积的桌面物体具有挑战性。本文研究了在实际应用中,均匀大小的积木放置在平面桌面网格位置时的问题特性。由于纯粹的抓取移除可能不可行,因此引入了方向性敲击原语,并将该问题的最优敲击抓取变体进行了建模。本文提出了一系列抽象,其中通过覆盖最小约束装置来识别必要的敲击。利用图抽象上的最大权重完美匹配,可以高效地在多项式时间内计算最优计划,以最小化动作数量。在合成环境以及IsaacSim中报告了随着网格大小增加的实验结果。理论观察为构建高效操作策略提供了有前途的基石,这些策略可以交错抓取和非抓取动作。

英文摘要

Rearranging densely packed tabletop objects is challenging when parallel-gripper picks are infeasible without sufficient clearance around an object. This work studies the problem characteristics for practically motivated settings with uniformly sized blocks placed at planar tabletop grid locations. Since purely prehensile removal can become infeasible, a directional knock primitive is therefore introduced and the optimal knock-pick variant of the problem is formulated. The work proposes a series of abstractions wherein minimal constraining gadgets are covered to identify the necessary knocks. Utilizing a maximum-weight perfect matching on a graphical abstraction yields efficient polynomial-time computation of the optimal plan that minimizes the number of actions. Experiments are reported for increasing grid sizes in synthetic settings as well as in IsaacSim. The theoretical observations provide a promising stepping stone towards rigorously building efficient manipulation strategies that interleave prehensile and non-prehensile actions.

2605.17681 2026-05-19 cs.RO 版本更新

PRIME: Physically-consistent Robotic Inertial and Motion Estimation for Legged and Humanoid Robots

PRIME: 为四足机器人和人形机器人提供物理一致的机器人惯性与运动估计

Jiarong Kang, Kunzhao Ren, Tao Pang, Xiaobin Xiong

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Shanghai Innovation Institute(上海创新研究院)

AI总结 该研究提出PRIME方法,通过结合可微接触动力学和光滑互补约束,实现从 onboard 传感器数据中获得物理一致的运动轨迹和惯性参数估计,从而提升机器人运动估计的准确性。

Comments Robotics: Science and Systems 2026

详情
AI中文摘要

人形和腿部机器人通过间歇性接触与环境互动,使准确的运动估计从根本上依赖于对接触动力学的推理。然而,标准的传感流程——无论是基于机载本体感觉的扩展卡尔曼滤波器(EKFs)还是外部运动捕捉系统——只能恢复运动学信息,而接触力、接触时间和惯性参数仍未被观测。因此,纯运动学重建往往违反刚体动力学,尤其是在接触丰富的运动中。为了实现从机载运动学数据中准确的运动估计,我们提出PRIME(Physically-consistent Robotic Inertial and Motion Estimation),一种最大后验(MAP)公式,将测量的运动学和执行器命令细化为动态一致的轨迹,同时联合估计摩擦接触力和物理一致的惯性参数。我们的方法结合了可微接触动力学与平滑互补约束和Antescu风格的摩擦模型,产生一个平滑的优化问题,在各种接触转换中保持可处理性。我们在接触丰富的运动中评估了PRIME,使用四足机器人和Unitree G1人形机器人,展示了改进的轨迹一致性和准确的惯性参数识别。除了通过校准的惯性参数提高状态估计和反馈控制外,PRIME还能够从实际机器人中生成带有力和接触注释的运动重建,可用于下游学习应用,包括大规模行为建模和机器人基础模型。

英文摘要

Humanoid and legged robots interact with the environment through intermittent contacts, making accurate motion estimation fundamentally dependent on reasoning about contact dynamics. However, standard sensing pipelines-whether based on onboard proprioception with Extended Kalman Filters (EKFs) or external motion capture systems-recover only kinematics, while contact forces, contact timing, and inertial parameters remain unobserved. As a result, purely kinematic reconstructions often violate rigid-body dynamics, particularly during contact-rich motions. To enable accurate motion estimation from onboard kinematics in real-world deployment, we propose PRIME (Physically-consistent Robotic Inertial and Motion Estimation), a Maximum A Posteriori (MAP) formulation that refines measured kinematics and actuator commands into a dynamically consistent trajectory while jointly estimating frictional contact forces and physically consistent inertial parameters. Our approach incorporates differentiable contact dynamics with smoothed complementarity constraints and an Anitescu-style friction model, yielding a smooth optimization problem that remains tractable across versatile contact transitions. We evaluate PRIME on contact-rich locomotion with quadrupedal robots and the Unitree G1 humanoid, demonstrating improved trajectory consistency and accurate inertial parameter identification. Beyond improving state estimation and feedback control with calibrated inertial parameters, PRIME produces force- and contact-annotated motion reconstructions from real robots in deployment, which can be used to provide high-quality data for downstream learning applications, including large-scale behavior modeling and robot foundation models.

2605.17661 2026-05-19 cs.RO cs.CV 版本更新

Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping

Mono-Hydra++: 基于多任务学习的实时单目场景图构建用于3D室内映射

U. V. B. L. Udugama, George Vosselman, Francesco Nex

发表机构 * Department of Earth Observation Science, University of Twente(特文特大学地球观测科学系)

AI总结 本文提出Mono-Hydra++,一种基于多任务学习的实时单目RGB加IMU流水线,用于3D室内度量语义映射和分层3D场景图构建,通过结合M2H-MX多任务模型和深度特征视觉惯性里程计前端,实现了在资源受限的机器人平台上无需主动深度传感器的实时度量语义映射和场景图构建。

Comments Submitted to ISPRS Journal of Photogrammetry and Remote Sensing. 50 pages, figures and tables included. Code: https://github.com/BavanthaU/mono-hydra-pp.git

详情
AI中文摘要

自主敏捷机器人需要的不仅仅是度量几何:它们必须理解物体、房间、地点和空间关系,以进行搜索、检查、探索和人机交互。传统度量地图支持定位和避障,但不提供这种语义和关系结构。3D场景图通过将几何与物体级和房间级的理解连接起来,填补了这一空白。在敏捷平台上构建此类表示仍然困难,因为空中和轻量级机器人受到严格的载荷、电力和计算限制,使RGB-D相机和LiDAR传感器在许多机载设置中不切实际。我们提出了Mono-Hydra++,一种实时单目RGB加IMU流水线,用于室内度量语义映射和分层3D场景图构建。该系统结合了M2H-MX,一种基于DINOv3的多任务模型,用于深度和语义,以及深度特征视觉惯性里程计前端,稀疏预测深度约束在VIO推导的姿态图中,语义遮蔽用于动态区域,以及在Mono-Hydra后端体积融合前的姿态感知时间对齐。在Go-SLAM ScanNet评估子集中,Mono-Hydra++在仅使用单目RGB加IMU输入的情况下,其平均轨迹误差比我们比较中的最强RGB-D基线低1.6%,在校准的7-Scenes中,其平均ATE比最强的竞争校准基线提高了29.8%。我们进一步在真实ITC建筑部署中验证了Mono-Hydra++,使用RealSense RGB加IMU,并通过在Jetson Orin NX 16GB上部署ONNX/TensorRT FP16 M2H-MX-L感知模型,以25.53 FPS的速度证明了嵌入可行性。这些结果表明,Mono-Hydra++可以在不依赖主动深度传感器的情况下,为资源受限的机器人平台提供实时度量语义映射和场景图构建。

英文摘要

Autonomous agile robots need more than metric geometry: they must understand objects, rooms, places, and spatial relations for search, inspection, exploration, and human robot interaction. Conventional metric maps support localization and collision avoidance, but do not provide this semantic and relational structure. 3D scene graphs address this gap by connecting geometry with object level and room level understanding. Building such representations on agile platforms remains difficult because aerial and lightweight robots operate under strict payload, power, and compute limits, making RGB-D cameras and LiDAR sensors impractical for many onboard settings. We present Mono-Hydra++, a real time monocular RGB plus IMU pipeline for indoor metric semantic mapping and hierarchical 3D scene graph construction. The system combines M2H-MX, a DINOv3 based multi-task model for depth and semantics, with a deep feature visual inertial odometry front end, sparse predicted depth constraints in the VIO derived pose graph, semantic masking for dynamic regions, and pose aware temporal alignment before volumetric fusion in the Mono-Hydra backend. On the Go-SLAM ScanNet evaluation subset, Mono-Hydra++ achieves 1.6% lower average trajectory error than the strongest RGB-D baseline in our comparison, while using only monocular RGB plus IMU input. On calibrated 7-Scenes, it improves average ATE by 29.8% over the strongest competing calibrated baseline. We further validate Mono-Hydra++ in a real ITC building deployment using RealSense RGB plus IMU and demonstrate embedded feasibility by deploying the ONNX/TensorRT FP16 M2H-MX-L perception model at 25.53 FPS on a Jetson Orin NX 16GB. These results show that Mono-Hydra++ can provide real time metric semantic mapping and scene graph construction for resource constrained robotic platforms without relying on active depth sensors.

2605.17601 2026-05-19 cs.RO 版本更新

From a Single Demonstration to a General Policy for Contact-Rich Manipulation

从单次示范到通用的接触密集操纵策略

Xing Li, Oliver Brock

发表机构 * Robotics and Biology Laboratory, Technische Universität Berlin(技术大学柏林机器人与生物学实验室) Science of Intelligence, Research Cluster of Excellence, Berlin(智能科学,卓越研究集群,柏林) Robotics Institute Germany(德国机器人研究所)

AI总结 本文提出了一种学习从示范(LfD)框架,通过利用环境约束作为归纳偏差,实现多阶段、接触密集任务的一次性泛化。该方法将示范表示为利用环境约束的行为序列,将任务通用结构(约束类型及其转换)与实例特定细节(精确示范轨迹、姿态和局部几何)分离。四阶段流程在该表示上构建完整策略:机器人首先将单次示范抽象为环境约束原语,然后通过自我引导探索进行歧义消除,接着整合针对人类修正以处理超出分布的变化,最后通过合规交互在线恢复抽象掉的细节。由于最终策略遵循约束而非模仿轨迹,它在物体姿态、局部几何和未建模接触动力学上实现了泛化。我们在七个现实世界多阶段接触密集操纵任务上验证了该方法,成功率达到90%以上。这些广泛实验结果确立了环境约束作为学习从示范中高效泛化基本构建块的重要性。

Comments 21 pages, 22 figures, 7 tables

详情
AI中文摘要

我们提出了一种学习从示范(LfD)框架,实现多阶段、接触密集任务的一次性泛化。我们的方法核心是利用环境约束作为归纳偏差。通过将示范表示为利用环境约束的行为序列,机器人将任务通用结构——约束类型及其转换——与实例特定细节(如精确示范轨迹、姿态和局部几何)分离。我们的四阶段流程在该表示上构建完整策略:机器人首先将单次示范抽象为环境约束原语,然后通过自我引导探索进行歧义消除,接着整合针对人类修正以处理超出分布的变化,最后通过合规交互在线恢复抽象掉的细节。由于最终策略遵循约束而非模仿轨迹,它在物体姿态、局部几何和未建模接触动力学上实现了泛化。我们在七个现实世界多阶段接触密集操纵任务上验证了该方法,成功率达到90%以上。这些广泛实验结果确立了环境约束作为学习从示范中高效泛化基本构建块的重要性。

英文摘要

We present a Learning from Demonstration (LfD) framework that achieves one-shot generalization in multi-stage, contact-rich manipulation tasks. Central to our approach is the utilization of environmental constraints as the inductive bias. By representing a demonstration as a sequence of behaviors that exploit environmental constraints, the robot separates task-general structure -- the constraint types and their transitions -- from instance-specific details such as exact demonstration trajectories, poses, and local geometries. Our four-stage pipeline builds a complete policy on this representation: the robot first abstracts a single demonstration into environmental-constraint primitives, then disambiguates them through self-guided exploration, next assimilates targeted human corrections that handle out-of-distribution variations, and finally recovers the abstracted-away details online through compliant interaction. Because the resulting policy follows constraints rather than mimics trajectories, it generalizes across object poses, local geometries, and unmodeled contact dynamics. We validate our approach on seven real-world multi-stage contact-rich manipulation tasks and achieve over 90% success. These extensive experimental results establish environmental constraints as fundamental building blocks for efficient generalization in learning from demonstration.

2605.17593 2026-05-19 cs.RO 版本更新

Motion-Uncertainty-Aware Next-Best-View Planning for Moving Object Reconstruction

考虑运动不确定性的移动物体重建最佳下视角规划

Karen Li, Mattia Mantovani, Robert J. Wood, Lorenzo Sabattini, Stephanie Gil

发表机构 * Harvard University(哈佛大学) University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学)

AI总结 本文提出了一种考虑运动不确定性的最佳下视角规划框架,用于重建未知的刚体目标,该框架利用噪声的平面位置测量和移动机器人深度观测,通过固定滞后高斯过程平滑器估计和预测目标状态,从而生成候选视角并提高重建完整性。

Comments This paper is accepted for publication for Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

主动重建移动物体需要在决策到执行延迟期间选择信息丰富的视角,同时考虑物体运动的不确定性。现有方法只解决了该问题的一部分:用于物体重建的下最佳视角(NBV)规划器通常优化表面覆盖但假设物体静止,而针对移动目标的运动感知主动感知方法考虑了目标运动,但优先考虑跟踪或可见性而非重建覆盖。本文提出了一种考虑运动不确定性的NBV框架,用于重建未知的刚体目标,该目标处于平面运动中。该框架利用目标的噪声平面位置测量和移动机器人的深度观测。关键思想是通过评估每个候选视角在由运动和测量不确定性诱导的可能未来目标状态下的预期观测质量,而不是在单一预测目标姿态上。为了获得这种预测信念,固定滞后高斯过程平滑器从噪声位置测量中估计和预测目标状态。所得信念用于生成围绕预测目标位置的候选视角,并通过可达性过滤它们,并估计其预期覆盖驱动的分数。仿真和实际实验表明,与非预测的NBV和仅预测的跟踪方法相比,重建完整性得到了改进,从而弥合了覆盖驱动的主动重建和预测驱动的跟踪之间。

英文摘要

Active 3D reconstruction of moving objects requires selecting informative viewpoints while accounting for object motion uncertainty during the decision-to-execution delay. Existing methods address only parts of this problem: next-best-view (NBV) planners for object reconstruction typically optimize surface coverage but assume static objects, while motion-aware active perception for moving targets accounts for target motion but prioritizes tracking or visibility over reconstruction coverage. This work presents a motion-uncertainty-aware NBV framework for reconstructing an unknown rigid object undergoing planar motion, using noisy planar position measurements of the object and depth observations from a mobile robot. The key idea is to evaluate each candidate viewpoint by its expected observation quality over plausible future object states induced by motion and measurement uncertainty, rather than at a single predicted object pose. To obtain this predictive belief, a fixed-lag Gaussian Process smoother estimates and predicts the object state from noisy position measurements. The resulting belief is used to generate candidate viewpoints around the predicted object location, filter them by reachability, and estimate their expected coverage-driven scores. Simulation and real-world experiments demonstrate improved reconstruction completeness over non-predictive NBV and prediction-only tracking methods, bridging coverage-driven active reconstruction and prediction-driven tracking.

2605.17556 2026-05-19 cs.RO cs.AI 版本更新

Visual Sculpting: Visually-Aligned Planning Representations for Long-Horizon Robot Clay Sculpting

视觉雕刻:用于长周期机器人泥塑的视觉对齐规划表示

Peter Schaldenbrand, Jean Oh

发表机构 * The Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 本文提出了一种视觉对齐的规划表示方法,用于长周期机器人泥塑任务,通过捕捉光照和纹理特征,提高了对可变形材料动态的建模能力,并展示了在不同可变形材料和末端执行器下的性能。

Comments 8 pages, 14 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

泥塑是一种复杂的艺术任务,需要通过长周期规划实现高阶目标。作为机器人问题,我们将泥塑视为形状到形状的匹配挑战。先前的可变形物体 manipulation 工作要么需要为每个目标重新训练策略,要么依赖于动态模型,这些模型将状态表示为稀疏点云,无法良好捕捉泥塑的重要特征,如纹理。我们提出了一种方法,用于建模可变形材料的动力学,并在视觉对齐的表示中为机器人雕刻规划。通过三种不同的可变形材料和各种末端执行器,我们证明我们的动力学模型在性能上与最先进的方法相当,并且具有兼容视觉规划的优势。我们的动作被表示为单个末端执行器向泥塑施加的参数化推力,这已被证明适用于长周期(>100次动作)的泥塑浮雕。最后,我们展示了在视觉对齐表示中规划的好处,同时提供了分析,证明了与3D表示相比,这种表示在规划上更具挑战性。

英文摘要

Clay sculpting is a nuanced, artistic task involving dexterous manipulation with long-horizon planning to achieve high-level goals. As a robotics problem, we formulate clay sculpting as a shape-to-shape matching challenge. Prior deformable object manipulation work either requires retraining a policy per goal or relies on dynamics models which represent state as sparse point clouds which do not capture important clay features, such as textures, well. We present a method for modeling the dynamics of deformable materials and planning for robotic sculpting in a representation that is visually-aligned, capturing lighting and texture features. With three different deformable materials and various end-effectors, we demonstrate that our dynamics model is comparable in performance to the state-of-the-art with the added benefit of being compatible with visual planning. Our actions are represented as parametrized pushes into clay with a single end-effector, which proved to be suitable for long-horizon (>100 actions) clay relief sculptures. Lastly, we show the benefits of planning in a visually-aligned representation, but also provide analysis providing evidence as to why this representation is challenging to plan in compared to 3D representations.

2605.17533 2026-05-19 eess.SY cs.RO cs.SY 版本更新

Distributed 3D Leader-Follower Formation Control with Field-of-View Safety via Control Barrier Functions

分布式三维领航跟随形成控制与视场安全 via 控制屏障函数

Immanuel R. Santjoko, Richie R. Suganda, Miao Pan, Bin Hu

发表机构 * Department of Electrical and Computer Engineering, University of Houston(休斯敦大学电气与计算机工程系) Department of Engineering Technology, University of Houston(休斯敦大学工程技术系)

AI总结 本文提出了一种分布式三维领航跟随形成(3D-LFF)控制框架,用于多无人机系统,能够在保证感知安全约束的同时实现形成跟踪。通过构建相对运动学模型和分布式控制器,结合控制屏障函数基于二次规划的安全过滤器,确保领导者始终在跟随者相机视场内,从而实现精确的形成跟踪和有效的视场约束。

Comments 9 page

详情
AI中文摘要

本文提出了一种分布式三维领航跟随形成(3D-LFF)控制框架,用于多无人机系统,该框架能够在保证感知安全约束的同时实现形成跟踪。维持安全的基于视觉的3D-LFF具有挑战性,因为机载相机对视场(FOV)限制严格, demanding的形成命令可能使领导者离开跟随者的相机视锥体,导致可见性丢失。为了解决这个问题,我们开发了一种感知感知安全的控制架构,通过构造保证可见性。首先,我们推导了一个基于视线坐标表示的相对运动学模型,并设计了一个仅使用本地可用相对状态的分布式3D-LFF跟踪控制器。接下来,我们将名义上的形成控制器嵌入到基于控制屏障函数的二次规划(CBF-QP)安全过滤器中,该过滤器最小化对命令速度的修改,以在保持领导者在跟随者相机视场内的同时,保留形成跟踪的可行性。Gazebo模拟和Crazyflie硬件实验验证了所提出的方法,展示了精确的形成跟踪和有效的视场约束,包括名义上的期望形成与可见性约束冲突的场景。

英文摘要

This letter proposes a distributed 3D leader-follower formation (3D-LFF) control framework for multi-UAV systems that achieves formation tracking while enforcing perception safety constraints. Maintaining safe, vision-based 3D-LFF is challenging because onboard cameras impose strict Field-of-View (FOV) limitations, and demanding formation commands can drive the leader outside the follower's camera frustum, resulting in loss of visibility. To address this issue, we develop a perception-aware safe control architecture that guarantees visibility by construction. First, we derive a relative kinematic model in a line-of-sight coordinate representation and design a distributed 3D-LFF tracking controller using only locally available relative states. Next, we embed the nominal formation controller within a Control Barrier Function-based Quadratic Program (CBF-QP) safety filter that minimally modifies the commanded velocities to maintain the leader inside the follower's camera frustum while preserving formation tracking whenever feasible. Gazebo simulations and Crazyflie hardware experiments validate the proposed approach, demonstrating accurate formation tracking and effective FOV enforcement, including scenarios in which the nominal desired formation conflicts with visibility constraints.

2605.17522 2026-05-19 cs.RO 版本更新

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

RoboFlow4D: 一种轻量级的流世界模型,面向实时的流引导机器人操作

Sixu Lin, Junliang Chen, Huaiyuan Xu, Zhuohao Li, Guangming Wang, Yixiong Jing, Sheng Xu, Runyi Zhao, Brian Sheil, Lap-Pui Chau, Guiliang Liu

发表机构 * School of Data Science, The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)数据科学学院) The Hong Kong Polytechnic University(香港理工大学) University of Cambridge(剑桥大学) Shenzhen Loop Area Institute(深圳-loop区研究所)

AI总结 本文提出RoboFlow4D,一种轻量级的流世界模型,通过统一感知与规划,利用物理3D空间中的时间运动估计,实现高效的实时流引导机器人操作,提高了操作成功率和计算效率。

详情
AI中文摘要

在三维环境中进行规划和行动是现实世界中机器人操作的基本能力。尽管先前工作已经探索了预测流规划器来指导三维操作,但现有方法往往依赖于模块化管道堆叠多个子模型,导致计算开销高且实时性能有限。为了解决这些挑战,我们引入了RoboFlow4D,一种轻量级的流世界模型,通过估计物理3D空间中的时间运动来统一感知和规划。作为一种端到端框架,RoboFlow4D直接从视觉观察和文本指令中预测多帧3D流,提供显式的基于流的规划以指导动作生成。这种设计允许无缝集成到通用动作策略中,形成高效的观察-规划-执行闭环。通过流预测与动作控制之间的慢-快协作,RoboFlow4D实现了实时且资源高效的操纵。在模拟和现实世界设置中的大量实验表明,RoboFlow4D在操纵成功率和计算效率方面持续改进,推动了流引导规划在具身智能中的发展。

英文摘要

Planning and acting in 3D environments is a fundamental capability for robotic manipulation in the real world. Although prior work has explored predictive flow planners to guide 3D manipulation, existing approaches often rely on modular pipelines stacking multiple submodels, resulting in high computational overhead and limited real-time performance. To address these challenges, we introduce RoboFlow4D, a lightweight flow world model that unifies perception and planning by estimating temporal motion in physical 3D space. As an end-to-end framework, RoboFlow4D directly predicts multi-frame 3D flows from visual observations and textual instructions, providing explicit flow-based planning to guide action generation. This design allows seamless integration with general action policies, forming an efficient observation-planning-execution closed loop. Through slow-fast collaboration between flow prediction and action control, RoboFlow4D enables real-time and resource-efficient manipulation. Extensive experiments in both simulation and real-world settings demonstrate that RoboFlow4D consistently improves manipulation success rates and computational efficiency, advancing flow-guided planning for embodied intelligence.

2605.17517 2026-05-19 cs.RO 版本更新

AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment

AffordVLA: 通过隐式特征对齐将 affordance 表示注入到视觉-语言-动作模型中

Weijie Kong, Zhian Su, Wei Yu, Huixu Dong

发表机构 * Grasp Lab, School of Mechanical Engineering of Zhejiang University(浙大机械工程学院抓取实验室)

AI总结 本文提出 AffordVLA 框架,通过隐式特征对齐将以操作为中心的 affordance 表示注入到视觉-语言-动作模型中,以提升动作准确性,实验表明其在仿真和现实中的表现优于现有方法。

Comments 13pages, 10figures

详情
AI中文摘要

最近在视觉-语言-动作(VLA)模型方面的进展显示出在通用机器人操作中的强大潜力。然而,大多数VLA模型的视觉表示往往由全局物体外观主导,难以聚焦于与任务相关的功能交互区域,这限制了它们在非结构化环境中的鲁棒性。现有的基于 affordance 的方法通常依赖于显式的掩码注入或外部感知模块,需要额外的注释,同时引入级联感知误差和推理开销。为了解决这些限制,我们提出 AffordVLA,一个增强的 VLA 框架,通过隐式表示对齐将以操作为中心的 affordance 感知内部化到 VLA 视觉表示中。具体来说,我们构建了一个零样本 affordance 教师,从 RGB 观察和语言指令中提取任务条件的 affordance 视觉表示。AffordVLA 对齐 VLA 的中间视觉表示与由教师提取的 affordance 视觉表示,从而隐式地将以操作为中心的 affordance 感知注入到 VLA 视觉表示中,提高动作准确性。广泛的仿真和现实世界实验表明,AffordVLA 及其 affordance 教师实现了最先进的性能,并优于强大的基线。消融分析显示,AffordVLA 有效重塑 VLA 视觉表示,同时保持推理效率,从而提高操作成功率和训练效率。

英文摘要

Recent advances in Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation. However, the visual representations of most VLA models are often dominated by global object appearance and struggle to focus on task-relevant functional interaction regions, which limits their robustness in unstructured environments. Existing affordance-based methods typically rely on explicit mask injection or external perception modules, requiring additional annotations while introducing cascading perception errors and inference overhead. To address these limitations, we propose AffordVLA, an affordance-enhanced VLA framework that internalizes manipulation-centric affordance perception into VLA visual representations through implicit representation alignment. Specifically, we construct a zero-shot affordance teacher to extract task-conditioned affordance visual representations from RGB observations and language instructions. AffordVLA aligns the intermediate visual representations of the VLA with the affordance visual representations extracted by the teacher, thereby implicitly injecting manipulation-centric affordance perception into VLA visual representations and improving action accuracy. Extensive simulation and real-world experiments demonstrate that AffordVLA and its affordance teacher achieve state-of-the-art performance and outperform strong baselines. Ablation analyses show that AffordVLA effectively reshapes VLA visual representations while preserving inference efficiency, leading to improved manipulation success rates and training efficiency.

2605.17486 2026-05-19 cs.RO cs.LG 版本更新

DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

DyGRO-VLA: 通过动态分组残差优化实现跨任务的视觉-语言-动作模型扩展

Sixu Lin, Yunpeng Qing, Litao Liu, Ming Zhou, Ruixing Jin, Xiaoyi Fan, Guiliang Liu

发表机构 * School of Data Science, The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)数据科学学院) Shenzhen Loop Area Institute(深圳环城研究院) Zhejiang University(浙江大学) Rutgers University-New Brunswick(罗格斯大学新布朗斯维尔回声分校) Shanghai AI Laboratory(上海人工智能实验室) Jiangxing Intelligence Technology Inc.(江行智能科技有限公司)

AI总结 本文提出DyGRO-VLA,一种通过动态分组残差优化实现跨任务视觉-语言-动作模型扩展的两阶段优化框架,旨在提升模型的泛化能力。

详情
AI中文摘要

最近在强化学习(RL)方面的进展提供了一种系统的方法来优化视觉-语言-动作(VLA)模型,推动了从轨迹模仿到任务环境中的主动学习的转变。尽管在控制精度上有所改进,大多数RL优化器仍然任务特定,这使VLA模型从通用控制器退化为过度拟合狭窄任务集的策略。在本研究中,我们深入分析了这一现象,并强调了跨任务特征表示对提高VLA模型泛化能力的重要性。受这一发现的启发,我们引入了DyGRO-VLA,一种两阶段优化框架,1)基于信息论原理有效地捕捉跨任务潜在表示,2)通过混合的RL残差动态优化策略。DyGRO-VLA使RL优化器能够在优化过程中利用任务相关的潜在信息,同时战略性地减轻对学习表示的不利干扰。我们在LIBERO、RoboTwin2基准以及现实世界中评估了我们的方法,证明了在多任务训练和分布偏移下,与强基线相比,我们的方法具有持续的改进。

英文摘要

Recent progress in Reinforcement Learning (RL) provides a principled approach to optimizing Vision-Language-Action (VLA) models, facilitating a shift from trajectory imitation to active learning in the task environment. Despite improvements in control precision, most RL optimizers remain task-specific, which reduces VLA models from generalist controllers to policies that overfit to a narrow set of tasks. In this study, we conduct an in-depth analysis of this phenomenon and highlight the importance of cross-task feature representations for improving the generalizability of VLA models. Motivated by this finding, we introduce DyGRO-VLA, a two-stage optimization framework that 1) effectively captures cross-task latent representations based on information-theoretic principles, and 2) dynamically refines policy optimization via a mixture-of-RL-residuals. DyGRO-VLA enables the RL optimizer to exploit task-relevant latent information while strategically mitigating adverse interference on the learned representations throughout the optimization process. We evaluate our approach on LIBERO, RoboTwin2 benchmarks, and further validate it on real world, demonstrating consistent improvements over strong baselines under multi-task training and distribution shift.

2605.17477 2026-05-19 cs.RO 版本更新

Rapid Vibration Suppression and Trajectory Tracking of a Serial Manipulator with Multi-Flexible Links

多柔性连杆串联 manipulator 的快速振动抑制与轨迹跟踪

Chengyi Wang, Yilong Huang, Ji Wang

发表机构 * School of Aerospace Engineering, Xiamen University(厦门大学航空航天工程学院)

AI总结 本文提出了一种基于 backstepping 的输出反馈框架,用于快速抑制多连杆串联柔性 manipulator 的振动并实现末端跟踪,通过 DeepONet 近似实现实时部署和可扩展性。

详情
AI中文摘要

柔性机器人 manipulator(FRMs)在轻量化设计和大工作空间方面具有优势,但其结构灵活性会引发振动、加速疲劳、降低跟踪性能并限制操作速度。这些挑战在多连杆串联 manipulator 中进一步加剧,因为整体长度的增加导致结构灵活性更大。本文提出了一种 backstepping 输出反馈框架,用于快速抑制 n 自由度串联柔性 manipulator(nDSFMR)的振动和末端跟踪,使用基于 DeepONet 的近似方法进行实际部署。每个连杆关节被建模为 Timoshenko 梁,结合 ODE 并转换为具有边界动态的 canonical 超几何 PDE。在关节处开发了基于 backstepping 的边界控制器,以等效地在梁上注入分布式阻尼,从而实现快速振动抑制和轨迹跟踪,仅使用可用的边界测量。为了实现实时实施和可扩展性,引入了 DeepONet 神经操作符来近似 backstepping 核,显著降低了计算成本,并在变化的操作条件下促进了快速控制器更新。在双连杆柔性 manipulator 上的实验表明,与具有前馈控制的线性二次调节器(LQR)相比,振动抑制更快,末端执行器收敛到期望轨迹的速度更快。

英文摘要

Flexible robotic manipulators (FRMs) offer advantages in lightweight design and large workspace, but their structural flexibility induces vibrations, accelerates fatigue, degrades tracking performance, and limits operational speed. These challenges are further amplified in multi-link serial manipulators, where increased overall length leads to greater structural flexibility. This article presents a backstepping output-feedback framework for fast vibration suppression and tip tracking of an n-degree-of-freedom serial flexible manipulator robot (nDSFMR), with a DeepONet-based approximation for practical deployment. Each link-joint is modeled as a Timoshenko beam coupled with an ODE and transformed into a canonical hyperbolic PDE with boundary dynamics. A backstepping-based boundary controller at the joint is developed to equivalently inject distributed damping along the beam, enabling rapid vibration suppression and trajectory tracking, only using available boundary measurements. To enable real-time implementation and scalability, a DeepONet neural operator is introduced to approximate the backstepping kernels, significantly reducing computational cost and facilitating fast controller updates under varying operating conditions. Experiments on a two-link flexible manipulator demonstrate faster vibration suppression and convergence of the end-effector to the desired trajectory, compared with a linear quadratic regulator (LQR) with feedforward control.

2605.17421 2026-05-19 cs.RO 版本更新

MUSE: Multimodal Uncertainty Quantification of State Estimation

MUSE:多模态状态估计不确定性量化

Minkyung Kim, Henry Che, Bhargav Chandaka, Bhumsitt Pramuanpornsatid, Chengyu Yang, Sheng Cheng, Xiaofeng Wang, Naira Hovakimyan, Shenlong Wang

发表机构 * Department of Mechanical Science and Engineering, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校机械科学与工程系) Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校塞贝尔计算与数据科学学院) Department of Electrical Engineering, University of South Carolina(南卡罗来纳大学电气工程系)

AI总结 本文提出MUSE,一种基于学习的实时框架,利用Mamba的强效序列建模能力,从多个异步传感器流中估计定位不确定性,提高了状态估计的可靠性和鲁棒性。

Comments Code and dataset: https://github.com/hungdche/MUSE

详情
AI中文摘要

准确的视觉状态估计一直是机器人领域的重要课题,广泛应用于机器人导航、自动驾驶和自主飞行。最近的机器人感知进展显著提高了状态估计的精度和鲁棒性,但如何量化和校准其精度,即我们对估计的置信度以及能否检测失败仍然是一个根本性挑战。在视觉惯性里程计(VIO)中,异方差和多模态的性质使不确定性量化尤为困难。本文介绍了MUSE(多模态状态估计不确定性量化),一种新颖的实时学习框架,利用Mamba的强大且高效的序列建模能力,从多个异步传感器流中估计定位不确定性。在公开和内部数据集上的实验表明,MUSE相比现有不确定性量化方法在可靠性和鲁棒性方面表现更优,消融研究验证了其关键设计选择的优势。

英文摘要

Accurate visual state estimation has been a central topic in robotics with a wide range of applications in robot navigation, autonomous driving, and autonomous flight. Recent advances in robot perception have led to significant improvements in the accuracy and robustness of state estimation, yet a fundamental challenge remains in how to quantify and calibrate its precision, i.e., how confident we are in an estimate and whether failures can be detected. This issue is particularly pronounced in visual-inertial odometry (VIO), where the heteroscedastic and multimodal nature of the problem makes uncertainty quantification especially difficult. This paper introduces MUSE (Multimodal Uncertainty Quantification of State Estimation), a novel real-time learning-based framework that leverages the strong and efficient sequential modeling capacity of Mamba to estimate localization uncertainty from multiple asynchronous sensor streams. Experiments on both public and in-house datasets demonstrate that MUSE achieves superior reliability and robustness compared to existing uncertainty quantification methods, and ablation studies justify the benefits of its key design choices.

2605.15641 2026-05-19 cs.RO cs.CR 版本更新

Propagating Unsafe Actions in LLM Controlled Multi-Robot Collaboration via Single Robot Compromise

通过单个机器人入侵在LLM控制的多机器人协作中传播不安全行为

Zhen Huang, Zhihuang Liu, Mengxuan Luo, Weishang Wu, Zhiping Cai

发表机构 * College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院)

AI总结 本文研究了LLM控制的多机器人协作中的安全问题,提出了一种新型攻击模式,其中攻击者仅通过单个机器人传播恶意意图,导致系统中协调的不安全行为,通过三个指标量化了这一过程,并展示了攻击的高效性和持续性。

Comments Accepted by the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026). 9 pages, 4 figures, 3 tables

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作具身智能中的通用规划器,使单个机器人和多机器人协作的高层协调和底层任务规划成为可能。这种对具身LLM规划器的依赖也引发了关键的安全问题,因为不一致或被操控的指令可以转化为物理动作。先前的工作已研究了单个机器人设置中的此类威胁,而LLM控制的多机器人协作中的安全风险,尤其是通过机器人间通信传播的风险,仍鲜有研究。为弥合这一差距,我们提出了一种新的攻击模式,攻击者仅与单个入口机器人交互。被入侵的机器人然后通过同伴通信传播恶意意图,导致系统中协调的不安全行为。我们的评估涵盖了高风险维度,如失职、隐私侵犯和公共安全危害,揭示了多机器人规划器中持续的安全对齐差距。我们通过三个指标量化这一过程:服从性、传染性和隐蔽性。实验显示了攻击者的持续控制和快速传播:在最强的情况下,服从性达到1.00,传染性上升到0.90。值得注意的是,该攻击非常高效,只需3.0轮次即可入侵所有机器人,同时保持隐蔽性得分为0.81。当机器人必须在关键时刻解决权衡问题,如紧急情况或权利冲突时,此类风险会加剧,因为协调机制可能无意中允许对抗性指令覆盖安全要求。代码可在https://github.com/TheFatInsect/InfectBot上获取。

英文摘要

Large language models (LLMs) are increasingly used as general planners in embodied intelligence, enabling high level coordination and low level task planning for both single robot and multi-robot collaboration. This increasing reliance on embodied LLM planners also raises critical security concerns, since misaligned or manipulated instructions can be translated into physical actions. Prior work has studied such threats in single robot settings, while security risks in LLM controlled multi-robot collaboration, especially those propagated through inter robot communication, remain largely unexplored. To bridge this gap, we propose a novel attack paradigm for multi-robot system in which the adversary interacts with only a single entry robot. The compromised robot then propagates malicious intent through peer communication, leading to coordinated unsafe actions across the system. Our evaluation, covering high risk dimensions of dereliction of duty, privacy compromise, and public safety hazards, reveals a persistent safety alignment gap in multi-robot planners. We quantify this process with three metrics, obedience, infectiousness, and stealthiness. Experiments demonstrate both persistent attacker control and rapid propagation: obedience reaches 1.00 in the strongest cases, and infectiousness rises to 0.90. Notably, the attack is highly efficient, requiring as few as 3.0 rounds to compromise all the robots while maintaining a stealthiness score of 0.81. Such risks are amplified when robots must resolve trade offs in critical situations, such as emergencies or conflicts of rights, because the coordination mechanism can unintentionally allow adversarial instructions to override safety requirements. The code is available at https://github.com/TheFatInsect/InfectBot.

2605.11817 2026-05-19 cs.RO cs.CV 版本更新

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

Yixu Feng, Zinan Zhao, Yanxiang Ma, Chenghao Xia, Chengbin Du, Yunke Wang, Chang Xu

发表机构 * The University of Sydney(悉尼大学) City University of Hong Kong(香港城市大学)

AI总结 本文提出了一种基于可微网格采样的视觉-语言-动作模型压缩方法,通过连续的token重采样保留关键空间信息,实现高达90%的计算量减少而不影响性能。

详情
Journal ref
Proceedings of the Forty-third International Conference on Machine Learning, 2026
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中表现出色,但其高计算成本限制了实时部署。现有token剪枝方法面临根本性的权衡:使用剪枝进行剧烈压缩会不可避免地丢弃关键几何细节,如接触点,导致性能严重下降。我们主张通过重新思考压缩作为几何感知的连续token重采样来打破这种权衡。为此,我们提出了可微网格采样器(GridS),一个即插即用的模块,用于在VLA中进行任务感知的连续重采样。通过自适应预测最小的显著坐标集并利用可微插值提取特征,GridS在保留关键空间信息的同时实现了大幅压缩(少于10%的原始视觉token)。在LIBERO基准和真实机器人平台上的实验表明,GridS实现了76%的FLOPs减少,而无需降级成功率。代码可在https://github.com/Fediory/Grid-Sampler上获得。

英文摘要

Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at https://github.com/Fediory/Grid-Sampler.

2605.07308 2026-05-19 cs.RO 版本更新

AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

AT-VLA: 用于增强视觉-语言-动作模型反馈反应的自适应触觉注入

Xiaoqi Li, Muhe Cai, Jiadong Xu, Juan Zhu, Hongwei Fan, Yan Shen, Guangrui Ren, Hao Dong

发表机构 * School of Computer Science, Peking University(北京大学计算机科学系) PrimeBot PKU Lab(北京大学实验室)

AI总结 本文提出AT-VLA,一种自适应触觉注入机制,通过动态决定触觉注入的时间和位置,减少对预训练表示的干扰,同时引入触觉反应双流机制,实现快速准确的触觉响应,以提高视觉-语言-动作模型在接触丰富操作任务中的表现。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在增强机器人代理执行多样化任务的能力方面取得了显著进展;然而,它们仍然面临在需要精确物理交互的接触丰富操作场景中的挑战。为了解决这一限制,最近的研究尝试在下游任务中整合触觉信号,使预训练的VLA能够解释触觉反馈。然而,在微调过程中引入新的模态,这些模态在预训练阶段很少出现,可能会破坏VLA的预训练能力。此外,VLA固有的缓慢推理速度会阻碍实时响应,并限制触觉反馈在动作调整中的有效利用。为克服这些挑战,我们提出了自适应触觉视觉-语言-动作(AT-VLA),引入了新颖的自适应触觉注入机制。该机制动态确定触觉注入的合适时间和位置,在显著促进动作生成时才进行注入,从而最小化对预训练表示的干扰。此外,为了实现快速准确的触觉响应,我们提出了触觉反应双流机制,将感知处理分为一个慢的视觉-语言流用于低频感知推理和一个快的触觉控制流用于高频物理交互理解,从而在0.04秒内实现实时闭环响应。现实世界实验彻底验证了AT-VLA在接触丰富操作任务中的有效性。项目页面可在:https://sites.google.com/view/at-vla。

英文摘要

Vision-Language-Action (VLA) models have significantly advanced the capabilities of robotic agents in executing diverse tasks; however, they still face challenges in contact-rich manipulation scenarios that require precise physical interactions. To address this limitation, recent studies have attempted to incorporate tactile signals during downstream tasks, enabling pretrained VLAs to interpret tactile feedback. Nevertheless, introducing new modalities during finetuning, which are rarely present in the pretrain stage, may disrupt the pretrained capabilities of VLAs. In addition, the inherently slow inference speed of VLAs hampers real-time responsiveness and limits the effective utilization of tactile feedback for action adjustment. To overcome these challenges, we propose Adaptive Tactile Vision-Language-Action (AT-VLA), which introduces a novel Adaptive Tactile Injection mechanism. This mechanism dynamically determines the appropriate timing and locations for tactile injection, incorporating only when it significantly contributes to action generation, thereby minimizing interference with pretrained representations. Furthermore, to enable rapid and accurate tactile responses, we propose a Tactile Reaction Dual-Stream mechanism, which decouples sensory processing into a slow visual-language stream for low-frequency perceptual reasoning and a fast tactile control stream for high-frequency physical interaction understanding, achieving real-time close-loop responses within 0.04 s. Real-world experiments thoroughly validate the effectiveness of AT-VLA in contact-rich manipulation tasks. The project page is available at: https://sites.google.com/view/at-vla.

2604.09609 2026-05-19 cs.AI cs.RO 版本更新

General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging

通用大语言模型作为人类驾驶员行为模型:简化合并案例

Samir H. A. Mohammad, Wouter Mooi, Arkady Zgonnikov

发表机构 * Department of Transport and Planning, Delft University of Technology(代尔夫特理工大学交通与规划系) Department of Cognitive Robotics(认知机器人学系)

AI总结 本文研究了通用大语言模型在模拟人类驾驶员行为中的应用,通过在简化的一维合并场景中嵌入两个通用大语言模型,并与人类数据进行定量和定性分析,发现模型在间歇性操作控制和空间线索战术依赖方面能再现人类行为,但在动态速度线索响应和安全性能方面存在差异,提示未来需进一步研究其失效模式以确保其作为人类驾驶行为模型的有效性。

Comments To be published in proceedings of IEEE ITSC 2026

详情
AI中文摘要

人类行为模型在自动驾驶车辆(AVs)的虚拟安全评估中作为行为参考和模拟人类代理至关重要,但当前模型面临可解释性与灵活性之间的权衡。通用大语言模型(LLMs)提供了一种有前景的替代方案:一个模型可能在各种场景中无需参数拟合即可部署。然而,LLMs在捕捉人类驾驶行为方面能做什么、不能做什么仍不明确。我们通过将两个通用LLMs(OpenAI o3和Google Gemini 2.5 Pro)作为独立的闭环驾驶员代理嵌入简化的一维合并场景,并通过定量和定性分析将其行为与人类数据进行比较,来填补这一空白。两个模型能够再现人类样式的间歇性操作控制和对空间线索的战术依赖。然而,它们均无法一致地捕捉人类对动态速度线索的反应,且模型间的安全性能差异显著。系统性的提示消融研究揭示了提示组件作为模型特定的归纳偏置,这些偏置在不同LLMs之间不转移。这些发现表明,通用LLMs可能潜在地作为独立、即用型的人类行为模型在AV评估流程中发挥作用,但未来研究需要进一步理解其失效模式,以确保其作为人类驾驶行为模型的有效性。

英文摘要

Human behavior models are essential as behavior references and for simulating human agents in virtual safety assessment of automated vehicles (AVs), yet current models face a trade-off between interpretability and flexibility. General-purpose large language models (LLMs) offer a promising alternative: a single model potentially deployable without parameter fitting across diverse scenarios. However, what LLMs can and cannot capture about human driving behavior remains poorly understood. We address this gap by embedding two general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) as standalone, closed-loop driver agents in a simplified one-dimensional merging scenario and comparing their behavior against human data using quantitative and qualitative analyses. Both models reproduce human-like intermittent operational control and tactical dependencies on spatial cues. However, neither consistently captures the human response to dynamic velocity cues, and safety performance diverges sharply between models. A systematic prompt ablation study reveals that prompt components act as model-specific inductive biases that do not transfer across LLMs. These findings suggest that general-purpose LLMs could potentially serve as standalone, ready-to-use human behavior models in AV evaluation pipelines, but future research is needed to better understand their failure modes and ensure their validity as models of human driving behavior.

2603.23672 2026-05-19 cs.RO cs.CV 版本更新

Bio-Inspired Event-Based Visual Servoing for Ground Robots

生物启发的基于事件的视觉伺服控制用于地面机器人

Maral Mordad, Kian Behzad, Debojyoti Biswas, Noah J. Cowan, Milad Siami

发表机构 * Department of Electrical & Computer Engineering, Northeastern University(东北大学电气与计算机工程系) Laboratory for Computational Sensing and Robotics, Johns Hopkins University(约翰霍普金斯大学计算感知与机器人实验室) Department of Mechanical Engineering, Johns Hopkins University(约翰霍普金斯大学机械工程系)

AI总结 本文提出了一种基于生物启发的1D事件视觉伺服框架,用于在结构化环境中运行的地面机器人,通过动态视觉传感器和多模式刺激直接合成非线性状态反馈项,实现了高效低延迟的控制。

详情
AI中文摘要

生物感觉系统本质上是自适应的,能够过滤掉恒定刺激并优先处理相对变化,可能提高计算和代谢效率。受广泛动物主动感知行为的启发,本文介绍了一种原理性的1D基于事件的视觉伺服框架,用于在结构化环境中运行的地面机器人。利用动态视觉传感器(DVS),我们证明通过将固定的空间核应用于由结构化对数强度变化模式生成的异步事件流,所得到的网络事件流能够分析性地隔离特定的运动状态组合。我们建立了该事件率估计器的一般理论界,并证明线性和二次空间剖面分别隔离了机器人的速度和位置-速度乘积。利用这些特性,我们采用多模式刺激直接合成非线性状态反馈项,而无需传统状态估计。为克服事件感知中在平衡点固有的线性可观测性损失,我们提出了一种生物启发的主动感知极限环控制器。在1/10比例自主地面车辆上的实验验证证实了所提出直接感知方法的有效性、极低延迟和计算效率。

英文摘要

Biological sensory systems are inherently adaptive, filtering out constant stimuli and prioritizing relative changes, likely enhancing computational and metabolic efficiency. Inspired by active sensing behaviors across a wide range of animals, this paper introduces a principled 1D event-based visual servoing framework for ground robots operating in structured environments. Utilizing a Dynamic Vision Sensor (DVS), we demonstrate that by applying a fixed spatial kernel to the asynchronous event stream generated from structured logarithmic intensity-change patterns, the resulting net event flux analytically isolates specific combinations of kinematic states. We establish a generalized theoretical bound for this event rate estimator and show that linear and quadratic spatial profiles isolate the robot's velocity and position-velocity product, respectively. Leveraging these properties, we employ a multi-pattern stimulus to directly synthesize a nonlinear state feedback term entirely without traditional state estimation. To overcome the inescapable loss of linear observability at equilibrium inherent in event sensing, we propose a bio-inspired active sensing limit-cycle controller. Experimental validation on a 1/10-scale autonomous ground vehicle confirms the efficacy, extreme low-latency, and computational efficiency of the proposed direct-sensing approach.

2602.20200 2026-05-19 cs.RO cs.AI cs.CV 版本更新

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

全局先验与局部一致性:双内存增强的视觉-语言-动作模型用于高效机器人操作

Zaijing Li, Bing Hu, Rui Shao, Gongwei Chen, Dongmei Jiang, Pengwei Xie, Jianye Hao, Liqiang Nie

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) PengCheng Laboratory(鹏城实验室) Shenzhen Loop Area Institute(深圳洛神研究院) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文提出OptimusVLA模型,通过引入全局先验内存和局部一致性内存,解决机器人操作中动作生成效率低和鲁棒性差的问题,从而在多个基准测试中实现了更高的成功率和更快的推理速度。

Comments Accepted by CVPR 2026

详情
AI中文摘要

分层视觉-语言-动作(VLA)模型已成为机器人操作中的主导范式。它通常包括一个视觉-语言骨干网络用于感知和理解,以及一个生成性策略用于动作生成。然而,其性能越来越受到动作生成过程的限制。(i) 低推理效率。各向同性噪声先验与目标动作分布之间存在显著的分布差距,这会增加去噪步骤和不可行样本的发生率。(ii) 脆弱性差。现有策略仅基于当前观察,忽视了历史序列的约束,因此缺乏对任务进展和时间一致性意识。为了解决这些问题,我们引入OptimusVLA,一种具有全局先验内存(GPM)和局部一致性内存(LCM)的双内存VLA框架。GPM用从语义相似轨迹中检索到的任务级先验替代高斯噪声,从而缩短生成路径并减少函数评估次数(NFE)。LCM动态建模执行的动作序列以推断任务进展,并注入一个学习的一致性约束,强制轨迹的时间一致性和平滑性。在三个模拟基准测试中,OptimusVLA始终优于强大的基线:它在LIBERO上实现了98.6%的平均成功率,在CALVIN上比pi_0提高了13.5%,在RoboTwin 2.0 Hard上达到了38%的平均成功率。在现实世界评估中,OptimusVLA在泛化和长周期套件中排名第一,比pi_0分别高出42.9%和52.4%,同时实现了2.9倍的推理加速。

英文摘要

Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.

2602.12978 2026-05-19 cs.RO cs.AI 版本更新

Learning Native Continuation for Action Chunking Flow Policies

学习原生延续以实现动作分块流策略

Yufeng Liu, Hang Yu, Juntu Zhao, Bocheng Li, Di Zhang, Mingzhu Li, Wenxuan Wu, Yingdong Hu, Junyuan Xie, Junliang Guo, Dequan Wang, Yang Gao

发表机构 * Spirit AI

AI总结 本文提出Legato方法,通过训练时的延续技术改进动作分块流基于VLA策略,减少动作边界不连续性和伪多模态切换,提升轨迹平滑度和任务完成效率。

Comments Accepted by Robotics: Science and Systems 2026 (RSS 2026). Project page: https://lyfeng001.github.io/Legato/

详情
AI中文摘要

动作分块使Vision Language Action (VLA)模型能够实时运行,但朴素的分块执行常在分块边界处出现不连续性。实时分块(RTC)缓解了这一问题,但其作为外部策略导致伪多模态切换和非内在平滑的轨迹。我们提出Legato,一种针对动作分块流基于VLA策略的训练时延续方法。具体而言,Legato从具有调度形状的已知动作和噪声混合物初始化去噪,使模型接触部分动作信息。此外,Legato重塑学习的流动力学,确保在每步指导下去噪过程在训练和推理之间保持一致。Legato进一步在训练中使用随机调度条件以支持变化的推理延迟并实现可控的平滑度。实证结果表明,Legato产生更平滑的轨迹并减少执行中的伪多模态切换,导致较少的犹豫和更短的任务完成时间。广泛的现实世界实验表明,Legato在五个操作任务中始终优于RTC,实现了轨迹平滑度和任务完成时间的约10%的改进。

英文摘要

Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.

2601.23087 2026-05-19 cs.RO 版本更新

CoLA-Flow Policy: Temporally Coherent Imitation Learning via Continuous Latent Action Flow Matching for Robotic Manipulation

CoLA-Flow Policy: 通过连续潜在动作流匹配实现机器人操作的时序一致模仿学习

Wu Songwei, Jiang Zhiduo, Sun Wandong, Xie Guanghu, Zhao Rui, Liu Hong, Liu Yang

AI总结 本文提出CoLA-Flow Policy,一种基于连续潜在动作空间的轨迹级模仿学习框架,通过学习显式的潜在空间流,解耦全局运动结构与低层控制噪声,从而实现平滑可靠的长时程执行,并结合几何感知点云条件和执行时多模态调节,提升现实环境的鲁棒性。

Comments 9 pages, 9 figures

详情
AI中文摘要

学习长时程的机器人操作需要同时实现表达能力强的行为建模、实时推断和稳定执行,这对现有的生成策略仍具有挑战性。基于扩散的方法具有强大的建模能力,但会导致较高的推断延迟,而流匹配方法能够在快速、近单步生成的同时,当直接在原始动作空间中操作时往往会出现执行不稳定的问题。我们提出了连续潜在动作流策略(CoLA-Flow Policy),一种轨迹级模仿学习框架,该框架在连续潜在动作空间中执行流匹配。通过将动作序列编码为时间一致的潜在轨迹,并学习显式的潜在空间流,CoLA-Flow Policy 解耦了全局运动结构与低层控制噪声,从而实现平滑且可靠的长时程执行。该框架进一步集成了几何感知点云条件和执行时多模态调节,利用视觉线索作为代表性模态以增强现实环境的鲁棒性。在仿真和真实机器人上的实验表明,CoLA-Flow Policy 实现了近单步推断,比原始动作空间流基线提高了93.7%的轨迹平滑度和25个百分点的任务成功率,同时比基于扩散的方法快得多。

英文摘要

Learning long-horizon robotic manipulation requires jointly achieving expressive behavior modeling, real-time inference, and stable execution, which remains challenging for existing generative policies. Diffusion-based approaches offer strong modeling capacity but incur high inference latency, while flow matching enables fast, near-single-step generation yet often suffers from unstable execution when operating directly in the raw action space. We propose Continuous Latent Action Flow Policy (CoLA-Flow Policy), a trajectory-level imitation learning framework that performs flow matching in a continuous latent action space. By encoding action sequences into temporally coherent latent trajectories and learning an explicit latent-space flow, CoLA-Flow Policy decouples global motion structure from low-level control noise, enabling smooth and reliable long-horizon execution. The framework further integrates geometry-aware point cloud conditioning and execution-time multimodal modulation, using visual cues as a representative modality to enhance real-world robustness. Experiments in simulation and on real robots show that CoLA-Flow Policy achieves near-single-step inference, improves trajectory smoothness by up to 93.7% and task success by up to 25 percentage points over raw action-space flow baselines, while remaining significantly faster than diffusion-based policies.

2601.18442 2026-05-19 cs.RO 版本更新

SG-CADVLM: A Context-Aware Decoding Powered Vision Language Model for Safety-Critical Scenario Generation

SG-CADVLM: 一种基于上下文感知解码的视觉语言模型,用于安全关键场景生成

Hongyi Zhao, Shuo Wang, Qijie He, Ziyuan Pu

发表机构 * School of Transportation, Southeast University(东南大学交通学院)

AI总结 本文提出SG-CADVLM,一种结合上下文感知解码的多模态输入处理框架,用于从事故报告中生成高保真的安全关键场景,通过减少视觉语言模型的幻觉并同时生成道路几何和车辆轨迹,提升了生成场景的准确性和实用性。

详情
AI中文摘要

自动驾驶(AV)需要在安全关键场景中进行严格测试以确保安全性验证,但其验证受到实地测试成本高和现有模拟在罕见安全关键事件中保真度不足的限制。碰撞报告提供了丰富的现实世界事故动态规范,使其成为大型语言模型和视觉语言模型生成高保真场景的有前景资源。然而,现有模型由于上下文抑制常偏离实际事故特征。为了解决这些限制,本文提出了SG-CADVLM,一种整合上下文感知解码与多模态输入处理的框架,用于从碰撞报告中生成安全关键场景。该框架在生成道路几何和车辆轨迹的同时减轻了VLMs的幻觉。实验结果表明,SG-CADVLM生成结合关键和高风险场景的速率比基线方法高88.1%(相比31.2%),代表了182%的提升,同时生成可用于自动驾驶测试的可执行模拟。

英文摘要

Autonomous Vehicle (AV) requires rigorous testing in safety-critical scenarios for safety validation, yet its validation is hindered by the high cost of field testing and the lack of fidelity in current simulations for rare safety-critical events. Crash reports offer rich and authentic specifications of real-world accident dynamics, making them a promising resource for Large Language Models and Vision-Language models to generate high-fidelity scenarios. However, the existing models frequently deviate from actual accident characteristics due to context suppression. To address these limitations, this paper presents SG-CADVLM, a framework integrateing Context-Aware Decoding with multimodal input processing to generate safety-critical scenarios from crash reports. The framework mitigates the hallucination of VLMs while generating road geometry and vehicle trajectories simultaneously. The experimental results demonstrate that SG-CADVLM generates combined critical and high-risk scenarios at a rate of 88.1% compared to 31.2% for the baseline methods, representing a 182% improvement, while producing executable simulations for autonomous vehicle testing.

2601.01155 2026-05-19 cs.RO 版本更新

ORION: Option-Regularized Deep Reinforcement Learning for Cooperative Multi-Agent Online Navigation

ORION:用于合作多智能体在线导航的选项正则化深度强化学习

Shizhe Zhang, Jingsong Liang, Zhitao Zhou, Shuhan Ye, Yizhuo Wang, Ming Siang Derek Tan, Jimmy Chiun, Yuhong Cao, Guillaume Sartoretti

发表机构 * Department of Mechanical Engineering, College of Design and Engineering, National University of Singapore(机械工程系,设计与工程学院,新加坡国立大学)

AI总结 该研究提出ORION框架,通过选项批评方法和双阶段合作策略,解决部分已知环境中多智能体导航的路径最优与环境信息收集之间的平衡问题,实现高效实时协作。

详情
AI中文摘要

现有多智能体导航方法通常假设环境完全已知,难以应对部分已知场景中过时或不完整的先验地图,如仓库或工厂 floor。在此类场景中,智能体需要在路径最优与收集和共享环境信息之间取得平衡。为此,我们提出了ORION,一种用于部分已知环境中合作多智能体在线导航的新型深度强化学习框架。从不完美的先验地图开始,ORION训练智能体进行去中心化决策,朝向个体目标协调,并通过在线感知共享在闭环感知-动作循环中主动减少任务相关的地图不确定性。我们首先设计了一个共享图编码器,将先验地图与在线感知融合为统一的表示,提供在环境差异下的鲁棒状态嵌入。ORION的核心是一个选项批评框架,学习转化为低层动作序列的高层合作模式,使智能体能够自适应地在个体导航和团队层面探索之间切换。我们进一步引入了双阶段合作策略,使智能体能够在地图不确定性下协助队友,从而减少总体完成时间。在广泛的迷宫状地图和大规模仓库环境中,ORION实现了高质量的实时去中心化协作,并可扩展到多达10个机器人,优于最先进的经典和学习基线。最后,我们在物理机器人团队上验证了ORION,证明了其在现实世界协作导航中的鲁棒性和实用性。

英文摘要

Existing methods for multi-agent navigation typically assume fully known environments, offering limited support for partially known scenarios with outdated or imperfect prior maps, such as warehouses or factory floors. There, agents need to balance path optimality with collecting and sharing environmental information to help teammates reach their own targets. To these ends, we propose ORION, a novel deep reinforcement learning framework for cooperative multi-agent online navigation in partially known environments. Starting from an imperfect prior map, ORION trains agents to make decentralized decisions, coordinate toward individual targets, and actively reduce task-relevant map uncertainty through online observation sharing in a closed perception-action loop. We first design a shared graph encoder that fuses prior map with online perception into a unified representation, providing robust state embeddings under environmental discrepancies. At the core of ORION is an option-critic framework that learns high-level cooperative modes translated into sequences of low-level actions, enabling adaptive switching between individual navigation and team-level exploration. We further introduce a dual-stage cooperation strategy that allows agents to assist teammates under map uncertainty, thereby reducing the overall makespan. Across extensive maze-like maps and large-scale warehouse environments, ORION achieves high-quality real-time decentralized cooperation while scaling to up to 10 robots, outperforming state-of-the-art classical and learning-based baselines. Finally, we validate ORION on physical robot teams, demonstrating its robustness and practicality for real-world cooperative navigation.

2512.24497 2026-05-19 cs.AI cs.LG cs.RO stat.ML 版本更新

What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

在联合嵌入预测世界模型中成功因素是什么?

Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, Yann LeCun

发表机构 * Meta FAIR Inria Paris(巴黎理工院) Ecole normale supérieure / PSL(巴黎高等师范学院 / PSL) New York University(纽约大学)

AI总结 本文研究了在物理规划中使用联合嵌入预测世界模型(JEPA-WMs)的成功因素,通过分析模型架构、训练目标和规划算法对规划成功的影响,提出了一种在导航和操作任务中优于现有基线方法的模型。

Comments V2 of the article: - Added AdaLN-zero - Added table comparing JEPA-WMs with baselines with std translating per-seed variability only, no variability across epochs - Reordered figures in main body of the paper V3: added data scaling experiments, theoretical appendix section on autoregressive rollout, acceptance at TMLR

详情
AI中文摘要

人工智能领域长期存在的挑战是开发能够解决广泛物理任务并泛化到新、未见过的任务和环境的智能体。一种流行的近期方法是通过状态-动作轨迹训练世界模型,然后使用规划算法解决新任务。规划通常在输入空间中进行,但最近出现的一类方法引入了在学习的表示空间中优化的规划算法,其承诺通过抽象无关细节来提高规划效率。在本工作中,我们将此类模型称为JEPA-WMs,并研究使此类算法有效技术选择。我们提出了一项全面研究几个关键组件,旨在找到该类中的最佳方法。我们使用模拟环境和真实世界机器人数据进行了实验,并研究了模型架构、训练目标和规划算法对规划成功的影响。我们结合发现,提出了一种在导航和操作任务中优于两个现有基线方法(DINO-WM和V-JEPA-2-AC)的模型。代码、数据和检查点可在https://github.com/facebookresearch/jepa-wms上获得。

英文摘要

A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks. Code, data and checkpoints are available at https://github.com/facebookresearch/jepa-wms.

2511.20353 2026-05-19 cs.RO 版本更新

Quality-guided UAV Surface Exploration for 3D Reconstruction

基于质量引导的无人机表面探索用于3D重建

Benjamin Sportich, Kenza Boubakri, Olivier Simonin, Alessandro Renzaglia

发表机构 * Inria(法国国家信息与自动化技术研究所) INSA Lyon(里昂国立应用科学学院) CITI(信息与通信技术研究所)

AI总结 本文提出了一种新的模块化Next-Best-View规划框架,通过使用重建质量目标来指导探索规划,以提高3D重建的效率和质量。

详情
AI中文摘要

映射未知环境的自主机器人有广泛的原因,但在实践中,这些原因往往在制定规划策略时被忽视。快速获取信息和对建筑物的全面结构评估有不同的要求,因此需要不同的方法。在本文中,我们提出了一种新的模块化Next-Best-View (NBV) 规划框架,该框架明确使用重建质量目标来指导探索规划。特别是,我们的方法引入了新的高效视图生成和视角候选选择方法,这些方法能够适应用户定义的质量要求,充分利用截断符号距离场(TSDF)表示中编码的不确定性。这导致了有根据且高效的探索决策,以满足预定目标。最后,我们通过在现实环境中进行广泛的模拟验证了我们的方法。我们证明了该方法能够根据用户目标调整其行为,同时在覆盖范围、最终3D地图的质量和路径效率方面都优于传统NBV策略。

英文摘要

Reasons for mapping an unknown environment with autonomous robots are wide-ranging, but in practice, they are often overlooked when developing planning strategies. Rapid information gathering and comprehensive structural assessment of buildings have different requirements and therefore necessitate distinct methodologies. In this paper, we propose a novel modular Next-Best-View (NBV) planning framework for aerial robots that explicitly uses a reconstruction quality objective to guide the exploration planning. In particular, our approach introduces new and efficient methods for view generation and selection of viewpoint candidates that are adaptive to the user-defined quality requirements, fully exploiting the uncertainty encoded in a Truncated Signed Distance field (TSDF) representation of the environment. This results in informed and efficient exploration decisions tailored towards the predetermined objective. Finally, we validate our method via extensive simulations in realistic environments. We demonstrate that it successfully adjusts its behavior to the user goal while consistently outperforming conventional NBV strategies in terms of coverage, quality of the final 3D map and path efficiency.

2510.17363 2026-05-19 cs.CV cs.LG cs.RO 版本更新

M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

M2H:基于高效窗口交叉任务注意力的多任务学习用于单目空间感知

U. V. B. L Udugama, George Vosselman, Francesco Nex

发表机构 * Department of Earth Observation Science(地球观测科学系)

AI总结 本文提出M2H框架,通过高效的窗口交叉任务注意力模块,实现单目图像上的语义分割、深度估计、边缘检测和表面法线估计,同时在计算效率上优于现有方法。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures

详情
AI中文摘要

在边缘设备上部署实时空间感知需要高效的多任务模型,这些模型能够在利用互补任务信息的同时最小化计算开销。本文介绍了Multi-Mono-Hydra(M2H),一种新的多任务学习框架,用于从单张单目图像中进行语义分割、深度、边缘和表面法线估计。与传统方法依赖独立单任务模型或共享编码器-解码器架构不同,M2H引入了基于窗口的跨任务注意力模块,实现了结构化的特征交换同时保留任务特定的细节,提高了任务间预测的一致性。M2H基于轻量级的ViT-based DINOv2主干网络,优化了实时部署,并作为支持动态环境中3D场景图构建的单目空间感知系统的基础。全面评估显示,M2H在NYUDv2上优于最先进的多任务模型,在Hypersim上超越了单任务深度和语义基线,在Cityscapes数据集上实现了更优的性能,同时在笔记本硬件上保持计算效率。除了基准测试外,M2H还在真实世界数据上得到了验证,证明了其在空间感知任务中的实用性。

英文摘要

Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

2509.19102 2026-05-19 cs.RO cs.AI cs.CV 版本更新

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

FUNCanon: 通过功能对象规范化学习姿态感知的动作原语以实现通用的机器人操作

Hongli Xu, Lei Zhang, Xiaoyue Hu, Boyang Zhong, Kaixin Bai, Zoltán-Csaba Márton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang

发表机构 * TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg(汉堡大学信息学院TAMS(多模态系统技术)) Technical University of Munich(慕尼黑技术大学) Agile Robots SE(敏捷机器人有限公司)

AI总结 本文提出FUNCanon框架,通过功能对象规范化学习姿态感知的动作原语,以实现通用的机器人操作,该方法将长周期操作任务分解为由主体、动词和对象定义的动作片段,从而提升策略的可组合性和可重用性。

Comments project website: https://sites.google.com/view/funcanon, 11 pages

详情
AI中文摘要

通用机器人技能从端到端演示中通常会导致任务特定的策略,这些策略难以超越训练分布进行泛化。因此,我们引入FUNCanon框架,将长周期操作任务转换为一系列动作片段,每个片段由主体、动词和对象定义。这些片段将策略学习聚焦于动作本身,而不是孤立的任务,从而实现组合性和重用性。为了使策略具有姿态感知和类别通用性,我们对功能对象进行规范化,通过功能对齐和自动操作轨迹转移,利用大型视觉语言模型的 affordance 信息将对象映射到共享的功能框架中。一个以对象为中心和动作为中心的扩散策略FuncDiffuser在对齐的数据上进行训练,自然尊重对象的 affordances 和姿态,简化了学习并提高了泛化能力。在模拟和现实基准上的实验表明,该方法在类别层面实现了泛化,跨任务行为重用和鲁棒的sim2real部署,显示功能规范化为复杂操作领域可扩展模仿学习提供了强大的归纳偏置。演示细节和补充材料可在我们的项目网站上获得:https://sites.google.com/view/funcanon。

英文摘要

General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.

2508.20836 2026-05-19 cs.RO math.OC 版本更新

First Experimental Demonstration of Natural Hovering Extremum Seeking: A New Paradigm in Flapping Flight Physics

首次实验性演示自然悬停极值搜索:飞行力学领域的新范式

Ahmed A. Elgohary, Rohan Palanikumar, Simone Martini, Sameh A. Eisa

发表机构 * Department of Aerospace Engineering and Engineering Mechanics(航空航天工程与工程力学系) University of Cincinnati(辛辛那提大学) Cincinnati, Ohio 45221, USA(俄亥俄州辛辛那提市45221号美国)

AI总结 本文首次实验验证了自然悬停极值搜索(NH-ES)这一新范式,展示了通过无需模型的实时反馈机制,利用飞行动物自身振荡实现稳定悬停飞行的原理。

详情
AI中文摘要

在本文中,我们报告了首次实验性演示了最近出现的悬停和振翅飞行力学新范式,称为自然悬停极值搜索(NH-ES),该范式提出,通过无需模型的实时反馈机制,利用振翅翼的内置自然振荡作为控制和推进输入,可以生成自然界中通过振翅昆虫和蜂鸟观察到的稳定悬停飞行力学。我们进行了moth-like、光源导向的实验,使用振翅翼体在完全无模型的设置中进行,该设置不依赖形态学参数和身体/空气动力学模型。我们展示了使用NH-ES的振翅体能够自主增益高度并稳定控制负责振翅的伺服器,包括具有pitching动态(文献中认为是开环悬停不稳定的主要原因)。振翅体仅需局部光强度反馈即可有效稳定悬停在光源附近。我们的结果也实现了在延迟和噪声效应下的验证,支持了之前观察到的NH-ES对潜在处理延迟和噪声感觉的鲁棒性。

英文摘要

In this letter, we report the first experimental demonstration of the recently emerged new paradigm in hovering and flapping flight physics called (Natural Hovering Extremum Seeking (NH-ES)) [doi.org/10.1103/4dm4-kc4g], which theorized that stable hovering flight physics observed in nature by flapping insects and hummingbirds can be generated via a model-free, real-time, computationally-basic, sensory-based feedback mechanism that only needs the built-in natural oscillations of the flapping wing as both the control and the propulsive input. We run experiments of moth-like, light source-seeking, on a flapping-wing body in a total model-free setting that is agnostic to morphological parameters and body/aerodynamic models. We show that the flapping body using NH-ES gains altitude and stabilizes autonomously the servos responsible for flapping, including with pitching dynamics (believed in literature to be a main reason of instability in open-loop hovering). The flapping body effectively/stably hovers about the light source, needing only feedback of local measurements of light intensity. Our results were also achieved under delay/noise effects, supporting earlier observations that NH-ES is robust against potential processing delays and noisy-sensations.

2508.05415 2026-05-19 cs.RO 版本更新

Do Robots Really Need Anthropomorphic Hands? A Comparison of Human and Robotic Hands

机器人真的需要拟人化的手吗?人类手与机器人手的比较

Alexander Fabisch, Wadhah Zai El Amri, Chandandeep Singh, Nicolás Navarro-Guerrero

发表机构 * Robotics Innovation Center, Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI)(德意志人工智能研究中心机器人创新中心) Leibniz Universität Hannover, L3S Research Center(汉诺威莱布尼茨大学L3S研究中心)

AI总结 本文通过比较人类手与机器人手的生物力学、感知和控制机制,探讨机器人是否需要拟人化手,发现复杂的手部设计并非所有任务所必需,而手部机制的复杂性与执行任务的广度相关,同时指出传感器集成和智能操作策略仍需进一步研究。

详情
AI中文摘要

人类操控技能是其自愿运动功能的巅峰,需要协调多个自由度并处理高维传感器输入以实现卓越的灵活性。因此,我们试图回答是否人类手与其相关的生物力学特性、传感器和控制机制是机器人应追求的理想。机器人真的需要拟人化手吗?我们首先从生物力学和感知的角度提取人类手的特征,与目前商用的机器人手进行比较。通过这种比较,我们得出研究问题,将操控系统复杂性与技能 repertoire 大小和灵活性联系起来。我们通过系统文献综述来回答这些问题,在2019-2025年的125篇论文中分析了操控能力。尽管复杂的五指手常被认为是机器人操控器的终极目标,但并非所有任务都必需。我们发现,在手内操控并不受益于拟人化手设计,因为更简单的机制就足够,但机制复杂性与手能执行的操控任务的广度相关。传感器集成和智能操控策略仍处于探索阶段,这可能是因为与手设计的不匹配:而不是复制手指数量和自由度,关注鲁棒性和柔软性将允许更智能的控制和学习,以利用环境接触并集成更多传感器。最后,我们呼吁标准化的评估标准,以实现手部设计和操控系统系统的比较。

英文摘要

Human manipulation skills represent a pinnacle of their voluntary motor functions, requiring the coordination of many degrees of freedom and processing of high-dimensional sensor input to achieve remarkable dexterity. Thus, we set out to answer whether the human hand, with its associated biomechanical properties, sensors, and control mechanisms, is an ideal that we should strive for in robotics. Do robots need anthropomorphic hands? We start by extracting characteristics of the human hand in terms of biomechanics and perception to compare them with currently commercially available robotic hands. From this comparison, we derive our research questions that connect manipulation system complexity to skill repertoire size and dexterity. We attempt to answer these with a systematic literature review, in which we analyze the manipulation capabilities demonstrated in 125 papers from 2019-2025. Although complex five-fingered hands are often considered the ultimate goal for robotic manipulators, they are not necessary for all tasks. We find that in-hand manipulation does not benefit from anthropomorphic hand design as simpler mechanisms are sufficient, but mechanism complexity correlates with the breadth of manipulation tasks a hand can perform. Sensor integration and intelligent manipulation strategies remain underexplored, which may be because of a misalignment with hand design: instead of replicating the number of fingers and degrees of freedom, focusing on robustness and softness would allow more intelligent control and learning to exploit environmental contacts and integrate more sensors. Finally, we argue for standardized evaluation criteria to enable systematic comparison of hand designs and manipulation systems.

2507.16481 2026-05-19 cs.RO cs.SY eess.SY 版本更新

Guided Reinforcement Learning for Omnidirectional 3D Jumping in Quadruped Robots

为四足机器人提供全方位三维跳跃的引导强化学习

Riccardo Bussola, Michele Focchi, Giulio Turrisi, Claudio Semini, Luigi Palopoli

发表机构 * Dynamic Legged System (DLS), Istituto Italiano di Tecnologia (IIT)(动态腿系统(DLS),意大利技术研究院(IIT))

AI总结 本文提出一种引导强化学习方法,结合贝塞尔曲线与匀加速直线运动模型,提高四足机器人三维跳跃的效率和可解释性,通过仿真和实验验证了其优越性。

详情
AI中文摘要

跳跃对四足机器人来说是一个重大挑战,尽管在许多操作场景中至关重要。虽然存在用于控制此类运动的优化方法,但它们往往耗时且需要大量的机器人和地形参数知识,使其在现实世界中不够稳健。强化学习(RL)正逐渐成为一种可行的替代方案,但传统端到端方法在样本复杂性方面效率低下,需要在模拟中进行大量训练,并且最终运动的可预测性差,这使得难以验证最终运动的安全性。为克服这些限制,本文介绍了一种新的引导强化学习方法,通过结合贝塞尔曲线与匀加速直线运动(UARM)模型,利用物理直觉实现高效且可解释的跳跃。广泛的仿真和实验结果清楚地证明了我们的方法相较于现有方法的优势。

英文摘要

Jumping poses a significant challenge for quadruped robots, despite being crucial for many operational scenarios. While optimisation methods exist for controlling such motions, they are often time-consuming and demand extensive knowledge of robot and terrain parameters, making them less robust in real-world scenarios. Reinforcement learning (RL) is emerging as a viable alternative, yet conventional end-to-end approaches lack efficiency in terms of sample complexity, requiring extensive training in simulations, and predictability of the final motion, which makes it difficult to certify the safety of the final motion. To overcome these limitations, this paper introduces a novel guided reinforcement learning approach that leverages physical intuition for efficient and explainable jumping, by combining Bézier curves with a Uniformly Accelerated Rectilinear Motion (UARM) model. Extensive simulation and experimental results clearly demonstrate the advantages of our approach over existing alternatives.

2507.16059 2026-05-19 cs.RO 版本更新

Therapist-Exoskeleton-Patient Interaction for Gait Therapy

治疗师-外骨骼-患者互动用于步态治疗

Emek Barış Küçüktabak, Matthew R. Short, Lorenzo Vianello, Daniel Ludvig, Levi Hargrove, Kevin Lynch, Jose Pons

发表机构 * Shirley Ryan AbilityLab Center for Robotics and Biosystems(机器人与生物系统中心) Department of Biomedical Engineering(生物医学工程系) Department of Mechanical Engineering(机械工程系) Department of Physical Medicine and Rehabilitation(康复医学系)

AI总结 本文提出了一种基于物理人机人交互(pHRHI)的步态康复新方法,通过让治疗师和中风患者均佩戴下肢外骨骼,并通过弹簧阻尼元件连接在髋膝处,实现双向互动,从而提高康复效果。

详情
AI中文摘要

中风后,个体常因下肢无力和失去独立关节控制而出现运动和平衡障碍。步态恢复是康复的关键目标,传统上通过高强度的治疗师指导训练实现。然而,手动辅助对治疗师来说体力消耗大,并限制了治疗师同时与多个关节互动的能力。机器人外骨骼能够提供多关节支持,减少治疗师的负担,并提供客观反馈,但当前的控制策略往往限制了治疗师的参与和适应性。本文提出了一种基于物理人机人交互(pHRHI)的步态康复新范式,其中治疗师和中风患者均佩戴下肢外骨骼,并通过弹簧阻尼元件在髋膝处虚拟连接。这使得双向互动成为可能,允许治疗师引导运动并接收触觉反馈。在一项针对八名慢性中风患者的研究中,pHRHI训练优于传统治疗师指导的 treadmill 走行,导致关节活动范围、步态指标、肌肉激活和动机均有所增加。这些结果突显了pHRHI在结合机器人精度与治疗师直觉方面对改善康复结果的潜力。

英文摘要

Following a stroke, individuals often experience mobility and balance impairments due to lower-limb weakness and loss of independent joint control. Gait recovery is a key goal of rehabilitation, traditionally achieved through high-intensity therapist-led training. However, manual assistance can be physically demanding and limits the therapist's ability to interact with multiple joints simultaneously. Robotic exoskeletons offer multi-joint support, reduce therapist strain, and provide objective feedback, but current control strategies often limit therapist involvement and adaptability. We present a novel gait rehabilitation paradigm based on physical Human-Robot-Human Interaction (pHRHI), where both the therapist and the post-stroke individual wear lower-limb exoskeletons virtually connected at the hips and knees via spring-damper elements. This enables bidirectional interaction, allowing the therapist to guide movement and receive haptic feedback. In a study with eight chronic stroke patients, pHRHI training outperformed conventional therapist-guided treadmill walking, leading to increased joint range of motion, step metrics, muscle activation, and motivation. These results highlight pHRHI's potential to combine robotic precision with therapist intuition for improved rehabilitation outcomes.

2507.04996 2026-05-19 cs.CY cs.CE cs.CL cs.HC cs.RO 版本更新

Agentic Vehicles for Human-Centered Mobility: Definition, Prospects, and Synergistic Co-Development with Vehicle Autonomy

面向人类中心的移动性:定义、前景以及与车辆自主性的协同发展

Jiangbo Yu, Raphael Frank, Luis Miranda-Moreno, Sasan Jafarnejad, Jonatas Augusto Manzolli, Fuqiang Liu, Jiyao Wang, Ali Eslami

发表机构 * Interdisciplinary Centre for Security, Reliability and Trust(跨学科安全、可靠与信任中心) University of Luxembourg(卢森堡大学)

AI总结 本文探讨了面向人类中心的移动性,提出了代理车辆的概念,指出自主性和代理性是相互关联但概念上不同的维度,并强调了协同发展的必要性。

详情
AI中文摘要

自主性,源自希腊语autos(自我)和nomos(法律),指的是根据内部规则运行而不受外部控制的能力。自动驾驶车辆(AuVs)因此被理解为能够感知环境并执行任务,且在一定程度上减少人类干预的车辆系统,这与SAE自动化驾驶级别所指示的方向一致。然而,最近的研究和部署越来越多地展示了车辆能力,这些能力虽然不违背自主性,但也不由自主性所涵盖,包括处理模糊目标、有目的的社会互动、外部工具使用、主动问题解决、持续学习以及在未见过且具有伦理重要性的环境中进行情境敏感推理,这在部分情况下得益于多模态语言模型。这些发展揭示了技术自主性与为人类中心移动性所需更广泛社会认知功能之间的差距,这些功能更精确地由代理性概念所捕捉。因此,而不是不断增加“自主”一词的修饰词,我们引入了代理车辆(AgVs)并建议自主性和代理性是相互交织但概念上不同的:如果自主性关注的是做什么和如何做(在内部规则下的任务执行),那么代理性则关注为什么做以及还能做什么(目标导向、适应性的行动)。我们提出自主性和代理性作为正交但相互促进的维度,并具有协同发展的意义。车辆代理标志着移动服务智能的新维度,预示着车辆作为社会中的目的性行为者。

英文摘要

Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity to operate according to internal rules without external control. Autonomous vehicles (AuVs) are therefore understood as vehicular systems that perceive their environment and execute tasks with minimal human intervention, consistent with the direction indicated by the SAE levels of automated driving. However, recent research and deployments increasingly showcase vehicular capabilities that, while not contradicting autonomy, are not entailed by it, including ambiguous goal handling, purposeful social engagement, external tool use, proactive problem solving, continuous learning, and context-sensitive reasoning in unseen and ethically salient situations, enabled in part by multimodal language models. These developments reveal a gap between technical autonomy and the broader social cognitive functions required for human-centered mobility, which are more precisely captured by the notion of agency. Therefore, rather than adding increasingly elaborate modifiers to "autonomous," we introduce agentic vehicles (AgVs) and suggest that autonomy and agency are intertwined but conceptually distinct: if autonomy concerns what to do and how to do it (task executions under internal rules), agency pertains to why to do it and what else can be done (goal-directed, adaptive actions). We present autonomy and agency as orthogonal yet synergistic dimensions with co-development implications. Vehicle agency marks a novel dimension of mobility service intelligence, heralding vehicles as purposeful actors in society.

2507.01099 2026-05-19 cs.CV cs.AI cs.LG cs.RO 版本更新

Geometry-aware 4D Video Generation for Robot Manipulation

面向机器人操作的几何感知4D视频生成

Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, Shuran Song

发表机构 * Stanford University(斯坦福大学) Toyota Research Institute(丰田研究院)

AI总结 本文提出了一种几何感知的4D视频生成模型,通过跨视角点图对齐进行训练,以确保生成视频在多视角下的3D一致性,从而在单个RGB-D图像输入下生成时空一致的未来视频序列,并在不依赖相机姿态的情况下实现稳定的视觉和空间对齐预测。

Comments ICLR 2026; Project website: https://robot4dgen.github.io

详情
AI中文摘要

理解并预测物理世界的动态可以增强机器人在复杂环境中的规划和交互能力。尽管最近的视频生成模型在建模动态场景方面显示出强大的潜力,但生成在不同摄像机视角下既时间一致又几何一致的视频仍然是一项重大挑战。为此,我们提出了一种4D视频生成模型,通过在训练过程中使用跨视角点图对齐来监督模型,以确保生成视频的多视角3D一致性。通过这种几何监督,模型学习了一个共享的3D场景表示,使其能够从单个RGB-D图像输入中,根据新的视角生成时空一致的未来视频序列,而无需依赖相机姿态作为输入。与现有基线方法相比,我们的方法在多个模拟和现实世界机器人数据集上产生了更稳定和空间对齐的预测。我们进一步表明,预测的4D视频可用于使用现成的6自由度姿态跟踪器恢复机器人末端执行器轨迹,从而生成在新相机视角下具有良好泛化能力的机器人操作策略。

英文摘要

Understanding and predicting dynamics of the physical world can enhance a robot's ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of generated videos by supervising the model with cross-view pointmap alignment during training. Through this geometric supervision, the model learns a shared 3D scene representation, enabling it to generate spatio-temporally aligned future video sequences from novel viewpoints given a single RGB-D image per view, and without relying on camera poses as input. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, yielding robot manipulation policies that generalize well to novel camera viewpoints.

2506.18024 2026-05-19 cs.DC cs.RO 版本更新

Leveraging Cloud-Fog Automation for Autonomous Collision Detection and Classification in Intelligent Unmanned Surface Vehicles

利用云雾自动化实现智能无人水面舰艇的自主碰撞检测与分类

Thien Tran, Quang Nguyen, Jonathan Kua, Minh Tran, Toan Luu, Thuong Hoang, Jiong Jin

发表机构 * Deakin University(德克萨斯大学) University of Birmingham(伯明翰大学) RMIT University(皇家墨尔本理工学院) VinUniversity(文大学) Swinburne University of Technology(斯威本技术大学) University of Tasmania(塔斯马尼亚大学)

AI总结 本文提出了一种针对智能无人水面舰艇的分布式云-边缘-IoT架构,通过云雾自动化范式解决海上ICPS的实时数据处理和预测建模限制问题,提升了计算效率、响应性和可扩展性。

Comments 6 pages, 5 figures, accepted paper on the 23rd IEEE International Conference on Industrial Informatics (INDIN), July 12-15, 2025, Kunming, China

详情
Journal ref
2025 IEEE 23rd International Conference on Industrial Informatics (INDIN), Kunming, China, 2025
AI中文摘要

工业蜂窝物理系统(ICPS)技术是推动海上自主性的基础,尤其对于无人水面舰艇(USVs)而言。然而,机载计算限制和通信延迟显著限制了实时数据处理、分析和预测建模,从而限制了海上ICPS的可扩展性和响应性。为克服这些挑战,我们提出了一种基于最近提出的云雾自动化范式设计原则的分布式云-边缘-IoT架构,专门针对海上ICPS。我们的架构由三个层次组成:云层用于集中和分布式数据聚合、高级分析和未来模型优化;边缘层执行本地AI驱动的处理和决策;物联网层负责低延迟传感器数据采集。我们的实验结果表明,计算效率、响应性和可扩展性均有所提高。与传统方法相比,我们实现了86%的分类准确率,并改进了延迟性能。通过采用云雾自动化,我们解决了海上ICPS应用中的低延迟处理限制和可扩展性挑战。我们的工作提供了一个实用、模块化和可扩展的框架,以推进稳健的自主性和AI驱动的决策和自主性,为未来海上ICPS中的智能USVs做出贡献。

英文摘要

Industrial Cyber-Physical Systems (ICPS) technologies are foundational in driving maritime autonomy, particularly for Unmanned Surface Vehicles (USVs). However, onboard computational constraints and communication latency significantly restrict real-time data processing, analysis, and predictive modeling, hence limiting the scalability and responsiveness of maritime ICPS. To overcome these challenges, we propose a distributed Cloud-Edge-IoT architecture tailored for maritime ICPS by leveraging design principles from the recently proposed Cloud-Fog Automation paradigm. Our proposed architecture comprises three hierarchical layers: a Cloud Layer for centralized and decentralized data aggregation, advanced analytics, and future model refinement; an Edge Layer that executes localized AI-driven processing and decision-making; and an IoT Layer responsible for low-latency sensor data acquisition. Our experimental results demonstrated improvements in computational efficiency, responsiveness, and scalability. When compared with our conventional approaches, we achieved a classification accuracy of 86\%, with an improved latency performance. By adopting Cloud-Fog Automation, we address the low-latency processing constraints and scalability challenges in maritime ICPS applications. Our work offers a practical, modular, and scalable framework to advance robust autonomy and AI-driven decision-making and autonomy for intelligent USVs in future maritime ICPS.

2506.17991 2026-05-19 cs.DC cs.RO 版本更新

CFTel: A Practical Architecture for Robust and Scalable Telerobotics with Cloud-Fog Automation

CFTel: 一种用于鲁棒且可扩展的远程机器人系统的实用架构

Thien Tran, Jonathan Kua, Minh Tran, Honghao Lyu, Thuong Hoang, Jiong Jin

发表机构 * Deakin University(德金大学) RMIT University(皇家墨尔本理工大学) Zhejiang University(浙江大学) Swinburne University of Technology(斯威本科技大学) University of Tasmania(塔斯马尼亚大学)

AI总结 本文提出了一种基于云雾自动化架构的远程机器人系统CFTel,旨在解决传统云基远程机器人系统的延迟、可靠性、可扩展性和容错问题,通过分布式云-边缘-机器人计算架构实现确定性连接、连接智能和网络化计算,从而提升实时控制、可扩展性和自主性。

Comments 6 pages, 1 figure, accepted paper on the 23rd IEEE International Conference on Industrial Informatics (INDIN), July 12-15, 2025, Kunming, China

详情
Journal ref
2025 IEEE 23rd International Conference on Industrial Informatics (INDIN), Kunming, China, 2025
AI中文摘要

远程机器人技术是自主工业云物理系统(ICPS)的关键基础,能够实现跨多个领域的远程操作。然而,传统基于云的远程机器人系统存在延迟、可靠性、可扩展性和容错性问题,阻碍了关键应用中的实时性能。云雾远程机器人(CFTel)基于云雾自动化(CFA)范式,通过利用分布式云-边缘-机器人计算架构来解决这些限制,实现确定性连接、确定性连接智能和确定性网络化计算。本文综合了最近的CFTel进展,旨在突出其在促进可扩展、低延迟、自主和AI驱动的远程机器人系统中的作用。我们分析了使这些架构框架和技术成为可能的架构框架和技术,包括5G超可靠低延迟通信、边缘智能、具身AI和数字孪生。研究证明,CFTel有潜力提高实时控制、可扩展性和自主性,同时支持服务导向型解决方案。我们还讨论了实际挑战,包括延迟限制、网络安全风险、互操作性问题和标准化努力。本文为未来远程机器人研究的研究人员、利益相关者和行业从业者提供了基础参考。

英文摘要

Telerobotics is a key foundation in autonomous Industrial Cyber-Physical Systems (ICPS), enabling remote operations across various domains. However, conventional cloud-based telerobotics suffers from latency, reliability, scalability, and resilience issues, hindering real-time performance in critical applications. Cloud-Fog Telerobotics (CFTel) builds on the Cloud-Fog Automation (CFA) paradigm to address these limitations by leveraging a distributed Cloud-Edge-Robotics computing architecture, enabling deterministic connectivity, deterministic connected intelligence, and deterministic networked computing. This paper synthesizes recent advancements in CFTel, aiming to highlight its role in facilitating scalable, low-latency, autonomous, and AI-driven telerobotics. We analyze architectural frameworks and technologies that enable them, including 5G Ultra-Reliable Low-Latency Communication, Edge Intelligence, Embodied AI, and Digital Twins. The study demonstrates that CFTel has the potential to enhance real-time control, scalability, and autonomy while supporting service-oriented solutions. We also discuss practical challenges, including latency constraints, cybersecurity risks, interoperability issues, and standardization efforts. This work serves as a foundational reference for researchers, stakeholders, and industry practitioners in future telerobotics research.

2505.07813 2026-05-19 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies

DexWild:面向真实场景的机器人策略的灵巧交互

Tony Tao, Mohan Kumar Srirama, Jason Jingzhou Liu, Kenneth Shaw, Deepak Pathak

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出DexWild框架,通过结合人类和机器人示范数据,提升机器人在多样化环境中的泛化能力,实验表明其在未见环境中的成功率显著高于传统方法。

Comments In RSS 2025. Website at https://dexwild.github.io

详情
AI中文摘要

大规模、多样化的机器人数据集已成为使灵巧操作策略泛化到新环境的有希望途径,但获取此类数据集存在诸多挑战。虽然远程操作能提供高保真的数据集,但其高成本限制了可扩展性。相反,如果人们可以像在日常生活中一样使用自己的手来收集数据呢?在DexWild中,一个多样化的数据收集团队使用他们的手在多种环境和物体上收集数小时的交互数据。为了记录这些数据,我们创建了DexWild-System,一种低成本、移动且易于使用的设备。DexWild学习框架在人类和机器人示范数据上共同训练,相较于单独训练每个数据集,其性能得到提升。这种组合产生了能够泛化到新环境、任务和形态的稳健机器人策略,只需少量额外的机器人特定数据。实验结果表明,DexWild显著提高了性能,在未见环境中实现了68.5%的成功率,几乎是仅使用机器人数据训练的策略的四倍,并提供了5.8倍更好的跨形态泛化能力。视频结果、代码库和说明可在https://dexwild.github.io上找到。

英文摘要

Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io

2503.02087 2026-05-19 cs.RO cs.LG cs.SY eess.SY 版本更新

Uncertainty Representation in a SOTIF-Related Use Case with Dempster-Shafer Theory for LiDAR Sensor-Based Object Detection

基于Dempster-Shafer理论的LiDAR传感器目标检测SOTIF相关用例中的不确定性表示

Milin Patel, Rolf Jung

发表机构 * Institute for Driver Assistance and Connected Mobility(驾驶员辅助与车联网研究所) Kempten University of Applied Sciences(科佩滕应用科学大学)

AI总结 本文提出了一种系统的方法,利用Dempster-Shafer理论构建判定框架,以表示LiDAR传感器目标检测中的不确定性,并通过方差敏感性分析量化和优先处理这些不确定性,以确保自动驾驶场景的安全性。

Comments submitted as extended paper of Vehicle Technology and Intelligent Transport Systems (VEHITS)2024 conference and will be published by Springer in a CCIS Series book later in 2025

详情
AI中文摘要

LiDAR传感器目标检测中的不确定性源于环境变化和传感器性能限制。表示这些不确定性对于确保预期功能安全(SOTIF)至关重要,SOTIF旨在防止自动驾驶场景中的危险。本文提出了一种系统的方法,用于识别、分类和表示LiDAR目标检测中的不确定性。Dempster-Shafer理论(DST)被用于构建判定框架(FoD)以表示检测结果。基于识别的不确定性来源之间的依赖性,应用条件基本概率分配(BPAs)。Yager的证据组合规则用于解决多个来源的冲突证据,提供一个结构化的框架来评估不确定性对检测准确性的影响。研究应用方差基于敏感性分析(VBSA)来量化和优先处理不确定性,详细说明其对检测性能的具体影响。

英文摘要

Uncertainty in LiDAR sensor-based object detection arises from environmental variability and sensor performance limitations. Representing these uncertainties is essential for ensuring the Safety of the Intended Functionality (SOTIF), which focuses on preventing hazards in automated driving scenarios. This paper presents a systematic approach to identifying, classifying, and representing uncertainties in LiDAR-based object detection within a SOTIF-related scenario. Dempster-Shafer Theory (DST) is employed to construct a Frame of Discernment (FoD) to represent detection outcomes. Conditional Basic Probability Assignments (BPAs) are applied based on dependencies among identified uncertainty sources. Yager's Rule of Combination is used to resolve conflicting evidence from multiple sources, providing a structured framework to evaluate uncertainties' effects on detection accuracy. The study applies variance-based sensitivity analysis (VBSA) to quantify and prioritize uncertainties, detailing their specific impact on detection performance.

2502.05462 2026-05-19 cs.RO cs.MA cs.SY eess.SY math.OC 版本更新

Motion Planning of Cooperative Nonholonomic Mobile Manipulators

协作非holonomic移动机械臂的运动规划

Keshab Patra, Arpita Sinha, Anirban Guha

发表机构 * Department of Mechanical Engineering, Indian Institute of Technology Bombay(印度理工学院班加罗尔机械工程系) Center for Systems and Control, Indian Institute of Technology Bombay(印度理工学院班加罗尔系统控制中心)

AI总结 本文提出了一种实时可实现的运动规划框架,用于非holonomic移动机械臂机器人在动态环境中协作运输物体。该框架通过静态无障碍区域找到从起点到目标的路径,并利用一种新颖、快速且计算轻量的椭圆技术生成路径周围的凸、静态、无障碍区域。引入了基于非线性模型预测控制(NMPC)的实时可实现规划技术,联合规划移动基底和机械臂的可行运动,并生成可行的、无碰撞的轨迹以实现协作物体运输。仿真和硬件实验验证了所提规划框架的有效性。

Comments Published in ASME Letters in Translational Robotics. This includes supplementary materials

详情
Journal ref
Patra, K., Sinha, A., and Guha, A. (May 2, 2026). "Motion Planning of Cooperative Nonholonomic Mobile Manipulators." ASME. Letters Trans. Robotics. December 2025; 1(4): 041003
AI中文摘要

我们提出了一种实时可实现的运动规划框架,用于非holonomic移动机械臂机器人(MMRs)在动态环境中协作运输物体。我们的全局规划器通过环境中的静态无障碍区域找到从起点到目标的路径,并利用一种新颖、快速且计算轻量的基于椭圆的技术生成路径周围的凸、静态、无障碍区域。我们引入了一种基于非线性模型预测控制(NMPC)的实时可实现规划技术,该技术联合规划移动基底和机械臂的可行运动,并生成可行的、无碰撞的轨迹以实现协作物体运输。仿真和硬件实验验证了我们所提规划框架的效率。

英文摘要

We propose a real-time implementable motion planning framework for cooperative object transportation by nonholonomic mobile manipulator robots (MMRs) in dynamic environments. Our global planner finds a path from start to goal through the static, obstacle-free regions in the environment and generates a set of convex, static, obstacle-free regions around the path using a novel, fast, and computationally lightweight ellipse-based technique. We introduce a nonlinear Model Predictive Control (NMPC) based real-time implementable planning technique that jointly plans feasible motion for the mobile base and the manipulator's arm and generates a kinodynamic feasible, collision-free trajectory for cooperative object transportation. Simulation and hardware experiments validate the efficiency of our proposed planning framework.

2409.12190 2026-05-19 cs.RO cs.CV 版本更新

Bundle Adjustment in the Eager Mode

急切模式下的捆绑调整

Zitong Zhan, Huan Xu, Zihang Fang, Xinpeng Wei, Yaoyu Hu, Chen Wang

发表机构 * Spatial AI & Robotics (SAIR) Lab, University at Buffalo(空间人工智能与机器人实验室,布法罗大学) Georgia Institute of Technology(佐治亚理工学院) Purdue University(普渡大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种与PyTorch无缝集成的高效急切模式捆绑调整库,通过稀疏感知的自动微分设计和GPU加速的稀疏运算,提升了在机器人应用中捆绑调整的运行效率和性能。

详情
AI中文摘要

捆绑调整(BA)是各种机器人应用中的关键技术,例如同步定位与建图(SLAM)、增强现实(AR)和摄影测量学。BA通过优化诸如相机姿态和3D地标等参数,使它们与观测结果对齐。随着深度学习在感知系统中的重要性日益增加,将BA与深度学习框架整合已成为提高可靠性和性能的迫切需求。然而,广泛使用的基于C++的BA库,如GTSAM、g²o和Ceres Solver,缺乏与现代深度学习库如PyTorch的原生整合。这种限制影响了它们的灵活性、调试简便性和整体实现效率。为了解决这一差距,我们引入了一种与PyTorch无缝集成的高效急切模式BA库。我们的方法包括稀疏感知的自动微分设计和针对二次优化设计的GPU加速稀疏运算。我们的GPU急切模式BA在所有基准测试中均实现了显著的运行时间效率,与GTSAM、g²o和Ceres相比,平均加速分别为18.5×、22×和23×。

英文摘要

Bundle adjustment (BA) is a critical technique in various robotic applications such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry. BA optimizes parameters such as camera poses and 3D landmarks to align them with observations. With the growing importance of deep learning in perception systems, there is an increasing need to integrate BA with deep learning frameworks for enhanced reliability and performance. However, widely-used C++-based BA libraries, such as GTSAM, g$^2$o, and Ceres Solver, lack native integration with modern deep learning libraries like PyTorch. This limitation affects their flexibility, ease of debugging, and overall implementation efficiency. To address this gap, we introduce an eager-mode BA library seamlessly integrated with PyTorch with high efficiency. Our approach includes a sparsity-aware auto-differentiation design and GPU-accelerated sparse operations designed for 2nd-order optimization. Our eager-mode BA on GPU demonstrates substantial runtime efficiency, achieving an average speedup of 18.5$\times$, 22$\times$, and 23$\times$ across all benchmarks compared to GTSAM, g$^2$o, and Ceres, respectively.

2403.10629 2026-05-19 cs.RO cs.SY eess.SY 版本更新

Virtual Elastic Tether: a New Approach for Multi-agent Navigation in Confined Aquatic Environments

虚拟弹性缆绳:一种多智能体在受限水下环境中的导航新方法

Kanzhong Yao, Xueliang Cheng, Keir Groves, Barry Lennox, Ognjen Marjanovic, Simon Watson

发表机构 * Manchester Centre for Robotics and AI, Department of Electrical and Electronic Engineering, University of Manchester(曼彻斯特机器人与人工智能中心,电气与电子工程系,曼彻斯特大学)

AI总结 本文提出了一种虚拟弹性缆绳(VET)方法,用于解决水下环境中多智能体导航的挑战,通过在不完全状态测量条件下实现更稳定的导航性能。

Comments This work has been submitted to the Wiley for possible publication

详情
AI中文摘要

水下导航是移动机器人领域中的一个具有挑战性的领域,由于水下环境中自我定位和通信固有的限制。一些挑战可以通过使用协作多智能体团队来缓解。然而,当应用于水下环境时,传统多智能体协作控制方法的鲁棒性受到很大限制,因为无法获得可靠的测量数据。本文在不完全状态测量的背景下引入了虚拟弹性缆绳(VET)的概念,这是一种用于受限制空间水下导航的创新方法。VET的概念是通过合作水下车辆探索系统(CAVES)进行公式化和验证的,CAVES是一种仿真到现实的多智能体水下机器人平台。在此框架内,开发了一种基于视觉的自主水下车辆-自主水面车辆的领导者-追随者公式。在仿真和物理平台上进行了实验,并与传统的基于图像的视觉伺服方法进行了比较。结果表明,基线方法在离散扰动下失效,当机器人之间的诱导距离在仿真中超过0.6米,在现实世界中超过0.3米时。相比之下,VET增强的系统在5秒内恢复到扰动前的距离。此外,结果展示了VET增强的CAVES在受限制的水池中成功导航,而基线方法无法有效执行。

英文摘要

Underwater navigation is a challenging area in the field of mobile robotics due to inherent constraints in self-localisation and communication in underwater environments. Some of these challenges can be mitigated by using collaborative multi-agent teams. However, when applied underwater, the robustness of traditional multi-agent collaborative control approaches is highly limited due to the unavailability of reliable measurements. In this paper, the concept of a Virtual Elastic Tether (VET) is introduced in the context of incomplete state measurements, which represents an innovative approach to underwater navigation in confined spaces. The concept of VET is formulated and validated using the Cooperative Aquatic Vehicle Exploration System (CAVES), which is a sim-to-real multi-agent aquatic robotic platform. Within this framework, a vision-based Autonomous Underwater Vehicle-Autonomous Surface Vehicle leader-follower formulation is developed. Experiments were conducted in both simulation and on a physical platform, benchmarked against a traditional Image-Based Visual Servoing approach. Results indicate that the formation of the baseline approach fails under discrete disturbances, when induced distances between the robots exceeds 0.6 m in simulation and 0.3 m in the real world. In contrast, the VET-enhanced system recovers to pre-perturbation distances within 5 seconds. Furthermore, results illustrate the successful navigation of VET-enhanced CAVES in a confined water pond where the baseline approach fails to perform adequately.

2605.17336 2026-05-19 cs.RO cs.CV eess.SP 版本更新

Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

基于触觉的多模态融合在具身智能中的应用:视觉、语言和接触驱动范式的综述

Zhixiang Cao, Di Tian, Runwei Guan, Yanzhou Mu, Xiaolou Sun, Shaofeng Liang, Daizong Liu, Tao Huang, Yutao Yue, Henghui Ding, Bin Fang, Alex Zhou, Qing-Long Han, Hui Xiong

发表机构 * School of Electronic Science and Engineering, Xi’an Jiaotong University, China(西安交通大学电子科学与技术学院) Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), China(香港科技大学(广州)人工智能研究所) State Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) Purple Mountain Laboratory, China(紫金山实验室) Institute for Math & AI, Wuhan University, China(武汉大学数学与人工智能学院) Centre for AI and Data Science Innovation and the School of Science and Engineering, James Cook University, Australia(詹姆斯库克大学人工智能与数据科学创新中心及科学与工程学院) School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China(北京邮电大学人工智能学院) Institute of Big Data, Fudan University, China(复旦大学大数据研究院) Linkerbot (Beijing) Technology Co., Ltd, China(北京链动科技有限公司) School of Engineering, Swinburne University of Technology, Melbourne(斯威本技术大学工程学院)

AI总结 本文综述了多模态触觉融合在具身智能中的研究,探讨了如何通过整合视觉、语言和触觉信息来提升物理交互与语义推理的结合,提出了一种分层的分类体系,并总结了当前的研究挑战和未来方向。

Comments 20 pages, 8 figures

详情
AI中文摘要

触觉感知是具身智能中的基本模态,能够提供关于接触几何、材料属性和交互动态的独特且直接反馈,这无法被远程传感器所替代。然而,单一的触觉感知在空间覆盖稀疏和缺乏全局语义上下文方面存在固有局限。随着深度学习和大语言模型的迅速发展,将触觉与视觉和语言相结合已成为连接物理交互与语义推理的关键。尽管进展迅速,现有研究仍分散在不同的数据集、传感模态和任务中,缺乏统一的理论框架。为解决这一差距,本文提供了截至2026年第一季度的多模态触觉融合研究的全面综述。我们提出了一种分层的分类体系,将该领域分为两个主要维度:多模态数据集和多模态方法。在数据方面,我们对从触觉-视觉数据集、触觉-语言数据集、触觉-视觉-语言数据集以及触觉-视觉-其他数据集等资源进行了分类。在方法方面,我们把先前的工作分为三个核心支柱:(1)多模态感知与识别,专注于物体理解和抓取预测;(2)跨模态生成,专注于触觉、视觉和文本之间的双向翻译;(3)多模态交互,强调反馈控制和语言引导的操作。此外,我们总结了代表性的触觉传感硬件,回顾了常用的评估指标和基准设置,并讨论了当前的挑战和有前途的未来方向。

英文摘要

Tactile sensing is a fundamental modality for embodied intelligence, offering unique and direct feedback on contact geometry, material properties, and interaction dynamics that remote sensors cannot replace. However, unimodal tactile perception is inherently limited by its sparse spatial coverage and lack of global semantic context. With the recent explosion in deep learning and large language models, integrating tactile with vision and language has become essential to bridge physical interaction with semantic reasoning, leading to the emergence of Multimodal Tactile Fusion. Despite rapid progress, the existing researches remain fragmented across disparate datasets, sensing modalities, and tasks, lacking a unified theoretical framework. To address this gap, this paper provides a comprehensive survey of multimodal tactile fusion research up to the first quarter of 2026. We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition, which focuses on object understanding and grasp prediction; (2) Cross-Modal Generation, focusing on bidirectional translation between tactile, vision, and text; and (3) Multimodal Interaction, emphasizing feedback control and language-guided manipulation. Furthermore, we summarize representative tactile sensing hardware, review commonly used evaluation metrics and benchmark settings, and discuss current challenges and promising future directions.

2605.17327 2026-05-19 cs.RO cs.AI cs.CV 版本更新

Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

为单目视觉-惯性系统使用前馈3D模型实现高效的特征-free初始化

Yuantai Zhang, Jiaqi Yang, Huajian Zeng, Changhao Chen, Haoang Li, Liang Li, Dezhen Song, Xingxing Zuo

发表机构 * MBZUAI(马克斯·普朗克人工智能研究所) HKUST (GZ)(香港科技大学(广州)) Zhejiang University(浙江大学)

AI总结 本文提出了一种无需视觉特征跟踪的初始化框架,利用前馈3D模型预测的点云,从而提高了单目视觉-惯性导航系统的初始化可靠性与效率,实验表明其初始化成功率超过90%且数据需求显著减少。

详情
AI中文摘要

快速且可靠的初始化对于单目视觉-惯性导航系统(VINS)至关重要,因为它为后续的状态估计建立了初始条件。尽管已有显著进展,但大多数现有方法仍依赖于视觉特征对应关系,并需要3-4秒的传感器数据才能成功初始化,这限制了它们的应用性和效率。随着前馈3D模型的出现,这些模型可以直接从图像预测点云,我们重新从简洁的角度审视视觉-惯性初始化问题。在本文中,我们提出了一种特征-free初始化框架,利用前馈3D模型预测的点云,从而避免了视觉特征跟踪和估计的需要。这种设计显著降低了系统复杂性并提高了初始化的可靠性。在公开数据集上的实验表明,所提出的特征-free初始化方法实现了最高成功率,超过90%,并且显著减少了成功初始化所需的数据持续时间,通常降至1.2秒以下。我们进一步在自采集的数据集上验证了我们的方法,覆盖了各种室内和室外场景,展示了鲁棒性能,特别是在现有方法常失败的视觉退化环境中。代码和数据集可在https://github.com/Yuantai-Z/FF-VIO-Init获取。

英文摘要

Fast and reliable initialization is critical for monocular visual-inertial navigation systems (VINS), as it establishes the starting conditions for subsequent state estimation. Despite steady progress, most existing methods heavily rely on visual feature correspondences and require 3-4 seconds of sensory data for successful initialization, which limits their applicability and efficiency. With the advent of feed-forward 3D models that can directly predict point clouds from images, we revisit the visual-inertial initialization problem from a concise perspective. In this work, we propose a feature-free initialization framework that leverages up-to-scale point clouds predicted by a feed-forward 3D model, thereby obviating the need for visual feature tracking and estimation. This design substantially reduces system complexity and improves the reliability of initialization. Experiments on public datasets demonstrate that the proposed feature-free initialization method achieves the highest success rate, exceeding 90%, and significantly reduces the data duration required for successful initialization, typically to under 1.2 s. We further validate our method on a self-collected dataset covering various indoor and outdoor scenarios, demonstrating robust performance, particularly in visually degraded environments where existing methods often fail. The code and dataset are available at https://github.com/Yuantai-Z/FF-VIO-Init.

2605.17302 2026-05-19 cs.RO 版本更新

Beyond Geometry: Efficient Topologically-Grounded Navigation in Complex 3D Environments

超越几何:在复杂3D环境中高效拓扑导向的导航

Yifan Du, Chengwei Zhang, Siyu Liao, Zhongfeng Wang

发表机构 * School of Integrated Circuits(集成电路学院)

AI总结 本文提出了一种表面提取框架,通过强制地面支撑、头顶 clearance 和基于种子的连通性约束,构建了物理可达的站立位置的简化状态空间,从而在复杂3D环境中实现高效的拓扑导向导航。

详情
AI中文摘要

在复杂的3D环境中,地面机器人导航常受到几何歧义的阻碍,其中不可通行的结构如家具与可通行地面共享局部几何特性。此外,搜索大规模体素空间的计算成本仍然是重大挑战。为了解决这些问题,我们提出了一种表面提取框架,通过强制地面支撑、头顶 clearance 和基于种子的连通性约束,构建了物理可达的站立位置的简化状态空间。在五个Matterport3D室内场景和三个PCT基准场景上的评估显示,状态空间减少了超过80%,并在Matterport3D场景上实现了亚毫秒级的A*搜索,所有300个测试查询均实现了100%的规划成功。

英文摘要

Ground robot navigation in complex 3D environments is often hindered by geometric ambiguity, where non-traversable structures such as furniture share local geometric properties with navigable ground. Furthermore, the computational cost of searching massive voxel spaces remains a significant challenge. To address these issues, we present a surface extraction framework that constructs a reduced state space of physically reachable standing positions by enforcing ground support, overhead clearance, and seed-based connectivity constraints. Evaluation across five Matterport3D indoor scenes and three PCT benchmark scenes demonstrates over 80\% state space reduction and sub-millisecond A* search on the Matterport3D scenes, with 100\% planning success across all 300 tested queries.

2605.17300 2026-05-19 cs.RO 版本更新

HCLM: A Hierarchical Framework for Cooperative Loco-Manipulation with Dual Quadrupeds

HCLM:一种用于双四足机器人协同运动操作的分层框架

Qixuan Li, Chen Le, Jincheng Yu, Xinlei Chen

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China(深圳国际研究生院,清华大学,深圳,中国) Department of Electronic Engineering, and the Institute for Embodied Intelligence and Robotics, Tsinghua University, Beijing, China(电子工程系,以及 embodied intelligence and robotics 院,清华大学,北京,中国)

AI总结 本文提出HCLM框架,通过分层结构实现双四足机器人在复杂环境中的协同运动操作,核心方法是采用集中式联合扩散策略和混合全身控制器,主要贡献是实现了高鲁棒性的多机器人协作控制。

详情
AI中文摘要

我们介绍了HCLM,一种用于通用目的双四足系统协同运动操作的分层框架。协调具有浮动基的多机器人协作操作极具挑战性,因为空间协调、稳健移动和闭链物理交互的需求相互冲突。为了解决这一问题,我们的架构系统性地将高层协作推理与底层稳健运动执行分离。在高层,一个集中式联合扩散策略利用SE(3)-不变的任务空间表示来学习不依赖坐标的空间协调模式。为了将这些帧无关的参考转换为物理运动,一个以任务为中心的混合全身控制器协同利用主动的运动学模型预测控制来生成无碰撞的速度分布,以及一个反应性执行层。关键的是,这一反应层保证了对精确末端执行器跟踪的快速响应,同时通过合作顺应方案整合主动力调节,以安全解决运动学冲突并在闭链交互中严格调节内部应力。我们验证了该框架在逐步更具挑战性的模拟场景中的有效性,包括协作搬运、打包和交接,并成功在现实世界中部署后者。结果表明,任务执行可靠,配置无关性严格,对严重物理扰动具有出色的抗扰性,为多机器人具身协调提供了一条高度稳健的路径。

英文摘要

We introduce HCLM, a hierarchical framework for general-purpose cooperative loco-manipulation with dual quadrupedal systems. Coordinating multi-robot collaborative manipulation across floating bases is highly challenging due to the conflicting demands of spatial coordination, robust locomotion, and closed-chain physical interactions. To resolve this, our architecture systematically decouples high-level collaborative reasoning from low-level robust motion execution. At the high level, a centralized Joint Diffusion Policy leverages an SE(3)-invariant task-space representation to learn coordinate-agnostic spatial coordination patterns. To translate these frame-agnostic references into physical motion, a task-centric hybrid Whole-Body Controller synergizes a proactive kinematic Model Predictive Control for collision-free velocity distribution with a reactive execution layer. Crucially, this reactive layer guarantees rapid responsiveness for precise end-effector tracking, while concurrently integrating active force regulation via a cooperative admittance scheme to safely resolve kinematic conflicts and strictly regulate internal stresses during closed-chain interactions. We validate the framework across progressively challenging simulated scenarios, including cooperative carrying, packing and handovers, and successfully deploy the latter in the real world. The results demonstrate reliable task execution, strict configuration agnosticism, and exceptional resilience against severe physical perturbations, offering a highly robust pathway for multi-robot embodied coordination.

2605.17293 2026-05-19 cs.RO cs.MA 版本更新

Task Capability Improvement Algorithm for Collaborative Manipulators

协作机械臂任务能力提升算法

Keshab Patra, Arpita Sinha, Anirban Guha

发表机构 * Department of Mechanical Engineering, Indian Institute of Technology Bombay(印度理工学院班加罗尔机械工程系) Center for Systems and Control, Indian Institute of Technology Bombay(印度理工学院班加罗尔系统与控制中心)

AI总结 本文提出利用附加力矩提高协作机械臂的任务能力,通过在非质心位置施加力产生额外力矩,从而增强单个机械臂及整个协作组的能力,实验结果显示任务能力提升了5.86%。

详情
AI中文摘要

本文介绍了一种利用附加力矩进行协作任务能力提升的方法。机械臂在物体的抓取点施加力。在非物体质心的位置施加力会产生不期望的力矩。这种不期望的力矩作为附加力矩,提高了单个机械臂的能力,从而提高了整个协作组的能力。任何任务能力的提升都会直接增加物体和运输能力。协作组增强的能力也有助于实现最优能力、最优资源分配和最大故障容忍性。我们的仿真结果表明,与不使用力矩增强机械臂能力相比,任务能力提升了5.86%。

英文摘要

This work introduces a cooperative task capability improvement utilizing additional moments. The manipulators apply forces at the object's grasp point. Applying forces at a point other than the object's center of gravity produces undesired moments. The undesired moment acts as an additional moment. It improves the capability of an individual manipulator and, hence, the entire collaborative group. Any improvements in task capability directly add up to the object and transportation capability. The group's enhanced capability also helps achieve optimal capability, optimal resource allocation, and maximum fault tolerance in object manipulation. Our simulation results show an improvement in the capability of 5.86 \% compared to when no moment is used to enhance the capability of the manipulators.

2605.17284 2026-05-19 cs.CV cs.AI cs.LG cs.RO 版本更新

CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving

CLAP:用于端到端自动驾驶的对比潜在空间提示优化

Ruiyang Zhu, Yuehan He, Boyuan Zheng, Zesen Zhao, Ahmad Chalhoub, Qingzhao Zhang, Z. Morley Mao

发表机构 * University of Michigan(密歇根大学) University of Arizona(亚利桑那大学)

AI总结 本文提出CLAP方法,通过对比潜在空间提示优化解决自动驾驶中罕见但安全关键的长尾场景问题,利用V2X通信获取数据并优化提示,从而提升规划性能。

Comments 9 pages + appendix

详情
AI中文摘要

端到端自动驾驶系统通过视觉-语言-动作(VLA)模型在常见驾驶场景中表现出色,但在罕见但安全关键的长尾场景如活跃施工区和复杂让行几何中表现脆弱。本文提出了一种方法,超越数据扩展和模型训练,解决长尾挑战场景。我们引入CLAP(对比潜在空间提示优化),一种位置感知的适应框架,通过车辆到一切(V2X)通信按需检索,将冻结的VLA驾驶模型与每条道路块的软提示相结合。我们的方法基于VLA潜在空间的两个观察:(i)在VLA的隐藏状态层,来自相同道路块的场景紧密聚集并占据潜在空间的紧凑区域;(ii)在单个道路块内,长尾和正常帧在潜在表示中高度混合,难以改进其中一个而不影响另一个。CLAP通过两阶段流程解决此问题:监督对比学习发现道路块特定的困难场景方向,随后方向性正则化提示优化选择性改进挑战帧同时保持正常帧性能。在NAVSIM基准上,使用各种最先进的VLA后端,CLAP将挑战场景规划错误减少了24%,在不回归正常帧的情况下显著提高了规划性能。

英文摘要

End-to-end autonomous driving systems powered by Vision-Language-Action (VLA) models achieve strong performance on common driving scenarios, yet remain brittle in rare but safety-critical long-tail situations such as active construction zones and complex yielding geometries. In this paper, we present a method that addresses the long-tail challenging scenes beyond data scaling and model training. We introduce CLAP (Contrastive Latent-space Prompt optimization), a location-aware adaptation framework that augments a frozen VLA driving model with per-roadblock soft prompts, optimized from crowdsourced data and retrieved on demand via Vehicle-to-Everything (V2X) communication. Our approach rests on two observations from VLAs' latent space: (i) at the VLA's hidden-state layer, scenarios from the same roadblock cluster tightly and occupy compact regions of the latent space; and (ii) within a single roadblock, long-tail and normal frames are heavily intermixed in the latent representation, making it difficult to improve one without disturbing the other. CLAP addresses this via a two-stage pipeline: supervised contrastive learning to discover a roadblock-specific hard-scene direction, followed by directionally regularized prompt optimization that selectively improves challenging frames while preserving normal frame performance. On the NAVSIM benchmark with various state-of-the-art VLA backbones, CLAP reduces challenging scenario planning error by 24% with no regression on normal frames, significantly improving planning performance.

2605.17264 2026-05-19 cs.RO 版本更新

Stretch-ICP: A Continuous-Trajectory Registration and Deskewing Algorithm in Scenarios of Aggressive Motions

Stretch-ICP: 一种在剧烈运动场景下的连续轨迹配准与校正算法

Simon-Pierre Deschênes, Veronica Vannini, Philippe Giguère, François Pomerleau

发表机构 * GitHub

AI总结 本文提出Stretch-ICP算法,通过改进SLAM的鲁棒性,以提高在剧烈运动下的激光雷达-惯性导航状态估计的鲁棒性和一致性,同时减少了线速度和角速度的估计误差。

Comments 29 pages, 16 figures, published in Sensors 2026, 26(8), 2567, special issue "New Challenges and Sensor Techniques in Robot Positioning"

详情
Journal ref
Sensors 2026, 26(8), 2567
AI中文摘要

在复杂的环境中,机器人自主性仍然具有挑战性,其中在不平或滑腻地形上失去稳定性可能导致极端加速度和角速度。这些运动会破坏传感器测量并降低状态估计的精度,推动了对更鲁棒算法的需求。为研究此问题,我们引入了Tumbling-Induced Gyroscope Saturation (TIGS)数据集,该数据集包含机械激光雷达和惯性测量单元(IMU)从山下滑倒的记录。该数据集包含的角速度是类似数据集的四倍,且已公开可用。我们随后提出了两种互补的方法来提高同步定位与建图(SLAM)的鲁棒性,并在TIGS上评估了它们。首先,Saturation-Aware Angular Velocity Estimation (SAAVE)在剧烈运动中估计角速度,当陀螺仪测量饱和时,减少角速度估计误差83.4%。其次,Stretch-ICP是一种新的配准和校正算法,能够在剧烈运动下比经典迭代最近点(ICP)算法产生更平滑的六自由度(DOF)轨迹。Stretch-ICP在扫描边界处将线速度和角速度误差分别减少95.2%和94.8%。共同的贡献提高了在剧烈运动下的激光雷达-惯性状态估计的鲁棒性和一致性。

英文摘要

Robust robotic autonomy remains challenging in complex environments, where loss of stability on uneven or slippery terrain can induce extreme accelerations and angular velocities. Such motions corrupt sensor measurements and degrade state estimation, motivating the need for improved algorithmic robustness. To investigate this issue, we introduce the Tumbling-Induced Gyroscope Saturation (TIGS) dataset, which consists of recordings from a mechanical lidar and an Inertial Measurement Unit (IMU) tumbling down a hill. The dataset contains angular speeds up to four times higher than those in similar datasets and is publicly available. We then propose two complementary methods to improve Simultaneous Localization And Mapping (SLAM) robustness and evaluate them on TIGS. First, Saturation-Aware Angular Velocity Estimation (SAAVE) estimates angular velocities when gyroscope measurements become saturated during aggressive motions, reducing angular speed estimation error by 83.4%. Second, Stretch-ICP, a novel registration and deskewing algorithm, enables reconstruction of smoother 6-Degrees Of Freedom (DOF) trajectories under aggressive motions compared to classical Iterative Closest Point (ICP). Stretch-ICP reduces linear and angular velocity errors by 95.2% and 94.8%, respectively, at scan boundaries. Together, these contributions improve the robustness and consistency of lidar-inertial state estimation under aggressive motions.

2605.17229 2026-05-19 cs.RO cs.SY eess.SY 版本更新

Generating Realistic Safety-Critical Scenarios for Vehicle-Pedestrian Interactions

生成车辆-行人交互的安全关键场景

Qingwen Pu, Kun Xie, Yuan Zhu, Guocong Zhai

发表机构 * Transportation Informatics Lab, Department of Civil and Environmental Engineering, Old Dominion University(交通信息实验室,土木与环境工程系,旧 Dominion 大学) Inner Mongolia Center for Transportation Research, Inner Mongolia University(内蒙古交通研究所,内蒙古大学) School of Transportation and Logistics, National Engineering Laboratory of Integrated Transportation Big Data Application Technology, National and Local Joint Engineering Research Center of Integrated Transportation Intelligence, Southwest Jiaotong University(交通运输学院,国家集成交通大数据应用技术工程实验室,国家与地方联合集成交通智能工程研究中心,西南交通大学)

AI总结 本文提出了一种三阶段框架,结合现实数据与自适应模拟,生成大规模行为真实的安全关键场景,通过多智能体状态空间Transformer增强DDPG算法,在车辆-行人交互中实现了高精度的避让行为生成,最终生成了VPSCI数据集。

Comments 49 pages, 13 figures, 11 table

详情
AI中文摘要

自动驾驶系统部署需要在安全关键的车辆-行人交互中进行严格验证,但现实世界数据集很少捕捉高风险场景,而模拟平台缺乏真实行为。为此,本研究提出了一种三阶段框架,结合现实数据与自适应模拟,生成行为真实的安全关键场景。第一阶段在现实安全关键数据上预训练多智能体状态空间Transformer增强DDPG(MA-SST-DDPG)智能体,通过数据驱动学习学习人类样式的避让行为。第二阶段在CARLA中部署预训练的多智能体进行在线强化学习,实现跨多样场景的泛化,整合现实知识与模拟经验,生成精炼的MA-SST-DDPG模型。第三阶段使用CARLA与精炼模型生成来自八个交叉口场景的超过198,000个高分辨率交互episode,最终生成车辆-行人安全关键交互(VPSCI)数据集。精炼的MA-SST-DDPG模型在复现真实避让行为上优于基线方法,实现了最低的轨迹误差(ADE=0.072 m,FDE=0.142 m)。统计比较证实生成数据与现实数据在冲突严重程度和行为响应分布上具有等价性。图灵测试确认三阶段框架生成的避让行为与现实交互无法区分。这些结果展示了该框架在生成高保真安全关键数据方面的有效性,为ADS开发和基于模拟的安全评估提供了有价值的来源。

英文摘要

Automated driving system deployment requires rigorous validation across safety-critical vehicle-pedestrian interactions, yet real-world datasets rarely capture high-risk scenarios while simulation platforms lack realistic behavior. In response, this study proposes a three-stage framework that combines real-world grounding with adaptive simulation to generate behaviorally realistic safety-critical scenarios at scale. Stage 1 pre-trains multi-agent state-space Transformer-enhanced DDPG (MA-SST-DDPG) agents on real-world safety-critical data to learn human-like interactive evasive behaviors through data-driven learning. Stage 2 deploys pre-trained multi-agents in CARLA for online reinforcement learning to generalize across diverse scenarios, integrating real-world knowledge with simulation experience to produce a refined MA-SST-DDPG model. Stage 3 uses CARLA with the refined model to generate over 198,000 high-resolution interaction episodes from eight intersection scenarios, culminating in the Vehicle-Pedestrian Safety-Critical Interaction (VPSCI) dataset. The Refined MA-SST-DDPG model outperformed baseline methods in reproducing realistic evasive behaviors, achieving the lowest trajectory errors (ADE = 0.072 m, FDE = 0.142 m). Statistical comparison confirmed distributional equivalence between the generated and real-world data in both conflict severity and behavioral response. A Turing test confirmed that the three-stage framework generated evasive behaviors were indistinguishable from real-world interactions. These results demonstrate the framework's effectiveness in producing high-fidelity safety-critical data, offering valuable sources for the development of ADS and simulation-based safety evaluations.

2605.17204 2026-05-19 cs.RO cs.AI 版本更新

Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

基于事件的稀疏自编码器用于视觉-语言-动作策略

Xinchen Jin, Aditya Chatterjee, Pranav Kumar, Rohan Paleja

发表机构 * Department of Computer Science, Purdue University West Lafayette, IN 47907(计算机科学系,普渡大学西拉法叶分校,印第安纳州,47907)

AI总结 本文提出了一种基于事件的稀疏自编码器(SAE)分析方法,用于视觉-语言-动作(VLA)策略的可解释性研究,通过行为事件锚定SAE特征分析,提升了对闭合回路行为的因果影响和可解释性。

详情
AI中文摘要

视觉-语言-动作(VLA)策略将语言和视觉输入转化为机器人动作,其隐藏表示直接塑造闭环行为。然而,语言和视觉-语言模型中的机制可解释性工具无法直接转移到VLA中:输出是机器人动作而非人类可读的标记,干预只能通过昂贵的闭环回放测试。我们提出了一种基于事件的可解释性流程,将SAE特征分析锚定在行为事件而非文本上下文中。通过在每个任务中使用视觉、状态和时间线索对末端执行器关键帧进行聚类,将SAE特征与行为显著事件联系起来,并通过可选的VLM注释与语义上下文联系起来。据我们所知,我们的流程是首个将基于SAE的VLA分析锚定在闭环行为事件上的方法之一。在两个仿真架构和一个真实机器人研究中,基于事件的排名在OpenVLA上产生了最强的因果效应,并转移到了π_{0.5}的连续动作块中。SAE是一种稀疏但不完美的干预基础:实用性因架构和干预位置而异,激进干预揭示了安全性和可解释性的限制。总体而言,基于事件的SAE分析成为行为锚定VLA可解释性的一种实用起点,推动了未来关于SAE特征的研究,包括超越动作对齐坐标的更细致分析、更精细的闭环评估以及高风险VLA部署中的安全干预。代码可在https://github.com/xc-j/Event-SAE上获得。

英文摘要

Vision-Language-Action (VLA) policies translate language and visual inputs into robot actions, where their hidden representations directly shape closed-loop behavior. However, mechanistic interpretability tools from language and vision-language models do not transfer cleanly to VLAs: outputs are robot actions rather than human-readable tokens, and interventions can only be tested via expensive closed-loop rollouts. We propose an event-grounded interpretability pipeline that anchors SAE feature analysis to behavioral events rather than text contexts. End-effector keyframes are clustered within each task using visual, state, and temporal cues, linking SAE features to behaviorally salient events and, via optional VLM annotations, to semantic context. To our knowledge, our pipeline is among the first to ground SAE-based VLA analysis in closed-loop behavioral events. Across two simulation architectures and a real-robot study, event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to the continuous action chunks of $π_{0.5}$. SAE is a sparse but imperfect intervention basis: usability varies with architecture and intervention site, and aggressive intervention reveals safety and interpretability limits. Overall, event-grounded SAE analysis emerges as a practical starting point for behavior-anchored VLA interpretability, motivating future work on SAE features beyond action-aligned coordinates, finer-grained closed-loop evaluation, and safe interventions for high-stakes VLA deployments. Code is available at \url{https://github.com/xc-j/Event-SAE}.

2605.17144 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States

对比性概念激活引导(COAST):通过隐藏状态解锁视觉-语言-动作模型

Miranda Muqing Miao, Subin Kim, Brandon Yang, Lyle Ungar

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出COAST方法,通过识别成功子空间来提升视觉-语言-动作模型在机器人任务中的性能,其核心方法是利用概念投射来引导模型向成功分布发展,从而提高任务成功率。

Comments Submitted to NeurIPS 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型利用大规模网络视觉-语言模型(VLM)预训练的强感知先验,但实际应用中却表现出惊人的脆弱性,常常在简单的机器人任务中失败。为缓解这一问题,我们提出了对比性概念激活引导(COAST)。COAST基于“概念”这一线性操作符,该操作符能将数据软投影到目标分布的主成分中。COAST利用概念来从少量的成功和失败轨迹中识别出目标机器人任务的成功子空间。在推理过程中,它将VLA的潜在表示引导到这些识别出的成功子空间中,以提高任务结果。在三种架构不同的神经策略(流匹配VLA、自回归VLA和扩散策略)上,COAST将绝对均值仿真和真实机器人任务的成功率分别提高了超过20%和40%。激活子空间几何表明,失败模式在不同任务中共享大量结构,而成功表示则主要任务特定。当任务共享相似的失败模式时,这种结构使之前拟合的概念能提升新任务的性能而无需重新拟合。最终,我们的结果表明,当前VLA在潜在表示中保留了大量任务相关的知识,而动作专家的解码瓶颈可以通过将残差流引导至任务相关子空间来缓解。COAST提供了一条轻量、无训练的路径,通过引导模型朝其自身的“成功”分布发展,来解锁这些潜在能力。

英文摘要

Vision-Language-Action (VLA) models leverage powerful perceptual priors from web-scale Vision-Language Model (VLM) pre-training, yet they remain surprisingly brittle in practice, frequently failing at simple robotic tasks. To mitigate this, we propose Contrastive Conceptor Activation Steering (COAST). COAST builds on the notion of a "conceptor", a linear operator that soft-projects data into the principal components of a target distribution. COAST uses conceptors to identify success-critical subspaces for a target robotic task from a few examples of success and failure rollouts. At inference time, it steers VLA latents into these identified success subspaces to improve task outcomes. Across three architecturally distinct neural policies (flow-matching VLA, autoregressive VLA, and Diffusion Policy), COAST improves absolute mean simulation and real-robot task success rate by over 20 and 40% respectively. The activation subspace geometry reveals that failure modes share substantial structure across tasks while success representations remain largely task-specific. When tasks share similar failure modes, this structure enables previously fitted conceptors to improve performance on new tasks without refitting. Ultimately, our results suggest that current VLAs retain substantial task-relevant knowledge in their latent representations, and that the action expert's decoding bottleneck could be mitigated by steering its residual stream toward task-relevant subspaces. COAST provides a lightweight, training-free path to unlocking these latent capabilities by steering the model towards its own "success" distributions.

2605.17123 2026-05-19 cs.HC cs.RO 版本更新

ATRACT: A Trustworthy Robotic Autonomous system to support Casualty Triage

ATRACT: 一种可靠的机器人自主系统以支持伤员分诊

Tasweer Ahmad, Rafael Pina, Sandip Pradhan, Arindam Sikdar, Mindula Illeperuma, Khizer Saeed, Peter Lee, Varuna De Silva, Ardhendu Behera

发表机构 * Department of Computer Science, Edge Hill University(埃德希尔大学计算机科学系) Institute for Digital Technologies, Loughborough University London(洛斯伯勒大学伦敦数字技术研究所) School of Architecture, Technology and Engineering, University of Brighton(布莱顿大学建筑、技术与工程学院) School of Criminology and Criminal Justice, University of Portsmouth(普茅斯大学犯罪学与刑事司法学院)

AI总结 本文提出ATRACT,一种人机协同的决策支持系统,通过多模态学习整合无人机视频与可穿戴传感器数据,以提高战场伤员分诊的准确性,同时减少前线医疗人员的风险。

详情
AI中文摘要

在无人机日益与敌对行动相关联的背景下,我们重新利用它们用于人道主义和生命拯救应用。然而,将搜索和救援无人机适应于战场分诊仍极具挑战性;技术必须可靠以支持在极端不确定性、受限访问和显著个人风险下操作的前线医护人员。由于冲突地区伤员撤离的日益增长的脆弱性,本文提出了ATRACT(一种可靠的机器人自主系统以支持伤员分诊),一种新颖的人机协同决策支持系统,旨在在创伤后的关键时期实现早期战场分诊。ATRACT整合无人机捕获的视频与可穿戴传感器输入进行多模态学习,以支持伤员状态评估,从而解决现有系统的局限性。无人机视频捕获细粒度的行为线索,如姿势、体态,而可穿戴传感器提供互补的生理信号,包括心率、呼吸率和运动。通过结合两种模态,ATRACT为在直接接触伤员被延迟、风险或受限时提供证据支持医护人员的早期判断。为了缓解受伤动作数据真实性的差距,设计了一种条件变分自编码器用于数据增强。在我们的无人机捕获数据集上的实验结果表明,所提出的流程在动作分类上达到了85.7%的准确率;而我们的轻量级CNN视觉编码器在与更强的预训练视频骨干网络竞争时仍具有竞争力。总体而言,结果支持ATRACT作为向冲突环境中远程分诊迈出的实际有意义的一步,其中多模态传感、人类监督和可信的决策支持可以改善伤员优先级排序,并减少前线医护人员的暴露风险。

英文摘要

At a time when drones are increasingly associated with hostile operations, we re-purpose them for humanitarian and life-saving applications. However, adapting search and rescue drones for battlefield triage remains extremely challenging; the technology must perform reliably to support frontline medics who are forced to operate under extreme uncertainty, restricted access, and significant personal risk. Due to growing vulnerabilities of casualty evacuation in conflicting zones, this paper presents ATRACT (A Trustworthy Robotic Autonomous system to support Casualty Triage), a novel human-in-the-loop decision support system to enable early battlefield triage during the critical post-trauma period. ATRACT integrates drone-captured video with wearable sensor input for multi-modal learning to support casualty-state assessment, thereby addressing the limitations of existing systems. Drone video captures fine-grained behavioural cues, such as pose, posture, while body-worn sensors provide complementary physiological signals, including heart rate, breathing rate, and movement. By combining two modalities, ATRACT provides evidence to support the early judgement of medics when direct access to the casualty is delayed, risky, or restricted. To mitigate the data realism gap pertaining to injured actions, a conditional variational autoencoder is devised for data augmentation. Experimental results on our drone captured dataset show that proposed pipeline achieves 85.7% accuracy for action classification; while our lightweight CNN visual encoder remains competitive with stronger pre-trained video backbones. Overall, the results support ATRACT as a practically meaningful step towards remote triage in contested environments, where multi-modal sensing, human oversight and trustworthy decision support can improve casualty prioritisation, and lessen the exposure of frontline medics.

2605.17077 2026-05-19 cs.RO cs.AI 版本更新

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

如何指导你的机器人:密集语言标注助力机器人策略学习

Bosung Kim, Ruiyi Wang, David Acuna, Jaehun Jung, Alexander Trevithick, Brandon Cui, Yejin Choi, Prithviraj Ammanabrolu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) NVIDIA

AI总结 本研究通过密集语言标注提升机器人策略学习效率,提出DeMiAn方法,利用视觉语言模型生成多方面标注,提升策略和世界模型性能,无需新增演示数据。

详情
AI中文摘要

机器人策略学习受限于演示数据收集成本,而现有演示的语言标注相对廉价。我们研究语言密度作为提取固定机器人或第一人称视频数据集信号的杠杆。我们引入DeMiAn(密集多方面标注),一种两阶段方法,首先通过视觉语言模型生成四个互补方面的演示段落重标记:物理运动、场景组成、手臂姿态和推理。一个学习到的指导者将任务描述和初始场景快照映射到部署时的任务合适标注,异步运行以隐藏生成延迟。在超过100万机器人操作片段和5万EgoVerse人类第一人称视频上,DeMiAn在视觉语言-动作策略和基于视频的世界-动作模型上均未收集新演示的情况下提升了性能。在RoboCasa上,指导者在任务-only基线基础上提升了5个百分点,接近每任务oracle的3个百分点。没有固定标注方面在所有任务中占主导,表明选择正确的密集语言至关重要。DeMiAn还提高了复合任务和分布外性能,并在考虑标注生成FLOPs后,同时提升了中训练和后训练的计算-性能前沿。这些结果将密集重新标注定位为机器人策略学习的实用扩展杠杆。

英文摘要

Scaling robot policy learning is bottlenecked by the cost of collecting demonstrations, while language annotations for existing demonstrations are comparatively cheap. We study language density as a lever for extracting more signal from a fixed robot or egocentric-video corpus. We introduce DeMiAn (Dense Multi-aspect Annotation), a two-stage approach that first re-labels demonstration segments with VLM-generated annotations along four complementary aspects: physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description and initial scene snapshot to a task-appropriate annotation at deployment, running asynchronously so generation latency is hidden behind policy execution. Across over 1M robot manipulation clips and 50K EgoVerse human-egocentric videos, DeMiAn improves both a vision-language-action policy and a video-based world-action model without collecting new demonstrations. On RoboCasa, the instructor raises success by 5 points over a task-only baseline and comes within 3 points of a per-task oracle. No fixed annotation aspect dominates across tasks, showing that selecting the right dense language matters. DeMiAn also improves composite-task and out-of-distribution performance, and shifts the compute-performance frontier in both mid-training and post-training after accounting for annotation-generation FLOPs. These results position dense re-annotation as a practical scaling lever for robot policy learning.

2605.17033 2026-05-19 cs.RO 版本更新

Generalizable and Actionable Parts Pose Estimation with Symmetry Annotation-Free Learning Strategy

具有对称标注自由学习策略的通用且可操作的部件姿态估计

Wenxiao Chen, Xueyu Yuan, Liu Liu, Di Wu, Dan Guo

发表机构 * Hefei University of Technology, Hefei, Anhui, China(合肥工业大学) University of Science and Technology of China, Hefei, Anhui, China(中国科学技术大学)

AI总结 本文提出了一种无需对称标注的通用且可操作的部件姿态估计框架SAFAG,通过分步细化两阶段框架和自监督学习策略解决对称预测问题,提升了在数据匮乏场景下的姿态估计性能和鲁棒性。

Comments Accepted as a poster at the Forty-third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

迫切需要的通用机器人物体交互和操作要求高质量的跨类别物体感知。作为该领域的先驱,通用且可操作的部件(GAParts)理解吸引了越来越多相关研究人员的关注。然而,大多数最近的工作要么在对称问题的设计上不足,要么需要丰富的对称标注,这严重阻碍了在数据匮乏场景中精确的GAPart姿态估计。在本文中,我们提出SAFAG,一种新的无需对称标注的通用且可操作的部件姿态估计框架。具体而言,我们建议了一个分步细化的两阶段框架用于候选到最终的四元数回归,并将对称预测作为概率分布问题,通过自监督学习策略进行解决。实验结果证明了我们SAFAG的优越性能和鲁棒性。我们相信我们的工作在许多具身AI系统领域具有巨大的应用潜力。

英文摘要

Urgently needed generalizable robot object interaction and manipulation requires high-quality Cross-Category object perception. As a pioneer of this area, Generalizable and Actionable Parts (GAParts) understanding has attracted increasing attention from relevant researchers. However, most recent works either have insufficient design regarding the symmetry issue or require rich symmetry annotation, which severely impedes precise GAPart pose estimation in data-lacking scenarios. In this paper, we propose SAFAG, a novel Symmetry Annotation-Free framework for Generalizable and Actionable Parts Pose Estimation. Specifically, we suggest a stepwise refinement two-stage framework for candidate-to-final quaternion regression, and tackle the symmetry prediction as a probability distribution problem with self-supervised learning strategy. The experimental results demonstrate the superior performance and robustness of our SAFAG. We believe that our work has the enormous potential to be applied in many areas of embodied AI system.

2605.16979 2026-05-19 cs.RO 版本更新

NORM-Nav: Zero-Shot Mobile Robot Navigation with Natural Language Behavioral Constraints

NORM-Nav: 通过自然语言行为约束实现零样本移动机器人导航

Dongjie Huo, Junhui Wang, Chao Gao, Yan Qiao, Dong Zhang, Guyue Zhou

发表机构 * College of Information Science and Technology, Beijing University of Chemical Technology(北京化工大学信息科学与技术学院) Institute for AI Industry Research (AIR), Tsinghua University(清华大学人工智能产业研究院) Institute of Systems Engineering and Collaborative Laboratory for Intelligent Science and Systems, Macau University of Science and Technology(澳门大学系统工程与智能科学与系统联合实验室) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与移动系统学院)

AI总结 本文提出NORM-Nav框架,通过将自然语言行为约束整合到基于成本图的规划中,提升移动机器人在人类环境中导航的社交适应性,实验表明其在任务成功率和轨迹贴近人类参考方面优于基线方法。

详情
AI中文摘要

移动机器人在人类环境中运行时,不仅要生成无碰撞路径,还必须生成遵循本地行为规范的轨迹。传统基于成本图的导航强调几何可行性,往往忽视这些要求,可能导致不恰当的社会行为。本文提出了NORM-Nav,一种零样本框架,将自然语言行为约束整合到基于成本图的规划中。一个大语言模型将每个指令解析为结构化约束,并通过实时视觉-激光雷达感知进行 grounding。这些约束被编码为多层成本图,代表几何、语义、方向和速度提示,并直接与标准栅格规划器兼容。仿真和现实世界实验表明,NORM-Nav提高了任务成功率,并产生比代表基线更接近人类参考的轨迹。项目网站可用 https://ei-nav.github.io/NORM-Nav。

英文摘要

Mobile robots operating in human-centered environments must generate not only collision-free paths but also trajectories that follow local behavioral conventions. Conventional costmap-based navigation emphasizes geometric feasibility and often overlooks such requirements, which can result in socially inappropriate behaviors. This paper presents NORM-Nav, a zero-shot framework that integrates natural language behavioral constraints into costmap-based planning. An LLM parses each instruction into structured constraints and grounds them using real-time vision--LiDAR perception. These constraints are encoded as multi-layer costmaps that represent geometric, semantic, directional, and velocity cues and are directly compatible with standard grid-based planners. Simulation and real-world experiments indicate that NORM-Nav improves task success rates and produces trajectories closer to human references than representative baselines. The project website is available at https://ei-nav.github.io/NORM-Nav.

2605.16932 2026-05-19 cs.RO 版本更新

MORN: Metacognitive Object-Goal Regulation for Resource-Rational Long-Horizon Navigation

MORN: 为资源理性长周期导航的元认知目标-目标调节

Xi Lin, Jiayi Li, Kangyi Wu, Jiaqiao Tang, Qingrong He, Lin Zhao

发表机构 * LCSR Lab, Johns Hopkins University(约翰霍普金斯大学LCSR实验室) Xi’an Jiaotong University(西安交通大学) JD Explore Academy, Beijing, China(京东探索研究院,北京,中国)

AI总结 本文提出MORN,一种基于双过程理论的元认知导航架构,通过引入资源理性机制,解决传统导航系统在长周期任务中因缺乏全局资源意识导致的资源浪费问题,提升了目标完成率和任务效率。

详情
AI中文摘要

在无结构人类环境中部署的机器人必须频繁执行长周期任务,如找到杯子、然后椅子、然后打印机,这些任务受严格操作约束。尽管现代零样本物体导航(ObjectNav)代理利用视觉-语言模型(VLMs)有效定位语义目标,但它们本质上是纯粹的反应系统,缺乏全局资源意识。因此,这些代理由于部分可观测性而无意中耗尽关键预算,包括时间和电池,对不可行的子目标进行本地探索,未能在本地探索与全局任务可行性之间取得平衡。为了填补这一差距,通过在导航循环中注入资源理性,我们提出了MORN(元认知目标-目标调节导航),一种受认知科学双过程理论启发的执行架构。MORN在冻结的导航骨干上增加了一个System 2元控制器,持续监控System 1的移动。通过正式化三个神经认知状态,潜在指数、坚持门控和证据积累,MORN根据在线进度速度和感知不确定性的估计动态调节任务计划。这种机制有效消除了沉没成本谬误,使代理能够提前中止僵尸目标并果断承诺可行的目标。在HM3D数据集上的大量实验表明,MORN将目标完成率(CR)从0.23提高到0.30,并将浪费步分数(WSF)从0.90降低到0.70,证明在资源受限自主性中,元认知对全局资源的意识与反应能力导航同样关键。

英文摘要

Robots deployed in unstructured human environments must frequently execute long-horizon missions, such as find the mug, then the chair, then the printer, under strict operational constraints. While contemporary zero-shot Object Navigation (ObjectNav) agents leverage Vision-Language Models (VLMs) to effectively localize semantic targets, they operate as purely reactive systems that inherently lack global resource awareness. Consequently, these agents inadvertently exhaust critical budgets, including time and battery, on infeasible subgoals due to partial observability, failing to balance local exploration with global mission viability. To bridge this gap by injecting resource-rationality into the navigation loop, we present MORN (Metacognitive Object-goal Regulation Navigation), an executive architecture inspired by Dual-Process Theory in cognitive science. MORN augments frozen navigation backbones with a System 2 meta-controller that continuously monitors the System 1 locomotor. By formalizing three neuro-cognitive states, Potentiality Index, Persistence Gating, and Evidence Accumulation, MORN dynamically regulates the mission schedule based on online estimates of progress velocity and perceptual uncertainty. This mechanism effectively neutralizes the Sunk Cost Fallacy, enabling agents to abort zombie goals early and decisively commit to achievable ones. Extensive experiments on the HM3D dataset demonstrate that MORN improves Goal Completion Rate (CR) from 0.23 to 0.30 and reduces Wasted Step Fraction (WSF) from 0.90 to 0.70, establishing that in resource-constrained autonomy, the metacognitive awareness of global resources is as critical as the reactive ability to navigate.

2605.16894 2026-05-19 cs.RO cs.SY eess.SY 版本更新

Beyond Safety Filtering: Control Barrier Function-Informed Reinforcement Learning for Connected and Automated Vehicles

超越安全过滤:基于控制屏障函数的强化学习用于连接和自动化车辆

Jianye Xu, Bassam Alrifaee

发表机构 * Department of Computer Science, RWTH Aachen University, Germany(德国亚琛工业大学计算机科学系)

AI总结 本文提出了一种基于控制屏障函数的多智能体强化学习奖励设计方法,通过将联合多智能体强化学习动作下的控制屏障函数约束值转化为奖励信号,以显式引导安全学习,并在四向多车道交叉口实验中验证了其在任务性能和对奖励超参数的鲁棒性方面优于传统启发式方法。

Comments This paper has been accepted for publication in the Proceedings of the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

详情
AI中文摘要

强化学习(RL)使用奖励来引导学习,然而奖励设计通常是通过启发式方法手动构建,这可能难以调整。我们提出了一种多智能体RL(MARL)中的控制屏障函数(CBF)引导的奖励设计,将联合MARL动作下的CBF约束值转换为奖励信号,以显式引导安全学习。我们在四向多车道交叉口中与连接和自动化车辆进行了对比实验,两种启发式奖励基线。结果表明,我们的方法在任务性能上最高,并且对奖励超参数的敏感性较低,在测试的超参数范围内始终表现出一致的强性能。用于重现实验结果的代码和视频演示可在https://github.com/bassamlab/SigmaRL上获得。

英文摘要

Reinforcement Learning (RL) uses rewards to guide learning, yet reward design is typically hand-crafted using heuristics that can be difficult to tune. We propose a Control Barrier Function (CBF)-informed reward design for Multi-Agent RL (MARL) that converts CBF constraint values under joint MARL actions into a reward signal that explicitly guides safe learning. We compare against two heuristic reward baselines in a four-way multi-lane intersection with connected and automated vehicles. Results show that our method achieves the highest task performance and is less sensitive to reward hyperparameters, yielding consistently strong performance across the tested hyperparameter range. Code for reproducing the experimental results and a video demonstration are available at https://github.com/bassamlab/SigmaRL.

2605.16871 2026-05-19 cs.RO 版本更新

SADP: Subgoal-Aware Diffusion Policy for Explainable Robots Learned from Foundation Model Generated Demonstrations

SADP:基于基础模型生成示范的子目标感知扩散策略用于可解释机器人

Site Hu, Takato Horii

发表机构 * Department of Systems Innovation, Graduate School of Engineering Science, Osaka University(系统创新系,工学研究科,大阪大学)

AI总结 本文提出SADP,一种基于基础模型生成示范的子目标感知扩散策略,用于可解释机器人,通过自主生成子目标标注的示范数据,训练扩散策略,使机器人能够通过子目标结构和执行进度向用户解释决策过程,从而在长周期操作中实现更高的任务成功率和故障诊断能力。

详情
AI中文摘要

可解释机器人不仅需要成功执行任务,还需要以用户友好的方式暴露内部决策过程。然而,大多数模仿学习方法仅在任务层面的示范上训练,没有显式建模子目标结构或执行进度。这种限制在标准机器人学习数据集中子目标级监督稀缺的情况下进一步加剧,限制了能够传达其执行子任务的机器人发展。为了解决这个问题,本文提出了Subgoal-Aware Diffusion Policy (SADP),一种利用基础模型自主生成子目标标注的示范数据,并在这些数据集上训练扩散策略的框架。SADP通过将动作生成条件化在任务层面和子目标层面的描述上,围绕人类可解释的子目标结构构建策略执行。一个轻量级的辅助头进一步预测子目标完成状态,使机器人能够暴露其当前执行阶段并监控子目标进展。在RLBench模拟和实际UR5e机器人上的实验表明,SADP在任务成功率方面优于强大的任务条件扩散基线,同时提供子目标级执行信号用于监控进度和故障诊断。这些结果表明,内置而非事后解释性可以与高任务性能共存。

英文摘要

Explainable robots require not only successful task execution but also the ability to expose internal decision-making process in a user-friendly manner. However, most imitation learning methods are trained solely on task-level demonstrations, without explicitly modeling subgoal structure or execution progress. This limitation is further exacerbated by the scarcity of subgoal-level supervision in standard robot learning datasets, which restricts the development of robots that can convey the subtasks they are executing during long-horizon manipulation. To address this issue, this paper proposes Subgoal-Aware Diffusion Policy (SADP), a framework that leverages foundation models to autonomously generate subgoal-annotated demonstrations and trains diffusion policies on these datasets. SADP structures policy execution around human-interpretable subgoals by conditioning action generation on both task-level and subgoal-level descriptions. A lightweight auxiliary head further predicts subgoal completion states, allowing the robot to expose its current execution stage and monitor subgoal progression. Experiments in RLBench simulations and real-world evaluations on a UR5e robot demonstrate that SADP achieves higher task success rates than strong task-conditioned diffusion baselines, while providing subgoal-level execution signals for monitoring progress and diagnosing failures. These results highlight that built-in, rather than post-hoc, interpretability can coexist with high task performance.

2605.16870 2026-05-19 cs.RO 版本更新

SSTL: Self-Sensing Tendon Loop for Hysteresis Modeling and Compensation in Tendon-Sheath Mechanisms

SSTL:自感知腱环用于腱鞘机制的滞后模型与补偿

Myeongbo Park, Junhyun Park, Ihsan Ullah, Chunggil An, Minho Hwang

发表机构 * Department of Robotics and Mechatronics Engineering, DGIST(机器人与机电工程系,DGIST) AI Research Lab, DEEPNOID(AI研究实验室,DEEPNOID)

AI总结 本文提出了一种自感知腱环(SSTL),用于解决腱鞘机制中由于腱鞘摩擦和腱弹性引起的滞后问题,通过测量输入和输出张力来建立滞后模型并进行补偿,从而提高柔性内窥镜机器人控制精度。

Comments 8 pages, 7 figures, 4 tables

详情
AI中文摘要

柔性内窥镜机器人通过自然孔道实现微创接入,但其控制精度受限于腱鞘机制(TSMs)中配置依赖的滞后现象。腱鞘摩擦和腱弹性导致输入和输出之间存在系统性差异,且该差异随插入管配置变化。为解决这一挑战,本文提出自感知腱环(SSTL),一种通过插入管双程路由并围绕远端滑轮缠绕的腱环结构,使输入和输出张力均可在近端测量,从而无需远端力或光纤传感器即可获得输入-输出张力剖面。由于SSTL与驱动TSM共享相同路由路径,两个TSM表现出高度相关的滞后行为。从SSTL张力剖面中,基于学习的映射估计驱动TSM的配置依赖滞后参数,这些参数随后被前馈控制器用于补偿驱动滞后。我们通过在三种不同插入管配置下跟踪驱动腱张力验证了所提方法。在正弦和随机轨迹上,所提方法将平均RMSE降低88.1%,达到直接识别方法的97.8%,后者需要直接测量驱动TSM的输入和输出张力剖面。

英文摘要

Flexible endoscopic robots enable minimally invasive access through natural orifices, but their control accuracy is limited by configuration-dependent hysteresis in the tendon-sheath mechanisms (TSMs). Tendon-sheath friction and tendon elasticity induce a systematic discrepancy between the proximal actuation input and distal output, and this discrepancy varies with the insertion tube configuration. To address this challenge, this paper proposes the Self-Sensing Tendon Loop (SSTL), a double-pass tendon loop routed through the insertion tube and wrapped around a distal pulley, and returned to the proximal end. The loop structure allows both the input and output tensions of the SSTL to be measured proximally, thereby providing an input-output tension profile without requiring distal force or fiber-optic sensors. Because the SSTL shares the same routing path as the actuation TSM, the two TSMs exhibit strongly correlated hysteresis behaviors. From the SSTL tension profile, a learning-based mapping estimates the configuration-dependent hysteresis parameters of the actuation TSM, which are then used by a feedforward controller to compensate for actuation hysteresis. We validate the proposed method by tracking actuation tendon tension under three different insertion tube configurations. Across sinusoidal and random trajectories, the proposed method reduces average RMSE by 88.1% compared with the uncompensated baseline, achieving 97.8% of the performance of direct identification, which requires direct measurement of the input and output tension profile of the actuation TSM.

2605.16863 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning

先规划,后扩散:用于长视距扩散规划的外在图引导

Yaniv Hassidof, Adir Morgan, Yilun Du, Kiril Solovey

发表机构 * Technion(技术Ion大学) Harvard(哈佛大学)

AI总结 本文提出了一种外在搜索引导的扩散模型(XDiffuser),通过在状态空间图上先规划再引导扩散过程,以提高长视距规划的效率和效果,尤其在低质量数据和未见任务中表现优异。

详情
AI中文摘要

组合扩散模型通过去噪多个重叠的子轨迹并确保它们构成全局解,为长视距规划提供了一条有前途的路线。然而,强制在长链上执行局部行为往往不足以产生一致的全局结构。最近的工作通过内在搜索在去噪过程中探索多条路径来解决这一限制。尽管内在搜索提高了全局一致性,但代价是重复评估已经计算密集的模型。在本文中,我们主张在去噪过程之外进行外在搜索,为长视距规划提供更有效的探索模式,同时自然地使经典算法能够解决测试时的未见组合任务。我们的eXtrinsic搜索引导的Diffuser(XDiffuser)首先在状态空间图上计算一个计划——作为扩散模型的轻量级局部连接Oracle。该计划随后用于引导单条轨迹的去噪,有效地将探索负担转移出去。XDiffuser在长视距任务上优于基于扩散的基线,特别是在低质量数据领域和超出目标到达的未见任务中,包括多智能体协调和TSP风格推理。项目网站:https://yanivhass.github.io/XDiffuser-site/

英文摘要

Compositional diffusion models offer a promising route to long-horizon planning by denoising multiple overlapping sub-trajectories while ensuring that together they constitute a global solution. However, enforcing local behavior over long chains is often insufficient for a coherent global structure to emerge. Recent works tackle this limitation through intrinsic search, which explores multiple paths during the denoising process. While intrinsic search improves global coherence, it comes at the cost of repeated evaluations of an already compute-heavy model. In this work, we argue that extrinsic search, performed outside the denoising process, offers a more effective mode of exploration for long-horizon planning while naturally enabling the use of classical algorithms to solve unseen combinatorial tasks at test time. Our eXtrinsic search-guided Diffuser (XDiffuser) first computes a plan over a state-space graph -- serving as a lightweight local connectivity oracle for the diffusion model. The plan is then used to guide denoising for a single trajectory, effectively offloading the burden of exploration. XDiffuser outperforms diffusion-based baselines on long-horizon tasks, with particularly large gains in the low-quality data regime and on unseen tasks beyond goal-reaching, including multi-agent coordination and TSP-style reasoning. Project website: https://yanivhass.github.io/XDiffuser-site/

2605.16858 2026-05-19 cs.RO cs.AI 版本更新

Pedestrian-Aware LLM-Driven Behavioral Planning for Autonomous Vehicles

面向行人的LLM驱动行为规划用于自动驾驶车辆

Aidana Baimbetova, Haruki Yonekura, Hamada Rizk, Hirozumi Yamaguchi

发表机构 * The University of Osaka, Japan(大阪大学,日本) RIKEN Center for Computational Science, Japan(日本计算科学研究中心) Tanta University, Egypt(埃及塔塔大学)

AI总结 本文提出了一种基于大型语言模型的决策框架,用于自动驾驶车辆在复杂城市环境中考虑行人行为,通过自然语言推理提示将结构化场景观测转换为语言推理,从而生成安全的驾驶决策。

Comments This paper has been accepted for presentation at the 29th IEEE International Conference on Intelligent Transportation Systems (ITSC)

详情
AI中文摘要

自动驾驶车辆(AVs)必须在行人行为多变、有时异常且训练中常未见的密集城市环境中做出可靠决策。基于强化学习(RL)的AV控制系统在结构化交通中表现良好,但在面对不可预测的行人交互和分布外场景时泛化能力较差。其依赖手工制定的奖励和不透明决策进一步限制了其在行人密集、安全关键环境中的适用性。为了解决这些限制,我们引入了一种基于大型语言模型(LLM)的决策框架,用于行人感知的行为规划。该系统将结构化的场景观测转换为自然语言推理提示,使LLM能够推断行人意图、预测风险并生成谨慎的战术驾驶决策。这些决策由运动规划器执行,以确保平滑且动力学可行的控制。我们在SUMO上评估了该框架,涵盖多个行人交互场景,包括意外闯红灯、回退过马路、犹豫和双向过马路。在零样本评估中,基于LLM的智能体实现了68%的无碰撞成功率,显著优于深度RL基线(17.7%)。在单行人场景中使用少量样本的episodic记忆,性能增加到96.0%,超过定制DQN控制器(82.0%)。跨行为评估进一步表明,来自回退交互的记忆可以转移到未见的犹豫和双向过马路场景,分别达到82.0%和90.0%的成功率。该系统能够更早地发起响应,维持更宽的安全缓冲区,并产生可解释、与人类一致的决策。

英文摘要

Autonomous Vehicles (AVs) must make reliable decisions in dense urban environments where pedestrian behavior is variable, sometimes abnormal, and often unseen during training. Reinforcement learning (RL)-based AV control systems perform well in structured traffic but struggle to generalize to unpredictable pedestrian interactions and out-of-distribution scenarios. Their reliance on handcrafted rewards and opaque decisions further limits their suitability for safety-critical, pedestrian-rich environments. To address these limitations, we introduce a Large Language Model (LLM)-based decision-making framework for pedestrian-aware behavioral planning. The system converts structured scene observations into natural-language reasoning prompts, enabling the LLM to infer pedestrian intent, anticipate risk, and generate cautious tactical driving decisions. These decisions are executed by a motion planner that ensures smooth, kinematically feasible control. We evaluate the framework in SUMO across multiple pedestrian-interaction scenarios, including unexpected jaywalking, turn-back crossing, hesitation, and bidirectional crossing. In zero-shot evaluation, the LLM-based agent achieves a 68% collision-free success rate, substantially outperforming deep RL baselines (17.7%). With few-shot episodic memory in a single-pedestrian scenario, performance increases to 96.0%, exceeding a custom DQN controller (82.0%). Cross-behavior evaluation further shows that memory derived from turn-back interactions transfers to unseen hesitation and bidirectional crossing scenarios, achieving 82.0% and 90.0% success, respectively. The system consistently initiates earlier responses, maintains wider safety buffers, and produces interpretable, human-aligned decisions.

2605.16797 2026-05-19 cs.CV cs.RO 版本更新

EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices

EgoKit: 向统一低成本第一人称视角数据采集迈进:异构设备

Liuchuan Yu, Erdem Murat, Beichen Wang, Yan Zeng, Tingting Luo, Huizhen Zhou, Shanghao Li, Huining Feng, Zhigen Zhao, Ning Yang, Ke Jing, Yunhao Liu, Ruoya Sheng

发表机构 * George Mason University(乔治·马歇尔大学) ByteDance(字节跳动)

AI总结 本文提出EgoKit,一种统一六种异构设备的第一人称视角数据采集工具包,解决了不同设备间SDK差异和数据采集不一致的问题,同时提供统一的日志格式和手部追踪数据。

详情
AI中文摘要

第一人称视角视频越来越多地被用作机器人学习、活动理解及具身AI研究的数据源,但大规模采集仍然碎片化:每个候选主机设备,如Android手机、iPhone、iPad、智能眼镜或扩展现实(XR)头戴设备,都暴露了不同的SDK,对原始摄像机访问有不同的政策,以及对外部USB摄像机和设备内跟踪有不同的限制。因此,同步第一人称视角和腕部视角的采集通常通过要么承诺单一专有平台或构建一次性装置来实现,这些装置无法跨设备转移。为了解决这一差距,我们提出了EgoKit,一种工具包,它在六个异构主机设备上暴露相同的第一人称视角录制流程。在所有支持的设备上,EgoKit提供相同的录制交互,并产生本地存储的视频,具有统一的日志格式;在XR头戴设备上,它还记录头部姿态和符合OpenXR标准的26关节手部追踪,与视频流对齐。配套的配件,包括两个带有支架的腕部摄像机、一个头带和一个USB-C集线器,使任何支持的主机都能添加腕部视角捕获,而无需定制硬件制造。EgoKit可在\url{https://egokit.chuange.org/}上获得。

英文摘要

Egocentric video is increasingly used as a data source for robot learning, activity understanding, and embodied AI research, but collecting it at scale remains fragmented in practice: each candidate host device, such as an Android phone, iPhone, iPad, smart glasses, or extended reality (XR) headset, exposes a different SDK, a different policy on raw camera access, and different limitations on external USB cameras and on-device tracking. Synchronized ego-view and wrist-view capture is therefore typically obtained by either committing to a single proprietary platform or building one-off rigs that do not transfer across devices. To address this gap, we present EgoKit, a toolkit that exposes the same egocentric recording workflow across six heterogeneous host devices. Across all supported devices, EgoKit presents the same recording interaction and produces locally stored video with a uniform log format; on XR headsets, it additionally logs head pose and OpenXR-standard 26-joint hand tracking aligned to the video streams. The companion accessories, including two wrist cameras with mounts, a head strap, and a USB-C hub, add wrist-view capture to any supported host without custom hardware fabrication. EgoKit is available at \url{https://egokit.chuange.org/}.

2605.16743 2026-05-19 cs.RO 版本更新

LACE: Latent Visual Representation for Cross-Embodiment Learning

LACE: 用于跨具身学习的潜在视觉表示

Yoo Sung Jang, Kanchana Ranasinghe, Cristina Mata, Yichi Zhang, Jorge Mendez-Mendez, Michael S. Ryoo

发表机构 * Stony Brook University(石溪大学) Salesforce AI Research(Salesforce AI研究院)

AI总结 本文提出LACE框架,通过利用跨具身共享身体部分的对应关系,在自监督学习backbone的潜在空间中对齐人类和机器人视觉表示,从而解决人类与机器人具身之间的视觉差距问题,提升机器人策略在稀疏示范下的表现。

详情
AI中文摘要

从人类示范中进行跨具身学习受到人类与机器人具身之间视觉差距的阻碍。尽管自监督学习(SSL)backbone能够编码通用物体的丰富类间语义,但我们发现它们无法建立人类与机器人手之间的对应关系。我们提出了LACE,一个框架,通过利用跨具身共享身体部分的对应关系作为稀疏监督,在这些backbone的潜在空间中对齐人类和机器人视觉表示。这些注解可以通过正向运动学自动获得,单个机器人示范就足以训练模型。我们的语义对齐损失匹配由对应特征引起的影响分布,将片段级监督提升到语义级对齐,同时Gram损失保留预训练特征质量。这种对齐使机器人策略能够在机器人示范稀缺时利用丰富的数据:在零样本迁移中,使用LACE-DINO的策略比使用DINO的策略表现优异(65%),在低数据和分布外环境中有持续的提升。

英文摘要

Cross-embodiment learning from human demonstrations is hindered by the visual gap between human and robot embodiments. While self-supervised learning (SSL) backbones encode rich inter-class semantics of general objects, we show they fail to establish correspondence between human and robot hands. We propose LACE, a framework that aligns human and robot visual representations in the latent space of these backbones by leveraging correspondences between shared body parts across embodiments as sparse supervision. These annotations can be automatically obtained via forward kinematics, and single robot demonstration is sufficient to train the model. Our semantic alignment loss matches distributions incurred by corresponding features, lifting patch-level supervision to semantic-level alignment, while a Gram loss preserves pretrained feature quality. This alignment enables robot policies to leverage abundant human data when robot demonstrations are scarce: in zero-shot transfer, policies using LACE-DINO outperform those using DINO by a large margin (65\%), with consistent gains in low-data regimes and out-of-distribution environments.

2605.16737 2026-05-19 cs.RO cs.CV 版本更新

DriveSafer: End-to-End Autonomous Driving with Safety Guidance

DriveSafer: 结合安全指导的端到端自动驾驶

Shounak Sural, Raj Rajkumar

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出DriveSafer框架,通过减少致命性规划失败来提高端到端自动驾驶的安全性,而非单纯提升平均规划质量。

详情
AI中文摘要

端到端(E2E)自动驾驶模型近年来在性能上有了显著提升,尤其是在越来越具有挑战性的基准测试中。然而,现代生成式E2E规划器仍然在安全关键场景中存在大量致命性故障。我们发现许多此类故障源于物理约束和安全要求的违反,导致不安全行为。受此发现启发,本文专注于改进生成式端到端驾驶中的安全结果,通过有针对性地减少致命性规划失败,而不是提升平均规划质量。为此,我们提出了DriveSafer,一种面向失败的的安全框架,用于端到端规划器。DriveSafer通过利用训练时的安全约束和推理时的安全指导,明确引导生成式规划器朝向安全行为。与最先进的DiffusionDrive模型相比,在NAVSIM基准测试中,DriveSafer将致命性故障数量(PDMS=0)减少了48%,在可行驶区域合规性故障上减少了超过65%。

英文摘要

End-to-End (E2E) autonomous driving models have shown growing capability in recent years, with performance improving on increasingly challenging benchmarks. However, modern generative E2E planners still suffer from a substantial number of catastrophic failures in safety-critical scenarios. We find that many such failures arise from violations of physical constraints and safety requirements, leading to unsafe behavior. Motivated by this finding, in this paper, we focus on improving safety outcomes in generative end-to-end driving with a targeted reduction of catastrophic planning failures, instead of enhancing average planning quality. Towards this end, we propose DriveSafer, a failure-aware safety framework for end-to-end planners. DriveSafer explicitly steers generative planners towards safe behaviors leveraging both training-time safety constraints and inference-time safety guidance. Compared to the state-of-the-art DiffusionDrive model, on the NAVSIM benchmark, DriveSafer reduces the number of catastrophic failures (PDMS=0) by 48%, with over 65% reduction in drivable-area compliance failures.

2605.16673 2026-05-19 cs.RO 版本更新

Bayesian Networks for Path-Based Sensors: Gathering Information and Path Planning in Communication Denied Environments

基于路径的传感器的贝叶斯网络:在通信受限环境中收集信息和路径规划

Alkesh K. Srivastava, George P. Kontoudis, Donald Sofge, Michael Otte

发表机构 * University of Maryland, College Park, MD, US.(美国马里兰大学学院公园分校) Temple University, Philadelphia, PA, US.(美国 Temple 大学) Colorado School of Mines, Golden, CO, US.(科罗拉多矿业学院) U.S. Naval Research Lab (Retired), DC, US.(美国海军研究实验室(退休))

AI总结 本文提出了一种基于贝叶斯网络的更新方法,用于在通信受限环境中通过路径传感器提升信念图的收敛速度,并考虑了假阳性和假阴性问题。

Comments This paper has been accepted for presentation at 17th World Symposium on the Algorithmic Foundations of Robotics (WAFR 2026)

详情
AI中文摘要

一种基于路径的传感器在连续路径上产生单个观测值。例如,布尔型路径传感器在路径上的任何一点检测到感兴趣的事件时返回'1',否则返回'0'。值得注意的是,'1'本身不提供关于事件发生位置的直接信息。先前的工作表明,多个路径传感器的观测可以融合以创建空间位置的贝叶斯信念图。此外,路径规划可以利用香农信息论来加速信念图的收敛速度。在本文中,我们提出了一种新的方法,基于路径传感器观测更新信念图,然后规划路径以增加信息增益。与之前通过平均替代事件历史来近似后验的方法不同,我们引入了贝叶斯网络(BN)的公式,该公式建模了潜在变量和路径传感器测量之间的概率关系,从而实现了更系统的贝叶斯信念更新。我们考虑在通信受限环境中进行静态危险检测作为代表性的问题设置。机器人返回其路径对应于路径传感器读数为'0'(危险未检测),而机器人未能返回则对应于读数为'1'(危险检测)。我们考虑假阳性和假阴性。我们发现,新方法在单机器人和多机器人情况下都比先前的工作更快地收敛于信念图。

英文摘要

A "path-based sensor" produces a single observation along a continuous path. For example, a boolean path-based sensor returns a single "1" if an event of interest is detected at any point along the path and a "0" otherwise. Notably, a "1" provides no direct information about where along the path the event(s) may have occurred. Previous work has demonstrated that observations from multiple path-based sensors can be fused to create a Bayesian belief map over the spatial locations of the underlying event or phenomenon. Moreover, path planning can employ Shannon information theory to accelerate the rate of convergence of the belief map. In this paper, we present a new method to update the belief map based on a path-based sensor observation, and then plan paths to increase information gain. In contrast to prior work that approximates the posterior by averaging over the alternative event histories, we introduce a Bayesian Network (BN) formulation that models the probabilistic relationships between the latent variables and path-based sensor measurements, enabling a more principled Bayesian belief update. We consider static hazard detection in a communication-denied environment as a representative problem setting. The event of a robot returning from its path corresponds to a path-based hazard sensor reading of "0" (hazard not detected), while a robot failing to return corresponds to a reading of "1" (hazard detected). We consider false positives and false negatives. We find that the new method leads to quicker convergence of the belief map than prior work in both single- and multi-robot cases.

2605.10408 2026-05-19 cs.SE cs.RO 版本更新

VISOR: A Vision-Language Model-based Test Oracle for Testing Robots

VISOR:基于视觉-语言模型的机器人测试 oracle

Prasun Saurabh, Pablo Valle, Aitor Arrieta, Shaukat Ali, Paolo Arcaini

发表机构 * Simula Research Laboratory(Simula研究实验室) Oslo Metropolitan University(奥斯陆理工大学) Mondragon University(蒙dragon大学) National Institute of Informatics(国立信息研究所)

AI总结 VISOR 提出基于视觉-语言模型的自动化测试 oracle 方法,解决机器人测试中任务正确性评估难题,通过两个模型在多个任务上验证,展现高召回率和高精确度,但不确定性与正确性相关性低。

详情
AI中文摘要

机器人测试需评估其是否正确、可靠且高质量完成任务,即测试 oracle 问题。传统方法依赖任务特定的符号 oracle 和人工评估,效率低且主观。VISOR 提出基于视觉-语言模型(VLM)的自动化评估方法,消除昂贵的人工评估需求。VISOR 自动评估任务正确性和质量,解决现有符号 oracle 任务特定且无法量化质量的局限。鉴于 VLM 的内在不确定性,VISOR 明确量化自身在测试评估中的不确定性。使用 GPT 和 Gemini 两个 VLM 在四个机器人任务上超过 1000 个视频中评估,结果表明 Gemini 回召率更高,GPT 精确度更高。然而,两者模型显示不确定性与正确性相关性低,阻碍将不确定性作为正确性预测指标。

英文摘要

Testing robots requires assessing whether they perform their intended tasks correctly, dependably, and with high quality, a challenge known as the test oracle problem in software testing. Traditionally, this assessment relies on task-specific symbolic oracles for task correctness and on human manual evaluation of robot behavior, which is time-consuming, subjective, and error-prone. To address this, we propose VISOR, a Vision-Language Model (VLM)-based approach for automated test oracle assessment that eliminates the need of expensive human evaluations. VISOR performs automated evaluation of task correctness and quality, addressing the limitations of existing symbolic test oracles, which are task-specific and provide pass/fail judgments without explicitly quantifying task quality. Given the inherent uncertainty in VLMs, VISOR also explicitly quantifies its own uncertainty during test assessments. We evaluated VISOR using two VLMs, i.e., GPT and Gemini, across four robotic tasks on over 1,000 videos. Results show that Gemini achieves higher recall while GPT achieves higher precision. However, both models show low correlation between uncertainty and correctness, which prevents using uncertainty as a correctness predictor.

2605.02759 2026-05-19 cs.RO cs.CV 版本更新

DynoSLAM: Dynamic SLAM with Generative Graph Neural Networks for Real-World Social Navigation

DynoSLAM:基于生成图神经网络的动态SLAM用于现实世界的社交导航

Danil Tokhchukov, Veronika Morozova, Gonzalo Ferrer

发表机构 * Applied AI Institute(应用人工智能研究所)

AI总结 本文提出DynoSLAM,通过整合社交感知图神经网络,解决动态环境中SLAM的不确定性问题,提升机器人在拥挤环境中的导航能力。

Comments Code & Project page at https://github.com/makriot/dynoslam

详情
AI中文摘要

传统同时定位与建图(SLAM)算法依赖于静态环境假设,限制了其在现实世界中的应用,尤其是存在移动实体(如行人)的场景。本文提出DynoSLAM,一种紧密耦合的动态图SLAM架构,将社交感知的图神经网络(GNN)直接整合到因子图优化中。与传统方法使用刚性常速启发式或确定性单体神经先验不同,我们的框架将行人运动预测建模为随机世界模型。通过利用训练好的GNN的蒙特卡洛回放,我们捕捉人类互动的多模态认知不确定性,并通过动态马氏距离因子将其嵌入SLAM图中。通过广泛的模拟实验,我们证明这种随机建模不仅保持了高度准确的回顾跟踪,还能防止由确定性

英文摘要

Traditional Simultaneous Localization and Mapping (SLAM) algorithms rely heavily on the static environment assumption, which severely limits their applicability in real-world spaces populated by moving entities, such as pedestrians. In this work, we propose DynoSLAM, a tightly-coupled Dynamic GraphSLAM architecture that integrates socially-aware Graph Neural Networks (GNNs) directly into the factor graph optimization. Unlike conventional approaches that use rigid constant-velocity heuristics or deterministic single-agent neural priors, our framework formulates pedestrian motion forecasting as a stochastic World Model. By utilizing Monte Carlo rollouts from a trained GNN, we capture the multimodal epistemic uncertainty of human interactions and embed it into the SLAM graph via a dynamic Mahalanobis distance factor. We demonstrate through extensive simulated experiments that this stochastic formulation not only maintains highly accurate retrospective tracking but also prevents the optimization failures caused by the deterministic "argmax problem". Ultimately, extracting the empirical mean and covariance matrices of future pedestrian states provides a mathematically rigorous, probabilistic safety envelope for downstream local planners, enabling anticipatory and collision-free robot navigation in densely crowded environments.

2604.21363 2026-05-19 cs.RO 版本更新

A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration

可部署的具身视觉-语言导航系统:具有层次认知和上下文感知探索

Kuan Xu, Ruimeng Liu, Yizhuo Yang, Denan Liang, Tongxing Jin, Shenghai Yuan, Chen Wang, Lihua Xie

发表机构 * Center for Advanced Robotics Technology Innovation (CARTIN), School of Electrical and Electronic Engineering, Nanyang Technological University(先进机器人技术与创新中心(CARTIN)、电子工程学院、南洋理工大学) Spatial AI & Robotics Lab, Department of Computer Science and Engineering(空间人工智能与机器人实验室、计算机科学与工程系)

AI总结 本文提出一种可部署的具身视觉-语言导航系统,通过分层认知和上下文感知探索,在真实机器人上实现高效导航与强推理能力。

Comments 10 pages, 5 figures,

详情
AI中文摘要

在智能机器人系统中,弥合具身智能与嵌入式部署之间的差距仍然是关键挑战,其中感知、推理和规划必须在计算、内存、能量和实时执行的严格约束下运行。在视觉-语言导航(VLN)中,现有方法常面临推理能力与部署效率之间的权衡。本文提出一种可部署的具身VLN系统,在真实机器人上实现了高效率和强高层推理。该系统分解为一个快速感知-行动层和一个深度推理层,以不同时间尺度异步运行,共享内存层实现高效交互。为支持长视距推理,我们逐步构建一个紧凑的记忆图,并逐步将分解的子图馈入视觉-语言模型(VLM)。此外,我们通过联合考虑推理结果和候选区域的空间分布,将探索公式化为加权旅行推销员问题(WTRP)。在模拟和真实环境中的广泛实验表明,该方法在导航成功率和效率上优于现有VLN方法,同时在资源受限硬件上保持实时性能。代码和额外的真实环境实验可在https://github.com/xukuanHIT/HiCo-Nav获取。

英文摘要

Bridging the gap between embodied intelligence and embedded deployment remains a key challenge in intelligent robotic systems, where perception, reasoning, and planning must operate under strict constraints on computation, memory, energy, and real-time execution. In vision-and-language navigation (VLN), existing approaches often face a trade-off between reasoning capability and deployment efficiency on real-world platforms. In this paper, we present a deployable embodied VLN system that achieves both high efficiency and strong high-level reasoning on real-world robots. The system is decomposed into a fast perception-action layer and a deep reasoning layer running asynchronously at different time scales, with a shared memory layer enabling efficient interaction between them. To support long-horizon reasoning, we incrementally construct a compact memory graph and progressively feed decomposed subgraphs into a vision-language model (VLM). Furthermore, we formulate exploration as a Weighted Traveling Repairman Problem (WTRP) by jointly considering reasoning outcomes and the spatial distribution of candidate regions. Extensive experiments in simulation and real-world environments demonstrate improved navigation success and efficiency over existing VLN approaches while maintaining real-time performance on resource-constrained hardware. Code and additional real-world experiments are available at https://github.com/xukuanHIT/HiCo-Nav.

2603.19199 2026-05-19 cs.RO cs.CV 版本更新

FASTER: Rethinking Real-Time Flow VLAs

FASTER:重新思考实时流视频语言动作

Yuxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Yang, Jinghua Hou, Junyi Li, Kaixin Ding, Hengshuang Zhao

发表机构 * The University of Hong Kong(香港大学) ACE Robotics(ACE机器人)

AI总结 本文提出FASTER方法,通过引入时间感知调度策略,显著降低实时流视频语言动作系统的反应延迟,提升动态任务中的轨迹生成效率与质量。

Comments Project page: https://innovator-zero.github.io/FASTER

详情
AI中文摘要

实时执行对于在物理世界中部署视觉-语言-动作(VLA)模型至关重要。现有异步推理方法主要优化轨迹平滑度,但忽视了对环境变化的反应延迟。通过重新思考动作分块策略中的反应概念,本文系统分析了决定反应时间的因素。我们证明反应时间遵循由时间到第一个动作(TTFA)和执行时间跨度共同决定的均匀分布。此外,我们揭示了在流式VLA中应用恒定调度的标准做法可能效率低下,并迫使系统在任何移动开始前完成所有采样步骤,从而形成反应延迟的瓶颈。为了解决这一问题,我们提出快速动作采样以实现即时反应(FASTER)。通过引入时间感知调度,FASTER在流式采样过程中自适应优先处理近期动作,将去噪的即时反应压缩至十倍(例如在π_{0.5}和X-VLA中)为一步,同时保持长时间跨度轨迹的质量。结合流式客户端-服务器管道,FASTER显著降低了在真实机器人上的有效反应延迟,尤其是在部署在消费级GPU上时。实际实验,包括一个高度动态的乒乓球任务,证明FASTER显著提升了通用策略的实时响应能力,实现了快速生成准确且平滑的轨迹。

英文摘要

Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $π_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks substantially improved real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.

2603.12243 2026-05-19 cs.RO 版本更新

HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies

HandelBot:通过快速适应灵巧机器人策略实现真实世界钢琴演奏

Amber Xie, Haozhi Qi, Dorsa Sadigh

发表机构 * Stanford University(斯坦福大学) Amazon FAR (Frontier AI & Robotics)(亚马逊前沿人工智能与机器人实验室)

AI总结 本文提出HandelBot框架,结合仿真策略和快速适应,通过两阶段流程实现高精度双手钢琴演奏,实验表明其在真实环境中的表现优于直接仿真部署。

Comments Website: https://amberxie88.github.io/handelbot

详情
AI中文摘要

掌握多指手的灵巧操作一直是机器人领域几十年来的重大挑战。尽管具有潜力,但高质量数据的收集仍是高精度任务的主要瓶颈。虽然强化学习和仿真到现实的迁移提供了有前途的替代方案,但转移的策略在需要毫米级精度的任务中往往失效,如双臂钢琴演奏。在本文中,我们介绍了HandelBot,一种结合仿真策略和通过两阶段流程实现快速适应的框架。从仿真训练的策略开始,我们首先应用结构化细化阶段,通过调整侧向手指关节来纠正空间对齐,基于物理回放。接着,我们使用残差强化学习自动学习细粒度的纠正动作。通过在五首公认曲目上进行广泛的硬件实验,我们证明HandelBot能够成功进行高精度双手钢琴演奏。我们的系统在真实环境中表现优于直接仿真部署,性能提升达1.8倍,并且仅需30分钟的真实交互数据。

英文摘要

Mastering dexterous manipulation with multi-fingered hands has been a grand challenge in robotics for decades. Despite its potential, the difficulty of collecting high-quality data remains a primary bottleneck for high-precision tasks. While reinforcement learning and simulation-to-real-world transfer offer a promising alternative, the transferred policies often fail for tasks demanding millimeter-scale precision, such as bimanual piano playing. In this work, we introduce HandelBot, a framework that combines a simulation policy and rapid adaptation through a two-stage pipeline. Starting from a simulation-trained policy, we first apply a structured refinement stage to correct spatial alignments by adjusting lateral finger joints based on physical rollouts. Next, we use residual reinforcement learning to autonomously learn fine-grained corrective actions. Through extensive hardware experiments across five recognized songs, we demonstrate that HandelBot can successfully perform precise bimanual piano playing. Our system outperforms direct simulation deployment by a factor of 1.8x and requires only 30 minutes of physical interaction data.

2603.07126 2026-05-19 cs.RO 版本更新

Efficient Trajectory Optimization for Autonomous Racing via Formula-1 Data-Driven Initialization

通过F1数据驱动初始化实现自主赛车的高效轨迹优化

Samir Shehadeh, Lukas Kutsch, Nils Dengler, Sicong Pan, Maren Bennewitz

发表机构 * University of Bonn(波恩大学) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔机器学习与人工智能研究所) Center for Robotics(机器人中心) German Federal Ministry of Research, Technology and Space(德国联邦研究、科技与航天部)

AI总结 本文利用F1 telemetry数据构建多赛道轨迹集,提出基于学习的初始化策略,通过局部赛道几何预测专家级赛车线,加速最优控制求解器收敛并减少运行时间。

详情
AI中文摘要

轨迹优化是快速高效自主赛车的核心组成部分。然而,实际优化流程对初始化高度敏感,当使用启发式轨迹如中线或最小曲率路径初始化时,可能收敛缓慢或陷入次优局部解。为解决这一限制,我们利用专家驾驶行为作为初始化先验,提出基于真实世界F1 telemetry的机器学习驱动初始化策略。为此,我们首先通过重建和对齐嘈杂的GPS telemetry到标准化参考线表示,构建包含17条赛道的多赛道F1轨迹数据集。在此基础上,我们提出一个神经网络,可直接从局部赛道几何预测专家级赛车线,而无需显式建模车辆动力学或力。预测的赛车线随后作为有指导的种子用于最小时间最优控制求解器。在所有17条赛道上的实验表明,学习到的初始化加速了求解器收敛并显著减少了运行时间,同时保持最终优化圈速。

英文摘要

Trajectory optimization is a central component of fast and efficient autonomous racing. However practical optimization pipelines remain highly sensitive to initialization and may converge slowly or to suboptimal local solutions when seeded with heuristic trajectories such as the centerline or minimum-curvature paths. To address this limitation, we leverage expert driving behavior as a initialization prior and propose a learning-informed initialization strategy based on real-world Formula~1 telemetry. To this end, we first construct a multi-track Formula~1 trajectory dataset by reconstructing and aligning noisy GPS telemetry to a standardized reference-line representation across 17 tracks. Building on this, we present a neural network that predicts an expert-like raceline offset directly from local track geometry, without explicitly modeling vehicle dynamics or forces. The predicted raceline is then used as an informed seed for a minimum-time optimal control solver. Experiments on all 17 tracks demonstrate that the learned initialization accelerates solver convergence and significantly reduces runtime compared to traditional geometric baselines, while preserving the final optimized lap time.

2603.02642 2026-05-19 cs.RO cs.DC cs.SY eess.SY 版本更新

cuNRTO: GPU-Accelerated Nonlinear Robust Trajectory Optimization

cuNRTO:GPU加速的非线性鲁棒轨迹优化

Jiawei Wang, Arshiya Taj Abdul, Evangelos A. Theodorou

发表机构 * Georgia Institute of Technology, Atlanta(佐治亚理工学院, 奥斯汀) University of California, San Diego(加州大学圣地亚哥分校) Deemos Corporation(德摩斯公司)

AI总结 本文提出cuNRTO框架,通过CUDA实现非线性鲁棒轨迹优化,利用DR方法和ADMM算法解决SOCP约束问题,提升计算效率,实验证明在不同机器人模型上实现高达139.6倍的加速。

详情
AI中文摘要

鲁棒轨迹优化通过计算满足所有有界扰动约束的控制策略,使自主系统在不确定性下安全运行。然而,这些问题通常导致计算成本高的二次锥编程(SOCP)约束。本文提出CUDA非线性鲁棒轨迹优化(cuNRTO)框架,引入两种动态优化架构,直接应用于鲁棒决策,并在CUDA上实现。第一种架构NRTO-DR利用Douglas-Rachford(DR)分裂法解决SOCP子问题,通过并行SOCP投影和稀疏直接求解显著减少计算负担。第二种架构NRTO-FullADMM是新型变体,利用问题结构提升可扩展性,使用交替方向乘子法(ADMM)。最后,我们通过自定义CUDA内核和cuBLAS GEMM链实现所提出的方法,通过模拟实验验证cuNRTO的性能,在轮式机器人、四旋翼和Franka机械臂模型上实现高达139.6倍的加速。更多细节请访问https://cunrto.github.io。

英文摘要

Robust trajectory optimization enables autonomous systems to operate safely under uncertainty by computing control policies that satisfy the constraints for all bounded disturbances. However, these problems often lead to large Second Order Conic Programming (SOCP) constraints, which are computationally expensive. In this work, we propose the CUDA Nonlinear Robust Trajectory Optimization (cuNRTO) framework by introducing two dynamic optimization architectures that have direct application to robust decision-making and are implemented on CUDA. The first architecture, NRTO-DR, leverages the Douglas-Rachford (DR) splitting method to solve the SOCP inner subproblems of NRTO, thereby significantly reducing the computational burden through parallel SOCP projections and sparse direct solves. The second architecture, NRTO-FullADMM, is a novel variant that further exploits the problem structure to improve scalability using the Alternating Direction Method of Multipliers (ADMM). Finally, we provide GPU implementations of the proposed methodologies using custom CUDA kernels for SOC projection steps and cuBLAS GEMM chains for feedback gain updates. We validate the performance of cuNRTO through simulated experiments on unicycle, quadcopter, and Franka manipulator models, demonstrating speedups of up to 139.6$\times$. More details are available at https://cunrto.github.io.

2602.23058 2026-05-19 cs.CV cs.RO 版本更新

GeoWorld: Geometric World Models

GeoWorld:几何世界模型

Zeyu Zhang, Danning Li, Ian Reid, Richard Hartley

发表机构 * ANU(澳大利亚国立大学) MBZUAI(穆斯林人工智能研究所)

AI总结 GeoWorld通过超几何JEPA和几何强化学习解决传统能量预测模型在几何结构和长周期预测中的不足,实验显示在3-4步规划中性能提升3%-2%。

Comments Accepted to CVPR 2026

详情
AI中文摘要

基于能量的预测世界模型通过推理潜在能量景观进行多步视觉规划,但现有方法面临两个挑战:(i)其潜在表示通常在欧几里得空间中学习,忽略了状态间的几何和层次结构;(ii)难以进行长周期预测,导致扩展 rollout 中快速退化。为了解决这些挑战,我们引入GeoWorld,通过超几何JEPA将潜在表示从欧几里得空间映射到双曲流形,以保留几何结构和层次关系。我们进一步引入几何强化学习进行能量优化,实现双曲潜在空间中的稳定多步规划。在CrossTask和COIN上的广泛实验显示,与最先进的V-JEPA 2相比,在3步规划中性能提升约3%,在4步规划中提升约2%。项目网站:https://steve-zeyu-zhang.github.io/GeoWorld。

英文摘要

Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: https://steve-zeyu-zhang.github.io/GeoWorld.

2602.22801 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

释放扩散模型在端到端自动驾驶中的潜力

Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang, Jianlin Zhang, Jianwei Cui, Guang Chen, Kun Ma, Hangjun Ye, Long Chen, Ya-Qin Zhang, Xianyuan Zhan, Jingjing Liu

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(人工智能产业研究院(AIR),清华大学)

AI总结 本文通过大规模实车数据和道路测试,系统研究了扩散模型在端到端自动驾驶中的规划能力,提出Hyper Diffusion Planner框架,实现10倍性能提升。

详情
AI中文摘要

扩散模型已成为机器人决策任务中的流行选择,近年来也开始被考虑用于解决自动驾驶任务。然而,其在自动驾驶中的应用和评估仍局限于模拟或实验室环境。本研究通过大规模实车数据和道路测试,系统研究了扩散模型作为端到端自动驾驶规划器的潜力。通过全面而受控的研究,我们识别了扩散损失空间、轨迹表示和数据缩放等关键洞察,显著影响端到端规划性能。此外,我们还提供了一种有效的强化学习后训练策略,进一步提升学习规划器的安全性和鲁棒性。所提出的扩散学习框架Hyper Diffusion Planner (HDP)在真实车辆平台上部署,并在6个城市驾驶场景和200公里的真实世界测试中,实现了相对于基模型的10倍性能提升。本文证明了当正确设计和训练时,扩散模型可以作为有效且可扩展的端到端自动驾驶规划器,用于复杂的真实世界自动驾驶任务。

英文摘要

Diffusion models have become a popular choice for decision-making tasks in robotics, and more recently, are also being considered for solving autonomous driving tasks. However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings. The full strength of diffusion models for large-scale, complex real-world settings, such as End-to-End Autonomous Driving (E2E AD), remains underexplored. In this study, we conducted a systematic and large-scale investigation to unleash the potential of the diffusion models as planners for E2E AD, based on a tremendous amount of real-vehicle data and road testing. Through comprehensive and carefully controlled studies, we identify key insights into the diffusion loss space, trajectory representation, and data scaling that significantly impact E2E planning performance. Moreover, we also provide an effective reinforcement learning post-training strategy to further enhance the safety and robustness of the learned planner. The resulting diffusion-based learning framework, Hyper Diffusion Planner (HDP), is deployed on a real-vehicle platform and evaluated across 6 urban driving scenarios and 200 km of real-world testing, achieving a notable 10x performance improvement over the base model. Our work demonstrates that diffusion models, when properly designed and trained, can serve as effective and scalable E2E AD planners for complex, real-world autonomous driving tasks.

2602.19710 2026-05-19 cs.CV cs.LG cs.RO 版本更新

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

面向通用视觉-语言-动作策略的通用姿态预训练

Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, Yanwei Fu

发表机构 * Tencent Robotics X(腾讯机器人X) Futian Laboratory(福田实验室) The Hong Kong University of Science and Technology(香港科学与技术大学) Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出Pose-VLA,通过分离预训练和后训练阶段,解决视觉-语言-动作模型中的特征坍塌和训练效率问题,实现通用3D空间先验提取与机器人特定动作空间的高效对齐。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. Project website: https://hetolin.github.io/PoseVLA

详情
Journal ref
Robotics: Science and Systems, 2026
AI中文摘要

现有视觉-语言-动作(VLA)模型常因将高层感知与稀疏的、特定身体动作监督结合而出现特征坍塌和低训练效率。由于这些模型通常依赖优化用于视觉问答(VQA)的VLM主干,它们擅长语义识别但常忽视细微的3D状态变化,这些变化决定了不同的动作模式。为解决这些不一致,我们提出了Pose-VLA,一种解耦范式,将VLA训练分为预训练阶段以提取统一摄像机空间中的通用3D空间先验,以及后训练阶段以在机器人特定的动作空间中高效对齐。通过引入离散姿态标记作为通用表示,Pose-VLA无缝整合了来自不同3D数据集的空间接地与机器人演示中的几何级轨迹。我们的框架遵循一个两阶段预训练流程,通过姿态建立基本空间接地,然后通过轨迹监督实现运动对齐。广泛的评估显示,Pose-VLA在RoboTwin 2.0上实现了79.5%的平均成功率,并在LIBERO上表现出竞争力。现实世界实验进一步展示了在使用仅100个演示每任务的情况下,对多样化物体的鲁棒泛化能力,验证了我们预训练范式的效率。

英文摘要

Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.

2602.12633 2026-05-19 cs.RO 版本更新

Real-to-Sim for Highly Cluttered Environments via Physics-Consistent Inter-Object Reasoning

通过物理一致的物体间推理实现高密度环境下的现实到仿真

Tianyi Xiang, Jiahang Cao, Sikai Guo, Guoyang Zhao, Andrew F. Luo, Jun Ma

发表机构 * Robotics and Autonomous Systems Thrust, The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)机器人与自主系统方向) Institute of Data Science, The University of Hong Kong(香港大学数据科学学院)

AI总结 本文提出一种物理约束的现实到仿真管道,通过接触图建模空间依赖性,提升高密度环境中的物体姿态和物理属性的精确性,实现高物理保真度的仿真场景。

Comments Project page: https://physics-constrained-real2sim.github.io

详情
AI中文摘要

从单视角观测重建物理有效的3D场景是连接视觉感知与机器人控制之间的必要前提。然而,在需要精确接触推理的场景中,例如在高度杂乱的环境中进行机器人操作时,仅依靠几何保真度是不够的。标准感知流程往往忽视物理约束,导致无效状态,例如漂浮物体或严重的相互穿透,使下游仿真不可靠。为了解决这些限制,我们提出了一种新的物理约束的现实到仿真管道,该管道从单视角RGB-D数据中重建物理一致的3D场景。我们方法的核心是一个可微优化管道,通过接触图显式建模空间依赖性,通过可微刚体仿真联合优化物体姿态和物理属性。在模拟和现实设置中的广泛评估表明,我们重建的场景实现了高物理保真度,并忠实复制了现实中的接触动态,使稳定可靠的接触丰富操作成为可能。

英文摘要

Reconstructing physically valid 3D scenes from single-view observations is a prerequisite for bridging the gap between visual perception and robotic control. However, in scenarios requiring precise contact reasoning, such as robotic manipulation in highly cluttered environments, geometric fidelity alone is insufficient. Standard perception pipelines often neglect physical constraints, resulting in invalid states, e.g., floating objects or severe inter-penetration, rendering downstream simulation unreliable. To address these limitations, we propose a novel physics-constrained Real-to-Sim pipeline that reconstructs physically consistent 3D scenes from single-view RGB-D data. Central to our approach is a differentiable optimization pipeline that explicitly models spatial dependencies via a contact graph, jointly refining object poses and physical properties through differentiable rigid-body simulation. Extensive evaluations in both simulation and real-world settings demonstrate that our reconstructed scenes achieve high physical fidelity and faithfully replicate real-world contact dynamics, enabling stable and reliable contact-rich manipulation.

2602.10503 2026-05-19 cs.RO 版本更新

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

迈向长寿命机器人:通过强化微调实现持续学习的VLA模型

Yuan Liu, Haoran Li, Shuai Tian, Yuxing Qin, Yuhui Chen, Yupeng Zheng, Yongzhen Huang, Dongbin Zhao

发表机构 * CASIA(中国科学院自动化研究所)

AI总结 本文提出LifeLong-RFT方法,通过整合分块级在线强化学习与多维过程奖励机制,提升VLA模型在多任务学习中的性能,实现持续学习。

详情
AI中文摘要

在大规模和多样化数据集上预训练的VLA模型表现出强大的泛化和适应能力作为通用机器人策略。然而,监督微调(SFT)作为适应VLA到下游领域的主要机制,需要大量任务特定数据且易发生灾难性遗忘。为解决这些限制,我们提出LifeLong-RFT,一种简单有效的强化微调(RFT)策略,独立于在线环境反馈和预训练奖励模型。通过整合分块级在线强化学习与所提出的多维过程奖励机制,LifeLong-RFT量化了中间动作分块在三个维度上的异质贡献,以促进策略优化。具体而言,(1)量化动作一致性奖励(QACR)确保在离散动作空间内的准确动作预测;(2)连续轨迹对齐奖励(CTAR)将解码的连续动作分块与参考轨迹对齐,以确保精确控制;(3)格式合规性奖励(FCR)保证输出的结构有效性。在SimplerEnv、LIBERO和现实任务中的全面实验表明,LifeLong-RFT在多任务学习中表现出色。此外,在LIBERO基准上的持续学习中,我们的方法在SFT基础上实现了22%的平均成功率提升,同时仅使用20%的训练数据即可有效适应新任务。整体而言,我们的方法提供了一种有前景的VLA后训练范式。项目页面可在<https://yuan-liu-lifelong-rft.github.io>获取。

英文摘要

Pretrained on large-scale and diverse datasets, VLA models demonstrate strong generalization and adaptability as general-purpose robotic policies. However, Supervised Fine-Tuning (SFT), which serves as the primary mechanism for adapting VLAs to downstream domains, requires substantial amounts of task-specific data and is prone to catastrophic forgetting. To address these limitations, we propose LifeLong-RFT, a simple yet effective Reinforcement Fine-Tuning (RFT) strategy for VLA models independent of online environmental feedback and pre-trained reward models. By integrating chunking-level on-policy reinforcement learning with the proposed multi-dimensional process reward mechanism, LifeLong-RFT quantifies the heterogeneous contributions of intermediate action chunks across three dimensions to facilitate policy optimization. Specifically, (1) the Quantized Action Consistency Reward (QACR) ensures accurate action prediction within the discrete action space; (2) the Continuous Trajectory Alignment Reward (CTAR) aligns decoded continuous action chunks with reference trajectories to ensure precise control; (3) the Format Compliance Reward (FCR) guarantees the structural validity of outputs. Comprehensive experiments across SimplerEnv, LIBERO, and real-world tasks demonstrate that LifeLong-RFT exhibits strong performance in multi-task learning. Furthermore, for continual learning on the LIBERO benchmark, our method achieves a 22% gain in average success rate over SFT, while effectively adapting to new tasks using only 20% of the training data. Overall, our method provides a promising post-training paradigm for VLAs. The project page is available at <https://yuan-liu-lifelong-rft.github.io>.

2602.08167 2026-05-19 cs.RO cs.AI cs.CV cs.LG 版本更新

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

基于互联网规模知识的自监督行动预测具身推理

Milan Ganai, Katie Luo, Jonas Frey, Clark Barrett, Marco Pavone

发表机构 * Stanford(斯坦福大学) UC Berkeley(加州大学伯克利分校) NVIDIA(英伟达)

AI总结 本文提出R&B-EnCoRe方法,通过自监督细化使模型从互联网知识中自推导具身推理策略,提升动作执行和导航性能,减少碰撞率。

Comments Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

具身链式思维(CoT)推理显著提升了视觉-语言-动作(VLA)模型,但当前方法依赖刚性模板指定推理原语(如场景中的物体、高层计划、结构 affordances)。这些模板可能迫使策略处理无关信息,干扰关键动作预测信号。我们引入R&B-EnCoRe,使模型通过自监督细化从互联网规模知识中自推导具身推理。通过将推理视为重要加权变分推断中的潜在变量,模型可生成并提炼无外部奖励、验证者或人工标注的具身特定策略训练数据集。我们在各种VLA架构中验证R&B-EnCoRe,应用于 manipulation(Franka Panda在仿真中,WidowX在硬件中)、legged导航(双足、轮式、自行车、四足)和自动驾驶具身,参数规模为1B、4B、7B和30B。我们的方法在 manipulation 成功率提升28%,导航评分提高101%,碰撞率减少21%。R&B-EnCoRe使模型提炼出预测成功控制的推理,避免手动标注工程,同时将互联网规模知识接地于物理执行。

英文摘要

Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.

2602.06807 2026-05-19 cs.RO cs.AI cs.LG 版本更新

SuReNav: Superpixel Graph-based Constraint Relaxation for Navigation in Over-constrained Environments

SuReNav:基于超像素图的约束放松用于过约束环境中的导航

Keonyoung Koh, Moonkyeong Jung, Samuel Seungsup Lee, Daehyung Park

发表机构 * School of Computing, Korea Advanced Institute of Science and Technology, Korea(韩国科学技术院计算机学院)

AI总结 本文提出SuReNav方法,通过超像素图构建区域约束,利用图神经网络实现安全高效导航,适用于半静态环境中过约束规划问题,提升导航的人类类比性能。

Comments Accepted by ICRA 2026. Code and videos are available at https://sure-nav.github.io/

详情
AI中文摘要

我们针对半静态环境中过约束规划问题,提出SuReNav方法,通过超像素图构建区域约束,利用图神经网络训练于人类示范数据,实现安全高效的导航。框架包含三个组件:1)带有区域约束的超像素图地图生成,2)利用图神经网络进行区域约束放松,3)放松、规划和执行的交织过程。在2D语义地图和3D OpenStreetMap地图上评估,实现最高的人类类比得分,同时保持效率与安全的平衡。最后在现实城市导航中展示其可扩展性和泛化能力。代码和视频可在https://sure-nav.github.io/获取。

英文摘要

We address the over-constrained planning problem in semi-static environments. The planning objective is to find a best-effort solution that avoids all hard constraint regions while minimally traversing the least risky areas. Conventional methods often rely on pre-defined area costs, limiting generalizations. Further, the spatial continuity of navigation spaces makes it difficult to identify regions that are passable without overestimation. To overcome these challenges, we propose SuReNav, a superpixel graph-based constraint relaxation and navigation method that imitates human-like safe and efficient navigation. Our framework consists of three components: 1) superpixel graph map generation with regional constraints, 2) regional-constraint relaxation using graph neural network trained on human demonstrations for safe and efficient navigation, and 3) interleaving relaxation, planning, and execution for complete navigation. We evaluate our method against state-of-the-art baselines on 2D semantic maps and 3D maps from OpenStreetMap, achieving the highest human-likeness score of complete navigation while maintaining a balanced trade-off between efficiency and safety. We finally demonstrate its scalability and generalization performance in real-world urban navigation with a quadruped robot, Spot. Code and Videos are available at https://sure-nav.github.io/.

2602.02236 2026-05-19 cs.RO cs.LG cs.NE cs.SY eess.SY 版本更新

Adaptive Control in Autonomous Driving via Real-Time Recurrent RL

通过实时递归强化学习实现自动驾驶中的自适应控制

Julian Lemmel, Felix Resch, Mónika Farsang, Ramin Hasani, Daniela Rus, Radu Grosu

发表机构 * TU Wien(维也纳技术大学) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) Liquid AI

AI总结 本文研究了通过实时递归强化学习(RTRRL)对自动驾驶预训练控制策略进行在线微调,结合离线行为克隆与在线RTRRL微调,以适应部署时的分布偏移。在CarRacing模拟和1:10比例的RoboRacer平台上的实验验证了该方法的有效性。

详情
AI中文摘要

我们研究了使用实时递归强化学习(RTRRL)对自动驾驶预训练控制策略进行在线微调,RTRRL是一种内存高效的算法,能够在每个时间步更新策略参数而无需反向传播时间。我们扩展RTRRL以支持最近提出的非线性对角状态空间模型(LrcSSM),并将离线行为克隆与在线RTRRL微调结合,以适应部署时的分布偏移。我们在CarRacing模拟和配备事件相机的1:10比例RoboRacer平台上验证了该方法,其中预训练策略在现实世界直线跟踪中进行在线微调。到目前为止,这是首次在标准(非脉冲)硬件上实现闭环控制中的在线强化学习微调,使用事件相机观测。基于LrcSSM的策略在两种设置中均表现出最佳且最一致的性能。

英文摘要

We study online fine-tuning of pretrained control policies for autonomous driving using Real-Time Recurrent Reinforcement Learning (RTRRL), a memory-efficient algorithm that updates policy parameters at every time step without backpropagation through time. We extend RTRRL to support LrcSSM, a recently proposed nonlinear diagonal state-space model, and combine offline behavioral cloning with online RTRRL fine-tuning to adapt policies to distribution shifts at deployment. We validate the approach in the CarRacing simulation and on a 1:10-scale RoboRacer platform equipped with an event camera, where a pretrained policy is fine-tuned online during real-world line-following. To our knowledge, this is the first demonstration of online RL fine-tuning with event-camera observations on standard (non-spiking) hardware in closed-loop control. LrcSSM-based policies improve fastest and most consistently across both settings.

2601.08454 2026-05-19 cs.RO 版本更新

Real2Sim via Active Perception with Behavior Trees Automatically Generated by VLMs

通过主动感知与行为树自动生成功能的Real2Sim

Alessandro Adami, Sebastian Zudaire, Ruggero Carli, Pietro Falco

发表机构 * University of Padova, Dept. of Information Engineering(帕多瓦大学信息工程系) ABB Robotics(ABB机器人)

AI总结 本文提出一种基于视觉语言模型的自主Real2Sim框架,通过分解语义任务,自动生成行为树以高效获取物理参数,实验证明其在效率和安全性上的优势。

详情
AI中文摘要

构建物理准确的仿真环境(Real2Sim)传统上依赖于手动系统识别或刚性、详尽的探索流程。这些任务无关的流程往往无法利用语义场景上下文,导致冗余的物理交互和低效的数据采集。本文提出了一种自主的、意图驱动的Real2Sim框架,利用视觉语言模型(VLMs)进行语义任务分解。给定一个高层的自然语言请求、不完整的仿真描述和视觉观察,该框架会自动识别出模拟任务所需的最小子集缺失的物理参数。然后生成一个由原子运动和感知原语组成的行为树(BT),以选择性地通过接触丰富的机器人交互来获取这些参数。在扭矩控制的Franka Emika Panda上的大量现实世界实验表明,我们的方法能够准确估计物体质量、表面几何和导出参数如摩擦。定量评估显示,与详尽基线方法相比,我们的方法在操作效率上有显著提升,而消融研究证实了提示架构在不同最先进的VLMs上的鲁棒性。此外,行为树的反应层级充当一个确定性的安全过滤器,成功缓解了生成VLM的幻觉并防止了不安全的物理异常。最终,这项工作提供了一种可扩展、高效且可解释的管道,用于从无结构的人类意图直接构建物理感知的数字双胞胎。

英文摘要

Constructing physically accurate simulation environments (Real2Sim) traditionally relies on manual system identification or rigid, exhaustive exploration routines. These task-agnostic pipelines often fail to leverage semantic scene context, leading to redundant physical interactions and inefficient data acquisition. In this paper, we present an autonomous, intent-driven Real2Sim framework that leverages Vision-Language Models (VLMs) for Semantic Task Decomposition. Given a high-level natural language request, an incomplete simulation description, and a visual observation, the framework autonomously identifies the minimal subset of missing physical parameters required for the simulation task. It then generates a reactive Behavior Tree (BT) composed of atomic motion and sensing primitives to selectively acquire these parameters through contact-rich robotic interaction. Extensive real-world experiments on a torque-controlled Franka Emika Panda demonstrate that our approach accurately estimates object mass, surface geometry, and derived parameters such as friction. Quantitative evaluations reveal significant operational efficiency gains compared to exhaustive baseline methods, while ablation studies confirm the robustness of the prompt architecture across different state-of-the-art VLMs. Furthermore, the reactive hierarchy of the BT acts as a deterministic safety filter, successfully mitigating generative VLM hallucinations and preventing unsafe physical anomalies. Ultimately, this work provides a scalable, efficient, and interpretable pipeline for building physics-aware digital twins directly from unstructured human intent.

2509.17666 2026-05-19 cs.RO 版本更新

Robust and Resilient Soft Robotic Object Insertion with Compliance-Enabled Contact Formation and Failure Recovery

具有合规性接触形成和故障恢复的鲁棒且具有弹性的软机器人物体插入

Mimo Shirasaka, Cristian C. Beltran-Hernandez, Masashi Hamaya, Yoshitaka Ushiku

发表机构 * OMRON SINIC X Corporation(OMRON SINIC X公司)

AI总结 本文提出了一种基于合规性腕部的物体插入方法,通过大变形吸收接触,实现安全接触形成和故障恢复,实验显示在随机条件下成功率高达83%。

详情
AI中文摘要

物体插入任务在姿态不确定性和环境变化下容易失败,通常需要手动微调或控制器再训练。本文提出了一种新颖的方法,利用被动合规的软腕实现安全接触吸收,无需高频控制或力传感。方法将插入过程结构化为合规性接触形成,通过逐步约束自由度的接触状态,并整合自动故障恢复策略。关键发现是腕部合规性允许安全重复的恢复尝试,因此称为合规性故障恢复。我们采用预训练的视觉-语言模型(VLM),从终端姿态和图像评估每个技能执行,识别故障模式,并通过选择技能和更新目标提出恢复动作。在模拟中,我们的方法在随机条件下实现了83%的成功率,包括抓取对齐误差达5度、孔位误差达20毫米、摩擦力增加五倍和未见过的方形/矩形活塞等故障,并在真实机器人上进一步验证了该方法。项目页面可在https://omron-sinicx.github.io/compliance-enabled-failure-recovery/上访问。

英文摘要

Object insertion tasks are prone to failure under pose uncertainty and environmental variation, often requiring manual fine-tuning or controller retraining. We present a novel approach for robust and resilient object insertion using a passively compliant soft wrist that enables safe contact absorption through large deformations, without high-frequency control or force sensing. Our method structures insertion as compliance-enabled contact formations, sequential contact states that progressively constrain degrees of freedom, and integrates automated failure recovery strategies. Our key insight is that wrist compliance permits safe, repeated recovery attempts; hence, we refer to it as compliance-enabled failure recovery. We employ a pre-trained vision-language model (VLM) that assesses each skill execution from terminal poses and images, identifies failure modes, and proposes recovery actions by selecting skills and updating goals. In simulation, our method achieved an 83% success rate, recovering from failures induced by randomized conditions, including grasp misalignments up to 5 degrees, hole-pose errors up to 20 mm, fivefold increases in friction, and unseen square/rectangular pegs, and we further validated the approach on a real robot. Project page is available at https://omron-sinicx.github.io/compliance-enabled-failure-recovery/.

2508.03018 2026-05-19 cs.AI cs.RO 版本更新

Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

超越策略优化:一种数据整理飞轮用于稀疏奖励长周期规划

Yutong Wang, Pengliang Ji, Kaixin Li, Baolong Bi, Tao Feng, Guillaume Sartoretti

发表机构 * Department of Mechanical Engineering, National University of Singapore(新加坡国立大学机械工程系) Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) School of Computing, National University of Singapore(新加坡国立大学计算机科学学院) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 本文提出BPO框架,通过自改进的数据飞轮开发鲁棒推理模型,解决多轮代理规划中稀疏奖励长周期问题,实现高效推理和显著的token效率。

详情
AI中文摘要

大型语言推理模型在静态任务中表现出色,但在交互环境中多轮代理规划面临两大挑战:信用分配问题使传统强化学习在稀疏奖励设置中无效,以及详尽的逐步推理历史计算开销过大。为此,我们提出BPO框架,包含三个阶段(自举、外推和精炼),通过自改进的数据飞轮开发稳健的推理模型,以应对长周期稀疏奖励环境。框架首先利用规划四元组和长短期链式思考融合高效推理,然后通过复杂度分层课程学习扩展到分布外任务,最后通过奖励门控拒绝采样学习经历进行迭代精炼。在ALFWorld、ScienceWorld和WebShop上的实验表明,本方法在状态-of-the-art中实现了显著的token效率,为代理规划中的推理模型提供了新的配方。

英文摘要

Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum learning. Finally, the model iteratively refines itself by learning exclusively on experiences selected via reward-gated rejection sampling. Experiments on ALFWorld, ScienceWorld, and WebShop demonstrate that our approach achieves state-of-the-art with significant token efficiency, providing a new recipe for reasoning models in agentic planning.

2507.09344 2026-05-19 cs.RO cs.SY eess.SY 版本更新

C-ZUPT: Stationarity-Aided Aerial Hovering

C-ZUPT:基于站定性的空中悬停

Daniel Engelsman, Itzik Klein

发表机构 * Hatter Department of Marine Technologies, Charney School of Marine Sciences, University of Haifa(哈特尔海洋技术系,查内海洋科学学院,海法大学)

AI总结 本文提出C-ZUPT方法,通过定义不确定性阈值识别准静态平衡状态,为估计滤波器提供精确速度更新,从而减少惯性漂移和控制努力,提升导航稳定性与悬停能效。

Comments 14 Pages, 16 Figures, 9 Tables

详情
Journal ref
IEEE Transactions on Aerospace and Electronic Systems, volume 62, pages 4063-4077, 2026
AI中文摘要

跨领域自主系统强调了对漂移鲁棒的状态估计需求。尽管卫星定位和摄像头广泛使用,但它们在许多环境中存在可用性限制。因此,定位必须仅依赖惯性传感器,导致随时间推移精度迅速下降,由于传感器偏差和噪声。为对抗这一问题,替代更新源——称为信息辅助——作为确定性的锚点。其中,零速度更新(ZUPT)在静止期间提供准确的修正,但受限于地面平台。本工作引入了一种受控的ZUPT(C-ZUPT)方法用于空中导航与控制,不依赖地面接触。通过定义不确定性阈值,C-ZUPT识别准静态平衡状态,为估计滤波器提供精确的速度更新。大量验证确认这些机会性、高质量的更新显著减少惯性漂移和控制努力。因此,C-ZUPT缓解了滤波器发散并提升导航稳定性,使更节能的悬停成为可能,并大幅延长持续飞行时间——这对资源受限的空中系统具有关键优势。

英文摘要

Autonomous systems across diverse domains have underscored the need for drift-resilient state estimation. Although satellite-based positioning and cameras are widely used, they often suffer from limited availability in many environments. As a result, positioning must rely solely on inertial sensors, leading to rapid accuracy degradation over time due to sensor biases and noise. To counteract this, alternative update sources-referred to as information aiding-serve as anchors of certainty. Among these, the zero-velocity update (ZUPT) is particularly effective in providing accurate corrections during stationary intervals, though it is restricted to surface-bound platforms. This work introduces a controlled ZUPT (C-ZUPT) approach for aerial navigation and control, independent of surface contact. By defining an uncertainty threshold, C-ZUPT identifies quasi-static equilibria to deliver precise velocity updates to the estimation filter. Extensive validation confirms that these opportunistic, high-quality updates significantly reduce inertial drift and control effort. As a result, C-ZUPT mitigates filter divergence and enhances navigation stability, enabling more energy-efficient hovering and substantially extending sustained flight-key advantages for resource-constrained aerial systems.

2506.13189 2026-05-19 cs.HC cs.RO 版本更新

Gesture First, LLM-Assisted Voice Complement: Exploring Multimodal Robot 'Puppeteer' Teleoperation Via Virtual Counterpart in Augmented Reality

先手势,LLM辅助语音补充:通过增强现实中的虚拟对应物探索多模态机器人'提线人'遥控

Yuchong Zhang, Bastian Orthmann, Shichen Ji, Michael Welle, Jonne Van Haastregt, Danica Kragic

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 本文探讨了通过增强现实中的虚拟对应物实现多模态机器人遥控的方法,比较了仅手势和结合语音与手势的交互方式在性能和用户体验上的差异,提出设计指南以平衡效率、鲁棒性和用户专业性。

Comments This work is under peer review

详情
AI中文摘要

通过增强现实(AR)实现的机器人遥控提供了一条通向更直观人机交互(HRI)的有希望路径。我们提出了一种头戴式AR'提线人'系统,用户通过与机器人虚拟对应物的交互来控制物理机器人,使用大语言模型(LLM)辅助的语音命令和手部手势交互在Meta Quest 3上。在42名参与者进行的AR基于机器人抓取和模式匹配任务的内部分组用户研究中,我们经验性地比较了两种交互条件:仅手势(GO)和结合语音+手势(VG)在性能和用户体验(UX)上的差异。在VG中,语音和手势以顺序角色分配的方式操作,语音负责高层导航,手势负责精细操作。我们的结果表明,GO目前为这种时间敏感的任务提供了更可靠和高效的控制,而VG引入了额外的灵活性,但也带来了延迟和识别问题,可能增加工作负荷。我们还分析了先前机器人专业知识如何在不同条件下区分性能和用户体验。基于这些发现,我们总结了一套AR'提线人'隐喻机器人遥控的设计指南,将多模态性作为适应性策略,必须在效率、鲁棒性和用户专业知识之间取得平衡,而不是假设额外模态对所有人都有益。

英文摘要

Robot teleoperation via augmented reality (AR) offers a promising path toward more intuitive human-robot interaction (HRI). We present a head-mounted AR 'puppeteer' system in which users control a physical robot by interacting with its virtual counterpart robot using large language model (LLM)-assisted voice commands and hand-gesture interaction on the Meta Quest 3. In a within-subject user study with 42 participants performing an AR-based robotic pick-and-place pattern-matching task, we empirically compare two interaction conditions: gesture-only (GO) and combined voice+gesture (VG) on performance and user experience (UX). In VG, voice and gesture operate in a sequential role-allocated manner, with voice handling high-level navigation and gesture handling fine manipulation. Our results show that GO currently provides more reliable and efficient control for this time-critical task, while VG introduces additional flexibility but also latency and recognition issues that can increase workload. We additionally analyze how prior robotics expertise differentiates performance and UX across conditions. Based on these findings, we distill a set of design guidelines for AR 'puppeteer' metaphoric robot teleoperation, framing multimodality as an adaptive strategy that must balance efficiency, robustness, and user expertise rather than assuming that additional modalities are universally beneficial.

2504.14820 2026-05-19 cs.RO 版本更新

A Visual Reinforcement Learning-Based Separate Primitive Policy for Peg-in-Hole Tasks

基于视觉强化学习的分步策略:用于铆钉入孔任务

Zichun Xu, Zhaomin Wang, Yuntao Li, Lei Zhuang, Zhiyuan Zhao, Guocai Yang, Jingdong Zhao

发表机构 * State Key Laboratory of Robotics and Systems, Harbin Institute of Technology(机器人系统国家重点实验室,哈尔滨工业大学) Ubtech Robotics(优必选科技) Meituan Academy of Robotics(美团机器人研究院) School of Mechanical Engineering, Shandong University(山东大学机械工程学院)

AI总结 本文提出S2P策略,通过视觉强化学习实现铆钉入孔任务中位置和插入动作的同步学习,提升了样本效率和成功率。

Comments Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

对于铆钉入孔任务,人类依赖双目视觉感知来定位铆钉于孔表面之上,然后进行插入。本文借鉴这种行为,使智能体通过视觉强化学习学习高效的装配策略。因此,我们提出分离基本策略(S2P),以学习如何同时推导位置和插入动作。S2P与无模型强化学习算法兼容。开发了十个具有不同多边形的插入任务作为评估基准。模拟实验表明,即使有力约束,S2P也能提高样本效率和成功率。还进行了实际实验以验证S2P的可行性。最后给出了消融实验,讨论S2P的通用性和影响其性能的一些因素。

英文摘要

For peg-in-hole tasks, humans rely on binocular visual perception to locate the peg above the hole surface and then proceed with insertion. This paper draws insights from this behavior to enable agents to learn efficient assembly strategies through visual reinforcement learning. Hence, we propose a Separate Primitive Policy (S2P) to learn how to derive location and insertion actions simultaneously. S2P is compatible with model-free reinforcement learning algorithms. Ten insertion tasks featuring different polygons are developed as benchmarks for evaluations. Simulation experiments show that S2P can boost the sample efficiency and success rate even with force constraints. Real-world experiments are also performed to verify the feasibility of S2P. Ablations are finally given to discuss the generalizability of S2P and some factors that affect its performance.

2503.12181 2026-05-19 cs.AI cs.RO 版本更新

Action-Gradient Monte Carlo Tree Search for Non-Parametric Continuous (PO)MDPs

动作-梯度蒙特卡洛树搜索用于非参数连续(PO)MDPs

Idan Lev-Yehudi, Michael Novitsky, Moran Barenboim, Ron Benchetrit, Vadim Indelman

发表机构 * Technion – Israel Institute of Technology(技术学院 – 以色列理工学院)

AI总结 本文提出AGMCTS框架,结合全局树搜索与局部梯度优化,解决连续状态空间下的规划问题,理论贡献包括动作评分梯度定理、多重要性采样树和可计算的动作评分梯度。

详情
AI中文摘要

在连续状态、动作和观察空间中,自主系统在线规划仍具挑战性。尽管蒙特卡洛树搜索(MCTS)通过采样有效扩展,但大多数连续(PO)MDP求解器未利用基于梯度的动作优化。本文提出动作-梯度MCTS(AGMCTS),结合全局树搜索与局部梯度优化,保持一致的价值估计。我们提供了三个关键理论贡献:(1)粒子信念状态的动作评分梯度定理;(2)多重要性采样(MIS)树,通过重用先前样本支持频繁动作分支更新而不引入估计漂移;(3)使用区域公式为平滑生成模型提供可计算的动作评分梯度。实验结果表明,AGMCTS在多个具有挑战性的连续MDP和POMDP基准中优于最先进的基于样本的求解器。

英文摘要

Online planning in continuous state, action, and observation spaces remains challenging for autonomous systems. While Monte Carlo Tree Search (MCTS) scales effectively via sampling, most continuous (PO)MDP solvers do not exploit gradient-based action optimization. We propose Action-Gradient MCTS (AGMCTS), a framework that combines global tree search with local gradient-based action refinement, while maintaining consistent value estimates. We provide three key theoretical contributions: (1) an action score gradient theorem for particle belief states; (2) the Multiple Importance Sampling (MIS) Tree that supports frequent action-branch updates by reusing prior samples without introducing estimator drift; and (3) tractable action score gradients for smooth generative models using the Area Formula. Empirical results demonstrate that AGMCTS outperforms state-of-the-art sample-based solvers in multiple challenging continuous MDP and POMDP benchmarks.

2411.17917 2026-05-19 cs.CV cs.RO 版本更新

DECODE: Domain-aware Continual Domain Expansion for Motion Prediction

DECODE:面向领域的持续领域扩展用于运动预测

Boqi Li, Haojie Zhu, Henry X. Liu

发表机构 * Department of Civil and Environmental Engineering, University of Michigan(密歇根大学土木与环境工程系)

AI总结 DECODE提出一种持续学习框架,通过预训练模型逐步扩展领域专用模型,结合超网络和流机制实现高效模型选择与不确定性估计,有效降低遗忘率并提升预测精度。

Comments This work has been published in IEEE TPAMI Early Access

详情
AI中文摘要

运动预测对于自动驾驶车辆在复杂环境中有效导航和准确预测其他交通参与者行为至关重要。随着自动驾驶不断发展,整合新多样驾驶场景的需求促使频繁重新训练模型。为此,我们引入DECODE,一种新的持续学习框架,从预训练的通用模型开始,逐步发展专用领域模型。不同于现有持续学习方法试图开发一个能跨多样场景泛化的统一模型,DECODE独特地平衡了专用性与泛化性,动态调整以满足实时需求。所提框架利用超网络生成模型参数,显著降低存储需求,并结合归一化流机制基于似然估计进行实时模型选择。此外,DECODE利用深度贝叶斯不确定性估计技术合并最相关专用和通用模型的输出。这种整合确保在熟悉条件下最优性能,同时在不熟悉场景中保持鲁棒性。广泛评估证实了框架的有效性,实现显著低的遗忘率0.044和平均minADE 0.584米,显著超越传统学习策略,并在广泛驾驶条件下表现出适应性。

英文摘要

Motion prediction is critical for autonomous vehicles to effectively navigate complex environments and accurately anticipate the behaviors of other traffic participants. As autonomous driving continues to evolve, the need to assimilate new and varied driving scenarios necessitates frequent model updates through retraining. To address these demands, we introduce DECODE, a novel continual learning framework that begins with a pre-trained generalized model and incrementally develops specialized models for distinct domains. Unlike existing continual learning approaches that attempt to develop a unified model capable of generalizing across diverse scenarios, DECODE uniquely balances specialization with generalization, dynamically adjusting to real-time demands. The proposed framework leverages a hypernetwork to generate model parameters, significantly reducing storage requirements, and incorporates a normalizing flow mechanism for real-time model selection based on likelihood estimation. Furthermore, DECODE merges outputs from the most relevant specialized and generalized models using deep Bayesian uncertainty estimation techniques. This integration ensures optimal performance in familiar conditions while maintaining robustness in unfamiliar scenarios. Extensive evaluations confirm the effectiveness of the framework, achieving a notably low forgetting rate of 0.044 and an average minADE of 0.584 m, significantly surpassing traditional learning strategies and demonstrating adaptability across a wide range of driving conditions.

2410.07191 2026-05-19 cs.RO cs.LG stat.ME 版本更新

Curb Your Attention: Causal Attention Gating for Robust Trajectory Prediction in Autonomous Driving

抑制注意力:因果注意力门控用于自动驾驶中的鲁棒轨迹预测

Ehsan Ahmadi, Ray Mercurius, Soheil Alizadeh, Kasra Rezaee, Amir Rasouli

发表机构 * University of Alberta(阿尔伯塔大学) Noah’s Ark Laboratory, Huawei Technologies Canada(华为加拿大诺亚实验室) Cornell University(康奈尔大学)

AI总结 本文提出CRiTIC模型,通过因果发现网络识别agent间因果关系,并引入因果注意力门控机制提升轨迹预测的鲁棒性和泛化能力,实验表明模型在对抗非因果扰动时鲁棒性提升54%。

Comments Accepted ICRA 2025

详情
AI中文摘要

自动驾驶中的轨迹预测模型易受非因果代理的扰动影响,此类扰动可能导致其他代理轨迹预测错误,进而影响自动驾驶决策的安全性和效率。本文提出CRiTIC模型,利用因果发现网络识别过去时间窗口内代理间的因果关系,并引入因果注意力门控机制,以选择性过滤Transformer架构中的信息。在两个自动驾驶基准数据集上进行了大量实验,评估了模型在对抗非因果扰动和泛化能力方面的鲁棒性。实验结果表明,预测鲁棒性可提升54%而对预测准确性影响不大。此外,本文展示了所提模型在跨域性能上的优越泛化能力,达到29%的改进。进一步细节请参见项目页面:https://ehsan-ami.github.io/critic。

英文摘要

Trajectory prediction models in autonomous driving are vulnerable to perturbations from non-causal agents whose actions should not affect the ego-agent's behavior. Such perturbations can lead to incorrect predictions of other agents' trajectories, potentially compromising the safety and efficiency of the ego-vehicle's decision-making process. Motivated by this challenge, we propose $\textit{Causal tRajecTory predICtion}$ $\textbf{(CRiTIC)}$, a novel model that utilizes a $\textit{Causal Discovery Network}$ to identify inter-agent causal relations over a window of past time steps. To incorporate discovered causal relationships, we propose a novel $\textit{Causal Attention Gating}$ mechanism to selectively filter information in the proposed Transformer-based architecture. We conduct extensive experiments on two autonomous driving benchmark datasets to evaluate the robustness of our model against non-causal perturbations and its generalization capacity. Our results indicate that the robustness of predictions can be improved by up to $\textbf{54%}$ without a significant detriment to prediction accuracy. Lastly, we demonstrate the superior domain generalizability of the proposed model, which achieves up to $\textbf{29%}$ improvement in cross-domain performance. These results underscore the potential of our model to enhance both robustness and generalization capacity for trajectory prediction in diverse autonomous driving domains. Further details can be found on our project page: https://ehsan-ami.github.io/critic.

2301.01114 2026-05-19 cs.RO cs.SY eess.SY 版本更新

Information Aided Navigation: A Review

信息辅助导航:综述

Daniel Engelsman, Itzik Klein

发表机构 * Hatter Department of Marine Technologies, Charney School of Marine Sciences, University of Haifa(哈特尔海洋技术系,查内海洋科学学院,海法大学)

AI总结 本文综述了信息辅助导航,将其分为直接、间接和模型辅助三类,通过匹配约束提升导航精度并补偿丢失信息。

Comments 8 figures, 3 tables

详情
Journal ref
IEEE Transactions on Instrumentation and Measurement, volume 72, pages 1-18, 2023
AI中文摘要

惯性导航系统性能很大程度上依赖于外部测量和信息的稳定流,以保证连续滤波更新和绑定惯性解漂移。不同操作环境的平台可能在某些时候无法接收外部测量,从而暴露导航解漂移。多年来,各种工作被提出以克服这一不足,通过利用系统当前状态的知识,将其转化为可用的信息源来更新导航滤波器。本文旨在提供信息辅助导航的全面综述,广泛分为直接、间接和模型辅助三类。每种方法通过实现其概念的显著工作、使用案例、相关状态更新和对应的测量模型进行描述。通过将适当的约束匹配到给定场景,可以提高导航解的准确性,补偿丢失的信息,并揭示某些内部状态,这些状态否则将保持不可观测。

英文摘要

The performance of inertial navigation systems is largely dependent on the stable flow of external measurements and information to guarantee continuous filter updates and bind the inertial solution drift. Platforms in different operational environments may be prevented at some point from receiving external measurements, thus exposing their navigation solution to drift. Over the years, a wide variety of works have been proposed to overcome this shortcoming, by exploiting knowledge of the system current conditions and turning it into an applicable source of information to update the navigation filter. This paper aims to provide an extensive survey of information aided navigation, broadly classified into direct, indirect, and model aiding. Each approach is described by the notable works that implemented its concept, use cases, relevant state updates, and their corresponding measurement models. By matching the appropriate constraint to a given scenario, one will be able to improve the navigation solution accuracy, compensate for the lost information, and uncover certain internal states, that would otherwise remain unobservable.

2605.16588 2026-05-19 cs.RO cs.SY eess.SY 版本更新

Policy Library CBF: Finite-Horizon Safety at Runtime via Parallel Rollouts

策略库CBF:通过并行滚动预测实现有限时间范围内的运行时安全

Taekyung Kim, Hideki Okamoto, Bardh Hoxha, Georgios Fainekos, Dimitra Panagou

AI总结 本文提出PL-CBF,通过并行有限时间滚动预测评估备用策略库,选择最安全模式并最小修改名义策略以确保安全,实验显示在保持毫秒级运行时间的同时提升了安全覆盖率。

Comments Project page: https://www.taekyung.me/plcbf

详情
AI中文摘要

在无结构环境中实现安全关键自主性对在线安全认证提出了重大挑战。我们提出了策略库控制障碍函数(PL-CBF),一种运行时安全过滤器,通过并行有限时间滚动预测评估备用策略库,选择最安全模式,并通过求解二次规划问题最小修改名义策略以确保安全。我们基于闭环行为的有限时间语言度量提供了理论分析,表征了政策库覆盖要求以认证有限时间范围的安全性。在平面双积分器(4状态)、具有突发摩擦变化的高速公路驾驶(8状态)以及拥挤动态环境中的3D四旋翼导航(12状态)模拟中,展示了比单策略安全过滤器更高的安全覆盖率,同时保持毫秒级运行时间。

英文摘要

Safety-critical autonomy in unstructured environments poses significant challenges for online safety certification under evolving constraints. We propose Policy Library Control Barrier Function~(PL-CBF), a runtime safety filter that evaluates a library of fallback policies via parallel finite-horizon rollouts, selects the least invasive safe mode, and enforces safety by solving a quadratic program that minimally modifies a nominal policy. We provide a theoretical analysis based on a finite-horizon language metric over closed-loop behaviors, characterizing policy-library coverage requirements for certifying finite-horizon safety. Simulations on a planar double-integrator (4 states), highway driving with abrupt friction changes using a realistic nonlinear vehicle model (8 states), and 3D quadrotor navigation in crowded dynamic environments (12 states) demonstrate improved safety coverage over single-policy safety filters while retaining millisecond-level runtime.

2605.16552 2026-05-19 cs.AI cs.RO 版本更新

From Prompts to Protocols: An AI Agent for Laboratory Automation

从提示到协议:一种用于实验室自动化的AI代理

Angelos Angelopoulos, James F. Cahoon, Ron Alterovitz

AI总结 本文提出一种整合大语言模型与实验室编排的AI代理,使科学家能通过自然语言创建和监控自动化实验协议,提升实验效率与准确性。

详情
AI中文摘要

自动化科学实验室能加快、安全、准确且可重复地执行协议,加速新材料和药物的发现与测试。然而,设置和运行自主实验室需要协调多种仪器和机器人,迫使科学家编写代码、管理配置文件和导航复杂软件架构。本文提出一种AI代理架构,整合大语言模型与实验室编排,使科学家能通过自然语言交互式创建和监控自动化实验协议。该代理集成到实验编排系统(EOS)中,通过代理循环实现自动验证和错误纠正,支持完整的实验生命周期:创建协议、运行和监控协议及闭环优化活动,以及分析结果。一个可视化图编辑器将协议渲染为同步于AI代理协议表示的交互式节点图,使在AI协助和手动协议构建之间无缝切换。在三个覆盖化学、生物学和材料科学的模拟自动化实验室上评估,该AI代理实现了97%的一次性协议生成成功率,并将所需界面操作减少了数量级。

英文摘要

Automating science laboratories enables faster, safer, more accurate, and more reproducible execution of protocols, accelerating the discovery and testing of new materials, drugs, and more. However, setting up and running autonomous labs requires coordinating numerous instruments and robots, forcing scientists to write code, manage configuration files, and navigate complex software infrastructure. We present an AI agent architecture that integrates large language models with laboratory orchestration, enabling scientists to interactively create and monitor automated lab protocols using natural language. Integrated into the Experiment Orchestration System (EOS), the AI agent operates under an agentic loop with automated validation and error correction, and supports the complete experimental lifecycle: creating protocols, running and monitoring both protocols and closed-loop optimization campaigns, and analyzing results. A visual graph editor renders protocols as interactive node-based diagrams synchronized with the AI agent's protocol representation, enabling seamless alternation between AI-assisted and manual protocol construction. Evaluated on three simulated automated labs spanning chemistry, biology, and materials science, the AI agent achieves a 97% first-attempt protocol generation success rate and an order of magnitude reduction in required interface actions.

2605.16537 2026-05-19 cs.RO 版本更新

Nori Bot: A Sub-$1,000 Floor-to-Counter Mobile Manipulator

Nori Bot:一款不到1000美元的 floor-to-counter 移动机械臂

Antonio Li, Sungjoon Park, Wen Ni Chew

AI总结 Nori Bot 通过600mm Z轴提升、Raspberry Pi 4与OpenClaw协同控制以及软件安全栈解决移动机械臂的三大限制,成本仅为同类商业平台的3%。

Comments 7 pages, 3 figures, 2 tables. Columbia University Deep Learning Robot Manipulation course project, Spring 2026

详情
AI中文摘要

Nori Bot通过600mm Z轴提升、Raspberry Pi 4与OpenClaw协同控制以及软件安全栈解决移动机械臂的三大限制,成本仅为同类商业平台的3%。

英文摘要

Open-source mobile manipulators have reached $660 (XLeRobot) but every sub-$1,000 platform shares three limitations: a fixed-height workspace, reactive-only control, and no protection against the stall-induced burn-out that destroys cheap Feetech servos. We present Nori Bot, a 17-DoF dual-arm mobile manipulator at $947 (~3% the cost of comparable commercial platforms) that addresses all three: (1) a 600mm Z-axis lift on the existing servo bus for floor-to-counter reach; (2) a thin-client Raspberry Pi 4 paired with the OpenClaw proactive agent runtime so cron jobs and hooks trigger physical tasks autonomously; and (3) a software safety stack with sensorless grip-force feedback via motor current on a soft TPU finger. Code, CAD, and the skill manifest will be released.

2605.16522 2026-05-19 cs.RO cs.MA 版本更新

A Mechanistic Model for Collective Motion from Sensorimotor Regularities

从传感器运动规律出发的集体运动机制模型

Vito Mengers, Bao Duc Cao, Oliver Brock

AI总结 本文提出基于机器人建模框架的集体运动机制模型,通过感知和动作的梯度下降实现群体行为,揭示了传感器运动规律在集体行为中的作用。

详情
AI中文摘要

动物群体行为长期以来通过自驱动粒子模型进行建模,但这些模型仅能描述现象而无法解释机制。本文提出基于机器人建模框架的集体运动机制模型,通过感知邻居的方位和大小信息,结合不确定的内部状态估计和梯度下降选择动作,不依赖预设的交互力,产生多样化的群体行为,如极化运动、 milling、环形结构和子群分裂。全局敏感性分析显示,行为转变由传感器运动参数决定,这些参数对应可测量的生物量。群体行为是交互传感器运动规律的涌现结果,物种间的差异源于机体和环境的差异。

英文摘要

Collective behavior in animals has long been modeled through self-propelled particle models, which reproduce striking group-level phenomena through abstract interaction forces. Yet these models are fundamentally descriptive: they leave open the question of how collective behavior is actually produced. Recent empirical work makes this gap concrete: locusts do not align with neighbors, sensory and cognitive mechanisms mediate interaction instead. A mechanistic model must therefore operate at the sensorimotor level, grounded in what individual organisms can actually perceive, estimate, and physically execute. We present such a model based on a modeling framework from robotics, extended here to collective motion. Each agent perceives neighbors through bearing and apparent-size cues within a limited field of view, maintains uncertain internal state estimates, and selects actions through gradient descent on a desired social distance -- without any prescribed interaction forces. This simple model produces diverse collective behaviors including polarized motion, milling, ring formations, and subgroup fragmentation. A global sensitivity analysis shows that behavioral transitions are governed by sensorimotor parameters corresponding to measurable biological quantities: field of view geometry, sensory noise, turning agility, and memory. Collective behavior can therefore be understood as the emergent outcome of interacting sensorimotor regularities, and differences across species as the emergent outcome of differences in embodiment and environment.

2605.16514 2026-05-19 cs.RO cs.AI 版本更新

No Plan, Yet Human: A Reactive Robotics Model Predicts Human Planning Failures on a Clinical Task

无计划,却有人类:一种反应式机器人模型预测临床任务中人类计划失败

Michael Migacev, Vito Mengers, Antonia Köngeter, Oliver Brock

AI总结 该研究利用反应式梯度下降框架AICON,通过塔罗伦敦测试揭示人类计划能力下降时的反应模式,发现其能更准确预测24个问题的难度排序,并在留出验证中表现优异,揭示了生物系统组织方式的普遍规律。

详情
AI中文摘要

理解为何某些顺序规划问题比其他问题更难需要超越平均性能的模型。这些模型应捕捉问题难度的具体模式,并理想情况下以与人类计划能力下降时相同的方式失败。我们应用为机器人操作开发的AICON反应式梯度下降框架,应用于塔罗伦敦测试,该测试用于评估帕金森病、轻度认知障碍和中风患者的规划能力。在不进行任何前瞻规划或了解人类认知的情况下,AICON在24个问题上更准确地再现了人类的细粒度难度排序,优于结构任务参数,并在留出验证中泛化到新问题。关键的是,AICON在计划能力下降的群体中优于计划基线,而计划基线更好地捕捉健康对照组。这种分离由原始AICON论文预测,该论文指出模型的失败模式与帕金森患者在目标层次结构上挣扎但不移动计数的情况相似。这表明,随着计划能力的下降,人类行为会转向AICON所建模的反应模式。这一发现扩展了更广泛的模式:AICON最初为机器人开发,现在能捕捉生物行为在感知、眼动和顺序规划方面的特征,表明其核心抽象反映了生物系统组织方式的真实特性。

英文摘要

Understanding why some sequential planning problems are harder than others requires models that go beyond average performance. They should capture the specific pattern of which problems are hard, and ideally fail in the same way people do when planning capacity is reduced. We apply AICON, a reactive gradient-descent framework developed for robotic manipulation, to the Tower of London test, a cognitive test used to assess planning in Parkinson's disease, mild cognitive impairment, and stroke. Without any lookahead planning or knowledge of human cognition, AICON reproduces the fine-grained human difficulty ordering across 24 problems better than structural task parameters and generalizes to held-out problems in a leave-two-out evaluation. Crucially, AICON outperforms a planning baseline for groups with reduced planning capacity while the planning baseline better captures healthy controls. This dissociation was predicted by the original AICON paper, which noted that the model's failure modes resemble those of Parkinson's patients who struggle with goal hierarchies but not move counts. This suggests that as planning capacity is reduced, human behavior shifts toward the reactive mode AICON models. The finding extends a broader pattern: AICON, originally built for robotics, now captures aspects of biological behavior across perception, eye movements, and sequential planning, suggesting its core abstraction reflects something real about how biological systems are organized.

2605.16442 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Hierarchical Two-Stage Framework for Environment-Aware Long-Horizon Vessel Trajectory Prediction

面向环境的长航程船舶轨迹预测分层两阶段框架

Ganeshaaraj Gnanavel, Tharindu Fernando, Sridha Sridharan, Clinton Fookes

AI总结 本文提出分层两阶段框架,结合长短期预测器与网格感知短期预测器,通过分层融合机制提升船舶轨迹预测精度,实验显示在ADE和FDE上优于现有方法。

详情
AI中文摘要

长航程船舶轨迹预测在真实海洋条件下对碰撞避免、交通管理和路线规划至关重要。然而,由于长距离时间依赖性和动态环境因素如洋流、风和波浪,实现准确预测具有挑战性。为此,我们提出一种分层两阶段框架,通过分层融合机制结合粗略长时预测器与网格感知的短时预测器。短时分支利用离散化海事单元上的时空图变换器捕捉局部动态,而长时分支编码总体航行意图。集成的环境模块利用洋流参数、风向量和显著波高,通过跨模态注意和特征调制实现对不同海况的适应性响应。此外,可学习的Savitzky-Golay平滑层增强了融合轨迹的时间一致性。我们在澳大利亚船队跟踪系统(CTS)数据上进行了评估,数据来自西北地区,并与Copernicus海洋服务产品对齐,使用3小时输入和10小时预测时间范围。实验结果表明,我们的框架在平均位移误差(ADE)和最终位移误差(FDE)上比现有方法提高了25%和17%。消融研究进一步验证了每个组件的贡献。

英文摘要

Long-horizon vessel trajectory forecasting under real ocean conditions is critical for collision avoidance, traffic management, and route planning. However, achieving accurate predictions is challenging due to long-range temporal dependencies and dynamic environmental factors such as currents, wind, and waves. To address these issues, we propose a hierarchical two-stage framework that combines a coarse long-term predictor with a grid-aware short-term predictor through a hierarchical fusion mechanism. The short-term branch leverages a Spatio-Temporal Graph Transformer on discretized maritime cells to capture localized dynamics, while the long-term branch encodes overarching navigational intent. An integrated environmental module incorporates oceanographic parameters, including surface currents, wind vectors, and significant wave height, using cross-modal attention and feature-wise modulation for adaptive response to varying sea conditions. Additionally, a learnable Savitzky-Golay smoothing layer enhances temporal coherence in fused trajectories. We evaluate our approach on Australian Craft Tracking System (CTS) data from the North West region, aligned with Copernicus Marine Service products, using a 3-hour input and a 10-hour prediction horizon. Experimental results show that our framework outperforms the state-of-the-art by 25% in Average Displacement Error (ADE) and 17% in Final Displacement Error (FDE). Ablation studies further validate the contribution of each component.

2605.16432 2026-05-19 cs.RO cs.AI cs.HC 版本更新

MR-SLAM: Immersive Spatial Supervision for Multi-Robot Mapping via Mixed Reality

MR-SLAM:通过混合现实实现多机器人地图的沉浸式空间监督

Prakash Aryan, Cem Erdogdu, Kavinaya Kumarchokkappan, Timo Kehrer, Sebastiano Panichella

AI总结 本文提出MR-SLAM系统,利用混合现实技术实现多机器人SLAM的沉浸式空间监督,通过实时可视化和空间锚定面板提升多机器人定位与建图效率。

Comments Accepted to ICRA 2026 Workshop "MM-SpatialAI Workshop: Multi-Modal Spatial AI for Robust Navigation and Open-World Understanding"

详情
AI中文摘要

在建筑检查或仓库通道监控等应用中,操作多机器人队伍进行同时定位与建图(SLAM)需要操作员持续保持对每个机器人位置和建图状态的空间意识,这在传统2D界面中表现不佳。我们提出了MR-SLAM,一种混合现实(MR)系统,其中佩戴Meta Quest 3头显的operator通过带有真实世界遮挡的通透视图操控三个模拟TurtleBot3机器人,同时空间锚定的仪表板面板实时报告建图进度。每个机器人运行独立的SLAM Toolbox实例,其占用网格在ROS 2后端实时合并。在五次9分钟的评估会话中,系统以8.83±0.16Hz的速度生成扫描,合并了17.9±0.8平方米的占用网格,并在机器人对之间达到94.7±0.5%的跨实例占用一致性。额外的会话记录了6.3ms的中位转换抖动和41平方米网格的26.7平方米覆盖。我们将MR-SLAM定位为一种参考实现,用于在消费级硬件上结合通透混合现实监督与多机器人SLAM。

英文摘要

Operating a multi-robot fleet for simultaneous localization and mapping (SLAM) in applications such as building inspection or warehouse-aisle monitoring requires the operator to maintain spatial awareness of each robot's position and mapping state, a task that scales poorly on conventional 2D interfaces. We present MR-SLAM, a mixed reality (MR) system in which an operator wearing a Meta Quest 3 headset teleoperates three simulated TurtleBot3 robots through a passthrough view with real-world occlusion, while spatially anchored dashboard panels report mapping progress in situ. Each robot runs an independent SLAM Toolbox instance whose occupancy grid is merged in real time on a Robot Operating System 2 (ROS 2) back end. Across five 9-minute evaluation sessions, the system delivered scans at 8.83 +/- 0.16 Hz, mapped 17.9 +/- 0.8 m^2 of merged occupancy, and reached 94.7 +/- 0.5% cross-instance occupancy consistency across robot pairs. An additional session recorded 6.3 ms median transform jitter and 26.7 m^2 coverage of a 41 m^2 grid. We position MR-SLAM as a reference implementation for combining passthrough mixed reality supervision with multi-robot SLAM on consumer hardware.

2605.16419 2026-05-19 cs.CV cs.AI cs.RO 版本更新

Agentic Pipeline for Self-Synchronized Multiview Joint Angle Monitoring in Uncalibrated Environments

基于代理的自同步多视角关节角度监控管道:在无标定环境中

Juncheng Yu, Lusi A, Haoxuan Xie, Weiming Wang

AI总结 本文提出了一种基于代理的自同步多视角关节角度监控方法,利用两台摄像头在无标定环境下实现自动视频同步和自验证,通过多模态大语言模型和先进单目2D姿态估计模型提取候选姿态,并通过代理选择机制自动识别和跟踪目标个体,以在多人和遮挡情况下产生一致的2D姿态,从而估计关节角度。

Comments Accepted by EMBC 2026. 7 pages, 3 figures

详情
AI中文摘要

运动监控在长期康复中对脊髓损伤患者至关重要,其中多视角无标记运动捕捉方法已显示出显著潜力。然而,由于依赖校准和多视角同步的困难,其在患者自行部署环境中部署仍然具有挑战性。在本工作中,我们提出了一种基于代理的自同步多视角关节角度监控管道,利用两台摄像头在无标定环境中实现自动视频同步和代理驱动的自验证。最先进的单目2D姿态估计模型用于提取候选姿态,其中应用了基于代理的选择机制,以自动识别和跟踪目标个体,从而在多人和遮挡情况下产生一致的2D姿态。此类2D姿态被优化以从无标定的多视角姿态序列中估计关节角度,通过显式的几何建模确保可解释性。与Vicon系统的验证显示了该方法的强性能,达到MAE为5.97°±2.36°和Pearson相关系数为0.962±0.014。所提出的方法预计能提供一个实用的、患者可自行部署的系统,以在无标定的家庭环境中进行日常运动监控。

英文摘要

Kinematic monitoring plays a critical role in long-term rehabilitation for patients with spinal cord injury (SCI), where multi-view markerless motion capture methods have shown significant potential. However, owing to the reliance on calibration and the difficulty of achieving multi-view synchronization, their deployment in patient self-deployed environments remains challenging. In this work, we propose an agentic pipeline for self-synchronized multi-view joint angle monitoring in uncalibrated environments using two cameras without hardware triggers. The Multimodal large language models enable automatic video synchronization and agent-driven self-verification. State-of-the-art monocular 2D pose estimation models are employed to extract candidate poses, where an agent-based selection mechanism is then applied to automatically identify and track the target subject, thereby producing consistent 2D poses in the presence of multiple individuals and occlusions. Such 2D poses are optimized to estimate joint angles from uncalibrated multi-view pose sequences, ensuring interpretability through explicit geometric modeling. Validation against Vicon system demonstrated the strong performance, achieving an MAE of $5.97^\circ \pm 2.36^\circ$ and a Pearson correlation coefficient of $0.962 \pm 0.014$. The proposed method is expected to provide a practical, patient self-deployable system to perform daily kinematic monitoring in uncalibrated home environments.

2605.16412 2026-05-19 cs.RO cs.CV 版本更新

SCAR: Self-Supervised Continuous Action Representation Learning

SCAR:自监督连续动作表示学习

Hongjia Liu, Fan Feng, Minghao Fu, Xinyue Wang, Haofei Lu, Biwei Huang

AI总结 本文提出SCAR框架,通过自监督学习统一动作表示,提升跨体素和任务的泛化能力。

详情
AI中文摘要

尽管动作在具身智能中起核心作用,但从视觉转换中学习可迁移的动作表示仍是一个基本挑战,特别是在数据有限的情况下,世界模型需要在不同体素间泛化。我们提出SCAR,一个联合逆向-前向动力学框架,用于从视觉转换中学习跨体素的统一动作表示。基于预训练生成主干,SCAR使用逆向动力学模型(IDM)从潜在观察对中推断潜在动作,并使用前向动力学模型(FDM)根据这些动作预测未来动态。为了使潜在空间可迁移而非通用视觉瓶颈,我们正则化潜在动作后验向标准高斯先验,限制任意视觉编码,并引入对抗不变性以抑制体素和环境特定的噪声因素。在Procgen和Robotwin数据集上的实验表明,学习的统一潜在动作表示比体素特定的原始动作更强大,作为世界建模的条件接口,提高了跨体素低数据适应和跨任务迁移性能。

英文摘要

Despite the central role of action in embodied intelligence, learning transferable action representations from visual transitions remains a fundamental challenge, particularly when world models must generalize across embodiments under limited data. We argue that action is not merely an auxiliary conditioning signal, but a distinct representational factor that decouples the controllable change from embodiment-specific actuation. In this work, we propose SCAR, a joint inverse-forward dynamics framework for learning unified action representations across embodiments from visual transitions. Built on a pretrained generative backbone, SCAR uses an inverse dynamics model (IDM) to infer latent actions from latent observation pairs and a forward dynamics model (FDM) to predict future dynamics conditioned on them. To make the latent space transferable rather than a generic visual bottleneck, we regularize the latent action posterior toward a standard Gaussian prior to limit arbitrary visual encoding, and introduce adversarial invariance to suppress embodiment- and environment-specific nuisance factors. Experiments on the Procgen and Robotwin dataset show that the learned unified latent action representation serves as a stronger conditioning interface for world modeling than embodiment-specific raw actions, yielding improved cross-embodiment low-data adaptation and cross-task transfer. Taken together, these results suggest that action can be learned as a shared representation of controllable change across embodiments, providing an interface for more transferable and generalizable world models.

2605.16398 2026-05-19 cs.RO cs.AI 版本更新

Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery

支持安全的变分混合滤波器用于接触模式和稀疏定律恢复

Marios Papamichalis, Regina Ruane

AI总结 本文提出VHYDRO变分混合动力学习器,通过混合学习的提案与可行转换律,防止分支丢失,实现连续状态和离散接触模式的联合推断,并在稀疏端-哈密顿定律恢复中提供三种保障。

详情
AI中文摘要

接触丰富的机器人动力学是混合的:单个观测可以匹配多个潜在状态和接触模式(自由、冲击、粘滑)。标准的退火滤波器不将概率分配给可行的接触转换将永久失去机器人实际遵循的分支。我们介绍了VHYDRO,一种变分混合动力学习器,防止这种分支丢失。在每一步中,VHYDRO混合学习的提案与可行转换律,然后进行采样和重要加权,确保模型可行的载体保留的每个转换都得到覆盖。VHYDRO联合推断连续的潜在状态和离散接触模式,并为每个恢复的模式拟合稀疏端-哈密顿定律。在此基础上,三种保证连接:支持覆盖稳定了滤波,稳定后的滤波将离散接触后验集中在一致的模式上,且模式纯段允许稀疏端-哈密顿恢复。恢复误差清晰地分为滤波、导数、模式不纯和物理残差部分。三种经验发现跟踪相同的机制。在重遮挡下,支持安全的滤波器保持可用,而非防御性的提案会崩溃。在ManiSkill演示和四个Sawyer/BridgeData任务家族上,离散状态形成时间一致的接触模式段,离散状态在ARI、变化点F1和段纯度上比事后和模式自由基线更强。在已知方程的混合系统中,模式条件的稀疏拟合恢复了活跃的物理项;纯预测基线则不能。

英文摘要

Contact-rich robot dynamics are hybrid: a single observation can match several latent states and contact regimes (free, impact, stick--slip). A standard amortized filter that places no probability on a feasible contact transition will permanently lose the branch the robot actually follows. We introduce VHYDRO, a variational hybrid dynamics learner that prevents this branch loss. At each step, VHYDRO mixes the learned proposal with a feasible transition law before sampling and importance weighting, ensuring that every transition retained by the model-feasible carrier remains covered. VHYDRO jointly infers a continuous latent state and a discrete contact mode, and fits a sparse port-Hamiltonian law to each recovered regime. On top of this, three guarantees connect: support coverage stabilizes filtering, the stabilized filter concentrates the discrete contact posterior on coherent regimes, and mode-pure segments admit sparse port-Hamiltonian recovery. The recovery error separates cleanly into filtering, derivative, mode-impurity, and physics-residual parts. Three empirical findings track the same mechanism. Under heavy occlusion the support-safe filter stays usable while a non-defensive proposal collapses. On ManiSkill demonstrations and on four Sawyer/BridgeData task families the discrete state forms temporally coherent contact-regime segments that the discrete state yields a stronger joint profile across ARI, change-point F1, and segment purity than post-hoc and mode-free baselines. On hybrid systems with known equations the mode-conditioned sparse fit recovers the active physical terms; purely predictive baselines do not.

2605.16395 2026-05-19 cs.RO cs.LG 版本更新

OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence

OrbiSim:作为具身智能的可微物理引擎的世界模型

Jiajian Li, Jingyuan Huang, Junru Gong, Qi Wang, Xiaokang Yang, Yunbo Wang

AI总结 OrbiSim提出了一种新的机器人仿真范式,将世界模型重新定义为完全可微的物理引擎,通过统一的物理基础路径连接结构化场景资产、神经动力学和下游强化学习,提升预测精度和控制性能。

Comments Project page: https://jjleejj85.github.io/projects/orbisim

详情
AI中文摘要

我们提出了OrbiSim,一种新的机器人仿真范式,将世界模型重新定义为完全可微的物理引擎,用于具身智能。不同于以往专注于潜在域或视觉域中无约束想象的世界模型,OrbiSim建立了一个统一的、基于物理的路径,连接结构化场景资产、神经动力学和下游强化学习。通过在整个仿真循环中实现端到端的可微性——从显式状态转换到视觉观察生成——OrbiSim支持传统经典模拟器难以处理的任务,如可微接触建模、稀疏奖励下的基于梯度的策略优化和直观的物理推理。实证结果表明,OrbiSim在预测保真度和控制性能方面显著优于最先进的世界模型。此外,其对资产配置和物理参数的一致响应表明其作为增强机器人仿真和策略训练的可微工具的潜力。

英文摘要

We present OrbiSim, a novel robotic simulation paradigm that redefines world models as a fully differentiable physics engine for embodied intelligence. Unlike prior world models that focus on unconstrained imagination in latent or visual domains, OrbiSim establishes a unified, physically-grounded pathway that bridges structured scene assets, neural dynamics, and downstream reinforcement learning. By enabling end-to-end differentiability throughout the entire simulation loop -- spanning from explicit state transitions to visual observation generation -- OrbiSim supports tasks traditionally intractable for classical simulators, such as differentiable contact modeling, gradient-based policy optimization under sparse rewards, and intuitive physical inference. Empirical results demonstrate that OrbiSim significantly outperforms state-of-the-art world models in both predictive fidelity and control performance. Furthermore, its consistent responsiveness to asset configurations and physical parameters suggests its potential as a differentiable tool for enhancing robot simulation and policy training.

2605.16391 2026-05-19 eess.SP cs.AI cs.LG cs.RO 版本更新

Overcoming the Intrinsic Performance Limitations of MEMS IMU via Diffusion-Based Generative Learning

通过扩散生成学习克服MEMS惯性测量单元的固有性能限制

Jiarui Lv, Feng Zhu, Xiaohong Zhang

AI总结 本文提出基于扩散的生成学习框架,利用低成本IMU数据生成高保真虚拟IMU数据,提升定位和姿态估计性能,并在空中测绘中验证了其有效性。

详情
AI中文摘要

惯性测量单元(IMUs)是多源集成导航系统中的基本传感组件,其性能直接影响解决方案的精度和可靠性。然而,低成本IMUs的精度受硬件限制。最近,生成式人工智能在建模复杂数据分布和重建高保真信号方面表现出色。受此启发,我们提出了一种基于扩散的生成学习框架,用于从低成本IMU测量中合成高保真虚拟IMU数据。具体而言,基于U-Net架构构建了条件扩散模型,其中高质量IMU测量用作先验真实数据,低成本IMU测量作为条件输入。模型生成的虚拟IMU数据用于后续导航和定位任务。实验结果表明,生成的虚拟IMU数据在定位和姿态估计方面均显著优于原始低成本IMU测量。此外,我们将模型转移到空中测绘实验中,其中所提出的方法产生了更薄且一致的点云。总体而言,所提出的框架突破了低成本IMU的性能限制,并展示了扩散基于生成学习在虚拟高质量IMU数据方面的潜力。

英文摘要

Inertial measurement units (IMUs) are fundamental sensing components in multi-source integrated navigation systems, and their performance directly determines the accuracy and reliability of solutions. However, the precision of low-cost IMUs is inherently constrained by hardware limitations. Recently, generative artificial intelligence has demonstrated remarkable capability in modeling complex data distributions and reconstructing high-fidelity signals. Motivated by this, we propose a diffusion-based generative learning framework for synthesizing high-fidelity virtual IMU data from low-cost IMU measurements. Specifically, a conditional diffusion model based on a U-Net architecture is constructed, where high-grade IMU measurements are utilized as ground-truth priors and low-cost IMU measurements are employed as conditional inputs. The virtual IMU data generated by the model is used for subsequent navigation and localization tasks. Experimental results demonstrate that the generated virtual IMU data significantly outperform the original low-cost IMU measurements in both positioning and attitude estimation. Furthermore, we transfer the model to airborne mapping experiments, where the proposed method produces thinner and more consistent point clouds. Overall, the proposed framework breaks the performance limits of low-cost IMU and demonstrates the potential of diffusion-based generative learning for virtual high-grade IMU data.

2605.16389 2026-05-19 cs.RO cs.AI cs.SY eess.SY 版本更新

Haptic Rendering of Fractional-Order Viscoelasticity: Passivity and Rendering Fidelity

触觉渲染中的分数阶粘弹性:被动性和渲染保真度

Gorkem Gemalmaz, Harun Tolasa, Volkan Patoglu

AI总结 本文研究分数阶粘弹性模型在有限记忆离散化下的被动性与渲染性能,推导闭式表达式确保触觉渲染的被动性,并通过实验验证理论结果及人感知的真实感。

Comments Under review for publication in IEEE Transactions on Robotics

详情
AI中文摘要

触觉渲染具有蠕变和应力松弛特性的粘弹性材料对于许多应用至关重要,如使用真实生物组织模型的医学培训。分数阶粘弹性模型提供了一种有效描述本质上时间依赖动态的方法,仅需少量参数,因为这些模型可以自然捕捉记忆效应。在本研究中,我们分析了分数阶粘弹性模型在有限记忆离散化下的被动性和渲染性能。我们推导出闭式表达式,以确保基于Grunwald-Letnikov导数的分数阶(FO)标准线性固体(SLS)模型的触觉渲染被动性。我们还提供了此类FO-SLS模型的有效刚度和阻尼的符号表达式。所得到的被动性条件构成了一个统一的框架,该框架推广了之前报告的整数阶凯尔文-沃伊特、麦克斯韦和SLS模型的结果,因为这些结果是新推导条件的特殊情况。此外,我们还提供了理论被动性界限的实验验证和对FO-SLS模型感知真实感的人类受试者评估。总体而言,本研究建立了在有限记忆离散化下的分数阶粘弹性渲染的统一理论框架和实验评估。

英文摘要

Haptic rendering of viscoelastic materials that exhibit creep and stress relaxation is crucial for many applications, such as medical training with realistic biological tissue models. Fractional-order viscoelastic models provide an effective means of describing intrinsically time-dependent dynamics with few parameters, as these models can naturally capture memory effects. In this study, we present analyses of passivity and rendering performance for fractional-order viscoelastic models under finite-memory discretization. We derive closed-form expressions to ensure the passivity of haptic rendering with a fractional-order (FO) standard linear solid (SLS) model based on Grunwald-Letnikov derivative under short-memory discretization. We also provide symbolic expressions for the effective stiffness and damping of such FO-SLS models. The resulting passivity conditions constitute a unified framework that generalizes previously reported results for integer-order Kelvin-Voigt, Maxwell, and SLS models, since these results are special cases of the newly derived condition. Furthermore, we provide experimental validations of the theoretical passivity bounds and human-subject evaluations of perceived realism of FO-SLS models. Overall, this study establishes a unified theoretical framework and experimental evaluations for FO viscoelastic rendering under short-memory discretization.

2605.16300 2026-05-19 cs.CY cs.AI cs.MA cs.RO 版本更新

Consent Chain Degradation in Embodied Multi-Agent Systems: Bridging the Gap Between AI Agent Governance and Robot Ethics

具身多智能体系统中的同意链退化:弥合人工智能代理治理与机器人伦理之间的鸿沟

Mehmet Haklidir

AI总结 本文提出同意链退化(CCD)概念,探讨多机器人委托链中人类同意的具体性、有效性及范围如何退化,并通过医疗、家庭和工业机器人场景展示其实际表现,分析现有法规对CCD核心维度的缺失。

Comments Accepted for oral presentation at the 2nd Workshop on Robot Ethics (WoRoBet), ICRA 2026, Vienna, Austria, June 1, 2026. 6 pages, 3 tables, 1 figure

详情
AI中文摘要

机器人系统正从孤立平台转向在人类环境中运行的互联多智能体生态系统。这一转变引发了现有框架未解决的治理问题:如何在多机器人委托链中传播、退化和破裂同意?人工智能伦理社区已开始研究数字软件代理的同意,而人机交互社区则研究人机双方面对的同意。现有研究均未涵盖当物理机器人以影响人类的方式委托任务给其他机器人时的情况。本文引入同意链退化(CCD),一种分析多机器人委托链中人类同意具体性、有效性及范围如何退化的概念框架。我们提出一种三层治理架构,即具身代理的同意运行验证框架(CoRVE),整合了同意范围建模、委托链追踪和物理不可逆性评估。医疗、家庭和工业机器人三个场景展示了CCD的实际表现,包括一个数值示例。对欧盟人工智能法案、GDPR、机械指令和修订后产品责任指令的监管缺口分析显示,这四个工具均未涵盖CCD的核心维度。

英文摘要

Robotic systems are moving from isolated platforms to interconnected multi-agent ecosystems that operate in human environments. This shift raises a governance problem that existing frameworks do not address: how does consent propagate, degrade, and break down across chains of delegation between embodied autonomous agents? The AI ethics community has begun to study consent for digital software agents, and the HRI community has examined consent in dyadic human-robot encounters. Neither body of work covers what happens when physical robots delegate tasks to other robots in ways that affect humans. This paper introduces consent chain degradation (CCD), a conceptual framework for analyzing how the specificity, validity, and scope of human consent erodes as authority passes through multi-robot delegation chains. We propose a three-layer governance architecture, the Consent Runtime Verification Framework for Embodied Agents (CoRVE), which integrates consent scope modeling, delegation chain tracking, and physical irreversibility assessment. Three scenarios in healthcare, domestic, and industrial robotics show how CCD arises in practice, including a worked numerical example. A regulatory gap analysis covering the EU AI Act, the GDPR, the Machinery Regulation, and the Revised Product Liability Directive shows that all four instruments leave core CCD dimensions unaddressed.

2602.16712 2026-05-19 cs.RO 版本更新

One Hand to Rule Them All: Canonical Representations for Unified Dexterous Manipulation

一手统治他们全部:统一灵活操作的规范表示

Zhenyu Wei, Yunchao Yao, Mingyu Ding

AI总结 本文提出统一灵活操作的规范表示,通过统一参数空间和规范URDF格式,解决固定手设计限制,实现跨手灵活操作学习。

Comments Accepted at RSS 2026

详情
AI中文摘要

当今灵活操作策略大多假设固定手设计,严重限制其在新结构布局下的泛化能力。为克服这一限制,我们引入了一种参数化的规范表示,统一了广泛灵活手架构。它包含统一的参数空间和规范URDF格式,提供三个关键优势:1)参数空间捕捉本质形态和运动学变化,用于学习算法的有效条件;2)可以在我们的空间上学习结构化潜在流形,使不同结构之间的插值产生平滑且物理上意义的形态过渡;3)规范URDF标准化了动作空间,同时保持原始URDF的动力学和功能属性,使跨结构策略学习高效可靠。通过广泛的分析和实验验证这些优势,包括抓取策略回放、VAE潜在编码和跨结构零样本迁移。具体而言,我们训练了一个VAE在统一表示上,获得紧凑且语义丰富的潜在嵌入,并开发一个基于规范表示的抓取策略,跨灵活手泛化。通过模拟和未见过的形态任务(例如81.9%的零样本成功率在3指LEAP手),证明我们的框架统一了结构多样手的表示和动作空间,为跨手学习提供可扩展的基础,朝着通用灵活操作迈进。项目页面:https://zhenyuwei2003.github.io/OHRA/

英文摘要

Dexterous manipulation policies today largely assume fixed hand designs, severely restricting their generalization to new embodiments with varied kinematic and structural layouts. To overcome this limitation, we introduce a parameterized canonical representation that unifies a broad spectrum of dexterous hand architectures. It comprises a unified parameter space and a canonical URDF format, offering three key advantages. 1) The parameter space captures essential morphological and kinematic variations for effective conditioning in learning algorithms. 2) A structured latent manifold can be learned over our space, where interpolations between embodiments yield smooth and physically meaningful morphology transitions. 3) The canonical URDF standardizes the action space while preserving dynamic and functional properties of the original URDFs, enabling efficient and reliable cross-embodiment policy learning. We validate these advantages through extensive analysis and experiments, including grasp policy replay, VAE latent encoding, and cross-embodiment zero-shot transfer. Specifically, we train a VAE on the unified representation to obtain a compact, semantically rich latent embedding, and develop a grasping policy conditioned on the canonical representation that generalizes across dexterous hands. We demonstrate, through simulation and real-world tasks on unseen morphologies (e.g., 81.9% zero-shot success rate on 3-finger LEAP Hand), that our framework unifies both the representational and action spaces of structurally diverse hands, providing a scalable foundation for cross-hand learning toward universal dexterous manipulation. Project Page: https://zhenyuwei2003.github.io/OHRA/

2509.24143 2026-05-19 cs.RO math.OC 版本更新

A Novel Model for 3D Motion Planning for a Generalized Dubins Vehicle with Pitch and Yaw Rate Constraints

一种带有俯仰和偏航率约束的广义杜宾车辆三维路径规划新模型

Deepak Prakash Kumar, Swaroop Darbha, Satyanarayana Gupta Manyam, David W. Casbeer

AI总结 本文提出一种新模型和快速算法,用于固定翼无人机的三维路径规划,通过体固定坐标系考虑完整车辆姿态,使用两个控制输入表示有限的俯仰和偏航率,生成更短可行路径。

Comments The code for this paper is available at https://github.com/DeepakPrakashKumar/3D-Motion-Planning-for-Generalized-Dubins-with-Pitch-Yaw-constraints

详情
AI中文摘要

本文提出了一种新的建模方法和快速算法,适用于固定翼无人机的三维路径规划。目标是构造连接给定初始和最终配置的最短路径,受运动约束。我们的工作与现有文献不同之处在于:首先,我们使用体固定坐标系考虑完整的车辆姿态,包括俯仰、偏航和偏航角;而现有工作仅使用俯仰和/或航向角,不足以唯一确定姿态。其次,我们使用两个控制输入表示受限的俯仰和偏航率,反映由两个独立执行器控制的情况。相比之下,大多数先前方法依赖于单一输入,如路径曲率,无法准确建模车辆的三维运动学。我们使用旋转最小化框架来描述车辆的配置及其演变,并通过在球面、圆柱面或平面上拼接最优杜宾路径来构建路径。数值模拟显示,我们的方法在平均10秒内生成可行路径,并在大多数情况下生成比现有方法更短的路径。

英文摘要

In this paper, we propose a new modeling approach and a fast algorithm for 3D motion planning, applicable for fixed-wing unmanned aerial vehicles. The goal is to construct the shortest path connecting given initial and final configurations subject to motion constraints. Our work differs from existing literature in two ways. First, we consider full vehicle orientation using a body-attached frame, which includes roll, pitch, and yaw angles. However, existing work uses only pitch and/or heading angle, which is insufficient to uniquely determine orientation. Second, we use two control inputs to represent bounded pitch and yaw rates, reflecting control by two separate actuators. In contrast, most previous methods rely on a single input, such as path curvature, which is insufficient for accurately modeling the vehicle's kinematics in 3D. We use a rotation minimizing frame to describe the vehicle's configuration and its evolution, and construct paths by concatenating optimal Dubins paths on spherical, cylindrical, or planar surfaces. Numerical simulations show our approach generates feasible paths within 10 seconds on average and yields shorter paths than existing methods in most cases.

2506.14009 2026-05-19 cs.RO 版本更新

GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics

GRaD-Nav++: 基于视觉-语言模型的视觉无人机导航:高斯辐射场与可微动力学

Qianzhong Chen, Naixiang Gao, Suning Huang, JunEn Low, Timothy Chen, Jiankai Sun, Mac Schwager

AI总结 GRaD-Nav++提出一种轻量级视觉-语言-动作框架,通过可微强化学习在3D高斯点云模拟器中训练,实现基于自然语言指令的实时无人机导航,展示在多任务和多环境下的高效导航能力。

Comments Published in: IEEE Robotics and Automation Letters ( Volume: 11, Issue: 2, February 2026)

详情
Journal ref
Chen, Qianzhong, et al. "Grad-nav++: Vision-language model enabled visual drone navigation with gaussian radiance fields and differentiable dynamics." IEEE Robotics and Automation Letters 11.2 (2025): 1418-1425
AI中文摘要

自主无人机在无结构环境中解释并执行高层语言指令仍是一个长期目标。然而,现有方法受限于对人工技能的依赖、参数调优的繁琐或计算密集型模型无法用于机载使用。我们引入GRaD-Nav++,一种轻量级的视觉-语言-动作(VLA)框架,能够在机载环境中实时运行并执行自然语言指令。我们的策略在逼真3D高斯点云(3DGS)模拟器中通过可微强化学习(DiffRL)训练,能够高效学习低层控制,从视觉和语言输入中学习。其核心是一个专家混合(MoE)动作头,能够自适应地路由计算以提高泛化能力并缓解遗忘。在多任务泛化实验中,GRaD-Nav++在训练任务中达到83%的成功率,在未见过的任务中达到75%。在真实硬件部署中,其在训练任务中的成功率为67%,在未见过的任务中为50%。在多环境适应实验中,GRaD-Nav++在多样化的模拟环境中平均成功率为81%,在多样的真实世界设置中为67%。这些结果为完全机载视觉-语言-动作(VLA)飞行建立了新的基准,并证明了紧凑、高效的模型可以在不依赖外部基础设施的情况下实现可靠的、语言引导的导航。

英文摘要

Autonomous drones capable of interpreting and executing high-level language instructions in unstructured environments remain a long-standing goal. Yet existing approaches are constrained by their dependence on hand-crafted skills, extensive parameter tuning, or computationally intensive models unsuitable for onboard use. We introduce GRaD-Nav++, a lightweight Vision-Language-Action (VLA) framework that runs fully onboard and follows natural-language commands in real time. Our policy is trained in a photorealistic 3D Gaussian Splatting (3DGS) simulator via Differentiable Reinforcement Learning (DiffRL), enabling efficient learning of low-level control from visual and linguistic inputs. At its core is a Mixture-of-Experts (MoE) action head, which adaptively routes computation to improve generalization while mitigating forgetting. In multi-task generalization experiments, GRaD-Nav++ achieves a success rate of 83% on trained tasks and 75% on unseen tasks in simulation. When deployed on real hardware, it attains 67% success on trained tasks and 50% on unseen ones. In multi-environment adaptation experiments, GRaD-Nav++ achieves an average success rate of 81% across diverse simulated environments and 67% across varied real-world settings. These results establish a new benchmark for fully onboard Vision-Language-Action (VLA) flight and demonstrate that compact, efficient models can enable reliable, language-guided navigation without relying on external infrastructure.