arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.20138 2026-05-20 cs.RO cs.SY eess.SY 版本更新

Hamilton--Jacobi Reachability for Spacecraft Collision Avoidance

航天器碰撞避免的Hamilton-Jacobi可达性

Larry Hui, Jordan Kam, William Su, Jianshu Zhou

发表机构 * Department of Mechanical Engineering, University of California, Berkeley(加州大学伯克利分校机械工程系) Aerospace Engineering Program, University of California, Berkeley(加州大学伯克利分校航空航天工程项目) Department of Mechanical Engineering, National University of Singapore(新加坡国立大学机械工程系)

AI总结 本文提出了一种用于同轨道双卫星碰撞避免问题的Hamilton-Jacobi(HJ)可达性框架,通过平面Hill-Clohessy-Wiltshire(HCW)动力学在径向-切向-法向(RTN)框架中建模相对运动。定义目标状态空间为对应于联邦通信委员会(FCC)轨道标准的最小分离要求的不安全相对配置。将航天器之间的相互作用建模为零和微分博弈,其中玩家1是受控卫星,玩家2被建模为具有未知意图的有界对抗干扰。本文提出了HJ公式,并计算了后向可达集,这些集描述了在最坏情况下无法避免碰撞的相对状态,而集外的状态则允许证明安全的轨迹。这些可达集与监督混合控制逻辑相结合,以确定何时必须启动规避机动,从而为可扩展性提供数学基础的安全保证。

Comments Accepted to the 20th IEEE International Conference on Control & Automation (IEEE ICCA 2026). 6 pages, 4 figures

详情
AI中文摘要

本文提出了一种用于同轨道双卫星碰撞避免问题的Hamilton-Jacobi(HJ)可达性框架,通过平面Hill-Clohessy-Wiltshire(HCW)动力学在径向-切向-法向(RTN)框架中建模相对运动。我们定义目标状态空间为对应于最小分离要求一致的联邦通信委员会(FCC)轨道标准的不安全相对配置。将航天器之间的相互作用建模为零和微分博弈,其中玩家1是受控卫星,玩家2被建模为具有未知意图的有界对抗干扰。我们提出了HJ公式,并计算了后向可达集,这些集描述了在最坏情况下无法避免碰撞的相对状态,而集外的状态则允许证明安全的轨迹。这些可达集与监督混合控制逻辑相结合,以确定何时必须启动规避机动,从而为可扩展性提供数学基础的安全保证。

英文摘要

This article presents a Hamilton--Jacobi (HJ) reachability framework for a two--satellite collision avoidance problem operating in the same circular orbit, where relative motion is modeled in the radial--tangential--normal (RTN) frame using planar Hill--Clohessy--Wiltshire (HCW) dynamics. We define the target state space as unsafe relative configurations in the orbit plane corresponding to minimum separation requirements consistent with Federal Communications Commission (FCC) orbital standards. The interaction between spacecraft is formulated as a zero--sum differential game, where Player 1 is the controlled satellite and Player 2 is modeled as a bounded adversarial disturbance with unknown intent. We present the HJ formulation and compute backward reachable sets that characterize relative states from which collision cannot be avoided under worst-case disturbances, while states outside this set admit provably collision-free trajectories. These reachable sets are integrated with supervisory hybrid control logic to determine when evasive maneuvers must be initiated, enabling mathematically grounded safety guarantees for scalability.

2605.20101 2026-05-20 cs.RO 版本更新

Topology-Optimized Pneumatic Soft Actuator: Design and Experimental Validation

拓扑优化气动软执行器:设计与实验验证

Sumit Mehta, Konstantinos Poulios

发表机构 * DTU(丹麦技术大学)

AI总结 本文通过非线性拓扑优化设计了软弹性气动执行器,并通过实验验证了其性能。

Comments 20 pages, 13 figures

详情
AI中文摘要

本文展示了使用非线性拓扑优化设计软弹性气动执行器的计算设计。一种现有的基于密度和多孔超弹性拓扑优化框架被从2D扩展到3D,并用于生成两种可制造的执行器设计,这些设计随后进行了数值和实验研究。对于两种设计,目标是在给定的驱动压力下,最大化弯曲响应,同时考虑两种不同的允许应变限制。所采用的拓扑优化框架的一个关键优势是,它可以在优化过程中一致地考虑由于加压引起的非常大的变形。这两种优化的3D设计通过立体光固化法制造,并通过实验测试来验证其性能。

英文摘要

This paper demonstrates the computational design of soft elastomeric pneumatic actuators using nonlinear topology optimization. An existing density- and porohyperelasticity-based topology optimization framework was extended from 2D to 3D and used to generate two manufacturable actuator designs, which were then studied numerically and experimentally. For both designs, the objective was to maximize the bending response for a prescribed actuation pressure under two different allowable strain limits. A key advantage of the employed topology optimization framework is that it can consistently, during the optimization, account for the very large deformations induced upon pressurization. The two optimized 3D designs were fabricated using stereolithography and experimentally tested to validate their performance.

2605.20072 2026-05-20 cs.AI cs.RO 版本更新

Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

探查具身大语言模型:当更高的观察保真度损害问题解决

Oussama Zenkri, Oliver Brock

发表机构 * Robotics and Biology Laboratory, Technische Universität Berlin, Germany(柏林技术大学机器人与生物学实验室) Science of Intelligence, Research Cluster of Excellence, Berlin, Germany(柏林科学智能研究卓越集群) Robotics Institute Germany (RIG)(德国机器人研究所(RIG))

AI总结 本文研究了具身大语言模型在不同观察信息下的行为,发现高保真度观察反而降低了问题解决能力,核心方法是通过实验改变可用信息并测量行为变化,主要贡献是揭示了感知误差与推理失败的交互影响。

Comments Submitted to From Animals to Animats: The 18th International Conference on the Simulation of Adaptive Behavior (SAB)

详情
AI中文摘要

大型语言模型日益被提出作为机器人系统的认知组件,但其不透明的决策过程使得在闭环具身任务中的成功或失败难以解释。遵循经验AI方法,我们通过改变代理可用的信息并测量行为变化来研究具身LLM代理的行为。使用Lockbox,一个具有隐藏依赖关系的顺序机械谜题,在物理机器人设置中评估LLM在RGB、RGB-D和地面真实符号观察下的表现,并通过受控模拟来探测由此产生行为。反直觉的是,代理在原始RGB输入下表现最佳,而在完美地面真实观察下表现最差。在模拟中,我们通过随机翻转感知的动作结果来探测这一效应,发现适度的噪声提高了性能,峰值出现在40%的翻转概率下,相比无噪声基线,成功率提高了2.85倍。进一步分析将这一收益归因于重复动作循环的减少。这些发现表明,仅凭成功率来评估LLM是不够的,因为测量性能可能反映了感知误差与推理失败之间的相互作用,而非稳健的问题解决。

英文摘要

Large Language Models are increasingly proposed as cognitive components for robotic systems, yet their opaque decision processes make it difficult to explain success or failure in closed-loop embodied tasks. Following an empirical AI methodology, we study embodied LLM agents behaviorally by varying the information available to the agent and measuring the resulting changes in behavior. Using the Lockbox, a sequential mechanical puzzle with hidden interdependencies, we evaluate LLMs across RGB, RGB-D, and ground-truth symbolic observations in a physical robotic setup and use controlled simulation to probe the resulting behavior. Counterintuitively, agents perform best under raw RGB input and worst under perfect ground-truth observations. In simulation, we probe this effect by randomly flipping perceived action outcomes and find that moderate noise improves performance, peaking at a 40% flip probability with a 2.85-fold success rate increase over the noise-free baseline. Further analysis links this gain to a reduction in repetitive action loops. These findings suggest that success rates alone are insufficient for evaluating LLMs, as measured performance may reflect the interaction between perceptual errors and reasoning failures rather than robust problem solving.

2605.20055 2026-05-20 cs.SE cs.AI cs.RO 版本更新

Towards LLM-Assisted Architecture Recovery for Real-World ROS~2 Systems: An Agent-Based Multi-Level Approach to Hierarchical Structural Architecture Reconstruction

面向现实世界ROS~2系统的LLM辅助架构恢复:一种基于智能体的多级方法用于分层结构架构重建

Dominique Briechle, Raj Chanchad, Tobias Geger, Ruidi He, Dhruv Jajadiya, Dhruv Kapadiya, Andreas Rausch, Meng Zhang

发表机构 * Institute for Software and Systems Engineering, Clausthal University of Technology, Clausthal-Zellerfeld 38678, Germany(软件与系统工程研究所, Clausthal 技术大学, Clausthal-Zellerfeld 38678,德国)

AI总结 本文提出了一种基于智能体的多级方法,用于恢复复杂ROS~2系统中的分层结构架构,通过改进的提示和多级中间架构表示,提高了架构恢复的一致性和可扩展性。

详情
AI中文摘要

显式软件架构模型是沟通、分析和演变复杂软件密集型系统的关键 artifacts。然而,在基于ROS~2的机器人系统中,结构(解构)和集成语义通常仅在分布式 artifacts(如源代码和启动文件)中隐式编码,使得恢复分层架构尤其困难。现有方法主要关注节点级实体和通信布线,而对多抽象层次上的分层结构(解构)恢复支持有限。本文扩展了我们之前提出的蓝图引导的LLM辅助架构恢复流程,通过两个主要改进:(1)改进的提示以提高架构合成的一致性和可控性;(2)基于多级中间架构表示的分阶段恢复策略,该策略结合了原子ROS节点列表和启动文件依赖关系,从而在多个抽象层次上实现结构受限的重建。该方法在基于协作机械臂和异构ROS~2 artifacts的现实世界自动化产品拆卸系统上进行了评估。与我们之前的工作相比,所选案例研究显示出显著更高的集成复杂性和更丰富的功能。结果表明,架构恢复在结构一致性、可扩展性和鲁棒性方面有所提高,同时揭示了与大规模ROS~2系统中动态集成语义相关的剩余挑战。

英文摘要

Explicit software architecture models are essential artifacts for communicating, analyzing, and evolving complex software-intensive systems. In ROS~2-based robotic systems, however, structural (de-)composition and integration semantics are often only implicitly encoded across distributed artifacts such as source code and launch files, making recovery of hierarchical architecture particularly difficult. Existing approaches mainly focus on node-level entities and communication wiring, while providing limited support for recovering hierarchical structural (de-)composition across multiple abstraction levels. In this paper, we extend our previously proposed blueprint-guided LLM-assisted architecture recovery pipeline for ROS~2 systems through two major enhancements: (1) refined prompting to improve the consistency and controllability of architecture synthesis, and (2) a staged recovery strategy based on multi-level intermediate architectural representations that incorporate the atomic ROS node list and launch file dependencies, thereby enabling structurally constrained reconstruction across multiple abstraction levels. The approach is evaluated on a real-world automated product disassembly system based on cooperative robotic arms and heterogeneous ROS~2 artifacts. Compared to our previous work, the considered case study exhibits substantially higher integration complexity and richer functionality. The results demonstrate improved structural consistency, scalability, and robustness of architecture recovery, while also revealing remaining challenges related to dynamic integration semantics in large-scale ROS~2 systems.

2605.19990 2026-05-20 cs.RO cs.CV cs.LG 版本更新

Minimalist Visual Inertial Odometry

极简视觉惯性里程计

Francesco Pasti, Jeremy Klotz, Nicola Bellotto, Shree K. Nayar

发表机构 * Department of Information Engineering, University of Padua(帕多瓦大学信息工程系) Computer Science Department, Columbia University(哥伦比亚大学计算机科学系)

AI总结 本文提出了一种极简的平面里程计方法,通过四个视觉测量和一个IMU实现差分驱动机器人的鲁棒运动估计,展示了极简传感在高效准确平面里程计中的应用。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

视觉-惯性里程计(VIO)对于移动机器人导航至关重要,但使用高像素相机需要大量资源。本文提出了一种极简方法用于平面里程计,证明仅四个视觉测量和一个IMU即可为差分驱动机器人提供可靠的运动估计。我们的关键见解是四个向下 facing 的光电二极管通过光学Gabor掩码感知世界,产生编码速度的信号。基于此,我们利用物理基础模拟器联合优化掩码参数和时间卷积网络(TCN)。所得到的模型仅通过光电二极管产生的四个测量值解码速度。将这些估计与IMU提供的角速度结合,可以得到连续的平面轨迹。我们通过将原型传感器安装在差分驱动机器人上验证了我们的方法。在多样化的室内和室外地形上,我们的系统能够紧密跟踪参考真实地面,无需任何现实中的微调。我们的工作表明,极简传感能够实现高效且准确的平面里程计。

英文摘要

Visual-Inertial Odometry(VIO), which is critical to mobile robot navigation, uses cameras with a large number of pixels. Capturing and processing camera images requires significant resources. This work presents a minimalist approach to planar odometry, demonstrating that just four visual measurements and an IMU can provide robust motion estimation for differential-drive robots. Our key insight is that four downward-facing photodiodes that sense the world through optical Gabor masks produce signals that encode speed. Based on this, we jointly optimize the mask parameters alongside a Temporal Convolutional Network (TCN) using a physically-grounded simulator. The resulting model decodes speed from just the four measurements produced by the photodiodes. Pairing these estimates with the angular speed from an IMU yields a continuous planar trajectory. We validate our approach with a prototype sensor mounted on a differential drive robot. Across diverse indoor and outdoor terrains, our system closely tracks the reference ground truth without any real-world fine-tuning. Our work shows that minimalist sensing enables efficient and accurate planar odometry.

2605.19986 2026-05-20 cs.RO cs.CV cs.LG 版本更新

Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

超越二元成功:一种用于细粒度操控的诊断元评估框架

He-Yang Xu, Pengyuan Zhang, Zongyuan Ge, Xiaoshuai Hao, Serge Belongie, Xin Geng, Yuxin Peng, Xiu-Shen Wei

发表机构 * Southeast University(东南大学) Monash University(墨尔本大学) Xiaomi EV(小米电动车) University of Copenhagen(哥本哈根大学) Peking University(北京大学)

AI总结 本文提出MetaFine框架,通过分解理解、感知和受控行为三个维度,诊断细粒度操控中的能力瓶颈,并通过因果干预识别视觉编码器在保持局部空间结构方面的关键限制,从而提升操控精度。

Comments Project page: https://metafine.github.io/

详情
AI中文摘要

细粒度操控标志着一个领域,其中全局场景上下文不再足够,成功取决于局部属性定位、高保真空间感知和符合约束的运动执行之间的紧密耦合。然而,当前的具身AI基准测试将这些能力简化为二元成功率,系统性地将报告能力夸大了多达70%,并掩盖了阻碍实际应用的架构瓶颈。我们引入了MetaFine,一种诊断元评估框架,通过分解理解、感知和受控行为三个轴来分离操控能力。基于组合任务图,MetaFine吸收异构外部基准,并在统一协议下重构为不同复杂度的诊断场景。通过这一视角评估最先进的视觉-语言-动作(VLA)模型,揭示了传统度量无法发现的严重维度特定失败。通过针对性的因果干预,我们确定了视觉编码器保持局部空间结构的能力是细粒度精度的关键瓶颈:改进它可以直接解锁之前无法触及的操控能力,而无需修改下游策略。MetaFine进一步支持混合真实-仿真验证,利用有限的配对现实运行来校准可扩展的仿真基于估计,以获得更稳定的物理基准测试。通过将评估从排名转向诊断,MetaFine将基准测试转变为修复真实物理敏捷性底层能力的可行指南。MetaFine框架、基准和相关资源将在项目页面上公开发布:https://metafine.github.io/。

英文摘要

Fine-grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high-fidelity spatial perception, and constraint-respecting motor execution. However, current embodied AI benchmarks collapse these capacities into binary success rates, systematically inflating reported capabilities by up to 70% and masking the architectural bottlenecks that impede real-world deployment. We introduce MetaFine, a diagnostic meta-evaluation framework that disentangles manipulation competency along three axes: understanding, perception, and controlled behavior. Built on a compositional task graph, MetaFine absorbs heterogeneous external benchmarks and reconstructs them into diagnostic scenarios of varying complexity under a unified protocol. Evaluating state-of-the-art vision-language-action (VLA) models through this lens exposes severe dimension-specific failures invisible to conventional metrics. Through targeted causal intervention, we identify the visual encoder's ability to preserve local spatial structure as a key bottleneck for fine-grained precision: improving it directly unlocks previously inaccessible manipulation capabilities without modifying downstream policies. MetaFine further supports hybrid real-sim validation, using limited paired real-world rollouts to calibrate scalable simulation-based estimates for more stable physical benchmarking. By shifting evaluation from ranking to diagnosis, MetaFine turns benchmarking into an actionable compass for repairing the layered capacities underlying genuine physical dexterity. The MetaFine framework, benchmarks, and supporting resources will be publicly released at our project page: https://metafine.github.io/.

2605.19981 2026-05-20 cs.RO 版本更新

CEER: Compliant End-Effector and Root Control as a Unified Interface for Hierarchical Humanoid Loco-Manipulation

CEER:一种用于分层人形机器人运动-操作的合规末端执行器-根控制统一接口

Xinyuan Luo, Xingrui Chen, Xunjian Yin, Hongxuan Wu, Boxi Xia, Zhuoqun Chen, Jinzhou Li, Boyuan Chen, Xianyi Cheng

发表机构 * Department of Mechanical Engineering and Materials Science(机械工程与材料科学系)

AI总结 本文提出CEER,一种用于分层人形机器人运动-操作的合规末端执行器-根控制统一接口,通过模块化接口实现接触丰富和长时程操作任务的稳定交互,实验表明其在仿真和硬件上均表现出较高的末端执行器跟踪精度和操作稳定性。

Comments Project page: https://robotproject8.github.io/ceer_page/. 9 pages, 7 figures

详情
AI中文摘要

人形机器人已经实现了出色的运动性能,但接触丰富且长时程的操作仍然是主要瓶颈。操作本质上是接触丰富的,需要具有合规性的全身控制以实现稳定的交互,而其多样性和长时程性质则支持模块化、规划兼容的接口,而非关节空间跟踪。我们提出CEER,一种合规末端执行器-根(EE-root)控制抽象,用于在分层规划框架内实现模块化的人形机器人运动-操作。CEER在由根运动命令和末端执行器姿态目标定义的可解释任务空间中实现合规性感知的全身控制,并支持与异构高层规划器的即插即用集成。我们进一步构建了一个分层系统,通过EE-root接口整合异构规划器和任务模块,从而在不重新训练底层全身策略的情况下实现多样化的操作任务。在仿真和硬件上的实验表明,末端执行器的跟踪精度达到3.3厘米,与基线相比显著减少了冲击,实现了在远程操作下的稳定接触丰富操作,并在房间级环境中实现了高达70%的成功率。这些结果表明,合规的EE-root控制提供了一种实用的抽象,用于人形机器人的运动-操作,实现了多样化技能的模块化和可扩展集成。

英文摘要

Humanoid robots have achieved impressive locomotion performance, yet contact-rich and long-horizon manipulation remains a major bottleneck. Manipulation is inherently contact-rich and demands compliant whole-body control for stable interaction, while its diversity and long-horizon nature favor modular, planner-compatible interfaces over joint-space tracking. We propose CEER, a compliant end-effector-root (EE-root) control abstraction for modular humanoid loco-manipulation within a hierarchical planning framework. CEER enables compliance-aware whole-body control in an interpretable task space defined by root motion commands and end-effector pose targets, and supports plug-and-play integration with heterogeneous high-level planners. A teacher-student framework is adopted to distill a general motion-tracking controller into a low-level policy that consumes only EE-root commands. We further construct a hierarchical system that integrates heterogeneous planners and task modules through the EE-root interface, enabling diverse manipulation tasks without retraining the underlying whole-body policy. Experiments in simulation and on hardware demonstrate 3.3 cm end-effector tracking accuracy with substantially reduced jerk compared to baselines, stable contact-rich manipulation under teleoperation, and up to 70% success in simulated single-object loco-manipulation tasks within a room-scale environment. These results indicate that compliant EE-root control provides a practical abstraction for humanoid loco-manipulation, enabling modular and scalable integration of diverse skills.

2605.19958 2026-05-20 cs.RO 版本更新

TravExplorer: Cross-Floor Embodied Exploration via Traversability-Aware 3-D Planning

TravExplorer: 通过可 traversability-aware 3-D 规划实现跨楼层的 embodied 探索

Han Zheng, Zhe Chen, Yudong Huang, Haoran Liu, Jinghao Wang, Ming Yang, Tong Qin

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出TravExplorer框架,结合零样本语义引导与可 traversability-aware 3-D 规划,实现跨楼层的 embodied 探索,通过统一的体积地图区分占用结构与机器人可达支撑面,并提取可 traversable 前沿区域,同时采用FOV-aware的主动感知策略解决跨楼层遍历中的不完整观测问题,最终在HM3D和MP3D上进行了4195次模拟实验,并在真实世界中验证了无需先验地图或人工干预的开放词汇目标搜索能力。

详情
AI中文摘要

Zero-shot Object Navigation (ZSON) has shown promise for open-vocabulary target search in unseen environments, yet most existing systems remain tied to planar representations and single-floor assumptions. These assumptions become inadequate in real buildings, where navigation involves floors, stairs, landings, and vertically overlapping spaces. This article presents TravExplorer, a cross-floor embodied exploration framework that couples zero-shot semantic guidance with traversability-aware 3-D planning. TravExplorer maintains a unified volumetric map that distinguishes occupied structures from robot-reachable support surfaces and extracts traversable frontiers from connected support surfaces, including floors, stairs, and landings. A FOV-aware active perception strategy further resolves incomplete observations during cross-floor traversal. To reduce semantic-reasoning latency, a lightweight guidance module aligns a probabilistic instance map from online open-vocabulary segmentation with a spatial value map from fast image-to-text matching. Based on these geometric and semantic memories, a hierarchical planner performs target-aware frontier touring over object hypotheses, traversable frontiers, and stair landmarks, and generates executable cross-floor motions through foothold-guided 3-D search and vertically constrained local trajectory optimization. Experiments over 4,195 simulated episodes on HM3D and MP3D demonstrate consistent advantages over representative ObjectNav baselines. Fifty real-world trials on a Unitree Go2 further validate open-vocabulary target search across single-floor and cross-floor indoor environments without prior maps or human intervention. The code will be released at https://github.com/wuyi2121/TravExplorer.

英文摘要

Zero-shot Object Navigation (ZSON) has shown promise for open-vocabulary target search in unseen environments, yet most existing systems remain tied to planar representations and single-floor assumptions. These assumptions become inadequate in real buildings, where navigation involves floors, stairs, landings, and vertically overlapping spaces. This article presents TravExplorer, a cross-floor embodied exploration framework that couples zero-shot semantic guidance with traversability-aware 3-D planning. TravExplorer maintains a unified volumetric map that distinguishes occupied structures from robot-reachable support surfaces and extracts traversable frontiers from connected support surfaces, including floors, stairs, and landings. A FOV-aware active perception strategy further resolves incomplete observations during cross-floor traversal. To reduce semantic-reasoning latency, a lightweight guidance module aligns a probabilistic instance map from online open-vocabulary segmentation with a spatial value map from fast image-to-text matching. Based on these geometric and semantic memories, a hierarchical planner performs target-aware frontier touring over object hypotheses, traversable frontiers, and stair landmarks, and generates executable cross-floor motions through foothold-guided 3-D search and vertically constrained local trajectory optimization. Experiments over 4,195 simulated episodes on HM3D and MP3D demonstrate consistent advantages over representative ObjectNav baselines. Fifty real-world trials on a Unitree Go2 further validate open-vocabulary target search across single-floor and cross-floor indoor environments without prior maps or human intervention. The code will be released at https://github.com/wuyi2121/TravExplorer.

2605.19957 2026-05-20 cs.CV cs.AI cs.RO 版本更新

World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

为混合具身体验中的长时域演化构建世界-自我模型

Zuyao Lin, Jianhui Zhang, Peidong Jia, Xiaoguang Zhao, Shanghang Zhang, Xingyu Chen

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Chinese Academy of Sciences(中国科学院大学) Peking University(北京大学)

AI总结 本文提出了一种新的世界-自我建模范式,通过分解未来演化为世界和自我组件,解决混合任务中长时域具身体验中的退化问题,并通过HTEWorld基准测试验证了其有效性。

详情
AI中文摘要

世界模型在具身智能中被广泛研究,但通常在同一流中预测世界和自我不同的演化,其中世界捕捉持续的指令无关场景规律,而自我捕捉机器人中心的指令条件动态。这种世界-自我纠缠导致长时域具身体验中的退化,特别是在混合任务中,其中导航和操作行为交替出现。在本文中,我们引入了世界-自我建模,一种新的概念范式,将未来演化分解为世界和自我组件。我们从三种视角定义世界-自我边界,即运动、语义和意图视角,并分析了三种解纠缠策略,即后、前和完全解纠缠。进一步,我们将该范式实例化为世界-自我模型(WEM),一个统一的具身世界模型,它将一个隐含的独立世界-自我规划器与一个级联并行混合专家(CP-MoE)扩散生成器相结合。为了实现严格的评估,我们进一步构建了HTEWorld,第一个长时域世界建模基准,包含125,000个视频片段(超过4.5百万帧)和精细的动作注释,以及300个多轮评估轨迹(超过2,000条指令)。广泛的实验表明,WEM在HTEWorld上实现了最先进的性能,同时在现有的仅操作基准上保持竞争力。

英文摘要

World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emph{World-Ego Modeling}, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.

2605.19940 2026-05-20 cs.AI cs.RO 版本更新

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

受机器人启发的用于社会敏感领域基础模型的护栏

Rebecca Ramnauth, Drazen Brscic, Brian Scassellati

发表机构 * Yale University(耶鲁大学) Kyoto University(京都大学)

AI总结 本文提出了一种基于机器人学的护栏框架,用于在社会敏感领域中对基础模型进行运行时行为控制,以减少交互轨迹中向不良状态的漂移,并适应多样化的社会情境。

Comments Under review at Journal of Artificial Intelligence Research (JAIR)

详情
AI中文摘要

基础模型正越来越多地应用于教育、心理健康和护理等社会敏感领域,其中失败往往具有累积性和情境依赖性。现有的护栏方法,从训练时对齐到提示、解码约束和事后调节,主要提供经验风险降低而非可执行的行为保证,并且大多将安全视为单个输出属性而非交互轨迹属性。我们重新将护栏视为对交互轨迹的运行时行为控制问题,并借鉴机器人学引入形式构造以在不确定的闭环系统中执行约束。我们将在Grounded Observer框架中实例化这些想法,并在三个现实世界部署中应用:闲聊、家庭自闭症疗法和学校行为缓和。在各种场景中,该框架能够实现运行时干预,以减少向不良交互状态漂移,同时适应多样化社会情境。我们讨论了该框架的扩展,并提出了加强保证的研究方向。

英文摘要

Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-dependent. Existing guardrail approaches -- ranging from training-time alignment to prompting, decoding constraints, and post-hoc moderation -- primarily provide empirical risk reduction rather than enforceable behavioral guarantees, and largely treat safety as a property of individual outputs rather than interaction trajectories. We reframe guardrails as a problem of runtime behavioral control over interaction trajectories, drawing on robotics to introduce formal constructs for constraint enforcement in uncertain, closed-loop systems. We instantiate these ideas in the Grounded Observer framework and apply it across three real-world deployments: small talk, in-home autism therapy, and behavioral de-escalation in schools. Across settings, the framework enables runtime interventions that mitigate drift into undesirable interaction regimes while adapting to diverse social contexts. We discuss extensions to the framework and propose research directions toward stronger guarantees.

2605.19924 2026-05-20 cs.RO 版本更新

RoHIL: Robust Human-in-the-Loop Robotic Reinforcement Learning Against Illumination Variations

RoHIL: 面对光照变化的鲁棒人机协同机器人强化学习

Shuoqin Zhang, Yixin Xiong, Xiru Gao, Kai Liu, Ke Wang, Xichuan Zhou, Zhe Hu

发表机构 * Chongqing University(重庆大学) Chengdu Anu Intelligence(成都安努智能)

AI总结 本文提出RoHIL框架,通过离线微调方法解决机器人在不同工作间因光照变化导致的性能下降问题,保留原始工作间性能并避免重新收集数据和训练。

详情
AI中文摘要

人机协同强化学习系统在训练工作间表现接近完美,但当同一名机器人被移动到数米外的工作间时,由于新的灯位置和窗户光线导致的视觉输入分布变化,系统会崩溃。重新收集演示并重新运行HIL在每个工作间不可行,而简单地在光照变化的数据上微调会触发灾难性遗忘。为解决跨域差距,我们提出了RoHIL,一个无需额外真实机器人交互的离线微调框架。RoHIL结合(i)基于世界模型的图像重光照器,重新合成源工作间轨迹的视觉流,以多种虚拟HDRI环境下的视觉流;(ii)光照保留回放(IRR),一种数据层面的反遗忘机制,将重光照适应转换与原始光照保留转换交错以保留源工作间的Bellman覆盖;(iii)锚定Bellman-actor正则化器,约束表示和策略漂移,从原始源工作间的策略约束。在四个真实机器人操作任务中,面对显著的跨工作间光照变化,RoHIL显著提高了光照变化下的性能,而标准HIL-RL在此处崩溃,同时保留了源工作间的性能,消除了为每个新工作间和环境重新收集数据和重新训练的需要。项目页面:https://anonymous4365.github.io/RoHIL/

英文摘要

Human-in-the-loop reinforcement learning systems achieve near-perfect success on the workstation where they are trained, but collapse when the same robot is moved to a workstation a few meters away due to shifts in the visual input distribution caused by new lamp positions and window light. Re-collecting demonstrations and re-running HIL on every workstation is incompatible with deployment, and naively fine-tuning on shifted-light data triggers catastrophic forgetting of the source workstation. To close this cross-domain gap, we present RoHIL, an offline fine-tuning framework that uses no extra real-robot interaction. RoHIL combines (i) a world-model-based image relighter that re-synthesises the visual stream of source-workstation trajectories under multiple virtual HDRI environments, leaving actions and rewards real; (ii) Illumination-Retention Replay (IRR), a data-level anti-forgetting mechanism that interleaves relit adaptation transitions with original-light retention transitions to preserve source-workstation Bellman coverage; and (iii) an anchored Bellman-actor regulariser that constrains representation and policy drift from the original source-workstation policy. Across four real-robot manipulation tasks under significant cross-workstation illumination variations, RoHIL substantially improves shifted-light performance where standard HIL-RL collapses, while preserving source-workstation performance, eliminating the need to re-collect data and retrain for every new workstation and environment. Project page: https://anonymous4365.github.io/RoHIL/

2605.19919 2026-05-20 cs.RO 版本更新

Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

超越动作残差:通过瓶颈潜在强化学习实现现实世界机器人策略引导

Dongjie Yu, Kun Lei, Zhennan Jiang, Jia Pan, Huazhe Xu

发表机构 * School of Computing and Data Science, The University of Hong Kong(香港大学计算与数据科学学院) Shanghai Qizhi Institute(上海启智研究院) Shanghai Jiao Tong University(上海交通大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院)

AI总结 本文提出了一种名为Z-Perturbation Reinforcement Learning(ZPRL)的方法,通过紧凑的瓶颈潜在空间来引导预训练策略,从而提高样本效率和最终性能,同时在现实世界任务中显著提升了成功率。

详情
AI中文摘要

预训练的模仿策略已成为机器人操作的强大基础,但它们通常需要在线改进以克服执行错误、数据集覆盖有限和部署不匹配的问题。因此,一个核心问题是强化学习(RL)应在离线预训练后如何适应策略。现有的轻量方法通常直接在动作空间上应用残差校正,但这往往导致噪声和结构不佳的探索。在本工作中,我们提出Z-Perturbation Reinforcement Learning(ZPRL),一种通过紧凑的瓶颈潜在空间而不是通过策略权重或输出动作来引导预训练策略的方法。在离线训练期间,我们通过插件式变分信息瓶颈(VIB)模块增强策略,以从观察嵌入中提取任务相关的潜在接口。在在线微调期间,基础策略被冻结,RL仅学习该潜在上的残差扰动,其解码表示条件冻结的动作生成器。我们将在流匹配策略上实例化ZPRL,并在八个模拟任务和四个现实世界任务上进行评估。在多样化的操作设置中,ZPRL在样本效率和最终性能上优于强大的训练后基线。在现实世界中,ZPRL在四个任务上的平均成功率比模仿基线策略提高了33.7%,同时产生比动作残差对照组更平滑的探索行为。这些结果表明,紧凑且任务对齐的瓶颈潜在空间为在线RL适应提供了一个有效的接口。更多视频可在https://manutdmoon.github.io/ZPRL/上找到。

英文摘要

Pretrained imitation policies have become a strong foundation for robot manipulation, but they often require online improvement to overcome execution errors, limited dataset coverage, and deployment mismatch. A central question is therefore how reinforcement learning (RL) should adapt policies after offline pretraining. Existing lightweight methods commonly apply residual corrections directly in action space, but this often leads to noisy and poorly structured exploration. In this work, we propose Z-Perturbation Reinforcement Learning (ZPRL), an approach that steers pretrained policies through a compact bottleneck latent rather than through policy weights or output actions. During offline training, we augment the policy with a plug-and-play variational information bottleneck (VIB) module to extract a task-relevant latent interface from observation embeddings. During online finetuning, the base policy is frozen and RL learns only a residual perturbation on this latent, whose decoded representation conditions the frozen action generator. We instantiate ZPRL on flow-matching policies and evaluate it on eight simulation tasks and four real-world tasks. Across diverse manipulation settings, ZPRL improves both sample efficiency and final performance over strong post-training baselines. In the real world, ZPRL improves the average success rate on four tasks by 33.7% over imitation base policies while producing smoother exploration behaviors than an action residual counterpart. These results suggest that a compact, task-aligned bottleneck latent provides an effective interface for online RL adaptation. More videos can be found at https://manutdmoon.github.io/ZPRL/.

2605.19887 2026-05-20 cs.DC cs.MA cs.RO cs.SY eess.SY 版本更新

DAG-Based QoS-Aware Dynamic Task Placement for Networked Multi-Stage Control Pipelines

基于DAG的QoS感知动态任务放置用于网络化多阶段控制流水线

Thien Tran, Jonathan Kua, Thuong Hoang, Minh Tran, Yuemin Ding, Jiong Jin

发表机构 * Deakin University, Australia(德坎大学,澳大利亚) RMIT University, Vietnam(皇家墨尔本理工大学,越南) University of Navarra, Spain(纳瓦拉大学,西班牙) Swinburne University of Technology, Australia(斯威本科技大学,澳大利亚)

AI总结 本文提出一种基于DAG的QoS感知动态任务放置框架,用于网络化机器人中的感知-感知-规划-控制流水线,通过动态任务放置优化计算、通信延迟和任务放置集,解决传统静态边缘卸载和单阶段模型的不足。

Comments 4 pages, 1 figure, 1 algorithm, accepted as a Work-in-Progress (WiP) paper, on the 24th IEEE International Conference on Industrial Informatics (INDIN), 26-29 July, 2026, Melbourne, Australia

详情
AI中文摘要

当前物理人工智能(PAI)严重依赖闭环视觉伺服流水线,其感知和规划阶段由于嵌入在机器人上的复杂模型可能在机载上变得计算密集。在实践中,将感知任务静态卸载到本地边缘是不适合具有标准化工业网络的高延迟敏感、精确工业环境的。这强调了在工业自动化中控制-通信-计算(3C)协同设计的重要性:单一本地执行会饱和AI加速的机器和机器人硬件,而静态边缘卸载会暴露控制环路到网络抖动。现有的自适应任务放置(ATP)控制器可以部分解决这一差距,通过在二进制阈值规则下将单个流水线阶段重新定位,但没有多阶段模型和显式的任务放置切换成本。在本工作进展(WiP)论文中,我们提出了一种基于有向无环图(DAG)的高质量服务(QoS)感知动态任务放置(DTP)框架,用于网络化机器人中的感知-感知-规划-控制流水线。该流水线被形式化为一个DAG,具有任务级别和节点级别的属性,用于计算成本、通信延迟和可行的任务放置集;在小的可解释候选集(完全本地、静态卸载、混合)上,基于窗口的成本函数结合尾端到端延迟、截止时间违规率、硬件利用率和汉明距离切换惩罚,并且DTP算法具有滞回和最小停留时间界限的任务放置抖动。本文的WiP论文提出了理论框架、结构化的定性分析以及两阶段仿真加硬件在环验证路线图。

英文摘要

Current Physical AI (PAI) relies heavily on closed-loop visual-servoing pipelines, whose perception and planning stages may become computationally intensive onboard due to complex models embedded on robots. In practice, offloading the perception task to on-site edges statically is inappropriate for latency-sensitive, precise industrial settings over a standardized industrial network. This emphasizes the importance of Control-Communication-Computing (3C) co-design in industrial automation: monolithic local execution saturates AI-accelerated machine and robot hardware, while static edge offloading exposes the control loop to network jitter. Existing adaptive task placement (ATP) controllers can partially address the gap by relocating a single pipeline stage on binary threshold rules, without a multi-stage model and an explicit cost on placement switching. In this Work-in-Progress (WiP) paper, we propose a directed acyclic graph (DAG) based quality-of-service (QoS)-aware dynamic task placement (DTP) framework for sensing-perception-planning-control pipelines in networked robotics. This pipeline is formalized as a DAG with task-level and node-level attributes for compute cost, communication delay, and feasible placement sets; over a small interpretable candidate set (fully local, static offload, hybrid), a window-based cost function combines tail end-to-end latency, deadline violation rate, hardware utilization, and a Hamming-distance switching penalty, and a DTP algorithm with hysteresis and a minimum dwell-time bounds placement chatter. Our WiP paper presents the theoretical framework, a structured qualitative analysis, and a two-phase simulation plus hardware-in-the-loop validation roadmap.

2605.19881 2026-05-20 cs.RO 版本更新

Trajectory Planning and Control near the Limits: an Open Experimental Benchmark on the RoboRacer Platform

轨迹规划与控制在极限情况下的研究:RoboRacer平台上的开放实验基准

Mattia Piccinini, Patrick Zambiasi, Aniello Mungiello, Mattia Piazza, Felix Jahncke, Johannnes Betz

发表机构 * Professorship of Autonomous Vehicle Systems, Technical University of Munich(自动驾驶车辆系统教授职位,慕尼黑技术大学) Munich Institute of Robotics and Machine Intelligence (MIRMI)(慕尼黑机器人与机器智能研究所(MIRMI)) Avilus GmbH(Avilus GmbH公司) Department of Information Technology and Electrical Engineering (DIETI), University of Naples Federico II(那不勒斯费德里科二世大学信息科技与电气工程系(DIETI)) Dept. of Industrial Engineering, University of Trento(特伦托大学工业工程系)

AI总结 本文提出了一种模块化框架,用于评估轨迹规划和控制在高加速度 maneuver 中的新方法和现有方法,通过 RoboRacer 平台上的两个赛道测试,展示了 MS-NN 在提高跟踪精度和减少转向振荡方面的优势,以及在线速度重计划对提高 lap 时间和安全速度的贡献。

Comments Accepted - 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC)

详情
AI中文摘要

我们提出了一种模块化框架,用于评估轨迹规划和控制在高加速度 maneuver 中的新方法和现有方法。我们的框架包括时间最优的赛道生成、在线时间最优速度重计划、几何路径跟踪控制器,以及一个新的模型结构神经网络(MS-NN)来学习转向控制的逆动力学。我们将在 1:10 尺寸的 RoboRacer 平台上部署该框架,使用两个赛道。通过几种消融研究,我们研究了单个模块及其组合的性能。我们证明了 MS-NN 显著提高了跟踪精度,减少了转向振荡,并且在物理上是可解释的。此外,在线速度重计划通过补偿执行误差来提高 lap 时间,并使车辆能够安全地达到更高的速度和加速度。为了支持未来研究,我们的代码、数据集、视频和结果均已公开发布在 https://roboracer-benchmark.github.io/planning_control_benchmark/。

英文摘要

We present a modular framework to benchmark new and existing methods for trajectory planning and control in high-acceleration maneuvers that push autonomous driving to the limits. Our framework includes time-optimal raceline generation, online time-optimal velocity replanning, geometric path tracking controllers, and a new model-structured neural network (MS-NN) to learn the inverse dynamics for steering control. We deploy our framework on a 1:10-scale RoboRacer platform, using two circuits. Through several ablations with cautious and aggressive racelines, we study the performance of single modules and their combinations. We show that our MS-NN significantly improves tracking accuracy, decreases steering oscillations, and is physically interpretable. Moreover, online velocity replanning improves lap times by compensating for execution errors, and enables the vehicle to safely reach higher speeds and accelerations. To support future research, our code, datasets, videos and results are publicly available at https://roboracer-benchmark.github.io/planning_control_benchmark/.

2605.19840 2026-05-20 cs.RO 版本更新

Justifying bio-inspired robotics research: A taxonomy of strategies

论证生物启发式机器人研究:一种策略的分类

Margaret J. Zhang, Justin Ting, Talia Y. Moore

发表机构 * Mechanical Engineering, University of Michigan, USA(机械工程,密歇根大学,美国) Electrical and Computer Engineering, University of Michigan, USA(电气与计算机工程,密歇根大学,美国) Robotics, Mechanical Engineering, Ecology and Evolutionary Biology, Museum of Zoology, University of Michigan, USA(机器人学、机械工程、生态学与进化生物学、动物博物馆,密歇根大学,美国)

AI总结 本文提出了一种生物启发式设计动机的分类,以帮助机器人研究者合理化其特定的生物启发方法,并帮助资助管理人员评估不同生物启发方法的价值。

详情
AI中文摘要

在人类历史的大部分时间里,我们并没有系统地思考为什么和如何将自然世界的方面纳入我们的设计中。缺乏系统方法导致了动机和方法的一致性问题,使得预测或评估生物启发式设计的成功变得困难。期望与结果之间的不匹配可能导致读者在认为生物启发式设计表面、薄弱或不完整时感到失望。这在机器人领域尤为明显,因为在该领域,与生物系统的相似性可能是构造的驱动动机。为了帮助机器人研究者合理化其特定的生物启发方法,并帮助资助计划管理人员区分不同生物启发方法的价值,本文提出了一种生物启发式设计动机的分类,并描述了不同方法可能带来的潜在重大贡献。

英文摘要

For most of human history, we have not thought systematically about how and why we incorporate aspects of the natural world into our designs. The lack of a systematic approach has resulted in inconsistencies in motivations and methods that make it difficult to predict or evaluate the success of bio-inspired design. This mismatch between expectations and results can lead to disappointment when a reader considers a bio-inspired design to be superficial, weak, or incomplete. This is especially true in the field of Robotics, in which similarity to a biological system might be the driving motivation for construction. In an effort to assist robotics researchers justify their specific bio-inspired approach and to assist funding program managers with discerning the value of different bio-inspired approaches, here we propose a taxonomy of motivations for bio-inspired design and describe the potential significant contributions that are likely to result from different approaches.

2605.19837 2026-05-20 cs.CV cs.AI cs.CL cs.RO 版本更新

CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

CADENet:条件自适应异步双流增强网络用于自动驾驶中的恶劣天气感知

Sherif Khairy, Catherine M. Elias

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt(德国开罗大学(GUC)计算机科学与工程系,埃及) C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt(认知驾驶系统实验室(C-DRiVeS Lab),开罗,埃及)

AI总结 本文提出CADENet,一种无需训练的三线系统,通过条件自适应增强和熵引导NMS融合,实现自动驾驶中恶劣天气下的目标检测,同时无需重新训练或额外硬件。

详情
AI中文摘要

恶劣天气(雨、雾、沙尘和雪)会降级自动驾驶车辆基于摄像头的目标检测。现有先增强后检测的方法会阻碍安全关键的感知循环,违反严格的实时要求。该问题的进展也受到一个未被认识到的评估上限的限制:在降质图像上标注的地面真实数据不能为一个能够恢复注释者自身无法看到的目标的检测器提供信用,因此真正的有用的增强可以注册为接近平坦的F1增益。本文提出了CADENet(条件自适应异步双流增强网络),一种无需训练的三线系统:线S(YOLOv11n)以全帧率提供检测,无额外延迟;线Q应用条件自适应增强(CAPE)并通过熵引导NMS(EG-NMS)融合结果,不阻塞线S;线E提供CLIP零样本天气分类,因此新的天气类别只需新的文本提示,无需标注数据和重新训练。在1327张DAWN图像(YOLOv11m,IoU=0.5,置信度=0.25)上评估,CADENet在雪中实现Recall=0.0103(微),F1=0.0230,在雨中实现F1=0.0038。我们正式化了DAWN类数据上的注释完整性偏差,因此报告的F1值是真实增益的下限;Recall是注释-间隙-免疫的头条指标。线S在增强负载下保持约44 FPS。无需模型重新训练或额外传感器硬件。

英文摘要

Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time requirements. Progress on this problem is also constrained by an under-recognized evaluation ceiling: ground truth annotated on degraded images cannot credit a detector that recovers objects the annotators themselves could not see, so a genuinely useful enhancement can register as a near-flat F1 gain. This paper presents CADENet (Condition-Adaptive Asynchronous Dual-stream Enhancement Network), a training-free three-thread system: Thread S (YOLOv11n) delivers detections at full frame rate with zero added latency; Thread Q applies condition-adaptive enhancement (CAPE) and fuses results via entropy-guided NMS (EG-NMS) without blocking Thread S; Thread E provides CLIP zero-shot weather classification, so new weather categories require only a new text prompt, with no labeled data and no retraining. Evaluated on 1327 DAWN images (YOLOv11m, IoU = 0.5, confidence = 0.25), CADENet achieves Recall = 0.0103 (micro), F1 = 0.0230 on snow, and F1 = 0.0038 on rain. We formalize the annotation completeness bias on DAWN-class data, so the reported F1 values are lower bounds on the true gain; recall is the annotation-gap-immune headline metric. Thread S sustains approximately 44 FPS regardless of enhancement load. No model retraining or additional sensor hardware is required.

2605.19824 2026-05-20 cs.AI cs.CL cs.CV cs.RO 版本更新

From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

从提示到路面通过时间:代理场景到计划推理中的时间定位

Ahmed Y. Gado, Omar Y. Goba, Alaa Hassanein, Catherine M. Elias, Ahmed Hussein

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt(德国亚历山大大学(GUC)计算机科学与工程系,埃及) C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt(认知驾驶系统实验室,埃及开罗,C-DRiVeS) M.Eng. Robotics Candidate at Deggendorf Institute of Technology, Germany(德国德格多夫技术学院机器人硕士候选人) IAV GmbH, Berlin, Germany(德国柏林IAV GmbH公司)

AI总结 本研究探讨了在代理间通信中引入时间条件是否能保持或增强推理的一致性,而不会降低语义或逻辑一致性,并通过BDD-X数据集的curated子集评估了三种具有递增时间整合的规划器架构。结果表明,时间条件改变了推理风格,但并未在标准NLP正确性指标上产生统计显著改进,但定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。

详情
AI中文摘要

近期尝试通过大型语言模型(LLMs)和大型多模态模型(LMMs)的集合来支持自动驾驶(AVs)中的高级场景解释和规划,仍然将时间视为次要属性。这种缺乏时间定位导致在连续动作推理中出现不一致,影响安全性和可解释性。本文探讨时间条件在代理间通信中是否能保持或增强一致性而不引入语义或逻辑一致性下降。为此,我们引入了三种具有递增时间整合的规划器架构,并在BDD-X数据集的curated子集上评估它们,使用语义、语法和逻辑指标。结果表明,虽然时间条件改变了推理风格,但并未在标准NLP基于的正确性指标上产生统计显著改进。然而,定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。这些发现澄清了基于提示的时间定位的局限性,并建立了时间场景到计划推理的第一个经验基准。

英文摘要

Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.

2605.16692 2026-05-20 cs.LG cs.AI cs.RO 版本更新

EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control

EfficientTDMPC: 改进的MPC目标以实现高效的连续控制

Thomas Evers, Cristian Meo, Wendelin Bohmer, Justin Dauwels, Yaniv Oren

发表机构 * TU Delft(代尔夫特理工大学) LatentWorlds AI

AI总结 本文提出EfficientTDMPC,一种基于模型的强化学习方法,用于连续控制,通过减少误差和增加数据新鲜度来提高样本效率。

详情
AI中文摘要

我们介绍了EfficientTDMPC,一种用于连续控制的样本高效模型基于强化学习方法,基于TD-MPC算法家族。该家族的核心是一个规划器,旨在找到最大化估计回报的行动序列。回报通过学习的模型和价值网络进行估计,每个都可以引入误差。EfficientTDMPC通过两种方式减少这种误差。首先,它引入了动态模型的集成,并在这些模型和不同的展开深度之间平均回报估计。其次,它增加了应用不确定性惩罚到规划器目标的选项,从而得到一个避免不确定回报估计的规划器。然后,它增加了实用改进,提高缓冲数据的新鲜度并减少计算。最后,我们发现我们的贡献使EfficientTDMPC能够更受益于更高的更新到数据(UTD)比率,进一步提高样本效率。据我们所知,在每个基准的低数据情况下,EfficientTDMPC在HumanoidBench-Hard和DMC hard上实现了最先进的样本效率,而在DMC easy上则匹配了最先进的性能。

英文摘要

We introduce EfficientTDMPC, a sample-efficient model-based reinforcement learning method for continuous control built on the TD-MPC family of algorithms. Central to this family is a planner that aims to find an action sequence that maximizes the estimated return. The return is estimated using a learned model and value networks, each of which can introduce error. EfficientTDMPC proposes to reduce this error in two ways. First, it introduces an ensemble of dynamics models and averages the return estimates across those models and across different rollout depths. Second, it adds the option to apply an uncertainty penalty to the planner objective, yielding a planner that avoids actions with uncertain return estimates. It then adds practical improvements which increase buffer data freshness and reduce compute. Lastly, we find that our contributions enable EfficientTDMPC to benefit more from a higher update-to-data (UTD) ratio, further improving sample efficiency. To the best of our knowledge, in the low data regime of each benchmark, EfficientTDMPC achieves state-of-the-art (SOTA) in terms of sample efficiency on HumanoidBench-Hard and DMC hard, while matching SOTA on DMC easy.

2605.16137 2026-05-20 cs.CV cs.RO 版本更新

STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics-Physics Dual System

STABLE: 通过语义-物理双系统生成仿真准备的桌面布局

Zhen Luo, Yixuan Yang, Xudong Xu, Jinkun Hao, Zhaoyang Lyu, Feng Zheng, Jiangmiao Pang, Yanwei Fu

发表机构 * Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出STABLE,一种通过语义-物理双系统生成仿真准备的桌面布局的方法,通过语义推理模块生成粗略布局,物理校正模块校正布局以确保物理合理性,从而提升场景的物理有效性。

Comments ICML 2026

详情
AI中文摘要

从任务指令生成仿真准备的桌面场景是嵌入式人工智能领域引人入胜且有前景的研究方向。然而,现有任务到场景生成方法仅依赖大型语言模型(LLMs)预测场景布局,不可避免地导致物体碰撞或漂浮,因为LLMs在三维空间推理方面存在固有局限性。在本文中,我们提出了STABLE,一种专为仿真准备的桌面场景生成设计的语义-物理双系统。STABLE由两个互补模块组成:(i)语义推理器,一个在结构化桌面场景数据集上微调的LLM,用于从输入任务指令生成粗略布局;(ii)物理校正器,一个具有物理意识的基于流的去噪模型,输出姿态更新以校正布局,从而确保场景的物理合理性,同时保持与任务指令的语义一致性。STABLE采用渐进生成范式:通过交替使用语义推理器和物理校正器,它逐步从任务关键对象扩展到背景对象。实验表明,STABLE成功生成严格符合任务指令的仿真准备的桌面场景,并显著提高了场景的物理有效性。

英文摘要

Generating simulation-ready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI. However, existing task-to-scene generation methods rely exclusively on large language models (LLMs) to predict scene layouts, inevitably yielding object collisions or floating due to LLMs' inherent limitations in 3D spatial reasoning. In this paper, we present STABLE, a semantics-physics dual-system tailored for simulation-ready tabletop scene generation. STABLE consists of two complementary modules: (i) a Semantic Reasoner, a fine-tuned LLM trained on a structured tabletop scene dataset to generate coarse layouts from input task instructions, and (ii) a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates to refine layouts, which ensures the physical plausibility of scenes while preserves semantic alignment with task instructions. STABLE adopts a progressive generation paradigm: by alternating between the Semantic Reasoner and Physics Corrector, it incrementally expands the scene from task-critical objects to background objects. Experiments demonstrate that STABLE successfully generates simulation-ready tabletop scenes that strictly conform to task instructions and significantly enhances the physical validity of scenes over prior art.

2604.07303 2026-05-20 cs.RO 版本更新

Robots that learn to evaluate models of collective behavior

能够评估集体行为模型的机器人

Mathis Hocke, Andreas Gerken, David Bierbach, Jens Krause, Tim Landgraf

发表机构 * Department of computer science, Freie Universität Berlin(自由大学柏林计算机科学系) SCIoI Excellence Cluster, Technische Universität Berlin(柏林技术大学SCIoI卓越中心) Faculty of Life Sciences, Humboldt-Universität zu Berlin(柏林洪堡大学生命科学学院) Department of Fish Biology, Fisheries, and Aquaculture, Leibniz Institute of Freshwater Ecology and Inland Fisheries(莱比锡淡水生态与内陆渔业研究所鱼类生物学、渔业与水产养殖系)

AI总结 本文提出了一种基于强化学习的框架,利用仿生机器人鱼评估活鱼行为的计算模型,通过闭环交互量化真实鱼与模拟鱼行为的差异,展示了学习驱动的机器人实验如何发现行为模型的不足。

详情
AI中文摘要

理解并建模动物行为对于研究集体运动、决策和生物启发机器人至关重要。然而,评估行为模型的准确性仍然常常依赖于离线比较静态轨迹统计。在这里,我们介绍了一种基于强化学习的框架,利用仿生机器人鱼(RoboFish)通过闭环交互评估计算模型中的活鱼行为。我们使用四个不同的鱼模型(一个简单的恒定跟随基准、两个基于规则的模型和一个生物基础的卷积神经网络模型)在仿真中训练策略,并将这些策略转移到真实的RoboFish系统中,与活鱼互动。策略被训练引导模拟鱼前往目标位置,使我们能够量化真实鱼对目标位置的响应与模拟鱼响应的差异。通过量化模拟到现实的差距(定义为模拟和现实行为指标分布的Wasserstein距离,如目标到达性能、个体间距离、墙互动和对齐),我们评估鱼模型。基于神经网络的鱼模型在目标到达性能和其他大多数指标上表现出最小的差距,表明其在该基准下的行为保真度高于传统基于规则的模型。更重要的是,这种分离表明,所提出的评估方法能够在匹配的闭环条件下定量区分候选模型。我们的工作展示了学习驱动的机器人实验如何揭示行为模型的不足,并提供了一种通过具身交互评估动物行为模型的一般框架。

英文摘要

Understanding and modeling animal behavior is essential for studying collective motion, decision-making, and bio-inspired robotics. Yet, evaluating the accuracy of behavioral models still often relies on offline comparisons to static trajectory statistics. Here we introduce a reinforcement-learning-based framework that uses a biomimetic robotic fish (RoboFish) to evaluate computational models of live fish behavior through closed-loop interaction. We trained policies in simulation using four distinct fish models-a simple constant-follow baseline, two rule-based models, and a biologically grounded convolutional neural network model-and transferred these policies to the real RoboFish setup, where they interacted with live fish. Policies were trained to guide a simulated fish to goal locations, enabling us to quantify how the response of real fish differs from the simulated fish's response. We evaluate the fish models by quantifying the sim-to-real gaps, defined as the Wasserstein distance between simulated and real distributions of behavioral metrics such as goal-reaching performance, inter-individual distances, wall interactions, and alignment. The neural network-based fish model exhibited the smallest gap across goal-reaching performance and most other metrics, indicating higher behavioral fidelity than conventional rule-based models under this benchmark. More importantly, this separation shows that the proposed evaluation can quantitatively distinguish candidate models under matched closed-loop conditions. Our work demonstrates how learning-based robotic experiments can uncover deficiencies in behavioral models and provides a general framework for evaluating animal behavior models through embodied interaction.

2603.18396 2026-05-20 cs.LG cs.RO 版本更新

RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

RE-SAC:在公交车队控制中解耦偶然风险和本质风险:一种稳定且稳健的集成深度强化学习方法

Yifan Zhang, Liang Zheng

发表机构 * Central South University(中南大学)

AI总结 该研究提出RE-SAC方法,通过解耦偶然风险和本质风险来提升公交车队控制的稳定性与鲁棒性,采用积分概率度量(IPM)基于的权重正则化和多样化Q-集成来应对不同类型的不确定性。

详情
AI中文摘要

公交保持控制因随机交通和乘客需求而具有挑战性。尽管深度强化学习(DRL)展现出潜力,但标准的actor-critic算法在波动环境中面临Q值不稳定的问题。这种不稳定性的一个关键来源是将两种不同的不确定性混淆:偶然不确定性(不可减少的噪声)和本质不确定性(数据不足)。将它们视为单一风险会导致在嘈杂状态下的价值低估,从而导致灾难性策略崩溃。我们提出了一种稳健的集成软actor-critic(RE-SAC)框架,以明确解耦这些不确定性。RE-SAC将积分概率度量(IPM)基于的权重正则化应用于批评者网络,以对抗偶然风险,为鲁棒Bellman算子提供平滑的分析下界,而无需昂贵的内循环扰动。为了应对本质风险,一个多样化Q-集成对稀疏覆盖区域中的过度自信价值估计进行惩罚。这种双重机制防止了集成方差将噪声误认为数据缺口,这种失败模式在我们的消融研究中被识别。在现实的双向公交走廊模拟实验中,RE-SAC在累计奖励(约-0.4e6)方面优于标准SAC(-0.55e6)。Mahalanobis稀有性分析证实,RE-SAC在罕见的分布外状态中将Oracle Q值估计误差减少了高达62%(MAE为1647 vs. 4343),展示了在高交通变异性下的优越鲁棒性。

英文摘要

Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.

2603.09473 2026-05-20 cs.RO cond-mat.mtrl-sci 版本更新

Receptogenesis in a Vascularized Robotic Embodiment

血管化机器人躯体中的生成

Kadri-Ann Pankratov, Leonid Zinatullin, Hans Priks, Adele Metsniit, Urmas Johanson, Tarmo Tamm, Alvo Aabloo, Edoardo Sinibaldi, Indrek Must

AI总结 本文提出了一种通过动态材料重构实现机器人身体内在功能适应和发展的方法,通过利用流体ics重构材料界面,实现基于环境提示的传感器生成,展示了在血管化机器人复合材料中通过光聚合物化实现材料级功能重构的可行性。

Comments Supplementary Files currently unavailable online. Please contact the First Author to request any Supplementary Files Version 2 - revision

详情
AI中文摘要

为机器人系统配备在操作过程中生成新硬件的能力扩展了对物理适应性的控制。不同于依赖于预或后部署离散组件集成的模块化系统,我们设想物理适应和发育可能来自动态材料重构以塑造身体内在功能。受生物体循环系统重新分配质量和功能的启发,我们利用流体ics重构材料界面,这是一种目前在机器人学中尚未具备的能力。在此,我们通过一种用于编程材料合成的血管化机器人复合材料实现了这一合成生长能力,通过生成物的生成——即基于环境提示从内部液体储备中按需构造传感器来演示。通过协调前体的流体运输与外部局部紫外线照射,我们驱动了原位光聚合,将前体转化为具有光敏感聚吡咯的固态分散体,从而在PETG中化学重构血管。该反应将具有光迟滞引发剂的前体转化为具有光敏感聚吡咯的固态分散体,在PETG中化学重构血管,从而建立了一种通过特征电阻降低验证的传感模式。新合成的传感器关闭了局部控制回路以调节受启发于飞蛾的机器人演示器的翅拍。这种物理更新在实时中增加了机器人的能力。血管化机器人躯体的材料级功能重构为在处于环境中机器人系统中生成新硬件提供了概念验证的材料基础——迈向能够自主产生硬件更新以匹配新环境需求的处于环境中机器人。

英文摘要

Equipping robotic systems with the capacity to generate $\textit{ex novo}$ hardware during operation extends control of physical adaptability. Unlike modular systems that rely on discrete component integration pre- or post-deployment, we envision the possibility that physical adaptation and development emerge from dynamic material restructuring to shape the body's intrinsic functions. Drawing inspiration from circulatory systems that redistribute mass and function in biological organisms, we utilize fluidics to restructure the material interface, a capability currently unpaired in robotics. Here, we realize this synthetic growth capability through a vascularized robotic composite designed for programmable material synthesis, demonstrated via receptogenesis - the on-demand construction of sensors from internal fluid reserves based on environmental cues. By coordinating the fluidic transport of precursors with external localized UV irradiation, we drive an $\textit{in situ}$ photopolymerization that chemically reconstructs the vasculature from the inside out. This reaction converts precursors with photolatent initiator into a solid dispersion of UV-sensitive polypyrrole in PETG, establishing a sensing modality validated by a characteristic decrease in electrical impedance. The newly synthesized sensor closed a local control loop to regulate wing flapping in a moth-inspired robotic demonstrator. This physical update increased the robot's capability in real time. Material-level functional restructuring of the vascularized robot body provides a proof-of-concept materials basis for $\textit{ex novo}$ hardware generation in situated robotic systems - a step toward situated robots in which a reaction to environmental stimuli autonomously produces hardware updates to match new environmental demands.

2601.20529 2026-05-20 cs.RO cs.MA 版本更新

A Practical Framework of Key Performance Indicators for Multi-Robot Lunar and Planetary Field Tests

多机器人月球和行星实地测试的关键绩效指标实用框架

Julia Richter, David Oberacker, Gabriela Ligeza, Valentin T. Bickel, Philip Arm, William Talbot, Marvin Grosse Besselmann, Florian Kehl, Tristan Schnell, Hendrik Kolvenbach, Rüdiger Dillmann, Arne Roennau, Marco Hutter

发表机构 * Robotic Systems Lab (RSL), ETH Zürich(罗伯特系统实验室(RSL),苏黎世联邦理工学院) FZI Research Center for Information Technology(信息技术研究所以及中心) Machine Intelligence and Robotics Lab (MaiRo), Karlsruhe Institute for Technology (KIT)(机器智能与机器人实验室(MaiRo),卡尔斯鲁厄理工学院) Department of Environmental Sciences, University of Basel(巴塞尔大学环境科学系) European Space Agency/ESTEC(欧洲航天局/ESTEC) Center for Space and Habitability, University of Bern(伯尔尼大学空间与宜居性中心) Space Instruments Group, University of Zürich(苏黎世大学空间仪器组) Space Science and Technology, ETH Zürich(苏黎世联邦理工学院空间科学与技术)

AI总结 本文提出了一种基于多机器人月球场景的KPI框架,用于评估和比较不同实地测试的性能,强调效率、鲁棒性和精度的场景依赖性优先级,并通过实地测试验证了其在实际应用中的有效性。

Comments Presented at ICRA 2026 Workshop on Multi-Agent Robotic Systems: Real-World Collaboration and Interaction

详情
AI中文摘要

在月球上寻找关键资源(如钛铁矿、稀有地球元素和水冰)需要稳健的探索方法,鉴于多变的地形和恶劣的环境条件。尽管有许多类比实地测试旨在实现这些目标,但比较其结果仍然具有挑战性,因为机器人平台和实验设置存在差异。这些任务通常使用选定的、场景特定的工程度量来评估性能,但无法建立场性能与科学驱动目标之间的明确联系。在本文中,我们通过从三个现实的多机器人月球场景中推导出一个结构化的KPI框架来填补这一空白,这些场景反映了科学目标和操作约束。我们的框架强调效率、鲁棒性和精度的场景依赖性优先级,并且专门设计用于实际部署。我们通过多机器人实地测试验证了该框架,并发现其在效率和鲁棒性相关的KPI方面具有实用性和易用性,而精度导向的KPI则需要可靠的地面真实数据,这在户外类比环境中并不总是可行。总体而言,我们提出这个框架作为通用的评估标准,能够实现一致、目标导向的多机器人实地测试比较,并支持未来行星探索机器人系统的系统性发展。

英文摘要

Robotic prospecting for critical resources on the Moon, such as ilmenite, rare earth elements, and water ice, requires robust exploration methods given the diverse terrain and harsh environmental conditions. Although numerous analog field trials address these goals, comparing their results remains challenging because of differences in robot platforms and experimental setups. These missions typically assess performance using selected, scenario-specific engineering metrics that fail to establish a clear link between field performance and science-driven objectives. In this paper, we address this gap by deriving a structured framework of KPI from three realistic multi-robot lunar scenarios reflecting scientific objectives and operational constraints. Our framework emphasizes scenario-dependent priorities in efficiency, robustness, and precision, and is explicitly designed for practical applicability in field deployments. We validated the framework in a multi-robot field test and found it practical and easy to apply for efficiency- and robustness-related KPI, whereas precision-oriented KPI require reliable ground-truth data that is not always feasible to obtain in outdoor analog environments. Overall, we propose this framework as a common evaluation standard enabling consistent, goal-oriented comparison of multi-robot field trials and supporting systematic development of robotic systems for future planetary exploration.

2601.14848 2026-05-20 cs.LG cs.AI cs.NE cs.RO 版本更新

From Observation to Prediction: LSTM for Vehicle Lane Change Forecasting on Highway On/Off-Ramps

从观测到预测:LSTM用于高速公路进出匝道的车辆车道变更预测

Mohamed Abouras, Catherine M. Elias

发表机构 * C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems(C-DRiVeS实验室:车载系统认知驾驶研究) Computer Science and Engineering Department - Faculty of Media Engineering and Technology - German University in Cairo(计算机科学与工程系 - 媒体工程与技术学院 - 埃及德国大学)

AI总结 本文研究了高速公路进出匝道区域与直线路段的区别,利用多层LSTM架构和ExiD无人机数据集训练模型,测试了不同预测时间范围和不同模型的工作流程,结果表明在4秒内预测准确率可达76%(匝道区域)和94%(一般高速公路场景).

详情
AI中文摘要

进出匝道是尽管引入了更高的高速公路交互变异水平但仍然被低估的道路部分。预测这些区域车辆的行为可以减少不确定性的影响并提高道路安全性。在本文中,研究了该感兴趣区域(AoI)与直线路段之间的差异。利用多层LSTM架构和ExiD无人机数据集训练AoI模型。在过程中测试了不同的预测时间范围和不同模型的工作流程。结果表明,在最大预测时间范围内,预测准确率在4秒内显示出巨大潜力,匝道区域的预测准确率从约76%开始,而一般高速公路场景的预测准确率在最大预测时间范围内达到94%。

英文摘要

On and off-ramps are understudied road sections even though they introduce a higher level of variation in highway interactions. Predicting vehicles' behavior in these areas can decrease the impact of uncertainty and increase road safety. In this paper, the difference between this Area of Interest (AoI) and a straight highway section is studied. Multi-layered LSTM architecture to train the AoI model with ExiD drone dataset is utilized. In the process, different prediction horizons and different models' workflow are tested. The results show great promise on horizons up to 4 seconds with prediction accuracy starting from about 76% for the AoI and 94% for the general highway scenarios on the maximum horizon.

2601.12373 2026-05-20 cs.CV cs.HC cs.RO 版本更新

CD-TWINSAFE: A ROS-enabled Digital Twin for Scene Understanding and Safety Emerging V2I Technology

CD-TWINSAFE:一种基于ROS的数字孪生用于场景理解和安全新兴V2I技术

Amro Khaled, Farah Khaled, Omar Riad, Catherine M. Elias

发表机构 * C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt(认知驾驶研究与车辆系统实验室,埃及开罗) Computer Science and Engineering Department - Faculty of Media Engineering and Technology(计算机科学与工程系-媒体工程与技术学院) German University in Cairo, Egypt(埃及开罗德国大学)

AI总结 本文提出了一种基于V2I的数字孪生系统CD-TWINSAFE,用于自动驾驶车辆的场景理解和安全监控,通过同时运行的两个栈结构实现车辆侧的驾驶模块和数字孪生模块,利用立体相机和Unreal Engine 5构建场景复现,并通过ROS架构实现V2I通信。

详情
AI中文摘要

本文介绍了CD-TWINSAFE,一种基于V2I的自动驾驶车辆数字孪生系统。所提出的架构由两个同时运行的栈组成,一个是车载驾驶栈,包含立体相机用于场景理解,另一个是数字孪生栈,运行Unreal Engine 5的场景复制品并返回安全警报至驾驶舱。车载栈在车辆侧实现,包括两个主要自主模块:定位和感知。通过车载传感器获取车辆的位置和方向。此外,感知模块负责处理立体相机的20fps图像,并通过两个互补的管道理解场景,包括物体检测和特征提取,包括物体速度、偏转角以及安全指标时间到碰撞和时间头道。收集的数据通过ROS架构以自定义ROS2消息的形式发送到基础设施侧,并通过UDP链接在4G调制解调器上进行V2I通信。通过数字孪生监控环境,共享消息更新生成的ego车辆和检测到的对象的信息,基于实时的定位和感知数据。通过不同驾驶场景的测试来验证所提出架构的有效性和实时响应能力。

英文摘要

In this paper, the CD-TWINSAFE is introduced, a V2I-based digital twin for Autonomous Vehicles. The proposed architecture is composed of two stacks running simultaneously, an on-board driving stack that includes a stereo camera for scene understanding, and a digital twin stack that runs an Unreal Engine 5 replica of the scene viewed by the camera as well as returning safety alerts to the cockpit. The on-board stack is implemented on the vehicle side including 2 main autonomous modules; localization and perception. The position and orientation of the ego vehicle are obtained using on-board sensors. Furthermore, the perception module is responsible for processing 20-fps images from stereo camera and understands the scene through two complementary pipelines. The pipeline are working on object detection and feature extraction including object velocity, yaw and the safety metrics time-to-collision and time-headway. The collected data form the driving stack are sent to the infrastructure side through the ROS-enabled architecture in the form of custom ROS2 messages and sent over UDP links that ride a 4G modem for V2I communication. The environment is monitored via the digital twin through the shared messages which update the information of the spawned ego vehicle and detected objects based on the real-time localization and perception data. Several tests with different driving scenarios to confirm the validity and real-time response of the proposed architecture.

2601.12367 2026-05-20 cs.HC cs.RO 版本更新

User-to-Vehicle Interaction in Smart Mobility: The GO-DRiVeS Autonomous Ride-Sharing Application

用户与车辆交互在智能交通中的应用:GO-DRiVeS自动驾驶拼车应用

Hana E. Elmalah, Catherine M. Elias

发表机构 * C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt(C-DRiVeS实验室:车载系统认知驾驶研究,埃及开罗) Computer Science and Engineering Department - Faculty of Media Engineering and Technology(计算机科学与工程系——媒体工程与技术学院) German University in Cairo, Egypt(埃及开罗德国大学)

AI总结 本文提出了一种名为GO-DRiVeS的拼车应用,旨在解决大学学生和员工在炎热天气或携带重物时长时间步行的问题。该应用采用敏捷开发方法,并基于现有的交通应用框架进行分析和比较,实现了用户注册、拼车请求和实时追踪等功能,并通过多个实验验证了其稳定性和可靠性。

详情
AI中文摘要

本文介绍了GO-DRiVeS应用,这是一种按需拼车和请求的移动应用,专门针对解决长时间步行、时间消耗和疲劳的问题,尤其是在炎热天气或携带重物时,这对大学学生和员工来说是一个挑战。GO-DRiVeS应用是按照敏捷方法开发的,以确保其灵活性。此外,使用移动应用程序系统架构和客户端-服务器架构。GO-DRiVeS是使用React Native(Expo)作为前端,Node.js和Express作为后端,MongoDB作为数据库实现的;基于对现有交通应用的详细分析,比较其框架并识别其核心功能。GO-DRiVeS支持用户注册、拼车请求和实时追踪等核心功能。此外,它能够以先到先得的方式同时处理多个请求。该应用基于这些功能进行开发,其结果以多种形式的实验形式呈现,展示了在处理请求时的稳定性,如在方法和结果章节中所展示的。

英文摘要

This paper introduces the GO-DRiVeS application, an on demand ride sharing and requesting mobile application tailored specifically to save long walks and challenges which are time consuming and tiring especially during hot days or when carrying heavy items, faced by university students and staff. The GO-DRiVeS application was developed following the Agile methodology for its flexibility. In addition to, using the mobile application system architecture and client-server architecture. GO-DRiVeS was implemented using React Native (Expo) for the frontend, Node.js and Express for the backend, and MongoDB as the database; based on a detailed analyses to the existing transportation application, comparing their frameworks and identifying their essential functionalities. GO-DRiVeS supports core features like user registration, ride requesting and real-time tracking.In addition to handling multiple requests at the same time in a first come first serve manner. The application was developed based on these features, and the results were conducted in the form of multiple experiments that demonstrated stable behavior in handling the requests, as presented in the Methodology and Results chapters.

2601.12358 2026-05-20 cs.CV cs.AI cs.RO 版本更新

From Prompts to Pavement: LMMs-based Agentic Behavior-Tree Generation Framework for Autonomous Vehicles

从提示到道路:基于大语言模型的代理行为树生成框架用于自动驾驶车辆

Omar Y. Goba, Ahmed Y. Gado, Catherine M. Elias, Ahmed Hussein

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt(德国亚历山大·冯·洪堡大学(开罗分校)计算机科学与工程系,埃及) C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt(认知驾驶系统实验室(车辆系统中的认知驾驶研究),开罗,埃及) IAV GmbH, Berlin, Germany(IAV GmbH,柏林,德国)

AI总结 本文提出了一种基于大语言模型和多模态视觉模型的代理行为树生成框架,用于自动驾驶车辆在复杂环境中自适应导航。该框架通过链式符号提示评估场景关键性,通过上下文学习构建高层子目标,并通过生成器合成可执行的BT子树,实现了在CARLA+Nav2模拟中对突发障碍物(如道路堵塞)的成功绕行。

详情
AI中文摘要

自动驾驶车辆(AVs)需要适应性行为规划器来安全地导航不可预测的现实环境。传统的行为树(BTs)提供结构化决策逻辑,但本质上是静态的,并且需要大量人工调优,限制了其在SAE Level 5自主性中的应用。本文提出了一种代理框架,利用大语言模型(LLMs)和多模态视觉模型(LVMs)来实时生成和适应BTs。一个专门的Descriptor代理使用链式符号提示来评估场景关键性,一个Planner代理通过上下文学习构建高层子目标,一个Generator代理合成可执行的BT子树。该系统集成到CARLA+Nav2模拟中,仅在基线BT失败时触发,展示了成功绕过突发障碍物(例如道路堵塞)的能力,无需人工干预。与静态BT基线相比,该方法是一种概念验证,能够扩展到多样的驾驶场景。

英文摘要

Autonomous vehicles (AVs) require adaptive behavior planners to navigate unpredictable, real-world environments safely. Traditional behavior trees (BTs) offer structured decision logic but are inherently static and demand labor-intensive manual tuning, limiting their applicability at SAE Level 5 autonomy. This paper presents an agentic framework that leverages large language models (LLMs) and multi-modal vision models (LVMs) to generate and adapt BTs on the fly. A specialized Descriptor agent applies chain-of-symbols prompting to assess scene criticality, a Planner agent constructs high-level sub-goals via in-context learning, and a Generator agent synthesizes executable BT sub-trees in XML format. Integrated into a CARLA+Nav2 simulation, our system triggers only upon baseline BT failure, demonstrating successful navigation around unexpected obstacles (e.g., street blockage) with no human intervention. Compared to a static BT baseline, this approach is a proof-of-concept that extends to diverse driving scenarios.

2512.20931 2026-05-20 cs.RO 版本更新

Certifiable Alignment of GNSS and Local Frames via Lagrangian Duality

通过拉格朗日对偶实现GNSS与局部框架的可验证对齐

Baoshan Song, Matthew Giamou, Penggao Yan, Chunxi Xia, Li-Ta Hsu

发表机构 * Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University, China(航空与航空工程系,香港理工大学,中国) Department of Computing and Software, McMaster University, Canada(计算与软件系,麦斯特大学,加拿大) School of Geodesy and Geomatics, Wuhan University, China(测绘学院,武汉大学,中国)

AI总结 本文提出了一种全局最优求解器,通过将原始伪距或多普勒测量转换为凸松弛问题,实现了GNSS与局部框架的可验证对齐,解决了传统方法在GNSS退化环境下的局限性。

Comments Final version in RA-L

详情
AI中文摘要

估计局部系统相对于全球导航卫星系统(GNSS)参考的绝对对齐 often 遭遇局部极小值和对卫星可用性高度依赖的问题。现有方法对于此对齐任务依赖于大量卫星,无法在GNSS退化环境中使用,或使用局部优化方法无法保证解的最优性。本文介绍了一种全局最优求解器,将原始伪距或多普勒测量转换为凸松弛问题。所提出的方法是可验证的,意味着可以数值验证结果的正确性,填补了现有局部优化器无法保证最优性的空白。我们首先将原始框架对齐问题公式化为一个非凸二次约束二次规划(QCQP)问题,并将QCQP问题松弛为一个凹的拉格朗日对偶问题,为原问题提供一个下界成本。然后我们进行松弛紧密性和可观测性分析,推导出解的可验证最优性的标准。最后进行仿真和实际世界实验来评估所提出的方法。实验表明,即使只有2颗卫星的多普勒测量和2D车辆运动,我们的方法也能提供可验证的最优解,而传统基于速度的VOBA方法和先进的GVINS对齐技术可能会失败或收敛到局部极小值。为了支持机器人中的GNSS导航技术发展,所有代码和数据均在https://github.com/Baoshan-Song/Certifiable-Doppler-alignment上开源。

英文摘要

Estimating the absolute orientation of a local system relative to a global navigation satellite system (GNSS) reference often suffers from local minima and high dependency on satellite availability. Existing methods for this alignment task rely on abundant satellites unavailable in GNSS-degraded environments, or use local optimization methods which cannot guarantee the optimality of a solution. This work introduces a globally optimal solver that transforms raw pseudo-range or Doppler measurements into a convexly relaxed problem. The proposed method is certifiable, meaning it can numerically verify the correctness of the result, filling a gap where existing local optimizers fail. We first formulate the original frame alignment problem as a nonconvex quadratically constrained quadratic program (QCQP) problem and relax the QCQP problem to a concave Lagrangian dual problem that provides a lower cost bound for the original problem. Then we perform relaxation tightness and observability analysis to derive criteria for certifiable optimality of the solution. Finally, simulation and real world experiments are conducted to evaluate the proposed method. The experiments show that our method provides certifiably optimal solutions even with only 2 satellites with Doppler measurements and 2D vehicle motion, while the traditional velocity-based VOBA method and the advanced GVINS alignment technique may fail or converge to local optima without notice. To support the development of GNSS-based navigation techniques in robotics, all code and data are open-sourced at https://github.com/Baoshan-Song/Certifiable-Doppler-alignment.

2512.00667 2026-05-20 eess.SY cs.RO cs.SY 版本更新

Active Learning of Fractional-Order Viscoelastic Model Parameters for Realistic Haptic Rendering

分数阶黏弹性模型参数的主动学习用于真实触觉渲染

Harun Tolasa, Gorkem Gemalmaz, Volkan Patoglu

发表机构 * Faculty of Engineering and Natural Sciences(工程与自然科学学院)

AI总结 本文提出了一种系统的方法,通过主动学习优化分数阶黏弹性模型的参数,以提高触觉渲染的感知真实感,同时通过人类在回路优化和群体感知地图结合,选择出在一般人群中被广泛认为真实的参数。

Comments This work has been submitted to the IEEE Transactions on Haptics for possible publication. 14 pages, 8 figures

详情
AI中文摘要

有效的医疗模拟器需要真实地渲染具有黏弹性材料特性(如蠕变和应力松弛)的生物组织。分数阶模型提供了一种有效描述本质上时间依赖的黏弹性动力学的方法,仅需少量参数,因为它们自然地捕捉记忆效应。然而,由于分数元素的阶数与其他参数之间的非直观、频率依赖的耦合,确定产生高感知真实感的分数阶模型参数值仍是一个重大挑战。在本研究中,我们提出了一种系统的方法,通过主动学习优化分数阶黏弹性模型的参数,以优化触觉渲染在一般人群中的感知真实感。首先,我们证明通过基于定性反馈的人类在回路(HiL)优化可以有效优化分数阶模型的参数,以确保对每个人都能保持一致的高真实感评分。其次,我们提出了一种严格的方法,将HiL优化结果结合到一个在完整数据集上训练的聚合感知地图中,并展示如何从这种表示中选择群体层面的最佳参数,这些参数在一般人群中被广泛认为是真实的。最后,我们通过人类受试者实验验证了广义分数阶黏弹性模型参数在三种黏弹性材料中的有效性。总体而言,通过所提出的HiL优化和聚合方法建立的广义分数阶黏弹性模型有潜力显著提高医疗训练模拟器的sim-to-real过渡性能。

英文摘要

Effective medical simulators necessitate realistic haptic rendering of biological tissues that exhibit viscoelastic material properties, such as creep and stress relaxation. Fractional-order models provide an effective means of describing intrinsically time-dependent viscoelastic dynamics with few parameters, as they naturally capture memory effects. However, due to the unintuitive, frequency-dependent coupling among the order of the fractional element and other parameters, determining appropriate parameter values for fractional-order models that yield high perceived realism remains a significant challenge. In this study, we propose a systematic means of determining the parameters of fractional-order viscoelastic models that optimizes the perceived realism of haptic rendering across general populations. First, we demonstrate that the parameters of fractional-order models can be effectively optimized through active learning, using qualitative feedback-based human-in-the-loop (HiL) optimization, to ensure consistently high realism ratings for each individual. Second, we propose a rigorous method to combine HiL optimization results into an aggregate perceptual map trained on the entire dataset, and demonstrate how to select population-level optimal parameters from this representation that are broadly perceived as realistic across general populations. Finally, we provide evidence of the effectiveness of the generalized fractional-order viscoelastic model parameters for three viscoelastic materials by characterizing their perceived realism through human-subject experiments. Overall, generalized fractional-order viscoelastic models established through the proposed HiL optimization and aggregation approach possess the potential to significantly improve the sim-to-real transition performance of medical training simulators.

2511.18236 2026-05-20 cs.RO cs.SY eess.SY 版本更新

APULSE: A Scalable Hybrid Algorithm for the RCSPP on Large-Scale Dense Graphs

APULSE:一种用于大规模密集图上RCSPP的可扩展混合算法

Nuno Soares, António Grilo

发表机构 * Academia Militar Lisboa(里斯本军事学院) INESC INOV Instituto Superior Técnico (IST) Universidade de Lisboa(INESC INOV 里斯本技术大学 (IST))

AI总结 本文提出APULSE算法,通过结合A*启发式搜索、Pulse式剪枝机制和时间桶策略,高效解决大规模密集图上的资源受限最短路径问题,展现出显著的可扩展性和鲁棒性。

Comments This version corrects keywords and reference [9]. 9 pages

详情
Journal ref
in IEEE Access, vol. 14, pp. 40690-40706, 2026
AI中文摘要

资源受限最短路径问题(RCSPP)是一个基础的NP难优化挑战,广泛应用于网络路由和自主导航等领域。该问题涉及在受预算限制的二次资源下寻找最小主成本路径。尽管存在各种RCSPP求解器,但它们在应用于复杂现实场景中常见的大型密集图时往往面临严重的可扩展性限制,使其在时间敏感的规划中不切实际。在无人地面车辆(UGVs)的任务规划等领域,这种挑战尤为突出。本文介绍APULSE,一种混合标签设置算法,旨在高效解决此类挑战性图中的RCSPP。APULSE结合了由A*启发式引导的最佳优先搜索、激进的Pulse式剪枝机制以及时间桶策略,以有效减少状态空间。通过使用大规模UGV规划场景的计算研究,APULSE与最先进的算法进行了基准测试。结果表明,APULSE在大型问题实例上能够以数量级更快的速度和更高的鲁棒性找到近最优解,特别是在竞争方法失败的情况下。这种优越的可扩展性使APULSE成为复杂大规模环境中的RCSPP有效解决方案,使其能够实现交互式决策支持和动态重新规划能力。

英文摘要

The resource-constrained shortest path problem (RCSPP) is a fundamental NP-hard optimization challenge with broad applications, from network routing to autonomous navigation. This problem involves finding a path that minimizes a primary cost subject to a budget on a secondary resource. While various RCSPP solvers exist, they often face critical scalability limitations when applied to the large, dense graphs characteristic of complex, real-world scenarios, making them impractical for time-critical planning. This challenge is particularly acute in domains like mission planning for unmanned ground vehicles (UGVs), which demand solutions on large-scale terrain graphs. This paper introduces APULSE, a hybrid label-setting algorithm designed to efficiently solve the RCSPP on such challenging graphs. APULSE integrates a best-first search guided by an A* heuristic with aggressive, Pulse-style pruning mechanisms and a time-bucketing strategy for effective state-space reduction. A computational study, using a large-scale UGV planning scenario, benchmarks APULSE against state-of-the-art algorithms. The results demonstrate that APULSE consistently finds near-optimal solutions while being orders of magnitude faster and more robust, particularly on large problem instances where competing methods fail. This superior scalability establishes APULSE as an effective solution for RCSPP in complex, large-scale environments, enabling capabilities such as interactive decision support and dynamic replanning.

2408.06843 2026-05-20 cs.RO 版本更新

Learn2Decompose: Learning Problem Decomposition for Efficient Sequential Multi-object Manipulation Planning

Learn2Decompose: 为高效连续多物体操作规划学习问题分解

Yan Zhang, Teng Xue, Amirreza Razmjoo, Sylvain Calinon

发表机构 * Idiap Research Institute(Idiap研究 institute) Ecole Polytechnique Fédérale de Lausanne(瑞士联邦理工学院洛桑分校)

AI总结 本文提出了一种高效的任务与运动重计划方法,用于动态环境中连续多物体操作的规划。通过从示范中学习问题分解来加速TAMP求解器,核心方法包括目标分解学习、计算距离学习和物体减少,有效提升了重计划效率。

Comments Extension of RAL version: added PR2 Whole-body kitchen task and detailed discussion on limitations in main text; added pseudocode and robustness analysis of our approach, and formal analysis on why and when task goals are decomposable in appendix

详情
AI中文摘要

我们提出了一种高效的任务和运动重计划方法,用于动态环境中连续多物体操作的规划。传统任务与运动规划(TAMP)求解器在规划时间上随着规划时间跨度和物体数量的增长而呈指数级增加,限制了其在现实场景中的应用。为了解决这一问题,我们提出通过示范学习问题分解来加速TAMP求解器。我们的方法包含三个关键组成部分:目标分解学习、计算距离学习和物体减少。目标分解识别系统在达到最终目标之前必须经过的必要状态序列,将其视为子目标序列。计算距离学习预测两个状态之间的计算复杂性,使系统能够从扰动状态中识别出时间上最近的子目标。物体减少最小化重计划过程中考虑的活跃物体集合,进一步提高效率。我们在三个基准上评估了我们的方法,证明了其在动态环境中提升连续多物体操作任务重计划效率的有效性。

英文摘要

We present an efficient task and motion replanning approach for sequential multi-object manipulation in dynamic environments. Conventional Task And Motion Planning (TAMP) solvers experience an exponential increase in planning time as the planning horizon and number of objects grow, limiting their applicability in real-world scenarios. To address this, we propose learning problem decompositions from demonstrations to accelerate TAMP solvers. Our approach consists of three key components: goal decomposition learning, computational distance learning, and object reduction. Goal decomposition identifies the necessary sequences of states that the system must pass through before reaching the final goal, treating them as subgoal sequences. Computational distance learning predicts the computational complexity between two states, enabling the system to identify the temporally closest subgoal from a disturbed state. Object reduction minimizes the set of active objects considered during replanning, further improving efficiency. We evaluate our approach on three benchmarks, demonstrating its effectiveness in improving replanning efficiency for sequential multi-object manipulation tasks in dynamic environments.

2209.12133 2026-05-20 cs.RO cs.SY eess.SY 版本更新

Development of a Deep Learning-Driven Control Framework for Exoskeleton Robots

一种基于深度学习的外骨骼机器人控制框架开发

Sk Hasan

AI总结 本文提出了一种计算高效的深度学习控制框架,用于高自由度外骨骼机器人,以解决传统模型控制在实时计算中的限制。通过设计一个并行结构的深度神经网络,结合物理数据训练,实现了轨迹跟踪的关节扭矩预测,并通过比例导数控制器补偿预测误差,展示了控制方案的稳定性与鲁棒性。

详情
Journal ref
Actuators 15, 274 (2026)
AI中文摘要

本研究旨在开发一种计算高效的基于深度学习的控制框架,用于高自由度外骨骼机器人,以解决传统模型控制在实时计算中的限制。为七自由度人下肢外骨骼机器人设计了一个并行结构的深度神经网络,该网络由四层组成,包含49个密集连接的神经元,并使用基于分析动力学模型的物理数据进行训练。在实时实现过程中,训练好的神经网络预测轨迹跟踪所需的关节扭矩命令,而比例导数控制器补偿残余预测误差。通过分析验证了所提控制方案的稳定性,并利用方差分析评估了参数变化的鲁棒性。在相同机器人动力学条件下,与计算扭矩、模型参考计算扭矩、滑模、自适应和线性二次控制器进行了对比仿真。结果表明,该方法在轨迹跟踪精度和扭矩特性上与传统非线性控制器相当,同时减少了计算负担。这些发现表明,所提出的基于深度学习的混合控制器为多自由度外骨骼机器人的控制提供了一种高效且稳健的替代方案。

英文摘要

The purpose of this study is to develop a computationally efficient deep learning based control framework for high degree of freedom exoskeleton robots to address the real time computational limitations associated with conventional model based control. A parallel structured deep neural network was designed for a seven degree of freedom human lower extremity exoskeleton robot. The network consists of four layers with 49 densely connected neurons and was trained using physics based data generated from the analytical dynamic model. During real time implementation, the trained neural network predicts joint torque commands required for trajectory tracking, while a proportional derivative controller compensates for residual prediction errors. Stability of the proposed control scheme was analytically established, and robustness to parameter variations was evaluated using analysis of variance. Comparative simulations were conducted against computed torque, model reference computed torque, sliding mode, adaptive, and linear quadratic controllers under identical robot dynamics. Results demonstrate accurate trajectory tracking with torque profiles comparable to conventional nonlinear controllers while reducing computational burden. These findings suggest that the proposed deep learning based hybrid controller offers an efficient and robust alternative for controlling multi degree of freedom exoskeleton robots.

2605.19703 2026-05-20 cs.RO 版本更新

KIO-planner: Attention-Guided Single-Stage Motion Planning with Dual Mapping for UAV Navigation

KIO-planner: 基于双映射的注意力引导单阶段运动规划用于无人机导航

Dexing Yao, Haochen Li, Junhao Wei, Yifu Zhao, Yanxiao Li, Jiahui Xu, Jinxuan Hu, Lele Tian, Baili Lu, Zikun Li, Xu Yang, Sio-Kei Im, Dingcheng Yang, Yapeng Wang

发表机构 * Faculty of Applied Sciences(应用科学学院) Macao Polytechnic University(澳门理工学院) College of Animal Science and Technology(动物科学与技术学院) Zhongkai University of Agriculture and Engineering(仲恺农业工程学院) School of Economics and Management(经济管理学院) South China Normal University(华南师范大学) Information Engineering School(信息工程学院)

AI总结 本文提出KIO-planner,一种基于注意力引导的单阶段轨迹规划框架,通过整合CBAM模块和双映射机制,实现了在密集障碍环境中低延迟、可靠的运动规划,提高了导航的敏捷性和安全性。

Comments Accepted by an IEEE Vehicular Technology Conference. 6 pages, 4 figures, 1 table

详情
AI中文摘要

在受限、墙壁密集的环境中实现自主无人机飞行需要在严格安全约束下具有低延迟和可靠性的运动规划。传统基于优化的规划器在导航密集结构障碍时面临映射延迟和容易陷入局部极小值的问题。同时,现有的端到端学习方法难以从原始深度图像中提取细粒度的几何特征,并缺乏硬的运动动力学约束,导致靠近墙壁时出现不可预测的碰撞。为了解决这些问题,我们提出了KIO-planner,一种注意力引导的单阶段轨迹规划框架。首先,我们将卷积块注意力模块(CBAM)整合到感知骨干中,以自适应地聚焦于关键结构边缘和可通行空间。其次,我们引入了一种新的双映射机制——包括物理界限激活和确定性的几何安全护盾——以在深度像素空间中强制运动动力学可行性并实现无碰撞飞行,而无需全局地图融合。广泛的高保真模拟实验表明,KIO-planner能够在高达3.0 m/s的速度下实现高度敏捷的导航。与最先进的基线相比,KIO-planner实现了更低的推理延迟(约24 ms)并生成了显著更平滑的轨迹,减少了28.4%的控制成本。最值得注意的是,我们的双映射显著增加了最坏情况的安全裕度,通过最小距离到障碍物的测量,从0.48米增加到0.76米,确保了在高度受限环境中快速、平滑和安全的导航。

英文摘要

Autonomous UAV flight in confined, wall-dense environments requires low-latency and reliable motion planning under strict safety constraints. Traditional optimization-based planners suffer from mapping latency and easily fall into local minima when navigating through dense structural obstacles. Meanwhile, existing end-to-end learning methods struggle to extract fine-grained geometric features from raw depth images and lack hard kinodynamic constraints, leading to unpredictable collisions near walls. To address these issues, we propose KIO-planner, an attention-guided single-stage trajectory planning framework. First, we integrate a Convolutional Block Attention Module (CBAM) into the perception backbone to adaptively focus on critical structural edges and traversable space. Second, we introduce a novel Dual Mapping mechanism--comprising physical bounds activation and a deterministic Geometric Safety Shield in the depth-pixel space--to enforce kinodynamic feasibility and collision-free flight without global map fusion. Extensive high-fidelity simulated experiments demonstrate that KIO-planner enables highly agile navigation at speeds up to 3.0 m/s. Compared to the state-of-the-art baseline, KIO-planner achieves lower inference latency (approximately 24 ms) and generates significantly smoother trajectories, reducing control cost by 28.4%. Most notably, our Dual Mapping substantially increases the worst-case safety margin, measured by minimum distance to obstacles, from 0.48 m to 0.76 m, ensuring fast, smooth, and safer navigation in highly constrained environments.

2605.19701 2026-05-20 cs.RO 版本更新

Multi-Session Ground Texture SLAM in Low-Dynamic Environments

多会话低动态环境下的地面纹理SLAM

Kyle M. Hart, Brendan Englot

发表机构 * Naval Air Warfare Center, Aircraft Division(海军航空武器中心,飞机分部) Department of Mechanical Engineering, Stevens Institute of Technology(机械工程系,史蒂文斯理工学院)

AI总结 本文研究了在低动态环境中多会话地面纹理SLAM中的轨迹估计精度影响,探讨了三种技术的影响,发现Kullback-Leibler散度在相似度评分和闭环置信度偏置方面效果最佳,并介绍了一个包含多会话图像和高精度姿态信息的数据集。

Comments 8 pages, 9 figures. To appear at the 23rd International Conference on Ubiquitous Robots, Osaka, Japan. Distribution Statement A: Approved for public release; distribution is unlimited, as submitted under NAVAIR Public Release Authorization 2025-0098

详情
AI中文摘要

同时定位与建图社区已经引入了大量适用于多会话操作的系统,这些系统适应于具有低动态变化特征的环境,如地面磨损、天气现象或季节变化,这些变化会影响建图。这些系统允许机器人在这些环境中进行终身操作。同时,对于那些唯一可用的地面纹理作为建图特征的环境,也存在越来越多的兴趣。然而,这些地面纹理系统尚未针对多会话低动态变化环境进行优化。本文探讨了三种不同技术对这些多会话低动态地面纹理环境轨迹估计精度的影响。其中,使用Kullback-Leibler散度作为相似度评分和偏置影响闭环置信度的方法效果最佳。我们分析了所有三种方法,并深入探讨了Kullback-Leibler散度的影响。我们还介绍了一个供机器人社区使用的数据集,其中包含多会话图像,地面在不同会话中发生变化,并包含高精度姿态信息用于评估。

英文摘要

The simultaneous localization and mapping community has introduced a growing number of systems adapted for multi-session operations where the operational environment features low-dynamic changes that impact mapping, such as surface wear, weather phenomena, or seasonal change. These systems allow for lifelong operations by a robot within these environments. There is also growing interest in operations in environments where the unique ground texture is the only mapping feature available for use. These ground texture systems are not yet targeted for multi-session low-dynamic-change environments though. This work explores the impact of three different techniques on trajectory estimation accuracy in these multi-session low-dynamic ground texture environments. Of the three, the use of Kullback-Leibler Divergence, as a similarity score and a bias influencing loop closure confidence, is found to have the most success. We show an analysis of all three methods and a deeper exploration of the impact of Kullback-Leibler Divergence. We also introduce a dataset for use by the robotics community that contains multi-session images where the ground changes between sessions and also high-accuracy pose information for use in evaluation.

2605.19690 2026-05-20 cs.RO 版本更新

D-CLING: Prior-Preserving Depth-Conditioned Fine-Tuning for Navigation Foundation Models

D-CLING: 保留先验知识的深度条件细调方法用于导航基础模型

Shintaro Nakaoka, Takayuki Kanai, Kazuhito Tanaka

发表机构 * Frontier Research Center, Toyota Motor Corporation(丰田电机公司前沿研究中心)

AI总结 本文提出了一种新的细调方法,通过利用大规模预训练同时高效学习新环境或相机配置等新设置,从而在保留预训练知识的同时提升导航模型的鲁棒性和准确性。

Comments This paper has been accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026), which will be held in Vienna, Austria, from June 1 to 5, 2026

详情
AI中文摘要

导航基础模型(NFMs)在大规模跨身体数据集上训练后,已在各种场景中展示了强大的泛化能力。采用领域内细调来校准NFMs的视觉-运动策略,有望在新场景中进一步提升性能。然而,细调后的模型仍然存在避障能力差或无法正确到达目标的问题。此外,使用小数据集进行模型更新通常会削弱预训练的先验知识,影响预训练的泛化能力。因此,细调会降低模型在稳健和准确导航方面的能力。在本文中,我们提出了一种新的细调方法,该方法利用大规模预训练同时高效学习新设置,如环境或相机配置。特别是,受ControlNet启发,我们通过将可训练的预训练骨干网络的可学习副本附加到NFMs上,利用零初始化残差路径进行细调,从而学习几何线索。这种设计使模型能够高效地获取领域内的几何信息,同时在各种行为中保留预训练的知识。尽管其简单性,我们对现实导航的全面评估表明,我们的方法能够有效实现稳健的长周期导航,同时最小化碰撞和人工干预。此外,我们的离线分析显示,所提出的方法在细调数据集之外仍能维持或进一步提升动作预测能力,为通用导航的持续学习提供了关键见解。项目页面:https://toyotafrc.github.io/DCLING-Proj/

英文摘要

Navigation Foundation Models (NFMs) trained on large cross-embodied datasets have demonstrated powerful generalizability in various scenarios. Adopting in-domain fine-tuning for an NFM efficiently calibrates the visuomotor policy, promising further improvement even in a novel scenario. However, the fine-tuned models still suffer from poor obstacle avoidance or fail to properly reach the provided goals. Furthermore, model updates using a small subset of data typically erode the pre-trained prior, compromising the pre-training generalization. Consequently, fine-tuning deteriorates the capability of the model for robust and accurate navigation. In this work, we present a novel fine-tuning method that leverages large-scale pre-training while efficiently learning in novel setups, such as environments or camera configurations. In particular, inspired by ControlNet, we fine-tune an NFM by attaching a trainable copy of the pre-trained backbone using zero-initialized residual pathways, thereby learning geometric cues. This design enables the model to efficiently acquire in-domain geometry while preserving pre-trained knowledge across various behaviors. Despite its simplicity, our comprehensive evaluation of real-world navigation suggests that our proposal effectively enables robust long-horizon navigation with minimal collisions and human intervention. Additionally, our offline analysis shows that the proposed method maintains or further improves action prediction capabilities beyond the fine-tuned dataset, providing a key insight into continual learning for general navigation. The project page: https://toyotafrc.github.io/DCLING-Proj/

2605.19678 2026-05-20 cs.RO 版本更新

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

RoVLA: 多一致性约束用于鲁棒的视觉-语言-动作模型

Jingzhou Luo, Yifan Wen, Yongjie Bai, Xinshuai Song, Yang Liu, Liang Lin

发表机构 * Sun Yat-sen University(中山大学) Peng Cheng Laboratory(鹏城实验室) Guangdong Key Laboratory of Big Data Analysis and Processing(广东大数据分析与处理重点实验室) X-Era AI Lab(X-Era AI实验室)

AI总结 本文提出RoVLA框架,通过多一致性约束提升视觉-语言-动作模型的鲁棒性,通过指令语义、轨迹演变和观察扰动三种互补变换增强模型的稳定性和泛化能力。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在具身操控中表现出色,但在视觉观察变化、语言指令改写和复合扰动下仍显脆弱。这种限制表明现有方法仍依赖于训练分布中的浅层相关性,而非学习任务语义、环境状态和动作生成之间的稳定耦合。尽管近期研究通过大规模训练、训练后适应或增强预测建模提高了鲁棒性,但很少在端到端策略本身中强制执行不变性一致性。为了解决这个问题,我们提出了RoVLA,一个具有多一致性约束的鲁棒视觉-语言-动作框架。RoVLA在三个互补的变换下强制一致性:指令语义、轨迹演变和观察扰动。具体而言,指令一致性(IC)通过语义等价指令改写促进稳定的语义关联,演变一致性(EC)在整个生成过程中保持一致的动作意图,观察一致性(OC)通过强制在受扰动前后的一致预测来提高对视觉和体感扰动的鲁棒性。通过在训练过程中显式建模这些不变性,RoVLA减少了对表面相关性的依赖,提高了鲁棒性和泛化能力。在LIBERO-Plus、RoboTwin 2.0和现实世界操控任务上的实验表明,RoVLA在强基线方法上表现一致,并在多样化的任务和观察转移下表现出更优越的鲁棒性。这些结果证明了多一致性学习在鲁棒具身控制中的有效性。代码将在https://github.com/HCPLab-SYSU/RoVLA上提供。

英文摘要

Vision-Language-Action (VLA) models have shown strong performance on embodied manipulation, yet they remain brittle under visual observation changes, paraphrased language instructions, and compounded perturbations. This limitation suggests that existing methods still rely heavily on shallow correlations in the training distribution, rather than learning stable couplings among task semantics, environment states, and action generation. Although recent efforts improve robustness through larger-scale training, post-training adaptation, or enhanced predictive modeling, they rarely enforce invariance-oriented consistency within the end-to-end policy itself. To address this issue, we propose RoVLA, a robust vision-language-action framework with multi-consistency constraints. RoVLA enforces consistency under three complementary transformations: instruction semantics, trajectory evolution, and observation perturbation. Specifically, Instructional Consistency (IC) promotes stable grounding under semantically equivalent instruction rewrites, Evolutionary Consistency (EC) preserves coherent action intent throughout the generation process, and Observational Consistency (OC) improves robustness to visual and proprioceptive perturbations by enforcing consistent predictions before and after targeted disturbances. By explicitly modeling these invariances during training, RoVLA reduces reliance on superficial correlations and improves robustness and generalization. Experiments on LIBERO-Plus, RoboTwin 2.0, and real-world manipulation tasks show that RoVLA consistently outperforms strong baseline methods and exhibits superior robustness under diverse task and observation shifts. These results demonstrate the effectiveness of multi-consistency learning for robust embodied control. Codes will be available at https://github.com/HCPLab-SYSU/RoVLA.

2605.19631 2026-05-20 cs.RO cs.CV 版本更新

HEAT: Heterogeneous End-to-End Autonomous Driving via Trajectory-Guided World Models

HEAT: 基于轨迹引导的世界模型实现异构端到端自动驾驶

Hoonhee Cho, Giwon Lee, Jae-Young Kang, Hyemin Yang, Heejun Park, Kuk-Jin Yoon

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出一种基于轨迹引导的学习方法,通过规划轨迹组织训练,使模型能够捕捉驾驶意图的领域不变表示,并结合预测未来潜在特征的世界模型,提高特征一致性并缓解领域偏见,从而在多个异构数据集上实现强性能。

详情
AI中文摘要

端到端自动驾驶作为一种直接将原始传感器数据映射到驾驶动作的替代方案,已逐渐取代传统模块化管道。尽管近期方法在单域数据集上表现强劲,但当在多个异构领域联合训练时,性能显著下降。然而,实际自动驾驶系统必须在具有异构分布的不同环境中运行,包括不同城市、传感器配置和交通模式,而无需领域特定重新训练。这一差距突显了多领域学习中的关键挑战:异构领域中的领域特定变化引入了冲突的学习信号,使模型倾向于妥协解决方案,这些方案在各个领域中都是次优的。为此,我们提出了一种轨迹驱动的学习范式,围绕规划轨迹组织训练,使模型能够捕捉驾驶意图的领域不变表示。此外,我们还引入了一个世界模型,该模型根据自主动作预测未来的潜在特征,从而提高特征一致性和缓解领域引起的偏见。我们在三个基准上评估了我们的方法,即nuScenes、NAVSIM和Waymo端到端数据集,并在所有领域上展示了显著优于现有方法的改进。我们的结果表明,一个统一的模型可以在异构数据集上进行训练,同时在每个领域中保持强大的性能,这表明了向可扩展的现实世界部署迈出的一步。我们将公开我们的代码。

英文摘要

End-to-end autonomous driving has emerged as a compelling alternative to traditional modular pipelines by directly mapping raw sensor data to driving actions. While recent approaches achieve strong performance on single-domain datasets, their performance degrades significantly when trained jointly across multiple heterogeneous domains. In practice, however, autonomous systems must operate across diverse environments with heterogeneous distributions, including different cities, sensor configurations, and traffic patterns, without domain-specific retraining. This gap highlights a key challenge in multi-domain learning: domain-specific variations across heterogeneous domains introduce conflicting learning signals, driving models toward compromised solutions that are suboptimal across domains. To address this, we propose a trajectory-driven learning paradigm that organizes training around planning trajectories, enabling the model to capture domain-invariant representations of driving intent. Furthermore, we incorporate a world model that predicts future latent features conditioned on ego actions, improving feature consistency and mitigating domain-induced biases. We evaluate our approach on three benchmarks, nuScenes, NAVSIM, and the Waymo end-to-end dataset, and show substantial improvements over existing methods across all domains. Our results demonstrate that a single unified model can be trained on heterogeneous datasets while maintaining strong performance within each domain, highlighting a step toward scalable real-world deployment. We will make our code publicly available.

2605.19600 2026-05-20 cs.RO 版本更新

FlyMirage: A Fully Automated Generation Pipeline for Diverse and Scalable UAV Flight Data via Generative World Model

FlyMirage: 一种用于生成多样化和可扩展的无人机飞行数据的完全自动化生成流程

Jinhan Li, Xijie Huang, Zhaoqi Wang, Yijin Wang, Weiqi Ge, Qiyi He, Mo Zhu, Fei Gao, Yuze Wu, Xin Zhou

发表机构 * State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou 310027, China(浙江大学工业控制技术状态重点实验室,杭州310027,中国) Differential Robotics, Hangzhou 311121, China(差分机器人,杭州311121,中国)

AI总结 本文提出FlyMirage,一种完全自动化的生成流程,通过生成世界模型生成大规模、多样化且逼真的无人机视觉-语言导航数据,支持下一代具身导航模型的发展。

详情
AI中文摘要

在视觉-语言导航(VLN)领域,空中数据集在结合规模、多样性和现实感方面仍然有限,通常依赖于昂贵的真实世界场景或视觉受限的模拟。为了解决这些挑战,我们引入了FlyMirage,一种高度可扩展且完全自动化的空中VLN数据生成流程。我们的方法利用大型语言模型(LLM)作为环境设计师来促进场景多样性,配以生成世界模型,将这些设计转化为高保真的3D高斯点云(3DGS)场景。为了显著减少人工劳动并确保飞行数据的可行性,FlyMirage自动化了场景探索和语义信息获取,并进一步集成了动态可行的规划器用于无人机(UAV)轨迹生成。利用这一工具链,我们生成了一个大规模、多样化且逼真的空中VLN数据集,具有动态可行的飞行轨迹,旨在支持下一代具身导航模型的发展。

英文摘要

In the field of Vision-Language Navigation (VLN), aerial datasets remain limited in their ability to combine scale, diversity, and realism, often relying on either costly real-world scenes or visually limited simulations. To address these challenges, we introduce FlyMirage, a highly scalable and fully automated data generation pipeline for aerial VLN. Our approach leverages large language models (LLM) as an environment designer to promote scene diversity, paired with a generative world model that instantiates these designs into high-fidelity 3D Gaussian Splatting (3DGS) scenes. To substantially reduce human labor and ensure the feasibility of flight data, FlyMirage automates scene exploration and semantic information acquisition, and further integrates a dynamically feasible planner for uncrewed aerial vehicle (UAV) trajectory generation. Utilizing this toolchain, we generate a large-scale, diverse, and photorealistic aerial VLN dataset, with dynamically feasible flying trajectories, designed to support the development of next-generation embodied navigation models.

2605.19594 2026-05-20 cs.RO 版本更新

MCNav: Memory-Aware Dynamic Cognitive Map for Zero-shot Goal-oriented Navigation

MCNav: 用于零样本目标导向导航的记忆感知动态认知图

Jingyu Li, Zhe Liu, Wenxiao Wu, Li Zhang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) University of Hong Kong(香港大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 本文提出MCNav,一种记忆感知的动态认知图导航框架,通过高效查询已探索区域的相关物体信息,解决零样本目标导向导航中目标丢失或误识别的问题,通过目标再验证和遗漏目标再探索策略,结合黑名单和双检机制,实现最先进的性能。

详情
AI中文摘要

在复杂环境中导航到实例级目标是一个具有挑战性的问题。许多现有的零样本方法通过建模整个环境并利用大语言模型进行场景理解来实现强性能。然而,这些策略主要集中在探索新区域,而缺乏对先前探索区域信息的深入利用。因此,当目标在先前访问的区域中丢失或误识别时,导航失败频繁发生。为了解决这些限制,我们提出了MCNav,一种具有动态认知图的记忆感知导航框架。该图存储有关已探索区域相关物体的高效查询信息。基于此记忆结构,MCNav引入了两种记忆感知探索策略:目标再验证,用于重新评估已见过的对象以纠正匹配失败;以及遗漏目标再探索,用于根据上下文线索估计目标在已探索区域中的存在概率。这些策略进一步通过黑名单机制防止重复错误,并通过双检机制进行高置信度确认。我们在HM3Dv1和HM3Dv2数据集上对MCNav进行了三种不同任务的评估,其中在实例级目标导航任务上实现了最先进的性能。

英文摘要

Navigating to instance-level targets in complex environments is a challenging problem. Many existing zero-shot methods achieve strong performance by modeling the entire environment and leveraging large language models for scene understanding. However, such strategies primarily focus on exploring new regions while lacking a deeper exploitation of information from previously explored areas. Consequently, when targets are missed or misidentified within previously visited regions, navigation failures occur frequently. To address these limitations, we propose MCNav, a memory-aware navigation framework with a dynamic cognitive map. This map stores efficiently queryable information about relevant objects in explored areas. Building on this memory structure, MCNav introduces two memory-aware exploration strategies: goal re-validation, which re-assesses previously seen objects to correct matching failures, and missed goal re-exploration, which estimates the likelihood that a target is present in an explored region from contextual cues. These strategies are further stabilized by a blacklist mechanism to prevent repeated errors and a double-check mechanism for high-confidence confirmation. We evaluate MCNav on the HM3Dv1 and HM3Dv2 datasets across three different tasks, where it achieves state-of-the-art performance, particularly on the instance-level goal navigation task.

2605.19592 2026-05-20 cs.RO cs.AI 版本更新

Implicit Action Chunking for Smooth Continuous Control

隐式动作分块用于平滑连续控制

Bosun Liang, Shuo Pei, Zirui Chen, Chuanzhi Fan, Chen Sun, Yuankai Wu, Huachun Tan, Yong Wang

发表机构 * Department of Data and Systems Engineering, The University of Hong Kong, Hong Kong SAR, China(香港大学数据与系统工程系) Beijing Institute of Technology, Zhuhai, China(北京理工大学珠海学院) College of Computer Science, Sichuan University, Chengdu, China(四川大学计算机学院)

AI总结 本文提出了一种隐式动作分块框架Dual-Window Smoothing (DWS),用于实现平滑的连续控制。该方法通过双窗口设计,在不扩展动作空间的情况下,确保物理平滑性和时间差分目标的一致性,从而解决传统显式动作分块方法的优化困难和与标准逐步交互不兼容的问题。

详情
AI中文摘要

强化学习常常产生高频振荡的控制信号,这会破坏物理部署所需的安全性和稳定性。显式动作分块通过预测固定时间跨度的轨迹来解决这个问题,但会按时间跨度长度成比例地扩展策略输出维度,导致优化困难和与标准逐步交互不兼容。为克服这些挑战,本文提出了Dual-Window Smoothing (DWS),一种隐式动作分块框架用于平滑连续控制。与显式方法不同,DWS通过确定性调制确保时间一致性,而不扩展动作空间。它采用双窗口设计:一个执行窗口通过确定性调制确保物理平滑,一个价值窗口在时间差分目标上对时间跨度进行对齐,以纠正由于开环执行导致的批评者偏差。DWS还包含一个轻量级的演员侧时间正则化器,基于一阶动作差异,以促进全局连续性。该设计有效地弥合了时间抽象与反应式逐步控制之间的差距。在包括DeepMind控制套件和工业能源管理任务在内的基准测试中,DWS优于最先进的(SOTA)基线。在复杂的基于视觉的自动驾驶任务中,DWS实现了更平滑的控制,更安全的行为,减少了抖动,并达到了100%的成功率。

英文摘要

Reinforcement learning often produces high-frequency oscillatory control signals that undermine the safety and stability required for physical deployment. Explicit action chunking addresses this by predicting fixed-horizon trajectories but scales the policy output dimension proportionally with the horizon length, leading to optimization difficulties and incompatibility with standard step-wise interaction. To overcome these challenges, this paper proposes Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control. Unlike explicit methods, DWS enforces temporal coherence without expanding the action space. It uses a dual-window design: an execution window that ensures physical smoothness through deterministic modulation, and a value window that aligns temporal-difference targets over the horizon to correct critic bias caused by open-loop execution. DWS also includes a lightweight actor-side temporal regularizer based on first-order action differences to promote global continuity. This design effectively bridges the gap between temporal abstraction and reactive step-wise control. Experiments on benchmarks including the DeepMind Control Suite and industrial energy management tasks show that DWS outperforms state-of-the-art (SOTA) baselines. In complex vision-based autonomous driving tasks, DWS achieves smoother control, safer behavior with reduced jitter, and attains a 100% success rate.

2605.19580 2026-05-20 cs.RO 版本更新

PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models

PAPO-VLA: 为视觉-语言-动作模型进行规划感知的策略优化

Peizheng Guo, Jingyao Wang, Changwen Zheng, Wenwen Qiang

发表机构 * Institute of Software Chinese Academy of Sciences(软件研究所中国科学院) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出PAPO-VLA,一种针对视觉-语言-动作模型的规划感知策略优化方法,通过识别和优化规划动作以提高VLA策略的可靠性。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在语言引导的机器人任务中展现出有前途的能力。然而,使VLA策略可靠仍然具有挑战性,因为一个操作任务是通过闭环交互完成的,其中每个动作都会影响后续的执行。为了分析这个问题,我们重新审视VLA策略在执行过程中的作用,并认为VLA策略同时扮演着规划者和执行者两个角色:规划者做出任务导向的决策以改变执行方向,而执行者通过密集的连续动作来实现这些决策。这种观点表明,提高VLA可靠性需要特别关注规划动作。现有的优化方法可以模仿动作或改进完整的轨迹,但通常不明确识别规划动作或衡量其对任务成功的重要性。为了解决这个问题,我们提出了PAPO-VLA,即针对VLA模型的规划感知策略优化方法。PAPO-VLA首先通过联合考虑动作变化和轨迹结果来识别规划动作,然后通过因果充分性和因果必要性估计其重要性,并最终将这种重要性纳入GRPO优势估计中。这样,更重要规划动作会受到更强的优化关注,同时整个轨迹仍然通过轨迹级反馈进行优化。在多个基准上的实验展示了PAPO-VLA的有效性。

英文摘要

Vision-Language-Action (VLA) models show promising ability in language-guided robotic tasks. However, making VLA policies reliable remains challenging, because a manipulation task is completed through closed-loop interaction, where each action affects subsequent execution. To analyze this problem, we revisit VLA policy during execution and argue that a VLA policy acts both as a planner, which makes task-oriented decisions that change the direction of execution, and as an executor, which realizes these decisions through dense continuous actions. This view suggests that improving VLA reliability requires particular attention to planning actions. Existing optimization methods can imitate actions or improve complete trajectories, but they usually do not explicitly identify planning actions or measure their importance for task success. To address this issue, we propose Planning-Aware Policy Optimization for VLA models (PAPO-VLA). PAPO-VLA first identifies planning actions by jointly considering action variation and trajectory outcome, then estimates their importance through causal sufficiency and causal necessity, and finally incorporates this importance into GRPO advantage estimation. In this way, more important planning actions receive stronger optimization emphasis, while the whole trajectory is still optimized by trajectory-level feedback. Experiments on multiple benchmarks demonstrate the effectiveness of PAPO-VLA.

2605.19562 2026-05-20 cs.RO cs.LG math.OC 版本更新

Learning-Accelerated Optimization-based Trajectory Planning for Cooperative Aerial-Ground Handover Missions

基于学习的优化轨迹规划用于协作的空中-地面切换任务

Jingshan Chen, Bochen Yu, Henrik Ebel, Peter Eberhard

发表机构 * Institute of Engineering and Computational Mechanics, University of Stuttgart, 70569 Stuttgart, Germany(工程与计算力学研究所,斯图加特大学,德国斯图加特70569) Mechanical Engineering, LUT University, 53850 Lappeenranta, Finland(机械工程,卢蒂大学,芬兰拉佩恩兰塔53850)

AI总结 本文提出了一种结合学习的轨迹规划框架,用于协同无人 aerial 和 ground 车辆的切换任务,通过使用解耦的编码器-解码器 LSTM 网络生成协调的切换轨迹预测,从而加速优化过程,实现更快的收敛和更高的优化成功率。

Comments Preprint of a contribution accepted for publication in the RoManSy 2026 Springer proceedings

详情
AI中文摘要

本文提出了一种基于学习的轨迹规划框架,用于协同无人 aerial 和 ground 车辆的切换任务。尽管集中式轨迹优化能够确保动态可行性和任务最优性,但其高计算成本限制了实时应用。我们提出了一种神经代理规划器,利用解耦的编码器-解码器长短期记忆(LSTM)网络,从任务规范中生成协调的切换轨迹预测。这些预测作为下游集中优化器的有信息的预热启动,从而加速收敛到动态可行的解决方案。基准评估显示,与冷启动优化相比,结合学习的规划框架在速度上提高了三倍以上,并实现了100%的优化成功率。结果表明,结合数据驱动推断与模型驱动细化能够为异构多机器人系统提供快速且可靠的轨迹生成。

英文摘要

This paper presents a learning-augmented trajectory planning framework for cooperative unmanned aerial vehicle (UAV) and unmanned ground vehicle (UGV) handover missions. While centralized trajectory optimization ensures dynamic feasibility and task optimality, its high computational cost limits real-time applicability. We propose a neural surrogate planner utilizing decoupled encoder-decoder long short-term memory (LSTM) networks to generate coordinated handover trajectory predictions from the task specifications. These predictions serve as informed warm starts for the downstream centralized optimizer, thereby accelerating convergence to dynamically feasible solutions. Benchmark evaluations demonstrate that the learning-augmented planning framework achieves more than a threefold speedup and 100% optimization success rate compared to cold start optimization. The results indicate that combining data-driven inference with model-based refinement enables fast and reliable trajectory generation for heterogeneous multi-robot systems.

2605.19524 2026-05-20 cs.RO cs.CV 版本更新

SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving

SafeAlign-VLA: 一种增强负样本的安全对齐框架用于风险感知的自动驾驶

Kefei Tian, Yuansheng Lian, Kai Yang, Xiangdong Chen, Shen Li

发表机构 * College of Transportation, Tongji University(同济大学交通运输学院) Department of Civil Engineering, Tsinghua University(清华大学土木工程系) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与移动系统学院) Department of Civil and Environmental Engineering, National University of Singapore(新加坡国立大学土木与环境工程系)

AI总结 本文提出SafeAlign-VLA框架,通过整合负样本数据提升自动驾驶系统对安全边界的理解,通过生成安全标签和反事实轨迹,结合两阶段训练策略和基于锚点的群体相对策略优化,提高了自动驾驶的安全性和鲁棒性。

详情
AI中文摘要

端到端的自动驾驶系统在常见场景中表现优异,但在安全关键的长尾案例中表现不佳。视觉-语言-动作(VLA)模型因其强大的推理能力而具有前景。然而,大多数基于VLA的方法依赖于正专家演示,很少利用负样本,导致对危险行为和安全边界的理解不足。为了解决这一限制,我们提出了SafeAlign-VLA,一种统一的增强负样本的安全对齐框架,将负数据整合到监督学习和强化学习中。首先,我们开发了一种反事实安全配对范式,通过反事实推理从危险场景中生成结构化的安全标签和反事实正轨迹。然后采用两阶段训练策略:负样本增强的监督微调用于故障反馈和轨迹修正,接着是基于锚点的群体相对策略优化,利用正负轨迹作为对比锚点,引导采样并惩罚高风险行为。在NAVSIM和DeepAccident上的实验验证了所提框架。SafeAlign-VLA在NAVSIM v1测试集上达到89.1 PDMS,比无负样本基线提高了1.3%。在DeepAccident上,碰撞率降低到3.36%,同时达到84.2%的语言准确率和85.8%的风险预测准确率。这些结果证明了所提增强负样本的安全对齐框架在安全和鲁棒自动驾驶中的有效性。

英文摘要

End-to-end autonomous driving systems excel in common scenarios but struggle with safety-critical long-tail cases. Vision-Language-Action (VLA) models are promising due to their strong reasoning capabilities. However, most VLA-based approaches rely on positive expert demonstrations, rarely exploiting negative samples, leading to insufficient understanding of risky behaviors and safety boundaries. To address this limitation, we propose SafeAlign-VLA, a unified negative-enhanced safe alignment framework that incorporates negative data into supervised learning and reinforcement learning. First, we develop a counterfactual safety pairing paradigm to generate structured safety labels and counterfactual positive trajectories from risky scenarios via counterfactual reasoning. Then, a two-stage training strategy is adopted: negative-enhanced supervised fine-tuning for failure feedback and trajectory correction, followed by anchor-based group relative policy optimization that uses positive and negative trajectories as contrastive anchors to steer sampling and penalize high-risk behaviors via group-relative advantages. Experiments on NAVSIM and DeepAccident validate the proposed framework. SafeAlign-VLA achieves 89.1 PDMS on the NAVSIM v1 testset, improving over the baseline without negative data by 1.3%. On DeepAccident, it reduces the collision rate to 3.36%, while achieving 84.2% language accuracy and 85.8% risk prediction accuracy. These results demonstrate the effectiveness of the proposed negative-enhanced safe alignment framework for safe and robust autonomous driving.

2605.19501 2026-05-20 cs.RO cs.AI 版本更新

CANINE: Coaching Visually Impaired Users for Interactive Navigation with a Robot Guide Dog

CANINE: 为视觉障碍者提供交互导航的机器人导盲犬教学系统

Cunjun Yu, Zishuo Wang, Anxing Xiao, Linfeng Li, David Hsu

发表机构 * School of Computing(computing 学院) Smart Systems Institute(智能系统研究所)

AI总结 本文提出CANINE系统,通过个性化适应性语音反馈帮助视觉障碍者学习与机器人导盲犬的交互导航,通过分解复杂协调任务并分层训练提升学习效率和最终导航性能。

Comments Accepted to RSS 2026

详情
AI中文摘要

机器人导盲犬提供了显著扩展视障者独立移动能力的导航帮助,但其有效使用需要微妙的人机协调,这使得用户难以从通用口头指令中学习。为解决这一挑战,我们提出了CANINE,一个自动化教学系统,通过个性化、适应性的语音反馈训练用户进行交互导航。CANINE将复杂协调任务分解为子技能,并在两个层次上运作。在高层,它通过知识追踪跟踪学习者在子技能中的熟练度,并优先训练最薄弱的领域。在底层,CANINE通过观察每个人类实践片段,利用基础模型推断错误的根本原因,并生成适应性的针对性语音纠正。通过盲folded参与者受控研究,将受试者视为定量评估的代理群体,证明CANINE在学习效率和最终导航性能上均优于通用口头指令。我们进一步通过保留研究和探索性案例研究验证CANINE。保留研究显示在两周后仍保持技能提升。案例研究确认CANINE在训练视障用户方面的有效性,同时揭示了实际部署中的额外设计考虑因素。两者均与受控研究的结果一致。项目页面:https://cunjunyu.github.io/project/canine/

英文摘要

Robot guide dogs offer navigation assistance that greatly expands the independent mobility of the visually impaired, but their effective use requires subtle human-robot coordination that is difficult for users to learn from generic verbal instructions. To tackle this challenge, we present CANINE, an automated coaching system that trains users for interactive navigation with a robot guide dog, through personalized, adaptive verbal feedback. CANINE decomposes a complex coordination task into sub-skills and operates at two levels. At the high level, it decides what to train by tracking the learner's proficiency across sub-skills using knowledge tracing and prioritizing training on the weakest areas. At the low level, CANINE decides how to train each sub-skill by observing each human practice episode, using foundation models to infer the underlying causes of errors, and generating targeted verbal corrections adaptively. A controlled study with blindfolded participants, treated as a proxy population for quantitative evaluation, demonstrates that CANINE significantly improves both learning efficiency and final navigation performance compared to generic verbal instructions. We further validate CANINE through a retention study and an exploratory case study. The retention study shows lasting skill improvement after two weeks. The case study confirms CANINE's effectiveness in training a visually impaired user, while revealing additional design considerations for real-world deployment. Both are well aligned with the findings of the controlled study. Project page: https://cunjunyu.github.io/project/canine/

2605.19490 2026-05-20 cs.RO cs.CV 版本更新

Closed-Loop Hybrid Digital Twin Platform for Connected and Automated Vehicle Validation

闭环混合数字孪生平台用于联网和自动化车辆验证

Kanglong Quan, Zhebing Xia, Linfeng Jiang, Hao Yu, Ziheng Qiao, Dapeng Dong, Dongyao Jia

发表机构 * National Natural Science Foundation of China(中国国家自然科学基金委员会) Suzhou Science and Technology Development Planning Programme(苏州科技发展计划)

AI总结 本文提出一种闭环混合数字孪生平台,通过高保真CARLA-SUMO协同模拟与物理测试现场和车辆的紧密耦合,实现联网和自动化车辆的高效验证。

详情
AI中文摘要

联网和自动化车辆(CAVs)的全面且高效的验证在实际部署前至关重要。虽然基于模拟的测试提供了可扩展性,但现有方法往往缺乏与真实车辆和现场数据的无缝集成,限制了其在捕捉动态真实世界交互方面的保真度。为弥合这一差距,本文提出了一种新的实时混合数字孪生平台。其核心创新在于高保真CARLA-SUMO协同模拟与物理测试现场和车辆通过低延迟的车辆到万物(V2X)通信链路的紧密耦合。定制开发的中间件作为关键桥梁,同步真实CAV的运动状态作为模拟中的影子车辆,并将虚拟控制命令转换为底盘执行的控制器局域网络(CAN)消息以实现闭环控制。详细的实现包括使用摄影测量法进行全尺寸资产重建以及云边协同架构以实现可扩展的多用户操作。实验结果表明同步稳定且闭环控制有效,延迟低,证实了该平台在多场景CAV验证中的实用性。

英文摘要

Comprehensive and efficient validation of connected and automated vehicles (CAVs) is critical prior to real-world deployment. While simulation-based testing offers scalability, existing approaches often lack seamless integration with real vehicles and field data, limiting their fidelity in capturing dynamic, real-world interactions. To bridge this gap, this paper proposes a novel real-time hybrid digital twin platform. Its core innovation lies in the tight coupling of a high-fidelity CARLA-SUMO co-simulation with a physical test site and vehicle via a low-latency Vehicle-to-Everything (V2X) communication link. A custom-developed middleware serves as the critical bridge, synchronizing a real CAV's kinematic state as a shadow vehicle in the simulation and translating virtual control commands into chassis-actuating Controller Area Network (CAN) messages for closed-loop control. Detailed implementation includes using photogrammetry for full-scale asset reconstruction and a cloud-edge collaborative architecture for scalable, multi-user operation. Experimental results demonstrate stable synchronization and effective closed-loop control with low latency, confirming the platform's practicality for multi-scenario CAV verification.

2605.19469 2026-05-20 cs.LG cs.AI cs.RO 版本更新

Sampling-Based Safe Reinforcement Learning

基于采样的安全强化学习

Luca Vignola, Bruce D. Lee, Manish Prajapat, Manuel Wendl, Melanie Zeilinger, Andreas Krause, Yarden As

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出了一种基于采样的安全强化学习方法,通过在有限的动力学样本集上联合施加约束,确保学习过程中的安全性,并在连续域中提供实用的安全保证,同时通过限制认知不确定性实现了高效的探索。

详情
AI中文摘要

安全探索仍然是强化学习(RL)中的基本挑战,限制了RL智能体在现实世界中的部署。我们提出了一种基于采样的安全强化学习(SBSRL),这是一种基于模型的RL算法,通过在有限的动力学样本集上联合施加约束,确保学习过程中的安全性。这种形式近似了在不确定动力学下的不可行最坏情况优化,并在连续域中实现了实用的安全保证。我们进一步引入了一种基于限制认知不确定性的探索策略,消除了显式探索奖励的需要。在常规条件下,我们推导了学习过程中安全性的高概率保证以及恢复近最优策略的有限时间样本复杂度界。实验证明,SBSRL在仿真和真实机器人硬件中均实现了安全且高效的探索,并可轻松扩展到实际的深度集合实现,以解决高维连续控制问题。

英文摘要

Safe exploration remains a fundamental challenge in reinforcement learning (RL), limiting the deployment of RL agents in the real world. We propose Sampling-Based Safe Reinforcement Learning (SBSRL), a model-based RL algorithm that maintains safety throughout the learning process by enforcing constraints jointly across a finite set of dynamics samples. This formulation approximates an intractable worst-case optimization over uncertain dynamics and enables practical safety guarantees in continuous domains. We further introduce an exploration strategy based on constraining epistemic uncertainty, eliminating the need for explicit exploration bonuses. Under regularity conditions, we derive high-probability guarantees of safety throughout learning and a finite-time sample complexity bound for recovering a near-optimal policy. Empirically, SBSRL achieves safe and efficient exploration both in simulation and in real robotic hardware, and readily extends to practical deep-ensemble implementations that scale to high-dimensional continuous control problems.

2605.19431 2026-05-20 cs.RO 版本更新

Self-assembling Modular Aerial Robot for Versatile Aerial Tasks

自组装模块化空中机器人用于多功能空中任务

Junichiro Sugihara, Masaki Kitagawa, Jinjie Li, Yunong Li, Takuzumi Nishio, Kei Okada, Moju Zhao

发表机构 * Department of Mechanical Engineering, The University of Tokyo(东京大学机械工程系) Department of Mechano Informatics, The University of Tokyo(东京大学机械信息学系)

AI总结 本文提出了一种自组装模块化空中机器人LEGION,通过飞行中自组装实现协同操作,结合灵活 maneuverability 和可重构性,实现了从被动观察者到主动参与者转变,拓展了空中物理交互的范围。

详情
AI中文摘要

多旋翼空中机器人在三维空间中具有出色的机动性,最近的进展使它们能够在复杂和狭窄的环境中进行灵活导航,尤其是对于小型机架。相比之下,用于高空工作的平台通常更大,以提供高推力以实现与环境的稳定物理交互。然而,这些矛盾的设计要求导致了灵活导航和稳健空中操作之间的长期权衡。本文提出了LEGION单元,这是一种可重新配置的模块化空中机器人,能够飞行中自组装以实现协同操作,灵感来自蚂蚁形成的自组织群体。每个单元保留了灵活的机动性,而两端的关节配备的对接接口使单元能够端到端自组装成飞行操作器。我们证明了多个单元可以自主飞行中对接;一旦锁定,它们通过控制接触力和扭矩保持零间隙锁定,即使在户外也能实现可靠的聚集和关节运动。我们进一步证明,自重构能力使单元能够在灵活的个体飞行和集体关节操作之间进行形态切换,同时实现核心飞行中操作原始操作,包括推、拉、旋转、抓取和携带。LEGION的自组织能力使空中机器人,特别是群组中的机器人,能够从被动观察者转变为环境中的主动参与者,拓展了空中物理交互的范围。

英文摘要

Multirotor aerial robots excel at maneuvering in three-dimensional space, and recent advances enable nimble navigation in cluttered and confined environments, especially for small airframes. By contrast, platforms built for high-altitude work tend to be larger to deliver high thrust for stable physical interaction with the environment. However, these conflicting design requirements create a long-standing trade-off between nimble navigation and robust aerial manipulation. Here, we present LEGION units, which are reconfigurable modular aerial robots capable of in-flight self-assembly for cooperative manipulation, drawing inspiration from the self-organized collectives formed by ants. Each unit retains nimble maneuverability while joint-equipped docking interfaces at both ends enable end-to-end self-assembly into a flying manipulator. We show that multiple units autonomously dock in flight; once latched, they maintain a zero-clearance interlock by controlling the contact force and torque, enabling reliable aggregation and articulated motion even outdoors. We further show that self-reconfigurability enables morphological switching between nimble individual flight and collective articulated manipulation, while realizing core in-flight manipulation primitives including pushing, pulling, rotating, grasping, and carrying. LEGION's self-organization enables aerial robots, especially in swarms, to shift from passive observers to active participants in their environment, broadening the scope of aerial physical interaction.

2605.19420 2026-05-20 cs.RO 版本更新

Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation

超越航点:双热图接地用于跨具身语义导航

Kaijie Yun, Yue Chen

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) JD AI Research(京东人工智能研究院)

AI总结 本文提出一种统一的视觉-语言框架,通过双热图表示替代单点回归,以解决语义指令与物理可达性之间的差距,从而提升跨具身语义导航的鲁棒性和性能。

详情
AI中文摘要

将开放式的语义指令接地为可执行的局部目标是人机交互中的基本挑战。尽管现有导航框架通常回归确定性的航点,但这种刚性方法会压缩空间不确定性,并且经常针对不可通行的物体中心,导致严重的执行失败。在本文中,我们专注于在视场内(in-FOV)的语义导航实际场景,其中机器人接收到简短的、交织的多模态(文本和图像)提示。为了弥合抽象语义意图与物理可达性之间的差距,我们提出了一种统一的视觉-语言框架,该框架放弃单点回归,转而采用双热图表示。我们的框架预测一个导航可及性热图,以捕捉连续的可到达区域,并结合一个面向热图用于方向约束。这些密集输出本质上充当可微的语义势场,能够无缝整合到下游的局部规划器中。为了支持这一范式,我们构建了一个完全自动化的、基于基础模型的合成数据管道,并建立了全面的模拟基准。广泛的实验表明,我们的框架在可比的8B基线中实现了最先进的性能。关键的是,通过特征融合研究和在不同机器人具身(Jetbot、H1、Aliengo)上的模拟研究,揭示出显式热图预测显著提高了可及率(AR)。通过将目标可靠地放置在可执行的自由空间中,我们的框架有效缓解了点回归的脆弱性,提供了一种可转移的路径,朝着安全的跨具身语义导航迈进。

英文摘要

Grounding open-ended semantic instructions into physically executable local goals is a fundamental challenge in human-robot interaction. While existing navigation frameworks often regress deterministic waypoints, this rigid formulation collapses spatial uncertainty and frequently targets non-traversable object centers, leading to severe execution failures. In this work, we focus on the practical setting of in-FOV semantic navigation, where a robot receives concise, interleaved multimodal (text and image) prompts. To bridge the gap between abstract semantic intent and physical reachability, we propose a unified Vision-Language framework that abandons single-point regression in favor of a Dual-Heatmap representation. Our framework predicts a navigation affordance heatmap that captures continuous reachable regions, coupled with a facing heatmap for orientation constraints. These dense outputs inherently function as a differentiable semantic potential field, integrating seamlessly with downstream local planners. To support this paradigm, we build a fully automated, foundation-model-assisted synthetic data pipeline and establish a comprehensive simulation benchmark. Extensive experiments demonstrate that our framework achieves state-of-the-art performance among comparable 8B baselines. Crucially, a feature-fusion study and simulation studies across diverse robot embodiments (Jetbot, H1, Aliengo) reveal that explicit heatmap prediction drastically improves the Affordance Rate (AR). By placing targets reliably in executable free space, our framework effectively mitigates the brittleness of point regression, offering a transferable path toward safe cross-embodiment semantic navigation.

2605.19328 2026-05-20 cs.CR cs.RO 版本更新

RoboJailBench: Benchmarking Adversarial Attacks and Defenses in Embodied Robotic Agents

RoboJailBench: 对具身体验机器人代理中对抗攻击和防御的基准测试

Doguhuan Yeke, Yanming Zhou, Leo Y. Lin, Hongyu Cai, Antonio Bianchi, Z. Berkay Celik

发表机构 * Purdue University(普渡大学)

AI总结 本文提出RoboJailBench,通过建立安全分类学、引入意图对比数据集管道以及提供一个演进的存储库,为具身体验人工智能中的对抗攻击和防御提供了标准化评估框架,同时构建了一个新的分类平衡数据集并增强了五个现有数据集。

详情
AI中文摘要

最近在视觉-语言模型(VLMs)上的进展促进了新的具身体验人工智能系统类别,其中这些模型被集成到物理平台中,例如机器人和自动驾驶车辆,以在多样环境中解释视觉场景并执行自然语言命令。先前的研究已经引入了针对具身体验人工智能的劫持攻击和防御。然而,其评估却依赖于随意的数据集、有限的指标,并强调攻击成功率,而忽略了安全性和执行良性命令能力之间的权衡。现有的基准和评估框架要么针对传统的聊天式模型,要么专注于非对抗性安全评估;既没有捕捉到具身体验人工智能系统中劫持攻击所需的输入、后果和评估标准。在本文中,我们通过RoboJailBench填补这一空白,其包含三个核心组件。我们基于ISO标准、监管规则和记录的事件建立了安全分类学,这一努力产生了18类具身体验人工智能的安全违规后果。我们引入了一个意图对比数据集管道,通过配对对抗性和良性目标来增强现有数据集,以衡量安全性和实用性。最后,我们提供了一个演进的存储库,包含标准化指标和统一的评估和整合新攻击和防御的流程。通过这个基准,我们构建了一个新的分类平衡数据集并增强了五个现有数据集。我们整合了四种攻击和两种防御,以在领先的具身体验VLMs上评估其性能。这个基准为具身体验人工智能中的劫持攻击提供了第一个标准化评估框架,并支持未来研究。我们发布了我们的代码、数据集和成果,并在https://purseclab.github.io/benchmark-for-robotics-security维护了一个排行榜。

英文摘要

Recent advances in Vision-Language Models (VLMs) facilitate a new class of embodied AI systems, where these models are integrated into physical platforms, e.g. robots and autonomous vehicles, to interpret visual scenes and execute natural language commands in diverse environments. Previous research has introduced jailbreak attacks and defenses for embodied AI. Their evaluations, however, rely on ad-hoc datasets, limited metrics, and emphasize attack success while neglecting the trade-off between security and the ability to follow benign commands. Existing benchmarks and evaluation frameworks either target traditional chat-based models or focus on non-adversarial safety evaluation for embodied AI; neither captures the adversarial risks, inputs, consequences, and evaluation criteria necessary for jailbreak attacks in embodied AI systems. In this paper, we address this gap with RoboJailBench, which consists of three core components. We establish a security taxonomy derived from ISO standards, regulatory rules, and documented incidents. This effort yields 18 categories of security violation consequences for embodied AI. We introduce an intent contrast dataset pipeline that augments existing datasets with paired adversarial and benign goals to measure both security and utility. Lastly, we provide an evolving repository with standardized metrics and a unified process for assessing and integrating new attacks and defenses. With this benchmark, we construct a new taxonomy-balanced dataset and augment five existing datasets. We integrate four attacks and two defenses to evaluate their performance on leading embodied VLMs. This benchmark provides the first standardized evaluation framework for jailbreak attacks in embodied AI and supports future research. We release our code, datasets, and artifacts, and maintain a leaderboard at https://purseclab.github.io/benchmark-for-robotics-security.

2605.19314 2026-05-20 cs.RO cs.AI 版本更新

ContextFlow: Hierarchical Task-State Alignment for Long-Horizon Embodied Agents

ContextFlow:长周期具身智能体的分层任务-状态对齐

Shuhan Guo, Kun Zhang, Haifei Liu, Xingyu Gao, Yongqi Zhang, Yaqing Wang, Quanming Yao

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Qiuzhen College, Tsinghua University(清华大学启祯学院) Beijing Institute of Mathematical Sciences and Applications(北京数学科学研究院) Department of Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)数据科学与分析部门) Institute of Microelectronics, Chinese Academy of Sciences(中国科学院微电子研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文研究了长周期具身智能体中任务-状态不一致问题,提出ContextFlow框架通过显式合同表示阶段、运行时观测转为证据包以及应用作用域更新来实现任务前沿对齐,提高任务执行的连贯性和可审计性。

详情
AI中文摘要

长周期具身智能体越来越多地将导航、搜索、接近和操作任务委托给专门执行器。随着这些执行器变得更强,瓶颈从局部技能执行转移到在规划、监控、记忆和执行之间保持一致的任务前沿。我们研究了任务-状态不一致,即在任务层面一致性失败,其中规划器的活跃阶段、运行时证据、记忆上下文和委托执行器不再支持相同的下一步决策。这种失败可能导致不支持的手动交接、阶段锁定、执行器-上下文不匹配和不必要的重新规划。我们提出ContextFlow,一个可检查的对齐框架,将阶段表示为显式合同,将运行时观测转换为证据包,并应用包括继续、细化、转移、提升和修复在内的作用域更新。ContextFlow使专门执行器负责局部闭环控制,同时使任务前沿对齐显式且可审计。在长周期具身任务上的实验和演示轨迹展示了证据基础的作用域更新如何诊断和缓解反复出现的任务-状态失败。

英文摘要

Long-horizon embodied agents increasingly delegate navigation, search, approach, and manipulation to specialist executors. As these executors become stronger, the main bottleneck shifts from local skill execution to maintaining a coherent task frontier across planning, monitoring, memory, and execution. We study task-state misalignment, a task-level consistency failure in which the planner's active stage, runtime evidence, remembered context, and delegated executor no longer justify the same next-step decision. This failure can lead to unsupported handoffs, stage lock, executor-context mismatch, and unnecessary replanning. We propose ContextFlow, an inspectable alignment framework that represents stages as explicit contracts, converts runtime observations into evidence packets, and applies scoped updates including continue, refine, transfer, promote, and repair. ContextFlow keeps specialist executors responsible for local closed-loop control while making task-frontier alignment explicit and auditable. Experiments and demonstration traces on long-horizon embodied tasks illustrate how evidence-grounded scoped updates diagnose and mitigate recurring task-state failures.

2605.19293 2026-05-20 cs.IT cs.LG cs.RO math.IT 版本更新

Domain-Adaptive Communication-Rate Optimization for Sim-to-Real Humanoid-Robot Wireless XR Teleoperation

领域自适应的通信速率优化用于仿真到现实的人形机器人无线XR远程操作

Caolu Xu, Zhiyong Chen, Meixia Tao, Li Song, Feng Yang, Wenjun Zhang

发表机构 * Cooperative Medianet Innovation Center(协作中位网创新中心) School of Information Science and Electronic Engineering(信息科学与电子工程学院) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种领域自适应的通信速率优化方法,通过在仿真到现实的分布偏移中平衡重建误差和通信能耗,利用PAC-Bayes泛化特性分析和密度比加权的PPO方法,结合离线真实域数据校正,以提高人形机器人无线XR远程操作的通信效率和重建精度。

Comments submitted to IEEE journal

详情
AI中文摘要

无线扩展现实(XR)远程操作为收集人形机器人演示提供了具身交互能力,但大规模应用受到高频运动传输开销的限制。本文开发了一个系统框架,集成了采样、传输、插值和重建,并制定了通信速率优化,旨在通过维度采样率控制最小化通信能耗,同时保持机器人运动轨迹的重建精度。由于从物理机器人获取实时反馈受限于硬件成本,必须通过与离线真实域数据校正的仿真交互来解决问题。为了指导仿真到现实的适应,我们提供了一种PAC-Bayes泛化特性刻画,揭示了潜在密度比估计、有限样本偏差和编码器偏差的影响。基于此分析,我们提出了一种具有密度比加权和信任区域正则化的近端策略优化(PPO)方法。在公共人形远程操作数据集上的实验表明,所提出的方法在仿真到现实分布偏移中改善了重建误差和通信能耗之间的权衡。我们进一步分析了所提出算法在各种无线信道和动态运动轨迹中的有效性。

英文摘要

Wireless extended reality (XR) teleoperation provides embodied interaction capability for collecting humanoid robot demonstrations, but the large-scale adoption is restricted by the overhead of high-frequency motion transmission. This paper develops a system framework that integrates sampling, transmission, interpolation, and reconstruction and formulates a communication-rate optimization that aims to minimize the communication energy while maintaining the reconstruction accuracy of robot motion trajectories through dimension-wise sampling-rate control. Since acquiring real-time feedback from physical robots is limited by hardware costs, it is necessary to solve the problem through simulator interaction with offline real-domain data correction. To guide sim-to-real adaptation, we provide a PAC-Bayes generalization characterization that reveals the effects of latent density-ratio estimation, finite-sample deviation, and encoder bias. Building on this analysis, we propose a proximal policy optimization (PPO) method with density-ratio weighting and trust-region regularization. Experiments on public humanoid teleoperation dataset show that the proposed method improves the tradeoff between reconstruction error and communication energy consumption under sim-to-real distribution shift. We further analyze the effectiveness of the proposed algorithm across various wireless channels and dynamic motion trajectories.

2605.19255 2026-05-20 cs.RO 版本更新

Bilateral Teleoperation with Compliant 6-DOF Pose-and-Force Sensing

双通道远程操作与合规6自由度位姿和力感知

Yue Feng, Weicheng Huang, I-Ming Chen

发表机构 * Robotics Research Centre Nanyang Technological University Singapore(南洋理工大学机器人研究中心)

AI总结 本文提出了一种基于硬件无关的WinGs操作系统(WOS)中间件的笛卡尔双通道框架,通过低成本的6自由度位姿和力感知末端执行器Delta6实现远程操作,该框架能够稳定跟踪高达120±40ms延迟和1%丢包率的系统,并在接触时匹配规定的虚拟刚度。

Comments 8 pages, 16 figures, 2 tables. Preprint

详情
AI中文摘要

现有的双通道远程操作平台仍然依赖于昂贵的刚性六轴力/扭矩传感器、紧密耦合的主从硬件和千赫兹控制回路。我们提出了一种基于硬件无关的WinGs操作系统(WOS)中间件的笛卡尔双通道框架,其中低成本的合规6自由度位姿和力感知末端执行器Delta6被安装在两侧,使得每个机械臂行为如同一个末端执行器6自由度系列弹性执行器(SEA)。主控制器运行一个仅含阻尼的顺应回路,配以6-D双二次-notch滤波器;从控制器通过基于位置的外环实现刚度-阻尼阻抗,通过PID力到位姿的映射。三个时间尺度(硬件I/O、中速阻抗/顺应、低速远程操作消息)被显式解耦,使同一应用能够驱动异构机械臂。在Lite6/FR3测试平台上以150Hz运行时,系统在高达120±40ms延迟和1%丢包率下稳定跟踪,接触时匹配规定的虚拟刚度,并在被动式测试中表现出良好的累积能量特征。

英文摘要

Existing bilateral teleoperation platforms still rely on costly rigid six-axis force/torque sensors, tightly coupled leader-follower hardware, and kilohertz control loops. We present a Cartesian bilateral framework built on the hardware-agnostic WinGs Operating Studio (WOS) middleware, in which a low-cost compliant 6-DOF pose-and-force sensing end-effector, Delta6, is mounted on both sides so that each manipulator behaves as an end-effector 6-DOF series elastic actuator (SEA). The leader runs a damping-only admittance loop with a 6-D biquad notch filter; the follower realizes a stiffness-damping impedance through a position-based outer loop with a PID wrench-to-pose mapping. Three time scales (hardware I/O, mid-rate impedance/admittance, low-rate teleoperation messages) are explicitly decoupled, enabling the same application to drive heterogeneous arms. On a Lite6/FR3 testbed at 150 Hz, the system tracks stably under delays up to $120\pm40$ ms and 1% packet loss, matches the prescribed virtual stiffness in contact, and shows a favorable cumulative energy signature in passivity-style tests.

2605.19209 2026-05-20 cs.RO cs.MA 版本更新

Graph Neural Planning and Predictive Control for Multi-Robot Communication-Constrained Unlabeled Motion Planning

基于图神经网络的多机器人通信受限的无标签运动规划与预测控制

Manohari Goarin, Yang Zhou, Giuseppe Loianno

发表机构 * New York University(纽约大学) University of California Berkeley(加州大学伯克利分校)

AI总结 本文提出一种分层框架,结合图注意力规划器和分布式非线性模型预测控制器,以解决多机器人在通信受限环境下同时分配目标和生成安全轨迹的问题,通过图神经网络方法实现可扩展的去中心化解决方案。

Comments 8 pages, 6 figures, Accepted at the IEEE International Conference on Robotics and Automation (ICRA) 2026

详情
AI中文摘要

多机器人同时分配目标并生成安全轨迹的无标签运动规划问题在许多协作任务中至关重要。最近的图神经网络方法提供了可扩展的去中心化解决方案,但依赖于简化动力学和模拟环境,忽略了现实部署中的关键挑战,如动态可行性和通信限制。为了解决这些差距,我们提出了一种分层框架,结合图注意力规划器(GATP)和分布式非线性模型预测控制器(NMPC)。GATP通过多机器人协作提供中间子目标,而NMPC在非线性动力学和执行约束下强制安全。我们评估了该框架在仿真和真实世界四旋翼实验中的性能。得益于注意力机制和最小通信需求,我们展示了在更大团队中改进的泛化能力、对通信延迟高达200毫秒的鲁棒性以及实用可行性,具有去中心化的板载推理。

英文摘要

The multi-robot unlabeled motion planning problem of concurrently assigning robots to goals and generating safe trajectories is central in many collaborative tasks. Recent Graph Neural Network methods offer scalable decentralized solutions but rely on simplified dynamics and simulation environments, overlooking key challenges of real-world deployment such as dynamic feasibility and communication constraints. To address these gaps, we propose a hierarchical framework that combines a Graph ATtention Planner (GATP) with a decentralized Nonlinear Model Predictive Controller (NMPC). GATP provides intermediate subgoals through multi-robot cooperation, and the NMPC enforces safety under nonlinear dynamics and actuation constraints. We evaluate our framework in both simulation and real-world quadrotor experiments. Thanks to attention mechanisms and minimal communication requirements, we demonstrate improved generalization to larger teams, robustness to communication delays up to 200 ms and practical feasibility with decentralized on-board inference.

2605.19206 2026-05-20 cs.RO 版本更新

CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation

CLUE: 通过利用统一语义地图实现适应性优先级上下文线索

Taeyun Kim, Alvin Jinsung Choi, Dasol Hong, Hyun Myung

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电子工程学院)

AI总结 CLUE通过利用统一语义地图,采用适应性优先级上下文线索的方法,有效解决零样本物体-目标导航问题,提高了导航的鲁棒性和效率。

Comments 8 pages, 5 figures

详情
AI中文摘要

零样本物体-目标导航(ZSON)是机器人领域具有挑战性的问题,需要对语言和视觉观察有全面的理解。房间和物体的上下文线索至关重要,但它们的相对重要性取决于目标:一些物体与特定房间类型紧密相关,而另一些物体则更可能由附近共存的物体预测。现有方法忽略了这一区别,导致探索效率低下且不准确。我们提出了CLUE,一种新的导航框架,通过利用从离线大型语言模型(LLM)提取的常识知识,适应性地平衡使用上下文房间和物体。通过使用LLM估计目标与房间类型的关联性,代理优先使用房间线索预测强关联的目标,使用物体线索预测弱关联的目标。我们的框架构建了一个统一的语义价值地图,整合了两种类型的上下文信息,并根据目标的模糊性进行自适应加权,以指导探索。结合多视角验证和由上下文线索指导的探索策略,CLUE实现了稳健且高效的导航。在模拟和真实世界部署中的大量实验表明,我们的方法在成功率(SR)和按路径长度加权的成功率(SPL)上均优于最先进的基线方法,证明了其在实际导航任务中的有效性和实用性。

英文摘要

Zero-shot object-goal navigation (ZSON) is a challenging problem in robotics that requires a comprehensive understanding of both language and visual observations. Contextual cues from rooms and objects are critical, but their relative importance depends on the target: some objects are strongly tied to specific room types, while others are better predicted by nearby co-located objects. Existing methods overlook this distinction, leading to inefficient and inaccurate exploration. We present CLUE, a novel navigation framework that adaptively balances the use of contextual rooms and objects by leveraging commonsense knowledge extracted from an offline large language model (LLM). By estimating a target's association with room types using LLM, the agent prioritizes room cues for predictable objects and object cues for those with weak room associations. Our framework constructs a unified semantic value map that integrates both types of contextual information, adaptively weighted by the target's ambiguity to guide exploration. Combined with multi-viewpoint verification and an exploration strategy informed by contextual cues, CLUE achieves robust and efficient navigation. Extensive experiments in simulation and real-world deployments show that our method consistently outperforms state-of-the-art baselines in both success rate (SR) and success weighted by path length (SPL), demonstrating its effectiveness and practicality for real-world navigation tasks.

2605.19202 2026-05-20 cs.RO cs.AI math.OC 版本更新

Aerial Inspection Behaviors via RL-based Quadrotor Control for Under-canopy Forest Environments

通过基于强化学习的四旋翼控制实现空中巡检行为:在树冠下森林环境中的应用

Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, Viswa Narayanan Sankaranarayanan, George Nikolakopoulos

发表机构 * Robotics and AI group, in the Department of Computer Science, Electrical and Space Engineering at Luleå University of Technology, Sweden(鲁尔坎大学技术学院机器人与人工智能小组,计算机科学、电气与空间工程系,瑞典)

AI总结 本文提出了一种基于深度强化学习的四旋翼控制器,用于在树冠下森林环境中进行自主巡检任务,通过端到端控制策略实现巡检视角姿态跟踪,并结合旅行商问题规划器和快速随机树星规划器确保长距离任务的安全可靠部署。

Comments Submitted to 2026 IEEE 22nd International Conference on Automation Science and Engineering

详情
AI中文摘要

本文针对在树冠下森林环境中使用基于深度强化学习(RL)的低级四旋翼控制器进行空中巡检任务的问题进行了研究。具体而言,本文提出了一种端到端(将状态映射到RPMs)的四旋翼控制策略,实现了巡检视角姿态跟踪(同时位置和偏航参考跟踪),这对于各种目标巡检行为和森林中的点对点导航至关重要。为确保在长距离任务中端到端RL控制器的安全可靠部署,本文利用了一个包含旅行商问题规划器(TSP)和快速随机树星规划器(RRT*)的更高导航指导层。在已知的森林地图和一组用户指定的巡检区域上,TSP规划器找到最优访问序列。在两个目标区域之间,RRT*规划器生成符合下层端到端RL策略跟踪限制的碰撞自由路径。通过五个目标巡检场景,本文证明了基于强化学习的电机级稳定控制器,结合导航指导层,可以有效用作树冠下森林巡检任务的低级巡检执行模块。

英文摘要

This paper addresses the problem of using a deep Reinforcement Learning (RL)-based low-level Quadrotor controller within an autonomous Quadrotor navigation stack for aerial inspection missions in under-canopy forest environments. Specifically, the article presents an end-to-end (mapping states to RPMs) Quadrotor control policy that achieves inspection view-pose tracking (simultaneous position and yaw reference tracking), which is crucial for various target inspection behaviors and point-to-point navigation in forests. To ensure safe and reliable deployment of the end-to-end RL controller in long-range missions, this article utilizes a higher navigation guidance layer comprising of a Traveling Salesman Problem planner (TSP) and a Rapidly-exploring Random Tree Star (RRT*) planner. Over a known map of a forest and a set of user-specified inspection regions, the TSP planner finds the optimal visitation sequence. Between two target regions, collision-free paths that respect the tracking limitations of the lower end-to-end RL policy are generated by an RRT* planner. Through five target inspection scenarios, this article demonstrates that an RL-based motor-level stabilizing controller, supported by a navigation guidance layer, can be used effectively as the low-level inspection execution module for under-canopy forest inspection missions.

2605.19166 2026-05-20 cs.RO cs.LG math.OC 版本更新

A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions

一种通过奖励设计和终止条件实现RL基于四旋翼控制性能调优的启发式方法

Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, George Nikolakopoulos

发表机构 * Robotics and AI group, in the Department of Computer Science, Electrical and Space Engineering at Luleå University of Technology(鲁德尼大学机器人与人工智能小组,计算机科学、电气与空间工程系)

AI总结 本文提出了一种新的启发式方法,通过奖励设计和终止条件实现RL四旋翼控制的可调性能,该方法通过双带宽指数奖励结构实现了设定点跟踪的临界阻尼响应,并具有低稳态误差。在使用近端策略优化(PPO)算法训练时,结合episode截断条件,在600万次时间步内以高效的方式实现了所需性能。通过直观的启发式规则调整奖励权重和指数系数,可以实现更快(空翻式)和更慢(检查式)的稳定时间性能,同时保留基线临界阻尼响应和约2%的稳态误差。

Comments Accepted in the 34th Mediterranean Conference on Control and Automation

详情
AI中文摘要

基于强化学习(RL)的四旋翼控制策略在诸如在复杂环境中快速导航和无人机赛车等任务中取得了显著性能。然而,在某些应用中,如基础设施检查,实现精确、可控的机动并具有可调性能至关重要。本文提出了一种新的启发式方法,通过奖励设计和终止条件实现RL基于四旋翼控制的可调性能。我们提出了一种包含双带宽指数的新型奖励结构,实现了设定点跟踪的基线临界阻尼响应,并具有低稳态误差。当使用近端策略优化(PPO)算法进行训练时,结合episode截断条件,在600万次时间步内以高效的方式实现了所需性能。为了调节基线行为的性能,我们提出了直观的启发式规则来调整奖励权重和指数系数,以实现更快(空翻式)和更慢(检查式)的稳定时间性能,同时保留基线临界阻尼响应和大约2%的稳态误差。我们评估了三种RL策略(基线、空翻和检查)在100次试验中的表现,并展示了在随机初始条件下位置和偏航跟踪的准确且可调性能,从而证明了所提出启发式方法的有效性。

英文摘要

Reinforcement learning (RL)-based quadrotor control policies have achieved impressive performance in tasks such as fast navigation in cluttered environments and drone racing, where the focus is on speed and agility. However, in several applications, such as infrastructure inspection, it is critical to achieve precise, controlled maneuvers with tunable performance. In this article, we present a novel heuristic approach to achieve tunable performance in RL-based Quadrotor control through reward design and termination conditions. We present a novel reward structure containing dual bandwidth exponentials that achieves a baseline critically damped response in setpoint tracking, with low steady-state errors. When trained with a Proximal Policy Optimization (PPO) algorithm, in conjunction with episode truncation conditions, the desired performance is achieved in 6 million time steps in a sample-efficient manner. In order to tune the performance about the baseline behavior, we present intuitive heuristic rules to adjust the reward weights and exponential coefficients to achieve faster (acrobatic-like) and slower (inspection-like) settling time performance, while retaining the baseline critically damped response and approximately 2\% steady-state error. We evaluate the three RL policies (baseline, acrobatic, and inspection) across 100 trials and show accurate and tunable performance in position and yaw tracking from random initial conditions, thereby demonstrating the effectiveness of the proposed heuristic approach.

2605.19136 2026-05-20 cs.RO 版本更新

Automatically Improving Simulation Physics for Articulated Objects

自动提升仿真的物理特性用于关节物体

Anh-Quan Pham

发表机构 * Penn(宾夕法尼亚大学) PennPAL Lab(宾夕法尼亚大学PAL实验室)

AI总结 本文研究了如何通过量化评估框架和多模态仿真反馈方法,提升关节物体在仿真中的物理真实性和稳定性,从而提高机器人学习的效率和效果。

详情
AI中文摘要

仿真是可扩展机器人学习的核心工具,但其效果取决于物体资产的质量。尽管现代3D数据集提供了丰富的几何和运动学表示,但通常缺乏用于稳定和真实交互所需的物理属性,需要大量手动工作来构建仿真准备的关节物体。在本论文中,我们引入了交互准备性,它表征了物体在操作下是否可以可靠地仿真。我们提出了一种定量评估框架,将交互准备性分解为可测量的组成部分,从而系统分析物体质量并揭示传统评估未捕获的失败模式。我们进一步提出了一个多模态、仿真循环的方法,从不完整的3D资产中生成交互准备的关节物体。该方法整合了几何、视觉和语义信息来推断物理属性,并通过迭代仿真反馈来优化这些属性,以提高物理一致性。在多样化的关节物体和操作任务上的实验表明,物体质量直接影响仿真稳定性、交互行为和策略性能。经过我们方法优化的物体表现出更稳定和真实的动态,从而实现了更可靠的下游学习和评估。总体而言,本论文展示了关节物体在仿真中的物理真实性的的重要性,并引入了一种由仿真反馈指导的实用多模态优化方法,用于大规模构建此类物体。

英文摘要

Simulation is a central tool for scalable robot learning, but its effectiveness depends on the quality of object assets. While modern 3D datasets provide rich geometric and kinematic representations, they typically lack the physical properties required for stable and realistic interaction, requiring significant manual effort to construct simulation-ready articulated objects. In this thesis, we introduce interaction-readiness, which characterizes whether an object can be reliably simulated under manipulation. We propose a quantitative evaluation framework that decomposes interaction-readiness into measurable components, enabling systematic analysis of object quality and revealing failure modes not captured by conventional evaluation. We further present a multi-modal, simulator-in-the-loop approach for generating interaction-ready articulated objects from incomplete 3D assets. The method integrates geometric, visual, and semantic information to infer physical properties and refines them through iterative simulator feedback to improve physical consistency. Experiments across diverse articulated objects and manipulation tasks show that object quality directly impacts simulation stability, interaction behavior, and policy performance. Objects refined by our method exhibit more stable and realistic dynamics, enabling more reliable downstream learning and evaluation. Overall, this thesis demonstrates the importance of physical realism for articulated objects in simulation and introduces a practical multi-modal refinement approach, guided by simulator feedback, for constructing such objects at scale.

2605.19120 2026-05-20 cs.RO 版本更新

CosFly: Plan in the Matrix, Fly in the World

CosFly:矩阵中的计划,世界中的飞行

Hanxuan Chen, Xiangyue Wang, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Binbo Li, Kangli Wang, Ji Pei

发表机构 * Autel Robotics(Autel机器人公司) Nanjing University(南京大学) Peking University(北京大学) Southern University of Science and Technology(南方科技大学) University of Hong Kong(香港大学)

AI总结 本文提出CosFly,一个用于空中跟踪的盒状结构规划和多模态模拟流程,以及CosFly-Track大规模无人机数据集,用于在多样环境中动态目标跟踪。CosFly通过将复杂的3D世界转换为结构化障碍表示进行规划,然后将轨迹投影到多模态传感器数据中,并支持可配置的固定视角缩放级别。

详情
AI中文摘要

我们介绍了CosFly,一个用于空中跟踪的盒状结构规划和多模态模拟流程,以及CosFly-Track,一个大规模的无人机数据集,用于在多样环境中进行动态目标跟踪。在我们的当前实现上,CosFly提供了一个模块化的7步构建流程,将复杂的3D世界转换为结构化的障碍表示用于规划,然后将结果轨迹投影到多模态传感器数据中,包括RGB图像、高精度深度图和语义分割掩码,并配以自然语言导航指令。一个关键特点是支持可配置的固定视角缩放级别(每个轨迹一个视角设置并保持恒定),通过相机内参数调整模拟各种焦距。该流程涵盖了从3D地图导出通过网格简化、行人和无人机轨迹规划、多模态渲染(6自由度姿态注释)、质量检查以及教师-学生描述生成的完整流程。我们分析了两种轨迹规划范式:传统的两阶段流程(前端候选生成和后端细化)以及直接基于梯度的公式,该公式在单一目标中优化多个跟踪约束。公开的CosFly-Track发布包含250条经过验证的轨迹和约10万张渲染图像,具有完整的6自由度无人机姿态注释(位置x、y、z和方向偏航、俯仰、滚动)。共同,该流程和数据集建立了一个可扩展的基础,支持在多样环境中进行空中-地面协同研究,支持动态目标跟踪、无人机导航和多模态感知。

英文摘要

We present CosFly, a box-structured planning and multimodal simulation pipeline for aerial tracking, together with CosFly-Track, a large-scale UAV dataset for dynamic target tracking across diverse environments including urban centers, highways, rural landscapes, forests, and coastal towns. In our current implementation on CARLA, CosFly provides a modular 7-step construction pipeline that converts complex 3D worlds into structured obstacle representations for planning, then projects the resulting trajectories back into multi-modal sensor data -- including RGB images, high-precision depth maps, and semantic segmentation masks -- paired with natural language navigation instructions. A key feature is the support for configurable fixed-FOV zoom levels (one FOV setting drawn per trajectory and held constant throughout), enabling simulation of various focal lengths through camera-intrinsic adjustments. The pipeline covers the complete workflow from 3D map export through grid simplification, pedestrian and drone trajectory planning, multi-modal rendering with 6-DOF pose annotations, quality inspection, and teacher-student caption generation. We analyze two trajectory-planning paradigms for aerial target tracking: a conventional two-stage pipeline with front-end candidate generation and backend refinement, and a direct gradient-based formulation that optimizes multiple tracking constraints in a single objective. The public CosFly-Track release contains 250 validated trajectories and approximately 100,000 rendered images with complete 6-DOF drone pose annotations (position x, y, z and orientation yaw, pitch, roll). Together, the pipeline and dataset establish a scalable foundation for aerial-ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi-modal perception across diverse environments.

2605.19104 2026-05-20 cs.RO cs.AI 版本更新

Neural Operators for Design-Space Surrogate Modeling of Tendon-Actuated Continuum Robots

神经运算符用于腱驱动连续机器人设计空间的代理建模

Branden Frieden, James M. Ferguson, Alan Kuntz, Varun Shankar

发表机构 * The Robotics Center and the Kahlert School of Computing at the University of Utah(犹他大学机器人中心和Kahlert计算学院) The Departments of Computer Science and Electrical and Computer Engineering at Vanderbilt University(范德比大学计算机科学与电气与计算机工程系)

AI总结 本文提出了一种基于神经运算符的学习方法,用于腱驱动连续机器人的设计空间代理建模,通过映射机器人设计参数和腱驱动输入到最终配置,实现跨大量机器人设计的泛化能力。

Comments Accepted to ICRA 2026

详情
AI中文摘要

连续机器人能够在受限环境中实现灵活的操作,但需要准确且高效的模型用于实时操作和控制。传统物理模型可能计算成本高且因未建模效应导致不准确,而当前基于学习的方法在特定机器人上泛化能力差。本文提出将腱驱动连续机器人代理建模作为运算符学习问题,将机器人设计参数和腱驱动输入映射到最终配置。该方法使单个训练模型能够跨大量机器人设计泛化。我们开发了四种新型神经运算符架构--两种基于深度运算符网络(DeepONets)和两种基于傅里叶神经运算符(FNOs)--并训练它们在仿真数据上预测机器人配置。所有架构均实现良好的准确性,同时允许快速且准确地跨设计泛化。我们的结果表明,运算符学习为连续机器人力学在设计空间中的代理建模提供了有效且可泛化的解决方案,使在手术和工业应用中控制、规划和设计优化能够快速建模。

英文摘要

Continuum robots enable dexterous manipulation in constrained environments, but require accurate and efficient models for real-time manipulation and control. Traditional physics-based models can be computationally expensive and may suffer from inaccuracies due to unmodeled effects, while current learning-based methods often generalize poorly beyond the specific robot on which they are trained. We present a formulation of surrogate modeling for tendon-driven continuum robots as an operator learning problem that maps robot design parameters and tendon actuation inputs to resulting configurations. This formulation enables a single trained model to generalize across a large class of robot designs. We develop four novel neural operator architectures--two based on Deep Operator Networks (DeepONets) and two based on Fourier Neural Operators (FNOs)--and train them on simulation data to predict robot configurations. All architectures achieve good accuracy while allowing for fast and accurate generalization across designs. Our results demonstrate that operator learning provides an effective and generalizable surrogate for continuum robot mechanics in the design space, enabling fast modeling for control, planning, and design optimization in surgical and industrial applications.

2605.19038 2026-05-20 cs.RO cs.LG 版本更新

Guiding Neuro-Symbolic Scenario Generation with Spatio-Temporal Logic

用时空逻辑引导神经符号场景生成

Lorenzo Bonin, Francesco Giacomarra, Luca Bortolussi, Jyotirmoy V. Deshmukh, Francesca Cairoli

发表机构 * University of Trieste(特里埃斯特大学) University of Southern California(南加州大学)

AI总结 本研究提出STRELGen框架,结合扩散模型和时空逻辑规范,高效生成安全关键的多智能体驾驶场景,提升自动驾驶系统的鲁棒性验证能力。

详情
AI中文摘要

自动驾驶技术的快速发展已远超安全评估方法的进展。传统测试依赖于暴露自动驾驶系统于大量真实交通场景,这是一种成本高昂且统计上无法有效捕捉罕见安全关键边缘情况的暴力方法。为解决这一根本限制,我们引入STRELGen,一个可扩展的框架,用于目标生成安全关键的驾驶场景。STRELGen协同结合多智能体轨迹生成扩散模型(DM)与通过高度可解释的形式化方法编码复杂安全和现实属性的时空逻辑(STREL)规范。关键在于监控这些规范的满足程度是可微的,从而允许基于梯度的搜索。在推理时间,我们直接优化DM的潜在空间以最大化STREL公式满足程度。结果是高效生成高度可信且安全关键的多智能体场景,这些场景位于学习的数据分布内。STRELGen因此提供了一种灵活、可解释且强大的工具,用于对自动驾驶系统进行压力测试,超越了暴力数据收集的限制。

英文摘要

The rapid advancement of autonomous driving (AD) technologies has outpaced the development of robust safety evaluation methods. Conventional testing relies on exposing AD systems to vast numbers of real-world traffic scenes -- a brute-force approach that is prohibitively expensive and statistically ineffective at capturing the rare, safety-critical edge cases essential for validating real-world robustness. To address this fundamental limitation, we introduce STRELGen, a scalable framework for the targeted generation of safety-critical driving scenarios. STRELGen synergistically combines a multi-agent trajectory-generation diffusion model (DM) with Spatio-Temporal Logic (STREL) specifications that encode complex safety and realism properties through a highly interpretable formalism. Crucially, monitoring satisfaction levels of these specifications is differentiable, enabling gradient-based search. At inference time, we optimize directly over the DM latent space to maximize STREL formula satisfaction. The result is efficient generation of highly plausible yet safety-critical multi-agent scenarios that lie within the learned data distribution. STRELGen thus provides a flexible, interpretable, and powerful tool for stress-testing autonomous driving systems, moving beyond the limitations of brute-force data collection.

2605.19033 2026-05-20 cs.RO cs.AI cs.CV cs.LG cs.MA 版本更新

RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

RLFTSim: 通过强化学习微调实现逼真且可控的多智能体交通仿真

Ehsan Ahmadi, Hunter Schofield, Behzad Khamidehi, Fazel Arasteh, Jinjun Shan, Lili Mou, Dongfeng Bai, Kasra Rezaee

发表机构 * University of Alberta(阿尔伯塔大学) Huawei Technologies Canada(华为加拿大技术有限公司) York University(约克大学) Canada CIFAR AI Chair, Amii(加拿大 CIFAR 人工智能主席,Amii)

AI总结 本文提出RLFTSim框架,通过强化学习微调提升交通仿真场景的真实感,并通过目标条件化方法实现对交通仿真可控性的提炼,实验表明其在真实感和可控性方面均优于其他启发式搜索方法。

Comments CVPR 2026 Highlight; Project page at https://ehsan-ami.github.io/rlftsim

详情
AI中文摘要

监督式开环训练已被广泛用于训练交通仿真模型;然而,它无法捕捉复杂驾驶场景中固有的动态性和多智能体交互。我们引入RLFTSim,一种基于强化学习的微调框架,通过将模拟器运行与真实世界数据分布对齐来增强场景真实性,并提供一种方法用于在场景生成中提炼目标条件化的可控性。我们基于预训练的仿真模型实例化RLFTSim,设计一种平衡保真度和可控性的奖励函数,并在Waymo Open Motion Dataset上进行了全面实验。我们的结果表明在真实感方面取得了改进,实现了最先进的性能。与其它基于启发式搜索的微调方法相比,RLFTSim由于提出了一种低方差且密集的奖励信号,所需样本显著更少,并且通过设计直接解决了真实感对齐问题。我们还通过目标条件化展示了我们方法在提炼交通仿真可控性方面的有效性。项目页面可在https://ehsan-ami.github.io/rlftsim上访问。

英文摘要

Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at https://ehsan-ami.github.io/rlftsim.

2605.19029 2026-05-20 cs.RO 版本更新

Distributionally Robust Control via Stein Variational Inference for Contact-Rich Manipulation

通过Stein变分推断进行分布鲁棒控制的接触丰富操作

Hrishikesh Sathyanarayan, Victor Vantilborgh, Harish Ravichandar, Tom Lefebvre, Ian Abraham

发表机构 * Department of Mechanical Engineering, Yale University(耶鲁大学机械工程系) Department of Electromechanical, Systems and Metal Engineering, Ghent University(根特大学机电系统与金属工程系) School of Interactive Computing, Georgia Institute of Technology(佐治亚理工学院交互计算学院) Department of Electrical Engineering, University of Sydney(悉尼大学电子工程系)

AI总结 本文提出了一种基于Stein变分推断的分布鲁棒控制方法,用于提升接触丰富操作中的不确定性建模能力,通过更灵活的不确定性建模在保持性能的同时精确适应不确定性,实验结果表明在广泛参数不确定性下,鲁棒性提高了3倍。

Comments In Proceedings of Robotics: Science and Systems, Sydney, Australia, July 2025

详情
AI中文摘要

可靠的机器人操作需要能够准确表示和适应来自接触丰富交互中不确定性的控制策略。现代数据驱动方法通过大规模训练和计算来缓解不确定性,但在训练样本有限时性能显著下降。相比之下,经典模型驱动控制器计算高效且可靠,但其对任务相关不确定性的有限表示能力会阻碍接触丰富交互的性能。在本文中,我们提出通过更灵活的不确定性建模来扩展模型驱动操作控制的能力,该方法将操作问题转化为分布鲁棒控制优化,并提出一种基于Stein变分推断的新确定性公式,该公式在保持性能的同时显式建模任务敏感的参数不确定性。结果表明,所得到的控制器更加关注任务对不确定性的敏感性,从而在不牺牲性能的情况下获得高可靠性。实验结果表明,在广泛参数不确定性下,接触丰富操作任务的鲁棒性提高了3倍,优于现有模型驱动控制方法。

英文摘要

Reliable robotic manipulation requires control policies that can accurately represent and adapt to uncertainty arising from contact-rich interactions. Modern data-driven methods mitigate uncertainty through large-scale training and computation, and degrade significantly in performance with limited number of training samples. By contrast, classical model-based controllers are computationally efficient and reliable, but their limited ability to represent task-relevant uncertainty can hinder performance in contact-rich interactions. In this work, we propose to expand the capabilities of model-based manipulation control through more flexible uncertainty modeling that retains performance while exactly adapting to uncertainty. Our approach casts the manipulation problem as a distributionally robust control optimization and proposes a novel deterministic formulation based on Stein variational inference that preserves performance while explicitly modeling task-sensitive parameter uncertainty. As a result, the derived controllers are more aware of task sensitivities to uncertainty, yielding high reliability without compromising performance. Experimental results demonstrate up to 3$\times$ improved robustness across a range of contact-rich manipulation tasks under broad parametric uncertainty, outperforming existing model-based control methods.

2605.15975 2026-05-20 cs.AI cs.RO 版本更新

Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

在符号世界模型上学习双层策略以实现长周期规划

Dillon Z. Chen, Till Hofmann, Toryn Q. Klassen, Sheila A. McIlraith

发表机构 * Vector Institute(向量研究所) University of Toronto(多伦多大学) LAAS-CNRS(Laas--cnrs) University of Toulouse(图卢兹大学) RWTH Aachen University(亚琛工业大学)

AI总结 本文提出了一种结合低层模仿学习和高层符号抽象的双层策略,用于解决长周期规划问题,通过BISON系统在扩展的MetaWorld基准上验证了其在处理大量物体和长周期任务上的优越性。

详情
AI中文摘要

我们解决了构建具有身体智能的AI代理以可靠解决长周期规划问题的挑战。模仿学习从演示中已显示出在训练机器人解决需要精细运动控制和操作的复杂任务方面的有效性。然而,仅通过模仿学习生成长周期计划仍然是一个艰巨的挑战。相比之下,高层(HL)符号抽象能够促进高效且可解释的长周期规划。我们提出结合低层(LL)模仿学习在操作和控制中的优势,以及高层符号抽象在长周期规划中的优势。我们通过双层策略(π^hl, π^ll)实现这一想法,其中包括从低层演示中学习的神经策略π^ll,以及由低层演示的符号抽象和归纳概括结合而成的高层符号策略π^hl。我们实现了这些想法的BISON系统。在扩展的MetaWorld基准上的实验表明,BISON能够泛化到长周期和更多物体数量的问题,比VLA和端到端方法更高效,并且在训练和推理中更节省时间和内存。值得注意的是,当忽略低层执行时,BISON的高层策略可以在一分钟内解决包含10,000个相关物体的高层问题。项目页面:https://dillonzchen.github.io/bison

英文摘要

We tackle the challenge of building embodied AI agents that can reliably solve long-horizon planning problems. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low-level (LL), continuous environments. Yet, it remains a difficult endeavour to generate long-horizon plans from imitation learning alone. In contrast, high-level (HL), symbolic abstractions facilitate efficient and interpretable long-horizon planning. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long-horizon planning. We realise this idea via \emph{bilevel policies} of the form $(π^{\mathrm{hl}}, π^{\mathrm{ll}})$, consisting of a neural policy $π^{\mathrm{ll}}$ learned from LL demonstrations, and an HL symbolic policy $π^{\mathrm{hl}}$ that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end-to-end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON's HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: https://dillonzchen.github.io/bison

2605.15336 2026-05-20 cs.RO cs.AI 版本更新

HoloMotion-1 Technical Report

HoloMotion-1 技术报告

Maiyue Chen, Kaihui Wang, Bo Zhang, Xihan Ma, Zhiyuan Yang, Yi Ren, Qijun Huang, Zihao Zhu, Yucheng Wang, Zhizhong Su

发表机构 * Horizon Robotics

AI总结 本文提出HoloMotion-1,一种用于零样本全身运动追踪的人形运动基础模型,通过大规模混合运动语料库训练控制策略,提升了运动行为的多样性和准确性,实现了对多种运动类型和捕捉条件的鲁棒泛化。

Comments 20 pages, 4 figures, 6 tables. Technical report

详情
AI中文摘要

在本报告中,我们介绍了HoloMotion-1,一种用于零样本全身运动追踪的人形运动基础模型。HoloMotion-1的关键创新在于利用大规模混合运动语料库进行控制策略训练,其中来自真实视频重建的运动提供了运动多样性的主要来源,而经过精心挑选的运动捕捉数据和内部运动数据则提供了更高保真度的监督和面向部署的覆盖范围。这种数据模式使HoloMotion-1超越了传统仅依赖运动捕捉的训练,并使策略能够接触更广泛的行为、捕捉条件和运动风格。从这种异构数据中学习引入了新的挑战,包括重建噪声、源域不匹配、运动质量不均以及在大行为变化下的时间建模需求。为了解决这些挑战,HoloMotion-1集成了大容量时间建模、具有稀疏激活的专家混合变压器以及KV缓存推理用于实时控制,并采用序列级训练策略,提高了在扩展运动序列上的学习效率。在多个未见过的运动基准测试中,HoloMotion-1在多样化的运动类型和捕捉条件下表现出鲁棒的泛化能力,显著提高了跟踪精度,且能够直接转移到真实的人形机器人上,无需特定任务的微调。

英文摘要

In this report, we present HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. A key innovation of HoloMotion-1 is to scale control-policy training with a large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity, while curated motion-capture and in-house motion data provide higher-fidelity supervision and deployment-oriented coverage. This data regime enables HoloMotion-1 to move beyond conventional MoCap-only training and exposes the policy to substantially broader behaviors, capture conditions, and motion styles. Learning from such heterogeneous data introduces new challenges, including reconstruction noise, source-domain mismatch, uneven motion quality, and the need for temporal modeling under large behavioral variation. To address these challenges, HoloMotion-1 integrates large-capacity temporal modeling, a sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, and a sequence-level training strategy that improves learning efficiency on extended motion sequences. Extensive experiments on multiple unseen motion benchmarks show that HoloMotion-1 generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning.

2605.13646 2026-05-20 cs.RO cs.AI 版本更新

Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

基于因果性的端到端自动驾驶:通过以自身为中心的联合场景建模

Seokha Moon, Minseung Lee, Joon Seo, Jinkyu Kim, Jungbeom Lee

发表机构 * Korea University(韩国大学) Kakao Mobility

AI总结 本文提出CaAD框架,通过共享潜在场景表示捕捉车辆与周围代理之间的因果依赖关系,以提高端到端自动驾驶的闭环规划性能。

详情
AI中文摘要

端到端自动驾驶通过直接从传感器输入预测未来轨迹,跳过了传统模块化流水线,近年来取得了显著进展。然而,现有方法往往忽视了车辆规划中的因果依赖关系,忽略了车辆与周围代理之间的相互关系。这种因果忽视导致轨迹预测不一致且不可靠,特别是在需要交互的关键场景中,车辆决策和邻近代理行为必须联合推理。为了解决这一限制,我们提出了CaAD,一个基于因果的端到端自动驾驶框架,该框架在共享的潜在场景表示中捕捉这些依赖关系。首先,我们提出一个以自身为中心的联合因果建模模块,基于边缘预测分支,并学习车辆与相关交互代理之间的因果依赖关系。其次,我们采用因果意识的策略对齐阶段,通过联合模式嵌入来对齐随机的车辆策略与从周围交通和地图上下文中计算出的规划导向闭环反馈。在Bench2Drive和NAVSIM基准上,CaAD展示了强大的闭环规划性能,分别在Bench2Drive上实现了87.53的驾驶分数和71.81的成功率,在NAVSIM上实现了91.1的PDMS。项目页面可在https://moonseokha.github.io/CaAD/上获取。

英文摘要

End-to-end autonomous driving, which bypasses traditional modular pipelines by directly predicting future trajectories from sensor inputs, has recently achieved substantial progress. However, existing methods often overlook the causal inter-dependencies in ego-vehicle planning, ignoring the reciprocal relations between the ego vehicle and surrounding agents. This causal oversight leads to inconsistent and unreliable trajectory predictions, especially in interaction-critical scenarios where ego decisions and neighboring agent behaviors must be reasoned about jointly. To address this limitation, we propose CaAD, a Causality-aware end-to-end Autonomous Driving framework that captures these dependencies within a shared latent scene representation. First, we propose an ego-centric joint-causal modeling module that builds on the marginal prediction branch, and learns causal dependencies between the ego vehicle and interaction-relevant agents. Second, we employ a causality-aware policy alignment stage implemented with joint-mode embeddings to align the stochastic ego policy with planning-oriented closed-loop feedback computed from surrounding traffic and map context. On the Bench2Drive and NAVSIM benchmarks, CaAD demonstrates strong closed-loop planning performance, achieving a Driving Score of 87.53 and Success Rate of 71.81 on Bench2Drive, and a PDMS of 91.1 on NAVSIM. The project page is available at https://moonseokha.github.io/CaAD/.

2605.12974 2026-05-20 cs.RO cs.SY eess.SY 版本更新

Distributionally Robust Safety Under Arbitrary Uncertainties: A Safety Filtering Approach

在任意不确定性下的分布鲁棒安全:一种安全过滤方法

Daniel M. Cherenson, Haejoon Lee, Taekyung Kim, Dimitra Panagou

AI总结 本文研究如何在分布模糊性下确保非线性系统的概率安全,提出了一种基于备份的安全过滤框架,通过在高性能名义策略和认证备份策略之间切换来保证安全,并采用分布鲁棒方法处理任意不确定性,通过采样方法验证了方法的有效性。

Comments 10 pages, 4 figures, submitted to IEEE Robotics and Automation Letters (RA-L); Project Page: https://dcherenson.github.io/drs-gk

详情
AI中文摘要

在本文中,我们研究如何在分布模糊性下确保非线性系统的概率安全。我们的方法基于一种备份-based的安全过滤框架,该框架在高性能的名义策略和认证备份策略之间切换以确保安全。为了处理任意不确定性,即分布不具有特定结构且真实分布未知的情况,我们采用分布鲁棒(DR)方法,使用Wasserstein不确定性集。而不是在线解决高维的DR轨迹优化问题,我们利用备份-based安全过滤的结构,将安全认证减少为在名义策略和备份策略之间切换的时间的一维搜索。然后,我们开发了一种基于采样的认证程序,具有有限样本保证,其中经验失败概率被与Wasserstein膨胀阈值进行比较。我们通过模拟三个系统验证了我们的方法,从Dubins车辆到高速赛车和战斗机,展示了方法的广泛应用性和计算效率。

英文摘要

In this work, we study how to ensure probabilistic safety for nonlinear systems under distributional ambiguity. Our approach builds on a backup-based safety filtering framework that switches between a high-performance nominal policy and a certified backup policy to ensure safety. To handle arbitrary uncertainties from ambiguous distributions, i.e., where the distribution is not of specific structure and the true distribution is unknown, we adopt a distributionally robust (DR) formulation using Wasserstein ambiguity sets. Rather than solving a high-dimensional DR trajectory optimization problem online, we exploit the structure of backup-based safety filtering to reduce safety certification to a one-dimensional search over the switching time between nominal and backup policies. We then develop a sampling-based certification procedure with finite-sample guarantees, where empirical failure probabilities are compared against a Wasserstein-inflated threshold. We validate our method through simulations across three systems, from a Dubins vehicle to a high-speed racing car and a fighter jet, demonstrating the broad applicability and computational efficiency.

2605.08879 2026-05-20 cs.RO 版本更新

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

通过保守监督微调在流匹配视觉-语言-动作中保持基础能力

Tianyi Zhang, Shaopeng Zhai, Haoran Zhang, Fuxian Huang, Qi Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文提出保守监督微调(ConSFT)方法,旨在通过动态调整学习信号来减少流匹配视觉-语言-动作模型在微调过程中对预训练能力的损害,从而在不依赖先验数据或架构开销的情况下提升模型在目标分布上的适应性和能力保留。

Comments 20 pages, 9 figures

详情
AI中文摘要

无约束的流匹配视觉-语言-动作(VLA)模型微调会导致参数过度覆盖,从而降低预训练能力。我们提出了保守监督微调(ConSFT),一种优化目标,能够适应目标分布同时减轻灾难性遗忘,无需先验数据或架构开销。通过根据模型置信度动态调整学习信号,ConSFT抑制来自低置信度样本的过度梯度,从而防止不成比例的参数更新,从而限制内在参数扰动风险。受强化学习信任区域裁剪的启发,这种形式建立了一个渐进学习动态,以确保目标收敛和先前能力保留,实现稀疏参数更新,而无需依赖显式正则化所需的并行参考网络。我们在LIBERO和RoboTwin基准上评估了ConSFT,针对最先进的流匹配VLA(π₀,π₀.₅和GR00T-N1.6-3B)。该方法在能力保留方面优于常规SFT,平均绝对优势超过20%,在无先验数据的环境中与数据密集型经验回放的效能相当。现实世界的机器人部署证实,ConSFT在下游适应过程中防止了空间过拟合,保留了预训练的物理技能,同时获取了序列目标任务。

英文摘要

Unconstrained fine-tuning of flow-matching Vision-Language-Action (VLA) models drives dense parameter overwrites, degrading pre-trained capabilities. We present Conservative Supervised Fine-Tuning (ConSFT), an optimization objective that adapts to target distributions while mitigating catastrophic forgetting, requiring zero prior data or architectural overhead. By dynamically scaling learning signals based on model confidence, ConSFT suppresses excessive gradients from low-confidence samples to prevent disproportionate parameter updates, thereby bounding the intrinsic parameter disruption risk. Inspired by reinforcement learning's trust-region clipping, this formulation establishes a progressive learning dynamic to secure target convergence and prior capability retention, maintaining sparse parameter updates without relying on the parallel reference networks required by explicit regularization. We evaluate ConSFT on the LIBERO and RoboTwin benchmarks across state-of-the-art flow-matching VLAs ($π_0$, $π_{0.5}$, and GR00T-N1.6-3B). The method outperforms vanilla SFT in capability retention by an average absolute margin of over 20\%, matching the efficacy of data-heavy Experience Replay in a prior-data-free regime. Real-world robotic deployments confirm that ConSFT precludes spatial overfitting during downstream adaptation, preserving pre-trained physical skills while acquiring sequential target tasks.

2605.08830 2026-05-20 cs.CV cs.AI cs.RO 版本更新

VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

VECTOR-Drive: 紧密耦合的视觉-语言与轨迹专家路由用于端到端自动驾驶

Rui Zhao, Jianlin Yu, Zhenhai Gao, Jiaqiao Liu, Fei Gao

发表机构 * College of Automotive Engineering, Jilin University(吉林大学汽车工程学院) The National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University(吉林大学汽车底盘集成与生物力学国家级重点实验室) ReeFocus AI Technology(ReeFocus人工智能技术)

AI总结 本文提出VECTOR-DRIVE框架,通过紧密耦合的视觉-语言与轨迹专家路由,解决端到端自动驾驶中视觉语言理解和轨迹预测之间的耦合问题,实现更高的任务性能。

详情
AI中文摘要

端到端自动驾驶需要模型理解交通场景、推断驾驶意图并生成可执行的运动计划。最近的视觉-语言-动作(VLA)模型继承了大规模视觉-语言预训练的语义先验,但仍然面临耦合权衡:完全共享的骨干网络保留了多模态交互,但可能导致语言推理和轨迹预测的耦合问题;而解耦的推理-动作管道减少了任务冲突,但削弱了语义-运动耦合。我们提出VECTOR-DRIVE,一个基于Qwen2.5-VL-3B的紧密耦合VLA框架。VECTOR-DRIVE通过共享自注意力保持所有token的耦合,并根据token语义路由前馈计算。视觉和语言token由视觉-语言专家处理以保留语义先验,而目标点、主体状态和噪声动作token则路由到轨迹专家进行运动特定计算。在动作token路径上,一个流匹配规划器将噪声动作token细化为未来路径点和速度配置文件。这种设计在单一多模态Transformer中耦合了语义推理和运动规划,同时分离了任务特定的FFN计算。在Bench2Drive上,VECTOR-DRIVE实现了88.91的驾驶得分,并优于代表性的端到端和VLA基线。定性结果和消融进一步验证了共享注意力、语义感知专家路由、渐进式训练和基于流的动作解码的优势。

英文摘要

End-to-end autonomous driving requires models to understand traffic scenes, infer driving intent, and generate executable motion plans. Recent vision-language-action (VLA) models inherit semantic priors from large-scale vision-language pretraining, yet still face a coupling trade-off: fully shared backbones preserve multimodal interaction but may entangle language reasoning and trajectory prediction, whereas decou pled reasoning-action pipelines reduce task conflict but weaken semantic-motion coupling. We propose VECTOR-DRIVE, a tightly coupled VLA framework built on Qwen2.5-VL-3B. VECTOR-DRIVE keeps all tokens coupled through shared self attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert to preserve semantic priors, while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert for motion-specific computation. On the action-token pathway, a flow-matching planner refines noisy action tokens into future waypoints and speed profiles. This design couples semantic reasoning and motion planning within a single multimodal Transformer while separating task-specific FFN computation. On Bench2Drive, VECTOR-DRIVE achieves 88.91 Driving Score and outperforms representative end-to end and VLA-based baselines. Qualitative results and ablations further validate the benefits of shared attention, semantic-aware expert routing, progressive training, and flow-based action de coding.

2605.04525 2026-05-20 cs.RO 版本更新

HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizon Tasks

HDFlow:用于长时间任务的分层扩散-流规划

Nandiraju Gireesh, Yuanliang Ju, Chaoyi Xu, Weiheng Liu, Yuxuan Wan, He Wang

发表机构 * Peking University(北京大学) University of Toronto(多伦多大学)

AI总结 本文提出HDFlow,一种新的分层规划框架,利用扩散和修正流模型的优势,克服了单一范式生成规划器的局限性,通过在模拟和现实中的家具组装任务验证其有效性。

Comments ICML 2026 (Spotlight)

详情
AI中文摘要

近年来,生成模型的进步在长时间、稀疏奖励任务中生成行为计划方面展现出潜力。尽管这些方法取得了有希望的结果,但它们通常缺乏分层分解的原理框架,并且由于其迭代去噪过程,难以应对实时执行的计算需求。在本文中,我们介绍了分层扩散-流(HDFlow),一种新颖的分层规划框架,能够最优地利用扩散和修正流模型的优势,以克服单一范式生成规划器的局限性。HDFlow采用高层扩散规划器,在学习的潜在空间中生成策略子目标序列,利用扩散强大的探索能力。这些子目标随后引导低层修正流规划器生成平滑且密集的轨迹,利用基于常微分方程(ODE)的轨迹生成的速度和效率。我们在四个具有挑战性的家具组装任务中评估了HDFlow,既在模拟中又在现实世界中,其表现显著优于最先进的方法。此外,我们还展示了该方法在两个包含多样化运动和操作任务的长时间基准测试中的泛化能力。项目网站:https://hdflow-page.github.io/

英文摘要

Recent advances in generative models have shown promise in generating behavior plans for long-horizon, sparse reward tasks. While these approaches have achieved promising results, they often lack a principled framework for hierarchical decomposition and struggle with the computational demands of real-time execution, due to their iterative denoising process. In this work, we introduce Hierarchical Diffusion-Flow (HDFlow), a novel hierarchical planning framework that optimally leverages the strengths of diffusion and rectified flow models to overcome the limitations of single-paradigm generative planners. HDFlow employs a high-level diffusion planner to generate sequences of strategic subgoals in a learned latent space, capitalizing on diffusion's powerful exploratory capabilities. These subgoals then guide a low-level rectified flow planner that generates smooth and dense trajectories, exploiting the speed and efficiency of ordinary differential equation (ODE)-based trajectory generation. We evaluate HDFlow on four challenging furniture assembly tasks in both simulation and real-world, where it significantly outperforms state-of-the-art methods. Furthermore, we also showcase our method's generalizability on two long-horizon benchmarks comprising diverse locomotion and manipulation tasks. Project website: https://hdflow-page.github.io/

2604.25646 2026-05-20 cs.CV cs.RO 版本更新

SAMe: A Semantic Anatomy Mapping Engine for Robotic Ultrasound

SAMe:一种用于机器人超声的语义解剖映射引擎

Jing Zhang, Duojie Chen, Wentao Jiang, Zihan Lou, Jianxin Liu, Xinwu Cui, Qinghong Zhao, Bo Du, Christoph F. Dietrich, Dacheng Tao

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) Hubei Center for Applied Mathematics, Wuhan University(湖北应用数学中心,武汉大学) Department of Ultrasound, The Central Hospital of Wuhan(武汉市中心医院超声科) Department of Medical Ultrasound, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology(同济医院,同济医学院,华中科技大学医学影像科) Department of Ultrasound in Medicine, Renmin Hospital of Wuhan University(武汉大学仁医医院医学超声科) University Hospital, Johann-Wolfgang-Goethe University Frankfurt am Main(法兰克福歌德大学医学院大学医院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 该研究提出SAMe,一种语义解剖映射引擎,通过提供显式的解剖先验层,解决机器人超声扫描初始化问题,实现了基于临床症状的解剖目标识别和控制指令生成,提高了自动扫描的准确性和效率。

Comments Supplementary information included. Code will be released at https://github.com/MiliLab/Echo-SAMe

详情
AI中文摘要

机器人超声已经实现了局部图像驱动控制、接触调节和视图优化,但当前系统缺乏必要的解剖学理解,无法确定应扫描什么、从哪里开始以及如何适应个体患者解剖结构。这些差距使得系统仍依赖专家干预来启动扫描。本文提出SAMe,一种语义解剖映射引擎,为机器人超声提供显式的解剖先验层。SAMe将扫描初始化视为目标到解剖到动作的过程:它将不明确的临床症状转化为结构化的目标器官,从单张外部身体图像中为这些目标生成患者特定的解剖表示,并将这种表示转换为面向控制的6自由度探头初始化状态,无需使用术前CT或MRI进行额外的配准。SAMe维护的解剖表示是显式的、轻量的(单器官推断在0.08秒内完成),并且设计上与下游控制兼容。在语义接地、解剖生成和真实机器人评估中,SAMe在完整的初始化流程中表现出色。在真实机器人实验中,基于质心的SAMe初始化在单目标设置下,对于肝脏(86.7% vs 46.7%)和肾脏(80.0% vs 73.3%)初始化均优于基于身体关键点的启发式基线。此外,当多个候选目标可用时,试验级别的器官命中率达到了肝脏97.3%和肾脏83.3%。这些结果建立了一个显式的解剖先验层,解决了扫描初始化问题,并为更广泛的下游自主扫描流程提供了解剖基础,为基于症状驱动和解剖信息的机器人超声提供了基础。

英文摘要

Robotic ultrasound has advanced local image-driven control, contact regulation, and view optimization, yet current systems lack the anatomical understanding needed to determine what to scan, where to begin, and how to adapt to individual patient anatomy. These gaps make systems still reliant on expert intervention to initiate scanning. Here we present SAMe, a semantic anatomy mapping engine that provides robotic ultrasound with an explicit anatomical prior layer. SAMe addresses scan initiation as a target-to-anatomy-to-action process: it grounds under-specified clinical complaints into structured target organs, instantiates a patient-specific anatomical representation for the grounded targets from a single external body image, and translates this representation into control-facing 6-DoF probe initialization states without any additional registration using preoperative CT or MRI. The anatomical representation maintained by SAMe is explicit, lightweight (single-organ inference in 0.08s), and compatible with downstream control by design. Across semantic grounding, anatomical instantiation, and real-robot evaluation, SAMe shows strong performance across the full initialization pipeline. In real-robot experiments, centroid-based SAMe initialization outperformed the body-keypoint-based heuristic baseline under a budget-matched single-target setting for both liver (86.7% versus 46.7%) and kidney (80.0% versus 73.3%) initialization. Furthermore, The trial-level organ-hit rate reached 97.3% for liver and 83.3% for kidney when multiple candidate targets were available. These results establish an explicit anatomical prior layer that addresses scan initialization and is designed to support broader downstream autonomous scanning pipelines, providing the anatomical foundation for complaint-driven, anatomically informed robotic ultrasonography.

2604.11417 2026-05-20 cs.RO cs.AI 版本更新

Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

高效的情绪感知图标手势预测用于机器人同声传译

Edwin C. Montiel-Vazquez, Christian Arzate Cruz, Stefanos Gkikas, Thomas Kassiotis, Giorgos Giannakakis, Randy Gomez

发表机构 * School of Engineering(工程学院) Honda Research Institute Japan(本田日本研究院) Department of Electronic Engineering(电子工程系)

AI总结 本文提出一种轻量级的transformer模型,通过文本和情绪单独生成图标手势的位置和强度,无需音频输入,在BEAT2数据集上优于GPT-4o,在语义手势位置分类和强度回归方面表现更佳,且计算紧凑,适合实时部署。

详情
AI中文摘要

同声传译手势可以提高参与度并改善语音理解。大多数数据驱动的机器人系统生成节奏般的运动,但很少整合语义强调。为此,我们提出了一种轻量级的transformer,该模型仅通过文本和情绪推导图标手势的位置和强度,无需在推理时使用音频输入。该模型在BEAT2数据集上在语义手势位置分类和强度回归方面均优于GPT-4o,同时保持计算紧凑性,适合在具身代理上实时部署。

英文摘要

Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.

2604.09323 2026-05-20 cs.RO 版本更新

Robust Adaptive Backstepping Impedance Control of Robots in Unknown Environments

在未知环境中具有鲁棒性的自适应反步阻抗控制

Reza Nazmara, Alap Kshirsagar, Jan Peters, A. Pedro Aguiar

发表机构 * Research Center for Systems and Technologies (SYSTEC), ARISE, Faculty of Engineering, University of Porto, 4200-465 Porto, Portugal(系统与技术研究中心(SYSTEC),ARISE,工程学院,波尔图大学,葡萄牙4200-465波尔图) Intelligent Autonomous Systems Lab, Department of Computer Science, TU Darmstadt, Germany(智能自主系统实验室,计算机科学系,达姆施塔特技术大学,德国)

AI总结 本文提出了一种针对在接触丰富且不确定环境中操作的机器人鲁棒自适应反步阻抗控制(RABIC)策略,该策略考虑了系统的完整耦合动力学,并明确考虑了外部扰动和未建模动力学等关键不确定性来源,而无需机器人动态参数。通过反步方法设计内环以跟踪参考阻抗模型,利用泰勒级数估计器估计系统动力学并采用自适应估计器确定外部力的上界。稳定性分析证明了整体系统的半全局有限时间稳定性。通过模拟移动机械臂场景和对实际Franka Emika Panda机器人的实验评估,证明了所提方法在安全性、轨迹跟踪和力监测方面优于PD控制。

Comments 8

详情
Journal ref
Mechatronics, Vol. 118, 103552 (2026)
AI中文摘要

本文提出了一种鲁棒自适应反步阻抗控制(RABIC)策略,用于在接触丰富和不确定环境中操作的机器人。所提出的控制策略考虑了系统的完整耦合动力学,并明确考虑了外部扰动和未建模动力学等关键不确定性来源,而无需机器人动态参数。我们提出了一种基于反步的自适应阻抗控制方案用于内环以跟踪参考阻抗模型。为了处理不确定性,我们采用基于泰勒级数的估计器来估计系统动力学,并采用自适应估计器来确定外部力的上界。稳定性分析证明了整体系统的半全局有限时间稳定性。为了证明所提方法的有效性,进行了模拟移动机械臂场景和对实际Franka Emika Panda机器人的真实实验评估。所提出的方法在安全性和轨迹跟踪及力监测方面优于PD控制。总体而言,RABIC框架为未来关于耦合移动和固定串联机械臂的自适应和学习阻抗控制的研究提供了坚实的基础。

英文摘要

This paper presents a Robust Adaptive Backstepping Impedance Control (RABIC) strategy for robots operating in contact-rich and uncertain environments. The proposed control strategy considers the complete coupled dynamics of the system and explicitly accounts for key sources of uncertainty, including external disturbances and unmodeled dynamics, while not requiring the robot's dynamic parameters in implementation. We propose a backstepping-based adaptive impedance control scheme for the inner loop to track the reference impedance model. To handle uncertainties, we employ a Taylor series-based estimator for system dynamics and an adaptive estimator for determining the upper bound of external forces. Stability analysis demonstrates the semi-global practical finite-time stability of the overall system. To demonstrate the effectiveness of the proposed method, a simulated mobile manipulator scenario and experimental evaluations on a real Franka Emika Panda robot were conducted. The proposed approach exhibits safer performance compared to PD control while ensuring trajectory tracking and force monitoring. Overall, the RABIC framework provides a solid basis for future research on adaptive and learning-based impedance control for coupled mobile and fixed serially linked manipulators.

2604.07993 2026-05-20 cs.RO 版本更新

HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

HEX: 人形对齐的专家用于跨躯体全身体操作

Shuanghao Bai, Meng Li, Xinyuan Lv, Jiawei Wang, Xinhua Wang, Fei Liao, Chengkai Hou, Langzhe Gu, Wanqi Zhou, Kun Wu, Ziluo Ding, Zhiyuan Xu, Lei Sun, Shanghang Zhang, Zhengping Che, Jian Tang, Badong Chen

发表机构 * Beijing Innovation Center of Humanoid Robotics(北京人形机器人创新中心) Xi’an Jiaotong University(西安交通大学) Nankai University(南开大学) Peking University(北京大学)

AI总结 HEX通过引入人形对齐的通用状态表示和混合专家统一本体预测器,实现了对全尺寸双足人形机器人全身体操作的协调控制,展示了在任务成功率和泛化能力上的最新成果。

Comments Project page: https://hex-humanoid.github.io/

详情
AI中文摘要

人类通过协调的全身控制实现复杂操作,而大多数视觉-语言-动作(VLA)模型将机器人身体部分独立处理,使得高自由度的人形控制具有挑战性和不稳定性。我们提出了HEX,一种面向全尺寸双足人形机器人的协调操作状态中心框架。HEX引入了人形对齐的通用状态表示,以实现跨异构躯体的可扩展学习,并结合混合专家统一本体预测器,从大规模多躯体轨迹数据中建模全身协调和时间运动动态。为了高效捕捉时间视觉上下文,HEX使用轻量级历史标记来总结过去的观察,避免在推理过程中重复编码历史图像。它进一步采用残差门控融合机制和流匹配动作头,以适应性地整合视觉-语言提示与本体动态以生成动作。在现实世界的人形操作任务中,HEX在任务成功率和泛化能力上实现了最先进的性能,特别是在快速反应和长时间范围场景中。

英文摘要

Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.

2602.09259 2026-05-20 cs.RO cs.HC 版本更新

Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation

以数据为中心的基于学习的多任务手术注视感知模型设计

Yizhou Li, Shuyuan Yang, Jiaji Su, Zonghe Chua

发表机构 * Department of Electrical, Computer, and Systems Engineering, Case Western Reserve University(电气、计算机与系统工程系,凯斯西储大学)

AI总结 本研究探讨了在多任务模拟中,基于学习的手术注视感知模型的设计,通过主动-被动注视数据集分析,评估了不同注视来源对注意力模型学习的影响,并提出了可扩展的群众源注视监督方法。

Comments 8 pages, conference pre-print

详情
AI中文摘要

在机器人辅助微创手术(RMIS)中,减少的触觉反馈和深度线索增加了对专家视觉感知的依赖,推动了基于注视引导的训练和基于学习的手术感知模型。然而,操作专家的注视数据收集成本高,且不清楚注视监督来源(专家水平(中级 vs. 初学者)和感知模态(主动执行 vs. 被动观看))如何影响注意力模型的学习。我们引入了一个配对的主动-被动、多任务手术注视数据集,该数据集在达芬奇SimNow模拟器上进行了四次钻探任务。使用VR头盔和眼动追踪记录了任务执行期间的主动注视,相应的视频被重新利用作为刺激,以收集观察者的被动注视,从而实现受控的同视频比较。我们量化了技能和模态依赖的注视组织差异,并通过注视密度重叠分析和单帧显著性建模评估了被动注视在操作监督中的可替代性。在各种设置中,MSI-Net产生了稳定且可解释的预测,而SalGAN不稳定且经常与人类注视不一致。训练于被动注视的模型恢复了相当大的中级主动注意力,但存在可预测的退化,且主动和被动目标之间的迁移是不对称的。值得注意的是,初学者的被动标签在较高质量演示中对中级-被动目标的近似具有有限的损失,这表明了一条可行的路径,用于在手术指导和感知建模中实现可扩展的群众源注视监督。

英文摘要

In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.

2602.09023 2026-05-20 cs.RO 版本更新

TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

TwinRL: 基于数字孪生的强化学习用于真实世界机器人操作

Qinwen Xu, Jiaming Liu, Rui Zhou, Shaojun Shi, Nuowei Han, Zhuoyang Liu, Chenyang Gu, Shuo Gu, Yang Yue, Gao Huang, Wenzhao Zheng, Sirui Han, Peng Jia, Shanghang Zhang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(信息处理国家重点实验室,计算机学院,北京大学) Simplexity Robotics(Simplexity机器人) Tsinghua University(清华大学) Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文提出TwinRL框架,通过数字孪生与真实世界协同训练,提升视觉-语言-动作模型在真实世界中的探索效率和收敛速度,实现高成功率和快速收敛。

详情
AI中文摘要

尽管具有强大的泛化能力,视觉-语言-动作(VLA)模型仍然受到专家演示成本高和现实世界交互有限的限制。虽然在线强化学习(RL)显示出前景,但将其应用于真实世界VLA操作受到探索效率低和探索覆盖受限的阻碍。通过系统性的现实世界实验,我们发现在线RL的有效探索空间主要受监督微调(SFT)期间诱导的轨迹分布所限制。受此观察启发,我们提出TwinRL,一种数字孪生-真实世界协同的后训练框架,通过三个阶段扩展和引导RL探索:SFT预热、孪生RL预热和真实世界RL。TwinRL首先从手机捕捉的场景中重建高保真的数字孪生。在SFT阶段,我们引入一种探索空间扩展策略,将轨迹分布的支持扩展到现实演示之外,重塑探索空间以更有效地进行RL。与将孪生视为数据增强工具不同,我们提出一种孪生RL预热策略,使其能够作为真实世界RL的探索引导。具体而言,TwinRL在数字孪生中执行高效的并行RL,生成填充回放缓冲区的交互轨迹,稳定后续真实世界RL学习。这一过程还识别出易失败但信息丰富的配置,使针对人类在回路中的rollouts进一步提高机器人上的效率。在四个任务中,TwinRL在分布内和分布外区域均实现近100%的成功率,比先前的真实世界RL方法快30%以上,仅需20分钟的机器人交互时间。

英文摘要

Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and limited real-world interaction. While online reinforcement learning (RL) has shown promise, its application to real-world VLA manipulation is hindered by low exploration efficiency and restricted exploration coverage. Through systematic real-world experiments, we observe that the effective exploration space of online RL is largely constrained by the trajectory distribution induced during supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative post-training framework that expands and guides RL exploration for VLA models through three stages: SFT warm-up, twin RL warm-up, and real-world RL. TwinRL first reconstructs a high-fidelity digital twin from smartphone-captured scenes. During the SFT stage, we introduce an exploration space expansion strategy that expands the support of the trajectory distribution beyond real demonstrations, reshaping the exploration space for more effective RL. Rather than treating the twin as a data augmentation tool, we propose a twin RL warm-up strategy that enables it to act as an exploration guide for real-world RL. Specifically, TwinRL performs efficient parallel RL in the digital twin to generate interactive trajectories that populate the replay buffer and stabilize subsequent real-world RL learning. This process also identifies failure-prone yet informative configurations, enabling targeted human-in-the-loop rollouts to further improve on-robot efficiency. Across four tasks, TwinRL achieves near-100% success in both in-distribution and out-of-distribution regions, delivering over 30% faster convergence than prior real-world RL methods with only 20 minutes of on-robot interaction.

2601.14234 2026-05-20 cs.LG cs.AI cs.RO stat.ML 版本更新

Q-learning with Adjoint Matching

具有伴随匹配的Q学习

Qiyang Li, Sergey Levine

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 本文提出了一种基于时序差分的强化学习算法QAM,解决了连续动作强化学习中的长期挑战:高效优化表达性强的扩散或流匹配策略相对于参数化的Q函数。通过利用批评者的首阶信息进行有效优化,但直接通过反向传播其多步去噪过程进行梯度优化在数值上不稳定。现有方法通过仅使用价值和丢弃梯度信息或依赖近似方法牺牲策略的表达性或偏置学习策略。QAM通过利用生成建模中最近提出的技术伴随匹配,将批评者的动作梯度转换为逐步目标函数,避免了不稳定反向传播,同时在最优时提供无偏且表达性强的策略。结合时序差分备份进行批评者学习,QAM在离线和离线到在线强化学习的硬稀疏奖励任务中一致优于先前方法。

Comments 32 pages, 8 figures, 7 tables

详情
AI中文摘要

我们提出QAM,一种新颖的基于时序差分的强化学习(RL)算法,解决了连续动作RL中长期存在的挑战:高效优化表达性强的扩散或流匹配策略相对于参数化的Q函数。有效的优化需要利用批评者的首阶信息,但通过反向传播其多步去噪过程进行直接梯度优化在数值上不稳定。现有方法通过仅使用价值和丢弃梯度信息或依赖近似方法牺牲策略的表达性或偏置学习策略。QAM通过利用生成建模中最近提出的技术伴随匹配,将批评者的动作梯度转换为逐步目标函数,避免了不稳定反向传播,同时在最优时提供无偏且表达性强的策略。结合时序差分备份进行批评者学习,QAM在离线和离线到在线RL的硬稀疏奖励任务中一致优于先前方法。

英文摘要

We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.

2512.24470 2026-05-20 cs.RO cs.AI 版本更新

Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models

桥梁上的基础模型:基于视觉-语言模型的语义危险检测与安全操作用于海上自主性

Kim Alexander Christensen, Andreas Gudahl Tufte, Alexey Gusev, Rohan Sinha, Milan Ganai, Ole Andreas Alsos, Marco Pavone, Martin Steinert

发表机构 * Dept. of Mechanical and Industrial Engineering, NTNU(机械与工业工程系,挪威科技大学) Dept. of Aeronautics and Astronautics, Stanford University(航空航天工程系,斯坦福大学) Dept. of Computer Science, Stanford University(计算机科学系,斯坦福大学) NVIDIA Research(NVIDIA研究)

AI总结 本文提出了一种基于视觉-语言模型的语义危险检测与安全操作方法,用于满足IMO草案MASS代码对海上自主船舶的要求,通过快速-慢速异常管道和短时间范围的人类可覆盖回退操作来实现,在40个港口场景中验证了该方法的性能。

Comments 17 pages without bibliography or appendix. The main paper has 16 figures. Paper webpage can be found at https://kimachristensen.github.io/bridge_policy/

详情
Journal ref
Ocean Engineering 359, Part 3 (2026), Article 124646
AI中文摘要

草案IMO MASS代码要求自主和远程监督的海事船舶检测其操作设计领域偏离,进入预定义的回退模式以通知操作员,允许立即的人类接管,并避免在未经批准的情况下更改航行计划。在警报到接管的间隙中满足这些义务需要一个短时间范围、可人类接管的回退操作。传统的海事自主堆栈在正确行动依赖于意义(例如,潜水员旗表示水中的人员,附近有火表示危险)时会遇到困难。我们主张(i)视觉-语言模型(VLMs)为这些分布外情况提供语义意识,(ii)一个快速-慢速异常管道,带有短时间范围、可人类接管的回退操作,使在交接窗口内实现这一目标成为可能。我们引入了Semantic Lookout,一种仅使用摄像头、候选约束的VLM回退操作选择器,它在连续人类授权下,从水有效、世界锚定的轨迹中选择一个谨慎的操作(或站守)。在40个港口场景中,我们测量了每调用场景的理解和延迟,与人类共识(模型多数三票投票)的一致性,短时间范围在火险场景中的风险缓解,以及在水上的警报->回退操作->操作员交接。子10秒的模型保留了较慢的最新模型大部分的意识。回退操作选择器在火险场景中比仅基于几何的基线表现更好,并增加了 standoff 距离。一次现场运行验证了端到端的操作。这些结果支持VLMs作为符合草案IMO MASS代码的语义回退操作选择器,适用于实际延迟预算,并激励未来工作,研究适应领域、混合自主性,将基础模型语义与多传感器鸟瞰感知和短时间范围重新规划相结合。网站:kimachristensen.github.io/bridge_policy

英文摘要

The draft IMO MASS Code requires autonomous and remotely supervised maritime vessels to detect departures from their operational design domain, enter a predefined fallback that notifies the operator, permit immediate human override, and avoid changing the voyage plan without approval. Meeting these obligations in the alert-to-takeover gap calls for a short-horizon, human-overridable fallback maneuver. Classical maritime autonomy stacks struggle when the correct action depends on meaning (e.g., diver-down flag means people in the water, fire close by means hazard). We argue (i) that vision-language models (VLMs) provide semantic awareness for such out-of-distribution situations, and (ii) that a fast-slow anomaly pipeline with a short-horizon, human-overridable fallback maneuver makes this practical in the handover window. We introduce Semantic Lookout, a camera-only, candidate-constrained VLM fallback maneuver selector that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority. On 40 harbor scenes we measure per-call scene understanding and latency, alignment with human consensus (model majority-of-three voting), short-horizon risk-relief on fire hazard scenes, and an on-water alert->fallback maneuver->operator handover. Sub-10 s models retain most of the awareness of slower state-of-the-art models. The fallback maneuver selector outperforms geometry-only baselines and increases standoff distance on fire scenes. A field run verifies end-to-end operation. These results support VLMs as semantic fallback maneuver selectors compatible with the draft IMO MASS Code, within practical latency budgets, and motivate future work on domain-adapted, hybrid autonomy that pairs foundation-model semantics with multi-sensor bird's-eye-view perception and short-horizon replanning. Website: kimachristensen.github.io/bridge_policy

2512.10891 2026-05-20 cs.RO cs.LG 版本更新

Iterative Compositional Data Generation for Robot Control

迭代组合数据生成用于机器人控制

Anh-Quan Pham, Marcel Hussing, Shubhankar P. Patankar, Dani S. Bassett, Jorge Mendez-Mendez, Eric Eaton

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Stony Brook University(石溪大学)

AI总结 本文提出了一种语义组合扩散变换器,通过注意力机制学习机器人、物体、障碍物和目标特定组件的交互,从而在有限任务集上训练后,能够零样本生成高质量过渡,进而学习未见任务组合的控制策略,并通过迭代自我改进过程提升零样本性能。

详情
AI中文摘要

收集机器人操作数据成本高昂,使得在多对象、多机器人和多环境设置中获取大量任务演示不切实际。尽管最近的生成模型可以为单个任务合成有用的数据,但它们未能利用机器人领域的组合结构,并且在泛化到未见任务组合时表现不佳。我们提出了一种语义组合扩散变换器,将过渡分解为机器人、物体、障碍物和目标特定的组件,并通过注意力机制学习它们的交互。一旦在有限的任务子集上训练,我们展示了模型能够零样本生成高质量的过渡,从而学习未见任务组合的控制策略。然后,我们引入了一个迭代自我改进过程,其中合成数据通过离线强化学习验证,并纳入后续的训练轮次中。我们的方法在单体和硬编码组合基线之上显著提高了零样本性能,最终解决了几乎所有未见任务,并展示了学习表示中出现有意义的组合结构。

英文摘要

Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.

2511.17166 2026-05-20 cs.RO 版本更新

Reflection-Based Relative Localization for Cooperative UAV Teams Using Active Markers

基于反射的协作无人机团队相对定位方法

Tim Lakemann, Daniel Bonilla Licea, Viktor Walter, Martin Saska

发表机构 * Multi-Robot Systems Group, Faculty of Electrical Engineering, Czech Technical University in Prague(布拉格捷克技术大学电气工程系多机器人系统组) Mohammed VI Polytechnic University(摩洛哥穆莱伊沙·穆莱·阿卜杜勒阿齐兹·本·阿卜杜勒阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐-middle Polytechnic University)

AI总结 本文提出了一种利用环境中的主动标记反射进行无人机团队相对定位的新方法,无需预先知道机器人大小或标记配置,且能有效应对表面不规则性带来的不确定性,实验表明其在不同光照条件下具有更高的有效范围和精度。

详情
AI中文摘要

主动标记在环境中的反射通常是机载视觉相对定位中的常见模糊源。本文提出了一种新的方法,利用这些通常不受欢迎的反射来实现异构多无人机团队的机载相对定位。该方法无需事先了解机器人大小或预定义的标记配置,不依赖于表面属性,并明确考虑了由表面不规则性引起的不确定性,包括对海洋部署相关的动态水表面。我们在室内和户外实验中验证了该方法,证明了其在不同光照条件下的可靠运行,并实现了比现有最先进方法更高的有效范围(超过30米)和精度。视频可通过以下链接获取:https://youtu.be/y0zp8cIwkig。

英文摘要

Reflections of active markers in the environment are a common source of ambiguity in onboard visual relative localization. This work presents a novel approach that exploits these typically unwanted reflections for onboard relative localization in heterogeneous multi-UAV teams. The method operates without prior knowledge of robot size or predefined marker configurations, remains independent of surface properties, and explicitly accounts for uncertainties caused by surface irregularities, including dynamic water surfaces relevant for marine deployments. We validated the approach in both indoor and outdoor experiments, demonstrating reliable operation across varying lighting conditions and achieving greater effective range (above 30 m) and accuracy than state-of-the-art methods. The video is available under the following link: https://youtu.be/y0zp8cIwkig.

2510.00600 2026-05-20 cs.RO cs.AI cs.CV cs.LG 版本更新

Hybrid Training for Vision-Language-Action Models

视觉-语言-动作模型的混合训练

Pietro Mazzaglia, Cansu Sancaktar, Markus Peschl, Daniel Dijkman

发表机构 * Qualcomm AI Research(高通AI研究)

AI总结 本文提出混合训练框架,旨在使视觉-语言-动作模型在推理时能够根据需要生成思考过程或直接预测动作,从而在保持性能提升的同时提高推理效率。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

使用大型语言模型生成中间思考过程(即链式思考,CoT)再提供答案,已成为解决复杂语言任务的有效方法。在机器人领域,类似的具身CoT策略,即在执行动作前生成思考,也已被证明在使用视觉-语言-动作模型(VLAs)时能够提高性能。然而,这些技术会增加模型生成输出的长度以包含思考过程,从而影响推理时间。在现实世界执行中,如机器人操作场景,延迟代理的动作会严重影响方法的实用性,因为任务需要长序列的动作。然而,生成长链式思考是否是实现性能提升的必要条件?在本文中,我们探索了混合训练(HyT)的概念,这是一种框架,使VLAs能够从思考中学习并受益于相关的性能提升,同时在推理时允许省略CoT生成。此外,通过学习有条件地预测多样化的输出,HyT在推理时提供了灵活性,使模型能够直接预测动作、生成思考或遵循指令。我们评估了所提出的方法在一系列模拟基准和真实世界实验中的表现。

英文摘要

Using Large Language Models to produce intermediate thoughts, a.k.a. Chain-of-thought (CoT), before providing an answer has been a successful recipe for solving complex language tasks. In robotics, similar embodied CoT strategies, generating thoughts before actions, have also been shown to lead to improved performance when using Vision-Language-Action models (VLAs). As these techniques increase the length of the model's generated outputs to include the thoughts, the inference time is negatively affected. Delaying an agent's actions in real-world executions, as in robotic manipulation settings, strongly affects the usability of a method, as tasks require long sequences of actions. However, is the generation of long chains-of-thought a strong prerequisite for achieving performance improvements? In this work, we explore the idea of Hybrid Training (HyT), a framework that enables VLAs to learn from thoughts and benefit from the associated performance gains, while enabling the possibility to leave out CoT generation during inference. Furthermore, by learning to conditionally predict a diverse set of outputs, HyT supports flexibility at inference time, enabling the model to either predict actions directly, generate thoughts or follow instructions. We evaluate the proposed method in a series of simulated benchmarks and real-world experiments.

2509.14787 2026-05-20 cs.RO 版本更新

COMPASS: Confined-space Manipulation Planning with Active Sensing Strategy

COMPASS:基于主动感知策略的受限空间操作规划

Qixuan Li, Chen Le, Dongyue Huang, Jincheng Yu, Xinlei Chen

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China(清华大学深圳国际研究生院,中国深圳) Department of Electronic Engineering, and the Institute for Embodied Intelligence and Robotics, Tsinghua University, Beijing, China(清华大学电子工程系,以及 embodied intelligence and robotics 研究院,中国北京) The School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore(南洋理工大学电子与电气工程学院,新加坡)

AI总结 本文提出COMPASS框架,通过主动感知策略在受限和杂乱环境中实现安全操作,提高了操作成功率。

Comments Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)

详情
AI中文摘要

在受限和杂乱环境中进行操作仍然是一个重大挑战,由于部分可观测性和复杂的配置空间。有效在这些环境中进行操作需要一种智能探索策略来安全地理解和搜索目标。在本文中,我们提出COMPASS,一种多阶段探索和操作框架,其特征是具有操作意识的基于采样的规划器。首先,我们通过近场意识扫描减少碰撞风险,以构建局部碰撞图。此外,我们采用多目标效用函数来寻找同时具有信息性和有利于后续操作的视角。此外,我们执行一种受限操作优化策略,以生成遵守障碍物约束的操作姿态。为了系统评估方法在这些困难下的性能,我们提出了一个包含四个挑战性场景的受限空间探索和操作基准。与为其他机器人设计的探索方法和仅考虑信息增益的方法相比,我们的框架在模拟中将操作成功率提高了24.25%。现实世界实验展示了我们的方法在受限环境中进行主动感知和操作的能力。

英文摘要

Manipulation in confined and cluttered environments remains a significant challenge due to partial observability and complex configuration spaces. Effective manipulation in such environments requires an intelligent exploration strategy to safely understand the scene and search the target. In this paper, we propose COMPASS, a multi-stage exploration and manipulation framework featuring a manipulation-aware sampling-based planner. First, we reduce collision risks with a near-field awareness scan to build a local collision map. Additionally, we employ a multi-objective utility function to find viewpoints that are both informative and conducive to subsequent manipulation. Moreover, we perform a constrained manipulation optimization strategy to generate manipulation poses that respect obstacle constraints. To systematically evaluate method's performance under these difficulties, we propose a benchmark of confined-space exploration and manipulation containing four level challenging scenarios. Compared to exploration methods designed for other robots and only considering information gain, our framework increases manipulation success rate by 24.25% in simulations. Real-world experiments demonstrate our method's capability for active sensing and manipulation in confined environments.

2506.01418 2026-05-20 cs.RO cs.CV 版本更新

SEMNAV: Enhancing Visual Semantic Navigation in Robotics through Semantic Segmentation

SEMNAV: 通过语义分割增强机器人中的视觉语义导航

Rafael Flor-Rodríguez, Carlos Gutiérrez-Álvarez, Francisco Javier Acevedo-Rodríguez, Sergio Lafuente-Arroyo, Roberto J. López-Sastre

发表机构 * University of Alcalá(阿尔卡萨大学) CAM-UAH Ministry of Science and Innovation of Spain(西班牙科学与创新部)

AI总结 本文提出SEMNAV,一种利用语义分割作为环境主要视觉输入表示的方法,以增强机器人代理的感知和决策能力,通过引入高层面的语义信息,提升模型在未知环境中的泛化能力,并引入SEMNAV数据集进行训练。

详情
Journal ref
Applied Intelligence, 2026
AI中文摘要

视觉语义导航(VSN)是机器人学中的基本问题,其中智能体必须在未知环境中导航至目标对象,主要依靠视觉信息。大多数最先进的VSN模型是在模拟环境中训练的,其中使用的是现实世界的渲染场景,最理想的情况。这些方法通常依赖于虚拟场景的原始RGB数据,这限制了它们在真实世界环境中的泛化能力,由于域适应问题。为了解决这个问题,本文提出了SEMNAV,一种新的方法,利用语义分割作为环境的主要视觉输入表示,以增强代理的感知和决策能力。通过显式地引入这种高层语义信息,我们的模型学习到稳健的导航策略,提高了在未见过的环境中泛化的能力,无论是模拟还是真实世界。我们还引入了SEMNAV数据集,这是一个新编纂的数据集,用于训练如SEMNAV这样的语义分割感知导航模型。我们的方法在模拟环境和真实世界机器人平台上进行了广泛的评估。实验结果表明,SEMNAV优于现有的最先进VSN模型,在Habitat 2.0模拟环境使用HM3D数据集时实现了更高的成功率。此外,我们的实际实验突显了语义分割在缓解仿真到现实差距方面的有效性,使我们的模型成为实用VSN基于机器人应用的有希望的解决方案。代码和数据集可在https://github.com/gramuah/semnav访问。

英文摘要

Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment, mainly using visual information. Most state-of-the-art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best. These approaches typically rely on raw RGB data from the virtual scenes, which limits their ability to generalize to real-world environments due to domain adaptation issues. To tackle this problem, in this work, we propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment to enhance the agent's perception and decision-making capabilities. By explicitly incorporating this type of high-level semantic information, our model learns robust navigation policies that improve generalization across unseen environments, both in simulated and real world settings. We also introduce the SEMNAV dataset, a newly curated dataset designed for training semantic segmentation-aware navigation models like SEMNAV. Our approach is evaluated extensively in both simulated environments and with real-world robotic platforms. Experimental results demonstrate that SEMNAV outperforms existing state-of-the-art VSN models, achieving higher success rates in the Habitat 2.0 simulation environment, using the HM3D dataset. Furthermore, our real-world experiments highlight the effectiveness of semantic segmentation in mitigating the sim-to-real gap, making our model a promising solution for practical VSN-based robotic applications. The code and datasets are accessible at https://github.com/gramuah/semnav

2505.09067 2026-05-20 math.OC cs.RO cs.SY eess.SY 版本更新

Solving Reach- and Stabilize-Avoid Problems Using Discounted Reachability

利用折扣可达性求解可达-稳定避问题

Boyang Li, Zheng Gong, Sylvia Herbert

发表机构 * Mechanical and Aerospace Engineering at UC San Diego(加州大学圣迭戈分校机械与航空航天工程系)

AI总结 本文针对一般非线性连续时间系统中的无限时间可达-避(RA)和稳定-避(SA)零和博弈问题,提出了一种新的Lipschitz连续RA价值函数,该函数的零子水平集精确刻画了RA集,并通过构造Bellman备份算子的合同性以及证明RA价值函数是Hamilton-Jacobi变分不等式的唯一粘性解,从而解决了RA问题。同时,通过结合最近提出的鲁棒控制Lyapunov-价值函数,开发了两步框架来解决SA问题,确保目标可达性和长期稳定性。最后,通过3D Dubins车系统数值验证了所提方法的有效性。

Comments 16 pages, 6 figures, 1 table. Accepted to IEEE Transactions on Automatic Control

详情
Journal ref
IEEE Transactions on Automatic Control (Early Access), 2026
AI中文摘要

在本文中,我们考虑一般非线性连续时间系统中的无限时间可达-避(RA)和稳定-避(SA)零和博弈问题,目标是找到能够被控制到达或稳定到目标集的的状态集,即使在最坏情况下也不违反约束。基于Hamilton-Jacobi可达性方法,我们通过设计新的Lipschitz连续RA价值函数来解决RA问题,该函数的零子水平集精确地刻画了RA集。我们证明了相关的Bellman备份算子是合同性的,并且RA价值函数是Hamilton-Jacobi变分不等式的唯一粘性解。最后,我们通过将我们的RA策略与最近提出的鲁棒控制Lyapunov-价值函数相结合,开发了一个两步框架来解决SA问题,从而确保目标可达性和长期稳定性。我们通过3D Dubins车系统数值验证了所提的RA和SA框架的有效性。

英文摘要

In this article, we consider the infinite-horizon reach-avoid (RA) and stabilize-avoid (SA) zero-sum game problems for general nonlinear continuous-time systems, where the goal is to find the set of states that can be controlled to reach or stabilize to a target set, without violating constraints even under the worst-case disturbance. Based on the Hamilton-Jacobi reachability method, we address the RA problem by designing a new Lipschitz continuous RA value function, whose zero sublevel set exactly characterizes the RA set. We establish that the associated Bellman backup operator is contractive and that the RA value function is the unique viscosity solution of a Hamilton-Jacobi variational inequality. Finally, we develop a two-step framework for the SA problem by integrating our RA strategies with a recently proposed Robust Control Lyapunov-Value Function, thereby ensuring both target reachability and long-term stability. We numerically verify our RA and SA frameworks on a 3D Dubins car system to demonstrate the efficacy of the proposed approach.

2504.09188 2026-05-20 cs.RO cs.SY eess.SY 版本更新

Compliant Explicit Reference Governor for Contact Friendly Robotic Manipulators

顺应性显式参考 governor 用于接触友好的机器人机械臂

Yaashia Gautam, Gilberto Briscoe-Martinez, Adhitya Mohan, Nataliya Nechyporenko, Alessandro Roncone, Marco M. Nicotra

发表机构 * College of Engineering(工程学院) Applied Sciences, University of Colorado Boulder(应用科学学院,科罗拉多大学博尔德分校) Boulder, CO 80301 USA(博尔德,科罗拉多州80301 USA)

AI总结 本文提出了一种顺应性显式参考 governor (CERG),一种模块化的参考管理系统,使机器人能够在有保证的条件下与环境物理交互。CERG 作为高层规划器和低层控制器之间的中间层,强制操作约束并使自由运动和接触操作之间平滑过渡。CERG 通过限制接触时机械臂可用的总能量来确保安全。在没有接触的情况下,CERG 不会惩罚系统性能。

Comments Updated paper with current contributions and author list , accepted at IFAC World Congress, Busan, 2026

详情
AI中文摘要

本文介绍了一种顺应性显式参考 governor (CERG),一种模块化的参考管理系统,使机器人能够在有保证的条件下与环境物理交互。CERG 是一个可以放置在高层规划器和低层控制器之间的中间层:它强制操作约束并使自由运动和接触操作之间平滑过渡。CERG 通过限制接触时机械臂可用的总能量来确保安全。然而,在没有接触的情况下,CERG 不会惩罚系统性能。仿真和硬件实验验证了 CERG 在日益复杂系统上的有效性。

英文摘要

This paper introduces the Compliant Explicit Reference Governor (CERG), a modular reference management system that enables robots to interact physically with their environment under provable guarantees. The CERG is an intermediate layer that can be placed between a high-level planner and a low-level controller: it enforces operational constraints and enables smooth transitions between free-motion and contact operations. The CERG ensures safety by limiting the total energy available to the robotic arm at the time of contact. In the absence of contact, however, the CERG does not penalize the system performance. Simulation and hardware experiments validate the CERG on increasingly complex systems.

2501.09203 2026-05-20 cs.CV cs.RO 版本更新

3D Modeling and Automated Measurement of Concrete Cracks via Segment Anything Refinement and Visual Inertial LiDAR Fusion

通过段落任何精修和视觉惯性LiDAR融合进行混凝土裂缝的3D建模与自动测量

Pengru Deng, Jiapeng Yao, Chun Li, Su Wang, Xinrun Li, Varun Ojha, Xuhui He

发表机构 * School of Civil Engineering(土木工程学院) Central South University(中南大学) Hunan Provincial Key Laboratory for Disaster Prevention and Mitigation of Rail Transit Engineering Structures(湖南省铁路工程结构灾害预防与 mitigation 工程结构重点实验室) Nvidia School of Computing(计算学院) Newcastle University(新castle大学)

AI总结 本文提出了一种结合计算机视觉技术和多模态同时定位与建图(SLAM)的创新框架,用于二维裂缝检测、三维重建和三维自动裂缝测量,解决了现有方法在适应性和鲁棒性方面的不足,特别是在处理曲线或复杂几何形状时的挑战。

Comments Title and author list updated

详情
Journal ref
Computer-Aided Civil and Infrastructure Engineering, Volume 45, 2026, 100019, ISSN 1093-9687
AI中文摘要

视觉-空间系统在混凝土裂缝检测中变得越来越关键。然而,现有方法往往缺乏对多样化场景的适应性,在基于图像的方法中表现出有限的鲁棒性,并且在处理曲线或复杂几何形状时存在困难。为了解决这些限制,本文提出了一种创新的框架,通过整合计算机视觉技术和多模态同时定位与建图(SLAM),用于二维(2D)裂缝检测、三维(3D)重建和三维自动裂缝测量。首先,基于基础的DeepLabv3+分割模型,并结合特定的改进利用基础模型Segment Anything Model(SAM),我们开发了一种具有强泛化能力的裂缝分割方法,能够在不熟悉的场景中生成精确的2D裂缝掩码。为了提高三维重建的准确性和鲁棒性,利用Light Detection and Ranging(LiDAR)点云与图像数据和分割掩码。通过利用图像和LiDAR-SLAM,我们开发了多帧和多模态融合框架,产生密集、着色的点云,有效捕捉裂缝语义在三维现实尺度上。此外,裂缝几何属性在三维密集点云空间中自动且直接地进行测量,超越了传统二维图像测量方法的限制。这一进步使该方法适用于具有曲线和复杂三维几何结构的结构部件。在各种混凝土结构上的实验结果突显了所提出方法的显著改进和独特优势,展示了其在现实应用中的有效性、准确性和鲁棒性。

英文摘要

Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.

2605.19015 2026-05-20 eess.SY cs.RO cs.SY 版本更新

Probabilistic Recursively Feasible Motion Planning Under Uncertain Environments

概率递归可行性运动规划在不确定环境中

Hyeontae Sung, Hyeongchan Ham, Junyoung Park, Kai Ren, Heejin Ahn

发表机构 * School of Electrical Engineering, Korea Advanced Institute of Science(韩国科学技术院电气工程学院)

AI总结 本文提出了一种概率递归可行模型预测控制框架,通过保证递归可行性概率来解决不确定环境中安全运动规划的挑战,主要贡献是通过闭式表达式计算轨迹的均值和协方差,并构建安全约束以确保递归可行性。

Comments 7 pages, 4 figures

详情
AI中文摘要

在不确定、时间变化的环境中进行安全运动规划具有挑战性,因为安全区域在规划步骤中可能不可预测地变化,通常导致递归可行性丧失。在本工作中,我们提出了一种概率递归可行模型预测控制(PRF-MPC)框架,该框架能够以指定概率保证递归可行性。我们引入了理想预测器应满足的性质以确保分布一致性,并利用这些性质推导出未来时间步骤预测轨迹的均值和协方差的闭式表达式。基于此分析,我们构建了安全约束,以确保当前安全集包含在未来的安全集中,从而以概率方式保证递归可行性。在车道变更场景的仿真结果表明,所提出的方法显著提高了递归可行性。

英文摘要

Safe motion planning in uncertain, time-varying environments is challenging because the safe region can change unpredictably across planning steps, often causing a loss of recursive feasibility. In this work, we present a Probabilistic Recursively Feasible Model Predictive Control (PRF-MPC) framework that guarantees recursive feasibility with a specified probability. We introduce properties that an ideal predictor should satisfy to ensure distributional consistency, and use these properties to derive closed-form expressions for the means and covariances of trajectories predicted at future time steps. Building on this analysis, we construct safety constraints that ensure, with high probability, that the current safe set is contained within the safe sets at future time steps, thereby probabilistically guaranteeing recursive feasibility. Simulation results on a lane-change scenario demonstrate that the proposed method significantly improves recursive feasibility.

2605.19009 2026-05-20 cs.RO cs.SY eess.SY 版本更新

Adversarial Stress Testing of SPARK Humanoid Safety Filters

对SPARK人形机器人类安全过滤器的对抗性压力测试

Saurav Ghosh, Abdou Sow, Luke Zhang

发表机构 * Department of Computer Science and Engineering, Washington University in St. Louis, Missouri, United States(计算机科学与工程系,华盛顿大学圣路易斯分校,密苏里州,美国)

AI总结 本文通过复制和压力测试研究了SPARK人形机器人类安全过滤器的鲁棒性,评估了多种方法在不同环境下的表现,揭示了安全行为在障碍物密集、距离估计噪声和延迟信息下的变化,强调了在部署前需使用能暴露故障模式的评估指标。

Comments 5 pages, 7 figures, 1 table. Code available at https://github.com/ghoshsaurav/spark-adversarial-safety

详情
AI中文摘要

人形机器人由于具有高维身体、众多碰撞约束以及必须在人和障碍物附近操作,难以安全部署。安全过滤器通过在可能违反避障约束时修改名义控制动作来帮助。然而,名义基准分数并不能完全显示这些过滤器在更困难环境中的行为。在本工作中,我们通过复制和压力测试研究了SPARK人形安全过滤器的鲁棒性。我们复制了SPARK基准案例G1SportMode_D1_WG_SO_v1到MuJoCo,并在受控随机种子下评估RSSA、RSSS、SSA、CBF、PFM和SMA。我们还构建了一个后处理流程,将原始SPARK日志转换为目标跟踪、最小距离和碰撞步骤指标。我们的结果表明,某些方法更接近目标跟踪,而其他方法更有效减少碰撞步骤。压力测试进一步表明,在障碍物密集、距离估计噪声和延迟障碍信息下,安全行为可能发生改变。这些发现表明,人形自主性应在名义性能之外进行评估,使用能暴露故障模式的指标。

英文摘要

Humanoid robots are difficult to deploy safely because they have high-dimensional bodies, many collision constraints, and must operate near people and obstacles. Safety filters help by modifying a nominal control action when it may violate collision-avoidance constraints. Still, nominal benchmark scores do not fully show how these filters behave in harder environments. In this work, we study the robustness of SPARK humanoid safety filters through replication and stress testing. We replicate the SPARK benchmark case G1SportMode_D1_WG_SO_v1 in MuJoCo and evaluate RSSA, RSSS, SSA, CBF, PFM, and SMA under controlled random seeds. We also built a post-processing pipeline that converts raw SPARK logs into goal-tracking, minimum-distance, and collision-step metrics. Our results show that some methods track the goal more closely, while others reduce collision steps more effectively. The stress tests further indicate that safety behavior can change under obstacle crowding, noisy distance estimates, and delayed obstacle information. These findings suggest that humanoid autonomy should be evaluated beyond nominal performance, using metrics that expose failure modes before deployment.

2605.19004 2026-05-20 cs.CV cs.LG cs.RO 版本更新

EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction

EgoTraj: 用于多模态预测的现实世界人轨迹数据集

Ahmad Yehia, Abduallah Mohamed, Tianyi Wang, Jiseop Byeon, Kun Qian, Junfeng Jiao, Christian Claudel

发表机构 * Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin(土木、建筑与环境工程系,德克萨斯大学奥斯汀分校) Meta Reality Labs(Meta现实实验室) School of Architecture, The University of Texas at Austin(建筑学院,德克萨斯大学奥斯汀分校)

AI总结 本文提出EgoTraj数据集,用于多模态预测,包含75个真实城市环境中的人导航轨迹,提供了同步的RGB视频和地面真实数据,包括6自由度头部姿态、3D眼 gaze向量和场景注释,展示了该数据集在AR感知、导航和辅助系统中的应用价值。

Comments 21 pages, 14 figures. Project page: https://github.com/yehiahmad/EgoTraj

详情
AI中文摘要

准确地从第一人称视角预测人类轨迹在人形机器人、可穿戴传感系统和辅助导航等应用中起着核心作用。然而,由于现实世界环境中缺乏第一人称轨迹数据集,这一方向的进展受到限制。为了解决这一需求,我们介绍了EgoTraj,一个使用Meta Quest Pro (MQPro)录制的egocentric多模态开放数据集。EgoTraj包含75个由多个MQPro穿戴设备在真实城市环境中收集的人导航轨迹。每个记录都提供了同步的RGB视频以及地面真实数据,包括连续时间同步的6自由度头部姿态、每帧3D眼 gaze向量和场景注释。据我们所知,EgoTraj不同于典型的egocentric轨迹数据集,因为它捕捉了在多样化的城市路线中进行的长视距、自主导航,具有广泛的参与者多样性。为了展示该数据集的潜力,我们对几种最先进的egocentric轨迹预测方法进行了基准测试,并进行了消融研究以分析注视、场景和运动提示的贡献。结果突显了EgoTraj在AR感知、导航和辅助系统中的实用性。EgoTraj数据集、代码和EgoViz仪表板已公开在https://github.com/yehiahmad/EgoTraj。

英文摘要

Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real-world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real-world urban environments. Each recording provides synchronized RGB video along with ground-truth data, including continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state-of-the-art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR-based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at https://github.com/yehiahmad/EgoTraj.

2605.18921 2026-05-20 cs.RO 版本更新

Geo-Data-Driven HD Map Generation Workflow with Integrated Reference-Free Constraint-Based Verification

基于地理数据的高精地图生成工作流与集成的无参考约束验证

Ruidi He, Vaibhav Tiwari, Mohanad Al-Ghobari, Meng Zhang, Andreas Rausch

发表机构 * Institute for Software and Systems Engineering(软件与系统工程研究所)

AI总结 本文提出了一种基于地理数据的高精地图生成工作流,结合了无参考约束验证,以降低对高精度参考数据的依赖,提高在缺乏专业测量数据或独立参考地图时的应用可行性。

详情
AI中文摘要

高精地图是自动驾驶系统的核心构件,但其生成通常依赖于传感器密集的移动测绘任务,而质量评估往往依赖于高精度参考数据。这些依赖性使得高精地图工程成本高且难以在缺乏专门测量数据或独立测量参考地图的环境中应用。本文提出了一种面向工程的基于地理数据的工作流,用于高精地图生成,并集成了表示层面的验证。该工作流使用公开可用的地理工程数据集作为主要输入源,并通过显式的中间表示和处理阶段,将它们转换为现有道路环境的车道级高精地图表示。为了在没有外部参考地图的情况下评估生成的表示,该工作流在工程过程中集成了可执行的基于约束的验证。所选约束来自与自动驾驶和道路设计指南相关的规范。它们直接在生成的车道let表示上进行评估,以检测几何、拓扑和高程相关的一致性问题。该工作流使用来自德国下萨克森州四个城市的基于真实世界shapefile的道路网络数据,并结合受控缺陷注入场景进行评估。真实世界评估显示,生成的地图表示在评估场景中满足所选约束,而缺陷注入研究证明了对所考虑缺陷类型的完全检测,没有观察到假阳性。结果表明,集成可执行验证的基于地理数据的高精地图生成可以在减少传感和参考数据可用性的情况下,为传感器密集的测绘工作流提供模块化和可检查的补充。

英文摘要

High-definition (HD) maps are core artifacts for automated driving systems, but their generation commonly relies on sensor-intensive mobile mapping campaigns, while quality assessment often depends on high-precision reference data. These dependencies make HD map engineering costly and difficult to apply in settings where specialised measurement data or independently measured reference maps are unavailable. This paper presents an engineering-oriented geo-data-driven workflow for HD map generation with integrated representation-level verification. The workflow uses openly available geo-engineering datasets as the primary input source and transforms them into lane-level HD map representations of existing road environments through explicit intermediate representations and processing stages. To assess the generated representations without external reference maps, the workflow integrates executable constraint-based verification into the engineering process. Selected constraints are derived from specifications relevant to automated driving and road-design guidelines. They are evaluated directly on the generated lanelet-based representation to detect geometric, topological, and elevation-related inconsistencies. The workflow is evaluated using real-world shapefile-based road-network data from four cities in Lower Saxony, Germany, and controlled defect-injection scenarios. The real-world evaluation shows that the generated map representations satisfy the selected constraints in the evaluated scenarios, while the defect-injection study demonstrates complete detection of the considered defect types without observed false positives. The results indicate that geo-data-driven HD map generation with integrated executable verification can provide a modular and inspectable complement to sensor-intensive mapping workflows under reduced sensing and reference-data availability.

2605.18895 2026-05-20 cs.RO cs.AI 版本更新

KG-ASG: Collision-Knowledge-Guided Closed-Loop Adversarial Scenario Generation With Primary-Support Attribution

KG-ASG: 基于碰撞知识的闭环对抗场景生成与主支持属性

Cheng Wang, Chen Xiong, Ziwen Wang, Yuchen Zhou, Qiang Liu

发表机构 * Guangdong Provincial Key Laboratory of Intelligent Transportation System, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University(广东省智能交通系统重点实验室,智能系统工程学院,中山大学深圳校区)

AI总结 本文提出KG-ASG框架,通过碰撞知识引导和主支持属性,提高自动驾驶系统安全验证的对抗有效性、可解释性和可执行性。

详情
AI中文摘要

自动驾驶系统安全验证需要高风险场景覆盖、清晰的碰撞语义、可执行轨迹和可追溯的多车辆交互。现有安全关键场景生成方法通常依赖低级轨迹扰动、碰撞代理优化或单对抗者搜索,可能产生具有模糊碰撞原因或不可控多车辆碰撞的对抗样本。本文提出KG-ASG,一种基于碰撞知识的闭环对抗场景生成框架,具有主支持属性。KG-ASG构建了结构化的碰撞知识库,并训练了一个轻量级的碰撞专家来推断目标碰撞模式、唯一的主对抗者、支持车辆及其交互角色。在该语义先验的引导下,多车辆对抗生成被公式化为主支持过程,其中主对抗者引发主要冲突,支持车辆塑造周围风险结构,而不会成为额外碰撞者。规则、物理、交互安全性和单碰撞器约束被作为硬门来过滤不可执行的样本。为处理反应性驾驶者行为,进一步使用规划器-控制器反馈进行故障诊断、候选重新排序和终端细化。在MetaDrive中重建的WOMD场景上的实验表明,KG-ASG在IDM、Cruise和Expert控制器下实现了强对抗有效性,同时提高了有效主攻击、减少了多碰撞,并获得了闭环恢复收益。这些结果表明,碰撞知识引导和主支持单碰撞器推理提高了自动驾驶安全验证的对抗有效性、可解释性和可执行性。

英文摘要

Safety validation of autonomous driving systems requires high-risk scenario coverage, clear collision semantics, executable trajectories, and attributable multi-vehicle interactions. Existing safety-critical scenario generation methods often rely on low-level trajectory perturbations, collision-proxy optimization, or single-adversary search, which may produce adversarial samples with ambiguous collision causes or uncontrolled multi-vehicle collisions. This paper proposes KG-ASG, a collision-knowledge-guided closed-loop adversarial scenario generation framework with primary-support attribution. KG-ASG constructs a structured collision knowledge base and trains a lightweight Collision Expert to infer the target collision mode, the unique primary adversary, support vehicles, and their interaction roles. Guided by this semantic prior, multi-vehicle adversarial generation is formulated as a primary-support process, where the primary adversary induces the main conflict and support vehicles shape the surrounding risk structure without becoming additional colliders. Rule, physical, interaction-safety, and single-collider constraints are imposed as hard gates to filter non-executable samples. To handle reactive ego behaviors, planner-controller feedback is further used for failure diagnosis, candidate re-ranking, and terminal refinement. Experiments on WOMD scenarios reconstructed in MetaDrive show that KG-ASG achieves strong adversarial effectiveness while improving Valid Primary Attack, reducing multi-collision, and obtaining closed-loop recovery gains under IDM, Cruise, and Expert controllers. These results demonstrate that collision-knowledge guidance and primary-support single-collider reasoning improve adversarial effectiveness, interpretability, and executability for autonomous driving safety validation.

2605.18872 2026-05-20 cs.LG cs.AI cs.RO 版本更新

EUPHORIA: Efficient Universal Planning via Hybrid Optimization for Robust Industrial Robotic Assembly

EUPHORIA: 通过混合优化实现高效通用规划以实现稳健的工业机器人装配

Shih-Yu Lai, Chia-Ching Yen, Yang-Ting Shen, Peter Yichen Chen, Yu-Lun Liu, Bing-Yu Chen

发表机构 * National Taiwan University(国立台湾大学) MoonShine Animation Studio(MoonShine动画工作室) National Cheng Kung University(国立成功大学) The University of British Columbia(不列颠哥伦比亚大学) National Yang Ming Chiao Tung University(阳明交通大学)

AI总结 本文提出EUPHORIA框架,通过混合优化策略实现通用少样本适应和动态效率,解决建筑机器人装配中规划器高度专业化和操作低效的问题,结合元几何编码器、物理引导图变压器和残差稳定性校正等方法,实现高效且鲁棒的装配规划。

详情
AI中文摘要

建筑机器人装配面临持续瓶颈:现有规划器要么高度专业化,需要每次新几何设计都进行昂贵的再训练,要么操作低效,将结构序列和运动学运动视为独立过程。我们提出了EUPHORIA,一个统一框架,通过混合优化策略实现通用少样本适应和动态效率。为克服再训练瓶颈,我们提出了基于图超网络的元几何编码器:不同于标准对比学习仅在特征级识别,我们的超网络动态从最小支持集中生成策略参数,使参数级适应复杂拓扑(如穹顶、拱门)而无需基于梯度的再训练。对于结构推理,我们引入了通过软演员-评论家(SAC)训练的物理引导图变压器,其物理偏置注意力机制通过离散元模型(DEM)模拟的接触力调节注意力分数,引导规划器朝向结构关键连接。我们进一步通过运动学感知序列确保操作效率,其中SAC目标惩罚高能转换。最后,我们通过残差稳定性校正弥合仿真到现实的差距,这是一种可微优化层,通过最小化联合能量-稳定性成本优先级来微调粗略装配动作。实验表明,EUPHORIA显著减少了与解耦基线相比的能量消耗,并在未见的非标准几何上实现了最先进的成功率,通过融合元学习、物理引导注意力和残差优化,实现一个连贯的通用规划器。

英文摘要

Robotic assembly in architectural construction faces a persistent bottleneck: existing planners are either highly specialized, requiring prohibitive retraining for every new geometric design, or operationally inefficient, treating structural sequencing and kinematic motion as disjoint processes. We present EUPHORIA, a unified framework that achieves universal few-shot adaptability and dynamic efficiency through a hybrid optimization strategy. To overcome the retraining bottleneck, we propose a Meta-Geometric Encoder based on Graph Hypernetworks: unlike standard contrastive learning, which performs only feature-level recognition, our hypernetwork dynamically generates policy parameters from a minimal support set, enabling parameter-level adaptation to complex topologies (e.g., domes, arches) without gradient-based retraining. For structural reasoning, we introduce a Physics-Informed Graph Transformer trained via Soft Actor-Critic (SAC), with a Physics-Bias Attention mechanism that modulates attention scores using contact forces from Discrete Element Model (DEM) simulations, guiding the planner toward structurally critical connections. We further ensure operational efficiency through Kinematics-Aware Sequencing, where the SAC objective penalizes high-energy transitions. Finally, we bridge the Sim2Real gap via Residual Stability Correction, a differentiable optimization layer that fine-tunes coarse assembly actions by minimizing a joint energy-stability cost prior to execution. Experiments show that EUPHORIA significantly reduces energy consumption over decoupled baselines and achieves state-of-the-art success rates on unseen, non-standard geometries with minimal few-shot examples, fusing meta-learning, physics-informed attention, and residual optimization into a cohesive, generalized planner.

2412.02818 2026-05-20 cs.RO cs.LG 版本更新

RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields

RoboMD: 通过语义势场揭示机器人漏洞

Som Sagar, Jiafei Duan, Sreevishakh Vasudevan, Yifan Zhou, Heni Ben Amor, Dieter Fox, Ransalu Senanayake

发表机构 * Arizona State University(亚利桑那州立大学) University of Washington(华盛顿大学)

AI总结 本研究提出RoboMD框架,通过学习基于连续视觉-语言嵌入的深度强化学习策略,揭示机器人在现实世界中因外部变化导致的漏洞,通过虚拟运行实现高效安全的漏洞分析,实验表明其能发现比现有基线多23%的漏洞,并提升机器人操作性能。

Comments 26 Pages, 20 figures

详情
AI中文摘要

机器人操作策略虽然对物理AI的前景至关重要,但在现实世界中存在外部变化时却极易产生漏洞。诊断这些漏洞面临两大挑战:(i)需要测试的 relevant 变化通常未知,(ii)直接在现实世界中测试成本高且不安全。我们介绍了一个框架,通过在连续视觉-语言嵌入上进行虚拟运行,学习一个单独的深度强化学习(深度RL)策略来预测漏洞。通过将富含语义和视觉变化的嵌入空间视为势场,该策略学会向易损区域移动并被成功区域排斥。该漏洞预测策略在虚拟运行中训练,使漏洞分析能够扩展和安全地进行,而无需昂贵的物理试验。通过查询该策略,我们的框架构建了一个概率性漏洞可能性地图。在模拟基准和物理机器人手臂上的实验表明,我们的框架揭示的漏洞比最先进的视觉-语言基线多出23%,揭示了被启发式测试忽略的细微漏洞。此外,我们展示了通过我们的框架发现的漏洞微调操作策略,可以使用更少的微调数据提升操作性能。

英文摘要

Robot manipulation policies, while central to the promise of physical AI, are highly vulnerable in the presence of external variations in the real world. Diagnosing these vulnerabilities is hindered by two key challenges: (i) the relevant variations to test against are often unknown, and (ii) direct testing in the real world is costly and unsafe. We introduce a framework that tackles both issues by learning a separate deep reinforcement learning (deep RL) policy for vulnerability prediction through virtual runs on a continuous vision-language embedding trained with limited success-failure data. By treating this embedding space, which is rich in semantic and visual variations, as a potential field, the policy learns to move toward vulnerable regions while being repelled from success regions. This vulnerability prediction policy, trained on virtual rollouts, enables scalable and safe vulnerability analysis without expensive physical trials. By querying this policy, our framework builds a probabilistic vulnerability-likelihood map. Experiments across simulation benchmarks and a physical robot arm show that our framework uncovers up to 23% more unique vulnerabilities than state-of-the-art vision-language baselines, revealing subtle vulnerabilities overlooked by heuristic testing. Additionally, we show that fine-tuning the manipulation policy with the vulnerabilities discovered by our framework improves manipulation performance with much less fine-tuning data.