arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.20138 2026-05-20 cs.RO cs.SY eess.SY 版本更新

Hamilton--Jacobi Reachability for Spacecraft Collision Avoidance

航天器碰撞避免的Hamilton-Jacobi可达性

Larry Hui, Jordan Kam, William Su, Jianshu Zhou

发表机构 * Department of Mechanical Engineering, University of California, Berkeley（加州大学伯克利分校机械工程系）； Aerospace Engineering Program, University of California, Berkeley（加州大学伯克利分校航空航天工程项目）； Department of Mechanical Engineering, National University of Singapore（新加坡国立大学机械工程系）

AI总结本文提出了一种用于同轨道双卫星碰撞避免问题的Hamilton-Jacobi（HJ）可达性框架，通过平面Hill-Clohessy-Wiltshire（HCW）动力学在径向-切向-法向（RTN）框架中建模相对运动。定义目标状态空间为对应于联邦通信委员会（FCC）轨道标准的最小分离要求的不安全相对配置。将航天器之间的相互作用建模为零和微分博弈，其中玩家1是受控卫星，玩家2被建模为具有未知意图的有界对抗干扰。本文提出了HJ公式，并计算了后向可达集，这些集描述了在最坏情况下无法避免碰撞的相对状态，而集外的状态则允许证明安全的轨迹。这些可达集与监督混合控制逻辑相结合，以确定何时必须启动规避机动，从而为可扩展性提供数学基础的安全保证。

Comments Accepted to the 20th IEEE International Conference on Control & Automation (IEEE ICCA 2026). 6 pages, 4 figures

详情

AI中文摘要

本文提出了一种用于同轨道双卫星碰撞避免问题的Hamilton-Jacobi（HJ）可达性框架，通过平面Hill-Clohessy-Wiltshire（HCW）动力学在径向-切向-法向（RTN）框架中建模相对运动。我们定义目标状态空间为对应于最小分离要求一致的联邦通信委员会（FCC）轨道标准的不安全相对配置。将航天器之间的相互作用建模为零和微分博弈，其中玩家1是受控卫星，玩家2被建模为具有未知意图的有界对抗干扰。我们提出了HJ公式，并计算了后向可达集，这些集描述了在最坏情况下无法避免碰撞的相对状态，而集外的状态则允许证明安全的轨迹。这些可达集与监督混合控制逻辑相结合，以确定何时必须启动规避机动，从而为可扩展性提供数学基础的安全保证。

英文摘要

This article presents a Hamilton--Jacobi (HJ) reachability framework for a two--satellite collision avoidance problem operating in the same circular orbit, where relative motion is modeled in the radial--tangential--normal (RTN) frame using planar Hill--Clohessy--Wiltshire (HCW) dynamics. We define the target state space as unsafe relative configurations in the orbit plane corresponding to minimum separation requirements consistent with Federal Communications Commission (FCC) orbital standards. The interaction between spacecraft is formulated as a zero--sum differential game, where Player 1 is the controlled satellite and Player 2 is modeled as a bounded adversarial disturbance with unknown intent. We present the HJ formulation and compute backward reachable sets that characterize relative states from which collision cannot be avoided under worst-case disturbances, while states outside this set admit provably collision-free trajectories. These reachable sets are integrated with supervisory hybrid control logic to determine when evasive maneuvers must be initiated, enabling mathematically grounded safety guarantees for scalability.

URL PDF HTML ☆

赞 0 踩 0

2605.20101 2026-05-20 cs.RO 版本更新

Topology-Optimized Pneumatic Soft Actuator: Design and Experimental Validation

拓扑优化气动软执行器：设计与实验验证

Sumit Mehta, Konstantinos Poulios

发表机构 * DTU（丹麦技术大学）

AI总结本文通过非线性拓扑优化设计了软弹性气动执行器，并通过实验验证了其性能。

Comments 20 pages, 13 figures

2605.20072 2026-05-20 cs.AI cs.RO 版本更新

Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

探查具身大语言模型：当更高的观察保真度损害问题解决

Oussama Zenkri, Oliver Brock

发表机构 * Robotics and Biology Laboratory, Technische Universität Berlin, Germany（柏林技术大学机器人与生物学实验室）； Science of Intelligence, Research Cluster of Excellence, Berlin, Germany（柏林科学智能研究卓越集群）； Robotics Institute Germany (RIG)（德国机器人研究所（RIG））

AI总结本文研究了具身大语言模型在不同观察信息下的行为，发现高保真度观察反而降低了问题解决能力，核心方法是通过实验改变可用信息并测量行为变化，主要贡献是揭示了感知误差与推理失败的交互影响。

Comments Submitted to From Animals to Animats: The 18th International Conference on the Simulation of Adaptive Behavior (SAB)

详情

AI中文摘要

大型语言模型日益被提出作为机器人系统的认知组件，但其不透明的决策过程使得在闭环具身任务中的成功或失败难以解释。遵循经验AI方法，我们通过改变代理可用的信息并测量行为变化来研究具身LLM代理的行为。使用Lockbox，一个具有隐藏依赖关系的顺序机械谜题，在物理机器人设置中评估LLM在RGB、RGB-D和地面真实符号观察下的表现，并通过受控模拟来探测由此产生行为。反直觉的是，代理在原始RGB输入下表现最佳，而在完美地面真实观察下表现最差。在模拟中，我们通过随机翻转感知的动作结果来探测这一效应，发现适度的噪声提高了性能，峰值出现在40%的翻转概率下，相比无噪声基线，成功率提高了2.85倍。进一步分析将这一收益归因于重复动作循环的减少。这些发现表明，仅凭成功率来评估LLM是不够的，因为测量性能可能反映了感知误差与推理失败之间的相互作用，而非稳健的问题解决。

英文摘要

Large Language Models are increasingly proposed as cognitive components for robotic systems, yet their opaque decision processes make it difficult to explain success or failure in closed-loop embodied tasks. Following an empirical AI methodology, we study embodied LLM agents behaviorally by varying the information available to the agent and measuring the resulting changes in behavior. Using the Lockbox, a sequential mechanical puzzle with hidden interdependencies, we evaluate LLMs across RGB, RGB-D, and ground-truth symbolic observations in a physical robotic setup and use controlled simulation to probe the resulting behavior. Counterintuitively, agents perform best under raw RGB input and worst under perfect ground-truth observations. In simulation, we probe this effect by randomly flipping perceived action outcomes and find that moderate noise improves performance, peaking at a 40% flip probability with a 2.85-fold success rate increase over the noise-free baseline. Further analysis links this gain to a reduction in repetitive action loops. These findings suggest that success rates alone are insufficient for evaluating LLMs, as measured performance may reflect the interaction between perceptual errors and reasoning failures rather than robust problem solving.

URL PDF HTML ☆

赞 0 踩 0

2605.20055 2026-05-20 cs.SE cs.AI cs.RO 版本更新

TravExplorer: 通过可 traversability-aware 3-D 规划实现跨楼层的 embodied 探索

Han Zheng, Zhe Chen, Yudong Huang, Haoran Liu, Jinghao Wang, Ming Yang, Tong Qin

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出TravExplorer框架，结合零样本语义引导与可 traversability-aware 3-D 规划，实现跨楼层的 embodied 探索，通过统一的体积地图区分占用结构与机器人可达支撑面，并提取可 traversable 前沿区域，同时采用FOV-aware的主动感知策略解决跨楼层遍历中的不完整观测问题，最终在HM3D和MP3D上进行了4195次模拟实验，并在真实世界中验证了无需先验地图或人工干预的开放词汇目标搜索能力。

详情

AI中文摘要

Zero-shot Object Navigation (ZSON) has shown promise for open-vocabulary target search in unseen environments, yet most existing systems remain tied to planar representations and single-floor assumptions. These assumptions become inadequate in real buildings, where navigation involves floors, stairs, landings, and vertically overlapping spaces. This article presents TravExplorer, a cross-floor embodied exploration framework that couples zero-shot semantic guidance with traversability-aware 3-D planning. TravExplorer maintains a unified volumetric map that distinguishes occupied structures from robot-reachable support surfaces and extracts traversable frontiers from connected support surfaces, including floors, stairs, and landings. A FOV-aware active perception strategy further resolves incomplete observations during cross-floor traversal. To reduce semantic-reasoning latency, a lightweight guidance module aligns a probabilistic instance map from online open-vocabulary segmentation with a spatial value map from fast image-to-text matching. Based on these geometric and semantic memories, a hierarchical planner performs target-aware frontier touring over object hypotheses, traversable frontiers, and stair landmarks, and generates executable cross-floor motions through foothold-guided 3-D search and vertically constrained local trajectory optimization. Experiments over 4,195 simulated episodes on HM3D and MP3D demonstrate consistent advantages over representative ObjectNav baselines. Fifty real-world trials on a Unitree Go2 further validate open-vocabulary target search across single-floor and cross-floor indoor environments without prior maps or human intervention. The code will be released at https://github.com/wuyi2121/TravExplorer.

英文摘要

Zero-shot Object Navigation (ZSON) has shown promise for open-vocabulary target search in unseen environments, yet most existing systems remain tied to planar representations and single-floor assumptions. These assumptions become inadequate in real buildings, where navigation involves floors, stairs, landings, and vertically overlapping spaces. This article presents TravExplorer, a cross-floor embodied exploration framework that couples zero-shot semantic guidance with traversability-aware 3-D planning. TravExplorer maintains a unified volumetric map that distinguishes occupied structures from robot-reachable support surfaces and extracts traversable frontiers from connected support surfaces, including floors, stairs, and landings. A FOV-aware active perception strategy further resolves incomplete observations during cross-floor traversal. To reduce semantic-reasoning latency, a lightweight guidance module aligns a probabilistic instance map from online open-vocabulary segmentation with a spatial value map from fast image-to-text matching. Based on these geometric and semantic memories, a hierarchical planner performs target-aware frontier touring over object hypotheses, traversable frontiers, and stair landmarks, and generates executable cross-floor motions through foothold-guided 3-D search and vertically constrained local trajectory optimization. Experiments over 4,195 simulated episodes on HM3D and MP3D demonstrate consistent advantages over representative ObjectNav baselines. Fifty real-world trials on a Unitree Go2 further validate open-vocabulary target search across single-floor and cross-floor indoor environments without prior maps or human intervention. The code will be released at https://github.com/wuyi2121/TravExplorer.

URL PDF HTML ☆

赞 0 踩 0

2605.19957 2026-05-20 cs.CV cs.AI cs.RO 版本更新

World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

为混合具身体验中的长时域演化构建世界-自我模型

Zuyao Lin, Jianhui Zhang, Peidong Jia, Xiaoguang Zhao, Shanghang Zhang, Xingyu Chen

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； University of Chinese Academy of Sciences（中国科学院大学）； Peking University（北京大学）

AI总结本文提出了一种新的世界-自我建模范式，通过分解未来演化为世界和自我组件，解决混合任务中长时域具身体验中的退化问题，并通过HTEWorld基准测试验证了其有效性。

详情

AI中文摘要

世界模型在具身智能中被广泛研究，但通常在同一流中预测世界和自我不同的演化，其中世界捕捉持续的指令无关场景规律，而自我捕捉机器人中心的指令条件动态。这种世界-自我纠缠导致长时域具身体验中的退化，特别是在混合任务中，其中导航和操作行为交替出现。在本文中，我们引入了世界-自我建模，一种新的概念范式，将未来演化分解为世界和自我组件。我们从三种视角定义世界-自我边界，即运动、语义和意图视角，并分析了三种解纠缠策略，即后、前和完全解纠缠。进一步，我们将该范式实例化为世界-自我模型（WEM），一个统一的具身世界模型，它将一个隐含的独立世界-自我规划器与一个级联并行混合专家（CP-MoE）扩散生成器相结合。为了实现严格的评估，我们进一步构建了HTEWorld，第一个长时域世界建模基准，包含125,000个视频片段（超过4.5百万帧）和精细的动作注释，以及300个多轮评估轨迹（超过2,000条指令）。广泛的实验表明，WEM在HTEWorld上实现了最先进的性能，同时在现有的仅操作基准上保持竞争力。

英文摘要

World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emph{World-Ego Modeling}, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.19940 2026-05-20 cs.AI cs.RO 版本更新

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

受机器人启发的用于社会敏感领域基础模型的护栏

Rebecca Ramnauth, Drazen Brscic, Brian Scassellati

发表机构 * Yale University（耶鲁大学）； Kyoto University（京都大学）

AI总结本文提出了一种基于机器人学的护栏框架，用于在社会敏感领域中对基础模型进行运行时行为控制，以减少交互轨迹中向不良状态的漂移，并适应多样化的社会情境。

Comments Under review at Journal of Artificial Intelligence Research (JAIR)

详情

AI中文摘要

基础模型正越来越多地应用于教育、心理健康和护理等社会敏感领域，其中失败往往具有累积性和情境依赖性。现有的护栏方法，从训练时对齐到提示、解码约束和事后调节，主要提供经验风险降低而非可执行的行为保证，并且大多将安全视为单个输出属性而非交互轨迹属性。我们重新将护栏视为对交互轨迹的运行时行为控制问题，并借鉴机器人学引入形式构造以在不确定的闭环系统中执行约束。我们将在Grounded Observer框架中实例化这些想法，并在三个现实世界部署中应用：闲聊、家庭自闭症疗法和学校行为缓和。在各种场景中，该框架能够实现运行时干预，以减少向不良交互状态漂移，同时适应多样化社会情境。我们讨论了该框架的扩展，并提出了加强保证的研究方向。

英文摘要

Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-dependent. Existing guardrail approaches -- ranging from training-time alignment to prompting, decoding constraints, and post-hoc moderation -- primarily provide empirical risk reduction rather than enforceable behavioral guarantees, and largely treat safety as a property of individual outputs rather than interaction trajectories. We reframe guardrails as a problem of runtime behavioral control over interaction trajectories, drawing on robotics to introduce formal constructs for constraint enforcement in uncertain, closed-loop systems. We instantiate these ideas in the Grounded Observer framework and apply it across three real-world deployments: small talk, in-home autism therapy, and behavioral de-escalation in schools. Across settings, the framework enables runtime interventions that mitigate drift into undesirable interaction regimes while adapting to diverse social contexts. We discuss extensions to the framework and propose research directions toward stronger guarantees.

URL PDF HTML ☆

赞 0 踩 0

2605.19924 2026-05-20 cs.RO 版本更新

RoHIL: Robust Human-in-the-Loop Robotic Reinforcement Learning Against Illumination Variations

RoHIL: 面对光照变化的鲁棒人机协同机器人强化学习

Shuoqin Zhang, Yixin Xiong, Xiru Gao, Kai Liu, Ke Wang, Xichuan Zhou, Zhe Hu

发表机构 * Chongqing University（重庆大学）； Chengdu Anu Intelligence（成都安努智能）

AI总结本文提出RoHIL框架，通过离线微调方法解决机器人在不同工作间因光照变化导致的性能下降问题，保留原始工作间性能并避免重新收集数据和训练。

详情

AI中文摘要

人机协同强化学习系统在训练工作间表现接近完美，但当同一名机器人被移动到数米外的工作间时，由于新的灯位置和窗户光线导致的视觉输入分布变化，系统会崩溃。重新收集演示并重新运行HIL在每个工作间不可行，而简单地在光照变化的数据上微调会触发灾难性遗忘。为解决跨域差距，我们提出了RoHIL，一个无需额外真实机器人交互的离线微调框架。RoHIL结合（i）基于世界模型的图像重光照器，重新合成源工作间轨迹的视觉流，以多种虚拟HDRI环境下的视觉流；（ii）光照保留回放（IRR），一种数据层面的反遗忘机制，将重光照适应转换与原始光照保留转换交错以保留源工作间的Bellman覆盖；（iii）锚定Bellman-actor正则化器，约束表示和策略漂移，从原始源工作间的策略约束。在四个真实机器人操作任务中，面对显著的跨工作间光照变化，RoHIL显著提高了光照变化下的性能，而标准HIL-RL在此处崩溃，同时保留了源工作间的性能，消除了为每个新工作间和环境重新收集数据和重新训练的需要。项目页面：https://anonymous4365.github.io/RoHIL/

英文摘要

Human-in-the-loop reinforcement learning systems achieve near-perfect success on the workstation where they are trained, but collapse when the same robot is moved to a workstation a few meters away due to shifts in the visual input distribution caused by new lamp positions and window light. Re-collecting demonstrations and re-running HIL on every workstation is incompatible with deployment, and naively fine-tuning on shifted-light data triggers catastrophic forgetting of the source workstation. To close this cross-domain gap, we present RoHIL, an offline fine-tuning framework that uses no extra real-robot interaction. RoHIL combines (i) a world-model-based image relighter that re-synthesises the visual stream of source-workstation trajectories under multiple virtual HDRI environments, leaving actions and rewards real; (ii) Illumination-Retention Replay (IRR), a data-level anti-forgetting mechanism that interleaves relit adaptation transitions with original-light retention transitions to preserve source-workstation Bellman coverage; and (iii) an anchored Bellman-actor regulariser that constrains representation and policy drift from the original source-workstation policy. Across four real-robot manipulation tasks under significant cross-workstation illumination variations, RoHIL substantially improves shifted-light performance where standard HIL-RL collapses, while preserving source-workstation performance, eliminating the need to re-collect data and retrain for every new workstation and environment. Project page: https://anonymous4365.github.io/RoHIL/

URL PDF HTML ☆

赞 0 踩 0

2605.19919 2026-05-20 cs.RO 版本更新

论证生物启发式机器人研究：一种策略的分类

Margaret J. Zhang, Justin Ting, Talia Y. Moore

发表机构 * Mechanical Engineering, University of Michigan, USA（机械工程，密歇根大学，美国）； Electrical and Computer Engineering, University of Michigan, USA（电气与计算机工程，密歇根大学，美国）； Robotics, Mechanical Engineering, Ecology and Evolutionary Biology, Museum of Zoology, University of Michigan, USA（机器人学、机械工程、生态学与进化生物学、动物博物馆，密歇根大学，美国）

AI总结本文提出了一种生物启发式设计动机的分类，以帮助机器人研究者合理化其特定的生物启发方法，并帮助资助管理人员评估不同生物启发方法的价值。

详情

AI中文摘要

在人类历史的大部分时间里，我们并没有系统地思考为什么和如何将自然世界的方面纳入我们的设计中。缺乏系统方法导致了动机和方法的一致性问题，使得预测或评估生物启发式设计的成功变得困难。期望与结果之间的不匹配可能导致读者在认为生物启发式设计表面、薄弱或不完整时感到失望。这在机器人领域尤为明显，因为在该领域，与生物系统的相似性可能是构造的驱动动机。为了帮助机器人研究者合理化其特定的生物启发方法，并帮助资助计划管理人员区分不同生物启发方法的价值，本文提出了一种生物启发式设计动机的分类，并描述了不同方法可能带来的潜在重大贡献。

英文摘要

For most of human history, we have not thought systematically about how and why we incorporate aspects of the natural world into our designs. The lack of a systematic approach has resulted in inconsistencies in motivations and methods that make it difficult to predict or evaluate the success of bio-inspired design. This mismatch between expectations and results can lead to disappointment when a reader considers a bio-inspired design to be superficial, weak, or incomplete. This is especially true in the field of Robotics, in which similarity to a biological system might be the driving motivation for construction. In an effort to assist robotics researchers justify their specific bio-inspired approach and to assist funding program managers with discerning the value of different bio-inspired approaches, here we propose a taxonomy of motivations for bio-inspired design and describe the potential significant contributions that are likely to result from different approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.19837 2026-05-20 cs.CV cs.AI cs.CL cs.RO 版本更新

CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

CADENet：条件自适应异步双流增强网络用于自动驾驶中的恶劣天气感知

Sherif Khairy, Catherine M. Elias

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt（德国开罗大学（GUC）计算机科学与工程系，埃及）； C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt（认知驾驶系统实验室（C-DRiVeS Lab），开罗，埃及）

AI总结本文提出CADENet，一种无需训练的三线系统，通过条件自适应增强和熵引导NMS融合，实现自动驾驶中恶劣天气下的目标检测，同时无需重新训练或额外硬件。

详情

AI中文摘要

恶劣天气（雨、雾、沙尘和雪）会降级自动驾驶车辆基于摄像头的目标检测。现有先增强后检测的方法会阻碍安全关键的感知循环，违反严格的实时要求。该问题的进展也受到一个未被认识到的评估上限的限制：在降质图像上标注的地面真实数据不能为一个能够恢复注释者自身无法看到的目标的检测器提供信用，因此真正的有用的增强可以注册为接近平坦的F1增益。本文提出了CADENet（条件自适应异步双流增强网络），一种无需训练的三线系统：线S（YOLOv11n）以全帧率提供检测，无额外延迟；线Q应用条件自适应增强（CAPE）并通过熵引导NMS（EG-NMS）融合结果，不阻塞线S；线E提供CLIP零样本天气分类，因此新的天气类别只需新的文本提示，无需标注数据和重新训练。在1327张DAWN图像（YOLOv11m，IoU=0.5，置信度=0.25）上评估，CADENet在雪中实现Recall=0.0103（微），F1=0.0230，在雨中实现F1=0.0038。我们正式化了DAWN类数据上的注释完整性偏差，因此报告的F1值是真实增益的下限；Recall是注释-间隙-免疫的头条指标。线S在增强负载下保持约44 FPS。无需模型重新训练或额外传感器硬件。

英文摘要

Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time requirements. Progress on this problem is also constrained by an under-recognized evaluation ceiling: ground truth annotated on degraded images cannot credit a detector that recovers objects the annotators themselves could not see, so a genuinely useful enhancement can register as a near-flat F1 gain. This paper presents CADENet (Condition-Adaptive Asynchronous Dual-stream Enhancement Network), a training-free three-thread system: Thread S (YOLOv11n) delivers detections at full frame rate with zero added latency; Thread Q applies condition-adaptive enhancement (CAPE) and fuses results via entropy-guided NMS (EG-NMS) without blocking Thread S; Thread E provides CLIP zero-shot weather classification, so new weather categories require only a new text prompt, with no labeled data and no retraining. Evaluated on 1327 DAWN images (YOLOv11m, IoU = 0.5, confidence = 0.25), CADENet achieves Recall = 0.0103 (micro), F1 = 0.0230 on snow, and F1 = 0.0038 on rain. We formalize the annotation completeness bias on DAWN-class data, so the reported F1 values are lower bounds on the true gain; recall is the annotation-gap-immune headline metric. Thread S sustains approximately 44 FPS regardless of enhancement load. No model retraining or additional sensor hardware is required.

URL PDF HTML ☆

赞 0 踩 0

2605.19824 2026-05-20 cs.AI cs.CL cs.CV cs.RO 版本更新

From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

从提示到路面通过时间：代理场景到计划推理中的时间定位

Ahmed Y. Gado, Omar Y. Goba, Alaa Hassanein, Catherine M. Elias, Ahmed Hussein

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt（德国亚历山大大学（GUC）计算机科学与工程系，埃及）； C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt（认知驾驶系统实验室，埃及开罗，C-DRiVeS）； M.Eng. Robotics Candidate at Deggendorf Institute of Technology, Germany（德国德格多夫技术学院机器人硕士候选人）； IAV GmbH, Berlin, Germany（德国柏林IAV GmbH公司）

AI总结本研究探讨了在代理间通信中引入时间条件是否能保持或增强推理的一致性，而不会降低语义或逻辑一致性，并通过BDD-X数据集的curated子集评估了三种具有递增时间整合的规划器架构。结果表明，时间条件改变了推理风格，但并未在标准NLP正确性指标上产生统计显著改进，但定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。

详情

AI中文摘要

近期尝试通过大型语言模型（LLMs）和大型多模态模型（LMMs）的集合来支持自动驾驶（AVs）中的高级场景解释和规划，仍然将时间视为次要属性。这种缺乏时间定位导致在连续动作推理中出现不一致，影响安全性和可解释性。本文探讨时间条件在代理间通信中是否能保持或增强一致性而不引入语义或逻辑一致性下降。为此，我们引入了三种具有递增时间整合的规划器架构，并在BDD-X数据集的curated子集上评估它们，使用语义、语法和逻辑指标。结果表明，虽然时间条件改变了推理风格，但并未在标准NLP基于的正确性指标上产生统计显著改进。然而，定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。这些发现澄清了基于提示的时间定位的局限性，并建立了时间场景到计划推理的第一个经验基准。

英文摘要

Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.16692 2026-05-20 cs.LG cs.AI cs.RO 版本更新

EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control

EfficientTDMPC: 改进的MPC目标以实现高效的连续控制

Thomas Evers, Cristian Meo, Wendelin Bohmer, Justin Dauwels, Yaniv Oren

发表机构 * TU Delft（代尔夫特理工大学）； LatentWorlds AI

AI总结本文提出EfficientTDMPC，一种基于模型的强化学习方法，用于连续控制，通过减少误差和增加数据新鲜度来提高样本效率。

详情

AI中文摘要

我们介绍了EfficientTDMPC，一种用于连续控制的样本高效模型基于强化学习方法，基于TD-MPC算法家族。该家族的核心是一个规划器，旨在找到最大化估计回报的行动序列。回报通过学习的模型和价值网络进行估计，每个都可以引入误差。EfficientTDMPC通过两种方式减少这种误差。首先，它引入了动态模型的集成，并在这些模型和不同的展开深度之间平均回报估计。其次，它增加了应用不确定性惩罚到规划器目标的选项，从而得到一个避免不确定回报估计的规划器。然后，它增加了实用改进，提高缓冲数据的新鲜度并减少计算。最后，我们发现我们的贡献使EfficientTDMPC能够更受益于更高的更新到数据（UTD）比率，进一步提高样本效率。据我们所知，在每个基准的低数据情况下，EfficientTDMPC在HumanoidBench-Hard和DMC hard上实现了最先进的样本效率，而在DMC easy上则匹配了最先进的性能。

英文摘要

We introduce EfficientTDMPC, a sample-efficient model-based reinforcement learning method for continuous control built on the TD-MPC family of algorithms. Central to this family is a planner that aims to find an action sequence that maximizes the estimated return. The return is estimated using a learned model and value networks, each of which can introduce error. EfficientTDMPC proposes to reduce this error in two ways. First, it introduces an ensemble of dynamics models and averages the return estimates across those models and across different rollout depths. Second, it adds the option to apply an uncertainty penalty to the planner objective, yielding a planner that avoids actions with uncertain return estimates. It then adds practical improvements which increase buffer data freshness and reduce compute. Lastly, we find that our contributions enable EfficientTDMPC to benefit more from a higher update-to-data (UTD) ratio, further improving sample efficiency. To the best of our knowledge, in the low data regime of each benchmark, EfficientTDMPC achieves state-of-the-art (SOTA) in terms of sample efficiency on HumanoidBench-Hard and DMC hard, while matching SOTA on DMC easy.

URL PDF HTML ☆

赞 0 踩 0

2605.16137 2026-05-20 cs.CV cs.RO 版本更新

STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics-Physics Dual System

STABLE: 通过语义-物理双系统生成仿真准备的桌面布局

Zhen Luo, Yixuan Yang, Xudong Xu, Jinkun Hao, Zhaoyang Lyu, Feng Zheng, Jiangmiao Pang, Yanwei Fu

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）

AI总结本文提出STABLE，一种通过语义-物理双系统生成仿真准备的桌面布局的方法，通过语义推理模块生成粗略布局，物理校正模块校正布局以确保物理合理性，从而提升场景的物理有效性。

Comments ICML 2026

详情

AI中文摘要

从任务指令生成仿真准备的桌面场景是嵌入式人工智能领域引人入胜且有前景的研究方向。然而，现有任务到场景生成方法仅依赖大型语言模型（LLMs）预测场景布局，不可避免地导致物体碰撞或漂浮，因为LLMs在三维空间推理方面存在固有局限性。在本文中，我们提出了STABLE，一种专为仿真准备的桌面场景生成设计的语义-物理双系统。STABLE由两个互补模块组成：（i）语义推理器，一个在结构化桌面场景数据集上微调的LLM，用于从输入任务指令生成粗略布局；（ii）物理校正器，一个具有物理意识的基于流的去噪模型，输出姿态更新以校正布局，从而确保场景的物理合理性，同时保持与任务指令的语义一致性。STABLE采用渐进生成范式：通过交替使用语义推理器和物理校正器，它逐步从任务关键对象扩展到背景对象。实验表明，STABLE成功生成严格符合任务指令的仿真准备的桌面场景，并显著提高了场景的物理有效性。

英文摘要

Generating simulation-ready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI. However, existing task-to-scene generation methods rely exclusively on large language models (LLMs) to predict scene layouts, inevitably yielding object collisions or floating due to LLMs' inherent limitations in 3D spatial reasoning. In this paper, we present STABLE, a semantics-physics dual-system tailored for simulation-ready tabletop scene generation. STABLE consists of two complementary modules: (i) a Semantic Reasoner, a fine-tuned LLM trained on a structured tabletop scene dataset to generate coarse layouts from input task instructions, and (ii) a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates to refine layouts, which ensures the physical plausibility of scenes while preserves semantic alignment with task instructions. STABLE adopts a progressive generation paradigm: by alternating between the Semantic Reasoner and Physics Corrector, it incrementally expands the scene from task-critical objects to background objects. Experiments demonstrate that STABLE successfully generates simulation-ready tabletop scenes that strictly conform to task instructions and significantly enhances the physical validity of scenes over prior art.

URL PDF HTML ☆

赞 0 踩 0

2604.07303 2026-05-20 cs.RO 版本更新

Robots that learn to evaluate models of collective behavior

能够评估集体行为模型的机器人

Mathis Hocke, Andreas Gerken, David Bierbach, Jens Krause, Tim Landgraf

发表机构 * Department of computer science, Freie Universität Berlin（自由大学柏林计算机科学系）； SCIoI Excellence Cluster, Technische Universität Berlin（柏林技术大学SCIoI卓越中心）； Faculty of Life Sciences, Humboldt-Universität zu Berlin（柏林洪堡大学生命科学学院）； Department of Fish Biology, Fisheries, and Aquaculture, Leibniz Institute of Freshwater Ecology and Inland Fisheries（莱比锡淡水生态与内陆渔业研究所鱼类生物学、渔业与水产养殖系）

AI总结本文提出了一种基于强化学习的框架，利用仿生机器人鱼评估活鱼行为的计算模型，通过闭环交互量化真实鱼与模拟鱼行为的差异，展示了学习驱动的机器人实验如何发现行为模型的不足。

详情

AI中文摘要

理解并建模动物行为对于研究集体运动、决策和生物启发机器人至关重要。然而，评估行为模型的准确性仍然常常依赖于离线比较静态轨迹统计。在这里，我们介绍了一种基于强化学习的框架，利用仿生机器人鱼（RoboFish）通过闭环交互评估计算模型中的活鱼行为。我们使用四个不同的鱼模型（一个简单的恒定跟随基准、两个基于规则的模型和一个生物基础的卷积神经网络模型）在仿真中训练策略，并将这些策略转移到真实的RoboFish系统中，与活鱼互动。策略被训练引导模拟鱼前往目标位置，使我们能够量化真实鱼对目标位置的响应与模拟鱼响应的差异。通过量化模拟到现实的差距（定义为模拟和现实行为指标分布的Wasserstein距离，如目标到达性能、个体间距离、墙互动和对齐），我们评估鱼模型。基于神经网络的鱼模型在目标到达性能和其他大多数指标上表现出最小的差距，表明其在该基准下的行为保真度高于传统基于规则的模型。更重要的是，这种分离表明，所提出的评估方法能够在匹配的闭环条件下定量区分候选模型。我们的工作展示了学习驱动的机器人实验如何揭示行为模型的不足，并提供了一种通过具身交互评估动物行为模型的一般框架。

英文摘要

Understanding and modeling animal behavior is essential for studying collective motion, decision-making, and bio-inspired robotics. Yet, evaluating the accuracy of behavioral models still often relies on offline comparisons to static trajectory statistics. Here we introduce a reinforcement-learning-based framework that uses a biomimetic robotic fish (RoboFish) to evaluate computational models of live fish behavior through closed-loop interaction. We trained policies in simulation using four distinct fish models-a simple constant-follow baseline, two rule-based models, and a biologically grounded convolutional neural network model-and transferred these policies to the real RoboFish setup, where they interacted with live fish. Policies were trained to guide a simulated fish to goal locations, enabling us to quantify how the response of real fish differs from the simulated fish's response. We evaluate the fish models by quantifying the sim-to-real gaps, defined as the Wasserstein distance between simulated and real distributions of behavioral metrics such as goal-reaching performance, inter-individual distances, wall interactions, and alignment. The neural network-based fish model exhibited the smallest gap across goal-reaching performance and most other metrics, indicating higher behavioral fidelity than conventional rule-based models under this benchmark. More importantly, this separation shows that the proposed evaluation can quantitatively distinguish candidate models under matched closed-loop conditions. Our work demonstrates how learning-based robotic experiments can uncover deficiencies in behavioral models and provides a general framework for evaluating animal behavior models through embodied interaction.

URL PDF HTML ☆

赞 0 踩 0

2603.18396 2026-05-20 cs.LG cs.RO 版本更新

RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

RE-SAC：在公交车队控制中解耦偶然风险和本质风险：一种稳定且稳健的集成深度强化学习方法

Yifan Zhang, Liang Zheng

发表机构 * Central South University（中南大学）

AI总结该研究提出RE-SAC方法，通过解耦偶然风险和本质风险来提升公交车队控制的稳定性与鲁棒性，采用积分概率度量（IPM）基于的权重正则化和多样化Q-集成来应对不同类型的不确定性。

详情

AI中文摘要

公交保持控制因随机交通和乘客需求而具有挑战性。尽管深度强化学习（DRL）展现出潜力，但标准的actor-critic算法在波动环境中面临Q值不稳定的问题。这种不稳定性的一个关键来源是将两种不同的不确定性混淆：偶然不确定性（不可减少的噪声）和本质不确定性（数据不足）。将它们视为单一风险会导致在嘈杂状态下的价值低估，从而导致灾难性策略崩溃。我们提出了一种稳健的集成软actor-critic（RE-SAC）框架，以明确解耦这些不确定性。RE-SAC将积分概率度量（IPM）基于的权重正则化应用于批评者网络，以对抗偶然风险，为鲁棒Bellman算子提供平滑的分析下界，而无需昂贵的内循环扰动。为了应对本质风险，一个多样化Q-集成对稀疏覆盖区域中的过度自信价值估计进行惩罚。这种双重机制防止了集成方差将噪声误认为数据缺口，这种失败模式在我们的消融研究中被识别。在现实的双向公交走廊模拟实验中，RE-SAC在累计奖励（约-0.4e6）方面优于标准SAC（-0.55e6）。Mahalanobis稀有性分析证实，RE-SAC在罕见的分布外状态中将Oracle Q值估计误差减少了高达62%（MAE为1647 vs. 4343），展示了在高交通变异性下的优越鲁棒性。

英文摘要

Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.

URL PDF HTML ☆

赞 0 踩 0

2603.09473 2026-05-20 cs.RO cond-mat.mtrl-sci 版本更新

Receptogenesis in a Vascularized Robotic Embodiment

血管化机器人躯体中的生成

Kadri-Ann Pankratov, Leonid Zinatullin, Hans Priks, Adele Metsniit, Urmas Johanson, Tarmo Tamm, Alvo Aabloo, Edoardo Sinibaldi, Indrek Must

AI总结本文提出了一种通过动态材料重构实现机器人身体内在功能适应和发展的方法，通过利用流体ics重构材料界面，实现基于环境提示的传感器生成，展示了在血管化机器人复合材料中通过光聚合物化实现材料级功能重构的可行性。

Comments Supplementary Files currently unavailable online. Please contact the First Author to request any Supplementary Files Version 2 - revision

详情

AI中文摘要

为机器人系统配备在操作过程中生成新硬件的能力扩展了对物理适应性的控制。不同于依赖于预或后部署离散组件集成的模块化系统，我们设想物理适应和发育可能来自动态材料重构以塑造身体内在功能。受生物体循环系统重新分配质量和功能的启发，我们利用流体ics重构材料界面，这是一种目前在机器人学中尚未具备的能力。在此，我们通过一种用于编程材料合成的血管化机器人复合材料实现了这一合成生长能力，通过生成物的生成——即基于环境提示从内部液体储备中按需构造传感器来演示。通过协调前体的流体运输与外部局部紫外线照射，我们驱动了原位光聚合，将前体转化为具有光敏感聚吡咯的固态分散体，从而在PETG中化学重构血管。该反应将具有光迟滞引发剂的前体转化为具有光敏感聚吡咯的固态分散体，在PETG中化学重构血管，从而建立了一种通过特征电阻降低验证的传感模式。新合成的传感器关闭了局部控制回路以调节受启发于飞蛾的机器人演示器的翅拍。这种物理更新在实时中增加了机器人的能力。血管化机器人躯体的材料级功能重构为在处于环境中机器人系统中生成新硬件提供了概念验证的材料基础——迈向能够自主产生硬件更新以匹配新环境需求的处于环境中机器人。

英文摘要

Equipping robotic systems with the capacity to generate $\textit{ex novo}$ hardware during operation extends control of physical adaptability. Unlike modular systems that rely on discrete component integration pre- or post-deployment, we envision the possibility that physical adaptation and development emerge from dynamic material restructuring to shape the body's intrinsic functions. Drawing inspiration from circulatory systems that redistribute mass and function in biological organisms, we utilize fluidics to restructure the material interface, a capability currently unpaired in robotics. Here, we realize this synthetic growth capability through a vascularized robotic composite designed for programmable material synthesis, demonstrated via receptogenesis - the on-demand construction of sensors from internal fluid reserves based on environmental cues. By coordinating the fluidic transport of precursors with external localized UV irradiation, we drive an $\textit{in situ}$ photopolymerization that chemically reconstructs the vasculature from the inside out. This reaction converts precursors with photolatent initiator into a solid dispersion of UV-sensitive polypyrrole in PETG, establishing a sensing modality validated by a characteristic decrease in electrical impedance. The newly synthesized sensor closed a local control loop to regulate wing flapping in a moth-inspired robotic demonstrator. This physical update increased the robot's capability in real time. Material-level functional restructuring of the vascularized robot body provides a proof-of-concept materials basis for $\textit{ex novo}$ hardware generation in situated robotic systems - a step toward situated robots in which a reaction to environmental stimuli autonomously produces hardware updates to match new environmental demands.

URL PDF HTML ☆

赞 0 踩 0

2601.20529 2026-05-20 cs.RO cs.MA 版本更新

A Practical Framework of Key Performance Indicators for Multi-Robot Lunar and Planetary Field Tests

多机器人月球和行星实地测试的关键绩效指标实用框架

Julia Richter, David Oberacker, Gabriela Ligeza, Valentin T. Bickel, Philip Arm, William Talbot, Marvin Grosse Besselmann, Florian Kehl, Tristan Schnell, Hendrik Kolvenbach, Rüdiger Dillmann, Arne Roennau, Marco Hutter

发表机构 * Robotic Systems Lab (RSL), ETH Zürich（罗伯特系统实验室（RSL），苏黎世联邦理工学院）； FZI Research Center for Information Technology（信息技术研究所以及中心）； Machine Intelligence and Robotics Lab (MaiRo), Karlsruhe Institute for Technology (KIT)（机器智能与机器人实验室（MaiRo），卡尔斯鲁厄理工学院）； Department of Environmental Sciences, University of Basel（巴塞尔大学环境科学系）； European Space Agency/ESTEC（欧洲航天局/ESTEC）； Center for Space and Habitability, University of Bern（伯尔尼大学空间与宜居性中心）； Space Instruments Group, University of Zürich（苏黎世大学空间仪器组）； Space Science and Technology, ETH Zürich（苏黎世联邦理工学院空间科学与技术）

AI总结本文提出了一种基于多机器人月球场景的KPI框架，用于评估和比较不同实地测试的性能，强调效率、鲁棒性和精度的场景依赖性优先级，并通过实地测试验证了其在实际应用中的有效性。

Comments Presented at ICRA 2026 Workshop on Multi-Agent Robotic Systems: Real-World Collaboration and Interaction

详情

AI中文摘要

在月球上寻找关键资源（如钛铁矿、稀有地球元素和水冰）需要稳健的探索方法，鉴于多变的地形和恶劣的环境条件。尽管有许多类比实地测试旨在实现这些目标，但比较其结果仍然具有挑战性，因为机器人平台和实验设置存在差异。这些任务通常使用选定的、场景特定的工程度量来评估性能，但无法建立场性能与科学驱动目标之间的明确联系。在本文中，我们通过从三个现实的多机器人月球场景中推导出一个结构化的KPI框架来填补这一空白，这些场景反映了科学目标和操作约束。我们的框架强调效率、鲁棒性和精度的场景依赖性优先级，并且专门设计用于实际部署。我们通过多机器人实地测试验证了该框架，并发现其在效率和鲁棒性相关的KPI方面具有实用性和易用性，而精度导向的KPI则需要可靠的地面真实数据，这在户外类比环境中并不总是可行。总体而言，我们提出这个框架作为通用的评估标准，能够实现一致、目标导向的多机器人实地测试比较，并支持未来行星探索机器人系统的系统性发展。

英文摘要

Robotic prospecting for critical resources on the Moon, such as ilmenite, rare earth elements, and water ice, requires robust exploration methods given the diverse terrain and harsh environmental conditions. Although numerous analog field trials address these goals, comparing their results remains challenging because of differences in robot platforms and experimental setups. These missions typically assess performance using selected, scenario-specific engineering metrics that fail to establish a clear link between field performance and science-driven objectives. In this paper, we address this gap by deriving a structured framework of KPI from three realistic multi-robot lunar scenarios reflecting scientific objectives and operational constraints. Our framework emphasizes scenario-dependent priorities in efficiency, robustness, and precision, and is explicitly designed for practical applicability in field deployments. We validated the framework in a multi-robot field test and found it practical and easy to apply for efficiency- and robustness-related KPI, whereas precision-oriented KPI require reliable ground-truth data that is not always feasible to obtain in outdoor analog environments. Overall, we propose this framework as a common evaluation standard enabling consistent, goal-oriented comparison of multi-robot field trials and supporting systematic development of robotic systems for future planetary exploration.

URL PDF HTML ☆

赞 0 踩 0

2601.14848 2026-05-20 cs.LG cs.AI cs.NE cs.RO 版本更新

From Observation to Prediction: LSTM for Vehicle Lane Change Forecasting on Highway On/Off-Ramps

从观测到预测：LSTM用于高速公路进出匝道的车辆车道变更预测

Mohamed Abouras, Catherine M. Elias

发表机构 * C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems（C-DRiVeS实验室：车载系统认知驾驶研究）； Computer Science and Engineering Department - Faculty of Media Engineering and Technology - German University in Cairo（计算机科学与工程系 - 媒体工程与技术学院 - 埃及德国大学）

AI总结本文研究了高速公路进出匝道区域与直线路段的区别，利用多层LSTM架构和ExiD无人机数据集训练模型，测试了不同预测时间范围和不同模型的工作流程，结果表明在4秒内预测准确率可达76%（匝道区域）和94%（一般高速公路场景）.

2601.12373 2026-05-20 cs.CV cs.HC cs.RO 版本更新

CD-TWINSAFE: A ROS-enabled Digital Twin for Scene Understanding and Safety Emerging V2I Technology

CD-TWINSAFE：一种基于ROS的数字孪生用于场景理解和安全新兴V2I技术

Amro Khaled, Farah Khaled, Omar Riad, Catherine M. Elias

发表机构 * C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt（认知驾驶研究与车辆系统实验室，埃及开罗）； Computer Science and Engineering Department - Faculty of Media Engineering and Technology（计算机科学与工程系-媒体工程与技术学院）； German University in Cairo, Egypt（埃及开罗德国大学）

AI总结本文提出了一种基于V2I的数字孪生系统CD-TWINSAFE，用于自动驾驶车辆的场景理解和安全监控，通过同时运行的两个栈结构实现车辆侧的驾驶模块和数字孪生模块，利用立体相机和Unreal Engine 5构建场景复现，并通过ROS架构实现V2I通信。

详情

DOI: 10.1109/MELECON64486.2026.11418830

AI中文摘要

本文介绍了CD-TWINSAFE，一种基于V2I的自动驾驶车辆数字孪生系统。所提出的架构由两个同时运行的栈组成，一个是车载驾驶栈，包含立体相机用于场景理解，另一个是数字孪生栈，运行Unreal Engine 5的场景复制品并返回安全警报至驾驶舱。车载栈在车辆侧实现，包括两个主要自主模块：定位和感知。通过车载传感器获取车辆的位置和方向。此外，感知模块负责处理立体相机的20fps图像，并通过两个互补的管道理解场景，包括物体检测和特征提取，包括物体速度、偏转角以及安全指标时间到碰撞和时间头道。收集的数据通过ROS架构以自定义ROS2消息的形式发送到基础设施侧，并通过UDP链接在4G调制解调器上进行V2I通信。通过数字孪生监控环境，共享消息更新生成的ego车辆和检测到的对象的信息，基于实时的定位和感知数据。通过不同驾驶场景的测试来验证所提出架构的有效性和实时响应能力。

英文摘要

In this paper, the CD-TWINSAFE is introduced, a V2I-based digital twin for Autonomous Vehicles. The proposed architecture is composed of two stacks running simultaneously, an on-board driving stack that includes a stereo camera for scene understanding, and a digital twin stack that runs an Unreal Engine 5 replica of the scene viewed by the camera as well as returning safety alerts to the cockpit. The on-board stack is implemented on the vehicle side including 2 main autonomous modules; localization and perception. The position and orientation of the ego vehicle are obtained using on-board sensors. Furthermore, the perception module is responsible for processing 20-fps images from stereo camera and understands the scene through two complementary pipelines. The pipeline are working on object detection and feature extraction including object velocity, yaw and the safety metrics time-to-collision and time-headway. The collected data form the driving stack are sent to the infrastructure side through the ROS-enabled architecture in the form of custom ROS2 messages and sent over UDP links that ride a 4G modem for V2I communication. The environment is monitored via the digital twin through the shared messages which update the information of the spawned ego vehicle and detected objects based on the real-time localization and perception data. Several tests with different driving scenarios to confirm the validity and real-time response of the proposed architecture.

URL PDF HTML ☆

赞 0 踩 0

2601.12367 2026-05-20 cs.HC cs.RO 版本更新

User-to-Vehicle Interaction in Smart Mobility: The GO-DRiVeS Autonomous Ride-Sharing Application

用户与车辆交互在智能交通中的应用：GO-DRiVeS自动驾驶拼车应用

Hana E. Elmalah, Catherine M. Elias

发表机构 * C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt（C-DRiVeS实验室：车载系统认知驾驶研究，埃及开罗）； Computer Science and Engineering Department - Faculty of Media Engineering and Technology（计算机科学与工程系——媒体工程与技术学院）； German University in Cairo, Egypt（埃及开罗德国大学）

AI总结本文提出了一种名为GO-DRiVeS的拼车应用，旨在解决大学学生和员工在炎热天气或携带重物时长时间步行的问题。该应用采用敏捷开发方法，并基于现有的交通应用框架进行分析和比较，实现了用户注册、拼车请求和实时追踪等功能，并通过多个实验验证了其稳定性和可靠性。

详情

DOI: 10.1109/MELECON64486.2026.11418861

AI中文摘要

本文介绍了GO-DRiVeS应用，这是一种按需拼车和请求的移动应用，专门针对解决长时间步行、时间消耗和疲劳的问题，尤其是在炎热天气或携带重物时，这对大学学生和员工来说是一个挑战。GO-DRiVeS应用是按照敏捷方法开发的，以确保其灵活性。此外，使用移动应用程序系统架构和客户端-服务器架构。GO-DRiVeS是使用React Native（Expo）作为前端，Node.js和Express作为后端，MongoDB作为数据库实现的；基于对现有交通应用的详细分析，比较其框架并识别其核心功能。GO-DRiVeS支持用户注册、拼车请求和实时追踪等核心功能。此外，它能够以先到先得的方式同时处理多个请求。该应用基于这些功能进行开发，其结果以多种形式的实验形式呈现，展示了在处理请求时的稳定性，如在方法和结果章节中所展示的。

英文摘要

This paper introduces the GO-DRiVeS application, an on demand ride sharing and requesting mobile application tailored specifically to save long walks and challenges which are time consuming and tiring especially during hot days or when carrying heavy items, faced by university students and staff. The GO-DRiVeS application was developed following the Agile methodology for its flexibility. In addition to, using the mobile application system architecture and client-server architecture. GO-DRiVeS was implemented using React Native (Expo) for the frontend, Node.js and Express for the backend, and MongoDB as the database; based on a detailed analyses to the existing transportation application, comparing their frameworks and identifying their essential functionalities. GO-DRiVeS supports core features like user registration, ride requesting and real-time tracking.In addition to handling multiple requests at the same time in a first come first serve manner. The application was developed based on these features, and the results were conducted in the form of multiple experiments that demonstrated stable behavior in handling the requests, as presented in the Methodology and Results chapters.

URL PDF HTML ☆

赞 0 踩 0

2601.12358 2026-05-20 cs.CV cs.AI cs.RO 版本更新

From Prompts to Pavement: LMMs-based Agentic Behavior-Tree Generation Framework for Autonomous Vehicles

从提示到道路：基于大语言模型的代理行为树生成框架用于自动驾驶车辆

Omar Y. Goba, Ahmed Y. Gado, Catherine M. Elias, Ahmed Hussein

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt（德国亚历山大·冯·洪堡大学（开罗分校）计算机科学与工程系，埃及）； C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt（认知驾驶系统实验室（车辆系统中的认知驾驶研究），开罗，埃及）； IAV GmbH, Berlin, Germany（IAV GmbH，柏林，德国）

AI总结本文提出了一种基于大语言模型和多模态视觉模型的代理行为树生成框架，用于自动驾驶车辆在复杂环境中自适应导航。该框架通过链式符号提示评估场景关键性，通过上下文学习构建高层子目标，并通过生成器合成可执行的BT子树，实现了在CARLA+Nav2模拟中对突发障碍物（如道路堵塞）的成功绕行。

详情

DOI: 10.1109/ITSC60802.2025.11423726

AI中文摘要

自动驾驶车辆（AVs）需要适应性行为规划器来安全地导航不可预测的现实环境。传统的行为树（BTs）提供结构化决策逻辑，但本质上是静态的，并且需要大量人工调优，限制了其在SAE Level 5自主性中的应用。本文提出了一种代理框架，利用大语言模型（LLMs）和多模态视觉模型（LVMs）来实时生成和适应BTs。一个专门的Descriptor代理使用链式符号提示来评估场景关键性，一个Planner代理通过上下文学习构建高层子目标，一个Generator代理合成可执行的BT子树。该系统集成到CARLA+Nav2模拟中，仅在基线BT失败时触发，展示了成功绕过突发障碍物（例如道路堵塞）的能力，无需人工干预。与静态BT基线相比，该方法是一种概念验证，能够扩展到多样的驾驶场景。

英文摘要

Autonomous vehicles (AVs) require adaptive behavior planners to navigate unpredictable, real-world environments safely. Traditional behavior trees (BTs) offer structured decision logic but are inherently static and demand labor-intensive manual tuning, limiting their applicability at SAE Level 5 autonomy. This paper presents an agentic framework that leverages large language models (LLMs) and multi-modal vision models (LVMs) to generate and adapt BTs on the fly. A specialized Descriptor agent applies chain-of-symbols prompting to assess scene criticality, a Planner agent constructs high-level sub-goals via in-context learning, and a Generator agent synthesizes executable BT sub-trees in XML format. Integrated into a CARLA+Nav2 simulation, our system triggers only upon baseline BT failure, demonstrating successful navigation around unexpected obstacles (e.g., street blockage) with no human intervention. Compared to a static BT baseline, this approach is a proof-of-concept that extends to diverse driving scenarios.

URL PDF HTML ☆

赞 0 踩 0

2512.20931 2026-05-20 cs.RO 版本更新

Certifiable Alignment of GNSS and Local Frames via Lagrangian Duality

通过拉格朗日对偶实现GNSS与局部框架的可验证对齐

Baoshan Song, Matthew Giamou, Penggao Yan, Chunxi Xia, Li-Ta Hsu

发表机构 * Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University, China（航空与航空工程系，香港理工大学，中国）； Department of Computing and Software, McMaster University, Canada（计算与软件系，麦斯特大学，加拿大）； School of Geodesy and Geomatics, Wuhan University, China（测绘学院，武汉大学，中国）

AI总结本文提出了一种全局最优求解器，通过将原始伪距或多普勒测量转换为凸松弛问题，实现了GNSS与局部框架的可验证对齐，解决了传统方法在GNSS退化环境下的局限性。

Comments Final version in RA-L

详情

AI中文摘要

估计局部系统相对于全球导航卫星系统（GNSS）参考的绝对对齐 often 遭遇局部极小值和对卫星可用性高度依赖的问题。现有方法对于此对齐任务依赖于大量卫星，无法在GNSS退化环境中使用，或使用局部优化方法无法保证解的最优性。本文介绍了一种全局最优求解器，将原始伪距或多普勒测量转换为凸松弛问题。所提出的方法是可验证的，意味着可以数值验证结果的正确性，填补了现有局部优化器无法保证最优性的空白。我们首先将原始框架对齐问题公式化为一个非凸二次约束二次规划（QCQP）问题，并将QCQP问题松弛为一个凹的拉格朗日对偶问题，为原问题提供一个下界成本。然后我们进行松弛紧密性和可观测性分析，推导出解的可验证最优性的标准。最后进行仿真和实际世界实验来评估所提出的方法。实验表明，即使只有2颗卫星的多普勒测量和2D车辆运动，我们的方法也能提供可验证的最优解，而传统基于速度的VOBA方法和先进的GVINS对齐技术可能会失败或收敛到局部极小值。为了支持机器人中的GNSS导航技术发展，所有代码和数据均在https://github.com/Baoshan-Song/Certifiable-Doppler-alignment上开源。

英文摘要

Estimating the absolute orientation of a local system relative to a global navigation satellite system (GNSS) reference often suffers from local minima and high dependency on satellite availability. Existing methods for this alignment task rely on abundant satellites unavailable in GNSS-degraded environments, or use local optimization methods which cannot guarantee the optimality of a solution. This work introduces a globally optimal solver that transforms raw pseudo-range or Doppler measurements into a convexly relaxed problem. The proposed method is certifiable, meaning it can numerically verify the correctness of the result, filling a gap where existing local optimizers fail. We first formulate the original frame alignment problem as a nonconvex quadratically constrained quadratic program (QCQP) problem and relax the QCQP problem to a concave Lagrangian dual problem that provides a lower cost bound for the original problem. Then we perform relaxation tightness and observability analysis to derive criteria for certifiable optimality of the solution. Finally, simulation and real world experiments are conducted to evaluate the proposed method. The experiments show that our method provides certifiably optimal solutions even with only 2 satellites with Doppler measurements and 2D vehicle motion, while the traditional velocity-based VOBA method and the advanced GVINS alignment technique may fail or converge to local optima without notice. To support the development of GNSS-based navigation techniques in robotics, all code and data are open-sourced at https://github.com/Baoshan-Song/Certifiable-Doppler-alignment.

URL PDF HTML ☆

赞 0 踩 0

2512.00667 2026-05-20 eess.SY cs.RO cs.SY 版本更新

Active Learning of Fractional-Order Viscoelastic Model Parameters for Realistic Haptic Rendering

分数阶黏弹性模型参数的主动学习用于真实触觉渲染

Harun Tolasa, Gorkem Gemalmaz, Volkan Patoglu

发表机构 * Faculty of Engineering and Natural Sciences（工程与自然科学学院）

AI总结本文提出了一种系统的方法，通过主动学习优化分数阶黏弹性模型的参数，以提高触觉渲染的感知真实感，同时通过人类在回路优化和群体感知地图结合，选择出在一般人群中被广泛认为真实的参数。

Comments This work has been submitted to the IEEE Transactions on Haptics for possible publication. 14 pages, 8 figures

详情

AI中文摘要

有效的医疗模拟器需要真实地渲染具有黏弹性材料特性（如蠕变和应力松弛）的生物组织。分数阶模型提供了一种有效描述本质上时间依赖的黏弹性动力学的方法，仅需少量参数，因为它们自然地捕捉记忆效应。然而，由于分数元素的阶数与其他参数之间的非直观、频率依赖的耦合，确定产生高感知真实感的分数阶模型参数值仍是一个重大挑战。在本研究中，我们提出了一种系统的方法，通过主动学习优化分数阶黏弹性模型的参数，以优化触觉渲染在一般人群中的感知真实感。首先，我们证明通过基于定性反馈的人类在回路（HiL）优化可以有效优化分数阶模型的参数，以确保对每个人都能保持一致的高真实感评分。其次，我们提出了一种严格的方法，将HiL优化结果结合到一个在完整数据集上训练的聚合感知地图中，并展示如何从这种表示中选择群体层面的最佳参数，这些参数在一般人群中被广泛认为是真实的。最后，我们通过人类受试者实验验证了广义分数阶黏弹性模型参数在三种黏弹性材料中的有效性。总体而言，通过所提出的HiL优化和聚合方法建立的广义分数阶黏弹性模型有潜力显著提高医疗训练模拟器的sim-to-real过渡性能。

英文摘要

Effective medical simulators necessitate realistic haptic rendering of biological tissues that exhibit viscoelastic material properties, such as creep and stress relaxation. Fractional-order models provide an effective means of describing intrinsically time-dependent viscoelastic dynamics with few parameters, as they naturally capture memory effects. However, due to the unintuitive, frequency-dependent coupling among the order of the fractional element and other parameters, determining appropriate parameter values for fractional-order models that yield high perceived realism remains a significant challenge. In this study, we propose a systematic means of determining the parameters of fractional-order viscoelastic models that optimizes the perceived realism of haptic rendering across general populations. First, we demonstrate that the parameters of fractional-order models can be effectively optimized through active learning, using qualitative feedback-based human-in-the-loop (HiL) optimization, to ensure consistently high realism ratings for each individual. Second, we propose a rigorous method to combine HiL optimization results into an aggregate perceptual map trained on the entire dataset, and demonstrate how to select population-level optimal parameters from this representation that are broadly perceived as realistic across general populations. Finally, we provide evidence of the effectiveness of the generalized fractional-order viscoelastic model parameters for three viscoelastic materials by characterizing their perceived realism through human-subject experiments. Overall, generalized fractional-order viscoelastic models established through the proposed HiL optimization and aggregation approach possess the potential to significantly improve the sim-to-real transition performance of medical training simulators.

URL PDF HTML ☆

赞 0 踩 0

2511.18236 2026-05-20 cs.RO cs.SY eess.SY 版本更新

APULSE: A Scalable Hybrid Algorithm for the RCSPP on Large-Scale Dense Graphs

APULSE：一种用于大规模密集图上RCSPP的可扩展混合算法

Nuno Soares, António Grilo

发表机构 * Academia Militar Lisboa（里斯本军事学院）； INESC INOV Instituto Superior Técnico (IST) Universidade de Lisboa（INESC INOV 里斯本技术大学 (IST)）

AI总结本文提出APULSE算法，通过结合A*启发式搜索、Pulse式剪枝机制和时间桶策略，高效解决大规模密集图上的资源受限最短路径问题，展现出显著的可扩展性和鲁棒性。

Comments This version corrects keywords and reference [9]. 9 pages

详情

DOI: 10.1109/ACCESS.2026.3669807
Journal ref: in IEEE Access, vol. 14, pp. 40690-40706, 2026

AI中文摘要

资源受限最短路径问题（RCSPP）是一个基础的NP难优化挑战，广泛应用于网络路由和自主导航等领域。该问题涉及在受预算限制的二次资源下寻找最小主成本路径。尽管存在各种RCSPP求解器，但它们在应用于复杂现实场景中常见的大型密集图时往往面临严重的可扩展性限制，使其在时间敏感的规划中不切实际。在无人地面车辆（UGVs）的任务规划等领域，这种挑战尤为突出。本文介绍APULSE，一种混合标签设置算法，旨在高效解决此类挑战性图中的RCSPP。APULSE结合了由A*启发式引导的最佳优先搜索、激进的Pulse式剪枝机制以及时间桶策略，以有效减少状态空间。通过使用大规模UGV规划场景的计算研究，APULSE与最先进的算法进行了基准测试。结果表明，APULSE在大型问题实例上能够以数量级更快的速度和更高的鲁棒性找到近最优解，特别是在竞争方法失败的情况下。这种优越的可扩展性使APULSE成为复杂大规模环境中的RCSPP有效解决方案，使其能够实现交互式决策支持和动态重新规划能力。

英文摘要

The resource-constrained shortest path problem (RCSPP) is a fundamental NP-hard optimization challenge with broad applications, from network routing to autonomous navigation. This problem involves finding a path that minimizes a primary cost subject to a budget on a secondary resource. While various RCSPP solvers exist, they often face critical scalability limitations when applied to the large, dense graphs characteristic of complex, real-world scenarios, making them impractical for time-critical planning. This challenge is particularly acute in domains like mission planning for unmanned ground vehicles (UGVs), which demand solutions on large-scale terrain graphs. This paper introduces APULSE, a hybrid label-setting algorithm designed to efficiently solve the RCSPP on such challenging graphs. APULSE integrates a best-first search guided by an A* heuristic with aggressive, Pulse-style pruning mechanisms and a time-bucketing strategy for effective state-space reduction. A computational study, using a large-scale UGV planning scenario, benchmarks APULSE against state-of-the-art algorithms. The results demonstrate that APULSE consistently finds near-optimal solutions while being orders of magnitude faster and more robust, particularly on large problem instances where competing methods fail. This superior scalability establishes APULSE as an effective solution for RCSPP in complex, large-scale environments, enabling capabilities such as interactive decision support and dynamic replanning.

URL PDF HTML ☆

赞 0 踩 0

2408.06843 2026-05-20 cs.RO 版本更新

Learn2Decompose: Learning Problem Decomposition for Efficient Sequential Multi-object Manipulation Planning

Learn2Decompose: 为高效连续多物体操作规划学习问题分解

Yan Zhang, Teng Xue, Amirreza Razmjoo, Sylvain Calinon

发表机构 * Idiap Research Institute（Idiap研究 institute）； Ecole Polytechnique Fédérale de Lausanne（瑞士联邦理工学院洛桑分校）

AI总结本文提出了一种高效的任务与运动重计划方法，用于动态环境中连续多物体操作的规划。通过从示范中学习问题分解来加速TAMP求解器，核心方法包括目标分解学习、计算距离学习和物体减少，有效提升了重计划效率。

Comments Extension of RAL version: added PR2 Whole-body kitchen task and detailed discussion on limitations in main text; added pseudocode and robustness analysis of our approach, and formal analysis on why and when task goals are decomposable in appendix

详情

AI中文摘要

我们提出了一种高效的任务和运动重计划方法，用于动态环境中连续多物体操作的规划。传统任务与运动规划（TAMP）求解器在规划时间上随着规划时间跨度和物体数量的增长而呈指数级增加，限制了其在现实场景中的应用。为了解决这一问题，我们提出通过示范学习问题分解来加速TAMP求解器。我们的方法包含三个关键组成部分：目标分解学习、计算距离学习和物体减少。目标分解识别系统在达到最终目标之前必须经过的必要状态序列，将其视为子目标序列。计算距离学习预测两个状态之间的计算复杂性，使系统能够从扰动状态中识别出时间上最近的子目标。物体减少最小化重计划过程中考虑的活跃物体集合，进一步提高效率。我们在三个基准上评估了我们的方法，证明了其在动态环境中提升连续多物体操作任务重计划效率的有效性。

英文摘要

We present an efficient task and motion replanning approach for sequential multi-object manipulation in dynamic environments. Conventional Task And Motion Planning (TAMP) solvers experience an exponential increase in planning time as the planning horizon and number of objects grow, limiting their applicability in real-world scenarios. To address this, we propose learning problem decompositions from demonstrations to accelerate TAMP solvers. Our approach consists of three key components: goal decomposition learning, computational distance learning, and object reduction. Goal decomposition identifies the necessary sequences of states that the system must pass through before reaching the final goal, treating them as subgoal sequences. Computational distance learning predicts the computational complexity between two states, enabling the system to identify the temporally closest subgoal from a disturbed state. Object reduction minimizes the set of active objects considered during replanning, further improving efficiency. We evaluate our approach on three benchmarks, demonstrating its effectiveness in improving replanning efficiency for sequential multi-object manipulation tasks in dynamic environments.

URL PDF HTML ☆

赞 0 踩 0

2209.12133 2026-05-20 cs.RO cs.SY eess.SY 版本更新

Development of a Deep Learning-Driven Control Framework for Exoskeleton Robots

一种基于深度学习的外骨骼机器人控制框架开发

Sk Hasan

AI总结本文提出了一种计算高效的深度学习控制框架，用于高自由度外骨骼机器人，以解决传统模型控制在实时计算中的限制。通过设计一个并行结构的深度神经网络，结合物理数据训练，实现了轨迹跟踪的关节扭矩预测，并通过比例导数控制器补偿预测误差，展示了控制方案的稳定性与鲁棒性。

详情

DOI: 10.3390/act15050274
Journal ref: Actuators 15, 274 (2026)

AI中文摘要

本研究旨在开发一种计算高效的基于深度学习的控制框架，用于高自由度外骨骼机器人，以解决传统模型控制在实时计算中的限制。为七自由度人下肢外骨骼机器人设计了一个并行结构的深度神经网络，该网络由四层组成，包含49个密集连接的神经元，并使用基于分析动力学模型的物理数据进行训练。在实时实现过程中，训练好的神经网络预测轨迹跟踪所需的关节扭矩命令，而比例导数控制器补偿残余预测误差。通过分析验证了所提控制方案的稳定性，并利用方差分析评估了参数变化的鲁棒性。在相同机器人动力学条件下，与计算扭矩、模型参考计算扭矩、滑模、自适应和线性二次控制器进行了对比仿真。结果表明，该方法在轨迹跟踪精度和扭矩特性上与传统非线性控制器相当，同时减少了计算负担。这些发现表明，所提出的基于深度学习的混合控制器为多自由度外骨骼机器人的控制提供了一种高效且稳健的替代方案。

英文摘要

The purpose of this study is to develop a computationally efficient deep learning based control framework for high degree of freedom exoskeleton robots to address the real time computational limitations associated with conventional model based control. A parallel structured deep neural network was designed for a seven degree of freedom human lower extremity exoskeleton robot. The network consists of four layers with 49 densely connected neurons and was trained using physics based data generated from the analytical dynamic model. During real time implementation, the trained neural network predicts joint torque commands required for trajectory tracking, while a proportional derivative controller compensates for residual prediction errors. Stability of the proposed control scheme was analytically established, and robustness to parameter variations was evaluated using analysis of variance. Comparative simulations were conducted against computed torque, model reference computed torque, sliding mode, adaptive, and linear quadratic controllers under identical robot dynamics. Results demonstrate accurate trajectory tracking with torque profiles comparable to conventional nonlinear controllers while reducing computational burden. These findings suggest that the proposed deep learning based hybrid controller offers an efficient and robust alternative for controlling multi degree of freedom exoskeleton robots.

URL PDF HTML ☆

赞 0 踩 0

2605.19703 2026-05-20 cs.RO 版本更新

KIO-planner: Attention-Guided Single-Stage Motion Planning with Dual Mapping for UAV Navigation

KIO-planner: 基于双映射的注意力引导单阶段运动规划用于无人机导航

Dexing Yao, Haochen Li, Junhao Wei, Yifu Zhao, Yanxiao Li, Jiahui Xu, Jinxuan Hu, Lele Tian, Baili Lu, Zikun Li, Xu Yang, Sio-Kei Im, Dingcheng Yang, Yapeng Wang

发表机构 * Faculty of Applied Sciences（应用科学学院）； Macao Polytechnic University（澳门理工学院）； College of Animal Science and Technology（动物科学与技术学院）； Zhongkai University of Agriculture and Engineering（仲恺农业工程学院）； School of Economics and Management（经济管理学院）； South China Normal University（华南师范大学）； Information Engineering School（信息工程学院）

AI总结本文提出KIO-planner，一种基于注意力引导的单阶段轨迹规划框架，通过整合CBAM模块和双映射机制，实现了在密集障碍环境中低延迟、可靠的运动规划，提高了导航的敏捷性和安全性。

Comments Accepted by an IEEE Vehicular Technology Conference. 6 pages, 4 figures, 1 table

详情

AI中文摘要

在受限、墙壁密集的环境中实现自主无人机飞行需要在严格安全约束下具有低延迟和可靠性的运动规划。传统基于优化的规划器在导航密集结构障碍时面临映射延迟和容易陷入局部极小值的问题。同时，现有的端到端学习方法难以从原始深度图像中提取细粒度的几何特征，并缺乏硬的运动动力学约束，导致靠近墙壁时出现不可预测的碰撞。为了解决这些问题，我们提出了KIO-planner，一种注意力引导的单阶段轨迹规划框架。首先，我们将卷积块注意力模块（CBAM）整合到感知骨干中，以自适应地聚焦于关键结构边缘和可通行空间。其次，我们引入了一种新的双映射机制——包括物理界限激活和确定性的几何安全护盾——以在深度像素空间中强制运动动力学可行性并实现无碰撞飞行，而无需全局地图融合。广泛的高保真模拟实验表明，KIO-planner能够在高达3.0 m/s的速度下实现高度敏捷的导航。与最先进的基线相比，KIO-planner实现了更低的推理延迟（约24 ms）并生成了显著更平滑的轨迹，减少了28.4%的控制成本。最值得注意的是，我们的双映射显著增加了最坏情况的安全裕度，通过最小距离到障碍物的测量，从0.48米增加到0.76米，确保了在高度受限环境中快速、平滑和安全的导航。

英文摘要

Autonomous UAV flight in confined, wall-dense environments requires low-latency and reliable motion planning under strict safety constraints. Traditional optimization-based planners suffer from mapping latency and easily fall into local minima when navigating through dense structural obstacles. Meanwhile, existing end-to-end learning methods struggle to extract fine-grained geometric features from raw depth images and lack hard kinodynamic constraints, leading to unpredictable collisions near walls. To address these issues, we propose KIO-planner, an attention-guided single-stage trajectory planning framework. First, we integrate a Convolutional Block Attention Module (CBAM) into the perception backbone to adaptively focus on critical structural edges and traversable space. Second, we introduce a novel Dual Mapping mechanism--comprising physical bounds activation and a deterministic Geometric Safety Shield in the depth-pixel space--to enforce kinodynamic feasibility and collision-free flight without global map fusion. Extensive high-fidelity simulated experiments demonstrate that KIO-planner enables highly agile navigation at speeds up to 3.0 m/s. Compared to the state-of-the-art baseline, KIO-planner achieves lower inference latency (approximately 24 ms) and generates significantly smoother trajectories, reducing control cost by 28.4%. Most notably, our Dual Mapping substantially increases the worst-case safety margin, measured by minimum distance to obstacles, from 0.48 m to 0.76 m, ensuring fast, smooth, and safer navigation in highly constrained environments.

URL PDF HTML ☆

赞 0 踩 0

2605.19701 2026-05-20 cs.RO 版本更新

Multi-Session Ground Texture SLAM in Low-Dynamic Environments

多会话低动态环境下的地面纹理SLAM

Kyle M. Hart, Brendan Englot

发表机构 * Naval Air Warfare Center, Aircraft Division（海军航空武器中心，飞机分部）； Department of Mechanical Engineering, Stevens Institute of Technology（机械工程系，史蒂文斯理工学院）

AI总结本文研究了在低动态环境中多会话地面纹理SLAM中的轨迹估计精度影响，探讨了三种技术的影响，发现Kullback-Leibler散度在相似度评分和闭环置信度偏置方面效果最佳，并介绍了一个包含多会话图像和高精度姿态信息的数据集。

Comments 8 pages, 9 figures. To appear at the 23rd International Conference on Ubiquitous Robots, Osaka, Japan. Distribution Statement A: Approved for public release; distribution is unlimited, as submitted under NAVAIR Public Release Authorization 2025-0098

详情

AI中文摘要

同时定位与建图社区已经引入了大量适用于多会话操作的系统，这些系统适应于具有低动态变化特征的环境，如地面磨损、天气现象或季节变化，这些变化会影响建图。这些系统允许机器人在这些环境中进行终身操作。同时，对于那些唯一可用的地面纹理作为建图特征的环境，也存在越来越多的兴趣。然而，这些地面纹理系统尚未针对多会话低动态变化环境进行优化。本文探讨了三种不同技术对这些多会话低动态地面纹理环境轨迹估计精度的影响。其中，使用Kullback-Leibler散度作为相似度评分和偏置影响闭环置信度的方法效果最佳。我们分析了所有三种方法，并深入探讨了Kullback-Leibler散度的影响。我们还介绍了一个供机器人社区使用的数据集，其中包含多会话图像，地面在不同会话中发生变化，并包含高精度姿态信息用于评估。

英文摘要

The simultaneous localization and mapping community has introduced a growing number of systems adapted for multi-session operations where the operational environment features low-dynamic changes that impact mapping, such as surface wear, weather phenomena, or seasonal change. These systems allow for lifelong operations by a robot within these environments. There is also growing interest in operations in environments where the unique ground texture is the only mapping feature available for use. These ground texture systems are not yet targeted for multi-session low-dynamic-change environments though. This work explores the impact of three different techniques on trajectory estimation accuracy in these multi-session low-dynamic ground texture environments. Of the three, the use of Kullback-Leibler Divergence, as a similarity score and a bias influencing loop closure confidence, is found to have the most success. We show an analysis of all three methods and a deeper exploration of the impact of Kullback-Leibler Divergence. We also introduce a dataset for use by the robotics community that contains multi-session images where the ground changes between sessions and also high-accuracy pose information for use in evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.19690 2026-05-20 cs.RO 版本更新

D-CLING: Prior-Preserving Depth-Conditioned Fine-Tuning for Navigation Foundation Models

D-CLING: 保留先验知识的深度条件细调方法用于导航基础模型

Shintaro Nakaoka, Takayuki Kanai, Kazuhito Tanaka

发表机构 * Frontier Research Center, Toyota Motor Corporation（丰田电机公司前沿研究中心）

AI总结本文提出了一种新的细调方法，通过利用大规模预训练同时高效学习新环境或相机配置等新设置，从而在保留预训练知识的同时提升导航模型的鲁棒性和准确性。

Comments This paper has been accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026), which will be held in Vienna, Austria, from June 1 to 5, 2026

详情

AI中文摘要

导航基础模型（NFMs）在大规模跨身体数据集上训练后，已在各种场景中展示了强大的泛化能力。采用领域内细调来校准NFMs的视觉-运动策略，有望在新场景中进一步提升性能。然而，细调后的模型仍然存在避障能力差或无法正确到达目标的问题。此外，使用小数据集进行模型更新通常会削弱预训练的先验知识，影响预训练的泛化能力。因此，细调会降低模型在稳健和准确导航方面的能力。在本文中，我们提出了一种新的细调方法，该方法利用大规模预训练同时高效学习新设置，如环境或相机配置。特别是，受ControlNet启发，我们通过将可训练的预训练骨干网络的可学习副本附加到NFMs上，利用零初始化残差路径进行细调，从而学习几何线索。这种设计使模型能够高效地获取领域内的几何信息，同时在各种行为中保留预训练的知识。尽管其简单性，我们对现实导航的全面评估表明，我们的方法能够有效实现稳健的长周期导航，同时最小化碰撞和人工干预。此外，我们的离线分析显示，所提出的方法在细调数据集之外仍能维持或进一步提升动作预测能力，为通用导航的持续学习提供了关键见解。项目页面：https://toyotafrc.github.io/DCLING-Proj/

英文摘要

Navigation Foundation Models (NFMs) trained on large cross-embodied datasets have demonstrated powerful generalizability in various scenarios. Adopting in-domain fine-tuning for an NFM efficiently calibrates the visuomotor policy, promising further improvement even in a novel scenario. However, the fine-tuned models still suffer from poor obstacle avoidance or fail to properly reach the provided goals. Furthermore, model updates using a small subset of data typically erode the pre-trained prior, compromising the pre-training generalization. Consequently, fine-tuning deteriorates the capability of the model for robust and accurate navigation. In this work, we present a novel fine-tuning method that leverages large-scale pre-training while efficiently learning in novel setups, such as environments or camera configurations. In particular, inspired by ControlNet, we fine-tune an NFM by attaching a trainable copy of the pre-trained backbone using zero-initialized residual pathways, thereby learning geometric cues. This design enables the model to efficiently acquire in-domain geometry while preserving pre-trained knowledge across various behaviors. Despite its simplicity, our comprehensive evaluation of real-world navigation suggests that our proposal effectively enables robust long-horizon navigation with minimal collisions and human intervention. Additionally, our offline analysis shows that the proposed method maintains or further improves action prediction capabilities beyond the fine-tuned dataset, providing a key insight into continual learning for general navigation. The project page: https://toyotafrc.github.io/DCLING-Proj/

URL PDF HTML ☆

赞 0 踩 0

2605.19678 2026-05-20 cs.RO 版本更新

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

RoVLA: 多一致性约束用于鲁棒的视觉-语言-动作模型

Jingzhou Luo, Yifan Wen, Yongjie Bai, Xinshuai Song, Yang Liu, Liang Lin

发表机构 * Sun Yat-sen University（中山大学）； Peng Cheng Laboratory（鹏城实验室）； Guangdong Key Laboratory of Big Data Analysis and Processing（广东大数据分析与处理重点实验室）； X-Era AI Lab（X-Era AI实验室）

AI总结本文提出RoVLA框架，通过多一致性约束提升视觉-语言-动作模型的鲁棒性，通过指令语义、轨迹演变和观察扰动三种互补变换增强模型的稳定性和泛化能力。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在具身操控中表现出色，但在视觉观察变化、语言指令改写和复合扰动下仍显脆弱。这种限制表明现有方法仍依赖于训练分布中的浅层相关性，而非学习任务语义、环境状态和动作生成之间的稳定耦合。尽管近期研究通过大规模训练、训练后适应或增强预测建模提高了鲁棒性，但很少在端到端策略本身中强制执行不变性一致性。为了解决这个问题，我们提出了RoVLA，一个具有多一致性约束的鲁棒视觉-语言-动作框架。RoVLA在三个互补的变换下强制一致性：指令语义、轨迹演变和观察扰动。具体而言，指令一致性（IC）通过语义等价指令改写促进稳定的语义关联，演变一致性（EC）在整个生成过程中保持一致的动作意图，观察一致性（OC）通过强制在受扰动前后的一致预测来提高对视觉和体感扰动的鲁棒性。通过在训练过程中显式建模这些不变性，RoVLA减少了对表面相关性的依赖，提高了鲁棒性和泛化能力。在LIBERO-Plus、RoboTwin 2.0和现实世界操控任务上的实验表明，RoVLA在强基线方法上表现一致，并在多样化的任务和观察转移下表现出更优越的鲁棒性。这些结果证明了多一致性学习在鲁棒具身控制中的有效性。代码将在https://github.com/HCPLab-SYSU/RoVLA上提供。

英文摘要

Vision-Language-Action (VLA) models have shown strong performance on embodied manipulation, yet they remain brittle under visual observation changes, paraphrased language instructions, and compounded perturbations. This limitation suggests that existing methods still rely heavily on shallow correlations in the training distribution, rather than learning stable couplings among task semantics, environment states, and action generation. Although recent efforts improve robustness through larger-scale training, post-training adaptation, or enhanced predictive modeling, they rarely enforce invariance-oriented consistency within the end-to-end policy itself. To address this issue, we propose RoVLA, a robust vision-language-action framework with multi-consistency constraints. RoVLA enforces consistency under three complementary transformations: instruction semantics, trajectory evolution, and observation perturbation. Specifically, Instructional Consistency (IC) promotes stable grounding under semantically equivalent instruction rewrites, Evolutionary Consistency (EC) preserves coherent action intent throughout the generation process, and Observational Consistency (OC) improves robustness to visual and proprioceptive perturbations by enforcing consistent predictions before and after targeted disturbances. By explicitly modeling these invariances during training, RoVLA reduces reliance on superficial correlations and improves robustness and generalization. Experiments on LIBERO-Plus, RoboTwin 2.0, and real-world manipulation tasks show that RoVLA consistently outperforms strong baseline methods and exhibits superior robustness under diverse task and observation shifts. These results demonstrate the effectiveness of multi-consistency learning for robust embodied control. Codes will be available at https://github.com/HCPLab-SYSU/RoVLA.

URL PDF HTML ☆

赞 0 踩 0

2605.19631 2026-05-20 cs.RO cs.CV 版本更新

HEAT: Heterogeneous End-to-End Autonomous Driving via Trajectory-Guided World Models

HEAT: 基于轨迹引导的世界模型实现异构端到端自动驾驶

Hoonhee Cho, Giwon Lee, Jae-Young Kang, Hyemin Yang, Heejun Park, Kuk-Jin Yoon

发表机构 * KAIST（韩国科学技术院）

AI总结本文提出一种基于轨迹引导的学习方法，通过规划轨迹组织训练，使模型能够捕捉驾驶意图的领域不变表示，并结合预测未来潜在特征的世界模型，提高特征一致性并缓解领域偏见，从而在多个异构数据集上实现强性能。

详情

AI中文摘要

端到端自动驾驶作为一种直接将原始传感器数据映射到驾驶动作的替代方案，已逐渐取代传统模块化管道。尽管近期方法在单域数据集上表现强劲，但当在多个异构领域联合训练时，性能显著下降。然而，实际自动驾驶系统必须在具有异构分布的不同环境中运行，包括不同城市、传感器配置和交通模式，而无需领域特定重新训练。这一差距突显了多领域学习中的关键挑战：异构领域中的领域特定变化引入了冲突的学习信号，使模型倾向于妥协解决方案，这些方案在各个领域中都是次优的。为此，我们提出了一种轨迹驱动的学习范式，围绕规划轨迹组织训练，使模型能够捕捉驾驶意图的领域不变表示。此外，我们还引入了一个世界模型，该模型根据自主动作预测未来的潜在特征，从而提高特征一致性和缓解领域引起的偏见。我们在三个基准上评估了我们的方法，即nuScenes、NAVSIM和Waymo端到端数据集，并在所有领域上展示了显著优于现有方法的改进。我们的结果表明，一个统一的模型可以在异构数据集上进行训练，同时在每个领域中保持强大的性能，这表明了向可扩展的现实世界部署迈出的一步。我们将公开我们的代码。

英文摘要

End-to-end autonomous driving has emerged as a compelling alternative to traditional modular pipelines by directly mapping raw sensor data to driving actions. While recent approaches achieve strong performance on single-domain datasets, their performance degrades significantly when trained jointly across multiple heterogeneous domains. In practice, however, autonomous systems must operate across diverse environments with heterogeneous distributions, including different cities, sensor configurations, and traffic patterns, without domain-specific retraining. This gap highlights a key challenge in multi-domain learning: domain-specific variations across heterogeneous domains introduce conflicting learning signals, driving models toward compromised solutions that are suboptimal across domains. To address this, we propose a trajectory-driven learning paradigm that organizes training around planning trajectories, enabling the model to capture domain-invariant representations of driving intent. Furthermore, we incorporate a world model that predicts future latent features conditioned on ego actions, improving feature consistency and mitigating domain-induced biases. We evaluate our approach on three benchmarks, nuScenes, NAVSIM, and the Waymo end-to-end dataset, and show substantial improvements over existing methods across all domains. Our results demonstrate that a single unified model can be trained on heterogeneous datasets while maintaining strong performance within each domain, highlighting a step toward scalable real-world deployment. We will make our code publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.19600 2026-05-20 cs.RO 版本更新

FlyMirage: A Fully Automated Generation Pipeline for Diverse and Scalable UAV Flight Data via Generative World Model

FlyMirage: 一种用于生成多样化和可扩展的无人机飞行数据的完全自动化生成流程

Jinhan Li, Xijie Huang, Zhaoqi Wang, Yijin Wang, Weiqi Ge, Qiyi He, Mo Zhu, Fei Gao, Yuze Wu, Xin Zhou

发表机构 * State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou 310027, China（浙江大学工业控制技术状态重点实验室，杭州310027，中国）； Differential Robotics, Hangzhou 311121, China（差分机器人，杭州311121，中国）

AI总结本文提出FlyMirage，一种完全自动化的生成流程，通过生成世界模型生成大规模、多样化且逼真的无人机视觉-语言导航数据，支持下一代具身导航模型的发展。

详情

AI中文摘要

在视觉-语言导航（VLN）领域，空中数据集在结合规模、多样性和现实感方面仍然有限，通常依赖于昂贵的真实世界场景或视觉受限的模拟。为了解决这些挑战，我们引入了FlyMirage，一种高度可扩展且完全自动化的空中VLN数据生成流程。我们的方法利用大型语言模型（LLM）作为环境设计师来促进场景多样性，配以生成世界模型，将这些设计转化为高保真的3D高斯点云（3DGS）场景。为了显著减少人工劳动并确保飞行数据的可行性，FlyMirage自动化了场景探索和语义信息获取，并进一步集成了动态可行的规划器用于无人机（UAV）轨迹生成。利用这一工具链，我们生成了一个大规模、多样化且逼真的空中VLN数据集，具有动态可行的飞行轨迹，旨在支持下一代具身导航模型的发展。

英文摘要

In the field of Vision-Language Navigation (VLN), aerial datasets remain limited in their ability to combine scale, diversity, and realism, often relying on either costly real-world scenes or visually limited simulations. To address these challenges, we introduce FlyMirage, a highly scalable and fully automated data generation pipeline for aerial VLN. Our approach leverages large language models (LLM) as an environment designer to promote scene diversity, paired with a generative world model that instantiates these designs into high-fidelity 3D Gaussian Splatting (3DGS) scenes. To substantially reduce human labor and ensure the feasibility of flight data, FlyMirage automates scene exploration and semantic information acquisition, and further integrates a dynamically feasible planner for uncrewed aerial vehicle (UAV) trajectory generation. Utilizing this toolchain, we generate a large-scale, diverse, and photorealistic aerial VLN dataset, with dynamically feasible flying trajectories, designed to support the development of next-generation embodied navigation models.

URL PDF HTML ☆

赞 0 踩 0

2605.19594 2026-05-20 cs.RO 版本更新

MCNav: Memory-Aware Dynamic Cognitive Map for Zero-shot Goal-oriented Navigation

MCNav: 用于零样本目标导向导航的记忆感知动态认知图

Jingyu Li, Zhe Liu, Wenxiao Wu, Li Zhang

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； University of Hong Kong（香港大学）； Huazhong University of Science and Technology（华中科技大学）

AI总结本文提出MCNav，一种记忆感知的动态认知图导航框架，通过高效查询已探索区域的相关物体信息，解决零样本目标导向导航中目标丢失或误识别的问题，通过目标再验证和遗漏目标再探索策略，结合黑名单和双检机制，实现最先进的性能。

详情

AI中文摘要

在复杂环境中导航到实例级目标是一个具有挑战性的问题。许多现有的零样本方法通过建模整个环境并利用大语言模型进行场景理解来实现强性能。然而，这些策略主要集中在探索新区域，而缺乏对先前探索区域信息的深入利用。因此，当目标在先前访问的区域中丢失或误识别时，导航失败频繁发生。为了解决这些限制，我们提出了MCNav，一种具有动态认知图的记忆感知导航框架。该图存储有关已探索区域相关物体的高效查询信息。基于此记忆结构，MCNav引入了两种记忆感知探索策略：目标再验证，用于重新评估已见过的对象以纠正匹配失败；以及遗漏目标再探索，用于根据上下文线索估计目标在已探索区域中的存在概率。这些策略进一步通过黑名单机制防止重复错误，并通过双检机制进行高置信度确认。我们在HM3Dv1和HM3Dv2数据集上对MCNav进行了三种不同任务的评估，其中在实例级目标导航任务上实现了最先进的性能。

英文摘要

Navigating to instance-level targets in complex environments is a challenging problem. Many existing zero-shot methods achieve strong performance by modeling the entire environment and leveraging large language models for scene understanding. However, such strategies primarily focus on exploring new regions while lacking a deeper exploitation of information from previously explored areas. Consequently, when targets are missed or misidentified within previously visited regions, navigation failures occur frequently. To address these limitations, we propose MCNav, a memory-aware navigation framework with a dynamic cognitive map. This map stores efficiently queryable information about relevant objects in explored areas. Building on this memory structure, MCNav introduces two memory-aware exploration strategies: goal re-validation, which re-assesses previously seen objects to correct matching failures, and missed goal re-exploration, which estimates the likelihood that a target is present in an explored region from contextual cues. These strategies are further stabilized by a blacklist mechanism to prevent repeated errors and a double-check mechanism for high-confidence confirmation. We evaluate MCNav on the HM3Dv1 and HM3Dv2 datasets across three different tasks, where it achieves state-of-the-art performance, particularly on the instance-level goal navigation task.

URL PDF HTML ☆

赞 0 踩 0

2605.19592 2026-05-20 cs.RO cs.AI 版本更新

Implicit Action Chunking for Smooth Continuous Control

隐式动作分块用于平滑连续控制

Bosun Liang, Shuo Pei, Zirui Chen, Chuanzhi Fan, Chen Sun, Yuankai Wu, Huachun Tan, Yong Wang

发表机构 * Department of Data and Systems Engineering, The University of Hong Kong, Hong Kong SAR, China（香港大学数据与系统工程系）； Beijing Institute of Technology, Zhuhai, China（北京理工大学珠海学院）； College of Computer Science, Sichuan University, Chengdu, China（四川大学计算机学院）

AI总结本文提出了一种隐式动作分块框架Dual-Window Smoothing (DWS)，用于实现平滑的连续控制。该方法通过双窗口设计，在不扩展动作空间的情况下，确保物理平滑性和时间差分目标的一致性，从而解决传统显式动作分块方法的优化困难和与标准逐步交互不兼容的问题。

详情

AI中文摘要

强化学习常常产生高频振荡的控制信号，这会破坏物理部署所需的安全性和稳定性。显式动作分块通过预测固定时间跨度的轨迹来解决这个问题，但会按时间跨度长度成比例地扩展策略输出维度，导致优化困难和与标准逐步交互不兼容。为克服这些挑战，本文提出了Dual-Window Smoothing (DWS)，一种隐式动作分块框架用于平滑连续控制。与显式方法不同，DWS通过确定性调制确保时间一致性，而不扩展动作空间。它采用双窗口设计：一个执行窗口通过确定性调制确保物理平滑，一个价值窗口在时间差分目标上对时间跨度进行对齐，以纠正由于开环执行导致的批评者偏差。DWS还包含一个轻量级的演员侧时间正则化器，基于一阶动作差异，以促进全局连续性。该设计有效地弥合了时间抽象与反应式逐步控制之间的差距。在包括DeepMind控制套件和工业能源管理任务在内的基准测试中，DWS优于最先进的（SOTA）基线。在复杂的基于视觉的自动驾驶任务中，DWS实现了更平滑的控制，更安全的行为，减少了抖动，并达到了100%的成功率。

英文摘要

Reinforcement learning often produces high-frequency oscillatory control signals that undermine the safety and stability required for physical deployment. Explicit action chunking addresses this by predicting fixed-horizon trajectories but scales the policy output dimension proportionally with the horizon length, leading to optimization difficulties and incompatibility with standard step-wise interaction. To overcome these challenges, this paper proposes Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control. Unlike explicit methods, DWS enforces temporal coherence without expanding the action space. It uses a dual-window design: an execution window that ensures physical smoothness through deterministic modulation, and a value window that aligns temporal-difference targets over the horizon to correct critic bias caused by open-loop execution. DWS also includes a lightweight actor-side temporal regularizer based on first-order action differences to promote global continuity. This design effectively bridges the gap between temporal abstraction and reactive step-wise control. Experiments on benchmarks including the DeepMind Control Suite and industrial energy management tasks show that DWS outperforms state-of-the-art (SOTA) baselines. In complex vision-based autonomous driving tasks, DWS achieves smoother control, safer behavior with reduced jitter, and attains a 100% success rate.

URL PDF HTML ☆

赞 0 踩 0

2605.19580 2026-05-20 cs.RO 版本更新

PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models

PAPO-VLA: 为视觉-语言-动作模型进行规划感知的策略优化

Peizheng Guo, Jingyao Wang, Changwen Zheng, Wenwen Qiang

发表机构 * Institute of Software Chinese Academy of Sciences（软件研究所中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结本文提出PAPO-VLA，一种针对视觉-语言-动作模型的规划感知策略优化方法，通过识别和优化规划动作以提高VLA策略的可靠性。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在语言引导的机器人任务中展现出有前途的能力。然而，使VLA策略可靠仍然具有挑战性，因为一个操作任务是通过闭环交互完成的，其中每个动作都会影响后续的执行。为了分析这个问题，我们重新审视VLA策略在执行过程中的作用，并认为VLA策略同时扮演着规划者和执行者两个角色：规划者做出任务导向的决策以改变执行方向，而执行者通过密集的连续动作来实现这些决策。这种观点表明，提高VLA可靠性需要特别关注规划动作。现有的优化方法可以模仿动作或改进完整的轨迹，但通常不明确识别规划动作或衡量其对任务成功的重要性。为了解决这个问题，我们提出了PAPO-VLA，即针对VLA模型的规划感知策略优化方法。PAPO-VLA首先通过联合考虑动作变化和轨迹结果来识别规划动作，然后通过因果充分性和因果必要性估计其重要性，并最终将这种重要性纳入GRPO优势估计中。这样，更重要规划动作会受到更强的优化关注，同时整个轨迹仍然通过轨迹级反馈进行优化。在多个基准上的实验展示了PAPO-VLA的有效性。

英文摘要

Vision-Language-Action (VLA) models show promising ability in language-guided robotic tasks. However, making VLA policies reliable remains challenging, because a manipulation task is completed through closed-loop interaction, where each action affects subsequent execution. To analyze this problem, we revisit VLA policy during execution and argue that a VLA policy acts both as a planner, which makes task-oriented decisions that change the direction of execution, and as an executor, which realizes these decisions through dense continuous actions. This view suggests that improving VLA reliability requires particular attention to planning actions. Existing optimization methods can imitate actions or improve complete trajectories, but they usually do not explicitly identify planning actions or measure their importance for task success. To address this issue, we propose Planning-Aware Policy Optimization for VLA models (PAPO-VLA). PAPO-VLA first identifies planning actions by jointly considering action variation and trajectory outcome, then estimates their importance through causal sufficiency and causal necessity, and finally incorporates this importance into GRPO advantage estimation. In this way, more important planning actions receive stronger optimization emphasis, while the whole trajectory is still optimized by trajectory-level feedback. Experiments on multiple benchmarks demonstrate the effectiveness of PAPO-VLA.

URL PDF HTML ☆

赞 0 踩 0

2605.19562 2026-05-20 cs.RO cs.LG math.OC 版本更新

Learning-Accelerated Optimization-based Trajectory Planning for Cooperative Aerial-Ground Handover Missions

基于学习的优化轨迹规划用于协作的空中-地面切换任务

Jingshan Chen, Bochen Yu, Henrik Ebel, Peter Eberhard

发表机构 * Institute of Engineering and Computational Mechanics, University of Stuttgart, 70569 Stuttgart, Germany（工程与计算力学研究所，斯图加特大学，德国斯图加特70569）； Mechanical Engineering, LUT University, 53850 Lappeenranta, Finland（机械工程，卢蒂大学，芬兰拉佩恩兰塔53850）

AI总结本文提出了一种结合学习的轨迹规划框架，用于协同无人 aerial 和 ground 车辆的切换任务，通过使用解耦的编码器-解码器 LSTM 网络生成协调的切换轨迹预测，从而加速优化过程，实现更快的收敛和更高的优化成功率。

Comments Preprint of a contribution accepted for publication in the RoManSy 2026 Springer proceedings

详情

AI中文摘要

本文提出了一种基于学习的轨迹规划框架，用于协同无人 aerial 和 ground 车辆的切换任务。尽管集中式轨迹优化能够确保动态可行性和任务最优性，但其高计算成本限制了实时应用。我们提出了一种神经代理规划器，利用解耦的编码器-解码器长短期记忆（LSTM）网络，从任务规范中生成协调的切换轨迹预测。这些预测作为下游集中优化器的有信息的预热启动，从而加速收敛到动态可行的解决方案。基准评估显示，与冷启动优化相比，结合学习的规划框架在速度上提高了三倍以上，并实现了100%的优化成功率。结果表明，结合数据驱动推断与模型驱动细化能够为异构多机器人系统提供快速且可靠的轨迹生成。

英文摘要

This paper presents a learning-augmented trajectory planning framework for cooperative unmanned aerial vehicle (UAV) and unmanned ground vehicle (UGV) handover missions. While centralized trajectory optimization ensures dynamic feasibility and task optimality, its high computational cost limits real-time applicability. We propose a neural surrogate planner utilizing decoupled encoder-decoder long short-term memory (LSTM) networks to generate coordinated handover trajectory predictions from the task specifications. These predictions serve as informed warm starts for the downstream centralized optimizer, thereby accelerating convergence to dynamically feasible solutions. Benchmark evaluations demonstrate that the learning-augmented planning framework achieves more than a threefold speedup and 100% optimization success rate compared to cold start optimization. The results indicate that combining data-driven inference with model-based refinement enables fast and reliable trajectory generation for heterogeneous multi-robot systems.

URL PDF HTML ☆

赞 0 踩 0

2605.19524 2026-05-20 cs.RO cs.CV 版本更新

SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving

SafeAlign-VLA: 一种增强负样本的安全对齐框架用于风险感知的自动驾驶

Kefei Tian, Yuansheng Lian, Kai Yang, Xiangdong Chen, Shen Li

发表机构 * College of Transportation, Tongji University（同济大学交通运输学院）； Department of Civil Engineering, Tsinghua University（清华大学土木工程系）； School of Vehicle and Mobility, Tsinghua University（清华大学车辆与移动系统学院）； Department of Civil and Environmental Engineering, National University of Singapore（新加坡国立大学土木与环境工程系）

AI总结本文提出SafeAlign-VLA框架，通过整合负样本数据提升自动驾驶系统对安全边界的理解，通过生成安全标签和反事实轨迹，结合两阶段训练策略和基于锚点的群体相对策略优化，提高了自动驾驶的安全性和鲁棒性。

详情

AI中文摘要

端到端的自动驾驶系统在常见场景中表现优异，但在安全关键的长尾案例中表现不佳。视觉-语言-动作（VLA）模型因其强大的推理能力而具有前景。然而，大多数基于VLA的方法依赖于正专家演示，很少利用负样本，导致对危险行为和安全边界的理解不足。为了解决这一限制，我们提出了SafeAlign-VLA，一种统一的增强负样本的安全对齐框架，将负数据整合到监督学习和强化学习中。首先，我们开发了一种反事实安全配对范式，通过反事实推理从危险场景中生成结构化的安全标签和反事实正轨迹。然后采用两阶段训练策略：负样本增强的监督微调用于故障反馈和轨迹修正，接着是基于锚点的群体相对策略优化，利用正负轨迹作为对比锚点，引导采样并惩罚高风险行为。在NAVSIM和DeepAccident上的实验验证了所提框架。SafeAlign-VLA在NAVSIM v1测试集上达到89.1 PDMS，比无负样本基线提高了1.3%。在DeepAccident上，碰撞率降低到3.36%，同时达到84.2%的语言准确率和85.8%的风险预测准确率。这些结果证明了所提增强负样本的安全对齐框架在安全和鲁棒自动驾驶中的有效性。

英文摘要

End-to-end autonomous driving systems excel in common scenarios but struggle with safety-critical long-tail cases. Vision-Language-Action (VLA) models are promising due to their strong reasoning capabilities. However, most VLA-based approaches rely on positive expert demonstrations, rarely exploiting negative samples, leading to insufficient understanding of risky behaviors and safety boundaries. To address this limitation, we propose SafeAlign-VLA, a unified negative-enhanced safe alignment framework that incorporates negative data into supervised learning and reinforcement learning. First, we develop a counterfactual safety pairing paradigm to generate structured safety labels and counterfactual positive trajectories from risky scenarios via counterfactual reasoning. Then, a two-stage training strategy is adopted: negative-enhanced supervised fine-tuning for failure feedback and trajectory correction, followed by anchor-based group relative policy optimization that uses positive and negative trajectories as contrastive anchors to steer sampling and penalize high-risk behaviors via group-relative advantages. Experiments on NAVSIM and DeepAccident validate the proposed framework. SafeAlign-VLA achieves 89.1 PDMS on the NAVSIM v1 testset, improving over the baseline without negative data by 1.3%. On DeepAccident, it reduces the collision rate to 3.36%, while achieving 84.2% language accuracy and 85.8% risk prediction accuracy. These results demonstrate the effectiveness of the proposed negative-enhanced safe alignment framework for safe and robust autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.19501 2026-05-20 cs.RO cs.AI 版本更新

CANINE: Coaching Visually Impaired Users for Interactive Navigation with a Robot Guide Dog

CANINE: 为视觉障碍者提供交互导航的机器人导盲犬教学系统

Cunjun Yu, Zishuo Wang, Anxing Xiao, Linfeng Li, David Hsu

发表机构 * School of Computing（computing 学院）； Smart Systems Institute（智能系统研究所）

AI总结本文提出CANINE系统，通过个性化适应性语音反馈帮助视觉障碍者学习与机器人导盲犬的交互导航，通过分解复杂协调任务并分层训练提升学习效率和最终导航性能。

Comments Accepted to RSS 2026

详情

AI中文摘要

机器人导盲犬提供了显著扩展视障者独立移动能力的导航帮助，但其有效使用需要微妙的人机协调，这使得用户难以从通用口头指令中学习。为解决这一挑战，我们提出了CANINE，一个自动化教学系统，通过个性化、适应性的语音反馈训练用户进行交互导航。CANINE将复杂协调任务分解为子技能，并在两个层次上运作。在高层，它通过知识追踪跟踪学习者在子技能中的熟练度，并优先训练最薄弱的领域。在底层，CANINE通过观察每个人类实践片段，利用基础模型推断错误的根本原因，并生成适应性的针对性语音纠正。通过盲folded参与者受控研究，将受试者视为定量评估的代理群体，证明CANINE在学习效率和最终导航性能上均优于通用口头指令。我们进一步通过保留研究和探索性案例研究验证CANINE。保留研究显示在两周后仍保持技能提升。案例研究确认CANINE在训练视障用户方面的有效性，同时揭示了实际部署中的额外设计考虑因素。两者均与受控研究的结果一致。项目页面：https://cunjunyu.github.io/project/canine/

英文摘要

Robot guide dogs offer navigation assistance that greatly expands the independent mobility of the visually impaired, but their effective use requires subtle human-robot coordination that is difficult for users to learn from generic verbal instructions. To tackle this challenge, we present CANINE, an automated coaching system that trains users for interactive navigation with a robot guide dog, through personalized, adaptive verbal feedback. CANINE decomposes a complex coordination task into sub-skills and operates at two levels. At the high level, it decides what to train by tracking the learner's proficiency across sub-skills using knowledge tracing and prioritizing training on the weakest areas. At the low level, CANINE decides how to train each sub-skill by observing each human practice episode, using foundation models to infer the underlying causes of errors, and generating targeted verbal corrections adaptively. A controlled study with blindfolded participants, treated as a proxy population for quantitative evaluation, demonstrates that CANINE significantly improves both learning efficiency and final navigation performance compared to generic verbal instructions. We further validate CANINE through a retention study and an exploratory case study. The retention study shows lasting skill improvement after two weeks. The case study confirms CANINE's effectiveness in training a visually impaired user, while revealing additional design considerations for real-world deployment. Both are well aligned with the findings of the controlled study. Project page: https://cunjunyu.github.io/project/canine/

URL PDF HTML ☆

赞 0 踩 0

2605.19490 2026-05-20 cs.RO cs.CV 版本更新

Closed-Loop Hybrid Digital Twin Platform for Connected and Automated Vehicle Validation

闭环混合数字孪生平台用于联网和自动化车辆验证

Kanglong Quan, Zhebing Xia, Linfeng Jiang, Hao Yu, Ziheng Qiao, Dapeng Dong, Dongyao Jia

发表机构 * National Natural Science Foundation of China（中国国家自然科学基金委员会）； Suzhou Science and Technology Development Planning Programme（苏州科技发展计划）

AI总结本文提出一种闭环混合数字孪生平台，通过高保真CARLA-SUMO协同模拟与物理测试现场和车辆的紧密耦合，实现联网和自动化车辆的高效验证。

详情

AI中文摘要

联网和自动化车辆（CAVs）的全面且高效的验证在实际部署前至关重要。虽然基于模拟的测试提供了可扩展性，但现有方法往往缺乏与真实车辆和现场数据的无缝集成，限制了其在捕捉动态真实世界交互方面的保真度。为弥合这一差距，本文提出了一种新的实时混合数字孪生平台。其核心创新在于高保真CARLA-SUMO协同模拟与物理测试现场和车辆通过低延迟的车辆到万物（V2X）通信链路的紧密耦合。定制开发的中间件作为关键桥梁，同步真实CAV的运动状态作为模拟中的影子车辆，并将虚拟控制命令转换为底盘执行的控制器局域网络（CAN）消息以实现闭环控制。详细的实现包括使用摄影测量法进行全尺寸资产重建以及云边协同架构以实现可扩展的多用户操作。实验结果表明同步稳定且闭环控制有效，延迟低，证实了该平台在多场景CAV验证中的实用性。

英文摘要

Comprehensive and efficient validation of connected and automated vehicles (CAVs) is critical prior to real-world deployment. While simulation-based testing offers scalability, existing approaches often lack seamless integration with real vehicles and field data, limiting their fidelity in capturing dynamic, real-world interactions. To bridge this gap, this paper proposes a novel real-time hybrid digital twin platform. Its core innovation lies in the tight coupling of a high-fidelity CARLA-SUMO co-simulation with a physical test site and vehicle via a low-latency Vehicle-to-Everything (V2X) communication link. A custom-developed middleware serves as the critical bridge, synchronizing a real CAV's kinematic state as a shadow vehicle in the simulation and translating virtual control commands into chassis-actuating Controller Area Network (CAN) messages for closed-loop control. Detailed implementation includes using photogrammetry for full-scale asset reconstruction and a cloud-edge collaborative architecture for scalable, multi-user operation. Experimental results demonstrate stable synchronization and effective closed-loop control with low latency, confirming the platform's practicality for multi-scenario CAV verification.

URL PDF HTML ☆

赞 0 踩 0

2605.19469 2026-05-20 cs.LG cs.AI cs.RO 版本更新

Sampling-Based Safe Reinforcement Learning

基于采样的安全强化学习

Luca Vignola, Bruce D. Lee, Manish Prajapat, Manuel Wendl, Melanie Zeilinger, Andreas Krause, Yarden As

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结本文提出了一种基于采样的安全强化学习方法，通过在有限的动力学样本集上联合施加约束，确保学习过程中的安全性，并在连续域中提供实用的安全保证，同时通过限制认知不确定性实现了高效的探索。

详情

AI中文摘要

安全探索仍然是强化学习（RL）中的基本挑战，限制了RL智能体在现实世界中的部署。我们提出了一种基于采样的安全强化学习（SBSRL），这是一种基于模型的RL算法，通过在有限的动力学样本集上联合施加约束，确保学习过程中的安全性。这种形式近似了在不确定动力学下的不可行最坏情况优化，并在连续域中实现了实用的安全保证。我们进一步引入了一种基于限制认知不确定性的探索策略，消除了显式探索奖励的需要。在常规条件下，我们推导了学习过程中安全性的高概率保证以及恢复近最优策略的有限时间样本复杂度界。实验证明，SBSRL在仿真和真实机器人硬件中均实现了安全且高效的探索，并可轻松扩展到实际的深度集合实现，以解决高维连续控制问题。

英文摘要

Safe exploration remains a fundamental challenge in reinforcement learning (RL), limiting the deployment of RL agents in the real world. We propose Sampling-Based Safe Reinforcement Learning (SBSRL), a model-based RL algorithm that maintains safety throughout the learning process by enforcing constraints jointly across a finite set of dynamics samples. This formulation approximates an intractable worst-case optimization over uncertain dynamics and enables practical safety guarantees in continuous domains. We further introduce an exploration strategy based on constraining epistemic uncertainty, eliminating the need for explicit exploration bonuses. Under regularity conditions, we derive high-probability guarantees of safety throughout learning and a finite-time sample complexity bound for recovering a near-optimal policy. Empirically, SBSRL achieves safe and efficient exploration both in simulation and in real robotic hardware, and readily extends to practical deep-ensemble implementations that scale to high-dimensional continuous control problems.

URL PDF HTML ☆

赞 0 踩 0

2605.19431 2026-05-20 cs.RO 版本更新

Self-assembling Modular Aerial Robot for Versatile Aerial Tasks

自组装模块化空中机器人用于多功能空中任务

Junichiro Sugihara, Masaki Kitagawa, Jinjie Li, Yunong Li, Takuzumi Nishio, Kei Okada, Moju Zhao

发表机构 * Department of Mechanical Engineering, The University of Tokyo（东京大学机械工程系）； Department of Mechano Informatics, The University of Tokyo（东京大学机械信息学系）

AI总结本文提出了一种自组装模块化空中机器人LEGION，通过飞行中自组装实现协同操作，结合灵活 maneuverability 和可重构性，实现了从被动观察者到主动参与者转变，拓展了空中物理交互的范围。

详情

AI中文摘要

多旋翼空中机器人在三维空间中具有出色的机动性，最近的进展使它们能够在复杂和狭窄的环境中进行灵活导航，尤其是对于小型机架。相比之下，用于高空工作的平台通常更大，以提供高推力以实现与环境的稳定物理交互。然而，这些矛盾的设计要求导致了灵活导航和稳健空中操作之间的长期权衡。本文提出了LEGION单元，这是一种可重新配置的模块化空中机器人，能够飞行中自组装以实现协同操作，灵感来自蚂蚁形成的自组织群体。每个单元保留了灵活的机动性，而两端的关节配备的对接接口使单元能够端到端自组装成飞行操作器。我们证明了多个单元可以自主飞行中对接；一旦锁定，它们通过控制接触力和扭矩保持零间隙锁定，即使在户外也能实现可靠的聚集和关节运动。我们进一步证明，自重构能力使单元能够在灵活的个体飞行和集体关节操作之间进行形态切换，同时实现核心飞行中操作原始操作，包括推、拉、旋转、抓取和携带。LEGION的自组织能力使空中机器人，特别是群组中的机器人，能够从被动观察者转变为环境中的主动参与者，拓展了空中物理交互的范围。

英文摘要

Multirotor aerial robots excel at maneuvering in three-dimensional space, and recent advances enable nimble navigation in cluttered and confined environments, especially for small airframes. By contrast, platforms built for high-altitude work tend to be larger to deliver high thrust for stable physical interaction with the environment. However, these conflicting design requirements create a long-standing trade-off between nimble navigation and robust aerial manipulation. Here, we present LEGION units, which are reconfigurable modular aerial robots capable of in-flight self-assembly for cooperative manipulation, drawing inspiration from the self-organized collectives formed by ants. Each unit retains nimble maneuverability while joint-equipped docking interfaces at both ends enable end-to-end self-assembly into a flying manipulator. We show that multiple units autonomously dock in flight; once latched, they maintain a zero-clearance interlock by controlling the contact force and torque, enabling reliable aggregation and articulated motion even outdoors. We further show that self-reconfigurability enables morphological switching between nimble individual flight and collective articulated manipulation, while realizing core in-flight manipulation primitives including pushing, pulling, rotating, grasping, and carrying. LEGION's self-organization enables aerial robots, especially in swarms, to shift from passive observers to active participants in their environment, broadening the scope of aerial physical interaction.

URL PDF HTML ☆

赞 0 踩 0

2605.19420 2026-05-20 cs.RO 版本更新

Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation

超越航点：双热图接地用于跨具身语义导航

Kaijie Yun, Yue Chen

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； JD AI Research（京东人工智能研究院）

AI总结本文提出一种统一的视觉-语言框架，通过双热图表示替代单点回归，以解决语义指令与物理可达性之间的差距，从而提升跨具身语义导航的鲁棒性和性能。

详情

AI中文摘要

将开放式的语义指令接地为可执行的局部目标是人机交互中的基本挑战。尽管现有导航框架通常回归确定性的航点，但这种刚性方法会压缩空间不确定性，并且经常针对不可通行的物体中心，导致严重的执行失败。在本文中，我们专注于在视场内（in-FOV）的语义导航实际场景，其中机器人接收到简短的、交织的多模态（文本和图像）提示。为了弥合抽象语义意图与物理可达性之间的差距，我们提出了一种统一的视觉-语言框架，该框架放弃单点回归，转而采用双热图表示。我们的框架预测一个导航可及性热图，以捕捉连续的可到达区域，并结合一个面向热图用于方向约束。这些密集输出本质上充当可微的语义势场，能够无缝整合到下游的局部规划器中。为了支持这一范式，我们构建了一个完全自动化的、基于基础模型的合成数据管道，并建立了全面的模拟基准。广泛的实验表明，我们的框架在可比的8B基线中实现了最先进的性能。关键的是，通过特征融合研究和在不同机器人具身（Jetbot、H1、Aliengo）上的模拟研究，揭示出显式热图预测显著提高了可及率（AR）。通过将目标可靠地放置在可执行的自由空间中，我们的框架有效缓解了点回归的脆弱性，提供了一种可转移的路径，朝着安全的跨具身语义导航迈进。

英文摘要

Grounding open-ended semantic instructions into physically executable local goals is a fundamental challenge in human-robot interaction. While existing navigation frameworks often regress deterministic waypoints, this rigid formulation collapses spatial uncertainty and frequently targets non-traversable object centers, leading to severe execution failures. In this work, we focus on the practical setting of in-FOV semantic navigation, where a robot receives concise, interleaved multimodal (text and image) prompts. To bridge the gap between abstract semantic intent and physical reachability, we propose a unified Vision-Language framework that abandons single-point regression in favor of a Dual-Heatmap representation. Our framework predicts a navigation affordance heatmap that captures continuous reachable regions, coupled with a facing heatmap for orientation constraints. These dense outputs inherently function as a differentiable semantic potential field, integrating seamlessly with downstream local planners. To support this paradigm, we build a fully automated, foundation-model-assisted synthetic data pipeline and establish a comprehensive simulation benchmark. Extensive experiments demonstrate that our framework achieves state-of-the-art performance among comparable 8B baselines. Crucially, a feature-fusion study and simulation studies across diverse robot embodiments (Jetbot, H1, Aliengo) reveal that explicit heatmap prediction drastically improves the Affordance Rate (AR). By placing targets reliably in executable free space, our framework effectively mitigates the brittleness of point regression, offering a transferable path toward safe cross-embodiment semantic navigation.

URL PDF HTML ☆

赞 0 踩 0

2605.19328 2026-05-20 cs.CR cs.RO 版本更新

RoboJailBench: Benchmarking Adversarial Attacks and Defenses in Embodied Robotic Agents

RoboJailBench: 对具身体验机器人代理中对抗攻击和防御的基准测试

Doguhuan Yeke, Yanming Zhou, Leo Y. Lin, Hongyu Cai, Antonio Bianchi, Z. Berkay Celik

发表机构 * Purdue University（普渡大学）

AI总结本文提出RoboJailBench，通过建立安全分类学、引入意图对比数据集管道以及提供一个演进的存储库，为具身体验人工智能中的对抗攻击和防御提供了标准化评估框架，同时构建了一个新的分类平衡数据集并增强了五个现有数据集。

详情

AI中文摘要

最近在视觉-语言模型（VLMs）上的进展促进了新的具身体验人工智能系统类别，其中这些模型被集成到物理平台中，例如机器人和自动驾驶车辆，以在多样环境中解释视觉场景并执行自然语言命令。先前的研究已经引入了针对具身体验人工智能的劫持攻击和防御。然而，其评估却依赖于随意的数据集、有限的指标，并强调攻击成功率，而忽略了安全性和执行良性命令能力之间的权衡。现有的基准和评估框架要么针对传统的聊天式模型，要么专注于非对抗性安全评估；既没有捕捉到具身体验人工智能系统中劫持攻击所需的输入、后果和评估标准。在本文中，我们通过RoboJailBench填补这一空白，其包含三个核心组件。我们基于ISO标准、监管规则和记录的事件建立了安全分类学，这一努力产生了18类具身体验人工智能的安全违规后果。我们引入了一个意图对比数据集管道，通过配对对抗性和良性目标来增强现有数据集，以衡量安全性和实用性。最后，我们提供了一个演进的存储库，包含标准化指标和统一的评估和整合新攻击和防御的流程。通过这个基准，我们构建了一个新的分类平衡数据集并增强了五个现有数据集。我们整合了四种攻击和两种防御，以在领先的具身体验VLMs上评估其性能。这个基准为具身体验人工智能中的劫持攻击提供了第一个标准化评估框架，并支持未来研究。我们发布了我们的代码、数据集和成果，并在https://purseclab.github.io/benchmark-for-robotics-security维护了一个排行榜。

英文摘要

Recent advances in Vision-Language Models (VLMs) facilitate a new class of embodied AI systems, where these models are integrated into physical platforms, e.g. robots and autonomous vehicles, to interpret visual scenes and execute natural language commands in diverse environments. Previous research has introduced jailbreak attacks and defenses for embodied AI. Their evaluations, however, rely on ad-hoc datasets, limited metrics, and emphasize attack success while neglecting the trade-off between security and the ability to follow benign commands. Existing benchmarks and evaluation frameworks either target traditional chat-based models or focus on non-adversarial safety evaluation for embodied AI; neither captures the adversarial risks, inputs, consequences, and evaluation criteria necessary for jailbreak attacks in embodied AI systems. In this paper, we address this gap with RoboJailBench, which consists of three core components. We establish a security taxonomy derived from ISO standards, regulatory rules, and documented incidents. This effort yields 18 categories of security violation consequences for embodied AI. We introduce an intent contrast dataset pipeline that augments existing datasets with paired adversarial and benign goals to measure both security and utility. Lastly, we provide an evolving repository with standardized metrics and a unified process for assessing and integrating new attacks and defenses. With this benchmark, we construct a new taxonomy-balanced dataset and augment five existing datasets. We integrate four attacks and two defenses to evaluate their performance on leading embodied VLMs. This benchmark provides the first standardized evaluation framework for jailbreak attacks in embodied AI and supports future research. We release our code, datasets, and artifacts, and maintain a leaderboard at https://purseclab.github.io/benchmark-for-robotics-security.

URL PDF HTML ☆

赞 0 踩 0

2605.19314 2026-05-20 cs.RO cs.AI 版本更新

自动提升仿真的物理特性用于关节物体

Anh-Quan Pham

发表机构 * Penn（宾夕法尼亚大学）； PennPAL Lab（宾夕法尼亚大学PAL实验室）

AI总结本文研究了如何通过量化评估框架和多模态仿真反馈方法，提升关节物体在仿真中的物理真实性和稳定性，从而提高机器人学习的效率和效果。

详情

AI中文摘要

仿真是可扩展机器人学习的核心工具，但其效果取决于物体资产的质量。尽管现代3D数据集提供了丰富的几何和运动学表示，但通常缺乏用于稳定和真实交互所需的物理属性，需要大量手动工作来构建仿真准备的关节物体。在本论文中，我们引入了交互准备性，它表征了物体在操作下是否可以可靠地仿真。我们提出了一种定量评估框架，将交互准备性分解为可测量的组成部分，从而系统分析物体质量并揭示传统评估未捕获的失败模式。我们进一步提出了一个多模态、仿真循环的方法，从不完整的3D资产中生成交互准备的关节物体。该方法整合了几何、视觉和语义信息来推断物理属性，并通过迭代仿真反馈来优化这些属性，以提高物理一致性。在多样化的关节物体和操作任务上的实验表明，物体质量直接影响仿真稳定性、交互行为和策略性能。经过我们方法优化的物体表现出更稳定和真实的动态，从而实现了更可靠的下游学习和评估。总体而言，本论文展示了关节物体在仿真中的物理真实性的的重要性，并引入了一种由仿真反馈指导的实用多模态优化方法，用于大规模构建此类物体。

英文摘要

Simulation is a central tool for scalable robot learning, but its effectiveness depends on the quality of object assets. While modern 3D datasets provide rich geometric and kinematic representations, they typically lack the physical properties required for stable and realistic interaction, requiring significant manual effort to construct simulation-ready articulated objects. In this thesis, we introduce interaction-readiness, which characterizes whether an object can be reliably simulated under manipulation. We propose a quantitative evaluation framework that decomposes interaction-readiness into measurable components, enabling systematic analysis of object quality and revealing failure modes not captured by conventional evaluation. We further present a multi-modal, simulator-in-the-loop approach for generating interaction-ready articulated objects from incomplete 3D assets. The method integrates geometric, visual, and semantic information to infer physical properties and refines them through iterative simulator feedback to improve physical consistency. Experiments across diverse articulated objects and manipulation tasks show that object quality directly impacts simulation stability, interaction behavior, and policy performance. Objects refined by our method exhibit more stable and realistic dynamics, enabling more reliable downstream learning and evaluation. Overall, this thesis demonstrates the importance of physical realism for articulated objects in simulation and introduces a practical multi-modal refinement approach, guided by simulator feedback, for constructing such objects at scale.

URL PDF HTML ☆

赞 0 踩 0

2605.19120 2026-05-20 cs.RO 版本更新

CosFly: Plan in the Matrix, Fly in the World

CosFly：矩阵中的计划，世界中的飞行

Hanxuan Chen, Xiangyue Wang, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Binbo Li, Kangli Wang, Ji Pei

发表机构 * Autel Robotics（Autel机器人公司）； Nanjing University（南京大学）； Peking University（北京大学）； Southern University of Science and Technology（南方科技大学）； University of Hong Kong（香港大学）

AI总结本文提出CosFly，一个用于空中跟踪的盒状结构规划和多模态模拟流程，以及CosFly-Track大规模无人机数据集，用于在多样环境中动态目标跟踪。CosFly通过将复杂的3D世界转换为结构化障碍表示进行规划，然后将轨迹投影到多模态传感器数据中，并支持可配置的固定视角缩放级别。

详情

AI中文摘要

我们介绍了CosFly，一个用于空中跟踪的盒状结构规划和多模态模拟流程，以及CosFly-Track，一个大规模的无人机数据集，用于在多样环境中进行动态目标跟踪。在我们的当前实现上，CosFly提供了一个模块化的7步构建流程，将复杂的3D世界转换为结构化的障碍表示用于规划，然后将结果轨迹投影到多模态传感器数据中，包括RGB图像、高精度深度图和语义分割掩码，并配以自然语言导航指令。一个关键特点是支持可配置的固定视角缩放级别（每个轨迹一个视角设置并保持恒定），通过相机内参数调整模拟各种焦距。该流程涵盖了从3D地图导出通过网格简化、行人和无人机轨迹规划、多模态渲染（6自由度姿态注释）、质量检查以及教师-学生描述生成的完整流程。我们分析了两种轨迹规划范式：传统的两阶段流程（前端候选生成和后端细化）以及直接基于梯度的公式，该公式在单一目标中优化多个跟踪约束。公开的CosFly-Track发布包含250条经过验证的轨迹和约10万张渲染图像，具有完整的6自由度无人机姿态注释（位置x、y、z和方向偏航、俯仰、滚动）。共同，该流程和数据集建立了一个可扩展的基础，支持在多样环境中进行空中-地面协同研究，支持动态目标跟踪、无人机导航和多模态感知。

英文摘要

We present CosFly, a box-structured planning and multimodal simulation pipeline for aerial tracking, together with CosFly-Track, a large-scale UAV dataset for dynamic target tracking across diverse environments including urban centers, highways, rural landscapes, forests, and coastal towns. In our current implementation on CARLA, CosFly provides a modular 7-step construction pipeline that converts complex 3D worlds into structured obstacle representations for planning, then projects the resulting trajectories back into multi-modal sensor data -- including RGB images, high-precision depth maps, and semantic segmentation masks -- paired with natural language navigation instructions. A key feature is the support for configurable fixed-FOV zoom levels (one FOV setting drawn per trajectory and held constant throughout), enabling simulation of various focal lengths through camera-intrinsic adjustments. The pipeline covers the complete workflow from 3D map export through grid simplification, pedestrian and drone trajectory planning, multi-modal rendering with 6-DOF pose annotations, quality inspection, and teacher-student caption generation. We analyze two trajectory-planning paradigms for aerial target tracking: a conventional two-stage pipeline with front-end candidate generation and backend refinement, and a direct gradient-based formulation that optimizes multiple tracking constraints in a single objective. The public CosFly-Track release contains 250 validated trajectories and approximately 100,000 rendered images with complete 6-DOF drone pose annotations (position x, y, z and orientation yaw, pitch, roll). Together, the pipeline and dataset establish a scalable foundation for aerial-ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi-modal perception across diverse environments.

URL PDF HTML ☆

赞 0 踩 0

2605.19104 2026-05-20 cs.RO cs.AI 版本更新

Neural Operators for Design-Space Surrogate Modeling of Tendon-Actuated Continuum Robots

神经运算符用于腱驱动连续机器人设计空间的代理建模

Branden Frieden, James M. Ferguson, Alan Kuntz, Varun Shankar

发表机构 * The Robotics Center and the Kahlert School of Computing at the University of Utah（犹他大学机器人中心和Kahlert计算学院）； The Departments of Computer Science and Electrical and Computer Engineering at Vanderbilt University（范德比大学计算机科学与电气与计算机工程系）

AI总结本文提出了一种基于神经运算符的学习方法，用于腱驱动连续机器人的设计空间代理建模，通过映射机器人设计参数和腱驱动输入到最终配置，实现跨大量机器人设计的泛化能力。

Comments Accepted to ICRA 2026

详情

AI中文摘要

连续机器人能够在受限环境中实现灵活的操作，但需要准确且高效的模型用于实时操作和控制。传统物理模型可能计算成本高且因未建模效应导致不准确，而当前基于学习的方法在特定机器人上泛化能力差。本文提出将腱驱动连续机器人代理建模作为运算符学习问题，将机器人设计参数和腱驱动输入映射到最终配置。该方法使单个训练模型能够跨大量机器人设计泛化。我们开发了四种新型神经运算符架构--两种基于深度运算符网络（DeepONets）和两种基于傅里叶神经运算符（FNOs）--并训练它们在仿真数据上预测机器人配置。所有架构均实现良好的准确性，同时允许快速且准确地跨设计泛化。我们的结果表明，运算符学习为连续机器人力学在设计空间中的代理建模提供了有效且可泛化的解决方案，使在手术和工业应用中控制、规划和设计优化能够快速建模。

英文摘要

Continuum robots enable dexterous manipulation in constrained environments, but require accurate and efficient models for real-time manipulation and control. Traditional physics-based models can be computationally expensive and may suffer from inaccuracies due to unmodeled effects, while current learning-based methods often generalize poorly beyond the specific robot on which they are trained. We present a formulation of surrogate modeling for tendon-driven continuum robots as an operator learning problem that maps robot design parameters and tendon actuation inputs to resulting configurations. This formulation enables a single trained model to generalize across a large class of robot designs. We develop four novel neural operator architectures--two based on Deep Operator Networks (DeepONets) and two based on Fourier Neural Operators (FNOs)--and train them on simulation data to predict robot configurations. All architectures achieve good accuracy while allowing for fast and accurate generalization across designs. Our results demonstrate that operator learning provides an effective and generalizable surrogate for continuum robot mechanics in the design space, enabling fast modeling for control, planning, and design optimization in surgical and industrial applications.

URL PDF HTML ☆

赞 0 踩 0

2605.19038 2026-05-20 cs.RO cs.LG 版本更新

在符号世界模型上学习双层策略以实现长周期规划

Dillon Z. Chen, Till Hofmann, Toryn Q. Klassen, Sheila A. McIlraith

发表机构 * Vector Institute（向量研究所）； University of Toronto（多伦多大学）； LAAS-CNRS（Laas--cnrs）； University of Toulouse（图卢兹大学）； RWTH Aachen University（亚琛工业大学）

AI总结本文提出了一种结合低层模仿学习和高层符号抽象的双层策略，用于解决长周期规划问题，通过BISON系统在扩展的MetaWorld基准上验证了其在处理大量物体和长周期任务上的优越性。

详情

AI中文摘要

我们解决了构建具有身体智能的AI代理以可靠解决长周期规划问题的挑战。模仿学习从演示中已显示出在训练机器人解决需要精细运动控制和操作的复杂任务方面的有效性。然而，仅通过模仿学习生成长周期计划仍然是一个艰巨的挑战。相比之下，高层（HL）符号抽象能够促进高效且可解释的长周期规划。我们提出结合低层（LL）模仿学习在操作和控制中的优势，以及高层符号抽象在长周期规划中的优势。我们通过双层策略（π^hl, π^ll）实现这一想法，其中包括从低层演示中学习的神经策略π^ll，以及由低层演示的符号抽象和归纳概括结合而成的高层符号策略π^hl。我们实现了这些想法的BISON系统。在扩展的MetaWorld基准上的实验表明，BISON能够泛化到长周期和更多物体数量的问题，比VLA和端到端方法更高效，并且在训练和推理中更节省时间和内存。值得注意的是，当忽略低层执行时，BISON的高层策略可以在一分钟内解决包含10,000个相关物体的高层问题。项目页面：https://dillonzchen.github.io/bison

英文摘要

We tackle the challenge of building embodied AI agents that can reliably solve long-horizon planning problems. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low-level (LL), continuous environments. Yet, it remains a difficult endeavour to generate long-horizon plans from imitation learning alone. In contrast, high-level (HL), symbolic abstractions facilitate efficient and interpretable long-horizon planning. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long-horizon planning. We realise this idea via \emph{bilevel policies} of the form $(π^{\mathrm{hl}}, π^{\mathrm{ll}})$, consisting of a neural policy $π^{\mathrm{ll}}$ learned from LL demonstrations, and an HL symbolic policy $π^{\mathrm{hl}}$ that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end-to-end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON's HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: https://dillonzchen.github.io/bison

URL PDF HTML ☆

赞 0 踩 0

2605.15336 2026-05-20 cs.RO cs.AI 版本更新

HoloMotion-1 Technical Report

HoloMotion-1 技术报告

Maiyue Chen, Kaihui Wang, Bo Zhang, Xihan Ma, Zhiyuan Yang, Yi Ren, Qijun Huang, Zihao Zhu, Yucheng Wang, Zhizhong Su

发表机构 * Horizon Robotics

AI总结本文提出HoloMotion-1，一种用于零样本全身运动追踪的人形运动基础模型，通过大规模混合运动语料库训练控制策略，提升了运动行为的多样性和准确性，实现了对多种运动类型和捕捉条件的鲁棒泛化。

Comments 20 pages, 4 figures, 6 tables. Technical report

详情

AI中文摘要

在本报告中，我们介绍了HoloMotion-1，一种用于零样本全身运动追踪的人形运动基础模型。HoloMotion-1的关键创新在于利用大规模混合运动语料库进行控制策略训练，其中来自真实视频重建的运动提供了运动多样性的主要来源，而经过精心挑选的运动捕捉数据和内部运动数据则提供了更高保真度的监督和面向部署的覆盖范围。这种数据模式使HoloMotion-1超越了传统仅依赖运动捕捉的训练，并使策略能够接触更广泛的行为、捕捉条件和运动风格。从这种异构数据中学习引入了新的挑战，包括重建噪声、源域不匹配、运动质量不均以及在大行为变化下的时间建模需求。为了解决这些挑战，HoloMotion-1集成了大容量时间建模、具有稀疏激活的专家混合变压器以及KV缓存推理用于实时控制，并采用序列级训练策略，提高了在扩展运动序列上的学习效率。在多个未见过的运动基准测试中，HoloMotion-1在多样化的运动类型和捕捉条件下表现出鲁棒的泛化能力，显著提高了跟踪精度，且能够直接转移到真实的人形机器人上，无需特定任务的微调。

英文摘要

In this report, we present HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. A key innovation of HoloMotion-1 is to scale control-policy training with a large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity, while curated motion-capture and in-house motion data provide higher-fidelity supervision and deployment-oriented coverage. This data regime enables HoloMotion-1 to move beyond conventional MoCap-only training and exposes the policy to substantially broader behaviors, capture conditions, and motion styles. Learning from such heterogeneous data introduces new challenges, including reconstruction noise, source-domain mismatch, uneven motion quality, and the need for temporal modeling under large behavioral variation. To address these challenges, HoloMotion-1 integrates large-capacity temporal modeling, a sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, and a sequence-level training strategy that improves learning efficiency on extended motion sequences. Extensive experiments on multiple unseen motion benchmarks show that HoloMotion-1 generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.13646 2026-05-20 cs.RO cs.AI 版本更新

Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

基于因果性的端到端自动驾驶：通过以自身为中心的联合场景建模

Seokha Moon, Minseung Lee, Joon Seo, Jinkyu Kim, Jungbeom Lee

发表机构 * Korea University（韩国大学）； Kakao Mobility

AI总结本文提出CaAD框架，通过共享潜在场景表示捕捉车辆与周围代理之间的因果依赖关系，以提高端到端自动驾驶的闭环规划性能。

详情

AI中文摘要

端到端自动驾驶通过直接从传感器输入预测未来轨迹，跳过了传统模块化流水线，近年来取得了显著进展。然而，现有方法往往忽视了车辆规划中的因果依赖关系，忽略了车辆与周围代理之间的相互关系。这种因果忽视导致轨迹预测不一致且不可靠，特别是在需要交互的关键场景中，车辆决策和邻近代理行为必须联合推理。为了解决这一限制，我们提出了CaAD，一个基于因果的端到端自动驾驶框架，该框架在共享的潜在场景表示中捕捉这些依赖关系。首先，我们提出一个以自身为中心的联合因果建模模块，基于边缘预测分支，并学习车辆与相关交互代理之间的因果依赖关系。其次，我们采用因果意识的策略对齐阶段，通过联合模式嵌入来对齐随机的车辆策略与从周围交通和地图上下文中计算出的规划导向闭环反馈。在Bench2Drive和NAVSIM基准上，CaAD展示了强大的闭环规划性能，分别在Bench2Drive上实现了87.53的驾驶分数和71.81的成功率，在NAVSIM上实现了91.1的PDMS。项目页面可在https://moonseokha.github.io/CaAD/上获取。

英文摘要

End-to-end autonomous driving, which bypasses traditional modular pipelines by directly predicting future trajectories from sensor inputs, has recently achieved substantial progress. However, existing methods often overlook the causal inter-dependencies in ego-vehicle planning, ignoring the reciprocal relations between the ego vehicle and surrounding agents. This causal oversight leads to inconsistent and unreliable trajectory predictions, especially in interaction-critical scenarios where ego decisions and neighboring agent behaviors must be reasoned about jointly. To address this limitation, we propose CaAD, a Causality-aware end-to-end Autonomous Driving framework that captures these dependencies within a shared latent scene representation. First, we propose an ego-centric joint-causal modeling module that builds on the marginal prediction branch, and learns causal dependencies between the ego vehicle and interaction-relevant agents. Second, we employ a causality-aware policy alignment stage implemented with joint-mode embeddings to align the stochastic ego policy with planning-oriented closed-loop feedback computed from surrounding traffic and map context. On the Bench2Drive and NAVSIM benchmarks, CaAD demonstrates strong closed-loop planning performance, achieving a Driving Score of 87.53 and Success Rate of 71.81 on Bench2Drive, and a PDMS of 91.1 on NAVSIM. The project page is available at https://moonseokha.github.io/CaAD/.

URL PDF HTML ☆

赞 0 踩 0

2605.12974 2026-05-20 cs.RO cs.SY eess.SY 版本更新

Distributionally Robust Safety Under Arbitrary Uncertainties: A Safety Filtering Approach

在任意不确定性下的分布鲁棒安全：一种安全过滤方法

Daniel M. Cherenson, Haejoon Lee, Taekyung Kim, Dimitra Panagou

AI总结本文研究如何在分布模糊性下确保非线性系统的概率安全，提出了一种基于备份的安全过滤框架，通过在高性能名义策略和认证备份策略之间切换来保证安全，并采用分布鲁棒方法处理任意不确定性，通过采样方法验证了方法的有效性。

Comments 10 pages, 4 figures, submitted to IEEE Robotics and Automation Letters (RA-L); Project Page: https://dcherenson.github.io/drs-gk

详情

AI中文摘要

在本文中，我们研究如何在分布模糊性下确保非线性系统的概率安全。我们的方法基于一种备份-based的安全过滤框架，该框架在高性能的名义策略和认证备份策略之间切换以确保安全。为了处理任意不确定性，即分布不具有特定结构且真实分布未知的情况，我们采用分布鲁棒（DR）方法，使用Wasserstein不确定性集。而不是在线解决高维的DR轨迹优化问题，我们利用备份-based安全过滤的结构，将安全认证减少为在名义策略和备份策略之间切换的时间的一维搜索。然后，我们开发了一种基于采样的认证程序，具有有限样本保证，其中经验失败概率被与Wasserstein膨胀阈值进行比较。我们通过模拟三个系统验证了我们的方法，从Dubins车辆到高速赛车和战斗机，展示了方法的广泛应用性和计算效率。

英文摘要

In this work, we study how to ensure probabilistic safety for nonlinear systems under distributional ambiguity. Our approach builds on a backup-based safety filtering framework that switches between a high-performance nominal policy and a certified backup policy to ensure safety. To handle arbitrary uncertainties from ambiguous distributions, i.e., where the distribution is not of specific structure and the true distribution is unknown, we adopt a distributionally robust (DR) formulation using Wasserstein ambiguity sets. Rather than solving a high-dimensional DR trajectory optimization problem online, we exploit the structure of backup-based safety filtering to reduce safety certification to a one-dimensional search over the switching time between nominal and backup policies. We then develop a sampling-based certification procedure with finite-sample guarantees, where empirical failure probabilities are compared against a Wasserstein-inflated threshold. We validate our method through simulations across three systems, from a Dubins vehicle to a high-speed racing car and a fighter jet, demonstrating the broad applicability and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.08879 2026-05-20 cs.RO 版本更新

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

通过保守监督微调在流匹配视觉-语言-动作中保持基础能力

Tianyi Zhang, Shaopeng Zhai, Haoran Zhang, Fuxian Huang, Qi Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结本文提出保守监督微调（ConSFT）方法，旨在通过动态调整学习信号来减少流匹配视觉-语言-动作模型在微调过程中对预训练能力的损害，从而在不依赖先验数据或架构开销的情况下提升模型在目标分布上的适应性和能力保留。

Comments 20 pages, 9 figures

详情

AI中文摘要

无约束的流匹配视觉-语言-动作（VLA）模型微调会导致参数过度覆盖，从而降低预训练能力。我们提出了保守监督微调（ConSFT），一种优化目标，能够适应目标分布同时减轻灾难性遗忘，无需先验数据或架构开销。通过根据模型置信度动态调整学习信号，ConSFT抑制来自低置信度样本的过度梯度，从而防止不成比例的参数更新，从而限制内在参数扰动风险。受强化学习信任区域裁剪的启发，这种形式建立了一个渐进学习动态，以确保目标收敛和先前能力保留，实现稀疏参数更新，而无需依赖显式正则化所需的并行参考网络。我们在LIBERO和RoboTwin基准上评估了ConSFT，针对最先进的流匹配VLA（π₀，π₀.₅和GR00T-N1.6-3B）。该方法在能力保留方面优于常规SFT，平均绝对优势超过20%，在无先验数据的环境中与数据密集型经验回放的效能相当。现实世界的机器人部署证实，ConSFT在下游适应过程中防止了空间过拟合，保留了预训练的物理技能，同时获取了序列目标任务。

英文摘要

Unconstrained fine-tuning of flow-matching Vision-Language-Action (VLA) models drives dense parameter overwrites, degrading pre-trained capabilities. We present Conservative Supervised Fine-Tuning (ConSFT), an optimization objective that adapts to target distributions while mitigating catastrophic forgetting, requiring zero prior data or architectural overhead. By dynamically scaling learning signals based on model confidence, ConSFT suppresses excessive gradients from low-confidence samples to prevent disproportionate parameter updates, thereby bounding the intrinsic parameter disruption risk. Inspired by reinforcement learning's trust-region clipping, this formulation establishes a progressive learning dynamic to secure target convergence and prior capability retention, maintaining sparse parameter updates without relying on the parallel reference networks required by explicit regularization. We evaluate ConSFT on the LIBERO and RoboTwin benchmarks across state-of-the-art flow-matching VLAs ($π_0$, $π_{0.5}$, and GR00T-N1.6-3B). The method outperforms vanilla SFT in capability retention by an average absolute margin of over 20\%, matching the efficacy of data-heavy Experience Replay in a prior-data-free regime. Real-world robotic deployments confirm that ConSFT precludes spatial overfitting during downstream adaptation, preserving pre-trained physical skills while acquiring sequential target tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.08830 2026-05-20 cs.CV cs.AI cs.RO 版本更新

高效的情绪感知图标手势预测用于机器人同声传译

Edwin C. Montiel-Vazquez, Christian Arzate Cruz, Stefanos Gkikas, Thomas Kassiotis, Giorgos Giannakakis, Randy Gomez

发表机构 * School of Engineering（工程学院）； Honda Research Institute Japan（本田日本研究院）； Department of Electronic Engineering（电子工程系）

AI总结本文提出一种轻量级的transformer模型，通过文本和情绪单独生成图标手势的位置和强度，无需音频输入，在BEAT2数据集上优于GPT-4o，在语义手势位置分类和强度回归方面表现更佳，且计算紧凑，适合实时部署。

2604.09323 2026-05-20 cs.RO 版本更新

Robust Adaptive Backstepping Impedance Control of Robots in Unknown Environments

在未知环境中具有鲁棒性的自适应反步阻抗控制

Reza Nazmara, Alap Kshirsagar, Jan Peters, A. Pedro Aguiar

发表机构 * Research Center for Systems and Technologies (SYSTEC), ARISE, Faculty of Engineering, University of Porto, 4200-465 Porto, Portugal（系统与技术研究中心（SYSTEC），ARISE，工程学院，波尔图大学，葡萄牙4200-465波尔图）； Intelligent Autonomous Systems Lab, Department of Computer Science, TU Darmstadt, Germany（智能自主系统实验室，计算机科学系，达姆施塔特技术大学，德国）

AI总结本文提出了一种针对在接触丰富且不确定环境中操作的机器人鲁棒自适应反步阻抗控制（RABIC）策略，该策略考虑了系统的完整耦合动力学，并明确考虑了外部扰动和未建模动力学等关键不确定性来源，而无需机器人动态参数。通过反步方法设计内环以跟踪参考阻抗模型，利用泰勒级数估计器估计系统动力学并采用自适应估计器确定外部力的上界。稳定性分析证明了整体系统的半全局有限时间稳定性。通过模拟移动机械臂场景和对实际Franka Emika Panda机器人的实验评估，证明了所提方法在安全性、轨迹跟踪和力监测方面优于PD控制。

Comments 8

详情

DOI: 10.1016/j.mechatronics.2026.103552
Journal ref: Mechatronics, Vol. 118, 103552 (2026)

AI中文摘要

本文提出了一种鲁棒自适应反步阻抗控制（RABIC）策略，用于在接触丰富和不确定环境中操作的机器人。所提出的控制策略考虑了系统的完整耦合动力学，并明确考虑了外部扰动和未建模动力学等关键不确定性来源，而无需机器人动态参数。我们提出了一种基于反步的自适应阻抗控制方案用于内环以跟踪参考阻抗模型。为了处理不确定性，我们采用基于泰勒级数的估计器来估计系统动力学，并采用自适应估计器来确定外部力的上界。稳定性分析证明了整体系统的半全局有限时间稳定性。为了证明所提方法的有效性，进行了模拟移动机械臂场景和对实际Franka Emika Panda机器人的真实实验评估。所提出的方法在安全性和轨迹跟踪及力监测方面优于PD控制。总体而言，RABIC框架为未来关于耦合移动和固定串联机械臂的自适应和学习阻抗控制的研究提供了坚实的基础。

英文摘要

This paper presents a Robust Adaptive Backstepping Impedance Control (RABIC) strategy for robots operating in contact-rich and uncertain environments. The proposed control strategy considers the complete coupled dynamics of the system and explicitly accounts for key sources of uncertainty, including external disturbances and unmodeled dynamics, while not requiring the robot's dynamic parameters in implementation. We propose a backstepping-based adaptive impedance control scheme for the inner loop to track the reference impedance model. To handle uncertainties, we employ a Taylor series-based estimator for system dynamics and an adaptive estimator for determining the upper bound of external forces. Stability analysis demonstrates the semi-global practical finite-time stability of the overall system. To demonstrate the effectiveness of the proposed method, a simulated mobile manipulator scenario and experimental evaluations on a real Franka Emika Panda robot were conducted. The proposed approach exhibits safer performance compared to PD control while ensuring trajectory tracking and force monitoring. Overall, the RABIC framework provides a solid basis for future research on adaptive and learning-based impedance control for coupled mobile and fixed serially linked manipulators.

URL PDF HTML ☆

赞 0 踩 0

2604.07993 2026-05-20 cs.RO 版本更新

HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

HEX: 人形对齐的专家用于跨躯体全身体操作

Shuanghao Bai, Meng Li, Xinyuan Lv, Jiawei Wang, Xinhua Wang, Fei Liao, Chengkai Hou, Langzhe Gu, Wanqi Zhou, Kun Wu, Ziluo Ding, Zhiyuan Xu, Lei Sun, Shanghang Zhang, Zhengping Che, Jian Tang, Badong Chen

发表机构 * Beijing Innovation Center of Humanoid Robotics（北京人形机器人创新中心）； Xi’an Jiaotong University（西安交通大学）； Nankai University（南开大学）； Peking University（北京大学）

AI总结 HEX通过引入人形对齐的通用状态表示和混合专家统一本体预测器，实现了对全尺寸双足人形机器人全身体操作的协调控制，展示了在任务成功率和泛化能力上的最新成果。

Comments Project page: https://hex-humanoid.github.io/

详情

AI中文摘要

人类通过协调的全身控制实现复杂操作，而大多数视觉-语言-动作（VLA）模型将机器人身体部分独立处理，使得高自由度的人形控制具有挑战性和不稳定性。我们提出了HEX，一种面向全尺寸双足人形机器人的协调操作状态中心框架。HEX引入了人形对齐的通用状态表示，以实现跨异构躯体的可扩展学习，并结合混合专家统一本体预测器，从大规模多躯体轨迹数据中建模全身协调和时间运动动态。为了高效捕捉时间视觉上下文，HEX使用轻量级历史标记来总结过去的观察，避免在推理过程中重复编码历史图像。它进一步采用残差门控融合机制和流匹配动作头，以适应性地整合视觉-语言提示与本体动态以生成动作。在现实世界的人形操作任务中，HEX在任务成功率和泛化能力上实现了最先进的性能，特别是在快速反应和长时间范围场景中。

英文摘要

Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.

URL PDF HTML ☆

赞 0 踩 0

2602.09259 2026-05-20 cs.RO cs.HC 版本更新

Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation

以数据为中心的基于学习的多任务手术注视感知模型设计

Yizhou Li, Shuyuan Yang, Jiaji Su, Zonghe Chua

发表机构 * Department of Electrical, Computer, and Systems Engineering, Case Western Reserve University（电气、计算机与系统工程系，凯斯西储大学）

AI总结本研究探讨了在多任务模拟中，基于学习的手术注视感知模型的设计，通过主动-被动注视数据集分析，评估了不同注视来源对注意力模型学习的影响，并提出了可扩展的群众源注视监督方法。

Comments 8 pages, conference pre-print

详情

AI中文摘要

在机器人辅助微创手术（RMIS）中，减少的触觉反馈和深度线索增加了对专家视觉感知的依赖，推动了基于注视引导的训练和基于学习的手术感知模型。然而，操作专家的注视数据收集成本高，且不清楚注视监督来源（专家水平（中级 vs. 初学者）和感知模态（主动执行 vs. 被动观看））如何影响注意力模型的学习。我们引入了一个配对的主动-被动、多任务手术注视数据集，该数据集在达芬奇SimNow模拟器上进行了四次钻探任务。使用VR头盔和眼动追踪记录了任务执行期间的主动注视，相应的视频被重新利用作为刺激，以收集观察者的被动注视，从而实现受控的同视频比较。我们量化了技能和模态依赖的注视组织差异，并通过注视密度重叠分析和单帧显著性建模评估了被动注视在操作监督中的可替代性。在各种设置中，MSI-Net产生了稳定且可解释的预测，而SalGAN不稳定且经常与人类注视不一致。训练于被动注视的模型恢复了相当大的中级主动注意力，但存在可预测的退化，且主动和被动目标之间的迁移是不对称的。值得注意的是，初学者的被动标签在较高质量演示中对中级-被动目标的近似具有有限的损失，这表明了一条可行的路径，用于在手术指导和感知建模中实现可扩展的群众源注视监督。

英文摘要

In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.

URL PDF HTML ☆

赞 0 踩 0

2602.09023 2026-05-20 cs.RO 版本更新

TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

TwinRL: 基于数字孪生的强化学习用于真实世界机器人操作

Qinwen Xu, Jiaming Liu, Rui Zhou, Shaojun Shi, Nuowei Han, Zhuoyang Liu, Chenyang Gu, Shuo Gu, Yang Yue, Gao Huang, Wenzhao Zheng, Sirui Han, Peng Jia, Shanghang Zhang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（信息处理国家重点实验室，计算机学院，北京大学）； Simplexity Robotics（Simplexity机器人）； Tsinghua University（清华大学）； Hong Kong University of Science and Technology（香港科技大学）

AI总结本文提出TwinRL框架，通过数字孪生与真实世界协同训练，提升视觉-语言-动作模型在真实世界中的探索效率和收敛速度，实现高成功率和快速收敛。

详情

AI中文摘要

尽管具有强大的泛化能力，视觉-语言-动作（VLA）模型仍然受到专家演示成本高和现实世界交互有限的限制。虽然在线强化学习（RL）显示出前景，但将其应用于真实世界VLA操作受到探索效率低和探索覆盖受限的阻碍。通过系统性的现实世界实验，我们发现在线RL的有效探索空间主要受监督微调（SFT）期间诱导的轨迹分布所限制。受此观察启发，我们提出TwinRL，一种数字孪生-真实世界协同的后训练框架，通过三个阶段扩展和引导RL探索：SFT预热、孪生RL预热和真实世界RL。TwinRL首先从手机捕捉的场景中重建高保真的数字孪生。在SFT阶段，我们引入一种探索空间扩展策略，将轨迹分布的支持扩展到现实演示之外，重塑探索空间以更有效地进行RL。与将孪生视为数据增强工具不同，我们提出一种孪生RL预热策略，使其能够作为真实世界RL的探索引导。具体而言，TwinRL在数字孪生中执行高效的并行RL，生成填充回放缓冲区的交互轨迹，稳定后续真实世界RL学习。这一过程还识别出易失败但信息丰富的配置，使针对人类在回路中的rollouts进一步提高机器人上的效率。在四个任务中，TwinRL在分布内和分布外区域均实现近100%的成功率，比先前的真实世界RL方法快30%以上，仅需20分钟的机器人交互时间。

英文摘要

Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and limited real-world interaction. While online reinforcement learning (RL) has shown promise, its application to real-world VLA manipulation is hindered by low exploration efficiency and restricted exploration coverage. Through systematic real-world experiments, we observe that the effective exploration space of online RL is largely constrained by the trajectory distribution induced during supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative post-training framework that expands and guides RL exploration for VLA models through three stages: SFT warm-up, twin RL warm-up, and real-world RL. TwinRL first reconstructs a high-fidelity digital twin from smartphone-captured scenes. During the SFT stage, we introduce an exploration space expansion strategy that expands the support of the trajectory distribution beyond real demonstrations, reshaping the exploration space for more effective RL. Rather than treating the twin as a data augmentation tool, we propose a twin RL warm-up strategy that enables it to act as an exploration guide for real-world RL. Specifically, TwinRL performs efficient parallel RL in the digital twin to generate interactive trajectories that populate the replay buffer and stabilize subsequent real-world RL learning. This process also identifies failure-prone yet informative configurations, enabling targeted human-in-the-loop rollouts to further improve on-robot efficiency. Across four tasks, TwinRL achieves near-100% success in both in-distribution and out-of-distribution regions, delivering over 30% faster convergence than prior real-world RL methods with only 20 minutes of on-robot interaction.

URL PDF HTML ☆

赞 0 踩 0

2601.14234 2026-05-20 cs.LG cs.AI cs.RO stat.ML 版本更新

Q-learning with Adjoint Matching

具有伴随匹配的Q学习

Qiyang Li, Sergey Levine

发表机构 * UC Berkeley（加州大学伯克利分校）

AI总结本文提出了一种基于时序差分的强化学习算法QAM，解决了连续动作强化学习中的长期挑战：高效优化表达性强的扩散或流匹配策略相对于参数化的Q函数。通过利用批评者的首阶信息进行有效优化，但直接通过反向传播其多步去噪过程进行梯度优化在数值上不稳定。现有方法通过仅使用价值和丢弃梯度信息或依赖近似方法牺牲策略的表达性或偏置学习策略。QAM通过利用生成建模中最近提出的技术伴随匹配，将批评者的动作梯度转换为逐步目标函数，避免了不稳定反向传播，同时在最优时提供无偏且表达性强的策略。结合时序差分备份进行批评者学习，QAM在离线和离线到在线强化学习的硬稀疏奖励任务中一致优于先前方法。

Comments 32 pages, 8 figures, 7 tables

详情

AI中文摘要

我们提出QAM，一种新颖的基于时序差分的强化学习（RL）算法，解决了连续动作RL中长期存在的挑战：高效优化表达性强的扩散或流匹配策略相对于参数化的Q函数。有效的优化需要利用批评者的首阶信息，但通过反向传播其多步去噪过程进行直接梯度优化在数值上不稳定。现有方法通过仅使用价值和丢弃梯度信息或依赖近似方法牺牲策略的表达性或偏置学习策略。QAM通过利用生成建模中最近提出的技术伴随匹配，将批评者的动作梯度转换为逐步目标函数，避免了不稳定反向传播，同时在最优时提供无偏且表达性强的策略。结合时序差分备份进行批评者学习，QAM在离线和离线到在线RL的硬稀疏奖励任务中一致优于先前方法。

英文摘要

We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.

URL PDF HTML ☆

赞 0 踩 0

2512.24470 2026-05-20 cs.RO cs.AI 版本更新

Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models

桥梁上的基础模型：基于视觉-语言模型的语义危险检测与安全操作用于海上自主性

Kim Alexander Christensen, Andreas Gudahl Tufte, Alexey Gusev, Rohan Sinha, Milan Ganai, Ole Andreas Alsos, Marco Pavone, Martin Steinert

发表机构 * Dept. of Mechanical and Industrial Engineering, NTNU（机械与工业工程系，挪威科技大学）； Dept. of Aeronautics and Astronautics, Stanford University（航空航天工程系，斯坦福大学）； Dept. of Computer Science, Stanford University（计算机科学系，斯坦福大学）； NVIDIA Research（NVIDIA研究）

AI总结本文提出了一种基于视觉-语言模型的语义危险检测与安全操作方法，用于满足IMO草案MASS代码对海上自主船舶的要求，通过快速-慢速异常管道和短时间范围的人类可覆盖回退操作来实现，在40个港口场景中验证了该方法的性能。

Comments 17 pages without bibliography or appendix. The main paper has 16 figures. Paper webpage can be found at https://kimachristensen.github.io/bridge_policy/

详情

DOI: 10.1016/j.oceaneng.2026.124646
Journal ref: Ocean Engineering 359, Part 3 (2026), Article 124646

AI中文摘要

草案IMO MASS代码要求自主和远程监督的海事船舶检测其操作设计领域偏离，进入预定义的回退模式以通知操作员，允许立即的人类接管，并避免在未经批准的情况下更改航行计划。在警报到接管的间隙中满足这些义务需要一个短时间范围、可人类接管的回退操作。传统的海事自主堆栈在正确行动依赖于意义（例如，潜水员旗表示水中的人员，附近有火表示危险）时会遇到困难。我们主张（i）视觉-语言模型（VLMs）为这些分布外情况提供语义意识，（ii）一个快速-慢速异常管道，带有短时间范围、可人类接管的回退操作，使在交接窗口内实现这一目标成为可能。我们引入了Semantic Lookout，一种仅使用摄像头、候选约束的VLM回退操作选择器，它在连续人类授权下，从水有效、世界锚定的轨迹中选择一个谨慎的操作（或站守）。在40个港口场景中，我们测量了每调用场景的理解和延迟，与人类共识（模型多数三票投票）的一致性，短时间范围在火险场景中的风险缓解，以及在水上的警报->回退操作->操作员交接。子10秒的模型保留了较慢的最新模型大部分的意识。回退操作选择器在火险场景中比仅基于几何的基线表现更好，并增加了 standoff 距离。一次现场运行验证了端到端的操作。这些结果支持VLMs作为符合草案IMO MASS代码的语义回退操作选择器，适用于实际延迟预算，并激励未来工作，研究适应领域、混合自主性，将基础模型语义与多传感器鸟瞰感知和短时间范围重新规划相结合。网站：kimachristensen.github.io/bridge_policy

英文摘要

The draft IMO MASS Code requires autonomous and remotely supervised maritime vessels to detect departures from their operational design domain, enter a predefined fallback that notifies the operator, permit immediate human override, and avoid changing the voyage plan without approval. Meeting these obligations in the alert-to-takeover gap calls for a short-horizon, human-overridable fallback maneuver. Classical maritime autonomy stacks struggle when the correct action depends on meaning (e.g., diver-down flag means people in the water, fire close by means hazard). We argue (i) that vision-language models (VLMs) provide semantic awareness for such out-of-distribution situations, and (ii) that a fast-slow anomaly pipeline with a short-horizon, human-overridable fallback maneuver makes this practical in the handover window. We introduce Semantic Lookout, a camera-only, candidate-constrained VLM fallback maneuver selector that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority. On 40 harbor scenes we measure per-call scene understanding and latency, alignment with human consensus (model majority-of-three voting), short-horizon risk-relief on fire hazard scenes, and an on-water alert->fallback maneuver->operator handover. Sub-10 s models retain most of the awareness of slower state-of-the-art models. The fallback maneuver selector outperforms geometry-only baselines and increases standoff distance on fire scenes. A field run verifies end-to-end operation. These results support VLMs as semantic fallback maneuver selectors compatible with the draft IMO MASS Code, within practical latency budgets, and motivate future work on domain-adapted, hybrid autonomy that pairs foundation-model semantics with multi-sensor bird's-eye-view perception and short-horizon replanning. Website: kimachristensen.github.io/bridge_policy

URL PDF HTML ☆

赞 0 踩 0

2512.10891 2026-05-20 cs.RO cs.LG 版本更新

Iterative Compositional Data Generation for Robot Control

迭代组合数据生成用于机器人控制

Anh-Quan Pham, Marcel Hussing, Shubhankar P. Patankar, Dani S. Bassett, Jorge Mendez-Mendez, Eric Eaton

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； Stony Brook University（石溪大学）

AI总结本文提出了一种语义组合扩散变换器，通过注意力机制学习机器人、物体、障碍物和目标特定组件的交互，从而在有限任务集上训练后，能够零样本生成高质量过渡，进而学习未见任务组合的控制策略，并通过迭代自我改进过程提升零样本性能。

详情

SEMNAV: 通过语义分割增强机器人中的视觉语义导航

Rafael Flor-Rodríguez, Carlos Gutiérrez-Álvarez, Francisco Javier Acevedo-Rodríguez, Sergio Lafuente-Arroyo, Roberto J. López-Sastre

发表机构 * University of Alcalá（阿尔卡萨大学）； CAM-UAH ； Ministry of Science and Innovation of Spain（西班牙科学与创新部）

AI总结本文提出SEMNAV，一种利用语义分割作为环境主要视觉输入表示的方法，以增强机器人代理的感知和决策能力，通过引入高层面的语义信息，提升模型在未知环境中的泛化能力，并引入SEMNAV数据集进行训练。

详情

DOI: 10.1007/s10489-026-07275-1
Journal ref: Applied Intelligence, 2026

AI中文摘要

视觉语义导航（VSN）是机器人学中的基本问题，其中智能体必须在未知环境中导航至目标对象，主要依靠视觉信息。大多数最先进的VSN模型是在模拟环境中训练的，其中使用的是现实世界的渲染场景，最理想的情况。这些方法通常依赖于虚拟场景的原始RGB数据，这限制了它们在真实世界环境中的泛化能力，由于域适应问题。为了解决这个问题，本文提出了SEMNAV，一种新的方法，利用语义分割作为环境的主要视觉输入表示，以增强代理的感知和决策能力。通过显式地引入这种高层语义信息，我们的模型学习到稳健的导航策略，提高了在未见过的环境中泛化的能力，无论是模拟还是真实世界。我们还引入了SEMNAV数据集，这是一个新编纂的数据集，用于训练如SEMNAV这样的语义分割感知导航模型。我们的方法在模拟环境和真实世界机器人平台上进行了广泛的评估。实验结果表明，SEMNAV优于现有的最先进VSN模型，在Habitat 2.0模拟环境使用HM3D数据集时实现了更高的成功率。此外，我们的实际实验突显了语义分割在缓解仿真到现实差距方面的有效性，使我们的模型成为实用VSN基于机器人应用的有希望的解决方案。代码和数据集可在https://github.com/gramuah/semnav访问。

英文摘要

Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment, mainly using visual information. Most state-of-the-art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best. These approaches typically rely on raw RGB data from the virtual scenes, which limits their ability to generalize to real-world environments due to domain adaptation issues. To tackle this problem, in this work, we propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment to enhance the agent's perception and decision-making capabilities. By explicitly incorporating this type of high-level semantic information, our model learns robust navigation policies that improve generalization across unseen environments, both in simulated and real world settings. We also introduce the SEMNAV dataset, a newly curated dataset designed for training semantic segmentation-aware navigation models like SEMNAV. Our approach is evaluated extensively in both simulated environments and with real-world robotic platforms. Experimental results demonstrate that SEMNAV outperforms existing state-of-the-art VSN models, achieving higher success rates in the Habitat 2.0 simulation environment, using the HM3D dataset. Furthermore, our real-world experiments highlight the effectiveness of semantic segmentation in mitigating the sim-to-real gap, making our model a promising solution for practical VSN-based robotic applications. The code and datasets are accessible at https://github.com/gramuah/semnav

URL PDF HTML ☆

赞 0 踩 0

2505.09067 2026-05-20 math.OC cs.RO cs.SY eess.SY 版本更新

Solving Reach- and Stabilize-Avoid Problems Using Discounted Reachability

利用折扣可达性求解可达-稳定避问题

Boyang Li, Zheng Gong, Sylvia Herbert

发表机构 * Mechanical and Aerospace Engineering at UC San Diego（加州大学圣迭戈分校机械与航空航天工程系）

AI总结本文针对一般非线性连续时间系统中的无限时间可达-避（RA）和稳定-避（SA）零和博弈问题，提出了一种新的Lipschitz连续RA价值函数，该函数的零子水平集精确刻画了RA集，并通过构造Bellman备份算子的合同性以及证明RA价值函数是Hamilton-Jacobi变分不等式的唯一粘性解，从而解决了RA问题。同时，通过结合最近提出的鲁棒控制Lyapunov-价值函数，开发了两步框架来解决SA问题，确保目标可达性和长期稳定性。最后，通过3D Dubins车系统数值验证了所提方法的有效性。

Comments 16 pages, 6 figures, 1 table. Accepted to IEEE Transactions on Automatic Control

详情

DOI: 10.1109/TAC.2026.3693989
Journal ref: IEEE Transactions on Automatic Control (Early Access), 2026

AI中文摘要

在本文中，我们考虑一般非线性连续时间系统中的无限时间可达-避（RA）和稳定-避（SA）零和博弈问题，目标是找到能够被控制到达或稳定到目标集的的状态集，即使在最坏情况下也不违反约束。基于Hamilton-Jacobi可达性方法，我们通过设计新的Lipschitz连续RA价值函数来解决RA问题，该函数的零子水平集精确地刻画了RA集。我们证明了相关的Bellman备份算子是合同性的，并且RA价值函数是Hamilton-Jacobi变分不等式的唯一粘性解。最后，我们通过将我们的RA策略与最近提出的鲁棒控制Lyapunov-价值函数相结合，开发了一个两步框架来解决SA问题，从而确保目标可达性和长期稳定性。我们通过3D Dubins车系统数值验证了所提的RA和SA框架的有效性。

英文摘要

In this article, we consider the infinite-horizon reach-avoid (RA) and stabilize-avoid (SA) zero-sum game problems for general nonlinear continuous-time systems, where the goal is to find the set of states that can be controlled to reach or stabilize to a target set, without violating constraints even under the worst-case disturbance. Based on the Hamilton-Jacobi reachability method, we address the RA problem by designing a new Lipschitz continuous RA value function, whose zero sublevel set exactly characterizes the RA set. We establish that the associated Bellman backup operator is contractive and that the RA value function is the unique viscosity solution of a Hamilton-Jacobi variational inequality. Finally, we develop a two-step framework for the SA problem by integrating our RA strategies with a recently proposed Robust Control Lyapunov-Value Function, thereby ensuring both target reachability and long-term stability. We numerically verify our RA and SA frameworks on a 3D Dubins car system to demonstrate the efficacy of the proposed approach.

URL PDF HTML ☆

赞 0 踩 0

2504.09188 2026-05-20 cs.RO cs.SY eess.SY 版本更新

Compliant Explicit Reference Governor for Contact Friendly Robotic Manipulators

顺应性显式参考 governor 用于接触友好的机器人机械臂

Yaashia Gautam, Gilberto Briscoe-Martinez, Adhitya Mohan, Nataliya Nechyporenko, Alessandro Roncone, Marco M. Nicotra

发表机构 * College of Engineering（工程学院）； Applied Sciences, University of Colorado Boulder（应用科学学院，科罗拉多大学博尔德分校）； Boulder, CO 80301 USA（博尔德，科罗拉多州80301 USA）

AI总结本文提出了一种顺应性显式参考 governor (CERG)，一种模块化的参考管理系统，使机器人能够在有保证的条件下与环境物理交互。CERG 作为高层规划器和低层控制器之间的中间层，强制操作约束并使自由运动和接触操作之间平滑过渡。CERG 通过限制接触时机械臂可用的总能量来确保安全。在没有接触的情况下，CERG 不会惩罚系统性能。

Comments Updated paper with current contributions and author list , accepted at IFAC World Congress, Busan, 2026

2501.09203 2026-05-20 cs.CV cs.RO 版本更新

3D Modeling and Automated Measurement of Concrete Cracks via Segment Anything Refinement and Visual Inertial LiDAR Fusion

通过段落任何精修和视觉惯性LiDAR融合进行混凝土裂缝的3D建模与自动测量

Pengru Deng, Jiapeng Yao, Chun Li, Su Wang, Xinrun Li, Varun Ojha, Xuhui He

发表机构 * School of Civil Engineering（土木工程学院）； Central South University（中南大学）； Hunan Provincial Key Laboratory for Disaster Prevention and Mitigation of Rail Transit Engineering Structures（湖南省铁路工程结构灾害预防与 mitigation 工程结构重点实验室）； Nvidia ； School of Computing（计算学院）； Newcastle University（新castle大学）

AI总结本文提出了一种结合计算机视觉技术和多模态同时定位与建图（SLAM）的创新框架，用于二维裂缝检测、三维重建和三维自动裂缝测量，解决了现有方法在适应性和鲁棒性方面的不足，特别是在处理曲线或复杂几何形状时的挑战。

Comments Title and author list updated

详情

DOI: 10.1016/j.cacaie.2026.100019
Journal ref: Computer-Aided Civil and Infrastructure Engineering, Volume 45, 2026, 100019, ISSN 1093-9687

AI中文摘要

视觉-空间系统在混凝土裂缝检测中变得越来越关键。然而，现有方法往往缺乏对多样化场景的适应性，在基于图像的方法中表现出有限的鲁棒性，并且在处理曲线或复杂几何形状时存在困难。为了解决这些限制，本文提出了一种创新的框架，通过整合计算机视觉技术和多模态同时定位与建图（SLAM），用于二维（2D）裂缝检测、三维（3D）重建和三维自动裂缝测量。首先，基于基础的DeepLabv3+分割模型，并结合特定的改进利用基础模型Segment Anything Model（SAM），我们开发了一种具有强泛化能力的裂缝分割方法，能够在不熟悉的场景中生成精确的2D裂缝掩码。为了提高三维重建的准确性和鲁棒性，利用Light Detection and Ranging（LiDAR）点云与图像数据和分割掩码。通过利用图像和LiDAR-SLAM，我们开发了多帧和多模态融合框架，产生密集、着色的点云，有效捕捉裂缝语义在三维现实尺度上。此外，裂缝几何属性在三维密集点云空间中自动且直接地进行测量，超越了传统二维图像测量方法的限制。这一进步使该方法适用于具有曲线和复杂三维几何结构的结构部件。在各种混凝土结构上的实验结果突显了所提出方法的显著改进和独特优势，展示了其在现实应用中的有效性、准确性和鲁棒性。

英文摘要

Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2605.19015 2026-05-20 eess.SY cs.RO cs.SY 版本更新

Probabilistic Recursively Feasible Motion Planning Under Uncertain Environments

概率递归可行性运动规划在不确定环境中

Hyeontae Sung, Hyeongchan Ham, Junyoung Park, Kai Ren, Heejin Ahn

发表机构 * School of Electrical Engineering, Korea Advanced Institute of Science（韩国科学技术院电气工程学院）

AI总结本文提出了一种概率递归可行模型预测控制框架，通过保证递归可行性概率来解决不确定环境中安全运动规划的挑战，主要贡献是通过闭式表达式计算轨迹的均值和协方差，并构建安全约束以确保递归可行性。

Comments 7 pages, 4 figures

详情

AI中文摘要

在不确定、时间变化的环境中进行安全运动规划具有挑战性，因为安全区域在规划步骤中可能不可预测地变化，通常导致递归可行性丧失。在本工作中，我们提出了一种概率递归可行模型预测控制（PRF-MPC）框架，该框架能够以指定概率保证递归可行性。我们引入了理想预测器应满足的性质以确保分布一致性，并利用这些性质推导出未来时间步骤预测轨迹的均值和协方差的闭式表达式。基于此分析，我们构建了安全约束，以确保当前安全集包含在未来的安全集中，从而以概率方式保证递归可行性。在车道变更场景的仿真结果表明，所提出的方法显著提高了递归可行性。

英文摘要

Safe motion planning in uncertain, time-varying environments is challenging because the safe region can change unpredictably across planning steps, often causing a loss of recursive feasibility. In this work, we present a Probabilistic Recursively Feasible Model Predictive Control (PRF-MPC) framework that guarantees recursive feasibility with a specified probability. We introduce properties that an ideal predictor should satisfy to ensure distributional consistency, and use these properties to derive closed-form expressions for the means and covariances of trajectories predicted at future time steps. Building on this analysis, we construct safety constraints that ensure, with high probability, that the current safe set is contained within the safe sets at future time steps, thereby probabilistically guaranteeing recursive feasibility. Simulation results on a lane-change scenario demonstrate that the proposed method significantly improves recursive feasibility.

URL PDF HTML ☆

赞 0 踩 0

2605.19009 2026-05-20 cs.RO cs.SY eess.SY 版本更新

Adversarial Stress Testing of SPARK Humanoid Safety Filters

对SPARK人形机器人类安全过滤器的对抗性压力测试

Saurav Ghosh, Abdou Sow, Luke Zhang

发表机构 * Department of Computer Science and Engineering, Washington University in St. Louis, Missouri, United States（计算机科学与工程系，华盛顿大学圣路易斯分校，密苏里州，美国）

AI总结本文通过复制和压力测试研究了SPARK人形机器人类安全过滤器的鲁棒性，评估了多种方法在不同环境下的表现，揭示了安全行为在障碍物密集、距离估计噪声和延迟信息下的变化，强调了在部署前需使用能暴露故障模式的评估指标。

Comments 5 pages, 7 figures, 1 table. Code available at https://github.com/ghoshsaurav/spark-adversarial-safety

详情

AI中文摘要

人形机器人由于具有高维身体、众多碰撞约束以及必须在人和障碍物附近操作，难以安全部署。安全过滤器通过在可能违反避障约束时修改名义控制动作来帮助。然而，名义基准分数并不能完全显示这些过滤器在更困难环境中的行为。在本工作中，我们通过复制和压力测试研究了SPARK人形安全过滤器的鲁棒性。我们复制了SPARK基准案例G1SportMode_D1_WG_SO_v1到MuJoCo，并在受控随机种子下评估RSSA、RSSS、SSA、CBF、PFM和SMA。我们还构建了一个后处理流程，将原始SPARK日志转换为目标跟踪、最小距离和碰撞步骤指标。我们的结果表明，某些方法更接近目标跟踪，而其他方法更有效减少碰撞步骤。压力测试进一步表明，在障碍物密集、距离估计噪声和延迟障碍信息下，安全行为可能发生改变。这些发现表明，人形自主性应在名义性能之外进行评估，使用能暴露故障模式的指标。

英文摘要

Humanoid robots are difficult to deploy safely because they have high-dimensional bodies, many collision constraints, and must operate near people and obstacles. Safety filters help by modifying a nominal control action when it may violate collision-avoidance constraints. Still, nominal benchmark scores do not fully show how these filters behave in harder environments. In this work, we study the robustness of SPARK humanoid safety filters through replication and stress testing. We replicate the SPARK benchmark case G1SportMode_D1_WG_SO_v1 in MuJoCo and evaluate RSSA, RSSS, SSA, CBF, PFM, and SMA under controlled random seeds. We also built a post-processing pipeline that converts raw SPARK logs into goal-tracking, minimum-distance, and collision-step metrics. Our results show that some methods track the goal more closely, while others reduce collision steps more effectively. The stress tests further indicate that safety behavior can change under obstacle crowding, noisy distance estimates, and delayed obstacle information. These findings suggest that humanoid autonomy should be evaluated beyond nominal performance, using metrics that expose failure modes before deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.19004 2026-05-20 cs.CV cs.LG cs.RO 版本更新

EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction

EgoTraj: 用于多模态预测的现实世界人轨迹数据集

Ahmad Yehia, Abduallah Mohamed, Tianyi Wang, Jiseop Byeon, Kun Qian, Junfeng Jiao, Christian Claudel

发表机构 * Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin（土木、建筑与环境工程系，德克萨斯大学奥斯汀分校）； Meta Reality Labs（Meta现实实验室）； School of Architecture, The University of Texas at Austin（建筑学院，德克萨斯大学奥斯汀分校）

AI总结本文提出EgoTraj数据集，用于多模态预测，包含75个真实城市环境中的人导航轨迹，提供了同步的RGB视频和地面真实数据，包括6自由度头部姿态、3D眼 gaze向量和场景注释，展示了该数据集在AR感知、导航和辅助系统中的应用价值。

Comments 21 pages, 14 figures. Project page: https://github.com/yehiahmad/EgoTraj

详情

AI中文摘要

准确地从第一人称视角预测人类轨迹在人形机器人、可穿戴传感系统和辅助导航等应用中起着核心作用。然而，由于现实世界环境中缺乏第一人称轨迹数据集，这一方向的进展受到限制。为了解决这一需求，我们介绍了EgoTraj，一个使用Meta Quest Pro (MQPro)录制的egocentric多模态开放数据集。EgoTraj包含75个由多个MQPro穿戴设备在真实城市环境中收集的人导航轨迹。每个记录都提供了同步的RGB视频以及地面真实数据，包括连续时间同步的6自由度头部姿态、每帧3D眼 gaze向量和场景注释。据我们所知，EgoTraj不同于典型的egocentric轨迹数据集，因为它捕捉了在多样化的城市路线中进行的长视距、自主导航，具有广泛的参与者多样性。为了展示该数据集的潜力，我们对几种最先进的egocentric轨迹预测方法进行了基准测试，并进行了消融研究以分析注视、场景和运动提示的贡献。结果突显了EgoTraj在AR感知、导航和辅助系统中的实用性。EgoTraj数据集、代码和EgoViz仪表板已公开在https://github.com/yehiahmad/EgoTraj。

英文摘要

Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real-world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real-world urban environments. Each recording provides synchronized RGB video along with ground-truth data, including continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state-of-the-art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR-based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at https://github.com/yehiahmad/EgoTraj.

URL PDF HTML ☆

赞 0 踩 0

2605.18921 2026-05-20 cs.RO 版本更新

Geo-Data-Driven HD Map Generation Workflow with Integrated Reference-Free Constraint-Based Verification

基于地理数据的高精地图生成工作流与集成的无参考约束验证

Ruidi He, Vaibhav Tiwari, Mohanad Al-Ghobari, Meng Zhang, Andreas Rausch

发表机构 * Institute for Software and Systems Engineering（软件与系统工程研究所）

AI总结本文提出了一种基于地理数据的高精地图生成工作流，结合了无参考约束验证，以降低对高精度参考数据的依赖，提高在缺乏专业测量数据或独立参考地图时的应用可行性。

详情

AI中文摘要

高精地图是自动驾驶系统的核心构件，但其生成通常依赖于传感器密集的移动测绘任务，而质量评估往往依赖于高精度参考数据。这些依赖性使得高精地图工程成本高且难以在缺乏专门测量数据或独立测量参考地图的环境中应用。本文提出了一种面向工程的基于地理数据的工作流，用于高精地图生成，并集成了表示层面的验证。该工作流使用公开可用的地理工程数据集作为主要输入源，并通过显式的中间表示和处理阶段，将它们转换为现有道路环境的车道级高精地图表示。为了在没有外部参考地图的情况下评估生成的表示，该工作流在工程过程中集成了可执行的基于约束的验证。所选约束来自与自动驾驶和道路设计指南相关的规范。它们直接在生成的车道let表示上进行评估，以检测几何、拓扑和高程相关的一致性问题。该工作流使用来自德国下萨克森州四个城市的基于真实世界shapefile的道路网络数据，并结合受控缺陷注入场景进行评估。真实世界评估显示，生成的地图表示在评估场景中满足所选约束，而缺陷注入研究证明了对所考虑缺陷类型的完全检测，没有观察到假阳性。结果表明，集成可执行验证的基于地理数据的高精地图生成可以在减少传感和参考数据可用性的情况下，为传感器密集的测绘工作流提供模块化和可检查的补充。

英文摘要

High-definition (HD) maps are core artifacts for automated driving systems, but their generation commonly relies on sensor-intensive mobile mapping campaigns, while quality assessment often depends on high-precision reference data. These dependencies make HD map engineering costly and difficult to apply in settings where specialised measurement data or independently measured reference maps are unavailable. This paper presents an engineering-oriented geo-data-driven workflow for HD map generation with integrated representation-level verification. The workflow uses openly available geo-engineering datasets as the primary input source and transforms them into lane-level HD map representations of existing road environments through explicit intermediate representations and processing stages. To assess the generated representations without external reference maps, the workflow integrates executable constraint-based verification into the engineering process. Selected constraints are derived from specifications relevant to automated driving and road-design guidelines. They are evaluated directly on the generated lanelet-based representation to detect geometric, topological, and elevation-related inconsistencies. The workflow is evaluated using real-world shapefile-based road-network data from four cities in Lower Saxony, Germany, and controlled defect-injection scenarios. The real-world evaluation shows that the generated map representations satisfy the selected constraints in the evaluated scenarios, while the defect-injection study demonstrates complete detection of the considered defect types without observed false positives. The results indicate that geo-data-driven HD map generation with integrated executable verification can provide a modular and inspectable complement to sensor-intensive mapping workflows under reduced sensing and reference-data availability.

URL PDF HTML ☆

赞 0 踩 0

2605.18895 2026-05-20 cs.RO cs.AI 版本更新

KG-ASG: Collision-Knowledge-Guided Closed-Loop Adversarial Scenario Generation With Primary-Support Attribution

KG-ASG: 基于碰撞知识的闭环对抗场景生成与主支持属性

Cheng Wang, Chen Xiong, Ziwen Wang, Yuchen Zhou, Qiang Liu

发表机构 * Guangdong Provincial Key Laboratory of Intelligent Transportation System, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University（广东省智能交通系统重点实验室，智能系统工程学院，中山大学深圳校区）

AI总结本文提出KG-ASG框架，通过碰撞知识引导和主支持属性，提高自动驾驶系统安全验证的对抗有效性、可解释性和可执行性。

详情

AI中文摘要

自动驾驶系统安全验证需要高风险场景覆盖、清晰的碰撞语义、可执行轨迹和可追溯的多车辆交互。现有安全关键场景生成方法通常依赖低级轨迹扰动、碰撞代理优化或单对抗者搜索，可能产生具有模糊碰撞原因或不可控多车辆碰撞的对抗样本。本文提出KG-ASG，一种基于碰撞知识的闭环对抗场景生成框架，具有主支持属性。KG-ASG构建了结构化的碰撞知识库，并训练了一个轻量级的碰撞专家来推断目标碰撞模式、唯一的主对抗者、支持车辆及其交互角色。在该语义先验的引导下，多车辆对抗生成被公式化为主支持过程，其中主对抗者引发主要冲突，支持车辆塑造周围风险结构，而不会成为额外碰撞者。规则、物理、交互安全性和单碰撞器约束被作为硬门来过滤不可执行的样本。为处理反应性驾驶者行为，进一步使用规划器-控制器反馈进行故障诊断、候选重新排序和终端细化。在MetaDrive中重建的WOMD场景上的实验表明，KG-ASG在IDM、Cruise和Expert控制器下实现了强对抗有效性，同时提高了有效主攻击、减少了多碰撞，并获得了闭环恢复收益。这些结果表明，碰撞知识引导和主支持单碰撞器推理提高了自动驾驶安全验证的对抗有效性、可解释性和可执行性。

英文摘要

Safety validation of autonomous driving systems requires high-risk scenario coverage, clear collision semantics, executable trajectories, and attributable multi-vehicle interactions. Existing safety-critical scenario generation methods often rely on low-level trajectory perturbations, collision-proxy optimization, or single-adversary search, which may produce adversarial samples with ambiguous collision causes or uncontrolled multi-vehicle collisions. This paper proposes KG-ASG, a collision-knowledge-guided closed-loop adversarial scenario generation framework with primary-support attribution. KG-ASG constructs a structured collision knowledge base and trains a lightweight Collision Expert to infer the target collision mode, the unique primary adversary, support vehicles, and their interaction roles. Guided by this semantic prior, multi-vehicle adversarial generation is formulated as a primary-support process, where the primary adversary induces the main conflict and support vehicles shape the surrounding risk structure without becoming additional colliders. Rule, physical, interaction-safety, and single-collider constraints are imposed as hard gates to filter non-executable samples. To handle reactive ego behaviors, planner-controller feedback is further used for failure diagnosis, candidate re-ranking, and terminal refinement. Experiments on WOMD scenarios reconstructed in MetaDrive show that KG-ASG achieves strong adversarial effectiveness while improving Valid Primary Attack, reducing multi-collision, and obtaining closed-loop recovery gains under IDM, Cruise, and Expert controllers. These results demonstrate that collision-knowledge guidance and primary-support single-collider reasoning improve adversarial effectiveness, interpretability, and executability for autonomous driving safety validation.

URL PDF HTML ☆

赞 0 踩 0

2605.18872 2026-05-20 cs.LG cs.AI cs.RO 版本更新

EUPHORIA: Efficient Universal Planning via Hybrid Optimization for Robust Industrial Robotic Assembly

EUPHORIA: 通过混合优化实现高效通用规划以实现稳健的工业机器人装配

Shih-Yu Lai, Chia-Ching Yen, Yang-Ting Shen, Peter Yichen Chen, Yu-Lun Liu, Bing-Yu Chen

发表机构 * National Taiwan University（国立台湾大学）； MoonShine Animation Studio（MoonShine动画工作室）； National Cheng Kung University（国立成功大学）； The University of British Columbia（不列颠哥伦比亚大学）； National Yang Ming Chiao Tung University（阳明交通大学）

AI总结本文提出EUPHORIA框架，通过混合优化策略实现通用少样本适应和动态效率，解决建筑机器人装配中规划器高度专业化和操作低效的问题，结合元几何编码器、物理引导图变压器和残差稳定性校正等方法，实现高效且鲁棒的装配规划。

详情

AI中文摘要

建筑机器人装配面临持续瓶颈：现有规划器要么高度专业化，需要每次新几何设计都进行昂贵的再训练，要么操作低效，将结构序列和运动学运动视为独立过程。我们提出了EUPHORIA，一个统一框架，通过混合优化策略实现通用少样本适应和动态效率。为克服再训练瓶颈，我们提出了基于图超网络的元几何编码器：不同于标准对比学习仅在特征级识别，我们的超网络动态从最小支持集中生成策略参数，使参数级适应复杂拓扑（如穹顶、拱门）而无需基于梯度的再训练。对于结构推理，我们引入了通过软演员-评论家（SAC）训练的物理引导图变压器，其物理偏置注意力机制通过离散元模型（DEM）模拟的接触力调节注意力分数，引导规划器朝向结构关键连接。我们进一步通过运动学感知序列确保操作效率，其中SAC目标惩罚高能转换。最后，我们通过残差稳定性校正弥合仿真到现实的差距，这是一种可微优化层，通过最小化联合能量-稳定性成本优先级来微调粗略装配动作。实验表明，EUPHORIA显著减少了与解耦基线相比的能量消耗，并在未见的非标准几何上实现了最先进的成功率，通过融合元学习、物理引导注意力和残差优化，实现一个连贯的通用规划器。

英文摘要

Robotic assembly in architectural construction faces a persistent bottleneck: existing planners are either highly specialized, requiring prohibitive retraining for every new geometric design, or operationally inefficient, treating structural sequencing and kinematic motion as disjoint processes. We present EUPHORIA, a unified framework that achieves universal few-shot adaptability and dynamic efficiency through a hybrid optimization strategy. To overcome the retraining bottleneck, we propose a Meta-Geometric Encoder based on Graph Hypernetworks: unlike standard contrastive learning, which performs only feature-level recognition, our hypernetwork dynamically generates policy parameters from a minimal support set, enabling parameter-level adaptation to complex topologies (e.g., domes, arches) without gradient-based retraining. For structural reasoning, we introduce a Physics-Informed Graph Transformer trained via Soft Actor-Critic (SAC), with a Physics-Bias Attention mechanism that modulates attention scores using contact forces from Discrete Element Model (DEM) simulations, guiding the planner toward structurally critical connections. We further ensure operational efficiency through Kinematics-Aware Sequencing, where the SAC objective penalizes high-energy transitions. Finally, we bridge the Sim2Real gap via Residual Stability Correction, a differentiable optimization layer that fine-tunes coarse assembly actions by minimizing a joint energy-stability cost prior to execution. Experiments show that EUPHORIA significantly reduces energy consumption over decoupled baselines and achieves state-of-the-art success rates on unseen, non-standard geometries with minimal few-shot examples, fusing meta-learning, physics-informed attention, and residual optimization into a cohesive, generalized planner.

URL PDF HTML ☆

赞 0 踩 0

2412.02818 2026-05-20 cs.RO cs.LG 版本更新

RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields

RoboMD: 通过语义势场揭示机器人漏洞

Som Sagar, Jiafei Duan, Sreevishakh Vasudevan, Yifan Zhou, Heni Ben Amor, Dieter Fox, Ransalu Senanayake

发表机构 * Arizona State University（亚利桑那州立大学）； University of Washington（华盛顿大学）

AI总结本研究提出RoboMD框架，通过学习基于连续视觉-语言嵌入的深度强化学习策略，揭示机器人在现实世界中因外部变化导致的漏洞，通过虚拟运行实现高效安全的漏洞分析，实验表明其能发现比现有基线多23%的漏洞，并提升机器人操作性能。

Comments 26 Pages, 20 figures

详情

AI中文摘要

机器人操作策略虽然对物理AI的前景至关重要，但在现实世界中存在外部变化时却极易产生漏洞。诊断这些漏洞面临两大挑战：（i）需要测试的 relevant 变化通常未知，（ii）直接在现实世界中测试成本高且不安全。我们介绍了一个框架，通过在连续视觉-语言嵌入上进行虚拟运行，学习一个单独的深度强化学习（深度RL）策略来预测漏洞。通过将富含语义和视觉变化的嵌入空间视为势场，该策略学会向易损区域移动并被成功区域排斥。该漏洞预测策略在虚拟运行中训练，使漏洞分析能够扩展和安全地进行，而无需昂贵的物理试验。通过查询该策略，我们的框架构建了一个概率性漏洞可能性地图。在模拟基准和物理机器人手臂上的实验表明，我们的框架揭示的漏洞比最先进的视觉-语言基线多出23%，揭示了被启发式测试忽略的细微漏洞。此外，我们展示了通过我们的框架发现的漏洞微调操作策略，可以使用更少的微调数据提升操作性能。

英文摘要

Robot manipulation policies, while central to the promise of physical AI, are highly vulnerable in the presence of external variations in the real world. Diagnosing these vulnerabilities is hindered by two key challenges: (i) the relevant variations to test against are often unknown, and (ii) direct testing in the real world is costly and unsafe. We introduce a framework that tackles both issues by learning a separate deep reinforcement learning (deep RL) policy for vulnerability prediction through virtual runs on a continuous vision-language embedding trained with limited success-failure data. By treating this embedding space, which is rich in semantic and visual variations, as a potential field, the policy learns to move toward vulnerable regions while being repelled from success regions. This vulnerability prediction policy, trained on virtual rollouts, enables scalable and safe vulnerability analysis without expensive physical trials. By querying this policy, our framework builds a probabilistic vulnerability-likelihood map. Experiments across simulation benchmarks and a physical robot arm show that our framework uncovers up to 23% more unique vulnerabilities than state-of-the-art vision-language baselines, revealing subtle vulnerabilities overlooked by heuristic testing. Additionally, we show that fine-tuning the manipulation policy with the vulnerabilities discovered by our framework improves manipulation performance with much less fine-tuning data.

URL PDF HTML ☆

赞 0 踩 0