arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.18729 2026-05-19 cs.RO cs.CV 版本更新

基于数据的 tendon-驱动连续机器人动态建模

Harald Minde Hansen, Bjørn Kåre Sæbø, Kristin Y. Pettersen, Jan Tommy Gravdahl, Mario Di Castro

发表机构 * Department of Engineering Cybernetics, Norwegian University of Science and Technology, NTNU（工程 cybernetics 部，挪威科学与技术大学，NTNU）

AI总结本文研究了基于数据的系统辨识方法，用于建模具有滚动关节的tendon-驱动连续机器人，发现仅需两个自由度的动力学模型即可准确捕捉系统动力学，展示了其在实时控制中的可行性。

2605.18617 2026-05-19 cs.RO cs.AI cs.CV 版本更新

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

ManiSoft: 向视觉-语言操控的柔软连续机器人迈进

Ziyu Wei, Luting Wang, Chen Gao, Li Wen, Si Liu

发表机构 * Beihang University（北京航空航天大学）； National University of Singapore（新加坡国立大学）； Hangzhou Innovation Institute, Beihang University（北京航空航天大学杭州创新研究院）

AI总结本文提出ManiSoft基准，用于研究柔软连续机器人的视觉-语言操控，通过定制模拟器结合真实柔软体动力学和丰富的接触交互，定义了四个任务以展示变形控制的不同方面，并通过自动化流程生成6300个多样场景和专家轨迹，评估了三种代表性策略模型的性能。

Comments Accepted in ICML 2026

详情

AI中文摘要

大多数现有的视觉-语言操控研究针对刚性机械臂，其固定形态限制了在杂乱或狭窄空间中的适应性。柔软机械臂由于其可变形性提供了一个有吸引力的替代方案，但面临不可靠的本体感觉和分布式的低层驱动挑战。为了研究这些挑战，我们介绍了ManiSoft，一个用于柔软机械臂的视觉-语言操控基准。ManiSoft特征一个定制的模拟器，通过弹性力约束将真实柔软体动力学与丰富的接触交互相结合。在此基础上，ManiSoft定义了四个任务，每个任务突出显示变形控制的不同方面，从基本末端执行器协调到障碍物回避。为了支持策略训练和评估，ManiSoft包括一个自动化流程，生成6,300个多样场景及其对应的专家轨迹。为了大规模生成高质量轨迹，我们首先使用高层规划器将每个任务分解为一系列路径点，然后使用低层强化学习策略生成扭矩命令以跟踪路径点。基准测试三种代表性策略模型显示在清洁场景中相对有希望的结果，但在随机化情况下性能显著下降。可视化分析表明，失败主要源于本体感觉状态的视觉估计不准确和变形性在适应性障碍回避中的利用有限。我们预计ManiSoft将作为有价值的测试平台，在视觉-语言操控的背景下弥合刚性和柔软机械臂之间的差距。代码和数据集已发布在https://buaa-colalab.github.io/ManiSoft。

英文摘要

Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce \ManiSoft, a benchmark for vision-language manipulation with soft arms. ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, \ManiSoft{} includes an automated pipeline that generates $6{,}300$ diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation. Out codes and datasets are released at https://buaa-colalab.github.io/ManiSoft.

URL PDF HTML ☆

赞 0 踩 0

2605.18611 2026-05-19 cs.RO 版本更新

面向水下环境的自主地面车辆实时流体动力学估计几何感知代理

Ammar Waheed, Luke Gallantree, Zohaib Hasnain

发表机构 * J. Mike Walker ’66 Department of Mechanical Engineering, Texas A&M University（J. Mike Walker ’66 机械工程系，德克萨斯A&M大学）； Defence Science and Technology Laboratory（国防科学与技术实验室）

AI总结本文提出了一种基于神经网络的几何感知代理，用于在水下环境中实时估计自主地面车辆的流体动力学，通过高保真CFD数据训练，实现了对车辆几何、深度和水流方向的准确预测，展示了在真实环境中的应用效果。

详情

AI中文摘要

在浅水或易发洪水的地形中运行的自主地面车辆需要能够考虑流体动力学力的动态模型。然而，目前可用的仿真和规划工具要么缺乏物理真实性，要么计算成本过高，无法实时运行。本文提出了一种针对不同表面的神经网络代理，通过在高保真CFD数据上训练，预测实时速率下的几何解析流体动力学力。车辆特定的符号距离场（SDF）提供每表面的淹没输入，使模型能够解析负载如何随车辆几何、深度和水流方向变化。在留出的CFD数据上，代理实现了纵向力对称MAPE（sMAPE）为13%，垂直力sMAPE为3-12%，推理时间每样本小于0.9毫秒。为了在真实世界条件下评估模型，使用全尺寸车辆在不同淹没深度下的涉水试验。运动捕捉推导的运动学作为代理输入，所得预测用于重现已知的力、速度和深度之间的物理关系。预测的阻力遵循二次速度缩放（R²≥0.97），浮力截距与深度线性相关（R²=0.973）。这两种关系未在模型训练损失中编码，但源自每表面架构中单独预测的表面力总和。所得到的框架为将物理基础的流体动力学嵌入自主地面车辆依赖的 amphibious 环境仿真和规划循环提供了路径。

英文摘要

Autonomous ground vehicles operating in shallow water or flood-prone terrains require dynamic models that account for hydrodynamic forces. However, the simulation and planning tools currently available either lack the physical fidelity or are too computationally expensive to run in real time. This work presents a per-surface neural network surrogate that bridges this gap by predicting geometry-resolved hydrodynamic forces at real-time rates, trained entirely on high-fidelity CFD data from two geometrically distinct vehicles. A vehicle specific Signed Distance Field (SDF) provides per-surface submergence inputs, allowing the model to resolve how loading varies with vehicle geometry, depth, and flow direction. On held-out CFD data, the surrogate achieves a longitudinal-force symmetric MAPE (sMAPE) of 13\% and a vertical-force sMAPE of 3-12\%, with inference running under 0.9\,ms per sample. To evaluate the model under real-world conditions, water wading trials of a full-scale vehicle at different submersion depths are used. Motion capture derived kinematics serve as the surrogate inputs, and the resulting predictions are tested to reproduce known physical relationships between force, speed, and depth. The predicted drag follows quadratic speed scaling ($R^2 \geq 0.97$) and the buoyancy intercepts scale linearly with depth ($R^2 = 0.973$). Neither relationship is encoded in the model training loss, both emerge from the per-surface architecture summing individually predicted surface forces. The resulting framework provides a pathway for embedding physically grounded hydrodynamics into the simulation and planning loops that autonomous ground vehicles depend on in amphibious environments.

URL PDF HTML ☆

赞 0 踩 0

2605.18482 2026-05-19 cs.RO 版本更新

Bidirectional Optical sensors for Actuation Tracking (BOAT) in soft lattice systems

用于软格栅系统的双向光学传感器（BOAT）用于驱动跟踪

Petr Trunin, Carolina Gay, Anderson Brazil Nardin, Trevor Exley, Diana Cafiso, Lucia Beccai

发表机构 * Soft BioRobotics and Perception Lab（软生物机器人与感知实验室）； Istituto Italiano di Tecnologia (IIT)（意大利技术研究院）； Genoa, Italy（意大利热那亚）

AI总结本文提出了一种基于椭球几何排列的双波导光学传感器（BOAT），用于监测软格栅结构的全局变形，特别是压缩和伸展，并通过实验验证了其在压力循环中的高重复性和可靠性。

详情

AI中文摘要

随着格栅结构在软机器人中的广泛应用，需要更先进的传感解决方案来监测其整体变形，特别是压缩和伸展。本文通过引入基于两个图案化波导的新型光学传感器来解决这一挑战。该双向光学传感器用于驱动跟踪（BOAT）与一个由嵌入式气动人工肌肉（PAM）驱动的格栅结构无缝共印制，并对其性能进行了评估。在PAM伸长或收缩时，嵌入的BOAT波导的弯曲会引起输出信号的变化，从而能够清楚地区分压缩和伸展状态。两种波导结构（通过表面图案化）和传感器化的格栅单元嵌入两个BOAT的设计均通过数值模拟得到支持。经过100次连续的压力循环（从+50 kPa到-40 kPa）的实验校准，显示出高度可重复的响应，使得能够可靠地区分伸展和压缩。最后，利用传感器反馈实现数字影子，使整个传感器化单元与其虚拟对应物持续同步。这些结果证明了BOAT在软格栅机器人系统变形监测中的强大和可靠作用。

英文摘要

The growing adoption of lattice-based structures in soft robotics creates a need for advanced sensing solutions capable of monitoring their global deformation, particularly compression and extension. In this work, we address this challenge by introducing a novel optical sensor based on two patterned waveguides arranged in an ellipsoidal geometry. This Bidirectional Optical sensor for Actuation Tracking (BOAT) is seamlessly co-printed with a lattice structure actuated by an embedded pneumatic artificial muscle (PAM), and its performance is assessed. During PAM elongation or contraction, the bending of the embedded BOAT waveguides induces output signal variations that enable a clear discrimination between compression and extension states. The designs of both each specific waveguide structure (by surface patterning) and of the sensorized lattice-based unit embedding two BOATs are supported by numerical simulations. Experimental calibration over 100 consecutive pressure cycles ranging from +50 kPa to $-$40 kPa demonstrates a highly repeatable response, allowing a reliable distinction between extension and compression. Finally, sensor feedback is used to implement a digital shadow, enabling continuous synchronization between the whole sensorized unit and its virtual counterpart. These results establish BOAT as a powerful and reliable approach for deformation monitoring in soft lattice-based robotic systems.

URL PDF HTML ☆

赞 0 踩 0

2605.18441 2026-05-19 cs.RO cs.SY eess.SY 版本更新

REACT: Environment-Adaptive Architecture for Continuous Formation Navigation of Wheeled Mobile Robots

REACT：面向轮式移动机器人连续编队导航的环境自适应架构

Jianghong Dong, Yifeng Zhang, Jiawei Wang, Mengchi Cai, Keqiang Li, Guillaume Sartoretti

发表机构 * School of Vehicle and Mobility, Tsinghua University（清华大学车辆与移动性学院）； Department of Mechanical Engineering, National University of Singapore（新加坡国立大学机械工程系）； Department of Civil and Environmental Engineering, University of Michigan（密歇根大学土木与环境工程系）

AI总结本文提出REACT架构，通过集中式编队生成和分布式编队维护相结合的方法，解决轮式移动机器人在复杂环境中编队导航的适应性问题，实现了无轨迹冲突的连续编队导航。

详情

AI中文摘要

轮式移动机器人（WMRs）的编队控制已广泛应用于物流运输、环境监测和搜索救援等领域。然而，大多数现有研究主要关注跟踪预定义编队，限制了其在复杂现实环境中的适应性。为此，我们提出了REACT（实时环境自适应架构用于连续编队导航），一种集成了集中式编队生成和分布式编队维护的分层架构。具体而言，上层在必要时生成新的环境自适应编队，并使用我们提出的TCF-R2T（轨迹冲突自由机器人到目标分配）算法，在多项式时间内计算无冲突的WMR到目标分配，实现及时的编队转换而无轨迹冲突。下层中，每个WMR执行我们开发的JSTP（联合时空轨迹规划）方法，通过同时优化空间位置和时间持续时间来维护生成的编队，从而增强机器人之间的协调性，并在障碍物丰富的环境和动态障碍场景中实现连续导航。仿真和实际实验验证了REACT的有效性和实用性。实验视频可在我们的项目网站上获取：https://dongjh20.github.io/REACT-website。

英文摘要

Formation control of wheeled mobile robots (WMRs) has been extensively studied due to its broad applications in fields such as logistics transportation, environmental monitoring, and search and rescue. However, most existing works mainly focus on tracking predefined formations, which limits their adaptability to complex real-world environments. To address this, we propose REACT (Real-time Environment-Adaptive architecture for Continuous formation navigaTion), a hierarchical architecture integrating centralized formation generation and distributed formation maintenance. Specifically, our upper layer generates new environment-adaptive formations when necessary and uses our proposed TCF-R2T (Trajectory-Conflict-Free Robot-to-Target assignment) algorithm to compute conflict-free WMR-to-target assignments in polynomial time, enabling timely formation transitions without trajectory conflicts. At the lower layer, each WMR executes our developed JSTP (Joint Spatio-Temporal trajectory Planning) method to maintain the generated formation by simultaneously optimizing spatial positions and temporal durations, thereby enhancing coordination among WMRs and enabling continuous navigation in obstacle-rich environments and dynamic-obstacle scenarios. Both simulation and real-world experiments validate the effectiveness and practical applicability of REACT. Experimental videos are available on our project website: https://dongjh20.github.io/REACT-website.

URL PDF HTML ☆

赞 0 踩 0

2605.18423 2026-05-19 cs.RO cs.CY 版本更新

REBAR: Reference Ethical Benchmark for Autonomy Readiness

REBAR：自主性准备的参考伦理基准

Jonathan Diller, David Barnes, Rebekah Bogdanoff, Rhett Collier, Roddy Collins, Keith Fieldhouse, Yonatan Gefen, Cameron Johnson, Anuriha Kodali, Brad Kriel, Varun Murali, James Niehaus, Mish Sukharev, Joseph VanPelt, Anthony Hoogs, Vijay Kumar, Arslan Basharat

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； David Barnes, LLC（大卫·巴恩斯公司）； Kitware, Inc.（Kitware公司）； Duality Robotics, Inc.（Duality机器人公司）； Texas A&M University（德克萨斯大学）； Charles River Analytics（查尔斯河分析公司）

AI总结本文提出REBAR框架，通过严谨测试提供可计算的自主性准备等级，以量化伦理性能并解决现有伦理AI框架的不足。

Comments To be presented at the 2026 Workshop on Robot Ethics - Ethical, Legal and User Perspectives in Robotics and Automation (WOROBET)

详情

AI中文摘要

随着自主系统日益先进，客观评估其伦理和法律合规性的指标对于告知终端用户其局限性并确保滥用者的责任至关重要。当前的伦理具身AI框架大多定性，侧重于系统设计（通过安全护栏或定向红队测试），而实现的护栏往往直接禁止不安全行为，而没有为用户提供重置或可解释的原因。相反，需要通过严格测试获得可计算的指标，使用户能够确定系统适用于任务。为解决这一差距，我们引入了自主性准备的参考伦理基准（REBAR），一个用于自主系统的定量测试和评估框架。REBAR将运行指标映射到可计算的自主性准备等级（ARL）标准，以量化伦理表现。该框架的关键创新包括一种神经符号大型语言模型（LLM）方法来计算和解释场景的伦理难度，LLM驱动的大规模测试实例生成，以及一个多功能、逼真模拟环境。通过通过此严格测试流程评估白盒自主性解决方案，REBAR提供了一个客观且可重复的基准分数，弥合了抽象原则与可验证、可问责的自主性之间的差距。

英文摘要

As autonomous systems grow more advanced, objective metrics to evaluate their ethical and legal compliance are critical for informing end users of their limitations and ensuring accountability of those who misuse them. Current ethical embodied AI frameworks remain mostly qualitative, focusing on system design (through safety guardrails or targeted red teaming), and the realized guardrails often directly disallow unsafe behavior without providing the user with an override or interpretable reason. Instead, there is a need for computable metrics through rigorous testing that allow a user to determine the applicability of the system to the task. To address this gap, we introduce the Reference Ethical Benchmark for Autonomy Readiness (REBAR), a quantitative test and evaluation framework for autonomous systems. REBAR maps operating metrics into a computable Autonomy Readiness Level (ARL) rubric that can quantify ethical performance. Key innovations of the framework include a neuro-symbolic Large Language Model (LLM) approach to calculate and explain the ethical difficulty of scenarios, LLM-driven at-scale generation of test instances, and a versatile, photorealistic simulation environment. By evaluating white-box autonomy solutions through this rigorous testing pipeline, REBAR delivers an objective and repeatable benchmark score, bridging the gap between abstract principles and verifiable, accountable autonomy.

URL PDF HTML ☆

赞 0 踩 0

2605.18407 2026-05-19 cond-mat.mes-hall cond-mat.mtrl-sci cs.AI cs.RO 版本更新

Qumus: Realization of An Embodied AI Quantum Material Experimentalist

Qumus: 一种具身人工智能量子材料实验家的实现

Lihan Shi, Zhaoyi Joy Zheng, Xinzhe Juan, Yimin Wang, Ming Yin, Mayank Sengupta, Kristina Wolinski, Yanyu Jia, Jingzhi Shi, Derek Saucedo, Neill Saggi, Haosen Guan, Kenji Watanabe, Takashi Taniguchi, Ali Yazdani, Mengdi Wang, Sanfeng Wu

AI总结本文提出Qumus，首个能够进行真实世界科学发现的具身人工智能量子材料实验家，通过机器人微型实验室实现了原子薄二维材料和范德瓦耳斯结构的制备与纳米加工，首次实现了AI生成石墨烯和原子薄场效应晶体管的AI制造。

Comments 29 Pages in total. Supplementary Demo Videos are available at https://qumus.ai

详情

AI中文摘要

尽管现代大语言模型（LLMs）和代理型人工智能（AI）在数字领域展现出了变革性能力，但实现能够进行真实世界科学发现的具身人工智能仍是一个具有挑战性的前沿。这些进展受到将高级推理、多模态信息处理和实时物理执行整合在一起的固有复杂性所阻碍。在这里，我们介绍了Qumus，首个AI量子材料实验家。Qumus物理上体现在一个机器人微型实验室中，是一个智能、多模态和多代理系统，旨在创建和纳米加工原子薄二维（2D）材料和堆叠范德瓦耳斯（vdW）结构。Qumus能够自主导航完整的科学循环，从假设生成和协议规划到多步骤实验执行、结果分析和报告，充当实验家的角色。值得注意的是，该系统首次实现了AI生成石墨烯，以及首次实现了复杂纳米设备（包括原子薄场效应晶体管）的AI制造，通过范德瓦耳斯堆叠。Qumus在这些任务中表现出色，通过展示自主纠错和闭环实验。我们的结果建立了一个可推广的框架，用于学习直接来自量子世界的自我改进具身人工智能系统，为量子材料、电子学等领域加速发现开辟了新路径。

英文摘要

While modern Large Language Models (LLMs) and agentic artificial intelligence (AI) have demonstrated transformative capabilities in digital domains, the realization of embodied AI capable of real-world scientific discovery remains a difficult frontier. The advancements are hindered by the inherent complexity of integrating high-level reasoning, multimodal information processing and real-time physical execution. Here we introduce Qumus, the first AI quantum materials experimentalist. Physically embodied within a robotic mini-laboratory, Qumus is an intelligent, multimodal, and multi-agent system designed for the creation and nano-processing of atomically thin two-dimensional (2D) materials and stacked van der Waals (vdW) structures. Qumus autonomously navigates the full scientific cycle, from hypothesis generation and protocol planning to multi-step experimental execution, result analysis and reporting, acting as an experimentalist. Markedly, the system has achieved, for the first time, the AI-creation of graphene, as well as the first AI-fabrication of complex nanodevices including atomically thin field-effect transistors via vdW stacking. Qumus excels at these tasks by demonstrating autonomous error correction and closed-loop experimentation. Our results establish a generalizable framework for self-improving embodied AI systems that learn directly from the quantum world, opening a pathway toward accelerated discovery in quantum materials, electronics and beyond.

URL PDF HTML ☆

赞 0 踩 0

2605.18385 2026-05-19 cs.RO cs.AI 版本更新

Towards Ubiquitous Mapping and Localization for Dynamic Indoor Environments

面向动态室内环境的无处不在的映射与定位

Halim Djerroud, Nico Steyn, Olivier Rabreau, Patrick Bonnin, Abderraouf Benali

发表机构 * Tshwane University of Technology（茨瓦内理工大学）

AI总结本文提出UbiSLAM，一种用于动态室内环境实时映射和定位的创新解决方案，通过部署固定RGB-D相机网络解决传统SLAM系统在环境变化敏感性和依赖移动单元传感器的问题，提升机器人在环境中的定位精度和响应性。

详情

DOI: 10.5220/0013245400003890
Journal ref: Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025), Volume 1, pages 537-548, SciTePress, 2025. ISBN: 978-989-758-737-5, ISSN: 2184-433X

AI中文摘要

我们提出了UbiSLAM，一种用于动态室内环境实时映射和定位的创新解决方案。通过在工作空间内战略性地部署固定RGB-D相机网络，UbiSLAM解决了传统SLAM系统常见的局限性，如对环境变化的敏感性和对移动单元传感器的依赖。这种固定传感器方法实现了实时、全面的映射，提高了机器人在环境中的定位精度和响应性。由UbiSLAM生成的集中化地图持续更新，为机器人提供准确的全局视图，从而提高导航、减少碰撞并促进共享空间中更流畅的人机交互。除了其优势外，UbiSLAM还面临挑战，特别是在确保完整空间覆盖和管理盲区方面，这需要从机器人本身集成数据。在本文中，我们讨论了潜在的解决方案，如自动校准以获得最佳的相机位置和方向，以及增强的通信协议以实现实时数据共享。所提出的模型减少了对单个机器人单元的计算负载，使更复杂的机器人平台能够有效运行，同时增强了整个系统的鲁棒性。

英文摘要

We present UbiSLAM, an innovative solution for real-time mapping and localization in dynamic indoor environments. By deploying a network of fixed RGB-D cameras strategically throughout the workspace, UbiSLAM addresses limitations commonly encountered in traditional SLAM systems, such as sensitivity to environmental changes and reliance on mobile unit sensors. This fixed-sensor approach enables real-time, comprehensive mapping, enhancing the localization accuracy and responsiveness of robots operating within the environment. The centralized map generated by UbiSLAM is continuously updated, providing robots with an accurate global view, which improves navigation, minimizes collisions, and facilitates smoother human-robot interactions in shared spaces. Beyond its advantages, UbiSLAM faces challenges, particularly in ensuring complete spatial coverage and managing blind spots, which necessitate data integration from the robots themselves. In this paper we discuss potential solutions, such as automatic calibration for optimal camera placement and orientation, along with enhanced communication protocols for real-time data sharing. The proposed model reduces the computational load on individual robotic units, allowing less complex robotic platforms to operate effectively while enhancing the robustness of the overall system.

URL PDF HTML ☆

赞 0 踩 0

2605.18373 2026-05-19 cs.RO cs.LG math.DS math.OC 版本更新

Dynamic robotic cloth folding with efficient Koopman operator-based model predictive control

动态机器人布料折叠与高效的Koopman算子基于模型预测控制

Edoardo Caldarelli, Franco Coltraro, Adrià Colomé, Lorenzo Rosasco, Carme Torras

发表机构 * Istituto Italiano di Tecnologia（意大利技术研究院）； Institut de Robòtica i Informàtica Industrial（机器人与信息技术研究所）； MaLGa Center, DIBRIS, Università degli Studi di Genova（MaLGa中心，DIBRIS，热那亚大学）

AI总结本文提出了一种基于Koopman算子的模型预测控制方法，用于快速生成布料折叠轨迹，结合物理仿真和高效的核基Koopman算子回归，以提高折叠任务的效率和精度。

Comments Accepted for presentation at the 2026 IEEE International Conference on Robotics and Automation (ICRA)

详情

AI中文摘要

机器人布料折叠是一项具有挑战性的任务，尤其是在动态折叠任务中，需要通过快速运动利用布料的动力学特性进行折叠。当受到这种快速运动的影响时，布料动力学的复杂性会阻碍系统识别和折叠轨迹的规划，导致在使用物理布料模型时仿真到现实的转移困难。与人类在折叠任务中表现出的灵活性相比，机器人通常使用小而刚性的衣物，要么太慢，要么太快但不精确，需要多次尝试才能获得相对良好的折叠效果。在本文中，我们通过生成快速折叠轨迹来解决这些问题，采用了一种新的模型预测控制器，结合基于物理的布料动力学仿真和高效的核基Koopman算子回归。Koopman算子回归是一种日益流行的机器学习技术，用于非线性系统识别，用于获得被折叠布料的线性模型。此类代理模型，通过高保真的物理布料仿真器的数据进行训练，可以用于合适的模型预测控制算法中，替代昂贵的非线性模型，以高效地生成由机器人执行的折叠轨迹。在模拟和真实机器人实验中，我们展示了Koopman算子基于模型提供的线性化如何能够有效地生成未见过的姿势的快速折叠轨迹，而不牺牲折叠的准确性。

英文摘要

Robotic cloth folding is a challenging task, particularly when considering dynamic folding tasks, which aim at folding cloth by fast motions that leverage its dynamics. When subject to such fast motions, the complexity of cloth dynamics hinders both system identification and planning of folding trajectories, resulting in a difficult simulation-to-reality transfer when using physical models of cloth. Compared to the dexterity that humans exhibit when performing folding tasks, robotic approaches usually employ small garments with quite rigid dynamics, and are either too slow, or fast but imprecise, requiring several attempts to achieve a reasonably good fold. In this paper, we tackle these challenges by generating fast folding trajectories with a novel model predictive controller, integrating physics-based simulation of cloth dynamics and efficient, kernel-based Koopman operator regression. Koopman operator regression, an increasingly popular machine learning technique for nonlinear system identification, is used to obtain a linear model for the cloth being folded. Such a surrogate model, trained with data from a high-fidelity, physics-based cloth simulator, can then be employed within a suitable model predictive control algorithm, in place of the costly, nonlinear one, to efficiently generate folding trajectories to be executed by a robotic manipulator. Both in simulated and real-robot experiments, we show how the linearization supplied by the Koopman operator-based model can be employed to efficiently generate fast folding trajectories to unseen poses, without sacrificing folding accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.18303 2026-05-19 cs.LG cs.AI cs.CV cs.RO 版本更新

PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics

PH-Dreamer: 通过端口-哈密顿生成动力学构建一个物理驱动的世界模型

Xueyu Luan, Chenwei Shi

AI总结本文提出了一种基于端口-哈密顿框架的物理驱动世界模型PH-Dreamer，通过三个协同机制改进了基于递归状态空间架构的世界模型，实现了更紧凑且物理结构化的表示，同时提高了内部模拟器的保真度，并减少了潜在相空间体积、能量消耗和平均加速度平方。

Comments 12 pages, 3 figures

详情

AI中文摘要

基于递归状态空间架构构建的世界模型能够实现高效的潜在想象，但仍然缺乏物理结构，导致动力学违反守恒和耗散原理。我们引入了一个统一的端口-哈密顿框架，通过三种协同机制来解决这一问题。首先，我们将隐含的物理先验嵌入到递归转换中，通过将投影的潜在演变建模为受流动和耗散控制的能量路由，使投影的PH相空间偏向于更紧凑且物理结构化的表示。其次，我们开发了一个具有运动学意识的能量世界模型，该模型从本体感觉观察估计哈密顿量和功率平衡，提供了一个明确的物理信号用于热力学推理。第三，利用这些能量梯度，我们建立了基于能量的Actor-Critic，利用拉格朗日乘数来正则化策略优化，使其朝着更低的能量和更平滑的控制方向发展。在视觉控制基准测试中，该范式不仅实现了更优的渐近回报，还通过在想象奖励和真实奖励之间建立更紧密且方差更低的对齐关系，提高了内部模拟器的保真度，同时将潜在相空间体积减少了4.18-8.41%，能量消耗降低了高达7.80%，平均加速度平方降低了高达9.38%。

英文摘要

World models built on recurrent state space architectures enable efficient latent imagination, yet remain physically unstructured, producing dynamics that violate conservation and dissipative principles. We introduce a unified Port-Hamiltonian framework that remedies this through three synergistic mechanisms. First, we embed implicit physical priors into recurrent transitions by modeling projected latent evolution as action controlled energy routing governed by flow and dissipation, biasing the projected PH phase space toward a more compact and physically structured representation. Second, we develop a kinematics aware energy world model that estimates the Hamiltonian and power balance from proprioceptive observations, providing an explicit physical signal for thermodynamic reasoning. Third, leveraging these energy gradients, we establish an energy guided Actor-Critic that uses Lagrangian multipliers to regularize policy optimization toward lower energy and smoother control. Across visual control benchmarks, this paradigm not only attains superior asymptotic returns but also elevates internal simulator fidelity by establishing a tighter, lower variance alignment between imagined and real rewards, all while reducing latent phase space volume by 4.18-8.41%, energy consumption by up to 7.80%, and mean squared jerk by up to 9.38%.

URL PDF HTML ☆

赞 0 踩 0

2605.18295 2026-05-19 cs.RO 版本更新

Assessing Localization Technologies for Pedestrian Collision Avoidance

评估用于行人碰撞避让的定位技术

Joshua Varughese, Joseba Gorospe, Novel Certad, Cristina Olaverri-Monreal

发表机构 * Dept. Intelligent Transport Systems, Johannes Kepler University Linz（智能交通系统系，约翰内斯·开普勒大学林茨）

AI总结本文评估了超宽频技术和蓝牙6.0在行人碰撞预警中的定位精度，并将其与全球导航卫星系统进行性能对比，发现这些技术在特定场景下可作为替代或补充方案，提升环境感知能力。

2605.18287 2026-05-19 cs.CV cs.RO 版本更新

StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

StableVLA: 向无额外数据的鲁棒视觉-语言-动作模型迈进

Yiyang Fu, Chubin Zhang, Shukai Gong, Yufan Deng, Kaiwei Sun, Qiyang Min, Qibin Hou, Yansong Tang, Jianan Wang, Daquan Zhou

发表机构 * Peking University（北京大学）； Tsinghua University（清华大学）； Nanjing University（南京大学）； Nankai University（南开大学）

AI总结本文研究了在未见真实世界视觉扰动下视觉-语言-动作（VLA）模型的鲁棒性问题，提出了一种基于信息理论的轻量级适配模块IB-Adapter，有效提升模型性能，同时保持高效和效果。

Comments Accepted by ICML 2026. Code: https://github.com/DAGroup-PKU/HumanNet. Project website: https://dagroup-pku.github.io/StableVLA/

详情

AI中文摘要

在训练数据中无法涵盖所有可能的扰动，这引发了关于在遇到未见真实世界视觉扰动时，视觉-语言-动作（VLA）模型鲁棒性的问题。在本文中，我们基于最近最先进的VLA模型进行了系统研究，并揭示了当引入训练数据中没有的视觉扰动时，性能显著下降。为缓解这一问题，我们提出了一种基于信息理论的轻量级适配模块，称为信息瓶颈适配器（IB-Adapter），该模块能够选择性地从视觉输入中过滤潜在噪声。无需任何额外数据或增强策略，IB-Adapter在基线模型上平均提升了30%，同时添加少于10M参数，显示出显著的效率和效果。此外，即使使用14倍更小的主干（0.5B参数）且未在Open X-Embodiment数据集上预训练，我们的模型StableVLA也实现了与7B规模最先进的VLA相媲美的鲁棒性。在参数开销极小（<10M）的情况下，我们的方法在长周期任务上保持了准确性，并在合成和物理视觉扰动下超越了OpenPi。

英文摘要

It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.

URL PDF HTML ☆

赞 0 踩 0

2605.18262 2026-05-19 cs.RO 版本更新

On Improving Multimodal Pedestrian Trajectory Prediction with CVAE: A Study on Benchmark and Robot Data

基于CVAE的多模态行人轨迹预测改进：对基准数据和机器人数据的研究

Yuzhou Liu, Cristina Olaverri-Monreal

发表机构 * Dept. Intelligent Transport Systems, Johannes Kepler University Linz（智能交通系统系，约翰·凯撒大学林茨）

AI总结本文提出基于Social-STGCNN的CVAE概率模型，以改进多模态行人轨迹预测，通过在基准数据集和真实机器人数据集上的评估，展示了方法在不同人群配置下的端点准确性和轨迹多样性改进。

详情

AI中文摘要

准确的行人轨迹预测对于在复杂环境中运行的自主系统至关重要，例如郊区或半结构化区域中的模块化巴士和送货机器人。Social Spatio-Temporal Graph Convolutional Neural Networks (Social-STGCNN) 通过建模社会互动展示了强大的性能；然而，生成多样且校准良好的未来轨迹仍然具有挑战性。在本文中，我们基于Social-STGCNN骨架，引入基于条件变分自动编码器（CVAE）的概率公式，以显式建模多模态未来轨迹。我们评估了该方法在ETH和UCY行人轨迹数据集以及由移动机器人收集的真实世界行人数据集上的性能。结果表明，在公共基准上取得了适度的提升，但在不同人群配置下表现出更一致的端点准确性和改进的轨迹多样性。在机器人收集的数据上的评估进一步证明了该方法在非定制基准之外的有效性，并支持其在实际部署中的适用性。

英文摘要

Accurate pedestrian trajectory prediction is crucial for autonomous systems operating in complex environments, such as modular buses and delivery robots in suburban or semi-structured areas. Social Spatio-Temporal Graph Convolutional Neural Networks (Social-STGCNN) have shown strong performance by modeling social interactions; however, producing diverse and well-calibrated future trajectories remains challenging. In this work, we build on a Social-STGCNN backbone and introduce a Conditional Variational Autoencoder (CVAE)-based probabilistic formulation to explicitly model multimodal future trajectories. We evaluate the method on the ETH and UCY pedestrian trajectory datasets as well as on a real-world pedestrian dataset collected by a mobile robot. Results show moderate gains on public benchmarks, but more consistent endpoint accuracy and improved trajectory diversity across different crowd configurations. Evaluation on robot-collected data further demonstrates the approach's effectiveness beyond curated benchmarks and supports its applicability in practical deployments.

URL PDF HTML ☆

赞 0 踩 0

2605.18197 2026-05-19 cs.RO cs.AI cs.CV 版本更新

RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots

仅RGB的主动3D场景图生成用于室内移动机器人

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

发表机构 * Mobile Robotics Group (MRG)（移动机器人小组）； Visual and Multimodal Applied Learning Lab (VANDAL)（视觉与多模态应用学习实验室）

AI总结本文提出了一种仅使用RGB输入的主动3D场景图生成方法，通过统一感知与规划的结构化表示，解决了传统方法对专用传感器的依赖问题，并在Replica数据集上验证了其有效性。

详情

AI中文摘要

当前3D场景图生成方法依赖于专用深度传感器，如LiDAR或RGB-D相机，限制了部署到专用机器人平台，并排除了仅使用RGB相机的场景，如固定外部基础设施。现有流程通常基于被动收集的观测轨迹，而不是基于部分构建的场景表示选择视角，因此无法有效利用图中编码的语义和空间信息。本文提出了一种完全视觉框架，用于从仅RGB输入中主动、逐步构建3D场景图，解决了这两个限制。所提出的方法围绕共享的结构化表示统一感知和规划，该表示捕捉了物体语义、3D几何、关系上下文以及多视角信息。由于该框架是硬件无关的，并且仅依赖RGB观测，因此可以将机载机器人相机和固定外部相机的输入整合到同一表示中。在Replica数据集上的实验表明，仅RGB的流程在F1分数上与使用真实深度的基线相当。在ReplicaCAD上的主动探索实验进一步表明，语义驱动的视角选择在相同探索预算下能够检测到比基于几何前沿的基线多超过两倍的物体。最后，外部相机设置表明，互补的RGB视角可以有效启动场景图并提高上下文理解，而无需额外的探索成本。

英文摘要

Current approaches to 3D scene graph generation rely on dedicated depth sensors, such as LiDAR or RGB-D cameras, for metric 3D reconstruction. This limits deployment to specialized robotic platforms and excludes settings where only RGB cameras are available, such as fixed external infrastructure. Existing pipelines also typically operate on passively collected observation trajectories, rather than selecting viewpoints based on the partially built scene representation, and therefore fail to effectively exploit the semantic and spatial information encoded within the graph during exploration. This paper presents a fully visual framework for the active, incremental construction of 3D scene graphs from RGB input only, addressing both limitations. The proposed approach unifies perception and planning around a shared structured representation that captures object semantics, 3D geometry, relational context, and information from multiple viewpoints. Because the framework is hardware-agnostic and relies only on RGB observations, it can incorporate inputs from both onboard robot cameras and fixed external cameras within the same representation. Experiments on the Replica dataset show that the RGB-only pipeline achieves F1-score parity with baselines using ground-truth depth. Active exploration experiments on ReplicaCAD further show that semantic-driven viewpoint selection detects more than twice as many objects as a geometric frontier-based baseline under the same exploration budget. Finally, the external-camera setting demonstrates that complementary RGB views can effectively bootstrap the scene graph and improve contextual understanding at no additional exploration cost.

URL PDF HTML ☆

赞 0 踩 0

2605.18184 2026-05-19 cs.RO cs.AI cs.CV 版本更新

Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation

固定外部摄像头作为主动3D场景图生成的共同先验地图

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

发表机构 * Mobile Robotics Group (MRG)（移动机器人组）； Visual and Multimodal Applied Learning Lab (VANDAL)（视觉与多模态应用学习实验室）

AI总结本文提出利用固定外部RGB摄像头作为共同先验地图，以实现主动、渐进式的3D场景图生成，通过融合机器人 onboard 摄像头和固定外部摄像头的数据，提高场景理解的效率和准确性。

详情

AI中文摘要

常用的先验信息，如BIM模型、平面图和遥感图像，可以为自主机器人系统提供有价值的几何和语义上下文。在本文中，我们将固定外部RGB摄像头的观测视为共同先验地图（CPMs）：环境的广角视图，在任何机器人运动开始之前初始化一个语义和几何场景先验。我们提出一个仅使用RGB的框架，用于主动、渐进式的3D场景图（3DSG）生成，该框架在单一硬件无关的管道中无缝融合来自机器人 onboard 摄像头和固定外部摄像头的观测。通过仅依赖RGB观测并通过前馈3D重建模型进行处理，系统将所有摄像头——机器人 onboard 或外部——视为相同，无需硬件修改。基于图的主动语义探索框架然后直接利用部分场景图，引导机器人向高语义不确定性区域前进，逐步完成和细化先验。实验表明，使用单个外部摄像头初始化场景图可使初始物体召回率提高高达+79%，并且先验的更丰富上下文显著提高了后续主动探索的效率。

英文摘要

Commonly available prior information, such as BIM models, floor plans, and remote sensing images, can provide valuable geometric and semantic context for autonomous robotic systems. In this paper, we treat observations from fixed external RGB cameras as Common Prior Maps (CPMs): wide-field views of the environment that initialize a semantic and geometric scene prior before any robot motion begins. We present an RGB-only framework for active, incremental 3D scene graph (3DSG) generation that seamlessly fuses observations from both onboard robot cameras and fixed external cameras within a single hardware-agnostic pipeline. By relying solely on RGB observations processed by a feed-forward 3D reconstruction model, the system treats all cameras - onboard or external - identically, requiring no hardware modifications. A graph-based active semantic exploration framework then directly leverages the partial scene graph to guide the robot toward regions of high semantic uncertainty, progressively completing and refining the prior. Experiments demonstrate that bootstrapping the scene graph with even a single external camera increases initial object recall by up to +79%, and that the richer context of the prior significantly improves the efficiency of subsequent active exploration.

URL PDF HTML ☆

赞 0 踩 0

2605.16015 2026-05-19 cs.RO cs.LG 版本更新

Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning

通过强化学习实现四旋翼机的自适应外环控制

Vishnu Saj, Sushil Vemuri, Dileep Kalathil, Moble Benedict

发表机构 * Texas A&M University（德克萨斯大学）

AI总结本文提出了一种新颖的自适应控制架构，通过强化学习和残差动力学预测器来提高四旋翼飞行器在动态扰动下的控制性能，实验证明其在现实环境中具有更高的轨迹跟踪精度。

详情

AI中文摘要

深度强化学习（DRL）在四旋翼飞行器控制中通常依赖于领域随机化（DR）进行仿真到现实的转移，导致过于保守的策略难以应对动态扰动。为了解决这个问题，我们提出了一种新的自适应控制架构，能够主动感知并响应即时扰动。首先，我们训练了一个最优的外环策略，然后用残差动力学预测器（RDP）替代其对地面真实扰动数据的依赖。RDP通过仅使用状态和控制动作的历史数据在线估计飞行器所受的外部力和力矩。为了实现无缝的硬件转移，我们引入了数据高效的线性校准桥和在线推力校正机制，利用仅几秒的飞行数据将模拟的潜在空间与现实对齐。在真实世界中对Crazyflie微型四旋翼的验证表明，我们的自适应控制器在严重不确定性下，包括质量变化、不对称载荷和动态悬挂载荷，均显著优于基线方法，保持了精确的轨迹跟踪性能。

英文摘要

Deep Reinforcement Learning (DRL) for quadrotor flight control typically relies on Domain Randomization (DR) for sim-to-real transfer, resulting in overly conservative policies that struggle with dynamic disturbances. To overcome this, we propose a novel adaptive control architecture that actively perceives and reacts to instantaneous perturbations. First, we train an optimal outer-loop policy, then replace its reliance on ground-truth disturbance data with a Residual Dynamics Predictor (RDP). The RDP estimates the external forces and moments acting on the aircraft in flight online using only the history of states and control actions. For seamless hardware transfer, we introduce a data-efficient linear calibration bridge and an online thrust correction mechanism that align the simulated latent space with reality using mere seconds of flight data. Real-world validations on a Crazyflie micro-quadrotor demonstrate that our adaptive controller significantly outperforms baselines, maintaining precise trajectory tracking under severe uncertainties including mass variations, asymmetric payloads, and dynamic slung loads

URL PDF HTML ☆

赞 0 踩 0

2605.11654 2026-05-19 cs.CV cs.AI cs.RO 版本更新

Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery

通过基于原型的语义部分发现实现抗天气的跨视角地理定位

Chi-Nguyen Tran, Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Long Tran-Thanh

发表机构 * Faculty of Information Technology, University of Science, Vietnam National University（信息技术学院，科学大学，越南国家大学）； Department of Computer Science, University of Warwick（计算机科学系，沃里克大学）

AI总结本文提出SkyPart，一种轻量级可替换头，用于基于补丁的视觉变换器，通过在补丁网格上显式分组实现部分分组。SkyPart有四个理论基础的组件：(i)通过单次传递余弦分配学习可学习的原型以竞争补丁标记；(ii)在训练期间应用的海拔条件线性调制，使检索嵌入在推理时无海拔依赖；(iii)对活跃原型的图注意力读出；(iv)一种Kendall不确定性加权多目标损失，其平稳点是帕累托平稳点。在26.95M参数和22.14 GFLOPs下，SkyPart是表现最佳方法中最小的，并在SUES-200、University-1652和DenseUAV上设定了新的状态。其在十条件WeatherPrompt腐蚀基准下的优势优于最强基线。

Comments 37 pages, 7 figures, 6 tables

详情

AI中文摘要

跨视角地理定位（CVGL），即匹配一个倾斜无人机视角到地理参考的卫星瓷砖，已成为在GPS信号被干扰、欺骗或不可用时自主无人机导航的关键替代方案。尽管近年来取得了显著进展，但仍然存在三个限制：（1）全局描述符设计将补丁网格压缩成一个向量，而没有在视角间隙中分离布局和纹理；（2）与海拔相关的尺度变化保留在学习嵌入中，而不是被边缘化；（3）多目标训练依赖于手动调整的标量损失，这些损失在不兼容的梯度尺度上。我们提出SkyPart，一种轻量级可替换头，用于基于补丁的视觉变换器（ViTs），在补丁网格上实施显式部分分组。SkyPart有四个理论基础的组件：（i）通过单次传递余弦分配学习可学习的原型以竞争补丁标记；（ii）在训练期间应用的海拔条件线性调制，使检索嵌入在推理时无海拔依赖；（iii）对活跃原型的图注意力读出；（iv）一种Kendall不确定性加权多目标损失，其平稳点是帕累托平稳点。在26.95M参数和22.14 GFLOPs下，SkyPart是表现最佳方法中最小的，并在SUES-200、University-1652和DenseUAV上设定了新的状态。其在十条件WeatherPrompt腐蚀基准下的优势优于最强基线。

英文摘要

Cross-view geo-localization (CVGL), which matches an oblique drone view to a geo-referenced satellite tile, has emerged as a key alternative for autonomous drone navigation when GNSS signals are jammed, spoofed, or unavailable. Despite strong recent progress, three limitations persist: (1) global-descriptor designs compress the patch grid into a single vector without separating layout from texture across the view gap; (2) altitude-related scale variation is retained in the learned embedding rather than marginalized; and (3) multi-objective training relies on hand-tuned scalars over losses on incompatible gradient scales. We propose SkyPart, a lightweight swappable head for patch-based vision transformers (ViTs) that institutes explicit part grouping over the patch grid. SkyPart has four theory-grounded components: (i) learnable prototypes competing for patch tokens via single-pass cosine assignment; (ii) altitude-conditioned linear modulation applied only during training, making the retrieval embedding altitude-free at inference; (iii) a graph-attention readout over active prototypes; and (iv) a Kendall uncertainty-weighted multi-objective loss whose stationary points are Pareto-stationary. At 26.95M parameters and 22.14 GFLOPs, SkyPart is the smallest among top-performing methods and sets a new state of the art on SUES-200, University-1652, and DenseUAV under a single-pass, no-re-ranking, no-TTA protocol. Its advantage over the strongest baseline widens under the ten-condition WeatherPrompt corruption benchmark.

URL PDF HTML ☆

赞 0 踩 0

2604.26450 2026-05-19 cs.RO 版本更新

Reactive Motion Generation via Phase-varying Neural Potential Functions

通过相变神经势函数实现反应性运动生成

Ahmet Tekden, Dimitrios Kanoulas, Aude Billard, Yasemin Bekiroglu

发表机构 * Chalmers AI Research Center (CHAIR)（查尔姆斯人工智能研究中心（CHAIR））； Chalmers Gender Initiative for Excellence (Genie)（查尔姆斯卓越性别倡议（Genie））； Wallenberg AI, Autonomous Systems and Software Program (WASP)（瓦兰贝格人工智能、自主系统和软件计划（WASP））； University College London（伦敦大学学院）； Ecole Polytechnique Federale de Lausanne (EPFL)（瑞士联邦理工学院（EPFL））

AI总结本文提出了一种基于相变神经势函数（PNPF）的运动生成框架，通过直接从状态进展估计相变量来条件势函数，从而在点到点、周期性和全6D运动任务中实现更有效的泛化，并在有交点轨迹和外部干扰下表现出更强的鲁棒性。

Comments Accepted by IEEE Robotics and Automation Letters (RAL)

详情

DOI: 10.1109/LRA.2026.3692035

AI中文摘要

动态系统（DS）方法在学习示范（LfD）中提供了从少量示范中获得稳定连续策略的能力。一阶动态系统（DS）在许多点对点和周期性任务中效果良好，只要为每个状态定义唯一的速度。对于具有交点的任务（例如绘制“8”），通常会使用扩展方法如二阶动态或相变量。然而，通过引入速度，二阶模型在交点附近对扰动敏感，因为速度用于区分运动方向。此外，这种区分可能在几乎相同的位移速度对对应不同后续运动时失效。相比之下，基于相位的方法依赖于开环时间或相变量，这限制了它们在扰动后恢复的能力。我们引入了相变神经势函数（PNPF），一种LfD框架，将势函数条件于直接从状态进展估计的相变量，而不是开环时间输入。该相变量使系统能够处理状态重访，而学习的势函数生成局部向量场用于反应性和稳定的控制。PNPF在点对点、周期性和全6D运动任务中表现出良好的泛化能力，在具有交点的轨迹上优于现有基线，并在实时机器人操作中表现出对外部扰动的鲁棒性。

英文摘要

Dynamical systems (DS) methods for Learning-from-Demonstration (LfD) provide stable, continuous policies from few demonstrations. First-order dynamical systems (DS) are effective for many point-to-point and periodic tasks, as long as a unique velocity is defined for each state. For tasks with intersections (e.g., drawing an "8"), extensions such as second-order dynamics or phase variables are often used. However, by incorporating velocity, second-order models become sensitive to disturbances near intersections, as velocity is used to disambiguate motion direction. Moreover, this disambiguation may fail when nearly identical position-velocity pairs correspond to different onward motions. In contrast, phase-based methods rely on open-loop time or phase variables, which limit their ability to recover after perturbations. We introduce Phase-varying Neural Potential Functions (PNPF), an LfD framework that conditions a potential function on a phase variable which is estimated directly from state progression, rather than on open-loop temporal inputs. This phase variable allows the system to handle state revisits, while the learned potential function generates local vector fields for reactive and stable control. PNPF generalizes effectively across point-to-point, periodic, and full 6D motion tasks, outperforms existing baselines on trajectories with intersections, and demonstrates robust performance in real-time robotic manipulation under external disturbances.

URL PDF HTML ☆

赞 0 踩 0

2604.10895 2026-05-19 cs.HC cs.RO 版本更新

Teaching Robots to Interpret Social Interactions through Lexically-guided Dynamic Graph Learning

通过词法引导的动态图学习教授机器人解读社交互动

Tongfei Bian, Mathieu Chollet, Tanaya Guha

发表机构 * University of Glasgow（格拉斯哥大学）

AI总结本文提出了一种名为SocialLDG的多任务学习框架，通过动态图学习建模状态之间的动态关系，实现了在人类-机器人社交交互数据集上的最佳性能，并支持任务扩展和时间影响分析。

Comments submitted to ACM MM 26

详情

AI中文摘要

为了使机器人具备社交智能，它必须能够从用户当前行为推断其内部状态，预测用户未来行为，并在需要时做出适当回应。在本工作中，我们探讨了机器人如何通过建模用户内部状态（潜在）和动作（可观察状态）之间的动态关系来获得这种社交智能。我们的前提是这些状态源于相同的底层社会认知过程，并动态地相互影响。受认知科学理论的启发，我们提出了一种新的多任务学习框架，称为SocialLDG，它明确建模作为六个不同任务的状态之间的动态关系。我们的框架使用语言模型为每个任务引入词法先验，并利用动态图学习来建模随时间演变的任务亲和力。SocialLDG有三个优势：首先，它在两个具有挑战性的人类-机器人社交交互公开数据集上实现了最先进的性能。其次，它通过无缝学习新任务而支持强大的任务扩展能力，而不会产生灾难性遗忘。最后，得益于显式建模任务亲和力，它提供了关于不同互动随时间展开以及内部状态和可观察动作如何相互影响的见解。

英文摘要

For a robot to be called socially intelligent, it must be able to infer users internal states from their current behaviour, predict the users future behaviour, and if required, respond appropriately. In this work, we investigate how robots can be endowed with such social intelligence by modelling the dynamic relationship between user's internal states (latent) and actions (observable state). Our premise is that these states arise from the same underlying socio-cognitive process and influence each other dynamically. Drawing inspiration from theories in Cognitive Science, we propose a novel multi-task learning framework, termed as \textbf{SocialLDG} that explicitly models the dynamic relationship among the states represent as six distinct tasks. Our framework uses a language model to introduce lexical priors for each task and employs dynamic graph learning to model task affinity evolving with time. SocialLDG has three advantages: First, it achieves state-of-the-art performance on two challenging human-robot social interaction datasets available publicly. Second, it supports strong task scalability by learning new tasks seamlessly without catastrophic forgetting. Finally, benefiting from explicit modelling task affinity, it offers insights on how different interactions unfolds in time and how the internal states and observable actions influence each other in human decision making.

URL PDF HTML ☆

赞 0 踩 0

2604.02060 2026-05-19 cs.CV cs.RO 版本更新

CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

CompassAD: 基于意图的多功能竞争物体3D affordance 地标

Jingliang Li, Jindou Jia, Tuo An, Chuhao Zhou, Xiangyu Chen, Shilin Shan, Boyu Ma, Bofan Lyu, Gen Li, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University, Singapore（MARS实验室，南洋理工大学，新加坡）

AI总结该研究提出了一种新的3D affordance设定，即意图驱动的可混淆地标，旨在预测多物体点云中正确物体的每点affordance掩码，基于隐含的自然语言意图。通过构建CompassAD基准，该研究展示了在具有隐含意图的多物体组合中的先进结果，并在机器人机械臂上验证了其在真实世界抓取中的有效性。

详情

AI中文摘要

当被告知要“切蛋糕”时，机器人必须在附近的剪刀之上选择刀，尽管两个物体都提供相同的切割功能。在真实世界场景中，多个物体可能具有相同的affordance，但只有一个是给定任务上下文下的合适对象。我们称这种情况为混淆对。然而，现有的3D affordance方法大多回避了这一挑战，通过评估孤立的单个物体，通常伴有查询中提供的显式类别名称。我们正式提出了意图驱动的可混淆affordance地标，这是一种新的3D affordance设定，要求在多物体点云中预测正确物体的每点affordance掩码，基于隐含的自然语言意图。为了研究这个问题，我们构建了CompassAD，第一个专注于隐含意图的多物体组合基准。它包含30个混淆物体对，覆盖16种affordance类型，6,422个组合，以及88K+个查询-回答对。此外，我们提出了CompassNet，一个包含两个专门模块的框架，专为该任务定制。实例受限的交叉注入（ICI）在物体边界内约束语言-几何对齐，以防止跨物体语义泄漏。双级对比细化（BCR）在几何组和点级别上强制执行区分，使目标和可混淆表面之间的区别更加清晰。广泛的实验表明，在已见和未见查询上均取得了最先进的结果，并在机器人机械臂上的部署证实了其在真实世界抓取中的有效性。

英文摘要

When told to "cut the cake," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Intent-Driven Confusable Affordance Grounding, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusing multi-object compositions. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 compositions, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object compositions.

URL PDF HTML ☆

赞 0 踩 0

2604.00634 2026-05-19 cs.RO cs.CV 版本更新

LiPS: Lightweight Panoptic Segmentation for Resource-Constrained Robotics

LiPS: 为资源受限机器人设计的轻量级全景分割

Calvin Galagain, Martyna Poreba, François Goulette, Cyrill Stachniss

发表机构 * Université Paris-Saclay, CEA LIST（巴黎-萨克雷大学，CEA LIST）； U2IS, ENSTA, Institut Polytechnique de Paris（U2IS、ENSTA、巴黎理工学院）； University of Bonn, Center for Robotics（波恩大学，机器人中心）

AI总结本文提出LiPS，一种轻量级全景分割方法，通过简化特征提取和融合路径，在保持查询基于解码的同时，显著降低计算需求，实现与更重模型相当的精度和更高的吞吐量。

Comments Accepted to IEEE International Conference on Image Processing (ICIP) 2026, Paper #2070

详情

AI中文摘要

全景分割是机器人感知的关键使能器，因为它将语义理解与对象级推理统一起来。然而，随着最新模型复杂性的增加，它们不再适合在资源受限的平台上部署，如移动机器人。我们提出了一种名为LiPS的新方法，通过轻量级设计保留查询基于解码，同时引入流线型的特征提取和融合路径，旨在在大幅降低计算需求的同时提供强大的全景分割性能。在标准基准上的评估表明，LiPS在精度上与更重的基线相当，同时提供高达4.5倍的吞吐量（每秒帧数），并需要几乎6.8倍更少的计算。这种效率使LiPS成为现代全景模型与现实世界机器人应用之间的重要桥梁。

英文摘要

Panoptic segmentation is a key enabler for robotic perception, as it unifies semantic understanding with object-level reasoning. However, the increasing complexity of state-of-the-art models makes them unsuitable for deployment on resource-constrained platforms such as mobile robots. We propose a novel approach called LiPS that addresses the challenge of efficient-to-compute panoptic segmentation with a lightweight design that retains query-based decoding while introducing a streamlined feature extraction and fusion pathway. It aims at providing a strong panoptic segmentation performance while substantially lowering the computational demands. Evaluations on standard benchmarks demonstrate that LiPS attains accuracy comparable to much heavier baselines, while providing up to 4.5 higher throughput, measured in frames per second, and requiring nearly 6.8 times fewer computations. This efficiency makes LiPS a highly relevant bridge between modern panoptic models and real-world robotic applications.

URL PDF HTML ☆

赞 0 踩 0

2603.26720 2026-05-19 cs.RO cs.AI 版本更新

SutureFormer: Learning Surgical Trajectories via Goal-conditioned Offline RL in Pixel Space

SutureFormer: 通过像素空间中的目标引导离线强化学习学习手术轨迹

Huanrong Liu, Chunlin Tian, Tongyu Jia, Tailai Zhou, Qin Liu, Yu Gao, Yutong Ban, Yun Gu, Guy Rosman, Xin Ma, Qingbiao Li

发表机构 * University of Macau（澳门大学）； The Chinese PLA General Hospital（中国人民解放军总医院）； Duke University（杜克大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出SutureFormer，一种基于目标引导的离线强化学习框架，通过稀疏标注到密集奖励信号的插值，有效学习手术针轨迹预测，减少平均位移误差58.6%。

详情

AI中文摘要

从内窥镜视频预测手术针轨迹对于机器人辅助缝合至关重要，能够实现预见性规划、实时引导和更安全的运动执行。现有直接从视觉观测学习运动分布的方法往往忽视相邻运动步骤之间的序列依赖性。此外，稀疏路径点标注通常无法提供足够的监督，进一步增加了监督或模仿学习方法的难度。为了解决这些挑战，我们将基于图像的针轨迹预测 formulations 为一个序列决策问题，在其中将针尖视为一个在像素空间中逐步移动的智能体。这种 formulation 自然捕捉了针运动的连续性，并能够显式建模在时间上物理上合理的像素级状态转换。从这个角度来看，我们提出SutureFormer，一种目标引导的离线强化学习框架，通过三次样条插值将稀疏标注转换为密集奖励信号，鼓励策略在利用有限专家指导的同时探索合理的未来运动路径。SutureFormer 使用观察编码器编码可变长度片段，以捕捉局部空间线索和长距离时间动态，并通过由离散方向和连续幅度组成的操作自回归地预测未来路径点。为了实现从专家演示中稳定离线策略优化，我们采用保守Q学习与行为克隆正则化。在包含1,158条轨迹的新的肾伤口缝合数据集中进行的实验表明，与最强基线相比，SutureFormer将平均位移误差减少了58.6%，证明了将针轨迹预测建模为像素级序列动作学习的有效性。

英文摘要

Predicting surgical needle trajectories from endoscopic video is critical for robot-assisted suturing, enabling anticipatory planning, real-time guidance, and safer motion execution. Existing methods that directly learn motion distributions from visual observations tend to overlook the sequential dependency among adjacent motion steps. Moreover, sparse waypoint annotations often fail to provide sufficient supervision, further increasing the difficulty of supervised or imitation learning methods. To address these challenges, we formulate image-based needle trajectory prediction as a sequential decision-making problem, in which the needle tip is treated as an agent that moves step by step in pixel space. This formulation naturally captures the continuity of needle motion and enables the explicit modeling of physically plausible pixel-wise state transitions over time. From this perspective, we propose SutureFormer, a goal-conditioned offline reinforcement learning framework that leverages sparse annotations to dense reward signals via cubic spline interpolation, encouraging the policy to exploit limited expert guidance while exploring plausible future motion paths. SutureFormer encodes variable-length clips using an observation encoder to capture both local spatial cues and long-range temporal dynamics, and autoregressively predicts future waypoints through actions composed of discrete directions and continuous magnitudes. To enable stable offline policy optimization from expert demonstrations, we adopt Conservative Q-Learning with Behavioral Cloning regularization. Experiments on a new kidney wound suturing dataset containing 1,158 trajectories from 50 patients show that SutureFormer reduces Average Displacement Error by 58.6% compared with the strongest baseline, demonstrating the effectiveness of modeling needle trajectory prediction as pixel-level sequential action learning.

URL PDF HTML ☆

赞 0 踩 0

2603.17751 2026-05-19 cs.RO cs.SY eess.SY 版本更新

Multi-Source Human-in-the-Loop Digital Twin Testbed for Connected and Autonomous Vehicles in Mixed Traffic Flow

多源人机协同数字孪生测试平台用于混合交通流中的连接与自动驾驶车辆

Jianghong Dong, Chunying Yang, Mengchi Cai, Chaoyi Chen, Qing Xu, Jianqiang Wang, Jiawei Wang, Keqiang Li

发表机构 * School of Vehicle and Mobility, Tsinghua University（清华大学车辆与移动性学院）； Department of Civil and Environmental Engineering, University of Michigan（密歇根大学土木与环境工程系）

AI总结本文提出了一种多源人机协同混合云控制测试平台（MSH-MCCT），用于在混合交通环境中测试连接与自动驾驶车辆（CAVs）与人类驾驶车辆（HDVs）之间的复杂交互，通过混合数字孪生概念结合混合现实与数字孪生，提升实验灵活性和可扩展性。

详情

DOI: 10.26599/JICV.2026.9210084
Journal ref: 2026 in Journal of Intelligent and Connected Vehicles

AI中文摘要

在新兴的混合交通环境中，连接与自动驾驶车辆（CAVs）必须与周围的人类驾驶车辆（HDVs）进行交互。本文介绍MSH-MCCT（多源人机协同混合云控制测试平台），一种新的CAV测试平台，能够捕捉各种CAVs和HDVs之间的复杂交互。利用混合数字孪生概念，该概念结合了混合现实与数字孪生，MSH-MCCT整合了物理、虚拟和混合平台，以及多源控制输入。通过混合平台的连接，MSH-MCCT允许人类驾驶员和CAV算法在多个视野范围内同时操作物理和虚拟车辆。特别地，该测试平台促进了物理和虚拟CAVs与HDVs的共存和实时交互，显著提高了实验的灵活性和可扩展性。在混合交通中的车辆编队实验展示了MSH-MCCT通过不同保真度的驾驶模拟器进行多源真实人类驾驶员闭环CAV测试的潜力。实验视频可在我们的项目网站上获得：https://dongjh20.github.io/MSH-MCCT。

英文摘要

In the emerging mixed traffic environments, Connected and Autonomous Vehicles (CAVs) have to interact with surrounding human-driven vehicles (HDVs). This paper introduces MSH-MCCT (Multi-Source Human-in-the-Loop Mixed Cloud Control Testbed), a novel CAV testbed that captures complex interactions between various CAVs and HDVs. Utilizing the Mixed Digital Twin concept, which combines Mixed Reality with Digital Twin, MSH-MCCT integrates physical, virtual, and mixed platforms, along with multi-source control inputs. Bridged by the mixed platform, MSH-MCCT allows human drivers and CAV algorithms to operate both physical and virtual vehicles within multiple fields of view. Particularly, this testbed facilitates the coexistence and real-time interaction of physical and virtual CAVs \& HDVs, significantly enhancing the experimental flexibility and scalability. Experiments on vehicle platooning in mixed traffic showcase the potential of MSH-MCCT to conduct CAV testing with multi-source real human drivers in the loop through driving simulators of diverse fidelity. The videos for the experiments are available at our project website: https://dongjh20.github.io/MSH-MCCT.

URL PDF HTML ☆

赞 0 踩 0

2603.14371 2026-05-19 cs.RO cs.AI 版本更新

OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism

OxyGen: 为多任务并行下的VLA推理提供统一的KV缓存管理

Xiangyu Li, Huaizhi Tang, Xin Ding, Weijun Wang, Ting Cao, Yunxin Liu

发表机构 * Institute for AI Industry Research (AIR)（人工智能产业研究院）； Department of Electronic Engineering（电子工程系）； University of Science and Technology of China（中国科学技术大学）

AI总结本文提出OxyGen，一种统一的KV缓存管理方法，用于在多任务并行下提高VLA推理效率，通过跨任务KV共享和跨帧连续批处理实现冗余计算和资源竞争的减少，从而在设备端实现更高的吞吐量和频率。

Comments Preprint

详情

AI中文摘要

具身AI代理越来越多地需要在不同的时间约束下从共享观察中并行执行多个任务，如操作、对话和记忆构建。最近的混合变换器（MoT）视觉-语言-动作模型（VLAs）在架构上支持这种异构输出，但现有的推理系统由于冗余计算和资源竞争未能在设备部署中实现高效的多任务并行。我们发现孤立的KV缓存管理是根本原因。为此，我们提出了统一的KV缓存管理，一种将KV缓存作为跨任务和时间的第一类共享资源的推理设计。这种抽象使两种关键优化成为可能：跨任务的KV共享消除了共享观察的冗余预填充，而跨帧连续批处理将可变长度的语言解码与固定速率的动作生成解耦。我们为流行的MoT VLA π_{0.5} 实现了这种设计，并在NVIDIA GeForce RTX 4090和Jetson AGX Thor两个代表性的设备端VLA推理平台上进行了评估。OxyGen在孤立执行的情况下实现了高达3.7倍的加速，同时在不降低动作质量的情况下，实现了超过200 tokens/s的语言吞吐量和70 Hz的动作频率，并进一步在搭载Jetson AGX Thor的现实人形机器人上验证了这些收益。

英文摘要

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment because of redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference design that treats the KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this design for $π_{0.5}$, a popular MoT VLA, and evaluate it on both NVIDIA GeForce RTX 4090 and Jetson AGX Thor, two representative platforms for on-device VLA inference. OxyGen achieves up to 3.7$\times$ speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without degrading action quality, and we further validate the gains on a real humanoid robot with on-board Jetson AGX Thor.

URL PDF HTML ☆

赞 0 踩 0

2602.05156 2026-05-19 cs.RO cs.SY eess.SY 版本更新

PLATO Hand: Shaping Contact Behavior with Fingernails for Precise Manipulation

PLATO Hand：利用指甲形状接触行为实现精确操控

Dong Ho Kang, Aaron Kim, Mingyo Seo, Kazuto Yokoyama, Tetsuya Narita, Luis Sentis

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Sony Group Corporation（索尼集团）

AI总结本文提出PLATO手，一种具有混合指尖的灵活机器人手，通过结合刚性指甲、嵌入式远节指骨和顺应性肉垫，实现接触行为的塑造。研究开发了基于应变能的弯曲-压入模型，指导指尖设计并解释材料刚度和接触几何如何控制指尖变形分配。实验显示提升了捏合稳定性、指甲介导的背侧接触力传输和本体感觉可观察性，并成功执行了敏感边缘操控任务，如纸张分隔、卡片拾取和橙子剥皮。这些结果表明，结合机械结构的接触界面与力-运动透明手指机制提供了精确操控的原理性方法。

详情

AI中文摘要

我们提出了PLATO手，一种具有混合指尖的灵活机器人手，该指尖结合了刚性指甲、嵌入式远节指骨和顺应性肉垫，以在操控过程中塑造接触行为。通过机械组织指尖接触的启动、支撑和传递方式，这种结构在多样化的物体几何形状和抓取方向上创造了稳定且任务相关的接触条件。我们开发了基于应变能的弯曲-压入模型，以指导指尖设计并解释材料刚度和接触几何如何控制指尖内的变形分配。实验显示提升了捏合稳定性、指甲介导的背侧接触力传输和本体感觉可观察性，并成功执行了敏感边缘操控任务，包括纸张分隔、卡片拾取和橙子剥皮。这些结果表明，结合机械结构的接触界面与力-运动透明手指机制提供了精确操控的原理性方法。我们的项目页面是：https://platohand.github.io

英文摘要

We present the PLATO Hand, a dexterous robotic hand with a hybrid fingertip that combines a rigid fingernail, embedded distal phalanx, and compliant pulp to shape contact behavior during manipulation. \rrev{By mechanically organizing how contact is initiated, supported, and transmitted at the fingertip, this structure creates stable and task-relevant contact conditions across diverse object geometries and grasp orientations.} We develop a strain-energy-based bending--indentation model to guide the fingertip design and to explain how material stiffness and contact geometry govern deformation partitioning within the fingertip. \rrev{Experiments show improved pinch stability, improved fingernail-mediated dorsal-contact force transmission and proprioceptive observability}, and successful execution of edge-sensitive manipulation tasks, including paper singulation, card picking, and orange peeling. These results show that coupling a mechanically structured contact interface with a force-motion-transparent finger mechanism provides a principled approach to precise manipulation. Our project page is at: https://platohand.github.io

URL PDF HTML ☆

赞 0 踩 0

2601.05653 2026-05-19 cs.RO cs.MA 版本更新

EvoQRE: Modeling Bounded Rationality in Safety-Critical Traffic Simulation via Evolutionary Quantal Response Equilibrium

EvoQRE: 通过进化量化反应均衡建模安全关键交通仿真中的有限理性

Phu-Hoa Pham, Chi-Nguyen Tran, Duy-Minh Dao-Sy, Phu-Quy Nguyen-Lam, Trung-Kiet Huynh

AI总结本文提出EvoQRE框架，通过量化反应均衡和进化博弈动态建模安全关键交通交互，理论证明其在弱单调性假设下收敛到Logit-QRE，并在Waymo和nuPlan数据集上验证了其在真实性和安全指标上的优越性。

Comments This article is being withdrawn due to identified issues in the experimental evaluation and theoretical assumptions that may affect the validity of some reported conclusions. The authors plan to revise the methodology and provide a corrected version in future work.

详情

AI中文摘要

现有的自动驾驶交通仿真框架通常依赖于模仿学习或博弈论方法来求解纳什或粗相关均衡，隐含假设了完美理性的代理。然而，人类驾驶员表现出有限理性，在认知和感知限制下做出近似最优决策。我们提出EvoQRE，一种原理性的框架，将安全关键交通交互建模为一般和博弈，通过量化反应均衡（QRE）和进化博弈动态求解。EvoQRE整合了预训练的生成世界模型与熵正则化的复制动态，捕捉随机的人类行为同时保持均衡结构。我们提供了严格的理论结果，证明所提出的动态在双重时间尺度随机近似下收敛到Logit-QRE，具有显式的收敛速率O(log k / k^{1/3})在弱单调性假设下。我们进一步通过混合基和能量基策略表示扩展QRE到连续动作空间。在Waymo Open Motion Dataset和nuPlan基准测试中，EvoQRE实现了最先进的现实感，改进的安全指标，以及通过可解释的理性参数可控生成多样化的安全关键场景。

英文摘要

Existing traffic simulation frameworks for autonomous vehicles typically rely on imitation learning or game-theoretic approaches that solve for Nash or coarse correlated equilibria, implicitly assuming perfectly rational agents. However, human drivers exhibit bounded rationality, making approximately optimal decisions under cognitive and perceptual constraints. We propose EvoQRE, a principled framework for modeling safety-critical traffic interactions as general-sum Markov games solved via Quantal Response Equilibrium (QRE) and evolutionary game dynamics. EvoQRE integrates a pre-trained generative world model with entropy-regularized replicator dynamics, capturing stochastic human behavior while maintaining equilibrium structure. We provide rigorous theoretical results, proving that the proposed dynamics converge to Logit-QRE under a two-timescale stochastic approximation with an explicit convergence rate of O(log k / k^{1/3}) under weak monotonicity assumptions. We further extend QRE to continuous action spaces using mixture-based and energy-based policy representations. Experiments on the Waymo Open Motion Dataset and nuPlan benchmark demonstrate that EvoQRE achieves state-of-the-art realism, improved safety metrics, and controllable generation of diverse safety-critical scenarios through interpretable rationality parameters.

URL PDF HTML ☆

赞 0 踩 0

2512.07765 2026-05-19 cs.RO 版本更新

Toward Seamless Physical Human-Humanoid Interaction: Insights from Control, Intent, and Modeling with a Vision for What Comes Next

迈向无缝的物理人机交互：从控制、意图和建模的角度见解以及未来发展的展望

Gustavo A. Cardona, Shubham S. Kumbhar, Panagiotis Artemiadis

发表机构 * University of Delaware（德克萨斯大学）

AI总结本文探讨了物理人机交互领域中控制、意图估计和计算人类模型三个核心支柱，总结了当前的研究现状、开放挑战和限制，并提出了跨领域整合的路径，旨在推动更鲁棒、安全和直观的物理交互研究。

Comments 60 pages, 5 figures, 3 tables

详情

DOI: 10.1007/s10846-026-02401-0

AI中文摘要

物理人机交互（pHHI）是一个快速发展的领域，对在无结构、以人为中心的环境中部署机器人具有重要意义。在本文综述中，我们通过三个核心支柱审视当前pHHI的现状：（i）人形机器人的建模与控制，（ii）人类意图估计，以及（iii）计算人类模型。对于每个支柱，我们调查了代表性方法，识别了开放挑战，并分析了当前限制，这些限制阻碍了鲁棒、可扩展和适应性交互的实现。这些包括需要能够处理不确定人类动态的全身控制策略、在有限感知下实时意图推断的需求，以及能够考虑人类身体状态变化的建模技术。尽管每个领域都取得了显著进展，但跨支柱的整合仍然有限。我们提出了统一这些领域的路径，以实现连贯的交互框架。这种结构不仅使我们能够映射当前的现状，还提出了未来研究的具体方向，旨在弥合这些领域之间的差距。此外，我们引入了一种基于模态的统一交互类型分类法，区分直接交互（如物理接触）和间接交互（如物体中介），并基于机器人参与的程度，从协助到合作和协作。对于此分类中的每个类别，我们提供了三个核心支柱，突出跨支柱整合的机会。我们的目标是建议推动鲁棒、安全和直观物理交互的途径，为未来研究提供路线图，使人形系统能够有效地理解、预测并与人类伙伴在多样化的现实环境中协作。

英文摘要

Physical Human-Humanoid Interaction (pHHI) is a rapidly advancing field with significant implications for deploying robots in unstructured, human-centric environments. In this review, we examine the current state of the art in pHHI through three core pillars: (i) humanoid modeling and control, (ii) human intent estimation, and (iii) computational human models. For each pillar, we survey representative approaches, identify open challenges, and analyze current limitations that hinder robust, scalable, and adaptive interaction. These include the need for whole-body control strategies capable of handling uncertain human dynamics, real-time intent inference under limited sensing, and modeling techniques that account for variability in human physical states. Although significant progress has been made within each domain, integration across pillars remains limited. We propose pathways for unifying methods across these areas to enable cohesive interaction frameworks. This structure enables us not only to map the current landscape but also to propose concrete directions for future research that aim to bridge these domains. Additionally, we introduce a unified taxonomy of interaction types based on modality, distinguishing between direct interactions (e.g., physical contact) and indirect interactions (e.g., object-mediated), and on the level of robot engagement, ranging from assistance to cooperation and collaboration. For each category in this taxonomy, we provide the three core pillars that highlight opportunities for cross-pillar unification. Our goal is to suggest avenues to advance robust, safe, and intuitive physical interaction, providing a roadmap for future research that will allow humanoid systems to effectively understand, anticipate, and collaborate with human partners in diverse real-world settings.

URL PDF HTML ☆

赞 0 踩 0

2511.00392 2026-05-19 cs.RO cs.AI cs.CV 版本更新

SonarSweep: Fusing Sonar and Vision for Robust 3D Reconstruction via Plane Sweeping

SonarSweep: 通过平面扫描融合声纳与视觉以实现鲁棒的3D重建

Lingpeng Chen, Jiakun Tang, Apple Pui-Yi Chui, Ziyang Hong, Junfeng Wu

发表机构 * Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Chinese University of Hong Kong, Hong Kong（香港中文大学）； Department of Automation, Harbin Institute of Technology（哈尔滨工业大学自动化系）

AI总结本文提出SonarSweep，一种端到端的深度学习框架，通过将平面扫描算法应用于声纳与视觉数据的跨模态融合，克服了单一模态方法在 underwater 环境中3D重建的局限性，实现了更精确和稳定的深度图生成。

Comments 8 pages, 9 figures, conference

详情

AI中文摘要

在视觉退化的水下环境中实现准确的3D重建仍是一个严峻的挑战。单一模态方法不足：基于视觉的方法因可见性差和几何约束而失败，而声纳则因固有的高度歧义和低分辨率而受限。因此，先前的融合技术依赖于启发式方法和错误的几何假设，导致显著的伪影和无法建模复杂场景。在本文中，我们引入了SonarSweep，一种新颖的端到端深度学习框架，通过将原理性的平面扫描算法应用于声纳与视觉数据的跨模态融合，克服了这些限制。在高保真模拟和真实环境中的大量实验表明，SonarSweep能够一致地生成密集且准确的深度图，在挑战性条件下，特别是在高浊度情况下，显著优于最先进的方法。为了促进进一步研究，我们将公开我们的代码和一个新型的数据集，该数据集包含同步的立体相机和声纳数据，这是首次公开的此类数据集。

英文摘要

Accurate 3D reconstruction in visually-degraded underwater environments remains a formidable challenge. Single-modality approaches are insufficient: vision-based methods fail due to poor visibility and geometric constraints, while sonar is crippled by inherent elevation ambiguity and low resolution. Consequently, prior fusion technique relies on heuristics and flawed geometric assumptions, leading to significant artifacts and an inability to model complex scenes. In this paper, we introduce SonarSweep, a novel, end-to-end deep learning framework that overcomes these limitations by adapting the principled plane sweep algorithm for cross-modal fusion between sonar and visual data. Extensive experiments in both high-fidelity simulation and real-world environments demonstrate that SonarSweep consistently generates dense and accurate depth maps, significantly outperforming state-of-the-art methods across challenging conditions, particularly in high turbidity. To foster further research, we will publicly release our code and a novel dataset featuring synchronized stereo-camera and sonar data, the first of its kind.

URL PDF HTML ☆

赞 0 踩 0

2510.26018 2026-05-19 cs.RO cs.AI 版本更新

RADRON: Cooperative Localization of Ionizing Radiation Sources by MAVs with Compton Cameras

RADRON：通过配备康普顿相机的微型飞行器进行离子化辐射源的协同定位

Petr Stibinger, Tomas Baca, Daniela Doubravova, Jan Rusnak, Jaroslav Solc, Jan Jakubek, Petr Stepan, Martin Saska

AI总结该研究提出了一种利用微型飞行器协同定位放射性物质的新方法，通过康普顿相机实时估计辐射源位置，即使在稀疏测量条件下也能实现高灵敏度检测。

Comments 8 pages, 9 figures, submitted for review to IEEE RA-L

详情

DOI: 10.1109/LRA.2026.3688053

AI中文摘要

我们提出了一种新型方法，通过合作微型飞行器（MAVs）定位放射性物质。我们的方法利用了最先进的单探测器康普顿相机，作为高灵敏度且微型的离子化辐射探测器。该探测器极低的重量（40克）为由协作敏捷MAVs进行的辐射检测开辟了新可能。我们提出了一种新的基本概念，将康普顿相机测量融合以实时估计辐射源位置，即使从极稀疏的测量中也能做到。数据读取和处理直接在机载上进行，结果用于动态反馈以驱动车辆运动。MAVs在紧密协作的群体中稳定，以最大化康普顿相机获取的信息，快速定位辐射源，甚至跟踪移动的辐射源。

英文摘要

We present a novel approach to localizing radioactive material by cooperating Micro Aerial Vehicles (MAVs). Our approach utilizes a state-of-the-art single-detector Compton camera as a highly sensitive, yet miniature detector of ionizing radiation. The detector's exceptionally low weight (40 g) opens up new possibilities of radiation detection by a team of cooperating agile MAVs. We propose a new fundamental concept of fusing the Compton camera measurements to estimate the position of the radiation source in real time even from extremely sparse measurements. The data readout and processing are performed directly onboard and the results are used in a dynamic feedback to drive the motion of the vehicles. The MAVs are stabilized in a tightly cooperating swarm to maximize the information gained by the Compton cameras, rapidly locate the radiation source, and even track a moving radiation source.

URL PDF HTML ☆

赞 0 踩 0

2510.24680 2026-05-19 cs.RO 版本更新

COLSON: 通过基于扩散的强化学习实现可控的社会导航

Kohei Matsumoto, Yuki Tomita, Yuki Hyodo, Ryo Kurazume

AI总结本文提出了一种基于扩散的强化学习方法，用于社会导航，通过灵活的动作分布提高了导航的适应性和可控性，同时能够适应未见过的场景。

Comments ICRA 2026

详情

AI中文摘要

在动态环境中移动机器人导航面临行人交通的关键挑战，在自主移动服务机器人发展中尤为重要。最近，基于深度强化学习的方法被积极研究，并因其优化能力优于传统规则方法。其中，假设连续动作空间的方法通常依赖高斯分布，这限制了生成动作的灵活性。相比之下，将扩散模型应用于强化学习已取得进展，使动作分布比高斯策略方法更加灵活。在本研究中，我们应用基于扩散的强化学习方法进行社会导航，并验证其有效性。此外，通过利用扩散模型的特点，我们提出了能够适应以前未见过的场景而无需额外训练的扩展方法。作为具体场景示例，我们展示了适应环境中有静态障碍物的场景（这些障碍物在训练期间不存在），以及目标与训练不同的场景，例如在避免他人时陪同目标行人到达目的地。

英文摘要

Mobile robot navigation in dynamic environments with pedestrian traffic is a key challenge in the development of autonomous mobile service robots. Recently, deep reinforcement learning-based methods have been actively studied and have outperformed traditional rule-based approaches owing to their optimization capabilities. Among these methods, those that assume continuous action spaces typically rely on Gaussian distributions, which limit the flexibility of the generated actions. In contrast, the application of diffusion models to reinforcement learning has advanced, enabling more flexible action distributions than Gaussian policy-based approaches. In this study, we apply a diffusion-based reinforcement learning approach to social navigation and validate its effectiveness. Furthermore, by exploiting the characteristics of diffusion models, we propose extensions that enable adaptation to previously unseen scenarios without additional training. As concrete scenario examples, we demonstrate adaptability to scenarios in which static obstacles exist in the environment that were not present during training, as well as scenarios in which the objective differs from training, such as accompanying target pedestrians while avoiding others to reach the destination.

URL PDF HTML ☆

赞 0 踩 0

2605.18109 2026-05-19 cs.AI cs.CV cs.RO 版本更新

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

TaskGround：全场景家庭推理的结构化可执行任务推断

ZhiYuan Feng, Yu Deng, Ruichuan An, Zhenhua Liu, Qixiu Li, Keming Wu, Zhiying Du, Weijie Wang, Haoxiao Wang, Shuang Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

发表机构 * Tsinghua University（清华大学）； Microsoft Research Asia（微软亚洲研究院）； Peking University（北京大学）； Fudan University（复旦大学）； Zhejiang University（浙江大学）

AI总结本文提出TaskGround框架，通过结构化任务推断提升全场景家庭推理能力，其核心贡献是引入FullHome评估套件，验证了在家庭场景中执行任务结构推断的重要性，并展示了紧凑本地模型在实际家庭部署中的有效性。

Comments Project page: https://aaronfengzy.github.io/TaskGround/

详情

AI中文摘要

在真实家庭部署中，家庭代理通常必须从完整的家庭场景和处于特定情境的家庭请求出发，而不是从干净的任务规范出发。此类请求要求代理识别与任务相关的实体，恢复意图的任务条件，并从周围场景上下文中解决顺序约束。我们正式将这种能力定义为全场景家庭推理：给定一个完整的家庭场景和一个处于特定情境的家庭请求，代理必须在生成接地技能级动作序列之前推断出可执行的任务结构。这种设置具有挑战性，因为完整的家庭场景包含大量与任务无关的信息，使直接完整场景提示效率低下且容易出错。在实际部署中，这一挑战进一步被隐私和本地计算限制放大，这些限制更倾向于紧凑的开放权重模型，其具有有限的长上下文推理能力。我们提出TaskGround，一种无需训练且模型无关的Ground-Infer-Execute框架，该框架将完整的场景接地为紧凑的任务相关场景切片，推断出可执行的任务结构，并将其编译为接地的技能级动作序列。为了评估这一设置，我们引入了FullHome，一个经过人类验证的400个家庭任务评估套件，涵盖多样化的家庭规模环境以及目标导向和过程约束要求。在FullHome上，TaskGround在专有和开放权重模型上均大幅提升了任务成功率。值得注意的是，它使Qwen3.5-9B在直接完整场景提示下与GPT-5竞争，同时将总输入token成本减少了多达18倍。我们的结果识别了执行任务结构推断为全场景家庭推理中的关键瓶颈，并表明结构化接地可以显著提高紧凑本地模型在实际家庭部署中的有效性。

英文摘要

In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such requests require agents to identify task-relevant entities, recover intended task conditions, and resolve ordering constraints from the surrounding scene context. We formalize this capability as full-scene household reasoning: given a complete household scene and a situated household request, an agent must infer executable task structure before producing a grounded skill-level action sequence. This setting is challenging because complete household scenes contain substantial task-irrelevant information, making direct complete-scene prompting inefficient and error-prone. In practical deployment, this challenge is further amplified by privacy and local compute constraints, which favor compact open-weight models with limited long-context reasoning ability. We propose TaskGround, a training-free and model-agnostic Ground-Infer-Execute framework that grounds complete scenes into compact task-relevant scene slices, infers executable task structure, and compiles it into grounded skill-level action sequences. To evaluate this setting, we introduce FullHome, a human-validated evaluation suite of 400 household tasks spanning diverse home-scale environments and both goal-oriented and process-constrained requirements. On FullHome, TaskGround improves task success rates by large margins across both proprietary and open-weight models. Notably, it makes Qwen3.5-9B competitive with GPT-5 under direct complete-scene prompting while reducing total input-token cost by up to 18x. Our results identify executable task-structure inference as a central bottleneck in full-scene household reasoning and show that structured grounding can make compact local models substantially more effective for practical household deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.18074 2026-05-19 cs.RO 版本更新

4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving

4DLidarOpen: 一个用于运动感知自动驾驶的开放4D FMCW激光雷达数据集

Kane Qian, Xin Zhao, Yining Shi, Rujun Yan, Zhengqing Pan, Kaojin Zhu, Mengmeng Yang, Kai Sun, Diange Yang, Kun Jiang

发表机构 * Tsinghua University（清华大学）； Hesai Technology Co., Ltd.（海思科技有限公司）

AI总结本文提出4DLidarOpen数据集，用于自动驾驶，该数据集基于4D频率调制连续波（FMCW）激光雷达传感，包含点径向速度测量、多种激光雷达、环绕摄像头和6自由度车辆姿态数据，通过混合标注策略实现大规模训练和人工精修，用于3D目标检测、鸟瞰图分割和流预测及运动预测基准测试。

Comments 15pages, 9 figures

详情

AI中文摘要

我们提出了4DLidarOpen，一个大规模的开放多模态自动驾驶数据集，核心是基于4D频率调制连续波（FMCW）激光雷达传感。与传统飞行时间激光雷达数据集主要提供几何测量不同，4DLidarOpen包含来自前方4D FMCW激光雷达的点径向速度测量，以及多种类型的激光雷达，包括旋转、固态和盲 spot变种，环绕视图摄像头，以及6-DOF ego-vehicle姿态。该数据集在北京复杂城市环境中采集，涵盖了密集行人交互、拥堵交通、高速驾驶和无保护变道。4DLidarOpen提供同步多传感器数据和具有持久跟踪ID的3D边界框标注，跨五个物体类别。采用混合标注策略，其中大规模自动标注数据支持可扩展训练，而人类专家对人工标注的训练和验证集进行精修。基于此数据集，我们建立了3D目标检测、鸟瞰图（BEV）分割和流预测以及运动预测的基准测试。大量实验表明，直接来自4D FMCW激光雷达的速度测量为动态场景理解提供了互补的运动线索。与仅几何感知相比，速度感知表示提高了运动相关感知和下游预测和规划，特别是在涉及易受伤害道路使用者和快速移动物体的场景中。这些结果表明，4D FMCW激光雷达是运动感知自动驾驶的有前途的感知模式。数据集和评估工具包已公开发布，以支持4D场景理解、多激光雷达融合和速度感知感知和规划的研究。

英文摘要

We present 4DLidarOpen, a large-scale open multi-modal dataset for autonomous driving, centered on 4D frequency-modulated continuous-wave (FMCW) Lidar sensing. Unlike conventional time-of-flight Lidar datasets that mainly provide geometric measurements, 4DLidarOpen includes point-wise radial velocity measurements from a forward-facing 4D FMCW Lidar, together with multiple Lidars of different types, including rotating, solid-state, and blind-spot variants, surround-view cameras, and 6-DOF ego-vehicle poses. The dataset was collected in complex urban environments in Beijing and covers dense pedestrian interactions, congested traffic, high-speed driving, and unprotected maneuvers. 4DLidarOpen provides synchronized multi-sensor data and 3D bounding-box annotations with persistent track IDs across five object categories. A hybrid annotation strategy is adopted, where large-scale auto-labeled data support scalable training and human experts refine annotations for the human-annotated training and validation sets. Based on this dataset, we establish benchmarks for 3D object detection, birds-eye view (BEV) segmentation and flow prediction, and motion forecasting with planning. Extensive experiments show that direct velocity measurements from 4D FMCW Lidar provide complementary motion cues for dynamic-scene understanding. Compared with geometric-only sensing, the velocity-aware representation improves motion-related perception and downstream forecasting and planning, especially in scenarios involving vulnerable road users and fast-moving objects. These results indicate that 4D FMCW Lidar is a promising sensing modality for motion-aware autonomous driving. The dataset and evaluation toolkit are publicly released to support research on 4D scene understanding, multi-Lidar fusion, and velocity-aware perception and planning.

URL PDF HTML ☆

赞 0 踩 0

2605.18059 2026-05-19 cs.RO 版本更新

Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations

Bench2Drive-Robust: 在部署扰动下闭环自动驾驶的基准测试

Zhiyuan Zhang, Zhenghao Jin, Yanlun Peng, Xianda Guo, Haoran Liu, Shaofeng Zhang, Xingjun Ma, Zuxuan Wu, Junchi Yan, Xiaosong Jia, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI (TEAI)（可信具身人工智能研究院）； Great Wall Motor（长城汽车）； Sch. of Computer Science & Sch. of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学计算机学院及人工智能学院）； School of Computer Science, Wuhan University（武汉大学计算机学院）； University of Science and Technology of China（中国科学技术大学）

AI总结本文提出Bench2Drive-Robust，首个针对闭环端到端自动驾驶在现实部署扰动下的设备中心鲁棒性基准测试，评估了三种主要来源的部署相关扰动对自动驾驶系统的影响，揭示了传统图像级腐蚀评估未能完全捕捉的鲁棒性挑战。

详情

AI中文摘要

鲁棒性是部署自动驾驶系统到现实世界中的关键要求。现有的自动驾驶鲁棒性基准测试在研究图像级腐蚀（如恶劣天气或摄像头退化）对感知模块和开环规划输出的影响方面取得了重要进展。然而，部署还可能涉及系统级缺陷，如推理延迟和自我状态估计误差，这些在闭环端到端自动驾驶评估中仍较少研究。这些缺陷可以通过反馈回路积累并导致控制不稳定。在本文中，我们提出了Bench2Drive-Robust，据我们所知，这是首个针对闭环端到端自动驾驶在现实部署扰动下的设备中心鲁棒性基准测试。我们系统地评估了三种主要来源的部署导向扰动：摄像头流故障（帧丢失、部分观察）、自我状态估计误差（GPS噪声，以及速度或里程误差）和计算导致的控制延迟（模型推理延迟）。我们评估了代表性端到端驾驶方法，并分析它们在不同扰动严重程度下的鲁棒性。我们的结果表明，这些部署相关扰动可以显著降低闭环驾驶性能，揭示了传统图像级腐蚀评估未能完全捕捉的鲁棒性挑战。通过建立闭环评估协议并展示这些部署导向扰动的实质性影响，Bench2Drive-Robust定义了端到端自动驾驶的实用鲁棒性问题，并鼓励进一步研究面向部署的鲁棒驾驶系统。

英文摘要

Robustness is a critical requirement for deploying autonomous driving systems in the real world. Existing robustness benchmarks for autonomous driving have made important progress in studying the effects of image-level corruptions, such as adverse weather or camera degradation, on perception modules and open-loop planning outputs. However, deployment can also involve system-level imperfections, such as inference latency and ego-state estimation errors, which remain less studied in closed-loop E2E-AD evaluation. These imperfections can accumulate through the feedback loop and destabilize control. In this work, we present Bench2Drive-Robust, to our knowledge the first device-centric robustness benchmark for closed-loop end-to-end autonomous driving under realistic deployment perturbations. We systematically evaluate deployment-oriented perturbations arising from three major sources: camera-stream failures (frame drop, partial observation), ego-state estimation errors (GPS noise, and speed or odometry errors), and compute-induced control delay (model inference delay). We evaluate representative end-to-end driving methods and analyze their robustness under different perturbation severities. Our results show that these deployment-related perturbations can substantially degrade closed-loop driving performance, revealing robustness challenges that are not fully captured by conventional image-level corruption evaluations. By establishing a closed-loop evaluation protocol and demonstrating the substantial impact of these deployment-oriented perturbations, Bench2Drive-Robust defines practical robustness problems for end-to-end autonomous driving and encourages further research on deployment-aware robust driving systems.

URL PDF HTML ☆

赞 0 踩 0

2605.18045 2026-05-19 cs.RO cs.AI 版本更新

Confidence-Gated Robot Autonomy: When Does Uncertainty Actually Help?

置信度门控机器人自主性：不确定性何时真的有帮助？

Johannes A. Gaus, Jhon P. F. Charaja, Daniel Haeufle

发表机构 * Hertie Institute for Clinical Brain Research & Center for Integrative Neuroscience, University of Tübingen（赫尔特研究所临床脑研究与整合神经科学中心，图宾根大学）

AI总结本文研究了不确定性在机器人自主性决策中的作用，发现当基础模型具备一定能力时，简单的不确定性代理足以实现选择性门控，但无法用于语义新颖性检测。

Comments ICRA 2026 workshop paper

详情

AI中文摘要

机器人系统常常使用预测不确定性来决定是否自主行动还是退回到备用策略。在阈值门控自主性中，不确定性主要通过其对可能错误的排序能力起作用。标准指标如预期校准误差和AUROC并不能直接测试不确定性是否改变行动/退避决策。因此，我们通过斯皮尔曼等级相关性、配对bootstrap等价检验和行动/退避一致率来评估不确定性。在三个时间活动识别基准上，我们发现存在一个数据集依赖的胜任区域，在此之下不确定性只能提供弱且不稳定的错误排序。在此之上，softmax启发式方法、MC Dropout和集成模型产生相似的门控行为，而阈值选择对执行结果影响更大。一个多种子具身模拟显示，一旦实现自主性，碰撞率和成本也呈现出相同模式。在时间协变量转移下，排序质量保持稳定，但细粒度语义OOD检测仍接近随机。这些结果表明，一旦基础模型具备一定能力，简单的不确定性代理足以实现选择性门控，但无法用于语义新颖性检测。

英文摘要

Robotic systems often use predictive uncertainty to decide whether to act autonomously or defer to a fallback policy. In threshold-gated autonomy, uncertainty matters mainly through its ability to rank likely errors. Standard metrics such as expected calibration error and AUROC do not directly test whether uncertainty changes act/defer decisions. We therefore evaluate uncertainty using Spearman rank correlation, paired bootstrap equivalence testing, and act/defer agreement. Across three temporal activity-recognition benchmarks, we find a dataset-dependent competence regime below which uncertainty provides a weak and unstable error ranking. Above this regime, softmax heuristics, MC Dropout, and ensembles produce similar gating behavior, while threshold choice has a much larger effect on execution outcomes. A multi-seed embodied simulation shows the same pattern for collision rate and cost once realized autonomy is matched. Under temporal covariate shift, ranking quality remains stable, but fine grained semantic OOD detection remains near chance. These results suggest that simple uncertainty proxies can suffice for selective gating once the base model is competent, but not for semantic novelty detection.

URL PDF HTML ☆

赞 0 踩 0

2605.18026 2026-05-19 cs.RO 版本更新

Scenario Generation in Roundabouts with Adjustable Interaction Intensity

在可调节交互强度的环形交叉口中的场景生成

Li Li, Till Temmen, Tobias Brinkmann, Björn Krautwig, Markus Eisenbarth, Jakob Andert

发表机构 * Chair of Mechatronics in Mobile Propulsion（移动 propulsion 机械系统教授团）

AI总结本文提出了一种具有可调节交互强度的环形交叉口场景生成器，通过解耦几何路线和时间进度轮廓，并利用预训练的自编码器映射到潜在代码，再通过Wasserstein生成对抗网络生成场景，从而提高时间-潜在空间的保真度和交互响应的合理性，增强了安全测试的可控性和可扩展性。

详情

AI中文摘要

环形交叉口以其频繁的合并和让行交互而著称，仍然是智能驾驶功能开发和测试中的安全关键案例。然而，从自然数据中提取足够的临界场景是低效的。大多数现有场景生成方法对交互强度和临界性控制有限，使得系统化安全测试和详细分析困难。本文提出了一种交互感知的环形交叉口场景生成器，具有连续可调的交互强度。首先，几何路线和时间进度轮廓被解耦并映射到潜在代码，使用预训练的自编码器。然后，通过Wasserstein生成对抗网络（WGAN）进行条件潜在生成，以生成场景。让行被建模为一种可控的定时干预，通过紧凑的让行代码在接近入口段进行，其中交互强度通过将代码与因子λ缩放来调节。结果表明，与基线模型相比，提高了时间-潜在空间的保真度和合理的交互响应。在临界性校准的缩放下，增加λ扩大了安全边际，提供了一种可扩展和受控的测试机制。

英文摘要

Roundabouts, characterized by frequent merging and yielding interactions, remain a safety-critical corner case for the development and testing of intelligent driving functions. However, extracting sufficient near-critical scenarios from naturalistic data is inefficient. Most existing scenario generation methods provide limited controllability over interaction intensity and criticality, making systematic safety testing and detailed analysis difficult. This paper presents an interaction-aware roundabout scenario generator with continuously adjustable interaction intensity. Geometric routes and temporal progress profiles are first decoupled and mapped to latent codes using pretrained autoencoders. Conditional latent generation is then performed with Wasserstein Generative Adversarial Networks (WGAN) to generate scenarios. Yielding is modeled as a controllable timing intervention via a compact yield code during the approach-to-entry segment, where interaction intensity is modulated by scaling the code with a factor $λ$. Results demonstrate enhanced timing-latent fidelity and plausible interaction responses compared to a baseline model. Under criticality-calibrated scaling, increasing $λ$ expands the safety margin, providing a scalable and controlled testing mechanism.

URL PDF HTML ☆

赞 0 踩 0

2605.17984 2026-05-19 eess.IV cs.CV cs.RO 版本更新

See Silhouettes in Motion with Neuromorphic Vision

用神经形态视觉感知运动中的轮廓

Pei Zhang, Shijie Lin, Zhou Ge, Jinpeng Chen, Wei Pu

发表机构 * School of Electrical Engineering, Guangxi University（广西大学电气工程学院）； Department of Computer Science, The University of Hong Kong（香港大学计算机科学系）； School of Mechatronic Engineering and Automation, Shanghai University（上海大学机电工程与自动化学院）； SHU General Intelligent Robotics Research Institute（SHU通用智能机器人研究院）； School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications（北京邮电大学计算机科学学院（国家试点软件学院））； School of Information and Communication Engineering, University of Electronic Science and Technology of China（电子科技大学信息与通信工程学院）

AI总结本文提出了一种双模方法，利用帧和事件的协同作用，在仅CPU的设备上实现实时高帧率二值化，有效减少运动模糊并提升在恶劣光照下的性能，为资源受限边缘平台的轻量感知和交互铺平道路。

Comments 12 pages, 12 figures, and 3 tables. This work is under review. Project page: https://github.com/pz-even/event_binarization

详情

AI中文摘要

准双模物体，如文本、道路标志和条形码，在日常视觉交流中发挥基本而关键的作用。通过将其简化为清晰的轮廓，二值化使用最简语言传达必要的视觉线索，以实现最大下游效率。然而，基于帧的成像在移动平台如无人机、自动驾驶汽车和水下车辆上往往面临困难。在这些动态场景中，快速运动和恶劣光照会使成像失效，导致严重的运动模糊和关键细节的消失。为克服这些限制，神经形态视觉通过事件相机，具有微秒级时间分辨率和高动态范围，成为自然的解决方案。在此事件驱动的感知范式基础上，我们提出了一种简单而有效的双模方法，利用帧和事件之间的协同作用，在仅CPU的设备上实现实时、高帧率的二值化。广泛的评估表明，该方法在减少运动模糊方面与领先技术具有竞争力，并在挑战性光照条件下提供显著改进。此外，我们的异步工作流程绕过了事件稀缺问题，避免了传统时间分组重建的限制，即使在极高的千赫兹帧率下也能保持清晰的目标形状。其二值化结果进一步作为可靠的表示，促进了各种下游任务。本文为在资源受限边缘平台上的具身智能轻量感知和交互铺平了道路。

英文摘要

Quasi-bimodal objects, such as text, road signs, and barcodes, play a basic yet vital role in daily visual communication. By boiling these down to clear silhouettes, binarization uses a minimal language to convey essential vision cues for maximum downstream efficiency. The catch is that frame-based imaging often struggles on mobile platforms like drones, self-driving cars, and underwater vehicles. In these dynamic scenes, rapid motion and harsh lighting can make it blind, causing severe motion blur and erasing crucial details. To overcome the limits, neuromorphic vision via event cameras, featuring microsecond-level temporal resolution and high dynamic range, steps in as a natural solution. Building upon this event-driven sensing paradigm, we introduce a simple yet effective dual-modal approach that harnesses the synergy between frames and events to achieve real-time, high-frame-rate binarization on CPU-only devices. Extensive evaluations present that it earns competitive performance against leading techniques in reducing motion blur, while delivering impressive improvements under challenging illumination. Besides, our asynchronous workflow bypasses event scarcity that breaks traditional time-binning reconstruction, maintaining clear target shapes even at extreme kilohertz frame rates. Its binary results further serve as reliable representations that facilitate a range of downstream tasks. This work paves the way towards lightweight perception and interaction in embodied intelligence on resource-constrained edge platforms.

URL PDF HTML ☆

赞 0 踩 0

2605.17928 2026-05-19 cs.RO cs.LG 版本更新

Transfer Learning for Customized Car Racing Environments

迁移学习用于定制化的赛车环境

Benedict Florance Arockiaraj, Richard Chang, Wesley Yee

发表机构 * seas（系统工程与科学学院）

AI总结本文研究了迁移学习在深度强化学习中的应用，旨在通过在单一赛道上训练智能体，实现零样本迁移或进一步微调以在其他定制化赛车环境中获得更快的圈速，并比较了基于模型和非基于模型方法的性能。

详情

AI中文摘要

迁移学习是一种技术，其中模型/智能体可以利用其在一项任务中获得的知识/专长来解决另一个密切相关任务。通过本项目，我们探讨了迁移学习在深度强化学习中的应用。具体而言，我们希望利用迁移学习在OpenAI的赛车环境中实现快速圈速，通过在单一赛道上训练智能体，并通过零样本迁移或额外微调在其他定制化目标环境中进行比赛。此外，我们比较了基于模型和非基于模型方法的性能，并观察到基于模型的方法在性能上占优，并且在该环境中比非基于模型的方法收敛得更快。我们观察到迁移学习在大多数设置中不仅提升了目标领域的性能，而且在学习过程中也表现出高水平的性能能力。

英文摘要

Transfer Learning, a technique where a model/agent can use the knowledge/expertise that it gained from one task and exploit that to solve another closely-related task, is often used in tackling problems in deep learning. Through this project, we explore transfer learning in the purview of deep reinforcement learning. Specifically, we want to use transfer learning to achieve the fast lap times in OpenAI's Car racing environment by training the agent on one circuit, and racing it on other customized target environments by zero-shot transfer or by additional fine-tuning. In addition, we compare the performance of model-based and model-free approaches, and observe that model-based approaches dominate in performance and converge faster than model-free approaches in this environment. We observe that transfer learning in most setups not only boosts the performance on the target domain, but also shows high performance ability during learning.

URL PDF HTML ☆

赞 0 踩 0

2605.17927 2026-05-19 cs.RO 版本更新

Learning-Based Adaptive Control for Surgical Robotic Exposure Task on Deformable Tissues

基于学习的自适应控制用于变形组织手术机器人暴露任务

Jiayi Liu, Kaiqi Wei, Yiwei Wang, Huan Zhao, Han Ding

发表机构 * Huazhong University of Science and Technology（华中科技大学）

AI总结本文提出了一种基于学习的自适应控制框架，用于解决手术中因覆盖组织的不规则几何形状、非线性生物力学特性及有限视野导致的自动组织牵开挑战，通过在线优化控制输入和深度变形估计模型实现零样本适应。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. 12 pages, 9 figures

详情

AI中文摘要

在各种外科手术中，感兴趣的区域（ROIs）如器官或病变常被覆盖组织遮挡，需要外科医生实现充分暴露以进行精确干预。然而，覆盖组织的不规则几何形状、非线性生物力学特性和术中ROIs的有限可见性对自动执行组织牵开提出了重大挑战。为此，我们提出了一个现实的组织牵开任务模型，并提出了一种基于学习的自适应控制框架，以实现ROIs的暴露。该方法通过监控组织视觉边界的变化在线优化控制输入，同时利用在模拟数据上训练的深度变形估计模型来识别最优抓取点，以确保自适应控制器的收敛性和安全性。通过在不同变形材料上的模拟和实际实验，证明了该框架能够实现零样本适应，并能完成从初始抓取选择到完全ROIs暴露的自动牵开过程。因此，它有潜力应用于实际的手术辅助场景。

英文摘要

In various surgical procedures, regions of interest (ROIs) such as organs or lesions are often occluded by overlying tissues, requiring surgeons to achieve adequate exposure for precise intervention. However, the irregular geometry, nonlinear biomechanical properties of overlying tissues, and limited intraoperative visibility of the ROI pose significant challenges to the autonomous execution of tissue retraction. To address this, we formulate a realistic model of the tissue retraction task and propose a learning-based adaptive control framework for achieving ROI exposure. The method optimizes control inputs online by monitoring changes in the visual boundary of the tissue, while leveraging a deep deformation estimation model trained on simulation data to identify the optimal grasping point and ensure the convergence and safety of the adaptive controller. Through simulations and real-world experiments on different deformable materials, it has been demonstrated that this framework exhibits zero-shot adaptation to similar tasks and can complete the autonomous retraction process, from initial grasp selection to full ROI exposure. Therefore, it has the potential to be applied in actual surgical assistance scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.17912 2026-05-19 cs.RO cs.CV 版本更新

视觉雕刻：用于长周期机器人泥塑的视觉对齐规划表示

Peter Schaldenbrand, Jean Oh

发表机构 * The Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）

AI总结本文提出了一种视觉对齐的规划表示方法，用于长周期机器人泥塑任务，通过捕捉光照和纹理特征，提高了对可变形材料动态的建模能力，并展示了在不同可变形材料和末端执行器下的性能。

Comments 8 pages, 14 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情

DOI: 10.1109/LRA.2026.3673896

AI中文摘要

泥塑是一种复杂的艺术任务，需要通过长周期规划实现高阶目标。作为机器人问题，我们将泥塑视为形状到形状的匹配挑战。先前的可变形物体 manipulation 工作要么需要为每个目标重新训练策略，要么依赖于动态模型，这些模型将状态表示为稀疏点云，无法良好捕捉泥塑的重要特征，如纹理。我们提出了一种方法，用于建模可变形材料的动力学，并在视觉对齐的表示中为机器人雕刻规划。通过三种不同的可变形材料和各种末端执行器，我们证明我们的动力学模型在性能上与最先进的方法相当，并且具有兼容视觉规划的优势。我们的动作被表示为单个末端执行器向泥塑施加的参数化推力，这已被证明适用于长周期（>100次动作）的泥塑浮雕。最后，我们展示了在视觉对齐表示中规划的好处，同时提供了分析，证明了与3D表示相比，这种表示在规划上更具挑战性。

英文摘要

Clay sculpting is a nuanced, artistic task involving dexterous manipulation with long-horizon planning to achieve high-level goals. As a robotics problem, we formulate clay sculpting as a shape-to-shape matching challenge. Prior deformable object manipulation work either requires retraining a policy per goal or relies on dynamics models which represent state as sparse point clouds which do not capture important clay features, such as textures, well. We present a method for modeling the dynamics of deformable materials and planning for robotic sculpting in a representation that is visually-aligned, capturing lighting and texture features. With three different deformable materials and various end-effectors, we demonstrate that our dynamics model is comparable in performance to the state-of-the-art with the added benefit of being compatible with visual planning. Our actions are represented as parametrized pushes into clay with a single end-effector, which proved to be suitable for long-horizon (>100 actions) clay relief sculptures. Lastly, we show the benefits of planning in a visually-aligned representation, but also provide analysis providing evidence as to why this representation is challenging to plan in compared to 3D representations.

URL PDF HTML ☆

赞 0 踩 0

2605.17533 2026-05-19 eess.SY cs.RO cs.SY 版本更新

Distributed 3D Leader-Follower Formation Control with Field-of-View Safety via Control Barrier Functions

分布式三维领航跟随形成控制与视场安全 via 控制屏障函数

Immanuel R. Santjoko, Richie R. Suganda, Miao Pan, Bin Hu

发表机构 * Department of Electrical and Computer Engineering, University of Houston（休斯敦大学电气与计算机工程系）； Department of Engineering Technology, University of Houston（休斯敦大学工程技术系）

AI总结本文提出了一种分布式三维领航跟随形成（3D-LFF）控制框架，用于多无人机系统，能够在保证感知安全约束的同时实现形成跟踪。通过构建相对运动学模型和分布式控制器，结合控制屏障函数基于二次规划的安全过滤器，确保领导者始终在跟随者相机视场内，从而实现精确的形成跟踪和有效的视场约束。

Comments 9 page

详情

AI中文摘要

本文提出了一种分布式三维领航跟随形成（3D-LFF）控制框架，用于多无人机系统，该框架能够在保证感知安全约束的同时实现形成跟踪。维持安全的基于视觉的3D-LFF具有挑战性，因为机载相机对视场（FOV）限制严格， demanding的形成命令可能使领导者离开跟随者的相机视锥体，导致可见性丢失。为了解决这个问题，我们开发了一种感知感知安全的控制架构，通过构造保证可见性。首先，我们推导了一个基于视线坐标表示的相对运动学模型，并设计了一个仅使用本地可用相对状态的分布式3D-LFF跟踪控制器。接下来，我们将名义上的形成控制器嵌入到基于控制屏障函数的二次规划（CBF-QP）安全过滤器中，该过滤器最小化对命令速度的修改，以在保持领导者在跟随者相机视场内的同时，保留形成跟踪的可行性。Gazebo模拟和Crazyflie硬件实验验证了所提出的方法，展示了精确的形成跟踪和有效的视场约束，包括名义上的期望形成与可见性约束冲突的场景。

英文摘要

This letter proposes a distributed 3D leader-follower formation (3D-LFF) control framework for multi-UAV systems that achieves formation tracking while enforcing perception safety constraints. Maintaining safe, vision-based 3D-LFF is challenging because onboard cameras impose strict Field-of-View (FOV) limitations, and demanding formation commands can drive the leader outside the follower's camera frustum, resulting in loss of visibility. To address this issue, we develop a perception-aware safe control architecture that guarantees visibility by construction. First, we derive a relative kinematic model in a line-of-sight coordinate representation and design a distributed 3D-LFF tracking controller using only locally available relative states. Next, we embed the nominal formation controller within a Control Barrier Function-based Quadratic Program (CBF-QP) safety filter that minimally modifies the commanded velocities to maintain the leader inside the follower's camera frustum while preserving formation tracking whenever feasible. Gazebo simulations and Crazyflie hardware experiments validate the proposed approach, demonstrating accurate formation tracking and effective FOV enforcement, including scenarios in which the nominal desired formation conflicts with visibility constraints.

URL PDF HTML ☆

赞 0 踩 0

2605.17522 2026-05-19 cs.RO 版本更新

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

RoboFlow4D: 一种轻量级的流世界模型，面向实时的流引导机器人操作

Sixu Lin, Junliang Chen, Huaiyuan Xu, Zhuohao Li, Guangming Wang, Yixiong Jing, Sheng Xu, Runyi Zhao, Brian Sheil, Lap-Pui Chau, Guiliang Liu

发表机构 * School of Data Science, The Chinese University of Hong Kong (Shenzhen)（香港中文大学（深圳）数据科学学院）； The Hong Kong Polytechnic University（香港理工大学）； University of Cambridge（剑桥大学）； Shenzhen Loop Area Institute（深圳-loop区研究所）

AI总结本文提出RoboFlow4D，一种轻量级的流世界模型，通过统一感知与规划，利用物理3D空间中的时间运动估计，实现高效的实时流引导机器人操作，提高了操作成功率和计算效率。

详情

AI中文摘要

在三维环境中进行规划和行动是现实世界中机器人操作的基本能力。尽管先前工作已经探索了预测流规划器来指导三维操作，但现有方法往往依赖于模块化管道堆叠多个子模型，导致计算开销高且实时性能有限。为了解决这些挑战，我们引入了RoboFlow4D，一种轻量级的流世界模型，通过估计物理3D空间中的时间运动来统一感知和规划。作为一种端到端框架，RoboFlow4D直接从视觉观察和文本指令中预测多帧3D流，提供显式的基于流的规划以指导动作生成。这种设计允许无缝集成到通用动作策略中，形成高效的观察-规划-执行闭环。通过流预测与动作控制之间的慢-快协作，RoboFlow4D实现了实时且资源高效的操纵。在模拟和现实世界设置中的大量实验表明，RoboFlow4D在操纵成功率和计算效率方面持续改进，推动了流引导规划在具身智能中的发展。

英文摘要

Planning and acting in 3D environments is a fundamental capability for robotic manipulation in the real world. Although prior work has explored predictive flow planners to guide 3D manipulation, existing approaches often rely on modular pipelines stacking multiple submodels, resulting in high computational overhead and limited real-time performance. To address these challenges, we introduce RoboFlow4D, a lightweight flow world model that unifies perception and planning by estimating temporal motion in physical 3D space. As an end-to-end framework, RoboFlow4D directly predicts multi-frame 3D flows from visual observations and textual instructions, providing explicit flow-based planning to guide action generation. This design allows seamless integration with general action policies, forming an efficient observation-planning-execution closed loop. Through slow-fast collaboration between flow prediction and action control, RoboFlow4D enables real-time and resource-efficient manipulation. Extensive experiments in both simulation and real-world settings demonstrate that RoboFlow4D consistently improves manipulation success rates and computational efficiency, advancing flow-guided planning for embodied intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.17517 2026-05-19 cs.RO 版本更新

AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment

AffordVLA: 通过隐式特征对齐将 affordance 表示注入到视觉-语言-动作模型中

Weijie Kong, Zhian Su, Wei Yu, Huixu Dong

发表机构 * Grasp Lab, School of Mechanical Engineering of Zhejiang University（浙大机械工程学院抓取实验室）

AI总结本文提出 AffordVLA 框架，通过隐式特征对齐将以操作为中心的 affordance 表示注入到视觉-语言-动作模型中，以提升动作准确性，实验表明其在仿真和现实中的表现优于现有方法。

Comments 13pages, 10figures

详情

AI中文摘要

最近在视觉-语言-动作（VLA）模型方面的进展显示出在通用机器人操作中的强大潜力。然而，大多数VLA模型的视觉表示往往由全局物体外观主导，难以聚焦于与任务相关的功能交互区域，这限制了它们在非结构化环境中的鲁棒性。现有的基于 affordance 的方法通常依赖于显式的掩码注入或外部感知模块，需要额外的注释，同时引入级联感知误差和推理开销。为了解决这些限制，我们提出 AffordVLA，一个增强的 VLA 框架，通过隐式表示对齐将以操作为中心的 affordance 感知内部化到 VLA 视觉表示中。具体来说，我们构建了一个零样本 affordance 教师，从 RGB 观察和语言指令中提取任务条件的 affordance 视觉表示。AffordVLA 对齐 VLA 的中间视觉表示与由教师提取的 affordance 视觉表示，从而隐式地将以操作为中心的 affordance 感知注入到 VLA 视觉表示中，提高动作准确性。广泛的仿真和现实世界实验表明，AffordVLA 及其 affordance 教师实现了最先进的性能，并优于强大的基线。消融分析显示，AffordVLA 有效重塑 VLA 视觉表示，同时保持推理效率，从而提高操作成功率和训练效率。

英文摘要

Recent advances in Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation. However, the visual representations of most VLA models are often dominated by global object appearance and struggle to focus on task-relevant functional interaction regions, which limits their robustness in unstructured environments. Existing affordance-based methods typically rely on explicit mask injection or external perception modules, requiring additional annotations while introducing cascading perception errors and inference overhead. To address these limitations, we propose AffordVLA, an affordance-enhanced VLA framework that internalizes manipulation-centric affordance perception into VLA visual representations through implicit representation alignment. Specifically, we construct a zero-shot affordance teacher to extract task-conditioned affordance visual representations from RGB observations and language instructions. AffordVLA aligns the intermediate visual representations of the VLA with the affordance visual representations extracted by the teacher, thereby implicitly injecting manipulation-centric affordance perception into VLA visual representations and improving action accuracy. Extensive simulation and real-world experiments demonstrate that AffordVLA and its affordance teacher achieve state-of-the-art performance and outperform strong baselines. Ablation analyses show that AffordVLA effectively reshapes VLA visual representations while preserving inference efficiency, leading to improved manipulation success rates and training efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.17486 2026-05-19 cs.RO cs.LG 版本更新

DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

DyGRO-VLA: 通过动态分组残差优化实现跨任务的视觉-语言-动作模型扩展

Sixu Lin, Yunpeng Qing, Litao Liu, Ming Zhou, Ruixing Jin, Xiaoyi Fan, Guiliang Liu

发表机构 * School of Data Science, The Chinese University of Hong Kong (Shenzhen)（香港中文大学（深圳）数据科学学院）； Shenzhen Loop Area Institute（深圳环城研究院）； Zhejiang University（浙江大学）； Rutgers University-New Brunswick（罗格斯大学新布朗斯维尔回声分校）； Shanghai AI Laboratory（上海人工智能实验室）； Jiangxing Intelligence Technology Inc.（江行智能科技有限公司）

AI总结本文提出DyGRO-VLA，一种通过动态分组残差优化实现跨任务视觉-语言-动作模型扩展的两阶段优化框架，旨在提升模型的泛化能力。

详情

AI中文摘要

最近在强化学习（RL）方面的进展提供了一种系统的方法来优化视觉-语言-动作（VLA）模型，推动了从轨迹模仿到任务环境中的主动学习的转变。尽管在控制精度上有所改进，大多数RL优化器仍然任务特定，这使VLA模型从通用控制器退化为过度拟合狭窄任务集的策略。在本研究中，我们深入分析了这一现象，并强调了跨任务特征表示对提高VLA模型泛化能力的重要性。受这一发现的启发，我们引入了DyGRO-VLA，一种两阶段优化框架，1）基于信息论原理有效地捕捉跨任务潜在表示，2）通过混合的RL残差动态优化策略。DyGRO-VLA使RL优化器能够在优化过程中利用任务相关的潜在信息，同时战略性地减轻对学习表示的不利干扰。我们在LIBERO、RoboTwin2基准以及现实世界中评估了我们的方法，证明了在多任务训练和分布偏移下，与强基线相比，我们的方法具有持续的改进。

英文摘要

Recent progress in Reinforcement Learning (RL) provides a principled approach to optimizing Vision-Language-Action (VLA) models, facilitating a shift from trajectory imitation to active learning in the task environment. Despite improvements in control precision, most RL optimizers remain task-specific, which reduces VLA models from generalist controllers to policies that overfit to a narrow set of tasks. In this study, we conduct an in-depth analysis of this phenomenon and highlight the importance of cross-task feature representations for improving the generalizability of VLA models. Motivated by this finding, we introduce DyGRO-VLA, a two-stage optimization framework that 1) effectively captures cross-task latent representations based on information-theoretic principles, and 2) dynamically refines policy optimization via a mixture-of-RL-residuals. DyGRO-VLA enables the RL optimizer to exploit task-relevant latent information while strategically mitigating adverse interference on the learned representations throughout the optimization process. We evaluate our approach on LIBERO, RoboTwin2 benchmarks, and further validate it on real world, demonstrating consistent improvements over strong baselines under multi-task training and distribution shift.

URL PDF HTML ☆

赞 0 踩 0

2605.17477 2026-05-19 cs.RO 版本更新

Yixu Feng, Zinan Zhao, Yanxiang Ma, Chenghao Xia, Chengbin Du, Yunke Wang, Chang Xu

发表机构 * The University of Sydney（悉尼大学）； City University of Hong Kong（香港城市大学）

AI总结本文提出了一种基于可微网格采样的视觉-语言-动作模型压缩方法，通过连续的token重采样保留关键空间信息，实现高达90%的计算量减少而不影响性能。

详情

Journal ref: Proceedings of the Forty-third International Conference on Machine Learning, 2026

AI中文摘要

视觉-语言-动作（VLA）模型在机器人操作中表现出色，但其高计算成本限制了实时部署。现有token剪枝方法面临根本性的权衡：使用剪枝进行剧烈压缩会不可避免地丢弃关键几何细节，如接触点，导致性能严重下降。我们主张通过重新思考压缩作为几何感知的连续token重采样来打破这种权衡。为此，我们提出了可微网格采样器（GridS），一个即插即用的模块，用于在VLA中进行任务感知的连续重采样。通过自适应预测最小的显著坐标集并利用可微插值提取特征，GridS在保留关键空间信息的同时实现了大幅压缩（少于10%的原始视觉token）。在LIBERO基准和真实机器人平台上的实验表明，GridS实现了76%的FLOPs减少，而无需降级成功率。代码可在https://github.com/Fediory/Grid-Sampler上获得。

英文摘要

Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at https://github.com/Fediory/Grid-Sampler.

URL PDF HTML ☆

赞 0 踩 0

2605.07308 2026-05-19 cs.RO 版本更新

AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

AT-VLA: 用于增强视觉-语言-动作模型反馈反应的自适应触觉注入

Xiaoqi Li, Muhe Cai, Jiadong Xu, Juan Zhu, Hongwei Fan, Yan Shen, Guangrui Ren, Hao Dong

发表机构 * School of Computer Science, Peking University（北京大学计算机科学系）； PrimeBot ； PKU Lab（北京大学实验室）

AI总结本文提出AT-VLA，一种自适应触觉注入机制，通过动态决定触觉注入的时间和位置，减少对预训练表示的干扰，同时引入触觉反应双流机制，实现快速准确的触觉响应，以提高视觉-语言-动作模型在接触丰富操作任务中的表现。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在增强机器人代理执行多样化任务的能力方面取得了显著进展；然而，它们仍然面临在需要精确物理交互的接触丰富操作场景中的挑战。为了解决这一限制，最近的研究尝试在下游任务中整合触觉信号，使预训练的VLA能够解释触觉反馈。然而，在微调过程中引入新的模态，这些模态在预训练阶段很少出现，可能会破坏VLA的预训练能力。此外，VLA固有的缓慢推理速度会阻碍实时响应，并限制触觉反馈在动作调整中的有效利用。为克服这些挑战，我们提出了自适应触觉视觉-语言-动作（AT-VLA），引入了新颖的自适应触觉注入机制。该机制动态确定触觉注入的合适时间和位置，在显著促进动作生成时才进行注入，从而最小化对预训练表示的干扰。此外，为了实现快速准确的触觉响应，我们提出了触觉反应双流机制，将感知处理分为一个慢的视觉-语言流用于低频感知推理和一个快的触觉控制流用于高频物理交互理解，从而在0.04秒内实现实时闭环响应。现实世界实验彻底验证了AT-VLA在接触丰富操作任务中的有效性。项目页面可在：https://sites.google.com/view/at-vla。

英文摘要

Vision-Language-Action (VLA) models have significantly advanced the capabilities of robotic agents in executing diverse tasks; however, they still face challenges in contact-rich manipulation scenarios that require precise physical interactions. To address this limitation, recent studies have attempted to incorporate tactile signals during downstream tasks, enabling pretrained VLAs to interpret tactile feedback. Nevertheless, introducing new modalities during finetuning, which are rarely present in the pretrain stage, may disrupt the pretrained capabilities of VLAs. In addition, the inherently slow inference speed of VLAs hampers real-time responsiveness and limits the effective utilization of tactile feedback for action adjustment. To overcome these challenges, we propose Adaptive Tactile Vision-Language-Action (AT-VLA), which introduces a novel Adaptive Tactile Injection mechanism. This mechanism dynamically determines the appropriate timing and locations for tactile injection, incorporating only when it significantly contributes to action generation, thereby minimizing interference with pretrained representations. Furthermore, to enable rapid and accurate tactile responses, we propose a Tactile Reaction Dual-Stream mechanism, which decouples sensory processing into a slow visual-language stream for low-frequency perceptual reasoning and a fast tactile control stream for high-frequency physical interaction understanding, achieving real-time close-loop responses within 0.04 s. Real-world experiments thoroughly validate the effectiveness of AT-VLA in contact-rich manipulation tasks. The project page is available at: https://sites.google.com/view/at-vla.

URL PDF HTML ☆

赞 0 踩 0

2604.09609 2026-05-19 cs.AI cs.RO 版本更新

General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging

通用大语言模型作为人类驾驶员行为模型：简化合并案例

Samir H. A. Mohammad, Wouter Mooi, Arkady Zgonnikov

发表机构 * Department of Transport and Planning, Delft University of Technology（代尔夫特理工大学交通与规划系）； Department of Cognitive Robotics（认知机器人学系）

AI总结本文研究了通用大语言模型在模拟人类驾驶员行为中的应用，通过在简化的一维合并场景中嵌入两个通用大语言模型，并与人类数据进行定量和定性分析，发现模型在间歇性操作控制和空间线索战术依赖方面能再现人类行为，但在动态速度线索响应和安全性能方面存在差异，提示未来需进一步研究其失效模式以确保其作为人类驾驶行为模型的有效性。

Comments To be published in proceedings of IEEE ITSC 2026

详情

AI中文摘要

人类行为模型在自动驾驶车辆（AVs）的虚拟安全评估中作为行为参考和模拟人类代理至关重要，但当前模型面临可解释性与灵活性之间的权衡。通用大语言模型（LLMs）提供了一种有前景的替代方案：一个模型可能在各种场景中无需参数拟合即可部署。然而，LLMs在捕捉人类驾驶行为方面能做什么、不能做什么仍不明确。我们通过将两个通用LLMs（OpenAI o3和Google Gemini 2.5 Pro）作为独立的闭环驾驶员代理嵌入简化的一维合并场景，并通过定量和定性分析将其行为与人类数据进行比较，来填补这一空白。两个模型能够再现人类样式的间歇性操作控制和对空间线索的战术依赖。然而，它们均无法一致地捕捉人类对动态速度线索的反应，且模型间的安全性能差异显著。系统性的提示消融研究揭示了提示组件作为模型特定的归纳偏置，这些偏置在不同LLMs之间不转移。这些发现表明，通用LLMs可能潜在地作为独立、即用型的人类行为模型在AV评估流程中发挥作用，但未来研究需要进一步理解其失效模式，以确保其作为人类驾驶行为模型的有效性。

英文摘要

Human behavior models are essential as behavior references and for simulating human agents in virtual safety assessment of automated vehicles (AVs), yet current models face a trade-off between interpretability and flexibility. General-purpose large language models (LLMs) offer a promising alternative: a single model potentially deployable without parameter fitting across diverse scenarios. However, what LLMs can and cannot capture about human driving behavior remains poorly understood. We address this gap by embedding two general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) as standalone, closed-loop driver agents in a simplified one-dimensional merging scenario and comparing their behavior against human data using quantitative and qualitative analyses. Both models reproduce human-like intermittent operational control and tactical dependencies on spatial cues. However, neither consistently captures the human response to dynamic velocity cues, and safety performance diverges sharply between models. A systematic prompt ablation study reveals that prompt components act as model-specific inductive biases that do not transfer across LLMs. These findings suggest that general-purpose LLMs could potentially serve as standalone, ready-to-use human behavior models in AV evaluation pipelines, but future research is needed to better understand their failure modes and ensure their validity as models of human driving behavior.

URL PDF HTML ☆

赞 0 踩 0

2603.23672 2026-05-19 cs.RO cs.CV 版本更新

CoLA-Flow Policy: 通过连续潜在动作流匹配实现机器人操作的时序一致模仿学习

Wu Songwei, Jiang Zhiduo, Sun Wandong, Xie Guanghu, Zhao Rui, Liu Hong, Liu Yang

AI总结本文提出CoLA-Flow Policy，一种基于连续潜在动作空间的轨迹级模仿学习框架，通过学习显式的潜在空间流，解耦全局运动结构与低层控制噪声，从而实现平滑可靠的长时程执行，并结合几何感知点云条件和执行时多模态调节，提升现实环境的鲁棒性。

Comments 9 pages, 9 figures

详情

AI中文摘要

学习长时程的机器人操作需要同时实现表达能力强的行为建模、实时推断和稳定执行，这对现有的生成策略仍具有挑战性。基于扩散的方法具有强大的建模能力，但会导致较高的推断延迟，而流匹配方法能够在快速、近单步生成的同时，当直接在原始动作空间中操作时往往会出现执行不稳定的问题。我们提出了连续潜在动作流策略（CoLA-Flow Policy），一种轨迹级模仿学习框架，该框架在连续潜在动作空间中执行流匹配。通过将动作序列编码为时间一致的潜在轨迹，并学习显式的潜在空间流，CoLA-Flow Policy 解耦了全局运动结构与低层控制噪声，从而实现平滑且可靠的长时程执行。该框架进一步集成了几何感知点云条件和执行时多模态调节，利用视觉线索作为代表性模态以增强现实环境的鲁棒性。在仿真和真实机器人上的实验表明，CoLA-Flow Policy 实现了近单步推断，比原始动作空间流基线提高了93.7%的轨迹平滑度和25个百分点的任务成功率，同时比基于扩散的方法快得多。

英文摘要

Learning long-horizon robotic manipulation requires jointly achieving expressive behavior modeling, real-time inference, and stable execution, which remains challenging for existing generative policies. Diffusion-based approaches offer strong modeling capacity but incur high inference latency, while flow matching enables fast, near-single-step generation yet often suffers from unstable execution when operating directly in the raw action space. We propose Continuous Latent Action Flow Policy (CoLA-Flow Policy), a trajectory-level imitation learning framework that performs flow matching in a continuous latent action space. By encoding action sequences into temporally coherent latent trajectories and learning an explicit latent-space flow, CoLA-Flow Policy decouples global motion structure from low-level control noise, enabling smooth and reliable long-horizon execution. The framework further integrates geometry-aware point cloud conditioning and execution-time multimodal modulation, using visual cues as a representative modality to enhance real-world robustness. Experiments in simulation and on real robots show that CoLA-Flow Policy achieves near-single-step inference, improves trajectory smoothness by up to 93.7% and task success by up to 25 percentage points over raw action-space flow baselines, while remaining significantly faster than diffusion-based policies.

URL PDF HTML ☆

赞 0 踩 0

2601.18442 2026-05-19 cs.RO 版本更新

SG-CADVLM: A Context-Aware Decoding Powered Vision Language Model for Safety-Critical Scenario Generation

SG-CADVLM: 一种基于上下文感知解码的视觉语言模型，用于安全关键场景生成

Hongyi Zhao, Shuo Wang, Qijie He, Ziyuan Pu

发表机构 * School of Transportation, Southeast University（东南大学交通学院）

AI总结本文提出SG-CADVLM，一种结合上下文感知解码的多模态输入处理框架，用于从事故报告中生成高保真的安全关键场景，通过减少视觉语言模型的幻觉并同时生成道路几何和车辆轨迹，提升了生成场景的准确性和实用性。

详情

AI中文摘要

自动驾驶（AV）需要在安全关键场景中进行严格测试以确保安全性验证，但其验证受到实地测试成本高和现有模拟在罕见安全关键事件中保真度不足的限制。碰撞报告提供了丰富的现实世界事故动态规范，使其成为大型语言模型和视觉语言模型生成高保真场景的有前景资源。然而，现有模型由于上下文抑制常偏离实际事故特征。为了解决这些限制，本文提出了SG-CADVLM，一种整合上下文感知解码与多模态输入处理的框架，用于从碰撞报告中生成安全关键场景。该框架在生成道路几何和车辆轨迹的同时减轻了VLMs的幻觉。实验结果表明，SG-CADVLM生成结合关键和高风险场景的速率比基线方法高88.1%（相比31.2%），代表了182%的提升，同时生成可用于自动驾驶测试的可执行模拟。

英文摘要

Autonomous Vehicle (AV) requires rigorous testing in safety-critical scenarios for safety validation, yet its validation is hindered by the high cost of field testing and the lack of fidelity in current simulations for rare safety-critical events. Crash reports offer rich and authentic specifications of real-world accident dynamics, making them a promising resource for Large Language Models and Vision-Language models to generate high-fidelity scenarios. However, the existing models frequently deviate from actual accident characteristics due to context suppression. To address these limitations, this paper presents SG-CADVLM, a framework integrateing Context-Aware Decoding with multimodal input processing to generate safety-critical scenarios from crash reports. The framework mitigates the hallucination of VLMs while generating road geometry and vehicle trajectories simultaneously. The experimental results demonstrate that SG-CADVLM generates combined critical and high-risk scenarios at a rate of 88.1% compared to 31.2% for the baseline methods, representing a 182% improvement, while producing executable simulations for autonomous vehicle testing.

URL PDF HTML ☆

赞 0 踩 0

2601.01155 2026-05-19 cs.RO 版本更新

ORION: Option-Regularized Deep Reinforcement Learning for Cooperative Multi-Agent Online Navigation

ORION：用于合作多智能体在线导航的选项正则化深度强化学习

Shizhe Zhang, Jingsong Liang, Zhitao Zhou, Shuhan Ye, Yizhuo Wang, Ming Siang Derek Tan, Jimmy Chiun, Yuhong Cao, Guillaume Sartoretti

发表机构 * Department of Mechanical Engineering, College of Design and Engineering, National University of Singapore（机械工程系，设计与工程学院，新加坡国立大学）

AI总结该研究提出ORION框架，通过选项批评方法和双阶段合作策略，解决部分已知环境中多智能体导航的路径最优与环境信息收集之间的平衡问题，实现高效实时协作。

详情

AI中文摘要

现有多智能体导航方法通常假设环境完全已知，难以应对部分已知场景中过时或不完整的先验地图，如仓库或工厂 floor。在此类场景中，智能体需要在路径最优与收集和共享环境信息之间取得平衡。为此，我们提出了ORION，一种用于部分已知环境中合作多智能体在线导航的新型深度强化学习框架。从不完美的先验地图开始，ORION训练智能体进行去中心化决策，朝向个体目标协调，并通过在线感知共享在闭环感知-动作循环中主动减少任务相关的地图不确定性。我们首先设计了一个共享图编码器，将先验地图与在线感知融合为统一的表示，提供在环境差异下的鲁棒状态嵌入。ORION的核心是一个选项批评框架，学习转化为低层动作序列的高层合作模式，使智能体能够自适应地在个体导航和团队层面探索之间切换。我们进一步引入了双阶段合作策略，使智能体能够在地图不确定性下协助队友，从而减少总体完成时间。在广泛的迷宫状地图和大规模仓库环境中，ORION实现了高质量的实时去中心化协作，并可扩展到多达10个机器人，优于最先进的经典和学习基线。最后，我们在物理机器人团队上验证了ORION，证明了其在现实世界协作导航中的鲁棒性和实用性。

英文摘要

Existing methods for multi-agent navigation typically assume fully known environments, offering limited support for partially known scenarios with outdated or imperfect prior maps, such as warehouses or factory floors. There, agents need to balance path optimality with collecting and sharing environmental information to help teammates reach their own targets. To these ends, we propose ORION, a novel deep reinforcement learning framework for cooperative multi-agent online navigation in partially known environments. Starting from an imperfect prior map, ORION trains agents to make decentralized decisions, coordinate toward individual targets, and actively reduce task-relevant map uncertainty through online observation sharing in a closed perception-action loop. We first design a shared graph encoder that fuses prior map with online perception into a unified representation, providing robust state embeddings under environmental discrepancies. At the core of ORION is an option-critic framework that learns high-level cooperative modes translated into sequences of low-level actions, enabling adaptive switching between individual navigation and team-level exploration. We further introduce a dual-stage cooperation strategy that allows agents to assist teammates under map uncertainty, thereby reducing the overall makespan. Across extensive maze-like maps and large-scale warehouse environments, ORION achieves high-quality real-time decentralized cooperation while scaling to up to 10 robots, outperforming state-of-the-art classical and learning-based baselines. Finally, we validate ORION on physical robot teams, demonstrating its robustness and practicality for real-world cooperative navigation.

URL PDF HTML ☆

赞 0 踩 0

2512.24497 2026-05-19 cs.AI cs.LG cs.RO stat.ML 版本更新

What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

在联合嵌入预测世界模型中成功因素是什么？

Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, Yann LeCun

发表机构 * Meta FAIR ； Inria Paris（巴黎理工院）； Ecole normale supérieure / PSL（巴黎高等师范学院 / PSL）； New York University（纽约大学）

AI总结本文研究了在物理规划中使用联合嵌入预测世界模型（JEPA-WMs）的成功因素，通过分析模型架构、训练目标和规划算法对规划成功的影响，提出了一种在导航和操作任务中优于现有基线方法的模型。

Comments V2 of the article: - Added AdaLN-zero - Added table comparing JEPA-WMs with baselines with std translating per-seed variability only, no variability across epochs - Reordered figures in main body of the paper V3: added data scaling experiments, theoretical appendix section on autoregressive rollout, acceptance at TMLR

详情

AI中文摘要

人工智能领域长期存在的挑战是开发能够解决广泛物理任务并泛化到新、未见过的任务和环境的智能体。一种流行的近期方法是通过状态-动作轨迹训练世界模型，然后使用规划算法解决新任务。规划通常在输入空间中进行，但最近出现的一类方法引入了在学习的表示空间中优化的规划算法，其承诺通过抽象无关细节来提高规划效率。在本工作中，我们将此类模型称为JEPA-WMs，并研究使此类算法有效技术选择。我们提出了一项全面研究几个关键组件，旨在找到该类中的最佳方法。我们使用模拟环境和真实世界机器人数据进行了实验，并研究了模型架构、训练目标和规划算法对规划成功的影响。我们结合发现，提出了一种在导航和操作任务中优于两个现有基线方法（DINO-WM和V-JEPA-2-AC）的模型。代码、数据和检查点可在https://github.com/facebookresearch/jepa-wms上获得。

英文摘要

A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks. Code, data and checkpoints are available at https://github.com/facebookresearch/jepa-wms.

URL PDF HTML ☆

赞 0 踩 0

2511.20353 2026-05-19 cs.RO 版本更新

Quality-guided UAV Surface Exploration for 3D Reconstruction

基于质量引导的无人机表面探索用于3D重建

Benjamin Sportich, Kenza Boubakri, Olivier Simonin, Alessandro Renzaglia

发表机构 * Inria（法国国家信息与自动化技术研究所）； INSA Lyon（里昂国立应用科学学院）； CITI（信息与通信技术研究所）

AI总结本文提出了一种新的模块化Next-Best-View规划框架，通过使用重建质量目标来指导探索规划，以提高3D重建的效率和质量。

详情

AI中文摘要

映射未知环境的自主机器人有广泛的原因，但在实践中，这些原因往往在制定规划策略时被忽视。快速获取信息和对建筑物的全面结构评估有不同的要求，因此需要不同的方法。在本文中，我们提出了一种新的模块化Next-Best-View (NBV) 规划框架，该框架明确使用重建质量目标来指导探索规划。特别是，我们的方法引入了新的高效视图生成和视角候选选择方法，这些方法能够适应用户定义的质量要求，充分利用截断符号距离场（TSDF）表示中编码的不确定性。这导致了有根据且高效的探索决策，以满足预定目标。最后，我们通过在现实环境中进行广泛的模拟验证了我们的方法。我们证明了该方法能够根据用户目标调整其行为，同时在覆盖范围、最终3D地图的质量和路径效率方面都优于传统NBV策略。

英文摘要

Reasons for mapping an unknown environment with autonomous robots are wide-ranging, but in practice, they are often overlooked when developing planning strategies. Rapid information gathering and comprehensive structural assessment of buildings have different requirements and therefore necessitate distinct methodologies. In this paper, we propose a novel modular Next-Best-View (NBV) planning framework for aerial robots that explicitly uses a reconstruction quality objective to guide the exploration planning. In particular, our approach introduces new and efficient methods for view generation and selection of viewpoint candidates that are adaptive to the user-defined quality requirements, fully exploiting the uncertainty encoded in a Truncated Signed Distance field (TSDF) representation of the environment. This results in informed and efficient exploration decisions tailored towards the predetermined objective. Finally, we validate our method via extensive simulations in realistic environments. We demonstrate that it successfully adjusts its behavior to the user goal while consistently outperforming conventional NBV strategies in terms of coverage, quality of the final 3D map and path efficiency.

URL PDF HTML ☆

赞 0 踩 0

2510.17363 2026-05-19 cs.CV cs.LG cs.RO 版本更新

M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

M2H：基于高效窗口交叉任务注意力的多任务学习用于单目空间感知

U. V. B. L Udugama, George Vosselman, Francesco Nex

发表机构 * Department of Earth Observation Science（地球观测科学系）

AI总结本文提出M2H框架，通过高效的窗口交叉任务注意力模块，实现单目图像上的语义分割、深度估计、边缘检测和表面法线估计，同时在计算效率上优于现有方法。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures

详情

DOI: 10.1109/IROS60139.2025.11246974

AI中文摘要

在边缘设备上部署实时空间感知需要高效的多任务模型，这些模型能够在利用互补任务信息的同时最小化计算开销。本文介绍了Multi-Mono-Hydra（M2H），一种新的多任务学习框架，用于从单张单目图像中进行语义分割、深度、边缘和表面法线估计。与传统方法依赖独立单任务模型或共享编码器-解码器架构不同，M2H引入了基于窗口的跨任务注意力模块，实现了结构化的特征交换同时保留任务特定的细节，提高了任务间预测的一致性。M2H基于轻量级的ViT-based DINOv2主干网络，优化了实时部署，并作为支持动态环境中3D场景图构建的单目空间感知系统的基础。全面评估显示，M2H在NYUDv2上优于最先进的多任务模型，在Hypersim上超越了单任务深度和语义基线，在Cityscapes数据集上实现了更优的性能，同时在笔记本硬件上保持计算效率。除了基准测试外，M2H还在真实世界数据上得到了验证，证明了其在空间感知任务中的实用性。

英文摘要

Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

URL PDF HTML ☆

赞 0 踩 0

2509.19102 2026-05-19 cs.RO cs.AI cs.CV 版本更新

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

FUNCanon: 通过功能对象规范化学习姿态感知的动作原语以实现通用的机器人操作

Hongli Xu, Lei Zhang, Xiaoyue Hu, Boyang Zhong, Kaixin Bai, Zoltán-Csaba Márton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang

发表机构 * TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg（汉堡大学信息学院TAMS（多模态系统技术））； Technical University of Munich（慕尼黑技术大学）； Agile Robots SE（敏捷机器人有限公司）

AI总结本文提出FUNCanon框架，通过功能对象规范化学习姿态感知的动作原语，以实现通用的机器人操作，该方法将长周期操作任务分解为由主体、动词和对象定义的动作片段，从而提升策略的可组合性和可重用性。

Comments project website: https://sites.google.com/view/funcanon, 11 pages

详情

AI中文摘要

通用机器人技能从端到端演示中通常会导致任务特定的策略，这些策略难以超越训练分布进行泛化。因此，我们引入FUNCanon框架，将长周期操作任务转换为一系列动作片段，每个片段由主体、动词和对象定义。这些片段将策略学习聚焦于动作本身，而不是孤立的任务，从而实现组合性和重用性。为了使策略具有姿态感知和类别通用性，我们对功能对象进行规范化，通过功能对齐和自动操作轨迹转移，利用大型视觉语言模型的 affordance 信息将对象映射到共享的功能框架中。一个以对象为中心和动作为中心的扩散策略FuncDiffuser在对齐的数据上进行训练，自然尊重对象的 affordances 和姿态，简化了学习并提高了泛化能力。在模拟和现实基准上的实验表明，该方法在类别层面实现了泛化，跨任务行为重用和鲁棒的sim2real部署，显示功能规范化为复杂操作领域可扩展模仿学习提供了强大的归纳偏置。演示细节和补充材料可在我们的项目网站上获得：https://sites.google.com/view/funcanon。

英文摘要

General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.

URL PDF HTML ☆

赞 0 踩 0

2508.20836 2026-05-19 cs.RO math.OC 版本更新

First Experimental Demonstration of Natural Hovering Extremum Seeking: A New Paradigm in Flapping Flight Physics

首次实验性演示自然悬停极值搜索：飞行力学领域的新范式

Ahmed A. Elgohary, Rohan Palanikumar, Simone Martini, Sameh A. Eisa

发表机构 * Department of Aerospace Engineering and Engineering Mechanics（航空航天工程与工程力学系）； University of Cincinnati（辛辛那提大学）； Cincinnati, Ohio 45221, USA（俄亥俄州辛辛那提市45221号美国）

AI总结本文首次实验验证了自然悬停极值搜索（NH-ES）这一新范式，展示了通过无需模型的实时反馈机制，利用飞行动物自身振荡实现稳定悬停飞行的原理。

详情

AI中文摘要

在本文中，我们报告了首次实验性演示了最近出现的悬停和振翅飞行力学新范式，称为自然悬停极值搜索（NH-ES），该范式提出，通过无需模型的实时反馈机制，利用振翅翼的内置自然振荡作为控制和推进输入，可以生成自然界中通过振翅昆虫和蜂鸟观察到的稳定悬停飞行力学。我们进行了moth-like、光源导向的实验，使用振翅翼体在完全无模型的设置中进行，该设置不依赖形态学参数和身体/空气动力学模型。我们展示了使用NH-ES的振翅体能够自主增益高度并稳定控制负责振翅的伺服器，包括具有pitching动态（文献中认为是开环悬停不稳定的主要原因）。振翅体仅需局部光强度反馈即可有效稳定悬停在光源附近。我们的结果也实现了在延迟和噪声效应下的验证，支持了之前观察到的NH-ES对潜在处理延迟和噪声感觉的鲁棒性。

英文摘要

In this letter, we report the first experimental demonstration of the recently emerged new paradigm in hovering and flapping flight physics called (Natural Hovering Extremum Seeking (NH-ES)) [doi.org/10.1103/4dm4-kc4g], which theorized that stable hovering flight physics observed in nature by flapping insects and hummingbirds can be generated via a model-free, real-time, computationally-basic, sensory-based feedback mechanism that only needs the built-in natural oscillations of the flapping wing as both the control and the propulsive input. We run experiments of moth-like, light source-seeking, on a flapping-wing body in a total model-free setting that is agnostic to morphological parameters and body/aerodynamic models. We show that the flapping body using NH-ES gains altitude and stabilizes autonomously the servos responsible for flapping, including with pitching dynamics (believed in literature to be a main reason of instability in open-loop hovering). The flapping body effectively/stably hovers about the light source, needing only feedback of local measurements of light intensity. Our results were also achieved under delay/noise effects, supporting earlier observations that NH-ES is robust against potential processing delays and noisy-sensations.

URL PDF HTML ☆

赞 0 踩 0

2508.05415 2026-05-19 cs.RO 版本更新

Do Robots Really Need Anthropomorphic Hands? A Comparison of Human and Robotic Hands

机器人真的需要拟人化的手吗？人类手与机器人手的比较

Alexander Fabisch, Wadhah Zai El Amri, Chandandeep Singh, Nicolás Navarro-Guerrero

发表机构 * Robotics Innovation Center, Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI)（德意志人工智能研究中心机器人创新中心）； Leibniz Universität Hannover, L3S Research Center（汉诺威莱布尼茨大学L3S研究中心）

AI总结本文通过比较人类手与机器人手的生物力学、感知和控制机制，探讨机器人是否需要拟人化手，发现复杂的手部设计并非所有任务所必需，而手部机制的复杂性与执行任务的广度相关，同时指出传感器集成和智能操作策略仍需进一步研究。

详情

AI中文摘要

人类操控技能是其自愿运动功能的巅峰，需要协调多个自由度并处理高维传感器输入以实现卓越的灵活性。因此，我们试图回答是否人类手与其相关的生物力学特性、传感器和控制机制是机器人应追求的理想。机器人真的需要拟人化手吗？我们首先从生物力学和感知的角度提取人类手的特征，与目前商用的机器人手进行比较。通过这种比较，我们得出研究问题，将操控系统复杂性与技能 repertoire 大小和灵活性联系起来。我们通过系统文献综述来回答这些问题，在2019-2025年的125篇论文中分析了操控能力。尽管复杂的五指手常被认为是机器人操控器的终极目标，但并非所有任务都必需。我们发现，在手内操控并不受益于拟人化手设计，因为更简单的机制就足够，但机制复杂性与手能执行的操控任务的广度相关。传感器集成和智能操控策略仍处于探索阶段，这可能是因为与手设计的不匹配：而不是复制手指数量和自由度，关注鲁棒性和柔软性将允许更智能的控制和学习，以利用环境接触并集成更多传感器。最后，我们呼吁标准化的评估标准，以实现手部设计和操控系统系统的比较。

英文摘要

Human manipulation skills represent a pinnacle of their voluntary motor functions, requiring the coordination of many degrees of freedom and processing of high-dimensional sensor input to achieve remarkable dexterity. Thus, we set out to answer whether the human hand, with its associated biomechanical properties, sensors, and control mechanisms, is an ideal that we should strive for in robotics. Do robots need anthropomorphic hands? We start by extracting characteristics of the human hand in terms of biomechanics and perception to compare them with currently commercially available robotic hands. From this comparison, we derive our research questions that connect manipulation system complexity to skill repertoire size and dexterity. We attempt to answer these with a systematic literature review, in which we analyze the manipulation capabilities demonstrated in 125 papers from 2019-2025. Although complex five-fingered hands are often considered the ultimate goal for robotic manipulators, they are not necessary for all tasks. We find that in-hand manipulation does not benefit from anthropomorphic hand design as simpler mechanisms are sufficient, but mechanism complexity correlates with the breadth of manipulation tasks a hand can perform. Sensor integration and intelligent manipulation strategies remain underexplored, which may be because of a misalignment with hand design: instead of replicating the number of fingers and degrees of freedom, focusing on robustness and softness would allow more intelligent control and learning to exploit environmental contacts and integrate more sensors. Finally, we argue for standardized evaluation criteria to enable systematic comparison of hand designs and manipulation systems.

URL PDF HTML ☆

赞 0 踩 0

2507.16481 2026-05-19 cs.RO cs.SY eess.SY 版本更新

Guided Reinforcement Learning for Omnidirectional 3D Jumping in Quadruped Robots

为四足机器人提供全方位三维跳跃的引导强化学习

Riccardo Bussola, Michele Focchi, Giulio Turrisi, Claudio Semini, Luigi Palopoli

发表机构 * Dynamic Legged System (DLS), Istituto Italiano di Tecnologia (IIT)（动态腿系统（DLS），意大利技术研究院（IIT））

AI总结本文提出一种引导强化学习方法，结合贝塞尔曲线与匀加速直线运动模型，提高四足机器人三维跳跃的效率和可解释性，通过仿真和实验验证了其优越性。

详情

AI中文摘要

跳跃对四足机器人来说是一个重大挑战，尽管在许多操作场景中至关重要。虽然存在用于控制此类运动的优化方法，但它们往往耗时且需要大量的机器人和地形参数知识，使其在现实世界中不够稳健。强化学习（RL）正逐渐成为一种可行的替代方案，但传统端到端方法在样本复杂性方面效率低下，需要在模拟中进行大量训练，并且最终运动的可预测性差，这使得难以验证最终运动的安全性。为克服这些限制，本文介绍了一种新的引导强化学习方法，通过结合贝塞尔曲线与匀加速直线运动（UARM）模型，利用物理直觉实现高效且可解释的跳跃。广泛的仿真和实验结果清楚地证明了我们的方法相较于现有方法的优势。

英文摘要

Jumping poses a significant challenge for quadruped robots, despite being crucial for many operational scenarios. While optimisation methods exist for controlling such motions, they are often time-consuming and demand extensive knowledge of robot and terrain parameters, making them less robust in real-world scenarios. Reinforcement learning (RL) is emerging as a viable alternative, yet conventional end-to-end approaches lack efficiency in terms of sample complexity, requiring extensive training in simulations, and predictability of the final motion, which makes it difficult to certify the safety of the final motion. To overcome these limitations, this paper introduces a novel guided reinforcement learning approach that leverages physical intuition for efficient and explainable jumping, by combining Bézier curves with a Uniformly Accelerated Rectilinear Motion (UARM) model. Extensive simulation and experimental results clearly demonstrate the advantages of our approach over existing alternatives.

URL PDF HTML ☆

赞 0 踩 0

2507.16059 2026-05-19 cs.RO 版本更新

Therapist-Exoskeleton-Patient Interaction for Gait Therapy

治疗师-外骨骼-患者互动用于步态治疗

Emek Barış Küçüktabak, Matthew R. Short, Lorenzo Vianello, Daniel Ludvig, Levi Hargrove, Kevin Lynch, Jose Pons

发表机构 * Shirley Ryan AbilityLab ； Center for Robotics and Biosystems（机器人与生物系统中心）； Department of Biomedical Engineering（生物医学工程系）； Department of Mechanical Engineering（机械工程系）； Department of Physical Medicine and Rehabilitation（康复医学系）

AI总结本文提出了一种基于物理人机人交互（pHRHI）的步态康复新方法，通过让治疗师和中风患者均佩戴下肢外骨骼，并通过弹簧阻尼元件连接在髋膝处，实现双向互动，从而提高康复效果。

详情

AI中文摘要

中风后，个体常因下肢无力和失去独立关节控制而出现运动和平衡障碍。步态恢复是康复的关键目标，传统上通过高强度的治疗师指导训练实现。然而，手动辅助对治疗师来说体力消耗大，并限制了治疗师同时与多个关节互动的能力。机器人外骨骼能够提供多关节支持，减少治疗师的负担，并提供客观反馈，但当前的控制策略往往限制了治疗师的参与和适应性。本文提出了一种基于物理人机人交互（pHRHI）的步态康复新范式，其中治疗师和中风患者均佩戴下肢外骨骼，并通过弹簧阻尼元件在髋膝处虚拟连接。这使得双向互动成为可能，允许治疗师引导运动并接收触觉反馈。在一项针对八名慢性中风患者的研究中，pHRHI训练优于传统治疗师指导的 treadmill 走行，导致关节活动范围、步态指标、肌肉激活和动机均有所增加。这些结果突显了pHRHI在结合机器人精度与治疗师直觉方面对改善康复结果的潜力。

英文摘要

Following a stroke, individuals often experience mobility and balance impairments due to lower-limb weakness and loss of independent joint control. Gait recovery is a key goal of rehabilitation, traditionally achieved through high-intensity therapist-led training. However, manual assistance can be physically demanding and limits the therapist's ability to interact with multiple joints simultaneously. Robotic exoskeletons offer multi-joint support, reduce therapist strain, and provide objective feedback, but current control strategies often limit therapist involvement and adaptability. We present a novel gait rehabilitation paradigm based on physical Human-Robot-Human Interaction (pHRHI), where both the therapist and the post-stroke individual wear lower-limb exoskeletons virtually connected at the hips and knees via spring-damper elements. This enables bidirectional interaction, allowing the therapist to guide movement and receive haptic feedback. In a study with eight chronic stroke patients, pHRHI training outperformed conventional therapist-guided treadmill walking, leading to increased joint range of motion, step metrics, muscle activation, and motivation. These results highlight pHRHI's potential to combine robotic precision with therapist intuition for improved rehabilitation outcomes.

URL PDF HTML ☆

赞 0 踩 0

2507.04996 2026-05-19 cs.CY cs.CE cs.CL cs.HC cs.RO 版本更新

Agentic Vehicles for Human-Centered Mobility: Definition, Prospects, and Synergistic Co-Development with Vehicle Autonomy

面向人类中心的移动性：定义、前景以及与车辆自主性的协同发展

Jiangbo Yu, Raphael Frank, Luis Miranda-Moreno, Sasan Jafarnejad, Jonatas Augusto Manzolli, Fuqiang Liu, Jiyao Wang, Ali Eslami

发表机构 * Interdisciplinary Centre for Security, Reliability and Trust（跨学科安全、可靠与信任中心）； University of Luxembourg（卢森堡大学）

AI总结本文探讨了面向人类中心的移动性，提出了代理车辆的概念，指出自主性和代理性是相互关联但概念上不同的维度，并强调了协同发展的必要性。

详情

AI中文摘要

自主性，源自希腊语autos（自我）和nomos（法律），指的是根据内部规则运行而不受外部控制的能力。自动驾驶车辆（AuVs）因此被理解为能够感知环境并执行任务，且在一定程度上减少人类干预的车辆系统，这与SAE自动化驾驶级别所指示的方向一致。然而，最近的研究和部署越来越多地展示了车辆能力，这些能力虽然不违背自主性，但也不由自主性所涵盖，包括处理模糊目标、有目的的社会互动、外部工具使用、主动问题解决、持续学习以及在未见过且具有伦理重要性的环境中进行情境敏感推理，这在部分情况下得益于多模态语言模型。这些发展揭示了技术自主性与为人类中心移动性所需更广泛社会认知功能之间的差距，这些功能更精确地由代理性概念所捕捉。因此，而不是不断增加“自主”一词的修饰词，我们引入了代理车辆（AgVs）并建议自主性和代理性是相互交织但概念上不同的：如果自主性关注的是做什么和如何做（在内部规则下的任务执行），那么代理性则关注为什么做以及还能做什么（目标导向、适应性的行动）。我们提出自主性和代理性作为正交但相互促进的维度，并具有协同发展的意义。车辆代理标志着移动服务智能的新维度，预示着车辆作为社会中的目的性行为者。

英文摘要

Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity to operate according to internal rules without external control. Autonomous vehicles (AuVs) are therefore understood as vehicular systems that perceive their environment and execute tasks with minimal human intervention, consistent with the direction indicated by the SAE levels of automated driving. However, recent research and deployments increasingly showcase vehicular capabilities that, while not contradicting autonomy, are not entailed by it, including ambiguous goal handling, purposeful social engagement, external tool use, proactive problem solving, continuous learning, and context-sensitive reasoning in unseen and ethically salient situations, enabled in part by multimodal language models. These developments reveal a gap between technical autonomy and the broader social cognitive functions required for human-centered mobility, which are more precisely captured by the notion of agency. Therefore, rather than adding increasingly elaborate modifiers to "autonomous," we introduce agentic vehicles (AgVs) and suggest that autonomy and agency are intertwined but conceptually distinct: if autonomy concerns what to do and how to do it (task executions under internal rules), agency pertains to why to do it and what else can be done (goal-directed, adaptive actions). We present autonomy and agency as orthogonal yet synergistic dimensions with co-development implications. Vehicle agency marks a novel dimension of mobility service intelligence, heralding vehicles as purposeful actors in society.

URL PDF HTML ☆

赞 0 踩 0

2507.01099 2026-05-19 cs.CV cs.AI cs.LG cs.RO 版本更新

Geometry-aware 4D Video Generation for Robot Manipulation

面向机器人操作的几何感知4D视频生成

Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, Shuran Song

发表机构 * Stanford University（斯坦福大学）； Toyota Research Institute（丰田研究院）

AI总结本文提出了一种几何感知的4D视频生成模型，通过跨视角点图对齐进行训练，以确保生成视频在多视角下的3D一致性，从而在单个RGB-D图像输入下生成时空一致的未来视频序列，并在不依赖相机姿态的情况下实现稳定的视觉和空间对齐预测。

Comments ICLR 2026; Project website: https://robot4dgen.github.io

详情

AI中文摘要

理解并预测物理世界的动态可以增强机器人在复杂环境中的规划和交互能力。尽管最近的视频生成模型在建模动态场景方面显示出强大的潜力，但生成在不同摄像机视角下既时间一致又几何一致的视频仍然是一项重大挑战。为此，我们提出了一种4D视频生成模型，通过在训练过程中使用跨视角点图对齐来监督模型，以确保生成视频的多视角3D一致性。通过这种几何监督，模型学习了一个共享的3D场景表示，使其能够从单个RGB-D图像输入中，根据新的视角生成时空一致的未来视频序列，而无需依赖相机姿态作为输入。与现有基线方法相比，我们的方法在多个模拟和现实世界机器人数据集上产生了更稳定和空间对齐的预测。我们进一步表明，预测的4D视频可用于使用现成的6自由度姿态跟踪器恢复机器人末端执行器轨迹，从而生成在新相机视角下具有良好泛化能力的机器人操作策略。

英文摘要

Understanding and predicting dynamics of the physical world can enhance a robot's ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of generated videos by supervising the model with cross-view pointmap alignment during training. Through this geometric supervision, the model learns a shared 3D scene representation, enabling it to generate spatio-temporally aligned future video sequences from novel viewpoints given a single RGB-D image per view, and without relying on camera poses as input. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, yielding robot manipulation policies that generalize well to novel camera viewpoints.

URL PDF HTML ☆

赞 0 踩 0

2506.18024 2026-05-19 cs.DC cs.RO 版本更新

Leveraging Cloud-Fog Automation for Autonomous Collision Detection and Classification in Intelligent Unmanned Surface Vehicles

利用云雾自动化实现智能无人水面舰艇的自主碰撞检测与分类

Thien Tran, Quang Nguyen, Jonathan Kua, Minh Tran, Toan Luu, Thuong Hoang, Jiong Jin

发表机构 * Deakin University（德克萨斯大学）； University of Birmingham（伯明翰大学）； RMIT University（皇家墨尔本理工学院）； VinUniversity（文大学）； Swinburne University of Technology（斯威本技术大学）； University of Tasmania（塔斯马尼亚大学）

AI总结本文提出了一种针对智能无人水面舰艇的分布式云-边缘-IoT架构，通过云雾自动化范式解决海上ICPS的实时数据处理和预测建模限制问题，提升了计算效率、响应性和可扩展性。

Comments 6 pages, 5 figures, accepted paper on the 23rd IEEE International Conference on Industrial Informatics (INDIN), July 12-15, 2025, Kunming, China

详情

DOI: 10.1109/INDIN64977.2025.11279683
Journal ref: 2025 IEEE 23rd International Conference on Industrial Informatics (INDIN), Kunming, China, 2025

AI中文摘要

工业蜂窝物理系统（ICPS）技术是推动海上自主性的基础，尤其对于无人水面舰艇（USVs）而言。然而，机载计算限制和通信延迟显著限制了实时数据处理、分析和预测建模，从而限制了海上ICPS的可扩展性和响应性。为克服这些挑战，我们提出了一种基于最近提出的云雾自动化范式设计原则的分布式云-边缘-IoT架构，专门针对海上ICPS。我们的架构由三个层次组成：云层用于集中和分布式数据聚合、高级分析和未来模型优化；边缘层执行本地AI驱动的处理和决策；物联网层负责低延迟传感器数据采集。我们的实验结果表明，计算效率、响应性和可扩展性均有所提高。与传统方法相比，我们实现了86%的分类准确率，并改进了延迟性能。通过采用云雾自动化，我们解决了海上ICPS应用中的低延迟处理限制和可扩展性挑战。我们的工作提供了一个实用、模块化和可扩展的框架，以推进稳健的自主性和AI驱动的决策和自主性，为未来海上ICPS中的智能USVs做出贡献。

英文摘要

Industrial Cyber-Physical Systems (ICPS) technologies are foundational in driving maritime autonomy, particularly for Unmanned Surface Vehicles (USVs). However, onboard computational constraints and communication latency significantly restrict real-time data processing, analysis, and predictive modeling, hence limiting the scalability and responsiveness of maritime ICPS. To overcome these challenges, we propose a distributed Cloud-Edge-IoT architecture tailored for maritime ICPS by leveraging design principles from the recently proposed Cloud-Fog Automation paradigm. Our proposed architecture comprises three hierarchical layers: a Cloud Layer for centralized and decentralized data aggregation, advanced analytics, and future model refinement; an Edge Layer that executes localized AI-driven processing and decision-making; and an IoT Layer responsible for low-latency sensor data acquisition. Our experimental results demonstrated improvements in computational efficiency, responsiveness, and scalability. When compared with our conventional approaches, we achieved a classification accuracy of 86\%, with an improved latency performance. By adopting Cloud-Fog Automation, we address the low-latency processing constraints and scalability challenges in maritime ICPS applications. Our work offers a practical, modular, and scalable framework to advance robust autonomy and AI-driven decision-making and autonomy for intelligent USVs in future maritime ICPS.

URL PDF HTML ☆

赞 0 踩 0

2506.17991 2026-05-19 cs.DC cs.RO 版本更新

CFTel: A Practical Architecture for Robust and Scalable Telerobotics with Cloud-Fog Automation

CFTel: 一种用于鲁棒且可扩展的远程机器人系统的实用架构

Thien Tran, Jonathan Kua, Minh Tran, Honghao Lyu, Thuong Hoang, Jiong Jin

发表机构 * Deakin University（德金大学）； RMIT University（皇家墨尔本理工大学）； Zhejiang University（浙江大学）； Swinburne University of Technology（斯威本科技大学）； University of Tasmania（塔斯马尼亚大学）

AI总结本文提出了一种基于云雾自动化架构的远程机器人系统CFTel，旨在解决传统云基远程机器人系统的延迟、可靠性、可扩展性和容错问题，通过分布式云-边缘-机器人计算架构实现确定性连接、连接智能和网络化计算，从而提升实时控制、可扩展性和自主性。

Comments 6 pages, 1 figure, accepted paper on the 23rd IEEE International Conference on Industrial Informatics (INDIN), July 12-15, 2025, Kunming, China

详情

DOI: 10.1109/INDIN64977.2025.11279161
Journal ref: 2025 IEEE 23rd International Conference on Industrial Informatics (INDIN), Kunming, China, 2025

AI中文摘要

远程机器人技术是自主工业云物理系统（ICPS）的关键基础，能够实现跨多个领域的远程操作。然而，传统基于云的远程机器人系统存在延迟、可靠性、可扩展性和容错性问题，阻碍了关键应用中的实时性能。云雾远程机器人（CFTel）基于云雾自动化（CFA）范式，通过利用分布式云-边缘-机器人计算架构来解决这些限制，实现确定性连接、确定性连接智能和确定性网络化计算。本文综合了最近的CFTel进展，旨在突出其在促进可扩展、低延迟、自主和AI驱动的远程机器人系统中的作用。我们分析了使这些架构框架和技术成为可能的架构框架和技术，包括5G超可靠低延迟通信、边缘智能、具身AI和数字孪生。研究证明，CFTel有潜力提高实时控制、可扩展性和自主性，同时支持服务导向型解决方案。我们还讨论了实际挑战，包括延迟限制、网络安全风险、互操作性问题和标准化努力。本文为未来远程机器人研究的研究人员、利益相关者和行业从业者提供了基础参考。

英文摘要

Telerobotics is a key foundation in autonomous Industrial Cyber-Physical Systems (ICPS), enabling remote operations across various domains. However, conventional cloud-based telerobotics suffers from latency, reliability, scalability, and resilience issues, hindering real-time performance in critical applications. Cloud-Fog Telerobotics (CFTel) builds on the Cloud-Fog Automation (CFA) paradigm to address these limitations by leveraging a distributed Cloud-Edge-Robotics computing architecture, enabling deterministic connectivity, deterministic connected intelligence, and deterministic networked computing. This paper synthesizes recent advancements in CFTel, aiming to highlight its role in facilitating scalable, low-latency, autonomous, and AI-driven telerobotics. We analyze architectural frameworks and technologies that enable them, including 5G Ultra-Reliable Low-Latency Communication, Edge Intelligence, Embodied AI, and Digital Twins. The study demonstrates that CFTel has the potential to enhance real-time control, scalability, and autonomy while supporting service-oriented solutions. We also discuss practical challenges, including latency constraints, cybersecurity risks, interoperability issues, and standardization efforts. This work serves as a foundational reference for researchers, stakeholders, and industry practitioners in future telerobotics research.

URL PDF HTML ☆

赞 0 踩 0

2505.07813 2026-05-19 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies

DexWild：面向真实场景的机器人策略的灵巧交互

Tony Tao, Mohan Kumar Srirama, Jason Jingzhou Liu, Kenneth Shaw, Deepak Pathak

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出DexWild框架，通过结合人类和机器人示范数据，提升机器人在多样化环境中的泛化能力，实验表明其在未见环境中的成功率显著高于传统方法。

Comments In RSS 2025. Website at https://dexwild.github.io

详情

AI中文摘要

大规模、多样化的机器人数据集已成为使灵巧操作策略泛化到新环境的有希望途径，但获取此类数据集存在诸多挑战。虽然远程操作能提供高保真的数据集，但其高成本限制了可扩展性。相反，如果人们可以像在日常生活中一样使用自己的手来收集数据呢？在DexWild中，一个多样化的数据收集团队使用他们的手在多种环境和物体上收集数小时的交互数据。为了记录这些数据，我们创建了DexWild-System，一种低成本、移动且易于使用的设备。DexWild学习框架在人类和机器人示范数据上共同训练，相较于单独训练每个数据集，其性能得到提升。这种组合产生了能够泛化到新环境、任务和形态的稳健机器人策略，只需少量额外的机器人特定数据。实验结果表明，DexWild显著提高了性能，在未见环境中实现了68.5%的成功率，几乎是仅使用机器人数据训练的策略的四倍，并提供了5.8倍更好的跨形态泛化能力。视频结果、代码库和说明可在https://dexwild.github.io上找到。

英文摘要

Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io

URL PDF HTML ☆

赞 0 踩 0

2503.02087 2026-05-19 cs.RO cs.LG cs.SY eess.SY 版本更新

Uncertainty Representation in a SOTIF-Related Use Case with Dempster-Shafer Theory for LiDAR Sensor-Based Object Detection

基于Dempster-Shafer理论的LiDAR传感器目标检测SOTIF相关用例中的不确定性表示

Milin Patel, Rolf Jung

发表机构 * Institute for Driver Assistance and Connected Mobility（驾驶员辅助与车联网研究所）； Kempten University of Applied Sciences（科佩滕应用科学大学）

AI总结本文提出了一种系统的方法，利用Dempster-Shafer理论构建判定框架，以表示LiDAR传感器目标检测中的不确定性，并通过方差敏感性分析量化和优先处理这些不确定性，以确保自动驾驶场景的安全性。

Comments submitted as extended paper of Vehicle Technology and Intelligent Transport Systems (VEHITS)2024 conference and will be published by Springer in a CCIS Series book later in 2025

详情

DOI: 10.1007/978-3-032-23187-1_10

AI中文摘要

LiDAR传感器目标检测中的不确定性源于环境变化和传感器性能限制。表示这些不确定性对于确保预期功能安全（SOTIF）至关重要，SOTIF旨在防止自动驾驶场景中的危险。本文提出了一种系统的方法，用于识别、分类和表示LiDAR目标检测中的不确定性。Dempster-Shafer理论（DST）被用于构建判定框架（FoD）以表示检测结果。基于识别的不确定性来源之间的依赖性，应用条件基本概率分配（BPAs）。Yager的证据组合规则用于解决多个来源的冲突证据，提供一个结构化的框架来评估不确定性对检测准确性的影响。研究应用方差基于敏感性分析（VBSA）来量化和优先处理不确定性，详细说明其对检测性能的具体影响。

英文摘要

Uncertainty in LiDAR sensor-based object detection arises from environmental variability and sensor performance limitations. Representing these uncertainties is essential for ensuring the Safety of the Intended Functionality (SOTIF), which focuses on preventing hazards in automated driving scenarios. This paper presents a systematic approach to identifying, classifying, and representing uncertainties in LiDAR-based object detection within a SOTIF-related scenario. Dempster-Shafer Theory (DST) is employed to construct a Frame of Discernment (FoD) to represent detection outcomes. Conditional Basic Probability Assignments (BPAs) are applied based on dependencies among identified uncertainty sources. Yager's Rule of Combination is used to resolve conflicting evidence from multiple sources, providing a structured framework to evaluate uncertainties' effects on detection accuracy. The study applies variance-based sensitivity analysis (VBSA) to quantify and prioritize uncertainties, detailing their specific impact on detection performance.

URL PDF HTML ☆

赞 0 踩 0

2502.05462 2026-05-19 cs.RO cs.MA cs.SY eess.SY math.OC 版本更新

Motion Planning of Cooperative Nonholonomic Mobile Manipulators

协作非holonomic移动机械臂的运动规划

Keshab Patra, Arpita Sinha, Anirban Guha

发表机构 * Department of Mechanical Engineering, Indian Institute of Technology Bombay（印度理工学院班加罗尔机械工程系）； Center for Systems and Control, Indian Institute of Technology Bombay（印度理工学院班加罗尔系统控制中心）

AI总结本文提出了一种实时可实现的运动规划框架，用于非holonomic移动机械臂机器人在动态环境中协作运输物体。该框架通过静态无障碍区域找到从起点到目标的路径，并利用一种新颖、快速且计算轻量的椭圆技术生成路径周围的凸、静态、无障碍区域。引入了基于非线性模型预测控制（NMPC）的实时可实现规划技术，联合规划移动基底和机械臂的可行运动，并生成可行的、无碰撞的轨迹以实现协作物体运输。仿真和硬件实验验证了所提规划框架的有效性。

Comments Published in ASME Letters in Translational Robotics. This includes supplementary materials

详情

DOI: 10.1115/1.4071124
Journal ref: Patra, K., Sinha, A., and Guha, A. (May 2, 2026). "Motion Planning of Cooperative Nonholonomic Mobile Manipulators." ASME. Letters Trans. Robotics. December 2025; 1(4): 041003

AI中文摘要

我们提出了一种实时可实现的运动规划框架，用于非holonomic移动机械臂机器人（MMRs）在动态环境中协作运输物体。我们的全局规划器通过环境中的静态无障碍区域找到从起点到目标的路径，并利用一种新颖、快速且计算轻量的基于椭圆的技术生成路径周围的凸、静态、无障碍区域。我们引入了一种基于非线性模型预测控制（NMPC）的实时可实现规划技术，该技术联合规划移动基底和机械臂的可行运动，并生成可行的、无碰撞的轨迹以实现协作物体运输。仿真和硬件实验验证了我们所提规划框架的效率。

英文摘要

We propose a real-time implementable motion planning framework for cooperative object transportation by nonholonomic mobile manipulator robots (MMRs) in dynamic environments. Our global planner finds a path from start to goal through the static, obstacle-free regions in the environment and generates a set of convex, static, obstacle-free regions around the path using a novel, fast, and computationally lightweight ellipse-based technique. We introduce a nonlinear Model Predictive Control (NMPC) based real-time implementable planning technique that jointly plans feasible motion for the mobile base and the manipulator's arm and generates a kinodynamic feasible, collision-free trajectory for cooperative object transportation. Simulation and hardware experiments validate the efficiency of our proposed planning framework.

URL PDF HTML ☆

赞 0 踩 0

2409.12190 2026-05-19 cs.RO cs.CV 版本更新

为单目视觉-惯性系统使用前馈3D模型实现高效的特征-free初始化

Yuantai Zhang, Jiaqi Yang, Huajian Zeng, Changhao Chen, Haoang Li, Liang Li, Dezhen Song, Xingxing Zuo

发表机构 * MBZUAI（马克斯·普朗克人工智能研究所）； HKUST (GZ)（香港科技大学（广州））； Zhejiang University（浙江大学）

AI总结本文提出了一种无需视觉特征跟踪的初始化框架，利用前馈3D模型预测的点云，从而提高了单目视觉-惯性导航系统的初始化可靠性与效率，实验表明其初始化成功率超过90%且数据需求显著减少。

详情

AI中文摘要

快速且可靠的初始化对于单目视觉-惯性导航系统（VINS）至关重要，因为它为后续的状态估计建立了初始条件。尽管已有显著进展，但大多数现有方法仍依赖于视觉特征对应关系，并需要3-4秒的传感器数据才能成功初始化，这限制了它们的应用性和效率。随着前馈3D模型的出现，这些模型可以直接从图像预测点云，我们重新从简洁的角度审视视觉-惯性初始化问题。在本文中，我们提出了一种特征-free初始化框架，利用前馈3D模型预测的点云，从而避免了视觉特征跟踪和估计的需要。这种设计显著降低了系统复杂性并提高了初始化的可靠性。在公开数据集上的实验表明，所提出的特征-free初始化方法实现了最高成功率，超过90%，并且显著减少了成功初始化所需的数据持续时间，通常降至1.2秒以下。我们进一步在自采集的数据集上验证了我们的方法，覆盖了各种室内和室外场景，展示了鲁棒性能，特别是在现有方法常失败的视觉退化环境中。代码和数据集可在https://github.com/Yuantai-Z/FF-VIO-Init获取。

英文摘要

Fast and reliable initialization is critical for monocular visual-inertial navigation systems (VINS), as it establishes the starting conditions for subsequent state estimation. Despite steady progress, most existing methods heavily rely on visual feature correspondences and require 3-4 seconds of sensory data for successful initialization, which limits their applicability and efficiency. With the advent of feed-forward 3D models that can directly predict point clouds from images, we revisit the visual-inertial initialization problem from a concise perspective. In this work, we propose a feature-free initialization framework that leverages up-to-scale point clouds predicted by a feed-forward 3D model, thereby obviating the need for visual feature tracking and estimation. This design substantially reduces system complexity and improves the reliability of initialization. Experiments on public datasets demonstrate that the proposed feature-free initialization method achieves the highest success rate, exceeding 90%, and significantly reduces the data duration required for successful initialization, typically to under 1.2 s. We further validate our method on a self-collected dataset covering various indoor and outdoor scenarios, demonstrating robust performance, particularly in visually degraded environments where existing methods often fail. The code and dataset are available at https://github.com/Yuantai-Z/FF-VIO-Init.

URL PDF HTML ☆

赞 0 踩 0

2605.17302 2026-05-19 cs.RO 版本更新

Beyond Geometry: Efficient Topologically-Grounded Navigation in Complex 3D Environments

超越几何：在复杂3D环境中高效拓扑导向的导航

Yifan Du, Chengwei Zhang, Siyu Liao, Zhongfeng Wang

发表机构 * School of Integrated Circuits（集成电路学院）

AI总结本文提出了一种表面提取框架，通过强制地面支撑、头顶 clearance 和基于种子的连通性约束，构建了物理可达的站立位置的简化状态空间，从而在复杂3D环境中实现高效的拓扑导向导航。

2605.17300 2026-05-19 cs.RO 版本更新

HCLM: A Hierarchical Framework for Cooperative Loco-Manipulation with Dual Quadrupeds

HCLM：一种用于双四足机器人协同运动操作的分层框架

Qixuan Li, Chen Le, Jincheng Yu, Xinlei Chen

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China（深圳国际研究生院，清华大学，深圳，中国）； Department of Electronic Engineering, and the Institute for Embodied Intelligence and Robotics, Tsinghua University, Beijing, China（电子工程系，以及 embodied intelligence and robotics 院，清华大学，北京，中国）

AI总结本文提出HCLM框架，通过分层结构实现双四足机器人在复杂环境中的协同运动操作，核心方法是采用集中式联合扩散策略和混合全身控制器，主要贡献是实现了高鲁棒性的多机器人协作控制。

详情

AI中文摘要

我们介绍了HCLM，一种用于通用目的双四足系统协同运动操作的分层框架。协调具有浮动基的多机器人协作操作极具挑战性，因为空间协调、稳健移动和闭链物理交互的需求相互冲突。为了解决这一问题，我们的架构系统性地将高层协作推理与底层稳健运动执行分离。在高层，一个集中式联合扩散策略利用SE(3)-不变的任务空间表示来学习不依赖坐标的空间协调模式。为了将这些帧无关的参考转换为物理运动，一个以任务为中心的混合全身控制器协同利用主动的运动学模型预测控制来生成无碰撞的速度分布，以及一个反应性执行层。关键的是，这一反应层保证了对精确末端执行器跟踪的快速响应，同时通过合作顺应方案整合主动力调节，以安全解决运动学冲突并在闭链交互中严格调节内部应力。我们验证了该框架在逐步更具挑战性的模拟场景中的有效性，包括协作搬运、打包和交接，并成功在现实世界中部署后者。结果表明，任务执行可靠，配置无关性严格，对严重物理扰动具有出色的抗扰性，为多机器人具身协调提供了一条高度稳健的路径。

英文摘要

We introduce HCLM, a hierarchical framework for general-purpose cooperative loco-manipulation with dual quadrupedal systems. Coordinating multi-robot collaborative manipulation across floating bases is highly challenging due to the conflicting demands of spatial coordination, robust locomotion, and closed-chain physical interactions. To resolve this, our architecture systematically decouples high-level collaborative reasoning from low-level robust motion execution. At the high level, a centralized Joint Diffusion Policy leverages an SE(3)-invariant task-space representation to learn coordinate-agnostic spatial coordination patterns. To translate these frame-agnostic references into physical motion, a task-centric hybrid Whole-Body Controller synergizes a proactive kinematic Model Predictive Control for collision-free velocity distribution with a reactive execution layer. Crucially, this reactive layer guarantees rapid responsiveness for precise end-effector tracking, while concurrently integrating active force regulation via a cooperative admittance scheme to safely resolve kinematic conflicts and strictly regulate internal stresses during closed-chain interactions. We validate the framework across progressively challenging simulated scenarios, including cooperative carrying, packing and handovers, and successfully deploy the latter in the real world. The results demonstrate reliable task execution, strict configuration agnosticism, and exceptional resilience against severe physical perturbations, offering a highly robust pathway for multi-robot embodied coordination.

URL PDF HTML ☆

赞 0 踩 0

2605.17293 2026-05-19 cs.RO cs.MA 版本更新

Task Capability Improvement Algorithm for Collaborative Manipulators

协作机械臂任务能力提升算法

Keshab Patra, Arpita Sinha, Anirban Guha

发表机构 * Department of Mechanical Engineering, Indian Institute of Technology Bombay（印度理工学院班加罗尔机械工程系）； Center for Systems and Control, Indian Institute of Technology Bombay（印度理工学院班加罗尔系统与控制中心）

AI总结本文提出利用附加力矩提高协作机械臂的任务能力，通过在非质心位置施加力产生额外力矩，从而增强单个机械臂及整个协作组的能力，实验结果显示任务能力提升了5.86%。

2605.17284 2026-05-19 cs.CV cs.AI cs.LG cs.RO 版本更新

CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving

CLAP：用于端到端自动驾驶的对比潜在空间提示优化

Ruiyang Zhu, Yuehan He, Boyuan Zheng, Zesen Zhao, Ahmad Chalhoub, Qingzhao Zhang, Z. Morley Mao

发表机构 * University of Michigan（密歇根大学）； University of Arizona（亚利桑那大学）

AI总结本文提出CLAP方法，通过对比潜在空间提示优化解决自动驾驶中罕见但安全关键的长尾场景问题，利用V2X通信获取数据并优化提示，从而提升规划性能。

Comments 9 pages + appendix

详情

AI中文摘要

端到端自动驾驶系统通过视觉-语言-动作（VLA）模型在常见驾驶场景中表现出色，但在罕见但安全关键的长尾场景如活跃施工区和复杂让行几何中表现脆弱。本文提出了一种方法，超越数据扩展和模型训练，解决长尾挑战场景。我们引入CLAP（对比潜在空间提示优化），一种位置感知的适应框架，通过车辆到一切（V2X）通信按需检索，将冻结的VLA驾驶模型与每条道路块的软提示相结合。我们的方法基于VLA潜在空间的两个观察：（i）在VLA的隐藏状态层，来自相同道路块的场景紧密聚集并占据潜在空间的紧凑区域；（ii）在单个道路块内，长尾和正常帧在潜在表示中高度混合，难以改进其中一个而不影响另一个。CLAP通过两阶段流程解决此问题：监督对比学习发现道路块特定的困难场景方向，随后方向性正则化提示优化选择性改进挑战帧同时保持正常帧性能。在NAVSIM基准上，使用各种最先进的VLA后端，CLAP将挑战场景规划错误减少了24%，在不回归正常帧的情况下显著提高了规划性能。

英文摘要

End-to-end autonomous driving systems powered by Vision-Language-Action (VLA) models achieve strong performance on common driving scenarios, yet remain brittle in rare but safety-critical long-tail situations such as active construction zones and complex yielding geometries. In this paper, we present a method that addresses the long-tail challenging scenes beyond data scaling and model training. We introduce CLAP (Contrastive Latent-space Prompt optimization), a location-aware adaptation framework that augments a frozen VLA driving model with per-roadblock soft prompts, optimized from crowdsourced data and retrieved on demand via Vehicle-to-Everything (V2X) communication. Our approach rests on two observations from VLAs' latent space: (i) at the VLA's hidden-state layer, scenarios from the same roadblock cluster tightly and occupy compact regions of the latent space; and (ii) within a single roadblock, long-tail and normal frames are heavily intermixed in the latent representation, making it difficult to improve one without disturbing the other. CLAP addresses this via a two-stage pipeline: supervised contrastive learning to discover a roadblock-specific hard-scene direction, followed by directionally regularized prompt optimization that selectively improves challenging frames while preserving normal frame performance. On the NAVSIM benchmark with various state-of-the-art VLA backbones, CLAP reduces challenging scenario planning error by 24% with no regression on normal frames, significantly improving planning performance.

URL PDF HTML ☆

赞 0 踩 0

2605.17264 2026-05-19 cs.RO 版本更新

Stretch-ICP: A Continuous-Trajectory Registration and Deskewing Algorithm in Scenarios of Aggressive Motions

Stretch-ICP: 一种在剧烈运动场景下的连续轨迹配准与校正算法

Simon-Pierre Deschênes, Veronica Vannini, Philippe Giguère, François Pomerleau

发表机构 * GitHub

AI总结本文提出Stretch-ICP算法，通过改进SLAM的鲁棒性，以提高在剧烈运动下的激光雷达-惯性导航状态估计的鲁棒性和一致性，同时减少了线速度和角速度的估计误差。

Comments 29 pages, 16 figures, published in Sensors 2026, 26(8), 2567, special issue "New Challenges and Sensor Techniques in Robot Positioning"

详情

DOI: 10.3390/s26082567
Journal ref: Sensors 2026, 26(8), 2567

AI中文摘要

在复杂的环境中，机器人自主性仍然具有挑战性，其中在不平或滑腻地形上失去稳定性可能导致极端加速度和角速度。这些运动会破坏传感器测量并降低状态估计的精度，推动了对更鲁棒算法的需求。为研究此问题，我们引入了Tumbling-Induced Gyroscope Saturation (TIGS)数据集，该数据集包含机械激光雷达和惯性测量单元（IMU）从山下滑倒的记录。该数据集包含的角速度是类似数据集的四倍，且已公开可用。我们随后提出了两种互补的方法来提高同步定位与建图（SLAM）的鲁棒性，并在TIGS上评估了它们。首先，Saturation-Aware Angular Velocity Estimation (SAAVE)在剧烈运动中估计角速度，当陀螺仪测量饱和时，减少角速度估计误差83.4%。其次，Stretch-ICP是一种新的配准和校正算法，能够在剧烈运动下比经典迭代最近点（ICP）算法产生更平滑的六自由度（DOF）轨迹。Stretch-ICP在扫描边界处将线速度和角速度误差分别减少95.2%和94.8%。共同的贡献提高了在剧烈运动下的激光雷达-惯性状态估计的鲁棒性和一致性。

英文摘要

Robust robotic autonomy remains challenging in complex environments, where loss of stability on uneven or slippery terrain can induce extreme accelerations and angular velocities. Such motions corrupt sensor measurements and degrade state estimation, motivating the need for improved algorithmic robustness. To investigate this issue, we introduce the Tumbling-Induced Gyroscope Saturation (TIGS) dataset, which consists of recordings from a mechanical lidar and an Inertial Measurement Unit (IMU) tumbling down a hill. The dataset contains angular speeds up to four times higher than those in similar datasets and is publicly available. We then propose two complementary methods to improve Simultaneous Localization And Mapping (SLAM) robustness and evaluate them on TIGS. First, Saturation-Aware Angular Velocity Estimation (SAAVE) estimates angular velocities when gyroscope measurements become saturated during aggressive motions, reducing angular speed estimation error by 83.4%. Second, Stretch-ICP, a novel registration and deskewing algorithm, enables reconstruction of smoother 6-Degrees Of Freedom (DOF) trajectories under aggressive motions compared to classical Iterative Closest Point (ICP). Stretch-ICP reduces linear and angular velocity errors by 95.2% and 94.8%, respectively, at scan boundaries. Together, these contributions improve the robustness and consistency of lidar-inertial state estimation under aggressive motions.

URL PDF HTML ☆

赞 0 踩 0

2605.17229 2026-05-19 cs.RO cs.SY eess.SY 版本更新

Generating Realistic Safety-Critical Scenarios for Vehicle-Pedestrian Interactions

生成车辆-行人交互的安全关键场景

Qingwen Pu, Kun Xie, Yuan Zhu, Guocong Zhai

发表机构 * Transportation Informatics Lab, Department of Civil and Environmental Engineering, Old Dominion University（交通信息实验室，土木与环境工程系，旧 Dominion 大学）； Inner Mongolia Center for Transportation Research, Inner Mongolia University（内蒙古交通研究所，内蒙古大学）； School of Transportation and Logistics, National Engineering Laboratory of Integrated Transportation Big Data Application Technology, National and Local Joint Engineering Research Center of Integrated Transportation Intelligence, Southwest Jiaotong University（交通运输学院，国家集成交通大数据应用技术工程实验室，国家与地方联合集成交通智能工程研究中心，西南交通大学）

AI总结本文提出了一种三阶段框架，结合现实数据与自适应模拟，生成大规模行为真实的安全关键场景，通过多智能体状态空间Transformer增强DDPG算法，在车辆-行人交互中实现了高精度的避让行为生成，最终生成了VPSCI数据集。

Comments 49 pages, 13 figures, 11 table

详情

AI中文摘要

自动驾驶系统部署需要在安全关键的车辆-行人交互中进行严格验证，但现实世界数据集很少捕捉高风险场景，而模拟平台缺乏真实行为。为此，本研究提出了一种三阶段框架，结合现实数据与自适应模拟，生成行为真实的安全关键场景。第一阶段在现实安全关键数据上预训练多智能体状态空间Transformer增强DDPG（MA-SST-DDPG）智能体，通过数据驱动学习学习人类样式的避让行为。第二阶段在CARLA中部署预训练的多智能体进行在线强化学习，实现跨多样场景的泛化，整合现实知识与模拟经验，生成精炼的MA-SST-DDPG模型。第三阶段使用CARLA与精炼模型生成来自八个交叉口场景的超过198,000个高分辨率交互episode，最终生成车辆-行人安全关键交互（VPSCI）数据集。精炼的MA-SST-DDPG模型在复现真实避让行为上优于基线方法，实现了最低的轨迹误差（ADE=0.072 m，FDE=0.142 m）。统计比较证实生成数据与现实数据在冲突严重程度和行为响应分布上具有等价性。图灵测试确认三阶段框架生成的避让行为与现实交互无法区分。这些结果展示了该框架在生成高保真安全关键数据方面的有效性，为ADS开发和基于模拟的安全评估提供了有价值的来源。

英文摘要

Automated driving system deployment requires rigorous validation across safety-critical vehicle-pedestrian interactions, yet real-world datasets rarely capture high-risk scenarios while simulation platforms lack realistic behavior. In response, this study proposes a three-stage framework that combines real-world grounding with adaptive simulation to generate behaviorally realistic safety-critical scenarios at scale. Stage 1 pre-trains multi-agent state-space Transformer-enhanced DDPG (MA-SST-DDPG) agents on real-world safety-critical data to learn human-like interactive evasive behaviors through data-driven learning. Stage 2 deploys pre-trained multi-agents in CARLA for online reinforcement learning to generalize across diverse scenarios, integrating real-world knowledge with simulation experience to produce a refined MA-SST-DDPG model. Stage 3 uses CARLA with the refined model to generate over 198,000 high-resolution interaction episodes from eight intersection scenarios, culminating in the Vehicle-Pedestrian Safety-Critical Interaction (VPSCI) dataset. The Refined MA-SST-DDPG model outperformed baseline methods in reproducing realistic evasive behaviors, achieving the lowest trajectory errors (ADE = 0.072 m, FDE = 0.142 m). Statistical comparison confirmed distributional equivalence between the generated and real-world data in both conflict severity and behavioral response. A Turing test confirmed that the three-stage framework generated evasive behaviors were indistinguishable from real-world interactions. These results demonstrate the framework's effectiveness in producing high-fidelity safety-critical data, offering valuable sources for the development of ADS and simulation-based safety evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.17204 2026-05-19 cs.RO cs.AI 版本更新

Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

基于事件的稀疏自编码器用于视觉-语言-动作策略

Xinchen Jin, Aditya Chatterjee, Pranav Kumar, Rohan Paleja

发表机构 * Department of Computer Science, Purdue University West Lafayette, IN 47907（计算机科学系，普渡大学西拉法叶分校，印第安纳州，47907）

AI总结本文提出了一种基于事件的稀疏自编码器（SAE）分析方法，用于视觉-语言-动作（VLA）策略的可解释性研究，通过行为事件锚定SAE特征分析，提升了对闭合回路行为的因果影响和可解释性。

详情

AI中文摘要

视觉-语言-动作（VLA）策略将语言和视觉输入转化为机器人动作，其隐藏表示直接塑造闭环行为。然而，语言和视觉-语言模型中的机制可解释性工具无法直接转移到VLA中：输出是机器人动作而非人类可读的标记，干预只能通过昂贵的闭环回放测试。我们提出了一种基于事件的可解释性流程，将SAE特征分析锚定在行为事件而非文本上下文中。通过在每个任务中使用视觉、状态和时间线索对末端执行器关键帧进行聚类，将SAE特征与行为显著事件联系起来，并通过可选的VLM注释与语义上下文联系起来。据我们所知，我们的流程是首个将基于SAE的VLA分析锚定在闭环行为事件上的方法之一。在两个仿真架构和一个真实机器人研究中，基于事件的排名在OpenVLA上产生了最强的因果效应，并转移到了π_{0.5}的连续动作块中。SAE是一种稀疏但不完美的干预基础：实用性因架构和干预位置而异，激进干预揭示了安全性和可解释性的限制。总体而言，基于事件的SAE分析成为行为锚定VLA可解释性的一种实用起点，推动了未来关于SAE特征的研究，包括超越动作对齐坐标的更细致分析、更精细的闭环评估以及高风险VLA部署中的安全干预。代码可在https://github.com/xc-j/Event-SAE上获得。

英文摘要

Vision-Language-Action (VLA) policies translate language and visual inputs into robot actions, where their hidden representations directly shape closed-loop behavior. However, mechanistic interpretability tools from language and vision-language models do not transfer cleanly to VLAs: outputs are robot actions rather than human-readable tokens, and interventions can only be tested via expensive closed-loop rollouts. We propose an event-grounded interpretability pipeline that anchors SAE feature analysis to behavioral events rather than text contexts. End-effector keyframes are clustered within each task using visual, state, and temporal cues, linking SAE features to behaviorally salient events and, via optional VLM annotations, to semantic context. To our knowledge, our pipeline is among the first to ground SAE-based VLA analysis in closed-loop behavioral events. Across two simulation architectures and a real-robot study, event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to the continuous action chunks of $π_{0.5}$. SAE is a sparse but imperfect intervention basis: usability varies with architecture and intervention site, and aggressive intervention reveals safety and interpretability limits. Overall, event-grounded SAE analysis emerges as a practical starting point for behavior-anchored VLA interpretability, motivating future work on SAE features beyond action-aligned coordinates, finer-grained closed-loop evaluation, and safe interventions for high-stakes VLA deployments. Code is available at \url{https://github.com/xc-j/Event-SAE}.

URL PDF HTML ☆

赞 0 踩 0

2605.17144 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States

对比性概念激活引导（COAST）：通过隐藏状态解锁视觉-语言-动作模型

Miranda Muqing Miao, Subin Kim, Brandon Yang, Lyle Ungar

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结本文提出COAST方法，通过识别成功子空间来提升视觉-语言-动作模型在机器人任务中的性能，其核心方法是利用概念投射来引导模型向成功分布发展，从而提高任务成功率。

Comments Submitted to NeurIPS 2026

详情

AI中文摘要

视觉-语言-动作（VLA）模型利用大规模网络视觉-语言模型（VLM）预训练的强感知先验，但实际应用中却表现出惊人的脆弱性，常常在简单的机器人任务中失败。为缓解这一问题，我们提出了对比性概念激活引导（COAST）。COAST基于“概念”这一线性操作符，该操作符能将数据软投影到目标分布的主成分中。COAST利用概念来从少量的成功和失败轨迹中识别出目标机器人任务的成功子空间。在推理过程中，它将VLA的潜在表示引导到这些识别出的成功子空间中，以提高任务结果。在三种架构不同的神经策略（流匹配VLA、自回归VLA和扩散策略）上，COAST将绝对均值仿真和真实机器人任务的成功率分别提高了超过20%和40%。激活子空间几何表明，失败模式在不同任务中共享大量结构，而成功表示则主要任务特定。当任务共享相似的失败模式时，这种结构使之前拟合的概念能提升新任务的性能而无需重新拟合。最终，我们的结果表明，当前VLA在潜在表示中保留了大量任务相关的知识，而动作专家的解码瓶颈可以通过将残差流引导至任务相关子空间来缓解。COAST提供了一条轻量、无训练的路径，通过引导模型朝其自身的“成功”分布发展，来解锁这些潜在能力。

英文摘要

Vision-Language-Action (VLA) models leverage powerful perceptual priors from web-scale Vision-Language Model (VLM) pre-training, yet they remain surprisingly brittle in practice, frequently failing at simple robotic tasks. To mitigate this, we propose Contrastive Conceptor Activation Steering (COAST). COAST builds on the notion of a "conceptor", a linear operator that soft-projects data into the principal components of a target distribution. COAST uses conceptors to identify success-critical subspaces for a target robotic task from a few examples of success and failure rollouts. At inference time, it steers VLA latents into these identified success subspaces to improve task outcomes. Across three architecturally distinct neural policies (flow-matching VLA, autoregressive VLA, and Diffusion Policy), COAST improves absolute mean simulation and real-robot task success rate by over 20 and 40% respectively. The activation subspace geometry reveals that failure modes share substantial structure across tasks while success representations remain largely task-specific. When tasks share similar failure modes, this structure enables previously fitted conceptors to improve performance on new tasks without refitting. Ultimately, our results suggest that current VLAs retain substantial task-relevant knowledge in their latent representations, and that the action expert's decoding bottleneck could be mitigated by steering its residual stream toward task-relevant subspaces. COAST provides a lightweight, training-free path to unlocking these latent capabilities by steering the model towards its own "success" distributions.

URL PDF HTML ☆

赞 0 踩 0

2605.17123 2026-05-19 cs.HC cs.RO 版本更新

ATRACT: A Trustworthy Robotic Autonomous system to support Casualty Triage

ATRACT: 一种可靠的机器人自主系统以支持伤员分诊

Tasweer Ahmad, Rafael Pina, Sandip Pradhan, Arindam Sikdar, Mindula Illeperuma, Khizer Saeed, Peter Lee, Varuna De Silva, Ardhendu Behera

发表机构 * Department of Computer Science, Edge Hill University（埃德希尔大学计算机科学系）； Institute for Digital Technologies, Loughborough University London（洛斯伯勒大学伦敦数字技术研究所）； School of Architecture, Technology and Engineering, University of Brighton（布莱顿大学建筑、技术与工程学院）； School of Criminology and Criminal Justice, University of Portsmouth（普茅斯大学犯罪学与刑事司法学院）

AI总结本文提出ATRACT，一种人机协同的决策支持系统，通过多模态学习整合无人机视频与可穿戴传感器数据，以提高战场伤员分诊的准确性，同时减少前线医疗人员的风险。

详情

AI中文摘要

在无人机日益与敌对行动相关联的背景下，我们重新利用它们用于人道主义和生命拯救应用。然而，将搜索和救援无人机适应于战场分诊仍极具挑战性；技术必须可靠以支持在极端不确定性、受限访问和显著个人风险下操作的前线医护人员。由于冲突地区伤员撤离的日益增长的脆弱性，本文提出了ATRACT（一种可靠的机器人自主系统以支持伤员分诊），一种新颖的人机协同决策支持系统，旨在在创伤后的关键时期实现早期战场分诊。ATRACT整合无人机捕获的视频与可穿戴传感器输入进行多模态学习，以支持伤员状态评估，从而解决现有系统的局限性。无人机视频捕获细粒度的行为线索，如姿势、体态，而可穿戴传感器提供互补的生理信号，包括心率、呼吸率和运动。通过结合两种模态，ATRACT为在直接接触伤员被延迟、风险或受限时提供证据支持医护人员的早期判断。为了缓解受伤动作数据真实性的差距，设计了一种条件变分自编码器用于数据增强。在我们的无人机捕获数据集上的实验结果表明，所提出的流程在动作分类上达到了85.7%的准确率；而我们的轻量级CNN视觉编码器在与更强的预训练视频骨干网络竞争时仍具有竞争力。总体而言，结果支持ATRACT作为向冲突环境中远程分诊迈出的实际有意义的一步，其中多模态传感、人类监督和可信的决策支持可以改善伤员优先级排序，并减少前线医护人员的暴露风险。

英文摘要

At a time when drones are increasingly associated with hostile operations, we re-purpose them for humanitarian and life-saving applications. However, adapting search and rescue drones for battlefield triage remains extremely challenging; the technology must perform reliably to support frontline medics who are forced to operate under extreme uncertainty, restricted access, and significant personal risk. Due to growing vulnerabilities of casualty evacuation in conflicting zones, this paper presents ATRACT (A Trustworthy Robotic Autonomous system to support Casualty Triage), a novel human-in-the-loop decision support system to enable early battlefield triage during the critical post-trauma period. ATRACT integrates drone-captured video with wearable sensor input for multi-modal learning to support casualty-state assessment, thereby addressing the limitations of existing systems. Drone video captures fine-grained behavioural cues, such as pose, posture, while body-worn sensors provide complementary physiological signals, including heart rate, breathing rate, and movement. By combining two modalities, ATRACT provides evidence to support the early judgement of medics when direct access to the casualty is delayed, risky, or restricted. To mitigate the data realism gap pertaining to injured actions, a conditional variational autoencoder is devised for data augmentation. Experimental results on our drone captured dataset show that proposed pipeline achieves 85.7% accuracy for action classification; while our lightweight CNN visual encoder remains competitive with stronger pre-trained video backbones. Overall, the results support ATRACT as a practically meaningful step towards remote triage in contested environments, where multi-modal sensing, human oversight and trustworthy decision support can improve casualty prioritisation, and lessen the exposure of frontline medics.

URL PDF HTML ☆

赞 0 踩 0

2605.17077 2026-05-19 cs.RO cs.AI 版本更新

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

如何指导你的机器人：密集语言标注助力机器人策略学习

Bosung Kim, Ruiyi Wang, David Acuna, Jaehun Jung, Alexander Trevithick, Brandon Cui, Yejin Choi, Prithviraj Ammanabrolu

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； NVIDIA

AI总结本研究通过密集语言标注提升机器人策略学习效率，提出DeMiAn方法，利用视觉语言模型生成多方面标注，提升策略和世界模型性能，无需新增演示数据。

详情

AI中文摘要

机器人策略学习受限于演示数据收集成本，而现有演示的语言标注相对廉价。我们研究语言密度作为提取固定机器人或第一人称视频数据集信号的杠杆。我们引入DeMiAn（密集多方面标注），一种两阶段方法，首先通过视觉语言模型生成四个互补方面的演示段落重标记：物理运动、场景组成、手臂姿态和推理。一个学习到的指导者将任务描述和初始场景快照映射到部署时的任务合适标注，异步运行以隐藏生成延迟。在超过100万机器人操作片段和5万EgoVerse人类第一人称视频上，DeMiAn在视觉语言-动作策略和基于视频的世界-动作模型上均未收集新演示的情况下提升了性能。在RoboCasa上，指导者在任务-only基线基础上提升了5个百分点，接近每任务oracle的3个百分点。没有固定标注方面在所有任务中占主导，表明选择正确的密集语言至关重要。DeMiAn还提高了复合任务和分布外性能，并在考虑标注生成FLOPs后，同时提升了中训练和后训练的计算-性能前沿。这些结果将密集重新标注定位为机器人策略学习的实用扩展杠杆。

英文摘要

Scaling robot policy learning is bottlenecked by the cost of collecting demonstrations, while language annotations for existing demonstrations are comparatively cheap. We study language density as a lever for extracting more signal from a fixed robot or egocentric-video corpus. We introduce DeMiAn (Dense Multi-aspect Annotation), a two-stage approach that first re-labels demonstration segments with VLM-generated annotations along four complementary aspects: physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description and initial scene snapshot to a task-appropriate annotation at deployment, running asynchronously so generation latency is hidden behind policy execution. Across over 1M robot manipulation clips and 50K EgoVerse human-egocentric videos, DeMiAn improves both a vision-language-action policy and a video-based world-action model without collecting new demonstrations. On RoboCasa, the instructor raises success by 5 points over a task-only baseline and comes within 3 points of a per-task oracle. No fixed annotation aspect dominates across tasks, showing that selecting the right dense language matters. DeMiAn also improves composite-task and out-of-distribution performance, and shifts the compute-performance frontier in both mid-training and post-training after accounting for annotation-generation FLOPs. These results position dense re-annotation as a practical scaling lever for robot policy learning.

URL PDF HTML ☆

赞 0 踩 0

2605.17033 2026-05-19 cs.RO 版本更新

Generalizable and Actionable Parts Pose Estimation with Symmetry Annotation-Free Learning Strategy

具有对称标注自由学习策略的通用且可操作的部件姿态估计

Wenxiao Chen, Xueyu Yuan, Liu Liu, Di Wu, Dan Guo

发表机构 * Hefei University of Technology, Hefei, Anhui, China（合肥工业大学）； University of Science and Technology of China, Hefei, Anhui, China（中国科学技术大学）

AI总结本文提出了一种无需对称标注的通用且可操作的部件姿态估计框架SAFAG，通过分步细化两阶段框架和自监督学习策略解决对称预测问题，提升了在数据匮乏场景下的姿态估计性能和鲁棒性。

Comments Accepted as a poster at the Forty-third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

迫切需要的通用机器人物体交互和操作要求高质量的跨类别物体感知。作为该领域的先驱，通用且可操作的部件（GAParts）理解吸引了越来越多相关研究人员的关注。然而，大多数最近的工作要么在对称问题的设计上不足，要么需要丰富的对称标注，这严重阻碍了在数据匮乏场景中精确的GAPart姿态估计。在本文中，我们提出SAFAG，一种新的无需对称标注的通用且可操作的部件姿态估计框架。具体而言，我们建议了一个分步细化的两阶段框架用于候选到最终的四元数回归，并将对称预测作为概率分布问题，通过自监督学习策略进行解决。实验结果证明了我们SAFAG的优越性能和鲁棒性。我们相信我们的工作在许多具身AI系统领域具有巨大的应用潜力。

英文摘要

Urgently needed generalizable robot object interaction and manipulation requires high-quality Cross-Category object perception. As a pioneer of this area, Generalizable and Actionable Parts (GAParts) understanding has attracted increasing attention from relevant researchers. However, most recent works either have insufficient design regarding the symmetry issue or require rich symmetry annotation, which severely impedes precise GAPart pose estimation in data-lacking scenarios. In this paper, we propose SAFAG, a novel Symmetry Annotation-Free framework for Generalizable and Actionable Parts Pose Estimation. Specifically, we suggest a stepwise refinement two-stage framework for candidate-to-final quaternion regression, and tackle the symmetry prediction as a probability distribution problem with self-supervised learning strategy. The experimental results demonstrate the superior performance and robustness of our SAFAG. We believe that our work has the enormous potential to be applied in many areas of embodied AI system.

URL PDF HTML ☆

赞 0 踩 0

2605.16979 2026-05-19 cs.RO 版本更新

NORM-Nav: Zero-Shot Mobile Robot Navigation with Natural Language Behavioral Constraints

NORM-Nav: 通过自然语言行为约束实现零样本移动机器人导航

Dongjie Huo, Junhui Wang, Chao Gao, Yan Qiao, Dong Zhang, Guyue Zhou

发表机构 * College of Information Science and Technology, Beijing University of Chemical Technology（北京化工大学信息科学与技术学院）； Institute for AI Industry Research (AIR), Tsinghua University（清华大学人工智能产业研究院）； Institute of Systems Engineering and Collaborative Laboratory for Intelligent Science and Systems, Macau University of Science and Technology（澳门大学系统工程与智能科学与系统联合实验室）； School of Vehicle and Mobility, Tsinghua University（清华大学车辆与移动系统学院）

AI总结本文提出NORM-Nav框架，通过将自然语言行为约束整合到基于成本图的规划中，提升移动机器人在人类环境中导航的社交适应性，实验表明其在任务成功率和轨迹贴近人类参考方面优于基线方法。

详情

AI中文摘要

移动机器人在人类环境中运行时，不仅要生成无碰撞路径，还必须生成遵循本地行为规范的轨迹。传统基于成本图的导航强调几何可行性，往往忽视这些要求，可能导致不恰当的社会行为。本文提出了NORM-Nav，一种零样本框架，将自然语言行为约束整合到基于成本图的规划中。一个大语言模型将每个指令解析为结构化约束，并通过实时视觉-激光雷达感知进行 grounding。这些约束被编码为多层成本图，代表几何、语义、方向和速度提示，并直接与标准栅格规划器兼容。仿真和现实世界实验表明，NORM-Nav提高了任务成功率，并产生比代表基线更接近人类参考的轨迹。项目网站可用 https://ei-nav.github.io/NORM-Nav。

英文摘要

Mobile robots operating in human-centered environments must generate not only collision-free paths but also trajectories that follow local behavioral conventions. Conventional costmap-based navigation emphasizes geometric feasibility and often overlooks such requirements, which can result in socially inappropriate behaviors. This paper presents NORM-Nav, a zero-shot framework that integrates natural language behavioral constraints into costmap-based planning. An LLM parses each instruction into structured constraints and grounds them using real-time vision--LiDAR perception. These constraints are encoded as multi-layer costmaps that represent geometric, semantic, directional, and velocity cues and are directly compatible with standard grid-based planners. Simulation and real-world experiments indicate that NORM-Nav improves task success rates and produces trajectories closer to human references than representative baselines. The project website is available at https://ei-nav.github.io/NORM-Nav.

URL PDF HTML ☆

赞 0 踩 0

2605.16932 2026-05-19 cs.RO 版本更新

MORN: Metacognitive Object-Goal Regulation for Resource-Rational Long-Horizon Navigation

MORN: 为资源理性长周期导航的元认知目标-目标调节

Xi Lin, Jiayi Li, Kangyi Wu, Jiaqiao Tang, Qingrong He, Lin Zhao

发表机构 * LCSR Lab, Johns Hopkins University（约翰霍普金斯大学LCSR实验室）； Xi’an Jiaotong University（西安交通大学）； JD Explore Academy, Beijing, China（京东探索研究院，北京，中国）

AI总结本文提出MORN，一种基于双过程理论的元认知导航架构，通过引入资源理性机制，解决传统导航系统在长周期任务中因缺乏全局资源意识导致的资源浪费问题，提升了目标完成率和任务效率。

详情

AI中文摘要

在无结构人类环境中部署的机器人必须频繁执行长周期任务，如找到杯子、然后椅子、然后打印机，这些任务受严格操作约束。尽管现代零样本物体导航（ObjectNav）代理利用视觉-语言模型（VLMs）有效定位语义目标，但它们本质上是纯粹的反应系统，缺乏全局资源意识。因此，这些代理由于部分可观测性而无意中耗尽关键预算，包括时间和电池，对不可行的子目标进行本地探索，未能在本地探索与全局任务可行性之间取得平衡。为了填补这一差距，通过在导航循环中注入资源理性，我们提出了MORN（元认知目标-目标调节导航），一种受认知科学双过程理论启发的执行架构。MORN在冻结的导航骨干上增加了一个System 2元控制器，持续监控System 1的移动。通过正式化三个神经认知状态，潜在指数、坚持门控和证据积累，MORN根据在线进度速度和感知不确定性的估计动态调节任务计划。这种机制有效消除了沉没成本谬误，使代理能够提前中止僵尸目标并果断承诺可行的目标。在HM3D数据集上的大量实验表明，MORN将目标完成率（CR）从0.23提高到0.30，并将浪费步分数（WSF）从0.90降低到0.70，证明在资源受限自主性中，元认知对全局资源的意识与反应能力导航同样关键。

英文摘要

Robots deployed in unstructured human environments must frequently execute long-horizon missions, such as find the mug, then the chair, then the printer, under strict operational constraints. While contemporary zero-shot Object Navigation (ObjectNav) agents leverage Vision-Language Models (VLMs) to effectively localize semantic targets, they operate as purely reactive systems that inherently lack global resource awareness. Consequently, these agents inadvertently exhaust critical budgets, including time and battery, on infeasible subgoals due to partial observability, failing to balance local exploration with global mission viability. To bridge this gap by injecting resource-rationality into the navigation loop, we present MORN (Metacognitive Object-goal Regulation Navigation), an executive architecture inspired by Dual-Process Theory in cognitive science. MORN augments frozen navigation backbones with a System 2 meta-controller that continuously monitors the System 1 locomotor. By formalizing three neuro-cognitive states, Potentiality Index, Persistence Gating, and Evidence Accumulation, MORN dynamically regulates the mission schedule based on online estimates of progress velocity and perceptual uncertainty. This mechanism effectively neutralizes the Sunk Cost Fallacy, enabling agents to abort zombie goals early and decisively commit to achievable ones. Extensive experiments on the HM3D dataset demonstrate that MORN improves Goal Completion Rate (CR) from 0.23 to 0.30 and reduces Wasted Step Fraction (WSF) from 0.90 to 0.70, establishing that in resource-constrained autonomy, the metacognitive awareness of global resources is as critical as the reactive ability to navigate.

URL PDF HTML ☆

赞 0 踩 0

2605.16894 2026-05-19 cs.RO cs.SY eess.SY 版本更新

Beyond Safety Filtering: Control Barrier Function-Informed Reinforcement Learning for Connected and Automated Vehicles

超越安全过滤：基于控制屏障函数的强化学习用于连接和自动化车辆

Jianye Xu, Bassam Alrifaee

发表机构 * Department of Computer Science, RWTH Aachen University, Germany（德国亚琛工业大学计算机科学系）

AI总结本文提出了一种基于控制屏障函数的多智能体强化学习奖励设计方法，通过将联合多智能体强化学习动作下的控制屏障函数约束值转化为奖励信号，以显式引导安全学习，并在四向多车道交叉口实验中验证了其在任务性能和对奖励超参数的鲁棒性方面优于传统启发式方法。

Comments This paper has been accepted for publication in the Proceedings of the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

2605.16871 2026-05-19 cs.RO 版本更新

SADP: Subgoal-Aware Diffusion Policy for Explainable Robots Learned from Foundation Model Generated Demonstrations

SADP：基于基础模型生成示范的子目标感知扩散策略用于可解释机器人

Site Hu, Takato Horii

发表机构 * Department of Systems Innovation, Graduate School of Engineering Science, Osaka University（系统创新系，工学研究科，大阪大学）

AI总结本文提出SADP，一种基于基础模型生成示范的子目标感知扩散策略，用于可解释机器人，通过自主生成子目标标注的示范数据，训练扩散策略，使机器人能够通过子目标结构和执行进度向用户解释决策过程，从而在长周期操作中实现更高的任务成功率和故障诊断能力。

详情

AI中文摘要

可解释机器人不仅需要成功执行任务，还需要以用户友好的方式暴露内部决策过程。然而，大多数模仿学习方法仅在任务层面的示范上训练，没有显式建模子目标结构或执行进度。这种限制在标准机器人学习数据集中子目标级监督稀缺的情况下进一步加剧，限制了能够传达其执行子任务的机器人发展。为了解决这个问题，本文提出了Subgoal-Aware Diffusion Policy (SADP)，一种利用基础模型自主生成子目标标注的示范数据，并在这些数据集上训练扩散策略的框架。SADP通过将动作生成条件化在任务层面和子目标层面的描述上，围绕人类可解释的子目标结构构建策略执行。一个轻量级的辅助头进一步预测子目标完成状态，使机器人能够暴露其当前执行阶段并监控子目标进展。在RLBench模拟和实际UR5e机器人上的实验表明，SADP在任务成功率方面优于强大的任务条件扩散基线，同时提供子目标级执行信号用于监控进度和故障诊断。这些结果表明，内置而非事后解释性可以与高任务性能共存。

英文摘要

Explainable robots require not only successful task execution but also the ability to expose internal decision-making process in a user-friendly manner. However, most imitation learning methods are trained solely on task-level demonstrations, without explicitly modeling subgoal structure or execution progress. This limitation is further exacerbated by the scarcity of subgoal-level supervision in standard robot learning datasets, which restricts the development of robots that can convey the subtasks they are executing during long-horizon manipulation. To address this issue, this paper proposes Subgoal-Aware Diffusion Policy (SADP), a framework that leverages foundation models to autonomously generate subgoal-annotated demonstrations and trains diffusion policies on these datasets. SADP structures policy execution around human-interpretable subgoals by conditioning action generation on both task-level and subgoal-level descriptions. A lightweight auxiliary head further predicts subgoal completion states, allowing the robot to expose its current execution stage and monitor subgoal progression. Experiments in RLBench simulations and real-world evaluations on a UR5e robot demonstrate that SADP achieves higher task success rates than strong task-conditioned diffusion baselines, while providing subgoal-level execution signals for monitoring progress and diagnosing failures. These results highlight that built-in, rather than post-hoc, interpretability can coexist with high task performance.

URL PDF HTML ☆

赞 0 踩 0

2605.16870 2026-05-19 cs.RO 版本更新

SSTL: Self-Sensing Tendon Loop for Hysteresis Modeling and Compensation in Tendon-Sheath Mechanisms

SSTL：自感知腱环用于腱鞘机制的滞后模型与补偿

Myeongbo Park, Junhyun Park, Ihsan Ullah, Chunggil An, Minho Hwang

发表机构 * Department of Robotics and Mechatronics Engineering, DGIST（机器人与机电工程系，DGIST）； AI Research Lab, DEEPNOID（AI研究实验室，DEEPNOID）

AI总结本文提出了一种自感知腱环(SSTL)，用于解决腱鞘机制中由于腱鞘摩擦和腱弹性引起的滞后问题，通过测量输入和输出张力来建立滞后模型并进行补偿，从而提高柔性内窥镜机器人控制精度。

Comments 8 pages, 7 figures, 4 tables

详情

AI中文摘要

柔性内窥镜机器人通过自然孔道实现微创接入，但其控制精度受限于腱鞘机制(TSMs)中配置依赖的滞后现象。腱鞘摩擦和腱弹性导致输入和输出之间存在系统性差异，且该差异随插入管配置变化。为解决这一挑战，本文提出自感知腱环(SSTL)，一种通过插入管双程路由并围绕远端滑轮缠绕的腱环结构，使输入和输出张力均可在近端测量，从而无需远端力或光纤传感器即可获得输入-输出张力剖面。由于SSTL与驱动TSM共享相同路由路径，两个TSM表现出高度相关的滞后行为。从SSTL张力剖面中，基于学习的映射估计驱动TSM的配置依赖滞后参数，这些参数随后被前馈控制器用于补偿驱动滞后。我们通过在三种不同插入管配置下跟踪驱动腱张力验证了所提方法。在正弦和随机轨迹上，所提方法将平均RMSE降低88.1%，达到直接识别方法的97.8%，后者需要直接测量驱动TSM的输入和输出张力剖面。

英文摘要

Flexible endoscopic robots enable minimally invasive access through natural orifices, but their control accuracy is limited by configuration-dependent hysteresis in the tendon-sheath mechanisms (TSMs). Tendon-sheath friction and tendon elasticity induce a systematic discrepancy between the proximal actuation input and distal output, and this discrepancy varies with the insertion tube configuration. To address this challenge, this paper proposes the Self-Sensing Tendon Loop (SSTL), a double-pass tendon loop routed through the insertion tube and wrapped around a distal pulley, and returned to the proximal end. The loop structure allows both the input and output tensions of the SSTL to be measured proximally, thereby providing an input-output tension profile without requiring distal force or fiber-optic sensors. Because the SSTL shares the same routing path as the actuation TSM, the two TSMs exhibit strongly correlated hysteresis behaviors. From the SSTL tension profile, a learning-based mapping estimates the configuration-dependent hysteresis parameters of the actuation TSM, which are then used by a feedforward controller to compensate for actuation hysteresis. We validate the proposed method by tracking actuation tendon tension under three different insertion tube configurations. Across sinusoidal and random trajectories, the proposed method reduces average RMSE by 88.1% compared with the uncompensated baseline, achieving 97.8% of the performance of direct identification, which requires direct measurement of the input and output tension profile of the actuation TSM.

URL PDF HTML ☆

赞 0 踩 0

2605.16863 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning

先规划，后扩散：用于长视距扩散规划的外在图引导

Yaniv Hassidof, Adir Morgan, Yilun Du, Kiril Solovey

发表机构 * Technion（技术Ion大学）； Harvard（哈佛大学）

AI总结本文提出了一种外在搜索引导的扩散模型（XDiffuser），通过在状态空间图上先规划再引导扩散过程，以提高长视距规划的效率和效果，尤其在低质量数据和未见任务中表现优异。

详情

AI中文摘要

组合扩散模型通过去噪多个重叠的子轨迹并确保它们构成全局解，为长视距规划提供了一条有前途的路线。然而，强制在长链上执行局部行为往往不足以产生一致的全局结构。最近的工作通过内在搜索在去噪过程中探索多条路径来解决这一限制。尽管内在搜索提高了全局一致性，但代价是重复评估已经计算密集的模型。在本文中，我们主张在去噪过程之外进行外在搜索，为长视距规划提供更有效的探索模式，同时自然地使经典算法能够解决测试时的未见组合任务。我们的eXtrinsic搜索引导的Diffuser（XDiffuser）首先在状态空间图上计算一个计划——作为扩散模型的轻量级局部连接Oracle。该计划随后用于引导单条轨迹的去噪，有效地将探索负担转移出去。XDiffuser在长视距任务上优于基于扩散的基线，特别是在低质量数据领域和超出目标到达的未见任务中，包括多智能体协调和TSP风格推理。项目网站：https://yanivhass.github.io/XDiffuser-site/

英文摘要

Compositional diffusion models offer a promising route to long-horizon planning by denoising multiple overlapping sub-trajectories while ensuring that together they constitute a global solution. However, enforcing local behavior over long chains is often insufficient for a coherent global structure to emerge. Recent works tackle this limitation through intrinsic search, which explores multiple paths during the denoising process. While intrinsic search improves global coherence, it comes at the cost of repeated evaluations of an already compute-heavy model. In this work, we argue that extrinsic search, performed outside the denoising process, offers a more effective mode of exploration for long-horizon planning while naturally enabling the use of classical algorithms to solve unseen combinatorial tasks at test time. Our eXtrinsic search-guided Diffuser (XDiffuser) first computes a plan over a state-space graph -- serving as a lightweight local connectivity oracle for the diffusion model. The plan is then used to guide denoising for a single trajectory, effectively offloading the burden of exploration. XDiffuser outperforms diffusion-based baselines on long-horizon tasks, with particularly large gains in the low-quality data regime and on unseen tasks beyond goal-reaching, including multi-agent coordination and TSP-style reasoning. Project website: https://yanivhass.github.io/XDiffuser-site/

URL PDF HTML ☆

赞 0 踩 0

2605.16858 2026-05-19 cs.RO cs.AI 版本更新

Pedestrian-Aware LLM-Driven Behavioral Planning for Autonomous Vehicles

面向行人的LLM驱动行为规划用于自动驾驶车辆

Aidana Baimbetova, Haruki Yonekura, Hamada Rizk, Hirozumi Yamaguchi

发表机构 * The University of Osaka, Japan（大阪大学，日本）； RIKEN Center for Computational Science, Japan（日本计算科学研究中心）； Tanta University, Egypt（埃及塔塔大学）

AI总结本文提出了一种基于大型语言模型的决策框架，用于自动驾驶车辆在复杂城市环境中考虑行人行为，通过自然语言推理提示将结构化场景观测转换为语言推理，从而生成安全的驾驶决策。

Comments This paper has been accepted for presentation at the 29th IEEE International Conference on Intelligent Transportation Systems (ITSC)

详情

AI中文摘要

自动驾驶车辆（AVs）必须在行人行为多变、有时异常且训练中常未见的密集城市环境中做出可靠决策。基于强化学习（RL）的AV控制系统在结构化交通中表现良好，但在面对不可预测的行人交互和分布外场景时泛化能力较差。其依赖手工制定的奖励和不透明决策进一步限制了其在行人密集、安全关键环境中的适用性。为了解决这些限制，我们引入了一种基于大型语言模型（LLM）的决策框架，用于行人感知的行为规划。该系统将结构化的场景观测转换为自然语言推理提示，使LLM能够推断行人意图、预测风险并生成谨慎的战术驾驶决策。这些决策由运动规划器执行，以确保平滑且动力学可行的控制。我们在SUMO上评估了该框架，涵盖多个行人交互场景，包括意外闯红灯、回退过马路、犹豫和双向过马路。在零样本评估中，基于LLM的智能体实现了68%的无碰撞成功率，显著优于深度RL基线（17.7%）。在单行人场景中使用少量样本的episodic记忆，性能增加到96.0%，超过定制DQN控制器（82.0%）。跨行为评估进一步表明，来自回退交互的记忆可以转移到未见的犹豫和双向过马路场景，分别达到82.0%和90.0%的成功率。该系统能够更早地发起响应，维持更宽的安全缓冲区，并产生可解释、与人类一致的决策。

英文摘要

Autonomous Vehicles (AVs) must make reliable decisions in dense urban environments where pedestrian behavior is variable, sometimes abnormal, and often unseen during training. Reinforcement learning (RL)-based AV control systems perform well in structured traffic but struggle to generalize to unpredictable pedestrian interactions and out-of-distribution scenarios. Their reliance on handcrafted rewards and opaque decisions further limits their suitability for safety-critical, pedestrian-rich environments. To address these limitations, we introduce a Large Language Model (LLM)-based decision-making framework for pedestrian-aware behavioral planning. The system converts structured scene observations into natural-language reasoning prompts, enabling the LLM to infer pedestrian intent, anticipate risk, and generate cautious tactical driving decisions. These decisions are executed by a motion planner that ensures smooth, kinematically feasible control. We evaluate the framework in SUMO across multiple pedestrian-interaction scenarios, including unexpected jaywalking, turn-back crossing, hesitation, and bidirectional crossing. In zero-shot evaluation, the LLM-based agent achieves a 68% collision-free success rate, substantially outperforming deep RL baselines (17.7%). With few-shot episodic memory in a single-pedestrian scenario, performance increases to 96.0%, exceeding a custom DQN controller (82.0%). Cross-behavior evaluation further shows that memory derived from turn-back interactions transfers to unseen hesitation and bidirectional crossing scenarios, achieving 82.0% and 90.0% success, respectively. The system consistently initiates earlier responses, maintains wider safety buffers, and produces interpretable, human-aligned decisions.

URL PDF HTML ☆

赞 0 踩 0

2605.16797 2026-05-19 cs.CV cs.RO 版本更新

EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices

EgoKit: 向统一低成本第一人称视角数据采集迈进：异构设备

Liuchuan Yu, Erdem Murat, Beichen Wang, Yan Zeng, Tingting Luo, Huizhen Zhou, Shanghao Li, Huining Feng, Zhigen Zhao, Ning Yang, Ke Jing, Yunhao Liu, Ruoya Sheng

发表机构 * George Mason University（乔治·马歇尔大学）； ByteDance（字节跳动）

AI总结本文提出EgoKit，一种统一六种异构设备的第一人称视角数据采集工具包，解决了不同设备间SDK差异和数据采集不一致的问题，同时提供统一的日志格式和手部追踪数据。

详情

AI中文摘要

第一人称视角视频越来越多地被用作机器人学习、活动理解及具身AI研究的数据源，但大规模采集仍然碎片化：每个候选主机设备，如Android手机、iPhone、iPad、智能眼镜或扩展现实(XR)头戴设备，都暴露了不同的SDK，对原始摄像机访问有不同的政策，以及对外部USB摄像机和设备内跟踪有不同的限制。因此，同步第一人称视角和腕部视角的采集通常通过要么承诺单一专有平台或构建一次性装置来实现，这些装置无法跨设备转移。为了解决这一差距，我们提出了EgoKit，一种工具包，它在六个异构主机设备上暴露相同的第一人称视角录制流程。在所有支持的设备上，EgoKit提供相同的录制交互，并产生本地存储的视频，具有统一的日志格式；在XR头戴设备上，它还记录头部姿态和符合OpenXR标准的26关节手部追踪，与视频流对齐。配套的配件，包括两个带有支架的腕部摄像机、一个头带和一个USB-C集线器，使任何支持的主机都能添加腕部视角捕获，而无需定制硬件制造。EgoKit可在\url{https://egokit.chuange.org/}上获得。

英文摘要

Egocentric video is increasingly used as a data source for robot learning, activity understanding, and embodied AI research, but collecting it at scale remains fragmented in practice: each candidate host device, such as an Android phone, iPhone, iPad, smart glasses, or extended reality (XR) headset, exposes a different SDK, a different policy on raw camera access, and different limitations on external USB cameras and on-device tracking. Synchronized ego-view and wrist-view capture is therefore typically obtained by either committing to a single proprietary platform or building one-off rigs that do not transfer across devices. To address this gap, we present EgoKit, a toolkit that exposes the same egocentric recording workflow across six heterogeneous host devices. Across all supported devices, EgoKit presents the same recording interaction and produces locally stored video with a uniform log format; on XR headsets, it additionally logs head pose and OpenXR-standard 26-joint hand tracking aligned to the video streams. The companion accessories, including two wrist cameras with mounts, a head strap, and a USB-C hub, add wrist-view capture to any supported host without custom hardware fabrication. EgoKit is available at \url{https://egokit.chuange.org/}.

URL PDF HTML ☆

赞 0 踩 0

2605.16743 2026-05-19 cs.RO 版本更新

LACE: Latent Visual Representation for Cross-Embodiment Learning

LACE: 用于跨具身学习的潜在视觉表示

Yoo Sung Jang, Kanchana Ranasinghe, Cristina Mata, Yichi Zhang, Jorge Mendez-Mendez, Michael S. Ryoo

发表机构 * Stony Brook University（石溪大学）； Salesforce AI Research（Salesforce AI研究院）

AI总结本文提出LACE框架，通过利用跨具身共享身体部分的对应关系，在自监督学习backbone的潜在空间中对齐人类和机器人视觉表示，从而解决人类与机器人具身之间的视觉差距问题，提升机器人策略在稀疏示范下的表现。

详情

AI中文摘要

从人类示范中进行跨具身学习受到人类与机器人具身之间视觉差距的阻碍。尽管自监督学习（SSL）backbone能够编码通用物体的丰富类间语义，但我们发现它们无法建立人类与机器人手之间的对应关系。我们提出了LACE，一个框架，通过利用跨具身共享身体部分的对应关系作为稀疏监督，在这些backbone的潜在空间中对齐人类和机器人视觉表示。这些注解可以通过正向运动学自动获得，单个机器人示范就足以训练模型。我们的语义对齐损失匹配由对应特征引起的影响分布，将片段级监督提升到语义级对齐，同时Gram损失保留预训练特征质量。这种对齐使机器人策略能够在机器人示范稀缺时利用丰富的数据：在零样本迁移中，使用LACE-DINO的策略比使用DINO的策略表现优异（65%），在低数据和分布外环境中有持续的提升。

英文摘要

Cross-embodiment learning from human demonstrations is hindered by the visual gap between human and robot embodiments. While self-supervised learning (SSL) backbones encode rich inter-class semantics of general objects, we show they fail to establish correspondence between human and robot hands. We propose LACE, a framework that aligns human and robot visual representations in the latent space of these backbones by leveraging correspondences between shared body parts across embodiments as sparse supervision. These annotations can be automatically obtained via forward kinematics, and single robot demonstration is sufficient to train the model. Our semantic alignment loss matches distributions incurred by corresponding features, lifting patch-level supervision to semantic-level alignment, while a Gram loss preserves pretrained feature quality. This alignment enables robot policies to leverage abundant human data when robot demonstrations are scarce: in zero-shot transfer, policies using LACE-DINO outperform those using DINO by a large margin (65\%), with consistent gains in low-data regimes and out-of-distribution environments.

URL PDF HTML ☆

赞 0 踩 0

2605.16737 2026-05-19 cs.RO cs.CV 版本更新

DriveSafer: End-to-End Autonomous Driving with Safety Guidance

DriveSafer: 结合安全指导的端到端自动驾驶

Shounak Sural, Raj Rajkumar

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出DriveSafer框架，通过减少致命性规划失败来提高端到端自动驾驶的安全性，而非单纯提升平均规划质量。

详情

AI中文摘要

端到端（E2E）自动驾驶模型近年来在性能上有了显著提升，尤其是在越来越具有挑战性的基准测试中。然而，现代生成式E2E规划器仍然在安全关键场景中存在大量致命性故障。我们发现许多此类故障源于物理约束和安全要求的违反，导致不安全行为。受此发现启发，本文专注于改进生成式端到端驾驶中的安全结果，通过有针对性地减少致命性规划失败，而不是提升平均规划质量。为此，我们提出了DriveSafer，一种面向失败的的安全框架，用于端到端规划器。DriveSafer通过利用训练时的安全约束和推理时的安全指导，明确引导生成式规划器朝向安全行为。与最先进的DiffusionDrive模型相比，在NAVSIM基准测试中，DriveSafer将致命性故障数量（PDMS=0）减少了48%，在可行驶区域合规性故障上减少了超过65%。

英文摘要

End-to-End (E2E) autonomous driving models have shown growing capability in recent years, with performance improving on increasingly challenging benchmarks. However, modern generative E2E planners still suffer from a substantial number of catastrophic failures in safety-critical scenarios. We find that many such failures arise from violations of physical constraints and safety requirements, leading to unsafe behavior. Motivated by this finding, in this paper, we focus on improving safety outcomes in generative end-to-end driving with a targeted reduction of catastrophic planning failures, instead of enhancing average planning quality. Towards this end, we propose DriveSafer, a failure-aware safety framework for end-to-end planners. DriveSafer explicitly steers generative planners towards safe behaviors leveraging both training-time safety constraints and inference-time safety guidance. Compared to the state-of-the-art DiffusionDrive model, on the NAVSIM benchmark, DriveSafer reduces the number of catastrophic failures (PDMS=0) by 48%, with over 65% reduction in drivable-area compliance failures.

URL PDF HTML ☆

赞 0 踩 0

2605.16673 2026-05-19 cs.RO 版本更新

Bayesian Networks for Path-Based Sensors: Gathering Information and Path Planning in Communication Denied Environments

基于路径的传感器的贝叶斯网络：在通信受限环境中收集信息和路径规划

Alkesh K. Srivastava, George P. Kontoudis, Donald Sofge, Michael Otte

发表机构 * University of Maryland, College Park, MD, US.（美国马里兰大学学院公园分校）； Temple University, Philadelphia, PA, US.（美国 Temple 大学）； Colorado School of Mines, Golden, CO, US.（科罗拉多矿业学院）； U.S. Naval Research Lab (Retired), DC, US.（美国海军研究实验室（退休））

AI总结本文提出了一种基于贝叶斯网络的更新方法，用于在通信受限环境中通过路径传感器提升信念图的收敛速度，并考虑了假阳性和假阴性问题。

Comments This paper has been accepted for presentation at 17th World Symposium on the Algorithmic Foundations of Robotics (WAFR 2026)

详情

AI中文摘要

一种基于路径的传感器在连续路径上产生单个观测值。例如，布尔型路径传感器在路径上的任何一点检测到感兴趣的事件时返回'1'，否则返回'0'。值得注意的是，'1'本身不提供关于事件发生位置的直接信息。先前的工作表明，多个路径传感器的观测可以融合以创建空间位置的贝叶斯信念图。此外，路径规划可以利用香农信息论来加速信念图的收敛速度。在本文中，我们提出了一种新的方法，基于路径传感器观测更新信念图，然后规划路径以增加信息增益。与之前通过平均替代事件历史来近似后验的方法不同，我们引入了贝叶斯网络（BN）的公式，该公式建模了潜在变量和路径传感器测量之间的概率关系，从而实现了更系统的贝叶斯信念更新。我们考虑在通信受限环境中进行静态危险检测作为代表性的问题设置。机器人返回其路径对应于路径传感器读数为'0'（危险未检测），而机器人未能返回则对应于读数为'1'（危险检测）。我们考虑假阳性和假阴性。我们发现，新方法在单机器人和多机器人情况下都比先前的工作更快地收敛于信念图。

英文摘要

A "path-based sensor" produces a single observation along a continuous path. For example, a boolean path-based sensor returns a single "1" if an event of interest is detected at any point along the path and a "0" otherwise. Notably, a "1" provides no direct information about where along the path the event(s) may have occurred. Previous work has demonstrated that observations from multiple path-based sensors can be fused to create a Bayesian belief map over the spatial locations of the underlying event or phenomenon. Moreover, path planning can employ Shannon information theory to accelerate the rate of convergence of the belief map. In this paper, we present a new method to update the belief map based on a path-based sensor observation, and then plan paths to increase information gain. In contrast to prior work that approximates the posterior by averaging over the alternative event histories, we introduce a Bayesian Network (BN) formulation that models the probabilistic relationships between the latent variables and path-based sensor measurements, enabling a more principled Bayesian belief update. We consider static hazard detection in a communication-denied environment as a representative problem setting. The event of a robot returning from its path corresponds to a path-based hazard sensor reading of "0" (hazard not detected), while a robot failing to return corresponds to a reading of "1" (hazard detected). We consider false positives and false negatives. We find that the new method leads to quicker convergence of the belief map than prior work in both single- and multi-robot cases.

URL PDF HTML ☆

赞 0 踩 0

2605.10408 2026-05-19 cs.SE cs.RO 版本更新

通过F1数据驱动初始化实现自主赛车的高效轨迹优化

Samir Shehadeh, Lukas Kutsch, Nils Dengler, Sicong Pan, Maren Bennewitz

发表机构 * University of Bonn（波恩大学）； Lamarr Institute for Machine Learning and Artificial Intelligence（拉马尔机器学习与人工智能研究所）； Center for Robotics（机器人中心）； German Federal Ministry of Research, Technology and Space（德国联邦研究、科技与航天部）

AI总结本文利用F1 telemetry数据构建多赛道轨迹集，提出基于学习的初始化策略，通过局部赛道几何预测专家级赛车线，加速最优控制求解器收敛并减少运行时间。

详情

AI中文摘要

轨迹优化是快速高效自主赛车的核心组成部分。然而，实际优化流程对初始化高度敏感，当使用启发式轨迹如中线或最小曲率路径初始化时，可能收敛缓慢或陷入次优局部解。为解决这一限制，我们利用专家驾驶行为作为初始化先验，提出基于真实世界F1 telemetry的机器学习驱动初始化策略。为此，我们首先通过重建和对齐嘈杂的GPS telemetry到标准化参考线表示，构建包含17条赛道的多赛道F1轨迹数据集。在此基础上，我们提出一个神经网络，可直接从局部赛道几何预测专家级赛车线，而无需显式建模车辆动力学或力。预测的赛车线随后作为有指导的种子用于最小时间最优控制求解器。在所有17条赛道上的实验表明，学习到的初始化加速了求解器收敛并显著减少了运行时间，同时保持最终优化圈速。

英文摘要

Trajectory optimization is a central component of fast and efficient autonomous racing. However practical optimization pipelines remain highly sensitive to initialization and may converge slowly or to suboptimal local solutions when seeded with heuristic trajectories such as the centerline or minimum-curvature paths. To address this limitation, we leverage expert driving behavior as a initialization prior and propose a learning-informed initialization strategy based on real-world Formula~1 telemetry. To this end, we first construct a multi-track Formula~1 trajectory dataset by reconstructing and aligning noisy GPS telemetry to a standardized reference-line representation across 17 tracks. Building on this, we present a neural network that predicts an expert-like raceline offset directly from local track geometry, without explicitly modeling vehicle dynamics or forces. The predicted raceline is then used as an informed seed for a minimum-time optimal control solver. Experiments on all 17 tracks demonstrate that the learned initialization accelerates solver convergence and significantly reduces runtime compared to traditional geometric baselines, while preserving the final optimized lap time.

URL PDF HTML ☆

赞 0 踩 0

2603.02642 2026-05-19 cs.RO cs.DC cs.SY eess.SY 版本更新

cuNRTO: GPU-Accelerated Nonlinear Robust Trajectory Optimization

cuNRTO：GPU加速的非线性鲁棒轨迹优化

Jiawei Wang, Arshiya Taj Abdul, Evangelos A. Theodorou

发表机构 * Georgia Institute of Technology, Atlanta（佐治亚理工学院, 奥斯汀）； University of California, San Diego（加州大学圣地亚哥分校）； Deemos Corporation（德摩斯公司）

AI总结本文提出cuNRTO框架，通过CUDA实现非线性鲁棒轨迹优化，利用DR方法和ADMM算法解决SOCP约束问题，提升计算效率，实验证明在不同机器人模型上实现高达139.6倍的加速。

详情

AI中文摘要

鲁棒轨迹优化通过计算满足所有有界扰动约束的控制策略，使自主系统在不确定性下安全运行。然而，这些问题通常导致计算成本高的二次锥编程（SOCP）约束。本文提出CUDA非线性鲁棒轨迹优化（cuNRTO）框架，引入两种动态优化架构，直接应用于鲁棒决策，并在CUDA上实现。第一种架构NRTO-DR利用Douglas-Rachford（DR）分裂法解决SOCP子问题，通过并行SOCP投影和稀疏直接求解显著减少计算负担。第二种架构NRTO-FullADMM是新型变体，利用问题结构提升可扩展性，使用交替方向乘子法（ADMM）。最后，我们通过自定义CUDA内核和cuBLAS GEMM链实现所提出的方法，通过模拟实验验证cuNRTO的性能，在轮式机器人、四旋翼和Franka机械臂模型上实现高达139.6倍的加速。更多细节请访问https://cunrto.github.io。

英文摘要

Robust trajectory optimization enables autonomous systems to operate safely under uncertainty by computing control policies that satisfy the constraints for all bounded disturbances. However, these problems often lead to large Second Order Conic Programming (SOCP) constraints, which are computationally expensive. In this work, we propose the CUDA Nonlinear Robust Trajectory Optimization (cuNRTO) framework by introducing two dynamic optimization architectures that have direct application to robust decision-making and are implemented on CUDA. The first architecture, NRTO-DR, leverages the Douglas-Rachford (DR) splitting method to solve the SOCP inner subproblems of NRTO, thereby significantly reducing the computational burden through parallel SOCP projections and sparse direct solves. The second architecture, NRTO-FullADMM, is a novel variant that further exploits the problem structure to improve scalability using the Alternating Direction Method of Multipliers (ADMM). Finally, we provide GPU implementations of the proposed methodologies using custom CUDA kernels for SOC projection steps and cuBLAS GEMM chains for feedback gain updates. We validate the performance of cuNRTO through simulated experiments on unicycle, quadcopter, and Franka manipulator models, demonstrating speedups of up to 139.6$\times$. More details are available at https://cunrto.github.io.

URL PDF HTML ☆

赞 0 踩 0

2602.23058 2026-05-19 cs.CV cs.RO 版本更新

GeoWorld: Geometric World Models

GeoWorld：几何世界模型

Zeyu Zhang, Danning Li, Ian Reid, Richard Hartley

发表机构 * ANU（澳大利亚国立大学）； MBZUAI（穆斯林人工智能研究所）

AI总结 GeoWorld通过超几何JEPA和几何强化学习解决传统能量预测模型在几何结构和长周期预测中的不足，实验显示在3-4步规划中性能提升3%-2%。

Comments Accepted to CVPR 2026

详情

AI中文摘要

基于能量的预测世界模型通过推理潜在能量景观进行多步视觉规划，但现有方法面临两个挑战：（i）其潜在表示通常在欧几里得空间中学习，忽略了状态间的几何和层次结构；（ii）难以进行长周期预测，导致扩展 rollout 中快速退化。为了解决这些挑战，我们引入GeoWorld，通过超几何JEPA将潜在表示从欧几里得空间映射到双曲流形，以保留几何结构和层次关系。我们进一步引入几何强化学习进行能量优化，实现双曲潜在空间中的稳定多步规划。在CrossTask和COIN上的广泛实验显示，与最先进的V-JEPA 2相比，在3步规划中性能提升约3%，在4步规划中提升约2%。项目网站：https://steve-zeyu-zhang.github.io/GeoWorld。

英文摘要

Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: https://steve-zeyu-zhang.github.io/GeoWorld.

URL PDF HTML ☆

赞 0 踩 0

2602.22801 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

释放扩散模型在端到端自动驾驶中的潜力

Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang, Jianlin Zhang, Jianwei Cui, Guang Chen, Kun Ma, Hangjun Ye, Long Chen, Ya-Qin Zhang, Xianyuan Zhan, Jingjing Liu

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University（人工智能产业研究院（AIR），清华大学）

AI总结本文通过大规模实车数据和道路测试，系统研究了扩散模型在端到端自动驾驶中的规划能力，提出Hyper Diffusion Planner框架，实现10倍性能提升。

详情

AI中文摘要

扩散模型已成为机器人决策任务中的流行选择，近年来也开始被考虑用于解决自动驾驶任务。然而，其在自动驾驶中的应用和评估仍局限于模拟或实验室环境。本研究通过大规模实车数据和道路测试，系统研究了扩散模型作为端到端自动驾驶规划器的潜力。通过全面而受控的研究，我们识别了扩散损失空间、轨迹表示和数据缩放等关键洞察，显著影响端到端规划性能。此外，我们还提供了一种有效的强化学习后训练策略，进一步提升学习规划器的安全性和鲁棒性。所提出的扩散学习框架Hyper Diffusion Planner (HDP)在真实车辆平台上部署，并在6个城市驾驶场景和200公里的真实世界测试中，实现了相对于基模型的10倍性能提升。本文证明了当正确设计和训练时，扩散模型可以作为有效且可扩展的端到端自动驾驶规划器，用于复杂的真实世界自动驾驶任务。

英文摘要

Diffusion models have become a popular choice for decision-making tasks in robotics, and more recently, are also being considered for solving autonomous driving tasks. However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings. The full strength of diffusion models for large-scale, complex real-world settings, such as End-to-End Autonomous Driving (E2E AD), remains underexplored. In this study, we conducted a systematic and large-scale investigation to unleash the potential of the diffusion models as planners for E2E AD, based on a tremendous amount of real-vehicle data and road testing. Through comprehensive and carefully controlled studies, we identify key insights into the diffusion loss space, trajectory representation, and data scaling that significantly impact E2E planning performance. Moreover, we also provide an effective reinforcement learning post-training strategy to further enhance the safety and robustness of the learned planner. The resulting diffusion-based learning framework, Hyper Diffusion Planner (HDP), is deployed on a real-vehicle platform and evaluated across 6 urban driving scenarios and 200 km of real-world testing, achieving a notable 10x performance improvement over the base model. Our work demonstrates that diffusion models, when properly designed and trained, can serve as effective and scalable E2E AD planners for complex, real-world autonomous driving tasks.

URL PDF HTML ☆

赞 0 踩 0

2602.19710 2026-05-19 cs.CV cs.LG cs.RO 版本更新

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

面向通用视觉-语言-动作策略的通用姿态预训练

Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, Yanwei Fu

发表机构 * Tencent Robotics X（腾讯机器人X）； Futian Laboratory（福田实验室）； The Hong Kong University of Science and Technology（香港科学与技术大学）； Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结本文提出Pose-VLA，通过分离预训练和后训练阶段，解决视觉-语言-动作模型中的特征坍塌和训练效率问题，实现通用3D空间先验提取与机器人特定动作空间的高效对齐。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. Project website: https://hetolin.github.io/PoseVLA

详情

Journal ref: Robotics: Science and Systems, 2026

AI中文摘要

现有视觉-语言-动作（VLA）模型常因将高层感知与稀疏的、特定身体动作监督结合而出现特征坍塌和低训练效率。由于这些模型通常依赖优化用于视觉问答（VQA）的VLM主干，它们擅长语义识别但常忽视细微的3D状态变化，这些变化决定了不同的动作模式。为解决这些不一致，我们提出了Pose-VLA，一种解耦范式，将VLA训练分为预训练阶段以提取统一摄像机空间中的通用3D空间先验，以及后训练阶段以在机器人特定的动作空间中高效对齐。通过引入离散姿态标记作为通用表示，Pose-VLA无缝整合了来自不同3D数据集的空间接地与机器人演示中的几何级轨迹。我们的框架遵循一个两阶段预训练流程，通过姿态建立基本空间接地，然后通过轨迹监督实现运动对齐。广泛的评估显示，Pose-VLA在RoboTwin 2.0上实现了79.5%的平均成功率，并在LIBERO上表现出竞争力。现实世界实验进一步展示了在使用仅100个演示每任务的情况下，对多样化物体的鲁棒泛化能力，验证了我们预训练范式的效率。

英文摘要

Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.

URL PDF HTML ☆

赞 0 踩 0

2602.12633 2026-05-19 cs.RO 版本更新

Real-to-Sim for Highly Cluttered Environments via Physics-Consistent Inter-Object Reasoning

通过物理一致的物体间推理实现高密度环境下的现实到仿真

Tianyi Xiang, Jiahang Cao, Sikai Guo, Guoyang Zhao, Andrew F. Luo, Jun Ma

发表机构 * Robotics and Autonomous Systems Thrust, The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州）机器人与自主系统方向）； Institute of Data Science, The University of Hong Kong（香港大学数据科学学院）

AI总结本文提出一种物理约束的现实到仿真管道，通过接触图建模空间依赖性，提升高密度环境中的物体姿态和物理属性的精确性，实现高物理保真度的仿真场景。

Comments Project page: https://physics-constrained-real2sim.github.io

详情

AI中文摘要

从单视角观测重建物理有效的3D场景是连接视觉感知与机器人控制之间的必要前提。然而，在需要精确接触推理的场景中，例如在高度杂乱的环境中进行机器人操作时，仅依靠几何保真度是不够的。标准感知流程往往忽视物理约束，导致无效状态，例如漂浮物体或严重的相互穿透，使下游仿真不可靠。为了解决这些限制，我们提出了一种新的物理约束的现实到仿真管道，该管道从单视角RGB-D数据中重建物理一致的3D场景。我们方法的核心是一个可微优化管道，通过接触图显式建模空间依赖性，通过可微刚体仿真联合优化物体姿态和物理属性。在模拟和现实设置中的广泛评估表明，我们重建的场景实现了高物理保真度，并忠实复制了现实中的接触动态，使稳定可靠的接触丰富操作成为可能。

英文摘要

Reconstructing physically valid 3D scenes from single-view observations is a prerequisite for bridging the gap between visual perception and robotic control. However, in scenarios requiring precise contact reasoning, such as robotic manipulation in highly cluttered environments, geometric fidelity alone is insufficient. Standard perception pipelines often neglect physical constraints, resulting in invalid states, e.g., floating objects or severe inter-penetration, rendering downstream simulation unreliable. To address these limitations, we propose a novel physics-constrained Real-to-Sim pipeline that reconstructs physically consistent 3D scenes from single-view RGB-D data. Central to our approach is a differentiable optimization pipeline that explicitly models spatial dependencies via a contact graph, jointly refining object poses and physical properties through differentiable rigid-body simulation. Extensive evaluations in both simulation and real-world settings demonstrate that our reconstructed scenes achieve high physical fidelity and faithfully replicate real-world contact dynamics, enabling stable and reliable contact-rich manipulation.

URL PDF HTML ☆

赞 0 踩 0

2602.10503 2026-05-19 cs.RO 版本更新

C-ZUPT：基于站定性的空中悬停

Daniel Engelsman, Itzik Klein

发表机构 * Hatter Department of Marine Technologies, Charney School of Marine Sciences, University of Haifa（哈特尔海洋技术系，查内海洋科学学院，海法大学）

AI总结本文提出C-ZUPT方法，通过定义不确定性阈值识别准静态平衡状态，为估计滤波器提供精确速度更新，从而减少惯性漂移和控制努力，提升导航稳定性与悬停能效。

Comments 14 Pages, 16 Figures, 9 Tables

详情

DOI: 10.1109/TAES.2025.3650499
Journal ref: IEEE Transactions on Aerospace and Electronic Systems, volume 62, pages 4063-4077, 2026

AI中文摘要

跨领域自主系统强调了对漂移鲁棒的状态估计需求。尽管卫星定位和摄像头广泛使用，但它们在许多环境中存在可用性限制。因此，定位必须仅依赖惯性传感器，导致随时间推移精度迅速下降，由于传感器偏差和噪声。为对抗这一问题，替代更新源——称为信息辅助——作为确定性的锚点。其中，零速度更新（ZUPT）在静止期间提供准确的修正，但受限于地面平台。本工作引入了一种受控的ZUPT（C-ZUPT）方法用于空中导航与控制，不依赖地面接触。通过定义不确定性阈值，C-ZUPT识别准静态平衡状态，为估计滤波器提供精确的速度更新。大量验证确认这些机会性、高质量的更新显著减少惯性漂移和控制努力。因此，C-ZUPT缓解了滤波器发散并提升导航稳定性，使更节能的悬停成为可能，并大幅延长持续飞行时间——这对资源受限的空中系统具有关键优势。

英文摘要

Autonomous systems across diverse domains have underscored the need for drift-resilient state estimation. Although satellite-based positioning and cameras are widely used, they often suffer from limited availability in many environments. As a result, positioning must rely solely on inertial sensors, leading to rapid accuracy degradation over time due to sensor biases and noise. To counteract this, alternative update sources-referred to as information aiding-serve as anchors of certainty. Among these, the zero-velocity update (ZUPT) is particularly effective in providing accurate corrections during stationary intervals, though it is restricted to surface-bound platforms. This work introduces a controlled ZUPT (C-ZUPT) approach for aerial navigation and control, independent of surface contact. By defining an uncertainty threshold, C-ZUPT identifies quasi-static equilibria to deliver precise velocity updates to the estimation filter. Extensive validation confirms that these opportunistic, high-quality updates significantly reduce inertial drift and control effort. As a result, C-ZUPT mitigates filter divergence and enhances navigation stability, enabling more energy-efficient hovering and substantially extending sustained flight-key advantages for resource-constrained aerial systems.

URL PDF HTML ☆

赞 0 踩 0

2506.13189 2026-05-19 cs.HC cs.RO 版本更新

Gesture First, LLM-Assisted Voice Complement: Exploring Multimodal Robot 'Puppeteer' Teleoperation Via Virtual Counterpart in Augmented Reality

先手势，LLM辅助语音补充：通过增强现实中的虚拟对应物探索多模态机器人'提线人'遥控

Yuchong Zhang, Bastian Orthmann, Shichen Ji, Michael Welle, Jonne Van Haastregt, Danica Kragic

发表机构 * KTH Royal Institute of Technology（皇家理工学院）

AI总结本文探讨了通过增强现实中的虚拟对应物实现多模态机器人遥控的方法，比较了仅手势和结合语音与手势的交互方式在性能和用户体验上的差异，提出设计指南以平衡效率、鲁棒性和用户专业性。

Comments This work is under peer review

详情

AI中文摘要

通过增强现实（AR）实现的机器人遥控提供了一条通向更直观人机交互（HRI）的有希望路径。我们提出了一种头戴式AR'提线人'系统，用户通过与机器人虚拟对应物的交互来控制物理机器人，使用大语言模型（LLM）辅助的语音命令和手部手势交互在Meta Quest 3上。在42名参与者进行的AR基于机器人抓取和模式匹配任务的内部分组用户研究中，我们经验性地比较了两种交互条件：仅手势（GO）和结合语音+手势（VG）在性能和用户体验（UX）上的差异。在VG中，语音和手势以顺序角色分配的方式操作，语音负责高层导航，手势负责精细操作。我们的结果表明，GO目前为这种时间敏感的任务提供了更可靠和高效的控制，而VG引入了额外的灵活性，但也带来了延迟和识别问题，可能增加工作负荷。我们还分析了先前机器人专业知识如何在不同条件下区分性能和用户体验。基于这些发现，我们总结了一套AR'提线人'隐喻机器人遥控的设计指南，将多模态性作为适应性策略，必须在效率、鲁棒性和用户专业知识之间取得平衡，而不是假设额外模态对所有人都有益。

英文摘要

Robot teleoperation via augmented reality (AR) offers a promising path toward more intuitive human-robot interaction (HRI). We present a head-mounted AR 'puppeteer' system in which users control a physical robot by interacting with its virtual counterpart robot using large language model (LLM)-assisted voice commands and hand-gesture interaction on the Meta Quest 3. In a within-subject user study with 42 participants performing an AR-based robotic pick-and-place pattern-matching task, we empirically compare two interaction conditions: gesture-only (GO) and combined voice+gesture (VG) on performance and user experience (UX). In VG, voice and gesture operate in a sequential role-allocated manner, with voice handling high-level navigation and gesture handling fine manipulation. Our results show that GO currently provides more reliable and efficient control for this time-critical task, while VG introduces additional flexibility but also latency and recognition issues that can increase workload. We additionally analyze how prior robotics expertise differentiates performance and UX across conditions. Based on these findings, we distill a set of design guidelines for AR 'puppeteer' metaphoric robot teleoperation, framing multimodality as an adaptive strategy that must balance efficiency, robustness, and user expertise rather than assuming that additional modalities are universally beneficial.

URL PDF HTML ☆

赞 0 踩 0

2504.14820 2026-05-19 cs.RO 版本更新

A Visual Reinforcement Learning-Based Separate Primitive Policy for Peg-in-Hole Tasks

基于视觉强化学习的分步策略：用于铆钉入孔任务

Zichun Xu, Zhaomin Wang, Yuntao Li, Lei Zhuang, Zhiyuan Zhao, Guocai Yang, Jingdong Zhao

发表机构 * State Key Laboratory of Robotics and Systems, Harbin Institute of Technology（机器人系统国家重点实验室，哈尔滨工业大学）； Ubtech Robotics（优必选科技）； Meituan Academy of Robotics（美团机器人研究院）； School of Mechanical Engineering, Shandong University（山东大学机械工程学院）

AI总结本文提出S2P策略，通过视觉强化学习实现铆钉入孔任务中位置和插入动作的同步学习，提升了样本效率和成功率。

Comments Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

2503.12181 2026-05-19 cs.AI cs.RO 版本更新

Action-Gradient Monte Carlo Tree Search for Non-Parametric Continuous (PO)MDPs

动作-梯度蒙特卡洛树搜索用于非参数连续（PO）MDPs

Idan Lev-Yehudi, Michael Novitsky, Moran Barenboim, Ron Benchetrit, Vadim Indelman

发表机构 * Technion – Israel Institute of Technology（技术学院 – 以色列理工学院）

AI总结本文提出AGMCTS框架，结合全局树搜索与局部梯度优化，解决连续状态空间下的规划问题，理论贡献包括动作评分梯度定理、多重要性采样树和可计算的动作评分梯度。

详情

AI中文摘要

在连续状态、动作和观察空间中，自主系统在线规划仍具挑战性。尽管蒙特卡洛树搜索（MCTS）通过采样有效扩展，但大多数连续（PO）MDP求解器未利用基于梯度的动作优化。本文提出动作-梯度MCTS（AGMCTS），结合全局树搜索与局部梯度优化，保持一致的价值估计。我们提供了三个关键理论贡献：（1）粒子信念状态的动作评分梯度定理；（2）多重要性采样（MIS）树，通过重用先前样本支持频繁动作分支更新而不引入估计漂移；（3）使用区域公式为平滑生成模型提供可计算的动作评分梯度。实验结果表明，AGMCTS在多个具有挑战性的连续MDP和POMDP基准中优于最先进的基于样本的求解器。

英文摘要

Online planning in continuous state, action, and observation spaces remains challenging for autonomous systems. While Monte Carlo Tree Search (MCTS) scales effectively via sampling, most continuous (PO)MDP solvers do not exploit gradient-based action optimization. We propose Action-Gradient MCTS (AGMCTS), a framework that combines global tree search with local gradient-based action refinement, while maintaining consistent value estimates. We provide three key theoretical contributions: (1) an action score gradient theorem for particle belief states; (2) the Multiple Importance Sampling (MIS) Tree that supports frequent action-branch updates by reusing prior samples without introducing estimator drift; and (3) tractable action score gradients for smooth generative models using the Area Formula. Empirical results demonstrate that AGMCTS outperforms state-of-the-art sample-based solvers in multiple challenging continuous MDP and POMDP benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2411.17917 2026-05-19 cs.CV cs.RO 版本更新

DECODE: Domain-aware Continual Domain Expansion for Motion Prediction

DECODE：面向领域的持续领域扩展用于运动预测

Boqi Li, Haojie Zhu, Henry X. Liu

发表机构 * Department of Civil and Environmental Engineering, University of Michigan（密歇根大学土木与环境工程系）

AI总结 DECODE提出一种持续学习框架，通过预训练模型逐步扩展领域专用模型，结合超网络和流机制实现高效模型选择与不确定性估计，有效降低遗忘率并提升预测精度。

Comments This work has been published in IEEE TPAMI Early Access

详情

DOI: 10.1109/TPAMI.2026.3683469

AI中文摘要

运动预测对于自动驾驶车辆在复杂环境中有效导航和准确预测其他交通参与者行为至关重要。随着自动驾驶不断发展，整合新多样驾驶场景的需求促使频繁重新训练模型。为此，我们引入DECODE，一种新的持续学习框架，从预训练的通用模型开始，逐步发展专用领域模型。不同于现有持续学习方法试图开发一个能跨多样场景泛化的统一模型，DECODE独特地平衡了专用性与泛化性，动态调整以满足实时需求。所提框架利用超网络生成模型参数，显著降低存储需求，并结合归一化流机制基于似然估计进行实时模型选择。此外，DECODE利用深度贝叶斯不确定性估计技术合并最相关专用和通用模型的输出。这种整合确保在熟悉条件下最优性能，同时在不熟悉场景中保持鲁棒性。广泛评估证实了框架的有效性，实现显著低的遗忘率0.044和平均minADE 0.584米，显著超越传统学习策略，并在广泛驾驶条件下表现出适应性。

英文摘要

Motion prediction is critical for autonomous vehicles to effectively navigate complex environments and accurately anticipate the behaviors of other traffic participants. As autonomous driving continues to evolve, the need to assimilate new and varied driving scenarios necessitates frequent model updates through retraining. To address these demands, we introduce DECODE, a novel continual learning framework that begins with a pre-trained generalized model and incrementally develops specialized models for distinct domains. Unlike existing continual learning approaches that attempt to develop a unified model capable of generalizing across diverse scenarios, DECODE uniquely balances specialization with generalization, dynamically adjusting to real-time demands. The proposed framework leverages a hypernetwork to generate model parameters, significantly reducing storage requirements, and incorporates a normalizing flow mechanism for real-time model selection based on likelihood estimation. Furthermore, DECODE merges outputs from the most relevant specialized and generalized models using deep Bayesian uncertainty estimation techniques. This integration ensures optimal performance in familiar conditions while maintaining robustness in unfamiliar scenarios. Extensive evaluations confirm the effectiveness of the framework, achieving a notably low forgetting rate of 0.044 and an average minADE of 0.584 m, significantly surpassing traditional learning strategies and demonstrating adaptability across a wide range of driving conditions.

URL PDF HTML ☆

赞 0 踩 0

2410.07191 2026-05-19 cs.RO cs.LG stat.ME 版本更新

Curb Your Attention: Causal Attention Gating for Robust Trajectory Prediction in Autonomous Driving

抑制注意力：因果注意力门控用于自动驾驶中的鲁棒轨迹预测

Ehsan Ahmadi, Ray Mercurius, Soheil Alizadeh, Kasra Rezaee, Amir Rasouli

发表机构 * University of Alberta（阿尔伯塔大学）； Noah’s Ark Laboratory, Huawei Technologies Canada（华为加拿大诺亚实验室）； Cornell University（康奈尔大学）

AI总结本文提出CRiTIC模型，通过因果发现网络识别agent间因果关系，并引入因果注意力门控机制提升轨迹预测的鲁棒性和泛化能力，实验表明模型在对抗非因果扰动时鲁棒性提升54%。

Comments Accepted ICRA 2025

详情

DOI: 10.1109/ICRA55743.2025.11128367

AI中文摘要

自动驾驶中的轨迹预测模型易受非因果代理的扰动影响，此类扰动可能导致其他代理轨迹预测错误，进而影响自动驾驶决策的安全性和效率。本文提出CRiTIC模型，利用因果发现网络识别过去时间窗口内代理间的因果关系，并引入因果注意力门控机制，以选择性过滤Transformer架构中的信息。在两个自动驾驶基准数据集上进行了大量实验，评估了模型在对抗非因果扰动和泛化能力方面的鲁棒性。实验结果表明，预测鲁棒性可提升54%而对预测准确性影响不大。此外，本文展示了所提模型在跨域性能上的优越泛化能力，达到29%的改进。进一步细节请参见项目页面：https://ehsan-ami.github.io/critic。

英文摘要

Trajectory prediction models in autonomous driving are vulnerable to perturbations from non-causal agents whose actions should not affect the ego-agent's behavior. Such perturbations can lead to incorrect predictions of other agents' trajectories, potentially compromising the safety and efficiency of the ego-vehicle's decision-making process. Motivated by this challenge, we propose $\textit{Causal tRajecTory predICtion}$ $\textbf{(CRiTIC)}$, a novel model that utilizes a $\textit{Causal Discovery Network}$ to identify inter-agent causal relations over a window of past time steps. To incorporate discovered causal relationships, we propose a novel $\textit{Causal Attention Gating}$ mechanism to selectively filter information in the proposed Transformer-based architecture. We conduct extensive experiments on two autonomous driving benchmark datasets to evaluate the robustness of our model against non-causal perturbations and its generalization capacity. Our results indicate that the robustness of predictions can be improved by up to $\textbf{54%}$ without a significant detriment to prediction accuracy. Lastly, we demonstrate the superior domain generalizability of the proposed model, which achieves up to $\textbf{29%}$ improvement in cross-domain performance. These results underscore the potential of our model to enhance both robustness and generalization capacity for trajectory prediction in diverse autonomous driving domains. Further details can be found on our project page: https://ehsan-ami.github.io/critic.

URL PDF HTML ☆

赞 0 踩 0

2301.01114 2026-05-19 cs.RO cs.SY eess.SY 版本更新

Information Aided Navigation: A Review

信息辅助导航：综述

Daniel Engelsman, Itzik Klein

发表机构 * Hatter Department of Marine Technologies, Charney School of Marine Sciences, University of Haifa（哈特尔海洋技术系，查内海洋科学学院，海法大学）

AI总结本文综述了信息辅助导航，将其分为直接、间接和模型辅助三类，通过匹配约束提升导航精度并补偿丢失信息。

Comments 8 figures, 3 tables

详情

DOI: 10.1109/TIM.2023.3303496
Journal ref: IEEE Transactions on Instrumentation and Measurement, volume 72, pages 1-18, 2023

AI中文摘要

惯性导航系统性能很大程度上依赖于外部测量和信息的稳定流，以保证连续滤波更新和绑定惯性解漂移。不同操作环境的平台可能在某些时候无法接收外部测量，从而暴露导航解漂移。多年来，各种工作被提出以克服这一不足，通过利用系统当前状态的知识，将其转化为可用的信息源来更新导航滤波器。本文旨在提供信息辅助导航的全面综述，广泛分为直接、间接和模型辅助三类。每种方法通过实现其概念的显著工作、使用案例、相关状态更新和对应的测量模型进行描述。通过将适当的约束匹配到给定场景，可以提高导航解的准确性，补偿丢失的信息，并揭示某些内部状态，这些状态否则将保持不可观测。

英文摘要

The performance of inertial navigation systems is largely dependent on the stable flow of external measurements and information to guarantee continuous filter updates and bind the inertial solution drift. Platforms in different operational environments may be prevented at some point from receiving external measurements, thus exposing their navigation solution to drift. Over the years, a wide variety of works have been proposed to overcome this shortcoming, by exploiting knowledge of the system current conditions and turning it into an applicable source of information to update the navigation filter. This paper aims to provide an extensive survey of information aided navigation, broadly classified into direct, indirect, and model aiding. Each approach is described by the notable works that implemented its concept, use cases, relevant state updates, and their corresponding measurement models. By matching the appropriate constraint to a given scenario, one will be able to improve the navigation solution accuracy, compensate for the lost information, and uncover certain internal states, that would otherwise remain unobservable.

URL PDF HTML ☆

赞 0 踩 0

2605.16588 2026-05-19 cs.RO cs.SY eess.SY 版本更新

Policy Library CBF: Finite-Horizon Safety at Runtime via Parallel Rollouts

策略库CBF：通过并行滚动预测实现有限时间范围内的运行时安全

Taekyung Kim, Hideki Okamoto, Bardh Hoxha, Georgios Fainekos, Dimitra Panagou

AI总结本文提出PL-CBF，通过并行有限时间滚动预测评估备用策略库，选择最安全模式并最小修改名义策略以确保安全，实验显示在保持毫秒级运行时间的同时提升了安全覆盖率。

Comments Project page: https://www.taekyung.me/plcbf

详情

AI中文摘要

在无结构环境中实现安全关键自主性对在线安全认证提出了重大挑战。我们提出了策略库控制障碍函数（PL-CBF），一种运行时安全过滤器，通过并行有限时间滚动预测评估备用策略库，选择最安全模式，并通过求解二次规划问题最小修改名义策略以确保安全。我们基于闭环行为的有限时间语言度量提供了理论分析，表征了政策库覆盖要求以认证有限时间范围的安全性。在平面双积分器（4状态）、具有突发摩擦变化的高速公路驾驶（8状态）以及拥挤动态环境中的3D四旋翼导航（12状态）模拟中，展示了比单策略安全过滤器更高的安全覆盖率，同时保持毫秒级运行时间。

英文摘要

Safety-critical autonomy in unstructured environments poses significant challenges for online safety certification under evolving constraints. We propose Policy Library Control Barrier Function~(PL-CBF), a runtime safety filter that evaluates a library of fallback policies via parallel finite-horizon rollouts, selects the least invasive safe mode, and enforces safety by solving a quadratic program that minimally modifies a nominal policy. We provide a theoretical analysis based on a finite-horizon language metric over closed-loop behaviors, characterizing policy-library coverage requirements for certifying finite-horizon safety. Simulations on a planar double-integrator (4 states), highway driving with abrupt friction changes using a realistic nonlinear vehicle model (8 states), and 3D quadrotor navigation in crowded dynamic environments (12 states) demonstrate improved safety coverage over single-policy safety filters while retaining millisecond-level runtime.

URL PDF HTML ☆

赞 0 踩 0

2605.16552 2026-05-19 cs.AI cs.RO 版本更新

From Prompts to Protocols: An AI Agent for Laboratory Automation

从提示到协议：一种用于实验室自动化的AI代理

Angelos Angelopoulos, James F. Cahoon, Ron Alterovitz

AI总结本文提出一种整合大语言模型与实验室编排的AI代理，使科学家能通过自然语言创建和监控自动化实验协议，提升实验效率与准确性。

详情

AI中文摘要

自动化科学实验室能加快、安全、准确且可重复地执行协议，加速新材料和药物的发现与测试。然而，设置和运行自主实验室需要协调多种仪器和机器人，迫使科学家编写代码、管理配置文件和导航复杂软件架构。本文提出一种AI代理架构，整合大语言模型与实验室编排，使科学家能通过自然语言交互式创建和监控自动化实验协议。该代理集成到实验编排系统（EOS）中，通过代理循环实现自动验证和错误纠正，支持完整的实验生命周期：创建协议、运行和监控协议及闭环优化活动，以及分析结果。一个可视化图编辑器将协议渲染为同步于AI代理协议表示的交互式节点图，使在AI协助和手动协议构建之间无缝切换。在三个覆盖化学、生物学和材料科学的模拟自动化实验室上评估，该AI代理实现了97%的一次性协议生成成功率，并将所需界面操作减少了数量级。

MR-SLAM：通过混合现实实现多机器人地图的沉浸式空间监督

Prakash Aryan, Cem Erdogdu, Kavinaya Kumarchokkappan, Timo Kehrer, Sebastiano Panichella

AI总结本文提出MR-SLAM系统，利用混合现实技术实现多机器人SLAM的沉浸式空间监督，通过实时可视化和空间锚定面板提升多机器人定位与建图效率。

Comments Accepted to ICRA 2026 Workshop "MM-SpatialAI Workshop: Multi-Modal Spatial AI for Robust Navigation and Open-World Understanding"

详情

AI中文摘要

在建筑检查或仓库通道监控等应用中，操作多机器人队伍进行同时定位与建图（SLAM）需要操作员持续保持对每个机器人位置和建图状态的空间意识，这在传统2D界面中表现不佳。我们提出了MR-SLAM，一种混合现实（MR）系统，其中佩戴Meta Quest 3头显的operator通过带有真实世界遮挡的通透视图操控三个模拟TurtleBot3机器人，同时空间锚定的仪表板面板实时报告建图进度。每个机器人运行独立的SLAM Toolbox实例，其占用网格在ROS 2后端实时合并。在五次9分钟的评估会话中，系统以8.83±0.16Hz的速度生成扫描，合并了17.9±0.8平方米的占用网格，并在机器人对之间达到94.7±0.5%的跨实例占用一致性。额外的会话记录了6.3ms的中位转换抖动和41平方米网格的26.7平方米覆盖。我们将MR-SLAM定位为一种参考实现，用于在消费级硬件上结合通透混合现实监督与多机器人SLAM。

英文摘要

Operating a multi-robot fleet for simultaneous localization and mapping (SLAM) in applications such as building inspection or warehouse-aisle monitoring requires the operator to maintain spatial awareness of each robot's position and mapping state, a task that scales poorly on conventional 2D interfaces. We present MR-SLAM, a mixed reality (MR) system in which an operator wearing a Meta Quest 3 headset teleoperates three simulated TurtleBot3 robots through a passthrough view with real-world occlusion, while spatially anchored dashboard panels report mapping progress in situ. Each robot runs an independent SLAM Toolbox instance whose occupancy grid is merged in real time on a Robot Operating System 2 (ROS 2) back end. Across five 9-minute evaluation sessions, the system delivered scans at 8.83 +/- 0.16 Hz, mapped 17.9 +/- 0.8 m^2 of merged occupancy, and reached 94.7 +/- 0.5% cross-instance occupancy consistency across robot pairs. An additional session recorded 6.3 ms median transform jitter and 26.7 m^2 coverage of a 41 m^2 grid. We position MR-SLAM as a reference implementation for combining passthrough mixed reality supervision with multi-robot SLAM on consumer hardware.

URL PDF HTML ☆

赞 0 踩 0

2605.16419 2026-05-19 cs.CV cs.AI cs.RO 版本更新

Agentic Pipeline for Self-Synchronized Multiview Joint Angle Monitoring in Uncalibrated Environments

基于代理的自同步多视角关节角度监控管道：在无标定环境中

Juncheng Yu, Lusi A, Haoxuan Xie, Weiming Wang

AI总结本文提出了一种基于代理的自同步多视角关节角度监控方法，利用两台摄像头在无标定环境下实现自动视频同步和自验证，通过多模态大语言模型和先进单目2D姿态估计模型提取候选姿态，并通过代理选择机制自动识别和跟踪目标个体，以在多人和遮挡情况下产生一致的2D姿态，从而估计关节角度。

Comments Accepted by EMBC 2026. 7 pages, 3 figures

详情

AI中文摘要

运动监控在长期康复中对脊髓损伤患者至关重要，其中多视角无标记运动捕捉方法已显示出显著潜力。然而，由于依赖校准和多视角同步的困难，其在患者自行部署环境中部署仍然具有挑战性。在本工作中，我们提出了一种基于代理的自同步多视角关节角度监控管道，利用两台摄像头在无标定环境中实现自动视频同步和代理驱动的自验证。最先进的单目2D姿态估计模型用于提取候选姿态，其中应用了基于代理的选择机制，以自动识别和跟踪目标个体，从而在多人和遮挡情况下产生一致的2D姿态。此类2D姿态被优化以从无标定的多视角姿态序列中估计关节角度，通过显式的几何建模确保可解释性。与Vicon系统的验证显示了该方法的强性能，达到MAE为5.97°±2.36°和Pearson相关系数为0.962±0.014。所提出的方法预计能提供一个实用的、患者可自行部署的系统，以在无标定的家庭环境中进行日常运动监控。

英文摘要

Kinematic monitoring plays a critical role in long-term rehabilitation for patients with spinal cord injury (SCI), where multi-view markerless motion capture methods have shown significant potential. However, owing to the reliance on calibration and the difficulty of achieving multi-view synchronization, their deployment in patient self-deployed environments remains challenging. In this work, we propose an agentic pipeline for self-synchronized multi-view joint angle monitoring in uncalibrated environments using two cameras without hardware triggers. The Multimodal large language models enable automatic video synchronization and agent-driven self-verification. State-of-the-art monocular 2D pose estimation models are employed to extract candidate poses, where an agent-based selection mechanism is then applied to automatically identify and track the target subject, thereby producing consistent 2D poses in the presence of multiple individuals and occlusions. Such 2D poses are optimized to estimate joint angles from uncalibrated multi-view pose sequences, ensuring interpretability through explicit geometric modeling. Validation against Vicon system demonstrated the strong performance, achieving an MAE of $5.97^\circ \pm 2.36^\circ$ and a Pearson correlation coefficient of $0.962 \pm 0.014$. The proposed method is expected to provide a practical, patient self-deployable system to perform daily kinematic monitoring in uncalibrated home environments.

URL PDF HTML ☆

赞 0 踩 0

2605.16412 2026-05-19 cs.RO cs.CV 版本更新

SCAR: Self-Supervised Continuous Action Representation Learning

SCAR：自监督连续动作表示学习

Hongjia Liu, Fan Feng, Minghao Fu, Xinyue Wang, Haofei Lu, Biwei Huang

AI总结本文提出SCAR框架，通过自监督学习统一动作表示，提升跨体素和任务的泛化能力。

详情

AI中文摘要

尽管动作在具身智能中起核心作用，但从视觉转换中学习可迁移的动作表示仍是一个基本挑战，特别是在数据有限的情况下，世界模型需要在不同体素间泛化。我们提出SCAR，一个联合逆向-前向动力学框架，用于从视觉转换中学习跨体素的统一动作表示。基于预训练生成主干，SCAR使用逆向动力学模型（IDM）从潜在观察对中推断潜在动作，并使用前向动力学模型（FDM）根据这些动作预测未来动态。为了使潜在空间可迁移而非通用视觉瓶颈，我们正则化潜在动作后验向标准高斯先验，限制任意视觉编码，并引入对抗不变性以抑制体素和环境特定的噪声因素。在Procgen和Robotwin数据集上的实验表明，学习的统一潜在动作表示比体素特定的原始动作更强大，作为世界建模的条件接口，提高了跨体素低数据适应和跨任务迁移性能。

英文摘要

Despite the central role of action in embodied intelligence, learning transferable action representations from visual transitions remains a fundamental challenge, particularly when world models must generalize across embodiments under limited data. We argue that action is not merely an auxiliary conditioning signal, but a distinct representational factor that decouples the controllable change from embodiment-specific actuation. In this work, we propose SCAR, a joint inverse-forward dynamics framework for learning unified action representations across embodiments from visual transitions. Built on a pretrained generative backbone, SCAR uses an inverse dynamics model (IDM) to infer latent actions from latent observation pairs and a forward dynamics model (FDM) to predict future dynamics conditioned on them. To make the latent space transferable rather than a generic visual bottleneck, we regularize the latent action posterior toward a standard Gaussian prior to limit arbitrary visual encoding, and introduce adversarial invariance to suppress embodiment- and environment-specific nuisance factors. Experiments on the Procgen and Robotwin dataset show that the learned unified latent action representation serves as a stronger conditioning interface for world modeling than embodiment-specific raw actions, yielding improved cross-embodiment low-data adaptation and cross-task transfer. Taken together, these results suggest that action can be learned as a shared representation of controllable change across embodiments, providing an interface for more transferable and generalizable world models.

URL PDF HTML ☆

赞 0 踩 0

2605.16398 2026-05-19 cs.RO cs.AI 版本更新

Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery

支持安全的变分混合滤波器用于接触模式和稀疏定律恢复

Marios Papamichalis, Regina Ruane

AI总结本文提出VHYDRO变分混合动力学习器，通过混合学习的提案与可行转换律，防止分支丢失，实现连续状态和离散接触模式的联合推断，并在稀疏端-哈密顿定律恢复中提供三种保障。

详情

AI中文摘要

接触丰富的机器人动力学是混合的：单个观测可以匹配多个潜在状态和接触模式（自由、冲击、粘滑）。标准的退火滤波器不将概率分配给可行的接触转换将永久失去机器人实际遵循的分支。我们介绍了VHYDRO，一种变分混合动力学习器，防止这种分支丢失。在每一步中，VHYDRO混合学习的提案与可行转换律，然后进行采样和重要加权，确保模型可行的载体保留的每个转换都得到覆盖。VHYDRO联合推断连续的潜在状态和离散接触模式，并为每个恢复的模式拟合稀疏端-哈密顿定律。在此基础上，三种保证连接：支持覆盖稳定了滤波，稳定后的滤波将离散接触后验集中在一致的模式上，且模式纯段允许稀疏端-哈密顿恢复。恢复误差清晰地分为滤波、导数、模式不纯和物理残差部分。三种经验发现跟踪相同的机制。在重遮挡下，支持安全的滤波器保持可用，而非防御性的提案会崩溃。在ManiSkill演示和四个Sawyer/BridgeData任务家族上，离散状态形成时间一致的接触模式段，离散状态在ARI、变化点F1和段纯度上比事后和模式自由基线更强。在已知方程的混合系统中，模式条件的稀疏拟合恢复了活跃的物理项；纯预测基线则不能。

英文摘要

Contact-rich robot dynamics are hybrid: a single observation can match several latent states and contact regimes (free, impact, stick--slip). A standard amortized filter that places no probability on a feasible contact transition will permanently lose the branch the robot actually follows. We introduce VHYDRO, a variational hybrid dynamics learner that prevents this branch loss. At each step, VHYDRO mixes the learned proposal with a feasible transition law before sampling and importance weighting, ensuring that every transition retained by the model-feasible carrier remains covered. VHYDRO jointly infers a continuous latent state and a discrete contact mode, and fits a sparse port-Hamiltonian law to each recovered regime. On top of this, three guarantees connect: support coverage stabilizes filtering, the stabilized filter concentrates the discrete contact posterior on coherent regimes, and mode-pure segments admit sparse port-Hamiltonian recovery. The recovery error separates cleanly into filtering, derivative, mode-impurity, and physics-residual parts. Three empirical findings track the same mechanism. Under heavy occlusion the support-safe filter stays usable while a non-defensive proposal collapses. On ManiSkill demonstrations and on four Sawyer/BridgeData task families the discrete state forms temporally coherent contact-regime segments that the discrete state yields a stronger joint profile across ARI, change-point F1, and segment purity than post-hoc and mode-free baselines. On hybrid systems with known equations the mode-conditioned sparse fit recovers the active physical terms; purely predictive baselines do not.

URL PDF HTML ☆

赞 0 踩 0

2605.16395 2026-05-19 cs.RO cs.LG 版本更新

OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence

OrbiSim：作为具身智能的可微物理引擎的世界模型

Jiajian Li, Jingyuan Huang, Junru Gong, Qi Wang, Xiaokang Yang, Yunbo Wang

AI总结 OrbiSim提出了一种新的机器人仿真范式，将世界模型重新定义为完全可微的物理引擎，通过统一的物理基础路径连接结构化场景资产、神经动力学和下游强化学习，提升预测精度和控制性能。

Comments Project page: https://jjleejj85.github.io/projects/orbisim

详情

AI中文摘要

我们提出了OrbiSim，一种新的机器人仿真范式，将世界模型重新定义为完全可微的物理引擎，用于具身智能。不同于以往专注于潜在域或视觉域中无约束想象的世界模型，OrbiSim建立了一个统一的、基于物理的路径，连接结构化场景资产、神经动力学和下游强化学习。通过在整个仿真循环中实现端到端的可微性——从显式状态转换到视觉观察生成——OrbiSim支持传统经典模拟器难以处理的任务，如可微接触建模、稀疏奖励下的基于梯度的策略优化和直观的物理推理。实证结果表明，OrbiSim在预测保真度和控制性能方面显著优于最先进的世界模型。此外，其对资产配置和物理参数的一致响应表明其作为增强机器人仿真和策略训练的可微工具的潜力。

英文摘要

We present OrbiSim, a novel robotic simulation paradigm that redefines world models as a fully differentiable physics engine for embodied intelligence. Unlike prior world models that focus on unconstrained imagination in latent or visual domains, OrbiSim establishes a unified, physically-grounded pathway that bridges structured scene assets, neural dynamics, and downstream reinforcement learning. By enabling end-to-end differentiability throughout the entire simulation loop -- spanning from explicit state transitions to visual observation generation -- OrbiSim supports tasks traditionally intractable for classical simulators, such as differentiable contact modeling, gradient-based policy optimization under sparse rewards, and intuitive physical inference. Empirical results demonstrate that OrbiSim significantly outperforms state-of-the-art world models in both predictive fidelity and control performance. Furthermore, its consistent responsiveness to asset configurations and physical parameters suggests its potential as a differentiable tool for enhancing robot simulation and policy training.

URL PDF HTML ☆

赞 0 踩 0

2605.16391 2026-05-19 eess.SP cs.AI cs.LG cs.RO 版本更新

Overcoming the Intrinsic Performance Limitations of MEMS IMU via Diffusion-Based Generative Learning

通过扩散生成学习克服MEMS惯性测量单元的固有性能限制

Jiarui Lv, Feng Zhu, Xiaohong Zhang

AI总结本文提出基于扩散的生成学习框架，利用低成本IMU数据生成高保真虚拟IMU数据，提升定位和姿态估计性能，并在空中测绘中验证了其有效性。

详情

AI中文摘要

惯性测量单元（IMUs）是多源集成导航系统中的基本传感组件，其性能直接影响解决方案的精度和可靠性。然而，低成本IMUs的精度受硬件限制。最近，生成式人工智能在建模复杂数据分布和重建高保真信号方面表现出色。受此启发，我们提出了一种基于扩散的生成学习框架，用于从低成本IMU测量中合成高保真虚拟IMU数据。具体而言，基于U-Net架构构建了条件扩散模型，其中高质量IMU测量用作先验真实数据，低成本IMU测量作为条件输入。模型生成的虚拟IMU数据用于后续导航和定位任务。实验结果表明，生成的虚拟IMU数据在定位和姿态估计方面均显著优于原始低成本IMU测量。此外，我们将模型转移到空中测绘实验中，其中所提出的方法产生了更薄且一致的点云。总体而言，所提出的框架突破了低成本IMU的性能限制，并展示了扩散基于生成学习在虚拟高质量IMU数据方面的潜力。

英文摘要

Inertial measurement units (IMUs) are fundamental sensing components in multi-source integrated navigation systems, and their performance directly determines the accuracy and reliability of solutions. However, the precision of low-cost IMUs is inherently constrained by hardware limitations. Recently, generative artificial intelligence has demonstrated remarkable capability in modeling complex data distributions and reconstructing high-fidelity signals. Motivated by this, we propose a diffusion-based generative learning framework for synthesizing high-fidelity virtual IMU data from low-cost IMU measurements. Specifically, a conditional diffusion model based on a U-Net architecture is constructed, where high-grade IMU measurements are utilized as ground-truth priors and low-cost IMU measurements are employed as conditional inputs. The virtual IMU data generated by the model is used for subsequent navigation and localization tasks. Experimental results demonstrate that the generated virtual IMU data significantly outperform the original low-cost IMU measurements in both positioning and attitude estimation. Furthermore, we transfer the model to airborne mapping experiments, where the proposed method produces thinner and more consistent point clouds. Overall, the proposed framework breaks the performance limits of low-cost IMU and demonstrates the potential of diffusion-based generative learning for virtual high-grade IMU data.

URL PDF HTML ☆

赞 0 踩 0

2605.16389 2026-05-19 cs.RO cs.AI cs.SY eess.SY 版本更新

Haptic Rendering of Fractional-Order Viscoelasticity: Passivity and Rendering Fidelity

触觉渲染中的分数阶粘弹性：被动性和渲染保真度

Gorkem Gemalmaz, Harun Tolasa, Volkan Patoglu

AI总结本文研究分数阶粘弹性模型在有限记忆离散化下的被动性与渲染性能，推导闭式表达式确保触觉渲染的被动性，并通过实验验证理论结果及人感知的真实感。

Comments Under review for publication in IEEE Transactions on Robotics

详情

AI中文摘要

GRaD-Nav++: 基于视觉-语言模型的视觉无人机导航：高斯辐射场与可微动力学

Qianzhong Chen, Naixiang Gao, Suning Huang, JunEn Low, Timothy Chen, Jiankai Sun, Mac Schwager

AI总结 GRaD-Nav++提出一种轻量级视觉-语言-动作框架，通过可微强化学习在3D高斯点云模拟器中训练，实现基于自然语言指令的实时无人机导航，展示在多任务和多环境下的高效导航能力。

Comments Published in: IEEE Robotics and Automation Letters ( Volume: 11, Issue: 2, February 2026)

详情

DOI: 10.1109/LRA.2025.3643290
Journal ref: Chen, Qianzhong, et al. "Grad-nav++: Vision-language model enabled visual drone navigation with gaussian radiance fields and differentiable dynamics." IEEE Robotics and Automation Letters 11.2 (2025): 1418-1425

AI中文摘要

自主无人机在无结构环境中解释并执行高层语言指令仍是一个长期目标。然而，现有方法受限于对人工技能的依赖、参数调优的繁琐或计算密集型模型无法用于机载使用。我们引入GRaD-Nav++，一种轻量级的视觉-语言-动作（VLA）框架，能够在机载环境中实时运行并执行自然语言指令。我们的策略在逼真3D高斯点云（3DGS）模拟器中通过可微强化学习（DiffRL）训练，能够高效学习低层控制，从视觉和语言输入中学习。其核心是一个专家混合（MoE）动作头，能够自适应地路由计算以提高泛化能力并缓解遗忘。在多任务泛化实验中，GRaD-Nav++在训练任务中达到83%的成功率，在未见过的任务中达到75%。在真实硬件部署中，其在训练任务中的成功率为67%，在未见过的任务中为50%。在多环境适应实验中，GRaD-Nav++在多样化的模拟环境中平均成功率为81%，在多样的真实世界设置中为67%。这些结果为完全机载视觉-语言-动作（VLA）飞行建立了新的基准，并证明了紧凑、高效的模型可以在不依赖外部基础设施的情况下实现可靠的、语言引导的导航。

英文摘要

Autonomous drones capable of interpreting and executing high-level language instructions in unstructured environments remain a long-standing goal. Yet existing approaches are constrained by their dependence on hand-crafted skills, extensive parameter tuning, or computationally intensive models unsuitable for onboard use. We introduce GRaD-Nav++, a lightweight Vision-Language-Action (VLA) framework that runs fully onboard and follows natural-language commands in real time. Our policy is trained in a photorealistic 3D Gaussian Splatting (3DGS) simulator via Differentiable Reinforcement Learning (DiffRL), enabling efficient learning of low-level control from visual and linguistic inputs. At its core is a Mixture-of-Experts (MoE) action head, which adaptively routes computation to improve generalization while mitigating forgetting. In multi-task generalization experiments, GRaD-Nav++ achieves a success rate of 83% on trained tasks and 75% on unseen tasks in simulation. When deployed on real hardware, it attains 67% success on trained tasks and 50% on unseen ones. In multi-environment adaptation experiments, GRaD-Nav++ achieves an average success rate of 81% across diverse simulated environments and 67% across varied real-world settings. These results establish a new benchmark for fully onboard Vision-Language-Action (VLA) flight and demonstrate that compact, efficient models can enable reliable, language-guided navigation without relying on external infrastructure.

URL PDF HTML ☆

赞 0 踩 0