arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.21460 2026-05-21 cs.RO cs.AI cs.HC 版本更新

HITL-D: Human In The Loop Diffusion Assisted Shared Control

HITL-D: 有人参与的扩散辅助共享控制

Riley Zilka, Sergey Khlynovskiy, Allie Wang, Martin Jagersand

发表机构 * Department of Computing Science, University of Alberta(阿尔伯塔大学计算机科学系)

AI总结 本文提出HITL-D框架,通过结合扩散策略和人类控制,提升多步骤、插入和精细操作任务的用户表现,减少 joystick 控制轴数量,降低认知负荷,并在多任务用户研究中显著提高任务完成速度和用户满意度。

Comments Accepted for presentation at ICRA 2026

详情
AI中文摘要

自主操作系统已展现出显著能力,但将人类专业知识与基于扩散的策略结合在共享控制中仍较为不成熟。本文提出人类在环扩散(HITL-D),一种共享控制框架,通过结合扩散策略和人类控制,提供基于场景点云和末端执行器笛卡尔位置的自主末端执行器方向更新。该方法减少了所需joystick控制轴的数量,从而降低认知负荷。在12名参与者的多任务用户研究中,HITL-D将平均任务完成时间减少了40%,降低了37%的感知负荷,并在独立性、直观性和信心等李克特量表评分上优于传统遥控方法。这些结果表明,HITL-D有效整合了人类专业知识与自主协助,提高了遥控的客观和主观方面。

英文摘要

Autonomous manipulation systems have achieved remarkable capabilities, yet the integration of human expertise with diffusion-based policies in shared control remains relatively unexplored. In this paper, we propose Human-In-The-Loop Diffusion (HITL-D), a shared control framework that enhances user performance in multi-step, insertion, and fine manipulation tasks. HITL-D leverages a novel combination of diffusion-based policies and human control to provide autonomous end effector orientation updates conditioned on a scene point cloud and the Cartesian position of the end effector. This approach reduces the number of joystick control axes required, thereby lowering mental workload. In a multi-task user study with 12 participants, HITL-D reduced average task completion times by 40%, decreased perceived workload by 37%, and improved Likert-scale ratings for independence, intuitiveness, and confidence compared to traditional teleoperation methods. These results demonstrate that HITL-D effectively integrates human expertise with autonomous assistance, improving both objective and subjective aspects of teleoperation.

2605.21439 2026-05-21 eess.SY cs.RO cs.SY 版本更新

Fully Actuated Manifold Constraint Based Output Feedback Control for Input-Constrained Uncertain Nonlinear Systems

全驱动流形约束基于输出反馈控制的输入受限不确定非线性系统

Dianrui Mu, Changchun Hua, Yafeng Li, Jiannan Chen, Rao Wei

发表机构 * Institute of Electrical Engineering, Yanshan University(燕山大学电气工程学院)

AI总结 本文提出了一种低复杂度、无模型的输出反馈控制器,用于处理具有未知输入约束的未知时变非线性系统,实现了预设的控制精度,并在执行器饱和后保持灵活的控制精度。该方法扩展了现有线性流形约束控制方法,包括非线性流形的构造和各种约束类型,从而在有限或固定时间内实现预设的控制精度。此外,通过构造误差驱动的灵活约束,实现了未知饱和情况下的灵活控制。最后提供了二阶及更高阶的控制示例和仿真。

Comments 22 pages, 12 figures, 2 tables

详情
AI中文摘要

本文提出了一种低复杂度、无模型的输出反馈控制器,用于处理具有未知输入约束的未知时变非线性系统。该控制器在执行器未饱和时实现预设的控制精度,并在执行器饱和后保持灵活的控制精度。这一结果将现有针对线性流形的约束控制方法扩展到更一般的形式,包括非线性流形的构造和各种类型的约束,从而在有限或固定时间内实现预设的控制精度。此外,通过构造误差驱动的灵活约束,实现了未知饱和情况下的灵活控制。最后,提供了二阶及更高阶的控制示例和仿真。

英文摘要

This paper presents a low-complexity, model-free, output-feedback controller for a class of unknown time-varying nonlinear systems with unknown input constraints. The controller achieves the preset control accuracy when the actuator is not saturated and maintains flexible control accuracy after actuator saturation. This result extends existing constraint control methods for linear manifolds to a more general form, including the construction of nonlinear manifolds and various types of constraints, thereby achieving preset control accuracy within finite or fixed time. Additionally, flexible control under unknown saturation is achieved through the construction of an error-driven flexible constraint. Finally, second-order and higher-order control examples and simulations are provided.

2605.21429 2026-05-21 cs.RO cs.LG 版本更新

roto 2.0: The Robot Tactile Olympiad

roto 2.0:机器人触觉奥林匹克

Elle Miller, Jayaram Reddy, Ayush Deshmukh, Trevor McInroe, David Abel, Oisin Mac Aodha, Sethu Vijayakumar

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 本文提出roto 2.0,一个基于触觉的强化学习基准,旨在通过四种不同的机器人形态(16-DOF到24-DOF)标准化触觉强化学习,专注于端到端的'盲'操作,仅使用本体感觉和触觉传感,不使用状态信息或蒸馏。研究展示了显著的性能提升,盲控代理在10秒内完成13次保定球旋转,比当前最先进的速度快了一个数量级。通过开源环境和经过充分调优的基线,降低了进入门槛,使研究人员能够优先考虑基本算法挑战而非繁琐的强化学习调优。

Comments Accepted to 7th ViTac Workshop, ICRA 2026

详情
AI中文摘要

基于触觉的强化学习(RL)目前受到碎片化研究和对过饱和方向任务的关注所限制。我们介绍了Robot Tactile Olympiad的v2版本(roto 2.0),一个GPU并行化的基准,旨在标准化四种不同的机器人形态(16-DOF到24-DOF)之间的触觉强化学习。与之前的基准不同,roto专注于端到端的'盲'操作,仅使用本体感觉和触觉传感,而不使用状态信息或蒸馏。我们展示了显著的性能提升,我们的盲控代理在10秒内完成13次保定球旋转,比当前最先进的速度快了一个数量级。通过开源我们的环境和经过充分调优的基线,我们降低了进入门槛,使研究人员能够优先考虑基本算法挑战而非繁琐的强化学习调优。网站:https://elle-miller.github.io/roto/

英文摘要

Tactile-based reinforcement learning (RL) is currently hindered by fragmented research and a focus on over-saturated orientation tasks. We introduce v2 of the Robot Tactile Olympiad (\texttt{roto 2.0}), a GPU-parallelised benchmark designed to standardise tactile-based RL across four distinct robotic morphologies (16-DOF to 24-DOF). Unlike prior benchmarks, roto focuses on end-to-end "blind" manipulation, utilising only proprioception and tactile sensing without state information or distillation. We demonstrate a significant performance leap, with our blind agents achieving 13 Baoding ball rotations in 10 seconds, an order of magnitude faster than current state-of-the-art speeds. By open-sourcing our environments and robustly tuned baselines, we reduce the barrier to entry and enable researchers to prioritise fundamental algorithmic challenges over tedious RL tuning. Website: https://elle-miller.github.io/roto/

2605.21414 2026-05-21 cs.RO cs.CV 版本更新

PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

PointACT: 多尺度点-动作交互的视觉-语言-动作模型

Shizhe Chen, Paul Pacaud, Cordelia Schmid

发表机构 * Inria(法国国家信息与自动化研究所) École normale supérieure(法国高等科学研究院) CNRS(法国国家科学研究中心) PSL Research University(巴黎综合理工研究院)

AI总结 本文提出PointACT,一种集成层次化3D点云表示的3D感知视觉-语言-动作政策,通过多尺度点-动作交互机制提升机器人在3D环境中的精细几何推理和空间定位能力。

Comments Accepted to RSS 2026; project webpage: https://cshizhe.github.io/projects/pointact.html

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过利用大规模预训练的视觉-语言骨干网络,在通用机器人操作中展现出强大潜力。然而,大多数现有VLA模型主要依赖2D视觉表示,限制了其对细粒度几何和空间定位的推理能力,这些能力对于在3D环境中实现精确且稳健的操作至关重要。在本文中,我们提出了PointACT,一种双系统3D感知VLA策略,直接将层次化的3D点云表示整合到动作解码过程中。PointACT采用多尺度点-动作交互机制,结合高效的瓶颈窗口自注意力机制,使演化动作令牌能够密集地关注局部几何细节和全局场景结构。我们评估了PointACT在LIBERO和RLBench基准上的表现,并系统地将其与单系统和双系统VLA基线进行比较,包括加入点云输入的变体。PointACT在两个基准上均实现了持续改进,在具有挑战性的RLBench-10Tasks套件上,其成功率比最先进的预训练VLA提高了10%,当冻结视觉-语言骨干并从头训练动作专家时,提升幅度更大。广泛的消融研究证明,紧密耦合层次化的3D几何与预训练的2D语义表示对于鲁棒且空间感知的机器人控制至关重要。我们的结果还突显了预训练3D表示在3D感知VLA策略中的潜力。

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.

2605.21406 2026-05-21 cs.RO 版本更新

MC-Risk: Multi-Component Risk Fields for Risk Identification and Motion Planning

MC-Risk:多组件风险场用于风险识别和运动规划

Maximilian Link, Yingjie Xu, Yingbai Hu, Yinlong Liu

发表机构 * Technical University of Munich(慕尼黑技术大学) The Chinese University of Hong Kong(香港中文大学) City University of Macau(澳门城市大学)

AI总结 本文提出MC-Risk,一种与规划器对齐的多组件风险场,用于早期、校准且类别感知的风险定位。该方法通过线性组合三个可解释模块,包括电机代理场、VRU风险场和道路惩罚场,并在RiskBench碰撞子集上进行了首次标准化定量评估,展示了最佳的风险定位和最早危险指示。

详情
AI中文摘要

我们提出了MC-Risk,一种与规划器对齐的多组件风险场,用于早期、校准且类别感知的风险定位。MC-Risk线性组合了三个可解释模块:(i) 一个电机代理场,融合了黑箱多模态轨迹预测器和解析高斯环构造,其横向宽度随速度/曲率增长,高度随前瞻减少;(ii) 一个VRU风险场,用向前偏的各向异性核替代等效行人块,该核与方向和速度对齐;(iii) 一个道路惩罚场,利用全高清地图拓扑,对非道路区域施加惩罚,并对同向/反向车道施加风险暴露。我们进行了首次标准化定量评估,评估了风险场形式在RiskBench碰撞子集上的表现。MC-Risk在整体风险定位和危险指示方面表现最佳。最后,我们通过将该场作为MPC成本密度使用,演示了一个即插即用的规划接口,实现了无额外训练的风险感知轨迹生成。

英文摘要

We present MC-Risk, a planner-aligned, multi-component risk field on a bird's-eye-view grid that yields early, calibrated, and class-aware risk localization. MC-Risk linearly composes three interpretable modules: (i) a motorized-agent field that fuses a black-box multimodal trajectory predictor with an analytic Gaussian-torus construction whose lateral width grows with speed/curvature and whose height attenuates with look-ahead; (ii) a VRU risk field that replaces isotropic pedestrian blobs with a forward-biased anisotropic kernel aligned to heading and speed; and (iii) a road penalty field that exploits full HD-map topology, imposing an off-road penalty and lane-aware risk exposure for same/opposite directions. We conduct, to our knowledge, the first standardized quantitative evaluation of a risk-field formulation on RiskBench's collision subset. MC-Risk attains the best overall risk localization and the earliest hazard indication. Finally, we demonstrate a plug-and-play planning interface by using the field as an MPC cost density, enabling risk-aware trajectory generation without additional training.

2605.21398 2026-05-21 cs.RO 版本更新

From swept contact to pose: Probe-aware registration via complementary-shape docking

从扫掠接触到姿态:通过互补形状对接实现探针感知的注册

Chen Chen, Yunwen Li, Yifan Xu, Xiangjie Yan, Chang Shu, Jianxia Hou, Shiji Song, Xiang Li

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系) Tsingscribe Medical Ltd.(Tsingscribe医疗有限公司) D-MAVT, ETH Zürich(苏黎世联邦理工学院D-MAVT) Peking University School and Hospital of Stomatology(北京大学口腔医学院及口腔医院) Institute for Guo Qiang, Tsinghua University(清华大学国强研究院)

AI总结 本研究提出了一种无需校准的注册方法,通过将接触注册重新表述为物体与探针扫掠体积之间的互补形状对接,显式考虑探针几何形状,并利用接触和非接触证据。该方法通过3D FFT相关性进行全局到局部搜索,然后使用李代数更新和解析接触灵敏度进行连续SE(3)细化,实现了高效的探索和指标级收敛。

Comments 8 pages, 9 figures, accepted to ICRA 2026

详情
AI中文摘要

在机器人操作中,精确的先验模型与真实场景之间的注册对于高精度操作至关重要,然而光学方法面临长校准链、视线约束和制造误差等问题。我们提出了一种无需校准的替代方法,将接触注册重新表述为物体与探针扫掠体积之间的互补形状对接,显式考虑探针几何形状,并利用接触和非接触证据。我们的求解器通过3D FFT相关性在低偏差的SO(3)样本上进行全局到局部搜索,随后使用李代数更新和解析接触灵敏度进行连续SE(3)细化。该流程在自由形式网格上进行了模拟,实现了亚0.04毫米和亚0.4度的精度,并在姿态噪声和接触丢失情况下表现出鲁棒性。在牙科准备机器人上,我们的方法达到了0.42毫米和3.75度的精度,优于光学追踪器注册,且无需外部传感器。这些结果展示了一种实用且精确的机器人注册策略,适用于手术和工业机器人。

英文摘要

Accurate registration between a prior model and the real scene is essential for high-precision robotic manipulation, yet optical methods suffer from long calibration chains, line-of-sight constraints, and fabrication errors. We propose a calibration-free alternative that reformulates contact registration as complementary-shape docking between the object and the probe's swept volume, explicitly accounting for probe geometry and leveraging both contact and non-contact evidence. Our solver integrates a global-to-local search via 3D FFT correlation over low-discrepancy SO(3) samples, then followed by continuous SE(3) refinement using Lie-algebra updates and analytic contact sensitivities. This pipeline yields efficient exploration and metric-grade convergence without fragile point correspondences. Simulation across free-form meshes achieved sub-0.04 mm and sub-0.4° accuracy and robustness to pose noise and contact loss. On a tooth-preparation robot, our method attained 0.42 mm and 3.75°, outperforming an optical tracker registration while requiring no external sensors. These results demonstrate a practical and precise registration strategy for surgical and industrial robots.

2605.21372 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

闭环动态驾驶数据混合用于真实-合成协同训练

Hongzhi Ruan, Pei Liu, Weiliang Ma, Zhengning Li, Xueyang Zhang, Jun Ma, Dan Xu, Kun Zhan

发表机构 * Li Auto(力汽车) HKUST(香港科技大学) HKUST (GZ)(香港科技大学(广州))

AI总结 本文提出了一种闭环动态数据混合方法,通过动态优化过程调整训练数据混合比例,以提升模型性能,解决了在有限预算下优化数据混合的关键问题。

详情
AI中文摘要

数据扩展是现代深度学习的基础,随着自动驾驶转向端到端学习,其重要性日益增加。现实世界驾驶数据标注成本高且场景偏向性明显,使利用几乎无限的合成数据进行真实-合成协同训练成为有前景的方向。然而,简单地整合所有可用的合成数据效率低下且导致分布偏移,优化实际训练预算下的数据混合仍是一个关键但尚未充分研究的问题。因此,我们主张在场景类型和数量上为训练数据混合提供明确指导。特别是在本文中,我们将数据混合近似概念化为一个动态优化过程,通过闭环评估反馈迭代调整训练数据混合以最大化模型性能,并提出AutoScale,一种完全自动化的闭环数据引擎,统一了场景表示、数据混合优化与检索以及模型训练与评估。具体而言,我们提出了图正则化的自编码器(Graph-RAE)用于驾驶场景表示,引入了簇感知梯度上升(Cluster-GA)用于簇级重要性估计和重新加权,并执行簇引导的向量检索以选择高价值样本。在NavSim上的实验表明,AutoScale在有限预算下优于传统协同训练和跨域基线,实现了更好的性能。

英文摘要

Data scaling is fundamental to modern deep learning, and grows increasingly critical as autonomous driving shifts to end-to-end learning. Real-world driving data is expensive to annotate and scene-biased, making real-synthetic co-training with near-infinite synthetic data a promising direction. However, naively incorporating all available synthetic data is inefficient and leads to distribution shifts, and optimizing data mixture under practical training budgets remains a critical yet under-explored problem. In this sense, we claim that the mixture of training data requires clear guidance in terms of scene types and quantities. Particularly in this work, we conceptualize the data mixture approximately as a dynamic optimization process that iteratively adjusts the training data mixture to maximize model performance, guided by closed-loop evaluation feedback, and propose AutoScale, a fully automated closed-loop data engine unifying scene representation, data mixture optimization and retrieval, as well as model training and evaluation. Specifically, we propose Graph Regularized AutoEncoder (Graph-RAE) for driving scene representations, introduce Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation and reweighting, and perform cluster-guided vector retrieval to select high-value samples. Experiments on NavSim demonstrate that AutoScale outperforms vanilla co-training and cross-domain baselines, achieving better performance with fewer synthetic samples under constrained budgets.

2605.21330 2026-05-21 cs.RO 版本更新

Learning Robust Dexterous In-Hand Manipulation from Joint Sensors with Proprioceptive Transformer

从联合传感器学习鲁棒的灵巧手部操作

Senlan Yao, Chenyu Yang, Jaehoon Kim, Aristotelis Sympetheros, Robert K. Katzschmann

发表机构 * Soft Robotics Laboratory, ETH Zürich, Switzerland(苏黎世联邦理工学院软机器人实验室)

AI总结 本文研究如何仅依靠关节传感进行手部操作,提出了一种无需外部感知的Proprioceptive Transformer方法,通过强化学习训练教师策略并将其转化为PT,实现了在腱驱动灵巧手上的连续立方体旋转,实验表明其在旋转速度和位置估计精度上优于基线方法。

Comments 8 pages, 6 figures, 3 tables

详情
AI中文摘要

手部对象操作是灵巧机器人的一项基本但具有挑战性的能力。尽管在灵巧操作方面取得了显著进展,现有方法主要依赖视觉或触觉传感来跟踪物体状态,而关节传感——任何机械手上最易获得的模态——仍被忽视,尤其是对于腱驱动手。本文研究关节传感单独能走多远,通过三个问题:(i) 是否电机编码器或直接关节传感能提供更好的本体感觉反馈,(ii) 如何从关节测量中提取环境信息,以及(iii) 是否仅使用关节控制可以在不依赖外部感知的情况下实现竞争性的现实性能。我们提出了Proprioceptive Transformer (PT),一种无需外部感知的连续立方体旋转方法,仅使用关节传感反馈。首先通过强化学习训练教师策略,利用特权物体信息,然后将其转化为PT,该方法仅基于关节位置和速度的历史数据。Transformer架构有效从关节传感器读数中的时间模式中提取隐含的物体状态信息。在真实的ORCA手实验中,我们的方法实现了比基线方法高3.1倍的旋转速度。我们还展示了PT在立方体位置估计上的RMSE比MLP基线低23.4%,表明其在从本体感觉源中提取外部信息方面具有优势。

英文摘要

In-hand object manipulation is a fundamental yet challenging capability for dexterous robots. Despite significant progress in dexterous manipulation, existing approaches rely heavily on vision or tactile sensing to track object states, while joint sensing -- the most readily available modality on any robotic hand -- remains largely overlooked, particularly for tendon-driven hands. In this paper, we study how far joint sensing alone can go by asking: (i) whether motor encoders or direct joint sensing provides better proprioceptive feedback, (ii) how to extract environment information from joint measurements, and (iii) whether joint-only control can achieve competitive real-world performance without external perception. We present the Proprioceptive Transformer (PT), an exteroceptive-free approach for continuous cube rotation on a tendon-driven dexterous hand that uses only joint sensing feedback. A teacher policy is first trained via reinforcement learning with privileged object information, then distilled into PT, which operates solely on joint position and velocity histories. The Transformer architecture effectively extracts implicit object state information from temporal patterns in joint sensor readings. Experiments on the real ORCA hand show that our approach achieves 3.1x higher rotation speed than baselines. We also demonstrate that our PT achieves a 23.4% lower RMSE for cube position estimation than the MLP baseline, indicating superior extraction of exteroceptive information from proprioceptive sources.

2605.21309 2026-05-21 cs.CV cs.RO 版本更新

Hyper-V2X: Hypernetworks for Estimating Epistemic and Aleatoric Uncertainty in Cooperative Bird's-Eye-View Semantic Segmentation

Hyper-V2X: 基于超网络的协作鸟瞰图语义分割中epistemic和aleatoric不确定性的估计

Abhishek Dinkar Jagtap, Sanath Tiptur Sadashivaiah, Andreas Festag

发表机构 * CARISSMA Institute for Electric, COnnected, and Secure Mobility (C-ECOS), Technische Hochschule Ingolstadt(CARISSMA电动、连接与安全移动研究所(C-ECOS)、因戈尔施塔特技术大学) University of Applied Sciences Aschaffenburg(阿施发堡应用科学大学)

AI总结 本文提出Hyper-V2X框架,通过超网络估计协作V2X感知中的epistemic和aleatoric不确定性,采用部分权重生成方案和V2X上下文嵌入模块,条件化贝叶斯超网络生成随机鸟瞰图分割的权重分布,提升感知可靠性。

Comments Accepted for IEEE Intelligent Vehicle Symposium (IV) 2026

详情
AI中文摘要

通过Vehicle-to-Everything (V2X)通信实现的协作感知通过共享传感器数据创建统一的环境表示,从而提高自动驾驶安全性。尽管近期工作已推进多智能体融合以改善感知,但此类协作框架中的不确定性量化仍鲜有研究。本文介绍Hyper-V2X,一种基于超网络的框架,用于估计V2X感知中的epistemic和aleatoric不确定性。具体而言,我们提出了一种部分权重生成方案和V2X上下文嵌入模块,将贝叶斯超网络条件化于融合的多智能体特征,以生成随机Bird's-Eye-View (BEV)分割的权重分布。与现有确定性BEV模型不同,Hyper-V2X在计算开销小的情况下实现了高效的不确定性估计。我们的方法架构无关,可无缝集成到现代协作骨干结构中,如CoBEVT。在OPV2V基准测试中,Hyper-V2X提供了准确且校准良好的不确定性估计,并提高了整体感知可靠性。我们的代码和基准已公开发布,许可证为开源:https://github.com/abhishekjagtap1/Hyper-V2X

英文摘要

Cooperative perception enabled by Vehicle-to-Everything (V2X) communication enhances autonomous driving safety by creating a unified environmental representation through shared sensory data. While recent works have advanced multi-agent fusion for improved perception, uncertainty quantification in such cooperative frameworks remains largely unexplored. This paper introduces Hyper-V2X, a hypernetwork-based framework for estimating both epistemic and aleatoric uncertainties in V2X-based perception. Specifically, we propose a partial weight generation scheme and V2X context embedding module that conditions a Bayesian hypernetwork on fused multi-agent features to generate weight distributions for stochastic Bird's-Eye-View (BEV) segmentation. Unlike existing deterministic BEV models, Hyper-V2X enables efficient uncertainty estimation with little computation overhead. Our approach is architecture-agnostic, and can be seamlessly integrating with modern cooperative backbones such as CoBEVT. Experiments on the OPV2V benchmark demonstrate that Hyper-V2X provides accurate, well-calibrated uncertainty estimates and improves overall perception reliability. Our code and benchmark are publicly available under an open-source license: https://github.com/abhishekjagtap1/Hyper-V2X

2605.07560 2026-05-21 cs.RO 版本更新

How to Utilize Failure Demo Data?: Effective Data Selection for Imitation Learning Using Distribution Differences in Attention Mechanism

如何利用失败演示数据?:利用注意力机制中的分布差异进行有效的模仿学习数据选择

Kana Miyamoto, Kanata Suzuki, Tetsuya Ogata

发表机构 * Faculty of Science and Engineering, Waseda University, Tokyo, Japan(科学与工程学部,早稻田大学,东京,日本) Physical AI Laboratory, Fujitsu Limited, Kanagawa, Japan(物理人工智能实验室, Fujitsu 有限,神奈川,日本)

AI总结 本文提出了一种利用注意力机制中的分布差异来有效选择模仿学习数据的方法,通过学习成功-失败差异的潜在表示,提高行动稳定性,并引入一个后训练度量来选择有益于学习的失败样本。

Comments 15 pages, 6 figures, 2 tables

详情
AI中文摘要

机器人任务的模仿学习主要依赖于仅在成功演示上训练的策略,尽管在人类数据收集过程中失败是不可避免的。许多现有方法利用失败数据需要额外的数据处理或通过自主回放进行迭代策略更新,这使得难以直接稳定地利用收集期间积累的失败数据。在本工作中,我们提出了一种方法,学习成功-失败差异的潜在表示,并将其纳入注意力机制中。在推理过程中,从初始观察中选择适当的潜在模式以提高行动稳定性。此外,我们引入了一个后训练度量,用于量化每个失败样本与成功演示之间的注意力差异,以选择失败数据。模拟结果表明,当使用失败数据训练时,所提出的方法提高了任务成功率,并且所提出的度量在结合成功演示时能够识别出有益于学习的失败样本。这些结果表明,所提出的方法可以支持在机器人数据收集管道中更高效地利用收集到的演示数据。

英文摘要

Imitation learning for robotic tasks has relied primarily on policies trained only on successful demonstrations, although failures are unavoidable during human data collection. Many existing approaches for exploiting failure data require additional data processing or iterative policy updates through autonomous rollouts, making it difficult to directly and stably utilize failure data accumulated during data collection. In this work, we propose a method that learns latent representations of success-failure discrepancies and incorporates them into the attention mechanism. During inference, an appropriate latent mode is selected from the initial observation to improve action stability. Furthermore, we introduce a post-training metric that quantifies the attention discrepancy between each failure sample and successful demonstrations to select failure data. Simulation results show that the proposed method improves task success rates when trained with failure data and that the proposed metric identifies failure samples that are beneficial for learning when combined with successful demonstrations. These results suggest that the proposed method can support more efficient use of collected demonstrations in robotic data collection pipelines.

2605.21258 2026-05-21 cs.RO cs.AI 版本更新

Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation

为机器人操作中的高效视觉表征学习结构潜在点

Yicheng Jiang, Jiaxu Wang, Junhao He, Zesen Gan, Junhao Li, Qiang Zhang, Jingkai Sun, Jiahang Cao, Mingyuan Sun, Xiangyu Yue, Qiming Shao

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) MMLab, The Chinese University of Hong Kong(香港中文大学MMLab) X-Humanoid Robots(X-Humanoid机器人) The University of Hong Kong(香港大学) Tsinghua University(清华大学)

AI总结 本文提出了一种新的预训练框架,通过学习混合表示-结构潜在点,结合隐式表示的表达能力和显式表示的结构先验,以提高机器人操作中的视觉表征效率和鲁棒性。

详情
Journal ref
International Conference on Robotics and Automation 2026
AI中文摘要

当前基于3D感知的预训练方法在具身感知和操作中大多基于可微渲染框架,产生完全隐式神经场或完全显式几何基元。隐式表示虽然具有表达能力,但缺乏显式结构线索,而显式表示则保留几何信息但受到分辨率限制和泛化能力差的困扰。为了解决这些限制,我们提出了一种新的预训练框架,学习混合表示-结构潜在点。具体来说,我们将在点云自编码器的潜在空间中插入一个点-wise潜在变分自编码器,联合正则化点-wise特征和坐标向高斯先验。所得到的紧凑潜在保留了粗略的结构趋势,不编码精确几何,但捕捉了更丰富的粗糙形状和语义信息,有效结合了隐式表示的表达能力和显式表示的结构先验。此外,受先前工作的共享设计选择启发,我们开发了一种流线型、高效的3DGS基于渲染管道,故意保持轻量,提高效率的同时,让前端潜在模块有更大的表征能力。在RLBench、ManiSkill2和真实机器人平台上的大量评估显示,在任务成功率、样本效率和对视角和场景变化的鲁棒性方面均优于强基线。消融研究进一步确认了框架中每个组件对整体性能的重要性。

英文摘要

Current 3D-aware pretraining methods for embodied perception and manipulation are largely built on differentiable rendering frameworks, producing either fully implicit neural fields or fully explicit geometric primitives. Implicit representations, while expressive, lack explicit structural cues, whereas explicit ones preserve geometry but suffer from resolution limits and weak generalization. To address these limitations, we propose a novel pretraining framework that learns a hybrid representation-structural latent points. Specifically, we insert a point-wise latent variational autoencoder into the latent space of a point-cloud autoencoder, jointly regularizing point-wise features and coordinates toward a Gaussian prior. The resulting compact latent preserves coarse structural tendencies, which do not encode precise geometry but capture richer rough shape and semantic information, effectively combining the expressiveness of implicit representations with the structural priors of explicit ones. In addition, informed by shared design choices in prior work, we develop a streamlined, efficient 3DGS-based rendering pipeline that is deliberately kept lightweight, improving efficiency while leaving greater representational capacity to the front-end latent module. Extensive evaluations on RLBench, ManiSkill2, and a real-robot platform demonstrate consistent gains in task success, sample efficiency, and robustness to viewpoint and scene variations over strong baselines. Ablation studies further confirm that each component of our framework is critical to overall performance.

2605.21257 2026-05-21 cs.RO 版本更新

Reinforcement Learning for Risk Adaptation via Differentiable CVaR Barrier Functions

通过可微分CVaR障碍函数实现风险适应的强化学习

Xinyi Wang, Taekyung Kim, Bardh Hoxha, Georgios Fainekos, Dimitra Panagou

发表机构 * Department of Robotics(机器人学系) Department of Aerospace Engineering(航空航天工程系) University of Michigan(密歇根大学) Toyota Motor North America, Research & Development(丰田美国北美洲研发)

AI总结 本文提出了一种端到端的风险适应框架,用于在障碍物运动不确定性的环境下进行人群导航,结合强化学习与基于条件价值-at-风险(CVaR)障碍函数的可微分二次规划安全层,共同学习名义控制输入、风险水平和安全边际,并强制执行显式的概率安全约束。

Comments Project page: https://anonymousrobotics9666.github.io/rlcvarbf/

详情
AI中文摘要

在存在不确定障碍物运动的拥挤环境中进行规划仍然具有挑战性,因为随机交互常常导致过于保守的行为或降低效率。为了解决这一挑战,我们提出了一种端到端的风险适应框架,用于在由高斯混合模型建模的障碍物运动不确定性下的人群导航。该框架结合了强化学习(RL)与基于条件价值-at-风险(CVaR)障碍函数的可微分二次规划安全层,共同学习名义控制输入、风险水平和安全边际,并强制执行显式的概率安全约束。这种设计实现了情境感知的适应,促进高效行为,仅在必要时引发谨慎。我们在动态、不确定和拥挤的环境中进行了广泛的评估,涵盖了不同障碍物密度和机器人模型的情况,进一步评估了在三种非分布情况下的泛化能力。提供了基于优化、基于RL和基于集成RL和优化方法的比较,证明所提出的方法在安全、效率和不确定性下的泛化能力方面表现最强。

英文摘要

Planning through crowded environments under uncertain obstacle motions remains difficult, as stochastic interactions often induce overly conservative behavior or reduced efficiency. To address this challenge, we propose an end-to-end risk adaptation framework for crowd navigation under obstacle-motion uncertainty modeled by a Gaussian mixture model. The framework combines reinforcement learning~(RL) with a differentiable quadratic-program safety layer based on Conditional Value-at-Risk~(CVaR) barrier functions, jointly learning nominal control input, risk level, and safety margin and enforcing explicit probabilistic safety constraints. This design enables context-aware adaptation, promoting efficient behavior while invoking caution only when necessary. We conduct extensive evaluations in dynamic, uncertain, and crowded environments across varying obstacle densities and robot models, and further assess generalization under three out-of-distribution cases. Comparisons across optimization-based, RL-based, and integrated RL and optimization methods are provided, and the proposed method is shown to deliver the strongest overall performance in safety, efficiency, and generalization under uncertainty.

2605.21242 2026-05-21 cs.RO 版本更新

To Select or not to Select, that is the Question: Distilling Robot Skill Prediction into a Small Ensemble

选择还是不选择,这是个问题:将机器人技能预测蒸馏成一个小集成

Haechan Mark Bong, Simon Roy, Euhid Aman, Giovanni Beltrame

发表机构 * Department of Computer Engineering and Software Engineering, Polytechnique Montréal(蒙特利尔理工学院计算机工程与软件工程系) MILA(蒙特利尔人工智能研究所) National Taiwan University of Science and Technology (NTUST)(台湾科技大学)

AI总结 本文研究了机器人技能预测问题,通过合成数据集和微调句子编码器,提出了一种小规模专用模型,在零样本提示下优于大型通用LLM,在机器人队伍任务路由中表现更佳。

详情
Journal ref
ICRA 2026 Workshop on Synthetic Data for Robot Learning
AI中文摘要

随着机器人队伍变得更加异质化,包括人形机器人、探测车、四足机器人和无人机,选择合适的机器人执行任务成为系统问题的核心。我们研究了机器人技能预测:将自然语言任务描述映射到所需的物理能力,如飞行、轮子、腿、表面水、水下和手。由于没有将自然语言任务描述映射到机器人物理能力的标记数据,我们使用LLM辅助生成和目标标签审计构建了合成任务到技能数据集。在该数据上训练的约133M参数的两个微调句子编码器(mpnet + MiniLM)在分层的200任务数据集上达到83.5%的任务到技能匹配,优于Kimi K2(1T MoE)72.0%、GPT-OSS-120B 71.5%和Llama-4-Scout-17B 69.0%。这些结果表明,在固定机器人技能分类下,通过合成数据训练的小型专用模型在机器人队伍任务路由中可以优于大型通用LLM。

英文摘要

As robot fleets become more heterogeneous, including humanoids, rovers, quadrupeds, and drones, selecting the right robot for a task becomes a core systems problem. We study robot skill prediction: mapping a natural-language task description to the physical capabilities required to execute it, such as fly, wheels, legs, surface water, under water and hands. Since labelled data that maps natural-language task descriptions to robot's physical capabilities does not exist, we construct a synthetic task-to-skill dataset using LLM-assisted generation and targeted label auditing. Trained on this data, a ~133M-parameter ensemble of two fine-tuned sentence encoders (mpnet + MiniLM) reaches 83.5% task-to-skill matching on a stratified 200 task dataset, outperforming Kimi K2 (1T MoE) at 72.0%, GPT-OSS-120B at 71.5%, and Llama-4-Scout-17B at 69.0% under the same zero-shot prompt. These results suggest that, for fixed robot skill taxonomies, small specialized models trained on synthetic data can outperform much larger general-purpose LLMs for fleet-level task routing.

2605.21188 2026-05-21 cs.RO 版本更新

A Terrain-Adaptive epsilon-Constraint MPC for Uneven Terrain Kinodynamic Planning

一种适应地形的epsilon约束MPC用于不规则地形运动动力学规划

Otobong Jerome, Geesara Kalathunga, Tiago Nascimento

发表机构 * Laboratório de Engenharia de Sistemas e Robótica, Universidade Federal da Paraíba(系统与机器人工程实验室,帕拉伊巴联邦大学) School of Computer Science, University of Lincoln(计算机科学学院,林肯大学)

AI总结 本文提出了一种适应地形的epsilon约束MPC方法,用于解决车辆在不规则地形上同时优化路径效率和姿态稳定性的规划问题,通过动态调整epsilon界限来实时探索帕累托前沿,并通过半参数模型结合分析车辆动力学和稀疏高斯过程来捕捉车辆-地形动力学。

详情
AI中文摘要

对于车辆在不规则地形上的运动动力学规划,需要同时优化竞争性目标,如路径效率和姿态稳定性。本文提出了一种集成到模型预测控制(MPC)框架中的自适应epsilon约束方法,其中epsilon界限根据地形描述符动态调整,以实时探索帕累托前沿。为了捕捉车辆-地形动力学,我们开发了一种半参数模型,结合分析车辆动力学和在相同地形描述符上训练的稀疏高斯过程(SGP)。所提出的epsilon-MPC在MPPI和GAKD基准上进行了评估,实现了94%的导航成功率,同时将最大方向偏移减少24%,并提高了多目标权衡质量23%。

英文摘要

Kinodynamic planning for car-like vehicles on uneven terrain requires simultaneously optimizing competing objectives such as path efficiency and pose stability. This work presents an adaptive epsilon-constraint method integrated into a Model Predictive Control (MPC) framework, where the epsilon bounds are dynamically adjusted based on terrain descriptors to explore the Pareto front in real time. To capture vehicle-terrain dynamics, we develop a semi-parametric model combining analytical vehicle dynamics with a Sparse Gaussian Process (SGP) trained on the same terrain descriptors. The proposed epsilon-MPC is evaluated against MPPI and GAKD baselines, achieving a 94% navigation success rate while reducing maximum orientation deviation by 24% and improving multi-objective trade-off quality by 23%.

2605.21157 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

Comparative Analysis of Military Detection Using Drone Imagery Across Multiple Visual Spectrums

多光谱下无人机影像用于军事检测的比较分析

Sourov Roy Shuvo, Prajwal Panth, Rajesh Chowdhury, Sorup Chakraborty, Sudip Chakrabarty, Prasant Kumar Pattnaik

发表机构 * School of Computer Engineering KIIT Deemed to be University(计算机工程学院 KIIT deemed to be 大学)

AI总结 本文研究了不同光谱条件下无人机影像用于军事目标检测的问题,通过构建四种不同数据集(灰度、热成像、夜视和模糊成像)来评估模型在不同环境下的性能,提出了一种改进的YOLOv11-small模型以提升无人机作战的性能和可靠性。

Comments 6 pages, 7 figures. Accepted at the 16th International Conference on Computing, Communication and Networking Technologies (ICCCNT), July 6-11, 2025, IIT Indore. Proceedings pending publication

详情
AI中文摘要

在现代战争中,无人机已成为情报收集和精确打击在不同 hostile 环境中的重要组成部分。其能够从安全距离实时操作 hostile 环境的能力使其在监视和军事行动中具有无价的价值。KIIT-MiTA 数据集由从无人机拍摄的不同军事场景图像组成,为检测军事目标提供了基础,但未考虑各种现实场景。为此,创建了四种不同类型的数据集:灰度、热成像、夜视和模糊成像,以模拟现实环境如低能见度、热成像和夜间条件。YOLOv11-small 模型被训练和用于检测不同设置中的目标。本研究通过在防御和进攻任务中开发先进的检测系统,提高了基于无人机的作战性能和可靠性。

英文摘要

In modern warfare, drones are becoming an essential part of intelligence gathering and carrying out precise attacks in different kinds of hostile environments. Their ability to operate in real-time and hostile environments from a safe distance makes them invaluable for surveillance and military operations. The KIIT-MiTA dataset is comprised of images of different military scenarios taken from drones, and these provide a foundation for detecting military objects, but it does not take into account the various types of real-world scenarios. With that in mind, to evaluate how the models are performing under varying conditions, four different types of datasets are created: Gray Scale, Thermal Vision, Night Vision, and Obscura Vision. These simulate the real-world environments such as low visibility, heat-based imagery, and nighttime conditions. The YOLOv11-small model is trained and used to detect objects across diverse settings. This research boosts the performance and reliability of drone-based operations by contributing to the development of advanced detection systems in both defensive and offensive missions.

2605.21150 2026-05-21 cs.RO 版本更新

EllipseLIO: Adaptive LiDAR Inertial Odometry with an Ellipsoid Representation

EllipseLIO: 一种基于椭球表示的自适应激光雷达惯性里程计

Rowan Border, Margarita Chli

发表机构 * Vision for Robotics Lab (V4RL)(机器人视觉实验室)

AI总结 本文提出EllipseLIO,一种基于椭球表示的实时激光雷达惯性里程计,通过自适应的激光雷达扫描过滤和配准方法,在不同环境和传感器下实现鲁棒的里程计性能,实验表明其在多种复杂场景中表现最优。

Comments 8 pages, 6 figures, 2 tables

详情
AI中文摘要

激光雷达惯性里程计(LIO)是许多需要无外部定位(如GPS)导航的移动机器人中的关键组件。在不同环境中自主运行且配备异构激光雷达传感器的平台需要一种能够适应这些不同场景且无需人工干预的LIO方法。现有LIO方法通常在环境和传感器相似时能提供可靠且准确的里程计,但许多方法在异构环境和传感器中保持鲁棒性时面临困难。本文提出了EllipseLIO,一种实时LIO方法,通过使用适应于传感器能力和环境的激光雷达扫描过滤和配准方法,在不同场景间进行泛化。在五个具有多样性和挑战性的数据集上,EllipseLIO与最先进的LIO方法的实验表明,EllipseLIO总体表现最佳。它在平均上比第二好的方法的里程计误差低38%,并且是唯一一个在所有实验中均不发散的方法。EllipseLIO的开源版本将在github.com/v4rl-ucy/ellipselio上提供。

英文摘要

LiDAR Inertial Odometry (LIO) is a critical component for many mobile robots that need to navigate without relying on external positioning (e.g., GPS). Platforms that operate autonomously in different environments and with heterogeneous LiDAR sensors require a LIO approach that can adapt to these different scenarios without human intervention. Existing LIO approaches can typically provide reliable and accurate odometry in scenarios with similar environments and sensors when suitably tuned. However, many approaches struggle to retain robust odometry across heterogeneous environments and sensors while using a consistent configuration. This paper presents EllipseLIO, a real-time LIO approach that generalises between scenarios by using methods for LiDAR scan filtering and registration that adapt to the sensor capabilities and environment without requiring scenario-specific tuning. Experiments with EllipseLIO and state-of-the-art LIO approaches on five datasets with diverse and challenging scenarios demonstrate that EllipseLIO is the best-performing approach overall. It achieves a 38% lower odometry error on average than the second-best approach and is the only approach that does not diverge in any experiment. An open-source version of EllipseLIO will be available at github.com/v4rl-ucy/ellipselio.

2605.21138 2026-05-21 cs.RO 版本更新

Safety-Critical Control for Smoothed Implicit Contact Dynamics

安全关键控制用于平滑隐式接触动力学

Haegu Lee, Yitaek Kim, Christoffer Sloth

发表机构 * The Maersk Mc-Kinney Moller Institute, University of Southern Denmark(马士基麦金尼莫勒研究所,丹麦南部大学)

AI总结 本文提出了一种方法,通过引入边界聚焦的滚动策略和离散时间控制屏障函数框架,解决平滑隐式接触动力学中接触力的约束问题,以提高安全性能。

详情
AI中文摘要

平滑隐式接触动力学使在接触丰富的任务中能够基于梯度的规划和控制,而无需预定义的模式序列。然而,安全关键控制仍然具有挑战性,因为隐式接触动力学使得安全过滤器设计变得复杂。平滑参数κ放松了接触互补性约束,这使动力学变得平滑但影响了接触力。本文提供了一种方法,以在使用放松的互补性约束时对实际接触力进行界定。我们显示,约束违反可以是非单调的κ。较小的κ减少了力近似误差,但并不一定改善安全性性能。为了解决这个问题,我们引入了边界聚焦的滚动策略来筛选κ,通过比较安全边际与近似误差。然后我们开发了一种基于隐式定义接触力的一阶泰勒近似的离散时间控制屏障函数(CBF)框架。为了考虑可能的力低估,我们通过添加一个固定的鲁棒边缘来增强由此产生的安全约束。在四个接触丰富的系统上的模拟显示,所提出的方法消除了在标准CBF下观察到的力违反现象。

英文摘要

Smoothed implicit contact dynamics enables gradient-based planning and control for contact-rich tasks without predefined mode sequences. However, safety-critical control remains challenging because implicit contact dynamics makes safety-filter design nontrivial. The smoothing parameter $κ$ relaxes contact complementarity constraints, which makes the dynamics smooth but affects the contact force. This paper provides a method for bounding the actual contact force despite the use of relaxed complementarity constraints. We show that constraint violations can be non-monotonic in $κ$. Smaller $κ$ reduces force-approximation error, but it does not necessarily improve safety performance. To address this issue, we introduce boundary-focused rollouts to screen $κ$ by comparing the safety margin with the approximation error. We then develop a discrete-time control barrier function (CBF) framework based on a first-order Taylor approximation of the implicitly defined contact force. To account for possible force under-prediction, we augment the resulting safety constraint with a fixed robust margin. Simulations on four contact-rich systems show that the proposed method eliminates force violations observed under a standard CBF.

2605.21133 2026-05-21 cs.RO 版本更新

Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum

通过主动空间大脑和可泛化动作小脑的人形全身 manipulation

Zhizhao Liang, Yi-Lin Wei, Xuhang Chen, Mu Lin, Yi-Xiang He, Zhexi Luo, Jun-Hui Liu, Kun-Yu Lin, Wei-Shi Zheng

发表机构 * School of Computer Science(计算机科学学院) Engineering, Sun Yat-sen University(工程学院,中山大学)

AI总结 本文提出了一种通用的人形 locomotion-manipulation 框架,通过主动空间大脑和可泛化动作小脑来解决复杂3D环境中空间理解困难和动作生成泛化困难的问题,展示了在多种任务和环境中的强性能。

Comments Project page: https://leungchaos.github.io/Humanoid-Whole-Body-Manipulation-via-Active-Spatial-Brain-and-Generalizable-Action-Cerebellum/

详情
AI中文摘要

在本文中,我们探索了空间感知的人形全身 manipulation 任务。与桌面设置相比,该任务提出了两个关键挑战:1)在复杂3D环境中,具有多样空间关系的空间理解具有挑战性。2)动作生成难以泛化,因为有限且昂贵的真实机器人数据限制了数据驱动模型的泛化能力。为了解决这些挑战,我们提出了一种通用的人形 locomotion-manipulation 框架,该框架利用多智能体大模型的空间感知和动作生成能力。具体而言,我们的框架包括两个组件:Active Spatial Brain 用于主动空间感知和决策,以及 Generalizable Action Cerebellum 用于生成可执行的机器人动作。第一个组件主动感知空间场景,并在任务规划和子任务分解上做出决策。第二个组件根据第一个模块的决策生成可执行的机器人动作,而无需任务特定的真实机器人数据。为了基准测试我们的框架,我们从两个视角设计了一组空间 manipulation 任务:评估空间感知和理解,以及评估真实机器人任务性能。结果表明,在各种任务和环境中,该框架在两个方面都表现出强大的性能。

英文摘要

In this paper, we explore spatial-aware humanoid whole-body manipulation task. Compared with tabletop settings, this task poses two key challenges: 1) Spatial understanding is challenging in complex 3D environments with diverse spatial relations. 2) Action generation is difficult to generalize, as limited and costly real-robot data restricts data-driven models generalization. To address these challenges, we propose a generalizable humanoid loco-manipulation framework that leverages the spatial perception and action generation capabilities of multi-agent large models. Specifically, our framework includes two components: Active Spatial Brain for active spatial perception and decision-making, and Generalizable Action Cerebellum for executable robot action generation. The first component actively perceives the spatial scene and makes decisions on task planning and subtask decomposition. The second component generate executable robot actions based on the decisions made by the first module without needs of task-specific real robot data. To benchmark our framework, we design a set of spatial manipulation tasks from two perspectives: evaluating spatial perception and understanding, and assessing real-robot task performance. The results demonstrate strong performance on both aspects across diverse tasks and environments.

2605.21111 2026-05-21 cs.RO cs.SY eess.SY 版本更新

Benchmarking Empirical and Learning-Based Approaches for Feedforward Steering Control in Autonomous Racing

为自动驾驶赛车中的前馈转向控制评估经验方法和学习方法

Georg Jank, Mattia Piccinini, Sebastian Wenk, Phillip Pitschi, Johannes Betz, Boris Lohmann

发表机构 * Chair of Automatic Control, Department of Engineering Physics and Computation, Technical University of Munich(慕尼黑技术大学自动控制系) Professorship of Autonomous Vehicle Systems (AVS), Department of Mobility Systems Engineering, Technical University of Munich(慕尼黑技术大学移动系统工程系自动驾驶车辆系统教授职位)

AI总结 本文通过系统评估两种学习方法和两种经验方法的前馈转向控制器,发现学习方法在开环评估中预测误差最小,但在闭环测试中路径跟踪性能和圈速并不优于所提出的方法,表明在完整轨迹规划和控制软件栈中评估前馈策略的必要性。

Comments 8 pages, 12 figures, Accepted to be published as part of the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026), Naples, Italy, September 15-18, 2026

详情
AI中文摘要

前馈转向控制是自动驾驶赛车分层控制架构中的关键组成部分。其目标是通过预测车辆的逆横向动力学来减少反馈控制器的转向修正。本文系统地比较了两种学习方法和两种经验(分析)前馈转向控制器。我们提出了一种基于多项式曲面拟合的新ehd公式,能够以最小的参数化捕捉速度依赖的非线性转向行为。我们使用基于现实世界阿布扎比分级自动驾驶赛车联赛的高保真度仿真框架,在高保真度双赛道车辆动力学仿真器中测试前馈控制器。开环评估显示,学习方法实现了最低的预测误差;然而闭环测试显示,这种改进的准确性并未转化为更好的路径跟踪性能或圈速,即使经过迭代微调后也是如此。相比之下,所提出的ehd方法在整体闭环鲁棒性和圈速方面表现最佳,突显了在完整轨迹规划和控制软件栈中评估前馈策略的必要性。我们的代码可在https://github.com/TUMRT/steering_ff_control上获得。

英文摘要

Feedforward steering control is a key component of hierarchical control architectures for autonomous racing. The goal is to reduce steering corrections from the feedback controllers by predicting the vehicle's inverse lateral dynamics. This paper presents a systematic benchmark of two learning-based and two empirical (analytical) feedforward steering controllers. We introduce a new \acf{ehd} formulation based on a polynomial surface fit that captures velocity-dependent nonlinear steering behavior with minimal parametrization. We test the feedforward controllers in a high-fidelity simulation framework based on the real-world Abu Dhabi Autonomous Racing League competition, using a high-fidelity double-track vehicle dynamics simulator. Open-loop evaluation shows that the learning-based controllers achieve the lowest prediction errors; however, closed-loop testing reveals that this improved accuracy does not translate into superior path tracking performance or lap times, even after iterative fine-tuning. In contrast, the proposed EHD approach achieves the best overall closed-loop robustness and lap time, highlighting the necessity of evaluating feedforward strategies within the complete trajectory planning and control software stack. Our code is available at https://github.com/TUMRT/steering_ff_control.

2605.21109 2026-05-21 cs.RO 版本更新

Anomaly-Informed Confidence Calibration for Vision-Based Safety Prediction

基于异常的置信度校准用于基于视觉的安全预测

Zhenjiang Mao, Jiawen Wu, Gabriel Wagner, Zhongzheng Zhang, Ivan Ruchkin

发表机构 * Trustworthy Engineered Autonomy (TEA) Lab, Department of Electrical and Computer Engineering, University of Florida(可信工程自主性实验室,电气与计算机工程系,佛罗里达大学)

AI总结 本文提出了一种基于异常的在线校准方法,通过融合感知和动态异常分数来改进基于视觉的安全预测中的置信度估计,从而在面对分布偏移时减少过自信,提升预测性能。

详情
AI中文摘要

可靠的置信度估计对于安全部署基于视觉的控制器至关重要,特别是在自动驾驶赛车中,安全预测必须从摄像头图像中推导出来,但现代预测器在测试时面临分布偏移时会变得危险地自信。我们发现现有异常信号中存在一个关键的感知-动态差距:广泛使用的分数,如自编码器重构误差,只能捕捉视觉损坏,却无法捕捉动态异常(例如执行偏差、延迟),其中图像仍然合理而轨迹却恶化。为了解决这个问题,我们提出了一种基于异常的在线校准方法,该方法不重新训练任何模型组件,融合了从世界模型中提取的两个互补的异常分数:一个来自重构误差的感知分数和一个来自epistemic不确定性及控制流统计的动态分数。基于这些融合的分数,一个轻量级的温度缩放校准器利用测试时增强来选择性地减少偏移下的过自信,同时保持正常条件下的性能。在四个未在训练中见过的真实世界异常协议(黑暗、模糊、执行偏差、处理延迟)下的物理DonkeyCar上进行实验,将平均预期校准误差从0.184降低到0.116,比最佳基线提高了37%,而无需修改基础安全预测器。

英文摘要

Reliable confidence estimates are important for safely deploying vision-based controllers in autonomous racing, where safety predictions must be derived from camera images, yet modern predictors become dangerously overconfident under test-time distribution shifts. We identify a critical perception-dynamics gap in existing anomaly signals: widely used scores, such as autoencoder reconstruction error, capture visual corruptions but miss dynamics anomalies (e.g., actuation bias, latency), where images remain plausible while the trajectory degrades. To address this, we propose an Anomaly-Informed Online Calibration approach that, without retraining any model component, fuses two complementary anomaly scores extracted from a world model: a perceptual score from reconstruction error and a dynamics score from epistemic uncertainty and control-stream statistics. Based on these fused scores, a lightweight temperature-scaling calibrator leverages test-time augmentation to selectively reduce overconfidence under shift while preserving nominal-condition performance. Experiments on a physical DonkeyCar under four real-world anomaly protocols unseen during training (darkness, blur, actuation bias, processing latency) reduce average expected calibration error from 0.184 to 0.116, a 37% improvement over the best baseline, without modifying the base safety predictor.

2605.21061 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Grounding Driving VLA via Inverse Kinematics

通过逆运动学接地驾驶VLA

Junsung Park, Hyunjung Shim

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院) KimJaeChul AI Graduate School(金 JaeChul人工智能研究生院)

AI总结 本文提出通过逆运动学求解器重新设计驾驶VLA,以解决轨迹预测中对视觉token的忽略问题,通过引入视觉状态预测和逆运动学网络,提升了视觉接地和轨迹规划性能。

详情
AI中文摘要

现有驾驶VLA在预测轨迹时大多忽略其视觉token--这一现象我们归因于任务公式结构上不合理的设定而非训练不足。我们证明,当通过逆运动学视角看待轨迹恢复时,需要当前和未来视觉状态作为边界条件;现有VLA仅提供前者,促使模型依赖自身状态和文本指令进行捷径预测。为解决此问题,我们重新设计驾驶VLA,使其风格类似于逆运动学求解器。首先,一个需要LLM预测未来视觉场景的下一视觉状态预测目标提供密集的视觉监督并抑制捷径路径。其次,一个单独的逆运动学网络(基于交叉注意力的条件扩散模型)仅输入当前和未来视觉状态,以在轨迹解码过程中抑制对自身状态和文本捷径的依赖。仅通过这种简单的处方,我们的0.5B规模模型恢复了视觉接地能力,并在闭合回路NAVSIM-v2和nuScenes基准上,其轨迹规划性能可与7B-8B规模的VLA相媲美。进一步的分析表明,这种改进源于恢复了利用视觉特征的能力,效果在动态驾驶场景如转弯时尤为明显。

英文摘要

Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.

2605.21053 2026-05-21 cs.RO 版本更新

Perception of Social Robots as Communication Partners in Healthcare for Older Adults

在医疗领域中老年人对社交机器人作为交流伙伴的感知

Hana Yamamoto, Carlotta Julia Mayer, Charlotte Raithel, Theresa Buchner, Christian Werner, Yasuhisa Hirata, Monika Eckstein, Katja Mombaur

发表机构 * Institute for Anthropomatics and Robotics(人机学与机器人研究所) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Department of Robotics(机器人系) Tohoku University(东北大学) Institute of Medical Psychology(医学心理学研究所) Heidelberg University(海德堡大学) Institute of Psychology(心理学研究所)

AI总结 研究探讨了社交机器人在医疗领域中作为交流伙伴的有效性,以及积极提示对交互效果的影响,发现机器人与人类交互时压力水平无显著差异,且机器人能被接受为有效的交流伙伴,有助于减轻护理人员负担。

Comments 31 pages, 10 figures, Under review at International Journal of Social Robotics

详情
AI中文摘要

通过社交助理工作者解决全球护理人员短缺问题,需要深入了解人类-机器人交互(HRI)对老年人的心理和生理影响。本研究探讨了社交机器人是否能像人类一样成为有效的交流伙伴,以及积极提示是否能同样增强这些交互。我们与35名参与者(年龄70岁以上)进行了比较研究。我们的多模态分析,整合了面部表情数据、心率变异性数据和主观问卷,发现人类和机器人交互的整体压力水平无显著差异。面部表情分析证实机器人被接受为有效的交流伙伴,而生理数据表明在机器人交互期间心率略低,表明比由人类主导的活动更放松。这些发现表明社交机器人可以与老年人互动而不引起心理压力,并能通过执行结构化任务(如健康监测调查)来减轻护理人员的负担。未来的工作应解决机器人设计中发现的'外观-内容不匹配'问题,以促进更加自然和有效的交互。

英文摘要

Addressing the global caregiver shortage through socially assistive robots necessitates a deep understanding of their psychological and physiological impacts on older adults during human-robot interaction (HRI). This study addresses whether social robots can serve as effective interaction partners compared to humans, and if "positive prompts" can similarly enhance these interactions. We conducted a comparative study with 35 participants (aged 70+). Our multi-modal analysis, integrating facial expression data, heart rate variability, and subjective questionnaires, revealed no significant differences in overall stress levels between human and robot interactions. Facial expression analysis confirmed that the robot was accepted as a valid interaction partner, while physiological data showed slightly lower heart rates during robot interactions, suggesting a more relaxed state compared to human-led sessions. These findings indicate that social robots can engage older adults without inducing psychological strain and are capable of alleviating caregiver burden by performing structured tasks, such as health-sensing surveys. Future work should address the identified "appearance-content mismatch" in robot design to facilitate even more natural and effective interactions.

2605.21026 2026-05-21 cs.RO 版本更新

Component Influence-Driven Fastener Reduction for Robotic Disassemblability-Aware Design Simplification

基于组件影响的快速件减少用于机器人拆解意识设计简化

Takuya Kiyokawa, Tomoki Ishikura, Shingo Hamada, Genichiro Matsuda, Kensuke Harada

发表机构 * Department of Systems Innovation, Graduate School of Engineering Science, The University of Osaka(大阪大学工学研究科系统创新部门) Manufacturing Innovation Division, Panasonic Holdings Corporation(松下电器制造创新部门)

AI总结 本文提出了一种分析框架,通过快速件减少来提高机器人拆解意识设计简化,该框架利用CAD模型和自动生成的接触-连接-约束(CCC)图,将机器人拆解序列规划结果转化为组件影响评分,以指导设计简化。

Comments 7 pages, 8 figures

详情
AI中文摘要

为了加速自动化再制造,产品设计阶段必须考虑机器人拆解。然而,设计师目前缺乏定量反馈来识别哪些结构元件阻碍机器人操作。为此,本研究提出了一种分析框架,专注于快速件减少,因为快速件是几乎所有制造产品中普遍存在的组件。使用CAD模型及其自动生成的接触-连接-约束(CCC)图,该框架将机器人拆解序列规划结果转化为组件影响评分。这些评分反映了组件在机器人拆解序列中导致结构约束违规或评估目标恶化的频率。为了突出结构障碍,该框架将这些评分投影到CAD几何体上作为3D热图。系统随后分析性地模拟了高影响快速件的移除。它报告了预期的结构约束减少、工具更换和机器人行驶距离的减少,同时通过评估几何稳定性指标防止结构不安全的修改。对七种家用电器的实验表明,该框架成功地针对冗余快速件。移除推荐的快速件通过消除8到132个结构约束(取决于每个产品的结构配置)简化了结构依赖性。此外,通过消除不必要的工具更换操作并缩短行驶距离(165到1675毫米,只要结构上允许)提高了机器人操作效率。

英文摘要

To accelerate automated remanufacturing, robotic disassembly must be considered during the product design phase. However, designers currently lack quantitative feedback to identify which structural elements hinder robotic operations. To address this, this study proposes an analytical framework that provides actionable redesign guidance focused on fastener reduction, as fasteners are numerous and ubiquitous components found in almost all manufactured products. Using a Computer-Aided Design (CAD) model and its automatically generated Contact-Connection-Constraint (CCC) graph, the framework translates robotic disassembly sequence planning outcomes into component influence scores. These scores reflect how often a component causes structural constraint violations or evaluation objective deteriorations in the robotic disassembly sequence. To visually highlight structural hindrances, the framework projects these scores onto the CAD geometry as 3D heatmaps. The system then analytically simulates the removal of highly influential fasteners. It reports the expected reductions in structural constraints, tool changes, and robot travel distances, while preventing structurally unsafe modifications by evaluating geometric stability metrics. Experiments on seven household appliances demonstrate that the framework successfully targets redundant fasteners. Removing the recommended fasteners simplified the structural dependencies by eliminating between 8 and 132 structural constraints on the graph depending on each product's structural configuration. Furthermore, it improved robotic operational efficiency by eliminating unnecessary tool change operations and shortening travel distances by 165 to 1675 millimeters wherever structurally permissible.

2605.20932 2026-05-21 cs.RO 版本更新

WiXus: A Wheeled-Legged Robot with Wire-Driven Environmental Utilizing to Integrate Mobility and Manipulation

WiXus: 一种配备线驱动环境利用的轮腿机器人,用于整合移动与操作

Shintaro Inoue, Kento Kawaharazuka, Temma Suzuki, Sota Yuzaki, Kei Okada

发表机构 * Department of Mechano-Informatics, Graduate School of Information Science and Technology, The University of Tokyo(机械信息学系,信息科学和技术研究生院,东京大学)

AI总结 本文提出了一种新型轮腿机器人WiXus,通过利用外部环境的线驱动机制,使机器人能够实现平面移动和三维移动,并将腿部重新用于物体操作和工具使用。

Comments Accepted at ICRA2026, website - https://shin0805.github.io/wixus/, YouTube - https://youtu.be/32qhUslR0gM

详情
AI中文摘要

轮腿机器人通过协调轮驱动和腿驱动实现高移动性,但通常仅作为专为移动设计的平台。因此,它们无法将腿部用于其他任务,如物体操作或工具利用。本文提出了一种方法,通过外部身体支持释放腿部的移动角色,以挖掘腿部的任务执行潜力。为此,我们提出并开发了一种新的机器人WiXus,该机器人融合了轮腿机制和利用外部环境的线驱动机制。开发的WiXus不仅能够通过轮腿驱动实现平面移动,还能通过协调线驱动和轮腿驱动实现如攀爬等三维移动。此外,通过使用线驱动驱动悬吊身体,WiXus成功将腿部重新用作手臂,执行物体操作(例如救援狗(填充玩具))和工具使用(例如用剪枝器采摘苹果(模拟))。本研究证明了利用线驱动驱动环境的方法是一种新的设计原则,扩展了轮腿机器人的操作领域。

英文摘要

Wheeled-legged robots, which have wheels at their feet and achieve high mobility by coordinating wheel drive and leg drive, have been developed. These robots have been developed purely as platforms specialized for locomotion. Therefore, they do not have a means to repurpose their legs for roles other than locomotion, such as object manipulation or tool utilization. In this paper, we address the problem of how to draw out the potential task-execution capability of the legs by freeing them from the roles of locomotion through external body support. To this end, we propose and develop a new robot, WiXus, which fuses a wheeled-legged mechanism with a wire-driven mechanism that utilizes the external environment. The developed WiXus demonstrates not only planar locomotion with wheeled-legged drive, but also three-dimensional mobility such as cliff climbing by coordinating wire-driven and wheeled-legged actuation. Furthermore, by suspending the body with wire-driven actuation, WiXus successfully repurpose its legs as arms to perform object manipulation, (e.g., rescuing a dog (stuffed animal)), and tool utilization (e.g., harvesting an apple (mockup) with loppers). This study demonstrates that the approach of utilizing the environment with wire-driven actuation is a new design principle that extends the operational domain of wheeled-legged robots.

2605.20929 2026-05-21 cs.RO 版本更新

STEAM: A Training-Free Congestion-Aware Enhancement Framework for Decentralized Multi-Agent Path Finding

STEAM: 一种无需训练的拥堵感知增强框架用于去中心化多智能体路径寻找

Mingyang Feng, Mengnuo Zhang, Shaoyuan Li, Xiang Yin

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(自动化与智能感知学院,上海交通大学)

AI总结 本文提出STEAM框架,一种无需训练的去中心化多智能体路径寻找(MAPF)学习方法,在离散环境中通过注入轻量级拥堵感知指导来提升性能,通过空间避让、时间修正和密度修正等方法提高成功率和效率。

详情
AI中文摘要

我们提出STEAM(空间、时间和涌现拥堵意识用于MAPF),一种无需训练的测试时间增强框架,用于学习的去中心化多智能体路径寻找(MAPF)在离散环境中。给定一个预训练的去中心化策略,STEAM不需要重新训练、架构修改或用集中规划器替代。相反,它将轻量级拥堵感知指导注入到原始策略执行中。STEAM首先通过当前的成本到目标地图诱导的最短路径来识别潜在的未来拥堵热点。通过更新agent特定的成本到目标信息来缓解空间上可避免的拥堵,而通过时间logit修正来处理空间上不可避免的瓶颈。此外,通过基于邻近智能体修正后的成本到目标地图的密度感知logit修正来减少涌现的局部拥堵。在代表性学习的去中心化MAPF算法上的大量实验表明,STEAM一致地提高了成功率、完成时间和解决方案成本,成功率提升高达60%,且仅带来轻微的计算开销。实现可在https://anonymous.4open.science/r/STEAM-MAPF-7A62获取。

英文摘要

We propose STEAM (Spatial, Temporal, and Emergent congestion Awareness for MAPF), a training-free test-time enhancement framework for learning-based decentralized Multi-Agent Path Finding (MAPF) in discrete environments. Given a pretrained decentralized policy, STEAM requires no retraining, architectural modification, or replacement by a centralized planner. Instead, it injects lightweight congestion-aware guidance into the original policy execution. STEAM first rolls out the shortest paths induced by the current cost-to-go maps to identify potential future congestion hotspots. Spatially avoidable congestion is mitigated by updating agent-specific cost-to-go information, while spatially unavoidable bottlenecks are handled through temporal logit correction. In addition, emergent local congestion is reduced by a density-aware logit correction based on neighboring agents' corrected cost-to-go maps. Extensive experiments on representative learning-based decentralized MAPF algorithms show that STEAM consistently improves success rate, makespan, and solution cost, with success-rate gains of up to 60% and only minor computational overhead. The implementation is available at https://anonymous.4open.science/r/STEAM-MAPF-7A62.

2605.20917 2026-05-21 cs.RO 版本更新

SubTGraph: Large-Scale Subterranean Environment Synthesis with Controllable Topological Variability for Robotic Autonomy Validation

SubTGraph: 大规模地下环境合成与可控拓扑变化用于机器人自主性验证

F. Labra Caso, A. Saradagi, S. Fredriksson, S. Nordström, A. Koval, G. Nikolakopoulos

发表机构 * Robotics & AI Luleå University of Technology(机器人与人工智能卢勒奥技术大学)

AI总结 本文提出SubTGraph框架,用于快速合成具有高变异性的多层级地下环境,通过用户指定的拓扑、维度、纹理等参数生成不同类型的地下环境,用于验证机器人自主栈各层的严格验证。

Comments 16 pages, 18 figures

详情
AI中文摘要

地下(SubT)环境已成为自主机器人技术的前沿领域,推动采矿自动化和行星探索(如火星熔岩管)。由于实际SubT环境的访问具有挑战性,因此在现实模拟环境中严格测试自主性堆栈至关重要。本文填补了已知的空白,即由于缺乏大规模基于模拟的基准评估基础设施,导致SubT研究论文通常只能在少数环境中展示验证结果。本文提出了SubTGraph,一种新的框架,用于快速合成具有高变异性的多层级SubT环境,结合用户指定的拓扑、维度、纹理等参数,生成如运营矿山、自然洞穴和熔岩管等不同环境。SubTGraph通过用户指定的结构约束构建成本矩阵,指导经典Dijkstra算法,利用DARPA World Generator的拓扑瓷砖生成SubT世界。通过三个机器人案例研究验证了SubTGraph在验证机器人自主栈不同层次的严格性方面的有效性。结构语义分割与拓扑地面真相进行验证,多智能体路径规划广泛测试以识别算法行为中的模式和趋势,LIO SLAM在具有挑战性的地下部分进行压力测试以识别失败案例。SubTGraph世界创建代码库已开源(https://github.com/LTU-RAI/SubTGraph.git),并附带包含150个高度变异的地下世界的数据库。

英文摘要

Subterranean (SubT) environments have been a frontier for autonomous robotics, driven by the push for automation of mining operations and the interest in planetary exploration (Martian Lava Tubes). Due to the challenges involved in accessing real SubT environments, rigorous hardening of autonomy stacks in realistic simulation environments is critical. This article fills a well-known gap, which relates to the unavailability of a large-scale simulation-based benchmarking infrastructure for rigorous statistical evaluation of robotic autonomy, due to which it is common for SubT research articles to present validation results in a few environments at best. This article presents SubTGraph, a novel framework for rapid synthesis of multi-level SubT environments with high variability, incorporating user specifications related to topology, dimensionality, textures, etc., to generate distinct environments such as operational mines, natural caves and lava tubes. SubTGraph builds a cost matrix from user-specified structural constraints to guide the classical Dijkstra algorithm to procedurally generate SubT worlds utilizing topometric tiles from the DARPA World Generator. Three robotics case-studies are investigated to demonstrate the utility of SubTGraph for rigorous validation of different layers in the robotic autonomy stack. Structural semantic segmentation is validated against topometric ground truths, multi-agent path planning is widely tested for identification of patterns and trends in the algorithm behavior and LIO SLAM is stress-tested in challenging subterranean sections to identify failure cases. The SubTGraph world creation codebase is open-sourced (https://github.com/LTU-RAI/SubTGraph.git) along with a database consisting of 150 highly variable underground worlds.

2605.20894 2026-05-21 cs.RO 版本更新

Mobile UMI: Cross-View Diffusion Policy with Decoupled Kinematics for Mobile Manipulation

Mobile UMI: 用于移动操作的跨视角扩散策略与解耦动力学

Haoran Huang, Haonan Dong, Huixu Dong

发表机构 * Zhejiang University(浙江大学)

AI总结 本文提出了一种无需硬件的演示框架Mobile UMI,通过三个组件解决移动模仿学习中的两个瓶颈问题:运动污染的动作标签和推理导致的执行延迟。核心方法是通过双摄像头捕捉全局和局部上下文,结合空间锚点统一视觉-惯性框架,并利用异步递推地执行器进行在线状态匹配,从而实现解耦的动力学和基座轨迹。

详情
AI中文摘要

在便携式演示接口上进行移动模仿学习面临两个耦合的瓶颈:由运动污染导致的动作标签和由于连续移动基座引起的推理诱导的执行延迟。最近的腕部安装接口降低了桌面数据收集的成本,但单个腕部视角无法捕捉基座导航所需的全局上下文。添加身体安装的摄像头会将人类行走与手部运动纠缠在一起。同时,生成策略引入了数百毫秒的推理延迟,在此期间,基座会经过预测的路径点,迫使在动作拼接处进行回退修正。本文提出了Mobile UMI,一种无需硬件的演示框架,通过三个组件解决这两个缺口。首先,双摄像头捕获系统记录以胸部为中心的全局上下文和以腕部为中心的局部交互,无需任何机器人存在。其次,基于ChArUco的一次性空间锚点统一了胸部和手部的视觉-惯性框架;手部姿态随后相对于胸部重新表达,以提取解耦的SE(3)操作和SE(2)基座轨迹。第三,异步递推地执行器执行在线状态匹配:每个生成的动作块都与当前物理姿态对齐,使过期的路径点在执行前被丢弃。整个系统在四个长周期家庭任务上进行了评估,在100次试验中平均成功率为83.8%。受控比较ACT和Diffusion Policy显示,仅胸部相对标签就缩小了大部分差距;在线状态匹配缩小了剩余差距。这些结果表明,在测试条件下,移动模仿学习中显式动力学分解与状态级延迟对齐相结合,提供了一种有效的解决方案,而无需对底层策略类别进行架构更改。

英文摘要

Mobile imitation learning on portable demonstration interfaces faces two coupled bottlenecks: locomotion-contaminated action labels and inference-induced execution latency on a continuously moving base. Recent wrist-mounted interfaces lower the cost of tabletop data collection, yet a single wrist view does not capture the global context required for base navigation. Adding a body-mounted camera entangles human walking with hand motion. Meanwhile, generative policies introduce hundreds of milliseconds of inference latency, during which the base advances past predicted waypoints, forcing backward corrections at action splices. This paper presents Mobile UMI, a hardware-free demonstration framework that addresses both gaps through three components. First, a dual-camera capture system records chest-centric global context and wrist-centric local interaction without any robot present. Second, a one-shot ChArUco-based spatial anchor unifies the chest and hand visual-inertial frames; the hand pose is then re-expressed relative to the chest to extract decoupled SE(3) manipulation and SE(2) base trajectories. Third, an asynchronous receding-horizon executor performs online state matching: each generated action chunk is realigned with the current physical pose so that expired waypoints are discarded before execution. The full system is evaluated on four long-horizon household tasks, achieving an average success rate of 83.8% over 100 trials per task. Controlled comparisons against ACT and Diffusion Policy show that the chest-relative label alone closes much of the gap; online state matching closes the remainder. These results indicate that, for mobile imitation learning under the tested conditions, explicit kinematic factorization combined with state-level latency alignment provides an effective solution without requiring architectural changes to the underlying policy class.

2605.20856 2026-05-21 cs.RO cs.AI cs.LG 版本更新

DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation

DISC: 通过策略生成解耦指令与状态条件控制

Hanxiang Ren, Pei Zhou, Xunzhe Zhou, Yanchao Yang

发表机构 * Zhejiang University(浙江大学) The University of Hong Kong(香港大学) TranscEngram

AI总结 DISC通过策略生成解耦指令与状态条件控制,解决了任务状态耦合导致的观察泄漏问题,并在多个基准测试中表现出色,证明了语言生成的策略参数驱动行为。

详情
AI中文摘要

语言条件的操控策略通常通过共享网络参数处理指令和观察。这种任务-状态耦合提供了观察泄漏的路径——网络学习了场景到动作的捷径,完全绕过了语言接地。DISC通过结构上消除这一失败。而不是将通用策略条件在语言上,DISC使用超网络从指令本身生成整个任务特定的视觉-运动策略参数集。生成的策略从不直接访问语言;因此,其任务意识必须来自语言。 Consequently,观察泄漏没有路径出现。另一方面,生成一致的高维策略权重本身是一个具有挑战性的问题。我们通过两阶段超网络解决它,其细化阶段将基于梯度优化的结构作为前馈归纳偏差嵌入,产生全局一致的参数,而无需实际梯度计算。在标准数据预算上完全从头训练,DISC在LIBERO-90和Meta-World上优于所有耦合基线,在复杂、长周期任务中优势扩大,并在不使用外部预训练数据的情况下超越了大规模预训练的π₀。在一个现实基准中,所有任务共享相同的视觉上下文,DISC显著优于耦合替代方案,直接证实了语言生成的策略参数,而非视觉捷径,驱动行为。超网络进一步学习了一个语义结构化的参数流形,能够从最少的演示中实现少样本适应,并在改写指令中实现稳健的泛化。我们的代码可在:https://github.com/ReNginx/DISC获取。

英文摘要

Language-conditioned manipulation policies typically process instructions and observations through shared network parameters. This task-state entanglement provides a pathway for observation leakage -- networks learn scene-to-action shortcuts that bypass language grounding entirely. DISC eliminates this failure structurally. Rather than conditioning a universal policy on language, DISC uses a hypernetwork to generate the entire parameter set of a task-specific visuomotor policy from the instruction alone. The generated policy never directly accesses language; therefore, its task-awareness must come from the language. Consequently, observation leakage has no pathway to emerge. On the other hand, generating coherent high-dimensional policy weights is itself a challenging problem. We address it with a two-stage hypernetwork whose refinement stage embeds the structure of gradient-based optimization as a feed-forward inductive bias, producing globally consistent parameters without actual gradient computation. Trained entirely from scratch on standard data budgets, DISC outperforms all entangled baselines on LIBERO-90 and Meta-World, with advantages that widen on complex, long-horizon tasks -- and surpasses the large-scale pretrained $π_0$ despite using no external pretraining data. On a real-world benchmark where all tasks share identical visual context, DISC substantially outperforms entangled alternatives, directly confirming that language-generated policy parameters, not visual shortcuts, drive behavior. The hypernetwork further learns a semantically structured parameter manifold that enables few-shot adaptation from minimal demonstrations and robust generalization across paraphrased instructions. Our code is available at: {https://github.com/ReNginx/DISC}.

2605.20850 2026-05-21 cs.RO 版本更新

SmoCap: Unified Scale-Pose Canonicalization with Proxy-Mapped Trust-Region QP

SmoCap: 一种统一的尺度-姿态规范化方法,结合代理映射信任区域QP

Shihao Li, Naohiko Sugita

发表机构 * The Research into Artifacts, Center for Engineering, The University of Tokyo(艺术研究机构,工程中心,东京大学)

AI总结 SmoCap通过在稀疏控制子空间中联合估计形态和姿态,解决阶段式工作流导致的形态-姿态补偿问题,实现了统一的尺度-姿态规范化框架,提高了运动规范化的实用性。

Comments 11 pages, 6 figures, 4 tables

详情
AI中文摘要

目标:阶段式工作流将模型缩放和逆运动学分开,会导致形态-姿态补偿,产生在弱观测方向上解不一致但数值上可接受的解。我们提出了SmoCap,一种抗泄漏的规范化框架,它在稀疏控制子空间中的每个局部信任区域二次规划(QP)中联合估计形态和姿态。方法:SmoCap通过分析代理映射姿态和缩放雅可比矩阵求解约束信任区域QP。低维代理映射稳定了弱观测方向并驱动协调结构。可选的预求解在困难配置中提供热启动。该框架使用队列荧光膝运动、人体测量学真实值和极端瑜伽序列进行评估。结果:SmoCap在荧光摄影膝屈曲上实现了2.9度RMSE,人体测量学端点误差约为3%。在泄漏审计中,SmoCap减少了标记RMSE、FE误差和人体测量学端点误差。代理耦合在瑜伽消融中保持了表达性和协调的脊柱运动,与基线模型相比,拟合误差增加(+0.14 mm,+0.6%)。中位标记RMSE约为20 mm,中位运行时间在0.204-0.332 ms/帧之间,通过一致的2-3次迭代实现。结论:SmoCap提供了一种经过外部验证的统一耦合感知尺度-姿态框架,使其在数据集规模上实现一致的运动规范化成为可能。

英文摘要

Objective: Stage-wise workflows that separate model scaling and inverse kinematics can induce morphology-posture compensation, resulting in anatomically inconsistent yet numerically acceptable solutions, especially in weakly observed directions. We present SmoCap, a leakage-resistant canonicalization framework that estimates morphology and posture jointly in each local trust-region quadratic program (QP) within a sparse control subspace. Methods: SmoCap solves a constrained trust-region QP with analytical proxy-mapped pose and scale Jacobians. The low dimensional proxy map stabilizes weakly observed directions and drives coordinated structures. An optional pre-solve provides warm starts in difficult configurations. The framework is evaluated using cohort fluoroscopy knee motion, anthropometric ground truth, and extreme yoga sequences. Results: SmoCap achieved 2.9 degree knee flexion RMSE against fluoroscopy, and a pooled anthropometric endpoint error around 3%. In the leakage audit against segment wise scaling, SmoCap also reduced marker RMSE, FE error, and anthropometric endpoint error. Proxy coupling preserved expressive and coordinated spine motion with marginal fitting error increase (+0.14 mm, +0.6%) against baseline models in yoga ablation. Median marker RMSE was around 20 mm, and median runtime was 0.204-0.332 ms/frame, achieved with consistently 2-3 iterations. Conclusion: SmoCap provides an externally validated unified coupling-aware scale-pose framework, making externally consistent motion canonicalization practical at dataset scale.

2605.20821 2026-05-21 cs.CV cs.RO 版本更新

VSCD: Video-based Scene Change Detection in Unaligned Scenes

VSCD: 基于视频的非对齐场景变化检测

Jiae Yoon, Ue-Hwan Kim

发表机构 * Department of AI Convergence, Gwangju Institute of Science(人工智能融合系,全州科学研究院) GIST InnoCORE AI-Nano Convergence Institute for Early Detection of Neurodegenerative Diseases, Gwangju Institute of Science(全州科学研究院AI-纳米融合研究所,用于早期检测神经退行性疾病的机构)

AI总结 本研究提出VSCD,一种用于非对齐场景中视频基变化检测的方法,通过查询帧生成像素级变化掩码,利用多参考模型和局部补丁对应来对齐参考特征,并融合候选变化特征以生成高分辨率掩码,实现了优于现有图像和视频基基线的性能。

Comments 18 pages, 7 figures. Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

检测环境中变化对于长期自主性至关重要,但大多数变化检测设置假设固定视角、轻微错位或仅少数变化对象。我们引入视频基场景变化检测(VSCD),该方法在给定参考和查询RGB视频的情况下,为每个查询帧预测像素级变化掩码。这两个视频记录于不同时间,且相机运动不受约束,视频之间没有时间同步,许多对象实例可能出现或消失。为研究此设置,我们构建了一个包含超过110万帧的大型基准,这些帧标注了像素级变化掩码,并附有现实世界测试集以评估迁移至现实的性能。我们提出了一种以查询为中心的多参考模型,该模型从变化掩码监督中隐式学习时间匹配,通过局部补丁对应对齐候选参考特征,并在解码高分辨率掩码前使用帧级和补丁级置信度融合每个候选的变化特征。我们的方法在强大的图像和视频基基线中实现了最先进的性能,并通过在移动机器人上部署验证其现实影响,用于两个下游应用——视觉监控和对象增量学习。

英文摘要

Detecting what has changed in an environment is essential for long-term autonomy, yet most change detection settings assume fixed viewpoints, mild misalignment, or only a few changed objects. We introduce Video-based Scene Change Detection (VSCD), which predicts a pixel-wise change mask for each query frame, given a reference and a query RGB video of the same indoor space recorded at different times under unconstrained camera motion. The two videos are not temporally synchronized, and many object instances may appear or disappear. To study this setting, we build a large-scale benchmark with over 1.1 million frames annotated with pixel-accurate change masks, together with a real-world test set for evaluating transfer beyond simulation. We propose a query-centric multi-reference model that learns temporal matching implicitly from change-mask supervision, aligns candidate reference features to the query via local patch correspondence, and fuses per-candidate change features using frame-level and patch-level confidence before decoding a high-resolution mask once per frame. Our approach achieves state-of-the-art performance against strong image- and video-based baselines, and we validate its real-world impact by deploying it on a mobile robot for two downstream applications -- visual surveillance and object incremental learning.

2605.20811 2026-05-21 cs.RO 版本更新

Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation

Demo-JEPA: 一种用于单次跨体态模仿的联合嵌入预测架构

Jingyang He, Guangrun Li, Jieyu Zhang, Chengkai Hou, Zhengping Che, Shanghang Zhang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(信息处理国家重点实验室,计算机学院,北京大学) University of Washington(华盛顿大学) Beijing Innovation Center of Humanoid Robotics(北京人形机器人创新中心)

AI总结 本文提出Demo-JEPA,一种跨体态模仿框架,通过解耦示范意图与体态特定的执行,利用共享预测表示空间将源视觉示范转换为目标兼容的未来潜在轨迹,使目标代理通过规划实现这些子目标,从而在异构体态间实现灵活的模仿。

详情
AI中文摘要

机器人模仿学习通常被视为复制演示动作,但动作本质上是体态特定的。当演示来自具有不同形态、运动学或动作空间的人类或机器人时,这种以动作为中心的观点需要共享动作空间、启发式重定向或大规模多体态联合训练。我们相反地将演示视为未来目标的隐含规范:目标代理应推断演示者试图实现的状态,而非演示者如何执行它。我们提出Demo-JEPA,一种跨体态模仿框架,通过基于JEPA的世界模型构建,将源视觉示范转换为目标兼容的未来潜在轨迹,这些轨迹在共享的预测表示空间中。目标代理随后利用这些潜在轨迹作为子目标,并通过其自身学习的向前动力学进行规划以实现它们。由于Demo-JEPA避免了动作层面的对应关系,仅需视觉示范和目标代理自身的交互经验,它支持在异构体态间灵活的模仿。在RLBench和真实世界操作任务中的实验表明,Demo-JEPA在专门的领域规划器中表现优异,并能泛化到未见的任务和体态配置,而此前的方法在此类情况下失效。

英文摘要

Robotic imitation learning is often treated as reproducing demonstrated actions, but actions are inherently embodiment-specific. When demonstrations come from humans or robots with different morphology, kinematics, or action spaces, this action-centric view requires shared action spaces, heuristic retargeting, or large-scale multi-embodiment co-training. We instead view demonstrations as implicit specifications of future goals: the target agent should infer what state the demonstrator is trying to realize, rather than how the demonstrator executes it. We propose Demo-JEPA, a cross-embodiment imitation framework that decouples demonstration intent from embodiment-specific execution. Built on a JEPA-based world model, Demo-JEPA translates source visual demonstrations into target-compatible future latent trajectories in a shared predictive representation space. The target agent then uses these latent trajectories as subgoals and realizes them through planning under its own learned forward dynamics. Because Demo-JEPA avoids action-level correspondence and requires only visual demonstrations plus the target agent's own interaction experience, it supports flexible imitation across heterogeneous embodiments. Experiments on RLBench and real-world manipulation tasks show that Demo-JEPA matches specialized in-domain planners and generalizes to unseen tasks and embodiment configurations where prior methods fail.

2605.20801 2026-05-21 cs.RO quant-ph 版本更新

Q-SpiRL: Quantum Spiking Reinforcement Learning for Adaptive Robot Navigation

Q-SpiRL:量子脉冲强化学习用于自适应机器人导航

Mohamed Khair Altrabulsi, Nouhaila Innan, Alberto Marchisio, Muhammad Kashif, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University Abu Dhabi (NYUAD)(eBRAIN实验室,工程系,纽约大学阿布扎比分校) Center for Quantum and Topological Systems (CQTS), NYUAD Research Institute(量子与拓扑系统中心(CQTS),NYUAD研究机构)

AI总结 本文提出Q-SpiRL框架,结合量子增强的脉冲神经网络,实现了在动态环境中高效稳定的机器人导航,通过实验验证了其在任务完成、轨迹效率和运动平滑度之间的最佳平衡。

Comments 11 pages, 6 figures

详情
AI中文摘要

在动态环境中实现自适应机器人导航需要能够可靠到达目标并产生高效稳定轨迹的策略。本文提出了Q-SpiRL,一种用于障碍感知机器人导航的量子脉冲强化学习框架。该框架开发并评估了五个智能体家族:表格Q学习、经典MLP、经典SNN、量子增强MLP(QMLP)和量子增强脉冲神经网络(QSNN)。尽管所有模型均在统一的训练和评估管道下实现,但QSNN是重点研究的中央架构,因为它结合了基于脉冲的时间处理与变分量子特征变换。实验在三个逐渐增大尺寸的网格世界环境中进行,即20x20、30x30和40x40,包含静态和动态障碍。性能评估使用成功率、成功率加权路径长度、路径长度和转弯率,在确定性推理下进行。结果表明,QSNN在最具有挑战性的设置中实现了最强的整体权衡,达到99%的成功率,同时保持高路径效率。在IBM量子硬件上的执行进一步证明了所提出混合策略在真实设备条件下的可行性。

英文摘要

Adaptive robot navigation in dynamic environments requires policies that can reach the target reliably while producing efficient and stable trajectories. This paper presents Q-SpiRL, a quantum spiking reinforcement learning framework for obstacle-aware robot navigation. The framework develops and evaluates five agent families: tabular Q-learning, classical MLP, classical SNN, quantum-enhanced MLP (QMLP), and quantum-enhanced spiking neural network (QSNN). While all models are implemented under a unified training and evaluation pipeline, the QSNN is the central architecture of interest, as it combines spike-based temporal processing with variational quantum feature transformation. Experiments are conducted across three grid-world environments of increasing size, namely 20x20, 30x30, and 40x40, with both static and dynamic obstacles. Performance is assessed using success rate, success-weighted path length, path length, and turn rate under deterministic inference. Results show that QSNN achieves the strongest overall trade-off between task completion, trajectory efficiency, and motion smoothness, reaching up to 99% success rate while maintaining high path efficiency in the most challenging setting. Execution on IBM quantum hardware further demonstrates the feasibility of deploying the proposed hybrid policy under real-device conditions.

2605.20796 2026-05-21 cs.RO 版本更新

CMC-Opt: Constraint Manifold with Corners for Inequality-Constrained Optimization

CMC-Opt: 带角落的约束流形用于不等式约束优化

Yetong Zhang, Frank Dellaert

发表机构 * College of Computing(计算学院) Georgia Institute of Technology(佐治亚理工学院) Atlanta, USA(美国亚特兰大)

AI总结 本文提出了一种基于流形的框架,用于解决机器人中存在等式和不等式约束的优化问题。通过引入带角落的约束流形,将原问题直接转换为无约束优化问题,从而在约束状态空间上进行优化,并在大规模动力学规划问题中验证了该框架的有效性和鲁棒性。

详情
AI中文摘要

我们介绍了一种基于流形的框架,用于解决机器人中出现的具有等式和不等式约束的优化问题。我们的方法将原始问题直接转换为在约束状态空间上的无约束优化问题。为此,我们引入了“带角落的约束流形”来表示满足混合非线性等式和不等式约束的状态空间。我们进一步扩展了流形优化算法,使其能够在这一新的拓扑结构上运行。我们在大规模动力学规划问题的背景下展示了该框架的威力和鲁棒性,成功地生成了动态可行的轨迹,而标准方法则失败。

英文摘要

We introduce a manifold-based framework for addressing optimization problems with equality and inequality constraints found in robotics. Our approach transforms the original problem into an unconstrained optimization problem directly on the constrained state space. To achieve this, we introduce ``constraint manifolds with corners" to represent the state space satisfying mixed nonlinear equality and inequality constraints. We further extend manifold optimization algorithms to operate on this new topological structure. We demonstrate the power and robustness of our framework in the context of a large-scale kinodynamic planning problem, successfully generating dynamically feasible trajectories where standard methods fail.

2605.20774 2026-05-21 cs.RO 版本更新

VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models

VLA-REPLICA: 一种低成本、可重复的现实世界评估视觉-语言-动作模型的基准

Alex S. Huang, Jiahui Zhang, Shiqing Tang, Yu Xiang

发表机构 * Intelligent Robotics and Vision Lab, University of Texas at Dallas(德克萨斯大学达拉斯分校智能机器人与视觉实验室) Allen High School(艾伦高中)

AI总结 本文提出VLA-REPLICA,一种低成本、可重复的现实世界评估视觉-语言-动作模型的基准,通过使用现成组件构建,提供一致的环境用于政策评估,并包含多样化的操作任务和小规模演示数据集,用于目标域适应。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在通用目的机器人操作中显示出强大的潜力,但其现实世界评估仍受到缺乏可访问、可重复和一致的基准的限制。模拟基准无法捕捉现实世界的复杂性,而现有的现实世界基准通常需要昂贵的硬件、集中评估或任务多样性有限。我们介绍了VLA-REPLICA,一种低成本、易于重复的现实世界评估VLA模型的基准。该系统由现成组件构建,可以快速组装并在不同实验室中复制,为全球各地的政策评估提供一致的环境。VLA-REPLICA包含多样化的操作任务和一个小规模的演示数据集用于目标域适应,并为在分布和出分布设置中的现实世界评估提供了协议。对模仿学习和最先进的VLA模型的实验揭示了模型的优势和局限性,而不同独立构建设置中的一致结果证明了我们基准的可重复性。

英文摘要

Vision-Language-Action (VLA) models have shown strong promise for general-purpose robotic manipulation, but their real-world evaluation remains limited by a lack of accessible, reproducible, and consistent benchmarks. Simulation benchmarks fail to capture real-world complexity, while existing real-world benchmarks often require expensive hardware, centralized evaluation, or are limited in task diversity. We introduce VLA-REPLICA, a low-cost, easily reproducible real-world benchmark for evaluating VLA models. Built from off-the-shelf components, our system can be quickly assembled and replicated across laboratories, providing a consistent environment for policy evaluation anywhere in the world. VLA-REPLICA includes a diverse suite of manipulation tasks and a small-scale demonstration dataset for target-domain adaptation, with real-world evaluation protocols for both in-distribution and out-of-distribution settings. Experiments with imitation learning and state-of-the-art VLA models reveal model strengths and limitations, while consistent results across independently constructed setups demonstrate the reproducibility of our benchmark.

2605.20758 2026-05-21 cs.AI cs.CV cs.LG cs.RO 版本更新

Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

面向组合奖励的冲突感知加法引导:流模型中的对抗性生成

Xuehui Yu, Fucheng Cai, Meiyi Wang, Xiaopeng Fan, Harold Soh

发表机构 * Smart Systems Institute, National University of Singapore, Singapore(新加坡国立大学智能系统研究所) Faculty of Computing, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机学院) School of Computing, National University of Singapore, Singapore(新加坡国立大学计算机学院)

AI总结 本文提出了一种面向组合奖励的冲突感知加法引导方法,用于在流模型中处理对抗性生成问题,通过动态检测和解决梯度冲突来纠正离曼福德漂移,提升了生成保真度。

Comments Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

在推理时间进行引导采样可以无需微调就通过解释生成过程为可控轨迹来驱动最先进的扩散和流模型。这提供了一种简单灵活的方式,将外部约束(如成本函数或预训练验证器)注入受控生成中。然而,现有方法在同时组合多个约束时往往失效,导致偏离真实数据曼福德。在本工作中,我们识别出这种离曼福德漂移的根本原因,并发现近似误差随着梯度不一致程度严重增加。基于这些发现,我们提出了一种轻量且可学习的方法,即冲突感知加法引导(g^car),该方法通过动态检测和解决梯度冲突来主动纠正离曼福德漂移。我们验证了g^car在多样化的领域中的有效性,从合成数据集和图像编辑到生成决策规划与控制。我们的结果表明,g^car有效纠正了离曼福德漂移,在生成保真度方面超越了基线方法,同时使用轻量计算。代码可在https://github.com/yuxuehui/CAR-guidance获取。

英文摘要

Inference-time guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre-trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off-manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict-Aware Additive Guidance ($g^\text{car}$), a lightweight and learnable method, which actively rectifies off-manifold drift by dynamically detecting and resolving gradient conflicts. We validate $g^\text{car}$ across diverse domains, ranging from synthetic datasets and image editing to generative decision-making for planning and control. Our results demonstrate that $g^\text{car}$ effectively rectifies off-manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at https://github.com/yuxuehui/CAR-guidance.

2605.20666 2026-05-21 cs.RO 版本更新

A Semantic and Occlusion-Aware GM-PHD Filter

一种语义和遮挡感知的GM-PHD滤波器

Jovan Menezes, Mark Campbell

发表机构 * Sibley School of Mechanical and Aerospace Engineering, Cornell University(康奈尔大学机械与航空航天工程系)

AI总结 本文提出了一种包含从深度学习中提取的语义信息的新出生模型,以创建一种遮挡感知的高斯混合概率假说密度(GM-PHD)滤波器。与以往依赖简单或统一假设的方法不同,所提出的语义-遮挡感知(S-OA)出生模型通过显式考虑遮挡区域并利用环境的语义信息来定义初始化项。这使滤波器能够准确表示新物体更可能出现的位置,从而在复杂和高密度的驾驶场景中提高跟踪性能。该方法通过蒙特卡洛模拟和KITTI数据集的实验进行评估。性能通过测量首次检测与跟踪初始化之间的延迟、平均绝对数量误差以及最优子模式分配(OSPA)度量来评估。结果表明,S-OA出生模型在遮挡密集的环境中减少了初始化延迟,在约70%的情况下匹配或优于最强基线。还提供了出生模型权重的敏感性分析。总体而言,研究结果强调了在自动驾驶中将遮挡推理和语义先验整合到贝叶斯跟踪框架中的优势。

Comments Accepted at ICRA 2026

详情
AI中文摘要

本文提出了一种新的出生模型,该模型包含从深度学习中提取的语义信息,以创建一种遮挡感知的高斯混合概率假说密度(GM-PHD)滤波器。与以往依赖简单或统一假设的方法不同,所提出的语义-遮挡感知(S-OA)出生模型通过显式考虑遮挡区域并利用环境的语义信息来定义初始化项。这使滤波器能够准确表示新物体更可能出现的位置,从而在复杂和高密度的驾驶场景中提高跟踪性能。该方法通过蒙特卡洛模拟和KITTI数据集的实验进行评估。性能通过测量首次检测与跟踪初始化之间的延迟、平均绝对数量误差以及最优子模式分配(OSPA)度量来评估。结果表明,S-OA出生模型在遮挡密集的环境中减少了初始化延迟,在约70%的情况下匹配或优于最强基线。还提供了出生模型权重的敏感性分析。总体而言,研究结果强调了在自动驾驶中将遮挡推理和语义先验整合到贝叶斯跟踪框架中的优势。

英文摘要

This paper proposes a new birth model including semantic information derived from deep learning to create an occlusion-aware Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter. Unlike prior approaches that rely on simplistic or uniform assumptions, the proposed Semantic-Occlusion Aware (S-OA) birth model defines initialization terms by explicitly considering regions of occlusion and by leveraging semantic information about the environment. This enables the filter to accurately represent where new objects are more likely to appear, thereby improving tracking performance in complex and high-density driving scenarios. The method is evaluated through Monte Carlo simulations and experiments on the KITTI dataset. Performance is assessed by measuring the latency between first detection and track initiation, along with the mean absolute cardinality error and the Optimal Subpattern Assignment (OSPA) metric. Results demonstrate that the S-OA birth model reduces initialization delay in occlusion-heavy settings, matching or outperforming the strongest baseline in approximately 70% of cases. A sensitivity analysis of birth model weights is also provided. Overall, the findings underscore the benefits of integrating occlusion reasoning and semantic priors into Bayesian tracking frameworks for autonomous driving.

2605.20648 2026-05-21 cs.RO cs.AI 版本更新

Jointly Learning Predicates and Actions Enables Zero-Shot Skill Composition

联合学习谓词和动作使零样本技能组合成为可能

Benedict Quartey, Sebastian Castro, Eric Rosen, Wil Thomason, George Konidaris, Stefanie Tellex

发表机构 * Brown University(布朗大学) Robotics & AI Institute(机器人与人工智能研究所)

AI总结 本文提出了一种联合学习谓词和动作的技能方法,通过闭合回路的视觉-运动策略,使机器人能够在不重新训练的情况下实现零样本技能组合。

详情
AI中文摘要

学习示范(LfD)使机器人能够从专家示例中学习复杂行为,但现有方法往往无法在不重新训练的情况下泛化到新组合的已知技能。现代生成性策略仅建模动作轨迹分布,因此无法推断出所需的符号结果。我们提出技能应联合建模动作轨迹和它们诱导的符号结果。为解决这一差距,我们引入了谓词动作技能(PACTS),一种闭合回路的视觉-运动策略,将技能建模为动作和谓词信念轨迹的联合生成过程,在单一模型中产生连贯的动作-结果滚动。联合生成动作和谓词使PACTS能够学习改进动作生成和谓词分类的内部表示。此外,我们通过利用PACTS的在线谓词预测作为符号接口来序列化和监控执行,展示了学习技能的零样本组合。项目网站:https://planpacts.github.io/

英文摘要

Learning from Demonstration (LfD) enables robots to learn complex behaviors from expert examples, yet existing approaches often fail to generalize to new compositions of known skills without retraining. Modern generative policies model distributions over action trajectories alone, thus are unable to reason about the symbolic outcomes required for robust composition. We propose that skills should jointly model action trajectories and the symbolic outcomes they induce. To address this gap, we introduce Predicate Action Skills (PACTS), a class of closed-loop visuomotor policies that model skills as a joint generative process over action and predicate belief trajectories, producing coherent action-outcome rollouts within a single model. Jointly generating actions and predicates enables PACTS to learn internal representations that improve both action generation and predicate classification. Furthermore, we demonstrate zero-shot composition of learned skills via planning by leveraging online predicate predictions from PACTS as a symbolic interface for sequencing and monitoring execution. Project website: https://planpacts.github.io/

2605.20644 2026-05-21 cs.LG cs.AI cs.RO 版本更新

Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines

制造设计:一种集成制造知识的强化学习框架用于航空发动机自由形管道路由

Caicheng Wang, Zili Wang, Shuyou Zhang, Yongzhe Xiang, Zheyi Li, Liangyou Li, Jianrong Tan

发表机构 * State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University(浙江大学流体动力与机电系统国家重点实验室) Engineering Research Center for Design Engineering and Digital Twin of Zhejiang Province, Zhejiang University(浙江省设计工程与数字孪生工程研究中心) Zhejiang Changxing Heliang Intelligent Equipment Co., Ltd.(浙江长兴鹤浪智能装备有限公司)

AI总结 本文提出了一种集成制造知识的强化学习框架,用于航空发动机中自由形管道路由优化,通过将制造知识作为约束条件,提高了管道路径的可制造性和几何平滑度。

详情
AI中文摘要

制造设计在先进航空发动机开发中起着关键作用,其中复杂组件需要仔细考虑可制造性。然而,当前的管道路由实践仍然很大程度上与下游制造脱节,导致需要大量劳动和试错迭代以获得可制造的设计。为了解决这个问题,本研究提出了一种基于弗伦塞尔的管道路由优化(FPRO)框架,这是一种用于航空发动机自由形管道设计的集成制造知识的强化学习方法。FPRO将路由问题表述为弗伦塞尔框架中的边界值问题。在此框架中,管道路径由曲率和扭率剖面表示,这些剖面通过三次赫尔迈特插值生成。为了将设计与制造相结合,领域特定的制造知识被嵌入到曲率和扭率的允许范围的约束中。路径优化使用了具有随机探索和阶段引导奖励机制的近端策略优化算法。统一的映射公式然后将优化的路径转换为弯曲模具的运动轨迹,使六轴自由弯曲机能够直接制造。实验结果表明,FPRO能够持续生成无碰撞、可制造的路径,其几何剖面比基于笛卡尔的方法更平滑。它还实现了更快的收敛速度和在终端对齐、路径长度、障碍物避让和可制造性方面的优越性能,优于最先进的强化学习基线。现实验证确认了制造管道与数字设计之间几何的紧密对应关系,验证了FPRO的实践可行性。

英文摘要

Design for manufacturing plays a critical role in advanced aeroengine development, where complex components necessitate careful consideration of manufacturability. However, current practices in pipe routing remain largely decoupled from down-stream manufacturing, leading to labor-intensive, trial-and-error iterations to achieve manufacturable designs. To address this problem, this study proposes the Frenet-based pipe routing optimization (FPRO) framework, a manufacturability knowledge-integrated reinforcement learning approach for free-form pipe design in aeroengines. FPRO formulates the routing problem as a boundary value problem in the Frenet frame. In this framework, the pipe path is represented by curvature and torsion profiles, which are generated using cubic Hermite interpolation. To integrate design and manufacturing, domain-specific manufacturing knowledge is embedded as constraints on the permissible ranges of curvature and torsion. The path optimization is performed using the proximal policy optimization algorithm with stochastic exploration and a stage-guided reward mechanism. A unified mapping formulation then translates the optimized path into motion trajectories for the bending die, enabling direct fabrication on a six-axis free-bending machine. Experimental results demonstrate that FPRO consistently generates collision-free, manufacturable paths with smoother geometric profiles compared to Cartesian-based methods. It also achieves faster convergence and superior performance in terminal alignment, path length, obstacle avoidance, and manufacturability compared to state-of-the-art reinforcement learning baselines. Real-world validation confirms the close geometric correspondence between the manufactured pipe and its digital design, validating the practical feasibility of FPRO.

2605.20625 2026-05-21 eess.SY cs.MA cs.RO cs.SY 版本更新

Time-To-Reach Separation and Safety Filtering for Safe, Fair, and Efficient Multi-Agent Coordination

时间到达分离与安全过滤用于安全、公平和高效的多智能体协调

Matthew Low, Jasmine Jerry Aloor, Victoria Marie Tuck, Pierluigi Nuzzo, Jason J. Choi

发表机构 * Department of Electrical Engineering and Computer Sciences, University of California, Berkeley(加州大学伯克利分校电子工程与计算机科学系) Department of Aeronautics and Astronautics, Massachusetts Institute of Technology(麻省理工学院航空与航天系) GRASP Laboratory, University of Pennsylvania(宾夕法尼亚大学GRASP实验室) Department of Electrical and Computer Engineering, University of California, Los Angeles(加州大学洛杉矶分校电子与计算机工程系)

AI总结 本文提出了一种多智能体协调框架,利用最小时间到达(TTR)作为统一指标用于优先级分配、时间分离和安全过滤,以协调多个空中车辆进入空中走廊并保持车辆间安全分离。

Comments 9 pages, 3 figures. Extended version (including appendix) of a paper submitted to the 65th IEEE Conf. on Decision and Control (2026)

详情
AI中文摘要

先进空中交通(AAM)操作预计会显著增加城市空域的空中交通,需要自主交通管理系统确保在高度拥堵环境中实现无碰撞操作。本文提出了一种多智能体协调框架,利用最小时间到达(TTR)作为统一指标用于优先级分配、时间分离和安全过滤。我们专注于协调多个空中车辆进入空中走廊的问题,同时保持车辆间的安全分离。车辆根据TTR分配到达一致的优先级,目标TTR值用于强制时间间隔,从而诱导空间分离。基于Hamilton-Jacobi可达性值函数的优先级一致的安全过滤层确保碰撞避免,同时最小化对参考引导的修改。在高度拥堵的走廊合并场景中的仿真结果表明,所提出的方法在安全、公平和效率方面优于时间最优引导和无优先级安全过滤。

英文摘要

Advanced Air Mobility (AAM) operations are expected to significantly increase aerial traffic in urban airspace, requiring autonomous traffic management systems to ensure collision-free operations in highly congested environments. In this paper, we propose a multi-agent coordination framework that uses minimum time-to-reach (TTR) as a unifying metric for priority assignment, temporal separation, and safety filtering. We focus on the problem of coordinating multiple aerial vehicles merging into an air corridor while maintaining safe separation between vehicles. Vehicles are assigned arrival-consistent priority based on TTR, and target TTR values are used to enforce temporal spacing that induces spatial separation. A priority-consistent safety filtering layer based on Hamilton-Jacobi reachability value functions ensures collision avoidance while minimally modifying the reference guidance. Simulation results in a highly congested corridor merging scenario show that the proposed method improves safety, fairness, and efficiency compared to time-optimal guidance and priority-agnostic safety filtering.

2605.20607 2026-05-21 cs.LG cs.CV cs.RO 版本更新

Mechanistic Interpretability for Learning Assurance of a Vision-Based Landing System

基于视觉着陆系统的学习保证机制解释

Romeo Valentin, Olivia Beyer Bruvik, Marc R. Schlichting, Mykel J. Kochenderfer

发表机构 * Stanford Intelligent Systems Laboratory, Stanford University, Stanford, CA, USA(斯坦福智能系统实验室,斯坦福大学,斯坦福,CA,美国)

AI总结 本文提出了一种基于视觉着陆系统的学习保证机制,通过分离内容与风格来构建可解释的模型,从而提供可靠的证据支持,同时引入了新的运行时保证方法来监控模型的情境表示。

Comments 10 pages, 4 figures

详情
AI中文摘要

EASA的学习保证指导要求数据驱动的航空系统构建并监控自身的情境表示,但对神经网络而言,提供此类证据的技术手段仍是一个开放问题。我们针对基于视觉的飞机着陆系统填补了这一空白:我们提出,一个可保证的模型至少必须展示其情境表示中能够分离内容与风格。展示模型的预测主要依赖于内容表示组件,从而得到一个具体的保证路径。为了在具体模型上展示这个保证路径,我们训练了一个用于跑道关键点回归的视觉Transformer模型,在LARDv2数据集上进行训练。该模型作为我们保证演示的主体,产生每块嵌入,我们通过K-SVD稀疏字典学习将其分解为可解释的原子。定性可视化确认了内容原子跟踪任务相关的跑道结构,风格原子跟踪领域特定的外观,且回归头几乎将所有线性权重放在内容原子上。我们进一步基于内容/风格分离并定义了模型外范围(OOMS)检测,一种新的运行时保证方法,直接监控模型的情境表示。OOMS监控与操作设计领域和输出空间的分布外监控互补,并满足最近EASA指导的明确要求。通过在测试时间和运行时直接分析模型的情境表示,本工作提供了EASA学习保证指导所要求的第一个具体的表示层面证据,并指出了机制解释作为未来航空安全案例的实用构建块。

英文摘要

EASA's learning-assurance guidance requires data-driven aviation systems to build and monitor their own situation representation, yet for neural networks the technical means to provide such evidence remain an open problem. We address this gap for a vision-based aircraft landing system: we propose that a minimally assurable model must at least be shown to separate content from style in its own situation representation. Showing that the model's predictions then rely largely on the contentful representation components leads to a concrete assurance path. To demonstrate this assurance path on a concrete model we train a vision transformer model for runway keypoint regression on the LARDv2 dataset. The model, which acts as the subject for our assurance demonstration, produces per-patch embeddings that we decompose into interpretable atoms via K-SVD sparse dictionary learning. A qualitative visualization confirms that contentful atoms track task-relevant runway structure and stylistic atoms track domain-specific appearance, and the regression head is shown to place almost all of its linear weight on contentful atoms. We further build on the content/style separation and define out-of-model-scope (OOMS) detection, a novel runtime assurance approach directly monitoring the model's situation representation. OOMS monitoring is complementary to operational design domain and output-space out-of-distribution monitoring and addresses concrete requirements of the recent EASA guidance. By directly analyzing a model's situation representation both at test time and runtime, this work delivers the first concrete piece of the representation-level evidence that EASA learning-assurance guidance demands, and points to mechanistic interpretability as a practical building block of future aviation safety cases.

2605.20595 2026-05-21 cs.RO cs.MA cs.NI 版本更新

Intent-First Aerial V2V for Tactical Coordination and Separation: Protocol and Performance Under Density and Disturbance

意图优先的空中车对车通信用于战术协调与分离:协议与性能在密度和干扰下的表现

Mehrnaz Sabet

发表机构 * Cornell University(康奈尔大学)

AI总结 本研究提出了一种意图优先的空中车对车(V2V)协议,用于密集的无人机交通管理(UTM)操作,通过部署的邻近通信机制提供新鲜可信的信息以实现局部协调,该协议结合刷新的状态和意图信标,用于局部感知、协同感知和降级模式评估,并通过事件触发的消息进行让行、排序、释放和应急协调。

Comments Submitted to IEEE Transactions on Intelligent Transportation Systems

详情
AI中文摘要

密集的低空航空操作需要的不仅仅是预先飞行路线协调和最后手段的碰撞避免。一旦飞机进入空中,扰动可以在战略重新授权能够吸收的时间尺度以下出现,而碰撞避免太晚且具有破坏性,无法作为常规交通管理。虽然战术分离被认可为中间层,但实现它需要一个可部署的邻近通信机制,该机制能够为本地协调提供新鲜、可信的信息。本文提出了迄今为止我们所知的第一个控制器耦合的特征化,即一个全空中、 sidelink 类型、意图优先的车辆对车辆(V2V)战术邻近交换堆栈,用于密集的无人机交通管理(UTM)操作。与仅意识广播不同,所提出的交换结合了刷新的状态和意图信标,用于局部感知、协同感知和降级模式评估,并通过事件触发的消息进行让行、排序、释放和应急协调。我们通过使用 sidelink 类型的 C-V2X 模块实现并评估了该模型,这些模块具有认证的 freshness 检查。评估使用了由实时、实地锚定的基础设施支持的场景驱动、高流量压力测试。结果表明,V2V 减少了过时信念的分歧,通过协同感知保持可观测性,拒绝无效的战术信息,抑制虚假的局部推断,并结构化共享资源协调。所实现的堆栈在较低到中等密度范围内提供了一个可行的通信层用于战术分离,但随着密度、干扰和复杂性的增加,会转向受保护的回退模式。这些发现将意图优先的空中 V2V 定位为在扰动驱动的都市空域中扩展战术协调的有界促进者。

英文摘要

Dense low-altitude aerial operations require more than pre-flight route coordination and last-resort collision avoidance. Once aircraft are airborne, disturbances can emerge on timescales shorter than strategic reauthorization can absorb, while collision avoidance is too late and disruptive to serve as routine traffic management. Although tactical separation is recognized as the intermediate layer, realizing it at scale requires a deployable neighborhood communication mechanism that provides fresh, trusted information for local coordination. This paper presents what is, to our knowledge, the first controller-coupled characterization of an all-airborne, sidelink-class, intent-first vehicle-to-vehicle (V2V) tactical neighborhood exchange stack for dense Unmanned Aircraft System Traffic Management (UTM) operations. Unlike awareness-only broadcast, the proposed exchange combines refreshed state and intent beacons for local awareness, cooperative perception, and degraded-mode assessment with event-triggered messages for yielding, sequencing, release, and contingency coordination. We implement and evaluate this model on an all-airborne V2V stack using sidelink-class C-V2X modules with authenticated freshness checks. Evaluation uses a scenario-driven, high-volume stress campaign supported by real-time, field-anchored infrastructure. Results show that V2V reduces stale-belief divergence, preserves observability through cooperative perception, rejects invalid tactical messages, suppresses false local inference, and structures shared-resource coordination. The implemented stack provides a viable communication layer for tactical separation in lower-to-moderate regimes, but transitions toward guarded fallback as density, impairment, and complexity increase. These findings position intent-first aerial V2V as a bounded enabler for scaling tactical coordination in disturbance-driven urban airspace.

2605.20566 2026-05-21 cs.RO 版本更新

Conflict-Aware Active Perception and Control in 3D Gaussian Splatting Fields via Control Barrier Functions

基于控制屏障函数的3D高斯点云场中冲突感知与控制

Amirhossein Mollaei Khass, Athanasios Cosse, Vivek Pandey, Nader Motee

发表机构 * Department of Mechanical Engineering and Mechanics, Lehigh University(机械工程与力学系,莱恩大学)

AI总结 本文提出了一种基于控制屏障函数的冲突感知与控制框架,用于在3D高斯点云场环境中安全导航并获取信息以减少地图不确定性,通过统一的安全关键和感知感知二次规划程序解决安全与感知目标的冲突。

Comments Project website: https://sircesoc.github.io/Conflict_Aware_Active_Perception/

详情
AI中文摘要

在不确定环境中主动感知要求机器人在安全导航的同时获取信息以减少地图不确定性。这些目标本质上存在冲突,因为信息丰富的视角通常位于不确定区域,具有更高的碰撞风险。为了解决这一挑战,我们开发了一种冲突感知和控制框架,用于在由3D高斯点云(3DGS)表示的环境中运行的机器人系统。通过从平均条件风险AV@R碰撞风险度量中导出的控制屏障函数(CBF)来确保安全,该度量考虑了几何不确定性和保证了安全集的前向不变性。为了提高感知,我们提出了一种风险感知的预期信息增益(EIG)公式,用于选择下一个最佳视角,并引入了将摄像机方向对齐局部信息上升方向的感知屏障函数。为了获得这些冲突的安全和感知目标的可处理公式,我们提出了一种统一的安全关键和感知感知二次规划程序,通过松弛变量放松感知约束。仿真结果表明,所提出的方法在安全性和信息获取方面均优于现有基于3DGS的方法。

英文摘要

Active perception in uncertain environments requires robots to navigate safely while acquiring informative observations to reduce map uncertainty. These objectives inherently conflict, as informative viewpoints often lie near uncertain regions with higher collision risk. To address this challenge, we develop a conflict-aware active perception and control framework for robotic systems operating in environments represented by 3D Gaussian Splatting (3DGS). Safety is enforced using a Control Barrier Function (CBF) derived from an Average Value-at-Risk AV@R collision-risk metric that accounts for geometric uncertainty and guarantees forward invariance of a safe set. To improve perception, we propose a risk-aware Expected Information Gain (EIG) formulation for selecting the next-best-view and introduce perception barrier functions that align the camera orientation with the local information-ascent direction. To obtain a tractable formulation for these conflicting safety and perception objectives, we propose a unified safety-critical, perception-aware quadratic program that enforces safety as a hard constraint while relaxing perception constraints through slack variables. Simulation results demonstrate that the proposed method improves both safety and information acquisition compared to existing 3DGS-based approaches.

2605.20561 2026-05-21 cs.RO 版本更新

Fault-Tolerant, Rigidity-Preserving Control of Inflatable Truss Robots

容错、保持刚性的可膨胀桁架机器人控制

James Wade, Isaac Weaver, Mihai Stanciu, Nathan Usevitch

发表机构 * Ira A. Fulton School of Engineering, Mechanical Engineering Department, Brigham Young University(伊拉·A·福林工程学院,机械工程系, Brigham Young 大学)

AI总结 本文提出了一种容错控制框架,用于可膨胀机器人桁架,能够在电机故障的情况下保持功能,通过三个关键贡献:扩展运动学优化以处理任意电机故障组合,引入离散时间控制屏障函数约束以保证结构刚性,以及利用 onboard 编码器反馈和基于正向运动学的状态估计器实现闭环位置控制。

详情
AI中文摘要

等周机器人桁架可以适应不同的任务和环境,因为它们具有高强重比,能够大幅改变自身形状,并可以重新配置成多种不同形状。然而,操作环境中电机故障如果未得到妥善处理,会严重限制操作能力。本文提出了一种容错控制框架,用于可膨胀机器人桁架,能够在电机故障的情况下保持功能,通过三个关键贡献。首先,我们扩展运动学优化以处理任意组合的电机故障,通过施加等式约束确保故障执行器不被使用。其次,我们引入离散时间控制屏障函数(DTCBF)约束,数学上保证结构刚性的同时最大化工作空间利用率,这是在离散时间控制下可靠操作桁架机器人的重要要求。第三,我们利用 onboard 编码器反馈和基于正向运动学的状态估计器实现闭环位置控制,在存在干扰的情况下提高位置精度。我们通过模拟和硬件实验验证了我们的方法,针对一个具有6个执行器的2D等周桁架测试平台。对于具有6个执行器的2D配置,我们展示了在单个电机故障下工作空间保留超过69%,并利用闭环控制实现了跟踪精度的25%提升。这些结果为在退化驱动条件下更鲁棒和坚韧的等周桁架机器人奠定了基础。

英文摘要

Isoperimetric robotic trusses can adapt to different tasks and environments because they have a high strength-to-weight ratio, can change their own shape dramatically, and can be reconfigured into a variety of different shapes. However, motor failures in operational environments can severely limit operational capabilities if not properly addressed. This paper presents a fault-tolerant control framework for an inflatable robotic truss that maintains functionality despite motor failures, shown through three key contributions. First, we extend the kinematic optimization to handle arbitrary combinations of motor failures by imposing equality constraints to ensure failed actuators are not used. Second, we introduce discrete-time control barrier function (DTCBF) constraints that mathematically guarantee structural rigidity while maximizing workspace utilization, a critical requirement for reliable operation of truss robots under discrete-time control. Third, we implement closed-loop position control using onboard encoder feedback and a forward kinematics-based state estimator, improving positional accuracy in the presence of disturbances. We validate our approach through simulation and hardware experiments on a 2D isoperimetric truss testbed. For a 2D configuration with 6 actuators, we demonstrate >69% workspace preservation under single-motor failures and a >25% improvement in tracking accuracy with closed-loop control. These results establish a foundation for more robust and resilient isoperimetric truss robots operating under degraded actuation.

2605.20551 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

更快或更强:通过加权聚合和标记剪枝实现灵活的视觉位置识别

Zichao Zeng, June Moh Goo, Junwei Zheng, Weijia Fan, Jiaming Zhang, Rainer Stiefelhagen, Jan Boehm

发表机构 * University College London(伦敦大学学院) Karlsruhe Institute of Technology(卡尔斯鲁厄大学) Hunan University(湖南大学) Shenzhen University(深圳大学)

AI总结 本文提出了一种加权聚合描述符(WeiAD)和标记剪枝框架(WeiToP),用于提升视觉位置识别的性能和效率,通过动态调整特征提取的精度与效率平衡。

详情
AI中文摘要

视觉位置识别(VPR)旨在将查询图像匹配到大规模数据库中相同地点的参考图像。最近最先进的方法采用视觉Transformer(ViTs)作为基础模型,提取对视角、光照和季节变化具有鲁棒性的补丁级特征,然后聚合为紧凑的全局描述符进行检索。大多数现有聚合方法将补丁标记均匀地池化到学习的簇中,尽管不同簇往往编码不同的空间或语义模式,并对VPR性能贡献不均。为了解决这一限制,我们提出了加权聚合描述符(WeiAD),在聚合过程中分配簇的权重,产生更具判别性的全局表示。除了准确性之外,检索延迟是大规模部署和资源受限边缘设备的关键关注点。先前的工作主要通过压缩全局描述符来减少延迟,而忽略了特征提取的成本,这在基于ViT的基础模型中变得更加严重。因此,我们引入了面向VPR的标记剪枝框架WeiToP,通过自蒸馏减少特征提取成本,其中聚合诱导的标记重要性监督一个轻量级剪枝模块,附加到早期Transformer层上,使推理时能够进行标记剪枝。在单次联合训练阶段后,WeiToP能够在推理时实现插拔式的标记剪枝,允许在不额外训练的情况下灵活地控制精度-效率权衡。此外,WeiToP在现有针对通用视觉任务的标记剪枝方法上表现更优。

英文摘要

Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transformers (ViTs) as backbone foundation models to extract patch-level features that are robust to viewpoint, illumination, and seasonal variations, which are then aggregated into a compact global descriptor for retrieval. Most existing aggregation methods uniformly pool patch tokens into learned clusters, despite the fact that different clusters often encode distinct spatial or semantic patterns and contribute unequally to VPR performance. To address this limitation, we propose Weighted Aggregated Descriptor (WeiAD), which assigns weights to clusters during aggregation, producing more discriminative global representations. Beyond accuracy, retrieval latency is a critical concern for large-scale deployments and resource-constrained edge devices. Prior work mainly reduces latency by compressing global descriptors, while overlooking the cost of feature extraction, an issue exacerbated by ViT-based backbones. We therefore introduce WeiToP, a VPR-oriented token pruning framework that reduces feature extraction cost via self-distillation, where aggregation-induced token importance supervises a lightweight pruning module attached to an early transformer layer, enabling inference-time token pruning. After a single joint training phase, WeiToP enables plug-and-play token pruning at inference time, allowing flexible and on-demand control over the accuracy-efficiency trade-off without additional training. Moreover, WeiToP outperforms existing token pruning methods adapted from general vision tasks.

2605.20544 2026-05-21 cs.RO cs.CV 版本更新

The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

顺从综合征:具身机器人代理中的退避基准测试

Doguhan Yeke, Elif Su Temirel, Ananth Shreekumar, Brandon Lee, Dongyan Xu, Z Berkay Celik

发表机构 * Purdue University(普渡大学) Bilkent University(比尔肯特大学)

AI总结 本文提出了一种用于具身机器人代理的退避基准测试框架RoboAbstention,通过五种机器人数据集中的图像生成退避指令,评估了多个前沿VLMs在退避任务中的表现,并探讨了改进退避性能的方法。

详情
AI中文摘要

视觉语言模型(VLMs)被用作具身代理的高层规划器,将自然语言指令和视觉观察转化为行动计划。尽管先前的工作研究了LLMs中的退避行为,但现有的基准测试大多仅限于文本,无法捕捉到具身机器人环境中的感知基础和物理约束。在这样的环境中,退避需要识别指令模糊、物理不可行、基于错误前提或在给定可用感觉模态和上下文下无法解决的情况。为了解决这一差距,我们引入了一个分类法来分类具身机器人中的退避行为,并提出了RoboAbstention,一个可扩展且可审计的框架,用于生成基于五个机器人数据集收集的图像的退避指令。RoboAbstention通过三个阶段的流程实现该分类法:(1)结构化的视觉基础,(2)确定性的约束推导,(3)通过类别特定模板进行受控的指令生成。这使能够构建一个具有可验证退避条件的多样化数据集。我们评估了几种前沿VLMs,并发现所有模型在退避任务中都表现出显著的弱点,包括那些具有高级推理能力的模型。表现最好的模型Gemini 2.5 Flash仅在6,069个基准指令中退避39.0%,而具身规划器Gemini Robotics ER 1.6 Preview仅在16.5%的指令中退避。我们进一步探讨了改进VLM规划器退避性能的方法,如防御性提示和上下文学习,并发现这些干预措施显著提高了性能,达到Gemini Robotics ER 1.6 Preview的93.6%退避率和GPT 5.4 Mini的88.6%退避率,但没有任何方法完全解决了该问题。我们开源了RoboAbstention在https://purseclab.github.io/RoboAbstention/。

英文摘要

Vision-language models (VLMs) are used as high-level planners for embodied agents, translating natural language instructions and visual observations into action plans. While prior work has studied abstention in LLMs, existing benchmarks are largely text-only and do not capture the perceptual grounding and physical constraints inherent to embodied robotics environments. In such settings, abstention requires recognizing when instructions are ambiguous, physically infeasible, based on false premises, or otherwise unresolvable given the available sensory modalities and context. To address this gap, we introduce a taxonomy to categorize abstention in the context of embodied robotics and present RoboAbstention, a scalable and auditable framework for generating abstention instructions grounded in images gathered from five robotics datasets. RoboAbstention instantiates the taxonomy through a three-phase pipeline: (1) structured visual grounding, (2) deterministic constraint derivation, and (3) controlled instruction generation via category-specific templates. This enables the construction of a diverse dataset with verifiable abstention conditions. We evaluate several frontier VLMs and find that all models exhibit significant weaknesses in abstention, including those with advanced reasoning capabilities. The best-performing model, Gemini 2.5 Flash, abstains on only 39.0% of our 6,069 benchmark instructions, while the embodied planner Gemini Robotics ER 1.6 Preview abstains on just 16.5%. We further explore methods for improving abstention in VLM planners, such as defensive prompting and in-context learning, and find that these interventions substantially improve performance, reaching 93.6% abstention rate for Gemini Robotics ER 1.6 Preview and 88.6% for GPT 5.4 Mini, yet no approach fully solves the problem. We open-source RoboAbstention at https://purseclab.github.io/RoboAbstention/.

2605.19503 2026-05-21 cs.RO cs.AI cs.LG 版本更新

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

ARC-RL: 一种受ARC Raiders启发的强化学习游乐场

Carlo Romeo, Andrew D. Bagdanov

发表机构 * Media Integration and Communication Center – University of Florence(媒体整合与通信中心——佛罗伦萨大学)

AI总结 本文提出ARC-RL,一个包含四种MuJoCo连续控制环境的强化学习游乐场,这些环境的机器人形态灵感来自ARC Raiders的生物目录,通过统一的观察模板、动作约定和奖励函数,研究不同形态和动画风格约束下的强化学习算法性能。

详情
AI中文摘要

腿部运动的强化学习已经发展成一个多组件奖励函数和物理引擎基准的堆叠,其形态统一来源于现实商业硬件。然而,游戏NPC受风格约束,缺乏sim-to-real机器人,通常以没有现实机器人对应物的生物形式出现。我们介绍了ARC-RL,一个包含四种MuJoCo连续控制环境的套件,其机器人形态受ARC Raiders的生物目录启发:18自由度的高六足Queen、12自由度的装甲六足Bastion、18自由度的紧凑六足Tick以及12自由度的四足Leaper。这四个机器人共享统一的观察模板、动作约定、仿真节奏和一个单一的闭式多组件奖励函数,其唯一形态差异体现在一小部分权重和参数中。奖励融合了速度跟踪帐篷、健康生存奖励、相位锁定步态适应奖励/成本对、动作正则化器、三个安全惩罚和姿态锚;在任何点都不会引入运动捕捉数据。我们还为每种形态提供手工制作的中心模式生成器演示,这些演示既作为固定专家参考,也作为离线到在线训练的先验数据来源。在此游乐场中,我们进行了一项受控的实证研究,比较标准在线算法(SAC、SPEQ、SOPE-EO)和带有先验数据的算法(SACfD、SPEQ-O2O、SOPE),并研究每种范式如何应对游乐场的形态多样性和动画风格约束。源代码可在https://github.com/CarloRomeo427/ARC_RL.git获取。

英文摘要

Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints. Source code is available at https://github.com/CarloRomeo427/ARC_RL.git.

2605.19138 2026-05-21 cs.RO cs.AI cs.LG 版本更新

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

COBALT: 通过基于云的远程操作利用智能手机进行机器人学习

Ayush Agarwal, Ansh Gandhi, Jeremy A. Collins, Omar Rayyan, Aryan Sarswat, Ranjani Koushik, Masoud Moghani, Ajay Mandlekar, Animesh Garg

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of California, Berkeley(加州大学伯克利分校) New York University Abu Dhabi (NYUAD)(纽约大学阿布扎克分校) University of Toronto(多伦多大学) NVIDIA(英伟达)

AI总结 本文提出COBALT平台,通过基于云的远程操作技术,利用智能手机等设备大规模收集高质量的机器人学习数据,提高仿真实验和现实世界中的机器人学习效率。

详情
AI中文摘要

大规模、高质量的演示数据稀缺仍然是扩展模仿学习用于机器人操作的主要瓶颈。我们提出了COBALT,一个旨在大规模普及机器人学习的远程操作平台,无论是仿真还是现实世界。通过利用向量化的环境,我们的可扩展、负载均衡的基础设施支持多个用户在单个GPU上同时进行远程操作,从而显著降低远程操作成本。操作员可以使用几乎全球任何地方的常见设备连接,包括单或双智能手机、VR头盔、3D鼠标和键盘。内存中的数据缓存和高效的视频流保持控制和渲染同步,支持数十个并发用户在20 Hz下以不超过100毫秒的端到端延迟运行,每GPU支持多达8个并发用户。我们还展示了稳定运行支持256个模拟客户端跨8个GPU,凸显了系统在硬件和单个服务器内的扩展能力。我们进行了全面的用户研究,显示基于手机的远程操作性能与或优于专用硬件,能够更快、更符合人体工学地收集数据。为确保数据质量,COBALT记录一套实时指标以自动过滤劣质演示。我们进一步证明,结构化的用户培训课程显著提高了数据收集质量。基于用户研究的洞察,我们通过众包收集了一个大规模、高质量的试点数据集,该数据集包含7500多个演示(50多个小时),在五个国家的智能手机上收集了九天的数据。我们通过训练最先进的模仿学习算法验证了数据集的质量。请访问https://cobalt-teleop.github.io/获取更多详情。

英文摘要

The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An inmemory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency for up to 8 concurrent users per GPU. We also demonstrate stable operation supporting 256 simulated clients across 8 GPUs, underscoring the system's ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit https://cobalt-teleop.github.io/ for more details.

2605.17776 2026-05-21 cs.RO 版本更新

CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

CosFly-Track: 一个大规模多模态数据集,用于通过多约束轨迹优化的无人机视觉跟踪

Xiangyue Wang, Hanxuan Chen, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Kangli Wang, Ji Pei

发表机构 * Autel Robotics(Autel机器人公司) Nanjing University(南京大学) Peking University(北京大学) Southern University of Science and Technology(南方科技大学) University of Hong Kong(香港大学)

AI总结 本文提出CosFly-Track数据集,用于无人机视觉跟踪任务,通过多约束轨迹优化生成大规模多模态数据,提升了动态目标跟踪性能。

详情
AI中文摘要

近年来,空中视觉-语言导航(VLN)数据集发展迅速,但主要解决的是面向静态目的地的目标导向导航问题,而无人机视觉跟踪——在保持可见性的同时持续跟随移动目标——则缺乏专门的训练数据。我们介绍了CosFlyTrack,这是一个用于城市环境中无人机视觉跟踪的大规模多模态数据集和可扩展生成管道。该数据集提供了约12,000条专家和扰动的无人机轨迹,这些轨迹源自6,000条行人路径,包含240万时间步(约334小时),包含七个对齐的数据通道:RGB、度量深度、语义分割、六自由度无人机姿态、带有可见性标志的目标状态、双语(中文-英文)指令以及轨迹对元数据。为了生成高质量的专家轨迹,我们开发了MuCO,一个多约束优化器,能够在连续的三维空间中直接规划,使用BVH加速的碰撞和可见性查询,共同执行目标可见性、视角质量、碰撞避免、平滑度、运动学可行性等约束,避免了基于网格的规划器的离散化伪影和事后平滑。在七个视觉-语言模型上的微调实验表明,CosFlyTrack将跟踪性能提升到78.3至95.6个百分点的SR@1米,比零样本基线提高了53至69个百分点,支持该数据集作为动态目标跟踪代理的训练资源。该数据集在https://huggingface.co/datasets/AutelRobotics/CosFly上公开可用;评估脚本和预训练检查点托管在https://huggingface.co/AutelRobotics/CosFly-Track上。

英文摘要

Recent aerial vision-language navigation (VLN) datasets have grown rapidly, but they primarily address goal-oriented navigation to static destinations, leaving UAV visual tracking -- continuously following a moving target while maintaining visibility -- largely without dedicated training data. We introduce CosFlyTrack, a large-scale multi-modal dataset and scalable generation pipeline for UAV visual tracking in urban environments. The dataset provides approximately 12,000 expert and perturbed UAV trajectories generated from 6,000 pedestrian paths, comprising 2.4 million timesteps (approximately 334 hours) with seven aligned data channels: RGB, metric depth, semantic segmentation, six-degree-of-freedom drone pose, target state with visibility flag, bilingual (Chinese-English) instructions, and trajectory-pair metadata. To generate high-quality expert trajectories, we develop MuCO, a multi-constraint optimizer that plans directly in continuous three-dimensional space with BVH-accelerated collision and visibility queries, jointly enforcing target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility, avoiding the discretization artifacts and post-hoc smoothing of grid-based planners. Fine-tuning experiments on seven vision-language models show that CosFlyTrack improves tracking performance to 78.3 to 95.6 percent SR@1 meter, a 53 to 69 percentage point gain over zero-shot baselines, supporting the dataset as a training resource for dynamic target-following agents. The dataset is publicly available at https://huggingface.co/datasets/AutelRobotics/CosFly; evaluation scripts and pre-trained checkpoints are hosted at https://huggingface.co/AutelRobotics/CosFly-Track.

2605.15944 2026-05-21 cs.RO cs.LG 版本更新

FocalPolicy: Frequency-Optimized Chunking and Locally Anchored Flow Matching for Coherent Visuomotor Policy

FocalPolicy: 频率优化的分块和局部锚定的流匹配用于连贯的视觉-运动策略

Qian He, Zhenshuo Yang, Wenqi Liang, Chunhui Hao, Nicu Sebe, Jiandong Tian

发表机构 * State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences(机器人与智能系统国家重点实验室,沈阳自动化研究所,中国科学院) University of the Chinese Academy of Sciences(中国科学院大学) University of Trento(特伦多大学)

AI总结 本文提出FocalPolicy,一种面向视觉-运动策略的策略,通过频率优化的分块和局部锚定的流匹配,解决连续视觉-运动策略中的精度与远见之间的平衡问题。

详情
AI中文摘要

视觉-运动策略旨在从专家示范中学习复杂的操作任务。然而,生成平滑且连贯的轨迹仍然具有挑战性,因为它需要在近端精度与远端远见之间进行平衡。现有方法通常专注于优化块内动作分布,往往忽略了块间连贯性。因此,块间不连续性显著阻碍了连贯长周期动作的学习。为克服这一限制并实现精度与远见之间的协同平衡,我们提出了FocalPolicy,一种具有远见的视觉-运动策略,结合了频率优化的分块与局部锚定的流匹配。我们引入了一个远见复合目标,监督时间域内近端动作的对齐,同时在多个未来动作块上正则化频率域结构以提高跨块连贯性。为了高效学习复杂动作分布,我们设计了局部锚定采样,以提高一致性流匹配训练期间的目标信号传播效率。广泛的实验表明,FocalPolicy优于现有方法,并验证了我们的模块对其他基线的通用性。项目网站:https://focalpolicy.github.io/

英文摘要

Visuomotor policies aim to learn complex manipulation tasks from expert demonstrations. However, generating smooth and coherent trajectories remains challenging, as it requires balancing proximal precision with distal foresight. Existing approaches typically focus on optimizing intra-chunk action distributions, often neglecting the inter-chunk coherence. Consequently, inter-chunk discontinuities significantly impede the learning of coherent long-horizon actions. To overcome this limitation and achieve a synergetic balance between precision and foresight, we propose FocalPolicy, a foresight-aware visuomotor policy that combines Frequency-Optimized Chunking with Locally Anchored flow matching. We introduce a foresight composite objective that supervises time-domain alignment within the proximal actions while regularizing frequency-domain structure over multiple future action chunks to improve cross-chunk coherence. To efficiently learn complex action distributions, we design locally anchored sampling to enhance target signal propagation efficiency during consistency flow matching training. Extensive experiments demonstrate that FocalPolicy outperforms existing approaches and confirm the generalizability of our modules to other baselines. Project website: https://focalpolicy.github.io/

2605.15157 2026-05-21 cs.RO cs.LG 版本更新

Hand-in-the-Loop: Improving VLA Policies for Dexterous Manipulation via Seamless Hand-Arm Intervention

手在环中:通过无缝手臂干预改进VLA策略以实现灵巧操作

Zhuohang Li, Liqun Huang, Wei Xu, Zhengming Zhu, Nie Lin, Xiao Ma, Xinjun Sheng, Ruoshi Wen

发表机构 * State Key Laboratory of Mechanical System and Vibration, School of Mechanical Engineering, Shanghai Jiao Tong University(机械系统与振动国家重点实验室,机械工程学院,上海交通大学) Shanghai Key Laboratory of Intelligent Robotics, Meta Robotics Institute, Shanghai Jiao Tong University, Shanghai 200240, China(智能机器人上海市重点实验室,元机器人研究院,上海交通大学,上海200240,中国) The University of Tokyo(东京大学)

AI总结 本文提出Hand-in-the-Loop方法,通过无缝整合人类干预与自主策略执行,减少手部操作中的突兀变化,提升双臂灵巧操作的鲁棒性和效率。

详情
AI中文摘要

Vision-Language-Action (VLA)模型在灵巧操作中容易累积误差,高维动作空间和接触丰富的动态会放大政策偏差。虽然交互模仿学习(IIL)可通过人类修正数据细化策略,但将其应用于高自由度机械手仍具有挑战性,因为人类遥控与策略执行在干预时刻的命令不匹配,导致机器人手部配置的突兀变化,即'手势跳跃'。我们提出了Hand-in-the-Loop (HandITL),一种无缝的人在回路干预方法,将人类的修正意图与自主策略执行相结合,以避免在双臂灵巧操作中的手势跳跃。与使用直接遥控接管相比,HandITL将干预抖动减少了99.8%,并保持了干预后的稳健操作,将抓取失败减少了87.5%,平均完成时间减少了19.1%。我们在需要双臂协调、工具使用和精细长时域操作的任务上验证了HandITL。当用于收集策略细化的修正数据时,HandITL在三个长时域灵巧任务中平均优于使用标准遥控数据训练的策略19%。

英文摘要

Vision-Language-Action (VLA) models are prone to compounding errors in dexterous manipulation, where high-dimensional action spaces and contact-rich dynamics amplify small policy deviations over long horizons. While Interactive Imitation Learning (IIL) can refine policies through human correction data, applying it to high-degree-of-freedom (DoF) robotic hands remains challenging due to a command mismatch between human teleoperation and policy execution at the intervention moment, which causes abrupt robot-hand configuration changes, or "gesture jumps". We present Hand-in-the-Loop (HandITL), a seamless human-in-the-loop intervention method that blends human corrective intent with autonomous policy execution to avoid gesture jumps during bimanual dexterous manipulation. Compared with taking over control using direct teleoperation, HandITL reduces intervention jitter by 99.8% and preserves robust post-intervention manipulation, reducing grasp failures by 87.5% and mean completion time by 19.1%. We validate HandITL on tasks requiring bimanual coordination, tool use, and fine-grained long-horizon manipulation. When used to collect correction data for policy refinement, HandITL yields policies that outperform those trained with standard teleoperation data by 19% on average across three long-horizon dexterous tasks.

2605.14417 2026-05-21 cs.RO cs.CV 版本更新

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

在身体移动之前:为语言条件的人形控制学习预见性关节意图

Haozhe Jia, Honglei Jin, Yuan Zhang, Youcheng Fan, Shaofeng Liang, Lei Wang, Shuxu Jin, Kuimou Yu, Zinuo Zhang, Jianfei Song, Wenshuo Chen, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) LimX Dynamics Technology Co., Ltd.(LimX动态技术有限公司) Shandong University(山东大学) Data61/CSIRO Griffith University(格里菲斯大学) Institute of Deep Perception Technology, Jiangsu Industrial Technology Research Institute (JITRI)(深度感知技术研究院,江苏省工业技术研究院(JITRI))

AI总结 该研究提出DAJI框架,通过学习语言生成与闭环控制之间的预见性关节意图接口,解决语言条件人形机器人中预见未来物理转换的需求,实现了在HumanML3D风格生成和BABEL任务中的高性能表现。

详情
AI中文摘要

自然语言是人形机器人的直观接口,但流式全身控制需要能够现在执行并预见未来物理转换的控制表示。现有语言条件人形系统通常生成低级跟踪器必须反应性修复的运动学参考,或使用隐式/动作策略,其输出不显式编码即将发生的接触变化、支撑转移和平衡准备。我们提出DAJI(Dynamics-Aligned Joint Intent),一个分层框架,学习语言生成与闭环控制之间的预见性关节意图接口。DAJI-Act通过学生驱动的回放将未来的教师 distill 成可部署的扩散动作策略,而 DAJI-Flow 自回归地从语言和意图历史生成未来意图块。实验表明,DAJI 在预见性隐式学习、单指令生成和流式指令跟随中表现优异,在 HumanML3D 风格生成中达到 94.42% 的回放成功率,在 BABEL 任务中达到 0.152 的子序列 FID。

英文摘要

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

2605.14201 2026-05-21 cs.RO cs.CV 版本更新

MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

MAPLE:基于潜在空间的多智能体交互用于端到端自动驾驶

Rajeev Yasarla, Deepti Hegde, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Meysam Sadeghigooghari, Hanno Ackermann, Litian Liu, Pranav Desai, Fatih Porikli, Mohammad Ghavamzadeh, Hong Cai

发表机构 * Qualcomm AI Research(英矽人工智能研究)

AI总结 本文提出MAPLE框架,通过在视觉-语言-动作模型的潜在空间中实现反应式多智能体滚动,以解决传统模仿学习框架下闭环设置中模型易碎的问题,通过监督微调和强化学习结合多样性奖励,实现了可扩展且无需外部模拟器的闭环训练,提升了端到端自动驾驶系统的鲁棒性。

Comments 19 pages, 9 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型在端到端运动规划中表现出色,但在闭环设置中由于训练基于传统模仿学习框架而显得脆弱。现有的闭环监督方法缺乏可扩展性且无法完全建模反应式环境。我们提出MAPLE,一种新的框架,用于在VLA模型的潜在空间中进行动态驾驶场景的反应式多智能体滚动。主体车辆和附近交通代理在多步时间范围内独立控制,同时对场景中的其他代理具有反应性,从而实现闭环训练。MAPLE包含两个训练阶段:(1)基于真实轨迹的潜在滚动监督微调,随后是(2)具有全局和代理特定奖励的强化学习,这些奖励鼓励安全、进展和交互真实感。我们进一步提出多样性奖励,鼓励模型生成可能不在记录驾驶数据中存在的规划行为。值得注意的是,我们的闭环训练框架具有可扩展性,且无需外部模拟器,这些模拟器计算成本高且视觉保真度有限。MAPLE在Bench2Drive上实现了最先进的驾驶性能,并展示了可扩展的闭环多智能体交互,为鲁棒的端到端自动驾驶系统提供了支持。

英文摘要

Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.

2605.11151 2026-05-21 cs.AI cs.RO 版本更新

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

RankQ: 通过自监督动作排名实现离线到在线强化学习

Andrew Choi, Wei Xu

发表机构 * Horizon Robotics(地平线机器人)

AI总结 该研究提出RankQ方法,通过自监督多项排名损失增强时序差分学习,以在大状态-动作空间中更准确地学习批评器,从而在稀疏奖励D4RL基准和基于视觉的机器人学习中实现更高效的离线到在线微调。

详情
AI中文摘要

离线到在线强化学习(RL)通过利用预先收集的数据集来提高样本效率。然而,一个关键挑战是在有限的数据集覆盖下,在大规模状态-动作空间中学习准确的批评器。为了减轻价值过估计带来的有害更新,先前方法通过降低分布外(OOD)动作相对于数据集动作的权重来引入悲观主义。虽然有效,但这种方法本质上充当了一个行为克隆锚点,当数据集动作不优时会阻碍后续在线策略改进。我们提出RankQ,一种离线到在线的Q学习目标,通过在时序差分学习中加入自监督的多项排名损失来强制结构化动作排序。通过学习相对动作偏好而不是均匀惩罚未见过的动作,RankQ塑造Q函数,使动作梯度指向高质量的行为。在稀疏奖励D4RL基准中,RankQ的性能与或优于七种先前方法。在基于视觉的机器人学习中,RankQ能够在低数据环境下有效微调预训练的视觉-语言-动作(VLA)模型,平均在模拟成功率上比次优方法高42.7%。在高数据环境下,RankQ在模拟性能上比次优方法提高13.7%,并实现强大的仿真到现实转移,将现实世界立方体堆叠成功率从43.1%提升到88.9%,相对于VLA的初始性能。

英文摘要

Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors. Across sparse reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. In vision-based robot learning, RankQ enables effective offline-to-online fine-tuning of a pretrained vision-language-action (VLA) model in a low-data regime, achieving on average a 42.7% higher simulation success rate than the next best method. In a high-data setting, RankQ improves simulation performance by 13.7% over the next best method and achieves strong sim-to-real transfer, increasing real-world cube stacking success from 43.1% to 88.9% relative to the VLA's initial performance.

2605.09586 2026-05-21 cs.CV cs.RO 版本更新

DeformMaster: An Interactive Physics-Neural World Model for Deformable Objects from Videos

DeformMaster: 一个用于从视频中生成变形物体交互物理-神经世界模型

Can Li, Zhoujian Li, Ren Li, Jie Gu, Lei Lei, Jingmin Chen, Lei Sun

发表机构 * Nankai University(南开大学) Zhejiang University(浙江大学) Southern University of Science and Technology(南方科技大学) Rightly Robotics, A4X(Rightly Robotics,A4X) University of Science and Technology of China(中国科学技术大学)

AI总结 本研究提出DeformMaster,一种基于视频的交互物理-神经世界模型,能够从真实交互视频中生成变形物体的统一动态-外观框架,通过保留结构化的物理推演并利用神经残差补偿未建模效应,实现高保真4D外观生成,实验表明其在动态预测和外观渲染方面优于现有方法。

Comments Project page: https://can-lee.github.io/deformmaster-web/

详情
AI中文摘要

世界模型用于变形物体应恢复不仅几何和外观,还应包含底层物理动态、交互基础和材料行为。从真实视频中学习此类模型具有挑战性,因为变形的线性、平面和体积物体在高维变形、噪声交互和复杂材料响应下演变。因此,模型必须从视觉观测中推断物理状态,通过新交互推进,并以高视觉保真度渲染结果。我们提出了DeformMaster,一种视频衍生的交互物理-神经世界模型,将真实交互视频转化为统一动态-外观框架中的变形物体在线交互模型。DeformMaster保留了结构化的物理推演,同时利用神经残差补偿未建模效应,将稀疏手部运动作为分布式合规执行器用于手-连续体交互,用空间变化的本构专家表示材料响应,并从预测的物理演变中驱动高保真4D外观。在真实世界变形物体序列上的实验表明,DeformMaster能够推演未来动态并渲染动态外观,优于现有最先进基线,同时支持新动作推演、材料参数变化和动态新视角合成。项目页面:https://can-lee.github.io/deformmaster-web/

英文摘要

World models for deformable objects should recover not only geometry and appearance, but also underlying physical dynamics, interaction grounding, and material behavior. Learning such a model from real videos is challenging because deformable linear, planar, and volumetric objects evolve under high-dimensional deformation, noisy interactions, and complex material response. The model must therefore infer a physical state from visual observations, roll it forward under new interactions, and render the resulting dynamics with high visual fidelity. We present DeformMaster, a video-derived interactive physics-neural world model that turns real interaction videos into an online interactive model of deformable objects within a unified dynamics-and-appearance framework. DeformMaster preserves structured physical rollout while using a neural residual to compensate for unmodeled effects, grounds sparse hand motion as distributed compliant actuator for hand-continuum interaction, represents material response with spatially varying constitutive experts, and drives high-fidelity 4D appearance from the predicted physical evolution. Experiments on real-world deformable-object sequences demonstrate DeformMaster's ability to roll out future dynamics and render dynamic appearance, outperforming state-of-the-art baselines while supporting novel action rollout, material-parameter variation, and dynamic novel-view synthesis. Project page: https://can-lee.github.io/deformmaster-web/

2603.14392 2026-05-21 cs.LG cs.RO 版本更新

WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems

WestWorld: 一种知识编码的可扩展轨迹世界模型用于多样化机器人系统

Yuchen Wang, Jiangtao Kong, Sizhe Wei, Xiaochang Li, Haohong Lin, Hongjue Zhao, Tianyi Zhou, Lu Gan, Huajie Shao

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Carnegie Mellon University(卡内基梅隆大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出WestWorld,一种知识编码的可扩展轨迹世界模型,用于多样化机器人系统,通过引入系统感知的混合专家(Sys-MoE)和结构嵌入来提升可扩展性和零样本泛化能力,实现了在多种机器人环境中的高效轨迹预测和控制。

Comments ICML 2026 spotlight

详情
AI中文摘要

轨迹世界模型在机器人动力学学习、规划和控制中起着关键作用。尽管最近的研究已经探索了适用于多样化机器人系统的轨迹世界模型,但它们难以扩展到大量不同的系统动态,并忽略了物理结构的领域知识。为了解决这些限制,我们引入了WestWorld,一种针对多样化机器人系统的知识编码可扩展轨迹世界模型。为了解决可扩展性挑战,我们提出了一种新颖的系统感知混合专家(Sys-MoE),通过可学习的系统嵌入动态结合和路由针对不同机器人系统的专用专家。为进一步增强零样本泛化能力,我们通过引入结构嵌入来整合机器人物理结构的领域知识,使轨迹表示与形态学信息对齐。在预训练于89个复杂环境(涵盖多样化形态的仿真和现实世界设置)后,WestWorld在零样本和少样本轨迹预测上显著优于竞争基线。此外,它在广泛范围的机器人环境中的可扩展性表现出色,并在不同机器人上的下游基于模型的控制中显著提高了性能。最后,我们在现实世界中的Unitree Go1上部署了该模型,展示了稳定的移动性能。代码可在https://github.com/511205787/WestWorld上获取。

英文摘要

Trajectory world models play a crucial role in robotic dynamics learning, planning, and control. While recent works have explored trajectory world models for diverse robotic systems, they struggle to scale to a large number of distinct system dynamics and overlook domain knowledge of physical structures. To address these limitations, we introduce WestWorld, a knoWledge-Encoded Scalable Trajectory World model for diverse robotic systems. To tackle the scalability challenge, we propose a novel system-aware Mixture-of-Experts (Sys-MoE) that dynamically combines and routes specialized experts for different robotic systems via a learnable system embedding. To further enhance zero-shot generalization, we incorporate domain knowledge of robot physical structures by introducing a structural embedding that aligns trajectory representations with morphological information. After pretraining on 89 complex environments spanning diverse morphologies across both simulation and real-world settings, WestWorld achieves significant improvements over competitive baselines in zero- and few-shot trajectory prediction. Additionally, it shows strong scalability across a wide range of robotic environments and significantly improves performance on downstream model-based control for different robots. Finally, we deploy our model on a real-world Unitree Go1, where it demonstrates stable locomotion performance. The code is available at https://github.com/511205787/WestWorld.

2602.18532 2026-05-21 cs.CV cs.AI cs.RO 版本更新

VLANeXt: Recipes for Building Strong VLA Models

VLANeXt: 构建强大VLA模型的配方

Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, Chen Change Loy

发表机构 * S-Lab, Nanyang Technological University(南洋理工大学S实验室) SenseTime Research(商汤研究) Sun Yat-sen University(中山大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文通过统一框架和评估设置重新审视VLA设计空间,系统分析了基础组件、感知要素和动作建模视角,总结出12项关键发现,提出了一种简单有效的VLA模型VLANeXt,并在LIBERO和LIBERO-plus基准测试中超越了现有方法,同时提供了易于使用的代码库。

Comments Accepted in ICML 2026, Project Page: https://dravenalg.github.io/VLANeXt/

详情
AI中文摘要

在大基础模型兴起之后,视觉-语言-动作模型(VLAs)应运而生,利用视觉语言模型的强大视觉和语言理解能力进行通用目的策略学习。然而,当前VLA领域仍处于碎片化和探索阶段。尽管许多团队提出了各自的VLA模型,但训练协议和评估设置的一致性不足,使得难以确定哪些设计选择真正重要。为了使这一发展领域更具结构化,我们重新审视VLA设计空间,基于类似RT-2的简单VLA基线,系统地分析了三个维度:基础组件、感知要素和动作建模视角。从这项研究中,我们提炼出12项关键发现,共同构成了构建强大VLA模型的实用配方。该探索的成果是一种简单而有效的模型VLANeXt,它在LIBERO和LIBERO-plus基准测试中优于现有方法,并在现实世界实验中表现出色。我们还发布了一个统一且易于使用的代码库,以重现我们的发现、探索设计空间并基于共享基础开发新的VLA变体。代码库可在https://github.com/DravenALG/VLANeXt上获得。

英文摘要

Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding from Vision-Language Models for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2, which is the origin of VLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. It outperforms the state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong performance in real-world experiments. We release a unified and easy-to-use codebase to reproduce our findings, explore the design space, and develop new VLA variants on top of a shared foundation. The codebase is available at https://github.com/DravenALG/VLANeXt.

2602.12283 2026-05-21 eess.SY cs.HC cs.RO cs.SY 版本更新

A Lightweight Cubature Kalman Filter for Attitude and Heading Reference Systems Using Simplified Prediction Equations

一种用于姿态和航向参考系统的轻量级立方根卡尔曼滤波器:使用简化预测方程

Shunsei Yamagishi, Lei Jing

发表机构 * Graduate School of Computer Science, Engineering, The University of Aizu(计算机科学与工程研究生院,宇土大学)

AI总结 本文提出了一种改进的立方根卡尔曼滤波器(CKF),在保持估计精度的同时降低了计算成本,称为'Kaisoku立方根卡尔曼滤波器(KCKF)'。通过简化CKF的方程,保留等价的数学关系,推导出轻量级的预测方程。实验结果表明,KCKF相比CKF在浮点运算(FLOPs)上更少,计算时间减少了约19%(在高性能计算机上)和15%(在低成本单板计算机上),同时保持了姿态估计的准确性。

详情
Journal ref
IEEE Access, vol. 14, 2026, pp. 73686-73697
AI中文摘要

姿态和航向参考系统(AHRS)被广泛应用于需要可靠方向和运动传感的任何地方。本文提出了一种改进的立方根卡尔曼滤波器(CKF),在保持估计精度的同时降低了计算成本,称为“Kaisoku立方根卡尔曼滤波器(KCKF)”。通过简化CKF的方程,保留等价的数学关系,推导出KCKF的计算效率方程。通过扩展CKF中的求和项并简化结果,推导出KCKF的轻量级预测方程。本文证明KCKF所需的浮点运算(FLOPs)比CKF更少。受控实验结果表明,与CKF相比,KCKF在高性能计算机上将计算时间减少了约19%,而在低成本单板计算机上减少了约15%。此外,KCKF保持了CKF的姿态估计精度。

英文摘要

Attitude and Heading Reference Systems (AHRSs) are broadly applied wherever reliable orientation and motion sensing is required. In this paper, we present an improved Cubature Kalman Filter (CKF) with lower computational cost while maintaining estimation accuracy, which is named "Kaisoku Cubature Kalman Filter (KCKF)". The computationally efficient equations of the KCKF are derived by simplifying those of the CKF, while preserving equivalent mathematical relations. The lightweight prediction equations in the KCKF are derived by expanding the summation terms in the CKF and simplifying the result. This paper shows that the KCKF requires fewer floating-point operations (FLOPs) than the CKF. The controlled experimental results show that the KCKF reduces the computation time by approximately 19% compared to the CKF on a high-performance computer, whereas the KCKF reduces the computation time by approximately 15% compared to the CKF on a low-cost single-board computer. In addition, the KCKF maintains the attitude estimation accuracy of the CKF.

2602.03209 2026-05-21 cs.RO 版本更新

Depth Completion in Unseen Field Robotics Environments Using Extremely Sparse Depth Measurements

在未见过的田间机器人环境中使用极稀疏深度测量进行深度补全

Marco Job, Thomas Stastny, Eleni Kelasidi, Roland Siegwart, Michael Pantic

发表机构 * Autonomous Systems Lab, ETH Zürich(瑞士苏黎世联邦理工学院自主系统实验室) Field Robotics Lab, NTNU(挪威特罗姆瑟大学场 robotics 实验室)

AI总结 本研究提出了一种深度补全模型,通过合成数据训练和极稀疏的深度传感器测量,在未见过的田间机器人环境中预测密集的度量深度,解决了低成本相机在田间机器人中应用受限的问题。

Comments Accepted to ICRA 2026

详情
AI中文摘要

在无结构环境中自主运行的田间机器人需要可靠的感知以确保安全和可靠的运行。最近的单目深度估计进展展示了低成本相机作为深度传感器的潜力;然而,由于缺乏可靠的尺度线索、模糊或低纹理条件以及大规模数据集的稀缺,其在田间机器人中的应用仍然有限。为了解决这些挑战,我们提出了一种深度补全模型,该模型在合成数据上训练,并利用深度传感器的极稀疏测量来预测未见过的田间机器人环境中的密集度量深度。一个针对田间机器人的合成数据集生成流程能够创建多个逼真的数据集用于训练。该数据集生成方法利用结构从运动的纹理3D网格和具有新视角合成的逼真渲染来模拟多样的田间机器人场景。我们的方法在Nvidia Jetson AGX Orin上实现了每帧53毫秒的端到端延迟,使嵌入式平台上的实时部署成为可能。广泛的评估表明,在多样化的现实世界田间机器人场景中具有竞争性的性能。

英文摘要

Autonomous field robots operating in unstructured environments require robust perception to ensure safe and reliable operations. Recent advances in monocular depth estimation have demonstrated the potential of low-cost cameras as depth sensors; however, their adoption in field robotics remains limited due to the absence of reliable scale cues, ambiguous or low-texture conditions, and the scarcity of large-scale datasets. To address these challenges, we propose a depth completion model that trains on synthetic data and uses extremely sparse measurements from depth sensors to predict dense metric depth in unseen field robotics environments. A synthetic dataset generation pipeline tailored to field robotics enables the creation of multiple realistic datasets for training purposes. This dataset generation approach utilizes textured 3D meshes from Structure from Motion and photorealistic rendering with novel viewpoint synthesis to simulate diverse field robotics scenarios. Our approach achieves an end-to-end latency of 53 ms per frame on a Nvidia Jetson AGX Orin, enabling real-time deployment on embedded platforms. Extensive evaluation demonstrates competitive performance across diverse real-world field robotics scenarios.

2512.13788 2026-05-21 cs.LG cs.RO 版本更新

Constrained Policy Optimization via Sampling-Based Weight-Space Projection

通过基于采样的权重空间投影进行约束策略优化

Shengfan Cao, Francesco Borrelli, Eunhyek Joa

发表机构 * Department of Mechanical Engineering, Seoul National University, Seoul, Korea(首尔国立大学机械工程系)

AI总结 该研究提出了一种基于采样的权重空间投影方法SCPO,用于在不离开安全操作范围的情况下优化策略,通过在参数空间中直接强制安全约束,确保在训练过程中保持安全性和可行性,同时在约束控制任务中实现闭环稳定性。

Comments Accepted for publication at IFAC World Congress 2026; fixed minor notation inconsistencies

详情
AI中文摘要

安全关键学习需要在不离开安全操作范围的情况下提高性能的策略。我们研究了约束策略学习,其中模型参数必须满足基于滚动的安全部署约束,这些约束可以评估但不能解析地微分。我们提出了SCPO,一种基于采样的权重空间投影方法,该方法在不需梯度访问约束函数的情况下直接在参数空间中强制安全。SCPO通过结合基于滚动的安全评估和参数扰动与安全度量变化之间的平滑性界,构建局部安全区域,并通过凸QCQP将每个梯度更新投影。我们建立了安全-by-induction保证:从任何安全初始化开始,给定可行的投影,所有中间策略保持安全。在具有稳定备份策略的约束控制设置中,SCPO进一步确保闭环稳定性,同时在保守备份之外实现安全适应。在具有有害监督的约束回归和双积分模仿与恶意专家的实验中,SCPO拒绝了不安全的更新,保持了训练过程中的可行性,并实现了有意义的目标改进。

英文摘要

Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy rollout-based safety constraints that can be evaluated but not differentiated analytically. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. SCPO constructs a local safe region by combining rollout-based safety evaluations with smoothness bounds relating parameter perturbations to changes in safety metrics, and projects each gradient update via a convex QCQP. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, SCPO further ensures closed-loop stability while enabling safe adaptation beyond the conservative backup. Experiments on constrained regression with harmful supervision and double-integrator imitation with a malicious expert show that SCPO rejects unsafe updates, maintains feasibility throughout training, and achieves meaningful objective improvement.

2512.09447 2026-05-21 cs.RO cs.CV 版本更新

Query-Calibrated Segmental Admission for Descriptor-Agnostic LiDAR Loop Closure in Repetitive Environments

基于查询校准的分段准入用于无描述符的激光雷达回环闭合在重复环境中

Jaehyun Kim, Seungwon Choi, Wonseok Kang, Tae-Wan Kim

发表机构 * Department of Naval Architecture and Ocean Engineering(naval architecture and ocean engineering department)

AI总结 该研究提出了一种无描述符的稀疏回环准入策略,用于在重复环境中稳定图结构,通过校准查询级的分段假设并验证代表性配对来减少回环因素的误入,从而提高回环闭合的精度和稳定性。

Comments 8 pages, 3 figures

详情
AI中文摘要

结构重复的环境会产生视觉上合理但存在混叠的LiDAR回环候选者,当这些候选者被作为回环因子加入图中时,可能会破坏位姿图优化。我们提出了一种名为查询校准分段准入(QCSA)的策略,这是一种面向图稳定性的稀疏回环准入政策。该策略通过与硬负样本对比对短描述符分段进行评分,校准哪些查询级的分段假设能达到几何关系,并通过广义迭代最近点(G-ICP)验证代表性配对。我们在SNU图书馆数据集(SNULib)和HeLiPR重叠路线上评估了该方法。在SNULib上对七种LiDAR描述符家族进行汇总分析,QCSA将插入的回环因子减少了3.8倍,将因子精度从0.542提高到0.717,并显著降低了每组查询的误入率。在更稀疏的图中,它保持了可比的平均绝对轨迹误差(ATE)并大幅降低了最坏序列ATE与密集Top1+G-ICP相比,从1.064降至0.778米。这些结果支持了所提出的回环准入层在重混叠的同时定位与建图(SLAM)中的应用。我们的实现和数据集将在:https://github.com/wanderingcar/snu_library_dataset上发布。

英文摘要

Structurally repetitive environments produce visually plausible but aliased LiDAR loop candidates that can destabilize pose-graph optimization when admitted as loop factors. We propose Query-Calibrated Segmental Admission (QCSA), a descriptor-agnostic sparse loop-admission policy for graph-stability-oriented insertion. The policy scores short descriptor segments against hard negatives, calibrates which query-level segment hypotheses reach geometry, and inserts representative pairs validated by Generalized Iterative Closest Point (G-ICP). We evaluate it on the SNU Library Dataset (SNULib) and HeLiPR overlap routes. Aggregated over seven LiDAR descriptor families on SNULib, QCSA reduces inserted loop factors by 3.8 times, raises factor precision from 0.542 to 0.717, and sharply lowers false admissions per query group. With this sparser graph, it maintains comparable mean absolute trajectory error (ATE) and substantially reduces worst-sequence ATE versus dense Top1+G-ICP, from 1.064 to 0.778 m. The aggregate mean and worst-sequence ATE remain lower than the odometry-only reference. Under a matched factor budget, QCSA also attains lower trajectory error than SeqSLAM and sparse Top1+G-ICP selections. Fixed-transfer validation on HeLiPR, with no route-specific tuning, likewise suppresses hard-negative admissions. These results support the proposed admission layer for aliasing-heavy simultaneous localization and mapping (SLAM). Our implementation and dataset will be released at: https://github.com/wanderingcar/snu_library_dataset.

2511.01219 2026-05-21 cs.RO 版本更新

Tackling the Kidnapped Robot Problem via Sparse Feasible Hypothesis Sampling and Reliable Batched Multi-Stage Inference

通过稀疏可行假设采样和可靠的分批多阶段推理解决被绑架的机器人问题

Muhua Zhang, Lei Ma, Ying Wu, Kai Shen, Deqing Huang, Henry Leung

发表机构 * School of Electrical Engineering, Southwest Jiaotong University(西南交通大学电子工程学院)

AI总结 本文提出了一种被动的2D全局重定位框架,通过单个LiDAR扫描和占用网格地图在机器人静止时高效可靠地估计全局姿态,从而提高移动机器人的长期自主性。该框架将全局重定位问题转化为非凸问题,并通过多假设方案与分批多阶段推理和早期终止平衡完整性和效率。

Comments 14 pages, 8 figures. Accepted for publication in IEEE Transactions on Instrumentation and Measurement. DOI: 10.1109/TIM.2026.3694741

详情
AI中文摘要

本文针对被绑架的机器人问题(KRP),即在已知地图中重新定位机器人时,没有先验姿态估计或在SLAM初始化时的定位丢失问题。为此,提出了一种被动的2D全局重定位框架。该框架在机器人静止时,通过单个LiDAR扫描和占用网格地图高效可靠地估计全局姿态,从而提高移动机器人的长期自主性。所提出的框架将全局重定位问题转化为非凸问题,并通过多假设方案与分批多阶段推理和早期终止来解决,平衡完整性和效率。快速探索随机树(RRT)在可通行性约束下,渐近覆盖可达空间以生成稀疏、均匀分布的可行位置假设,从根本上减少采样空间。假设首先通过所提出的扫描均方差(SMAD)进行排序,这是一种粗略的光束误差水平度量,通过优先处理高可能性的候选者来实现早期终止。SMAD计算优化以适应有限的扫描测量。提出的翻译亲和度扫描到地图对齐度量(TAM)用于在假设位置可靠地选择方向,并准确评估最终的全局姿态,以减轻由于稀疏假设引起的翻译不确定性以及非全景LiDAR扫描和环境变化导致的传统似然场度量的退化。在资源受限的移动机器人上的真实世界实验表明,所提出的框架在成功率、在测量不确定性下的鲁棒性和计算效率方面均表现优异。

英文摘要

This paper addresses the Kidnapped Robot Problem (KRP), a core localization challenge of relocalizing a robot in a known map without prior pose estimate upon localization loss or at SLAM initialization. For this purpose, a passive 2-D global relocalization framework is proposed. It estimates the global pose efficiently and reliably from a single LiDAR scan and an occupancy grid map while the robot remains stationary, thereby enhancing the long-term autonomy of mobile robots. The proposed framework casts global relocalization as a non-convex problem and solves it via the multi-hypothesis scheme with batched multi-stage inference and early termination, balancing completeness and efficiency. The Rapidly-exploring Random Tree (RRT), under traversability constraints, asymptotically covers the reachable space to generate sparse, uniformly distributed feasible positional hypotheses, fundamentally reducing the sampling space. The hypotheses are preliminarily ordered by the proposed Scan Mean Absolute Difference (SMAD), a coarse beam-error level metric that facilitates the early termination by prioritizing high-likelihood candidates. The SMAD computation is optimized for limited scan measurements. The Translation-Affinity Scan-to-Map Alignment Metric (TAM) is proposed for reliable orientation selection at hypothesized positions and accurate final global pose evaluation to mitigate degradation in conventional likelihood-field metrics under translational uncertainty induced by sparse hypotheses, as well as non-panoramic LiDAR scan and environmental changes. Real-world experiments on a resource-constrained mobile robot with non-panoramic LiDAR scans show that the proposed framework achieves competitive performance in success rate, robustness under measurement uncertainty, and computational efficiency.

2510.18034 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

VLMs能否解锁语义异常检测?一个结构化推理的框架

Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz

发表机构 * Professorship of Autonomous Vehicle Systems TUM School of Engineering Design, Technical University of Munich Munich, Germany

AI总结 本文提出SAVANT框架,通过结构化推理方法提升VLM在语义异常检测中的性能,实现对自动驾驶场景中罕见异常情况的更准确识别。

Comments 8 pages, 5 figures

详情
AI中文摘要

自动驾驶系统仍然对长尾的稀有、分布外语义异常极度脆弱。尽管VLMs已显现为感知的有前途工具,但其在异常检测中的应用仍然主要局限于提示专有模型,限制了可靠性、可重复性和部署可行性。为解决这一差距,我们引入SAVANT(语义异常验证/分析工具包),一种新的模型无关推理框架,将异常检测重新表述为分层语义一致性验证。通过应用SAVANT的两阶段流程——结构化场景描述提取和多模态评估,现有VLMs在输入图像中检测异常驾驶场景的得分得到提升。我们的方法取代了随意提示,通过语义感知推理,将基于VLM的检测转化为四个语义领域之间的原则性分解。我们证明,在平衡的现实驾驶场景集上,应用SAVANT可将VLM的绝对召回率提高约18.5%,相比提示基线。此外,这一增益使大规模注释成为可能:利用我们框架内的最佳专有模型,我们自动标注了约10,000张现实世界图像,具有高置信度。我们使用由此产生的高质量数据集来微调一个7B开源模型(Qwen2.5-VL)以执行单次异常检测,达到90.8%的召回率和93.8%的准确率,超越所有评估模型,同时在接近零成本的情况下实现本地部署。通过将结构化语义推理与可扩展的数据整理相结合,我们为自动驾驶系统中的语义异常检测数据稀缺问题提供了实用的解决方案。补充材料:https://TUM-AVS.github.io/SAVANT/.

英文摘要

Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models - limiting reliability, reproducibility, and deployment feasibility. To address this gap, we introduce SAVANT (Semantic Anomaly Verification/Analysis Toolkit), a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline - structured scene description extraction and multi-modal evaluation - existing VLMs improve their scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains. We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM's absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. We use the resulting high-quality dataset to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By coupling structured semantic reasoning with scalable data curation, we provide a practical solution to data scarcity in semantic anomaly detection for autonomous systems. Supplementary material: https://TUM-AVS.github.io/SAVANT/.

2509.26627 2026-05-21 cs.AI cs.LG cs.RO 版本更新

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

TimeRewarder: 通过帧间时间距离从被动视频中学习密集奖励

Yuyang Liu, Chuan Wen, Yihang Hu, Dinesh Jayaraman, Yang Gao

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China(清华大学交叉信息研究院) Shanghai Qi Zhi Institute(上海启智研究院) Shanghai Jiao Tong University(上海交通大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出TimeRewarder方法,通过帧间时间距离从被动视频中学习密集奖励,以提升强化学习在稀疏奖励任务中的性能,实验表明其在多个任务中显著提高了成功率和样本效率。

Comments ICML 2026 spotlight paper

详情
AI中文摘要

设计密集奖励对于强化学习(RL)至关重要,但在机器人学中往往需要大量的手动工作且缺乏可扩展性。一个有前景的解决方案是将任务进展视为密集奖励信号,因为它量化了动作在时间上推动系统向任务完成迈进的程度。我们提出了TimeRewarder,一种简单而有效的奖励学习方法,通过建模帧对之间的时间距离,从被动视频(包括机器人演示和人类视频)中推导出进展估计信号。然后展示如何通过TimeRewarder提供逐步的代理奖励以指导强化学习。在我们对十个具有挑战性的Meta-World任务的全面实验中,我们表明TimeRewarder显著提高了稀疏奖励任务的强化学习性能,仅在每个任务中进行200,000次环境交互时,就实现了9/10任务的几乎完美成功。该方法在最终成功率和样本效率上均优于先前方法和手动设计的环境密集奖励。此外,我们还展示了TimeRewarder预训练可以利用真实世界的人类视频,突显了其作为从多样化视频源中获取丰富奖励信号的可扩展方法的潜力。

英文摘要

Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply step-wise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 environment interactions per task. This approach outperformed previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder pretraining can exploit real-world human videos, highlighting its potential as a scalable approach to rich reward signals from diverse video sources.

2509.07674 2026-05-21 cs.RO cs.HC 版本更新

Temporal Counterfactual Explanations of Behaviour Tree Decisions

行为树决策的时序反事实解释

Tamlin Love, Antonio Andriella, Guillem Alenyà

发表机构 * Institut de Robòtica i Informàtica Industrial, CSIC-UPC(机器人与计算机工业研究所,CSIC-UPC)

AI总结 本文提出了一种生成反事实解释的方法,通过构建行为树的因果模型来解释机器人决策原因,提高了机器人系统的透明性和安全性。

Comments 33 pages, 7 figures + 4 figures in appendices

详情
AI中文摘要

可解释性,特别是机器人能够解释其决策或行为原因的能力,是帮助用户理解其交互和共存的机器人的重要工具。行为树是控制机器人决策的流行框架,因此一个自然的问题是,由行为树驱动的系统是否能够回答'为什么'的问题。尽管行为树驱动的机器人可解释性已受到一些关注,但现有的方法无法生成详细说明机器人决策原因的因果反事实解释。因此,在本工作中,我们介绍了一种新颖的方法,该方法能够自动根据对比性'为什么'问题生成反事实解释。我们的方法通过首先自动构建从行为树结构以及状态和个体行为树节点的领域知识中的因果模型来实现这一点。然后对所得因果模型进行查询和搜索,以找到一组多样的反事实解释。我们证明我们的方法能够正确解释广泛的行为树结构和状态在实时中的行为,与之前的方法相比,这些方法要么无法用因果解释回答对比性问题,要么无法保证提供一致和准确的解释。通过能够回答广泛的因果查询,我们的方法代表了朝着更透明、更易理解和最终更安全和可信的机器人系统迈进的一步。

英文摘要

Explainability, in particular, the ability for robots to explain why they have made a decision or behaved in a certain way, is a critical tool in helping users understand the robots they interact and coexist with. Behaviour trees are a popular framework for controlling the decision-making of robots, and thus a natural question to ask is whether or not a system driven by a behaviour tree is capable of answering "why" questions. While explainability for behaviour tree-driven robots has seen some prior attention, no existing methods are capable of generating causal, counterfactual explanations which detail the reasons for robot decisions and behaviour. Therefore, in this work, we introduce a novel approach which automatically generates counterfactual explanations in response to contrastive "why" questions. Our method achieves this by first automatically building a causal model from the structure of the behaviour tree as well as domain knowledge about the state and individual behaviour tree nodes. The resultant causal model is then queried and searched to find a set of diverse counterfactual explanations. We demonstrate that our approach is able to correctly explain the behaviour of a wide range of behaviour tree structures and states in real time, unlike previous methods which are either unable to answer contrastive questions with causal explanations, or are not guaranteed to provide consistent and accurate explanations. By being able to answer a wide range of causal queries, our approach represents a step towards more transparent, understandable, and ultimately safe and trustworthy robotic systems.

2508.06206 2026-05-21 cs.RO cs.CV 版本更新

Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model

Affordance-R1: 为多模态大语言模型中的通用化 affordance 推理设计的强化学习

Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, Yuexin Ma

发表机构 * The Hong Kong University of Science and Technology (GZ)(香港科技大学(广州)) National University of Singapore(新加坡国立大学) ShanghaiTech University(上海科技大学) East China Normal University(华东师范大学) Nanjing University of Information Science & Technology(南京信息工程大学) Zhejiang University(浙江大学) Institute of Automation, Chinese Academy of Science(中国科学院自动化研究所) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出 Affordance-R1,一种结合认知 CoT 引导的 Group Relative Policy Optimization (GRPO) 的统一 affordance 地标框架,通过强化学习实现零样本泛化和测试时推理能力。

详情
AI中文摘要

Affordance grounding 旨在预测与机器人执行动作相关的物体特定区域。它在人机交互、人-物交互、具身操作和具身感知领域中起着至关重要的作用。现有模型由于缺乏链式思维(CoT)推理能力,往往忽视不同物体间的 affordance 共享,限制了其域外(OOD)泛化和显式推理能力。为了解决这些挑战,我们提出了 Affordance-R1,这是首个集成认知 CoT 引导的 Group Relative Policy Optimization(GRPO)的统一 affordance 地标框架。具体而言,我们设计了一个复杂的 affordance 函数,包含格式、感知和认知奖励,以有效引导优化方向。此外,我们构建了一个高质量的 affordance 中心推理数据集 ReasonAff,以支持训练。通过仅使用强化学习与 GRPO 进行训练,而不使用显式推理数据,Affordance-R1 实现了稳健的零样本泛化,并表现出涌现的测试时推理能力。全面的实验表明,我们的模型优于已建立的方法,并展示了开放世界泛化能力。据我们所知,Affordance-R1 是首个将基于 GRPO 的 RL 与推理结合到 affordance 推理中的方法。我们的方法和数据集已发布在 https://github.com/hq-King/Affordance-R1。

英文摘要

Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.

2507.09180 2026-05-21 cs.CV cs.RO 版本更新

Multimodal Fusion for Sim2real Transfer in Visual Reinforcement Learning

多模态融合用于视觉强化学习中的仿真到现实迁移

Zichun Xu, Jingdong Zhao, Chenyu Guo, Qianxue Zhang, Liao Zhang, Xiao Zhang, Yiming Ren, Lian Zhang, Zengren Zhao

发表机构 * Medical Artificial Intelligence Lab, The First Hospital of Hebei Medical University, Hebei Medical University(医学人工智能实验室,河北医科大学第一医院,河北医科大学) State Key Laboratory of Robotics and Systems, Harbin Institute of Technology(机器人系统国家重点实验室,哈尔滨工业大学)

AI总结 本文提出基于视觉变换器的多模态融合框架,通过融合RGB和深度信息提升泛化能力,并设计对比学习方案和课程式域随机化方案以提高样本效率和迁移性能,实验结果表明该方法在现实任务中表现优异。

详情
AI中文摘要

深度信息对场景外观变化具有鲁棒性,并固有地包含3D空间细节。因此,本文提出基于视觉变换器的视觉主干,用于融合RGB和深度模态以增强泛化能力。不同模态首先通过单独的CNN茎部进行处理,结合的卷积特征被送入可扩展的视觉变换器以获得视觉表示。此外,设计了一种对比学习方案,通过掩码和未掩码的token来提高样本效率和泛化性能。采用基于课程的域随机化方案以灵活稳定训练过程。最后,仿真结果表明,我们的融合方案优于其他基线。通过零样本迁移验证了模型的可行性,能够执行现实世界操作任务。

英文摘要

Depth information is robust to scene appearance variations and inherently carries 3D spatial details. Thus, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization in this paper. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive learning scheme is designed with masked and unmasked tokens to enhance the sample efficiency and generalization performance. A curriculum-based domain randomization scheme is used to flexibly stabilize the training process. Finally, simulation results demonstrate that our fusion scheme outperforms the other baselines. The feasibility of our model is validated to perform real-world manipulation tasks via zero-shot transfer.

2605.20484 2026-05-21 cs.RO 版本更新

Enhancing Graph-Based SLAM in GNSS-Denied environments by leveraging leg odometry

通过利用腿部里程计增强基于图的SLAM在GNSS受限环境中的性能

Léon Perruchot-Triboulet, Luc Jaulin, Kai Xiao

发表机构 * LinxAI Tech(LinxAI科技)

AI总结 本文提出了一种基于因子图的架构,通过结合本体感觉腿部里程计和激光雷达-惯性里程计,有效减少了GNSS受限环境中视觉漂移,提高了SLAM的鲁棒性。

Comments 4 pages, 3 figures, 2 tables, for ICRA workshop on Robot Meets GNSS and Ranging for Seamless Autonomy

详情
AI中文摘要

在GNSS受限环境中,自主导航仍然是四足机器人面临的核心挑战,其中如激光雷达等外周传感器在几何稀疏或重复场景中容易产生高度漂移。我们提出了一种因子图架构,该架构通过并行运动学车道驱动由本体感觉腿部里程计提供的数据,并通过身份相对姿态约束与主要激光雷达-惯性车道连接,该约束采用选择性噪声模型。在Linxai D50四足平台上,该方法在两个总计超过一公里的户外环路中应用,将高度漂移从超过30米减少到不足30厘米,并在基线流程完全失败的场景中实现了收敛。这些结果表明,已经在机载系统中计算的本体感觉数据构成了轻量且有效的垂直锚点,用于GNSS受限环境下的SLAM。

英文摘要

Autonomous navigation in GNSS-denied environments remains a core challenge for legged robots, where exteroceptive sensors such as LiDAR are prone to elevation drift in geometrically sparse or repetitive scenes. We present a factor graph architecture that augments the LIO-SAM framework with a parallel kinematic lane driven by proprioceptive leg odometry, coupled to the main LiDAR-inertial lane via an identity relative pose constraint with a selective noise model. Applied to a Linxai D50 quadruped platform across two outdoor loops totaling over one kilometer, our approach reduces elevation drift from over 30m to under 30cm and enables convergence in a scene where the baseline pipeline fails entirely. These results suggest that proprioceptive data, already computed onboard for gait control, constitutes a lightweight and effective vertical anchor for SLAM in GNSS-denied settings.

2605.20433 2026-05-21 cs.RO 版本更新

Spacetime Optimal-Transport Attention for Visuo-Haptic Imitation Learning of Contact-Rich Manipulation

时空最优传输注意力用于视觉-触觉模仿学习中的富接触操作

Yue Feng, Weicheng Huang, I-Ming Chen

发表机构 * Robotics Research Centre, School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore(机器人研究中心,机械与航空航天工程学院,南洋理工大学,新加坡) Wings Robotics(Wings机器人)

AI总结 本文提出了一种三模态融合框架Spacetime Optimal-Transport Attention (SO-TA),通过熵正则化的最优传输对力-姿态衍生的子查询和视觉块进行对齐,以解决富接触操作中多模态信息融合的问题,并在真实机器人任务中实现了高成功率。

Comments 8 pages, 16 figures, 3 tables. Preprint

详情
AI中文摘要

接触密集的操作任务,如紧密间隙插入、连接器配合、抛光和表面适应擦拭,仍然难以由数据驱动控制器处理,因为它们耦合了不连续接触动力学、部分可观测性和严格的安全约束。单一传感模态不足以满足需求:视觉在接触前提供全局上下文,力/扭矩(F/T)反馈在接触后控制交互,而本体感觉姿态提供一致的运动学骨架。大多数先前的接触密集任务模仿学习策略仅在单模态或双模态信号上操作,而少数融合三种模态的策略通常采用现成的注意力模块,没有明确的先验知识指导注意力质量如何分布在任务相关的区域。我们提出了Spacetime Optimal-Transport Attention (SO-TA),一种三模态融合骨干,用熵正则化的最优传输(OT)对齐代替softmax归一化的块注意力。显式的边缘约束作为结构化的归纳偏置,鼓励在接触密集任务中条件感知的空间选择,这种选择在光照、干扰和部分遮挡下保持稳定。SO-TA与基于扩散的序列策略相结合,将观察窗口映射到姿态-动作块。我们在三个真实机器人任务上评估了SO-TA:紧密圆柱体装配、BCM布线连接器插入和曲面标记擦除。在每个条件约200次滚出下,SO-TA在紧密圆柱体装配任务中达到100%的成功率,而在匹配容量下的交叉注意力为93%,在光照、干扰和部分遮挡扰动下保持82.5%的成功率,而连接基线降至43.5%。OT衍生的块热图和留一法模态影响比提供可解释的、相位依赖的诊断。

英文摘要

Contact-rich manipulation tasks such as tight-clearance insertion, connector mating, polishing, and surface-conforming wiping remain difficult for data-driven controllers because they couple discontinuous contact dynamics, partial observability, and strict safety constraints. No single sensing modality suffices: vision supplies global context before contact, force/torque (F/T) feedback governs interaction after contact, and proprioceptive pose provides a consistent kinematic backbone. Most prior imitation-learning policies for contact-rich tasks operate on uni- or bi-modal signals, and the few that fuse three modalities typically adopt off-the-shelf attention modules with no explicit prior on how attention mass should be distributed across task-relevant regions. We present Spacetime Optimal-Transport Attention (SO-TA), a tri-modal fusion backbone that replaces softmax-normalized patch attention by an entropy-regularized Optimal Transport (OT) alignment between force-pose-derived sub-queries and visual patches. Explicit marginal constraints act as a structured inductive bias for contact-rich tasks, encouraging conditioning-aware spatial selection that is stable across illumination, distractors, and partial occlusion. SO-TA is paired with a diffusion-based sequence policy mapping observation windows to pose-action chunks. We evaluate SO-TA on three real-robot tasks: tight peg-in-hole assembly, BCM wiring-connector insertion, and curved-surface mark erasing. With ~200 rollouts per condition, SO-TA reaches 100% success on tight peg-in-hole versus 93% for cross-attention at matched capacity, and retains 82.5% success under illumination, distractor, and partial-occlusion perturbations where a concatenation baseline drops to 43.5%. OT-derived patch heatmaps and leave-one-out modality-influence ratios provide interpretable, phase-dependent diagnostics.

2605.20431 2026-05-21 cs.HC cs.RO 版本更新

Multi-Week, In-Class Deployments of Telepresence Robots With Four Homebound K-12 Students: Benefits, Challenges, and Recommendations

多周课堂部署的四名居家K-12学生使用的远程存在机器人:益处、挑战与建议

Matthew Rueben, Rhianna Lee, Thomas R. Groechel, Hengzhi Chen, Haemi Lee, Gisele Ragusa, Maja J. Matarić

发表机构 * Colby College(科布利学院)

AI总结 本研究探讨了远程存在机器人在K-12教育中帮助居家学生参与课堂的益处、挑战及改进建议,通过四次多周部署和15次访谈分析了学生体验和课堂管理需求。

详情
Journal ref
Rueben, M., Lee, R., Groechel, T.R. et al. Multi-week, in-class deployments of telepresence robots with four homebound K-12 students: Benefits, challenges, and recommendations. Educ Inf Technol 31, 2145-2175 (2026)
AI中文摘要

在K-12教育中,缺席大量学校时间已被证明会增加学生认知和社会发展风险。替代方案如家庭教学和在线学习虽然常见,但缺乏与同龄人和教师的课堂互动。移动远程存在系统,或称为远程存在机器人,对居家学生有吸引力,因为它们提供了实时参与视频会议技术之外的具身性和移动性。然而,仍需研究以使远程存在机器人能够满足居家学生在K-12课堂环境中的复杂需求。我们通过四次多周部署,记录了四名居家K-12学生通过远程存在机器人参加课堂的体验,共进行了15次访谈并进行了定性案例研究分析。这些居家学生及其部署情境在多个维度上各不相同,尽管所有参与者都享受了移动远程参与的一些益处,但每个参与者也经历了独特的益处。一些关于听觉、视觉和移动机器人的挑战需要改进远程存在系统的设计。其他挑战则提出了课堂部署管理的优先事项,例如确保远程学生参与课堂活动、对教师负责,并受到同学的尊重。基于研究的见解,我们提出了类似情境中的现实部署程序的建议。

英文摘要

Missing significant amounts of school during K-12 education is known to put students' cognitive and social development at risk. Alternatives such as home instruction and online learning are common, but lack sufficient interaction with peers and teachers in the classroom. Mobile remote presence systems, or telepresence robots, are promising for homebound students because they provide embodiment and mobility in addition to the real-time participation offered by video conferencing technologies. Research is needed, however, for telepresence robots to meet the complex needs of homebound students participating remotely in the K-12 classroom context. We present findings from four multi-week deployments with homebound K-12 students attending classes via telepresence robots. The homebound students' experiences were documented in a total of 15 interviews and analyzed qualitatively as case studies. The homebound student participants and their deployment contexts differed from one another along multiple dimensions, and while some benefits of mobile remote attendance were enjoyed by all participants, each participant also experienced unique benefits. Some challenges with hearing, seeing, and moving the robot around the classroom warranted improvements to the design of the telepresence system. Other challenges suggested priorities for managing a classroom deployment, such as ensuring that the remote student is included in classroom activities, accountable to the teacher, and treated with respect by classmates. Based on insights from the study, we make recommendations for real-world deployment procedures in similar contexts.

2605.20395 2026-05-21 cs.RO 版本更新

Scalable Multi-robot Motion Planning via Hierarchical Subproblem Expansion and Workspace Decomposition Refinement

通过分层子问题扩展和工作空间分解细化实现可扩展的多机器人运动规划

Isaac Ngui, Courtney McBeth, James D. Motes, Marco Morales, Nancy M. Amato

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Instituto Tecnológico Autónomo de México(墨西哥自治理工学院)

AI总结 本文提出了一种多机器人运动规划方法,通过工作空间分解的离散搜索提高规划效率,核心方法是分层子问题扩展和工作空间分解细化,主要贡献是通过迭代优化工作空间表示来搜索更小的解耦配置空间。

Comments Accepted to WAFR 2026

详情
AI中文摘要

多机器人运动规划中的基本挑战是在不引起大规模计算开销的情况下实现足够的协调以避免机器人间的冲突。在本文中,我们提出了一种多移动机器人运动规划方法,通过利用工作空间分解的离散搜索来提供规划过程中的协调。虽然先前的工作利用工作空间拓扑来指导何时需要机器人之间的协调,然后将机器人组合到它们的联合配置空间中,我们进一步通过迭代优化工作空间表示,使规划器能够搜索更小、解耦的配置空间,从而将规划时间提高了数量级。

英文摘要

A fundamental challenge in multi-robot motion planning is achieving sufficient coordination to avoid inter-robot conflicts without incurring the large computational expense of searching the joint configuration space of the robot group. In this work, we present a method for multiple mobile robot motion planning that achieves an improvement in planning time up to an order of magnitude by leveraging the insight that we can use discrete search over a workspace decomposition to provide coordination between robots during planning. While prior work uses workspace topology to inform when coordination between robots is needed and then composes robots into their joint configuration space, we take a step further by iteratively refining our workspace representation to allow our planner to search smaller, decoupled configuration spaces.

2605.20392 2026-05-21 cs.RO 版本更新

VBT-MPC: Vision-Based Tactile MPC for Contour Following

VBT-MPC:基于视觉的触觉MPC用于轮廓跟踪

Edison Velasco-Sanchez, Luis F. Recalde, Guanrui Li, Pablo Gil

发表机构 * AUROVA Lab, Computer Science Research Institute, University of Alicante(AUROVA实验室,计算机科学研究院,阿利坎特大学) Worcester Polytechnic Institute(沃思堡理工大学)

AI总结 本文提出了一种基于视觉的触觉模型预测控制(VBT-MPC)框架,用于机器人轮廓跟踪,通过眼在手配置安装的基于视觉的触觉传感器(VBTS)直接在轮廓特征空间中操作,避免了单独的姿态估计模块和复杂的力控制架构,并在仿真和实际实验中评估了在不同几何形状和材料物体上的轮廓跟踪性能。

Comments This article has been accepted for publication in IEEE Robotics and Automation Letters. This is a preprint version. This work was supported by the Interreg-VI Sudoe and European Regional Development Funds through the REMAIN Project under Grant S1/1.1/E0111

详情
AI中文摘要

触觉感知在机器人操作中起着关键作用,特别是在表面检查等任务中。成功的执行需要在准确跟踪物体轮廓的同时保持接触。在本工作中,我们提出了一种基于视觉的触觉模型预测控制(VBT-MPC)框架,用于使用安装在眼在手配置中的基于视觉的触觉传感器(VBTS)的机器人轮廓跟踪。所提出的控制器直接在轮廓特征空间中操作,从而避免了单独的姿态估计模块或复杂的力控制架构。我们进一步将我们的VBT-MPC与适应于触觉特征的视觉伺服策略进行比较,并在仿真和实际实验中评估了在具有不同几何形状和材料的物体上的轮廓跟踪性能。

英文摘要

Tactile sensing plays a key role in robotic manipulation, particularly in tasks like surface inspection. Successful execution requires maintaining contact while accurately tracking object contours. In this work, we propose a Vision-Based Tactile Model Predictive Control (VBT-MPC) framework for robotic contour following using a Vision-Based Tactile Sensor (VBTS) mounted in an eye-in-hand configuration. The proposed controller operates directly in contour features space, thereby avoiding the need for separate pose-estimation modules or complex force-control architectures. We further compare our VBT-MPC with visual-servoing strategies adapted to tactile features, and evaluate contour tracking on objects with diverse geometries and materials in both simulation and real-world experiments.

2605.20390 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

STELLAR: 为自动驾驶扩展3D感知大模型

Yingwei Li, Xin Huang, Yang Liu, Yang Fu, Alex Zihao Zhu, Chen Song, Junwen Yao, Anant Subramanian, Hao Xiang, Weijing Shi, Yuliang Zou, Tom Hoddes, Zhaoqi Leng, Govind Thattai, Dragomir Anguelov, Mingxing Tan

发表机构 * Waymo UCSD(加州大学圣地亚哥分校)

AI总结 本文研究了大规模训练在自动驾驶感知系统中的应用,通过扩展输入模态并训练大规模模型,实现了在Waymo数据集上的新状态-of-the-art性能。

详情
AI中文摘要

模型扩展通过在多样化数据集上进行大规模训练已显示出显著的成功。然而,尚不清楚相同的范式是否适用于自动驾驶感知系统,因为存在独特的挑战,如融合异构传感器数据和需要复杂的3D空间理解。为弥合这一差距,我们进行了系统分析,研究了规模对这些系统的影响。我们基于稀疏窗口变换器开发了STELLAR模型,扩展了输入模态,包括LiDAR、雷达、相机和地图先验。我们在一个包含5000万驾驶示例的大规模数据集上训练该模型,参数数量高达5亿。我们的大规模实验揭示了模型性能与模型大小、数据和计算之间的经验扩展趋势。所得到的模型在Waymo Open Dataset挑战中建立了新的状态-of-the-art,大幅超越了先前的成果。我们的工作表明,大规模训练是提升自动驾驶感知模型能力极具前景的路径。

英文摘要

Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.

2605.20373 2026-05-21 cs.RO cs.AI cs.CV 版本更新

SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework

SUGAR: 一种可扩展的人类-视频驱动的通用人形机器人运动-操作学习框架

Tianshu Wu, Xiangqi Kong, Yue Chen, Qize Yu, Hang Ye, Jia Li, Yizhou Wang, Hao Dong

发表机构 * CFCS, School of Computer Science, Peking University(计算机学院,北京大学计算机科学系) School of Computer Science and Engineering, Beihang University(计算机科学与工程学院,北航)

AI总结 该研究提出SUGAR框架,通过将多样化的视频转化为可部署的人形机器人运动-操作技能,无需特定任务的奖励工程或参考动作条件,在仿真和现实硬件中实现了六种代表性任务的高性能表现,展示了可扩展性和零样本现实迁移能力。

Comments Project Page: https://tianshuwu.github.io/sugar-humanoid/

详情
AI中文摘要

构建能够实现在现实世界中通用的全身体运动-操作能力的人形机器人仍是一个根本性挑战。现有方法要么依赖于繁琐的特定任务奖励工程,要么依赖于僵化的参考动作回放,无法泛化,或者依赖于昂贵的远程操作,限制了可扩展性。尽管人类视频捕捉了多样化的动作行为,但从中推断出的运动先验固有地不完美,受到遮挡、接触伪影和重定向误差的影响,使其不适合直接的策略学习。为此,我们提出了SUGAR,一种可扩展的数据驱动框架,能够将多样化的视频转化为可部署的人形机器人运动-操作技能,无需任何特定任务的奖励工程或参考动作条件。SUGAR分为三个阶段。首先,一个完全自动化的流程从无结构的人类视频中提取运动交互先验,包括人类-物体运动轨迹和接触标签。第二,一个特权物理基础的细化器利用统一的模仿奖励和渐进状态池,将不完美的先验转化为物理上可行的、高保真的技能。第三,经过细化的技能被转化为一个分层的自主策略,包括一个命令生成器和一个命令跟踪器。我们在仿真和现实世界的人形硬件中评估了SUGAR,我们的方法在六种代表性运动-操作任务上显著优于参考跟踪基线,性能随着人类视频数据量的增加而明显提升。它还实现了零样本现实迁移,具有可靠的闭环执行、自主故障恢复和在外部扰动下的稳定长时程性能。项目页面:https://tianshuwu.github.io/sugar-humanoid/

英文摘要

Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human-object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics-based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high-fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco-manipulation tasks in simulation and real-world humanoid hardware. Our method substantially outperforms reference-tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero-shot real-world transfer with reliable closed-loop execution, autonomous failure recovery, and stable long-horizon performance under external perturbations. Project Page: https://tianshuwu.github.io/sugar-humanoid/

2605.20355 2026-05-21 cs.RO cs.HC cs.LG 版本更新

Proximal State Nudging: Reducing Skill Atrophy from AI Assistance

近端状态引导:减少人工智能辅助下的技能退化

Megha Srivastava, Jonathan Ouyang, Eric Zhou, Andrew Silva, Emily Sumner, Dorsa Sadigh, Yuchen Cui, Deepak Gopinath, Guy Rosman

发表机构 * Stanford University(斯坦福大学) University of California Los Angeles(加州大学洛杉矶分校) Toyota Research Institute(丰田研究院)

AI总结 本文提出了一种名为近端状态引导(PSN)的共享自主算法,通过引导用户向最易学习的状态发展,同时优化技能发展和任务表现,以减少人工智能辅助下的技能退化问题。

Comments 9 pages

详情
AI中文摘要

技能退化,即在人工智能辅助下人类能力的逐渐下降,对半自主系统的共享控制构成了安全风险,因为在这种情况下,操作员可能无法区分自己的输入与自主修正。我们提出了近端状态引导(PSN),一种共享自主算法,通过引导用户向估计最易学习的状态发展,共同优化技能发展和任务表现。我们首先展示了PSN在平衡无辅助奖励下的学生进步与总体共享表现方面优于现有共享自主基线,使用经典LunarLander环境中的模拟学生。然后,我们呈现了迄今为止关于整合学习兼容共享自主的规划器的人类受试者研究:在CARLA模拟器中的两个驾驶任务(高性能赛车和并线,n=60)中,PSN在无辅助技能方面产生的收益比标准混合共享自主大7倍,同时碰撞次数比无辅助自我练习少50%。

英文摘要

Skill atrophy, the gradual decline of human capability under AI assistance, poses a safety risk in shared-control of semi-autonomous systems, where operators may be unable to distinguish their own inputs from autonomous corrections. We propose Proximal State Nudging (PSN), a shared autonomy algorithm that jointly optimizes for skill development and task performance by nudging users toward states estimated to be most learnable. We first show that PSN outperforms existing shared autonomy baselines in balancing student improvement in unassisted reward with overall shared performance, using simulated students in the classic LunarLander environment. We then present, to the best of our knowledge, the first human subject studies of a planner incorporating learning-compatible shared autonomy: across two driving tasks in the CARLA simulator (High Performance Racing and Parallel Parking, n = 60), PSN produces up to 7x larger gains in unassisted skill than standard blended shared autonomy, while incurring 50% fewer collisions than unassisted self-practice.

2605.20304 2026-05-21 cs.RO 版本更新

Terrestrial Soft Mobile Robots: A Review

陆地软体移动机器人:综述

Dimuthu D. K. Arachchige

发表机构 * School of Computing, DePaul University(德保罗大学计算机学院) Department of Computer Science, Hampton University(哈珀学院计算机科学系)

AI总结 本文综述了软体移动机器人的当前研究状态,重点探讨了无轮陆地移动系统中的运动策略、驱动方法、建模方法和控制系统,同时指出了实现软体移动机器人在各领域广泛应用的关键挑战。

详情
AI中文摘要

软体移动机器人已经 emerged as 一个有前景的研究领域,具有在多个学科中应用的潜力,包括但不限于搜索与救援、服务、监控、探索和制造。在本文中,我们提供了一篇关于当前软体移动机器人研究现状的全面综述,重点是无轮陆地移动系统。我们包括了过去和现在在运动策略、驱动方法、建模方法和控制系统方面的进展。进一步,我们确定了必须克服的关键研究挑战,以实现软体移动机器人在各种应用中的广泛应用。总体而言,本文为对软体移动机器人和软机器人领域感兴趣的研究人员和实践者提供了有价值的资源。

英文摘要

Soft mobile robots have emerged as a promising area of research with potential applications in various disciplines including but not limited to search-and-rescue, service, surveillance, explorations, and manufacturing. In this article, we provide a comprehensive review of the current state of soft mobile robot research, focusing on wheelless terrestrial locomotive systems. We include past and present developments in locomotion strategies, actuation methods, modeling approaches, and control systems. Further, we identify key research challenges that must be overcome to enable the widespread adoption of soft mobile robots in various applications. Overall, this article provides a valuable resource for researchers and practitioners interested in the field of soft mobile robots and soft robotics.

2605.20299 2026-05-21 cs.LG cs.AI cs.RO 版本更新

Mechanisms of Misgeneralization in Physical Sequence Modeling

物理序列建模中泛化错误的机制

Kento Nishi, Raphael Tang, Karun Kumar, Core Francisco Park, Hidenori Tanaka

发表机构 * Harvard College(哈佛大学) Harvard John A. Paulson School of Engineering and Applied Sciences(哈佛大学约翰·A·保罗森工程与应用科学学院) Comcast AI CBS-NTT Program in Physics of Intelligence, Harvard University(哈佛大学物理智能计划) Physics of Artificial Intelligence Group, NTT Research, Inc., Sunnyvale, CA, USA(人工智能物理研究组,NTT研究公司,美国加利福尼亚州山景城) Microsoft(微软)

AI总结 本文研究了物理序列建模中由于局部误差传播导致的物理泛化错误,提出了一种数据偏差核来预测物理量的质量变化,并提出了基于核的干预策略。

Comments Preprint. kentonishi.com/physical-misgeneralization

详情
AI中文摘要

生成序列模型通常用于在物理领域规划运动,从机器人到机械模拟。在构建训练此类模型的数据集时,工程师可能会选择演示来指定轨迹在物理量如旅行距离或机械能上的分布。例如,一个构建迷宫导航代理的机器人工程师可能会选择旅行距离覆盖固定范围的演示,希望限制代理的预期功率使用。我们发现标准深度学习可以违反这一意图:每个生成的轨迹在单独看来都合理,但物理量的总体分布是错误的。我们将这种失败称为物理泛化错误,并发展了其机制。通过受控的合成任务,我们发现物理泛化错误出现在局部误差典型于模型类通过物理测量传播到恢复分布时。我们用数据偏差核估计这些误差,并利用它来预测在我们的合成任务和更应用的迷宫导航和双摆运动任务中哪些物理量获得或失去质量。最后,我们的机制性解释有助于识别哪些缓解策略在结构上具有前景,并利用它提出了一种基于核的干预。

英文摘要

Generative sequence models are often trained to plan motion in physical domains, from robotics to mechanical simulations. When constructing a dataset to train such a model, engineers may curate demonstrations to specify how trajectories should be distributed over a physical quantity like travel distance or mechanical energy. For example, a roboticist building a maze navigation agent might choose demonstrations whose travel distances cover a fixed range uniformly, hoping to constrain the agent's expected power usage. We find that standard deep learning can violate this intent: each generated trajectory can seem plausible on its own, but the aggregate distribution over the physical quantity is wrong. We call this failure physical misgeneralization, and develop an account of its mechanism. Using controlled synthetic tasks, we show that physical misgeneralization arises when local errors typical of the model class propagate through the physical measurement to shift the recovered distribution. We estimate these errors with a data deviation kernel, and we use it to predict which physical quantities gain or lose mass in both our synthetic and more applied maze navigation and double-pendulum motion tasks. Finally, our mechanistic interpretation helps identify which mitigation strategies are structurally promising, and we use it to propose a kernel-informed intervention.

2605.20264 2026-05-21 cs.RO cs.HC 版本更新

Adaptive Human-Robot Collaboration for Masonry Construction Under Material and Assembly Uncertainty

面向材料和装配不确定性的自适应人机协作砌筑

Jutang Gao, Arash Adel

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文提出了一种自适应的人机协作流程,用于应对砌筑施工中材料和装配不确定性带来的容忍度累积问题,通过投影指导和激光扫描反馈实现精准协作。

Comments Accepted for publication in Proceedings of the 43rd International Symposium on Automation and Robotics in Construction (ISARC 2026)

详情
AI中文摘要

建筑领域的人机协作常常受到机器人与人类之间通信有限以及材料和装配不确定性导致的容忍度累积的挑战。本文提出了一种针对砌筑施工的自适应人机协作流程,通过一个安装在末端执行器上的投影仪提供空间注册的实时投影指导,用于手动粘合剂的施加,以及激光扫描用于反馈驱动的抓取和放置姿态校正。这些机制共同作用,使人类和机器人的动作能够根据材料变化和累积的装配容忍度进行调整。在传统交错排列和非标准配置的全尺寸实验中,投影指导提高了粘合剂施加的一致性并减少了施加时间,而基于激光的校正保持了水平层并避免了开放式执行中易导致碰撞失败的问题。这些结果表明,通过材料和实际建造传感实现的空间投影与反馈驱动的适应相结合,可以缓解容忍度累积,提高人机协作施工的精度和鲁棒性。

英文摘要

Human-robot collaboration in construction is often challenged by limited robot-to-human communication and the need to adapt to tolerance accumulation arising from material and assembly uncertainties. We present an adaptive human-robot collaborative workflow for masonry construction that addresses communication limitations and tolerance accumulation, demonstrated through a brickwork case study in which a robot places bricks while a human applies adhesive. This workflow is enabled by two complementary mechanisms: 1) an end-effector-mounted projector that provides spatially registered, just-in-time projection guidance for manual adhesive application, and 2) laser scanning for feedback-driven grasping and placement pose correction. Together, these mechanisms enable adjustment of human and robotic actions in response to material variability and accumulated assembly tolerances. Full-scale experiments across conventional running-bond and nonstandard configurations demonstrate that projection guidance improves adhesive application consistency and reduces application time, while laser-based correction maintains level courses and avoids collision-prone failures associated with open-loop execution. These results indicate that integrating spatial projection with feedback-driven adaptation, enabled by material and as-built sensing, can mitigate tolerance accumulation and improve precision and robustness in human-robot collaborative construction.

2605.20209 2026-05-21 cs.GR cs.LG cs.RO 版本更新

NaP-Control: Navigating Diffusion Prior for Versatile and Fast Character Control

NaP-Control: 为多功能和快速字符控制导航扩散先验

Chia-Wen Chen, Yan Wu, Korrawe Karunratanakul, Siyu Tang

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出NaP-Control方法,通过强化学习操控任务无关的扩散策略先验的潜在噪声,实现快速、鲁棒且高保真的字符控制,同时通过环境交互优化任务奖励,提升成功率并适应挑战性场景。

详情
AI中文摘要

在基于物理的动画中实现精确、多功能的全身字符控制仍然具有挑战性。最近的基于扩散的策略生成丰富且表达性强的动作,但通常依赖于基于梯度的测试时间引导以满足任务目标,这会减慢速度并降低鲁棒性。我们引入NaP-Control(Navigating Diffusion Prior for Versatile and Fast Character Control),简称NaP。我们的方法使用强化学习操控任务无关的扩散策略先验的潜在噪声,将其引导至任务特定的行为,以实现快速、鲁棒且高保真的控制。与仅依赖离线训练的方法不同,NaP在训练期间与环境交互以校正动作并优化任务奖励,提高成功率并使系统能够适应具有挑战性的场景。通过直接预测任务优化的扩散噪声,NaP消除了去噪过程中的迭代引导,实现了高效的推理。实验表明,NaP在多样化的任务中实现了更高的成功率和更快的推理速度,同时保持自然的动作。

英文摘要

Achieving precise, versatile whole-body character control in physics-based animation remains challenging. Recent diffusion-based policies generate rich and expressive motions but typically rely on gradient-based test-time guidance to satisfy task objectives, which is slow and can reduce robustness. We introduce NaP-Control (Navigating Diffusion Prior for Versatile and Fast Character Control), abbreviated as NaP. Our method uses reinforcement learning to manipulate the latent noise of a task-agnostic diffusion policy prior, steering it toward task-specific behaviors for fast, robust control with high motion fidelity. In contrast to methods that rely solely on offline training, NaP interacts with the environment during training to correct motions and optimize task rewards, improving success rates and enabling adaptation to challenging scenarios. By directly predicting task-optimized diffusion noise, NaP eliminates iterative guidance during denoising and enables efficient inference. Experiments show that NaP attains higher success rates and faster inference while preserving natural motion across diverse tasks.

2603.14698 2026-05-21 cs.RO 版本更新

Dual Quaternion Based Contact Modeling for Fast and Smooth Collision Recovery of Quadrotors

基于双四元数的接触建模用于四旋翼快速平滑碰撞恢复

Valentin Gaucher, Wenlong Zhang

发表机构 * School of Manufacturing Systems and Networks, Ira A. Fulton Schools of Engineering, Arizona State University(制造系统与网络学院,伊拉·A·福尔顿工程学院,亚利桑那州立大学)

AI总结 本文提出了一种基于双四元数的接触模型,用于四旋翼在复杂环境中实现快速且平滑的碰撞恢复,通过统一空间扭曲实现法向和切向冲击分量的耦合,减少执行延迟并降低动能峰值。

Comments 8 pages, 5 figures, 2 tables

详情
AI中文摘要

无人飞行器(UAVs)在复杂环境中运行时,需要高效的冲击建模以保持碰撞后稳定性,然而经典冲击接触模型将法向和切向分量解耦。本文提出了一种直接在SE(3)流形上的双四元数冲击重置映射。通过操作统一空间扭曲(统一线性和角速度),所提出的公式在单个闭合表达式中保留法向和切向冲击分量之间的交叉耦合,并将经典解耦的牛顿冲击模型作为特殊情况恢复。设计了一个恢复控制器,将线性和角动量耦合以强制在冲击过程中耗散动能。硬件在环基准测试显示,与优化的矩阵实现相比,执行延迟减少了24%,与位置加四元数(PQ)形式相比减少了20%。在MuJoCo模拟中,经过蒙特卡洛扫掠的冲击角度和摩擦系数测试显示,与已发表的线性阻抗基线相比,位置均方根误差(RMSE)减少了50.8%-75.1%,峰值动能减少了68.7%-85%。

英文摘要

Unmanned aerial vehicles (UAVs) operating in cluttered environments require efficient and accurate impact modeling to maintain stability post collisions, however classical impulse contact models decouple the normal and tangential components. This letter presents a dual quaternion impulse reset map directly on the SE(3) manifold. By operating on the unified spatial twist (unified linear and angular velocities), the proposed formulation retains the cross-coupling between normal and tangential impulse components in a single closed-form expression, and recovers the classical decoupled Newton impulse model as a special case. A recovery controller is designed that couples linear and angular momentum to enforce kinetic energy dissipation across impacts. Hardware-in-the-loop benchmarks demonstrate a 24\% reduction in execution latency compared to an optimized matrix-based implementation, and a 20\% reduction relative to a position-plus-quaternion (PQ) formulation. MuJoCo simulations across Monte Carlo sweeps over impact angles and friction coefficients show a 50.8\%-75.1\% reduction in position root-mean-square error (RMSE) and a 68.7\%-85\% decrease in peak kinetic energy compared to published linear-admittance baselines.