arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.21460 2026-05-21 cs.RO cs.AI cs.HC 版本更新

HITL-D: Human In The Loop Diffusion Assisted Shared Control

HITL-D: 有人参与的扩散辅助共享控制

Riley Zilka, Sergey Khlynovskiy, Allie Wang, Martin Jagersand

发表机构 * Department of Computing Science, University of Alberta（阿尔伯塔大学计算机科学系）

AI总结本文提出HITL-D框架，通过结合扩散策略和人类控制，提升多步骤、插入和精细操作任务的用户表现，减少 joystick 控制轴数量，降低认知负荷，并在多任务用户研究中显著提高任务完成速度和用户满意度。

Comments Accepted for presentation at ICRA 2026

详情

AI中文摘要

MC-Risk：多组件风险场用于风险识别和运动规划

Maximilian Link, Yingjie Xu, Yingbai Hu, Yinlong Liu

发表机构 * Technical University of Munich（慕尼黑技术大学）； The Chinese University of Hong Kong（香港中文大学）； City University of Macau（澳门城市大学）

AI总结本文提出MC-Risk，一种与规划器对齐的多组件风险场，用于早期、校准且类别感知的风险定位。该方法通过线性组合三个可解释模块，包括电机代理场、VRU风险场和道路惩罚场，并在RiskBench碰撞子集上进行了首次标准化定量评估，展示了最佳的风险定位和最早危险指示。

详情

AI中文摘要

我们提出了MC-Risk，一种与规划器对齐的多组件风险场，用于早期、校准且类别感知的风险定位。MC-Risk线性组合了三个可解释模块：(i) 一个电机代理场，融合了黑箱多模态轨迹预测器和解析高斯环构造，其横向宽度随速度/曲率增长，高度随前瞻减少；(ii) 一个VRU风险场，用向前偏的各向异性核替代等效行人块，该核与方向和速度对齐；(iii) 一个道路惩罚场，利用全高清地图拓扑，对非道路区域施加惩罚，并对同向/反向车道施加风险暴露。我们进行了首次标准化定量评估，评估了风险场形式在RiskBench碰撞子集上的表现。MC-Risk在整体风险定位和危险指示方面表现最佳。最后，我们通过将该场作为MPC成本密度使用，演示了一个即插即用的规划接口，实现了无额外训练的风险感知轨迹生成。

英文摘要

We present MC-Risk, a planner-aligned, multi-component risk field on a bird's-eye-view grid that yields early, calibrated, and class-aware risk localization. MC-Risk linearly composes three interpretable modules: (i) a motorized-agent field that fuses a black-box multimodal trajectory predictor with an analytic Gaussian-torus construction whose lateral width grows with speed/curvature and whose height attenuates with look-ahead; (ii) a VRU risk field that replaces isotropic pedestrian blobs with a forward-biased anisotropic kernel aligned to heading and speed; and (iii) a road penalty field that exploits full HD-map topology, imposing an off-road penalty and lane-aware risk exposure for same/opposite directions. We conduct, to our knowledge, the first standardized quantitative evaluation of a risk-field formulation on RiskBench's collision subset. MC-Risk attains the best overall risk localization and the earliest hazard indication. Finally, we demonstrate a plug-and-play planning interface by using the field as an MPC cost density, enabling risk-aware trajectory generation without additional training.

URL PDF HTML ☆

赞 0 踩 0

2605.21398 2026-05-21 cs.RO 版本更新

From swept contact to pose: Probe-aware registration via complementary-shape docking

从扫掠接触到姿态：通过互补形状对接实现探针感知的注册

Chen Chen, Yunwen Li, Yifan Xu, Xiangjie Yan, Chang Shu, Jianxia Hou, Shiji Song, Xiang Li

发表机构 * Department of Automation, Tsinghua University（清华大学自动化系）； Tsingscribe Medical Ltd.（Tsingscribe医疗有限公司）； D-MAVT, ETH Zürich（苏黎世联邦理工学院D-MAVT）； Peking University School and Hospital of Stomatology（北京大学口腔医学院及口腔医院）； Institute for Guo Qiang, Tsinghua University（清华大学国强研究院）

AI总结本研究提出了一种无需校准的注册方法，通过将接触注册重新表述为物体与探针扫掠体积之间的互补形状对接，显式考虑探针几何形状，并利用接触和非接触证据。该方法通过3D FFT相关性进行全局到局部搜索，然后使用李代数更新和解析接触灵敏度进行连续SE(3)细化，实现了高效的探索和指标级收敛。

Comments 8 pages, 9 figures, accepted to ICRA 2026

详情

AI中文摘要

在机器人操作中，精确的先验模型与真实场景之间的注册对于高精度操作至关重要，然而光学方法面临长校准链、视线约束和制造误差等问题。我们提出了一种无需校准的替代方法，将接触注册重新表述为物体与探针扫掠体积之间的互补形状对接，显式考虑探针几何形状，并利用接触和非接触证据。我们的求解器通过3D FFT相关性在低偏差的SO(3)样本上进行全局到局部搜索，随后使用李代数更新和解析接触灵敏度进行连续SE(3)细化。该流程在自由形式网格上进行了模拟，实现了亚0.04毫米和亚0.4度的精度，并在姿态噪声和接触丢失情况下表现出鲁棒性。在牙科准备机器人上，我们的方法达到了0.42毫米和3.75度的精度，优于光学追踪器注册，且无需外部传感器。这些结果展示了一种实用且精确的机器人注册策略，适用于手术和工业机器人。

英文摘要

Accurate registration between a prior model and the real scene is essential for high-precision robotic manipulation, yet optical methods suffer from long calibration chains, line-of-sight constraints, and fabrication errors. We propose a calibration-free alternative that reformulates contact registration as complementary-shape docking between the object and the probe's swept volume, explicitly accounting for probe geometry and leveraging both contact and non-contact evidence. Our solver integrates a global-to-local search via 3D FFT correlation over low-discrepancy SO(3) samples, then followed by continuous SE(3) refinement using Lie-algebra updates and analytic contact sensitivities. This pipeline yields efficient exploration and metric-grade convergence without fragile point correspondences. Simulation across free-form meshes achieved sub-0.04 mm and sub-0.4° accuracy and robustness to pose noise and contact loss. On a tooth-preparation robot, our method attained 0.42 mm and 3.75°, outperforming an optical tracker registration while requiring no external sensors. These results demonstrate a practical and precise registration strategy for surgical and industrial robots.

URL PDF HTML ☆

赞 0 踩 0

2605.21372 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

为机器人操作中的高效视觉表征学习结构潜在点

Yicheng Jiang, Jiaxu Wang, Junhao He, Zesen Gan, Junhao Li, Qiang Zhang, Jingkai Sun, Jiahang Cao, Mingyuan Sun, Xiangyu Yue, Qiming Shao

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； MMLab, The Chinese University of Hong Kong（香港中文大学MMLab）； X-Humanoid Robots（X-Humanoid机器人）； The University of Hong Kong（香港大学）； Tsinghua University（清华大学）

AI总结本文提出了一种新的预训练框架，通过学习混合表示-结构潜在点，结合隐式表示的表达能力和显式表示的结构先验，以提高机器人操作中的视觉表征效率和鲁棒性。

详情

Journal ref: International Conference on Robotics and Automation 2026

AI中文摘要

当前基于3D感知的预训练方法在具身感知和操作中大多基于可微渲染框架，产生完全隐式神经场或完全显式几何基元。隐式表示虽然具有表达能力，但缺乏显式结构线索，而显式表示则保留几何信息但受到分辨率限制和泛化能力差的困扰。为了解决这些限制，我们提出了一种新的预训练框架，学习混合表示-结构潜在点。具体来说，我们将在点云自编码器的潜在空间中插入一个点-wise潜在变分自编码器，联合正则化点-wise特征和坐标向高斯先验。所得到的紧凑潜在保留了粗略的结构趋势，不编码精确几何，但捕捉了更丰富的粗糙形状和语义信息，有效结合了隐式表示的表达能力和显式表示的结构先验。此外，受先前工作的共享设计选择启发，我们开发了一种流线型、高效的3DGS基于渲染管道，故意保持轻量，提高效率的同时，让前端潜在模块有更大的表征能力。在RLBench、ManiSkill2和真实机器人平台上的大量评估显示，在任务成功率、样本效率和对视角和场景变化的鲁棒性方面均优于强基线。消融研究进一步确认了框架中每个组件对整体性能的重要性。

英文摘要

Current 3D-aware pretraining methods for embodied perception and manipulation are largely built on differentiable rendering frameworks, producing either fully implicit neural fields or fully explicit geometric primitives. Implicit representations, while expressive, lack explicit structural cues, whereas explicit ones preserve geometry but suffer from resolution limits and weak generalization. To address these limitations, we propose a novel pretraining framework that learns a hybrid representation-structural latent points. Specifically, we insert a point-wise latent variational autoencoder into the latent space of a point-cloud autoencoder, jointly regularizing point-wise features and coordinates toward a Gaussian prior. The resulting compact latent preserves coarse structural tendencies, which do not encode precise geometry but capture richer rough shape and semantic information, effectively combining the expressiveness of implicit representations with the structural priors of explicit ones. In addition, informed by shared design choices in prior work, we develop a streamlined, efficient 3DGS-based rendering pipeline that is deliberately kept lightweight, improving efficiency while leaving greater representational capacity to the front-end latent module. Extensive evaluations on RLBench, ManiSkill2, and a real-robot platform demonstrate consistent gains in task success, sample efficiency, and robustness to viewpoint and scene variations over strong baselines. Ablation studies further confirm that each component of our framework is critical to overall performance.

URL PDF HTML ☆

赞 0 踩 0

2605.21257 2026-05-21 cs.RO 版本更新

Reinforcement Learning for Risk Adaptation via Differentiable CVaR Barrier Functions

通过可微分CVaR障碍函数实现风险适应的强化学习

Xinyi Wang, Taekyung Kim, Bardh Hoxha, Georgios Fainekos, Dimitra Panagou

发表机构 * Department of Robotics（机器人学系）； Department of Aerospace Engineering（航空航天工程系）； University of Michigan（密歇根大学）； Toyota Motor North America, Research & Development（丰田美国北美洲研发）

AI总结本文提出了一种端到端的风险适应框架，用于在障碍物运动不确定性的环境下进行人群导航，结合强化学习与基于条件价值-at-风险（CVaR）障碍函数的可微分二次规划安全层，共同学习名义控制输入、风险水平和安全边际，并强制执行显式的概率安全约束。

Comments Project page: https://anonymousrobotics9666.github.io/rlcvarbf/

详情

AI中文摘要

在存在不确定障碍物运动的拥挤环境中进行规划仍然具有挑战性，因为随机交互常常导致过于保守的行为或降低效率。为了解决这一挑战，我们提出了一种端到端的风险适应框架，用于在由高斯混合模型建模的障碍物运动不确定性下的人群导航。该框架结合了强化学习（RL）与基于条件价值-at-风险（CVaR）障碍函数的可微分二次规划安全层，共同学习名义控制输入、风险水平和安全边际，并强制执行显式的概率安全约束。这种设计实现了情境感知的适应，促进高效行为，仅在必要时引发谨慎。我们在动态、不确定和拥挤的环境中进行了广泛的评估，涵盖了不同障碍物密度和机器人模型的情况，进一步评估了在三种非分布情况下的泛化能力。提供了基于优化、基于RL和基于集成RL和优化方法的比较，证明所提出的方法在安全、效率和不确定性下的泛化能力方面表现最强。

英文摘要

Planning through crowded environments under uncertain obstacle motions remains difficult, as stochastic interactions often induce overly conservative behavior or reduced efficiency. To address this challenge, we propose an end-to-end risk adaptation framework for crowd navigation under obstacle-motion uncertainty modeled by a Gaussian mixture model. The framework combines reinforcement learning~(RL) with a differentiable quadratic-program safety layer based on Conditional Value-at-Risk~(CVaR) barrier functions, jointly learning nominal control input, risk level, and safety margin and enforcing explicit probabilistic safety constraints. This design enables context-aware adaptation, promoting efficient behavior while invoking caution only when necessary. We conduct extensive evaluations in dynamic, uncertain, and crowded environments across varying obstacle densities and robot models, and further assess generalization under three out-of-distribution cases. Comparisons across optimization-based, RL-based, and integrated RL and optimization methods are provided, and the proposed method is shown to deliver the strongest overall performance in safety, efficiency, and generalization under uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2605.21242 2026-05-21 cs.RO 版本更新

To Select or not to Select, that is the Question: Distilling Robot Skill Prediction into a Small Ensemble

选择还是不选择，这是个问题：将机器人技能预测蒸馏成一个小集成

Haechan Mark Bong, Simon Roy, Euhid Aman, Giovanni Beltrame

发表机构 * Department of Computer Engineering and Software Engineering, Polytechnique Montréal（蒙特利尔理工学院计算机工程与软件工程系）； MILA（蒙特利尔人工智能研究所）； National Taiwan University of Science and Technology (NTUST)（台湾科技大学）

AI总结本文研究了机器人技能预测问题，通过合成数据集和微调句子编码器，提出了一种小规模专用模型，在零样本提示下优于大型通用LLM，在机器人队伍任务路由中表现更佳。

详情

Journal ref: ICRA 2026 Workshop on Synthetic Data for Robot Learning

安全关键控制用于平滑隐式接触动力学

Haegu Lee, Yitaek Kim, Christoffer Sloth

发表机构 * The Maersk Mc-Kinney Moller Institute, University of Southern Denmark（马士基麦金尼莫勒研究所，丹麦南部大学）

AI总结本文提出了一种方法，通过引入边界聚焦的滚动策略和离散时间控制屏障函数框架，解决平滑隐式接触动力学中接触力的约束问题，以提高安全性能。

详情

AI中文摘要

平滑隐式接触动力学使在接触丰富的任务中能够基于梯度的规划和控制，而无需预定义的模式序列。然而，安全关键控制仍然具有挑战性，因为隐式接触动力学使得安全过滤器设计变得复杂。平滑参数κ放松了接触互补性约束，这使动力学变得平滑但影响了接触力。本文提供了一种方法，以在使用放松的互补性约束时对实际接触力进行界定。我们显示，约束违反可以是非单调的κ。较小的κ减少了力近似误差，但并不一定改善安全性性能。为了解决这个问题，我们引入了边界聚焦的滚动策略来筛选κ，通过比较安全边际与近似误差。然后我们开发了一种基于隐式定义接触力的一阶泰勒近似的离散时间控制屏障函数（CBF）框架。为了考虑可能的力低估，我们通过添加一个固定的鲁棒边缘来增强由此产生的安全约束。在四个接触丰富的系统上的模拟显示，所提出的方法消除了在标准CBF下观察到的力违反现象。

英文摘要

Smoothed implicit contact dynamics enables gradient-based planning and control for contact-rich tasks without predefined mode sequences. However, safety-critical control remains challenging because implicit contact dynamics makes safety-filter design nontrivial. The smoothing parameter $κ$ relaxes contact complementarity constraints, which makes the dynamics smooth but affects the contact force. This paper provides a method for bounding the actual contact force despite the use of relaxed complementarity constraints. We show that constraint violations can be non-monotonic in $κ$. Smaller $κ$ reduces force-approximation error, but it does not necessarily improve safety performance. To address this issue, we introduce boundary-focused rollouts to screen $κ$ by comparing the safety margin with the approximation error. We then develop a discrete-time control barrier function (CBF) framework based on a first-order Taylor approximation of the implicitly defined contact force. To account for possible force under-prediction, we augment the resulting safety constraint with a fixed robust margin. Simulations on four contact-rich systems show that the proposed method eliminates force violations observed under a standard CBF.

URL PDF HTML ☆

赞 0 踩 0

2605.21133 2026-05-21 cs.RO 版本更新

Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum

通过主动空间大脑和可泛化动作小脑的人形全身 manipulation

Zhizhao Liang, Yi-Lin Wei, Xuhang Chen, Mu Lin, Yi-Xiang He, Zhexi Luo, Jun-Hui Liu, Kun-Yu Lin, Wei-Shi Zheng

发表机构 * School of Computer Science（计算机科学学院）； Engineering, Sun Yat-sen University（工程学院，中山大学）

AI总结本文提出了一种通用的人形 locomotion-manipulation 框架，通过主动空间大脑和可泛化动作小脑来解决复杂3D环境中空间理解困难和动作生成泛化困难的问题，展示了在多种任务和环境中的强性能。

Comments Project page: https://leungchaos.github.io/Humanoid-Whole-Body-Manipulation-via-Active-Spatial-Brain-and-Generalizable-Action-Cerebellum/

详情

AI中文摘要

在本文中，我们探索了空间感知的人形全身 manipulation 任务。与桌面设置相比，该任务提出了两个关键挑战：1）在复杂3D环境中，具有多样空间关系的空间理解具有挑战性。2）动作生成难以泛化，因为有限且昂贵的真实机器人数据限制了数据驱动模型的泛化能力。为了解决这些挑战，我们提出了一种通用的人形 locomotion-manipulation 框架，该框架利用多智能体大模型的空间感知和动作生成能力。具体而言，我们的框架包括两个组件：Active Spatial Brain 用于主动空间感知和决策，以及 Generalizable Action Cerebellum 用于生成可执行的机器人动作。第一个组件主动感知空间场景，并在任务规划和子任务分解上做出决策。第二个组件根据第一个模块的决策生成可执行的机器人动作，而无需任务特定的真实机器人数据。为了基准测试我们的框架，我们从两个视角设计了一组空间 manipulation 任务：评估空间感知和理解，以及评估真实机器人任务性能。结果表明，在各种任务和环境中，该框架在两个方面都表现出强大的性能。

英文摘要

In this paper, we explore spatial-aware humanoid whole-body manipulation task. Compared with tabletop settings, this task poses two key challenges: 1) Spatial understanding is challenging in complex 3D environments with diverse spatial relations. 2) Action generation is difficult to generalize, as limited and costly real-robot data restricts data-driven models generalization. To address these challenges, we propose a generalizable humanoid loco-manipulation framework that leverages the spatial perception and action generation capabilities of multi-agent large models. Specifically, our framework includes two components: Active Spatial Brain for active spatial perception and decision-making, and Generalizable Action Cerebellum for executable robot action generation. The first component actively perceives the spatial scene and makes decisions on task planning and subtask decomposition. The second component generate executable robot actions based on the decisions made by the first module without needs of task-specific real robot data. To benchmark our framework, we design a set of spatial manipulation tasks from two perspectives: evaluating spatial perception and understanding, and assessing real-robot task performance. The results demonstrate strong performance on both aspects across diverse tasks and environments.

URL PDF HTML ☆

赞 0 踩 0

2605.21111 2026-05-21 cs.RO cs.SY eess.SY 版本更新

Benchmarking Empirical and Learning-Based Approaches for Feedforward Steering Control in Autonomous Racing

为自动驾驶赛车中的前馈转向控制评估经验方法和学习方法

Georg Jank, Mattia Piccinini, Sebastian Wenk, Phillip Pitschi, Johannes Betz, Boris Lohmann

发表机构 * Chair of Automatic Control, Department of Engineering Physics and Computation, Technical University of Munich（慕尼黑技术大学自动控制系）； Professorship of Autonomous Vehicle Systems (AVS), Department of Mobility Systems Engineering, Technical University of Munich（慕尼黑技术大学移动系统工程系自动驾驶车辆系统教授职位）

AI总结本文通过系统评估两种学习方法和两种经验方法的前馈转向控制器，发现学习方法在开环评估中预测误差最小，但在闭环测试中路径跟踪性能和圈速并不优于所提出的方法，表明在完整轨迹规划和控制软件栈中评估前馈策略的必要性。

Comments 8 pages, 12 figures, Accepted to be published as part of the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026), Naples, Italy, September 15-18, 2026

详情

AI中文摘要

前馈转向控制是自动驾驶赛车分层控制架构中的关键组成部分。其目标是通过预测车辆的逆横向动力学来减少反馈控制器的转向修正。本文系统地比较了两种学习方法和两种经验（分析）前馈转向控制器。我们提出了一种基于多项式曲面拟合的新ehd公式，能够以最小的参数化捕捉速度依赖的非线性转向行为。我们使用基于现实世界阿布扎比分级自动驾驶赛车联赛的高保真度仿真框架，在高保真度双赛道车辆动力学仿真器中测试前馈控制器。开环评估显示，学习方法实现了最低的预测误差；然而闭环测试显示，这种改进的准确性并未转化为更好的路径跟踪性能或圈速，即使经过迭代微调后也是如此。相比之下，所提出的ehd方法在整体闭环鲁棒性和圈速方面表现最佳，突显了在完整轨迹规划和控制软件栈中评估前馈策略的必要性。我们的代码可在https://github.com/TUMRT/steering_ff_control上获得。

英文摘要

Feedforward steering control is a key component of hierarchical control architectures for autonomous racing. The goal is to reduce steering corrections from the feedback controllers by predicting the vehicle's inverse lateral dynamics. This paper presents a systematic benchmark of two learning-based and two empirical (analytical) feedforward steering controllers. We introduce a new \acf{ehd} formulation based on a polynomial surface fit that captures velocity-dependent nonlinear steering behavior with minimal parametrization. We test the feedforward controllers in a high-fidelity simulation framework based on the real-world Abu Dhabi Autonomous Racing League competition, using a high-fidelity double-track vehicle dynamics simulator. Open-loop evaluation shows that the learning-based controllers achieve the lowest prediction errors; however, closed-loop testing reveals that this improved accuracy does not translate into superior path tracking performance or lap times, even after iterative fine-tuning. In contrast, the proposed EHD approach achieves the best overall closed-loop robustness and lap time, highlighting the necessity of evaluating feedforward strategies within the complete trajectory planning and control software stack. Our code is available at https://github.com/TUMRT/steering_ff_control.

URL PDF HTML ☆

赞 0 踩 0

2605.21109 2026-05-21 cs.RO 版本更新

Anomaly-Informed Confidence Calibration for Vision-Based Safety Prediction

基于异常的置信度校准用于基于视觉的安全预测

Zhenjiang Mao, Jiawen Wu, Gabriel Wagner, Zhongzheng Zhang, Ivan Ruchkin

发表机构 * Trustworthy Engineered Autonomy (TEA) Lab, Department of Electrical and Computer Engineering, University of Florida（可信工程自主性实验室，电气与计算机工程系，佛罗里达大学）

AI总结本文提出了一种基于异常的在线校准方法，通过融合感知和动态异常分数来改进基于视觉的安全预测中的置信度估计，从而在面对分布偏移时减少过自信，提升预测性能。

详情

AI中文摘要

可靠的置信度估计对于安全部署基于视觉的控制器至关重要，特别是在自动驾驶赛车中，安全预测必须从摄像头图像中推导出来，但现代预测器在测试时面临分布偏移时会变得危险地自信。我们发现现有异常信号中存在一个关键的感知-动态差距：广泛使用的分数，如自编码器重构误差，只能捕捉视觉损坏，却无法捕捉动态异常（例如执行偏差、延迟），其中图像仍然合理而轨迹却恶化。为了解决这个问题，我们提出了一种基于异常的在线校准方法，该方法不重新训练任何模型组件，融合了从世界模型中提取的两个互补的异常分数：一个来自重构误差的感知分数和一个来自epistemic不确定性及控制流统计的动态分数。基于这些融合的分数，一个轻量级的温度缩放校准器利用测试时增强来选择性地减少偏移下的过自信，同时保持正常条件下的性能。在四个未在训练中见过的真实世界异常协议（黑暗、模糊、执行偏差、处理延迟）下的物理DonkeyCar上进行实验，将平均预期校准误差从0.184降低到0.116，比最佳基线提高了37%，而无需修改基础安全预测器。

英文摘要

Reliable confidence estimates are important for safely deploying vision-based controllers in autonomous racing, where safety predictions must be derived from camera images, yet modern predictors become dangerously overconfident under test-time distribution shifts. We identify a critical perception-dynamics gap in existing anomaly signals: widely used scores, such as autoencoder reconstruction error, capture visual corruptions but miss dynamics anomalies (e.g., actuation bias, latency), where images remain plausible while the trajectory degrades. To address this, we propose an Anomaly-Informed Online Calibration approach that, without retraining any model component, fuses two complementary anomaly scores extracted from a world model: a perceptual score from reconstruction error and a dynamics score from epistemic uncertainty and control-stream statistics. Based on these fused scores, a lightweight temperature-scaling calibrator leverages test-time augmentation to selectively reduce overconfidence under shift while preserving nominal-condition performance. Experiments on a physical DonkeyCar under four real-world anomaly protocols unseen during training (darkness, blur, actuation bias, processing latency) reduce average expected calibration error from 0.184 to 0.116, a 37% improvement over the best baseline, without modifying the base safety predictor.

URL PDF HTML ☆

赞 0 踩 0

2605.21061 2026-05-21 cs.CV cs.AI cs.RO 版本更新

STEAM: 一种无需训练的拥堵感知增强框架用于去中心化多智能体路径寻找

Mingyang Feng, Mengnuo Zhang, Shaoyuan Li, Xiang Yin

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University（自动化与智能感知学院，上海交通大学）

AI总结本文提出STEAM框架，一种无需训练的去中心化多智能体路径寻找（MAPF）学习方法，在离散环境中通过注入轻量级拥堵感知指导来提升性能，通过空间避让、时间修正和密度修正等方法提高成功率和效率。

详情

AI中文摘要

我们提出STEAM（空间、时间和涌现拥堵意识用于MAPF），一种无需训练的测试时间增强框架，用于学习的去中心化多智能体路径寻找（MAPF）在离散环境中。给定一个预训练的去中心化策略，STEAM不需要重新训练、架构修改或用集中规划器替代。相反，它将轻量级拥堵感知指导注入到原始策略执行中。STEAM首先通过当前的成本到目标地图诱导的最短路径来识别潜在的未来拥堵热点。通过更新agent特定的成本到目标信息来缓解空间上可避免的拥堵，而通过时间logit修正来处理空间上不可避免的瓶颈。此外，通过基于邻近智能体修正后的成本到目标地图的密度感知logit修正来减少涌现的局部拥堵。在代表性学习的去中心化MAPF算法上的大量实验表明，STEAM一致地提高了成功率、完成时间和解决方案成本，成功率提升高达60%，且仅带来轻微的计算开销。实现可在https://anonymous.4open.science/r/STEAM-MAPF-7A62获取。

英文摘要

We propose STEAM (Spatial, Temporal, and Emergent congestion Awareness for MAPF), a training-free test-time enhancement framework for learning-based decentralized Multi-Agent Path Finding (MAPF) in discrete environments. Given a pretrained decentralized policy, STEAM requires no retraining, architectural modification, or replacement by a centralized planner. Instead, it injects lightweight congestion-aware guidance into the original policy execution. STEAM first rolls out the shortest paths induced by the current cost-to-go maps to identify potential future congestion hotspots. Spatially avoidable congestion is mitigated by updating agent-specific cost-to-go information, while spatially unavoidable bottlenecks are handled through temporal logit correction. In addition, emergent local congestion is reduced by a density-aware logit correction based on neighboring agents' corrected cost-to-go maps. Extensive experiments on representative learning-based decentralized MAPF algorithms show that STEAM consistently improves success rate, makespan, and solution cost, with success-rate gains of up to 60% and only minor computational overhead. The implementation is available at https://anonymous.4open.science/r/STEAM-MAPF-7A62.

URL PDF HTML ☆

赞 0 踩 0

2605.20917 2026-05-21 cs.RO 版本更新

SubTGraph: Large-Scale Subterranean Environment Synthesis with Controllable Topological Variability for Robotic Autonomy Validation

SubTGraph: 大规模地下环境合成与可控拓扑变化用于机器人自主性验证

F. Labra Caso, A. Saradagi, S. Fredriksson, S. Nordström, A. Koval, G. Nikolakopoulos

发表机构 * Robotics & AI Luleå University of Technology（机器人与人工智能卢勒奥技术大学）

AI总结本文提出SubTGraph框架，用于快速合成具有高变异性的多层级地下环境，通过用户指定的拓扑、维度、纹理等参数生成不同类型的地下环境，用于验证机器人自主栈各层的严格验证。

Comments 16 pages, 18 figures

详情

AI中文摘要

地下（SubT）环境已成为自主机器人技术的前沿领域，推动采矿自动化和行星探索（如火星熔岩管）。由于实际SubT环境的访问具有挑战性，因此在现实模拟环境中严格测试自主性堆栈至关重要。本文填补了已知的空白，即由于缺乏大规模基于模拟的基准评估基础设施，导致SubT研究论文通常只能在少数环境中展示验证结果。本文提出了SubTGraph，一种新的框架，用于快速合成具有高变异性的多层级SubT环境，结合用户指定的拓扑、维度、纹理等参数，生成如运营矿山、自然洞穴和熔岩管等不同环境。SubTGraph通过用户指定的结构约束构建成本矩阵，指导经典Dijkstra算法，利用DARPA World Generator的拓扑瓷砖生成SubT世界。通过三个机器人案例研究验证了SubTGraph在验证机器人自主栈不同层次的严格性方面的有效性。结构语义分割与拓扑地面真相进行验证，多智能体路径规划广泛测试以识别算法行为中的模式和趋势，LIO SLAM在具有挑战性的地下部分进行压力测试以识别失败案例。SubTGraph世界创建代码库已开源（https://github.com/LTU-RAI/SubTGraph.git），并附带包含150个高度变异的地下世界的数据库。

英文摘要

Subterranean (SubT) environments have been a frontier for autonomous robotics, driven by the push for automation of mining operations and the interest in planetary exploration (Martian Lava Tubes). Due to the challenges involved in accessing real SubT environments, rigorous hardening of autonomy stacks in realistic simulation environments is critical. This article fills a well-known gap, which relates to the unavailability of a large-scale simulation-based benchmarking infrastructure for rigorous statistical evaluation of robotic autonomy, due to which it is common for SubT research articles to present validation results in a few environments at best. This article presents SubTGraph, a novel framework for rapid synthesis of multi-level SubT environments with high variability, incorporating user specifications related to topology, dimensionality, textures, etc., to generate distinct environments such as operational mines, natural caves and lava tubes. SubTGraph builds a cost matrix from user-specified structural constraints to guide the classical Dijkstra algorithm to procedurally generate SubT worlds utilizing topometric tiles from the DARPA World Generator. Three robotics case-studies are investigated to demonstrate the utility of SubTGraph for rigorous validation of different layers in the robotic autonomy stack. Structural semantic segmentation is validated against topometric ground truths, multi-agent path planning is widely tested for identification of patterns and trends in the algorithm behavior and LIO SLAM is stress-tested in challenging subterranean sections to identify failure cases. The SubTGraph world creation codebase is open-sourced (https://github.com/LTU-RAI/SubTGraph.git) along with a database consisting of 150 highly variable underground worlds.

URL PDF HTML ☆

赞 0 踩 0

2605.20894 2026-05-21 cs.RO 版本更新

Mobile UMI: Cross-View Diffusion Policy with Decoupled Kinematics for Mobile Manipulation

Mobile UMI: 用于移动操作的跨视角扩散策略与解耦动力学

Haoran Huang, Haonan Dong, Huixu Dong

发表机构 * Zhejiang University（浙江大学）

AI总结本文提出了一种无需硬件的演示框架Mobile UMI，通过三个组件解决移动模仿学习中的两个瓶颈问题：运动污染的动作标签和推理导致的执行延迟。核心方法是通过双摄像头捕捉全局和局部上下文，结合空间锚点统一视觉-惯性框架，并利用异步递推地执行器进行在线状态匹配，从而实现解耦的动力学和基座轨迹。

详情

AI中文摘要

在便携式演示接口上进行移动模仿学习面临两个耦合的瓶颈：由运动污染导致的动作标签和由于连续移动基座引起的推理诱导的执行延迟。最近的腕部安装接口降低了桌面数据收集的成本，但单个腕部视角无法捕捉基座导航所需的全局上下文。添加身体安装的摄像头会将人类行走与手部运动纠缠在一起。同时，生成策略引入了数百毫秒的推理延迟，在此期间，基座会经过预测的路径点，迫使在动作拼接处进行回退修正。本文提出了Mobile UMI，一种无需硬件的演示框架，通过三个组件解决这两个缺口。首先，双摄像头捕获系统记录以胸部为中心的全局上下文和以腕部为中心的局部交互，无需任何机器人存在。其次，基于ChArUco的一次性空间锚点统一了胸部和手部的视觉-惯性框架；手部姿态随后相对于胸部重新表达，以提取解耦的SE(3)操作和SE(2)基座轨迹。第三，异步递推地执行器执行在线状态匹配：每个生成的动作块都与当前物理姿态对齐，使过期的路径点在执行前被丢弃。整个系统在四个长周期家庭任务上进行了评估，在100次试验中平均成功率为83.8%。受控比较ACT和Diffusion Policy显示，仅胸部相对标签就缩小了大部分差距；在线状态匹配缩小了剩余差距。这些结果表明，在测试条件下，移动模仿学习中显式动力学分解与状态级延迟对齐相结合，提供了一种有效的解决方案，而无需对底层策略类别进行架构更改。

英文摘要

Mobile imitation learning on portable demonstration interfaces faces two coupled bottlenecks: locomotion-contaminated action labels and inference-induced execution latency on a continuously moving base. Recent wrist-mounted interfaces lower the cost of tabletop data collection, yet a single wrist view does not capture the global context required for base navigation. Adding a body-mounted camera entangles human walking with hand motion. Meanwhile, generative policies introduce hundreds of milliseconds of inference latency, during which the base advances past predicted waypoints, forcing backward corrections at action splices. This paper presents Mobile UMI, a hardware-free demonstration framework that addresses both gaps through three components. First, a dual-camera capture system records chest-centric global context and wrist-centric local interaction without any robot present. Second, a one-shot ChArUco-based spatial anchor unifies the chest and hand visual-inertial frames; the hand pose is then re-expressed relative to the chest to extract decoupled SE(3) manipulation and SE(2) base trajectories. Third, an asynchronous receding-horizon executor performs online state matching: each generated action chunk is realigned with the current physical pose so that expired waypoints are discarded before execution. The full system is evaluated on four long-horizon household tasks, achieving an average success rate of 83.8% over 100 trials per task. Controlled comparisons against ACT and Diffusion Policy show that the chest-relative label alone closes much of the gap; online state matching closes the remainder. These results indicate that, for mobile imitation learning under the tested conditions, explicit kinematic factorization combined with state-level latency alignment provides an effective solution without requiring architectural changes to the underlying policy class.

URL PDF HTML ☆

赞 0 踩 0

2605.20856 2026-05-21 cs.RO cs.AI cs.LG 版本更新

Demo-JEPA: 一种用于单次跨体态模仿的联合嵌入预测架构

Jingyang He, Guangrun Li, Jieyu Zhang, Chengkai Hou, Zhengping Che, Shanghang Zhang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（信息处理国家重点实验室，计算机学院，北京大学）； University of Washington（华盛顿大学）； Beijing Innovation Center of Humanoid Robotics（北京人形机器人创新中心）

AI总结本文提出Demo-JEPA，一种跨体态模仿框架，通过解耦示范意图与体态特定的执行，利用共享预测表示空间将源视觉示范转换为目标兼容的未来潜在轨迹，使目标代理通过规划实现这些子目标，从而在异构体态间实现灵活的模仿。

详情

AI中文摘要

机器人模仿学习通常被视为复制演示动作，但动作本质上是体态特定的。当演示来自具有不同形态、运动学或动作空间的人类或机器人时，这种以动作为中心的观点需要共享动作空间、启发式重定向或大规模多体态联合训练。我们相反地将演示视为未来目标的隐含规范：目标代理应推断演示者试图实现的状态，而非演示者如何执行它。我们提出Demo-JEPA，一种跨体态模仿框架，通过基于JEPA的世界模型构建，将源视觉示范转换为目标兼容的未来潜在轨迹，这些轨迹在共享的预测表示空间中。目标代理随后利用这些潜在轨迹作为子目标，并通过其自身学习的向前动力学进行规划以实现它们。由于Demo-JEPA避免了动作层面的对应关系，仅需视觉示范和目标代理自身的交互经验，它支持在异构体态间灵活的模仿。在RLBench和真实世界操作任务中的实验表明，Demo-JEPA在专门的领域规划器中表现优异，并能泛化到未见的任务和体态配置，而此前的方法在此类情况下失效。

英文摘要

Robotic imitation learning is often treated as reproducing demonstrated actions, but actions are inherently embodiment-specific. When demonstrations come from humans or robots with different morphology, kinematics, or action spaces, this action-centric view requires shared action spaces, heuristic retargeting, or large-scale multi-embodiment co-training. We instead view demonstrations as implicit specifications of future goals: the target agent should infer what state the demonstrator is trying to realize, rather than how the demonstrator executes it. We propose Demo-JEPA, a cross-embodiment imitation framework that decouples demonstration intent from embodiment-specific execution. Built on a JEPA-based world model, Demo-JEPA translates source visual demonstrations into target-compatible future latent trajectories in a shared predictive representation space. The target agent then uses these latent trajectories as subgoals and realizes them through planning under its own learned forward dynamics. Because Demo-JEPA avoids action-level correspondence and requires only visual demonstrations plus the target agent's own interaction experience, it supports flexible imitation across heterogeneous embodiments. Experiments on RLBench and real-world manipulation tasks show that Demo-JEPA matches specialized in-domain planners and generalizes to unseen tasks and embodiment configurations where prior methods fail.

URL PDF HTML ☆

赞 0 踩 0

2605.20801 2026-05-21 cs.RO quant-ph 版本更新

Q-SpiRL: Quantum Spiking Reinforcement Learning for Adaptive Robot Navigation

Q-SpiRL：量子脉冲强化学习用于自适应机器人导航

Mohamed Khair Altrabulsi, Nouhaila Innan, Alberto Marchisio, Muhammad Kashif, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University Abu Dhabi (NYUAD)（eBRAIN实验室，工程系，纽约大学阿布扎比分校）； Center for Quantum and Topological Systems (CQTS), NYUAD Research Institute（量子与拓扑系统中心（CQTS），NYUAD研究机构）

AI总结本文提出Q-SpiRL框架，结合量子增强的脉冲神经网络，实现了在动态环境中高效稳定的机器人导航，通过实验验证了其在任务完成、轨迹效率和运动平滑度之间的最佳平衡。

Comments 11 pages, 6 figures

详情

AI中文摘要

在动态环境中实现自适应机器人导航需要能够可靠到达目标并产生高效稳定轨迹的策略。本文提出了Q-SpiRL，一种用于障碍感知机器人导航的量子脉冲强化学习框架。该框架开发并评估了五个智能体家族：表格Q学习、经典MLP、经典SNN、量子增强MLP（QMLP）和量子增强脉冲神经网络（QSNN）。尽管所有模型均在统一的训练和评估管道下实现，但QSNN是重点研究的中央架构，因为它结合了基于脉冲的时间处理与变分量子特征变换。实验在三个逐渐增大尺寸的网格世界环境中进行，即20x20、30x30和40x40，包含静态和动态障碍。性能评估使用成功率、成功率加权路径长度、路径长度和转弯率，在确定性推理下进行。结果表明，QSNN在最具有挑战性的设置中实现了最强的整体权衡，达到99%的成功率，同时保持高路径效率。在IBM量子硬件上的执行进一步证明了所提出混合策略在真实设备条件下的可行性。

英文摘要

Adaptive robot navigation in dynamic environments requires policies that can reach the target reliably while producing efficient and stable trajectories. This paper presents Q-SpiRL, a quantum spiking reinforcement learning framework for obstacle-aware robot navigation. The framework develops and evaluates five agent families: tabular Q-learning, classical MLP, classical SNN, quantum-enhanced MLP (QMLP), and quantum-enhanced spiking neural network (QSNN). While all models are implemented under a unified training and evaluation pipeline, the QSNN is the central architecture of interest, as it combines spike-based temporal processing with variational quantum feature transformation. Experiments are conducted across three grid-world environments of increasing size, namely 20x20, 30x30, and 40x40, with both static and dynamic obstacles. Performance is assessed using success rate, success-weighted path length, path length, and turn rate under deterministic inference. Results show that QSNN achieves the strongest overall trade-off between task completion, trajectory efficiency, and motion smoothness, reaching up to 99% success rate while maintaining high path efficiency in the most challenging setting. Execution on IBM quantum hardware further demonstrates the feasibility of deploying the proposed hybrid policy under real-device conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.20796 2026-05-21 cs.RO 版本更新

CMC-Opt: Constraint Manifold with Corners for Inequality-Constrained Optimization

CMC-Opt: 带角落的约束流形用于不等式约束优化

Yetong Zhang, Frank Dellaert

发表机构 * College of Computing（计算学院）； Georgia Institute of Technology（佐治亚理工学院）； Atlanta, USA（美国亚特兰大）

AI总结本文提出了一种基于流形的框架，用于解决机器人中存在等式和不等式约束的优化问题。通过引入带角落的约束流形，将原问题直接转换为无约束优化问题，从而在约束状态空间上进行优化，并在大规模动力学规划问题中验证了该框架的有效性和鲁棒性。

2605.20774 2026-05-21 cs.RO 版本更新

联合学习谓词和动作使零样本技能组合成为可能

Benedict Quartey, Sebastian Castro, Eric Rosen, Wil Thomason, George Konidaris, Stefanie Tellex

发表机构 * Brown University（布朗大学）； Robotics & AI Institute（机器人与人工智能研究所）

AI总结本文提出了一种联合学习谓词和动作的技能方法，通过闭合回路的视觉-运动策略，使机器人能够在不重新训练的情况下实现零样本技能组合。

详情

AI中文摘要

学习示范（LfD）使机器人能够从专家示例中学习复杂行为，但现有方法往往无法在不重新训练的情况下泛化到新组合的已知技能。现代生成性策略仅建模动作轨迹分布，因此无法推断出所需的符号结果。我们提出技能应联合建模动作轨迹和它们诱导的符号结果。为解决这一差距，我们引入了谓词动作技能（PACTS），一种闭合回路的视觉-运动策略，将技能建模为动作和谓词信念轨迹的联合生成过程，在单一模型中产生连贯的动作-结果滚动。联合生成动作和谓词使PACTS能够学习改进动作生成和谓词分类的内部表示。此外，我们通过利用PACTS的在线谓词预测作为符号接口来序列化和监控执行，展示了学习技能的零样本组合。项目网站：https://planpacts.github.io/

英文摘要

Learning from Demonstration (LfD) enables robots to learn complex behaviors from expert examples, yet existing approaches often fail to generalize to new compositions of known skills without retraining. Modern generative policies model distributions over action trajectories alone, thus are unable to reason about the symbolic outcomes required for robust composition. We propose that skills should jointly model action trajectories and the symbolic outcomes they induce. To address this gap, we introduce Predicate Action Skills (PACTS), a class of closed-loop visuomotor policies that model skills as a joint generative process over action and predicate belief trajectories, producing coherent action-outcome rollouts within a single model. Jointly generating actions and predicates enables PACTS to learn internal representations that improve both action generation and predicate classification. Furthermore, we demonstrate zero-shot composition of learned skills via planning by leveraging online predicate predictions from PACTS as a symbolic interface for sequencing and monitoring execution. Project website: https://planpacts.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.20644 2026-05-21 cs.LG cs.AI cs.RO 版本更新

容错、保持刚性的可膨胀桁架机器人控制

James Wade, Isaac Weaver, Mihai Stanciu, Nathan Usevitch

发表机构 * Ira A. Fulton School of Engineering, Mechanical Engineering Department, Brigham Young University（伊拉·A·福林工程学院，机械工程系， Brigham Young 大学）

AI总结本文提出了一种容错控制框架，用于可膨胀机器人桁架，能够在电机故障的情况下保持功能，通过三个关键贡献：扩展运动学优化以处理任意电机故障组合，引入离散时间控制屏障函数约束以保证结构刚性，以及利用 onboard 编码器反馈和基于正向运动学的状态估计器实现闭环位置控制。

详情

AI中文摘要

等周机器人桁架可以适应不同的任务和环境，因为它们具有高强重比，能够大幅改变自身形状，并可以重新配置成多种不同形状。然而，操作环境中电机故障如果未得到妥善处理，会严重限制操作能力。本文提出了一种容错控制框架，用于可膨胀机器人桁架，能够在电机故障的情况下保持功能，通过三个关键贡献。首先，我们扩展运动学优化以处理任意组合的电机故障，通过施加等式约束确保故障执行器不被使用。其次，我们引入离散时间控制屏障函数（DTCBF）约束，数学上保证结构刚性的同时最大化工作空间利用率，这是在离散时间控制下可靠操作桁架机器人的重要要求。第三，我们利用 onboard 编码器反馈和基于正向运动学的状态估计器实现闭环位置控制，在存在干扰的情况下提高位置精度。我们通过模拟和硬件实验验证了我们的方法，针对一个具有6个执行器的2D等周桁架测试平台。对于具有6个执行器的2D配置，我们展示了在单个电机故障下工作空间保留超过69%，并利用闭环控制实现了跟踪精度的25%提升。这些结果为在退化驱动条件下更鲁棒和坚韧的等周桁架机器人奠定了基础。

英文摘要

Isoperimetric robotic trusses can adapt to different tasks and environments because they have a high strength-to-weight ratio, can change their own shape dramatically, and can be reconfigured into a variety of different shapes. However, motor failures in operational environments can severely limit operational capabilities if not properly addressed. This paper presents a fault-tolerant control framework for an inflatable robotic truss that maintains functionality despite motor failures, shown through three key contributions. First, we extend the kinematic optimization to handle arbitrary combinations of motor failures by imposing equality constraints to ensure failed actuators are not used. Second, we introduce discrete-time control barrier function (DTCBF) constraints that mathematically guarantee structural rigidity while maximizing workspace utilization, a critical requirement for reliable operation of truss robots under discrete-time control. Third, we implement closed-loop position control using onboard encoder feedback and a forward kinematics-based state estimator, improving positional accuracy in the presence of disturbances. We validate our approach through simulation and hardware experiments on a 2D isoperimetric truss testbed. For a 2D configuration with 6 actuators, we demonstrate >69% workspace preservation under single-motor failures and a >25% improvement in tracking accuracy with closed-loop control. These results establish a foundation for more robust and resilient isoperimetric truss robots operating under degraded actuation.

URL PDF HTML ☆

赞 0 踩 0

2605.20551 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

更快或更强：通过加权聚合和标记剪枝实现灵活的视觉位置识别

Zichao Zeng, June Moh Goo, Junwei Zheng, Weijia Fan, Jiaming Zhang, Rainer Stiefelhagen, Jan Boehm

发表机构 * University College London（伦敦大学学院）； Karlsruhe Institute of Technology（卡尔斯鲁厄大学）； Hunan University（湖南大学）； Shenzhen University（深圳大学）

AI总结本文提出了一种加权聚合描述符（WeiAD）和标记剪枝框架（WeiToP），用于提升视觉位置识别的性能和效率，通过动态调整特征提取的精度与效率平衡。

详情

AI中文摘要

视觉位置识别（VPR）旨在将查询图像匹配到大规模数据库中相同地点的参考图像。最近最先进的方法采用视觉Transformer（ViTs）作为基础模型，提取对视角、光照和季节变化具有鲁棒性的补丁级特征，然后聚合为紧凑的全局描述符进行检索。大多数现有聚合方法将补丁标记均匀地池化到学习的簇中，尽管不同簇往往编码不同的空间或语义模式，并对VPR性能贡献不均。为了解决这一限制，我们提出了加权聚合描述符（WeiAD），在聚合过程中分配簇的权重，产生更具判别性的全局表示。除了准确性之外，检索延迟是大规模部署和资源受限边缘设备的关键关注点。先前的工作主要通过压缩全局描述符来减少延迟，而忽略了特征提取的成本，这在基于ViT的基础模型中变得更加严重。因此，我们引入了面向VPR的标记剪枝框架WeiToP，通过自蒸馏减少特征提取成本，其中聚合诱导的标记重要性监督一个轻量级剪枝模块，附加到早期Transformer层上，使推理时能够进行标记剪枝。在单次联合训练阶段后，WeiToP能够在推理时实现插拔式的标记剪枝，允许在不额外训练的情况下灵活地控制精度-效率权衡。此外，WeiToP在现有针对通用视觉任务的标记剪枝方法上表现更优。

英文摘要

Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transformers (ViTs) as backbone foundation models to extract patch-level features that are robust to viewpoint, illumination, and seasonal variations, which are then aggregated into a compact global descriptor for retrieval. Most existing aggregation methods uniformly pool patch tokens into learned clusters, despite the fact that different clusters often encode distinct spatial or semantic patterns and contribute unequally to VPR performance. To address this limitation, we propose Weighted Aggregated Descriptor (WeiAD), which assigns weights to clusters during aggregation, producing more discriminative global representations. Beyond accuracy, retrieval latency is a critical concern for large-scale deployments and resource-constrained edge devices. Prior work mainly reduces latency by compressing global descriptors, while overlooking the cost of feature extraction, an issue exacerbated by ViT-based backbones. We therefore introduce WeiToP, a VPR-oriented token pruning framework that reduces feature extraction cost via self-distillation, where aggregation-induced token importance supervises a lightweight pruning module attached to an early transformer layer, enabling inference-time token pruning. After a single joint training phase, WeiToP enables plug-and-play token pruning at inference time, allowing flexible and on-demand control over the accuracy-efficiency trade-off without additional training. Moreover, WeiToP outperforms existing token pruning methods adapted from general vision tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.20544 2026-05-21 cs.RO cs.CV 版本更新

The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

顺从综合征：具身机器人代理中的退避基准测试

Doguhan Yeke, Elif Su Temirel, Ananth Shreekumar, Brandon Lee, Dongyan Xu, Z Berkay Celik

发表机构 * Purdue University（普渡大学）； Bilkent University（比尔肯特大学）

AI总结本文提出了一种用于具身机器人代理的退避基准测试框架RoboAbstention，通过五种机器人数据集中的图像生成退避指令，评估了多个前沿VLMs在退避任务中的表现，并探讨了改进退避性能的方法。

详情

AI中文摘要

视觉语言模型（VLMs）被用作具身代理的高层规划器，将自然语言指令和视觉观察转化为行动计划。尽管先前的工作研究了LLMs中的退避行为，但现有的基准测试大多仅限于文本，无法捕捉到具身机器人环境中的感知基础和物理约束。在这样的环境中，退避需要识别指令模糊、物理不可行、基于错误前提或在给定可用感觉模态和上下文下无法解决的情况。为了解决这一差距，我们引入了一个分类法来分类具身机器人中的退避行为，并提出了RoboAbstention，一个可扩展且可审计的框架，用于生成基于五个机器人数据集收集的图像的退避指令。RoboAbstention通过三个阶段的流程实现该分类法：（1）结构化的视觉基础，（2）确定性的约束推导，（3）通过类别特定模板进行受控的指令生成。这使能够构建一个具有可验证退避条件的多样化数据集。我们评估了几种前沿VLMs，并发现所有模型在退避任务中都表现出显著的弱点，包括那些具有高级推理能力的模型。表现最好的模型Gemini 2.5 Flash仅在6,069个基准指令中退避39.0%，而具身规划器Gemini Robotics ER 1.6 Preview仅在16.5%的指令中退避。我们进一步探讨了改进VLM规划器退避性能的方法，如防御性提示和上下文学习，并发现这些干预措施显著提高了性能，达到Gemini Robotics ER 1.6 Preview的93.6%退避率和GPT 5.4 Mini的88.6%退避率，但没有任何方法完全解决了该问题。我们开源了RoboAbstention在https://purseclab.github.io/RoboAbstention/。

英文摘要

Vision-language models (VLMs) are used as high-level planners for embodied agents, translating natural language instructions and visual observations into action plans. While prior work has studied abstention in LLMs, existing benchmarks are largely text-only and do not capture the perceptual grounding and physical constraints inherent to embodied robotics environments. In such settings, abstention requires recognizing when instructions are ambiguous, physically infeasible, based on false premises, or otherwise unresolvable given the available sensory modalities and context. To address this gap, we introduce a taxonomy to categorize abstention in the context of embodied robotics and present RoboAbstention, a scalable and auditable framework for generating abstention instructions grounded in images gathered from five robotics datasets. RoboAbstention instantiates the taxonomy through a three-phase pipeline: (1) structured visual grounding, (2) deterministic constraint derivation, and (3) controlled instruction generation via category-specific templates. This enables the construction of a diverse dataset with verifiable abstention conditions. We evaluate several frontier VLMs and find that all models exhibit significant weaknesses in abstention, including those with advanced reasoning capabilities. The best-performing model, Gemini 2.5 Flash, abstains on only 39.0% of our 6,069 benchmark instructions, while the embodied planner Gemini Robotics ER 1.6 Preview abstains on just 16.5%. We further explore methods for improving abstention in VLM planners, such as defensive prompting and in-context learning, and find that these interventions substantially improve performance, reaching 93.6% abstention rate for Gemini Robotics ER 1.6 Preview and 88.6% for GPT 5.4 Mini, yet no approach fully solves the problem. We open-source RoboAbstention at https://purseclab.github.io/RoboAbstention/.

URL PDF HTML ☆

赞 0 踩 0

2605.19503 2026-05-21 cs.RO cs.AI cs.LG 版本更新

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

ARC-RL: 一种受ARC Raiders启发的强化学习游乐场

Carlo Romeo, Andrew D. Bagdanov

发表机构 * Media Integration and Communication Center – University of Florence（媒体整合与通信中心——佛罗伦萨大学）

AI总结本文提出ARC-RL，一个包含四种MuJoCo连续控制环境的强化学习游乐场，这些环境的机器人形态灵感来自ARC Raiders的生物目录，通过统一的观察模板、动作约定和奖励函数，研究不同形态和动画风格约束下的强化学习算法性能。

详情

AI中文摘要

腿部运动的强化学习已经发展成一个多组件奖励函数和物理引擎基准的堆叠，其形态统一来源于现实商业硬件。然而，游戏NPC受风格约束，缺乏sim-to-real机器人，通常以没有现实机器人对应物的生物形式出现。我们介绍了ARC-RL，一个包含四种MuJoCo连续控制环境的套件，其机器人形态受ARC Raiders的生物目录启发：18自由度的高六足Queen、12自由度的装甲六足Bastion、18自由度的紧凑六足Tick以及12自由度的四足Leaper。这四个机器人共享统一的观察模板、动作约定、仿真节奏和一个单一的闭式多组件奖励函数，其唯一形态差异体现在一小部分权重和参数中。奖励融合了速度跟踪帐篷、健康生存奖励、相位锁定步态适应奖励/成本对、动作正则化器、三个安全惩罚和姿态锚；在任何点都不会引入运动捕捉数据。我们还为每种形态提供手工制作的中心模式生成器演示，这些演示既作为固定专家参考，也作为离线到在线训练的先验数据来源。在此游乐场中，我们进行了一项受控的实证研究，比较标准在线算法（SAC、SPEQ、SOPE-EO）和带有先验数据的算法（SACfD、SPEQ-O2O、SOPE），并研究每种范式如何应对游乐场的形态多样性和动画风格约束。源代码可在https://github.com/CarloRomeo427/ARC_RL.git获取。

英文摘要

Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints. Source code is available at https://github.com/CarloRomeo427/ARC_RL.git.

URL PDF HTML ☆

赞 0 踩 0

2605.19138 2026-05-21 cs.RO cs.AI cs.LG 版本更新

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

COBALT: 通过基于云的远程操作利用智能手机进行机器人学习

Ayush Agarwal, Ansh Gandhi, Jeremy A. Collins, Omar Rayyan, Aryan Sarswat, Ranjani Koushik, Masoud Moghani, Ajay Mandlekar, Animesh Garg

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； University of California, Berkeley（加州大学伯克利分校）； New York University Abu Dhabi (NYUAD)（纽约大学阿布扎克分校）； University of Toronto（多伦多大学）； NVIDIA（英伟达）

AI总结本文提出COBALT平台，通过基于云的远程操作技术，利用智能手机等设备大规模收集高质量的机器人学习数据，提高仿真实验和现实世界中的机器人学习效率。

详情

AI中文摘要

大规模、高质量的演示数据稀缺仍然是扩展模仿学习用于机器人操作的主要瓶颈。我们提出了COBALT，一个旨在大规模普及机器人学习的远程操作平台，无论是仿真还是现实世界。通过利用向量化的环境，我们的可扩展、负载均衡的基础设施支持多个用户在单个GPU上同时进行远程操作，从而显著降低远程操作成本。操作员可以使用几乎全球任何地方的常见设备连接，包括单或双智能手机、VR头盔、3D鼠标和键盘。内存中的数据缓存和高效的视频流保持控制和渲染同步，支持数十个并发用户在20 Hz下以不超过100毫秒的端到端延迟运行，每GPU支持多达8个并发用户。我们还展示了稳定运行支持256个模拟客户端跨8个GPU，凸显了系统在硬件和单个服务器内的扩展能力。我们进行了全面的用户研究，显示基于手机的远程操作性能与或优于专用硬件，能够更快、更符合人体工学地收集数据。为确保数据质量，COBALT记录一套实时指标以自动过滤劣质演示。我们进一步证明，结构化的用户培训课程显著提高了数据收集质量。基于用户研究的洞察，我们通过众包收集了一个大规模、高质量的试点数据集，该数据集包含7500多个演示（50多个小时），在五个国家的智能手机上收集了九天的数据。我们通过训练最先进的模仿学习算法验证了数据集的质量。请访问https://cobalt-teleop.github.io/获取更多详情。

英文摘要

The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An inmemory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency for up to 8 concurrent users per GPU. We also demonstrate stable operation supporting 256 simulated clients across 8 GPUs, underscoring the system's ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit https://cobalt-teleop.github.io/ for more details.

URL PDF HTML ☆

赞 0 踩 0

2605.17776 2026-05-21 cs.RO 版本更新

CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

CosFly-Track: 一个大规模多模态数据集，用于通过多约束轨迹优化的无人机视觉跟踪

Xiangyue Wang, Hanxuan Chen, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Kangli Wang, Ji Pei

发表机构 * Autel Robotics（Autel机器人公司）； Nanjing University（南京大学）； Peking University（北京大学）； Southern University of Science and Technology（南方科技大学）； University of Hong Kong（香港大学）

AI总结本文提出CosFly-Track数据集，用于无人机视觉跟踪任务，通过多约束轨迹优化生成大规模多模态数据，提升了动态目标跟踪性能。

详情

AI中文摘要

近年来，空中视觉-语言导航（VLN）数据集发展迅速，但主要解决的是面向静态目的地的目标导向导航问题，而无人机视觉跟踪——在保持可见性的同时持续跟随移动目标——则缺乏专门的训练数据。我们介绍了CosFlyTrack，这是一个用于城市环境中无人机视觉跟踪的大规模多模态数据集和可扩展生成管道。该数据集提供了约12,000条专家和扰动的无人机轨迹，这些轨迹源自6,000条行人路径，包含240万时间步（约334小时），包含七个对齐的数据通道：RGB、度量深度、语义分割、六自由度无人机姿态、带有可见性标志的目标状态、双语（中文-英文）指令以及轨迹对元数据。为了生成高质量的专家轨迹，我们开发了MuCO，一个多约束优化器，能够在连续的三维空间中直接规划，使用BVH加速的碰撞和可见性查询，共同执行目标可见性、视角质量、碰撞避免、平滑度、运动学可行性等约束，避免了基于网格的规划器的离散化伪影和事后平滑。在七个视觉-语言模型上的微调实验表明，CosFlyTrack将跟踪性能提升到78.3至95.6个百分点的SR@1米，比零样本基线提高了53至69个百分点，支持该数据集作为动态目标跟踪代理的训练资源。该数据集在https://huggingface.co/datasets/AutelRobotics/CosFly上公开可用；评估脚本和预训练检查点托管在https://huggingface.co/AutelRobotics/CosFly-Track上。

英文摘要

Recent aerial vision-language navigation (VLN) datasets have grown rapidly, but they primarily address goal-oriented navigation to static destinations, leaving UAV visual tracking -- continuously following a moving target while maintaining visibility -- largely without dedicated training data. We introduce CosFlyTrack, a large-scale multi-modal dataset and scalable generation pipeline for UAV visual tracking in urban environments. The dataset provides approximately 12,000 expert and perturbed UAV trajectories generated from 6,000 pedestrian paths, comprising 2.4 million timesteps (approximately 334 hours) with seven aligned data channels: RGB, metric depth, semantic segmentation, six-degree-of-freedom drone pose, target state with visibility flag, bilingual (Chinese-English) instructions, and trajectory-pair metadata. To generate high-quality expert trajectories, we develop MuCO, a multi-constraint optimizer that plans directly in continuous three-dimensional space with BVH-accelerated collision and visibility queries, jointly enforcing target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility, avoiding the discretization artifacts and post-hoc smoothing of grid-based planners. Fine-tuning experiments on seven vision-language models show that CosFlyTrack improves tracking performance to 78.3 to 95.6 percent SR@1 meter, a 53 to 69 percentage point gain over zero-shot baselines, supporting the dataset as a training resource for dynamic target-following agents. The dataset is publicly available at https://huggingface.co/datasets/AutelRobotics/CosFly; evaluation scripts and pre-trained checkpoints are hosted at https://huggingface.co/AutelRobotics/CosFly-Track.

URL PDF HTML ☆

赞 0 踩 0

2605.15944 2026-05-21 cs.RO cs.LG 版本更新

FocalPolicy: Frequency-Optimized Chunking and Locally Anchored Flow Matching for Coherent Visuomotor Policy

FocalPolicy: 频率优化的分块和局部锚定的流匹配用于连贯的视觉-运动策略

Qian He, Zhenshuo Yang, Wenqi Liang, Chunhui Hao, Nicu Sebe, Jiandong Tian

发表机构 * State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences（机器人与智能系统国家重点实验室，沈阳自动化研究所，中国科学院）； University of the Chinese Academy of Sciences（中国科学院大学）； University of Trento（特伦多大学）

AI总结本文提出FocalPolicy，一种面向视觉-运动策略的策略，通过频率优化的分块和局部锚定的流匹配，解决连续视觉-运动策略中的精度与远见之间的平衡问题。

详情

AI中文摘要

视觉-运动策略旨在从专家示范中学习复杂的操作任务。然而，生成平滑且连贯的轨迹仍然具有挑战性，因为它需要在近端精度与远端远见之间进行平衡。现有方法通常专注于优化块内动作分布，往往忽略了块间连贯性。因此，块间不连续性显著阻碍了连贯长周期动作的学习。为克服这一限制并实现精度与远见之间的协同平衡，我们提出了FocalPolicy，一种具有远见的视觉-运动策略，结合了频率优化的分块与局部锚定的流匹配。我们引入了一个远见复合目标，监督时间域内近端动作的对齐，同时在多个未来动作块上正则化频率域结构以提高跨块连贯性。为了高效学习复杂动作分布，我们设计了局部锚定采样，以提高一致性流匹配训练期间的目标信号传播效率。广泛的实验表明，FocalPolicy优于现有方法，并验证了我们的模块对其他基线的通用性。项目网站：https://focalpolicy.github.io/

英文摘要

Visuomotor policies aim to learn complex manipulation tasks from expert demonstrations. However, generating smooth and coherent trajectories remains challenging, as it requires balancing proximal precision with distal foresight. Existing approaches typically focus on optimizing intra-chunk action distributions, often neglecting the inter-chunk coherence. Consequently, inter-chunk discontinuities significantly impede the learning of coherent long-horizon actions. To overcome this limitation and achieve a synergetic balance between precision and foresight, we propose FocalPolicy, a foresight-aware visuomotor policy that combines Frequency-Optimized Chunking with Locally Anchored flow matching. We introduce a foresight composite objective that supervises time-domain alignment within the proximal actions while regularizing frequency-domain structure over multiple future action chunks to improve cross-chunk coherence. To efficiently learn complex action distributions, we design locally anchored sampling to enhance target signal propagation efficiency during consistency flow matching training. Extensive experiments demonstrate that FocalPolicy outperforms existing approaches and confirm the generalizability of our modules to other baselines. Project website: https://focalpolicy.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.15157 2026-05-21 cs.RO cs.LG 版本更新

Hand-in-the-Loop: Improving VLA Policies for Dexterous Manipulation via Seamless Hand-Arm Intervention

手在环中：通过无缝手臂干预改进VLA策略以实现灵巧操作

Zhuohang Li, Liqun Huang, Wei Xu, Zhengming Zhu, Nie Lin, Xiao Ma, Xinjun Sheng, Ruoshi Wen

发表机构 * State Key Laboratory of Mechanical System and Vibration, School of Mechanical Engineering, Shanghai Jiao Tong University（机械系统与振动国家重点实验室，机械工程学院，上海交通大学）； Shanghai Key Laboratory of Intelligent Robotics, Meta Robotics Institute, Shanghai Jiao Tong University, Shanghai 200240, China（智能机器人上海市重点实验室，元机器人研究院，上海交通大学，上海200240，中国）； The University of Tokyo（东京大学）

AI总结本文提出Hand-in-the-Loop方法，通过无缝整合人类干预与自主策略执行，减少手部操作中的突兀变化，提升双臂灵巧操作的鲁棒性和效率。

详情

AI中文摘要

Vision-Language-Action (VLA)模型在灵巧操作中容易累积误差，高维动作空间和接触丰富的动态会放大政策偏差。虽然交互模仿学习(IIL)可通过人类修正数据细化策略，但将其应用于高自由度机械手仍具有挑战性，因为人类遥控与策略执行在干预时刻的命令不匹配，导致机器人手部配置的突兀变化，即'手势跳跃'。我们提出了Hand-in-the-Loop (HandITL)，一种无缝的人在回路干预方法，将人类的修正意图与自主策略执行相结合，以避免在双臂灵巧操作中的手势跳跃。与使用直接遥控接管相比，HandITL将干预抖动减少了99.8%，并保持了干预后的稳健操作，将抓取失败减少了87.5%，平均完成时间减少了19.1%。我们在需要双臂协调、工具使用和精细长时域操作的任务上验证了HandITL。当用于收集策略细化的修正数据时，HandITL在三个长时域灵巧任务中平均优于使用标准遥控数据训练的策略19%。

英文摘要

Vision-Language-Action (VLA) models are prone to compounding errors in dexterous manipulation, where high-dimensional action spaces and contact-rich dynamics amplify small policy deviations over long horizons. While Interactive Imitation Learning (IIL) can refine policies through human correction data, applying it to high-degree-of-freedom (DoF) robotic hands remains challenging due to a command mismatch between human teleoperation and policy execution at the intervention moment, which causes abrupt robot-hand configuration changes, or "gesture jumps". We present Hand-in-the-Loop (HandITL), a seamless human-in-the-loop intervention method that blends human corrective intent with autonomous policy execution to avoid gesture jumps during bimanual dexterous manipulation. Compared with taking over control using direct teleoperation, HandITL reduces intervention jitter by 99.8% and preserves robust post-intervention manipulation, reducing grasp failures by 87.5% and mean completion time by 19.1%. We validate HandITL on tasks requiring bimanual coordination, tool use, and fine-grained long-horizon manipulation. When used to collect correction data for policy refinement, HandITL yields policies that outperform those trained with standard teleoperation data by 19% on average across three long-horizon dexterous tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.14417 2026-05-21 cs.RO cs.CV 版本更新

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

在身体移动之前：为语言条件的人形控制学习预见性关节意图

Haozhe Jia, Honglei Jin, Yuan Zhang, Youcheng Fan, Shaofeng Liang, Lei Wang, Shuxu Jin, Kuimou Yu, Zinuo Zhang, Jianfei Song, Wenshuo Chen, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； LimX Dynamics Technology Co., Ltd.（LimX动态技术有限公司）； Shandong University（山东大学）； Data61/CSIRO ； Griffith University（格里菲斯大学）； Institute of Deep Perception Technology, Jiangsu Industrial Technology Research Institute (JITRI)（深度感知技术研究院，江苏省工业技术研究院（JITRI））

AI总结该研究提出DAJI框架，通过学习语言生成与闭环控制之间的预见性关节意图接口，解决语言条件人形机器人中预见未来物理转换的需求，实现了在HumanML3D风格生成和BABEL任务中的高性能表现。

详情

AI中文摘要

自然语言是人形机器人的直观接口，但流式全身控制需要能够现在执行并预见未来物理转换的控制表示。现有语言条件人形系统通常生成低级跟踪器必须反应性修复的运动学参考，或使用隐式/动作策略，其输出不显式编码即将发生的接触变化、支撑转移和平衡准备。我们提出DAJI（Dynamics-Aligned Joint Intent），一个分层框架，学习语言生成与闭环控制之间的预见性关节意图接口。DAJI-Act通过学生驱动的回放将未来的教师 distill 成可部署的扩散动作策略，而 DAJI-Flow 自回归地从语言和意图历史生成未来意图块。实验表明，DAJI 在预见性隐式学习、单指令生成和流式指令跟随中表现优异，在 HumanML3D 风格生成中达到 94.42% 的回放成功率，在 BABEL 任务中达到 0.152 的子序列 FID。

英文摘要

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

URL PDF HTML ☆

赞 0 踩 0

2605.14201 2026-05-21 cs.RO cs.CV 版本更新

MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

MAPLE：基于潜在空间的多智能体交互用于端到端自动驾驶

Rajeev Yasarla, Deepti Hegde, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Meysam Sadeghigooghari, Hanno Ackermann, Litian Liu, Pranav Desai, Fatih Porikli, Mohammad Ghavamzadeh, Hong Cai

发表机构 * Qualcomm AI Research（英矽人工智能研究）

AI总结本文提出MAPLE框架，通过在视觉-语言-动作模型的潜在空间中实现反应式多智能体滚动，以解决传统模仿学习框架下闭环设置中模型易碎的问题，通过监督微调和强化学习结合多样性奖励，实现了可扩展且无需外部模拟器的闭环训练，提升了端到端自动驾驶系统的鲁棒性。

Comments 19 pages, 9 figures

详情

AI中文摘要

视觉-语言-动作（VLA）模型在端到端运动规划中表现出色，但在闭环设置中由于训练基于传统模仿学习框架而显得脆弱。现有的闭环监督方法缺乏可扩展性且无法完全建模反应式环境。我们提出MAPLE，一种新的框架，用于在VLA模型的潜在空间中进行动态驾驶场景的反应式多智能体滚动。主体车辆和附近交通代理在多步时间范围内独立控制，同时对场景中的其他代理具有反应性，从而实现闭环训练。MAPLE包含两个训练阶段：（1）基于真实轨迹的潜在滚动监督微调，随后是（2）具有全局和代理特定奖励的强化学习，这些奖励鼓励安全、进展和交互真实感。我们进一步提出多样性奖励，鼓励模型生成可能不在记录驾驶数据中存在的规划行为。值得注意的是，我们的闭环训练框架具有可扩展性，且无需外部模拟器，这些模拟器计算成本高且视觉保真度有限。MAPLE在Bench2Drive上实现了最先进的驾驶性能，并展示了可扩展的闭环多智能体交互，为鲁棒的端到端自动驾驶系统提供了支持。

英文摘要

Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.

URL PDF HTML ☆

赞 0 踩 0

2605.11151 2026-05-21 cs.AI cs.RO 版本更新

一种用于姿态和航向参考系统的轻量级立方根卡尔曼滤波器：使用简化预测方程

Shunsei Yamagishi, Lei Jing

发表机构 * Graduate School of Computer Science, Engineering, The University of Aizu（计算机科学与工程研究生院，宇土大学）

AI总结本文提出了一种改进的立方根卡尔曼滤波器（CKF），在保持估计精度的同时降低了计算成本，称为'Kaisoku立方根卡尔曼滤波器（KCKF）'。通过简化CKF的方程，保留等价的数学关系，推导出轻量级的预测方程。实验结果表明，KCKF相比CKF在浮点运算（FLOPs）上更少，计算时间减少了约19%（在高性能计算机上）和15%（在低成本单板计算机上），同时保持了姿态估计的准确性。

详情

DOI: 10.1109/ACCESS.2026.3686031
Journal ref: IEEE Access, vol. 14, 2026, pp. 73686-73697

AI中文摘要

姿态和航向参考系统（AHRS）被广泛应用于需要可靠方向和运动传感的任何地方。本文提出了一种改进的立方根卡尔曼滤波器（CKF），在保持估计精度的同时降低了计算成本，称为“Kaisoku立方根卡尔曼滤波器（KCKF）”。通过简化CKF的方程，保留等价的数学关系，推导出KCKF的计算效率方程。通过扩展CKF中的求和项并简化结果，推导出KCKF的轻量级预测方程。本文证明KCKF所需的浮点运算（FLOPs）比CKF更少。受控实验结果表明，与CKF相比，KCKF在高性能计算机上将计算时间减少了约19%，而在低成本单板计算机上减少了约15%。此外，KCKF保持了CKF的姿态估计精度。

英文摘要

Attitude and Heading Reference Systems (AHRSs) are broadly applied wherever reliable orientation and motion sensing is required. In this paper, we present an improved Cubature Kalman Filter (CKF) with lower computational cost while maintaining estimation accuracy, which is named "Kaisoku Cubature Kalman Filter (KCKF)". The computationally efficient equations of the KCKF are derived by simplifying those of the CKF, while preserving equivalent mathematical relations. The lightweight prediction equations in the KCKF are derived by expanding the summation terms in the CKF and simplifying the result. This paper shows that the KCKF requires fewer floating-point operations (FLOPs) than the CKF. The controlled experimental results show that the KCKF reduces the computation time by approximately 19% compared to the CKF on a high-performance computer, whereas the KCKF reduces the computation time by approximately 15% compared to the CKF on a low-cost single-board computer. In addition, the KCKF maintains the attitude estimation accuracy of the CKF.

URL PDF HTML ☆

赞 0 踩 0

2602.03209 2026-05-21 cs.RO 版本更新

Depth Completion in Unseen Field Robotics Environments Using Extremely Sparse Depth Measurements

在未见过的田间机器人环境中使用极稀疏深度测量进行深度补全

Marco Job, Thomas Stastny, Eleni Kelasidi, Roland Siegwart, Michael Pantic

发表机构 * Autonomous Systems Lab, ETH Zürich（瑞士苏黎世联邦理工学院自主系统实验室）； Field Robotics Lab, NTNU（挪威特罗姆瑟大学场 robotics 实验室）

AI总结本研究提出了一种深度补全模型，通过合成数据训练和极稀疏的深度传感器测量，在未见过的田间机器人环境中预测密集的度量深度，解决了低成本相机在田间机器人中应用受限的问题。

Comments Accepted to ICRA 2026

详情

AI中文摘要

在无结构环境中自主运行的田间机器人需要可靠的感知以确保安全和可靠的运行。最近的单目深度估计进展展示了低成本相机作为深度传感器的潜力；然而，由于缺乏可靠的尺度线索、模糊或低纹理条件以及大规模数据集的稀缺，其在田间机器人中的应用仍然有限。为了解决这些挑战，我们提出了一种深度补全模型，该模型在合成数据上训练，并利用深度传感器的极稀疏测量来预测未见过的田间机器人环境中的密集度量深度。一个针对田间机器人的合成数据集生成流程能够创建多个逼真的数据集用于训练。该数据集生成方法利用结构从运动的纹理3D网格和具有新视角合成的逼真渲染来模拟多样的田间机器人场景。我们的方法在Nvidia Jetson AGX Orin上实现了每帧53毫秒的端到端延迟，使嵌入式平台上的实时部署成为可能。广泛的评估表明，在多样化的现实世界田间机器人场景中具有竞争性的性能。

英文摘要

Autonomous field robots operating in unstructured environments require robust perception to ensure safe and reliable operations. Recent advances in monocular depth estimation have demonstrated the potential of low-cost cameras as depth sensors; however, their adoption in field robotics remains limited due to the absence of reliable scale cues, ambiguous or low-texture conditions, and the scarcity of large-scale datasets. To address these challenges, we propose a depth completion model that trains on synthetic data and uses extremely sparse measurements from depth sensors to predict dense metric depth in unseen field robotics environments. A synthetic dataset generation pipeline tailored to field robotics enables the creation of multiple realistic datasets for training purposes. This dataset generation approach utilizes textured 3D meshes from Structure from Motion and photorealistic rendering with novel viewpoint synthesis to simulate diverse field robotics scenarios. Our approach achieves an end-to-end latency of 53 ms per frame on a Nvidia Jetson AGX Orin, enabling real-time deployment on embedded platforms. Extensive evaluation demonstrates competitive performance across diverse real-world field robotics scenarios.

URL PDF HTML ☆

赞 0 踩 0

2512.13788 2026-05-21 cs.LG cs.RO 版本更新

Constrained Policy Optimization via Sampling-Based Weight-Space Projection

通过基于采样的权重空间投影进行约束策略优化

Shengfan Cao, Francesco Borrelli, Eunhyek Joa

发表机构 * Department of Mechanical Engineering, Seoul National University, Seoul, Korea（首尔国立大学机械工程系）

AI总结该研究提出了一种基于采样的权重空间投影方法SCPO，用于在不离开安全操作范围的情况下优化策略，通过在参数空间中直接强制安全约束，确保在训练过程中保持安全性和可行性，同时在约束控制任务中实现闭环稳定性。

Comments Accepted for publication at IFAC World Congress 2026; fixed minor notation inconsistencies

详情

AI中文摘要

安全关键学习需要在不离开安全操作范围的情况下提高性能的策略。我们研究了约束策略学习，其中模型参数必须满足基于滚动的安全部署约束，这些约束可以评估但不能解析地微分。我们提出了SCPO，一种基于采样的权重空间投影方法，该方法在不需梯度访问约束函数的情况下直接在参数空间中强制安全。SCPO通过结合基于滚动的安全评估和参数扰动与安全度量变化之间的平滑性界，构建局部安全区域，并通过凸QCQP将每个梯度更新投影。我们建立了安全-by-induction保证：从任何安全初始化开始，给定可行的投影，所有中间策略保持安全。在具有稳定备份策略的约束控制设置中，SCPO进一步确保闭环稳定性，同时在保守备份之外实现安全适应。在具有有害监督的约束回归和双积分模仿与恶意专家的实验中，SCPO拒绝了不安全的更新，保持了训练过程中的可行性，并实现了有意义的目标改进。

英文摘要

Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy rollout-based safety constraints that can be evaluated but not differentiated analytically. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. SCPO constructs a local safe region by combining rollout-based safety evaluations with smoothness bounds relating parameter perturbations to changes in safety metrics, and projects each gradient update via a convex QCQP. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, SCPO further ensures closed-loop stability while enabling safe adaptation beyond the conservative backup. Experiments on constrained regression with harmful supervision and double-integrator imitation with a malicious expert show that SCPO rejects unsafe updates, maintains feasibility throughout training, and achieves meaningful objective improvement.

URL PDF HTML ☆

赞 0 踩 0

2512.09447 2026-05-21 cs.RO cs.CV 版本更新

Query-Calibrated Segmental Admission for Descriptor-Agnostic LiDAR Loop Closure in Repetitive Environments

基于查询校准的分段准入用于无描述符的激光雷达回环闭合在重复环境中

Jaehyun Kim, Seungwon Choi, Wonseok Kang, Tae-Wan Kim

发表机构 * Department of Naval Architecture and Ocean Engineering（naval architecture and ocean engineering department）

AI总结该研究提出了一种无描述符的稀疏回环准入策略，用于在重复环境中稳定图结构，通过校准查询级的分段假设并验证代表性配对来减少回环因素的误入，从而提高回环闭合的精度和稳定性。

Comments 8 pages, 3 figures

详情

AI中文摘要

结构重复的环境会产生视觉上合理但存在混叠的LiDAR回环候选者，当这些候选者被作为回环因子加入图中时，可能会破坏位姿图优化。我们提出了一种名为查询校准分段准入（QCSA）的策略，这是一种面向图稳定性的稀疏回环准入政策。该策略通过与硬负样本对比对短描述符分段进行评分，校准哪些查询级的分段假设能达到几何关系，并通过广义迭代最近点（G-ICP）验证代表性配对。我们在SNU图书馆数据集（SNULib）和HeLiPR重叠路线上评估了该方法。在SNULib上对七种LiDAR描述符家族进行汇总分析，QCSA将插入的回环因子减少了3.8倍，将因子精度从0.542提高到0.717，并显著降低了每组查询的误入率。在更稀疏的图中，它保持了可比的平均绝对轨迹误差（ATE）并大幅降低了最坏序列ATE与密集Top1+G-ICP相比，从1.064降至0.778米。这些结果支持了所提出的回环准入层在重混叠的同时定位与建图（SLAM）中的应用。我们的实现和数据集将在：https://github.com/wanderingcar/snu_library_dataset上发布。

英文摘要

Structurally repetitive environments produce visually plausible but aliased LiDAR loop candidates that can destabilize pose-graph optimization when admitted as loop factors. We propose Query-Calibrated Segmental Admission (QCSA), a descriptor-agnostic sparse loop-admission policy for graph-stability-oriented insertion. The policy scores short descriptor segments against hard negatives, calibrates which query-level segment hypotheses reach geometry, and inserts representative pairs validated by Generalized Iterative Closest Point (G-ICP). We evaluate it on the SNU Library Dataset (SNULib) and HeLiPR overlap routes. Aggregated over seven LiDAR descriptor families on SNULib, QCSA reduces inserted loop factors by 3.8 times, raises factor precision from 0.542 to 0.717, and sharply lowers false admissions per query group. With this sparser graph, it maintains comparable mean absolute trajectory error (ATE) and substantially reduces worst-sequence ATE versus dense Top1+G-ICP, from 1.064 to 0.778 m. The aggregate mean and worst-sequence ATE remain lower than the odometry-only reference. Under a matched factor budget, QCSA also attains lower trajectory error than SeqSLAM and sparse Top1+G-ICP selections. Fixed-transfer validation on HeLiPR, with no route-specific tuning, likewise suppresses hard-negative admissions. These results support the proposed admission layer for aliasing-heavy simultaneous localization and mapping (SLAM). Our implementation and dataset will be released at: https://github.com/wanderingcar/snu_library_dataset.

URL PDF HTML ☆

赞 0 踩 0

2511.01219 2026-05-21 cs.RO 版本更新

Tackling the Kidnapped Robot Problem via Sparse Feasible Hypothesis Sampling and Reliable Batched Multi-Stage Inference

通过稀疏可行假设采样和可靠的分批多阶段推理解决被绑架的机器人问题

Muhua Zhang, Lei Ma, Ying Wu, Kai Shen, Deqing Huang, Henry Leung

发表机构 * School of Electrical Engineering, Southwest Jiaotong University（西南交通大学电子工程学院）

AI总结本文提出了一种被动的2D全局重定位框架，通过单个LiDAR扫描和占用网格地图在机器人静止时高效可靠地估计全局姿态，从而提高移动机器人的长期自主性。该框架将全局重定位问题转化为非凸问题，并通过多假设方案与分批多阶段推理和早期终止平衡完整性和效率。

Comments 14 pages, 8 figures. Accepted for publication in IEEE Transactions on Instrumentation and Measurement. DOI: 10.1109/TIM.2026.3694741

详情

DOI: 10.1109/TIM.2026.3694741

AI中文摘要

本文针对被绑架的机器人问题（KRP），即在已知地图中重新定位机器人时，没有先验姿态估计或在SLAM初始化时的定位丢失问题。为此，提出了一种被动的2D全局重定位框架。该框架在机器人静止时，通过单个LiDAR扫描和占用网格地图高效可靠地估计全局姿态，从而提高移动机器人的长期自主性。所提出的框架将全局重定位问题转化为非凸问题，并通过多假设方案与分批多阶段推理和早期终止来解决，平衡完整性和效率。快速探索随机树（RRT）在可通行性约束下，渐近覆盖可达空间以生成稀疏、均匀分布的可行位置假设，从根本上减少采样空间。假设首先通过所提出的扫描均方差（SMAD）进行排序，这是一种粗略的光束误差水平度量，通过优先处理高可能性的候选者来实现早期终止。SMAD计算优化以适应有限的扫描测量。提出的翻译亲和度扫描到地图对齐度量（TAM）用于在假设位置可靠地选择方向，并准确评估最终的全局姿态，以减轻由于稀疏假设引起的翻译不确定性以及非全景LiDAR扫描和环境变化导致的传统似然场度量的退化。在资源受限的移动机器人上的真实世界实验表明，所提出的框架在成功率、在测量不确定性下的鲁棒性和计算效率方面均表现优异。

英文摘要

This paper addresses the Kidnapped Robot Problem (KRP), a core localization challenge of relocalizing a robot in a known map without prior pose estimate upon localization loss or at SLAM initialization. For this purpose, a passive 2-D global relocalization framework is proposed. It estimates the global pose efficiently and reliably from a single LiDAR scan and an occupancy grid map while the robot remains stationary, thereby enhancing the long-term autonomy of mobile robots. The proposed framework casts global relocalization as a non-convex problem and solves it via the multi-hypothesis scheme with batched multi-stage inference and early termination, balancing completeness and efficiency. The Rapidly-exploring Random Tree (RRT), under traversability constraints, asymptotically covers the reachable space to generate sparse, uniformly distributed feasible positional hypotheses, fundamentally reducing the sampling space. The hypotheses are preliminarily ordered by the proposed Scan Mean Absolute Difference (SMAD), a coarse beam-error level metric that facilitates the early termination by prioritizing high-likelihood candidates. The SMAD computation is optimized for limited scan measurements. The Translation-Affinity Scan-to-Map Alignment Metric (TAM) is proposed for reliable orientation selection at hypothesized positions and accurate final global pose evaluation to mitigate degradation in conventional likelihood-field metrics under translational uncertainty induced by sparse hypotheses, as well as non-panoramic LiDAR scan and environmental changes. Real-world experiments on a resource-constrained mobile robot with non-panoramic LiDAR scans show that the proposed framework achieves competitive performance in success rate, robustness under measurement uncertainty, and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2510.18034 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

VLMs能否解锁语义异常检测？一个结构化推理的框架

Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz

发表机构 * Professorship of Autonomous Vehicle Systems TUM School of Engineering ； Design, Technical University of Munich Munich, Germany

AI总结本文提出SAVANT框架，通过结构化推理方法提升VLM在语义异常检测中的性能，实现对自动驾驶场景中罕见异常情况的更准确识别。

Comments 8 pages, 5 figures

详情

AI中文摘要

自动驾驶系统仍然对长尾的稀有、分布外语义异常极度脆弱。尽管VLMs已显现为感知的有前途工具，但其在异常检测中的应用仍然主要局限于提示专有模型，限制了可靠性、可重复性和部署可行性。为解决这一差距，我们引入SAVANT（语义异常验证/分析工具包），一种新的模型无关推理框架，将异常检测重新表述为分层语义一致性验证。通过应用SAVANT的两阶段流程——结构化场景描述提取和多模态评估，现有VLMs在输入图像中检测异常驾驶场景的得分得到提升。我们的方法取代了随意提示，通过语义感知推理，将基于VLM的检测转化为四个语义领域之间的原则性分解。我们证明，在平衡的现实驾驶场景集上，应用SAVANT可将VLM的绝对召回率提高约18.5%，相比提示基线。此外，这一增益使大规模注释成为可能：利用我们框架内的最佳专有模型，我们自动标注了约10,000张现实世界图像，具有高置信度。我们使用由此产生的高质量数据集来微调一个7B开源模型（Qwen2.5-VL）以执行单次异常检测，达到90.8%的召回率和93.8%的准确率，超越所有评估模型，同时在接近零成本的情况下实现本地部署。通过将结构化语义推理与可扩展的数据整理相结合，我们为自动驾驶系统中的语义异常检测数据稀缺问题提供了实用的解决方案。补充材料：https://TUM-AVS.github.io/SAVANT/.

英文摘要

Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models - limiting reliability, reproducibility, and deployment feasibility. To address this gap, we introduce SAVANT (Semantic Anomaly Verification/Analysis Toolkit), a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline - structured scene description extraction and multi-modal evaluation - existing VLMs improve their scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains. We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM's absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. We use the resulting high-quality dataset to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By coupling structured semantic reasoning with scalable data curation, we provide a practical solution to data scarcity in semantic anomaly detection for autonomous systems. Supplementary material: https://TUM-AVS.github.io/SAVANT/.

URL PDF HTML ☆

赞 0 踩 0

2509.26627 2026-05-21 cs.AI cs.LG cs.RO 版本更新

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

TimeRewarder: 通过帧间时间距离从被动视频中学习密集奖励

Yuyang Liu, Chuan Wen, Yihang Hu, Dinesh Jayaraman, Yang Gao

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China（清华大学交叉信息研究院）； Shanghai Qi Zhi Institute（上海启智研究院）； Shanghai Jiao Tong University（上海交通大学）； University of Pennsylvania（宾夕法尼亚大学）

AI总结本文提出TimeRewarder方法，通过帧间时间距离从被动视频中学习密集奖励，以提升强化学习在稀疏奖励任务中的性能，实验表明其在多个任务中显著提高了成功率和样本效率。

Comments ICML 2026 spotlight paper

详情

AI中文摘要

设计密集奖励对于强化学习（RL）至关重要，但在机器人学中往往需要大量的手动工作且缺乏可扩展性。一个有前景的解决方案是将任务进展视为密集奖励信号，因为它量化了动作在时间上推动系统向任务完成迈进的程度。我们提出了TimeRewarder，一种简单而有效的奖励学习方法，通过建模帧对之间的时间距离，从被动视频（包括机器人演示和人类视频）中推导出进展估计信号。然后展示如何通过TimeRewarder提供逐步的代理奖励以指导强化学习。在我们对十个具有挑战性的Meta-World任务的全面实验中，我们表明TimeRewarder显著提高了稀疏奖励任务的强化学习性能，仅在每个任务中进行200,000次环境交互时，就实现了9/10任务的几乎完美成功。该方法在最终成功率和样本效率上均优于先前方法和手动设计的环境密集奖励。此外，我们还展示了TimeRewarder预训练可以利用真实世界的人类视频，突显了其作为从多样化视频源中获取丰富奖励信号的可扩展方法的潜力。

英文摘要

Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply step-wise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 environment interactions per task. This approach outperformed previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder pretraining can exploit real-world human videos, highlighting its potential as a scalable approach to rich reward signals from diverse video sources.

URL PDF HTML ☆

赞 0 踩 0

2509.07674 2026-05-21 cs.RO cs.HC 版本更新

Temporal Counterfactual Explanations of Behaviour Tree Decisions

行为树决策的时序反事实解释

Tamlin Love, Antonio Andriella, Guillem Alenyà

发表机构 * Institut de Robòtica i Informàtica Industrial, CSIC-UPC（机器人与计算机工业研究所，CSIC-UPC）

AI总结本文提出了一种生成反事实解释的方法，通过构建行为树的因果模型来解释机器人决策原因，提高了机器人系统的透明性和安全性。

Comments 33 pages, 7 figures + 4 figures in appendices

详情

AI中文摘要

可解释性，特别是机器人能够解释其决策或行为原因的能力，是帮助用户理解其交互和共存的机器人的重要工具。行为树是控制机器人决策的流行框架，因此一个自然的问题是，由行为树驱动的系统是否能够回答'为什么'的问题。尽管行为树驱动的机器人可解释性已受到一些关注，但现有的方法无法生成详细说明机器人决策原因的因果反事实解释。因此，在本工作中，我们介绍了一种新颖的方法，该方法能够自动根据对比性'为什么'问题生成反事实解释。我们的方法通过首先自动构建从行为树结构以及状态和个体行为树节点的领域知识中的因果模型来实现这一点。然后对所得因果模型进行查询和搜索，以找到一组多样的反事实解释。我们证明我们的方法能够正确解释广泛的行为树结构和状态在实时中的行为，与之前的方法相比，这些方法要么无法用因果解释回答对比性问题，要么无法保证提供一致和准确的解释。通过能够回答广泛的因果查询，我们的方法代表了朝着更透明、更易理解和最终更安全和可信的机器人系统迈进的一步。

英文摘要

Explainability, in particular, the ability for robots to explain why they have made a decision or behaved in a certain way, is a critical tool in helping users understand the robots they interact and coexist with. Behaviour trees are a popular framework for controlling the decision-making of robots, and thus a natural question to ask is whether or not a system driven by a behaviour tree is capable of answering "why" questions. While explainability for behaviour tree-driven robots has seen some prior attention, no existing methods are capable of generating causal, counterfactual explanations which detail the reasons for robot decisions and behaviour. Therefore, in this work, we introduce a novel approach which automatically generates counterfactual explanations in response to contrastive "why" questions. Our method achieves this by first automatically building a causal model from the structure of the behaviour tree as well as domain knowledge about the state and individual behaviour tree nodes. The resultant causal model is then queried and searched to find a set of diverse counterfactual explanations. We demonstrate that our approach is able to correctly explain the behaviour of a wide range of behaviour tree structures and states in real time, unlike previous methods which are either unable to answer contrastive questions with causal explanations, or are not guaranteed to provide consistent and accurate explanations. By being able to answer a wide range of causal queries, our approach represents a step towards more transparent, understandable, and ultimately safe and trustworthy robotic systems.

URL PDF HTML ☆

赞 0 踩 0

2508.06206 2026-05-21 cs.RO cs.CV 版本更新

Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model

Affordance-R1: 为多模态大语言模型中的通用化 affordance 推理设计的强化学习

Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, Yuexin Ma

发表机构 * The Hong Kong University of Science and Technology (GZ)（香港科技大学（广州））； National University of Singapore（新加坡国立大学）； ShanghaiTech University（上海科技大学）； East China Normal University（华东师范大学）； Nanjing University of Information Science & Technology（南京信息工程大学）； Zhejiang University（浙江大学）； Institute of Automation, Chinese Academy of Science（中国科学院自动化研究所）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结本文提出 Affordance-R1，一种结合认知 CoT 引导的 Group Relative Policy Optimization (GRPO) 的统一 affordance 地标框架，通过强化学习实现零样本泛化和测试时推理能力。

详情

AI中文摘要

Affordance grounding 旨在预测与机器人执行动作相关的物体特定区域。它在人机交互、人-物交互、具身操作和具身感知领域中起着至关重要的作用。现有模型由于缺乏链式思维（CoT）推理能力，往往忽视不同物体间的 affordance 共享，限制了其域外（OOD）泛化和显式推理能力。为了解决这些挑战，我们提出了 Affordance-R1，这是首个集成认知 CoT 引导的 Group Relative Policy Optimization（GRPO）的统一 affordance 地标框架。具体而言，我们设计了一个复杂的 affordance 函数，包含格式、感知和认知奖励，以有效引导优化方向。此外，我们构建了一个高质量的 affordance 中心推理数据集 ReasonAff，以支持训练。通过仅使用强化学习与 GRPO 进行训练，而不使用显式推理数据，Affordance-R1 实现了稳健的零样本泛化，并表现出涌现的测试时推理能力。全面的实验表明，我们的模型优于已建立的方法，并展示了开放世界泛化能力。据我们所知，Affordance-R1 是首个将基于 GRPO 的 RL 与推理结合到 affordance 推理中的方法。我们的方法和数据集已发布在 https://github.com/hq-King/Affordance-R1。

英文摘要

Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.

URL PDF HTML ☆

赞 0 踩 0

2507.09180 2026-05-21 cs.CV cs.RO 版本更新

Multimodal Fusion for Sim2real Transfer in Visual Reinforcement Learning

多模态融合用于视觉强化学习中的仿真到现实迁移

Zichun Xu, Jingdong Zhao, Chenyu Guo, Qianxue Zhang, Liao Zhang, Xiao Zhang, Yiming Ren, Lian Zhang, Zengren Zhao

发表机构 * Medical Artificial Intelligence Lab, The First Hospital of Hebei Medical University, Hebei Medical University（医学人工智能实验室，河北医科大学第一医院，河北医科大学）； State Key Laboratory of Robotics and Systems, Harbin Institute of Technology（机器人系统国家重点实验室，哈尔滨工业大学）

AI总结本文提出基于视觉变换器的多模态融合框架，通过融合RGB和深度信息提升泛化能力，并设计对比学习方案和课程式域随机化方案以提高样本效率和迁移性能，实验结果表明该方法在现实任务中表现优异。

2605.20484 2026-05-21 cs.RO 版本更新

Enhancing Graph-Based SLAM in GNSS-Denied environments by leveraging leg odometry

通过利用腿部里程计增强基于图的SLAM在GNSS受限环境中的性能

Léon Perruchot-Triboulet, Luc Jaulin, Kai Xiao

发表机构 * LinxAI Tech（LinxAI科技）

AI总结本文提出了一种基于因子图的架构，通过结合本体感觉腿部里程计和激光雷达-惯性里程计，有效减少了GNSS受限环境中视觉漂移，提高了SLAM的鲁棒性。

Comments 4 pages, 3 figures, 2 tables, for ICRA workshop on Robot Meets GNSS and Ranging for Seamless Autonomy

2605.20433 2026-05-21 cs.RO 版本更新

Spacetime Optimal-Transport Attention for Visuo-Haptic Imitation Learning of Contact-Rich Manipulation

时空最优传输注意力用于视觉-触觉模仿学习中的富接触操作

Yue Feng, Weicheng Huang, I-Ming Chen

发表机构 * Robotics Research Centre, School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore（机器人研究中心，机械与航空航天工程学院，南洋理工大学，新加坡）； Wings Robotics（Wings机器人）

AI总结本文提出了一种三模态融合框架Spacetime Optimal-Transport Attention (SO-TA)，通过熵正则化的最优传输对力-姿态衍生的子查询和视觉块进行对齐，以解决富接触操作中多模态信息融合的问题，并在真实机器人任务中实现了高成功率。

Comments 8 pages, 16 figures, 3 tables. Preprint

详情

AI中文摘要

接触密集的操作任务，如紧密间隙插入、连接器配合、抛光和表面适应擦拭，仍然难以由数据驱动控制器处理，因为它们耦合了不连续接触动力学、部分可观测性和严格的安全约束。单一传感模态不足以满足需求：视觉在接触前提供全局上下文，力/扭矩（F/T）反馈在接触后控制交互，而本体感觉姿态提供一致的运动学骨架。大多数先前的接触密集任务模仿学习策略仅在单模态或双模态信号上操作，而少数融合三种模态的策略通常采用现成的注意力模块，没有明确的先验知识指导注意力质量如何分布在任务相关的区域。我们提出了Spacetime Optimal-Transport Attention (SO-TA)，一种三模态融合骨干，用熵正则化的最优传输（OT）对齐代替softmax归一化的块注意力。显式的边缘约束作为结构化的归纳偏置，鼓励在接触密集任务中条件感知的空间选择，这种选择在光照、干扰和部分遮挡下保持稳定。SO-TA与基于扩散的序列策略相结合，将观察窗口映射到姿态-动作块。我们在三个真实机器人任务上评估了SO-TA：紧密圆柱体装配、BCM布线连接器插入和曲面标记擦除。在每个条件约200次滚出下，SO-TA在紧密圆柱体装配任务中达到100%的成功率，而在匹配容量下的交叉注意力为93%，在光照、干扰和部分遮挡扰动下保持82.5%的成功率，而连接基线降至43.5%。OT衍生的块热图和留一法模态影响比提供可解释的、相位依赖的诊断。

英文摘要

Contact-rich manipulation tasks such as tight-clearance insertion, connector mating, polishing, and surface-conforming wiping remain difficult for data-driven controllers because they couple discontinuous contact dynamics, partial observability, and strict safety constraints. No single sensing modality suffices: vision supplies global context before contact, force/torque (F/T) feedback governs interaction after contact, and proprioceptive pose provides a consistent kinematic backbone. Most prior imitation-learning policies for contact-rich tasks operate on uni- or bi-modal signals, and the few that fuse three modalities typically adopt off-the-shelf attention modules with no explicit prior on how attention mass should be distributed across task-relevant regions. We present Spacetime Optimal-Transport Attention (SO-TA), a tri-modal fusion backbone that replaces softmax-normalized patch attention by an entropy-regularized Optimal Transport (OT) alignment between force-pose-derived sub-queries and visual patches. Explicit marginal constraints act as a structured inductive bias for contact-rich tasks, encouraging conditioning-aware spatial selection that is stable across illumination, distractors, and partial occlusion. SO-TA is paired with a diffusion-based sequence policy mapping observation windows to pose-action chunks. We evaluate SO-TA on three real-robot tasks: tight peg-in-hole assembly, BCM wiring-connector insertion, and curved-surface mark erasing. With ~200 rollouts per condition, SO-TA reaches 100% success on tight peg-in-hole versus 93% for cross-attention at matched capacity, and retains 82.5% success under illumination, distractor, and partial-occlusion perturbations where a concatenation baseline drops to 43.5%. OT-derived patch heatmaps and leave-one-out modality-influence ratios provide interpretable, phase-dependent diagnostics.

URL PDF HTML ☆

赞 0 踩 0

2605.20431 2026-05-21 cs.HC cs.RO 版本更新

Multi-Week, In-Class Deployments of Telepresence Robots With Four Homebound K-12 Students: Benefits, Challenges, and Recommendations

多周课堂部署的四名居家K-12学生使用的远程存在机器人：益处、挑战与建议

Matthew Rueben, Rhianna Lee, Thomas R. Groechel, Hengzhi Chen, Haemi Lee, Gisele Ragusa, Maja J. Matarić

发表机构 * Colby College（科布利学院）

AI总结本研究探讨了远程存在机器人在K-12教育中帮助居家学生参与课堂的益处、挑战及改进建议，通过四次多周部署和15次访谈分析了学生体验和课堂管理需求。

详情

DOI: 10.1007/s10639-025-13855-4
Journal ref: Rueben, M., Lee, R., Groechel, T.R. et al. Multi-week, in-class deployments of telepresence robots with four homebound K-12 students: Benefits, challenges, and recommendations. Educ Inf Technol 31, 2145-2175 (2026)

AI中文摘要

在K-12教育中，缺席大量学校时间已被证明会增加学生认知和社会发展风险。替代方案如家庭教学和在线学习虽然常见，但缺乏与同龄人和教师的课堂互动。移动远程存在系统，或称为远程存在机器人，对居家学生有吸引力，因为它们提供了实时参与视频会议技术之外的具身性和移动性。然而，仍需研究以使远程存在机器人能够满足居家学生在K-12课堂环境中的复杂需求。我们通过四次多周部署，记录了四名居家K-12学生通过远程存在机器人参加课堂的体验，共进行了15次访谈并进行了定性案例研究分析。这些居家学生及其部署情境在多个维度上各不相同，尽管所有参与者都享受了移动远程参与的一些益处，但每个参与者也经历了独特的益处。一些关于听觉、视觉和移动机器人的挑战需要改进远程存在系统的设计。其他挑战则提出了课堂部署管理的优先事项，例如确保远程学生参与课堂活动、对教师负责，并受到同学的尊重。基于研究的见解，我们提出了类似情境中的现实部署程序的建议。

英文摘要

Missing significant amounts of school during K-12 education is known to put students' cognitive and social development at risk. Alternatives such as home instruction and online learning are common, but lack sufficient interaction with peers and teachers in the classroom. Mobile remote presence systems, or telepresence robots, are promising for homebound students because they provide embodiment and mobility in addition to the real-time participation offered by video conferencing technologies. Research is needed, however, for telepresence robots to meet the complex needs of homebound students participating remotely in the K-12 classroom context. We present findings from four multi-week deployments with homebound K-12 students attending classes via telepresence robots. The homebound students' experiences were documented in a total of 15 interviews and analyzed qualitatively as case studies. The homebound student participants and their deployment contexts differed from one another along multiple dimensions, and while some benefits of mobile remote attendance were enjoyed by all participants, each participant also experienced unique benefits. Some challenges with hearing, seeing, and moving the robot around the classroom warranted improvements to the design of the telepresence system. Other challenges suggested priorities for managing a classroom deployment, such as ensuring that the remote student is included in classroom activities, accountable to the teacher, and treated with respect by classmates. Based on insights from the study, we make recommendations for real-world deployment procedures in similar contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.20395 2026-05-21 cs.RO 版本更新

Scalable Multi-robot Motion Planning via Hierarchical Subproblem Expansion and Workspace Decomposition Refinement

通过分层子问题扩展和工作空间分解细化实现可扩展的多机器人运动规划

Isaac Ngui, Courtney McBeth, James D. Motes, Marco Morales, Nancy M. Amato

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Instituto Tecnológico Autónomo de México（墨西哥自治理工学院）

AI总结本文提出了一种多机器人运动规划方法，通过工作空间分解的离散搜索提高规划效率，核心方法是分层子问题扩展和工作空间分解细化，主要贡献是通过迭代优化工作空间表示来搜索更小的解耦配置空间。

Comments Accepted to WAFR 2026

2605.20392 2026-05-21 cs.RO 版本更新

VBT-MPC: Vision-Based Tactile MPC for Contour Following

VBT-MPC：基于视觉的触觉MPC用于轮廓跟踪

Edison Velasco-Sanchez, Luis F. Recalde, Guanrui Li, Pablo Gil

发表机构 * AUROVA Lab, Computer Science Research Institute, University of Alicante（AUROVA实验室，计算机科学研究院，阿利坎特大学）； Worcester Polytechnic Institute（沃思堡理工大学）

AI总结本文提出了一种基于视觉的触觉模型预测控制（VBT-MPC）框架，用于机器人轮廓跟踪，通过眼在手配置安装的基于视觉的触觉传感器（VBTS）直接在轮廓特征空间中操作，避免了单独的姿态估计模块和复杂的力控制架构，并在仿真和实际实验中评估了在不同几何形状和材料物体上的轮廓跟踪性能。

Comments This article has been accepted for publication in IEEE Robotics and Automation Letters. This is a preprint version. This work was supported by the Interreg-VI Sudoe and European Regional Development Funds through the REMAIN Project under Grant S1/1.1/E0111

2605.20390 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

陆地软体移动机器人：综述

Dimuthu D. K. Arachchige

发表机构 * School of Computing, DePaul University（德保罗大学计算机学院）； Department of Computer Science, Hampton University（哈珀学院计算机科学系）

AI总结本文综述了软体移动机器人的当前研究状态，重点探讨了无轮陆地移动系统中的运动策略、驱动方法、建模方法和控制系统，同时指出了实现软体移动机器人在各领域广泛应用的关键挑战。

2605.20299 2026-05-21 cs.LG cs.AI cs.RO 版本更新

Mechanisms of Misgeneralization in Physical Sequence Modeling

物理序列建模中泛化错误的机制

Kento Nishi, Raphael Tang, Karun Kumar, Core Francisco Park, Hidenori Tanaka

发表机构 * Harvard College（哈佛大学）； Harvard John A. Paulson School of Engineering and Applied Sciences（哈佛大学约翰·A·保罗森工程与应用科学学院）； Comcast AI ； CBS-NTT Program in Physics of Intelligence, Harvard University（哈佛大学物理智能计划）； Physics of Artificial Intelligence Group, NTT Research, Inc., Sunnyvale, CA, USA（人工智能物理研究组，NTT研究公司，美国加利福尼亚州山景城）； Microsoft（微软）

AI总结本文研究了物理序列建模中由于局部误差传播导致的物理泛化错误，提出了一种数据偏差核来预测物理量的质量变化，并提出了基于核的干预策略。

Comments Preprint. kentonishi.com/physical-misgeneralization

详情

AI中文摘要

生成序列模型通常用于在物理领域规划运动，从机器人到机械模拟。在构建训练此类模型的数据集时，工程师可能会选择演示来指定轨迹在物理量如旅行距离或机械能上的分布。例如，一个构建迷宫导航代理的机器人工程师可能会选择旅行距离覆盖固定范围的演示，希望限制代理的预期功率使用。我们发现标准深度学习可以违反这一意图：每个生成的轨迹在单独看来都合理，但物理量的总体分布是错误的。我们将这种失败称为物理泛化错误，并发展了其机制。通过受控的合成任务，我们发现物理泛化错误出现在局部误差典型于模型类通过物理测量传播到恢复分布时。我们用数据偏差核估计这些误差，并利用它来预测在我们的合成任务和更应用的迷宫导航和双摆运动任务中哪些物理量获得或失去质量。最后，我们的机制性解释有助于识别哪些缓解策略在结构上具有前景，并利用它提出了一种基于核的干预。

英文摘要

Generative sequence models are often trained to plan motion in physical domains, from robotics to mechanical simulations. When constructing a dataset to train such a model, engineers may curate demonstrations to specify how trajectories should be distributed over a physical quantity like travel distance or mechanical energy. For example, a roboticist building a maze navigation agent might choose demonstrations whose travel distances cover a fixed range uniformly, hoping to constrain the agent's expected power usage. We find that standard deep learning can violate this intent: each generated trajectory can seem plausible on its own, but the aggregate distribution over the physical quantity is wrong. We call this failure physical misgeneralization, and develop an account of its mechanism. Using controlled synthetic tasks, we show that physical misgeneralization arises when local errors typical of the model class propagate through the physical measurement to shift the recovered distribution. We estimate these errors with a data deviation kernel, and we use it to predict which physical quantities gain or lose mass in both our synthetic and more applied maze navigation and double-pendulum motion tasks. Finally, our mechanistic interpretation helps identify which mitigation strategies are structurally promising, and we use it to propose a kernel-informed intervention.

URL PDF HTML ☆

赞 0 踩 0

2605.20264 2026-05-21 cs.RO cs.HC 版本更新

Adaptive Human-Robot Collaboration for Masonry Construction Under Material and Assembly Uncertainty

面向材料和装配不确定性的自适应人机协作砌筑

Jutang Gao, Arash Adel

发表机构 * Princeton University（普林斯顿大学）

AI总结本文提出了一种自适应的人机协作流程，用于应对砌筑施工中材料和装配不确定性带来的容忍度累积问题，通过投影指导和激光扫描反馈实现精准协作。

Comments Accepted for publication in Proceedings of the 43rd International Symposium on Automation and Robotics in Construction (ISARC 2026)

详情

AI中文摘要

建筑领域的人机协作常常受到机器人与人类之间通信有限以及材料和装配不确定性导致的容忍度累积的挑战。本文提出了一种针对砌筑施工的自适应人机协作流程，通过一个安装在末端执行器上的投影仪提供空间注册的实时投影指导，用于手动粘合剂的施加，以及激光扫描用于反馈驱动的抓取和放置姿态校正。这些机制共同作用，使人类和机器人的动作能够根据材料变化和累积的装配容忍度进行调整。在传统交错排列和非标准配置的全尺寸实验中，投影指导提高了粘合剂施加的一致性并减少了施加时间，而基于激光的校正保持了水平层并避免了开放式执行中易导致碰撞失败的问题。这些结果表明，通过材料和实际建造传感实现的空间投影与反馈驱动的适应相结合，可以缓解容忍度累积，提高人机协作施工的精度和鲁棒性。

英文摘要

Human-robot collaboration in construction is often challenged by limited robot-to-human communication and the need to adapt to tolerance accumulation arising from material and assembly uncertainties. We present an adaptive human-robot collaborative workflow for masonry construction that addresses communication limitations and tolerance accumulation, demonstrated through a brickwork case study in which a robot places bricks while a human applies adhesive. This workflow is enabled by two complementary mechanisms: 1) an end-effector-mounted projector that provides spatially registered, just-in-time projection guidance for manual adhesive application, and 2) laser scanning for feedback-driven grasping and placement pose correction. Together, these mechanisms enable adjustment of human and robotic actions in response to material variability and accumulated assembly tolerances. Full-scale experiments across conventional running-bond and nonstandard configurations demonstrate that projection guidance improves adhesive application consistency and reduces application time, while laser-based correction maintains level courses and avoids collision-prone failures associated with open-loop execution. These results indicate that integrating spatial projection with feedback-driven adaptation, enabled by material and as-built sensing, can mitigate tolerance accumulation and improve precision and robustness in human-robot collaborative construction.

URL PDF HTML ☆

赞 0 踩 0

2605.20209 2026-05-21 cs.GR cs.LG cs.RO 版本更新

NaP-Control: Navigating Diffusion Prior for Versatile and Fast Character Control

NaP-Control: 为多功能和快速字符控制导航扩散先验

Chia-Wen Chen, Yan Wu, Korrawe Karunratanakul, Siyu Tang

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结本文提出NaP-Control方法，通过强化学习操控任务无关的扩散策略先验的潜在噪声，实现快速、鲁棒且高保真的字符控制，同时通过环境交互优化任务奖励，提升成功率并适应挑战性场景。

详情

AI中文摘要

在基于物理的动画中实现精确、多功能的全身字符控制仍然具有挑战性。最近的基于扩散的策略生成丰富且表达性强的动作，但通常依赖于基于梯度的测试时间引导以满足任务目标，这会减慢速度并降低鲁棒性。我们引入NaP-Control（Navigating Diffusion Prior for Versatile and Fast Character Control），简称NaP。我们的方法使用强化学习操控任务无关的扩散策略先验的潜在噪声，将其引导至任务特定的行为，以实现快速、鲁棒且高保真的控制。与仅依赖离线训练的方法不同，NaP在训练期间与环境交互以校正动作并优化任务奖励，提高成功率并使系统能够适应具有挑战性的场景。通过直接预测任务优化的扩散噪声，NaP消除了去噪过程中的迭代引导，实现了高效的推理。实验表明，NaP在多样化的任务中实现了更高的成功率和更快的推理速度，同时保持自然的动作。

英文摘要

Achieving precise, versatile whole-body character control in physics-based animation remains challenging. Recent diffusion-based policies generate rich and expressive motions but typically rely on gradient-based test-time guidance to satisfy task objectives, which is slow and can reduce robustness. We introduce NaP-Control (Navigating Diffusion Prior for Versatile and Fast Character Control), abbreviated as NaP. Our method uses reinforcement learning to manipulate the latent noise of a task-agnostic diffusion policy prior, steering it toward task-specific behaviors for fast, robust control with high motion fidelity. In contrast to methods that rely solely on offline training, NaP interacts with the environment during training to correct motions and optimize task rewards, improving success rates and enabling adaptation to challenging scenarios. By directly predicting task-optimized diffusion noise, NaP eliminates iterative guidance during denoising and enables efficient inference. Experiments show that NaP attains higher success rates and faster inference while preserving natural motion across diverse tasks.

URL PDF HTML ☆

赞 0 踩 0

2603.14698 2026-05-21 cs.RO 版本更新

Dual Quaternion Based Contact Modeling for Fast and Smooth Collision Recovery of Quadrotors

基于双四元数的接触建模用于四旋翼快速平滑碰撞恢复

Valentin Gaucher, Wenlong Zhang

发表机构 * School of Manufacturing Systems and Networks, Ira A. Fulton Schools of Engineering, Arizona State University（制造系统与网络学院，伊拉·A·福尔顿工程学院，亚利桑那州立大学）

AI总结本文提出了一种基于双四元数的接触模型，用于四旋翼在复杂环境中实现快速且平滑的碰撞恢复，通过统一空间扭曲实现法向和切向冲击分量的耦合，减少执行延迟并降低动能峰值。

Comments 8 pages, 5 figures, 2 tables

详情

AI中文摘要

无人飞行器（UAVs）在复杂环境中运行时，需要高效的冲击建模以保持碰撞后稳定性，然而经典冲击接触模型将法向和切向分量解耦。本文提出了一种直接在SE(3)流形上的双四元数冲击重置映射。通过操作统一空间扭曲（统一线性和角速度），所提出的公式在单个闭合表达式中保留法向和切向冲击分量之间的交叉耦合，并将经典解耦的牛顿冲击模型作为特殊情况恢复。设计了一个恢复控制器，将线性和角动量耦合以强制在冲击过程中耗散动能。硬件在环基准测试显示，与优化的矩阵实现相比，执行延迟减少了24%，与位置加四元数（PQ）形式相比减少了20%。在MuJoCo模拟中，经过蒙特卡洛扫掠的冲击角度和摩擦系数测试显示，与已发表的线性阻抗基线相比，位置均方根误差（RMSE）减少了50.8%-75.1%，峰值动能减少了68.7%-85%。

英文摘要

Unmanned aerial vehicles (UAVs) operating in cluttered environments require efficient and accurate impact modeling to maintain stability post collisions, however classical impulse contact models decouple the normal and tangential components. This letter presents a dual quaternion impulse reset map directly on the SE(3) manifold. By operating on the unified spatial twist (unified linear and angular velocities), the proposed formulation retains the cross-coupling between normal and tangential impulse components in a single closed-form expression, and recovers the classical decoupled Newton impulse model as a special case. A recovery controller is designed that couples linear and angular momentum to enforce kinetic energy dissipation across impacts. Hardware-in-the-loop benchmarks demonstrate a 24\% reduction in execution latency compared to an optimized matrix-based implementation, and a 20\% reduction relative to a position-plus-quaternion (PQ) formulation. MuJoCo simulations across Monte Carlo sweeps over impact angles and friction coefficients show a 50.8\%-75.1\% reduction in position root-mean-square error (RMSE) and a 68.7\%-85\% decrease in peak kinetic energy compared to published linear-admittance baselines.

URL PDF HTML ☆

赞 0 踩 0