赞 0 踩 0

2606.11891 2026-06-11 cs.RO cs.LG 新提交

Critic Architecture Matters: Dual vs. Unified Critics for Humanoid Loco-Manipulation

评论家架构的重要性：双评论家与统一评论家在人形机器人移动操作中的对比

Mehmet Turan Yardımcı

AI总结针对人形机器人多目标强化学习，对比统一评论家与双评论家架构，实验表明双评论家策略在到达速度、吞吐量和成功率上显著优于统一评论家，且架构选择比奖励工程影响更大。

详情

Comments: Accepted at the ICRA 2026 Workshop on Reinforcement Learning for Imitation Learning (RL4IL), Vienna, Austria. 4 pages, 2 figures

AI中文摘要

人形机器人的多目标强化学习必须在单一策略中协调移动和操作。一个自然的设计选择是使用单一（统一）评论家来估计所有目标的组合价值，还是使用具有不相交奖励信号的单独（双）评论家。我们在NVIDIA Isaac Lab中对Unitree G1人形机器人（23个主动自由度）进行了受控比较，通过一个从静态到达延伸到具有可变方向目标的行走的13级顺序课程训练移动操作策略。在标准化评估中，与统一评论家策略相比，双评论家策略到达目标的速度快3.5倍（6.5 vs. 22.6模拟步），吞吐量高2倍（每1000步验证到达次数14.3 vs. 7.0），并且验证到达率更高（65.2% vs. 53.8%）。值得注意的是，额外的反博弈奖励机制在架构改变之外没有提供进一步改进（60.9% vs. 65.2%）。这些结果对新兴的强化学习微调模仿学习策略范式有直接影响：当使用强化学习优化预训练的操作策略时，统一评论家可能通过竞争性的移动梯度抑制已学习的行为。这些发现表明，评论家架构是多目标人形机器人强化学习中一个首要且常被忽视的设计选择，其对到达效率的影响大于奖励工程。

英文摘要

Multi-objective reinforcement learning for humanoid robots must coordinate locomotion and manipulation within a single policy. A natural design choice is whether to use a single (unified) critic that estimates the combined value of all objectives, or separate (dual) critics with disjoint reward signals. We present a controlled comparison on the Unitree G1 humanoid (23 active DoF) in NVIDIA Isaac Lab, training loco-manipulation policies through a sequential curriculum spanning 13 levels from stationary reaching to walking with variable-orientation targets. In standardized evaluation, dual-critic policies reach targets 3.5$\times$ faster (6.5 vs. 22.6 simulation steps), achieve 2$\times$ higher throughput (14.3 vs. 7.0 validated reaches per 1,000 steps), and attain higher validated reach rates (65.2% vs. 53.8%) compared to the unified-critic policy. Notably, additional anti-gaming reward mechanisms provide no further improvement beyond the architectural change alone (60.9% vs. 65.2%). These results have direct implications for the emerging paradigm of RL fine-tuning of imitation-learned policies: when refining a pre-trained manipulation policy with RL, a unified critic risks suppressing the learned behavior through competing locomotion gradients. These findings demonstrate that critic architecture is a primary - and often overlooked - design choice in multi-objective humanoid RL, with greater impact than reward engineering on reaching efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.12365 2026-06-11 cs.RO cs.AI 新提交

Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics

环境扩散策略：从次优数据中进行机器人模仿学习

Adam Wei, Nicholas Pfaff, Thomas Cohn, Arif Kerem Dayı, Constantinos Daskalakis, Giannis Daras, Russ Tedrake

发表机构 * MIT（麻省理工学院）

AI总结提出环境扩散策略，通过噪声依赖的数据使用从次优数据中提取有用特征，在六项任务上优于现有方法，最高提升33%。

详情

Comments: 14 pages (main body), 52 pages total. Project website: this https URL

AI中文摘要

我们提出环境扩散策略，一种从机器人次优数据中进行模仿学习的简单且原则性的方法。高质量、特定任务的机器人数据收集昂贵且耗时，而低质量或分布外演示的次优数据集则丰富。现有的在机器人中同时训练两种数据源的方法通常无法分离次优样本中的有意义和有害特征。相比之下，我们的方法通过引入机器人协同训练的新轴：噪声依赖的数据使用，仅提取有用特征。环境扩散策略在训练期间将次优数据的贡献限制在仅高和低扩散时间。为了严格证明我们的方法，我们首先观察到机器人动作数据表现出频谱幂律。这在我们利用的最优扩散策略上引出了两个重要性质：全局到局部层次结构和局部性。我们使用简化模型从理论上形式化这一讨论。我们的实验在六项任务上验证了环境扩散策略对四种类型的次优动作数据（噪声轨迹、模拟到现实差距、任务不匹配和大规模数据混合）的有效性。结果表明，它有效地从任意来源的次优数据中学习。值得注意的是，当扩展到Open X-Embodiment（一个具有异质数据质量和非结构化分布偏移的大规模数据集）时，它比现有协同训练基线高出33%。总体而言，环境扩散策略提高了次优演示的实用性，并扩展了机器人中可用数据源的范围。

英文摘要

We propose Ambient Diffusion Policy, a simple and principled method for imitation learning from suboptimal data in robotics. High-quality, task-specific robot data is expensive and time-consuming to collect, while suboptimal datasets with lower-quality or out-of-distribution demonstrations are abundant. Existing methods that co-train on both data sources in robotics often fail to separate the meaningful and the harmful features in the suboptimal samples. In contrast, our method extracts only the useful features by introducing a new axis to co-training in robotics: noise-dependent data usage. Ambient Diffusion Policy restricts the contribution of suboptimal data during training to only the high and low diffusion times. To rigorously justify our approach, we first observe that robot action data exhibits a spectral power law. This induces two important properties on the optimal Diffusion Policy that we exploit: a global-to-local hierarchy and locality. We theoretically formalize this discussion using a simplified model. Our experiments validate Ambient Diffusion Policy on four types of suboptimal action data (noisy trajectories, sim-to-real gap, task mismatch, and large-scale data mixtures) across six tasks. The results show that it effectively learns from arbitrary sources of suboptimal data. Notably, it outperforms existing co-training baselines by up to 33% when scaled to Open X-Embodiment - a large dataset with heterogeneous data quality and unstructured distribution shifts. Overall, Ambient Diffusion Policy increases the utility of suboptimal demonstrations and expands the set of usable data sources in robotics.

URL PDF HTML ☆

赞 0 踩 0

2606.12372 2026-06-11 cs.RO cs.LG 新提交

UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning

UniIntervene：用于高效现实世界强化学习的智能干预

Haoyuan Deng, Yitong Gao, Yudong Lin, Haichao Liu, Zhenyu Wu, Ziwei Wang

发表机构 * Nanyang Technological University（南洋理工大学）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出UniIntervene智能干预模型，通过检测低效探索并自主恢复策略至高价值状态，在真实机器人操作任务中平均成功率提升8.6%，人类干预减少57%。

详情

Comments: Project page: this https URL

AI中文摘要

人在回路强化学习（HiL-RL）已成为现实世界机器人操作的有效范式，能够通过人类指导实现在线策略改进。然而，当前的HiL-RL框架仍然依赖频繁的人类干预来纠正策略，使其脱离低效探索，这导致高昂的人力成本并限制了现实世界的可扩展性。为解决这一问题，我们提出UniIntervene，一种智能干预模型，它能够检测低效探索并自主将策略恢复至高价值状态，从而接管人类操作员的大部分干预工作。具体而言，UniIntervene首先执行未来条件化的动作价值估计，预测当前动作的潜在后果并评估其诱导价值，从而提供更稳定的进展信号。在此基础上，一个时间价值风险评论家聚合最近的价值动态，并在估计价值出现持续停滞或下降时触发干预。当需要干预时，UniIntervene从过去干预事件的内存中检索高价值恢复目标，并通过目标条件化的恢复策略生成可执行的纠正动作。通过这种方式，UniIntervene将干预从被动的人类纠正转变为价值感知的恢复过程，从而实现高效的现实世界强化学习。在多种真实世界操作任务上的大量实验表明，与最先进的HiL-RL基线相比，UniIntervene将平均成功率提高了8.6%，同时将人类干预减少了57%。

英文摘要

Human-in-the-loop reinforcement learning (HiL-RL) has emerged as an effective paradigm for real-world robotic manipulation, enabling online policy improvement with human guidance. However, current HiL-RL frameworks remain intervention-intensive, relying on frequent human corrections to redirect the policy out of unproductive exploration, which incurs high labor cost and limits real-world scalability. To address this, we propose UniIntervene, an agentic intervention model that detects unproductive exploration and autonomously recovers the policy toward high-value states, taking over the bulk of interventions from human operators. Specifically, UniIntervene first performs future-conditioned action-value estimation, predicting the latent consequence of the current action and evaluating its induced value, which provides a more stable progress signal. Building on this, a temporal value-risk critic aggregates recent value dynamics and triggers intervention when the estimated value exhibits sustained stagnation or degradation. When intervention is required, UniIntervene retrieves a high-value recovery target from a memory of past intervention episodes and produces executable corrective actions through a goal-conditioned recovery policy. In this way, UniIntervene turns intervention from passive human correction into a value-aware recovery process for efficient real-world RL. Extensive experiments on diverse real-world manipulation tasks demonstrate that UniIntervene improves the average success rate by 8.6% while reducing human interventions by 57% relative to state-of-the-art HiL-RL baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.12334 2026-06-11 cs.LG cs.RO 交叉投稿

Fourier Features Let Agents Learn High Precision Policies with Imitation Learning

傅里叶特征让智能体通过模仿学习学习高精度策略

Balázs Gyenes, Emiliyan Gospodinov, Jan Frieling, Enrico Krohmer, Nicolas Schreiber, Xiaogang Jia, Niklas Freymuth, Gerhard Neumann

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； FZI Research Center for Information Technology（FZI信息技术研究中心）

AI总结提出在点云编码器中使用傅里叶特征映射，解决神经网络低频偏好导致的高精度操作问题，在多个基准和真实机器人上显著提升性能。

详情

Comments: Published as a conference paper at ICML 2026

AI中文摘要

高精度机器人操作需要细粒度的空间推理，由于深度模糊和透视尺度问题，仅使用RGB的策略通常难以实现。直接利用3D信息（如基于点云的策略）比纯图像策略提供了更强的几何先验，但其性能仍然高度依赖于任务。我们假设这种差异可能是由于神经网络倾向于学习低频函数的频谱偏差，这尤其影响以缓慢变化的笛卡尔特征为条件的架构。因此，我们提出将点云从笛卡尔空间映射到高维傅里叶空间，有效地使点云编码器能够直接访问高频特征。我们通过实验验证了傅里叶特征在RoboCasa和ManiSkill3基准测试中的具有挑战性的操作任务以及真实机器人设置上的效果。尽管简单，我们发现傅里叶特征在不同的编码器架构和基准测试中提供了显著的好处，并且对超参数具有鲁棒性。我们的结果表明，傅里叶特征让策略比笛卡尔特征更有效地利用几何细节，显示了其作为基于点云的模仿学习的通用工具的潜力。我们在项目页面上提供源代码和视频：https://this https URL

英文摘要

High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information directly, such as those based on point clouds, offer a stronger geometric prior over purely image-based ones, yet their performance remains highly task-dependent. We hypothesize that this discrepancy may be due to the spectral bias of neural networks towards learning low frequency functions, which especially affects architectures conditioned on slow-moving Cartesian features. We thus propose to map point clouds from Cartesian space into high-dimensional Fourier space, effectively equipping the point cloud encoder with direct access to high-frequency features. We experimentally validate the use of Fourier features on challenging manipulation tasks from the RoboCasa and ManiSkill3 benchmarks and on a real robot setup. Despite their simplicity, we find that Fourier features provide significant benefits across diverse encoder architectures and benchmarks and are robust across hyperparameters. Our results indicate that Fourier features let policies leverage geometric details more effectively than Cartesian features, showing their potential as a general-purpose tool for point cloud-based imitation learning. We provide source code and videos on our project page: this https URL

URL PDF HTML ☆

赞 0 踩 0

2505.03296 2026-06-11 cs.RO cs.AI cs.LG 版本更新

The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning

离散时间高斯过程混合在机器人策略学习中的惊人有效性

Jan Ole von Hartz, Adrian Röfer, Joschka Boedecker, Abhinav Valada

AI总结提出MiDiGap方法，利用少量演示和相机观测，通过离散时间高斯过程混合实现机器人操作策略的灵活表示与模仿学习，在长时域、高约束、动态和多模态任务上取得SOTA性能，并支持推理时引导。

详情

Comments: Submitted for publication to IEEE Transaction on Robotics

AI中文摘要

我们提出了离散时间高斯过程混合（MiDiGap），一种用于机器人操作中灵活策略表示和模仿学习的新方法。MiDiGap仅使用相机观测，即可从少至五次演示中学习，并在一系列具有挑战性的任务中泛化。它在长时域行为（如泡咖啡）、高约束运动（如开门）、动态动作（如用铲子舀取）和多模态任务（如挂杯子）上表现出色。MiDiGap在CPU上不到一分钟即可学习这些任务，并线性扩展到大型数据集。我们还开发了一套丰富的推理时引导工具，利用碰撞信号和机器人运动学约束等证据。这种引导实现了新颖的泛化能力，包括避障和跨本体策略迁移。MiDiGap在多样化的少样本操作基准上达到了最先进的性能。在受约束的RLBench任务上，它将策略成功率提高了76个百分点，并将轨迹成本降低了67%。在多模态任务上，它将策略成功率提高了48个百分点，并将样本效率提高了20倍。在跨本体迁移中，策略成功率提高了一倍以上。我们在以下网址公开了代码：https://this https URL。

英文摘要

We present Mixture of Discrete-time Gaussian Processes (MiDiGap), a novel approach for flexible policy representation and imitation learning in robot manipulation. MiDiGap enables learning from as few as five demonstrations using only camera observations and generalizes across a wide range of challenging tasks. It excels at long-horizon behaviors such as making coffee, highly constrained motions such as opening doors, dynamic actions such as scooping with a spatula, and multimodal tasks such as hanging a mug. MiDiGap learns these tasks on a CPU in less than a minute and scales linearly to large datasets. We also develop a rich suite of tools for inference-time steering using evidence such as collision signals and robot kinematic constraints. This steering enables novel generalization capabilities, including obstacle avoidance and cross-embodiment policy transfer. MiDiGap achieves state-of-the-art performance on diverse few-shot manipulation benchmarks. On constrained RLBench tasks, it improves policy success by 76 percentage points and reduces trajectory cost by 67%. On multimodal tasks, it improves policy success by 48 percentage points and increases sample efficiency by a factor of 20. In cross-embodiment transfer, it more than doubles policy success. We make the code publicly available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2511.14427 2026-06-11 cs.RO cs.LG 版本更新

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

面向接触丰富机器人强化学习的自监督多感官预训练

Rickmer Krohn, Vignesh Prasad, Gabriele Tiboni, Georgia Chalvatzaki

AI总结提出MSDP框架，通过掩码自编码和跨模态预测学习多感官表示，并采用非对称架构（评论家使用交叉注意力提取动态特征，演员使用稳定池化表示）加速策略学习，在模拟和真实机器人任务中展现出鲁棒性和高效性。

详情

Comments: 8 pages, 11 figures

AI中文摘要

有效的接触丰富操作需要机器人协同利用视觉、力和本体感觉。然而，强化学习智能体在这种多感官环境中难以学习，尤其是在感官噪声和动态变化的情况下。我们提出了多感官动态预训练（MSDP），一种新颖的框架，用于学习面向任务策略学习的表达性多感官表示。MSDP基于掩码自编码，通过仅从传感器嵌入的子集重建多感官观测来训练基于Transformer的编码器，从而实现跨模态预测和传感器融合。对于下游策略学习，我们引入了一种新颖的非对称架构，其中交叉注意力机制允许评论家从冻结的嵌入中提取动态的、任务特定的特征，而演员则接收稳定的池化表示来指导其动作。我们的方法在多种扰动（包括传感器噪声和物体动力学变化）下表现出加速学习和鲁棒性能。在模拟和真实世界中多个具有挑战性的、接触丰富的机器人操作任务上的评估展示了MSDP的有效性。我们的方法对扰动表现出强鲁棒性，并在仅6000次在线交互的真实机器人上实现了高成功率，为复杂的多感官机器人控制提供了一种简单而强大的解决方案。网站：this https URL

英文摘要

Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control. Website: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.10040 2026-06-11 cs.RO 版本更新

Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

Efficient-WAM: 一种具有低成本未来想象能力的10亿参数世界-动作模型

Jiajun Li, Tiecheng Guo, Yifan Ye, Rongyu Zhang, Xiaowei Chi, Qianpu Sun, Ying Li, Yunfan Lou, Yan Huang, Zhihe Lu, Meng Guo, Shanghang Zhang

AI总结提出Efficient-WAM，通过紧凑视频专家、稀疏视频潜变量和非对称去噪降低未来想象成本，在保持控制性能的同时实现30倍推理加速。

详情

AI中文摘要

世界-动作模型（WAM）通过将未来视觉预测与动作生成相结合，已成为具身控制的一种有前景的范式。然而，大多数现有WAM依赖于逼真的未来预测，这导致高推理延迟，使得实时机器人部署困难。这促使设计一种更高效的WAM，既能保留未来视觉预测的控制优势，又能降低其推理成本。我们引入了Efficient-WAM，一种在保留控制优势的同时降低未来想象成本的世界-动作模型。Efficient-WAM通过从WAN-2.2-5B迁移的紧凑视频专家、稀疏视频潜变量以及非对称视频-动作去噪（为视频分配比动作更少的采样步骤）来提高推理效率。Efficient-WAM不优化未来分支的视觉保真度，而是将未来视频预测视为动作生成的紧凑指导信号。在RoboTwin 2.0和真实世界操作任务上的综合实验表明，尽管未来预测明显粗糙，Efficient-WAM仍能保持强大的动作性能。在保持竞争性控制能力的同时，我们的10亿参数模型在物理部署中可将每块延迟降低至约100毫秒，相比现有WAM实现了30倍的加速。

英文摘要

World-Action Models (WAMs) have emerged as a promising paradigm for embodied control by coupling future visual prediction with action generation. However, most existing WAMs rely on photorealistic future prediction, which incurs high inference latency and makes real-time robot deployment difficult. This motivates a more efficient WAM design that preserves the control benefits of future visual prediction while reducing its inference cost. We introduce Efficient-WAM, a World-Action Model that reduces the cost of future imagination while preserving its control benefit. Efficient-WAM improves inference efficiency via a compact video expert transferred from WAN-2.2-5B, token-sparse video latents, and asymmetric video-action denoising that allocates fewer sampling steps to video than to actions. Instead of optimizing the future branch for visual fidelity, Efficient-WAM treats future video prediction as a compact guidance signal for action generation. Comprehensive experiments on RoboTwin 2.0 and real-world manipulation tasks show that Efficient-WAM maintains strong action performance despite visibly coarse future predictions. While maintaining competitive control capabilities, our 1B-parameter model can reduce per-chunk latency to around 100 ms during physical deployment, achieving a 30x speedup over existing WAMs.

URL PDF HTML ☆

赞 0 踩 0

2606.11092 2026-06-11 cs.RO cs.AI 版本更新

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

RoboNaldo：通过运动引导课程强化学习实现精准、稳定且强力的人形足球射门

Yichao Zhong, Yidan Lu, Yuhang Lu, Tianyang Tang, Haoguang Mai, Yixuan Pan, Tianyu Li, Li Chen, Jingbo Wang, Zhongyu Li, Peng Lu, Hongyang Li

发表机构 * The University of Hong Kong（香港大学）； The Chinese University of Hong Kong（香港中文大学）； Archon Robotics

AI总结提出三阶段运动引导课程强化学习框架RoboNaldo，从单一人踢参考逐步优化射门性能，在仿真中射门误差降低48.6%、速度提升2.96倍，真实机器人上3米外平均射门误差0.73-0.86米，触球后球速达13.10米/秒。

详情

AI中文摘要

精英级人形足球射门需要全身稳定性、高冲量全身交互以及目标精度。运动跟踪驱动的强化学习提供了全身运动协调的稳定性，但固定参考难以适应不同的球位和击球时机；相比之下，任务奖励驱动的强化学习难以从零开始探索和发现有效的踢球动作。因此，我们引入了RoboNaldo，一个用于高冲量人形交互的三阶段运动引导课程强化学习框架。使用单一人踢参考作为支架，并逐步将优化转向射门性能。课程首先学习稳定的全身踢球先验，然后使踢球适应任意静止球位的任意球场景，最后通过运动指令和踢球触发接口扩展到移动球射门。训练期间，一个高级启发式规划器控制该接口，而推理时其他高级控制器可驱动相同的低级策略。在仿真中，RoboNaldo的任意球射门误差比先前工作基线低48.6%，射门速度高2.96倍。在真实世界中，使用搭载机载感知的宇树G1，RoboNaldo在3米距离的任意球和移动球情况下，平均目标射门误差分别为0.73米和0.86米。触球后球速达到13.10米/秒，是职业比赛开放射门速度的59-71%。项目页面：$\href{ this https URL }{\text{ this http URL }}$。

英文摘要

Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2604.13733 2026-06-11 cs.LG cs.AI cs.RO 版本更新

Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents

视觉-语言-动作跳跃启动用于强化学习机器人智能体

Angelo Moroncelli, Roberto Zanetti, Marco Maccarini, Loris Roveda

AI总结提出VLAJS方法，通过稀疏的VLA高层动作建议引导PPO探索，结合方向性动作一致性正则化，提升强化学习在长时域操作任务中的样本效率，并在仿真和真实机器人上验证。

详情

Comments: ICRA 2026 Workshop on Reinforcement Learning in the Era of Imitation Learning

AI中文摘要

强化学习（RL）能够实现机器人操作的高频闭环控制，但由于探索效率低下和信用分配不佳，在稀疏或不完美奖励的长时域任务中难以扩展。视觉-语言-动作（VLA）模型利用大规模多模态预训练提供通用任务级推理，但当前限制阻碍其直接用于快速精确操作。本文提出视觉-语言-动作跳跃启动（VLAJS），一种将稀疏VLA引导与在线策略RL相结合的方法，以改善探索和学习效率。VLAJS将VLA视为高层动作建议的瞬态来源，偏置早期探索并改善信用分配，同时保留RL的高频状态基控制。我们的方法用方向性动作一致性正则化增强近端策略优化（PPO），在早期训练中软对齐RL智能体的动作与VLA引导，而不强制严格模仿、需要演示或依赖持续教师查询。VLA引导稀疏应用并随时间退火，使智能体在线适应并最终超越引导策略。我们在六个挑战性操作任务上评估VLAJS：仿真中的提升、拾取与放置、销钉重定向、销钉插入、戳和推，并在真实Franka Panda机器人上验证子集。VLAJS在样本效率上持续优于PPO和蒸馏式基线，在多个任务中将所需环境交互减少超过50%。真实世界实验展示了零样本仿真到真实迁移以及在杂乱、物体变化和外部扰动下的鲁棒执行。

英文摘要

Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.

URL PDF HTML ☆

赞 0 踩 0

2605.03065 2026-06-11 cs.LG cs.RO 版本更新

DynaRetarget: 基于采样的轨迹优化的动态可行重定向

Victor Dhedin, Ilyass Taouil, Shafeef Omar, Dian Yu, Kun Tao, Angela Dai, Majid Khadiv

AI总结提出DynaRetarget框架，通过采样轨迹优化将人体运动重定向为人形机器人控制策略，实现长时域动态可行运动，在数百个演示中取得更高成功率。

2602.06868 2026-06-11 cs.RO 版本更新

Consensus-based optimization (CBO): Towards Global Optimality in Robotics

基于共识的优化（CBO）：迈向机器人学的全局最优性

Xudong Sun, Armand Jordana, Massimo Fornasier, Jalal Etesami, Majid Khadiv

AI总结提出将共识优化（CBO）引入机器人学，在温和假设下保证收敛到全局最优，并在三个挑战性轨迹优化场景中优于现有方法。

详情

AI中文摘要

零阶优化最近在机器人系统的最优轨迹和策略设计中受到显著关注。然而，大多数现有方法（如MPPI、CEM和CMA-ES）本质上是局部的，因为它们依赖于梯度估计。在本文中，我们将基于共识的优化（CBO）引入机器人学，该方法在温和假设下保证收敛到全局最优。我们提供了理论分析和说明性示例，以直观理解CBO与现有方法之间的根本差异。为了展示CBO在机器人问题上的可扩展性，我们考虑了三个具有挑战性的轨迹优化场景：（1）一个简单系统的长时域问题，（2）一个高度欠驱动系统的动态平衡问题，以及（3）一个仅具有终端成本的高维问题。我们的结果表明，在所有三个具有挑战性的设置中，CBO相对于现有方法能够实现更低的成本。这为研究机器人学中的全局轨迹优化开辟了一个新框架。

英文摘要

Zero-order optimization has recently received significant attention for designing optimal trajectories and policies for robotic systems. However, most existing methods (e.g., MPPI, CEM, and CMA-ES) are local in nature, as they rely on gradient estimation. In this paper, we introduce consensus-based optimization (CBO) to robotics, which is guaranteed to converge to a global optimum under mild assumptions. We provide theoretical analysis and illustrative examples that give intuition into the fundamental differences between CBO and existing methods. To demonstrate the scalability of CBO for robotics problems, we consider three challenging trajectory optimization scenarios: (1) a long-horizon problem for a simple system, (2) a dynamic balance problem for a highly underactuated system, and (3) a high-dimensional problem with only a terminal cost. Our results show that CBO is able to achieve lower costs with respect to existing methods on all three challenging settings. This opens a new framework to study global trajectory optimization in robotics.

URL PDF HTML ☆

赞 0 踩 0

2605.12053 2026-06-11 cs.RO 版本更新

Closing the Motion Execution Gap: From Semantic Motion Task Constraints to Kinematic Control

弥合运动执行差距：从语义运动任务约束到运动学控制

Simon Stelter, Vanessa Hassouna, Malte Huerkamp, Michael Beetz

AI总结本文提出通过运动状态图实现语义约束与可执行机器人运动的连接，利用统一的可微运动学世界模型实现世界中心的运动规范与跨平台泛化，采用基于lMPC的任务函数方法确保任务切换的平滑过渡。

详情

Comments: 9 pages, 8 figures, to be published in IJCAI 2026

AI中文摘要

本文针对运动执行差距问题，即高层符号任务描述与可执行机器人运动之间的脱节，提出运动状态图作为可执行的符号表示。该方法允许任意排列运动约束、监控器或嵌套状态图的并行与顺序组合。通过使用统一的可微运动学世界模型，实现了以世界为中心的运动规范和跨体素的泛化。运动执行通过基于lMPC的任务函数方法实现，利用 jerk 限制确保任务切换的平滑过渡。通过在八个机器人平台上部署该方法，展示了跨平台的可转移性。所提出的框架称为 Giskard，并且是开源的：https://github.com/cram2/cognitive_robot_abstract_machine.

英文摘要

This paper addresses the Motion Execution Gap, the disconnect between high-level symbolic task descriptions using semantic constraints and executable robot motions. Motion Statecharts are introduced as an executable symbolic representation for complex motions. They allow the arbitrary arrangement of motion constraints, monitors or nested statecharts in parallel and sequence. World-centric motion specification and generalization across embodiments are enabled through the use of a unified differentiable kinematic world model of both, robots and environments. Motion execution is realized through a lMPC-based implementation of the task-function approach, in which smooth transitions during task switches are ensured using jerk bounds. Cross-platform transferability was demonstrated by deploying the method on eight robot platforms, operating in diverse environments. The proposed framework is called Giskard and is available open source: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11396 2026-06-11 cs.RO 新提交

PLUME: Probabilistic Latent Unified World Modeling and Parameter Estimation for Multi-Finger Manipulation

PLUME: 多指操作的概率潜在统一世界建模与参数估计

Abhinav Kumar, Soshi Iba, Rana Soltani Zarrin, Dmitry Berenson

发表机构 * University of Michigan（密歇根大学）； Honda Research Institute USA（本田美国研究所）

AI总结提出PLUME世界模型，联合学习参数信念演化与条件动力学，通过在线参数推断实现零样本迁移，在螺丝刀旋转等任务中优于现有方法。

详情

Comments: 16 pages, 5 figures

AI中文摘要

多指手的灵巧操作可能对物理参数（如物体形状、姿态和摩擦系数）敏感。虽然仿真能够利用已知参数值进行大规模数据收集，但基于仿真训练的策略在部署时仍需处理不确定性，此时真实参数及由此决定的真实动力学是未知的。对于螺丝刀旋转等精确任务，标准域随机化策略可能不足，因为操作策略可能需要根据特定参数值而变化。为解决这一问题，我们提出了概率潜在统一世界建模与参数估计（PLUME），这是一种世界模型，它联合学习对参数值的信念演化以及以这些参数为条件的系统动力学。我们学习一个潜在空间，以联合表示多个性质不同的物理参数以及奖励（奖励本身是部分可观测变量的函数），从而为规划提供信息。我们的新颖学习框架通过在线参数推断（而非重新训练或微调）实现了世界模型与真实动力学的高效对齐。我们在模拟的螺丝刀旋转、阀门旋转、桶提升和圆盘弹射任务以及硬件螺丝刀旋转任务上评估了我们的方法，在这些任务中，我们实现了仿真训练策略的成功零样本迁移，并超越了最先进的离线强化学习和世界模型增强行为克隆基线。视频请见我们的网站：https://this URL。

英文摘要

Dexterous manipulation with multi-finger hands can be sensitive to physical parameters such as object shape, pose, and friction coefficients. While simulation enables large-scale data collection with known parameter values, simulation-trained policies must still handle uncertainty at deployment, where the true parameters and therefore the true dynamics are unknown. Standard domain randomization strategies may be insufficient for precise tasks like screwdriver turning, as manipulation strategies may need to change depending on specific parameter values. To address this, we propose Probabilistic Latent Unified world Modeling and parameter Estimation (PLUME), a world model that jointly learns to evolve a belief over parameter values as well as the system dynamics conditioned on those parameters. We learn a latent space to jointly represent multiple qualitatively different physical parameters along with rewards, themselves functions of partially-observable variables, to inform planning. Our novel learning framework leads to efficient alignment of the world model to true dynamics through online parameter inference as opposed to re-training or fine-tuning. We evaluate our method on simulated screwdriver turning, valve turning, bucket lifting, and disk flicking tasks, as well as a hardware screwdriver turning task, where we achieve successful zero-shot transfer of our simulation-trained policy and outperform state-of-the-art offline reinforcement learning and world-model-augmented behavior cloning baselines. Please see our website at this https URL for videos.

URL PDF HTML ☆

赞 0 踩 0

2606.11743 2026-06-11 cs.RO cs.GR cs.LG 新提交

TacCoRL: Integrating Tactile Feedback into VLA via Simulation

TacCoRL: 通过仿真将触觉反馈集成到视觉-语言-动作模型中

Siyu Ma, Yuqi Liang, Chang Yu, Yunuo Chen, Hao Su, Yixin Zhu, Yin Yang, Chenfanfu Jiang

发表机构 * University of California, Los Angeles（加利福尼亚大学洛杉矶分校）； University of California, San Diego（加利福尼亚大学圣迭戈分校）； University of Electronic Science and Technology of China（电子科技大学）； Peking University（北京大学）； University of Utah（犹他大学）

AI总结提出TacCoRL框架，通过仿真与真实联合训练和强化学习，将触觉反馈注入视觉-语言-动作策略，在接触密集型任务中平均成功率提升22.5%。

详情

AI中文摘要

视觉-语言-动作（VLA）模型为机器人操作提供了强大的视觉、语言和动作先验，但仅凭视觉观察往往缺失接触密集型任务所需的局部接触状态。我们提出TacCoRL，一个可扩展的框架，将触觉反馈注入VLA策略，并通过仿真-真实联合训练和基于仿真的强化学习（RL）进行改进，无需大规模触觉预训练或广泛的真实世界接触探索。关键思想不仅是添加触觉作为输入，而是学习在接近失败状态下接触读数应如何调节动作响应，这些状态在演示中罕见且在硬件上收集风险高。我们使用真实对齐的仿真器作为接触交互的闭环训练环境。混合的仿真和真实轨迹首先在预训练策略中热启动触觉条件动作。具有可验证任务奖励的强化学习随后通过仿真接触回滚优化策略。它强化导致任务完成的触觉条件动作，而真实轨迹上的监督目标将精炼策略锚定到部署的视觉、触觉和动作分布。所得策略直接转移到真实机器人，无需特权仿真状态或在线真实世界RL。在四个双臂接触密集型任务中，最终的视觉-触觉策略平均成功率达到72.5%，而基线为50.0%。结果视频和更多细节见此链接。

英文摘要

Vision-language-action (VLA) models provide strong visual, language, and action priors for robot manipulation, but visual observations alone often miss the local contact state required for contact-rich tasks. We present TacCoRL, a scalable framework that injects Tactile feedback into VLA policies and improves them through sim-real Co-training and simulation-based reinforcement learning (RL), without requiring large-scale tactile pretraining or extensive real-world contact exploration. The key idea is not only adding touch as an input, but learning how contact readings should modulate action responses in near-failure states that are rare in demonstrations and risky to collect on hardware. We use a real-aligned simulator as a closed-loop training environment for contact interaction. Mixed simulated and real trajectories first warm-start tactile-conditioned actions in the pretrained policy. Reinforcement learning with verifiable task rewards then optimizes the policy using simulated contact rollouts. It reinforces tactile-conditioned actions that lead to task completion, while a supervised objective on real trajectories keeps the refined policy anchored to deployment visual, tactile, and action distributions. The resulting policy transfers directly to the real robot without privileged simulation state or online real-world RL. Across four bimanual contact-rich tasks, the final visuo-tactile policy achieves an average success rate of 72.5%, compared to baseline of 50.0%. Result videos and more details are available at this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.11767 2026-06-11 cs.RO cs.AI 新提交

Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning

通过真实到仿真到真实触觉策略学习的盲操作灵巧抓取

Shengcheng Luo, Xiyan Huang, Zhe Xu, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University（上海科技大学）； Beijing Institute for General Artificial Intelligence（北京通用人工智能研究院）

AI总结提出一种结合Real2Sim触觉校准、布局感知触觉编码器和触觉条件扩散策略的框架，实现仅依赖触觉的灵巧手盲抓取，在真实机器人上对20个物体达到27%成功率。

详情

Comments: 23 pages, 6 figures

AI中文摘要

使用灵巧手进行盲抓取是一项关键的操作能力。然而，由于触觉的仿真到真实差距以及稀疏触觉信号的有限表达能力，为真实机器人学习这种仅依赖触觉的策略仍然具有挑战性。为了弥合这一差距，我们提出了一个仅依赖触觉的盲抓取框架，该框架可部署在物理多指机器人手上。我们的方法结合了三个关键组成部分。首先，我们引入了一个Real2Sim触觉校准流程，构建了一个接触校准的数字孪生模拟器，能够复现真实的触觉信号。其次，我们使用布局感知触觉编码器改进了稀疏触觉观测的表达能力，该编码器通过自监督预训练融入了传感器几何先验。第三，为了提高对未见物体的泛化能力，我们在校准后的模拟器中训练了特定物体的强化学习专家，并将其成功的抓取轨迹聚合为触觉条件扩散策略。我们在配备分布式触觉传感的物理LEAP手上评估了我们的方法，涉及10个见过和10个未见过的物体。部署的策略在所有20个物体上实现了27%的真实世界抓取成功率，无需真实世界的抓取演示或视觉输入。仿真消融实验表明，布局感知触觉预训练提高了抓取性能，而传感级评估确认Real2Sim校准增加了仿真与硬件之间触觉接触事件的一致性。这些结果表明，接触事件校准、几何感知触觉表示学习和基于扩散的策略聚合为真实灵巧机器人手上的仅触觉盲抓取提供了一条有效路径。项目页面：此HTTP URL。

英文摘要

Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robots remains challenging due to the tactile sim-to-real gap and the limited expressiveness of sparse tactile signals. To bridge this gap, we propose a framework for tactile-only blind grasping that is deployable on a physical multi-fingered robotic hand. Our approach combines three key components. First, we introduce a Real2Sim tactile calibration pipeline that constructs a contact-calibrated digital-twin simulator capable of reproducing real tactile signals. Second, we improve the expressiveness of sparse tactile observations using a layout-aware tactile encoder, which incorporates sensor-geometry priors through self-supervised pretraining. Third, to improve generalization to unseen objects, we train object-specific reinforcement-learning experts in the calibrated simulator and aggregate their successful grasp trajectories into a tactile-conditioned Diffusion Policy. We evaluate our method on a physical LEAP Hand equipped with distributed tactile sensing across 10 seen and 10 unseen objects. The deployed policy achieves a 27\% real-world grasp success rate across all 20 objects, without real-world grasping demonstrations or visual input. Simulation ablations show that layout-aware tactile pretraining improves grasping performance, while sensing-level evaluations confirm that Real2Sim calibration increases the consistency of tactile contact events between simulation and hardware. Together, these results suggest that contact-event calibration, geometry-aware tactile representation learning, and diffusion-based policy aggregation provide an effective path toward tactile-only blind grasping on real dexterous robotic hands. Project page: this http URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11826 2026-06-11 cs.RO 新提交

Modular Anthropomorphic Hand Design via Multi-Parameter Finger Benchmarking and Selection

模块化拟人手设计：基于多参数手指基准测试与选择

Yu Zhang, Huijiang Wang, Josie Hughes

发表机构 * The CREATE Lab, Institute of Mechanical Engineering, Swiss Federal Institute of Technology in Lausanne (EPFL)（瑞士洛桑联邦理工学院机械工程研究所CREATE实验室）

AI总结提出一种模块化拟人手设计框架，通过多参数基准测试优化手指设计，实现整手性能提升，并在多物体抓取和灯泡旋拧任务中验证有效性。

详情

Comments: 14 pages, 13 figures. Submitted to an IEEE journal for possible publication

AI中文摘要

设计拟人灵巧手仍然具有挑战性，因为设计空间跨越形态、驱动和传感特性，而性能指标既包括任务相关也包括任务无关。现有的优化方法通常是非结构化的，或者只考虑单一性能指标，限制了系统比较和针对性改进。虽然整手的设计考虑很重要，但单个手指的特性在灵巧性中起着关键作用。通过开发一个手指可以模块化集成到完整遥操作手中的机器人手平台，我们提出优化手指可以显著提高整手性能。该方法能够在手指集成到手部进行任务级验证之前，通过多个定量基准快速筛选不同的手指级原型。候选手指设计（包含关节、骨骼、皮肤和传感器位置的变化）使用面向机制和任务相关的指标进行评估，建立了组件设计与整手体现之间的定量联系。该框架通过开发具有优化手指的拟人机器人手得到验证，展示了这些手指如何在多物体抓取和灯泡旋拧等任务中实现性能提升。

英文摘要

Designing anthropomorphic dexterous robotic hands remains challenging as the design space straddles morphology, actuation, and sensing properties, and performance metrics span both task-dependent and task-agnostic. Existing optimization methods are often unstructured or consider only a single performance metric, limiting systematic comparison and targeted refinement. While the design considerations of the entire hand are significant, the individual finger properties play a key role in dexterity. By developing a robotic hand platform where fingers can be modularly integrated into a full teleoperated hand, we propose that optimizing the fingers can significantly improve overall hand performance. This approach enables rapid screening of different finger-level prototypes through a number of quantitative benchmarks before their integration into the hand for task-level validation. Candidate finger designs (incorporating variations in joint, bone, skin, and sensor placement) are assessed using both mechanism-oriented and task-relevant metrics, which establish a quantitative link between component design and full hand embodiment. The framework is validated through the development of an anthropomorphic robotic hand with optimized fingers, demonstrating how these fingers enable performance improvements across tasks, including multi-object grasping and light bulb screwing.

URL PDF HTML ☆

赞 0 踩 0

2606.12048 2026-06-11 cs.RO 新提交

Point Cloud Segmentation for Autonomous Clip Positioning in Laparoscopic Cholecystectomy on a Phantom

用于腹腔镜胆囊切除术中自动夹子定位的点云分割（在体模上）

Balázs Gyenes, Nikolai Franke, Paul Maria Scheikl, Pit Henrich, Rayan Younis, Gerhard Neumann, Martin Wagner, Franziska Mathis-Ullrich

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； HIDSS4Health - Helmholtz Information and Data Science School for Health（亥姆霍兹信息与数据科学健康学校）； Friedrich-Alexander-University, Erlangen-Nuremberg（弗里德里希-亚历山大大学埃尔朗根-纽伦堡）； University Hospital Carl Gustav Carus and Centre for Tactile Internet with Human-in-the-loop (CeTI), Dresden University of Technology（卡尔·古斯塔夫·卡鲁斯大学医院及德累斯顿工业大学触觉互联网人机共融卓越中心）

AI总结提出首个在腹腔镜手术体模上实现自主夹子定位的机器人系统，通过点云分割和样条插值提取目标位置，利用合成数据预训练和两种数据增强克服数据稀缺，达到0.75mm精度和100%成功率。

详情

Comments: 8 pages, 5 figures, accepted to IEEE Robotics and Automation Letters (RAL)

AI中文摘要

机器人技术中的高风险应用，如机器人辅助手术，提出了独特的挑战。这些系统必须高度精确且可解释，才能部署在对错误或不安全探索容忍度极低的环境中。我们提出了第一个在腹腔镜手术（普外科最常见的手术之一）中在物理体模上演示自主夹子定位的机器人系统。在从单个相机分割无色点云后，使用样条插值提取夹子的目标位置，然后可由操作员调整。分割模型仅使用60个手工标记的真实点云进行训练，反映了手术领域的数据稀缺性。我们通过结合在128,000个合成点云上的预训练和两种新颖的数据增强技术来克服这一问题。末端执行器到每个目标的运动可视化给操作员，满足微创手术的独特运动约束，同时确保机器人的动作可验证和可解释。在真实机器人实验中，我们的系统以95%的成功率定位目标，精度为0.75mm，并以100%的成功率执行自主夹子定位。我们提供的见解适用于许多其他需要识别并导航到精确目标的手术和非手术任务。源代码和项目页面：此 https URL

英文摘要

High-risk applications in robotics, such as robot-assisted surgery, present unique challenges. These systems must be both highly precise and interpretable in order to be deployed in environments with very low tolerance for error or unsafe exploration. We present the first robotic system to demonstrate autonomous clip positioning on a physical phantom in laparoscopic surgery, one of the most common interventions in general surgery. After segmentation of a colorless point cloud from a single camera, target positions for the clips are extracted using spline interpolation, and can then be adjusted by the human operator. The segmentation model is trained on only 60 hand-labeled real point clouds, reflecting data scarcity in the surgical domain. We overcome this with a combination of pre-training on 128,000 synthetic point clouds and two novel data augmentation techniques. The motion of the end-effector to each target is visualized for the operator, satisfying the unique motion constraints of minimally-invasive surgery while ensuring that the robot's actions are verifiable and interpretable. In real robot experiments, our system localizes targets with the required precision of 0.75mm at a 95% success rate and executes autonomous clip positioning with a 100% success rate. We provide insights that are applicable to many other surgical and non-surgical tasks that require identifying and navigating to a precise target. Source code and project page: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.12406 2026-06-11 cs.RO cs.AI cs.LG eess.SY 新提交

FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning

FACTR 2: 学习商用机器人手臂的外部力感知提升策略学习

Steven Oh, Jason Jingzhou Liu, Tony Tao, Philip Han, Kenneth Shaw, Satoshi Funabashi, Ruslan Salakhutdinov, Deepak Pathak

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Waseda University（早稻田大学）

AI总结提出无需专用力传感器的数据驱动方法NEXT，可在1分钟内从10分钟自由运动数据中训练，实现与专用关节力矩传感器相当的估计，并结合FIRST采样策略提升策略学习性能。

详情

Comments: Website at this https URL

AI中文摘要

接触丰富的操作需要力敏感性，但由于成本高昂，许多机器人手臂缺乏专用的力传感器。我们提出了神经外部力矩估计（NEXT），一种无需任何专用力传感器即可估计外部关节力矩的数据驱动方法。NEXT 仅需 10 分钟的自由运动数据即可在 1 分钟内完成训练，却能实现与专用关节力矩传感器相当的估计。NEXT 能够在低成本手臂上实现力反馈遥操作，并通过力信息重采样训练（FIRST）改进策略学习，该训练在行为克隆过程中对预接触和接触段进行上采样。在五个长时域任务中，FIRST 在任务进展上比先前的力感知策略提高了超过 17%。NEXT 和 FIRST 共同将力感知遥操作和策略学习引入现成的机器人，无需额外的传感硬件。视频结果和代码可在 https://this URL 获取。

英文摘要

Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy learning through Force-Informed Re-Sampling Training (FIRST), which up-samples pre-contact and contact segments during behavior cloning. Across five long-horizon tasks, FIRST outperforms prior force-aware policies by over 17% in task progress. Together, NEXT and FIRST bring force-aware teleoperation and policy learning to off-the-shelf robots without additional sensing hardware. Video results and code are available at this https URL

URL PDF HTML ☆

赞 0 踩 1

2503.08445 2026-06-11 cs.RO 版本更新

iPack: Intuitive Bin Packing with Large Language Models

iPack: 基于大型语言模型的直观装箱

Yannik Blei, Michael Krawez, Adrian Göß, Devadas Vijayan Sheela, Tobias Jülg, Pierre Krack, Florian Walter, Wolfram Burgard

AI总结提出LLM-Pack方法，利用语言和视觉基础模型生成模仿人类策略的杂货装箱顺序，无需专门训练即可处理新物品，模块化设计便于升级。

详情

Comments: 7 Pages, 9 Figures

AI中文摘要

机器人和自动化在物流中越来越有影响力，但仍主要局限于传统仓库。在杂货零售中，虽然存在无收银员超市等进步，但顾客仍然手动挑选和包装杂货。尽管机器人领域对箱子拾取问题有大量关注，但包装物品和杂货的任务基本上未被触及。然而，以正确顺序包装杂货物品对于防止产品损坏至关重要，例如，重物不应放在易碎物品之上。然而，正确包装顺序的确切标准很难定义，特别是考虑到商店中通常有大量各种物品。在本文中，我们介绍了LLM-Pack，一种新颖的杂货包装方法。LLM-Pack利用语言和视觉基础模型来识别杂货并生成模仿人类包装策略的包装序列。LLM-Pack不需要专门训练来处理新的杂货物品，其模块化设计允许轻松升级底层基础模型。我们广泛评估了我们的方法以展示其性能。我们将在本文发表后公开LLM-Pack的源代码。

英文摘要

Robotics and automation are increasingly influential in logistics but remain largely confined to traditional warehouses. In grocery retail, advancements such as cashier-less supermarkets exist, yet customers still manually pick and pack groceries. While there has been a substantial focus in robotics on the bin picking problem, the task of packing objects and groceries has remained largely untouched. However, packing grocery items in the right order is crucial for preventing product damage, e.g., heavy objects should not be placed on top of fragile ones. However, the exact criteria for the right packing order are hard to define, in particular given the huge variety of objects typically found in stores. In this paper, we introduce LLM-Pack, a novel approach for grocery packing. LLM-Pack leverages language and vision foundation models for identifying groceries and generating a packing sequence that mimics human packing strategy. LLM-Pack does not require dedicated training to handle new grocery items and its modularity allows easy upgrades of the underlying foundation models. We extensively evaluate our approach to demonstrate its performance. We will make the source code of LLMPack publicly available upon the publication of this manuscript.

URL PDF HTML ☆

赞 0 踩 0

2604.20348 2026-06-11 cs.RO cs.AI cs.MA 版本更新

KinematicRL: 一种面向社交导航的具有运动学可行性的仿真到现实强化学习框架

Zhiming Xu, Haodong Yang, Chengju Liu, Qijun Chen, Chenpeng Yao

发表机构 * School of Computer Science and Technology, Tongji University（同济大学计算机科学与技术学院）； Department of Electronics and Information Engineering, Tongji University（同济大学电子与信息工程学院）； Shanghai Institute of Intelligent Science and Technology, Tongji University（同济大学上海智能科学与技术研究院）

AI总结提出KinematicRL框架，通过二阶控制动作空间、基于2D LiDAR的聚类人体追踪和无偏残差门控模块，解决社交导航中仿真到现实的动态可行性问题。

详情

Comments: Accepted by IEEE Transactions on Automation Science and Engineering (T-ASE)

AI中文摘要

深度强化学习（DRL）在社交导航中展现出潜力，但其实际部署仍受到由简化一阶动力学和特定上下文的人体状态估计管道导致的持续仿真到现实差距的阻碍。本文提出一个统一框架，解决这些限制，以生成适用于实际部署的动态可行导航策略。首先，理论分析表明，模拟与实际机器人位置之间的跟踪误差随控制阶数增加呈指数衰减，这促使使用高阶控制输入作为DRL动作空间。针对差动驱动机器人开发了二阶控制公式，并辅以随机迭代线性二次型调节器（iLQR），通过散度最小化目标预训练策略。其次，为避免相机-激光雷达融合带来的额外系统复杂性，引入仅使用2D激光雷达的基于聚类的人体追踪管道。根据空间邻近性和速度相似性关联人体检测，实现对附近行人的可靠区分，并通过时间聚合获得稳定的速度估计。第三，我们引入一个无偏残差门控模块，以平衡基于反应和基于记忆的行为，同时处理时变的人群规模，这两者对于社交导航至关重要。由此产生的策略KinematicRL持续改善运动学性能，并适应检测到的人类数量变化。在真实环境中的实验表明，当与所提出的追踪管道结合时，KinematicRL可以在实际差动驱动机器人上以最小修改部署。

英文摘要

Deep Reinforcement Learning (DRL) has shown promise for social navigation, yet its real-world deployment remains hindered by a persistent sim-to-real gap arising from simplified first-order dynamics and context-specific human state estimation pipelines. This work presents a unified framework that addresses these limitations to produce dynamically feasible navigation policies suitable for real-world deployment. First, theoretical analysis reveals that tracking error between simulated and actual robot position decays exponentially with increased control order, motivating the use of higher-order control inputs as DRL action space. A second-order control formulation tailored to differential drive robots is developed, complemented by a stochastic iterative Linear Quadratic Regulator (iLQR) that pretrains the policy via a divergence minimization objective. Second, to avoid the added system complexity of camera-LiDAR fusion, a cluster-based human tracking pipeline using only 2D LiDAR is introduced. Human detections are associated according to both spatial proximity and velocity similarity, enabling reliable differentiation of nearby pedestrians and yielding stable velocity estimates through temporal aggregation. Third, we introduce an unbiased residual gating block to balance reaction- and memory-based behaviors while handling time-varying crowd sizes, both critical for social navigation. The resulting policy, KinematicRL, consistently improves kinematic performance and adapts to varying number of detected humans. Experiments in real-world environments demonstrate that, when combined with the proposed tracking pipeline, KinematicRL can be deployed on a real differential drive robot with minimal modifications.

URL PDF HTML ☆

赞 0 踩 0

2503.22926 2026-06-11 cs.RO 版本更新

SR-LIO++: LiDAR-Inertial Odometry and Quantized Mapping with Caching-Aware Sweep Reconstruction

SR-LIO++: 基于缓存感知扫描重建的LiDAR-惯性里程计与量化建图

Zikang Yuan, Ruiye Ming, Chengwei Zhao, Yonghao Tan, Pingcheng Dong, Yuan Ren, Yuzhong Jiao, Xin Yang, Kwang-Ting Cheng

AI总结提出SR-LIO++系统，通过扫描重建提高输出频率，并采用缓存机制和量化地图点管理，在资源受限平台上实现高精度、高效率的LiDAR-惯性里程计。

详情

Comments: 18 pages, 10 figures

AI中文摘要

解决3D LiDAR固有低采集频率限制以实现高频输出已成为LiDAR-惯性里程计（LIO）领域的关键研究焦点。为确保实时性能，频率增强的LIO系统必须在显著缩短的时间窗口内处理每次扫描，这对在资源受限平台上的部署提出了巨大挑战。为解决这些限制，我们引入了SR-LIO++，一种创新的LIO系统，能够在资源受限的硬件平台（包括Raspberry Pi 4B）上实现相对于输入频率两倍的输出频率。我们的系统采用先前提出的扫描重建方法来提高LiDAR扫描频率，生成高频重建扫描。在此基础之上，我们提出了一种针对最近段中间结果（即表面参数）的缓存机制，有效减少了相邻重建扫描中公共段的冗余处理。该方法将处理时间从传统的对重建扫描频率的线性依赖中解耦出来。此外，我们提出了一种基于索引表映射的量化地图点管理，通过将全局3D点存储从64位双精度转换为8位字符表示，显著减少了内存使用。该方法还将最近邻搜索中计算密集的欧几里得距离计算从64位双精度转换为16位短整型和32位整型格式，降低了计算成本。在三个不同计算平台和四个公开数据集上的广泛实验评估表明，SR-LIO++在保持最先进精度的同时，显著提高了效率。值得注意的是，我们的系统在Raspberry Pi 4B硬件上成功实现了20 Hz的状态输出。

英文摘要

Addressing the inherent low acquisition frequency limitation of 3D LiDAR to achieve high-frequency output has become a critical research focus in the LiDAR-Inertial Odometry (LIO) domain. To ensure real-time performance, frequency-enhanced LIO systems must process each sweep within significantly reduced timeframe, which presents substantial challenges for deployment on resource-constrained platforms. To address these limitations, we introduce SR-LIO++, an innovative LIO system capable of achieving doubled output frequency relative to input frequency on resource-constrained hardware platforms, including the Raspberry Pi 4B. Our system employs the previously proposed sweep reconstruction methodology to enhance LiDAR sweep frequency, generating high-frequency reconstructed sweeps. Building upon this foundation, we propose a caching mechanism for intermediate results (i.e., surface parameters) of the most recent segments, effectively minimizing redundant processing of common segments in adjacent reconstructed sweeps. This method decouples processing time from the traditionally linear dependence on reconstructed sweep frequency. Furthermore, we present a quantized map point management based on index table mapping, significantly reducing memory usage by converting global 3D point storage from 64-bit double precision to 8-bit char representation. This method also converts the computationally intensive Euclidean distance calculations in nearest neighbor searches from 64-bit double precision to 16-bit short and 32-bit integer formats, reducing computational cost. Extensive experimental evaluations across three distinct computing platforms and four public datasets demonstrate that SR-LIO++ maintains state-of-the-art accuracy while substantially enhancing efficiency. Notably, our system successfully achieves 20 Hz state output on Raspberry Pi 4B hardware.

URL PDF HTML ☆

赞 0 踩 0

2505.10018 2026-06-11 cs.RO 版本更新

LEMON-Mapping: Loop-Enhanced Large-Scale Multi-Session Point Cloud Merging and Optimization for Globally Consistent Mapping

LEMON-Mapping: 面向全局一致建图的环路增强大规模多会话点云融合与优化

Lijie Wang, Xiaoyi Zhong, Ziyi Xu, Kaixin Chai, Anke Zhao, Tianyu Zhao, Changjian Jiang, Qianhao Wang, Xieyuanli Chen, Fei Gao

AI总结提出LEMON-Mapping框架，通过鲁棒环路处理、空间光束法平差和基于PGO的优化，解决多机器人建图中重叠区域发散和模糊问题，实现高精度全局一致点云融合。

详情

AI中文摘要

多机器人协作在现代机器人领域日益重要且面临重大挑战，尤其是在构建全局一致、精确的地图方面。传统的多机器人位姿图优化（PGO）方法确保基本的全局一致性，但忽略了地图的几何结构，仅将闭环作为位姿节点之间的约束，导致重叠区域出现发散和模糊。为解决此问题，我们提出LEMON-Mapping，一种用于大规模多会话点云融合与优化的环路增强框架。我们重新审视环路在多机器人建图中的作用，并引入三项关键创新。首先，我们开发了一种鲁棒的环路处理机制来剔除异常值，以及一种环路召回策略来恢复被错误移除但有效的环路。其次，我们引入了针对多机器人地图的空间光束法平差，以减少发散并消除重叠中的模糊。第三，我们设计了一种基于PGO的方法，利用精化的光束法平差约束将局部精度传播到整个地图。我们在多个公开数据集和一个自采集数据集上验证了LEMON-Mapping。实验结果表明，与传统融合方法相比，我们的框架具有更优越的建图精度和全局一致性。可扩展性实验也证明了其处理涉及大量机器人场景的强大能力。

英文摘要

Multi-robot collaboration is becoming increasingly critical and presents significant challenges in modern robotics, especially for building a globally consistent, accurate map. Traditional multi-robot pose graph optimization (PGO) methods ensure basic global consistency but ignore the geometric structure of the map, and only use loop closures as constraints between pose nodes, leading to divergence and blurring in overlapping regions. To address this issue, we propose LEMON-Mapping, a loop-enhanced framework for large-scale, multi-session point cloud fusion and optimization. We re-examine the role of loops for multi-robot mapping and introduce three key innovations. First, we develop a robust loop processing mechanism that rejects outliers and a loop recall strategy to recover mistakenly removed but valid loops. Second, we introduce spatial bundle adjustment for multi-robot maps, reducing divergence and eliminating blurring in overlaps. Third, we design a PGO-based approach that leverages refined bundle adjustment constraints to propagate local accuracy to the entire map. We validate LEMON-Mapping on several public datasets and a self-collected dataset. The experimental results show superior mapping accuracy and global consistency of our framework compared to traditional merging methods. Scalability experiments also demonstrate its strong capability to handle scenarios involving numerous robots.

URL PDF HTML ☆

赞 0 踩 0

2511.13207 2026-06-11 cs.RO cs.CV 版本更新

PIGEON: VLM-Driven Object Navigation via Points of Interest Selection

PIGEON: 通过兴趣点选择的VLM驱动物体导航

Cheng Peng, Zhenzhe Zhang, Xiaobao Wei, Yanhao Zhang, Heng Wang, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Shanghang Zhang, Jing Liu

AI总结提出PIGEON框架，将物体导航建模为基于原始观测的稀疏决策问题，通过兴趣点（PoI）作为视觉决策单元，结合VLM选择关键点，实现零样本SOTA性能并迁移至主动具身问答。

详情

AI中文摘要

在未见过的室内环境中进行物体导航要求智能体在部分可观测条件下执行语义搜索。视觉-语言模型（VLM）为此任务提供了强大的语义-空间先验，但如何将其与机器人导航接口仍然具有挑战性：密集的VLM推理成本高昂，而将环境抽象为符号记忆通常将高层推理与支持它的原始视觉证据分离。我们提出PIGEON（基于兴趣点引导的物体导航探索），一种VLM驱动的框架，将物体导航建模为基于原始观测的稀疏决策问题。PIGEON引入兴趣点（PoI）作为稀疏视觉决策单元，将几何可执行的路点与原始自我中心观测耦合。PIGEON不是将VLM用作密集控制器或限制其进行前沿排序，而是使VLM能够选择任务关键的PoI，包括探索前沿、疑似目标物体、可穿越楼梯和楼层级摘要，而低级规划器在它们之间执行连续运动。这种PoI接口进一步使高层导航决策可验证，使我们能够开发一个RLVR流水线，无需手动思维链注释即可改进局部VLM。在Habitat ObjectNav基准上的大量实验表明，PIGEON实现了零样本最先进性能，与基础模型能力一致扩展，并且仅通过提示修改即可迁移到主动具身问答。在物理机器人上的实际部署进一步证明了其鲁棒性和效率。

英文摘要

Object navigation in unseen indoor environments requires agents to perform semantic search under partial observability. Vision-language models (VLMs) provide strong semantic-spatial priors for this task, but how to interface them with robot navigation remains challenging: dense VLM inference is expensive, while abstracting environments into symbolic memories often separates high-level reasoning from the raw visual evidence that supports it. We propose we propose PIGEON (Point of Interest Guided Exploration for Object Navigation), a VLM-driven framework that formulates object navigation as raw-observation-grounded sparse decision problem. PIGEON introduces Points of Interest (PoIs) as sparse visual decision units that couple geometrically executable waypoints with raw egocentric observations. Rather than using VLMs as dense controllers or restricting them to frontier ranking, PIGEON enables VLMs to select among task-critical PoIs, including exploration frontiers, suspected target objects, traversable stairs, and floor-level summaries, while low-level planners execute continuous motion between them. This PoI interface further makes high-level navigation decisions verifiable, allowing us to develop an RLVR pipeline that improves local VLMs without manual Chain-of-Thought annotations. Extensive experiments on Habitat ObjectNav benchmarks show that PIGEON achieves state-of-the-art zero-shot performance, scales consistently with foundation model capacity, and transfers to Active Embodied Question Answering with only prompt modifications. Real-world deployments on physical robots further demonstrate its robustness and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2512.19245 2026-06-11 eess.SY cs.RO 版本更新

Vision-Aided Relative State Estimation for Approach and Landing on a Moving Platform with Inertial Measurements

基于视觉辅助的相对状态估计用于移动平台进近与着陆的惯性测量

Tarek Bouazza, Alessandro Melis, Soulaimane Berkane, Robert Mahony, Tarek Hamel

AI总结提出一种级联观测器，结合SO(3)互补滤波和线性Riccati观测器，利用IMU和单目相机估计无人机与移动平台的相对位姿和速度，在持续激励条件下实现几乎全局渐近稳定。

详情

Comments: 13 pages, 4 figures. To appear in proceedings of IFAC World Congress 2026

AI中文摘要

本文解决了在进近和着陆过程中，无人机与经历任意三维运动的平面平台之间的相对位置、姿态和速度的估计问题。该估计依赖于安装在两个系统上的惯性测量单元（IMU）的测量值，假设存在合适的通信信道来交换数据，以及由机载单目相机提供的视觉信息，从中提取平台中心的方位（视线方向）和其平面表面的法向量。我们提出了一种级联观测器，在$\mathbf{SO}(3)$上采用互补滤波器来重构相对姿态，随后使用线性Riccati观测器进行相对位置和速度估计。在持续激励条件下建立了两个观测器的收敛性，并证明了级联是几乎全局渐近和局部指数稳定的。我们进一步将设计扩展到平台旋转限制在其法向轴的情况，并表明可以利用其测量的线性加速度来恢复剩余不可观测的旋转角。提供了该情况下局部指数收敛的充分条件。通过大量仿真验证了所提出的观测器。

英文摘要

This paper tackles the problem of estimating the relative position, orientation, and velocity between a UAV and a planar platform undergoing arbitrary 3D motion during approach and landing. The estimation relies on measurements from Inertial Measurement Units (IMUs) mounted on both systems, assuming there is a suitable communication channel to exchange data, together with visual information provided by an onboard monocular camera, from which the bearing (line-of-sight direction) to the platform's center and the normal vector of its planar surface are extracted. We propose a cascade observer with a complementary filter on $\mathbf{SO}(3)$ to reconstruct the relative attitude, followed by a linear Riccati observer for relative position and velocity estimation. Convergence of both observers is established under persistently exciting conditions, and the cascade is shown to be almost globally asymptotically and locally exponentially stable. We further extend the design to the case where the platform's rotation is restricted to its normal axis and show that its measured linear acceleration can be exploited to recover the remaining unobservable rotation angle. A sufficient condition for local exponential convergence in this setting is provided. The proposed observers are validated through extensive simulations.

URL PDF HTML ☆

赞 0 踩 0

2605.06100 2026-06-11 eess.SP cs.AI cs.LG cs.RO 版本更新

CredibleDFGO: Differentiable Factor Graph Optimization with Credibility Supervision

可信DFGO：具有可信度监督的可微因子图优化

Liang Qian, Penggao Yan, Penghui Xu, Li-Ta Hsu

AI总结针对GNSS协方差不可靠问题，提出CredibleDFGO框架，通过可微高斯-牛顿求解器与加权生成网络，利用适当评分规则监督预测分布，提升协方差可信度与定位精度。

详情

Comments: Submitted to NAVIGATION: Journal of the Institute of Navigation

AI中文摘要

全球导航卫星系统（GNSS）定位广泛用于城市导航，但GNSS求解器报告的协方差在城市峡谷中通常不可靠。现有的可微因子图优化（DFGO）方法通过求解器学习测量加权，但仍仅使用位置目标。因此，位置估计可能改善，而报告的协方差仍然过小、过大或方向错误。我们提出CredibleDFGO（CDFGO），一种可微GNSS因子图框架，将协方差可信度作为显式训练目标。加权生成网络（WGN）预测每颗卫星的可靠性权重，可微高斯-牛顿求解器将这些权重映射到位置估计和基于Hessian的后验协方差。我们使用适当评分规则端到端监督东-北预测分布。我们研究了负对数似然（NLL）、能量分数（ES）及其组合。在三个UrbanNav测试场景上的结果表明，协方差可信度持续提升。定位精度在中度城市和严峻城市场景中也有所提高；在深度城市场景中，平均水平误差和第95百分位误差均有所改善。在严峻城市的旺角（MK）场景中，与DFGO（MAE）相比，CDFGO-Combined将平均水平误差从13.77米降至11.68米，将NLL从40.63降至6.59，将ES从12.31降至9.05。案例研究将MK改进归因于更好的轴向一致性、更可信的局部协方差椭圆以及卫星级重新加权。

英文摘要

Global navigation satellite system (GNSS) positioning is widely used for urban navigation, but the covariance reported by the GNSS solver is often unreliable in urban canyons. Existing differentiable factor graph optimization (DFGO) methods learn measurement weighting through the solver, but they still use position-only objectives. As a result, the position estimate may improve while the reported covariance remains too small, too large, or incorrectly oriented. We propose CredibleDFGO (CDFGO), a differentiable GNSS factor graph framework that makes covariance credibility an explicit training target. A Weighting Generation Network (WGN) predicts per-satellite reliability weights, and a differentiable Gauss-Newton solver maps these weights to a position estimate and a Hessian-derived posterior covariance. We use proper scoring rules to supervise the East-North predictive distribution end to end. We study negative log-likelihood (NLL), the energy score (ES), and their combination. Results on three UrbanNav test scenes show consistent gains in covariance credibility. Positioning accuracy also improves on the medium-urban and harsh-urban scenes; on the deep-urban scene, both the mean horizontal error and the 95th-percentile error improve. On the harsh-urban Mong Kok (MK) scene, CDFGO-Combined reduces the mean horizontal error from 13.77 m to 11.68 m, reduces NLL from 40.63 to 6.59, and reduces ES from 12.31 to 9.05 relative to DFGO (MAE). Case studies link the MK improvement to better axis-wise consistency, more credible local covariance ellipses, and satellite-level reweighting.

URL PDF HTML ☆

赞 0 踩 0

2606.11818 2026-06-11 cs.RO 新提交

Human-Guided Co-Manipulation of Carbon Fiber Plies

碳纤维铺层的人机协同操作

Rami Ojanen, James Fant-Male, Roel Pieters

发表机构 * Automation Technology and Mechanical Engineering, Tampere University（坦佩雷大学自动化技术与机械工程系）

AI总结针对柔性材料自动化处理困难的问题，本文提出结合语音指令、视觉腕部跟踪和力/柔顺控制的多模态方法，实现碳纤维铺层的高效人机协同操作。

详情

Comments: Accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

AI中文摘要

由于柔性材料的可变形性带来的挑战，完全自动化处理这类物体是一项困难的任务。同时，全手动过程在人体工程学上可能具有挑战性、繁琐且低效。因此，人机协作（HRC）和协同操作（co-manipulation）在该领域受到越来越多的关注，因为它们能够在需要时让人类参与，同时提高生产力。为了实现操作员与机器人之间的高效协同操作和交互，需要不同的模态和控制方法。在本文中，我们提出并研究了用于碳纤维铺层协同操作的不同控制方法，在受控环境中评估每种方法的优缺点。我们提出，语音指令、通过视觉进行腕部跟踪以及力与柔顺控制的多模态组合将为任务的完整和直观控制提供最佳解决方案。

英文摘要

The handling of flexible materials is a difficult task to fully automate due to the challenges caused by the deformability of these types of objects. Meanwhile, a fully manual process can be ergonomically challenging, tedious and inefficient. Thus, human-robot collaboration (HRC) and cooperative manipulation (co-manipulation) have received increasing interest in this field as they enable human involvement when needed while also improving productivity. To enable efficient co-manipulation and interaction between the human operator and the robot, different modalities and control methods are required. In this paper, we present and examine different control methods for co-manipulation of carbon fiber plies, evaluating the pros and cons of each method in a controlled setting. We propose that a multimodal combination of speech commands, wrist-tracking through vision, and force with compliant control would provide the best solution for complete and intuitive control of the task.

URL PDF HTML ☆

赞 0 踩 0

2606.12374 2026-06-11 cs.RO cs.CV 新提交

Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration

语义感知的潜水员活动识别框架用于有效的水下多人类-机器人协作

Sadman Sakib Enan, Junaed Sattar

发表机构 * University of Minnesota（明尼苏达大学）

AI总结提出DAR-Net框架，结合Transformer时间推理与像素级场景监督，通过多损失训练对齐全局活动识别与局部人机交互语义，解决低可见度水下环境中的潜水员活动识别问题，并发布首个水下潜水员活动数据集UDA。

详情

AI中文摘要

有效的人机多体协作对于在具有挑战性和高风险的水下环境中扩展人类主导的操作至关重要。为了使自主水下航行器（AUV）成为真正的队友，它们必须能够理解周围环境并识别潜水员的活动，以提供帮助并确保安全。为此，我们引入了DAR-Net，一种新颖的基于Transformer的框架，用于分析复杂的水下场景并对潜水员活动进行分类。我们的贡献在于一种语义引导的学习公式，它将基于Transformer的时间推理与像素级场景监督相结合。这种多损失训练策略明确地将全局活动识别与局部人机交互语义对齐，这在低可见度水下条件下尤为关键。为了解决该领域数据稀缺的重大挑战，我们首次提出了水下潜水员活动（UDA）数据集，这是一个基础资源，包含超过2600张带有像素级掩码的注释图像。通过在受控环境中进行严格的实验评估，我们证明DAR-Net在识别六种不同潜水员活动方面达到了有希望的准确性，优于现有最先进的模型。虽然该数据集提供了关键的基线，但我们的工作作为开创性的一步，为未来研究奠定了基础，并促进了更智能、协作的水下机器人系统的发展。

英文摘要

Effective multi-human-robot collaboration is essential for expanding human-led operations in the challenging and high-risk underwater environment. For autonomous underwater vehicles (AUVs) to become true teammates, they must be able to comprehend their surroundings and recognize a diver's activities to offer assistance and ensure safety. Towards this goal, we introduce DAR-Net, a novel transformer-based framework that analyzes complex underwater scenes to classify diver activities. Our contribution lies in a semantically guided learning formulation that couples transformer-based temporal reasoning with pixel-level scene supervision. This multi-loss training strategy explicitly aligns global activity recognition with local human-robot interaction semantics, which is particularly critical in low-visibility underwater conditions. To address the significant challenge of data scarcity in this domain, we present the first-ever Underwater Diver Activity (UDA) dataset, a foundational resource containing over 2,600 annotated images with pixel-level masks. Through rigorous experimental evaluations in a controlled environment, we demonstrate that DAR-Net achieves promising accuracy in recognizing six distinct diver activities, outperforming state-of-the-art models. While this dataset provides a crucial baseline, our work serves as a pioneering step, laying the groundwork for future research and facilitating the development of more intelligent, collaborative underwater robotic systems.

URL PDF HTML ☆

赞 0 踩 0

2606.12339 2026-06-11 cs.SD cs.RO 交叉投稿

Fast-SDE: Efficient Single-Microphone Sound Source Distance Estimation in Reverberant Environments

Fast-SDE：混响环境中高效单麦克风声源距离估计

Jiang Wang, Runwu Shi, Yaozhong Kang, Benjamin Yen, Takeshi Ashizawa, Kazuhiro Nakadai

发表机构 * Institute of Science Tokyo（东京科学大学）

AI总结提出Fast-SDE，一种基于子带处理的轻量级单麦克风框架，用于在资源受限的机器人平台上实现高效声源距离估计。

详情

Comments: To appear in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

AI中文摘要

声源距离估计（SDE）是人机交互中的关键能力。不适当的交互距离不仅会降低语音采集和理解的可靠性，还会损害交互的自然性和舒适性。现有大多数SDE方法依赖麦克风阵列，然而多麦克风系统通常需要精心的硬件同步、几何校准以及额外的空间和计算资源，这限制了其在尺寸受限和计算能力有限的嵌入式平台上的适用性。为了解决这些问题，我们提出了Fast-SDE，一种轻量级的单麦克风SDE框架，适用于计算资源有限且尺寸严格受限的机器人平台。具体来说，Fast-SDE采用基于子带的骨干网络，将频率轴分解为多个子带，而不是使用宽的全频带骨干处理整个频谱。然后，一个共享的子带编码器将每个子带映射为紧凑的潜在表示，并学习声学结构与时频模式之间的关系。最后，一个轻量级的回归头将融合后的子带表示转换为估计的距离。大量的仿真和真实世界实验证明了所提方法的优点。为了惠及更广泛的研究社区，我们在以下网址开源了代码：this https URL。

英文摘要

Sound source distance estimation (SDE) is a critical capability in human-robot interaction. An inappropriate interaction distance not only reduces the reliability of speech acquisition and understanding, but also compromises the naturalness and comfort of the interaction. Most existing SDE methods rely on microphone arrays, however, multi-microphone systems typically require careful hardware synchronization, geometric calibration, and additional space and computational resources, which limits applicability to size-constrained and computability-limited embodied platforms. To alleviate these issues, we propose Fast-SDE, a lightweight single-microphone SDE framework that is suited for deployment on robot platforms with limited computational resources and strict size constraints. Specifically, Fast-SDE employs a subband-based backbone that decomposes the frequency axis into multiple subbands, rather than processing the entire spectrum with a wide full-band backbone. A shared subband encoder then maps each subband to a compact latent representation and learns the relationship between acoustic structure and time-frequency patterns. Finally, a lightweight regression head converts the fused subband representations into the estimated distance. Extensive simulation and real-world experiments demonstrate the merits of the proposed method. To benefit the broader research community, we have open-sourced our code at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2511.05203 2026-06-11 cs.RO 版本更新

SIL: Symbiotic Interactive Learning for Language-Conditioned Human-Agent Co-Adaptation

SIL: 语言条件的人机协同适应的共生交互学习

Linus Nwankwo, Bjoern Ellensohn, Christian Rauch, Elmar Rueckert

AI总结提出共生交互学习（SIL）框架，实现人类与智能体在共享潜在任务空间中的双向协同适应，通过联合信念状态、FM空间推理和记忆架构，在指令跟随等任务中达到90.4%完成率和0.83信念对齐分数。

详情

AI中文摘要

当今的自主智能体主要由基础模型（FMs）驱动，能够理解自然语言指令并以类似人类的推理解决长期任务。然而，当前的人机交互框架大多遵循单向的主从技术，其中具身智能体被动执行命令而没有互惠学习。这忽视了日常人际交互中协同适应、多轮交互的本质。我们引入了共生交互学习（SIL），一个在共享潜在任务空间中的双向协同适应框架，其中人类和智能体都维护着随交互历史演变的联合信念状态。这使得主动澄清、适应性建议和共享计划细化成为可能。SIL利用FMs进行空间感知和推理，并结合一个三元组损失训练的神经编码器，将FMs的输出嵌入到任务特定的潜在表示中。为了支持任务演变时的长期稳定性，SIL利用情景记忆和语义记忆架构，并通过弹性权重巩固进行正则化以减轻灾难性遗忘。我们在模拟和真实世界的具身任务上评估SIL，包括指令跟随、信息检索、查询导向推理和交互式对话，实现了90.4%的任务完成率和ρ≈0.83的信念对齐分数，比最佳消融实验绝对提高了约20个百分点。演示和资源：此https URL。

英文摘要

Today's autonomous agents, largely driven by foundation models (FMs), can understand natural language instructions and solve long-horizon tasks with human-like reasoning. However, current human-robot interaction frameworks largely follow a one-way master-apprentice technique where the embodied agent passively executes commands without reciprocal learning. This neglects the co-adaptive, multi-turn nature of everyday human-to-human interactions. We introduce symbiotic interactive learning (SIL), a bidirectional co-adaptation framework in a shared latent task space, where both the human and the agent maintain joint belief states that evolve with the interaction history. This enables proactive clarification, adaptive suggestions, and shared plan refinement. SIL leverages FMs for spatial perception and reasoning, together with a triplet-loss-trained neural encoder that grounds the FMs' outputs into task-specific latent representations. To support long-term stability as tasks evolve, SIL utilises episodic and semantic memory architectures, regularised via elastic weight consolidation to mitigate catastrophic forgetting. We evaluate SIL on simulated and real-world embodied tasks, including instruction following, information retrieval, query-oriented reasoning, and interactive dialogue, achieving a $90.4\%$ task completion rate and a belief alignment score of $\rho \approx 0.83$, an absolute improvement of about $20$ percentage points over the best ablations. Demos and resources: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11324 2026-06-11 cs.RO cs.AI cs.LG 新提交

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Embodied-R1.5：通过具身基础模型演化物理智能

Yifu Yuan, Yaoting Huang, Xianze Yao, Yutong Li, Shuoheng Zhang, Linqi Han, Pengyi Li, Jiangeng Sun, Wenting Jia, Zhao Zhang, Yuhao Liu, Ruihao Liao, Yucheng Hu, Qiyu Wu, Yuxiao Li, Zibin Dong, Fei Ni, Yan Zheng, Shuyang Gu, Yi Ma, Hongyao Tang, Han Hu, Jianye Hao

发表机构 * Tianjin University（天津大学）； Tencent Hunyuan（腾讯混元）

AI总结提出统一具身基础模型Embodied-R1.5，通过自动化数据管道和多任务平衡强化学习，在8B参数下实现24项基准中16项最优，并支持微调为VLA模型。

详情

Comments: Embodied R1.5 technical report. Project page: this https URL

AI中文摘要

我们介绍了Embodied-R1.5，一个统一的具身基础模型（EFM），它在单一架构中集成了全面的具身推理能力，涵盖具身认知、任务规划、修正和指向，旨在实现通用物理智能。利用三个自动化数据构建管道显著扩展关键能力的数据覆盖，我们构建了超过150亿token的大规模数据系统，并设计了多任务平衡的RL配方以缓解异构任务冲突。我们进一步引入了规划器-基础模型-修正器（PGC）闭环框架，使单一模型能够自主执行并在长时任务中进行自我修正。仅凭8B参数，Embodied-R1.5在24个具身VLM基准中的16个上达到了最先进水平，超越了Gemini-Robotics-ER-1.5和GPT-5.4等领先模型。得益于内化的具身能力，Embodied-R1.5只需少量数据即可微调为VLA，在4个流行的操作基准套件上优于$\pi_{0.5}$等领先VLA模型。我们进一步进行了广泛的零样本真实机器人实验，验证了在指令跟随、可供性基础、铰接物体操作和长时复杂任务中的性能，展示了向物理世界的强泛化能力。我们开源了模型权重、数据集、训练代码以及EmbodiedEvalKit（一个专为具身任务定制的评估框架），以促进EFM的未来研究。

英文摘要

We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into a VLA with only a small amount of data, outperforming leading VLA models like $\pi_{0.5}$ across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.

URL PDF HTML ☆

赞 0 踩 0

2606.11628 2026-06-11 cs.RO cs.AI 新提交

LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

LUCID：从非结构化人类视频学习与具身无关的意图模型以实现可扩展的灵巧机器人技能获取

Harsh Gupta, Guanya Shi, Wenzhen Yuan

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出LUCID两阶段框架，从互联网规模的非结构化人类视频学习任务意图，并在大规模并行仿真中学习机器人控制，实现零样本迁移到不同具身和场景。

详情

AI中文摘要

目前最广泛采用的机器人学习流程通常从机器人演示或结构化人类数据中学习技能，这些数据收集成本高昂且与特定具身绑定。相比之下，非结构化人类视频提供了一种可扩展的替代方案。它们包含跨物体、场景和策略的多样化操作演示，但与机器人动作没有直接联系。我们提出LUCID，一个两阶段框架，从互联网规模数据集的非结构化人类视频中学习任务意图，并在大规模并行仿真中学习机器人控制。意图模型根据当前观测以闭环方式预测短时意图（场景中下一步应该发生什么）。一个具身特定的感觉运动策略将此意图转换为机器人动作。意图接口在控制器之间共享，因此相同的意图模型可应用于不同具身，从我们的主要灵巧手到平行夹爪。我们在五个真实世界操作任务上评估LUCID：搅拌、擦拭和分拣，仅由互联网视频监督，零样本迁移到新场景和物体实例；以及推T和电缆布线，各由1小时自收集智能手机视频监督。项目页面：此 https URL。

英文摘要

The most widely-adopted robot learning pipelines today learn skills from robot demonstrations or structured human data, which are expensive to collect and tied to specific embodiments. In contrast, unstructured human videos provide a scalable alternative. They contain diverse manipulation demonstrations across objects, scenes, and strategies, but are not directly connected to robot action. We propose LUCID, a two-stage framework that learns task intent from unstructured human videos drawn from internet-scale datasets and learns robot control in massively-parallel simulation. The intent model predicts short-horizon intent (what should happen next in the scene) from the current observation in closed loop. An embodiment-specific sensorimotor policy converts this intent into robot actions. The intent interface is shared across controllers, so the same intent model can be applied to different embodiments, from our primary dexterous hand to a parallel-jaw gripper. We evaluate LUCID on five real-world manipulation tasks: stirring, wiping, and binning supervised by only internet video, with zero-shot transfer to novel scenes and object instances; and push-T and cable routing supervised by 1 hr each of self-collected smartphone video. Project page: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12028 2026-06-11 cs.RO 新提交

VICX: Generalizable Robot Manipulation via Video Generation and In-Context Operator Network

VICX: 通过视频生成和上下文操作网络实现可泛化的机器人操作

Song Chen, Linyan Xiang, Ying Zhou, Liu Yang

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出VICX框架，利用冻结视频生成模型生成视觉计划，并通过视频到轨迹的上下文操作网络（V2T-ICON）将其映射为机器人可执行轨迹，实现跨任务、跨本体泛化。

详情

Comments: The first two authors contributed equally to this work

AI中文摘要

可泛化的机器人操作不仅需要对未见场景进行任务级推理，还需要将视觉计划可靠地映射到具体本体的执行中。为弥合这一差距，我们提出了VICX（视频生成与上下文执行），一种解耦的闭环操作框架。在VICX中，冻结的视频生成模型生成视觉-语言条件化的高层视觉计划，而视频到轨迹的上下文操作网络（V2T-ICON）作为任务无关的接口，将这些计划映射为可执行的机器人状态轨迹。为提高执行泛化性，V2T-ICON基于分割提取的仅手臂帧观测，并使用检索到的图像-状态对作为上下文提示，从而在推理时无需参数更新即可实现鲁棒且可泛化的视觉到状态映射。在Meta-World上的实验表明，VICX支持跨任务泛化、闭环自我修正和跨本体迁移，展示了在任务语义和机器人执行上的双重泛化能力。项目网页见：此 https URL。

英文摘要

Generalizable robot manipulation requires not only task-level reasoning over unseen scenes, but also reliable grounding of visual plans into embodiment-specific execution. To bridge this gap, we propose VICX (Video generation and In-Context eXecution), a decoupled closed-loop manipulation framework. In VICX, a frozen video generation model produces vision-language-conditioned high-level visual plans, while a Video-to-Trajectory In-Context Operator Network (V2T-ICON) serves as the task-agnostic interface that grounds these plans into executable robot-state trajectories. To improve execution generalization, V2T-ICON operates on segmentation-extracted arm-only frame observations and uses retrieved image-state pairs as in-context prompts, allowing a robust and generalizable visual-to-state mapping at inference time without parameter updates. Experiments on Meta-World show that VICX supports cross-task generalization, closed-loop self-correction, and cross-embodiment transfer, demonstrating dual generalization across both task semantics and robot execution. The project webpage can be found here: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12105 2026-06-11 cs.RO cs.CV cs.LG 新提交

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

DAM-VLA: 解耦异步多模态视觉语言动作模型

Pankhuri Vanjani, Zhuoyue Li, Jakub Suliga, Moritz Reuss, Gianluca Geraci, Xinkai Jiang, Rudolf Lioutikov

发表机构 * Intuitive Robots Lab, Karlsruhe Institute of Technology (KIT)（直觉机器人实验室，卡尔斯鲁厄理工学院）； NVIDIA（英伟达）； Robotics Institute of Germany（德国机器人研究所）

AI总结针对VLA模型同步时钟与物理交互中不同模态频率不匹配的问题，提出DAM-VLA，通过解耦各模态时间处理、维护传感器速率更新的潜在缓冲区，并利用门控交叉注意力整合高频模态，在7个真实操作任务中平均成功率提升至95.2%。

详情

Comments: 17 pages, 8 figures

AI中文摘要

视觉-语言-动作（VLA）模型继承了视觉-语言预训练中的共享同步时钟，以单一速率处理每个输入。这与物理交互不一致，在物理交互中，高频模态以数百赫兹变化，视觉演化较慢，而语言在整个回合中保持不变。同步VLA会过采样慢速模态，欠采样快速模态，并将动作生成限制在最低有效频率。我们假设解耦每个模态的时间处理，让每个模态以其自身传感器速率更新和保留信息，可以产生更强的表示和更鲁棒的控制。我们提出DAM-VLA，它维护每个模态的潜在缓冲区，以传感器速率刷新并由动作头连续读取，通过门控交叉注意力整合新的高频模态，同时保持预训练主干不变。在七个接触丰富的真实世界操作任务中，DAM-VLA将最强同步基线的平均成功率提高了一倍以上（95.2% vs. 40.95%），同时维持平滑、反应式的100 Hz控制。项目网站：\href{ this https URL }{ this http URL }

英文摘要

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{ this https URL }{ this http URL }

URL PDF HTML ☆

赞 0 踩 0

2606.12109 2026-06-11 cs.RO cs.AI 新提交

Bridging the Morphology Gap: Adapting VLA Models to Dexterous Manipulation via Intent-Conditioned Fine-Tuning

弥合形态差距：通过意图条件微调使VLA模型适应灵巧操作

Chuanke Pang, Junyi Huang, Zhijun Zhao, Yaobing Wang, Kun Xu, Xilun Ding

发表机构 * Beihang University（北京航空航天大学）； China Academy of Space Technology（中国空间技术研究院）

AI总结提出InDex框架，通过将预训练的1-DoF平行抓取输出重用作宏观虚拟抓取意图代理，结合两阶段解耦学习架构，实现VLA模型从低自由度夹爪到高自由度灵巧手的适应，有效缓解灾难性遗忘和动作流形坍缩。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在机器人操作中展现了显著的零样本泛化能力，然而绝大多数预训练流程严格局限于低自由度平行夹爪。将这些丰富的语义先验适应到高自由度灵巧手引入了严重的形态差距，直接的端到端联合微调会由于数据稀缺而导致空间推理的灾难性遗忘和急性动作流形坍缩。在本文中，我们提出了InDex，一种新颖的、数据高效的适应框架，其根植于跨形态语义继承。我们不丢弃预训练的1-DoF平行抓取输出，而是将其重新用作连续的、宏观的虚拟抓取意图代理，以顺序化控制拓扑。我们实现了一个两阶段解耦学习架构：第一阶段参数高效地将VLA主干对齐以预测连续的臂轨迹和标量抓取意图；第二阶段冻结该空间主干，并利用一个意图条件去噪扩散头来解码多指末端执行器的细粒度关节运动。跨一系列多阶段、高接触灵巧操作任务的广泛模拟基准测试表明，InDex能够以最少的演示数据有效掌握复杂技能，显著优于整体基线，同时保留了原始VLA先验的鲁棒空间泛化能力。

英文摘要

Vision-Language-Action (VLA) models have demonstrated remarkable zero-shot generalization in robotic manipulation, yet the vast majority of pre-trained pipelines remain strictly confined to low-DoF parallel grippers. Adapting these rich semantic priors to high-DoF dexterous hands introduces a severe morphology gap, direct end-to-end joint fine-tuning inherently causes catastrophic forgetting of spatial reasoning and acute action manifold collapse due to data scarcity. In this paper, we present InDex, a novel, data-efficient adaptation framework rooted in cross-morphology semantic inheritance. Rather than discarding the pre-trained 1-DoF parallel grasp output, we repurpose it as a continuous, macroscopic virtual grasp intent proxy to sequentialize the control topology. We implement a two-stage decoupled learning architecture: the first stage parameter-efficiently aligns the VLA backbone to predict continuous arm trajectories and the scalar grasp intent; the second stage freezes this spatial backbone and leverages an intent-conditioned denoising diffusion head to decode fine-grained joint articulations for multi-fingered end-effectors. Extensive simulation benchmarks across a suite of multi-stage, contact-rich dexterous manipulation tasks demonstrate that InDex effectively masters intricate skills with minimal demonstration data, substantially outperforming monolithic baselines while preserving the robust spatial generalizability of the original VLA prior.

URL PDF HTML ☆

赞 0 踩 0

2606.12299 2026-06-11 cs.RO cs.LG 新提交

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

学习对你的VLA说什么：基本无害的视觉语言动作模型引导

Hyun Joe Jeong, Gokul Swamy, Andrea Bajcsy

发表机构 * Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）

AI总结提出一个框架，通过交互式搜索语言序列改进闭环VLA任务性能，并学习一个改进头预测何时语言引导能提升性能，同时通过共形化防止有害干预。

详情

Comments: 22 pages, 14 tables, 14 figures

AI中文摘要

视觉-语言-动作（VLA）模型为机器人控制提供了自然语言接口，但从语言到行为的映射通常脆弱且不直观：语义相似的指令可能引发截然不同的行为，而某些能力可能无法仅通过提示激发。因此，人类指令和零样本语言模型都可能无法可靠地引导VLA成功执行任务。在这项工作中，我们提出了一个框架，该框架交互式地搜索改进闭环VLA任务性能的语言序列，将这些序列提炼为测试时语言反馈策略（LFP），并学习一个改进头来预测何时语言引导会提升性能。我们对这个改进头进行共形化，以防止在分布外场景中LFP相对于原始指令降低任务性能的有害引导干预。关键的是，我们的方法适用于任意冻结的预训练VLA，既不需要访问原始训练分布，也不需要微调底层模型。在已知环境中，我们的共形化LFP在仿真中使基础VLA性能提升24.7%，在硬件中提升65.0%。在视觉和语义扰动下，我们的共形化LFP具有强大的无害性保证，并产生开环提示无法观察到的恢复行为。

英文摘要

Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, while some capabilities may not be elicitable through prompting alone. As a result, both human instructions and zero-shot language models can fail to reliably steer VLAs toward successful task execution. In this work, we propose a framework that interactively searches for language sequences that improve closed-loop VLA task performance, distills these sequences into a test-time language feedback policy (LFP), and learns an improvement head that predicts when language steering will improve performance. We conformalize this improvement head to prevent harmful steering interventions, where the LFP decreases task performance relative to the original instruction on out-of-distribution scenarios. Crucially, our approach operates on arbitrary frozen pre-trained VLAs, requiring neither access to the original training distribution nor fine-tuning of the underlying model. On seen environments, our conformalized LFP improves base VLA performance by 24.7% in simulation and 65.0% in hardware. On visual and semantic perturbations, our conformalized LFP has strong harmlessness guarantees, and produces recovery behaviors not observed with open-loop prompting.

URL PDF HTML ☆

赞 0 踩 0

2606.12366 2026-06-11 cs.RO 新提交

APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies

APT: 动作专家预训练提升视觉-语言-动作策略的指令泛化能力

Kechun Xu, Zhenjie Zhu, Anzhe Chen, Rong Xiong, Yue Wang

发表机构 * Zhejiang University（浙江大学）； Zhejiang Humanoid Robot Innovation Center（浙江人形机器人创新中心）

AI总结针对连续动作专家模型对分布外语言指令泛化差的问题，提出APT两阶段训练方法，先预训练动作专家作为视觉-动作先验，再通过门控融合注入语言，显著提升泛化性能。

详情

AI中文摘要

视觉-语言-动作（VLA）模型将预训练的视觉-语言模型（VLM）与连续动作专家结合，在操作任务上取得了强劲性能，但对分布外（OOD）语言指令的泛化能力仍然较差。一个已知挑战是VLA数据中的结构不平衡，其中语言的多样性远低于视觉和动作内容，使得策略容易依赖视觉捷径。虽然离散动作方法通过视觉-语言联合训练缓解了这一问题，但连续动作专家缺乏此类保护：它们从随机初始化开始，完全从不平衡数据中学习，产生噪声梯度，破坏VLM并无法利用其语言能力。我们从贝叶斯角度出发，将策略分解为与语言无关的视觉-动作（VA）先验和语言条件化的VLA似然，并提出APT，一种强调动作专家预训练的两阶段训练方法。在第一阶段，动作专家作为VA先验，在冻结的VLM提供的视觉-动作对上进行预训练，绕过了语言不平衡问题。在第二阶段，通过门控融合机制注入语言标记，该机制整合VLM特征的同时保留已学习的视觉运动先验。APT适用于主流VLA架构，包括π和GR00T风格架构。综合实验验证了APT在未见指令和组合任务上取得了一致的性能提升。项目页面：此 https URL

英文摘要

Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the $\pi$ and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.12402 2026-06-11 cs.RO cs.AI cs.CV 新提交

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

DIRECT: 在具身规划器中何时何地分配测试时计算？

Jadelynn Dao, Milan Ganai, Yasmina Abukhadra, Ajay Sridhar, Mozhgan Nasr Azadani, Katie Luo, Clark Barrett, Jiajun Wu, Chelsea Finn, Marco Pavone

发表机构 * Stanford University（斯坦福大学）； University of Waterloo（滑铁卢大学）； NVIDIA（英伟达）

AI总结提出DIRECT路由框架，根据多模态场景上下文按提示分配计算资源，优化成功-成本帕累托前沿，实验表明不同缩放轴带来不同能力增益，在物理机器人上以更低延迟匹配或超越更强模型。

详情

AI中文摘要

视觉语言模型（VLM）越来越多地被部署为具身智能体的高层规划器，一种新兴策略是扩展测试时计算以提高能力。然而，我们观察到这样做会增加延迟、令牌使用和FLOPs，同时在下游任务中产生不均匀且往往递减的收益，限制了具身智能体的部署范围。我们认为，选择何时何地花费测试时计算是将前沿性能带入现实世界的关键。我们引入了DIRECT，一个路由框架，利用多模态场景上下文按提示分配计算资源，在固定模型选择上改进了成功-成本帕累托前沿。在三种主要的缩放轴（即思维链深度、模型大小和记忆历史）上，我们在VLABench和RoboMME上的实验表明，测试时计算并非均匀的杠杆：不同的轴产生性质不同的能力增益。我们在DROID设置中的物理Franka机械臂上验证了这些见解，涵盖了零样本操作和长程链式任务，我们的路由器以高达65%的平均延迟降低匹配或超过了更强模型的成功率。最终，我们的结果表明，天真地扩展测试时计算是浪费的，而DIRECT能够以极低的成本在机器人系统中提供前沿级别的具身规划。项目页面可在此http URL找到。

英文摘要

Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success--cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at this http URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12403 2026-06-11 cs.RO 新提交

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

World Pilot: 用世界动作先验引导视觉-语言-动作模型

Zefu Lin, Rongxu Cui, Junjia Xu, Xiaojuan Jin, Wenling Li, Lue Fan, Zhaoxiang Zhang

发表机构 * Institute of Automation, Chinese Academy of Sciences (CASIA)（中国科学院自动化研究所）； Nanjing University（南京大学）； Beihang University（北京航空航天大学）

AI总结提出World Pilot框架，通过世界动作模型（WAM）的潜在引导和动作引导两条路径，为VLA模型提供场景演化先验和轨迹级运动提示，在LIBERO-Plus零样本OOD基准上达到84.7%的总成功率，并在多个真实机器人操作任务中取得最高成功率。

详情

Comments: Project Website: this https URL

AI中文摘要

视觉-语言-动作（VLA）模型从大规模预训练中继承了语义基础，并在分布内的操作任务中表现良好。然而，这种语义基础建立在静态图像-文本对上，而操作是一个连续的、接触丰富的过程，其动态特性是这种预训练无法捕捉的。我们提出了World Pilot，一个VLA框架，通过两条互补路径将世界动作模型（WAM）的先验注入决策链。潜在引导（Latent Steering）以场景演化潜变量为条件作用于感知层，动作引导（Action Steering）将预期轨迹作为运动先验提供给动作生成器。这两个先验共同为VLA提供了场景的预期视图和轨迹级运动提示，同时保留了其语义条件。即使由未经过动作后训练的视频预训练世界模型提供，场景演化先验仍然有效。World Pilot在LIBERO-Plus零样本OOD基准上达到了84.7%的总成功率，并在四个操作任务的每个真实机器人设置中取得了最高成功率，在视角、几何、变形状态和姿态变化下具有最大的优势。项目网站：此 https URL

英文摘要

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.12217 2026-06-11 cs.CV cs.AI cs.RO 交叉投稿

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

使远见可操作：在世界动作模型中重新利用表示对齐

Lu Qiu, Yizhuo Li, Yi Chen, Yuying Ge, Yixiao Ge, Xihui Liu

发表机构 * The University of Hong Kong（香港大学）； XPENG Robotics（小鹏机器人）

AI总结针对世界动作模型中视觉预测与动作提取不匹配的问题，提出AGRA方法，通过对齐视频扩散特征与语义表示，提升动作解码器对任务相关区域的关注，从而改善操作任务的性能与泛化能力。

详情

AI中文摘要

世界动作模型（WAM）通过使用视频生成模型在生成控制动作之前建模未来场景演变，为机器人操作提供了一条有前景的途径。然而，我们的实证观察揭示了一个现象：生成合理的视觉未来并不总能保证提取出准确的动作。为了诊断这一失败，我们进行了动作头注意力分析和因果干预。我们发现动作解码器未能聚焦于任务相关的交互区域，并且对任务无关区域的扰动保持敏感。这揭示了一种表示不匹配：为视觉重建优化的隐藏状态并未以适用于低级动作控制的形式组织。在本文中，我们提出了AGRA，一种动作接地表示对齐目标，通过将中间视频扩散特征与来自基础视觉编码器的空间连贯语义表示对齐，来正则化世界-动作接口。我们在真实世界的操作任务上评估了AGRA。实验表明，AGRA使世界模型表示更加动作接地：通过将动作解码器聚焦于正确的交互区域，它提高了物体定位精度和功能理解，并使策略对任务无关区域的扰动更加鲁棒。因此，AGRA在分布内性能和分布外泛化方面均持续优于基线世界动作模型。

英文摘要

World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.

URL PDF HTML ☆

赞 0 踩 0

2605.00321 2026-06-11 cs.RO 版本更新

Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models

具身可解释性：将因果理解与视觉-语言-动作模型的泛化联系起来

Hanxin Zhang, Mingshuo Xu, Abdulqader Dhafer, Shigang Yue, Hongbiao Dong, Zhou Daniel Hao

AI总结提出干预显著性评分（ISS）和干扰质量比（NMR），通过干预掩码估计视觉区域对动作预测的因果影响，并量化对任务无关特征的归因，实验表明NMR可预测泛化行为，ISS比现有方法提供更忠实的解释。

详情

Comments: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

AI中文摘要

视觉-语言-动作（VLA）策略在分布偏移下经常失败，这表明决策可能依赖于虚假的视觉相关性而非任务相关原因。我们将视觉-动作归因形式化为一个干预估计问题。据此，我们引入了干预显著性评分（ISS），一种用于估计视觉区域对动作预测的因果影响的干预掩码程序，以及干扰质量比（NMR），一种对任务无关特征归因的标量度量。我们分析了ISS的统计性质，表明它可以实现无偏估计，并刻画了动作预测误差为因果影响提供有效代理的条件。跨多种操作任务的实验表明，NMR预测泛化行为，且ISS比现有可解释性方法产生更忠实的解释。这些结果表明，干预归因为识别具身策略中的因果错位提供了一种简单的诊断方法。

英文摘要

Vision-Language-Action (VLA) policies often fail under distribution shift, suggesting that decisions may depend on spurious visual correlations rather than task-relevant causes. We formulate visual-action attribution as an interventional estimation problem. Accordingly, we introduce the Interventional Significance Score (ISS), an interventional masking procedure for estimating the causal influence of visual regions on action predictions, and the Nuisance Mass Ratio (NMR), a scalar measure of attribution to task-irrelevant features. We analyze the statistical properties of ISS and show that it admits unbiased estimation, and we characterize conditions under which action prediction error provides a valid proxy for causal influence. Experiments across diverse manipulation tasks indicate that NMR predicts generalization behavior and that ISS yields more faithful explanations than existing interpretability methods. These results suggest that interventional attribution provides a simple diagnostic approach for identifying causal misalignment in embodied policies.

URL PDF HTML ☆

赞 0 踩 0

2606.06904 2026-06-11 cs.RO cs.CV 版本更新

ActionMap: Robot Policy Learning via Voxel Action Heatmap

ActionMap: 基于体素动作热图的机器人策略学习

Pei Yang, Hai Ci, Yanzhe Chen, Qi Lv, Han Cai, Mike Zheng Shou

发表机构 * National University of Singapore ； NVIDIA

AI总结提出ActionMap，一种将动作空间建模为体素热图的动作解码器，替代现有VLA模型中的单点预测器，在LIBERO仿真和真实Franka操作中提升性能和数据效率。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在骨干网络、训练方法和数据规模方面快速发展，但将骨干网络隐藏状态转换为连续控制信号的动作解码器几乎没有变化，在大多数现有VLA中仍然是单点预测器。无论是通过自回归词元箱、L1回归还是流匹配去噪实现，所得解码器都将动作空间视为无结构的，在训练期间未利用相邻动作的几何邻近性。为了改进这一点，我们引入了ActionMap，一种体素热图动作头，可以插入现有VLA中替换其原生动作解码器。对于每个新动作，该头预测动作空间上的体素热图，其中每个体素直接存储对应动作的概率。在LIBERO仿真和真实Franka操作中，我们的热图头在匹配训练步数下超越了两种架构不同的骨干网络（例如，在LIBERO四套件平均上比OpenVLA-OFT的L1回归头高出8.2%），在两种骨干网络上以相当或更快的速度收敛，并且在低训练数据下保持显著更高的数据效率。跨骨干网络的一致性表明，动作表示是VLA性能的一个真正杠杆，与进一步的骨干网络或方法缩放不同。项目页面：此 https URL。

英文摘要

Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone's hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT's L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.08530 2026-06-11 cs.RO cs.AI 版本更新

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

GEAR-VLA：学习几何感知的动作表示以实现可泛化的机器人操作

Yuan Zhang, Shiqi Zhang, Yedong Shen, Shuai Dong, Jiajun Deng, Xin Zhang, Yuxuan Gao, Jiajia Wu, Xin Nie, Zhiyuan Cheng, Jianmin Ji, Yanyong Zhang, Xingyi Zhang, Jia Pan

发表机构 * Anhui University（安徽大学）； University of Science and Technology of China（中国科学技术大学）； iFLYTEK（科大讯飞）

AI总结提出GEAR-VLA框架，通过粗到细的动作学习、语义对齐的3D集成和具身规范化，学习统一的几何感知动作表示，实现跨物体、背景和机器人的泛化操作。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在基准测试中表现强劲，但在实际部署中仍难以应对未见过的物体、背景变化和不同的机器人本体。我们认为这源于缺乏统一的几何感知操作表示，使得现有VLA容易受到低级轨迹监督、不对齐的3D特征和本体差异的影响。为此，我们提出GEAR-VLA，一个用于学习统一几何感知动作表示以实现可泛化机器人操作的VLA框架。GEAR-VLA采用粗到细的动作学习，其中多源具身预训练赋予VLM具身推理和离散动作理解能力，随后潜在动作标记将动作语义连接到梯度解耦的DiT连续动作专家。它通过将可训练的3D空间骨干与VLA表示对齐，同时冻结原始VLM对齐的视觉通路，进一步执行语义对齐的3D集成。为了跨机器人共享该表示，GEAR-VLA使用具身规范化，其中具身感知状态和具身不变动作将机器人差异限制在低级接口。大量的仿真和真实实验证明了强大的泛化能力：GEAR-VLA在LIBERO、零样本LIBERO-Plus和RoboTwin 2.0上达到了最先进的性能，在AgileX上达到85.9%的成功率，在预训练未见过的LDT-01本体上达到81.0%，并在包含212个未见物体的6,360次试验通用抓取基准上获得90.1%的成功率。代码和模型将在https://github.com/babynabeauty/GEAR-VLA发布。

英文摘要

Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2510.14828 2026-06-11 cs.AI cs.RO 版本更新

RoboGPT-R1: Enhancing Robot Task Planning with Reinforcement Learning

RoboGPT-R1: 通过强化学习增强机器人任务规划

Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, Haoran Li

AI总结提出RoboGPT-R1两阶段微调框架，先监督学习获取基础知识，再通过强化学习提升视觉空间理解和推理能力，在EmbodiedBench上超越GPT-4o-mini 21.33%。

详情

DOI: 10.65109/NOXT1107
Journal ref: Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), pp. 2827-2837, IFAAMAS, 2026

AI中文摘要

提高具身智能体的推理能力对于机器人在长视距操作任务中成功完成复杂的人类指令至关重要。尽管基于监督微调（SFT）的大语言模型和视觉语言模型在规划任务中取得了成功，但由于其常识和推理能力受限，它们在复杂现实环境中执行长视距操作任务时仍面临挑战。考虑到通过监督微调将通用视觉语言模型对齐到机器人规划任务存在泛化能力差和物理理解不足的问题，我们提出了RoboGPT-R1，一个用于具身规划的两阶段微调框架。在该框架中，监督训练通过专家序列获取基础知识，随后通过强化学习解决模型在视觉空间理解和推理方面的不足。为了实现多步推理任务中的物理理解和动作序列一致性，我们设计了一个基于规则的奖励函数，同时考虑了长视距性能和环境中的动作约束。基于Qwen2.5-VL-3B训练的推理模型在EmbodiedBench基准上显著优于更大规模的模型GPT-4o-mini 21.33%，并超过其他基于Qwen2.5-VL-7B训练的工作20.33%。

英文摘要

Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.11249 2026-06-11 cs.RO cs.LG cs.MA 新提交

MASK: Multi-Agent Semantic K-Scheduling for Risk-Sensitive 6G Robotics

MASK: 面向风险敏感的6G机器人学的多智能体语义K调度

Ahmet Gunhan Aydin, Elif Tugce Ceran

发表机构 * Middle East Technical University（中东技术大学）； Aselsan Inc.（阿塞尔桑公司）

AI总结针对6G机器人协同感知中频谱资源受限的问题，提出多智能体语义K调度（MASK）架构，通过仲裁辅助语义信息门控（A-SIG）机制仅调度语义重要性最高的K个智能体，结合自监督全局编码器和分布策略，在严格带宽限制下实现鲁棒的风险感知协调，性能接近无通信约束基线。

详情

AI中文摘要

实现6G连接机器人学的愿景需要协调高性能协作控制与物理无线信道的刚性频谱限制。在现实的协作感知场景中，频谱资源被量化为有限的物理资源块或正交子载波，使得所有智能体同时传输不可行。为了解决这一问题，我们提出了多智能体语义K调度（MASK），一种控制架构，旨在在严格的瞬时带宽限制下维持鲁棒的风险感知协调。我们引入了仲裁辅助语义信息门控（A-SIG），一种轻量级协调机制，通过基于本地计算的语义重要性分数仅调度前K个智能体来强制执行硬接入约束。通过将这些优先观测聚合为紧凑的潜在状态，自监督全局编码器使得分布策略能够在数据稀疏的情况下减轻尾部风险。我们在多个基准上评估了MASK，证明即使信道接入限制为群体大小的一小部分，其性能也能匹配无通信约束的基线。此外，该框架对数据包擦除具有固有的弹性，验证了语义调度作为资源受限的6G系统的关键使能技术。

英文摘要

Realizing the vision of 6G connected robotics requires reconciling high-performance collaborative control with the rigid spectral limitations of physical wireless channels. In realistic collaborative sensing scenarios, spectral resources are quantized into finite physical resource blocks or orthogonal subcarriers, rendering simultaneous transmission by all agents infeasible. To address this, we propose Multi-Agent Semantic K-Scheduling (MASK), a control architecture designed to sustain robust, risk-aware coordination under strict instantaneous bandwidth caps. We introduce Arbiter-Assisted Semantic Information Gating (A-SIG), a lightweight coordination mechanism that enforces hard access constraints by scheduling only the top-K agents based on locally computed semantic importance scores. By aggregating these prioritized observations into a compact latent state, a self-supervised global encoder enables a distributional policy to mitigate tail risks despite data sparsity. We evaluate MASK across diverse benchmarks, demonstrating that it matches the performance of communication-unconstrained baselines even when channel access is restricted to a small fraction of the swarm size. Furthermore, the framework exhibits inherent resilience to packet erasures, validating semantic scheduling as a critical enabler for resource-constrained 6G systems.

URL PDF HTML ☆

赞 0 踩 0

2606.11489 2026-06-11 cs.RO 新提交

Steering Multirobot Behavior via Closed-Loop Affine Activation Editing

通过闭环仿射激活编辑引导多机器人行为

Satyajeet Das, Darren Chiu, Shashank Hegde, Gaurav S. Sukhatme

发表机构 * University of Southern California（南加州大学）

AI总结提出CLAE框架，在推理时通过编辑冻结策略的中间激活来引导多机器人行为，无需微调或重训练，并在多四旋翼导航任务中验证了速度控制、编队保持和规避监控等新行为。

详情

AI中文摘要

现实世界中的机器人需要适应超出其预训练策略范围的行为。策略微调或重训练是可选方案，但它们存在灾难性遗忘的风险，会降低预训练策略的基础性能。为了解决这个问题，我们引入了CLAE：闭环仿射激活编辑，这是一种推理时框架，通过编辑中间激活来引导冻结策略的行为，同时保持基础策略权重和下游动作头不变。CLAE将行为引导视为一个闭环问题，其输出编辑策略激活，这些激活在线适应机器人状态、环境、目标行为和多机器人上下文。它在冻结策略激活上训练稀疏自编码器，通过事后探测选择行为相关的潜在特征，并学习一个轻量级的基于强化学习的引导策略，在推理期间对所选潜在特征应用状态相关的仿射编辑。我们在一个冻结的多四旋翼导航策略上验证了CLAE，该策略训练用于执行单一任务：在避开障碍物的同时将机器人导航到一组目标位置。通过大量仿真和物理测试，我们表明，在导航到目标位置的同时，CLAE可以：1. 通过控制每个机器人的速度曲线来引导单个机器人行为；2. 通过保持期望的编队来协调多机器人行为；3. 产生全新的行为，其中机器人需要减少在环境中暴露于监控摄像头的机会。

英文摘要

Real-world robots need to adapt their behavior beyond the envelope of their pre-trained policy. Policy finetuning or retraining are options, but they risk catastrophic forgetting, degrading the pretrained policy's base performance. To combat this, we introduce CLAE: Closed-Loop Affine Activation Editing, an inference-time framework for steering the behavior of a frozen policy by editing intermediate activations while keeping the base policy weights and downstream action head untouched. CLAE approaches behavior steering as a closed-loop problem whose outputs edit policy activations that adapt online to the robot state, environment, target behavior, and multi-robot context. It trains a sparse autoencoder over frozen-policy activations, selects behavior-relevant latent features via post-hoc probing, and learns a lightweight RL-based steering policy that applies state-dependent affine edits to selected latents during inference. We validate CLAE on a frozen multi-quadrotor navigation policy trained to perform a single task: navigating robots to a set of goal locations while avoiding obstacles. Through extensive simulations and physical tests, we show that while navigating to their goal positions, CLAE can 1. steer individual robot behavior by controlling each robot's velocity profile; 2. coordinate multirobot behavior by preserving a desired formation; and 3. produce entirely new behavior wherein robots are required to reduce their exposure to surveillance cameras in the environment.

URL PDF HTML ☆

赞 0 踩 0

2606.12070 2026-06-11 cs.RO 新提交

Fibration Trees: A Unified Approach to Multi-Robot Motion Planning

纤维树：多机器人运动规划的统一方法

Andreas Orthey, Florian T. Pokorny, Lydia E. Kavraki

发表机构 * Technical University of Berlin（柏林工业大学）； KTH Royal Institute of Technology（瑞典皇家理工学院）； Rice University and the Ken Kennedy Institute（莱斯大学和肯·肯尼迪研究所）

AI总结提出纤维树统一框架，通过纤维化建模投影，结合优先序、并行分解和任务空间投影，并开发Fibration-RRT规划器，在高维多机器人运动规划中实现概率完备性。

详情

Comments: 23 pages, 12 figures

AI中文摘要

状态空间投影与分解已成为解决高维多机器人运动规划问题中维度灾难的强大工具。然而，现有方法缺乏一个统一框架来无缝处理投影（优先序或任务空间）与分解（并行或解耦子空间）的组合。为填补这一空白，我们引入了纤维树，即以状态空间为节点、纤维化为边的树结构，其中纤维化将高维空间投影到低维（或简化）空间。通过将投影建模为纤维化，我们将顺序优先序、并行分解和任务空间投影统一在单一、连贯的形式体系下。在此基础上，我们开发了快速探索随机纤维树（Fibration-RRT）规划器，这是一种基于采样的运动规划器，它推广了商空间RRT（用于顺序优先序）和离散RRT（用于并行分解）的策略，同时允许包含任务空间投影。Fibration-RRT在用户定义的纤维树上运行，并被证明是概率完备的。为测试Fibration-RRT的通用性和效率，我们提供了开源实现，并在32个场景中进行了实验，使用了多达96自由度的多机器人团队。结果表明，Fibration-RRT通过利用用户定义的纤维树高效解决了高维问题，从而确立了纤维树作为多机器人运动规划的强大统一框架。

英文摘要

State space projections and decompositions have emerged as powerful tools to tackle the curse of dimensionality in high-dimensional, multi-robot motion planning problems. However, existing methods lack a unified framework which seamlessly handles combinations of projections (prioritization or task-space) and decompositions (parallel or decoupled subspaces). To fill this gap, we introduce fibration trees, which are trees consisting of state spaces as nodes and fibrations as edges, whereby a fibration models a projection from a higher-dimensional space to a lower-dimensional (or simplified) space. By modeling projections as fibrations, we unify sequential prioritization, parallel decomposition, and task-space projections under a single, coherent formalism. Building on this, we develop the rapidly-exploring random fibration trees (Fibration-RRT) planner, a sampling-based motion planner that generalizes strategies from quotient-space RRT (for sequential prioritizations) and discrete RRT (for parallel decompositions), while allowing the inclusion of task-space projections. Fibration-RRT operates on user-defined fibration trees and is proven to be probabilistically complete. To test the generality and efficiency of Fibration-RRT, we provide an open-source implementation and conduct experiments on 32 scenarios using multi robot teams with up to 96 degrees of freedom. Our results indicate that Fibration-RRT efficiently solves high-dimensional problems by exploiting user-defined fibration trees, thereby establishing fibration trees as a powerful, unified framework for multi-robot motion planning.

URL PDF HTML ☆

赞 0 踩 0

2606.12306 2026-06-11 cs.RO 新提交

UGV-Conditioned Multi-UAV Informative Planning on a Shared Exposure Belief

基于共享暴露信念的UGV条件多无人机信息规划

Lars Oerlemans, Moji Shi, Marija Popovic

AI总结提出一种协调无人机编队降低地面车辆在未知威胁区导航风险的方法，通过共享暴露信念引导感知并减少冗余覆盖，仿真显示累积暴露降低38%，冗余覆盖从38.8%降至3.7%。

详情

Comments: 8 pages, 6 figures

AI中文摘要

在大型、威胁增强的环境中进行安全地面导航需要空中支持，以主动降低地面车辆沿路线面临的风险。现有的空中侦察系统专注于测绘或覆盖环境，但不将感知引导到对地面车辆安全最相关的区域。在本文中，我们解决了协调一组无人机（UAV）以提高无人地面车辆（UGV）在未知威胁区导航安全性的问题。我们方法的一个关键方面是共享暴露信念，该信念根据空中观测在线更新，并由无人机团队和地面车辆共同使用。这使我们能够将空中感知引导到路线相关区域，同时允许UGV围绕新发现的威胁重新规划。我们通过空间区域分配协调无人机团队以避免冗余感知。仿真实验表明，与不考虑危险等级的系统相比，我们的方法将UGV累积暴露降低了38%，并在我们的多无人机协调方案下将冗余空中覆盖从38.8%降至3.7%。

英文摘要

Safe ground navigation in large, threat-augmented environments requires aerial support that actively reduces the risks that a ground vehicle faces along its route. Existing aerial reconnaissance systems focus on mapping or covering the environment, but do not direct sensing toward regions that are most relevant for ground vehicle safety. In this paper, we address the problem of coordinating a team of unmanned aerial vehicles (UAVs) to improve the safety of an unmanned ground vehicle (UGV) navigating through unknown threat zones. A key aspect of our approach is a shared exposure belief that is updated online from aerial observations and used jointly by the UAV team and the ground vehicle. This enables us to direct aerial sensing towards route-relevant regions while allowing the UGV to replan around newly revealed threats. We coordinate the UAV team through spatial region assignment to avoid redundant sensing. Simulation experiments show that our approach reduces cumulative UGV exposure by 38% compared to a system that does not account for hazard levels, and reduces redundant aerial coverage from 38.8% to 3.7% under our multi-UAV coordination scheme.

URL PDF HTML ☆

赞 0 踩 0

2606.12352 2026-06-11 cs.RO cs.AI 新提交

CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

CHORUS: 基于单一VLA策略的去中心化多体协作

Ria Doshi, Tian Gao, Annie Chen, Chelsea Finn, Jeannette Bohg

发表机构 * Stanford University（斯坦福大学）

AI总结提出CHORUS框架，利用预训练视觉-语言-动作模型的视觉运动先验，实现无需推理时通信的去中心化多机器人协作，在真实实验中显著优于基线。

详情

Comments: Project Website: this https URL

AI中文摘要

多机器人协作使机器人能够高效完成从通过门搬运沙发到建筑工地组装结构等各种任务。然而，在移动多机器人环境中实现这种协调仍然具有挑战性：基于团队联合观测的集中式方法随团队规模扩展性差，而为每个机器人训练一个策略的去中心化方法通常需要显式对齐程序或推理时信息共享来克服部分可观测性。我们的关键见解是，预训练的视觉-语言-动作（VLA）模型的视觉运动先验应能够仅从每个机器人的局部观测实现反应式去中心化协作，无需这些推理时假设。我们提出CHORUS，一个适配单一VLA骨干以控制多样化多机器人团队的框架。推理时，每个机器人运行CHORUS的独立副本，仅基于其自身观测和机器人标识提示。在包括移动卷尺测量、图书馆书籍交接和洗衣篮抬举的真实实验中，CHORUS相比去中心化从头训练模型提升64个百分点，对队友行为的反应性提升40个百分点，并优于集中式基线。这些结果表明，共享VLA骨干能够实现去中心化多机器人协作，无需每个机器人的独立策略或推理时机器人间通信。

英文摘要

Multi-robot collaboration allows robots to efficiently take on a wide range of tasks, from moving a couch through a doorway to assembling structures on a construction site. However, achieving such coordination in mobile multi-robot settings remains challenging: centralized methods conditioned on the combined observations of a team scale poorly with team size, and decentralized methods that train one policy per robot often require explicit alignment procedures or information sharing at inference time to overcome partial observability. Our key insight is that the visuomotor priors of pretrained vision-language-action (VLA) models should enable reactive, decentralized collaboration from each robot's local observations alone, without these inference-time assumptions. We propose CHORUS, a framework that adapts a single VLA backbone to control diverse, multi-robot teams. At inference time, each robot runs an independent copy of CHORUS, conditioned only on its own observations and a robot-identifying prompt. In real-world experiments including mobile tape measurement, library book handovers, and laundry basket lifting, CHORUS achieves a 64% point improvement over decentralized, from-scratch models, improves reactivity to teammate behavior by 40% points, and outperforms centralized baselines. Together, these results show that a shared VLA backbone is capable of achieving decentralized multi-robot collaboration, without per-robot policies or inter-robot communication at inference.

URL PDF HTML ☆

赞 0 踩 0

2601.10724 2026-06-11 cs.RO 版本更新

Adaptive Sliding Mode Control for Vehicle Platoons with State-Dependent Friction Uncertainty

具有状态依赖摩擦不确定性的车辆队列自适应滑模控制

Rishabh Dev Yadav

AI总结针对车辆队列中未知且状态依赖的摩擦力，提出一种自适应滑模控制器，无需先验知识即可处理摩擦不确定性，实现速度调节和间距保持。

详情

AI中文摘要

多机器人编队控制在车辆编队、队列、载荷运输和监视等领域有广泛应用。在车辆队列中保持编队需要设计合适的控制方案，能够处理外部干扰和不确定的系统参数，同时保持机器人之间预定义的安全距离。此背景下的一个关键挑战是处理车轮与地面之间未知/不确定的摩擦力，这些摩擦力随路面变化、轮胎磨损和车辆速度而变化。尽管最先进的自适应控制器可以处理先验有界的不确定性，但它们难以准确建模和识别摩擦力，这些摩擦力通常是状态依赖的且无法先验有界。本文提出了一种新的基于轮式移动机器人的车辆队列自适应滑模控制器，无需先验了解摩擦力的参数和结构即可处理其未知和复杂的行为。该控制器利用自适应滑模控制技术来调节队列速度并保持预定义的机器人间距离，即使在存在外部干扰和不确定系统参数的情况下也是如此。该方法包括两个阶段：首先，运动学控制器根据期望轨迹计算期望速度；其次，动力学模型生成命令以实现期望运动。通过分离机器人的运动学和动力学，该方法可以简化控制问题，并实现对轮式移动机器人更高效、更鲁棒的控制。

英文摘要

Multi-robot formation control has various applications in domains such as vehicle troops, platoons, payload transportation, and surveillance. Maintaining formation in a vehicle platoon requires designing a suitable control scheme that can tackle external disturbances and uncertain system parameters while maintaining a predefined safe distance between the robots. A crucial challenge in this context is dealing with the unknown/uncertain friction forces between wheels and the ground, which vary with changes in road surface, wear in tires, and speed of the vehicle. Although state-of-the-art adaptive controllers can handle a priori bounded uncertainties, they struggle with accurately modeling and identifying frictional forces, which are often state-dependent and cannot be a priori bounded. This thesis proposes a new adaptive sliding mode controller for wheeled mobile robot-based vehicle platoons that can handle the unknown and complex behavior of frictional forces without prior knowledge of their parameters and structures. The controller uses the adaptive sliding mode control techniques to regulate the platoon's speed and maintain a predefined inter-robot distance, even in the presence of external disturbances and uncertain system parameters. This approach involves a two-stage process: first, the kinematic controller calculates the desired velocities based on the desired trajectory; and second, the dynamics model generates the commands to achieve the desired motion. By separating the kinematics and dynamics of the robot, this approach can simplify the control problem and allow for more efficient and robust control of the wheeled mobile robot.

URL PDF HTML ☆

赞 0 踩 0

2606.08102 2026-06-11 cs.RO cs.AI cs.MA 版本更新

Continual Quadruped Robots Coordination via Semantic Skill Discovery

通过语义技能发现实现持续四足机器人协调

Daoqing Wang, Yuchen Xiao, Weixuan Huang, Zhilong Zhang, Shenghua Wan, Meng Li, Lei Yuan, Yang Yu

AI总结提出Conquer框架，通过语义技能库实现多四足机器人在持续学习任务中的协调，避免灾难性遗忘，最终平均成功率95.6%。

详情

Comments: 22 pages, 8 figures, 11 tables. Project page: this https URL

AI中文摘要

多四足协调因其增强的负载能力、更广的接触覆盖范围以及对挑战性任务的适应性提升而受到越来越多的关注。现有的多四足操作方法通常专注于预定义或封闭的任务族，往往依赖多智能体强化学习（MARL）来训练特定任务的协调策略。然而，这类方法在开放式持续学习场景中难以应对，其中任务顺序到达，机器人期望在复用先前学到的技能的同时获取新协调技能，且不出现灾难性遗忘。为应对这一挑战，我们提出Conquer，一个语义技能库框架，将持续多四足协调形式化为检索-适应-更新过程。首先，为适应不同任务中的团队规模变化，我们设计了一个团队结构的Self-Allies-Goal（SAG）主干，通过显式建模每个机器人自身状态、队友上下文和任务目标，支持可变基数的机器人团队。对于每个新任务，Conquer从执行前信息构建任务级语义描述符，并从技能库中检索相关技能进行适应。成功执行后，Conquer通过提取轨迹级语义描述符并根据语义距离组织它们来更新技能库，从而实现持续技能积累和跨任务知识迁移。仿真实验表明，Conquer达到了95.6%的最终平均成功率，展示了强大的前向迁移能力和可忽略的灾难性遗忘。在宇树Go2团队上的实际部署进一步验证了Conquer用于实际多四足协调的可行性。仿真和真实机器人演示视频见：https://conquer-project.pages.dev/。

英文摘要

Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi-quadruped manipulation typically focus on predefined or closed task families, often relying on multi-agent reinforcement learning (MARL) to train task-specific coordination policies. However, such methods struggle in open-ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill-library framework that formulates continual multi-quadruped coordination as a retrieve-adapt-update process. First, to accommodate varying team sizes across tasks, we design a team-structured Self-Allies-Goal (SAG) backbone that supports variable-cardinality robot teams by explicitly modeling each robot's own state, teammate context, and task goal. For each incoming task, Conquer constructs a task-level semantic descriptor from pre-execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory-level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross-task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real-world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi-quadruped coordination. Simulation and real-robot demonstration videos are available at: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11419 2026-06-11 cs.RO 新提交

A Modular Dual-Camera Pipeline for Micro-Inspection Using Aerial Robots

一种用于空中机器人微检测的模块化双相机流水线

S.H. Mirtajadini, N. Rublein, R.M. Ramakrishnan, G. ter Maat, M. Aldibaja, A.Y. Mersha

发表机构 * Netherlands Organization for Scientific Research (NWO)（荷兰科学研究组织）； Saxion University of Applied Sciences（萨克逊应用科学大学）

AI总结提出一种模块化双相机空中微检测流水线，通过变焦云台相机和广角立体导航相机协同工作，结合视觉反馈回路，实现对树木和温室粘虫板等非结构目标的鲁棒微检测。

详情

AI中文摘要

现有大多数基于无人机的检测系统要求无人机危险地靠近目标飞行或遵循复杂飞行路径以捕获小细节。此外，无人机飞行受干扰和定位不准确的影响，当视野狭窄时可能导致无人机丢失目标。此外，轨迹规划通常需要目标几何、位置和方向的先验信息，这对于非结构目标（如树木、车辆或人员）并不总是可用。为解决这些挑战，本文提出了 aerial_micro_inspection，一种适用于不同用例的通用空中微检测流水线。该流水线假设一架搭载PX4的无人机配备两个摄像头：(i) 一个变焦云台检测摄像头，无需无人机飞得离目标很近即可捕获精细细节；(ii) 一个宽视场立体导航摄像头，现场获取目标表面，估计其距离，并将其分割成较小的检测区域。此外，当检测摄像头访问较大表面的小分区时，基于视觉的反馈回路补偿无人机运动。我们在仿真和真实实验中评估了该流水线，主要在两种用例场景中：用于检测橡树行军虫及其卵的树木检测，以及用于检测粉虱的温室粘虫板检测。结果显示，在仿真中，无人机干扰下的覆盖鲁棒性得到改善，在真实实验中，有效检测了幼虫和卵，并对昆虫进行了高细节成像。该流水线是开源的，基于ROS 2开发，可通过替换表面分割和微目标检测检查点来适应新应用。代码见：this https URL

英文摘要

Most existing drone-based inspection systems require the drone to fly dangerously close to the target or follow complex flight paths to capture small details. In addition, drone flight is affected by disturbances and localization inaccuracies, which can cause the drone to lose sight of its supposed target when it has a narrow view. Furthermore, trajectory planning often requires prior information about the target's geometry, position, and orientation, which is not always available for non-structural targets such as trees, vehicles, or people. To address these challenges, this paper presents aerial_micro_inspection, a generic pipeline for aerial micro-inspection across different use cases. The pipeline assumes a PX4-powered drone equipped with two cameras: (i) a zoomed, gimbal-mounted inspection camera that captures fine details without requiring the drone to fly very close to the target, and (ii) a wide-field-of-view stereo navigation camera that acquires the target surface on site, estimates its range, and partitions it into smaller inspection regions. In addition, a vision-based feedback loop compensates for drone motion while the inspection camera visits small partitions of a larger surface. We evaluate the pipeline in simulation and real-world experiments, mainly in two use-case scenarios: tree inspection for detecting oak processionary caterpillars and their eggs, and greenhouse inspection of sticky traps for detecting whiteflies. The results show improved coverage robustness under drone disturbances in simulation, as well as effective detection of caterpillars and eggs and high-detail imaging of insects in real-world experiments. The pipeline is open-source, developed in ROS 2, and can be adapted to new applications by replacing the surface-segmentation and micro-target detection checkpoints. The code is available at: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.11708 2026-06-11 cs.RO 新提交

Explore From Sketch: Accelerating UAV Exploration in Large-scale Environments with Prior Maps

从草图探索：利用先验地图加速无人机在大规模环境中的探索

Tiancheng Lai, Yuman Gao, Xiangyu Li, Ruitian Pang, Xingpeng Wang, Siqi Shen, Mengke Zhang, Yin He, Fei Gao, Chao Xu, Yanjun Cao

发表机构 * Institute of Cyber-Systems and Control, College of Control Science and Engineering, Zhejiang University（浙江大学控制科学与工程学院工业控制技术研究所）； Huzhou Institute, Zhejiang University, and Huzhou Key Laboratory of Autonomous System（浙江大学湖州研究院，湖州市自主系统重点实验室）； Zhejiang Zhongyan Industry Co., Ltd（浙江中烟工业有限责任公司）； Differential Robotics Technology Company（微分机器人技术有限公司）

AI总结提出利用稀疏、未对齐甚至不一致的2D先验地图加速无人机大规模环境探索的框架，通过鲁棒的2D-3D点云配准和层次化视点规划，实现效率提升34.2%。

详情

Comments: 25 pages, 22 figures

AI中文摘要

无人机在大规模、拓扑复杂环境中的自主探索常因次优调度和绕路而效率低下。先验地图（如施工图纸）虽然通常不精确且有缺陷，但在许多场景中易于获取，并具有提供全局结构指导的潜力。本文提出一种新颖的探索框架，利用稀疏、未对齐甚至不一致的2D先验地图进行基于LiDAR的无人机探索。首先，提出一种鲁棒的2D-3D点云配准流程，将LiDAR观测与先验地图对齐。该配准流程结合了用于单帧候选检索的GeoContext描述符、用于带异常值剔除的粗变换估计的多帧验证机制，以及用于精化的Scale-ICP算法。配准模块能够处理地图差异，并在几何歧义出现时提供多个假设。为了有效利用配准结果进行探索规划，我们进一步开发了一种在定位不确定性下的层次化视点规划策略。该层次化策略首先将局部视点空间附着到先验引导点上，并采用蒙特卡洛树搜索求解器确定每个配准假设下的遍历顺序。为减轻配准不确定性，风险感知选择器使用置信度加权的旅行风险评估先验序列，并在选定的先验引导下，通过固定端点旅行商问题生成高效的局部覆盖路径。基准评估显示，与最先进方法相比，探索效率提升高达34.2%，飞行距离减少37.9%，而广泛的仿真和现场实验进一步证明了对先验地图不完整和变形的鲁棒性。

英文摘要

Autonomous exploration with UAVs in large-scale, topologically complex environments often suffers from low efficiency due to suboptimal scheduling and detours. Prior maps (e.g., construction drawings), although usually imprecise and flawed, are readily available in many scenarios and have the potential to provide global structural guidance. This paper presents a novel exploration framework that leverages sparse, unaligned, and even discrepant 2D prior maps for LiDAR-based UAV exploration. First, a robust 2D-3D point cloud registration pipeline is proposed to align LiDAR observations with prior maps. The registration pipeline combines a GeoContext descriptor for single-frame candidate retrieval, a multi-frame verification mechanism for coarse transformation estimation with outlier rejection, and a Scale-ICP algorithm for refinement. The registration module can handle map discrepancies and provide multiple hypotheses when geometric ambiguities arise. To effectively utilize the registration results for exploration planning, we further develop a hierarchical viewpoint planning strategy under localization uncertainties. The hierarchical strategy first spatially attaches local viewpoints to prior guidepoints and adopts a Monte Carlo Tree Search solver to determine their traversal sequence under each registration hypothesis. To mitigate registration uncertainty, a risk-aware selector evaluates prior sequences using confidence-weighted travel risk, and a fixed-endpoint traveling salesman problem is formulated to generate an efficient local coverage path under the selected prior guidance. Benchmark evaluations reveal up to 34.2% improvement in exploration efficiency and 37.9% reduction in flight distance compared to state-of-the-art methods, while extensive simulations and field experiments further demonstrate robustness to prior map incompleteness and deformations.

URL PDF HTML ☆

赞 0 踩 0

2606.12019 2026-06-11 cs.RO 新提交

MPPI-based Informative Trajectory Planning for Search and Capture of Drifting Targets with ASVs

基于MPPI的自主水面艇搜索与捕获漂移目标的信息轨迹规划

Sanjeev Ramkumar Sudha, Marija Popović, Erlend M. Coates

发表机构 * Norwegian University of Science and Technology (NTNU)（挪威科技大学）； TU Delft（代尔夫特理工大学）

AI总结针对自主水面艇在动态环境中搜索并捕获多个漂移目标的问题，提出一种基于模型预测路径积分（MPPI）控制的混合规划框架，通过优化长时域连续轨迹平衡搜索与跟踪，并在拦截阶段切换至纯追踪制导，实验验证了有效性。

详情

AI中文摘要

自主水面艇为开放水域的环境清理以及搜索救援行动提供了高效解决方案。这些环境中的目标持续漂移，因此高效搜索必须平衡未观测区域的探索与已知目标的跟踪。然而，大多数目标跟踪与追捕场景仅考虑简单的制导行为及短期预测用于决策。在本论文中，我们针对动态环境中搜索并捕获多个漂移目标（如垃圾）的问题，提出一种混合规划框架。我们策略的一个关键方面是基于模型预测路径积分（MPPI）控制的时空信息规划方法，这是一种基于采样的模型预测控制方法。该规划器通过优化长时域上的连续轨迹直接生成运动学级指令。多目标代价函数平衡搜索与跟踪目标，同时确保安全、可行的轨迹。在拦截阶段，我们切换至纯追踪制导控制器以实现对移动目标的物理捕获。实验表明，我们的规划器优于所选的规划基线。最后，我们在自主水面艇的现场试验中验证了该方法。

英文摘要

Autonomous surface vehicles offer an efficient solution for environmental cleanup as well as search and rescue operations in open waters. Targets in these settings drift continuously, so efficient search must balance exploration of unobserved regions with tracking of known targets. However, most target tracking and pursuit scenarios consider simple guidance behaviours and short-term predictions for decision-making. In this letter, we address the problem of search and capture of multiple drifting targets, such as litter, in dynamic environments, using a hybrid planning framework. A key aspect of our strategy is a spatiotemporal informative planning method based on model predictive path integral (MPPI) control, a sampling-based model predictive control approach. The planner directly generates kinematic-level commands by optimising continuous trajectories over long horizons. A multi-objective cost balances search and tracking objectives while ensuring safe, feasible trajectories. In the interception stage, we switch to a pure pursuit guidance controller for the physical capture of moving targets. Experiments show that our planner outperforms the chosen planning baselines. Finally, we validate our approach in field trials with an ASV.

URL PDF HTML ☆

赞 0 踩 0

2606.12142 2026-06-11 cs.RO cs.CV 新提交

AerialClaw: An Open-Source Framework for LLM-Driven Autonomous Aerial Agents

AerialClaw：一个用于LLM驱动的自主空中智能体的开源框架

Ke Li, Jianfei Yang, Luyao Zhang, Guo Yu, Chengwei Yan, Yuan Ding, Di Wang, Nan Luo, Gang Liu, Xiao Gao, Quan Wang

发表机构 * Xidian University（西安电子科技大学）； Xi'an University of Architecture and Technology（西安建筑科技大学）

AI总结提出AerialClaw开源框架，采用模块化脑-技能-运行时架构，使基于LLM的智能体能够理解自然语言任务、调用空中技能、闭环决策，提升无人机系统的灵活性、可复现性和可扩展性。

详情

AI中文摘要

无人机（UAV）越来越多地用于检查、搜索救援、环境监测和应急响应。然而，大多数无人机应用仍然依赖于预定义的命令序列或特定任务的管道，开发者手动连接感知、规划、飞行控制、仿真、日志记录和安全模块。这限制了自主空中系统的灵活性、可复现性和可扩展性。本文提出了AerialClaw，一个开源软件框架，使无人机能够作为决策型空中智能体运行，而不仅仅是遵循命令的平台。给定自然语言任务，AerialClaw允许基于LLM的智能体理解任务、维护上下文、调用可执行的空中技能、观察感知和运行时反馈，并在闭环中迭代更新其决策。该框架采用模块化的脑-技能-运行时架构，结合了用于原子无人机操作的硬技能、基于Markdown的可重用任务策略软技能、文档驱动的智能体状态和能力边界、记忆驱动的反思、面向安全的运行时验证以及平台无关的执行适配器。AerialClaw支持轻量级模拟执行、PX4 SITL与Gazebo以及基于AirSim的仿真，同时提供Web控制台、可插拔模型后端、示例任务、仿真资产和分阶段部署脚本。通过结合标准化的空中技能、文档驱动的智能体状态、记忆和闭环LLM决策，AerialClaw提供了一个可复现且可扩展的开源框架，用于构建能够解释任务、做出决策、执行技能并根据反馈调整行为的无人机系统。

英文摘要

Unmanned aerial vehicles (UAVs) are increasingly used in inspection, search and rescue, environmental monitoring, and emergency response. However, most UAV applications still rely on pre-defined command sequences or task-specific pipelines, where developers manually connect perception, planning, flight control, simulation, logging, and safety modules. This limits the flexibility, reproducibility, and extensibility of autonomous aerial systems. This paper presents AerialClaw, an open-source software framework that enables UAVs to operate as decision-making aerial agents rather than merely command-following platforms. Given a natural-language mission, AerialClaw allows an LLM-based agent to understand the task, maintain context, invoke executable aerial skills, observe perception and runtime feedback, and iteratively update its decisions in a closed loop. The framework adopts a modular brain-skill-runtime architecture, combining hard skills for atomic UAV operations, Markdown-based soft skills for reusable task strategies, document-driven agent state and capability boundaries, memory-driven reflection, safety-oriented runtime validation, and platform-agnostic execution adapters. AerialClaw supports lightweight mock execution, PX4 SITL with Gazebo, and AirSim-based simulation, together with a web console, pluggable model backends, example missions, simulation assets, and staged deployment scripts. By combining standardized aerial skills, document-driven agent state, memory, and closed-loop LLM decision-making, AerialClaw provides a reproducible and extensible open-source framework for building UAV systems that can interpret missions, make decisions, execute skills, and adapt their behavior from feedback.

URL PDF HTML ☆

赞 0 踩 0

2606.12236 2026-06-11 cs.RO cs.CV 新提交

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

DrivingAgent: 自动驾驶系统的设计与调度智能体

Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王选计算机技术研究所）； University of California, Merced（加州大学默塞德分校）

AI总结提出DrivingAgent框架，通过自动化模块开发（设计阶段）和强化学习训练的轻量级LLM实时调度（调度阶段），解决自动驾驶系统集成新模型和满足实时约束的挑战，在nuScenes和Bench2Drive上取得更优速度-精度权衡。

详情

AI中文摘要

许多自动驾驶系统越来越多地整合基础模型以提高泛化能力并处理长尾场景。然而，这一趋势带来了两个关键挑战：（i）设计和集成新模型的手动且劳动密集型过程，以及（ii）缺乏智能、动态的调度机制以满足严格的实时约束。虽然基于大语言模型（LLM）的智能体为自动化提供了有前景的途径，但现有框架并不适合自动驾驶。具体来说，它们未能区分系统设计和实时调度的根本不同需求，将模块视为不透明的黑盒，并且并非为持续运行而设计。为了解决这些局限性，我们提出了DrivingAgent，这是一个针对自动驾驶系统设计和调度双重挑战的新型智能体框架。在设计阶段，DrivingAgent通过解释系统架构、生成代码以及通过超网络训练验证模块来自动化模块开发。在调度阶段，它采用一个通过强化学习训练的轻量级LLM来实时动态编排系统模块，并由一个集成长期存储与带时间戳短期上下文的结构化记忆支持。实验结果表明，DrivingAgent在nuScenes和Bench2Drive基准测试上实现了更优的速度-精度权衡。

英文摘要

Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.11687 2026-06-11 cs.CV cs.LG cs.RO 交叉投稿

DroneShield-AI: A Multi-Modal Sensor Fusion Framework for Real-Time Autonomous Drone Threat Detection, Behavioral Intent Classification, and Swarm Intelligence in Contested Airspace

DroneShield-AI：一种用于受争议空域中实时自主无人机威胁检测、行为意图分类和群体智能的多模态传感器融合框架

Marius Bayizere

AI总结提出DroneShield-AI框架，集成RF信号分类、声学检测、YOLOv8视觉检测等六层处理，通过行为意图分类引擎（BICE）实现六类威胁分类并提前30秒预警，以及图神经网络群体智能模块（GNN-SIM）分析多无人机编队，在低成本硬件上达到96.1%检测精度和142ms延迟。

详情

Comments: 23 pages, 6 figures, 11 tables. Code available at this https URL

AI中文摘要

无人机（UAV）威胁已成为21世纪定义性的安全挑战。本文提出DroneShield-AI，一个统一的开放框架，集成了六个处理层：RF信号分类、声学电机特征检测、基于YOLOv8的视觉检测、证据加权传感器融合、行为意图分类引擎（BICE）和图神经网络群体智能模块（GNN-SIM）。BICE首次引入了针对无人机飞行模式的系统性六类威胁分类法，能够提前30秒发出预测性操作员警报。GNN-SIM是首个用于对抗性多无人机编队分析的开放框架，采用图注意力网络。在三个公开的真实世界数据集上评估，融合流水线在约500-780美元总系统成本的商用CPU级硬件上实现了96.1%的检测准确率、3.2%的误报率、AUC-ROC：0.981以及142ms的端到端延迟。所有代码、模型权重和仿真数据集在提交时公开发布。

英文摘要

Unmanned Aerial Vehicle (UAV) threats have emerged as a defining security challenge of the 21st century. This paper presents DroneShield-AI, a unified open framework integrating six processing layers: RF signal classification, acoustic motor-signature detection, YOLOv8-based visual detection, evidence-weighted sensor fusion, a Behavioral Intent Classification Engine (BICE), and a Graph Neural Network Swarm Intelligence Module (GNN-SIM). BICE introduces the first systematic six-class threat taxonomy for drone flight patterns, enabling predictive operator alerts with a 30-second advance-warning horizon. GNN-SIM is the first open framework for adversarial multi-drone formation analysis using Graph Attention Networks. Evaluated on three publicly available real-world datasets, the fused pipeline achieves 96.1% detection accuracy, 3.2% false alarm rate, AUC-ROC: 0.981, and 142ms end-to-end latency on commodity CPU-class hardware at approximately $500-$780 USD total system cost. All code, model weights, and simulation datasets are publicly released at submission.

URL PDF HTML ☆

赞 0 踩 0

2606.11889 2026-06-11 cs.CV cs.AI cs.RO 交叉投稿

Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection

面向自动驾驶危险检测的视觉-语言模型任务对齐稳定性分析

Everett Richards

AI总结研究视觉-语言模型在自动驾驶危险检测中，嵌入漂移与任务对齐危险分数变化的关系，发现不同腐败类型导致不同的失效模式，建议基准测试包含任务对齐稳定性指标。

详情

Comments: 8 pages (5 main body + 3 references / appendices). ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)

AI中文摘要

视觉-语言模型（VLM）越来越多地用于自动驾驶中的场景理解，但鲁棒性分析通常仅依赖于任务无关的嵌入稳定性。我们研究腐败引起的嵌入漂移是否能预测基于CLIP图像-文本相似性的任务对齐危险分数的变化。通过在BDD100K道路场景上使用受控腐败，我们将嵌入漂移与边际漂移（定义为扰动下危险分数的变化）进行比较。这种关系高度依赖于腐败类型：某些家族表现出表示漂移与决策漂移之间的强耦合，而其他家族则在嵌入变化相对较小的情况下引发危险的决策不稳定性。此外，腐败家族在失效方向上有所不同：大多数通过假阴性抑制危险检测，而遮挡则触发假警报，这表明基准设计应考虑不对称的失效模式，而不仅仅是整体不稳定性率。这些结果表明，鲁棒性基准应包含任务对齐的稳定性指标，而不仅仅是嵌入级别的扰动统计。

英文摘要

Vision-language models (VLMs) are increasingly used for scene understanding in autonomous driving, but robustness analysis often relies on task-agnostic embedding stability alone. We study whether corruption-induced embedding drift predicts changes in a task-aligned hazard score derived from CLIP image-text similarities. Using controlled corruptions on BDD100K road scenes, we compare embedding drift against margin drift, defined as the change in hazard score under perturbation. The relationship is highly corruption-dependent: some families exhibit strong coupling between representation drift and decision drift, while others induce hazardous decision instability despite relatively modest embedding change. Furthermore, corruption families differ in failure direction: most suppress hazard detections via false negatives, while occlusion instead triggers false alarms, suggesting that benchmark design should account for asymmetric failure modes, not just overall instability rates. These results suggest that robustness benchmarks should include task-aligned stability measures in addition to embedding-level perturbation statistics.

URL PDF HTML ☆

赞 0 踩 0

2606.12396 2026-06-11 cs.CV cs.RO 交叉投稿

VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

VLGA：用于自动驾驶的视觉-语言-几何-动作模型

Jin Yao, Dhruva Dixith Kurra, Tom Lampo, Zezhou Cheng, Danhua Guo, Burhan Yaman

发表机构 * Uber AV Labs（Uber自动驾驶实验室）； University of Virginia（弗吉尼亚大学）

AI总结提出VLGA模型，通过引入几何作为第四模态，利用逐像素点图回归损失监督，实现密集3D世界重建，在nuScenes和Bench2Drive上达到SOTA。

详情

Comments: Project page: this https URL

AI中文摘要

视觉-语言-动作（VLA）模型能够描述场景并用语言进行推理，但仍难以将其动作锚定在周围的密集3D世界中。现有方法要么从冻结的3D基础模型中注入特征，而没有确保策略使用这些特征的目标，要么通过稀疏的框和地图损失来约束几何，这些损失不提供密集的空间信号。我们引入了VLGA，这是第一个被监督以重建其驾驶通过的密集3D世界的视觉-语言-动作模型。VLGA通过一个专门的专家模块，由针对LiDAR的逐像素点图回归损失监督，将几何作为第四模态与视觉、语言和动作一起引入。在具有挑战性的nuScenes和Bench2Drive数据集上分别进行开环和闭环评估的大量实验表明，VLGA优于对应的VLA方法。特别是在开环nuScenes上，VLGA在没有自车状态的情况下，在VLA方法中取得了新的最先进结果，具有最低的L2误差（平均0.50米）和3秒碰撞率（0.18%）。在闭环Bench2Drive上，VLGA取得了79.08的最先进驾驶得分，比最强的先前VLA高出0.71，同时具有相当的效率和舒适性。

英文摘要

Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-loop evaluations, respectively, show the superiority of VLGA over counterpart VLA methods. In particular, on open-loop nuScenes, VLGA sets a new state of the art among VLA methods without ego status, with the lowest L2 (0.50\,m average) and 3-second collision rate (0.18\%). On closed-loop Bench2Drive, VLGA attains the state-of-the-art driving score of 79.08, +0.71 over the strongest prior VLA, at comparable efficiency and comfort.

URL PDF HTML ☆

赞 0 踩 0

2503.06578 2026-06-11 cs.RO eess.SY 版本更新

Non-Equilibrium MAV-Capture-MAV via Time-Optimal Planning and Reinforcement Learning

非平衡MAV捕获MAV：基于时间最优规划和强化学习

Canlun Zheng, Zhanyu Guo, Zikang Yin, Chunyu Wang, Zhikun Wang, Shiyu Zhao

AI总结针对高机动性目标捕获难题，本文设计紧凑型捕获MAV，结合时间最优规划与强化学习方法，在非稳定状态下实现目标捕获。

详情

AI中文摘要

由于飞行MAV（微型飞行器）的捕获具有挑战性和广阔应用前景，近年来受到越来越多的研究关注。尽管已有进展，现有工作的一个关键限制是捕获策略通常相对简单且受平台性能约束。本文研究能够捕获高机动性目标的控制策略。在非稳定条件下实现目标捕获这一独特挑战使其区别于传统的追逃和制导问题。在本研究中，我们从较大的MAV平台过渡到一种专门设计的、配备定制发射装置的紧凑型捕获MAV，同时保持高机动性。我们探索了时间最优规划（TOP）和强化学习（RL）方法。仿真表明，TOP提供高机动性和更短的轨迹，而RL在实时适应性和稳定性方面表现优异。此外，RL方法已在真实场景中测试，成功实现了即使在非稳定状态下的目标捕获。

英文摘要

The capture of flying MAVs (micro aerial vehicles) has garnered increasing research attention due to its intriguing challenges and promising applications. Despite recent advancements, a key limitation of existing work is that capture strategies are often relatively simple and constrained by platform performance. This paper addresses control strategies capable of capturing high-maneuverability targets. The unique challenge of achieving target capture under unstable conditions distinguishes this task from traditional pursuit-evasion and guidance problems. In this study, we transition from larger MAV platforms to a specially designed, compact capture MAV equipped with a custom launching device while maintaining high maneuverability. We explore both time-optimal planning (TOP) and reinforcement learning (RL) methods. Simulations demonstrate that TOP offers highly maneuverable and shorter trajectories, while RL excels in real-time adaptability and stability. Moreover, the RL method has been tested in real-world scenarios, successfully achieving target capture even in unstable states.

URL PDF HTML ☆

赞 0 踩 0

2602.20958 2026-06-11 cs.RO cs.AI 版本更新

EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

基于EKF的深度相机与深度学习融合用于搜救任务中无人机-人员距离估计与跟随

Luka Šiktar, Branimir Ćaran, Bojan Šekoranja, Marko Švaco

AI总结提出融合深度相机测量和单目相机人体距离估计的EKF方法，利用YOLO-pose实现实时融合，提高无人机跟随中距离估计的精度和鲁棒性，在三个测试场景中平均误差降低15.3%。

详情

Comments: This work has been submitted to the IEEE for possible publication

AI中文摘要

基于视觉的无人机框架通过检测和识别特定个体，然后跟踪并跟随它们，同时保持安全距离，来辅助人类搜索任务。无人机跟随的一个关键安全要求是在现实条件下准确估计相机与目标物体之间的距离，这通过融合多种图像模态来实现。作为使用深度学习进行自动人员检测和面部识别系统的一部分，本文提出了融合深度相机测量和单目相机到人体距离估计的方法，以实现鲁棒的跟踪和跟随。使用YOLO-pose实现了基于深度学习的深度相机数据滤波和从单目相机估计相机到人体距离，从而利用扩展卡尔曼滤波算法实现深度信息的实时融合。所提出的子系统设计用于无人机，估计和测量深度相机与人体关键点之间的距离，以保持无人机与人类目标之间的安全距离。我们的系统提供了准确的距离估计，并已通过运动捕捉地面真值数据进行了验证。该系统已在室内实时测试，在三个测试场景中，距离估计的平均误差、均方根误差和标准差降低了高达15.3%。基于测试结果，基于EKF融合的方法通过减少深度相机最佳工作范围之外的误差，增加了深度检测范围。它还在具有挑战性的条件下（如反射和能见度差）显示出改进的鲁棒性和精度，使其适用于搜救任务。

英文摘要

Vision-based Unmanned Aerial Vehicles (UAVs) frameworks aid human search tasks by detecting and recognizing specific individuals, then tracking and following them while maintaining a safe distance. A key safety requirement for UAV following is the accurate estimation of the distance between camera and target object under real-world conditions, achieved by fusing multiple image modalities. As part of the system for automatic people detection and face recognition using deep learning, in this paper we present the fusion of depth camera measurements and monocular camera-to-body distance estimation for robust tracking and following. Deep learning based filtering of depth camera data and estimation of camera-to-body distance from a monocular camera are achieved with YOLO-pose, enabling real-time fusion of depth information using the Extended Kalman Filter (EKF) algorithm. The proposed subsystem, designed for use in drones, estimates and measures the distance between the depth camera and the human body keypoints, to maintain the safe distance between the drone and the human target. Our system provides an accurate estimated distance, which has been validated against motion capture ground truth data. The system has been tested in real time indoors, where it reduces the average errors, RMSE and standard deviations of distance estimation up to 15,3% in three tested scenarios. Based on the test results, the EKF fusion-based approach increases the depth detection range by reducing the errors outside the optimal depth camera working range. It also shows improved robustness and precision in challenging conditions, such as reflections and poor visibility, making it suitable for SAR.

URL PDF HTML ☆

赞 0 踩 0

2606.10639 2026-06-11 cs.RO 版本更新

Planar-Sector LOS Guidance for Interception of Agile Targets with Lifting-Wing Quadcopters

面向敏捷目标拦截的升力翼四旋翼平面扇形视线制导

Linkai Liu, Kun Yang, Han Zou, Chen Min, Shuli Lv, Shuai Wang, Quan Quan

发表机构 * School of Automation Science and Electrical Engineering, Beihang University（北京航空航天大学自动化科学与电气工程学院）； Research and Development Department, China Academy of Launch Vehicle Technology（中国运载火箭技术研究院研发部）

AI总结提出平面扇形视线（PS-LOS）制导框架，通过非对称约束释放机动性，使升力翼四旋翼在仅用单目相机的情况下实现远程自主拦截敏捷目标，实验验证了高达138米距离的成功拦截。

详情

Comments: Accepted to the IEEE International Conference on Robotics and Automation (ICRA 2026). Recipient of the ICRA 2026 Best Paper Award in Field and Service Robotics

AI中文摘要

由于目标运动不可预测、感知受限以及目标可见性与拦截器机动性之间的强耦合，对敏捷空中目标的自主视觉拦截具有挑战性。大多数现有的捷联相机拦截方法使用锥形视线（LOS）约束来保持目标靠近图像中心，从而保证可见性。虽然安全，但这种对称约束不必要地限制了机动性，并可能显著减少可用于追击的推力。受激进FPV飞行员不在所有图像方向上保持相等可见性裕度的观察启发，本文提出了一种平面扇形视线（PS-LOS）制导框架，用于仅配备捷联单目相机的升力翼四旋翼的自主拦截。PS-LOS严格约束横向图像误差，同时放松纵向图像误差在安全的视场裕度内，在保持可见性的同时释放机动性以进行加速密集型追击。在升力翼四旋翼模型下，PS-LOS在LOS方向附近提供的可用推力比传统锥形LOS约束多近50%。为了实现无需直接深度测量的仅视线拦截，为升力翼四旋翼开发了延迟补偿状态估计框架和非线性制导与控制架构。广泛的外场飞行实验证明了在真实风扰动下，对具有大幅、高频和不可预测运动的敏捷目标的自主拦截。所提出的系统在高达138米的距离上实现了成功拦截，并在整个交战过程中保持连续视觉跟踪。结果验证了PS-LOS作为一种保持可见性、感知机动性的制导框架，用于远程视觉拦截敏捷空中目标。

英文摘要

Autonomous visual interception of agile aerial targets is challenging due to unpredictable target motion, limited sensing, and the strong coupling between target visibility and interceptor maneuverability. Most existing strapdown-camera interception methods preserve visibility using conic line-of-sight (LOS) constraints that keep the target near the image center. While safe, such symmetric constraints unnecessarily restrict maneuverability and can significantly reduce the usable thrust for pursuit. Motivated by the observation that aggressive FPV pilots do not maintain equal visibility margins in all image directions, this paper proposes a Planar-Sector Line-of-Sight (PS-LOS) guidance framework for autonomous interception using a lifting-wing quadcopter equipped with only a strapdown monocular camera. PS-LOS tightly constrains lateral image error while relaxing longitudinal image error within a safe field-of-view margin, preserving visibility while releasing maneuverability for acceleration-intensive pursuit. Under the lifting-wing quadcopter model, PS-LOS provides nearly 50% more available thrust near the LOS direction than conventional conic LOS constraints. To realize LOS-only interception without direct depth measurements, a delay-compensated state-estimation framework and a nonlinear guidance-and-control architecture are developed for lifting-wing quadcopters. Extensive outdoor flight experiments demonstrate autonomous interception of agile targets exhibiting large-amplitude, high-frequency, and unpredictable motion under real wind disturbances. The proposed system achieves successful interceptions at ranges up to 138 m while maintaining continuous visual tracking throughout the engagement. The results validate PS-LOS as a visibility-preserving, maneuverability-aware guidance framework for long-range visual interception of agile aerial targets.

URL PDF HTML ☆

赞 0 踩 0

2606.11278 2026-06-11 cs.RO 新提交

Model-based Optimization of Anguilliform Swimming Gaits for Soft Robotic Applications

基于模型的鳗鲡式游泳步态优化及其在软体机器人中的应用

Brian Van Stratum, James Gallentine, Caleb Rucker, Eric Barth, Jonathan E. Clark, Kourosh Shoele

发表机构 * FAMU/FSU College of Engineering（佛罗里达农工大学/佛罗里达州立大学工程学院）； Vanderbilt University（范德堡大学）； The University of Tennessee, Knoxville（田纳西大学诺克斯维尔分校）

AI总结本文提出软体七鳃鳗启发双环境机器人(SLIDER)，通过结合Lighthill理论、非线性结构模型和遗传算法，优化游泳控制模式与尾鳍设计，实现系留游泳速度21.7±0.4 cm/s，并探索多模态机器人优化。

详情

AI中文摘要

在本文中，我们介绍了软体七鳃鳗启发双环境机器人(SLIDER)以及用于设计该机器人的适当建模和优化流程。我们使用Lighthill的大振幅细长体理论来表示主要的流体环境作用——惯性效应、涡流力和粘性耗散。对于结构设计参数，如内部压力、尾部尺寸和身体刚度，我们开发并验证了一个快速的几何和材料非线性模型。流固耦合方程采用高效的二阶box方法隐式求解。采用气动歧管机器人系统在静水槽环境中驱动SLIDER，实现计算与实验结果的交叉比较。我们发现低频游泳主要受阻力环境影响，而高频游泳主要受惯性流体力的影响。利用我们的高效模型和遗传算法，我们共同优化了游泳控制模式和尾鳍设计（受限于SLIDER的攀爬形态），实现了21.7±0.4 cm/s（0.59 Bl/s）的系留游泳速度。此外，我们研究了执行游泳和攀爬任务的多模态机器人的优化程序。

英文摘要

In this paper, we introduce the Soft Lamprey-Inspired Dual Environment Robot (SLIDER) and a proper modeling and optimization procedure employed to design the robot. We represent the primary fluid environment actions - inertial effects, vortex forces, and viscous dissipation - using Lighthill's theory for large-amplitude elongated bodies. For structural design parameters such as internal pressure, tail size, and body stiffness, a fast, geometrically and materially nonlinear model is developed and validated. The fluid-structure interaction equations are solved implicitly with an efficient second-order box method. A pneumatic manifold robotic system is employed to actuate SLIDER in a quiescent water tank environment, allowing cross-comparison of computational and experimental results. We find that low-frequency swimming is dominated by resistant environmental forces, whereas higher-frequency swimming is primarily affected by inertial fluid forces. Using our efficient model alongside a genetic algorithm, we co-optimize a swimming control pattern and caudal fin design (subject to SLIDER's climbing morphology) to achieve a tethered swimming speed of 21.7 +/- 0.4 cm/s (0.59 Bl/s). Furthermore, we investigate the optimization procedure for a multimodal robot performing both swimming and climbing tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.11704 2026-06-11 cs.RO 新提交

Improving Human Diving Endurance with a Field-Deployable, Untethered Exoskeleton

利用可现场部署的无系留外骨骼提高人类潜水耐力

Zhihao Zhou, Zhenmeng Ju, Rui Yang, Chenxi Zhang, Zhihao Zhou, Ming Xu, Enhao Zheng, Dongjie Jiang, Lecheng Ruan, Jingeng Mai, Qining Wang

发表机构 * Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）； Beijing Engineering Research Center of Intelligent Rehabilitation Engineering（北京市智能康复工程技术研究中心）； School of Advanced Manufacturing and Robotics, Peking University（北京大学先进制造与机器人学院）； State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所多模态人工智能系统国家重点实验室）； Department of Sports Medicine, Peking University Third Hospital（北京大学第三医院运动医学科）； School of Rehabilitation Sciences and Engineering, University of Health and Rehabilitation Sciences（康复大学康复科学与工程学院）

AI总结本文提出DiveMate外骨骼，通过自适应踢腿辅助在真实水下环境中将潜水距离提高42.9%，潜水时长延长54.9%，净耗气率降低47.0%，显著提升人类潜水耐力。

详情

AI中文摘要

人类在水下运动中的耐力从根本上受到高能量需求（克服阻力）和自持呼吸气体有限供应的限制。虽然外骨骼技术可以降低人类在陆地运动中的代谢成本，但其在增强水下潜水耐力方面的潜力尚未被探索。本文介绍了DiveMate，一种可现场部署的无系留外骨骼，旨在通过自适应踢腿辅助在真实水下环境中提高人类潜水耐力。在自然潜水过程中，DiveMate通过降低耗气率，使给定能量（呼吸气体）下的行进距离增加42.9%，潜水时长延长54.9%。肌肉激活的显著减少表明生理消耗降低，净耗气率降低47.0%。运动学特征和规律性的改善进一步支撑了高效的能量经济性。这些结果表明，应用外骨骼辅助有利于提高人类潜水耐力，增强其探索水下世界的能力。本研究拓展了外骨骼的应用前沿，并为未来水下辅助设备的设计和评估提供了潜在参考。

英文摘要

Human endurance in underwater locomotion is fundamentally restricted by high energetic demands to overcome drag and the finite supply of self-contained breathing gas. While exoskeleton technology can reduce the metabolic cost of humans in terrestrial locomotion, its potential to enhance human endurance during underwater diving remains entirely unexplored. Here, we present DiveMate, a field-deployable, untethered exoskeleton designed to improve human diving endurance via adaptive kick assistance in real-world underwater environments. During naturalistic diving, DiveMate increases the travel distance using a given energy (breathing gas) by 42.9% and extends dive duration by 54.9% through reducing gas consumption rate. Marked reductions in muscle activation indicate a decrease in physiological exertion, with the net gas consumption rate decreasing by 47.0%. Kinematic characteristics and regularity improvements further underpin efficient energy economy. These results suggest that applying exoskeleton assistance is beneficial for improving human diving endurance and augmenting their ability to explore the aquatic world. This study extends the application frontier of exoskeletons and provides a potential reference for the design and assessment of future underwater assistive devices.

URL PDF HTML ☆

赞 0 踩 0

2606.11952 2026-06-11 cs.RO 新提交

Deformable In-Hand Slip-Aware Tactile Sensor with Integrated Velocity, Force/Torque, and Pressure Map Sensing

可变形手内滑移感知触觉传感器，集成速度、力/力矩和压力图传感

Gabriel Arslan Waltersson, Yiannis Karayiannidis

发表机构 * Chalmers University of Technology（查尔姆斯理工大学）； Lund University（隆德大学）

AI总结提出一种新型触觉传感器，通过可变形接触垫集成速度、力/力矩和压力图传感，实现手内操作的滑移感知控制，并支持快速低成本制造。

2606.12112 2026-06-11 cs.RO 新提交

PEBRE: An Open-Hardware Compute and Perception Add-On for the Pepper Robot

PEBRE：Pepper 机器人的开源硬件计算与感知扩展模块

Malte Kuhlmann, Ignacio Bugueno-Cordova, Emil Alms, Javier Ruiz-del-Solar, Nicolás Navarro-Guerrero

发表机构 * Leibniz Universität Hannover（莱布尼茨汉诺威大学）； University of Chile（智利大学）

AI总结本文提出 PEBRE，一种为 Pepper 机器人设计的开源硬件扩展模块，通过集成 Jetson Orin Nano 等组件显著提升其计算与感知能力，并延长平台使用寿命。

2511.08299 2026-06-11 cs.RO 版本更新

自然环境中机器人感知的跨模态基准测试

David Hall, Joshua Knights, Mark Cox, Peyman Moghadam

AI总结针对自然环境中机器人感知的挑战，提出WildCross跨模态基准，用于大规模自然场景下的地点识别和度量深度估计，并扩展了度量深度估计实验。

详情

Comments: Accepted to the IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026

AI中文摘要

自然环境对机器人感知系统提出了复杂挑战。当前模型，特别是视觉基础模型，主要在有结构的城市环境中训练，导致其在野外机器人任务的感知中存在弱点。我们利用最近发布的WildCross基准展示了当前模型的局限性，这是一个用于大规模自然环境中地点识别和度量深度估计的新型跨模态基准。WildCross包含超过476K个顺序RGB帧，带有半稠密深度和表面法线标注，每个帧都与准确的6DoF姿态和同步的稠密激光雷达子地图对齐。在这项工作中，我们提供了对最近WildCross基准结果的扩展分析，特别强调扩展的度量深度估计实验。本工作的代码仓库和数据集可在https://csiro-robotics.github.io/WildCross获取。

英文摘要

Natural environments present a complex challenge to robotics perception systems. Current models, particularly vision foundation models, are largely trained on structured, urban environments leading to weaknesses in their perception for field robotics tasks. We showcase the limitations of current models using our recently released WildCross benchmark, a new cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF pose and synchronized dense lidar submaps. In this work, we provide an expanded analysis of the benchmark results from the recent WildCross benchmark, with particular emphasis on expanded metric depth estimation experiments. Access to the code repository and dataset for this work can be found at https://csiro-robotics.github.io/WildCross.

URL PDF HTML ☆

赞 0 踩 0

2509.19463 2026-06-11 cs.RO 版本更新

CU-Multi: A Dataset for Multi-Robot Collaborative Perception

CU-Multi：多机器人协同感知数据集

Doncey Albin, Daniel McGann, Miles Mena, Annika Thomas, Harel Biggie, Xuefei Sun, Steve McGuire, Jonathan P. How, Christoffer Heckman

AI总结针对多机器人协同感知基准测试缺乏专用数据集的问题，提出CU-Multi数据集，包含多天采集的同步多机器人轨迹、RGB-D、RTK GPS、语义LiDAR及精确里程计，支持可重复评估。

详情

Comments: 8 pages, 11 figures. arXiv admin note: text overlap with arXiv:2505.17576

AI中文摘要

多机器人系统的一个核心挑战是将独立收集的感知数据融合成统一表示。尽管协同SLAM（C-SLAM）取得了进展，但由于缺乏专用的多机器人数据集，基准测试仍然受到阻碍。许多评估转而分割单机器人轨迹，这种做法可能仅部分反映真实的多机器人操作，更关键的是缺乏标准化，导致结果难以解释或跨研究比较。虽然最近引入了几个多机器人数据集，但它们大多包含短轨迹，机器人间重叠有限且机器人内闭环稀疏。为克服这些限制，我们引入了CU-Multi，这是一个在科罗拉多大学博尔德分校两个大型户外场地多天收集的数据集。CU-Multi包含四个同步运行，具有对齐的起始时间和受控的轨迹重叠，复现了机器人团队的不同视角。它包括RGB-D感知、RTK GPS、语义LiDAR和精化的地面真实里程计。通过将重叠变化与密集语义标注相结合，CU-Multi为多机器人协同感知任务中的可重复评估提供了坚实基础。

英文摘要

A central challenge for multi-robot systems is fusing independently gathered perception data into a unified representation. Despite progress in Collaborative SLAM (C-SLAM), benchmarking remains hindered by the scarcity of dedicated multi-robot datasets. Many evaluations instead partition single-robot trajectories, a practice that may only partially reflect true multi-robot operations and, more critically, lacks standardization, leading to results that are difficult to interpret or compare across studies. While several multi-robot datasets have recently been introduced, they mostly contain short trajectories with limited inter-robot overlap and sparse intra-robot loop closures. To overcome these limitations, we introduce CU-Multi, a dataset collected over multiple days at two large outdoor sites on the University of Colorado Boulder campus. CU-Multi comprises four synchronized runs with aligned start times and controlled trajectory overlap, replicating the distinct perspectives of a robot team. It includes RGB-D sensing, RTK GPS, semantic LiDAR, and refined ground-truth odometry. By combining overlap variation with dense semantic annotations, CU-Multi provides a strong foundation for reproducible evaluation in multi-robot collaborative perception tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.11535 2026-06-11 cs.RO 新提交

Adversarial Attacks on Learned Policies for Surgical Robotic Tasks

针对手术机器人任务学习策略的对抗攻击

Shutong Jin, Ziyang Chen, Preethi Satish, Paavan Gupta, Florian T. Pokorny, Ken Goldberg

发表机构 * University of California, Berkeley（加州大学伯克利分校）； KTH Royal Institute of Technology（瑞典皇家理工学院）

AI总结研究学习型策略在机器人辅助手术中易受对抗攻击的脆弱性，提出破坏性和引导性攻击方法，实验表明攻击可使手术子任务成功率平均降低61%。

详情

AI中文摘要

基于学习的策略正被考虑用于增强机器人辅助手术中人类外科医生的灵巧性。从视觉观察到机器人动作的端到端映射是否容易受到对抗性攻击，从而可能导致患者受伤？在本文中，我们首次研究了手术机器人中学习型策略面临的对抗性威胁。我们研究了两种威胁模式：(a) 破坏性攻击，其中难以察觉的视觉扰动中断策略执行，以及 (b) 引导性攻击，其中此类扰动将策略动作引导至攻击者指定的方向。我们提出了三种对抗性攻击方法，每种方法对策略信息的访问权限逐渐增加，并评估了它们对两个手术子任务（清创和缝合）的影响。我们的评估涵盖了三种端到端策略架构：ACT、扩散策略和Pi0。此外，我们引入了一类新的光度对抗攻击，它模仿自然视觉变化（如光照变化）来生成有效且视觉上合理的扰动。使用清创和缝合模型进行的560次物理实验结果表明，最先进的策略可能受到显著干扰，导致手术子任务成功率平均降低61%。项目页面：此 https URL

英文摘要

Learning-based policies are being considered to augment the dexterity of human surgeons in robot-assisted surgery. Can the end-to-end mapping from visual observations to robot actions be vulnerable to adversarial attacks, potentially leading to patient injury? In this paper, we present the first study of adversarial threats to learning-based policies in surgical robotics. We investigate two threat modes: (a) disruptive attacks, where imperceptible visual perturbations interrupt policy execution, and (b) steering attacks, where such perturbations steer policy actions toward attacker-specified directions. We formulate three adversarial attack methods, each with increasing access to policy information, and evaluate their impact on two surgical subtasks: debridement and suturing. Our evaluation covers three end-to-end policy architectures: ACT, Diffusion Policy, and Pi0. In addition, we introduce a new class of photometric adversarial attacks that mimic natural visual changes, such as lighting variations, to generate effective yet visually plausible perturbations. Results from 560 physical experiments using phantoms for debridement and suturing suggest that state-of-the-art policies can be significantly disrupted, resulting in an average 61% reduction in surgical subtask success rates. Project page: this https URL

URL PDF HTML ☆

赞 0 踩 0

2604.07833 2026-06-11 cs.RO 版本更新

Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution

利用具身体 agent：运行时治理以实现政策约束执行

Xue Qin, Simin Luan, John See, Zeyd Boukhers, Cong Yang, Zhijun Li

AI总结本文提出了一种政策约束执行框架，通过将 agent 认知与执行监督分离，增强了具身体 agent 的运行时治理能力，通过 1000 次随机模拟验证，显著提高了对未经授权动作的拦截率和系统恢复成功率。

详情

Comments: 36 pages, 3 figures, 10 tables

AI中文摘要

具身体 agent 正从被动推理系统发展为能够与工具、机器人和物理环境交互的主动执行者。一旦获得执行权限，核心挑战是如何在运行时保持行动的可控性。现有方法将安全性和恢复逻辑嵌入 agent 循环中，使执行控制难以标准化、审计和适应。本文认为，具身体智能不仅需要更强的 agent，还需要更强的运行时治理。我们提出了一种政策约束执行框架，将 agent 认知与执行监督分离。治理被外部化为一个专用的运行时层，负责政策检查、能力准入、执行监控、回滚处理和人工覆盖。我们正式界定了具身体 agent、具身体能力模块（ECMs）和运行时治理层之间的控制边界，并通过 1000 次随机模拟试验在三个治理维度上进行了验证。结果表明，96.2% 的未经授权动作被拦截，运行时漂移下不安全延续率从 100% 降至 22.2%，且在完全政策合规的情况下，91.4% 的恢复成功，显著优于所有基线（p<0.001）。通过将运行时治理重新定义为一个系统问题，本文将政策约束执行定位为具身体 agent 系统的关键设计原则。

英文摘要

Embodied Agents are evolving from passive reasoning systems into active executors that interact with tools, robots, and physical environments. Once an agent gains execution authority, the central challenge shifts from how to make it act to how to keep its actions governable at runtime. Existing approaches embed safety, recovery, and decision constraints inside the agent loop, making execution control difficult to standardize, audit, and adapt across environments. We propose a runtime governance framework for policy-constrained execution that separates agent cognition from execution oversight. Governance is externalized into a dedicated runtime layer performing policy checking, capability admission, execution monitoring, rollback, and human override. We formalize the control boundary among a persistent Embodied Agent, modular Capability Packages, and the governance layer, and define a policy-constrained execution pipeline evaluated under controlled simulation. Over 1000 randomized trials, the framework achieves 96.2%+/-2.7% interception of unauthorized actions, reduces unsafe continuation from 100% to 22.2%+/-3.1% under runtime drift, and attains 90.7%+/-3.0% recovery success with full policy compliance. Comparison with five baselines, including AutoRT-style constitution filtering and RoboGuard-style two-stage guardrails, shows that pre-execution filtering is equally effective across governance-aware methods, while only the proposed framework provides continuous runtime detection (RVDR = 61.3% vs. 0%) and structured recovery (all p<0.001). A sensitivity sweep across the full detection range confirms a genuine detection-continuation trade-off. This work argues future embodied systems should be designed for governable execution.

URL PDF HTML ☆

赞 0 踩 0

2605.12386 2026-06-11 cs.RO 版本更新

SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation

SafeManip: 一种基于属性的基准，用于机器人操作中的时间安全评估

Chengyue Huang, Khang Vo Huynh, Sebastian Elbaum, Zsolt Kira, Lu Feng

AI总结 SafeManip通过定义可重用的安全模板，评估机器人操作中的时间安全属性，涵盖碰撞安全、抓取稳定性等八类安全类别，验证了现有方法在安全评估上的不足。

详情

AI中文摘要

机器人操作通常通过任务成功率评估，但成功完成并不保证安全执行。许多安全故障是时间相关的：机器人可能在污染后接触清洁表面或在物体完全进入封闭空间前释放物体。我们介绍了SafeManip，一种基于属性的基准，用于显式评估机器人操作中的时间安全属性，超越了以往主要关注任务完成或每个状态约束违规的评估。SafeManip使用有限迹线上的线性时间逻辑（LTLf）定义可重用的安全模板。它将观察到的运行结果映射到符号谓词轨迹，并使用基于LTLf的监控器进行评估。其属性集涵盖八类操作安全类别：碰撞和接触安全、抓取稳定性、释放稳定性、交叉污染、动作开始、机制恢复、物体容纳和封闭空间访问。模板可以使用任务特定的对象、固定装置、区域或技能进行实例化，允许相同的安全规范在不同任务和环境中泛化。我们在六个视觉-语言-动作策略上评估SafeManip，包括π0、π0.5、GR00T及其训练变体，覆盖50个RoboCasa365家庭任务。结果表明，即使强大的模型也常常行为不安全。任务成功率的提升并不总是转化为更安全的执行：许多成功的运行仍然不安全，而更长的horizon或更复杂的任务暴露了更多的违规行为。SafeManip提供了一个可重用的评估层，用于诊断时间安全故障并测量安全成功，而不仅仅是任务完成。

英文摘要

Robotic manipulation is typically evaluated by task success, but successful completion does not guarantee safe execution. Many safety failures are temporal: a robot may touch a clean surface after contamination or release an object before it is fully inside an enclosure. We introduce SafeManip, a property-driven benchmark to explicitly evaluate temporal safety properties in robotic manipulation, moving beyond prior evaluations that largely focus on task completion or per-state constraint violations. SafeManip defines reusable safety templates over finite executions using Linear Temporal Logic over finite traces (LTLf). It maps observed rollouts to symbolic predicate traces and evaluates them with LTLf-based monitors. Its property suite covers eight manipulation safety categories: collision and contact safety, grasp stability, release stability, cross-contamination, action onset, mechanism recovery, object containment, and enclosure access. Templates can be instantiated with task-specific objects, fixtures, regions, or skills, allowing the same safety specifications to generalize across tasks and environments. We evaluate SafeManip on six vision-language-action policies, including $\pi_0$, $\pi_{0.5}$, GR00T, and their training variants, across 50 RoboCasa365 household tasks. Results show that even strong models often behave unsafely. Task-success gains do not reliably translate into safer execution: many successful rollouts remain unsafe, while longer-horizon or more complex tasks expose more violations. SafeManip provides a reusable evaluation layer for diagnosing temporal safety failures and measuring safe success beyond task completion.

URL PDF HTML ☆

赞 0 踩 0

2606.11341 2026-06-11 cs.LG cs.RO 交叉投稿

Energy-Conserved Neural Pipelines: Attenuating Error Propagation in Modular Neural Networks via Physical Conservation Constraints

能量守恒神经管道：通过物理守恒约束减弱模块化神经网络中的误差传播

David Young, Swan Yi Htet

发表机构 * ORION Robotics

AI总结提出在模块间强制能量守恒（特征向量L2范数不变）作为硬约束，实验证明该方法在多种噪声下显著优于基线，并具有深度不变性和理论保证。

详情

Comments: 22 pages, 2 figures, 7 tables, 25 references

AI中文摘要

模块化神经网络管道存在误差累积问题：任何模块边界的噪声都会传播并可能在后续模块中放大。我们引入能量守恒作为模块间信息流的硬物理约束。激活能量（特征向量的平方L2范数）被强制在每个模块边界精确保持不变。与软能量惩罚不同，守恒是不可违反的定律：网络可以在神经元之间重新分配能量，但不能创造或毁灭能量。在CIFAR-10上的四个实验表明：（1）在噪声sigma=0.2时，守恒方法保留了77.4%的干净准确率，而基线为35.1%，能量惩罚模型为30.9%（p<0.001，5个种子）；（2）管道变得深度不变，在深度2至5且每个边界都有噪声时保留了93.3%的准确率；（3）该优势泛化到系统性偏差（+45.1%）、高斯噪声（+40.4%）和对抗噪声（+4.8%），而对dropout有原则性的无影响（-0.3%）；（4）在ResNet-18上，守恒优势与内在归一化呈反比：在sigma=0.2时，有BatchNorm时+0.3个百分点，无BatchNorm时+26.2个百分点，在sigma=0.5时达到+58.0个百分点。实验5在真实模块化机器人管道（MuJoCo物理，Franka Panda）上验证了该算子。在独立机器上的三次独立运行（每个单元90次试验）中，守恒在单目深度类噪声上提供了平均+18.9个百分点的优势。一个形式化界限证明了守恒噪声能量严格小于输入噪声能量。

英文摘要

Modular neural network pipelines suffer from error compounding: noise at any module boundary propagates and potentially amplifies through subsequent modules. We introduce energy conservation as a hard physical constraint on inter-module information flow. Activation energy (the squared L2 norm of feature vectors) is enforced to be exactly preserved at every module boundary. Unlike soft energy penalties, conservation is an inviolable law: the network may redistribute energy across neurons but cannot create or destroy it. Four experiments on CIFAR-10 demonstrate: (1) conservation retains 77.4% of clean accuracy at noise sigma=0.2, versus 35.1% for baselines and 30.9% for energy-penalized models (p<0.001, 5 seeds); (2) pipelines become depth-invariant, retaining 93.3% at depths 2 through 5 with noise at every boundary; (3) the advantage generalizes to systematic bias (+45.1%), Gaussian (+40.4%), and adversarial noise (+4.8%), with a principled non-effect on dropout (-0.3%); (4) on ResNet-18, the conservation advantage scales inversely with intrinsic normalization: +0.3 pp with BatchNorm, +26.2 pp without at sigma=0.2, reaching +58.0 pp at sigma=0.5. Experiment 5 validates the operator on a real modular robotic pipeline (MuJoCo physics, Franka Panda). Across three independent runs on separate machines (90 trials per cell), conservation provides +18.9 pp average advantage on monocular-depth-style noise. A formal bound proves conserved noise energy is strictly less than input noise energy.

URL PDF HTML ☆

赞 0 踩 0

2510.24515 2026-06-11 cs.RO 版本更新

Learning Ordinal Response Policies in Rank-Based Stochastic Prize-Collecting Games

基于排序的随机奖品收集博弈中的序数响应策略学习

Malintha Fernando, Petter Ögren, Silun Zhang

AI总结提出随机奖品收集定向越野博弈（SPCOG），扩展团队定向越野问题至自利代理场景，利用序数排名（OR）作为强归纳偏置，并设计虚拟序数响应学习（FORL）算法实现收敛策略。

详情

AI中文摘要

团队定向越野问题（TOP）概括了自主移动、空中物流和监视应用中出现的许多现实世界多智能体调度和路由任务。虽然多智能体系统规划中存在多种TOP变体，但它们假设所有智能体都朝着单一目标合作；因此，当它们在奖励稀缺环境中竞争时，这些变体并不适用。我们提出随机奖品收集定向越野博弈（SPCOG）作为TOP的扩展，以在存在自利智能体、能量约束和随机转移的情况下在图上进行规划。关于完全图和星图的理论讨论表明，在SPCOG中存在唯一的纯纳什均衡，该均衡与基于排序的冲突解决下等效TOP的最优路由解一致。我们提出序数排名（OR）的概念，作为智能体全局排名及其在拓扑定义良好的邻域内位置的简洁表示。在动态和静态奖品分布下，对真实世界道路网络图进行的实证评估表明，在参数共享设置中，利用局部信息的策略可以优于利用全局信息的策略，前提是前者以OR而非全局排名为条件，这表明OR在图上的多智能体博弈中充当了强归纳偏置。与全局排名条件策略相比，OR条件策略还能更好地泛化到具有大量智能体的博弈中。最后，我们还提出虚拟序数响应学习（FORL）作为一种熵调节算法，以在图上奖品收集博弈的独立学习设置中获得收敛策略。

英文摘要

The Team Orienteering Problem (TOP) generalizes many real-world multi-agent scheduling and routing tasks that occur in autonomous mobility, aerial logistics, and surveillance applications. While many flavors of the TOP exist for planning in multi-agent systems, they assume that all the agents cooperate toward a single objective; therefore, they do not extend to settings when they compete in reward-scarce environments. We propose Stochastic Prize-Collecting Orienteering Games (SPCOG) as an extension of the TOP to plan in the presence of self-interested agents operating on a graph, under energy constraints and stochastic transitions. A theoretical discussion on complete and star graphs establishes that there is a unique pure Nash equilibrium in SPCOGs that coincides with the optimal routing solution of an equivalent TOP under rank-based conflict resolution. We propose the concept of Ordinal Rank (OR) as a concise representation of an agents' global rank and its location within a topological, well-defined neighborhood. Empirical evaluations conducted on real-world, road-network graphs under both dynamic and stationary prize distributions show that in parameter-sharing settings, the policies that leverage local information can outperform those policies leverage global information when the former is conditioned on the OR rather than the global rank, indicating that the OR acts as a strong inductive bias in multi-agent games on graphs. The OR-conditioned policies also generalize much better to games with large number of agents compared to global-rank conditioned policies. Finally, we also propose we propose Fictitious Ordinal Response Learning (FORL) as an entropy-regulated algorithm to obtain convergent policies in independent-learning settings in prize-collecting games on graphs.

URL PDF HTML ☆

赞 0 踩 0

2511.20216 2026-06-11 cs.AI cs.CE cs.CV cs.LG cs.RO

CostNav: A Navigation Benchmark for Real-World Economic-Cost Evaluation of Physical AI Agents

Haebin Seong, Sungmin Kim, Yongjun Cho, Myunchul Joe, Geunwoo Kim, Yubeen Park, Sunhoo Kim, Samwoo Seong, Yoonshik Kim, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Jinmyung Kwak, Sunghee Ahn, Jaemin Lee, Younggil Do, Seungyeop Yi, Woojin Cheong, Minhyeok Oh, Minchan Kim, Seongjae Kang, Youngjae Yu, Yunsung Lee

2602.03147 2026-06-11 cs.RO

Multi-function Robotized Surgical Dissector for Endoscopic Pulmonary Thromboendarterectomy: Preclinical Study and Evaluation

Runfeng Zhu, Xin Zhong, Qingxiang Zhao, Jing Lin, Zhong Wu, Kang Li

2412.12231 2026-06-11 cs.RO cs.LG

Demonstrating Data-to-Knowledge Pipelines for Connecting Production Sites in the World Wide Lab

Leon Gorißen, Jan-Niklas Schneider, Mohamed Behery, Philipp Brauner, Moritz Lennartz, David Kötter, Thomas Kaster, Oliver Petrovic, Christian Hinke, Thomas Gries, Gerhard Lakemeyer, Martina Ziefle, Christian Brecher, Constantin Häfner

1. 机器人学习与模仿强化学习 13 篇

Dynamic Execution Horizon Prediction for Chunk-based Robot Policies

Learning Object Manipulation from Scratch via Contrastive Interaction

Distortion-Resilient Robotic Imitation Learning for Autonomous Cable Routing

Critic Architecture Matters: Dual vs. Unified Critics for Humanoid Loco-Manipulation

Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics

UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning

Fourier Features Let Agents Learn High Precision Policies with Imitation Learning

The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

2. 运动规划、控制与动力学 5 篇

ConsistencyPlanner: Real-time Planning with Fast-Sampling Consistency Models

Learning Unions of Convex Sets via Invertible Latent Decomposition for Path Planning

DynaRetarget: Dynamically-Feasible Retargeting using Sampling-Based Trajectory Optimization

Consensus-based optimization (CBO): Towards Global Optimality in Robotics

Closing the Motion Execution Gap: From Semantic Motion Task Constraints to Kinematic Control

3. 操作、抓取与灵巧手 9 篇

PLUME: Probabilistic Latent Unified World Modeling and Parameter Estimation for Multi-Finger Manipulation

TacCoRL: Integrating Tactile Feedback into VLA via Simulation

Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning

Modular Anthropomorphic Hand Design via Multi-Parameter Finger Benchmarking and Selection

Point Cloud Segmentation for Autonomous Clip Positioning in Laparoscopic Cholecystectomy on a Phantom

FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning

iPack: Intuitive Bin Packing with Large Language Models

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

4. 导航、定位与SLAM 7 篇

SAFER-Nav: Enhancing Safety for Visual Robot Navigation via Segmentation-Aware Fine-Tuning

KinematicRL: A Sim-to-Real Reinforcement Learning Framework For Social Navigation With Kinodynamic Feasibility

SR-LIO++: LiDAR-Inertial Odometry and Quantized Mapping with Caching-Aware Sweep Reconstruction

LEMON-Mapping: Loop-Enhanced Large-Scale Multi-Session Point Cloud Merging and Optimization for Globally Consistent Mapping

PIGEON: VLM-Driven Object Navigation via Points of Interest Selection

Vision-Aided Relative State Estimation for Approach and Landing on a Moving Platform with Inertial Measurements

CredibleDFGO: Differentiable Factor Graph Optimization with Credibility Supervision

5. 人机交互与协作机器人 4 篇

Human-Guided Co-Manipulation of Carbon Fiber Plies

Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration

Fast-SDE: Efficient Single-Microphone Sound Source Distance Estimation in Reverberant Environments

SIL: Symbiotic Interactive Learning for Language-Conditioned Human-Agent Co-Adaptation

6. 具身智能与视觉语言动作模型 14 篇

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

VICX: Generalizable Robot Manipulation via Video Generation and In-Context Operator Network

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

Bridging the Morphology Gap: Adapting VLA Models to Dexterous Manipulation via Intent-Conditioned Fine-Tuning

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models

ActionMap: Robot Policy Learning via Voxel Action Heatmap

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

RoboGPT-R1: Enhancing Robot Task Planning with Reinforcement Learning

7. 多机器人与群体系统 7 篇

MASK: Multi-Agent Semantic K-Scheduling for Risk-Sensitive 6G Robotics

Steering Multirobot Behavior via Closed-Loop Affine Activation Editing

Fibration Trees: A Unified Approach to Multi-Robot Motion Planning

UGV-Conditioned Multi-UAV Informative Planning on a Shared Exposure Belief

CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

Adaptive Sliding Mode Control for Vehicle Platoons with State-Dependent Friction Uncertainty

Continual Quadruped Robots Coordination via Semantic Skill Discovery

8. 无人车、无人机与移动机器人 11 篇

A Modular Dual-Camera Pipeline for Micro-Inspection Using Aerial Robots

Explore From Sketch: Accelerating UAV Exploration in Large-scale Environments with Prior Maps

MPPI-based Informative Trajectory Planning for Search and Capture of Drifting Targets with ASVs

AerialClaw: An Open-Source Framework for LLM-Driven Autonomous Aerial Agents

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

DroneShield-AI: A Multi-Modal Sensor Fusion Framework for Real-Time Autonomous Drone Threat Detection, Behavioral Intent Classification, and Swarm Intelligence in Contested Airspace

Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection

VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

Non-Equilibrium MAV-Capture-MAV via Time-Optimal Planning and Reinforcement Learning

EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

Planar-Sector LOS Guidance for Interception of Agile Targets with Lifting-Wing Quadcopters

9. 软体机器人与硬件设计 5 篇

Model-based Optimization of Anguilliform Swimming Gaits for Soft Robotic Applications