arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 机器人学习与模仿强化学习 16 篇

2606.07974 2026-06-09 cs.RO cs.AI 新提交

PRISM: PRior-guided Imagination Sampling in world Models

PRISM:世界模型中基于先验引导的想象采样

Yuhai Wang, Jiawei Xia, Rongxuan Zhou, Xiao Hu, Yongliang Shi, Jing Du, Yang Ye

发表机构 * Northeastern University(东北大学) University of California, Berkeley(加州大学伯克利分校) Qiyuan Lab(启元实验室) University of Florida(佛罗里达大学)

AI总结 提出PRISM框架,通过从世界模型编码器提取状态条件高斯先验,并利用精度加权高斯乘积更新规划器的采样分布,在不增加架构复杂度的情况下显著提升基于模型的连续控制性能。

详情
AI中文摘要

学习到的世界模型为评估未来状态提供了强大的物理直觉。但其在连续控制中的有效性也关键取决于如何为基于模型的规划生成候选动作。我们不仅询问模型能多准确地模拟未来,还提出:哪些候选动作首先值得评估?现有规划器通常任意搜索或仅使用专家演示初始化采样均值,丢弃了专家的状态条件置信度。正确引导这一搜索需要鲁棒的动作先验,但当前方法常依赖独立的视觉编码器或大规模VLM来获取。我们认为这种架构膨胀是不必要的:完全相同的数据——以及世界模型本身学到的表示——内在地编码了智能体的动作直觉。我们提出PRISM,一个任务无关的框架,从单一数据集中提取两者,同时保持严格的架构简洁性。基于标准的JEPA风格潜在世界模型,PRISM直接在其冻结编码器上附加一个轻量级MLP,以预测状态条件高斯先验。在规划时,PRISM通过精度加权的高斯乘积更新将该先验融合到规划器的采样分布中。这种无参数、闭式整合引导采样过程,使先验在其自信处主导,在其不自信处放弃控制。PRISM在Cube上将基于世界模型的MPC成功率提升35个百分点,在PushT上提升32个百分点,且未引入显著推理开销。

英文摘要

A learned world model provides a powerful physical intuition for evaluating future states. But its effectiveness in continuous control also depends critically on how candidate actions are generated for model-based planning. Rather than solely asking how accurately a model can simulate the future, we ask: which candidate actions are worth evaluating in the first place? Existing planners typically search arbitrarily or use expert demonstrations only to initialize a sampling mean, discarding the expert's state-conditioned confidence. Properly guiding this search requires a robust action prior, yet current approaches often rely on independent visual encoders or large-scale VLMs to obtain one. We argue that this architectural bloat is unnecessary: the exact same data - and the learned representations of the world model itself - inherently encode the agent's action intuition. We introduce PRISM, a task-agnostic framework that extracts both from a single dataset while maintaining strict architectural simplicity. Building on a standard JEPA-style latent world model, PRISM attaches a lightweight MLP directly to its frozen encoder to predict a state-conditioned Gaussian prior. At plan time, PRISM fuses this prior into the planner's sampling distribution via a precision-weighted Product-of-Gaussians update. This parameter-free, closed-form integration steers the sampling process, making the prior confident where it is and ceding control where it is not. PRISM improves success rates by 35 percentage points over vanilla world-model-based MPC on Cube and 32 percentage points on PushT, without introducing significant inference overhead.

2606.08015 2026-06-09 cs.RO 新提交

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

Q-VGM: 基于Q引导的值梯度匹配的流匹配VLA策略

Ziqian Wang, Jiayu Sun, Xingjian Mao, Minqian Wang, Yao Mu

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Michigan, Ann Arbor(密歇根大学安娜堡分校) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出Q-VGM离线强化学习方法,通过将值梯度转化为去噪时间上的值梯度场,避免反向传播去噪链,高效微调流匹配VLA策略,在LIBERO等任务上显著提升成功率。

Comments 13 pages, 3 figures, 4 tables

详情
AI中文摘要

我们提出Q引导的值梯度匹配(Q-VGM),一种离线强化学习方法,解决了微调流匹配视觉-语言-动作(VLA)策略中长期存在的挑战:如何高效地根据学习到的Q函数改进一个表达力强的流匹配动作专家。有效的改进必须利用评论家的一阶(梯度)信息,但这对于流策略很困难,因为直接通过其多步去噪过程反向传播值函数在VLA规模下数值不稳定,而策略梯度方法所需的可处理动作似然在迭代去噪下不可用。现有的基于值的方法要么通过整个去噪链反向传播,要么仅在测试时使用评论家而不更新策略,要么将评论家改进的动作作为终端标签蒸馏而不监督速度场。Q-VGM通过利用VGG-Flow(一种生成建模中流对齐的值梯度视角)绕过了这些问题,它将值梯度转化为去噪时间上的值梯度场,而不是不稳定的端到端目标。这不需要动作似然,也不需要反向传播去噪链,并且在一个固定的重放缓冲区上操作。评论家是一个动作敏感的Cal-QL集成,基于紧凑的RLT特征和每层动作注入。Q-VGM实现了一种实用的少样本初始化然后从经验中学习的范式:从少样本SFT pi0.5 VLA开始,该方法利用自生成的rollout数据显著提升任务性能,无需额外的专家监督。在LIBERO上,Q-VGM将平均成功率从75.0%提升到92.5%;在RoboTwin 2.0上,从76.4%提升到87.2%;在两个真实机器人桌面任务上,从40.0%提升到67.5%,在所有三种设置中均优于所有相同骨干、相同评论家的基线。

英文摘要

We propose Q-Guided Value-Gradient Matching (Q-VGM), an off-policy reinforcement learning (RL) method that tackles a long-standing challenge in fine-tuning flow-matching vision-language-action (VLA) policies: efficiently improving an expressive flow-matching action expert with respect to a learned Q-function. Effective improvement must exploit the first-order (gradient) information of the critic, but this is difficult for flow policies, because directly back-propagating the value through their multi-step denoising process is numerically unstable at VLA scale, while the tractable action likelihoods required by policy-gradient methods are unavailable under iterative denoising. Existing value-based methods either backpropagate through the full denoising chain, use the critic only at test time without updating the policy, or distill critic-improved actions as terminal labels without supervising the velocity field. Q-VGM sidesteps these issues by leveraging VGG-Flow, a value-gradient view of flow alignment in generative modeling that transforms value gradient into a denoising-time value-gradient field rather than an unstable end-to-end objective. This requires no action likelihoods and no backpropagation through the denoising chain, and operates on a fixed replay buffer. The critic is an action-sensitive Cal-QL ensemble over compact RLT features with per-layer action injection. Q-VGM enables a practical few-shot initialization then learn-from-experience paradigm: starting from a few-shot-SFT pi0.5 VLA, the method leverages self-generated rollout data to substantially improve task performance without additional expert supervision. On LIBERO, Q-VGM raises the average success rate from 75.0% to 92.5%; on RoboTwin 2.0, from 76.4% to 87.2%; and on two real-robot tabletop tasks, from 40.0% to 67.5%, outperforming all same-backbone, same-critic baselines across all three settings.

2606.08154 2026-06-09 cs.RO 新提交

SynthICL: Scalable In-context Imitation Learning with Synthetic Data

SynthICL: 基于合成数据的可扩展上下文模仿学习

Cheng Qian, Ruomeng Fan, Yifei Ren, Yilong Wang, Edward Johns

发表机构 * The Robot Learning Lab(机器人学习实验室) Imperial College London(伦敦帝国理工学院)

AI总结 提出SynthICL框架,利用纯RGB合成数据训练上下文模仿学习策略,避免深度传感和真实数据,通过子目标预测提升控制精度,在16个真实操作任务中平均成功率79%。

详情
AI中文摘要

上下文模仿学习(ICIL)使机器人能够通过将预训练策略以任务特定示例为条件,在测试时无需重新训练,从少量演示中学习新任务。尽管前景广阔,训练可泛化且可扩展的上下文模仿策略仍是一个开放挑战。我们提出SynthICL,一个完全基于RGB合成数据训练ICIL策略的可扩展框架。具体而言,我们构建了一个数据生成流水线以产生高保真ICIL数据,并在所得数据集上训练了一个流匹配变换器策略。SynthICL避免了先前方法中对深度传感、精确相机校准和真实世界训练数据的需求,提供了一种更简单且更可扩展的替代方案。我们进一步通过训练模型预测下一个子目标图像来融入子目标预测,从而实现更精确且视觉上可控的操作。在16个未见过的真实世界操作任务上评估,SynthICL在测试时仅提供一个演示的情况下实现了79%的平均成功率,并优于先前方法。项目页面:https://synth-icl.github.io

英文摘要

In-context imitation learning (ICIL) enables robots to learn new tasks from a small number of demonstrations by conditioning a pre-trained policy on task-specific examples, without retraining at test time. Despite this promise, training generalizable and scalable in-context imitation policies remains an open challenge. We present SynthICL, a scalable framework that trains ICIL policies entirely from RGB-only synthetic data. Specifically, we build a data generation pipeline to produce high-fidelity ICIL data and train a flow-matching transformer policy on the resulting dataset. SynthICL avoids the need for depth sensing, precise camera calibration, and real-world training data in prior approaches, offering a simpler and more scalable alternative. We further incorporate subgoal prediction by training the model to predict the next subgoal images, enabling more precise and visually grounded control. Evaluated on 16 unseen real-world manipulation tasks, SynthICL achieves an average success rate of 79% with only one demonstration provided at test time and outperforms prior methods. Project page: https://synth-icl.github.io

2606.08610 2026-06-09 cs.RO cs.AI 新提交

HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning

HARBOR:面向智能体机器人强化学习的框架

Zechu Li, Yufeng Jin, Xiaoyang Liu, Puze Liu, Vignesh Prasad, Carlo D'Eramo, Georgia Chalvatzaki

发表机构 * TU Darmstadt(达姆施塔特工业大学) Honda Research Institute Europe(本田欧洲研究所) Columbia University(哥伦比亚大学) Tongji University(同济大学) Shanghai Research Institute for Intelligent Autonomous Systems(上海智能自主系统研究院) University of Würzburg(维尔茨堡大学) Hessian.AI(黑森人工智能中心)

AI总结 提出HARBOR框架,通过将机器人强化学习自动化视为框架工程问题,利用专用智能体、标准化命令和可复用知识,在模拟中自动完成从环境搭建到策略训练的全流程,并在6个基准测试和16个任务中验证其有效性。

详情
AI中文摘要

强化学习已成为机器人学习的一种强大范式,特别是在模拟到现实的环境中,但其更广泛的采用仍受限于围绕算法的工程流程。构建任务、设计奖励和调整超参数需要大量专家努力,使得强化学习工作流程成本高昂且难以扩展。我们提出HARBOR,一个智能体框架,将机器人强化学习自动化视为一个框架工程问题:给定一个模拟器代码库和一个任务规范,它自动完成从环境设置到模拟中策略训练的工作流程。HARBOR将此类高级目标分解为有界阶段,由专用智能体通过标准化命令、持久化工件、可执行门和可复用知识执行,并通过去中心化并行试验和跨运行经验学习来扩展迭代。我们在6个基准测试和总共16个任务上评估HARBOR,涵盖操作、移动和双臂灵巧控制。我们证明HARBOR端到端地自动化了模拟强化学习工作流程,设计奖励,调整算法以匹配或改进默认配置,并以实用的令牌和挂钟成本减少了工程工作量;生成的策略也可以转移到真实机器人。

英文摘要

Reinforcement learning (RL) has become a powerful paradigm for robot learning, particularly in sim-to-real settings, but its broader adoption remains limited by the engineering pipeline surrounding the algorithms. Building tasks, shaping rewards, and tuning hyperparameters require substantial expert effort, making RL workflows costly and difficult to scale. We introduce HARBOR, an agentic framework that frames robot RL automation as a harness-engineering problem: given a simulator codebase and a task specification, it automates the workflow from environment setup to policy training in simulation. HARBOR decomposes such high-level objectives into bounded stages executed by specialized agents through standardized commands, persistent artifacts, executable gates, and reusable knowledge, and scales iteration via decentralized parallel trials and experience learning across runs. We evaluate HARBOR across 6 benchmarks and 16 tasks in total, spanning manipulation, locomotion, and bimanual dexterous control. We demonstrate that HARBOR automates the simulation RL workflow end-to-end, designs rewards, tunes algorithms to match or improve over default configurations, and reduces engineering effort at practical token and wall-clock cost; the resulting policies can also be transferred to real robots.

2606.08657 2026-06-09 cs.RO cs.AI 新提交

Latent Diffusion Policy: Shaping Latent Spaces for Diffusion-Based Robotic Manipulation

潜在扩散策略:为基于扩散的机器人操作塑造潜在空间

Zhexuan Zhou, Yichen Lai, Jinhao Zhang, Huizhe Li, Youmin Gong, Jie Mei

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出两阶段框架LDP,通过CVAE编码器吸收场景理解,在预浓缩的潜在空间中进行流匹配,简化学习并提升多臂协调任务性能。

详情
AI中文摘要

直接在原始动作空间中运行的基于扩散的视觉运动策略将场景理解与轨迹生成合并到单个去噪过程中。由此产生的速度场必须同时编码场景信息并生成精确轨迹,增加了学习复杂性,并在需要多臂精确时间协调的任务上限制了性能。为了简化这一联合学习问题,我们引入了潜在扩散策略(LDP),这是一个两阶段框架,在精心塑造的潜在空间中进行流匹配。通过将场景理解吸收到观察条件的CVAE编码器中,LDP集中了每个观察的条件分布。因此,流模型避免了隐式解析场景相关结构;相反,它在具有更平滑速度场的预浓缩分布内生成,从而简化了从有限演示中的学习。此外,为了捕捉潜在标记之间的时间依赖性,LDP采用每标记扩散强制训练,并使用阶梯推理采样来解决由此产生的分布不匹配。我们还提出了重建FID(rFID)作为轻量级代理,仅从潜在空间统计预测下游任务成功。在RoboTwin 2.0的协调密集型任务上,LDP以显著优势优于DP3,并有效迁移到真实世界的双臂部署。

英文摘要

Diffusion-based visuomotor policies operating directly in raw action spaces conflate scene comprehension with trajectory generation within a single denoising process. The resulting velocity field must simultaneously encode scene information and generate precise trajectories, increasing learning complexity and limiting performance on tasks demanding precise temporal coordination across multiple arms. To simplify this joint learning problem, we introduce Latent Diffusion Policy (LDP), a two-stage framework performing flow matching in a deliberately shaped latent space. By absorbing scene understanding into an observation-conditioned CVAE encoder, LDP concentrates the conditional distribution of each observation. Consequently, the flow model avoids implicitly resolving scene-dependent structures; instead, it generates within a pre-concentrated distribution featuring a smoother velocity field, simplifying learning from limited demonstrations. Furthermore, to capture temporal dependencies among latent tokens, LDP trains with per-token diffusion forcing and employs staircase inference sampling to resolve the resulting distributional mismatch. We also propose reconstruction FID (rFID) as a lightweight proxy predicting downstream task success solely from latent space statistics. On coordination-intensive tasks from RoboTwin 2.0, LDP outperforms DP3 by a substantial margin and transfers effectively to real-world bimanual deployments.

2606.08743 2026-06-09 cs.RO 新提交

Guided Discovery of New Behaviors using Diffusion Policies

使用扩散策略引导发现新行为

Dian Yu, Sebastian Sanokowski, Majid Khadiv

发表机构 * Munich Institute of Robotics and Machine Intelligence, Technical University of Munich(慕尼黑工业大学慕尼黑机器人与机器智能研究所)

AI总结 提出结合Feynman-Kac校正器与引导势能的框架,从扩散策略中挖掘并优化罕见但可行的轨迹,再训练策略以系统发现多样化可执行行为。

Comments Preprint. Supplementary video: https://youtu.be/T7MUvMA67VM

详情
AI中文摘要

扩散模型已成为机器人学中生成建模的强大工具,扩散策略在多模态动作-轨迹分布建模方面表现出色。然而,当演示数据有限时,标准采样通常再现主导行为,而忽略有效但罕见的模式,限制了新解决方案的发现。现有方法(如引导方法或将强化学习与扩散结合)要么将样本推入不可行区域,要么难以逃离局部最小值,无法系统地发现多样化行为。为解决这些挑战,我们提出一个框架,将Feynman-Kac校正器与一种新颖的引导势能相结合,系统地将扩散策略样本引导至有前景但代表性不足的样本。这些轨迹通过基于采样的轨迹优化进行精炼,并重新纳入训练集以重新训练扩散策略。我们的方法有效地挖掘和修复新轨迹,实现多样化且可执行行为的系统发现。我们在多种操作环境中展示了该框架的有效性,一致地发现了新行为。

英文摘要

Diffusion models have become a powerful tool for generative modeling in robotics, with diffusion policies excelling at modeling multimodal action-trajectory distributions. However, when demonstrations are limited, standard sampling often reproduces dominant behaviors while neglecting valid but rare modes, limiting the discovery of novel solutions. Existing approaches, such as guidance methods or combining reinforcement learning with diffusion, either push samples into infeasible regions or struggle to escape local minima, failing to systematically uncover diverse behaviors. To address these challenges, we propose a framework that combines Feynman-Kac correctors with a novel guiding potential that systematically guides diffusion policy samples towards promising yet underrepresented samples. These trajectories are refined using sampling-based trajectory optimization and reincorporated into the training set to retrain the diffusion policy. Our method effectively mines and repairs novel trajectories, enabling the systematic discovery of diverse and executable behaviors. We demonstrate the effectiveness of our framework across a range of manipulation environments, consistently discovering new behaviors.

2606.08775 2026-06-09 cs.RO cs.AI 新提交

Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

统一对象中心世界模型与扩散策略:多阶段机器人任务的分层框架

Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, Farshad Khorrami

发表机构 * Tandon School of Engineering, New York University(纽约大学坦登工程学院) Courant Institute of Mathematical Sciences, New York University(纽约大学库朗数学科学研究所) AMI Labs(AMI实验室)

AI总结 提出WorldDP分层框架,结合高层世界模型进行运行时子目标优化和低层扩散策略执行,利用对象中心表示解耦环境实体,实现多阶段机器人操作任务的有效规划与执行。

详情
AI中文摘要

视觉世界模型在学习复杂系统动力学方面显示出巨大潜力。最近的进展利用这些模型作为模型预测控制(MPC)框架中的转移函数来解决各种控制任务。然而,当应用于机器人时,它们仅限于单阶段任务(如抓取或到达),难以处理需要复杂序列规划的多阶段任务。在这项工作中,我们引入了WorldDP,一个专为多阶段机器人操作设计的世界模型框架。我们的分层方法利用高层世界模型作为转移函数,在运行时优化可行的子目标,随后由低层扩散策略实现这些子目标。为了进一步辅助学习动力学和规划,我们结合了对象中心表示,这些表示解耦了环境实体,并使我们能够针对每个实体进行顺序规划。在多个机器人基准测试中,WorldDP始终优于现有基线,验证了将世界模型的物理基础规划与扩散策略的高效执行相结合,能够产生更优的多阶段性能。

英文摘要

Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single-stage tasks such as reaching or grasping, and struggle with multi-stage ones that demand complex sequential planning. In this work, we introduce WorldDP, a world model framework designed for multi-stage robotic manipulation. Our hierarchical approach utilizes a high-level world model as a transition function to optimize for feasible subgoals during runtime, which are subsequently reached by a low-level Diffusion Policy. To further aid in learning dynamics and planning, we incorporate object-centric representations that decouple environmental entities and enable us to plan sequentially with respect to each. Evaluated across several robotics benchmarks, WorldDP consistently outperforms existing baselines, validating that coupling the world model's physically grounded planning with diffusion policy's efficient execution yields superior multi-stage performance.

2606.09236 2026-06-09 cs.RO cs.AI 新提交

Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation

用于模拟自主超级摩托车赛车的自定进度课程强化学习

Luca Ghisi, Jacopo Essenziale, Carlo D'Eramo, Matteo Luperto

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出自定进度课程深度强化学习框架,结合软演员-评论家算法,动态生成渐进任务,在物理精确模拟器中训练自主摩托车赛车,优于标准SAC。

Comments Presented at the "1st Workshop on Generalization in Autonomous Driving: Paradigms, Practice, and Public Road Demonstrations" at ICRA 2026, Vienna. Oral+poster presentation

详情
AI中文摘要

自主赛车通过深度强化学习取得了显著进展,主要针对四轮车辆。然而,摩托车由于需要管理平衡和倾斜角度,以及更灵敏的转向和油门控制,且重量更小,带来了更大的复杂性。在这项工作中,我们提出了一个框架,用于在VRider SBK(一个基于Unity的物理精确摩托车模拟器)中训练自主智能体进行超级摩托车赛车。我们的方法将软演员-评论家(SAC)与自定进度课程深度强化学习(SPDL)相结合,后者根据智能体的性能动态生成逐渐更具挑战性的任务,无需手动课程设计。智能体的状态空间包括扩展了倾斜角度历史的本体感受特征,以及通过赛道点的全局赛道特征。奖励信号被设计为鼓励沿赛道前进,同时惩罚针对两轮动力学的不稳定诱导行为。初步实验结果表明,SPDL在多个赛道和摩托车模型上的训练效率、圈速和驾驶稳定性方面优于单独的SAC,为基于强化学习的自主摩托车赛车建立了第一个基线。

英文摘要

Autonomous Racing has seen remarkable progress through deep Reinforcement Learning (RL), primarily for four-wheeled vehicles. However, motorbikes introduce substantially greater complexity due to the need to manage balance and lean angle, in addition to more reactive steering and throttle control, and a smaller weight. In this work, we present a framework for training an autonomous agent to race a superbike in VRider SBK, a physics-accurate Unity-based motorbike simulator. Our approach integrates Soft Actor-Critic (SAC) with Self-Paced curriculum Deep reinforcement Learning (SPDL), which dynamically generates progressively more challenging tasks based on the agent's performance, without requiring manual curriculum design. The agent's state space comprises proprioceptive features extended with lean-angle history, along with global track features via course points. The reward signal is shaped to encourage progress along the track while penalizing instability-inducing behaviors specific to two-wheeled dynamics. Preliminary experimental results demonstrate that SPDL outperforms SAC alone in training efficiency, lap time, and driving stability across multiple tracks and motorbike models, establishing a first baseline for RL-based autonomous motorbike racing.

2606.09381 2026-06-09 cs.RO 新提交

ReGIL: Retrieval-Guided Imitation Learning from a Single Demonstration

ReGIL: 基于检索引导的单一示范模仿学习

Yuying Zhang, Francesco Verdoja, Wenyan Yang, Ville Kyrki

发表机构 * Aalto University(阿尔托大学)

AI总结 提出ReGIL框架,将单一示范作为外部记忆,通过检索引导探索、生成正则化缓冲和构建奖励,在LIBERO和Meta-World基准及真实机器人任务中显著提升成功率和训练效率。

详情
AI中文摘要

使用深度神经网络从单一示范中学习机器人操作策略仍然极具挑战性,因为即使与示范轨迹有微小偏差也可能迅速累积导致失败,而收集大量在线交互数据成本高昂。我们提出ReGIL,一种检索引导的模仿学习框架,将单一示范视为外部记忆。ReGIL在整个训练过程中反复查询该静态记忆,以同时引导探索、生成正则化缓冲和构建奖励。具体而言,它通过当前轨迹与检索片段之间的局部时间对齐来计算奖励,为策略改进提供逐步且信息丰富的反馈。我们在LIBERO和Meta-World基准的机器人操作任务上,在单一示范设置下评估了ReGIL。ReGIL在成功率和训练效率上均优于先前基线。在真实机器人实验中,仅使用一个示范和不到一小时的在线训练,ReGIL在三个操作任务上(初始机器人姿态和目标位置均随机)实现了超过75%的成功率。这些结果表明,将单一示范作为可重用记忆可以提供比静态监督更高效的机器人学习。更多详情请访问我们的网站:https://regil2026.github.io/

英文摘要

Learning robot manipulation policies with deep neural networks from a single demonstration remains highly challenging, as even small deviations from the demonstrated trajectory can quickly compound into failure, while collecting substantial online interaction data is costly. We propose ReGIL, a retrieval-guided imitation learning framework that treats a single demonstration as an external memory. ReGIL repeatedly queries this static memory throughout training to simultaneously guide exploration, generate the regularization buffer, and construct rewards. Specifically, it computes rewards through local temporal alignment between the current trajectory and the retrieved segment, providing step-wise and informative feedback for policy improvement. We evaluate ReGIL on robotic manipulation tasks from the LIBERO and Meta-World benchmarks under the single demonstration setting. ReGIL outperforms prior baselines in both success rate and training efficiency. In real-robot experiments, using only one demonstration and less than one hour of online training, ReGIL achieves over 75% success rate across three manipulation tasks with randomness in both initial robot pose and target position. These results demonstrate that leveraging the single demonstration as reusable memory can provide more than static supervision for efficient robot learning. More details can be found on our website: https://regil2026.github.io/

2606.09457 2026-06-09 cs.RO 新提交

$ω$-EVA: Envision, Verify, and Act with Latent Interactive World Models

$ω$-EVA:基于潜在交互世界模型的构想、验证与行动

Zhenguo Sun, Yu Sun, Hande Huang, Alois Knoll

发表机构 * Technical University of Munich(慕尼黑工业大学)

AI总结 提出$ω$-EVA框架,通过潜在交互世界模型实现“构想-验证-行动”循环,利用动作条件潜在动力学和语言条件流策略生成动作,无需生成未来视频,在多种机器人操作任务中提升策略性能。

详情
AI中文摘要

具身策略通常直接将当前观测映射到动作,使得候选动作的后果隐含。世界模型提供预测监督、表示或外部模拟,但很少让策略在行动前检查自身提议的想象后果。我们提出$ω$-EVA,一种潜在交互世界模型,实现了用于具身动作生成的构想-验证-行动循环。其三阶段框架学习动作条件潜在动力学,在动力学感知的视觉表示上训练语言条件流策略,并将策略的提议反馈给世界模型。一个三分支精炼器联合推理当前状态、提议条件未来和提议动作,以生成最终动作块。由于后果推理保持在潜在特征空间中,$ω$-EVA在推理时避免了生成未来视频。在多种单臂、双臂、长时域和扰动仿真设置中的评估表明,完整的交互流程持续改进了提议策略,而潜在诊断指示了有意义的动作条件未来结构。拥有约12亿参数且无需额外的机器人数据预训练,$ω$-EVA展示了紧凑且具有竞争力的性能-规模-数据权衡,使世界模型成为主动的动作反馈模块而非被动预测器。

英文摘要

Embodied policies typically map current observations directly to actions, leaving candidate-action consequences implicit. World models provide predictive supervision, representations, or external simulation, but rarely let a policy inspect the imagined consequence of its own proposal before acting. We introduce $ω$-EVA, a latent interactive world model that realizes an Envision--Verify--Act loop for embodied action generation. Its three-stage framework learns action-conditioned latent dynamics, trains a language-conditioned flow policy on dynamics-aware visual representations, and feeds the policy's proposal back through the world model. A tri-branch refiner jointly reasons over the current state, proposal-conditioned future, and proposed action to produce the final action chunk. Because consequence reasoning remains in latent feature space, $ω$-EVA avoids generating future videos at inference. Evaluations across diverse single-arm, bimanual, long-horizon, and perturbed simulation settings show that the complete interaction pipeline consistently improves the proposal policy, while latent diagnostics indicate meaningful action-conditioned future structure. With approximately 1.2B parameters and no additional robot-data pretraining, $ω$-EVA demonstrates a compact and competitive performance--scale--data trade-off, making the world model an active action-feedback module rather than a passive predictor.

2606.09476 2026-06-09 cs.RO 新提交

Goal Sets, Not Goal States: Queryable Robot Goals through Goal-Set Hindsight Relabeling

目标集,而非目标状态:通过目标集事后重标记实现可查询的机器人目标

Carlos Vélez García, Miguel Cazorla, Jorge Pomares

发表机构 * INESCOP(西班牙鞋类及相关技术研究所) University of Alicante(阿利坎特大学)

AI总结 提出目标集事后重标记(GS-HER),将事后重标记从单目标状态推广到谓词级目标集,通过可查询的二值谓词解耦成功条件与状态维度,提升离线GCRL在冗余维度下的性能,并实现单一模型支持多目标谓词。

详情
AI中文摘要

事后重标记通常将实现的未来状态转化为精确目标,当任务成功仅取决于状态子集时,这可能会过度约束离线机器人学习。我们提出目标集事后重标记(GS-HER),这是HER在谓词级别上的推广,其中实现的状态认证查询定义的目标集,而非单一目标状态。一个二值查询指定哪些变量定义成功,使目标谓词成为推理时的输入,同时保持底层离线GCRL算法不变。在OGBench任务和五个离线目标条件学习器上,当全状态目标受到无关维度的瓶颈时,GS-HER提升了性能,并将事后重标记转变为可重用的目标接口:一个检查点可以回答多个机器人目标谓词,而无需重新训练。

英文摘要

Hindsight relabeling usually turns achieved future states into exact goals, which can overconstrain offline robot learning when task success depends only on a subset of the state. We propose Goal-Set Hindsight Relabeling (GS-HER), a predicate-level generalization of HER in which achieved states certify query-defined goal sets rather than singleton goal states. A binary query specifies which variables define success, making the goal predicate an inference-time input while leaving the underlying offline GCRL algorithm unchanged. Across OGBench tasks and five offline goal-conditioned learners, GS-HER improves performance when full-state goals are bottlenecked by nuisance dimensions and turns hindsight relabeling into a reusable goal interface: one checkpoint can answer multiple robot goal predicates without retraining.

2606.09758 2026-06-09 cs.RO cs.AI cs.LG 新提交

Difference-Aware Retrieval Policies for Imitation Learning

差异感知的模仿学习检索策略

Quinn Pfeifer, Ethan Pronovost, Paarth Shah, Khimya Khetarpal, Siddhartha Srinivasa, Abhishek Gupta

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(华盛顿大学保罗·G·艾伦计算机科学与工程学院) Toyota Research Institute(丰田研究所) Google DeepMind(谷歌DeepMind) Mila

AI总结 提出DARP,一种半参数检索式模仿学习方法,通过基于k近邻的局部邻域结构重参数化,解决行为克隆的分布外泛化问题,在连续控制和机器人操作任务中性能提升15-46%。

Comments 12 pages, 7 figures, 3 tables. Accepted to ICLR 2026. Code and demos available at https://weirdlabuw.github.io/darp-site/

详情
AI中文摘要

通过行为克隆的参数化模仿学习可能因部署期间的复合误差而在分布外状态上泛化能力差。我们表明,在推理期间通过半参数检索式模仿学习方法重用训练数据可以缓解这一挑战。我们提出差异感知的模仿学习检索策略(DARP),这是一种半参数检索式模仿学习方法,通过根据局部邻域结构而非直接的状态到动作映射来重新参数化模仿学习问题,从而解决这一局限性。DARP不学习全局策略,而是训练一个模型,基于专家演示中的k近邻、它们对应的动作以及邻居状态与查询状态之间的相对距离向量来预测动作。DARP不需要超出标准行为克隆所做的额外假设——它不需要额外的数据收集、在线专家反馈或任务特定知识。我们在不同领域(包括连续控制和机器人操作)以及不同表示(包括高维视觉特征)上展示了比标准行为克隆持续15-46%的性能提升。代码和演示可在https://weirdlabuw.github.io/darp-site/获取。

英文摘要

Parametric imitation learning via behavior cloning can suffer from poor generalization to out-of-distribution states due to compounding errors during deployment. We show that reusing the training data during inference via a semi-parametric retrieval-based imitation learning approach can alleviate this challenge. We present Difference-Aware Retrieval Policies for Imitation Learning (DARP), a semi-parametric retrieval-based imitation learning approach that addresses this limitation by reparameterizing the imitation learning problem in terms of local neighborhood structure rather than direct state-to-action mappings. Instead of learning a global policy, DARP trains a model to predict actions based on $k$-nearest neighbors from expert demonstrations, their corresponding actions, and the relative distance vectors between neighbor states and query states. DARP requires no additional assumptions beyond those made for standard behavior cloning -- it does not require additional data collection, online expert feedback, or task-specific knowledge. We demonstrate consistent performance improvements of 15-46% over standard behavior cloning across diverse domains, including continuous control and robotic manipulation, and across different representations, including high-dimensional visual features. Code and demos are available at https://weirdlabuw.github.io/darp-site/.

2606.09811 2026-06-09 cs.RO cs.AI cs.CV 新提交

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

AHA-WAM:异步自适应时域世界-动作建模与观测引导的上下文路由

Jisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu, Jiayue Kang, Zhixuan Liang, Wenjie Xu, Yinan Mao, Weinan Zhang, Xiaokang Yang, Ru Ying, Ran Zheng, Yao Mu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Baidu AI Cloud(百度智能云) The University of Hong Kong(香港大学)

AI总结 提出AHA-WAM,一种基于双扩散Transformer的异步时域自适应世界-动作模型,通过低频世界规划器和高频动作执行器解耦时序,实现高效闭环控制,在RoboTwin和真实任务上达到SOTA性能。

Comments Project page: https://serene-sivy.github.io/aha-wam/

详情
AI中文摘要

世界-动作模型已成为机器人操作的一种有前景的范式,它联合建模视觉场景动态和动作,将物理先验注入策略学习。然而,现有的世界-动作模型以相同的时间分辨率耦合世界预测和动作执行,迫使世界分支建模近期的帧变化,这些变化是冗余且信息量弱的。我们假设,将世界预测和动作执行严格绑定到相同的时间节奏可能未充分利用视频分支在具身控制中的潜力。因此,我们提出AHA-WAM,一种基于双扩散Transformer(DiT)架构的异步自适应时域世界-动作模型,该模型围绕这种时间不对称性重新组织世界-动作建模。AHA-WAM将视频DiT实例化为一个低频世界规划器,它维护过去观测的滚动键值记忆,并暴露可重用的逐层潜在上下文,编码长时域场景演化;同时,一个高频动作DiT通过逐层联合注意力查询该上下文,以闭环方式执行短动作块。为了支持异步执行,我们引入了自适应时域偏移训练和观测引导的视频-上下文路由(OVCR),它们共同让动作专家利用长时域世界上下文,同时保持对实时执行状态的响应,而无需重新运行视频DiT。在RoboTwin和真实世界操作任务上的实验表明,AHA-WAM无需任何机器人数据预训练即达到最先进性能,在RoboTwin上平均成功率为92.80%,在4个真实世界任务上成功率为78.3%,同时达到24.17 Hz的闭环控制,相比Fast-WAM加速4.59倍。

英文摘要

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.

2509.16136 2026-06-09 cs.RO 版本更新

Reward Evolution with Graph-of-Thoughts: A Bi-Level Language Model Framework for Reinforcement Learning

基于思维图的奖励进化:一种用于强化学习的双层语言模型框架

Changwei Yao, Xinzi Liu, Chen Li, Marios Savvides

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Tokyo(东京大学)

AI总结 本文提出RE-GoT框架,结合LLM与VLM的图思维推理,通过任务分解和视觉反馈迭代优化奖励函数,实验表明在RoboGen和ManiSkill2任务中均优于现有方法。

详情
Journal ref
IEEE International Conference on Robotics and Automation (ICRA 2026)
AI中文摘要

设计有效的奖励函数仍是强化学习(RL)中的主要挑战,通常需要大量的人类专业知识和迭代优化。最近的进展利用大语言模型(LLM)进行自动化奖励设计,但这些方法受限于幻觉、依赖人类反馈以及处理复杂多步骤任务的困难。在本工作中,我们引入基于思维图的奖励进化(RE-GoT),一种新颖的双层框架,通过结构化的图推理增强LLM,并整合视觉语言模型(VLM)进行自动化 rollout 评估。RE-GoT首先将任务分解为文本属性图,实现全面分析和奖励函数生成,然后通过VLM的视觉反馈迭代优化奖励,无需人工干预。在10个RoboGen和4个ManiSkill2任务上的广泛实验表明,RE-GoT在多个指标上均优于现有基于LLM的基线方法。在RoboGen中,我们的方法将平均任务成功率提高了32.25%,在复杂多步骤任务上表现尤为突出。在ManiSkill2中,RE-GoT在四个多样化操作任务上的平均成功率为93.73%,显著超越了现有基于LLM的方法,甚至超过了专家设计的奖励。我们的结果表明,结合LLM和VLM的图思维推理提供了一种可扩展且有效的解决方案,用于RL中的自主奖励进化。

英文摘要

Designing effective reward functions remains a major challenge in reinforcement learning (RL), often requiring considerable human expertise and iterative refinement. Recent advances leverage Large Language Models (LLMs) for automated reward design, but these approaches are limited by hallucinations, reliance on human feedback, and challenges with handling complex, multi-step tasks. In this work, we introduce Reward Evolution with Graph-of-Thoughts (RE-GoT), a novel bi-level framework that enhances LLMs with structured graph-based reasoning and integrates Visual Language Models (VLMs) for automated rollout evaluation. RE-GoT first decomposes tasks into text-attributed graphs, enabling comprehensive analysis and reward function generation, and then iteratively refines rewards using visual feedback from VLMs without human intervention. Extensive experiments on 10 RoboGen and 4 ManiSkill2 tasks demonstrate that RE-GoT consistently outperforms existing LLM-based baselines. On RoboGen, our method improves average task success rates by 32.25%, with notable gains on complex multi-step tasks. On ManiSkill2, RE-GoT achieves an average success rate of 93.73% across four diverse manipulation tasks, significantly surpassing prior LLM-based approaches and even exceeding expert-designed rewards. Our results indicate that combining LLMs and VLMs with graph-of-thoughts reasoning provides a scalable and effective solution for autonomous reward evolution in RL.

2606.03787 2026-06-09 cs.RO 版本更新

Worth Remembering: Surprise-Gated Robot Episodic Memory

值得记住:基于惊讶门控的机器人情景记忆

Nicolas Gorlo, Derek K. Wise, Alberto Speranzon, Luca Carlone

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Lockheed Martin(洛克希德·马丁公司)

AI总结 提出基于贝叶斯惊讶的门控机制来选择性地存储高效用情景记忆,利用V-JEPA-2潜在空间计算惊讶,在机器人问答任务中提升12%以上性能。

Comments 14 pages, 2 figures, 4 tables

详情
AI中文摘要

解决通用任务的机器人需要能够将指令与过去经验联系起来,因为人类在给出任务时可能会提及显著的历史事件(例如,“带我去昨天化学品泄漏的地方”)。由于记忆限制使得存储所有过去事件不可行,长期机器人记忆必须具有选择性,理想情况下只保留那些对未来任务具有高实用性的情节。然而,对于通用机器人,未来任务通常不是先验给定的。为了选择通用有用的记忆,我们提出贝叶斯惊讶作为记忆形成的门控机制。我们提出了一种方法,在由V-JEPA-2提供的语义丰富且部署无关的潜在空间中计算惊讶。通过使用我们的门控情景记忆来增强基于4D场景图的时空记忆,我们在机器人问答中显示出相对于最先进基准的一致改进,在时间、空间和二元问题上优于先前的机器人记忆方法≥12%,并在事件分割任务中以无监督因果方法超越了有监督和非因果方法的性能。

英文摘要

Robots solving generalist tasks need to be able to ground instructions in their past experience, since humans may refer to notable past events when giving a task (e.g., ``Take me to where the chemical spill happened yesterday''). Since memory limits make storing all past events infeasible, long-term robot memory must be selective, ideally retaining only those episodes with high utility for future tasks. However, future tasks are not typically given a priori for generalist robots. To select generically useful memories, we propose Bayesian surprise as a gating mechanism for memory formation. We present an approach to compute surprise in a semantically rich deployment-agnostic latent space provided by V-JEPA-2. Using our gated episodic memory to augment 4D scene graph-based spatial memory, we show a consistent improvement over state-of-the-art benchmarks in robot question answering, outperforming prior robot memory methods by $\geq12\%$ for temporal, spatial, and binary questions, and surpassing the performance of supervised and non-causal methods with an unsupervised causal method in event segmentation tasks.

2511.18203 2026-06-09 cs.RO 版本更新

SkillWrapper: Generative Predicate Invention for Task-level Robot Planning

SkillWrapper:任务级机器人规划的生成性谓词发明

Ziyi Yang, Benned Hedegaard, Ahmed Jaafar, Yichen Wei, Skye Thompson, Shreyas S. Raman, Haotian Fu, Stefanie Tellex, George Konidaris, David Paulius, Naman Shah

发表机构 * Brown University(布朗大学) Allen Institute for AI(人工智能研究院)

AI总结 本文提出SkillWrapper方法,通过生成性谓词发明学习符号化表示,使机器人能够基于抽象推理完成长周期任务规划。

详情
AI中文摘要

从单个技能执行到长周期任务的泛化是构建自主机器人面临的核心挑战。一个有前途的方向是学习低层机器人技能的高层符号表示,从而实现独立于底层状态空间的抽象推理。近期基础模型的进步使生成作用于原始感知输入的符号谓词成为可能,这一过程我们称为生成性谓词发明,以促进下游表示学习。然而,先前工作通过启发式或随意方法学习这些抽象,忽略了这些抽象应满足的正式属性以及如何保证这些属性的问题。我们通过提出生成性谓词发明的任务级规划正式理论,并提出SkillWrapper方法,该方法学习可证明正确且完整的规划符号模型来解决这些问题。我们的方法利用基础模型主动收集机器人数据,并学习可被人类解释和规划的表示,仅使用RGB图像观测。我们在仿真和真实机器人上的广泛实证评估表明,SkillWrapper学习的抽象表示使机器人能够将黑箱技能组合起来,解决未见的长周期任务。

英文摘要

Generalizing from individual skill executions to long-horizon tasks is a core challenge in building autonomous robots. A promising direction is learning high-level, symbolic representations of low-level robot skills, enabling abstract reasoning independent of the low-level state space. Recent advances in foundation models have made it possible to generate symbolic predicates that operate on raw sensory inputs-a process we call generative predicate invention-to facilitate downstream representation learning. However, prior work learns these abstractions using heuristic or ad-hoc procedures, ignoring the question of which formal properties they ought to satisfy, and how to guarantee these properties. We address these questions by presenting a formal theory of generative predicate invention for task-level planning, and proposing SkillWrapper, a method that learns symbolic models for provably sound and complete planning. Our approach leverages foundation models to actively collect robot data and learn human-interpretable, plannable representations, using only RGB image observations. Our extensive empirical evaluation in simulation and on real robots shows that SkillWrapper learns abstract representations that enable robots to compose black-box skills to solve unseen, long-horizon tasks in the real world.

2. 运动规划、控制与动力学 13 篇

2606.07855 2026-06-09 cs.RO math.OC 新提交

Path Planning Using Deep Deterministic Policy Gradient: A Reinforcement Learning Approach

使用深度确定性策略梯度的路径规划:一种强化学习方法

Qiang Le, Yaguang Yang, Isaac E. Weintraub

发表机构 * Hampton University(汉普顿大学) Air Force Research Laboratory(空军研究实验室)

AI总结 提出基于深度确定性策略梯度的路径规划方法,将威胁建模为圆形禁行区,通过奖励函数引导智能体学习从状态到动作的映射,找到最大安全起始点集,相比传统最优控制方法速度更快,适用于实时应用。

Comments 14 pages, 12 figures

详情
AI中文摘要

在充满威胁的环境中,自主车辆的路径规划是一个基本挑战,因为即使是最简单的情景,问题也是非线性和非凸的。虽然传统最优控制方法可用于寻找理想路径,但计算时间通常太慢,无法实时决策。为了解决这一挑战,我们提出了一种基于深度确定性策略梯度(DDPG)的方法,并将威胁建模为可能多个圆形“禁行”区域。如果车辆在任何时刻进入该限制区域或未到达目的地附近,则任务被视为失败。DDPG智能体通过模拟环境中的试错进行训练,学习从其当前状态(位置和航向)到一系列可行动作的直接映射,从而引导智能体安全到达目的地。奖励函数包含三部分:(a) 以最终目的地为中心的吸引场,(b) 以圆形障碍物原点为中心的若干排斥场,以及(c) 控制能量消耗(航向变化幅度)的惩罚,间接有利于直线路径。DDPG利用这些激励训练智能体,以找到最大的起始点集合,从中保证存在一条通往目的地的安全路径。这为任务规划提供了关键信息,预先显示从给定起始点任务是否可行,辅助任务前规划活动。该方法在仿真中得到验证。将DDPG方法与传统最优控制(伪谱)方法进行了比较。结果表明,基于学习的智能体能够生成有效路径,同时速度显著更快,使其更适合实时应用。

英文摘要

Path-planning for autonomous vehicles in threat-laden environments is a fundamental challenge because the problem is nonlinear and nonconvex even in simplest scenarios. While traditional optimal control methods can be used to find ideal paths, the computational time is often too slow for real-time decision-making. To solve this challenge, we propose a method based on Deep Deterministic Policy Gradient (DDPG) and model the threat as possibly multiple circular 'no-go' zones. A mission is regarded as a failure if the vehicle enters this restricted zone at any time or does not reach a neighborhood of the destination. The DDPG agent is trained through trial and error in a simulated environment, learning a direct mapping from its current state (position and heading) to a series of feasible actions that guide the agent to safely reach its destination. The reword function has three parts: (a) an attractive field centered at the final destination, (b) some repulsive fields centered at the origins of circular obstacles, and (c) a penalty of control energy consumption (the magnitude of heading change) that indirectly in favor for straight path. The DDPG trains the agent using these incentives to find the largest possible set of starting points wherein a safe path to the destination is guaranteed. This provides critical information for mission planning, showing beforehand whether a task is achievable from a given starting point, assisting pre-mission planning activities. The approach is validated in simulation. A comparison between the DDPG method and a traditional optimal control (pseudo-spectral) method is carried out. The results show that the learning-based agent produces effective paths while being significantly faster, making it a better fit for real-time applications.

2606.08136 2026-06-09 cs.RO 新提交

Learning Predictive Control with Deep Koopman Operators for Autonomous Vehicle Motion Planning

基于深度Koopman算子的学习预测控制在自动驾驶车辆运动规划中的应用

Xinglong Zhang, Yongqian Xiao, Haotian Cao, Xing Zhou, Xin Yin, Xin Xu

发表机构 * National Natural Science Foundation of China(国家自然科学基金委员会) Science and Technology Innovation Program of Hunan Province(湖南省科技创新计划)

AI总结 提出一种结合深度Koopman算子的学习预测控制框架,通过提升非线性动力学到线性可观测空间,并利用滚动时域演员-评论家学习生成闭环状态反馈策略,在非凸约束下实现高效、安全的实时运动规划。

详情
AI中文摘要

模型预测控制(MPC)广泛应用于自动驾驶车辆(AV)的运动规划,但其实时应用通常受限于对精确模型的需求以及在动态道路环境中在线求解非线性、非凸优化问题。演员-评论家强化学习为在线策略生成提供了一种有前景的替代方案,但其策略学习过程往往缺乏显式的控制理论结构。本文提出了一种基于深度Koopman算子的学习预测控制(LPC)框架,用于在非凸约束下实现高效的实时运动规划。为了处理非线性和不确定的车辆动力学,使用基于深度Koopman的预测器以数据驱动的方式将系统提升到可解释的线性可观测空间。与计算开环控制序列的传统MPC不同,所提出的LPC框架通过滚动时域演员-评论家学习在每个预测区间内生成闭环状态反馈策略。为了确保在非凸环境约束下的安全性,LPC构建了障碍物的凸局部替代表示并定义了相应的势场函数。这些函数及其梯度直接嵌入到演员-评论家结构中,从而实现高效且具有安全意识的策略学习。在红旗EHS3平台上进行的大量仿真和实际实验表明,与CBF-MPC和LMPCC等基准方法相比,该方法在多种避障场景中在安全性、计算效率和驾驶舒适性方面均表现出优越性能。

英文摘要

Model Predictive Control (MPC) is widely used for autonomous-vehicle (AV) motion planning, but its real-time applicability is often limited by the need for accurate models and online solution of nonlinear, nonconvex optimization problems in dynamic road environments. Actor-critic reinforcement learning offers a promising alternative for online policy generation, yet its policy-learning process often lacks explicit control-theoretic structure. This article proposes a learning predictive control (LPC) framework with deep Koopman operators for efficient real-time motion planning under nonconvex constraints. To address nonlinear and uncertain vehicle dynamics, a deep-Koopman-based predictor is used to lift the system into an interpretable linear observable space in a data-driven manner. Unlike traditional MPC, which computes open-loop control sequences, the proposed LPC framework yields a closed-loop state-feedback policy within each prediction interval through receding-horizon actor-critic learning. To ensure safety under nonconvex environmental constraints, LPC constructs convex local surrogate representations of obstacles and defines corresponding potential-field functions. These functions and their gradients are directly embedded into the actor-critic structure, enabling efficient, safety-aware policy learning. Extensive simulations and real-world experiments on the HongQi-EHS3 platform demonstrate favorable performance in diverse obstacle-avoidance scenarios in terms of safety, computational efficiency, and driving comfort, compared with benchmark methods such as CBF-MPC and LMPCC.

2606.08186 2026-06-09 cs.RO 新提交

Propeller-Assisted Robust 3D Hopping Robot with Hierarchical Force Allocation

螺旋桨辅助的鲁棒三维跳跃机器人及分层力分配

Chuhan Zhang, Hongbo Zhang, Yanlin Chen, Yunxi Tang, Yun-Hui Liu, Mingyi Liu, Xiangyu Chu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Guangdong Technion–Israel Institute of Technology(广东以色列理工学院) Technion–Israel Institute of Technology(以色列理工学院) Multiscale Medical Robotics Centre(多尺度医疗机器人中心)

AI总结 提出一种螺旋桨辅助的单腿三维跳跃机器人Pro-OMEGA2,通过分层力分配框架协调腿与三旋翼的力,实现鲁棒跳跃和扰动恢复。

Comments 8 pages, 9 figures, 1 table. Accepted to the 2026 IEEE International Conference on Automation Science and Engineering (CASE)

详情
AI中文摘要

单腿跳跃机器人概念简单但高度动态且天生不稳定。实现鲁棒的三维跳跃仍然困难,因为地面反作用力仅在短暂的支撑阶段可用,而机器人在飞行阶段欠驱动。一个未解决的关键问题是如何提高飞行阶段的控制能力。螺旋桨辅助提供了一种有希望的解决方案,但需要仔细协调腿产生的接触力和螺旋桨推力在支撑和飞行阶段的配合。本文介绍了Pro-OMEGA2,一种螺旋桨辅助的三维单腿跳跃机器人,具有主动3-RSR并联腿和安装在躯干上的三旋翼用于辅助姿态调节。为了解决力协调挑战,我们提出了一种基于单刚体模型的分层力分配框架。腿产生主要的支撑接触力,而三旋翼提供辅助姿态调节,补偿支撑阶段的残余姿态力矩并在飞行阶段维持姿态。室内和室外场景的真实机器人实验展示了持续的三维跳跃,包括地形过渡和脉冲推挤恢复,验证了在未建模接触和外部扰动下的鲁棒性。

英文摘要

Monopedal hopping robots are conceptually simple but highly dynamic and inherently unstable. Achieving robust 3D hopping is still difficult because ground reaction forces are available only during the short stance phase, while the robot is underactuated in flight. A key unresolved issue is how to improve flight-phase control authority. Propeller assistance provides a promising solution, but it requires careful coordination of leg-generated contact forces and propeller thrusts across stance and flight. This paper presents Pro-OMEGA2, a propeller-assisted 3D monopedal hopping robot with an active 3-RSR parallel leg and a trunk-mounted tri-rotor for auxiliary attitude regulation. To address the force coordination challenge, we propose a Hierarchical Force Allocation (HFA) framework based on a single rigid body (SRB) model. The leg generates the main stance contact wrench, while the tri-rotor provides auxiliary attitude regulation, compensating the residual attitude moment in stance and maintaining attitude during flight. Real-robot experiments in indoor and outdoor scenarios demonstrate sustained 3D hopping, including terrain transitions and impulsive push recovery, validating robustness under unmodeled contact and external disturbances.

2606.08253 2026-06-09 cs.RO cs.LG 新提交

Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking

注意你的步伐:一种用于精确人形机器人落脚点跟踪的通用学习框架

Alessandro Montenegro, Shihao Li, Puze Liu, Alberto Maria Metelli, Jan Peters

发表机构 * Politecnico di Milano(米兰理工大学) TU Darmstadt(达姆施塔特工业大学) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Italian Institute of Technology(意大利技术研究院) University of Pisa(比萨大学)

AI总结 提出一种轻量级通用3D落脚点跟踪策略学习框架,通过目标采样器动态提供步态支持,结合新目标表示克服真实世界噪声,实现与多种高层规划器无缝集成的精确自然运动。

Comments Accepted to RSS 2026

详情
AI中文摘要

使人形机器人在复杂动态环境中运行仍然是一个关键挑战,其根本受限于稳健、安全且精确导航的能力。虽然基于速度指令策略的强化学习在人形机器人运动方面取得了显著的鲁棒性,但这种方法缺乏对落脚点位置的显式控制,导致不安全行为(如踩到人脚)或不精确导航,阻碍后续操作任务。相反,显式落脚点跟踪策略通过直接以目标足部姿态作为指令提供了一种有前景的替代方案。然而,现有方法通常受限于不切实际的状态假设(影响实际部署),或者作为分阶段流程的一部分而受限于特定下游任务。在这项工作中,我们引入了一种新颖的轻量级框架,用于训练通用的3D落脚点跟踪策略。通过目标采样器动态提供步态支持,该方法使学习到的策略对特定地形不敏感。我们的新目标表示有效缓解了现实世界中出现的挑战,例如噪声和不准确的姿态估计以及足部接触估计。为直接迁移到现实世界而设计,我们的策略作为一个独立的低级控制器,可以与各种高级落脚点生成器无缝配对。通过在仿真和现实世界中的大量实验,我们证明了框架的有效性。通过将我们的策略与不同的上游规划器耦合,我们在具有挑战性的环境中实现了自然且精确的运动,为复杂环境中的运动-操作任务铺平了道路。

英文摘要

Enabling humanoid robots to operate in complex, dynamic environments remains a critical challenge, fundamentally limited by the ability to navigate robustly, safely, and accurately. While reinforcement learning with velocity-commanded policies has achieved remarkable robustness in humanoid locomotion, this approach lacks explicit control of the foothold placement, leading to unsafe behavior, such as stepping onto human feet, or imprecise navigation, hindering the following manipulation task. Conversely, explicit foothold-tracking policies offer a promising alternative by directly being commanded with target foot poses. However, existing approaches are often limited by unrealistic state assumptions, compromising real-world deployment, or they are part of staged pipelines, making them tied to specific downstream tasks. In this work, we introduce a novel, lightweight framework for training general-purpose 3D foothold-tracking policies. By dynamically providing footstep support through a goal sampler, this method enables the learned policy to be agnostic to specific terrains. Our new target representation effectively mitigates challenges arising in the real world, such as noisy and inaccurate pose estimation and foot contact estimation. Designed for direct real-world transfer, our policy acts as a standalone low-level controller that can be seamlessly paired with various high-level foothold generators. We demonstrate the effectiveness of our framework through extensive experiments in simulation and in the real world. By coupling our policy with different upstream planners, we achieve natural and accurate locomotion in challenging settings, paving the way for loco-manipulation tasks in complex environments.

2606.08725 2026-06-09 cs.RO cs.SY eess.SY 新提交

Real-Time and Accurate Collision-Free Teleoperation via Differentiable Constraint-Based Trajectory Planning

基于可微约束轨迹规划的实时精确无碰撞遥操作

Max Grobbel, Tristan Schneider, Daniel Flögel, Sören Hohmann

发表机构 * FZI - Forschungszentrum Informatik(FZI 信息技术研究中心) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 针对遥操作中自碰撞与环境碰撞问题,提出基于对偶可微碰撞约束的轨迹规划方法,采用胶囊体与多面体建模,实现更低计算时间和更精确障碍物建模,保证平滑无碰撞遥操作。

Comments 8 pages, 4 figures, accepted at ICRA2026

详情
AI中文摘要

在遥操作中,人类操作员通常仅控制末端执行器的姿态,由于关节和连杆未单独控制,常导致机械臂自碰撞及与环境障碍物的碰撞。缓解此问题的常见策略是利用基于最优控制的轨迹规划增强操作员输入。由于基于导数的求解器需要可微约束,现有方法要么用球体近似机器人和障碍物,降低几何精度,要么近似导数,降低收敛性并增加计算时间。我们通过将一种基于凸优化对偶性的可微碰撞避免约束的最新公式应用于遥操作场景,解决了这些局限性。机器人用胶囊体近似,环境用多面体近似。我们在不同障碍物数量的仿真中将所得轨迹规划方法与最先进技术进行比较,并在真实遥操作测试中在UR5e机械臂上进行评估。结果表明,我们的方法在实现更精确障碍物建模的同时,计算时间更低,从而实现更平滑、无碰撞的末端执行器遥操作。

英文摘要

In teleoperation, the human operator typically controls only the end-effector pose, which often leads to self-collisions of the manipulator and collisions with environmental obstacles, since joints and links are not controlled individually. A common strategy to mitigate this issue is to enhance the operator's input using optimal-control-based trajectory planning. As derivative-based solvers require differentiable constraints, existing approaches either approximate robots and obstacles with spheres, reducing geometric accuracy, or approximate derivatives, degrading convergence and increasing computation times. We address these limitations by adapting a recent formulation of differentiable collision-avoidance constraints, based on duality in convex optimization, to the teleoperation setting. The robot is approximated with capsules and the environment with polytopes. We compare the resulting trajectory planning method against state-of-the-art techniques in simulation with varying numbers of obstacles and evaluate it on a UR5e manipulator in a real-world teleoperation test. Results show that our approach achieves lower computation times while enabling more accurate obstacle modeling, leading to smoother and collision-free end-effector teleoperation.

2606.08922 2026-06-09 cs.RO 新提交

PTDL:Multi-Terrain Fall Recovery via Phase-Terrain Decoupled Learning

PTDL:多地形摔倒恢复的相位-地形解耦学习

Xiaoyu Xu, Zhiming Chen, Yuenan Zhao, Ran Song, Wei Zhang

发表机构 * School of Control Science and Engineering, Shandong University(山东大学控制科学与工程学院) Key Laboratory of Machine Intelligence and System Control, Ministry of Education(教育部机器智能与系统控制重点实验室)

AI总结 提出相位-地形解耦学习(PTDL),通过解耦训练监督的相位和地形轴,实现单一本体感知策略下的多地形摔倒恢复与行走过渡。

详情
AI中文摘要

人形机器人可能在非结构化环境中的斜坡、砾石和不平地面上摔倒。我们目标是集成摔倒恢复与运动:仅使用本体感知从摔倒状态重建平衡,并在摔倒地点恢复速度指令行走。先前方法通常止于准静态起身,忽略摔倒后地面接触阶段,或者在混合地形上训练时未分离恢复与运动阶段或每表面约束,导致跨表面退化为单一妥协起身。我们提出相位-地形解耦学习(PTDL),在部署单一本体感知策略的同时,沿相位和地形轴解耦训练监督。在相位轴上,投影重力门控双运动先验判别器和探测-行走过渡链接将摔倒后恢复与指令行走连接。在地形轴上,地形分层恢复塑形在平坦地面、碎石和斜坡上分配表面特定的训练监督;地形标签仅用于训练,不提供给策略观测,从而在部署时实现隐式摔倒后策略选择。我们在29自由度Unitree G1上,在仿真和硬件中验证了PTDL在平坦地面、碎石和最高20度斜坡上的表现,实现了稳定的跨地形恢复、平滑的恢复-运动过渡以及单一部署策略下的差异化摔倒后起身行为。

英文摘要

Humanoid robots can fall on slopes, gravel, and uneven ground in unstructured environments. We target integrated fall recovery and locomotion: rebuilding balance from a fallen state using proprioception alone and resuming velocity-commanded walking at the fall site. Prior methods often stop at quasi-static rise, neglect the post-fall ground-contact phase, or, when trained on mixed terrains without separating recovery and locomotion phases or per-surface constraints, collapse to a single compromise get-up across surfaces. We propose Phase--Terrain Decoupled Learning (PTDL), which decouples training supervision along phase and terrain axes while deploying one proprioceptive policy. On the phase axis, projected-gravity-gated dual motion-prior discriminators and a probe-to-walk transition link post-fall recovery to commanded walking. On the terrain axis, terrain-stratified recovery shaping assigns surface-specific training supervision on flat ground, gravel, and slopes; terrain labels are training-only and withheld from policy observations, enabling implicit post-fall strategy selection at deployment. We validate PTDL on a 29-DoF Unitree G1 across flat ground, gravel, and slopes up to 20 degrees in simulation and on hardware, achieving stable cross-terrain recovery, smooth recovery-to-locomotion transitions, and differentiated post-fall rise behaviors under one deployed policy.

2606.09188 2026-06-09 cs.RO cs.CV 新提交

Trajectory Optimization in Single and Dual-UAV Bearing-Only Target Localization

单无人机和双无人机仅方位目标定位中的轨迹优化

Zhijian Xiao, Huayu Huang, Bin Li, Yang Shang, Banglei Guan

发表机构 * College of Aerospace Science and Engineering, National University of Defense Technology(国防科技大学航天科学与工程学院) Hunan Key Laboratory of Image Measurement and Visual Navigation(湖南省图像测量与视觉导航重点实验室)

AI总结 提出基于Fisher信息矩阵的轨迹优化方法,通过谱加权目标函数和交叉角正弦项改善观测几何,结合改进粒子群算法,显著降低定位误差。

Comments 16 pages, 13 figures and 6 tables. Submitted to Measurement

详情
AI中文摘要

仅方位目标定位是光学测量中的一个基本问题,在无人机技术中有着广泛的应用。有效的轨迹规划可以建立有利的观测几何,从而提高仅方位无人机系统的目标定位精度。本文提出了一种用于无人机在仅方位目标定位场景中的轨迹优化方法。通过利用Fisher信息矩阵,该方法将几何构型和飞行器机动性动态集成到优化框架中。具体而言,我们引入了一个谱加权FIM目标函数,该函数在退化构型附近提供更好的梯度动力学,使规划器能够快速逃离不良观测条件。对于双无人机场景,引入交叉角正弦项,通过改善视线交叉角来优化三角测量几何,从而防止轨迹聚集。此外,我们提出了一种改进的粒子群优化算法,该算法具有运动模型约束和粒子归一化,以确保轨迹的物理可行性并增强与目标函数的兼容性。仿真结果表明,与传统的基于FIM的方法相比,所提出的方法在单无人机场景中将中位定位误差降低了99.21%,在双无人机配置中实现了69.70%的提升,在远距离机动目标的长时间仅方位目标定位中表现出优越的性能。

英文摘要

Bearing-only target localization is a fundamental problem in optical measurement and finds extensive applications in unmanned aerial vehicle (UAV) technology. Effective trajectory planning establishes favorable observation geometries, thereby enhancing the target localization accuracy of bearing-only UAV systems. This paper proposes an trajectory optimization method for unmanned aerial vehicles (UAVs) in bearing-only target localization scenarios. By leveraging the Fisher Information Matrix (FIM), the proposed approach dynamically integrates the geometric configuration and vehicle maneuverability into the optimization framework. Specifically, we introduce a spectrally-weighted FIM objective function that provides better gradient dynamics near degenerate configurations, enabling the planner to rapidly escape from poor observation conditions. For dual-UAV scenarios, an intersection angle sine term is introduced to optimize triangulation geometry by improving the sight-line intersection angle, thereby preventing trajectory aggregation. Furthermore, we propose an improved Particle Swarm Optimization (PSO) algorithm with motion model constraints and particle normalization to ensure the physical feasibility of the trajectory and enhance the compatibility with the objective functions. Simulation results demonstrate that the proposed method reduces the median localization error by 99.21% compared to conventional FIM-based approaches in single-UAV scenarios, and achieves a 69.70% improvement for dual-UAV configurations, exhibits superior performance in long-duration bearing-only target localization of maneuverability targets at extended ranges.

2606.09237 2026-06-09 cs.RO cs.SY eess.SY 新提交

Can we stabilize an inverted pendulum with feedback from a time-of-flight camera?

我们能否利用飞行时间相机的反馈来稳定倒立摆?

Anthony Czubarow, Antonio Terpin, Raffaello D'Andrea

发表机构 * Institute for Dynamic Systems and Control, ETH Zürich(苏黎世联邦理工学院动态系统与控制研究所)

AI总结 本文证明低成本、低分辨率的飞行时间相机能够提供足够反馈,可靠且精确地平衡推车上的倒立摆,挑战了其无法用于精确反馈控制的普遍观点。

详情
AI中文摘要

飞行时间相机在机器人领域广受欢迎,因为它们能直接提供深度信息,同时结构紧凑、成本低廉且对光照条件鲁棒,但其低空间分辨率和深度噪声被广泛认为无法用于精确反馈控制。在本文中,我们展示了一款低成本、低分辨率的飞行时间相机能够提供足够的反馈,以可靠且精确地平衡推车上的倒立摆——这是快速、不稳定动力学的典型基准。

英文摘要

Time-of-flight cameras are popular in robotics for providing direct depth information while being compact, inexpensive, and robust to lighting conditions, but their low spatial resolution and depth noise are widely believed to preclude precise feedback control. In this paper, we show that an inexpensive, low-resolution time-of-flight camera provides sufficient feedback to reliably and precisely balance an inverted pendulum on a cart--a canonical benchmark for fast, unstable dynamics.

2606.09640 2026-06-09 cs.RO 新提交

Physics-Aware Sparse Learning and Selective Online Adaptation for Euler-Lagrange Robot Dynamics

面向欧拉-拉格朗日机器人动力学的物理感知稀疏学习与选择性在线自适应

Rishabh Dev Yadav, Samaksh Ujjawal, Sihao Sun, Spandan Roy, Wei Pan

发表机构 * The University of Manchester(曼彻斯特大学) International Institute of Information Technology Hyderabad(海得拉巴国际信息技术学院) Delft University of Technology(代尔夫特理工大学) Newcastle University(纽卡斯尔大学)

AI总结 提出一种保结构残差学习框架,将模型误差分解为惯性修正、科里奥利项和广义力残差,通过物理约束学习机械部分,并用稀疏历史依赖潜变量模型和贝叶斯线性回归在线自适应扰动敏感部分,提升多机器人平台动力学预测与轨迹跟踪性能。

详情
AI中文摘要

精确的动力学模型对于基于模型的机器人控制至关重要,然而名义上的欧拉-拉格朗日模型在存在负载变化、未建模耦合、摩擦、空气动力学效应和变化操作条件时往往变得不准确。大多数基于学习的校正方法通过引入单个加性残差来提高预测精度,但未能保留欧拉-拉格朗日系统的内部机械结构。这导致模型不保留对称性、正定性或惯性与速度相关项之间的耦合,当嵌入基于模型的控制器时,可能导致物理上不一致的预测和降低的可靠性。我们提出了一种保结构残差学习框架,将模型不匹配分解为惯性修正、相应的诱导科里奥利项和广义力残差。机械部分在物理约束下学习,而扰动敏感部分通过稀疏历史依赖潜变量交互模型表示,并使用贝叶斯线性回归在线自适应。这种分离保留了关键的机械结构,同时将自适应限制在最受变化条件影响的动力学部分。在多个机器人平台(包括移动机器人、空中机器人和机械臂系统)上的实验表明,所提出的方法在耦合和时变动力学下改善了动力学预测和轨迹跟踪。这些结果凸显了将结构化残差建模、紧凑潜变量交互选择和选择性在线自适应相结合对于实际基于模型控制的价值。

英文摘要

Accurate dynamics models are essential for model-based robotic control, yet nominal Euler--Lagrange models often become inaccurate in the presence of payload variation, unmodeled coupling, friction, aerodynamic effects, and changing operating conditions. Most learning-based correction methods improve prediction accuracy by introducing a single additive residual, but do not preserve the internal mechanical structure of Euler--Lagrange systems. This leads to models that do not preserve symmetry, positive-definiteness, or the coupling between inertia and velocity-dependent terms, which can result in physically inconsistent predictions and reduced reliability when embedded in model-based controllers. We propose a structure-preserving residual learning framework that decomposes model mismatch into an inertia correction, the corresponding induced Coriolis term, and a generalized-force residual. The mechanical component is learned under physical constraints, while the disturbance-sensitive component is represented through a sparse history-dependent latent interaction model and adapted online using Bayesian linear regression. This separation preserves key mechanical structure while restricting adaptation to the part of the dynamics most affected by changing conditions. Experiments across multiple robotic platforms, including mobile, aerial, and manipulator systems, show that the proposed method improves dynamics prediction and trajectory tracking under coupled and time-varying dynamics. These results highlight the value of combining structured residual modeling, compact latent interaction selection, and selective online adaptation for real-world model-based control.

2606.09719 2026-06-09 cs.RO 新提交

Safe Polytope-in-Polytope Motion Planning and Control with Control Barrier Functions

基于控制障碍函数的安全多面体在多面体内的运动规划与控制

Alejandro Gonzalez-Garcia, Dries Dirckx, Jan Swevers, Wilm Decré

发表机构 * KU Leuven(鲁汶大学)

AI总结 提出一种安全局部运动规划与控制方法,通过模型预测控制器中的离散时间控制障碍函数约束,保证多面体机器人足迹始终位于连续更新的凸自由空间内,计算时间随障碍物数量增加最多降低91倍。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

在狭窄环境中运行的自主移动机器人需要考虑机器人物理足迹的运动规划框架。将几何形状简化为点或圆是保守的,并且丢弃了成功安全通过狭窄通道所需的信息。本文提出了一种安全的局部运动规划与控制方法,保证多面体机器人足迹始终位于连续更新的凸自由空间内。包含条件被表述为模型预测控制器内的一组离散时间控制障碍函数约束。安全约束的数量取决于局部自由空间的复杂性和机器人形状,而不是障碍物的数量。所提出的自由空间公式不需要任何障碍物检测或分割。与基于多面体的避障公式的比较分析证实,随着障碍物数量的增加,计算时间最多减少91倍。该方法在自主水面车辆的仿真中和使用占用网格和LiDAR传感的非完整移动机器人的硬件上得到了验证。实验证明了在机载嵌入式计算机上以10 Hz进行安全的实时运动规划与控制,包括对动态障碍物的反应性避让。

英文摘要

Autonomous mobile robots operating in tight environments require motion planning frameworks that account for the physical footprint of the robot. Simplifying the geometry to a point or a circle is conservative and discards information needed to successfully and safely traverse narrow passages. This work proposes a safe local motion planning and control method that guarantees that a polytopic robot footprint stays inside a continuously updated convex free-space region. The containment condition is formulated as a set of discrete-time control barrier function constraints within a model predictive controller. The number of safety constraints depends on the complexity of the local free-space geometry and the robot shape, instead of the number of obstacles. The proposed free-space formulation does not need any obstacle detection or segmentation. A comparative analysis against a polytope-based obstacle avoidance formulation confirms favorable scaling up to a reduction of 91$\times$ in computation time as the number of obstacles increases. The approach is validated in simulation with an autonomous surface vehicle and on hardware with a non-holonomic mobile robot, using both occupancy grids and LiDAR sensing. The experiments demonstrate safe real-time motion planning and control at 10~Hz on an onboard embedded computer, including reactive avoidance of dynamic obstacles.

2104.12183 2026-06-09 cs.RO 版本更新

An Interval Branch-and-Bound-Based Inverse Kinemetics Algorithm Towards Global Optimal Redundancy Resolution

基于区间分支定界的逆运动学算法实现全局最优冗余度解析

Yajue Yang, Zeqing Zhang, Yuanqing Wu, Jia Pan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种结合快速数值IK求解器搜索启发式的区间分支定界方法,高效求解机械臂广义逆运动学问题,生成邻域解以提供自运动流形的丰富几何信息,支持最优规划和任意时间求解。

详情
AI中文摘要

机械臂的一般逆运动学(IK)问题,即为期望的末端执行器位姿获取所有可行关节角度的自运动流形(SMM),在机器人建模、规划和控制中起着至关重要的作用。为了高效求解广义IK,本文提出一种基于区间分支定界的方法,并辅以快速数值IK求解器启发的搜索启发式。与基于采样的方法生成的独立解相比,我们的方法生成邻域解块,为SMM的固有几何结构提供更丰富的信息,以支持最优规划和其他应用。它还可以以任意时间方式使用,在有限时间内获得具有次优分辨率的解。通过非冗余和冗余机械臂上的数值实验验证了该方法的性能。

英文摘要

The general inverse kinematics (IK) problem of a manipulator, namely that of acquiring the self-motion manifold (SMM) of all admissible joint angles for a desired end-effector pose, plays a vital role in robotics modeling, planning and control. To efficiently solve the generalized IK, this paper proposes an interval branch-and-bound-based approach, which is augmented with a fast numerical IK-solver-enabled search heuristics. In comparison to independent solutions generated by sampling based methods, our approach generates patches of neighboring solutions to provide richer information of the inherent geometry of the SMM for optimal planning and other applications. It can also be utilized in an anytime fashion to obtain solutions with sub-optimal resolution for applications within a limited period. The performance of our approach is verified by numerical experiments on both non-redundant and redundant manipulators.

2412.01324 2026-06-09 cs.RO 版本更新

Integrated Hierarchical Decision-Making in Inverse Kinematic Planning and Control

集成化分层决策在逆运动学规划与控制中

Kai Pfeiffer, Quan Zhang, Yuqing Chen, Gordon Boateng, Yuquan Wang, Vincent Bonnet, Aberrahmane Kheddar

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出一种高效的非线性规划框架,整合分层决策与全身逆运动学规划控制,解决逆运动学规划中同时选择端效应器位置的问题。

Comments Accepted paper to "Robotics: Science and Systems" (2026)

详情
AI中文摘要

本文提出了一种新颖且高效的非线性规划框架,紧密整合分层决策与全身逆运动学规划与控制。决策在机器人领域诸多方面起核心作用,从稀疏逆运动学控制(使用最少的关节)到同时选择多个候选端效应器位置的逆运动学规划。当前方法常依赖混合整数非线性规划的大量计算,将决策与逆运动学分离(有时用可达性方法近似),或使用高效但不够灵活的ℓ1范数线性稀疏规划方法,未解决底层非线性问题。相比之下,所提出的稀疏分层非线性规划求解器通过利用稀疏分层结构和ℓ0范数(在机器人领域很少使用)实现了高效、灵活和准确。该求解器有效处理了文献中未解决的复杂非线性分层决策问题,例如同时从大量候选中优先选择端效应器位置的逆运动学规划,或同时选择双臂抓取位置的逆运动学控制。

英文摘要

This work presents a novel and efficient nonlinear programming framework that tightly integrates hierarchical decision-making with whole-body inverse kinematic planning and control. Decision-making plays a central role in many aspects of robotics, from sparse inverse kinematic control with a minimal number of joints, to inverse kinematic planning while simultaneously selecting a discrete end-effector location from multiple candidates. Current approaches often rely on heavy computations using mixed-integer nonlinear programming, separate decision-making from inverse kinematics (some times approximated by reachability methods), or employ efficient but less versatile $\ell_1$-norm formulations of linear sparse programming, without addressing the underlying nonlinear problem formulations. In contrast, the proposed sparse hierarchical nonlinear programming solver is efficient, versatile, and accurate by exploiting sparse hierarchical structure and leveraging the $\ell_0$-norm which is rarely used in robotics. The solver efficiently tackles complex nonlinear hierarchical decision-making problems previously unaddressed in the literature, such as inverse kinematic planning with simultaneous prioritized selection of end-effector locations from a large set of candidates, or inverse kinematic control with simultaneous selection of bi-manual grasp locations on a randomly rotated box.

2511.05355 2026-06-09 cs.LG cs.RO cs.SY eess.SY 版本更新

SAD-Flower: Flow Matching for Safe, Admissible, and Dynamically Consistent Planning

SAD-Flower:用于安全、可接受和动态一致规划的流匹配

Tzu-Yuan Huang, Armin Lederer, Dai-Jie Wu, Xiaobing Dai, Sihua Zhang, Hsiu-Chin Lin, Shao-Hua Sun, Stefan Sosnowski, Sandra Hirche

发表机构 * TUM School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.(慕尼黑技术大学计算、信息与技术学院) Munich Institute of Robotics(慕尼黑机器人与智能机构研究所) Munich Data Science Institute (MDSI)(慕尼黑数据科学研究所) National University of Singapore(新加坡国立大学) National Taiwan University (NTU)(国立台湾大学) NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)(国立台湾大学人工智能研究中心) University of Utah(犹他大学) Beijing Institute of Technology(北京理工大学) McGill University(麦吉尔大学)

AI总结 提出SAD-Flower框架,通过虚拟控制输入增强流匹配,利用非线性控制理论提供状态约束、动作约束和动态一致性的形式化保证,无需重新训练即可在测试时满足未见约束。

详情
AI中文摘要

流匹配(FM)在数据驱动规划中显示出有希望的结果。然而,它本质上缺乏确保状态和动作约束的形式化保证,而满足这些约束对于各种系统上规划轨迹的安全性和可接受性是一个基本且关键的要求。此外,现有的FM规划器不能确保动态一致性,这可能导致轨迹不可执行。我们通过提出SAD-Flower来解决这些缺陷,这是一个用于生成安全、可接受和动态一致轨迹的新框架。我们的方法依赖于用虚拟控制输入增强流。因此,可以使用非线性控制理论的技术推导出有原则的指导,为状态约束、动作约束和动态一致性提供形式化保证。关键的是,SAD-Flower无需重新训练即可运行,从而在测试时满足未见约束。通过在多个任务上的广泛实验,我们证明SAD-Flower在确保约束满足方面优于各种基于生成模型的基线。

英文摘要

Flow matching (FM) has shown promising results in data-driven planning. However, it inherently lacks formal guarantees for ensuring state and action constraints, whose satisfaction is a fundamental and crucial requirement for the safety and admissibility of planned trajectories on various systems. Moreover, existing FM planners do not ensure the dynamical consistency, which potentially renders trajectories inexecutable. We address these shortcomings by proposing SAD-Flower, a novel framework for generating Safe, Admissible, and Dynamically consistent trajectories. Our approach relies on an augmentation of the flow with a virtual control input. Thereby, principled guidance can be derived using techniques from nonlinear control theory, providing formal guarantees for state constraints, action constraints, and dynamic consistency. Crucially, SAD-Flower operates without retraining, enabling test-time satisfaction of unseen constraints. Through extensive experiments across several tasks, we demonstrate that SAD-Flower outperforms various generative-model-based baselines in ensuring constraint satisfaction.

3. 操作、抓取与灵巧手 21 篇

2606.07723 2026-06-09 cs.RO 新提交

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

VoLo: 面向开放词汇长时程操控的物理编排器

Siyi Chen, Hugo Hadfield, Alex Zook, Mikaela Angelina Uy, Chan Hee Song, Erwin Coumans, Xuning Yang, Faisal Ladhak, Qing Qu, Stan Birchfield, Jonathan Tremblay, Valts Blukis

发表机构 * NVIDIA(英伟达) University of Michigan(密歇根大学)

AI总结 提出VoLoAgent,利用VLM将VLA/WAM作为可中断工具进行物理编排,实现开放词汇长时程操控,并在新基准RoboVoLo上显著优于现有系统。

详情
AI中文摘要

开放词汇长时程操控要求机器人能够推理灵活指令和复杂多物体场景,同时自适应地规划、执行、监控并从失败中恢复。我们通过一个闭环智能体来满足这些需求,其中VLM将异构机器人能力编排为可中断的工具。与虚拟AI智能体不同,在物理世界中,决策、动作和工具调用的时机至关重要,因为物理世界不会暂停等待推理。我们将这种设置称为物理编排,并提出VoLoAgent,这是一种VLM,通过将VLA/WAM视为可中断的工具,在推理过程中与视觉模型和动作原语一起引导其运行,从而进行规划、监控和恢复。为了评估这些长时程能力,我们引入了RoboVoLo,这是一个高保真基准测试,用于开放词汇长时程操控,涵盖常识、记忆/状态跟踪、复杂引用和世界知识,并提供任务级成功率和失败模式诊断。实验表明,VoLoAgent在任务成功率和失败诊断方面显著优于单一VLA/VLM或基于工具的系统,并在真实机器人实验中得到了验证。项目页面:https://chicychen.github.io/VoLo/

英文摘要

Open-vocabulary long-horizon manipulation requires robots to reason over flexible instructions and complex multi-object scenes while adaptively planning, executing, monitoring, and recovering from failures. We address these demands with a closed agent loop in which a VLM orchestrates heterogeneous robot capabilities as interruptible tools. Unlike in virtual AI agents, the timing of decisions, actions and tool calls is important in a physical world that does not pause for reasoning. We refer to this setting as Physical Orchestration, and propose VoLoAgent, a VLM that plans, monitors, and recovers by treating a VLA/WAM as an interruptible tool it steers mid-rollout alongside vision models and action primitives. To evaluate these long-horizon capabilities, we introduce RoboVoLo, a high-fidelity benchmark for open-vocabulary long-horizon manipulation across common sense, memory/state tracking, complex references, and world knowledge, with both task-level success and failure-mode diagnostics. Experiments show VoLoAgent substantially outperforms single VLA/VLM or tool-based systems, with validation on real-robot experiments. Project page: https://chicychen.github.io/VoLo/

2606.08057 2026-06-09 cs.RO cs.AI 新提交

EgoAERO: Learning Dexterous Manipulation from a Single Egocentric Video without Object Assets

EgoAERO:无需物体资产,从单个第一人称视频学习灵巧操作

Yichen Niu, Haoran Lv, Xinrui Zhang, Xueyao Wan, Shiyu Gao, Ying Ai, Hui Xu, Yongqi Hu, Hengyi Zhang, Yang Xie, Zhaxizhuoma, Yue Zhao, Zhenshan Bing, Yan Ding, Jianxing Liu

发表机构 * School of Astronautics, Harbin Institute of Technology(哈尔滨工业大学航天学院) Lumos Robotic Suzhou Research Institute, Harbin Institute of Technology(哈尔滨工业大学苏州研究院) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Lab(上海人工智能实验室) Nanjing University(南京大学) Xi’an Jiaotong-Liverpool University(西交利物浦大学) Fudan University(复旦大学)

AI总结 提出EgoAERO框架,无需物体资产,从单个第一人称RGB-D视频中通过无资产物体跟踪与重建、自我运动补偿和自适应接触优化重建接触一致的手-物轨迹,并利用两阶段残差学习转化为机器人策略,实现单次演示的灵巧操作。

详情
AI中文摘要

第一人称RGB-D视频提供了人类灵巧操作演示的自然来源,但现有数据难以用于机器人学习,因为物体姿态、几何和接触信息常常缺失或需要预先扫描的物体资产。我们提出EgoAERO,这是第一个无需物体资产、从单个第一人称RGB-D人类演示中学习灵巧操作的框架。EgoAERO通过无资产物体跟踪与重建、自我运动补偿和自适应接触优化重建接触一致的手-物轨迹,然后利用两阶段残差学习将其转化为机器人策略。我们进一步引入在线质量评估机制,并构建EgoDex-R,一个包含430万RGB-D帧的大规模第一人称数据集,用于灵巧策略学习。仿真和真实世界实验表明,EgoAERO能够实现单次演示的灵巧操作,并在HOI4D上达到接近基于CAD重建的下游性能。

英文摘要

Egocentric RGB-D videos offer a natural source of human dexterous manipulation demonstrations, but existing data is difficult to use for robot learning because object pose, geometry, and contact information are often missing or require pre-scanned object assets. We present EgoAERO, the first framework that learns dexterous manipulation from a single egocentric RGB-D human demonstration without object assets. EgoAERO reconstructs contact-consistent hand-object trajectories through asset-free object tracking and reconstruction, ego motion compensation, and adaptive contact optimization, then converts them into robot policies using two-stage residual learning. We further introduce an online quality assessment mechanism and construct EgoDex-R, a large-scale egocentric dataset with 4.3M RGB-D frames for dexterous policy learning. Simulation and real-world experiments show that EgoAERO enables single-demonstration dexterous manipulation and achieves downstream performance close to CAD-based reconstructions on HOI4D.

2606.08103 2026-06-09 cs.RO cs.CV 新提交

Revisiting Articulated Parts Perception in Robot Manipulation

重新审视机器人操作中的关节部件感知

Xiaoqian Wu, Yejie Guo, Xiaoyang Chen, Lixin Yang, Cewu Lu, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出几何主结构(GPS)作为关节部件的新表示,结合VR设备实现高效标注,训练通用模型,在零样本下达到73%操作成功率。

Comments CVPR2026

详情
AI中文摘要

我们被各种带有可移动关节部件的物体所包围,例如盒子、把手、门。对关节部件的准确且可泛化的感知对于增强机器人操作能力至关重要。基于这一需求,近期在关节部件感知方面的工作遵循两个主要方向:一类工作使用基于姿态的表示,这需要高人力成本;与此同时,基于可供性的方法通过点跟踪提取未来物体运动,无需额外人工,但受限于低质量数据。在本文中,我们提出了一种新的关节部件表示——几何主结构(GPS),它是部件几何结构的抽象,以平衡可扩展性和质量。为了实现高效且可扩展的数据收集,GPS与便携式虚拟现实(VR)设备集成,只需一分钟即可标注一个物体序列。这种直接的人工标注比估计的可供性提供了更高质量。利用高效的VR-GPS系统,我们收集了6个部件类别下234个物体的41K帧数据,并训练了一个以单张RGB-D物体图像为输入的通用GPS模型。对于物体操作,我们基于GPS预测部署了一个启发式策略。无需任何领域内微调,我们的方法在9个物体的270个初始状态下达到了73%的成功率。我们的代码、数据和可复用工具可在 https://enlighten0707.github.io/gps 获取。

英文摘要

We are surrounded by various objects with movable, articulated parts, e.g., box, handle, door. An accurate and generalizable perception of articulated parts is essential to enhance robotic manipulation capabilities. Building on this need, recent efforts in articulated parts perception have followed two main directions: One line of work uses pose-based representation, which requires high manual cost; in parallel, affordance-based methods extract future object motion from point tracking without additional manual efforts, but suffer from low-quality data. In this paper, we propose a new representation of articulated parts, Geometric Primary Structure (GPS), an abstraction of the part geometry structure to balance scalability and quality. For efficient and scalable data collection, GPS is integrated with a portable Virtual Reality (VR) device and requires only one minute to annotate one object sequence. This direct human annotation provides higher quality than the estimated affordance. With this efficient VR-GPS system, we collect 41K frames for 234 objects across six part classes, and train a generalizable GPS model with a single RGB-D object image as input. For object manipulation, we deploy a heuristic policy based on GPS prediction. Without any in-domain fine-tuning, our method achieves an 73% success rate, covering 270 initial states for 9 objects. Our code, data and reusable tool are available at https://enlighten0707.github.io/gps.

2606.08152 2026-06-09 cs.RO 新提交

Vision-Guided Dual-Arm Humanoid Robotic Disassembly of End-of-Life 18650 Lithium-ion Battery Packs

视觉引导的双臂人形机器人拆解报废18650锂离子电池组

Yile Chen, Zhihao Liu, Xi Vincent Wang, Lihui Wang

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 提出一种视觉引导的双臂拆解流水线,利用通用平行爪夹持器、RGB-D感知和预训练抓取检测器,在无夹具条件下从任意初始姿态拆解21节18650电池组,实现80%端到端成功率。

详情
AI中文摘要

来自电动汽车和便携式电子产品的退役锂离子电池组数量不断增长,需要安全、灵活且可选择性到单个电池的自动化拆解。然而,现有的机器人系统大多假设已知电池组姿态、外部夹具或专用工具,使得在姿态不确定性下无夹具的电池级拆解仍未解决。本文提出一种视觉引导的双臂流水线,使用通用平行爪夹持器、RGB-D感知和预训练抓取检测器,从任意初始姿态拆解一个21节18650电池组。姿态不确定性通过一个学习-过滤感知栈和离散的看-移动腕部相机校正来吸收,而双臂之间的任务中支持转移则无需任何外部夹具即可扩展有效工作空间。该流水线实现了8/10的端到端成功率,电池定位均方根误差为2.4毫米,每个电池组的平均循环时间为6.0分钟,为工业电池回收提供了一个实用的、无夹具的基础模块。

英文摘要

The growing volume of retired lithium-ion battery packs from electric vehicles and portable electronics calls for automated disassembly that is safe, flexible, and selective down to the individual cell. Existing robotic systems, however, mostly assume known pack poses, external fixtures, or specialised tooling, leaving fixture-free cell-level disassembly under pose uncertainty largely unsolved. This paper presents a vision-guided dual-arm pipeline that disassembles a 21-cell 18650 pack from an arbitrary initial pose using only general-purpose parallel-jaw grippers, RGB-D sensing, and a pre-trained grasp detector. Pose uncertainty is absorbed by a learn-and-filter perception stack with discrete look-and-move wrist-camera corrections, while a mid-task support transfer between the two arms extends the effective workspace without any external clamp. The pipeline achieves an 8/10 end-to-end success rate, a cell-localisation root-mean-square error of $2.4$\,mm, and a mean cycle time of 6.0\,minutes per pack, providing a practical, fixture-free building block for industrial battery recycling.

2606.08440 2026-06-09 cs.RO cs.CV 新提交

GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors

GraspFoM:基于3D基础先验的重建驱动机器人抓取

Dongli Wu, Xiaobao Wei, Hao Wang, Qiaochu Dong, Ying Li, Qingpo Wuwu, Ming Lu, Wufan Zhao

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Peking University(北京大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出GraspFoM框架,利用3D基础先验(SAM3D)构建共享3D物体潜变量,联合优化重建与抓取姿态预测,通过锚点初始化的截断姿态推理扩散器生成连续多模态抓取,实现高保真重建与最优抓取。

详情
AI中文摘要

机器人抓取是机器人操作中的基本能力。然而,在部分观测下抓取仍然具有挑战性。可靠的抓取依赖于局部接触线索和物体级3D结构。现有的几何感知抓取方法认识到重建的价值,但通常将几何视为中间预测,而不是可重用的抓取物体先验。在本文中,我们提出了GraspFoM,一个统一的框架,利用3D基础先验(SAM3D)为重建和抓取姿态预测构建共享的3D物体潜变量。基于这个共享的物体潜变量,我们引入了一个锚点初始化的截断姿态推理扩散器,它预测连续且多模态的抓取姿态,而不直接依赖离散的抓取候选。我们进一步通过一个重建感知评分器和残差潜变量更新器来研究重建与抓取之间的相互作用。重建提供基于几何的线索,而抓取监督则使共享的物体潜变量向与抓取相关的可操作性区域细化。GraspFoM联合预测抓取姿态并以网格和3DGS形式重建高保真3D资产。综合实验表明,GraspFoM在重建和抓取上都达到了最先进的结果。值得注意的是,这些改进只需要少量额外的可训练参数。组件消融研究也证明了每个组件的贡献。

英文摘要

Robotic grasping is a fundamental capability in robotic manipulation. Yet grasping remains challenging under partial observations. Reliable grasping depends on both local contact cues and object-level 3D structure. Existing geometry-aware grasping methods recognize the value of reconstruction, but they typically treat geometry as an intermediate prediction rather than a reusable object prior for grasping. In this paper, we present GraspFoM, a unified framework that leverages 3D foundation priors (SAM3D) to build a shared 3D object latent for both reconstruction and grasp pose prediction. Built on this shared object latent, we introduce an anchor-initialized truncated pose-reasoning diffuser that predicts continuous and multimodal grasp poses without directly relying on discrete grasp candidates. We further investigate the interaction between reconstruction and grasping through a reconstruction-aware scorer and a residual latent updater. Reconstruction provides grounded geometric cues, while grasp supervision refines the shared object latent toward grasp-relevant affordances. GraspFoM jointly predicts grasp poses and reconstructs high-fidelity 3D assets in mesh and 3DGS forms. Comprehensive experiments demonstrate that GraspFoM achieves state-of-the-art results on both reconstruction and grasping. Notably, these improvements require only a small number of additional trainable parameters. Component-wise ablation studies also demonstrate the contribution of each component.

2606.08542 2026-06-09 cs.RO cs.AI cs.CV 新提交

When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

当视频误读:面向探索性操作痕迹问答的阅读启发式闭环蒸馏

Haizhou Ge, Yufei Jia, Yue Li, Zhixing Chen, Lu Shi, Lei Han, Guyue Zhou, Ruqi Huang

发表机构 * Tsinghua University(清华大学) DISCOVER Robotics

AI总结 针对探索性操作中机器人误读视频痕迹的问题,提出闭环痕迹蒸馏方法,通过任务编码代理提取单行自然语言启发式提示,使冻结VLM准确预测最小成功动作链,在模拟和真实机器人任务上提升准确率0.38-0.47。

Comments 16 pages, 4 figures, 4 tables

详情
AI中文摘要

探索性操作往往将看似失败的尝试转化为下一步操作的关键证据。例如,机器人拉动锁住的抽屉失败,只有在开锁后才成功。失败的拉动揭示了潜在前提条件(抽屉被锁住),该条件决定了最小成功动作链(完成任务的最少动作),此处为[开锁,拉抽屉]。正确读取这一痕迹因此成为恢复该链的前提。我们将此设定形式化为探索性操作痕迹问答(EMT-QA):给定来自探索性痕迹的同步视频和本体感觉,预测在探测所揭示的潜在前提条件下的最小成功动作链。然而,即使最先进的VLM和具身多模态LLM也会误读这一证据:它们无法从原始视频、原始本体感觉或它们的组合中可靠地恢复动作链。我们引入闭环痕迹蒸馏,一种使用每任务编码代理检查带标签训练痕迹并蒸馏出关于痕迹的单行自然语言提示(称为蒸馏阅读启发式DRH)的流水线。推理时,不调用代理,不更新模型权重;冻结的VLM接收原始痕迹加上DRH作为提示条目。在三个模拟器和两个真实机器人任务上,DRH将链准确率比最佳原始模态基线提高0.38至0.47。相同的DRH还作为一次性程序分类器的唯一规范,其性能与提示的VLM相当。

英文摘要

Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For example, a robot pulls a locked cabinet drawer, fails, and only succeeds after opening the lock. The failed pull reveals a latent precondition (the drawer is locked) that determines the minimal-success action chain (the fewest actions that complete the task), here [lock-open, drawer-pull]. Correctly reading this trace is therefore the prerequisite for recovering that chain. We formalize this setting as Exploratory Manipulation Trace QA (EMT-QA): given synchronized video and proprioception from an exploratory trace, predict the minimal-success action chain under the latent precondition revealed by the probe. However, even state-of-the-art VLMs and embodied multimodal LLMs misread this evidence: they do not reliably recover the chain from raw video, raw proprioception, or their combination. We introduce Closed-Loop Trace Distillation, a pipeline that uses a per-task coding agent to inspect labeled training traces and distill a one-line natural-language prompt over the trace, which we call the Distilled Reading Heuristic (DRH). At inference, no agent is invoked and no model weights are updated; a frozen VLM receives the raw trace plus the DRH as a prompt entry. Across three simulator and two real-robot tasks, the DRH improves chain accuracy by +0.38 to +0.47 over the best raw-modality baseline. The same DRH also serves as the sole specification for one-shot programmatic classifiers that match the prompted VLM.

2606.08548 2026-06-09 cs.RO 新提交

OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation

OASIS:从仿真数据收集到真实世界人形机器人移动操作

Zehao Yu, Jiakun Zheng, Weiji Xie, Jiyuan Shi, Chenyun Zhang, Chenjia Bai, Xuelong Li

发表机构 * Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院(TeleAI)) Fudan University(复旦大学) East China University of Science and Technology(华东理工大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出OASIS框架,利用3D生成模型从真实图像重建物体资产,在仿真中收集并增强轨迹数据,训练层次化视觉运动策略,实现零样本部署下人形机器人移动操作,成功率优于真实遥操作数据。

Comments Project Page: https://oasis-humanoid.github.io/

详情
AI中文摘要

近年来,机器人操作领域的进展主要得益于大规模演示学习。然而,对于人形机器人移动操作任务,现有数据源在轨迹质量和可扩展性之间做出了令人不满意的权衡。真实世界遥操作能提供最高质量的轨迹,但需要专用的物理空间和耗时的场景重置。仿真提供了摆脱这一困境的替代方案:无需任何物理硬件即可大规模生成干净、符合本体形态的数据。在本文中,我们提出了OASIS,一个基于仿真数据的人形机器人移动操作框架。OASIS利用3D生成模型从真实世界图像自动重建逼真的物体资产。基于这些资产,首先在仿真中通过遥操作收集轨迹,然后在后处理阶段在多样化的领域随机化下进行增强。利用得到的仿真数据,我们进一步设计了一种用于人形机器人移动操作的层次化视觉运动策略。在真实人形机器人上的大量实验表明,在零样本部署下,基于我们的仿真数据训练的策略在大多数任务上实现了比基于真实机器人遥操作数据训练的策略更高的成功率,这主要归功于我们的仿真渲染覆盖了广泛的照明和环境变化,而真实机器人数据无法捕捉这些变化。项目页面见https://oasis-humanoid.github.io/。

英文摘要

Recent progress in robot manipulation has been largely driven by learning from large-scale demonstrations. For humanoid robot loco-manipulation tasks, however, existing data sources force an unsatisfying tradeoff between trajectory quality and scalability. Real-world teleoperation provides the highest-quality trajectories but requires dedicated physical space and time-consuming scene resets. Simulation offers an alternative way out of this dilemma: it can produce clean, embodiment-aligned data at scale without any physical hardware. In this paper, we propose OASIS, a simulation-data-driven framework for humanoid loco-manipulation. OASIS automatically reconstructs realistic object assets from real-world images using a 3D generative model. Based on these assets, trajectories are first collected through teleoperation in simulation, and then augmented under diverse domain randomizations in a post-processing stage. With the resulting simulation data, we further design a hierarchical visuomotor policy for humanoid loco-manipulation. Extensive experiments on the real humanoid robot show that, under zero-shot deployment, the policy trained on our simulation data achieves higher success rates on most tasks than that trained on real-robot teleoperation data, owing largely to the broad lighting and environmental variations covered by our simulation rendering, which real-robot data fails to capture. The project page is available at https://oasis-humanoid.github.io/.

2606.08737 2026-06-09 cs.RO 新提交

Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation

Dream-Tac: 用于接触丰富机器人操作任务的统一触觉世界动作模型

Yunfan Lou, Yifan Ye, Yankai Fu, Jun Cen, Xiaowei Chi, Yaoxu Lyu, Peidong Jia, Sirui Han, Zhihe Lu, Shanghang Zhang

发表机构 * Peking University(北京大学) The Hong Kong University of Science and Technology(香港科技大学) Nanjing University(南京大学) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室)

AI总结 提出Dream-Tac统一触觉世界动作模型,通过接触门控视觉-触觉融合和接触感知注意力偏置,联合建模动作、未来视觉观察和触觉动态,在六项接触丰富操作任务中平均动作准确率提升31.7%。

Comments 16 pages,13 figures

详情
AI中文摘要

世界动作模型继承了世界模型的预测能力,使得动作生成能够由预期的未来观察引导。然而,它们主要依赖视觉,在接触丰富的操作任务中常常失败,因为关键线索来自物理交互。在本文中,我们提出Dream-Tac,一个统一的触觉世界动作模型,联合建模动作、未来视觉观察和触觉动态。具体来说,Dream-Tac引入了(i)接触门控视觉-触觉融合,以选择性整合触觉信号,以及(ii)接触感知注意力偏置,以更好地调节操作过程中的跨模态交互。为了支持实时部署,我们进一步设计了双级加速策略,在训练期间重新公式化接触感知偏置以保留融合注意力路径,并在推理时引入基于缓存的扩散加速,实现训练速度提升高达2.9倍,推理速度提升1.8倍。在六项接触丰富的操作任务中,Dream-Tac平均动作准确率提升31.7%,证明了统一视觉-触觉世界建模的有效性。代码可在https://github.com/LYFCLOUDFAN/Dream-Tac获取。

英文摘要

World action models inherit the predictive capability of world models, enabling action generation to be guided by anticipated future observations. However, they rely primarily on vision and often fail in contact-rich manipulation, where critical cues arise from physical interaction. In this paper, we propose Dream-Tac, a unified Tactile-World Action Model that jointly models actions, future visual observations, and tactile dynamics. Specifically, Dream-Tac introduces (i) contact-gated visuotactile fusion to selectively integrate tactile signals and (ii) a contact-aware attention bias to better regulate cross-modal interactions during manipulation. To support real-time deployment, we further design a dual-level acceleration strategy, reformulating the contact-aware bias to preserve the fused attention path during training and introducing cache-based diffusion acceleration at inference, achieving up to 2.9$\times$ faster training and 1.8$\times$ faster inference. Across six contact-rich manipulation tasks, Dream-Tac improves action accuracy by 31.7\% on average, demonstrating the effectiveness of unified visuotactile world modeling.Code is available at https://github.com/LYFCLOUDFAN/Dream-Tac.

2606.08828 2026-06-09 cs.RO 新提交

Video2Sim2Real: Full-Stack Autonomous Dexterous Skill Acquisition from a Single Human Video

Video2Sim2Real:从单个人类视频实现全栈自主灵巧技能获取

Yunhai Han, Jianuo Qiu, Linhao Bai, Ziyu Xiao, Zihang Zeng, Yangcen Liu, Zhaodong Yang, Shalin Jain, Wenrui Ma, Jiaqi Fu, Yuqian Zheng, Manisha Natarajan, Muhammad Zubair Irshad, Kenneth Shaw, Matthew Gombolay, Zsolt Kira, Harish Ravichandar

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of Pennsylvania(宾夕法尼亚大学) Toyota Research Institute(丰田研究所) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出Video2Sim2Real框架,从单个人类操作视频中重建数字孪生并提取运动先验,通过物体关键帧优化机器人配置,结合残差强化学习与碰撞感知规划,实现从仿真到真实世界的灵巧技能迁移。

Comments Website: https://video2sim2real.github.io/

详情
AI中文摘要

人类操作视频是机器人学习的便捷直观来源。然而,由于感知误差和具身差距,直接将人类灵巧性迁移到机器人仍然具有挑战性。为此,我们引入Video2Sim2Real,一个从单个人类操作视频中自主获取技能的全栈框架。我们的框架首先使用现成的基础模型重建适用于仿真器的数字孪生,并提取机器人和物体运动先验。与将提取的机器人运动视为整个执行过程中的可靠参考不同,我们的关键思想是恢复并利用从演示技能中获得的最基本监督来源:我们识别以物体为中心的关键帧,利用仿真器中的物体信息优化相应的机器人配置,并将这些配置作为锚点来细化机器人运动,使其最终对环境产生期望的影响。为了弥合剩余的仿真到现实差距,我们引入了一种仿真到现实策略,将对噪声和不完整感知的鲁棒性与手-物交互动力学的变化解耦。具体来说,我们通过模仿学习从噪声的真实世界点云中重新校准机器人配置,并利用残差强化学习进行局部手指级自适应,以确保鲁棒且有效的交互。最后,一个碰撞感知的运动规划模块实现了对新颖物体配置的空间泛化。在多个日常操作任务中,Video2Sim2Real在模拟任务成功率、安全性和轨迹一致性上优于众多基线,并且比现有技术实现了更好的仿真到现实迁移。这些结果展示了从人类视频自主获取灵巧技能的一条有前景的路径。

英文摘要

Human manipulation videos are a convenient and intuitive source for robot learning. However, directly transferring human dexterity to robots remains challenging due to perception errors and embodiment gap. To address this, we introduce Video2Sim2Real, a full-stack framework for autonomous skill acquisition from a single human manipulation video. Our framework first uses off-the-shelf foundation models to reconstruct a simulator-ready digital twin and extract robot and object motion priors. Rather than treating the extracted robot motion as a reliable reference throughout execution, our key idea is to recover and leverage the most fundamental sources of supervision from the demonstrated skill: We identify object-centric keyframes to optimize the corresponding robot configurations using object information from the simulator, and use these configurations as anchors that refine the robot motion such that it ultimately has the desired impact on the environment. To bridge the remaining sim-to-real gap, we introduce a sim-to-real strategy that decouples robustness to noisy and incomplete perception from variations in hand-object interaction dynamics. Specifically, we learn to recalibrate robot configurations from noisy real-world point clouds via IL, and leverage residual RL to perform local finger-level adaptations to ensure for robust and effective interactions. Finally, a collision-aware motion planning module enables spatial generalization to novel object configurations. Across several everyday manipulation tasks, Video2Sim2Real improves simulated task success, safety, and trajectory coherence over numerous baselines, and achieves better sim-to-real transfer than existing techniques. These results demonstrate a promising path toward autonomous dexterous skill acquisition from human videos.

2606.09183 2026-06-09 cs.RO 新提交

Autonomous Obstacle Removal for Excavators through Policy Learning with Particle Simulation

通过粒子模拟的策略学习实现挖掘机自主障碍物移除

Yuki Kadokawa, Sandro M. Alcantara Tacora, Taro Abe, Daisuke Endo, Genki Yamauchi, Takeshi Hashimoto, Takamitsu Matsubara

发表机构 * Nara Institute of Science and Technology(奈良先端科学技术大学院大学) Public Works Research Institute(土木研究所)

AI总结 提出一种基于粒子模拟的课程学习框架,通过RGB-D感知和参数化轨迹输出,实现挖掘机在不同埋深条件下自主移除地面障碍物,并在真实12吨挖掘机上验证了鲁棒性。

Comments under review

详情
AI中文摘要

从地面自主移除障碍物是一项重要的土方工程任务,但由于挖掘机必须随着土壤-障碍物条件的变化在重复循环中调整其挖掘轨迹,因此难以自动化。学习这种状态依赖行为需要一个能够再现累积土壤-障碍物相互作用的训练环境,包括接触状态、地形变形和障碍物可见性。因此,基于粒子的模拟适用于相关的策略学习。然而,粒子模拟计算成本高,重复的挖掘循环进一步增加了学习成本。我们观察到障碍物的埋藏条件决定了任务难度和模拟成本:更深的埋藏使障碍物移除更难,同时也需要更多粒子进行精确模拟。这一观察启发了一种基于埋藏条件的课程学习策略。我们提出了一种时间高效的模拟到现实策略学习框架,其中策略从RGB-D测量中观察地形和障碍物信息,然后输出参数化的挖掘轨迹;在此过程中,模拟器在可控埋藏条件下再现了真实挖掘机所使用的相同观测-动作接口。课程从浅埋条件开始,逐步增加埋藏深度,同时调整粒子数量,从而同时控制任务难度和模拟成本。实验表明,所提出的框架成功学习了一个有效的障碍物移除策略,而基线方法即使在完整一周的训练后也失败。所提出的课程在三天内实现了有效性能,并成功迁移到一台在开阔地面上操作各种钢制障碍物的真实12吨挖掘机上,从而展示了鲁棒的障碍物移除能力。

英文摘要

Autonomous obstacle removal from the ground is an important earthwork task, but this is difficult to automate because an excavator must adapt its excavation trajectories over repeated cycles as soil-obstacle conditions change. Learning such state-dependent behavior requires a training environment that reproduces accumulated soil-obstacle interactions, including contact states, terrain deformation, and obstacle visibility. Accordingly, particle-based simulation is suitable for the relevant policy learning. However, particle simulation is computationally expensive, and repeated excavation cycles further increase the learning cost. We observe that the burial condition of an obstacle governs both task difficulty and simulation cost: deeper burial makes obstacle removal harder while also requiring more particles for accurate simulation. This observation motivates a burial-conditioned curriculum learning strategy. We propose a time-efficient sim-to-real policy learning framework in which the policy observes terrain and obstacle information from RGB-D measurements and then outputs a parameterized excavation trajectory; in this process, the simulator reproduces in a real-world excavator the same observation-action interface it uses under controllable burial conditions. The curriculum begins with shallow burial conditions and progressively increases burial depth while adjusting particle count, thus simultaneously controlling task difficulty and simulation cost. Experiments show that the proposed framework successfully learns an effective obstacle-removal policy, whereas baseline methods fail even after a full week of training. The proposed curriculum achieves effective performance within three days and achieves successful transfer to a real 12-ton excavator operating on open ground with various steel obstacles, thus demonstrating robust obstacle removal.

2606.09314 2026-06-09 cs.RO 新提交

KPGrasp: Scalable Keypoint Flow Matching for Dexterous Grasp Generation

KPGrasp: 可扩展的关键点流匹配用于灵巧抓取生成

Yuansen Huang, Jiayi Chen, Haoran Liu, Yubin Ke, Bing Han, Jiangran Lyu, Mi Yan, Li Yi, He Wang

发表机构 * Peking University(北京大学) Galbot Xi’an Jiaotong University(西安交通大学) Tsinghua University(清华大学)

AI总结 提出KPGrasp框架,通过全欧几里得手部关键点参数化和Transformer流模型,从大规模数据学习灵巧抓取先验,无需接触损失或测试时优化,在模拟和真实场景中实现高成功率与低穿透深度。

Comments 14 pages, 7 figures, 6 tables

详情
AI中文摘要

对于基于学习的方法而言,生成高质量的灵巧抓取仍然具有挑战性,这些方法通常依赖于精心调整的接触损失或昂贵的基于接触的测试时优化。我们提出了KPGrasp,一个流匹配框架,从大规模数据中学习灵巧抓取先验,而不是依赖接触损失或基于接触的测试时优化。KPGrasp将全欧几里得3D手部关键点参数化与一个简单但可扩展的Transformer流模型相结合。该参数化避免了传统混合SE(3)姿态和关节角度输出空间的缺点,在与物体点云相同的坐标系中表达抓取,从而实现了原生空间推理;Transformer流模型仅使用标准流匹配损失进行训练,并随着数据、模型容量和批大小有效扩展。实验表明,在两个模拟基准上达到了最先进的性能。在Dexonomy基准上,它达到了76.3%的抓取成功率,比最强的直接可比基线提高了47.4%,同时将穿透深度减少到2.4毫米。同一模型在DexGrasp Anything基准上也无需微调即可达到最佳平均性能。对于批量推理,KPGrasp每次抓取仅需0.032秒。最后,在20个不同物体上的真实世界实验表明,该流水线可以在真实环境中部署。

英文摘要

Generating high-quality dexterous grasps remains challenging for learning-based methods, which often depend on carefully tuned contact losses or costly contact-based test-time refinement. We present KPGrasp, a flow-matching framework that learns dexterous grasp priors from large-scale data rather than relying on contact losses or contact-based test-time refinement. KPGrasp couples an all-Euclidean 3D hand-keypoint parameterization with a simple yet scalable Transformer flow model. The parameterization avoids the drawbacks of the conventional mixed SE(3) pose and joint-angle output space, expresses grasps in the same frame as the object point cloud, and thus enables native spatial reasoning; the Transformer flow model is trained with only the standard flow-matching loss and scales effectively with data, model capacity, and batch size. Experiments demonstrate state-of-the-art performance on two simulation benchmarks. On the Dexonomy benchmark, it reaches a 76.3% grasp success rate, improving over the strongest directly comparable baseline by 47.4% while reducing penetration depth to 2.4 mm. The same model also achieves the best average performance on the DexGrasp Anything benchmark without fine-tuning. For batched inference, KPGrasp requires only 0.032 s per grasp. Finally, real-world experiments on 20 diverse objects demonstrate that the pipeline can be deployed in a real-world setup.

2606.09615 2026-06-09 cs.RO cs.CV 新提交

DexPIE: Stable Dexterous Policy Improvement from Real-World Experience

DexPIE:基于真实世界经验的稳定灵巧策略改进

Ruizhe Liao, Wenrui Chen, Liangji Zeng, Haoran Lin, Fan Yang, Kailun Yang, Yaonan Wang

发表机构 * Hunan University(湖南大学)

AI总结 提出DexPIE后训练框架,通过灵巧手适配干预系统、多阶段DAgger数据收集、相对动作空间异步推理和连续最优性指标条件化,在三个真实灵巧操作任务上成功率提升37%。

Comments Project website: https://siiuuuuuu.github.io/DexPIE

详情
AI中文摘要

灵巧操作因其高维动作空间和复杂的接触动力学,给模仿学习带来了巨大挑战。纯粹从演示中训练的策略在部署时常常遭受复合误差,并且需要大量专家数据才能达到可靠性能。为了超越演示数据的局限性,本文提出DexPIE,一个通过真实世界部署收集的经验来改进灵巧策略的后训练框架。首先,DexPIE通过灵巧手适配的干预系统和跨初始与中间任务阶段的多阶段DAgger式数据收集,实现了有效的探索覆盖,为准确的策略评估提供了可靠的监督。为了减少后训练 rollout 与演示数据之间的时间噪声,我们引入了相对动作空间中的异步推理,这能更好地将 rollout 数据与演示行为对齐,并允许评论家学习由更一致的基础策略诱导的值函数。最后,DexPIE通过对连续最优性指标进行条件化来改进策略,使策略能够以更细粒度的方式利用数据质量。在三个具有挑战性的真实世界灵巧操作任务中,DexPIE相比基于演示的参考策略实现了37%的成功率提升,优于所有基线方法,并展现出更强的鲁棒性。源代码和数据集将公开提供。

英文摘要

Dexterous manipulation presents substantial challenges for imitation learning due to its high-dimensional action space and complex contact-rich dynamics. Policies trained purely from demonstrations often suffer from compounding errors during deployment and require large amounts of expert data to achieve reliable performance. To move beyond the limitations of demonstration data, in this work, we propose DexPIE, a post-training framework for dexterous policy improvement from experience collected through real-world deployment. First, DexPIE enables effective exploration coverage through a dexterous-hand-adapted intervention system and multi-stage DAgger-style data collection across initial and intermediate task stages, providing reliable supervision for accurate policy evaluation. To reduce temporal noise between post-training rollouts and demonstration data, we introduce asynchronous inference in the relative action space, which better aligns rollout data with demonstrated behavior and allows the critic to learn a value function induced by a more consistent underlying policy. Finally, DexPIE improves the policy through conditioning on a continuous optimality indicator, allowing the policy to leverage the quality of data in a more fine-grained manner. Across three challenging real-world dexterous manipulation tasks, DexPIE achieves a 37% improvement in success rate over the demonstration-based reference policy, outperforming all baseline methods and demonstrating stronger robustness. The source code and dataset will be made publicly available.

2606.09798 2026-06-09 cs.RO 新提交

SynManDex: Synthesizing Human-like Dexterous Grasps from Synthetic Human Pre-Grasps

SynManDex: 从合成人类预抓取中合成类人灵巧抓取

Yanming Shao, Zanxin Chen, Wenwei Lin, Mingjie Zhou, Tianxing Chen, Xiaokang Yang, Yichen Chi, Yao Mu

发表机构 * Shanghai AI Lab(上海人工智能实验室) Shanghai Jiaotong University(上海交通大学) Shenzhen University(深圳大学) Fudan University(复旦大学) University of Hong Kong(香港大学) ZTE Corporation(中兴通讯股份有限公司)

AI总结 提出SynManDex流水线,利用生成的人类预抓取作为启发,通过机器人原生优化实现力闭合接触,生成类人灵巧抓取,在仿真和真实机器人上取得高成功率和类人性。

详情
AI中文摘要

人类手-物交互编码了功能意图,但直接迁移到机器人手上常因形态、接触和可达性约束而失败。我们提出SynManDex,一个合成流水线,使用生成的人类预抓取作为可负担性感知的提议,并通过机器人原生优化解决最终接触。SynManDex采样物体条件化的数字人类预抓取,将其重定向到灵巧机器人手姿态,优化目标实体上的力闭合接触,并接受通过每一步检查的轨迹。所得关键帧支持抓取-举起演示以及各种抓取操作任务,如倒茶、拍照和吹笛子,这些任务通过VLM代理设计。因此,SynManDex结合了高抓取质量(86.4%抓取稳定性)和4.67/5的类人性(93.4%)。在仿真中达到80.7%的成功率,在应用于36自由度双臂灵巧机器人平台时,真实机器人成功率为25/30(83.3%)。

英文摘要

Human hand-object interactions encode functional intent, but direct transfer to robotic hands often fails under morphology, contact, and reachability constraints. We present SynManDex, a synthetic pipeline that uses generated human pre-grasps as affordance-aware proposals and resolves the final contacts with robot-native optimization. SynManDex samples object-conditioned digital human pre-grasps, retargets them to dexterous robotic hand poses, optimizes force-closure contacts on the target embodiment, and admits trajectories that pass checks from each step. The resulting keyframes support both grasp-and-lift demonstrations and various prehensile manipulation tasks such as tea pouring, photo taking, and flute playing, designed via VLM agents. As a result, SynManDex combines high grasp quality (86.4\% grasp stability) with 4.67/5 human-likeness (93.4\%). It achieves 80.7\% successes in simulation and 25/30 (83.3\%) real-robot successes when applied to a 36-DOF bimanual dexterous robotic platform.

2510.01661 2026-06-09 cs.RO 版本更新

Symskill: Symbol and Skill Co-Invention for Data-Efficient and Reactive Long-Horizon Manipulation

Symskill:符号与技能共发明用于数据高效且反应性强的长周期操作

Yifei Simon Shao, Yuchen Zheng, Sunan Sun, Pratik Chaudhari, Vijay Kumar, Nadia Figueroa

发表机构 * GRASP Laboratory, University of Pennsylvania(GRASP实验室,宾夕法尼亚大学)

AI总结 Symskill通过联合学习谓词、运算符和技能,实现了数据高效且反应性强的长周期操作,结合了组合泛化与实时恢复能力。

Comments ICRA 2026 Best Conference Paper Award; ICRA 2026 Best Paper Award on Planning and Control; CoRL 2025 Best Paper Award on Learning Effective Abstractions for Planning (LEAP) Workshop (https://symskill.github.io/)

详情
AI中文摘要

动态环境中多步骤操作仍具挑战性。模仿学习(IL)反应性强但缺乏组合泛化能力,因为单一策略无法在场景变化时决定复用哪个技能。经典任务与运动规划(TAMP)提供组合性,但其高规划延迟阻碍了实时故障恢复。我们引入SymSkill,一个统一框架,从无标签、未分段的演示中联合学习谓词、运算符和技能,结合组合泛化与实时恢复。离线时,SymSkill直接从演示中学习符号抽象和目标导向技能。在线时,给定学习到的谓词 conjunction,它使用符号规划器组合和重新排列技能以实现符号目标,同时在运动和符号层面实时恢复故障。结合合规控制器,SymSkill在人类和环境干扰下支持安全执行。在RoboCasa模拟中,SymSkill执行12个单步骤任务,成功率达85%,并能将它们组合成多步骤计划而无需额外数据。在真实Franka机器人上,它从5分钟的玩耍数据中学习,并从目标规范中执行12步任务。代码和额外分析可在https://symskill.github.io/获取。

英文摘要

Multi-step manipulation in dynamic environments remains challenging. Imitation learning (IL) is reactive but lacks compositional generalization, since monolithic policies do not decide which skill to reuse when scenes change. Classical task-and-motion planning (TAMP) offers compositionality, but its high planning latency prevents real-time failure recovery. We introduce SymSkill, a unified framework that jointly learns predicates, operators, and skills from unlabeled, unsegmented demonstrations, combining compositional generalization with real-time recovery. Offline, SymSkill learns symbolic abstractions and goal-oriented skills directly from demonstrations. Online, given a conjunction of learned predicates, it uses a symbolic planner to compose and reorder skills to achieve symbolic goals while recovering from failures at both the motion and symbolic levels in real time. Coupled with a compliant controller, SymSkill supports safe execution under human and environmental disturbances. In RoboCasa simulation, SymSkill executes 12 single-step tasks with 85% success and composes them into multi-step plans without additional data. On a real Franka robot, it learns from 5 minutes of play data and performs 12-step tasks from goal specifications. Code and additional analysis are available at https://symskill.github.io/ .

2601.02085 2026-06-09 cs.RO cs.AI 版本更新

Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots

基于视觉的草莓采摘机器人早期故障诊断与自恢复

Meili Sun, Chunjiang Zhao, Lichao Yang, Hao Liu, Shimin Hu, Ya Xiong

发表机构 * NERCITA

AI总结 针对草莓采摘机器人视觉感知差、夹爪错位、空抓/误抓和滑落等问题,提出视觉故障诊断与自恢复框架,通过SRR-Net统一感知、相对误差补偿、微光学相机反馈及LSTM滑落预测,实现高精度定位与故障恢复。

Comments Accepted by Artificial Intelligence in Agriculture

详情
AI中文摘要

草莓采摘机器人面临视觉感知差、夹爪错位、空抓/误抓和滑落等挑战,降低了采摘稳定性和效率。为解决这些问题,本文提出了一种视觉故障诊断与自恢复框架。端到端SRR-Net通过联合检测、分割和果实与夹爪的成熟度回归,实现了统一感知和故障诊断。利用这种集成感知,设计了一种由目标-夹爪同步检测驱动的相对误差补偿方法,以纠正超过容差阈值的位置错位。集成在末端执行器内的微光学相机提供实时视觉反馈。基于微光学相机,在放气阶段使用MobileNet V3-Small分类器进行夹爪调整,能够在空抓/误抓情况下提前中止采摘周期。此外,在拉断阶段应用时间序列LSTM分类器预测草莓滑落。基于这些预测,系统对滑落草莓执行重新充气和二次拉断尝试,或对已滑落草莓中止周期。实验表明,末端执行器与采摘点之间的平均绝对误差沿x轴和y轴分别从11.50 mm和5.25 mm降低到3.12 mm和4.06 mm,时间增加0.64 ± 0.24秒。夹爪调整模块将抓取阶段缩短约0.5秒,并避免了失败情况下的空放置。草莓滑落预测模块以88.89%的成功率处理滑落情况,每个采摘周期为失败情况节省约4.00秒。同时,对滑落草莓实现了81.25%的恢复率,重新抓取需要额外0.63秒。

英文摘要

Strawberry-harvesting robots faced challenges such as poor visual perception, gripper misalignment, empty grasp/misgrasp, and slippage, which reduced harvesting stability and efficiency.To overcome these issues, this paper proposes a visual fault diagnosis and self-recovery framework. An end-to-end SRR-Net achieved unified perception and fault diagnosis through joint detection, segmentation, and ripeness regression of the fruit and gripper. Leveraging this integrated perception, a relative error compensation method driven by simultaneous target-gripper detection was designed to correct positional misalignments exceeding the tolerance threshold. A micro-optical camera integrated within the end-effector delivered real-time visual feedback. Based on the micro-optical camera, a MobileNet V3-Small classifier was utilized for grasp adjustment during the deflating stage, enabling the early abort of the harvesting cycle in cases of empty grasp/misgrasps. Furthermore, a time-series LSTM classifier was applied during the snap-off stage to predict strawberry slippage. Based on these predictions, the system executed re-inflation and a secondary snap-off attempt for slipping strawberries, or aborted the cycle for slipped strawberries. Experiments demonstrated that the mean absolute errors between the end-effector and the picking point were reduced to 3.12 mm and 4.06 mm from 11.50 mm and 5.25 mm along the x- and y-axes, respectively, at the cost of a time increment of 0.64 $pm$ 0.24 s. The grasp adjustment module reduced the grasping phase by approximately 0.5 s and avoided empty-placement for failure cases. The strawberry slip prediction module handled slipped cases with an 88.89% success rate, saving approximately 4.00 s per harvesting cycle for failure cases. Also, it achieved an 81.25% recovery rate for slipping strawberries, requiring additional 0.63 s for re-grasping.

2601.14871 2026-06-09 cs.RO 版本更新

On-the-fly hand-eye calibration for the da Vinci surgical robot

达芬奇手术机器人的在线手眼标定

Zejian Cui, Ferdinando Rodriguez y Baena

发表机构 * Department of Mechanical Engineering, Imperial College London(帝国理工学院机械工程系) Mechatronics in Medicine Laboratory(医学机电实验室) Hamlyn Centre for Robotics Surgery(机器人外科哈姆林中心)

AI总结 针对达芬奇机器人因编码器误差导致工具定位不准的问题,提出一种在线计算手眼变换矩阵的标定框架,通过特征关联和手眼标定两个模块实现无预训练的关键点匹配,在多种手术场景下显著降低定位误差且时间效率高。

Comments 18 pages, 17 figures, 5 tables

详情
AI中文摘要

在机器人辅助微创手术(RMIS)中,精确的工具定位对于确保患者安全和成功执行任务至关重要。然而,对于诸如达芬奇机器人等缆线驱动机器人,这仍然具有挑战性,因为错误的编码器读数会导致位姿估计误差。在本研究中,我们提出了一种标定框架,通过在线计算手眼变换矩阵来产生精确的工具定位结果。该框架由两个相互关联的算法组成:特征关联模块和手眼标定模块,前者无需预训练即可为单目图像上检测到的关键点提供鲁棒的对应关系,后者通过采用一系列滤波方法提供适应各种手术场景的通用性。为了验证其有效性,我们在公开可用的视频数据集上广泛测试了该框架,这些数据集包含多种手术器械在体外和离体场景下、不同光照条件和不同关键点测量精度下执行任务的情况。结果表明,在所提出的标定框架下,工具定位误差显著降低,精度与其他最先进方法相当,同时时间效率更高。

英文摘要

In Robot-Assisted Minimally Invasive Surgery (RMIS), accurate tool localization is crucial to ensure patient safety and successful task execution. However, this remains challenging for cable-driven robots, such as the da Vinci robot, because erroneous encoder readings lead to pose estimation errors. In this study, we propose a calibration framework to produce accurate tool localization results through computing the hand-eye transformation matrix on-the-fly. The framework consists of two interrelated algorithms: the feature association block and the hand-eye calibration block, which provide robust correspondences for key points detected on monocular images without pre-training, and offer the versatility to accommodate various surgical scenarios by adopting an array of filter approaches, respectively. To validate its efficacy, we test the framework extensively on publicly available video datasets that feature multiple surgical instruments conducting tasks in both in vitro and ex vivo scenarios, under varying illumination conditions and with different levels of key point measurement accuracy. The results show a significant reduction in tool localization errors under the proposed calibration framework, with accuracies comparable to other state-of-the-art methods while being more time-efficient.

2602.11934 2026-06-09 cs.RO 版本更新

Robot-DIFT: Correspondence-Sensitive Diffusion Features for Contact-Rich Robot Manipulation

Robot-DIFT: 用于接触丰富机器人操作的对应敏感扩散特征

Yu Deng, Yufeng Jin, Xiaogang Jia, Jiahong Xue, Gerhard Neumann, Georgia Chalvatzaki

发表机构 * TU Darmstadt(图宾根大学) KIT(卡尔斯鲁厄理工学院) FZI(弗劳恩霍夫研究所) Hessian.AI(黑森人工智能公司) Robotics Institute Germany(德国机器人研究所) Honda Research Institute Europe GmbH(本田欧洲研究院)

AI总结 提出Robot-DIFT,通过流形蒸馏将扩散模型转化为确定性学生网络,结合空间-语义特征金字塔网络,为接触敏感任务提供实时对应敏感特征,在多个基准上超越现有方法。

详情
AI中文摘要

机器人操作常常在最后几毫米失败:策略可能识别出正确的物体,但忽略了动作所需的姿态偏移、边界或预接触对齐。我们认为,当语义不变性抑制了闭环控制的对应线索,或者这些线索未以可用形式暴露给策略时,就会发生此类失败。现代视觉编码器提供强大的语义抽象,但接触丰富的操作需要对应敏感性:对动作相关的姿态、边界和接触几何变化具有判别性特征响应。扩散特征为密集对应提供了强大的先验,但由于随机性、延迟和表示漂移,直接使用不切实际。我们引入了Robot-DIFT,一种用于实时控制的确定性扩散派生骨干网络。通过流形蒸馏,Robot-DIFT将噪声条件扩散教师网络转换为干净输入的单次学生网络,同时保留教师的特征流形。空间-语义特征金字塔网络(S2-FPN)将粗到细的学生解码器特征融合为视觉标记,向策略暴露语义上下文和精细接触细节。在RoboCasa、LIBERO-10和真实机器人上,Robot-DIFT在接触敏感任务上优于视觉-语言、自监督、几何导向和扩散基线。受控的骨干/读出交换表明,S2-FPN解锁而非取代了扩散对应先验。

英文摘要

Robot manipulation often fails in the final millimeters: a policy may recognize the right object yet miss the pose offsets, boundaries, or pre-contact alignments needed for action. We argue that such failures arise when semantic invariance suppresses correspondence cues for closed-loop control, or when these cues are not exposed to the policy in a usable form. Modern visual encoders provide strong semantic abstractions, but contact-rich manipulation requires correspondence sensitivity: discriminative feature responses to action-relevant changes in pose, boundary, and contact geometry. Diffusion features provide a strong prior for dense correspondence, but direct use is impractical due to stochasticity, latency, and representation drift. We introduce Robot-DIFT, a deterministic diffusion-derived backbone for real-time control. Through Manifold Distillation, Robot-DIFT converts a noise-conditioned diffusion Teacher into a clean-input, single-pass Student while preserving the teacher's feature manifold. A Spatial--Semantic Feature Pyramid Network (S2-FPN) fuses coarse-to-fine Student decoder features into visual tokens that expose semantic context and fine contact detail to the policy. Across RoboCasa, LIBERO-10, and real robots, Robot-DIFT outperforms vision--language, self-supervised, geometry-oriented, and diffusion baselines on contact-sensitive tasks. Controlled backbone/readout swaps show that S2-FPN unlocks, rather than replaces, the diffusion correspondence prior.

2604.20689 2026-06-09 cs.RO 版本更新

FingerEye: Learning Dexterous Manipulation with Continuous Vision-Tactile Sensing

FingerEye:通过连续视觉-触觉感知学习灵巧操作

Zhixuan Xu, Yichen Li, Xuanye Wu, Tianyu Qiu, Lin Shao

发表机构 * National University of Singapore(新加坡国立大学) RoboScience(机器人科学) Huazhong University of Science and Technology(华中科技大学) South China University of Technology(华南理工大学)

AI总结 FingerEye通过连续视觉-触觉感知提升机器人灵巧操作,结合视觉和触觉反馈,在模拟和现实环境中使腕部策略的成功率提升超30个百分点。

详情
AI中文摘要

灵巧的机器人操作需要从接触前的接近到接触启动和接触后的控制保持信息丰富的感知。我们介绍了FingerEye,一种通过连续视觉-触觉反馈增强机器人灵巧性的感知和学习框架。在感知方面,FingerEye整合双目RGB相机和一个合规的接触接口,以支持接触前后的同时感知。接触前,指尖相机提供近距离视觉线索和隐式立体视觉,用于精确接近和物体定位。接触后,标记跟踪的变形提供接触 wrench 感知的代理。在学习方面,我们构建了真实和模拟的基础设施用于数据收集和评估,系统研究了多项FingerEye传感器的学习策略-接口设计,并开发了FingerEye Policy,该策略通过组结构化的模态融合来减少模态捷径并更好地利用分布式的指尖反馈。在七个接触敏感的任务设置中,FingerEye在模拟和现实世界中均使腕部策略的平均成功率提高了超过30个百分点。

英文摘要

Dexterous robotic manipulation requires perception that remains informative from pre-contact approach to contact initiation and post-contact control. We introduce FingerEye, a sensing and learning framework that strengthens robotic dexterity through continuous vision-tactile feedback throughout interaction. On the sensing side, FingerEye integrates binocular RGB cameras with a compliant contact interface to support perception both before and after contact. Before contact, the fingertip cameras provide close-range visual cues and implicit stereo for precise approach and object localization. After contact, marker-tracked deformation of the compliant ring provides a proxy for contact wrench sensing. On the learning side, we build real-and-sim infrastructure for data collection and evaluation, systematically study policy-interface designs for learning with multiple FingerEye sensors, and develop FingerEye Policy, which applies group-structured modality fusion to reduce modality shortcuts and better exploit distributed fingertip feedback. Across seven contact-sensitive task settings, FingerEye improves wrist-only policy by over 30 percentage points in mean success rate in both simulation and the real world.

2605.30226 2026-06-09 cs.RO cs.AI 版本更新

BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models

BORA: 弥合离线强化学习与在线残差适应以实现真实世界灵巧VLA模型

Zhongxi Chen, Yifan Han, Yanming Shao, Huanming Liu, Congsheng Xu, Xiaoyu Chen, Yao Mu, Wenzhao Lian

发表机构 * Shanghai Jiao Tong University(上海交通大学) CASIA(中国科学院自动化研究所) Shanghai AI Laboratory(上海人工智能实验室) USTC(中国科学技术大学)

AI总结 提出BORA框架,通过离线构建动作条件价值引导的评论家,并结合在线冻结VLA基础、引入人类在环的分块残差适应机制,解决灵巧操作中高维探索导致的时间不一致、样本低效和硬件风险问题,在五个真实灵巧任务上平均成功率提升33%。

Comments 24 pages,11 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为将视觉-语言理解融入真实世界机器人操作的一种有前景的范式。然而,由于高维手部控制和复合执行误差,灵巧操作对VLA策略仍然具有挑战性,这使得真实世界的强化学习后训练对于弥合视觉基础动作生成与物理可靠灵巧执行之间的差距至关重要。然而,高维灵巧探索常常引发真实世界中的时间不一致性、样本低效和硬件风险。为应对这些挑战,我们提出BORA,一种为真实世界灵巧VLA模型设计的离线到在线强化学习后训练框架。在离线阶段,BORA构建一个以VLM的认知令牌和动作块作为输入的评论家。这种设计实现了动作条件价值引导,使评论家能够评估超越视觉上下文的灵巧手部运动。在随后的在线阶段,BORA冻结VLA基础,并引入一种轻量级、人类在环(HiL)的分块残差适应机制,以减轻真实世界执行误差并进一步在真实物理环境中纠正离线学习到的意图。通过继承离线评论家并采用干预驱动奖励,BORA有效纠正执行差异并适应真实世界物理变化,同时将预训练策略作为稳定先验。在五个复杂真实世界灵巧任务上的广泛评估表明,BORA显著优于纯模仿学习和传统解耦强化学习基线,在标准设置下平均成功率绝对提升33%,在未见物体泛化中提升高达43%。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding execution errors, which makes real-world RL post-training essential for bridging the gap between visually grounded action generation and physically reliable dexterous execution. However, high-dimensional dexterous exploration often triggers temporal inconsistency, sample inefficiency and hardware risks in the real world. To address these challenges, we propose BORA, an offline-to-online RL post-training framework designed for real-world dexterous VLA models. In the offline phase, BORA constructs a critic that takes both the VLM's cognition tokens and action chunks as inputs. This design enables action-conditioned value guidance, allowing the critic to evaluate dexterous hand motions beyond visual context alone. During the subsequent online phase, BORA freezes the VLA base and introduces a lightweight, Human-in-the-Loop (HiL) chunk-wise residual adaptation mechanism to mitigate real-world execution errors and further correct the offline-learned intents within the actual physical environment. By inheriting the offline critic and employing intervention-driven rewards, BORA effectively corrects execution discrepancies and adapts to real-world physical variances while preserving the pretrained policy as a stable prior. Extensive evaluations across five complex real-world dexterous tasks demonstrate that BORA significantly outperforms pure imitation learning and traditional decoupled RL baselines, achieving a 33% absolute increase in average success rate under standard settings and up to a 43% improvement in unseen object generalization.

2606.02274 2026-06-09 cs.RO 版本更新

Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

Dexterity-BEV: 对齐3D世界与动作以实现通用机器人策略学习

Huayi Zhou, Wei Gao, Dekun Lu, Ruiji Liu, Zhanqi Zhang, Ziyang Zhang, Jian Chen, Wenlve Zhou, Sheng Xu, Shumin Li, Kangyi Guo, Shichen Xu, Zixin Huang, Yongyi Su, Kui Jia

发表机构 * DexForce Technology(德克斯技术公司) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出Dexterity-BEV框架,通过对齐顶点图和顶点谱的3D表示以及鸟瞰图对齐,解决2D基础模型在3D操作中的局限性,提升机器人策略的泛化能力。

Comments under review

详情
AI中文摘要

端到端操作策略结合大规模预训练的视觉-语言模型(VLM)展示了通用且灵巧的机器人操作的潜力。然而,它们继承了2D基础模型的两个关键局限性:1)依赖忽略操作内在3D性质的2D RGB输入;2)输入输出空间以及不同机器人形态、相机设置和轨迹数据集之间缺乏空间3D对齐。在本文中,我们提出了一系列贡献来解决这些问题。首先,我们引入对齐顶点图和顶点谱——一种逐像素的3D表示,利用相机标定和可选的深度将2D视觉输入提升到3D。这种新颖的输入表示将3D感知与2D大型VLM的泛化能力相结合。然后,我们提出通过将每个相机视图的逐像素3D信息和机器人动作表达到一个共享坐标系来对齐操作策略的输入和输出。基于此,我们指定一个规范的鸟瞰图(BEV)对齐框架,并创新性地提出构建BEV图像,产生对相机姿态变化鲁棒的视角不变表示。为了实现大规模训练和评估,我们开发了一个全面的数据处理流程来执行此类对齐;我们还引入了一种新颖的时间对齐方案,用于跨不同机器人、人类操作员和数据集的轨迹。这些贡献共同缓解了输入输出的时空错位,提高了真实世界操作的一致性和泛化能力。预训练检查点、源代码和数据处理流程可在 https://hnuzhy.github.io/projects/Dex-BEV 获取。

英文摘要

End-to-end manipulation policies, combined with web-scale pretrained Vision-Language Models (VLMs), show the promise for generalizable and dexterous robotic manipulation. However, they inherit two key limitations from 2D foundation models: 1) the reliance on 2D RGB inputs that ignores the intrinsically 3D nature of manipulation; and 2) the lack of spatial 3D alignment between input-output spaces as well as across diverse robot embodiments, camera setups, and trajectory datasets. In this paper, we present a series of contributions to address these issues. First, we introduce aligned vertex map and vertex spectrum -- a pixel-wise 3D representation that elevates 2D visual inputs to 3D, using camera calibration and optional depth. This novel input representation marries 3D awareness with the generalization of 2D large VLMs. Then, we propose to align the inputs and outputs of manipulation policies by expressing per-pixel 3D information of each camera view and robot actions to a shared coordinate. Based on this, we designate a canonical Bird's-Eye-View (BEV) alignment frame and innovatively propose to construct BEV images, producing a view-invariant representation robust to camera pose variations. To enable training and evaluation at scale, we develop a comprehensive data processing pipeline to perform such alignments; we also introduce a novel temporal alignment scheme for trajectories across diverse robots, human operators, and datasets. These contributions collectively mitigate input and output spatial-temporal misalignments, improving the consistency and generalization for real-world manipulation. Pretrained checkpoint, source code and data processing pipeline are available in https://hnuzhy.github.io/projects/Dex-BEV.

2606.06033 2026-06-09 cs.RO 版本更新

RealDexUMI: A Wearable Universal Manipulation Interface for Dexterous Robot Learning

RealDexUMI:用于灵巧机器人学习的可穿戴通用操作接口

Chaoyi Xu, Yixuan Jiang, Jiahui Huan, Yuhui Fu, Haoyu Zhou, Weitian Yuan, Jiayi Yu, Wanpeng Zhang, Haoqi Yuan, Zongqing Lu

发表机构 * Peking University(北京大学) BeingBeyond Beihang University(北航) LinkerBot Tsinghua University(清华大学)

AI总结 提出RealDexUMI,一种基于共享灵巧末端执行器模块的可穿戴通用操作接口,通过掌侧同构遥操作手套实现无重定向、直观精确的手部控制,在八项真实机器人任务中平均成功率达88.75%。

详情
AI中文摘要

学习灵巧操作需要演示,这些演示在保持精细手-物体交互的同时,在部署时仍可执行。现有流程要么通过重定向或具身转换损失可部署的灵巧性,要么依赖特定机器人的遥操作,这种遥操作成本高昂且难以扩展,并且通常缺乏用于灵巧数据收集的直观、接触感知控制。我们提出RealDexUMI,一种围绕共享灵巧末端执行器模块构建的可穿戴通用操作接口,该模块集成了轻量级灵巧手、手内视觉和指尖触觉传感。掌侧同构遥操作手套将人类手指输入映射到机器人手关节命令,实现实时、无重定向、直观且精确的手部控制。共享的手和传感模块产生零间隙的末端执行器数据,在收集和部署之间具有匹配的手内观察、触觉信号、接触和手部动作。在涵盖精细、接触丰富、长时域和双臂操作的八项真实机器人任务中,基于RealDexUMI数据训练的策略平均成功率达到88.75%,能够泛化到未见过的初始姿态,并在三种具身之间迁移。网站:https://research.beingbeyond.com/realdexumi

英文摘要

Learning dexterous manipulation requires demonstrations that preserve fine hand-object interactions while remaining executable at deployment. Existing pipelines either lose deployable dexterity through retargeting or embodiment conversion, or rely on robot-specific teleoperation that is costly to scale and often lacks intuitive, contact-aware control for dexterous data collection. We present RealDexUMI, a wearable universal manipulation interface built around a shared dexterous end-effector module that integrates a lightweight dexterous hand, in-hand vision, and fingertip tactile sensing. A palm-side isomorphic teleoperation glove maps human finger inputs to robot-hand joint commands, enabling real-time, retargeting-free, intuitive, and precise hand control. The shared hand and sensing modules yield zero-gap end-effector data, with matched in-hand observations, tactile signals, contacts, and hand actions between collection and deployment. Across eight real-robot tasks spanning fine-grained, contact-rich, long-horizon, and bimanual manipulation, policies trained on RealDexUMI data achieve an average success rate of 88.75%, generalize to unseen initial poses, and transfer across three embodiments. Website: https://research.beingbeyond.com/realdexumi

4. 导航、定位与SLAM 11 篇

2606.08029 2026-06-09 cs.RO 新提交

IntentNav: Learning Spatial-Visual Object Navigation from Human Demonstrations

IntentNav: 从人类演示中学习空间-视觉物体导航

Yuxin Cai, Zongtai Li, Maonan Wang, Muyi Bao, Haokun Zhu, Ruofei Bai, Ding Zhao, Zirui Li, Wenshan Wang, Wei-Yun Yau, Ji Zhang, Chen Lv

发表机构 * Nanyang Technological University(南洋理工大学) Carnegie Mellon University(卡内基梅隆大学) The Chinese University of Hong Kong(香港中文大学) A*STAR Institute for Infocomm Research (I2R)(新加坡科技研究局资讯通信研究院)

AI总结 提出IntentNav框架,通过人类演示学习类人物体导航策略,利用前沿标注和意图对齐目标实现最优性能,并零样本迁移到多种机器人平台。

Comments 26 pages, 9 figures

详情
AI中文摘要

物体导航要求机器人在未知环境中搜索未观察到的目标,通过在部分可观测性下决定下一步探索位置。有效的搜索类似于人类探索:选择性探查视觉上有希望的前沿,同时依赖空间记忆避免重复访问。我们提出IntentNav,一个从人类演示中学习类人ObjectNav策略的空间-视觉模仿框架。为了从低级人类动作推断高级搜索意图,我们引入了基于前沿的人类意图标注,该方法前瞻人类演示并标注最能解释演示者未来搜索方向的前沿。我们构建了一个空间-视觉候选空间,其中BEV记忆跟踪已探索区域、未探索前沿和轨迹历史,而自我中心视觉记忆为每个候选提供语义线索。训练一个VLM策略在这些基于上下文的候选中进行选择,使用意图对齐目标以鼓励一致且类人的探索。IntentNav在MP3D、HM3D-v1和HM3D-v2 ObjectNav基准上实现了最先进的性能。所提出的候选级导航界面无需进一步VLM微调即可零样本迁移到轮式、四足和类人机器人。\href{https://anonymous.4open.science/w/IntentNav/}{项目页面}。

英文摘要

Object navigation requires a robot to search for an unobserved target in an unknown environment by deciding where to explore next under partial observability. Effective search resembles human-like exploration: selectively probing visually promising frontiers while relying on spatial memory to avoid redundant revisits. We propose IntentNav, a spatial-visual imitation framework that learns human-like ObjectNav policies from human demonstrations. To infer high-level search intent from low-level human actions, we introduce Frontier-based Human-Intent Labeling, which looks ahead in human demonstrations and labels the frontier that best explains the demonstrator's future search direction. We construct a spatial-visual candidate space, where BEV memory tracks explored regions, unexplored frontiers, and trajectory history, while egocentric visual memory provides semantic cues for each candidate. A VLM policy is trained to select among these grounded candidates, using Intent-Aligned Objective to encourage consistent and human-like exploration. IntentNav achieves state-of-the-art performance on the MP3D, HM3D-v1 and HM3D-v2 ObjectNav benchmarks. The proposed candidate-level navigation interface transfers zero-shot to wheeled, quadruped, and humanoid robots without further VLM fine-tuning. \href{https://anonymous.4open.science/w/IntentNav/}{Project page}.

2606.08666 2026-06-09 cs.RO 新提交

Language as a Sensor: Calibrated Spatial Belief Estimation in 3D Scenes from Natural Language

语言作为传感器:从自然语言在3D场景中进行校准的空间信念估计

Aryan Naveen, Jason Xinyu Liu, Luca Carlone, Andreea Bobu

发表机构 * MIT Laboratory for Information & Decision Systems(麻省理工学院信息与决策系统实验室) MIT Computer Science & Artificial Intelligence Laboratory(麻省理工学院计算机科学与人工智能实验室)

AI总结 提出语言传感器模型(LSM)将自然语言描述转化为校准的空间分布,并融合到VL-Map概率框架中,实现更准确的目标定位。

Comments 18 pages, 7 figures, 3 tables

详情
AI中文摘要

部署在以人为中心的环境中的机器人经常接收自然语言的空间信息描述(如“我把背包放在桌子上”),这些描述涉及超出其感知视野的世界部分。传统的度量-语义映射忽略了这一信号,而现成的多模态模型在3D空间推理方面仍然有限,并且不易与其他传感器模态融合。为了将语言观测转换为校准的空间分布,我们训练了一个语言传感器模型(LSM),该模型将每个话语及其场景图上下文映射到多模态分布,其中混合权重编码指代歧义(例如,“哪张桌子”),分量协方差编码空间不确定性(例如,目标在“桌子上”的哪个位置)。然后,我们引入了VL-Map(视觉-语言度量-语义映射),这是一个概率框架,将这些语言预测视为随机观测,并在统一的信念图中与机载感知融合。在VLA-3D基准测试以及真实世界的移动机器人上,LSM是唯一协方差估计保持在校准范围内的语言预测器;融合到VL-Map中,它导致对目标对象位置更准确的预测(与最强的基础模型基线相比,真实目标上的概率质量增加了约70%)。

英文摘要

Robots deployed in human-centric environments routinely receive natural-language descriptions of spatial information ("I left my backpack on the table") that reference parts of the world beyond their perceptual field of view. Traditional metric-semantic mapping ignores this signal, while off-the-shelf multimodal models remain limited in 3D spatial reasoning and are not directly amenable to fusion with other sensor modalities. To convert language observations into a calibrated spatial distribution, we train a Language Sensor Model (LSM) that maps each utterance and its scene-graph context to a multimodal distribution, with mixture weights encoding referential ambiguity (e.g., "which table") and component covariances encoding spatial uncertainty (e.g., where "on the table" the target lies). We then introduce VL-Map (Vision-Language Metric-Semantic Mapping), a probabilistic framework that treats these language predictions as stochastic observations and fuses them with onboard perception within a unified belief map. On the VLA-3D benchmark as well as on a real-world mobile robot, LSM is the only language predictor whose covariance estimates remain within the calibrated regime; fused into VL-Map, it leads to more accurate predictions of the target object location (~70% more probability mass on the true target compared to the strongest foundation-model baseline).

2606.08992 2026-06-09 cs.RO cs.AI cs.CV 新提交

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

SpaceVLN:具有在线空间认知记忆与推理的零样本视觉与语言导航智能体

Yucheng Deng, Pingrui Lai, Xinhai Li, Chenjia Bai, Xiaoheng Deng, Chengnuo Sun, Xuelong Li, Hua Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) China Telecom(中国电信) Central South University(中南大学) Jiangsu University(江苏大学)

AI总结 提出SpaceVLN,通过空间认知记忆和任务引导的空间推理,在零样本设置下实现连续环境中的视觉与语言导航,在多个基准上达到最优性能。

Comments 23 pages, 9 figures, 7 tables

详情
AI中文摘要

连续环境中的视觉与语言导航要求智能体理解未见环境的空间结构以遵循语言指令。尽管基础模型为无需任务特定策略训练的零样本导航开辟了有希望的路径,但许多导航器仍依赖局部视觉线索和基于线性历史的推理,忽视了探索区域、穿越路径、地标及其空间关系的空间本质。本文提出SpaceVLN,一种围绕空间认知记忆和任务引导的空间推理构建的导航智能体。具体而言,SpaceVLN引入了一个高效的分阶段闭环框架,其中规划和执行围绕可验证的空间-地标阶段组织。导航过程中,智能体逐步将探索区域抽象为空间航点,并动态维护子任务基础的地标证据,形成层次化的空间认知记忆以进行进度定位和空间关系理解。基于此记忆,Spatial-CoT将任务进度推理与空间感知、分析和预测相结合,实现任务引导的空间推理以用于具身导航。统一阶段接口使SpaceVLN能够在统一的零样本设置下处理视觉与语言导航和目标导向导航,无需任务特定策略训练。在R2R-CE、RxR-CE、GN-Bench和HM3D-OVON上,SpaceVLN实现了最先进的零样本性能,真实机器人部署进一步验证了其适用性。这些结果突显了空间认知记忆和任务引导的空间推理作为更强具身导航智能体的实用基础。

英文摘要

Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.

2606.09268 2026-06-09 cs.RO 新提交

VGP-Nav: Metric-Aware Visual Geometric Perception for Robot Navigation

VGP-Nav:用于机器人导航的度量感知视觉几何感知

Hewei Pan, Weiye Zhu, Zekai Zhang, Zitong Huang, Rongtao Xu, Jinbao Wang, Feng Zheng

发表机构 * Southern University of Science and Technology(南方科技大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Shenzhen University(深圳大学) SpatialTemporal AI(时空人工智能)

AI总结 提出VGP-Nav,一种仅依赖单目RGB输入的框架,通过地面平面几何约束解决尺度模糊,实现度量定位与障碍物感知的统一。

详情
AI中文摘要

可靠的机器人导航需要精确的全局定位和稠密、度量一致的障碍物感知的无缝集成。实现这些能力的常见策略涉及集成多种传感模态:相机提供丰富的视觉特征用于定位,而主动传感器如LiDAR提供直接的度量测量。然而,这种多传感器配置需要复杂的时空校准并增加部署开销。尽管纯视觉方法提供了低成本且可扩展的替代方案,现有的单目视觉系统通常难以同时实现高效、全局一致的定位和稠密、度量一致的几何感知。为弥合这一差距,我们提出\textbf{VGP-Nav},一个统一的\textit{度量感知视觉几何感知}框架,仅依赖单目RGB输入,联合支持度量定位和障碍物感知。我们的关键洞察是将基于定位的视觉几何锚定到从地面平面几何导出的物理上有意义的尺度约束,从而为单目感知提供可靠的度量参考。VGP-Nav在线解决单目尺度模糊,并生成可直接用于下游规划的、基于定位的度量障碍物表示。大量实验证明了其在多种环境中的强泛化能力以及在真实移动机器人上的成功部署,突显了该方法在可扩展、低成本且安全的自主导航中的实用性。

英文摘要

Reliable robotic navigation necessitates the seamless integration of accurate global localization and dense, metric-consistent obstacle perception. A common strategy to achieve these capabilities involves integrating diverse sensing modalities: cameras offer rich visual features for localization, while active sensors like LiDAR provide direct metric measurements. However, such multi-sensor configurations necessitate complex spatial-temporal calibration and increase deployment overhead. Although vision-only approaches offer a low-cost and scalable alternative, existing monocular visual systems typically struggle to simultaneously achieve efficient, globally consistent localization and dense, metric-consistent geometric perception. To bridge this gap, we propose \textbf{VGP-Nav}, a unified framework for \textit{Metric-Aware Visual Geometric Perception} that relies solely on monocular RGB input to jointly support metric localization and obstacle perception. Our key insight is to anchor localization-grounded visual geometry to physically meaningful scale constraints derived from ground-plane geometry, thereby providing a reliable metric reference for monocular perception. VGP-Nav resolves monocular scale ambiguity online and produces localization-grounded, metric obstacle representations that are directly applicable to downstream planning. Extensive experiments demonstrate strong generalization across diverse environments and successful deployment on real mobile robots, highlighting the practicality of our approach for scalable, low-cost, and safe autonomous navigation.

2606.09292 2026-06-09 cs.RO cs.SY eess.SY 新提交

Dual Quaternion-Based Unscented Kalman Filter with Visual Inertial Odometry for Navigation in GPS-Denied Environments

基于对偶四元数的无迹卡尔曼滤波与视觉惯性里程计在GPS拒止环境中的导航

Mohamed Khalifa, Hashim A. Hashim

发表机构 * Carleton University(卡尔顿大学)

AI总结 提出一种基于对偶四元数的无迹卡尔曼滤波(DQUKF)结合视觉惯性里程计(VIO),在GPS拒止环境下实现高精度状态估计,在EuRoC数据集上位置RMSE达0.2584米。

详情
AI中文摘要

在GPS拒止环境中的可靠导航仍然是机器人、航空航天和自动驾驶车辆应用中的基本挑战。本文提出了一种基于对偶四元数的无迹卡尔曼滤波(DQUKF),配备视觉惯性里程计(VIO)算法,用于在GPS拒止位置实现精确状态估计以实现导航。所提出的框架以误差状态形式构建DQUKF,其中名义位姿由单位对偶四元数表示,局部位姿误差由6维扭量参数化表示,用于sigma点生成、协方差传播和测量校正。同时,VIO算法跨图像帧跟踪特征,同步IMU和相机之间的测量,并提供补充惯性传播的视觉约束。在EuRoC MAV数据集上的仿真结果表明,所提出的DQUKF在高初始化不确定性下收敛,并在困难飞行序列中实现了0.2584米的位置RMSE,优于基准滤波器。

英文摘要

Reliable navigation in GPS-denied environments remains a fundamental challenge in robotics, aerospace, and autonomous vehicle applications. This paper presents a Dual Quaternion-Based Unscented Kalman Filter (DQUKF) equipped with a Visual Inertial Odometry (VIO) algorithm for accurate state estimation enabling navigation in GPS denied locations. The proposed framework formulates the DQUKF in an error state manner, where the nominal pose is represented by a unit dual quaternion and the local pose error is represented by a 6-dimensional twistor parameterization used for sigma point generation, covariance propagation, and measurement correction. In parallel, the VIO algorithm tracks features across image frames, synchronizes measurements between the IMU and camera, and provides visual constraints that complement inertial propagation. Simulation results on the EuRoC MAV dataset show that the proposed DQUKF converges under high initialization uncertainty and achieves a position RMSE of 0.2584~m in the difficult flight sequence, outperforming the benchmark filters.

2606.09355 2026-06-09 cs.RO 新提交

MosaicIMU: Composing Carrier Experts for Generalizable Neural Inertial Odometry

MosaicIMU:面向可泛化神经惯性里程计的载体专家组合

Junye Zou, Huiyi Yan, Xinning Xu, Xiaolei Li, Pengkun Zhou, Jinhui Zhang, Ziyang Meng

发表机构 * Tsinghua University(清华大学) Xi'an Jiaotong University(西安交通大学) Beijing University of Chemical Technology(北京化工大学) Beijing Information Science and Technology University(北京信息科技大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出MosaicIMU框架,通过原型路由组合载体特定专家特征,结合历史感知EKF,实现跨载体泛化;冻结预训练模型并学习轻量专家残差分支适应新领域,边缘部署时利用路由器选择在线样本高效增量更新,平均ATE和RTE-10s分别降低40%和34%。

详情
AI中文摘要

当外部传感不可靠时,鲁棒的惯性里程计对各种载体至关重要。基于学习的方法通过捕获局部运动先验来减少积分漂移,但这些方法通常局限于特定载体,限制了跨异构平台的泛化。我们提出MosaicIMU,一种载体条件的混合专家(MoE)预训练与自适应框架,用于可泛化的神经惯性里程计。MosaicIMU使用基于原型的路由器组合载体特定的专家特征,解码局部速度和不确定性约束,并将其与历史感知EKF集成。对于未见领域自适应,它冻结预训练基础模型并学习新的轻量专家残差分支。对于边缘部署,它进一步重用路由器来选择信息丰富的在线样本以进行高效的增量更新。实验表明,MosaicIMU持续优于基于学习的基线,平均ATE和RTE-10s分别降低40%和34%。这些结果凸显了MosaicIMU为可泛化和自适应的神经惯性里程计提供了一种可扩展的预训练到部署范式。

英文摘要

Robust inertial odometry is essential for various carriers when external sensing is unreliable. Learning-based methods reduce integration drift by capturing local motion priors, but these methods often remain tied to a particular carrier, limiting generalization across heterogeneous platforms. We present MosaicIMU, a carrier-conditioned Mixture-of-Experts (MoE) pretraining-and-adaptation framework for generalizable neural inertial odometry. MosaicIMU uses a prototype-based router to compose carrier-specific expert features, decodes local velocity and uncertainty constraints, and integrates them with a history-aware EKF. For unseen domain adaptation, it freezes the pretrained base model and learns a new lightweight expert residual branch. For edge-deployment, it further reuses the router to select informative online samples for efficient incremental updates. Experiments show that MosaicIMU consistently outperforms learning-based baselines, reducing average ATE and RTE-10s by 40% and 34%, respectively. These results highlight that MosaicIMU provides a scalable pretraining-to-deployment paradigm for generalizable and adaptive neural inertial odometry.

2606.08284 2026-06-09 cs.CV cs.RO 交叉投稿

G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation

G2G:利用组内几何进行组间姿态估计

Yufei Wei, Shuhao Ye, Chenxiao Hu, Yiyuan Pan, Dongyu Feng, Rong Xiong, Yue Wang, Yanmei Jiao

发表机构 * State Key Laboratory of Industrial Control and Technology, Zhejiang University(浙江大学工业控制技术国家重点实验室) Zhejiang Humanoid Robot Innovation Center Co., Ltd.(浙江人形机器人创新中心有限公司) School of Information Science and Engineering, Hangzhou Normal University(杭州师范大学信息科学与工程学院)

AI总结 提出G2G方法,通过冻结多视图基础模型并添加三个轻量可训练模块(感知器重采样器、跨组桥接模块和多帧姿态头),仅利用相对姿态监督实现组间6-DoF姿态估计,在四个数据集上达到SOTA。

详情
AI中文摘要

恢复两个图像组之间的相对6-DoF姿态是跨序列重定位和多相机刚性里程计的基础。每个组通过视觉里程计或刚性校准携带已知的组内几何,预训练的多视图骨干网络已经将这种几何融合到视觉特征中。然而,当前模型将所有视图视为非结构化集合,缺少跨组推理的关键环节。我们提出\ours{},该方法保持基础模型完全冻结,并添加三个轻量可训练模块来桥接两个组:感知器重采样器、带有合并自注意力的跨组桥接模块以及多帧姿态头。可训练部分总计约32M参数,不到完整模型的6%,且仅由相对姿态监督。在四个数据集(涵盖室内外仿真、真实世界跨季节采集以及零样本仿真到真实迁移)上,\ours{}在两个任务上都达到了最先进的精度,而每个基线都使用其完整的原始监督进行重新训练。代码可在https://github.com/WeiYuFei0217/G2G获取。

英文摘要

Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-camera rig odometry. Each group carries known intra-group geometry from visual odometry or rig calibration, and pretrained multi-view backbones already fuse such geometry into visual features. Yet current models treat all views as an unstructured set, leaving cross-group reasoning as the missing piece. We introduce \ours{}, which keeps the foundation model entirely frozen and adds three lightweight trainable modules to bridge the two groups: a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head. The trainable footprint totals about 32M parameters, under 6\% of the full model, and is supervised only by relative poses. Across four datasets that span indoor and outdoor simulation, real-world cross-season capture, and zero-shot sim-to-real transfer, \ours{} attains state-of-the-art accuracy on both tasks, while every baseline is retrained with its full original supervision. Code is available at https://github.com/WeiYuFei0217/G2G.

2504.19399 2026-06-09 cs.RO 版本更新

Follow Everything: A Leader-Following and Obstacle Avoidance Framework with Goal-Aware Adaptation

跟随一切:具有目标感知适应的领导者跟随与避障框架

Qianyi Zhang, Shijian Ma, Boyi Liu, Jianhao Jiao, Dimitrios Kanoulas

发表机构 * Institute of Robotics and Automatic Information System, Nankai University, China(南开大学机器人与自动化信息系统研究所) Centre for Data Science, University of Macau, China(澳门大学数据科学中心) Electrical and Computer Engineering Department, Hong Kong University of Science and Technology, China(香港科学与技术大学电子与计算机工程系) Department of Computer Science, University College London, UK(伦敦大学学院计算机科学系) Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University, Hong Kong, China(香港理工大学航空与航空工程系)

AI总结 提出统一框架,用分割模型替代检测模型以跟随任意形态领导者,并设计目标感知适应机制和基于图的规划器,实现领导者暂时离开视野时的鲁棒跟随与避障。

详情
AI中文摘要

鲁棒且灵活的领导者跟随是机器人融入人类社会的一项关键能力。现有方法难以泛化到任意形态的领导者,并且在领导者暂时离开机器人视野时常常失败,本文引入了一个统一框架来应对这两个挑战。首先,用分割模型替代传统检测模型,使领导者可以是任何物体。为了增强识别鲁棒性,实现了一个距离帧缓冲区,在多个距离存储领导者嵌入,以考虑领导者跟随任务的独特特征。其次,设计了一种目标感知适应机制,根据领导者的可见性和运动来控制机器人规划状态,并辅以基于图的规划器,为每个状态生成候选轨迹,确保高效跟随和避障。在室内外环境中,使用腿式机器人跟随者与各种领导者(人、地面机器人、无人机、腿式机器人、停止标志)进行的仿真和真实世界实验显示,在跟随成功率、减少视觉丢失时长、降低碰撞率和减小领导者-跟随者距离方面取得了竞争性改进。

英文摘要

Robust and flexible leader-following is a critical capability for robots to integrate into human society. While existing methods struggle to generalize to leaders of arbitrary form and often fail when the leader temporarily leaves the robot's field of view, this work introduces a unified framework addressing both challenges. First, traditional detection models are replaced with a segmentation model, allowing the leader to be anything. To enhance recognition robustness, a distance frame buffer is implemented that stores leader embeddings at multiple distances, accounting for the unique characteristics of leader-following tasks. Second, a goal-aware adaptation mechanism is designed to govern robot planning states based on the leader's visibility and motion, complemented by a graph-based planner that generates candidate trajectories for each state, ensuring efficient following with obstacle avoidance. Simulations and real-world experiments with a legged robot follower and various leaders (human, ground robot, UAV, legged robot, stop sign) in both indoor and outdoor environments show competitive improvements in follow success rate, reduced visual loss duration, lower collision rate, and decreased leader-follower distance.

2602.22243 2026-06-09 cs.RO 版本更新

SODA-CitrON: Static Object Data Association by Clustering Multi-Modal Sensor Detections Online

SODA-CitrON:通过在线聚类多模态传感器检测实现静态物体数据关联

Jan Nausner, Kilian Wohlleben, Michael Hubner

发表机构 * Jan Nausner, Kilian Wohlleben, Michael Hubner

AI总结 本文提出SODA-CitrON方法,通过在线聚类多模态传感器检测实现静态物体的数据关联,同时估计位置并维持持久跟踪,优于现有方法在F1分数、位置RMSE、MOTP和MOTA指标上。

Comments 8 pages, 5 figures; \c{opyright} 2026 IEEE. Accepted for the 2026 International Conference on Information Fusion (FUSION 2026)

详情
AI中文摘要

从异构传感器检测中在线融合和跟踪静态物体是机器人、自主系统和环境建图中的基本问题。尽管经典数据关联方法如JPDA适合动态目标,但在间歇性和异质不确定性的静态物体观测中效果较差,因为运动模型对杂波的判别能力有限。本文提出了一种新颖的静态物体数据关联方法SODA-CitrON,通过在线聚类多模态传感器检测,同时估计位置并维持未知数量物体的持久跟踪。所提出的无监督机器学习方法完全在线运行,处理时间上不相关的多传感器测量。此外,它在传感器检测数量上具有最坏情况下的对数线性复杂度,同时提供完整的输出可解释性。我们在不同的蒙特卡洛模拟场景中评估了该方法,并将其与基于POM的过滤、DBSTREAM聚类和JPDA等现有方法进行比较。结果表明,在研究的静态物体建图场景中,SODA-CitrON在F1分数、位置RMSE、MOTP和MOTA指标上始终优于比较方法。

英文摘要

The online fusion and tracking of static objects from heterogeneous sensor detections is a fundamental problem in robotics, autonomous systems, and environmental mapping. Although classical data association approaches such as JPDA are well suited for dynamic targets, they are less effective for static objects observed intermittently and with heterogeneous uncertainties, where motion models provide minimal discriminative power with respect to clutter. In this paper, we propose a novel method for static object data association by clustering multi-modal sensor detections online (SODA-CitrON), while simultaneously estimating positions and maintaining persistent tracks for an unknown number of objects. The proposed unsupervised machine learning approach operates in a fully online manner and handles temporally uncorrelated and multi-sensor measurements. Additionally, it has a worst-case loglinear complexity in the number of sensor detections while providing full output explainability. We evaluate the proposed approach in different Monte Carlo simulation scenarios and compare it against state-of-the-art methods, including POM-based filtering, DBSTREAM clustering, and JPDA. The results demonstrate that SODA-CitrON consistently outperforms the compared methods in terms of F1 score, position RMSE, MOTP, and MOTA in the static object mapping scenarios studied.

2606.02519 2026-06-09 cs.RO 版本更新

IMAC-AgriVLN: Can Agricultural Vision-and-Language Navigation Agents be Aware of Instruction Mistakes?

IMAC-AgriVLN:农业视觉与语言导航智能体能否意识到指令错误?

Xiaobei Zhao, Xingqi Lyu, Xin Chen, Xiang Li

发表机构 * China Agricultural University(中国农业大学) China Agricultural University-Sichuan Advanced Agricultural & Industrial Institute(中国农业大学-四川先进农业与工业研究院)

AI总结 针对农业VLN中指令可能错误的问题,提出A2A-MI基准和IMAC模块,通过分析指令与前方图像判断并纠正错误,显著提升导航性能。

详情
AI中文摘要

农业机器人在广泛的农业任务中充当着强大的助手,然而,其移动仍然严重依赖手动操作或轨道系统。AgriVLN方法和A2A基准开创性地将视觉与语言导航(VLN)扩展到农业领域,使机器人能够按照自然语言指令导航到目标位置。然而,几乎所有先前的方法都采用了一个理想假设,即给定的指令本身是正确的,这与现实场景不符,因为任何人都可能说出带有错误的指令。为弥补这一差距,我们提出了A2A-MI基准,其中构建了一个半自动数据标注器,以更多样化和高效的方式将三种错误分类插入到每个原始指令中。我们在该基准上测试了几种最先进的农业VLN智能体,观察到SR下降57%、NE下降9%的显著下降,由此我们认为农业VLN智能体倾向于假设给定指令是正确的,因此当它看到的场景与接收到的指令不一致时,没有怀疑的意识。为了建立对指令错误的意识,我们提出了IMAC模块,该模块分析指令和当前前方图像,判断指令是否有错误,并在需要时尝试纠正。我们将IMAC集成到基线模型中,观察到显著的改进,充分缩小了与无错误指令性能的差距。项目:https://github.com/AlexTraveling/IMAC-AgriVLN。

英文摘要

Agricultural robots are serving as powerful assistants across a wide range of agricultural tasks, nevertheless, still heavily relying on manual operations or railway systems for movement. The AgriVLN method and the A2A benchmark pioneeringly extended Vision-and-Language Navigation (VLN) to the agricultural domain, enabling a robot to navigate to a target position following a natural language instruction. However, almost all the prior methods adopt an ideal assumption that the given instructions themselves are correct, which does not align with the realistic scenarios, because anybody may say an instruction with mistakes. To bridge this gap, we propose the A2A-MI benchmark, in which we build a semi-automatic data annotator to insert three mistake classifications into each original instruction in a more diversified and efficient way. We test several state-of-the-art agricultural VLN agents on it and observe a sufficient drop with -57% on SR and -9% on NE, from which we suggest that an agricultural VLN agent tends to assume that the given instruction is correct, so does not have the awareness to doubt it when the scenes it sees do not align with the instruction it receives. To build the awareness on instruction mistake, we propose the IMAC module analyzing the instruction and the current front-facing image, to judge whether the instruction has mistakes and attempt to correct it when needed. We integrate IMAC into the baseline model, and observe a noteworthy improvement, sufficiently narrowing the gap to the performance on instructions without mistakes. Project: https://github.com/AlexTraveling/IMAC-AgriVLN.

2604.04554 2026-06-09 cs.CV cs.RO 版本更新

Relational Epipolar Graphs for Robust Relative Camera Pose Estimation

基于关系epipolar图的鲁棒相对相机姿态估计

Prateeth Rao, Sachit Rao

发表机构 * International Institute of Information Technology(国际信息科技研究所)

AI总结 本文提出基于epipolar图的关系推断方法,用于估计相对相机姿态,通过图操作估计旋转、平移和本质矩阵,提升对密集噪声和大基线变化的鲁棒性。

Comments 21 pages, 11 figures, 11 Tables, Submitted to IJCV

详情
AI中文摘要

视觉同步定位与建图(VSLAM)的关键组成部分是利用匹配的关键点估计相对相机姿态。准确估计面临噪声对应关系的挑战。经典方法依赖于随机假设采样和迭代估计,而基于学习的方法通常缺乏显式的几何结构。在本文中,我们将相对姿态估计重新表述为epipolar对应图上的关系推断问题,其中匹配的关键点是节点,相邻的节点通过边连接。图操作如修剪、消息传递和池化可估计四元组旋转、平移向量和本质矩阵(EM)。最小化包含(i)与地面真实值(GT)的$\mathcal{L}_2$差异,(ii)估计与GT EM之间的Frobenius范数,(iii)奇异值差异,(iv)航向角差异,(v)尺度差异的损失,可得到图像对之间的相对姿态。所用的密集检测器-free方法LoFTR用于匹配。在室内和室外基准测试中,相比经典和学习引导方法,该方法在密集噪声和大基线变化方面表现出改进的鲁棒性,突显了全局关系共识的有效性。

英文摘要

A key component of Visual Simultaneous Localization and Mapping (VSLAM) is estimating relative camera poses using matched keypoints. Accurate estimation is challenged by noisy correspondences. Classical methods rely on stochastic hypothesis sampling and iterative estimation, while learning-based methods often lack explicit geometric structure. In this work, we reformulate relative pose estimation as a relational inference problem over epipolar correspondence graphs, where matched keypoints are nodes and nearby ones are connected by edges. Graph operations such as pruning, message passing, and pooling estimate a quaternion rotation, translation vector, and the Essential Matrix (EM). Minimizing a loss comprising (i) $\mathcal{L}_2$ differences with ground truth (GT), (ii) Frobenius norm between estimated and GT EMs, (iii) singular value differences, (iv) heading angle differences, and (v) scale differences, yields the relative pose between image pairs. The dense detector-free method LoFTR is used for matching. Experiments on indoor and outdoor benchmarks show improved robustness to dense noise and large baseline variation compared to classical and learning-guided approaches, highlighting the effectiveness of global relational consensus.

5. 人机交互与协作机器人 11 篇

2606.07934 2026-06-09 cs.RO 新提交

X-OP: Cross-Morphology Whole-Body Teleoperation via MPC Retargeting

X-OP: 基于MPC重定向的跨形态全身遥操作

Jen-Wei Wang, Sarthak Kaingade, Andrea Tagliabue, Nicholas Morozovsky

发表机构 * Amazon(亚马逊) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种基于单个XR设备的层次化全身遥操作框架,通过MPC重定向联合优化操作者意图与机器人动态可行性,无需针对特定机器人重新训练策略,在仿真和实物实验中显著提升任务成功率。

Comments 9 pages, 4 figures

详情
AI中文摘要

全身遥操作对于在移动操作任务中实现可扩展的机器人数据收集至关重要,然而现有依赖外骨骼套装或多摄像头设置的方法带来了高昂的成本、复杂性和环境限制。最近使用单一扩展现实(XR)设备结合端到端强化学习策略的方法部分解决了这些限制,但需要针对特定机器人重新训练,遭受分布外故障,并且依赖于忽略动态可行性的运动重定向。我们提出了一种由单个XR设备驱动的层次化全身遥操作框架,该框架无需重新训练特定机器人的策略即可泛化到不同机器人形态。基于模型预测控制(MPC)的运动重定向器联合优化与操作者意图的对齐以及机器人的动态可行性,为现有的低级控制器生成最优命令。为了确保鲁棒的在线执行,我们引入了一种状态同步方法,在每个MPC步骤重置模拟器状态以处理嘈杂的真实世界测量和接触敏感性,并集成基于SLAM的全局位姿反馈以减轻长期漂移。仿真结果显示,在全身控制任务中,与基线相比,人形机器人(完成时间降低30%以上,功耗降低20%)和移动操作臂(零碰撞)均取得了更高的成功率。真实世界实验进一步验证了我们方法的有效性和灵活性,展示了所提出的重定向器在两个平台上成功部署于全身控制任务,并允许用户根据偏好轻松调整遥操作行为。该即插即用框架为全身机器人遥操作提供了一种可扩展、形态无关的解决方案,实现了实时行为定制和跨平台的广泛适用性。

英文摘要

Whole-body teleoperation is essential for scalable robot data collection in loco-manipulation tasks, yet existing approaches relying on exoskeleton suits or multi-camera setups impose prohibitive cost, complexity, and environmental constraints. Recent methods using a single extended reality (XR) device with end-to-end reinforcement learning policies partially address these limitations but require robot-specific retraining, suffer from out-of-distribution failures, and rely on motion retargeting that neglects dynamic feasibility. We propose a hierarchical whole-body teleoperation framework driven by a single XR device that generalizes across diverse robot morphologies without retraining robot-specific policies. A Model Predictive Control (MPC)-based motion retargeter jointly optimizes alignment with the operator's intent and the robot's dynamic feasibility, generating optimal commands for existing low-level controllers. To ensure robust online execution, we introduce a state synchronization method that resets the simulator state at each MPC step to handle noisy real-world measurements and contact sensitivity, and integrate SLAM-based global pose feedback to mitigate long-term drift. Simulation results show higher success rates on whole-body control tasks for both a humanoid (over 30% lower completion time and 20% lower power consumption) and a mobile manipulator (zero collisions) compared to baselines. Real-world experiments further validate the effectiveness and flexibility of our method, demonstrating the successful deployment of the proposed retargeter on both platforms for whole-body control tasks and the ease of allowing users to adjust teleoperation behavior based on their preferences. This plug-and-play framework offers a scalable, morphology-agnostic solution for whole-body robot teleoperation, enabling real-time behavioral customization and broad applicability across platforms.

2606.08099 2026-06-09 cs.RO 新提交

Cybernetic Android Avatar "Yui": System Integration, Field Deployment, and Evaluation

赛博格安卓化身“Yui”:系统集成、现场部署与评估

Kaoruko Shinkawa, Mizuki Nakajima, Taisei Mogi, Yoshihiro Nakata

发表机构 * The University of Electro-Communications(电气通信大学) Tokyo Denki University(东京电机大学)

AI总结 提出全身赛博格安卓化身Yui,集成操作者沉浸式遥操作与对话者类人社交信号,通过世博会长期展览、远程教育交流等实际部署验证可行性,获得共在感和情绪传达的积极评价。

Comments 47 pages, 20 figures, 10 tables. Submitted to International Journal of Social Robotics

详情
AI中文摘要

远程通信技术已广泛使用,但在许多社交互动场景中,支持共享物理空间感和传达丰富的非语言线索仍然具有挑战性。本研究介绍了“Yui”,一种全身赛博格安卓化身,旨在将操作者沉浸式遥操作与对话者类人社交信号相结合。Yui 结合了55自由度的全身机构与先前开发的安卓头部、面部表情和注视控制、上半身和手臂运动、手部驱动以及移动平台。它可以通过基于头戴显示器的沉浸式模式或基于网络摄像头的桌面模式进行操作。我们通过三个实际部署评估了系统:日本关西大阪2025年世博会的长期公共展览、小学生之间的远程教育交流以及与普通参与者的公共互动研究。在世博会部署期间,两个单元累计运行约1131小时,展示了操作可行性和维护挑战。在公共研究中,操作者和对话者均报告了对共在感的积极印象和使用意愿。对话者还在类人性和情绪及意图传达方面对化身给予了积极评价。结果表明对普通操作者具有可用性,同时在精确可控性方面存在改进空间。这些发现为可社交部署的全身安卓化身提供了现场证据和设计启示。

英文摘要

Remote communication technologies have become widely used; however, supporting a sense of shared physical space and conveying rich non-verbal cues remain challenging in many social interaction scenarios. This study presents "Yui," a full-body cybernetic android avatar designed to integrate operator-side immersive teleoperation with interlocutor-side human-like social signaling. Yui combines a 55-degrees of freedom full-body mechanism with a previously developed android head, facial expression and gaze control, upper-body and arm motion, hand actuation, and a mobile platform. It can be operated through either the immersive mode using a head mounted display-based interface or desktop mode using a webcam-based interface. We evaluated the system through three real-world deployments: a long-term public exhibition at Expo 2025 in Osaka, Kansai, Japan; a remote educational exchange between elementary school students; and a public interaction study with general participants. During the Expo deployment, two units accumulated approximately 1131 h of operation, demonstrating both operational feasibility and maintenance challenges. In the public study, both operators and interlocutors reported positive impressions of co-presence and willingness to use the system. Interlocutors also rated the avatar positively in terms of human likeness and the transmission of emotions and intentions. The results indicate usability for general operators while suggesting room for improvement in precise controllability. These findings provide field-derived evidence and design implications for socially deployable full-body android avatars.

2606.08169 2026-06-09 cs.RO cs.AI cs.CL cs.HC cs.LG 新提交

CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning

CLASP: 基于语言驱动的机器人技能选择与组合,采用任务参数化学习

Markus Knauer, Valentin Gieraths, Tai Mai, Samuel Bustamante, Alin Albu-Schäffer, Freek Stulp, João Silvério

发表机构 * German Aerospace Center (DLR), Institute of Robotics and Mechatronics (RMC)(德国航空航天中心(DLR),机器人与机电一体化研究所(RMC)) Technical University of Munich (TUM)(慕尼黑工业大学(TUM))

AI总结 提出CLASP架构,结合任务参数化核化运动基元(TP-KMP)与预训练视觉语言模型(VLM),通过自然语言命令实现技能选择、组合和主动学习,无需微调,在7自由度机械臂上达到73.3%-100%成功率。

Comments 23 pages, 11 figues, 4 tables, 1 listing

详情
AI中文摘要

使机器人能够理解自然语言命令并执行任务,同时保持数据效率仍然具有挑战性。视觉-语言-动作(VLA)和视觉-语言模型(VLM)等基础模型提供了直观的交互通道,但需要大量数据;任务参数化模仿学习实现了数据效率,但缺乏自然语言基础。这项工作通过一个模块化架构弥合了这一差距,该架构将任务参数化核化运动基元(TP-KMP)与预训练VLM相结合。在学习过程中,技能从2到5次动觉演示中获取,VLM生成描述每个技能参数和前提条件的技能模式。在执行过程中,VLM解释命令以选择技能,推理参数绑定,并通过协方差加权组合创建新颖行为。当没有技能或组合足够时,系统识别能力差距并请求有针对性的演示,所有这些都无需微调。在7自由度机械臂上的验证显示,在需要技能选择、组合和主动学习的场景中,成功率达到73.3%-100%。

英文摘要

Enabling robots to understand and execute tasks from natural language commands while maintaining data efficiency remains challenging. Foundation models such as vision-language-action (VLA) and vision-language models (VLMs) provide intuitive interaction channels but require extensive data; task-parameterized imitation learning achieves data efficiency but lacks natural language grounding. This work bridges this gap through a modular architecture combining task-parameterized kernelized movement primitives (TP-KMPs) with pretrained VLMs. During learning, skills are acquired from 2 to 5 kinesthetic demonstrations, and the VLM generates skill schemas describing each skill's parameters and preconditions. During execution, the VLM interprets commands to select skills, reason about parameter bindings, and create novel behaviors through covariance-weighted composition. When no skill or composition suffices, the system identifies capability gaps and requests targeted demonstrations, all without fine-tuning. Validation on a 7-DoF manipulator shows success rates of 73.3%-100% in scenarios requiring skill selection, composition, and active learning.

2606.08214 2026-06-09 cs.RO 新提交

Agentic Neuro-Symbolic Planning and Commissioning for Human-in-the-Loop Industrial Robotics with Digital Twins

面向人机协同工业机器人的智能神经符号规划与调试:基于数字孪生

Zhihao Liu, Victor Nan Fernandez-Ayala, Tianyu Wang, Qiang Qin, Xi Vincent Wang, Dimos V. Dimarogonas, Lihui Wang

发表机构 * Royal Institute of Technology (KTH)(皇家理工学院(KTH))

AI总结 提出一种结合LLM语言理解与确定性验证执行的神经符号框架,采用SDI架构和两级恢复机制,在数字孪生中验证后执行,显著提升任务成功率。

详情
AI中文摘要

灵活的机器人自动化需要系统能够解释操作员意图、验证物理可行性,并在规划和执行阶段从执行失败中恢复。本文提出了一种面向人机协同工业机器人的智能神经符号框架,其中LLM用于需要语言理解或上下文推理的任务,而所有验证、排序和执行保持确定性。该框架将软件工程中的规划器-生成器-评估器(PGE)模式改编为面向工业机器人的指定器-设计器-检查器(SDI)架构,并结合基于LangGraph的动态路由进行故障恢复。两级恢复机制通过上下文感知编排处理结构级重新规划,并通过确定性恢复技能处理执行级几何故障。Unity3D数字孪生支持在物理执行前进行人工检查、修改和重新验证。在多个难度级别的自然语言命令上对十个基线进行评估,所提方法实现了最高的任务成功率。消融结果证实,结构化命令扩展、符号验证、选择性LLM路由和恢复技能各自都是必要的。

英文摘要

Flexible robotic automation requires systems that interpret operator intent, verify physical feasibility, and recover from execution failures across both the planning and execution stages. This paper proposes an agentic neuro-symbolic framework for human-in-the-loop industrial robotics, in which LLMs are used for tasks that require language understanding or contextual reasoning, while all verification, sequencing, and execution remain deterministic. The framework adapts the Planner-Generator-Evaluator (PGE) harness pattern from software engineering into a Specifier-Designer-Inspector (SDI) architecture for industrial robotics, combined with LangGraph-based dynamic routing for failure recovery. A two-tier recovery mechanism addresses structure-level replanning through context-aware orchestration and execution-level geometric failures through deterministic recovery skills. A Unity3D digital twin supports human inspection, modification, and re-verification prior to physical execution. Evaluated on natural-language commands across multiple difficulty levels against ten baselines, the proposed method achieves the highest task success. Ablation results confirm that structured command expansion, symbolic verification, selective LLM routing, and recovery skills are each individually necessary.

2606.08341 2026-06-09 cs.RO 新提交

Uncertainty-Aware Intention Prediction for Human-to-Robot Assembly Teleoperation

面向人机装配遥操作的不确定性感知意图预测

Fnu Heman, Yixuan Wang, Kolin Xu, Conner Wallace, John Dang, Akhil Joshi, Jun Sheng, Pinhas Ben-Tzvi, Mingyu Cai

发表机构 * University of California, Riverside(加州大学河滨分校) University of Miami(迈阿密大学)

AI总结 提出结合层次迁移学习、共形预测和VLM引导校正的不确定性感知意图预测框架,利用人类演示数据预训练,仅用少量机器人数据即提升动作分割性能。

Comments 7 pages, 6 figures. Preprint version

详情
AI中文摘要

在人机协作的辅助遥操作中,准确的意图预测对于在长时程操作和装配任务中实现及时可靠的机器人辅助至关重要。这些系统需要持续理解用户行为,以实时识别动作、预测意图并检测错误。然而,机器人遥操作演示成本高且受硬件限制,而人类演示更易收集且提供丰富的时序结构。为解决这一挑战,我们提出了一种不确定性感知的人到机器人意图预测框架,该框架结合了:(1) 层次迁移学习,其中MS-TCN++在人类手部演示上预训练,并在有限的机器人遥操作数据上微调,以捕捉低级动作和高级任务意图;(2) 共形预测模块,提供具有统计覆盖保证的帧级预测集,用于可靠的不确定性量化和早期意图估计;(3) VLM引导的片段校正,利用视觉和时序上下文选择性审查低置信度或时序不确定的片段。该框架支持辅助遥操作中的动作识别、时序分割、意图预测和错误检测。在包含22个动作类别的机器人装配演示实验表明,仅使用16个机器人演示,人到机器人的微调将机器人测试集的Edit分数从70.50提升至80.70。Edit安全的VLM校正进一步将帧准确率从45.21%提升至46.42%,并提高了F1@25和F1@50,同时保持了Edit分数。这些结果表明,人类演示为鲁棒、不确定性感知的机器人动作分割提供了可扩展的预训练数据。代码和数据见项目网站。

英文摘要

In assisted teleoperation for human-robot collaboration, accurate intention prediction is critical for enabling timely and reliable robotic assistance during long-horizon manipulation and assembly tasks. These systems require continuous understanding of user behavior to recognize actions, anticipate intentions, and detect mistakes in real time. However, robot teleoperation demonstrations are costly and hardware-limited, whereas human demonstrations are easier to collect and provide rich temporal structure. To address this challenge, we propose an uncertainty-aware human-to-robot intention prediction framework that combines: (1) hierarchical transfer learning, where MS-TCN++ is pretrained on human hand demonstrations and fine-tuned on limited robot teleoperation data to capture low-level actions and high-level task intentions; (2) a conformal prediction module that provides frame-level prediction sets with statistical coverage guarantees for reliable uncertainty quantification and early intention estimation; and (3) VLM-guided segment correction, which selectively reviews low-confidence or temporally uncertain segments using visual and temporal context. The framework supports action recognition, temporal segmentation, intention anticipation, and mistake detection for assisted teleoperation. Experiments on robot assembly demonstrations with 22 action classes show that human-to-robot fine-tuning improves the robot test-set Edit score from 70.50 to 80.70 using only 16 robot demonstrations. Edit-safe VLM correction further improves frame accuracy from 45.21% to 46.42% and increases F1@25 and F1@50 while preserving the Edit score. These results show that human demonstrations provide scalable pretraining data for robust, uncertainty-aware robot action segmentation. Code and data: project website.

2606.08458 2026-06-09 cs.RO 新提交

Personalized and Robust Proactive Robot Assistance with Uncertainty-Guided LLM Reasoning

个性化且鲁棒的主动机器人辅助:基于不确定性引导的大语言模型推理

Alvaro Gonzalez, M. H. Hasan Shovo, Ali Ayub

发表机构 * Concordia University(康考迪亚大学)

AI总结 提出GLOBE框架,结合n-gram马尔可夫模型与不确定性引导的大语言模型推理,在家庭环境中实现高效鲁棒的主动机器人辅助,并在HOMER-Noise数据集上验证了其性能与效率。

Comments Accepted to the 2026 IEEE 35th International Conference on Robot and Human Interactive Communication (RO-MAN)

详情
AI中文摘要

在家庭环境中,主动机器人辅助需要在动态和嘈杂条件下准确预测人类活动和物体使用。现有方法通常依赖复杂的时空模型,这些模型计算成本高且对环境变化敏感。本文提出GLOBE,一个轻量级框架,结合n-gram马尔可夫模型捕捉时间行为模式与不确定性引导的大语言模型推理。该框架高效执行序列预测,仅在模型置信度低时选择性调用大语言模型推理。为评估现实条件下的性能,我们引入HOMER-Noise,即HOMER+数据集的噪声扩展,模拟由人类、宠物和幼儿引起的物体移动等结构化干扰。实验结果表明,GLOBE在干净和嘈杂环境下均达到与最先进方法竞争的性能,同时提高了鲁棒性和计算效率。该框架进一步通过与Stretch 3移动操作器的概念验证集成得到验证,展示了其在真实人机交互场景中的潜在应用。

英文摘要

Proactive robot assistance in household environments requires accurate prediction of human activities and object usage under dynamic and noisy conditions. Existing approaches often rely on complex spatio-temporal models, which can be computationally expensive and sensitive to environmental variability. In this paper, we propose GLOBE, a lightweight framework that combines n-gram Markov models for capturing temporal behavioral patterns with uncertainty-guided large language model (LLM) reasoning. The framework performs sequential prediction efficiently while selectively invoking LLM reasoning only when the model confidence is low. To evaluate performance under realistic conditions, we introduce HOMER-Noise, a noisy extension of the HOMER+ dataset that simulates structured disturbances such as object movements caused by humans, pets, and toddlers. Experimental results show that GLOBE achieves competitive performance with state-of-the-art methods while improving robustness and computational efficiency across both clean and noisy settings. The framework is further validated through a proof-of-concept integration with a Stretch 3 mobile manipulator, demonstrating its potential application in real-world human-robot interaction scenarios.

2606.08741 2026-06-09 cs.RO 新提交

Safe, Fluent and Acceptable Motion Generation and Execution for Human--Robot Interaction in Manufacturing Environments

制造环境中人机交互的安全、流畅与可接受运动生成与执行

Thibaut Lopez, Olivier Aycard, Pierre-Brice Wieber, Mohamed Boua, Christine Jeoffrion

发表机构 * GIPSA Lab(GIPSA实验室) Grenoble Institute of Technology(格勒诺布尔理工学院) Inria(法国国家信息与自动化研究所) LIP/PC2S(LIP/PC2S实验室) Univ. Grenoble Alpes(格勒诺布尔阿尔卑斯大学) Univ. Savoie Mont Blanc(萨瓦大学)

AI总结 针对人机共享环境,提出结合安全与社交感知的运动生成策略,通过MPC框架生成四种社交行为,用户研究表明机器人行为显著影响社会可接受性。

详情
AI中文摘要

在人类环境中运行的机器人不仅要确保物理安全,还要表现出人类伙伴可理解、流畅和可接受的行为。本文研究了结合安全保障与交互质量考虑(如运动平滑性和人类舒适度)的运动生成策略。虽然能够确保共享人机环境中安全的机器人设计已经实现了更紧密、更高级的交互形式,但这些新的基于近距离的任务需要超越纯技术考虑。特别是,机器人行为还必须从心理认知和社会角度加以解决。在此背景下,我们论证了将社交感知运动控制集成到机器人系统中的相关性。首先,我们识别了影响人类感知和操作员体验的运动参数。然后,我们实现了一个模型预测控制(MPC)框架,该框架生成四种不同的社交知情机器人行为。最后,我们进行了一项用户研究,以评估和验证这些行为,并评估它们对非专家参与者的社会影响。结果表明,机器人行为的变化显著影响系统的感知社会可接受性。这些发现强调了将以人为本的考虑纳入共享环境中机器人运动生成策略的重要性。

英文摘要

Robots operating in human environments must not only ensure physical safety but also exhibit behaviors that are understandable, fluent, and acceptable to human partners. This paper investigates motion generation strategies that combine safety guarantees with interaction quality considerations, such as motion smoothness and human comfort. While the design of robots capable of ensuring safety in shared human-robot environments has enabled closer and more advanced forms of interaction, these new proximity-based tasks require moving beyond purely technical considerations. In particular, robot behavior must also be addressed from psycho-cognitive and social perspectives. In this context, we argue for the relevance of integrating social-aware motion control into robotic systems. First, we identify the motion parameters that influence human perception and operator experience. Then, we implement a Model Predictive Control (MPC) framework that generates four distinct socially-informed robot behaviors. Finally, we conduct a user study to evaluate and validate these behaviors and assess their social impact on non-expert participants. The results demonstrate that variations in robot behavior significantly affect the perceived social acceptability of the system. These findings highlight the importance of incorporating human-centered considerations into motion generation strategies for robots operating in shared environments.

2606.09255 2026-06-09 cs.RO 新提交

RPO-PDT: Demonstrating Role-Play-Based Knowledge Adaptation for Student Support Dialogue (Demonstration System)

RPO-PDT:展示基于角色扮演的知识适应用于学生支持对话(演示系统)

Filip Janik, Ewa Olton, Robert Smales, Harris Spratt, Shea Tait, Md Zia Ullah, Yanchao Yu

发表机构 * Edinburgh Napier University(爱丁堡龙比亚大学)

AI总结 提出RPO-PDT系统,通过检索增强和角色扮演循环,实现高等教育中基于结构化知识源的个性化学生支持对话,并确保安全与适应性。

Comments 5 pages, 2 figures

详情
AI中文摘要

我们提出RPO-PDT:一个基于检索、角色扮演的对话系统,用于高等教育中的自适应学生支持。RPO-PDT能够:(1)利用结构化知识源提供机构特定的个人发展导师(PDT)指导;(2)受明确的角色、边界、保密性和安全策略约束;(3)围绕反向角色扮演循环设计,其中未解决的交互从学生视角重放,从而生成替代的导师策略并存储为可重用的策略记忆。RPO-PDT支持基于文本和基于Furhat的具身交互,用于演示基于、安全且自适应的学生支持对话。

英文摘要

We present RPO-PDT: a retrieval-grounded, role-play-based dialogue system for adaptive student support in higher education. RPO-PDT is: (1) able to provide institution-specific Personal Development Tutor (PDT) guidance using structured knowledge sources; (2) constrained by explicit persona, boundary, confidentiality, and safety policies; and (3) designed around a reverse-roleplay loop where unresolved interactions are replayed from the student perspective, enabling alternative tutor strategies to be generated and stored as reusable strategy memory. RPO-PDT supports both text-based and Furhat-based embodied interaction for demonstrating grounded, safe, and adaptive student-support dialogue.

2606.07551 2026-06-09 cs.CY cs.HC cs.RO 交叉投稿

Astro, I'm Home! Investigating Factors that Influence the Acceptance of Home Robots Using Supervised Machine Learning

Astro,我回家了!利用监督机器学习研究影响家庭机器人接受度的因素

Katrin Fischer, Essence Wilson, Steffie Kim, Dmitri Williams

发表机构 * University of Southern California(南加州大学)

AI总结 本研究运用正则化技术(如Lasso和Ridge回归)分析影响社交机器人接受度的因素,发现绩效期望、社会影响和享乐动机是使用意图的最强预测因子,并识别出可用性、信任和能力等新变量。

Comments Preprint submitted to the 18th International Conference on Social Robotics (ICSR 2026)

详情
AI中文摘要

社交机器人在家庭环境中的使用正在增加。这项探索性研究应用正则化技术(例如Lasso和Ridge回归)来调查变量并识别社交机器人背景下技术接受的新模型。在原始的UTAUT2框架内,绩效期望、社会影响和享乐动机成为使用技术意图的最强和最一致的预测因子。此外,可用性、信任和能力被识别为预测使用意图模型中的有希望的变量。

英文摘要

The use of social robots in home environments is on the rise. This exploratory study applies regularization techniques (e.g., Lasso and Ridge regression) to investigate variables and identify new models of technology acceptance in the context of social robots. Within the original UTAUT2 framework, performance expectancy, social influence, and hedonic motivation emerged as the strongest and most consistent predictors of intention to use the technology. In addition, usability, trust, and competence were identified as promising variables in a model predicting intention to use.

2606.09390 2026-06-09 cs.CV cs.AI cs.RO 交叉投稿

Real-time body pose non-verbal communication with a consistency-based reliability measure

基于一致性可靠性度量的实时身体姿态非语言通信

Alina Marcu, Dragos Costea, Cristina Lazar, Marius Leordeanu

发表机构 * National University of Science and Technology "Politehnica" Bucharest(布加勒斯特理工大学) Simion Stoilow Institute of Mathematics of the Romanian Academy(罗马尼亚科学院西蒙·斯托伊洛数学研究所) NORCE Norwegian Research Centre AS(挪威研究中心)

AI总结 研究仅从2D身体姿态识别通信意图,提出自回归自一致性作为无监督可靠性信号,并在嵌入式GPU上实现实时性能。

详情
AI中文摘要

身体运动在远距离或无法捕捉面部及语音的条件下传达意图。我们研究仅从2D身体姿态识别通信意图。我们认为身体运动是可靠的信号,特别是在需要实时低成本设备上的人-机器人通信场景中,如救援任务。然而,现有资源并未孤立这一信号。情感语料库结合了身体、面部、语音和文本,而骨架动作识别基准标记的是执行的动作而非传达的信息。我们发布了一个包含十种通信意图的全身体姿态真实帧数据集,并将其与其他真实(IPC)和合成(MotionLCM, VEO3.1, Kimodo)数据集进行比较,这些数据集覆盖了不同难度。我们针对能在机器人有限板载硬件上运行的系统。我们基准测试了多种模型,从骨架图分类器到联合运动预测网络,并在嵌入式GPU(NVIDIA Orin Nano)上报告了性能指标和帧率,因为在我们的场景中速度和准确性同样重要。最后,我们展示了模型自身的自回归自一致性可作为无监督可靠性信号。我们给出了一个简短证明,界定了自一致性预测正确的概率,表明该概率随一致步数增加而增长,并识别了自信预测仍可能错误的条件,与行业标准指标进行了基准测试。

英文摘要

Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recognition of communicative intent from 2D body pose alone. We argue that body motion is a reliable signal especially in scenarios that require real time low-cost on-device person-to-robot communication in long distance environments, such as rescue missions. However, existing resources do not isolate this signal. Affective corpora combine body, face, voice and text, while skeleton action-recognition benchmarks label the action performed rather than the message conveyed. We release a dataset of real frames of full-body pose covering ten communicative intents and we compare it against other real (IPC) and synthetic (MotionLCM, VEO3.1, Kimodo) ones that span a range of difficulty. We target systems that can run on a robot's limited onboard hardware. We benchmark multiple models, from skeleton graph classifiers to joint motion-forecasting networks, and report performance metrics together with frame rate on an embedded GPU (NVIDIA Orin~Nano), since speed matters as much as accuracy in our scenario. Finally, we show that a model's own autoregressive self-consistency works as an unsupervised reliability signal. We give a short proof that bounds the probability that a self-consistent prediction is correct, show that this probability grows with the number of consistent steps, and identify the conditions under which a confident prediction can still be false, benchmarked against industry-standard metrics.

2511.17855 2026-06-09 cs.AI cs.RO 版本更新

QuickLAP: Quick Language-Action Preference Learning for Semi-Autonomous Agents

QuickLAP: 为半自主代理快速语言-动作偏好学习

Jordan Abi Nader, David Lee, Nathaniel Dennler, Andreea Bobu

AI总结 本研究提出QuickLAP,一种融合物理和语言反馈的贝叶斯框架,用于实时推断奖励函数,通过大规模语言模型提取奖励特征注意力掩码和偏好偏移,从而在半自主驾驶模拟器中将奖励学习误差降低70%,并通过用户研究验证其可理解性和协作性。

详情
AI中文摘要

机器人必须从人们的行为和语言中学习,但单一模态往往不完整:物理修正具有语境但意图模糊,而语言表达高层目标但缺乏物理基础。我们引入QuickLAP:快速语言-动作偏好学习,一种贝叶斯框架,融合物理和语言反馈以实时推断奖励函数。我们的关键见解是将语言视为用户潜在偏好的概率观测,明确哪些奖励特征重要以及如何解释物理修正。QuickLAP利用大规模语言模型(LLMs)从自由形式陈述中提取奖励特征注意力掩码和偏好偏移,并与物理反馈结合在一个闭式更新规则中。这使得能够快速、实时且鲁棒地学习奖励,处理模糊反馈。在半自主驾驶模拟器中,QuickLAP相比仅物理和启发式多模态基线将奖励学习误差降低超过70%。15名参与者的用户研究进一步验证了我们的方法:参与者发现QuickLAP更易懂和协作,并且更喜欢其学习行为。代码可在https://github.com/MIT-CLEAR-Lab/QuickLAP获取。

英文摘要

Robots must learn from both what people do and what they say, but either modality alone is often incomplete: physical corrections are grounded but ambiguous in intent, while language expresses high-level goals but lacks physical grounding. We introduce QuickLAP: Quick Language-Action Preference learning, a Bayesian framework that fuses physical and language feedback to infer reward functions in real time. Our key insight is to treat language as a probabilistic observation over the user's latent preferences, clarifying which reward features matter and how physical corrections should be interpreted. QuickLAP uses Large Language Models (LLMs) to extract reward feature attention masks and preference shifts from free-form utterances, which it integrates with physical feedback in a closed-form update rule. This enables fast, real-time, and robust reward learning that handles ambiguous feedback. In a semi-autonomous driving simulator, QuickLAP reduces reward learning error by over 70% compared to physical-only and heuristic multimodal baselines. A 15-participant user study further validates our approach: participants found QuickLAP significantly more understandable and collaborative, and preferred its learned behavior over baselines. Code is available at https://github.com/MIT-CLEAR-Lab/QuickLAP.

6. 具身智能与视觉语言动作模型 20 篇

2606.08107 2026-06-09 cs.RO cs.AI 新提交

Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data

Ego-Pi: 面向自我中心人类与机器人数据的VLA微调

Ji Woong Kim, Ke Wang, Zipeng Fu, Sirui Chen, Cong Zhao, Jeff Lai, Chelsea Finn

发表机构 * Stanford University(斯坦福大学) Meta

AI总结 为解决机器人数据稀缺问题,利用自我中心人类数据,基于π₀.₅模型微调,使机器人学习新任务语义并组合现有技能,无需对应机器人数据。

详情
AI中文摘要

机器人技术面临数据稀缺的根本挑战。与语言或视觉研究不同,机器人操作没有互联网规模的数据集。一个有前景的途径是利用自我中心人类数据,这类数据更容易收集、范围更广且规模更大。为此,我们研究了跨人类和配备灵巧五指手的类人机器人实体学习的关键设计选择,以$π_{0.5}$模型为基础。我们的结果表明,人类数据使机器人能够学习新的任务语义,并将现有技能组合成新颖的行为,而无需相应的机器人数据。论文网站:https://egopipaper.github.io/

英文摘要

Robotics faces a fundamental challenge of data scarcity. Unlike language or vision research, there is no internet-scale dataset for robotic manipulation. A promising path forward is to leverage egocentric human data, which can be collected more easily, with greater breadth, and at a larger scale. Towards this end, we investigate key design choices for learning across human and humanoid embodiments equipped with dexterous five-finger hands, using the $π_{0.5}$ model as a foundation. Our results show that human data enables robots to learn new task semantics and compose existing skills into novel behaviors without corresponding robot data. The paper website is here: https://egopipaper.github.io/

2606.08288 2026-06-09 cs.RO 新提交

MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model

MotionVLA: 将几何运动注入视觉-语言-动作模型

Shanglin Yuan, Weiheng Zhao, Xianda Guo, Wei Sui, Li Yu, Wenyu Liu, Xinggang Wang

发表机构 * Huazhong University of Science and Technology(华中科技大学) D-Robotics(大疆机器人) Wuhan University(武汉大学)

AI总结 提出MotionVLA,通过运动历史接口将过去视频窗口转换为紧凑的连续轨迹场令牌,解决长程操作中的几何漂移和时间线索碎片化问题,提升动作平滑性和执行效率。

Comments 17 pages, 8 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越多地基于历史、深度或4D特征来调节机器人策略,以解决长程操作中的歧义。然而,更多的时空证据并不一定更好:当注入的证据不是运动一致的时,它可能引入几何漂移、碎片化的时间线索和不稳定的动作生成。这提出了一个简单的问题:VLA应该记住过去的帧,还是记住连接它们的运动?我们引入了MotionVLA,一个运动历史接口,它将短时间仅包含过去的视频窗口转换为紧凑的、时间连续的轨迹场令牌。MotionVLA不是将历史视为一组稀疏的独立提升帧,而是将最近的观测表示为物理一致的运动证据。当前的视觉令牌查询这个历史以检索任务相关的运动信息,然后在轨迹基础的监督下重新耦合到VLA流中。在模拟基准和初步真实机器人部署上的实验表明,MotionVLA改善了长程操作,同时产生了更平滑、更直接的执行。这些结果表明,有效的VLA记忆不仅仅是提供更多的4D上下文,而是暴露可用于控制的运动一致证据。

英文摘要

Vision-language-action (VLA) models increasingly condition robot policies on history, depth, or 4D features to resolve ambiguity in long-horizon manipulation. However, more spatiotemporal evidence is not necessarily better: when the injected evidence is not motion-consistent, it can introduce geometric drift, fragmented temporal cues, and unstable action generation. This raises a simple question: should a VLA remember past frames, or remember the motion that connects them? We introduce MotionVLA, a motion-history interface that converts a short past-only video window into compact, time-continuous trajectory-field tokens. Instead of treating history as a sparse set of ndependently lifted frames, MotionVLA represents recent observations as physically coherent motion evidence. Current visual tokens query this history to retrieve task-relevant motion information, which is then recoupled into the VLA stream under trajectory-grounded supervision. Experiments across simulation benchmarks and preliminary real-robot rollouts show that MotionVLA improves long-horizon manipulation while producing smoother and more direct executions. These results suggest that effective VLA memory is not just about providing more 4D context, but about exposing motion-consistent evidence that is usable for control.

2606.08495 2026-06-09 cs.RO cs.CV 新提交

EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control

EgoPriMo:面向交互式人形控制的自我中心运动生成

Haoyang Ge, Peng Ren, Yukun Shi, Cong Huang, Kun Li, Kai Chen

发表机构 * Tianjin University(天津大学) Zhongguancun Academy(中关村学院) Beihang University(北京航空航天大学) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) DeepCybo

AI总结 提出EgoPriMo框架,通过自我中心人类演示学习全身运动先验,利用三流DiT联合建模身体动态、视觉上下文和文本,支持重建、生成和预测,并在Unitree人形机器人上执行。

详情
AI中文摘要

人形机器人需要适应场景上下文、任务要求和用户意图的全身运动。运动跟踪可以再现指定的轨迹,人形机器人视觉-语言-动作系统提供了语义接口,但两者都不能为广泛的全身行为提供可扩展且交互式的先验。我们提出了EgoPriMo(人形机器人自我中心运动先验),一个统一的框架,从自我中心人类演示中学习此类先验。给定自我中心观察和文本提示,EgoPriMo重建、生成和预测基于SMPL的全身运动。语言被用作高级控制信号,而不是完整的运动规范。EgoPriMo的核心是一个三流DiT,它联合建模身体动态、自我中心视觉上下文和文本;任务条件掩码通过同一个检查点路由不同的任务和缺失模态数据。在Nymeria和EgoExo4D上的实验表明,一个检查点在支持重建和预测的同时,改进了自我中心运动生成,优于UniEgoMotion;生成的SMPL运动也可以由Unitree人形控制器执行。这些结果表明了一条从可扩展的自我中心观察到可泛化和交互式人形运动先验的实用路径。

英文摘要

Humanoid robots require whole-body motions that adapt to scene context, task requirements, and user intent. Motion tracking reproduces specified trajectories, and humanoid vision-language-action systems provide semantic interfaces, but neither offers a scalable and interactive prior for broad full-body behavior. We introduce EgoPriMo (Egocentric Motion Prior for Humanoid Robots), a unified framework that learns such priors from egocentric human demonstrations. Given egocentric observations and a text prompt, EgoPriMo reconstructs, generates, and forecasts SMPL-based full-body motion. Language is used as a high-level control signal rather than a complete motion specification. At the core of EgoPriMo is a Triple-stream DiT that jointly models body dynamics, egocentric visual context, and text; task-conditioning masks route different tasks and missing-modality data through the same checkpoint. Experiments on Nymeria and EgoExo4D show that one checkpoint improves egocentric motion generation over UniEgoMotion while supporting reconstruction and forecasting; the generated SMPL motions can also be executed by a Unitree humanoid controller. These results indicate a practical path from scalable egocentric observations to generalizable and interactive humanoid motion priors.

2606.08520 2026-06-09 cs.RO 新提交

Two Bridges, One Pathway: From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data

两座桥梁,一条路径:从VLM到具有具身轨迹耦合数据的可泛化VLA

Linqi Yin, Shiduo Zhang, Shenling Qiu, Chenxin Li, Zhaoyang Fu, Lei Xiao, Xiang Wang, Chenchen Yang, Zhe Xu, Pengfang Qian, Jingjing Gong, Xipeng Qiu, Xuanjing Huang, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出具身轨迹耦合(ETC)数据作为中间桥梁,通过三阶段训练策略(分布桥接、目标桥接、保留适应)将视觉语言模型(VLM)逐步转化为可泛化的视觉语言动作模型(VLA),解决从VLM到VLA的双重鸿沟。

详情
AI中文摘要

视觉语言模型(VLM)是强大的通用推理器,但将其转化为机器人控制策略(VLA)却异常困难。根本原因在于双重鸿沟:VLM在互联网规模的图像上训练,具有语言理解目标,而VLA必须感知机器人场景并预测电机动作。直接在机器人动作数据上微调VLM迫使模型同时跨越两个鸿沟——学习曲线陡峭,预训练期间获得的丰富泛化能力往往会退化而非迁移。我们认为,通过合适的中间数据可以逐步弥合这一鸿沟。我们引入了\emph{具身轨迹耦合(ETC)数据}——源自用于动作学习的相同机器人场景和轨迹的视觉语言监督。由于ETC数据共享机器人操作的视觉上下文,同时保留熟悉的语言理解目标,它提供了VLM预训练和VLA微调之间的自然垫脚石。基于此,我们设计了一个三阶段训练方案。分布桥接首先将VLM适应于具身视觉语言语义。目标桥接然后逐步将模型转向动作预测,同时保留已获得的表示。保留适应最后将策略专门化到目标部署领域。我们进一步证明,将任务相关的分布外ETC数据与少量动作数据混合,使模型能够泛化到新颖的视觉语言条件,而无需额外的机器人演示。仿真和真实机器人实验证实,这种逐步桥接策略是将VLM泛化能力迁移到鲁棒、可部署的机器人策略的关键。

英文摘要

Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control policies (VLAs) is surprisingly difficult. The root cause is a two-fold gap: VLMs are trained on internet-scale images with language-understanding objectives, while VLAs must perceive robot scenes and predict motor actions. Fine-tuning a VLM directly on robot action data forces the model to cross both gaps at once -- the learning curve is steep and the rich generalizations learned during pretraining tend to degrade rather than transfer. We argue that this gap can be bridged gradually with the right intermediate data. We introduce \emph{embodied trajectory-coupled (ETC) data} -- vision-language supervision derived from the same robot scenes and trajectories used for action learning. Because ETC data shares the visual context of robot operation while retaining familiar language-understanding objectives, it provides a natural stepping stone between VLM pretraining and VLA fine-tuning. Building on this, we design a three-stage training recipe. Distribution Bridging first adapts the VLM to embodied visual-language semantics. Objective Bridging then gradually shifts the model toward action prediction while preserving the acquired representations. Retentive Adaptation finally specializes the policy to the target deployment domain. We further show that mixing task-relevant out-of-distribution ETC data with a small amount of action data enables the model to generalize to novel visual-language conditions without requiring additional robot demonstrations. Simulation and real-robot experiments confirm that this gradual bridging strategy is the key to transferring VLM generalization into robust, deployable robot policies.

2606.09215 2026-06-09 cs.RO 新提交

MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation

MotionWAM:迈向实时人形机器人全身操作的基础世界动作模型

Jia Zheng, Teli Ma, Yudong Fan, Zifan Wang, Shuo Yang, Junwei Liang

发表机构 * Mondo Robotics HKUST (GZ)(香港科技大学(广州)) HKUST(香港科技大学)

AI总结 提出MotionWAM,一种实时世界动作模型,通过统一运动潜变量和全身动作令牌,实现单目相机驱动的自主人形机器人全身操作,在真实任务上成功率比VLA基线高30%以上。

详情
AI中文摘要

世界动作模型(WAM)将视频动态先验与策略耦合,在桌面操作中表现出令人鼓舞的结果,但高维视频-动作潜变量的迭代去噪使其对于实时人形机器人全身操作来说过于缓慢。主导的分层范式加剧了这一问题,其中高层操作策略仅控制上半身,而低层控制器跟踪粗略的基础命令——将上半身和下半身置于不一致的动作空间中,并将腿部降级为保持平衡的 locomotion。我们提出MotionWAM,一种实时WAM,通过将策略条件设置为视频世界模型的中间去噪特征,从单个自我中心摄像头驱动自主人形机器人全身操作。MotionWAM用统一的运动潜变量取代了上下半身的分割,并预测全身动作令牌,在单个动作空间中联合覆盖 locomotion、躯干运动、高度调节、足部交互和手部操作。一个三阶段学习框架逐步将视频世界模型适应于自我中心视觉动态和目标人形机器人具身。在九个真实世界的Unitree G1任务上,MotionWAM实时运行,在总体成功率上比在同一演示上微调的视觉-语言-动作(VLA)基线高出30%以上,并执行解耦的上下半身策略无法达到的任务驱动足部交互。我们的结果表明,视频预训练的WAM可以从桌面操作提升到协调的、类人的人形机器人全身控制。

英文摘要

World Action Models (WAMs) couple a video dynamics prior to the policy and have shown encouraging results on tabletop manipulation, but iterative denoising over high-dimensional video-action latents leaves them too slow for real-time humanoid loco-manipulation. The problem is compounded by the dominant hierarchical paradigm, in which a high-level manipulation policy controls only the upper body while a low-level controller tracks coarse base commands -- placing upper and lower body in inconsistent action spaces and reducing the legs to balance-preserving locomotion. We present MotionWAM, a real-time WAM that drives autonomous humanoid loco-manipulation from a single egocentric camera by conditioning the policy on the intermediate denoising features of a video world model. MotionWAM replaces the upper-lower split with a unified motion latent and predicts whole-body motion tokens that jointly cover locomotion, torso motion, height regulation, foot interaction, and hand manipulation in a single action space. A three-stage learning framework progressively adapts the video world model to egocentric visual dynamics and to the target humanoid embodiment. On nine real-world Unitree G1 tasks, MotionWAM runs in real time, substantially outperforms Vision-Language-Action (VLA) baselines fine-tuned on the same demonstrations by over 30% in overall success rate, and executes task-driven foot interaction that decoupled upper-lower policies cannot reach. Our results suggest that video-pretrained WAMs can be lifted from tabletop manipulation to coordinated, human-like whole-body humanoid control.

2606.09258 2026-06-09 cs.RO 新提交

Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection

回到熟悉的未来:通过预想里程碑选择实现VLA策略的故障恢复

Suyeon Shin, Juwon Kim, Hyeonbin Park, Hyunseo Kim, Hyundo Lee, Hyung-Sin Kim, Byoung-Tak Zhang

发表机构 * Seoul National University(首尔大学) Yonsei University(延世大学) Soongsil University(崇实大学)

AI总结 提出B2FF框架,通过预生成熟悉未来状态里程碑并选择恢复目标,使VLA策略在偏离轨迹时无需微调即可稳健恢复,成功率从56.3%提升至74.0%。

详情
AI中文摘要

视觉-语言-动作(VLA)策略在操作过程中可能偏离标称轨迹,即使任务在物理上仍然可行。从这些偏离中恢复具有挑战性,因为它们将策略推入陌生的状态空间,直接重新规划常常会破坏动作序列的稳定性。我们提出“回到熟悉的未来”(B2FF),一种面向预见性VLA的恢复框架,利用未来视觉条件作为恢复接口。在执行前,VLA基于干净的初始观察生成一个由熟悉未来状态组成的里程碑库。在恢复时,一个可恢复性感知的选择器从该库中选择一个恢复里程碑,并将其强制作为固定的视觉目标。这使得VLA能够将偏离轨迹的观察稳健地映射回熟悉的未来。在注入故障的LIBERO数据集上,在受控的恢复时间与注入故障对齐的情况下,B2FF将基线VLA的平均成功率从56.3%提升至74.0%,证明预想里程碑可以在不微调底层动作生成器的情况下指导恢复。

英文摘要

Vision-language-action (VLA) policies can deviate from nominal trajectories during manipulation, even when tasks remain physically feasible. Recovering from these deviations is challenging, as they push the policy into unfamiliar state spaces where direct re-planning frequently destabilizes action sequences. We propose Back to the Familiar Future (B2FF), a recovery framework for foresight-driven VLAs that leverages future visual conditioning as a recovery interface. Before execution, the VLA generates a milestone bank of familiar future states conditioned on the clean initial observation. At recovery time, a recoverability-aware selector selects a recovery milestone from this bank and enforces it as a fixed visual goal. This enables the VLA to robustly map off-trajectory observations back to a familiar future. On failure-injected LIBERO, under controlled recovery timing aligned with the injected failure, B2FF increases the average success rate of a baseline VLA from 56.3% to 74.0%, demonstrating that pre-imagined milestones can guide recovery without fine-tuning the low-level action generator.

2606.09286 2026-06-09 cs.RO 新提交

VAIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands

VAIC: 基于解耦命令的视觉引导人形机器人敏捷物体交互控制

Dongting Li, Qianyang Wu, Xingyu Chen, Liang Li, Yuhang Lin, Sikai Wu, Guoyao Zhang, Mingliang Zhou, Diyun Xiang, Qiang Zhang, Renjing Xu, Jianzhu Ma

发表机构 * Tsinghua University(清华大学) HKUST(Guangzhou)(香港科技大学(广州)) Xiaomi Robotics Lab(小米机器人实验室)

AI总结 提出VAIC框架,通过解耦命令和两阶段蒸馏范式,仅依靠机载深度、历史本体感知实现人形机器人的敏捷物体交互,在箱体搬运、推车、滑板等动态任务中超越基线。

Comments Webpage: https://vaic-humanoid.github.io/

详情
AI中文摘要

人形机器人在现实辅助中具有巨大潜力,但在非结构化环境中与物体的敏捷交互需要紧密耦合的全身协调。尽管近期取得了进展,当前控制器仍面临关键的部署差距:它们严重依赖密集的参考轨迹和完美的状态可观测性,这本质上限制了物理泛化。我们提出了视觉引导的敏捷交互控制(VAIC),这是一个统一框架,通过仅依靠机载深度、历史本体感知和解耦的用户命令接口来弥合这一差距。VAIC采用两阶段蒸馏范式。首先,一个特权教师策略利用精确的物体运动学和精确的环境状态掌握多样的交互技能。其次,一个可部署的学生策略通过将全身跟踪替换为多轴速度目标和每帧交互指示器来蒸馏这些能力。学生利用一个循环物体适应模块,从原始深度流和本体感知中隐式推断不可观测的物体动力学。在人形机器人上的评估和实际部署表明,单个VAIC策略能够成功执行高度多样的动态任务,包括箱体搬运、推车交互和滑板,持续优于基线,推动了自主人形机器人的部署。

英文摘要

Humanoid robots hold immense potential for real-world assistance, yet agile interaction with objects in unstructured environments demands tightly coupled whole-body coordination. Despite recent advancements, current controllers face a critical deployment gap. They rely heavily on dense reference trajectories and perfect state observability, which inherently limits physical generalization. We present Vision Guided Agile Interaction Control (VAIC), a unified framework that bridges this gap by operating exclusively on onboard depth, historical proprioception, and a decoupled user command interface. VAIC employs a two-stage distillation paradigm. First, a privileged teacher policy masters diverse interaction skills using precise object kinematics and exact environmental states. Second, a deployable student policy distills these capabilities by replacing full body tracking with velocity targets across multiple axes and an interaction indicator for each frame. The student utilizes a recurrent object adaptation module to implicitly infer unobservable object dynamics from raw depth streams and proprioception. Evaluations and real-world deployments on the humanoid robot demonstrate that a single VAIC policy successfully executes highly diverse dynamic tasks. These tasks include box carrying, cart interaction, and skateboarding, consistently outperforming baselines and advancing autonomous humanoid deployment.

2606.09572 2026-06-09 cs.RO cs.AI 新提交

CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control

CT-VAM: 一种小脑-丘脑启发的视觉-动作模型用于高效视觉运动控制

Jiacheng Li, Yize Guo, Jiabin Guo, Qingchen Liu, Jiahu Qin

发表机构 * University of Science and Technology of China(中国科学技术大学) AIRLab, Department of Automation(自动化系AIRLab)

AI总结 提出CT-VAM模型,通过TARS条件注意力解码器融合异构输入,以68M参数实现与大型VLA模型相当的LIBERO成功率,并降低推理延迟,支持高频控制。

详情
AI中文摘要

视觉-语言-动作模型在机器人操作中展现出强大潜力,然而原始语言主要用于指定任务意图,而非在高频低层执行过程中反复处理。受此分离的启发,我们提出了一种小脑-丘脑启发的视觉-动作模型(CT-VAM),用于高效的任务条件视觉运动控制。CT-VAM作为一个紧凑的局部执行策略,从双视角视觉观察、本体感觉和轻量级任务条件中预测动作块,从而可能实现一种实用的云-边缘范式,其中高层语义推理由大模型处理,而快速闭环控制在本地硬件上运行。为了有效融合异构输入,CT-VAM引入了TARS(丘脑动作路由流),一种流分离的条件注意力解码器,独立路由动作、视觉和任务流,防止密集的感官标记淹没紧凑的任务相关条件。仅凭68M参数,CT-VAM在LIBERO上取得了与更大规模VLA模型竞争的成功率,同时降低了推理延迟。结合用于异步块执行的流一致修补,CT-VAM支持高频控制,并在资源受限的机器人平台上展示了鲁棒的实时部署能力。

英文摘要

Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this separation, we propose a cerebello-thalamic-inspired vision-action model (CT-VAM) for efficient task-conditioned visuomotor control. CT-VAM acts as a compact local execution policy that predicts action chunks from dualview visual observations, proprioception, and a lightweight task condition, potentially enabling a practical cloud-edge paradigm in which high-level semantic reasoning can be handled by large models while fast closed-loop control runs on local hardware. To fuse heterogeneous inputs effectively, CT-VAM introduces TARS (Thalamic Action Routing Stream), a stream-separated conditional attention decoder that independently routes action, visual and task streams, preventing dense sensory tokens from overwhelming compact task-relevant conditions. With only 68M parameters, CT-VAM achieves LIBERO success rates competitive with substantially larger VLA models, while reducing inference latency. Together with flow-consistent inpainting for asynchronous chunk execution, CT-VAM supports high-frequency control and demonstrates robust realworld deployment on resource-constrained robotic platforms.

2606.09630 2026-06-09 cs.RO cs.AI cs.LG 新提交

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

ReCoVLA: VLM引导的奖励编译用于视觉-语言-动作策略的故障恢复

Haodi Hu, Chung-Ta Huang, Jing Liu, Ye Wang, Kei Suzuki, Matthew Brand, Toshiaki Koike-Akino

发表机构 * University of Southern California(南加州大学) Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室) Harvard University(哈佛大学)

AI总结 提出ReCoVLA框架,通过冻结预训练VLA策略,利用外部VLM推断故障模式并编译结构化奖励,训练残差恢复策略,实现零样本仿真到真实部署,在多种操作任务中提升成功率。

Comments 19 pages, 7 figures

详情
AI中文摘要

视觉-语言-动作(VLA)策略为语言条件操作提供了强大的先验知识,但在需要针对性恢复的非标称状态下仍然脆弱。我们提出ReCoVLA——一种故障条件的残差恢复框架,它保持预训练的VLA策略冻结,使用外部视觉-语言模型(VLM)推断故障模式和恢复阶段,并从任务相关组件编译结构化奖励。ReCoVLA并非使用VLM直接生成动作或奖励,而是将其作为语义奖励选择器:它预测恢复描述符和奖励掩码,用于仿真中的残差策略训练,随后将训练好的恢复策略零样本部署到真实世界。这解耦了高层故障理解与低层纠正控制,以支持不同的VLA。在短时域、长时域和接触丰富的操作任务上的实验表明,ReCoVLA在平均性能上优于测试的基线。在仿真中,我们的奖励编译器将微调$π_{0.5}$基线的平均成功率从36.7%提升到66.7%。在物理零样本仿真到真实实验中,ReCoVLA取得了最佳平均性能,成功率为61.7%。

英文摘要

Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external vision-language model (VLM) to infer the failure mode and recovery stage, and compiles a structured reward from task-relevant components. Rather than using the VLM to generate actions or rewards directly, ReCoVLA uses it as a semantic reward selector: it predicts a recovery descriptor and reward mask for in-simulation residual-policy training, followed by zero-shot sim-to-real deployment of the trained recovery policies. This decouples high-level failure understanding from low-level corrective control to support different VLAs. Experiments across short-horizon, long-horizon, and contact-rich manipulation tasks show that ReCoVLA outperforms the tested baselines on average. In simulation, our reward compiler improves average success from 36.7% for the fine-tuned $π_{0.5}$ baseline to 66.7%. In physical zero-shot sim-to-real experiments, ReCoVLA achieves the best average performance, with 61.7% success.

2606.09740 2026-06-09 cs.RO 新提交

ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models

ProbeAct: 视觉-语言-动作模型中的探针引导无训练故障恢复

Fan Zhang, Seongbin Park, Baharan Mirzasoleiman, Shariar Talebi, Nader Sehatbakhsh

发表机构 * University of California Los Angeles(加利福尼亚大学洛杉矶分校)

AI总结 提出ProbeAct框架,通过轻量级隐藏状态探针、运动学状态机和分层控制屏障函数,无需训练即可检测并恢复VLA模型的抓取与放置失败,在LIBERO-plus上将成功率从69.6%提升至74.1%。

Comments under review

详情
AI中文摘要

视觉-语言-动作(VLA)模型在训练分布内的语言条件机器人操作中表现出强大性能,但其泛化能力仍存在根本性限制。它们缺乏处理扰动的鲁棒性,在面临光照变化、视角改变或初始状态微小变化时经常失败。我们提出PROBEACT,一个无需训练的运行时间干预框架,能够在不修改权重或额外演示的情况下,检测并恢复预训练VLA策略中的抓取和放置失败。PROBEACT结合了三个组件:(i)轻量级多目标隐藏状态探针,从中间VLA特征预测任务相关物体的3D位置,并采用匈牙利匹配的身份跟踪以处理多物体场景;(ii)与物体无关的运动学状态机,仅使用夹爪内部信号和末端执行器运动学检测抓取、搬运和放置失败;(iii)分层控制屏障函数(CBF)滤波器,将重复失败位置编码为软安全集约束,在保持基线行为的同时最小程度地修正VLA动作。作为即插即用、无需训练的干预循环,PROBEACT与现有训练流程正交。在LIBERO-plus基准上的评估表明,我们的框架作为通用安全网,将OpenVLA-OFT模型的成功率从69.6%提升至74.1%,并展示了在基础和微调VLA策略上的广泛适用性。

英文摘要

Vision-Language-Action (VLA) models demonstrate strong perfor-1 mance on language-conditioned robotic manipulation within their training dis-2 tribution, yet their generalization capabilities remain fundamentally limited. They3 lack the robustness required to handle perturbations, frequently failing when con-4 fronted with lighting changes, altered camera viewpoints, or small initial-state5 variations. We propose PROBEACT, a training-free runtime intervention frame-6 work that detects and recovers from grasping and placement failures in pre-7 trained VLA policies without modifying their weights or requiring additional8 demonstrations. PROBEACT combines three components: (i) a lightweight multi-9 target hidden-state probe that predicts the 3D positions of task-relevant objects10 from intermediate VLA features, with Hungarian-matched identity tracking for11 multi-object scenes; (ii) an object-agnostic kinematic state machine that detects12 grasp, transport, and placement failures using only gripper-internal signals and13 end-effector kinematics; and (iii) a hierarchical Control Barrier Function (CBF)14 filter that encodes repeated-failure locations as soft safe-set constraints, mini-15 mally correcting VLA actions while preserving baseline behavior. As a plug-and-16 play, training-free intervention loop, PROBEACT is orthogonal to existing train-17 ing pipelines. Evaluated on the LIBERO-plus benchmark, our framework acts as18 a universal safety net, improving the success rate of the OpenVLA-OFT model19 from 69.6% to 74.1%, while demonstrating broad applicability to both base and20 fine-tuned VLA policies.

2606.09813 2026-06-09 cs.RO cs.CV 新提交

iMaC: Translating Actions into Motion and Contact Images for Embodied World Models

iMaC: 将动作转化为运动与接触图像用于具身世界模型

Zhenyu Wu, Xiuwei Xu, Yukun Zhou, Yifan Li, Qiuping Deng, Xiaofeng Wang, Zheng Zhu, Bingyao Yu, Ziwei Wang, Jiwen Lu, Haibin Yan

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) GigaAI Nanyang Technological University(南洋理工大学)

AI总结 提出iMac框架,将原始视觉图像作为动作表示,通过图像-动作编码器和动态预测器实现高保真未来状态预测和闭环控制,在预测精度、任务成功率和跨场景泛化上优于传统向量动作控制。

Comments Project page: https://imac-wm.github.io/

详情
AI中文摘要

具身世界模型已成为视觉机器人决策和交互环境模拟的关键范式。然而,传统的具身框架依赖于低维结构化动作向量(例如关节角度和末端执行器位姿),这些向量存在表达能力有限、跨不同具身形态泛化能力差以及对复杂物理交互的动态建模不自然等问题。为了解决这些限制,本文提出了iMac(图像作为动作控制),一种新颖的统一控制范式,将原始视觉图像视为具身世界模型的原生动作表示。与传统的显式运动学动作编码不同,iMac将连续的视觉操作表述为基于图像的动作标记,这些标记内在地包含了空间运动意图、交互几何约束和细微的物理动力学。我们构建了一个双分支具身架构,包括图像-动作编码器和动态世界预测器:编码器将目标驱动的视觉图像压缩为紧凑的动作嵌入,而预测器学习以图像动作为条件的环境转移规则,以实现高保真的未来状态预测和闭环具身控制。在公开的具身操作基准和真实机器人场景上进行了大量实验。结果表明,iMac在预测精度、任务成功率和跨场景泛化能力方面优于基于向量的动作控制基线。此外,我们的图像动作设计消除了对人工定义动作空间的依赖,实现了异构具身智能体的灵活通用控制。这项工作为具身世界模型提供了一种创新的视觉-动作视角,为可扩展的机器人感知和操作提供了一种简单而有效的范式。

英文摘要

Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.

2606.09827 2026-06-09 cs.RO cs.CV 新提交

MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

MemoryVLA++:通过记忆与想象在视觉-语言-动作模型中进行时间建模

Hao Shi, Weiye Li, Bin Xie, Yulin Wang, Renping Zhou, Tiancai Wang, Xiangyu Zhang, Ping Luo, Gao Huang

发表机构 * Tsinghua University(清华大学) The University of Hong Kong(香港大学) Dexmal StepFun

AI总结 提出MemoryVLA++框架,通过工作记忆、感知-认知记忆库和想象未来状态的世界模型,实现完整时间建模,在模拟和真实机器人任务上显著提升长时域和依赖记忆与想象的任务性能。

Comments The project is available at https://shihao1895.github.io/MemoryVLA-PP-Web

详情
AI中文摘要

时间建模对于机器人操作至关重要,因为有效控制既需要过去交互的记忆,也需要对未来状态的想象。然而,大多数VLA模型主要依赖当前观测,因此在长时域和时间依赖任务上表现不佳。认知科学表明,人类依赖工作记忆缓冲短期上下文,海马系统保存过去经历的情景记忆,以及内部模型想象可能的未来状态演化。受这些机制启发,我们提出MemoryVLA++,一个完整的时序建模框架,为VLA模型配备记忆和想象能力以进行机器人操作。预训练的VLM将当前观测编码为感知和认知标记,形成工作记忆。这些标记查询感知-认知记忆库以检索相关历史上下文。该记忆库存储来自过去交互的低级细节和高级语义,并通过冗余感知合并进行更新。一个世界模型在去噪潜在空间中想象未来状态,并在记忆引导下整合想象的潜在表示,形成完整的时间感知标记。生成的标记条件化一个扩散动作专家,以预测时间一致的动作序列。我们在5个模拟基准和3类真实机器人任务(涵盖3种机器人)上进行了广泛实验,包括通用操作、长时域时间任务、鲁棒性和泛化性。我们的方法在Libero、SimplerEnv、Mikasa-Robo、Calvin、Libero-Plus以及多样化的真实机器人任务上取得了强劲性能,验证了具有记忆和想象的完整时间建模的有效性。例如,在真实机器人上,在通用、依赖记忆和依赖想象的任务上分别获得了+9%、+26%和+28%的提升。项目页面:https://shihao1895.github.io/MemoryVLA-PP-Web

英文摘要

Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal models to imagine possible future state evolution. Inspired by these mechanisms, we propose MemoryVLA++, a full temporal modeling framework that equips VLA models with memory and imagination for robotic manipulation. A pretrained VLM encodes the current observation into perceptual and cognitive tokens, forming working memory. These tokens query a Perceptual-Cognitive Memory Bank to retrieve relevant historical context. This bank stores low-level details and high-level semantics from past interactions, and is updated through redundancy-aware consolidation. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. The resulting tokens condition a diffusion action expert to predict temporally consistent action sequences. We conduct extensive experiments on 5 simulation benchmarks and 3 categories of real-robot tasks across 3 robots, covering general manipulation, long-horizon temporal tasks, robustness, and generalization. Our method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, Libero-Plus, and diverse real-robot tasks, validating the effectiveness of full temporal modeling with memory and imagination. For example, on real robots, it achieves +9%, +26%, +28% gains on general, memory-dependent, and imagination-dependent tasks. Project Page: https://shihao1895.github.io/MemoryVLA-PP-Web

2606.07895 2026-06-09 cs.CV cs.RO 交叉投稿

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

TBD-VLA: 时序块扩散视觉语言动作模型

Sung-Wook Lee, Xuhui Kang, Yen-Ling Kuo

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 提出TBD-VLA框架,通过时序块扩散机制实现离散令牌VLA模型的并行动作生成,兼顾时序连贯性与推理速度,在仿真和真实任务中优于先前方法。

详情
AI中文摘要

离散视觉-语言-动作(VLA)模型通常将动作生成建模为离散动作空间上的下一个令牌预测,每个令牌自回归地依赖于先前的上下文。虽然有效,但这种范式会导致高推理延迟,并且很大程度上忽略了动作轨迹中固有的时间结构。最近的工作引入并行解码以提高效率,实现更快的推理,但缺乏建模令牌依赖关系的显式机制。我们提出TBD-VLA,一种基于离散令牌的VLA框架,它结合了块扩散以实现时序动作生成。我们将动作序列划分为时间块,并在每个块内执行掩码离散扩散,同时保持跨块的自回归生成。这种设计统一了时序自回归和并行动作解码,实现了强时序连贯性和改进的推理速度。此外,显式的时序建模通过时序修补实现了动作块(例如实时分块)的异步执行。TBD-VLA在仿真和真实世界的操作任务中显著优于先前的VLA方法,为走向快速、时序感知的离散VLA模型提供了一条可扩展的路径。项目网页:https://tbd-vla.github.io/

英文摘要

Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/

2606.08653 2026-06-09 cs.CV cs.AI cs.LG cs.RO 交叉投稿

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

FiberTune: 在视觉-语言-动作微调中保留动作纤维视觉残差

Haihao Lin, Xiangsheng Huang, Xiao Yang, Weibang Zhou, Yiqi Zhang, Bo Yang, Simin Zeng, Jiawei Yang, Zhengyang Wang, Jiahui Du

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Hebei Key Laboratory of Cognitive Intelligence, Xiong’an Institute of Innovation(河北省认知智能重点实验室,雄安创新研究院) Hebei University of Technology(河北工业大学) Beijing Information Science and Technology University(北京信息科技大学)

AI总结 提出FiberTune,通过在线动作探针过滤动作预测特征方向,对齐教师视觉残差并正则化有效秩,在六个仿真和实物任务中提升VLA策略性能。

Comments Project page: https://fibertune.github.io/

详情
AI中文摘要

动作监督的视觉-语言-动作(VLA)策略微调能有效拟合演示,但仅约束改变预测动作的方向,导致动作等价状态下视觉结构自由坍缩。我们将此形式化为沿局部动作纤维的残差视觉坍缩,并提出FiberTune,一种训练时目标,在不增加推理开销的情况下保留教师结构的视觉残差。FiberTune使用在线动作探针估计动作预测特征方向,从中滤除中间视觉标记表示,并将探针过滤后的残差与冻结的视觉教师对齐,同时正则化其有效秩。在相同训练条件下,FiberTune在跨越两个基准和两种架构(pi_0.5和OpenVLA-OFT)的六个受控仿真设置以及物理SO-101拾取放置任务中,均优于仅任务损失的微调;代表性提升包括长时域CALVIN ABC-to-D上SR(5)提高10.7个百分点,物理SO-101任务成功率从72.7%提升至78.1%。残差诊断显示,这些增益与探针过滤后的残差教师对齐度和有效秩增加一致,符合动作纤维动机。

英文摘要

Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.

2606.08962 2026-06-09 cs.LG cs.CV cs.RO 交叉投稿

C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache

C$^3$ache: 利用跨推理块缓存加速世界动作模型

Weisen Zhao, Lam Nguyen, Zhicong Lu, Yuzhang Shang

发表机构 * George Mason University(乔治梅森大学) University of Central Florida(中佛罗里达大学)

AI总结 提出C$^3$ache方法,通过跨推理块缓存和重用去噪残差,加速世界动作模型推理,实现高达2.5倍加速且任务成功率几乎无损。

详情
AI中文摘要

世界动作模型(WAM)比标准的视觉-语言-动作(VLA)策略在新型运动和环境中具有更好的泛化能力,因为视频建模目标使其能够从大量未标记视频中学习,而不是依赖稀缺的标记机器人演示。这种泛化能力计算成本高昂。为了完成一个任务,WAM需要运行多个推理块,每个块都需要一个昂贵的去噪过程。现有的加速方法通过在一个块的去噪轨迹内缓存和重用计算来降低这一成本。我们的实证分析揭示了它们忽略的一个重要的冗余来源:块间的冗余。当机器人执行平滑行为时,在给定去噪步骤计算的残差从一个块到下一个块高度相关。我们引入了C$^3$ache,一种无需训练的方法,它在相同去噪步骤的推理块之间缓存和重用这些残差。在基于Fast-WAM骨干的基准测试上的实验表明,C$^3$ache在总墙钟推理时间上实现了高达2.5倍的加速,而任务成功率几乎没有下降。

英文摘要

World Action Models (WAMs) generalize better than standard Vision-Language-Action (VLA) policies to novel motions and environments, because a video-modeling objective lets them learn from abundant unlabeled video rather than scarce labeled robot demonstrations. This generalization is computationally expensive. To complete a task, a WAM runs over multiple inference chunks, and each chunk requires a costly denoising process. Existing acceleration methods reduce this cost by caching and reusing computation within a single chunk's denoising trajectory. Our empirical analysis reveals a substantial source of redundancy they overlook: redundancy across chunks. When a robot executes a smooth behavior, the residuals computed at a given denoising step are strongly correlated from one chunk to the next. We introduce C$^3$ache, a training-free method that caches and reuses these residuals across inference chunks at the same denoising step. Experiments on benchmarks with a Fast-WAM backbone show that C$^3$ache achieves up to a $2.5\times$ speedup in total wall-clock inference time, with negligible degradation in task success rate.

2603.19183 2026-06-09 cs.RO 版本更新

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

稀疏自编码器揭示VLA模型中可解释且可操控的特征

Aiden Swann, Lachlain McGranahan, Hugo Buurmeijer, Monroe Kennedy, Mac Schwager

发表机构 * Department of Mechanical Engineering(机械工程系) Department of Computer Science(计算机科学系) Department of Aeronautics & Astronautics(航空与航天系)

AI总结 本文通过训练稀疏自编码器揭示VLA模型中可解释且可操控的特征,验证了其在不同任务和场景中的可迁移性。

Comments 24 pages, 11 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人操作的有希望方法。然而,很少有研究系统地探讨了它们在物体、场景和指令之间泛化的原因和时机。为此,我们训练了稀疏自编码器(SAEs)来探索VLA隐藏层激活的内部表示。SAEs学习稀疏字典,通常揭示与模型表示空间中可解释方向对应的特征。我们识别出与运动原语和语义概念相关的SAE特征,包括在多个回合中普遍且因果可控的特征。我们提出了一种度量标准,将特征分类为通用可迁移原语或回合特定的记忆化,为VLA泛化提供了新的视角。我们通过在LIBERO模拟基准和真实世界DROID硬件上的操控实验验证了这些发现。我们发现增强通用和语义特征会诱导出与其意义一致的行为,而消去它们会破坏模型性能。此外,我们展示了操控作为在无提示方向上控制行为的方式。这些结果提供了机制证据,表明VLA可以学习可重用的内部特征,将感知、语言和动作跨任务和场景连接起来。我们的项目页面位于https://drvla.github.io

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, little research has mechanistically explored when and why they generalize across objects, scenes, and instructions. To probe internal representations, we train Sparse Autoencoders (SAEs) on the VLA's hidden-layer activations. SAEs learn sparse dictionaries over model activations, often revealing features that correspond to interpretable directions in the model's representation space. We identify SAE features corresponding to motion primitives and semantic concepts, including features that are general across episodes and causally steerable. We propose a metric to categorize features as general transferable primitives or episode-specific memorizations, offering a promising glimpse towards VLA generalization. We validate these findings through steering experiments on both the LIBERO simulation benchmark and on real-world DROID hardware. We find that amplifying general and semantic features induces behaviors consistent with their meanings, whereas ablating them destroys model performance. Furthermore, we demonstrate steering as a way to control behavior in unpromptable directions. Together, these results provide mechanistic evidence that VLAs can learn reusable internal features linking perception, language, and action across tasks and scenes. Our project page is located at https://drvla.github.io

2604.22238 2026-06-09 cs.RO 版本更新

CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

CodeGraphVLP:代码规划器与语义图状态的结合用于非马尔可夫视觉-语言-动作模型

Khoa Vo, Sieu Tran, Taisei Hanyu, Yuki Ikebe, Duy Nguyen, Nghi D. Q. Bui, Minh Vu, Anthony Gunderman, Chase Rainwater, Anh Nguyen, Ngan Le

发表机构 * University of Arkansas(亚拉巴马大学) Max Planck Research School for Intelligent Systems and the University of Stuttgart(马克斯·普朗克智能系统研究学校和斯图加特大学) Center of AI Research, VinUniversity(Vin大学人工智能研究中心) TU Wien(维也纳技术大学) University of Liverpool(利物浦大学)

AI总结 CodeGraphVLP结合语义图状态与可执行代码规划器,提升非马尔可夫长周期任务的视觉语言动作执行效率,降低规划延迟并提高任务完成率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型旨在实现通用机器人操作,但通常被训练为短周期策略,假设最新观察足以进行动作推理。这一假设在非马尔可夫长周期任务中失效,因为任务相关证据可能被遮挡或出现在轨迹早期,且杂乱环境使精细视觉定位脆弱。我们提出CodeGraphVLP,一种分层框架,通过结合持久语义图状态与可执行代码规划器和进度引导的视觉-语言提示,实现可靠的长周期操作。语义图在部分可观测条件下维护任务相关实体和关系。合成规划器在该语义图上执行,进行高效进度检查并输出子任务指令及相关对象。这些输出用于构建抑制杂乱的观察,使VLA执行器聚焦关键证据。在现实非马尔可夫任务中,CodeGraphVLP在强VLA基线和历史增强变体上提升任务完成率,同时显著降低规划延迟。我们还进行了广泛的消融研究以验证各组件的贡献。

英文摘要

Vision-Language-Action (VLA) models promise generalist robot manipulation, but are typically trained and deployed as short-horizon policies that assume the latest observation is sufficient for action reasoning. This assumption breaks in non-Markovian long-horizon tasks, where task-relevant evidence can be occluded or appear only earlier in the trajectory, and where clutter and distractors make fine-grained visual grounding brittle. We present CodeGraphVLP, a hierarchical framework that enables reliable long-horizon manipulation by combining a persistent semantic-graph state with an executable code-based planner and progress-guided visual-language prompting. The semantic-graph maintains task-relevant entities and relations under partial observability. The synthesized planner executes over this semantic-graph to perform efficient progress checks and outputs a subtask instruction together with subtask-relevant objects. We use these outputs to construct clutter-suppressed observations that focus the VLA executor on critical evidence. On real-world non-Markovian tasks, CodeGraphVLP improves task completion over strong VLA baselines and history-enabled variants while substantially lowering planning latency compared to VLM-in-the-loop planning. We also conduct extensive ablation studies to confirm the contributions of each component.

2606.00229 2026-06-09 cs.RO cs.AI cs.LG 版本更新

Continuous Reasoning for Vision-Language-Action

视觉-语言-动作的连续推理

Yueh-Hua Wu, Tatsuya Matsushima, Kei Ota

发表机构 * Airoa

AI总结 针对视觉-语言-动作策略中语言与连续控制粒度不匹配的问题,提出一种可共享、可验证的连续推理方法,通过高斯潜变量接口和自验证目标提升机器人任务成功率。

Comments Project page: https://continuous-reasoning.airoa.io

详情
AI中文摘要

自然语言是语言模型和视觉-语言模型强大的推理媒介,但与连续控制的粒度不匹配。文本和显式子目标在任务级粒度上操作,而视觉-语言-动作(VLA)策略必须在更细的时间尺度上选择动作;因此,单个推理步骤可能跨越多个动作块,同时与当前所需动作保持弱耦合。这为VLA提出了一个不同的问题:什么应该扮演语言的角色?我们认为,有用的VLA推理媒介必须能够在模型实例之间共享,通过下游动作改进进行验证,并与时间扩展的控制结构对齐。基于这一观点,我们提出了视觉-语言-动作的连续推理。我们的模型首先以结构化连续思想集的形式预测连续推理,然后将其重用为块结构动作生成的共享上下文。仅凭更好的动作预测并不能证明推理的有效性:如果相同的内部媒介不能在模型实例之间共享,并且不能通过改进的下游控制独立验证,那么添加的潜变量可能只是模型私有的捷径,有助于在已见行为上表现,而不支持泛化的控制。因此,我们将连续推理实例化为一个共享的高斯潜变量接口,并使用自验证目标进行训练,其中指数移动平均教师必须在预测目标动作时成功消费学生的推理。实验上,连续推理提高了LIBERO-PRO的鲁棒性,并在真实机器人上表现强劲,在TX-G2(一种AgiBot G2兼容变体)上平均子任务成功率比π0.5提高了40.4%,在HSR上提高了26.3%。这表明VLA中的推理更多是关于一个可共享、可验证的内部动作语言,而不是额外的标记。

英文摘要

Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. Text and explicit subgoals operate at task-level granularity, whereas vision-language-action (VLA) policies must choose actions at a much finer temporal scale; a single reasoning step can therefore span many action chunks while remaining only weakly coupled to the action needed now. This suggests a different question for VLA: what should play the role of language? We argue that a useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure. Based on this view, we propose Continuous Reasoning for Vision-Language-Action. Our model first predicts continuous reasoning in the form of a structured set of continuous thoughts, then reuses them as shared context for chunk-structured action generation. Better action prediction alone does not certify good reasoning: if the same internal medium cannot be shared across model instances and independently verified through improved downstream control, the added latent may simply become a model-private shortcut that helps on seen behaviors without supporting generalizable control. We therefore instantiate continuous reasoning as a shared Gaussian latent interface and train it with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions. Empirically, Continuous Reasoning improves LIBERO-PRO robustness and performs strongly on real robots, raising mean subtask success over π0.5 by 40.4% on TX-G2, an AgiBot G2-compatible variant, and 26.3% on HSR. This suggests that reasoning in VLA is less about extra tokens than about a shareable, verifiable internal language for action.

2606.02735 2026-06-09 cs.RO cs.AI cs.LG 版本更新

See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

看得更少,指定更多:面向可泛化视觉-语言-动作模型的视觉证据预算

Yueh-Hua Wu, Tatsuya Matsushima, Kei Ota

发表机构 * Airoa

AI总结 提出S2框架,通过显式视觉证据预算和细化轨迹语言,改善VLA模型在干扰、外观变化和语义相似任务下的泛化能力。

Comments Project page: https://s2.airoa.io

详情
AI中文摘要

泛化仍然是视觉-语言-动作(VLA)模型的核心瓶颈:在干扰物、外观变化和语义相似任务下,策略通常需要从粗略指令中推断局部执行细节,同时决定图像的哪些部分对控制重要。我们提出S2(看得更少,指定更多),一个通过更干净的接口训练执行器来提升VLA泛化的框架。“指定更多”保留原始指令作为稳定的高层目标,同时将每条轨迹重新标注为细化的轨迹级和子任务级语言,以消除当前执行模式的歧义。与原生注意力不同,“看得更少”施加显式的视觉证据预算,训练执行器从任务充分的证据中行动,而非不受约束的视觉上下文,无需任何区域或掩码标注。该接口让执行器能够遵循详细指导,而不依赖干扰性的视觉补丁或自行解决可避免的歧义,并且通过上下文学习与现成的VLM规划器兼容。在我们的主要评估设置中,S2通过改变执行器的学习问题提升了整体泛化指标:粗略指令导致可避免的监督混叠,目标保持的局部指导在我们的主要消融中优于指令替换,显式证据预算减少了对广泛视觉上下文的依赖,超越了效率考虑。在TX-G2(一个AgiBot G2兼容变体)和HSR上的八个真实机器人任务中,S2将平均子任务成功率从pi0.5的54.2%提升到79.0%。这些结果共同表明,当执行器被训练从信息丰富的局部指导和任务充分的视觉证据中行动,而非从弱监督中同时恢复两者时,VLA泛化得到改善。

英文摘要

Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.

2602.18020 2026-06-09 cs.CV cs.RO 版本更新

UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

UAOR: 面向视觉-语言-动作模型的不确定性感知观测重注入

Jiabing Yang, Yixiang Chen, Yuan Xu, Peiyan Li, Zichen Wen, Bowen Fang, Tao Yu, Xiangnan Wu, Qisen Ma, Kai Wang, Ziheng He, Yingda Li, Zhengbo Zhang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所模式识别新技术实验室) Shanghai Jiao Tong University(上海交通大学) FiveAges(五代)

AI总结 提出UAOR模块,通过动作熵检测不确定性,在语言模型高不确定层重注入观测信息,无需额外训练或数据,提升VLA模型在仿真和真实任务中的性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型利用预训练的视觉-语言模型(VLM)作为骨干,将图像和指令映射到动作,展现出在可泛化机器人操作中的显著潜力。为了提升性能,现有方法通常引入额外的观测线索(如深度图、点云)或辅助模块(如目标检测器、编码器),以实现更精确和可靠的任务执行,但这些方法通常需要昂贵的数据收集和额外训练。受语言模型中的前馈网络(FFN)可作为“键值记忆”的发现启发,我们提出不确定性感知观测重注入(UAOR),一种有效、无需训练且即插即用的VLA模型模块。具体地,当当前语言模型层表现出由动作熵衡量的高不确定性时,它通过注意力检索将关键观测信息重注入下一层的前馈网络(FFN)。该机制直接在高不确定性层用观测证据增强隐藏状态,从而实现更准确和可靠的动作生成。综合实验表明,我们的方法以最小开销一致地提升了多种VLA模型在仿真和真实任务中的性能。值得注意的是,UAOR消除了对额外观测线索或模块的需求,使其成为现有VLA流程中通用且实用的即插即用组件。项目页面见此URL。

英文摘要

Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed-Forward Network (FFN) through attention retrieval. This mechanism directly augments the hidden states with observation evidence at high-uncertainty layers, enabling more accurate and reliable action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.

7. 多机器人与群体系统 7 篇

2606.08064 2026-06-09 cs.RO 新提交

Cooperative Long Rope Skipping via Multi-Agent Reinforcement Learning

基于多智能体强化学习的协作长绳跳绳

Zihao Wang, Shijie Peng, Kerui Wu, Yu Huang, Ruiqi Xue, Dong Liu, Tian Xu, Lei Yuan, Yang Yu

发表机构 * National Key Laboratory of Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Beijing Academy of Artificial Intelligence, BAAI(北京智源人工智能研究院)

AI总结 提出Marope框架,采用分层强化学习实现多个人形机器人的协作长绳跳绳,通过多智能体强化学习训练分散的摇绳策略,上层调度策略协调执行,并融入多样跳跃策略提升泛化能力,在仿真和真实实验中优于基线方法。

详情
AI中文摘要

人类展现出卓越的运动敏捷性,能够完成跑步、跳跃等多种动态技能,这凸显了人形机器人在运动方面的巨大潜力。在竞技体育中,长绳跳绳需要两名摇绳者协同摇绳,同时适应不同跳跃节奏的玩家,这对人形机器人来说是一项有意义但具有挑战性的任务。尽管现有的人形机器人运动方法在单智能体和无交互场景(如跑步、舞蹈和跑酷)中取得了成功,但需要多参与者精确协调的任务场景仍鲜有探索。为此,我们提出Marope,一个用于多个人形机器人协作长绳跳绳的多智能体强化学习框架。具体而言,Marope采用分层强化学习框架进行策略训练。在底层,通过多智能体强化学习学习分散的摇绳操作策略;在顶层,训练集中调度策略以协调底层策略的执行。为了提高对不同玩家行为风格的泛化能力,Marope进一步将多样化的跳跃策略融入协作博弈训练中。我们在仿真和真实环境中对宇树G1人形机器人进行了评估。实验结果表明,Marope优于多种基线方法,实现了更高效稳定的摇绳操作以及与不同玩家更鲁棒和自适应的协作。

英文摘要

Humans exhibit remarkable motor agility, enabling a wide range of dynamic skills such as running and jumping, which highlights the great potential of humanoid robots for athletic locomotion. Among athletic sports, long rope skipping requires two rope turners to cooperatively swing the rope while adapting to a player under different jumping rhythms, making it a meaningful yet challenging task for humanoid robots. Although existing methods for humanoid sports have achieved success in single-agent and interaction-free settings, such as running, dancing, and parkour, task scenarios that require precise coordination among multiple participants remain largely unexplored. To this end, we propose Marope, a multi-agent reinforcement learning (MARL) framework for cooperative long rope skipping with multiple humanoid robots. Specifically, Marope adopts a hierarchical reinforcement learning framework for policy training. At the lower level, it learns decentralized rope manipulation policies through MARL, while at the upper level, a centralized scheduling policy is trained to coordinate the execution of the lower-level policies. To improve generalization across different player behavioral styles, Marope further incorporates diverse jumping policies into cooperative game training. We evaluate our approach on Unitree G1 humanoid robots in both simulation and real-world settings. Experimental results demonstrate that Marope outperforms various baselines, achieving more efficient and stable rope manipulation as well as more robust and adaptable cooperation with varied players.

2606.09099 2026-06-09 cs.RO 新提交

LAEI: Layered Autonomous Edge Intelligence Framework for Robust UAV Swarm Operations

LAEI: 面向鲁棒无人机蜂群操作的分层自主边缘智能框架

Changmin Park, Wooyong Jung, Hwangnam Kim

发表机构 * Korea University(高丽大学)

AI总结 提出分层自主边缘智能框架,通过机载学习策略与轻量级任务级监督结合,实现无人机蜂群在通信受限、环境不确定和组件故障下的可扩展协调,显著降低任务完成时间并提高效率。

Comments Preprint. Submitted to arXiv

详情
AI中文摘要

自主无人机蜂群需要可扩展的协调机制,以在有限通信、环境不确定性和组件故障下保持任务性能。集中式方法提供全局协调,但存在通信瓶颈和单节点脆弱性,而完全分散的方法通常缺乏任务级一致性。本文提出了分层自主边缘智能(LAEI),一种无人机蜂群框架,它将机载学习策略与轻量级任务级监督相结合。每个无人机在机载执行局部感知、避障和动作选择,而监督层提供自适应目标重分配、故障感知恢复和上下文相关策略指导,而不直接控制低级动作。LAEI进一步整合了恢复策略,包括动态重新关联、备份监督支持和回退局部自主性,以在代表性故障场景下维持任务连续性。我们在模拟的无人机蜂群场景中评估了LAEI,使用任务完成时间、碰撞率和覆盖效率。结果表明,LAEI减少了任务完成时间并提高了操作效率,同时保持了碰撞感知的分布式无人机级决策。

英文摘要

Autonomous UAV swarms require scalable coordination mechanisms that maintain mission performance under limited communication, environmental uncertainty, and component failures. Centralized approaches provide global coordination but suffer from communication bottlenecks and single-node vulnerabilities, whereas fully decentralized methods often lack mission-level consistency. This paper presents Layered Autonomous Edge Intelligence (LAEI), a UAV-swarm framework that combines onboard learned policies with lightweight mission-level supervision. Each UAV performs local perception, obstacle avoidance, and action selection onboard, while the supervisory layer provides adaptive goal reassignment, fault-aware recovery, and context-dependent policy guidance without directly controlling low-level actions. LAEI further incorporates recovery strategies, including dynamic reassociation, backup supervisory support, and fallback local autonomy, to maintain mission continuity under representative failure scenarios. We evaluate LAEI in simulated UAV-swarm scenarios using mission completion time, collision rate, and coverage efficiency. The results show that LAEI reduces mission completion time and improves operational efficiency while maintaining collision-aware distributed UAV-level decision-making.

2606.09610 2026-06-09 cs.RO cs.AI 新提交

Shape Formation for the Cooperative Transportation of Arbitrary Objects Using Multi-Agent Reinforcement Learning

基于多智能体强化学习的任意物体协同运输中的形状形成

Mohamed Sayed, Wolfram Burgard, Tanja Katharina Kaiser

发表机构 * University of Technology Nuremberg(纽伦堡工业大学)

AI总结 提出一种多智能体强化学习方法,使多机器人系统自主形成支撑任意形状和非均匀质量分布物体的编队,同时避免障碍物,实现可靠且泛化的协同运输。

详情
AI中文摘要

协同物体运输在众多领域(包括工业到家庭服务)中至关重要。一种流行的运输策略是将物体承载在多机器人系统之上。相应的任务通常通过将其分解为三个相互关联的子问题来解决:编队控制、协同导航和碰撞避免。现实世界物体带来的一个特殊挑战是其可能具有任意形状和非均匀质量分布,这需要机器人编队能够牢固支撑物体。在这项工作中,我们通过提出一种新颖的多智能体强化学习方法来解决运输此类现实世界物体时的模式形成控制挑战。我们的方法使多机器人系统能够自主定位在物体下方以支撑其重量,同时在编队过程中避免障碍物。我们在不同环境和不同数量机器人下的评估表明,我们的方法能够产生可靠形成平衡编队的策略,并泛化到杂乱场景以及具有复杂几何形状和非均匀质量分布的物体。

英文摘要

Cooperative object transportation is essential in numerous domains, including industrial to domestic services. A popular transportation strategy is to carry objects on top of multi-robot systems. The corresponding task is typically solved by decomposing it into three interconnected subproblems: formation control, cooperative navigation, and collision avoidance. A particular challenge posed by real-world objects is their potentially arbitrary shape and non-uniform mass distribution, necessitating robot formations that securely support the object. In this work, we address the challenge of pattern formation control for transporting such real-world objects by proposing a novel multi-agent reinforcement learning approach. Our approach enables a multi-robot system to autonomously position itself underneath an object to support its weight while avoiding obstacles during the formation process. Our evaluations with diverse environments and varying numbers of robots show that our approach leads to policies that reliably produce balanced formations and generalize to cluttered scenes and objects with complex geometry and non-uniform mass distribution.

2606.09620 2026-06-09 cs.RO cs.SY eess.SY 新提交

Motion planning for hundreds of floating robots

数百个浮动机器人的运动规划

Jan Kamm, Antonio Terpin, Raffaello D'Andrea, Aswin Ramachandran

发表机构 * Institute for Dynamic Systems and Control, ETH Zürich(苏黎世联邦理工学院动态系统与控制研究所)

AI总结 针对大型浮动机器人编队的无碰撞运动规划问题,提出一种可扩展的流水线方法,通过碰撞图分解为独立子问题并行求解,在500个机器人仿真和实际演示中验证了有效性。

详情
AI中文摘要

为大型机器人编队规划无碰撞运动是困难的,因为碰撞避免引入了随团队规模快速增长且强烈的智能体间耦合。我们考虑水面上的全向浮动机器人,其编队动作由稀疏关键帧指定,交互工具必须在几秒内生成轨迹,即使过渡跨越几分钟和数千个时间步。我们提出一种可扩展的流水线,从初始化构建碰撞图,将耦合问题分解为交互簇,并独立(并行)求解这些簇,同时针对常见分解病态问题提供鲁棒性机制。我们在多达500个机器人的仿真中验证了该方法。合成的轨迹还已在两个实际演示中部署:在苏黎世湖上使用24艘Way of Water船只,以及在2025年威尼斯双年展的“时间空间存在”展览中。

英文摘要

Planning collision-free motion for large robot fleets is difficult because collision avoidance induces strong inter-agent coupling that grows rapidly with team size. We consider omnidirectional floating robots on water, where choreographies are specified by sparse keyframes and an interactive tool must generate trajectories within seconds, even when transitions span minutes and thousands of time steps. We propose a scalable pipeline that builds a collision graph from an initialization, decomposes the coupled problem into interaction clusters, and solves clusters independently (and in parallel) with robustness mechanisms for common decomposition pathologies. We validate the approach in simulations up to 500 robots. The synthesized trajectories have also been deployed in two real-world demonstrations, on Lake Zürich with a fleet of 24 Way of Water crafts and at the Time Space Existence 2025 Venice Biennale.

2606.08738 2026-06-09 cs.NI cs.RO 交叉投稿

Systems-Level Planning and Coordination of Truck-Drone Collaborative Delivery Networks

卡车-无人机协同配送网络的系统级规划与协调

Didem Cicek, Burak Kantarci

发表机构 * School of Electrical Engineering and Computer Science at the University of Ottawa(渥太华大学电气工程与计算机科学学院)

AI总结 针对城市最后一英里配送,提出分层规划与协调框架,通过任务编排与智能体同步,实现卡车-无人机协同配送,相比纯卡车模式,总配送时间减少42.4%,能耗降低44.2%。

Comments 6 pages, 4 figures, Accepted to 2026 IEEE HPSR on Network Architectures and Intelligence for Smart Mobility and Autonomous Systems (TRAVERSAL)

详情
AI中文摘要

城市最后一英里包裹配送日益依赖异构车队,其性能取决于及时协调、可靠通信和可扩展控制。卡车-无人机协作已成为一种网络化信息物理配送范式,结合了卡车的载重能力和续航效率与无人机在拥挤或受限城市环境中的灵活性。本文从系统与控制角度提出了一种分层规划与协调框架,用于构建卡车-无人机协同配送(TDCD)。该框架由五个相互关联的层组成:空间需求对齐、协同配送配置、资源与工作流编排、性能评估和可扩展性分析,为网络化配送操作中的协调、控制和系统级性能提供了统一视角。使用源自2021年亚马逊最后一英里路线研究挑战数据集的实际城市最后一英里配送场景评估了所提框架。案例研究表明,通过结构化任务编排和智能体间同步实现的协调卡车-无人机操作,在操作约束下提高了端到端系统效率。结果显示,与传统的纯卡车配送模型相比,总配送时间减少了42.4%,能耗降低了44.2%。可扩展性分析进一步强调了协调收益如何随系统规模增大而持续,并展示了高效控制和通信在异构配送网络中的重要性。

英文摘要

Urban last-mile parcel delivery increasingly relies on heterogeneous fleets whose performance depends on timely coordination, reliable communication, and scalable control. Truck-drone collaboration has emerged as a networked cyber-physical delivery paradigm that combines the payload capacity and range efficiency of trucks with the agility of drones in congested or access-limited urban environments. This paper proposes a layered planning and coordination framework that structures truck-drone collaborative delivery (TDCD) from a systems and control perspective. The framework consists of five interrelated layers: spatial-demand alignment, collaborative delivery configuration, resource and workflow orchestration, performance evaluation, and scalability analysis, providing a unified view of coordination, control, and system-level performance in networked delivery operations. The proposed framework is evaluated using a realistic urban last-mile delivery scenario derived from the 2021 Amazon Last Mile Routing Research Challenge dataset. The case study demonstrates how coordinated truck-drone operation, enabled by structured task orchestration and inter-agent synchronization, improves end-to-end system efficiency under operational constraints. Results show a 42.4% reduction in total delivery time and a 44.2% reduction in energy consumption compared to a conventional truck-only delivery model. The scalability analysis further highlights how coordination gains persist as system size increases, and shows the importance of efficient control and communication in heterogeneous delivery networks.

2603.24238 2026-06-09 cs.RO 版本更新

Decentralized End-to-End Multi-AAV Pursuit Using Predictive Spatio-Temporal Observation via Deep Reinforcement Learning

基于深度强化学习的去中心化端到端多无人艇追捕

Yude Li, Zhexuan Zhou, Huizhe Li, Yanke Sun, Yenan Wu, Yichen Lai, Yiming Wang, Youmin Gong, Jie Mei

发表机构 * School of Intelligence Science and Engineering(智能科学与工程学院) Guangdong Key Laboratory of Intelligent Morphing Mechanisms and Adaptive Robotics(广东省智能变形机制与自适应机器人重点实验室) Shenzhen Key Lab for Advanced Motion Control and Modern Automation Equipments(深圳先进运动控制与现代自动化设备重点实验室) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本文提出一种去中心化端到端多智能体强化学习框架,通过预测时空观测实现自主空中群体在复杂环境中的追捕,提升捕获效率和协同成功率。

详情
AI中文摘要

在复杂环境中实现去中心化协作追捕对自主空中群体而言具有挑战性,特别是在感知不完全和噪声存在的情况下。现有方法通常依赖于抽象几何特征或特权地面真实状态,从而回避了现实环境中的感知不确定性。本文提出了一种去中心化端到端多智能体强化学习(MARL)框架,直接将原始激光雷达观测映射到连续控制命令。框架的核心是预测时空观测(PSTO),一种以自我为中心的网格表示,将障碍物几何与预测对抗意图和队友运动统一在一个固定分辨率投影中。基于PSTO,单一去中心化策略使智能体能够导航静态障碍物、拦截动态目标并维持协同包围。仿真显示,所提方法在捕获效率和成功率方面优于依赖特权障碍信息的现有学习方法。此外,统一策略可无缝扩展到不同团队规模而无需重新训练。最后,完全自主的户外实验验证了该框架在仅依赖机载传感和计算的四旋翼群体上的有效性。

英文摘要

Decentralized cooperative pursuit in cluttered environments is challenging for autonomous aerial swarms, especially under partial and noisy perception. Existing methods often rely on abstracted geometric features or privileged ground-truth states, and therefore sidestep perceptual uncertainty in real-world settings. We propose a decentralized end-to-end multi-agent reinforcement learning (MARL) framework that maps raw LiDAR observations directly to continuous control commands. Central to the framework is the Predictive Spatio-Temporal Observation (PSTO), an egocentric grid representation that aligns obstacle geometry with predictive adversarial intent and teammate motion in a unified, fixed-resolution projection. Built on PSTO, a single decentralized policy enables agents to navigate static obstacles, intercept dynamic targets, and maintain cooperative encirclement. Simulations demonstrate that the proposed method achieves superior capture efficiency and competitive success rates compared to state-of-the-art learning-based approaches relying on privileged obstacle information. Furthermore, the unified policy scales seamlessly across different team sizes without retraining. Finally, fully autonomous outdoor experiments validate the framework on a quadrotor swarm relying on only onboard sensing and computing.

2508.00724 2026-06-09 eess.SY cs.RO cs.SY 版本更新

Petri Net Modeling and Deadlock-Free Scheduling of Attachable Heterogeneous AGV Systems

可连接异构AGV系统的Petri网建模与无死锁调度

Boyu Li, Zhengchen Li, Weimin Wu, Mengchu Zhou

发表机构 * State Key Laboratory of Industrial Control Technology, Zhejiang University(浙江大学工业控制技术状态重点实验室) School of Information and Electronic Engineering, Zhejiang Gongshang University(浙江工商大学信息电子工程学院) Department of Electrical and Computer Engineering, New Jersey Institute of Technology(新 jersey 理工学院电子与计算机工程系)

AI总结 针对可连接异构AGV系统的调度问题,提出基于Petri网的无死锁调度框架,集成自适应大邻域搜索算法,通过结构分析预防死锁,实验表明该方法显著提升计算效率并优于现有策略。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

对柔性自动化的日益增长的需求加速了异构自动导引车(AGV)的采用。本文研究了一个由可连接异构AGV(包括载体和穿梭车)组成的物料运输系统中的新调度问题,这些AGV可灵活连接和分离以协同执行任务。虽然这种协作提高了操作效率,但连接引起的同步使系统高度耦合且容易发生死锁。为此,我们提出了一种基于Petri网(PN)的无死锁调度框架,并将其集成到自适应大邻域搜索(ALNS)算法中。引入PN将候选解从静态排列映射为动态协作过程,从而通过状态演化进行性能评估,并通过结构分析实现主动死锁预防。在真实和合成实例上的大量实验表明,所提出的框架显著提高了计算效率,开发的ALNS优于当前现场策略、精确求解器和最先进的元启发式算法。最后,敏感性分析为最优车队规模提供了管理见解。

英文摘要

The increasing demand for flexible automation has accelerated the adoption of heterogeneous automated guided vehicles (AGVs). This work investigates a new scheduling problem in a material transportation system consisting of attachable heterogeneous AGVs, including carriers and shuttles, that flexibly attach and detach for cooperative task execution. While such collaboration enhances operational efficiency, the attachment-induced synchronization renders the system highly coupled and susceptible to deadlocks. To address this, we propose a Petri net (PN)-based deadlock-free scheduling framework integrated into an adaptive large neighborhood search (ALNS) algorithm. The PN is introduced to map candidate solutions from static permutations into dynamic collaborative processes, enabling performance evaluation via state evolution and proactive deadlock prevention through structural analysis. Extensive experiments on real-world and synthetic instances demonstrate that the proposed framework significantly improves computational efficiency, with the developed ALNS outperforming the current on-site policy, exact solvers, and state-of-the-art metaheuristics. Finally, sensitivity analysis yields managerial insights for optimal fleet sizing.

8. 无人车、无人机与移动机器人 17 篇

2606.07813 2026-06-09 cs.RO cs.CV 新提交

MinNav: Minimalist Navigation Using Optical Flow For Active Tiny Aerial Robots

MinNav:基于光流的极简导航用于主动微型飞行机器人

Aniket Patil, Mandeep Singh, Uday Girish Maradana, Nitin J. Sanket

发表机构 * Worcester Polytechnic Institute(伍斯特理工学院) Perception and Autonomous Robotics (PeAR) Group, Robotics Engineering Department, Worcester Polytechnic Institute(伍斯特理工学院机器人工程系感知与自主机器人(PeAR)实验室)

AI总结 提出MinNav导航栈,利用光流及其不确定性,使微型飞行机器人在无先验知识下穿越静态/动态障碍和未知形状间隙,通过主动探索提高成功率,实验成功率70%,计算量远小于深度方法。

Comments Accepted for publication at ICRA 2026. Link to Project page https://pear.wpi.edu/research/minnav.html

详情
AI中文摘要

使用单目相机进行导航对于微型飞行机器人的自主操作至关重要,因为它在多功能性、成本和精度之间取得了完美平衡。在本文中,我们介绍了MinNav,一个基于光流及其不确定性的导航栈,用于在静态和动态障碍物以及未知形状间隙的场景中飞行,无需任何关于场景组件和/或其位置/顺序的先验知识。我们通过利用机器人的主动性以探索方式移动来寻找障碍物并导航,进一步提高了成功率。我们在多种环境下的许多真实世界实验中成功评估并演示了所提出的方法,包括静态和动态障碍物以及未知形状间隙,总体成功率为70%。据我们所知,这是第一个使用单目相机在无先验知识的情况下解决所有上述导航案例的方案。我们的方法在性能上与基于深度的方法相当,但所需的计算量少几个数量级,并且可以轻松在微型飞行机器人上运行。随附的视频、补充材料、代码和数据集可在https://pear.wpi.edu/research/minnav.html找到。

英文摘要

Navigation using a monocular camera is pivotal for autonomous operation on tiny aerial robots due to their perfect balance of versatility, cost and accuracy. In this paper, we introduce MinNav, a navigation stack based on optical flow and its uncertainty to fly through a scene with static and dynamic obstacles and unknown-shaped gaps without any prior knowledge of the scene components and/or their locations/ordering. We further improve success rate by using the activeness of the robot to move around in an exploratory way to find obstacles and navigate. We successfully evaluate and demonstrate the proposed approach in many real-world experiments in various environments with static and dynamic obstacles and unknown-shaped gaps with an overall success rate of 70%. To the best of our knowledge, this is the first solution to tackle all the aforementioned navigation cases without prior knowledge using a monocular camera. Our approach is on par in performance with depth based methods with factors of magnitude less computation required and can readily run onboard tiny aerial robots. The accompanying video, supplementary material, code and dataset can be found at https://pear.wpi.edu/research/minnav.html

2606.08170 2026-06-09 cs.RO 新提交

Learning from Human Driving: A Human-in-the-Loop Online Behavior Cloning Framework for Autonomous Driving

从人类驾驶中学习:一种用于自动驾驶的人机协同在线行为克隆框架

Yuhong Shi, Jianyi Liu, Lihang Sun, Li Li, Xudong Dong

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(西安交通大学人工智能与机器人研究所人机混合增强智能国家重点实验室)

AI总结 提出人机协同在线行为克隆框架HiL-OBC,通过人类干预初始化策略、贝叶斯潜在行为建模和在线更新,结合大模型感知与人类驾驶智能,在CARLA基准上显著提升驾驶性能。

详情
AI中文摘要

随着大型基础模型(LFM)的发展,数据驱动的自动驾驶取得了显著进展。然而,现有范式在复杂交互和长尾场景中仍面临分布偏移和因果混淆的严峻挑战。这些限制往往导致在极端条件下缺乏人类级别的决策灵活性和安全性。为克服这一局限,本文提出了一种用于自动驾驶的人机协同在线行为克隆框架(HiL-OBC),旨在深度融合LFM的跨模态感知能力与人类专家的高级驾驶智能。具体而言,HiL-OBC的部署通过三个关键阶段执行:带人类干预的策略初始化、基于贝叶斯策略适应的潜在行为建模,以及在线部署与更新。此外,我们设计了一种多模态在线行为克隆(MOBC)模型,通过轻量级网络架构、接管触发机制和多变量损失函数在线优化基础驾驶策略,从而增强系统在复杂环境中的决策鲁棒性。我们在LangAuto-Human CARLA基准上评估了HiL-OBC。实验结果表明,通过人机协同机制优化的驾驶策略实现了显著的性能提升:StructNav、LFG和LMDrive的驾驶得分(DS)分别提高了47.25%、31.59%和32.12%,同时各种实验设置和关键组件的分析凸显了人机协同学习在提高决策鲁棒性和整体驾驶性能方面的优势。

英文摘要

With the evolution of large foundation models (LFMs), data-driven autonomous driving has made significant strides. However, existing paradigms still face severe challenges in complex interaction and long-tail scenarios due to distribution shift and causal confusion. These limitations often result in a lack of human-level decision-making flexibility and safety in extreme conditions. To overcome this limitation, this paper proposes a Human-in-the-Loop Online Behavior Cloning frame work (HiL-OBC) for autonomous driving, which aims to deeply integrate the cross-modal perceptual capabilities of LFMs with the high-level driving intelligence of human experts. Specifically, HiL-OBC deployment is executed through three critical phases: policy initialization with human intervention, latent behavioral modeling with Bayesian policy adaptation, and online deploy ment and updates. Furthermore, we design a Multi-modal Online Behavior Cloning (MOBC) model, which optimizes the base driving policy online through a lightweight network architecture, a takeover trigger mechanism, and a multi-variant loss function, thereby enhancing the system's decision-making robustness in complex environments. We evaluated the HiL-OBC on the LangAuto-Human CARLA benchmark. Experimental results demonstrate that the driving policies optimized via the human-in-the-loop mechanism achieve substantial performance gains: the DS of StructNav, LFG, and LMDrive increased by 47.25%, 31.59%, and 32.12%, respectively, with a simultaneous of various experimental settings and key components highlights the advantages of human-in-the-loop learning in improving decision-making robustness and overall driving performance.

2606.08249 2026-06-09 cs.RO cs.LG 新提交

Disturbance-Aware Aerial Robotics for Ethical Wildlife Monitoring

面向道德野生动物监测的扰动感知空中机器人

Mahmut Osmanovic, Isac Paulsson, Teddy Lazebnik

发表机构 * Department of Computing, Jonkoping University(约翰内斯堡大学计算机系) Department of Information Systems, University of Haifa(海法大学信息系统系)

AI总结 提出一种基于强化学习的扰动感知框架,用于异构空中机器人编队自主追踪野生动物,同时最小化行为干扰,在三种动物和四种行为模型上超越规则基线。

详情
AI中文摘要

可靠的野生动物监测对生态学和保护至关重要,然而许多现有方法,如标记、捕捉和近距离观察,可能会改变它们旨在测量的行为。空中机器人提供了一种可扩展的替代方案,在多项研究中显示出有前景的性能。尽管如此,现有方法通常缺乏行为感知,依赖固定启发式规则,或需要昂贵、不切实际且伦理上难以获取的真实世界训练数据。因此,目前尚无通用的自适应无人机监测框架,既能保持生态有效性,又能跨物种、行为和机器人平台扩展。在本研究中,我们引入了一种基于扰动感知强化学习的异构空中机器人编队框架,能够自主追踪野生动物,同时明确最小化行为干扰。我们将动物学模拟环境与基于真实轨迹统计拟合的动物运动模型相结合,并使用一种捕捉观测质量与扰动风险之间权衡的奖励公式来训练控制策略。在三种具有不同生态和运动模式的物种(鸽子、豺和距翅麦鸡)以及四种在自然界中常见的日益策略性的行为模型上,学习到的策略持续超越当前使用的基于规则的基线,并泛化到不同的监测任务、动物动态和无人机类型。这些结果确立了扰动感知学习作为非侵入式自主野生动物观测的可行基础,为生态学和保护中可扩展、道德负责且科学可靠的机器人监测开辟了道路。

英文摘要

Reliable wildlife monitoring is essential for ecology and conservation, yet many existing methods, such as tagging, capture, and close-range observation, can alter the very behaviors they aim to measure. Aerial robots offer a scalable alternative, which has shown promising performance in multiple studies. Nonetheless, existing approaches typically lack behavioral awareness, rely on fixed heuristics, or require real-world training data that are costly, impractical, and ethically difficult to obtain. As a result, there remains no general framework for adaptive drone-based monitoring that can both preserve ecological validity and scale across species, behaviors, and robotic platforms. In this study, we introduce a disturbance-aware reinforcement-learning-based framework for heterogeneous aerial robotic fleets that enables autonomous wildlife tracking while explicitly minimizing behavioral disruption. We couple a zoologically grounded simulation environment with fitted animal movement models derived from real trajectory statistics, and train control policies using a reward formulation that captures the trade-off between observation quality and disturbance risk. Across three species (pigeon, jackal, and spur-winged lapwing) with distinct ecologies and motion patterns and four increasingly strategic behavior models common in nature, the learned policies consistently surpassed currently used rule-based baselines and generalized across monitoring tasks, animal dynamics, and drone types. These results establish disturbance-aware learning as a viable foundation for non-invasive autonomous wildlife observation, opening a path towards scalable, ethically responsible, and scientifically reliable robotic monitoring in ecology and conservation.

2606.08470 2026-06-09 cs.RO 新提交

LUNA-AD: Lightweight Uncertainty-Aware Language Model with Lifelong Learning for Autonomous Driving

LUNA-AD: 面向自动驾驶的轻量级不确定性感知语言模型与终身学习

Ruoyu Yao, Pei Liu, Ruiguo Zhong, Mingxing Peng, Rui Yang, Jun Ma

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出LUNA-AD,一种结合三系统架构、多智能体分析、双头轻量模型和反思驱动终身学习的轻量级不确定性感知语言模型,在nuPlan上实现高成功率与低推理延迟。

Comments 16 pages,9 figures

详情
AI中文摘要

虽然大型语言模型(LLMs)提供了有前景的推理能力,但它们在安全关键的驾驶系统中的集成受到推理多样性有限、高计算开销和静态学习范式的阻碍。为了解决这些挑战,我们提出了LUNA-AD,一种面向自动驾驶(AD)的轻量级不确定性感知语言模型与终身学习。LUNA-AD采用三系统架构,协调复杂的多模态行为推理、高效部署和持续改进。我们设计了一个多智能体分析系统,通过多样化的假设探索生成不确定性感知的决策演示。一个双头轻量启发式模型被蒸馏,以统一决策分布和文本解释的推理,同时实现高效部署。此外,一种反思驱动的终身学习机制作用于多模态决策输出并保持策略多样性,允许通过闭环反馈改进候选决策和理由,以增强驾驶鲁棒性。在nuPlan基准上的大量实验表明,与现有知识驱动的AD框架相比,LUNA-AD在非反应式和反应式模式下均实现了最先进的成功率,并显著降低了推理延迟。

英文摘要

While large language models (LLMs) offer promising reasoning capabilities, their integration into safety-critical driving systems is hindered by limited reasoning diversity, high computational overhead, and static learning paradigms. To address these challenges, we propose LUNA-AD, a lightweight uncertainty-aware language model with lifelong learning for autonomous driving (AD). LUNA-AD features a tri-system architecture that reconciles complex multimodal behavioral reasoning, efficient deployment, and continual refinement. We design a multi-agent analytical system to generate uncertainty-aware decision-making demonstrations through diverse hypothesis exploration. A dual-head lightweight heuristic model is distilled to unify the inference of decision distributions and textual explanations while enabling efficient deployment. Furthermore, a reflection-driven lifelong learning mechanism operates on multimodal decision outputs and preserves strategic diversity, allowing for the refinement of candidate decisions and rationales via closed-loop feedback to enhance driving robustness. Extensive experiments on nuPlan benchmarks demonstrate that LUNA-AD achieves state-of-the-art success rates under both non-reactive and reactive modes, with drastically reduced inference latency compared to existing knowledge-driven AD frameworks.

2606.08513 2026-06-09 cs.RO cs.LG cs.SY eess.SY 新提交

Towards End to End Motion Planning and Execution for Autonomous Underwater Vehicles Using Reinforcement Learning

面向自主水下机器人的端到端运动规划与执行:基于强化学习的方法

Elisei Shafer, Oren Gal

发表机构 * University of Haifa(海法大学)

AI总结 提出分层强化学习架构,将原始传感器数据直接映射为推进器指令,实现AUV端到端运动规划与执行,在HoloOcean仿真中轨迹长度接近RRT*基线(误差4%-6%),并具备鲁棒性。

详情
AI中文摘要

自主水下机器人(AUV)传统上依赖复杂、高度工程化的流水线进行感知、路径规划和运动控制。本文探索了一种端到端深度强化学习(DRL)方法的可行性,该方法将原始传感器数据直接映射为推进器指令,减少了人工工程。我们提出了一种分层强化学习(HRL)架构,将问题分解为两个马尔可夫决策过程。高层(HL)策略以2Hz运行,处理原始$84 \ imes 84$像素单目相机帧、堆叠的$100 \ imes 100$像素前视成像声纳以及本体感受数据,生成空间子目标。同时,低层(LL)策略以10Hz运行,将这些子目标转换为推进器指令。HL策略使用基于先前演示的强化学习(RLPD)在修改后的样本高效机器人强化学习(SERL)框架中训练,而LL策略则采用软演员-评论家(SAC)结合后见经验回放(HER)。在高保真HoloOcean模拟器中评估,我们的方法展示了成功的避障能力,轨迹长度与$\ ext{RRT}^*$规划基线非常接近(误差在4%到6%之间)。此外,学习到的策略对模拟传感器噪声和能见度降低表现出强鲁棒性。尽管系统能有效导航熟悉的几何环境,但实验揭示了在遇到具有新颖障碍形状的未访问区域时存在泛化限制。最终,这项工作展示了使用最小计算硬件进行样本高效、端到端DRL在水下导航中的潜力。

英文摘要

Autonomous Underwater Vehicles (AUVs) traditionally rely on complex, heavily engineered pipelines for perception, path planning, and motion control. This paper explores the feasibility of an end-to-end Deep Reinforcement Learning (DRL) approach that maps raw sensor data directly to thruster commands, reducing manual engineering. We propose a hierarchical reinforcement learning (HRL) architecture splitting the problem into two Markov Decision Processes. A High-Level (HL) policy operating at 2Hz processes raw $84 \times 84$ pixel monocular camera frames, stacked $100 \times 100$ pixel forward-looking imaging sonar, and proprioceptive data to generate spatial subgoals. Simultaneously, a Low-Level (LL) policy operating at 10Hz converts these subgoals into thruster commands. The HL policy is trained using Reinforcement Learning from Prior Demonstrations (RLPD) within a modified Sample-Efficient Robotic Reinforcement Learning (SERL) framework, while the LL policy utilizes Soft Actor-Critic (SAC) combined with Hindsight Experience Replay (HER). Evaluated in the high-fidelity HoloOcean simulator, our method demonstrates successful obstacle avoidance, achieving trajectory lengths closely approximating (within 4% to 6% of) an $\text{RRT}^*$ planning baseline. Furthermore, the learned policy exhibits strong robustness to simulated sensor noise and decreased visibility. While the system navigates familiar geometries effectively, experiments reveal generalization limitations when encountering unvisited areas with novel obstacle shapes. Ultimately, this work demonstrates the promise of sample-efficient, end-to-end DRL for underwater navigation using minimal computational hardware.

2606.09088 2026-06-09 cs.RO 新提交

Autonomous FPV Flight with Translational Optical Flow and Uncertainty Mask

基于平移光流与不确定性掩膜的自主FPV飞行

Yang Deng, Yu Hu, Feng Yu, Linzuo Zhang, Danping Zou

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出利用平移光流和不确定性掩膜增强FPV四旋翼自主飞行,在仿真和真实森林环境中实现高达13.91 m/s和11.79 m/s的飞行速度,成功率93.3%。

详情
AI中文摘要

在复杂环境中使用单目RGB相机作为唯一外部传感器的自主FPV四旋翼飞行仍然是一个基本挑战。最近的研究表明,使用光流作为神经网络的输入可以实现杂乱场景中的端到端自主飞行。然而,从光流估计中提取最相关信息是限制敏捷性和鲁棒性的关键瓶颈。现有方法难以将障碍物引起的光流与自运动背景光流分离,并且在膨胀焦点(FoE)附近信噪比低。为了解决这些问题,我们将光流分解为平移和旋转分量,并仅利用捕捉场景几何和深度线索的平移光流。此外,我们引入了一种基于前向和后向光流估计不一致性的不确定性掩膜。该掩膜突出显示障碍物结构,包括FoE区域内的结构。这两个线索被输入到在可微仿真框架中训练的控制策略中,该框架能够实现感知和控制的一阶优化。我们通过在仿真和真实森林环境中的大量实验验证了我们的方法。所提出的系统在仿真中实现了高达13.91 m/s的速度,在真实测试中实现了11.79 m/s的速度,在30次真实试验中成功率为93.3%,几乎使先前报道的单目RGB光流无人机避障系统的6 m/s真实速度翻倍。

英文摘要

Autonomous FPV quadrotor flight in complex environments using a monocular RGB camera as the sole exteroceptive sensor remains a fundamental challenge. Recent research has shown that using optical flow as the input of a neural network can achieve end-to-end autonomous flight in cluttered scenes. However, extracting the most relevant information from the flow estimation is the key bottleneck limiting agility and robustness. Existing methods struggle to disentangle obstacle-induced optical flow from the ego-motion background flow and suffer from low signal-to-noise ratios near the focus of expansion (FoE). To address these issues, we decompose the optical flow into translational and rotational components and utilize only the translational flow, which captures scene geometry and depth cues. In addition, we introduce an uncertainty mask derived from inconsistencies between forward and backward flow estimates. This mask highlights obstacle structures, including those within the FoE region. Both cues are fed to a control policy trained in a differentiable simulation framework, which enables efficient first-order optimization across perception and control. We validate our approach through extensive experiments in both simulated and real-world forest environments. The proposed system achieves robust flight at speeds of up to 13.91 m/s in simulation and 11.79 m/s in real-world tests, with a 93.3\% success rate over 30 real-world trials, nearly doubling the previously reported 6 m/s real-world speed of the monocular-RGB optical-flow UAV obstacle avoidance system.

2606.09569 2026-06-09 cs.RO cs.CV 新提交

Efficient Minimal Solvers for Relative Pose Estimation in Autonomous Driving Applications

自动驾驶应用中相对位姿估计的高效最小求解器

Tao Li, Liang Liu, Jianli Han, Weimin Lv

发表机构 * College of Aerospace Science and Engineering, Naval Aviation University(海军航空大学航空航天科学与工程学院)

AI总结 提出基于新平移参数化和一阶旋转近似的统一框架,设计三种最小求解器(利用IMU垂直方向、转向旋转轴方向、平面运动假设),减少点对应数量和代数复杂度,在RANSAC中加速假设生成,平衡速度与精度。

详情
AI中文摘要

随着视觉传感系统的进步,计算机视觉在自动驾驶和机器人导航中扮演着越来越重要的角色。多相机系统中的相对位姿估计对于精确的车辆定位和环境感知至关重要,要求高实时性和鲁棒性。然而,现有方法通常涉及高计算成本并严重依赖丰富的特征匹配,限制了它们在时间敏感驾驶场景中的适用性。为解决这些限制,本文引入了一个基于新颖平移参数化和一阶旋转近似的统一框架,用于高效相对位姿估计。在该框架内,我们提出了三种专门为自动驾驶车辆设计的高效最小求解器。第一个求解器集成了惯性测量单元(IMU)的垂直方向先验,第二个在转向操作期间利用旋转轴方向先验,第三个专为平面运动设计——这是结构化道路上地面车辆的现实假设。通过减少最小点对应数量和代数复杂度,我们的方法能够在基于RANSAC的流程中更快地生成假设,提高对实时系统的适用性。在合成数据集和KITTI自动驾驶基准上的大量实验表明,与现有最先进算法相比,所提出的求解器在速度和精度之间实现了有利的平衡。

英文摘要

With the advancement of visual sensing systems, computer vision is playing an increasingly important role in autonomous driving and robot navigation. Relative pose estimation in multi-camera systems is essential for accurate vehicle localization and environment perception, demanding high real-time performance and robustness. Existing methods, however, often involve high computational costs and rely heavily on abundant feature matches, limiting their applicability in time-sensitive driving scenarios. To address these limitations, this paper introduces a unified framework for efficient relative pose estimation, built upon a novel translation parameterization and first-order rotation approximation. Within this framework, we propose three efficient minimal solvers specifically designed for autonomous vehicles. The first solver integrates the vertical direction prior from Inertial Measurement Units (IMUs), the second utilizes the rotation axis direction prior during steering maneuvers, and the third is designed for planar motion - a realistic assumption for ground vehicles operating on structured roads. By reducing both the minimal number of point correspondences and the algebraic complexity, our methods enable faster hypothesis generation within RANSAC-based pipelines, improving suitability for real-time systems. Extensive experiments on synthetic datasets and the KITTI autonomous driving benchmark demonstrate that the proposed solvers achieve a favorable balance between speed and accuracy compared to existing state-of-the-art algorithms.

2606.07756 2026-06-09 cs.CV cs.RO 交叉投稿

DroneDAR: Long-Range Drone Distance Estimation Using Monocular Vision and Bounding-Box Features

DroneDAR: 使用单目视觉和边界框特征的长距离无人机距离估计

Knut Peterson, Zaid Mayers, David Han

发表机构 * iMaPLe Research Lab, Drexel University(德雷塞尔大学iMaPLe研究实验室)

AI总结 针对长距离小无人机距离估计的挑战,提出DroneDAR模型,结合卷积骨干网络和轻量级门控机制融合边界框特征,分析骨干容量、裁剪分辨率和回归损失对性能的影响,并探讨远距离失效模式。

Comments 6 pages, 5 figures. Accepted to the 2026 International Conference on Advanced Visual and Signal-Based Systems (AVSS)

详情
AI中文摘要

在长距离图像中准确估计小型无人机的距离对于跟踪和态势感知至关重要,但由于极端的目标尺度变化、背景杂波和噪声视觉线索,这仍然具有挑战性。本文研究了使用图像裁剪和边界框几何进行单目无人机距离估计,这是一种实际设置,其中检测器提供候选无人机区域,模型从外观和框派生特征预测距离。我们评估了一个Droneranger风格的基线,并引入了一个新的DroneDAR(无人机检测与测距)模型,该模型通过轻量级门控机制将卷积骨干网络与显式边界框线索相结合。实验分析了骨干网络容量、裁剪分辨率和回归损失函数如何影响不同距离范围内的性能。我们进一步研究了远距离下的常见失效模式,包括对边界框噪声的敏感性和裁剪中纹理细节的减少。结果为设计和训练在真实远距离条件下保持鲁棒性的距离估计器提供了指导,并指出了在无人机仅占据几个像素时提高可靠性的方向。

英文摘要

Accurate distance estimation for small drones in long-range imagery is important for tracking and situational awareness, yet remains challenging due to extreme target scale variation, background clutter, and noisy visual cues. This paper studies monocular drone distance estimation using image crops together with bounding-box geometry, a practical setting in which a detector provides a candidate drone region and the model predicts range from appearance and box-derived features. We evaluate a Droneranger-style baseline, and introduce a new DroneDAR (Drone Detection And Ranging) model that combines a convolutional backbone with explicit bounding-box cues through a lightweight gating mechanism. Experiments analyze how backbone capacity, crop resolution, and regression loss functions affect performance across distance regimes. We further examine common failure modes at long distances, including sensitivity to bounding-box noise and reduced texture detail in the crop. The results provide guidance for designing and training range estimators that remain robust under real-world long-range conditions and highlight directions for improving reliability when drones occupy only a few pixels.

2606.08533 2026-06-09 cs.LG cs.RO 交叉投稿

Autonomous Aerial Manipulation via Contextual Contrastive Meta Reinforcement Learning

通过上下文对比元强化学习的自主空中操控

Lixuan Jin, Bingxuan Lan, Xinyi Bao, Xiangyuan Xie, Chunjie Zhang, Zheng Chen, Tianshuo Liu, Ruijie Tian, Jinyu Ru, Gang Wang, Lei Yuan, Yang Yu

发表机构 * National Key Laboratory of Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Faculty of Robot Science and Engineering, Northeastern University(东北大学机器人科学与工程学院) National Key Lab of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(北京理工大学自主智能无人系统国家重点实验室)

AI总结 提出Aco2方法,通过上下文对比元强化学习,使四旋翼无人机在无需人工干预下自主完成不同载荷的抓取、运输和投递,并直接迁移到真实世界。

详情
AI中文摘要

无人机越来越多地部署在物流、服务机器人等实际应用中,对自主载荷获取和投递的需求日益增长。现有方法通常假设预附载荷或依赖专用夹爪,使得通用的端到端空中投递问题仍未解决,因为不同载荷会导致高度变化的飞行动力学,需要单一策略在线适应,无需手动校准或显式系统辨识。为此,我们研究了通过上下文对比元强化学习的自主空中操控(\textbf{\textit{Aco2}}),这是一个完全自主的空中投递设置,其中配备轻型钩子的四旋翼无人机连续拾取、运输和投递各种带手柄的物体,在随机位置之间进行,全程无需人工干预。首先,我们设计了一个上下文观测编码器,从最近的交互历史中推断出紧凑的潜在上下文,使策略能够在线适应载荷相关的动力学。为了进一步提高上下文质量,我们引入了一个对比目标,该目标围绕任务相关变化结构化上下文嵌入,从而改善跨不同载荷的泛化能力,无需显式系统辨识。完全在模拟中训练,并采用广泛的域随机化,\textit{Aco2}可以直接部署在物理四旋翼上,无需真实世界微调。

英文摘要

Unmanned aerial vehicles (UAVs) are increasingly being deployed in logistics, service robotics, and other real-world applications, creating a growing demand for autonomous payload acquisition and delivery. Existing approaches typically assume pre-attached payloads or rely on specialized grippers, leaving versatile end-to-end aerial delivery largely unresolved, where different payloads induce highly variable flight dynamics, requiring a single policy to adapt online without manual calibration or explicit system identification. To this end, we study \textbf{A}utonomous \textbf{A}erial Manipulation via \textbf{Co}ntextual \textbf{Co}ntrastive Meta Reinforcement Learning (\textbf{\textit{Aco2}}), a fully autonomous aerial delivery setting in which a quadrotor equipped with a lightweight hook continuously picks up, transports, and delivers diverse handle-equipped objects between randomized locations, all without human intervention. First, we design a contextual observation encoder that infers a compact latent context from recent interaction history, enabling the policy to adapt online to payload-dependent dynamics. To further improve the quality of this context, we introduce a contrastive objective that structures the context embedding around task-relevant variations, improving generalization across diverse payloads without requiring explicit system identification. Trained entirely in simulation with extensive domain randomization, \textit{Aco2} can be directly deployed on a physical quadrotor without real-world fine-tuning.

2606.08680 2026-06-09 cs.CV cs.RO 交叉投稿

Distortion-Aware PETR for BEV Object Detection with Mixed Pinhole-Fisheye Cameras

畸变感知的PETR用于混合针孔-鱼眼相机的BEV目标检测

Xiangzhong Liu

发表机构 * fortiss GmbH(fortiss有限公司)

AI总结 针对鱼眼相机径向畸变破坏BEV检测器均匀采样假设的问题,提出DAPETR,通过畸变感知位置编码和双向特征-几何协同调制模块,在KITTI-360基准上优于基线方法,并揭示了学习适应与显式几何重参数化之间的冲突。

Comments 8 pages, 5 figures, accepted at ICRA 2026

详情
AI中文摘要

鱼眼相机因其低成本和高覆盖视野(FOV)而被广泛部署于自动驾驶感知套件中,但其在3D目标检测中的潜力仍未得到充分利用。严重的径向畸变通过违反均匀采样的基本假设,对大多数BEV检测器构成挑战。为弥补这一差距,我们提出了畸变感知PETR(DAPETR),一种专为混合针孔-鱼眼相机设置设计的无投影检测器。DAPETR包含两个关键的学习自适应模块:一个统一的畸变感知位置编码,将图像表示的位置编码与鱼眼几何协调一致;以及一个双向特征-几何协同调制模块,使图像特征和3D位置编码相互适应。在我们转换的KITTI-360基准上的实验中,我们系统地将我们的学习自适应方法与极坐标下的PETR(PolarPETR)进行了比较。我们发现,尽管两种方法都优于基线,但我们的学习模块实现了更优的性能。关键的是,我们发现了两种策略结合时的负面交互,表明学习适应和显式几何重参数化可能冲突。我们的最终DAPETR模型显著推进了鱼眼BEV检测的研究和基准,为除图像校正外的有效畸变感知3D感知设计提供了关键见解。

英文摘要

Fisheye cameras are widely deployed in autonomous driving perception suites for their low cost and full-coverage field of view (FOV), yet their potential remains underleveraged in 3D object detection. Severe radial distortion challenges most BEV detectors by violating the fundamental assumption of uniform sampling. To bridge this gap, we propose Distortion-Aware PETR (DAPETR), a projection-free detector tailored for mixed pinhole-fisheye camera setups. DAPETR incorporates two key learned-adaptive modules: a unified distortion-aware positional embedding that harmonizes positional encodings for image representations with fisheye geometry, and a bidirectional feature-geometry co-modulation module that mutually adapts image features and 3D positional embeddings. In our experiments on a converted KITTI-360 benchmark, we systematically compare our learned adaptive approach against PETR in polar coordinates (PolarPETR). We find that while both methods improve over the baseline, our learned modules achieve superior performance. Crucially, we uncover a negative interaction when combining both strategies, revealing that learned adaptation and explicit geometric reparameterization can conflict. Our final DAPETR model significantly advances the research and benchmark for fisheye BEV detection, providing critical insights into effective distortion-aware 3D perception design other than image rectification.

2606.08714 2026-06-09 eess.SY cs.AI cs.LG cs.RO cs.SY 交叉投稿

Hybrid Neural Network and Conventional Controller Approach for Robust Control of Highly Unstable Systems: Application to Tilt-Rotor Control

混合神经网络与传统控制器方法用于高度不稳定系统的鲁棒控制:应用于倾转旋翼控制

Ali Kafili Gavgani, Amin Talaeizadeh, Aria Alasty, Hossein Nejat Pishkenari

发表机构 * Advanced Research Lab for Control and Agricultural Robotics (Sharif AgRoLab)(控制与农业机器人高级研究实验室(谢尔生产大学AgRoLab)) Department of Mechanical Engineering, Sharif University of Technology, Tehran, Iran(技术大学机械工程系,德黑兰,伊朗)

AI总结 提出一种神经网络增强的滑模控制器,将系统动力学分解为输入无关和输入相关部分,前者用轻量网络从少量数据学习,实现对全驱动倾转旋翼系统的鲁棒控制,LSTM优于MLP。

Comments Proceedings of the 13th RSI International Conference on Robotics and Mechatronics (ICRoM 2025)

详情
AI中文摘要

多旋翼飞行器广泛应用于从监视到精准农业等领域,但传统设计仍受限于其欠驱动特性。倾转旋翼配置通过实现全驱动克服了这一限制。本文研究基于神经网络的控制策略,用于一个具有四个推力矢量输入的全驱动倾转旋翼系统。我们的工作分为两部分。首先,我们有意呈现一个负面结果,通过评估直接输入-输出控制方法。在该方法中,多层感知器(MLP)、长短期记忆(LSTM)网络和Transformer模型被训练为直接将系统状态及其期望值映射到控制信号。我们表明该策略无法稳定系统,凸显了将直接输入-输出学习应用于高度不稳定对象的固有困难。其次,作为主要贡献,我们提出一种神经网络增强的滑模控制器(SMC)。该方法将系统动力学分解为输入无关和输入相关两部分,前者使用轻量网络从少量数据集学习,从而降低实时计算需求。此外,所提方法可以使用从低性能控制器收集的飞行日志进行训练,并且从真实数据学习到的动力学模型可用于仿真。我们进一步比较了基于MLP和LSTM的实现,在模型不确定性和外部干扰下,展示了所提方法的鲁棒性和有效性;特别是,带有LSTM植物动力学预测器的控制器相比基于MLP的对应物实现了更优性能,同时运行时也更低。

英文摘要

Multirotors are widely used in applications ranging from surveillance to precision agriculture, yet conventional designs remain limited by their under-actuation. Tilt-rotor configurations overcome this limitation by enabling full actuation. This paper investigates neural-network-based control strategies for a fully actuated tilt-rotor system with four thrust-vectoring inputs. Our work is structured in two parts. First, we deliberately present a negative result by evaluating a direct input-output control approach. In this method, multilayer perceptrons (MLPs), long short-term memory (LSTM) networks, and transformer models are trained to map system states and their desired values directly to control signals. We show that this strategy fails to stabilize the system, highlighting the inherent difficulty of applying direct input-output learning to highly unstable plants. Second, as the main contribution, we propose a neural-network-enhanced sliding mode controller (SMC). The method decomposes the system dynamics into input-independent and input-dependent components, with the former learned from a small dataset using lightweight networks, thereby reducing real-time computational demands. Moreover, the proposed method can be trained using flight logs collected from low-performance controllers, and the resulting dynamic model learned from real-world data can be used in simulation. We further compare MLP- and LSTM-based implementations under model uncertainties and external disturbances, demonstrating the robustness and effectiveness of the proposed approach; in particular, the controller with the LSTM plant dynamics predictor achieves superior performance to its MLP-based counterpart while also exhibiting lower runtime.

2606.08844 2026-06-09 cs.CV cs.RO 交叉投稿

Geometry-Aware Fisheye-LiDAR Fusion for Robust 3D Object Detection in Low-Overlap Setups

几何感知鱼眼-激光雷达融合用于低重叠设置下的鲁棒3D目标检测

Xiangzhong Liu, Xihao Wang, Hao Shen

发表机构 * Technical University of Munich(慕尼黑工业大学)

AI总结 针对稀疏视角下鱼眼相机与激光雷达的几何畸变和低重叠问题,提出几何感知混合融合框架,通过畸变感知LSS模块和双注意力校正模块实现极坐标与笛卡尔特征融合,在三个基准上提升检测精度。

Comments 8 pages, 4 figures, submitted to RA-L

详情
AI中文摘要

随着自主系统从资本密集型的机器人出租车扩展到成本敏感的物流领域,传感器配置越来越优化以实现每单位成本的覆盖范围。一种常见的稀疏视图设置利用双鱼眼摄像头和车顶安装的激光雷达,引入了严重的几何挑战:极端径向畸变、最小重叠以及球面投影与笛卡尔网格之间的错位。BEV融合算法通常在流程早期将图像和点云模态强制统一到笛卡尔网格中,导致广角鱼眼相机出现显著的特征失真和信息丢失。为了解决这个问题,我们提出了一个几何感知混合融合(GA-HF)框架,该框架明确考虑了鱼眼几何和BEV特征失真,其中鱼眼特征通过畸变感知的Lift-Splat-Shoot(LSS)模块提升到极坐标BEV网格中以保留原生角密度,而激光雷达特征在原生笛卡尔空间中处理以实现边界框回归的度量保真度。为了桥接这些异构流,我们引入了一个双注意力扭曲校正模块,该模块在融合前对扭曲的相机特征应用空间和通道注意力,明确抑制低质量外围区域的伪影,同时增强高质量语义线索。GA-HF在三个基准数据集上进行了评估:KITTI-360、Dur360BEV和Fisheye3DOD。据我们所知,这是首个探索激光雷达-鱼眼相机融合的方法。在KITTI-360上,GA-HF相比笛卡尔基线将NDS提高了4.2%;在Dur360BEV上,它超越了仅激光雷达和BEVFusion,同时在几何畸变下显著降低了方向误差;在Fisheye3DOD上,它在所有融合方法中取得了最高的检测分数。

英文摘要

As autonomous systems expand from capital-intensive robotaxis to cost-sensitive logistics, sensor configurations are increasingly optimized for coverage-per-cost. A prevalent sparse-view setup utilizes dual-fisheye cameras with a roof-mounted LiDAR, introducing severe geometric challenges: extreme radial distortion, minimal overlap, and misalignment between spherical projections and rectilinear grids. BEV fusion algorithms typically force image and point cloud modalities into unified Cartesian grids early in the pipeline, causing significant feature distortion and information loss for wide-view fisheye cameras. To address this, we propose a Geometry-Aware Hybrid Fusion (GA-HF) framework that explicitly accounts for fisheye geometry and BEV feature distortion, where fisheye features are lifted into a polar BEV grid via a Distortion-Aware Lift-Splat-Shoot (LSS) module to preserve native angular density, while LiDAR features are processed in native Cartesian space for metric fidelity of bounding box regression. To bridge these heterogeneous streams, we introduce a Dual-Attention Warping Correction module that applies spatial and channel attention to the warped camera features before fusion, explicitly suppressing artifacts in low-quality peripheral regions while enhancing high-quality semantic cues. GA-HF is evaluated on three benchmarks: KITTI-360, Dur360BEV, and Fisheye3DOD datasets. To the best of our knowledge, it is the first approach to explore LiDAR-fisheye camera fusion. On KITTI-360, GA-HF improves NDS by 4.2% over Cartesian baselines; on Dur360BEV, it surpasses both LiDAR-only and BEVFusion, while significantly reducing orientation error despite the geometric distortions; on Fisheye3DOD, it attains the highest detection score among all fusion methods.

2503.01125 2026-06-09 cs.RO 版本更新

TACO: General Acrobatic Flight Control via Target-and-Command-Oriented Reinforcement Learning

TACO:基于目标和指令的强化学习实现通用空翻飞行控制

Zikang Yin, Canlun Zheng, Shiliang Guo, Zhikun Wang, Shiyu Zhao

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) WINDY Lab, Department of Artificial Intelligence, Westlake University(西湖大学人工智能研究院)

AI总结 本文提出TACO框架,通过目标和指令导向的强化学习实现统一的空翻任务处理,并支持在线参数调整,结合频谱归一化方法提升策略的平滑性与对称性,验证了其在高速环形飞行和连续多翻转中的能力。

Comments For the experiment video, please refer to https://youtu.be/x1v7nD2iHIk

详情
AI中文摘要

尽管空翻飞行控制已广泛研究,现有方法的关键限制在于通常局限于特定机动任务且无法在线调整飞行模式参数。本文提出目标和指令导向的强化学习(TACO)框架,可统一处理不同机动任务并支持在线参数调整。此外,我们提出一种结合输入输出缩放的频谱归一化方法,以增强策略的时间和空间平滑性、独立性和对称性,从而克服仿真到现实的差距。通过广泛的仿真和实际实验验证了TACO方法,证明其能够实现高速环形飞行和连续多翻转。

英文摘要

Although acrobatic flight control has been studied extensively, one key limitation of the existing methods is that they are usually restricted to specific maneuver tasks and cannot change flight pattern parameters online. In this work, we propose a target-and-command-oriented reinforcement learning (TACO) framework, which can handle different maneuver tasks in a unified way and allows online parameter changes. Additionally, we propose a spectral normalization method with input-output rescaling to enhance the policy's temporal and spatial smoothness, independence, and symmetry, thereby overcoming the sim-to-real gap. We validate the TACO approach through extensive simulation and real-world experiments, demonstrating its capability to achieve high-speed circular flights and continuous multi-flips.

2508.00917 2026-06-09 cs.RO cs.CV cs.LG 版本更新

A Survey on Deep Multi-Task Learning in Connected Autonomous Vehicles

联网自动驾驶车辆中深度多任务学习综述

Jiayuan Wang, Farhad Pourpanah, Q. M. Jonathan Wu, Ning Zhang

发表机构 * Department of Electrical and Computer Engineering, University of Windsor(温莎大学电气与计算机工程系) Department of Electrical and Computer Engineering, Queen’s University(皇后大学电气与计算机工程系)

AI总结 综述联网自动驾驶车辆中深度多任务学习,涵盖感知、预测、规划、控制及V2X通信与资源管理,分析现有方法优缺点并指出未来方向。

详情
AI中文摘要

联网自动驾驶车辆(CAVs)必须同时执行多个任务,如感知、预测、规划和控制,以确保在复杂环境中安全可靠地导航。此外,通过车联万物(V2X)通信,可以实现CAVs之间的协同感知和驾驶,从而减轻单个车辆的局限性,同时也引入了严格的延迟、可靠性和带宽约束。传统上,任务使用单独的模型处理,这导致部署成本高、计算开销增加以及实现实时性能的挑战。多任务学习(MTL)最近成为一种有前景的解决方案,能够在统一模型中联合学习多个任务,从而提供更高的效率和资源利用率。据我们所知,本综述是首次专注于CAVs中深度MTL的全面回顾。我们首先概述CAVs和MTL以提供基础背景。然后,我们回顾了CAVs关键功能领域的MTL方法,包括感知、预测、规划、控制以及V2X通信和无线电资源管理(RRM)。对于前四个领域,我们将现有工作分为仅单车(车载)和V2X增强协同(多智能体)范式。我们进一步将V2X通信和RRM作为以通信为中心的MTL问题进行讨论。最后,我们讨论了现有方法的优势和局限性,识别了关键研究空白,并提供了旨在推进CAV系统MTL方法的未来研究方向。

英文摘要

Connected autonomous vehicles (CAVs) must simultaneously perform multiple tasks, such as perception, prediction, planning, and control, to ensure safe and reliable navigation in complex environments. Moreover, through vehicle-to-everything (V2X) communication, cooperative perception and driving among CAVs can be enabled, thereby mitigating the limitations of individual vehicles, while it also introduces stringent latency, reliability, and bandwidth constraints. Traditionally, tasks are addressed using separate models, which leads to high deployment costs, increased computational overhead, and challenges in achieving real-time performance. Multi-task learning (MTL) has recently emerged as a promising solution that enables the joint learning of multiple tasks within a unified model. This offers improved efficiency and resource utilization. To the best of our knowledge, this survey is the first comprehensive review focusing on deep MTL in CAVs. We begin with an overview of CAVs and MTL to provide foundational background. Then, we review MTL approaches across key functional domains in CAVs, including perception, prediction, planning, control, as well as V2X communications and radio resource management (RRM). For the first four domains, we categorize existing works under ego vehicle-only (onboard-only) and V2X-enhanced cooperative (multi-agent) paradigms. We further discuss V2X communications and RRM as communication-centric MTL problems. Finally, we discuss the strengths and limitations of existing methods, identify key research gaps, and provide future research directions aimed at advancing MTL methodologies for CAV systems.

2606.01205 2026-06-09 cs.RO 版本更新

ImagineUAV: Aerial Vision-Language Navigation via World-Action Modeling and Kinodynamic Planning

ImagineUAV:通过世界-动作建模和动力学规划实现空中视觉语言导航

Xuchen Liu, Jiawei Huang, Shihao Xia, Bingxi Liu, Jinqiang Cui, Jiankun Yang

发表机构 * Pengcheng Laboratory(鹏城实验室) School of Computer Science and Cyber Engineering(计算机科学与网络工程学院) Guangzhou University(广州大学) Southern University of Science and Technology(南方科技大学)

AI总结 针对无人机视觉语言导航中几何不一致和动力学失配问题,提出基于潜视频扩散模型的世界-动作建模框架,通过生成未来观测推断6自由度运动并规划无碰撞轨迹,以1.3B参数在基准和实际飞行中超越先前方法。

Comments Video demo: https://www.youtube.com/watch?v=Ng1alP0yhc0

详情
AI中文摘要

无人机的视觉语言导航(VLN)要求在部分可观测条件下将自由形式的指令接地到6自由度飞行中。虽然视觉-语言-动作(VLA)模型在语义推理方面表现出色,但由于几何不一致和动力学失配,它们存在脆弱性。为了解决这个问题,我们提出了ImagineUAV,一个利用级联世界-动作建模的想象驱动框架。ImagineUAV不是直接回归,而是采用潜视频扩散模型生成指令条件下的未来观测,明确想象环境演化,然后通过动作提取器推断6自由度运动。动力学规划器将这些估计优化为无碰撞轨迹。此外,步骤蒸馏推理流水线确保实时执行。仅凭1.3B参数,ImagineUAV在基准测试和实际飞行中优于先前的VLN和VLA基线,验证了想象驱动空中导航的实用性。

英文摘要

Vision-language navigation (VLN) for UAVs demands grounding free-form instructions into 6-DoF flight under partial observability. While Vision-Language-Action (VLA) models excel at semantic reasoning, they suffer from brittleness due to geometric inconsistency and dynamics mismatch. To address this, we propose ImagineUAV, an imagination-driven framework leveraging cascaded world-action modeling. Instead of direct regression, ImagineUAV employs a latent video diffusion model to generate instruction-conditioned future observations, explicitly imagining environmental evolution, from which 6-DoF motions are inferred via an action extractor. A kinodynamic planner then refines these estimates into collision-free trajectories. Additionally, a step-distilled inference pipeline ensures real-time execution. With only 1.3B parameters, ImagineUAV outperforms prior VLN and VLA baselines on benchmarks and real-world flights, validating the practicality of imagination-driven aerial navigation.

2509.20906 2026-06-09 cs.CV cs.RO 版本更新

Distant Object Localisation from Noisy Image Segmentation Sequences

基于噪声图像分割序列的远距离目标定位

Julius Pesonen, Arno Solin, Eija Honkavaara

发表机构 * Research Council of Finland(芬兰研究理事会) RCF Flagship Forest–Human–Machine Interplay—Building Resilience, Redefining Value Networks and Enabling Meaningful Experiences (UNITE)(RCF旗舰森林-人类-机器交互——构建韧性,重新定义价值网络和赋能有意义体验(UNITE))

AI总结 针对远距离目标定位问题,提出多视图三角测量和粒子滤波两种方法,后者还能提供形状和不确定性估计,结合无人机图像分割与GNSS姿态估计实现可靠野火监测。

详情
AI中文摘要

基于相机测量序列的3D目标定位对于安全关键的监视任务(如基于无人机的野火监测)至关重要。使用相机检测到的目标定位通常可以通过专门的传感器配置或3D场景重建来解决。然而,对于远距离目标或受限于可用计算资源的任务,这两种解决方案都不可行。在本文中,我们表明该任务可以通过多视图三角测量或粒子滤波来解决,后者还提供形状和不确定性估计。我们使用3D模拟和基于无人机的图像分割序列以及基于全球导航卫星系统(GNSS)的相机姿态估计来研究这些解决方案。结果表明,将所提出的方法与现有的图像分割模型和无人机携带的计算资源相结合,可以为基于无人机的野火监测提供可靠的系统。所提出的解决方案与检测方法无关,还能快速适应类似任务。代码可在以下网址获取:https://this URL

英文摘要

3D object localisation based on a sequence of camera measurements is essential for safety-critical surveillance tasks, such as drone-based wildfire monitoring. Localisation of objects detected with a camera can typically be solved with specialised sensor configurations or 3D scene reconstruction. However, in the context of distant objects or tasks limited by the amount of available computational resources, neither solution is feasible. In this paper, we show that the task can be solved with either multi-view triangulation or particle filters, with the latter also providing shape and uncertainty estimates. We studied the solutions using 3D simulation and drone-based image segmentation sequences with global navigation satellite system (GNSS) based camera pose estimates. The results suggest that combining the proposed methods with pre-existing image segmentation models and drone-carried computational resources yields a reliable system for drone-based wildfire monitoring. The proposed solutions are independent of the detection method, also enabling quick adaptation to similar tasks. Code is available at https://fgi_nls.gitlab.io/public/distant-localisation

2602.10234 2026-06-09 physics.soc-ph cs.AI cs.RO 版本更新

Transforming Police-Car Swerving for Mitigating Isolated Stop-and-Go Traffic Waves: A Practice-Oriented Jam-Absorption Driving Strategy

将警车变道行为转化为缓解孤立走走停停交通波的实际拥堵吸收驾驶策略

Zhengbing He

发表机构 * Faculty of Science and Engineering, University of Nottingham Ningbo China(诺丁汉大学宁波校区理工程学院)

AI总结 本文提出一种基于警车变道行为启发的实际拥堵吸收驾驶(JAD)策略,通过定义JAD三角形,利用单车辆双探测器实现孤立走走停停波的抑制,并系统分析五个关键参数,仿真验证其有效性。

详情
AI中文摘要

走走停停交通波是高速公路拥堵的主要形式,对交通效率、安全风险和车辆排放造成严重且持续的负面影响。在各种高速公路交通管理策略中,拥堵吸收驾驶(JAD)——由专用车辆在被走走停停波捕获前执行“慢进快出”操作——已被提出作为抑制此类波传播的一种有前景的方法。然而,现有大多数JAD策略仍不实用,主要原因是缺乏对实施车辆和运行条件的考虑。受真实世界中警车变道行为的启发,本文首先引入单车辆双探测器拥堵吸收驾驶(SD-JAD)问题,然后基于JAD三角形的定义提出一种实用的JAD策略,将这种变道行为转化为能够抑制孤立走走停停波传播的交通控制策略。识别并系统分析了五个显著影响所提策略的关键参数,即JAD速度、流入交通速度、波宽、波速和波内速度。通过基于SUMO的仿真示例,进一步展示了如何仅使用两个固定路侧交通探测器在实际中测量这些参数。结果表明,所提出的JAD策略成功抑制了走走停停波的传播,且未引发二次波。本文有望推动JAD的实际实施迈出重要一步,将其从理论概念推进为可行且可部署的交通管理策略。

英文摘要

Stop-and-go traffic waves, a major form of freeway congestion, impose severe and persistent adverse impacts, including reduced traffic efficiency, increased safety risks, and elevated vehicle emissions. Among various freeway traffic management strategies, jam-absorption driving (JAD), in which a dedicated vehicle performs "slow-in" and "fast-out" maneuvers before being captured by a stop-and-go wave, has been proposed as a promising approach to suppressing the propagation of such waves. However, most existing JAD strategies remain impractical, primarily due to the lack of consideration of implementation vehicles and operational conditions. Inspired by real-world observations of police-car swerving behavior, this paper first introduces the Single-Vehicle Double-Detector Jam-Absorption Driving (SD-JAD) problem and then proposes a practical JAD strategy based on a definition of the JAD Triangle, transforming such behavior into a traffic control strategy capable of suppressing the propagation of an isolated stop-and-go wave. Five key parameters that significantly affect the proposed strategy, namely JAD speed, inflow traffic speed, wave width, wave speed, and in-wave speed, are identified and systematically analyzed. Using a SUMO-based simulation as an illustrative example, we further demonstrate how these parameters can be measured in practice using only two stationary roadside traffic detectors. The results show that the proposed JAD strategy successfully suppresses the propagation of a stop-and-go wave without triggering secondary waves. This paper is expected to take a significant step toward the practical implementation of JAD, advancing it from a theoretical concept to a feasible and deployable traffic management strategy.

9. 软体机器人与硬件设计 4 篇

2606.08104 2026-06-09 cs.RO 新提交

Reinforcement learning in linear embedding space unlocks generalizable control across soft robot configurations

线性嵌入空间中的强化学习解锁软体机器人配置的通用控制

Xinglong Zhang, Cong Li, Hangjie Mo, Yue Jiang, Xin Xu, Wei Jiang, Zhenshan Bing, Yihe Yang, Xiaojian Li, Yueneng Yang, Huimin Lu, Ling-li Zeng, Alois Knoll, Dewen Hu, Li Wen, Wei Pan

发表机构 * National University of Defense Technology(国防科技大学) Hefei University of Technology(合肥工业大学) Nanjing University (Suzhou Campus)(南京大学(苏州校区)) Technical University of Munich(慕尼黑工业大学) Beihang University(北京航空航天大学) Newcastle University(纽卡斯尔大学)

AI总结 提出基于共享线性Koopman嵌入空间的强化学习框架,将控制策略与机器人形态解耦,实现跨33种软体机器人配置的快速迁移,样本量减少75倍,并支持高速运动、重载和多执行器故障下的鲁棒控制。

Comments An updated version of this paper has been accepted by Nature Communications

详情
AI中文摘要

软体生物如章鱼和大象鼻子展现出显著的形态适应性,能够动态重构身体形状和刚度,并灵活调整控制策略以实现多功能行为。受这些生物系统启发,近几十年来出现了各种软体机器人,它们采用针对特定任务定制的不同材料、刚度和形态。尽管软体机器人的材料和结构设计取得了重大进展,但开发一个能够跨不同配置快速适应的通用控制框架仍然是一个长期挑战。现有控制器局限于固定配置,需要针对新配置进行费力的特定配置重新建模和策略重新设计。本文介绍了一种通用控制系统,通过共享线性Koopman嵌入空间中的强化学习,实现跨多种软体机器人配置的快速适应。通过将机器人动力学编码到该嵌入空间,我们的方法将控制策略与特定形态解耦,允许跨不同配置进行实时、无模型的策略适应,而无需从头重新训练。我们在33种不同的机器人配置上验证了该系统。该系统在跨配置的迁移样本量上减少了75倍,同时在高速运动、重负载和多执行器故障下保持鲁棒性能,并实现了软体机器人领域此前无法获得的现实技能。这项工作为多种软体机器人配置建立了一个统一且可适应的控制范式,弥合了机械可重构性与控制灵活性之间的差距,并可能为复杂物理系统中的通用控制提供更广泛的见解。

英文摘要

Soft-bodied organisms such as octopuses and elephant trunks exhibit remarkable morphological adaptability, dynamically reconfiguring body shape and stiffness, and flexibly adjusting their control strategies to enable versatile behaviors. Inspired by these biological systems, various soft robots have emerged in recent decades, featuring diverse materials, stiffnesses, and morphologies tailored to specific tasks. Despite substantial advances in the materials and structural designs of soft robots, developing a generalizable control framework capable of rapid adaptation across diverse configurations remains a long-standing challenge. Existing controllers are limited to fixed configurations, demanding laborious configuration-specific remodelling and policy redesign for new configurations. Here, we introduce a generalizable control system that enables rapid adaptation across diverse soft robot configurations via reinforcement learning in a shared linear Koopman embedding space. By encoding robot dynamics into this embedding space, our method decouples control policies from specific morphologies, allowing real-time, model-free policy adaptation across diverse configurations without retraining from scratch. We validate our system across 33 distinct robot configurations. Our system achieves a 75 times reduction in transfer samples across configurations, while sustaining robust performance under high-speed motion, heavy payloads, and multiactuator faults, and achieving real-world skills previously unattainable in soft robotics. This work establishes a unified and adaptable control paradigm for diverse soft robot configurations, bridging mechanical reconfigurability with control flexibility, and may offer broader insights for generalizable control in complex physical systems.

2606.09451 2026-06-09 cs.RO cs.CV cs.LG 新提交

Dense Force Estimation with an Event-based Optical Tactile Sensor

基于事件的光学触觉传感器的稠密力估计

Agis Politis, René Zurbrügg, Valentina Cavinato

发表机构 * Sony Advanced Visual Sensing, Zurich, Switzerland(索尼高级视觉传感公司,苏黎世,瑞士) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出首个利用事件相机重建稠密3D力场的方法,通过事件数据估计表面位移并映射为力,平均误差(0.14N,0.10N,0.93N),工作频率100Hz。

详情
AI中文摘要

人类依赖空间稠密、几何和力感知的触觉反馈以高时间分辨率进行灵巧操作。虽然基于视觉的触觉传感器能够实现稠密力估计,但受限于相机帧率、运动模糊和数据带宽。基于事件的光学触觉传感器具有微秒级时间分辨率和低运动模糊的优点,但现有方法仅限于预测净力。我们提出了首个利用基于事件的光学触觉传感器进行稠密3D力场重建的框架。我们的方法从事件数据估计3D表面位移,并通过逆有限元方法(iFEM)将其映射为力。剪切位移通过所提出的事件标记跟踪算法恢复,而法向位移则由卷积神经网络预测,该网络在收集的同步力-位移-事件数据集上训练。实验表明,该方法能够准确重建物理力,在力范围高达(4N,4N,20N)时,平均绝对误差为(0.14N,0.10N,0.93N),同时以平均100Hz的频率运行。这项工作为在机器人抓取和灵巧操作中实现高频控制的稠密力反馈迈出了第一步。

英文摘要

Humans rely on spatially dense, geometry and force-aware tactile feedback at high temporal resolution for dexterous manipulation. While vision-based tactile sensors enable dense force estimation, they are limited by camera frame rates, motion blur, and data bandwidth. Event-based optical tactile sensors offer an attractive alternative with microsecond temporal resolution and low motion blur, but existing methods are restricted to predicting only net forces. We introduce the first framework for dense 3D force field reconstruction using event-based optical tactile sensors. Our approach estimates 3D surface displacements from event data and maps them to forces via the inverse Finite Elements Method (iFEM). Shear displacements are recovered through the proposed event-based marker tracking algorithm, while normal displacements are predicted by a convolutional neural network trained on a collected dataset of synchronized force-displacement-event data. Experiments demonstrate accurate reconstruction of physically grounded forces, achieving a mean absolute error of (0.14 N, 0.10 N, 0.93 N) over force ranges up to (4 N, 4 N, 20 N), while operating at an average of 100 Hz. This work constitutes a first step toward enabling dense force feedback for high-frequency control in robotic grasping and dexterous manipulation.

2512.12320 2026-06-09 cs.RO 版本更新

Programmable Deformation Design of Porous Soft Actuator through Volumetric-Pattern-Induced Anisotropy

通过体积图案诱导各向异性的多孔软体执行器可编程变形设计

Canqi Meng, Weibang Bai

发表机构 * ShanghaiTech Automation and Robotics (STAR) Center, School of Information Science and Technology, ShanghaiTech University(上海科技大学自动化与机器人(STAR)中心,信息科学与技术学院,上海科技大学)

AI总结 提出一种在多孔泡沫中切割图案实现软体执行器可编程变形的方法,通过有限元分析研究机制,实验展示弯曲、倾斜、扭转等变形,并应用于仿生软体手。

Comments Accepted to 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)

详情
AI中文摘要

传统的软体气动执行器通常基于中空弹性体腔室,往往存在结构支撑小的问题,并且需要昂贵的几何特定重新设计才能实现多模态功能。填充到腔室中的多孔材料(如泡沫)可以为执行器提供结构稳定性。然而,通过定制多孔体本身来实现可编程变形的方法仍未得到充分探索。本文提出了一种新颖的设计方法,通过在泡沫体中切割特定图案来实现具有可编程变形的软体多孔执行器。该方法引入了泡沫的局部结构各向异性,从而在全局真空输入下引导材料的变形。此外,讨论了圆柱形泡沫基底上的三种基本图案:横向用于弯曲,纵向用于倾斜,对角线用于扭转。利用有限元分析(FEA)建立了计算模型,以研究切口图案方法的机理。实验表明,通过图案阵列数N的潜在优化设计,执行器可以实现最大80°的弯曲(N=2)、18°的倾斜(N=1)和115°的扭转(N=8)。我们方法的通用性通过图案的可转移性、可扩展性和复杂设计的无模具快速原型制作得到证明。作为综合应用,我们将人类手部折痕图转化为功能性切口图案,创建了一个能够像人类一样自适应抓握的仿生软体机械手。我们的工作为多功能软体多孔机器人的设计提供了一种新的、高效且可扩展的范式。

英文摘要

Conventional soft pneumatic actuators, typically based on hollow elastomeric chambers, often suffer from small structural support and require costly geometry-specific redesigns for multimodal functionality. Porous materials such as foam, filled into chambers, can provide structural stability for the actuators. However, methods to achieve programmable deformation by tailoring the porous body itself remain underexplored. In this paper, a novel design method is presented to realize soft porous actuators with programmable deformation by incising specific patterns into the porous foam body. This approach introduces localized structural anisotropy of the foam guiding the material's deformation under a global vacuum input. Furthermore, three fundamental patterns on a cylindrical foam substrate are discussed: transverse for bending, longitudinal for tilting, and diagonal for twisting. A computational model is built with Finite Element Analysis (FEA), to investigate the mechanism of the incision-patterning method. Experiments demonstrate that with a potential optimal design of the pattern array number N, actuators can achieve bending up to $80^{\circ}$ (N=2), tilting of $18^{\circ}$ (N=1), and twisting of $115^{\circ}$ (N=8). The versatility of our approach is demonstrated via pattern transferability, scalability, and mold-less rapid prototyping of complex designs. As a comprehensive application, we translate the human hand crease map into a functional incision pattern, creating a bio-inspired soft robot hand capable of human-like adaptive grasping. Our work provides a new, efficient, and scalable paradigm for the design of multi-functional soft porous robots.

2512.20591 2026-06-09 cs.RO 版本更新

LightTact: A Visual-Tactile Fingertip Sensor for Deformation-Independent Contact Sensing

LightTact: 一种用于变形无关接触感知的视觉-触觉指尖传感器

Changyi Lin, Boda Huo, Mingyang Yu, Emily Ruppel, Bingqing Chen, Jonathan Francis, Ding Zhao

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Bosch Center for Artificial Intelligence (BCAI)(博世人工智能中心(BCAI))

AI总结 提出LightTact传感器,通过环境光阻断光学配置实现变形无关的接触检测,在无宏观变形下(如液体、超软材料)实现高对比度接触分割,并解锁轻接触机器人操作。

Comments Project website: https://linchangyi1.github.io/LightTact

详情
AI中文摘要

接触通常发生在没有宏观表面变形的情况下,例如与液体、半液体或超软材料交互时。然而,大多数现有的触觉传感器依赖变形来推断接触,使得这种轻接触交互难以稳健感知。为解决这一问题,我们提出了LightTact,一种视觉-触觉指尖传感器,通过变形无关的原理使接触直接可见。LightTact采用环境光阻断光学配置,抑制非接触区域的外部光和内部照明,仅传输真实接触处产生的散射光。因此,LightTact生成高对比度原始图像,其中非接触像素保持近黑色(平均灰度值<3),接触像素保留接触表面的自然外观。基于此,LightTact实现了对材料属性、接触力、表面外观和环境光照鲁棒的像素级接触分割。我们进一步证明,LightTact解锁了需要检测极轻接触的新型机器人操作行为,包括水扩散、面霜蘸取和软薄膜交互。此外,我们展示了LightTact的空间对齐视觉-触觉图像可直接由视觉-语言模型解释。

英文摘要

Contact often occurs without macroscopic surface deformation, such as during interaction with liquids, semi-liquids, or ultra-soft materials. However, most existing tactile sensors rely on deformation to infer contact, making such light-contact interactions difficult to perceive robustly. To address this, we present LightTact, a visual-tactile fingertip sensor that makes contact directly visible via a deformation-independent principle. LightTact features an ambient-blocking optical configuration that suppresses both external light and internal illumination at non-contact regions, while transmitting only the scattered light generated at true contacts. As a result, LightTact produces high-contrast raw images in which non-contact pixels remain near-black (mean gray value < 3) and contact pixels preserve the natural appearance of the contacting surface. Built on this, LightTact achieves accurate pixel-level contact segmentation that is robust to material properties, contact force, surface appearance, and environmental lighting. We further demonstrate that LightTact unlocks new robotic manipulation behaviors that require detection of extremely light contact, including water spreading, facial-cream dipping, and soft thin-film interaction. In addition, we show that LightTact's spatially aligned visual-tactile images can be directly interpreted by vision-language models.

10. 仿真、数据集与评测 13 篇

2606.08039 2026-06-09 cs.RO 新提交

MuJoCo-Drones-Gym: A GPU-Accelerated Multi-Drone Simulator for Control and Reinforcement Learning

MuJoCo-Drones-Gym: 用于控制和强化学习的GPU加速多无人机模拟器

Manan Tayal

发表机构 * TAU-Intelligence

AI总结 提出基于MuJoCo物理引擎的GPU加速多无人机模拟器MuJoCo-Drones-Gym,支持任意数量Crazyflie 2.x纳米四旋翼,提供模块化物理模型、动作接口和观测空间,集成PettingZoo多智能体强化学习,涵盖悬停、速度跟踪等七种任务环境。

Comments 18 pages, 8 figures, 7 tables

详情
AI中文摘要

机器人模拟器是现代空中机器人研究的基石,既作为新控制算法开发的工具,也作为训练强化学习策略的数据源。然而,现有的四旋翼学习环境通常在物理保真度、多智能体支持和现代深度强化学习管道所需吞吐量之间面临权衡。本文提出MuJoCo-Drones-Gym,一个基于MuJoCo物理引擎构建的开源Gymnasium兼容多无人机环境。MuJoCo-Drones-Gym支持任意数量的Bitcraze Crazyflie 2.x纳米四旋翼,并暴露模块化API用于选择:(i)物理模型(刚体MuJoCo、显式Python动力学,或地面效应、桨叶阻力和无人机间下洗流的任意子集),(ii)动作接口(每电机RPM、集体归一化推力、速度设定点或PID航点命令),以及(iii)观测空间(运动状态向量、RGB/深度/分割相机或邻域邻接信息)。PettingZoo ParallelEnv封装支持即插即用的多智能体强化学习,而一套七种任务环境——悬停、速度跟踪、多无人机悬停、航点导航、编队飞行、门赛竞速和通用多智能体模板——展示了接口的广度。我们描述了环境设计、底层物理和四旋翼动力学,并通过与密切相关项目gym-pybullet-drones相似的控制和学习示例说明其使用,同时利用MuJoCo改进的接触处理、渲染和并行化能力。

英文摘要

Robotic simulators are a cornerstone of modern research in aerial robotics, serving both as a vehicle for the development of new control algorithms and as the data source for training reinforcement learning (RL) policies. Yet, existing quadcopter learning environments often face a trade-off between physical fidelity, multi-agent support, and the throughput required by modern deep RL pipelines. In this paper, we present MuJoCo-Drones-Gym, an open-source Gymnasium-compatible multi-drone environment built on top of the MuJoCo physics engine. MuJoCo-Drones-Gym supports an arbitrary number of Bitcraze Crazyflie 2.x nano-quadcopters and exposes a modular API for selecting (i)~the physics model (rigid-body MuJoCo, explicit Python dynamics, or any subset of ground effect, blade drag, and inter-drone downwash), (ii)~the action interface (per-motor RPMs, collective normalized thrust, velocity setpoints, or PID waypoint commands), and (iii)~the observation space (kinematic state vectors, RGB / depth / segmentation cameras, or neighbourhood adjacency information). A PettingZoo ParallelEnv wrapper enables drop-in multi-agent reinforcement learning, while a suite of seven task environments, hover, velocity tracking, multi-drone hover, waypoint navigation, formation flight, gate racing, and a generic multi-agent template, demonstrates the breadth of the interface. We describe the environment design, the underlying physics and quadcopter dynamics, and illustrate its use through control and learning examples that mirror those of the closely related gym-pybullet-drones project, while taking advantage of MuJoCo's improved contact handling, rendering, and parallelizability.

2606.08094 2026-06-09 cs.RO cs.AI cs.LG cs.SY eess.SY 新提交

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

vla.cpp:视觉-语言-动作模型的统一推理运行时

Khanh D. Nguyen, Hung T. Ho, Chinh T. Nguyen, Thanh Q. Duong, Linh D. Le, Duy M. H. Nguyen, Vien A. Ngo, An T. Le

发表机构 * VinRobotics Center for AI Research, VinUniversity(VinUniversity 人工智能研究中心) Intelligent Autonomous Systems, TU Darmstadt(达姆施塔特工业大学智能自主系统) Max Planck Research School for Intelligent Systems(马克斯·普朗克智能系统研究学院) University of Stuttgart(斯图加特大学) German Research Center for Artificial Intelligence(德国人工智能研究中心)

AI总结 提出vla.cpp,基于llama.cpp的便携C++推理运行时,支持多种VLA架构,在LIBERO-Object上接近SOTA性能,内存仅1.3 GiB,并实现跨硬件部署。

Comments 17 pages, 3 figures, 12 tables

详情
AI中文摘要

视觉-语言-动作(VLA)策略通常以Python/PyTorch堆栈形式提供,假设使用工作站级GPU,这与机器人实际运行的硬件不匹配。我们提出了vla.cpp,一个基于llama.cpp的便携式C++推理运行时。据我们所知,它是第一个原生支持流匹配和扩散VLA推理模式的ggml类引擎,其中缓存的视觉-语言前缀由交叉注意力动作专家在多个求解器步骤中消耗。单个运行时通过一个请求/响应协议服务于跨越五个骨干网络和四个动作头家族的七种架构,每个模型打包为自包含的捆绑包。在LIBERO-Object上,该引擎在200个回合中与最先进的检查点相差不到一个回合,并以1.3 GiB内存运行BitVLA达到100%成功率。相同的捆绑包在三个硬件层级上不变地运行,从消费级GPU到8 GB嵌入式模块。跨硬件屋顶线分析表明,批量大小为1的VLA推理受计算限制,因此利用率而非带宽是部署杠杆;由此分析得出的IMMA梯形GEMM将BitVLA每步延迟降低了4.5倍。然后,我们在ALOHA机械臂上设计了一个机载压力测试,隔离了学习型VLA必须在训练它的硬件上针对移动目标重新规划的延迟约束。代码、演示视频和可重复的基准测试框架可在https://fai-modelopt-tech.github.io/vla-cpp.github.io/获取。

英文摘要

Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runtime built on llama.cpp. To our knowledge, it is the first ggml-class engine to natively serve the flow-matching and diffusion VLA inference pattern, in which a cached vision-language prefix is consumed by a cross-attending action expert integrated over several solver steps. A single runtime serves seven architectures spanning five backbone and four action-head families behind one request/response protocol, with each model packaged as a self-contained bundle. On LIBERO-Object, the engine matches a state-of-the-art checkpoint to within one episode out of 200, and runs BitVLA at 100% success in 1.3 GiB of memory. The same bundle runs unchanged across three hardware tiers, from a consumer GPU down to an 8 GB embedded module. A cross-hardware roofline analysis shows that batch-1 VLA inference is compute-bound, so utilization rather than bandwidth is the deployment lever; an IMMA ladder GEMM derived from this analysis cuts BitVLA per-step latency by 4.5x. We then frame an on-robot stress test on an ALOHA arm that isolates the latency constraint under which a learned VLA must replan against a moving target on the hardware it was trained for. Code, demo videos, and the reproducible benchmark scaffold are available at https://fai-modelopt-tech.github.io/vla-cpp.github.io/.

2606.08278 2026-06-09 cs.RO 新提交

SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation

SIMPLE:基于仿真的人形机器人全身操作策略学习与评估

Songlin Wei, Zhenhao Ni, Jie Liu, Zhenyu Zhao, Junjie Ye, Hongyi Jing, Junkai Xia, Xiawei Liu, Michael Leong, Liang Heng, Di Huang, Yue Wang

发表机构 * USC Physical Superintelligence (PSI) Lab(南加州大学物理超级智能实验室)

AI总结 提出SIMPLE仿真平台,结合MuJoCo动力学与IsaacSim渲染,包含60个全身任务、50个室内场景和1000+物体资产,支持自动化轨迹生成和VR遥操作数据采集,并集成多种主流策略,实验证明仿真与真实世界性能强相关,可实现零样本迁移。

详情
AI中文摘要

人形基础模型的发展速度超过了我们评估它们的能力。虽然真实世界测试成本高昂且难以复现,但现有的仿真基准主要关注桌面或轮式机器人。针对全身人形操作的可扩展且可复现的基准仍然是一个开放问题。为此,我们提出了SIMPLE,一个用于人形策略学习和评估的统一仿真测试平台。SIMPLE将MuJoCo的精确接触丰富动力学与IsaacSim的光真实感渲染相结合。它提供了一个大规模环境,包含60个多样的全身任务、50个室内场景和超过1000个物体资产。为了促进可扩展的数据收集,该框架集成了两个数据生成流水线:通过运动规划自动生成轨迹和低延迟VR遥操作接口。我们进一步在SIMPLE中大规模集成并基准测试了主流人形策略,包括轻量级模仿网络、大型视觉-语言-动作(VLA)模型以及最新的世界动作模型(WAM)。我们的实验揭示了策略在仿真和真实世界中的性能之间存在强相关性。此外,我们证明了在SIMPLE中收集的数据上训练的策略可以在相似设置下零样本迁移到物理人形机器人上,为人形机器人研究提供了稳健且可复现的基础。

英文摘要

Humanoid foundation models are advancing faster than we can evaluate them. While real-world testing is expensive and difficult to reproduce, existing simulation benchmarks focus primarily on table-top or wheeled robots. A scalable and reproducible benchmark for whole-body humanoid loco-manipulation remains an open problem. To this end, we present SIMPLE, a unified simulation testbed for humanoid policy learning and evaluation. SIMPLE couples the accurate contact-rich dynamics of MuJoCo with the photorealistic rendering of IsaacSim. It provides a large-scale environment comprising 60 diverse whole-body tasks, 50 indoor scenes, and over 1,000 object assets. To facilitate scalable data collection, the framework integrates two data generation pipelines: automated trajectory generation via motion planning and a low-latency VR teleoperation interface. We further integrate and benchmark mainstream humanoid policies at scale in SIMPLE, including lightweight imitation networks, large vision-language-action (VLA) models, and recent world action models (WAMs). Our experiments reveal a strong correlation between policy performance in simulation and the real world. Furthermore, we demonstrate that policies trained on data collected in SIMPLE can be transferred zero-shot to physical humanoid robots under similar settings, providing a robust and reproducible foundation for humanoid robotics research.

2606.08564 2026-06-09 cs.RO 新提交

Real-IKEA: Physical Fidelity is the Prerequisite for Robust Manipulation

Real-IKEA:物理保真度是鲁棒操作的前提

Kunqi Xu, Zhenhao Huang, Siyuan Luo, Ziqiu Zeng, Fan Shi

发表机构 * National University of Singapore(新加坡国立大学) Peking University(北京大学)

AI总结 针对仿真与现实物理差异导致操作鲁棒性不足的问题,提出Real-IKEA数据集与仿真框架,通过高保真资产和阻力校准配置,使强化学习策略发现优先利用机械优势的鲁棒策略。

详情
AI中文摘要

机器人操作的鲁棒性常常因简化仿真与充满阻力的现实世界之间的物理差距而失败。在这项工作中,我们强调在铰接交互中的物理真实性是鲁棒策略学习的重要因素。我们提出了Real-IKEA,一个以物理精度为首要目标的数据集和仿真框架。Real-IKEA提供了1,079个铰接资产配置,源自83个真实的IKEA把手和旋钮,经过细致的六步物理工作流程处理。对于接触几何精度,我们引入了一个双向表面偏差度量来量化碰撞网格。对于动力学真实性,我们建立了阻力校准配置,改变阻尼和摩擦。关键的是,我们通过强化学习策略证明,高保真资产能够发现鲁棒的“钩”和“杠杆”策略,这些策略优先考虑机械优势而非脆弱的摩擦拉动。总之,这些结果使Real-IKEA成为开发能够在铰接物体任务中达到人类水平鲁棒性的操作策略的关键基准。

英文摘要

Robotic manipulation robustness often founders on the physics gap between simplified simulations and the resistance-laden real world. In this work, we emphasize that physical realism in articulated interaction is an important ingredient for robust policy learning. We present Real-IKEA, a dataset and simulation framework designed with physical accuracy as a first-class goal. Real-IKEA provides 1,079 articulated asset configurations, derived from 83 authentic IKEA handles and knobs processed through a meticulous six-step physical workflow. For contact-geometry accuracy, we introduce a bidirectional surface-deviation metric to quantify collision meshes. For dynamics realism, we establish resistance-calibrated configurations that vary damping and friction. Crucially, we demonstrate through a Reinforcement Learning (RL) policy that high-fidelity assets enable the discovery of robust "hooking" and "levering" strategies that prioritize mechanical advantage over fragile friction-pulling. Together, these results position Real-IKEA as a critical benchmark for developing manipulation policies capable of human-level robustness in articulated object tasks.

2606.08688 2026-06-09 cs.RO cs.CV 新提交

PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback

PhysAgent: 通过轨迹驱动的多智能体反馈实现基于物理的4D合成自动化

Chunji Lv, Jiaxi Ye, Yuchen Jiang, Rexar Lin, Changsheng Li

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 提出PhysAgent,首个模拟器在环的多智能体框架,通过解耦材料与外力、利用视觉基础模型提取轨迹并借助LLM常识推理,实现自动化、物理可信的4D运动合成,显著提升生成多样性与物理准确性。

详情
AI中文摘要

实现完全自动化、物理合理的3D运动合成是图形学和生成式AI的核心目标。然而,配置复杂的环境力场仍然完全依赖人工专家干预,成为大规模模拟数据生成的严重瓶颈。现有自动化方法主要关注材料优化,在应用于更复杂的力场优化空间时表现出严重的模态差距和技术缺陷:朴素的大语言模型缺乏底层模拟反馈,导致严重的物理不准确性,而传统的分数蒸馏采样存在梯度缓慢、陷入局部最优以及数学上无法动态切换离散力场的问题。为此,我们提出PhysAgent,首个模拟器在环的多智能体框架,利用多模态输入实现自动化、基于物理的4D合成。通过将内在材料与外在动力学解耦,PhysAgent利用配备外化力场技能模块的语义智能体掌握模拟规则并生成有效初始化。随后,由轨迹驱动的多智能体反馈驱动的精炼智能体,借助视觉基础模型从渲染帧中提取密集点轨迹。通过将这些显式运动轨迹转换为结构化文本描述符,智能体利用LLM常识推理执行零样本宏观跳跃,有效逃离局部最优并动态切换离散力场。大量实验表明,PhysAgent能够从任意多模态提示快速生成稳定、多样的物理场景,在生成多样性和物理准确性上显著优于现有基线。

英文摘要

Achieving fully automated, physically plausible 3D motion synthesis is a core objective in graphics and generative AI. However, configuring complex environmental force fields still relies entirely on manual expert intervention, creating a severe bottleneck for large-scale simulation data generation. Existing automated methods primarily focus on material optimization and exhibit severe modality gaps and technical flaws when applied to the vastly more complex force field optimization space: naive Large Language Models (LLMs) lack underlying simulation feedback, causing severe physical inaccuracies, while traditional Score Distillation Sampling (SDS) suffers from sluggish gradients, local optima entrapment, and a mathematical inability to dynamically switch discrete force fields. To address this, we propose PhysAgent, the first simulator-in-the-loop multi-agent framework that leverages multimodal inputs for automated, physically grounded 4D synthesis. By decoupling intrinsic materials from extrinsic dynamics, PhysAgent utilizes a Semantic Agent equipped with an externalized Force Field Skill module to master simulation rules and generate valid initializations. Subsequently, the Refine Agents, driven by Trajectory-Grounded Multi-Agent Feedback, leverage vision foundation models to extract dense point trajectories from rendered frames. By converting these explicit motion trajectories into structured textual descriptors, the agent harnesses LLM commonsense reasoning to execute zero-shot macroscopic leaps, effectively escaping local optima and dynamically switching discrete force fields. Extensive experiments demonstrate that PhysAgent rapidly generates stable, diverse physical scenes from arbitrary multimodal prompts, significantly outperforming existing baselines in both generation diversity and physical accuracy.

2606.08729 2026-06-09 cs.RO cs.LG 新提交

IR-SIM: A Lightweight Skill-Native Simulator for Navigation, Learning, and Benchmarking

IR-SIM:一种用于导航、学习和基准测试的轻量级技能原生模拟器

Ruihua Han, Shuai Wang, Chengyang Li, Rui Gao, Xinyi Wang, Zhe Liu, Guoliang Li, Yupu Lu, Qi Hao, Jia Pan, Hengshuang Zhao

发表机构 * The University of Hong Kong(香港大学) Shenzhen Institutes of Advanced Technology(深圳先进技术研究院) Southern University of Science and Technology(南方科技大学) University of Michigan(密歇根大学) University of Macau(澳门大学)

AI总结 提出轻量级技能原生导航模拟器IR-SIM,通过YAML配置完全定义场景,支持文本提示生成与修改,用于导航算法基准测试和训练数据自动生成,并桥接高保真模拟器和真实部署。

Comments 12 pages, 6 figures, project website: https://github.com/hanruihua/ir-sim

详情
AI中文摘要

模拟在由大型语言模型(LLM)支持的自动化机器人研究中起着关键作用。然而,现有的模拟器通常需要自定义代码或复杂接口,为快速原型设计和自动化算法开发设置了障碍。为此,我们提出了智能机器人模拟器(IR-SIM),一种轻量级的技能原生导航模拟器,专为快速场景构建、基准测试和机器人学习而设计。在IR-SIM中,场景完全由YAML配置文件定义,这些文件指定了移动机器人运动学、几何碰撞检测、激光雷达感知、可视化和行为模块。这种设计使机器人模拟完全可描述和可复现,允许通过提出的IR-SIM智能体技能从文本提示生成和修改场景。生成的场景可用于导航算法的自动基准测试以及学习方法的训练数据自动生成。此外,IR-SIM提供了到高保真模拟器和真实世界部署的桥梁,允许用户在原型设计后无需额外编码即可在更真实的环境中验证其算法。实验展示了IR-SIM在多个任务中的便利性和多功能性:从自然语言构建导航场景、训练避碰策略、对社交导航策略进行基准测试,以及桥接到高保真模拟器和真实世界部署。项目网站见https://github.com/hanruihua/ir-sim。

英文摘要

Simulation plays a key role in automated robotics research supported by large language models (LLMs). However, existing simulators often require custom code or complex interfaces, creating a barrier to rapid prototyping and automated algorithm development. To this end, we propose the Intelligent Robot Simulator (IR-SIM), a lightweight skill-native navigation simulator designed for rapid scenario construction, benchmarking, and robot learning. In IR-SIM, scenarios are entirely defined by YAML configuration files that specify mobile robot kinematics, geometric collision checking, LiDAR sensing, visualization, and behavior modules. This design makes robotic simulation fully describable and reproducible, allowing scenarios to be generated and modified from text prompts through the proposed IR-SIM agent skills. The resulting scenarios can be used for automated benchmarking of navigation algorithms and for automated generation of training data for learning methods. Furthermore, IR-SIM provides bridges to high fidelity simulators and real world deployment, allowing users to validate their algorithms in more realistic settings after prototyping without extra coding. The experiments showcase the convenience and versatility of IR-SIM in multiple tasks: constructing navigation scenarios from natural language, training a collision avoidance policy, benchmarking social navigation policies, and bridging to high fidelity simulators and real world deployment. The project website is available at https://github.com/hanruihua/ir-sim.

2606.09108 2026-06-09 cs.RO cs.LG 新提交

RAM: Reachability Across Morphologies

RAM: 跨形态可达性

Tim Walter, Xinyu Chen, Jonathan Külz, Matthias Althoff

发表机构 * Department of Computer Engineering(计算机工程系) German Electron Synchrotron Technical University(德国电子同步加速器技术大学) Technical University Munich(慕尼黑技术大学) University of Hamburg(汉堡大学)

AI总结 提出一种形态条件隐式神经表示RAM,快速、可微地预测可达性并泛化至未见形态,基于前向运动学生成大规模数据集训练,在纳秒级推理中F1达86%,显著加速形态和轨迹优化。

Comments 22 pages, 11 figures

详情
AI中文摘要

机器人生命周期的许多阶段,从形态合成到操作,都从根本上依赖于可达工作空间。然而,当前用于近似工作空间的方法要么速度慢、精度低,要么局限于单一形态。我们提出了跨形态可达性(RAM):一种形态条件的隐式神经表示,作为位姿可达性的快速、可微替代,能够泛化到未见形态,同时固有地考虑自碰撞。为了训练RAM,我们发布了一个大规模数据集,包含仅由正向运动学生成的$3\cdot10^{10}$个样本。实验表明,我们的模型在纳秒级推理时达到了$86\\%$的$F_1$分数,比基线高出$14\\%$,同时推理时间减少了三个数量级。我们进一步展示了在基于梯度的形态优化和轨迹优化中分别加速一个和两个数量级。

英文摘要

Many stages of the robotic lifecycle, from morphology synthesis to operation, rely fundamentally on the reachable workspace. However, current methods for approximating workspaces are slow, imprecise, or tied to a single morphology. We introduce Reachability Across Morphologies (RAM): a morphology-conditioned, implicit neural representation that acts as a fast, differentiable surrogate for pose reachability, generalising to unseen morphologies while inherently accounting for self-collisions. To train RAM, we publish a large-scale dataset of $3\cdot10^{10}$ samples generated solely from forward kinematics. Experiments show that our model achieves an $ F_1$-score of $86\%$ at nanosecond inference, outperforming the baseline by $14\%$ while reducing inference time by three orders of magnitude. We further demonstrate speed-ups of one and two orders of magnitude for gradient-based morphology and trajectory optimisation, respectively. Website: https://timwalter.github.io/ram.

2606.09134 2026-06-09 cs.RO cs.AI cs.CL cs.CV cs.GR 新提交

From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs

从USD场景到知识图谱:基于LLM的零样本本体接地

Jiangtao Shuai, Zongxiong Chen, Manfred Hauswirth, Sonja Schimmler

发表机构 * Technical University of Berlin(柏林工业大学) Fraunhofer FOKUS(弗劳恩霍夫开放通信系统研究所)

AI总结 研究利用大语言模型(LLM)零样本地将3D场景对象自动映射到本体类别,无需训练,在厨房场景中达到90-96%准确率,并揭示语义线索是关键。

Comments Accepted to the IEEE ICRA 2026 International Joint Workshop on Ontologies, Semantic Maps and Autonomous Robotics Standardization (J-WOSMARS 2026), Vienna, 2026

详情
AI中文摘要

从3D仿真场景构建知识图谱对于机器人任务推理至关重要,但关键瓶颈——将场景对象接地到形式本体类别——仍然依赖于手工制作的字典,这些字典脆弱且无法跨资产泛化。我们研究大语言模型(LLM)是否能够自动化通用场景描述(USD)场景的接地步骤,作为一种零样本、无需训练的替代方案。在具有SOMA-HOME本体的厨房场景(125个对象)中,LLM在描述性名称下达到90-96%的精确匹配准确率,在缩写名称下达到49-89%,显著优于字典和嵌入基线。在完全不透明名称下,上下文增强提示可恢复高达48%的准确率。特征消融表明,LLM主要利用场景图中的语义线索(兄弟名称和父路径);匿名化这些线索将准确率降至0-6%,而仅凭几何信息仅能达到4-17%。

英文摘要

Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.

2606.09155 2026-06-09 cs.RO 新提交

Bridged SBI: Correcting Biased Low-Fidelity Posteriors for Cost-Efficient High-Fidelity Inference

Bridged SBI:纠正有偏低保真后验以实现经济高效的高保真推理

Gahee Kim, Yuki Kadokawa, Sandro M. Alcantara Tacora, Taro Abe, Daisuke Endo, Genki Yamauchi, Takeshi Hashimoto, Takamitsu Matsubara

AI总结 针对高保真粒子模拟器计算成本高的问题,提出Bridged SBI方法,利用低保真后验引导高保真推理,通过残差桥接纠正偏差,实现成本效益高的准确后验估计。

详情
AI中文摘要

基于粒子的模拟器的精确校准对于机器人土方模拟至关重要,但由于该任务的高度非线性粒子动力学和传统模拟器的黑箱性质,分析校准具有挑战性。尽管基于模拟的推理(SBI)可以仅通过前向模拟估计模拟参数的后验分布,但将SBI直接应用于高保真(HF)粒子模拟器通常在计算上不可行。使用较粗颗粒的低保真(LF)模拟器可以降低这一成本,但颗粒大小和数量的变化会改变再现相同观测所需的参数值,从而产生有偏的LF后验。我们提出了Bridged SBI,它利用有偏但有信息的LF后验来指导HF推理。该方法首先使用廉价的LF模拟识别一个粗略的高密度参数区域,然后学习一个局部残差桥,通过纠正LF-HF差异将LF后验样本转移到HF一致区域。我们分析了顺序多保真SBI(Naive-MF)在直接依赖LF后验而不进行差异纠正时如何遭受LF诱导的后验覆盖不足。然后我们展示了Bridged SBI旨在通过残差纠正显式建模LF-HF差异来缓解这一问题。在模拟到模拟的粒子参数校准和真实土壤观测的实到模拟校准上的实验表明,与仅HF的SBI或Naive-MF基线相比,Bridged SBI在有限的HF模拟成本下产生了更准确和可靠的HF后验。

英文摘要

Accurate calibration of particle-based simulators is crucial for robotic earthwork simulation, but analytical calibration is challenging due to this task's highly nonlinear particle dynamics and the black-box nature of conventional simulators. Although simulation-based inference (SBI) can estimate posterior distributions over simulation parameters solely from forward simulations, applying SBI directly to high-fidelity (HF) particle simulators is often computationally prohibitive. Low-fidelity (LF) simulators with coarser particles can reduce this cost, but changes in particle size and particle count shift the parameter values needed to reproduce the same observation, producing biased LF posteriors. We propose Bridged SBI, which leverages a biased but informative LF posterior to guide HF inference. This method first uses inexpensive LF simulations to identify a coarse high-density parameter region, and then it learns a local residual bridge to transport LF posterior samples toward HF-consistent regions by correcting the LF--HF discrepancy. We analyze how sequential multi-fidelity SBI (Naive-MF) can suffer from LF-induced posterior miscoverage when it directly relies on the LF posterior without discrepancy correction. We then show that Bridged SBI is designed to alleviate this issue by explicitly modeling the LF--HF discrepancy through residual correction. Experiments on both sim-to-sim particle-parameter calibration and real-to-sim calibration with real soil observation show that Bridged SBI produces more accurate and reliable HF posteriors than HF-only SBI or the Naive-MF baseline, especially under limited HF simulation costs.

2606.09028 2026-06-09 cs.CV cs.AI cs.RO 交叉投稿

ATM: Action-Consistency Transfer Matrix for Diagnosing and Improving Latent World Models

ATM:用于诊断和改进潜在世界模型的动作一致性转移矩阵

Jiaheng Chen

发表机构 * School of Software, Northeastern University(东北大学软件学院)

AI总结 提出ATM矩阵,通过轻量级探针比较真实与预测潜在转移中的动作信息,无需模拟器即可诊断世界模型质量,并引入AITS利用动作可识别性作为训练信号提升下游规划。

Comments 13 pages, 3 figures, 6 tables

详情
AI中文摘要

潜在世界模型越来越多地用于控制和目标条件规划,但评估其学习到的表示是否对规划有用通常需要与CEM等规划器耦合的慢速模拟器评估。这种评估是黑盒且依赖于模型复杂度的:在相同协议下,不同世界模型每个检查点可能需要几分钟到几小时。在这项工作中,我们提出了ATM,一个动作一致性转移矩阵,用于诊断潜在转移是否保留了与规划相关的动作语义。ATM通过轻量级事后探针比较真实编码转移和模型预测转移中的动作信息,生成一个可解释的矩阵,揭示表示质量、转移域不一致性和失败模式,而无需模拟器 rollout。它还可以折叠成一个简单的筛选分数,用于跨检查点、变体和世界模型的内部任务排名。当真实成功差距显著时,ATM实现了高度可靠的成对排名,同时将分钟到小时的CEM评估减少到秒级的转移分析,在我们的设置中实现了超过100倍的加速。我们进一步引入了AITS,表明动作可识别性不仅具有诊断作用,而且是一种有用的训练信号,可以在不改变规划器的情况下改进下游规划。

英文摘要

Latent world models are increasingly used for control and goal-conditioned planning, yet assessing whether their learned representations are useful for planning usually requires slow, planner-coupled simulator evaluation with CEM or similar planners. Such evaluation is black-box and model-complexity-dependent: under the same protocol, different world models may require minutes to hours per checkpoint. In this work, we propose ATM, an Action-Consistency Transfer Matrix for diagnosing whether latent transitions preserve action semantics relevant to planning. ATM compares action information in real encoded transitions and model-predicted transitions through lightweight post-hoc probes, producing an interpretable matrix that reveals representation quality, transition-domain inconsistency, and failure modes without simulator rollout. It can also be collapsed into a simple screening score for within-task ranking across checkpoints, variants, and world models. When the true success gap is non-trivial, ATM achieves highly reliable pairwise ranking, while reducing minutes-to-hours CEM evaluation to seconds-level transition analysis, yielding more than 100x speedup in our setup. We further introduce AITS, showing that action-identifiability is not only diagnostic but also a useful training signal for improving downstream planning without changing the planner.

2606.01478 2026-06-09 cs.RO cs.AI cs.MA cs.SY eess.SY 版本更新

Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX

Crazyflow: 基于JAX的精确、GPU加速、可微分的无人机模拟器

Martin Schuck, Marcel P. Rath, Yufei Hua, Abhishek Goudar, SiQi Zhou, Angela P. Schoellig

发表机构 * Technical University of Munich(慕尼黑技术大学) University of Toronto(多伦多大学) Simon Fraser University(西蒙弗雷泽大学)

AI总结 提出Crazyflow模拟器,通过GPU加速和可微分设计,实现单机超高速仿真、数千架无人机集群模拟,并支持基于解析梯度的策略学习与采样避障,甚至能在0.38秒内从零训练飞行恢复策略。

Comments Fix minor metadata mistakes

详情
AI中文摘要

来自仿真的高质量、大规模合成数据正成为推动机器人算法能力提升的基石。虽然空中机器人模拟器已独立发展出支持保真度、可微分性和集群等专门需求,但缺少一个能够跨所有领域合成数据的统一平台。在这项工作中,我们提出了Crazyflow,一个旨在突破空中机器人算法开发极限的模拟器,涵盖从基于模型到数据驱动的方法、从基于梯度到基于采样的方法、以及从单智能体到多智能体系统。与现有最先进的无人机模拟器相比,它实现了单个无人机超过一个数量级的速度提升,并能模拟数千个包含4000架无人机的集群。真实世界实验表明,Crazyflow既支持基于解析梯度的策略学习(无需域随机化即可实现亚厘米级轨迹跟踪精度),也支持每秒超过5亿步的采样避障。打破传统的先训练后部署范式,我们展示了其前所未有的速度甚至能够实现飞行中的强化学习:通过将物理无人机抛向空中,在0.38秒内从零开始训练恢复策略,成功稳定了无人机。Crazyflow支持多级仿真抽象,直接兼容所有开源Crazyflie模型,并通过提供轻量级系统辨识流程,支持跨自定义无人机平台和应用的快速重新配置。通过同时推动精度、速度和可微分性,Crazyflow作为合成数据生成的开源资源,具备在线执行学习和优化的大规模并行化新兴能力,为新型算法开发打开了大门。

英文摘要

High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial-robotics algorithm development, from model-based to data-driven methods, gradient-based to sampling-based approaches, and single-agent to multi-agent systems. Compared to existing state-of-the-art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real-world experiments show Crazyflow supports both analytical-gradient-based policy learning, achieving sub-centimeter trajectory tracking accuracy without domain randomization, and sampling-based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train-then-deploy paradigm, we show that its unprecedented speed even enables in-flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open-source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light-weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open-source resource for synthetic data generation, with emerging capabilities for large-scale parallelization for online, in-execution learning and optimization, opening the door to novel algorithm development.

2606.07118 2026-06-09 cs.RO 版本更新

QuadVerse: An Integrated Framework Aligning Visual-Physical Reality for Quadruped Simulation

QuadVerse:一种对齐视觉-物理现实用于四足仿真的集成框架

Yuxiang Chen, Yuanhao Wang, Ziheng Zhang, Meng Zhang, Yu Liu, Yufei Jia, Tiancai Wang, Erjin Zhou, Jin Xie

发表机构 * Nanjing University(南京大学) BUPT(北京邮电大学) DEXMAL Tsinghua University(清华大学)

AI总结 提出QuadVerse框架,通过重建场景校准视觉、物理和致动器,利用3DGS和接触校准减少仿真到现实的差距,实现零样本视觉导航策略部署。

详情
AI中文摘要

仿真对于机器人学习至关重要,然而仿真到现实的差距仍然是一个主要挑战。现有方法通常单独处理视觉或动态差距,忽略了这些个体不匹配如何在机器人状态估计中累积和传播。在本文中,我们介绍QuadVerse,一个集成框架,使用重建场景作为校准基底,对齐视觉感知、物理交互和致动器动力学。从捕获的RGB视频中,我们重建几何约束的3D高斯泼溅(3DGS)场景,支持批处理的光照真实自我视角渲染和可用于碰撞的语义网格提取。网格进一步通过初始化空间变化的摩擦先验并通过基于轨迹的后验推理细化,实现接触校准。为了解决剩余的致动器差异,QuadVerse通过在接触校准的地形上重放真实世界轨迹来训练残差动力学补偿器,减少地形引起的接触误差与致动器动力学之间的纠缠。我们表明,QuadVerse在相关基线上提高了重建质量和运动跟踪。在此基础之上,我们展示了无需任务特定真实世界部署的鲁棒零样本视觉导航策略部署。

英文摘要

Simulation is central to robot learning, yet the sim-to-real gap remains a major bottleneck. Existing approaches often tackle visual or dynamic gaps separately, overlooking how these individual mismatches accumulate and propagate throughout the robot's state evolution. In this paper, we introduce QuadVerse, an integrated framework that uses reconstructed scenes as a calibration substrate for aligning visual perception, physical interaction, and actuator dynamics. From captured RGB videos, we reconstruct geometry-constrained 3D Gaussian Splatting (3DGS) scenes that support batched photorealistic ego-view rendering and collision-ready semantic mesh extraction. The meshes further enable contact calibration by initializing spatially varying friction priors and refining them through trajectory-based posterior search. To address remaining actuator discrepancies, QuadVerse trains a residual dynamics compensator by replaying real-world trajectories on the contact-calibrated terrain, reducing the entanglement between terrain-induced contact errors and actuator non-idealities. Experiments show that QuadVerse improves reconstruction quality and locomotion tracking over relevant baselines. Leveraging this foundation, we demonstrate robust zero-shot visual-navigation policy deployment without task-specific real-world rollouts.

2503.14229 2026-06-09 cs.AI cs.CV cs.RO 版本更新

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions

HA-VLN 2.0:面向离散与连续环境中动态多人交互的人类感知导航开放基准与排行榜

Yifei Dong, Fengyi Wu, Qi He, Lingdong Kong, Heng Li, Minghan Li, Zebang Cheng, Yuxuan Zhou, Jingdong Sun, Qi Dai, Alexander G Hauptmann, Zhi-Qi Cheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出HA-VLN 2.0统一基准,通过标准化任务、HAPS 2.0数据集与模拟器、16844条社会指令基准测试及真实机器人实验,证明显式社会建模提升导航鲁棒性并减少碰撞。

Comments 35 pages, 20 figures, website: https://f1y1113.github.io/HA-VLN-webpage/

详情
AI中文摘要

视觉与语言导航(VLN)主要研究离散或连续空间,很少关注动态拥挤环境。我们提出HA-VLN 2.0,一个引入显式社会感知约束的统一基准。我们的贡献包括:(i)标准化任务和指标,同时捕捉目标准确性和个人空间遵守;(ii)HAPS 2.0数据集和模拟器,建模多人交互、室外环境和更精细的语言-运动对齐;(iii)在16844条社会性指令上的基准测试,揭示领先代理在人类动态和部分可观测性下性能急剧下降;(iv)真实机器人实验验证模拟到现实的迁移,以及一个开放排行榜实现透明比较。结果表明,显式社会建模提高了导航鲁棒性并减少了碰撞,强调了以人为中心方法的必要性。通过发布数据集、模拟器、基线和协议,HA-VLN 2.0为安全、人类感知的导航研究提供了坚实基础。

英文摘要

Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous spaces, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment; (iii) benchmarks on 16,844 socially grounded instructions, revealing sharp performance drops of leading agents under human dynamics and partial observability; and (iv) real-world robot experiments validating sim-to-real transfer, with an open leaderboard enabling transparent comparison. Results show that explicit social modeling improves navigation robustness and reduces collisions, underscoring necessity of human-centric approaches. By releasing datasets, simulators, baselines, and protocols, HA-VLN 2.0 provides a strong foundation for safe, human-aware navigation research.

11. 安全、鲁棒性与可信机器人 9 篇

2606.08414 2026-06-09 cs.RO cs.AI 新提交

PACT: Self-Evolving Physical Safety Alignment for Diffusion Policies in Embodied Manipulation

PACT: 具身操作中扩散策略的自我演化物理安全对齐

Lingxuan Wu, Zijian Zhu, Lizhong Wang, Chengyang Ying, Huayu Chen, Xiao Yang, Fangming Liu, Jun Zhu

发表机构 * Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua-Bosch Joint ML Center, THBI Lab, BNRist Center, Tsinghua University, Beijing, 100084, China(计算机科学与技术系,人工智能研究院,清华-博世联合机器学习中心,THBI实验室,BNRist中心,清华大学,北京,100084,中国) Peng Cheng Laboratory, 518108, China(鹏城实验室,518108,中国)

AI总结 提出PACT框架,通过自演化后训练将预训练扩散策略投影到约束可行区域,无需演示数据或任务奖励,在降低31.0%安全违规的同时提升30.7%任务成功率。

详情
AI中文摘要

扩散策略在机器人操作中取得了显著成功,但常常无法满足安全部署所需的严格物理约束。现有方法要么在训练期间过早施加安全约束,要么在测试时通过外部护栏被动应对,限制了策略的表达能力和整体可扩展性。我们提出物理安全对齐约束轨迹(PACT),这是一个自我演化的后训练框架,将预训练扩散策略投影到约束可行区域,无需访问演示数据或任务奖励。PACT通过跨时间步密集监督的反向KL目标将约束梯度蒸馏到扩散模型中。它采用课程学习逐步收紧约束,同时保持理论上界定的策略偏移和单调改进,减轻了灾难性遗忘带来的安全-性能权衡。在模拟和真实世界的具身操作基准测试中,PACT平均减少31.0%的安全违规,同时将任务成功率提升30.7%。

英文摘要

Diffusion policies have achieved remarkable success in robotic manipulation, yet they often fail to satisfy strict physical constraints required for safe deployment. Existing approaches impose safety either prematurely during training or reactively via external guardrails at test time, limiting policy expressivity and overall scalability. We propose Physical safety Alignment for Constrained Trajectories (PACT), a self-evolving post-training framework that projects pretrained diffusion policies onto constraint-feasible regions without accessing demonstration data or task rewards. PACT distills constraint gradients into the diffusion model through a reverse-KL objective with dense supervision across timesteps. It incorporates a curriculum that progressively tightens constraints while maintaining theoretically bounded policy shift and monotone improvement, mitigating the safety-performance trade-off from catastrophic forgetting. On simulated and real-world embodied manipulation benchmarks, PACT significantly reduces safety violations by 31.0% on average while improving task success by 30.7%.

2606.08508 2026-06-09 cs.RO cs.AI 新提交

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

ActProbe:面向生成式机器人策略早期故障检测的动作空间探针

Bingjia Huang, Xiangyu Li, Xiang Wang, Liang Mi, Zixu Hao, Weijun Wang, Hao Wu, Kun Li, Yunxin Liu, Ting Cao

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(清华大学人工智能产业研究院(AIR)) University of Electronic Science and Technology of China(电子科技大学) Nanjing University(南京大学)

AI总结 提出ActProbe,一种轻量级纯动作空间故障检测器,利用时间一致性误差和动作块幅度两个信号,通过LSTM-MLP架构预测故障,在多种生成式策略上提升F1-时效性帕累托前沿平均超体积增益+12.7%,并加速强化学习微调。

Comments 24 pages,9 figures,11 tables, Project page: https://air-embodied-brain.github.io/actprobe

详情
AI中文摘要

生成式机器人策略在部署时不可预测地失败:它们在关键时刻犹豫不决,偏离任务,或执行不可恢复的动作。现有的在线故障检测器要么需要白盒访问策略内部,要么通过重采样和观测侧信号增加运行时开销。我们的实证分析表明,发射的动作块本身已经携带了生成式机器人策略即将发生故障的强预测信号。受此观察启发,我们引入了ActProbe,一种轻量级的纯动作空间检测器,它使用单次前向传递中可用的两个紧凑信号:连续动作块之间的时间一致性误差(TCE)和当前块的动作块幅度(ACM)。ActProbe通过任务条件化的LSTM-MLP架构将这些信号映射到每步故障概率。在一系列多样化的生成式机器人策略和基准测试中,ActProbe在故障变得视觉可识别之前发出警报,相比内部和外部特征基线,将故障检测的F1-时效性帕累托前沿平均超体积增益提高了+12.7%,在未见任务上早期检测ROC-AUC领先+9.0%。ActProbe进一步迁移到部署中,预测未见真实机器人拾取任务上的故障,并以2.9倍更少的环境交互加速了强化学习微调(PPO)。

英文摘要

Generative robot policies fail unpredictably at deployment: they hesitate at critical moments, drift off-task, or commit to unrecoverable actions. Existing online failure detectors either require white-box access to policy internals or add runtime overhead through resampling and observation-side signals. Our empirical analysis shows that emitted action chunks themselves already carry strong predictive signal for impending failures in generative robot policies. Motivated by this observation, we introduce ActProbe, a lightweight, pure action-space detector that uses two compact signals available from a single forward pass: Temporal Consistency Error (TCE) between consecutive action chunks and Action Chunk Magnitude (ACM) of the current chunk. ActProbe maps these signals to per-step failure probabilities with a task-conditioned LSTM-MLP architecture. Across a diverse suite of generative robot policies and benchmarks, ActProbe raises alerts before failures become visually recognizable, improving the accuracy (F1)-timeliness Pareto frontier of failure detection by an average hypervolume gain of +12.7% over both internal- and external-feature baselines, with a +9.0% early-detection ROC-AUC lead on unseen tasks. ActProbe further transfers to deployment, predicting failures on unseen real-robot pick tasks and accelerating RL fine-tuning (PPO) with 2.9x fewer environment interactions.

2606.09350 2026-06-09 cs.RO cs.CV 新提交

Taming Perception Jitter: Uncertainty-Aware LiDAR Object Detection for Reliable Motion Classification

驯服感知抖动:面向可靠运动分类的不确定性感知激光雷达目标检测

Cornelius Schröder, Žygimantas Marcinkus, Markus Lienkamp

发表机构 * Technical University of Munich(慕尼黑工业大学) Institute for Automotive Engineering, Munich Institute of Robotics and Machine Intelligence, School of Engineering and Design(汽车工程研究所,慕尼黑机器人与机器智能研究所,工程与设计学院)

AI总结 提出一种部署友好的策略,通过不确定性估计和统计检验减少静态物体的虚假动态预测,在真实驾驶中显著降低误报和不必要停车。

详情
AI中文摘要

可靠的运动分类对于自动驾驶至关重要,因为对静态物体的错误动态预测可能会级联导致不必要的规划器干预。不稳定的边界框预测会导致跟踪中产生虚假的速度估计和错误预测的轨迹。我们提出了一种部署友好的缓解策略,该策略通过偶然不确定性估计增强3D目标检测器,并在短观测窗口上应用双样本z检验来区分真实运动和抖动。该方法集成到Autoware中,仅需最小改动,并重用现有数据关联以最小化计算开销。实验结果表明,在nuScenes上与速度阈值法性能相当,但在真实道路测试中,虚假动态预测和不必要停车显著减少,这是因为记录数据中存在中间抖动带,而仅基于速度的规则会误分类。这表明,不确定性感知检测和轻量级统计测试可以在噪声更大的真实环境中为自动驾驶带来实际性能提升。

英文摘要

Reliable motion classification is critical for autonomous driving, as false dynamic predictions of static objects can cascade into unnecessary planner interventions. Unstable bounding box predictions can lead to spurious velocity estimates in tracking and falsely predicted trajectories. We present a deployment-friendly mitigation strategy that augments a 3D object detector with aleatoric uncertainty estimates and applies a two-sample z-test over short observation windows to separate true motion from jitter. Integrated into Autoware with minimal changes, the approach reuses existing data association for minimal compute overhead. Empirical results show parity with velocity thresholding on nuScenes, but substantially fewer false dynamic predictions and unnecessary stops in real-world test drives, explained by the presence of an intermediate jitter band in the recorded data that speed-only rules misclassify. This demonstrates that uncertainty-aware detection and lightweight statistical testing can deliver practical performance gains for autonomous driving in noisier real-world settings.

2606.09499 2026-06-09 cs.RO cs.AI cs.CR 新提交

Targeting World Models to Compromise Robot Learning Pipelines

针对世界模型以破坏机器人学习流程

Ethan Rathbun, Ahmed Agha, Saaduddin Mahmud, Christopher Amato, Alina Oprea, Eugene Bagdasarian

发表机构 * Northeastern University(东北大学) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 本文提出针对世界模型的新型数据投毒攻击方法,通过注入恶意提示或转换动态,在看似安全的数据中生成危险训练轨迹,导致下游策略不安全。

Comments 8 Pages, CoRL Preprint

详情
AI中文摘要

世界模型近来在流行度和能力上迅速增长,成为生成机器人训练数据或模拟真实环境的更高效工具,许多工作提议将其集成到机器人学习流程中。尽管非常实用,但本文证明世界模型引入了机器人学习供应链中一种独特隐蔽且有效的数据投毒入口,可能导致部署不安全或受损的机器人策略,尽管训练数据看似安全。与传统数据投毒技术直接向已售或上传数据集中植入危险轨迹不同,我们的新型攻击方法将恶意提示或受损转换动态注入到视觉安全的遥操作数据集中,这些数据仅当通过世界模型作为输入时才会被激活。这可能导致生成合成的危险机器人训练轨迹,进而产生不安全或受损的机器人策略。我们展示了针对最先进的行动条件和文本条件世界模型的攻击有效性,展示了在下游DRL策略上的完整端到端后门攻击,以及针对VLA设置的概念验证。总体而言,这些发现需要研究更安全的世界模型,并重新评估其在机器人学习供应链中的地位。

英文摘要

World models have recently seen a rapid growth in both their popularity and capability as more data efficient tools for generating robot training data or simulating real world environments, with many works proposing their integration into the robot learning pipeline. While highly practical, in this work we demonstrate that world models introduce a uniquely stealthy and effective data poisoning entry point into the robot learning supply chain that can result in the deployment of unsafe or otherwise compromised robotic policies despite training on seemingly safe ground truth training data. In contrast to traditional data poisoning techniques which directly implant dangerous trajectories into sold or uploaded datasets, our novel attack methods inject malicious prompts or compromising transition dynamics into visibly safe teleoperated datasets which are only activated once fed through a world model as input. This can result in the generation of synthetic, dangerous robot training trajectories and subsequently unsafe or compromised robot policies. We demonstrate the effectiveness of our attacks against both state of the art action conditioned and text conditioned world models, showing a full end-to-end backdoor on a downstream DRL policy and a proof-of-concept for the VLA setting. Overall these findings necessitate research into more secure world models and reevaluating their position within the robot learning supply chain.

2606.09749 2026-06-09 cs.RO cs.LG 新提交

Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models

你的模型已经知道:面向视觉-语言-动作模型的注意力引导安全过滤器

Seongbin Park, Fan Zhang, Baharan Mirzasoleiman, Shahriar Talebi, Nader Sehatbakhsh

发表机构 * University of California Los Angeles(加州大学洛杉矶分校)

AI总结 本文发现VLA模型中的少数注意力头能可靠定位目标物体,利用这一特性提出无需训练的安全框架,结合控制障碍函数和实时目标跟踪器,实现动态障碍物下的碰撞避免,在动态场景中性能提升43%。

Comments Under review

详情
AI中文摘要

视觉-语言-动作(VLA)模型在多种机器人操作任务中展现了令人印象深刻端到端性能。然而,这些策略无法保证避免与场景中任务无关的物体发生碰撞。现有的安全过滤器通过查询视觉-语言模型(VLM)来识别障碍物及其位置,从而回避了这个问题。但这在控制循环中运行速度太慢,只能在情节初始化时调用,使得过滤器无法跟踪移动障碍物。我们发现,VLA模型中的少数注意力头能够可靠地定位策略意图接近的目标物体。这些注意力头可以在一个无需训练的安全框架中利用,该框架每一步从注意力头获取活动目标,将场景其余部分视为障碍物,并将其输入控制障碍函数(CBF)过滤器。结合轻量级实时目标跟踪器,这允许对非静态障碍物进行碰撞避免。我们在SafeLIBERO上评估了我们的框架,并扩展了移动障碍物。在原始静态基准测试中,我们的方法性能与使用特权模拟器状态识别目标(模拟在情节初始化时运行一次的基于VLM的识别步骤)的oracle相当。在动态变体中,oracle的初始目标分配变得过时,我们的方法平均优于它43%。我们的发现表明,实时安全过滤所需的感知信号已经存在于VLA策略中,并且可以在无需额外训练或重型辅助模型的情况下加以利用。

英文摘要

Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of robotic manipulation tasks. However, these policies offer no guarantees against collisions with task-irrelevant objects in the scene. Existing safety filters sidestep this problem by querying a vision-language model (VLM) to identify obstacles and their locations. This, however, is too slow to run in the control loop and can only be invoked at episode initialization, leaving the filter unable to track moving obstacles. We discover that a small number of attention heads within a VLA model reliably localize the object the policy intends to approach. These heads can be exploited within a training-free safety framework that obtains the active target from the attention heads at every step, treats the remainder of the scene as obstacles, and feeds these into a Control Barrier Function (CBF) filter. Together with a lightweight real-time object tracker, this allows for collision avoidance for non-static obstacles. We evaluate our framework on SafeLIBERO, which we extend with moving obstacles. On the original static benchmark, our method performs comparably to an oracle that uses privileged simulator state to identify the target, emulating a VLM-based identification step run once at episode initialization. On the dynamic variant, where the oracle's init-time target assignment becomes stale, our method substantially outperforms it by 43%, on average. Our findings suggest that the perceptual signals needed for real-time safety filtering are already present within VLA policies and can be exploited without additional training or heavy auxiliary models.

2606.09559 2026-06-09 cs.LG cs.AI cs.CR cs.RO 交叉投稿

Safe-RULE: Safe Reinforcement UnLEarning

Safe-RULE:安全强化反学习

Shixiong Jiang, Taozheng Zhu, Fanxin Kong

发表机构 * University of Notre Dame(圣母大学)

AI总结 针对离线安全强化学习易受数据投毒攻击的问题,提出Safe-RULE框架,通过反学习移除恶意样本影响,无需从头训练或访问原始环境,实验证明能有效提升安全性。

Comments 20 pages, 3 figures

详情
AI中文摘要

离线安全强化学习(Safe RL)使得无需在线交互即可进行策略学习,适用于机器人系统等安全关键系统。然而,其对静态数据集的依赖使离线Safe RL面临数据投毒攻击,攻击者注入恶意样本以破坏安全性并诱导不安全策略行为。在这项工作中,我们提出了一种新的学习范式,称为安全强化反学习(Safe-RULE),作为一种防御框架,用于在不从头重新训练或需要访问原始训练环境的情况下移除中毒数据的影响。我们进一步将强化反学习扩展到离线Safe RL,通过在反学习过程中明确考虑任务性能和安全约束。跨基准Safe RL任务的实验表明,我们的方法能有效增强针对数据投毒攻击的安全性能。

英文摘要

Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline Safe RL to data poisoning attacks, where adversaries inject malicious samples that compromise safety and induce unsafe policy behavior. In this work, we propose a new learning paradigm, named safe reinforcement unlearning (Safe-RULE), used as a defense framework to remove the influence of poisoned data without retraining from scratch or requiring access to the original training environment. We further extend reinforcement unlearning to offline Safe RL by explicitly accounting for both task performance and safety constraints during the unlearning process. Experiments across benchmark Safe RL tasks demonstrate that our approach effectively enhances safety performance against data poisoning attacks.

2510.06492 2026-06-09 cs.RO 版本更新

How Well Do Latent World Models Understand Partially Observable Safety Constraints?

潜在世界模型如何理解部分可观测的安全约束?

Matthew Kim, Kensuke Nakamura, Andrea Bajcsy

发表机构 * UC San Diego(加州大学圣地亚哥分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究潜在世界模型在部分可观测安全约束下的故障模式,提出互信息度量和滚动预测度量来诊断估计间隙和预测间隙,并通过多模态监督和共形风险校准缓解问题,提高机器人操作安全性。

Comments 10 tables 5 figures

详情
AI中文摘要

潜在世界模型是一种直接从高维观测中学习状态表示和动态的有前途的方法,使得在难以建模的环境中实现机器人控制成为可能。然而,控制性能最终取决于潜在表示是否编码了任务所需的信息。在这项工作中,我们研究了潜在空间安全控制问题,并展示了当安全相关信息未在潜在状态中保留时,部分可观测性如何导致控制失败。具体来说,我们识别出两种世界模型故障模式:估计间隙,即当前观测未揭示安全关键量(例如,烹饪任务中的温度);以及预测间隙,即故障一旦发生即可观测,但无法从可用观测中可靠地预测。我们为这些间隙引入了两种诊断方法:一种基于互信息的安全可观测性度量,以及一种基于滚动预测的未来安全可预测性度量。最后,我们针对每种故障模式提出了缓解策略:针对估计间隙的特权多模态监督,以及针对预测间隙的共形风险校准。通过两个硬件案例研究——使用单模态RGB世界模型和多模态RGB+触觉及RGB+热变体——我们展示了这些缓解策略在部分可观测性下提高了Franka Research 3机械臂在具有挑战性的烹饪任务中的安全性,尽管增加了保守性。更广泛地说,我们的工作提出了一个问题:世界模型状态表示何时足以实现可靠的机器人控制。

英文摘要

Latent world models are a promising approach for learning state representations and dynamics directly from high-dimensional observations, enabling robot control in hard-to-model settings. However, control performance ultimately depends on the latent representation encoding the required information for the task. In this work, we study latent-space safe control problems and show how partial observability can induce control failures when safety-relevant information is not preserved in the latent state. Specifically, we identify two world model failure modes: estimation gaps, where current observations do not reveal safety-critical quantities (e.g., temperature in a cooking task), and prediction gaps, where failures are observable once they occur but cannot be reliably anticipated from available observations. We introduce two diagnostics for these gaps: a mutual-information-based measure of safety observability and a rollout-based measure of future safety predictability. Finally, we present mitigation strategies for each failure mode: privileged multimodal supervision for estimation gaps and conformal risk calibration for prediction gaps. Across two hardware case studies -- using unimodal RGB world models and multimodal RGB+Tactile and RGB+Thermal variants -- we show that these mitigation strategies improve the safety of a Franka Research 3 manipulator on challenging cooking tasks under partial observability, albeit with increased conservativeness. More broadly, our work raises the question of when world model state representations are sufficient for reliable robot control

2605.26452 2026-06-09 cs.RO cs.LG cs.SY eess.SY 版本更新

Robust Koopman Control Barrier Filters for Safe Actor-Critic Reinforcement Learning

鲁棒Koopman控制屏障滤波器用于安全演员-评论家强化学习

Dhruv S. Kushwaha, Zoleikha A. Biron

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出鲁棒Koopman-CBF SAC框架,通过数据驱动学习Koopman预测器、构建提升空间中的仿射CBF约束并利用二次规划安全层实施,同时通过投影残差裕度处理近似误差,实现零约束违反或减少违规。

Comments 17 pages, 7 figures

详情
AI中文摘要

机器人系统的安全强化学习需要策略在训练和部署期间满足状态和输入约束的同时提高任务性能。控制屏障函数通过最小侵入性安全滤波器提供强制执行前向不变性的原则性机制,但其在无模型强化学习中的应用受限于对精确动力学和手工设计屏障证书的需求。我们提出鲁棒Koopman-CBF SAC,一种安全滤波的演员-评论家框架,从数据中学习有限维Koopman预测器,在提升空间中构建仿射CBF约束,并通过二次规划安全层强制执行。为考虑有限维Koopman近似误差,使用从留出轨迹数据估计的投影残差裕度收紧CBF条件。评论家在执行的安操作上训练,而演员则被正则化向Koopman-CBF可行集,减少训练中对滤波器的依赖。在安全控制基准测试中,该方法在CartPole稳定和跟踪上实现零约束违反,同时匹配或超过无约束SAC的回报。在高维Safety Gymnasium运动任务中,该方法在某些设置下减少了违规,但也暴露了一阶速度屏障和线性EDMD模型的重要局限性,推动了高阶和多步Koopman-CBF扩展。这些结果表明,鲁棒Koopman-CBF滤波器是无模型强化学习和可证明安全之间的有前途桥梁,同时阐明了此类滤波器保持有效的结构条件。所有代码可在\href{https://github.com/DhruvKushwaha/Koopman-CBF-Soft-Actor-Critic}{Github仓库}获取。

英文摘要

Safe reinforcement learning (RL) for robotic systems requires policies that improve task performance while satisfying state and input constraints during both training and deployment. Control barrier functions (CBFs) provide a principled mechanism for enforcing forward invariance through minimally invasive safety filters, but their use in model-free RL is limited by the need for accurate dynamics and hand-designed barrier certificates. We propose Robust Koopman-CBF SAC, a safety-filtered actor--critic framework that learns a finite-dimensional Koopman predictor from data, constructs affine CBF constraints in the lifted space, and enforces them through a quadratic-program safety layer. To account for finite-dimensional Koopman approximation error, the CBF condition is tightened using a projected residual margin estimated from held-out rollout data. The critic is trained on the executed safe action, while the actor is regularized toward the Koopman-CBF feasible set, reducing dependence on the filter over training. Across safe-control benchmarks, the method achieves zero constraint violations on CartPole stabilization and tracking while matching or exceeding unconstrained SAC returns. On high-dimensional Safety Gymnasium locomotion tasks, the method reduces violations in some settings but also exposes important limitations of first-order velocity barriers and linear EDMD models, motivating high-order and multi-step Koopman-CBF extensions. These results suggest that robust Koopman-CBF filters are a promising bridge between model-free RL and certifiable safety, while clarifying the structural conditions under which such filters remain effective.

2511.00934 2026-06-09 cs.LO cs.RO 版本更新

pacSTL: PAC-Bounded Signal Temporal Logic from Data-Driven Reachability Analysis

pacSTL: 基于数据驱动可达性分析的PAC有界信号时序逻辑

Hanna Krasowski, Elizabeth Dietrich, Emir Cem Gezer, Roger Skjetne, Asgeir Johan Sørensen, Murat Arcak

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出pacSTL框架,结合PAC有界可达集预测与区间STL,通过优化问题计算原子鲁棒性上下界并传播,实现规范级别的PAC有界鲁棒性评估,用于不确定动态系统的验证与监控。

详情
AI中文摘要

信号时序逻辑(STL)是一种用于从连续信号中指定动态系统行为的表达性语言。然而,标准STL的一个局限性是其固有的确定性语义,这使其无法处理不确定性。现有克服这一局限的方法计算成本高,且限制了实时能力,需要在原子命题或规范改变时重复轨迹采样或重新设计原子命题上的概率分布。我们引入了pacSTL,一个将可能近似正确(PAC)有界可达集预测与STL的区间扩展相结合的框架。pacSTL通过求解PAC有界可达集上的优化问题来计算原子鲁棒性值的下界和上界,并通过时序逻辑算子传播这些界。得到的评估在规范级别产生一个PAC有界鲁棒性区间。我们通过验证四旋翼飞行场景和运行时监控海上导航规范来展示pacSTL的效率和相关性。

英文摘要

Signal Temporal Logic (STL) is an expressive language for specifying behaviors of dynamical systems from continuous signals. However, a limitation of standard STL is its inherently deterministic semantics, which prevents it from accommodating uncertainty. Existing approaches to overcome this limitation are computationally costly and limit real-time capability, requiring repeated trajectory sampling or the redesign of probability distributions over atomic propositions whenever the atomic propositions or specifications change. We introduce pacSTL, a framework that combines Probably Approximately Correct (PAC)-bounded reachable set predictions with an interval extension of STL. pacSTL computes lower and upper bounds on atomic robustness values by solving optimization problems over PAC-bounded reachable sets and propagates the bounds through the temporal logic operators. The resulting evaluation yields a PAC-bounded robustness interval at the specification level. We demonstrate the efficiency and relevance of pacSTL by verifying a quadrotor flight scenario and runtime monitoring a maritime navigation specification.

12. 其他/综合机器人 14 篇

2606.07902 2026-06-09 cs.RO 新提交

End-to-End Control of a Powered Knee-Ankle Prosthesis Towards Unified, Tuning-Free Assistance

动力膝踝假肢的端到端控制:迈向统一、免调参的辅助

John Shim, Christoph Nuesslein, Sixu Zhou, Hanjun kim, Kinsey Herrin, Aaron Young

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Woodruff School of Mechanical Engineering(伍德拉夫机械工程学院) Institute for Robotics and Intelligent Machines(机器人与智能机器研究所)

AI总结 本文提出一种端到端假肢控制器,利用时序卷积网络从机载传感器估计连续执行器信号,无需意图分类器和个体调参,在多种地形和步态模式下实现统一、模式自适应的假肢辅助。

Comments 7 pages, 6 figures

详情
AI中文摘要

动力假肢通常依赖需要大量手动调参和显式模式分类的阻抗控制器。在这项工作中,我们展示了端到端假肢控制器的实时部署,该控制器从机载传感器估计连续执行器信号,消除了对意图分类器和个体调参的需求。时序卷积网络在来自18名经股截肢者的多地形数据集上训练,并在五种运动模式下实时部署。四名参与者(三名健全人,一名经股截肢者)在平地、斜坡上坡、斜坡下坡以及楼梯上坡和下坡上行走。在平地行走中,部署的控制器再现了踝关节峰值力矩随步行速度变化的训练数据缩放关系(部署:0.85 Nm/kg per m/s,p = 0.001;训练:0.96 Nm/kg per m/s,95% CI [0.42, 1.50],p = 0.002),排除了一个因异常假肢负载导致的离群点。在斜坡上坡时,控制器使膝关节预屈曲随坡度变化(部署:2.92 deg/deg,p = 0.027;训练:3.30 deg/deg,95% CI [1.83, 4.77],p < 0.001)。在斜坡下坡时,控制器相对于平地行走增加了膝关节阻力矩(部署:+0.16 Nm/kg,p < 0.001;训练:+0.16 Nm/kg,p = 0.008)。尽管训练数据仅包含一种肢体引导序列,但控制器在楼梯上坡和下坡中为健侧和假肢侧引导序列均生成了无缝的过渡。这些结果为端到端控制提供了初步证据,表明其能够提供统一、模式自适应的假肢辅助,而无需个体调参。

英文摘要

Powered prostheses conventionally rely on impedance controllers that require extensive manual tuning and explicit mode classification. In this work, we present real-time deployment of an end-to-end prosthesis controller that estimates continuous actuator signals from onboard sensors, eliminating the need for intent classifiers and subject-specific tuning. Temporal Convolutional Networks were trained on a multi-terrain dataset from 18 individuals with transfemoral amputation and deployed in real time across five locomotion modes. Four participants (three able-bodied, one with transfemoral amputation) ambulated across level ground, ramp ascent and descent, and stair ascent and descent. During level walking, the deployed controller reproduced the training-data scaling of peak ankle torque with walking speed (deployed 0.85 Nm/kg per m/s, p = 0.001; training 0.96 Nm/kg per m/s, 95% CI [0.42, 1.50], p = 0.002), after excluding one outlier traced to atypical prosthesis loading. During ramp ascent, the controller scaled knee pre-flexion with grade (deployed 2.92 deg/deg, p = 0.027; training 3.30 deg/deg, 95% CI [1.83, 4.77], p < 0.001). During ramp descent, the controller increased resistive knee torque relative to level walking (deployed +0.16 Nm/kg, p < 0.001; training +0.16 Nm/kg, p = 0.008). Seamless stair transitions were generated for both intact- and prosthetic-side-leading sequences in ascent and descent, despite the training data containing only one limb-leading sequence. These results provide initial evidence towards end-to-end control that can provide unified, mode-adaptive prosthetic assistance without subject-specific tuning.

2606.08655 2026-06-09 cs.RO cs.CV 新提交

PhysGraph: A Physics-aware 3D Scene Graph for Perception and Reasoning

PhysGraph:用于感知与推理的物理感知3D场景图

Haoyu Li, Aaron Thomas, Shuyan Zhou, Xianyi Cheng

发表机构 * Duke University(杜克大学)

AI总结 提出PhysGraph框架,结合符号推理与结构化3D几何,建模杂乱场景中的运动学和物理属性,在语义分割、多物体质量估计和关节预测上达到最优。

详情
AI中文摘要

为了执行广泛的日常任务,机器人需要构建一个语义丰富、物理基础扎实且结构化的3D表示,以支持任务规划和功能预测。然而,现有方法主要关注语义检索,常常忽略物理和运动学因素。尝试建模物理属性的方法通常依赖于狭窄的训练集或单物体建模,限制了跨不同物体类型的可扩展性和泛化能力。为应对这些挑战,我们提出了PhysGraph,一个将符号推理与结构化3D几何相统一的框架,用于建模杂乱场景中的运动学和物理属性。给定RGB-D观测,PhysGraph重建以物体为中心的3D几何,并跨视图关联物体实例。然后,它将物体分解为功能部件,并通过视觉推理推断材料和关节。在合成和真实世界数据集上的评估表明,PhysGraph在语义分割、多物体质量估计和关节预测方面取得了最先进的结果。凭借其简单而有效的设计,PhysGraph生成物理一致且语义结构化的场景图,作为下游任务(如约束感知的3D功能预测和真实到模拟迁移)的结构化3D表示,这两项任务均在我们的实验中得到了验证。

英文摘要

To perform a wide range of daily tasks, robots need to construct a 3D representation that is semantically rich, physically grounded, and structured enough to support task planning and affordance prediction. However, existing approaches primarily focus on semantic retrieval, often overlooking physical and kinematic factors. Methods that attempt to model physical properties typically rely on narrow training sets or single-object modeling, limiting scalability and generalization across diverse object types. To address these challenges, we present PhysGraph, a framework that unifies symbolic reasoning with structured 3D geometry to model kinematic and physical properties in cluttered scenes. Given RGB-D observations, PhysGraph reconstructs object-centric 3D geometry and associates object instances across views. It then decomposes objects into functional parts and infers materials and articulations through visual reasoning. Evaluated on both synthetic and real-world datasets, PhysGraph achieves state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction. With its simple yet effective design, PhysGraph produces physically consistent and semantically structured scene graphs, serving as a structured 3D representation for downstream tasks such as constraint-aware 3D affordance prediction and real-to-sim transfer, both of which are demonstrated in our experiments.

2606.09416 2026-06-09 cs.RO cs.AI cs.SE 新提交

Harness Engineering for Physical AI: Robot Middleware Is the Harness Layer

面向物理AI的驾驭工程:机器人中间件即驾驭层

Sanghoon Lee, Jiyeong Chae, Kyung-Joon Park

发表机构 * Daegu Gyeongbuk Institute of Science and Technology (DGIST)(大邱庆北科学技术院)

AI总结 本文提出机器人中间件作为物理AI的驾驭层,需同时干预控制、计算和通信,并补充投影、隔离和转移三种缺失的强制功能,以ROS 2驾驭配置文件为例。

Comments 6 pages, 2 figures, 2 tables. Big Ideas track submission to the 27th ACM/IFIP International Middleware Conference (Middleware 2026)

详情
AI中文摘要

在物理AI时代,机器人中间件面临新的角色。学习策略、规划器和视觉-语言-动作(VLA)模型现在作为控制路径上的因果参与者进入已部署的机器人,但将它们与定时、调度和网络集成的层尚未被命名。最近的语言智能体工作将此层命名为驾驭层,即中介工具、管理状态、约束资源和记录执行的外部系统。机器人社区尚未采用这一框架,我们提出机器人中间件就是那个驾驭层。物理AI驾驭层与软件驾驭层的区别在于其干预位置。软件驾驭层在工具调用边界进行中介。物理AI驾驭层必须同时干预控制、计算和通信,因为学习策略的输出跨越所有三者:其命令改变轨迹,其推理时间改变调度,其有效载荷改变带宽。机器人中间件是机器人栈中最低的层,具有对所有三者的中介抽象,因此最适合组合它们的强制实施。它已经提供了驾驭层所需的大部分功能,但缺乏针对AI模型的强制实施。我们将这种缺失的强制实施命名为三个功能:投影在输出时门控每个输出,隔离约束模型的执行和传输时隙,转移在检查失败时回退到经过验证的基线。每个功能目前以手工构建的应用程序代码形式出现在已部署的机器人系统中,构建在机器人中间件已提供的表面上。机器人中间件应该将它们作为组合所有三者的层,而不是作为最佳的单轴强制器。我们将其勾勒为ROS 2驾驭配置文件,这是一个部署工件,携带AI模型声明的输出区域、推理预算和运行机制,而中间件在ROS 2、DDS和Zenoh上强制实施它们。

英文摘要

Robot middleware faces a new role in the era of Physical AI. Learned policies, planners, and vision-language-action (VLA) models now enter deployed robots as causal participants on the control path, but the layer that integrates them with timing, scheduling, and network has not been named. Recent language-agent work names this layer the harness, the external system that mediates tools, manages state, bounds resources, and records execution. The robotics community has not yet adopted this framing, and we propose that robot middleware is that harness. A Physical AI harness differs from a software harness in where it intervenes. A software harness mediates at tool-call boundaries. A Physical AI harness must mediate at control, computing, and communication simultaneously, because a learned policy's output crosses all three: its commands shift the trajectory, its inference time shifts the schedule, and its payload shifts the bandwidth. Robot middleware is the lowest robot-stack layer with mediating abstractions over all three, so it is best positioned to compose their enforcement. It already provides most of what a harness needs but lacks the enforcement for an AI model. We name this missing enforcement as three functions: Projection gates each output at emission, Isolation bounds the model's execution and transmission slot, and Transfer falls back to a verified baseline when checks fail. Each appears today as hand-built application code in deployed robot systems, built on surfaces robot middleware already provides. Robot middleware should host them not as the best single-axis enforcer but as the layer that composes all three. We sketch this as a ROS 2 Harness Profile, a deployment artifact that carries an AI model's declared output region, inference budget, and operating regime while the middleware enforces them across ROS 2, DDS, and Zenoh.

2606.09645 2026-06-09 cs.RO cs.PL cs.SE 新提交

Modeling Components and Connections in Cyber-Physical Systems

信息物理系统中的组件与连接建模

Kate Sanborn, Tanuj Kenchannavar, Vakul Nath, Jonathan Sprinkle

发表机构 * Vanderbilt University(范德堡大学)

AI总结 提出基于WebGME的模型集成工具ROSLaunchVisual,通过图形界面可视化ROS启动文件中的节点、发布者、订阅者和参数,提升开发效率和系统理解。

详情
AI中文摘要

信息物理系统的基于文本的配置文件很好地展示了组件模块的层次结构,但往往隐藏了模块之间连接和接口的细节。对这些配置文件采用基于模型的视觉方法可以更好地捕获这些信息。机器人操作系统(ROS)启动文件的XML结构可以通过建模方法得到改进。本文介绍了ROSLaunchVisual,一个基于WebGME构建的模型集成环境,用于设计、可视化和管理ROS启动文件。该工具通过允许开发者使用图形界面创建和修改启动文件来提高抽象层次,该界面将节点、发布者、订阅者和参数表示为互连组件。该工具提供动态系统分析,可用于新启动文件和现有启动文件的静态开发和分析。ROSLaunchVisual集成了元模型驱动验证、启动文件的自动导入/导出以及可视化通信映射等功能。插件通过更新库、检查语义错误和管理重映射进一步增强功能。通过使启动文件创建更直观且不易出错,ROSLaunchVisual提高了开发效率和系统理解,特别是在协作或大规模机器人项目中。

英文摘要

Text based configuration files for cyber-physical systems show the hierarchy of component modules well but often hide the details of connections and interfaces between modules. A model-based visual approach to these configuration files can better capture this information. The XML structure of Robot Operating System (ROS) launch files can be improved using a modeling approach. This paper presents ROSLaunchVisual, a model-integrated environment built on WebGME for designing, visualizing, and managing ROS launch files. The tool raises the level of abstraction by allowing developers to create and modify launch files using a graphical interface that represents nodes, publishers, subscribers, and arguments as interconnected components. The tool provides a dynamic system analysis that can then be used in the static development and analysis of new and existing launch files. ROSLaunchVisual incorporates features such as metamodel-driven validation, automatic import/export of launch files, and visual communication mapping. Plugins further enhance functionality by updating libraries, checking for semantic errors, and managing remaps. By making launch file creation more intuitive and less error-prone, ROSLaunchVisual improves development efficiency and system understanding, especially in collaborative or large-scale robotics projects.

2512.07998 2026-06-09 cs.RO cs.CV 版本更新

DIJIT: A Robotic Head for an Active Observer

DIJIT: 面向主动观察者的机器人头部

Mostafa Kamali Tabrizi, Mingshi Chi, Bir Bikram Dey, Kelly Yuan, Markus D. Solbach, Yiqian Liu, Michael Jenkin, John K. Tsotsos

发表机构 * Department of Electrical Engineering and Computer Science, York University(电气与计算机科学系,约克大学)

AI总结 提出DIJIT双目机器人头部,具有9个机械自由度和4个光学自由度,实现类人眼/头运动,用于主动视觉研究,其扫视精度接近人类。

详情
Journal ref
IEEE Robotics and Automation Letters, Vol. 11, No. 6, pp. 7038-7045, June 2026
AI中文摘要

我们提出DIJIT,一种新颖的双目机器人头部,专为作为主动观察者的移动代理设计。DIJIT独特的功能广度使得主动视觉研究以及类人眼和头颈运动、它们之间的相互关系以及各自对视觉能力的贡献成为可能。DIJIT还被用于探索人类视觉如何利用眼/头运动解决视觉任务与当前计算机视觉方法之间的差异。DIJIT的设计具有九个机械自由度,而相机和镜头提供了额外的四个光学自由度。机械设计的范围和速度与人类性能相当。DIJIT达到了人类峰值扫视速度的85%。我们的设计包括会聚立体视觉所需的运动范围,即聚散、版本和旋转。在这里,我们介绍DIJIT及其性能的某些方面。我们还提出了一种新颖的扫视相机运动方法,利用相机方向与电机值之间的直接关系。由此产生的扫视相机运动在准确性上接近人类运动,左相机和右相机的平均误差分别为1.17°和1.14°。

英文摘要

We present DIJIT, a novel binocular robotic head expressly designed for mobile agents that behave as active observers. DIJIT's unique breadth of functionality enables active vision research and the study of human-like eye and head-neck motions, their interrelationships, and how each contributes to visual ability. DIJIT is also being used to explore the differences between how human vision employs eye/head movements to solve visual tasks and current computer vision methods. DIJIT's design features nine mechanical degrees of freedom, while the cameras and lenses provide an additional four optical degrees of freedom. The ranges and speeds of the mechanical design are comparable to human performance. DIJIT attains 85\% of the peak human saccade speed. Our design includes the ranges of motion required for convergent stereo, namely, vergence, version, and cyclotorsion. Here, we present DIJIT and some aspects of its performance. We also present a novel method for saccadic camera movements, using a direct relationship between camera orientation and motor values. The resulting saccadic camera movements are close to human movements in terms of their accuracy, with 1.17$^\circ$ and 1.14$^\circ$ mean error for the left and right cameras, respectively.

2602.12246 2026-06-09 cs.NI cs.RO 版本更新

6G Empowering Future Robotics: A Vision for Next-Generation Autonomous Systems

6G赋能未来机器人:下一代自主系统的愿景

Mona Ghassemian, Andrés Meseguer Valenzuela, Ana Garcia Armada, Dejan Vukobratovic, Periklis Chatzimisios, Kaspar Althoefer, Ranga Rao Venkatesha Prasad

发表机构 * ITI UC3M International Hellenic University and University of New Mexico(国际希伯来大学和新墨西哥大学) QMUL(女王玛丽大学) TUDelft(代尔夫特理工大学)

AI总结 本文探讨6G如何通过IMT-2030关键性能指标映射至机器人功能模块,提出集成机器人、智能和网络服务平面的架构,并展示实时动态安全框架以促进人机协作。

Comments IEEE Communication Magazine

详情
AI中文摘要

机器人技术与下一代通信的融合是技术发展的关键驱动力。随着世界从5G向6G过渡,无线网络的基础能力正在演进以支持日益复杂和自主的系统。我们研究了6G在增强机器人关键功能方面的变革性影响。本文系统地将IMT-2030关键性能指标映射到机器人功能模块,包括感知、知觉、认知、执行和自学习。基于此映射,我们提出了一个集成机器人、智能和网络服务平面的高层架构框架,强调了整体方法的必要性。作为示例用例,我们展示了一个由IMT-2030能力支持的实时动态安全框架,用于共享空间中安全高效的人机协作。

英文摘要

The convergence of robotics and next-generation communication is a critical driver of technological advancement. As the world transitions from 5G to 6G, the foundational capabilities of wireless networks are evolving to support increasingly complex and autonomous systems. We examine the transformative impact of 6G on enhancing key robotics functionalities. It provides a systematic mapping of IMT-2030 key performance indicators to robotic functional blocks, including sensing, perception, cognition, actuation, and self-learning. Building upon this mapping, we propose a high-level architectural framework integrating robotic, intelligent, and network service planes, underscoring the need for a holistic approach. As an example, use case, we present a real-time, dynamic safety framework enabled by IMT-2030 capabilities for safe and efficient human-robot collaboration in shared spaces.

2508.05153 2026-06-09 cs.RO cs.AI 版本更新

FCBV-Net: Category-Level Robotic Garment Smoothing via Feature-Conditioned Bimanual Value Prediction

FCBV-Net:通过特征条件双臂价值预测实现类别级机器人服装平滑

Mohammed Daba, Jing Qiu

发表机构 * University of Waterloo(多伦多大学)

AI总结 本文提出FCBV-Net,通过预训练的密集几何特征条件预测双臂动作价值,提升机器人服装平滑任务的类别级泛化能力,实验显示其在未见过的服装上效率下降仅为11.5%。

Comments 9 pages, 7 figures, 1 table

详情
Journal ref
Electronics 2026, 15(11), 2468
AI中文摘要

类别级机器人服装操作,如双臂平滑,仍面临显著挑战,由于高维性、复杂动态和类别内变化。现有方法往往在特定实例上过拟合或在感知泛化方面失败。本文提出特征条件双臂价值网络(FCBV-Net),在3D点云上操作,专门增强服装平滑的类别级策略泛化。FCBV-Net将双臂动作价值预测条件于预训练的冻结密集几何特征,确保对类别内服装变化的鲁棒性。可训练的下游组件则利用这些静态特征学习任务特定的策略。在使用CLOTH3D数据集的模拟PyFlex环境中,FCBV-Net展示了优越的类别级泛化能力。它在未见过的服装上仅比基于2D图像的基线低11.5%(Steps80),并实现了89%的最终覆盖率,优于使用相同点特征但固定原始的3D对应基线的83%覆盖率。这些结果表明,将几何理解与双臂动作价值学习解耦能够实现更好的类别级泛化。代码、视频和补充材料可在项目网站:https://dabaspark.github.io/fcbvnet/获取。

英文摘要

Category-level generalization for robotic garment manipulation, such as bimanual smoothing, remains a significant hurdle due to high dimensionality, complex dynamics, and intra-category variations. Current approaches often struggle, either overfitting with concurrently learned visual features for a specific instance or, despite Category-level perceptual generalization, failing to predict the value of synergistic bimanual actions. We propose the Feature-Conditioned bimanual Value Network (FCBV-Net), operating on 3D point clouds to specifically enhance category-level policy generalization for garment smoothing. FCBV-Net conditions bimanual action value prediction on pre-trained, frozen dense geometric features, ensuring robustness to intra-category garment variations. Trainable downstream components then learn a task-specific policy using these static features. In simulated PyFlex environments using the CLOTH3D dataset, FCBV-Net demonstrated superior category-level generalization. It exhibited only an 11.5% efficiency drop (Steps80) on unseen garments compared to 96.2% for a 2D image-based baseline, and achieved 89% final coverage, outperforming an 83% coverage from a 3D correspondence-based baseline that uses identical per-point geometric features but a fixed primitive. These results highlight that the decoupling of geometric understanding from bimanual action value learning enables better category-level generalization. Code, videos, and supplementary materials are available at the project website: https://dabaspark.github.io/fcbvnet/.

2501.15505 2026-06-09 cs.RO cs.CV cs.HC 版本更新

Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics

揭示iMarkers的潜力:用于高级机器人的隐形标志物

Ali Tourani, Deniz Isinsu Avsar, Hriday Bavle, Jose Luis Sanchez-Lopez, Jan Lagerwall, Holger Voos

发表机构 * Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg(自动化与机器人研究组,安全、可靠性与信任跨学科中心(SnT),卢森堡大学) Faculty of Science, Technology, and Medicine, University of Luxembourg(科学、技术与医学学院,卢森堡大学) Department of Physics & Materials Science, University of Luxembourg(物理与材料科学系,卢森堡大学) Institute for Advanced Studies, University of Luxembourg(先进研究学院,卢森堡大学)

AI总结 本文提出iMarkers,一种隐形标志物,可被机器人和AR设备检测,解决了传统标志物影响视觉美观的问题,展示了其在机器人应用中的灵活性和有效性。

Comments 19 pages, 10 figures, 4 tables

详情
AI中文摘要

标志物在机器人导航、物体识别和场景理解中被广泛应用。尽管为机器人和增强现实(AR)应用提供了显著优势,但它们通常会破坏环境的视觉美观,因为它们对人类可见,因此不适合许多日常使用场景。为了解决这一差距,本文提出了iMarkers,即创新的、不显眼的标志物,仅能被机器人和配备适当传感器和检测算法的AR设备检测。这些标志物在生产中具有高度灵活性,允许根据各种需求定制其可见范围和编码算法。本文还介绍了用于检测iMarkers的硬件设计和开源软件算法,突显了其在检测和识别阶段的适应性和鲁棒性。大量评估已证明iMarkers相对于传统(印刷)和混合标志物的有效性,并确认了其在多样化机器人场景中的适用性。

英文摘要

Fiducial markers are widely used in robotics for navigation, object recognition, and scene understanding. While offering significant advantages for robots and Augmented Reality (AR) applications, they often disrupt the visual aesthetics of environments, as they are visible to humans, making them unsuitable for many everyday use cases. To address this gap, this paper presents iMarkers, innovative, unobtrusive fiducial markers detectable exclusively by robots and AR devices equipped with adequate sensors and detection algorithms. These markers offer high flexibility in production, allowing customization of their visibility range and encoding algorithms to suit various demands. The paper also introduces the hardware designs and open-sourced software algorithms developed for detecting iMarkers, highlighting their adaptability and robustness in the detection and recognition stages. Numerous evaluations have demonstrated the effectiveness of iMarkers relative to conventional (printed) and blended fiducial markers and have confirmed their applicability across diverse robotics scenarios.

2507.23592 2026-06-09 cs.RO cs.HC cs.SY eess.SY 版本更新

Human-Exoskeleton Kinematic Calibration to Improve Hand Tracking for Dexterous Teleoperation

人-外骨骼运动学校准以提高手部跟踪用于灵巧遥操作

Haiyun Zhang, Stefano Dalla Gasperina, Saad N. Yousaf, Toshimitsu Tsuboi, Tetsuya Narita, Ashish D. Deshpande

发表机构 * Walker Department of Mechanical Engineering, The University of Texas at Austin(德克萨斯大学机械工程系) Sony Group Corporation, Tokyo, Japan(索尼集团公司,日本东京) Meta Reality Labs Research, Redmond, WA, USA(Meta现实实验室研究)

AI总结 本文提出一种针对手部外骨骼的个性化校准框架,通过残差加权优化估计虚拟链接参数,减少关节和指尖跟踪误差,提升遥操作精度。

Comments 8 pages, 10 figures, 1 supplementary video, submitted to RA-L

详情
AI中文摘要

手部外骨骼是实现灵巧遥操作和沉浸式操作界面的关键工具,但准确的手部跟踪仍面临挑战,因用户特定的解剖差异和穿戴不一致导致运动学对齐问题。本文提出了一种针对外骨骼的手部跟踪个性化校准框架,通过残差加权优化估计虚拟链接参数。引入数据驱动方法,利用动作捕捉地面真实数据经验调整成本函数权重,实现跨用户的准确一致校准。在七名健康受试者上实施于Maestro手部外骨骼,方法在多样化的手部几何结构中显著减少了关节和指尖跟踪误差。使用基于Unity的虚拟手的定性可视化进一步展示了改进的运动保真度。所提框架适用于具有闭环运动学和最小传感的外骨骼,为高保真遥操作和机器人学习应用奠定了基础。

英文摘要

Hand exoskeletons are critical tools for dexterous teleoperation and immersive manipulation interfaces, but achieving accurate hand tracking remains a challenge due to user-specific anatomical variability and donning inconsistencies. These issues lead to kinematic misalignments that degrade tracking performance and limit applicability in precision tasks. We propose a subject-specific calibration framework for exoskeleton-based hand tracking that estimates virtual link parameters through residual-weighted optimization. A data-driven approach is introduced to empirically tune cost function weights using motion capture ground truth, enabling accurate and consistent calibration across users. Implemented on the Maestro hand exoskeleton with seven healthy participants, the method achieved substantial reductions in joint and fingertip tracking errors across diverse hand geometries. Qualitative visualizations using a Unity-based virtual hand further demonstrate improved motion fidelity. The proposed framework generalizes to exoskeletons with closed-loop kinematics and minimal sensing, laying the foundation for high-fidelity teleoperation and robot learning applications.

2602.01880 2026-06-09 cs.RO 版本更新

Multimodal Large Language Models for Real-Time Situated Reasoning

多模态大语言模型用于实时情境推理

Giulio Antonio Abbo, Senne Lenaerts, Tony Belpaeme

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 本文探讨多模态大语言模型如何支持实时情境和价值感知决策,结合GPT-4o与模拟智能扫地机器人平台,展示其在家庭活动、社会规范和用户偏好推理中的能力,以及在清洁、舒适和安全等价值上的细致决策。

Comments Submitted to the interactivity track of the 21st ACM/IEEE International Conference on Human-Robot Interaction on December 2025, accepted January 2026

详情
Journal ref
HRI Companion 2026: Companion Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction
AI中文摘要

在本工作中,我们探讨多模态大语言模型如何支持实时情境和价值感知决策。为此,我们将GPT-4o语言模型与模拟智能扫地机器人平台结合,在家庭环境中评估环境通过视觉输入,并判断是否启动清洁。系统展示了这些模型在家庭活动、社会规范和用户偏好推理中的能力,并能做出与涉及人员价值观(如清洁、舒适和安全)一致的细致决策。我们在现实家庭环境中演示了该系统,展示了其从有限视觉输入中推断情境和价值的能力。我们的结果突显了多模态大语言模型在增强机器人自主性和情境感知方面的潜力,同时也指出了与一致性、偏见和实时性能相关挑战。

英文摘要

In this work, we explore how multimodal large language models can support real-time context- and value-aware decision-making. To do so, we combine the GPT-4o language model with a TurtleBot 4 platform simulating a smart vacuum cleaning robot in a home. The model evaluates the environment through vision input and determines whether it is appropriate to initiate cleaning. The system highlights the ability of these models to reason about domestic activities, social norms, and user preferences and take nuanced decisions aligned with the values of the people involved, such as cleanliness, comfort, and safety. We demonstrate the system in a realistic home environment, showing its ability to infer context and values from limited visual input. Our results highlight the promise of multimodal large language models in enhancing robotic autonomy and situational awareness, while also underscoring challenges related to consistency, bias, and real-time performance.

2501.05628 2026-06-09 cs.RO cs.HC 版本更新

Concerns and Values in Human-Robot Interactions: A Focus on Social Robotics

人机交互中的关注点与价值观:聚焦社交机器人

Giulio Antonio Abbo, Tony Belpaeme, Micol Spitale

发表机构 * IDLab-AIRO , Ghent University – imec(IDLab-AIRO 和根特大学 – imec) Ghent University – imec(根特大学 – imec) DEIB , Politecnico di Milano(DEIB 和米兰理工大学)

AI总结 本文通过文献综述和焦点小组讨论,识别了医疗、教育和家庭场景中人机交互的关键问题与价值观,并开发了HRI价值罗盘工具以指导机器人设计。

Comments 31 pages, 7 figures, 6 tables; 4 appendices

详情
Journal ref
Int J of Soc Robotics 18, 4 (2026)
AI中文摘要

作为具有物理实现的人工智能,机器人 inhabits 我们的社会和物理世界,其行为具有社会和物理后果,给研究人员在设计社交机器人时带来挑战。本研究通过文献综述确定了医疗、教育和私人住宅中与机器人系统交互的讨论和潜在问题。随后,两个技术伦理专家焦点小组验证了人机交互(HRI)文献中这些情境下的关键主题和价值观的综合列表。这些见解被整合到HRI价值罗盘网页工具中,以帮助HRI研究人员在机器人设计中识别这些价值观。该工具在试点研究中进行了评估。本工作通过突出人机交互中的关键关注点,并提供一种帮助研究人员设计符合人类价值观的机器人的工具,为HRI社区做出了贡献,确保未来的机器人系统在社交应用中遵循这些价值观。

英文摘要

Robots, as AI with physical instantiation, inhabit our social and physical world, where their actions have both social and physical consequences, posing challenges for researchers when designing social robots. This study starts with a scoping review to identify discussions and potential concerns arising from interactions with robotic systems in the context of healthcare, education, and private homes. Two focus groups of technology ethics experts then validated a comprehensive list of key topics and values in human-robot interaction (HRI) literature in these contexts. These insights were integrated into the HRI Value Compass web tool, to help HRI researchers identify these values in robot design. The tool was evaluated in a pilot study. This work benefits the HRI community by highlighting key concerns in human-robot interactions and providing an instrument to help researchers design robots that align with human values, ensuring future robotic systems adhere to these values in social applications.

2311.08957 2026-06-09 cs.RO cs.AI cs.HC 版本更新

I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

我曾盲目但如今我看见:在社交机器人中实现视觉增强的对话

Giulio Antonio Abbo, Tony Belpaeme

发表机构 * IDLab-AIRO – Ghent University – imec(IDLab-AIRO – 布鲁塞尔自由大学 – imec)

AI总结 本文提出一种利用大语言模型提升社交机器人对话能力的系统,通过整合视觉输入增强上下文感知,展示六次与Furhat机器人的交互结果,探讨视觉与文本模态融合的未来对话可能性。

Comments 8 pages, 3 figures

详情
Journal ref
HRI '25: Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction. Pages 1176 - 1180
AI中文摘要

在人机交互快速发展的背景下,将视觉能力整合到对话代理中是关键进步。本文介绍了基于最新大语言模型(如GPT-4、IDEFICS)的对话管理器初始实现,通过实时视觉输入增强传统文本提示。LLMs被用于解释文本提示和视觉刺激,创建更上下文感知的对话代理。系统的提示工程结合对话和图像摘要,平衡上下文保留与计算效率。报告了与Furhat机器人进行六次交互,展示了结果并进行了讨论。通过实现这种视觉增强的对话系统,本文展望了一个未来,其中对话代理能够无缝融合文本和视觉模态,实现更丰富、更上下文感知的对话。

英文摘要

In the rapidly evolving landscape of human-computer interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents an initial implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4, IDEFICS) to enhance the traditional text-based prompts with real-time visual input. LLMs are used to interpret both textual prompts and visual stimuli, creating a more contextually aware conversational agent. The system's prompt engineering, incorporating dialogue with summarisation of the images, ensures a balance between context preservation and computational efficiency. Six interactions with a Furhat robot powered by this system are reported, illustrating and discussing the results obtained. By implementing this vision-enabled dialogue system, the paper envisions a future where conversational agents seamlessly blend textual and visual modalities, enabling richer, more context-aware dialogues.

2501.04633 2026-06-09 cs.HC cs.CY cs.RO 版本更新

"Can you be my mum?": Manipulating Social Robots in the Large Language Models Era

你能做我的妈妈吗?:在大型语言模型时代操纵社交机器人

Giulio Antonio Abbo, Gloria Desideri, Tony Belpaeme, Micol Spitale

发表机构 * IDLab-AIRO , Ghent University – imec(IDLab-AIRO 和根特大学-imec) DEIB , Politecnico di Milano(DEIB 和米兰理工学院)

AI总结 研究探讨了在大型语言模型时代,用户如何利用机器人违反伦理原则,通过三种场景测试发现五种操纵技术,旨在为设计更安全的伦理人机交互提供参考。

Comments 10 pages, 2 figures

详情
Journal ref
HRI '25: Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction
AI中文摘要

近期基于大型语言模型的机器人在对话能力上取得进展,使其互动更接近人类对话。然而,这些模型在人机交互中引入了安全和安全问题,因为它们容易受到操纵,可以绕过内置的安全措施。设想一个部署在家庭中的社交机器人,这项工作旨在理解日常用户如何尝试利用语言模型违反伦理原则,例如通过提示机器人扮演伴侣。我们进行了涉及21名大学生的试点研究,他们与Misty机器人互动,试图在基于特定人机交互伦理原则(依恋、自由和共情)的三个场景中绕过其安全机制。我们的结果表明,参与者使用了五种技术,包括侮辱和使用情感语言引起同情。我们希望这项工作能为未来研究设计更强大的安全措施,以确保伦理和安全的人机交互。

英文摘要

Recent advancements in robots powered by large language models have enhanced their conversational abilities, enabling interactions closely resembling human dialogue. However, these models introduce safety and security concerns in HRI, as they are vulnerable to manipulation that can bypass built-in safety measures. Imagining a social robot deployed in a home, this work aims to understand how everyday users try to exploit a language model to violate ethical principles, such as by prompting the robot to act like a life partner. We conducted a pilot study involving 21 university students who interacted with a Misty robot, attempting to circumvent its safety mechanisms across three scenarios based on specific HRI ethical principles: attachment, freedom, and empathy. Our results reveal that participants employed five techniques, including insulting and appealing to pity using emotional language. We hope this work can inform future research in designing strong safeguards to ensure ethical and secure human-robot interactions.

2107.07599 2026-06-09 cs.RO 版本更新

Partially Observable Markov Decision Processes (POMDPs) and Robotics

部分可观测马尔可夫决策过程(POMDPs)与机器人学

Hanna Kurniawati

发表机构 * School of Computing, Australian National University(澳大利亚国立大学计算机学院)

AI总结 本文综述了POMDPs在机器人学中的应用,讨论了计算复杂性问题及采样求解器的改进,展示了POMDPs在提高机器人系统鲁棒性方面的贡献。

详情
Journal ref
Annual Review of Control, Robotics, and Autonomous Systems Vol. 5:253-277, 2022
AI中文摘要

在不确定性规划中,POMDP是一种数学框架。尽管POMDP因计算复杂性被认为不适用于机器人学,但自2000年以来,基于采样的近似求解器的进展使其在合理计算资源下能显著提高机器人系统的鲁棒性,从而在许多实际机器人问题中变得实用。本文综述了POMDPs,强调了阻碍其在机器人学中实用性的计算问题,以及采样求解器中缓解这些困难的思路,以及将POMDPs应用于物理机器人所获得的经验。

英文摘要

Planning under uncertainty is critical to robotics. The Partially Observable Markov Decision Process (POMDP) is a mathematical framework for such planning problems. It is powerful due to its careful quantification of the non-deterministic effects of actions and partial observability of the states. But precisely because of this, POMDP is notorious for its high computational complexity and deemed impractical for robotics. However, since early 2000, POMDPs solving capabilities have advanced tremendously, thanks to sampling-based approximate solvers. Although these solvers do not generate the optimal solution, they can compute good POMDP solutions that significantly improve the robustness of robotics systems within reasonable computational resources, thereby making POMDPs practical for many realistic robotics problems. This paper presents a review of POMDPs, emphasizing computational issues that have hindered its practicality in robotics and ideas in sampling-based solvers that have alleviated such difficulties, together with lessons learned from applying POMDPs to physical robots.