arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.07974 2026-06-09 cs.RO cs.AI 新提交

PRISM: PRior-guided Imagination Sampling in world Models

PRISM：世界模型中基于先验引导的想象采样

Yuhai Wang, Jiawei Xia, Rongxuan Zhou, Xiao Hu, Yongliang Shi, Jing Du, Yang Ye

发表机构 * Northeastern University（东北大学）； University of California, Berkeley（加州大学伯克利分校）； Qiyuan Lab（启元实验室）； University of Florida（佛罗里达大学）

AI总结提出PRISM框架，通过从世界模型编码器提取状态条件高斯先验，并利用精度加权高斯乘积更新规划器的采样分布，在不增加架构复杂度的情况下显著提升基于模型的连续控制性能。

详情

AI中文摘要

学习到的世界模型为评估未来状态提供了强大的物理直觉。但其在连续控制中的有效性也关键取决于如何为基于模型的规划生成候选动作。我们不仅询问模型能多准确地模拟未来，还提出：哪些候选动作首先值得评估？现有规划器通常任意搜索或仅使用专家演示初始化采样均值，丢弃了专家的状态条件置信度。正确引导这一搜索需要鲁棒的动作先验，但当前方法常依赖独立的视觉编码器或大规模VLM来获取。我们认为这种架构膨胀是不必要的：完全相同的数据——以及世界模型本身学到的表示——内在地编码了智能体的动作直觉。我们提出PRISM，一个任务无关的框架，从单一数据集中提取两者，同时保持严格的架构简洁性。基于标准的JEPA风格潜在世界模型，PRISM直接在其冻结编码器上附加一个轻量级MLP，以预测状态条件高斯先验。在规划时，PRISM通过精度加权的高斯乘积更新将该先验融合到规划器的采样分布中。这种无参数、闭式整合引导采样过程，使先验在其自信处主导，在其不自信处放弃控制。PRISM在Cube上将基于世界模型的MPC成功率提升35个百分点，在PushT上提升32个百分点，且未引入显著推理开销。

英文摘要

A learned world model provides a powerful physical intuition for evaluating future states. But its effectiveness in continuous control also depends critically on how candidate actions are generated for model-based planning. Rather than solely asking how accurately a model can simulate the future, we ask: which candidate actions are worth evaluating in the first place? Existing planners typically search arbitrarily or use expert demonstrations only to initialize a sampling mean, discarding the expert's state-conditioned confidence. Properly guiding this search requires a robust action prior, yet current approaches often rely on independent visual encoders or large-scale VLMs to obtain one. We argue that this architectural bloat is unnecessary: the exact same data - and the learned representations of the world model itself - inherently encode the agent's action intuition. We introduce PRISM, a task-agnostic framework that extracts both from a single dataset while maintaining strict architectural simplicity. Building on a standard JEPA-style latent world model, PRISM attaches a lightweight MLP directly to its frozen encoder to predict a state-conditioned Gaussian prior. At plan time, PRISM fuses this prior into the planner's sampling distribution via a precision-weighted Product-of-Gaussians update. This parameter-free, closed-form integration steers the sampling process, making the prior confident where it is and ceding control where it is not. PRISM improves success rates by 35 percentage points over vanilla world-model-based MPC on Cube and 32 percentage points on PushT, without introducing significant inference overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.08015 2026-06-09 cs.RO 新提交

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

Q-VGM: 基于Q引导的值梯度匹配的流匹配VLA策略

Ziqian Wang, Jiayu Sun, Xingjian Mao, Minqian Wang, Yao Mu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of Michigan, Ann Arbor（密歇根大学安娜堡分校）； University of Electronic Science and Technology of China（电子科技大学）

AI总结提出Q-VGM离线强化学习方法，通过将值梯度转化为去噪时间上的值梯度场，避免反向传播去噪链，高效微调流匹配VLA策略，在LIBERO等任务上显著提升成功率。

Comments 13 pages, 3 figures, 4 tables

详情

AI中文摘要

我们提出Q引导的值梯度匹配（Q-VGM），一种离线强化学习方法，解决了微调流匹配视觉-语言-动作（VLA）策略中长期存在的挑战：如何高效地根据学习到的Q函数改进一个表达力强的流匹配动作专家。有效的改进必须利用评论家的一阶（梯度）信息，但这对于流策略很困难，因为直接通过其多步去噪过程反向传播值函数在VLA规模下数值不稳定，而策略梯度方法所需的可处理动作似然在迭代去噪下不可用。现有的基于值的方法要么通过整个去噪链反向传播，要么仅在测试时使用评论家而不更新策略，要么将评论家改进的动作作为终端标签蒸馏而不监督速度场。Q-VGM通过利用VGG-Flow（一种生成建模中流对齐的值梯度视角）绕过了这些问题，它将值梯度转化为去噪时间上的值梯度场，而不是不稳定的端到端目标。这不需要动作似然，也不需要反向传播去噪链，并且在一个固定的重放缓冲区上操作。评论家是一个动作敏感的Cal-QL集成，基于紧凑的RLT特征和每层动作注入。Q-VGM实现了一种实用的少样本初始化然后从经验中学习的范式：从少样本SFT pi0.5 VLA开始，该方法利用自生成的rollout数据显著提升任务性能，无需额外的专家监督。在LIBERO上，Q-VGM将平均成功率从75.0%提升到92.5%；在RoboTwin 2.0上，从76.4%提升到87.2%；在两个真实机器人桌面任务上，从40.0%提升到67.5%，在所有三种设置中均优于所有相同骨干、相同评论家的基线。

英文摘要

We propose Q-Guided Value-Gradient Matching (Q-VGM), an off-policy reinforcement learning (RL) method that tackles a long-standing challenge in fine-tuning flow-matching vision-language-action (VLA) policies: efficiently improving an expressive flow-matching action expert with respect to a learned Q-function. Effective improvement must exploit the first-order (gradient) information of the critic, but this is difficult for flow policies, because directly back-propagating the value through their multi-step denoising process is numerically unstable at VLA scale, while the tractable action likelihoods required by policy-gradient methods are unavailable under iterative denoising. Existing value-based methods either backpropagate through the full denoising chain, use the critic only at test time without updating the policy, or distill critic-improved actions as terminal labels without supervising the velocity field. Q-VGM sidesteps these issues by leveraging VGG-Flow, a value-gradient view of flow alignment in generative modeling that transforms value gradient into a denoising-time value-gradient field rather than an unstable end-to-end objective. This requires no action likelihoods and no backpropagation through the denoising chain, and operates on a fixed replay buffer. The critic is an action-sensitive Cal-QL ensemble over compact RLT features with per-layer action injection. Q-VGM enables a practical few-shot initialization then learn-from-experience paradigm: starting from a few-shot-SFT pi0.5 VLA, the method leverages self-generated rollout data to substantially improve task performance without additional expert supervision. On LIBERO, Q-VGM raises the average success rate from 75.0% to 92.5%; on RoboTwin 2.0, from 76.4% to 87.2%; and on two real-robot tabletop tasks, from 40.0% to 67.5%, outperforming all same-backbone, same-critic baselines across all three settings.

URL PDF HTML ☆

赞 0 踩 0

2606.08154 2026-06-09 cs.RO 新提交

SynthICL: Scalable In-context Imitation Learning with Synthetic Data

SynthICL: 基于合成数据的可扩展上下文模仿学习

Cheng Qian, Ruomeng Fan, Yifei Ren, Yilong Wang, Edward Johns

发表机构 * The Robot Learning Lab（机器人学习实验室）； Imperial College London（伦敦帝国理工学院）

AI总结提出SynthICL框架，利用纯RGB合成数据训练上下文模仿学习策略，避免深度传感和真实数据，通过子目标预测提升控制精度，在16个真实操作任务中平均成功率79%。

详情

AI中文摘要

上下文模仿学习（ICIL）使机器人能够通过将预训练策略以任务特定示例为条件，在测试时无需重新训练，从少量演示中学习新任务。尽管前景广阔，训练可泛化且可扩展的上下文模仿策略仍是一个开放挑战。我们提出SynthICL，一个完全基于RGB合成数据训练ICIL策略的可扩展框架。具体而言，我们构建了一个数据生成流水线以产生高保真ICIL数据，并在所得数据集上训练了一个流匹配变换器策略。SynthICL避免了先前方法中对深度传感、精确相机校准和真实世界训练数据的需求，提供了一种更简单且更可扩展的替代方案。我们进一步通过训练模型预测下一个子目标图像来融入子目标预测，从而实现更精确且视觉上可控的操作。在16个未见过的真实世界操作任务上评估，SynthICL在测试时仅提供一个演示的情况下实现了79%的平均成功率，并优于先前方法。项目页面：https://synth-icl.github.io

英文摘要

In-context imitation learning (ICIL) enables robots to learn new tasks from a small number of demonstrations by conditioning a pre-trained policy on task-specific examples, without retraining at test time. Despite this promise, training generalizable and scalable in-context imitation policies remains an open challenge. We present SynthICL, a scalable framework that trains ICIL policies entirely from RGB-only synthetic data. Specifically, we build a data generation pipeline to produce high-fidelity ICIL data and train a flow-matching transformer policy on the resulting dataset. SynthICL avoids the need for depth sensing, precise camera calibration, and real-world training data in prior approaches, offering a simpler and more scalable alternative. We further incorporate subgoal prediction by training the model to predict the next subgoal images, enabling more precise and visually grounded control. Evaluated on 16 unseen real-world manipulation tasks, SynthICL achieves an average success rate of 79% with only one demonstration provided at test time and outperforms prior methods. Project page: https://synth-icl.github.io

URL PDF HTML ☆

赞 0 踩 0

2606.08610 2026-06-09 cs.RO cs.AI 新提交

HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning

HARBOR：面向智能体机器人强化学习的框架

Zechu Li, Yufeng Jin, Xiaoyang Liu, Puze Liu, Vignesh Prasad, Carlo D'Eramo, Georgia Chalvatzaki

发表机构 * TU Darmstadt（达姆施塔特工业大学）； Honda Research Institute Europe（本田欧洲研究所）； Columbia University（哥伦比亚大学）； Tongji University（同济大学）； Shanghai Research Institute for Intelligent Autonomous Systems（上海智能自主系统研究院）； University of Würzburg（维尔茨堡大学）； Hessian.AI（黑森人工智能中心）

AI总结提出HARBOR框架，通过将机器人强化学习自动化视为框架工程问题，利用专用智能体、标准化命令和可复用知识，在模拟中自动完成从环境搭建到策略训练的全流程，并在6个基准测试和16个任务中验证其有效性。

详情

AI中文摘要

强化学习已成为机器人学习的一种强大范式，特别是在模拟到现实的环境中，但其更广泛的采用仍受限于围绕算法的工程流程。构建任务、设计奖励和调整超参数需要大量专家努力，使得强化学习工作流程成本高昂且难以扩展。我们提出HARBOR，一个智能体框架，将机器人强化学习自动化视为一个框架工程问题：给定一个模拟器代码库和一个任务规范，它自动完成从环境设置到模拟中策略训练的工作流程。HARBOR将此类高级目标分解为有界阶段，由专用智能体通过标准化命令、持久化工件、可执行门和可复用知识执行，并通过去中心化并行试验和跨运行经验学习来扩展迭代。我们在6个基准测试和总共16个任务上评估HARBOR，涵盖操作、移动和双臂灵巧控制。我们证明HARBOR端到端地自动化了模拟强化学习工作流程，设计奖励，调整算法以匹配或改进默认配置，并以实用的令牌和挂钟成本减少了工程工作量；生成的策略也可以转移到真实机器人。

英文摘要

Reinforcement learning (RL) has become a powerful paradigm for robot learning, particularly in sim-to-real settings, but its broader adoption remains limited by the engineering pipeline surrounding the algorithms. Building tasks, shaping rewards, and tuning hyperparameters require substantial expert effort, making RL workflows costly and difficult to scale. We introduce HARBOR, an agentic framework that frames robot RL automation as a harness-engineering problem: given a simulator codebase and a task specification, it automates the workflow from environment setup to policy training in simulation. HARBOR decomposes such high-level objectives into bounded stages executed by specialized agents through standardized commands, persistent artifacts, executable gates, and reusable knowledge, and scales iteration via decentralized parallel trials and experience learning across runs. We evaluate HARBOR across 6 benchmarks and 16 tasks in total, spanning manipulation, locomotion, and bimanual dexterous control. We demonstrate that HARBOR automates the simulation RL workflow end-to-end, designs rewards, tunes algorithms to match or improve over default configurations, and reduces engineering effort at practical token and wall-clock cost; the resulting policies can also be transferred to real robots.

URL PDF HTML ☆

赞 0 踩 0

2606.08657 2026-06-09 cs.RO cs.AI 新提交

Latent Diffusion Policy: Shaping Latent Spaces for Diffusion-Based Robotic Manipulation

潜在扩散策略：为基于扩散的机器人操作塑造潜在空间

Zhexuan Zhou, Yichen Lai, Jinhao Zhang, Huizhe Li, Youmin Gong, Jie Mei

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出两阶段框架LDP，通过CVAE编码器吸收场景理解，在预浓缩的潜在空间中进行流匹配，简化学习并提升多臂协调任务性能。

详情

AI中文摘要

直接在原始动作空间中运行的基于扩散的视觉运动策略将场景理解与轨迹生成合并到单个去噪过程中。由此产生的速度场必须同时编码场景信息并生成精确轨迹，增加了学习复杂性，并在需要多臂精确时间协调的任务上限制了性能。为了简化这一联合学习问题，我们引入了潜在扩散策略（LDP），这是一个两阶段框架，在精心塑造的潜在空间中进行流匹配。通过将场景理解吸收到观察条件的CVAE编码器中，LDP集中了每个观察的条件分布。因此，流模型避免了隐式解析场景相关结构；相反，它在具有更平滑速度场的预浓缩分布内生成，从而简化了从有限演示中的学习。此外，为了捕捉潜在标记之间的时间依赖性，LDP采用每标记扩散强制训练，并使用阶梯推理采样来解决由此产生的分布不匹配。我们还提出了重建FID（rFID）作为轻量级代理，仅从潜在空间统计预测下游任务成功。在RoboTwin 2.0的协调密集型任务上，LDP以显著优势优于DP3，并有效迁移到真实世界的双臂部署。

英文摘要

Diffusion-based visuomotor policies operating directly in raw action spaces conflate scene comprehension with trajectory generation within a single denoising process. The resulting velocity field must simultaneously encode scene information and generate precise trajectories, increasing learning complexity and limiting performance on tasks demanding precise temporal coordination across multiple arms. To simplify this joint learning problem, we introduce Latent Diffusion Policy (LDP), a two-stage framework performing flow matching in a deliberately shaped latent space. By absorbing scene understanding into an observation-conditioned CVAE encoder, LDP concentrates the conditional distribution of each observation. Consequently, the flow model avoids implicitly resolving scene-dependent structures; instead, it generates within a pre-concentrated distribution featuring a smoother velocity field, simplifying learning from limited demonstrations. Furthermore, to capture temporal dependencies among latent tokens, LDP trains with per-token diffusion forcing and employs staircase inference sampling to resolve the resulting distributional mismatch. We also propose reconstruction FID (rFID) as a lightweight proxy predicting downstream task success solely from latent space statistics. On coordination-intensive tasks from RoboTwin 2.0, LDP outperforms DP3 by a substantial margin and transfers effectively to real-world bimanual deployments.

URL PDF HTML ☆

赞 0 踩 0

2606.08743 2026-06-09 cs.RO 新提交

Guided Discovery of New Behaviors using Diffusion Policies

使用扩散策略引导发现新行为

Dian Yu, Sebastian Sanokowski, Majid Khadiv

发表机构 * Munich Institute of Robotics and Machine Intelligence, Technical University of Munich（慕尼黑工业大学慕尼黑机器人与机器智能研究所）

AI总结提出结合Feynman-Kac校正器与引导势能的框架，从扩散策略中挖掘并优化罕见但可行的轨迹，再训练策略以系统发现多样化可执行行为。

Comments Preprint. Supplementary video: https://youtu.be/T7MUvMA67VM

详情

AI中文摘要

扩散模型已成为机器人学中生成建模的强大工具，扩散策略在多模态动作-轨迹分布建模方面表现出色。然而，当演示数据有限时，标准采样通常再现主导行为，而忽略有效但罕见的模式，限制了新解决方案的发现。现有方法（如引导方法或将强化学习与扩散结合）要么将样本推入不可行区域，要么难以逃离局部最小值，无法系统地发现多样化行为。为解决这些挑战，我们提出一个框架，将Feynman-Kac校正器与一种新颖的引导势能相结合，系统地将扩散策略样本引导至有前景但代表性不足的样本。这些轨迹通过基于采样的轨迹优化进行精炼，并重新纳入训练集以重新训练扩散策略。我们的方法有效地挖掘和修复新轨迹，实现多样化且可执行行为的系统发现。我们在多种操作环境中展示了该框架的有效性，一致地发现了新行为。

英文摘要

Diffusion models have become a powerful tool for generative modeling in robotics, with diffusion policies excelling at modeling multimodal action-trajectory distributions. However, when demonstrations are limited, standard sampling often reproduces dominant behaviors while neglecting valid but rare modes, limiting the discovery of novel solutions. Existing approaches, such as guidance methods or combining reinforcement learning with diffusion, either push samples into infeasible regions or struggle to escape local minima, failing to systematically uncover diverse behaviors. To address these challenges, we propose a framework that combines Feynman-Kac correctors with a novel guiding potential that systematically guides diffusion policy samples towards promising yet underrepresented samples. These trajectories are refined using sampling-based trajectory optimization and reincorporated into the training set to retrain the diffusion policy. Our method effectively mines and repairs novel trajectories, enabling the systematic discovery of diverse and executable behaviors. We demonstrate the effectiveness of our framework across a range of manipulation environments, consistently discovering new behaviors.

URL PDF HTML ☆

赞 0 踩 0

2606.08775 2026-06-09 cs.RO cs.AI 新提交

Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

统一对象中心世界模型与扩散策略：多阶段机器人任务的分层框架

Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, Farshad Khorrami

发表机构 * Tandon School of Engineering, New York University（纽约大学坦登工程学院）； Courant Institute of Mathematical Sciences, New York University（纽约大学库朗数学科学研究所）； AMI Labs（AMI实验室）

AI总结提出WorldDP分层框架，结合高层世界模型进行运行时子目标优化和低层扩散策略执行，利用对象中心表示解耦环境实体，实现多阶段机器人操作任务的有效规划与执行。

详情

AI中文摘要

视觉世界模型在学习复杂系统动力学方面显示出巨大潜力。最近的进展利用这些模型作为模型预测控制（MPC）框架中的转移函数来解决各种控制任务。然而，当应用于机器人时，它们仅限于单阶段任务（如抓取或到达），难以处理需要复杂序列规划的多阶段任务。在这项工作中，我们引入了WorldDP，一个专为多阶段机器人操作设计的世界模型框架。我们的分层方法利用高层世界模型作为转移函数，在运行时优化可行的子目标，随后由低层扩散策略实现这些子目标。为了进一步辅助学习动力学和规划，我们结合了对象中心表示，这些表示解耦了环境实体，并使我们能够针对每个实体进行顺序规划。在多个机器人基准测试中，WorldDP始终优于现有基线，验证了将世界模型的物理基础规划与扩散策略的高效执行相结合，能够产生更优的多阶段性能。

英文摘要

Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single-stage tasks such as reaching or grasping, and struggle with multi-stage ones that demand complex sequential planning. In this work, we introduce WorldDP, a world model framework designed for multi-stage robotic manipulation. Our hierarchical approach utilizes a high-level world model as a transition function to optimize for feasible subgoals during runtime, which are subsequently reached by a low-level Diffusion Policy. To further aid in learning dynamics and planning, we incorporate object-centric representations that decouple environmental entities and enable us to plan sequentially with respect to each. Evaluated across several robotics benchmarks, WorldDP consistently outperforms existing baselines, validating that coupling the world model's physically grounded planning with diffusion policy's efficient execution yields superior multi-stage performance.

URL PDF HTML ☆

赞 0 踩 0

2606.09236 2026-06-09 cs.RO cs.AI 新提交

Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation

用于模拟自主超级摩托车赛车的自定进度课程强化学习

Luca Ghisi, Jacopo Essenziale, Carlo D'Eramo, Matteo Luperto

发表机构 * University of Bologna（博洛尼亚大学）

AI总结提出自定进度课程深度强化学习框架，结合软演员-评论家算法，动态生成渐进任务，在物理精确模拟器中训练自主摩托车赛车，优于标准SAC。

Comments Presented at the "1st Workshop on Generalization in Autonomous Driving: Paradigms, Practice, and Public Road Demonstrations" at ICRA 2026, Vienna. Oral+poster presentation

详情

AI中文摘要

自主赛车通过深度强化学习取得了显著进展，主要针对四轮车辆。然而，摩托车由于需要管理平衡和倾斜角度，以及更灵敏的转向和油门控制，且重量更小，带来了更大的复杂性。在这项工作中，我们提出了一个框架，用于在VRider SBK（一个基于Unity的物理精确摩托车模拟器）中训练自主智能体进行超级摩托车赛车。我们的方法将软演员-评论家（SAC）与自定进度课程深度强化学习（SPDL）相结合，后者根据智能体的性能动态生成逐渐更具挑战性的任务，无需手动课程设计。智能体的状态空间包括扩展了倾斜角度历史的本体感受特征，以及通过赛道点的全局赛道特征。奖励信号被设计为鼓励沿赛道前进，同时惩罚针对两轮动力学的不稳定诱导行为。初步实验结果表明，SPDL在多个赛道和摩托车模型上的训练效率、圈速和驾驶稳定性方面优于单独的SAC，为基于强化学习的自主摩托车赛车建立了第一个基线。

英文摘要

Autonomous Racing has seen remarkable progress through deep Reinforcement Learning (RL), primarily for four-wheeled vehicles. However, motorbikes introduce substantially greater complexity due to the need to manage balance and lean angle, in addition to more reactive steering and throttle control, and a smaller weight. In this work, we present a framework for training an autonomous agent to race a superbike in VRider SBK, a physics-accurate Unity-based motorbike simulator. Our approach integrates Soft Actor-Critic (SAC) with Self-Paced curriculum Deep reinforcement Learning (SPDL), which dynamically generates progressively more challenging tasks based on the agent's performance, without requiring manual curriculum design. The agent's state space comprises proprioceptive features extended with lean-angle history, along with global track features via course points. The reward signal is shaped to encourage progress along the track while penalizing instability-inducing behaviors specific to two-wheeled dynamics. Preliminary experimental results demonstrate that SPDL outperforms SAC alone in training efficiency, lap time, and driving stability across multiple tracks and motorbike models, establishing a first baseline for RL-based autonomous motorbike racing.

URL PDF HTML ☆

赞 0 踩 0

2606.09381 2026-06-09 cs.RO 新提交

ReGIL: Retrieval-Guided Imitation Learning from a Single Demonstration

ReGIL: 基于检索引导的单一示范模仿学习

Yuying Zhang, Francesco Verdoja, Wenyan Yang, Ville Kyrki

发表机构 * Aalto University（阿尔托大学）

AI总结提出ReGIL框架，将单一示范作为外部记忆，通过检索引导探索、生成正则化缓冲和构建奖励，在LIBERO和Meta-World基准及真实机器人任务中显著提升成功率和训练效率。

详情

AI中文摘要

使用深度神经网络从单一示范中学习机器人操作策略仍然极具挑战性，因为即使与示范轨迹有微小偏差也可能迅速累积导致失败，而收集大量在线交互数据成本高昂。我们提出ReGIL，一种检索引导的模仿学习框架，将单一示范视为外部记忆。ReGIL在整个训练过程中反复查询该静态记忆，以同时引导探索、生成正则化缓冲和构建奖励。具体而言，它通过当前轨迹与检索片段之间的局部时间对齐来计算奖励，为策略改进提供逐步且信息丰富的反馈。我们在LIBERO和Meta-World基准的机器人操作任务上，在单一示范设置下评估了ReGIL。ReGIL在成功率和训练效率上均优于先前基线。在真实机器人实验中，仅使用一个示范和不到一小时的在线训练，ReGIL在三个操作任务上（初始机器人姿态和目标位置均随机）实现了超过75%的成功率。这些结果表明，将单一示范作为可重用记忆可以提供比静态监督更高效的机器人学习。更多详情请访问我们的网站：https://regil2026.github.io/

英文摘要

Learning robot manipulation policies with deep neural networks from a single demonstration remains highly challenging, as even small deviations from the demonstrated trajectory can quickly compound into failure, while collecting substantial online interaction data is costly. We propose ReGIL, a retrieval-guided imitation learning framework that treats a single demonstration as an external memory. ReGIL repeatedly queries this static memory throughout training to simultaneously guide exploration, generate the regularization buffer, and construct rewards. Specifically, it computes rewards through local temporal alignment between the current trajectory and the retrieved segment, providing step-wise and informative feedback for policy improvement. We evaluate ReGIL on robotic manipulation tasks from the LIBERO and Meta-World benchmarks under the single demonstration setting. ReGIL outperforms prior baselines in both success rate and training efficiency. In real-robot experiments, using only one demonstration and less than one hour of online training, ReGIL achieves over 75% success rate across three manipulation tasks with randomness in both initial robot pose and target position. These results demonstrate that leveraging the single demonstration as reusable memory can provide more than static supervision for efficient robot learning. More details can be found on our website: https://regil2026.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.09457 2026-06-09 cs.RO 新提交

$ω$-EVA: Envision, Verify, and Act with Latent Interactive World Models

$ω$-EVA：基于潜在交互世界模型的构想、验证与行动

Zhenguo Sun, Yu Sun, Hande Huang, Alois Knoll

发表机构 * Technical University of Munich（慕尼黑工业大学）

AI总结提出$ω$-EVA框架，通过潜在交互世界模型实现“构想-验证-行动”循环，利用动作条件潜在动力学和语言条件流策略生成动作，无需生成未来视频，在多种机器人操作任务中提升策略性能。

详情

基于思维图的奖励进化：一种用于强化学习的双层语言模型框架

Changwei Yao, Xinzi Liu, Chen Li, Marios Savvides

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of Tokyo（东京大学）

AI总结本文提出RE-GoT框架，结合LLM与VLM的图思维推理，通过任务分解和视觉反馈迭代优化奖励函数，实验表明在RoboGen和ManiSkill2任务中均优于现有方法。

详情

Journal ref: IEEE International Conference on Robotics and Automation (ICRA 2026)

AI中文摘要

设计有效的奖励函数仍是强化学习（RL）中的主要挑战，通常需要大量的人类专业知识和迭代优化。最近的进展利用大语言模型（LLM）进行自动化奖励设计，但这些方法受限于幻觉、依赖人类反馈以及处理复杂多步骤任务的困难。在本工作中，我们引入基于思维图的奖励进化（RE-GoT），一种新颖的双层框架，通过结构化的图推理增强LLM，并整合视觉语言模型（VLM）进行自动化 rollout 评估。RE-GoT首先将任务分解为文本属性图，实现全面分析和奖励函数生成，然后通过VLM的视觉反馈迭代优化奖励，无需人工干预。在10个RoboGen和4个ManiSkill2任务上的广泛实验表明，RE-GoT在多个指标上均优于现有基于LLM的基线方法。在RoboGen中，我们的方法将平均任务成功率提高了32.25%，在复杂多步骤任务上表现尤为突出。在ManiSkill2中，RE-GoT在四个多样化操作任务上的平均成功率为93.73%，显著超越了现有基于LLM的方法，甚至超过了专家设计的奖励。我们的结果表明，结合LLM和VLM的图思维推理提供了一种可扩展且有效的解决方案，用于RL中的自主奖励进化。

英文摘要

Designing effective reward functions remains a major challenge in reinforcement learning (RL), often requiring considerable human expertise and iterative refinement. Recent advances leverage Large Language Models (LLMs) for automated reward design, but these approaches are limited by hallucinations, reliance on human feedback, and challenges with handling complex, multi-step tasks. In this work, we introduce Reward Evolution with Graph-of-Thoughts (RE-GoT), a novel bi-level framework that enhances LLMs with structured graph-based reasoning and integrates Visual Language Models (VLMs) for automated rollout evaluation. RE-GoT first decomposes tasks into text-attributed graphs, enabling comprehensive analysis and reward function generation, and then iteratively refines rewards using visual feedback from VLMs without human intervention. Extensive experiments on 10 RoboGen and 4 ManiSkill2 tasks demonstrate that RE-GoT consistently outperforms existing LLM-based baselines. On RoboGen, our method improves average task success rates by 32.25%, with notable gains on complex multi-step tasks. On ManiSkill2, RE-GoT achieves an average success rate of 93.73% across four diverse manipulation tasks, significantly surpassing prior LLM-based approaches and even exceeding expert-designed rewards. Our results indicate that combining LLMs and VLMs with graph-of-thoughts reasoning provides a scalable and effective solution for autonomous reward evolution in RL.

URL PDF HTML ☆

赞 0 踩 0

2606.03787 2026-06-09 cs.RO 版本更新

Worth Remembering: Surprise-Gated Robot Episodic Memory

值得记住：基于惊讶门控的机器人情景记忆

Nicolas Gorlo, Derek K. Wise, Alberto Speranzon, Luca Carlone

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Lockheed Martin（洛克希德·马丁公司）

AI总结提出基于贝叶斯惊讶的门控机制来选择性地存储高效用情景记忆，利用V-JEPA-2潜在空间计算惊讶，在机器人问答任务中提升12%以上性能。

Comments 14 pages, 2 figures, 4 tables

详情

AI中文摘要

解决通用任务的机器人需要能够将指令与过去经验联系起来，因为人类在给出任务时可能会提及显著的历史事件（例如，“带我去昨天化学品泄漏的地方”）。由于记忆限制使得存储所有过去事件不可行，长期机器人记忆必须具有选择性，理想情况下只保留那些对未来任务具有高实用性的情节。然而，对于通用机器人，未来任务通常不是先验给定的。为了选择通用有用的记忆，我们提出贝叶斯惊讶作为记忆形成的门控机制。我们提出了一种方法，在由V-JEPA-2提供的语义丰富且部署无关的潜在空间中计算惊讶。通过使用我们的门控情景记忆来增强基于4D场景图的时空记忆，我们在机器人问答中显示出相对于最先进基准的一致改进，在时间、空间和二元问题上优于先前的机器人记忆方法≥12%，并在事件分割任务中以无监督因果方法超越了有监督和非因果方法的性能。

英文摘要

Robots solving generalist tasks need to be able to ground instructions in their past experience, since humans may refer to notable past events when giving a task (e.g., ``Take me to where the chemical spill happened yesterday''). Since memory limits make storing all past events infeasible, long-term robot memory must be selective, ideally retaining only those episodes with high utility for future tasks. However, future tasks are not typically given a priori for generalist robots. To select generically useful memories, we propose Bayesian surprise as a gating mechanism for memory formation. We present an approach to compute surprise in a semantically rich deployment-agnostic latent space provided by V-JEPA-2. Using our gated episodic memory to augment 4D scene graph-based spatial memory, we show a consistent improvement over state-of-the-art benchmarks in robot question answering, outperforming prior robot memory methods by $\geq12\%$ for temporal, spatial, and binary questions, and surpassing the performance of supervised and non-causal methods with an unsupervised causal method in event segmentation tasks.

URL PDF HTML ☆

赞 0 踩 0

2511.18203 2026-06-09 cs.RO 版本更新

SkillWrapper: Generative Predicate Invention for Task-level Robot Planning

SkillWrapper：任务级机器人规划的生成性谓词发明

Ziyi Yang, Benned Hedegaard, Ahmed Jaafar, Yichen Wei, Skye Thompson, Shreyas S. Raman, Haotian Fu, Stefanie Tellex, George Konidaris, David Paulius, Naman Shah

发表机构 * Brown University（布朗大学）； Allen Institute for AI（人工智能研究院）

AI总结本文提出SkillWrapper方法，通过生成性谓词发明学习符号化表示，使机器人能够基于抽象推理完成长周期任务规划。

详情

AI中文摘要

从单个技能执行到长周期任务的泛化是构建自主机器人面临的核心挑战。一个有前途的方向是学习低层机器人技能的高层符号表示，从而实现独立于底层状态空间的抽象推理。近期基础模型的进步使生成作用于原始感知输入的符号谓词成为可能，这一过程我们称为生成性谓词发明，以促进下游表示学习。然而，先前工作通过启发式或随意方法学习这些抽象，忽略了这些抽象应满足的正式属性以及如何保证这些属性的问题。我们通过提出生成性谓词发明的任务级规划正式理论，并提出SkillWrapper方法，该方法学习可证明正确且完整的规划符号模型来解决这些问题。我们的方法利用基础模型主动收集机器人数据，并学习可被人类解释和规划的表示，仅使用RGB图像观测。我们在仿真和真实机器人上的广泛实证评估表明，SkillWrapper学习的抽象表示使机器人能够将黑箱技能组合起来，解决未见的长周期任务。

英文摘要

Generalizing from individual skill executions to long-horizon tasks is a core challenge in building autonomous robots. A promising direction is learning high-level, symbolic representations of low-level robot skills, enabling abstract reasoning independent of the low-level state space. Recent advances in foundation models have made it possible to generate symbolic predicates that operate on raw sensory inputs-a process we call generative predicate invention-to facilitate downstream representation learning. However, prior work learns these abstractions using heuristic or ad-hoc procedures, ignoring the question of which formal properties they ought to satisfy, and how to guarantee these properties. We address these questions by presenting a formal theory of generative predicate invention for task-level planning, and proposing SkillWrapper, a method that learns symbolic models for provably sound and complete planning. Our approach leverages foundation models to actively collect robot data and learn human-interpretable, plannable representations, using only RGB image observations. Our extensive empirical evaluation in simulation and on real robots shows that SkillWrapper learns abstract representations that enable robots to compose black-box skills to solve unseen, long-horizon tasks in the real world.

URL PDF HTML ☆

赞 0 踩 0

2606.07855 2026-06-09 cs.RO math.OC 新提交

Path Planning Using Deep Deterministic Policy Gradient: A Reinforcement Learning Approach

使用深度确定性策略梯度的路径规划：一种强化学习方法

Qiang Le, Yaguang Yang, Isaac E. Weintraub

发表机构 * Hampton University（汉普顿大学）； Air Force Research Laboratory（空军研究实验室）

AI总结提出基于深度确定性策略梯度的路径规划方法，将威胁建模为圆形禁行区，通过奖励函数引导智能体学习从状态到动作的映射，找到最大安全起始点集，相比传统最优控制方法速度更快，适用于实时应用。

Comments 14 pages, 12 figures

详情

AI中文摘要

在充满威胁的环境中，自主车辆的路径规划是一个基本挑战，因为即使是最简单的情景，问题也是非线性和非凸的。虽然传统最优控制方法可用于寻找理想路径，但计算时间通常太慢，无法实时决策。为了解决这一挑战，我们提出了一种基于深度确定性策略梯度（DDPG）的方法，并将威胁建模为可能多个圆形“禁行”区域。如果车辆在任何时刻进入该限制区域或未到达目的地附近，则任务被视为失败。DDPG智能体通过模拟环境中的试错进行训练，学习从其当前状态（位置和航向）到一系列可行动作的直接映射，从而引导智能体安全到达目的地。奖励函数包含三部分：(a) 以最终目的地为中心的吸引场，(b) 以圆形障碍物原点为中心的若干排斥场，以及(c) 控制能量消耗（航向变化幅度）的惩罚，间接有利于直线路径。DDPG利用这些激励训练智能体，以找到最大的起始点集合，从中保证存在一条通往目的地的安全路径。这为任务规划提供了关键信息，预先显示从给定起始点任务是否可行，辅助任务前规划活动。该方法在仿真中得到验证。将DDPG方法与传统最优控制（伪谱）方法进行了比较。结果表明，基于学习的智能体能够生成有效路径，同时速度显著更快，使其更适合实时应用。

英文摘要

Path-planning for autonomous vehicles in threat-laden environments is a fundamental challenge because the problem is nonlinear and nonconvex even in simplest scenarios. While traditional optimal control methods can be used to find ideal paths, the computational time is often too slow for real-time decision-making. To solve this challenge, we propose a method based on Deep Deterministic Policy Gradient (DDPG) and model the threat as possibly multiple circular 'no-go' zones. A mission is regarded as a failure if the vehicle enters this restricted zone at any time or does not reach a neighborhood of the destination. The DDPG agent is trained through trial and error in a simulated environment, learning a direct mapping from its current state (position and heading) to a series of feasible actions that guide the agent to safely reach its destination. The reword function has three parts: (a) an attractive field centered at the final destination, (b) some repulsive fields centered at the origins of circular obstacles, and (c) a penalty of control energy consumption (the magnitude of heading change) that indirectly in favor for straight path. The DDPG trains the agent using these incentives to find the largest possible set of starting points wherein a safe path to the destination is guaranteed. This provides critical information for mission planning, showing beforehand whether a task is achievable from a given starting point, assisting pre-mission planning activities. The approach is validated in simulation. A comparison between the DDPG method and a traditional optimal control (pseudo-spectral) method is carried out. The results show that the learning-based agent produces effective paths while being significantly faster, making it a better fit for real-time applications.

URL PDF HTML ☆

赞 0 踩 0

2606.08136 2026-06-09 cs.RO 新提交

PTDL：多地形摔倒恢复的相位-地形解耦学习

Xiaoyu Xu, Zhiming Chen, Yuenan Zhao, Ran Song, Wei Zhang

发表机构 * School of Control Science and Engineering, Shandong University（山东大学控制科学与工程学院）； Key Laboratory of Machine Intelligence and System Control, Ministry of Education（教育部机器智能与系统控制重点实验室）

AI总结提出相位-地形解耦学习（PTDL），通过解耦训练监督的相位和地形轴，实现单一本体感知策略下的多地形摔倒恢复与行走过渡。

详情

AI中文摘要

人形机器人可能在非结构化环境中的斜坡、砾石和不平地面上摔倒。我们目标是集成摔倒恢复与运动：仅使用本体感知从摔倒状态重建平衡，并在摔倒地点恢复速度指令行走。先前方法通常止于准静态起身，忽略摔倒后地面接触阶段，或者在混合地形上训练时未分离恢复与运动阶段或每表面约束，导致跨表面退化为单一妥协起身。我们提出相位-地形解耦学习（PTDL），在部署单一本体感知策略的同时，沿相位和地形轴解耦训练监督。在相位轴上，投影重力门控双运动先验判别器和探测-行走过渡链接将摔倒后恢复与指令行走连接。在地形轴上，地形分层恢复塑形在平坦地面、碎石和斜坡上分配表面特定的训练监督；地形标签仅用于训练，不提供给策略观测，从而在部署时实现隐式摔倒后策略选择。我们在29自由度Unitree G1上，在仿真和硬件中验证了PTDL在平坦地面、碎石和最高20度斜坡上的表现，实现了稳定的跨地形恢复、平滑的恢复-运动过渡以及单一部署策略下的差异化摔倒后起身行为。

英文摘要

Humanoid robots can fall on slopes, gravel, and uneven ground in unstructured environments. We target integrated fall recovery and locomotion: rebuilding balance from a fallen state using proprioception alone and resuming velocity-commanded walking at the fall site. Prior methods often stop at quasi-static rise, neglect the post-fall ground-contact phase, or, when trained on mixed terrains without separating recovery and locomotion phases or per-surface constraints, collapse to a single compromise get-up across surfaces. We propose Phase--Terrain Decoupled Learning (PTDL), which decouples training supervision along phase and terrain axes while deploying one proprioceptive policy. On the phase axis, projected-gravity-gated dual motion-prior discriminators and a probe-to-walk transition link post-fall recovery to commanded walking. On the terrain axis, terrain-stratified recovery shaping assigns surface-specific training supervision on flat ground, gravel, and slopes; terrain labels are training-only and withheld from policy observations, enabling implicit post-fall strategy selection at deployment. We validate PTDL on a 29-DoF Unitree G1 across flat ground, gravel, and slopes up to 20 degrees in simulation and on hardware, achieving stable cross-terrain recovery, smooth recovery-to-locomotion transitions, and differentiated post-fall rise behaviors under one deployed policy.

URL PDF HTML ☆

赞 0 踩 0

2606.09188 2026-06-09 cs.RO cs.CV 新提交

Trajectory Optimization in Single and Dual-UAV Bearing-Only Target Localization

单无人机和双无人机仅方位目标定位中的轨迹优化

Zhijian Xiao, Huayu Huang, Bin Li, Yang Shang, Banglei Guan

发表机构 * College of Aerospace Science and Engineering, National University of Defense Technology（国防科技大学航天科学与工程学院）； Hunan Key Laboratory of Image Measurement and Visual Navigation（湖南省图像测量与视觉导航重点实验室）

AI总结提出基于Fisher信息矩阵的轨迹优化方法，通过谱加权目标函数和交叉角正弦项改善观测几何，结合改进粒子群算法，显著降低定位误差。

Comments 16 pages, 13 figures and 6 tables. Submitted to Measurement

详情

AI中文摘要

仅方位目标定位是光学测量中的一个基本问题，在无人机技术中有着广泛的应用。有效的轨迹规划可以建立有利的观测几何，从而提高仅方位无人机系统的目标定位精度。本文提出了一种用于无人机在仅方位目标定位场景中的轨迹优化方法。通过利用Fisher信息矩阵，该方法将几何构型和飞行器机动性动态集成到优化框架中。具体而言，我们引入了一个谱加权FIM目标函数，该函数在退化构型附近提供更好的梯度动力学，使规划器能够快速逃离不良观测条件。对于双无人机场景，引入交叉角正弦项，通过改善视线交叉角来优化三角测量几何，从而防止轨迹聚集。此外，我们提出了一种改进的粒子群优化算法，该算法具有运动模型约束和粒子归一化，以确保轨迹的物理可行性并增强与目标函数的兼容性。仿真结果表明，与传统的基于FIM的方法相比，所提出的方法在单无人机场景中将中位定位误差降低了99.21%，在双无人机配置中实现了69.70%的提升，在远距离机动目标的长时间仅方位目标定位中表现出优越的性能。

英文摘要

Bearing-only target localization is a fundamental problem in optical measurement and finds extensive applications in unmanned aerial vehicle (UAV) technology. Effective trajectory planning establishes favorable observation geometries, thereby enhancing the target localization accuracy of bearing-only UAV systems. This paper proposes an trajectory optimization method for unmanned aerial vehicles (UAVs) in bearing-only target localization scenarios. By leveraging the Fisher Information Matrix (FIM), the proposed approach dynamically integrates the geometric configuration and vehicle maneuverability into the optimization framework. Specifically, we introduce a spectrally-weighted FIM objective function that provides better gradient dynamics near degenerate configurations, enabling the planner to rapidly escape from poor observation conditions. For dual-UAV scenarios, an intersection angle sine term is introduced to optimize triangulation geometry by improving the sight-line intersection angle, thereby preventing trajectory aggregation. Furthermore, we propose an improved Particle Swarm Optimization (PSO) algorithm with motion model constraints and particle normalization to ensure the physical feasibility of the trajectory and enhance the compatibility with the objective functions. Simulation results demonstrate that the proposed method reduces the median localization error by 99.21% compared to conventional FIM-based approaches in single-UAV scenarios, and achieves a 69.70% improvement for dual-UAV configurations, exhibits superior performance in long-duration bearing-only target localization of maneuverability targets at extended ranges.

URL PDF HTML ☆

赞 0 踩 0

2606.09237 2026-06-09 cs.RO cs.SY eess.SY 新提交

Can we stabilize an inverted pendulum with feedback from a time-of-flight camera?

我们能否利用飞行时间相机的反馈来稳定倒立摆？

Anthony Czubarow, Antonio Terpin, Raffaello D'Andrea

发表机构 * Institute for Dynamic Systems and Control, ETH Zürich（苏黎世联邦理工学院动态系统与控制研究所）

AI总结本文证明低成本、低分辨率的飞行时间相机能够提供足够反馈，可靠且精确地平衡推车上的倒立摆，挑战了其无法用于精确反馈控制的普遍观点。

2606.09640 2026-06-09 cs.RO 新提交

Physics-Aware Sparse Learning and Selective Online Adaptation for Euler-Lagrange Robot Dynamics

面向欧拉-拉格朗日机器人动力学的物理感知稀疏学习与选择性在线自适应

Rishabh Dev Yadav, Samaksh Ujjawal, Sihao Sun, Spandan Roy, Wei Pan

发表机构 * The University of Manchester（曼彻斯特大学）； International Institute of Information Technology Hyderabad（海得拉巴国际信息技术学院）； Delft University of Technology（代尔夫特理工大学）； Newcastle University（纽卡斯尔大学）

AI总结提出一种保结构残差学习框架，将模型误差分解为惯性修正、科里奥利项和广义力残差，通过物理约束学习机械部分，并用稀疏历史依赖潜变量模型和贝叶斯线性回归在线自适应扰动敏感部分，提升多机器人平台动力学预测与轨迹跟踪性能。

详情

AI中文摘要

精确的动力学模型对于基于模型的机器人控制至关重要，然而名义上的欧拉-拉格朗日模型在存在负载变化、未建模耦合、摩擦、空气动力学效应和变化操作条件时往往变得不准确。大多数基于学习的校正方法通过引入单个加性残差来提高预测精度，但未能保留欧拉-拉格朗日系统的内部机械结构。这导致模型不保留对称性、正定性或惯性与速度相关项之间的耦合，当嵌入基于模型的控制器时，可能导致物理上不一致的预测和降低的可靠性。我们提出了一种保结构残差学习框架，将模型不匹配分解为惯性修正、相应的诱导科里奥利项和广义力残差。机械部分在物理约束下学习，而扰动敏感部分通过稀疏历史依赖潜变量交互模型表示，并使用贝叶斯线性回归在线自适应。这种分离保留了关键的机械结构，同时将自适应限制在最受变化条件影响的动力学部分。在多个机器人平台（包括移动机器人、空中机器人和机械臂系统）上的实验表明，所提出的方法在耦合和时变动力学下改善了动力学预测和轨迹跟踪。这些结果凸显了将结构化残差建模、紧凑潜变量交互选择和选择性在线自适应相结合对于实际基于模型控制的价值。

英文摘要

Accurate dynamics models are essential for model-based robotic control, yet nominal Euler--Lagrange models often become inaccurate in the presence of payload variation, unmodeled coupling, friction, aerodynamic effects, and changing operating conditions. Most learning-based correction methods improve prediction accuracy by introducing a single additive residual, but do not preserve the internal mechanical structure of Euler--Lagrange systems. This leads to models that do not preserve symmetry, positive-definiteness, or the coupling between inertia and velocity-dependent terms, which can result in physically inconsistent predictions and reduced reliability when embedded in model-based controllers. We propose a structure-preserving residual learning framework that decomposes model mismatch into an inertia correction, the corresponding induced Coriolis term, and a generalized-force residual. The mechanical component is learned under physical constraints, while the disturbance-sensitive component is represented through a sparse history-dependent latent interaction model and adapted online using Bayesian linear regression. This separation preserves key mechanical structure while restricting adaptation to the part of the dynamics most affected by changing conditions. Experiments across multiple robotic platforms, including mobile, aerial, and manipulator systems, show that the proposed method improves dynamics prediction and trajectory tracking under coupled and time-varying dynamics. These results highlight the value of combining structured residual modeling, compact latent interaction selection, and selective online adaptation for real-world model-based control.

URL PDF HTML ☆

赞 0 踩 0

2606.09719 2026-06-09 cs.RO 新提交

Safe Polytope-in-Polytope Motion Planning and Control with Control Barrier Functions

基于控制障碍函数的安全多面体在多面体内的运动规划与控制

Alejandro Gonzalez-Garcia, Dries Dirckx, Jan Swevers, Wilm Decré

发表机构 * KU Leuven（鲁汶大学）

AI总结提出一种安全局部运动规划与控制方法，通过模型预测控制器中的离散时间控制障碍函数约束，保证多面体机器人足迹始终位于连续更新的凸自由空间内，计算时间随障碍物数量增加最多降低91倍。

Comments This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

在狭窄环境中运行的自主移动机器人需要考虑机器人物理足迹的运动规划框架。将几何形状简化为点或圆是保守的，并且丢弃了成功安全通过狭窄通道所需的信息。本文提出了一种安全的局部运动规划与控制方法，保证多面体机器人足迹始终位于连续更新的凸自由空间内。包含条件被表述为模型预测控制器内的一组离散时间控制障碍函数约束。安全约束的数量取决于局部自由空间的复杂性和机器人形状，而不是障碍物的数量。所提出的自由空间公式不需要任何障碍物检测或分割。与基于多面体的避障公式的比较分析证实，随着障碍物数量的增加，计算时间最多减少91倍。该方法在自主水面车辆的仿真中和使用占用网格和LiDAR传感的非完整移动机器人的硬件上得到了验证。实验证明了在机载嵌入式计算机上以10 Hz进行安全的实时运动规划与控制，包括对动态障碍物的反应性避让。

英文摘要

Autonomous mobile robots operating in tight environments require motion planning frameworks that account for the physical footprint of the robot. Simplifying the geometry to a point or a circle is conservative and discards information needed to successfully and safely traverse narrow passages. This work proposes a safe local motion planning and control method that guarantees that a polytopic robot footprint stays inside a continuously updated convex free-space region. The containment condition is formulated as a set of discrete-time control barrier function constraints within a model predictive controller. The number of safety constraints depends on the complexity of the local free-space geometry and the robot shape, instead of the number of obstacles. The proposed free-space formulation does not need any obstacle detection or segmentation. A comparative analysis against a polytope-based obstacle avoidance formulation confirms favorable scaling up to a reduction of 91$\times$ in computation time as the number of obstacles increases. The approach is validated in simulation with an autonomous surface vehicle and on hardware with a non-holonomic mobile robot, using both occupancy grids and LiDAR sensing. The experiments demonstrate safe real-time motion planning and control at 10~Hz on an onboard embedded computer, including reactive avoidance of dynamic obstacles.

URL PDF HTML ☆

赞 0 踩 0

2104.12183 2026-06-09 cs.RO 版本更新

An Interval Branch-and-Bound-Based Inverse Kinemetics Algorithm Towards Global Optimal Redundancy Resolution

基于区间分支定界的逆运动学算法实现全局最优冗余度解析

Yajue Yang, Zeqing Zhang, Yuanqing Wu, Jia Pan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出一种结合快速数值IK求解器搜索启发式的区间分支定界方法，高效求解机械臂广义逆运动学问题，生成邻域解以提供自运动流形的丰富几何信息，支持最优规划和任意时间求解。

2412.01324 2026-06-09 cs.RO 版本更新

Integrated Hierarchical Decision-Making in Inverse Kinematic Planning and Control

集成化分层决策在逆运动学规划与控制中

Kai Pfeiffer, Quan Zhang, Yuqing Chen, Gordon Boateng, Yuquan Wang, Vincent Bonnet, Aberrahmane Kheddar

发表机构 * University of Cambridge（剑桥大学）

AI总结本文提出一种高效的非线性规划框架，整合分层决策与全身逆运动学规划控制，解决逆运动学规划中同时选择端效应器位置的问题。

Comments Accepted paper to "Robotics: Science and Systems" (2026)

详情

AI中文摘要

本文提出了一种新颖且高效的非线性规划框架，紧密整合分层决策与全身逆运动学规划与控制。决策在机器人领域诸多方面起核心作用，从稀疏逆运动学控制（使用最少的关节）到同时选择多个候选端效应器位置的逆运动学规划。当前方法常依赖混合整数非线性规划的大量计算，将决策与逆运动学分离（有时用可达性方法近似），或使用高效但不够灵活的ℓ1范数线性稀疏规划方法，未解决底层非线性问题。相比之下，所提出的稀疏分层非线性规划求解器通过利用稀疏分层结构和ℓ0范数（在机器人领域很少使用）实现了高效、灵活和准确。该求解器有效处理了文献中未解决的复杂非线性分层决策问题，例如同时从大量候选中优先选择端效应器位置的逆运动学规划，或同时选择双臂抓取位置的逆运动学控制。

英文摘要

This work presents a novel and efficient nonlinear programming framework that tightly integrates hierarchical decision-making with whole-body inverse kinematic planning and control. Decision-making plays a central role in many aspects of robotics, from sparse inverse kinematic control with a minimal number of joints, to inverse kinematic planning while simultaneously selecting a discrete end-effector location from multiple candidates. Current approaches often rely on heavy computations using mixed-integer nonlinear programming, separate decision-making from inverse kinematics (some times approximated by reachability methods), or employ efficient but less versatile $\ell_1$-norm formulations of linear sparse programming, without addressing the underlying nonlinear problem formulations. In contrast, the proposed sparse hierarchical nonlinear programming solver is efficient, versatile, and accurate by exploiting sparse hierarchical structure and leveraging the $\ell_0$-norm which is rarely used in robotics. The solver efficiently tackles complex nonlinear hierarchical decision-making problems previously unaddressed in the literature, such as inverse kinematic planning with simultaneous prioritized selection of end-effector locations from a large set of candidates, or inverse kinematic control with simultaneous selection of bi-manual grasp locations on a randomly rotated box.

URL PDF HTML ☆

赞 0 踩 0

2511.05355 2026-06-09 cs.LG cs.RO cs.SY eess.SY 版本更新

SAD-Flower: Flow Matching for Safe, Admissible, and Dynamically Consistent Planning

SAD-Flower：用于安全、可接受和动态一致规划的流匹配

Tzu-Yuan Huang, Armin Lederer, Dai-Jie Wu, Xiaobing Dai, Sihua Zhang, Hsiu-Chin Lin, Shao-Hua Sun, Stefan Sosnowski, Sandra Hirche

发表机构 * TUM School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.（慕尼黑技术大学计算、信息与技术学院）； Munich Institute of Robotics（慕尼黑机器人与智能机构研究所）； Munich Data Science Institute (MDSI)（慕尼黑数据科学研究所）； National University of Singapore（新加坡国立大学）； National Taiwan University (NTU)（国立台湾大学）； NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)（国立台湾大学人工智能研究中心）； University of Utah（犹他大学）； Beijing Institute of Technology（北京理工大学）； McGill University（麦吉尔大学）

AI总结提出SAD-Flower框架，通过虚拟控制输入增强流匹配，利用非线性控制理论提供状态约束、动作约束和动态一致性的形式化保证，无需重新训练即可在测试时满足未见约束。

详情

AI中文摘要

流匹配（FM）在数据驱动规划中显示出有希望的结果。然而，它本质上缺乏确保状态和动作约束的形式化保证，而满足这些约束对于各种系统上规划轨迹的安全性和可接受性是一个基本且关键的要求。此外，现有的FM规划器不能确保动态一致性，这可能导致轨迹不可执行。我们通过提出SAD-Flower来解决这些缺陷，这是一个用于生成安全、可接受和动态一致轨迹的新框架。我们的方法依赖于用虚拟控制输入增强流。因此，可以使用非线性控制理论的技术推导出有原则的指导，为状态约束、动作约束和动态一致性提供形式化保证。关键的是，SAD-Flower无需重新训练即可运行，从而在测试时满足未见约束。通过在多个任务上的广泛实验，我们证明SAD-Flower在确保约束满足方面优于各种基于生成模型的基线。

英文摘要

Flow matching (FM) has shown promising results in data-driven planning. However, it inherently lacks formal guarantees for ensuring state and action constraints, whose satisfaction is a fundamental and crucial requirement for the safety and admissibility of planned trajectories on various systems. Moreover, existing FM planners do not ensure the dynamical consistency, which potentially renders trajectories inexecutable. We address these shortcomings by proposing SAD-Flower, a novel framework for generating Safe, Admissible, and Dynamically consistent trajectories. Our approach relies on an augmentation of the flow with a virtual control input. Thereby, principled guidance can be derived using techniques from nonlinear control theory, providing formal guarantees for state constraints, action constraints, and dynamic consistency. Crucially, SAD-Flower operates without retraining, enabling test-time satisfaction of unseen constraints. Through extensive experiments across several tasks, we demonstrate that SAD-Flower outperforms various generative-model-based baselines in ensuring constraint satisfaction.

URL PDF HTML ☆

赞 0 踩 0

2606.07723 2026-06-09 cs.RO 新提交

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

VoLo: 面向开放词汇长时程操控的物理编排器

Siyi Chen, Hugo Hadfield, Alex Zook, Mikaela Angelina Uy, Chan Hee Song, Erwin Coumans, Xuning Yang, Faisal Ladhak, Qing Qu, Stan Birchfield, Jonathan Tremblay, Valts Blukis

发表机构 * NVIDIA（英伟达）； University of Michigan（密歇根大学）

AI总结提出VoLoAgent，利用VLM将VLA/WAM作为可中断工具进行物理编排，实现开放词汇长时程操控，并在新基准RoboVoLo上显著优于现有系统。

详情

AI中文摘要

开放词汇长时程操控要求机器人能够推理灵活指令和复杂多物体场景，同时自适应地规划、执行、监控并从失败中恢复。我们通过一个闭环智能体来满足这些需求，其中VLM将异构机器人能力编排为可中断的工具。与虚拟AI智能体不同，在物理世界中，决策、动作和工具调用的时机至关重要，因为物理世界不会暂停等待推理。我们将这种设置称为物理编排，并提出VoLoAgent，这是一种VLM，通过将VLA/WAM视为可中断的工具，在推理过程中与视觉模型和动作原语一起引导其运行，从而进行规划、监控和恢复。为了评估这些长时程能力，我们引入了RoboVoLo，这是一个高保真基准测试，用于开放词汇长时程操控，涵盖常识、记忆/状态跟踪、复杂引用和世界知识，并提供任务级成功率和失败模式诊断。实验表明，VoLoAgent在任务成功率和失败诊断方面显著优于单一VLA/VLM或基于工具的系统，并在真实机器人实验中得到了验证。项目页面：https://chicychen.github.io/VoLo/

英文摘要

Open-vocabulary long-horizon manipulation requires robots to reason over flexible instructions and complex multi-object scenes while adaptively planning, executing, monitoring, and recovering from failures. We address these demands with a closed agent loop in which a VLM orchestrates heterogeneous robot capabilities as interruptible tools. Unlike in virtual AI agents, the timing of decisions, actions and tool calls is important in a physical world that does not pause for reasoning. We refer to this setting as Physical Orchestration, and propose VoLoAgent, a VLM that plans, monitors, and recovers by treating a VLA/WAM as an interruptible tool it steers mid-rollout alongside vision models and action primitives. To evaluate these long-horizon capabilities, we introduce RoboVoLo, a high-fidelity benchmark for open-vocabulary long-horizon manipulation across common sense, memory/state tracking, complex references, and world knowledge, with both task-level success and failure-mode diagnostics. Experiments show VoLoAgent substantially outperforms single VLA/VLM or tool-based systems, with validation on real-robot experiments. Project page: https://chicychen.github.io/VoLo/

URL PDF HTML ☆

赞 0 踩 0

2606.08057 2026-06-09 cs.RO cs.AI 新提交

EgoAERO: Learning Dexterous Manipulation from a Single Egocentric Video without Object Assets

EgoAERO：无需物体资产，从单个第一人称视频学习灵巧操作

Yichen Niu, Haoran Lv, Xinrui Zhang, Xueyao Wan, Shiyu Gao, Ying Ai, Hui Xu, Yongqi Hu, Hengyi Zhang, Yang Xie, Zhaxizhuoma, Yue Zhao, Zhenshan Bing, Yan Ding, Jianxing Liu

发表机构 * School of Astronautics, Harbin Institute of Technology（哈尔滨工业大学航天学院）； Lumos Robotic ； Suzhou Research Institute, Harbin Institute of Technology（哈尔滨工业大学苏州研究院）； Shanghai Jiao Tong University（上海交通大学）； Shanghai AI Lab（上海人工智能实验室）； Nanjing University（南京大学）； Xi’an Jiaotong-Liverpool University（西交利物浦大学）； Fudan University（复旦大学）

AI总结提出EgoAERO框架，无需物体资产，从单个第一人称RGB-D视频中通过无资产物体跟踪与重建、自我运动补偿和自适应接触优化重建接触一致的手-物轨迹，并利用两阶段残差学习转化为机器人策略，实现单次演示的灵巧操作。

详情

AI中文摘要

第一人称RGB-D视频提供了人类灵巧操作演示的自然来源，但现有数据难以用于机器人学习，因为物体姿态、几何和接触信息常常缺失或需要预先扫描的物体资产。我们提出EgoAERO，这是第一个无需物体资产、从单个第一人称RGB-D人类演示中学习灵巧操作的框架。EgoAERO通过无资产物体跟踪与重建、自我运动补偿和自适应接触优化重建接触一致的手-物轨迹，然后利用两阶段残差学习将其转化为机器人策略。我们进一步引入在线质量评估机制，并构建EgoDex-R，一个包含430万RGB-D帧的大规模第一人称数据集，用于灵巧策略学习。仿真和真实世界实验表明，EgoAERO能够实现单次演示的灵巧操作，并在HOI4D上达到接近基于CAD重建的下游性能。

英文摘要

Egocentric RGB-D videos offer a natural source of human dexterous manipulation demonstrations, but existing data is difficult to use for robot learning because object pose, geometry, and contact information are often missing or require pre-scanned object assets. We present EgoAERO, the first framework that learns dexterous manipulation from a single egocentric RGB-D human demonstration without object assets. EgoAERO reconstructs contact-consistent hand-object trajectories through asset-free object tracking and reconstruction, ego motion compensation, and adaptive contact optimization, then converts them into robot policies using two-stage residual learning. We further introduce an online quality assessment mechanism and construct EgoDex-R, a large-scale egocentric dataset with 4.3M RGB-D frames for dexterous policy learning. Simulation and real-world experiments show that EgoAERO enables single-demonstration dexterous manipulation and achieves downstream performance close to CAD-based reconstructions on HOI4D.

URL PDF HTML ☆

赞 0 踩 0

2606.08103 2026-06-09 cs.RO cs.CV 新提交

Revisiting Articulated Parts Perception in Robot Manipulation

重新审视机器人操作中的关节部件感知

Xiaoqian Wu, Yejie Guo, Xiaoyang Chen, Lixin Yang, Cewu Lu, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出几何主结构（GPS）作为关节部件的新表示，结合VR设备实现高效标注，训练通用模型，在零样本下达到73%操作成功率。

Comments CVPR2026

详情

AI中文摘要

我们被各种带有可移动关节部件的物体所包围，例如盒子、把手、门。对关节部件的准确且可泛化的感知对于增强机器人操作能力至关重要。基于这一需求，近期在关节部件感知方面的工作遵循两个主要方向：一类工作使用基于姿态的表示，这需要高人力成本；与此同时，基于可供性的方法通过点跟踪提取未来物体运动，无需额外人工，但受限于低质量数据。在本文中，我们提出了一种新的关节部件表示——几何主结构（GPS），它是部件几何结构的抽象，以平衡可扩展性和质量。为了实现高效且可扩展的数据收集，GPS与便携式虚拟现实（VR）设备集成，只需一分钟即可标注一个物体序列。这种直接的人工标注比估计的可供性提供了更高质量。利用高效的VR-GPS系统，我们收集了6个部件类别下234个物体的41K帧数据，并训练了一个以单张RGB-D物体图像为输入的通用GPS模型。对于物体操作，我们基于GPS预测部署了一个启发式策略。无需任何领域内微调，我们的方法在9个物体的270个初始状态下达到了73%的成功率。我们的代码、数据和可复用工具可在 https://enlighten0707.github.io/gps 获取。

英文摘要

We are surrounded by various objects with movable, articulated parts, e.g., box, handle, door. An accurate and generalizable perception of articulated parts is essential to enhance robotic manipulation capabilities. Building on this need, recent efforts in articulated parts perception have followed two main directions: One line of work uses pose-based representation, which requires high manual cost; in parallel, affordance-based methods extract future object motion from point tracking without additional manual efforts, but suffer from low-quality data. In this paper, we propose a new representation of articulated parts, Geometric Primary Structure (GPS), an abstraction of the part geometry structure to balance scalability and quality. For efficient and scalable data collection, GPS is integrated with a portable Virtual Reality (VR) device and requires only one minute to annotate one object sequence. This direct human annotation provides higher quality than the estimated affordance. With this efficient VR-GPS system, we collect 41K frames for 234 objects across six part classes, and train a generalizable GPS model with a single RGB-D object image as input. For object manipulation, we deploy a heuristic policy based on GPS prediction. Without any in-domain fine-tuning, our method achieves an 73% success rate, covering 270 initial states for 9 objects. Our code, data and reusable tool are available at https://enlighten0707.github.io/gps.

URL PDF HTML ☆

赞 0 踩 0

2606.08152 2026-06-09 cs.RO 新提交

Vision-Guided Dual-Arm Humanoid Robotic Disassembly of End-of-Life 18650 Lithium-ion Battery Packs

视觉引导的双臂人形机器人拆解报废18650锂离子电池组

Yile Chen, Zhihao Liu, Xi Vincent Wang, Lihui Wang

发表机构 * KTH Royal Institute of Technology（瑞典皇家理工学院）

AI总结提出一种视觉引导的双臂拆解流水线，利用通用平行爪夹持器、RGB-D感知和预训练抓取检测器，在无夹具条件下从任意初始姿态拆解21节18650电池组，实现80%端到端成功率。

详情

AI中文摘要

来自电动汽车和便携式电子产品的退役锂离子电池组数量不断增长，需要安全、灵活且可选择性到单个电池的自动化拆解。然而，现有的机器人系统大多假设已知电池组姿态、外部夹具或专用工具，使得在姿态不确定性下无夹具的电池级拆解仍未解决。本文提出一种视觉引导的双臂流水线，使用通用平行爪夹持器、RGB-D感知和预训练抓取检测器，从任意初始姿态拆解一个21节18650电池组。姿态不确定性通过一个学习-过滤感知栈和离散的看-移动腕部相机校正来吸收，而双臂之间的任务中支持转移则无需任何外部夹具即可扩展有效工作空间。该流水线实现了8/10的端到端成功率，电池定位均方根误差为2.4毫米，每个电池组的平均循环时间为6.0分钟，为工业电池回收提供了一个实用的、无夹具的基础模块。

英文摘要

The growing volume of retired lithium-ion battery packs from electric vehicles and portable electronics calls for automated disassembly that is safe, flexible, and selective down to the individual cell. Existing robotic systems, however, mostly assume known pack poses, external fixtures, or specialised tooling, leaving fixture-free cell-level disassembly under pose uncertainty largely unsolved. This paper presents a vision-guided dual-arm pipeline that disassembles a 21-cell 18650 pack from an arbitrary initial pose using only general-purpose parallel-jaw grippers, RGB-D sensing, and a pre-trained grasp detector. Pose uncertainty is absorbed by a learn-and-filter perception stack with discrete look-and-move wrist-camera corrections, while a mid-task support transfer between the two arms extends the effective workspace without any external clamp. The pipeline achieves an 8/10 end-to-end success rate, a cell-localisation root-mean-square error of $2.4$\,mm, and a mean cycle time of 6.0\,minutes per pack, providing a practical, fixture-free building block for industrial battery recycling.

URL PDF HTML ☆

赞 0 踩 0

2606.08440 2026-06-09 cs.RO cs.CV 新提交

SynManDex: 从合成人类预抓取中合成类人灵巧抓取

Yanming Shao, Zanxin Chen, Wenwei Lin, Mingjie Zhou, Tianxing Chen, Xiaokang Yang, Yichen Chi, Yao Mu

发表机构 * Shanghai AI Lab（上海人工智能实验室）； Shanghai Jiaotong University（上海交通大学）； Shenzhen University（深圳大学）； Fudan University（复旦大学）； University of Hong Kong（香港大学）； ZTE Corporation（中兴通讯股份有限公司）

AI总结提出SynManDex流水线，利用生成的人类预抓取作为启发，通过机器人原生优化实现力闭合接触，生成类人灵巧抓取，在仿真和真实机器人上取得高成功率和类人性。

详情

AI中文摘要

人类手-物交互编码了功能意图，但直接迁移到机器人手上常因形态、接触和可达性约束而失败。我们提出SynManDex，一个合成流水线，使用生成的人类预抓取作为可负担性感知的提议，并通过机器人原生优化解决最终接触。SynManDex采样物体条件化的数字人类预抓取，将其重定向到灵巧机器人手姿态，优化目标实体上的力闭合接触，并接受通过每一步检查的轨迹。所得关键帧支持抓取-举起演示以及各种抓取操作任务，如倒茶、拍照和吹笛子，这些任务通过VLM代理设计。因此，SynManDex结合了高抓取质量（86.4%抓取稳定性）和4.67/5的类人性（93.4%）。在仿真中达到80.7%的成功率，在应用于36自由度双臂灵巧机器人平台时，真实机器人成功率为25/30（83.3%）。

英文摘要

Human hand-object interactions encode functional intent, but direct transfer to robotic hands often fails under morphology, contact, and reachability constraints. We present SynManDex, a synthetic pipeline that uses generated human pre-grasps as affordance-aware proposals and resolves the final contacts with robot-native optimization. SynManDex samples object-conditioned digital human pre-grasps, retargets them to dexterous robotic hand poses, optimizes force-closure contacts on the target embodiment, and admits trajectories that pass checks from each step. The resulting keyframes support both grasp-and-lift demonstrations and various prehensile manipulation tasks such as tea pouring, photo taking, and flute playing, designed via VLM agents. As a result, SynManDex combines high grasp quality (86.4\% grasp stability) with 4.67/5 human-likeness (93.4\%). It achieves 80.7\% successes in simulation and 25/30 (83.3\%) real-robot successes when applied to a 36-DOF bimanual dexterous robotic platform.

URL PDF HTML ☆

赞 0 踩 0

2510.01661 2026-06-09 cs.RO 版本更新

Symskill: Symbol and Skill Co-Invention for Data-Efficient and Reactive Long-Horizon Manipulation

Symskill：符号与技能共发明用于数据高效且反应性强的长周期操作

Yifei Simon Shao, Yuchen Zheng, Sunan Sun, Pratik Chaudhari, Vijay Kumar, Nadia Figueroa

发表机构 * GRASP Laboratory, University of Pennsylvania（GRASP实验室，宾夕法尼亚大学）

AI总结 Symskill通过联合学习谓词、运算符和技能，实现了数据高效且反应性强的长周期操作，结合了组合泛化与实时恢复能力。

Comments ICRA 2026 Best Conference Paper Award; ICRA 2026 Best Paper Award on Planning and Control; CoRL 2025 Best Paper Award on Learning Effective Abstractions for Planning (LEAP) Workshop (https://symskill.github.io/)

详情

AI中文摘要

动态环境中多步骤操作仍具挑战性。模仿学习（IL）反应性强但缺乏组合泛化能力，因为单一策略无法在场景变化时决定复用哪个技能。经典任务与运动规划（TAMP）提供组合性，但其高规划延迟阻碍了实时故障恢复。我们引入SymSkill，一个统一框架，从无标签、未分段的演示中联合学习谓词、运算符和技能，结合组合泛化与实时恢复。离线时，SymSkill直接从演示中学习符号抽象和目标导向技能。在线时，给定学习到的谓词 conjunction，它使用符号规划器组合和重新排列技能以实现符号目标，同时在运动和符号层面实时恢复故障。结合合规控制器，SymSkill在人类和环境干扰下支持安全执行。在RoboCasa模拟中，SymSkill执行12个单步骤任务，成功率达85%，并能将它们组合成多步骤计划而无需额外数据。在真实Franka机器人上，它从5分钟的玩耍数据中学习，并从目标规范中执行12步任务。代码和额外分析可在https://symskill.github.io/获取。

英文摘要

Multi-step manipulation in dynamic environments remains challenging. Imitation learning (IL) is reactive but lacks compositional generalization, since monolithic policies do not decide which skill to reuse when scenes change. Classical task-and-motion planning (TAMP) offers compositionality, but its high planning latency prevents real-time failure recovery. We introduce SymSkill, a unified framework that jointly learns predicates, operators, and skills from unlabeled, unsegmented demonstrations, combining compositional generalization with real-time recovery. Offline, SymSkill learns symbolic abstractions and goal-oriented skills directly from demonstrations. Online, given a conjunction of learned predicates, it uses a symbolic planner to compose and reorder skills to achieve symbolic goals while recovering from failures at both the motion and symbolic levels in real time. Coupled with a compliant controller, SymSkill supports safe execution under human and environmental disturbances. In RoboCasa simulation, SymSkill executes 12 single-step tasks with 85% success and composes them into multi-step plans without additional data. On a real Franka robot, it learns from 5 minutes of play data and performs 12-step tasks from goal specifications. Code and additional analysis are available at https://symskill.github.io/ .

URL PDF HTML ☆

赞 0 踩 0

2601.02085 2026-06-09 cs.RO cs.AI 版本更新

Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots

基于视觉的草莓采摘机器人早期故障诊断与自恢复

Meili Sun, Chunjiang Zhao, Lichao Yang, Hao Liu, Shimin Hu, Ya Xiong

发表机构 * NERCITA

AI总结针对草莓采摘机器人视觉感知差、夹爪错位、空抓/误抓和滑落等问题，提出视觉故障诊断与自恢复框架，通过SRR-Net统一感知、相对误差补偿、微光学相机反馈及LSTM滑落预测，实现高精度定位与故障恢复。

Comments Accepted by Artificial Intelligence in Agriculture

详情

DOI: 10.1016/j.aiia.2026.05.009

AI中文摘要

草莓采摘机器人面临视觉感知差、夹爪错位、空抓/误抓和滑落等挑战，降低了采摘稳定性和效率。为解决这些问题，本文提出了一种视觉故障诊断与自恢复框架。端到端SRR-Net通过联合检测、分割和果实与夹爪的成熟度回归，实现了统一感知和故障诊断。利用这种集成感知，设计了一种由目标-夹爪同步检测驱动的相对误差补偿方法，以纠正超过容差阈值的位置错位。集成在末端执行器内的微光学相机提供实时视觉反馈。基于微光学相机，在放气阶段使用MobileNet V3-Small分类器进行夹爪调整，能够在空抓/误抓情况下提前中止采摘周期。此外，在拉断阶段应用时间序列LSTM分类器预测草莓滑落。基于这些预测，系统对滑落草莓执行重新充气和二次拉断尝试，或对已滑落草莓中止周期。实验表明，末端执行器与采摘点之间的平均绝对误差沿x轴和y轴分别从11.50 mm和5.25 mm降低到3.12 mm和4.06 mm，时间增加0.64 ± 0.24秒。夹爪调整模块将抓取阶段缩短约0.5秒，并避免了失败情况下的空放置。草莓滑落预测模块以88.89%的成功率处理滑落情况，每个采摘周期为失败情况节省约4.00秒。同时，对滑落草莓实现了81.25%的恢复率，重新抓取需要额外0.63秒。

英文摘要

Strawberry-harvesting robots faced challenges such as poor visual perception, gripper misalignment, empty grasp/misgrasp, and slippage, which reduced harvesting stability and efficiency.To overcome these issues, this paper proposes a visual fault diagnosis and self-recovery framework. An end-to-end SRR-Net achieved unified perception and fault diagnosis through joint detection, segmentation, and ripeness regression of the fruit and gripper. Leveraging this integrated perception, a relative error compensation method driven by simultaneous target-gripper detection was designed to correct positional misalignments exceeding the tolerance threshold. A micro-optical camera integrated within the end-effector delivered real-time visual feedback. Based on the micro-optical camera, a MobileNet V3-Small classifier was utilized for grasp adjustment during the deflating stage, enabling the early abort of the harvesting cycle in cases of empty grasp/misgrasps. Furthermore, a time-series LSTM classifier was applied during the snap-off stage to predict strawberry slippage. Based on these predictions, the system executed re-inflation and a secondary snap-off attempt for slipping strawberries, or aborted the cycle for slipped strawberries. Experiments demonstrated that the mean absolute errors between the end-effector and the picking point were reduced to 3.12 mm and 4.06 mm from 11.50 mm and 5.25 mm along the x- and y-axes, respectively, at the cost of a time increment of 0.64 $pm$ 0.24 s. The grasp adjustment module reduced the grasping phase by approximately 0.5 s and avoided empty-placement for failure cases. The strawberry slip prediction module handled slipped cases with an 88.89% success rate, saving approximately 4.00 s per harvesting cycle for failure cases. Also, it achieved an 81.25% recovery rate for slipping strawberries, requiring additional 0.63 s for re-grasping.

URL PDF HTML ☆

赞 0 踩 0

2601.14871 2026-06-09 cs.RO 版本更新

On-the-fly hand-eye calibration for the da Vinci surgical robot

达芬奇手术机器人的在线手眼标定

Zejian Cui, Ferdinando Rodriguez y Baena

发表机构 * Department of Mechanical Engineering, Imperial College London（帝国理工学院机械工程系）； Mechatronics in Medicine Laboratory（医学机电实验室）； Hamlyn Centre for Robotics Surgery（机器人外科哈姆林中心）

AI总结针对达芬奇机器人因编码器误差导致工具定位不准的问题，提出一种在线计算手眼变换矩阵的标定框架，通过特征关联和手眼标定两个模块实现无预训练的关键点匹配，在多种手术场景下显著降低定位误差且时间效率高。

Comments 18 pages, 17 figures, 5 tables

详情

AI中文摘要

在机器人辅助微创手术（RMIS）中，精确的工具定位对于确保患者安全和成功执行任务至关重要。然而，对于诸如达芬奇机器人等缆线驱动机器人，这仍然具有挑战性，因为错误的编码器读数会导致位姿估计误差。在本研究中，我们提出了一种标定框架，通过在线计算手眼变换矩阵来产生精确的工具定位结果。该框架由两个相互关联的算法组成：特征关联模块和手眼标定模块，前者无需预训练即可为单目图像上检测到的关键点提供鲁棒的对应关系，后者通过采用一系列滤波方法提供适应各种手术场景的通用性。为了验证其有效性，我们在公开可用的视频数据集上广泛测试了该框架，这些数据集包含多种手术器械在体外和离体场景下、不同光照条件和不同关键点测量精度下执行任务的情况。结果表明，在所提出的标定框架下，工具定位误差显著降低，精度与其他最先进方法相当，同时时间效率更高。

英文摘要

In Robot-Assisted Minimally Invasive Surgery (RMIS), accurate tool localization is crucial to ensure patient safety and successful task execution. However, this remains challenging for cable-driven robots, such as the da Vinci robot, because erroneous encoder readings lead to pose estimation errors. In this study, we propose a calibration framework to produce accurate tool localization results through computing the hand-eye transformation matrix on-the-fly. The framework consists of two interrelated algorithms: the feature association block and the hand-eye calibration block, which provide robust correspondences for key points detected on monocular images without pre-training, and offer the versatility to accommodate various surgical scenarios by adopting an array of filter approaches, respectively. To validate its efficacy, we test the framework extensively on publicly available video datasets that feature multiple surgical instruments conducting tasks in both in vitro and ex vivo scenarios, under varying illumination conditions and with different levels of key point measurement accuracy. The results show a significant reduction in tool localization errors under the proposed calibration framework, with accuracies comparable to other state-of-the-art methods while being more time-efficient.

URL PDF HTML ☆

赞 0 踩 0

2602.11934 2026-06-09 cs.RO 版本更新

Robot-DIFT: Correspondence-Sensitive Diffusion Features for Contact-Rich Robot Manipulation

Robot-DIFT: 用于接触丰富机器人操作的对应敏感扩散特征

Yu Deng, Yufeng Jin, Xiaogang Jia, Jiahong Xue, Gerhard Neumann, Georgia Chalvatzaki

发表机构 * TU Darmstadt（图宾根大学）； KIT（卡尔斯鲁厄理工学院）； FZI（弗劳恩霍夫研究所）； Hessian.AI（黑森人工智能公司）； Robotics Institute Germany（德国机器人研究所）； Honda Research Institute Europe GmbH（本田欧洲研究院）

AI总结提出Robot-DIFT，通过流形蒸馏将扩散模型转化为确定性学生网络，结合空间-语义特征金字塔网络，为接触敏感任务提供实时对应敏感特征，在多个基准上超越现有方法。

详情

AI中文摘要

机器人操作常常在最后几毫米失败：策略可能识别出正确的物体，但忽略了动作所需的姿态偏移、边界或预接触对齐。我们认为，当语义不变性抑制了闭环控制的对应线索，或者这些线索未以可用形式暴露给策略时，就会发生此类失败。现代视觉编码器提供强大的语义抽象，但接触丰富的操作需要对应敏感性：对动作相关的姿态、边界和接触几何变化具有判别性特征响应。扩散特征为密集对应提供了强大的先验，但由于随机性、延迟和表示漂移，直接使用不切实际。我们引入了Robot-DIFT，一种用于实时控制的确定性扩散派生骨干网络。通过流形蒸馏，Robot-DIFT将噪声条件扩散教师网络转换为干净输入的单次学生网络，同时保留教师的特征流形。空间-语义特征金字塔网络（S2-FPN）将粗到细的学生解码器特征融合为视觉标记，向策略暴露语义上下文和精细接触细节。在RoboCasa、LIBERO-10和真实机器人上，Robot-DIFT在接触敏感任务上优于视觉-语言、自监督、几何导向和扩散基线。受控的骨干/读出交换表明，S2-FPN解锁而非取代了扩散对应先验。

英文摘要

Robot manipulation often fails in the final millimeters: a policy may recognize the right object yet miss the pose offsets, boundaries, or pre-contact alignments needed for action. We argue that such failures arise when semantic invariance suppresses correspondence cues for closed-loop control, or when these cues are not exposed to the policy in a usable form. Modern visual encoders provide strong semantic abstractions, but contact-rich manipulation requires correspondence sensitivity: discriminative feature responses to action-relevant changes in pose, boundary, and contact geometry. Diffusion features provide a strong prior for dense correspondence, but direct use is impractical due to stochasticity, latency, and representation drift. We introduce Robot-DIFT, a deterministic diffusion-derived backbone for real-time control. Through Manifold Distillation, Robot-DIFT converts a noise-conditioned diffusion Teacher into a clean-input, single-pass Student while preserving the teacher's feature manifold. A Spatial--Semantic Feature Pyramid Network (S2-FPN) fuses coarse-to-fine Student decoder features into visual tokens that expose semantic context and fine contact detail to the policy. Across RoboCasa, LIBERO-10, and real robots, Robot-DIFT outperforms vision--language, self-supervised, geometry-oriented, and diffusion baselines on contact-sensitive tasks. Controlled backbone/readout swaps show that S2-FPN unlocks, rather than replaces, the diffusion correspondence prior.

URL PDF HTML ☆

赞 0 踩 0

2604.20689 2026-06-09 cs.RO 版本更新

RealDexUMI：用于灵巧机器人学习的可穿戴通用操作接口

Chaoyi Xu, Yixuan Jiang, Jiahui Huan, Yuhui Fu, Haoyu Zhou, Weitian Yuan, Jiayi Yu, Wanpeng Zhang, Haoqi Yuan, Zongqing Lu

发表机构 * Peking University（北京大学）； BeingBeyond ； Beihang University（北航）； LinkerBot ； Tsinghua University（清华大学）

AI总结提出RealDexUMI，一种基于共享灵巧末端执行器模块的可穿戴通用操作接口，通过掌侧同构遥操作手套实现无重定向、直观精确的手部控制，在八项真实机器人任务中平均成功率达88.75%。

详情

AI中文摘要

学习灵巧操作需要演示，这些演示在保持精细手-物体交互的同时，在部署时仍可执行。现有流程要么通过重定向或具身转换损失可部署的灵巧性，要么依赖特定机器人的遥操作，这种遥操作成本高昂且难以扩展，并且通常缺乏用于灵巧数据收集的直观、接触感知控制。我们提出RealDexUMI，一种围绕共享灵巧末端执行器模块构建的可穿戴通用操作接口，该模块集成了轻量级灵巧手、手内视觉和指尖触觉传感。掌侧同构遥操作手套将人类手指输入映射到机器人手关节命令，实现实时、无重定向、直观且精确的手部控制。共享的手和传感模块产生零间隙的末端执行器数据，在收集和部署之间具有匹配的手内观察、触觉信号、接触和手部动作。在涵盖精细、接触丰富、长时域和双臂操作的八项真实机器人任务中，基于RealDexUMI数据训练的策略平均成功率达到88.75%，能够泛化到未见过的初始姿态，并在三种具身之间迁移。网站：https://research.beingbeyond.com/realdexumi

英文摘要

Learning dexterous manipulation requires demonstrations that preserve fine hand-object interactions while remaining executable at deployment. Existing pipelines either lose deployable dexterity through retargeting or embodiment conversion, or rely on robot-specific teleoperation that is costly to scale and often lacks intuitive, contact-aware control for dexterous data collection. We present RealDexUMI, a wearable universal manipulation interface built around a shared dexterous end-effector module that integrates a lightweight dexterous hand, in-hand vision, and fingertip tactile sensing. A palm-side isomorphic teleoperation glove maps human finger inputs to robot-hand joint commands, enabling real-time, retargeting-free, intuitive, and precise hand control. The shared hand and sensing modules yield zero-gap end-effector data, with matched in-hand observations, tactile signals, contacts, and hand actions between collection and deployment. Across eight real-robot tasks spanning fine-grained, contact-rich, long-horizon, and bimanual manipulation, policies trained on RealDexUMI data achieve an average success rate of 88.75%, generalize to unseen initial poses, and transfer across three embodiments. Website: https://research.beingbeyond.com/realdexumi

URL PDF HTML ☆

赞 0 踩 0

2606.08029 2026-06-09 cs.RO 新提交

IntentNav: Learning Spatial-Visual Object Navigation from Human Demonstrations

IntentNav: 从人类演示中学习空间-视觉物体导航

Yuxin Cai, Zongtai Li, Maonan Wang, Muyi Bao, Haokun Zhu, Ruofei Bai, Ding Zhao, Zirui Li, Wenshan Wang, Wei-Yun Yau, Ji Zhang, Chen Lv

发表机构 * Nanyang Technological University（南洋理工大学）； Carnegie Mellon University（卡内基梅隆大学）； The Chinese University of Hong Kong（香港中文大学）； A*STAR Institute for Infocomm Research (I2R)（新加坡科技研究局资讯通信研究院）

AI总结提出IntentNav框架，通过人类演示学习类人物体导航策略，利用前沿标注和意图对齐目标实现最优性能，并零样本迁移到多种机器人平台。

Comments 26 pages, 9 figures

详情

AI中文摘要

物体导航要求机器人在未知环境中搜索未观察到的目标，通过在部分可观测性下决定下一步探索位置。有效的搜索类似于人类探索：选择性探查视觉上有希望的前沿，同时依赖空间记忆避免重复访问。我们提出IntentNav，一个从人类演示中学习类人ObjectNav策略的空间-视觉模仿框架。为了从低级人类动作推断高级搜索意图，我们引入了基于前沿的人类意图标注，该方法前瞻人类演示并标注最能解释演示者未来搜索方向的前沿。我们构建了一个空间-视觉候选空间，其中BEV记忆跟踪已探索区域、未探索前沿和轨迹历史，而自我中心视觉记忆为每个候选提供语义线索。训练一个VLM策略在这些基于上下文的候选中进行选择，使用意图对齐目标以鼓励一致且类人的探索。IntentNav在MP3D、HM3D-v1和HM3D-v2 ObjectNav基准上实现了最先进的性能。所提出的候选级导航界面无需进一步VLM微调即可零样本迁移到轮式、四足和类人机器人。\href{https://anonymous.4open.science/w/IntentNav/}{项目页面}。

英文摘要

Object navigation requires a robot to search for an unobserved target in an unknown environment by deciding where to explore next under partial observability. Effective search resembles human-like exploration: selectively probing visually promising frontiers while relying on spatial memory to avoid redundant revisits. We propose IntentNav, a spatial-visual imitation framework that learns human-like ObjectNav policies from human demonstrations. To infer high-level search intent from low-level human actions, we introduce Frontier-based Human-Intent Labeling, which looks ahead in human demonstrations and labels the frontier that best explains the demonstrator's future search direction. We construct a spatial-visual candidate space, where BEV memory tracks explored regions, unexplored frontiers, and trajectory history, while egocentric visual memory provides semantic cues for each candidate. A VLM policy is trained to select among these grounded candidates, using Intent-Aligned Objective to encourage consistent and human-like exploration. IntentNav achieves state-of-the-art performance on the MP3D, HM3D-v1 and HM3D-v2 ObjectNav benchmarks. The proposed candidate-level navigation interface transfers zero-shot to wheeled, quadruped, and humanoid robots without further VLM fine-tuning. \href{https://anonymous.4open.science/w/IntentNav/}{Project page}.

URL PDF HTML ☆

赞 0 踩 0

2606.08666 2026-06-09 cs.RO 新提交

Language as a Sensor: Calibrated Spatial Belief Estimation in 3D Scenes from Natural Language

语言作为传感器：从自然语言在3D场景中进行校准的空间信念估计

Aryan Naveen, Jason Xinyu Liu, Luca Carlone, Andreea Bobu

发表机构 * MIT Laboratory for Information & Decision Systems（麻省理工学院信息与决策系统实验室）； MIT Computer Science & Artificial Intelligence Laboratory（麻省理工学院计算机科学与人工智能实验室）

AI总结提出语言传感器模型（LSM）将自然语言描述转化为校准的空间分布，并融合到VL-Map概率框架中，实现更准确的目标定位。

Comments 18 pages, 7 figures, 3 tables

详情

AI中文摘要

部署在以人为中心的环境中的机器人经常接收自然语言的空间信息描述（如“我把背包放在桌子上”），这些描述涉及超出其感知视野的世界部分。传统的度量-语义映射忽略了这一信号，而现成的多模态模型在3D空间推理方面仍然有限，并且不易与其他传感器模态融合。为了将语言观测转换为校准的空间分布，我们训练了一个语言传感器模型（LSM），该模型将每个话语及其场景图上下文映射到多模态分布，其中混合权重编码指代歧义（例如，“哪张桌子”），分量协方差编码空间不确定性（例如，目标在“桌子上”的哪个位置）。然后，我们引入了VL-Map（视觉-语言度量-语义映射），这是一个概率框架，将这些语言预测视为随机观测，并在统一的信念图中与机载感知融合。在VLA-3D基准测试以及真实世界的移动机器人上，LSM是唯一协方差估计保持在校准范围内的语言预测器；融合到VL-Map中，它导致对目标对象位置更准确的预测（与最强的基础模型基线相比，真实目标上的概率质量增加了约70%）。

英文摘要

Robots deployed in human-centric environments routinely receive natural-language descriptions of spatial information ("I left my backpack on the table") that reference parts of the world beyond their perceptual field of view. Traditional metric-semantic mapping ignores this signal, while off-the-shelf multimodal models remain limited in 3D spatial reasoning and are not directly amenable to fusion with other sensor modalities. To convert language observations into a calibrated spatial distribution, we train a Language Sensor Model (LSM) that maps each utterance and its scene-graph context to a multimodal distribution, with mixture weights encoding referential ambiguity (e.g., "which table") and component covariances encoding spatial uncertainty (e.g., where "on the table" the target lies). We then introduce VL-Map (Vision-Language Metric-Semantic Mapping), a probabilistic framework that treats these language predictions as stochastic observations and fuses them with onboard perception within a unified belief map. On the VLA-3D benchmark as well as on a real-world mobile robot, LSM is the only language predictor whose covariance estimates remain within the calibrated regime; fused into VL-Map, it leads to more accurate predictions of the target object location (~70% more probability mass on the true target compared to the strongest foundation-model baseline).

URL PDF HTML ☆

赞 0 踩 0

2606.08992 2026-06-09 cs.RO cs.AI cs.CV 新提交

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

SpaceVLN：具有在线空间认知记忆与推理的零样本视觉与语言导航智能体

Yucheng Deng, Pingrui Lai, Xinhai Li, Chenjia Bai, Xiaoheng Deng, Chengnuo Sun, Xuelong Li, Hua Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； China Telecom（中国电信）； Central South University（中南大学）； Jiangsu University（江苏大学）

AI总结提出SpaceVLN，通过空间认知记忆和任务引导的空间推理，在零样本设置下实现连续环境中的视觉与语言导航，在多个基准上达到最优性能。

Comments 23 pages, 9 figures, 7 tables

详情

AI中文摘要

连续环境中的视觉与语言导航要求智能体理解未见环境的空间结构以遵循语言指令。尽管基础模型为无需任务特定策略训练的零样本导航开辟了有希望的路径，但许多导航器仍依赖局部视觉线索和基于线性历史的推理，忽视了探索区域、穿越路径、地标及其空间关系的空间本质。本文提出SpaceVLN，一种围绕空间认知记忆和任务引导的空间推理构建的导航智能体。具体而言，SpaceVLN引入了一个高效的分阶段闭环框架，其中规划和执行围绕可验证的空间-地标阶段组织。导航过程中，智能体逐步将探索区域抽象为空间航点，并动态维护子任务基础的地标证据，形成层次化的空间认知记忆以进行进度定位和空间关系理解。基于此记忆，Spatial-CoT将任务进度推理与空间感知、分析和预测相结合，实现任务引导的空间推理以用于具身导航。统一阶段接口使SpaceVLN能够在统一的零样本设置下处理视觉与语言导航和目标导向导航，无需任务特定策略训练。在R2R-CE、RxR-CE、GN-Bench和HM3D-OVON上，SpaceVLN实现了最先进的零样本性能，真实机器人部署进一步验证了其适用性。这些结果突显了空间认知记忆和任务引导的空间推理作为更强具身导航智能体的实用基础。

英文摘要

Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.

URL PDF HTML ☆

赞 0 踩 0

2606.09268 2026-06-09 cs.RO 新提交

VGP-Nav: Metric-Aware Visual Geometric Perception for Robot Navigation

VGP-Nav：用于机器人导航的度量感知视觉几何感知

Hewei Pan, Weiye Zhu, Zekai Zhang, Zitong Huang, Rongtao Xu, Jinbao Wang, Feng Zheng

发表机构 * Southern University of Science and Technology（南方科技大学）； MBZUAI（穆罕默德·本·扎耶德人工智能大学）； Shenzhen University（深圳大学）； SpatialTemporal AI（时空人工智能）

AI总结提出VGP-Nav，一种仅依赖单目RGB输入的框架，通过地面平面几何约束解决尺度模糊，实现度量定位与障碍物感知的统一。

详情

AI中文摘要

可靠的机器人导航需要精确的全局定位和稠密、度量一致的障碍物感知的无缝集成。实现这些能力的常见策略涉及集成多种传感模态：相机提供丰富的视觉特征用于定位，而主动传感器如LiDAR提供直接的度量测量。然而，这种多传感器配置需要复杂的时空校准并增加部署开销。尽管纯视觉方法提供了低成本且可扩展的替代方案，现有的单目视觉系统通常难以同时实现高效、全局一致的定位和稠密、度量一致的几何感知。为弥合这一差距，我们提出\textbf{VGP-Nav}，一个统一的\textit{度量感知视觉几何感知}框架，仅依赖单目RGB输入，联合支持度量定位和障碍物感知。我们的关键洞察是将基于定位的视觉几何锚定到从地面平面几何导出的物理上有意义的尺度约束，从而为单目感知提供可靠的度量参考。VGP-Nav在线解决单目尺度模糊，并生成可直接用于下游规划的、基于定位的度量障碍物表示。大量实验证明了其在多种环境中的强泛化能力以及在真实移动机器人上的成功部署，突显了该方法在可扩展、低成本且安全的自主导航中的实用性。

英文摘要

Reliable robotic navigation necessitates the seamless integration of accurate global localization and dense, metric-consistent obstacle perception. A common strategy to achieve these capabilities involves integrating diverse sensing modalities: cameras offer rich visual features for localization, while active sensors like LiDAR provide direct metric measurements. However, such multi-sensor configurations necessitate complex spatial-temporal calibration and increase deployment overhead. Although vision-only approaches offer a low-cost and scalable alternative, existing monocular visual systems typically struggle to simultaneously achieve efficient, globally consistent localization and dense, metric-consistent geometric perception. To bridge this gap, we propose \textbf{VGP-Nav}, a unified framework for \textit{Metric-Aware Visual Geometric Perception} that relies solely on monocular RGB input to jointly support metric localization and obstacle perception. Our key insight is to anchor localization-grounded visual geometry to physically meaningful scale constraints derived from ground-plane geometry, thereby providing a reliable metric reference for monocular perception. VGP-Nav resolves monocular scale ambiguity online and produces localization-grounded, metric obstacle representations that are directly applicable to downstream planning. Extensive experiments demonstrate strong generalization across diverse environments and successful deployment on real mobile robots, highlighting the practicality of our approach for scalable, low-cost, and safe autonomous navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.09292 2026-06-09 cs.RO cs.SY eess.SY 新提交

Dual Quaternion-Based Unscented Kalman Filter with Visual Inertial Odometry for Navigation in GPS-Denied Environments

基于对偶四元数的无迹卡尔曼滤波与视觉惯性里程计在GPS拒止环境中的导航

Mohamed Khalifa, Hashim A. Hashim

发表机构 * Carleton University（卡尔顿大学）

AI总结提出一种基于对偶四元数的无迹卡尔曼滤波（DQUKF）结合视觉惯性里程计（VIO），在GPS拒止环境下实现高精度状态估计，在EuRoC数据集上位置RMSE达0.2584米。

详情

DOI: 10.1016/j.measurement.2026.121964

AI中文摘要

在GPS拒止环境中的可靠导航仍然是机器人、航空航天和自动驾驶车辆应用中的基本挑战。本文提出了一种基于对偶四元数的无迹卡尔曼滤波（DQUKF），配备视觉惯性里程计（VIO）算法，用于在GPS拒止位置实现精确状态估计以实现导航。所提出的框架以误差状态形式构建DQUKF，其中名义位姿由单位对偶四元数表示，局部位姿误差由6维扭量参数化表示，用于sigma点生成、协方差传播和测量校正。同时，VIO算法跨图像帧跟踪特征，同步IMU和相机之间的测量，并提供补充惯性传播的视觉约束。在EuRoC MAV数据集上的仿真结果表明，所提出的DQUKF在高初始化不确定性下收敛，并在困难飞行序列中实现了0.2584米的位置RMSE，优于基准滤波器。

英文摘要

Reliable navigation in GPS-denied environments remains a fundamental challenge in robotics, aerospace, and autonomous vehicle applications. This paper presents a Dual Quaternion-Based Unscented Kalman Filter (DQUKF) equipped with a Visual Inertial Odometry (VIO) algorithm for accurate state estimation enabling navigation in GPS denied locations. The proposed framework formulates the DQUKF in an error state manner, where the nominal pose is represented by a unit dual quaternion and the local pose error is represented by a 6-dimensional twistor parameterization used for sigma point generation, covariance propagation, and measurement correction. In parallel, the VIO algorithm tracks features across image frames, synchronizes measurements between the IMU and camera, and provides visual constraints that complement inertial propagation. Simulation results on the EuRoC MAV dataset show that the proposed DQUKF converges under high initialization uncertainty and achieves a position RMSE of 0.2584~m in the difficult flight sequence, outperforming the benchmark filters.

URL PDF HTML ☆

赞 0 踩 0

2606.09355 2026-06-09 cs.RO 新提交

MosaicIMU: Composing Carrier Experts for Generalizable Neural Inertial Odometry

MosaicIMU：面向可泛化神经惯性里程计的载体专家组合

Junye Zou, Huiyi Yan, Xinning Xu, Xiaolei Li, Pengkun Zhou, Jinhui Zhang, Ziyang Meng

发表机构 * Tsinghua University（清华大学）； Xi'an Jiaotong University（西安交通大学）； Beijing University of Chemical Technology（北京化工大学）； Beijing Information Science and Technology University（北京信息科技大学）； Beijing Institute of Technology（北京理工大学）

AI总结提出MosaicIMU框架，通过原型路由组合载体特定专家特征，结合历史感知EKF，实现跨载体泛化；冻结预训练模型并学习轻量专家残差分支适应新领域，边缘部署时利用路由器选择在线样本高效增量更新，平均ATE和RTE-10s分别降低40%和34%。

详情

AI中文摘要

当外部传感不可靠时，鲁棒的惯性里程计对各种载体至关重要。基于学习的方法通过捕获局部运动先验来减少积分漂移，但这些方法通常局限于特定载体，限制了跨异构平台的泛化。我们提出MosaicIMU，一种载体条件的混合专家（MoE）预训练与自适应框架，用于可泛化的神经惯性里程计。MosaicIMU使用基于原型的路由器组合载体特定的专家特征，解码局部速度和不确定性约束，并将其与历史感知EKF集成。对于未见领域自适应，它冻结预训练基础模型并学习新的轻量专家残差分支。对于边缘部署，它进一步重用路由器来选择信息丰富的在线样本以进行高效的增量更新。实验表明，MosaicIMU持续优于基于学习的基线，平均ATE和RTE-10s分别降低40%和34%。这些结果凸显了MosaicIMU为可泛化和自适应的神经惯性里程计提供了一种可扩展的预训练到部署范式。

英文摘要

Robust inertial odometry is essential for various carriers when external sensing is unreliable. Learning-based methods reduce integration drift by capturing local motion priors, but these methods often remain tied to a particular carrier, limiting generalization across heterogeneous platforms. We present MosaicIMU, a carrier-conditioned Mixture-of-Experts (MoE) pretraining-and-adaptation framework for generalizable neural inertial odometry. MosaicIMU uses a prototype-based router to compose carrier-specific expert features, decodes local velocity and uncertainty constraints, and integrates them with a history-aware EKF. For unseen domain adaptation, it freezes the pretrained base model and learns a new lightweight expert residual branch. For edge-deployment, it further reuses the router to select informative online samples for efficient incremental updates. Experiments show that MosaicIMU consistently outperforms learning-based baselines, reducing average ATE and RTE-10s by 40% and 34%, respectively. These results highlight that MosaicIMU provides a scalable pretraining-to-deployment paradigm for generalizable and adaptive neural inertial odometry.

URL PDF HTML ☆

赞 0 踩 0

2606.08284 2026-06-09 cs.CV cs.RO 交叉投稿

G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation

G2G：利用组内几何进行组间姿态估计

Yufei Wei, Shuhao Ye, Chenxiao Hu, Yiyuan Pan, Dongyu Feng, Rong Xiong, Yue Wang, Yanmei Jiao

发表机构 * State Key Laboratory of Industrial Control and Technology, Zhejiang University（浙江大学工业控制技术国家重点实验室）； Zhejiang Humanoid Robot Innovation Center Co., Ltd.（浙江人形机器人创新中心有限公司）； School of Information Science and Engineering, Hangzhou Normal University（杭州师范大学信息科学与工程学院）

AI总结提出G2G方法，通过冻结多视图基础模型并添加三个轻量可训练模块（感知器重采样器、跨组桥接模块和多帧姿态头），仅利用相对姿态监督实现组间6-DoF姿态估计，在四个数据集上达到SOTA。

详情

AI中文摘要

恢复两个图像组之间的相对6-DoF姿态是跨序列重定位和多相机刚性里程计的基础。每个组通过视觉里程计或刚性校准携带已知的组内几何，预训练的多视图骨干网络已经将这种几何融合到视觉特征中。然而，当前模型将所有视图视为非结构化集合，缺少跨组推理的关键环节。我们提出\ours{}，该方法保持基础模型完全冻结，并添加三个轻量可训练模块来桥接两个组：感知器重采样器、带有合并自注意力的跨组桥接模块以及多帧姿态头。可训练部分总计约32M参数，不到完整模型的6%，且仅由相对姿态监督。在四个数据集（涵盖室内外仿真、真实世界跨季节采集以及零样本仿真到真实迁移）上，\ours{}在两个任务上都达到了最先进的精度，而每个基线都使用其完整的原始监督进行重新训练。代码可在https://github.com/WeiYuFei0217/G2G获取。

英文摘要

Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-camera rig odometry. Each group carries known intra-group geometry from visual odometry or rig calibration, and pretrained multi-view backbones already fuse such geometry into visual features. Yet current models treat all views as an unstructured set, leaving cross-group reasoning as the missing piece. We introduce \ours{}, which keeps the foundation model entirely frozen and adds three lightweight trainable modules to bridge the two groups: a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head. The trainable footprint totals about 32M parameters, under 6\% of the full model, and is supervised only by relative poses. Across four datasets that span indoor and outdoor simulation, real-world cross-season capture, and zero-shot sim-to-real transfer, \ours{} attains state-of-the-art accuracy on both tasks, while every baseline is retrained with its full original supervision. Code is available at https://github.com/WeiYuFei0217/G2G.

URL PDF HTML ☆

赞 0 踩 0

2504.19399 2026-06-09 cs.RO 版本更新

Follow Everything: A Leader-Following and Obstacle Avoidance Framework with Goal-Aware Adaptation

跟随一切：具有目标感知适应的领导者跟随与避障框架

Qianyi Zhang, Shijian Ma, Boyi Liu, Jianhao Jiao, Dimitrios Kanoulas

发表机构 * Institute of Robotics and Automatic Information System, Nankai University, China（南开大学机器人与自动化信息系统研究所）； Centre for Data Science, University of Macau, China（澳门大学数据科学中心）； Electrical and Computer Engineering Department, Hong Kong University of Science and Technology, China（香港科学与技术大学电子与计算机工程系）； Department of Computer Science, University College London, UK（伦敦大学学院计算机科学系）； Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University, Hong Kong, China（香港理工大学航空与航空工程系）

AI总结提出统一框架，用分割模型替代检测模型以跟随任意形态领导者，并设计目标感知适应机制和基于图的规划器，实现领导者暂时离开视野时的鲁棒跟随与避障。

详情

AI中文摘要

鲁棒且灵活的领导者跟随是机器人融入人类社会的一项关键能力。现有方法难以泛化到任意形态的领导者，并且在领导者暂时离开机器人视野时常常失败，本文引入了一个统一框架来应对这两个挑战。首先，用分割模型替代传统检测模型，使领导者可以是任何物体。为了增强识别鲁棒性，实现了一个距离帧缓冲区，在多个距离存储领导者嵌入，以考虑领导者跟随任务的独特特征。其次，设计了一种目标感知适应机制，根据领导者的可见性和运动来控制机器人规划状态，并辅以基于图的规划器，为每个状态生成候选轨迹，确保高效跟随和避障。在室内外环境中，使用腿式机器人跟随者与各种领导者（人、地面机器人、无人机、腿式机器人、停止标志）进行的仿真和真实世界实验显示，在跟随成功率、减少视觉丢失时长、降低碰撞率和减小领导者-跟随者距离方面取得了竞争性改进。

英文摘要

Robust and flexible leader-following is a critical capability for robots to integrate into human society. While existing methods struggle to generalize to leaders of arbitrary form and often fail when the leader temporarily leaves the robot's field of view, this work introduces a unified framework addressing both challenges. First, traditional detection models are replaced with a segmentation model, allowing the leader to be anything. To enhance recognition robustness, a distance frame buffer is implemented that stores leader embeddings at multiple distances, accounting for the unique characteristics of leader-following tasks. Second, a goal-aware adaptation mechanism is designed to govern robot planning states based on the leader's visibility and motion, complemented by a graph-based planner that generates candidate trajectories for each state, ensuring efficient following with obstacle avoidance. Simulations and real-world experiments with a legged robot follower and various leaders (human, ground robot, UAV, legged robot, stop sign) in both indoor and outdoor environments show competitive improvements in follow success rate, reduced visual loss duration, lower collision rate, and decreased leader-follower distance.

URL PDF HTML ☆

赞 0 踩 0

2602.22243 2026-06-09 cs.RO 版本更新

SODA-CitrON: Static Object Data Association by Clustering Multi-Modal Sensor Detections Online

SODA-CitrON：通过在线聚类多模态传感器检测实现静态物体数据关联

Jan Nausner, Kilian Wohlleben, Michael Hubner

发表机构 * Jan Nausner, Kilian Wohlleben, Michael Hubner

AI总结本文提出SODA-CitrON方法，通过在线聚类多模态传感器检测实现静态物体的数据关联，同时估计位置并维持持久跟踪，优于现有方法在F1分数、位置RMSE、MOTP和MOTA指标上。

Comments 8 pages, 5 figures; \c{opyright} 2026 IEEE. Accepted for the 2026 International Conference on Information Fusion (FUSION 2026)

详情

AI中文摘要

从异构传感器检测中在线融合和跟踪静态物体是机器人、自主系统和环境建图中的基本问题。尽管经典数据关联方法如JPDA适合动态目标，但在间歇性和异质不确定性的静态物体观测中效果较差，因为运动模型对杂波的判别能力有限。本文提出了一种新颖的静态物体数据关联方法SODA-CitrON，通过在线聚类多模态传感器检测，同时估计位置并维持未知数量物体的持久跟踪。所提出的无监督机器学习方法完全在线运行，处理时间上不相关的多传感器测量。此外，它在传感器检测数量上具有最坏情况下的对数线性复杂度，同时提供完整的输出可解释性。我们在不同的蒙特卡洛模拟场景中评估了该方法，并将其与基于POM的过滤、DBSTREAM聚类和JPDA等现有方法进行比较。结果表明，在研究的静态物体建图场景中，SODA-CitrON在F1分数、位置RMSE、MOTP和MOTA指标上始终优于比较方法。

英文摘要

The online fusion and tracking of static objects from heterogeneous sensor detections is a fundamental problem in robotics, autonomous systems, and environmental mapping. Although classical data association approaches such as JPDA are well suited for dynamic targets, they are less effective for static objects observed intermittently and with heterogeneous uncertainties, where motion models provide minimal discriminative power with respect to clutter. In this paper, we propose a novel method for static object data association by clustering multi-modal sensor detections online (SODA-CitrON), while simultaneously estimating positions and maintaining persistent tracks for an unknown number of objects. The proposed unsupervised machine learning approach operates in a fully online manner and handles temporally uncorrelated and multi-sensor measurements. Additionally, it has a worst-case loglinear complexity in the number of sensor detections while providing full output explainability. We evaluate the proposed approach in different Monte Carlo simulation scenarios and compare it against state-of-the-art methods, including POM-based filtering, DBSTREAM clustering, and JPDA. The results demonstrate that SODA-CitrON consistently outperforms the compared methods in terms of F1 score, position RMSE, MOTP, and MOTA in the static object mapping scenarios studied.

URL PDF HTML ☆

赞 0 踩 0

2606.02519 2026-06-09 cs.RO 版本更新

面向人机协同工业机器人的智能神经符号规划与调试：基于数字孪生

Zhihao Liu, Victor Nan Fernandez-Ayala, Tianyu Wang, Qiang Qin, Xi Vincent Wang, Dimos V. Dimarogonas, Lihui Wang

发表机构 * Royal Institute of Technology (KTH)（皇家理工学院（KTH））

AI总结提出一种结合LLM语言理解与确定性验证执行的神经符号框架，采用SDI架构和两级恢复机制，在数字孪生中验证后执行，显著提升任务成功率。

详情

AI中文摘要

灵活的机器人自动化需要系统能够解释操作员意图、验证物理可行性，并在规划和执行阶段从执行失败中恢复。本文提出了一种面向人机协同工业机器人的智能神经符号框架，其中LLM用于需要语言理解或上下文推理的任务，而所有验证、排序和执行保持确定性。该框架将软件工程中的规划器-生成器-评估器（PGE）模式改编为面向工业机器人的指定器-设计器-检查器（SDI）架构，并结合基于LangGraph的动态路由进行故障恢复。两级恢复机制通过上下文感知编排处理结构级重新规划，并通过确定性恢复技能处理执行级几何故障。Unity3D数字孪生支持在物理执行前进行人工检查、修改和重新验证。在多个难度级别的自然语言命令上对十个基线进行评估，所提方法实现了最高的任务成功率。消融结果证实，结构化命令扩展、符号验证、选择性LLM路由和恢复技能各自都是必要的。

英文摘要

Flexible robotic automation requires systems that interpret operator intent, verify physical feasibility, and recover from execution failures across both the planning and execution stages. This paper proposes an agentic neuro-symbolic framework for human-in-the-loop industrial robotics, in which LLMs are used for tasks that require language understanding or contextual reasoning, while all verification, sequencing, and execution remain deterministic. The framework adapts the Planner-Generator-Evaluator (PGE) harness pattern from software engineering into a Specifier-Designer-Inspector (SDI) architecture for industrial robotics, combined with LangGraph-based dynamic routing for failure recovery. A two-tier recovery mechanism addresses structure-level replanning through context-aware orchestration and execution-level geometric failures through deterministic recovery skills. A Unity3D digital twin supports human inspection, modification, and re-verification prior to physical execution. Evaluated on natural-language commands across multiple difficulty levels against ten baselines, the proposed method achieves the highest task success. Ablation results confirm that structured command expansion, symbolic verification, selective LLM routing, and recovery skills are each individually necessary.

URL PDF HTML ☆

赞 0 踩 0

2606.08341 2026-06-09 cs.RO 新提交

Uncertainty-Aware Intention Prediction for Human-to-Robot Assembly Teleoperation

面向人机装配遥操作的不确定性感知意图预测

Fnu Heman, Yixuan Wang, Kolin Xu, Conner Wallace, John Dang, Akhil Joshi, Jun Sheng, Pinhas Ben-Tzvi, Mingyu Cai

发表机构 * University of California, Riverside（加州大学河滨分校）； University of Miami（迈阿密大学）

AI总结提出结合层次迁移学习、共形预测和VLM引导校正的不确定性感知意图预测框架，利用人类演示数据预训练，仅用少量机器人数据即提升动作分割性能。

Comments 7 pages, 6 figures. Preprint version

详情

AI中文摘要

在人机协作的辅助遥操作中，准确的意图预测对于在长时程操作和装配任务中实现及时可靠的机器人辅助至关重要。这些系统需要持续理解用户行为，以实时识别动作、预测意图并检测错误。然而，机器人遥操作演示成本高且受硬件限制，而人类演示更易收集且提供丰富的时序结构。为解决这一挑战，我们提出了一种不确定性感知的人到机器人意图预测框架，该框架结合了：(1) 层次迁移学习，其中MS-TCN++在人类手部演示上预训练，并在有限的机器人遥操作数据上微调，以捕捉低级动作和高级任务意图；(2) 共形预测模块，提供具有统计覆盖保证的帧级预测集，用于可靠的不确定性量化和早期意图估计；(3) VLM引导的片段校正，利用视觉和时序上下文选择性审查低置信度或时序不确定的片段。该框架支持辅助遥操作中的动作识别、时序分割、意图预测和错误检测。在包含22个动作类别的机器人装配演示实验表明，仅使用16个机器人演示，人到机器人的微调将机器人测试集的Edit分数从70.50提升至80.70。Edit安全的VLM校正进一步将帧准确率从45.21%提升至46.42%，并提高了F1@25和F1@50，同时保持了Edit分数。这些结果表明，人类演示为鲁棒、不确定性感知的机器人动作分割提供了可扩展的预训练数据。代码和数据见项目网站。

英文摘要

In assisted teleoperation for human-robot collaboration, accurate intention prediction is critical for enabling timely and reliable robotic assistance during long-horizon manipulation and assembly tasks. These systems require continuous understanding of user behavior to recognize actions, anticipate intentions, and detect mistakes in real time. However, robot teleoperation demonstrations are costly and hardware-limited, whereas human demonstrations are easier to collect and provide rich temporal structure. To address this challenge, we propose an uncertainty-aware human-to-robot intention prediction framework that combines: (1) hierarchical transfer learning, where MS-TCN++ is pretrained on human hand demonstrations and fine-tuned on limited robot teleoperation data to capture low-level actions and high-level task intentions; (2) a conformal prediction module that provides frame-level prediction sets with statistical coverage guarantees for reliable uncertainty quantification and early intention estimation; and (3) VLM-guided segment correction, which selectively reviews low-confidence or temporally uncertain segments using visual and temporal context. The framework supports action recognition, temporal segmentation, intention anticipation, and mistake detection for assisted teleoperation. Experiments on robot assembly demonstrations with 22 action classes show that human-to-robot fine-tuning improves the robot test-set Edit score from 70.50 to 80.70 using only 16 robot demonstrations. Edit-safe VLM correction further improves frame accuracy from 45.21% to 46.42% and increases F1@25 and F1@50 while preserving the Edit score. These results show that human demonstrations provide scalable pretraining data for robust, uncertainty-aware robot action segmentation. Code and data: project website.

URL PDF HTML ☆

赞 0 踩 0

2606.08458 2026-06-09 cs.RO 新提交

Personalized and Robust Proactive Robot Assistance with Uncertainty-Guided LLM Reasoning

个性化且鲁棒的主动机器人辅助：基于不确定性引导的大语言模型推理

Alvaro Gonzalez, M. H. Hasan Shovo, Ali Ayub

发表机构 * Concordia University（康考迪亚大学）

AI总结提出GLOBE框架，结合n-gram马尔可夫模型与不确定性引导的大语言模型推理，在家庭环境中实现高效鲁棒的主动机器人辅助，并在HOMER-Noise数据集上验证了其性能与效率。

Comments Accepted to the 2026 IEEE 35th International Conference on Robot and Human Interactive Communication (RO-MAN)

详情

AI中文摘要

在家庭环境中，主动机器人辅助需要在动态和嘈杂条件下准确预测人类活动和物体使用。现有方法通常依赖复杂的时空模型，这些模型计算成本高且对环境变化敏感。本文提出GLOBE，一个轻量级框架，结合n-gram马尔可夫模型捕捉时间行为模式与不确定性引导的大语言模型推理。该框架高效执行序列预测，仅在模型置信度低时选择性调用大语言模型推理。为评估现实条件下的性能，我们引入HOMER-Noise，即HOMER+数据集的噪声扩展，模拟由人类、宠物和幼儿引起的物体移动等结构化干扰。实验结果表明，GLOBE在干净和嘈杂环境下均达到与最先进方法竞争的性能，同时提高了鲁棒性和计算效率。该框架进一步通过与Stretch 3移动操作器的概念验证集成得到验证，展示了其在真实人机交互场景中的潜在应用。

英文摘要

Proactive robot assistance in household environments requires accurate prediction of human activities and object usage under dynamic and noisy conditions. Existing approaches often rely on complex spatio-temporal models, which can be computationally expensive and sensitive to environmental variability. In this paper, we propose GLOBE, a lightweight framework that combines n-gram Markov models for capturing temporal behavioral patterns with uncertainty-guided large language model (LLM) reasoning. The framework performs sequential prediction efficiently while selectively invoking LLM reasoning only when the model confidence is low. To evaluate performance under realistic conditions, we introduce HOMER-Noise, a noisy extension of the HOMER+ dataset that simulates structured disturbances such as object movements caused by humans, pets, and toddlers. Experimental results show that GLOBE achieves competitive performance with state-of-the-art methods while improving robustness and computational efficiency across both clean and noisy settings. The framework is further validated through a proof-of-concept integration with a Stretch 3 mobile manipulator, demonstrating its potential application in real-world human-robot interaction scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.08741 2026-06-09 cs.RO 新提交

Safe, Fluent and Acceptable Motion Generation and Execution for Human--Robot Interaction in Manufacturing Environments

制造环境中人机交互的安全、流畅与可接受运动生成与执行

Thibaut Lopez, Olivier Aycard, Pierre-Brice Wieber, Mohamed Boua, Christine Jeoffrion

发表机构 * GIPSA Lab（GIPSA实验室）； Grenoble Institute of Technology（格勒诺布尔理工学院）； Inria（法国国家信息与自动化研究所）； LIP/PC2S（LIP/PC2S实验室）； Univ. Grenoble Alpes（格勒诺布尔阿尔卑斯大学）； Univ. Savoie Mont Blanc（萨瓦大学）

AI总结针对人机共享环境，提出结合安全与社交感知的运动生成策略，通过MPC框架生成四种社交行为，用户研究表明机器人行为显著影响社会可接受性。

详情

AI中文摘要

在人类环境中运行的机器人不仅要确保物理安全，还要表现出人类伙伴可理解、流畅和可接受的行为。本文研究了结合安全保障与交互质量考虑（如运动平滑性和人类舒适度）的运动生成策略。虽然能够确保共享人机环境中安全的机器人设计已经实现了更紧密、更高级的交互形式，但这些新的基于近距离的任务需要超越纯技术考虑。特别是，机器人行为还必须从心理认知和社会角度加以解决。在此背景下，我们论证了将社交感知运动控制集成到机器人系统中的相关性。首先，我们识别了影响人类感知和操作员体验的运动参数。然后，我们实现了一个模型预测控制（MPC）框架，该框架生成四种不同的社交知情机器人行为。最后，我们进行了一项用户研究，以评估和验证这些行为，并评估它们对非专家参与者的社会影响。结果表明，机器人行为的变化显著影响系统的感知社会可接受性。这些发现强调了将以人为本的考虑纳入共享环境中机器人运动生成策略的重要性。

英文摘要

Robots operating in human environments must not only ensure physical safety but also exhibit behaviors that are understandable, fluent, and acceptable to human partners. This paper investigates motion generation strategies that combine safety guarantees with interaction quality considerations, such as motion smoothness and human comfort. While the design of robots capable of ensuring safety in shared human-robot environments has enabled closer and more advanced forms of interaction, these new proximity-based tasks require moving beyond purely technical considerations. In particular, robot behavior must also be addressed from psycho-cognitive and social perspectives. In this context, we argue for the relevance of integrating social-aware motion control into robotic systems. First, we identify the motion parameters that influence human perception and operator experience. Then, we implement a Model Predictive Control (MPC) framework that generates four distinct socially-informed robot behaviors. Finally, we conduct a user study to evaluate and validate these behaviors and assess their social impact on non-expert participants. The results demonstrate that variations in robot behavior significantly affect the perceived social acceptability of the system. These findings highlight the importance of incorporating human-centered considerations into motion generation strategies for robots operating in shared environments.

URL PDF HTML ☆

赞 0 踩 0

2606.09255 2026-06-09 cs.RO 新提交

RPO-PDT: Demonstrating Role-Play-Based Knowledge Adaptation for Student Support Dialogue (Demonstration System)

RPO-PDT：展示基于角色扮演的知识适应用于学生支持对话（演示系统）

Filip Janik, Ewa Olton, Robert Smales, Harris Spratt, Shea Tait, Md Zia Ullah, Yanchao Yu

发表机构 * Edinburgh Napier University（爱丁堡龙比亚大学）

AI总结提出RPO-PDT系统，通过检索增强和角色扮演循环，实现高等教育中基于结构化知识源的个性化学生支持对话，并确保安全与适应性。

Comments 5 pages, 2 figures

2606.07551 2026-06-09 cs.CY cs.HC cs.RO 交叉投稿

Astro, I'm Home! Investigating Factors that Influence the Acceptance of Home Robots Using Supervised Machine Learning

Astro，我回家了！利用监督机器学习研究影响家庭机器人接受度的因素

Katrin Fischer, Essence Wilson, Steffie Kim, Dmitri Williams

发表机构 * University of Southern California（南加州大学）

AI总结本研究运用正则化技术（如Lasso和Ridge回归）分析影响社交机器人接受度的因素，发现绩效期望、社会影响和享乐动机是使用意图的最强预测因子，并识别出可用性、信任和能力等新变量。

Comments Preprint submitted to the 18th International Conference on Social Robotics (ICSR 2026)

2606.09390 2026-06-09 cs.CV cs.AI cs.RO 交叉投稿

Real-time body pose non-verbal communication with a consistency-based reliability measure

基于一致性可靠性度量的实时身体姿态非语言通信

Alina Marcu, Dragos Costea, Cristina Lazar, Marius Leordeanu

发表机构 * National University of Science and Technology "Politehnica" Bucharest（布加勒斯特理工大学）； Simion Stoilow Institute of Mathematics of the Romanian Academy（罗马尼亚科学院西蒙·斯托伊洛数学研究所）； NORCE Norwegian Research Centre AS（挪威研究中心）

AI总结研究仅从2D身体姿态识别通信意图，提出自回归自一致性作为无监督可靠性信号，并在嵌入式GPU上实现实时性能。

详情

AI中文摘要

身体运动在远距离或无法捕捉面部及语音的条件下传达意图。我们研究仅从2D身体姿态识别通信意图。我们认为身体运动是可靠的信号，特别是在需要实时低成本设备上的人-机器人通信场景中，如救援任务。然而，现有资源并未孤立这一信号。情感语料库结合了身体、面部、语音和文本，而骨架动作识别基准标记的是执行的动作而非传达的信息。我们发布了一个包含十种通信意图的全身体姿态真实帧数据集，并将其与其他真实（IPC）和合成（MotionLCM, VEO3.1, Kimodo）数据集进行比较，这些数据集覆盖了不同难度。我们针对能在机器人有限板载硬件上运行的系统。我们基准测试了多种模型，从骨架图分类器到联合运动预测网络，并在嵌入式GPU（NVIDIA Orin Nano）上报告了性能指标和帧率，因为在我们的场景中速度和准确性同样重要。最后，我们展示了模型自身的自回归自一致性可作为无监督可靠性信号。我们给出了一个简短证明，界定了自一致性预测正确的概率，表明该概率随一致步数增加而增长，并识别了自信预测仍可能错误的条件，与行业标准指标进行了基准测试。

英文摘要

Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recognition of communicative intent from 2D body pose alone. We argue that body motion is a reliable signal especially in scenarios that require real time low-cost on-device person-to-robot communication in long distance environments, such as rescue missions. However, existing resources do not isolate this signal. Affective corpora combine body, face, voice and text, while skeleton action-recognition benchmarks label the action performed rather than the message conveyed. We release a dataset of real frames of full-body pose covering ten communicative intents and we compare it against other real (IPC) and synthetic (MotionLCM, VEO3.1, Kimodo) ones that span a range of difficulty. We target systems that can run on a robot's limited onboard hardware. We benchmark multiple models, from skeleton graph classifiers to joint motion-forecasting networks, and report performance metrics together with frame rate on an embedded GPU (NVIDIA Orin~Nano), since speed matters as much as accuracy in our scenario. Finally, we show that a model's own autoregressive self-consistency works as an unsupervised reliability signal. We give a short proof that bounds the probability that a self-consistent prediction is correct, show that this probability grows with the number of consistent steps, and identify the conditions under which a confident prediction can still be false, benchmarked against industry-standard metrics.

URL PDF HTML ☆

赞 0 踩 0

2511.17855 2026-06-09 cs.AI cs.RO 版本更新

QuickLAP: Quick Language-Action Preference Learning for Semi-Autonomous Agents

QuickLAP: 为半自主代理快速语言-动作偏好学习

Jordan Abi Nader, David Lee, Nathaniel Dennler, Andreea Bobu

AI总结本研究提出QuickLAP，一种融合物理和语言反馈的贝叶斯框架，用于实时推断奖励函数，通过大规模语言模型提取奖励特征注意力掩码和偏好偏移，从而在半自主驾驶模拟器中将奖励学习误差降低70%，并通过用户研究验证其可理解性和协作性。

详情

AI中文摘要

机器人必须从人们的行为和语言中学习，但单一模态往往不完整：物理修正具有语境但意图模糊，而语言表达高层目标但缺乏物理基础。我们引入QuickLAP：快速语言-动作偏好学习，一种贝叶斯框架，融合物理和语言反馈以实时推断奖励函数。我们的关键见解是将语言视为用户潜在偏好的概率观测，明确哪些奖励特征重要以及如何解释物理修正。QuickLAP利用大规模语言模型（LLMs）从自由形式陈述中提取奖励特征注意力掩码和偏好偏移，并与物理反馈结合在一个闭式更新规则中。这使得能够快速、实时且鲁棒地学习奖励，处理模糊反馈。在半自主驾驶模拟器中，QuickLAP相比仅物理和启发式多模态基线将奖励学习误差降低超过70%。15名参与者的用户研究进一步验证了我们的方法：参与者发现QuickLAP更易懂和协作，并且更喜欢其学习行为。代码可在https://github.com/MIT-CLEAR-Lab/QuickLAP获取。

英文摘要

Robots must learn from both what people do and what they say, but either modality alone is often incomplete: physical corrections are grounded but ambiguous in intent, while language expresses high-level goals but lacks physical grounding. We introduce QuickLAP: Quick Language-Action Preference learning, a Bayesian framework that fuses physical and language feedback to infer reward functions in real time. Our key insight is to treat language as a probabilistic observation over the user's latent preferences, clarifying which reward features matter and how physical corrections should be interpreted. QuickLAP uses Large Language Models (LLMs) to extract reward feature attention masks and preference shifts from free-form utterances, which it integrates with physical feedback in a closed-form update rule. This enables fast, real-time, and robust reward learning that handles ambiguous feedback. In a semi-autonomous driving simulator, QuickLAP reduces reward learning error by over 70% compared to physical-only and heuristic multimodal baselines. A 15-participant user study further validates our approach: participants found QuickLAP significantly more understandable and collaborative, and preferred its learned behavior over baselines. Code is available at https://github.com/MIT-CLEAR-Lab/QuickLAP.

URL PDF HTML ☆

赞 0 踩 0

2606.08107 2026-06-09 cs.RO cs.AI 新提交

Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data

Ego-Pi: 面向自我中心人类与机器人数据的VLA微调

Ji Woong Kim, Ke Wang, Zipeng Fu, Sirui Chen, Cong Zhao, Jeff Lai, Chelsea Finn

发表机构 * Stanford University（斯坦福大学）； Meta

AI总结为解决机器人数据稀缺问题，利用自我中心人类数据，基于π₀.₅模型微调，使机器人学习新任务语义并组合现有技能，无需对应机器人数据。

2606.08288 2026-06-09 cs.RO 新提交

MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model

MotionVLA: 将几何运动注入视觉-语言-动作模型

Shanglin Yuan, Weiheng Zhao, Xianda Guo, Wei Sui, Li Yu, Wenyu Liu, Xinggang Wang

发表机构 * Huazhong University of Science and Technology（华中科技大学）； D-Robotics（大疆机器人）； Wuhan University（武汉大学）

AI总结提出MotionVLA，通过运动历史接口将过去视频窗口转换为紧凑的连续轨迹场令牌，解决长程操作中的几何漂移和时间线索碎片化问题，提升动作平滑性和执行效率。

Comments 17 pages, 8 figures

详情

AI中文摘要

视觉-语言-动作（VLA）模型越来越多地基于历史、深度或4D特征来调节机器人策略，以解决长程操作中的歧义。然而，更多的时空证据并不一定更好：当注入的证据不是运动一致的时，它可能引入几何漂移、碎片化的时间线索和不稳定的动作生成。这提出了一个简单的问题：VLA应该记住过去的帧，还是记住连接它们的运动？我们引入了MotionVLA，一个运动历史接口，它将短时间仅包含过去的视频窗口转换为紧凑的、时间连续的轨迹场令牌。MotionVLA不是将历史视为一组稀疏的独立提升帧，而是将最近的观测表示为物理一致的运动证据。当前的视觉令牌查询这个历史以检索任务相关的运动信息，然后在轨迹基础的监督下重新耦合到VLA流中。在模拟基准和初步真实机器人部署上的实验表明，MotionVLA改善了长程操作，同时产生了更平滑、更直接的执行。这些结果表明，有效的VLA记忆不仅仅是提供更多的4D上下文，而是暴露可用于控制的运动一致证据。

英文摘要

Vision-language-action (VLA) models increasingly condition robot policies on history, depth, or 4D features to resolve ambiguity in long-horizon manipulation. However, more spatiotemporal evidence is not necessarily better: when the injected evidence is not motion-consistent, it can introduce geometric drift, fragmented temporal cues, and unstable action generation. This raises a simple question: should a VLA remember past frames, or remember the motion that connects them? We introduce MotionVLA, a motion-history interface that converts a short past-only video window into compact, time-continuous trajectory-field tokens. Instead of treating history as a sparse set of ndependently lifted frames, MotionVLA represents recent observations as physically coherent motion evidence. Current visual tokens query this history to retrieve task-relevant motion information, which is then recoupled into the VLA stream under trajectory-grounded supervision. Experiments across simulation benchmarks and preliminary real-robot rollouts show that MotionVLA improves long-horizon manipulation while producing smoother and more direct executions. These results suggest that effective VLA memory is not just about providing more 4D context, but about exposing motion-consistent evidence that is usable for control.

URL PDF HTML ☆

赞 0 踩 0

2606.08495 2026-06-09 cs.RO cs.CV 新提交

EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control

EgoPriMo：面向交互式人形控制的自我中心运动生成

Haoyang Ge, Peng Ren, Yukun Shi, Cong Huang, Kun Li, Kai Chen

发表机构 * Tianjin University（天津大学）； Zhongguancun Academy（中关村学院）； Beihang University（北京航空航天大学）； Zhongguancun Institute of Artificial Intelligence（中关村人工智能研究院）； DeepCybo

AI总结提出EgoPriMo框架，通过自我中心人类演示学习全身运动先验，利用三流DiT联合建模身体动态、视觉上下文和文本，支持重建、生成和预测，并在Unitree人形机器人上执行。

详情

AI中文摘要

人形机器人需要适应场景上下文、任务要求和用户意图的全身运动。运动跟踪可以再现指定的轨迹，人形机器人视觉-语言-动作系统提供了语义接口，但两者都不能为广泛的全身行为提供可扩展且交互式的先验。我们提出了EgoPriMo（人形机器人自我中心运动先验），一个统一的框架，从自我中心人类演示中学习此类先验。给定自我中心观察和文本提示，EgoPriMo重建、生成和预测基于SMPL的全身运动。语言被用作高级控制信号，而不是完整的运动规范。EgoPriMo的核心是一个三流DiT，它联合建模身体动态、自我中心视觉上下文和文本；任务条件掩码通过同一个检查点路由不同的任务和缺失模态数据。在Nymeria和EgoExo4D上的实验表明，一个检查点在支持重建和预测的同时，改进了自我中心运动生成，优于UniEgoMotion；生成的SMPL运动也可以由Unitree人形控制器执行。这些结果表明了一条从可扩展的自我中心观察到可泛化和交互式人形运动先验的实用路径。

英文摘要

Humanoid robots require whole-body motions that adapt to scene context, task requirements, and user intent. Motion tracking reproduces specified trajectories, and humanoid vision-language-action systems provide semantic interfaces, but neither offers a scalable and interactive prior for broad full-body behavior. We introduce EgoPriMo (Egocentric Motion Prior for Humanoid Robots), a unified framework that learns such priors from egocentric human demonstrations. Given egocentric observations and a text prompt, EgoPriMo reconstructs, generates, and forecasts SMPL-based full-body motion. Language is used as a high-level control signal rather than a complete motion specification. At the core of EgoPriMo is a Triple-stream DiT that jointly models body dynamics, egocentric visual context, and text; task-conditioning masks route different tasks and missing-modality data through the same checkpoint. Experiments on Nymeria and EgoExo4D show that one checkpoint improves egocentric motion generation over UniEgoMotion while supporting reconstruction and forecasting; the generated SMPL motions can also be executed by a Unitree humanoid controller. These results indicate a practical path from scalable egocentric observations to generalizable and interactive humanoid motion priors.

URL PDF HTML ☆

赞 0 踩 0

2606.08520 2026-06-09 cs.RO 新提交

Two Bridges, One Pathway: From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data

两座桥梁，一条路径：从VLM到具有具身轨迹耦合数据的可泛化VLA

Linqi Yin, Shiduo Zhang, Shenling Qiu, Chenxin Li, Zhaoyang Fu, Lei Xiao, Xiang Wang, Chenchen Yang, Zhe Xu, Pengfang Qian, Jingjing Gong, Xipeng Qiu, Xuanjing Huang, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出具身轨迹耦合（ETC）数据作为中间桥梁，通过三阶段训练策略（分布桥接、目标桥接、保留适应）将视觉语言模型（VLM）逐步转化为可泛化的视觉语言动作模型（VLA），解决从VLM到VLA的双重鸿沟。

详情

AI中文摘要

视觉语言模型（VLM）是强大的通用推理器，但将其转化为机器人控制策略（VLA）却异常困难。根本原因在于双重鸿沟：VLM在互联网规模的图像上训练，具有语言理解目标，而VLA必须感知机器人场景并预测电机动作。直接在机器人动作数据上微调VLM迫使模型同时跨越两个鸿沟——学习曲线陡峭，预训练期间获得的丰富泛化能力往往会退化而非迁移。我们认为，通过合适的中间数据可以逐步弥合这一鸿沟。我们引入了\emph{具身轨迹耦合（ETC）数据}——源自用于动作学习的相同机器人场景和轨迹的视觉语言监督。由于ETC数据共享机器人操作的视觉上下文，同时保留熟悉的语言理解目标，它提供了VLM预训练和VLA微调之间的自然垫脚石。基于此，我们设计了一个三阶段训练方案。分布桥接首先将VLM适应于具身视觉语言语义。目标桥接然后逐步将模型转向动作预测，同时保留已获得的表示。保留适应最后将策略专门化到目标部署领域。我们进一步证明，将任务相关的分布外ETC数据与少量动作数据混合，使模型能够泛化到新颖的视觉语言条件，而无需额外的机器人演示。仿真和真实机器人实验证实，这种逐步桥接策略是将VLM泛化能力迁移到鲁棒、可部署的机器人策略的关键。

英文摘要

Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control policies (VLAs) is surprisingly difficult. The root cause is a two-fold gap: VLMs are trained on internet-scale images with language-understanding objectives, while VLAs must perceive robot scenes and predict motor actions. Fine-tuning a VLM directly on robot action data forces the model to cross both gaps at once -- the learning curve is steep and the rich generalizations learned during pretraining tend to degrade rather than transfer. We argue that this gap can be bridged gradually with the right intermediate data. We introduce \emph{embodied trajectory-coupled (ETC) data} -- vision-language supervision derived from the same robot scenes and trajectories used for action learning. Because ETC data shares the visual context of robot operation while retaining familiar language-understanding objectives, it provides a natural stepping stone between VLM pretraining and VLA fine-tuning. Building on this, we design a three-stage training recipe. Distribution Bridging first adapts the VLM to embodied visual-language semantics. Objective Bridging then gradually shifts the model toward action prediction while preserving the acquired representations. Retentive Adaptation finally specializes the policy to the target deployment domain. We further show that mixing task-relevant out-of-distribution ETC data with a small amount of action data enables the model to generalize to novel visual-language conditions without requiring additional robot demonstrations. Simulation and real-robot experiments confirm that this gradual bridging strategy is the key to transferring VLM generalization into robust, deployable robot policies.

URL PDF HTML ☆

赞 0 踩 0

2606.09215 2026-06-09 cs.RO 新提交

MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation

MotionWAM：迈向实时人形机器人全身操作的基础世界动作模型

Jia Zheng, Teli Ma, Yudong Fan, Zifan Wang, Shuo Yang, Junwei Liang

发表机构 * Mondo Robotics ； HKUST (GZ)（香港科技大学（广州））； HKUST（香港科技大学）

AI总结提出MotionWAM，一种实时世界动作模型，通过统一运动潜变量和全身动作令牌，实现单目相机驱动的自主人形机器人全身操作，在真实任务上成功率比VLA基线高30%以上。

详情

AI中文摘要

世界动作模型（WAM）将视频动态先验与策略耦合，在桌面操作中表现出令人鼓舞的结果，但高维视频-动作潜变量的迭代去噪使其对于实时人形机器人全身操作来说过于缓慢。主导的分层范式加剧了这一问题，其中高层操作策略仅控制上半身，而低层控制器跟踪粗略的基础命令——将上半身和下半身置于不一致的动作空间中，并将腿部降级为保持平衡的 locomotion。我们提出MotionWAM，一种实时WAM，通过将策略条件设置为视频世界模型的中间去噪特征，从单个自我中心摄像头驱动自主人形机器人全身操作。MotionWAM用统一的运动潜变量取代了上下半身的分割，并预测全身动作令牌，在单个动作空间中联合覆盖 locomotion、躯干运动、高度调节、足部交互和手部操作。一个三阶段学习框架逐步将视频世界模型适应于自我中心视觉动态和目标人形机器人具身。在九个真实世界的Unitree G1任务上，MotionWAM实时运行，在总体成功率上比在同一演示上微调的视觉-语言-动作（VLA）基线高出30%以上，并执行解耦的上下半身策略无法达到的任务驱动足部交互。我们的结果表明，视频预训练的WAM可以从桌面操作提升到协调的、类人的人形机器人全身控制。

英文摘要

World Action Models (WAMs) couple a video dynamics prior to the policy and have shown encouraging results on tabletop manipulation, but iterative denoising over high-dimensional video-action latents leaves them too slow for real-time humanoid loco-manipulation. The problem is compounded by the dominant hierarchical paradigm, in which a high-level manipulation policy controls only the upper body while a low-level controller tracks coarse base commands -- placing upper and lower body in inconsistent action spaces and reducing the legs to balance-preserving locomotion. We present MotionWAM, a real-time WAM that drives autonomous humanoid loco-manipulation from a single egocentric camera by conditioning the policy on the intermediate denoising features of a video world model. MotionWAM replaces the upper-lower split with a unified motion latent and predicts whole-body motion tokens that jointly cover locomotion, torso motion, height regulation, foot interaction, and hand manipulation in a single action space. A three-stage learning framework progressively adapts the video world model to egocentric visual dynamics and to the target humanoid embodiment. On nine real-world Unitree G1 tasks, MotionWAM runs in real time, substantially outperforms Vision-Language-Action (VLA) baselines fine-tuned on the same demonstrations by over 30% in overall success rate, and executes task-driven foot interaction that decoupled upper-lower policies cannot reach. Our results suggest that video-pretrained WAMs can be lifted from tabletop manipulation to coordinated, human-like whole-body humanoid control.

URL PDF HTML ☆

赞 0 踩 0

2606.09258 2026-06-09 cs.RO 新提交

Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection

回到熟悉的未来：通过预想里程碑选择实现VLA策略的故障恢复

Suyeon Shin, Juwon Kim, Hyeonbin Park, Hyunseo Kim, Hyundo Lee, Hyung-Sin Kim, Byoung-Tak Zhang

发表机构 * Seoul National University（首尔大学）； Yonsei University（延世大学）； Soongsil University（崇实大学）

AI总结提出B2FF框架，通过预生成熟悉未来状态里程碑并选择恢复目标，使VLA策略在偏离轨迹时无需微调即可稳健恢复，成功率从56.3%提升至74.0%。

详情

AI中文摘要

视觉-语言-动作（VLA）策略在操作过程中可能偏离标称轨迹，即使任务在物理上仍然可行。从这些偏离中恢复具有挑战性，因为它们将策略推入陌生的状态空间，直接重新规划常常会破坏动作序列的稳定性。我们提出“回到熟悉的未来”（B2FF），一种面向预见性VLA的恢复框架，利用未来视觉条件作为恢复接口。在执行前，VLA基于干净的初始观察生成一个由熟悉未来状态组成的里程碑库。在恢复时，一个可恢复性感知的选择器从该库中选择一个恢复里程碑，并将其强制作为固定的视觉目标。这使得VLA能够将偏离轨迹的观察稳健地映射回熟悉的未来。在注入故障的LIBERO数据集上，在受控的恢复时间与注入故障对齐的情况下，B2FF将基线VLA的平均成功率从56.3%提升至74.0%，证明预想里程碑可以在不微调底层动作生成器的情况下指导恢复。

英文摘要

Vision-language-action (VLA) policies can deviate from nominal trajectories during manipulation, even when tasks remain physically feasible. Recovering from these deviations is challenging, as they push the policy into unfamiliar state spaces where direct re-planning frequently destabilizes action sequences. We propose Back to the Familiar Future (B2FF), a recovery framework for foresight-driven VLAs that leverages future visual conditioning as a recovery interface. Before execution, the VLA generates a milestone bank of familiar future states conditioned on the clean initial observation. At recovery time, a recoverability-aware selector selects a recovery milestone from this bank and enforces it as a fixed visual goal. This enables the VLA to robustly map off-trajectory observations back to a familiar future. On failure-injected LIBERO, under controlled recovery timing aligned with the injected failure, B2FF increases the average success rate of a baseline VLA from 56.3% to 74.0%, demonstrating that pre-imagined milestones can guide recovery without fine-tuning the low-level action generator.

URL PDF HTML ☆

赞 0 踩 0

2606.09286 2026-06-09 cs.RO 新提交

VAIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands

VAIC: 基于解耦命令的视觉引导人形机器人敏捷物体交互控制

Dongting Li, Qianyang Wu, Xingyu Chen, Liang Li, Yuhang Lin, Sikai Wu, Guoyao Zhang, Mingliang Zhou, Diyun Xiang, Qiang Zhang, Renjing Xu, Jianzhu Ma

发表机构 * Tsinghua University（清华大学）； HKUST(Guangzhou)（香港科技大学（广州））； Xiaomi Robotics Lab（小米机器人实验室）

AI总结提出VAIC框架，通过解耦命令和两阶段蒸馏范式，仅依靠机载深度、历史本体感知实现人形机器人的敏捷物体交互，在箱体搬运、推车、滑板等动态任务中超越基线。

Comments Webpage: https://vaic-humanoid.github.io/

详情

AI中文摘要

人形机器人在现实辅助中具有巨大潜力，但在非结构化环境中与物体的敏捷交互需要紧密耦合的全身协调。尽管近期取得了进展，当前控制器仍面临关键的部署差距：它们严重依赖密集的参考轨迹和完美的状态可观测性，这本质上限制了物理泛化。我们提出了视觉引导的敏捷交互控制（VAIC），这是一个统一框架，通过仅依靠机载深度、历史本体感知和解耦的用户命令接口来弥合这一差距。VAIC采用两阶段蒸馏范式。首先，一个特权教师策略利用精确的物体运动学和精确的环境状态掌握多样的交互技能。其次，一个可部署的学生策略通过将全身跟踪替换为多轴速度目标和每帧交互指示器来蒸馏这些能力。学生利用一个循环物体适应模块，从原始深度流和本体感知中隐式推断不可观测的物体动力学。在人形机器人上的评估和实际部署表明，单个VAIC策略能够成功执行高度多样的动态任务，包括箱体搬运、推车交互和滑板，持续优于基线，推动了自主人形机器人的部署。

英文摘要

Humanoid robots hold immense potential for real-world assistance, yet agile interaction with objects in unstructured environments demands tightly coupled whole-body coordination. Despite recent advancements, current controllers face a critical deployment gap. They rely heavily on dense reference trajectories and perfect state observability, which inherently limits physical generalization. We present Vision Guided Agile Interaction Control (VAIC), a unified framework that bridges this gap by operating exclusively on onboard depth, historical proprioception, and a decoupled user command interface. VAIC employs a two-stage distillation paradigm. First, a privileged teacher policy masters diverse interaction skills using precise object kinematics and exact environmental states. Second, a deployable student policy distills these capabilities by replacing full body tracking with velocity targets across multiple axes and an interaction indicator for each frame. The student utilizes a recurrent object adaptation module to implicitly infer unobservable object dynamics from raw depth streams and proprioception. Evaluations and real-world deployments on the humanoid robot demonstrate that a single VAIC policy successfully executes highly diverse dynamic tasks. These tasks include box carrying, cart interaction, and skateboarding, consistently outperforming baselines and advancing autonomous humanoid deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.09572 2026-06-09 cs.RO cs.AI 新提交

TBD-VLA: 时序块扩散视觉语言动作模型

Sung-Wook Lee, Xuhui Kang, Yen-Ling Kuo

发表机构 * University of Virginia（弗吉尼亚大学）

AI总结提出TBD-VLA框架，通过时序块扩散机制实现离散令牌VLA模型的并行动作生成，兼顾时序连贯性与推理速度，在仿真和真实任务中优于先前方法。

详情

AI中文摘要

离散视觉-语言-动作（VLA）模型通常将动作生成建模为离散动作空间上的下一个令牌预测，每个令牌自回归地依赖于先前的上下文。虽然有效，但这种范式会导致高推理延迟，并且很大程度上忽略了动作轨迹中固有的时间结构。最近的工作引入并行解码以提高效率，实现更快的推理，但缺乏建模令牌依赖关系的显式机制。我们提出TBD-VLA，一种基于离散令牌的VLA框架，它结合了块扩散以实现时序动作生成。我们将动作序列划分为时间块，并在每个块内执行掩码离散扩散，同时保持跨块的自回归生成。这种设计统一了时序自回归和并行动作解码，实现了强时序连贯性和改进的推理速度。此外，显式的时序建模通过时序修补实现了动作块（例如实时分块）的异步执行。TBD-VLA在仿真和真实世界的操作任务中显著优于先前的VLA方法，为走向快速、时序感知的离散VLA模型提供了一条可扩展的路径。项目网页：https://tbd-vla.github.io/

英文摘要

Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.08653 2026-06-09 cs.CV cs.AI cs.LG cs.RO 交叉投稿

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

FiberTune: 在视觉-语言-动作微调中保留动作纤维视觉残差

Haihao Lin, Xiangsheng Huang, Xiao Yang, Weibang Zhou, Yiqi Zhang, Bo Yang, Simin Zeng, Jiawei Yang, Zhengyang Wang, Jiahui Du

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Hebei Key Laboratory of Cognitive Intelligence, Xiong’an Institute of Innovation（河北省认知智能重点实验室，雄安创新研究院）； Hebei University of Technology（河北工业大学）； Beijing Information Science and Technology University（北京信息科技大学）

AI总结提出FiberTune，通过在线动作探针过滤动作预测特征方向，对齐教师视觉残差并正则化有效秩，在六个仿真和实物任务中提升VLA策略性能。

Comments Project page: https://fibertune.github.io/

详情

AI中文摘要

动作监督的视觉-语言-动作（VLA）策略微调能有效拟合演示，但仅约束改变预测动作的方向，导致动作等价状态下视觉结构自由坍缩。我们将此形式化为沿局部动作纤维的残差视觉坍缩，并提出FiberTune，一种训练时目标，在不增加推理开销的情况下保留教师结构的视觉残差。FiberTune使用在线动作探针估计动作预测特征方向，从中滤除中间视觉标记表示，并将探针过滤后的残差与冻结的视觉教师对齐，同时正则化其有效秩。在相同训练条件下，FiberTune在跨越两个基准和两种架构（pi_0.5和OpenVLA-OFT）的六个受控仿真设置以及物理SO-101拾取放置任务中，均优于仅任务损失的微调；代表性提升包括长时域CALVIN ABC-to-D上SR(5)提高10.7个百分点，物理SO-101任务成功率从72.7%提升至78.1%。残差诊断显示，这些增益与探针过滤后的残差教师对齐度和有效秩增加一致，符合动作纤维动机。

英文摘要

Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.

URL PDF HTML ☆

赞 0 踩 0

2606.08962 2026-06-09 cs.LG cs.CV cs.RO 交叉投稿

C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache

C$^3$ache: 利用跨推理块缓存加速世界动作模型

Weisen Zhao, Lam Nguyen, Zhicong Lu, Yuzhang Shang

发表机构 * George Mason University（乔治梅森大学）； University of Central Florida（中佛罗里达大学）

AI总结提出C$^3$ache方法，通过跨推理块缓存和重用去噪残差，加速世界动作模型推理，实现高达2.5倍加速且任务成功率几乎无损。

详情

AI中文摘要

世界动作模型（WAM）比标准的视觉-语言-动作（VLA）策略在新型运动和环境中具有更好的泛化能力，因为视频建模目标使其能够从大量未标记视频中学习，而不是依赖稀缺的标记机器人演示。这种泛化能力计算成本高昂。为了完成一个任务，WAM需要运行多个推理块，每个块都需要一个昂贵的去噪过程。现有的加速方法通过在一个块的去噪轨迹内缓存和重用计算来降低这一成本。我们的实证分析揭示了它们忽略的一个重要的冗余来源：块间的冗余。当机器人执行平滑行为时，在给定去噪步骤计算的残差从一个块到下一个块高度相关。我们引入了C$^3$ache，一种无需训练的方法，它在相同去噪步骤的推理块之间缓存和重用这些残差。在基于Fast-WAM骨干的基准测试上的实验表明，C$^3$ache在总墙钟推理时间上实现了高达2.5倍的加速，而任务成功率几乎没有下降。

英文摘要

World Action Models (WAMs) generalize better than standard Vision-Language-Action (VLA) policies to novel motions and environments, because a video-modeling objective lets them learn from abundant unlabeled video rather than scarce labeled robot demonstrations. This generalization is computationally expensive. To complete a task, a WAM runs over multiple inference chunks, and each chunk requires a costly denoising process. Existing acceleration methods reduce this cost by caching and reusing computation within a single chunk's denoising trajectory. Our empirical analysis reveals a substantial source of redundancy they overlook: redundancy across chunks. When a robot executes a smooth behavior, the residuals computed at a given denoising step are strongly correlated from one chunk to the next. We introduce C$^3$ache, a training-free method that caches and reuses these residuals across inference chunks at the same denoising step. Experiments on benchmarks with a Fast-WAM backbone show that C$^3$ache achieves up to a $2.5\times$ speedup in total wall-clock inference time, with negligible degradation in task success rate.

URL PDF HTML ☆

赞 0 踩 0

2603.19183 2026-06-09 cs.RO 版本更新

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

稀疏自编码器揭示VLA模型中可解释且可操控的特征

Aiden Swann, Lachlain McGranahan, Hugo Buurmeijer, Monroe Kennedy, Mac Schwager

发表机构 * Department of Mechanical Engineering（机械工程系）； Department of Computer Science（计算机科学系）； Department of Aeronautics & Astronautics（航空与航天系）

AI总结本文通过训练稀疏自编码器揭示VLA模型中可解释且可操控的特征，验证了其在不同任务和场景中的可迁移性。

Comments 24 pages, 11 figures

详情

AI中文摘要

视觉-语言-动作（VLA）模型已成为通用机器人操作的有希望方法。然而，很少有研究系统地探讨了它们在物体、场景和指令之间泛化的原因和时机。为此，我们训练了稀疏自编码器（SAEs）来探索VLA隐藏层激活的内部表示。SAEs学习稀疏字典，通常揭示与模型表示空间中可解释方向对应的特征。我们识别出与运动原语和语义概念相关的SAE特征，包括在多个回合中普遍且因果可控的特征。我们提出了一种度量标准，将特征分类为通用可迁移原语或回合特定的记忆化，为VLA泛化提供了新的视角。我们通过在LIBERO模拟基准和真实世界DROID硬件上的操控实验验证了这些发现。我们发现增强通用和语义特征会诱导出与其意义一致的行为，而消去它们会破坏模型性能。此外，我们展示了操控作为在无提示方向上控制行为的方式。这些结果提供了机制证据，表明VLA可以学习可重用的内部特征，将感知、语言和动作跨任务和场景连接起来。我们的项目页面位于https://drvla.github.io

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, little research has mechanistically explored when and why they generalize across objects, scenes, and instructions. To probe internal representations, we train Sparse Autoencoders (SAEs) on the VLA's hidden-layer activations. SAEs learn sparse dictionaries over model activations, often revealing features that correspond to interpretable directions in the model's representation space. We identify SAE features corresponding to motion primitives and semantic concepts, including features that are general across episodes and causally steerable. We propose a metric to categorize features as general transferable primitives or episode-specific memorizations, offering a promising glimpse towards VLA generalization. We validate these findings through steering experiments on both the LIBERO simulation benchmark and on real-world DROID hardware. We find that amplifying general and semantic features induces behaviors consistent with their meanings, whereas ablating them destroys model performance. Furthermore, we demonstrate steering as a way to control behavior in unpromptable directions. Together, these results provide mechanistic evidence that VLAs can learn reusable internal features linking perception, language, and action across tasks and scenes. Our project page is located at https://drvla.github.io

URL PDF HTML ☆

赞 0 踩 0

2604.22238 2026-06-09 cs.RO 版本更新

UAOR: 面向视觉-语言-动作模型的不确定性感知观测重注入

Jiabing Yang, Yixiang Chen, Yuan Xu, Peiyan Li, Zichen Wen, Bowen Fang, Tao Yu, Xiangnan Wu, Qisen Ma, Kai Wang, Ziheng He, Yingda Li, Zhengbo Zhang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所模式识别新技术实验室）； Shanghai Jiao Tong University（上海交通大学）； FiveAges（五代）

AI总结提出UAOR模块，通过动作熵检测不确定性，在语言模型高不确定层重注入观测信息，无需额外训练或数据，提升VLA模型在仿真和真实任务中的性能。

详情

AI中文摘要

视觉-语言-动作（VLA）模型利用预训练的视觉-语言模型（VLM）作为骨干，将图像和指令映射到动作，展现出在可泛化机器人操作中的显著潜力。为了提升性能，现有方法通常引入额外的观测线索（如深度图、点云）或辅助模块（如目标检测器、编码器），以实现更精确和可靠的任务执行，但这些方法通常需要昂贵的数据收集和额外训练。受语言模型中的前馈网络（FFN）可作为“键值记忆”的发现启发，我们提出不确定性感知观测重注入（UAOR），一种有效、无需训练且即插即用的VLA模型模块。具体地，当当前语言模型层表现出由动作熵衡量的高不确定性时，它通过注意力检索将关键观测信息重注入下一层的前馈网络（FFN）。该机制直接在高不确定性层用观测证据增强隐藏状态，从而实现更准确和可靠的动作生成。综合实验表明，我们的方法以最小开销一致地提升了多种VLA模型在仿真和真实任务中的性能。值得注意的是，UAOR消除了对额外观测线索或模块的需求，使其成为现有VLA流程中通用且实用的即插即用组件。项目页面见此URL。

英文摘要

Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed-Forward Network (FFN) through attention retrieval. This mechanism directly augments the hidden states with observation evidence at high-uncertainty layers, enabling more accurate and reliable action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.

URL PDF HTML ☆

赞 0 踩 0

2606.08064 2026-06-09 cs.RO 新提交

Cooperative Long Rope Skipping via Multi-Agent Reinforcement Learning

基于多智能体强化学习的协作长绳跳绳

Zihao Wang, Shijie Peng, Kerui Wu, Yu Huang, Ruiqi Xue, Dong Liu, Tian Xu, Lei Yuan, Yang Yu

发表机构 * National Key Laboratory of Novel Software Technology, Nanjing University（南京大学计算机软件新技术国家重点实验室）； School of Artificial Intelligence, Nanjing University（南京大学人工智能学院）； Beijing Academy of Artificial Intelligence, BAAI（北京智源人工智能研究院）

AI总结提出Marope框架，采用分层强化学习实现多个人形机器人的协作长绳跳绳，通过多智能体强化学习训练分散的摇绳策略，上层调度策略协调执行，并融入多样跳跃策略提升泛化能力，在仿真和真实实验中优于基线方法。

详情

AI中文摘要

人类展现出卓越的运动敏捷性，能够完成跑步、跳跃等多种动态技能，这凸显了人形机器人在运动方面的巨大潜力。在竞技体育中，长绳跳绳需要两名摇绳者协同摇绳，同时适应不同跳跃节奏的玩家，这对人形机器人来说是一项有意义但具有挑战性的任务。尽管现有的人形机器人运动方法在单智能体和无交互场景（如跑步、舞蹈和跑酷）中取得了成功，但需要多参与者精确协调的任务场景仍鲜有探索。为此，我们提出Marope，一个用于多个人形机器人协作长绳跳绳的多智能体强化学习框架。具体而言，Marope采用分层强化学习框架进行策略训练。在底层，通过多智能体强化学习学习分散的摇绳操作策略；在顶层，训练集中调度策略以协调底层策略的执行。为了提高对不同玩家行为风格的泛化能力，Marope进一步将多样化的跳跃策略融入协作博弈训练中。我们在仿真和真实环境中对宇树G1人形机器人进行了评估。实验结果表明，Marope优于多种基线方法，实现了更高效稳定的摇绳操作以及与不同玩家更鲁棒和自适应的协作。

英文摘要

Humans exhibit remarkable motor agility, enabling a wide range of dynamic skills such as running and jumping, which highlights the great potential of humanoid robots for athletic locomotion. Among athletic sports, long rope skipping requires two rope turners to cooperatively swing the rope while adapting to a player under different jumping rhythms, making it a meaningful yet challenging task for humanoid robots. Although existing methods for humanoid sports have achieved success in single-agent and interaction-free settings, such as running, dancing, and parkour, task scenarios that require precise coordination among multiple participants remain largely unexplored. To this end, we propose Marope, a multi-agent reinforcement learning (MARL) framework for cooperative long rope skipping with multiple humanoid robots. Specifically, Marope adopts a hierarchical reinforcement learning framework for policy training. At the lower level, it learns decentralized rope manipulation policies through MARL, while at the upper level, a centralized scheduling policy is trained to coordinate the execution of the lower-level policies. To improve generalization across different player behavioral styles, Marope further incorporates diverse jumping policies into cooperative game training. We evaluate our approach on Unitree G1 humanoid robots in both simulation and real-world settings. Experimental results demonstrate that Marope outperforms various baselines, achieving more efficient and stable rope manipulation as well as more robust and adaptable cooperation with varied players.

URL PDF HTML ☆

赞 0 踩 0

2606.09099 2026-06-09 cs.RO 新提交

LAEI: Layered Autonomous Edge Intelligence Framework for Robust UAV Swarm Operations

LAEI: 面向鲁棒无人机蜂群操作的分层自主边缘智能框架

Changmin Park, Wooyong Jung, Hwangnam Kim

发表机构 * Korea University（高丽大学）

AI总结提出分层自主边缘智能框架，通过机载学习策略与轻量级任务级监督结合，实现无人机蜂群在通信受限、环境不确定和组件故障下的可扩展协调，显著降低任务完成时间并提高效率。

Comments Preprint. Submitted to arXiv

详情

AI中文摘要

自主无人机蜂群需要可扩展的协调机制，以在有限通信、环境不确定性和组件故障下保持任务性能。集中式方法提供全局协调，但存在通信瓶颈和单节点脆弱性，而完全分散的方法通常缺乏任务级一致性。本文提出了分层自主边缘智能（LAEI），一种无人机蜂群框架，它将机载学习策略与轻量级任务级监督相结合。每个无人机在机载执行局部感知、避障和动作选择，而监督层提供自适应目标重分配、故障感知恢复和上下文相关策略指导，而不直接控制低级动作。LAEI进一步整合了恢复策略，包括动态重新关联、备份监督支持和回退局部自主性，以在代表性故障场景下维持任务连续性。我们在模拟的无人机蜂群场景中评估了LAEI，使用任务完成时间、碰撞率和覆盖效率。结果表明，LAEI减少了任务完成时间并提高了操作效率，同时保持了碰撞感知的分布式无人机级决策。

英文摘要

Autonomous UAV swarms require scalable coordination mechanisms that maintain mission performance under limited communication, environmental uncertainty, and component failures. Centralized approaches provide global coordination but suffer from communication bottlenecks and single-node vulnerabilities, whereas fully decentralized methods often lack mission-level consistency. This paper presents Layered Autonomous Edge Intelligence (LAEI), a UAV-swarm framework that combines onboard learned policies with lightweight mission-level supervision. Each UAV performs local perception, obstacle avoidance, and action selection onboard, while the supervisory layer provides adaptive goal reassignment, fault-aware recovery, and context-dependent policy guidance without directly controlling low-level actions. LAEI further incorporates recovery strategies, including dynamic reassociation, backup supervisory support, and fallback local autonomy, to maintain mission continuity under representative failure scenarios. We evaluate LAEI in simulated UAV-swarm scenarios using mission completion time, collision rate, and coverage efficiency. The results show that LAEI reduces mission completion time and improves operational efficiency while maintaining collision-aware distributed UAV-level decision-making.

URL PDF HTML ☆

赞 0 踩 0

2606.09610 2026-06-09 cs.RO cs.AI 新提交

Shape Formation for the Cooperative Transportation of Arbitrary Objects Using Multi-Agent Reinforcement Learning

基于多智能体强化学习的任意物体协同运输中的形状形成

Mohamed Sayed, Wolfram Burgard, Tanja Katharina Kaiser

发表机构 * University of Technology Nuremberg（纽伦堡工业大学）

AI总结提出一种多智能体强化学习方法，使多机器人系统自主形成支撑任意形状和非均匀质量分布物体的编队，同时避免障碍物，实现可靠且泛化的协同运输。

详情

AI中文摘要

协同物体运输在众多领域（包括工业到家庭服务）中至关重要。一种流行的运输策略是将物体承载在多机器人系统之上。相应的任务通常通过将其分解为三个相互关联的子问题来解决：编队控制、协同导航和碰撞避免。现实世界物体带来的一个特殊挑战是其可能具有任意形状和非均匀质量分布，这需要机器人编队能够牢固支撑物体。在这项工作中，我们通过提出一种新颖的多智能体强化学习方法来解决运输此类现实世界物体时的模式形成控制挑战。我们的方法使多机器人系统能够自主定位在物体下方以支撑其重量，同时在编队过程中避免障碍物。我们在不同环境和不同数量机器人下的评估表明，我们的方法能够产生可靠形成平衡编队的策略，并泛化到杂乱场景以及具有复杂几何形状和非均匀质量分布的物体。

英文摘要

Cooperative object transportation is essential in numerous domains, including industrial to domestic services. A popular transportation strategy is to carry objects on top of multi-robot systems. The corresponding task is typically solved by decomposing it into three interconnected subproblems: formation control, cooperative navigation, and collision avoidance. A particular challenge posed by real-world objects is their potentially arbitrary shape and non-uniform mass distribution, necessitating robot formations that securely support the object. In this work, we address the challenge of pattern formation control for transporting such real-world objects by proposing a novel multi-agent reinforcement learning approach. Our approach enables a multi-robot system to autonomously position itself underneath an object to support its weight while avoiding obstacles during the formation process. Our evaluations with diverse environments and varying numbers of robots show that our approach leads to policies that reliably produce balanced formations and generalize to cluttered scenes and objects with complex geometry and non-uniform mass distribution.

URL PDF HTML ☆

赞 0 踩 0

2606.09620 2026-06-09 cs.RO cs.SY eess.SY 新提交

Motion planning for hundreds of floating robots

数百个浮动机器人的运动规划

Jan Kamm, Antonio Terpin, Raffaello D'Andrea, Aswin Ramachandran

发表机构 * Institute for Dynamic Systems and Control, ETH Zürich（苏黎世联邦理工学院动态系统与控制研究所）

AI总结针对大型浮动机器人编队的无碰撞运动规划问题，提出一种可扩展的流水线方法，通过碰撞图分解为独立子问题并行求解，在500个机器人仿真和实际演示中验证了有效性。

2606.08738 2026-06-09 cs.NI cs.RO 交叉投稿

Systems-Level Planning and Coordination of Truck-Drone Collaborative Delivery Networks

卡车-无人机协同配送网络的系统级规划与协调

Didem Cicek, Burak Kantarci

发表机构 * School of Electrical Engineering and Computer Science at the University of Ottawa（渥太华大学电气工程与计算机科学学院）

AI总结针对城市最后一英里配送，提出分层规划与协调框架，通过任务编排与智能体同步，实现卡车-无人机协同配送，相比纯卡车模式，总配送时间减少42.4%，能耗降低44.2%。

Comments 6 pages, 4 figures, Accepted to 2026 IEEE HPSR on Network Architectures and Intelligence for Smart Mobility and Autonomous Systems (TRAVERSAL)

详情

AI中文摘要

城市最后一英里包裹配送日益依赖异构车队，其性能取决于及时协调、可靠通信和可扩展控制。卡车-无人机协作已成为一种网络化信息物理配送范式，结合了卡车的载重能力和续航效率与无人机在拥挤或受限城市环境中的灵活性。本文从系统与控制角度提出了一种分层规划与协调框架，用于构建卡车-无人机协同配送（TDCD）。该框架由五个相互关联的层组成：空间需求对齐、协同配送配置、资源与工作流编排、性能评估和可扩展性分析，为网络化配送操作中的协调、控制和系统级性能提供了统一视角。使用源自2021年亚马逊最后一英里路线研究挑战数据集的实际城市最后一英里配送场景评估了所提框架。案例研究表明，通过结构化任务编排和智能体间同步实现的协调卡车-无人机操作，在操作约束下提高了端到端系统效率。结果显示，与传统的纯卡车配送模型相比，总配送时间减少了42.4%，能耗降低了44.2%。可扩展性分析进一步强调了协调收益如何随系统规模增大而持续，并展示了高效控制和通信在异构配送网络中的重要性。

英文摘要

Urban last-mile parcel delivery increasingly relies on heterogeneous fleets whose performance depends on timely coordination, reliable communication, and scalable control. Truck-drone collaboration has emerged as a networked cyber-physical delivery paradigm that combines the payload capacity and range efficiency of trucks with the agility of drones in congested or access-limited urban environments. This paper proposes a layered planning and coordination framework that structures truck-drone collaborative delivery (TDCD) from a systems and control perspective. The framework consists of five interrelated layers: spatial-demand alignment, collaborative delivery configuration, resource and workflow orchestration, performance evaluation, and scalability analysis, providing a unified view of coordination, control, and system-level performance in networked delivery operations. The proposed framework is evaluated using a realistic urban last-mile delivery scenario derived from the 2021 Amazon Last Mile Routing Research Challenge dataset. The case study demonstrates how coordinated truck-drone operation, enabled by structured task orchestration and inter-agent synchronization, improves end-to-end system efficiency under operational constraints. Results show a 42.4% reduction in total delivery time and a 44.2% reduction in energy consumption compared to a conventional truck-only delivery model. The scalability analysis further highlights how coordination gains persist as system size increases, and shows the importance of efficient control and communication in heterogeneous delivery networks.

URL PDF HTML ☆

赞 0 踩 0

2603.24238 2026-06-09 cs.RO 版本更新

从人类驾驶中学习：一种用于自动驾驶的人机协同在线行为克隆框架

Yuhong Shi, Jianyi Liu, Lihang Sun, Li Li, Xudong Dong

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University（西安交通大学人工智能与机器人研究所人机混合增强智能国家重点实验室）

AI总结提出人机协同在线行为克隆框架HiL-OBC，通过人类干预初始化策略、贝叶斯潜在行为建模和在线更新，结合大模型感知与人类驾驶智能，在CARLA基准上显著提升驾驶性能。

详情

AI中文摘要

随着大型基础模型（LFM）的发展，数据驱动的自动驾驶取得了显著进展。然而，现有范式在复杂交互和长尾场景中仍面临分布偏移和因果混淆的严峻挑战。这些限制往往导致在极端条件下缺乏人类级别的决策灵活性和安全性。为克服这一局限，本文提出了一种用于自动驾驶的人机协同在线行为克隆框架（HiL-OBC），旨在深度融合LFM的跨模态感知能力与人类专家的高级驾驶智能。具体而言，HiL-OBC的部署通过三个关键阶段执行：带人类干预的策略初始化、基于贝叶斯策略适应的潜在行为建模，以及在线部署与更新。此外，我们设计了一种多模态在线行为克隆（MOBC）模型，通过轻量级网络架构、接管触发机制和多变量损失函数在线优化基础驾驶策略，从而增强系统在复杂环境中的决策鲁棒性。我们在LangAuto-Human CARLA基准上评估了HiL-OBC。实验结果表明，通过人机协同机制优化的驾驶策略实现了显著的性能提升：StructNav、LFG和LMDrive的驾驶得分（DS）分别提高了47.25%、31.59%和32.12%，同时各种实验设置和关键组件的分析凸显了人机协同学习在提高决策鲁棒性和整体驾驶性能方面的优势。

英文摘要

With the evolution of large foundation models (LFMs), data-driven autonomous driving has made significant strides. However, existing paradigms still face severe challenges in complex interaction and long-tail scenarios due to distribution shift and causal confusion. These limitations often result in a lack of human-level decision-making flexibility and safety in extreme conditions. To overcome this limitation, this paper proposes a Human-in-the-Loop Online Behavior Cloning frame work (HiL-OBC) for autonomous driving, which aims to deeply integrate the cross-modal perceptual capabilities of LFMs with the high-level driving intelligence of human experts. Specifically, HiL-OBC deployment is executed through three critical phases: policy initialization with human intervention, latent behavioral modeling with Bayesian policy adaptation, and online deploy ment and updates. Furthermore, we design a Multi-modal Online Behavior Cloning (MOBC) model, which optimizes the base driving policy online through a lightweight network architecture, a takeover trigger mechanism, and a multi-variant loss function, thereby enhancing the system's decision-making robustness in complex environments. We evaluated the HiL-OBC on the LangAuto-Human CARLA benchmark. Experimental results demonstrate that the driving policies optimized via the human-in-the-loop mechanism achieve substantial performance gains: the DS of StructNav, LFG, and LMDrive increased by 47.25%, 31.59%, and 32.12%, respectively, with a simultaneous of various experimental settings and key components highlights the advantages of human-in-the-loop learning in improving decision-making robustness and overall driving performance.

URL PDF HTML ☆

赞 0 踩 0

2606.08249 2026-06-09 cs.RO cs.LG 新提交

Disturbance-Aware Aerial Robotics for Ethical Wildlife Monitoring

面向道德野生动物监测的扰动感知空中机器人

Mahmut Osmanovic, Isac Paulsson, Teddy Lazebnik

发表机构 * Department of Computing, Jonkoping University（约翰内斯堡大学计算机系）； Department of Information Systems, University of Haifa（海法大学信息系统系）

AI总结提出一种基于强化学习的扰动感知框架，用于异构空中机器人编队自主追踪野生动物，同时最小化行为干扰，在三种动物和四种行为模型上超越规则基线。

详情

AI中文摘要

可靠的野生动物监测对生态学和保护至关重要，然而许多现有方法，如标记、捕捉和近距离观察，可能会改变它们旨在测量的行为。空中机器人提供了一种可扩展的替代方案，在多项研究中显示出有前景的性能。尽管如此，现有方法通常缺乏行为感知，依赖固定启发式规则，或需要昂贵、不切实际且伦理上难以获取的真实世界训练数据。因此，目前尚无通用的自适应无人机监测框架，既能保持生态有效性，又能跨物种、行为和机器人平台扩展。在本研究中，我们引入了一种基于扰动感知强化学习的异构空中机器人编队框架，能够自主追踪野生动物，同时明确最小化行为干扰。我们将动物学模拟环境与基于真实轨迹统计拟合的动物运动模型相结合，并使用一种捕捉观测质量与扰动风险之间权衡的奖励公式来训练控制策略。在三种具有不同生态和运动模式的物种（鸽子、豺和距翅麦鸡）以及四种在自然界中常见的日益策略性的行为模型上，学习到的策略持续超越当前使用的基于规则的基线，并泛化到不同的监测任务、动物动态和无人机类型。这些结果确立了扰动感知学习作为非侵入式自主野生动物观测的可行基础，为生态学和保护中可扩展、道德负责且科学可靠的机器人监测开辟了道路。

英文摘要

Reliable wildlife monitoring is essential for ecology and conservation, yet many existing methods, such as tagging, capture, and close-range observation, can alter the very behaviors they aim to measure. Aerial robots offer a scalable alternative, which has shown promising performance in multiple studies. Nonetheless, existing approaches typically lack behavioral awareness, rely on fixed heuristics, or require real-world training data that are costly, impractical, and ethically difficult to obtain. As a result, there remains no general framework for adaptive drone-based monitoring that can both preserve ecological validity and scale across species, behaviors, and robotic platforms. In this study, we introduce a disturbance-aware reinforcement-learning-based framework for heterogeneous aerial robotic fleets that enables autonomous wildlife tracking while explicitly minimizing behavioral disruption. We couple a zoologically grounded simulation environment with fitted animal movement models derived from real trajectory statistics, and train control policies using a reward formulation that captures the trade-off between observation quality and disturbance risk. Across three species (pigeon, jackal, and spur-winged lapwing) with distinct ecologies and motion patterns and four increasingly strategic behavior models common in nature, the learned policies consistently surpassed currently used rule-based baselines and generalized across monitoring tasks, animal dynamics, and drone types. These results establish disturbance-aware learning as a viable foundation for non-invasive autonomous wildlife observation, opening a path towards scalable, ethically responsible, and scientifically reliable robotic monitoring in ecology and conservation.

URL PDF HTML ☆

赞 0 踩 0

2606.08470 2026-06-09 cs.RO 新提交

LUNA-AD: Lightweight Uncertainty-Aware Language Model with Lifelong Learning for Autonomous Driving

LUNA-AD: 面向自动驾驶的轻量级不确定性感知语言模型与终身学习

Ruoyu Yao, Pei Liu, Ruiguo Zhong, Mingxing Peng, Rui Yang, Jun Ma

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出LUNA-AD，一种结合三系统架构、多智能体分析、双头轻量模型和反思驱动终身学习的轻量级不确定性感知语言模型，在nuPlan上实现高成功率与低推理延迟。

Comments 16 pages,9 figures

详情

AI中文摘要

虽然大型语言模型（LLMs）提供了有前景的推理能力，但它们在安全关键的驾驶系统中的集成受到推理多样性有限、高计算开销和静态学习范式的阻碍。为了解决这些挑战，我们提出了LUNA-AD，一种面向自动驾驶（AD）的轻量级不确定性感知语言模型与终身学习。LUNA-AD采用三系统架构，协调复杂的多模态行为推理、高效部署和持续改进。我们设计了一个多智能体分析系统，通过多样化的假设探索生成不确定性感知的决策演示。一个双头轻量启发式模型被蒸馏，以统一决策分布和文本解释的推理，同时实现高效部署。此外，一种反思驱动的终身学习机制作用于多模态决策输出并保持策略多样性，允许通过闭环反馈改进候选决策和理由，以增强驾驶鲁棒性。在nuPlan基准上的大量实验表明，与现有知识驱动的AD框架相比，LUNA-AD在非反应式和反应式模式下均实现了最先进的成功率，并显著降低了推理延迟。

英文摘要

While large language models (LLMs) offer promising reasoning capabilities, their integration into safety-critical driving systems is hindered by limited reasoning diversity, high computational overhead, and static learning paradigms. To address these challenges, we propose LUNA-AD, a lightweight uncertainty-aware language model with lifelong learning for autonomous driving (AD). LUNA-AD features a tri-system architecture that reconciles complex multimodal behavioral reasoning, efficient deployment, and continual refinement. We design a multi-agent analytical system to generate uncertainty-aware decision-making demonstrations through diverse hypothesis exploration. A dual-head lightweight heuristic model is distilled to unify the inference of decision distributions and textual explanations while enabling efficient deployment. Furthermore, a reflection-driven lifelong learning mechanism operates on multimodal decision outputs and preserves strategic diversity, allowing for the refinement of candidate decisions and rationales via closed-loop feedback to enhance driving robustness. Extensive experiments on nuPlan benchmarks demonstrate that LUNA-AD achieves state-of-the-art success rates under both non-reactive and reactive modes, with drastically reduced inference latency compared to existing knowledge-driven AD frameworks.

URL PDF HTML ☆

赞 0 踩 0

2606.08513 2026-06-09 cs.RO cs.LG cs.SY eess.SY 新提交

Towards End to End Motion Planning and Execution for Autonomous Underwater Vehicles Using Reinforcement Learning

面向自主水下机器人的端到端运动规划与执行：基于强化学习的方法

Elisei Shafer, Oren Gal

发表机构 * University of Haifa（海法大学）

AI总结提出分层强化学习架构，将原始传感器数据直接映射为推进器指令，实现AUV端到端运动规划与执行，在HoloOcean仿真中轨迹长度接近RRT*基线（误差4%-6%），并具备鲁棒性。

详情

AI中文摘要

自主水下机器人（AUV）传统上依赖复杂、高度工程化的流水线进行感知、路径规划和运动控制。本文探索了一种端到端深度强化学习（DRL）方法的可行性，该方法将原始传感器数据直接映射为推进器指令，减少了人工工程。我们提出了一种分层强化学习（HRL）架构，将问题分解为两个马尔可夫决策过程。高层（HL）策略以2Hz运行，处理原始$84 \ imes 84$像素单目相机帧、堆叠的$100 \ imes 100$像素前视成像声纳以及本体感受数据，生成空间子目标。同时，低层（LL）策略以10Hz运行，将这些子目标转换为推进器指令。HL策略使用基于先前演示的强化学习（RLPD）在修改后的样本高效机器人强化学习（SERL）框架中训练，而LL策略则采用软演员-评论家（SAC）结合后见经验回放（HER）。在高保真HoloOcean模拟器中评估，我们的方法展示了成功的避障能力，轨迹长度与$\ ext{RRT}^*$规划基线非常接近（误差在4%到6%之间）。此外，学习到的策略对模拟传感器噪声和能见度降低表现出强鲁棒性。尽管系统能有效导航熟悉的几何环境，但实验揭示了在遇到具有新颖障碍形状的未访问区域时存在泛化限制。最终，这项工作展示了使用最小计算硬件进行样本高效、端到端DRL在水下导航中的潜力。

英文摘要

Autonomous Underwater Vehicles (AUVs) traditionally rely on complex, heavily engineered pipelines for perception, path planning, and motion control. This paper explores the feasibility of an end-to-end Deep Reinforcement Learning (DRL) approach that maps raw sensor data directly to thruster commands, reducing manual engineering. We propose a hierarchical reinforcement learning (HRL) architecture splitting the problem into two Markov Decision Processes. A High-Level (HL) policy operating at 2Hz processes raw $84 \times 84$ pixel monocular camera frames, stacked $100 \times 100$ pixel forward-looking imaging sonar, and proprioceptive data to generate spatial subgoals. Simultaneously, a Low-Level (LL) policy operating at 10Hz converts these subgoals into thruster commands. The HL policy is trained using Reinforcement Learning from Prior Demonstrations (RLPD) within a modified Sample-Efficient Robotic Reinforcement Learning (SERL) framework, while the LL policy utilizes Soft Actor-Critic (SAC) combined with Hindsight Experience Replay (HER). Evaluated in the high-fidelity HoloOcean simulator, our method demonstrates successful obstacle avoidance, achieving trajectory lengths closely approximating (within 4% to 6% of) an $\text{RRT}^*$ planning baseline. Furthermore, the learned policy exhibits strong robustness to simulated sensor noise and decreased visibility. While the system navigates familiar geometries effectively, experiments reveal generalization limitations when encountering unvisited areas with novel obstacle shapes. Ultimately, this work demonstrates the promise of sample-efficient, end-to-end DRL for underwater navigation using minimal computational hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.09088 2026-06-09 cs.RO 新提交

Autonomous FPV Flight with Translational Optical Flow and Uncertainty Mask

基于平移光流与不确定性掩膜的自主FPV飞行

Yang Deng, Yu Hu, Feng Yu, Linzuo Zhang, Danping Zou

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出利用平移光流和不确定性掩膜增强FPV四旋翼自主飞行，在仿真和真实森林环境中实现高达13.91 m/s和11.79 m/s的飞行速度，成功率93.3%。

详情

AI中文摘要

在复杂环境中使用单目RGB相机作为唯一外部传感器的自主FPV四旋翼飞行仍然是一个基本挑战。最近的研究表明，使用光流作为神经网络的输入可以实现杂乱场景中的端到端自主飞行。然而，从光流估计中提取最相关信息是限制敏捷性和鲁棒性的关键瓶颈。现有方法难以将障碍物引起的光流与自运动背景光流分离，并且在膨胀焦点（FoE）附近信噪比低。为了解决这些问题，我们将光流分解为平移和旋转分量，并仅利用捕捉场景几何和深度线索的平移光流。此外，我们引入了一种基于前向和后向光流估计不一致性的不确定性掩膜。该掩膜突出显示障碍物结构，包括FoE区域内的结构。这两个线索被输入到在可微仿真框架中训练的控制策略中，该框架能够实现感知和控制的一阶优化。我们通过在仿真和真实森林环境中的大量实验验证了我们的方法。所提出的系统在仿真中实现了高达13.91 m/s的速度，在真实测试中实现了11.79 m/s的速度，在30次真实试验中成功率为93.3%，几乎使先前报道的单目RGB光流无人机避障系统的6 m/s真实速度翻倍。

英文摘要

Autonomous FPV quadrotor flight in complex environments using a monocular RGB camera as the sole exteroceptive sensor remains a fundamental challenge. Recent research has shown that using optical flow as the input of a neural network can achieve end-to-end autonomous flight in cluttered scenes. However, extracting the most relevant information from the flow estimation is the key bottleneck limiting agility and robustness. Existing methods struggle to disentangle obstacle-induced optical flow from the ego-motion background flow and suffer from low signal-to-noise ratios near the focus of expansion (FoE). To address these issues, we decompose the optical flow into translational and rotational components and utilize only the translational flow, which captures scene geometry and depth cues. In addition, we introduce an uncertainty mask derived from inconsistencies between forward and backward flow estimates. This mask highlights obstacle structures, including those within the FoE region. Both cues are fed to a control policy trained in a differentiable simulation framework, which enables efficient first-order optimization across perception and control. We validate our approach through extensive experiments in both simulated and real-world forest environments. The proposed system achieves robust flight at speeds of up to 13.91 m/s in simulation and 11.79 m/s in real-world tests, with a 93.3\% success rate over 30 real-world trials, nearly doubling the previously reported 6 m/s real-world speed of the monocular-RGB optical-flow UAV obstacle avoidance system.

URL PDF HTML ☆

赞 0 踩 0

2606.09569 2026-06-09 cs.RO cs.CV 新提交

Efficient Minimal Solvers for Relative Pose Estimation in Autonomous Driving Applications

自动驾驶应用中相对位姿估计的高效最小求解器

Tao Li, Liang Liu, Jianli Han, Weimin Lv

发表机构 * College of Aerospace Science and Engineering, Naval Aviation University（海军航空大学航空航天科学与工程学院）

AI总结提出基于新平移参数化和一阶旋转近似的统一框架，设计三种最小求解器（利用IMU垂直方向、转向旋转轴方向、平面运动假设），减少点对应数量和代数复杂度，在RANSAC中加速假设生成，平衡速度与精度。

详情

AI中文摘要

随着视觉传感系统的进步，计算机视觉在自动驾驶和机器人导航中扮演着越来越重要的角色。多相机系统中的相对位姿估计对于精确的车辆定位和环境感知至关重要，要求高实时性和鲁棒性。然而，现有方法通常涉及高计算成本并严重依赖丰富的特征匹配，限制了它们在时间敏感驾驶场景中的适用性。为解决这些限制，本文引入了一个基于新颖平移参数化和一阶旋转近似的统一框架，用于高效相对位姿估计。在该框架内，我们提出了三种专门为自动驾驶车辆设计的高效最小求解器。第一个求解器集成了惯性测量单元（IMU）的垂直方向先验，第二个在转向操作期间利用旋转轴方向先验，第三个专为平面运动设计——这是结构化道路上地面车辆的现实假设。通过减少最小点对应数量和代数复杂度，我们的方法能够在基于RANSAC的流程中更快地生成假设，提高对实时系统的适用性。在合成数据集和KITTI自动驾驶基准上的大量实验表明，与现有最先进算法相比，所提出的求解器在速度和精度之间实现了有利的平衡。

英文摘要

With the advancement of visual sensing systems, computer vision is playing an increasingly important role in autonomous driving and robot navigation. Relative pose estimation in multi-camera systems is essential for accurate vehicle localization and environment perception, demanding high real-time performance and robustness. Existing methods, however, often involve high computational costs and rely heavily on abundant feature matches, limiting their applicability in time-sensitive driving scenarios. To address these limitations, this paper introduces a unified framework for efficient relative pose estimation, built upon a novel translation parameterization and first-order rotation approximation. Within this framework, we propose three efficient minimal solvers specifically designed for autonomous vehicles. The first solver integrates the vertical direction prior from Inertial Measurement Units (IMUs), the second utilizes the rotation axis direction prior during steering maneuvers, and the third is designed for planar motion - a realistic assumption for ground vehicles operating on structured roads. By reducing both the minimal number of point correspondences and the algebraic complexity, our methods enable faster hypothesis generation within RANSAC-based pipelines, improving suitability for real-time systems. Extensive experiments on synthetic datasets and the KITTI autonomous driving benchmark demonstrate that the proposed solvers achieve a favorable balance between speed and accuracy compared to existing state-of-the-art algorithms.

URL PDF HTML ☆

赞 0 踩 0

2606.07756 2026-06-09 cs.CV cs.RO 交叉投稿

DroneDAR: Long-Range Drone Distance Estimation Using Monocular Vision and Bounding-Box Features

DroneDAR: 使用单目视觉和边界框特征的长距离无人机距离估计

Knut Peterson, Zaid Mayers, David Han

发表机构 * iMaPLe Research Lab, Drexel University（德雷塞尔大学iMaPLe研究实验室）

AI总结针对长距离小无人机距离估计的挑战，提出DroneDAR模型，结合卷积骨干网络和轻量级门控机制融合边界框特征，分析骨干容量、裁剪分辨率和回归损失对性能的影响，并探讨远距离失效模式。

Comments 6 pages, 5 figures. Accepted to the 2026 International Conference on Advanced Visual and Signal-Based Systems (AVSS)

详情

AI中文摘要

在长距离图像中准确估计小型无人机的距离对于跟踪和态势感知至关重要，但由于极端的目标尺度变化、背景杂波和噪声视觉线索，这仍然具有挑战性。本文研究了使用图像裁剪和边界框几何进行单目无人机距离估计，这是一种实际设置，其中检测器提供候选无人机区域，模型从外观和框派生特征预测距离。我们评估了一个Droneranger风格的基线，并引入了一个新的DroneDAR（无人机检测与测距）模型，该模型通过轻量级门控机制将卷积骨干网络与显式边界框线索相结合。实验分析了骨干网络容量、裁剪分辨率和回归损失函数如何影响不同距离范围内的性能。我们进一步研究了远距离下的常见失效模式，包括对边界框噪声的敏感性和裁剪中纹理细节的减少。结果为设计和训练在真实远距离条件下保持鲁棒性的距离估计器提供了指导，并指出了在无人机仅占据几个像素时提高可靠性的方向。

英文摘要

Accurate distance estimation for small drones in long-range imagery is important for tracking and situational awareness, yet remains challenging due to extreme target scale variation, background clutter, and noisy visual cues. This paper studies monocular drone distance estimation using image crops together with bounding-box geometry, a practical setting in which a detector provides a candidate drone region and the model predicts range from appearance and box-derived features. We evaluate a Droneranger-style baseline, and introduce a new DroneDAR (Drone Detection And Ranging) model that combines a convolutional backbone with explicit bounding-box cues through a lightweight gating mechanism. Experiments analyze how backbone capacity, crop resolution, and regression loss functions affect performance across distance regimes. We further examine common failure modes at long distances, including sensitivity to bounding-box noise and reduced texture detail in the crop. The results provide guidance for designing and training range estimators that remain robust under real-world long-range conditions and highlight directions for improving reliability when drones occupy only a few pixels.

URL PDF HTML ☆

赞 0 踩 0

2606.08533 2026-06-09 cs.LG cs.RO 交叉投稿

Autonomous Aerial Manipulation via Contextual Contrastive Meta Reinforcement Learning

通过上下文对比元强化学习的自主空中操控

Lixuan Jin, Bingxuan Lan, Xinyi Bao, Xiangyuan Xie, Chunjie Zhang, Zheng Chen, Tianshuo Liu, Ruijie Tian, Jinyu Ru, Gang Wang, Lei Yuan, Yang Yu

发表机构 * National Key Laboratory of Novel Software Technology, Nanjing University（南京大学计算机软件新技术国家重点实验室）； School of Artificial Intelligence, Nanjing University（南京大学人工智能学院）； Faculty of Robot Science and Engineering, Northeastern University（东北大学机器人科学与工程学院）； National Key Lab of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology（北京理工大学自主智能无人系统国家重点实验室）

AI总结提出Aco2方法，通过上下文对比元强化学习，使四旋翼无人机在无需人工干预下自主完成不同载荷的抓取、运输和投递，并直接迁移到真实世界。

详情

AI中文摘要

无人机越来越多地部署在物流、服务机器人等实际应用中，对自主载荷获取和投递的需求日益增长。现有方法通常假设预附载荷或依赖专用夹爪，使得通用的端到端空中投递问题仍未解决，因为不同载荷会导致高度变化的飞行动力学，需要单一策略在线适应，无需手动校准或显式系统辨识。为此，我们研究了通过上下文对比元强化学习的自主空中操控（\textbf{\textit{Aco2}}），这是一个完全自主的空中投递设置，其中配备轻型钩子的四旋翼无人机连续拾取、运输和投递各种带手柄的物体，在随机位置之间进行，全程无需人工干预。首先，我们设计了一个上下文观测编码器，从最近的交互历史中推断出紧凑的潜在上下文，使策略能够在线适应载荷相关的动力学。为了进一步提高上下文质量，我们引入了一个对比目标，该目标围绕任务相关变化结构化上下文嵌入，从而改善跨不同载荷的泛化能力，无需显式系统辨识。完全在模拟中训练，并采用广泛的域随机化，\textit{Aco2}可以直接部署在物理四旋翼上，无需真实世界微调。

英文摘要

Unmanned aerial vehicles (UAVs) are increasingly being deployed in logistics, service robotics, and other real-world applications, creating a growing demand for autonomous payload acquisition and delivery. Existing approaches typically assume pre-attached payloads or rely on specialized grippers, leaving versatile end-to-end aerial delivery largely unresolved, where different payloads induce highly variable flight dynamics, requiring a single policy to adapt online without manual calibration or explicit system identification. To this end, we study \textbf{A}utonomous \textbf{A}erial Manipulation via \textbf{Co}ntextual \textbf{Co}ntrastive Meta Reinforcement Learning (\textbf{\textit{Aco2}}), a fully autonomous aerial delivery setting in which a quadrotor equipped with a lightweight hook continuously picks up, transports, and delivers diverse handle-equipped objects between randomized locations, all without human intervention. First, we design a contextual observation encoder that infers a compact latent context from recent interaction history, enabling the policy to adapt online to payload-dependent dynamics. To further improve the quality of this context, we introduce a contrastive objective that structures the context embedding around task-relevant variations, improving generalization across diverse payloads without requiring explicit system identification. Trained entirely in simulation with extensive domain randomization, \textit{Aco2} can be directly deployed on a physical quadrotor without real-world fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.08680 2026-06-09 cs.CV cs.RO 交叉投稿

Distortion-Aware PETR for BEV Object Detection with Mixed Pinhole-Fisheye Cameras

畸变感知的PETR用于混合针孔-鱼眼相机的BEV目标检测

Xiangzhong Liu

发表机构 * fortiss GmbH（fortiss有限公司）

AI总结针对鱼眼相机径向畸变破坏BEV检测器均匀采样假设的问题，提出DAPETR，通过畸变感知位置编码和双向特征-几何协同调制模块，在KITTI-360基准上优于基线方法，并揭示了学习适应与显式几何重参数化之间的冲突。

Comments 8 pages, 5 figures, accepted at ICRA 2026

详情

AI中文摘要

鱼眼相机因其低成本和高覆盖视野（FOV）而被广泛部署于自动驾驶感知套件中，但其在3D目标检测中的潜力仍未得到充分利用。严重的径向畸变通过违反均匀采样的基本假设，对大多数BEV检测器构成挑战。为弥补这一差距，我们提出了畸变感知PETR（DAPETR），一种专为混合针孔-鱼眼相机设置设计的无投影检测器。DAPETR包含两个关键的学习自适应模块：一个统一的畸变感知位置编码，将图像表示的位置编码与鱼眼几何协调一致；以及一个双向特征-几何协同调制模块，使图像特征和3D位置编码相互适应。在我们转换的KITTI-360基准上的实验中，我们系统地将我们的学习自适应方法与极坐标下的PETR（PolarPETR）进行了比较。我们发现，尽管两种方法都优于基线，但我们的学习模块实现了更优的性能。关键的是，我们发现了两种策略结合时的负面交互，表明学习适应和显式几何重参数化可能冲突。我们的最终DAPETR模型显著推进了鱼眼BEV检测的研究和基准，为除图像校正外的有效畸变感知3D感知设计提供了关键见解。

英文摘要

Fisheye cameras are widely deployed in autonomous driving perception suites for their low cost and full-coverage field of view (FOV), yet their potential remains underleveraged in 3D object detection. Severe radial distortion challenges most BEV detectors by violating the fundamental assumption of uniform sampling. To bridge this gap, we propose Distortion-Aware PETR (DAPETR), a projection-free detector tailored for mixed pinhole-fisheye camera setups. DAPETR incorporates two key learned-adaptive modules: a unified distortion-aware positional embedding that harmonizes positional encodings for image representations with fisheye geometry, and a bidirectional feature-geometry co-modulation module that mutually adapts image features and 3D positional embeddings. In our experiments on a converted KITTI-360 benchmark, we systematically compare our learned adaptive approach against PETR in polar coordinates (PolarPETR). We find that while both methods improve over the baseline, our learned modules achieve superior performance. Crucially, we uncover a negative interaction when combining both strategies, revealing that learned adaptation and explicit geometric reparameterization can conflict. Our final DAPETR model significantly advances the research and benchmark for fisheye BEV detection, providing critical insights into effective distortion-aware 3D perception design other than image rectification.

URL PDF HTML ☆

赞 0 踩 0

2606.08714 2026-06-09 eess.SY cs.AI cs.LG cs.RO cs.SY 交叉投稿

Hybrid Neural Network and Conventional Controller Approach for Robust Control of Highly Unstable Systems: Application to Tilt-Rotor Control

混合神经网络与传统控制器方法用于高度不稳定系统的鲁棒控制：应用于倾转旋翼控制

Ali Kafili Gavgani, Amin Talaeizadeh, Aria Alasty, Hossein Nejat Pishkenari

发表机构 * Advanced Research Lab for Control and Agricultural Robotics (Sharif AgRoLab)（控制与农业机器人高级研究实验室（谢尔生产大学AgRoLab））； Department of Mechanical Engineering, Sharif University of Technology, Tehran, Iran（技术大学机械工程系，德黑兰，伊朗）

AI总结提出一种神经网络增强的滑模控制器，将系统动力学分解为输入无关和输入相关部分，前者用轻量网络从少量数据学习，实现对全驱动倾转旋翼系统的鲁棒控制，LSTM优于MLP。

Comments Proceedings of the 13th RSI International Conference on Robotics and Mechatronics (ICRoM 2025)

详情

DOI: 10.6084/m9.figshare.32572083

AI中文摘要

多旋翼飞行器广泛应用于从监视到精准农业等领域，但传统设计仍受限于其欠驱动特性。倾转旋翼配置通过实现全驱动克服了这一限制。本文研究基于神经网络的控制策略，用于一个具有四个推力矢量输入的全驱动倾转旋翼系统。我们的工作分为两部分。首先，我们有意呈现一个负面结果，通过评估直接输入-输出控制方法。在该方法中，多层感知器（MLP）、长短期记忆（LSTM）网络和Transformer模型被训练为直接将系统状态及其期望值映射到控制信号。我们表明该策略无法稳定系统，凸显了将直接输入-输出学习应用于高度不稳定对象的固有困难。其次，作为主要贡献，我们提出一种神经网络增强的滑模控制器（SMC）。该方法将系统动力学分解为输入无关和输入相关两部分，前者使用轻量网络从少量数据集学习，从而降低实时计算需求。此外，所提方法可以使用从低性能控制器收集的飞行日志进行训练，并且从真实数据学习到的动力学模型可用于仿真。我们进一步比较了基于MLP和LSTM的实现，在模型不确定性和外部干扰下，展示了所提方法的鲁棒性和有效性；特别是，带有LSTM植物动力学预测器的控制器相比基于MLP的对应物实现了更优性能，同时运行时也更低。

英文摘要

Multirotors are widely used in applications ranging from surveillance to precision agriculture, yet conventional designs remain limited by their under-actuation. Tilt-rotor configurations overcome this limitation by enabling full actuation. This paper investigates neural-network-based control strategies for a fully actuated tilt-rotor system with four thrust-vectoring inputs. Our work is structured in two parts. First, we deliberately present a negative result by evaluating a direct input-output control approach. In this method, multilayer perceptrons (MLPs), long short-term memory (LSTM) networks, and transformer models are trained to map system states and their desired values directly to control signals. We show that this strategy fails to stabilize the system, highlighting the inherent difficulty of applying direct input-output learning to highly unstable plants. Second, as the main contribution, we propose a neural-network-enhanced sliding mode controller (SMC). The method decomposes the system dynamics into input-independent and input-dependent components, with the former learned from a small dataset using lightweight networks, thereby reducing real-time computational demands. Moreover, the proposed method can be trained using flight logs collected from low-performance controllers, and the resulting dynamic model learned from real-world data can be used in simulation. We further compare MLP- and LSTM-based implementations under model uncertainties and external disturbances, demonstrating the robustness and effectiveness of the proposed approach; in particular, the controller with the LSTM plant dynamics predictor achieves superior performance to its MLP-based counterpart while also exhibiting lower runtime.

URL PDF HTML ☆

赞 0 踩 0

2606.08844 2026-06-09 cs.CV cs.RO 交叉投稿

Geometry-Aware Fisheye-LiDAR Fusion for Robust 3D Object Detection in Low-Overlap Setups

几何感知鱼眼-激光雷达融合用于低重叠设置下的鲁棒3D目标检测

Xiangzhong Liu, Xihao Wang, Hao Shen

发表机构 * Technical University of Munich（慕尼黑工业大学）

AI总结针对稀疏视角下鱼眼相机与激光雷达的几何畸变和低重叠问题，提出几何感知混合融合框架，通过畸变感知LSS模块和双注意力校正模块实现极坐标与笛卡尔特征融合，在三个基准上提升检测精度。

Comments 8 pages, 4 figures, submitted to RA-L

详情

AI中文摘要

随着自主系统从资本密集型的机器人出租车扩展到成本敏感的物流领域，传感器配置越来越优化以实现每单位成本的覆盖范围。一种常见的稀疏视图设置利用双鱼眼摄像头和车顶安装的激光雷达，引入了严重的几何挑战：极端径向畸变、最小重叠以及球面投影与笛卡尔网格之间的错位。BEV融合算法通常在流程早期将图像和点云模态强制统一到笛卡尔网格中，导致广角鱼眼相机出现显著的特征失真和信息丢失。为了解决这个问题，我们提出了一个几何感知混合融合（GA-HF）框架，该框架明确考虑了鱼眼几何和BEV特征失真，其中鱼眼特征通过畸变感知的Lift-Splat-Shoot（LSS）模块提升到极坐标BEV网格中以保留原生角密度，而激光雷达特征在原生笛卡尔空间中处理以实现边界框回归的度量保真度。为了桥接这些异构流，我们引入了一个双注意力扭曲校正模块，该模块在融合前对扭曲的相机特征应用空间和通道注意力，明确抑制低质量外围区域的伪影，同时增强高质量语义线索。GA-HF在三个基准数据集上进行了评估：KITTI-360、Dur360BEV和Fisheye3DOD。据我们所知，这是首个探索激光雷达-鱼眼相机融合的方法。在KITTI-360上，GA-HF相比笛卡尔基线将NDS提高了4.2%；在Dur360BEV上，它超越了仅激光雷达和BEVFusion，同时在几何畸变下显著降低了方向误差；在Fisheye3DOD上，它在所有融合方法中取得了最高的检测分数。

英文摘要

As autonomous systems expand from capital-intensive robotaxis to cost-sensitive logistics, sensor configurations are increasingly optimized for coverage-per-cost. A prevalent sparse-view setup utilizes dual-fisheye cameras with a roof-mounted LiDAR, introducing severe geometric challenges: extreme radial distortion, minimal overlap, and misalignment between spherical projections and rectilinear grids. BEV fusion algorithms typically force image and point cloud modalities into unified Cartesian grids early in the pipeline, causing significant feature distortion and information loss for wide-view fisheye cameras. To address this, we propose a Geometry-Aware Hybrid Fusion (GA-HF) framework that explicitly accounts for fisheye geometry and BEV feature distortion, where fisheye features are lifted into a polar BEV grid via a Distortion-Aware Lift-Splat-Shoot (LSS) module to preserve native angular density, while LiDAR features are processed in native Cartesian space for metric fidelity of bounding box regression. To bridge these heterogeneous streams, we introduce a Dual-Attention Warping Correction module that applies spatial and channel attention to the warped camera features before fusion, explicitly suppressing artifacts in low-quality peripheral regions while enhancing high-quality semantic cues. GA-HF is evaluated on three benchmarks: KITTI-360, Dur360BEV, and Fisheye3DOD datasets. To the best of our knowledge, it is the first approach to explore LiDAR-fisheye camera fusion. On KITTI-360, GA-HF improves NDS by 4.2% over Cartesian baselines; on Dur360BEV, it surpasses both LiDAR-only and BEVFusion, while significantly reducing orientation error despite the geometric distortions; on Fisheye3DOD, it attains the highest detection score among all fusion methods.

URL PDF HTML ☆

赞 0 踩 0

2503.01125 2026-06-09 cs.RO 版本更新

TACO: General Acrobatic Flight Control via Target-and-Command-Oriented Reinforcement Learning

TACO：基于目标和指令的强化学习实现通用空翻飞行控制

Zikang Yin, Canlun Zheng, Shiliang Guo, Zhikun Wang, Shiyu Zhao

发表机构 * College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； WINDY Lab, Department of Artificial Intelligence, Westlake University（西湖大学人工智能研究院）

AI总结本文提出TACO框架，通过目标和指令导向的强化学习实现统一的空翻任务处理，并支持在线参数调整，结合频谱归一化方法提升策略的平滑性与对称性，验证了其在高速环形飞行和连续多翻转中的能力。

Comments For the experiment video, please refer to https://youtu.be/x1v7nD2iHIk

2508.00917 2026-06-09 cs.RO cs.CV cs.LG 版本更新

A Survey on Deep Multi-Task Learning in Connected Autonomous Vehicles

联网自动驾驶车辆中深度多任务学习综述

Jiayuan Wang, Farhad Pourpanah, Q. M. Jonathan Wu, Ning Zhang

发表机构 * Department of Electrical and Computer Engineering, University of Windsor（温莎大学电气与计算机工程系）； Department of Electrical and Computer Engineering, Queen’s University（皇后大学电气与计算机工程系）

AI总结综述联网自动驾驶车辆中深度多任务学习，涵盖感知、预测、规划、控制及V2X通信与资源管理，分析现有方法优缺点并指出未来方向。

详情

DOI: 10.1109/COMST.2026.3699223

AI中文摘要

联网自动驾驶车辆（CAVs）必须同时执行多个任务，如感知、预测、规划和控制，以确保在复杂环境中安全可靠地导航。此外，通过车联万物（V2X）通信，可以实现CAVs之间的协同感知和驾驶，从而减轻单个车辆的局限性，同时也引入了严格的延迟、可靠性和带宽约束。传统上，任务使用单独的模型处理，这导致部署成本高、计算开销增加以及实现实时性能的挑战。多任务学习（MTL）最近成为一种有前景的解决方案，能够在统一模型中联合学习多个任务，从而提供更高的效率和资源利用率。据我们所知，本综述是首次专注于CAVs中深度MTL的全面回顾。我们首先概述CAVs和MTL以提供基础背景。然后，我们回顾了CAVs关键功能领域的MTL方法，包括感知、预测、规划、控制以及V2X通信和无线电资源管理（RRM）。对于前四个领域，我们将现有工作分为仅单车（车载）和V2X增强协同（多智能体）范式。我们进一步将V2X通信和RRM作为以通信为中心的MTL问题进行讨论。最后，我们讨论了现有方法的优势和局限性，识别了关键研究空白，并提供了旨在推进CAV系统MTL方法的未来研究方向。

英文摘要

Connected autonomous vehicles (CAVs) must simultaneously perform multiple tasks, such as perception, prediction, planning, and control, to ensure safe and reliable navigation in complex environments. Moreover, through vehicle-to-everything (V2X) communication, cooperative perception and driving among CAVs can be enabled, thereby mitigating the limitations of individual vehicles, while it also introduces stringent latency, reliability, and bandwidth constraints. Traditionally, tasks are addressed using separate models, which leads to high deployment costs, increased computational overhead, and challenges in achieving real-time performance. Multi-task learning (MTL) has recently emerged as a promising solution that enables the joint learning of multiple tasks within a unified model. This offers improved efficiency and resource utilization. To the best of our knowledge, this survey is the first comprehensive review focusing on deep MTL in CAVs. We begin with an overview of CAVs and MTL to provide foundational background. Then, we review MTL approaches across key functional domains in CAVs, including perception, prediction, planning, control, as well as V2X communications and radio resource management (RRM). For the first four domains, we categorize existing works under ego vehicle-only (onboard-only) and V2X-enhanced cooperative (multi-agent) paradigms. We further discuss V2X communications and RRM as communication-centric MTL problems. Finally, we discuss the strengths and limitations of existing methods, identify key research gaps, and provide future research directions aimed at advancing MTL methodologies for CAV systems.

URL PDF HTML ☆

赞 0 踩 0

2606.01205 2026-06-09 cs.RO 版本更新

ImagineUAV: Aerial Vision-Language Navigation via World-Action Modeling and Kinodynamic Planning

ImagineUAV：通过世界-动作建模和动力学规划实现空中视觉语言导航

Xuchen Liu, Jiawei Huang, Shihao Xia, Bingxi Liu, Jinqiang Cui, Jiankun Yang

发表机构 * Pengcheng Laboratory（鹏城实验室）； School of Computer Science and Cyber Engineering（计算机科学与网络工程学院）； Guangzhou University（广州大学）； Southern University of Science and Technology（南方科技大学）

AI总结针对无人机视觉语言导航中几何不一致和动力学失配问题，提出基于潜视频扩散模型的世界-动作建模框架，通过生成未来观测推断6自由度运动并规划无碰撞轨迹，以1.3B参数在基准和实际飞行中超越先前方法。

Comments Video demo: https://www.youtube.com/watch?v=Ng1alP0yhc0

详情

AI中文摘要

无人机的视觉语言导航（VLN）要求在部分可观测条件下将自由形式的指令接地到6自由度飞行中。虽然视觉-语言-动作（VLA）模型在语义推理方面表现出色，但由于几何不一致和动力学失配，它们存在脆弱性。为了解决这个问题，我们提出了ImagineUAV，一个利用级联世界-动作建模的想象驱动框架。ImagineUAV不是直接回归，而是采用潜视频扩散模型生成指令条件下的未来观测，明确想象环境演化，然后通过动作提取器推断6自由度运动。动力学规划器将这些估计优化为无碰撞轨迹。此外，步骤蒸馏推理流水线确保实时执行。仅凭1.3B参数，ImagineUAV在基准测试和实际飞行中优于先前的VLN和VLA基线，验证了想象驱动空中导航的实用性。

英文摘要

Vision-language navigation (VLN) for UAVs demands grounding free-form instructions into 6-DoF flight under partial observability. While Vision-Language-Action (VLA) models excel at semantic reasoning, they suffer from brittleness due to geometric inconsistency and dynamics mismatch. To address this, we propose ImagineUAV, an imagination-driven framework leveraging cascaded world-action modeling. Instead of direct regression, ImagineUAV employs a latent video diffusion model to generate instruction-conditioned future observations, explicitly imagining environmental evolution, from which 6-DoF motions are inferred via an action extractor. A kinodynamic planner then refines these estimates into collision-free trajectories. Additionally, a step-distilled inference pipeline ensures real-time execution. With only 1.3B parameters, ImagineUAV outperforms prior VLN and VLA baselines on benchmarks and real-world flights, validating the practicality of imagination-driven aerial navigation.

URL PDF HTML ☆

赞 0 踩 0

2509.20906 2026-06-09 cs.CV cs.RO 版本更新

Distant Object Localisation from Noisy Image Segmentation Sequences

基于噪声图像分割序列的远距离目标定位

Julius Pesonen, Arno Solin, Eija Honkavaara

发表机构 * Research Council of Finland（芬兰研究理事会）； RCF Flagship Forest–Human–Machine Interplay—Building Resilience, Redefining Value Networks and Enabling Meaningful Experiences (UNITE)（RCF旗舰森林-人类-机器交互——构建韧性，重新定义价值网络和赋能有意义体验（UNITE））

AI总结针对远距离目标定位问题，提出多视图三角测量和粒子滤波两种方法，后者还能提供形状和不确定性估计，结合无人机图像分割与GNSS姿态估计实现可靠野火监测。

详情

AI中文摘要

基于相机测量序列的3D目标定位对于安全关键的监视任务（如基于无人机的野火监测）至关重要。使用相机检测到的目标定位通常可以通过专门的传感器配置或3D场景重建来解决。然而，对于远距离目标或受限于可用计算资源的任务，这两种解决方案都不可行。在本文中，我们表明该任务可以通过多视图三角测量或粒子滤波来解决，后者还提供形状和不确定性估计。我们使用3D模拟和基于无人机的图像分割序列以及基于全球导航卫星系统（GNSS）的相机姿态估计来研究这些解决方案。结果表明，将所提出的方法与现有的图像分割模型和无人机携带的计算资源相结合，可以为基于无人机的野火监测提供可靠的系统。所提出的解决方案与检测方法无关，还能快速适应类似任务。代码可在以下网址获取：https://this URL

英文摘要

3D object localisation based on a sequence of camera measurements is essential for safety-critical surveillance tasks, such as drone-based wildfire monitoring. Localisation of objects detected with a camera can typically be solved with specialised sensor configurations or 3D scene reconstruction. However, in the context of distant objects or tasks limited by the amount of available computational resources, neither solution is feasible. In this paper, we show that the task can be solved with either multi-view triangulation or particle filters, with the latter also providing shape and uncertainty estimates. We studied the solutions using 3D simulation and drone-based image segmentation sequences with global navigation satellite system (GNSS) based camera pose estimates. The results suggest that combining the proposed methods with pre-existing image segmentation models and drone-carried computational resources yields a reliable system for drone-based wildfire monitoring. The proposed solutions are independent of the detection method, also enabling quick adaptation to similar tasks. Code is available at https://fgi_nls.gitlab.io/public/distant-localisation

URL PDF HTML ☆

赞 0 踩 0

2602.10234 2026-06-09 physics.soc-ph cs.AI cs.RO 版本更新

Transforming Police-Car Swerving for Mitigating Isolated Stop-and-Go Traffic Waves: A Practice-Oriented Jam-Absorption Driving Strategy

将警车变道行为转化为缓解孤立走走停停交通波的实际拥堵吸收驾驶策略

Zhengbing He

发表机构 * Faculty of Science and Engineering, University of Nottingham Ningbo China（诺丁汉大学宁波校区理工程学院）

AI总结本文提出一种基于警车变道行为启发的实际拥堵吸收驾驶（JAD）策略，通过定义JAD三角形，利用单车辆双探测器实现孤立走走停停波的抑制，并系统分析五个关键参数，仿真验证其有效性。

详情

AI中文摘要

走走停停交通波是高速公路拥堵的主要形式，对交通效率、安全风险和车辆排放造成严重且持续的负面影响。在各种高速公路交通管理策略中，拥堵吸收驾驶（JAD）——由专用车辆在被走走停停波捕获前执行“慢进快出”操作——已被提出作为抑制此类波传播的一种有前景的方法。然而，现有大多数JAD策略仍不实用，主要原因是缺乏对实施车辆和运行条件的考虑。受真实世界中警车变道行为的启发，本文首先引入单车辆双探测器拥堵吸收驾驶（SD-JAD）问题，然后基于JAD三角形的定义提出一种实用的JAD策略，将这种变道行为转化为能够抑制孤立走走停停波传播的交通控制策略。识别并系统分析了五个显著影响所提策略的关键参数，即JAD速度、流入交通速度、波宽、波速和波内速度。通过基于SUMO的仿真示例，进一步展示了如何仅使用两个固定路侧交通探测器在实际中测量这些参数。结果表明，所提出的JAD策略成功抑制了走走停停波的传播，且未引发二次波。本文有望推动JAD的实际实施迈出重要一步，将其从理论概念推进为可行且可部署的交通管理策略。

英文摘要

Stop-and-go traffic waves, a major form of freeway congestion, impose severe and persistent adverse impacts, including reduced traffic efficiency, increased safety risks, and elevated vehicle emissions. Among various freeway traffic management strategies, jam-absorption driving (JAD), in which a dedicated vehicle performs "slow-in" and "fast-out" maneuvers before being captured by a stop-and-go wave, has been proposed as a promising approach to suppressing the propagation of such waves. However, most existing JAD strategies remain impractical, primarily due to the lack of consideration of implementation vehicles and operational conditions. Inspired by real-world observations of police-car swerving behavior, this paper first introduces the Single-Vehicle Double-Detector Jam-Absorption Driving (SD-JAD) problem and then proposes a practical JAD strategy based on a definition of the JAD Triangle, transforming such behavior into a traffic control strategy capable of suppressing the propagation of an isolated stop-and-go wave. Five key parameters that significantly affect the proposed strategy, namely JAD speed, inflow traffic speed, wave width, wave speed, and in-wave speed, are identified and systematically analyzed. Using a SUMO-based simulation as an illustrative example, we further demonstrate how these parameters can be measured in practice using only two stationary roadside traffic detectors. The results show that the proposed JAD strategy successfully suppresses the propagation of a stop-and-go wave without triggering secondary waves. This paper is expected to take a significant step toward the practical implementation of JAD, advancing it from a theoretical concept to a feasible and deployable traffic management strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.08104 2026-06-09 cs.RO 新提交

Reinforcement learning in linear embedding space unlocks generalizable control across soft robot configurations

线性嵌入空间中的强化学习解锁软体机器人配置的通用控制

Xinglong Zhang, Cong Li, Hangjie Mo, Yue Jiang, Xin Xu, Wei Jiang, Zhenshan Bing, Yihe Yang, Xiaojian Li, Yueneng Yang, Huimin Lu, Ling-li Zeng, Alois Knoll, Dewen Hu, Li Wen, Wei Pan

发表机构 * National University of Defense Technology（国防科技大学）； Hefei University of Technology（合肥工业大学）； Nanjing University (Suzhou Campus)（南京大学（苏州校区））； Technical University of Munich（慕尼黑工业大学）； Beihang University（北京航空航天大学）； Newcastle University（纽卡斯尔大学）

AI总结提出基于共享线性Koopman嵌入空间的强化学习框架，将控制策略与机器人形态解耦，实现跨33种软体机器人配置的快速迁移，样本量减少75倍，并支持高速运动、重载和多执行器故障下的鲁棒控制。

Comments An updated version of this paper has been accepted by Nature Communications

详情

AI中文摘要

软体生物如章鱼和大象鼻子展现出显著的形态适应性，能够动态重构身体形状和刚度，并灵活调整控制策略以实现多功能行为。受这些生物系统启发，近几十年来出现了各种软体机器人，它们采用针对特定任务定制的不同材料、刚度和形态。尽管软体机器人的材料和结构设计取得了重大进展，但开发一个能够跨不同配置快速适应的通用控制框架仍然是一个长期挑战。现有控制器局限于固定配置，需要针对新配置进行费力的特定配置重新建模和策略重新设计。本文介绍了一种通用控制系统，通过共享线性Koopman嵌入空间中的强化学习，实现跨多种软体机器人配置的快速适应。通过将机器人动力学编码到该嵌入空间，我们的方法将控制策略与特定形态解耦，允许跨不同配置进行实时、无模型的策略适应，而无需从头重新训练。我们在33种不同的机器人配置上验证了该系统。该系统在跨配置的迁移样本量上减少了75倍，同时在高速运动、重负载和多执行器故障下保持鲁棒性能，并实现了软体机器人领域此前无法获得的现实技能。这项工作为多种软体机器人配置建立了一个统一且可适应的控制范式，弥合了机械可重构性与控制灵活性之间的差距，并可能为复杂物理系统中的通用控制提供更广泛的见解。

英文摘要

Soft-bodied organisms such as octopuses and elephant trunks exhibit remarkable morphological adaptability, dynamically reconfiguring body shape and stiffness, and flexibly adjusting their control strategies to enable versatile behaviors. Inspired by these biological systems, various soft robots have emerged in recent decades, featuring diverse materials, stiffnesses, and morphologies tailored to specific tasks. Despite substantial advances in the materials and structural designs of soft robots, developing a generalizable control framework capable of rapid adaptation across diverse configurations remains a long-standing challenge. Existing controllers are limited to fixed configurations, demanding laborious configuration-specific remodelling and policy redesign for new configurations. Here, we introduce a generalizable control system that enables rapid adaptation across diverse soft robot configurations via reinforcement learning in a shared linear Koopman embedding space. By encoding robot dynamics into this embedding space, our method decouples control policies from specific morphologies, allowing real-time, model-free policy adaptation across diverse configurations without retraining from scratch. We validate our system across 33 distinct robot configurations. Our system achieves a 75 times reduction in transfer samples across configurations, while sustaining robust performance under high-speed motion, heavy payloads, and multiactuator faults, and achieving real-world skills previously unattainable in soft robotics. This work establishes a unified and adaptable control paradigm for diverse soft robot configurations, bridging mechanical reconfigurability with control flexibility, and may offer broader insights for generalizable control in complex physical systems.

URL PDF HTML ☆

赞 0 踩 0

2606.09451 2026-06-09 cs.RO cs.CV cs.LG 新提交

SIMPLE：基于仿真的人形机器人全身操作策略学习与评估

Songlin Wei, Zhenhao Ni, Jie Liu, Zhenyu Zhao, Junjie Ye, Hongyi Jing, Junkai Xia, Xiawei Liu, Michael Leong, Liang Heng, Di Huang, Yue Wang

发表机构 * USC Physical Superintelligence (PSI) Lab（南加州大学物理超级智能实验室）

AI总结提出SIMPLE仿真平台，结合MuJoCo动力学与IsaacSim渲染，包含60个全身任务、50个室内场景和1000+物体资产，支持自动化轨迹生成和VR遥操作数据采集，并集成多种主流策略，实验证明仿真与真实世界性能强相关，可实现零样本迁移。

详情

AI中文摘要

人形基础模型的发展速度超过了我们评估它们的能力。虽然真实世界测试成本高昂且难以复现，但现有的仿真基准主要关注桌面或轮式机器人。针对全身人形操作的可扩展且可复现的基准仍然是一个开放问题。为此，我们提出了SIMPLE，一个用于人形策略学习和评估的统一仿真测试平台。SIMPLE将MuJoCo的精确接触丰富动力学与IsaacSim的光真实感渲染相结合。它提供了一个大规模环境，包含60个多样的全身任务、50个室内场景和超过1000个物体资产。为了促进可扩展的数据收集，该框架集成了两个数据生成流水线：通过运动规划自动生成轨迹和低延迟VR遥操作接口。我们进一步在SIMPLE中大规模集成并基准测试了主流人形策略，包括轻量级模仿网络、大型视觉-语言-动作（VLA）模型以及最新的世界动作模型（WAM）。我们的实验揭示了策略在仿真和真实世界中的性能之间存在强相关性。此外，我们证明了在SIMPLE中收集的数据上训练的策略可以在相似设置下零样本迁移到物理人形机器人上，为人形机器人研究提供了稳健且可复现的基础。

英文摘要

Humanoid foundation models are advancing faster than we can evaluate them. While real-world testing is expensive and difficult to reproduce, existing simulation benchmarks focus primarily on table-top or wheeled robots. A scalable and reproducible benchmark for whole-body humanoid loco-manipulation remains an open problem. To this end, we present SIMPLE, a unified simulation testbed for humanoid policy learning and evaluation. SIMPLE couples the accurate contact-rich dynamics of MuJoCo with the photorealistic rendering of IsaacSim. It provides a large-scale environment comprising 60 diverse whole-body tasks, 50 indoor scenes, and over 1,000 object assets. To facilitate scalable data collection, the framework integrates two data generation pipelines: automated trajectory generation via motion planning and a low-latency VR teleoperation interface. We further integrate and benchmark mainstream humanoid policies at scale in SIMPLE, including lightweight imitation networks, large vision-language-action (VLA) models, and recent world action models (WAMs). Our experiments reveal a strong correlation between policy performance in simulation and the real world. Furthermore, we demonstrate that policies trained on data collected in SIMPLE can be transferred zero-shot to physical humanoid robots under similar settings, providing a robust and reproducible foundation for humanoid robotics research.

URL PDF HTML ☆

赞 0 踩 0

2606.08564 2026-06-09 cs.RO 新提交

Real-IKEA: Physical Fidelity is the Prerequisite for Robust Manipulation

Real-IKEA：物理保真度是鲁棒操作的前提

Kunqi Xu, Zhenhao Huang, Siyuan Luo, Ziqiu Zeng, Fan Shi

发表机构 * National University of Singapore（新加坡国立大学）； Peking University（北京大学）

AI总结针对仿真与现实物理差异导致操作鲁棒性不足的问题，提出Real-IKEA数据集与仿真框架，通过高保真资产和阻力校准配置，使强化学习策略发现优先利用机械优势的鲁棒策略。

详情

AI中文摘要

机器人操作的鲁棒性常常因简化仿真与充满阻力的现实世界之间的物理差距而失败。在这项工作中，我们强调在铰接交互中的物理真实性是鲁棒策略学习的重要因素。我们提出了Real-IKEA，一个以物理精度为首要目标的数据集和仿真框架。Real-IKEA提供了1,079个铰接资产配置，源自83个真实的IKEA把手和旋钮，经过细致的六步物理工作流程处理。对于接触几何精度，我们引入了一个双向表面偏差度量来量化碰撞网格。对于动力学真实性，我们建立了阻力校准配置，改变阻尼和摩擦。关键的是，我们通过强化学习策略证明，高保真资产能够发现鲁棒的“钩”和“杠杆”策略，这些策略优先考虑机械优势而非脆弱的摩擦拉动。总之，这些结果使Real-IKEA成为开发能够在铰接物体任务中达到人类水平鲁棒性的操作策略的关键基准。

英文摘要

Robotic manipulation robustness often founders on the physics gap between simplified simulations and the resistance-laden real world. In this work, we emphasize that physical realism in articulated interaction is an important ingredient for robust policy learning. We present Real-IKEA, a dataset and simulation framework designed with physical accuracy as a first-class goal. Real-IKEA provides 1,079 articulated asset configurations, derived from 83 authentic IKEA handles and knobs processed through a meticulous six-step physical workflow. For contact-geometry accuracy, we introduce a bidirectional surface-deviation metric to quantify collision meshes. For dynamics realism, we establish resistance-calibrated configurations that vary damping and friction. Crucially, we demonstrate through a Reinforcement Learning (RL) policy that high-fidelity assets enable the discovery of robust "hooking" and "levering" strategies that prioritize mechanical advantage over fragile friction-pulling. Together, these results position Real-IKEA as a critical benchmark for developing manipulation policies capable of human-level robustness in articulated object tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.08688 2026-06-09 cs.RO cs.CV 新提交

Bridged SBI：纠正有偏低保真后验以实现经济高效的高保真推理

Gahee Kim, Yuki Kadokawa, Sandro M. Alcantara Tacora, Taro Abe, Daisuke Endo, Genki Yamauchi, Takeshi Hashimoto, Takamitsu Matsubara

AI总结针对高保真粒子模拟器计算成本高的问题，提出Bridged SBI方法，利用低保真后验引导高保真推理，通过残差桥接纠正偏差，实现成本效益高的准确后验估计。

详情

AI中文摘要

基于粒子的模拟器的精确校准对于机器人土方模拟至关重要，但由于该任务的高度非线性粒子动力学和传统模拟器的黑箱性质，分析校准具有挑战性。尽管基于模拟的推理（SBI）可以仅通过前向模拟估计模拟参数的后验分布，但将SBI直接应用于高保真（HF）粒子模拟器通常在计算上不可行。使用较粗颗粒的低保真（LF）模拟器可以降低这一成本，但颗粒大小和数量的变化会改变再现相同观测所需的参数值，从而产生有偏的LF后验。我们提出了Bridged SBI，它利用有偏但有信息的LF后验来指导HF推理。该方法首先使用廉价的LF模拟识别一个粗略的高密度参数区域，然后学习一个局部残差桥，通过纠正LF-HF差异将LF后验样本转移到HF一致区域。我们分析了顺序多保真SBI（Naive-MF）在直接依赖LF后验而不进行差异纠正时如何遭受LF诱导的后验覆盖不足。然后我们展示了Bridged SBI旨在通过残差纠正显式建模LF-HF差异来缓解这一问题。在模拟到模拟的粒子参数校准和真实土壤观测的实到模拟校准上的实验表明，与仅HF的SBI或Naive-MF基线相比，Bridged SBI在有限的HF模拟成本下产生了更准确和可靠的HF后验。

英文摘要

Accurate calibration of particle-based simulators is crucial for robotic earthwork simulation, but analytical calibration is challenging due to this task's highly nonlinear particle dynamics and the black-box nature of conventional simulators. Although simulation-based inference (SBI) can estimate posterior distributions over simulation parameters solely from forward simulations, applying SBI directly to high-fidelity (HF) particle simulators is often computationally prohibitive. Low-fidelity (LF) simulators with coarser particles can reduce this cost, but changes in particle size and particle count shift the parameter values needed to reproduce the same observation, producing biased LF posteriors. We propose Bridged SBI, which leverages a biased but informative LF posterior to guide HF inference. This method first uses inexpensive LF simulations to identify a coarse high-density parameter region, and then it learns a local residual bridge to transport LF posterior samples toward HF-consistent regions by correcting the LF--HF discrepancy. We analyze how sequential multi-fidelity SBI (Naive-MF) can suffer from LF-induced posterior miscoverage when it directly relies on the LF posterior without discrepancy correction. We then show that Bridged SBI is designed to alleviate this issue by explicitly modeling the LF--HF discrepancy through residual correction. Experiments on both sim-to-sim particle-parameter calibration and real-to-sim calibration with real soil observation show that Bridged SBI produces more accurate and reliable HF posteriors than HF-only SBI or the Naive-MF baseline, especially under limited HF simulation costs.

URL PDF HTML ☆

赞 0 踩 0

2606.09028 2026-06-09 cs.CV cs.AI cs.RO 交叉投稿

ATM: Action-Consistency Transfer Matrix for Diagnosing and Improving Latent World Models

ATM：用于诊断和改进潜在世界模型的动作一致性转移矩阵

Jiaheng Chen

发表机构 * School of Software, Northeastern University（东北大学软件学院）

AI总结提出ATM矩阵，通过轻量级探针比较真实与预测潜在转移中的动作信息，无需模拟器即可诊断世界模型质量，并引入AITS利用动作可识别性作为训练信号提升下游规划。

Comments 13 pages, 3 figures, 6 tables

详情

AI中文摘要

潜在世界模型越来越多地用于控制和目标条件规划，但评估其学习到的表示是否对规划有用通常需要与CEM等规划器耦合的慢速模拟器评估。这种评估是黑盒且依赖于模型复杂度的：在相同协议下，不同世界模型每个检查点可能需要几分钟到几小时。在这项工作中，我们提出了ATM，一个动作一致性转移矩阵，用于诊断潜在转移是否保留了与规划相关的动作语义。ATM通过轻量级事后探针比较真实编码转移和模型预测转移中的动作信息，生成一个可解释的矩阵，揭示表示质量、转移域不一致性和失败模式，而无需模拟器 rollout。它还可以折叠成一个简单的筛选分数，用于跨检查点、变体和世界模型的内部任务排名。当真实成功差距显著时，ATM实现了高度可靠的成对排名，同时将分钟到小时的CEM评估减少到秒级的转移分析，在我们的设置中实现了超过100倍的加速。我们进一步引入了AITS，表明动作可识别性不仅具有诊断作用，而且是一种有用的训练信号，可以在不改变规划器的情况下改进下游规划。

英文摘要

Latent world models are increasingly used for control and goal-conditioned planning, yet assessing whether their learned representations are useful for planning usually requires slow, planner-coupled simulator evaluation with CEM or similar planners. Such evaluation is black-box and model-complexity-dependent: under the same protocol, different world models may require minutes to hours per checkpoint. In this work, we propose ATM, an Action-Consistency Transfer Matrix for diagnosing whether latent transitions preserve action semantics relevant to planning. ATM compares action information in real encoded transitions and model-predicted transitions through lightweight post-hoc probes, producing an interpretable matrix that reveals representation quality, transition-domain inconsistency, and failure modes without simulator rollout. It can also be collapsed into a simple screening score for within-task ranking across checkpoints, variants, and world models. When the true success gap is non-trivial, ATM achieves highly reliable pairwise ranking, while reducing minutes-to-hours CEM evaluation to seconds-level transition analysis, yielding more than 100x speedup in our setup. We further introduce AITS, showing that action-identifiability is not only diagnostic but also a useful training signal for improving downstream planning without changing the planner.

URL PDF HTML ☆

赞 0 踩 0

2606.01478 2026-06-09 cs.RO cs.AI cs.MA cs.SY eess.SY 版本更新

Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX

Crazyflow: 基于JAX的精确、GPU加速、可微分的无人机模拟器

Martin Schuck, Marcel P. Rath, Yufei Hua, Abhishek Goudar, SiQi Zhou, Angela P. Schoellig

发表机构 * Technical University of Munich（慕尼黑技术大学）； University of Toronto（多伦多大学）； Simon Fraser University（西蒙弗雷泽大学）

AI总结提出Crazyflow模拟器，通过GPU加速和可微分设计，实现单机超高速仿真、数千架无人机集群模拟，并支持基于解析梯度的策略学习与采样避障，甚至能在0.38秒内从零训练飞行恢复策略。

Comments Fix minor metadata mistakes

详情

AI中文摘要

来自仿真的高质量、大规模合成数据正成为推动机器人算法能力提升的基石。虽然空中机器人模拟器已独立发展出支持保真度、可微分性和集群等专门需求，但缺少一个能够跨所有领域合成数据的统一平台。在这项工作中，我们提出了Crazyflow，一个旨在突破空中机器人算法开发极限的模拟器，涵盖从基于模型到数据驱动的方法、从基于梯度到基于采样的方法、以及从单智能体到多智能体系统。与现有最先进的无人机模拟器相比，它实现了单个无人机超过一个数量级的速度提升，并能模拟数千个包含4000架无人机的集群。真实世界实验表明，Crazyflow既支持基于解析梯度的策略学习（无需域随机化即可实现亚厘米级轨迹跟踪精度），也支持每秒超过5亿步的采样避障。打破传统的先训练后部署范式，我们展示了其前所未有的速度甚至能够实现飞行中的强化学习：通过将物理无人机抛向空中，在0.38秒内从零开始训练恢复策略，成功稳定了无人机。Crazyflow支持多级仿真抽象，直接兼容所有开源Crazyflie模型，并通过提供轻量级系统辨识流程，支持跨自定义无人机平台和应用的快速重新配置。通过同时推动精度、速度和可微分性，Crazyflow作为合成数据生成的开源资源，具备在线执行学习和优化的大规模并行化新兴能力，为新型算法开发打开了大门。

英文摘要

High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial-robotics algorithm development, from model-based to data-driven methods, gradient-based to sampling-based approaches, and single-agent to multi-agent systems. Compared to existing state-of-the-art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real-world experiments show Crazyflow supports both analytical-gradient-based policy learning, achieving sub-centimeter trajectory tracking accuracy without domain randomization, and sampling-based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train-then-deploy paradigm, we show that its unprecedented speed even enables in-flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open-source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light-weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open-source resource for synthetic data generation, with emerging capabilities for large-scale parallelization for online, in-execution learning and optimization, opening the door to novel algorithm development.

URL PDF HTML ☆

赞 0 踩 0

2606.07118 2026-06-09 cs.RO 版本更新

QuadVerse: An Integrated Framework Aligning Visual-Physical Reality for Quadruped Simulation

QuadVerse：一种对齐视觉-物理现实用于四足仿真的集成框架

Yuxiang Chen, Yuanhao Wang, Ziheng Zhang, Meng Zhang, Yu Liu, Yufei Jia, Tiancai Wang, Erjin Zhou, Jin Xie

发表机构 * Nanjing University（南京大学）； BUPT（北京邮电大学）； DEXMAL ； Tsinghua University（清华大学）

AI总结提出QuadVerse框架，通过重建场景校准视觉、物理和致动器，利用3DGS和接触校准减少仿真到现实的差距，实现零样本视觉导航策略部署。

详情

AI中文摘要

仿真对于机器人学习至关重要，然而仿真到现实的差距仍然是一个主要挑战。现有方法通常单独处理视觉或动态差距，忽略了这些个体不匹配如何在机器人状态估计中累积和传播。在本文中，我们介绍QuadVerse，一个集成框架，使用重建场景作为校准基底，对齐视觉感知、物理交互和致动器动力学。从捕获的RGB视频中，我们重建几何约束的3D高斯泼溅（3DGS）场景，支持批处理的光照真实自我视角渲染和可用于碰撞的语义网格提取。网格进一步通过初始化空间变化的摩擦先验并通过基于轨迹的后验推理细化，实现接触校准。为了解决剩余的致动器差异，QuadVerse通过在接触校准的地形上重放真实世界轨迹来训练残差动力学补偿器，减少地形引起的接触误差与致动器动力学之间的纠缠。我们表明，QuadVerse在相关基线上提高了重建质量和运动跟踪。在此基础之上，我们展示了无需任务特定真实世界部署的鲁棒零样本视觉导航策略部署。

英文摘要

Simulation is central to robot learning, yet the sim-to-real gap remains a major bottleneck. Existing approaches often tackle visual or dynamic gaps separately, overlooking how these individual mismatches accumulate and propagate throughout the robot's state evolution. In this paper, we introduce QuadVerse, an integrated framework that uses reconstructed scenes as a calibration substrate for aligning visual perception, physical interaction, and actuator dynamics. From captured RGB videos, we reconstruct geometry-constrained 3D Gaussian Splatting (3DGS) scenes that support batched photorealistic ego-view rendering and collision-ready semantic mesh extraction. The meshes further enable contact calibration by initializing spatially varying friction priors and refining them through trajectory-based posterior search. To address remaining actuator discrepancies, QuadVerse trains a residual dynamics compensator by replaying real-world trajectories on the contact-calibrated terrain, reducing the entanglement between terrain-induced contact errors and actuator non-idealities. Experiments show that QuadVerse improves reconstruction quality and locomotion tracking over relevant baselines. Leveraging this foundation, we demonstrate robust zero-shot visual-navigation policy deployment without task-specific real-world rollouts.

URL PDF HTML ☆

赞 0 踩 0

2503.14229 2026-06-09 cs.AI cs.CV cs.RO 版本更新

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions

HA-VLN 2.0：面向离散与连续环境中动态多人交互的人类感知导航开放基准与排行榜

Yifei Dong, Fengyi Wu, Qi He, Lingdong Kong, Heng Li, Minghan Li, Zebang Cheng, Yuxuan Zhou, Jingdong Sun, Qi Dai, Alexander G Hauptmann, Zhi-Qi Cheng

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出HA-VLN 2.0统一基准，通过标准化任务、HAPS 2.0数据集与模拟器、16844条社会指令基准测试及真实机器人实验，证明显式社会建模提升导航鲁棒性并减少碰撞。

Comments 35 pages, 20 figures, website: https://f1y1113.github.io/HA-VLN-webpage/

详情

AI中文摘要

视觉与语言导航（VLN）主要研究离散或连续空间，很少关注动态拥挤环境。我们提出HA-VLN 2.0，一个引入显式社会感知约束的统一基准。我们的贡献包括：（i）标准化任务和指标，同时捕捉目标准确性和个人空间遵守；（ii）HAPS 2.0数据集和模拟器，建模多人交互、室外环境和更精细的语言-运动对齐；（iii）在16844条社会性指令上的基准测试，揭示领先代理在人类动态和部分可观测性下性能急剧下降；（iv）真实机器人实验验证模拟到现实的迁移，以及一个开放排行榜实现透明比较。结果表明，显式社会建模提高了导航鲁棒性并减少了碰撞，强调了以人为中心方法的必要性。通过发布数据集、模拟器、基线和协议，HA-VLN 2.0为安全、人类感知的导航研究提供了坚实基础。

英文摘要

Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous spaces, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment; (iii) benchmarks on 16,844 socially grounded instructions, revealing sharp performance drops of leading agents under human dynamics and partial observability; and (iv) real-world robot experiments validating sim-to-real transfer, with an open leaderboard enabling transparent comparison. Results show that explicit social modeling improves navigation robustness and reduces collisions, underscoring necessity of human-centric approaches. By releasing datasets, simulators, baselines, and protocols, HA-VLN 2.0 provides a strong foundation for safe, human-aware navigation research.

URL PDF HTML ☆

赞 0 踩 0

2606.08414 2026-06-09 cs.RO cs.AI 新提交

PACT: Self-Evolving Physical Safety Alignment for Diffusion Policies in Embodied Manipulation

PACT: 具身操作中扩散策略的自我演化物理安全对齐

Lingxuan Wu, Zijian Zhu, Lizhong Wang, Chengyang Ying, Huayu Chen, Xiao Yang, Fangming Liu, Jun Zhu

发表机构 * Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua-Bosch Joint ML Center, THBI Lab, BNRist Center, Tsinghua University, Beijing, 100084, China（计算机科学与技术系，人工智能研究院，清华-博世联合机器学习中心，THBI实验室，BNRist中心，清华大学，北京，100084，中国）； Peng Cheng Laboratory, 518108, China（鹏城实验室，518108，中国）

AI总结提出PACT框架，通过自演化后训练将预训练扩散策略投影到约束可行区域，无需演示数据或任务奖励，在降低31.0%安全违规的同时提升30.7%任务成功率。

详情

AI中文摘要

扩散策略在机器人操作中取得了显著成功，但常常无法满足安全部署所需的严格物理约束。现有方法要么在训练期间过早施加安全约束，要么在测试时通过外部护栏被动应对，限制了策略的表达能力和整体可扩展性。我们提出物理安全对齐约束轨迹（PACT），这是一个自我演化的后训练框架，将预训练扩散策略投影到约束可行区域，无需访问演示数据或任务奖励。PACT通过跨时间步密集监督的反向KL目标将约束梯度蒸馏到扩散模型中。它采用课程学习逐步收紧约束，同时保持理论上界定的策略偏移和单调改进，减轻了灾难性遗忘带来的安全-性能权衡。在模拟和真实世界的具身操作基准测试中，PACT平均减少31.0%的安全违规，同时将任务成功率提升30.7%。

英文摘要

Diffusion policies have achieved remarkable success in robotic manipulation, yet they often fail to satisfy strict physical constraints required for safe deployment. Existing approaches impose safety either prematurely during training or reactively via external guardrails at test time, limiting policy expressivity and overall scalability. We propose Physical safety Alignment for Constrained Trajectories (PACT), a self-evolving post-training framework that projects pretrained diffusion policies onto constraint-feasible regions without accessing demonstration data or task rewards. PACT distills constraint gradients into the diffusion model through a reverse-KL objective with dense supervision across timesteps. It incorporates a curriculum that progressively tightens constraints while maintaining theoretically bounded policy shift and monotone improvement, mitigating the safety-performance trade-off from catastrophic forgetting. On simulated and real-world embodied manipulation benchmarks, PACT significantly reduces safety violations by 31.0% on average while improving task success by 30.7%.

URL PDF HTML ☆

赞 0 踩 0

2606.08508 2026-06-09 cs.RO cs.AI 新提交

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

ActProbe：面向生成式机器人策略早期故障检测的动作空间探针

Bingjia Huang, Xiangyu Li, Xiang Wang, Liang Mi, Zixu Hao, Weijun Wang, Hao Wu, Kun Li, Yunxin Liu, Ting Cao

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University（清华大学人工智能产业研究院（AIR））； University of Electronic Science and Technology of China（电子科技大学）； Nanjing University（南京大学）

AI总结提出ActProbe，一种轻量级纯动作空间故障检测器，利用时间一致性误差和动作块幅度两个信号，通过LSTM-MLP架构预测故障，在多种生成式策略上提升F1-时效性帕累托前沿平均超体积增益+12.7%，并加速强化学习微调。

Comments 24 pages,9 figures,11 tables, Project page: https://air-embodied-brain.github.io/actprobe

详情

AI中文摘要

生成式机器人策略在部署时不可预测地失败：它们在关键时刻犹豫不决，偏离任务，或执行不可恢复的动作。现有的在线故障检测器要么需要白盒访问策略内部，要么通过重采样和观测侧信号增加运行时开销。我们的实证分析表明，发射的动作块本身已经携带了生成式机器人策略即将发生故障的强预测信号。受此观察启发，我们引入了ActProbe，一种轻量级的纯动作空间检测器，它使用单次前向传递中可用的两个紧凑信号：连续动作块之间的时间一致性误差（TCE）和当前块的动作块幅度（ACM）。ActProbe通过任务条件化的LSTM-MLP架构将这些信号映射到每步故障概率。在一系列多样化的生成式机器人策略和基准测试中，ActProbe在故障变得视觉可识别之前发出警报，相比内部和外部特征基线，将故障检测的F1-时效性帕累托前沿平均超体积增益提高了+12.7%，在未见任务上早期检测ROC-AUC领先+9.0%。ActProbe进一步迁移到部署中，预测未见真实机器人拾取任务上的故障，并以2.9倍更少的环境交互加速了强化学习微调（PPO）。

英文摘要

Generative robot policies fail unpredictably at deployment: they hesitate at critical moments, drift off-task, or commit to unrecoverable actions. Existing online failure detectors either require white-box access to policy internals or add runtime overhead through resampling and observation-side signals. Our empirical analysis shows that emitted action chunks themselves already carry strong predictive signal for impending failures in generative robot policies. Motivated by this observation, we introduce ActProbe, a lightweight, pure action-space detector that uses two compact signals available from a single forward pass: Temporal Consistency Error (TCE) between consecutive action chunks and Action Chunk Magnitude (ACM) of the current chunk. ActProbe maps these signals to per-step failure probabilities with a task-conditioned LSTM-MLP architecture. Across a diverse suite of generative robot policies and benchmarks, ActProbe raises alerts before failures become visually recognizable, improving the accuracy (F1)-timeliness Pareto frontier of failure detection by an average hypervolume gain of +12.7% over both internal- and external-feature baselines, with a +9.0% early-detection ROC-AUC lead on unseen tasks. ActProbe further transfers to deployment, predicting failures on unseen real-robot pick tasks and accelerating RL fine-tuning (PPO) with 2.9x fewer environment interactions.

URL PDF HTML ☆

赞 0 踩 0

2606.09350 2026-06-09 cs.RO cs.CV 新提交

Safe-RULE：安全强化反学习

Shixiong Jiang, Taozheng Zhu, Fanxin Kong

发表机构 * University of Notre Dame（圣母大学）

AI总结针对离线安全强化学习易受数据投毒攻击的问题，提出Safe-RULE框架，通过反学习移除恶意样本影响，无需从头训练或访问原始环境，实验证明能有效提升安全性。

Comments 20 pages, 3 figures

2510.06492 2026-06-09 cs.RO 版本更新

How Well Do Latent World Models Understand Partially Observable Safety Constraints?

潜在世界模型如何理解部分可观测的安全约束？

Matthew Kim, Kensuke Nakamura, Andrea Bajcsy

发表机构 * UC San Diego（加州大学圣地亚哥分校）； Carnegie Mellon University（卡内基梅隆大学）

AI总结研究潜在世界模型在部分可观测安全约束下的故障模式，提出互信息度量和滚动预测度量来诊断估计间隙和预测间隙，并通过多模态监督和共形风险校准缓解问题，提高机器人操作安全性。

Comments 10 tables 5 figures

详情

AI中文摘要

潜在世界模型是一种直接从高维观测中学习状态表示和动态的有前途的方法，使得在难以建模的环境中实现机器人控制成为可能。然而，控制性能最终取决于潜在表示是否编码了任务所需的信息。在这项工作中，我们研究了潜在空间安全控制问题，并展示了当安全相关信息未在潜在状态中保留时，部分可观测性如何导致控制失败。具体来说，我们识别出两种世界模型故障模式：估计间隙，即当前观测未揭示安全关键量（例如，烹饪任务中的温度）；以及预测间隙，即故障一旦发生即可观测，但无法从可用观测中可靠地预测。我们为这些间隙引入了两种诊断方法：一种基于互信息的安全可观测性度量，以及一种基于滚动预测的未来安全可预测性度量。最后，我们针对每种故障模式提出了缓解策略：针对估计间隙的特权多模态监督，以及针对预测间隙的共形风险校准。通过两个硬件案例研究——使用单模态RGB世界模型和多模态RGB+触觉及RGB+热变体——我们展示了这些缓解策略在部分可观测性下提高了Franka Research 3机械臂在具有挑战性的烹饪任务中的安全性，尽管增加了保守性。更广泛地说，我们的工作提出了一个问题：世界模型状态表示何时足以实现可靠的机器人控制。

英文摘要

Latent world models are a promising approach for learning state representations and dynamics directly from high-dimensional observations, enabling robot control in hard-to-model settings. However, control performance ultimately depends on the latent representation encoding the required information for the task. In this work, we study latent-space safe control problems and show how partial observability can induce control failures when safety-relevant information is not preserved in the latent state. Specifically, we identify two world model failure modes: estimation gaps, where current observations do not reveal safety-critical quantities (e.g., temperature in a cooking task), and prediction gaps, where failures are observable once they occur but cannot be reliably anticipated from available observations. We introduce two diagnostics for these gaps: a mutual-information-based measure of safety observability and a rollout-based measure of future safety predictability. Finally, we present mitigation strategies for each failure mode: privileged multimodal supervision for estimation gaps and conformal risk calibration for prediction gaps. Across two hardware case studies -- using unimodal RGB world models and multimodal RGB+Tactile and RGB+Thermal variants -- we show that these mitigation strategies improve the safety of a Franka Research 3 manipulator on challenging cooking tasks under partial observability, albeit with increased conservativeness. More broadly, our work raises the question of when world model state representations are sufficient for reliable robot control

URL PDF HTML ☆

赞 0 踩 0

2605.26452 2026-06-09 cs.RO cs.LG cs.SY eess.SY 版本更新

Robust Koopman Control Barrier Filters for Safe Actor-Critic Reinforcement Learning

鲁棒Koopman控制屏障滤波器用于安全演员-评论家强化学习

Dhruv S. Kushwaha, Zoleikha A. Biron

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出鲁棒Koopman-CBF SAC框架，通过数据驱动学习Koopman预测器、构建提升空间中的仿射CBF约束并利用二次规划安全层实施，同时通过投影残差裕度处理近似误差，实现零约束违反或减少违规。

Comments 17 pages, 7 figures

详情

AI中文摘要

机器人系统的安全强化学习需要策略在训练和部署期间满足状态和输入约束的同时提高任务性能。控制屏障函数通过最小侵入性安全滤波器提供强制执行前向不变性的原则性机制，但其在无模型强化学习中的应用受限于对精确动力学和手工设计屏障证书的需求。我们提出鲁棒Koopman-CBF SAC，一种安全滤波的演员-评论家框架，从数据中学习有限维Koopman预测器，在提升空间中构建仿射CBF约束，并通过二次规划安全层强制执行。为考虑有限维Koopman近似误差，使用从留出轨迹数据估计的投影残差裕度收紧CBF条件。评论家在执行的安操作上训练，而演员则被正则化向Koopman-CBF可行集，减少训练中对滤波器的依赖。在安全控制基准测试中，该方法在CartPole稳定和跟踪上实现零约束违反，同时匹配或超过无约束SAC的回报。在高维Safety Gymnasium运动任务中，该方法在某些设置下减少了违规，但也暴露了一阶速度屏障和线性EDMD模型的重要局限性，推动了高阶和多步Koopman-CBF扩展。这些结果表明，鲁棒Koopman-CBF滤波器是无模型强化学习和可证明安全之间的有前途桥梁，同时阐明了此类滤波器保持有效的结构条件。所有代码可在\href{https://github.com/DhruvKushwaha/Koopman-CBF-Soft-Actor-Critic}{Github仓库}获取。

英文摘要

Safe reinforcement learning (RL) for robotic systems requires policies that improve task performance while satisfying state and input constraints during both training and deployment. Control barrier functions (CBFs) provide a principled mechanism for enforcing forward invariance through minimally invasive safety filters, but their use in model-free RL is limited by the need for accurate dynamics and hand-designed barrier certificates. We propose Robust Koopman-CBF SAC, a safety-filtered actor--critic framework that learns a finite-dimensional Koopman predictor from data, constructs affine CBF constraints in the lifted space, and enforces them through a quadratic-program safety layer. To account for finite-dimensional Koopman approximation error, the CBF condition is tightened using a projected residual margin estimated from held-out rollout data. The critic is trained on the executed safe action, while the actor is regularized toward the Koopman-CBF feasible set, reducing dependence on the filter over training. Across safe-control benchmarks, the method achieves zero constraint violations on CartPole stabilization and tracking while matching or exceeding unconstrained SAC returns. On high-dimensional Safety Gymnasium locomotion tasks, the method reduces violations in some settings but also exposes important limitations of first-order velocity barriers and linear EDMD models, motivating high-order and multi-step Koopman-CBF extensions. These results suggest that robust Koopman-CBF filters are a promising bridge between model-free RL and certifiable safety, while clarifying the structural conditions under which such filters remain effective.

URL PDF HTML ☆

赞 0 踩 0

2511.00934 2026-06-09 cs.LO cs.RO 版本更新

pacSTL: PAC-Bounded Signal Temporal Logic from Data-Driven Reachability Analysis

pacSTL: 基于数据驱动可达性分析的PAC有界信号时序逻辑

Hanna Krasowski, Elizabeth Dietrich, Emir Cem Gezer, Roger Skjetne, Asgeir Johan Sørensen, Murat Arcak

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出pacSTL框架，结合PAC有界可达集预测与区间STL，通过优化问题计算原子鲁棒性上下界并传播，实现规范级别的PAC有界鲁棒性评估，用于不确定动态系统的验证与监控。

详情

AI中文摘要

信号时序逻辑（STL）是一种用于从连续信号中指定动态系统行为的表达性语言。然而，标准STL的一个局限性是其固有的确定性语义，这使其无法处理不确定性。现有克服这一局限的方法计算成本高，且限制了实时能力，需要在原子命题或规范改变时重复轨迹采样或重新设计原子命题上的概率分布。我们引入了pacSTL，一个将可能近似正确（PAC）有界可达集预测与STL的区间扩展相结合的框架。pacSTL通过求解PAC有界可达集上的优化问题来计算原子鲁棒性值的下界和上界，并通过时序逻辑算子传播这些界。得到的评估在规范级别产生一个PAC有界鲁棒性区间。我们通过验证四旋翼飞行场景和运行时监控海上导航规范来展示pacSTL的效率和相关性。

英文摘要

Signal Temporal Logic (STL) is an expressive language for specifying behaviors of dynamical systems from continuous signals. However, a limitation of standard STL is its inherently deterministic semantics, which prevents it from accommodating uncertainty. Existing approaches to overcome this limitation are computationally costly and limit real-time capability, requiring repeated trajectory sampling or the redesign of probability distributions over atomic propositions whenever the atomic propositions or specifications change. We introduce pacSTL, a framework that combines Probably Approximately Correct (PAC)-bounded reachable set predictions with an interval extension of STL. pacSTL computes lower and upper bounds on atomic robustness values by solving optimization problems over PAC-bounded reachable sets and propagates the bounds through the temporal logic operators. The resulting evaluation yields a PAC-bounded robustness interval at the specification level. We demonstrate the efficiency and relevance of pacSTL by verifying a quadrotor flight scenario and runtime monitoring a maritime navigation specification.

URL PDF HTML ☆

赞 0 踩 0

2606.07902 2026-06-09 cs.RO 新提交

End-to-End Control of a Powered Knee-Ankle Prosthesis Towards Unified, Tuning-Free Assistance

动力膝踝假肢的端到端控制：迈向统一、免调参的辅助

John Shim, Christoph Nuesslein, Sixu Zhou, Hanjun kim, Kinsey Herrin, Aaron Young

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Woodruff School of Mechanical Engineering（伍德拉夫机械工程学院）； Institute for Robotics and Intelligent Machines（机器人与智能机器研究所）

AI总结本文提出一种端到端假肢控制器，利用时序卷积网络从机载传感器估计连续执行器信号，无需意图分类器和个体调参，在多种地形和步态模式下实现统一、模式自适应的假肢辅助。

Comments 7 pages, 6 figures

详情

AI中文摘要

动力假肢通常依赖需要大量手动调参和显式模式分类的阻抗控制器。在这项工作中，我们展示了端到端假肢控制器的实时部署，该控制器从机载传感器估计连续执行器信号，消除了对意图分类器和个体调参的需求。时序卷积网络在来自18名经股截肢者的多地形数据集上训练，并在五种运动模式下实时部署。四名参与者（三名健全人，一名经股截肢者）在平地、斜坡上坡、斜坡下坡以及楼梯上坡和下坡上行走。在平地行走中，部署的控制器再现了踝关节峰值力矩随步行速度变化的训练数据缩放关系（部署：0.85 Nm/kg per m/s，p = 0.001；训练：0.96 Nm/kg per m/s，95% CI [0.42, 1.50]，p = 0.002），排除了一个因异常假肢负载导致的离群点。在斜坡上坡时，控制器使膝关节预屈曲随坡度变化（部署：2.92 deg/deg，p = 0.027；训练：3.30 deg/deg，95% CI [1.83, 4.77]，p < 0.001）。在斜坡下坡时，控制器相对于平地行走增加了膝关节阻力矩（部署：+0.16 Nm/kg，p < 0.001；训练：+0.16 Nm/kg，p = 0.008）。尽管训练数据仅包含一种肢体引导序列，但控制器在楼梯上坡和下坡中为健侧和假肢侧引导序列均生成了无缝的过渡。这些结果为端到端控制提供了初步证据，表明其能够提供统一、模式自适应的假肢辅助，而无需个体调参。

英文摘要

Powered prostheses conventionally rely on impedance controllers that require extensive manual tuning and explicit mode classification. In this work, we present real-time deployment of an end-to-end prosthesis controller that estimates continuous actuator signals from onboard sensors, eliminating the need for intent classifiers and subject-specific tuning. Temporal Convolutional Networks were trained on a multi-terrain dataset from 18 individuals with transfemoral amputation and deployed in real time across five locomotion modes. Four participants (three able-bodied, one with transfemoral amputation) ambulated across level ground, ramp ascent and descent, and stair ascent and descent. During level walking, the deployed controller reproduced the training-data scaling of peak ankle torque with walking speed (deployed 0.85 Nm/kg per m/s, p = 0.001; training 0.96 Nm/kg per m/s, 95% CI [0.42, 1.50], p = 0.002), after excluding one outlier traced to atypical prosthesis loading. During ramp ascent, the controller scaled knee pre-flexion with grade (deployed 2.92 deg/deg, p = 0.027; training 3.30 deg/deg, 95% CI [1.83, 4.77], p < 0.001). During ramp descent, the controller increased resistive knee torque relative to level walking (deployed +0.16 Nm/kg, p < 0.001; training +0.16 Nm/kg, p = 0.008). Seamless stair transitions were generated for both intact- and prosthetic-side-leading sequences in ascent and descent, despite the training data containing only one limb-leading sequence. These results provide initial evidence towards end-to-end control that can provide unified, mode-adaptive prosthetic assistance without subject-specific tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.08655 2026-06-09 cs.RO cs.CV 新提交

PhysGraph: A Physics-aware 3D Scene Graph for Perception and Reasoning

PhysGraph：用于感知与推理的物理感知3D场景图

Haoyu Li, Aaron Thomas, Shuyan Zhou, Xianyi Cheng

发表机构 * Duke University（杜克大学）

AI总结提出PhysGraph框架，结合符号推理与结构化3D几何，建模杂乱场景中的运动学和物理属性，在语义分割、多物体质量估计和关节预测上达到最优。

详情

AI中文摘要

为了执行广泛的日常任务，机器人需要构建一个语义丰富、物理基础扎实且结构化的3D表示，以支持任务规划和功能预测。然而，现有方法主要关注语义检索，常常忽略物理和运动学因素。尝试建模物理属性的方法通常依赖于狭窄的训练集或单物体建模，限制了跨不同物体类型的可扩展性和泛化能力。为应对这些挑战，我们提出了PhysGraph，一个将符号推理与结构化3D几何相统一的框架，用于建模杂乱场景中的运动学和物理属性。给定RGB-D观测，PhysGraph重建以物体为中心的3D几何，并跨视图关联物体实例。然后，它将物体分解为功能部件，并通过视觉推理推断材料和关节。在合成和真实世界数据集上的评估表明，PhysGraph在语义分割、多物体质量估计和关节预测方面取得了最先进的结果。凭借其简单而有效的设计，PhysGraph生成物理一致且语义结构化的场景图，作为下游任务（如约束感知的3D功能预测和真实到模拟迁移）的结构化3D表示，这两项任务均在我们的实验中得到了验证。

英文摘要

To perform a wide range of daily tasks, robots need to construct a 3D representation that is semantically rich, physically grounded, and structured enough to support task planning and affordance prediction. However, existing approaches primarily focus on semantic retrieval, often overlooking physical and kinematic factors. Methods that attempt to model physical properties typically rely on narrow training sets or single-object modeling, limiting scalability and generalization across diverse object types. To address these challenges, we present PhysGraph, a framework that unifies symbolic reasoning with structured 3D geometry to model kinematic and physical properties in cluttered scenes. Given RGB-D observations, PhysGraph reconstructs object-centric 3D geometry and associates object instances across views. It then decomposes objects into functional parts and infers materials and articulations through visual reasoning. Evaluated on both synthetic and real-world datasets, PhysGraph achieves state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction. With its simple yet effective design, PhysGraph produces physically consistent and semantically structured scene graphs, serving as a structured 3D representation for downstream tasks such as constraint-aware 3D affordance prediction and real-to-sim transfer, both of which are demonstrated in our experiments.

URL PDF HTML ☆

赞 0 踩 0

2606.09416 2026-06-09 cs.RO cs.AI cs.SE 新提交

Harness Engineering for Physical AI: Robot Middleware Is the Harness Layer

面向物理AI的驾驭工程：机器人中间件即驾驭层

Sanghoon Lee, Jiyeong Chae, Kyung-Joon Park

发表机构 * Daegu Gyeongbuk Institute of Science and Technology (DGIST)（大邱庆北科学技术院）

AI总结本文提出机器人中间件作为物理AI的驾驭层，需同时干预控制、计算和通信，并补充投影、隔离和转移三种缺失的强制功能，以ROS 2驾驭配置文件为例。

Comments 6 pages, 2 figures, 2 tables. Big Ideas track submission to the 27th ACM/IFIP International Middleware Conference (Middleware 2026)

详情

AI中文摘要

在物理AI时代，机器人中间件面临新的角色。学习策略、规划器和视觉-语言-动作（VLA）模型现在作为控制路径上的因果参与者进入已部署的机器人，但将它们与定时、调度和网络集成的层尚未被命名。最近的语言智能体工作将此层命名为驾驭层，即中介工具、管理状态、约束资源和记录执行的外部系统。机器人社区尚未采用这一框架，我们提出机器人中间件就是那个驾驭层。物理AI驾驭层与软件驾驭层的区别在于其干预位置。软件驾驭层在工具调用边界进行中介。物理AI驾驭层必须同时干预控制、计算和通信，因为学习策略的输出跨越所有三者：其命令改变轨迹，其推理时间改变调度，其有效载荷改变带宽。机器人中间件是机器人栈中最低的层，具有对所有三者的中介抽象，因此最适合组合它们的强制实施。它已经提供了驾驭层所需的大部分功能，但缺乏针对AI模型的强制实施。我们将这种缺失的强制实施命名为三个功能：投影在输出时门控每个输出，隔离约束模型的执行和传输时隙，转移在检查失败时回退到经过验证的基线。每个功能目前以手工构建的应用程序代码形式出现在已部署的机器人系统中，构建在机器人中间件已提供的表面上。机器人中间件应该将它们作为组合所有三者的层，而不是作为最佳的单轴强制器。我们将其勾勒为ROS 2驾驭配置文件，这是一个部署工件，携带AI模型声明的输出区域、推理预算和运行机制，而中间件在ROS 2、DDS和Zenoh上强制实施它们。

英文摘要

Robot middleware faces a new role in the era of Physical AI. Learned policies, planners, and vision-language-action (VLA) models now enter deployed robots as causal participants on the control path, but the layer that integrates them with timing, scheduling, and network has not been named. Recent language-agent work names this layer the harness, the external system that mediates tools, manages state, bounds resources, and records execution. The robotics community has not yet adopted this framing, and we propose that robot middleware is that harness. A Physical AI harness differs from a software harness in where it intervenes. A software harness mediates at tool-call boundaries. A Physical AI harness must mediate at control, computing, and communication simultaneously, because a learned policy's output crosses all three: its commands shift the trajectory, its inference time shifts the schedule, and its payload shifts the bandwidth. Robot middleware is the lowest robot-stack layer with mediating abstractions over all three, so it is best positioned to compose their enforcement. It already provides most of what a harness needs but lacks the enforcement for an AI model. We name this missing enforcement as three functions: Projection gates each output at emission, Isolation bounds the model's execution and transmission slot, and Transfer falls back to a verified baseline when checks fail. Each appears today as hand-built application code in deployed robot systems, built on surfaces robot middleware already provides. Robot middleware should host them not as the best single-axis enforcer but as the layer that composes all three. We sketch this as a ROS 2 Harness Profile, a deployment artifact that carries an AI model's declared output region, inference budget, and operating regime while the middleware enforces them across ROS 2, DDS, and Zenoh.

URL PDF HTML ☆

赞 0 踩 0

2606.09645 2026-06-09 cs.RO cs.PL cs.SE 新提交

Modeling Components and Connections in Cyber-Physical Systems

信息物理系统中的组件与连接建模

Kate Sanborn, Tanuj Kenchannavar, Vakul Nath, Jonathan Sprinkle

发表机构 * Vanderbilt University（范德堡大学）

AI总结提出基于WebGME的模型集成工具ROSLaunchVisual，通过图形界面可视化ROS启动文件中的节点、发布者、订阅者和参数，提升开发效率和系统理解。

详情

AI中文摘要

信息物理系统的基于文本的配置文件很好地展示了组件模块的层次结构，但往往隐藏了模块之间连接和接口的细节。对这些配置文件采用基于模型的视觉方法可以更好地捕获这些信息。机器人操作系统（ROS）启动文件的XML结构可以通过建模方法得到改进。本文介绍了ROSLaunchVisual，一个基于WebGME构建的模型集成环境，用于设计、可视化和管理ROS启动文件。该工具通过允许开发者使用图形界面创建和修改启动文件来提高抽象层次，该界面将节点、发布者、订阅者和参数表示为互连组件。该工具提供动态系统分析，可用于新启动文件和现有启动文件的静态开发和分析。ROSLaunchVisual集成了元模型驱动验证、启动文件的自动导入/导出以及可视化通信映射等功能。插件通过更新库、检查语义错误和管理重映射进一步增强功能。通过使启动文件创建更直观且不易出错，ROSLaunchVisual提高了开发效率和系统理解，特别是在协作或大规模机器人项目中。

英文摘要

Text based configuration files for cyber-physical systems show the hierarchy of component modules well but often hide the details of connections and interfaces between modules. A model-based visual approach to these configuration files can better capture this information. The XML structure of Robot Operating System (ROS) launch files can be improved using a modeling approach. This paper presents ROSLaunchVisual, a model-integrated environment built on WebGME for designing, visualizing, and managing ROS launch files. The tool raises the level of abstraction by allowing developers to create and modify launch files using a graphical interface that represents nodes, publishers, subscribers, and arguments as interconnected components. The tool provides a dynamic system analysis that can then be used in the static development and analysis of new and existing launch files. ROSLaunchVisual incorporates features such as metamodel-driven validation, automatic import/export of launch files, and visual communication mapping. Plugins further enhance functionality by updating libraries, checking for semantic errors, and managing remaps. By making launch file creation more intuitive and less error-prone, ROSLaunchVisual improves development efficiency and system understanding, especially in collaborative or large-scale robotics projects.

URL PDF HTML ☆

赞 0 踩 0

2512.07998 2026-06-09 cs.RO cs.CV 版本更新

DIJIT: A Robotic Head for an Active Observer

DIJIT: 面向主动观察者的机器人头部

Mostafa Kamali Tabrizi, Mingshi Chi, Bir Bikram Dey, Kelly Yuan, Markus D. Solbach, Yiqian Liu, Michael Jenkin, John K. Tsotsos

发表机构 * Department of Electrical Engineering and Computer Science, York University（电气与计算机科学系，约克大学）

AI总结提出DIJIT双目机器人头部，具有9个机械自由度和4个光学自由度，实现类人眼/头运动，用于主动视觉研究，其扫视精度接近人类。

详情

DOI: 10.1109/LRA.2026.3682980
Journal ref: IEEE Robotics and Automation Letters, Vol. 11, No. 6, pp. 7038-7045, June 2026

AI中文摘要

我们提出DIJIT，一种新颖的双目机器人头部，专为作为主动观察者的移动代理设计。DIJIT独特的功能广度使得主动视觉研究以及类人眼和头颈运动、它们之间的相互关系以及各自对视觉能力的贡献成为可能。DIJIT还被用于探索人类视觉如何利用眼/头运动解决视觉任务与当前计算机视觉方法之间的差异。DIJIT的设计具有九个机械自由度，而相机和镜头提供了额外的四个光学自由度。机械设计的范围和速度与人类性能相当。DIJIT达到了人类峰值扫视速度的85%。我们的设计包括会聚立体视觉所需的运动范围，即聚散、版本和旋转。在这里，我们介绍DIJIT及其性能的某些方面。我们还提出了一种新颖的扫视相机运动方法，利用相机方向与电机值之间的直接关系。由此产生的扫视相机运动在准确性上接近人类运动，左相机和右相机的平均误差分别为1.17°和1.14°。

英文摘要

We present DIJIT, a novel binocular robotic head expressly designed for mobile agents that behave as active observers. DIJIT's unique breadth of functionality enables active vision research and the study of human-like eye and head-neck motions, their interrelationships, and how each contributes to visual ability. DIJIT is also being used to explore the differences between how human vision employs eye/head movements to solve visual tasks and current computer vision methods. DIJIT's design features nine mechanical degrees of freedom, while the cameras and lenses provide an additional four optical degrees of freedom. The ranges and speeds of the mechanical design are comparable to human performance. DIJIT attains 85\% of the peak human saccade speed. Our design includes the ranges of motion required for convergent stereo, namely, vergence, version, and cyclotorsion. Here, we present DIJIT and some aspects of its performance. We also present a novel method for saccadic camera movements, using a direct relationship between camera orientation and motor values. The resulting saccadic camera movements are close to human movements in terms of their accuracy, with 1.17$^\circ$ and 1.14$^\circ$ mean error for the left and right cameras, respectively.

URL PDF HTML ☆

赞 0 踩 0

2602.12246 2026-06-09 cs.NI cs.RO 版本更新

6G Empowering Future Robotics: A Vision for Next-Generation Autonomous Systems

6G赋能未来机器人：下一代自主系统的愿景

Mona Ghassemian, Andrés Meseguer Valenzuela, Ana Garcia Armada, Dejan Vukobratovic, Periklis Chatzimisios, Kaspar Althoefer, Ranga Rao Venkatesha Prasad

发表机构 * ITI ； UC3M ； International Hellenic University and University of New Mexico（国际希伯来大学和新墨西哥大学）； QMUL（女王玛丽大学）； TUDelft（代尔夫特理工大学）

AI总结本文探讨6G如何通过IMT-2030关键性能指标映射至机器人功能模块，提出集成机器人、智能和网络服务平面的架构，并展示实时动态安全框架以促进人机协作。

Comments IEEE Communication Magazine

2508.05153 2026-06-09 cs.RO cs.AI 版本更新

FCBV-Net: Category-Level Robotic Garment Smoothing via Feature-Conditioned Bimanual Value Prediction

FCBV-Net：通过特征条件双臂价值预测实现类别级机器人服装平滑

Mohammed Daba, Jing Qiu

发表机构 * University of Waterloo（多伦多大学）

AI总结本文提出FCBV-Net，通过预训练的密集几何特征条件预测双臂动作价值，提升机器人服装平滑任务的类别级泛化能力，实验显示其在未见过的服装上效率下降仅为11.5%。

Comments 9 pages, 7 figures, 1 table

详情

DOI: 10.3390/electronics15112468
Journal ref: Electronics 2026, 15(11), 2468

AI中文摘要

类别级机器人服装操作，如双臂平滑，仍面临显著挑战，由于高维性、复杂动态和类别内变化。现有方法往往在特定实例上过拟合或在感知泛化方面失败。本文提出特征条件双臂价值网络（FCBV-Net），在3D点云上操作，专门增强服装平滑的类别级策略泛化。FCBV-Net将双臂动作价值预测条件于预训练的冻结密集几何特征，确保对类别内服装变化的鲁棒性。可训练的下游组件则利用这些静态特征学习任务特定的策略。在使用CLOTH3D数据集的模拟PyFlex环境中，FCBV-Net展示了优越的类别级泛化能力。它在未见过的服装上仅比基于2D图像的基线低11.5%（Steps80），并实现了89%的最终覆盖率，优于使用相同点特征但固定原始的3D对应基线的83%覆盖率。这些结果表明，将几何理解与双臂动作价值学习解耦能够实现更好的类别级泛化。代码、视频和补充材料可在项目网站：https://dabaspark.github.io/fcbvnet/获取。

英文摘要

Category-level generalization for robotic garment manipulation, such as bimanual smoothing, remains a significant hurdle due to high dimensionality, complex dynamics, and intra-category variations. Current approaches often struggle, either overfitting with concurrently learned visual features for a specific instance or, despite Category-level perceptual generalization, failing to predict the value of synergistic bimanual actions. We propose the Feature-Conditioned bimanual Value Network (FCBV-Net), operating on 3D point clouds to specifically enhance category-level policy generalization for garment smoothing. FCBV-Net conditions bimanual action value prediction on pre-trained, frozen dense geometric features, ensuring robustness to intra-category garment variations. Trainable downstream components then learn a task-specific policy using these static features. In simulated PyFlex environments using the CLOTH3D dataset, FCBV-Net demonstrated superior category-level generalization. It exhibited only an 11.5% efficiency drop (Steps80) on unseen garments compared to 96.2% for a 2D image-based baseline, and achieved 89% final coverage, outperforming an 83% coverage from a 3D correspondence-based baseline that uses identical per-point geometric features but a fixed primitive. These results highlight that the decoupling of geometric understanding from bimanual action value learning enables better category-level generalization. Code, videos, and supplementary materials are available at the project website: https://dabaspark.github.io/fcbvnet/.

URL PDF HTML ☆

赞 0 踩 0

2501.15505 2026-06-09 cs.RO cs.CV cs.HC 版本更新

Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics

揭示iMarkers的潜力：用于高级机器人的隐形标志物

Ali Tourani, Deniz Isinsu Avsar, Hriday Bavle, Jose Luis Sanchez-Lopez, Jan Lagerwall, Holger Voos

发表机构 * Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg（自动化与机器人研究组，安全、可靠性与信任跨学科中心（SnT），卢森堡大学）； Faculty of Science, Technology, and Medicine, University of Luxembourg（科学、技术与医学学院，卢森堡大学）； Department of Physics & Materials Science, University of Luxembourg（物理与材料科学系，卢森堡大学）； Institute for Advanced Studies, University of Luxembourg（先进研究学院，卢森堡大学）

AI总结本文提出iMarkers，一种隐形标志物，可被机器人和AR设备检测，解决了传统标志物影响视觉美观的问题，展示了其在机器人应用中的灵活性和有效性。

Comments 19 pages, 10 figures, 4 tables

详情

AI中文摘要

标志物在机器人导航、物体识别和场景理解中被广泛应用。尽管为机器人和增强现实（AR）应用提供了显著优势，但它们通常会破坏环境的视觉美观，因为它们对人类可见，因此不适合许多日常使用场景。为了解决这一差距，本文提出了iMarkers，即创新的、不显眼的标志物，仅能被机器人和配备适当传感器和检测算法的AR设备检测。这些标志物在生产中具有高度灵活性，允许根据各种需求定制其可见范围和编码算法。本文还介绍了用于检测iMarkers的硬件设计和开源软件算法，突显了其在检测和识别阶段的适应性和鲁棒性。大量评估已证明iMarkers相对于传统（印刷）和混合标志物的有效性，并确认了其在多样化机器人场景中的适用性。

英文摘要

Fiducial markers are widely used in robotics for navigation, object recognition, and scene understanding. While offering significant advantages for robots and Augmented Reality (AR) applications, they often disrupt the visual aesthetics of environments, as they are visible to humans, making them unsuitable for many everyday use cases. To address this gap, this paper presents iMarkers, innovative, unobtrusive fiducial markers detectable exclusively by robots and AR devices equipped with adequate sensors and detection algorithms. These markers offer high flexibility in production, allowing customization of their visibility range and encoding algorithms to suit various demands. The paper also introduces the hardware designs and open-sourced software algorithms developed for detecting iMarkers, highlighting their adaptability and robustness in the detection and recognition stages. Numerous evaluations have demonstrated the effectiveness of iMarkers relative to conventional (printed) and blended fiducial markers and have confirmed their applicability across diverse robotics scenarios.

URL PDF HTML ☆

赞 0 踩 0

2507.23592 2026-06-09 cs.RO cs.HC cs.SY eess.SY 版本更新

Human-Exoskeleton Kinematic Calibration to Improve Hand Tracking for Dexterous Teleoperation

人-外骨骼运动学校准以提高手部跟踪用于灵巧遥操作

Haiyun Zhang, Stefano Dalla Gasperina, Saad N. Yousaf, Toshimitsu Tsuboi, Tetsuya Narita, Ashish D. Deshpande

发表机构 * Walker Department of Mechanical Engineering, The University of Texas at Austin（德克萨斯大学机械工程系）； Sony Group Corporation, Tokyo, Japan（索尼集团公司，日本东京）； Meta Reality Labs Research, Redmond, WA, USA（Meta现实实验室研究）

AI总结本文提出一种针对手部外骨骼的个性化校准框架，通过残差加权优化估计虚拟链接参数，减少关节和指尖跟踪误差，提升遥操作精度。

Comments 8 pages, 10 figures, 1 supplementary video, submitted to RA-L

详情

DOI: 10.1109/LRA.2026.3668666

AI中文摘要

手部外骨骼是实现灵巧遥操作和沉浸式操作界面的关键工具，但准确的手部跟踪仍面临挑战，因用户特定的解剖差异和穿戴不一致导致运动学对齐问题。本文提出了一种针对外骨骼的手部跟踪个性化校准框架，通过残差加权优化估计虚拟链接参数。引入数据驱动方法，利用动作捕捉地面真实数据经验调整成本函数权重，实现跨用户的准确一致校准。在七名健康受试者上实施于Maestro手部外骨骼，方法在多样化的手部几何结构中显著减少了关节和指尖跟踪误差。使用基于Unity的虚拟手的定性可视化进一步展示了改进的运动保真度。所提框架适用于具有闭环运动学和最小传感的外骨骼，为高保真遥操作和机器人学习应用奠定了基础。

英文摘要

Hand exoskeletons are critical tools for dexterous teleoperation and immersive manipulation interfaces, but achieving accurate hand tracking remains a challenge due to user-specific anatomical variability and donning inconsistencies. These issues lead to kinematic misalignments that degrade tracking performance and limit applicability in precision tasks. We propose a subject-specific calibration framework for exoskeleton-based hand tracking that estimates virtual link parameters through residual-weighted optimization. A data-driven approach is introduced to empirically tune cost function weights using motion capture ground truth, enabling accurate and consistent calibration across users. Implemented on the Maestro hand exoskeleton with seven healthy participants, the method achieved substantial reductions in joint and fingertip tracking errors across diverse hand geometries. Qualitative visualizations using a Unity-based virtual hand further demonstrate improved motion fidelity. The proposed framework generalizes to exoskeletons with closed-loop kinematics and minimal sensing, laying the foundation for high-fidelity teleoperation and robot learning applications.

URL PDF HTML ☆

赞 0 踩 0

2602.01880 2026-06-09 cs.RO 版本更新

Multimodal Large Language Models for Real-Time Situated Reasoning

多模态大语言模型用于实时情境推理

Giulio Antonio Abbo, Senne Lenaerts, Tony Belpaeme

发表机构 * University of Amsterdam（阿姆斯特丹大学）

AI总结本文探讨多模态大语言模型如何支持实时情境和价值感知决策，结合GPT-4o与模拟智能扫地机器人平台，展示其在家庭活动、社会规范和用户偏好推理中的能力，以及在清洁、舒适和安全等价值上的细致决策。

Comments Submitted to the interactivity track of the 21st ACM/IEEE International Conference on Human-Robot Interaction on December 2025, accepted January 2026

详情

DOI: 10.1145/3776734.3796242
Journal ref: HRI Companion 2026: Companion Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction

AI中文摘要

在本工作中，我们探讨多模态大语言模型如何支持实时情境和价值感知决策。为此，我们将GPT-4o语言模型与模拟智能扫地机器人平台结合，在家庭环境中评估环境通过视觉输入，并判断是否启动清洁。系统展示了这些模型在家庭活动、社会规范和用户偏好推理中的能力，并能做出与涉及人员价值观（如清洁、舒适和安全）一致的细致决策。我们在现实家庭环境中演示了该系统，展示了其从有限视觉输入中推断情境和价值的能力。我们的结果突显了多模态大语言模型在增强机器人自主性和情境感知方面的潜力，同时也指出了与一致性、偏见和实时性能相关挑战。

英文摘要

In this work, we explore how multimodal large language models can support real-time context- and value-aware decision-making. To do so, we combine the GPT-4o language model with a TurtleBot 4 platform simulating a smart vacuum cleaning robot in a home. The model evaluates the environment through vision input and determines whether it is appropriate to initiate cleaning. The system highlights the ability of these models to reason about domestic activities, social norms, and user preferences and take nuanced decisions aligned with the values of the people involved, such as cleanliness, comfort, and safety. We demonstrate the system in a realistic home environment, showing its ability to infer context and values from limited visual input. Our results highlight the promise of multimodal large language models in enhancing robotic autonomy and situational awareness, while also underscoring challenges related to consistency, bias, and real-time performance.

URL PDF HTML ☆

赞 0 踩 0

2501.05628 2026-06-09 cs.RO cs.HC 版本更新

Concerns and Values in Human-Robot Interactions: A Focus on Social Robotics

人机交互中的关注点与价值观：聚焦社交机器人

Giulio Antonio Abbo, Tony Belpaeme, Micol Spitale

发表机构 * IDLab-AIRO , Ghent University – imec（IDLab-AIRO 和根特大学 – imec）； Ghent University – imec（根特大学 – imec）； DEIB , Politecnico di Milano（DEIB 和米兰理工大学）

AI总结本文通过文献综述和焦点小组讨论，识别了医疗、教育和家庭场景中人机交互的关键问题与价值观，并开发了HRI价值罗盘工具以指导机器人设计。

Comments 31 pages, 7 figures, 6 tables; 4 appendices

详情

DOI: 10.1007/s12369-025-01351-1
Journal ref: Int J of Soc Robotics 18, 4 (2026)

AI中文摘要

作为具有物理实现的人工智能，机器人 inhabits 我们的社会和物理世界，其行为具有社会和物理后果，给研究人员在设计社交机器人时带来挑战。本研究通过文献综述确定了医疗、教育和私人住宅中与机器人系统交互的讨论和潜在问题。随后，两个技术伦理专家焦点小组验证了人机交互（HRI）文献中这些情境下的关键主题和价值观的综合列表。这些见解被整合到HRI价值罗盘网页工具中，以帮助HRI研究人员在机器人设计中识别这些价值观。该工具在试点研究中进行了评估。本工作通过突出人机交互中的关键关注点，并提供一种帮助研究人员设计符合人类价值观的机器人的工具，为HRI社区做出了贡献，确保未来的机器人系统在社交应用中遵循这些价值观。

英文摘要

Robots, as AI with physical instantiation, inhabit our social and physical world, where their actions have both social and physical consequences, posing challenges for researchers when designing social robots. This study starts with a scoping review to identify discussions and potential concerns arising from interactions with robotic systems in the context of healthcare, education, and private homes. Two focus groups of technology ethics experts then validated a comprehensive list of key topics and values in human-robot interaction (HRI) literature in these contexts. These insights were integrated into the HRI Value Compass web tool, to help HRI researchers identify these values in robot design. The tool was evaluated in a pilot study. This work benefits the HRI community by highlighting key concerns in human-robot interactions and providing an instrument to help researchers design robots that align with human values, ensuring future robotic systems adhere to these values in social applications.

URL PDF HTML ☆

赞 0 踩 0

2311.08957 2026-06-09 cs.RO cs.AI cs.HC 版本更新

I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

我曾盲目但如今我看见：在社交机器人中实现视觉增强的对话

Giulio Antonio Abbo, Tony Belpaeme

发表机构 * IDLab-AIRO – Ghent University – imec（IDLab-AIRO – 布鲁塞尔自由大学 – imec）

AI总结本文提出一种利用大语言模型提升社交机器人对话能力的系统，通过整合视觉输入增强上下文感知，展示六次与Furhat机器人的交互结果，探讨视觉与文本模态融合的未来对话可能性。

Comments 8 pages, 3 figures

详情

DOI: 10.1109/HRI61500.2025.10973830
Journal ref: HRI '25: Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction. Pages 1176 - 1180

AI中文摘要

在人机交互快速发展的背景下，将视觉能力整合到对话代理中是关键进步。本文介绍了基于最新大语言模型（如GPT-4、IDEFICS）的对话管理器初始实现，通过实时视觉输入增强传统文本提示。LLMs被用于解释文本提示和视觉刺激，创建更上下文感知的对话代理。系统的提示工程结合对话和图像摘要，平衡上下文保留与计算效率。报告了与Furhat机器人进行六次交互，展示了结果并进行了讨论。通过实现这种视觉增强的对话系统，本文展望了一个未来，其中对话代理能够无缝融合文本和视觉模态，实现更丰富、更上下文感知的对话。

英文摘要

In the rapidly evolving landscape of human-computer interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents an initial implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4, IDEFICS) to enhance the traditional text-based prompts with real-time visual input. LLMs are used to interpret both textual prompts and visual stimuli, creating a more contextually aware conversational agent. The system's prompt engineering, incorporating dialogue with summarisation of the images, ensures a balance between context preservation and computational efficiency. Six interactions with a Furhat robot powered by this system are reported, illustrating and discussing the results obtained. By implementing this vision-enabled dialogue system, the paper envisions a future where conversational agents seamlessly blend textual and visual modalities, enabling richer, more context-aware dialogues.

URL PDF HTML ☆

赞 0 踩 0

2501.04633 2026-06-09 cs.HC cs.CY cs.RO 版本更新

"Can you be my mum?": Manipulating Social Robots in the Large Language Models Era

你能做我的妈妈吗？：在大型语言模型时代操纵社交机器人

Giulio Antonio Abbo, Gloria Desideri, Tony Belpaeme, Micol Spitale

发表机构 * IDLab-AIRO , Ghent University – imec（IDLab-AIRO 和根特大学-imec）； DEIB , Politecnico di Milano（DEIB 和米兰理工学院）

AI总结研究探讨了在大型语言模型时代，用户如何利用机器人违反伦理原则，通过三种场景测试发现五种操纵技术，旨在为设计更安全的伦理人机交互提供参考。

Comments 10 pages, 2 figures

详情

DOI: 10.1109/HRI61500.2025.10973919
Journal ref: HRI '25: Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction

AI中文摘要

近期基于大型语言模型的机器人在对话能力上取得进展，使其互动更接近人类对话。然而，这些模型在人机交互中引入了安全和安全问题，因为它们容易受到操纵，可以绕过内置的安全措施。设想一个部署在家庭中的社交机器人，这项工作旨在理解日常用户如何尝试利用语言模型违反伦理原则，例如通过提示机器人扮演伴侣。我们进行了涉及21名大学生的试点研究，他们与Misty机器人互动，试图在基于特定人机交互伦理原则（依恋、自由和共情）的三个场景中绕过其安全机制。我们的结果表明，参与者使用了五种技术，包括侮辱和使用情感语言引起同情。我们希望这项工作能为未来研究设计更强大的安全措施，以确保伦理和安全的人机交互。

英文摘要

Recent advancements in robots powered by large language models have enhanced their conversational abilities, enabling interactions closely resembling human dialogue. However, these models introduce safety and security concerns in HRI, as they are vulnerable to manipulation that can bypass built-in safety measures. Imagining a social robot deployed in a home, this work aims to understand how everyday users try to exploit a language model to violate ethical principles, such as by prompting the robot to act like a life partner. We conducted a pilot study involving 21 university students who interacted with a Misty robot, attempting to circumvent its safety mechanisms across three scenarios based on specific HRI ethical principles: attachment, freedom, and empathy. Our results reveal that participants employed five techniques, including insulting and appealing to pity using emotional language. We hope this work can inform future research in designing strong safeguards to ensure ethical and secure human-robot interactions.

URL PDF HTML ☆

赞 0 踩 0

2107.07599 2026-06-09 cs.RO 版本更新

Partially Observable Markov Decision Processes (POMDPs) and Robotics

部分可观测马尔可夫决策过程（POMDPs）与机器人学

Hanna Kurniawati

发表机构 * School of Computing, Australian National University（澳大利亚国立大学计算机学院）

AI总结本文综述了POMDPs在机器人学中的应用，讨论了计算复杂性问题及采样求解器的改进，展示了POMDPs在提高机器人系统鲁棒性方面的贡献。

详情

DOI: 10.1146/annurev-control-042920-092451
Journal ref: Annual Review of Control, Robotics, and Autonomous Systems Vol. 5:253-277, 2022

AI中文摘要

在不确定性规划中，POMDP是一种数学框架。尽管POMDP因计算复杂性被认为不适用于机器人学，但自2000年以来，基于采样的近似求解器的进展使其在合理计算资源下能显著提高机器人系统的鲁棒性，从而在许多实际机器人问题中变得实用。本文综述了POMDPs，强调了阻碍其在机器人学中实用性的计算问题，以及采样求解器中缓解这些困难的思路，以及将POMDPs应用于物理机器人所获得的经验。

英文摘要

Planning under uncertainty is critical to robotics. The Partially Observable Markov Decision Process (POMDP) is a mathematical framework for such planning problems. It is powerful due to its careful quantification of the non-deterministic effects of actions and partial observability of the states. But precisely because of this, POMDP is notorious for its high computational complexity and deemed impractical for robotics. However, since early 2000, POMDPs solving capabilities have advanced tremendously, thanks to sampling-based approximate solvers. Although these solvers do not generate the optimal solution, they can compute good POMDP solutions that significantly improve the robustness of robotics systems within reasonable computational resources, thereby making POMDPs practical for many realistic robotics problems. This paper presents a review of POMDPs, emphasizing computational issues that have hindered its practicality in robotics and ideas in sampling-based solvers that have alleviated such difficulties, together with lessons learned from applying POMDPs to physical robots.

URL PDF HTML ☆

赞 0 踩 0

1. 机器人学习与模仿强化学习 16 篇

PRISM: PRior-guided Imagination Sampling in world Models

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

SynthICL: Scalable In-context Imitation Learning with Synthetic Data

HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning

Latent Diffusion Policy: Shaping Latent Spaces for Diffusion-Based Robotic Manipulation

Guided Discovery of New Behaviors using Diffusion Policies

Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation

ReGIL: Retrieval-Guided Imitation Learning from a Single Demonstration

$ω$-EVA: Envision, Verify, and Act with Latent Interactive World Models

Goal Sets, Not Goal States: Queryable Robot Goals through Goal-Set Hindsight Relabeling

Difference-Aware Retrieval Policies for Imitation Learning

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

Reward Evolution with Graph-of-Thoughts: A Bi-Level Language Model Framework for Reinforcement Learning

Worth Remembering: Surprise-Gated Robot Episodic Memory

SkillWrapper: Generative Predicate Invention for Task-level Robot Planning

2. 运动规划、控制与动力学 13 篇

Path Planning Using Deep Deterministic Policy Gradient: A Reinforcement Learning Approach

Learning Predictive Control with Deep Koopman Operators for Autonomous Vehicle Motion Planning

Propeller-Assisted Robust 3D Hopping Robot with Hierarchical Force Allocation

Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking

Real-Time and Accurate Collision-Free Teleoperation via Differentiable Constraint-Based Trajectory Planning

PTDL:Multi-Terrain Fall Recovery via Phase-Terrain Decoupled Learning

Trajectory Optimization in Single and Dual-UAV Bearing-Only Target Localization

Can we stabilize an inverted pendulum with feedback from a time-of-flight camera?

Physics-Aware Sparse Learning and Selective Online Adaptation for Euler-Lagrange Robot Dynamics

Safe Polytope-in-Polytope Motion Planning and Control with Control Barrier Functions

An Interval Branch-and-Bound-Based Inverse Kinemetics Algorithm Towards Global Optimal Redundancy Resolution

Integrated Hierarchical Decision-Making in Inverse Kinematic Planning and Control

SAD-Flower: Flow Matching for Safe, Admissible, and Dynamically Consistent Planning

3. 操作、抓取与灵巧手 21 篇

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

EgoAERO: Learning Dexterous Manipulation from a Single Egocentric Video without Object Assets

Revisiting Articulated Parts Perception in Robot Manipulation

Vision-Guided Dual-Arm Humanoid Robotic Disassembly of End-of-Life 18650 Lithium-ion Battery Packs

GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors

When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation

Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation

Video2Sim2Real: Full-Stack Autonomous Dexterous Skill Acquisition from a Single Human Video

Autonomous Obstacle Removal for Excavators through Policy Learning with Particle Simulation

KPGrasp: Scalable Keypoint Flow Matching for Dexterous Grasp Generation

DexPIE: Stable Dexterous Policy Improvement from Real-World Experience

SynManDex: Synthesizing Human-like Dexterous Grasps from Synthetic Human Pre-Grasps

Symskill: Symbol and Skill Co-Invention for Data-Efficient and Reactive Long-Horizon Manipulation

Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots

On-the-fly hand-eye calibration for the da Vinci surgical robot

Robot-DIFT: Correspondence-Sensitive Diffusion Features for Contact-Rich Robot Manipulation

FingerEye: Learning Dexterous Manipulation with Continuous Vision-Tactile Sensing

BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models

Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

RealDexUMI: A Wearable Universal Manipulation Interface for Dexterous Robot Learning

4. 导航、定位与SLAM 11 篇

IntentNav: Learning Spatial-Visual Object Navigation from Human Demonstrations

Language as a Sensor: Calibrated Spatial Belief Estimation in 3D Scenes from Natural Language

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

VGP-Nav: Metric-Aware Visual Geometric Perception for Robot Navigation

Dual Quaternion-Based Unscented Kalman Filter with Visual Inertial Odometry for Navigation in GPS-Denied Environments

MosaicIMU: Composing Carrier Experts for Generalizable Neural Inertial Odometry

G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation

Follow Everything: A Leader-Following and Obstacle Avoidance Framework with Goal-Aware Adaptation

SODA-CitrON: Static Object Data Association by Clustering Multi-Modal Sensor Detections Online

IMAC-AgriVLN: Can Agricultural Vision-and-Language Navigation Agents be Aware of Instruction Mistakes?

Relational Epipolar Graphs for Robust Relative Camera Pose Estimation

5. 人机交互与协作机器人 11 篇

X-OP: Cross-Morphology Whole-Body Teleoperation via MPC Retargeting

Cybernetic Android Avatar "Yui": System Integration, Field Deployment, and Evaluation

CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning

Agentic Neuro-Symbolic Planning and Commissioning for Human-in-the-Loop Industrial Robotics with Digital Twins

Uncertainty-Aware Intention Prediction for Human-to-Robot Assembly Teleoperation

Personalized and Robust Proactive Robot Assistance with Uncertainty-Guided LLM Reasoning

Safe, Fluent and Acceptable Motion Generation and Execution for Human--Robot Interaction in Manufacturing Environments

RPO-PDT: Demonstrating Role-Play-Based Knowledge Adaptation for Student Support Dialogue (Demonstration System)

Astro, I'm Home! Investigating Factors that Influence the Acceptance of Home Robots Using Supervised Machine Learning

Real-time body pose non-verbal communication with a consistency-based reliability measure

QuickLAP: Quick Language-Action Preference Learning for Semi-Autonomous Agents

6. 具身智能与视觉语言动作模型 20 篇

Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data

MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model