arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 机器人学习与模仿强化学习 5 篇

2606.18328 2026-06-18 cs.RO 新提交

Recover, Discover, Plan: Learning Skills and Concepts from Robot Failures

恢复、发现、规划:从机器人失败中学习技能与概念

Bowen Li, Mayank Mishra, Y. Isabel Liu, Stone Tao, Nishanth Kumar, Alexander G. Gray, Ruwan Wickramarachchi, Jonathan Francis, Sebastian Scherer, Tom Silver

发表机构 * CMU(卡内基梅隆大学) Princeton(普林斯顿大学) AI2(艾伦人工智能研究所) MIT(麻省理工学院) Centaur AI Bosch Center for AI(博世人工智能中心)

AI总结 提出ReSYNC方法,通过技能学习与概念发现的交替过程,从失败恢复经验中逐步构建抽象谓词,实现全局失败避免和长期规划,性能提升超50%。

Comments 9 pages, 6 figures. Website: https://jaraxxus-me.github.io/ReSYNC/

详情
AI中文摘要

智能机器人不仅应该从失败中恢复,还应该获取必要的抽象知识以避免未来的失败。虽然强化学习(RL)可以学习反应性恢复行为,但为每种不同的失败模式训练单独的策略效率极低。我们引入了恢复驱动的关系概念综合(ReSYNC),这是第一种从失败恢复经验中逐步发现并细化状态抽象(关系谓词)以支持抽象规划的方法。与纯粹的反应性方法不同,ReSYNC通过增量双学习过程联合学习技能和概念。在技能学习阶段,机器人使用RL学习从训练任务中出现的失败中恢复。在概念学习阶段,机器人发现新的关系谓词并细化其抽象规划模型,以解释和泛化所学的恢复行为。这种交互使ReSYNC能够将训练中看到的局部恢复转化为测试时的全局失败避免。在四个模拟领域,我们展示了ReSYNC持续扩展和细化其抽象库的能力,使其能够解决长期、前所未见的问题,性能超过强基线50%以上。此外,我们展示了ReSYNC的仿真到现实迁移,其中它执行真实世界的非抓取操作技能,并通过抽象规划泛化到未见场景。总体而言,ReSYNC代表了朝着机器人自主获取抽象以实现物理世界中可扩展的、感知失败的规划迈出的重要一步。

英文摘要

Intelligent robots should not only recover from failures, but also acquire the abstract knowledge needed to avoid them in the future. While reinforcement learning (RL) can learn reactive recovery behaviors, training a separate policy for every distinct failure mode is highly inefficient. We introduce Recovery-Driven Synthesis of Relational Concepts (ReSYNC), the first approach that progressively discovers and refines state abstractions (relational predicates) from failure-recovery experience to support abstract planning. Unlike purely reactive methods, ReSYNC jointly learns skills and concepts through an incremental dual-learning process. In the skill-learning phase, the robot uses RL to learn to recover from failures seen in training tasks. In the concept-learning phase, the robot discovers new relational predicates and refines its abstract planning model to explain and generalize the learned recovery behaviors. This interaction enables ReSYNC to convert local recoveries seen during training into global failure avoidance at test time. Across four simulated domains, we show that ReSYNC's ability to continually expand and refine its abstraction library allows it to solve long-horizon, previously unseen problems, outperforming strong baselines by over 50%. Additionally, we demonstrate sim-to-real transfer of ReSYNC, where it performs real-world non-prehensile manipulation skills and generalizes to unseen scenarios through abstract planning. Overall, ReSYNC represents a significant step toward robots that autonomously acquire abstractions for scalable, failure-aware planning in the physical world.

2606.18589 2026-06-18 cs.RO 新提交

DREAM-Chunk: Reactive Action Chunking with Latent World Model

DREAM-Chunk:基于潜在世界模型的反应式动作分块

Wenxi Chen, Kaidi Zhang, Chi Lin, Zhiyuan Zhang, Yu She, Yuejiang Liu, Raymond A. Yeh, Shaoshuai Mou, Yan Gu

发表机构 * Purdue University(普渡大学) Stanford University(斯坦福大学)

AI总结 提出DREAM-Chunk方法,通过轻量级潜在世界模型在测试时采样多个候选动作分块并选择最优执行,提升动作分块策略在随机动态下的鲁棒性。

详情
AI中文摘要

动作分块已成为视觉-语言-动作(VLA)模型的常见接口,使得低频策略推理能够驱动高频机器人执行。然而,一旦动作分块被提交,其开环执行在随机动态、硬件执行错误和部分可观测性下可能变得脆弱。我们提出DREAM-Chunk,一种测试时扩展方法,通过轻量级潜在世界模型增强基于分块的策略,无需额外的策略微调。在测试时,DREAM-Chunk采样多个候选动作分块,展开其预测的潜在未来,并从预测状态与观测展开最匹配的分块中选择动作。通过这种方式,DREAM-Chunk利用额外的测试时计算覆盖多个可能的随机未来,并提高长时域分块执行期间的响应性。在Kinetix基准测试中,DREAM-Chunk在增加的动作噪声下提高了鲁棒性,并从更大的候选样本量中受益,尤其是当演示包含纠正行为时。我们进一步在两个机器人平台的四个操作任务和两种VLA策略下,针对各种随机性来源验证了DREAM-Chunk。在仿真和硬件实验中,DREAM-Chunk提高了动作分块策略在随机动态下的鲁棒性。

英文摘要

Action chunking has become a common interface for vision-language-action (VLA) models, enabling low-frequency policy inference to drive high-frequency robot execution. However, once an action chunk is committed, its open-loop execution can be brittle under stochastic dynamics, hardware execution errors, and partial observability. We propose DREAM-Chunk, a test-time scaling method that augments chunking-based policies with a lightweight latent world model, without requiring additional policy fine-tuning. At test time, DREAM-Chunk samples multiple candidate action chunks, rolls out their predicted latent futures, and selects actions from the chunk whose predicted state best matches the observed rollout. In this way, DREAM-Chunk uses additional test-time computation to cover multiple plausible stochastic futures and improve reactivity during long-horizon chunk execution. On the Kinetix benchmark, DREAM-Chunk improves robustness under increasing action noise and benefits from larger candidate sample sizes, especially when demonstrations contain corrective behaviors. We further validate DREAM-Chunk on four manipulation tasks across two robot platforms and two VLA policies under various sources of stochasticity. Across simulation and hardware experiments, DREAM-Chunk improves the robustness of action-chunking policies in stochastic dynamics.

2606.18772 2026-06-18 cs.RO 新提交

HALOMI: Learning Humanoid Loco-Manipulation with Active Perception from Human Demonstrations

HALOMI: 从人类演示中学习具有主动感知的人形机器人全身操控

Zehui Zhao, Yuxuan Zhao, Gaojing Zhang, Chenxi Liu, Maolin Zheng, Wenzhao Lian

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Sussex(萨塞克斯大学) East China University of Science and Technology(华东理工大学)

AI总结 提出HALOMI框架,通过扩展通用操控接口(UMI)实现主动感知,利用流形约束控制器和观察-动作对齐,使Unitree G1人形机器人在五项真实任务中平均成功率达85%。

详情
AI中文摘要

人类演示可以大规模收集,并自然捕捉主动的手眼协调,是学习人形机器人全身操控的有前景的数据源。然而,直接将人类演示迁移到人形机器人需要精确的世界坐标系跟踪控制器,这在分布外(OOD)目标下通常脆弱,而人形差异在自我中心观察和动作执行中持续存在。为解决这些挑战,我们提出HALOMI,一个从人类演示中学习具有主动感知的人形机器人全身操控的可扩展框架。HALOMI扩展了通用操控接口(UMI)并加入自我中心感知,以大规模收集自我视角和手腕视角观察以及头-手轨迹。我们进一步提出一个流形约束控制器,在学习的潜在行为流形中规划,以实现世界坐标系中精确鲁棒的头-手跟踪。为弥合人形差异,我们进行自我视角对齐,并引入控制器感知的参考轨迹自适应,以减少观察和动作执行中的不匹配。我们在配备活动脖子的Unitree G1人形机器人上验证HALOMI,涉及导航、抓取、双手操控、全身协调和动态行为五项真实任务。在三个定量评估的任务中,HALOMI平均成功率达85%,而额外定性演示显示其支持动态抛掷和深蹲抓取的能力。

英文摘要

Human demonstrations, which can be collected at scale and naturally capture active hand-eye coordination, are a promising data source for learning humanoid loco-manipulation. However, directly transferring human demonstrations to humanoids requires a precise world-frame tracking controller, which is often brittle under Out-of-Distribution(OOD) targets, while human-to-humanoid gaps persist in both egocentric observation and action execution. To address these challenges, we present HALOMI, a scalable framework for learning humanoid loco-manipulation with active perception from human demonstrations. HALOMI extends Universal Manipulation Interface (UMI) with egocentric sensing to collect ego-view and wrist-view observations along with head-hand trajectories at scale. We further propose a manifold-constrained controller that plans in a learned latent behavior manifold to enable precise and robust head-hand tracking in the world frame. To bridge the human-to-humanoid gap, we perform ego-view alignment and introduce a controller-aware reference trajectory adaptation to reduce mismatch in both observation and action execution. We validate HALOMI on a Unitree G1 humanoid robot with an actuated neck across five real-world tasks involving navigation, grasping, bimanual manipulation, whole-body coordination, and dynamic behaviors. Across the three quantitatively evaluated tasks, HALOMI achieves an average success rate of 85\%, while additional qualitative demonstrations show its ability to support dynamic tossing and deep-squat grasping.

2606.18953 2026-06-18 cs.RO 新提交

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

面向零样本仿真到现实VLA增强的以对象为中心的残差强化学习

Kinam Kim, Namiko Saito, Heecheol Kim, Katsushi Ikeuchi, Jaegul Choo, Yasuyuki Matsushita

发表机构 * KAIST(韩国科学技术院) Microsoft Research Asia - Tokyo(微软亚洲研究院-东京) The University of Tokyo(东京大学)

AI总结 提出以对象为中心的残差强化学习框架,在仿真中训练策略,零样本迁移到真实机器人,将VLA模型成功率从42%提升至76%。

Comments 8 pages, 7 figures, 2 tables; 8-page appendix

详情
AI中文摘要

视觉-语言-动作(VLA)模型能够泛化到多种操作任务,但其基于模仿学习的策略在精确物理交互中因执行误差累积而脆弱;能否仅在仿真中训练的强化学习策略零样本提升真实世界VLA的鲁棒性?残差强化学习在冻结的VLA之上学习修正策略,提供了一个自然框架,但现有方法面临根本的仿真到现实困境:特权状态方法需要有损蒸馏才能部署;基于图像的方法存在视觉域差距;而真实世界强化学习成本高且不安全。我们提出一种以对象为中心的残差强化学习框架,利用对象姿态优化VLA动作,从而构建一个在仿真和现实之间一致迁移的紧凑观测空间。为对齐两个域,我们额外在仿真中重放相同的遥操作演示,以训练真实世界VLA的仿真对应物。残差强化学习策略仅在仿真中通过姿态噪声注入和丢弃进行训练,并零样本迁移到真实机器人。在真实Franka Research 3(FR3)机器人的五个操作任务上,我们的方法将成功率从42%零样本提升至76%,且改进后的轨迹可进一步用于重新训练基础VLA以实现自我改进,无需额外遥操作。项目页面:此https URL

英文摘要

Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the visual domain gap; and real-world RL is costly and unsafe. We propose an object-centric residual RL framework that refines VLA actions using object poses, enabling a compact observation space that transfers consistently between simulation and reality. To align the two domains, we additionally replay the same teleoperation demonstrations in simulation to train a sim counterpart of the real-world VLA. The residual RL policy is trained only in simulation with pose noise injection and dropout, and transfers zero-shot to the real robot. Across five manipulation tasks on a real Franka Research 3 (FR3) robot, our method improves the success rate from 42% to 76% zero-shot, and the improved rollouts can be further reused to retrain the base VLA for self-improvement without additional teleoperation. Project page: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/

2606.19328 2026-06-18 cs.LG cs.AI cs.RO 交叉投稿

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

UBP2: 不确定性平衡的偏好规划用于高效基于偏好的强化学习

Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart

发表机构 * Learning, Embodied Autonomy, and Forecasting (LEAF) Lab, University of Toronto(多伦多大学学习、具身自主与预测(LEAF)实验室)

AI总结 提出UBP2方法,通过联合推理奖励、动力学和值函数的不确定性来主动引导探索,在Meta-World基准上显著提高了样本效率。

详情
AI中文摘要

基于偏好的强化学习提供了一种从行为的成对比较中学习奖励模型的方法,绕过了显式奖励设计的需求。然而,现有方法通常依赖于被动数据收集,并且在学习的早期阶段样本效率低下。我们引入了一种基于模型的方法,通过联合推理奖励、动力学和值函数的不确定性来主动引导探索。我们的方法,不确定性平衡的偏好规划(UBP2),使用奖励、动力学和值函数模型的集成,根据结合了期望奖励、终值认知不确定性的统一评分来评估候选轨迹。在此目标下的规划产生了利用和信息获取之间的显式权衡,无需临时的探索启发式。在标准正则性假设下,我们为有限时域和无限时域设置建立了次线性遗憾保证。实验上,在Meta-World基准上的实验表明,UBP2比无模型的基于偏好的方法和非乐观的基于模型的基线方法实现了更高的样本效率。

英文摘要

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.

2. 运动规划、控制与动力学 7 篇

2606.18514 2026-06-18 cs.RO cs.LG 新提交

N(CO)$^2$: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering

N(CO)$^2$: 基于机会约束的神经组合优化求解随机定向问题

Anas Saeed, Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * Department of Computer Science and Engineering, University of California, Merced(加州大学默塞德分校计算机科学与工程系)

AI总结 提出N(CO)$^2$框架,结合强化学习求解随机定向问题,无需手工启发式,在不确定环境下优化路径选择,性能媲美MILP。

详情
Journal ref
In Proceedings of the IEEE International Conference on Automation Science and Engineering (CASE), 2025
AI中文摘要

神经组合优化(NCO)通过学习启发式,为求解复杂图优化问题提供了一种有前景的替代传统启发式方法的方法。这类问题在自动化领域频繁出现,可用于建模多种应用。虽然NCO在确定性组合优化问题上已被广泛研究,但只有少数工作旨在解决随机组合优化问题。本文提出N(CO)$^2$:基于机会约束的神经组合优化,用于求解随机定向问题(SOP),无需手工设计的启发式。通过集成强化学习(RL)框架,模型在不确定性下优化路径选择,有效平衡探索与利用。实验结果表明,我们的方法在多种SOP实例上具有良好的泛化能力,与最先进的混合整数线性规划(MILP)相比性能具有竞争力。所提方法减少了启发式设计的人力投入,同时在不确定环境中实现自适应和高效的决策。

英文摘要

Neural combinatorial optimization (NCO) offers a promising alternative to traditional heuristic-based methods for solving complex graph optimization problems by proposing to learn heuristics through data. This class of problems frequently arises in automation, as it can be used to model a variety of applications. While NCO has been extensively studied for deterministic combinatorial optimization problems, there are only a few works that aim to solve stochastic combinatorial optimization problems. In this work, we present N(CO)$^2$: Neural Combinatorial Optimization with Chance cOnstraints to solve the Stochastic Orienteering Problem (SOP) without the use of hand-crafted heuristics. By integrating a reinforcement learning (RL) framework, the model optimizes path selection under uncertainty, effectively balancing exploration and exploitation. Empirical results demonstrate that our method generalizes well across diverse SOP instances, achieving competitive performance compared to the state-of-the-art mixed-integer linear program (MILP) for the task. The proposed approach reduces human effort in heuristic design while enabling adaptive and efficient decision-making in uncertain environments.

2606.18625 2026-06-18 cs.RO 新提交

SRL: Combining SLIP Model and Reinforcement Learning for Agile Robotic Jumping

SRL:结合SLIP模型与强化学习实现敏捷机器人跳跃

Xiaowen Hu, Linqi Ye, Yudi Zhu, Chenyue Shao, Rankun Li, Qingdu Li, Yan Peng

发表机构 * Institute of Artificial Intelligence, Shanghai University(上海大学人工智能研究院) Institute of Machine Intelligence, University of Shanghai for Science and Technology(上海理工大学机器智能研究院)

AI总结 提出SRL框架,融合SLIP模型的物理基线与强化学习的自适应能力,通过前馈控制信号与实时反馈优化机器人跳跃,显著减少训练时间并保持高精度跟踪。

Comments 17 pages, 12 figures

详情
AI中文摘要

机器人跳跃在搜救和物流等应用中至关重要,这些场景中跨越障碍和提高机动效率是关键。弹簧负载倒立摆(SLIP)模型利用简化的弹簧-质量动力学,自然编码了生物上合理的弹跳运动,但由于对接触和关节动力学的理想化假设,其在不规则地形上的性能会下降。同时,强化学习(RL)能够适应多样化和复杂的环境,但通常需要来自无引导探索的大量数据。SLIP的物理基线与RL的自适应能力的互补优势促使我们提出一种混合框架,以克服各自的局限性。因此,我们提出了弹簧负载强化学习(SRL),它将基于SLIP的前馈控制信号与RL驱动的实时反馈相结合,实现了机器人跳跃的持续优化。实验结果表明,与基线方法相比,SRL能够在更少的训练时间内实现更稳定的跳跃,平均位置跟踪误差低于0.1米,速度跟踪误差在目标值的±3%以内。通过双足和四足模拟的地面与楼梯跳跃,以及sim-to-sim和sim-to-real验证,SRL展现出对各种任务要求和环境复杂性的鲁棒适应性,突显了其在实际部署中的潜力。

英文摘要

Robotic jumping is pivotal in applications such as search and rescue and logistics, where crossing obstacles and enhancing mobility efficiency are critical. The Spring-Loaded Inverted Pendulum (SLIP) model leverages simplified spring-mass dynamics that naturally encode biologically plausible hopping motions, yet its performance degrades on irregular terrain due to idealized assumptions regarding contact and joint dynamics. Meanwhile, Reinforcement Learning (RL) can adapt to diverse and complex environments but often requires extensive data from unguided exploration. The complementary strengths of SLIP's physically grounded baseline and RL's adaptive capabilities motivate a hybrid framework that overcomes these individual limitations. We therefore propose Spring-loaded Reinforcement Learning (SRL), which integrates SLIP-based feedforward control signals with RL-driven real-time feedback, enabling continuous optimization of robotic jumping. Experimental results demonstrate that SRL can achieve more stable jumps with much less training time than the baseline method, maintaining an average position tracking error below 0.1 m and velocity tracking errors within +/-3% of the target values. Through bipedal and quadrupedal simulations of ground and stair jumping, as well as sim-to-sim and sim-to-real validations, SRL exhibits robust adaptability to various task requirements and environmental complexities, underscoring its potential for real-world deployment.

2606.18730 2026-06-18 cs.RO cs.AI math.CO math.OC 新提交

Two-Phase Bilevel Search for the Moving-Target Traveling Salesman Problem with Moving Obstacles

带移动障碍物的移动目标旅行商问题的两阶段双层搜索

Allen George Philip, Anoop Bhat, Sivakumar Rathinam, Howie Choset

发表机构 * Texas A&M University(德克萨斯A&M大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对带移动障碍物的移动目标旅行商问题,提出混合整数锥规划公式和两阶段双层搜索算法,显著优于基线方法。

详情
AI中文摘要

移动目标旅行商问题(MT-TSP)寻求从静态仓库出发、访问一组移动目标(每个目标在其分配的时间窗口内)并返回仓库的代理的最小成本轨迹。在本文中,我们研究了带移动障碍物的移动目标旅行商问题(MT-TSP-MO),这是MT-TSP的推广,其中代理轨迹必须避开移动障碍物。我们提出了一个混合整数锥规划(MICP)公式,可以使用现成的求解器求解,以及一个快速且可扩展的两阶段双层搜索(TPBS)算法,该算法为问题计算高质量可行解。我们在多达40个目标和40个障碍物的广泛问题实例上评估了我们的方法,与现有基线算法相比。结果表明,所提出的两种方法在成功率、解决方案成本和计算时间方面均显著优于基线。

英文摘要

The Moving-Target Traveling Salesman Problem (MT-TSP) seeks a minimum cost trajectory for an agent that departs from a static depot, visits a set of moving targets, each within one of their assigned time windows, and returns to the depot. In this article, we study the Moving-Target Traveling Salesman Problem with Moving Obstacles (MT-TSP-MO), a generalization of the MT-TSP where the agent trajectory must avoid moving obstacles. We present a Mixed-Integer Conic Programming (MICP) formulation that can be solved using off-the-shelf solvers, as well as a fast and scalable Two-Phase Bilevel Search (TPBS) algorithm that computes high-quality feasible solutions for the problem. We evaluate our approaches against an existing baseline algorithm on a broad range of problem instances with up to 40 targets and 40 obstacles. The results demonstrate that both the proposed methods significantly outperform the baseline with respect to success rates, solution costs, and computation time.

2606.18828 2026-06-18 cs.RO cs.AI 新提交

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

空间即智能:用于黎曼度量生成的神经半群叠加

Chenghao Xu

发表机构 * National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University(湖南大学机器人视觉感知与控制技术国家工程研究中心)

AI总结 提出将智能置于空间本身,通过神经半群叠加机制生成黎曼度量,使动作简化为测地线跟随,在单障碍场景训练后零样本泛化到未见配置。

详情
AI中文摘要

传统方法将智能置于智能体中,无论是作为学习策略还是搜索过程。我们则将智能置于空间本身:场景在构型流形上诱导一个黎曼度量,动作简化为跟随该度量的测地线,而无需调用单独的规划器或碰撞检查器。一个单一的编码器-路由器网络通过三个互补的参数组实现这一思想——框架参数(定向生成器)、调制参数(控制空间传播)和基本系数(决定强度)。这些组通过共享的半群叠加机制组合,产生单个黎曼度量场,形成一种紧凑的架构,其几何复杂度自然随场景复杂度扩展。在单个双障碍场景上训练后,该模型在未见过的障碍配置上展现出鲁棒的零样本泛化能力,无碰撞路径成本与障碍穿透路径成本相差数个数量级。

英文摘要

Traditional approaches place intelligence in the agent, whether as a learned policy or a search procedure. We instead place intelligence in the space itself: a scene induces a Riemannian metric on the configuration manifold, and action reduces to following the geodesics of that metric rather than invoking a separate planner or collision checker. A single Encoder-Router network realizes this idea through three complementary parameter groups -- frame parameters that orient the generators, modulation parameters that govern their spatial propagation, and basic coefficients that determine their strength. These groups combine through a shared semigroup-superposition mechanism to produce a single Riemannian metric field, yielding a compact architecture whose geometry scales naturally with scene complexity. Trained on a single two-obstacle scene, the model demonstrates robust zero-shot generalization across unseen obstacle configurations, with orders-of-magnitude separation between collision-free and obstacle-penetrating path costs.

2606.18883 2026-06-18 cs.RO 新提交

ZiMPedance: Impedance-Aware ZMP Modeling and Control for Payload Carrying with Quadruped Robots

ZiMPedance:面向四足机器人负载搬运的阻抗感知ZMP建模与控制

Giovanni B. Dessy, Lorenzo Amatucci, Victor Barasuol, Claudio Semini

发表机构 * Dynamic Legged Systems Lab, Istituto Italiano di Tecnologia (IIT)(动态腿部系统实验室,意大利技术研究院(IIT))

AI总结 提出扩展零力矩点(ZMP)公式以包含被动负载接口动力学,结合模型预测控制减少稳定性违规达10倍,并提高运动效率。

详情
AI中文摘要

四足机器人的负载运输受到机器人与负载之间物理接口动力学的强烈影响。与主动机械臂相比,被动弹簧臂减轻了重量和复杂性,但其弹簧-阻尼动力学可能引入振荡力,降低运动稳定性。本文推导了一个扩展的零力矩点(ZMP)公式,该公式包含被动负载接口动力学,将刚度、阻尼和负载质量与稳定性裕度联系起来。分析表明,欠阻尼配置可能与运动谐波共振。基于这一见解,我们通过被动子系统动力学增强了单刚体动力学模型,并将其集成到模型预测控制框架中。在仿真中,所提出的控制器将稳定性违规减少高达10倍(从7.0%降至0.7%),并通过将水平地面反作用力努力降低高达15%来提高运动效率。硬件实验表明,在标称控制器失效的拉放扰动下,携带2公斤负载的机器人能够稳定运动。同一模型还使得通过被动臂动力学实现末端执行器跟踪成为可能,而无需直接驱动臂。

英文摘要

Load transportation with quadruped robots is strongly affected by the dynamics of the physical interface between the robot and the load. Passive spring-based arms reduce weight and complexity compared to active manipulators, but their spring-damper dynamics can introduce oscillatory forces that degrade locomotion stability. This paper derives an extended Zero Moment Point (ZMP) formulation that includes passive payload-interface dynamics, relating stiffness, damping, and payload mass to the stability margin. The analysis shows that underdamped configurations can resonate with locomotion harmonics. Based on this insight, we augment a Single Rigid Body Dynamics model with passive subsystem dynamics and integrate it into a Model Predictive Control framework. In simulation, the proposed controller reduces stability violations by up to $10\times$, from $7.0\%$ to $0.7\%$, and increase locomotion efficiency by lowering horizontal ground reaction force effort by up to $15\%$ compared to a nominal baseline. Hardware experiments with a $2\,\mathrm{kg}$ payload show stable locomotion under pull-release disturbances where the nominal controller fails. The same model also enables end-effector tracking through passive arm dynamics without direct arm actuation.

2606.19031 2026-06-18 cs.RO 新提交

Congestion-Aware Robot Tour Planning in Crowded Environments

拥挤环境中的拥塞感知机器人巡视规划

Stefano Bernagozzi, Charlie Street, Masoumeh Mansouri, Lorenzo Natale

发表机构 * Istituto Italiano di Tecnologia(意大利理工学院) Università di Genova(热那亚大学) University of Birmingham(伯明翰大学)

AI总结 提出一种基于概率的巡视规划器,通过学习人流预测模型并在线构建马尔可夫决策过程,在拥挤环境中高效规划机器人路径,减少拥塞影响。

Comments Accepted to IEEE IROS 2026

详情
AI中文摘要

自主移动服务机器人通常需要完成在环境中遍历一组位置的巡视任务。例如,引导人们穿过购物中心、在配送中心递送包裹或在博物馆提供导览。然而,在拥挤环境中,人群的存在可能对机器人性能产生负面影响。例如,人类会触发机器人的碰撞避免操作,从而降低机器人速度。人群随机移动且随时间变化。本文提出一种针对拥挤环境的概率巡视规划器,该规划器明确考虑人类拥塞。我们学习圆形线性流场(CLiFF)地图,该地图根据初始观测预测人类轨迹。然后,我们利用这些预测在线构建并求解马尔可夫决策过程,从而高效地将机器人引导通过环境。我们的方法具有足够的可扩展性,能够在观察到新人群时重新规划。我们在购物中心的真实人群数据集上评估了该方法。

英文摘要

Autonomous mobile service robots are often required to complete tours that require navigating through a set of locations in an environment. Example domains include guiding people through a shopping mall, delivering packages in a fulfilment centre, or giving guided tours in a museum. However, in crowded environments, the presence of people may negatively impact robot performance. For example, humans will activate robot collision avoidance manoeuvres that slow the robot down. Crowds move stochastically and vary throughout the day. In this paper we present a probabilistic tour planner for crowded environments which explicitly reasons over human congestion. We learn circular linear flow field (CLiFF) maps which predict human trajectories given an initial observation. We then use these predictions to build and solve a Markov decision process online which efficiently routes the robot through the environment. Our approach is scalable enough to re-plan as new people are observed. We evaluate our approach on a real-world crowd dataset in a shopping mall.

2512.21109 2026-06-18 cs.RO 版本更新

Robust and Efficient MuJoCo-based Model Predictive Control via Web of Affine Spaces Derivatives

基于仿射空间网络导数的鲁棒高效MuJoCo模型预测控制

Chen Liang, Daniel Rakita

发表机构 * Department of Computer Science, Yale University(耶鲁大学计算机科学系)

AI总结 针对MJPC中有限差分导数计算瓶颈,引入仿射空间网络(WASP)导数替代,实现高效稳定的导数计算,在多种机器人任务中实现高达2倍加速,并优于随机采样规划器。

Comments Accepted to 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

MuJoCo是一个强大且高效的物理模拟器,广泛应用于机器人领域。其在实际中的一种常见应用是通过模型预测控制(MPC),该控制方法利用模拟器的重复滚动来优化未来动作,并实时生成响应性控制策略。为了使这一过程更易于使用,开源库MuJoCo MPC(MJPC)提供了直接构建在MuJoCo模拟器之上的即用型MPC算法和实现。然而,MJPC依赖有限差分(FD)来计算通过底层MuJoCo模拟器的导数,这通常是一个关键瓶颈,可能使其在时间敏感任务中成本过高,尤其是在高自由度系统或复杂场景中。在本文中,我们介绍了在MJPC中使用仿射空间网络(WASP)导数作为FD的即插即用替代方案。WASP是一种最近开发的方法,用于高效计算精确导数近似序列。通过重用先前相关导数计算的信息,WASP加速并稳定了新导数的计算,使其特别适合MPC随时间迭代的细粒度更新。我们在涵盖多种机器人形态的多样化MJPC任务集上评估了WASP。我们的结果表明,WASP导数在MJPC中特别有效:它无缝集成到各种任务中,提供一致鲁棒的性能,并且与基于导数的规划器(如iLQG)一起使用时,相比FD后端实现了高达2倍的加速。此外,基于WASP的MPC在我们的评估任务中优于MJPC的随机采样规划器,提供了更高的效率和可靠性。为了支持采用和未来研究,我们发布了完全集成WASP导数的MJPC开源实现。

英文摘要

MuJoCo is a powerful and efficient physics simulator widely used in robotics. One common way it is applied in practice is through Model Predictive Control (MPC), which uses repeated rollouts of the simulator to optimize future actions and generate responsive control policies in real time. To make this process more accessible, the open source library MuJoCo MPC (MJPC) provides ready-to-use MPC algorithms and implementations built directly on top of the MuJoCo simulator. However, MJPC relies on finite differencing (FD) to compute derivatives through the underlying MuJoCo simulator, which is often a key bottleneck that can make it prohibitively costly for time-sensitive tasks, especially in high-DOF systems or complex scenes. In this paper, we introduce the use of Web of Affine Spaces (WASP) derivatives within MJPC as a drop-in replacement for FD. WASP is a recently developed approach for efficiently computing sequences of accurate derivative approximations. By reusing information from prior, related derivative calculations, WASP accelerates and stabilizes the computation of new derivatives, making it especially well suited for MPC's iterative, fine-grained updates over time. We evaluate WASP across a diverse suite of MJPC tasks spanning multiple robot embodiments. Our results suggest that WASP derivatives are particularly effective in MJPC: it integrates seamlessly across tasks, delivers consistently robust performance, and achieves up to a 2$\mathsf{x}$ speedup compared to an FD backend when used with derivative-based planners, such as iLQG. In addition, WASP-based MPC outperforms MJPC's stochastic sampling-based planners on our evaluation tasks, offering both greater efficiency and reliability. To support adoption and future research, we release an open-source implementation of MJPC with WASP derivatives fully integrated.

3. 操作、抓取与灵巧手 12 篇

2606.18628 2026-06-18 cs.RO 新提交

Self-Supervised Mask-Aware Transformers for Fault-Tolerant FBG Force Sensing in Minimally Invasive Surgical Robotics

自监督掩码感知Transformer用于微创手术机器人中容错FBG力传感

Peibo Sun, Shiyuan Dong, Shucheng Ye, Jianrong Cai, Yushan Liu, Hongen Liao, Tianqi Huang, Fang Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院)

AI总结 针对微创手术机器人中FBG传感器因通道耦合和断裂导致的力估计退化问题,提出统一的自监督掩码感知Transformer,通过掩码通道重建预训练和动态损坏课程微调,实现多通道故障下的优雅降级,在8通道数据集上达到0.0066 N均方根误差。

详情
AI中文摘要

在微创手术机器人中,导管级光纤布拉格光栅(FBG)传感器因其能够通过复用多个光学通道来估计多维力而具有前景。然而,部署这些紧凑的多通道传感器引入了两个关键工程挑战:复杂变形过程中固有的非线性交叉轴耦合,以及受限工作空间中光纤断裂导致的间歇性通道丢失。这些复合问题严重降低了力估计性能。现有的容错方法依赖于组合模型库,其随通道数量呈指数级扩展,并且需要昂贵的每模式校准。在本文中,我们提出了一种统一的、自监督的掩码感知Transformer,它显式地建模通道可用性,以在多样化和动态的传感器故障下实现优雅降级。编码器通过未标记数据流上的掩码通道重建进行预训练,并使用平衡的干净与损坏视图目标以及动态损坏课程进行力回归微调。此外,通过异方差高斯负对数似然训练的并行不确定性头,在单次前向传播中预测每轴置信度,避免了多遍集成的开销。在导管级8通道FBG数据集上评估,我们的单一统一模型实现了标称均方根误差(RMSE)0.0066 N,并在严重4通道故障下优雅降级至0.0126 N。这显著优于包含255个每模式神经网络的综合模型库(4通道丢失时为0.0154 N),同时消除了模式特定校准。

英文摘要

In minimally invasive surgical robotics, catheter-scale Fiber Bragg Grating (FBG) sensors are promising due to their ability to estimate multi-dimensional forces by multiplexing several optical channels. However, deploying these compact multi-channel sensors introduces two critical engineering challenges: inherent nonlinear cross-axis coupling during complex deformations, and intermittent channel dropouts caused by fiber fractures in constrained workspaces. These compounding issues severely degrade force estimation. Existing fault-tolerant approaches rely on combinatorial model banks, which scale exponentially with the channel count and demand prohibitively expensive per-pattern calibration. In this paper, we propose a unified, self-supervised mask-aware Transformer that explicitly models channel availability to enable graceful degradation under diverse and dynamic sensor failures. The encoder is pretrained via masked-channel reconstruction on unlabeled data streams and fine-tuned for force regression using a balanced clean-and-corrupted-view objective alongside a dynamic corruption curriculum. Furthermore, a parallel uncertainty head, trained via heteroscedastic Gaussian negative log-likelihood, predicts per-axis confidence in a single forward pass, circumventing the overhead of multi-pass ensembles. Evaluated on a catheter-scale 8-channel FBG dataset, our single unified model achieves a nominal Root Mean Square Error (RMSE) of 0.0066~N and degrades gracefully to 0.0126~N under severe 4-channel failures. This significantly outperforms a comprehensive model bank of 255 per-pattern neural networks (0.0154~N at 4-channel loss) while eliminating pattern-specific calibration.

2606.19089 2026-06-18 cs.RO 新提交

ART-VS: Adaptive Resolution Tiling for Vision Transformer Visual Servoing

ART-VS:用于视觉Transformer伺服的自适应分辨率分块

Alessandro Scherl, Bernhard Neuberger, Simon Schwaiger, David Mulero-Pérez, Lucas Muster, Jose Garcia-Rodriguez

发表机构 * Department of Computer Technology, University of Alicante(阿尔瓦登特技术系,阿利坎特大学) Department of Industrial Engineering, UAS Technikum Vienna(工业工程系,维也纳技术学院) Automation and Control Institute, TU Wien(自动化与控制研究所,维也纳技术大学) Institute of Software Engineering and Artificial Intelligence, Graz University of Technology(软件工程与人工智能研究所,格拉茨技术大学) Institute for Integrative Nature Conservation Research, University of Natural Resources and Life Sciences Vienna(整合自然保护研究 institute,维也纳自然资源与生命科学大学)

AI总结 提出ART-VS方法,通过粗-精两阶段自适应调整特征粒度,在不需任务特定训练下提升视觉伺服鲁棒性和精度,显著降低定位误差并提高速度。

Comments Accepted at IROS2026

详情
AI中文摘要

基于自监督视觉Transformer(ViT)特征的视觉伺服实现了无需训练的机器人定位,具有强泛化能力,但面临鲁棒性与精度之间的根本权衡。粗粒度的块级描述符提供稳定的对应关系,但限制了定位精度。提高图像分辨率可改善精度,但鲁棒性增益有限——在扰动下,高分辨率处理仅将收敛成功率从76.6%提升至81.0%,尽管ViT块数量增加了12倍。因此,我们提出自适应分辨率分块视觉伺服(ART-VS),一种两阶段方法,根据伺服进程调整特征粒度:先以原生ViT分辨率进行粗阶段实现稳定对齐,然后进行分块高分辨率阶段,将匹配限制在局部邻域以提高定位精度。无需任何任务特定训练,ART-VS在扰动下达到95.4%的收敛率,比标准分辨率和全分辨率ViT伺服分别高出18.8和14.4个百分点。与前者相比,定位误差降低53%,同时运行速度比后者快10倍以上,VRAM使用减少27%。我们在三个ViT骨干网络上验证了ART-VS,并展示了真实世界类别级抓取未见过的物体实例,透明瓶成功率95/100,鞋子成功率98/100。代码见该链接。

英文摘要

Visual servoing with self-supervised Vision Transformer (ViT) features enables training-free robotic positioning with strong generalization, but faces a fundamental trade-off between robustness and precision. Coarse patch-level descriptors provide stable correspondences yet limit positioning accuracy. Increasing image resolution improves precision but yields only marginal robustness gains - under perturbation, high-resolution processing improves convergence success rate from 76.6% to just 81.0% despite 12x more ViT patches. Therefore, we propose Adaptive Resolution Tiling Visual Servoing (ART-VS), a two-phase method that adapts feature granularity to servoing progress: a coarse phase at native ViT resolution for stable alignment, then a tiled high-resolution phase that restricts matching to local neighborhoods improving positioning accuracy. Without any task-specific training, ART-VS achieves 95.4% convergence under perturbation, outperforming standard and full-resolution ViT-based servoing by 18.8 and 14.4 percentage points. Over the former it reduces positioning error by 53%, while running at over 10x higher speed and 27% lower VRAM than the latter. We validate ART-VS across three ViT backbones and demonstrate real-world category-level grasping of unseen object instances, achieving 95/100 on transparent bottles and 98/100 on shoes. Code available under https://art-vs.github.io/.

2606.19091 2026-06-18 cs.RO 新提交

GCNGrasp-VP: Affordance-Guided View Planning for Efficient Task-Oriented Grasping

GCNGrasp-VP: 基于功能引导的视角规划用于高效任务导向抓取

Zanjia Tong, Wenlong Dong, Chengjie Zhang, Hong Zhang

发表机构 * Shenzhen Key Laboratory of Robotics and Computer Vision(机器人与计算机视觉深圳重点实验室)

AI总结 提出GCNGrasp-VP框架,通过功能场预测引导主动视角规划,无需场景重建,单次视角调整即可显著提升遮挡下的任务导向抓取成功率。

Comments Accepted to IROS 2026

详情
AI中文摘要

当物体视角存在遮挡时,任务导向抓取性能会显著下降。现有的任务导向抓取方法通常假设任务相关区域在初始帧中可见,而视角规划方法虽然能够实现主动感知,但往往忽略任务语义并依赖耗时的场景重建。为了解决这些局限性,我们提出了GCNGrasp-VP,一个将功能场预测与主动视角规划相结合的高效框架。该框架的核心是GCNGrasp-v2,一个同时支持抓取评估和功能场预测的任务导向抓取模型,实现了常数时间推理复杂度。利用这一能力,我们的功能引导视角规划器(Affordance-VP)将功能场作为信息增益度量,无需场景重建即可引导相机观察任务相关区域。视角规划结果表明,我们的方法仅需一次视角调整就显著优于基于场景不确定性的基线方法。真实世界验证进一步证实了在单物体场景中抓取成功率的显著提升,同时保持毫秒级计算延迟。代码和模型可在以下网址获取:this https URL。

英文摘要

Task-oriented grasping performance degrades significantly when object views suffer from occlusions. Existing task-oriented grasping methods typically assume task-relevant regions are visible in the initial frame, while view planning approaches enable active perception but often ignore task semantics and rely on time-consuming scene reconstruction. To address these limitations, we present GCNGrasp-VP, an efficient framework integrating affordance field prediction with active view planning. Central to this framework is GCNGrasp-v2, a task-oriented grasp model that simultaneously supports grasp evaluation and affordance field prediction, achieving constant-time inference complexity. Leveraging this capability, our Affordance-guided View Planner (Affordance-VP) utilizes the affordance field as an information gain metric to guide camera observation of task-relevant regions without requiring scene reconstruction. View planning results show that our method significantly outperforms scene-uncertainty-driven baselines with only one view adjustment. Real-world validation further confirms substantial improvements in grasp success rates for single-object scenarios while maintaining millisecond-level computational latency. Code and models are available at https://github.com/Instinct323/GCNGrasp-VP.

2606.19194 2026-06-18 cs.RO 新提交

Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation

用于机器人操作中一步流匹配的可逆神经网络适配器

Yu Zhang, Kangyi Ji, Yongxiang Zou, Rongtao Xu, Feng Zheng, Long Cheng

AI总结 提出可逆神经网络适配器,通过一步去噪过程生成高维动作,降低推理复杂度并保持精度,在仿真和真实实验中提升效率。

详情
AI中文摘要

本文提出了一种用于通用机器人操作的可逆神经网络适配器,旨在通过一步去噪过程,基于多模态观测(包括视觉、语言和本体感受输入)生成精确的高维动作。基于流匹配公式,所提出的适配器有效地将动作生成轨迹约束在可逆潜空间内,从而仅需单次推理步骤即可实现高效、高质量的灵巧动作合成。与传统的迭代流匹配策略相比,所提出的框架显著降低了推理复杂度,同时保持了强大的动作预测精度和稳定性。在多种仿真基准和真实机器人平台上进行了大量实验,以评估所提出方法的有效性。在仿真基准测试中,所提出的适配器在广泛的操作任务上持续表现出优于或接近最先进的性能。此外,真实世界实验显示,视觉-语言-动作(VLA)模型的推理效率显著提升,平均推理延迟从110毫秒降低到61毫秒,同时保持了强大的任务性能。

英文摘要

This paper presents an invertible neural network adapter for general robotic manipulation, designed to generate precise high-dimensional actions conditioned on multimodal observations, including visual, linguistic, and proprioceptive inputs, through a one-step denoising process. Built upon a flow-matching formulation, the proposed adapter effectively constrains the action generation trajectory within an invertible latent space, thereby enabling efficient and high-quality dexterous action synthesis with only a single inference step. Compared with conventional iterative flow-matching policies, the proposed framework substantially reduces inference complexity while maintaining strong action prediction accuracy and stability. Extensive experiments are conducted across a diverse set of simulation benchmarks and real-world robotic platforms to evaluate the effectiveness of the proposed method. Across simulation benchmarks, the proposed adapter consistently demonstrates superior or near state-of-the-art performance on a wide range of manipulation tasks. Furthermore, real-world experiments reveal a significant improvement in inference efficiency for vision-language-action (VLA) models, reducing the average inference latency from 110 ms to 61 ms while maintaining strong task performance.

2606.19233 2026-06-18 cs.RO 新提交

Mobile Pedipulation for Object Sliding via Hierarchical Control on a Wheeled Bipedal Robot

基于轮式双足机器人分层控制的移动式腿部操作物体滑动

Yue Qin, Yulun Zhuang, Zelin Shen, Yanran Ding

发表机构 * University of Michigan(密歇根大学)

AI总结 提出一种分层控制框架,使轮式双足机器人能用腿部滑动平面物体,通过简化三刚体动力学模型和轨迹优化运动规划器,在实验中成功实现1kg物体取回和4kg物体滑动。

Comments 8 pages, 7 figures

详情
AI中文摘要

在本文中,我们提出了一种分层控制框架,使轮式双足机器人能够利用其轮式腿执行平面物体滑动任务。该方法基于一个简化三刚体动力学模型构建了非线性模型预测控制器,该模型明确考虑了髋关节滚动自由度和多种轮-环境接触模式,这对于横向步态和腿部操作任务至关重要。在该框架内,非线性模型预测控制器同时调节机器人 locomotion 和交互力,使机器人能够稳定地执行滚动和物体操作行为。我们开发了一个基于轨迹优化的机器人-物体运动规划器,以生成包含地面-物体接触中粘滑转换的参考运动。通过实际硬件实验验证了两种代表性的腿部操作运动,即滑行和横向滑动,其中机器人成功地从桌子下取回一个1kg的物体,并通过滑行将一个4kg的物体滑动0.228米的距离。

英文摘要

In this letter, we present a hierarchical control framework that enables wheeled bipedal robots to perform planar object sliding tasks with their wheeled legs. The proposed approach formulates a nonlinear model predictive controller (NMPC) based on a reduced-order three rigid bodies (TRB) dynamical model that explicitly accounts for the hip roll degree of freedom and multiple wheel-environment contact modes, which is essential for lateral stepping and pedipulation tasks. Within this framework, the NMPC simultaneously regulates robot locomotion and interaction forces, allowing the robot to stably execute both rolling and object manipulation behaviors. A trajectory-optimization-based robot-object motion planner is developed to generate reference motions that incorporate stick-slip transitions in ground-object contact. Two representative pedipulation motions, namely scooting and lateral sliding, are validated through real-world hardware experiments, in which the robot successfully retrieves a 1 kg object from under a desk and slides a 4 kg object over a distance of 0.228 m via scooting.

2606.19314 2026-06-18 cs.RO 新提交

Modeling Branches for Active Manipulation using Iterative Parameter Estimation

基于迭代参数估计的主动操作分支建模

Madhav Rijal, Rashik Shrestha, Trevor Smith, Yu Gu

发表机构 * Department of Mechanical and Aerospace Engineering, West Virginia University(西弗吉尼亚大学机械与航空航天工程系)

AI总结 提出一种通过迭代估计材料参数来建模植物分支的方法,利用有限元模拟和变形感知运动规划器,实现精确分支操作,平均变形能量降低35.69%。

Comments Accepted to IROS 2026

详情
AI中文摘要

本研究提出了一种通过迭代估计材料参数来建模多样化植物分支的方法,以支持精细的分支操作。在农业机器人中,分支操作对于植物重新定位、稳定以及清除密集叶片中的视觉障碍是必要的。该方法从点云数据构建四面体分支模型,并使用有限元方法模拟其行为。利用真实观测的变形数据,迭代估计分支参数,然后通过变形感知运动规划器计算最优路径,以在另一个机器人的视野内移动和稳定分支。在30次对具有不同几何形状和材料特性的分支进行的试验中,该方法平均降低了35.69%的变形能量,同时路径长度平均增加了8.10%。

英文摘要

This study presents a method for modeling diverse plant branches by iteratively estimating material parameters to support delicate branch manipulation. Branch manipulation is necessary in agricultural robotics for plant repositioning, stabilizing, and clearing visual obstructions in dense foliage. The proposed method builds a tetrahedral branch model from point-cloud data and simulates its behavior using the finite element method. Using real observed deformation data, it iteratively estimates branch parameters and then computes an optimal path with a deformation-aware motion planner to move and stabilize branches within another robot's field of view. Across 30 trials on branches with varying geometries and material properties, the proposed method reduced the deformation energy by 35.69% while increasing the path length by 8.10% on average.

2606.19333 2026-06-18 cs.RO cs.CV 新提交

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Do as I Do: 从日常人类视频中获取灵巧操作数据

Bhawna Paliwal, Haritheja Etukuru, William Liang, Pieter Abbeel, Nur Muhammad Mahi Shafiullah, Jitendra Malik

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 提出DO AS I DO算法,从单目RGB人类视频中重建手-物交互并重定向到多指灵巧机器人手,生成可执行的操作数据,优于现有方法。

Comments Project website: https://do-as-i-do.com/

详情
AI中文摘要

我们如何可扩展地生成机器人操作数据,特别是在像多指灵巧手这样的人形平台上?从人类视频中学习最近成为这个问题的可能答案。然而,估计手-物交互和跨越人-机器人具身差距的困难阻碍了将丰富的单目RGB人类视频作为机器人操作数据的主要来源。在这项工作中,我们提出了DO AS I DO,一种将单目RGB人类视频重建并重定向到多指灵巧机器人手的算法。DO AS I DO从各种自我中心和外部中心的野外视频源中重建手-物交互。然后,该算法将这些手-物交互估计重定向为一系列可在现实世界中执行的动作,从不同的人类视频中生成机器人完整的操作数据。总体而言,DO AS I DO在从RGB视频中估计手-物交互和提取灵巧操作轨迹方面优于先前的最先进技术,正如我们在具有真实标签的数据集和在线收集的视频片段数据集上的实验所示。我们的实验使我们能够为从业者收集人类操作数据提出一个有效性指南。

英文摘要

How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

2606.18960 2026-06-18 cs.CV cs.RO 交叉投稿

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Mem-World:用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology(大连理工大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院)

AI总结 提出Mem-World,通过4D腕部视角曲面元索引内存W-VMem,解决操作中因遮挡和运动导致的场景遗忘问题,实现持久世界建模,提升策略评估与改进效果。

详情
AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式,通过生成动作一致的视频推演,为昂贵的真实世界实验提供了可扩展的替代方案。然而,在操作中持久世界建模仍然具有挑战性:频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图,导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制,我们提出了Mem-World,一种内存增强的多视图动作条件世界模型。其核心是W-VMem,一种4D腕部视图为中心的曲面元索引内存,将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置,W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中,通过基于曲面元的渲染和评分选择相关历史帧,为预测提供信息丰富且非冗余的上下文。大量实验表明,Mem-World在复杂操作场景中生成持久推演,比Ctrl-World实现更可靠的策略评估,将皮尔逊相关系数提高14.5%,并通过合成数据生成支持有效的策略改进,在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

2501.02874 2026-06-18 cs.RO 版本更新

Steering Flexible Linear Objects in Planar Environments by Two Robot Hands Using Euler's Elastica Solutions

使用欧拉弹性线解在两机器人手在平面环境中操控柔性线性物体

Aharon Levin, Elon Rimon, Amir Shapiro

发表机构 * Dept. of ME, Technion, Israel(技术学院机械工程系,以色列) Dept. of ME, Ben-Gurion University, Israel(本· Gurion大学机械工程系,以色列)

AI总结 本文利用欧拉弹性线解,通过控制两机器人手的抓取端点位置和切线,实现平面环境中柔性线性物体的无自交、稳定和避障操控。

详情
AI中文摘要

机器人手对柔性物体(如电缆、电线和生鲜食品)的操控构成了机器人抓取力学中的一个特殊挑战。本文考虑了两机器人手在平面环境中操控柔性线性物体的问题。柔性线性物体被建模为弹性不可拉伸杆,通过改变抓取端点位置同时保持端点切线相等来进行操控。柔性线性物体的形状具有基于抓取端点位置和切线的闭式解,称为欧拉弹性线。本文在最优控制框架下获得了弹性线解,然后利用弹性线解得到了柔性线性物体无自交、稳定性和避障的闭式判据。这些新工具被整合到一个规划方案中,用于在稀疏障碍物分布的平面环境中操控柔性线性物体。该方案已完全实现并通过详细示例进行了演示。

英文摘要

The manipulation of flexible objects such as cables, wires and fresh food items by robot hands forms a special challenge in robot grasp mechanics. This paper considers the steering of flexible linear objects in planar environments by two robot hands. The flexible linear object, modeled as an elastic non-stretchable rod, is manipulated by varying the gripping endpoint positions while keeping equal endpoint tangents. The flexible linear object shape has a closed form solution in terms of the grasp endpoint positions and tangents, called Euler's elastica. This paper obtains the elastica solutions under the optimal control framework, then uses the elastica solutions to obtain closed-form criteria for non self-intersection, stability and obstacle avoidance of the flexible linear object. The new tools are incorporated into a planning scheme for steering flexible linear objects in planar environments populated by sparsely spaced obstacles. The scheme is fully implemented and demonstrated with detailed examples.

2601.20381 2026-06-18 cs.RO 版本更新

STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation

STORM:基于槽的任务感知面向对象的机器人操作表示

Alexandre Chapin, Emmanuel Dellandréa, Liming Chen

发表机构 * Ecole Centrale de Lyon, LIRIS(里尔森中央理工大学,LIRIS实验室)

AI总结 提出STORM模块,通过多阶段训练策略将冻结的视觉基础模型与语义感知槽结合,生成面向对象的任务感知表示,提升机器人操作在视觉干扰下的泛化性和控制性能。

详情
AI中文摘要

视觉基础模型为机器人提供了强大的感知特征,但其密集表示缺乏显式的对象级结构,限制了操作任务的鲁棒性和可收缩性。我们提出STORM(基于槽的任务感知面向对象的机器人操作表示),一个轻量级的面向对象适应模块,通过一组语义感知槽增强冻结的视觉基础模型,用于机器人操作。STORM不重新训练大型骨干网络,而是采用多阶段训练策略:首先通过使用语言嵌入的视觉-语义预训练稳定面向对象的槽,然后与下游操作策略联合适应。这种分阶段学习防止了退化槽的形成,并在保持语义一致性的同时将感知与任务目标对齐。在对象发现基准和模拟操作任务上的实验表明,与直接使用冻结的基础模型特征或端到端训练面向对象的表示相比,STORM改善了对视觉干扰物的泛化能力和控制性能。我们的结果强调了多阶段适应作为将通用基础模型特征转化为用于机器人控制的任务感知面向对象表示的有效机制。

英文摘要

Visual foundation models provide strong perceptual features for robotics, but their dense representations lack explicit object-level structure, limiting robustness and contractility in manipulation tasks. We propose STORM (Slot-based Task-aware Object-centric Representation for robotic Manipulation), a lightweight object-centric adaptation module that augments frozen visual foundation models with a small set of semantic-aware slots for robotic manipulation. Rather than retraining large backbones, STORM employs a multi-phase training strategy: object-centric slots are first stabilized through visual--semantic pretraining using language embeddings, then jointly adapted with a downstream manipulation policy. This staged learning prevents degenerate slot formation and preserves semantic consistency while aligning perception with task objectives. Experiments on object discovery benchmarks and simulated manipulation tasks show that STORM improves generalization to visual distractors, and control performance compared to directly using frozen foundation model features or training object-centric representations end-to-end. Our results highlight multi-phase adaptation as an efficient mechanism for transforming generic foundation model features into task-aware object-centric representations for robotic control.

2605.05925 2026-06-18 cs.RO 版本更新

DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions

DexSynRefine:合成与精炼人-物交互运动以实现物理可行的灵巧机器人动作

Hyesung Lee, Hyunwoo Jung, Si-Hwan Heo, Sungwook Yang

发表机构 * Korea Institute of Science and Technology(韩国科学技术院) KAIST(韩国科学技术院) Hanyang University(翰阳大学)

AI总结 提出DexSynRefine框架,通过HOI-MMFP运动先验合成手-物轨迹,结合任务空间残差强化学习和接触动力学适应,将人-物交互数据转化为物理可行的灵巧操作,在五个任务上成功率提升50-70个百分点。

Comments Project page: https://dexsynrefine.github.io/

详情
AI中文摘要

从人-物交互(HOI)数据中学习灵巧操作为机器人遥操作提供了一种可扩展的替代方案,但HOI演示通常稀疏且纯运动学,在实体不匹配和接触丰富的动力学下直接重定向不可靠。我们提出DexSynRefine,一个耦合框架,将HOI数据视为结构化运动先验而非可执行的机器人动作。DexSynRefine首先使用HOI运动流形流基元(HOI-MMFP)——一种耦合手-物运动的运动先验,根据任务和初始物体状态合成手-物轨迹。然后通过任务空间残差强化学习对其进行物理接地,并通过从本体感受历史推断缺失的接触动力学上下文来适应执行。在五个灵巧操作任务中,每个阶段解决一个互补的瓶颈:HOI-MMFP提高了轨迹一致性和平滑性,任务空间残差在测试的替代方案中提供了最强的接地表示,接触动力学适应实现了鲁棒的真实世界执行。综合来看,DexSynRefine在真实世界中的成功率比运动学重定向提高了50-70个百分点。

英文摘要

Learning dexterous manipulation from human-object interaction (HOI) data offers a scalable alternative to robot teleoperation, but HOI demonstrations are typically sparse and purely kinematic, making direct retargeting unreliable under embodiment mismatch and contact-rich dynamics. We present DexSynRefine, a coupled framework that treats HOI data as structured motion priors rather than executable robot actions. DexSynRefine first synthesizes hand-object trajectories conditioned on the task and initial object state using HOI Motion Manifold Flow Primitives (HOI-MMFP), a motion prior for coupled hand-object motion. It then physically grounds them with task-space residual reinforcement learning and adapts execution by inferring missing contact-dynamics context from proprioceptive history. Across five dexterous manipulation tasks, each stage addresses a complementary bottleneck: HOI-MMFP improves trajectory consistency and smoothness, task-space residuals provide the strongest grounding representation among the tested alternatives, and contact-dynamics adaptation enables robust real-world execution. Together, DexSynRefine improves real-world success rates over kinematic retargeting by 50-70~percentage points.

2606.13672 2026-06-18 cs.RO 版本更新

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

$\texttt{WEAVER}$:更好、更快、更长——一种有效的机器人操作世界模型

Arnav Kumar Jain, Yilin Wu, Jesse Farebrother, Gokul Swamy, Andrea Bajcsy

发表机构 * Mila - Québec AI Institute(Mila - 魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) Carnegie Mellon University(卡内基梅隆大学) McGill University(麦吉尔大学)

AI总结 提出WEAVER世界模型架构,通过流匹配损失训练多视图潜在预测,同时实现高保真度、长程一致性和高效推理,在机器人操作任务中显著提升策略评估、改进和测试时规划性能。

详情
AI中文摘要

世界模型(即学习型模拟器)对机器人技术的潜在影响深远——包括策略评估、策略改进和测试时规划——所有这些都只需有限的真实世界交互。为了解锁这些下游能力,世界模型需要同时满足三个期望:(i)保真度(即产生与现实相关的模拟轨迹),(ii)一致性(即产生在长时域上连贯的模拟轨迹),以及(iii)效率(即快速产生模拟轨迹)。我们提出$\texttt{WEAVER}$(面向具身推理的多视图世界估计):一种同时实现所有三个期望的世界模型架构,在机器人操作任务上提供了最先进的结果。$\texttt{WEAVER}$是一个多视图世界模型,通过流匹配损失训练以预测未来潜在状态和奖励值。我们提炼了模型架构、记忆和预测目标方面的关键设计决策,以解锁那些困扰先前世界建模方法的长时间动态操作任务。我们将$\texttt{WEAVER}$应用于机器人硬件,展示了其在策略评估(与真实世界成功率的相关系数$\rho=0.870$)、策略改进(在$\pi_{0.5}$机器人基础模型上真实世界成功率提升$38\%$)和测试时规划(真实世界成功率提升$14\%$,且比先前世界模型快$5-10$倍)方面的有效性。$\texttt{WEAVER}$在分布外场景评估中也表现出优于先前世界模型的性能。代码、模型和视频见:this https URL。

英文摘要

The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and test-time planning -- all with limited real-world interaction. To unlock these downstream capabilities, a WM needs to jointly satisfy three desiderata: $\textit{(i)}$ fidelity (i.e., producing simulated trajectories that correlate with reality), $\textit{(ii)}$ consistency (i.e., producing simulated trajectories that are coherent over long horizons), and $\textit{(iii)}$ efficiency (i.e., producing simulated trajectories quickly). We propose WEAVER (World Estimation Across Views for Embodied Reasoning): a WM architecture that simultaneously achieves all three desiderata, providing state-of-the-art results on robotic manipulation tasks. WEAVER is a multi-view WM trained to predict future latents and reward values via a flow-matching loss. We distill the key design decisions across model architecture, memory, and prediction objectives required to unlock the kinds of long-horizon dynamic manipulation tasks that have confounded prior world modeling approaches. We apply WEAVER in robotic hardware, demonstrating its effectiveness at policy evaluation ($ρ$=0.870 correlation with real-world success rate), policy improvement (real-world success rate improvement of $38\%$ on top of the $π_{0.5}$ robot foundation model), and test-time planning (real-world success rate improvement of $14\%$ with a $5-10\times$ speedup over prior WMs). WEAVER also demonstrates better performance than prior WMs when evaluated on out-of-distribution scenarios. Code, models, and videos at: https://arnavkj1995.github.io/WEAVER/ .

4. 导航、定位与SLAM 11 篇

2606.18426 2026-06-18 cs.RO 新提交

VEGA: Learning Navigation VLAs from In-the-Wild Egocentric Video with Geometric Trajectory Supervision

VEGA: 从野外自我中心视频中通过几何轨迹监督学习导航VLA

Gershom Seneviratne, Yohan Abeysinghe, Jianyu An, Vaibhav Shende, Dinesh Manocha

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

AI总结 提出VEGA方法,利用未标注的自我中心视频通过重建场景几何生成障碍感知轨迹,训练流匹配VLA导航策略,在VEGA-Bench上碰撞减少33.0%,真实世界成功率提升至少150.0%。

详情
AI中文摘要

我们提出了VEGA,一种从未标注的自我中心导航视频中训练导航视觉-语言-动作(VLA)模型的方法。互联网规模的自我中心视频提供了可扩展的导航相关视觉观察来源,捕捉了杂乱场景、近距离障碍物以及通过真实世界空间的自然人体运动。然而,这些视频不能直接用于策略学习,因为它们没有提供在机器人坐标系中基于显式导航目标的障碍感知轨迹。VEGA通过从单目视频重建局部场景几何、采样导航目标(表示为文本、图像或空间路径点)并利用构建的几何生成障碍感知轨迹来解决这一差距。生成的轨迹分布随后用于训练流匹配VLA导航策略。通过仅在训练期间使用几何,VEGA将障碍感知规划直接蒸馏到基于视觉的策略中。此外,我们引入了VEGA-Bench,一个包含25万场景和约500万个导航目标(与场景几何配对)的基准,旨在评估VLA的目标进展、碰撞避免和障碍物间隙。我们的评估表明,VEGA在VEGA-Bench上实现了有竞争力的目标进展,同时相比最强基线碰撞减少33.0%,障碍物间隙提高17.9%,在真实世界试验中成功率至少提高150.0%,碰撞至少减少66.7%,障碍物间隙至少提高60.0%。最终,我们证明了视频衍生的几何监督为训练障碍感知导航VLA提供了可扩展且有效的信号。代码和基准将在发表时发布。

英文摘要

We introduce VEGA, an approach for training navigation VisionLanguage-Action (VLA) models from unlabeled egocentric navigation videos. Internet-scale egocentric videos provide a scalable source of navigation-relevant visual observations, capturing cluttered scenes, close-range obstacles, and natural human motion through real-world spaces. However, these videos are not directly usable for policy learning because they do not provide obstacle-aware trajectories conditioned on explicit navigation goals in the robot's coordinate frame. VEGA addresses this gap by reconstructing local scene geometry from monocular video, sampling navigation goals (represented as text, image, or spatial waypoints) and generating obstacle-aware trajectories using the constructed geometry. The resulting trajectory distribution is then used to train a flow-matching VLA navigation policy. By using geometry exclusively during training, VEGA distills obstacle-aware planning directly into a vision-based policy. Furthermore, we introduce VEGA-Bench, a benchmark containing 250k scenes and approximately 5 million navigation goals paired with scene geometry, designed to evaluate goal progress, collision avoidance, and obstacle clearance of VLAs. Our evaluation shows that VEGA achieves competitive goal progress while reducing collisions by 33.0% and improving obstacle clearance by 17.9% over the strongest baseline on VEGABench, while improving success by at least 150.0%, reducing collisions by at least 66.7%, and improving obstacle clearance by at least 60.0% in real-world trials. Ultimately, we demonstrate that video-derived geometric supervision provides a scalable and effective signal for training obstacle-aware navigation VLAs. The code and benchmark will be released at the time of publication.

2606.18634 2026-06-18 cs.RO cs.AI 新提交

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

EffiNav: 融合深度与视觉语言实现高效物体目标导航

Zecheng Yin, Benedict Jun Ma

发表机构 * Systems Hub of Intelligence Transportation HKUST(GZ)(香港科技大学(广州)智能交通系统中心)

AI总结 提出EffiNav框架,融合深度信息与视觉语言模型,通过预测探索边界和语义先验指导导航,在HM3D和OVON数据集上匹配或超越基线,提升路径效率与泛化性。

详情
AI中文摘要

在未知环境中定位目标物体是自主智能体的基本能力,应用范围从搜索救援到野外机器人。该任务的简化版本是物体目标导航(ObjNav)。在ObjNav中,成功到达目标物体提供了基本的性能度量;然而,导航轨迹的效率同样重要,因为它指示了智能体探索的智能程度以及后续任务剩余的时间。在未知环境中,高效导航的关键在于决定下一步探索的位置。尽管许多先前工作旨在解决这一核心挑战并在某些场景中取得了有希望的性能,但最近的基于训练的模型和非训练框架分别仍存在泛化性和效率问题,在最坏情况下可能导致对已访问区域的过度探索或冗余的来回运动。我们在两个广泛使用的仿真基准Habitat Matterport 3D(HM3D)和开放词汇物体目标导航(OVON)上评估EffiNav,并在真实世界的物理机器人上进一步验证其有效性。我们对大量仿真回合进行了失败分析。通过最小修改,我们还将EffiNav扩展到GOAT-BENCH数据集上的记忆增强ObjNav任务,展示了其在标准ObjNav设置之外的适应性。在两个标准指标——成功率(SR)和路径长度加权成功率(SPL)上,EffiNav匹配或超越了最近的基线,反映了其效率、鲁棒性和实际适用性。认识到两个数据集的不同侧重点,性能表明该框架在高效ObjNav中更加平衡和可泛化。

英文摘要

To locate a target object while exploring the unknown environment is a fundamental capability for autonomous agents, with applications ranging from search-and-rescue to field robots. A simplified version of such task is Object Goal Navigation (ObjNav). In ObjNav, successful arrival at the target object provides a basic measure of performance; however, the efficiency of the navigation trajectory is equally important, as it indicates how intelligently the agent explores and how much time remains for subsequent tasks. In unknown environments, the key to efficient navigation lies in deciding where to explore next. While many prior works aim to address this core challenge and achieved promising performance in certain settings, recent training-based models and non-training frameworks still suffer from generalization and efficiency issues respectively, which in the worst cases can lead to excessive exploration of already-visited areas or redundant back-and-forth motion. We evaluate EffiNav on two widely used simulation benchmarks Habitat Matterport 3D (HM3D) and Open-Vocabulary Object goal Navigation (OVON), and further validate its effectiveness on physical robots in real-world settings. We conduct failure analysis on massive simulation episodes. With minimal modification, we also extend EffiNav to a memory-augmented ObjNav task on the GOAT-BENCH dataset, demonstrating its adaptability beyond standard ObjNav settings. Across two standard metrics--Success Rate (SR) and Success weighted by Path Length (SPL), EffiNav matches or outperforms recent baselines, reflecting its efficiency, robustness, and practical applicability. Recognizing the different emphases of the two datasets, the performances reveals this framework is more balanced and generalizable for efficient ObjNav.

2606.18951 2026-06-18 cs.RO 新提交

A High-accuracy Event-based Underwater SLAM System

高精度事件相机水下SLAM系统

Yifan Peng, Qihang, Liu, Haoying Li, Yuzhe Li, Junfeng Wu, Ziyang Hong

AI总结 针对事件相机水下SLAM中时间曲面成像质量差和匹配失败问题,提出基于结构感知度量和贝叶斯优化的高精度立体SLAM系统,并贡献首个高质量水下事件数据集UWE。

详情
AI中文摘要

虽然事件相机为水下SLAM提供了巨大潜力,但现有的基于时间曲面(TS)的方法在水下部署时被证明非常不可靠。波动的相机速度严重降低了TS成像质量,而宽立体基线和重复的水下纹理导致关键匹配失败,频繁引发系统崩溃。为克服这些挑战,我们开发了首个高精度事件相机水下立体SLAM系统。基于结构张量相干性和梯度,设计了一种结构感知度量来定量评估TS结构信息密度。通过将最优TS生成解耦为基于系统初始化的两个不同阶段,贝叶斯优化(BO)在初始化前首先预测最优先验TS,同时我们设置异步在线局部搜索方法,在跟踪阶段实时获取合适的TS。我们使用先验视差保证精确的数据关联,并采用“最新观测优先”三角测量机制实现稳定三角测量。作为这些解决方案的基准和社区资源,我们还贡献了UWE,这是首个高质量真实世界水下事件数据集,包含变化的相机运动、复杂纹理和不同轨迹特征。在公共数据集和UWE上的广泛评估表明,所提出的SLAM系统与最先进的事件相机方法相比具有竞争力的精度性能。代码和数据将开源。

英文摘要

While event cameras offer immense potential for underwater SLAM, existing Time Surface (TS)-based methods prove highly unreliable when deployed underwater. Fluctuating camera velocities severely degrade TS imaging quality, while wide stereo baselines and repetitive underwater textures induce critical matching failures, frequently triggering system failure. To overcome these challenges, we develop the first high-accuracy event-based underwater stereo SLAM system. A structure-aware metric for TS is designed based on structure tensor coherence and gradients to quantitatively evaluate TS structural information density. By decoupling the optimal TS generation into two distinct stages based on system initialization, Bayesian Optimization(BO) first predicts an optimal prior TS sequentially before initialization while we set an asynchronous online local searching method periodically to obtain appropriate TS in real-time during the tracking stage. We use the prior disparity to guarantee precise data association and "latest-observation-first'' triangulation mechanism to realize stable triangulation. As a benchmark for these solutions and a resource for the community, we also contribute UWE, the first high-quality real-world underwater event dataset containing variable camera motions, complex textures and different trajectory features. Extensive evaluations on public datasets and UWE show the competitive accuracy performance of the proposed SLAM system compared to the state-of-the-art event-based method. The code and data will be open-sourced.

2606.19122 2026-06-18 cs.RO 新提交

Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning

基于混合2D-3D学习的人行道机器人单目3D占用感知

Yukai Ma, Joe Lin, Liu Liu, Honglin He, Lulu Ricketts, Brad Squicciarini, Yong Liu, Bolei Zhou

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Zhejiang University(浙江大学) Coco Robotics Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出WalkOCC框架,通过混合射线行进单目3D占用感知,结合LiDAR-RGB配对数据与大规模无配对单目图像学习,提升人行道机器人导航的预测精度和泛化能力。

详情
AI中文摘要

现实世界中的人行道拥挤、杂乱且结构化程度低于道路,使得3D占用预测成为配送机器人和电动轮椅等移动机器人安全导航的关键。现有的占用学习流程主要针对道路自动驾驶设计,通常在大规模配对的LiDAR-RGB数据集上训练,需要密集的3D监督和多个摄像头输入,这些数据收集成本高且未能充分捕捉人行道特定特征。我们提出WalkOCC,一种用于人行道机器人的混合射线行进单目3D占用感知框架。WalkOCC显式地将来自LiDAR-RGB配对数据的几何基础与来自大规模无配对单目图像的可扩展学习相结合。它从配对序列中引导出伪占用监督,并在额外的仅2D数据上联合学习图像级表示。它在不需要昂贵的3D占用标注的情况下实现了稳定的优化和改进的泛化能力。大量实验表明,与基于自监督图像的基线相比,在预测精度、对路缘和排水沟等细微城市结构的细粒度分割以及对环境和跨本体变化的鲁棒性方面,WalkOCC均取得了一致的提升。为了便于评估和基准测试,我们还引入了Sidewalk3D,这是一个大规模的人行道感知数据集,包含在多个地点和时间段收集的LiDAR-相机配对序列,以及用于评估的3D语义占用标注。代码和数据将公开提供。

英文摘要

Sidewalks in the real world are crowded, cluttered, and less structured than roads, making 3D occupancy prediction a key ingredient for the safe navigation of mobile robots such as delivery bots and electric wheelchairs. Existing occupancy learning pipelines are largely designed for on-road autonomous driving and often train on large-scale paired LiDAR-RGB datasets with dense 3D supervision and multiple camera inputs, which are costly to collect and do not adequately capture sidewalk-specific characteristics. We propose WalkOCC, a hybrid Ray-marching monocular 3D occupancy perception framework for robots operating on sidewalks. WalkOCC explicitly couples geometric grounding from LiDAR-RGB paired data with scalable learning from large-scale unpaired monocular images. It bootstraps pseudo occupancy supervision from paired sequences and jointly learns image-level representations on additional 2D-only data. It yields stable optimization and improved generalization without requiring costly 3D occupancy annotations. Extensive experiments demonstrate consistent gains in prediction accuracy, fine-grained segmentation of subtle urban structures such as curbs and gutters, and robustness to environmental and cross-embodiment shifts compared with self-supervised image-based baselines. To facilitate evaluation and benchmarking, we also introduce Sidewalk3D, a large-scale sidewalk perception dataset with LiDAR-camera paired sequences collected across multiple locations and time periods, along with 3D semantic occupancy annotations for evaluation. Code and data will be made available.

2606.19190 2026-06-18 cs.RO 新提交

FAST-LIVGO: A Degeneracy-Robust LiDAR-Inertial-Visual-GNSS Fusion Odometry

FAST-LIVGO:一种退化鲁棒的LiDAR-惯性-视觉-GNSS融合里程计

Zhiyu Chen, Chunran Zheng, Jiayu Wen, XiaoLei Zhang, Jiaming Xu, Feng Pan, Yukang Cui

发表机构 * College of Mechatronics and Control Engineering, Shenzhen University(深圳大学机电与控制工程学院) Department of Mechanical Engineering, The University of Hong Kong(香港大学机械工程系) College of Automation, Harbin Engineering University(哈尔滨工程大学自动化学院)

AI总结 提出一种基于误差状态迭代卡尔曼滤波的紧耦合LiDAR-惯性-视觉-GNSS融合框架,通过动态时间规整的时空对齐模块、多普勒和时差载波相位观测模型以及退化感知的双模式异常值拒绝策略,在长期大尺度动态环境中实现高精度鲁棒的状态估计。

Comments Accepted for presentation at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

在长期、大规模和高度动态环境中的鲁棒状态估计与建图仍然是机器人领域的关键挑战。现有的LiDAR-惯性-视觉里程计(LIVO)系统在局部精度上表现良好,但在长距离下会累积漂移,并在几何退化或无纹理场景中可能失效。同时,GNSS辅助融合框架通常依赖LiDAR或视觉里程计进行状态预测和异常值拒绝,使其在里程计退化时变得脆弱。为解决这些局限,我们提出一种基于误差状态迭代卡尔曼滤波的紧耦合LiDAR-惯性-视觉-GNSS融合框架。引入基于动态时间规整的在线时空对齐模块以应对高度动态条件。为更好利用GNSS精度,我们开发了基于多普勒频移和固定锚点时间差载波相位的观测模型,在不增加历史锚点状态的情况下提供毫米级相对约束。我们进一步设计了一种退化感知的双模式异常值拒绝策略,根据LIVO退化程度在LIVO先验引导拒绝和GNSS辅助恢复之间切换。在公开M3DGR数据集和自建20 m/s固定翼无人机数据集上的实验表明,我们的系统减少了累积漂移和地图重影,在精度和鲁棒性上优于现有方法。

英文摘要

Robust state estimation and mapping in long-term, large-scale, and highly dynamic environments remains a key challenge in robotics. Existing LiDAR-Inertial-Visual Odometry (LIVO) systems achieve strong local accuracy but suffer from accumulated drift over long distances and may fail in geometrically degraded or textureless scenes. Meanwhile, GNSS-aided fusion frameworks often rely on LiDAR or visual odometry for state prediction and outlier rejection, making them vulnerable when odometry degenerates. To address these limitations, we propose a tightly coupled LiDAR-Inertial-Visual-GNSS fusion framework based on an Error-State Iterated Kalman Filter. An online spatiotemporal alignment module using Dynamic Time Warping is introduced for highly dynamic conditions. To better exploit GNSS precision, we develop observation models based on Doppler shifts and fixed-anchor Time-Differenced Carrier Phase, providing millimeter-level relative constraints without augmenting historical anchor states. We further design a degeneracy-aware dual-mode outlier rejection strategy that switches between LIVO-prior-guided rejection and GNSS-aided recovery according to the LIVO degeneracy level. Experiments on the public M3DGR dataset and a custom 20~m/s fixed-wing UAV dataset demonstrate that our system reduces accumulated drift and map ghosting, outperforming state-of-the-art methods in accuracy and robustness.

2606.19307 2026-06-18 cs.RO 新提交

Observability and Consistency Analysis for Visual-Inertial Navigation with Anchored Feature Parameterizations

基于锚定特征参数化的视觉惯性导航的可观性与一致性分析

Mitchell Cohen, Vassili Korotkine, James Richard Forbes

发表机构 * McGill University(麦吉尔大学)

AI总结 分析基于滤波的视觉惯性导航系统(VINS)使用锚定特征表示时的可观性与一致性,证明其不可观子空间独立于估计的地标状态,从而改善一致性,但仍依赖导航状态,需额外一致性增强技术。

Comments Accepted to IEEE/RSJ IROS. 8 pages, 3 figures, 4 tables

详情
AI中文摘要

本文分析了使用锚定特征表示的基于滤波的视觉惯性导航系统(VINS)的可观性和一致性特性。结果表明,采用锚定地标参数化的VINS的不可观子空间独立于估计的地标状态,从而无需任何额外修改即可改善估计器的一致性。然而,不可观子空间仍然依赖于估计的导航状态,因此需要额外的一致性增强技术。本文提出了两种方法来改善采用锚定特征表示的VINS的一致性。仿真结果表明,与使用全局参考系解析特征的算法相比,所有采用锚定特征参数化的估计器都表现出更好的一致性,特别是在特征初始化可能较差的情况下。在TUM-VI数据集上的真实世界实验表明,仅使用锚定特征表示即可获得与采用全局特征表示的一致性改进估计器相当的性能,证明了在VINS中使用锚定特征参数化的优势。

英文摘要

This paper presents an analysis of the observability and consistency properties of filtering-based visual-inertial navigation systems (VINS) that utilize anchored feature representations. The unobservable subspace of VINS with anchored landmark parameterizations is shown to be independent of the estimated landmark state, which leads to improved estimator consistency properties without any additional modifications. However, the unobservable subspace is still found to depend on the estimated navigation state, necessitating additional consistency-enforcing techniques. Two methods to improve the consistency of VINS with anchored feature representations are presented. Simulation results showcase that all estimators employing anchored feature paramterizations exhibit improved consistency properties compared to algorithms that estimate features resolved in a global reference frame, especially in scenarios where feature initialization may be poor. Real-world experiments on the TUM-VI dataset showcase that the use of anchored feature representations alone can yield comparable performance to consistency-improved estimators employing a global feature representation, demonstrating the benefit of using anchored feature parameterizations for VINS.

2606.18583 2026-06-18 cs.CV cs.RO 交叉投稿

Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

空地激光雷达地点识别:基于块级自监督学习和扩展互逆重排序

Yandi Yang, Xianghong Zou, Jianping Li, Haofeng Xie, Saurav Uprety, Hongzhou Yang, Naser El-Sheimy

发表机构 * University of Calgary(卡尔加里大学) Nanchang University(南昌大学) Nanyang Technological University(南洋理工大学) Wuhan University(武汉大学)

AI总结 提出一种空地激光雷达地点识别框架,通过多尺度块级自监督学习缩小域差距,并利用扩展互逆重排序算法减少误检,在多个数据集上显著提升检索精度。

详情
AI中文摘要

激光雷达地点识别用于确定在预先采集的点云地图上的位置。最常研究的基于地面激光雷达的地点识别存在预访问要求、覆盖不完整和视角有限等缺点。使用预先采集的全覆盖机载激光扫描(ALS)数据作为空中先验地图可以克服这些缺点,使得跨视角地点识别变得必要且有利。然而,空地激光雷达地点识别面临重大挑战,包括空中和地面点云之间的域差距以及初始检索中的误检。为了解决这些问题,我们提出了一种用于空地激光雷达地点识别的新型检索和重排序框架。基于相邻点云块与锚点块共享相似语义的先验知识,我们的检索网络在多个尺度上引入了块级自监督学习模块,并与场景级学习相结合,以提高空中和地面点云之间全局特征的判别性。此外,利用ALS点云的结构化空间分布,我们引入了一种扩展互逆(ER)重排序算法,以最大化利用邻域信息,并根据邻域特征优化每个特征,然后用于更新相似度矩阵以进行最终排序。大量实验表明,我们的检索网络优于现有最先进(SOTA)方法,在CS-Urban-Scenes数据集上平均Recall@1提高了9.8%,平均Recall@1%提高了3.2%,同时在CS-Campus3D数据集上也展示了最佳性能。此外,我们的ER重排序算法在无需额外训练的情况下,进一步将CS-Campus3D上的平均Recall@1提高了4.9%,CS-Urban-Scenes上提高了10.2%。

英文摘要

LiDAR place recognition determines one's position on a prior point cloud map. The most studied ground-level LiDAR place recognition suffers from pre-visit requirements, incomplete coverage, and limited perspectives. Using pre-acquired, full-coverage Airborne Laser Scanning (ALS) data as an aerial prior map overcomes these drawbacks, making cross-view place recognition necessary and advantageous. However, aerial-ground LiDAR place recognition faces significant challenges, including the domain gap between aerial and ground point clouds, and false positives during initial retrieval. To address these challenges, we present a novel retrieval and re-ranking framework for aerial-ground LiDAR place recognition. Based on the priors that neighboring point cloud patches share similar semantics with anchor patch, our retrieval network introduces patch-level self-supervised learning modules at multiple scales and integrates with scene-level learning to improve global feature discriminativeness between aerial and ground point clouds. Furthermore, leveraging the structured spatial distribution of ALS point clouds, we introduce an Expanded Reciprocal (ER) re-ranking algorithm to exploit neighborhood information maximally and refine each feature based on neighbor features, which are then used to update the similarity matrix for final ranking. Extensive experiments demonstrate that our retrieval network outperforms existing state-of-the-art (SOTA) methods, achieving a 9.8\% improvement in average Recall@1 and a 3.2\% improvement in average Recall@1\% on the CS-Urban-Scenes, while also showing the best performance on the CS-Campus3D dataset. Additionally, our ER re-ranking algorithm further boosts the average Recall@1 by 4.9\% on CS-Campus3D and 10.2\% on CS-Urban-Scenes without additional training.

2606.18687 2026-06-18 cs.CV cs.RO 交叉投稿

Spatially Stratified Distillation for Heterogeneous Radar Place Recognition

空间分层蒸馏用于异构雷达位置识别

Sagun Singh Shrestha, Samuel Harding, Abdelwahed Khamis, Saimunur Rahman, Peyman Moghadam

发表机构 * CSIRO Robotics(澳大利亚联邦科学与工业研究组织机器人实验室) University of Queensland(昆士兰大学)

AI总结 针对4D汽车雷达与密集旋转雷达之间的异构位置识别,提出空间分层蒸馏(SSD)方法,通过基于雷达回波的物理空间非对称对齐,在重叠区域强制特征对齐,在稀疏区域降低蒸馏权重,在HeRCULES数据集上达到最先进性能。

Comments IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026

详情
AI中文摘要

可扩展的全天候位置识别越来越依赖于异构雷达位置识别来桥接不同的硬件平台。一个显著的应用是将来自经济高效的4D汽车雷达的查询与由密集旋转雷达构建的高保真参考地图进行匹配。这一过程从根本上受到4D传感器极端稀疏性(和窄视场)的限制,该传感器仅捕获旋转雷达数据库中存在的结构密度的一小部分。先前的工作通过统一不同的雷达信号来解决这个问题,即将两种信号投影到共同的表示空间。然而,它们在多会话环境中性能下降。在本文中,我们提出了空间分层蒸馏(SSD);一种策略,用直接从物理雷达回波导出的非对称空间对齐取代标准的均匀蒸馏。在两个雷达都有重叠回波的区域,SSD强制进行强特征对齐。关键的是,在4D学生雷达缺乏回波但教师雷达在共享视场内包含有效结构的稀疏区域,SSD应用大幅折扣的蒸馏权重。对最近的HeRCULES数据集的广泛评估表明,SSD显著优于先前的位置识别方法,在其具有挑战性的动态序列上取得了最先进的结果。

英文摘要

Scalable, all-weather place recognition increasingly relies on heterogeneous radar place recognition to bridge diverse hardware platforms. A notable application is matching queries from cost-effective 4D automotive radars against high-fidelity reference maps built by dense spinning radars. This process is fundamentally limited by the extreme sparsity (and narrow field-of-view) of the 4D sensor, which captures only a fraction of the structural density present in the spinning radar database. Prior efforts address this issue by unifying different radar signals. That is, projecting both signals into a common representational space. Yet, they suffer performance degradation in multi-session environments. In this paper, we propose spatially-stratified distillation (SSD); a strategy that replaces standard uniform distillation with an asymmetric spatial alignment derived directly from physical radar returns. In regions where both radars exhibit overlapping returns, SSD enforces strong feature alignment. Crucially, in sparse regions where the 4D student lacks returns but the teacher contains valid structure within the shared field of view, SSD applies heavily discounted distillation weights. Extensive evaluations of the recent HeRCULES dataset demonstrate that SSD significantly outperforms prior place recognition methods, achieving state-of-the-art results on its challenging dynamic sequences.

2511.02036 2026-06-18 cs.RO 版本更新

TurboMap: GPU-Accelerated Local Mapping for Visual SLAM

TurboMap: 面向视觉SLAM的GPU加速局部建图

Parsa Hosseininejad, Kimia Khabiri, Shishir Gopinath, Soudabeh Mohammadhashemi, Karthik Dantu, Steven Y. Ko

发表机构 * Simon Fraser University(西蒙弗雷泽大学) University at Buffalo(布法罗大学)

AI总结 针对视觉SLAM中局部建图延迟问题,提出GPU并行化与CPU优化结合的TurboMap后端,通过重构地图点创建、融合及关键帧管理,实现1.3-1.6倍加速且保持精度。

Comments Accepted for presentation at IROS 2026, preprint

详情
AI中文摘要

在实时视觉SLAM系统中,局部建图必须在严格的延迟约束下运行,因为延迟会降低地图质量并增加跟踪失败的风险。GPU并行化是降低延迟的有效途径。然而,由于同步共享状态更新以及将大型地图数据结构传输到GPU的开销,并行化局部建图具有挑战性。本文提出TurboMap,一个GPU并行化且CPU优化的局部建图后端,全面解决了这些挑战。我们重构了地图点创建,以在GPU上实现并行关键点对应搜索,重新设计并并行化了地图点融合,在CPU上优化了冗余关键帧剔除,并集成了基于GPU的快速局部光束法平差求解器。为最小化数据传输和同步成本,我们引入了持久化的GPU驻留关键帧存储。在EuRoC和TUM-VI数据集上的实验表明,平均局部建图速度分别提升1.3倍和1.6倍,同时保持精度不变。

英文摘要

In real-time Visual SLAM systems, local mapping must operate under strict latency constraints, as delays degrade map quality and increase the risk of tracking failure. GPU parallelization offers a promising way to reduce latency. However, parallelizing local mapping is challenging due to synchronized shared-state updates and the overhead of transferring large map data structures to the GPU. This paper presents TurboMap, a GPU-parallelized and CPU-optimized local mapping backend that holistically addresses these challenges. We restructure Map Point Creation to enable parallel Keypoint Correspondence Search on the GPU, redesign and parallelize Map Point Fusion, optimize Redundant Keyframe Culling on the CPU, and integrate a fast GPU-based Local Bundle Adjustment solver. To minimize data transfer and synchronization costs, we introduce persistent GPU-resident keyframe storage. Experiments on the EuRoC and TUM-VI datasets show average local mapping speedups of 1.3x and 1.6x, respectively, while preserving accuracy.

2602.04401 2026-06-18 cs.RO cs.CV 版本更新

Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

视觉地点识别中可靠操作点选择的分位数迁移

Dhyey Manish Rajani, Michael Milford, Tobias Fischer

发表机构 * QUT Centre for Robotics(昆士兰理工大学机器人中心) School of Electrical Engineering and Robotics(电气工程与机器人学院) Queensland University of Technology(昆士兰理工大学)

AI总结 提出一种通过分位数归一化迁移阈值的方法,自动选择视觉地点识别系统的操作点,在100%精度下最大化召回率,无需手动调参。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

详情
AI中文摘要

视觉地点识别(VPR)是全球导航卫星系统(GNSS)受限环境中定位的关键组成部分,但其性能严重依赖于选择平衡精度和召回率的图像匹配阈值(操作点)。阈值通常针对特定环境离线手动调整,并在部署期间固定,导致在环境变化下性能下降。我们提出一种方法,自动选择VPR系统的操作点,以在100%精度下最大化召回率。该方法使用已知对应关系的小型校准遍历,并通过相似度得分分布的分位数归一化将阈值迁移到部署中。这种分位数迁移确保阈值在校准大小和查询子集上保持稳定。在五个基准数据集上使用七种最先进的VPR技术进行的实验表明,我们提出的方法始终优于现有基线,使底层VPR技术在大约两倍的部署场景中(中位数改进)以100%精度运行,同时在该精度下检索到多达29%的正确匹配。该方法通过适应新环境并在操作条件下泛化,消除了手动调整。我们的代码可在该https URL获取。

英文摘要

Visual Place Recognition (VPR) is a key component for localisation in Global Navigation Satellite System (GNSS)-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that automatically selects the operating point of a VPR system to maximise recall at 100% precision. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets. Experiments with seven state-of-the-art VPR techniques across five benchmark datasets demonstrate that our proposed approach consistently outperforms existing baselines, enabling the underlying VPR technique to operate at 100% precision in approximately twice as many deployment scenarios (median improvement), while retrieving up to 29% more correct matches at that precision. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code is available at https://github.com/DhyeyR-007/Quantile-Transfer-for-Reliable-VPR.

2606.01605 2026-06-18 cs.RO 版本更新

Embedding Semantic Risk into Distance Fields and CBFs for Online Monocular Safe Control

将语义风险嵌入距离场和CBF用于在线单目安全控制

Dawei Zhang, Nuo Chen, Shuo Liu, Roberto Tron, Zhiwen Fan

发表机构 * Division of Systems Engineering, Boston University(系统工程系,波士顿大学) Department of Mechanical Engineering, Boston University(机械工程系,波士顿大学) Department of Electrical and Computer Engineering, Texas A&M University(电气与计算机工程系,德克萨斯农工大学)

AI总结 提出一种在线单目感知到控制框架,通过将语义风险直接嵌入欧几里得符号距离场(ESDF),在控制优化前编码风险,实现基于控制障碍函数(CBF)的语义感知安全导航与遥操作。

详情
AI中文摘要

我们提出了一种在线单目感知到控制框架,将语义风险嵌入到用于基于控制障碍函数(CBF)的安全导航和遥操作的距离场中。许多基于感知的安全过滤器对所有映射的障碍物分配相同的基于距离的安全裕度,或者仅将语义用作下游控制器调整,而不是在空间表示中编码语义风险。我们的框架通过将语义信息直接嵌入欧几里得符号距离场(ESDF),在线推理障碍物几何和类别相关风险。这种设计在控制优化前编码语义风险,因此高风险对象在安全场中施加更大的空间影响,同时保留运行时高效的ESDF查询。具体来说,基于基础模型的SLAM前端从单目RGB视频重建密集3D几何,而每帧语义分割提供像素级类别标签,这些标签被融合到重建的几何中。得到的几何-语义表示随后被转换为ESDF,其中语义标签识别安全相关区域并在场计算前施加类别相关的膨胀。语义感知的ESDF提供CBF控制器所需的局部距离值和空间导数,而类别相关的增益进一步调节控制器响应。广泛的仿真和硬件实验证明了在线操作在10-20 Hz的频率以及遥操作和自主导航中的语义感知安全行为。

英文摘要

We propose an online monocular perception-to-control framework that embeds semantic risk into the distance field used by Control Barrier Function (CBF)-based safe navigation and teleoperation. Many perception-based safety filters assign the same distance-based safety margin to all mapped obstacles or use semantics only as a downstream controller adjustment, rather than encoding semantic risk in the spatial representation. Our framework instead reasons online about obstacle geometry and class-dependent risk by embedding semantic information directly into the Euclidean Signed Distance Field (ESDF). This design encodes semantic risk before control optimization, so high-risk objects exert a larger spatial influence in the safety field while retaining efficient ESDF queries at runtime. Specifically, a foundation-model-based SLAM front end reconstructs dense 3-D geometry from monocular RGB video, while per-frame semantic segmentation provides pixel-level class labels that are fused into the reconstructed geometry. The resulting geometric-semantic representation is then converted into an ESDF, where semantic labels identify safety-relevant regions and impose class-dependent inflation before field computation. The semantic-aware ESDF provides the local distance values and spatial derivatives required by the CBF controller, while class-dependent gains further regulate the controller response. Extensive simulation and hardware experiments demonstrate online operation at 10--20 Hz and semantic-aware safe behavior in both teleoperation and autonomous navigation.

5. 人机交互与协作机器人 6 篇

2606.18519 2026-06-18 cs.RO cs.AI 新提交

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

如您所愿:利用LLM在精准农业中进行形式化验证的任务规划

Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * University of California, Merced(加州大学默塞德分校)

AI总结 针对自然语言歧义性,提出基于线性时序逻辑(LTL)反馈循环的LLM任务规划系统,通过双LLM分工实现规范生成与验证,提升精准农业任务规划的可靠性。

详情
Journal ref
Published in Proceedings of 2026 International Conference on Robotics and Automation (ICRA)
AI中文摘要

尽管机器人系统现已商业化并部署于各行各业,但许多系统高度专业化,通常需要高级技能才能操作并确保其按指令执行。为缓解这一问题,我们近期引入了一个任务规划器,利用大语言模型(LLM)根据自然语言描述的任务描述合成精准农业中的任务计划。虽然该系统表现出色,但也存在自然语言固有的歧义性。本文通过引入多个基于线性时序逻辑(LTL)的反馈循环来扩展我们的系统,以确保任务规划系统满足用户制定的规范,同时仍使用自然语言。为减轻潜在偏差,我们使用两个不同的商业LLM分别负责规范生成和验证子任务。通过大量实验,我们强调了将任务验证集成到全自主流水线中的优势与局限,特别是关于LLM生成有效LTL公式的能力,并展示了我们的实现如何应对和解决这些挑战。

英文摘要

Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, we recently introduced a mission planner leveraging LLMs to synthesize mission plans in precision agriculture based on mission descriptions provided in natural language. While the system demonstrates impressive performance, it also suffers from the inherent ambiguities of natural language. In this paper, we extend our system to address this issue by introducing multiple feedback loops in the planning architecture that leverage linear temporal logic (LTL) to ensure the mission planning system meets the specifications formulated by the user while still using natural language. To mitigate potential bias, this is achieved by using two different commercial LLMs in charge of the specification and verification subtasks. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM's ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.

2606.18601 2026-06-18 cs.RO 新提交

Admittance-Based Surface Alignment for Human-in-the-Loop Robotic Visual Inspection

基于导纳的表面对齐用于人在环机器人视觉检测

Antara Banerjee, Colin Acton, Xu Chen

发表机构 * University of Washington(华盛顿大学)

AI总结 提出一种基于导纳的实时闭环控制框架,融合操作员输入与感知驱动,实现机器人末端执行器与局部表面的精确对齐,在6自由度机械臂上验证了稳定法向跟踪和0.4°的平均定向误差。

详情
AI中文摘要

精密视觉检测是航空航天、半导体和医疗制造中质量保证的基础,这些领域中高价值零件上未被检测到的表面缺陷直接导致报废、返工和现场故障。机器人视觉检测需要在存在感知噪声和表面不规则的情况下,实现末端执行器与局部表面几何的精确对齐。在工业环境中,通常通过遥操作或共享自主性将人类操作员保持在回路中,引入实时调整,使得纯离线运动规划不足。这激发了能够在人类和感知不确定性下做出反应性、顺从行为的控制架构。本文提出了一种新颖的实时闭环机器人定向控制流程,用于精密视觉检测,该流程采用基于导纳的框架,统一了操作员输入和感知驱动的表面对齐。我们将末端执行器设计为在粘性介质中运动的虚拟球体,使得由此产生的物理可解释的质量-阻尼系统根据定向误差和操作员命令生成同步、顺从的运动。我们在6自由度机械臂上验证了该框架,展示了稳定的法向跟踪和0.4°的最终平均定向误差。

英文摘要

Precision visual inspection underpins quality assurance across aerospace, semiconductor, and medical manufacturing, where undetected surface anomalies on high-value parts translate directly into scrap, rework, and field failures. Robotic visual inspection requires precise alignment between the end-effector and local surface geometry in the presence of perception noise and surface irregularities. In industrial settings, a human operator is often kept in the loop via teleoperation or shared autonomy, introducing real-time adjustments that render purely offline motion planning inadequate. This motivates control architectures capable of reactive, compliant behavior under combined human and perceptual uncertainty. This paper presents a novel real-time, closed-loop robotic orientation control pipeline for precision visual inspection, with an admittance-based framework that unifies operator input and perception-driven surface alignment. We design the end-effector as a virtual sphere moving through a viscous medium, such that the resulting physically interpretable mass--damper system generates synchronized, compliant motion from orientation error and operator commands. We validate the framework on a 6-DOF manipulator demonstrating stable normal-tracking and a final mean orientation error of 0.4°.

2606.18747 2026-06-18 cs.RO cs.AI 新提交

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

通过基于人类反馈的迭代强化学习利用大语言模型生成自然且富有表现力的机器人手势

Chris Lee, Flora Salim, Benjamin Tag, Francisco Cruz

发表机构 * University of New South Wales(新南威尔士大学) Universidad Central de Chile(智利中央大学)

AI总结 针对社交机器人手势生成僵硬问题,提出将ChatGPT集成到Pepper机器人中生成共语手势,并引入基于人类反馈的迭代强化学习(RLHF)优化手势,实验表明RLHF提升了手势的表现力、相关性和流畅性。

Comments 8 Pages, 6 Figures

详情
AI中文摘要

富有表现力的手势对于自然有效的沟通至关重要,当仅靠语言线索不足时(例如,指向),手势可以补充言语。对于像Pepper这样的人形社交机器人,产生自然且富有表现力的动作对于改善人机交互(HRI)和长期接受度至关重要。然而,由于依赖专家编写的动画,生成手势仍然具有挑战性,导致行为僵硬,难以适应动态和多样化的环境。或者,机器学习方法通常难以捕捉感知的自然性,随着自由度的增加而变得更加困难。因此,产生富有表现力的机器人手势需要一个能够适应环境同时遵守社会规范和物理约束的系统。大语言模型(LLMs)的最新进展使得动态代码生成成为可能,为从自然语言实时合成手势提供了新的机会。在本文中,我们将ChatGPT集成到人形机器人Pepper中,以生成与对话输出一致的共语手势。虽然这一基线实现了灵活的手势生成,但生成的动作通常被认为僵硬且不自然。为了解决这一限制,我们引入了一种基于人类反馈的迭代强化学习(RLHF)系统,该系统根据用户评估微调手势生成,并利用迭代用户研究比较Pepper生成的手势。我们的结果表明,RLHF改进了LLM的共语生成能力,产生了更富有表现力、相关且流畅的动作。

英文摘要

Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper's generated gestures. Our results show that RLHF improved the LLM's co-speech generative capabilities, producing more expressive, relevant and fluid movements.

2606.19240 2026-06-18 cs.RO cs.CV cs.HC cs.SY eess.SY 新提交

Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation

透过遮挡:机器人遥操作的确定性手臂运动学校正

Thomas M. Kwok, Nicholas Koenig, Yue Hu

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出手臂运动学校正方法,利用恒定臂长几何约束和勾股定理确定性地重建遮挡关节深度,无需复杂建模,经Vicon验证有效,并成功应用于遥操作。

详情
AI中文摘要

无标记、单RGB-D相机动作捕捉为机器人遥操作提供了一种低成本、非侵入性的替代传统标记系统的方法;然而,在自遮挡存在时,特别是上肢运动期间,深度估计常常退化。本文提出了一种手臂运动学校正(AKC)方法,通过基于恒定臂长施加几何约束来改进深度估计。所提出的方法利用手腕位置和预定义臂长,基于勾股定理的确定性公式重建遮挡关节深度,从而避免了对复杂概率建模或参数调整的需求。针对Vicon参考系统的实验验证表明,该方法在静态和动态关节运动下均表现出可靠的性能,通过均方根误差(RMSE)和皮尔逊相关性进行评估。此外,在模拟和物理机器人环境中成功演示了运动映射遥操作。结果表明,AKC在长时间、严重自遮挡下增强了鲁棒性并保持了解剖一致性,即使与不太可靠的时间滤波器配对时也是如此,突显了其在机器人遥操作和人机交互等实时应用中的实用性。

英文摘要

Markerless, single-RGB-D-camera motion capture provides a low-cost and non-invasive alternative to conventional marker-based systems for robot teleoperation; however, depth estimation often degrades in the presence of self-occlusion, particularly during upper-limb motion. This paper presents an Arm Kinematic Correction (AKC) method that improves depth estimation by enforcing geometric constraints based on constant arm lengths. The proposed approach reconstructs occluded joint depths by leveraging wrist positions and predefined arm lengths via a deterministic formulation based on the Pythagorean theorem, thereby avoiding the need for complex probabilistic modeling or parameter tuning. Experimental validation against a Vicon reference system demonstrates reliable performance for both static and dynamic joint motions, evaluated using root-mean-square error (RMSE) and Pearson correlation. Furthermore, motion-mapping teleoperation is successfully demonstrated in both simulated and physical robot environments. The results show that AKC enhances robustness and preserves anatomical consistency under long-duration, severe self-occlusion, even when paired with less reliable temporal filters, highlighting its practicality for real-time applications such as robot teleoperation and human-robot interaction.

2503.08895 2026-06-18 cs.RO 版本更新

Mutual Adaptation in Human-Robot Co-Transportation with Human Preference Uncertainty

人机协同运输中考虑人类偏好不确定性的相互适应

Al Jaber Mahmud, Weizi Li, Xuan Wang

发表机构 * George Mason University(乔治·马歇尔大学) University of California, Riverside(加州大学河滨分校)

AI总结 针对人机协同运输中人类偏好参数不确定及适应策略平衡问题,提出统一框架,通过建模偏好概率分布、时变固执度及协调规划模型,结合位姿优化策略,实现相互适应以提升任务性能。

Comments 9 pages, 6 figures

详情
AI中文摘要

相互适应可以通过整合机器人和人类对环境的理解来增强人机协同运输的整体任务性能。虽然人类建模有助于捕捉人类的主观偏好,但存在两个挑战:(i)人类偏好参数的不确定性,以及(ii)需要平衡对人和机器都有利的适应策略。在本文中,我们提出了一个统一的框架来应对这些挑战,并通过相互适应提高任务性能。首先,我们不依赖固定参数,而是通过纳入一系列不确定的人类偏好参数来建模人类选择的概率分布。在此基础上,我们引入时变固执度量和协调规划模型,该模型允许机器人领导团队的轨迹,或者如果人类偏好的路径与机器人的计划冲突且其固执度超过阈值,则机器人转为跟随人类。最后,我们引入一种用于低级控制的位姿优化策略,以减轻人类领导时的不确定行为。为了验证该框架,我们设计并进行了包含二十名人类参与者反馈的研究。然后,通过仿真,我们展示了我们的模型在通过相互适应和位姿优化增强任务性能方面的有效性。

英文摘要

Mutual adaptation can enhance overall task performance in human-robot co-transportation by integrating both the robot's and the human's understanding of the environment. While human modeling helps capture humans' subjective preferences, two challenges persist: (i) the uncertainty of human preference parameters and (ii) the need to balance adaptation strategies that benefit both humans and robots. In this paper, we propose a unified framework to address these challenges and improve task performance through mutual adaptation. First, instead of relying on fixed parameters, we model a probability distribution of human choices by incorporating a range of uncertain human preference parameters. Building on this, we introduce a time-varying stubbornness measure and a coordinated planning model, which allows either the robot to lead the team's trajectory or, if a human's preferred path conflicts with the robot's plan and their stubbornness exceeds a threshold, the robot to transition to following the human. Finally, we introduce a pose optimization strategy for low-level control to mitigate the uncertain human behaviors when they are leading. To validate the framework, we design and perform a study with human feedback from twenty human participants. We then demonstrate, through simulations, the effectiveness of our models in enhancing task performance with mutual adaptation and pose optimization.

2501.06348 2026-06-18 cs.HC cs.RO 版本更新

Why Automate This? Exploring Correlations Between Desire for Robotic Automation, Invested Time and Well-Being

为什么自动化这个?探索机器人自动化愿望、投入时间与幸福感之间的相关性

Ruchira Ray, Leona Pang, Sanjana Srivastava, Li Fei-Fei, Samantha Shorey, Roberto Martín-Martín

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) Stanford University(斯坦福大学) University of Pittsburgh(匹兹堡大学)

AI总结 本研究利用BEHAVIOR-1K等数据集,发现活动时间并非自动化偏好的强预测因子,而幸福感和痛苦感是最强指标,并揭示了性别和收入水平的差异。

Comments 26 pages, 14 figures

详情
AI中文摘要

理解人类倾向于自动化任务的动机对于开发无缝融入日常生活的机器人至关重要。因此,我们提出疑问:个体是否更倾向于根据活动消耗的时间或执行活动时的感受来自动化活动?本研究探讨了这些偏好以及它们是否在不同社会群体(特别是性别类别和收入水平)之间存在差异。利用BEHAVIOR-1K数据集、美国时间使用调查以及美国时间使用调查幸福感模块的数据,我们研究了机器人自动化愿望、花费时间以及相关感受(幸福感、意义感、悲伤感、痛苦感、压力感或疲惫感)之间的关系。我们的主要发现表明,尽管存在常见假设,但活动花费的时间并不能强烈预测自动化偏好;相反,幸福感和痛苦感是最强的指标。我们还识别出性别和经济水平的差异:女性倾向于自动化压力大的活动,而男性倾向于自动化让他们不快乐的活动;中等收入个体优先自动化不太愉快和有意义的活动,而低收入和高收入群体则没有显著相关性。我们希望我们的研究有助于推动机器人设计符合用户优先事项,使家用机器人朝着更具社会相关性的解决方案发展。所有数据和交互式工具均可在此https URL公开获取。

英文摘要

Understanding the motivations underlying the human inclination to automate tasks is vital for developing robots that fit seamlessly into daily life. Accordingly, we ask: are individuals more inclined to automate activities based on the time they consume or the feelings experienced while performing them? This study explores these preferences and whether they vary across social groups, specifically gender category and income level. Leveraging data from the BEHAVIOR-1K dataset, the American Time-Use Survey, and the American Time-Use Survey Well-Being Module, we investigate the relationship between the desire for robot automation, time spent, and associated feelings: Happiness, Meaningfulness, Sadness, Painfulness, Stressfulness, or Tiredness. Our key findings show that, despite common assumptions, time spent on activities does not strongly predict automation preferences; instead, happiness and pain are the strongest indicators. We also identify differences by gender and economic level: Women prefer to automate stressful activities, whereas men prefer to automate those that make them unhappy; mid-income individuals prioritize automating less enjoyable and meaningful activities, while low and high-income show no significant correlations. We hope our research helps motivate the design of robots that align with user priorities, moving domestic robotics toward more socially relevant solutions. All data and an interactive tool are publicly available at https://robin-lab.cs.utexas.edu/why-automate-this/.

6. 具身智能与视觉语言动作模型 7 篇

2606.18363 2026-06-18 cs.RO cs.AI 新提交

Guava: An Effective and Universal Harness for Embodied Manipulation

Guava: 一种有效且通用的具身操作工具框架

Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, Jiayuan Mao

发表机构 * University of Maryland College Park(马里兰大学帕克分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Waterloo(滑铁卢大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) University of Pennsylvania(宾夕法尼亚大学) Amazon FAR(亚马逊 FAR)

AI总结 提出Guava框架,通过迭代感知-推理-行动循环、语义动作抽象和多模态观测三大关键设计,将具身操作能力蒸馏到4B开源模型中,在仿真和真实环境中性能媲美前沿专有模型。

详情
AI中文摘要

在大规模视觉-语言数据上训练的语言模型已展现出作为具身智能体的强大潜力。通过具身工具使用来驾驭模型,为端到端的视觉-语言-行动系统提供了一种有前景的替代方案,它将高层推理与外部模块(用于感知、规划和控制)相结合。然而,对于具身操作而言,什么构成了有效的工具框架,以及这种框架能在多大程度上解锁广泛推理模型的具身能力,仍不清楚。在这项工作中,我们提出了Guava,一个通过系统探索智能体工作流、动作空间和观测空间的设计空间而开发的具身工具使用框架。我们的研究确定了有效具身智能体的三个关键要素:迭代感知-推理-行动循环、语义动作抽象和多模态观测。为了理解这些设计原则是否对小型模型也具有普适性,我们开发了一个端到端的训练流程,利用完全在仿真中收集的不到2000条轨迹,将具身操作能力蒸馏到一个4B开源模型中。在仿真和真实环境中的实验结果表明,其性能与前沿专有模型相当,同时展现出对未见物体、新指令和长时域任务的强大泛化能力。结果表明,一个精心设计的框架可以作为具身操作的可扩展、模型无关的接口,使紧凑的开源模型在极少的训练数据下展现出强大的涌现具身能力。

英文摘要

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.

2606.19088 2026-06-18 cs.RO 新提交

ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks

ReSiReg:面向语言条件机器人任务的空间一致语义

Simon Schwaiger, David Seyser, Alessandro Scherl, Wilfried Wöber, Gerald Steinbauer-Wagner

发表机构 * Graz University of Technology, Institute of Software Engineering and Artificial Intelligence(格拉茨技术大学,软件工程与人工智能研究所) University of Applied Sciences Technikum Wien, Department of Industrial Engineering(维也纳应用科技大学,工业工程系) University of Alicante, Department of Computer Technology(阿利坎特大学,计算机技术系) University of Natural Resources and Life Sciences, Institute for Integrative Nature Conservation Research(自然资源与生命科学大学,整合自然保护研究 institute)

AI总结 提出ReSiReg方法,通过重构空间一致的VLM中间特征,改善密集语言接地检索,在OVSS和3D映射中提升空间一致性,并发布紧凑的25M参数VLM模型。

详情
AI中文摘要

视觉-语言模型(VLM)使机器人能够遵循开放语言指令。然而,密集的VLM嵌入已被证明存在噪声且缺乏空间一致性。这对于需要同时推理语义和3D空间的机器人应用来说是有问题的。我们研究了近期VLM的空间结构,并提出了ReSiReg,一种特征重构方法,利用空间一致的VLM中间特征来改善密集语言接地检索。ReSiReg将中间特征聚类为视觉原型,推导其语言描述符,并将每个补丁重构为原型级语言嵌入的软混合。我们在OVSS和3D映射上跨骨干网络进行定量评估,并在真实世界操作场景中进行定性评估。定量结果显示密集检索得到改善;操作场景显示出更空间一致的目标激活。我们进一步为机器人应用提供了一个紧凑的25M密集VLM,远小于ViT-B基线且具有竞争力。可从此网址获取。

英文摘要

Vision-Language Models (VLMs) enable robots to follow open-language instructions. However, dense VLM embeddings have shown to be noisy and lack spatial consistency. This is problematic for robotic applications, which require simultaneous reasoning over semantics and 3D space. We examine spatial structure across recent VLMs and propose ReSiReg, a feature reconstruction method that uses spatially consistent VLM intermediates to improve dense language-grounded retrieval. ReSiReg clusters intermediates into visual prototypes, derives their language descriptors, and reconstructs each patch as a soft mixture of prototype-level language embeddings. We evaluate quantitatively on OVSS and 3D mapping across backbones, and qualitatively in real-world manipulation scenes. Quantitative results show improved dense retrieval; manipulation scenes show more spatially consistent target activations. We further provide a compact 25M dense VLM for robotic applications, substantially smaller than and competitive with ViT-B baselines. Available at https://resireg.github.io

2606.19340 2026-06-18 cs.RO 新提交

Zero-Shot Long-Horizon Dexterous Manipulation via Multi-View 3D-Grounded VLM Reasoning

零样本长时程灵巧操作:基于多视图3D接地VLM推理

Jisoo Kim, Sangwon Baik, Taeksoo Kim, Sungjoo Kim, Junyoung Lee, Mingi Choi, Hanbyul Joo

发表机构 * Seoul National University(首尔国立大学) RLWRLD

AI总结 提出零样本框架,利用多视图RGB图像通过VLM生成3D任务规划,结合三角测量和射线投票实现精确3D接地,支持抓取和工具使用,在真实实验中优于基线方法。

详情
AI中文摘要

我们提出了一个零样本框架,用于长时程灵巧操作,该框架将语言指令从校准的多视图RGB图像接地到可执行的3D任务规划。我们的系统不是训练端到端策略,而是使用视觉语言模型(VLM)生成参考帧任务接地和原始级2D关键点,然后通过多视图融合将其提升到3D。这种提升结合了视图级VLM接地的三角测量与参考视图射线投票,后者沿语义相机射线搜索跨相邻视图的几何一致候选点。生成的3D关键点支持抓取和放置以及工具使用:对于工具使用,我们检索与推断技能类别对应的以对象为中心的原子动作,并将其存储的6D工具轨迹对齐到场景;对于灵巧执行,我们将提升的抓取关键点扩展为任务条件抓取可行区域,并使用臂手运动生成器生成可行的抓取-运动对。真实世界实验表明,与单视图RGB-D接地和微调VLA基线相比,3D接地精度和执行可靠性有所提高。我们进一步通过闭环状态验证和重新规划展示了长时程操作,实现了在新场景中对未见物体和工具使用任务的零样本执行。

英文摘要

We present a zero-shot framework for long-horizon dexterous manipulation that grounds language instructions into executable 3D task plans from calibrated multi-view RGB images. Rather than training an end-to-end policy, our system uses a vision-language model (VLM) to produce reference-frame task grounding and primitive-level 2D keypoints, then lifts them into 3D via multi-view fusion. This lifting combines triangulation of view-wise VLM groundings with reference-view ray voting, which searches along a semantic camera ray for geometrically consistent candidates across neighboring views. The resulting 3D keypoints support both pick-and-place and tool-use: for tool-use, we retrieve an object-centric atomic action corresponding to the inferred skill category and align its stored 6D tool trajectory to the scene; for dexterous execution, we expand the lifted grasp keypoint into a task-conditioned grasp affordance region and generate feasible grasp-motion pairs with an arm-hand motion generator. Real-world experiments show improved 3D grounding accuracy and execution reliability over single-view RGB-D grounding and fine-tuned VLA baselines. We further demonstrate long-horizon manipulation through closed-loop status verification and replan, enabling zero-shot execution on unseen objects and tool-use tasks in novel scenes.

2606.18955 2026-06-18 cs.CV cs.RO 交叉投稿

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

运动聚焦的潜在动作使跨实体VLA训练能从人类自我中心视频中学习

Runze Xu, Yiluo Zhang, Jian Wang, Yu Wang, Jincheng Yu

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Tianfu Jiangxi Laboratory(天府江西实验室)

AI总结 提出基于潜在动作的框架,利用混合解耦VQ-VAE从无标签人类视频中提取通用动作先验,通过意图-感知解耦策略减少动作幻觉,仅需50条轨迹即可适配下游任务。

Comments Accepted to IROS 2026

详情
AI中文摘要

训练通用视觉-语言-动作(VLA)模型通常需要大量、多样化的机器人数据集,并带有高保真动作标注。尽管自我中心的人类操作视频丰富且捕捉了显著的环境多样性,但缺乏动作标签使其难以在传统训练范式下使用。为解决这一问题,我们提出了一种基于潜在动作的框架,旨在从无标签人类视频中提取通用动作先验。该架构采用混合解耦VQ-VAE,通过物理掩码将运动动态与环境背景解耦,从而构建跨实体动作码本。通过在人类视频上使用码本进行预训练,VLM骨干网络学习到动作意图的深层表示。为了适应特定实体,我们引入了一种意图-感知解耦策略,其中VLM预测动作意图,而一个独立的冻结视觉编码器为动作专家提供状态特定特征,从而减少动作幻觉。在仿真和真实环境中的结果表明,我们的方法仅在无标签人类视频上预训练,与在大量标注数据集上训练的最先进VLA模型相比具有竞争力,且仅需50条轨迹进行下游适配。

英文摘要

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

2606.19297 2026-06-18 cs.LG cs.RO 交叉投稿

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

VLA 甚至知道基础知识吗?衡量视觉-语言-动作模型中的常识和世界知识保留

Nikita Kachaev, Andrey Moskalenko, Matvey Skripkin, Nikita Kurlaev, Daria Pugacheva, Albina Burlova, Mikhail Kolosov, Denis Shepelev, Andrey Kuznetsov, Elena Tutubalina, Aleksandr I. Panov, Alexey K. Kovalev, Vlad Shakhuro

发表机构 * CogAI Lab(CogAI实验室) FusionBrain Lab(FusionBrain实验室) IAI MSU(莫斯科大学人工智能研究所) Lomonosov MSU(莫斯科国立罗蒙诺索夫大学) NUST MISIS(国立研究型技术大学MISIS) Applied AI Institute(应用人工智能研究所) HSE University(高等经济大学) Generalizable AI Systems(通用人工智能系统实验室) ISP RAS(俄罗斯科学院系统编程研究所) MIRAI Domain-specific NLP Group(领域特定自然语言处理组)

AI总结 提出 Act2Answer 协议,通过动作回答评估 VLA 模型的知识保留,发现模型在简单概念上表现良好,但在丰富语义类别上存在差距,且 VQA 联合训练有助于知识保留。

Comments Project page: https://tttonyalpha.github.io/act2answer/

详情
AI中文摘要

具身视觉-语言-动作(VLA)模型通常通过在机器人数据上微调强大的预训练 VLM 获得,但目前尚不清楚它们在适应后保留了多少常识和事实知识。在知识敏感任务上的失败是模糊的,混淆了知识缺失与低级控制泛化能力差。我们引入 Act2Answer,一种轻量级协议,通过要求智能体通过动作来回答,将 VLM 知识基准适配到 VLA 评估。每个问题变成一个简短的桌面场景,其中智能体执行单个物体放置动作以选择候选答案,从而产生动作基础的、减少控制混淆的成功率。我们在不同的常识和世界知识类别中策划了这样的环境测试套件,并引入逐层意图探测以定位 VLM 骨干和动作头中与答案相关的信息。在对 7 个 VLA 模型和 9 个 VLM 基线的大规模研究中,我们系统地跨类别对模型进行排名,发现 VLA 在简单概念上表现稳健,但在更丰富的语义类别上相对于其源 VLM 显示出更大的差距,VQA 联合训练与更好的知识保留相关,并且答案相关信号在 VLA 中间层达到峰值,但在上层减弱。Act2Answer 可在以下网址获取:此 https URL。

英文摘要

Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action to select among candidate answers, yielding an action-grounded success rate with reduced control confounds. We curate a test suite of such environments across diverse commonsense and world-knowledge categories and introduce layerwise intent probing to localize answer-relevant information across the VLM backbone and action head. In a large-scale study of 7 VLA models and 9 VLM baselines, we systematically rank models across categories, finding that VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs, that VQA co-training is associated with better knowledge retention, and that answer-relevant signals peak in middle VLA layers but attenuate in upper layers. Act2Answer is available at https://tttonyalpha.github.io/act2answer/.

2606.17846 2026-06-18 cs.RO cs.CV cs.LG 版本更新

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Qwen-RobotManip 技术报告:对齐解锁机器人操作基础模型的规模

Haoqi Yuan, Zhixuan Liang, Anzhe Chen, Ye Wang, Haoyang Li, Pei Lin, Yiyang Huang, Zixing Lei, Tong Zhang, Jiazhao Zhang, Jie Zhang, Jingyang Fan, Gengze Zhou, Qihang Peng, Chenxu Lv, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team(Qwen团队)

AI总结 提出 Qwen-RobotManip,通过统一的对齐框架(表示、运动和行为维度)实现多源异构操作数据的大规模协同训练,构建约38,100小时预训练语料,在零样本指令跟随、跨本体迁移等泛化能力上超越先前模型。

Comments 44 pages

详情
AI中文摘要

语言和多模态基础模型通过统一公式对齐异构数据并大规模训练,实现了强大的泛化能力。在本报告中,我们研究这种扩展方法是否可以应用于机器人操作以实现真正的泛化。这具有挑战性,因为与文本不同,操作数据本质上是异构的、收集成本高且多样性狭窄,使得对齐和规模同时变得困难。我们提出了 Qwen-RobotManip,一个基于 Qwen-VL 构建的可泛化视觉-语言-动作基础模型。Qwen-RobotManip 引入了一个跨操作表示、运动和行为维度的统一对齐框架,使大规模多源训练变得一致而非冲突。这种对齐能力进而使 Qwen-RobotManip 能够吸收以前训练方案无法维持规模的操作数据。一个人到机器人合成流水线将第一人称手部演示转换为跨15个平台的机器人轨迹,一个严格的策展流水线协调异构数据集。仅使用开源数据集和人类视频,无需专有数据收集,Qwen-RobotManip 构建了约38,100小时的预训练语料,并展现出涌现的泛化能力,包括零样本指令跟随、对扰动的鲁棒性、反应性错误恢复和跨本体迁移。我们发现标准基准无法捕捉预训练质量,因此采用了包括 RoboCasa365、LIBERO-Plus、EBench、RoboTwin-Clean2Rand、RoboTwin-IF 和 RoboTwin-XE 在内的 OOD 设置。Qwen-RobotManip 在所有 OOD 设置中显著优于先前最先进的模型(包括 π0.5),在 RoboChallenge 中排名第一,相对改进20%,并在包括 AgileX ALOHA、Franka、UR 和 ARX 在内的真实机器人平台上得到验证。

英文摘要

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $π$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3:面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

AI总结 提出基于统一混合Transformer架构的全模态世界模型Cosmos 3,联合处理语言、图像、视频、音频和动作序列,在理解和生成任务上达到新最优,为具身智能体提供可扩展的通用骨干。

详情
AI中文摘要

我们介绍了Cosmos 3,一个全模态世界模型家族,设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置,Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明,Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平,展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型,并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署,我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准,网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

7. 多机器人与群体系统 2 篇

2606.18516 2026-06-18 cs.RO 新提交

Task Allocation and Motion Planning in Dynamic, Cluttered Environments via CBBA and Graphs of Convex Sets

动态杂乱环境下的任务分配与运动规划:基于CBBA与凸集图

Matthew D. Osburn, Cameron K. Peterson, John L. Salmon

发表机构 * Electrical and Computer Engineering(电气与计算机工程系) Mechanical Engineering(机械工程系)

AI总结 针对动态杂乱环境中的多智能体任务规划,提出结合凸集图(GCS)进行轨迹优化与共识捆绑算法(CBBA)进行分布式任务分配的方法,实现安全高效的轨迹规划和任务协调。

Comments 15 pages single column, 10 figures, AIAA-Scitech 2027 Submission

详情
AI中文摘要

在杂乱、动态环境中的多智能体任务规划需要在分配任务给智能体的同时,确定通过环境的安全、时间高效的轨迹。当任务是动态的(例如会合目标)时,分配决策不仅取决于哪个智能体最适合某项任务,还取决于该任务何时何地可以到达。本文提出了一个解决该问题的方法,该方法将凸集图(GCS)用于轨迹优化,与共识捆绑算法(CBBA)用于分布式任务分配相结合。在我们的方法中,GCS通过使用时间扩展(3D+时间)配置空间找到通过动态环境的最优轨迹。同时,CBBA协调跨智能体的任务分配,使得在移动环境中能够做出明智的决策。然后,我们连接分配和规划,使智能体能够在3D+时间配置空间中避免碰撞,并提供准确的任务完成时间估计。我们在具有静态和动态任务的模拟杂乱环境中展示了我们方法的有效性。

英文摘要

Multi-agent task planning in cluttered, dynamic environments requires assigning tasks to agents while simultaneously determining safe, time-efficient trajectories through the environment. When tasks are dynamic, such as rendezvous objectives, allocation decisions depend not only on which agent is best suited for a task, but also on when and where that task can be reached. This paper presents a solution to this problem, which combines Graphs of Convex Sets (GCS) for trajectory optimization with the Consensus-Based Bundle Algorithm (CBBA) for distributed task allocation. In our approach, GCS finds optimal trajectories through dynamic environments using a time-extended (3D+time) configuration space. At the same time, CBBA coordinates task assignments across agents, enabling informed decision-making in a moving environment. We then connect allocation and planning to allow the agents to avoid collisions in the 3D+time configuration space and provide accurate time estimates for task completion. We demonstrate the effectiveness of our approach in simulated cluttered environments with static and dynamic tasks.

2510.18085 2026-06-18 cs.RO cs.AI cs.MA 版本更新

R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations

R2BC: 从单智能体演示进行多智能体模仿学习

Connor Mattson, Varun Raveendra, Ellen Novoseller, Nicholas Waytowich, Vernon J. Lawhern, Daniel S. Brown

发表机构 * Kahlert School of Computing, University of Utah(犹他大学凯勒尔计算学院) DEVCOM Army Research Laboratory(陆军研究实验室)

AI总结 提出R2BC方法,通过轮换单智能体演示训练多机器人系统,无需联合动作空间演示,在模拟和实物任务中性能媲美或超越基于特权同步演示的基线方法。

Comments 8 pages, 6 figures. In Proceedings: IEEE International Conference on Robotics & Automation (ICRA 2026)

详情
AI中文摘要

模仿学习(IL)是人类教授机器人的自然方式,尤其是在高质量演示易于获取的情况下。虽然IL已广泛应用于单机器人场景,但将其扩展到多智能体系统的研究相对较少,尤其是在单个人类必须为协作机器人团队提供演示的场景中。本文介绍并研究了轮换行为克隆(R2BC),该方法使单个人类操作员能够通过顺序的单智能体演示有效训练多机器人系统。我们的方法允许人类一次远程操作一个智能体,并逐步向整个系统教授多智能体行为,无需联合多智能体动作空间的演示。我们表明,在四个多智能体模拟任务中,R2BC方法的性能与基于特权同步演示的Oracle行为克隆方法相当,甚至在某些情况下超越后者。最后,我们在两个使用真实人类演示训练的物理机器人任务上部署了R2BC。

英文摘要

Imitation Learning (IL) is a natural way for humans to teach robots, particularly when high-quality demonstrations are easy to obtain. While IL has been widely applied to single-robot settings, relatively few studies have addressed the extension of these methods to multi-agent systems, especially in settings where a single human must provide demonstrations to a team of collaborating robots. In this paper, we introduce and study Round-Robin Behavior Cloning (R2BC), a method that enables a single human operator to effectively train multi-robot systems through sequential, single-agent demonstrations. Our approach allows the human to teleoperate one agent at a time and incrementally teach multi-agent behavior to the entire system, without requiring demonstrations in the joint multi-agent action space. We show that R2BC methods match, and in some cases surpass, the performance of an oracle behavior cloning approach trained on privileged synchronized demonstrations across four multi-agent simulated tasks. Finally, we deploy R2BC on two physical robot tasks trained using real human demonstrations.

8. 无人车、无人机与移动机器人 4 篇

2606.18630 2026-06-18 cs.RO 新提交

DNN Koopman-Based Deviation Compensation for UGV Path Tracking Control on Coupled Slope and Potholed Road

基于DNN Koopman的偏差补偿用于耦合坡度和坑洼道路上的UGV路径跟踪控制

Jian Zhao, Wenbo Zhou, Zhicheng Chen, Bing Zhu, Jiayi Han, Dongjian Song, Yinju Lin, Peixing Zhang

发表机构 * Xiamen King Long United Automotive Industry Co., Ltd.(厦门金龙联合汽车工业有限公司)

AI总结 提出基于DNN Koopman的偏差补偿策略,结合自适应遗忘递推最小二乘估计轮胎刚度、Laguerre模型预测控制与事件触发协同补偿,在耦合坡度和坑洼道路上提升UGV路径跟踪精度超11.5%

Comments 22 pages, 13 figures

详情
AI中文摘要

在越野场景中运行的无人地面车辆面临复杂地形扰动,这些扰动会显著降低路径跟踪性能。针对这一挑战,本文提出了一种基于深度神经网络Koopman的偏差补偿策略,用于无人地面车辆路径跟踪控制。首先,基于耦合坡度上的车辆动力学函数,设计了一种带有解耦误差项的自适应遗忘递推最小二乘法来估计轮胎侧偏刚度。在此基础上,通过引入Laguerre函数,设计了一种Laguerre模型预测控制路径跟踪控制策略,该策略可在不同耦合坡度场景下降低计算资源消耗的同时保持可靠的跟踪性能。然后,通过将Koopman算子理论与深度神经网络相结合,提出了一种深度神经网络Koopman路径偏差补偿方法,该方法显著提高了无人地面车辆在坑洼道路扰动下的路径跟踪精度。此外,基于补偿激活准则和可信度验证,建立了一种将Laguerre模型预测控制与深度神经网络Koopman耦合的事件触发并行协同补偿机制。该机制提高了坑洼道路上的路径跟踪精度,同时确保了整体转向指令的可行性和深度神经网络Koopman补偿后车辆的稳定性。最后,构建了硬件在环实验平台进行验证。实验结果表明,所提出的无人地面车辆路径跟踪策略在多种工况下跟踪性能提升超过11.5%。

英文摘要

Unmanned ground vehicles (UGVs) operating in off-road scenarios are confronted with complex terrain disturbances that can substantially degrade path tracking performance. To address this challenge, this paper proposes a deep neural network (DNN) Koopman-based deviation compensation strategy for UGV path tracking control. Firstly, based on the vehicle dynamic function on coupled slope, an adaptive forgetting recursive least squares method with decoupled error terms is designed to estimate tire cornering stiffness. On this basis, a Laguerre model predictive control (LMPC) path tracking control strategy is designed by incorporating Laguerre functions, which can reduce computational resource usage while maintaining reliable tracking performance across different coupled slope scenarios. Then, by integrating Koopman operator theory with DNN, a DNN Koopman (DK) path deviation compensation method is proposed, which significantly improves the path tracking accuracy of UGV under potholed road disturbances. Furthermore, an event-triggered parallel cooperative (EPC) compensation mechanism that couples LMPC with DK is established based on compensation activation criteria and credibility verification. This mechanism improves path tracking accuracy on potholed road while ensuring the feasibility of overall steering command and stability of vehicle after DK compensation. Finally, a hardware-in-the-loop (HiL) experimental platform is constructed for validation. Experimental results demonstrate that the proposed UGV path tracking strategy improves tracking performance by more than 11.5% across multiple operating conditions.

2606.19227 2026-06-18 cs.RO 新提交

Constant Time-Delay Leader Following with Neural Networks and Invariant Extended Kalman Filters for Arbitrary Trajectories

基于神经网络与不变扩展卡尔曼滤波的任意轨迹恒定时间延迟领航跟随

Luka Antonyshyn, Paulo Ricardo Marques de Araujo, Sidney Givigi

发表机构 * University of Toronto Institute for Aerospace Studies(多伦多大学航空航天研究所) School of Computing, Queen’s University(女王大学计算机学院)

AI总结 提出一种结合概率Seq2Seq神经网络与不变扩展卡尔曼滤波的恒定时间延迟轨迹跟踪方法,用于无通信、无全局坐标的车队,在SE(2)流形上准确估计领车轨迹,并利用几何模型预测控制提升性能。

Comments 9 pages, 6 figures

详情
AI中文摘要

本文提出了一种用于车辆队列的恒定时间延迟轨迹跟踪方法,该方法无需车辆间通信、公共坐标系或全球定位。该方法将概率序列到序列(Seq2Seq)神经网络与不变扩展卡尔曼滤波(IEKF)相结合,以热启动预测过程,从而在SE(2)流形上准确估计领车相对轨迹。进一步引入几何模型预测控制器,以充分利用基于流形的轨迹预测来改善控制性能。该系统能够处理具有不同速度和运动轮廓的任意非线性轨迹,同时减少了对基于专家领域知识的轨迹跟踪系统设计的需求,即使在长轨迹延迟下也是如此。通过运动学仿真中与纯IEKF基线、基于学习的方法以及真实轨迹的对比,以及使用真实机器人车辆的实验,验证了该方法的有效性。

英文摘要

This paper proposes a constant time-delay trajectory tracking method for vehicle convoys operating without inter-vehicle communication, a common coordinate system, or global positioning. The method integrates a probabilistic sequence-to-sequence (Seq2Seq) neural network with an invariant extended Kalman filter (IEKF) to warm-start the prediction process, allowing accurate estimation of a leader vehicle's relative trajectory on the SE(2) manifold. A geometric model predictive controller is further incorporated to fully exploit the manifold-based trajectory predictions for improved control performance. The system can handle arbitrary nonlinear trajectories with varying speeds and motion profiles while reducing the need for expert-based domain knowledge for the design of trajectory following systems, even under long trajectory delays. The effectiveness of the method is validated through comparisons with a pure IEKF baseline, learning-based methods, and the ground-truth trajectory in kinematic simulations, as well as in experiments using real robotic vehicles.

2606.19258 2026-06-18 cs.CV cs.RO 交叉投稿

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

CABLE: 面向V2X系统的云辅助带宽高效LMM编码框架

Haohua Que, Zhipeng Bao, Qianyi Wu, Handong Yao

发表机构 * University of Georgia(佐治亚大学)

AI总结 提出CABLE框架,通过边缘端利用自我运动补偿和残差运动线索传播云分割掩码,生成感兴趣区域(ROI)并仅上传ROI掩码图像,形成掩码-ROI-LMM反馈循环,在五个数据集上实现73-87%的ROI像素覆盖减少和5-8倍LMM预填充加速。

详情
AI中文摘要

云托管的大型多模态模型(LMM)可以为车联网系统提供强大的开放词汇感知能力,但简单地将全分辨率帧从边缘传输到云会导致严重的通信开销和云侧预填充延迟。我们提出了CABLE,一种用于边缘-云感知的云辅助带宽高效LMM编码框架。CABLE在边缘端利用自我运动补偿传播先前的云分割掩码,通过残差运动线索进行细化,并通过走廊包络整合断开区域,形成鲁棒的感兴趣区域(ROI)。仅上传ROI掩码图像,而云分割输出作为下一帧的先验反馈,形成掩码-ROI-LMM反馈循环。在五个数据集(nuScenes、WOD-ZB、Waymo、KITTI和CADC)上的实验表明,该方法在保持感知能力的同时实现了显著的通信节省,相对于全帧推理,ROI像素覆盖减少73-87%,估计LMM预填充加速5-8倍,检测质量略有折衷。

英文摘要

Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$--$87\%$ ROI pixel-coverage reduction with $5$--$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

2602.01700 2026-06-18 cs.RO 版本更新

Tilt-Ropter: A Fully Actuated Hybrid Aerial-Terrestrial Vehicle with Tilt Rotors and Passive Wheels

Tilt-Ropter: 一种带有倾转旋翼和被动轮的全驱动混合空中-地面车辆

Ruoyu Wang, Xuchen Liu, Zongzhou Wu, Zixuan Guo, Wendi Ding, Ben M. Chen

发表机构 * Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong(机械与自动化工程系,香港中文大学) Faculty of Engineering, The University of Hong Kong(工程学院,香港大学) Peng Cheng Laboratory(鹏城实验室)

AI总结 提出全驱动混合空中-地面车辆Tilt-Ropter,通过倾转旋翼和被动轮实现高效多模态运动,并设计统一非线性模型预测控制器实现低跟踪误差和地面运动功耗降低92.8%。

Comments 8 pages, 10 figures. Accepted by the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

在这项工作中,我们提出了Tilt-Ropter,一种全驱动的混合空中-地面车辆(HATV),它集成了倾转旋翼和被动轮,以实现高效的多模态运动。与传统的欠驱动HATV不同,Tilt-Ropter的全驱动设计允许力和扭矩解耦控制,提高了机动性和地面运动效率。开发了一个统一的非线性模型预测控制器(NMPC)来跟踪参考轨迹,强制执行非完整约束,并适应运动模式间的接触效应,同时通过专门的控制分配确保执行器可行性。为了解决复杂的轮地动力学问题,集成了一个外部力估计器来提供实时交互力估计。该系统通过仿真和实际实验进行了验证,包括无缝的空地过渡和轨迹跟踪任务。实验结果表明,两种模式下的跟踪误差都很低,并且地面运动期间的功耗相比飞行降低了92.8%,突显了该平台在能源受限环境中执行长时间任务的适用性。

英文摘要

In this work, we present Tilt-Ropter, a fully actuated hybrid aerial-terrestrial vehicle (HATV) that integrates tilt rotors with passive wheels to enable efficient multi-modal locomotion. Unlike conventional underactuated HATVs, the fully actuated design of Tilt-Ropter allows decoupled force and torque control, improving maneuverability and ground locomotion efficiency. A unified nonlinear model predictive controller (NMPC) is developed to track reference trajectories, enforce non-holonomic constraints, and accommodate contact effects across locomotion modes, while ensuring actuator feasibility through dedicated control allocation. To address complex wheel-ground dynamics, an external wrench estimator is incorporated to provide real-time interaction wrench estimates. The system is validated through simulation and real-world experiments, including seamless air-ground transitions and trajectory tracking tasks. Experimental results demonstrate low tracking errors in both modes and reveal a 92.8% reduction in power consumption during ground locomotion compared to flight, highlighting the platform's suitability for long-duration missions in energy-constrained environments.

9. 软体机器人与硬件设计 3 篇

2606.18680 2026-06-18 cs.RO 新提交

High-Degree-of-Freedom Lightweight Bioinspired Leg for Enhanced Mobility in Small Robots

高自由度轻量化仿生腿:提升小型机器人机动性

Haoqi Han, Yifei Yu, Jiaming Zhang, Xinru Cui, Linxi Feng, Hesheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai University of Electric Power(上海电力大学)

AI总结 针对微型机器人腿部自由度受限问题,提出一种四自由度并联腿机构,通过同心设计简化运动学,实现轻量化(18.9g)和大工作空间(>22255 mm³),显著提升运动灵活性。

详情
Journal ref
2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
AI中文摘要

在微型机器人领域,如何在严格的空间限制下通过增加腿部机构的自由度来增强运动能力仍然是一个重大挑战。受昆虫运动启发,本文提出了一种新型的微型四自由度并联腿机构,并系统分析了其机械设计、电气系统和运动学。该设计采用两个球形五杆连杆机构,在并联四杆配置中实现空间运动。此外,采用同心设计策略简化了腿部运动学的解析解。由于采用并联系统架构,所有执行器均位于主体上,与传统高自由度腿部结构相比,大大降低了运动部件的等效惯性。系统总质量仅为18.9 g,末端执行器输出力约为0.5 N,工作空间超过22255 mm³。实验结果表明,所提出的单腿机构具有优异的运动灵活性,凸显了其在微型仿生机器人领域的潜力。

英文摘要

In microrobotics, enhancing locomotion capabilities by increasing the degrees of freedom (DoF) of leg mechanisms under severe spatial constraints remains a significant challenge. Inspired by insect locomotion, this paper presents a novel micro-scale parallel leg mechanism with four degrees of freedom, and systematically analyzes its mechanical design, electrical system, and kinematics. The design incorporates two spherical five-bar linkages to achieve spatial motion within a parallel four-bar configuration. Furthermore, a concentric design strategy is employed to simplify the analytical solution of the leg kinematics. Due to the parallel system architecture, all actuators are located on the main body, substantially reducing the equivalent inertia of moving parts compared to traditional high-DOF leg structures. The total mass of the system is only 18.9 g, with an end-effector output force of approximately 0.5 N and a workspace exceeding 22255 mm3. Experimental results demonstrate that the proposed single-leg mechanism achieves excellent motion flexibility, highlighting its potential for micro bio-inspired robotics.

2606.18704 2026-06-18 cs.RO 新提交

Selective Unit-Cell Actuation in Lattice Structures for Distributed Morphology in Soft Robots

晶格结构中的选择性单元胞驱动用于软体机器人的分布式形态变化

Trevor Exley, Altair Coutinho, Lucia Beccai

发表机构 * Istituto Italiano di Tecnologia (IIT)(意大利技术研究院)

AI总结 提出嵌入式气动单元胞,将弯曲支柱晶格与双向波纹管致动器集成,通过空间驱动模式实现全局形态控制,实验验证了可扩展位移、力生成及弯曲、抓取和爬行运动。

Comments Accepted to IROS 2026, 8 pages, 5 figures

详情
AI中文摘要

软晶格结构越来越多地用于机器人中以定制柔顺性和引导变形;然而,驱动通常是在设备或模块级别引入,致动器插入到原本被动的架构中。在这项工作中,我们将致动器-晶格协同设计推进到单元胞尺度。我们提出了一种嵌入式气动单元胞,它将弯曲支柱晶格几何形状与双向波纹管致动器集成在一个单一的整体元件中。当镶嵌时,晶格作为一个分布式驱动场,其中全局形态由空间驱动模式而非均匀加压控制。对1x1、2x2和3x3镶嵌的实验表征展示了可扩展的位移和力生成,具有可重复的循环性能。在3x3x3阵列中,单元胞的选择性驱动产生了不同的全局变形模式,包括弯曲和定向抓取,而无需改变硬件配置。此外,耦合主动和被动单元胞实现了弯曲驱动的爬行运动,证明了异质镶嵌可以通过不对称变形进行平移。这些结果确立了单元胞级驱动作为晶格基软体机器人分布式变形的策略,并为可扩展的整体机器人架构提供了基础。

英文摘要

Soft lattice structures are increasingly used in robotics to tailor compliance and guide deformation; however, actuation is typically introduced at the device or module level, with actuators inserted into otherwise passive architectures. In this work, we move actuator-lattice co-design to the unit-cell scale. We present an embedded pneumatic unit cell that integrates curved-strut lattice geometry with a bidirectional bellow actuator within a single monolithic element. When tessellated, the lattice functions as a distributed actuation field in which global morphology is governed by spatial actuation patterns rather than uniform pressurization. Experimental characterization of 1x1, 2x2, and 3x3 tessellations demonstrates scalable displacement and force generation with repeatable cyclic performance. Selective actuation of unit cells in a 3x3x3 array produces distinct global deformation modes, including bending and directional grasping, without altering hardware configuration. Additionally, coupling active and passive unit cells enables bending-driven crawling locomotion, demonstrating that heterogeneous tessellations can translate through asymmetric deformation. These results establish unit-cell-level actuation as a strategy for distributed morphing in lattice-based soft robots and provide a foundation for scalable, monolithic robotic architectures.

2606.19265 2026-06-18 cs.RO 新提交

Shape Sensing of Continuum Robots using Direct Laser Writing

使用直接激光写入的连续体机器人形状感知

Amber K. Rothe, Nidhi Malhotra, Jaydev P. Desai

发表机构 * Winship Cancer Institute of Emory University(埃默里大学温希普癌症研究所) Medical Robotics and Automation (RoboMed) Laboratory, Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology(佐治亚理工学院华莱士·H·库尔特生物医学工程系医疗机器人与自动化实验室)

AI总结 本文利用直接激光写入技术制造应变传感器,集成于连续体机器人关节中,通过线性和非线性模型预测关节角度,误差低至1.76度,并实现闭环控制,跟踪误差小于3度。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

连续体机器人因其固有的柔顺性和灵巧性,为微创和自然腔道手术提供了一种有前景的方法。然而,这种灵活性也使得估计机器人当前形状变得具有挑战性。已有多种方法用于重建这些机器人的形状,包括成像、光学传感、磁传感和电阻传感。使用直接激光写入(DLW)制造的应变传感器可以提供一种替代传感方法。该技术涉及使用激光诱导某些聚合物碳化,以创建石墨烯图案,例如应变传感器。在本文中,我们展示了如何使用同一激光和同一设置将柔性连续体关节和DLW传感器加工成一个整体结构。使用线性和非线性模型对制造的传感器进行表征,这些模型用于预测关节角度,误差低至1.76度。此外,我们展示了如何使用DLW传感器在机器人关节中实现闭环控制,跟踪误差低于3度。

英文摘要

Continuum robots offer a promising approach for minimally invasive and natural-orifice surgical procedures due to their inherent compliance and dexterity. However, this flexibility also makes estimating the current shape of the robot challenging. Several approaches have been used to reconstruct the shape of these robots, including imaging, optical sensing, magnetic sensing, and resistive sensing. Strain sensors fabricated using direct laser writing (DLW) could provide an alternative sensing method. This technique involves using a laser to induce carbonization of certain polymers to create graphene patterns, such as strain sensors. In this paper, we demonstrate how a flexible continuum joint and a DLW sensor can be machined as one monolithic structure using the same laser and the same setup. The fabricated sensors are characterized using linear and nonlinear models, which are used to predict the joint angle with error as low as 1.76 degrees. Furthermore, we demonstrate how a DLW sensor can be used to implement closed-loop control in a robotic joint, achieving tracking error under 3 degrees.

10. 仿真、数据集与评测 20 篇

2606.18375 2026-06-18 cs.RO 新提交

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

PAIWorld: 用于机器人操作的三维一致世界基础模型

Yuhang Huang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Jiazhao Zhang, Ruizhen Hu, Wancheng Feng, Shilong Zou, Hewen Xiao, Ziqiao Zhou, Kaiyun Huang, Zhiyu Peng, Juzhan Xu, Hang Zhao, Chenyang Zhu, Renjiao Yi, Yifei Huang, Douhui Wu, Yan Zhang, Kexu Cheng, Chunhe Song, Yunzhi Xue, Xiuhong Zhang, Leitao Guo, Yunji Chen, Bin Wu, Haibin Yu, Kai Xu

发表机构 * Institute of AI for Industries, Chinese Academy of Sciences(中国科学院人工智能产业研究院)

AI总结 提出PAIWorld框架,通过几何感知交叉注意力、几何旋转位置编码和潜在3D-REPA蒸馏,解决多视图世界模型的3D不一致问题,在机器人操作基准上取得领先性能。

详情
AI中文摘要

世界基础模型(WFMs)是强大的模拟器,但它们主要运行在单视图设置中,缺乏机器人操作所需的多视图3D一致性。虽然机器人系统依赖多个摄像头(自我中心、眼到手和腕装)进行策略学习,但当前的多视图世界模型只是简单地拼接视图标记,没有显式的几何推理。这导致跨视图物体漂移、深度不一致和纹理错位。我们将这些失败归因于两个缺陷:缺乏显式的视图间通信机制和缺乏3D几何先验。我们认为同时解决这两个问题是必要且充分的。为此,我们提出PAIWorld,一个通过三个核心组件增强扩散变换器世界模型的框架:(1)几何感知交叉注意力块,建立跨视图的显式通路;(2)几何旋转位置编码,将相机射线方向和外部姿态编码到注意力机制中;(3)潜在3D-REPA,从冻结的3D基础模型中蒸馏3D感知特征以确保3D一致性。基于DiT世界基础模型,PAIWorld在机器人操作基准上实现了最先进的多视图3D一致性,在WorldArena排行榜上排名第一,在AgiBot-Challenge2026排行榜上排名第二,同时支持基于模型的规划、世界动作模型和多视图策略后训练等下游应用。

英文摘要

World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.

2606.18594 2026-06-18 cs.RO cs.AI 新提交

Benchmarking Action Spaces in Reinforcement Learning for Vision-based Robotic Manipulation

基于视觉的机器人操作中强化学习动作空间的基准测试

Seyed Alireza Azimi, Homayoon Farrahi, Abhishek Naik, Colin Bellinger, A. Rupam Mahmood

发表机构 * Department of Computing Science, University of Alberta(阿尔伯塔大学计算机科学系) National Research Council Canada(加拿大国家研究委员会) School of Electrical Engineering and Computer Science, University of Ottawa(渥太华大学电气工程与计算机科学学院) Vector Institute(向量研究所) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔机器智能研究所)

AI总结 本研究通过模拟到现实的迁移,在物体抓取和推动任务中评估了四种动作空间,发现关节速度动作空间在平滑性和任务性能上最优,并为RL实践者提供了动作空间选择指导。

Comments 9 pages with references

详情
AI中文摘要

在现实世界的强化学习(RL)中,动作空间的选择在塑造运动平滑性、安全性和整体任务性能方面起着关键作用。在本研究中,我们评估了位姿增量、位姿速度、关节位置增量和关节速度在两项基于视觉的操作任务(物体抓取和推动)中的表现。我们在模拟中训练策略,并通过模拟到现实的迁移将其部署到现实世界。我们发现,动作空间表示确实显著影响模拟到现实的性能。特别是,我们发现关节速度动作空间在平滑性和最终任务性能方面最适合基于视觉的抓取和推动任务。我们还为RL实践者在模拟和现实实验中选择动作空间提供了实用指导。

英文摘要

In real-world reinforcement learning (RL), the choice of action space can play a key role in shaping motion smoothness, safety, and overall task performance. In this study, we evaluate pose increment, pose velocity, joint position increment, and joint velocity across two vision-based manipulation tasks: object picking and pushing. We train policies in simulation and deploy them to the real world using sim-to-real transfer. We find that action-space representation indeed significantly affects sim-to-real performance. In particular, we find that the joint velocity action space is best for the vision-based picking and pushing tasks in terms of smoothness and final task performance. We also provide practical guidance for RL practitioners in choosing action spaces for both simulation and real-world experiments.

2606.18610 2026-06-18 cs.RO cs.CV 新提交

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

SC3-Eval: 通过自洽视频生成评估机器人基础模型

Wei-Cheng Tseng, Gashon Hussein, Yuzhu Dong, Allen Z. Ren, Lucy X. Shi, XuDong Wang, Sergey Levine, Zhaoshuo Li, Jinwei Gu, Florian Shkurti, Ming-Yu Liu, Quan Vuong

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) NVIDIA(英伟达) Physical Intelligence Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) Allen Institute for AI(艾伦人工智能研究所)

AI总结 提出SC3-Eval方法,利用前向-反向动力学一致性、跨视角一致性和测试时一致性,将预训练视频基础模型转化为准确的策略评估器,在7个真实世界策略上达到0.929的皮尔逊相关系数。

详情
AI中文摘要

在真实世界中评估通用机器人操作策略成本高、速度慢且难以扩展。动作条件视频世界模型通过模拟策略 rollout 提供了一种可扩展的替代方案。自回归 rollout 会累积复合误差,多视角观测必须保持相互一致,且评估器必须泛化到行为超出训练分布的策略。我们通过 SC3-Eval 解决这些挑战,这是一种自洽视频生成方案,通过强制三种互补的一致性,将预训练视频基础模型转化为准确的策略评估器。首先,前向-反向动力学一致性联合训练模型从动作预测帧以及从帧恢复动作,将生成的 rollout 锚定在物理上合理的动作流形上,并抵消仅前向模型无法惩罚的漂移。其次,跨视角一致性训练模型从每个相机视角修补其他视角,使多相机观测在长 rollout 中保持连贯,无需任何显式记忆机制。第三,测试时一致性在推理时重用反向动力学模式作为每个动作块的置信度信号,当生成的帧偏离请求的动作时终止 rollout。我们还展示了 SC3-Eval rollout 复现了策略在真实世界 rollout 中表现出的失败模式,支持细粒度的诊断比较而不仅仅是聚合排名。在七个真实世界的视觉-语言-动作策略上,SC3-Eval 达到了闭环皮尔逊相关系数 0.929 和 MMRV 0.119,优于三个强先前的基于视频模型的基线,并泛化到新任务。

英文摘要

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

2606.18646 2026-06-18 cs.RO 新提交

A Scalable Embodied Intelligence Platform for Seamless Real-to-Sim-to-Real Transfer of Household Mobile Manipulation Tasks

一种可扩展的具身智能平台,用于家庭移动操作任务的无缝真实-仿真-真实迁移

Kui Yang, Xianlei Long, Haoxuan Li, Yan Ding, Chao Chen

发表机构 * School of Computer Science, Chongqing University(重庆大学计算机学院) R&D Department, Lumos Robotics Technology (Suzhou) Co., Ltd(苏州 Lumos 机器人技术(苏州)有限公司研发部)

AI总结 提出BestMan平台,通过自动化场景生成、仿真引导任务形式化和硬件无关中间件,解决真实-仿真-真实迁移中的场景重建、策略评估和部署兼容性挑战,实现家庭移动操作的无缝迁移。

Comments CCF Transactions on Pervasive Computing and Interaction

详情
AI中文摘要

移动操作是具身智能机器人的基本能力。对非结构化家庭环境中鲁棒且可泛化操作的需求日益增长,推动了具身智能平台的快速发展。然而,实现真实-仿真-真实循环的无缝迁移面临三个关键挑战:昂贵的高保真仿真场景重建、仿真中系统策略评估的复杂性以及不兼容的真实世界部署。为了解决这些挑战,我们开发了BestMan,一个可扩展且无缝的真实-仿真-真实平台,弥合仿真与真实世界之间的差距,实现家庭移动操作的有效策略开发、集成和部署。具体来说,我们设计了一个新颖的自动化场景生成(ASG)模块,从真实观测中重建逼真的仿真。然后,我们提出了一种仿真引导的任务形式化和技能学习架构,支持在仿真中灵活集成和大规模评估混合技能策略。最后,为了增强真实世界的可扩展性,我们开发了一个硬件无关的统一中间件(HUM),确保跨异构移动操作器的无缝且兼容的仿真到真实迁移,用于真实部署。实验结果表明,我们提出的平台在建立标准化基准和促进移动操作领域有前景的研究方面表现出优越的性能。

英文摘要

Mobile manipulation is a fundamental capability in embodied intelligence robotics. The growing demand for robust and generalizable manipulation in unstructured household environments has driven rapid progress in embodied intelligence platforms. However, achieving a seamless transfer across the real-to-sim-to-real cycle faces three key challenges, including costly high-fidelity simulation scenes reconstruction, the complexity of systematic strategy evaluation in simulation, and incompatible real-world deployments. To address these challenges, we develop BestMan, a scalable and seamless real-to-sim-to-real platform that bridges the gap between the simulation and the real world, enabling effective strategy development, integration, and deployment for household mobile manipulation. Specifically, we design a novel Automated Scene Generation (ASG) module to reconstruct realistic simulations from real observations. Then, we propose a simulation-guided task formalization and skill learning architecture that supports the flexible integration and large-scale evaluations of hybrid skill strategies in simulation. Finally, to enhance the real-world scalability, we develop a Hardware-agnostic and Unified Middleware (HUM) to ensure seamless and compatible sim-to-real transfer across heterogeneous mobile manipulators for real deployments. Experimental results demonstrate the superior performance of our proposed platform in establishing standardized benchmarks and facilitating promising research in the field of mobile manipulation.

2606.18698 2026-06-18 cs.RO cs.AI cs.LG 新提交

Leveraging Energy Features for Surface Classification with Deep Learning: A Comparative Analysis Across Three Independent Datasets

利用能量特征进行基于深度学习的表面分类:三个独立数据集的比较分析

Alexander Belyaev, Oleg Kushnarev

AI总结 研究评估能量特征作为表面分类的独立或辅助模态的可行性,在三个数据集上比较多种深度学习架构,发现CNN性能最优,纯能量特征准确率85-90%,与惯性特征结合可达96-99%,且能量特征可稳定提升1-2%准确率。

详情
AI中文摘要

基于能量的方法在移动机器人表面分类中仍是一个相对未被充分研究的途径,尽管在受限环境中取得了有希望的结果。本研究评估了使用能量衍生特征作为独立分类模态或作为惯性数据补充输入的可行性。在三个公开数据集上进行了全面评估,比较了现代深度学习架构(包括循环神经网络、卷积神经网络、仅编码器变压器和Mamba状态空间模型)在自动超参数调整和输入序列长度优化下的性能。模型在所有评估数据集上均实现了比先前报道值更高的准确率,其中卷积神经网络取得了最高的整体性能。当仅依赖基于能量的特征时,模型分类准确率在85-90%范围内,比与惯性特征结合时(96-99%)低约5-10%。用能量特征增强惯性数据导致平均准确率持续提高1-2%。这些发现表明,仅依赖能量特征的分类器为独立部署提供了足够的准确性,同时在与其它感知模态结合使用时也提供了一致的增益。

英文摘要

The energy-based method remains a comparatively underexamined approach for surface classification in mobile robotics, despite promising results in constrained environments. This study evaluated the viability of using energy-derived features as either a standalone classification modality or as supplementary input to inertial data. A comprehensive evaluation was conducted across three publicly available datasets, comparing the performance of modern deep learning architectures including recurrent neural networks, convolutional neural networks, encoder-only transformers, and Mamba state-space models, under automated hyperparameter tuning and input sequence length optimization. The models achieved higher accuracy than previously reported values on all evaluated datasets, with the convolutional neural network yielding the highest overall performance. When relying exclusively on energy-based features, the models attained classification accuracies in the range of 85-90%, approximately 5-10% lower than those achieved when combined with inertial features (96-99%). Augmenting inertial data with energy features resulted in a consistent mean accuracy improvement of 1-2%. These findings indicate that classifiers relying solely on energy features offer sufficient accuracy for standalone deployment, while also providing a consistent gain when used in combination with other sensing modalities.

2606.18948 2026-06-18 cs.RO 新提交

C-ARC: Continuous-Adaptive Range Clustering for Non-Repetitive LiDAR Sensors

C-ARC: 面向非重复式LiDAR传感器的连续自适应范围聚类

Nick B. Schroeder, Jonathan Lichtenfeld, Oskar von Stryk

发表机构 * Technical University of Darmstadt(德累斯顿技术大学) Simulation, Systems Optimization and Robotics Group(仿真、系统优化与机器人组)

AI总结 提出C-ARC框架,通过滑动窗口上的持久双图结构解耦高频点插入与按需聚类检索,并利用指数控制环自适应校准网格分辨率,实现非重复式LiDAR点云的实时聚类。

Comments Submitted to IEEE Robotics and Automation Letters. This work has been submitted to the IEEE for possible publication. 8 pages, 7 figures

详情
AI中文摘要

实时LiDAR聚类识别点云中的结构,是许多移动机器人算法的重要前提。当前方法主要针对重复式机械LiDAR传感器开发。近年来,由于成本和外形尺寸小,非重复式LiDAR传感器的使用显著增加。这类基于Risley棱镜的非重复传感器违反了重复式机械传感器的两个关键假设:结构化的扫描线和明确的帧边界。其Rhodonea曲线轨迹产生非均匀点分布,且缺乏旋转周期使得传统扫描线索引无法适用。为满足这些新需求,我们开发了C-ARC,一个连续自适应范围聚类框架,它在滑动窗口上维护一个持久双图,将高频点插入与按需聚类检索解耦。这对于SLAM或跟踪等关键功能至关重要。自适应范围网格分辨率机制在初始化时使用指数控制环校准网格尺寸,无需预先了解扫描模式即可平衡稀疏-碰撞权衡。作为开源的单线程C++17库实现,C-ARC在商用硬件上对Livox Mid-360以20 Hz产生实时聚类输出。在Livox Avia上的评估表明,对于扫描模式高度集中的传感器,无界单元占用是主要限制。自适应分辨率机制还提高了现有基于网格的方法在非重复数据上的聚类质量。

英文摘要

Real-time LiDAR clustering identifies structures in point clouds, which is an essential prerequisite for many mobile robotics algorithms. Current methods are mostly developed for repetitive mechanical LiDAR sensors. Recently, the use of non-repetitive LiDAR sensors is strongly increasing due to their small cost and form factor. Such non-repetitive Risley prism-based sensors violate two key assumptions of repetitive mechanical sensors: structured scan lines and well-defined frame boundaries. Their Rhodonea-curve trajectories produce non-uniform point distributions, and the absence of a rotation cycle renders conventional scan line indexing inapplicable. To meet such new requirements, we developed C-ARC, a Continuous-Adaptive Range Clustering framework that maintains a persistent dual-graph over a sliding window, decoupling high-frequency point insertion from on-demand cluster retrieval. This is crucial for key functionalities like SLAM or tracking. An adaptive range grid resolution mechanism calibrates grid dimensions at initialization using an exponential control loop, balancing the sparsity-collision trade-off without prior knowledge of the scanning pattern. Implemented as an open-sourced single-threaded C++17 library, C-ARC produces real-time cluster output at 20 Hz on commodity hardware for the Livox Mid-360. Evaluation on the Livox Avia identifies unbounded cell occupancy as the primary limitation for sensors with strongly concentrated scan patterns. The adaptive resolution mechanism additionally improves clustering quality for existing grid-based methods on non-repetitive data.

2606.18959 2026-06-18 cs.RO 新提交

TactSpace: Learning a Physics-enriched Shared Latent Space for Tactile Sim-to-Real Transfer

TactSpace: 学习富含物理信息的共享潜在空间以实现触觉模拟到现实的迁移

Arunim Joarder, Arjun Bhardwaj, René Zurbrügg, Mayank Mittal, Florin Püntener, Sira Bielefeldt, Cosmin Roman, Vaishakh Patil, Marco Hutter

发表机构 * Robotic Systems Lab, ETH Zürich(瑞士苏黎世联邦理工学院机器人系统实验室) Micro- and Nanosystems Lab, ETH Zürich(瑞士苏黎世联邦理工学院微纳系统实验室) ETH AI Center(苏黎世联邦理工学院人工智能中心) NVIDIA(NVIDIA公司)

AI总结 提出多模态表示学习框架TactSpace,通过共享潜在空间对齐异构触觉模态,实现零样本模拟到现实迁移,在力预测和形状重建任务中分别降低误差16.7%和45.8%。

Comments 9 pages, 6 figures, 4 tables, accepted into IROS 2026

详情
AI中文摘要

触觉传感提供了对机器人操作至关重要的接触相互作用的直接测量。然而,当前的模拟器缺乏足够保真度来忠实模拟触觉传感器的复杂变形和换能机制,严重阻碍了机器人学习流程中的模拟到现实迁移。为了解决这一挑战,我们提出了一种多模态表示学习框架,该框架在共享潜在空间内对齐异构触觉模态,消除了对精确原始信号模拟的需求,同时保留了相关的接触信息。我们的方法采用模态特定编码器将不同的触觉观测(例如模拟穿透深度和真实电容)投影到公共嵌入空间中。该模型使用自重建和交叉重建目标以及对比对齐进行训练,鼓励模态不变且信息丰富的表示。我们在压头形状识别、力预测和几何重建任务上评估学习到的嵌入,仅在模拟中训练并直接在真实传感器测量上测试。我们的结果展示了跨物理不同表示的零样本模拟到现实迁移。此外,结合多物理模拟模态产生了更信息丰富的嵌入,这些嵌入可跨不同下游任务迁移,力预测误差降低16.7%,形状重建误差降低45.8%。最后,我们为Isaac Lab发布了一个基于Warp的高效罚函数触觉模拟模型实现,支持可扩展的触觉数据生成。

英文摘要

Tactile sensing provides direct measurements of contact interactions that are essential for robotic manipulation. However, current simulators lack the fidelity to faithfully model the complex deformation and transduction mechanics of tactile sensors, severely hindering sim-to-real transfer in robot learning pipelines. To address this challenge, we propose a multi-modal representation learning framework that aligns heterogeneous tactile modalities within a shared latent space, eliminating the need for accurate raw-signal simulation while preserving relevant contact information. Our approach employs modality-specific encoders to project diverse tactile observations, such as simulated penetration depth and real-world capacitance, into a common embedding space. The model is trained using self- and cross-reconstruction objectives alongside contrastive alignment, encouraging modality-invariant yet information-rich representations. We evaluate the learned embeddings on indenter shape identification, force prediction, and geometric reconstruction tasks, training exclusively in simulation and testing directly on real sensor measurements. Our results demonstrate zero-shot sim-to-real transfer across physically dissimilar representations. Furthermore, incorporating multi-physics simulation modalities yields more informative embeddings that transfer across diverse downstream tasks, demonstrating a 16.7% reduction in force prediction error and a 45.8% reduction in shape reconstruction error. Finally, we release an efficient Warp-based implementation of a penalty-based tactile simulation model for Isaac Lab, enabling scalable tactile data generation.

2606.19067 2026-06-18 cs.RO cs.CV 新提交

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

传感器配置至关重要:四足机器人多模态SLAM的系统评估

Roberto Corlito, Fabian Schmidt, Nils Seibert, Markus Enzweiler, Abhinav Valada, Arne Roennau

发表机构 * Machine Intelligence and Robotics Lab, Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院智能机器人实验室) Institute for Intelligent Systems, Esslingen University of Applied Sciences(埃森堡应用科学大学智能系统研究所) Department of Computer Science, University of Freiburg(弗赖堡大学计算机科学系)

AI总结 针对四足机器人运动中的传感器配置问题,系统评估了视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法,发现立体相机、全局快门和适当惯性集成能显著提升定位鲁棒性。

详情
AI中文摘要

四足机器人在不同环境中的自主导航从根本上依赖于鲁棒的同步定位与地图构建(SLAM)。虽然视觉-惯性SLAM在轮式、手持和空中平台上已经成熟,但在腿部运动的剧烈动态下,硬件级传感器配置如何影响性能仍存在关键的评估空白。四足机器人引入了独特的具身感知挑战,包括足部冲击、高频机械振动和快速角旋转,这些都会降低标准感知管道的性能。为了填补这一空白,我们使用在ANYmal D四足机器人上记录的GrandTour数据集,对最先进的视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法进行了系统评估。我们分离并量化了相机模态、快门技术和惯性传感器层级的影响,分析了它们在定位精度、算法鲁棒性和计算资源利用方面的权衡。我们的实证结果表明,硬件选择对系统鲁棒性有显著影响:立体配置始终优于单目和RGB-D模态,全局快门相机相比卷帘快门相机显著减少了运动引起的跟踪失败,并且关键的是,在剧烈的腿部运动下,标准惯性集成可能降低主要基于视觉的框架的性能。这些见解还为定制传感器负载提供了具体的设计指南,以实现敏捷腿部系统的可靠感知。

英文摘要

Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.

2606.19154 2026-06-18 cs.RO 新提交

Viking Hill Dataset: A Lidar-Radar-Camera Dataset for Detection and Segmentation in Forest Scenes

Viking Hill数据集:用于森林场景检测与分割的激光雷达-雷达-相机数据集

Vladimír Kubelka, Oleksandr Kotlyar, Unal Artan, Martin Magnusson

发表机构 * Örebro University, AASS research centre, Robot Navigation and Perception Lab(厄勒布鲁大学,AASS研究中心,机器人导航与感知实验室)

AI总结 提出首个包含4D成像雷达的森林多传感器数据集,通过MinkowskiUNet实现雷达与激光雷达点云的语义分割,并评估树干分割质量与树木尺寸的关系。

Comments 33 pages, 11 figures

详情
AI中文摘要

在森林冠层下运行的自主机器人需要对树木及周围植被在不同季节条件下进行稳健感知。现有的林业数据集提供带有单棵树标注的激光雷达或相机数据,但均未包含共配准的4D成像雷达——这一模态因其对视觉退化、表面污染和植被遮挡的鲁棒性而日益受到关注。我们介绍了一个由移动机器人收集的多传感器森林数据集,该机器人配备了高分辨率FMCW成像雷达、激光雷达、RGB相机、IMU和RTK-GNSS。该场地在两个不同植被状态的会话中记录,3D立方体标注(包括每棵树的直径估计)为所有三种感知模态提供了共享语义标签。此外,我们提供了使用MinkowskiUNet对雷达和激光雷达点云进行语义分割的基线结果。雷达在主要类别(地面91%,冠层86%)上取得了与激光雷达竞争性的IoU分数,但在几何精细结构(如树干)上落后(56%对74%)。跨模态分析进一步比较了激光雷达和雷达的树干分割与RGB检测模型,而按直径分层的评估揭示了树干分割质量如何随树木尺寸变化。除了分割,共配准的多模态数据和RTK-GNSS辅助参考定位支持冠层下地图构建、定位和传感器融合的研究。数据集和标注工具已公开。

英文摘要

Autonomous robots operating under forest canopies need robust perception of trees and surrounding vegetation across varying seasonal conditions. Existing forestry datasets provide lidar or camera data with per-tree annotations, but none include co-registered 4D imaging radar -- a modality of growing interest for its resilience to visual degradation, surface contamination, and vegetation occlusion. We introduce a multi-sensor forest dataset collected by a mobile robot equipped with a high-resolution FMCW imaging radar, lidar, RGB camera, IMU, and RTK-GNSS. The site was recorded in two sessions under contrasting vegetation states, and 3D cuboid annotations -- including per-tree diameter estimates -- provide shared semantic labels across all three perception modalities. Furthermore, we provide baseline results for semantic segmentation of the radar and lidar point clouds using MinkowskiUNet. Radar achieves IoU scores competitive with lidar for dominant classes (ground 91%, canopy 86%) while lagging on geometrically fine structures such as tree trunks (56% vs. 74%). A cross-modality analysis further compares lidar and radar trunk segmentation against an RGB detection model, and a diameter-stratified evaluation reveals how trunk segmentation quality varies with tree size. Beyond segmentation, the co-registered multi-modal data and RTK-GNSS-aided reference positioning support research in mapping, localization, and sensor fusion under canopy. The dataset and annotation tools are publicly available.

2606.19161 2026-06-18 cs.RO 新提交

HT-Bench: Benchmarking and Learning Dexterous Full-Hand Tactile Representations with Egocentric Vision

HT-Bench:基于自我中心视觉的灵巧全手触觉表示基准与学习

Yuzhe Huang, Jiaping Wu, Jiaming Jiang, Hezhe Lin, Aikebaier Aierken, Yunlong Wang, Kun Cheng, Ziyuan Jiao, Yuanxin Zhong

发表机构 * Beihang University(北京航空航天大学) Rimbot BUPT(北京邮电大学) ShanghaiTech University(上海科技大学) Tsinghua University(清华大学) CAS(中国科学院)

AI总结 提出HT-Bench多任务基准和HandTouch编码器,通过大规模自我中心视觉与全手触觉数据,在触觉相似性检索、掩码修复、视觉到触觉合成等任务上验证了触觉表示的有效性。

Comments 9pages, 4figures

详情
AI中文摘要

由于触觉传感器设计、数据格式和机器人形态的多样性,为机器人操作中的触觉表示学习建立通用基准仍然具有挑战性。我们并未试图建立这样的基准,而是探索了一个可扩展且有前景的未来发展方向:将自我中心视觉与全手触觉数据配对。为此,我们引入了\ extbf{HT-Bench},一个用于灵巧全手触觉感知的大规模多任务基准,包含在226个任务中收集的1000万RGB帧和780万触觉帧。HT-Bench从三个关键角度评估触觉表示:它们是否编码有意义的接触几何、是否能够将触觉观测与视觉信息对齐、以及是否能够泛化到未见任务。为评估这些能力,HT-Bench包含四个任务:细粒度触觉相似性检索、掩码触觉修复、视觉到触觉合成以及多模态触觉帧预测。我们进一步提出了\ extbf{HandTouch},一个矢量量化视觉-触觉编码器,通过渐进的空间、跨模态和时间训练学习触觉表示。在HT-Bench上,HandTouch始终优于代表性的触觉编码器基线,将细粒度触觉相似性检索的Recall@5从74.65%提高到85.23%,将掩码触觉修复的RMSE从0.022降低到0.010,并将视觉到触觉合成的OOD cIoU从0.628提高到0.705。这些结果证明了HandTouch的有效性,并表明大规模自我中心全手触觉数据为评估和推进灵巧操作中的触觉表示学习提供了可扩展的基础。

英文摘要

Establishing a universal benchmark for tactile representation learning in robotic manipulation remains challenging due to the diversity of tactile sensor designs, data formats, and robot embodiments. Rather than seeking to establish such, we explore a scalable and promising direction for future development: egocentric vision paired with full-hand tactile data. To this end, we introduce \textbf{HT-Bench}, a large-scale multi-task benchmark for dexterous full-hand tactile sensing, comprising 10M RGB frames and 7.8M tactile frames collected across 226 tasks. HT-Bench evaluates tactile representations from three key perspectives: whether they encode meaningful contact geometry, whether they can align tactile observations with visual information, and whether they generalize to unseen tasks. To assess these capabilities, HT-Bench includes four tasks: fine-grained tactile similarity retrieval, masked tactile inpainting, vision-to-tactile synthesis, and multimodal tactile frame prediction. We further propose \textbf{HandTouch}, a vector-quantized vision--tactile encoder that learns tactile representations through progressive spatial, cross-modal, and temporal training. Across HT-Bench, HandTouch consistently outperforms representative tactile encoder baselines, improving Recall@5 on fine-grained tactile similarity retrieval from 74.65\% to 85.23\%, reducing RMSE on masked tactile inpainting from 0.022 to 0.010, and increasing OOD cIoU on vision-to-tactile synthesis from 0.628 to 0.705. These results demonstrate the effectiveness of HandTouch and suggest that large-scale egocentric full-hand tactile data provides a scalable basis for evaluating and advancing tactile representation learning in dexterous manipulation.

2606.19176 2026-06-18 cs.RO cs.AI cs.SY eess.SY 新提交

Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight

用于自主海上无人机飞行的深度单目位姿估计的硬件与视觉在环验证

Maneesha Wickramasuriya, Beomyeol Yu, Jaden Shin, Mason Huslig, Taeyoung Lee, Murray Snyder

发表机构 * George Washington University(乔治华盛顿大学)

AI总结 提出硬件验证的视觉在环框架,结合深度变换器单目位姿估计器和延迟卡尔曼滤波器,在模拟逼真海上环境中实现自主室内飞行,验证了感知延迟等嵌入式效应。

Comments 6 pages 9 figues

详情
AI中文摘要

船舶上的自主无人机操作需要可靠的基于视觉的相对位姿估计,然而海上验证成本高、依赖天气且风险大。本文提出一个硬件验证的视觉在环框架,能够在模拟逼真海上环境的同时实现完全自主的室内飞行。渲染的海上视图由板载的基于深度变换器的单目位姿估计器处理。延迟的视觉测量与高频率IMU数据通过延迟卡尔曼滤波器融合,为几何控制提供一致的状态估计。该系统捕捉了纯仿真中缺失的关键嵌入式效应,包括感知延迟、异步更新和计算约束。自主起飞、轨迹跟踪和着陆实验证明了稳定的闭环飞行。结果建立了一个安全且硬件真实的中间阶段,用于在船上部署之前开发海上无人机自主性。

英文摘要

Autonomous UAV operations on ships require reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents a hardware-validated vision-in-the-loop framework that enables fully autonomous indoor flight while emulating photorealistic maritime environments. Rendered maritime views are processed onboard by a deep transformer-based monocular pose estimator. Delayed vision measurements are fused with high-rate IMU data using a delayed Kalman filter to provide consistent state estimates for geometric control. The system captures critical embedded effects, including perception latency, asynchronous updates, and computational constraints, that are absent in pure simulation. Autonomous takeoff, trajectory tracking, and landing experiments demonstrate stable closed-loop flight. The results establish a safe and hardware-realistic intermediate stage for developing maritime UAV autonomy prior to shipboard deployment.

2606.19186 2026-06-18 cs.RO cs.LG 新提交

Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise

学习标注延迟和误报AEB事件:针对极端类别不平衡和非对称标签噪声的实用系统

Mengxiang Hao, Xin Jiang, Xinghao Huang, Wenliang Su, Zhiteng Wang, Junjie Rao, Xiaotian Yang, Wei Liao, Chengyu Han, Gen Liang, Yulun Song, Zhitao Xu, Xianpeng Lang

发表机构 * Li Auto(理想汽车)

AI总结 提出首个自动化AEB标注框架,通过特定数据增强和噪声抑制技术,解决极端类别不平衡和非对称标签噪声问题,将延迟/误报触发召回率提升80%,人工工作量减少50%。

Comments 8 pages, 5 figures, accepted by IEEE International Conference on Robotics and Automation (ICRA)

详情
Journal ref
2026 IEEE International Conference on Robotics and Automation (ICRA)
AI中文摘要

自主紧急制动(AEB)优化依赖于准确标注的真实世界触发事件,特别是揭示系统缺陷的罕见但关键的延迟和误报AEB触发事件。然而,这些少数样本在每天数千次触发事件中占比不到5%,使得大规模人工标注成本过高。我们提出了首个自动化AEB标注框架来解决这一问题。在开发过程中,我们识别出两个严重损害延迟/误报触发标注准确性的基本挑战:(1)极端类别不平衡,其中延迟/误报触发被真实触发淹没;(2)非对称标签噪声,其中误标注的多数样本(真实触发)抑制了少数样本(延迟/误报触发)的学习。为克服这些挑战,我们提出两项关键创新:(1)特定数据增强,通过操纵焦点目标属性、移植自车动态和掩蔽非焦点代理来合成逼真样本;(2)噪声抑制,使用稳定硬度估计和探针引导的自适应阈值来清理误标注的真实触发样本。关键的是,我们将模型部署为具有全栈架构的实用标注系统,从每天数千个AEB事件中高效识别关键的延迟/误报触发。生产结果表明,延迟/误报触发的召回率提高了80%,人工工作量减少了50%。除了直接收益,该系统通过积累高质量标注实现持续自我改进,为车载AEB系统优化奠定了必要的数据基础。

英文摘要

Autonomous Emergency Braking (AEB) optimization relies on accurately annotated real-world trigger events, particularly rare but critical delayed and false AEB triggers that expose system deficiencies. However, these minority samples comprise less than 5% of thousands of daily triggers, making manual annotation prohibitively expensive at scale. We present the first automated AEB annotation framework to address this problem. During development, we identified two fundamental challenges that severely impair delayed/false trigger annotation accuracy: (1) Extreme class imbalance where delayed/false triggers are overwhelmed by true triggers; (2) Asymmetric label noise where mislabeled majority samples (true triggers) suppress minority samples (delayed/false triggers) learning. To overcome these challenges, we propose two key innovations: (1) Specific data augmentation that synthesizes realistic samples by manipulating focal target attributes, transplanting ego-vehicle dynamics, and masking non-focal agents; (2) noise suppression using stable hardness estimation and probe-guided adaptive threshold to clean mislabeled true trigger samples. Crucially, we deploy our model as a practical annotation system with full-stack architecture, efficiently identifying critical delayed/false triggers from thousands of daily AEB events. Production results demonstrate 80% improvement in recall of delayed/false triggers and 50% reduction in manual workload. Beyond immediate gains, the system enables continuous self-improvement through accumulated high-quality annotations, establishing a necessary data foundation for on-vehicle AEB system optimization

2606.19267 2026-06-18 cs.RO cs.SY eess.SY 新提交

A Mixed-Reality Testbed for Autonomous Vehicles

自动驾驶汽车的混合现实测试平台

H. M. Sabbir Ahmad, Ehsan Sabouni, Emrullah Celik, Zean Wan, Damola Ajeyemi, Christos G. Cassandras, Wenchao Li

发表机构 * Boston University(波士顿大学)

AI总结 提出一种混合现实硬件在环测试平台,集成物理移动机器人与高保真仿真环境,用于验证感知、规划和控制算法,并支持多智能体系统研究。

Comments 9 pages, 7 figures, 1 table

详情
AI中文摘要

我们提出了一种用于自动驾驶汽车的混合现实、硬件在环(HIL)测试平台,该平台将物理移动机器人测试平台与高保真仿真环境无缝集成。虚拟仿真能够创建多样化的、安全关键的驾驶场景,以验证最先进的感知、规划和控制算法,同时通过配备多模态传感器的物理机器人在逼真的虚拟环境中增强仿真,进一步促进严格的验证。我们的测试平台还利用无线通信实现车辆连接,并通过物理机器人和虚拟仿真代理的组合容纳大量代理,支持包括网联自动驾驶汽车(CAV)在内的多智能体系统研究。最后,我们提出了一种结合感知、规划和一种新颖的基于控制障碍函数(CBF)的在线学习控制器的安全保证框架,用于CAV。使用所提出框架的实验用于验证和展示测试平台的关键功能以及其在弥合仿真与真实世界硬件部署之间差距方面的整体效用。

英文摘要

We propose a mixed-reality, hardware-in-the-loop (HIL) testbed for autonomous vehicles that seamlessly integrates a physical testbed of mobile robots with a high-fidelity simulation environment. The virtual simulation enables the creation of diverse, safety-critical driving scenarios to validate state-of-the-art perception, planning, and control algorithms, while augmenting simulations with physical robots equipped with multimodal sensors in photorealistic virtual environments further facilitating rigorous validation. Our testbed also features vehicular connectivity using wireless communication and can accommodate a large number of agents through the combination of physical robots and virtual simulated agents, supporting research on multi-agent systems including Connected and Autonomous Vehicles (CAVs). Finally, we present a safety-guaranteed framework combining perception, planning and a novel online learning-based controller using Control Barrier Functions (CBFs) for CAVs. Experiments using the proposed framework are used to validate and demonstrate the key functionalities and the overall utility of the testbed to bridge the gap between simulation and real-world hardware deployment.

2606.18439 2026-06-18 cs.CV cs.RO 交叉投稿

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

RegimeVGGT:面向视觉几何基础Transformer的逐层空间保持冗余去除

Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, Zibo Zhao, Jiaxiang Hu, Kai Tang, Yichen Guo

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of California, Irvine(加利福尼亚大学尔湾分校) Nanyang Technological University(南洋理工大学)

AI总结 提出RegimeVGGT,通过逐层U形压缩(显著性引导带状合并与选择性保护K/V下采样)去除冗余,在保持重建质量的同时实现6.7倍加速。

Comments 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

详情
AI中文摘要

视觉几何基础Transformer(VGGT)通过一次前向传播从多视图图像恢复密集3D场景结构,但二次交叉帧注意力限制了其可扩展性。现有的免训练加速器沿单一轴均匀减少计算,忽略了层间异质性。我们的频谱、探测和因果分析揭示了三个区域:浅层缺乏跨视图结构,中层驱动跨视图对齐,深层对密集几何是冗余的,但其跨帧注意力对姿态仍然至关重要。RegimeVGGT沿两个轴应用逐层U形压缩:显著性引导带状合并保护几何和边缘显著性令牌,而选择性保护K/V下采样通过相移空间网格、参考帧锚点以及未压缩的相机/注册令牌来保持跨帧空间覆盖和姿态关键路径。免训练,RegimeVGGT在匹配重建质量下相比VGGT*实现了6.7倍加速。

英文摘要

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

2606.18582 2026-06-18 cs.CV cs.RO eess.IV 交叉投稿

Technical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leveraging DINOv3 for Robust Outdoor Scene Understanding in Field Robotics

ICRA 2026 GOOSE 2D细粒度语义分割挑战赛技术报告:利用DINOv3实现野外机器人中的鲁棒户外场景理解

Jaeil Park, Hyobin Choi, Sangjin Lee, Hyungtae Lim, Sung-Hoon Yoon

发表机构 * Daegu Gyeongbuk Institute of Science and Technology (DGIST)(大邱庆北科学技术院) Massachusetts Institute of Technology (MIT)(麻省理工学院)

AI总结 提出一种结合DINOv3自监督骨干、ViT-Adapter和Mask2Former解码器的网络设计,以及多尺度测试增强和模型集成的推理策略,在64类细粒度越野语义分割挑战中取得第一名,复合得分76.57%。

Comments 5 pages, 4 figures

详情
AI中文摘要

ICRA 2026野外机器人研讨会举办的GOOSE 2D细粒度语义分割挑战赛评估了越野图像在64个细粒度类别和11个评估的非空洞粗类别上的密集语义分割。我们提出了该挑战的第一名解决方案。我们的解决方案包含两个互补的改进:(a) 网络级设计,结合了自监督DINOv3 ViT-L/16骨干、ViT-Adapter和Mask2Former掩码分类解码器,以及基于全局[CLS]令牌的粗类别辅助损失;(b) 推理时聚合策略,基于多尺度和水平翻转测试时增强,以及使用Codabench分数选择的前三个检查点的集成。我们的方法达到了官方复合得分76.57%,包括69.32%的细类mIoU和83.81%的类别级mIoU,并在最终阶段排行榜上排名第一:http://this url。

英文摘要

The GOOSE 2D Fine-Grained Semantic Segmentation Challenge at the ICRA 2026 Workshop on Field Robotics evaluates dense semantic segmentation of off-road imagery over a fine-grained taxonomy of 64 classes and 11 evaluated non-void coarse categories. We present the first-place solution to this challenge. Our solution comprises two complementary improvements: (a) a network-level design that combines a self-supervised DINOv3 ViT-L/16 backbone, a ViT-Adapter, and a Mask2Former mask-classification decoder, together with a coarse-category auxiliary loss on the global [CLS] token; and (b) an inference-time aggregation strategy based on multi-scale and horizontal-flip test-time augmentation and an ensemble of the top three checkpoints selected using Codabench scores. Our method achieves an official composite score of 76.57%, consisting of 69.32% fine-class mIoU and 83.81% category-level mIoU, and ranks first on the final phase leaderboard: www.codabench.org/competitions/14257/#/results-tab.

2606.19253 2026-06-18 cs.CV cs.AI cs.LG cs.RO 交叉投稿

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas: 通过全景重投影实现3D场景理解

Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑工业大学) Huawei(华为)

AI总结 提出OneCanvas方法,将多视图补丁特征聚合到全景画布上,利用深度和相机位姿进行重投影,无需复杂几何编码器或大量训练,在SQA3D等基准上达到最先进精度。

Comments Project page: https://baranowskibrt.github.io/onecanvas/

详情
AI中文摘要

现有的视觉语言模型(VLM)中的3D场景理解方法要么依赖复杂的、模型特定的几何编码器,要么为了追求空间推理而需要大量的训练预算。相反,OneCanvas将所有视图的补丁特征聚合到一个单一的等距柱状全景画布上。具体来说,每个补丁利用其深度和相机位姿被反投影到3D世界坐标,然后根据从画布原点看到的该点的连续经度和纬度放置在画布上,无需对重叠视图进行光栅化或聚合。补丁的度量坐标的3D位置嵌入被添加到其特征中,从而恢复了将世界位置压缩到角度画布坐标时丢失的深度。因此,来自所有帧的补丁共享一个空间坐标系,无需融合或对主干网络进行重大架构修改。预训练的VLM将此表示视为普通图像。由于画布可以以任何感兴趣的姿态为中心,相同的表示直接支持从特定视角进行情境推理,这是机器人和具身AI中的常见需求。得益于这种表示,我们还可以引入空间预训练课程:通过程序化地将从真实图像中提取的对象的补丁特征放置在原本空白的画布上的选定3D世界位置,我们生成了涵盖广泛空间推理任务的即时监督,并控制答案分布以减少空间推理捷径。OneCanvas在SQA3D和VSI-Bench上达到了最先进的准确率,并在SPBench上泛化到分布外数据,其训练计算量比最强竞争方法少一个数量级。

英文摘要

Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

2512.11736 2026-06-18 cs.RO 版本更新

Bench-Push: Benchmarking Pushing-based Navigation and Manipulation Tasks for Mobile Robots

Bench-Push:基于推动的移动机器人导航与操作任务基准测试

Ninghan Zhong, Steven Caro, Megnath Ramesh, Rishi Bhatnagar, Avraiem Iskandar, Stephen L. Smith

发表机构 * Institute for Robotics and Intelligent Machines, Georgia Institute of Technology(机器人与智能机器研究所,佐治亚理工学院) Department of Electrical and Computer Engineering, University of Waterloo(电气与计算机工程系,滑铁卢大学) Department of Mechanical Engineering, University of Alberta(机械工程系,阿尔伯塔大学)

AI总结 提出首个统一的推动式移动机器人导航与操作基准Bench-Push,包含多种模拟环境、新评估指标和基线实现,用于解决可移动障碍物环境中的机器人推动任务评估问题。

Comments Published in CRV 2026

详情
AI中文摘要

移动机器人越来越多地部署在具有可移动物体的杂乱环境中,这对禁止交互的传统方法提出了挑战。在这种环境中,移动机器人必须超越传统的避障策略,利用推动或轻推策略来实现其目标。尽管基于推动的机器人研究正在增长,但评估依赖于临时设置,限制了可重复性和交叉比较。为了解决这个问题,我们提出了Bench-Push,这是首个用于基于推动的移动机器人导航和操作任务的统一基准。Bench-Push包括多个组件:1)一系列全面的模拟环境,捕捉推动任务中的基本挑战,包括在具有可移动障碍物的迷宫中导航、自主船舶在冰覆盖水域中导航、箱子递送和区域清理,每个任务都有不同复杂程度;2)新的评估指标,用于捕捉效率、交互努力和部分任务完成;3)使用Bench-Push评估跨环境的已建立基线的示例实现。Bench-Push作为Python库开源,采用模块化设计。代码、文档和训练模型可在https://this URL找到。

英文摘要

Mobile robots are increasingly deployed in cluttered environments with movable objects, posing challenges for traditional methods that prohibit interaction. In such settings, the mobile robot must go beyond traditional obstacle avoidance, leveraging pushing or nudging strategies to accomplish its goals. While research in pushing-based robotics is growing, evaluations rely on ad hoc setups, limiting reproducibility and cross-comparison. To address this, we present Bench-Push, the first unified benchmark for pushing-based mobile robot navigation and manipulation tasks. Bench-Push includes multiple components: 1) a comprehensive range of simulated environments that capture the fundamental challenges in pushing-based tasks, including navigating a maze with movable obstacles, autonomous ship navigation in ice-covered waters, box delivery, and area clearing, each with varying levels of complexity; 2) novel evaluation metrics to capture efficiency, interaction effort, and partial task completion; and 3) demonstrations using Bench-Push to evaluate example implementations of established baselines across environments. Bench-Push is open-sourced as a Python library with a modular design. The code, documentation, and trained models can be found at https://github.com/IvanIZ/BenchNPIN.

2512.14428 2026-06-18 cs.RO 版本更新

Odyssey: An Automotive Lidar-Inertial Odometry Dataset with GNSS-denied situations

Odyssey:一种面向GNSS拒止场景的汽车激光雷达-惯性里程计数据集

Aaron Kurda, Simon Steuernagel, Lukas Jung, Marcus Baum

发表机构 * University of Göttingen(哥廷根大学) iMAR Navigation(iMAR导航)

AI总结 提出Odyssey数据集,采用导航级环形激光陀螺仪RTK/INS提供高精度真值,包含36个序列和长时间GNSS拒止环境(隧道、室内停车场),用于评估LIO/SLAM系统。

Comments 10 pages, 4 figures, 3 tables, submitted to International Journal of Robotics Research (IJRR)

详情
AI中文摘要

激光雷达-惯性里程计(LIO)及同时定位与建图(SLAM)系统的开发与评估需要精确的真值。全球导航卫星系统(GNSS)常作为其基础,但在遮挡环境中,由于多径效应或信号丢失,其信号可能不可靠。现有数据集通过引入惯性测量单元(IMU)测量来补偿偶发的GNSS丢失,但由于累积漂移,常用系统不允许对GNSS拒止环境进行长时间研究。因此,此类数据集的多样性有限。为弥补这一空白,我们提出了Odyssey,一个汽车LIO数据集,其特点包括:(1)基于导航级环形激光陀螺仪(RLG)的RTK/INS导出的真值,其偏置稳定性比现有汽车数据集好1到4个数量级;(2)跨不同环境的36个序列的全面收集,支持稳健且全面的评估;(3)长时间的GNSS拒止环境,包括隧道以及汽车基准测试中此前未见过的室内停车场。在此,我们的RLG系统能够在常用系统会过度漂移的场景中实现准确评估。除了为LIO提供数据外,Odyssey还通过三次轨迹重复和通过精确大地坐标集成外部地图数据来支持地点识别任务。所有数据、数据加载器和补充材料均可在线获取,网址为:https://this https URL。

英文摘要

The development and evaluation of Lidar-Inertial Odometry (LIO) and Simultaneous Localization and Mapping (SLAM) systems requires a precise ground truth. The Global Navigation Satellite System (GNSS) is often used as a foundation for this, but its signals can be unreliable in obstructed environments due to multi-path effects or loss-of-signal. While existing datasets compensate for sporadic GNSS loss by incorporating Inertial Measurement Unit (IMU) measurements, the commonly used systems do not permit prolonged study of GNSS-denied environments due to accumulated drift. Therefore, the diversity of such datasets is limited. To close this gap, we present Odyssey, an automotive LIO dataset featuring: (1) a ground truth derived from a navigation-grade Ring Laser Gyroscope (RLG)-based RTK/INS, offering bias stability one to four orders of magnitude better than existing automotive datasets; (2) a comprehensive collection of 36 sequences across diverse environments, enabling robust and comprehensive evaluation and (3) prolonged GNSS-denied environments, including tunnels and, previously unseen in the context of automotive benchmarks, indoor parking garages. Here, our RLG-based system enables accurate evaluation in scenarios where commonly employed systems would drift excessively. Besides providing data for LIO, Odyssey also supports place recognition tasks through threefold trajectory repetition and integration of external mapping data via precise geodetic coordinates. All data, dataloader and supplementary material are available online at https://odyssey.uni-goettingen.de/ .

2601.07052 2026-06-18 cs.RO 版本更新

RSLCPP -- Deterministic Simulations Using ROS 2

RSLCPP——使用ROS 2进行确定性仿真

Simon Sagmeister, Marcel Weinmann, Phillip Pitschi, Markus Lienkamp

发表机构 * Technical University of Munich, Germany(慕尼黑技术大学) School of Engineering & Design, Department of Mobility Systems Engineering, Institute of Automotive Technology(工程与设计学院,移动系统工程系,汽车技术研究所) School of Engineering & Design, Department of Engineering Physics and Computation, Institute of Automatic Control(工程与设计学院,工程物理与计算系,自动控制研究所)

AI总结 针对ROS异步多进程设计导致仿真结果不可复现的问题,提出RSLCPP库,通过确定性回调执行实现跨平台可复现仿真,无需修改现有节点代码。

Comments Accepted for publication at the 'IEEE Robotics and Automation Practice'

详情
AI中文摘要

仿真在现实机器人技术中至关重要,为开发各种机器人应用提供了安全、可扩展且高效的环境。虽然机器人操作系统(ROS)在学术界和工业界已被广泛采用作为这些机器人应用的基础,但其异步、多进程的设计使得复现变得复杂,尤其是在不同的硬件平台上。当计算时间和通信延迟变化时,无法保证确定性回调执行。这种缺乏复现性的问题给科学基准测试和持续集成带来了困难,因为在这些场景中一致的结果至关重要。为了解决这个问题,我们提出了一种使用ROS 2节点创建确定性仿真的方法。我们的ROS仿真库(RSLCPP)实现了这种方法,使得现有节点可以组合成一个产生可复现结果的仿真例程,通常无需更改任何源代码。我们证明,在测试合成基准测试和真实机器人系统时,我们的方法在各种CPU和架构上产生相同的结果。RSLCPP已开源,网址为:https://this https URL。

英文摘要

Simulation is crucial in real-world robotics, offering safe, scalable, and efficient environments for developing a variety of robotic applications. While the Robot Operating System (ROS) has been widely adopted as the backbone of these robotic applications in both academia and industry, its asynchronous, multi-process design complicates reproducibility, especially across varying hardware platforms. Deterministic callback execution cannot be guaranteed when computation times and communication delays vary. This lack of reproducibility complicates scientific benchmarking and continuous integration, where consistent results are essential. To address this, we present a methodology to create deterministic simulations using ROS 2 nodes. Our ROS Simulation Library for C++ (RSLCPP) implements this approach, enabling existing nodes to be combined into a simulation routine that yields reproducible results, usually without requiring any source code changes. We demonstrate that our approach produces identical results across various CPUs and architectures when testing both a synthetic benchmark and a real-world robotics system. RSLCPP is open-sourced at https://github.com/TUMFTM/rslcpp.

2606.17639 2026-06-18 cs.RO cs.CV 版本更新

ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

ERQA-Plus:具身AI推理的诊断基准

Hong Yang, Basura Fernando

发表机构 * Centre for Frontier AI Research, Agency for Science, Technology and Research(新加坡科技研究局前沿人工智能研究中心) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 提出ERQA-Plus基准,包含1766个基于机器人中心图像的问答实例,覆盖感知、动作、社交、导航和常识推理,用于诊断具身AI的推理能力。

详情
AI中文摘要

通用具身智能体需要的不仅仅是物体识别:它们必须从情境视觉观察中推理空间关系、动作、程序、人类意图、环境约束和常识后果。然而,现有的视觉和具身问答基准通常对测试的推理依赖关系控制有限,使得难以将基于具身的推理与基于捷径的视觉或语言模式匹配区分开来。我们提出了ERQA-Plus,一个用于具身AI推理的诊断基准。ERQA-Plus包含1766个问答实例,这些实例基于711张以机器人为中心的图像,并根据一个结构化的分类法组织,涵盖感知、动作中心、社交交互、导航环境和上下文常识推理。该数据集使用多阶段生成和验证流程构建,结合了分类法引导的问题生成、自动质量判断、迭代修订和人工评估,以改进视觉基础、答案有效性和推理质量。我们对代表性的通用视觉语言模型和具身模型进行了基准测试,包括LLaVA-NeXT-8B、Prismatic-7B、MiniCPM-V-4.5-8B、Qwen3-VL、RoboRefer-8B和RoboBrain2.5-8B。尽管最强的模型Qwen3-VL-32B达到了83.4%的整体准确率和61.4的SBERT分数,但类别级别的结果揭示了空间推理、程序推理、事件预测和意图推理方面的持续弱点。因此,ERQA-Plus提供了一个细粒度的评估框架,不仅衡量具身智能体是否回答正确,还衡量它们能够可靠地执行哪些形式的具身推理。数据集可在https://this https URL获取,项目页面在https://this https URL。

英文摘要

Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.

11. 安全、鲁棒性与可信机器人 3 篇

2606.18632 2026-06-18 cs.RO 新提交

ROBOSHACKLES: A Safety Dataset for Human-Injury Prevention in Embodied Foundation Models

ROBOSHACKLES: 面向具身基础模型中人体伤害预防的安全数据集

Zhuowen Yin, Chongyang Liu, Wenzhang Yang, Renjue Li, Yinxing Xue

发表机构 * Institute of Al for Industries, Chinese Academy of Sciences(工业人工智能研究所,中国科学院) University of Science and Technology of China(中国科学技术大学)

AI总结 为解决机器人伤害人类数据难以安全收集的问题,提出基于真实观测的安全数据构建流水线,生成包含1万条视频的ROBOSHACKLES数据集,涵盖直接和间接伤害类别,评估发现现有模型在安全关键场景下100%产生不安全动作。

详情
AI中文摘要

具身基础模型(EFMs)整合了多模态理解、未来状态推理和可执行的机器人动作。然而,它们在预防人体伤害方面的安全对齐仍未得到充分探索,主要是因为机器人伤害人类或造成危险家庭情境的真实世界数据无法安全或合乎道德地收集。为应对这一挑战,我们提出了一种针对人体伤害预防的安全关键数据构建流水线。该流水线从真实的DROID观测出发,经过场景理解、危险感知图像编辑、时间提示生成和单次滚动合成等步骤。时间提示指定了预期的场景演变,而Wan2.7则从编辑后的危险状态中单次合成逼真的机器人滚动视频。利用该流水线,我们构建了ROBOSHACKLES,一个包含10,000条机器人视频片段的数据集,源自真实的DROID观测,涵盖两个直接伤害和四个间接伤害类别。为确保数据集质量,我们使用自动指标评估任务完成度和视觉质量,并在基于拒绝的安全准则下评估了六个代表性EFM。结果表明,所有评估模型在测试的安全关键场景中都产生了不安全动作,不安全动作生成率为100%。ROBOSHACKLES可作为拒绝学习和机器人动作执行前危险预测的可扩展基准和训练资源。该数据集公开于https://roboshackles.github.io。

英文摘要

Embodied Foundation Models (EFMs) integrate multimodal understanding, future-state reasoning, and executable robot actions. Yet their safety alignment for human-injury prevention remains underexplored, primarily because real-world data of robots harming humans or creating hazardous household situations cannot be safely or ethically collected. To address this challenge, we propose a safety-critical data construction pipeline for human-injury prevention in EFMs.Starting from real DROID observations, our construction pipeline proceeds through scene understanding, hazard-aware image editing, temporal prompt generation, and single-pass rollout synthesis. The temporal prompts specify the expected scene evolution, while Wan2.7 synthesizes realistic robotic rollouts from the edited hazardous states in a single pass. Using this pipeline, we construct ROBOSHACKLES, a 10,000-clip robotic video dataset derived from real DROID observations, spanning two direct-harm and four indirect-harm categories. To ensure dataset quality, we assess task completion and visual quality with automatic metrics, and evaluate six representative EFMs under a refusal-based safety criterion. Results show that all evaluated models produce unsafe actions in the tested safety-critical scenarios, yielding a 100% unsafe action generation rate. ROBOSHACKLES serves as a scalable benchmark and training resource for refusal learning and hazard anticipation before robot action execution.The dataset is publicly available at https://huggingface.co/datasets/YZW00/RoboShackles.

2606.18532 2026-06-18 cs.CR cs.AI cs.RO cs.SE 交叉投稿

AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework

AI沙箱:威胁模型、分类法与测量框架

Inderjeet Singh, Haitham Mahmoud, Andrés Murillo

发表机构 * Fujitsu Research of Europe(富士通欧洲研究)

AI总结 提出AI沙箱的威胁模型、分类法和测量框架,形式化沙箱边界与最弱链规则,定义网络物理威胁模型,并通过三个案例验证。

Comments 50 pages, 8 figures, 10 tables

详情
AI中文摘要

AI系统越来越多地在结合隔离、仿真、仪器化、监督和证据捕获的有界环境中进行评估。对于物理AI、AIoT和网络物理系统,这种转变不仅仅是术语问题:被测系统可能通过物理过程、网络设备和人类操作员进行感知、决策、执行、通信和故障。本文开发了一种面向保证的AI沙箱描述,将其作为数字AI、具身自主和网络物理部署中测试、评估、验证和确认的受控环境。我们形式化了沙箱边界和用于将每个维度的证据组合成有界部署声明的“最弱链”规则;分离了主要的沙箱原型;定义了一个包括对保证装置本身攻击的网络物理威胁模型;并引入了一个跨越保真度、可控性、可观测性、包含性、可重复性和治理工件的测量框架,在三个实际沙箱的工作案例研究中实例化。由此产生的威胁模型、分类法和测量框架阐明了沙箱可以有效测试什么、它可以包含哪些风险,以及它可以为安全、安保和监管保证支持哪些形式的证据。

英文摘要

AI systems are increasingly evaluated in bounded environments that combine isolation, simulation, instrumentation, supervision, and evidence capture. For physical AI, AIoT, and cyber-physical systems, this shift is not a matter of terminology: the system under test may sense, decide, actuate, communicate, and fail through physical processes, networked devices, and human operators. This article develops an assurance-oriented account of AI sandboxes as controlled environments for testing, evaluation, verification, and validation across digital AI, embodied autonomy, and cyber-physical deployments. We formalize the sandbox boundary and a weakest-link rule for composing per-dimension evidence into a bounded deployment claim; separate major sandbox archetypes; define a cyber-physical threat model that includes attacks on the assurance apparatus itself; and introduce a measurement framework spanning fidelity, controllability, observability, containment, reproducibility, and governance artifacts, instantiated on three worked case studies of real sandboxes. The resulting threat model, taxonomy, and measurement framework clarify what a sandbox can validly test, which risks it can contain, and what forms of evidence it can support for safety, security, and regulatory assurance.

2606.18697 2026-06-18 cs.LG cs.CR cs.RO 交叉投稿

Stealthy World Model Manipulation via Data Poisoning

通过数据投毒进行隐蔽的世界模型操纵

Yibin Hu, Xiaolin Sun, Zizhan Zheng

发表机构 * Department of Computer Science(计算机科学系)

AI总结 提出SWAAP框架,通过两阶段数据投毒(双层级优化寻找有害目标模型+梯度匹配隐蔽实现)操纵学习到的世界模型,导致规划性能显著下降,且能规避多种防御检测。

Comments 41 pages, 8 figures, 11 tables. Submitted to NeurIPS 2026

详情
AI中文摘要

基于模型的学习智能体使用学习到的世界模型来预测未来状态、规划行动并适应新环境。然而,从收集的经验中更新世界模型的过程创造了一个训练时攻击面:对抗性投毒的微调轨迹可以操纵学习到的动力学,从而破坏下游规划。在本文中,我们提出了SWAAP,这是第一个针对学习到的世界模型的两阶段数据投毒框架。在第一阶段,SWAAP利用过渡梯度定理实现的一阶双层优化,识别出一个有害的目标世界模型,该模型在规划下诱导低回报行为,同时保持接近干净动力学。在第二阶段,SWAAP通过隐蔽约束的梯度匹配实现该目标,仅修改有限比例的微调过渡目标,使得诱导的训练梯度将受害者模型引向对抗目标,同时预测误差正则化器鼓励投毒目标保持接近世界模型的自然近似误差。为了评估攻击的隐蔽性,我们在投毒管道的三个阶段评估了防御和可检测性:投毒过渡的预训练检测、微调期间的鲁棒训练以及测试时对结果世界模型的监控。在多种连续控制任务中,SWAAP导致显著的性能下降,同时保持投毒过渡接近干净数据,并规避了评估的非自适应残差/CUSUM/TRIM风格防御。这些结果揭示了世界模型适应管道中的实际漏洞,并强调了需要保护世界模型训练数据和所学动力学的鲁棒性方法。

英文摘要

Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack surface: adversarially poisoned fine-tuning trajectories can manipulate the learned dynamics and thereby corrupt downstream planning. In this paper, we propose SWAAP, the first two-stage data poisoning framework for learned world models. In the first stage, SWAAP identifies a harmful target world model that induces low-return behavior under planning while remaining close to clean dynamics, using first-order bilevel optimization enabled by a transition-gradient theorem. In the second stage, SWAAP realizes this target through stealth-constrained gradient matching, modifying only a limited fraction of fine-tuning transition targets so that the induced training gradients steer the victim model toward the adversarial target, while a prediction-error regularizer encourages the poisoned targets to remain close to the world model's natural approximation error. To assess attack stealthiness, we evaluate defenses and detectability across three stages of the poisoning pipeline: pre-training detection of poisoned transitions, robust training during fine-tuning, and test-time monitoring of the resulting world model. Across diverse continuous-control tasks, SWAAP causes substantial performance degradation while keeping poisoned transitions close to clean data and evading the evaluated non-adaptive residual/CUSUM/TRIM-style defenses. These results reveal a practical vulnerability in world-model adaptation pipelines and highlight the need for robustness methods that protect both world-model training data and learned dynamics.

12. 其他/综合机器人 2 篇

2507.16859 2026-06-18 cs.RO cs.AI 版本更新

Enhancing Fatigue Detection through Heterogeneous Multi-Source Data Integration and Cross-Domain Modality Imputation

通过异构多源数据集成与跨域模态插补增强疲劳检测

Luobin Cui, Yanlai Wu, Tang Ying, Weikai Li

AI总结 针对实际部署环境中高质量传感器不可用的问题,提出异构多源疲劳检测框架,利用共享模态进行跨域模态插补,融合源域知识提升目标域疲劳检测性能。

Comments 4figures,14pages

详情
AI中文摘要

疲劳检测对于安全相关应用(如航空、采矿和长途运输)中的人类操作员至关重要。可靠的操作员疲劳估计可以支持人机系统中的及时警告、自适应任务调度、接管提醒和其他安全管理决策。然而,这些功能的有效性取决于疲劳相关信号是否能在部署环境中可靠捕获。虽然许多研究已显示高保真传感器在受控实验室环境中的价值,但在实际环境中,由于噪声、光照条件和视野限制,其性能往往会下降,从而限制了实际应用。本文形式化了一种面向实际部署的疲劳检测设置,其中高质量传感器在实际应用中通常不可用。为解决这一问题,我们利用来自异构源域的知识,包括难以在现场部署但常用于受控环境的高保真传感器,来辅助真实目标域中的疲劳检测。基于这一思想,我们设计了一个异构多源疲劳检测框架,该框架利用目标域中的可用模态,同时通过基于共享模态的跨域模态插补来利用源域中的多样化配置。

英文摘要

Fatigue detection for human operators is important in safety-related applications such as aviation, mining, and long-haul transport. Reliable estimation of operator fatigue can support timely warnings, adaptive task scheduling, takeover reminders, and other safety-management decisions in human-machine systems. However, the effectiveness of these functions depends on whether fatigue-related signals can be reliably captured in the deployment environment. While many studies have shown the value of high-fidelity sensors in controlled laboratory environments, their performance often degrades when used in real-world settings because of noise, lighting conditions, and field-of-view constraints, thereby limiting their practical use. This paper formalizes a deployment-oriented setting for real-world fatigue detection, where high-quality sensors are often unavailable in practical applications. To address this issue, we use knowledge from heterogeneous source domains, including high-fidelity sensors that are difficult to deploy in the field but commonly used in controlled environments, to assist fatigue detection in the real-world target domain. Based on this idea, we design a heterogeneous and multi-source fatigue-detection framework that uses the available modalities in the target domain while leveraging diverse configurations in the source domains through cross-domain modality imputation based on shared modalities.

2602.15513 2026-06-18 cs.RO cs.AI 版本更新

HIMM: Human-Inspired Long-Term Memory Modeling for Embodied Exploration and Question Answering

Ji Li, Bo Wang, Jing Xia, Mingyi Li, Shiyan Hu

发表机构 * The University of Hong Kong(香港大学) Beijing Institute of Technology(北京理工大学)

详情
Journal ref
IROS 2026
英文摘要

Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.