arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 机器人学习与模仿强化学习 4 篇

2606.18328 2026-06-18 cs.RO 新提交

Recover, Discover, Plan: Learning Skills and Concepts from Robot Failures

恢复、发现、规划:从机器人失败中学习技能与概念

Bowen Li, Mayank Mishra, Y. Isabel Liu, Stone Tao, Nishanth Kumar, Alexander G. Gray, Ruwan Wickramarachchi, Jonathan Francis, Sebastian Scherer, Tom Silver

发表机构 * CMU(卡内基梅隆大学) Princeton(普林斯顿大学) AI2(艾伦人工智能研究所) MIT(麻省理工学院) Centaur AI Bosch Center for AI(博世人工智能中心)

AI总结 提出ReSYNC方法,通过技能学习与概念发现的交替过程,从失败恢复经验中逐步构建抽象谓词,实现全局失败避免和长期规划,性能提升超50%。

Comments 9 pages, 6 figures. Website: https://jaraxxus-me.github.io/ReSYNC/

详情
AI中文摘要

智能机器人不仅应该从失败中恢复,还应该获取必要的抽象知识以避免未来的失败。虽然强化学习(RL)可以学习反应性恢复行为,但为每种不同的失败模式训练单独的策略效率极低。我们引入了恢复驱动的关系概念综合(ReSYNC),这是第一种从失败恢复经验中逐步发现并细化状态抽象(关系谓词)以支持抽象规划的方法。与纯粹的反应性方法不同,ReSYNC通过增量双学习过程联合学习技能和概念。在技能学习阶段,机器人使用RL学习从训练任务中出现的失败中恢复。在概念学习阶段,机器人发现新的关系谓词并细化其抽象规划模型,以解释和泛化所学的恢复行为。这种交互使ReSYNC能够将训练中看到的局部恢复转化为测试时的全局失败避免。在四个模拟领域,我们展示了ReSYNC持续扩展和细化其抽象库的能力,使其能够解决长期、前所未见的问题,性能超过强基线50%以上。此外,我们展示了ReSYNC的仿真到现实迁移,其中它执行真实世界的非抓取操作技能,并通过抽象规划泛化到未见场景。总体而言,ReSYNC代表了朝着机器人自主获取抽象以实现物理世界中可扩展的、感知失败的规划迈出的重要一步。

英文摘要

Intelligent robots should not only recover from failures, but also acquire the abstract knowledge needed to avoid them in the future. While reinforcement learning (RL) can learn reactive recovery behaviors, training a separate policy for every distinct failure mode is highly inefficient. We introduce Recovery-Driven Synthesis of Relational Concepts (ReSYNC), the first approach that progressively discovers and refines state abstractions (relational predicates) from failure-recovery experience to support abstract planning. Unlike purely reactive methods, ReSYNC jointly learns skills and concepts through an incremental dual-learning process. In the skill-learning phase, the robot uses RL to learn to recover from failures seen in training tasks. In the concept-learning phase, the robot discovers new relational predicates and refines its abstract planning model to explain and generalize the learned recovery behaviors. This interaction enables ReSYNC to convert local recoveries seen during training into global failure avoidance at test time. Across four simulated domains, we show that ReSYNC's ability to continually expand and refine its abstraction library allows it to solve long-horizon, previously unseen problems, outperforming strong baselines by over 50%. Additionally, we demonstrate sim-to-real transfer of ReSYNC, where it performs real-world non-prehensile manipulation skills and generalizes to unseen scenarios through abstract planning. Overall, ReSYNC represents a significant step toward robots that autonomously acquire abstractions for scalable, failure-aware planning in the physical world.

2606.18589 2026-06-18 cs.RO 新提交

DREAM-Chunk: Reactive Action Chunking with Latent World Model

DREAM-Chunk:基于潜在世界模型的反应式动作分块

Wenxi Chen, Kaidi Zhang, Chi Lin, Zhiyuan Zhang, Yu She, Yuejiang Liu, Raymond A. Yeh, Shaoshuai Mou, Yan Gu

发表机构 * Purdue University(普渡大学) Stanford University(斯坦福大学)

AI总结 提出DREAM-Chunk方法,通过轻量级潜在世界模型在测试时采样多个候选动作分块并选择最优执行,提升动作分块策略在随机动态下的鲁棒性。

详情
AI中文摘要

动作分块已成为视觉-语言-动作(VLA)模型的常见接口,使得低频策略推理能够驱动高频机器人执行。然而,一旦动作分块被提交,其开环执行在随机动态、硬件执行错误和部分可观测性下可能变得脆弱。我们提出DREAM-Chunk,一种测试时扩展方法,通过轻量级潜在世界模型增强基于分块的策略,无需额外的策略微调。在测试时,DREAM-Chunk采样多个候选动作分块,展开其预测的潜在未来,并从预测状态与观测展开最匹配的分块中选择动作。通过这种方式,DREAM-Chunk利用额外的测试时计算覆盖多个可能的随机未来,并提高长时域分块执行期间的响应性。在Kinetix基准测试中,DREAM-Chunk在增加的动作噪声下提高了鲁棒性,并从更大的候选样本量中受益,尤其是当演示包含纠正行为时。我们进一步在两个机器人平台的四个操作任务和两种VLA策略下,针对各种随机性来源验证了DREAM-Chunk。在仿真和硬件实验中,DREAM-Chunk提高了动作分块策略在随机动态下的鲁棒性。

英文摘要

Action chunking has become a common interface for vision-language-action (VLA) models, enabling low-frequency policy inference to drive high-frequency robot execution. However, once an action chunk is committed, its open-loop execution can be brittle under stochastic dynamics, hardware execution errors, and partial observability. We propose DREAM-Chunk, a test-time scaling method that augments chunking-based policies with a lightweight latent world model, without requiring additional policy fine-tuning. At test time, DREAM-Chunk samples multiple candidate action chunks, rolls out their predicted latent futures, and selects actions from the chunk whose predicted state best matches the observed rollout. In this way, DREAM-Chunk uses additional test-time computation to cover multiple plausible stochastic futures and improve reactivity during long-horizon chunk execution. On the Kinetix benchmark, DREAM-Chunk improves robustness under increasing action noise and benefits from larger candidate sample sizes, especially when demonstrations contain corrective behaviors. We further validate DREAM-Chunk on four manipulation tasks across two robot platforms and two VLA policies under various sources of stochasticity. Across simulation and hardware experiments, DREAM-Chunk improves the robustness of action-chunking policies in stochastic dynamics.

2606.18772 2026-06-18 cs.RO 新提交

HALOMI: Learning Humanoid Loco-Manipulation with Active Perception from Human Demonstrations

HALOMI: 从人类演示中学习具有主动感知的人形机器人全身操控

Zehui Zhao, Yuxuan Zhao, Gaojing Zhang, Chenxi Liu, Maolin Zheng, Wenzhao Lian

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Sussex(萨塞克斯大学) East China University of Science and Technology(华东理工大学)

AI总结 提出HALOMI框架,通过扩展通用操控接口(UMI)实现主动感知,利用流形约束控制器和观察-动作对齐,使Unitree G1人形机器人在五项真实任务中平均成功率达85%。

详情
AI中文摘要

人类演示可以大规模收集,并自然捕捉主动的手眼协调,是学习人形机器人全身操控的有前景的数据源。然而,直接将人类演示迁移到人形机器人需要精确的世界坐标系跟踪控制器,这在分布外(OOD)目标下通常脆弱,而人形差异在自我中心观察和动作执行中持续存在。为解决这些挑战,我们提出HALOMI,一个从人类演示中学习具有主动感知的人形机器人全身操控的可扩展框架。HALOMI扩展了通用操控接口(UMI)并加入自我中心感知,以大规模收集自我视角和手腕视角观察以及头-手轨迹。我们进一步提出一个流形约束控制器,在学习的潜在行为流形中规划,以实现世界坐标系中精确鲁棒的头-手跟踪。为弥合人形差异,我们进行自我视角对齐,并引入控制器感知的参考轨迹自适应,以减少观察和动作执行中的不匹配。我们在配备活动脖子的Unitree G1人形机器人上验证HALOMI,涉及导航、抓取、双手操控、全身协调和动态行为五项真实任务。在三个定量评估的任务中,HALOMI平均成功率达85%,而额外定性演示显示其支持动态抛掷和深蹲抓取的能力。

英文摘要

Human demonstrations, which can be collected at scale and naturally capture active hand-eye coordination, are a promising data source for learning humanoid loco-manipulation. However, directly transferring human demonstrations to humanoids requires a precise world-frame tracking controller, which is often brittle under Out-of-Distribution(OOD) targets, while human-to-humanoid gaps persist in both egocentric observation and action execution. To address these challenges, we present HALOMI, a scalable framework for learning humanoid loco-manipulation with active perception from human demonstrations. HALOMI extends Universal Manipulation Interface (UMI) with egocentric sensing to collect ego-view and wrist-view observations along with head-hand trajectories at scale. We further propose a manifold-constrained controller that plans in a learned latent behavior manifold to enable precise and robust head-hand tracking in the world frame. To bridge the human-to-humanoid gap, we perform ego-view alignment and introduce a controller-aware reference trajectory adaptation to reduce mismatch in both observation and action execution. We validate HALOMI on a Unitree G1 humanoid robot with an actuated neck across five real-world tasks involving navigation, grasping, bimanual manipulation, whole-body coordination, and dynamic behaviors. Across the three quantitatively evaluated tasks, HALOMI achieves an average success rate of 85\%, while additional qualitative demonstrations show its ability to support dynamic tossing and deep-squat grasping.

2606.18953 2026-06-18 cs.RO 新提交

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

面向零样本仿真到现实VLA增强的以对象为中心的残差强化学习

Kinam Kim, Namiko Saito, Heecheol Kim, Katsushi Ikeuchi, Jaegul Choo, Yasuyuki Matsushita

发表机构 * KAIST(韩国科学技术院) Microsoft Research Asia - Tokyo(微软亚洲研究院-东京) The University of Tokyo(东京大学)

AI总结 提出以对象为中心的残差强化学习框架,在仿真中训练策略,零样本迁移到真实机器人,将VLA模型成功率从42%提升至76%。

Comments 8 pages, 7 figures, 2 tables; 8-page appendix

详情
AI中文摘要

视觉-语言-动作(VLA)模型能够泛化到多种操作任务,但其基于模仿学习的策略在精确物理交互中因执行误差累积而脆弱;能否仅在仿真中训练的强化学习策略零样本提升真实世界VLA的鲁棒性?残差强化学习在冻结的VLA之上学习修正策略,提供了一个自然框架,但现有方法面临根本的仿真到现实困境:特权状态方法需要有损蒸馏才能部署;基于图像的方法存在视觉域差距;而真实世界强化学习成本高且不安全。我们提出一种以对象为中心的残差强化学习框架,利用对象姿态优化VLA动作,从而构建一个在仿真和现实之间一致迁移的紧凑观测空间。为对齐两个域,我们额外在仿真中重放相同的遥操作演示,以训练真实世界VLA的仿真对应物。残差强化学习策略仅在仿真中通过姿态噪声注入和丢弃进行训练,并零样本迁移到真实机器人。在真实Franka Research 3(FR3)机器人的五个操作任务上,我们的方法将成功率从42%零样本提升至76%,且改进后的轨迹可进一步用于重新训练基础VLA以实现自我改进,无需额外遥操作。项目页面:此https URL

英文摘要

Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the visual domain gap; and real-world RL is costly and unsafe. We propose an object-centric residual RL framework that refines VLA actions using object poses, enabling a compact observation space that transfers consistently between simulation and reality. To align the two domains, we additionally replay the same teleoperation demonstrations in simulation to train a sim counterpart of the real-world VLA. The residual RL policy is trained only in simulation with pose noise injection and dropout, and transfers zero-shot to the real robot. Across five manipulation tasks on a real Franka Research 3 (FR3) robot, our method improves the success rate from 42% to 76% zero-shot, and the improved rollouts can be further reused to retrain the base VLA for self-improvement without additional teleoperation. Project page: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/

2. 运动规划、控制与动力学 6 篇

2606.18514 2026-06-18 cs.RO cs.LG 新提交

N(CO)$^2$: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering

N(CO)$^2$: 基于机会约束的神经组合优化求解随机定向问题

Anas Saeed, Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * Department of Computer Science and Engineering, University of California, Merced(加州大学默塞德分校计算机科学与工程系)

AI总结 提出N(CO)$^2$框架,结合强化学习求解随机定向问题,无需手工启发式,在不确定环境下优化路径选择,性能媲美MILP。

详情
Journal ref
In Proceedings of the IEEE International Conference on Automation Science and Engineering (CASE), 2025
AI中文摘要

神经组合优化(NCO)通过学习启发式,为求解复杂图优化问题提供了一种有前景的替代传统启发式方法的方法。这类问题在自动化领域频繁出现,可用于建模多种应用。虽然NCO在确定性组合优化问题上已被广泛研究,但只有少数工作旨在解决随机组合优化问题。本文提出N(CO)$^2$:基于机会约束的神经组合优化,用于求解随机定向问题(SOP),无需手工设计的启发式。通过集成强化学习(RL)框架,模型在不确定性下优化路径选择,有效平衡探索与利用。实验结果表明,我们的方法在多种SOP实例上具有良好的泛化能力,与最先进的混合整数线性规划(MILP)相比性能具有竞争力。所提方法减少了启发式设计的人力投入,同时在不确定环境中实现自适应和高效的决策。

英文摘要

Neural combinatorial optimization (NCO) offers a promising alternative to traditional heuristic-based methods for solving complex graph optimization problems by proposing to learn heuristics through data. This class of problems frequently arises in automation, as it can be used to model a variety of applications. While NCO has been extensively studied for deterministic combinatorial optimization problems, there are only a few works that aim to solve stochastic combinatorial optimization problems. In this work, we present N(CO)$^2$: Neural Combinatorial Optimization with Chance cOnstraints to solve the Stochastic Orienteering Problem (SOP) without the use of hand-crafted heuristics. By integrating a reinforcement learning (RL) framework, the model optimizes path selection under uncertainty, effectively balancing exploration and exploitation. Empirical results demonstrate that our method generalizes well across diverse SOP instances, achieving competitive performance compared to the state-of-the-art mixed-integer linear program (MILP) for the task. The proposed approach reduces human effort in heuristic design while enabling adaptive and efficient decision-making in uncertain environments.

2606.18625 2026-06-18 cs.RO 新提交

SRL: Combining SLIP Model and Reinforcement Learning for Agile Robotic Jumping

SRL:结合SLIP模型与强化学习实现敏捷机器人跳跃

Xiaowen Hu, Linqi Ye, Yudi Zhu, Chenyue Shao, Rankun Li, Qingdu Li, Yan Peng

发表机构 * Institute of Artificial Intelligence, Shanghai University(上海大学人工智能研究院) Institute of Machine Intelligence, University of Shanghai for Science and Technology(上海理工大学机器智能研究院)

AI总结 提出SRL框架,融合SLIP模型的物理基线与强化学习的自适应能力,通过前馈控制信号与实时反馈优化机器人跳跃,显著减少训练时间并保持高精度跟踪。

Comments 17 pages, 12 figures

详情
AI中文摘要

机器人跳跃在搜救和物流等应用中至关重要,这些场景中跨越障碍和提高机动效率是关键。弹簧负载倒立摆(SLIP)模型利用简化的弹簧-质量动力学,自然编码了生物上合理的弹跳运动,但由于对接触和关节动力学的理想化假设,其在不规则地形上的性能会下降。同时,强化学习(RL)能够适应多样化和复杂的环境,但通常需要来自无引导探索的大量数据。SLIP的物理基线与RL的自适应能力的互补优势促使我们提出一种混合框架,以克服各自的局限性。因此,我们提出了弹簧负载强化学习(SRL),它将基于SLIP的前馈控制信号与RL驱动的实时反馈相结合,实现了机器人跳跃的持续优化。实验结果表明,与基线方法相比,SRL能够在更少的训练时间内实现更稳定的跳跃,平均位置跟踪误差低于0.1米,速度跟踪误差在目标值的±3%以内。通过双足和四足模拟的地面与楼梯跳跃,以及sim-to-sim和sim-to-real验证,SRL展现出对各种任务要求和环境复杂性的鲁棒适应性,突显了其在实际部署中的潜力。

英文摘要

Robotic jumping is pivotal in applications such as search and rescue and logistics, where crossing obstacles and enhancing mobility efficiency are critical. The Spring-Loaded Inverted Pendulum (SLIP) model leverages simplified spring-mass dynamics that naturally encode biologically plausible hopping motions, yet its performance degrades on irregular terrain due to idealized assumptions regarding contact and joint dynamics. Meanwhile, Reinforcement Learning (RL) can adapt to diverse and complex environments but often requires extensive data from unguided exploration. The complementary strengths of SLIP's physically grounded baseline and RL's adaptive capabilities motivate a hybrid framework that overcomes these individual limitations. We therefore propose Spring-loaded Reinforcement Learning (SRL), which integrates SLIP-based feedforward control signals with RL-driven real-time feedback, enabling continuous optimization of robotic jumping. Experimental results demonstrate that SRL can achieve more stable jumps with much less training time than the baseline method, maintaining an average position tracking error below 0.1 m and velocity tracking errors within +/-3% of the target values. Through bipedal and quadrupedal simulations of ground and stair jumping, as well as sim-to-sim and sim-to-real validations, SRL exhibits robust adaptability to various task requirements and environmental complexities, underscoring its potential for real-world deployment.

2606.18730 2026-06-18 cs.RO cs.AI math.CO math.OC 新提交

Two-Phase Bilevel Search for the Moving-Target Traveling Salesman Problem with Moving Obstacles

带移动障碍物的移动目标旅行商问题的两阶段双层搜索

Allen George Philip, Anoop Bhat, Sivakumar Rathinam, Howie Choset

发表机构 * Texas A&M University(德克萨斯A&M大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对带移动障碍物的移动目标旅行商问题,提出混合整数锥规划公式和两阶段双层搜索算法,显著优于基线方法。

详情
AI中文摘要

移动目标旅行商问题(MT-TSP)寻求从静态仓库出发、访问一组移动目标(每个目标在其分配的时间窗口内)并返回仓库的代理的最小成本轨迹。在本文中,我们研究了带移动障碍物的移动目标旅行商问题(MT-TSP-MO),这是MT-TSP的推广,其中代理轨迹必须避开移动障碍物。我们提出了一个混合整数锥规划(MICP)公式,可以使用现成的求解器求解,以及一个快速且可扩展的两阶段双层搜索(TPBS)算法,该算法为问题计算高质量可行解。我们在多达40个目标和40个障碍物的广泛问题实例上评估了我们的方法,与现有基线算法相比。结果表明,所提出的两种方法在成功率、解决方案成本和计算时间方面均显著优于基线。

英文摘要

The Moving-Target Traveling Salesman Problem (MT-TSP) seeks a minimum cost trajectory for an agent that departs from a static depot, visits a set of moving targets, each within one of their assigned time windows, and returns to the depot. In this article, we study the Moving-Target Traveling Salesman Problem with Moving Obstacles (MT-TSP-MO), a generalization of the MT-TSP where the agent trajectory must avoid moving obstacles. We present a Mixed-Integer Conic Programming (MICP) formulation that can be solved using off-the-shelf solvers, as well as a fast and scalable Two-Phase Bilevel Search (TPBS) algorithm that computes high-quality feasible solutions for the problem. We evaluate our approaches against an existing baseline algorithm on a broad range of problem instances with up to 40 targets and 40 obstacles. The results demonstrate that both the proposed methods significantly outperform the baseline with respect to success rates, solution costs, and computation time.

2606.18828 2026-06-18 cs.RO cs.AI 新提交

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

空间即智能:用于黎曼度量生成的神经半群叠加

Chenghao Xu

发表机构 * National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University(湖南大学机器人视觉感知与控制技术国家工程研究中心)

AI总结 提出将智能置于空间本身,通过神经半群叠加机制生成黎曼度量,使动作简化为测地线跟随,在单障碍场景训练后零样本泛化到未见配置。

详情
AI中文摘要

传统方法将智能置于智能体中,无论是作为学习策略还是搜索过程。我们则将智能置于空间本身:场景在构型流形上诱导一个黎曼度量,动作简化为跟随该度量的测地线,而无需调用单独的规划器或碰撞检查器。一个单一的编码器-路由器网络通过三个互补的参数组实现这一思想——框架参数(定向生成器)、调制参数(控制空间传播)和基本系数(决定强度)。这些组通过共享的半群叠加机制组合,产生单个黎曼度量场,形成一种紧凑的架构,其几何复杂度自然随场景复杂度扩展。在单个双障碍场景上训练后,该模型在未见过的障碍配置上展现出鲁棒的零样本泛化能力,无碰撞路径成本与障碍穿透路径成本相差数个数量级。

英文摘要

Traditional approaches place intelligence in the agent, whether as a learned policy or a search procedure. We instead place intelligence in the space itself: a scene induces a Riemannian metric on the configuration manifold, and action reduces to following the geodesics of that metric rather than invoking a separate planner or collision checker. A single Encoder-Router network realizes this idea through three complementary parameter groups -- frame parameters that orient the generators, modulation parameters that govern their spatial propagation, and basic coefficients that determine their strength. These groups combine through a shared semigroup-superposition mechanism to produce a single Riemannian metric field, yielding a compact architecture whose geometry scales naturally with scene complexity. Trained on a single two-obstacle scene, the model demonstrates robust zero-shot generalization across unseen obstacle configurations, with orders-of-magnitude separation between collision-free and obstacle-penetrating path costs.

2606.18883 2026-06-18 cs.RO 新提交

ZiMPedance: Impedance-Aware ZMP Modeling and Control for Payload Carrying with Quadruped Robots

ZiMPedance:面向四足机器人负载搬运的阻抗感知ZMP建模与控制

Giovanni B. Dessy, Lorenzo Amatucci, Victor Barasuol, Claudio Semini

发表机构 * Dynamic Legged Systems Lab, Istituto Italiano di Tecnologia (IIT)(动态腿部系统实验室,意大利技术研究院(IIT))

AI总结 提出扩展零力矩点(ZMP)公式以包含被动负载接口动力学,结合模型预测控制减少稳定性违规达10倍,并提高运动效率。

详情
AI中文摘要

四足机器人的负载运输受到机器人与负载之间物理接口动力学的强烈影响。与主动机械臂相比,被动弹簧臂减轻了重量和复杂性,但其弹簧-阻尼动力学可能引入振荡力,降低运动稳定性。本文推导了一个扩展的零力矩点(ZMP)公式,该公式包含被动负载接口动力学,将刚度、阻尼和负载质量与稳定性裕度联系起来。分析表明,欠阻尼配置可能与运动谐波共振。基于这一见解,我们通过被动子系统动力学增强了单刚体动力学模型,并将其集成到模型预测控制框架中。在仿真中,所提出的控制器将稳定性违规减少高达10倍(从7.0%降至0.7%),并通过将水平地面反作用力努力降低高达15%来提高运动效率。硬件实验表明,在标称控制器失效的拉放扰动下,携带2公斤负载的机器人能够稳定运动。同一模型还使得通过被动臂动力学实现末端执行器跟踪成为可能,而无需直接驱动臂。

英文摘要

Load transportation with quadruped robots is strongly affected by the dynamics of the physical interface between the robot and the load. Passive spring-based arms reduce weight and complexity compared to active manipulators, but their spring-damper dynamics can introduce oscillatory forces that degrade locomotion stability. This paper derives an extended Zero Moment Point (ZMP) formulation that includes passive payload-interface dynamics, relating stiffness, damping, and payload mass to the stability margin. The analysis shows that underdamped configurations can resonate with locomotion harmonics. Based on this insight, we augment a Single Rigid Body Dynamics model with passive subsystem dynamics and integrate it into a Model Predictive Control framework. In simulation, the proposed controller reduces stability violations by up to $10\times$, from $7.0\%$ to $0.7\%$, and increase locomotion efficiency by lowering horizontal ground reaction force effort by up to $15\%$ compared to a nominal baseline. Hardware experiments with a $2\,\mathrm{kg}$ payload show stable locomotion under pull-release disturbances where the nominal controller fails. The same model also enables end-effector tracking through passive arm dynamics without direct arm actuation.

2606.19031 2026-06-18 cs.RO 新提交

Congestion-Aware Robot Tour Planning in Crowded Environments

拥挤环境中的拥塞感知机器人巡视规划

Stefano Bernagozzi, Charlie Street, Masoumeh Mansouri, Lorenzo Natale

发表机构 * Istituto Italiano di Tecnologia(意大利理工学院) Università di Genova(热那亚大学) University of Birmingham(伯明翰大学)

AI总结 提出一种基于概率的巡视规划器,通过学习人流预测模型并在线构建马尔可夫决策过程,在拥挤环境中高效规划机器人路径,减少拥塞影响。

Comments Accepted to IEEE IROS 2026

详情
AI中文摘要

自主移动服务机器人通常需要完成在环境中遍历一组位置的巡视任务。例如,引导人们穿过购物中心、在配送中心递送包裹或在博物馆提供导览。然而,在拥挤环境中,人群的存在可能对机器人性能产生负面影响。例如,人类会触发机器人的碰撞避免操作,从而降低机器人速度。人群随机移动且随时间变化。本文提出一种针对拥挤环境的概率巡视规划器,该规划器明确考虑人类拥塞。我们学习圆形线性流场(CLiFF)地图,该地图根据初始观测预测人类轨迹。然后,我们利用这些预测在线构建并求解马尔可夫决策过程,从而高效地将机器人引导通过环境。我们的方法具有足够的可扩展性,能够在观察到新人群时重新规划。我们在购物中心的真实人群数据集上评估了该方法。

英文摘要

Autonomous mobile service robots are often required to complete tours that require navigating through a set of locations in an environment. Example domains include guiding people through a shopping mall, delivering packages in a fulfilment centre, or giving guided tours in a museum. However, in crowded environments, the presence of people may negatively impact robot performance. For example, humans will activate robot collision avoidance manoeuvres that slow the robot down. Crowds move stochastically and vary throughout the day. In this paper we present a probabilistic tour planner for crowded environments which explicitly reasons over human congestion. We learn circular linear flow field (CLiFF) maps which predict human trajectories given an initial observation. We then use these predictions to build and solve a Markov decision process online which efficiently routes the robot through the environment. Our approach is scalable enough to re-plan as new people are observed. We evaluate our approach on a real-world crowd dataset in a shopping mall.

3. 操作、抓取与灵巧手 7 篇

2606.18628 2026-06-18 cs.RO 新提交

Self-Supervised Mask-Aware Transformers for Fault-Tolerant FBG Force Sensing in Minimally Invasive Surgical Robotics

自监督掩码感知Transformer用于微创手术机器人中容错FBG力传感

Peibo Sun, Shiyuan Dong, Shucheng Ye, Jianrong Cai, Yushan Liu, Hongen Liao, Tianqi Huang, Fang Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院)

AI总结 针对微创手术机器人中FBG传感器因通道耦合和断裂导致的力估计退化问题,提出统一的自监督掩码感知Transformer,通过掩码通道重建预训练和动态损坏课程微调,实现多通道故障下的优雅降级,在8通道数据集上达到0.0066 N均方根误差。

详情
AI中文摘要

在微创手术机器人中,导管级光纤布拉格光栅(FBG)传感器因其能够通过复用多个光学通道来估计多维力而具有前景。然而,部署这些紧凑的多通道传感器引入了两个关键工程挑战:复杂变形过程中固有的非线性交叉轴耦合,以及受限工作空间中光纤断裂导致的间歇性通道丢失。这些复合问题严重降低了力估计性能。现有的容错方法依赖于组合模型库,其随通道数量呈指数级扩展,并且需要昂贵的每模式校准。在本文中,我们提出了一种统一的、自监督的掩码感知Transformer,它显式地建模通道可用性,以在多样化和动态的传感器故障下实现优雅降级。编码器通过未标记数据流上的掩码通道重建进行预训练,并使用平衡的干净与损坏视图目标以及动态损坏课程进行力回归微调。此外,通过异方差高斯负对数似然训练的并行不确定性头,在单次前向传播中预测每轴置信度,避免了多遍集成的开销。在导管级8通道FBG数据集上评估,我们的单一统一模型实现了标称均方根误差(RMSE)0.0066 N,并在严重4通道故障下优雅降级至0.0126 N。这显著优于包含255个每模式神经网络的综合模型库(4通道丢失时为0.0154 N),同时消除了模式特定校准。

英文摘要

In minimally invasive surgical robotics, catheter-scale Fiber Bragg Grating (FBG) sensors are promising due to their ability to estimate multi-dimensional forces by multiplexing several optical channels. However, deploying these compact multi-channel sensors introduces two critical engineering challenges: inherent nonlinear cross-axis coupling during complex deformations, and intermittent channel dropouts caused by fiber fractures in constrained workspaces. These compounding issues severely degrade force estimation. Existing fault-tolerant approaches rely on combinatorial model banks, which scale exponentially with the channel count and demand prohibitively expensive per-pattern calibration. In this paper, we propose a unified, self-supervised mask-aware Transformer that explicitly models channel availability to enable graceful degradation under diverse and dynamic sensor failures. The encoder is pretrained via masked-channel reconstruction on unlabeled data streams and fine-tuned for force regression using a balanced clean-and-corrupted-view objective alongside a dynamic corruption curriculum. Furthermore, a parallel uncertainty head, trained via heteroscedastic Gaussian negative log-likelihood, predicts per-axis confidence in a single forward pass, circumventing the overhead of multi-pass ensembles. Evaluated on a catheter-scale 8-channel FBG dataset, our single unified model achieves a nominal Root Mean Square Error (RMSE) of 0.0066~N and degrades gracefully to 0.0126~N under severe 4-channel failures. This significantly outperforms a comprehensive model bank of 255 per-pattern neural networks (0.0154~N at 4-channel loss) while eliminating pattern-specific calibration.

2606.19089 2026-06-18 cs.RO 新提交

ART-VS: Adaptive Resolution Tiling for Vision Transformer Visual Servoing

ART-VS:用于视觉Transformer伺服的自适应分辨率分块

Alessandro Scherl, Bernhard Neuberger, Simon Schwaiger, David Mulero-Pérez, Lucas Muster, Jose Garcia-Rodriguez

发表机构 * Department of Computer Technology, University of Alicante(阿尔瓦登特技术系,阿利坎特大学) Department of Industrial Engineering, UAS Technikum Vienna(工业工程系,维也纳技术学院) Automation and Control Institute, TU Wien(自动化与控制研究所,维也纳技术大学) Institute of Software Engineering and Artificial Intelligence, Graz University of Technology(软件工程与人工智能研究所,格拉茨技术大学) Institute for Integrative Nature Conservation Research, University of Natural Resources and Life Sciences Vienna(整合自然保护研究 institute,维也纳自然资源与生命科学大学)

AI总结 提出ART-VS方法,通过粗-精两阶段自适应调整特征粒度,在不需任务特定训练下提升视觉伺服鲁棒性和精度,显著降低定位误差并提高速度。

Comments Accepted at IROS2026

详情
AI中文摘要

基于自监督视觉Transformer(ViT)特征的视觉伺服实现了无需训练的机器人定位,具有强泛化能力,但面临鲁棒性与精度之间的根本权衡。粗粒度的块级描述符提供稳定的对应关系,但限制了定位精度。提高图像分辨率可改善精度,但鲁棒性增益有限——在扰动下,高分辨率处理仅将收敛成功率从76.6%提升至81.0%,尽管ViT块数量增加了12倍。因此,我们提出自适应分辨率分块视觉伺服(ART-VS),一种两阶段方法,根据伺服进程调整特征粒度:先以原生ViT分辨率进行粗阶段实现稳定对齐,然后进行分块高分辨率阶段,将匹配限制在局部邻域以提高定位精度。无需任何任务特定训练,ART-VS在扰动下达到95.4%的收敛率,比标准分辨率和全分辨率ViT伺服分别高出18.8和14.4个百分点。与前者相比,定位误差降低53%,同时运行速度比后者快10倍以上,VRAM使用减少27%。我们在三个ViT骨干网络上验证了ART-VS,并展示了真实世界类别级抓取未见过的物体实例,透明瓶成功率95/100,鞋子成功率98/100。代码见该链接。

英文摘要

Visual servoing with self-supervised Vision Transformer (ViT) features enables training-free robotic positioning with strong generalization, but faces a fundamental trade-off between robustness and precision. Coarse patch-level descriptors provide stable correspondences yet limit positioning accuracy. Increasing image resolution improves precision but yields only marginal robustness gains - under perturbation, high-resolution processing improves convergence success rate from 76.6% to just 81.0% despite 12x more ViT patches. Therefore, we propose Adaptive Resolution Tiling Visual Servoing (ART-VS), a two-phase method that adapts feature granularity to servoing progress: a coarse phase at native ViT resolution for stable alignment, then a tiled high-resolution phase that restricts matching to local neighborhoods improving positioning accuracy. Without any task-specific training, ART-VS achieves 95.4% convergence under perturbation, outperforming standard and full-resolution ViT-based servoing by 18.8 and 14.4 percentage points. Over the former it reduces positioning error by 53%, while running at over 10x higher speed and 27% lower VRAM than the latter. We validate ART-VS across three ViT backbones and demonstrate real-world category-level grasping of unseen object instances, achieving 95/100 on transparent bottles and 98/100 on shoes. Code available under https://art-vs.github.io/.

2606.19091 2026-06-18 cs.RO 新提交

GCNGrasp-VP: Affordance-Guided View Planning for Efficient Task-Oriented Grasping

GCNGrasp-VP: 基于功能引导的视角规划用于高效任务导向抓取

Zanjia Tong, Wenlong Dong, Chengjie Zhang, Hong Zhang

发表机构 * Shenzhen Key Laboratory of Robotics and Computer Vision(机器人与计算机视觉深圳重点实验室)

AI总结 提出GCNGrasp-VP框架,通过功能场预测引导主动视角规划,无需场景重建,单次视角调整即可显著提升遮挡下的任务导向抓取成功率。

Comments Accepted to IROS 2026

详情
AI中文摘要

当物体视角存在遮挡时,任务导向抓取性能会显著下降。现有的任务导向抓取方法通常假设任务相关区域在初始帧中可见,而视角规划方法虽然能够实现主动感知,但往往忽略任务语义并依赖耗时的场景重建。为了解决这些局限性,我们提出了GCNGrasp-VP,一个将功能场预测与主动视角规划相结合的高效框架。该框架的核心是GCNGrasp-v2,一个同时支持抓取评估和功能场预测的任务导向抓取模型,实现了常数时间推理复杂度。利用这一能力,我们的功能引导视角规划器(Affordance-VP)将功能场作为信息增益度量,无需场景重建即可引导相机观察任务相关区域。视角规划结果表明,我们的方法仅需一次视角调整就显著优于基于场景不确定性的基线方法。真实世界验证进一步证实了在单物体场景中抓取成功率的显著提升,同时保持毫秒级计算延迟。代码和模型可在以下网址获取:this https URL。

英文摘要

Task-oriented grasping performance degrades significantly when object views suffer from occlusions. Existing task-oriented grasping methods typically assume task-relevant regions are visible in the initial frame, while view planning approaches enable active perception but often ignore task semantics and rely on time-consuming scene reconstruction. To address these limitations, we present GCNGrasp-VP, an efficient framework integrating affordance field prediction with active view planning. Central to this framework is GCNGrasp-v2, a task-oriented grasp model that simultaneously supports grasp evaluation and affordance field prediction, achieving constant-time inference complexity. Leveraging this capability, our Affordance-guided View Planner (Affordance-VP) utilizes the affordance field as an information gain metric to guide camera observation of task-relevant regions without requiring scene reconstruction. View planning results show that our method significantly outperforms scene-uncertainty-driven baselines with only one view adjustment. Real-world validation further confirms substantial improvements in grasp success rates for single-object scenarios while maintaining millisecond-level computational latency. Code and models are available at https://github.com/Instinct323/GCNGrasp-VP.

2606.19194 2026-06-18 cs.RO 新提交

Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation

用于机器人操作中一步流匹配的可逆神经网络适配器

Yu Zhang, Kangyi Ji, Yongxiang Zou, Rongtao Xu, Feng Zheng, Long Cheng

AI总结 提出可逆神经网络适配器,通过一步去噪过程生成高维动作,降低推理复杂度并保持精度,在仿真和真实实验中提升效率。

详情
AI中文摘要

本文提出了一种用于通用机器人操作的可逆神经网络适配器,旨在通过一步去噪过程,基于多模态观测(包括视觉、语言和本体感受输入)生成精确的高维动作。基于流匹配公式,所提出的适配器有效地将动作生成轨迹约束在可逆潜空间内,从而仅需单次推理步骤即可实现高效、高质量的灵巧动作合成。与传统的迭代流匹配策略相比,所提出的框架显著降低了推理复杂度,同时保持了强大的动作预测精度和稳定性。在多种仿真基准和真实机器人平台上进行了大量实验,以评估所提出方法的有效性。在仿真基准测试中,所提出的适配器在广泛的操作任务上持续表现出优于或接近最先进的性能。此外,真实世界实验显示,视觉-语言-动作(VLA)模型的推理效率显著提升,平均推理延迟从110毫秒降低到61毫秒,同时保持了强大的任务性能。

英文摘要

This paper presents an invertible neural network adapter for general robotic manipulation, designed to generate precise high-dimensional actions conditioned on multimodal observations, including visual, linguistic, and proprioceptive inputs, through a one-step denoising process. Built upon a flow-matching formulation, the proposed adapter effectively constrains the action generation trajectory within an invertible latent space, thereby enabling efficient and high-quality dexterous action synthesis with only a single inference step. Compared with conventional iterative flow-matching policies, the proposed framework substantially reduces inference complexity while maintaining strong action prediction accuracy and stability. Extensive experiments are conducted across a diverse set of simulation benchmarks and real-world robotic platforms to evaluate the effectiveness of the proposed method. Across simulation benchmarks, the proposed adapter consistently demonstrates superior or near state-of-the-art performance on a wide range of manipulation tasks. Furthermore, real-world experiments reveal a significant improvement in inference efficiency for vision-language-action (VLA) models, reducing the average inference latency from 110 ms to 61 ms while maintaining strong task performance.

2606.19233 2026-06-18 cs.RO 新提交

Mobile Pedipulation for Object Sliding via Hierarchical Control on a Wheeled Bipedal Robot

基于轮式双足机器人分层控制的移动式腿部操作物体滑动

Yue Qin, Yulun Zhuang, Zelin Shen, Yanran Ding

发表机构 * University of Michigan(密歇根大学)

AI总结 提出一种分层控制框架,使轮式双足机器人能用腿部滑动平面物体,通过简化三刚体动力学模型和轨迹优化运动规划器,在实验中成功实现1kg物体取回和4kg物体滑动。

Comments 8 pages, 7 figures

详情
AI中文摘要

在本文中,我们提出了一种分层控制框架,使轮式双足机器人能够利用其轮式腿执行平面物体滑动任务。该方法基于一个简化三刚体动力学模型构建了非线性模型预测控制器,该模型明确考虑了髋关节滚动自由度和多种轮-环境接触模式,这对于横向步态和腿部操作任务至关重要。在该框架内,非线性模型预测控制器同时调节机器人 locomotion 和交互力,使机器人能够稳定地执行滚动和物体操作行为。我们开发了一个基于轨迹优化的机器人-物体运动规划器,以生成包含地面-物体接触中粘滑转换的参考运动。通过实际硬件实验验证了两种代表性的腿部操作运动,即滑行和横向滑动,其中机器人成功地从桌子下取回一个1kg的物体,并通过滑行将一个4kg的物体滑动0.228米的距离。

英文摘要

In this letter, we present a hierarchical control framework that enables wheeled bipedal robots to perform planar object sliding tasks with their wheeled legs. The proposed approach formulates a nonlinear model predictive controller (NMPC) based on a reduced-order three rigid bodies (TRB) dynamical model that explicitly accounts for the hip roll degree of freedom and multiple wheel-environment contact modes, which is essential for lateral stepping and pedipulation tasks. Within this framework, the NMPC simultaneously regulates robot locomotion and interaction forces, allowing the robot to stably execute both rolling and object manipulation behaviors. A trajectory-optimization-based robot-object motion planner is developed to generate reference motions that incorporate stick-slip transitions in ground-object contact. Two representative pedipulation motions, namely scooting and lateral sliding, are validated through real-world hardware experiments, in which the robot successfully retrieves a 1 kg object from under a desk and slides a 4 kg object over a distance of 0.228 m via scooting.

2606.19314 2026-06-18 cs.RO 新提交

Modeling Branches for Active Manipulation using Iterative Parameter Estimation

基于迭代参数估计的主动操作分支建模

Madhav Rijal, Rashik Shrestha, Trevor Smith, Yu Gu

发表机构 * Department of Mechanical and Aerospace Engineering, West Virginia University(西弗吉尼亚大学机械与航空航天工程系)

AI总结 提出一种通过迭代估计材料参数来建模植物分支的方法,利用有限元模拟和变形感知运动规划器,实现精确分支操作,平均变形能量降低35.69%。

Comments Accepted to IROS 2026

详情
AI中文摘要

本研究提出了一种通过迭代估计材料参数来建模多样化植物分支的方法,以支持精细的分支操作。在农业机器人中,分支操作对于植物重新定位、稳定以及清除密集叶片中的视觉障碍是必要的。该方法从点云数据构建四面体分支模型,并使用有限元方法模拟其行为。利用真实观测的变形数据,迭代估计分支参数,然后通过变形感知运动规划器计算最优路径,以在另一个机器人的视野内移动和稳定分支。在30次对具有不同几何形状和材料特性的分支进行的试验中,该方法平均降低了35.69%的变形能量,同时路径长度平均增加了8.10%。

英文摘要

This study presents a method for modeling diverse plant branches by iteratively estimating material parameters to support delicate branch manipulation. Branch manipulation is necessary in agricultural robotics for plant repositioning, stabilizing, and clearing visual obstructions in dense foliage. The proposed method builds a tetrahedral branch model from point-cloud data and simulates its behavior using the finite element method. Using real observed deformation data, it iteratively estimates branch parameters and then computes an optimal path with a deformation-aware motion planner to move and stabilize branches within another robot's field of view. Across 30 trials on branches with varying geometries and material properties, the proposed method reduced the deformation energy by 35.69% while increasing the path length by 8.10% on average.

2606.19333 2026-06-18 cs.RO cs.CV 新提交

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Do as I Do: 从日常人类视频中获取灵巧操作数据

Bhawna Paliwal, Haritheja Etukuru, William Liang, Pieter Abbeel, Nur Muhammad Mahi Shafiullah, Jitendra Malik

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 提出DO AS I DO算法,从单目RGB人类视频中重建手-物交互并重定向到多指灵巧机器人手,生成可执行的操作数据,优于现有方法。

Comments Project website: https://do-as-i-do.com/

详情
AI中文摘要

我们如何可扩展地生成机器人操作数据,特别是在像多指灵巧手这样的人形平台上?从人类视频中学习最近成为这个问题的可能答案。然而,估计手-物交互和跨越人-机器人具身差距的困难阻碍了将丰富的单目RGB人类视频作为机器人操作数据的主要来源。在这项工作中,我们提出了DO AS I DO,一种将单目RGB人类视频重建并重定向到多指灵巧机器人手的算法。DO AS I DO从各种自我中心和外部中心的野外视频源中重建手-物交互。然后,该算法将这些手-物交互估计重定向为一系列可在现实世界中执行的动作,从不同的人类视频中生成机器人完整的操作数据。总体而言,DO AS I DO在从RGB视频中估计手-物交互和提取灵巧操作轨迹方面优于先前的最先进技术,正如我们在具有真实标签的数据集和在线收集的视频片段数据集上的实验所示。我们的实验使我们能够为从业者收集人类操作数据提出一个有效性指南。

英文摘要

How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

4. 导航、定位与SLAM 6 篇

2606.18426 2026-06-18 cs.RO 新提交

VEGA: Learning Navigation VLAs from In-the-Wild Egocentric Video with Geometric Trajectory Supervision

VEGA: 从野外自我中心视频中通过几何轨迹监督学习导航VLA

Gershom Seneviratne, Yohan Abeysinghe, Jianyu An, Vaibhav Shende, Dinesh Manocha

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

AI总结 提出VEGA方法,利用未标注的自我中心视频通过重建场景几何生成障碍感知轨迹,训练流匹配VLA导航策略,在VEGA-Bench上碰撞减少33.0%,真实世界成功率提升至少150.0%。

详情
AI中文摘要

我们提出了VEGA,一种从未标注的自我中心导航视频中训练导航视觉-语言-动作(VLA)模型的方法。互联网规模的自我中心视频提供了可扩展的导航相关视觉观察来源,捕捉了杂乱场景、近距离障碍物以及通过真实世界空间的自然人体运动。然而,这些视频不能直接用于策略学习,因为它们没有提供在机器人坐标系中基于显式导航目标的障碍感知轨迹。VEGA通过从单目视频重建局部场景几何、采样导航目标(表示为文本、图像或空间路径点)并利用构建的几何生成障碍感知轨迹来解决这一差距。生成的轨迹分布随后用于训练流匹配VLA导航策略。通过仅在训练期间使用几何,VEGA将障碍感知规划直接蒸馏到基于视觉的策略中。此外,我们引入了VEGA-Bench,一个包含25万场景和约500万个导航目标(与场景几何配对)的基准,旨在评估VLA的目标进展、碰撞避免和障碍物间隙。我们的评估表明,VEGA在VEGA-Bench上实现了有竞争力的目标进展,同时相比最强基线碰撞减少33.0%,障碍物间隙提高17.9%,在真实世界试验中成功率至少提高150.0%,碰撞至少减少66.7%,障碍物间隙至少提高60.0%。最终,我们证明了视频衍生的几何监督为训练障碍感知导航VLA提供了可扩展且有效的信号。代码和基准将在发表时发布。

英文摘要

We introduce VEGA, an approach for training navigation VisionLanguage-Action (VLA) models from unlabeled egocentric navigation videos. Internet-scale egocentric videos provide a scalable source of navigation-relevant visual observations, capturing cluttered scenes, close-range obstacles, and natural human motion through real-world spaces. However, these videos are not directly usable for policy learning because they do not provide obstacle-aware trajectories conditioned on explicit navigation goals in the robot's coordinate frame. VEGA addresses this gap by reconstructing local scene geometry from monocular video, sampling navigation goals (represented as text, image, or spatial waypoints) and generating obstacle-aware trajectories using the constructed geometry. The resulting trajectory distribution is then used to train a flow-matching VLA navigation policy. By using geometry exclusively during training, VEGA distills obstacle-aware planning directly into a vision-based policy. Furthermore, we introduce VEGA-Bench, a benchmark containing 250k scenes and approximately 5 million navigation goals paired with scene geometry, designed to evaluate goal progress, collision avoidance, and obstacle clearance of VLAs. Our evaluation shows that VEGA achieves competitive goal progress while reducing collisions by 33.0% and improving obstacle clearance by 17.9% over the strongest baseline on VEGABench, while improving success by at least 150.0%, reducing collisions by at least 66.7%, and improving obstacle clearance by at least 60.0% in real-world trials. Ultimately, we demonstrate that video-derived geometric supervision provides a scalable and effective signal for training obstacle-aware navigation VLAs. The code and benchmark will be released at the time of publication.

2606.18634 2026-06-18 cs.RO cs.AI 新提交

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

EffiNav: 融合深度与视觉语言实现高效物体目标导航

Zecheng Yin, Benedict Jun Ma

发表机构 * Systems Hub of Intelligence Transportation HKUST(GZ)(香港科技大学(广州)智能交通系统中心)

AI总结 提出EffiNav框架,融合深度信息与视觉语言模型,通过预测探索边界和语义先验指导导航,在HM3D和OVON数据集上匹配或超越基线,提升路径效率与泛化性。

详情
AI中文摘要

在未知环境中定位目标物体是自主智能体的基本能力,应用范围从搜索救援到野外机器人。该任务的简化版本是物体目标导航(ObjNav)。在ObjNav中,成功到达目标物体提供了基本的性能度量;然而,导航轨迹的效率同样重要,因为它指示了智能体探索的智能程度以及后续任务剩余的时间。在未知环境中,高效导航的关键在于决定下一步探索的位置。尽管许多先前工作旨在解决这一核心挑战并在某些场景中取得了有希望的性能,但最近的基于训练的模型和非训练框架分别仍存在泛化性和效率问题,在最坏情况下可能导致对已访问区域的过度探索或冗余的来回运动。我们在两个广泛使用的仿真基准Habitat Matterport 3D(HM3D)和开放词汇物体目标导航(OVON)上评估EffiNav,并在真实世界的物理机器人上进一步验证其有效性。我们对大量仿真回合进行了失败分析。通过最小修改,我们还将EffiNav扩展到GOAT-BENCH数据集上的记忆增强ObjNav任务,展示了其在标准ObjNav设置之外的适应性。在两个标准指标——成功率(SR)和路径长度加权成功率(SPL)上,EffiNav匹配或超越了最近的基线,反映了其效率、鲁棒性和实际适用性。认识到两个数据集的不同侧重点,性能表明该框架在高效ObjNav中更加平衡和可泛化。

英文摘要

To locate a target object while exploring the unknown environment is a fundamental capability for autonomous agents, with applications ranging from search-and-rescue to field robots. A simplified version of such task is Object Goal Navigation (ObjNav). In ObjNav, successful arrival at the target object provides a basic measure of performance; however, the efficiency of the navigation trajectory is equally important, as it indicates how intelligently the agent explores and how much time remains for subsequent tasks. In unknown environments, the key to efficient navigation lies in deciding where to explore next. While many prior works aim to address this core challenge and achieved promising performance in certain settings, recent training-based models and non-training frameworks still suffer from generalization and efficiency issues respectively, which in the worst cases can lead to excessive exploration of already-visited areas or redundant back-and-forth motion. We evaluate EffiNav on two widely used simulation benchmarks Habitat Matterport 3D (HM3D) and Open-Vocabulary Object goal Navigation (OVON), and further validate its effectiveness on physical robots in real-world settings. We conduct failure analysis on massive simulation episodes. With minimal modification, we also extend EffiNav to a memory-augmented ObjNav task on the GOAT-BENCH dataset, demonstrating its adaptability beyond standard ObjNav settings. Across two standard metrics--Success Rate (SR) and Success weighted by Path Length (SPL), EffiNav matches or outperforms recent baselines, reflecting its efficiency, robustness, and practical applicability. Recognizing the different emphases of the two datasets, the performances reveals this framework is more balanced and generalizable for efficient ObjNav.

2606.18951 2026-06-18 cs.RO 新提交

A High-accuracy Event-based Underwater SLAM System

高精度事件相机水下SLAM系统

Yifan Peng, Qihang, Liu, Haoying Li, Yuzhe Li, Junfeng Wu, Ziyang Hong

AI总结 针对事件相机水下SLAM中时间曲面成像质量差和匹配失败问题,提出基于结构感知度量和贝叶斯优化的高精度立体SLAM系统,并贡献首个高质量水下事件数据集UWE。

详情
AI中文摘要

虽然事件相机为水下SLAM提供了巨大潜力,但现有的基于时间曲面(TS)的方法在水下部署时被证明非常不可靠。波动的相机速度严重降低了TS成像质量,而宽立体基线和重复的水下纹理导致关键匹配失败,频繁引发系统崩溃。为克服这些挑战,我们开发了首个高精度事件相机水下立体SLAM系统。基于结构张量相干性和梯度,设计了一种结构感知度量来定量评估TS结构信息密度。通过将最优TS生成解耦为基于系统初始化的两个不同阶段,贝叶斯优化(BO)在初始化前首先预测最优先验TS,同时我们设置异步在线局部搜索方法,在跟踪阶段实时获取合适的TS。我们使用先验视差保证精确的数据关联,并采用“最新观测优先”三角测量机制实现稳定三角测量。作为这些解决方案的基准和社区资源,我们还贡献了UWE,这是首个高质量真实世界水下事件数据集,包含变化的相机运动、复杂纹理和不同轨迹特征。在公共数据集和UWE上的广泛评估表明,所提出的SLAM系统与最先进的事件相机方法相比具有竞争力的精度性能。代码和数据将开源。

英文摘要

While event cameras offer immense potential for underwater SLAM, existing Time Surface (TS)-based methods prove highly unreliable when deployed underwater. Fluctuating camera velocities severely degrade TS imaging quality, while wide stereo baselines and repetitive underwater textures induce critical matching failures, frequently triggering system failure. To overcome these challenges, we develop the first high-accuracy event-based underwater stereo SLAM system. A structure-aware metric for TS is designed based on structure tensor coherence and gradients to quantitatively evaluate TS structural information density. By decoupling the optimal TS generation into two distinct stages based on system initialization, Bayesian Optimization(BO) first predicts an optimal prior TS sequentially before initialization while we set an asynchronous online local searching method periodically to obtain appropriate TS in real-time during the tracking stage. We use the prior disparity to guarantee precise data association and "latest-observation-first'' triangulation mechanism to realize stable triangulation. As a benchmark for these solutions and a resource for the community, we also contribute UWE, the first high-quality real-world underwater event dataset containing variable camera motions, complex textures and different trajectory features. Extensive evaluations on public datasets and UWE show the competitive accuracy performance of the proposed SLAM system compared to the state-of-the-art event-based method. The code and data will be open-sourced.

2606.19122 2026-06-18 cs.RO 新提交

Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning

基于混合2D-3D学习的人行道机器人单目3D占用感知

Yukai Ma, Joe Lin, Liu Liu, Honglin He, Lulu Ricketts, Brad Squicciarini, Yong Liu, Bolei Zhou

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Zhejiang University(浙江大学) Coco Robotics Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出WalkOCC框架,通过混合射线行进单目3D占用感知,结合LiDAR-RGB配对数据与大规模无配对单目图像学习,提升人行道机器人导航的预测精度和泛化能力。

详情
AI中文摘要

现实世界中的人行道拥挤、杂乱且结构化程度低于道路,使得3D占用预测成为配送机器人和电动轮椅等移动机器人安全导航的关键。现有的占用学习流程主要针对道路自动驾驶设计,通常在大规模配对的LiDAR-RGB数据集上训练,需要密集的3D监督和多个摄像头输入,这些数据收集成本高且未能充分捕捉人行道特定特征。我们提出WalkOCC,一种用于人行道机器人的混合射线行进单目3D占用感知框架。WalkOCC显式地将来自LiDAR-RGB配对数据的几何基础与来自大规模无配对单目图像的可扩展学习相结合。它从配对序列中引导出伪占用监督,并在额外的仅2D数据上联合学习图像级表示。它在不需要昂贵的3D占用标注的情况下实现了稳定的优化和改进的泛化能力。大量实验表明,与基于自监督图像的基线相比,在预测精度、对路缘和排水沟等细微城市结构的细粒度分割以及对环境和跨本体变化的鲁棒性方面,WalkOCC均取得了一致的提升。为了便于评估和基准测试,我们还引入了Sidewalk3D,这是一个大规模的人行道感知数据集,包含在多个地点和时间段收集的LiDAR-相机配对序列,以及用于评估的3D语义占用标注。代码和数据将公开提供。

英文摘要

Sidewalks in the real world are crowded, cluttered, and less structured than roads, making 3D occupancy prediction a key ingredient for the safe navigation of mobile robots such as delivery bots and electric wheelchairs. Existing occupancy learning pipelines are largely designed for on-road autonomous driving and often train on large-scale paired LiDAR-RGB datasets with dense 3D supervision and multiple camera inputs, which are costly to collect and do not adequately capture sidewalk-specific characteristics. We propose WalkOCC, a hybrid Ray-marching monocular 3D occupancy perception framework for robots operating on sidewalks. WalkOCC explicitly couples geometric grounding from LiDAR-RGB paired data with scalable learning from large-scale unpaired monocular images. It bootstraps pseudo occupancy supervision from paired sequences and jointly learns image-level representations on additional 2D-only data. It yields stable optimization and improved generalization without requiring costly 3D occupancy annotations. Extensive experiments demonstrate consistent gains in prediction accuracy, fine-grained segmentation of subtle urban structures such as curbs and gutters, and robustness to environmental and cross-embodiment shifts compared with self-supervised image-based baselines. To facilitate evaluation and benchmarking, we also introduce Sidewalk3D, a large-scale sidewalk perception dataset with LiDAR-camera paired sequences collected across multiple locations and time periods, along with 3D semantic occupancy annotations for evaluation. Code and data will be made available.

2606.19190 2026-06-18 cs.RO 新提交

FAST-LIVGO: A Degeneracy-Robust LiDAR-Inertial-Visual-GNSS Fusion Odometry

FAST-LIVGO:一种退化鲁棒的LiDAR-惯性-视觉-GNSS融合里程计

Zhiyu Chen, Chunran Zheng, Jiayu Wen, XiaoLei Zhang, Jiaming Xu, Feng Pan, Yukang Cui

发表机构 * College of Mechatronics and Control Engineering, Shenzhen University(深圳大学机电与控制工程学院) Department of Mechanical Engineering, The University of Hong Kong(香港大学机械工程系) College of Automation, Harbin Engineering University(哈尔滨工程大学自动化学院)

AI总结 提出一种基于误差状态迭代卡尔曼滤波的紧耦合LiDAR-惯性-视觉-GNSS融合框架,通过动态时间规整的时空对齐模块、多普勒和时差载波相位观测模型以及退化感知的双模式异常值拒绝策略,在长期大尺度动态环境中实现高精度鲁棒的状态估计。

Comments Accepted for presentation at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

在长期、大规模和高度动态环境中的鲁棒状态估计与建图仍然是机器人领域的关键挑战。现有的LiDAR-惯性-视觉里程计(LIVO)系统在局部精度上表现良好,但在长距离下会累积漂移,并在几何退化或无纹理场景中可能失效。同时,GNSS辅助融合框架通常依赖LiDAR或视觉里程计进行状态预测和异常值拒绝,使其在里程计退化时变得脆弱。为解决这些局限,我们提出一种基于误差状态迭代卡尔曼滤波的紧耦合LiDAR-惯性-视觉-GNSS融合框架。引入基于动态时间规整的在线时空对齐模块以应对高度动态条件。为更好利用GNSS精度,我们开发了基于多普勒频移和固定锚点时间差载波相位的观测模型,在不增加历史锚点状态的情况下提供毫米级相对约束。我们进一步设计了一种退化感知的双模式异常值拒绝策略,根据LIVO退化程度在LIVO先验引导拒绝和GNSS辅助恢复之间切换。在公开M3DGR数据集和自建20 m/s固定翼无人机数据集上的实验表明,我们的系统减少了累积漂移和地图重影,在精度和鲁棒性上优于现有方法。

英文摘要

Robust state estimation and mapping in long-term, large-scale, and highly dynamic environments remains a key challenge in robotics. Existing LiDAR-Inertial-Visual Odometry (LIVO) systems achieve strong local accuracy but suffer from accumulated drift over long distances and may fail in geometrically degraded or textureless scenes. Meanwhile, GNSS-aided fusion frameworks often rely on LiDAR or visual odometry for state prediction and outlier rejection, making them vulnerable when odometry degenerates. To address these limitations, we propose a tightly coupled LiDAR-Inertial-Visual-GNSS fusion framework based on an Error-State Iterated Kalman Filter. An online spatiotemporal alignment module using Dynamic Time Warping is introduced for highly dynamic conditions. To better exploit GNSS precision, we develop observation models based on Doppler shifts and fixed-anchor Time-Differenced Carrier Phase, providing millimeter-level relative constraints without augmenting historical anchor states. We further design a degeneracy-aware dual-mode outlier rejection strategy that switches between LIVO-prior-guided rejection and GNSS-aided recovery according to the LIVO degeneracy level. Experiments on the public M3DGR dataset and a custom 20~m/s fixed-wing UAV dataset demonstrate that our system reduces accumulated drift and map ghosting, outperforming state-of-the-art methods in accuracy and robustness.

2606.19307 2026-06-18 cs.RO 新提交

Observability and Consistency Analysis for Visual-Inertial Navigation with Anchored Feature Parameterizations

基于锚定特征参数化的视觉惯性导航的可观性与一致性分析

Mitchell Cohen, Vassili Korotkine, James Richard Forbes

发表机构 * McGill University(麦吉尔大学)

AI总结 分析基于滤波的视觉惯性导航系统(VINS)使用锚定特征表示时的可观性与一致性,证明其不可观子空间独立于估计的地标状态,从而改善一致性,但仍依赖导航状态,需额外一致性增强技术。

Comments Accepted to IEEE/RSJ IROS. 8 pages, 3 figures, 4 tables

详情
AI中文摘要

本文分析了使用锚定特征表示的基于滤波的视觉惯性导航系统(VINS)的可观性和一致性特性。结果表明,采用锚定地标参数化的VINS的不可观子空间独立于估计的地标状态,从而无需任何额外修改即可改善估计器的一致性。然而,不可观子空间仍然依赖于估计的导航状态,因此需要额外的一致性增强技术。本文提出了两种方法来改善采用锚定特征表示的VINS的一致性。仿真结果表明,与使用全局参考系解析特征的算法相比,所有采用锚定特征参数化的估计器都表现出更好的一致性,特别是在特征初始化可能较差的情况下。在TUM-VI数据集上的真实世界实验表明,仅使用锚定特征表示即可获得与采用全局特征表示的一致性改进估计器相当的性能,证明了在VINS中使用锚定特征参数化的优势。

英文摘要

This paper presents an analysis of the observability and consistency properties of filtering-based visual-inertial navigation systems (VINS) that utilize anchored feature representations. The unobservable subspace of VINS with anchored landmark parameterizations is shown to be independent of the estimated landmark state, which leads to improved estimator consistency properties without any additional modifications. However, the unobservable subspace is still found to depend on the estimated navigation state, necessitating additional consistency-enforcing techniques. Two methods to improve the consistency of VINS with anchored feature representations are presented. Simulation results showcase that all estimators employing anchored feature paramterizations exhibit improved consistency properties compared to algorithms that estimate features resolved in a global reference frame, especially in scenarios where feature initialization may be poor. Real-world experiments on the TUM-VI dataset showcase that the use of anchored feature representations alone can yield comparable performance to consistency-improved estimators employing a global feature representation, demonstrating the benefit of using anchored feature parameterizations for VINS.

5. 人机交互与协作机器人 4 篇

2606.18519 2026-06-18 cs.RO cs.AI 新提交

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

如您所愿:利用LLM在精准农业中进行形式化验证的任务规划

Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * University of California, Merced(加州大学默塞德分校)

AI总结 针对自然语言歧义性,提出基于线性时序逻辑(LTL)反馈循环的LLM任务规划系统,通过双LLM分工实现规范生成与验证,提升精准农业任务规划的可靠性。

详情
Journal ref
Published in Proceedings of 2026 International Conference on Robotics and Automation (ICRA)
AI中文摘要

尽管机器人系统现已商业化并部署于各行各业,但许多系统高度专业化,通常需要高级技能才能操作并确保其按指令执行。为缓解这一问题,我们近期引入了一个任务规划器,利用大语言模型(LLM)根据自然语言描述的任务描述合成精准农业中的任务计划。虽然该系统表现出色,但也存在自然语言固有的歧义性。本文通过引入多个基于线性时序逻辑(LTL)的反馈循环来扩展我们的系统,以确保任务规划系统满足用户制定的规范,同时仍使用自然语言。为减轻潜在偏差,我们使用两个不同的商业LLM分别负责规范生成和验证子任务。通过大量实验,我们强调了将任务验证集成到全自主流水线中的优势与局限,特别是关于LLM生成有效LTL公式的能力,并展示了我们的实现如何应对和解决这些挑战。

英文摘要

Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, we recently introduced a mission planner leveraging LLMs to synthesize mission plans in precision agriculture based on mission descriptions provided in natural language. While the system demonstrates impressive performance, it also suffers from the inherent ambiguities of natural language. In this paper, we extend our system to address this issue by introducing multiple feedback loops in the planning architecture that leverage linear temporal logic (LTL) to ensure the mission planning system meets the specifications formulated by the user while still using natural language. To mitigate potential bias, this is achieved by using two different commercial LLMs in charge of the specification and verification subtasks. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM's ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.

2606.18601 2026-06-18 cs.RO 新提交

Admittance-Based Surface Alignment for Human-in-the-Loop Robotic Visual Inspection

基于导纳的表面对齐用于人在环机器人视觉检测

Antara Banerjee, Colin Acton, Xu Chen

发表机构 * University of Washington(华盛顿大学)

AI总结 提出一种基于导纳的实时闭环控制框架,融合操作员输入与感知驱动,实现机器人末端执行器与局部表面的精确对齐,在6自由度机械臂上验证了稳定法向跟踪和0.4°的平均定向误差。

详情
AI中文摘要

精密视觉检测是航空航天、半导体和医疗制造中质量保证的基础,这些领域中高价值零件上未被检测到的表面缺陷直接导致报废、返工和现场故障。机器人视觉检测需要在存在感知噪声和表面不规则的情况下,实现末端执行器与局部表面几何的精确对齐。在工业环境中,通常通过遥操作或共享自主性将人类操作员保持在回路中,引入实时调整,使得纯离线运动规划不足。这激发了能够在人类和感知不确定性下做出反应性、顺从行为的控制架构。本文提出了一种新颖的实时闭环机器人定向控制流程,用于精密视觉检测,该流程采用基于导纳的框架,统一了操作员输入和感知驱动的表面对齐。我们将末端执行器设计为在粘性介质中运动的虚拟球体,使得由此产生的物理可解释的质量-阻尼系统根据定向误差和操作员命令生成同步、顺从的运动。我们在6自由度机械臂上验证了该框架,展示了稳定的法向跟踪和0.4°的最终平均定向误差。

英文摘要

Precision visual inspection underpins quality assurance across aerospace, semiconductor, and medical manufacturing, where undetected surface anomalies on high-value parts translate directly into scrap, rework, and field failures. Robotic visual inspection requires precise alignment between the end-effector and local surface geometry in the presence of perception noise and surface irregularities. In industrial settings, a human operator is often kept in the loop via teleoperation or shared autonomy, introducing real-time adjustments that render purely offline motion planning inadequate. This motivates control architectures capable of reactive, compliant behavior under combined human and perceptual uncertainty. This paper presents a novel real-time, closed-loop robotic orientation control pipeline for precision visual inspection, with an admittance-based framework that unifies operator input and perception-driven surface alignment. We design the end-effector as a virtual sphere moving through a viscous medium, such that the resulting physically interpretable mass--damper system generates synchronized, compliant motion from orientation error and operator commands. We validate the framework on a 6-DOF manipulator demonstrating stable normal-tracking and a final mean orientation error of 0.4°.

2606.18747 2026-06-18 cs.RO cs.AI 新提交

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

通过基于人类反馈的迭代强化学习利用大语言模型生成自然且富有表现力的机器人手势

Chris Lee, Flora Salim, Benjamin Tag, Francisco Cruz

发表机构 * University of New South Wales(新南威尔士大学) Universidad Central de Chile(智利中央大学)

AI总结 针对社交机器人手势生成僵硬问题,提出将ChatGPT集成到Pepper机器人中生成共语手势,并引入基于人类反馈的迭代强化学习(RLHF)优化手势,实验表明RLHF提升了手势的表现力、相关性和流畅性。

Comments 8 Pages, 6 Figures

详情
AI中文摘要

富有表现力的手势对于自然有效的沟通至关重要,当仅靠语言线索不足时(例如,指向),手势可以补充言语。对于像Pepper这样的人形社交机器人,产生自然且富有表现力的动作对于改善人机交互(HRI)和长期接受度至关重要。然而,由于依赖专家编写的动画,生成手势仍然具有挑战性,导致行为僵硬,难以适应动态和多样化的环境。或者,机器学习方法通常难以捕捉感知的自然性,随着自由度的增加而变得更加困难。因此,产生富有表现力的机器人手势需要一个能够适应环境同时遵守社会规范和物理约束的系统。大语言模型(LLMs)的最新进展使得动态代码生成成为可能,为从自然语言实时合成手势提供了新的机会。在本文中,我们将ChatGPT集成到人形机器人Pepper中,以生成与对话输出一致的共语手势。虽然这一基线实现了灵活的手势生成,但生成的动作通常被认为僵硬且不自然。为了解决这一限制,我们引入了一种基于人类反馈的迭代强化学习(RLHF)系统,该系统根据用户评估微调手势生成,并利用迭代用户研究比较Pepper生成的手势。我们的结果表明,RLHF改进了LLM的共语生成能力,产生了更富有表现力、相关且流畅的动作。

英文摘要

Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper's generated gestures. Our results show that RLHF improved the LLM's co-speech generative capabilities, producing more expressive, relevant and fluid movements.

2606.19240 2026-06-18 cs.RO cs.CV cs.HC cs.SY eess.SY 新提交

Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation

透过遮挡:机器人遥操作的确定性手臂运动学校正

Thomas M. Kwok, Nicholas Koenig, Yue Hu

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出手臂运动学校正方法,利用恒定臂长几何约束和勾股定理确定性地重建遮挡关节深度,无需复杂建模,经Vicon验证有效,并成功应用于遥操作。

详情
AI中文摘要

无标记、单RGB-D相机动作捕捉为机器人遥操作提供了一种低成本、非侵入性的替代传统标记系统的方法;然而,在自遮挡存在时,特别是上肢运动期间,深度估计常常退化。本文提出了一种手臂运动学校正(AKC)方法,通过基于恒定臂长施加几何约束来改进深度估计。所提出的方法利用手腕位置和预定义臂长,基于勾股定理的确定性公式重建遮挡关节深度,从而避免了对复杂概率建模或参数调整的需求。针对Vicon参考系统的实验验证表明,该方法在静态和动态关节运动下均表现出可靠的性能,通过均方根误差(RMSE)和皮尔逊相关性进行评估。此外,在模拟和物理机器人环境中成功演示了运动映射遥操作。结果表明,AKC在长时间、严重自遮挡下增强了鲁棒性并保持了解剖一致性,即使与不太可靠的时间滤波器配对时也是如此,突显了其在机器人遥操作和人机交互等实时应用中的实用性。

英文摘要

Markerless, single-RGB-D-camera motion capture provides a low-cost and non-invasive alternative to conventional marker-based systems for robot teleoperation; however, depth estimation often degrades in the presence of self-occlusion, particularly during upper-limb motion. This paper presents an Arm Kinematic Correction (AKC) method that improves depth estimation by enforcing geometric constraints based on constant arm lengths. The proposed approach reconstructs occluded joint depths by leveraging wrist positions and predefined arm lengths via a deterministic formulation based on the Pythagorean theorem, thereby avoiding the need for complex probabilistic modeling or parameter tuning. Experimental validation against a Vicon reference system demonstrates reliable performance for both static and dynamic joint motions, evaluated using root-mean-square error (RMSE) and Pearson correlation. Furthermore, motion-mapping teleoperation is successfully demonstrated in both simulated and physical robot environments. The results show that AKC enhances robustness and preserves anatomical consistency under long-duration, severe self-occlusion, even when paired with less reliable temporal filters, highlighting its practicality for real-time applications such as robot teleoperation and human-robot interaction.

6. 具身智能与视觉语言动作模型 3 篇

2606.18363 2026-06-18 cs.RO cs.AI 新提交

Guava: An Effective and Universal Harness for Embodied Manipulation

Guava: 一种有效且通用的具身操作工具框架

Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, Jiayuan Mao

发表机构 * University of Maryland College Park(马里兰大学帕克分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Waterloo(滑铁卢大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) University of Pennsylvania(宾夕法尼亚大学) Amazon FAR(亚马逊 FAR)

AI总结 提出Guava框架,通过迭代感知-推理-行动循环、语义动作抽象和多模态观测三大关键设计,将具身操作能力蒸馏到4B开源模型中,在仿真和真实环境中性能媲美前沿专有模型。

详情
AI中文摘要

在大规模视觉-语言数据上训练的语言模型已展现出作为具身智能体的强大潜力。通过具身工具使用来驾驭模型,为端到端的视觉-语言-行动系统提供了一种有前景的替代方案,它将高层推理与外部模块(用于感知、规划和控制)相结合。然而,对于具身操作而言,什么构成了有效的工具框架,以及这种框架能在多大程度上解锁广泛推理模型的具身能力,仍不清楚。在这项工作中,我们提出了Guava,一个通过系统探索智能体工作流、动作空间和观测空间的设计空间而开发的具身工具使用框架。我们的研究确定了有效具身智能体的三个关键要素:迭代感知-推理-行动循环、语义动作抽象和多模态观测。为了理解这些设计原则是否对小型模型也具有普适性,我们开发了一个端到端的训练流程,利用完全在仿真中收集的不到2000条轨迹,将具身操作能力蒸馏到一个4B开源模型中。在仿真和真实环境中的实验结果表明,其性能与前沿专有模型相当,同时展现出对未见物体、新指令和长时域任务的强大泛化能力。结果表明,一个精心设计的框架可以作为具身操作的可扩展、模型无关的接口,使紧凑的开源模型在极少的训练数据下展现出强大的涌现具身能力。

英文摘要

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.

2606.19088 2026-06-18 cs.RO 新提交

ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks

ReSiReg:面向语言条件机器人任务的空间一致语义

Simon Schwaiger, David Seyser, Alessandro Scherl, Wilfried Wöber, Gerald Steinbauer-Wagner

发表机构 * Graz University of Technology, Institute of Software Engineering and Artificial Intelligence(格拉茨技术大学,软件工程与人工智能研究所) University of Applied Sciences Technikum Wien, Department of Industrial Engineering(维也纳应用科技大学,工业工程系) University of Alicante, Department of Computer Technology(阿利坎特大学,计算机技术系) University of Natural Resources and Life Sciences, Institute for Integrative Nature Conservation Research(自然资源与生命科学大学,整合自然保护研究 institute)

AI总结 提出ReSiReg方法,通过重构空间一致的VLM中间特征,改善密集语言接地检索,在OVSS和3D映射中提升空间一致性,并发布紧凑的25M参数VLM模型。

详情
AI中文摘要

视觉-语言模型(VLM)使机器人能够遵循开放语言指令。然而,密集的VLM嵌入已被证明存在噪声且缺乏空间一致性。这对于需要同时推理语义和3D空间的机器人应用来说是有问题的。我们研究了近期VLM的空间结构,并提出了ReSiReg,一种特征重构方法,利用空间一致的VLM中间特征来改善密集语言接地检索。ReSiReg将中间特征聚类为视觉原型,推导其语言描述符,并将每个补丁重构为原型级语言嵌入的软混合。我们在OVSS和3D映射上跨骨干网络进行定量评估,并在真实世界操作场景中进行定性评估。定量结果显示密集检索得到改善;操作场景显示出更空间一致的目标激活。我们进一步为机器人应用提供了一个紧凑的25M密集VLM,远小于ViT-B基线且具有竞争力。可从此网址获取。

英文摘要

Vision-Language Models (VLMs) enable robots to follow open-language instructions. However, dense VLM embeddings have shown to be noisy and lack spatial consistency. This is problematic for robotic applications, which require simultaneous reasoning over semantics and 3D space. We examine spatial structure across recent VLMs and propose ReSiReg, a feature reconstruction method that uses spatially consistent VLM intermediates to improve dense language-grounded retrieval. ReSiReg clusters intermediates into visual prototypes, derives their language descriptors, and reconstructs each patch as a soft mixture of prototype-level language embeddings. We evaluate quantitatively on OVSS and 3D mapping across backbones, and qualitatively in real-world manipulation scenes. Quantitative results show improved dense retrieval; manipulation scenes show more spatially consistent target activations. We further provide a compact 25M dense VLM for robotic applications, substantially smaller than and competitive with ViT-B baselines. Available at https://resireg.github.io

2606.19340 2026-06-18 cs.RO 新提交

Zero-Shot Long-Horizon Dexterous Manipulation via Multi-View 3D-Grounded VLM Reasoning

零样本长时程灵巧操作:基于多视图3D接地VLM推理

Jisoo Kim, Sangwon Baik, Taeksoo Kim, Sungjoo Kim, Junyoung Lee, Mingi Choi, Hanbyul Joo

发表机构 * Seoul National University(首尔国立大学) RLWRLD

AI总结 提出零样本框架,利用多视图RGB图像通过VLM生成3D任务规划,结合三角测量和射线投票实现精确3D接地,支持抓取和工具使用,在真实实验中优于基线方法。

详情
AI中文摘要

我们提出了一个零样本框架,用于长时程灵巧操作,该框架将语言指令从校准的多视图RGB图像接地到可执行的3D任务规划。我们的系统不是训练端到端策略,而是使用视觉语言模型(VLM)生成参考帧任务接地和原始级2D关键点,然后通过多视图融合将其提升到3D。这种提升结合了视图级VLM接地的三角测量与参考视图射线投票,后者沿语义相机射线搜索跨相邻视图的几何一致候选点。生成的3D关键点支持抓取和放置以及工具使用:对于工具使用,我们检索与推断技能类别对应的以对象为中心的原子动作,并将其存储的6D工具轨迹对齐到场景;对于灵巧执行,我们将提升的抓取关键点扩展为任务条件抓取可行区域,并使用臂手运动生成器生成可行的抓取-运动对。真实世界实验表明,与单视图RGB-D接地和微调VLA基线相比,3D接地精度和执行可靠性有所提高。我们进一步通过闭环状态验证和重新规划展示了长时程操作,实现了在新场景中对未见物体和工具使用任务的零样本执行。

英文摘要

We present a zero-shot framework for long-horizon dexterous manipulation that grounds language instructions into executable 3D task plans from calibrated multi-view RGB images. Rather than training an end-to-end policy, our system uses a vision-language model (VLM) to produce reference-frame task grounding and primitive-level 2D keypoints, then lifts them into 3D via multi-view fusion. This lifting combines triangulation of view-wise VLM groundings with reference-view ray voting, which searches along a semantic camera ray for geometrically consistent candidates across neighboring views. The resulting 3D keypoints support both pick-and-place and tool-use: for tool-use, we retrieve an object-centric atomic action corresponding to the inferred skill category and align its stored 6D tool trajectory to the scene; for dexterous execution, we expand the lifted grasp keypoint into a task-conditioned grasp affordance region and generate feasible grasp-motion pairs with an arm-hand motion generator. Real-world experiments show improved 3D grounding accuracy and execution reliability over single-view RGB-D grounding and fine-tuned VLA baselines. We further demonstrate long-horizon manipulation through closed-loop status verification and replan, enabling zero-shot execution on unseen objects and tool-use tasks in novel scenes.

7. 多机器人与群体系统 1 篇

2606.18516 2026-06-18 cs.RO 新提交

Task Allocation and Motion Planning in Dynamic, Cluttered Environments via CBBA and Graphs of Convex Sets

动态杂乱环境下的任务分配与运动规划:基于CBBA与凸集图

Matthew D. Osburn, Cameron K. Peterson, John L. Salmon

发表机构 * Electrical and Computer Engineering(电气与计算机工程系) Mechanical Engineering(机械工程系)

AI总结 针对动态杂乱环境中的多智能体任务规划,提出结合凸集图(GCS)进行轨迹优化与共识捆绑算法(CBBA)进行分布式任务分配的方法,实现安全高效的轨迹规划和任务协调。

Comments 15 pages single column, 10 figures, AIAA-Scitech 2027 Submission

详情
AI中文摘要

在杂乱、动态环境中的多智能体任务规划需要在分配任务给智能体的同时,确定通过环境的安全、时间高效的轨迹。当任务是动态的(例如会合目标)时,分配决策不仅取决于哪个智能体最适合某项任务,还取决于该任务何时何地可以到达。本文提出了一个解决该问题的方法,该方法将凸集图(GCS)用于轨迹优化,与共识捆绑算法(CBBA)用于分布式任务分配相结合。在我们的方法中,GCS通过使用时间扩展(3D+时间)配置空间找到通过动态环境的最优轨迹。同时,CBBA协调跨智能体的任务分配,使得在移动环境中能够做出明智的决策。然后,我们连接分配和规划,使智能体能够在3D+时间配置空间中避免碰撞,并提供准确的任务完成时间估计。我们在具有静态和动态任务的模拟杂乱环境中展示了我们方法的有效性。

英文摘要

Multi-agent task planning in cluttered, dynamic environments requires assigning tasks to agents while simultaneously determining safe, time-efficient trajectories through the environment. When tasks are dynamic, such as rendezvous objectives, allocation decisions depend not only on which agent is best suited for a task, but also on when and where that task can be reached. This paper presents a solution to this problem, which combines Graphs of Convex Sets (GCS) for trajectory optimization with the Consensus-Based Bundle Algorithm (CBBA) for distributed task allocation. In our approach, GCS finds optimal trajectories through dynamic environments using a time-extended (3D+time) configuration space. At the same time, CBBA coordinates task assignments across agents, enabling informed decision-making in a moving environment. We then connect allocation and planning to allow the agents to avoid collisions in the 3D+time configuration space and provide accurate time estimates for task completion. We demonstrate the effectiveness of our approach in simulated cluttered environments with static and dynamic tasks.

8. 无人车、无人机与移动机器人 2 篇

2606.18630 2026-06-18 cs.RO 新提交

DNN Koopman-Based Deviation Compensation for UGV Path Tracking Control on Coupled Slope and Potholed Road

基于DNN Koopman的偏差补偿用于耦合坡度和坑洼道路上的UGV路径跟踪控制

Jian Zhao, Wenbo Zhou, Zhicheng Chen, Bing Zhu, Jiayi Han, Dongjian Song, Yinju Lin, Peixing Zhang

发表机构 * Xiamen King Long United Automotive Industry Co., Ltd.(厦门金龙联合汽车工业有限公司)

AI总结 提出基于DNN Koopman的偏差补偿策略,结合自适应遗忘递推最小二乘估计轮胎刚度、Laguerre模型预测控制与事件触发协同补偿,在耦合坡度和坑洼道路上提升UGV路径跟踪精度超11.5%

Comments 22 pages, 13 figures

详情
AI中文摘要

在越野场景中运行的无人地面车辆面临复杂地形扰动,这些扰动会显著降低路径跟踪性能。针对这一挑战,本文提出了一种基于深度神经网络Koopman的偏差补偿策略,用于无人地面车辆路径跟踪控制。首先,基于耦合坡度上的车辆动力学函数,设计了一种带有解耦误差项的自适应遗忘递推最小二乘法来估计轮胎侧偏刚度。在此基础上,通过引入Laguerre函数,设计了一种Laguerre模型预测控制路径跟踪控制策略,该策略可在不同耦合坡度场景下降低计算资源消耗的同时保持可靠的跟踪性能。然后,通过将Koopman算子理论与深度神经网络相结合,提出了一种深度神经网络Koopman路径偏差补偿方法,该方法显著提高了无人地面车辆在坑洼道路扰动下的路径跟踪精度。此外,基于补偿激活准则和可信度验证,建立了一种将Laguerre模型预测控制与深度神经网络Koopman耦合的事件触发并行协同补偿机制。该机制提高了坑洼道路上的路径跟踪精度,同时确保了整体转向指令的可行性和深度神经网络Koopman补偿后车辆的稳定性。最后,构建了硬件在环实验平台进行验证。实验结果表明,所提出的无人地面车辆路径跟踪策略在多种工况下跟踪性能提升超过11.5%。

英文摘要

Unmanned ground vehicles (UGVs) operating in off-road scenarios are confronted with complex terrain disturbances that can substantially degrade path tracking performance. To address this challenge, this paper proposes a deep neural network (DNN) Koopman-based deviation compensation strategy for UGV path tracking control. Firstly, based on the vehicle dynamic function on coupled slope, an adaptive forgetting recursive least squares method with decoupled error terms is designed to estimate tire cornering stiffness. On this basis, a Laguerre model predictive control (LMPC) path tracking control strategy is designed by incorporating Laguerre functions, which can reduce computational resource usage while maintaining reliable tracking performance across different coupled slope scenarios. Then, by integrating Koopman operator theory with DNN, a DNN Koopman (DK) path deviation compensation method is proposed, which significantly improves the path tracking accuracy of UGV under potholed road disturbances. Furthermore, an event-triggered parallel cooperative (EPC) compensation mechanism that couples LMPC with DK is established based on compensation activation criteria and credibility verification. This mechanism improves path tracking accuracy on potholed road while ensuring the feasibility of overall steering command and stability of vehicle after DK compensation. Finally, a hardware-in-the-loop (HiL) experimental platform is constructed for validation. Experimental results demonstrate that the proposed UGV path tracking strategy improves tracking performance by more than 11.5% across multiple operating conditions.

2606.19227 2026-06-18 cs.RO 新提交

Constant Time-Delay Leader Following with Neural Networks and Invariant Extended Kalman Filters for Arbitrary Trajectories

基于神经网络与不变扩展卡尔曼滤波的任意轨迹恒定时间延迟领航跟随

Luka Antonyshyn, Paulo Ricardo Marques de Araujo, Sidney Givigi

发表机构 * University of Toronto Institute for Aerospace Studies(多伦多大学航空航天研究所) School of Computing, Queen’s University(女王大学计算机学院)

AI总结 提出一种结合概率Seq2Seq神经网络与不变扩展卡尔曼滤波的恒定时间延迟轨迹跟踪方法,用于无通信、无全局坐标的车队,在SE(2)流形上准确估计领车轨迹,并利用几何模型预测控制提升性能。

Comments 9 pages, 6 figures

详情
AI中文摘要

本文提出了一种用于车辆队列的恒定时间延迟轨迹跟踪方法,该方法无需车辆间通信、公共坐标系或全球定位。该方法将概率序列到序列(Seq2Seq)神经网络与不变扩展卡尔曼滤波(IEKF)相结合,以热启动预测过程,从而在SE(2)流形上准确估计领车相对轨迹。进一步引入几何模型预测控制器,以充分利用基于流形的轨迹预测来改善控制性能。该系统能够处理具有不同速度和运动轮廓的任意非线性轨迹,同时减少了对基于专家领域知识的轨迹跟踪系统设计的需求,即使在长轨迹延迟下也是如此。通过运动学仿真中与纯IEKF基线、基于学习的方法以及真实轨迹的对比,以及使用真实机器人车辆的实验,验证了该方法的有效性。

英文摘要

This paper proposes a constant time-delay trajectory tracking method for vehicle convoys operating without inter-vehicle communication, a common coordinate system, or global positioning. The method integrates a probabilistic sequence-to-sequence (Seq2Seq) neural network with an invariant extended Kalman filter (IEKF) to warm-start the prediction process, allowing accurate estimation of a leader vehicle's relative trajectory on the SE(2) manifold. A geometric model predictive controller is further incorporated to fully exploit the manifold-based trajectory predictions for improved control performance. The system can handle arbitrary nonlinear trajectories with varying speeds and motion profiles while reducing the need for expert-based domain knowledge for the design of trajectory following systems, even under long trajectory delays. The effectiveness of the method is validated through comparisons with a pure IEKF baseline, learning-based methods, and the ground-truth trajectory in kinematic simulations, as well as in experiments using real robotic vehicles.

9. 软体机器人与硬件设计 3 篇

2606.18680 2026-06-18 cs.RO 新提交

High-Degree-of-Freedom Lightweight Bioinspired Leg for Enhanced Mobility in Small Robots

高自由度轻量化仿生腿:提升小型机器人机动性

Haoqi Han, Yifei Yu, Jiaming Zhang, Xinru Cui, Linxi Feng, Hesheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai University of Electric Power(上海电力大学)

AI总结 针对微型机器人腿部自由度受限问题,提出一种四自由度并联腿机构,通过同心设计简化运动学,实现轻量化(18.9g)和大工作空间(>22255 mm³),显著提升运动灵活性。

详情
Journal ref
2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
AI中文摘要

在微型机器人领域,如何在严格的空间限制下通过增加腿部机构的自由度来增强运动能力仍然是一个重大挑战。受昆虫运动启发,本文提出了一种新型的微型四自由度并联腿机构,并系统分析了其机械设计、电气系统和运动学。该设计采用两个球形五杆连杆机构,在并联四杆配置中实现空间运动。此外,采用同心设计策略简化了腿部运动学的解析解。由于采用并联系统架构,所有执行器均位于主体上,与传统高自由度腿部结构相比,大大降低了运动部件的等效惯性。系统总质量仅为18.9 g,末端执行器输出力约为0.5 N,工作空间超过22255 mm³。实验结果表明,所提出的单腿机构具有优异的运动灵活性,凸显了其在微型仿生机器人领域的潜力。

英文摘要

In microrobotics, enhancing locomotion capabilities by increasing the degrees of freedom (DoF) of leg mechanisms under severe spatial constraints remains a significant challenge. Inspired by insect locomotion, this paper presents a novel micro-scale parallel leg mechanism with four degrees of freedom, and systematically analyzes its mechanical design, electrical system, and kinematics. The design incorporates two spherical five-bar linkages to achieve spatial motion within a parallel four-bar configuration. Furthermore, a concentric design strategy is employed to simplify the analytical solution of the leg kinematics. Due to the parallel system architecture, all actuators are located on the main body, substantially reducing the equivalent inertia of moving parts compared to traditional high-DOF leg structures. The total mass of the system is only 18.9 g, with an end-effector output force of approximately 0.5 N and a workspace exceeding 22255 mm3. Experimental results demonstrate that the proposed single-leg mechanism achieves excellent motion flexibility, highlighting its potential for micro bio-inspired robotics.

2606.18704 2026-06-18 cs.RO 新提交

Selective Unit-Cell Actuation in Lattice Structures for Distributed Morphology in Soft Robots

晶格结构中的选择性单元胞驱动用于软体机器人的分布式形态变化

Trevor Exley, Altair Coutinho, Lucia Beccai

发表机构 * Istituto Italiano di Tecnologia (IIT)(意大利技术研究院)

AI总结 提出嵌入式气动单元胞,将弯曲支柱晶格与双向波纹管致动器集成,通过空间驱动模式实现全局形态控制,实验验证了可扩展位移、力生成及弯曲、抓取和爬行运动。

Comments Accepted to IROS 2026, 8 pages, 5 figures

详情
AI中文摘要

软晶格结构越来越多地用于机器人中以定制柔顺性和引导变形;然而,驱动通常是在设备或模块级别引入,致动器插入到原本被动的架构中。在这项工作中,我们将致动器-晶格协同设计推进到单元胞尺度。我们提出了一种嵌入式气动单元胞,它将弯曲支柱晶格几何形状与双向波纹管致动器集成在一个单一的整体元件中。当镶嵌时,晶格作为一个分布式驱动场,其中全局形态由空间驱动模式而非均匀加压控制。对1x1、2x2和3x3镶嵌的实验表征展示了可扩展的位移和力生成,具有可重复的循环性能。在3x3x3阵列中,单元胞的选择性驱动产生了不同的全局变形模式,包括弯曲和定向抓取,而无需改变硬件配置。此外,耦合主动和被动单元胞实现了弯曲驱动的爬行运动,证明了异质镶嵌可以通过不对称变形进行平移。这些结果确立了单元胞级驱动作为晶格基软体机器人分布式变形的策略,并为可扩展的整体机器人架构提供了基础。

英文摘要

Soft lattice structures are increasingly used in robotics to tailor compliance and guide deformation; however, actuation is typically introduced at the device or module level, with actuators inserted into otherwise passive architectures. In this work, we move actuator-lattice co-design to the unit-cell scale. We present an embedded pneumatic unit cell that integrates curved-strut lattice geometry with a bidirectional bellow actuator within a single monolithic element. When tessellated, the lattice functions as a distributed actuation field in which global morphology is governed by spatial actuation patterns rather than uniform pressurization. Experimental characterization of 1x1, 2x2, and 3x3 tessellations demonstrates scalable displacement and force generation with repeatable cyclic performance. Selective actuation of unit cells in a 3x3x3 array produces distinct global deformation modes, including bending and directional grasping, without altering hardware configuration. Additionally, coupling active and passive unit cells enables bending-driven crawling locomotion, demonstrating that heterogeneous tessellations can translate through asymmetric deformation. These results establish unit-cell-level actuation as a strategy for distributed morphing in lattice-based soft robots and provide a foundation for scalable, monolithic robotic architectures.

2606.19265 2026-06-18 cs.RO 新提交

Shape Sensing of Continuum Robots using Direct Laser Writing

使用直接激光写入的连续体机器人形状感知

Amber K. Rothe, Nidhi Malhotra, Jaydev P. Desai

发表机构 * Winship Cancer Institute of Emory University(埃默里大学温希普癌症研究所) Medical Robotics and Automation (RoboMed) Laboratory, Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology(佐治亚理工学院华莱士·H·库尔特生物医学工程系医疗机器人与自动化实验室)

AI总结 本文利用直接激光写入技术制造应变传感器,集成于连续体机器人关节中,通过线性和非线性模型预测关节角度,误差低至1.76度,并实现闭环控制,跟踪误差小于3度。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

连续体机器人因其固有的柔顺性和灵巧性,为微创和自然腔道手术提供了一种有前景的方法。然而,这种灵活性也使得估计机器人当前形状变得具有挑战性。已有多种方法用于重建这些机器人的形状,包括成像、光学传感、磁传感和电阻传感。使用直接激光写入(DLW)制造的应变传感器可以提供一种替代传感方法。该技术涉及使用激光诱导某些聚合物碳化,以创建石墨烯图案,例如应变传感器。在本文中,我们展示了如何使用同一激光和同一设置将柔性连续体关节和DLW传感器加工成一个整体结构。使用线性和非线性模型对制造的传感器进行表征,这些模型用于预测关节角度,误差低至1.76度。此外,我们展示了如何使用DLW传感器在机器人关节中实现闭环控制,跟踪误差低于3度。

英文摘要

Continuum robots offer a promising approach for minimally invasive and natural-orifice surgical procedures due to their inherent compliance and dexterity. However, this flexibility also makes estimating the current shape of the robot challenging. Several approaches have been used to reconstruct the shape of these robots, including imaging, optical sensing, magnetic sensing, and resistive sensing. Strain sensors fabricated using direct laser writing (DLW) could provide an alternative sensing method. This technique involves using a laser to induce carbonization of certain polymers to create graphene patterns, such as strain sensors. In this paper, we demonstrate how a flexible continuum joint and a DLW sensor can be machined as one monolithic structure using the same laser and the same setup. The fabricated sensors are characterized using linear and nonlinear models, which are used to predict the joint angle with error as low as 1.76 degrees. Furthermore, we demonstrate how a DLW sensor can be used to implement closed-loop control in a robotic joint, achieving tracking error under 3 degrees.

10. 仿真、数据集与评测 13 篇

2606.18375 2026-06-18 cs.RO 新提交

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

PAIWorld: 用于机器人操作的三维一致世界基础模型

Yuhang Huang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Jiazhao Zhang, Ruizhen Hu, Wancheng Feng, Shilong Zou, Hewen Xiao, Ziqiao Zhou, Kaiyun Huang, Zhiyu Peng, Juzhan Xu, Hang Zhao, Chenyang Zhu, Renjiao Yi, Yifei Huang, Douhui Wu, Yan Zhang, Kexu Cheng, Chunhe Song, Yunzhi Xue, Xiuhong Zhang, Leitao Guo, Yunji Chen, Bin Wu, Haibin Yu, Kai Xu

发表机构 * Institute of AI for Industries, Chinese Academy of Sciences(中国科学院人工智能产业研究院)

AI总结 提出PAIWorld框架,通过几何感知交叉注意力、几何旋转位置编码和潜在3D-REPA蒸馏,解决多视图世界模型的3D不一致问题,在机器人操作基准上取得领先性能。

详情
AI中文摘要

世界基础模型(WFMs)是强大的模拟器,但它们主要运行在单视图设置中,缺乏机器人操作所需的多视图3D一致性。虽然机器人系统依赖多个摄像头(自我中心、眼到手和腕装)进行策略学习,但当前的多视图世界模型只是简单地拼接视图标记,没有显式的几何推理。这导致跨视图物体漂移、深度不一致和纹理错位。我们将这些失败归因于两个缺陷:缺乏显式的视图间通信机制和缺乏3D几何先验。我们认为同时解决这两个问题是必要且充分的。为此,我们提出PAIWorld,一个通过三个核心组件增强扩散变换器世界模型的框架:(1)几何感知交叉注意力块,建立跨视图的显式通路;(2)几何旋转位置编码,将相机射线方向和外部姿态编码到注意力机制中;(3)潜在3D-REPA,从冻结的3D基础模型中蒸馏3D感知特征以确保3D一致性。基于DiT世界基础模型,PAIWorld在机器人操作基准上实现了最先进的多视图3D一致性,在WorldArena排行榜上排名第一,在AgiBot-Challenge2026排行榜上排名第二,同时支持基于模型的规划、世界动作模型和多视图策略后训练等下游应用。

英文摘要

World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.

2606.18594 2026-06-18 cs.RO cs.AI 新提交

Benchmarking Action Spaces in Reinforcement Learning for Vision-based Robotic Manipulation

基于视觉的机器人操作中强化学习动作空间的基准测试

Seyed Alireza Azimi, Homayoon Farrahi, Abhishek Naik, Colin Bellinger, A. Rupam Mahmood

发表机构 * Department of Computing Science, University of Alberta(阿尔伯塔大学计算机科学系) National Research Council Canada(加拿大国家研究委员会) School of Electrical Engineering and Computer Science, University of Ottawa(渥太华大学电气工程与计算机科学学院) Vector Institute(向量研究所) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔机器智能研究所)

AI总结 本研究通过模拟到现实的迁移,在物体抓取和推动任务中评估了四种动作空间,发现关节速度动作空间在平滑性和任务性能上最优,并为RL实践者提供了动作空间选择指导。

Comments 9 pages with references

详情
AI中文摘要

在现实世界的强化学习(RL)中,动作空间的选择在塑造运动平滑性、安全性和整体任务性能方面起着关键作用。在本研究中,我们评估了位姿增量、位姿速度、关节位置增量和关节速度在两项基于视觉的操作任务(物体抓取和推动)中的表现。我们在模拟中训练策略,并通过模拟到现实的迁移将其部署到现实世界。我们发现,动作空间表示确实显著影响模拟到现实的性能。特别是,我们发现关节速度动作空间在平滑性和最终任务性能方面最适合基于视觉的抓取和推动任务。我们还为RL实践者在模拟和现实实验中选择动作空间提供了实用指导。

英文摘要

In real-world reinforcement learning (RL), the choice of action space can play a key role in shaping motion smoothness, safety, and overall task performance. In this study, we evaluate pose increment, pose velocity, joint position increment, and joint velocity across two vision-based manipulation tasks: object picking and pushing. We train policies in simulation and deploy them to the real world using sim-to-real transfer. We find that action-space representation indeed significantly affects sim-to-real performance. In particular, we find that the joint velocity action space is best for the vision-based picking and pushing tasks in terms of smoothness and final task performance. We also provide practical guidance for RL practitioners in choosing action spaces for both simulation and real-world experiments.

2606.18610 2026-06-18 cs.RO cs.CV 新提交

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

SC3-Eval: 通过自洽视频生成评估机器人基础模型

Wei-Cheng Tseng, Gashon Hussein, Yuzhu Dong, Allen Z. Ren, Lucy X. Shi, XuDong Wang, Sergey Levine, Zhaoshuo Li, Jinwei Gu, Florian Shkurti, Ming-Yu Liu, Quan Vuong

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) NVIDIA(英伟达) Physical Intelligence Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) Allen Institute for AI(艾伦人工智能研究所)

AI总结 提出SC3-Eval方法,利用前向-反向动力学一致性、跨视角一致性和测试时一致性,将预训练视频基础模型转化为准确的策略评估器,在7个真实世界策略上达到0.929的皮尔逊相关系数。

详情
AI中文摘要

在真实世界中评估通用机器人操作策略成本高、速度慢且难以扩展。动作条件视频世界模型通过模拟策略 rollout 提供了一种可扩展的替代方案。自回归 rollout 会累积复合误差,多视角观测必须保持相互一致,且评估器必须泛化到行为超出训练分布的策略。我们通过 SC3-Eval 解决这些挑战,这是一种自洽视频生成方案,通过强制三种互补的一致性,将预训练视频基础模型转化为准确的策略评估器。首先,前向-反向动力学一致性联合训练模型从动作预测帧以及从帧恢复动作,将生成的 rollout 锚定在物理上合理的动作流形上,并抵消仅前向模型无法惩罚的漂移。其次,跨视角一致性训练模型从每个相机视角修补其他视角,使多相机观测在长 rollout 中保持连贯,无需任何显式记忆机制。第三,测试时一致性在推理时重用反向动力学模式作为每个动作块的置信度信号,当生成的帧偏离请求的动作时终止 rollout。我们还展示了 SC3-Eval rollout 复现了策略在真实世界 rollout 中表现出的失败模式,支持细粒度的诊断比较而不仅仅是聚合排名。在七个真实世界的视觉-语言-动作策略上,SC3-Eval 达到了闭环皮尔逊相关系数 0.929 和 MMRV 0.119,优于三个强先前的基于视频模型的基线,并泛化到新任务。

英文摘要

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

2606.18646 2026-06-18 cs.RO 新提交

A Scalable Embodied Intelligence Platform for Seamless Real-to-Sim-to-Real Transfer of Household Mobile Manipulation Tasks

一种可扩展的具身智能平台,用于家庭移动操作任务的无缝真实-仿真-真实迁移

Kui Yang, Xianlei Long, Haoxuan Li, Yan Ding, Chao Chen

发表机构 * School of Computer Science, Chongqing University(重庆大学计算机学院) R&D Department, Lumos Robotics Technology (Suzhou) Co., Ltd(苏州 Lumos 机器人技术(苏州)有限公司研发部)

AI总结 提出BestMan平台,通过自动化场景生成、仿真引导任务形式化和硬件无关中间件,解决真实-仿真-真实迁移中的场景重建、策略评估和部署兼容性挑战,实现家庭移动操作的无缝迁移。

Comments CCF Transactions on Pervasive Computing and Interaction

详情
AI中文摘要

移动操作是具身智能机器人的基本能力。对非结构化家庭环境中鲁棒且可泛化操作的需求日益增长,推动了具身智能平台的快速发展。然而,实现真实-仿真-真实循环的无缝迁移面临三个关键挑战:昂贵的高保真仿真场景重建、仿真中系统策略评估的复杂性以及不兼容的真实世界部署。为了解决这些挑战,我们开发了BestMan,一个可扩展且无缝的真实-仿真-真实平台,弥合仿真与真实世界之间的差距,实现家庭移动操作的有效策略开发、集成和部署。具体来说,我们设计了一个新颖的自动化场景生成(ASG)模块,从真实观测中重建逼真的仿真。然后,我们提出了一种仿真引导的任务形式化和技能学习架构,支持在仿真中灵活集成和大规模评估混合技能策略。最后,为了增强真实世界的可扩展性,我们开发了一个硬件无关的统一中间件(HUM),确保跨异构移动操作器的无缝且兼容的仿真到真实迁移,用于真实部署。实验结果表明,我们提出的平台在建立标准化基准和促进移动操作领域有前景的研究方面表现出优越的性能。

英文摘要

Mobile manipulation is a fundamental capability in embodied intelligence robotics. The growing demand for robust and generalizable manipulation in unstructured household environments has driven rapid progress in embodied intelligence platforms. However, achieving a seamless transfer across the real-to-sim-to-real cycle faces three key challenges, including costly high-fidelity simulation scenes reconstruction, the complexity of systematic strategy evaluation in simulation, and incompatible real-world deployments. To address these challenges, we develop BestMan, a scalable and seamless real-to-sim-to-real platform that bridges the gap between the simulation and the real world, enabling effective strategy development, integration, and deployment for household mobile manipulation. Specifically, we design a novel Automated Scene Generation (ASG) module to reconstruct realistic simulations from real observations. Then, we propose a simulation-guided task formalization and skill learning architecture that supports the flexible integration and large-scale evaluations of hybrid skill strategies in simulation. Finally, to enhance the real-world scalability, we develop a Hardware-agnostic and Unified Middleware (HUM) to ensure seamless and compatible sim-to-real transfer across heterogeneous mobile manipulators for real deployments. Experimental results demonstrate the superior performance of our proposed platform in establishing standardized benchmarks and facilitating promising research in the field of mobile manipulation.

2606.18698 2026-06-18 cs.RO cs.AI cs.LG 新提交

Leveraging Energy Features for Surface Classification with Deep Learning: A Comparative Analysis Across Three Independent Datasets

利用能量特征进行基于深度学习的表面分类:三个独立数据集的比较分析

Alexander Belyaev, Oleg Kushnarev

AI总结 研究评估能量特征作为表面分类的独立或辅助模态的可行性,在三个数据集上比较多种深度学习架构,发现CNN性能最优,纯能量特征准确率85-90%,与惯性特征结合可达96-99%,且能量特征可稳定提升1-2%准确率。

详情
AI中文摘要

基于能量的方法在移动机器人表面分类中仍是一个相对未被充分研究的途径,尽管在受限环境中取得了有希望的结果。本研究评估了使用能量衍生特征作为独立分类模态或作为惯性数据补充输入的可行性。在三个公开数据集上进行了全面评估,比较了现代深度学习架构(包括循环神经网络、卷积神经网络、仅编码器变压器和Mamba状态空间模型)在自动超参数调整和输入序列长度优化下的性能。模型在所有评估数据集上均实现了比先前报道值更高的准确率,其中卷积神经网络取得了最高的整体性能。当仅依赖基于能量的特征时,模型分类准确率在85-90%范围内,比与惯性特征结合时(96-99%)低约5-10%。用能量特征增强惯性数据导致平均准确率持续提高1-2%。这些发现表明,仅依赖能量特征的分类器为独立部署提供了足够的准确性,同时在与其它感知模态结合使用时也提供了一致的增益。

英文摘要

The energy-based method remains a comparatively underexamined approach for surface classification in mobile robotics, despite promising results in constrained environments. This study evaluated the viability of using energy-derived features as either a standalone classification modality or as supplementary input to inertial data. A comprehensive evaluation was conducted across three publicly available datasets, comparing the performance of modern deep learning architectures including recurrent neural networks, convolutional neural networks, encoder-only transformers, and Mamba state-space models, under automated hyperparameter tuning and input sequence length optimization. The models achieved higher accuracy than previously reported values on all evaluated datasets, with the convolutional neural network yielding the highest overall performance. When relying exclusively on energy-based features, the models attained classification accuracies in the range of 85-90%, approximately 5-10% lower than those achieved when combined with inertial features (96-99%). Augmenting inertial data with energy features resulted in a consistent mean accuracy improvement of 1-2%. These findings indicate that classifiers relying solely on energy features offer sufficient accuracy for standalone deployment, while also providing a consistent gain when used in combination with other sensing modalities.

2606.18948 2026-06-18 cs.RO 新提交

C-ARC: Continuous-Adaptive Range Clustering for Non-Repetitive LiDAR Sensors

C-ARC: 面向非重复式LiDAR传感器的连续自适应范围聚类

Nick B. Schroeder, Jonathan Lichtenfeld, Oskar von Stryk

发表机构 * Technical University of Darmstadt(德累斯顿技术大学) Simulation, Systems Optimization and Robotics Group(仿真、系统优化与机器人组)

AI总结 提出C-ARC框架,通过滑动窗口上的持久双图结构解耦高频点插入与按需聚类检索,并利用指数控制环自适应校准网格分辨率,实现非重复式LiDAR点云的实时聚类。

Comments Submitted to IEEE Robotics and Automation Letters. This work has been submitted to the IEEE for possible publication. 8 pages, 7 figures

详情
AI中文摘要

实时LiDAR聚类识别点云中的结构,是许多移动机器人算法的重要前提。当前方法主要针对重复式机械LiDAR传感器开发。近年来,由于成本和外形尺寸小,非重复式LiDAR传感器的使用显著增加。这类基于Risley棱镜的非重复传感器违反了重复式机械传感器的两个关键假设:结构化的扫描线和明确的帧边界。其Rhodonea曲线轨迹产生非均匀点分布,且缺乏旋转周期使得传统扫描线索引无法适用。为满足这些新需求,我们开发了C-ARC,一个连续自适应范围聚类框架,它在滑动窗口上维护一个持久双图,将高频点插入与按需聚类检索解耦。这对于SLAM或跟踪等关键功能至关重要。自适应范围网格分辨率机制在初始化时使用指数控制环校准网格尺寸,无需预先了解扫描模式即可平衡稀疏-碰撞权衡。作为开源的单线程C++17库实现,C-ARC在商用硬件上对Livox Mid-360以20 Hz产生实时聚类输出。在Livox Avia上的评估表明,对于扫描模式高度集中的传感器,无界单元占用是主要限制。自适应分辨率机制还提高了现有基于网格的方法在非重复数据上的聚类质量。

英文摘要

Real-time LiDAR clustering identifies structures in point clouds, which is an essential prerequisite for many mobile robotics algorithms. Current methods are mostly developed for repetitive mechanical LiDAR sensors. Recently, the use of non-repetitive LiDAR sensors is strongly increasing due to their small cost and form factor. Such non-repetitive Risley prism-based sensors violate two key assumptions of repetitive mechanical sensors: structured scan lines and well-defined frame boundaries. Their Rhodonea-curve trajectories produce non-uniform point distributions, and the absence of a rotation cycle renders conventional scan line indexing inapplicable. To meet such new requirements, we developed C-ARC, a Continuous-Adaptive Range Clustering framework that maintains a persistent dual-graph over a sliding window, decoupling high-frequency point insertion from on-demand cluster retrieval. This is crucial for key functionalities like SLAM or tracking. An adaptive range grid resolution mechanism calibrates grid dimensions at initialization using an exponential control loop, balancing the sparsity-collision trade-off without prior knowledge of the scanning pattern. Implemented as an open-sourced single-threaded C++17 library, C-ARC produces real-time cluster output at 20 Hz on commodity hardware for the Livox Mid-360. Evaluation on the Livox Avia identifies unbounded cell occupancy as the primary limitation for sensors with strongly concentrated scan patterns. The adaptive resolution mechanism additionally improves clustering quality for existing grid-based methods on non-repetitive data.

2606.18959 2026-06-18 cs.RO 新提交

TactSpace: Learning a Physics-enriched Shared Latent Space for Tactile Sim-to-Real Transfer

TactSpace: 学习富含物理信息的共享潜在空间以实现触觉模拟到现实的迁移

Arunim Joarder, Arjun Bhardwaj, René Zurbrügg, Mayank Mittal, Florin Püntener, Sira Bielefeldt, Cosmin Roman, Vaishakh Patil, Marco Hutter

发表机构 * Robotic Systems Lab, ETH Zürich(瑞士苏黎世联邦理工学院机器人系统实验室) Micro- and Nanosystems Lab, ETH Zürich(瑞士苏黎世联邦理工学院微纳系统实验室) ETH AI Center(苏黎世联邦理工学院人工智能中心) NVIDIA(NVIDIA公司)

AI总结 提出多模态表示学习框架TactSpace,通过共享潜在空间对齐异构触觉模态,实现零样本模拟到现实迁移,在力预测和形状重建任务中分别降低误差16.7%和45.8%。

Comments 9 pages, 6 figures, 4 tables, accepted into IROS 2026

详情
AI中文摘要

触觉传感提供了对机器人操作至关重要的接触相互作用的直接测量。然而,当前的模拟器缺乏足够保真度来忠实模拟触觉传感器的复杂变形和换能机制,严重阻碍了机器人学习流程中的模拟到现实迁移。为了解决这一挑战,我们提出了一种多模态表示学习框架,该框架在共享潜在空间内对齐异构触觉模态,消除了对精确原始信号模拟的需求,同时保留了相关的接触信息。我们的方法采用模态特定编码器将不同的触觉观测(例如模拟穿透深度和真实电容)投影到公共嵌入空间中。该模型使用自重建和交叉重建目标以及对比对齐进行训练,鼓励模态不变且信息丰富的表示。我们在压头形状识别、力预测和几何重建任务上评估学习到的嵌入,仅在模拟中训练并直接在真实传感器测量上测试。我们的结果展示了跨物理不同表示的零样本模拟到现实迁移。此外,结合多物理模拟模态产生了更信息丰富的嵌入,这些嵌入可跨不同下游任务迁移,力预测误差降低16.7%,形状重建误差降低45.8%。最后,我们为Isaac Lab发布了一个基于Warp的高效罚函数触觉模拟模型实现,支持可扩展的触觉数据生成。

英文摘要

Tactile sensing provides direct measurements of contact interactions that are essential for robotic manipulation. However, current simulators lack the fidelity to faithfully model the complex deformation and transduction mechanics of tactile sensors, severely hindering sim-to-real transfer in robot learning pipelines. To address this challenge, we propose a multi-modal representation learning framework that aligns heterogeneous tactile modalities within a shared latent space, eliminating the need for accurate raw-signal simulation while preserving relevant contact information. Our approach employs modality-specific encoders to project diverse tactile observations, such as simulated penetration depth and real-world capacitance, into a common embedding space. The model is trained using self- and cross-reconstruction objectives alongside contrastive alignment, encouraging modality-invariant yet information-rich representations. We evaluate the learned embeddings on indenter shape identification, force prediction, and geometric reconstruction tasks, training exclusively in simulation and testing directly on real sensor measurements. Our results demonstrate zero-shot sim-to-real transfer across physically dissimilar representations. Furthermore, incorporating multi-physics simulation modalities yields more informative embeddings that transfer across diverse downstream tasks, demonstrating a 16.7% reduction in force prediction error and a 45.8% reduction in shape reconstruction error. Finally, we release an efficient Warp-based implementation of a penalty-based tactile simulation model for Isaac Lab, enabling scalable tactile data generation.

2606.19067 2026-06-18 cs.RO cs.CV 新提交

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

传感器配置至关重要:四足机器人多模态SLAM的系统评估

Roberto Corlito, Fabian Schmidt, Nils Seibert, Markus Enzweiler, Abhinav Valada, Arne Roennau

发表机构 * Machine Intelligence and Robotics Lab, Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院智能机器人实验室) Institute for Intelligent Systems, Esslingen University of Applied Sciences(埃森堡应用科学大学智能系统研究所) Department of Computer Science, University of Freiburg(弗赖堡大学计算机科学系)

AI总结 针对四足机器人运动中的传感器配置问题,系统评估了视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法,发现立体相机、全局快门和适当惯性集成能显著提升定位鲁棒性。

详情
AI中文摘要

四足机器人在不同环境中的自主导航从根本上依赖于鲁棒的同步定位与地图构建(SLAM)。虽然视觉-惯性SLAM在轮式、手持和空中平台上已经成熟,但在腿部运动的剧烈动态下,硬件级传感器配置如何影响性能仍存在关键的评估空白。四足机器人引入了独特的具身感知挑战,包括足部冲击、高频机械振动和快速角旋转,这些都会降低标准感知管道的性能。为了填补这一空白,我们使用在ANYmal D四足机器人上记录的GrandTour数据集,对最先进的视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法进行了系统评估。我们分离并量化了相机模态、快门技术和惯性传感器层级的影响,分析了它们在定位精度、算法鲁棒性和计算资源利用方面的权衡。我们的实证结果表明,硬件选择对系统鲁棒性有显著影响:立体配置始终优于单目和RGB-D模态,全局快门相机相比卷帘快门相机显著减少了运动引起的跟踪失败,并且关键的是,在剧烈的腿部运动下,标准惯性集成可能降低主要基于视觉的框架的性能。这些见解还为定制传感器负载提供了具体的设计指南,以实现敏捷腿部系统的可靠感知。

英文摘要

Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.

2606.19154 2026-06-18 cs.RO 新提交

Viking Hill Dataset: A Lidar-Radar-Camera Dataset for Detection and Segmentation in Forest Scenes

Viking Hill数据集:用于森林场景检测与分割的激光雷达-雷达-相机数据集

Vladimír Kubelka, Oleksandr Kotlyar, Unal Artan, Martin Magnusson

发表机构 * Örebro University, AASS research centre, Robot Navigation and Perception Lab(厄勒布鲁大学,AASS研究中心,机器人导航与感知实验室)

AI总结 提出首个包含4D成像雷达的森林多传感器数据集,通过MinkowskiUNet实现雷达与激光雷达点云的语义分割,并评估树干分割质量与树木尺寸的关系。

Comments 33 pages, 11 figures

详情
AI中文摘要

在森林冠层下运行的自主机器人需要对树木及周围植被在不同季节条件下进行稳健感知。现有的林业数据集提供带有单棵树标注的激光雷达或相机数据,但均未包含共配准的4D成像雷达——这一模态因其对视觉退化、表面污染和植被遮挡的鲁棒性而日益受到关注。我们介绍了一个由移动机器人收集的多传感器森林数据集,该机器人配备了高分辨率FMCW成像雷达、激光雷达、RGB相机、IMU和RTK-GNSS。该场地在两个不同植被状态的会话中记录,3D立方体标注(包括每棵树的直径估计)为所有三种感知模态提供了共享语义标签。此外,我们提供了使用MinkowskiUNet对雷达和激光雷达点云进行语义分割的基线结果。雷达在主要类别(地面91%,冠层86%)上取得了与激光雷达竞争性的IoU分数,但在几何精细结构(如树干)上落后(56%对74%)。跨模态分析进一步比较了激光雷达和雷达的树干分割与RGB检测模型,而按直径分层的评估揭示了树干分割质量如何随树木尺寸变化。除了分割,共配准的多模态数据和RTK-GNSS辅助参考定位支持冠层下地图构建、定位和传感器融合的研究。数据集和标注工具已公开。

英文摘要

Autonomous robots operating under forest canopies need robust perception of trees and surrounding vegetation across varying seasonal conditions. Existing forestry datasets provide lidar or camera data with per-tree annotations, but none include co-registered 4D imaging radar -- a modality of growing interest for its resilience to visual degradation, surface contamination, and vegetation occlusion. We introduce a multi-sensor forest dataset collected by a mobile robot equipped with a high-resolution FMCW imaging radar, lidar, RGB camera, IMU, and RTK-GNSS. The site was recorded in two sessions under contrasting vegetation states, and 3D cuboid annotations -- including per-tree diameter estimates -- provide shared semantic labels across all three perception modalities. Furthermore, we provide baseline results for semantic segmentation of the radar and lidar point clouds using MinkowskiUNet. Radar achieves IoU scores competitive with lidar for dominant classes (ground 91%, canopy 86%) while lagging on geometrically fine structures such as tree trunks (56% vs. 74%). A cross-modality analysis further compares lidar and radar trunk segmentation against an RGB detection model, and a diameter-stratified evaluation reveals how trunk segmentation quality varies with tree size. Beyond segmentation, the co-registered multi-modal data and RTK-GNSS-aided reference positioning support research in mapping, localization, and sensor fusion under canopy. The dataset and annotation tools are publicly available.

2606.19161 2026-06-18 cs.RO 新提交

HT-Bench: Benchmarking and Learning Dexterous Full-Hand Tactile Representations with Egocentric Vision

HT-Bench:基于自我中心视觉的灵巧全手触觉表示基准与学习

Yuzhe Huang, Jiaping Wu, Jiaming Jiang, Hezhe Lin, Aikebaier Aierken, Yunlong Wang, Kun Cheng, Ziyuan Jiao, Yuanxin Zhong

发表机构 * Beihang University(北京航空航天大学) Rimbot BUPT(北京邮电大学) ShanghaiTech University(上海科技大学) Tsinghua University(清华大学) CAS(中国科学院)

AI总结 提出HT-Bench多任务基准和HandTouch编码器,通过大规模自我中心视觉与全手触觉数据,在触觉相似性检索、掩码修复、视觉到触觉合成等任务上验证了触觉表示的有效性。

Comments 9pages, 4figures

详情
AI中文摘要

由于触觉传感器设计、数据格式和机器人形态的多样性,为机器人操作中的触觉表示学习建立通用基准仍然具有挑战性。我们并未试图建立这样的基准,而是探索了一个可扩展且有前景的未来发展方向:将自我中心视觉与全手触觉数据配对。为此,我们引入了\ extbf{HT-Bench},一个用于灵巧全手触觉感知的大规模多任务基准,包含在226个任务中收集的1000万RGB帧和780万触觉帧。HT-Bench从三个关键角度评估触觉表示:它们是否编码有意义的接触几何、是否能够将触觉观测与视觉信息对齐、以及是否能够泛化到未见任务。为评估这些能力,HT-Bench包含四个任务:细粒度触觉相似性检索、掩码触觉修复、视觉到触觉合成以及多模态触觉帧预测。我们进一步提出了\ extbf{HandTouch},一个矢量量化视觉-触觉编码器,通过渐进的空间、跨模态和时间训练学习触觉表示。在HT-Bench上,HandTouch始终优于代表性的触觉编码器基线,将细粒度触觉相似性检索的Recall@5从74.65%提高到85.23%,将掩码触觉修复的RMSE从0.022降低到0.010,并将视觉到触觉合成的OOD cIoU从0.628提高到0.705。这些结果证明了HandTouch的有效性,并表明大规模自我中心全手触觉数据为评估和推进灵巧操作中的触觉表示学习提供了可扩展的基础。

英文摘要

Establishing a universal benchmark for tactile representation learning in robotic manipulation remains challenging due to the diversity of tactile sensor designs, data formats, and robot embodiments. Rather than seeking to establish such, we explore a scalable and promising direction for future development: egocentric vision paired with full-hand tactile data. To this end, we introduce \textbf{HT-Bench}, a large-scale multi-task benchmark for dexterous full-hand tactile sensing, comprising 10M RGB frames and 7.8M tactile frames collected across 226 tasks. HT-Bench evaluates tactile representations from three key perspectives: whether they encode meaningful contact geometry, whether they can align tactile observations with visual information, and whether they generalize to unseen tasks. To assess these capabilities, HT-Bench includes four tasks: fine-grained tactile similarity retrieval, masked tactile inpainting, vision-to-tactile synthesis, and multimodal tactile frame prediction. We further propose \textbf{HandTouch}, a vector-quantized vision--tactile encoder that learns tactile representations through progressive spatial, cross-modal, and temporal training. Across HT-Bench, HandTouch consistently outperforms representative tactile encoder baselines, improving Recall@5 on fine-grained tactile similarity retrieval from 74.65\% to 85.23\%, reducing RMSE on masked tactile inpainting from 0.022 to 0.010, and increasing OOD cIoU on vision-to-tactile synthesis from 0.628 to 0.705. These results demonstrate the effectiveness of HandTouch and suggest that large-scale egocentric full-hand tactile data provides a scalable basis for evaluating and advancing tactile representation learning in dexterous manipulation.

2606.19176 2026-06-18 cs.RO cs.AI cs.SY eess.SY 新提交

Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight

用于自主海上无人机飞行的深度单目位姿估计的硬件与视觉在环验证

Maneesha Wickramasuriya, Beomyeol Yu, Jaden Shin, Mason Huslig, Taeyoung Lee, Murray Snyder

发表机构 * George Washington University(乔治华盛顿大学)

AI总结 提出硬件验证的视觉在环框架,结合深度变换器单目位姿估计器和延迟卡尔曼滤波器,在模拟逼真海上环境中实现自主室内飞行,验证了感知延迟等嵌入式效应。

Comments 6 pages 9 figues

详情
AI中文摘要

船舶上的自主无人机操作需要可靠的基于视觉的相对位姿估计,然而海上验证成本高、依赖天气且风险大。本文提出一个硬件验证的视觉在环框架,能够在模拟逼真海上环境的同时实现完全自主的室内飞行。渲染的海上视图由板载的基于深度变换器的单目位姿估计器处理。延迟的视觉测量与高频率IMU数据通过延迟卡尔曼滤波器融合,为几何控制提供一致的状态估计。该系统捕捉了纯仿真中缺失的关键嵌入式效应,包括感知延迟、异步更新和计算约束。自主起飞、轨迹跟踪和着陆实验证明了稳定的闭环飞行。结果建立了一个安全且硬件真实的中间阶段,用于在船上部署之前开发海上无人机自主性。

英文摘要

Autonomous UAV operations on ships require reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents a hardware-validated vision-in-the-loop framework that enables fully autonomous indoor flight while emulating photorealistic maritime environments. Rendered maritime views are processed onboard by a deep transformer-based monocular pose estimator. Delayed vision measurements are fused with high-rate IMU data using a delayed Kalman filter to provide consistent state estimates for geometric control. The system captures critical embedded effects, including perception latency, asynchronous updates, and computational constraints, that are absent in pure simulation. Autonomous takeoff, trajectory tracking, and landing experiments demonstrate stable closed-loop flight. The results establish a safe and hardware-realistic intermediate stage for developing maritime UAV autonomy prior to shipboard deployment.

2606.19186 2026-06-18 cs.RO cs.LG 新提交

Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise

学习标注延迟和误报AEB事件:针对极端类别不平衡和非对称标签噪声的实用系统

Mengxiang Hao, Xin Jiang, Xinghao Huang, Wenliang Su, Zhiteng Wang, Junjie Rao, Xiaotian Yang, Wei Liao, Chengyu Han, Gen Liang, Yulun Song, Zhitao Xu, Xianpeng Lang

发表机构 * Li Auto(理想汽车)

AI总结 提出首个自动化AEB标注框架,通过特定数据增强和噪声抑制技术,解决极端类别不平衡和非对称标签噪声问题,将延迟/误报触发召回率提升80%,人工工作量减少50%。

Comments 8 pages, 5 figures, accepted by IEEE International Conference on Robotics and Automation (ICRA)

详情
Journal ref
2026 IEEE International Conference on Robotics and Automation (ICRA)
AI中文摘要

自主紧急制动(AEB)优化依赖于准确标注的真实世界触发事件,特别是揭示系统缺陷的罕见但关键的延迟和误报AEB触发事件。然而,这些少数样本在每天数千次触发事件中占比不到5%,使得大规模人工标注成本过高。我们提出了首个自动化AEB标注框架来解决这一问题。在开发过程中,我们识别出两个严重损害延迟/误报触发标注准确性的基本挑战:(1)极端类别不平衡,其中延迟/误报触发被真实触发淹没;(2)非对称标签噪声,其中误标注的多数样本(真实触发)抑制了少数样本(延迟/误报触发)的学习。为克服这些挑战,我们提出两项关键创新:(1)特定数据增强,通过操纵焦点目标属性、移植自车动态和掩蔽非焦点代理来合成逼真样本;(2)噪声抑制,使用稳定硬度估计和探针引导的自适应阈值来清理误标注的真实触发样本。关键的是,我们将模型部署为具有全栈架构的实用标注系统,从每天数千个AEB事件中高效识别关键的延迟/误报触发。生产结果表明,延迟/误报触发的召回率提高了80%,人工工作量减少了50%。除了直接收益,该系统通过积累高质量标注实现持续自我改进,为车载AEB系统优化奠定了必要的数据基础。

英文摘要

Autonomous Emergency Braking (AEB) optimization relies on accurately annotated real-world trigger events, particularly rare but critical delayed and false AEB triggers that expose system deficiencies. However, these minority samples comprise less than 5% of thousands of daily triggers, making manual annotation prohibitively expensive at scale. We present the first automated AEB annotation framework to address this problem. During development, we identified two fundamental challenges that severely impair delayed/false trigger annotation accuracy: (1) Extreme class imbalance where delayed/false triggers are overwhelmed by true triggers; (2) Asymmetric label noise where mislabeled majority samples (true triggers) suppress minority samples (delayed/false triggers) learning. To overcome these challenges, we propose two key innovations: (1) Specific data augmentation that synthesizes realistic samples by manipulating focal target attributes, transplanting ego-vehicle dynamics, and masking non-focal agents; (2) noise suppression using stable hardness estimation and probe-guided adaptive threshold to clean mislabeled true trigger samples. Crucially, we deploy our model as a practical annotation system with full-stack architecture, efficiently identifying critical delayed/false triggers from thousands of daily AEB events. Production results demonstrate 80% improvement in recall of delayed/false triggers and 50% reduction in manual workload. Beyond immediate gains, the system enables continuous self-improvement through accumulated high-quality annotations, establishing a necessary data foundation for on-vehicle AEB system optimization

2606.19267 2026-06-18 cs.RO cs.SY eess.SY 新提交

A Mixed-Reality Testbed for Autonomous Vehicles

自动驾驶汽车的混合现实测试平台

H. M. Sabbir Ahmad, Ehsan Sabouni, Emrullah Celik, Zean Wan, Damola Ajeyemi, Christos G. Cassandras, Wenchao Li

发表机构 * Boston University(波士顿大学)

AI总结 提出一种混合现实硬件在环测试平台,集成物理移动机器人与高保真仿真环境,用于验证感知、规划和控制算法,并支持多智能体系统研究。

Comments 9 pages, 7 figures, 1 table

详情
AI中文摘要

我们提出了一种用于自动驾驶汽车的混合现实、硬件在环(HIL)测试平台,该平台将物理移动机器人测试平台与高保真仿真环境无缝集成。虚拟仿真能够创建多样化的、安全关键的驾驶场景,以验证最先进的感知、规划和控制算法,同时通过配备多模态传感器的物理机器人在逼真的虚拟环境中增强仿真,进一步促进严格的验证。我们的测试平台还利用无线通信实现车辆连接,并通过物理机器人和虚拟仿真代理的组合容纳大量代理,支持包括网联自动驾驶汽车(CAV)在内的多智能体系统研究。最后,我们提出了一种结合感知、规划和一种新颖的基于控制障碍函数(CBF)的在线学习控制器的安全保证框架,用于CAV。使用所提出框架的实验用于验证和展示测试平台的关键功能以及其在弥合仿真与真实世界硬件部署之间差距方面的整体效用。

英文摘要

We propose a mixed-reality, hardware-in-the-loop (HIL) testbed for autonomous vehicles that seamlessly integrates a physical testbed of mobile robots with a high-fidelity simulation environment. The virtual simulation enables the creation of diverse, safety-critical driving scenarios to validate state-of-the-art perception, planning, and control algorithms, while augmenting simulations with physical robots equipped with multimodal sensors in photorealistic virtual environments further facilitating rigorous validation. Our testbed also features vehicular connectivity using wireless communication and can accommodate a large number of agents through the combination of physical robots and virtual simulated agents, supporting research on multi-agent systems including Connected and Autonomous Vehicles (CAVs). Finally, we present a safety-guaranteed framework combining perception, planning and a novel online learning-based controller using Control Barrier Functions (CBFs) for CAVs. Experiments using the proposed framework are used to validate and demonstrate the key functionalities and the overall utility of the testbed to bridge the gap between simulation and real-world hardware deployment.

11. 安全、鲁棒性与可信机器人 1 篇

2606.18632 2026-06-18 cs.RO 新提交

ROBOSHACKLES: A Safety Dataset for Human-Injury Prevention in Embodied Foundation Models

ROBOSHACKLES: 面向具身基础模型中人体伤害预防的安全数据集

Zhuowen Yin, Chongyang Liu, Wenzhang Yang, Renjue Li, Yinxing Xue

发表机构 * Institute of Al for Industries, Chinese Academy of Sciences(工业人工智能研究所,中国科学院) University of Science and Technology of China(中国科学技术大学)

AI总结 为解决机器人伤害人类数据难以安全收集的问题,提出基于真实观测的安全数据构建流水线,生成包含1万条视频的ROBOSHACKLES数据集,涵盖直接和间接伤害类别,评估发现现有模型在安全关键场景下100%产生不安全动作。

详情
AI中文摘要

具身基础模型(EFMs)整合了多模态理解、未来状态推理和可执行的机器人动作。然而,它们在预防人体伤害方面的安全对齐仍未得到充分探索,主要是因为机器人伤害人类或造成危险家庭情境的真实世界数据无法安全或合乎道德地收集。为应对这一挑战,我们提出了一种针对人体伤害预防的安全关键数据构建流水线。该流水线从真实的DROID观测出发,经过场景理解、危险感知图像编辑、时间提示生成和单次滚动合成等步骤。时间提示指定了预期的场景演变,而Wan2.7则从编辑后的危险状态中单次合成逼真的机器人滚动视频。利用该流水线,我们构建了ROBOSHACKLES,一个包含10,000条机器人视频片段的数据集,源自真实的DROID观测,涵盖两个直接伤害和四个间接伤害类别。为确保数据集质量,我们使用自动指标评估任务完成度和视觉质量,并在基于拒绝的安全准则下评估了六个代表性EFM。结果表明,所有评估模型在测试的安全关键场景中都产生了不安全动作,不安全动作生成率为100%。ROBOSHACKLES可作为拒绝学习和机器人动作执行前危险预测的可扩展基准和训练资源。该数据集公开于https://roboshackles.github.io。

英文摘要

Embodied Foundation Models (EFMs) integrate multimodal understanding, future-state reasoning, and executable robot actions. Yet their safety alignment for human-injury prevention remains underexplored, primarily because real-world data of robots harming humans or creating hazardous household situations cannot be safely or ethically collected. To address this challenge, we propose a safety-critical data construction pipeline for human-injury prevention in EFMs.Starting from real DROID observations, our construction pipeline proceeds through scene understanding, hazard-aware image editing, temporal prompt generation, and single-pass rollout synthesis. The temporal prompts specify the expected scene evolution, while Wan2.7 synthesizes realistic robotic rollouts from the edited hazardous states in a single pass. Using this pipeline, we construct ROBOSHACKLES, a 10,000-clip robotic video dataset derived from real DROID observations, spanning two direct-harm and four indirect-harm categories. To ensure dataset quality, we assess task completion and visual quality with automatic metrics, and evaluate six representative EFMs under a refusal-based safety criterion. Results show that all evaluated models produce unsafe actions in the tested safety-critical scenarios, yielding a 100% unsafe action generation rate. ROBOSHACKLES serves as a scalable benchmark and training resource for refusal learning and hazard anticipation before robot action execution.The dataset is publicly available at https://huggingface.co/datasets/YZW00/RoboShackles.