arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.18328 2026-06-18 cs.RO 新提交

Recover, Discover, Plan: Learning Skills and Concepts from Robot Failures

恢复、发现、规划：从机器人失败中学习技能与概念

Bowen Li, Mayank Mishra, Y. Isabel Liu, Stone Tao, Nishanth Kumar, Alexander G. Gray, Ruwan Wickramarachchi, Jonathan Francis, Sebastian Scherer, Tom Silver

发表机构 * CMU（卡内基梅隆大学）； Princeton（普林斯顿大学）； AI2（艾伦人工智能研究所）； MIT（麻省理工学院）； Centaur AI ； Bosch Center for AI（博世人工智能中心）

AI总结提出ReSYNC方法，通过技能学习与概念发现的交替过程，从失败恢复经验中逐步构建抽象谓词，实现全局失败避免和长期规划，性能提升超50%。

Comments 9 pages, 6 figures. Website: https://jaraxxus-me.github.io/ReSYNC/

详情

AI中文摘要

智能机器人不仅应该从失败中恢复，还应该获取必要的抽象知识以避免未来的失败。虽然强化学习（RL）可以学习反应性恢复行为，但为每种不同的失败模式训练单独的策略效率极低。我们引入了恢复驱动的关系概念综合（ReSYNC），这是第一种从失败恢复经验中逐步发现并细化状态抽象（关系谓词）以支持抽象规划的方法。与纯粹的反应性方法不同，ReSYNC通过增量双学习过程联合学习技能和概念。在技能学习阶段，机器人使用RL学习从训练任务中出现的失败中恢复。在概念学习阶段，机器人发现新的关系谓词并细化其抽象规划模型，以解释和泛化所学的恢复行为。这种交互使ReSYNC能够将训练中看到的局部恢复转化为测试时的全局失败避免。在四个模拟领域，我们展示了ReSYNC持续扩展和细化其抽象库的能力，使其能够解决长期、前所未见的问题，性能超过强基线50%以上。此外，我们展示了ReSYNC的仿真到现实迁移，其中它执行真实世界的非抓取操作技能，并通过抽象规划泛化到未见场景。总体而言，ReSYNC代表了朝着机器人自主获取抽象以实现物理世界中可扩展的、感知失败的规划迈出的重要一步。

英文摘要

Intelligent robots should not only recover from failures, but also acquire the abstract knowledge needed to avoid them in the future. While reinforcement learning (RL) can learn reactive recovery behaviors, training a separate policy for every distinct failure mode is highly inefficient. We introduce Recovery-Driven Synthesis of Relational Concepts (ReSYNC), the first approach that progressively discovers and refines state abstractions (relational predicates) from failure-recovery experience to support abstract planning. Unlike purely reactive methods, ReSYNC jointly learns skills and concepts through an incremental dual-learning process. In the skill-learning phase, the robot uses RL to learn to recover from failures seen in training tasks. In the concept-learning phase, the robot discovers new relational predicates and refines its abstract planning model to explain and generalize the learned recovery behaviors. This interaction enables ReSYNC to convert local recoveries seen during training into global failure avoidance at test time. Across four simulated domains, we show that ReSYNC's ability to continually expand and refine its abstraction library allows it to solve long-horizon, previously unseen problems, outperforming strong baselines by over 50%. Additionally, we demonstrate sim-to-real transfer of ReSYNC, where it performs real-world non-prehensile manipulation skills and generalizes to unseen scenarios through abstract planning. Overall, ReSYNC represents a significant step toward robots that autonomously acquire abstractions for scalable, failure-aware planning in the physical world.

URL PDF HTML ☆

赞 0 踩 0

2606.18589 2026-06-18 cs.RO 新提交

DREAM-Chunk: Reactive Action Chunking with Latent World Model

DREAM-Chunk：基于潜在世界模型的反应式动作分块

Wenxi Chen, Kaidi Zhang, Chi Lin, Zhiyuan Zhang, Yu She, Yuejiang Liu, Raymond A. Yeh, Shaoshuai Mou, Yan Gu

发表机构 * Purdue University（普渡大学）； Stanford University（斯坦福大学）

AI总结提出DREAM-Chunk方法，通过轻量级潜在世界模型在测试时采样多个候选动作分块并选择最优执行，提升动作分块策略在随机动态下的鲁棒性。

详情

AI中文摘要

动作分块已成为视觉-语言-动作（VLA）模型的常见接口，使得低频策略推理能够驱动高频机器人执行。然而，一旦动作分块被提交，其开环执行在随机动态、硬件执行错误和部分可观测性下可能变得脆弱。我们提出DREAM-Chunk，一种测试时扩展方法，通过轻量级潜在世界模型增强基于分块的策略，无需额外的策略微调。在测试时，DREAM-Chunk采样多个候选动作分块，展开其预测的潜在未来，并从预测状态与观测展开最匹配的分块中选择动作。通过这种方式，DREAM-Chunk利用额外的测试时计算覆盖多个可能的随机未来，并提高长时域分块执行期间的响应性。在Kinetix基准测试中，DREAM-Chunk在增加的动作噪声下提高了鲁棒性，并从更大的候选样本量中受益，尤其是当演示包含纠正行为时。我们进一步在两个机器人平台的四个操作任务和两种VLA策略下，针对各种随机性来源验证了DREAM-Chunk。在仿真和硬件实验中，DREAM-Chunk提高了动作分块策略在随机动态下的鲁棒性。

英文摘要

Action chunking has become a common interface for vision-language-action (VLA) models, enabling low-frequency policy inference to drive high-frequency robot execution. However, once an action chunk is committed, its open-loop execution can be brittle under stochastic dynamics, hardware execution errors, and partial observability. We propose DREAM-Chunk, a test-time scaling method that augments chunking-based policies with a lightweight latent world model, without requiring additional policy fine-tuning. At test time, DREAM-Chunk samples multiple candidate action chunks, rolls out their predicted latent futures, and selects actions from the chunk whose predicted state best matches the observed rollout. In this way, DREAM-Chunk uses additional test-time computation to cover multiple plausible stochastic futures and improve reactivity during long-horizon chunk execution. On the Kinetix benchmark, DREAM-Chunk improves robustness under increasing action noise and benefits from larger candidate sample sizes, especially when demonstrations contain corrective behaviors. We further validate DREAM-Chunk on four manipulation tasks across two robot platforms and two VLA policies under various sources of stochasticity. Across simulation and hardware experiments, DREAM-Chunk improves the robustness of action-chunking policies in stochastic dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.18772 2026-06-18 cs.RO 新提交

HALOMI: Learning Humanoid Loco-Manipulation with Active Perception from Human Demonstrations

HALOMI: 从人类演示中学习具有主动感知的人形机器人全身操控

Zehui Zhao, Yuxuan Zhao, Gaojing Zhang, Chenxi Liu, Maolin Zheng, Wenzhao Lian

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of Sussex（萨塞克斯大学）； East China University of Science and Technology（华东理工大学）

AI总结提出HALOMI框架，通过扩展通用操控接口(UMI)实现主动感知，利用流形约束控制器和观察-动作对齐，使Unitree G1人形机器人在五项真实任务中平均成功率达85%。

详情

AI中文摘要

人类演示可以大规模收集，并自然捕捉主动的手眼协调，是学习人形机器人全身操控的有前景的数据源。然而，直接将人类演示迁移到人形机器人需要精确的世界坐标系跟踪控制器，这在分布外(OOD)目标下通常脆弱，而人形差异在自我中心观察和动作执行中持续存在。为解决这些挑战，我们提出HALOMI，一个从人类演示中学习具有主动感知的人形机器人全身操控的可扩展框架。HALOMI扩展了通用操控接口(UMI)并加入自我中心感知，以大规模收集自我视角和手腕视角观察以及头-手轨迹。我们进一步提出一个流形约束控制器，在学习的潜在行为流形中规划，以实现世界坐标系中精确鲁棒的头-手跟踪。为弥合人形差异，我们进行自我视角对齐，并引入控制器感知的参考轨迹自适应，以减少观察和动作执行中的不匹配。我们在配备活动脖子的Unitree G1人形机器人上验证HALOMI，涉及导航、抓取、双手操控、全身协调和动态行为五项真实任务。在三个定量评估的任务中，HALOMI平均成功率达85%，而额外定性演示显示其支持动态抛掷和深蹲抓取的能力。

英文摘要

Human demonstrations, which can be collected at scale and naturally capture active hand-eye coordination, are a promising data source for learning humanoid loco-manipulation. However, directly transferring human demonstrations to humanoids requires a precise world-frame tracking controller, which is often brittle under Out-of-Distribution(OOD) targets, while human-to-humanoid gaps persist in both egocentric observation and action execution. To address these challenges, we present HALOMI, a scalable framework for learning humanoid loco-manipulation with active perception from human demonstrations. HALOMI extends Universal Manipulation Interface (UMI) with egocentric sensing to collect ego-view and wrist-view observations along with head-hand trajectories at scale. We further propose a manifold-constrained controller that plans in a learned latent behavior manifold to enable precise and robust head-hand tracking in the world frame. To bridge the human-to-humanoid gap, we perform ego-view alignment and introduce a controller-aware reference trajectory adaptation to reduce mismatch in both observation and action execution. We validate HALOMI on a Unitree G1 humanoid robot with an actuated neck across five real-world tasks involving navigation, grasping, bimanual manipulation, whole-body coordination, and dynamic behaviors. Across the three quantitatively evaluated tasks, HALOMI achieves an average success rate of 85\%, while additional qualitative demonstrations show its ability to support dynamic tossing and deep-squat grasping.

URL PDF HTML ☆

赞 0 踩 0

2606.18953 2026-06-18 cs.RO 新提交

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

面向零样本仿真到现实VLA增强的以对象为中心的残差强化学习

Kinam Kim, Namiko Saito, Heecheol Kim, Katsushi Ikeuchi, Jaegul Choo, Yasuyuki Matsushita

发表机构 * KAIST（韩国科学技术院）； Microsoft Research Asia - Tokyo（微软亚洲研究院-东京）； The University of Tokyo（东京大学）

AI总结提出以对象为中心的残差强化学习框架，在仿真中训练策略，零样本迁移到真实机器人，将VLA模型成功率从42%提升至76%。

Comments 8 pages, 7 figures, 2 tables; 8-page appendix

详情

AI中文摘要

视觉-语言-动作（VLA）模型能够泛化到多种操作任务，但其基于模仿学习的策略在精确物理交互中因执行误差累积而脆弱；能否仅在仿真中训练的强化学习策略零样本提升真实世界VLA的鲁棒性？残差强化学习在冻结的VLA之上学习修正策略，提供了一个自然框架，但现有方法面临根本的仿真到现实困境：特权状态方法需要有损蒸馏才能部署；基于图像的方法存在视觉域差距；而真实世界强化学习成本高且不安全。我们提出一种以对象为中心的残差强化学习框架，利用对象姿态优化VLA动作，从而构建一个在仿真和现实之间一致迁移的紧凑观测空间。为对齐两个域，我们额外在仿真中重放相同的遥操作演示，以训练真实世界VLA的仿真对应物。残差强化学习策略仅在仿真中通过姿态噪声注入和丢弃进行训练，并零样本迁移到真实机器人。在真实Franka Research 3（FR3）机器人的五个操作任务上，我们的方法将成功率从42%零样本提升至76%，且改进后的轨迹可进一步用于重新训练基础VLA以实现自我改进，无需额外遥操作。项目页面：此https URL

英文摘要

Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the visual domain gap; and real-world RL is costly and unsafe. We propose an object-centric residual RL framework that refines VLA actions using object poses, enabling a compact observation space that transfers consistently between simulation and reality. To align the two domains, we additionally replay the same teleoperation demonstrations in simulation to train a sim counterpart of the real-world VLA. The residual RL policy is trained only in simulation with pose noise injection and dropout, and transfers zero-shot to the real robot. Across five manipulation tasks on a real Franka Research 3 (FR3) robot, our method improves the success rate from 42% to 76% zero-shot, and the improved rollouts can be further reused to retrain the base VLA for self-improvement without additional teleoperation. Project page: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/

URL PDF HTML ☆

赞 0 踩 0

2606.19328 2026-06-18 cs.LG cs.AI cs.RO 交叉投稿

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

UBP2: 不确定性平衡的偏好规划用于高效基于偏好的强化学习

Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart

发表机构 * Learning, Embodied Autonomy, and Forecasting (LEAF) Lab, University of Toronto（多伦多大学学习、具身自主与预测（LEAF）实验室）

AI总结提出UBP2方法，通过联合推理奖励、动力学和值函数的不确定性来主动引导探索，在Meta-World基准上显著提高了样本效率。

详情

AI中文摘要

基于偏好的强化学习提供了一种从行为的成对比较中学习奖励模型的方法，绕过了显式奖励设计的需求。然而，现有方法通常依赖于被动数据收集，并且在学习的早期阶段样本效率低下。我们引入了一种基于模型的方法，通过联合推理奖励、动力学和值函数的不确定性来主动引导探索。我们的方法，不确定性平衡的偏好规划（UBP2），使用奖励、动力学和值函数模型的集成，根据结合了期望奖励、终值认知不确定性的统一评分来评估候选轨迹。在此目标下的规划产生了利用和信息获取之间的显式权衡，无需临时的探索启发式。在标准正则性假设下，我们为有限时域和无限时域设置建立了次线性遗憾保证。实验上，在Meta-World基准上的实验表明，UBP2比无模型的基于偏好的方法和非乐观的基于模型的基线方法实现了更高的样本效率。

英文摘要

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18514 2026-06-18 cs.RO cs.LG 新提交

N(CO)$^2$: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering

N(CO)$^2$: 基于机会约束的神经组合优化求解随机定向问题

Anas Saeed, Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * Department of Computer Science and Engineering, University of California, Merced（加州大学默塞德分校计算机科学与工程系）

AI总结提出N(CO)$^2$框架，结合强化学习求解随机定向问题，无需手工启发式，在不确定环境下优化路径选择，性能媲美MILP。

详情

Journal ref: In Proceedings of the IEEE International Conference on Automation Science and Engineering (CASE), 2025

AI中文摘要

神经组合优化（NCO）通过学习启发式，为求解复杂图优化问题提供了一种有前景的替代传统启发式方法的方法。这类问题在自动化领域频繁出现，可用于建模多种应用。虽然NCO在确定性组合优化问题上已被广泛研究，但只有少数工作旨在解决随机组合优化问题。本文提出N(CO)$^2$：基于机会约束的神经组合优化，用于求解随机定向问题（SOP），无需手工设计的启发式。通过集成强化学习（RL）框架，模型在不确定性下优化路径选择，有效平衡探索与利用。实验结果表明，我们的方法在多种SOP实例上具有良好的泛化能力，与最先进的混合整数线性规划（MILP）相比性能具有竞争力。所提方法减少了启发式设计的人力投入，同时在不确定环境中实现自适应和高效的决策。

英文摘要

Neural combinatorial optimization (NCO) offers a promising alternative to traditional heuristic-based methods for solving complex graph optimization problems by proposing to learn heuristics through data. This class of problems frequently arises in automation, as it can be used to model a variety of applications. While NCO has been extensively studied for deterministic combinatorial optimization problems, there are only a few works that aim to solve stochastic combinatorial optimization problems. In this work, we present N(CO)$^2$: Neural Combinatorial Optimization with Chance cOnstraints to solve the Stochastic Orienteering Problem (SOP) without the use of hand-crafted heuristics. By integrating a reinforcement learning (RL) framework, the model optimizes path selection under uncertainty, effectively balancing exploration and exploitation. Empirical results demonstrate that our method generalizes well across diverse SOP instances, achieving competitive performance compared to the state-of-the-art mixed-integer linear program (MILP) for the task. The proposed approach reduces human effort in heuristic design while enabling adaptive and efficient decision-making in uncertain environments.

URL PDF HTML ☆

赞 0 踩 0

2606.18625 2026-06-18 cs.RO 新提交

SRL: Combining SLIP Model and Reinforcement Learning for Agile Robotic Jumping

SRL：结合SLIP模型与强化学习实现敏捷机器人跳跃

Xiaowen Hu, Linqi Ye, Yudi Zhu, Chenyue Shao, Rankun Li, Qingdu Li, Yan Peng

发表机构 * Institute of Artificial Intelligence, Shanghai University（上海大学人工智能研究院）； Institute of Machine Intelligence, University of Shanghai for Science and Technology（上海理工大学机器智能研究院）

AI总结提出SRL框架，融合SLIP模型的物理基线与强化学习的自适应能力，通过前馈控制信号与实时反馈优化机器人跳跃，显著减少训练时间并保持高精度跟踪。

Comments 17 pages, 12 figures

详情

AI中文摘要

机器人跳跃在搜救和物流等应用中至关重要，这些场景中跨越障碍和提高机动效率是关键。弹簧负载倒立摆（SLIP）模型利用简化的弹簧-质量动力学，自然编码了生物上合理的弹跳运动，但由于对接触和关节动力学的理想化假设，其在不规则地形上的性能会下降。同时，强化学习（RL）能够适应多样化和复杂的环境，但通常需要来自无引导探索的大量数据。SLIP的物理基线与RL的自适应能力的互补优势促使我们提出一种混合框架，以克服各自的局限性。因此，我们提出了弹簧负载强化学习（SRL），它将基于SLIP的前馈控制信号与RL驱动的实时反馈相结合，实现了机器人跳跃的持续优化。实验结果表明，与基线方法相比，SRL能够在更少的训练时间内实现更稳定的跳跃，平均位置跟踪误差低于0.1米，速度跟踪误差在目标值的±3%以内。通过双足和四足模拟的地面与楼梯跳跃，以及sim-to-sim和sim-to-real验证，SRL展现出对各种任务要求和环境复杂性的鲁棒适应性，突显了其在实际部署中的潜力。

英文摘要

Robotic jumping is pivotal in applications such as search and rescue and logistics, where crossing obstacles and enhancing mobility efficiency are critical. The Spring-Loaded Inverted Pendulum (SLIP) model leverages simplified spring-mass dynamics that naturally encode biologically plausible hopping motions, yet its performance degrades on irregular terrain due to idealized assumptions regarding contact and joint dynamics. Meanwhile, Reinforcement Learning (RL) can adapt to diverse and complex environments but often requires extensive data from unguided exploration. The complementary strengths of SLIP's physically grounded baseline and RL's adaptive capabilities motivate a hybrid framework that overcomes these individual limitations. We therefore propose Spring-loaded Reinforcement Learning (SRL), which integrates SLIP-based feedforward control signals with RL-driven real-time feedback, enabling continuous optimization of robotic jumping. Experimental results demonstrate that SRL can achieve more stable jumps with much less training time than the baseline method, maintaining an average position tracking error below 0.1 m and velocity tracking errors within +/-3% of the target values. Through bipedal and quadrupedal simulations of ground and stair jumping, as well as sim-to-sim and sim-to-real validations, SRL exhibits robust adaptability to various task requirements and environmental complexities, underscoring its potential for real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.18730 2026-06-18 cs.RO cs.AI math.CO math.OC 新提交

Two-Phase Bilevel Search for the Moving-Target Traveling Salesman Problem with Moving Obstacles

带移动障碍物的移动目标旅行商问题的两阶段双层搜索

Allen George Philip, Anoop Bhat, Sivakumar Rathinam, Howie Choset

发表机构 * Texas A&M University（德克萨斯A&M大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结针对带移动障碍物的移动目标旅行商问题，提出混合整数锥规划公式和两阶段双层搜索算法，显著优于基线方法。

详情

AI中文摘要

移动目标旅行商问题（MT-TSP）寻求从静态仓库出发、访问一组移动目标（每个目标在其分配的时间窗口内）并返回仓库的代理的最小成本轨迹。在本文中，我们研究了带移动障碍物的移动目标旅行商问题（MT-TSP-MO），这是MT-TSP的推广，其中代理轨迹必须避开移动障碍物。我们提出了一个混合整数锥规划（MICP）公式，可以使用现成的求解器求解，以及一个快速且可扩展的两阶段双层搜索（TPBS）算法，该算法为问题计算高质量可行解。我们在多达40个目标和40个障碍物的广泛问题实例上评估了我们的方法，与现有基线算法相比。结果表明，所提出的两种方法在成功率、解决方案成本和计算时间方面均显著优于基线。

英文摘要

The Moving-Target Traveling Salesman Problem (MT-TSP) seeks a minimum cost trajectory for an agent that departs from a static depot, visits a set of moving targets, each within one of their assigned time windows, and returns to the depot. In this article, we study the Moving-Target Traveling Salesman Problem with Moving Obstacles (MT-TSP-MO), a generalization of the MT-TSP where the agent trajectory must avoid moving obstacles. We present a Mixed-Integer Conic Programming (MICP) formulation that can be solved using off-the-shelf solvers, as well as a fast and scalable Two-Phase Bilevel Search (TPBS) algorithm that computes high-quality feasible solutions for the problem. We evaluate our approaches against an existing baseline algorithm on a broad range of problem instances with up to 40 targets and 40 obstacles. The results demonstrate that both the proposed methods significantly outperform the baseline with respect to success rates, solution costs, and computation time.

URL PDF HTML ☆

赞 0 踩 0

2606.18828 2026-06-18 cs.RO cs.AI 新提交

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

空间即智能：用于黎曼度量生成的神经半群叠加

Chenghao Xu

发表机构 * National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University（湖南大学机器人视觉感知与控制技术国家工程研究中心）

AI总结提出将智能置于空间本身，通过神经半群叠加机制生成黎曼度量，使动作简化为测地线跟随，在单障碍场景训练后零样本泛化到未见配置。

详情

AI中文摘要

传统方法将智能置于智能体中，无论是作为学习策略还是搜索过程。我们则将智能置于空间本身：场景在构型流形上诱导一个黎曼度量，动作简化为跟随该度量的测地线，而无需调用单独的规划器或碰撞检查器。一个单一的编码器-路由器网络通过三个互补的参数组实现这一思想——框架参数（定向生成器）、调制参数（控制空间传播）和基本系数（决定强度）。这些组通过共享的半群叠加机制组合，产生单个黎曼度量场，形成一种紧凑的架构，其几何复杂度自然随场景复杂度扩展。在单个双障碍场景上训练后，该模型在未见过的障碍配置上展现出鲁棒的零样本泛化能力，无碰撞路径成本与障碍穿透路径成本相差数个数量级。

英文摘要

Traditional approaches place intelligence in the agent, whether as a learned policy or a search procedure. We instead place intelligence in the space itself: a scene induces a Riemannian metric on the configuration manifold, and action reduces to following the geodesics of that metric rather than invoking a separate planner or collision checker. A single Encoder-Router network realizes this idea through three complementary parameter groups -- frame parameters that orient the generators, modulation parameters that govern their spatial propagation, and basic coefficients that determine their strength. These groups combine through a shared semigroup-superposition mechanism to produce a single Riemannian metric field, yielding a compact architecture whose geometry scales naturally with scene complexity. Trained on a single two-obstacle scene, the model demonstrates robust zero-shot generalization across unseen obstacle configurations, with orders-of-magnitude separation between collision-free and obstacle-penetrating path costs.

URL PDF HTML ☆

赞 0 踩 0

2606.18883 2026-06-18 cs.RO 新提交

自监督掩码感知Transformer用于微创手术机器人中容错FBG力传感

Peibo Sun, Shiyuan Dong, Shucheng Ye, Jianrong Cai, Yushan Liu, Hongen Liao, Tianqi Huang, Fang Chen

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）

AI总结针对微创手术机器人中FBG传感器因通道耦合和断裂导致的力估计退化问题，提出统一的自监督掩码感知Transformer，通过掩码通道重建预训练和动态损坏课程微调，实现多通道故障下的优雅降级，在8通道数据集上达到0.0066 N均方根误差。

详情

AI中文摘要

在微创手术机器人中，导管级光纤布拉格光栅（FBG）传感器因其能够通过复用多个光学通道来估计多维力而具有前景。然而，部署这些紧凑的多通道传感器引入了两个关键工程挑战：复杂变形过程中固有的非线性交叉轴耦合，以及受限工作空间中光纤断裂导致的间歇性通道丢失。这些复合问题严重降低了力估计性能。现有的容错方法依赖于组合模型库，其随通道数量呈指数级扩展，并且需要昂贵的每模式校准。在本文中，我们提出了一种统一的、自监督的掩码感知Transformer，它显式地建模通道可用性，以在多样化和动态的传感器故障下实现优雅降级。编码器通过未标记数据流上的掩码通道重建进行预训练，并使用平衡的干净与损坏视图目标以及动态损坏课程进行力回归微调。此外，通过异方差高斯负对数似然训练的并行不确定性头，在单次前向传播中预测每轴置信度，避免了多遍集成的开销。在导管级8通道FBG数据集上评估，我们的单一统一模型实现了标称均方根误差（RMSE）0.0066 N，并在严重4通道故障下优雅降级至0.0126 N。这显著优于包含255个每模式神经网络的综合模型库（4通道丢失时为0.0154 N），同时消除了模式特定校准。

英文摘要

In minimally invasive surgical robotics, catheter-scale Fiber Bragg Grating (FBG) sensors are promising due to their ability to estimate multi-dimensional forces by multiplexing several optical channels. However, deploying these compact multi-channel sensors introduces two critical engineering challenges: inherent nonlinear cross-axis coupling during complex deformations, and intermittent channel dropouts caused by fiber fractures in constrained workspaces. These compounding issues severely degrade force estimation. Existing fault-tolerant approaches rely on combinatorial model banks, which scale exponentially with the channel count and demand prohibitively expensive per-pattern calibration. In this paper, we propose a unified, self-supervised mask-aware Transformer that explicitly models channel availability to enable graceful degradation under diverse and dynamic sensor failures. The encoder is pretrained via masked-channel reconstruction on unlabeled data streams and fine-tuned for force regression using a balanced clean-and-corrupted-view objective alongside a dynamic corruption curriculum. Furthermore, a parallel uncertainty head, trained via heteroscedastic Gaussian negative log-likelihood, predicts per-axis confidence in a single forward pass, circumventing the overhead of multi-pass ensembles. Evaluated on a catheter-scale 8-channel FBG dataset, our single unified model achieves a nominal Root Mean Square Error (RMSE) of 0.0066~N and degrades gracefully to 0.0126~N under severe 4-channel failures. This significantly outperforms a comprehensive model bank of 255 per-pattern neural networks (0.0154~N at 4-channel loss) while eliminating pattern-specific calibration.

URL PDF HTML ☆

赞 0 踩 0

2606.19089 2026-06-18 cs.RO 新提交

ART-VS: Adaptive Resolution Tiling for Vision Transformer Visual Servoing

ART-VS：用于视觉Transformer伺服的自适应分辨率分块

Alessandro Scherl, Bernhard Neuberger, Simon Schwaiger, David Mulero-Pérez, Lucas Muster, Jose Garcia-Rodriguez

发表机构 * Department of Computer Technology, University of Alicante（阿尔瓦登特技术系，阿利坎特大学）； Department of Industrial Engineering, UAS Technikum Vienna（工业工程系，维也纳技术学院）； Automation and Control Institute, TU Wien（自动化与控制研究所，维也纳技术大学）； Institute of Software Engineering and Artificial Intelligence, Graz University of Technology（软件工程与人工智能研究所，格拉茨技术大学）； Institute for Integrative Nature Conservation Research, University of Natural Resources and Life Sciences Vienna（整合自然保护研究 institute，维也纳自然资源与生命科学大学）

AI总结提出ART-VS方法，通过粗-精两阶段自适应调整特征粒度，在不需任务特定训练下提升视觉伺服鲁棒性和精度，显著降低定位误差并提高速度。

Comments Accepted at IROS2026

详情

AI中文摘要

基于自监督视觉Transformer（ViT）特征的视觉伺服实现了无需训练的机器人定位，具有强泛化能力，但面临鲁棒性与精度之间的根本权衡。粗粒度的块级描述符提供稳定的对应关系，但限制了定位精度。提高图像分辨率可改善精度，但鲁棒性增益有限——在扰动下，高分辨率处理仅将收敛成功率从76.6%提升至81.0%，尽管ViT块数量增加了12倍。因此，我们提出自适应分辨率分块视觉伺服（ART-VS），一种两阶段方法，根据伺服进程调整特征粒度：先以原生ViT分辨率进行粗阶段实现稳定对齐，然后进行分块高分辨率阶段，将匹配限制在局部邻域以提高定位精度。无需任何任务特定训练，ART-VS在扰动下达到95.4%的收敛率，比标准分辨率和全分辨率ViT伺服分别高出18.8和14.4个百分点。与前者相比，定位误差降低53%，同时运行速度比后者快10倍以上，VRAM使用减少27%。我们在三个ViT骨干网络上验证了ART-VS，并展示了真实世界类别级抓取未见过的物体实例，透明瓶成功率95/100，鞋子成功率98/100。代码见该链接。

英文摘要

Visual servoing with self-supervised Vision Transformer (ViT) features enables training-free robotic positioning with strong generalization, but faces a fundamental trade-off between robustness and precision. Coarse patch-level descriptors provide stable correspondences yet limit positioning accuracy. Increasing image resolution improves precision but yields only marginal robustness gains - under perturbation, high-resolution processing improves convergence success rate from 76.6% to just 81.0% despite 12x more ViT patches. Therefore, we propose Adaptive Resolution Tiling Visual Servoing (ART-VS), a two-phase method that adapts feature granularity to servoing progress: a coarse phase at native ViT resolution for stable alignment, then a tiled high-resolution phase that restricts matching to local neighborhoods improving positioning accuracy. Without any task-specific training, ART-VS achieves 95.4% convergence under perturbation, outperforming standard and full-resolution ViT-based servoing by 18.8 and 14.4 percentage points. Over the former it reduces positioning error by 53%, while running at over 10x higher speed and 27% lower VRAM than the latter. We validate ART-VS across three ViT backbones and demonstrate real-world category-level grasping of unseen object instances, achieving 95/100 on transparent bottles and 98/100 on shoes. Code available under https://art-vs.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.19091 2026-06-18 cs.RO 新提交

GCNGrasp-VP: Affordance-Guided View Planning for Efficient Task-Oriented Grasping

GCNGrasp-VP: 基于功能引导的视角规划用于高效任务导向抓取

Zanjia Tong, Wenlong Dong, Chengjie Zhang, Hong Zhang

发表机构 * Shenzhen Key Laboratory of Robotics and Computer Vision（机器人与计算机视觉深圳重点实验室）

AI总结提出GCNGrasp-VP框架，通过功能场预测引导主动视角规划，无需场景重建，单次视角调整即可显著提升遮挡下的任务导向抓取成功率。

Comments Accepted to IROS 2026

详情

AI中文摘要

当物体视角存在遮挡时，任务导向抓取性能会显著下降。现有的任务导向抓取方法通常假设任务相关区域在初始帧中可见，而视角规划方法虽然能够实现主动感知，但往往忽略任务语义并依赖耗时的场景重建。为了解决这些局限性，我们提出了GCNGrasp-VP，一个将功能场预测与主动视角规划相结合的高效框架。该框架的核心是GCNGrasp-v2，一个同时支持抓取评估和功能场预测的任务导向抓取模型，实现了常数时间推理复杂度。利用这一能力，我们的功能引导视角规划器（Affordance-VP）将功能场作为信息增益度量，无需场景重建即可引导相机观察任务相关区域。视角规划结果表明，我们的方法仅需一次视角调整就显著优于基于场景不确定性的基线方法。真实世界验证进一步证实了在单物体场景中抓取成功率的显著提升，同时保持毫秒级计算延迟。代码和模型可在以下网址获取：this https URL。

英文摘要

Task-oriented grasping performance degrades significantly when object views suffer from occlusions. Existing task-oriented grasping methods typically assume task-relevant regions are visible in the initial frame, while view planning approaches enable active perception but often ignore task semantics and rely on time-consuming scene reconstruction. To address these limitations, we present GCNGrasp-VP, an efficient framework integrating affordance field prediction with active view planning. Central to this framework is GCNGrasp-v2, a task-oriented grasp model that simultaneously supports grasp evaluation and affordance field prediction, achieving constant-time inference complexity. Leveraging this capability, our Affordance-guided View Planner (Affordance-VP) utilizes the affordance field as an information gain metric to guide camera observation of task-relevant regions without requiring scene reconstruction. View planning results show that our method significantly outperforms scene-uncertainty-driven baselines with only one view adjustment. Real-world validation further confirms substantial improvements in grasp success rates for single-object scenarios while maintaining millisecond-level computational latency. Code and models are available at https://github.com/Instinct323/GCNGrasp-VP.

URL PDF HTML ☆

赞 0 踩 0

2606.19194 2026-06-18 cs.RO 新提交

Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation

用于机器人操作中一步流匹配的可逆神经网络适配器

Yu Zhang, Kangyi Ji, Yongxiang Zou, Rongtao Xu, Feng Zheng, Long Cheng

AI总结提出可逆神经网络适配器，通过一步去噪过程生成高维动作，降低推理复杂度并保持精度，在仿真和真实实验中提升效率。

详情

AI中文摘要

本文提出了一种用于通用机器人操作的可逆神经网络适配器，旨在通过一步去噪过程，基于多模态观测（包括视觉、语言和本体感受输入）生成精确的高维动作。基于流匹配公式，所提出的适配器有效地将动作生成轨迹约束在可逆潜空间内，从而仅需单次推理步骤即可实现高效、高质量的灵巧动作合成。与传统的迭代流匹配策略相比，所提出的框架显著降低了推理复杂度，同时保持了强大的动作预测精度和稳定性。在多种仿真基准和真实机器人平台上进行了大量实验，以评估所提出方法的有效性。在仿真基准测试中，所提出的适配器在广泛的操作任务上持续表现出优于或接近最先进的性能。此外，真实世界实验显示，视觉-语言-动作（VLA）模型的推理效率显著提升，平均推理延迟从110毫秒降低到61毫秒，同时保持了强大的任务性能。

英文摘要

This paper presents an invertible neural network adapter for general robotic manipulation, designed to generate precise high-dimensional actions conditioned on multimodal observations, including visual, linguistic, and proprioceptive inputs, through a one-step denoising process. Built upon a flow-matching formulation, the proposed adapter effectively constrains the action generation trajectory within an invertible latent space, thereby enabling efficient and high-quality dexterous action synthesis with only a single inference step. Compared with conventional iterative flow-matching policies, the proposed framework substantially reduces inference complexity while maintaining strong action prediction accuracy and stability. Extensive experiments are conducted across a diverse set of simulation benchmarks and real-world robotic platforms to evaluate the effectiveness of the proposed method. Across simulation benchmarks, the proposed adapter consistently demonstrates superior or near state-of-the-art performance on a wide range of manipulation tasks. Furthermore, real-world experiments reveal a significant improvement in inference efficiency for vision-language-action (VLA) models, reducing the average inference latency from 110 ms to 61 ms while maintaining strong task performance.

URL PDF HTML ☆

赞 0 踩 0

2606.19233 2026-06-18 cs.RO 新提交

Mobile Pedipulation for Object Sliding via Hierarchical Control on a Wheeled Bipedal Robot

基于轮式双足机器人分层控制的移动式腿部操作物体滑动

Yue Qin, Yulun Zhuang, Zelin Shen, Yanran Ding

发表机构 * University of Michigan（密歇根大学）

AI总结提出一种分层控制框架，使轮式双足机器人能用腿部滑动平面物体，通过简化三刚体动力学模型和轨迹优化运动规划器，在实验中成功实现1kg物体取回和4kg物体滑动。

Comments 8 pages, 7 figures

详情

AI中文摘要

在本文中，我们提出了一种分层控制框架，使轮式双足机器人能够利用其轮式腿执行平面物体滑动任务。该方法基于一个简化三刚体动力学模型构建了非线性模型预测控制器，该模型明确考虑了髋关节滚动自由度和多种轮-环境接触模式，这对于横向步态和腿部操作任务至关重要。在该框架内，非线性模型预测控制器同时调节机器人 locomotion 和交互力，使机器人能够稳定地执行滚动和物体操作行为。我们开发了一个基于轨迹优化的机器人-物体运动规划器，以生成包含地面-物体接触中粘滑转换的参考运动。通过实际硬件实验验证了两种代表性的腿部操作运动，即滑行和横向滑动，其中机器人成功地从桌子下取回一个1kg的物体，并通过滑行将一个4kg的物体滑动0.228米的距离。

英文摘要

In this letter, we present a hierarchical control framework that enables wheeled bipedal robots to perform planar object sliding tasks with their wheeled legs. The proposed approach formulates a nonlinear model predictive controller (NMPC) based on a reduced-order three rigid bodies (TRB) dynamical model that explicitly accounts for the hip roll degree of freedom and multiple wheel-environment contact modes, which is essential for lateral stepping and pedipulation tasks. Within this framework, the NMPC simultaneously regulates robot locomotion and interaction forces, allowing the robot to stably execute both rolling and object manipulation behaviors. A trajectory-optimization-based robot-object motion planner is developed to generate reference motions that incorporate stick-slip transitions in ground-object contact. Two representative pedipulation motions, namely scooting and lateral sliding, are validated through real-world hardware experiments, in which the robot successfully retrieves a 1 kg object from under a desk and slides a 4 kg object over a distance of 0.228 m via scooting.

URL PDF HTML ☆

赞 0 踩 0

2606.19314 2026-06-18 cs.RO 新提交

Modeling Branches for Active Manipulation using Iterative Parameter Estimation

基于迭代参数估计的主动操作分支建模

Madhav Rijal, Rashik Shrestha, Trevor Smith, Yu Gu

发表机构 * Department of Mechanical and Aerospace Engineering, West Virginia University（西弗吉尼亚大学机械与航空航天工程系）

AI总结提出一种通过迭代估计材料参数来建模植物分支的方法，利用有限元模拟和变形感知运动规划器，实现精确分支操作，平均变形能量降低35.69%。

Comments Accepted to IROS 2026

2606.19333 2026-06-18 cs.RO cs.CV 新提交

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Do as I Do: 从日常人类视频中获取灵巧操作数据

Bhawna Paliwal, Haritheja Etukuru, William Liang, Pieter Abbeel, Nur Muhammad Mahi Shafiullah, Jitendra Malik

发表机构 * UC Berkeley（加州大学伯克利分校）

AI总结提出DO AS I DO算法，从单目RGB人类视频中重建手-物交互并重定向到多指灵巧机器人手，生成可执行的操作数据，优于现有方法。

Comments Project website: https://do-as-i-do.com/

详情

AI中文摘要

我们如何可扩展地生成机器人操作数据，特别是在像多指灵巧手这样的人形平台上？从人类视频中学习最近成为这个问题的可能答案。然而，估计手-物交互和跨越人-机器人具身差距的困难阻碍了将丰富的单目RGB人类视频作为机器人操作数据的主要来源。在这项工作中，我们提出了DO AS I DO，一种将单目RGB人类视频重建并重定向到多指灵巧机器人手的算法。DO AS I DO从各种自我中心和外部中心的野外视频源中重建手-物交互。然后，该算法将这些手-物交互估计重定向为一系列可在现实世界中执行的动作，从不同的人类视频中生成机器人完整的操作数据。总体而言，DO AS I DO在从RGB视频中估计手-物交互和提取灵巧操作轨迹方面优于先前的最先进技术，正如我们在具有真实标签的数据集和在线收集的视频片段数据集上的实验所示。我们的实验使我们能够为从业者收集人类操作数据提出一个有效性指南。

英文摘要

How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.18960 2026-06-18 cs.CV cs.RO 交叉投稿

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Mem-World：用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology（大连理工大学）； Samsung R&D Institute China-Beijing (SRCB)（三星中国北京研究院）

AI总结提出Mem-World，通过4D腕部视角曲面元索引内存W-VMem，解决操作中因遮挡和运动导致的场景遗忘问题，实现持久世界建模，提升策略评估与改进效果。

详情

AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式，通过生成动作一致的视频推演，为昂贵的真实世界实验提供了可扩展的替代方案。然而，在操作中持久世界建模仍然具有挑战性：频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图，导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制，我们提出了Mem-World，一种内存增强的多视图动作条件世界模型。其核心是W-VMem，一种4D腕部视图为中心的曲面元索引内存，将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置，W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中，通过基于曲面元的渲染和评分选择相关历史帧，为预测提供信息丰富且非冗余的上下文。大量实验表明，Mem-World在复杂操作场景中生成持久推演，比Ctrl-World实现更可靠的策略评估，将皮尔逊相关系数提高14.5%，并通过合成数据生成支持有效的策略改进，在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

URL PDF HTML ☆

赞 0 踩 0

2501.02874 2026-06-18 cs.RO 版本更新

Steering Flexible Linear Objects in Planar Environments by Two Robot Hands Using Euler's Elastica Solutions

使用欧拉弹性线解在两机器人手在平面环境中操控柔性线性物体

Aharon Levin, Elon Rimon, Amir Shapiro

发表机构 * Dept. of ME, Technion, Israel（技术学院机械工程系，以色列）； Dept. of ME, Ben-Gurion University, Israel（本· Gurion大学机械工程系，以色列）

AI总结本文利用欧拉弹性线解，通过控制两机器人手的抓取端点位置和切线，实现平面环境中柔性线性物体的无自交、稳定和避障操控。

详情

AI中文摘要

机器人手对柔性物体（如电缆、电线和生鲜食品）的操控构成了机器人抓取力学中的一个特殊挑战。本文考虑了两机器人手在平面环境中操控柔性线性物体的问题。柔性线性物体被建模为弹性不可拉伸杆，通过改变抓取端点位置同时保持端点切线相等来进行操控。柔性线性物体的形状具有基于抓取端点位置和切线的闭式解，称为欧拉弹性线。本文在最优控制框架下获得了弹性线解，然后利用弹性线解得到了柔性线性物体无自交、稳定性和避障的闭式判据。这些新工具被整合到一个规划方案中，用于在稀疏障碍物分布的平面环境中操控柔性线性物体。该方案已完全实现并通过详细示例进行了演示。

英文摘要

The manipulation of flexible objects such as cables, wires and fresh food items by robot hands forms a special challenge in robot grasp mechanics. This paper considers the steering of flexible linear objects in planar environments by two robot hands. The flexible linear object, modeled as an elastic non-stretchable rod, is manipulated by varying the gripping endpoint positions while keeping equal endpoint tangents. The flexible linear object shape has a closed form solution in terms of the grasp endpoint positions and tangents, called Euler's elastica. This paper obtains the elastica solutions under the optimal control framework, then uses the elastica solutions to obtain closed-form criteria for non self-intersection, stability and obstacle avoidance of the flexible linear object. The new tools are incorporated into a planning scheme for steering flexible linear objects in planar environments populated by sparsely spaced obstacles. The scheme is fully implemented and demonstrated with detailed examples.

URL PDF HTML ☆

赞 0 踩 0

2601.20381 2026-06-18 cs.RO 版本更新

STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation

STORM：基于槽的任务感知面向对象的机器人操作表示

Alexandre Chapin, Emmanuel Dellandréa, Liming Chen

发表机构 * Ecole Centrale de Lyon, LIRIS（里尔森中央理工大学，LIRIS实验室）

AI总结提出STORM模块，通过多阶段训练策略将冻结的视觉基础模型与语义感知槽结合，生成面向对象的任务感知表示，提升机器人操作在视觉干扰下的泛化性和控制性能。

详情

AI中文摘要

视觉基础模型为机器人提供了强大的感知特征，但其密集表示缺乏显式的对象级结构，限制了操作任务的鲁棒性和可收缩性。我们提出STORM（基于槽的任务感知面向对象的机器人操作表示），一个轻量级的面向对象适应模块，通过一组语义感知槽增强冻结的视觉基础模型，用于机器人操作。STORM不重新训练大型骨干网络，而是采用多阶段训练策略：首先通过使用语言嵌入的视觉-语义预训练稳定面向对象的槽，然后与下游操作策略联合适应。这种分阶段学习防止了退化槽的形成，并在保持语义一致性的同时将感知与任务目标对齐。在对象发现基准和模拟操作任务上的实验表明，与直接使用冻结的基础模型特征或端到端训练面向对象的表示相比，STORM改善了对视觉干扰物的泛化能力和控制性能。我们的结果强调了多阶段适应作为将通用基础模型特征转化为用于机器人控制的任务感知面向对象表示的有效机制。

英文摘要

Visual foundation models provide strong perceptual features for robotics, but their dense representations lack explicit object-level structure, limiting robustness and contractility in manipulation tasks. We propose STORM (Slot-based Task-aware Object-centric Representation for robotic Manipulation), a lightweight object-centric adaptation module that augments frozen visual foundation models with a small set of semantic-aware slots for robotic manipulation. Rather than retraining large backbones, STORM employs a multi-phase training strategy: object-centric slots are first stabilized through visual--semantic pretraining using language embeddings, then jointly adapted with a downstream manipulation policy. This staged learning prevents degenerate slot formation and preserves semantic consistency while aligning perception with task objectives. Experiments on object discovery benchmarks and simulated manipulation tasks show that STORM improves generalization to visual distractors, and control performance compared to directly using frozen foundation model features or training object-centric representations end-to-end. Our results highlight multi-phase adaptation as an efficient mechanism for transforming generic foundation model features into task-aware object-centric representations for robotic control.

URL PDF HTML ☆

赞 0 踩 0

2605.05925 2026-06-18 cs.RO 版本更新

DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions

DexSynRefine：合成与精炼人-物交互运动以实现物理可行的灵巧机器人动作

Hyesung Lee, Hyunwoo Jung, Si-Hwan Heo, Sungwook Yang

发表机构 * Korea Institute of Science and Technology（韩国科学技术院）； KAIST（韩国科学技术院）； Hanyang University（翰阳大学）

AI总结提出DexSynRefine框架，通过HOI-MMFP运动先验合成手-物轨迹，结合任务空间残差强化学习和接触动力学适应，将人-物交互数据转化为物理可行的灵巧操作，在五个任务上成功率提升50-70个百分点。

Comments Project page: https://dexsynrefine.github.io/

详情

AI中文摘要

从人-物交互（HOI）数据中学习灵巧操作为机器人遥操作提供了一种可扩展的替代方案，但HOI演示通常稀疏且纯运动学，在实体不匹配和接触丰富的动力学下直接重定向不可靠。我们提出DexSynRefine，一个耦合框架，将HOI数据视为结构化运动先验而非可执行的机器人动作。DexSynRefine首先使用HOI运动流形流基元（HOI-MMFP）——一种耦合手-物运动的运动先验，根据任务和初始物体状态合成手-物轨迹。然后通过任务空间残差强化学习对其进行物理接地，并通过从本体感受历史推断缺失的接触动力学上下文来适应执行。在五个灵巧操作任务中，每个阶段解决一个互补的瓶颈：HOI-MMFP提高了轨迹一致性和平滑性，任务空间残差在测试的替代方案中提供了最强的接地表示，接触动力学适应实现了鲁棒的真实世界执行。综合来看，DexSynRefine在真实世界中的成功率比运动学重定向提高了50-70个百分点。

英文摘要

Learning dexterous manipulation from human-object interaction (HOI) data offers a scalable alternative to robot teleoperation, but HOI demonstrations are typically sparse and purely kinematic, making direct retargeting unreliable under embodiment mismatch and contact-rich dynamics. We present DexSynRefine, a coupled framework that treats HOI data as structured motion priors rather than executable robot actions. DexSynRefine first synthesizes hand-object trajectories conditioned on the task and initial object state using HOI Motion Manifold Flow Primitives (HOI-MMFP), a motion prior for coupled hand-object motion. It then physically grounds them with task-space residual reinforcement learning and adapts execution by inferring missing contact-dynamics context from proprioceptive history. Across five dexterous manipulation tasks, each stage addresses a complementary bottleneck: HOI-MMFP improves trajectory consistency and smoothness, task-space residuals provide the strongest grounding representation among the tested alternatives, and contact-dynamics adaptation enables robust real-world execution. Together, DexSynRefine improves real-world success rates over kinematic retargeting by 50-70~percentage points.

URL PDF HTML ☆

赞 0 踩 0

2606.13672 2026-06-18 cs.RO 版本更新

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

$\texttt{WEAVER}$：更好、更快、更长——一种有效的机器人操作世界模型

Arnav Kumar Jain, Yilin Wu, Jesse Farebrother, Gokul Swamy, Andrea Bajcsy

发表机构 * Mila - Québec AI Institute（Mila - 魁北克人工智能研究所）； Université de Montréal（蒙特利尔大学）； Carnegie Mellon University（卡内基梅隆大学）； McGill University（麦吉尔大学）

AI总结提出WEAVER世界模型架构，通过流匹配损失训练多视图潜在预测，同时实现高保真度、长程一致性和高效推理，在机器人操作任务中显著提升策略评估、改进和测试时规划性能。

详情

AI中文摘要

世界模型（即学习型模拟器）对机器人技术的潜在影响深远——包括策略评估、策略改进和测试时规划——所有这些都只需有限的真实世界交互。为了解锁这些下游能力，世界模型需要同时满足三个期望：（i）保真度（即产生与现实相关的模拟轨迹），（ii）一致性（即产生在长时域上连贯的模拟轨迹），以及（iii）效率（即快速产生模拟轨迹）。我们提出$\texttt{WEAVER}$（面向具身推理的多视图世界估计）：一种同时实现所有三个期望的世界模型架构，在机器人操作任务上提供了最先进的结果。$\texttt{WEAVER}$是一个多视图世界模型，通过流匹配损失训练以预测未来潜在状态和奖励值。我们提炼了模型架构、记忆和预测目标方面的关键设计决策，以解锁那些困扰先前世界建模方法的长时间动态操作任务。我们将$\texttt{WEAVER}$应用于机器人硬件，展示了其在策略评估（与真实世界成功率的相关系数$\rho=0.870$）、策略改进（在$\pi_{0.5}$机器人基础模型上真实世界成功率提升$38\%$）和测试时规划（真实世界成功率提升$14\%$，且比先前世界模型快$5-10$倍）方面的有效性。$\texttt{WEAVER}$在分布外场景评估中也表现出优于先前世界模型的性能。代码、模型和视频见：this https URL。

英文摘要

The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and test-time planning -- all with limited real-world interaction. To unlock these downstream capabilities, a WM needs to jointly satisfy three desiderata: $\textit{(i)}$ fidelity (i.e., producing simulated trajectories that correlate with reality), $\textit{(ii)}$ consistency (i.e., producing simulated trajectories that are coherent over long horizons), and $\textit{(iii)}$ efficiency (i.e., producing simulated trajectories quickly). We propose WEAVER (World Estimation Across Views for Embodied Reasoning): a WM architecture that simultaneously achieves all three desiderata, providing state-of-the-art results on robotic manipulation tasks. WEAVER is a multi-view WM trained to predict future latents and reward values via a flow-matching loss. We distill the key design decisions across model architecture, memory, and prediction objectives required to unlock the kinds of long-horizon dynamic manipulation tasks that have confounded prior world modeling approaches. We apply WEAVER in robotic hardware, demonstrating its effectiveness at policy evaluation ($ρ$=0.870 correlation with real-world success rate), policy improvement (real-world success rate improvement of $38\%$ on top of the $π_{0.5}$ robot foundation model), and test-time planning (real-world success rate improvement of $14\%$ with a $5-10\times$ speedup over prior WMs). WEAVER also demonstrates better performance than prior WMs when evaluated on out-of-distribution scenarios. Code, models, and videos at: https://arnavkj1995.github.io/WEAVER/ .

URL PDF HTML ☆

赞 1 踩 0

2606.18426 2026-06-18 cs.RO 新提交

VEGA: Learning Navigation VLAs from In-the-Wild Egocentric Video with Geometric Trajectory Supervision

VEGA: 从野外自我中心视频中通过几何轨迹监督学习导航VLA

Gershom Seneviratne, Yohan Abeysinghe, Jianyu An, Vaibhav Shende, Dinesh Manocha

发表机构 * University of Maryland, College Park（马里兰大学帕克分校）

AI总结提出VEGA方法，利用未标注的自我中心视频通过重建场景几何生成障碍感知轨迹，训练流匹配VLA导航策略，在VEGA-Bench上碰撞减少33.0%，真实世界成功率提升至少150.0%。

详情

AI中文摘要

我们提出了VEGA，一种从未标注的自我中心导航视频中训练导航视觉-语言-动作（VLA）模型的方法。互联网规模的自我中心视频提供了可扩展的导航相关视觉观察来源，捕捉了杂乱场景、近距离障碍物以及通过真实世界空间的自然人体运动。然而，这些视频不能直接用于策略学习，因为它们没有提供在机器人坐标系中基于显式导航目标的障碍感知轨迹。VEGA通过从单目视频重建局部场景几何、采样导航目标（表示为文本、图像或空间路径点）并利用构建的几何生成障碍感知轨迹来解决这一差距。生成的轨迹分布随后用于训练流匹配VLA导航策略。通过仅在训练期间使用几何，VEGA将障碍感知规划直接蒸馏到基于视觉的策略中。此外，我们引入了VEGA-Bench，一个包含25万场景和约500万个导航目标（与场景几何配对）的基准，旨在评估VLA的目标进展、碰撞避免和障碍物间隙。我们的评估表明，VEGA在VEGA-Bench上实现了有竞争力的目标进展，同时相比最强基线碰撞减少33.0%，障碍物间隙提高17.9%，在真实世界试验中成功率至少提高150.0%，碰撞至少减少66.7%，障碍物间隙至少提高60.0%。最终，我们证明了视频衍生的几何监督为训练障碍感知导航VLA提供了可扩展且有效的信号。代码和基准将在发表时发布。

英文摘要

We introduce VEGA, an approach for training navigation VisionLanguage-Action (VLA) models from unlabeled egocentric navigation videos. Internet-scale egocentric videos provide a scalable source of navigation-relevant visual observations, capturing cluttered scenes, close-range obstacles, and natural human motion through real-world spaces. However, these videos are not directly usable for policy learning because they do not provide obstacle-aware trajectories conditioned on explicit navigation goals in the robot's coordinate frame. VEGA addresses this gap by reconstructing local scene geometry from monocular video, sampling navigation goals (represented as text, image, or spatial waypoints) and generating obstacle-aware trajectories using the constructed geometry. The resulting trajectory distribution is then used to train a flow-matching VLA navigation policy. By using geometry exclusively during training, VEGA distills obstacle-aware planning directly into a vision-based policy. Furthermore, we introduce VEGA-Bench, a benchmark containing 250k scenes and approximately 5 million navigation goals paired with scene geometry, designed to evaluate goal progress, collision avoidance, and obstacle clearance of VLAs. Our evaluation shows that VEGA achieves competitive goal progress while reducing collisions by 33.0% and improving obstacle clearance by 17.9% over the strongest baseline on VEGABench, while improving success by at least 150.0%, reducing collisions by at least 66.7%, and improving obstacle clearance by at least 60.0% in real-world trials. Ultimately, we demonstrate that video-derived geometric supervision provides a scalable and effective signal for training obstacle-aware navigation VLAs. The code and benchmark will be released at the time of publication.

URL PDF HTML ☆

赞 0 踩 0

2606.18634 2026-06-18 cs.RO cs.AI 新提交

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

EffiNav: 融合深度与视觉语言实现高效物体目标导航

Zecheng Yin, Benedict Jun Ma

发表机构 * Systems Hub of Intelligence Transportation HKUST(GZ)（香港科技大学（广州）智能交通系统中心）

AI总结提出EffiNav框架，融合深度信息与视觉语言模型，通过预测探索边界和语义先验指导导航，在HM3D和OVON数据集上匹配或超越基线，提升路径效率与泛化性。

详情

AI中文摘要

在未知环境中定位目标物体是自主智能体的基本能力，应用范围从搜索救援到野外机器人。该任务的简化版本是物体目标导航（ObjNav）。在ObjNav中，成功到达目标物体提供了基本的性能度量；然而，导航轨迹的效率同样重要，因为它指示了智能体探索的智能程度以及后续任务剩余的时间。在未知环境中，高效导航的关键在于决定下一步探索的位置。尽管许多先前工作旨在解决这一核心挑战并在某些场景中取得了有希望的性能，但最近的基于训练的模型和非训练框架分别仍存在泛化性和效率问题，在最坏情况下可能导致对已访问区域的过度探索或冗余的来回运动。我们在两个广泛使用的仿真基准Habitat Matterport 3D（HM3D）和开放词汇物体目标导航（OVON）上评估EffiNav，并在真实世界的物理机器人上进一步验证其有效性。我们对大量仿真回合进行了失败分析。通过最小修改，我们还将EffiNav扩展到GOAT-BENCH数据集上的记忆增强ObjNav任务，展示了其在标准ObjNav设置之外的适应性。在两个标准指标——成功率（SR）和路径长度加权成功率（SPL）上，EffiNav匹配或超越了最近的基线，反映了其效率、鲁棒性和实际适用性。认识到两个数据集的不同侧重点，性能表明该框架在高效ObjNav中更加平衡和可泛化。

英文摘要

To locate a target object while exploring the unknown environment is a fundamental capability for autonomous agents, with applications ranging from search-and-rescue to field robots. A simplified version of such task is Object Goal Navigation (ObjNav). In ObjNav, successful arrival at the target object provides a basic measure of performance; however, the efficiency of the navigation trajectory is equally important, as it indicates how intelligently the agent explores and how much time remains for subsequent tasks. In unknown environments, the key to efficient navigation lies in deciding where to explore next. While many prior works aim to address this core challenge and achieved promising performance in certain settings, recent training-based models and non-training frameworks still suffer from generalization and efficiency issues respectively, which in the worst cases can lead to excessive exploration of already-visited areas or redundant back-and-forth motion. We evaluate EffiNav on two widely used simulation benchmarks Habitat Matterport 3D (HM3D) and Open-Vocabulary Object goal Navigation (OVON), and further validate its effectiveness on physical robots in real-world settings. We conduct failure analysis on massive simulation episodes. With minimal modification, we also extend EffiNav to a memory-augmented ObjNav task on the GOAT-BENCH dataset, demonstrating its adaptability beyond standard ObjNav settings. Across two standard metrics--Success Rate (SR) and Success weighted by Path Length (SPL), EffiNav matches or outperforms recent baselines, reflecting its efficiency, robustness, and practical applicability. Recognizing the different emphases of the two datasets, the performances reveals this framework is more balanced and generalizable for efficient ObjNav.

URL PDF HTML ☆

赞 0 踩 0

2606.18951 2026-06-18 cs.RO 新提交

A High-accuracy Event-based Underwater SLAM System

高精度事件相机水下SLAM系统

Yifan Peng, Qihang, Liu, Haoying Li, Yuzhe Li, Junfeng Wu, Ziyang Hong

AI总结针对事件相机水下SLAM中时间曲面成像质量差和匹配失败问题，提出基于结构感知度量和贝叶斯优化的高精度立体SLAM系统，并贡献首个高质量水下事件数据集UWE。

详情

AI中文摘要

虽然事件相机为水下SLAM提供了巨大潜力，但现有的基于时间曲面（TS）的方法在水下部署时被证明非常不可靠。波动的相机速度严重降低了TS成像质量，而宽立体基线和重复的水下纹理导致关键匹配失败，频繁引发系统崩溃。为克服这些挑战，我们开发了首个高精度事件相机水下立体SLAM系统。基于结构张量相干性和梯度，设计了一种结构感知度量来定量评估TS结构信息密度。通过将最优TS生成解耦为基于系统初始化的两个不同阶段，贝叶斯优化（BO）在初始化前首先预测最优先验TS，同时我们设置异步在线局部搜索方法，在跟踪阶段实时获取合适的TS。我们使用先验视差保证精确的数据关联，并采用“最新观测优先”三角测量机制实现稳定三角测量。作为这些解决方案的基准和社区资源，我们还贡献了UWE，这是首个高质量真实世界水下事件数据集，包含变化的相机运动、复杂纹理和不同轨迹特征。在公共数据集和UWE上的广泛评估表明，所提出的SLAM系统与最先进的事件相机方法相比具有竞争力的精度性能。代码和数据将开源。

英文摘要

While event cameras offer immense potential for underwater SLAM, existing Time Surface (TS)-based methods prove highly unreliable when deployed underwater. Fluctuating camera velocities severely degrade TS imaging quality, while wide stereo baselines and repetitive underwater textures induce critical matching failures, frequently triggering system failure. To overcome these challenges, we develop the first high-accuracy event-based underwater stereo SLAM system. A structure-aware metric for TS is designed based on structure tensor coherence and gradients to quantitatively evaluate TS structural information density. By decoupling the optimal TS generation into two distinct stages based on system initialization, Bayesian Optimization(BO) first predicts an optimal prior TS sequentially before initialization while we set an asynchronous online local searching method periodically to obtain appropriate TS in real-time during the tracking stage. We use the prior disparity to guarantee precise data association and "latest-observation-first'' triangulation mechanism to realize stable triangulation. As a benchmark for these solutions and a resource for the community, we also contribute UWE, the first high-quality real-world underwater event dataset containing variable camera motions, complex textures and different trajectory features. Extensive evaluations on public datasets and UWE show the competitive accuracy performance of the proposed SLAM system compared to the state-of-the-art event-based method. The code and data will be open-sourced.

URL PDF HTML ☆

赞 0 踩 0

2606.19122 2026-06-18 cs.RO 新提交

空地激光雷达地点识别：基于块级自监督学习和扩展互逆重排序

Yandi Yang, Xianghong Zou, Jianping Li, Haofeng Xie, Saurav Uprety, Hongzhou Yang, Naser El-Sheimy

发表机构 * University of Calgary（卡尔加里大学）； Nanchang University（南昌大学）； Nanyang Technological University（南洋理工大学）； Wuhan University（武汉大学）

AI总结提出一种空地激光雷达地点识别框架，通过多尺度块级自监督学习缩小域差距，并利用扩展互逆重排序算法减少误检，在多个数据集上显著提升检索精度。

详情

AI中文摘要

激光雷达地点识别用于确定在预先采集的点云地图上的位置。最常研究的基于地面激光雷达的地点识别存在预访问要求、覆盖不完整和视角有限等缺点。使用预先采集的全覆盖机载激光扫描（ALS）数据作为空中先验地图可以克服这些缺点，使得跨视角地点识别变得必要且有利。然而，空地激光雷达地点识别面临重大挑战，包括空中和地面点云之间的域差距以及初始检索中的误检。为了解决这些问题，我们提出了一种用于空地激光雷达地点识别的新型检索和重排序框架。基于相邻点云块与锚点块共享相似语义的先验知识，我们的检索网络在多个尺度上引入了块级自监督学习模块，并与场景级学习相结合，以提高空中和地面点云之间全局特征的判别性。此外，利用ALS点云的结构化空间分布，我们引入了一种扩展互逆（ER）重排序算法，以最大化利用邻域信息，并根据邻域特征优化每个特征，然后用于更新相似度矩阵以进行最终排序。大量实验表明，我们的检索网络优于现有最先进（SOTA）方法，在CS-Urban-Scenes数据集上平均Recall@1提高了9.8%，平均Recall@1%提高了3.2%，同时在CS-Campus3D数据集上也展示了最佳性能。此外，我们的ER重排序算法在无需额外训练的情况下，进一步将CS-Campus3D上的平均Recall@1提高了4.9%，CS-Urban-Scenes上提高了10.2%。

英文摘要

LiDAR place recognition determines one's position on a prior point cloud map. The most studied ground-level LiDAR place recognition suffers from pre-visit requirements, incomplete coverage, and limited perspectives. Using pre-acquired, full-coverage Airborne Laser Scanning (ALS) data as an aerial prior map overcomes these drawbacks, making cross-view place recognition necessary and advantageous. However, aerial-ground LiDAR place recognition faces significant challenges, including the domain gap between aerial and ground point clouds, and false positives during initial retrieval. To address these challenges, we present a novel retrieval and re-ranking framework for aerial-ground LiDAR place recognition. Based on the priors that neighboring point cloud patches share similar semantics with anchor patch, our retrieval network introduces patch-level self-supervised learning modules at multiple scales and integrates with scene-level learning to improve global feature discriminativeness between aerial and ground point clouds. Furthermore, leveraging the structured spatial distribution of ALS point clouds, we introduce an Expanded Reciprocal (ER) re-ranking algorithm to exploit neighborhood information maximally and refine each feature based on neighbor features, which are then used to update the similarity matrix for final ranking. Extensive experiments demonstrate that our retrieval network outperforms existing state-of-the-art (SOTA) methods, achieving a 9.8\% improvement in average Recall@1 and a 3.2\% improvement in average Recall@1\% on the CS-Urban-Scenes, while also showing the best performance on the CS-Campus3D dataset. Additionally, our ER re-ranking algorithm further boosts the average Recall@1 by 4.9\% on CS-Campus3D and 10.2\% on CS-Urban-Scenes without additional training.

URL PDF HTML ☆

赞 0 踩 0

2606.18687 2026-06-18 cs.CV cs.RO 交叉投稿

Spatially Stratified Distillation for Heterogeneous Radar Place Recognition

空间分层蒸馏用于异构雷达位置识别

Sagun Singh Shrestha, Samuel Harding, Abdelwahed Khamis, Saimunur Rahman, Peyman Moghadam

发表机构 * CSIRO Robotics（澳大利亚联邦科学与工业研究组织机器人实验室）； University of Queensland（昆士兰大学）

AI总结针对4D汽车雷达与密集旋转雷达之间的异构位置识别，提出空间分层蒸馏（SSD）方法，通过基于雷达回波的物理空间非对称对齐，在重叠区域强制特征对齐，在稀疏区域降低蒸馏权重，在HeRCULES数据集上达到最先进性能。

Comments IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026

详情

AI中文摘要

可扩展的全天候位置识别越来越依赖于异构雷达位置识别来桥接不同的硬件平台。一个显著的应用是将来自经济高效的4D汽车雷达的查询与由密集旋转雷达构建的高保真参考地图进行匹配。这一过程从根本上受到4D传感器极端稀疏性（和窄视场）的限制，该传感器仅捕获旋转雷达数据库中存在的结构密度的一小部分。先前的工作通过统一不同的雷达信号来解决这个问题，即将两种信号投影到共同的表示空间。然而，它们在多会话环境中性能下降。在本文中，我们提出了空间分层蒸馏（SSD）；一种策略，用直接从物理雷达回波导出的非对称空间对齐取代标准的均匀蒸馏。在两个雷达都有重叠回波的区域，SSD强制进行强特征对齐。关键的是，在4D学生雷达缺乏回波但教师雷达在共享视场内包含有效结构的稀疏区域，SSD应用大幅折扣的蒸馏权重。对最近的HeRCULES数据集的广泛评估表明，SSD显著优于先前的位置识别方法，在其具有挑战性的动态序列上取得了最先进的结果。

英文摘要

Scalable, all-weather place recognition increasingly relies on heterogeneous radar place recognition to bridge diverse hardware platforms. A notable application is matching queries from cost-effective 4D automotive radars against high-fidelity reference maps built by dense spinning radars. This process is fundamentally limited by the extreme sparsity (and narrow field-of-view) of the 4D sensor, which captures only a fraction of the structural density present in the spinning radar database. Prior efforts address this issue by unifying different radar signals. That is, projecting both signals into a common representational space. Yet, they suffer performance degradation in multi-session environments. In this paper, we propose spatially-stratified distillation (SSD); a strategy that replaces standard uniform distillation with an asymmetric spatial alignment derived directly from physical radar returns. In regions where both radars exhibit overlapping returns, SSD enforces strong feature alignment. Crucially, in sparse regions where the 4D student lacks returns but the teacher contains valid structure within the shared field of view, SSD applies heavily discounted distillation weights. Extensive evaluations of the recent HeRCULES dataset demonstrate that SSD significantly outperforms prior place recognition methods, achieving state-of-the-art results on its challenging dynamic sequences.

URL PDF HTML ☆

赞 0 踩 0

2511.02036 2026-06-18 cs.RO 版本更新

TurboMap: GPU-Accelerated Local Mapping for Visual SLAM

TurboMap: 面向视觉SLAM的GPU加速局部建图

Parsa Hosseininejad, Kimia Khabiri, Shishir Gopinath, Soudabeh Mohammadhashemi, Karthik Dantu, Steven Y. Ko

发表机构 * Simon Fraser University（西蒙弗雷泽大学）； University at Buffalo（布法罗大学）

AI总结针对视觉SLAM中局部建图延迟问题，提出GPU并行化与CPU优化结合的TurboMap后端，通过重构地图点创建、融合及关键帧管理，实现1.3-1.6倍加速且保持精度。

Comments Accepted for presentation at IROS 2026, preprint

详情

AI中文摘要

在实时视觉SLAM系统中，局部建图必须在严格的延迟约束下运行，因为延迟会降低地图质量并增加跟踪失败的风险。GPU并行化是降低延迟的有效途径。然而，由于同步共享状态更新以及将大型地图数据结构传输到GPU的开销，并行化局部建图具有挑战性。本文提出TurboMap，一个GPU并行化且CPU优化的局部建图后端，全面解决了这些挑战。我们重构了地图点创建，以在GPU上实现并行关键点对应搜索，重新设计并并行化了地图点融合，在CPU上优化了冗余关键帧剔除，并集成了基于GPU的快速局部光束法平差求解器。为最小化数据传输和同步成本，我们引入了持久化的GPU驻留关键帧存储。在EuRoC和TUM-VI数据集上的实验表明，平均局部建图速度分别提升1.3倍和1.6倍，同时保持精度不变。

英文摘要

In real-time Visual SLAM systems, local mapping must operate under strict latency constraints, as delays degrade map quality and increase the risk of tracking failure. GPU parallelization offers a promising way to reduce latency. However, parallelizing local mapping is challenging due to synchronized shared-state updates and the overhead of transferring large map data structures to the GPU. This paper presents TurboMap, a GPU-parallelized and CPU-optimized local mapping backend that holistically addresses these challenges. We restructure Map Point Creation to enable parallel Keypoint Correspondence Search on the GPU, redesign and parallelize Map Point Fusion, optimize Redundant Keyframe Culling on the CPU, and integrate a fast GPU-based Local Bundle Adjustment solver. To minimize data transfer and synchronization costs, we introduce persistent GPU-resident keyframe storage. Experiments on the EuRoC and TUM-VI datasets show average local mapping speedups of 1.3x and 1.6x, respectively, while preserving accuracy.

URL PDF HTML ☆

赞 0 踩 0

2602.04401 2026-06-18 cs.RO cs.CV 版本更新

Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

视觉地点识别中可靠操作点选择的分位数迁移

Dhyey Manish Rajani, Michael Milford, Tobias Fischer

发表机构 * QUT Centre for Robotics（昆士兰理工大学机器人中心）； School of Electrical Engineering and Robotics（电气工程与机器人学院）； Queensland University of Technology（昆士兰理工大学）

AI总结提出一种通过分位数归一化迁移阈值的方法，自动选择视觉地点识别系统的操作点，在100%精度下最大化召回率，无需手动调参。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

详情

AI中文摘要

视觉地点识别（VPR）是全球导航卫星系统（GNSS）受限环境中定位的关键组成部分，但其性能严重依赖于选择平衡精度和召回率的图像匹配阈值（操作点）。阈值通常针对特定环境离线手动调整，并在部署期间固定，导致在环境变化下性能下降。我们提出一种方法，自动选择VPR系统的操作点，以在100%精度下最大化召回率。该方法使用已知对应关系的小型校准遍历，并通过相似度得分分布的分位数归一化将阈值迁移到部署中。这种分位数迁移确保阈值在校准大小和查询子集上保持稳定。在五个基准数据集上使用七种最先进的VPR技术进行的实验表明，我们提出的方法始终优于现有基线，使底层VPR技术在大约两倍的部署场景中（中位数改进）以100%精度运行，同时在该精度下检索到多达29%的正确匹配。该方法通过适应新环境并在操作条件下泛化，消除了手动调整。我们的代码可在该https URL获取。

英文摘要

Visual Place Recognition (VPR) is a key component for localisation in Global Navigation Satellite System (GNSS)-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that automatically selects the operating point of a VPR system to maximise recall at 100% precision. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets. Experiments with seven state-of-the-art VPR techniques across five benchmark datasets demonstrate that our proposed approach consistently outperforms existing baselines, enabling the underlying VPR technique to operate at 100% precision in approximately twice as many deployment scenarios (median improvement), while retrieving up to 29% more correct matches at that precision. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code is available at https://github.com/DhyeyR-007/Quantile-Transfer-for-Reliable-VPR.

URL PDF HTML ☆

赞 0 踩 0

2606.01605 2026-06-18 cs.RO 版本更新

Embedding Semantic Risk into Distance Fields and CBFs for Online Monocular Safe Control

将语义风险嵌入距离场和CBF用于在线单目安全控制

Dawei Zhang, Nuo Chen, Shuo Liu, Roberto Tron, Zhiwen Fan

发表机构 * Division of Systems Engineering, Boston University（系统工程系，波士顿大学）； Department of Mechanical Engineering, Boston University（机械工程系，波士顿大学）； Department of Electrical and Computer Engineering, Texas A&M University（电气与计算机工程系，德克萨斯农工大学）

AI总结提出一种在线单目感知到控制框架，通过将语义风险直接嵌入欧几里得符号距离场（ESDF），在控制优化前编码风险，实现基于控制障碍函数（CBF）的语义感知安全导航与遥操作。

详情

AI中文摘要

我们提出了一种在线单目感知到控制框架，将语义风险嵌入到用于基于控制障碍函数（CBF）的安全导航和遥操作的距离场中。许多基于感知的安全过滤器对所有映射的障碍物分配相同的基于距离的安全裕度，或者仅将语义用作下游控制器调整，而不是在空间表示中编码语义风险。我们的框架通过将语义信息直接嵌入欧几里得符号距离场（ESDF），在线推理障碍物几何和类别相关风险。这种设计在控制优化前编码语义风险，因此高风险对象在安全场中施加更大的空间影响，同时保留运行时高效的ESDF查询。具体来说，基于基础模型的SLAM前端从单目RGB视频重建密集3D几何，而每帧语义分割提供像素级类别标签，这些标签被融合到重建的几何中。得到的几何-语义表示随后被转换为ESDF，其中语义标签识别安全相关区域并在场计算前施加类别相关的膨胀。语义感知的ESDF提供CBF控制器所需的局部距离值和空间导数，而类别相关的增益进一步调节控制器响应。广泛的仿真和硬件实验证明了在线操作在10-20 Hz的频率以及遥操作和自主导航中的语义感知安全行为。

英文摘要

We propose an online monocular perception-to-control framework that embeds semantic risk into the distance field used by Control Barrier Function (CBF)-based safe navigation and teleoperation. Many perception-based safety filters assign the same distance-based safety margin to all mapped obstacles or use semantics only as a downstream controller adjustment, rather than encoding semantic risk in the spatial representation. Our framework instead reasons online about obstacle geometry and class-dependent risk by embedding semantic information directly into the Euclidean Signed Distance Field (ESDF). This design encodes semantic risk before control optimization, so high-risk objects exert a larger spatial influence in the safety field while retaining efficient ESDF queries at runtime. Specifically, a foundation-model-based SLAM front end reconstructs dense 3-D geometry from monocular RGB video, while per-frame semantic segmentation provides pixel-level class labels that are fused into the reconstructed geometry. The resulting geometric-semantic representation is then converted into an ESDF, where semantic labels identify safety-relevant regions and impose class-dependent inflation before field computation. The semantic-aware ESDF provides the local distance values and spatial derivatives required by the CBF controller, while class-dependent gains further regulate the controller response. Extensive simulation and hardware experiments demonstrate online operation at 10--20 Hz and semantic-aware safe behavior in both teleoperation and autonomous navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.18519 2026-06-18 cs.RO cs.AI 新提交

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

如您所愿：利用LLM在精准农业中进行形式化验证的任务规划

Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * University of California, Merced（加州大学默塞德分校）

AI总结针对自然语言歧义性，提出基于线性时序逻辑（LTL）反馈循环的LLM任务规划系统，通过双LLM分工实现规范生成与验证，提升精准农业任务规划的可靠性。

详情

Journal ref: Published in Proceedings of 2026 International Conference on Robotics and Automation (ICRA)

AI中文摘要

尽管机器人系统现已商业化并部署于各行各业，但许多系统高度专业化，通常需要高级技能才能操作并确保其按指令执行。为缓解这一问题，我们近期引入了一个任务规划器，利用大语言模型（LLM）根据自然语言描述的任务描述合成精准农业中的任务计划。虽然该系统表现出色，但也存在自然语言固有的歧义性。本文通过引入多个基于线性时序逻辑（LTL）的反馈循环来扩展我们的系统，以确保任务规划系统满足用户制定的规范，同时仍使用自然语言。为减轻潜在偏差，我们使用两个不同的商业LLM分别负责规范生成和验证子任务。通过大量实验，我们强调了将任务验证集成到全自主流水线中的优势与局限，特别是关于LLM生成有效LTL公式的能力，并展示了我们的实现如何应对和解决这些挑战。

英文摘要

Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, we recently introduced a mission planner leveraging LLMs to synthesize mission plans in precision agriculture based on mission descriptions provided in natural language. While the system demonstrates impressive performance, it also suffers from the inherent ambiguities of natural language. In this paper, we extend our system to address this issue by introducing multiple feedback loops in the planning architecture that leverage linear temporal logic (LTL) to ensure the mission planning system meets the specifications formulated by the user while still using natural language. To mitigate potential bias, this is achieved by using two different commercial LLMs in charge of the specification and verification subtasks. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM's ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.

URL PDF HTML ☆

赞 0 踩 0

2606.18601 2026-06-18 cs.RO 新提交

Admittance-Based Surface Alignment for Human-in-the-Loop Robotic Visual Inspection

基于导纳的表面对齐用于人在环机器人视觉检测

Antara Banerjee, Colin Acton, Xu Chen

发表机构 * University of Washington（华盛顿大学）

AI总结提出一种基于导纳的实时闭环控制框架，融合操作员输入与感知驱动，实现机器人末端执行器与局部表面的精确对齐，在6自由度机械臂上验证了稳定法向跟踪和0.4°的平均定向误差。

详情

AI中文摘要

精密视觉检测是航空航天、半导体和医疗制造中质量保证的基础，这些领域中高价值零件上未被检测到的表面缺陷直接导致报废、返工和现场故障。机器人视觉检测需要在存在感知噪声和表面不规则的情况下，实现末端执行器与局部表面几何的精确对齐。在工业环境中，通常通过遥操作或共享自主性将人类操作员保持在回路中，引入实时调整，使得纯离线运动规划不足。这激发了能够在人类和感知不确定性下做出反应性、顺从行为的控制架构。本文提出了一种新颖的实时闭环机器人定向控制流程，用于精密视觉检测，该流程采用基于导纳的框架，统一了操作员输入和感知驱动的表面对齐。我们将末端执行器设计为在粘性介质中运动的虚拟球体，使得由此产生的物理可解释的质量-阻尼系统根据定向误差和操作员命令生成同步、顺从的运动。我们在6自由度机械臂上验证了该框架，展示了稳定的法向跟踪和0.4°的最终平均定向误差。

英文摘要

Precision visual inspection underpins quality assurance across aerospace, semiconductor, and medical manufacturing, where undetected surface anomalies on high-value parts translate directly into scrap, rework, and field failures. Robotic visual inspection requires precise alignment between the end-effector and local surface geometry in the presence of perception noise and surface irregularities. In industrial settings, a human operator is often kept in the loop via teleoperation or shared autonomy, introducing real-time adjustments that render purely offline motion planning inadequate. This motivates control architectures capable of reactive, compliant behavior under combined human and perceptual uncertainty. This paper presents a novel real-time, closed-loop robotic orientation control pipeline for precision visual inspection, with an admittance-based framework that unifies operator input and perception-driven surface alignment. We design the end-effector as a virtual sphere moving through a viscous medium, such that the resulting physically interpretable mass--damper system generates synchronized, compliant motion from orientation error and operator commands. We validate the framework on a 6-DOF manipulator demonstrating stable normal-tracking and a final mean orientation error of 0.4°.

URL PDF HTML ☆

赞 0 踩 0

2606.18747 2026-06-18 cs.RO cs.AI 新提交

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

通过基于人类反馈的迭代强化学习利用大语言模型生成自然且富有表现力的机器人手势

Chris Lee, Flora Salim, Benjamin Tag, Francisco Cruz

发表机构 * University of New South Wales（新南威尔士大学）； Universidad Central de Chile（智利中央大学）

AI总结针对社交机器人手势生成僵硬问题，提出将ChatGPT集成到Pepper机器人中生成共语手势，并引入基于人类反馈的迭代强化学习（RLHF）优化手势，实验表明RLHF提升了手势的表现力、相关性和流畅性。

Comments 8 Pages, 6 Figures

详情

AI中文摘要

富有表现力的手势对于自然有效的沟通至关重要，当仅靠语言线索不足时（例如，指向），手势可以补充言语。对于像Pepper这样的人形社交机器人，产生自然且富有表现力的动作对于改善人机交互（HRI）和长期接受度至关重要。然而，由于依赖专家编写的动画，生成手势仍然具有挑战性，导致行为僵硬，难以适应动态和多样化的环境。或者，机器学习方法通常难以捕捉感知的自然性，随着自由度的增加而变得更加困难。因此，产生富有表现力的机器人手势需要一个能够适应环境同时遵守社会规范和物理约束的系统。大语言模型（LLMs）的最新进展使得动态代码生成成为可能，为从自然语言实时合成手势提供了新的机会。在本文中，我们将ChatGPT集成到人形机器人Pepper中，以生成与对话输出一致的共语手势。虽然这一基线实现了灵活的手势生成，但生成的动作通常被认为僵硬且不自然。为了解决这一限制，我们引入了一种基于人类反馈的迭代强化学习（RLHF）系统，该系统根据用户评估微调手势生成，并利用迭代用户研究比较Pepper生成的手势。我们的结果表明，RLHF改进了LLM的共语生成能力，产生了更富有表现力、相关且流畅的动作。

英文摘要

Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper's generated gestures. Our results show that RLHF improved the LLM's co-speech generative capabilities, producing more expressive, relevant and fluid movements.

URL PDF HTML ☆

赞 0 踩 0

2606.19240 2026-06-18 cs.RO cs.CV cs.HC cs.SY eess.SY 新提交

Guava: 一种有效且通用的具身操作工具框架

Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, Jiayuan Mao

发表机构 * University of Maryland College Park（马里兰大学帕克分校）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Waterloo（滑铁卢大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； University of Pennsylvania（宾夕法尼亚大学）； Amazon FAR（亚马逊 FAR）

AI总结提出Guava框架，通过迭代感知-推理-行动循环、语义动作抽象和多模态观测三大关键设计，将具身操作能力蒸馏到4B开源模型中，在仿真和真实环境中性能媲美前沿专有模型。

详情

AI中文摘要

在大规模视觉-语言数据上训练的语言模型已展现出作为具身智能体的强大潜力。通过具身工具使用来驾驭模型，为端到端的视觉-语言-行动系统提供了一种有前景的替代方案，它将高层推理与外部模块（用于感知、规划和控制）相结合。然而，对于具身操作而言，什么构成了有效的工具框架，以及这种框架能在多大程度上解锁广泛推理模型的具身能力，仍不清楚。在这项工作中，我们提出了Guava，一个通过系统探索智能体工作流、动作空间和观测空间的设计空间而开发的具身工具使用框架。我们的研究确定了有效具身智能体的三个关键要素：迭代感知-推理-行动循环、语义动作抽象和多模态观测。为了理解这些设计原则是否对小型模型也具有普适性，我们开发了一个端到端的训练流程，利用完全在仿真中收集的不到2000条轨迹，将具身操作能力蒸馏到一个4B开源模型中。在仿真和真实环境中的实验结果表明，其性能与前沿专有模型相当，同时展现出对未见物体、新指令和长时域任务的强大泛化能力。结果表明，一个精心设计的框架可以作为具身操作的可扩展、模型无关的接口，使紧凑的开源模型在极少的训练数据下展现出强大的涌现具身能力。

英文摘要

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.

URL PDF HTML ☆

赞 0 踩 0

2606.19088 2026-06-18 cs.RO 新提交

ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks

ReSiReg：面向语言条件机器人任务的空间一致语义

Simon Schwaiger, David Seyser, Alessandro Scherl, Wilfried Wöber, Gerald Steinbauer-Wagner

发表机构 * Graz University of Technology, Institute of Software Engineering and Artificial Intelligence（格拉茨技术大学，软件工程与人工智能研究所）； University of Applied Sciences Technikum Wien, Department of Industrial Engineering（维也纳应用科技大学，工业工程系）； University of Alicante, Department of Computer Technology（阿利坎特大学，计算机技术系）； University of Natural Resources and Life Sciences, Institute for Integrative Nature Conservation Research（自然资源与生命科学大学，整合自然保护研究 institute）

AI总结提出ReSiReg方法，通过重构空间一致的VLM中间特征，改善密集语言接地检索，在OVSS和3D映射中提升空间一致性，并发布紧凑的25M参数VLM模型。

详情

AI中文摘要

视觉-语言模型（VLM）使机器人能够遵循开放语言指令。然而，密集的VLM嵌入已被证明存在噪声且缺乏空间一致性。这对于需要同时推理语义和3D空间的机器人应用来说是有问题的。我们研究了近期VLM的空间结构，并提出了ReSiReg，一种特征重构方法，利用空间一致的VLM中间特征来改善密集语言接地检索。ReSiReg将中间特征聚类为视觉原型，推导其语言描述符，并将每个补丁重构为原型级语言嵌入的软混合。我们在OVSS和3D映射上跨骨干网络进行定量评估，并在真实世界操作场景中进行定性评估。定量结果显示密集检索得到改善；操作场景显示出更空间一致的目标激活。我们进一步为机器人应用提供了一个紧凑的25M密集VLM，远小于ViT-B基线且具有竞争力。可从此网址获取。

英文摘要

Vision-Language Models (VLMs) enable robots to follow open-language instructions. However, dense VLM embeddings have shown to be noisy and lack spatial consistency. This is problematic for robotic applications, which require simultaneous reasoning over semantics and 3D space. We examine spatial structure across recent VLMs and propose ReSiReg, a feature reconstruction method that uses spatially consistent VLM intermediates to improve dense language-grounded retrieval. ReSiReg clusters intermediates into visual prototypes, derives their language descriptors, and reconstructs each patch as a soft mixture of prototype-level language embeddings. We evaluate quantitatively on OVSS and 3D mapping across backbones, and qualitatively in real-world manipulation scenes. Quantitative results show improved dense retrieval; manipulation scenes show more spatially consistent target activations. We further provide a compact 25M dense VLM for robotic applications, substantially smaller than and competitive with ViT-B baselines. Available at https://resireg.github.io

URL PDF HTML ☆

赞 0 踩 0

2606.19340 2026-06-18 cs.RO 新提交

CABLE: 面向V2X系统的云辅助带宽高效LMM编码框架

Haohua Que, Zhipeng Bao, Qianyi Wu, Handong Yao

发表机构 * University of Georgia（佐治亚大学）

AI总结提出CABLE框架，通过边缘端利用自我运动补偿和残差运动线索传播云分割掩码，生成感兴趣区域（ROI）并仅上传ROI掩码图像，形成掩码-ROI-LMM反馈循环，在五个数据集上实现73-87%的ROI像素覆盖减少和5-8倍LMM预填充加速。

详情

AI中文摘要

云托管的大型多模态模型（LMM）可以为车联网系统提供强大的开放词汇感知能力，但简单地将全分辨率帧从边缘传输到云会导致严重的通信开销和云侧预填充延迟。我们提出了CABLE，一种用于边缘-云感知的云辅助带宽高效LMM编码框架。CABLE在边缘端利用自我运动补偿传播先前的云分割掩码，通过残差运动线索进行细化，并通过走廊包络整合断开区域，形成鲁棒的感兴趣区域（ROI）。仅上传ROI掩码图像，而云分割输出作为下一帧的先验反馈，形成掩码-ROI-LMM反馈循环。在五个数据集（nuScenes、WOD-ZB、Waymo、KITTI和CADC）上的实验表明，该方法在保持感知能力的同时实现了显著的通信节省，相对于全帧推理，ROI像素覆盖减少73-87%，估计LMM预填充加速5-8倍，检测质量略有折衷。

英文摘要

Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$--$87\%$ ROI pixel-coverage reduction with $5$--$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

URL PDF HTML ☆

赞 0 踩 0

2602.01700 2026-06-18 cs.RO 版本更新

Tilt-Ropter: A Fully Actuated Hybrid Aerial-Terrestrial Vehicle with Tilt Rotors and Passive Wheels

Tilt-Ropter: 一种带有倾转旋翼和被动轮的全驱动混合空中-地面车辆

Ruoyu Wang, Xuchen Liu, Zongzhou Wu, Zixuan Guo, Wendi Ding, Ben M. Chen

发表机构 * Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong（机械与自动化工程系，香港中文大学）； Faculty of Engineering, The University of Hong Kong（工程学院，香港大学）； Peng Cheng Laboratory（鹏城实验室）

AI总结提出全驱动混合空中-地面车辆Tilt-Ropter，通过倾转旋翼和被动轮实现高效多模态运动，并设计统一非线性模型预测控制器实现低跟踪误差和地面运动功耗降低92.8%。

Comments 8 pages, 10 figures. Accepted by the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情

AI中文摘要

在这项工作中，我们提出了Tilt-Ropter，一种全驱动的混合空中-地面车辆（HATV），它集成了倾转旋翼和被动轮，以实现高效的多模态运动。与传统的欠驱动HATV不同，Tilt-Ropter的全驱动设计允许力和扭矩解耦控制，提高了机动性和地面运动效率。开发了一个统一的非线性模型预测控制器（NMPC）来跟踪参考轨迹，强制执行非完整约束，并适应运动模式间的接触效应，同时通过专门的控制分配确保执行器可行性。为了解决复杂的轮地动力学问题，集成了一个外部力估计器来提供实时交互力估计。该系统通过仿真和实际实验进行了验证，包括无缝的空地过渡和轨迹跟踪任务。实验结果表明，两种模式下的跟踪误差都很低，并且地面运动期间的功耗相比飞行降低了92.8%，突显了该平台在能源受限环境中执行长时间任务的适用性。

英文摘要

In this work, we present Tilt-Ropter, a fully actuated hybrid aerial-terrestrial vehicle (HATV) that integrates tilt rotors with passive wheels to enable efficient multi-modal locomotion. Unlike conventional underactuated HATVs, the fully actuated design of Tilt-Ropter allows decoupled force and torque control, improving maneuverability and ground locomotion efficiency. A unified nonlinear model predictive controller (NMPC) is developed to track reference trajectories, enforce non-holonomic constraints, and accommodate contact effects across locomotion modes, while ensuring actuator feasibility through dedicated control allocation. To address complex wheel-ground dynamics, an external wrench estimator is incorporated to provide real-time interaction wrench estimates. The system is validated through simulation and real-world experiments, including seamless air-ground transitions and trajectory tracking tasks. Experimental results demonstrate low tracking errors in both modes and reveal a 92.8% reduction in power consumption during ground locomotion compared to flight, highlighting the platform's suitability for long-duration missions in energy-constrained environments.

URL PDF HTML ☆

赞 0 踩 0

2606.18680 2026-06-18 cs.RO 新提交

High-Degree-of-Freedom Lightweight Bioinspired Leg for Enhanced Mobility in Small Robots

高自由度轻量化仿生腿：提升小型机器人机动性

Haoqi Han, Yifei Yu, Jiaming Zhang, Xinru Cui, Linxi Feng, Hesheng Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai University of Electric Power（上海电力大学）

AI总结针对微型机器人腿部自由度受限问题，提出一种四自由度并联腿机构，通过同心设计简化运动学，实现轻量化（18.9g）和大工作空间（>22255 mm³），显著提升运动灵活性。

详情

Journal ref: 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

AI中文摘要

在微型机器人领域，如何在严格的空间限制下通过增加腿部机构的自由度来增强运动能力仍然是一个重大挑战。受昆虫运动启发，本文提出了一种新型的微型四自由度并联腿机构，并系统分析了其机械设计、电气系统和运动学。该设计采用两个球形五杆连杆机构，在并联四杆配置中实现空间运动。此外，采用同心设计策略简化了腿部运动学的解析解。由于采用并联系统架构，所有执行器均位于主体上，与传统高自由度腿部结构相比，大大降低了运动部件的等效惯性。系统总质量仅为18.9 g，末端执行器输出力约为0.5 N，工作空间超过22255 mm³。实验结果表明，所提出的单腿机构具有优异的运动灵活性，凸显了其在微型仿生机器人领域的潜力。

英文摘要

In microrobotics, enhancing locomotion capabilities by increasing the degrees of freedom (DoF) of leg mechanisms under severe spatial constraints remains a significant challenge. Inspired by insect locomotion, this paper presents a novel micro-scale parallel leg mechanism with four degrees of freedom, and systematically analyzes its mechanical design, electrical system, and kinematics. The design incorporates two spherical five-bar linkages to achieve spatial motion within a parallel four-bar configuration. Furthermore, a concentric design strategy is employed to simplify the analytical solution of the leg kinematics. Due to the parallel system architecture, all actuators are located on the main body, substantially reducing the equivalent inertia of moving parts compared to traditional high-DOF leg structures. The total mass of the system is only 18.9 g, with an end-effector output force of approximately 0.5 N and a workspace exceeding 22255 mm3. Experimental results demonstrate that the proposed single-leg mechanism achieves excellent motion flexibility, highlighting its potential for micro bio-inspired robotics.

URL PDF HTML ☆

赞 0 踩 0

2606.18704 2026-06-18 cs.RO 新提交

Selective Unit-Cell Actuation in Lattice Structures for Distributed Morphology in Soft Robots

晶格结构中的选择性单元胞驱动用于软体机器人的分布式形态变化

Trevor Exley, Altair Coutinho, Lucia Beccai

发表机构 * Istituto Italiano di Tecnologia (IIT)（意大利技术研究院）

AI总结提出嵌入式气动单元胞，将弯曲支柱晶格与双向波纹管致动器集成，通过空间驱动模式实现全局形态控制，实验验证了可扩展位移、力生成及弯曲、抓取和爬行运动。

Comments Accepted to IROS 2026, 8 pages, 5 figures

详情

AI中文摘要

软晶格结构越来越多地用于机器人中以定制柔顺性和引导变形；然而，驱动通常是在设备或模块级别引入，致动器插入到原本被动的架构中。在这项工作中，我们将致动器-晶格协同设计推进到单元胞尺度。我们提出了一种嵌入式气动单元胞，它将弯曲支柱晶格几何形状与双向波纹管致动器集成在一个单一的整体元件中。当镶嵌时，晶格作为一个分布式驱动场，其中全局形态由空间驱动模式而非均匀加压控制。对1x1、2x2和3x3镶嵌的实验表征展示了可扩展的位移和力生成，具有可重复的循环性能。在3x3x3阵列中，单元胞的选择性驱动产生了不同的全局变形模式，包括弯曲和定向抓取，而无需改变硬件配置。此外，耦合主动和被动单元胞实现了弯曲驱动的爬行运动，证明了异质镶嵌可以通过不对称变形进行平移。这些结果确立了单元胞级驱动作为晶格基软体机器人分布式变形的策略，并为可扩展的整体机器人架构提供了基础。

英文摘要

Soft lattice structures are increasingly used in robotics to tailor compliance and guide deformation; however, actuation is typically introduced at the device or module level, with actuators inserted into otherwise passive architectures. In this work, we move actuator-lattice co-design to the unit-cell scale. We present an embedded pneumatic unit cell that integrates curved-strut lattice geometry with a bidirectional bellow actuator within a single monolithic element. When tessellated, the lattice functions as a distributed actuation field in which global morphology is governed by spatial actuation patterns rather than uniform pressurization. Experimental characterization of 1x1, 2x2, and 3x3 tessellations demonstrates scalable displacement and force generation with repeatable cyclic performance. Selective actuation of unit cells in a 3x3x3 array produces distinct global deformation modes, including bending and directional grasping, without altering hardware configuration. Additionally, coupling active and passive unit cells enables bending-driven crawling locomotion, demonstrating that heterogeneous tessellations can translate through asymmetric deformation. These results establish unit-cell-level actuation as a strategy for distributed morphing in lattice-based soft robots and provide a foundation for scalable, monolithic robotic architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.19265 2026-06-18 cs.RO 新提交

Shape Sensing of Continuum Robots using Direct Laser Writing

使用直接激光写入的连续体机器人形状感知

Amber K. Rothe, Nidhi Malhotra, Jaydev P. Desai

发表机构 * Winship Cancer Institute of Emory University（埃默里大学温希普癌症研究所）； Medical Robotics and Automation (RoboMed) Laboratory, Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology（佐治亚理工学院华莱士·H·库尔特生物医学工程系医疗机器人与自动化实验室）

AI总结本文利用直接激光写入技术制造应变传感器，集成于连续体机器人关节中，通过线性和非线性模型预测关节角度，误差低至1.76度，并实现闭环控制，跟踪误差小于3度。

Comments This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

连续体机器人因其固有的柔顺性和灵巧性，为微创和自然腔道手术提供了一种有前景的方法。然而，这种灵活性也使得估计机器人当前形状变得具有挑战性。已有多种方法用于重建这些机器人的形状，包括成像、光学传感、磁传感和电阻传感。使用直接激光写入（DLW）制造的应变传感器可以提供一种替代传感方法。该技术涉及使用激光诱导某些聚合物碳化，以创建石墨烯图案，例如应变传感器。在本文中，我们展示了如何使用同一激光和同一设置将柔性连续体关节和DLW传感器加工成一个整体结构。使用线性和非线性模型对制造的传感器进行表征，这些模型用于预测关节角度，误差低至1.76度。此外，我们展示了如何使用DLW传感器在机器人关节中实现闭环控制，跟踪误差低于3度。

英文摘要

Continuum robots offer a promising approach for minimally invasive and natural-orifice surgical procedures due to their inherent compliance and dexterity. However, this flexibility also makes estimating the current shape of the robot challenging. Several approaches have been used to reconstruct the shape of these robots, including imaging, optical sensing, magnetic sensing, and resistive sensing. Strain sensors fabricated using direct laser writing (DLW) could provide an alternative sensing method. This technique involves using a laser to induce carbonization of certain polymers to create graphene patterns, such as strain sensors. In this paper, we demonstrate how a flexible continuum joint and a DLW sensor can be machined as one monolithic structure using the same laser and the same setup. The fabricated sensors are characterized using linear and nonlinear models, which are used to predict the joint angle with error as low as 1.76 degrees. Furthermore, we demonstrate how a DLW sensor can be used to implement closed-loop control in a robotic joint, achieving tracking error under 3 degrees.

URL PDF HTML ☆

赞 0 踩 0

2606.18375 2026-06-18 cs.RO 新提交

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

PAIWorld: 用于机器人操作的三维一致世界基础模型

Yuhang Huang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Jiazhao Zhang, Ruizhen Hu, Wancheng Feng, Shilong Zou, Hewen Xiao, Ziqiao Zhou, Kaiyun Huang, Zhiyu Peng, Juzhan Xu, Hang Zhao, Chenyang Zhu, Renjiao Yi, Yifei Huang, Douhui Wu, Yan Zhang, Kexu Cheng, Chunhe Song, Yunzhi Xue, Xiuhong Zhang, Leitao Guo, Yunji Chen, Bin Wu, Haibin Yu, Kai Xu

发表机构 * Institute of AI for Industries, Chinese Academy of Sciences（中国科学院人工智能产业研究院）

AI总结提出PAIWorld框架，通过几何感知交叉注意力、几何旋转位置编码和潜在3D-REPA蒸馏，解决多视图世界模型的3D不一致问题，在机器人操作基准上取得领先性能。

详情

AI中文摘要

世界基础模型（WFMs）是强大的模拟器，但它们主要运行在单视图设置中，缺乏机器人操作所需的多视图3D一致性。虽然机器人系统依赖多个摄像头（自我中心、眼到手和腕装）进行策略学习，但当前的多视图世界模型只是简单地拼接视图标记，没有显式的几何推理。这导致跨视图物体漂移、深度不一致和纹理错位。我们将这些失败归因于两个缺陷：缺乏显式的视图间通信机制和缺乏3D几何先验。我们认为同时解决这两个问题是必要且充分的。为此，我们提出PAIWorld，一个通过三个核心组件增强扩散变换器世界模型的框架：（1）几何感知交叉注意力块，建立跨视图的显式通路；（2）几何旋转位置编码，将相机射线方向和外部姿态编码到注意力机制中；（3）潜在3D-REPA，从冻结的3D基础模型中蒸馏3D感知特征以确保3D一致性。基于DiT世界基础模型，PAIWorld在机器人操作基准上实现了最先进的多视图3D一致性，在WorldArena排行榜上排名第一，在AgiBot-Challenge2026排行榜上排名第二，同时支持基于模型的规划、世界动作模型和多视图策略后训练等下游应用。

英文摘要

World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.

URL PDF HTML ☆

赞 0 踩 0

2606.18594 2026-06-18 cs.RO cs.AI 新提交

Benchmarking Action Spaces in Reinforcement Learning for Vision-based Robotic Manipulation

基于视觉的机器人操作中强化学习动作空间的基准测试

Seyed Alireza Azimi, Homayoon Farrahi, Abhishek Naik, Colin Bellinger, A. Rupam Mahmood

发表机构 * Department of Computing Science, University of Alberta（阿尔伯塔大学计算机科学系）； National Research Council Canada（加拿大国家研究委员会）； School of Electrical Engineering and Computer Science, University of Ottawa（渥太华大学电气工程与计算机科学学院）； Vector Institute（向量研究所）； Alberta Machine Intelligence Institute (Amii)（阿尔伯塔机器智能研究所）

AI总结本研究通过模拟到现实的迁移，在物体抓取和推动任务中评估了四种动作空间，发现关节速度动作空间在平滑性和任务性能上最优，并为RL实践者提供了动作空间选择指导。

Comments 9 pages with references

2606.18610 2026-06-18 cs.RO cs.CV 新提交

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

SC3-Eval: 通过自洽视频生成评估机器人基础模型

Wei-Cheng Tseng, Gashon Hussein, Yuzhu Dong, Allen Z. Ren, Lucy X. Shi, XuDong Wang, Sergey Levine, Zhaoshuo Li, Jinwei Gu, Florian Shkurti, Ming-Yu Liu, Quan Vuong

发表机构 * University of Toronto（多伦多大学）； Vector Institute（向量研究所）； NVIDIA（英伟达）； Physical Intelligence ； Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）； Allen Institute for AI（艾伦人工智能研究所）

AI总结提出SC3-Eval方法，利用前向-反向动力学一致性、跨视角一致性和测试时一致性，将预训练视频基础模型转化为准确的策略评估器，在7个真实世界策略上达到0.929的皮尔逊相关系数。

详情

AI中文摘要

在真实世界中评估通用机器人操作策略成本高、速度慢且难以扩展。动作条件视频世界模型通过模拟策略 rollout 提供了一种可扩展的替代方案。自回归 rollout 会累积复合误差，多视角观测必须保持相互一致，且评估器必须泛化到行为超出训练分布的策略。我们通过 SC3-Eval 解决这些挑战，这是一种自洽视频生成方案，通过强制三种互补的一致性，将预训练视频基础模型转化为准确的策略评估器。首先，前向-反向动力学一致性联合训练模型从动作预测帧以及从帧恢复动作，将生成的 rollout 锚定在物理上合理的动作流形上，并抵消仅前向模型无法惩罚的漂移。其次，跨视角一致性训练模型从每个相机视角修补其他视角，使多相机观测在长 rollout 中保持连贯，无需任何显式记忆机制。第三，测试时一致性在推理时重用反向动力学模式作为每个动作块的置信度信号，当生成的帧偏离请求的动作时终止 rollout。我们还展示了 SC3-Eval rollout 复现了策略在真实世界 rollout 中表现出的失败模式，支持细粒度的诊断比较而不仅仅是聚合排名。在七个真实世界的视觉-语言-动作策略上，SC3-Eval 达到了闭环皮尔逊相关系数 0.929 和 MMRV 0.119，优于三个强先前的基于视频模型的基线，并泛化到新任务。

英文摘要

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.18646 2026-06-18 cs.RO 新提交

A Scalable Embodied Intelligence Platform for Seamless Real-to-Sim-to-Real Transfer of Household Mobile Manipulation Tasks

一种可扩展的具身智能平台，用于家庭移动操作任务的无缝真实-仿真-真实迁移

Kui Yang, Xianlei Long, Haoxuan Li, Yan Ding, Chao Chen

发表机构 * School of Computer Science, Chongqing University（重庆大学计算机学院）； R&D Department, Lumos Robotics Technology (Suzhou) Co., Ltd（苏州 Lumos 机器人技术（苏州）有限公司研发部）

AI总结提出BestMan平台，通过自动化场景生成、仿真引导任务形式化和硬件无关中间件，解决真实-仿真-真实迁移中的场景重建、策略评估和部署兼容性挑战，实现家庭移动操作的无缝迁移。

Comments CCF Transactions on Pervasive Computing and Interaction

详情

AI中文摘要

移动操作是具身智能机器人的基本能力。对非结构化家庭环境中鲁棒且可泛化操作的需求日益增长，推动了具身智能平台的快速发展。然而，实现真实-仿真-真实循环的无缝迁移面临三个关键挑战：昂贵的高保真仿真场景重建、仿真中系统策略评估的复杂性以及不兼容的真实世界部署。为了解决这些挑战，我们开发了BestMan，一个可扩展且无缝的真实-仿真-真实平台，弥合仿真与真实世界之间的差距，实现家庭移动操作的有效策略开发、集成和部署。具体来说，我们设计了一个新颖的自动化场景生成（ASG）模块，从真实观测中重建逼真的仿真。然后，我们提出了一种仿真引导的任务形式化和技能学习架构，支持在仿真中灵活集成和大规模评估混合技能策略。最后，为了增强真实世界的可扩展性，我们开发了一个硬件无关的统一中间件（HUM），确保跨异构移动操作器的无缝且兼容的仿真到真实迁移，用于真实部署。实验结果表明，我们提出的平台在建立标准化基准和促进移动操作领域有前景的研究方面表现出优越的性能。

英文摘要

Mobile manipulation is a fundamental capability in embodied intelligence robotics. The growing demand for robust and generalizable manipulation in unstructured household environments has driven rapid progress in embodied intelligence platforms. However, achieving a seamless transfer across the real-to-sim-to-real cycle faces three key challenges, including costly high-fidelity simulation scenes reconstruction, the complexity of systematic strategy evaluation in simulation, and incompatible real-world deployments. To address these challenges, we develop BestMan, a scalable and seamless real-to-sim-to-real platform that bridges the gap between the simulation and the real world, enabling effective strategy development, integration, and deployment for household mobile manipulation. Specifically, we design a novel Automated Scene Generation (ASG) module to reconstruct realistic simulations from real observations. Then, we propose a simulation-guided task formalization and skill learning architecture that supports the flexible integration and large-scale evaluations of hybrid skill strategies in simulation. Finally, to enhance the real-world scalability, we develop a Hardware-agnostic and Unified Middleware (HUM) to ensure seamless and compatible sim-to-real transfer across heterogeneous mobile manipulators for real deployments. Experimental results demonstrate the superior performance of our proposed platform in establishing standardized benchmarks and facilitating promising research in the field of mobile manipulation.

URL PDF HTML ☆

赞 0 踩 0

传感器配置至关重要：四足机器人多模态SLAM的系统评估

Roberto Corlito, Fabian Schmidt, Nils Seibert, Markus Enzweiler, Abhinav Valada, Arne Roennau

发表机构 * Machine Intelligence and Robotics Lab, Karlsruhe Institute of Technology (KIT)（卡尔斯鲁厄理工学院智能机器人实验室）； Institute for Intelligent Systems, Esslingen University of Applied Sciences（埃森堡应用科学大学智能系统研究所）； Department of Computer Science, University of Freiburg（弗赖堡大学计算机科学系）

AI总结针对四足机器人运动中的传感器配置问题，系统评估了视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法，发现立体相机、全局快门和适当惯性集成能显著提升定位鲁棒性。

详情

AI中文摘要

四足机器人在不同环境中的自主导航从根本上依赖于鲁棒的同步定位与地图构建（SLAM）。虽然视觉-惯性SLAM在轮式、手持和空中平台上已经成熟，但在腿部运动的剧烈动态下，硬件级传感器配置如何影响性能仍存在关键的评估空白。四足机器人引入了独特的具身感知挑战，包括足部冲击、高频机械振动和快速角旋转，这些都会降低标准感知管道的性能。为了填补这一空白，我们使用在ANYmal D四足机器人上记录的GrandTour数据集，对最先进的视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法进行了系统评估。我们分离并量化了相机模态、快门技术和惯性传感器层级的影响，分析了它们在定位精度、算法鲁棒性和计算资源利用方面的权衡。我们的实证结果表明，硬件选择对系统鲁棒性有显著影响：立体配置始终优于单目和RGB-D模态，全局快门相机相比卷帘快门相机显著减少了运动引起的跟踪失败，并且关键的是，在剧烈的腿部运动下，标准惯性集成可能降低主要基于视觉的框架的性能。这些见解还为定制传感器负载提供了具体的设计指南，以实现敏捷腿部系统的可靠感知。

英文摘要

Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.

URL PDF HTML ☆

赞 0 踩 0

2606.19154 2026-06-18 cs.RO 新提交

Viking Hill Dataset: A Lidar-Radar-Camera Dataset for Detection and Segmentation in Forest Scenes

Viking Hill数据集：用于森林场景检测与分割的激光雷达-雷达-相机数据集

Vladimír Kubelka, Oleksandr Kotlyar, Unal Artan, Martin Magnusson

发表机构 * Örebro University, AASS research centre, Robot Navigation and Perception Lab（厄勒布鲁大学，AASS研究中心，机器人导航与感知实验室）

AI总结提出首个包含4D成像雷达的森林多传感器数据集，通过MinkowskiUNet实现雷达与激光雷达点云的语义分割，并评估树干分割质量与树木尺寸的关系。

Comments 33 pages, 11 figures

详情

AI中文摘要

在森林冠层下运行的自主机器人需要对树木及周围植被在不同季节条件下进行稳健感知。现有的林业数据集提供带有单棵树标注的激光雷达或相机数据，但均未包含共配准的4D成像雷达——这一模态因其对视觉退化、表面污染和植被遮挡的鲁棒性而日益受到关注。我们介绍了一个由移动机器人收集的多传感器森林数据集，该机器人配备了高分辨率FMCW成像雷达、激光雷达、RGB相机、IMU和RTK-GNSS。该场地在两个不同植被状态的会话中记录，3D立方体标注（包括每棵树的直径估计）为所有三种感知模态提供了共享语义标签。此外，我们提供了使用MinkowskiUNet对雷达和激光雷达点云进行语义分割的基线结果。雷达在主要类别（地面91%，冠层86%）上取得了与激光雷达竞争性的IoU分数，但在几何精细结构（如树干）上落后（56%对74%）。跨模态分析进一步比较了激光雷达和雷达的树干分割与RGB检测模型，而按直径分层的评估揭示了树干分割质量如何随树木尺寸变化。除了分割，共配准的多模态数据和RTK-GNSS辅助参考定位支持冠层下地图构建、定位和传感器融合的研究。数据集和标注工具已公开。

英文摘要

Autonomous robots operating under forest canopies need robust perception of trees and surrounding vegetation across varying seasonal conditions. Existing forestry datasets provide lidar or camera data with per-tree annotations, but none include co-registered 4D imaging radar -- a modality of growing interest for its resilience to visual degradation, surface contamination, and vegetation occlusion. We introduce a multi-sensor forest dataset collected by a mobile robot equipped with a high-resolution FMCW imaging radar, lidar, RGB camera, IMU, and RTK-GNSS. The site was recorded in two sessions under contrasting vegetation states, and 3D cuboid annotations -- including per-tree diameter estimates -- provide shared semantic labels across all three perception modalities. Furthermore, we provide baseline results for semantic segmentation of the radar and lidar point clouds using MinkowskiUNet. Radar achieves IoU scores competitive with lidar for dominant classes (ground 91%, canopy 86%) while lagging on geometrically fine structures such as tree trunks (56% vs. 74%). A cross-modality analysis further compares lidar and radar trunk segmentation against an RGB detection model, and a diameter-stratified evaluation reveals how trunk segmentation quality varies with tree size. Beyond segmentation, the co-registered multi-modal data and RTK-GNSS-aided reference positioning support research in mapping, localization, and sensor fusion under canopy. The dataset and annotation tools are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.19161 2026-06-18 cs.RO 新提交

HT-Bench: Benchmarking and Learning Dexterous Full-Hand Tactile Representations with Egocentric Vision

HT-Bench：基于自我中心视觉的灵巧全手触觉表示基准与学习

Yuzhe Huang, Jiaping Wu, Jiaming Jiang, Hezhe Lin, Aikebaier Aierken, Yunlong Wang, Kun Cheng, Ziyuan Jiao, Yuanxin Zhong

发表机构 * Beihang University（北京航空航天大学）； Rimbot ； BUPT（北京邮电大学）； ShanghaiTech University（上海科技大学）； Tsinghua University（清华大学）； CAS（中国科学院）

AI总结提出HT-Bench多任务基准和HandTouch编码器，通过大规模自我中心视觉与全手触觉数据，在触觉相似性检索、掩码修复、视觉到触觉合成等任务上验证了触觉表示的有效性。

Comments 9pages, 4figures

详情

AI中文摘要

由于触觉传感器设计、数据格式和机器人形态的多样性，为机器人操作中的触觉表示学习建立通用基准仍然具有挑战性。我们并未试图建立这样的基准，而是探索了一个可扩展且有前景的未来发展方向：将自我中心视觉与全手触觉数据配对。为此，我们引入了\ extbf{HT-Bench}，一个用于灵巧全手触觉感知的大规模多任务基准，包含在226个任务中收集的1000万RGB帧和780万触觉帧。HT-Bench从三个关键角度评估触觉表示：它们是否编码有意义的接触几何、是否能够将触觉观测与视觉信息对齐、以及是否能够泛化到未见任务。为评估这些能力，HT-Bench包含四个任务：细粒度触觉相似性检索、掩码触觉修复、视觉到触觉合成以及多模态触觉帧预测。我们进一步提出了\ extbf{HandTouch}，一个矢量量化视觉-触觉编码器，通过渐进的空间、跨模态和时间训练学习触觉表示。在HT-Bench上，HandTouch始终优于代表性的触觉编码器基线，将细粒度触觉相似性检索的Recall@5从74.65%提高到85.23%，将掩码触觉修复的RMSE从0.022降低到0.010，并将视觉到触觉合成的OOD cIoU从0.628提高到0.705。这些结果证明了HandTouch的有效性，并表明大规模自我中心全手触觉数据为评估和推进灵巧操作中的触觉表示学习提供了可扩展的基础。

英文摘要

Establishing a universal benchmark for tactile representation learning in robotic manipulation remains challenging due to the diversity of tactile sensor designs, data formats, and robot embodiments. Rather than seeking to establish such, we explore a scalable and promising direction for future development: egocentric vision paired with full-hand tactile data. To this end, we introduce \textbf{HT-Bench}, a large-scale multi-task benchmark for dexterous full-hand tactile sensing, comprising 10M RGB frames and 7.8M tactile frames collected across 226 tasks. HT-Bench evaluates tactile representations from three key perspectives: whether they encode meaningful contact geometry, whether they can align tactile observations with visual information, and whether they generalize to unseen tasks. To assess these capabilities, HT-Bench includes four tasks: fine-grained tactile similarity retrieval, masked tactile inpainting, vision-to-tactile synthesis, and multimodal tactile frame prediction. We further propose \textbf{HandTouch}, a vector-quantized vision--tactile encoder that learns tactile representations through progressive spatial, cross-modal, and temporal training. Across HT-Bench, HandTouch consistently outperforms representative tactile encoder baselines, improving Recall@5 on fine-grained tactile similarity retrieval from 74.65\% to 85.23\%, reducing RMSE on masked tactile inpainting from 0.022 to 0.010, and increasing OOD cIoU on vision-to-tactile synthesis from 0.628 to 0.705. These results demonstrate the effectiveness of HandTouch and suggest that large-scale egocentric full-hand tactile data provides a scalable basis for evaluating and advancing tactile representation learning in dexterous manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.19176 2026-06-18 cs.RO cs.AI cs.SY eess.SY 新提交

Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight

用于自主海上无人机飞行的深度单目位姿估计的硬件与视觉在环验证

Maneesha Wickramasuriya, Beomyeol Yu, Jaden Shin, Mason Huslig, Taeyoung Lee, Murray Snyder

发表机构 * George Washington University（乔治华盛顿大学）

AI总结提出硬件验证的视觉在环框架，结合深度变换器单目位姿估计器和延迟卡尔曼滤波器，在模拟逼真海上环境中实现自主室内飞行，验证了感知延迟等嵌入式效应。

Comments 6 pages 9 figues

详情

AI中文摘要

船舶上的自主无人机操作需要可靠的基于视觉的相对位姿估计，然而海上验证成本高、依赖天气且风险大。本文提出一个硬件验证的视觉在环框架，能够在模拟逼真海上环境的同时实现完全自主的室内飞行。渲染的海上视图由板载的基于深度变换器的单目位姿估计器处理。延迟的视觉测量与高频率IMU数据通过延迟卡尔曼滤波器融合，为几何控制提供一致的状态估计。该系统捕捉了纯仿真中缺失的关键嵌入式效应，包括感知延迟、异步更新和计算约束。自主起飞、轨迹跟踪和着陆实验证明了稳定的闭环飞行。结果建立了一个安全且硬件真实的中间阶段，用于在船上部署之前开发海上无人机自主性。

英文摘要

Autonomous UAV operations on ships require reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents a hardware-validated vision-in-the-loop framework that enables fully autonomous indoor flight while emulating photorealistic maritime environments. Rendered maritime views are processed onboard by a deep transformer-based monocular pose estimator. Delayed vision measurements are fused with high-rate IMU data using a delayed Kalman filter to provide consistent state estimates for geometric control. The system captures critical embedded effects, including perception latency, asynchronous updates, and computational constraints, that are absent in pure simulation. Autonomous takeoff, trajectory tracking, and landing experiments demonstrate stable closed-loop flight. The results establish a safe and hardware-realistic intermediate stage for developing maritime UAV autonomy prior to shipboard deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.19186 2026-06-18 cs.RO cs.LG 新提交

Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise

学习标注延迟和误报AEB事件：针对极端类别不平衡和非对称标签噪声的实用系统

Mengxiang Hao, Xin Jiang, Xinghao Huang, Wenliang Su, Zhiteng Wang, Junjie Rao, Xiaotian Yang, Wei Liao, Chengyu Han, Gen Liang, Yulun Song, Zhitao Xu, Xianpeng Lang

发表机构 * Li Auto（理想汽车）

AI总结提出首个自动化AEB标注框架，通过特定数据增强和噪声抑制技术，解决极端类别不平衡和非对称标签噪声问题，将延迟/误报触发召回率提升80%，人工工作量减少50%。

Comments 8 pages, 5 figures, accepted by IEEE International Conference on Robotics and Automation (ICRA)

详情

Journal ref: 2026 IEEE International Conference on Robotics and Automation (ICRA)

AI中文摘要

自主紧急制动（AEB）优化依赖于准确标注的真实世界触发事件，特别是揭示系统缺陷的罕见但关键的延迟和误报AEB触发事件。然而，这些少数样本在每天数千次触发事件中占比不到5%，使得大规模人工标注成本过高。我们提出了首个自动化AEB标注框架来解决这一问题。在开发过程中，我们识别出两个严重损害延迟/误报触发标注准确性的基本挑战：（1）极端类别不平衡，其中延迟/误报触发被真实触发淹没；（2）非对称标签噪声，其中误标注的多数样本（真实触发）抑制了少数样本（延迟/误报触发）的学习。为克服这些挑战，我们提出两项关键创新：（1）特定数据增强，通过操纵焦点目标属性、移植自车动态和掩蔽非焦点代理来合成逼真样本；（2）噪声抑制，使用稳定硬度估计和探针引导的自适应阈值来清理误标注的真实触发样本。关键的是，我们将模型部署为具有全栈架构的实用标注系统，从每天数千个AEB事件中高效识别关键的延迟/误报触发。生产结果表明，延迟/误报触发的召回率提高了80%，人工工作量减少了50%。除了直接收益，该系统通过积累高质量标注实现持续自我改进，为车载AEB系统优化奠定了必要的数据基础。

英文摘要

Autonomous Emergency Braking (AEB) optimization relies on accurately annotated real-world trigger events, particularly rare but critical delayed and false AEB triggers that expose system deficiencies. However, these minority samples comprise less than 5% of thousands of daily triggers, making manual annotation prohibitively expensive at scale. We present the first automated AEB annotation framework to address this problem. During development, we identified two fundamental challenges that severely impair delayed/false trigger annotation accuracy: (1) Extreme class imbalance where delayed/false triggers are overwhelmed by true triggers; (2) Asymmetric label noise where mislabeled majority samples (true triggers) suppress minority samples (delayed/false triggers) learning. To overcome these challenges, we propose two key innovations: (1) Specific data augmentation that synthesizes realistic samples by manipulating focal target attributes, transplanting ego-vehicle dynamics, and masking non-focal agents; (2) noise suppression using stable hardness estimation and probe-guided adaptive threshold to clean mislabeled true trigger samples. Crucially, we deploy our model as a practical annotation system with full-stack architecture, efficiently identifying critical delayed/false triggers from thousands of daily AEB events. Production results demonstrate 80% improvement in recall of delayed/false triggers and 50% reduction in manual workload. Beyond immediate gains, the system enables continuous self-improvement through accumulated high-quality annotations, establishing a necessary data foundation for on-vehicle AEB system optimization

URL PDF HTML ☆

赞 0 踩 0

2606.19267 2026-06-18 cs.RO cs.SY eess.SY 新提交

A Mixed-Reality Testbed for Autonomous Vehicles

自动驾驶汽车的混合现实测试平台

H. M. Sabbir Ahmad, Ehsan Sabouni, Emrullah Celik, Zean Wan, Damola Ajeyemi, Christos G. Cassandras, Wenchao Li

发表机构 * Boston University（波士顿大学）

AI总结提出一种混合现实硬件在环测试平台，集成物理移动机器人与高保真仿真环境，用于验证感知、规划和控制算法，并支持多智能体系统研究。

Comments 9 pages, 7 figures, 1 table

详情

AI中文摘要

我们提出了一种用于自动驾驶汽车的混合现实、硬件在环（HIL）测试平台，该平台将物理移动机器人测试平台与高保真仿真环境无缝集成。虚拟仿真能够创建多样化的、安全关键的驾驶场景，以验证最先进的感知、规划和控制算法，同时通过配备多模态传感器的物理机器人在逼真的虚拟环境中增强仿真，进一步促进严格的验证。我们的测试平台还利用无线通信实现车辆连接，并通过物理机器人和虚拟仿真代理的组合容纳大量代理，支持包括网联自动驾驶汽车（CAV）在内的多智能体系统研究。最后，我们提出了一种结合感知、规划和一种新颖的基于控制障碍函数（CBF）的在线学习控制器的安全保证框架，用于CAV。使用所提出框架的实验用于验证和展示测试平台的关键功能以及其在弥合仿真与真实世界硬件部署之间差距方面的整体效用。

英文摘要

We propose a mixed-reality, hardware-in-the-loop (HIL) testbed for autonomous vehicles that seamlessly integrates a physical testbed of mobile robots with a high-fidelity simulation environment. The virtual simulation enables the creation of diverse, safety-critical driving scenarios to validate state-of-the-art perception, planning, and control algorithms, while augmenting simulations with physical robots equipped with multimodal sensors in photorealistic virtual environments further facilitating rigorous validation. Our testbed also features vehicular connectivity using wireless communication and can accommodate a large number of agents through the combination of physical robots and virtual simulated agents, supporting research on multi-agent systems including Connected and Autonomous Vehicles (CAVs). Finally, we present a safety-guaranteed framework combining perception, planning and a novel online learning-based controller using Control Barrier Functions (CBFs) for CAVs. Experiments using the proposed framework are used to validate and demonstrate the key functionalities and the overall utility of the testbed to bridge the gap between simulation and real-world hardware deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.18439 2026-06-18 cs.CV cs.RO 交叉投稿

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

RegimeVGGT：面向视觉几何基础Transformer的逐层空间保持冗余去除

Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, Zibo Zhao, Jiaxiang Hu, Kai Tang, Yichen Guo

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； University of California, Irvine（加利福尼亚大学尔湾分校）； Nanyang Technological University（南洋理工大学）

AI总结提出RegimeVGGT，通过逐层U形压缩（显著性引导带状合并与选择性保护K/V下采样）去除冗余，在保持重建质量的同时实现6.7倍加速。

Comments 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

详情

AI中文摘要

视觉几何基础Transformer（VGGT）通过一次前向传播从多视图图像恢复密集3D场景结构，但二次交叉帧注意力限制了其可扩展性。现有的免训练加速器沿单一轴均匀减少计算，忽略了层间异质性。我们的频谱、探测和因果分析揭示了三个区域：浅层缺乏跨视图结构，中层驱动跨视图对齐，深层对密集几何是冗余的，但其跨帧注意力对姿态仍然至关重要。RegimeVGGT沿两个轴应用逐层U形压缩：显著性引导带状合并保护几何和边缘显著性令牌，而选择性保护K/V下采样通过相移空间网格、参考帧锚点以及未压缩的相机/注册令牌来保持跨帧空间覆盖和姿态关键路径。免训练，RegimeVGGT在匹配重建质量下相比VGGT*实现了6.7倍加速。

英文摘要

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

URL PDF HTML ☆

赞 0 踩 0

2606.18582 2026-06-18 cs.CV cs.RO eess.IV 交叉投稿

Technical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leveraging DINOv3 for Robust Outdoor Scene Understanding in Field Robotics

ICRA 2026 GOOSE 2D细粒度语义分割挑战赛技术报告：利用DINOv3实现野外机器人中的鲁棒户外场景理解

Jaeil Park, Hyobin Choi, Sangjin Lee, Hyungtae Lim, Sung-Hoon Yoon

发表机构 * Daegu Gyeongbuk Institute of Science and Technology (DGIST)（大邱庆北科学技术院）； Massachusetts Institute of Technology (MIT)（麻省理工学院）

AI总结提出一种结合DINOv3自监督骨干、ViT-Adapter和Mask2Former解码器的网络设计，以及多尺度测试增强和模型集成的推理策略，在64类细粒度越野语义分割挑战中取得第一名，复合得分76.57%。

Comments 5 pages, 4 figures

详情

AI中文摘要

ICRA 2026野外机器人研讨会举办的GOOSE 2D细粒度语义分割挑战赛评估了越野图像在64个细粒度类别和11个评估的非空洞粗类别上的密集语义分割。我们提出了该挑战的第一名解决方案。我们的解决方案包含两个互补的改进：(a) 网络级设计，结合了自监督DINOv3 ViT-L/16骨干、ViT-Adapter和Mask2Former掩码分类解码器，以及基于全局[CLS]令牌的粗类别辅助损失；(b) 推理时聚合策略，基于多尺度和水平翻转测试时增强，以及使用Codabench分数选择的前三个检查点的集成。我们的方法达到了官方复合得分76.57%，包括69.32%的细类mIoU和83.81%的类别级mIoU，并在最终阶段排行榜上排名第一：http://this url。

英文摘要

The GOOSE 2D Fine-Grained Semantic Segmentation Challenge at the ICRA 2026 Workshop on Field Robotics evaluates dense semantic segmentation of off-road imagery over a fine-grained taxonomy of 64 classes and 11 evaluated non-void coarse categories. We present the first-place solution to this challenge. Our solution comprises two complementary improvements: (a) a network-level design that combines a self-supervised DINOv3 ViT-L/16 backbone, a ViT-Adapter, and a Mask2Former mask-classification decoder, together with a coarse-category auxiliary loss on the global [CLS] token; and (b) an inference-time aggregation strategy based on multi-scale and horizontal-flip test-time augmentation and an ensemble of the top three checkpoints selected using Codabench scores. Our method achieves an official composite score of 76.57%, consisting of 69.32% fine-class mIoU and 83.81% category-level mIoU, and ranks first on the final phase leaderboard: www.codabench.org/competitions/14257/#/results-tab.

URL PDF HTML ☆

赞 0 踩 0

2606.19253 2026-06-18 cs.CV cs.AI cs.LG cs.RO 交叉投稿

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas: 通过全景重投影实现3D场景理解

Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner

发表机构 * Technical University of Munich（慕尼黑工业大学）； Huawei（华为）

AI总结提出OneCanvas方法，将多视图补丁特征聚合到全景画布上，利用深度和相机位姿进行重投影，无需复杂几何编码器或大量训练，在SQA3D等基准上达到最先进精度。

Comments Project page: https://baranowskibrt.github.io/onecanvas/

详情

AI中文摘要

现有的视觉语言模型（VLM）中的3D场景理解方法要么依赖复杂的、模型特定的几何编码器，要么为了追求空间推理而需要大量的训练预算。相反，OneCanvas将所有视图的补丁特征聚合到一个单一的等距柱状全景画布上。具体来说，每个补丁利用其深度和相机位姿被反投影到3D世界坐标，然后根据从画布原点看到的该点的连续经度和纬度放置在画布上，无需对重叠视图进行光栅化或聚合。补丁的度量坐标的3D位置嵌入被添加到其特征中，从而恢复了将世界位置压缩到角度画布坐标时丢失的深度。因此，来自所有帧的补丁共享一个空间坐标系，无需融合或对主干网络进行重大架构修改。预训练的VLM将此表示视为普通图像。由于画布可以以任何感兴趣的姿态为中心，相同的表示直接支持从特定视角进行情境推理，这是机器人和具身AI中的常见需求。得益于这种表示，我们还可以引入空间预训练课程：通过程序化地将从真实图像中提取的对象的补丁特征放置在原本空白的画布上的选定3D世界位置，我们生成了涵盖广泛空间推理任务的即时监督，并控制答案分布以减少空间推理捷径。OneCanvas在SQA3D和VSI-Bench上达到了最先进的准确率，并在SPBench上泛化到分布外数据，其训练计算量比最强竞争方法少一个数量级。

英文摘要

Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

URL PDF HTML ☆

赞 0 踩 0

2512.11736 2026-06-18 cs.RO 版本更新

Bench-Push: Benchmarking Pushing-based Navigation and Manipulation Tasks for Mobile Robots

Bench-Push：基于推动的移动机器人导航与操作任务基准测试

Ninghan Zhong, Steven Caro, Megnath Ramesh, Rishi Bhatnagar, Avraiem Iskandar, Stephen L. Smith

发表机构 * Institute for Robotics and Intelligent Machines, Georgia Institute of Technology（机器人与智能机器研究所，佐治亚理工学院）； Department of Electrical and Computer Engineering, University of Waterloo（电气与计算机工程系，滑铁卢大学）； Department of Mechanical Engineering, University of Alberta（机械工程系，阿尔伯塔大学）

AI总结提出首个统一的推动式移动机器人导航与操作基准Bench-Push，包含多种模拟环境、新评估指标和基线实现，用于解决可移动障碍物环境中的机器人推动任务评估问题。

Comments Published in CRV 2026

详情

AI中文摘要

移动机器人越来越多地部署在具有可移动物体的杂乱环境中，这对禁止交互的传统方法提出了挑战。在这种环境中，移动机器人必须超越传统的避障策略，利用推动或轻推策略来实现其目标。尽管基于推动的机器人研究正在增长，但评估依赖于临时设置，限制了可重复性和交叉比较。为了解决这个问题，我们提出了Bench-Push，这是首个用于基于推动的移动机器人导航和操作任务的统一基准。Bench-Push包括多个组件：1）一系列全面的模拟环境，捕捉推动任务中的基本挑战，包括在具有可移动障碍物的迷宫中导航、自主船舶在冰覆盖水域中导航、箱子递送和区域清理，每个任务都有不同复杂程度；2）新的评估指标，用于捕捉效率、交互努力和部分任务完成；3）使用Bench-Push评估跨环境的已建立基线的示例实现。Bench-Push作为Python库开源，采用模块化设计。代码、文档和训练模型可在https://this URL找到。

英文摘要

Mobile robots are increasingly deployed in cluttered environments with movable objects, posing challenges for traditional methods that prohibit interaction. In such settings, the mobile robot must go beyond traditional obstacle avoidance, leveraging pushing or nudging strategies to accomplish its goals. While research in pushing-based robotics is growing, evaluations rely on ad hoc setups, limiting reproducibility and cross-comparison. To address this, we present Bench-Push, the first unified benchmark for pushing-based mobile robot navigation and manipulation tasks. Bench-Push includes multiple components: 1) a comprehensive range of simulated environments that capture the fundamental challenges in pushing-based tasks, including navigating a maze with movable obstacles, autonomous ship navigation in ice-covered waters, box delivery, and area clearing, each with varying levels of complexity; 2) novel evaluation metrics to capture efficiency, interaction effort, and partial task completion; and 3) demonstrations using Bench-Push to evaluate example implementations of established baselines across environments. Bench-Push is open-sourced as a Python library with a modular design. The code, documentation, and trained models can be found at https://github.com/IvanIZ/BenchNPIN.

URL PDF HTML ☆

赞 0 踩 0

2512.14428 2026-06-18 cs.RO 版本更新

Odyssey: An Automotive Lidar-Inertial Odometry Dataset with GNSS-denied situations

Odyssey：一种面向GNSS拒止场景的汽车激光雷达-惯性里程计数据集

Aaron Kurda, Simon Steuernagel, Lukas Jung, Marcus Baum

发表机构 * University of Göttingen（哥廷根大学）； iMAR Navigation（iMAR导航）

AI总结提出Odyssey数据集，采用导航级环形激光陀螺仪RTK/INS提供高精度真值，包含36个序列和长时间GNSS拒止环境（隧道、室内停车场），用于评估LIO/SLAM系统。

Comments 10 pages, 4 figures, 3 tables, submitted to International Journal of Robotics Research (IJRR)

详情

AI中文摘要

激光雷达-惯性里程计（LIO）及同时定位与建图（SLAM）系统的开发与评估需要精确的真值。全球导航卫星系统（GNSS）常作为其基础，但在遮挡环境中，由于多径效应或信号丢失，其信号可能不可靠。现有数据集通过引入惯性测量单元（IMU）测量来补偿偶发的GNSS丢失，但由于累积漂移，常用系统不允许对GNSS拒止环境进行长时间研究。因此，此类数据集的多样性有限。为弥补这一空白，我们提出了Odyssey，一个汽车LIO数据集，其特点包括：（1）基于导航级环形激光陀螺仪（RLG）的RTK/INS导出的真值，其偏置稳定性比现有汽车数据集好1到4个数量级；（2）跨不同环境的36个序列的全面收集，支持稳健且全面的评估；（3）长时间的GNSS拒止环境，包括隧道以及汽车基准测试中此前未见过的室内停车场。在此，我们的RLG系统能够在常用系统会过度漂移的场景中实现准确评估。除了为LIO提供数据外，Odyssey还通过三次轨迹重复和通过精确大地坐标集成外部地图数据来支持地点识别任务。所有数据、数据加载器和补充材料均可在线获取，网址为：https://this https URL。

英文摘要

The development and evaluation of Lidar-Inertial Odometry (LIO) and Simultaneous Localization and Mapping (SLAM) systems requires a precise ground truth. The Global Navigation Satellite System (GNSS) is often used as a foundation for this, but its signals can be unreliable in obstructed environments due to multi-path effects or loss-of-signal. While existing datasets compensate for sporadic GNSS loss by incorporating Inertial Measurement Unit (IMU) measurements, the commonly used systems do not permit prolonged study of GNSS-denied environments due to accumulated drift. Therefore, the diversity of such datasets is limited. To close this gap, we present Odyssey, an automotive LIO dataset featuring: (1) a ground truth derived from a navigation-grade Ring Laser Gyroscope (RLG)-based RTK/INS, offering bias stability one to four orders of magnitude better than existing automotive datasets; (2) a comprehensive collection of 36 sequences across diverse environments, enabling robust and comprehensive evaluation and (3) prolonged GNSS-denied environments, including tunnels and, previously unseen in the context of automotive benchmarks, indoor parking garages. Here, our RLG-based system enables accurate evaluation in scenarios where commonly employed systems would drift excessively. Besides providing data for LIO, Odyssey also supports place recognition tasks through threefold trajectory repetition and integration of external mapping data via precise geodetic coordinates. All data, dataloader and supplementary material are available online at https://odyssey.uni-goettingen.de/ .

URL PDF HTML ☆

赞 0 踩 0

2601.07052 2026-06-18 cs.RO 版本更新

RSLCPP -- Deterministic Simulations Using ROS 2

RSLCPP——使用ROS 2进行确定性仿真

Simon Sagmeister, Marcel Weinmann, Phillip Pitschi, Markus Lienkamp

发表机构 * Technical University of Munich, Germany（慕尼黑技术大学）； School of Engineering & Design, Department of Mobility Systems Engineering, Institute of Automotive Technology（工程与设计学院，移动系统工程系，汽车技术研究所）； School of Engineering & Design, Department of Engineering Physics and Computation, Institute of Automatic Control（工程与设计学院，工程物理与计算系，自动控制研究所）

AI总结针对ROS异步多进程设计导致仿真结果不可复现的问题，提出RSLCPP库，通过确定性回调执行实现跨平台可复现仿真，无需修改现有节点代码。

Comments Accepted for publication at the 'IEEE Robotics and Automation Practice'

详情

DOI: 10.1109/RAP.2026.3704080

AI中文摘要

仿真在现实机器人技术中至关重要，为开发各种机器人应用提供了安全、可扩展且高效的环境。虽然机器人操作系统（ROS）在学术界和工业界已被广泛采用作为这些机器人应用的基础，但其异步、多进程的设计使得复现变得复杂，尤其是在不同的硬件平台上。当计算时间和通信延迟变化时，无法保证确定性回调执行。这种缺乏复现性的问题给科学基准测试和持续集成带来了困难，因为在这些场景中一致的结果至关重要。为了解决这个问题，我们提出了一种使用ROS 2节点创建确定性仿真的方法。我们的ROS仿真库（RSLCPP）实现了这种方法，使得现有节点可以组合成一个产生可复现结果的仿真例程，通常无需更改任何源代码。我们证明，在测试合成基准测试和真实机器人系统时，我们的方法在各种CPU和架构上产生相同的结果。RSLCPP已开源，网址为：https://this https URL。

英文摘要

Simulation is crucial in real-world robotics, offering safe, scalable, and efficient environments for developing a variety of robotic applications. While the Robot Operating System (ROS) has been widely adopted as the backbone of these robotic applications in both academia and industry, its asynchronous, multi-process design complicates reproducibility, especially across varying hardware platforms. Deterministic callback execution cannot be guaranteed when computation times and communication delays vary. This lack of reproducibility complicates scientific benchmarking and continuous integration, where consistent results are essential. To address this, we present a methodology to create deterministic simulations using ROS 2 nodes. Our ROS Simulation Library for C++ (RSLCPP) implements this approach, enabling existing nodes to be combined into a simulation routine that yields reproducible results, usually without requiring any source code changes. We demonstrate that our approach produces identical results across various CPUs and architectures when testing both a synthetic benchmark and a real-world robotics system. RSLCPP is open-sourced at https://github.com/TUMFTM/rslcpp.

URL PDF HTML ☆

赞 0 踩 0

2606.17639 2026-06-18 cs.RO cs.CV 版本更新

ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

ERQA-Plus：具身AI推理的诊断基准

Hong Yang, Basura Fernando

发表机构 * Centre for Frontier AI Research, Agency for Science, Technology and Research（新加坡科技研究局前沿人工智能研究中心）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）

AI总结提出ERQA-Plus基准，包含1766个基于机器人中心图像的问答实例，覆盖感知、动作、社交、导航和常识推理，用于诊断具身AI的推理能力。

详情

AI中文摘要

通用具身智能体需要的不仅仅是物体识别：它们必须从情境视觉观察中推理空间关系、动作、程序、人类意图、环境约束和常识后果。然而，现有的视觉和具身问答基准通常对测试的推理依赖关系控制有限，使得难以将基于具身的推理与基于捷径的视觉或语言模式匹配区分开来。我们提出了ERQA-Plus，一个用于具身AI推理的诊断基准。ERQA-Plus包含1766个问答实例，这些实例基于711张以机器人为中心的图像，并根据一个结构化的分类法组织，涵盖感知、动作中心、社交交互、导航环境和上下文常识推理。该数据集使用多阶段生成和验证流程构建，结合了分类法引导的问题生成、自动质量判断、迭代修订和人工评估，以改进视觉基础、答案有效性和推理质量。我们对代表性的通用视觉语言模型和具身模型进行了基准测试，包括LLaVA-NeXT-8B、Prismatic-7B、MiniCPM-V-4.5-8B、Qwen3-VL、RoboRefer-8B和RoboBrain2.5-8B。尽管最强的模型Qwen3-VL-32B达到了83.4%的整体准确率和61.4的SBERT分数，但类别级别的结果揭示了空间推理、程序推理、事件预测和意图推理方面的持续弱点。因此，ERQA-Plus提供了一个细粒度的评估框架，不仅衡量具身智能体是否回答正确，还衡量它们能够可靠地执行哪些形式的具身推理。数据集可在https://this https URL获取，项目页面在https://this https URL。

英文摘要

Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.

URL PDF HTML ☆

赞 0 踩 0

2606.18632 2026-06-18 cs.RO 新提交

通过异构多源数据集成与跨域模态插补增强疲劳检测

Luobin Cui, Yanlai Wu, Tang Ying, Weikai Li

AI总结针对实际部署环境中高质量传感器不可用的问题，提出异构多源疲劳检测框架，利用共享模态进行跨域模态插补，融合源域知识提升目标域疲劳检测性能。

Comments 4figures,14pages

详情

AI中文摘要

疲劳检测对于安全相关应用（如航空、采矿和长途运输）中的人类操作员至关重要。可靠的操作员疲劳估计可以支持人机系统中的及时警告、自适应任务调度、接管提醒和其他安全管理决策。然而，这些功能的有效性取决于疲劳相关信号是否能在部署环境中可靠捕获。虽然许多研究已显示高保真传感器在受控实验室环境中的价值，但在实际环境中，由于噪声、光照条件和视野限制，其性能往往会下降，从而限制了实际应用。本文形式化了一种面向实际部署的疲劳检测设置，其中高质量传感器在实际应用中通常不可用。为解决这一问题，我们利用来自异构源域的知识，包括难以在现场部署但常用于受控环境的高保真传感器，来辅助真实目标域中的疲劳检测。基于这一思想，我们设计了一个异构多源疲劳检测框架，该框架利用目标域中的可用模态，同时通过基于共享模态的跨域模态插补来利用源域中的多样化配置。

英文摘要

Fatigue detection for human operators is important in safety-related applications such as aviation, mining, and long-haul transport. Reliable estimation of operator fatigue can support timely warnings, adaptive task scheduling, takeover reminders, and other safety-management decisions in human-machine systems. However, the effectiveness of these functions depends on whether fatigue-related signals can be reliably captured in the deployment environment. While many studies have shown the value of high-fidelity sensors in controlled laboratory environments, their performance often degrades when used in real-world settings because of noise, lighting conditions, and field-of-view constraints, thereby limiting their practical use. This paper formalizes a deployment-oriented setting for real-world fatigue detection, where high-quality sensors are often unavailable in practical applications. To address this issue, we use knowledge from heterogeneous source domains, including high-fidelity sensors that are difficult to deploy in the field but commonly used in controlled environments, to assist fatigue detection in the real-world target domain. Based on this idea, we design a heterogeneous and multi-source fatigue-detection framework that uses the available modalities in the target domain while leveraging diverse configurations in the source domains through cross-domain modality imputation based on shared modalities.

URL PDF HTML ☆

赞 0 踩 0

2602.15513 2026-06-18 cs.RO cs.AI 版本更新

HIMM: Human-Inspired Long-Term Memory Modeling for Embodied Exploration and Question Answering

Ji Li, Bo Wang, Jing Xia, Mingyi Li, Shiyan Hu

发表机构 * The University of Hong Kong（香港大学）； Beijing Institute of Technology（北京理工大学）

1. 机器人学习与模仿强化学习 5 篇

Recover, Discover, Plan: Learning Skills and Concepts from Robot Failures

DREAM-Chunk: Reactive Action Chunking with Latent World Model

HALOMI: Learning Humanoid Loco-Manipulation with Active Perception from Human Demonstrations

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

2. 运动规划、控制与动力学 7 篇

N(CO)$^2$: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering

SRL: Combining SLIP Model and Reinforcement Learning for Agile Robotic Jumping

Two-Phase Bilevel Search for the Moving-Target Traveling Salesman Problem with Moving Obstacles

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

ZiMPedance: Impedance-Aware ZMP Modeling and Control for Payload Carrying with Quadruped Robots

Congestion-Aware Robot Tour Planning in Crowded Environments

Robust and Efficient MuJoCo-based Model Predictive Control via Web of Affine Spaces Derivatives

3. 操作、抓取与灵巧手 12 篇

Self-Supervised Mask-Aware Transformers for Fault-Tolerant FBG Force Sensing in Minimally Invasive Surgical Robotics

ART-VS: Adaptive Resolution Tiling for Vision Transformer Visual Servoing

GCNGrasp-VP: Affordance-Guided View Planning for Efficient Task-Oriented Grasping

Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation

Mobile Pedipulation for Object Sliding via Hierarchical Control on a Wheeled Bipedal Robot

Modeling Branches for Active Manipulation using Iterative Parameter Estimation

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Steering Flexible Linear Objects in Planar Environments by Two Robot Hands Using Euler's Elastica Solutions

STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation

DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

4. 导航、定位与SLAM 11 篇

VEGA: Learning Navigation VLAs from In-the-Wild Egocentric Video with Geometric Trajectory Supervision

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

A High-accuracy Event-based Underwater SLAM System

Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning

FAST-LIVGO: A Degeneracy-Robust LiDAR-Inertial-Visual-GNSS Fusion Odometry

Observability and Consistency Analysis for Visual-Inertial Navigation with Anchored Feature Parameterizations

Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

Spatially Stratified Distillation for Heterogeneous Radar Place Recognition

TurboMap: GPU-Accelerated Local Mapping for Visual SLAM

Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

Embedding Semantic Risk into Distance Fields and CBFs for Online Monocular Safe Control

5. 人机交互与协作机器人 6 篇

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

Admittance-Based Surface Alignment for Human-in-the-Loop Robotic Visual Inspection

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation

Mutual Adaptation in Human-Robot Co-Transportation with Human Preference Uncertainty

Why Automate This? Exploring Correlations Between Desire for Robotic Automation, Invested Time and Well-Being

6. 具身智能与视觉语言动作模型 7 篇

Guava: An Effective and Universal Harness for Embodied Manipulation

ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks

Zero-Shot Long-Horizon Dexterous Manipulation via Multi-View 3D-Grounded VLM Reasoning

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Cosmos 3: Omnimodal World Models for Physical AI

7. 多机器人与群体系统 2 篇

Task Allocation and Motion Planning in Dynamic, Cluttered Environments via CBBA and Graphs of Convex Sets

R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations

8. 无人车、无人机与移动机器人 4 篇

DNN Koopman-Based Deviation Compensation for UGV Path Tracking Control on Coupled Slope and Potholed Road

Constant Time-Delay Leader Following with Neural Networks and Invariant Extended Kalman Filters for Arbitrary Trajectories

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

Tilt-Ropter: A Fully Actuated Hybrid Aerial-Terrestrial Vehicle with Tilt Rotors and Passive Wheels

9. 软体机器人与硬件设计 3 篇

High-Degree-of-Freedom Lightweight Bioinspired Leg for Enhanced Mobility in Small Robots

Selective Unit-Cell Actuation in Lattice Structures for Distributed Morphology in Soft Robots

Shape Sensing of Continuum Robots using Direct Laser Writing

10. 仿真、数据集与评测 20 篇

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

Benchmarking Action Spaces in Reinforcement Learning for Vision-based Robotic Manipulation

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

A Scalable Embodied Intelligence Platform for Seamless Real-to-Sim-to-Real Transfer of Household Mobile Manipulation Tasks

Leveraging Energy Features for Surface Classification with Deep Learning: A Comparative Analysis Across Three Independent Datasets

C-ARC: Continuous-Adaptive Range Clustering for Non-Repetitive LiDAR Sensors

TactSpace: Learning a Physics-enriched Shared Latent Space for Tactile Sim-to-Real Transfer

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

Viking Hill Dataset: A Lidar-Radar-Camera Dataset for Detection and Segmentation in Forest Scenes

HT-Bench: Benchmarking and Learning Dexterous Full-Hand Tactile Representations with Egocentric Vision

Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight

Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise

A Mixed-Reality Testbed for Autonomous Vehicles