arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.31486 2026-06-01 cs.RO 版本更新

Learning Controlled Separation of Small Objects Between Two Fingers with a Tactile Skin

利用触觉皮肤学习两个手指间小物体的受控分离

Ulf Kasolowsky, Berthold Bäuml

发表机构 * Learning AI for Dextrous Robots Lab(灵巧机器人学习人工智能实验室) Technical University of Munich(慕尼黑技术大学) DLR Institute of Robotics and Mechatronics(德意志航天中心机器人与机电研究所)

AI总结 本文提出并解决了多用途机器人手两个手指间小物体的受控分离任务,通过强化学习训练纯触觉策略,并分析了空间分辨触觉反馈的优势。

详情
AI中文摘要

我们提出并解决了多用途机器人手两个手指间小物体的受控分离这一新任务:在抓取一盒小物体后,任务是丢弃尽可能多的物体,直到手指间保留所需数量。这些物体相对于手指宽度很小,而且绝对尺寸也很小。在我们的案例中,处理的是直径仅为6毫米的小颗粒。我们证明,该任务可以纯粹通过触觉(无视觉)完成,使用指尖上的空间分辨触觉皮肤。分离策略通过强化学习在模拟中训练,使用简单的稀疏奖励,基本上检查是否达到所需物体数量。在模拟实验中,我们详尽分析了使用空间分辨触觉反馈的好处:虽然理想(高分辨率)触觉传感器几乎可以完美完成任务,但空间分辨率较低的传感器(此处为4x4触觉单元)与仅使用手指关节传感器相比,仍能带来高达20%的改进。为了进行此分析,我们还在策略旁边训练了一个估计器,用于预测真实接触位置。最后,我们展示了配备触觉皮肤的DLR-Hand II的成功仿真到现实迁移。

英文摘要

We introduce and solve the novel task of controlled separation of small objects with two fingers of a multi-purpose robotic hand: after grasping into a box of small objects, the task is to drop as many of them until a desired number remains between the fingers. The objects are small compared to the width of the fingers but also in absolute terms. In our case little pellets with a diameter of only 6mm are handled. We show that the task can be performed purely tactile (no vision) using a spatially-resolved tactile skin on a fingertip. The separation policy is trained in simulation via reinforcement learning using a straightforward sparse reward, which basically checks if the desired number of objects is reached. In simulation experiments, we provide an exhaustive analysis of the benefits of using spatially-resolved tactile feedback: while an ideal (high-resolution) tactile sensor allows solving the task almost perfectly, a sensor with lower spatial resolution (here 4x4 taxels) still leads to an improvement of up to 20% compared to using only the fingers' joint sensors. For this analysis, we further train an estimator alongside the policy that predicts the ground truth contact positions. Finally, we demonstrate the successful sim-to-real transfer for the DLR-Hand II equipped with a tactile skin.

2605.31481 2026-06-01 cs.RO 版本更新

Batched Differentiable Rigid Body Dynamics in PyTorch for GPU-Accelerated Robot Learning

Batched Differentiable Rigid Body Dynamics in PyTorch for GPU-Accelerated Robot Learning

Yue Wang, Yanran Xu, Wenbo Wu, Chuanhang Qiu, Zhaoxing Li

发表机构 * University of Southampton(南安普顿大学)

AI总结 提出BARD,一种基于PyTorch的批处理可微刚体动力学库,通过三级缓存、无矩阵乘法的关节变换和层级并行传播,在GPU上实现高达64倍的前向运动学加速,并支持梯度计算。

详情
AI中文摘要

随着机器人控制转向大规模强化学习与循环动力学计算,社区对Pinocchio等CPU绑定库的依赖在基于GPU的训练流程中造成了吞吐瓶颈。我们提出了BARD(批处理铰接刚体动力学),这是一个自包含的PyTorch实现,基于Featherstone的刚体动力学算法,针对批处理GPU评估和自动微分进行了优化。三个设计选择使其高效:分层惰性求值缓存避免冗余树遍历,通过预计算的Rodrigues常数实现无矩阵乘法的关节变换,以及将顺序操作减少为树深度批处理步骤的层级并行传播。在五个机器人模型(7-23自由度)上,BARD在数值上匹配Pinocchio,同时在NVIDIA H200上以批大小4096实现前向运动学高达64倍、雅可比矩阵高达63倍的吞吐量提升。我们通过基于梯度的系统辨识验证了可微性,在7自由度机械臂上,在5%扭矩噪声下将连杆质量恢复至1.24%的平均误差,并将BARD集成到Isaac Lab AMP训练流程中,用于具有4096个并行环境的11自由度脊柱四足机器人,其在循环动力学中比Pinocchio快8.5倍,比ADAM快2.0倍。BARD已开源:https://github.com/YueWang996/bard-pytorch-dynamics。

英文摘要

As robot control shifts toward large-scale reinforcement learning with in-loop dynamics computation, the community's reliance on CPU-bound libraries such as Pinocchio creates a throughput bottleneck in GPU-based training pipelines. We present BARD (Batched Articulated Rigid-body Dynamics), a self-contained PyTorch implementation of Featherstone's rigid-body dynamics algorithms, optimized for batched GPU evaluation and automatic differentiation. Three design choices make this efficient: a tiered lazy-evaluation cache that avoids redundant tree traversals, matmul-free joint transforms via pre-computed Rodrigues constants, and level-parallel propagation that reduces sequential operations to tree-depth batched steps. On five robot models (7-23 DOFs), BARD matches Pinocchio numerically while reaching up to 64x higher throughput for Forward Kinematics and 63x for Jacobians at batch size 4096 on an NVIDIA H200. We validate differentiability through gradient-based system identification on a 7-DOF manipulator, recovering link masses to 1.24% mean error under 5% torque noise, and integrate BARD into an Isaac Lab AMP training pipeline for an 11-DOF spined quadruped with 4096 parallel environments, where it is 8.5x faster than Pinocchio and 2.0x faster than ADAM for in-loop dynamics. BARD is open-sourced at: https://github.com/YueWang996/bard-pytorch-dynamics.

2605.31476 2026-06-01 cs.RO 版本更新

IDOL: Inverse-Dynamics-Guided Future Prediction for End-to-End Autonomous Driving

IDOL: 逆动力学引导的未来预测用于端到端自动驾驶

Chenghao Zhang, Timin Li, Dongmei Li

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系)

AI总结 提出IDOL框架,通过逆动力学模型将BEV世界模型预测的未来潜在场景状态转化为规划相关的轨迹增量,实现未来预测与轨迹优化的紧密耦合,在NAVSIM基准上达到最优性能。

Comments 20 pages, 5 figures

详情
AI中文摘要

端到端自动驾驶已成为直接从传感器观测学习规划的有力范式,而近期基于世界模型的方法通过显式推理场景未来演化进一步丰富了这一范式。然而,仅靠未来预测并不能保证更好的规划,除非预测的演化能够转化为规划相关的轨迹更新。当前许多方法仍预测未来场景状态,而未明确解码状态转换中隐藏的运动含义。因此,未来推理通常仅具有描述性价值,而与可执行运动生成的耦合较弱。为解决此限制,我们提出IDOL,一种基于逆动力学的未来预测框架,用于潜在BEV空间中基于世界模型的端到端规划,其中逆动力学作为未来预测与轨迹优化之间的关键桥梁。IDOL首先使用BEV世界模型预测多个未来潜在场景状态,然后对相邻潜在未来应用逆动力学模型,以解码过渡感知的轨迹特征并恢复规划相关的运动增量,解释潜在世界随时间如何演化。这些逆动力学导出的信号用于优化规划轨迹,将未来预测从被动场景预测转变为可操作的规划指导。轻量级闭环细化模块通过重用优化轨迹进行另一轮未来感知推理,进一步改善长时一致性。通过将逆动力学引入潜在未来推理,IDOL加强了世界建模与规划之间的耦合。在NAVSIM v1和NAVSIM v2基准上的大量实验表明,IDOL在可比方法中达到了最先进的性能。

英文摘要

End-to-end autonomous driving has emerged as a compelling paradigm for learning planning directly from sensor observations, while recent world-model-based approaches further enrich this paradigm by enabling explicit reasoning about how the scene may evolve in the future. Yet future prediction alone does not guarantee better planning unless the predicted evolution can be converted into planning-relevant trajectory updates. Many current methods still forecast future scene states without explicitly decoding the motion implications hidden in state transitions. As a result, future reasoning often remains descriptively useful but only weakly coupled to executable motion generation. To address this limitation, we propose \mathbf{IDOL}, an inverse-dynamics-guided future prediction framework for world-model-based end-to-end planning in latent BEV space, where inverse dynamics serves as the key bridge between future prediction and trajectory optimization. IDOL first predicts multiple future latent scene states with a BEV world model, then applies an inverse dynamics model to adjacent latent futures to decode transition-aware trajectory features and recover planning-relevant motion deltas that explain how the latent world evolves over time. These inverse-dynamics-derived signals are used to optimize the planned trajectory, turning future forecasting from passive scene anticipation into actionable planning guidance. A lightweight closed-loop refinement module further improves long-horizon consistency by reusing the optimized trajectory for another round of future-aware reasoning. By introducing inverse dynamics into latent future reasoning, IDOL tightens the coupling between world modeling and planning. Extensive experiments on the NAVSIM v1 and NAVSIM v2 benchmarks show that IDOL achieves state-of-the-art performance among comparable methods.

2605.31460 2026-06-01 cs.RO cs.SY eess.SY 版本更新

On-Device Robotic Planning: Eliminating Inference Redundancy for Efficient Decision-Making

设备端机器人规划:消除推理冗余以实现高效决策

Joonhee Lee, Hyunseung Shin, Hyunmi Kim, Pei Zhang, Jeonggil Ko

发表机构 * School of Integrated Technology, College of Computing, Yonsei University(延世大学整合技术学院,计算学院) Department of Hyperscale AI SoC Research Section, ETRI(ETRI超大规模AI SoC研究部) EECS - Electrical and Computer Engineering, University of Michigan(密歇根大学电气与计算机工程学院)

AI总结 提出REIS框架,通过场景门控、KV引导的affordance路由和审慎推理减少推理冗余,在保持语义适应性的同时加速机器人控制。

Comments 19 pages

详情
AI中文摘要

基于推理的机器人策略使用大型语言和视觉语言模型实现了强大的语义规划能力,但大多受限于高推理延迟,限制了实际实时部署。在这项工作中,我们观察到机器人推理工作负载包含大量的时间冗余,连续观察经常产生相同的动作和子目标。基于这一洞察,我们提出了REIS,一种受人类认知启发的机器人决策框架,在保持语义适应性的同时最小化不必要的推理。REIS结合了轻量级场景门控、KV引导的affordance路由和审慎推理,以在具身约束下加速机器人控制。在ALFRED和真实世界机器人任务上的实验表明,REIS显著抑制了推理开销,同时保持了有竞争力的任务性能。

英文摘要

Reasoning-based robotic policies using large language and vision-language models achieve strong semantic planning capabilities but mostly suffer from a high inference latency that limits practical real-time deployment. In this work, we observe that robotic reasoning workloads contain substantial temporal redundancy, where consecutive observations frequently produce identical actions and subgoals. Based on this insight, we present REIS, a human cognition inspired robotic decision-making framework that minimizes unnecessary reasoning while preserving semantic adaptability. REIS combines lightweight scene gating, KV-steered affordance routing, and deliberative reasoning to accelerate robotic control under embodied constraints. Experiments on ALFRED, and real-world robotic tasks demonstrate that REIS significantly suppresses reasoning overhead while maintaining competitive task performance.

2605.31436 2026-06-01 cs.RO 版本更新

Actuator-Aware Inverse Kinematics with Joint-Limit Admissibility for Torque-Controlled Redundant Robots

面向力矩控制冗余机器人的关节极限可容许性感知逆运动学

Mohammad Dastranj, Mahdi Hejrati, Jouni Mattila

发表机构 * Unit of Automation Technology and Mechanical Engineering, Faculty of Engineering and Natural Sciences(自动化技术与机械工程单位,工程与自然科学学院)

AI总结 提出一种基于凸二次规划的逆运动学方法,通过控制障碍函数约束关节极限,并利用控制器兼容性目标解决冗余,实现无需修改下层控制器的任务行为改善。

详情
AI中文摘要

本文针对关节极限约束下的力矩控制冗余机器人,提出了执行器感知的逆运动学。在所考虑的架构中,逆运动学输出不仅仅是纯运动学的关节速度指令;它是提供给下游力矩级控制器的所需关节速度。因此,小的命令任务残差不一定能改善实际运动。所提出的方法构建了一个凸二次规划问题,其决策变量是关节级所需速度。控制障碍函数风格的边界施加了参考级关节极限可容许性,而任务方程通过惩罚松弛变量处理。冗余通过考虑先前命令一致性和执行器扭矩容量加权的控制器兼容性目标来解决。该方法独立于特定的力矩级控制器,可作为末端轨迹与冗余机器人控制器之间的中间逆运动学层。在虚拟分解控制的七自由度上肢外骨骼上的实验将所提方法与标准逆运动学基线以及约束任务保持二次规划基线进行了比较。结果表明,在不修改下游控制器的情况下,在测试轨迹中实现了更低的极限推动指令、有界可容许所需速度以及改善的实际任务行为。

英文摘要

This paper proposes actuator-aware inverse kinematics for torque-controlled redundant robots under joint-limit constraints. In the considered architecture, the inverse-kinematic output is not merely a purely kinematic joint-velocity command; it is the required joint velocity supplied to a downstream torque-level controller. Therefore, a small commanded task residual may not necessarily improve realized motion. The proposed method formulates a convex quadratic programming problem whose decision variable is the joint-level required velocity. Control barrier function style bounds impose reference-level joint-limit admissibility, while the task equation is handled through a penalized slack variable. Redundancy is resolved using a controller-compatibility objective that accounts for previous-command consistency and actuator torque-capacity weighting. The method is independent of the particular torque-level controller and can serve as an intermediate IK layer between an endpoint trajectory and a redundant robot controller. Experiments on a virtual-decomposition-controlled seven-degree-of-freedom upper-limb exoskeleton compare the method with standard inverse-kinematic baselines and a constrained task-preserving quadratic programming baseline. The results indicate lower limit-pushing commands, bounded admissible required velocities, and improved realized task behavior in the tested trajectory, without modifying the downstream controller.

2605.31387 2026-06-01 cs.CL cs.RO 版本更新

Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely

多轮多智能体对话用于协作重建仅略微提升VLM在空间推理上的性能

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

发表机构 * Computational Linguistics, Department of Linguistics University of Potsdam(语言学计算系,语言学系 柏林洪堡大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 研究通过多轮多智能体对话框架评估视觉语言模型在协作空间推理任务中的表现,发现视觉空间理解仍是主要瓶颈,文本表示和分解图像表示可部分提升性能。

Comments Preprint

详情
AI中文摘要

在多样化环境中运行的机器人依赖视觉输入来解释物体和空间布局。在人类协作任务中,它们被期望通过语言传达这种理解。视觉语言模型(VLM)支持涉及视觉解释、问答和指令跟随的机器人任务,但它们在需要空间推理的协作对话任务中的能力仍未充分探索。我们通过一个结合视觉解释、基础、语言引导交互和动作生成的协作结构构建任务来研究这一差距。我们开发了一个框架,其中VLM通过对话从视觉和文本输入重建目标结构。我们在交互设置、输入模态和图像表示上评估了开放权重和封闭VLM。结果表明,对于评估的VLM,视觉表示的空间推理仍然困难。目标的详细文本表示在模态条件下产生更高的重建成功率,而分解的图像表示提高了性能。这些发现揭示了协作VLM智能体在视觉空间基础和基础指令生成方面的局限性。

英文摘要

Robots operating in diverse environments rely on visual input to interpret objects and spatial layouts. In human-collaborative tasks, they are expected to communicate this understanding through language. Vision-language models (VLMs) support robotic tasks involving visual interpretation, question answering, and instruction following, but their capabilities in collaborative dialogue tasks requiring spatial reasoning remain underexplored. We study this gap through a collaborative structure-building task that combines visual interpretation, grounding, language-guided interaction, and action generation. We develop a framework in which VLMs use dialogue to reconstruct a target structure from visual and textual inputs. We evaluate open-weight and closed VLMs across interaction settings, input modalities, and image representations. Results show that spatial reasoning over visual representations remains difficult for the evaluated VLMs. Detailed text representations of the target yield higher reconstruction success across modality conditions, while decomposed image representations improve performance. These findings reveal limits in visual spatial grounding and grounded instruction generation for collaborative VLM agents.

2605.31376 2026-06-01 cs.RO cs.CV cs.GR 版本更新

LiftNav: Path Planning via Semantic Lifting in TSDF-Guided Gaussian Splatting

LiftNav: TSDF引导的高斯泼溅中的语义提升路径规划

Hannah Schieber, Dominik Frischmann, Victor Schaack, Angela P. Schoellig, Daniel Roth

发表机构 * Technical University of Munich(慕尼黑技术大学) Human-Centered Computing and Extended Reality Lab(以人为本计算与扩展现实实验室) TUM University Hospital(慕尼黑大学医院) Clinic for Orthopedics and Sports Orthopedics(骨科与运动医学诊所) Munich Institute of Robotics and Machine Intelligence (MIRMI)(慕尼黑机器人与机器智能研究所) Learning Systems and Robotics Lab(学习系统与机器人实验室)

AI总结 提出LiftNav混合导航框架,结合TSDF+GS双地图、YOLO检测、TSDF三维提升和B样条轨迹优化,实现无需密集三维嵌入的灵活语义导航,并通过铰链损失碰撞惩罚提升轨迹平滑性和安全性,在Replica数据集仿真中实现100%可行性和更短轨迹。

详情
AI中文摘要

未知室内环境中的自主机器人需要可靠的碰撞避免和对象级理解。经典表示如TSDF支持安全规划但缺乏语义,而像高斯泼溅(GS)这样的逼真方法提供丰富外观但存在软几何问题,限制了精确的障碍物避免。我们提出LiftNav,一个基于GSFusion的TSDF+GS双地图构建的混合导航框架,并增强了基于YOLO的检测、基于TSDF的三维提升和B样条轨迹优化的实时流水线。该设计实现了无需密集三维嵌入的灵活语义导航。我们进一步引入了一种基于铰链损失的碰撞惩罚,提高了轨迹平滑性和安全性。我们在使用Replica数据集的仿真中评估了我们的方法。与最先进的辐射场基线相比,我们展示了100%的可行性和更短的轨迹。

英文摘要

Autonomous robots in unknown indoor environments require both reliable collision avoidance and object-level understanding. Classical representations such as TSDF support safe planning but lack semantics, while photorealistic methods like Gaussian Splatting (GS) provide rich appearance yet suffer from soft geometry, limiting precise obstacle avoidance. We present LiftNav, a hybrid navigation framework built on GSFusion's TSDF+GS dual map, augmented with a real-time pipeline of YOLO-based detection, TSDF-based 3D lifting, and B-spline trajectory optimization. This design enables flexible semantic navigation without dense 3D embeddings. We further introduce a hinge-loss-based collision penalty that improves trajectory smoothness and safety. We evaluate our approach in a simulation using the Replica dataset. Compared against a state-of-the-art radiance field baseline we show a 100% feasibility rate and shorter trajectories.

2605.31352 2026-06-01 cs.RO 版本更新

Haptic Sorter: A Unified Planning Framework for Online Shape Estimation and Real-Time Pose Inference

Haptic Sorter: 一种用于在线形状估计和实时位姿推断的统一规划框架

Zhuoyi Lu, Lin Yang, Sri Harsha Turlapati, Domenico Campolo

发表机构 * School of Mechanical and Aerospace Engineering, Nanyang Technological University (NTU), Singapore(南洋理工大学机械与航空航天工程学院)

AI总结 提出一种基于模型的统一几何框架,结合贝叶斯优化引导的触觉探索、超椭圆形状近似、操作势能自适应公式以及在线常微分方程实时位姿推断,用于机器人操作中的形状估计和位姿跟踪。

详情
AI中文摘要

机器人操作通常假设在运动规划之前,物体的形状和位姿是已知的。然而,在实践中精确的几何信息并不总是可用的,并且位姿推断受到传感器不确定性和视角遮挡的影响。在这项工作中,我们提出了一个统一的基于模型的几何框架,集成了机器人触觉感知、建模和操作规划。我们的创新点包括: i) 引入贝叶斯优化(BO)来指导触觉探索以推断物体形状,其中使用超椭圆来近似几何边界; ii) 自适应地制定操作势能,编码物体几何以用于准静态机器人-物体交互; iii) 提出一个在线常微分方程(ODE),基于模型预测和触觉反馈进行实时位姿推断。 我们在一个二维机器人分拣任务上部署了我们的系统,并改变物体几何形状,以在仿真和真实世界的多臂设置中验证我们框架的鲁棒性和泛化能力。

英文摘要

Robotics manipulation usually assumes that the shape and pose of the object are known to the robot prior to motion planning. However, precise geometric information is not always available in practice, and pose inference suffers from sensor uncertainties and view occlusion. In this work, we propose a unified model-based geometric framework integrating robotic haptic perception, modeling, and manipulation planning. Our novelties involve: \textit{i)} Introducing Bayesian Optimization (BO) to guide the haptic exploration for object shape inference, where superellipses are used to approximate geometric boundary; \textit{ii)} Adaptive formulation of manipulation potential encoding object geometry for quasi-static robot-object interaction; \textit{iii)} Proposing an online Ordinary Differential Equation (ODE) for real-time pose inference based on model prediction and tactile feedback. We deploy our system on a 2D robotic sorting task, and vary object geometries to validate the robustness and generalizability of our framework in both simulation and a real-world multi-arm setup.

2605.31343 2026-06-01 cs.RO 版本更新

Learning Terrain-Aware Whole-Body Control for Perceptive Legged Loco-Manipulation

学习面向感知的腿式移动操作的地形感知全身控制

Sikai Guo, Yudong Zhong, Guoyang Zhao, Botao Dang, Zhihai Bi, Jun Ma

发表机构 * Robotics and Autonomous Systems Thrust, The Hong Kong University of Science and Technology (Guangzhou)(机器人与自主系统方向,香港科技大学(广州))

AI总结 提出TA-WBC框架,通过混合外感受编码器提取地形特征、基于脚接触平面的末端执行器采样方法以及双策略蒸馏模块,实现腿式机械臂在复杂地形上的全身移动操作控制。

详情
AI中文摘要

腿式机械臂结合了卓越的地形适应性和移动操作能力,使其在人类中心环境中极具应用前景。通过协调腿和臂的控制,全身控制器可以显著扩展腿式机械臂的操作工作空间。然而,许多现有的全身控制器主要依赖于本体感觉,并未整合有效地形拓扑感知所需的关键外部感受。这一限制可能阻碍它们适应不同环境条件并有效导航复杂地形。在本文中,我们介绍了TA-WBC,一种用于腿式机械臂的地形感知全身控制框架,其特点是一种新颖的基于强化学习的统一策略,专门针对各种地形中的全身移动操作任务。具体来说,我们采用混合外感受编码器提取地形特征,为机器人主动调整姿态和立足点提供必要基础。此外,为了促进稳定的跨地形移动操作,我们提出了一种基于脚接触平面的新颖末端执行器采样方法,将操作目标与基座波动解耦。此外,引入了双策略蒸馏模块,以在不发生灾难性遗忘的情况下整合广泛的全身运动与地形适应性。仿真和真实世界实验验证了我们提出的控制器的鲁棒性,该控制器实现了更大的可达空间、更小的跟踪误差和减少的意外绊倒。这一统一策略突显了腿式机械臂在复杂地形上执行移动操作任务的有前景的能力。

英文摘要

Legged manipulators integrate exceptional terrain adaptability along with mobile manipulation capabilities, which make them highly promising for deployment in human-centric environments. By coordinating the control of both legs and arms, a whole-body controller can significantly expand the operational workspace of legged manipulators. However, many existing whole-body controllers primarily depend on proprioception and do not incorporate the critical exteroception required for effective terrain topology perception. This limitation can hinder their ability to adapt to varying environmental conditions and navigate complex terrains effectively. In this paper, we introduce TA-WBC, a terrain-aware whole-body control framework for legged manipulators, which features a novel RL-based unified policy tailored to whole-body loco-manipulation tasks in various terrains. Specifically, we employ a hybrid exteroception encoder to extract terrain features, providing an essential basis for the robot to proactively adapt posture and footholds. Furthermore, to facilitate stable cross-terrain loco-manipulation, we propose a novel end-effector sampling method based on the foot contact plane, decoupling manipulation target from base fluctuations. Moreover, a dual-policy distillation module is introduced to integrate expansive whole-body motion with terrain adaptability without catastrophic forgetting. The simulation and real-world experiments validate the robustness of our proposed controller, which leads to a larger reachable space, less tracking error, and reduced unexpected stumbles. This unified policy highlights the promising capabilities of legged manipulators in performing loco-manipulation tasks across complex terrains.

2605.31321 2026-06-01 cs.RO 版本更新

Surface Constraint Policy for Learning Surface-Constrained and Dynamically Feasible Robot Skills

表面约束策略:学习受表面约束且动态可行的机器人技能

Shuai Ke, Jiexin Zhang, Huan Zhao, Zhiao Wei, Yikun Guo, Jie Pan, Han Ding

发表机构 * State Key Laboratory of Intelligent Manufacturing Equipment and Technology, Huazhong University of Science and Technology(智能制造装备与技术国家重点实验室,华中科技大学)

AI总结 提出表面约束策略(SCP),通过二维加权高斯核编码表面几何约束,结合扩散策略和基于相似性的动作映射生成动态可行的表面约束运动,解决了自由曲面约束下动作随机性和接触不稳定的问题。

详情
AI中文摘要

基于扩散的模仿学习方法在机器人灵巧操作任务中取得了快速进展。然而,当应用于涉及复杂自由曲面约束的任务时,由于缺乏显式的表面几何约束建模和动态可行性问题,它们存在局限性,导致随机动作生成无法实现可靠的表面对齐和维持稳定接触。为了解决这些局限性,我们提出了一种新颖的表面约束策略(SCP),用于基于人类演示和实时视觉观察生成满足自由曲面约束的机器人动作。首先,使用从演示中推导出的二维加权高斯核函数对表面几何约束进行编码。基于编码的表面几何约束,使用基于扩散的策略从多模态感知输入(包括视觉观察和机器人状态反馈)中推断任务级动作意图。这些意图通过基于相似性的动作映射方法进一步转化为表面约束的动态运动基元(DMP),从而实现平滑且柔顺的运动执行。SCP实现了结构化表面几何意图和动态可接受动作的生成。所提出的方法在多个表面操作任务上进行了验证,并与现有技术进行了比较。实验结果表明,在表面约束下,该方法具有优越的任务成功率和接触稳定性。

英文摘要

Diffusion-based imitation learning methods have driven rapid progress in robot dexterous manipulation tasks. However, they have limitations when applied to tasks that involve complex free-form surface constraints because of their lack of explicit surface geometry constraint modeling and the dynamic feasibility issue, resulting in stochastic action generation that fails to achieve reliable surface alignment and maintain stable contact. To address these limitations, we propose a novel surface constraint policy (SCP) for generating robot actions that satisfy free-form surface constraints on the basis of human demonstrations and real-time visual observations. First, the surface geometry constraint is encoded using a two-dimensional weighted Gaussian kernel function that is derived from demonstrations. Building on the encoded surface geometry constraints, the diffusion-based policy is used to infer task-level action intentions from multimodal sensory inputs, including visual observations and robot state feedback. These intentions are further transformed into surface-constrained dynamic movement primitives (DMPs) through a similarity-based action mapping method, thereby enabling smooth and compliant motion execution. The SCP achieves generation of structured surface geometric intent and dynamically admissible actions. The proposed method is validated on multiple surface manipulation tasks and compared with existing techniques. The experimental results demonstrate superior task success rates and contact stability under surface constraints.

2605.31314 2026-06-01 cs.RO 版本更新

AR Forcing: Towards Long-Horizon Robot Navigation World Model

AR Forcing: 迈向长时域机器人导航世界模型

Yifei Yang, Zehua Fan, Huan Li, Aoqi Wang, Lida Huang, Haibao Yu, Haiyan Liu, Xuanyao Mao, Jason Bao, Liang Xu, Bingchuan Sun, Yan Wang

发表机构 * Institute for AI Industry Research, Tsinghua University(清华大学人工智能产业研究院) Shanghai Jiao Tong University(上海交通大学) School of Safety Science, Tsinghua University(清华大学安全科学学院) The University of Hong Kong(香港大学) Lenovo, Beijing, China(北京联想公司)

AI总结 提出AR Forcing自回归训练策略,通过将扩散损失集成到自回归训练循环中,解决训练与推理分布偏移问题,提升长时域导航中图像一致性和轨迹预测精度。

详情
AI中文摘要

基于扩散的机器人导航世界模型通常使用并行监督进行训练,而在路径规划时采用自回归推理。这导致训练和推理之间的分布偏移,从而在长时域预测中降低性能。我们提出AR Forcing,一种自回归训练策略,将标准扩散损失集成到自回归训练循环中。在每个步骤中,模型使用其自身的预测来更新上下文并优化单步噪声预测目标,从而在训练期间显式地将模型暴露于推理状态分布。我们的方法不需要额外的判别器或分布匹配损失,保留了原始扩散框架和采样器,并且易于集成。在多领域导航数据集(RECON、SCAND、HuRoN、TartanDrive)上的实验表明,与强基线相比,AR Forcing在长时域导航期间提高了生成图像的一致性以及预测轨迹的准确性,增强了模型在复杂已知和未知环境中的鲁棒性。我们将很快发布代码。

英文摘要

The diffusion based robot navigation world models are typically trained using parallel supervision, while autoregressive inference is employed during path planning. This results in a distribution shift between training and inference, which destabilizes the performance over long-horizon prediction. We propose AR Forcing, an autoregressive training strategy, which integrates the standard diffusion loss into the autoregressive training loop. At each step, the model uses its own predictions to update the context and optimize the single step noise prediction objective, thereby explicitly exposing the model to the inference state distribution during training. Our method does not require additional discriminators or distribution-matching losses, retains the original diffusion framework and sampler, and is easy to integrate. Experiments on multi-domain navigation datasets (RECON, SCAND, HuRoN, TartanDrive) show that compared with strong baselines, AR Forcing improved the consistency of generated images during long-horizon navigation and the accuracy of predicted trajectories, enhancing robustness of the model in complex known and unknown environments. We will release the code soon.

2605.31256 2026-06-01 cs.RO 版本更新

Before Parc Fermé: RL-Time Pruning for Efficient Embodied LLMs in Autonomous Driving

在封闭停车场之前:面向自动驾驶高效具身大语言模型的强化学习时间剪枝

Luca Benfenati, Ali Azimi, Matteo Risso, Fabio Carapellese, Daniele Jahier Pagliari, Alessio Burrello

发表机构 * Department of Control and Computer Engineering(控制与计算机工程系) Department of Mechanical and Aerospace Engineering(机械与航空航天工程系)

AI总结 提出一种在强化学习过程中进行剪枝的策略BPF,通过任务特定监督和闭环反馈压缩具身大语言模型控制器,在自动驾驶控制管道中实现了更好的性能-内存-吞吐量权衡。

详情
AI中文摘要

具身大语言模型越来越多地被用作机器人控制管道中的推理模块,以改善人机交互,但其内存和生成延迟使得实时部署变得困难。剪枝可以降低这些成本,但对于经历多个预训练和后训练阶段的控制器,关键问题不仅在于剪枝多少,还在于何时进行剪枝。在这项工作中,我们提出了Before Parc Fermé(BPF),一种在强化学习期间执行的剪枝策略,它在具身大语言模型控制器仍在针对闭环行为进行优化时对其进行压缩。这使得剪枝决策能够考虑塑造最终控制器的任务特定监督和闭环反馈。我们提出了两种变体:BPF-RL,它在强化学习期间通过按预定义训练间隔移除部分模型来执行迭代剪枝;以及BPF-SFT/RL,它首先在SFT期间移除部分模型结构,然后在强化学习期间使用与BPF-RL相同的迭代策略进一步压缩,直到达到目标剪枝比率。我们在基于LLM的自动驾驶控制管道RobotxR1上,使用已建立的LLM剪枝框架(LLM-Pruner)评估BPF,并将其与训练后剪枝、带有强化学习恢复的训练后剪枝、SFT阶段剪枝以及来自同一系列的小型密集模型进行比较。我们的结果表明,在所考虑的剪枝策略中,BPF提供了最佳的任务性能与内存和吞吐量之间的权衡。在压缩较大的RobotxR1模型时,BPF-SFT/RL实现了比直接选择同一系列中较小密集模型更好的尺寸-端到端性能权衡,以每损失一个百分点的控制适应性所移除的参数数量衡量,提升幅度为1.69倍。在目标机器人平台上搭载的Jetson AGX Orin上,紧凑模型将解码吞吐量提高了高达27%。

英文摘要

Embodied Large Language Models (LLMs) are increasingly used as reasoning modules in robotic control pipelines to improve human-robot interaction, but their memory and generation latency make real-time deployment difficult. Pruning can reduce these costs, but for controllers that undergo multiple pre- and post-training phases, the crucial question is not only how much to prune, but when pruning should occur. In this work, we propose Before Parc Fermé (BPF), a pruning strategy performed during RL that compresses embodied LLM controllers while they are still being optimized for closed-loop behavior. This allows pruning decisions to account for the task-specific supervision and closed-loop feedback that shape the final controller. We propose two variants: BPF-RL, which performs iterative pruning during RL by removing part of the model at predefined training intervals, and BPF-SFT/RL, which first prunes part of the model structure during SFT and then further compresses it during RL using the same iterative strategy as BPF-RL until the target pruning ratio is reached. We evaluate BPF on RobotxR1, an LLM-based autonomous-driving control pipeline, using an established LLM pruning framework (LLM-Pruner), and compare it against post-training pruning, post-training pruning with RL recovery, SFT-stage pruning, and smaller dense models from the same family. Our results show that BPF provides the best task-performance vs. memory and throughput trade-off among the considered pruning strategies. When compressing the larger RobotxR1 models, BPF-SFT/RL achieves a $1.69\times$ better size-end-to-end performance trade-off than directly selecting a smaller dense model from the same family, measured as removed parameters per lost percentage point of control adaptability. On the Jetson AGX Orin mounted on the target robotic platform, the compact models improve decode throughput by up to $27\%$.

2605.31234 2026-06-01 cs.RO 版本更新

HARP-VLA: Human-Robot Aligned Representation Learning for Vision-Language-Action Model

HARP-VLA:面向视觉-语言-动作模型的人机对齐表示学习

Xiang Zhu, Puzhen Yuan, Yichen Liu, Jianyu Chen

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University, China(清华大学交叉信息研究院) Shanghai Qi Zhi Institute, China(上海启智研究院)

AI总结 提出HARP框架,通过有限配对人机演示和未配对视频,学习对齐的人机视觉与潜在动作表示,提升VLA模型预训练效果,在CALVIN和真实世界任务中取得性能提升。

详情
AI中文摘要

从大规模人类视频中学习可泛化的视觉-语言-动作(VLA)模型具有前景但也充满挑战,原因在于视觉观察和可执行动作方面存在跨实体差异。虽然潜在动作模型通过学习动作抽象减少了动作执行差距,但它们仍然依赖视觉特征。因此,未对齐的人机视觉表示可能导致策略输入不一致,并引发领域相关的潜在动作,阻碍人类视频的有效协同训练。为解决这一问题,我们提出HARP,一种人机对齐的表示学习框架,用于从人类视频中进行更有效的VLA预训练。具体而言,HARP使用有限的配对人机演示作为跨实体桥梁,并利用大量未配对的人机视频作为可扩展的动态监督数据源。它训练一个机器人适应的视觉编码器和一个潜在动作模型,采用以操作为中心的辅助线索和源相对对判别对齐损失,将机器人表示向人类语义对齐,同时保留对级判别性。学习到的对齐视觉编码器和潜在动作模型为VLA式策略学习提供了统一的视觉和动作表示,其中人类和机器人视频提供视觉-语言到潜在动作的监督,轻量级机器人动作头将潜在动作转化为可执行命令。在特征可视化、仿真和真实世界操作上的实验表明,人机对齐和下游策略性能得到提升,在CALVIN ABC→D上达到4.481的平均长度,真实世界成功率比最强基线提升7.1%。

英文摘要

Learning generalizable vision-language-action (VLA) models from large-scale human videos is promising but challenging due to cross-embodiment discrepancies in both visual observations and executable actions. While latent action models reduce the action execution gap by learning action abstractions, they still rely on visual features. Thus, misaligned human and robot visual representations can lead to inconsistencies in policy inputs and induce domain-dependent latent actions, hindering effective co-training with human videos. To address this, we propose HARP, a human-robot aligned representation learning framework for more effective VLA pretraining from human videos. Specifically, HARP uses limited paired human-robot demonstrations as cross-embodiment bridges and abundant unpaired human and robot videos as a scalable dynamics supervision data source. It trains a robot-adapted visual encoder and a latent action model with manipulation-centric auxiliary cues and a source-relative pair-discriminative alignment loss, which adapts robot representations toward human semantics while preserving pair-level discrimination. The learned aligned vision encoder and latent action model provide a unified vision and action representation for VLA-style policy learning, where human and robot videos provide vision-language-to-latent-action supervision and a lightweight robot action head grounds latent actions into executable commands. Experiments on feature visualization, simulation, and realworld manipulation show improved human-robot alignment and downstream policy performance, achieving 4.481 average length on CALVIN ABC$\rightarrow$D and a 7.1\% realworld success rate gain over the strongest baseline.

2605.31210 2026-06-01 cs.RO cs.AI 版本更新

Simulation of collision avoidance behavior in crowd movement by data-driven approach

基于数据驱动方法的群体运动碰撞规避行为模拟

Xuanwen Liang, Eric Wai Ming Lee

发表机构 * Department of Architecture and Civil Engineering(建筑与土木工程系) University of Hong Kong(香港大学)

AI总结 针对数据驱动人群模拟中碰撞率高的问题,提出一种结合碰撞惩罚的生成对抗网络(CPGAN),通过侧向加速度碰撞损失函数和Voronoi特征提取方法,有效降低双向流中的对向碰撞率。

详情
AI中文摘要

人群运动模拟对于行人安全管理和设施布局优化至关重要。数据驱动模型提高了欧几里得度量下的轨迹预测精度,但存在碰撞率过高的问题,尤其是在双向和多向流中。本文建立了一种新颖的数据驱动人群模拟模型,将行人碰撞机制纳入损失函数以减少碰撞。提出了基于侧向加速度的碰撞损失函数和基于Voronoi的运动特征提取方法。该模型基于生成对抗网络(GAN)架构,称为CPGAN(碰撞惩罚GAN)。我们在涉及频繁碰撞规避行为的双向流场景中评估了CPGAN。结果表明,所提出的基于侧向加速度的碰撞损失显著降低了相反方向行人的碰撞率,达到与受控实验相当的水平。CPGAN有效模拟了双向流,再现了通道形成和N-t曲线。研究成果可为将行人动力学机制融入数据驱动人群模拟的损失函数提供启发。

英文摘要

Crowd movement simulation is essential for pedestrian safety management and facility layout optimization. Data-driven models enhance trajectory prediction accuracy under Euclidean metrics, yet they suffer from excessively high collision rates, especially in bidirectional and multidirectional flows. In this paper, we establish a novel data-driven crowd simulation model that incorporates the pedestrian collision mechanism into the loss function to reduce collisions. A new lateral-acceleration-based collision loss function and a Voronoi-based motion feature extraction approach are proposed. The model is based on a Generative Adversarial Network (GAN) architecture and is termed CPGAN (Collision-Penalized GAN). We evaluate CPGAN in bidirectional flow scenarios, which involve frequent collision avoidance behaviors. Results show that the proposed lateral-acceleration-based collision loss significantly reduces opposite-direction pedestrian collision rates to levels comparable with controlled experiments. CPGAN effectively simulates bidirectional flow, reproducing lane formation and N-t curves. The research outcomes can provide inspiration for integrating pedestrian dynamics mechanisms into loss functions in data-driven crowd simulation.

2605.31196 2026-06-01 cs.CV cs.AI cs.CL cs.RO 版本更新

Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

探索视觉-语言模型中的碰撞接地以实现安全的人机协作

Jun Wang, Xiaohao Xu, Xiaonan Huang

发表机构 * University of Michigan, Ann Arbor(密歇根大学,安娜堡)

AI总结 针对安全人机协作,提出碰撞接地概念及物理基准TouchSafeBench,评估视觉-语言模型在分类当前安全状态和预警即将碰撞任务中的表现,发现现有模型不可靠,视觉流畅性不等于物理责任性。

Comments 31 pages, 9 figures

详情
AI中文摘要

安全的人机协作需要的不仅仅是视觉描述:监控器必须确定机器人身体是否安全分离、已经与场景或人发生碰撞,或即将碰撞。我们将这种能力称为碰撞接地:将视觉观察与机器人身体几何、相机视角、场景布局、人体接近度和时间运动相结合,以推断当前和即将发生的接触。我们引入了TouchSafeBench,一个基于物理的基准,用于评估视觉-语言模型(VLM)中的碰撞接地能力。TouchSafeBench基于Habitat 3.0构建,包含2,940个模拟室内共现场景,涵盖社交导航和社交重排,具有同步的多视角RGB-D观测、自上而下的轨迹地图、校准的相机元数据和模拟器导出的接触标签。我们研究了两个面向部署的任务:分类当前安全状态和在接触前预警即将发生的碰撞。在三个前沿或面向机器人的VLM和九种视觉表示中,当前模型远未达到可靠:最佳平均Macro-F1仍低于50%,显式深度不会自动转化为机器人身体碰撞证据,且机器人与场景的接触始终比人与人的接触风险更难。TouchSafeBench揭示了具身VLM的一个核心限制:视觉流畅性并不意味着物理责任性。可靠的机器人安全监控器需要能够显式绑定视角、机器人形态、度量几何和未来碰撞的表示。我们将在论文被接收后发布该基准。

英文摘要

Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.

2605.31121 2026-06-01 cs.RO cs.AI 版本更新

TARIC: Memory-Augmented Traversability-Aware Outdoor VLN under Interrupted Semantic Cues

TARIC: 语义线索中断下基于记忆增强的可通行性感知户外视觉语言导航

Tianle Zeng, Hanjing Ye, Jianwei Peng, Jingwen Yu, Hanxuan Chen, Hong Zhang

发表机构 * Shenzhen Key Laboratory of Robotics and Computer Vision(深圳机器人与计算机视觉重点实验室) Southern University of Science and Technology(南方科技大学) CKS Robotics Institute(CKS机器人研究所) Hong Kong University of Science and Technology(香港科技大学) College of Electrical and Information Engineering(电气与信息工程学院)

AI总结 针对户外视觉语言导航中语义线索中断导致导航退化的问题,提出统一框架,通过可通行性一致的执行引导和不确定性感知的3D线索记忆,在长时间无线索阶段维持稳定导航,在四足和轮式平台上成功率提升显著。

详情
AI中文摘要

户外视觉语言导航(VLN)在远程、开放世界环境中经常受到语义线索中断的干扰,此时信息性目标线索变得稀疏、被遮挡或离开视野。一旦此类线索消失,智能体进入无线索阶段,并常退化为回溯、振荡航向或盲目探索。虽然基于记忆的方法试图弥合这些间隙,但在可通行性驱动的绕行中常常失败:记忆中的线索方向可能不可行,迫使绕行延长无线索阶段,并逐渐使机器人中心的线索过时、隐式历史模糊。这使得可通行性成为维持目标导向引导的稳定性条件,而不仅仅是局部安全问题。 我们提出一个统一的户外VLN框架,通过在长时间无线索阶段维持可通行性一致的可执行引导来应对语义线索中断。具体来说,我们的方法从可见性门控的目标或探索线索中提取语义方位,并利用实时近场可通行性轮廓将其接地为可执行航向,提供超越仅拒绝安全过滤的目标一致可行引导。为防止绕行期间引导退化,我们将间歇性2D证据提升为世界对齐的3D线索记忆,并配备不确定性感知读出机制,确保引导在机器人移动时持续可达且稳定。 我们在四足和轮式平台上评估该框架,路线长度为600-1000米。我们的方法在模拟中成功率比最强基线提高超过10个百分点,真实世界成功率达到40%,而最强基线为17.5%,且在长时间无线索间隔中具有显著更高的鲁棒性。

英文摘要

Outdoor vision-language navigation (VLN) in long-range, open-world environments is frequently disrupted by semantic-cue interruptions, where informative goal cues become sparse, occluded, or leave the field of view. Once such cues disappear, agents enter a cue-free phase and often degrade into backtracking, oscillatory headings, or aimless exploration. While memory-based methods attempt to bridge these gaps, they often fail under traversability-driven detours: the remembered cue direction may be infeasible, forcing detours that prolong cue-free phases and gradually render robot-centric cues stale and implicit histories blurred. This makes traversability a stability condition for maintaining goal-directed guidance, rather than merely a local safety concern. We propose a unified outdoor VLN framework that survives semantic-cue interruptions by maintaining traversability-consistent executable guidance throughout prolonged cue-free phases. Specifically, our method extracts semantic bearings from visibility-gated goal or exploration cues and grounds them into executable headings using a real-time near-field traversability profile, providing goal-consistent feasible guidance beyond reject-only safety filtering. To prevent guidance degradation during detours, we lift intermittent 2D evidence into a world-aligned 3D cue memory with an uncertainty-aware readout mechanism, ensuring guidance remains continuously reachable and stable as the robot moves. We evaluate the framework on quadrupedal and wheeled platforms over 600--1000 m routes. Our method improves simulation success rate by over 10 percentage points over the strongest baseline and achieves a real-world success rate of 40%, compared to 17.5% for the strongest baseline, with substantially higher robustness during prolonged cue-free intervals.

2605.31119 2026-06-01 cs.RO cs.LG 版本更新

Don't Fool Me Twice: Adapting to Adversity in the Wild with Experience-Driven Reasoning

不要愚弄我两次:通过经验驱动推理在野外适应逆境

Navin Sriram Ravie, Andrew Jong, Krrish Jain, John Liu, Omar Alama, Bijo Sebastian, Sebastian Scherer

发表机构 * Department of Engineering Design, Indian Institute of Technology, Madras(印度理工学院工程设计系,马德拉斯) Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 提出一种持续学习框架,使移动机器人能够在线从干扰中学习,通过语义将异常行为归因于原因,从而更好地预测和规划未来。

详情
AI中文摘要

在机器人学中,危险和逆境模式通常具有具体性且相对于每个智能体。自主移动机器人的一个前沿是使智能体能够在未见的非结构化环境中有效运行。在未见的非结构化环境中的一个重大挑战是可能无法预测特定机器人的所有危险。尽管最近的工作使用大型基础视觉语言模型(VLM)来预先预测一个详尽的常识性危险列表,但仍然难以捕捉可能的交互和依赖于具体性的逆境。我们提出了一个持续学习框架,使移动具身智能体能够在线从干扰中学习,并通过语义将异常行为归因于原因,从而更好地预测和规划未来世界。我们的框架“不要愚弄我两次”首先观察干扰并描述其对机器人的影响;该描述通过视觉上下文增强,以查询VLM预测可能的原因;使用核回归对局部干扰进行特征化,从而实现对瞬态异常的高效、少样本建模。我们利用语义体素中心建模来估计认知不确定性,通过将交互驱动的干扰视为可学习的空间行为,实现更丰富的下游恢复。我们提出了四个假设,并在仿真和硬件上跨具体性和逆境模式进行了验证。

英文摘要

In robotics, dangers and adversity modes are often embodiment-specific and relative to each agent. A frontier of autonomous mobile robotics is to enable agents to operate effectively in the wild in unseen unstructured environments. A significant challenge in unseen unstructured environments is that it may not be possible to predict all the dangers to the specific robot. Although recent work has used large foundation vision-language models (VLMs) to preemptively predict an exhaustive list of common-sense dangers, it remains difficult to capture possible interaction and embodiment-dependent adversities. We propose a continual learning framework for a mobile embodied agent to learn online from disturbances and attribute anomalous behaviours to causes through semantics, enabling better prediction and planning of the world in the future. Our framework, "Don't Fool Me Twice", first observes disturbances and describes their effects on the robot; this description is augmented with visual context to query a VLM to predict possible causes; the local disturbance is characterized using kernel regression, which allows for efficient, few-shot modeling of transient anomalies. We leverage semantic voxel-centric modeling to estimate epistemic uncertainty, enabling richer downstream recovery by treating interaction-driven disturbances as learnable spatial behaviors. We present four hypotheses and validate them in simulation and on hardware across embodiments and adversity modes.

2605.31116 2026-06-01 cs.CV cs.RO 版本更新

NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving

NTR:端到端驾驶中场景令牌瓶颈的神经令牌重建

Jiahui Li, Jiawei Sun, Zixiang Ren, Ming Liu, Jiamin Shi, Ruiteng Zhao, Zhiyang Liu, Liying Liu, Zuoguan Wang, Kaidi Yang

发表机构 * National University of Singapore(新加坡国立大学) Black Sesame Technologies(黑 sesame 技术公司)

AI总结 针对端到端驾驶中场景令牌瓶颈缺乏视觉监督的问题,提出神经令牌重建(NTR)框架,通过自蒸馏掩码潜在重建约束场景令牌保留更丰富的视觉表示,实现最先进的驾驶性能。

详情
AI中文摘要

最近的无感知端到端自动驾驶方法通过将密集的图像块令牌压缩为紧凑的场景令牌,用于下游轨迹生成和评分,从而绕过了显式的感知输出。虽然这些场景令牌为规划器形成了紧凑的视觉瓶颈,但它们仅从规划目标接收监督,对编码的视觉信息提供了有限的约束。为了解决这一限制,我们引入了神经令牌重建(NTR),一种表示学习框架,直接约束无感知驾驶中的紧凑场景令牌瓶颈。NTR引入了一种自蒸馏掩码潜在重建目标,该目标仅使用紧凑的场景令牌作为重建记忆来重建被掩码的块级潜在特征。这迫使重建梯度仅通过场景令牌瓶颈传递,鼓励场景令牌为规划保留更丰富且更少冗余的视觉表示。我们进一步引入了来自基础模型注释的语义先验,作为弱语义接口,将重建目标偏向于驾驶相关结构,而不引入显式的感知头。所有辅助重建组件在推理时被移除,部署的规划器保持不变。NTR在三个公共自动驾驶基准测试中实现了最先进的性能,包括Waymo E2E上的8.0461 RFS以及NavSim1&2上的94.1 PDMS / 90.9 EPDMS。学习到的场景令牌表现出更低的成对冗余和更高的有效秩,表明有效的瓶颈监督同时改善了紧凑视觉表示学习和规划性能。

英文摘要

Recent perception-free end-to-end (E2E) autonomous driving methods bypass explicit perception outputs by compressing dense image patch tokens into compact scene tokens for downstream trajectory generation and scoring. While these scene tokens form a compact visual bottleneck for the planner, they receive supervision solely from the planning objective, providing limited constraints on the encoded visual information. To address this limitation, we introduce Neural Token Reconstruction (NTR), a representation learning framework to directly constrain the compact scene-token bottleneck in perception-free driving. NTR introduces a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. We further introduce semantic priors derived from foundation-model annotations as a weak semantic interface biasing reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time, leaving the deployed planner unchanged. NTR achieves state-of-the-art performance on three public autonomous driving benchmarks, including 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim1&2. The learned scene tokens exhibit lower pairwise redundancy and higher effective rank, indicating that effective bottleneck supervision improves both compact visual representation learning and planning performance.

2605.31110 2026-06-01 cs.RO 版本更新

Building Generalization Into Behavior Generation Via Adaptive Compositions of Regularities

通过规律的自适应组合构建行为生成中的泛化能力

Aravind Battaje, Malte Bernhard, Vito Mengers, Oliver Brock

发表机构 * Science of Intelligence, Research Cluster of Excellence, Berlin, Germany(柏林智能科学卓越研究中心) Robotics Institute Germany(德国机器人研究所)

AI总结 本文通过AICON框架研究自适应组合规律(机器人-环境系统中的可预测关系)作为行为生成中泛化能力的关键机制,并在模拟实验中验证其有效性。

Comments 10 pages, 6 figures

详情
AI中文摘要

机器人领域的泛化需要关于世界如何结构化的先验知识,然而这种结构会随情境变化。本文研究一个命题:泛化源于将规律(机器人-环境系统中的可预测关系)自适应组合成适合情境的行为生成结构。我们通过分析AICON(主动互连)框架中的机制来检验这一命题,该框架将规律表示为可微分网络中的交互过程,其中感觉反馈实现组合,梯度下降生成行为。为了隔离自适应组合作为关键机制,我们研究了一个简单的模拟问题,其中所有相关规律都可以被识别。我们将所得模型暴露于设计时未考虑的各种新条件下,发现除了一个编码规律被证明不足的情况外,它在所有情况下都能生成情境适当的行为。消融实验表明,网络会根据规律的信息量自动调节哪些规律影响行为。这些结果表明,规律的自适应组合构成了将泛化能力构建到行为生成中的强大归纳偏置。

英文摘要

Generalization in robotics requires prior knowledge about how the world is structured, yet this structure changes from one situation to the next. This paper investigates the proposition that generalization arises from adaptively composing regularities -- predictable relationships within the robot-environment system -- into situation-appropriate structures for behavior generation. We examine this proposition by analyzing the mechanism in AICON (Active InterCONnect), a framework representing regularities as interacting processes in a differentiable network, where sensory feedback realizes composition and gradient descent generates behavior. To isolate adaptive composition as the key mechanism, we study a simple simulated problem in which all relevant regularities can be identified. We expose the resulting model to a wide range of novel conditions not considered during design, and we find that it generates context-appropriate behavior in all but one case, where encoded regularities are provably insufficient. Ablations reveal that the network automatically modulates which regularities influence behavior based on their informativeness. These results suggest that adaptive composition of regularities constitutes a powerful inductive bias for building generalization into behavior generation.

2605.31066 2026-06-01 cs.RO 版本更新

Can Aerial VLA Models Cooperate? Evaluating Closed-Loop Air-Ground Coordination with CARLA-Air

空中VLA模型能协作吗?基于CARLA-Air的闭环空地协调评估

Tianle Zeng, Yanci Wen, Xueang Yu, Hong Zhang

发表机构 * Southern University of Science and Technology(南方科技大学) Fudan University(复旦大学)

AI总结 本文通过构建CARLA-Air仿真环境,评估空中视觉-语言-动作模型在空地协作任务中的表现,发现当前模型难以将单智能体能力转化为稳定协作行为,并指出零样本协作需要伙伴状态显式感知、低延迟动作协调和团队目标对齐三个关键组件。

Comments Code at https://github.com/louiszengCN/CarlaAir

详情
AI中文摘要

最近的空中视觉-语言-动作(VLA)模型展示了有前景的单无人机能力,例如跟踪移动物体和导航到语言指定的地标。然而,这些能力能否转移到空地协作中尚不清楚,其中无人机和无人地面车辆必须在共享的闭环物理世界中联合行动。我们通过CARLA-Air研究这个问题,这是一个单进程空地评估环境,在同一个虚幻引擎运行时内统一了CARLA和AirSim。通过共享相同的世界状态、物理时钟和感知流水线,CARLA-Air实现了物理一致的无人机-无人地面车辆交互,并精确测量仿真时间戳对齐和有效协调延迟。利用CARLA-Air,我们在两个互补的诊断任务上评估了代表性的空中VLA和规划基线:移动平台降落和遮挡恢复护航。结果表明,当前的空中VLA模型通常能够跟踪或跟随地面伙伴,但难以将这种单智能体能力转化为稳定的协作行为。状态提示提供的益处有限,而朴素的双向交互未能持续提高性能,并且可能放大大多数基线的错误。这些发现表明,在测试的基于文本的提示接口下,零样本协作空地VLA需要当前范式之外的三个组件:显式的伙伴状态感知、低延迟动作协调和团队目标对齐。我们的代码可在https://github.com/louiszengCN/CarlaAir获取。

英文摘要

Recent aerial vision-language-action (VLA) models show promising single-UAV capabilities, such as tracking moving objects and navigating to language-specified landmarks. However, it remains unclear whether these capabilities can transfer to air-ground cooperation, where a UAV and a UGV must act jointly in a shared, closed-loop physical world. We study this question with CARLA-Air, a single-process air-ground evaluation environment that unifies CARLA and AirSim inside one Unreal Engine runtime. By sharing the same world state, physics tick, and sensing pipeline, CARLA-Air enables physically consistent UAV--UGV interaction and precise measurement of simulation-timestamp alignment and effective coordination latency. Using CARLA-Air, we evaluate representative aerial VLA and planning baselines on two complementary diagnostic tasks: moving-platform landing and occlusion-recovery escort. The results show that current aerial VLA models can often track or follow a ground partner, but struggle to convert this single-agent competence into stable cooperative behavior. State prompting provides limited benefit, and naive bidirectional interaction fails to consistently improve performance and can amplify errors for most baselines. These findings suggest that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires three components beyond the current paradigm: explicit partner-state grounding, low-latency action coordination, and team-level objective alignment. Our code is available at https://github.com/louiszengCN/CarlaAir.

2605.30989 2026-06-01 cs.RO 版本更新

A study on a Real-Time VR-Based Teleoperation Framework for Manipulator in Dynamic Environment

动态环境下基于实时VR的机械臂遥操作框架研究

InGyu Choi, GeonYeong Go, SunWoo Ahn, HyoJae Kang, Min-Sung Kang

发表机构 * Department of Robotics, Hanyang University(韩世大学机器人系) Department of Smart Construction Engineering, Hanyang University(韩世大学智能建造工程系) Department of Interdisciplinary Robot Engineering Systems, Hanyang University(韩世大学跨学科机器人工程系统系) School of Smart Convergence Engineering, Hanyang University, Ansan(韩世大学智能融合工程学院,安山)

AI总结 提出一种集成GPU加速逆运动学和轨迹优化的VR遥操作框架,在静态和动态障碍物环境中实现低延迟、碰撞感知的实时机械臂控制。

Comments This manuscript has been submitted for possible publication

详情
AI中文摘要

机器人遥操作能够在人类难以直接进入的危险环境中安全、非接触地执行任务,并且随着最近VR技术的发展,其应用范围已经扩大。然而,许多VR遥操作研究主要作为机器人模仿学习的数据收集工具,因此它们通常没有明确处理操作过程中的动态障碍物、工作空间变化或碰撞风险。为了实际部署以保障操作员安全,遥操作必须能够以低延迟响应动态情况,并对经验不足的操作员的错误保持鲁棒性。本文提出了一种VR遥操作框架,支持实时操作,同时处理与静态和移动障碍物的碰撞。该框架在VR界面中集成了GPU加速的逆运动学和轨迹优化,以在机器人约束下在每个控制周期生成可行的关节命令。使用7自由度机械臂进行的实验展示了在无障碍物、静态障碍物和移动障碍物三种场景下的稳定在线行为和碰撞感知运动生成。结果表明,所提出的方法生成的运动与操作员的命令一致,并在障碍物干扰命令路径时产生安全的绕行。

英文摘要

Robot teleoperation enables safe, non-contact task execution in hazardous environments where direct human access is difficult, and its application has expanded with recent VR technologies. Many VR teleoperation studies, however, have primarily served as data-collection tools for robot imitation learning, so they often do not explicitly address dynamic obstacles, workspace changes, or collision risks during operation. For real deployment aimed at operator safety, teleoperation must react to dynamic situations with low latency and remain robust to mistakes made by inexperienced operators. This paper presents a VR teleoperation framework that supports real-time manipulation while handling collisions with both static and moving obstacles. The framework integrates GPU-accelerated inverse kinematics and trajectory optimization within a VR interface to generate feasible joint commands at each control cycle under robot constraints. Experiments with a 7-DoF manipulator demonstrate stable online behavior and collision-aware motion generation across three scenarios: obstacle-free, static-obstacle, and moving-obstacle environments. The results indicate that the proposed approach generates motion consistent with the operator's command while producing safe detours when obstacles interfere with the commanded path.

2605.30957 2026-06-01 cs.RO 版本更新

RDGen: Demonstration Generation for High-Quality Robot Learning via Reinforcement Learning

RDGen: 通过强化学习生成高质量机器人学习的演示

Zijian Zhu, Menglin Zou, Zhuang Li, Yaojie Tu, Xinhai Sun

AI总结 提出RDGen框架,利用从仿真到真实的强化学习策略生成高质量机器人演示轨迹,用于训练视觉-语言-动作模型,相比人工遥操作产生更平滑轨迹并提升下游性能。

Comments 13 pages, 4 figures, 3 tables

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人控制的一种有前景的范式。然而,其性能仍然从根本上受限于高质量机器人轨迹数据的可用性。在当前的机器人学习实践中,这些数据主要通过人类遥操作收集,这需要大量人力、成本高昂且难以扩展。在本文中,我们提出了RDGen,一种用于生成高质量机器人演示的仿真到真实强化学习框架。RDGen并非仅将强化学习用作最终控制策略,而是利用训练好的RL策略作为结构化的轨迹生成器。该系统由一个基于VLM的任务解析器(用于识别任务相关物体)、一个基于Grounding DINO的物体定位器以及一个从仿真迁移到真实机器人的RL策略组成。然后,成功的 rollout 被收集为干净、高质量的演示,用于下游VLA训练,而仿真阶段进一步以极低的边际成本提供可扩展的额外轨迹来源。在拾取和放置任务上的实验表明,迁移后的RL策略实现了高任务成功率。与人类遥操作相比,RDGen生成的轨迹显著更平滑,并产生更优的下游VLA性能。这些结果表明,RL生成的演示可以作为机器人策略学习更可靠和一致的监督信号。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robot control. However, their performance remains fundamentally constrained by the availability of high-quality robot trajectory data. In current robot learning practice, such data are primarily collected through human teleoperation, which is labor-intensive, costly, and difficult to scale. In this paper, we propose RDGen, a sim-to-real reinforcement learning framework for generating high-quality robot demonstrations. Rather than employing reinforcement learning solely as the final control policy, RDGen leverages trained RL policies as a structured trajectory generator. The system consists of a VLM-based task parser that identifies task-relevant objects, a Grounding DINO-based object localizer, and an RL policy transferred from simulation to the real robot. Successful rollouts are then harvested as clean, high-quality demonstrations for downstream VLA training, while the simulation stage further provides a scalable source of additional trajectories at little marginal cost. Experiments on a pick-and-place task demonstrate that the transferred RL policy achieves a high task success rate. Compared with human teleoperation, RDGen produces significantly smoother trajectories and yields superior downstream VLA performance. These results indicate that RL-generated demonstrations can serve as more reliable and consistent supervisory signals for robot policy learning.

2605.30928 2026-06-01 cs.RO 版本更新

Enhancing Human-Likeness in Reinforcement Learning Agents via Hierarchical Macro Action Quantization

通过分层宏动作量化增强强化学习智能体的人类相似性

Usman Nizamani, M. Shaheer Luqman, Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, M. Zeeshan Zia, Quoc-Huy Tran

发表机构 * Retrocausal, Inc.(Retrocausal公司)

AI总结 提出一种分层宏动作量化框架(HiMAQ),通过两级向量量化将人类演示编码为宏动作,使强化学习智能体在保持高回报的同时生成更接近人类的行为序列,在D4RL基准上优于非分层基线并兼容多种RL算法。

详情
AI中文摘要

人类化智能体是人工智能的长期目标。尽管性能强劲,大多数强化学习(RL)智能体仍以奖励驱动,且常表现出与人类不同的行为,限制了可解释性和可靠性。在这项工作中,我们引入了一种新颖的人类化RL框架,该框架在最大化奖励的同时预测与人类行为紧密对齐的动作序列。具体来说,我们使用一种分层宏动作量化方法(称为HiMAQ)将人类演示编码为宏动作,该方法包含两个连续的向量量化层级。低层量化将输入动作映射到细粒度的子动作簇,而高层量化将这些子动作簇聚合成动作簇。在D4RL基准上的广泛评估表明,我们的分层方法优于非分层基线(MAQ),在保持与先前RL智能体相当或更高成功率的同时,获得了更好的人类相似性分数。这些改进泛化到与各种RL算法(即IQL、SAC和RLPD)的集成中。

英文摘要

Human-like agents are a long-standing goal of artificial intelligence. Despite strong performance, most reinforcement learning (RL) agents remain reward-driven and often exhibit behaviors that differ from humans, limiting interpretability and reliability. In this work, we introduce a novel human-like RL framework that predicts action sequences closely aligned with human behaviors while maximizing rewards. Specifically, we encode human demonstrations into macro actions using a hierarchical macro action quantization approach (termed HiMAQ) consisting of two successive levels of vector quantization. The lower quantization level maps input actions to fine-grained subaction clusters, while the higher quantization level aggregates these subaction clusters into action clusters. Extensive evaluations on the D4RL benchmarks show that our hierarchical approach outperforms the non-hierarchical baseline (MAQ), achieving better human-likeness scores while maintaining comparable or better success rates than previous RL agents. The improvements generalize across integrations with various RL algorithms, namely IQL, SAC, and RLPD.

2605.30906 2026-06-01 cs.RO cs.SY eess.SY 版本更新

Trajectory Planning for Non-Communicating Mobile Robots using Inverse Optimal Control

非通信移动机器人的逆最优控制轨迹规划

Nina Majer, Yannick Epple, Xin Ye, Stefan Schwab, Sören Hohmann

发表机构 * FZI Research Center for Information Technology(弗劳恩霍夫信息技术研究所) Institute of Control Systems, Karlsruhe Institute of Technology(卡尔斯鲁厄大学控制系统研究所)

AI总结 针对非通信移动机器人在避碰场景中的高效交互,提出一种结合逆最优控制的轨迹规划与预测算法,通过估计未知目标状态并联合预测,实现更快的规划求解。

详情
AI中文摘要

为了实现非通信移动机器人在避碰场景中的高效交互,我们提出了一种新颖的轨迹规划与预测组合算法。逆最优控制用于基于观测到的过去轨迹估计所有机器人的未知目标状态。每个机器人还从其他机器人的角度考虑自我预测,并使用估计的目标状态解决联合预测问题。然后将得到的预测用于规划。在2-8个机器人场景中的仿真结果表明,与基于恒定加速度估计目标状态的规划相比,所有车辆到达目标的中位时间加快了9.8%。此外,所提出的方法从未导致求解器无法找到规划或预测问题的解。

英文摘要

To enable an efficient interaction of non-communicating mobile robots in collision avoidance scenarios, we present a novel combined trajectory planning and prediction algorithm. Inverse optimal control is used to estimate unknown goal states of all robots based on observed past trajectories. Each robot also takes the perspective of other robots in considering self-prediction and solves a joint prediction problem using the estimated goal states. The resulting predictions are then considered for planning. Simulation results of scenarios with 2-8 robots show that the median of the durations until all vehicles reach their goals is 9.8 % faster compared to planning with constant acceleration based estimated goal states. Moreover, the proposed approach never leads to the solver being unable to find a solution to the planning or prediction problem.

2605.30849 2026-06-01 cs.RO 版本更新

High-Load-Density Electro-Permanent Magnetic Foot with Controllable Adhesion for Quadruped Wall-Climbing Robots

用于四足爬壁机器人的高负载密度可控吸附电永磁足

An Li, Bo Tao, I-Ming Chen, Han Ding

AI总结 提出一种高负载密度可控吸附电永磁足,采用圆形Halbach网络电永磁(CHN-EPM)吸附单元和力反馈系统,实现四足机器人在铁磁表面的可靠攀爬,最大吸附力超过1000 N,负载重量比超过200:1。

Comments 10 pages, 6 figures, 2 tables; project page and videos available in the repository

详情
AI中文摘要

为了实现四足机器人在铁磁表面上的可靠攀爬运动,本文提出了一种具有可控吸附力的高负载密度电永磁足,其特点是采用力反馈圆形Halbach网络电永磁(CHN-EPM)吸附单元和磁化控制系统。由于其三维磁路结构和磁通集中效应,CHN-EPM实现了分布式并联磁通路径,提高了磁通利用率,从而降低了对气隙变化的敏感性,即使在部分接触条件下也能保持有效吸附。所提出的CHN-EPM最大吸附力超过1000 N,负载重量比超过200:1。开发了磁化驱动器和两级脉冲电流控制策略,以调节励磁电流幅值和持续时间,实现精确可靠的磁化。通过集成柔性压力传感器进行接触力反馈,系统可以有效监测附着和脱离状态,确保在不确定接触条件下实现可靠的吸附切换。所提出的系统被集成到商用四足机器人(Unitree GO2)中,展示了在天花板和垂直壁面上的高负载吸附,以及在涂漆、穿孔和弯曲铁磁表面上的稳定运动。

英文摘要

To enable reliable climbing locomotion of quadruped robots on ferromagnetic surfaces, this paper presents a high-load-density electro-permanent magnetic foot with controllable adhesion, featuring force-feedback circular Halbach-net electro-permanent magnet (CHN-EPM) adhesion units and a magnetization control system. Due to its three-dimensional magnetic circuit structure and flux-concentration effect, the CHN-EPM enables a distributed parallel magnetic flux path with enhanced flux utilization, resulting in reduced sensitivity to air-gap variations and allowing effective adhesion to be maintained even under partial contact conditions. The proposed CHN-EPM generates a maximum adhesion force exceeding 1000 N with a load-to-weight ratio over 200:1. A magnetization driver and a two-stage pulse current control strategy are developed to regulate the excitation current amplitude and duration, enabling accurate and reliable magnetization. By incorporating a flexible pressure sensor for contact force feedback, the system can effectively monitor attachment and detachment states, ensuring robust adhesion switching under uncertain contact conditions. The proposed system is integrated into a commercial quadruped robot (Unitree GO2), demonstrating high-load adhesion on ceiling and vertical-wall surfaces and stable locomotion on painted, perforated, and curved ferromagnetic surfaces.

2605.30834 2026-06-01 cs.RO cs.AI 版本更新

Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

轨迹中的捉迷藏:发现VLA运行时监控的失败信号

Seongheon Park, Wendi Li, Changdae Oh, Samuel Yeh, Zsolt Kira, Michael Hagenow, Sharon Li

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出Hide-and-Seek框架,通过轨迹间和轨迹内对比学习,从轨迹级监督中定位失败指示动作,实现无需步骤标注的VLA模型运行时失败检测。

详情
AI中文摘要

视觉-语言-动作(VLA)模型使机器人能够遵循自然语言指令并在不同任务中泛化,但在实际部署中仍易受执行失败影响,损害可靠性。因此,在执行过程中检测此类失败对于具身系统的稳健部署至关重要。现有的失败检测方法要么依赖昂贵的动作重采样或外部模型,要么将轨迹级标签均匀传播到每个时间步,掩盖了局部失败信号。在本文中,我们提出 extbf{Hide-and-Seek}框架,将VLA失败检测形式化为粗监督学习问题。通过结合轨迹间和轨迹内对比目标,Hide-and-Seek能够定位指示失败的动作,并仅从轨迹级监督中诱导出具有时间结构的失败信号,无需任何步骤级标注。我们在LIBERO、VLABench和真实机器人平台上,针对三种代表性VLA策略(OpenVLA、$π_0$和$π_{0.5}$)评估了Hide-and-Seek。我们的方法在共形预测下实现了最先进的多任务失败检测性能,具有实用的准确度-及时性权衡,并且对已见和未见任务均具有良好的泛化能力。

英文摘要

Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliability in real-world deployment. Detecting such failures during execution is therefore critical for the robust deployment of embodied systems. Existing failure detection methods either rely on expensive action resampling or external models, while alternatives propagate trajectory-level labels uniformly across every timestep, obscuring localized failure signals. In this paper, we propose \textbf{Hide-and-Seek}, a framework that formulates VLA failure detection as a coarsely supervised learning problem. By combining inter-trajectory and intra-trajectory contrastive objectives, Hide-and-Seek localizes failure-indicative actions and induces temporally structured failure signals from trajectory-level supervision alone, without any step-level annotation. We evaluate Hide-and-Seek on LIBERO, VLABench, and a real-world robotic platform across three representative VLA policies: OpenVLA, $π_0$, and $π_{0.5}$.Our method achieves state-of-the-art multi-task failure detection performance with a practical accuracy--timeliness trade-off under conformal prediction, and generalizes well to both seen and unseen tasks.

2605.30795 2026-06-01 cs.RO 版本更新

Feat2Go: Visual Feature-Grounded Value Estimation for Embodied Reinforcement Learning

Feat2Go: 面向具身强化学习的视觉特征基础价值估计

Junyang Shu, Zhiwei Lin, Bingqing Wei, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University, China(北京大学计算机科学技术研究院)

AI总结 提出Feat2Go框架,通过预训练视觉世界模型提取补丁级子目标相似度并聚类语义阶段,训练具身价值模型预测结构进度以重塑终端奖励,显著提升VLA模型在单臂和双臂操作任务中的强化学习性能。

详情
AI中文摘要

强化学习是提升视觉-语言-动作(VLA)模型能力的一种有前景的方法,同时避免了模仿学习对大量数据的需求。然而,其对VLA模型的有效性常受限于稀疏监督以及为长程操作设计信息丰富的奖励信号的困难。在这项工作中,我们提出了Feat2Go,一种用于具身强化学习的细粒度价值估计框架。具体来说,Feat2Go首先通过测量与子目标状态的补丁级相似性,并利用基于趋势的聚类将回合划分为语义阶段,从预训练的视觉世界模型中导出一个连续的进度目标。然后,我们训练一个具身价值模型,根据当前观测和任务指令预测这一结构进度,并在策略优化过程中使用预测值重塑终端奖励。所提出的框架与现有的VLA策略强化学习流程(包括PPO和GRPO)兼容,且不依赖手动奖励工程。在ManiSkill3和RoboTwin 2.0上的大量实验表明,Feat2Go在单臂和双臂操作设置下均能持续提升现有VLA模型的性能。更具体地说,在ManiSkill3上,Feat2Go将OpenVLAOFT的平均分布外成功率从17.5%提升至82.9%,同时保留了96.9%的分布内性能。在RoboTwin 2.0上,Feat2Go在域随机化任务设置中实现了88.8%的平均成功率,优于先前的强化学习方法。

英文摘要

Reinforcement learning is a promising approach for improving the capabilities of vision-language-action (VLA) models while avoiding the heavy data requirements of imitation learning. However, its effectiveness for VLA models is often constrained by sparse supervision and the difficulty of designing informative reward signals for long-horizon manipulation. In this work, we present Feat2Go, a fine-grained value estimation framework for embodied reinforcement learning. Specifically, Feat2Go first derives a continuous progress target from a pretrained visual world model by measuring patch-level similarity to subgoal states and partitioning episodes into semantic stages with trend-based clustering. We then train an embodied value model to predict this structural progress from the current observation and task instruction, and use the predicted value to reshape terminal rewards during policy optimization. The proposed framework is compatible with existing VLA policy reinforcement learning pipelines, including PPO and GRPO, and does not rely on manual reward engineering. Extensive experiments on ManiSkill3 and RoboTwin 2.0 demonstrate that Feat2Go consistently improves the performance of existing VLA models under both single-arm and bimanual manipulation settings. More specifically, on ManiSkill3, Feat2Go improves OpenVLAOFT from 17.5% to 82.9% average out-of-distribution success while retaining 96.9% in-distribution performance. On RoboTwin 2.0, Feat2Go achieves an average success rate of 88.8% in domain-randomized task settings, outperforming prior reinforcement learning methods.

2605.30780 2026-06-01 cs.RO 版本更新

Two Degree-of-Freedom Vibratory Transport in a Grasp

抓取中的两自由度振动运输

C. L. Yako, Shenli Yuan, Kenneth Salisbury

发表机构 * Department of Mechanical Engineering, Stanford University(斯坦福大学机械工程系) Department of Computer Science, Stanford University(斯坦福大学计算机科学系)

AI总结 利用非对称振动实现抓取零件的两自由度(DoF)手内操作,通过闭环位置控制产生周期性粘滑波形,分析波形参数对平均速度的影响,并用实验验证。

详情
AI中文摘要

在本文中,我们利用非对称振动演示了抓取零件的两自由度(DoF)手内操作。非对称振动通过移动表面的闭环位置控制实现,该表面向待操作零件施加周期性粘滑波形。我们从理论上分析了两个振动波形参数——粘附加速度和滑动加速度——如何影响零件在对抗重力运动时的平均速度。然后使用实验装置验证理论趋势,其中挤压力受控,零件运动由高分辨率编码器记录。我们还开发了一个2-DoF振动表面,能够在一个方向平移并绕表面法线旋转。在平行爪夹持器配置中使用两个这样的2-DoF表面,我们双向平移和旋转各种抓取零件,并证明相同的平移波形趋势也适用于面内旋转。

英文摘要

In this paper, we use asymmetric vibrations to demonstrate two degree-of-freedom (DoF) in-hand manipulation of grasped parts. The asymmetric vibrations are achieved through closed-loop position control of a moving surface, which applies a periodic stick-slip waveform to the part to be manipulated. We show analytically how two vibratory waveform parameters, the sticking acceleration and the slipping acceleration, affect average part velocity when moving against gravity. The theoretical trends are then validated using an experimental setup where the squeeze force is controlled and part motion is recorded by a high-resolution encoder. We also develop a 2-DoF vibratory surface capable of translation in one direction and rotation about the surface normal. Using two of these 2-DoF surfaces in a parallel jaw gripper configuration, we bidirectionally translate and rotate a variety of grasped parts, as well as demonstrate that the same waveform trends for translation also persist for in-plane rotation.

2605.30778 2026-06-01 cs.RO 版本更新

Object-Informed Model Predictive Path Integral Control for Non-Prehensile Robot Manipulation

面向非抓取式机器人操作的对象感知模型预测路径积分控制

Nikola Raicevic, Bharath Raam Radhakrishnan, Chenbin Yu, Ki Myung Brian Lee, Nikolay Atanasov

发表机构 * Department of Electrical and Computer Engineering, University of California San Diego(加州大学圣地亚哥分校电气与计算机工程系)

AI总结 提出一种分层模型预测路径积分(MPPI)控制框架,通过对象级规划引导机器人级规划,实现非抓取式操作中的长时域高效规划。

详情
AI中文摘要

由于欠驱动和不连续交互,非抓取式机器人操作的长时域规划具有挑战性。我们提出一种分层模型预测路径积分(MPPI)控制框架,通过单独计算的对象级规划引导机器人级规划,实现高效的长时域预测。我们首先求解一个简化的仅对象问题,假设对象可以直接被驱动,并将规划的对象轨迹作为参考来求解联合机器人-对象规划问题。我们在仿真和硬件上使用6自由度xArm6机械臂执行对象推动任务来评估我们的方法,其中目标对象必须到达目标点同时避开静态障碍物,这需要非短视的推理。我们的对象感知MPPI在仿真中将任务成功率提高了40%,控制频率提高了26%,在实际实验中提高了20%,且计算量与常规MPPI相当。

英文摘要

Long-horizon planning for non-prehensile robot manipulation is challenging due to underactuated and discontinuous interactions. We propose a hierarchical formulation of model predictive path integral (MPPI) control that guides robot-level planning with a separately computed object-level plan to achieve efficient long-horizon prediction. We first solve a simplified object-only problem, assuming the object can be actuated directly, and use the planned object trajectory as a reference in solving the joint robot-object planning problem. We evaluate our method in both simulation and hardware using a 6-DoF xArm6 manipulator to perform object pushing tasks in which the target object must reach a goal while avoiding static obstacles, necessitating non-myopic reasoning. Our object-informed MPPI increases task success by 40\% with a 26\% faster control frequency in simulation, and by 20\% in real experiments with similar computation as regular MPPI.

2605.30770 2026-06-01 cs.RO 版本更新

SSR: Scaling Surefooted and Symmetric Humanoid Traversal to the Open World

SSR:将稳健且对称的人形穿越扩展到开放世界

Ruiqi Yu, Yiwen Wang, Yuan Hao, Jun WU, Qiuguo Zhu

发表机构 * Zhejiang University(浙江大学)

AI总结 提出SSR框架,通过引入想象落脚点引导、等变潜在空间对称增强和地形特定多判别器运动先验,实现基于视觉的人形机器人在开放世界中的安全稳定穿越。

详情
AI中文摘要

将人形穿越扩展到开放世界是在人类环境中实际部署的关键,但仍然具有挑战性。机器人必须利用视觉在高度动态运动下确保在异质地面上安全可靠的落脚点,同时产生协调、自然的全身行为。我们提出SSR,一种高效的端到端框架,用于基于自我中心视觉的人形穿越,联合学习这些能力。SSR引入了想象落脚点引导,学习建模即将到来的摆动脚接触并评估其支撑,以指导触地前的摆动朝向稳定区域,减少边缘滑动。它进一步采用等变潜在空间对称增强,在高维视觉观察下有效诱导双边协调,并使用地形特定多判别器运动先验,鼓励跨场景的类人行为。大量实验表明,SSR在多种真实世界地形上实现了安全、稳定和高质量的运动,包括不同结构的楼梯以及宽间隙和高平台等极端挑战,同时在开放户外环境中实现了可靠的长距离穿越。

英文摘要

Extending humanoid traversal to the open world is key to practical deployment in human environments, but remains challenging. The robot must use vision to ensure safe and reliable foot placement on heterogeneous terrain under highly dynamic motion, while producing coordinated, natural whole-body behaviors. We propose SSR, an efficient end-to-end framework for egocentric vision-based humanoid traversal that jointly learns these capabilities. SSR introduces imagined foothold guidance, which learns to model forthcoming swing-foot contacts and evaluates their support to guide pre-touchdown swings toward stable regions, reducing edge slips. It further employs equivariant latent-space symmetry augmentation to efficiently induce bilateral coordination under high-dimensional visual observations, and uses terrain-specific multi-discriminator motion priors to encourage human-like behavior across scenes. Extensive experiments show that SSR achieves safe, stable, and high-quality locomotion on diverse real-world terrains, including stairs with varied structures and extreme challenges such as wide gaps and high platforms, while enabling reliable long-horizon traversal in open outdoor environments.

2605.30769 2026-06-01 cs.CV cs.RO 版本更新

DisPlace: Discriminative Place Projections for Multi-Reference Visual Place Recognition

DisPlace: 面向多参考视觉地点识别的判别性地点投影

Dhyey Manish Rajani, Michael Milford, Tobias Fischer

发表机构 * QUT Centre for Robotics, School of Electrical Engineering and Robotics at the Queensland University of Technology(昆士兰理工大学机器人中心,电气工程与机器人学学院)

AI总结 提出DisPlace框架,通过广义特征值问题融合多参考描述符,最大化地点间可分性并抑制地点内变化,提升视觉地点识别在多变条件下的鲁棒性。

Comments Under review

详情
AI中文摘要

视觉地点识别(VPR)的一个关键挑战是在不同环境条件和视角下,将查询图像与参考地图进行匹配。虽然多次参考遍历提高了鲁棒性,但现有的融合策略要么统一聚合参考,要么依赖启发式选择,无法区分保持稳定地点身份的描述符变化与由变化条件或视角引起的变化。在本文中,我们提出DisPlace,一种多参考VPR框架,将多个参考描述符融合为单个紧凑且具有判别性的地点表示。DisPlace将描述符融合表述为一个广义特征值问题,该问题最大化地点间可分性,同时抑制跨参考的地点内变化,而不是保留整体描述符方差。与现有的多参考融合方法不同,DisPlace利用跨参考遍历的变化来识别哪些描述符维度的线性组合保留了地点身份,哪些捕捉了条件或视角特定的变化。我们在Oxford RobotCar、Nordland、Pittsburgh30k和Google Landmarks v2上,使用六种最先进的VPR描述符评估了DisPlace。在54种外观变化条件下,DisPlace在49种中优于七种多参考基线,在视角和非结构化设置下持续改进描述符级融合性能,并且在推理期间比所有比较的融合方法需要更少的存储空间。

英文摘要

A key challenge in Visual Place Recognition (VPR) is matching query images against reference maps captured under diverse environmental conditions and viewpoints. While multiple reference traversals improve robustness, existing fusion strategies either aggregate references uniformly or rely on heuristic selection, without distinguishing descriptor variations that preserve stable place identity from those caused by changing conditions or viewpoints. In this paper, we propose DisPlace, a multi-reference VPR framework that fuses multiple reference descriptors into a single compact and discriminative place representation. DisPlace formulates descriptor fusion as a generalized eigenvalue problem that maximizes between-place separability while suppressing within-place variation across references, rather than preserving overall descriptor variance. Unlike existing multi-reference fusion methods, DisPlace exploits variation across reference traversals to identify which linear combinations of descriptor dimensions preserve place identity and which capture condition- or viewpoint-specific variation. We evaluate DisPlace on Oxford RobotCar, Nordland, Pittsburgh30k, and Google Landmarks v2 across six state-of-the-art VPR descriptors. DisPlace outperforms seven multi-reference baselines in 49 out of 54 appearance-varying conditions, consistently improves descriptor-level fusion performance under viewpoint and unstructured settings, and requires less storage during inference than all compared fusion methods.

2605.30749 2026-06-01 cs.LG cs.RO 版本更新

FLAG: Flow Policy MaxEnt-RL by Latent Augmented Guidance

FLAG: 通过潜在增强引导的流策略最大熵强化学习

Sungha Kim, Gawon Lee, Jusuk Lee, Jonghae Park, H. Jin Kim, Daesol Cho

发表机构 * Seoul National University(首尔国立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出FLAG方法,通过潜在变量增强状态空间并优化代理最大熵目标,解决重要性权重崩溃问题,实现高维控制任务中的表达性策略优化。

详情
AI中文摘要

最大熵强化学习(MaxEnt-RL)能够实现鲁棒的探索,然而实际实现通常将策略限制为简单的高斯分布。最近的方法通过重要性加权监督学习引入表达性生成策略,但容易受到重要性权重崩溃的影响,这限制了它们在高维动作空间中的可扩展性。我们的关键见解是通过局部化采样区域来缓解这一限制,避免在整个动作空间上进行重要性采样导致的权重退化。为了实例化这一见解,我们引入了FLAG(具有潜在增强引导的流策略)。FLAG通过流潜在变量增强状态空间,并优化一个可证明一致的代理MaxEnt-RL目标。我们经验证明,FLAG能够在有限的重要性样本下实现表达性策略优化,并扩展到高维控制任务。此外,FLAG在具有挑战性的基准测试中达到了最先进的性能。我们的项目网页:https://flag-rl.github.io/

英文摘要

Maximum entropy reinforcement learning (MaxEnt-RL) enables robust exploration, yet practical implementations often restrict policies to simple Gaussians. While recent approaches incorporate expressive generative policies via importance-weighted supervised learning, they are prone to importance weight collapse, which limits their scalability in high-dimensional action spaces. Our key insight is to mitigate this limitation by localizing the sampling region, avoiding the weight degeneracy induced by importance sampling over the entire action space. To instantiate this insight, we introduce \textbf{FLAG} (\textbf{F}low policy with \textbf{L}atent-\textbf{A}ugmented \textbf{G}uidance). FLAG augments the state space with a flow latent variable and optimizes a provably consistent proxy MaxEnt-RL objective. We empirically demonstrate that FLAG enables expressive policy optimization with limited importance samples and scales to high-dimensional control tasks. Furthermore, FLAG achieves state-of-the-art performance across challenging benchmarks. Our project webpage: https://flag-rl.github.io/

2605.30740 2026-06-01 cs.RO cs.AI 版本更新

GSAM: A Generalizable and Safe Robotic Framework for Articulated Object Manipulation

GSAM: 一种通用且安全的铰接物体操作机器人框架

Beichen Shao, Mengying Xie, Heng Su, Wanyi Zhang, Mingyan Li, Yan Ding, Fausto Giunchiglia, Chao Chen

发表机构 * College of Computer Science, Chongqing University, Chongqing, China(重庆大学计算机学院) Lumos Robotics, China(Lumos机器人中国) Xi'an Jiaotong-Liverpool University, China(西安交通大学利物浦大学) Fudan University, China(复旦大学) Department of Information Engineering and Computer Science, University of Trento, Trento, Italy(特伦托大学信息工程与计算机科学系)

AI总结 提出GSAM框架,通过视觉感知器生成运动学参数、基于VLM的细调器进行常识推理修正、交互约束函数生成器集成障碍物避免知识,并由运动学感知规划器验证轨迹可达性,在50个铰链任务上相比最佳基线将标准差降低3.1%、操作成功率提升36.0%。

Comments Accepted by the 19th International Conference on Parallel Problem Solving from Nature (PPSN 2026)

详情
AI中文摘要

铰接物体操作对服务机器人是一个独特的挑战。现有方法采用端到端策略学习、视觉运动规划以及大语言/视觉语言模型(LLM/VLM),但往往忽视了铰接物体的多样性和末端执行器与手柄之间交互的复杂性,导致泛化能力有限和破坏性碰撞。为了解决这一问题,我们提出了GSAM,一个通用且安全的铰接物体操作机器人框架。具体来说,一个基于视觉的感知器生成运动学参数。考虑到感知器中预训练标记产生的原始估计可能偏离常识,我们提出了一个基于VLM的细调器,利用链式思维(COT)常识推理来细化感知。为了防止破坏性碰撞,我们设计了一个交互约束函数生成器,将铰接物体、交互姿态和障碍物避免知识集成到一个基中。然后LLM将这些约束函数化,并将其应用于轨迹和姿态规划。一个运动学感知的操作规划器验证轨迹和姿态的可达性。在5个物体类别的50个铰链任务和50个随机初始化的末端执行器-手柄配置上的实验表明,与最佳基线相比,GSAM将标准差降低了3.1%,操作成功率提高了36.0%,分别展示了GSAM在实际场景中优越的物体泛化能力和交互安全性。

英文摘要

Articulated object manipulation is a unique challenge for service robots. Existing methods employ end-to-end policy learning, visionmotion planning, and large-language/visual-language model (LLM/VLM), but often overlook the diversity of articulated objects and the complexity of interactions between end-effector and handle, leading to limited generalization and destructive collisions. To address this, we propose GSAM, a generalizable and safe robotic framework for articulated object manipulation. Specifically, a vision-based perceiver generates the kinematic parameters. Considering that pre-trained markers in perceiver yield raw estimations that may deviate from commonsense, we present a f ine-tuned VLM-based refiner, using chain-of-thought (COT) commonsense reasoning to refine perception. To prevent destructive collisions, we design an interaction constraint function generator, integrating articulated object, interaction pose, and obstacle avoidance knowledge into a base. LLM then functionalize these constraints and apply them to trajectory and posture planning. A kinematic-aware manipulation planner verifies reachability for trajectory and posture. Experiments on 50 hinge tasks across 5 object categories and 50 randomly initialized end-effectorhandle configurations show that GSAM reduces standard deviation by 3.1% and improves manipulation success rate by 36.0% compared to the best baseline, respectively demonstrating the superior object generalization and interaction safety of GSAM in practical scenarios.

2605.30696 2026-06-01 cs.RO cs.SY eess.SY 版本更新

Geometry-Aware Control Barrier Functions for Collision Avoidance via Bernstein Polynomial Approximations

基于Bernstein多项式近似的几何感知控制障碍函数用于碰撞避免

Siwon Jo, Yanze Zhang, Yupeng Yang, Wenhao Luo

发表机构 * GRASP Laboratory, University of Pennsylvania(宾夕法尼亚大学GRASP实验室) Department of Computer Science, University of Illinois Chicago(伊利诺伊大学芝加哥分校计算机科学系) Department of Computer Science, University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校计算机科学系)

AI总结 提出一种基于Bernstein多项式符号距离场的几何感知控制障碍函数,统一表示障碍物与机器人,利用多项式可微性实现闭环控制约束,保证单机器人和异构多机器人碰撞避免的安全性与效率。

Comments 8 pages; Accepted by 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)

详情
AI中文摘要

安全导航通常依赖于基于机器人和障碍物形状的明确定义条件,当它们具有不规则几何形状时可能具有挑战性。虽然控制障碍函数(CBF)提供了一种有效机制来强制执行安全集前向不变性,但常见的形状替代(例如球体或超椭球体)要么在非结构化场景中过于保守,要么需要许多局部基元,这会增加约束数量并降低实时性能。在本文中,我们介绍了一种基于Bernstein多项式符号距离场(BP-SDF)的新型几何感知控制障碍函数(CBF)。它提供了一种统一的方式来表示障碍物和机器人,从而用统一的最小距离来表示障碍函数。得益于Bernstein多项式的可微性,可以轻松地在闭环中强制执行控制约束。我们通过不同环境下的仿真验证了该方法在单机器人导航和异构多机器人碰撞避免中保证安全性的效率和性能。

英文摘要

Safe navigation often relies on well-defined conditions based on the shape of robots and obstacles, and can be challenging when they have irregular geometries. While Control Barrier Functions (CBFs) offer an efficient mechanism to enforce safe set forward invariance, common shape surrogates (e.g., spheres or super-ellipsoids) either are overly conservative in unstructured scenes or require many local primitives, which inflates constraint counts and degrades real-time performance. In this paper, we introduce a novel geometry-aware Control Barrier Function (CBF) based on Bernstein-Polynomial Signed Distance Fields (BP-SDFs). It provides a unified way to represent the obstacles and robots, so as to represent the barrier function with a unified minimum distance. Benefiting from the differentiability of the Bernstein polynomials, one can easily enforce the control constraints in a closed loop. We validate the method's efficiency and performance to guarantee safety in single-robot navigation and heterogeneous multi-robot collision avoidance via simulations under different environments.

2605.30695 2026-06-01 cs.RO 版本更新

Primitive Subspaces Mediate Few-Shot Transfer in VLAs

原始子空间介导VLA中的少样本迁移

Anya Singh, Cabrel Happi, Jai Relan, Varun Nair, Vidyut Baradwaj

AI总结 本研究通过原始感知训练在视觉-语言-动作(VLA)策略中构建可迁移的子技能库,仅需少量演示即可实现少样本迁移,相比平坦训练方法样本效率提升3倍。

详情
AI中文摘要

在工业环境中部署视觉-语言-动作(VLA)策略需要能够以低成本教授新任务,而当前VLA缺乏这一特性,因为每个新任务都需要微调。我们研究原始感知训练是否会产生一种可迁移的产物:一个学习到的子技能库,可以在推理时根据少量演示进行组合,以执行策略从未训练过的任务。我们在REASSEMBLE接触式装配数据集上,使用匹配的LoRA微调配方和固定超参数,训练了两种具有不同归纳偏置的VLA架构——OpenVLA和$π_{0.5}$,并在平坦轨迹和原始分割的回合(带有原始特定语言提示)之间改变训练方式。我们从训练中保留6个对象-任务组合,并评估少样本迁移:模型接收$m \in \{0, 1, 3, 5, 10\}$个保留任务的演示,并在不更新权重的情况下尝试执行。我们在三个训练种子上重复实验,并在第二个数据集(LIBERO-Long)上进行验证。原始训练模型仅需m=3个演示即可达到微调上限性能的78%,而平坦训练模型需要m=10个演示才能达到相同水平——这是一个3倍的样本效率差距,在种子、架构和数据集上均得到复现。为了建立因果关系,我们消融了隐藏状态的原始可解码子空间,结果显示少样本迁移性能下降32个百分点,而消融相同维度的随机子空间则没有影响,这表明原始表示是因果必要的,而非与迁移偶然相关。我们识别并纠正了评估分块策略时的一个方法论陷阱:单步动作范围门的族系膨胀会导致与真实人类演示相比的假失败率高出数量级。

英文摘要

Deploying vision-language-action (VLA) policies in industrial environments requires the ability to teach new tasks at low cost, a property current VLAs lack, since each new task requires fine-tuning. We investigate whether primitive-aware training produces a transferable artifact: a learned library of sub-skills that can be composed at inference time, conditioned on a small number of demonstrations, to perform tasks the policy was never trained on. We train two VLA architectures with different inductive biases, OpenVLA and $π_{0.5}$, on the REASSEMBLE contact-rich assembly dataset under matched LoRA fine-tuning recipes and locked hyperparameters, varying training between flat trajectories and primitive-segmented episodes with primitive-specific language prompts. We hold out 6 object-task combinations from training and evaluate few-shot transfer: models receive $m \in \{0, 1, 3, 5, 10\}$ demonstrations of a held-out task and attempt execution without weight updates. We replicate across three training seeds and validate on a second dataset (LIBERO-Long). Primitive-trained models reach 78% of fine-tuned upper-bound performance with only m=3 demonstrations, while flat-trained models require m=10 demonstrations to reach the same level -- a $3\times$ sample efficiency gap that replicates across seeds, architectures, and datasets. To establish causation, we ablate the primitive-decodable subspace of hidden states and show few-shot transfer degrades by 32 percentage points while ablating a random subspace of equal dimensionality has no effect, indicating primitive representations are causally necessary rather than incidentally correlated with transfer. We identify and correct a methodological pitfall in evaluating chunked policies: family-wise inflation of single-step action-range gates produces order-of-magnitude higher false-failure rates against ground-truth human demonstrations.

2605.30671 2026-06-01 cs.CV cs.RO 版本更新

WristCompass: Kinematic Coupling as a Learnable Visual Concept for Ego-Camera Orientation

WristCompass: 运动耦合作为可学习的视觉概念用于自我相机朝向估计

Varun Nair, Vidyut Baradwaj, Jiahang He, Anya Singh, Jai Relan, Cabrel Happi

AI总结 提出WristCompass,利用手腕与相机朝向之间的运动耦合作为视觉概念,通过紧凑的4D特征和GRU时序建模,从操作视频中恢复自我相机朝向,零样本迁移至厨房视频并达到与1B参数场景模型相近的性能。

详情
AI中文摘要

从操作视频中恢复自我相机朝向是从自我中心演示中分离手部运动与相机运动的前提,这是模仿学习的关键步骤。从场景几何推断朝向的常规方法在手部遮挡框架时失效:VGGT,一个1B参数的场景重建模型,在TACO基准测试上的表现甚至不如常数预测器。我们识别出一个替代的视觉概念,它恰好出现在场景几何缺失时:运动耦合动力学,即由手臂-肩-头链施加的手腕运动与相机朝向之间的结构化物理关系。我们发现这个概念是紧凑的(4D手腕间特征优于126D全手关键点)、时序的(需要短窗口上的GRU而非逐帧检索)和物理基础的(由于根植于解剖学而非场景外观,因此可零样本跨数据集迁移)。仅在桌面操作上训练的WristCompass,零样本迁移至Epic Kitchens烹饪视频,实现了14.3°的中位测地误差,并以200K GRU参数接近1B参数场景模型的性能。

英文摘要

Recovering ego-camera orientation from manipulation video is a prerequisite for disentangling hand motion from camera motion, a key step in imitation learning from egocentric demonstrations. The obvious approach, inferring orientation from scene geometry, fails when hands occlude the frame: VGGT, a 1B-parameter scene reconstruction model, scores worse than a constant predictor on the TACO benchmark. We identify an alternative visual concept that is present precisely when scene geometry is absent: kinematic coupling dynamics, the structured physical relationship between wrist motion and camera orientation imposed by the arm-shoulder-head chain. We find that this concept is compact (4D inter-wrist features outperform 126D full hand keypoints), temporal (requiring a GRU over short windows rather than per-frame retrieval), and physically grounded (transferring zero-shot across datasets because it is rooted in anatomy rather than scene appearance). Trained only on tabletop manipulation, WristCompass transfers zero-shot to Epic Kitchens cooking video, achieving 14.3$^\circ$ median geodesic error and approaching the performance of a 1B-parameter scene model at 200K GRU parameters.

2605.30660 2026-06-01 cs.LG cs.RO 版本更新

BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies

BOKBO (Best of K Bad Options): VLA策略的校准式弃权

Anya Singh, Cabrel Happi, Jai Relan, Varun Nair, Vidyut Baradwaj

AI总结 针对视觉-语言-动作(VLA)策略的测试时扩展方法,提出首个共形弃权层BOKBO,通过全局和逐任务变体提供有限样本无分布保证的执行违规率控制,并揭示基于扰动的K采样下策略内部非一致性分数的结构性缺陷。

详情
AI中文摘要

针对视觉-语言-动作(VLA)策略的测试时扩展方法,如RoboMonkey、SEAL、MG-Select和V-GPS,在推理时采样K个候选动作块并执行验证器最优结果。当所有K个候选都不安全时,系统会执行违规动作且无警告。我们提出BOKBO,这是首个用于K样本VLA推理的共形弃权层,提供执行违规率的有限样本无分布保证。我们提供全局和逐任务(Mondrian)变体,其中逐任务变体缩小了最困难任务上的条件差距。我们的分析揭示了基于扰动的K采样下策略内部非一致性分数的结构性失败:基础策略置信度代理与K样本不一致性之间的相关性为0.98(与动作噪声超参数σ相关),而与实际安全违规的相关性处于噪声基底。我们通过复现令牌级温度采样下的分析来测试失败范围,发现该失败是机制特定的,并在基于策略随机性的采样下得到部分缓解。一个基于语义视觉特征和任务标识学习的违规预测器支持紧密校准:在libero_object_temp_x0.1上使用OpenVLA-OFT,ε=0.05时,条件CRC边界在86%的bootstrap分割上成立,覆盖率为78%,净任务成功率为70%。Mondrian-BOKBO将最小逐任务条件保持比例从0.71提高到0.93。结果在5个训练种子上稳定,在π_0-FAST上的bootstrap噪声内可复现,在libero_spatial_temp_x0.1作为同等基准上成立,并经受住了四个套件内分布偏移。我们还识别并纠正了一个方法论陷阱:全局设置的力阈值远低于专家典型的操作力,将不安全行为与正常操作混淆,导致违规率膨胀5倍。

英文摘要

Test-time scaling for vision-language-action (VLA) policies, methods such as RoboMonkey, SEAL, MG-Select, and V-GPS, samples K candidate action chunks at inference and executes the verifier-best. When all K candidates are unsafe, the system executes a violating action with no warning. We propose BOKBO, the first conformal abstention layer for K-sample VLA inference, providing finite-sample distribution-free guarantees on executed-violation rate. We provide both global and per-task (Mondrian) variants, with the per-task variant closing the conditional gap on the hardest tasks. Our analysis exposes a structural failure of policy-internal nonconformity scores under perturbation-based K-sampling: the base-policy confidence proxy and K-sample disagreement correlate at 0.98 with the action-noise hyperparameter $σ$, while correlating at the noise floor with actual safety violations. We test the failure's scope by replicating the analysis under token-level temperature sampling and find the failure is mechanism-specific and partially mitigated under policy-stochasticity-based sampling. A learned violation predictor conditioned on semantic visual features and task identity supports tight calibration: at $ε$ = 0.05 on libero_object_temp_x0.1 with OpenVLA-OFT, the conditional CRC bound holds on 86% of bootstrap splits with 78% coverage and 70% net task success. Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93. Results are stable across 5 training seeds, replicate within bootstrap noise on $π_0$-FAST, hold on libero_spatial_temp_x0.1 as a co-equal benchmark, and survive four within-suite distribution shifts. We additionally identify and correct a methodological pitfall: globally-set force thresholds well below expert-typical manipulation forces conflate unsafe behavior with normal manipulation, inflating violation rates by $5\times$.

2605.30647 2026-06-01 cs.RO 版本更新

Bidirectional Incremental Generalized Hybrid A*

双向增量广义混合A*

Sidharth Talia, Oren Salzman, Siddhartha Srinivasa

AI总结 针对复杂动力学系统在非结构化环境中的高效任意时刻运动规划问题,提出双向增量广义混合A*算法,通过双向搜索缓解冻结顶点隐藏解的问题,保证单调成本改进和终止,显著减少扩展次数。

详情
AI中文摘要

我们关注在非结构化环境中具有复杂动力学的系统的有效任意时刻运动规划问题,这些环境使得预计算运动基元不可行。由于维度灾难,直接应用A*到此类问题在计算上不可行。诸如Hybrid A*等方法通过离散化状态空间来解决这一负担,但反过来在树发现和离散化分辨率之间产生了耦合。增量广义混合A*(IGHA*)通过以任意方式在分辨率层次上进行搜索来打破这种耦合,它冻结顶点以供后续搜索迭代使用,而不是修剪它们。然而,冻结的顶点可能会在特定迭代中隐藏支持解的顶点。虽然经典双向搜索的动机是减少搜索深度,但将IGHA*扩展到双向设置(称为Bi-IGHA*)通过从根本上缓解冻结顶点隐藏解的行为而获得额外好处。我们证明了Bi-IGHA*保留了IGHA*在单调成本改进和终止方面的保证。我们通过实验表明,Bi-IGHA*在R3、R4和R6规划问题上显著减少了扩展次数,并在高速越野自主性的运动规划中实现了等效的闭环性能,同时所需的扩展次数显著减少。网站:https://personalrobotics.github.io/IGHAStar/biighastar.html

英文摘要

We focus on the problem of efficient anytime kinodynamic planning for systems with complex dynamics in unstructured environments that make precomputing motion primitives infeasible. Directly applying A* to such problems is computationally infeasible due to the curse of dimensionality. Methods such as Hybrid A* addressed this burden by discretizing the state space, but in turn creating a coupling between tree discovery and the discretization resolution. The Incremental Generalized Hybrid A* (IGHA*) performs search over a hierarchy of resolutions in an anytime fashion to break this coupling, by freezing vertices to use in later search iterations rather than pruning them. However, the frozen vertices can hide solution-supporting vertices from the search at a particular iteration. While classical bidirectional search is motivated by the reduction of search depth, extending IGHA* into the bidirectional setting (termed Bi-IGHA*) obtains additional benefit by fundamentally mitigating the behaviour induced by frozen vertices hiding solutions. We show that Bi-IGHA* preserves IGHA*'s guarantees on monotonic cost improvement and termination. We empirically show that Bi-IGHA* substantially reduces expansions on R3, R4, and R6 planning problems, and achieves equivalent closed-loop performance with kinodynamic planning for high-speed off-road autonomy while requiring significantly fewer expansions. Website: https://personalrobotics.github.io/IGHAStar/biighastar.html

2605.30639 2026-06-01 cs.CV cs.AI cs.RO 版本更新

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

PInVerify:面向主动实例验证的离线具身基准

Yuhang Jiang

发表机构 * University of Trento(特伦托大学)

AI总结 提出主动实例验证任务,构建离线具身基准PInVerify,通过多视角导航和细粒度属性匹配评估具身智能体,并基于多模态大语言模型建立基线。

Comments Accepted as a poster at the Foundation Models Meet Embodied Agents (FMEA) Workshop, CVPR 2026. 44 pages including appendix. Code: https://github.com/Avalon-S/PInVerify

详情
AI中文摘要

具身智能体在导航到目标物体方面取得了显著进展,但到达目标附近并不能保证智能体找到了正确的实例:微妙的属性差异(例如“白色花卉”与“白色条纹”)通常需要近距离、多视角检查。我们通过主动实例验证(AIV)来解决这一差距,该任务要求智能体主动围绕候选对象选择视角,以判断其是否匹配细粒度的自然语言描述。我们将AIV形式化为一个有限视野决策过程,并引入PInVerify,一个用于AIV的离线具身基准:包含18个物体类别的3000个评估场景,以多视角捕获形式提供,并采用6扇区导航拓扑,暴露陷阱视角(可导航但无信息)和不可达扇区。作为参考基线,我们构建了一个无需训练的流水线和一个基于开源多模态大语言模型(MLLMs)的LoRA微调端到端智能体(参数规模≤8B),包括属性分解、可见性加权多视角跟踪器和三种次优视角选择(NBV)策略。在Qwen3-VL(4B/8B)、SenseNova-SI-1.2-InternVL3-8B、CLIP和SigLIP2上的评估中,最佳MLLM基线超过最佳嵌入基线4.9个百分点;GT框消融实验显示检测差距为+3.1个百分点;在测试的NBV策略中,我们未观察到主动视角选择带来的可靠增益。LoRA微调智能体(SFT+GSPO)达到85.6%。PInVerify旨在支持具身AI中主动、细粒度语义验证的进一步研究。代码:https://github.com/Avalon-S/PInVerify。

英文摘要

Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent has found the correct instance: subtle attribute differences (e.g., "white floral" vs. "white striped") often require close-range, multi-view inspection. We address this gap with Active Instance Verification (AIV), a task in which an agent actively selects viewpoints around a candidate object to decide whether it matches a fine-grained natural-language description. We formalize AIV as a finite-horizon decision process and introduce PInVerify, an offline embodied benchmark for AIV: 3,000 evaluation episodes across 18 object categories, delivered as multi-view captures with a 6-sector navigation topology that exposes trap views (navigable but uninformative) and unreachable sectors. As reference baselines we build a training-free pipeline and a LoRA-fine-tuned end-to-end agent around open-source multimodal large language models (MLLMs) at on-device scale ($\leq$8B parameters), with attribute decomposition, a visibility-weighted multi-view tracker, and three next-best-view (NBV) strategies. In our evaluation across Qwen3-VL (4B/8B), SenseNova-SI-1.2-InternVL3-8B, CLIP, and SigLIP2, the best MLLM-based baseline exceeds the best embedding baseline by 4.9 pp; GT-box ablations show a +3.1 pp detection gap; and we do not observe reliable gains from active viewpoint selection within the tested NBV strategies. A LoRA-fine-tuned agent (SFT+GSPO) reaches 85.6%. PInVerify aims to support further work on active, fine-grained semantic verification in embodied AI. Code: https://github.com/Avalon-S/PInVerify.

2605.30617 2026-06-01 cs.RO math.OC 版本更新

Exploiting Chordal Sparsity for Globally Optimal Estimation with Factor Graphs

利用弦稀疏性实现因子图的全局最优估计

Avinash Subramanian, Connor Holmes, Timothy D. Barfoot, Frank Dellaert, Frederike Dümbgen

发表机构 * College of Computing, Georgia Institute of Technology(佐治亚理工学院计算机学院) Robotics Institute, University of Toronto(多伦多大学机器人研究所) Department of Mechanical Engineering, Carnegie Mellon University(卡内基梅隆大学机械工程系)

AI总结 本文提出在GTSAM框架中自动构建凸半定规划松弛,并利用贝叶斯树分解加速求解,实现因子图的全局最优估计。

详情
Journal ref
ICRA 2026 WORKSHOP ON FRONTIERS OF OPTIMIZATION FOR ROBOTICS
AI中文摘要

鲁棒且高效的状态估计对于机器人感知、导航和控制至关重要。状态估计问题可以方便地使用因子图框架建模,如现代软件包GTSAM或g2o所支持的那样。然而,这些框架中包含的标准求解器是局部的,可能收敛到较差的局部最小值,带来显著的安全隐患。相反,基于凸松弛的技术已被证明能够全局求解或认证许多状态估计问题。但是,这些松弛方法1)通常需要大量精力来构建,并且2)与高效的局部求解器相比,可能产生显著更高的成本,因为它们需要求解一个大规模半定规划(SDP)。在这项工作中,我们通过以下方式解决了这两个缺点:1)在GTSAM框架内创建了一个新过程,用于自动为任何具有常见因子和变量类型的因子图构建凸SDP松弛,以及2)利用GTSAM原生的贝叶斯树结构来分解SDP问题,从而在弦稀疏问题上显著加速求解器时间。我们通过两个案例研究展示了这种利用结构的全局估计器与标准局部求解器相比的有利扩展性:一个带有环因子图的三维位姿图SLAM问题和一个带有链因子图的二维定位问题。软件框架可在https://github.com/borglab/gtsam获取。

英文摘要

Robust and efficient state estimation is crucial for perception, navigation, and control in robotics. State estimation problems are conveniently modeled using the factor-graph framework as enabled by modern software packages such as GTSAM or g2o. However, the standard solvers included in such frameworks are local and may converge to poor local minima, posing significant safety concerns. Conversely, techniques based on convex relaxations have been shown to provide a means of globally solving or certifying many state estimation problems. However, these relaxations 1) often require substantial effort to formulate, and 2) may incur significantly higher cost compared to efficient local solvers, as they require solving a large semidefinite program (SDP). In this work, we address both shortcomings by 1) creating a new procedure within the GTSAM framework for automatically constructing convex SDP relaxations for any factor graphs with common factor and variable types, and by 2) exploiting the Bayes tree constructions native to GTSAM to decompose the SDP problem, leading to significant speedup in solver time for chordally sparse problems. We demonstrate the favorable scaling of this structure-exploiting global estimator compared to standard local solvers for two case studies: A 3D pose-graph SLAM problem with a ring factor graph and a 2D localization problem with a chain factor graph. The software framework is available at https://github.com/borglab/gtsam.

2605.30612 2026-06-01 cs.RO cs.LG cs.SY eess.SY 版本更新

ZAPS-DA: Zero-Phase Action Policy Smoothing with Decoupled Actor for Continuous Control in Reinforcement Learning

ZAPS-DA:基于解耦演员的零相位动作策略平滑用于强化学习中的连续控制

Faiq Shamass

发表机构 * Independent Researcher(独立研究者)

AI总结 提出ZAPS-DA框架,通过解耦演员网络模仿零相位滤波目标,在不引入相位延迟和后处理的情况下减少连续控制策略的动作抖动,并在驾驶仿真中验证了其有效性。

Comments 7 pages, 5 figures, 5 tables. Submitted to IEEE RA-L

详情
AI中文摘要

基于离策略强化学习训练的连续控制策略经常表现出高频动作抖动,使得直接部署在物理执行器上不可行。事后滤波可以减弱抖动但引入相位延迟;在演员损失中嵌入平滑惩罚会将其与RL梯度耦合,并将奖励回归与过度激进的平滑混为一谈。我们提出ZAPS-DA,一个在部署时减少动作抖动且具有可忽略相位延迟和无后处理的框架。ZAPS-DA将一个未修改的主演员(由基础RL损失训练)与一个单独的解耦演员配对,该解耦演员通过监督学习模仿存储在回放缓冲区中的零相位滤波目标。部署的策略是解耦演员:一个从当前观测到平滑动作的前馈映射,没有推理时滤波和动作历史输入——我们称之为非因果滤波器的因果蒸馏机制。幅度匹配的MSE损失提供了跨优化器类别的零超参数可移植性。使用Soft Actor-Critic和Savitzky-Golay滤波器在两个驾驶模拟器中通过配对n=150评估协议进行验证:在MetaDrive上,ZAPS-DA将转向抖动减少14-21倍,油门抖动减少3-5倍(所有p < 10^{-4},Bonferroni校正),同时以6.3%的奖励成本匹配任务完成率(成功率p=0.31,碰撞率p=0.31);在自定义Webots自适应巡航控制环境中,相同的SG配置产生了帕累托改进——奖励持平(p=0.121),转向抖动减少8-45倍,总任务失败率从2.0%降至0.7%。

英文摘要

Continuous control policies trained with off-policy reinforcement learning frequently exhibit high-frequency action jitter, rendering direct deployment on physical actuators impractical. Post-hoc filtering attenuates jitter but introduces phase lag; embedding smoothness penalties in the actor's loss couples them with the RL gradient and conflates reward regression with over-aggressive smoothing. We present ZAPS-DA, a framework that reduces action jitter at deployment with negligible phase lag and no post-processing. ZAPS-DA pairs an unmodified main actor (trained by the base RL loss) with a separate decoupled actor trained via supervised imitation of zero-phase filtered targets stored in the replay buffer. The deployed policy is the decoupled actor: a feed-forward map from the current observation to a smooth action, with no inference-time filter and no action-history input -- a mechanism we term causal distillation of a non-causal filter. A magnitude-matched MSE loss provides zero-hyperparameter portability across optimizer classes. Validated with Soft Actor-Critic and a Savitzky--Golay filter in two driving simulators using paired n=150 evaluation protocols: on MetaDrive, ZAPS-DA reduces steering jitter by 14--21x and throttle jitter by 3--5x (all $p < 10^{-4}$, Bonferroni-corrected) while matching task-completion (p=0.28 success, p=0.31 crash) at a 6.3% reward cost; on a custom Webots adaptive cruise control environment, the same SG configuration produces a Pareto improvement -- reward parity (p=0.121), 8--45x steering jitter reduction, and total task-failure rate reduced from 2.0% to 0.7%.

2605.30583 2026-06-01 cs.RO cs.PF 版本更新

Caspar: CUDA Accelerator for Symbolic Programming with Adaptive Reordering

Caspar: 基于自适应重排序的符号编程CUDA加速器

Emil Martens, Aaron Miller, Matias Varnum, Annette Stahl

发表机构 * Norwegian University of Science and Technology(挪威科学与技术大学) Skydio

AI总结 提出Caspar库,通过自动生成优化CUDA内核,实现从Python符号表达式到GPU高性能运行时的桥梁,并在大规模BA数据集上实现5-20倍加速。

Comments Accepted at ICRA 2026

详情
AI中文摘要

我们提出Caspar,一个使现代GPU在机器人领域更易用的库,并提供可应用于多种优化问题的最先进非线性GPU求解器。Caspar通过从符号表达式自动生成优化的CUDA内核,弥合了Python中表达性符号编程与C++中高性能GPU运行时之间的差距。基于SymForce库,用户可以轻松定义和组合符号表达式(包括李群运算),以生成自定义CUDA内核。要将Caspar用作求解器,用户只需定义符号残差函数;Caspar随后使用符号微分生成必要的GPU内核和接口以执行非线性优化。本文介绍了Caspar的核心组件,并通过在Bundle Adjustment in the Large (BAL)数据集上执行光束法平差展示了其性能。我们将Caspar与其他最先进的光束法平差器进行基准测试,结果表明它比最佳替代方案快5到20倍,所需内存更少,且达到相似的精度。这说明了我们的符号GPU编程方法的优势。Caspar作为SymForce的一部分发布,可在https://github.com/symforce-org/symforce免费获取。

英文摘要

We present Caspar, a library that makes the power of modern GPUs more accessible in robotics and provides a state-of-the-art nonlinear GPU solver that can be applied to a wide range of different optimization problems. Caspar bridges the gap between expressive symbolic programming in Python and high-performance GPU runtimes in C++ by automatically generating optimized CUDA kernels from symbolic expressions. Building on the SymForce library, users can easily define and combine symbolic expressions, including Lie group operations, to generate custom CUDA kernels. To use Caspar as a solver, users need only define the symbolic residual functions; Caspar then uses symbolic differentiation to generate the necessary GPU kernels and interfaces to perform nonlinear optimization. In this paper, we present the core components of Caspar and showcase its performance by performing bundle adjustment on the Bundle Adjustment in the Large (BAL) dataset. We benchmark Caspar against other state-of-the-art bundle adjusters and show that it is 5 to 20 times faster than the best alternative, requires less memory, and achieves similar accuracy. This illustrates the benefit of our symbolic GPU programming approach. Caspar is released as part of SymForce and is freely available at https://github.com/symforce-org/symforce

2605.30571 2026-06-01 cs.AR cs.AI cs.DC cs.PF cs.RO 版本更新

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

受限于内存但不受限于带宽:批量1的LLM解码中的物理AI推理差距

Josef Chen

发表机构 * KAIKAKU(卡伊卡普)

AI总结 本文通过测量不同GPU上批量1的自回归解码性能,发现物理AI推理并非仅受内存带宽限制,还受启动开销影响,并指出量化路径的实际收益取决于运行时实现。

详情
AI中文摘要

物理AI系统,包括机器人、自动驾驶车辆、具身智能体和边缘副驾驶,通常运行与云端LLM服务不同的推理工作负载:单流、批量1的自回归解码,其中一个机器人、摄像头流或用户会话等待下一个token。这种工作负载通常被描述为受内存带宽限制。每个解码步骤都会流式传输模型权重和活跃的KV缓存,因此延迟应与峰值HBM带宽成比例。我们表明这种说法是正确的但不完整。我们测量了三个7至8B类GQA变压器在四个NVIDIA GPU(H100 SXM5、A100-80GB SXM4、L40S和L4)上的批量1解码。我们评估了从2048到16384的上下文长度,在受控的bf16 SDPA设置下产生了44个有效单元。达到的峰值HBM带宽比例随着峰值带宽的增加而下降。在标题性的Qwen-2.5-7B ctx=2048单元中,L4达到了其分析内存下限的大约81%,而H100仅达到27%。物理AI解码是内存主导的,但更快的内存并不能转化为成比例的延迟增益。我们通过CUDA Graphs A/B实验测试了缺失项。在H100上,ctx=2048时,CUDA Graphs在N=10个新会话中将解码延迟提高了1.259倍,95%自助法置信区间为1.253至1.267。在L4上,相同的干预仅提供了1.028倍的提升。这分离出了在快速GPU上可见但在较慢、带宽受限的GPU上基本隐藏的启动侧开销。部署的含义是,只有当运行时实现时,内存节省才重要。在L4上,bf16解码接近内存下限,但常见的量化路径并未恢复预期的4倍权重流量减少:从62.32 ms/step的bf16基线,bnb-nf4达到59.36 ms/step,AutoAWQ+Marlin达到45.24 ms/step。使用Ada调优的int4内核的GPTQ+ExLlamaV2达到17.36 ms/step。

英文摘要

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.

2605.30569 2026-06-01 cs.RO 版本更新

Any-ttach: Quick End-effector Swapping Enables Manipulation Dexterity with Simplicity

Any-ttach: 快速末端执行器更换实现简洁的灵巧操作

Weizhe Ni, Jinzhou Li, Haoyu Li, Cody Andres Alessio-Bunnell, Wenjing Pan, Xianyi Cheng

发表机构 * Department of Mechanical Engineering and Materials Science, Duke University(杜克大学机械工程与材料科学系)

AI总结 提出Any-ttach框架,通过低成本快速末端执行器更换机制,结合任务规划,实现多种工具和末端模块的灵巧操作,在长时任务中验证了可靠性和效率提升。

详情
AI中文摘要

机器人操作灵巧性通常通过构建越来越复杂的高自由度多指手来实现。虽然许多机器人手被设计为复制人类形态,但人手的功能角色暗示了不同的视角:其复杂性可能很大程度上是为了支持工具使用和工具制造。这一观察启发了Any-ttach,一个以工具为中心的操作框架,将快速末端执行器更换视为实现简单灵巧性的机制。Any-ttach结合了用于开合机器人接口的低成本自动更换机制、用于收集人类演示的手持设备,以及一个组合了学习、参数化和规划的工具使用技能的任务规划框架。该系统通过相同的共享接口支持多种工具和末端执行器模块,包括日常工具、铰接工具(如剪刀)、Fin Ray手指和低成本拟人手。我们的实验表明,Any-ttach提高了工具更换的可靠性,增加了演示效率,减少了工具位姿变异性,并支持多样化的工具使用技能。在两个长时任务(制作三明治和准备黄瓜)中,Any-ttach通过末端执行器切换和执行监控执行了六个工具使用子技能。这些结果表明,机器人不仅可以通过更复杂的末端执行器,还可以通过快速可更换的工具和末端执行器模块来扩展操作能力。更多详情和视频请访问https://any-ttach.github.io/。

英文摘要

Robotic manipulation dexterity is often pursued by building increasingly complex high-DoF multifingered hands. While many robotic hands are designed to replicate human morphology, the functional role of human hands suggests a different perspective: much of their complexity may exist to enable tool use and tool making. This observation motivates Any-ttach, a tool-centric manipulation framework that treats quick end-effector swapping as a mechanism for dexterity with simplicity. Any-ttach combines a low-cost automatic swapping mechanism for an open-close robot interface, a handheld device for collecting human demonstrations, and a task planning framework that composes learned, parameterized, and planned tool-use skills. The system supports diverse tools and end-effector modules, including daily tools, articulated tools such as scissors, Fin Ray fingers, and a low-cost anthropomorphic hand, through the same shared interface. Our experiments show that Any-ttach improves tool-swapping reliability, increases demonstration efficiency, reduces tool-pose variability, and supports diverse tool-use skills. In two long-horizon tasks, making a sandwich and preparing a cucumber, Any-ttach executes six tool-use subskills through end-effector switching and execution monitoring. These results suggest that robots can expand manipulation capability not only through more complex end-effectors, but also through rapidly exchangeable tools and end-effector modules. More details and videos are available at https://any-ttach.github.io/.

2605.30508 2026-06-01 cs.RO 版本更新

ARISTO Hand: Sensing-Driven Distal Hyperextension for Fine-Grained Manipulation

ARISTO Hand:基于感知驱动的远端过伸实现精细操作

Aaron Kim, Dong Ho Kang, Mark Helwig, Mingyo Seo, Kazuto Yokoyama, Tetsuya Narita, Luis Sentis

发表机构 * Human Centered Robotics Lab at The University of Texas at Austin(德克萨斯大学奥斯汀分校人本机器人实验室) Sony Group Corporation(索尼集团)

AI总结 提出一种肌腱驱动机械手ARISTO Hand,通过主动远端过伸和混合指尖传感架构(刚性指甲安装力-扭矩传感器与软电容触觉阵列),增强对薄物体的操作能力,在1-20 mm厚度范围内将拔出力提升2.76倍,并实现SD卡插拔等精细任务。

详情
AI中文摘要

操作薄物体需要精确的接触几何和可靠的力感知,然而许多拟人化机械手缺乏此类交互所需的机械和传感能力。我们提出ARISTO Hand,一种肌腱驱动机器人手,它将主动远端过伸与混合指尖传感架构相结合,该架构结合了刚性指甲安装的力-扭矩传感器和软电容触觉阵列。主动过伸使得指尖能够在标准屈曲的运动学极限之外进行受控接合,对于1-20 mm的物体厚度,拔出力提高了2.76倍,同时保留了标称抓取能力。刚性指甲安装传感器在边缘接触期间提供可靠的力测量,此时本体感觉力估计的灵敏度随着接触几何接近运动学奇点而下降。我们通过定量力表征和多阶段SD卡提取与插入任务验证了所提出的架构。视频和补充材料可在 https://aristohand.github.io 获取。

英文摘要

Manipulating thin objects requires precise contact geometry and reliable force perception, yet many anthropomorphic robotic hands lack the mechanical and sensing capabilities needed for such interactions. We present the ARISTO Hand, a tendon-driven robotic hand that integrates active distal hyperextension with a hybrid fingertip-sensing architecture that combines a rigid, nail-mounted force-torque sensor and a soft capacitive tactile array. Active hyperextension enables controlled fingertip engagement beyond the kinematic limits of standard flexion, increasing pull-out force by 2.76x for object thicknesses of 1-20 mm while preserving the nominal grasp capability. The rigid nail-mounted sensor provides reliable force measurements during edge contacts, where the sensitivity of proprioceptive force estimation degrades as the contact geometry approaches kinematic singularities. We validate the proposed architecture through quantitative force characterization and a multi-stage SD card extraction and insertion task. Video and supplementary materials are available at: https://aristohand.github.io

2605.30506 2026-06-01 cs.RO cs.CV 版本更新

VLM-GLoc: Vision-Language Model Enhanced Monte Carlo Localization for Robust Semantic Global Localization in Cluttered Quasi-Static Environments

VLM-GLoc:视觉语言模型增强的蒙特卡洛定位,用于杂乱准静态环境中的鲁棒语义全局定位

Shivendra Agrawal, Bradley Hayes

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出VLM-GLoc方法,利用开放词汇视觉语言模型作为统一语义观测前端,通过逆语义提议机制和文本到地图检索,在几何模糊和语义歧义的准静态环境中实现鲁棒全局定位。

详情
AI中文摘要

在几何模糊的准静态环境(如杂货店、办公室、学校和医院)中,全局定位对移动机器人构成重大挑战。具有平行过道和长尾产品分布的杂货店,以及具有重复家具(如椅子、桌子、显示器和门)的办公室和实验室,是常见的室内环境,存在几何甚至语义歧义。传统方法要么依赖独特的几何特征,要么依赖特定领域的视觉管道,这些方法难以处理长尾语义分布和瞬态视觉杂乱。我们提出VLM-GLoc,一种分层语义蒙特卡洛定位(MCL)方法,利用开放词汇视觉语言模型(VLM)作为统一语义观测前端。我们假设VLM具有三重优势:(1)提取高度判别性的丰富文本特征,(2)对模糊或动态对象进行隐式质量过滤,(3)针对数据增强的持久性推理。我们引入一种逆语义提议机制,通过文本到地图检索播种粒子。在两个具有不同特征的真实世界环境和两个不同平台上进行评估:一个3500平方英尺的杂货店(使用手机)和一个3700平方英尺的实验室空间(使用四足机器人),VLM-GLoc分别实现了70%和74%的全局定位成功率,显著优于传统的纯几何和特定领域基线方法。

英文摘要

Global localization in geometrically aliased, quasi-static environments such as grocery stores, offices, schools, and hospitals poses a significant challenge for mobile robots. Grocery stores with parallel aisles and a long tailed distribution of products, as well as offices and labs with repetitive furniture such as chairs, desks, monitors, and doors, exemplify common indoor environments that present geometric and even semantic ambiguity. Traditional approaches rely either on distinct geometric features or on domain-specific vision pipelines that struggle with long-tail semantic distributions and transient visual clutter. We present VLM-GLoc, a method for hierarchical semantic Monte Carlo Localization (MCL) that leverages open-vocabulary Vision-Language Models (VLMs) as a unified semantic observation front-end. We hypothesize a three-fold benefit from VLMs: (1) extracting highly discriminative rich text features, (2) implicit quality filtering of blurry or dynamic objects, and (3) permanence reasoning for targeted data augmentation. We introduce an inverse semantic proposal mechanism that seeds particles via text-to-map retrieval. Evaluated across two real-world environments with different characteristics and two different platforms: a 3,500 sq. ft. grocery store with a cellphone and a 3,700 sq. ft. lab space with a quadruped, VLM-GLoc achieves 70% and 74% global localization success respectively, substantially outperforming traditional geometry-only and domain-specific baselines.

2605.30503 2026-06-01 cs.RO cs.SY eess.SY stat.ML 版本更新

Physics-informed Goal-Conditioned Reinforcement Learning under Hybrid Contact Dynamics

混合接触动力学下的物理信息目标条件强化学习

Vittorio Giammarino, Anastasios Manganaris, Ahmed H. Qureshi

发表机构 * Department of Computer Science(计算机科学系)

AI总结 针对接触丰富任务中混合动力学导致现有物理信息目标条件强化学习方法性能下降的问题,提出接触感知和分层公式,选择性应用物理信息归纳偏置,向接触丰富操作扩展。

详情
AI中文摘要

从稀疏反馈中学习达到任意目标需要智能体推断状态-目标对之间的丰富可达性概念。目标条件强化学习(GCRL)通过学习跨目标泛化的策略来应对这一挑战,但随着底层动力学变得高维、混合或接触依赖,这种泛化变得越来越困难。为了解决这个问题,物理信息GCRL(Pi-GCRL)将最优控制启发的归纳偏置引入目标条件价值学习。虽然Pi-GCRL方法在导航和无目标到达领域已被证明有效,但它们在接触丰富任务中的可靠性仍不清楚,其中接触交互导致混合动力学、模式依赖的可控性和非光滑价值景观。在这项工作中,我们表明这些结构特性可能导致现有Pi-GCRL方法在朴素应用于接触丰富操作时性能下降。受此分析启发,我们引入了接触感知和分层公式,选择性地将物理信息归纳偏置应用于操作问题。我们的结果为将Pi-GCRL扩展到接触丰富操作提供了原则性的一步。

英文摘要

Learning to reach arbitrary goals from sparse feedback requires agents to infer a rich notion of reachability across state--goal pairs. Goal-conditioned reinforcement learning (GCRL) tackles this challenge by learning policies that generalize across goals, but this generalization becomes increasingly difficult as the underlying dynamics become high-dimensional, hybrid, or contact-dependent. To address this issue, physics-informed GCRL (Pi-GCRL) introduces optimal-control-inspired inductive biases into goal-conditioned value learning. While Pi-GCRL methods have proven effective in navigation and object-free goal-reaching domains, their reliability in contact-rich tasks remains unclear, where contact interactions induce hybrid dynamics, mode-dependent controllability, and nonsmooth value landscapes. In this work, we show that these structural properties can cause existing Pi-GCRL methods to degrade when applied naively to contact-rich manipulation. Motivated by this analysis, we introduce contact-aware and hierarchical formulations that apply physics-informed inductive biases selectively across the manipulation problem. Our results provide a principled step toward extending Pi-GCRL to contact-rich manipulation.

2605.30488 2026-06-01 cs.RO 版本更新

CoMo3R-SLAM: Collaborative Monocular Dense SLAM with Learned 3D Reconstruction Priors for Outdoor Multi-Agent Systems

CoMo3R-SLAM: 面向室外多智能体系统的协作式单目稠密SLAM与学习型3D重建先验

Zhihao Cao, Qi Shao, Shuhao Zhai, Feng Tian, Anh Nguyen, Hesheng Wang, Baoru Huang

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Liverpool(利物浦大学) Harbin Engineering University(哈尔滨工程大学) University of Ottawa(Ottawa大学) Shanghai Jiao Tong University(上海交通大学) Imperial College London(伦敦帝国理工学院)

AI总结 提出首个协作式单目稠密RGB SLAM系统CoMo3R-SLAM,利用学习的前馈3D重建先验实现室外多智能体地图构建,无需深度传感器即可生成全局一致的度量地图。

详情
AI中文摘要

协作式稠密SLAM对于多机器人团队在大规模室外环境中实现可扩展且一致的3D感知至关重要。现有系统通常依赖深度传感器,导致显著的载荷、功耗和标定成本。单目RGB相机是一种轻量级替代方案,但协作式单目稠密SLAM仍面临尺度模糊、智能体间数据关联不可靠等困难,尤其是在室外场景中,低重叠和重复结构使得传统特征匹配不可靠,从而需要鲁棒的几何信息。我们提出CoMo3R-SLAM,这是首个利用鲁棒的学习前馈3D重建先验进行室外多智能体地图构建的协作式单目稠密RGB SLAM系统。每个智能体运行一个先验引导的前端,用于实时跟踪和局部稠密融合,而协调器执行稠密点图匹配以进行跨智能体验证、闭式Sim(3)规范同步以及GPU加速的全局光束法平差与分段深度优化。我们的系统既不需要深度传感器也不需要参数化内参,仅凭单目RGB即可产生鲁棒的跨智能体约束和全局一致的度量地图。在Tanks and Temples和Waymo序列上,CoMo3R-SLAM在四个Tanks and Temples场景中的三个上实现了最佳ATE,并在Waymo上达到竞争性精度,匹配或超越最先进的RGB-D方法,同时以8 FPS在线运行。

英文摘要

Collaborative dense SLAM is essential for multi-robot teams to achieve scalable and consistent 3D perception across large-scale outdoor environments. Existing systems typically depend on depth sensors, incurring significant payload, power, and calibration costs. Monocular RGB cameras are a lightweight alternative, but collaborative monocular dense SLAM remains difficult due to scale ambiguity, unreliable inter-agent data association, especially in outdoor scenes where low overlap and repetitive structures make traditional feature matching unreliable, motivating robust geometric information. We propose CoMo3R-SLAM, the first collaborative monocular dense RGB SLAM system that leverages robust learned feed-forward 3D reconstruction priors for outdoor multi-agent mapping. Each agent runs a prior-guided front-end for real-time tracking and local dense fusion, while a coordinator performs dense pointmap matching for cross-agent verification, closed-form Sim(3) gauge synchronization, and GPU-accelerated global bundle adjustment with segment-level depth optimization. Requiring neither depth sensors nor parametric intrinsics, our system produces robust cross-agent constraints and globally consistent metric maps from monocular RGB alone. On Tanks and Temples and Waymo sequences, CoMo3R-SLAM achieves the best ATE on three of four Tanks and Temples scenes and competitive Waymo accuracy, matching or exceeding state-of-the-art RGB-D methods while running online at 8 FPS.

2605.30484 2026-06-01 cs.RO 版本更新

ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation

ELAN4D:以具身为中心的4D监督用于视觉-语言-动作模型的即插即用适配

Zeyuan He, Bowen Yang, Zhirui Fang, Keru Zhou, Lei Jiang, Jingjing Qian, Fan Mo, Junchi Yan, Philip Torr, Xiu Li, Li Jiang, Jialin Yu

发表机构 * Torr Vision Group, University of Oxford(托尔视觉组,牛津大学) The Chinese University of Hong Kong, Shenzhen(香港大学(深圳)) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) University College London(伦敦大学学院) University of Cambridge(剑桥大学)

AI总结 提出ELAN4D框架,通过未来机器人关键点轨迹作为预测性时空监督,以即插即用方式增强VLA策略的鲁棒性和泛化能力。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出潜力,但现有大多数策略通过直接从当前观测回归动作来反应式运行,没有显式建模未来动态。这限制了它们在分布外扰动下的泛化能力。为解决此问题,我们提出ELAN4D,一个以具身为中心的4D感知训练框架,通过未来机器人关键点轨迹作为预测性时空监督来增强VLA策略。仅利用本体感觉状态的前向运动学,我们推导出机器人关键点(如关节和末端执行器)的3D位移轨迹,预处理成本可忽略。这些轨迹提供度量且紧凑的监督,无需外部跟踪器或重建。一个即插即用的辅助分支,配备轻量级轨迹解码器,在通过梯度隔离保护预训练视觉-语言主干的同时,将4D信号注入动作专家。推理时丢弃轨迹解码器,保持基础策略接口不变。在LIBERO、LIBERO-Plus、RoboTwin2.0和真实世界操作任务上的大量实验表明,ELAN4D持续优于强VLA基线,在分布外扰动(包括相机、背景和布局变化)下取得最佳整体性能和显著提升。这些结果凸显了以具身为中心的4D监督对于构建更鲁棒和可泛化的操作策略的有效性。

英文摘要

Vision-Language-Action (VLA) models have shown promise for robotic manipulation, yet most existing policies operate reactively by directly regressing actions from current observations, without explicitly modeling future dynamics. This limits their ability to generalize under out-of-distribution perturbations. To address this issue, we propose ELAN4D, an embodiment-centric, 4D-aware training framework that enhances VLA policies with future robot keypoint tracks as predictive spatio-temporal supervision. Using only forward kinematics from proprioceptive states, we derive 3D displacement tracks of robot keypoints, such as joints and the end-effector, with negligible preprocess cost. These tracks provide metric and compact supervision without requiring external trackers or reconstruction. A plug-and-play auxiliary branch with a lightweight track decoder injects this 4D signal into the action expert while preserving the pretrained vision-language backbone through gradient isolation. The track decoder is discarded during inference, leaving the base policy interface unchanged. Extensive experiments on LIBERO, LIBERO-Plus, RoboTwin2.0 and real-world manipulation tasks demonstrate that ELAN4D consistently improves over strong VLA baselines, achieving the best overall performance and substantial gains under out-of-distribution perturbations, including camera, background, and layout shifts. These results highlight the effectiveness of embodiment-centric 4D supervision for building more robust and generalizable manipulation policies.

2605.30468 2026-06-01 cs.RO 版本更新

Learning-Based Navigation for Indoor Mobile Robots

基于学习的室内移动机器人导航

Tri-Tin Nguyen, Tien-Dat Nguyen, Gia-Uy Le, Vinh Nguyen, Vinh-Hao Nguyen

发表机构 * Faculty of Electrical Electronic Engineering, Ho Chi Minh City University of Technology, VNU-HCM Ho Chi Minh City, Vietnam

AI总结 提出一种结合监督学习全局规划器与基于学习的DWA局部规划器的导航框架,通过行为克隆和PPO优化实现安全避障导航。

详情
AI中文摘要

本文提出了一种基于学习的室内移动机器人导航框架。该方法将基于代价感知A*专家轨迹训练的监督神经全局规划器与提出的基于学习的DWA局部规划器相结合,后者被表述为动态窗口法(DWA)动作格上的离散候选选择。对于局部规划,策略首先通过行为克隆进行训练,然后在可行性感知掩码下通过近端策略优化(PPO)进行精炼。该框架在模拟和真实室内环境中进行了实现和评估。实验结果表明,所提方法能够在存在障碍物的情况下生成可行的全局路径和可靠的局部运动指令,以实现安全的目标导向导航。这些结果证明了将基于学习的全局规划与强化学习精炼的局部控制相结合用于室内移动机器人导航的有效性。源代码将在 https://ntdathp.github.io/rl_robot_web/ 发布。

英文摘要

This paper presents a learning-based navigation framework for indoor mobile robots. The proposed method combines a supervised neural global planner, trained from cost-aware A* expert trajectories, with the proposed Learning-Based DWA local planner, which is formulated as discrete candidate selection over the Dynamic Window Approach (DWA) action lattice. For local planning, the policy is first trained by behavior cloning and then refined by Proximal Policy Optimization (PPO) under feasibility-aware masking. The framework is implemented and evaluated in both simulated and real-world indoor environments. Experimental results show that the proposed method generates feasible global routes and reliable local motion commands for safe goal-directed navigation in the presence of obstacles. These results demonstrate the effectiveness of integrating learning-based global planning with reinforcement-learning-refined local control for indoor mobile robot navigation. The source code will be released at https://ntdathp.github.io/rl_robot_web/.

2605.30383 2026-06-01 cs.RO cs.AI 版本更新

Structured interactions improve distributed coordination beyond model scaling in a real-world multi-robot system

结构化交互在真实世界多机器人系统中超越模型规模提升分布式协调能力

Junping Wang, Zhizhong Zhang, Yongqiang Tang, Geng Zheng, Jiaming Zhang, Shiji Song, Yanmei Li, Yushan Ma

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,自动化研究所,中国科学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) School of Computer Science and Technology, East China Normal University(华东师范大学计算机科学与技术学院) Department of Automation, Tsinghua University(清华大学自动化系) Liupanshan Laboratory, Ningxia University(宁夏大学鲁班实验室)

AI总结 通过真实多机器人实验,发现模块化层次化交互拓扑相比增加模型规模能更显著提升协调性能。

详情
AI中文摘要

提升单个机器人能力是常见但昂贵的做法。本文研究真实多机器人协调中的系统级设计问题:在硬件预算匹配的情况下,重构机器人间的通信是否比增加机载模型规模带来更大收益?使用10个物理机器人执行代表性的运输与建图任务(每种条件5次运行,共60次运行),我们发现从全连接切换到模块化层次化交互可将归一化性能提升47分(0-100分),而将神经网络隐藏层大小加倍最多提升9分。嵌套混合效应模型比较显示,拓扑对模型拟合的改善远大于规模。该模式在独立的SMAC复制实验中得到确认;异构基准重分析提供次要支持性一致性检查而非主要证据。在仿真校准的外推中观察到超过1024个隐藏单元的性能饱和,但未直接在硬件上验证。这些结果表明,在测试系统和任务设置中,交互结构可发挥主导作用,但更广泛的定量泛化仍有待建立。

英文摘要

Scaling individual robot capabilities is common but costly. Here we investigate a system-level design question in real-world multi-robot coordination: given matched hardware budgets, does restructuring communication among robots yield larger gains than increasing onboard model size? Using a representative transport-and-mapping task with 10 physical robots (5 runs per condition, 60 runs total), we find that switching from fully connected to modular hierarchical interactions improves normalised performance by 47 points (0--100), whereas doubling neural network hidden size yields at most 9 points. Nested mixed-effects model comparisons show a substantially larger improvement in model fit for topology than for scale. The pattern is confirmed in independent SMAC replications; heterogeneous benchmark reanalyses provide secondary supporting consistency checks rather than primary evidence. Performance saturation beyond 1024 hidden units is observed in simulation-calibrated extrapolation, not directly on hardware. These results indicate that interaction structure can play a dominant role within the tested system and task setting, while broader quantitative generalisation remains to be established.

2605.30368 2026-06-01 cs.NE cs.AI cs.RO q-bio.NC 版本更新

Reinterpreting Safety Thresholds as Neuron Spiking Thresholds

将安全阈值重新解释为神经元放电阈值

Enrico Del Re, Mohamed Sabry, Cristina Olaverri-Monreal

发表机构 * Johannes Kepler University Linz(约翰·凯撒大学林茨) Department Intelligent Transport Systems(智能交通运输系统部门)

AI总结 提出将替代安全措施(SSM)的固定阈值重新解释为泄漏积分点火(LIF)神经元的放电阈值,构建脉冲神经网络(SNN)学习人类刹车起始点,实现客观SSM与主观安全感知的融合。

Comments 6 pages

详情
AI中文摘要

替代安全措施(SSM)在自动驾驶领域的交通风险评估中被广泛使用。然而,大多数基于SSM的评估采用固定阈值,无法捕捉人类对持续临界状态的响应或对短暂高风险峰值的反应。本文提出了一种受生物学启发的SSM阈值重新解释,将其建模为泄漏积分点火(LIF)神经元的放电阈值,并将多个SSM输入组合成脉冲神经网络(SNN)。该SNN经过训练,使其发放的脉冲与人类刹车起始点对齐。训练数据是在使用3D-CoAutoSim平台(基于CARLA/Unreal和六自由度运动平台)的受控跟车实验中记录的,实验中生成了诱导的关键事件。结果表明,学习到的脉冲活动在定性上与跨场景的刹车行为一致,并捕捉了仅靠阈值交叉无法一致解释的反应。跨参与者的分析进一步表明,学习到的输入阈值保持相对一致,而学习到的衰减因子编码了SSM的不同时间敏感性。本研究的发现表明,脉冲动力学可能作为一种机制,促进客观SSM与主观人类安全感知的融合。

英文摘要

Surrogate Safety Measures (SSMs) are extensively utilised in the evaluation of traffic risk in automated driving contexts. However, the majority of SSM-based evaluations employ fixed thresholds that fail to capture the human response to sustained borderline conditions or the reaction to brief, high-risk peaks. The present work proposes a biologically inspired reinterpretation of SSM thresholds. This is modelled as spiking thresholds of leaky integrate-and-fire (LIF) neurons, with multiple SSM inputs combined into a spiking neural network (SNN). The SNN is trained to emit spikes that are aligned with human braking onsets. The training data was recorded in a controlled car-following experiment using the 3D-CoAutoSim platform with CARLA/Unreal and a 6-DOF motion platform, where induced critical events were generated. The results demonstrate that the learned spiking activity qualitatively aligns with braking behaviour across scenarios and captures reactions that are not consistently explained by threshold crossings alone. Analysis across participants further indicates that learned input thresholds remain relatively consistent, while learned decay factors encode different temporal sensitivities for the SSMs. The findings of this study indicate that spiking dynamics may serve as a mechanism to facilitate the convergence of objective SSMs with subjective human safety perception.

2605.28442 2026-06-01 cs.RO cs.CV 版本更新

Self-Supervised Online Robot-Agnostic Traversability Estimation for Open-World Environments

面向开放世界的自监督在线机器人无关可通行性估计

Julia Hindel, Simon Bultmann, Houman Masnavi, Daniele Cattaneo, Abhinav Valada

发表机构 * Department of Computer Science, University of Freiburg(弗赖堡大学计算机科学系)

AI总结 提出COTRATE框架,通过自监督在线学习从多模态未标记机器人经验中估计可通行性,采用机器人无关的地形评估模块和多样性感知特征选择策略,实现跨平台知识迁移并降低遗忘。

Comments 14 pages, 16 Figures

详情
AI中文摘要

自监督在线可通行性估计使机器人能够从未标记的开放世界经验中持续学习,并调整其导航行为以实现安全高效的轨迹。现有方法要么依赖手工设计的本体感受可通行性分数,限制了机器人无关性,要么对先验数据进行聚类,阻碍了在线学习。此外,许多持续学习方法会带来大量的内存和计算成本,阻碍了机载部署。我们提出了COTRATE,一个用于从多模态、未标记的机器人经验中持续估计可通行性的在线学习框架。我们的方法首先使用一个基于学习的机器人无关在线地形评估模块,该模块处理本体感受和惯性信号,推断出鲁棒的可通行性分数。然后,这些分数通过一种新颖的对齐损失来监督视觉可通行性网络,该损失将视觉嵌入与在线地形评估相关联。为了在持续学习过程中以最小开销减轻遗忘,我们提出了一种多样性感知的特征选择策略,该策略使用紧凑的回放记忆来保持性能。我们进一步表明,学习到的可通行性表示支持具有不同运动学特性的不同机器人平台之间的知识迁移。我们在一个包含约50,000张图像的数据集上评估了COTRATE,该数据集由两个机器人平台在11种户外地形上收集,并在三个代表性户外环境中的导航任务上进行了基准测试。我们将数据集、代码和训练模型公开。

英文摘要

Self-supervised online traversability estimation enables robots to continuously learn from unlabeled open-world experiences and adapt their navigation behavior toward safe and efficient trajectories. Existing approaches either rely on handcrafted proprioceptive traversability scores, limiting robot-agnosticism, or cluster prior data, preventing online learning. Moreover, many continual learning methods incur substantial memory and computational costs, hindering onboard deployment. We introduce COTRATE, an online learning framework for continuous traversability estimation from multimodal, unlabeled robot experience. Our method first infers robust traversability scores using a robot-agnostic, learning-based online terrain assessment module operating on proprioceptiveand inertial signals. These scores then supervise a visual traversability network through a novel alignment loss that associates visual embeddings with online terrain assessments. To mitigate forgetting during continual learning with minimal overhead, we propose a diversity-aware feature selection strategythat preserves performance using a compact replay memory. We further show that the learned traversability representation supports knowledge transfer across different robot platforms with different locomotion kinematics. We evaluate COTRATE on a dataset of $\approx$ 50,000 images collected with two robotic platforms across 11 outdoor terrains, and benchmark it on navigation tasks in three representative outdoor environments. We make the dataset, code, and trained models publicly available.

2605.27114 2026-06-01 cs.RO 版本更新

VR-DAgger: Immersive VR for Dexterous Data Collection and Uncertainty-Guided On-Policy Correction

VR-DAgger: 用于灵巧数据收集和不确定性引导的在线策略校正的沉浸式VR

René Zurbrügg, Tifanny Portela, Arjun Bhardwaj, Aravind Elanjimattathil Vijayan, Maximum Wilder-Smith, Marco Hutter

发表机构 * Robotics Systems Lab(机器人系统实验室) ETH Zürich(苏黎世联邦理工学院) ETH AI Center(ETH人工智能中心) ETH Augmented Reality Research Lab(ETH增强现实研究实验室) ETH Mobility Initiative(ETH移动性倡议) ANYbotics AG(ANYbotics公司) Swiss Federal Railways(瑞士联邦铁路)

AI总结 提出VR-DAgger框架,通过VR应用进行灵巧遥操作和数据收集,利用MC Dropout不确定性评分选择关键失败片段进行在线校正,在灵巧操作任务上相比行为克隆提升高达23个百分点,并减少约40%的样本收集时间。

详情
AI中文摘要

从示范中学习对于机器人操作是有效的,但收集足够的任务特定数据仍然是一个主要瓶颈。在分布偏移下,小误差会累积,性能下降,专家时间往往花费在冗余、低价值的修正上,而不是少数关键失败案例。我们提出了VR-DAgger,一个以沉浸式VR应用为中心的人机协作框架,用于灵巧遥操作、示范收集和选择性策略校正。VR客户端提供直观的手部控制和同步场景可视化,而后台工作站运行仿真和学习,实现无需操作员持续监督的自主部署。我们使用蒙特卡洛(MC)Dropout在Isaac Lab部署扩散策略时对不确定性进行评分,并选择信息量大的失败片段进行校正。这些片段在VR中作为剪辑重放,操作员选择性地标记和校正策略的行为,将监督集中在不确定性最高的地方,无需全程监控或单独的中断分类器。我们在三个灵巧操作任务(平底锅抓取放置、抽屉打开、阀门旋转)上使用10自由度XHand在标准和具有挑战性的初始配置下进行评估。主动标记在所有任务上持续优于行为克隆,提升高达23个百分点。与无指导的人机协作检查相比,VR-DAgger通过将审查集中在选定的片段而非完整部署上,将每个样本的收集时间减少了约40%。

英文摘要

Learning from demonstrations is effective for robotic manipulation, but collecting sufficient task-specific data remains a major bottleneck. Under distribution shift, small errors compound, performance degrades, and expert time is often spent on redundant, low-value corrections instead of the few critical failure cases. We present VR-DAgger, a human-in-the-loop framework centered on an immersive VR application for dexterous teleoperation, demonstration collection, and selective policy correction. The VR client provides intuitive hand control with synchronized scene visualization, while a backend workstation runs simulation and learning, enabling autonomous rollouts without continuous operator oversight. We use Monte Carlo (MC) dropout to score uncertainty during Isaac Lab rollouts of a diffusion policy and select informative failure segments for correction. These segments are replayed in VR as clips, where the operator selectively labels and corrects the policy's behavior, concentrating supervision where uncertainty is highest without full-rollout monitoring or a separate intervention classifier. We evaluate on three dexterous manipulation tasks (Pan pick-and-place, Drawer opening, Valve turning) with a 10-DoF XHand under standard and challenging initial configurations. Active labeling consistently improves over behavioral cloning across all tasks, with gains of up to 23 percentage points. Compared to unguided human-in-the-loop inspection, VR-DAgger reduces per-sample collection time by approximately 40% by focusing review on selected segments rather than full rollouts.

2605.26430 2026-06-01 cs.RO 版本更新

Multi-Robot Box Transport over Different Surfaces with Decentralized Role-based Proportional Control

多机器人在不同表面上的基于去中心化角色比例控制的箱子运输

Aditya Bhatt, Himavarshini Yarragangu, Urvish Shah, Venkata Sai Yaswanth Mohan Thota, Souma Chowdhury

发表机构 * Mechanical & Aerospace Eng., University at Buffalo, Buffalo, NY(机械与航空航天工程系,布法罗大学,布法罗,纽约)

AI总结 提出一种异步去中心化任务与运动规划方法R2P2,通过角色分配和比例控制实现多机器人在不同倾斜和摩擦表面上的协作箱子运输,在仿真和物理实验中验证了其泛化性和成功率优于标准虚拟领导者-跟随者方法。

Comments Accepted for presentation at the 2026 ASME IDETC-CIE

详情
AI中文摘要

通过推动实现多机器人协作运输物体在建筑、仓库环境以及灾后 debris 清理等许多应用中具有广泛前景。然而,在不同倾斜和摩擦特性的表面上实现协作运输带来了独特的挑战。为应对这些挑战,本文提出了一种异步去中心化任务与运动规划方法,用于在平坦、上坡和下坡地形上运输不同质量的矩形箱子。这种去中心化方法减轻了通信、同步和共识需求,并缓解了单点故障问题。我们的方法称为R2P2(基于规则和比例控制原语的角色分配),根据对所需操作模式(箱子旋转 vs 平移)的认知规则为机器人分配角色(例如,推、支撑和阻止);随后根据角色执行基于规则的控制或机器人速度的比例控制。每个机器人在执行角色和控制时假设能观察到自身和箱子的位置与朝向。R2P2在使用NVIDIA IsaacSim构建的模拟器中通过六机器人团队进行了评估——展示了在不同表面摩擦/倾斜和箱子质量场景下的泛化能力,并且与标准虚拟领导者-跟随者方法相比具有更高的成功率。R2P2还通过物理实验成功验证,在四台负责移动1.2 kg箱子的turtlebots上执行。

英文摘要

Collaborative transport of objects via pushing by multiple robots has many applications, ranging from construction and warehouse environments to post disaster debris clean-up. Achieving collaborative transport over surfaces with different inclination and friction properties however poses unique challenges. To address these challenges, this paper presents an asynchronous decentralized task and motion planning approach for transporting rectangular boxes of varying mass over flat, uphill and downhill terrain. Such a decentralized approach alleviates communication, synchronization and consensus needs and mitigates single point of failure issues. Our approach, called R2P2 or Roles with Rules and Proportional-control Primitive, assigns roles (e.g., push, support and prevent) to robots based on rules cognizant of the mode of manipulation needed (box rotation vs translation); this is followed by either rule-based control or proportional control of robot velocity based on the roles. Each robot is assumed to observe the location and heading of self and the box in executing the role and controls. R2P2 is evaluated with a six-robot team deployed in a simulator built using NVIDIA IsaacSim -- demonstrating generalizability across different surface friction/inclination and box mass scenarios, and better success rate compared to a standard virtual-leader-follower method. R2P2 is also successfully validated with a physical experiment, where it is executed onboard four turtlebots tasked with moving a 1.2 kg box.

2605.26304 2026-06-01 cs.RO 版本更新

Collaborative Navigation and Exploration with $β$-Sparse Gaussian Processes

基于$β$-稀疏高斯过程的协作导航与探索

Evangelos Psomiadis, Dipankar Maity, Panagiotis Tsiotras

发表机构 * D. Guggenheim School of Aerospace Engineering, Georgia Tech, Atlanta, GA, USA(佐治亚理工学院D.Guggenheim航空航天工程学院) Department of Electrical and Computer Engineering, UNC Charlotte, Charlotte, NC, USA(北卡罗来纳大学夏洛特分校电气与计算机工程系)

AI总结 针对异构机器人在未知环境中的协作导航问题,提出一种利用$β$-稀疏高斯过程进行带宽受限下地图点选择和导航动作联合优化的框架,显著降低路径代价和传输信息量。

Comments 16 pages, 6 figures

详情
AI中文摘要

异构机器人在未知环境中的协作导航由于传感、通信和计算限制而面临重大挑战。在这项工作中,一个领航机器人向目标导航,同时一个移动传感器机器人(例如无人机)通过传输其局部观测地图的信息来辅助,但受带宽限制。我们提出一个框架,使传感器能够在线联合选择其传输的地图点和导航动作,同时预测环境的未探索区域。为此,我们提出了$β$-稀疏高斯过程,一种鲁棒的变分稀疏高斯过程模型,用于在基数约束下进行任务感知的诱导点选择。此外,我们开发了一种平衡任务相关性与探索的动作选择策略。在火星和地球地图上的仿真表明,与无通信相比,该框架可将路径代价降低18%,与原始数据传输基线相比,传输信息量减少76%。

英文摘要

Collaborative navigation of heterogeneous robots in unknown environments poses significant challenges due to sensing, communication, and computational limitations. In this work, a lead robot navigates toward a target while a mobile sensor robot (e.g., a drone) assists by transmitting information about its locally observed map under bandwidth constraints. We propose a framework that enables the sensor to jointly select its transmitted map points and navigation actions online, while also predicting unexplored regions of the environment. To this end, we present $β$-Sparse Gaussian Processes, a robust variational sparse Gaussian Process model for task-aware inducing point selection under cardinality constraints. Furthermore, we develop an action-selection strategy that balances task relevance with exploration. Simulations on Mars and Earth maps show that the framework can reduce path cost by 18% relative to no communication and decrease transmitted information by 76% compared to raw-data transmission baselines.

2605.29879 2026-06-01 cs.CV cs.RO 版本更新

DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding

DGSG-Mind:用于长期场景理解与定位的动态3D高斯场景图

Luzhou Ge, Xiangyu Zhu, Jinyan Liu, Xuesong Li

发表机构 * School of Computer Science, Beijing Institute of Technology, China(北京理工大学计算机科学学院)

AI总结 提出DGSG-Mind,一种混合实例感知的3D高斯动态场景图系统,通过概率体素网格与显式3D高斯结合实现鲁棒的跨模态实例融合和增量语义映射,并构建层次化场景图与3D高斯思维进行多模态推理,在零样本3D视觉定位、开放词汇语义分割和场景重建中取得领先性能。

Comments 9 pages, 6 figures

详情
AI中文摘要

将开放词汇语义信息集成到动态3D场景表示中对于长期具身场景理解至关重要。然而,现有方法常因跨视角线索不完整而导致脆弱的实例关联,同时处理对象级拓扑变化的能力有限,限制了长期机器人任务执行。此外,当前的3D场景理解方法要么依赖简单的特征匹配而缺乏显式空间推理,要么假设离线真实3D几何。为应对这些挑战,我们提出DGSG-Mind,一种混合实例感知的3D高斯动态场景图系统,配备具身推理智能体。我们的系统将概率体素网格与显式3D高斯耦合,实现鲁棒的跨模态实例融合和增量语义映射。它通过基于高斯的视觉重定位和由几何-语义一致性引导的局部掩码细化来处理动态变化。基于实例高斯图,DGSG-Mind进一步构建层次化场景图,并开发3D高斯思维,集成结构关系、空间-语义信息和视觉标注的RoI高斯渲染以进行多模态推理。大量实验表明,DGSG-Mind在基于自重建地图的方法中实现了最佳的零样本3D视觉定位性能,同时在3D开放词汇语义分割和场景重建中也表现出强劲性能。我们进一步将DGSG-Mind部署到真实世界机器人上,展示其目标导向推理和动态更新能力。DGSG-Mind的项目页面位于https://icr-lab.github.io/DGSG-Mind。

英文摘要

Integrating open-vocabulary semantic information into dynamic 3D scene representations is essential for long-term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross-view cues, while their limited ability to handle object-level topological changes restricts long-term robotic task execution. Moreover, current 3D scene understanding methods either rely on simple feature matching without explicit spatial reasoning or assume offline ground-truth 3D geometry. To address these challenges, we present DGSG-Mind, a hybrid instance-aware 3D Gaussian dynamic scene graph system with an embodied reasoning agent. Our system couples a probabilistic voxel grid with explicit 3D Gaussians to enable robust cross-modal instance fusion and incremental semantic mapping. It handles dynamic changes through Gaussian-based visual relocalization and localized masked refinement guided by geometric-semantic consistency. Built on the instance Gaussian map, DGSG-Mind further constructs a hierarchical scene graph and develops the 3D Gaussian Mind, which integrates structural relations, spatial-semantic information, and visually annotated RoI Gaussian renderings for multimodal reasoning. Extensive experiments show that DGSG-Mind achieves the best zero-shot 3DVG performance among methods operating on self-reconstructed maps, while also delivering strong performance in 3D open-vocabulary semantic segmentation and scene reconstruction. We further deploy DGSG-Mind on real-world robots to demonstrate its target-oriented reasoning and dynamic update capabilities. The project page of DGSG-Mind is available at https://icr-lab.github.io/DGSG-Mind

2605.22639 2026-06-01 cs.RO 版本更新

Symmetries Here and There, Combined Everywhere: Cross-space Symmetry Compositions in Robotics

此处与彼处的对称性,无处不在的组合:机器人学中的跨空间对称性组合

Loizos Hadjiloizou, Rodrigo Pérez-Dattari, Noémie Jaquier

发表机构 * Department of Robotics, Perception and Learning, KTH Royal Institute of Technology(机器人、感知与学习系,皇家理工学院)

AI总结 提出跨空间对称性组合框架,通过前向运动学的微分几何结构实现配置空间与任务空间对称性的联合等变,并在双机械臂实验中验证了多对称性联合利用能提升泛化能力。

Comments 8 pages, 8 figures, 1 table

详情
AI中文摘要

机器人由于其机械结构和任务属性展现出丰富的对称性。尽管许多机器人问题同时表现出多种对称性,现有方法通常孤立地处理它们,未能利用其组合潜力。本文介绍了跨空间对称性组合,一个学习在配置空间和任务空间中对多种对称性联合等变的机器人策略的框架。利用前向运动学映射的微分几何结构,我们将对称性从配置空间下降到任务空间,并从任务空间提升到配置空间,使得它们能够在统一的表示空间内组合。我们在双机械臂的仿真和真实世界实验中验证了该框架,证明联合利用多种对称性能够改善泛化能力。

英文摘要

Robots exhibit a rich variety of symmetries arising from their mechanical structure and the properties of their tasks. Although many robotics problems exhibit several symmetries simultaneously, existing approaches typically treat them in isolation, failing to exploit their combined potential. This paper introduces cross-space symmetry compositions, a framework for learning robot policies that are jointly equivariant to multiple symmetries across configuration and task spaces. Leveraging the differential-geometric structure of the forward kinematics map, we both descend symmetries from configuration to task space and lift symmetries from task to configuration space, enabling their composition within a unified representation space. We validate our framework on simulated and real-world experiments on a dual-arm robot, demonstrating that jointly leveraging multiple symmetries yields improved generalization.

2605.21007 2026-06-01 cs.CV cs.RO 版本更新

LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation

LiteViLNet: 轻量级视觉-激光雷达融合网络用于高效道路分割

Daojie Peng, Bingtao Wang, Fulong Ma, Liang Zhang, Jun Ma

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Shandong University(山东大学)

AI总结 提出轻量级多模态网络LiteViLNet,通过双流编码器、深度可分离卷积和多尺度特征融合模块,在KITTI数据集上以14.04M参数达到96.36% MaxF,实现精度与效率的平衡。

详情
AI中文摘要

道路分割是自动驾驶和智能机器人系统的基本感知任务,需要高精度和实时推理,特别是在资源受限的边缘设备上部署时。现有的多模态道路分割方法通常依赖重型基于Transformer的编码器以达到最先进的性能,但其巨大的计算成本阻碍了在嵌入式平台上的实时部署。为解决这一困境,我们提出了LiteViLNet,一种轻量级多模态网络,融合RGB纹理信息和LiDAR几何信息用于高效道路分割。具体来说,我们设计了双流轻量级编码器和深度可分离卷积,以最小的参数从两种模态中提取层次特征。我们进一步提出了多尺度特征融合模块(MSFM)以促进不同层次的跨模态交互,以及一个大核桥模块以线性复杂度捕获长距离依赖。在KITTI道路数据集和实际应用上的大量实验表明,LiteViLNet在准确性和效率之间取得了有希望的平衡。值得注意的是,仅用14.04M参数,我们的模型达到了96.36%的MaxF分数,在所有基于CNN的方法中排名最佳,并与更大的基于Transformer的模型相当,在RTX 4060 Ti上模型推理速度为163.79 FPS(在Jetson Orin NX上为22.18 FPS)。它在推理速度上优于许多重型方法,同时保持高度竞争的准确性,充分验证了LiteViLNet在自动驾驶和智能机器人中实时嵌入式部署的潜力。

英文摘要

Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose LiteViLNet, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.

2604.15215 2026-06-01 cs.RO 版本更新

A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

用于机器人上下文模仿学习的层次化时空动作分词器

Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, Usman Nizamani, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran

发表机构 * Retrocausal, Inc.(Retrocausal公司)

AI总结 提出一种层次化时空动作分词器HiST-AT,通过两级向量量化实现动作的层次化聚类,并同时利用空间和时间信息进行重建,在多个模拟和真实机器人操作基准上达到最先进性能。

详情
AI中文摘要

我们提出了一种新颖的层次化时空动作分词器,用于上下文模仿学习。我们首先提出一种层次化方法,包括两个连续级别的向量量化。具体来说,低级别将输入动作分配到细粒度子簇,而高级别进一步将细粒度子簇映射到簇。我们的层次化方法优于非层次化方法,同时主要通过重建输入动作来利用空间信息。此外,我们通过利用空间和时间线索扩展了我们的方法,形成了层次化时空动作分词器,即HiST-AT。具体来说,我们的层次化时空方法进行多级聚类,同时重建输入动作及其相关时间戳。最后,在多个模拟和真实机器人操作基准上的广泛评估表明,我们的方法在上下文模仿学习中建立了新的最先进性能。

英文摘要

We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.

2601.15197 2026-06-01 cs.AI cs.CL cs.CV cs.RO 版本更新

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

LangForce: 通过潜在动作查询对视觉语言动作模型进行贝叶斯分解

Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen

发表机构 * Huazhong University of Science and Technology(华中科技大学) Beijing Zhongguancun Academy(北京中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Harbin Institute of Technology(哈尔滨工业大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Zhengzhou University(郑州大学) Beihang University(北航) East China Normal University(东华大学) DeepCybot Co., Ltd.(DeepCybot有限公司)

AI总结 针对VLA模型在训练中因数据偏差导致语言信息被忽略的问题,提出LangForce框架,通过贝叶斯分解和潜在动作查询构建双分支架构,最大化动作与指令的点互信息,无需新数据即可显著提升泛化能力。

Comments ICML 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中显示出潜力,但往往难以泛化到新指令或复杂的多任务场景。我们识别出当前训练范式中的一个关键病理:目标驱动的数据收集造成了数据集偏差。在此类数据集中,仅凭视觉观察就能高度预测语言指令,导致指令与动作之间的条件互信息消失,我们将此现象称为信息崩溃。因此,模型退化为忽略语言约束的纯视觉策略,并在分布外(OOD)设置中失败。为解决此问题,我们提出LangForce,一种通过贝叶斯分解强制执行指令跟随的新框架。通过引入可学习的潜在动作查询,我们构建了一个双分支架构,用于估计纯视觉先验 $p(a \mid v)$ 和语言条件后验 $π(a \mid v, \ell)$。然后我们优化策略以最大化动作与指令之间的条件点互信息(PMI)。该目标有效惩罚了视觉捷径,并奖励明确解释语言命令的动作。无需新数据,LangForce显著提升了泛化能力。在SimplerEnv和RoboCasa上的大量实验证明了显著改进,包括在具有挑战性的OOD SimplerEnv基准上提升11.3%,验证了我们的方法在动作中稳健地锚定语言的能力。

英文摘要

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

2605.01581 2026-06-01 cs.RO 版本更新

Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

Hyper-DP3: 面向视觉运动控制的3D扩散策略的频率感知规模调整

Jinhao Zhang, Zhexuan Zhou, Huizhe Li, Yichen Lai, Wenlong Xia, Haoming Song, Youmin Gong, Jie Mei

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shanghai Jiao Tong University(上海交通大学)

AI总结 针对机器人操作中扩散策略的高计算成本问题,从频域角度分析动作轨迹的平滑性,提出轻量级3D扩散策略Hyper-DP3,使用扩散混合器解码器和两步DDIM推理,以极低参数和延迟实现最先进性能。

详情
AI中文摘要

基于扩散的视觉运动策略在机器人操作中表现良好,但当前方法仍继承了图像生成风格的解码器和多步采样。我们从频域角度重新审视这一设计。机器人动作轨迹高度平滑,大部分能量集中在少数低频离散余弦变换模式上。在此结构下,我们证明最优去噪器的误差受低频子空间维度和残余高频能量限制,意味着去噪误差在很少的反向步骤后即饱和。这也表明动作去噪需要比图像生成简单得多的去噪模型。受此启发,我们提出Hyper-DP3(HDP3),一种口袋大小的3D扩散策略,具有轻量级扩散混合器解码器,支持两步DDIM推理。我们的合成实验验证了理论,并支持两步去噪的充分性。此外,在RoboTwin2.0、Adroit、MetaWorld和真实世界任务中,HDP3以不到先前基于3D扩散策略1%的参数和显著更低的推理延迟实现了最先进的性能。

英文摘要

Diffusion-based visuomotor policies perform well in robotic manipulation, yet current methods still inherit image-generation-style decoders and multi-step sampling. We revisit this design from a frequency-domain perspective. Robot action trajectories are highly smooth, with most energy concentrated in a few low-frequency discrete cosine transform modes. Under this structure, we show that the error of the optimal denoiser is bounded by the low-frequency subspace dimension and residual high-frequency energy, implying that denoising error saturates after very few reverse steps. This also suggests that action denoising requires a much simpler denoising model than image generation. Motivated by this insight, we propose Hyper-DP3 (HDP3), a pocket-scale 3D diffusion policy with a lightweight Diffusion Mixer decoder that supports two-step DDIM inference. Our synthetic experiments validate the theory and support the sufficiency of two-step denoising. Futhermore, across RoboTwin2.0, Adroit, MetaWorld, and real-world tasks, HDP3 achieves state-of-the-art performance with fewer than 1% of the parameters of prior 3D diffusion-based policies and substantially lower inference latency.

2604.27994 2026-06-01 cs.RO 版本更新

Dreaming Across Towns: Semantic Rollout and Town-Adversarial Regularization for Zero-Shot Held-Out-Town Fixed-Route Driving in CARLA

跨城镇驾驶:面向CARLA零样本未见城镇固定路线驾驶的语义展开与城镇对抗正则化

Feeza Khan Khanzada, Jaerock Kwon

发表机构 * Department of Electrical and Computer Engineering, University of Michigan–Dearborn(密歇根大学迪尔伯恩分校电子与计算机工程系)

AI总结 提出一种结合未来语义预测与城镇对抗正则化的训练方法,在仅使用Town05和Town06训练的情况下,提升CARLA驾驶代理在未见城镇Town03和Town04上的零样本迁移性能。

详情
AI中文摘要

在一个模拟城镇中训练的驾驶代理往往在新城镇中表现不佳,因为道路形状、交叉口和车道布局可能不同。本文研究如何在CARLA驾驶模拟器中改进这种迁移,而不向代理提供来自测试城镇的任何训练数据。代理仅在Town05和Town06中训练,然后直接在Town03和Town04中评估。为了聚焦于道路布局差异,所有实验使用相同的天气和交通设置。我们提出一种训练方法,鼓励代理学习跨城镇有用的特征,而不是与单个训练城镇绑定的特征。在训练过程中,代理被要求预测未来相机视图的高层视觉含义,并且被阻止依赖那些揭示数据来自哪个源城镇的线索。这些额外的学习信号仅在训练期间使用;在测试时,驾驶策略使用与基线代理相同的观测和控制接口。在与匹配的DreamerV3风格世界模型驾驶代理的受控比较中,所提出的方法在未见城镇上取得了最高的平均成功率:在Town03上为36.6%,95%置信区间[30.5, 42.7];在Town04上为85.6%,95%置信区间[84.0, 87.2](基于五个训练种子计算)。针对最强基线的种子配对测试显示,在两个未见城镇上成功率差异均为正。额外实验表明,单独预测未来视觉含义或单独去除城镇特定线索不足以匹配组合方法。这些结果表明,将未来场景理解与减少对源城镇特定特征的依赖相结合,可以改善该CARLA设置下的跨城镇驾驶性能。

英文摘要

Driving agents trained in one simulated town often perform poorly in a new town because the road shapes, intersections, and lane layouts can be different. This paper studies how to improve this kind of transfer in the CARLA driving simulator without giving the agent any training data from the test towns. The agent is trained only in Town05 and Town06, then evaluated directly in Town03 and Town04. To focus on road-layout differences, all experiments use the same weather and traffic settings. We propose a training method that encourages the agent to learn features that are useful across towns rather than features tied to one training town. During training, the agent is asked to predict the high-level visual meaning of future camera views and is also discouraged from relying on cues that reveal which source town the data came from. These extra learning signals are used only during training; at test time, the driving policy uses the same observation and control interface as the baseline agent. In controlled comparisons with matched DreamerV3-style world-model driving agents, the proposed method achieves the highest mean held-out success: 36.6\% on Town03 with a 95\% confidence interval of [30.5, 42.7] and 85.6\% on Town04 with a 95\% confidence interval of [84.0, 87.2], computed across five training seeds. Seed-paired tests against the strongest primary baselines show positive success-rate differences in both held-out towns. Additional experiments show that predicting future visual meaning alone or removing town-specific cues alone is not enough to match the combined method. These results suggest that combining future-scene understanding with reduced reliance on source-town-specific features can improve cross-town driving performance in this CARLA setting.

2604.20395 2026-06-01 cs.CV cs.RO 版本更新

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

SpaCeFormer: 快速无提议开放词汇3D实例分割

Chris Choy, Junha Lee, Chunghyun Park, Minsu Cho, Jan Kautz

发表机构 * NVIDIA

AI总结 提出SpaCeFormer,一种基于空间曲线变换的无提议方法,在0.12-0.30秒内完成场景分割,比多阶段2D+3D流水线快2-3个数量级,并构建了最大开放词汇3D实例分割数据集SpaCeFormer-3M,在ScanNet200上零样本mAP达11.1,提升2.8倍。

Comments Project page: https://nvlabs.github.io/SpaCeFormer/

详情
AI中文摘要

开放词汇3D实例分割是机器人和AR/VR的核心能力,但先前方法存在瓶颈:多阶段2D+3D流水线聚合基础模型输出需数百秒每场景,而伪标签端到端方法依赖碎片化掩码和外部区域提议。我们提出SpaCeFormer,一种无提议的空间曲线变换器,在标准基准上每场景运行0.12-0.30秒,比多阶段2D+3D流水线快2-3个数量级。我们将其与SpaCeFormer-3M配对,这是最大的开放词汇3D实例分割数据集(通过多视图掩码聚类和多视图VLM标注构建,包含来自7.4K场景的604K实例的3.0M多视图一致描述);其掩码召回率比先前单视图流水线高21倍(IoU>0.5时54.3% vs 2.5%)。SpaCeFormer结合空间窗口注意力与Morton曲线序列化以获得空间连贯特征,并使用RoPE增强解码器直接从学习到的查询预测实例掩码,无需外部提议。在ScanNet200上,我们实现11.1零样本mAP,比先前最佳无提议方法提升2.8倍;在ScanNet++和Replica上,我们达到22.9和24.1 mAP,超越包括使用多视图2D输入在内的所有先前方法。

英文摘要

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs in 0.12--0.30 seconds per scene across standard benchmarks, 2--3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21$\times$ higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU$>$0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8$\times$ improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

2604.10432 2026-06-01 cs.RO 版本更新

AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

AnySlot: 用于零样本槽级放置的目标条件视觉-语言-动作策略

Zhaofeng Hu, Sifan Zhou, Qinbo Zhang, Rongtao Xu, Qi Su, Jorge Mendez-Mendz, Ci-Jyun Liang

发表机构 * Stony Brook University(石溪大学) Carnegie Mellon University(卡内基梅隆大学) MBZUAI Peking University(北京大学)

AI总结 提出AnySlot框架,通过将语言指令转化为空间视觉目标,解耦高层槽选择与低层执行,实现零样本槽级精确放置。

详情
AI中文摘要

视觉-语言-动作(VLA)策略已成为通用机器人操作的多功能范式。然而,在组合语言下的精确物体放置对端到端VLA策略仍然具有挑战性。槽级放置需要可靠的槽接地和厘米级几何精度。为此,我们提出AnySlot,一个通过引入语言接地与控制之间的显式空间视觉目标来降低组合复杂性的框架。AnySlot通过在目标槽处渲染空间标记将语言转化为视觉目标,然后使用目标条件VLA策略执行该目标。这种层次化设计将高层槽选择与低层执行解耦,提高了语义准确性和空间鲁棒性。此外,认识到此类精度要求高的任务缺乏基准,我们引入了SlotBench,一个包含九个任务类别的结构化模拟基准,用于评估槽级放置中的空间推理。大量实验表明,AnySlot在零样本槽级放置中显著优于平面VLA基线和模块化接地方法。

英文摘要

Vision-Language-Action (VLA) policies have emerged as a versatile paradigm for generalist robotic manipulation. However, precise object placement under compositional language remains challenging for end-to-end VLA policies. Slot-level placement requires reliable slot grounding and centimeter-level geometric precision. To this end, we propose AnySlot, a framework that reduces compositional complexity by introducing an explicit spatial visual goal between language grounding and control. AnySlot converts language into a visual goal by rendering a spatial marker at the intended slot, then executes this goal with a goal-conditioned VLA policy. This hierarchical design decouples high-level slot selection from low-level execution, improving semantic accuracy and spatial robustness. Furthermore, recognizing the lack of benchmarks for such precision-demanding tasks, we introduce SlotBench, a structured simulation benchmark with nine task categories for evaluating spatial reasoning in slot-level placement. Extensive experiments show that AnySlot significantly outperforms flat VLA baselines and modular grounding methods in zero-shot slot-level placement.

2604.01985 2026-06-01 cs.LG cs.AI cs.RO 版本更新

World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry

World Action Verifier: 通过前向-反向不对称性自我改进世界模型

Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, Yilun Du

发表机构 * Stanford University(斯坦福大学) UC San Diego(加州大学圣地亚哥分校) Carnegie Mellon University(卡内基梅隆大学) Google DeepMind(谷歌深Mind) Harvard University(哈佛大学)

AI总结 提出World Action Verifier (WAV)框架,利用状态合理性和动作可达性的独立验证以及前向-反向不对称性,通过视频语料库的多样子目标生成器和稀疏逆模型实现循环一致性,从而在欠探索区域自我改进世界模型,在多个任务中样本效率提升2倍且下游策略性能提升22%以上。

Comments Project Website: https://world-action-verifier.github.io

详情
AI中文摘要

通用世界模型有望实现可扩展的策略评估、优化和规划,但达到所需的鲁棒性仍然具有挑战性。与主要关注最优动作的策略学习不同,世界模型需要在大量次优动作的空间中保持可靠,而这些动作在带有动作标签的机器人交互中往往代表性不足。为了解决这一挑战,我们提出了World Action Verifier (WAV)框架,该框架使世界模型能够识别自身的预测错误并进行自我改进。关键思想是将动作条件的状态预测分解为两个独立可验证的因素:状态合理性和动作可达性。我们证明,由于两个潜在的不对称性——更广泛的无动作数据的可用性和动作相关特征的更低维度——验证这些因素比直接前向预测更容易处理。利用这些不对称性,我们通过(i)从视频语料库中获得的多样子目标生成器和(ii)从状态特征子集推断动作的稀疏逆模型来增强世界模型。通过强制提议的子目标、推断的动作和前向展开之间的循环一致性,WAV在现有方法常常失败的欠探索区域提供了一种有效的验证机制。在涵盖MiniGrid、RoboMimic和ManiSkill的九个任务中,我们的方法实现了2倍的样本效率提升,同时将下游策略性能提高了22%以上。

英文摘要

General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning which primarily focuses on optimal actions, a world model needs to be reliable over a vast space of suboptimal actions, which are often underrepresented in action-labeled robot interactions. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two independently verifiable factors: state plausibility and action reachability. We show that verifying these factors is significantly more tractable than direct forward prediction due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among proposed subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods often fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by over 22%.

2603.28579 2026-06-01 cs.RO 版本更新

EBuddy: a workflow orchestrator for industrial human-machine collaboration

EBuddy:面向工业人机协作的工作流编排器

Michele Banfi, Rocco Felici, Stefano Baraldo, Oliver Avram, Anna Valente

发表机构 * Laboratory of Automation, Robotics and Machines (ARM)(自动化、机器人与机器实验室)

AI总结 提出EBuddy,一种基于语音引导的工作流编排器,通过将专家实践形式化为有限状态机驱动的应用,实现工业环境中自然的人机协作,显著缩短端到端流程时间并保持可重复性。

详情
AI中文摘要

本文介绍了EBuddy,一种用于工业环境中自然人机协作的语音引导工作流编排器。EBuddy针对工具密集型工作流中一个反复出现的瓶颈:专家知识有效但难以规模化,当操作员和会话之间临时重建程序时,执行质量会下降。EBuddy将专家实践操作化为有限状态机(FSM)驱动的应用程序,在运行时提供可解释的决策框架(当前状态和允许的动作),使得口头请求在状态约束下被解释,同时系统执行并监控相应的工具交互。通过模块化工作流工件,EBuddy协调异构资源,包括GUI驱动的软件和协作机器人,利用自动语音识别和意图理解实现完全基于语音的交互。在定向能量沉积(DED)的叶轮叶片检查和修复准备中,通过人机协作实现的工业试点显示,在入职、3D扫描和处理以及修复程序生成过程中,端到端流程时间显著减少,同时保持了可重复性和低操作员负担。

英文摘要

This paper presents EBuddy, a voice-guided workflow orchestrator for natural human-machine collaboration in industrial environments. EBuddy targets a recurrent bottleneck in tool-intensive workflows: expert know-how is effective but difficult to scale, and execution quality degrades when procedures are reconstructed ad hoc across operators and sessions. EBuddy operationalizes expert practice as a finite state machine (FSM) driven application that provides an interpretable decision frame at runtime (current state and admissible actions), so that spoken requests are interpreted within state-grounded constraints, while the system executes and monitors the corresponding tool interactions. Through modular workflow artifacts, EBuddy coordinates heterogeneous resources, including GUI-driven software and a collaborative robot, leveraging fully voice-based interaction through automatic speech recognition and intent understanding. An industrial pilot on impeller blade inspection and repair preparation for directed energy deposition (DED), realized by human-robot collaboration, shows substantial reductions in end-to-end process duration across onboarding, 3D scanning and processing, and repair program generation, while preserving repeatability and low operator burden.

2603.26612 2026-06-01 cs.RO 版本更新

Meta-Adaptive Beam Search Planning for Transformer-Based Reinforcement Learning Control of UAVs with Overhead Manipulators under Flight Disturbances

基于Transformer强化学习的无人机搭载顶置机械臂在飞行扰动下的元自适应波束搜索规划

Hazim Alzorgan, Sayed Pedram Haeri Boroujeni, Abolfazl Razi

AI总结 针对无人机与顶置机械臂耦合导致的末端执行器跟踪误差问题,提出基于Transformer双深度Q网络(DDQN)的强化学习框架,通过自适应波束搜索规划器利用学习到的评论家进行前向估计,实现软件在环的短视域波束搜索,显著降低跟踪误差并提升奖励。

Comments The paper will be reworked significantly

详情
AI中文摘要

配备顶置机械臂的无人机为检查、维护和基于接触的交互提供了独特的能力。然而,无人机及其机械臂的运动紧密耦合,由风或控制不完善引起的微小姿态变化会使末端执行器偏离预定路径。这种耦合使得可靠跟踪变得困难,也限制了最初为固定基座机器人设计的学习型臂控制器的直接使用。在我们的测试中,每当无人机机体经历漂移或快速姿态修正时,这些效应都会一致出现。为了解决这一问题,我们开发了一个基于Transformer双深度Q网络(DDQN)的强化学习框架,其核心思想是使用自适应波束搜索规划器,该规划器利用学习到的评论家作为前向估计器,对候选控制序列进行短视域波束搜索。这使得控制器能够通过模拟推演来预测末端执行器的运动,而不是直接在实际模型上执行这些动作,实现了软件在环(SITL)方法。前瞻依赖于处理短状态序列的Transformer评论家提供的价值估计,而DDQN骨干网络则提供保持学习过程稳定所需的单步目标。在相同训练条件下对3自由度空中机械臂进行评估,所提出的元自适应规划器表现出最强的整体性能,奖励增加10.2%,平均跟踪误差大幅降低(从约6%降至3%),并且相对于DDQN基线,组合奖励-误差指标改善29.6%。当无人机基座因外部扰动出现漂移时,与固定波束和仅Transformer变体相比,我们的方法在跟踪目标尖端轨迹方面表现出更高的稳定性(保持5厘米跟踪误差)。

英文摘要

Drones equipped with overhead manipulators offer unique capabilities for inspection, maintenance, and contact-based interaction. However, the motion of the drone and its manipulator is tightly linked, and even small attitude changes caused by wind or control imperfections shift the end-effector away from its intended path. This coupling makes reliable tracking difficult and also limits the direct use of learning-based arm controllers that were originally designed for fixed-base robots. These effects appear consistently in our tests whenever the UAV body experiences drift or rapid attitude corrections. To address this behavior, we develop a reinforcement-learning (RL) framework with a transformer-based double deep Q learning (DDQN), with the core idea of using an adaptive beam-search planner that applies a short-horizon beam search over candidate control sequences using the learned critic as the forward estimator. This allows the controller to anticipate the end-effector's motion through simulated rollouts rather than executing those actions directly on the actual model, realizing a software-in-the-loop (SITL) approach. The lookahead relies on value estimates from a Transformer critic that processes short sequences of states, while a DDQN backbone provides the one-step targets needed to keep the learning process stable. Evaluated on a 3-DoF aerial manipulator under identical training conditions, the proposed meta-adaptive planner shows the strongest overall performance with a 10.2% reward increase, a substantial reduction in mean tracking error (from about 6% to 3%), and a 29.6% improvement in the combined reward-error metric relative to the DDQN baseline. Our method exhibits elevated stability in tracking target tip trajectory (by maintaining 5 cm tracking error) when the drone base exhibits drifts due to external disturbances, as opposed to the fixed-beam and Transformer-only variants.

2509.22550 2026-06-01 cs.RO 版本更新

An Intention-driven Lane Change Framework Considering Heterogeneous Dynamic Cooperation in Mixed-traffic Environment

考虑混合交通中异构动态协作的意图驱动换道框架

Xiaoyun Qiu, Haichao Liu, Yue Pan, Jun Ma, Xinhu Zheng

发表机构 * Intelligent Transportation Thrust, Systems Hub, The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)智能交通研究所,系统中心) Guangdong Provincial Key Lab of Integrated Communication, Sensing and Computation for Ubiquitous Internet of Things(广东省集成通信、感知与计算 ubiquitous internet of things 关键实验室) Robotics and Autonomous Systems Thrust, Systems Hub, The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)机器人与自主系统研究所)

AI总结 提出一种结合驾驶风格识别、协作感知决策与运动规划的意图驱动换道框架,通过深度学习和逆强化学习实现混合交通中安全高效的换道。

详情
Journal ref
IEEE Transactions on Intelligent Transportation Systems, May, 2026
AI中文摘要

在混合交通环境中,自动驾驶车辆(AV)必须与异构的人类驾驶车辆(HV)交互,这些车辆的意图和驾驶风格因个体和场景而异。这种变异性给换道交互带来了不确定性,其中安全性和效率关键取决于准确预测周围驾驶员的协作反应。现有方法通常通过假设统一或固定的行为模式来过度简化这些交互。为了解决这一限制,我们提出了一种意图驱动的换道框架,该框架将驾驶风格识别与协作感知决策和运动规划相结合。一个基于深度学习的分类器实时识别不同的人类驾驶风格。然后,我们引入了一个双视角协作分数,由内在的基于风格的倾向和交互动态组件组成,从而实现可解释和自适应的意图预测及定量推断。一个决策模块结合了行为克隆(BC)和逆强化学习(IRL)来确定换道的可行性。随后,建立了一个协调的运动规划架构,将基于IRL的意图推断与模型预测控制(MPC)相结合,以生成无碰撞且符合社会规范的轨迹。在NGSIM数据集上的实验表明,所提出的决策模型优于代表性的基于规则和基于学习的基线,在换道分类中达到了96.98%的准确率。运动规划评估进一步证明了在混合交通环境中机动成功率和执行稳定性的提高。这些结果验证了结构化协作建模对于意图驱动的自主换道的有效性。

英文摘要

In mixed-traffic environments, autonomous vehicles (AVs) must interact with heterogeneous human-driven vehicles (HVs) whose intentions and driving styles vary across individuals and scenarios. Such variability introduces uncertainty into lane change interactions, where safety and efficiency critically depend on accurately anticipating surrounding drivers' cooperative responses. Existing methods often oversimplify these interactions by assuming uniform or fixed behavioral patterns. To address this limitation, we propose an intention-driven lane change framework that integrates driving-style recognition with cooperation-aware decision-making and motion-planning. A deep learning-based classifier identifies distinct human driving styles in real time. We then introduce a dual-perspective cooperation score composed of intrinsic style-dependent tendencies and interactive dynamic components, enabling interpretable and adaptive intention prediction and quantitative inference. A decision-making module combines behavior cloning (BC) and inverse reinforcement learning (IRL) to determine lane change feasibility. Later, a coordinated motion-planning architecture integrating IRL-based intention inference with model predictive control (MPC) is established to generate collision-free and socially compliant trajectories. Experiments on the NGSIM dataset show that the proposed decision-making model outperforms representative rule-based and learning-based baselines, achieving 96.98% accuracy in lane change classification. Motion-planning evaluations further demonstrate improved maneuver success and execution stability in mixed-traffic environments. These results validate the effectiveness of structured cooperation modeling for intention-driven autonomous lane changes.

2603.11586 2026-06-01 cs.RO 版本更新

Unsupervised LiDAR-Based Multi-UAV Detection and Tracking Under Extreme Sparsity

基于激光雷达的极端稀疏条件下多无人机无监督检测与跟踪

Nivand Khosravi, Rodrigo Ventura, Meysam Basiri

发表机构 * Instituto Superior T\' e cnico University of Lisbon Lisbon, Portugal

AI总结 针对非重复固态激光雷达扫描导致的极端稀疏点云,提出无监督检测与跟踪流水线,通过自适应DBSCAN聚类和时序一致性检验实现高精度检测,并比较确定性分配与概率数据关联在跟踪中的性能。

Comments Presented at the International Conference on Mechatronics and Robotics Engineering (ICMRE2026). To appear in IEEE conference proceedings

详情
Journal ref
Proc. 2026 12th International Conference on Mechatronics and Robotics Engineering (ICMRE), Oldenburg, Germany, 2026
AI中文摘要

非重复固态激光雷达扫描导致对空中无人机检测的极端稀疏测量:一个10-25米的小型四旋翼通常每次扫描仅产生1-2个回波,远低于大多数现有检测方法假设的点密度,且不足以进行稳健的多目标数据关联。我们提出了一种无监督、仅依赖激光雷达的流水线,无需标注训练数据即可处理检测和跟踪。检测器将距离自适应DBSCAN聚类与三阶段时序一致性检验相结合,并在真实空对空飞行数据上以八种不同参数配置进行基准测试。最佳设置达到0.891精度、0.804召回率和0.63米均方根误差,系统性的minPts扫描验证了大多数扫描最多包含1-2个目标点,直接量化了稀疏程度。对于多目标跟踪,我们在四种具有递增模糊程度的模拟场景中,比较了确定性匈牙利分配与联合概率数据关联(JPDA),每种均与交互多模型滤波耦合。JPDA将身份切换减少了64%,而对MOTA影响可忽略,表明当无人机轨迹彼此接近时概率关联具有优势。结合真实世界检测与RTK-GPS真值以及基于模拟的跟踪与身份标注真值的双环境评估策略,克服了在无人机间距低于2米时仅依赖GNSS评估的局限性。

英文摘要

Non-repetitive solid-state LiDAR scanning leads to an extremely sparse measurement regime for detecting airborne UAVs: a small quadrotor at 10-25 m typically produces only 1-2 returns per scan, which is far below the point densities assumed by most existing detection approaches and inadequate for robust multi-target data association. We introduce an unsupervised, LiDAR-only pipeline that addresses both detection and tracking without the need for labeled training data. The detector integrates range-adaptive DBSCAN clustering with a three-stage temporal consistency check and is benchmarked on real-world air-to-air flight data under eight different parameter configurations. The best setup attains 0.891 precision, 0.804 recall, and 0.63 m RMSE, and a systematic minPts sweep verifies that most scans contain at most 1-2 target points, directly quantifying the sparsity regime. For multi-target tracking, we compare deterministic Hungarian assignment with joint probabilistic data association (JPDA), each coupled with Interacting Multiple Model filtering, in four simulated scenarios with increasing levels of ambiguity. JPDA cuts identity switches by 64% with negligible impact on MOTA, demonstrating that probabilistic association is advantageous when UAV trajectories approach one another closely. A two-environment evaluation strategy, combining real-world detection with RTK-GPS ground truth and simulation-based tracking with identity-annotated ground truth, overcomes the limitations of GNSS-only evaluation at inter-UAV distances below 2 m.

2602.23280 2026-06-01 cs.LG cs.RO 版本更新

Mollified Value Learning

Mollified Value Learning

Hrishikesh Viswanath, Juanwu Lu, S. Talha Bukhari, Mihir Chauhan, Damon Conover, Ziran Wang, Aniket Bera

发表机构 * Department of Computer Science, Purdue University, USA(普渡大学计算机科学系) College of Engineering, Purdue University, USA(普渡大学工程学院) DEVCOM Army Research Laboratory, USA(美国国防部 DEVCOM 军事研究实验室)

AI总结 针对离线目标条件强化学习中值函数估计困难的问题,提出一种通过空间测度聚合约束(而非逐点微分约束)来诱导距离类值几何的方法,称为Mollified Value Learning(MVL),在导航和高维机器人操作任务中提升了目标达成性能。

详情
AI中文摘要

离线目标条件强化学习(GCRL)从静态数据集中学习达到目标的行为,但在有限的状态-动作覆盖下,准确的值估计仍然具有挑战性。现有的物理信息方法通过施加由Hamilton-Jacobi-Bellman(HJB)最优性原理导出的逐点距离类几何约束(通常通过一阶偏微分方程如Eikonal方程)来解决这一问题。然而,通过显式微分结构强制局部一致性在复杂高维环境中可能变得不稳定。我们的关键洞察是,将距离类约束重新解释为局部空间测度上的期望。通过在该测度上聚合约束而非逐点评估,目标函数充当空间平滑器(mollifier),在无需昂贵微分算子的情况下诱导出距离类值几何。我们称之为Mollified Value Learning(MVL)。在导航和高维机器人操作任务上的实验表明,当与隐式值表示学习方法结合使用时,MVL学习到结构化的值表示,提高了目标达成性能。开源代码可在https://github.com/HrishikeshVish/MVL获取。

英文摘要

Offline goal-conditioned reinforcement learning (GCRL) learns goal-reaching behaviors from static datasets, but accurate value estimation remains challenging under limited state-action coverage. Existing physics-informed approaches address this by imposing pointwise distance-like geometric constraints derived from Hamilton--Jacobi--Bellman (HJB) optimality principles, often through first-order partial differential equations such as the Eikonal equation. However, enforcing local consistency through explicit differential structure can become unstable in complex, high-dimensional environments. Our key insight is to instead reinterpret distance-like constraints as an expectation over a local spatial measure. By aggregating constraints over this measure rather than evaluating them pointwise, the objective acts as a spatial mollifier, inducing distance-like value geometry without requiring expensive differential operators. We refer to this as Mollified Value Learning (MVL). Experiments across navigation and high-dimensional robotic manipulation tasks show that MVL learns structured, value representations, improving goal-reaching performance, when used with implicit value representation learning methods. Open-source codes are available at https://github.com/HrishikeshVish/MVL.

2602.21013 2026-06-01 cs.RO 版本更新

Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks

笔记到自我:带草稿本的增强型VLA用于依赖记忆的操作任务

Sanjay Haresh, Daniel Dijkman, Apratim Bhattacharyya, Roland Memisevic

发表机构 * Qualcomm AI Research(高通AI研究)

AI总结 本文通过在视觉-语言-动作模型中加入语言草稿本来赋予其空间和时间记忆,从而提升其在依赖记忆的长时域操作任务上的泛化能力。

Comments To appear at ICRA 2026

详情
AI中文摘要

许多灵巧操作任务本质上是非马尔可夫的,但在最近视觉-语言-动作(VLA)范式的热潮中,这一点很少受到关注。尽管VLA成功地将互联网规模的语义理解引入机器人领域,但现有的VLA主要是“无状态的”,并且在依赖记忆的长时域任务中表现不佳。在这项工作中,我们探索了一种通过引入语言草稿本来赋予VLA空间和时间记忆的方法。草稿本使得记忆任务特定信息(如物体位置)成为可能,并且允许模型跟踪计划以及在该计划中朝着子目标的进展。我们在ClevrSkills环境中的一组依赖记忆的任务、MemoryBench以及一个具有挑战性的真实世界拾取和放置任务上评估了这种方法。我们表明,对于非递归和递归模型,引入语言草稿本显著提高了这些任务的泛化能力。

英文摘要

Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.

2602.15018 2026-06-01 cs.RO cs.CV 版本更新

Neurosim: A Fast Simulator for Neuromorphic Robot Perception

Neurosim: 一种用于神经形态机器人感知的快速模拟器

Richeek Das, Pratik Chaudhari

发表机构 * GRASP Laboratory, University of Pennsylvania(宾夕法尼亚大学GRASP实验室)

AI总结 提出Neurosim和Cortex库,通过高速传感器模拟和低延迟通信,支持神经形态感知与控制算法的训练和闭环测试。

Comments 11 pages, 6 figures

详情
AI中文摘要

Neurosim是一个快速、实时、高性能的库,用于模拟动态视觉传感器、RGB相机、深度传感器和惯性传感器等传感器。它还可以模拟复杂动态环境中多旋翼飞行器的敏捷动力学。Neurosim在桌面GPU上可实现高达约2700 FPS的帧率。Neurosim与一个基于ZeroMQ的通信库Cortex集成,以促进与机器学习和机器人工作流的无缝集成。Cortex为Python和C++应用程序提供了一个高吞吐量、低延迟的消息传递系统,原生支持NumPy数组和PyTorch张量。本文讨论了Neurosim和Cortex的设计理念。它展示了如何利用它们来(i)训练神经形态感知和控制算法,例如,在时间同步的多模态数据上使用自监督学习,以及(ii)在闭环中测试这些算法的实时实现。Neurosim和Cortex可在https://github.com/grasp-lyrl/neurosim获取。

英文摘要

Neurosim is a fast, real-time, high-performance library for simulating sensors such as dynamic vision sensors, RGB cameras, depth sensors, and inertial sensors. It can also simulate agile dynamics of multi-rotor vehicles in complex and dynamic environments. Neurosim can achieve frame rates as high as ~2700 FPS on a desktop GPU. Neurosim integrates with a ZeroMQ-based communication library called Cortex to facilitate seamless integration with machine learning and robotics workflows. Cortex provides a high-throughput, low-latency message-passing system for Python and C++ applications, with native support for NumPy arrays and PyTorch tensors. This paper discusses the design philosophy behind Neurosim and Cortex. It demonstrates how they can be used to (i) train neuromorphic perception and control algorithms, e.g., using self-supervised learning on time-synchronized multi-modal data, and (ii) test real-time implementations of these algorithms in closed-loop. Neurosim and Cortex are available at https://github.com/grasp-lyrl/neurosim .

2602.12686 2026-06-01 cs.RO 版本更新

SignScene: Visual Sign Grounding for Mapless Navigation

SignScene: 用于无地图导航的视觉标志接地

Nicky Zimmerman, Joel Loo, Benjamin Koh, Zishuo Wang, David Hsu

发表机构 * Smart Systems Institute, National University of Singapore, 3 Research Link, 117602, Singapore.(新加坡国立大学智能系统研究所) School of Computing, National University of Singapore, 13 Computing Drive, 117417, Singapore.(新加坡国立大学计算机学院)

AI总结 提出SignScene,一种以标志为中心的空间语义表示方法,利用视觉语言模型将标志的语义指令与场景元素和导航动作对应,实现无地图导航,在114个查询中达到88%的接地准确率。

Comments Under review for a conference

详情
AI中文摘要

导航标志使人类能够在没有地图的情况下导航陌生环境。本文研究机器人如何类似地利用标志在开放世界中进行无地图导航。一个核心挑战在于解读标志:现实世界的标志多样且复杂,其抽象语义内容需要与局部3D场景对应。我们将此形式化为标志接地问题,即将标志上的语义指令映射到相应的场景元素和导航动作。最近的视觉语言模型(VLM)具备完成此任务所需的语义常识和推理能力,但对空间信息的表示方式敏感。我们提出SignScene,一种以标志为中心的空间语义表示,捕获与导航相关的场景元素和标志信息,并以有利于有效推理的形式呈现给VLM。我们在涵盖九种不同环境类型的114个查询数据集上评估了我们的接地方法,实现了88%的接地准确率,显著优于基线。最后,我们证明该方法使Spot机器人仅使用标志即可在现实世界中进行无地图导航。

英文摘要

Navigational signs enable humans to navigate unfamiliar environments without maps. This work studies how robots can similarly exploit signs for mapless navigation in the open world. A central challenge lies in interpreting signs: real-world signs are diverse and complex, and their abstract semantic contents need to be grounded in the local 3D scene. We formalize this as sign grounding, the problem of mapping semantic instructions on signs to corresponding scene elements and navigational actions. Recent Vision-Language Models (VLMs) offer the semantic common-sense and reasoning capabilities required for this task, but are sensitive to how spatial information is represented. We propose SignScene, a sign-centric spatial-semantic representation that captures navigation-relevant scene elements and sign information, and presents them to VLMs in a form conducive to effective reasoning. We evaluate our grounding approach on a dataset of 114 queries collected across nine diverse environment types, achieving 88% grounding accuracy and significantly outperforming baselines. Finally, we demonstrate that it enables real-world mapless navigation on a Spot robot using only signs.

2602.03639 2026-06-01 cs.RO 版本更新

Variance-Reduced Model Predictive Path Integral via Quadratic Model Approximation

基于二次模型近似的方差缩减模型预测路径积分

Fabian Schramm, Franki Nguimatsia Tiofack, Nicolas Perrin-Gilbert, Marc Toussaint, Justin Carpentier

发表机构 * Inria and DI-ENS, PSL Research University(法国国家信息与自动化研究所(Inria)及巴黎社会科学大学(DI-ENS, PSL Research University)) Sorbonne University(索邦大学) TU Berlin(柏林技术大学)

AI总结 提出一种混合方差缩减MPPI框架,通过将目标函数分解为已知近似模型与残差项,并采用二次近似推导闭式先验,以降低方差并提高样本效率,在多个任务中实现更快收敛和更优性能。

Comments Accepted to Robotics: Science and Systems (RSS) 2026, Sydney, Australia

详情
AI中文摘要

基于采样的控制器,如模型预测路径积分(MPPI)方法,提供了很大的灵活性,但常常遭受高方差和低样本效率的问题。为了解决这些挑战,我们引入了一种混合方差缩减MPPI框架,该框架将先验模型整合到采样过程中。我们的关键见解是将目标函数分解为已知的近似模型和一个残差项。由于残差仅捕捉模型与目标之间的差异,它通常比原始目标具有更小的幅度和更低的方差。尽管这一原理适用于一般的建模选择,但我们证明采用二次近似能够推导出一个闭式的、模型引导的先验,该先验有效地将样本集中在信息丰富的区域。关键的是,该框架对几何信息的来源是不可知的,允许二次模型从精确导数、结构近似(例如高斯或拟牛顿)或无梯度的随机平滑中构建。我们在标准优化基准、一个非线性欠驱动小车-杆控制任务以及一个具有非光滑动力学的接触丰富操作问题上验证了该方法。在这些领域中,与标准MPPI相比,我们在低样本情况下实现了更快的收敛和更优的性能。这些结果表明,该方法可以在获取样本昂贵或受限的情况下,使基于样本的控制策略更加实用。

英文摘要

Sampling-based controllers, such as Model Predictive Path Integral (MPPI) methods, offer substantial flexibility but often suffer from high variance and low sample efficiency. To address these challenges, we introduce a hybrid variance-reduced MPPI framework that integrates a prior model into the sampling process. Our key insight is to decompose the objective function into a known approximate model and a residual term. Since the residual captures only the discrepancy between the model and the objective, it typically exhibits a smaller magnitude and lower variance than the original objective. Although this principle applies to general modeling choices, we demonstrate that adopting a quadratic approximation enables the derivation of a closed-form, model-guided prior that effectively concentrates samples in informative regions. Crucially, the framework is agnostic to the source of geometric information, allowing the quadratic model to be constructed from exact derivatives, structural approximations (e.g., Gauss- or Quasi-Newton), or gradient-free randomized smoothing. We validate the approach on standard optimization benchmarks, a nonlinear, underactuated cart-pole control task, and a contact-rich manipulation problem with non-smooth dynamics. Across these domains, we achieve faster convergence and superior performance in low-sample regimes compared to standard MPPI. These results suggest that the method can make sample-based control strategies more practical in scenarios where obtaining samples is expensive or limited.

2602.02459 2026-06-01 cs.RO 版本更新

TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments

TIC-VLA:一种用于动态环境中机器人导航的思考控制视觉-语言-动作模型

Zhiyu Huang, Yun Zhang, Johnson Liu, Rui Song, Chen Tang, Jiaqi Ma

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出TIC-VLA模型,通过显式建模推理延迟并引入延迟语义-控制接口,结合异步训练流程,解决动态环境中视觉-语言-动作模型的推理与实时控制异步问题,在仿真和真实机器人上优于先前模型。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

在动态、以人为中心的环境中,机器人必须遵循语言指令同时保持实时反应控制。视觉-语言-动作(VLA)模型提供了一个有前景的框架,但它们假设时间对齐的推理和控制,尽管语义推理相对于实时动作固有地存在延迟。我们提出了Think-in-Control(TIC)-VLA,一种延迟感知框架,在动作生成过程中显式建模延迟的语义推理。TIC-VLA定义了一个延迟语义-控制接口,该接口除了当前观测外,还基于延迟的视觉-语言语义状态和显式延迟元数据来条件化动作生成,使策略能够补偿异步推理。我们进一步提出了一种延迟一致的训练流程,在模仿学习和在线强化学习期间注入推理推理延迟,使训练与异步部署对齐。为了支持现实评估,我们提出了DynaNav,一个物理精确、照片级真实的仿真套件,用于动态环境中的语言引导导航。在仿真和真实机器人上的大量实验表明,TIC-VLA在多秒推理延迟下始终优于先前的VLA模型,同时保持鲁棒的实时控制。项目网站:https://ucla-mobility.github.io/TIC-VLA/

英文摘要

Robots in dynamic, human-centric environments must follow language instructions while maintaining real-time reactive control. Vision-language-action (VLA) models offer a promising framework, but they assume temporally aligned reasoning and control, despite semantic inference being inherently delayed relative to real-time action. We introduce Think-in-Control (TIC)-VLA, a latency-aware framework that explicitly models delayed semantic reasoning during action generation. TIC-VLA defines a delayed semantic-control interface that conditions action generation on delayed vision-language semantic states and explicit latency metadata, in addition to current observations, enabling policies to compensate for asynchronous reasoning. We further propose a latency-consistent training pipeline that injects reasoning inference delays during imitation learning and online reinforcement learning, aligning training with asynchronous deployment. To support realistic evaluation, we present DynaNav, a physics-accurate, photo-realistic simulation suite for language-guided navigation in dynamic environments. Extensive experiments in simulation and on a real robot show that TIC-VLA consistently outperforms prior VLA models while maintaining robust real-time control under multi-second reasoning latency. Project website: https://ucla-mobility.github.io/TIC-VLA/

2602.02220 2026-06-01 cs.CV cs.RO 版本更新

LangMap: A Human-Verified Benchmark for Hierarchical Open-Vocabulary Goal Navigation

LangMap:一个用于分层开放词汇目标导航的人工验证基准

Bo Miao, Weijia Liu, Jun Luo, Lachlan Shinnick, Jian Liu, Thomas Hamilton-Smith, Yuhe Yang, Zijie Wu, Vanja Videnovic, Feras Dayoub, Anton van den Hengel

发表机构 * AIML, Adelaide University(AIML,阿德莱德大学) East China Normal University(华东师范大学) NERC-RVC, Hunan University(NERC-RVC,湖南大学) The University of Western Australia(西澳大学) Breaker Industries

AI总结 针对现有基准在分层语义目标导航中的不足,提出LangMap基准,通过人工验证的语义标注和对比注释协议,支持场景、房间、区域和实例四个层级的目标导航任务,并引入PlaNaVid基线方法。

详情
AI中文摘要

语言条件目标导航(LGN)要求智能体在没有逐步指导的情况下定位用户指定的目标。然而,现有基准主要关注类别级目标或依赖视觉语言模型(VLM)生成的实例描述,这些描述通常包含歧义和语义错误,限制了系统性和可靠的评估。我们提出了HieraNav,一个开放词汇的LGN任务,目标在四个分层语义层级上指定:场景、房间、区域和实例。为此,我们提出了Language as a Map(LangMap),据我们所知,这是第一个具有人工验证语义标注的真实世界3D室内导航基准,支持所有四个目标层级的任务。LangMap提供了区域标签以及覆盖414个对象类别的区分性区域和实例描述,通过比较同一场景区域和实例的严格对比注释协议生成,包含超过18K个任务。每个目标都配有简洁和详细的描述,支持跨指令风格的评估。定量和定性分析验证了我们的注释质量;值得注意的是,我们的实例描述在文本到视图匹配上比GOAT-Bench注释高出23个百分点。我们进一步引入了PlaNaVid,一个强大的仅RGB基线,它将有界多样记忆(BDM)与高级规划相结合,以激发用于多目标导航的反应策略。PlaNaVid在没有深度、3D场景表示或对象掩码的情况下实现了顶级成功率。进一步分析表明,记忆和更丰富的上下文提升了性能,而长尾类别、小物体、远距离目标和多目标完成仍然是开放的挑战。该基准可在https://bo-miao.github.io/LangMap获取。

英文摘要

Language-conditioned goal navigation (LGN) requires agents to locate user-specified targets without step-by-step guidance. However, existing benchmarks largely focus on category-level goals or rely on instance descriptions generated by vision-language models (VLMs), which often contain ambiguities and semantic errors, limiting systematic and reliable evaluation. We introduce HieraNav, an open-vocabulary LGN task with goals specified at four hierarchical semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), to our knowledge the first real-world 3D indoor navigation benchmark with human-verified semantic annotations to support tasks across all four goal levels. LangMap provides region labels and discriminative region and instance descriptions covering 414 object categories, produced through a rigorous contrastive annotation protocol comparing same-scene regions and instances, and contains over 18K tasks. Each target is paired with concise and detailed descriptions, enabling evaluation across instruction styles. Quantitative and qualitative analyses validate our annotation quality; notably, our instance descriptions outperform GOAT-Bench annotations by 23 percentage points in text-to-view matching. We further introduce PlaNaVid, a strong RGB-only baseline that combines Bounded Diverse Memory (BDM) with high-level planning to prime a reactive policy for multi-goal navigation. PlaNaVid achieves top-tier success rates without depth, 3D scene representations, or object masks. Further analysis shows that memory and richer context boost performance, while long-tailed categories, small objects, distant targets, and multi-goal completion remain open challenges. The benchmark is available at https://bo-miao.github.io/LangMap

2601.18537 2026-06-01 cs.RO cs.AI 版本更新

SKETCH: Semantic Key-Point Conditioning for Long-Horizon Vessel Trajectory Prediction

SKETCH: 面向长时域船舶轨迹预测的语义关键点条件建模

Linyong Gan, Zimo Li, Wenxin Xu, Xingjian Li, Jianhua Z. Huang, Enmei Tu, Shuhang Chen

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)数据科学学院) School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)科学与工程学院) School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)人工智能学院) COSCO SHIPPING Advanced Technology Institute, Shanghai, China(中远海运技术研究院)

AI总结 针对长时域轨迹预测中方向漂移问题,提出基于语义关键点(NKP)的条件轨迹建模框架,将预测分解为全局语义决策与局部运动建模,采用预训练-微调策略估计NKP先验,在真实AIS数据上显著提升长时域、方向精度和细粒度预测性能。

详情
AI中文摘要

由于复杂导航行为和环境因素导致的复合不确定性,准确的长时域船舶轨迹预测仍然具有挑战性。现有方法在长时间外推时往往难以保持全局方向一致性,导致轨迹漂移或不合理。为解决这一问题,我们提出了一种语义关键点条件轨迹建模框架,通过以捕获导航意图的高级下一关键点(NKP)为条件来预测未来轨迹。该公式将长时域预测分解为全局语义决策和局部运动建模,有效将未来轨迹的支持集限制在语义可行的子集内。为了从历史观测中高效估计NKP先验,我们采用了预训练-微调策略。在真实AIS数据上的大量实验表明,所提方法在长旅行时长、方向精度和细粒度轨迹预测方面持续优于现有最先进方法。

英文摘要

Accurate long-horizon vessel trajectory prediction remains challenging due to compounded uncertainty from complex navigation behaviors and environmental factors. Existing methods often struggle to maintain global directional consistency, leading to drifting or implausible trajectories when extrapolated over long time horizons. To address this issue, we propose a semantic-key-point-conditioned trajectory modeling framework, in which future trajectories are predicted by conditioning on a high-level Next Key Point (NKP) that captures navigational intent. This formulation decomposes long-horizon prediction into global semantic decision-making and local motion modeling, effectively restricting the support of future trajectories to semantically feasible subsets. To efficiently estimate the NKP prior from historical observations, we adopt a pretrain-finetune strategy. Extensive experiments on real-world AIS data demonstrate that the proposed method consistently outperforms state-of-the-art approaches, particularly for long travel durations, directional accuracy, and fine-grained trajectory prediction.

2411.04073 2026-06-01 cs.RO cs.CC cs.MA 版本更新

A Two-Stage Reactive Auction Framework for the Multi-Depot Rural Postman Problem with Dynamic Vehicle Failures

面向动态车辆故障的多仓库农村邮差问题的两阶段反应式拍卖框架

Eashwar Sathyamurthy, Jeffrey W. Herrmann, Shapour Azarm

发表机构 * Department of Mechanical Engineering, University of Maryland(马里兰大学机械工程系) Department of Mechanical Engineering, The Catholic University of America(美国天主教大学机械工程系)

AI总结 针对多仓库农村邮差问题中车辆故障导致的任务中断,提出一种两阶段实时重调度框架,结合集中式拍卖与对等拍卖,在保证解质量的同时将重调度时间从小时级降至秒级。

详情
AI中文摘要

尽管无人车车队在运输、物流和巡检中提供了效率,但它们对故障的敏感性对任务连续性构成了重大挑战。我们研究了带有可充电和可重复使用车辆的多仓库农村邮差问题(MD-RPP-RRV),其中放置在多个仓库、具有容量约束的无人充电车辆在为基于弧的需求服务时可能发生故障。为了解决运行中意外的车辆故障,我们提出了一种两阶段实时重调度框架。首先,集中式拍卖快速生成可行的重调度方案;对于此阶段,我们推导了一个理论加性界,为最坏情况下的重调度惩罚提供了分析保证。其次,对等拍卖通过一个针对问题的磁场路由器对局部调度进行修复,以细化基线方案,该路由器利用通过敏感性分析校准的参数,确保计算增长可控。我们将该方法与模拟退火元启发式算法进行基准比较,以评估解质量和执行速度。在257个不同故障场景上的实验结果表明,与元启发式基线相比,该框架实现了平均运行时间减少超过95%,将重调度时间从小时级缩短到秒级,同时保持高质量的解。两阶段框架在大规模实例上表现出色,在近80%的场景中优于集中式拍卖,平均解改进超过12%。此外,它在59%和28%的场景中分别优于模拟退火的平均结果和最佳结果,为实时任务连续性提供了所需的鲁棒速度-质量权衡。

英文摘要

Although unmanned vehicle fleets offer efficiency in transportation, logistics and inspection, their susceptibility to failures poses a significant challenge to mission continuity. We study the Multi-Depot Rural Postman Problem with Rechargeable and Reusable Vehicles (MD-RPP-RRV) with vehicle failures, where unmanned rechargeable vehicles placed at multiple depots with capacity constraints may fail while serving arc-based demands. To address unexpected vehicle breakdowns during operation, we propose a two-stage real-time rescheduling framework. First, a centralized auction quickly generates a feasible rescheduling solution; for this stage, we derive a theoretical additive bound that establishes an analytical guarantee on the worst-case rescheduling penalty. Second, a peer auction refines this baseline through a problem-specific magnetic field router for local schedule repair, utilizing parameters calibrated via sensitivity analysis to ensure controlled computational growth. We benchmark this approach against a simulated annealing metaheuristic to evaluate solution quality and execution speed. Experimental results on 257 diverse failure scenarios demonstrate that the framework achieves an average runtime reduction of over 95\% relative to the metaheuristic baseline, cutting rescheduling times from hours to seconds while maintaining high solution quality. The two-stage framework excels on large-scale instances, surpassing the centralized auction in nearly 80\% of scenarios with an average solution improvement exceeding 12\%. Moreover, it outperforms the simulated annealing mean and best results in 59\% and 28\% of scenarios, respectively, offering the robust speed-quality trade-off required for real-time mission continuity.

2512.11571 2026-06-01 cs.RO 版本更新

Cross-Entropy Optimization of Physically Grounded Task and Motion Plans

物理基础的任务与运动规划的交叉熵优化

Andreu Matoses Gimenez, Nils Wilde, Chris Pek, Javier Alonso-Mora

发表机构 * Department of Cognitive Robotics, Delft University of Technology(德鲁夫特理工大学认知机器人学系) Faculty of Computer Science, Dalhousie University(达尔豪斯大学计算机科学学院)

AI总结 提出利用GPU并行物理模拟器和交叉熵优化,通过采样控制器参数获得低成本解决方案,以解决传统TAMP算法忽略动力学和接触的问题。

Comments Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情
Journal ref
IEEE Robotics and Automation Letters, 2026
AI中文摘要

自主执行任务通常需要机器人规划高级离散动作和连续低级运动来实现它们。先前的TAMP算法主要关注计算性能、完备性或最优性,通过简化和抽象使问题易于处理。然而,这可能导致生成的计划在需要操作物体时,未能考虑可靠执行任务所必需的动力学或复杂接触。此外,忽略低级控制器影响的方法可能无法为真实系统获得最优或可行的计划实现。我们研究使用GPU并行物理模拟器来计算带有运动控制器的计划实现,明确考虑动力学,并考虑与环境的接触。通过交叉熵优化,我们对控制器或动作的参数进行采样,以获得低成本解决方案。由于我们的方法使用与真实系统相同的控制器,机器人可以直接执行计算出的计划。我们在一组任务中展示了我们的方法,其中机器人能够利用环境的几何形状来移动物体。网站和代码:https://andreumatoses.github.io/research/parallel-realization

英文摘要

Autonomously performing tasks often requires robots to plan high-level discrete actions and continuous low-level motions to realize them. Previous TAMP algorithms have focused mainly on computational performance, completeness, or optimality by making the problem tractable through simplifications and abstractions. However, this comes at the cost of the resulting plans potentially failing to account for the dynamics or complex contacts necessary to reliably perform the task when object manipulation is required. Additionally, approaches that ignore effects of the low-level controllers may not obtain optimal or feasible plan realizations for the real system. We investigate the use of a GPU-parallelized physics simulator to compute realizations of plans with motion controllers, explicitly accounting for dynamics, and considering contacts with the environment. Using cross-entropy optimization, we sample the parameters of the controllers, or actions, to obtain low-cost solutions. Since our approach uses the same controllers as the real system, the robot can directly execute the computed plans. We demonstrate our approach for a set of tasks where the robot is able to exploit the environment's geometry to move an object. Website and code: https://andreumatoses.github.io/research/parallel-realization

2511.19433 2026-06-01 cs.RO cs.AI cs.CV 版本更新

Mixture of Horizons in Action Chunking

动作分块中的视野混合

Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, Mingyu Ding

发表机构 * Renmin University of China(中国人民大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) The Chinese University of Hong Kong(香港中文大学)

AI总结 针对视觉-语言-动作模型中动作分块长度(视野)的权衡问题,提出混合视野策略,通过并行处理不同视野的动作片段并融合输出,同时提升长期预见与短期精度,实现性能与泛化性的改进。

Comments Accepted at ICML 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出显著能力,但其性能对训练中使用的$ extbf{动作分块长度}$(称为$ extbf{视野}$)敏感。我们的实证研究揭示了一个内在权衡:较长的视野提供更强的全局预见但降低细粒度精度,而较短的视野增强局部控制但在长期任务上表现不佳,这意味着固定选择单一视野是次优的。为缓解这一权衡,我们提出$ extbf{混合视野(MoH)}$策略。MoH将动作分块重新排列为多个不同视野的片段,通过共享动作变换器并行处理,并使用轻量线性门控融合输出。它具有三个吸引人的优点:1) MoH在单个模型中联合利用长期预见和短期精度,提高了复杂任务的性能和泛化能力。2) MoH对全注意力动作模块即插即用,训练或推理开销极小。3) MoH支持自适应视野的动态推理,通过跨视野共识选择稳定动作,实现比基线高2.5倍的吞吐量,同时保持优越性能。在基于流的策略$π_0$、$π_{0.5}$和单步回归策略$π_{ ext{reg}}$上的大量实验表明,MoH在仿真和真实世界任务上均取得一致且显著的提升。值得注意的是,在混合任务设置下,带有MoH的$π_{0.5}$在LIBERO上仅经过$30k$次训练迭代即达到99$\%$的平均成功率,创下新纪录。项目页面:https://timsty1.github.io/moh/

英文摘要

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://timsty1.github.io/moh/

2510.17111 2026-06-01 cs.RO cs.LG 版本更新

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

面向具身操作的高效视觉-语言-动作模型:系统综述

Weifan Guan, Qinghao Hu, Aosheng Li, Jian Cheng

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Chinese Academy of Sciences(中国科学院大学) AiRiA Nanjing University of Information Science and Technology(南京信息科学技术大学)

AI总结 本文系统综述了通过模型架构、感知特征、动作生成和训练/推理策略四个维度降低视觉-语言-动作模型延迟、内存占用及计算成本的方法。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过将自然语言指令和视觉观察映射到机器人动作,将视觉-语言模型扩展到具身控制。尽管功能强大,但VLA系统因其巨大的计算和内存需求而面临重大挑战,这与需要实时性能的边缘平台(如机载移动操作器)的约束相冲突。解决这一矛盾已成为近期研究的核心焦点。鉴于对更高效、可扩展的VLA系统的日益关注,本综述系统回顾了提高VLA效率的方法,重点在于减少延迟、内存占用以及训练和推理成本。我们将现有解决方案分为四个维度:模型架构、感知特征、动作生成和训练/推理策略,并总结了每个类别中的代表性技术。最后,我们讨论了未来趋势和开放挑战,指出了推进高效具身智能的方向。

英文摘要

Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real-time performance. Addressing this tension has become a central focus of recent research. In light of the growing efforts toward more efficient and scalable VLA systems, this survey provides a systematic review of approaches for improving VLA efficiency, with an emphasis on reducing latency, memory footprint, and training and inference costs. We categorize existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, summarizing representative techniques within each category. Finally, we discuss future trends and open challenges, highlighting directions for advancing efficient embodied intelligence.

2509.19452 2026-06-01 cs.RO cs.CV cs.LG 版本更新

HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames

HUNT:通过瞬时相对帧在非结构化环境中进行高速无人机导航与跟踪

Alessandro Saviolo, Jeffrey Mao, Giuseppe Loianno

发表机构 * New York University(纽约大学) University of California Berkeley(加州大学伯克利分校)

AI总结 提出HUNT框架,利用瞬时相对帧统一搜索与跟踪,实现高速飞行和鲁棒自主性。

详情
AI中文摘要

搜索与救援任务要求无人机既能高速穿越未知的非结构化环境,又能在检测到目标后跟踪目标。在感知退化且无全局定位的情况下实现这两种能力仍是一个开放挑战。最近的相对导航工作通过将规划和控制锚定到可见的检测目标上展示了鲁棒跟踪,但在视野中没有目标时无法进行导航。我们提出了HUNT(高速无人机导航与跟踪),一个实时框架,在单一相对公式中统一了穿越、获取和跟踪。HUNT直接从机载瞬时观测量(如姿态、高度和速度)定义导航目标,从而在搜索过程中实现反应式高速飞行。一旦检测到目标,相同的感知-控制管道无缝过渡到跟踪。在茂密森林、集装箱场地以及使用车辆和人体模型的搜索与救援任务中的户外实验表明,在全局方法失败的情况下,该框架实现了鲁棒自主性。

英文摘要

Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT (High-speed UAV Navigation and Tracking), a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception-control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail.

2503.21168 2026-06-01 cs.RO cs.SY eess.SY 版本更新

TAGA: A Tangent-Based Reactive Approach for Socially Compliant Robot Navigation Around Human Groups

TAGA:一种基于切线的反应式方法,用于在人群体周围实现社交合规的机器人导航

Utsha Kumar Roy, Sejuti Rahman

发表机构 * Department of Computer Science and Engineering, BRAC University(布拉格大学计算机科学与工程系) New Uzbekistan University(新乌兹别克斯坦大学)

AI总结 提出TAGA方法,通过切线路径检测群体边界并协调群体与个体避障,引入群体穿越率(GCR)指标,在多种人群动力学模型下验证了反应式与学习型方法的非对称性效果。

Comments 8 pages, 3 figures, 3 tables. Submitted to IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

机器人在有人群的环境中导航时,必须避免碰撞并尊重人群的社会结构,特别是社会群体的隐含边界。大多数导航方法将人类建模为独立个体,即使无碰撞也会导致社交干扰行为。本文提出TAGA(群体避障的切线动作),通过切线路径机动检测群体边界,无需修改底层导航策略。一个分层安全控制器协调群体级避障与个体碰撞预防。我们提出群体穿越率(GCR),一个连续度量,衡量机器人在任何群体凸包内停留的时间步比例,提供比终端度量更细粒度的社交合规评估。我们引入了一个现实的人群模拟基准,包含五个基于经验的阶段:个体速度异质性、群体速度耦合、F-formation静态群体、领导者-跟随者动力学和凸包边界,并在ORCA和Social Force行人动力学下进行评估。在ORCA、Social Force、DS-RNN和Intention-RL上的实验揭示了反应式-学习型非对称性:TAGA对经典反应式基线提升最大(成功率最高+8pp,GCR减半),而对学习型策略成本近乎为零。这些发现为模块化群体感知何时增加价值以及端到端群体感知训练何时更优提供了可操作的指导。

英文摘要

Robots navigating human-populated environments must avoid collisions while respecting the social structure of crowds, particularly the implicit boundaries of social groups. Most navigation approaches model humans as independent individuals,causing socially disruptive behavior even when collision-free. This paper presents TAGA (Tangent Action for Group Avoidance), detected group boundaries via tangent-path maneuvers without modifying the underlying navigation policy. A hierarchical safety controller coordinates group-level avoidance with individual collision prevention. We propose the Group Crossing Rate (GCR), a continuous metric measuring the fraction of timesteps the robot spends inside any group convex hull, providing finer-grained social compliance assessment than terminal metrics alone. We introduce a realistic crowd simulation benchmark with five empirically grounded phases: individual speed heterogeneity, group speed coupling, F-formation static groups, leader-follower dynamics, and convex-hull boundaries, evaluated under both ORCA and Social Force pedestrian dynamics. Experiments across ORCA, Social Force, DS-RNN, and Intention-RL reveal a reactive-learning asymmetry: TAGA provides the largest gains for classical reactive baselines (up to +8pp success rate, GCR halved) with near-zero cost for learned policies. These findings offer actionable guidance for when modular group-awareness adds value versus when end-to-end group-aware training is preferable.

2506.23768 2026-06-01 cs.RO 版本更新

Motion Tracking with Muscles: Predictive Control of a Parametric Musculoskeletal Canine Model

基于肌肉的运动追踪:参数化肌肉骨骼犬模型的预测控制

Vittorio La Barbera, Steven Bohez, Leonard Hasenclever, Yuval Tassa, John R. Hutchinson

发表机构 * DeepMind(深Mind) Royal Veterinary College(皇家兽医学院)

AI总结 提出一种由精确3D肌肉网格程序化生成的犬类肌肉骨骼模型,结合改进的肌肉动力学模型和运动捕捉任务,通过比较模拟肌肉激活模式与实验EMG数据验证,旨在弥合生物力学、机器人和计算神经科学之间的差距。

详情
AI中文摘要

我们引入了一种新颖的犬类肌肉骨骼模型,该模型由精确的3D肌肉网格程序化生成。伴随该模型的是一个基于运动捕捉的步态任务,兼容多种控制算法,以及一个改进的肌肉动力学模型,旨在增强可微控制框架中的收敛性。我们通过将模拟的肌肉激活模式与先前犬类步态研究中实验获得的肌电图(EMG)数据进行比较来验证我们的方法。这项工作旨在弥合生物力学、机器人和计算神经科学之间的差距,为研究肌肉驱动和神经肌肉控制的研究人员提供一个稳健的平台。我们计划发布完整模型以及重定向的运动捕捉片段,以促进进一步的研究和开发。

英文摘要

We introduce a novel musculoskeletal model of a dog, procedurally generated from accurate 3D muscle meshes. Accompanying this model is a motion capture-based locomotion task compatible with a variety of control algorithms, as well as an improved muscle dynamics model designed to enhance convergence in differentiable control frameworks. We validate our approach by comparing simulated muscle activation patterns with experimentally obtained electromyography (EMG) data from previous canine locomotion studies. This work aims to bridge gaps between biomechanics, robotics, and computational neuroscience, offering a robust platform for researchers investigating muscle actuation and neuromuscular control.We plan to release the full model along with the retargeted motion capture clips to facilitate further research and development.

2505.20795 2026-06-01 cs.RO 版本更新

Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt

以人类演示视频为提示学习可泛化的机器人策略

Xiang Zhu, Yichen Liu, Hezhong Li, Jianyu Chen

发表机构 * Tsinghua University, China(清华大学,中国) Shanghai Qi Zhi Institute, China(上海启智研究院,中国)

AI总结 提出两阶段框架,利用人类演示视频学习可泛化机器人策略,无需遥操作数据或微调即可执行新任务。

Comments Accepted to the IEEE International Conference on Robotics and Automation (ICRA), 2026

详情
AI中文摘要

最近的机器人学习方法通常依赖于通过遥操作收集的大规模机器人数据集的模仿学习。面对新任务时,这些方法通常需要收集一组新的遥操作数据并微调策略。此外,遥操作数据收集流程也繁琐且昂贵。相反,人类能够通过观察他人操作高效学习新任务。在本文中,我们介绍了一种新颖的两阶段框架,利用人类演示学习可泛化的机器人策略。该策略可以直接以人类演示视频为提示,执行新任务,无需任何新的遥操作数据和模型微调。在第一阶段,我们训练视频生成模型,通过交叉预测捕获人类和机器人演示视频数据的联合表示。在第二阶段,我们使用新颖的原型对比损失将学习到的表示与人类和机器人之间的共享动作空间融合。在真实世界灵巧操作任务上的实证评估显示了所提出方法的有效性和泛化能力。

英文摘要

Recent robot learning methods commonly rely on imitation learning from massive robotic dataset collected with teleoperation. When facing a new task, such methods generally require collecting a set of new teleoperation data and finetuning the policy. Furthermore, the teleoperation data collection pipeline is also tedious and expensive. Instead, human is able to efficiently learn new tasks by just watching others do. In this paper, we introduce a novel two-stage framework that utilizes human demonstrations to learn a generalizable robot policy. Such policy can directly take human demonstration video as a prompt and perform new tasks without any new teleoperation data and model finetuning at all. In the first stage, we train video generation model that captures a joint representation for both the human and robot demonstration video data using cross-prediction. In the second stage, we fuse the learned representation with a shared action space between human and robot using a novel prototypical contrastive loss. Empirical evaluations on real-world dexterous manipulation tasks show the effectiveness and generalization capabilities of our proposed method.

2407.16167 2026-06-01 cs.RO cs.SY eess.SY 版本更新

Consideration of Vehicle Characteristics on the Motion Planner Algorithm

运动规划算法中车辆特性的考虑

Syed Adil Ahmed, Taehyun Shim

发表机构 * University of Michigan Dearborn(密歇根大学迪尔伯恩分校)

AI总结 针对现有轨迹规划器未考虑质心高度影响导致不同车辆(尤其是高质心车辆)轨迹非最优的问题,提出一种采用简化双轨模型、基于稳态方程估计侧向和侧倾载荷转移以及简化轮胎模型的规划器,以降低求解器负担,并在高/低加速度条件和不同车辆高度下与粒子模型和运动学模型规划器进行对比。

Comments This paper has been accepted for conference proceedings in MECC 2024, Chicago under a Creative Commons License CC-BY-NC-ND

详情
Journal ref
IFAC-PapersOnLine, Vol 58, Num 28, 2024, pgs 444-449
AI中文摘要

自主车辆控制通常分为两个主要领域:轨迹规划和轨迹跟踪。目前,轨迹规划大多通过粒子或基于运动学模型的优化控制器完成。由于这些规划器不考虑质心高度及其影响,其输出对于不同车辆类型(尤其是高质心车辆)并非唯一。因此,跟踪控制器在尝试实现这些次优轨迹时,可能需要付出较大努力以避免车辆操纵性和舒适性约束。本文尝试通过考虑一种采用简化双轨模型的规划器来解决该问题,该模型利用稳态方程估计侧向和侧倾载荷转移,并采用简化轮胎模型以降低求解器负担。将所开发的规划器与广泛使用的粒子模型和运动学模型规划器在碰撞避免场景下进行对比,涵盖高/低加速度条件和不同车辆高度。

英文摘要

Autonomous vehicle control is generally divided in two main areas; trajectory planning and tracking. Currently, the trajectory planning is mostly done by particle or kinematic model-based optimization controllers. The output of these planners, since they do not consider CG height and its effects, is not unique for different vehicle types, especially for high CG vehicles. As a result, the tracking controller may have to work hard to avoid vehicle handling and comfort constraints while trying to realize these sub-optimal trajectories. This paper tries to address this problem by considering a planner with simplified double track model with estimation of lateral and roll based load transfer using steady state equations and a simplified tire model to reduce solver workload. The developed planner is compared with the widely used particle and kinematic model planners in collision avoidance scenarios in both high and low acceleration conditions and with different vehicle heights.