arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.06556 2026-06-08 cs.RO 新提交

在日常生活人类视频上协同训练机器人操作策略时什么因素重要？

Richard Li, Aditya Prakash, Andrew Wen, Saurabh Gupta, Yilun Du, Pulkit Agrawal

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Harvard University（哈佛大学）

AI总结研究利用日常互联网视频协同训练机器人操作策略时，手部姿态质量和运动差距对迁移的影响，提出一种协同训练方法，在低机器人数据场景下六个操作任务中绝对成功率提升29.7%。

Comments The project website is here: https://richardrl.github.io/what-matters-cotraining-human-videos/index.html

2606.06686 2026-06-08 cs.RO cs.DS 新提交

基于真实仓库中闭路电视摄像机网络的多机器人规划与控制

Luke Robinson, Benjamin Ramtoula, Anas Izaaryene, Paul Newman, Daniele De Martini

发表机构 * Oxford Robotics Institute, University of Oxford, UK（牛津大学机器人研究所，牛津大学，英国）； Robot Systems Group, Technical University of Munich, Germany（机器人系统组，慕尼黑技术大学，德国）

AI总结提出仅利用分布式CCTV网络和边缘计算实现多机器人协调规划与控制的方法，在真实仓库中验证了四台机器人和30个摄像头的系统，首次实现仅依赖外部摄像头网络的现场多机器人协调。

详情

AI中文摘要

利用环境中嵌入的摄像头对移动机器人进行离车控制，通过将感知和计算移离机器人，为可扩展的自主性提供了一条实用路径。我们将这一思想从单机器人情况扩展到真实仓库中的协调车队，仅使用分布式CCTV网络和边缘计算驱动多个机器人。该系统完全在未校准的、基于像素的拓扑相机图的图像空间中运行，支持灵活相机放置下的大范围操作。分层规划器为每个机器人选择相机序列，并通过每个视图规划其图像空间运动，采用优先-联合策略协调机器人，将重叠的相机区域视为一次仅由一个机器人持有的共享资源，以防止碰撞和死锁。我们在一个真实仓库中验证了该方法，该仓库有四个机器人和30个摄像头，分布在六个27米长的过道中，报告了任务时间和协调统计数据。据我们所知，这是首次仅使用外部摄像头网络和离车计算进行多机器人规划和协调的现场演示，机器人未携带任何特定于任务的导航硬件。

英文摘要

Off-board control of mobile robots from cameras embedded in the environment offers a practical path to scalable autonomy, moving sensing and compute off the robots. We extend this idea from the single-robot case to coordinated fleets in a real warehouse, driving multiple robots with only a distributed CCTV network and edge compute. The system operates entirely in image space over an uncalibrated, pixel-wise topological camera graph, enabling wide-area operation with flexible camera placement. A hierarchical planner selects a camera sequence per robot and plans its image-space motion through each view, coordinating robots with a prioritised-then-joint strategy and treating overlapping camera regions as shared resources held by one robot at a time to prevent collisions and deadlocks. We validate the approach in a real warehouse with four robots and 30 cameras across six 27 m aisles, reporting mission times and coordination statistics. To our knowledge, this is the first field demonstration of multi-robot planning and coordination using only an external camera network and off-board compute, with robots carrying no task-specific navigation hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.06790 2026-06-08 cs.RO cs.LG cs.SY eess.SY 新提交

Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension

学习具有主动铰接悬挂的行星探测车的全地形运动

Arthur Bouton, Tristan D. Hasseler, Michael Paton, Travis Brown, Jacob Levy, William Reid, Joshua Martin, Hari Nayar

发表机构 * Jet Propulsion Laboratory, California Institute of Technology（喷气推进实验室，加州理工学院）； Center for Autonomy, University of Texas at Austin（自主性中心，德克萨斯大学奥斯汀分校）； Space Systems Laboratory, University of Maryland（空间系统实验室，马里兰大学）

AI总结提出一种带有主动万向悬挂的四轮行星探测车概念，利用强化学习训练单一神经网络控制器，实现自主障碍协商和全地形运动，通过策略整合和零样本迁移在物理车上验证。

Comments 21 pages, 26 figures

详情

AI中文摘要

本文介绍了ERNEST，一种四轮行星探测车概念，配备了两自由度主动万向悬挂系统，结合偏航和滚转驱动，实现车轮重构、转向和主动负载分配。一个单一的神经网络控制器，经过训练以在挑战性地形上跟踪期望路径，完全释放了这种驱动悬挂系统在自主障碍协商中的能力。利用高保真DARTS仿真引擎开发了强化学习框架，该引擎结合了刚体接触动力学和Bekker-Wong地面力学，使得能够出现适应松散土壤条件的运动策略。为了在异质地形上获得单一统一控制器，一种策略整合策略将地形专业化智能体的经验合并到一个神经网络中，消除了对显式地形分类和控制器切换的需求。得到的控制器结合了本体感觉和外感觉反馈，包括稀疏立体视觉导出的地形高程、底盘姿态、关节状态和力-扭矩测量。通过领域随机化、传感器噪声注入和模型到真实系统的辨识，实现了到物理车的零样本迁移。实验结果表明，该控制器能够自主穿越岩石场、凸起陷阱、轮高台阶、沙波纹和沙坡。在20°沙坡上，尽管增加了驱动，学习到的控制器在干沙上降低了37%的运输成本，并在湿沙上实现了优越的性能，而被动悬挂在湿沙上完全无法移动。

英文摘要

This paper presents ERNEST, a four-wheeled planetary rover concept equipped with a two-degree-of-freedom Active Gimbal Suspension that combines yaw and roll actuation to enable wheel reconfiguration, steering, and active load redistribution. A single neural network controller, trained to track a desired path across challenging terrain, fully unlocks the capabilities of this actuated suspension system for autonomous obstacle negotiation. A reinforcement learning framework is developed using the high-fidelity DARTS simulation engine, which combines rigid-contact dynamics and Bekker-Wong terramechanics, enabling the emergence of locomotion strategies adapted to loose-soil conditions. To obtain a single unified controller across heterogeneous terrains, a policy consolidation strategy merges the experience of terrain-specialized agents into one neural network, eliminating the need for explicit terrain classification and controller switching. The resulting controller operates on a combination of proprioceptive and exteroceptive feedback, including sparse stereo-derived terrain elevation, chassis attitude, joint states, and force-torque measurements. Zero-shot transfer to the physical rover is achieved through domain randomization, sensor noise injection, and model-to-real system identification. Experimental results demonstrate autonomous traversal of rock fields, a bump trap, a wheel-high step, sand ripples, and sandy slopes. On a 20° sandy slope, the learned controller reduces the cost of transport by 37% on dry sand despite the additional actuation, and achieves superior performance on wet sand where the passive suspension becomes completely immobilized.

URL PDF HTML ☆

赞 0 踩 0

2606.06805 2026-06-08 cs.RO cs.AI cs.SY eess.SY 新提交

Lane Change Trajectory Planning for Personalized Driving Comfort and Mobility Efficiency

车道变更轨迹规划：个性化驾驶舒适性与移动效率

Haoxuan Dong, Dongjun Li, Ziyou Song

发表机构 * Department of Mechanical Engineering（机械工程系）； Department of Electrical Engineering（电气工程系）； National University of Singapore（新加坡国立大学）； Computer Science（计算机科学）； University of Michigan（密歇根大学）

AI总结提出一种神经网络驱动的轨迹规划器，结合三阶多项式轨迹生成与学习模块，通过双头共享骨干和基于误差胜者逻辑回归的统计门控机制，实现个性化舒适性与移动效率的平衡。

Comments Accepted by the IEEE Intelligent Vehicles Symposium (IEEE IV 2026), Detroit, MI, United States, June 22_25, 2026

详情

AI中文摘要

车道变更涉及同时的纵向和横向运动，这些运动影响驾驶舒适性和移动效率。由于这些运动紧密耦合且存在显著的车辆间差异，车道变更操作的轨迹规划具有高度个性化的特点。本研究提出了一种神经网络驱动的规划器，该规划器将三阶多项式轨迹生成器与学习模块相结合，该学习模块在不同驾驶条件下推断最优轨迹参数。使用具有双头的共享骨干网络，一个头确保全工况操作保障，而另一个头捕捉驾驶员对舒适性或移动效率的特定偏好。通过基于误差胜者逻辑回归的统计门控实现头门控切换机制，该机制在不同驾驶条件下自适应地选择适当的头，从而实现上下文感知的车道变更轨迹规划。代表性案例和蒙特卡洛模拟表明，所提出的规划器在车道变更过程中实现了个性化的舒适性和移动性，而基线则在个性化数据不足或不可用的驾驶条件下确保可行的轨迹。

英文摘要

Lane changing entails simultaneous longitudinal and lateral motions that affect driving comfort and mobility efficiency. Because these motions are tightly coupled and subject to substantial inter-vehicle variability, trajectory planning for lane-change maneuvers is characterized by a highly personalized nature. This study proposes a neural network-driven planner that integrates a third-order polynomial trajectory generator with a learning module that infers optimal trajectory parameters across diverse driving conditions. Using a shared backbone with dual heads, one head ensures all-condition operational guarantees, while the other captures driver-specific preferences for comfort or mobility efficiency. A head-gated switching mechanism, realized through a statistical gate based on error-winner logistic regression, adaptively selects the appropriate head under varying driving conditions, which enables context-aware lane-change trajectory planning. Representative cases and Monte Carlo simulations show that the proposed planner achieves personalized comfort and mobility during lane changes, while the baseline ensures feasible trajectories under driving conditions where personalized data are insufficient or inaccessible.

URL PDF HTML ☆

赞 0 踩 0

2606.06829 2026-06-08 cs.RO 新提交

Three-dimensional hydro-cluttered locomotion by an undulatory robot

三维水杂波环境中的波动机器人运动

Tianyu Wang, Matthew Fernandez, Galen Tunnicliffe, Nikolas Cornell, Justin Duong, Donoven Dortilus, Zhaochen J. Xu, Patricia Meza, Sean Lublinsky, Darsh Parikh, Jianfeng Lin, Emily Grace, Daniel I. Goldman

发表机构 * Institute for Robotics and Intelligent Machines, Georgia Institute of Technology（机器人与智能机器研究所，佐治亚理工学院）； School of Physics, Georgia Institute of Technology（Georgia理工学院物理系）； George W. Woodruff School of Mechanical Engineering, Georgia Institute of Technology（佐治亚理工学院乔治·W·伍德鲁夫机械工程学院）； School of Electrical and Computer Engineering, Georgia Institute of Technology（佐治亚理工学院电气与计算机工程学院）； Department of Mechanical and Industrial Engineering, Northeastern University（东北大学机械与工业工程系）； Ransom Everglades School（拉森·伊弗格莱德学校）

AI总结提出AquaMILR机器人，通过可编程体顺应性和深度调节，在三维水杂波环境中实现快速鲁棒的前进运动，并利用惯性滚动作为自发恢复机制。

详情

AI中文摘要

水生机器人扩展了人类进入水下环境的能力，但许多水下空间包含可能干扰开放水域运动的障碍物。在“水杂波”环境中，水与刚性和柔性杂物交织，使得身体与障碍物的接触不可避免。在这些空间中操作需要能够调节和利用接触的机器人，但这一机制仍然难以建模或模拟。基于近期在具有地形适应能力的无肢机器人机械智能方面的进展，我们利用AquaMILR（一种细长无肢机器人）开发了三维水生运动原理，该机器人结合了双侧缆绳驱动、可编程体顺应性、分布式深度调节、耐腐蚀外壳以及用于无系留现场操作的板载电源和电子设备。系统的机器人物理实验表明，可编程体顺应性调节身体变形，并将身体-环境相互作用转化为跨增强水杂波约束强度的快速、鲁棒的前向推进。深度调节提供了三维通道，使机器人能够绕过杂物、从阻塞中恢复，并继续通过原本无法通行的路径。在潜在卡滞场景中，涌现的惯性诱导滚动作为一种自发恢复机制，使机器人摆脱可能导致失败的杂物，无需额外控制即可继续运动。在红树林水生环境中的机器人测试表明，这些原理可转化为实际操作，实现导航和无法进入根区的板载视觉检查。这些结果确立了水杂波运动原理和一种设计范式，其中水生机器人将环境复杂性作为运动资源加以利用。

英文摘要

Aquatic robots have expanded human access to underwater environments, yet many underwater spaces contain obstacles that can disrupt open-water locomotion. In "hydro-cluttered" environments, water is interspersed with rigid and flexible clutter, making body-obstacle contact unavoidable. Operating in these spaces requires robots that can regulate and exploit contact, but this regime remains difficult to model or simulate. Building on recent advances in mechanical intelligence in terradynamically capable limbless robotics, we develop principles for 3D aquatic locomotion using AquaMILR, an elongate limbless robot that combines bilateral cable-driven actuation, programmable body compliance, distributed depth regulation, corrosion-resistant enclosures, and onboard power and electronics for untethered field operation. Systematic robophysical experiments reveal that programmable body compliance regulates body deformation and converts body-environment interactions into fast, robust, forward progression across increasing hydro-clutter constraint strength. Depth regulation provides three-dimensional access, allowing the robot to bypass clutter, recover from obstruction, and continue through otherwise inaccessible routes. In potential jamming scenarios, emergent inertia-induced rolling acts as a spontaneous recovery mechanism, freeing the robot from clutter that would otherwise lead to failure and allowing locomotion to continue without additional control. Tests of the robot in an aquatic mangrove field demonstrate that these principles transfer to practical operation, enabling navigation and onboard visual inspection of inaccessible root zones. These results establish principles for hydro-cluttered locomotion and a design paradigm in which aquatic robots exploit environmental complexity as a locomotor resource.

URL PDF HTML ☆

赞 0 踩 0

2606.06832 2026-06-08 cs.RO 新提交

STRIPS-WM: Learning Grounded Propositional STRIPS-style World Models from Images

STRIPS-WM：从图像学习基于命题的STRIPS风格世界模型

Abhiroop Ajith, Constantinos Chamzas

发表机构 * Worcester Polytechnic Institute（沃斯特理工学院）

AI总结提出STRIPS-WM框架，从图像转换中学习符号化世界模型，用于机器人视觉任务规划，提升规划成功率。

详情

AI中文摘要

执行长时域视觉操作的机器人观察高维图像，但成功的规划依赖于与动作相关的事实：当前可以做什么以及之后会发生什么变化。有用的规划表示应丢弃无关的视觉细节，同时保留动作的适用性和效果。经典任务规划器通过具有前提条件和效果的符号操作符利用这种结构，但从原始视觉经验中获得此类表示仍然具有挑战性。我们研究了一个视觉任务规划设置，其中机器人仅接收图像转换：当前图像、执行的高级动作以及结果图像。在测试时，给定起始图像和目标图像，机器人必须产生一系列达到目标的高级动作。为了解决这个问题，我们引入了STRIPS-WM，一个直接从视觉转换中学习基于图像的STRIPS风格世界模型的框架。STRIPS-WM首先从图像中诱导出有限的抽象转换图，然后学习潜在二元谓词和每个动作标签的一个基于命题的操作符。学习到的操作符形成一个具有稀疏前提条件和添加/删除效果的符号动作模型。最后，学习到的谓词被蒸馏到视觉编码器中，使得能够直接从新的起始和目标图像进行经典规划。在视觉重排任务上的实验表明，STRIPS-WM在图像到规划的成功率上优于测试的视觉展开、潜在图搜索和潜在符号基线。

英文摘要

Robots performing long-horizon visual manipulation observe high-dimensional images, but successful plans depend on action-relevant facts: what can be done now and what changes afterward. A useful planning representation should discard irrelevant visual details while preserving action applicability and effects. Classical task planners exploit this structure through symbolic operators with preconditions and effects, but obtaining such representations from raw visual experience remains challenging. We study a visual task-planning setting in which a robot receives only image transitions: the current image, executed high-level action, and the resulting image. At test time, given a start image and a goal image, the robot must produce a sequence of high-level actions that reaches the goal. To address this problem, we introduce STRIPS-WM, a framework for learning image-grounded STRIPS-style world models directly from visual transitions. STRIPS-WM first induces a finite abstract transition graph from images, then learns latent binary predicates and one grounded propositional operator per action label. The learned operators form a symbolic action model with sparse preconditions and add/delete effects. Finally, the learned predicates are distilled into a visual encoder, enabling classical planning directly from novel start and goal images. Experiments on visual rearrangement tasks show that STRIPS-WM improves image-to-plan success over the tested visual rollout, latent graph-search and latent-symbolic baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.06836 2026-06-08 cs.RO cs.AI cs.CV 新提交

Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation

像飞行员一样思考：细粒度长时程无人机导航

Xiangyi Zheng, Xiangyu Wang, Qinan Liao, Zimu Tang, Yue Liao, Dongyue Lyu, Guodong Wang, Junjie Liu, Si Liu

发表机构 * Colab ； Beihang University（北航）； Meituan（美团）； National University of Singapore（新加坡国立大学）

AI总结提出FLIGHT基准和FLIGHT VLA异步架构，通过低频飞行员推理VLM与高频扩散动作模型解耦，实现无人机长时程语义指令下的平滑连续飞行控制。

详情

AI中文摘要

语言引导的无人机代理必须执行长时程语义指令，同时产生平滑、物理可行的连续飞行命令，然而现有的视觉语言导航（VLN）基准通常使用离散或粗粒度的动作，而现有的无人机视觉-语言-动作（VLA）任务则专注于短时、原子化的机动。为了解决无人机任务设置中的这一空白，我们引入了\ extbf{FLIGHT}，一个用于混合无人机导航与推理任务的\ extbf{细}粒度\ extbf{长}时程\ extbf{指令引导}基准，该基准结合了多阶段指令与密集的6-DoF轨迹注释，分为两个数据集：细粒度VLN和长时程流。为了使无人机代理具备对任务执行状态和任务规划进行实时飞行推理的能力，同时适应高频、实时的精确控制，我们进一步提出了\ extbf{FLIGHT VLA}，一种异步架构，将用于任务状态推理的低频流式飞行员视觉语言模型（VLM）与用于连续控制的高频扩散动作模型解耦，并由显式的\ extbf{飞行员推理}文本进行监督，该文本总结了当前飞行状态并预测下一个子目标。在闭环评估中，FLIGHT VLA在我们的FLIGHT基准上持续优于代表性的VLN和VLA基线，实现了更强的多阶段完成、子目标遵循和终端控制。其训练的流式飞行员推理VLM进一步提升了无人机视频推理，验证了我们设计的有效性。

英文摘要

Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce \textbf{FLIGHT}, a \textbf{F}ine-grained \textbf{L}ong-horizon \textbf{I}nstruction-\textbf{G}uided benchmark for \textbf{H}ybrid UAV navigation and reasoning \textbf{T}asks, which combines multi-stage instructions with dense 6-DoF trajectory annotations across two dataset splits: Fine-grained VLN and Long-horizon Flow. To endow the UAV agent with the capability of real-time in-flight reasoning over task execution status and mission planning, while simultaneously accommodating high-frequency, real-time precise control, we further propose \textbf{FLIGHT VLA}, an asynchronous architecture that decouples a low-frequency Streaming Pilot Vision-Language Model (VLM) for task-state reasoning from a high-frequency diffusion action model for continuous control, supervised by explicit \textbf{Pilot Reasoning} texts that summarize the current flight state and anticipate the next subgoal. In closed-loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi-stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.

URL PDF HTML ☆

赞 0 踩 0

2606.06870 2026-06-08 cs.RO 新提交

What Is My Robot Thinking? Design Considerations for Transparent and Trustworthy Shared Autonomy

我的机器人在想什么？透明且可信的共享自主性的设计考量

Atharv Belsare, Zohre Karimi, Connor Mattson, Rushiil Nakka, Daniel S. Brown

发表机构 * Kahlert School of Computing, University of Utah（犹他大学计算学院）； Robotics Center, University of Utah（犹他大学机器人中心）

AI总结通过用户实验研究共享自主系统中界面透明度（反馈模态和信息丰富度）对协调与信任的影响，发现反馈提高意图对齐、减少纠正干预，视觉优于听觉，信息丰富度偏好依赖任务复杂度，揭示完整信念分布并不一致提升对齐或信任。

Comments 9 pages, 5 Figures, Code and videos are available at https://sites.google.com/view/design-t2-sa/home. Under review at IROS 2026

详情

AI中文摘要

在共享自主性下运行的辅助机器人必须平衡用户控制与自主辅助。由于机器人动作依赖于不可直接观察的内部意图推理，推断目标与预期目标之间的不匹配会破坏协调与信任。我们研究了界面级透明度，包括反馈模态（视觉与听觉）和信息丰富度（稀疏与丰富），如何影响基于视觉的共享自主系统中的交互。在一项包含N=25名参与者的用户研究中，涉及两项辅助操作任务，我们评估了这些设计如何影响协调与信任。提供反馈显著提高了意图对齐并减少了纠正干预，表明使推断目标可理解加速了共享控制中的收敛。参与者偏好视觉反馈而非听觉反馈，而对稀疏与丰富信息的偏好取决于任务复杂度。我们还发现，揭示完整的信念分布并不一致地提高对齐或信任。这些发现共同表明，有效的透明度主要通过目标可理解性增强协调，而信任取决于任务适当的信息暴露，而非最大程度的信息披露。基于这些结果，我们概述了设计透明共享自主系统的指导方针。

英文摘要

Assistive robots operating under shared autonomy must balance user control with autonomous assistance. Because robot actions depend on internal intent inference that is not directly observable, mismatches between inferred and intended goals can undermine coordination and trust. We investigate how interface-level transparency, including feedback modality (visual vs. auditory) and information richness (sparse vs. rich), shapes interaction in a vision-based shared autonomy system. In a user study with N=25 participants across two assistive manipulation tasks, we evaluate how these designs influence coordination and trust. Providing feedback significantly improves intent alignment and reduces corrective intervention, indicating that making the inferred goal legible accelerates convergence in shared control. Participants preferred visual over auditory feedback, while preferences for sparse versus rich information depended on task complexity. We also found that revealing the full belief distribution did not consistently improve alignment or trust. Together, these findings indicate that effective transparency enhances coordination primarily through goal legibility, while trust depends on task-appropriate information exposure rather than maximal disclosure. Based on these results, we outline guidelines for designing transparent shared autonomy systems.

URL PDF HTML ☆

赞 0 踩 0

2606.06877 2026-06-08 cs.RO cs.AI 新提交

Neuro-Symbolic Learning for Long-Horizon Task Planning Under Complex Logical Constraints

复杂逻辑约束下长时域任务规划的神经符号学习

Qiwei Du, Zitong Zhan, Shaoshu Su, Bowen Li, Yi Du, Zhipeng Zhao, Taimeng Fu, Sebastian Scherer, Jiaoyang Li, Chen Wang

发表机构 * Spatial AI & Robotics (SAIR) Lab, University at Buffalo, NY 14260（空间人工智能与机器人实验室，布法罗大学，纽约州，14260）； Robotics Institute, Carnegie Mellon University, PA 15213（机器人研究所，卡内基梅隆大学，宾夕法尼亚州，15213）

AI总结提出基于命令学习的双层优化框架，通过神经评分器剪枝无关对象，并引入3R策略（修复、重启、回滚）稳定下层规划，在三个基准上实现失败率降低80.04%、规划时间减少57.14%。

详情

AI中文摘要

当机器人必须在复杂逻辑约束（包括对象可供性、空间关系和顺序动作依赖）下推理长时域动作序列时，任务规划常常面临严重的效率瓶颈。最近的神经符号方法通过学习对象重要性分数来剪枝任务无关对象，从而提高规划效率，但它们通常依赖于从完整搜索空间生成的固定离线监督。这造成了训练-测试不匹配：在部署时，规划器在由模型自身不完美预测诱导的剪枝搜索空间中运行，导致暴露偏差和规划性能下降。为了解决这一挑战，我们将任务规划的对象重要性学习形式化为一个基于命令学习的双层优化问题。上层优化一个神经评分器，而下层在评分剪枝的搜索空间中求解符号规划问题。为了稳定这一学习过程，我们在下层规划中引入3R策略，使用并行的修复、重启和回滚恢复来为上层学习提供可靠且自适应的反馈。在三个具有挑战性的基准上的实验展示了最先进的性能，包括失败率降低80.04%和规划时间减少57.14%。我们进一步在仿真和现实世界中的四足移动机械臂上验证了该框架，展示了其在高效且可部署的神经符号任务规划方面的潜力。

英文摘要

Task planning often suffers from severe efficiency bottlenecks when robots must reason over long-horizon action sequences under complex logical constraints, including object affordances, spatial relationships, and sequential action dependencies. Recent neuro-symbolic methods improve planning efficiency by learning object-importance scores to prune task-irrelevant objects, but they typically rely on fixed offline supervision generated from full search spaces. This creates a train-test mismatch: at deployment, the planner operates in pruned search spaces induced by the model's own imperfect predictions, leading to exposure bias and degraded planning performance. To address this challenge, we formulate object-importance learning for task planning as an imperative learning-based bilevel optimization problem. The upper level optimizes a neural scorer, while the lower level solves a symbolic planning problem in the score-pruned search space. To stabilize this learning process, we introduce a 3R strategy into the lower-level planning, using parallel Repair, Restart, and Rollback recovery to provide reliable and adaptive feedback for upper-level learning. Experiments on three challenging benchmarks demonstrate state-of-the-art performance, including an 80.04% reduction in failure rate and a 57.14% reduction in planning time. We further validate the framework on a quadruped-based mobile manipulator in simulation and the real world, demonstrating its potential for efficient and deployable neuro-symbolic task planning.

URL PDF HTML ☆

赞 0 踩 0

2606.06878 2026-06-08 cs.RO cs.CV 新提交

A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation

一种用于鲁棒6-DoF抓取姿态估计的跨视图融合框架

Kangjian Zhu, Haobo Jiang, Jianjun Qian, Jin Xie

发表机构 * Nanjing University of Science and Technology（南京理工大学）； Nanyang Technological University（南洋理工大学）； Nanjing University（南京大学）

AI总结提出跨视图融合框架，通过辅助视图缓解遮挡，利用自监督对比学习增强点云特征的空间一致性和方向区分性，并设计跨视图对齐圆柱体集成模块融合抓取相关几何，提升角落视图下的6-DoF抓取姿态估计鲁棒性。

Comments Corresponding author: Jin Xie

详情

AI中文摘要

本文提出一种跨视图融合框架，增强了角落视图中6-DoF抓取姿态估计的鲁棒性。我们的框架通过引入辅助视图缓解遮挡，并通过后融合策略避免了耗时的、任务无关的多视图重建。为了增强跨视图融合，我们提出一种自监督对比学习策略，利用跨视图关联来正则化点云特征。简而言之，如果两个点对应相同的3D位置，则跨视图点对被视作匹配；如果它们代表不同的抓取方向，则视为不匹配。该学习策略显著增强了点特征的空间一致性和方向区分性，从而促进了跨视图融合并提高了估计鲁棒性。此外，我们提出一种跨视图对齐圆柱体集成模块，将抓取相关几何融合为综合表示。具体地，该模块首先根据相似性对齐跨视图点和特征，以增强对噪声的鲁棒性。随后，将这些点注册到圆柱坐标系中，强调对抓取重要的旋转对称几何。最后，交替使用局部自注意力和种子交叉注意力层，分别实现单视图内和跨视图间的交互，支持抓取相关几何的细粒度表示。我们的框架在GraspNet-1Billion基准测试和实际应用中均取得了强劲性能。代码可在以下网址获取：此https URL。

英文摘要

In this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views. Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy. To enhance cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features. In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions. The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness. Furthermore, we propose a cross-view-aligned cylinder integration module to fuse grasp-relevant geometry into a comprehensive representation. Specifically, the module first aligns the cross-view points and features according to their similarity to enhance the robustness against noise. Subsequently, these points are registered into the cylindrical coordinate frame, emphasizing the rotation-symmetric geometry which is important for grasping. Finally, local self-attention and seed cross-attention layers are alternately employed, respectively enabling interactions within single views and across views, which supports fine-grained representation of grasp-relevant geometry. Our framework achieves strong performance on the GraspNet-1Billion benchmark and in real-world applications. Code is available at https://github.com/KJZhuAutomatic/Cross-view-Grasp.

URL PDF HTML ☆

赞 0 踩 0

2606.06944 2026-06-08 cs.RO 新提交

T-GMP: Terrain-conditioned Generative Motion Priors for Versatile and Natural Humanoid Locomotion

T-GMP: 基于地形条件的生成式运动先验用于多功能且自然的人形机器人 locomotion

Junhong Guo, Hao Hu, Chen Chen, Haoxuan Han, Linao Gong, Xin Yang, Zhicheng He, Yao Su, Fenghua He

发表机构 * Harbin Institute of Technology（哈尔滨工程大学）； Leju Robotics（莱居机器人）

AI总结提出 T-GMP 模块，利用条件变分自编码器从少量专家演示中学习地形条件潜在运动流形，结合对抗学习与立足点惩罚，实现统一策略下适应地形变化的多功能自然运动。

详情

AI中文摘要

实现拟人自然性和鲁棒地形穿越仍然是人形机器人 locomotion 的基本挑战。现有的强化学习方法通常依赖固定的运动先验，限制了其对变化环境的适应性。我们提出基于地形条件的生成式运动先验（T-GMP），该模块使用条件变分自编码器从少量专家状态-地形演示中捕获地形条件潜在运动流形。学习到的先验能够实现平滑的风格转换，促进统一策略适应地形变化。我们将 T-GMP 集成到对抗学习流程中，并引入提出的立足点惩罚，其中判别器根据局部地形特征动态调节自然性约束，指导生成多功能且类人的运动。实验结果表明，我们的方法在穿越成功率和运动平滑度上优于现有基线，同时保持了仿生自然和物理协调的运动。

面向可泛化3D视觉运动策略学习的任务编辑

Jian-Jian Jiang, YiHan Yang, Lan Wei, Yuming Luo, Xiao-Ming Wu, Xuhang Chen, Bin Fan, Dandan Zhang, Wei-Shi Zheng

发表机构 * Sun Yat-sen University（中山大学）； Imperial College London（帝国理工学院）； Nanyang Technological University（南洋理工大学）； South China University of Technology（华南理工大学）

AI总结提出Task-Edit框架，通过将任务分解为场景、技能和对象组件并灵活重组，生成多样化轨迹，提升3D视觉运动策略在长程操作任务中的泛化能力。

Comments 8 pages, 4 figures

详情

AI中文摘要

3D视觉运动策略为复杂机器人操作提供了有前景的方向，因为深度图和点云为空间推理提供了丰富的几何信息。然而，它们的成功通常依赖于大规模的真实世界演示，这些演示的收集成本高昂且耗时。为此，现有方法通常使用演示生成策略，通过对人类收集的演示应用以对象为中心的变换（如改变对象姿态或尺度）来提高数据效率。虽然这些变换在局部变化上有效，但它们很大程度上保留了原始场景结构和技能序列，限制了合成复杂任务中多样化的场景-技能-对象组合的能力。在本文中，我们提出Task-Edit，一种新颖的演示生成框架，从任务中心编辑的角度生成多样化轨迹。Task-Edit的关键见解是将任务分解为场景、技能和对象组件，并灵活地重新组合它们。通过这种方式，Task-Edit实现了可扩展的演示生成，并显著提高了长程操作任务的泛化能力。我们通过大量真实世界实验评估了Task-Edit，并展示了三个优势：（1）有效性：Task-Edit在各种真实世界任务和机器人形态上显著提升了3D视觉运动策略。（2）泛化性：Task-Edit提高了模型在不同场景设置下的泛化能力。（3）适用性：Task-Edit使模型能够处理真实世界中难以收集的场景，包括抗干扰、避障和未见过的杂乱场景。

英文摘要

3D visuomotor policies offer a promising direction for complex robotic manipulation, as depth maps and point clouds provide rich geometric information for spatial reasoning. However, their success often depends on large-scale real-world demonstrations, which are costly and time-consuming to collect. To this end, existing methods commonly use demonstration generation strategies to improve data efficiency by applying object-centric transformations to human-collected demonstrations, such as varying object poses or scales. While effective for local variation, these transformations largely preserve the original scene structure and skill sequence, limiting their ability to synthesize diverse scene-skill-object combinations for complex tasks. In this paper, we propose Task-Edit, a novel demonstration generation framework that generates diverse trajectories from a task-centric editing perspective. The key insight of Task-Edit is to decompose a task into scene, skill and object components, and flexibly recombine them. In this way, Task-Edit enables scalable demonstration generation and significantly improves generalization for long-horizon manipulation tasks. We evaluate Task-Edit through extensive real-world experiments and demonstrate three advantages: (1) Effectiveness: Task-Edit significantly improves 3D visuomotor policies across various real-world tasks and robot embodiments. (2) Generalizability: Task-Edit improves model generalization across different scenario setups. (3) Applicability: Task-Edit enables models to handle scenarios that are difficult to collect in the real world, including disturbance resistance, obstacle avoidance and unseen cluttered scenes.

URL PDF HTML ☆

赞 0 踩 0

2606.07013 2026-06-08 cs.RO cs.HC 新提交

A Multi-Operator Mixed-Reality Interface for Multi-Robot Control and Coordination: Co-Located and Private Workspace Collaboration

面向多机器人控制与协调的多操作员混合现实界面：共位与私有工作空间协作

Omotoye Shamsudeen Adekoya, Antonio Sgorbissa, Carmine Tommaso Recchiuto

发表机构 * DIBRIS Department, RICE Laboratory, University of Genoa（DIBRIS部门，RICE实验室，热那亚大学）

AI总结提出一种扩展至多操作员协作的混合现实界面，支持共位共享工作空间和私有工作空间两种模式，通过注册驱动场景构建、轻量级会话同步和单机器人控制租约防止命令冲突。实验表明两种模式任务性能相当，但共位模式显著提升协作感知和操作员偏好。

Comments Submitted to RO-MAN 2026

详情

AI中文摘要

多操作员控制机器人团队不仅需要访问相同的任务信息，还需要维护共享态势感知并防止冲突干预的机制。基于我们之前的HORUS界面（统一系统的整体操作现实），我们提出了一种混合现实界面，将单操作员多机器人监督扩展到协作式多操作员使用。该系统支持两种互补模式：共位共享工作空间，操作员在同一物理位置观察和操作同一张迷你地图；以及私有工作空间模式，操作员通过独立放置的本地工作空间执行相同任务。该架构结合了注册驱动的场景构建、轻量级共享会话同步以及每机器人控制租约，以支持协作监控、任务分配和远程操作，同时防止冲突命令。我们在一项人类受试者研究中评估了该方法，共有36名参与者（18对）在两个搜索环境中控制三台Nova Carter移动机器人。两种模式下的客观任务性能相当，表明两种模式都支持有效的任务执行。然而，共位共享工作空间显著改善了感知协作、共享理解和交接清晰度，并且是首选的协作模式。这些结果表明，即使底层机器人控制工具保持不变，物理上共置MR工作空间也能改善操作员的协调方式。

英文摘要

Multi-operator control of robot teams requires not only access to the same mission information, but also mechanisms for maintaining shared awareness and preventing conflicting interventions. Building on our previous HORUS interface (Holistic Operational Reality for Unified Systems) we present a mixed-reality interface that extends single-operator multi-robot supervision to collaborative multi-operator use. The system supports two complementary modes: a co-located shared workspace, in which operators observe and manipulate the same mini-map in the same physical location, and a private-workspace mode, in which operators work on the same mission through independently placed local workspaces. The architecture combines registration-driven scene construction, lightweight shared-session synchronization, and per-robot control leases to support collaborative monitoring, tasking, and teleoperation while preventing conflicting commands. We evaluated the approach in a human-subject study with 36 participants (18 pairs) controlling three Nova Carter mobile robots in two search environments. The performance of the objective task was comparable across the two modes, indicating that both modes supported effective mission execution. However, the co-located shared workspace significantly improved perceived collaboration, shared understanding, and handoff clarity, and was the preferred collaborative mode. These results indicate that physically co-locating the MR workspace improves how operators coordinate even when the underlying robot-control tools remain unchanged.

URL PDF HTML ☆

赞 0 踩 0

2606.07067 2026-06-08 cs.RO 新提交

Extending Responsibility-Sensitive Safety for the Assessment of Offloaded Autonomous Driving Services

扩展责任敏感安全以评估卸载的自动驾驶服务

Robin Dehler, Aryan Thakur, Michael Buchholz

AI总结针对自动驾驶功能卸载中V2X通信导致响应时间变化的安全挑战，扩展责任敏感安全定义，提出基于安全约束的卸载决策与回退机制，并引入热备阶段提升回退安全性。

Comments 8 pages; accepted for 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC), Naples, Italy, September 15-18, 2026 - DOI will be added after publication

详情

AI中文摘要

安全是自动驾驶系统开发的基本要求。虽然功能卸载在计算效率和能耗方面显示出显著优势，但其在安全关键的AD功能中的应用带来了新的挑战。特别是，由于无线车联网通信，卸载的服务组合会导致响应时间增加且可变，这直接影响车辆的反应时间，从而影响其安全保证。在本文中，我们通过扩展责任敏感安全（RSS）的定义，明确考虑本地和卸载的AD服务组合的不同响应时间，来应对这一挑战。基于这一扩展，我们提出将其集成到功能卸载中，使用RSS安全约束进行卸载决策和回退机制。仅当当前交通状况在相应的端到端响应时间下保持安全时，才允许卸载的服务组合。如果违反此条件，系统将执行受控回退到本地执行。此外，我们引入了一种增强的回退策略，其中包括卸载服务的热备阶段，从而实现从卸载服务到本地服务的更快、更安全的过渡。所提出的方法已集成到我们的AD堆栈中，并在仿真和真实世界中进行了评估。实验结果表明，与最先进的功能卸载和安全框架相比，所提出的方法提高了安全性，同时在安全条件允许时保留了分布式计算的优势。

英文摘要

Safety is a fundamental requirement in the development of autonomous driving (AD) systems. While function offloading has demonstrated significant benefits in terms of computational efficiency and energy consumption, its application to safety-critical AD functionality introduces new challenges. In particular, offloaded service compositions incur increased and variable response times due to wireless vehicle-to-everything (V2X) communication, which directly affects the vehicle's reaction time and thus its safety guarantees. In this paper, we address this challenge by extending the definitions of Responsibility-Sensitive Safety (RSS) to explicitly account for different response times of local and offloaded AD service compositions. Based on this extension, we propose an integration into function offloading, using the RSS safety constraints for offloading decision-making and fallback mechanisms. Offloaded service compositions are only permitted if the current traffic situation remains safe under the corresponding end-to-end response time. If this condition is violated, the system performs a controlled fallback to local execution. Furthermore, we introduce an enhanced fallback strategy that includes a warm-standby phase for offloaded services, enabling faster and safer transitions from offloaded to local services. The proposed approach is integrated into our AD stack and evaluated in both simulation and the real world. Experimental results demonstrate that the proposed method improves safety compared to state-of-the-art function offloading and safety frameworks, while preserving the benefits of distributed computation when safety conditions allow.

URL PDF HTML ☆

赞 0 踩 0

2606.07083 2026-06-08 cs.RO 新提交

Predictive Style Matching: Natural and Robust Humanoid Locomotion

预测性风格匹配：自然且鲁棒的类人机器人行走

Simeon Nedelchev, Ekaterina Chaikovskaia, Egor Davydenko, Eduard Zaliaev, Roman Gorbachev

发表机构 * Moscow Institute of Physics and Technology (MIPT)（莫斯科物理技术学院）； Innopolis University（因诺波利斯大学）； Sber Robotics Center（Sber机器人中心）

AI总结提出预测性风格匹配（PSM）方法，通过离线预测器将机器人下半身状态映射到上半身关节和步态目标，在保持任务奖励鲁棒性的同时显著降低风格误差。

详情

AI中文摘要

强化学习已成为类人机器人行走控制的主流方法：策略能够可靠地从仿真迁移到硬件，并从干扰中优雅恢复。然而，运动质量仍然落后：仅任务奖励往往收敛到僵硬、不对称的步态，而运动模仿方法改善了外观，但由于参考信号可能对抗恢复平衡所需的瞬态姿态，因此对外部干扰更加敏感。我们提出预测性风格匹配，其中离线预测器将机器人下半身状态历史和速度命令映射到可解释的上半身关节和步态目标，以在训练期间塑造奖励。由于目标是状态条件而非时间索引，且预测器仅在训练时使用，部署的控制器继承了仅任务奖励强化学习基线（RL baseline）的本体感觉接口和推理成本。在Unitree G1上，无论是在仿真还是硬件中，PSM将上半身风格误差比仅任务RL降低大约一个数量级，同时保持其跌倒恢复率，而运动模仿基线实现了最低的风格误差，但无法从干扰中恢复的频率大约高出五倍。

英文摘要

Reinforcement learning has become the prevailing approach to humanoid locomotion control: policies transfer reliably from simulation to hardware and recover gracefully from disturbances. Motion quality, however, still lags behind: task-only rewards often converge to stiff, asymmetric gaits, while motion imitation methods improve appearance but become more sensitive to external disturbances because reference signals can oppose the transient poses needed to regain balance. We propose Predictive Style Matching, in which an offline predictor maps the robot's lower-body state history and velocity commands to interpretable upper-body joint and gait targets that shape the rewards during training. Because the targets are state-conditioned rather than time-indexed and the predictor is used only at training time, the deployed controller inherits the proprioceptive interface and inference cost of a task-only RL baseline. On the Unitree G1, in both simulation and hardware, PSM reduces upper-body style error by roughly an order of magnitude over task-only RL while preserving its fall-recovery rate, whereas the motion-imitation baseline attains the lowest style error but fails to recover from disturbances about five times as often.

URL PDF HTML ☆

赞 0 踩 0

2606.07089 2026-06-08 cs.RO 新提交

Dreaming when Necessary: Advancing World Action Models with Adaptive Multi-Modal Reasoning

必要时做梦：通过自适应多模态推理推进世界行动模型

Yinzhou Tang, Jingbo Xu, Yu Shang, Zihao Song, Chen Gao, Wei Wu, Yong Li

发表机构 * Tsinghua University（清华大学）； Manifold AI

AI总结提出AdaWAM，通过轻量动态路由器自适应触发文本或视觉推理，提升长时复杂任务中的推理效率和性能。

详情

AI中文摘要

世界行动模型（WAMs）为具身智能提供了一种有前景的方法，但现有方法严重依赖视频预测作为行动先验，缺乏自适应多模态推理，限制了其在长时、复杂任务中的有效性。我们观察到，WAMs在不同执行上下文中需要不同的多模态推理模式：在任务转换期间，文本推理对于指导高层行动预测至关重要，而在细粒度操作期间，视觉推理对于精确控制至关重要。基于这一观察，我们提出了\textbf{AdaWAM}，一种具有自适应多模态推理能力的世界行动模型。AdaWAM集成了一个轻量动态路由器，可在任务执行过程中根据需要自主触发文本或视觉推理。在模拟和真实世界具身任务上的实验表明，AdaWAM在显著提升推理效率的同时，超越了最先进的具身策略。代码和演示可在以下网址获取：this https URL。

英文摘要

World Action Models (WAMs) offer a promising approach to embodied intelligence, yet existing methods rely heavily on video prediction as action priors and lack adaptive multimodal reasoning, limiting their effectiveness on long-horizon, complex tasks. We observe that WAMs require different multimodal reasoning modes under different execution contexts: textual reasoning is essential during task transitions to guide high-level action prediction, while visual reasoning is critical during fine-grained manipulation for precise control. Motivated by this observation, we propose \textbf{AdaWAM}, a world action model with adaptive multimodal reasoning abilities. AdaWAM integrates a lightweight dynamic router that autonomously triggers textual or visual reasoning as needed during task execution. Experiments on both simulated and real-world embodied tasks show that AdaWAM substantially improves inference efficiency while outperforming state-of-the-art embodied policies. Codes and demos are available at: https://adawam.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.07107 2026-06-08 cs.RO 新提交

Coarse-to-Control: Action-Token Planning for Vision-Language-Action Models

粗到细控制：面向视觉-语言-动作模型的行动令牌规划

Jinhao Wu, Shiduo Zhang, Yicheng Liu, Xiaopeng Yu, Sixian Li, Siyin Wang, Hang Zhao, Jing Huo, Yang Gao, Jingjing Gong, Xipeng Qiu, Yu-Gang Jiang

发表机构 * Nanjing University（南京大学）； Shanghai Innovation Institute（上海创新研究院）； Fudan University（复旦大学）； Tsinghua University（清华大学）

AI总结提出Coarse-to-Control框架，在动作令牌空间中引入原生规划，通过先预测粗粒度动作令牌序列再生成可执行动作，提升长程任务性能。

详情

AI中文摘要

危险环境中可解释自主性的抽象架构

Matt Luckcuck, Hazel M Taylor, Marie Farrell

发表机构 * Maynooth University（梅诺斯大学）； University of Manchester（曼彻斯特大学）

AI总结提出一种支持自主系统解释其行为的抽象架构，旨在通过设计可解释性增强用户信任，并以民用核工业为例展示应用。

Comments Originally published 20th of October 2022 at the Second International Workshop on Requirements Engineering for Explainable Systems (RE4ES), which was hosted by the International Requirements Engineering Conference 2022

详情

DOI: 10.1109/REW56159.2022.00027

AI中文摘要

自主机器人系统被提议用于危险环境，通常是为了减少人类工人的风险。在不久的将来，人类工人可能会继续使用和指挥这些自主机器人，就像其他计算机化工具一样，但具有更复杂的决策能力。因此，工程努力的一个重要方向是确保这些用户信任系统。最近的文献表明，可解释性与系统的可信度密切相关。与安全性和保密性属性一样，可解释性应该被设计到系统中，而不是事后添加。本文提出了一种抽象架构，支持自主系统解释其行为（可解释自主性），为实施可解释自主系统提供了设计模板。我们给出了一个工作示例，说明我们的架构如何应用于民用核工业，其中工人和监管机构都需要信任系统的决策能力。

英文摘要

Autonomous robotic systems are being proposed for use in hazardous environments, often to reduce the risks to human workers. In the immediate future, it is likely that human workers will continue to use and direct these autonomous robots, much like other computerised tools but with more sophisticated decision-making. Therefore, one important area on which to focus engineering effort is ensuring that these users trust the system. Recent literature suggests that explainability is closely related to how trustworthy a system is. Like safety and security properties, explainability should be designed into a system, instead of being added afterwards. This paper presents an abstract architecture that supports an autonomous system explaining its behaviour (explainable autonomy), providing a design template for implementing explainable autonomous systems. We present a worked example of how our architecture could be applied in the civil nuclear industry, where both workers and regulators need to trust the system's decision-making capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.07217 2026-06-08 cs.RO cs.CV cs.LG 新提交

Robotic Policy Adaptation via Weight-Space Meta-Learning

通过权重空间元学习实现机器人策略自适应

Christian Bianchi, Siamak Yousefi, Alessio Sampieri, Andrea Roberti, Luca Rigazio, Fabio Galasso, Luca Franco

发表机构 * ItalAI ； University of Verona（威尼斯大学）； Sapeinza University of Rome（罗马萨佩因扎大学）

AI总结提出WIZARD框架，通过权重空间元学习从语言指令和演示视频生成任务特定LoRA参数，无需微调即可适应新任务，在LIBERO上性能提升高达14倍。

详情

AI中文摘要

视觉-语言-动作（VLA）模型正成为机器人操作的一种有前景的范式，能够从大规模演示和动作标签语料库中训练通用策略。然而，将这些模型适应新任务通常仍需要任务特定的演示、动作注释和额外的微调，使得部署成本高昂且难以扩展。我们提出WIZARD，一种权重空间元学习框架，通过为冻结的VLA策略生成任务特定的LoRA参数来避免任务特定的微调。仅凭语言指令和简短的演示视频，WIZARD即可在单次前向传播中预测相应的自适应权重，无需目标任务动作标签或测试时优化。在元训练期间，WIZARD学习将任务证据直接映射到专家LoRA更新，在权重空间中捕获任务之间的关系。在LIBERO上的实验表明，WIZARD在未见过的数据集集合上性能提升高达约2倍，在未见过的任务上提升高达约14倍。在Franka Emika Panda机器人上，WIZARD持续优于真实域自适应基线，表明生成的适配器提供了超越仿真的任务级特化。

英文摘要

Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to new tasks still typically requires task-specific demonstrations, action annotations, and additional fine-tuning, making deployment costly and difficult to scale. We propose WIZARD, a weight-space meta-learning framework that sidesteps task-specific fine-tuning by generating task-specific LoRA parameters for a frozen VLA policy. Given only a language instruction and a short demonstration video, WIZARD predicts the corresponding adaptation weights in a single forward pass, without target-task action labels or test-time optimization. During meta-training, WIZARD learns to map task evidence directly to expert LoRA updates, capturing relationships between tasks in weight space. Experiments on LIBERO show that WIZARD improves performance by up to ~2x on unseen dataset collections and up to ~14x on unseen tasks. On a Franka Emika Panda, WIZARD consistently improves over a real-domain adapted baseline, showing that generated adapters provide task-level specialization beyond simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.07244 2026-06-08 cs.RO cs.AI cs.CV 新提交

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

超越航点：面向视觉语言导航的轨迹中心航点范式

Haoxiang Shi, Xiang Deng, Haoyu Zhang, Qiaohui Chu, Yaowei Wang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； Pengcheng Laboratory（鹏城实验室）

AI总结提出轨迹航点范式，通过TSDF引导的扩散策略预测可执行轨迹，解决VLN-CE中航点不可达与规划控制不一致问题，在基准上取得最优性能。

详情

AI中文摘要

连续环境中的视觉语言导航（VLN-CE）要求智能体在类似真实世界的环境中遵循自然语言指令进行导航。大多数VLN-CE方法采用三阶段框架：航点预测器提出可导航航点，导航器选择最佳航点，低层控制器执行移动。然而，这种解耦范式常导致航点不可达或规划与控制不一致。本文提出一种称为轨迹航点的新范式，将每个候选航点锚定到可执行轨迹上。为此，我们设计了TSDF引导的扩散策略作为轨迹航点预测器，引导轨迹生成避开障碍物，从本质上保证预测航点的可达性。进一步提出轨迹增强导航器，将关联轨迹作为额外信息注入规划，实现高层语义决策与低层执行的严格一致性。在VLN-CE基准上的大量实验表明，我们的轨迹航点范式优于基线方法。

英文摘要

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.07304 2026-06-08 cs.RO 新提交

CAPE: Contrastive Action-conditioned Parallel Encoding for Embodied Planning

CAPE: 用于具身规划的条件对比动作并行编码

Cong Chen, Haowen Wang, Zhixiang Zhang, Pei Ren, Zhengping Che

AI总结提出CAPE框架，通过对比学习区分不同动作序列的未来结果，实现高效视觉动力学建模，在真实世界和零样本迁移任务中显著提升规划性能并降低推理成本。

Comments 19 pages, 7 figures

详情

AI中文摘要

具身智能体需要在执行前预测候选动作的未来后果，以便有效规划。现有的视觉动力学模型通过重建未来视觉状态或展开密集潜在表示来学习，这会将学习能力分散到视觉显著但与规划无关的内容上，而不是驱动操作结果的动作条件变化。我们提出CAPE，一种对比动作条件并行编码框架，通过区分不同动作序列诱导的未来结果来学习视觉动力学。给定初始观察和候选动作序列，CAPE在单次前向传播中解码完整的未来潜在轨迹，并使用目标收敛对比目标进行训练，该目标对齐对应相同未来结果的预测，同时分离对应不同结果的预测。在真实世界DROID和零样本迁移到RoboCasa上，CAPE在状态检索、离线动作匹配和闭环规划方面显著优于先前基线，同时在长预测范围内显著降低了规划时的推理成本。

英文摘要

Embodied agents need to predict the future consequences of candidate actions in order to plan effectively before execution. Existing visual dynamics models learn by reconstructing future visual states or rolling out dense latent representations, which spreads learning capacity across visually salient but planning-irrelevant content rather than the action-conditioned changes that drive manipulation outcomes. We propose CAPE, a Contrastive Action-conditioned Parallel Encoding framework that learns visual dynamics by distinguishing the future outcomes induced by different action sequences. Given an initial observation and a candidate action sequence, CAPE decodes the full future latent trajectory in a single forward pass and is trained with a Goal-Convergent Contrastive Objective that aligns predictions corresponding to the same future outcome while separating those corresponding to different outcomes. On real-world DROID and zero-shot transfer to RoboCasa, CAPE substantially outperforms prior baselines on future-state retrieval, offline action matching, and closed-loop planning, while notably reducing planning-time inference cost at long prediction horizons.

URL PDF HTML ☆

赞 0 踩 0

2606.07383 2026-06-08 cs.RO cs.LG 新提交

RhinoVLA Technical Report

RhinoVLA 技术报告

Huixi Intelligence, :, Chen Zhang, Chenyang Zhou, Guanglei Ding, Guanghui He, Haibin Gao, Jiajia Chen, Jianyong Zhang, Lianyi Yu, Ningyi Xu, Ping Xu, Qingchen Li, Yingjun Hu, Yijia Zhang, Yuxi Liu

发表机构 * Huixi Intelligence（慧溪智能）

AI总结针对边缘硬件上VLA模型部署延迟问题，提出RhinoVLA，通过令牌高效骨干、连续动作专家和统一接口实现实时闭环控制，在Huixi R1上达到11.69 Hz推理速度。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在机器人操作中展现出强大潜力，但在边缘硬件上的实时部署仍具挑战。本文中，我们识别出VLM视觉和上下文令牌是部署延迟的主要来源：对于以GEMM为主的投影算子，当模型维度固定时，计算量随输入令牌数量线性增长。基于此观察，我们提出RhinoVLA，一种与Huixi R1边缘SoC协同设计的面向部署的VLA模型。RhinoVLA采用令牌高效的Qwen3-VL骨干和连续动作专家，在保留预训练多模态能力的同时减少VLM侧的令牌和计算负担。为支持跨机器人学习，RhinoVLA进一步引入统一接口，结合视图注册表、72维物理状态-动作槽空间和机器人实例LoRA，使异构机器人观测和动作模式能在共享策略下对齐。在部署方面，RhinoVLA通过硬件感知编译、混合精度执行和并行视觉编码进行优化。实验表明，RhinoVLA在相似参数量下实现了与π0.5相当的下游性能，同时在Huixi R1上达到11.69 Hz的端到端推理，满足10 Hz实时闭环控制目标。该项目将在以下网址开源：此 https URL。

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for robotic manipulation, but real-time deployment on edge hardware remains challenging. In this work, we identify VLM visual and context tokens as a major source of deployment latency: for GEMM-dominated projection operators, computation grows linearly with the number of input tokens when model dimensions are fixed. Motivated by this observation, we propose RhinoVLA, a deployment-oriented VLA model co-designed with the Huixi R1 edge SoC. RhinoVLA adopts a token-efficient Qwen3-VL backbone and a continuous Action Expert, reducing the VLM-side token and computation burden while preserving pretrained multimodal capability. To support cross-robot learning, RhinoVLA further introduces a unified interface that combines View Registry, 72D physical state-action slot space, and robotinstance LoRA, allowing heterogeneous robot observations and action schemas to be aligned under a shared policy. On the deployment side, RhinoVLA is optimized through hardware-aware compilation, mixed-precision execution, and parallel visual encoding. Experiments show that RhinoVLA achieves downstream performance comparable to π0.5 at a similar parameter scale, while reaching 11.69 Hz end-to-end inference on Huixi R1, meeting the 10 Hz real-time closedloop control target. The project will be open-sourced at https://github.com/HuixiAI/RhinoVLA.

URL PDF HTML ☆

赞 0 踩 0

2606.07386 2026-06-08 cs.RO 新提交

Spline Policy: A Structured Representation for Robot Policies

样条策略：机器人策略的结构化表示

Mengze Tian, Yiming Li, Sichao Liu, Auke Ijspeert, Sylvain Calinon

发表机构 * École Polytechnique Fédérale de Lausanne (EPFL)（瑞士联邦理工学院（EPFL））； Idiap Research Institute（Idiap研究 institute）

AI总结提出样条策略（SP），用样条参数替代动作块，保留策略主干，支持连续轨迹解码、时域重采样、参数空间编辑及下游控制，并具有局部修正机制。

Comments This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

现代机器人操作的模仿学习策略通常将动作表示为固定分辨率的动作块，这种方法简单有效，但在执行前暴露的几何和时间结构有限。本文研究了样条策略（SP），一种结构化表示，它用样条参数替换动作块，同时保持策略主干不变。预测的样条可以解码为紧凑的连续轨迹，在不同时间分辨率下查询，在参数空间中进行约束或编辑，并传递给下游控制器。对于二次样条输出，相同的表示还可以通过解析距离场构造转换为状态依赖的向量场。在该构造的正则性和投影假设下，诱导的动力学不会增加与生成样条的距离，从而在预测运动周围产生有原则的局部修正机制。样条输出进一步支持从观测到样条参数、轨迹和流场的不确定性传播，并且可以与经典控制机制（如零空间碰撞避免）结合，而无需重新训练策略主干。我们使用扩散、流匹配、基于Transformer和视觉-语言-动作主干实例化了SP。在低维运动学习、匹配主干下的模拟操作、灵巧操作以及真实机器人案例研究中的实验表明，SP与现代策略学习器兼容，同时暴露了有用的运动结构特性，包括紧凑解码、时间重采样、预测运动周围的局部修正、不确定性评估和控制器兼容性。

英文摘要

Modern imitation-learning policies for robot manipulation often represent actions as fixed-resolution action chunks, which are simple and effective but expose limited geometric and temporal structure before execution. This paper studies Spline Policy (SP), a structured representation that replaces action chunks with spline parameters while keeping the policy backbone unchanged. The predicted spline can be decoded as a compact continuous trajectory, queried at different temporal resolutions, constrained or edited in parameter space, and passed to downstream controllers. For quadratic spline outputs, the same representation can also be converted into a state-dependent vector field through an analytical distance-field construction. Under the regularity and projection assumptions of this construction, the induced dynamics do not increase the distance to the generated spline, yielding a principled local corrective mechanism around the predicted motion. The spline output further supports uncertainty propagation from observations to spline parameters, trajectories, and flow fields, and can be combined with classical control mechanisms such as null-space collision avoidance without retraining the policy backbone. We instantiate SP with diffusion, flow-matching, transformer-based, and vision-language-action backbones. Experiments in low-dimensional motion learning, simulated manipulation under matched backbones, dexterous manipulation, and real-robot case studies show that SP remains compatible with modern policy learners while exposing useful motion-structure properties, including compact decoding, temporal resampling, local correction around predicted motions, uncertainty evaluation, and controller compatibility.

URL PDF HTML ☆

赞 0 踩 0

2606.07389 2026-06-08 cs.RO 新提交

Simulation-Driven Imitation Learning for Biosignals-Free Shared-Autonomy Prosthetic Grasping

模拟驱动的无生物信号共享自主假肢抓取模仿学习

Kaijie Shi, Wanglong Lu, Huiling Chen, Vinicius Prado da Fonseca, Ting Zou, Hanli Zhao, Xianta Jiang

发表机构 * Memorial University of Newfoundland（缅因大学）； Wenzhou University（温州大学）

AI总结提出一个自动生成多样化抓取演示的模拟框架，结合物理可行抓取合成、自然到达轨迹重定向和程序化环境执行，通过模仿学习实现高成功率和强泛化能力的假肢控制。

详情

AI中文摘要

无生物信号的上肢假肢共享自主控制旨在不依赖EMG或其他生理信号的情况下实现自然且低努力的操作。最近的基于模仿学习的方法显示出有希望的结果，但其可扩展性受到收集大量真实世界人类演示数据的成本和变异性的限制。在这项工作中，我们提出了一个可扩展的模拟框架，该框架从腕部安装的虚拟摄像头自动生成多样化的到达-抓取演示。该框架结合了物理可行的抓取合成、自然到达轨迹重定向以及在程序化生成的室内环境中的到达-抓取-提升执行。它记录腕部视角观察、本体感觉和动作，以构建用于模仿学习的大规模演示数据集。通过广泛的模拟基准测试，我们评估了物体和场景的泛化能力，并比较了几种代表性的最先进模仿学习方法。结果表明，模拟演示足够丰富和一致，可用于有效的策略学习。在三个现实场景中，学习到的模拟到现实策略实现了超过90%的抓取成功率，超越了基线方法，并表现出更强的泛化能力，突显了模拟驱动训练在无生物信号共享自主假肢抓取中的前景。演示可在\href{此URL}{此URL}获取。

英文摘要

Biosignals-free shared-autonomy control of upper-limb prosthetic hands aims to enable natural and low-effort manipulation without relying on EMG or other physiological signals. Recent imitation-learning-based approaches have shown promising results, but their scalability is limited by the cost and variability of collecting large amounts of real-world human demonstration data. In this work, we present a scalable simulation framework that automatically generates diverse reach-to-grasp demonstrations from a wrist-mounted virtual camera. The framework combines physically feasible grasp synthesis, natural reaching trajectories retargeting, and reach--grasp--lift execution in procedurally generated indoor environments. It records wrist-view observations, proprioception, and actions to build a large-scale demonstration dataset for imitation learning. Through extensive simulation benchmarks, we evaluate object and scene generalization and compare several representative state-of-the-art imitation learning methods. Results show that the simulated demonstrations are sufficiently rich and consistent for effective policy learning. In three realistic settings, the learned sim-to-real policy achieves over 90\% grasp success, surpasses baseline methods, and exhibits stronger generalization, highlighting the promise of simulation-driven training for biosignals-free shared-autonomy prosthetic grasping. The demonstrations are available at \href{https://sites.google.com/view/sim-prosthetic-grasp/home}{https://sites.google.com/view/sim-prosthetic-grasp/home}.

URL PDF HTML ☆

赞 0 踩 0

2606.07424 2026-06-08 cs.RO 新提交

Rapid co-design of Buoyancy-assisted robots for Challenging Locomotion using Gaussian Evolutionary Specialists

基于高斯进化专家的浮力辅助机器人快速协同设计以应对挑战性运动

Ankit Sinha, Nitish Sontakke, Dennis Hong, Yusuke Tanaka, Sehoon Ha

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； University of California, Los Angeles（加州大学洛杉矶分校）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出高斯进化专家（GES）框架，通过解耦设计空间划分与策略学习，在浮力辅助轻量腿单元（BALLU）上实现5-25%性能提升，并缩短37%设计优化时间。

Comments Submitted to RA-L

详情

AI中文摘要

设计高性能腿式机器人需要联合优化形态和控制。无模型强化学习（RL）为开发鲁棒控制器提供了模型预测控制的替代方案，无需明确指定机器人动力学。因此，我们看到了使用RL训练控制器和评估设计以优化机器人形态。虽然RL在运动方面取得了成功，但由于重复的策略训练，将其用于协同设计内循环成本高昂。基于形态的条件通用策略提供了一种有前景的替代方案，但遭受行为多样性崩溃，收敛到单一策略，在不同设计上表现次优。另一方面，端到端混合专家（MoE）架构因其表示崩溃而失败。我们提出高斯进化专家（GES），一个将设计空间划分与策略学习解耦以显式捕获多样行为的框架。GES将专家策略分配给演化的高斯区域，并通过训练、探测和领土扩展迭代优化它们。生成的专家被集成到设计采样循环中，用直接评估替代昂贵的重新训练。在浮力辅助轻量腿单元（BALLU）上测试时，GES发现的设计比朴素通用策略性能高5-25%。在硬件上，GES优化设计克服了24厘米高的障碍——比基线BALLU设计提升3倍。此外，GES将设计优化时间缩短了37%。

英文摘要

Designing high-performance legged robots requires jointly optimizing morphology and control. Model-free Reinforcement Learning (RL) offers an alternative to model-predictive control for developing robust controllers without explicitly specifying robot dynamics. Thus, we have seen theuse of RL to train controllers and evaluate designs for robot morphology optimization. While RL has shown success inlocomotion, using it in the co-design inner loop is expensive due to repeated policy training. Universal policies conditioned on morphology offer a promising alternative, but suffer from behavioral diversity collapse, converging to a single strategy that performs sub-optimally across designs. On the other hand, end-to-end Mixture-of-Experts (MoE) architectures fail due to a collapse in its representation. We propose Gaussian Evolutionary Specialists (GES), a framework that decouples design-space partitioning from policy learning to capture diverse behaviors explicitly. GES assigns specialist policies to evolving Gaussian regions and iteratively refines them via training, probing, and territory expansion. The resulting specialists are integrated into a design sampling loop, replacing costly re-training with direct evaluation. When tested on the Buoyancy-Assisted Light Legged Unit (BALLU), GES discovers designs with 5 - 25% higher performance than naive universal policies. On hardware, a GES optimized design overcomes a 24 cm tall obstacle - 3x improvement over the baseline BALLU design. Moreover, GES curtails design optimization time by 37%.

URL PDF HTML ☆

赞 0 踩 0

2606.07437 2026-06-08 cs.RO cs.AI cs.HC cs.SE cs.SY eess.SY 新提交

AEGIS：物理AI的备份反射

Josef Chen

发表机构 * KAIKAKU

AI总结提出AEGIS方法，通过在弱策略的冻结激活上使用轻量级探针检测高风险步骤，仅在必要时切换到强策略，在LIBERO-Spatial上恢复了弱策略损失的10.1%轨迹。

详情

AI中文摘要

长时域机器人操作往往逐渐失败：一个坏步骤会降低状态，策略会陷入无法恢复的盆地。失败在发生之前通常是可见的。我们引入了AEGIS（激活探针早期预警、门控推理切换），一种选择性升级方法，通过在弱策略的冻结激活上使用轻量级探针，在仍有时间采取行动时检测高风险步骤。当探针标记一个步骤时，控制权切换到更强的独立策略，但仅限于需要它的步骤。在LIBERO-Spatial上，AEGIS恢复了弱策略单独损失的10.1%的轨迹，而预算匹配的盲目升级为4.6%，随机触发安慰剂为5.1%。这些增益在单侧精确配对McNemar检验中显著，经Holm-Bonferroni调整，三个预注册对比：比盲目升级高5.4个百分点，p=8.5e-6；比随机触发高5.0个百分点，p=1.0e-4；配对轨迹自举置信区间排除零。AEGIS仅在38%的步骤上激活强策略，因此杠杆是时机而非计算。探针在早期窗口AUROC为0.764，95% CI [0.70, 0.84]，在首次切换前从弱策略路径的前30%轨迹步骤中读取。我们预注册了完整的分析计划，包括条件恢复任务率估计量和明确的终止标准，并在每臂700个公共随机数情节上确认了结果，nA-fail=646。

英文摘要

Long-horizon robot manipulation tends to fail gradually: one bad step degrades the state, and the policy spirals into a basin from which it cannot recover. The failure is often visible before it happens. We introduce AEGIS (Activation-probe Early-warning, Gated Inference Switching), a selective escalation method that uses a lightweight probe on a weak policy's frozen activations to detect high-risk steps while there is still time to act. When the probe flags a step, control switches to a stronger separate policy, but only for the steps that need it. On LIBERO-Spatial, AEGIS recovers 10.1% of the trajectories the weak policy alone loses, versus 4.6% for budget-matched blind escalation and 5.1% for a random-trigger placebo. These gains are significant under one-sided exact paired McNemar tests with Holm-Bonferroni adjustment over three pre-registered contrasts: +5.4pp over blind escalation, p=8.5e-6; +5.0pp over random triggering, p=1.0e-4; paired-trajectory bootstrap CIs exclude zero. AEGIS activates the stronger policy on only 38% of steps, so the lever is timing rather than compute. The probe clears its precondition with an early-window AUROC of 0.764, 95% CI [0.70, 0.84], read from the weak-policy path over the first 30% of trajectory steps before any handoff. We pre-register the full analysis plan, including a conditional recovered-task-rate estimand and explicit kill criteria, and confirm the result on 700 common-random-number episodes per arm, with nA-fail=646.

URL PDF HTML ☆

赞 0 踩 0

2606.07100 2026-06-08 cs.CV cs.RO 交叉投稿

LARA: Latent Action Representation Alignment for Vision-Language-Action Models

LARA: 视觉-语言-动作模型的潜在动作表示对齐

Mengya Liu, Baoxiong Jia, Jiangyong Huang, Jingze Zhang, Siyuan Huang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LARA框架，通过表示对齐联合优化潜在动作模型和视觉-语言-动作模型，利用人类视频数据提升机器人操作性能，在模拟和真实基准上平均提升约10%、5%和15%。

详情

AI中文摘要

视觉-语言动作（VLA）模型使机器人能够直接从观测和语言指令预测动作，但其性能依赖于大规模、高质量数据，并受到真实机器人动作数据集稀缺的限制。为了利用丰富的未标记人类视频促进VLA模型学习，潜在动作模型（LAM）从视觉动态中学习潜在动作表示，为VLA学习提供额外监督。然而，LAM和VLA通常分开训练，导致LAM在VLA训练期间未接地，且VLA模型受冻结的LAM表示约束。为解决这些问题，我们提出潜在动作表示对齐（LARA），一种即插即用框架，通过表示对齐联合优化LAM和VLA。这使得LAM能够利用动作轨迹学习以避免虚假视觉变化，同时VLA通过LAM中学习的前向动力学进行正则化，减少功能无效轨迹的幻觉。我们展示了LARA在预训练、预训练VLA模型的后训练增强以及LAM细化中的多功能性和有效性，在3个模拟和1个精心设计的真实机器人操作基准上平均提升约10%、约5%和约15%。

英文摘要

Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model learning with abundant unlabeled human videos, Latent Action Models (LAM) learn latent action representations from visual dynamics to provide additional supervision for VLA learning. However, LAM and VLA are typically trained separately, leaving LAM ungrounded during VLA training and VLA models constrained by frozen LAM representations. To address these issues, we propose Latent Action Representation Alignment (LARA), a plug-and-play framework that jointly optimizes LAM and VLA via representation alignment. This enables reciprocal benefits where LAMs learn with action trajectories to avoid spurious visual changes, while VLAs are regularized by forward dynamics learned within LAMs to reduce hallucinations of functionally ineffective trajectories. We demonstrate LARA versatility and effectiveness for pre-training, post-training enhancement of pre-trained VLA models, and LAM refinement, achieving an average of ~10%, ~5%, and ~15% improvement over 3 simulation and 1 meticulously designed real-world robotic manipulation benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.07233 2026-06-08 cs.CV cs.LG cs.RO 交叉投稿

Does Appearance Help? A Systematic Study of Image-Based Re-Identification in Online 3D Multi-Pedestrian Tracking

外观有帮助吗？在线3D多行人追踪中基于图像的重识别系统研究

Eduardo Borges, Luís Garrote, Urbano J. Nunes

发表机构 * Institute of Systems and Robotics, Department of Electrical and Computer Engineering, University of Coimbra（系统与机器人研究所，电气与计算机工程系，科英布拉大学）

AI总结系统研究轻量级投影框架下图像重识别在在线3D多目标追踪中的作用，提出级联匹配策略以在低延迟下恢复遮挡轨迹并防止身份切换。

Comments Accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

详情

AI中文摘要

基于LiDAR的3D多目标追踪通常仅依赖几何信息，这在长时间遮挡或拥挤人群环境中往往不足以区分目标。虽然集成基于RGB的重识别提供了保持身份上下文的理论解决方案，但现有方法通常依赖计算昂贵的并行检测器，阻碍了机器人的实时响应。本文通过利用轻量级投影框架解耦移动机器人的几何和外观建模，对在线3D多目标追踪中的基于图像的重识别进行了系统研究。对特征提取架构进行了全面分析，采用轻量级CNN和视觉Transformer，并评估了多种多模态数据关联策略以平衡计算延迟和鲁棒追踪。在KITTI数据集的行人类别上的实验表明，外观和运动成本的朴素线性融合由于视觉噪声而降低了性能。相反，级联匹配策略成功恢复了被遮挡的轨迹而不损害整体精度，有效防止了身份切换以维持人机交互的连续性。我们表明，轻量级架构可以在安全导航所需的低延迟和社交意识所需的判别能力之间提供最优权衡。

英文摘要

LiDAR-based 3D Multi-Object Tracking (MOT) typically relies solely on geometric information, which is often insufficient to distinguish between targets during prolonged occlusions or in crowded human-populated environments. While integrating RGB-based Re-Identification (ReID) offers a theoretical solution for preserving identity context, existing approaches often rely on computationally expensive parallel detectors that hinder real-time robot responsiveness. This work presents a systematic study of image-based ReID in online 3D MOT, utilizing a lightweight projection-based framework to decouple geometric and appearance modeling for mobile robots. A comprehensive analysis of feature extraction architectures is conducted, employing lightweight CNNs and Vision Transformers, and evaluating various multi-modal data association strategies to balance computational latency with robust tracking. Experiments on the Pedestrian class of the KITTI dataset reveal that naive linear fusion, of appearance and motion costs, degrades performance due to visual noise. Conversely, a cascaded matching strategy successfully recovers occluded tracks without compromising overall precision, effectively preventing identity switches to maintain human-robot interaction continuity. We show that lightweight architectures can offer an optimal trade-off between the low latency required for safe navigation and the discriminative power needed for social awareness.

URL PDF HTML ☆

赞 0 踩 0

2606.07366 2026-06-08 cs.CV cs.LG cs.RO 交叉投稿

Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam Videos

Dash2Sim: 来自野外行车记录仪视频的闭环驾驶仿真

Anurag Ghosh, Francesco Pittaluga, Khiem Vuong, Angela Chen, Juan Alvarez-Padilla, Manmohan Chandraker, Srinivasa Narasimhan

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； NEC Labs America（NEC美国实验室）； MIT（麻省理工学院）； UC San Diego（加州大学圣地亚哥分校）

AI总结提出Dash2Sim框架，将单目行车记录仪视频转化为度量级、地理参考的4D驾驶日志，用于闭环仿真，并构建ROADWork4D基准数据集，验证了施工区场景对规划器的挑战。

详情

AI中文摘要

自动驾驶仿真通常依赖于在少数城市收集的数据或手工编写的合成场景。行车记录仪视频覆盖了更广泛的位置和情况，包括罕见或长尾场景。由于难以从单目野外视频中恢复准确的4D场景，它们被认为不太适用于仿真。施工区是行车记录仪捕捉到的一类长尾情况。我们提出Dash2Sim，一个将野外单目行车记录仪视频转化为度量级、地理参考的4D驾驶日志并与现有仿真器兼容的框架，并针对独立维护的地图验证每个日志，无需标注。我们将Dash2Sim应用于大型视频语料库，创建了ROADWork4D基准数据集，涵盖17个城市的4,244个场景和270万个3D对象。在验证子集ROADWork4D-CL（2,201个场景）上，我们研究了特权闭环规划器，发现施工区场景具有挑战性：尽管基于规则和混合规划器的泛化能力优于基于学习的规划器，但所有规划器均表现不足，无法完成临时施工区通道所需的变道。在规划之外，Dash2Sim恢复的密集深度在新视角合成质量上提高了高达19%（基于感知指标），表明其具有为单目视频的闭环传感器仿真提供丰富条件的潜力。

英文摘要

Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. They are considered less usable for simulation because it is difficult to recover accurate 4D scenes from monocular in-the-wild videos. Work zones are one such class of long-tailed situations that dashcams capture. We present Dash2Sim, a framework that turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies eachone against an independently maintained map without annotations. We apply Dash2Sim to a large video corpus to create the ROADWork4D benchmark dataset, which spans 4,244 scenes with 2.7M 3D objects across 17 cities. On a verified subset ROADWork4D-CL (2,201 scenes), we study privileged closed-loop planners and find that work zone scenarios are difficult: while rule-based and hybrid planners generalize better than learning-based ones, all fall short, failing to make the lane changes that temporary work zone channels require. Beyond planning, dense depth recovered by Dash2Sim improves novel-view synthesis quality by up to 19% on perceptual metrics, suggesting its potential to provide rich conditioning for closed-loop sensor simulation from monocular videos.

URL PDF HTML ☆

赞 0 踩 0

2606.07449 2026-06-08 eess.SY cs.RO cs.SY 交叉投稿

On orbital stabilization of a circular motion primitive for a dynamic extension of the Dubins car model

关于Dubins汽车模型动态扩展的圆形运动原语的轨道镇定

Artem Angelchev-Shiryaev, Pavel E. Aleshin, Anton S. Shiriaev, Pavel A. Shamanaev, Leonid B. Freidovich

发表机构 * Department of Industrial and Mechanical Sciences, Lund University（林恩大学工业与机械科学系）； Department of Information Technologies, Sirius University（西里乌斯大学信息科技系）； Department of Engineering Cybernetics, NTNU（挪威特纳大学工程控制系）； Department of Applied Physics and Electronics, Umeå University（乌梅大学应用物理与电子系）

AI总结针对Dubins汽车模型动态扩展的圆形运动原语，在横向线性化框架下研究轨道镇定，发现标准方法因横向线性化不稳定而失效，提出一组显式可验证条件使控制器设计仍适用。

Comments 34 pages

2606.07476 2026-06-08 eess.SY cs.RO cs.SY eess.SP 交叉投稿

Physiologically Constrained Musculoskeletal Neural Network for Multi-DoF Joint Kinematics Estimation from Partially Observed sEMG

生理约束下的肌肉骨骼神经网络用于部分观测sEMG的多自由度关节运动学估计

Wending Heng, Mingming Zhang, Glen Cooper, Zhenhong Li

发表机构 * University of Manchester（曼彻斯特大学）； Southern University of Science and Technology（南方科技大学）

AI总结提出一种肌肉骨骼神经网络(MSK-NN)，结合CNN和肌肉骨骼前向动力学模块，在部分观测表面肌电信号下估计多自由度关节角度，并通过复合损失函数实现生理合理激活推断。

详情

AI中文摘要

本文研究了在部分观测表面肌电信号(sEMG)下的多自由度(DoF)关节运动学估计问题，其中由于解剖不可及性或传感器限制，只能测量任务相关肌肉的子集。提出了一种新颖的肌肉骨骼神经网络(MSK-NN)，用于估计多自由度关节角度，同时推断已测量和未测量肌肉的激活。MSK-NN由一个基于CNN的肌肉激活估计器和一个嵌入的MSK前向动力学模块组成，形成完全可微的架构。与需要额外生物力学标签(如肌肉-肌腱力、关节力矩)的现有混合神经框架不同，MSK-NN在没有内部生物力学变量直接监督的情况下进行训练。通过结合关节运动学损失、数据驱动的肌肉协同损失和解剖引导的趋势损失，设计了复合物理-生理损失。该方法在不受约束的速度和幅度下的三种节律运动和一种随机运动上评估了二自由度腕关节运动学估计。与CNN、Bi-LSTM、CNN-LSTM和PET基线相比，MSK-NN实现了更低的归一化均方根误差(NRMSE)和更高的决定系数(R2)，尤其是在随机运动中。更重要的是，优化的MSK参数保持在生理极限内，并且输入排除肌肉的估计激活与其记录的sEMG包络表现出强烈的时间一致性，证明了MSK-NN恢复生理合理激活的能力。

英文摘要

This paper investigates multi-degrees of freedom (DoF) joint kinematics estimation under partially observed surface electromyography (sEMG), where only a subset of task-relevant muscles can be measured due to anatomical inaccessibility or sensor constraints. A novel musculoskeletal neural network (MSK-NN) is proposed to estimate multi-DoF joint angles while simultaneously inferring activations for both measured and unmeasured muscles. MSK-NN consists of a CNN-based muscle activation estimator and an embedded MSK forward dynamics module, forming a fully differentiable architecture. Unlike existing hybrid neural frameworks that require additional biomechanical labels (e.g., muscle-tendon forces, joint torques), MSK-NN is trained without direct supervision of internal biomechanical variables. A composite physics-physiology loss is designed by incorporating a joint kinematics loss, a data-driven muscle synergy loss, and an anatomy-guided trend loss. The proposed method is evaluated on two-DoF wrist kinematics estimation across three rhythmic motions with unconstrained speed and amplitude, and one random motion. Compared with CNN, Bi-LSTM, CNN-LSTM, and PET baselines, MSK-NN achieves lower normalized root mean square error (NRMSE) and higher coefficient of determination (R2), especially for the random motion. More importantly, the optimized MSK parameters remain within physiological limits, and the estimated activation of an input-excluded muscle exhibits strong temporal agreement with its recorded sEMG envelope, demonstrating the capability of musculoskeletal (MSK)-NN to recover physiologically plausible activations.

URL PDF HTML ☆

赞 0 踩 0

2501.15768 2026-06-08 cs.RO cs.SY eess.SY 版本更新

Error-State LQR Formulation for Quadrotor UAV Trajectory Tracking

四旋翼无人机轨迹跟踪的误差状态LQR公式

Micah Reich

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出一种基于误差状态线性二次型调节器的四旋翼无人机轨迹跟踪方法，利用指数坐标表示姿态误差，结合全状态反馈与级联体速率控制器实现鲁棒控制。

2502.16531 2026-06-08 cs.RO cs.SY eess.SY 版本更新

Efficient Coordination and Synchronization of Multi-Robot Systems Under Recurring Linear Temporal Logic

基于循环线性时序逻辑的多机器人系统高效协调与同步

Davide Peron, Victor Nan Fernandez-Ayala, Eleftherios E. Vlahakis, Dimos V. Dimarogonas

发表机构 * Department of Information Engineering, University of Padova（帕多瓦大学信息工程系）； Division of Decision and Control Systems, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology（皇家理工学院电气工程与计算机科学学院决策与控制系统系）

AI总结提出一种结合离线计划综合与在线协调的底层方法，通过实时通信动态调整计划，并引入同步机制处理动作延迟，实现多机器人系统的可扩展协调与同步框架。

详情

DOI: 10.1109/ICRA55743.2025.11127554
Journal ref: Proc. IEEE ICRA, 2025, pp. 10194-10200

AI中文摘要

我们考虑在循环任务下形式化为线性时序逻辑（LTL）规范的多机器人系统。为了高效解决规划问题，我们提出了一种自底向上的方法，将离线计划综合与在线协调相结合，通过实时通信动态调整计划。为了解决动作延迟，我们引入了一种同步机制，确保协调的任务执行，从而得到一个适用于广泛多机器人应用的多智能体协调与同步框架。该软件包使用Python和ROS2开发，便于广泛部署。我们通过涉及九个机器人的实验室实验验证了我们的发现，显示出与先前方法相比增强的适应性。此外，我们进行了多达九十个智能体的仿真，以展示我们工作降低的计算复杂性和可扩展性特征。

英文摘要

We consider multi-robot systems under recurring tasks formalized as linear temporal logic (LTL) specifications. To solve the planning problem efficiently, we propose a bottom-up approach combining offline plan synthesis with online coordination, dynamically adjusting plans via real-time communication. To address action delays, we introduce a synchronization mechanism ensuring coordinated task execution, leading to a multi-agent coordination and synchronization framework that is adaptable to a wide range of multi-robot applications. The software package is developed in Python and ROS2 for broad deployment. We validate our findings through lab experiments involving nine robots showing enhanced adaptability compared to previous methods. Additionally, we conduct simulations with up to ninety agents to demonstrate the reduced computational complexity and the scalability features of our work.

URL PDF HTML ☆

赞 0 踩 0

2504.10102 2026-06-08 cs.RO cs.SY eess.SY 版本更新

CRAFT：利用基础模型自主教练强化学习以完成多机器人协调任务

Seoyeon Choi, Kanghyun Ryu, Jonghoon Ock, Negar Mehr

发表机构 * Department of Mechanical Engineering, University of California Berkeley（机械工程系，加州大学伯克利分校）

AI总结提出CRAFT框架，利用大语言模型分解任务、生成奖励函数，并通过视觉语言模型优化，实现多机器人协调学习，在四足导航和双臂操作任务中验证有效性。

详情

AI中文摘要

多智能体强化学习（MARL）为多智能体系统中的协调学习提供了强大的框架。然而，由于机器人具有高维连续联合动作空间、复杂的奖励设计以及并发学习智能体带来的非平稳性，将MARL应用于机器人领域仍然具有挑战性。另一方面，人类通常在教练的帮助下学习复杂的协调任务，教练通过精心设计的课程和详细反馈来指导学习。基于基础模型的推理能力，我们认为这些模型可以类似地教练机器人学习协调。受此启发，我们提出了CRAFT：利用基础模型自主教练强化学习以完成协调任务，这是一个利用基础模型作为多机器人协调“教练”的框架。CRAFT利用大语言模型（LLMs）的规划能力，自动将长时域协调任务分解为子任务序列。然后，CRAFT使用LLM生成的奖励函数训练每个子任务，并通过视觉语言模型（VLM）引导的奖励细化循环来改进它们。我们在多四足导航和双臂操作任务上评估了CRAFT，并展示了其学习复杂协调行为的能力。此外，在多四足导航设置中，我们展示了学到的策略可以迁移到现实世界。项目网站：https://iconlab.negarmehr.com/CRAFT/

英文摘要

Multi-Agent Reinforcement Learning (MARL) provides a powerful framework for learning coordination in multi-agent systems. However, applying MARL to robotics remains challenging due to their high-dimensional continuous joint action spaces, complex reward design, and non-stationarity from concurrently learning agents. On the other hand, humans often learn complex coordination with the help of coaches, who guide learning through carefully designed curricula and detailed feedback. Building on the reasoning capabilities of foundation models, we argue that these models can similarly coach robots to learn coordination. Motivated by this, we propose CRAFT: Coaching Reinforcement learning Autonomously using Foundation models for learning coordination Tasks, a framework that leverages foundation models to act as a "coach" for multi-robot coordination. CRAFT automatically decomposes long-horizon coordination tasks into sequences of subtasks using the planning capability of Large Language Models (LLMs). Then, CRAFT trains each subtask using LLM-generated reward functions, and refines them through a Vision Language Model (VLM)-guided reward-refinement loop. We evaluate CRAFT on multi-quadruped navigation and bimanual manipulation tasks, and demonstrate its capability to learn complex coordination behaviors. In addition, in a multi-quadruped navigation setting, we show that our learned policies transfer to the real world. Project website is https://iconlab.negarmehr.com/CRAFT/

URL PDF HTML ☆

赞 0 踩 0

2510.11014 2026-06-08 cs.RO cs.AI cs.CV 版本更新

MatterDoor: Sampling Zero-shot Spatio-semantic Priors using Generative Models

MatterDoor: 使用生成模型采样零样本空间语义先验

Subhransu S. Bhattacharjee, Hao Lu, Dylan Campbell, Rahul Shome

发表机构 * School of Computing, Australian National University（澳大利亚国立大学计算机学院）

AI总结针对机器人通过门缝观察时场景结构缺失的问题，提出MatterDoor方法，利用预训练生成模型（VLM引导外推、单目深度估计、语义分割）采样隐藏房间的语义3D点云先验，在Matterport3D基准上验证了零样本空间语义先验的有效性。

Comments Under Review

详情

AI中文摘要

自主机器人通常只能通过门缝部分观察房间，墙壁和场景结构隐藏了安全导航和目标导向行动所需的几何和任务相关语义。我们询问现成的预训练生成视觉模型能否将这些缺失结构作为零样本离线先验用于机器人推理。此类先验应支持对未观察结构的空间语义查询，估计隐藏区域中的目标物体似然以及这些区域被占用的概率。给定一个以自我为中心的RGB观测和目标查询，我们的流程使用VLM引导的外推、单目深度估计和语义分割来采样隐藏房间的语义标注3D点云假设。我们引入了MatterDoor，一个源自Matterport3D的门遮挡室内场景基准，并使用生成指标和模拟Stretch机器人目标到达任务评估所得先验。我们的结果表明，无需针对特定问题微调即可推导出对规划有用的空间语义先验。

英文摘要

Autonomous robots often view rooms only partially, through a doorway, where the walls and scene structure hide the geometry and task-relevant semantics needed for safe navigation and goal-directed action. We ask whether off-the-shelf pretrained generative vision models can derive this missing structure as zero-shot offline priors for robot reasoning. Such priors should support spatio-semantic queries over unobserved structure, estimating the target object likelihood in hidden regions and the probability that those regions are occupied. Given an egocentric RGB observation and target query, our pipeline uses VLM-guided outpainting, monocular depth estimation, and semantic segmentation to sample semantically labeled 3D point cloud hypotheses of the hidden room. We introduce MatterDoor, a Matterport3D-derived benchmark of doorway-occluded indoor scenes, and evaluate the resulting priors with generative metrics and simulated Stretch robot object-reaching tasks. Our results suggest that useful spatio-semantic priors for planning can be derived without problem-specific fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2511.12795 2026-06-08 cs.RO 版本更新

ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model

ActiveGrasp: 基于校准能量模型的信息引导主动抓取

Boshu Lei, Wen Jiang, Kostas Daniilidis

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； Archimedes, Athena RC（阿基米德、阿提卡RC）

AI总结针对密集杂乱环境中的抓取问题，提出一种校准能量模型生成抓取姿态，并基于抓取分布的信息增益主动选择视角，在有限视角下高效抓取目标物体。

Comments CVPR 2026

详情

AI中文摘要

在密集杂乱环境中抓取对机器人是一项具有挑战性的任务。以往的方法试图通过在抓取姿态生成前主动收集多个视角来解决这个问题。然而，它们要么忽略了抓取分布对信息增益估计的重要性，要么依赖于抓取分布的投影，这忽略了SE(3)流形上抓取姿态的结构。为了应对这些挑战，我们提出了一种用于抓取姿态生成的校准能量模型，以及一种从抓取分布估计信息增益的主动视角选择方法。我们的能量模型捕捉了SE(3)流形上抓取分布的多模态特性。能量水平被校准到抓取的成功率，使得预测分布与真实分布一致。通过从基于重建环境的校准分布中估计抓取的信息增益，选择下一个最佳视角，这可以高效地驱动机器人探索目标物体的可抓取部分。在模拟环境和真实机器人设置上的实验表明，与先前最先进的模型相比，我们的模型能够在有限视角预算下成功抓取杂乱环境中的物体。我们的模拟环境可以作为未来主动抓取研究的可复现平台。当论文公开发布时，我们的源代码将公开。

英文摘要

Grasping in a densely cluttered environment is a challenging task for robots. Previous methods tried to solve this problem by actively gathering multiple views before grasp pose generation. However, they either overlooked the importance of the grasp distribution for information gain estimation or relied on the projection of the grasp distribution, which ignores the structure of grasp poses on the SE(3) manifold. To tackle these challenges, we propose a calibrated energy-based model for grasp pose generation and an active view selection method that estimates information gain from grasp distribution. Our energy-based model captures the multi-modality nature of grasp distribution on the SE(3) manifold. The energy level is calibrated to the success rate of grasps so that the predicted distribution aligns with the real distribution. The next best view is selected by estimating the information gain for grasp from the calibrated distribution conditioned on the reconstructed environment, which could efficiently drive the robot to explore affordable parts of the target object. Experiments on simulated environments and real robot setups demonstrate that our model could successfully grasp objects in a cluttered environment with limited view budgets compared to previous state-of-the-art models. Our simulated environment can serve as a reproducible platform for future research on active grasping. The source code of our paper will be made public when the paper is released to the public.

URL PDF HTML ☆

赞 0 踩 0

2601.10930 2026-06-08 cs.RO 版本更新

Where to Touch, How to Contact: Hierarchical RL-MPC Framework for Geometry-Aware Long-Horizon Dexterous Manipulation

何处触碰，如何接触：面向几何感知的长时间灵巧操作的分层RL-MPC框架

Zhixian Xie, Yu Xiang, Michael Posa, Wanxin Jin

发表机构 * Arizona State University（亚利桑那州立大学）； University of Texas at Dallas（德克萨斯大学达拉斯分校）； University of Pennsylvania（宾夕法尼亚大学）

AI总结提出分层RL-MPC框架，高层RL策略预测接触意图（接触位置和子目标位姿），低层接触隐式MPC优化局部接触模式并实时重规划，实现几何泛化的非抓取操作，数据效率提升10倍且零样本迁移到真实环境。

详情

AI中文摘要

接触丰富的灵巧操作中的一个关键挑战是需要共同推理全局几何和非光滑接触动力学。端到端策略绕过了这一复杂性，但通常需要大量数据，并且从仿真到现实的迁移效果差。我们通过一个简单的见解来解决这些局限性：灵巧操作本质上是分层的——在高层次上，机器人决定在哪里触碰（几何）；在低层次上，它确定如何通过接触动力学移动物体。基于这一见解，我们提出了一个分层RL-MPC框架，其中高层强化学习（RL）策略预测接触意图，这是一种新颖的以物体为中心的接口，指定了（i）物体表面接触位置和（ii）接触后的物体子目标位姿。在接触意图的条件下，低层接触隐式模型预测控制（MPC）优化局部接触模式，并通过接触动力学进行实时（重新）规划，以生成稳健地将物体移向每个子目标的机器人动作。我们在非抓取任务上评估该框架，包括跨不同物体形状的几何泛化推、基于翻转/旋转的物体重新定向以及环境辅助的物体重新定位。它实现了高成功率，数据量大幅减少（比端到端基线少10倍），高度稳健的性能，以及零样本从仿真到现实的迁移。

英文摘要

A key challenge in contact-rich dexterous manipulation is the need to jointly reason over global geometry and nonsmooth contact dynamics. End-to-end policies bypass this complexity, but often require large amounts of data and transfer poorly from simulation to reality. We address the limitations with a simple insight: dexterous manipulation is inherently hierarchical--at a high level, a robot decides where to touch (geometry); at a low level it determines how to move the object through contact dynamics. Building on this insight, we propose a hierarchical RL--MPC framework in which a high-level reinforcement learning (RL) policy predicts a contact intention, a novel object-centric interface that specifies (i) an object-surface contact location and (ii) a post-contact object subgoal pose. Conditioned on the contact intention, a low-level contact-implicit model predictive control (MPC) optimizes local contact modes and real-time (re)plans through contact dynamics to generate robot actions that robustly move the object toward each subgoal. We evaluate the framework on non-prehensile tasks, including geometry-generalized pushing across diverse object shapes, pivoting/flipping-based object reorientation, and environment-assisted object repositioning. It achieves high success rate with substantially reduced data (10 times less than end-to-end baselines), highly robust performance, and zero-shot sim-to-real transfer.

URL PDF HTML ☆

赞 0 踩 0

2602.09580 2026-06-08 cs.RO cs.LG 版本更新

SERNF: Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

SERNF: 通过动作块评论家和归一化流实现样本高效的真实世界灵巧策略微调

Chenyu Yang, Denis Tarasov, Davide Liconti, Romain Guntz, Hehui Zheng, Robert K. Katzschmann

发表机构 * Soft Robotics Lab, D-MAVT（软机器人实验室，D-MAVT）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出SERNF框架，结合归一化流策略和动作块评论家，实现真实世界灵巧操作策略的样本高效微调，解决多模态动作分布和信用分配问题。

Comments https://srl-ethz.github.io/SERNF/

详情

AI中文摘要

ViVa：用于机器人强化学习的视频生成价值模型

Jindi Lv, Hao Li, Jie Li, Fankun Kong, Yang Wang, Pengfei Yi, Yifei Nie, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang

发表机构 * GigaAI ； Sichuan University（四川大学）； Tsinghua University（清华大学）

AI总结提出ViVa，利用预训练视频生成器联合预测未来本体感受和标量价值，通过时空先验实现可靠价值估计，在三个任务中取得最优结果，与RECAP结合平均成功率达80%。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过大规模预训练推进了机器人操作，但由于部分可观测性和延迟反馈，实际部署仍然具有挑战性。强化学习通过价值函数解决这一问题，该函数评估任务进展并指导策略改进。然而，基于视觉-语言模型（VLM）的现有价值模型难以捕捉时间动态和物理交互，削弱了长期任务中价值估计的可靠性。本文提出ViVa，一种视频生成价值模型，该模型重新利用预训练的视频生成器，联合预测未来本体感受和标量价值。通过将价值估计基于预期的具身动态，ViVa利用时空先验，将价值与超越静态快照的前瞻性内在耦合。ViVa在三个任务的基于度量的评估中取得了最先进的结果，产生可靠的价值信号，准确跟踪任务进展并检测执行错误。集成到RECAP中，它实现了80%的平均成功率，突显了视频生成模型在价值估计中的前景。

英文摘要

Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics and physical interactions, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator to jointly predict future proprioception and a scalar value. By grounding value estimation in anticipated embodiment dynamics, ViVa leverages spatiotemporal priors to intrinsically couple value with foresight beyond static snapshots. ViVa achieves state-of-the-art results in metric-based evaluation across three tasks, producing reliable value signals that accurately track task progress and detect execution errors. Integrated into RECAP, it achieves an average success rate of 80%, highlighting the promise of video-generative models for value estimation.

URL PDF HTML ☆

赞 0 踩 0

2605.07496 2026-06-08 cs.RO 版本更新

PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation

PathPainter：将图像生成模型的泛化能力迁移至具身导航

Yijin Wang, Yuru Tian, Xijie Huang, Weiqi Gai, Mo Zhu, Xin Zhou, Yuze Wu, Fei Gao

发表机构 * Tsinghua University（清华大学）

AI总结提出利用鸟瞰图作为全局先验的导航系统，通过图像生成模型理解自然语言意图并生成可通行掩码，结合跨视图定位消除里程计漂移，在无人机平台上完成160米室外长距离导航。

Comments Work in the progress. 16 pages, 13 figures

详情

AI中文摘要

鸟瞰图已被广泛证明能为导航提供有价值的先验信息。鉴于这种视图提供的全局信息，仍存在两个关键挑战：如何充分利用这些信息以及如何在执行过程中可靠地使用它们。在本文中，我们提出了一种导航系统，该系统使用鸟瞰图作为全局先验，并专为地面和近地面机器人平台设计。该系统采用图像生成模型从自然语言中解读人类意图，识别目标目的地，并生成可通行掩码。在执行过程中，我们引入跨视图定位以将机器人的里程计与鸟瞰图对齐，并减轻传统里程计中的长期漂移。我们进行了广泛的基准实验来评估所提出的方法，并在无人机平台上进一步验证。仅使用传统的局部运动规划器，无人机成功完成了160米的室外长距离导航任务。这项工作展示了基础模型的世界理解能力如何迁移到具身导航，使机器人能够受益于现有图像生成模型的强大泛化能力。

英文摘要

Bird's-eye-view (BEV) images have been widely demonstrated to provide valuable prior information for navigation. Given the global information provided by such views, two key challenges remain: how to fully exploit this information and how to reliably use it during execution. In this paper, we propose a navigation system that uses BEV images as global priors and is designed for ground and near-ground robotic platforms. The system employs an image generation model to interpret human intent from natural language, identify the target destination, and generate traversability masks. During execution, we introduce cross-view localization to align the robot's odometry with the BEV map and mitigate long-term drift in conventional odometry. We conduct extensive benchmark experiments to evaluate the proposed method and further validate it on a UAV platform. Using only a conventional local motion planner, the UAV successfully completes a 160-meter outdoor long-range navigation task. This work demonstrates how the world-understanding capabilities of foundation models can be transferred to embodied navigation, enabling robots to benefit from the strong generalization ability of existing image generation models.

URL PDF HTML ☆

赞 0 踩 0

2605.08732 2026-06-08 cs.RO cs.LG 版本更新

Latent Geometry Beyond Search: Amortizing Planning in World Models

超越搜索的潜在几何：在世界模型中摊销规划

Hoang Nguyen, Xiaohao Xu, Xiaonan Huang

发表机构 * Department of Robotics, University of Michigan, Ann Arbor（密歇根大学机器人系，安阿伯）

AI总结提出在正则化潜在几何下，将规划摊销为潜在逆动力学映射，以轻量级GC-IDM替代在线搜索，在七个环境协议中匹配或超越CEM，决策成本降低100-130倍。

Comments 31 pages

详情

AI中文摘要

现代基于视觉的世界模型可以将观测表示为紧凑而富有表现力的潜在流形，但在这些空间中进行快速的目标导向规划仍然具有挑战性。这引发了一个核心问题：学习到的表示何时简化控制，而不仅仅是实现预测？我们在预训练的LeWorldModel中研究这个问题，其潜在几何通过正则化实现平滑性和均匀性。我们的关键见解是，在这种几何下，规划可以摊销为潜在逆动力学映射，而无需在线搜索。因此，我们用一个轻量级的目标条件逆动力学模型（GC-IDM）替代迭代规划，该模型将当前潜在状态、目标潜在状态和剩余时间步直接映射到下一个动作。实验上，在涵盖导航、接触丰富的操作和连续控制的四个基准环境中，我们的控制器在八个环境-协议设置中的七个上匹配或超过了CEM，同时将每次决策成本降低了100-130倍。对测试时规划器（CEM、MPPI、iCEM和基于梯度的方法）的更广泛扫描表明，这一结果并非特定于某个优化器。这些发现表明，测试时规划恢复的大部分结构已经局部编码在潜在表示中。更广泛地说，我们的结果表明，足够结构化的潜在空间可以将部分规划负担从在线优化转移到学习推理。我们的代码公开在 https://github.com/hdnndh/Latent-Geometry-Beyond-Search-Amortizing-Planning-in-World-Models 。

英文摘要

Modern vision-based world models can represent observations as compact yet expressive latent manifolds, but fast goal-oriented planning in these spaces remains challenging. This raises a central question: when does a learned representation simplify control, rather than merely enabling prediction? We study this question in a pretrained LeWorldModel, whose latent geometry is regularized for smoothness and uniformity. Our key insight is that, under such geometry, planning can be amortized into a latent inverse-dynamics mapping instead of requiring online search. We therefore replace iterative planning with a lightweight Goal-Conditioned Inverse Dynamics Model (GC-IDM) that maps the current latent state, goal latent state, and remaining horizon directly to the next action. Empirically, across four benchmark environments spanning navigation, contact-rich manipulation, and continuous control, our controller matches or exceeds CEM in seven of eight environment-protocol settings while reducing per-decision cost by 100-130x. A broader sweep over test-time planners (CEM, MPPI, iCEM, and gradient-based methods) shows that this result is not specific to a particular optimizer. These findings suggest that much of the structure recovered by test-time planning is already locally encoded in the latent representation. More broadly, our results indicate that sufficiently structured latent spaces can shift part of the planning burden from online optimization to learned inference. Our code is publicly available at https://github.com/hdnndh/Latent-Geometry-Beyond-Search-Amortizing-Planning-in-World-Models .

URL PDF HTML ☆

赞 0 踩 0

2605.26974 2026-06-08 cs.RO 版本更新

Trust, Geometry, and Rules: A Credibility-Aware Reinforcement Learning Framework for Safe USV Navigation under Uncertainty

信任、几何与规则：不确定性下安全USV导航的可信感知强化学习框架

Yuhang Zhang, Shuqi Chai, Yukang Zhang, Liusha Yang, Mingchuan Zhang, Wei Wang, Qingjiang Shi, Quanbo Ge

发表机构 * School of Information Engineering, Henan University of Science and Technology（河南科技大学信息工程学院）； Shenzhen Research Institute of Big Data（深圳大数据研究院）； School of Logistics Engineering, Shanghai Maritime University（上海 Maritime University物流工程学院）； Shenzhen Technology University（深圳科技大学）； School of Computer Science, Wuhan University（武汉大学计算机学院）； School of Software Engineering, Tongji University（同济大学软件工程学院）

AI总结提出一种集成可信感知学习、几何安全屏蔽和连续规则感知嵌入的强化学习框架，以解决动态海洋环境中USV导航的安全性和COLREGs合规性问题。

详情

AI中文摘要

在动态海洋环境中，无人水面艇（USV）的安全自主导航并遵守《国际海上避碰规则》（COLREGs）仍然是一项艰巨的挑战，特别是当感知系统表现出校准不当的不确定性时。现有的基于强化学习（RL）的方法常常因为状态估计误差导致不可靠的信念状态误导价值函数，而离散的交通规则则引入了学习目标的不连续性而失败。为了解决这些挑战，我们提出了一个集成可信感知学习、几何安全屏蔽和连续规则感知嵌入的框架。首先，可信加权价值学习（CW-VL）引入了一个动态信任因子，该因子源自滤波器估计协方差与经验误差统计之间的差异，以调节评论家的异方差损失，防止策略对噪声样本过拟合。其次，协方差膨胀速度障碍（CI-VO）将位置估计不确定性映射为集合角裕度，形成一个保守的几何屏蔽，覆盖危险的探索行为。第三，风险感知COLREGs职责嵌入将二元相遇职责放松为连续的规则感知信号，提供平滑的扇区过渡信息，并抑制稀疏规则奖励引起的振荡。模拟相遇研究表明，该方法在感知不一致性下具有更好的训练鲁棒性，并且在避碰和COLREGs合规性方面优于基线方法。

英文摘要

Autonomous navigation of Unmanned Surface Vehicles (USVs) that is safe and compliant with the International Regulations for Preventing Collisions at Sea (COLREGs) remains a formidable challenge in dynamic maritime environments, particularly when perception systems exhibit miscalibrated uncertainty. Existing Reinforcement Learning (RL)-based methods often falter because state-estimation errors induce unreliable belief states that mislead the value function, while discrete traffic rules introduce discontinuity in the learning objective. To address these challenges, we propose a framework integrating credibility-aware learning, geometric safety shielding, and continuous rule-aware embedding. First, Credibility-Weighted Value Learning (CW-VL) introduces a dynamic trust factor derived from the discrepancy between filter-estimated covariance and empirical error statistics to modulate the critic's heteroscedastic loss, preventing policy overfitting to noisy samples. Second, the Covariance-Inflated Velocity Obstacle (CI-VO) maps position-estimation uncertainty into set-wise angular margins, forming a conservative geometric shield that overrides hazardous exploratory actions. Third, Risk-Aware COLREGs Duty Embedding relaxes binary encounter duties into continuous rule-aware signals, providing smooth sector-transition information and suppressing oscillation from sparse rule rewards. Simulated encounter studies demonstrate improved training robustness against perceptual inconsistency and superior collision avoidance and COLREGs compliance over baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01072 2026-06-08 cs.RO cs.CV 版本更新

面向机器人任务规划的3D grounded视觉-语言框架：自动化提示合成与监督推理

Guoqin Tang, Qingxuan Jia, Zeyuan Huang, Gang Chen, Ning Ji, Zhipeng Yao

发表机构 * Tsinghua University（清华大学）

AI总结本文提出融合2D提示合成模块和小语言模型的框架，提升机器人3D场景理解与任务执行能力，实验显示任务成功率高达96.0%。

详情

DOI: 10.1016/j.engappai.2025.113268
Journal ref: Engineering Applications of Artificial Intelligence, vol. 164, p. 113268, 2026

AI中文摘要

视觉-语言模型（VLMs）在场景理解和感知任务中取得了显著成功，使机器人能够在动态环境中自适应地规划和执行动作。然而，大多数多模态大语言模型缺乏稳健的3D场景定位能力，限制了其在精细机器人操作中的有效性。此外，低识别精度、低效、差的迁移性和可靠性等挑战阻碍了其在精密任务中的应用。为解决这些限制，我们提出了一种新的框架，该框架整合了一个2D提示合成模块，通过将2D图像映射到点云，以及一个小型语言模型（SLM）来监督VLM的输出。2D提示合成模块使训练于2D图像和文本的VLM能够自主提取精确的3D空间信息，无需人工干预，显著增强了3D场景理解。同时，SLM监督VLM的输出，缓解幻觉并确保可靠的可执行机器人控制代码生成。我们的框架消除了在新环境中重新训练的需要，从而提高了成本效率和操作鲁棒性。实验结果表明，所提出的框架实现了96.0%的任务成功率（TSR），优于其他方法。消融研究证明了2D提示合成模块和输出监督模块的关键作用（当移除时，TSR下降67%）。这些发现验证了框架在提升3D识别、任务规划和机器人任务执行方面的有效性。

英文摘要

Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations. Additionally, challenges such as low recognition accuracy, inefficiency, poor transferability, and reliability hinder their use in precision tasks. To address these limitations, we propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs. The 2D prompt synthesis module enables VLMs, trained on 2D images and text, to autonomously extract precise 3D spatial information without manual intervention, significantly enhancing 3D scene understanding. Meanwhile, the SLM supervises VLM outputs, mitigating hallucinations and ensuring reliable, executable robotic control code generation. Our framework eliminates the need for retraining in new environments, thereby improving cost efficiency and operational robustness. Experimental results that the proposed framework achieved a 96.0\% Task Success Rate (TSR), outperforming other methods. Ablation studies demonstrated the critical role of both the 2D prompt synthesis module and the output supervision module (which, when removed, caused a 67\% TSR drop). These findings validate the framework's effectiveness in improving 3D recognition, task planning, and robotic task execution.

URL PDF HTML ☆

赞 0 踩 0