arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.06556 2026-06-08 cs.RO 新提交

Robots Need More than VLA and World Models

机器人需要的不仅仅是VLA和世界模型

Elis Karcini, Faisal Mehrban, Quang Nguyen, Mac Schwager, Arash Ajoudani, Cesar Cadena, Jan Peters, Marco Hutter, Haitham Bou-Ammar

发表机构 * Motoniq.ai Stanford University(斯坦福大学) Istituto Italiano di Tecnologia(意大利技术研究院) ETH Zurich(苏黎世联邦理工学院) Technical University of Darmstadt(德累斯顿技术大学) UCL Centre for AI(伦敦大学学院人工智能中心)

AI总结 本文认为机器人通用智能的关键瓶颈不仅是策略学习,还缺乏将非结构化行为数据转化为机器人可用监督的机制,并提出了四种缺失的接口组件。

详情
AI中文摘要

通用机器人智能通常被框定为策略扩展问题:收集更多机器人演示,训练更大的视觉-语言-动作(VLA)模型,并期望更广泛的泛化。在这篇立场论文中,我们认为这种框架是不完整的。核心瓶颈不仅是策略学习,而是缺乏将世界上丰富的非结构化行为数据转化为有监督的机器人监督的机制。人类运动、互联网视频、仿真 rollout 和交互式演示包含关于任务、目标、接触、失败和物理约束的丰富信息,然而这些信息中的大部分无法直接被机器人策略使用,因为它们缺乏特定于具身的动作标签、任务语义和奖励结构。我们为下一代机器人识别了四个缺失的组件:用于自动标注非结构化行为的数据接口、用于将人类运动重定向到机器人动作的具身接口、用于物理接地3D推理的世界模型接口,以及用于从视频和语言推断任务进展和成功的奖励接口。我们调查了机器人基础模型、跨具身数据集、从视频学习、世界模型和奖励建模方面的最新进展,并提出了一个研究议程,以构建不仅能够从机器人演示中学习,而且能够从更广泛的物理世界中学习的机器人系统。

英文摘要

Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision. Human motion, internet video, simulation rollouts, and interactive demonstrations contain rich information about tasks, goals, contacts, failures, and physical constraints, yet most of this information is not directly usable by robot policies because it lacks embodiment-specific action labels, task semantics, and reward structure. We identify four missing components for the next generation of robotics: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress and success from video and language. We survey recent progress in robot foundation models, cross-embodiment datasets, learning from video, world models, and reward modelling, and propose a research agenda for building robotics systems that can learn not only from robot demonstrations, but from the broader physical world.

2606.06569 2026-06-08 cs.RO 新提交

PhyRoGen: Synthetic Generation of Physical Robot Manipulation Puzzles Using Procedural Content Generation

PhyRoGen:使用程序化内容生成物理机器人操作谜题的合成生成

Lennart Julian Droß, Andreas Orthey, Marc Toussaint

发表机构 * Technical University of Berlin(柏林技术大学) Robotics Institute Germany(德国机器人研究所)

AI总结 提出PhyRoGen框架,利用程序化内容生成自动创建机器人操作谜题的合成数据集,生成的24个谜题可在1-300秒内求解,并在物理仿真中验证可操作性。

Comments 8 pages, accepted at CASE 2026

详情
AI中文摘要

机器人操作物理谜题对于自动装配和拆卸任务很重要。然而,为了让机器人解决物理谜题,需要学习操作技能,这需要大量的训练数据集,而数据集的生成通常耗时且繁琐。为了解决这个问题,我们提出了物理机器人操作谜题生成框架(PhyRoGen),它利用程序化内容生成(PCG)来自动生成操作谜题的合成数据集。PhyRoGen是一个通用谜题生成器,可以生成具有互锁对象依赖关系的物理谜题,其中必须先操作一个关节对象,然后才能移动另一个对象。基于PhyRoGen,我们定义了六个具体的生成器,用于生成24个物理谜题。通过使用基准测试框架,我们能够使用基于采样的规划算法在1到300秒内解决所有谜题。最后,我们通过使用KUKA LBR iiwa机器人在物理仿真中演示了每个生成的谜题都是可操作的。这表明我们的框架能够程序化地生成独特的、可解的机器人操作谜题,这是对操作算法进行基准测试和开发稳健基础模型的关键要素。

英文摘要

Robot manipulation of physical puzzles is important for automatic assembly and disassembly tasks. However, to enable robots to solve physical puzzles, manipulation skills need to be learned, which requires large training datasets, the generation of which is often time consuming and tedious. To overcome this problem, we propose the Physical Robot Manipulation Puzzle Generation framework (PhyRoGen), which leverages procedural content generation (PCG) for automated generation of synthetic datasets of manipulation puzzles. PhyRoGen is a general-purpose puzzle generator, which can generate physical puzzles with interlocking object dependencies, where one articulated object must be manipulated before another can be moved. Based upon PhyRoGen, we define six concrete generators which we use to generate 24 physical puzzles. By using a benchmarking framework, we are able to solve all puzzles in 1 to 300 seconds using sampling-based planning algorithms. Finally, we demonstrate that every generated puzzle is manipulatable by using a KUKA LBR iiwa robot in a physical simulation. This shows that our framework is able to procedurally generate unique, solvable robot manipulation puzzles, which is a crucial ingredient to benchmark manipulation algorithms and to develop robust foundation models.

2606.06618 2026-06-08 cs.RO cs.AI cs.LG 新提交

ChronoForest: Closed-Loop Multi-Tree Diffusion Planning for Efficient Bridge Search and Route Composition

ChronoForest: 用于高效桥接搜索和路线组合的闭环多树扩散规划

Jungmin Seo, Jaesik Park

发表机构 * Seoul National University(首尔国立大学)

AI总结 针对仅依赖短程离线轨迹进行长程路线规划的问题,提出ChronoForest系统,通过锚链树扩散规划器和在线多树协调器实现局部桥接搜索与全局路线重解,在OGBench和哈密顿路线组合基准上显著提升成功率和效率。

Comments 40 pages, 4 figures, 7 tables, 3 algorithms

详情
AI中文摘要

当仅有短程离线轨迹可用时,我们如何规划到达指定目标、访问必经航点且保持路径短的长程路线?这一问题在离线导航中至关重要,因为收集足够丰富的长程数据十分困难,但真实智能体仍需以路线级效率(而非仅仅可行性)解决长程任务。难点有两方面:在微观层面,组合多个短程片段会在搜索代价和路径质量之间产生权衡;在宏观层面,航点排序需要比较起点、目标和航点锚点之间的成对旅行代价,而这些锚点在规划前未知,且仅通过长程时间距离估计时可靠性下降。本文提出ChronoForest,一种闭环规划系统,通过锚链树扩散规划器和在线多树协调器,将局部桥接搜索与在线路线重解耦合。ChronoForest利用时间距离进行短程引导和节点评估,同时利用搜索时的桥接证据验证长程锚点连通性,并反复重解路线。在OGBench AntMaze-Stitch上,ChronoForest在中等、大型和巨型分片上分别达到99.8%、99.3%和99.5%的成功率,并在巨型拼接任务上相比先前报道的扩散方法提升高达34.5个百分点。在哈密顿路线组合基准上,在线重解纠正了较差的时间排序,提升了路线质量,同时代价远低于穷举规划。

英文摘要

How can we plan long-horizon routes that reach designated goals, visit required waypoints, and remain short when only short-horizon offline trajectories are available? This problem matters in offline navigation because collecting sufficiently rich long-horizon data is difficult, yet real agents must still solve long-range tasks with route-level efficiency rather than mere feasibility. The difficulty is twofold: at the microscopic level, composing many short-horizon segments creates a trade-off between search cost and path quality, while at the macroscopic level, waypoint ordering requires comparing pairwise travel costs among start, goal, and waypoint anchors that are unknown before planning and increasingly unreliable when estimated only from long-range temporal distance. In this paper, we propose ChronoForest, a closed-loop planning system that couples local bridge search and online route re-solving through an anchor-chaining tree diffusion planner and an online multi-tree orchestrator. ChronoForest uses temporal distance for short-range guidance and node evaluation, while using search-time bridge evidence to validate long-range anchor connectivity and repeatedly re-solve the route. On OGBench AntMaze-Stitch, ChronoForest achieves 99.8%, 99.3%, and 99.5% success on the medium, large, and giant splits and improves giant-stitch success by up to 34.5 points over prior reported diffusion-based results. On Hamiltonian route-composition benchmarks, online re-solving corrects poor temporal orderings and improves route quality while remaining substantially cheaper than exhaustive planning.

2606.06627 2026-06-08 cs.RO cs.AI cs.CV cs.LG 新提交

What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

在日常生活人类视频上协同训练机器人操作策略时什么因素重要?

Richard Li, Aditya Prakash, Andrew Wen, Saurabh Gupta, Yilun Du, Pulkit Agrawal

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Harvard University(哈佛大学)

AI总结 研究利用日常互联网视频协同训练机器人操作策略时,手部姿态质量和运动差距对迁移的影响,提出一种协同训练方法,在低机器人数据场景下六个操作任务中绝对成功率提升29.7%。

Comments The project website is here: https://richardrl.github.io/what-matters-cotraining-human-videos/index.html

详情
AI中文摘要

用于协同训练机器人操作策略的人类视频数据集主要由精心策划的演示组成,其中动作被编排成类似机器人行为,并且使用专用硬件捕获3D手部姿态。更丰富的数据源是日常互联网视频,但哪些因素能够实现从这些视频到机器人的迁移仍是一个开放问题。我们使用一个新的数据集(包含532个人类视频,共28小时的高质量三角测量手部标签和自然动作)对此进行研究。我们发现手部姿态质量影响迁移,但即使手部姿态准确,固有的运动差距也会阻碍迁移,除非视觉和策略网络针对每种具身形态进行专门化。我们的协同训练方法在低机器人数据场景下,在六个操作任务中绝对成功率提升29.7%,并带来一致的改进。

英文摘要

Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrated to resemble robot behavior and 3D hand poses are captured with specialized hardware. A more plentiful source of data is everyday Internet video, but it is an open question what factors enable transfer from such videos to robots. We investigate this using a new dataset of 532 human videos with 28 hours of high-quality triangulated hand labels and natural motions. We find that hand pose quality affects transfer, but even with accurate hands, the inherent motion gap hinders transfer unless the vision and policy networks specialize to each embodiment. Our cotraining recipe yields consistent improvements, with an absolute success rate gain of $29.7\%$ in the low-robot-data regime across six manipulation tasks.

2606.06686 2026-06-08 cs.RO cs.DS 新提交

On the Hardness of Optimal Motion on Trees

关于树上最优运动的难度

Tzvika Geft

发表机构 * Rutgers University(罗切斯特大学)

AI总结 本文证明,在树上,带标签和2色变体的多智能体路径寻找(MAPF)问题在距离、makespan和flowtime三个目标下均为NP难,解决了长期未决的经典Pebble Motion问题。

详情
AI中文摘要

本文提出了一个简单框架,解决了树上多智能体路径寻找(MAPF)在标准目标(距离、makespan和flowtime)下对于带标签和带颜色变体的复杂度。在MAPF中,智能体占据图的顶点,必须移动到目标顶点而不发生碰撞,同时优化给定目标。在带标签情况下,智能体是不同的,各自有目标;在带颜色情况下,相同颜色的智能体可互换。虽然许多MAPF变体已知是难解的,但树上几个基本情况仍然开放。我们证明了在树上,对于所有三个目标,带标签和2色MAPF都是NP难的。特别地,我们解决了经典的Pebble Motion问题,其中一次一个石子移动到相邻的空顶点,目标是最小化总移动次数。尽管这是最基本的离散运动模型之一,其在树上的复杂度几十年来一直未解决。此外,对于带颜色的Pebble Motion,我们给出了在任何图类上的第一个难度结果,仅用两种颜色,这是紧的。所有这些结果都是通过Stack Rearrangement的难度建立的,该问题本身是一个开放问题,要求最优地重新排列存储在栈中的物品,我们也证明了它是NP难的。值得注意的是,与栈的联系在所有问题上已经产生了在非常简单的树(细分星形)上的难度。总之,这些结果揭示了一个共同的易处理性障碍,它渗透了几个基本运动模型,从而统一并加强了先前的难度结果。

英文摘要

This paper presents a simple framework that settles the complexity of Multi-Agent Path Finding (MAPF) on trees across standard objectives--distance, makespan, and flowtime--for both labeled and colored variants. In MAPF, agents occupy the vertices of a graph and must move to target vertices without collisions while optimizing a given objective. In the labeled case, the agents are distinct and have respective targets; in the colored case, agents of the same color are interchangeable. While many MAPF variants are known to be intractable, several basic cases on trees have remained open. We prove NP-hardness on trees for both labeled and 2-colored MAPF under all three objectives. In particular, we resolve the classical Pebble Motion problem, where one pebble moves at a time to an adjacent empty vertex and the goal is to minimize the total number of moves. Despite being one of the most basic discrete motion models, its complexity on trees had remained open for several decades. Moreover, for colored Pebble Motion, we give the first hardness result on any graph class, already with two colors, which is tight. All of these results are established through the hardness of Stack Rearrangement, itself posed as an open problem, which asks to optimally rearrange items stored in stacks, and which we also prove to be NP-hard. Notably, the connection to stacks yields hardness already on very simple trees--subdivided stars--across all problems. Together, these results reveal a common tractability barrier that permeates several fundamental motion models, thereby unifying and strengthening prior hardness results.

2606.06704 2026-06-08 cs.RO 新提交

Optimal Control Approach for Non-prehensile Ball Juggling Using a 7-DoF Manipulator

使用7自由度机械臂进行非抓取式抛球的最优控制方法

Joel Ramadani, Vasilije Rakčević, Riddhiman Laha, Arne Sachtler, Valentin Le Mesle, Achim J. Lilienthal, Sami Haddadin

发表机构 * Technical University of Munich(慕尼黑技术大学) Mohamed Bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) German Aerospace Center (DLR), Institute of Robotics and Mechatronics(德国航空航天中心(DLR)机器人与机电机构研究所)

AI总结 提出一种基于模型的两阶段最优控制框架,用于7自由度机械臂使用工具进行非抓取式抛球,生成周期性抛球轨迹并通过离线计算实现实时误差校正。

Comments 8 pages, accepted at ICRA 2026

详情
AI中文摘要

非抓取式物体操作技能对于现实世界的机器人交互至关重要,能够实现高度动态的任务,例如在托盘上平衡玻璃杯或控制物体在桌子上滑动。其中,以高速操作要求和由此产生的混合动力学的普遍敏感性为特征的任务尤其难以完成。在这些任务中,抛球可以被视为一个极具挑战性的动作。机器人抛球的关键在于实现欠驱动物体的动态稳定。由于物体不具备自我校正能力,其稳定性完全依赖于施加在其上的力。这创建了一个对控制输入敏感的系统,其中时机对于持续抵消偏差并维持期望行为至关重要。我们开发了一种系统方法,用于控制一个7自由度机械臂使用工具进行非抓取式抛球。我们的主要贡献是一个基于模型的框架,用于生成抛球轨迹并稳定该混合系统的周期性抛球运动。该框架包含一个两阶段最优控制方法,用于计算稳定抛球所需的底层可行运动模式。然后,离线计算的轨迹被组织起来,以便在不在线求解最优控制问题的情况下实现实时误差校正。我们首先在仿真环境中评估所提出控制器的性能,然后使用Franka Emika Panda机器人进行实验,以证明其有效性。

英文摘要

Non-prehensile object manipulation skills are important for real-world robot interactions, enabling highly dynamic tasks such as balancing a glass on a tray or the controlled sliding of items on a table. Among such tasks, those characterised by high-speed manipulation requirements and general sensitivity of the resulting hybrid dynamics are particularly hard to accomplish. Within these, juggling can be seen as a highly challenging maneuver to be solved. The key to robotic juggling is achieving dynamic stabilisation of an underactuated object. Since the object does not possess the ability of self-correction, its stability is entirely dependent on the forces applied to it. This creates a system that is sensitive to control inputs, where timing is critical to continuously counteract deviations and maintain the desired behavior. We develop a systematic method to control a 7-degree-of-freedom manipulator performing non-prehensile ball juggling with a tool. Our primary contribution is a model-based framework for generating juggling trajectories and stabilizing a periodic juggling motion for this hybrid system. The framework incorporates a two-stage optimal control approach to compute the underlying feasible motion patterns required for stable juggling. Offline-computed trajectories are then organised to enable real-time error correction without solving optimal control problems online. We demonstrate the effectiveness of the resulting controller by first evaluating its performance in a simulation environment and performing an experiment using a Franka Emika Panda robot.

2606.06721 2026-06-08 cs.RO cs.AI 新提交

SCOUT: Semantic scene COverage via Uncertainty-guided Traversal

SCOUT: 基于不确定性引导遍历的语义场景覆盖

Junyu Mao, Sara Ayoubi, Vishnu D. Sharma, Ilija Hadžić, Matthew Andrews

发表机构 * Nokia Bell Labs, France(诺基亚贝尔实验室,法国) Nokia Bell Labs, Murray Hill, NJ, USA(诺基亚贝尔实验室,美国,新泽西州 Murray Hill) Imperial College London(帝国理工学院伦敦分校) Locus Robotics(Locus机器人技术公司)

AI总结 提出SCOUT框架,通过不确定性引导的遍历规划与概率场景图构建的闭环,使机器人主动探索并逐步理解环境,实现语义场景完整性作为操作目标。

Comments 2026 ICRA Workshop on Uncertainty in Open World Robotics

详情
AI中文摘要

长时间运行的机器人不应仅仅访问空间,而应逐步理解空间。然而,大多数3D场景图管线将感知视为固定数据集上的后处理阶段,将场景表示与决定首先观察什么的决策解耦。我们提出SCOUT,一种在线语义探索框架,通过将主动遍历与概率场景图构建耦合来闭合这一循环。给定先验2D占用地图和带姿态的RGB-D观测,SCOUT增量构建一个不确定性感知的3D场景图,其节点维护融合的几何和开放词汇对象标签的后验信念,而边编码结构关系,如在上、内部、属于和旁边。这些信念被反馈给不确定性引导的遍历规划器,该规划器通过平衡期望语义确定性增益、几何覆盖增益和旅行成本来选择视点。这样,当额外证据重要时,机器人重新访问模糊对象,当场景不完整时,扩展到未见的自由空间。由此产生的系统将语义场景完整性视为操作目标,而非语义映射的被动副产品,朝着能够在最少人工干预下巡逻、更新和推理不断变化的室内环境的自主智能体迈进。

英文摘要

Robots that operate over extended periods should not merely visit space; they should progressively understand it. Yet most 3D scene graph pipelines treat perception as a post-processing stage over a fixed dataset, decoupling scene representation from the decisions that determine what is observed in the first place. We present SCOUT, an online semantic exploration framework that closes this loop by coupling active traversal with probabilistic scene graph construction. Given a prior 2D occupancy map and posed RGB-D observations, SCOUT incrementally builds an uncertainty-aware 3D scene graph whose nodes maintain fused geometry and posterior beliefs over open-vocabulary object labels, while edges encode structural relations such as on, inside, belong, and next to. These beliefs are fed back to an uncertainty-guided traversal planner, which selects viewpoints by balancing expected semantic certainty gain, geometric coverage gain, and travel cost. In this way, the robot revisits ambiguous objects when additional evidence matters and expands into unseen free space when the scene remains incomplete. The resulting system treats semantic scene completeness as an operational objective rather than a passive by-product of semantic mapping, moving toward autonomous agents that can patrol, update, and reason about evolving indoor environments with minimal human intervention.

2606.06727 2026-06-08 cs.RO cs.SY eess.SY 新提交

IDDMBSE: Integrating Data-Driven and Model-Based Systems Engineering for Trusted Autonomous Cyber-Physical Systems

IDDMBSE:集成数据驱动和基于模型的系统工程用于可信自主网络物理系统

John S. Baras, Sai Sandeep Damera, Ryan Matheu, Clinton Enwerem, Praveen M. S. Kumar

发表机构 * Institute for Systems Research, University of Maryland, College Park(系统研究所,马里兰大学,College Park)

AI总结 提出IDDMBSE方法,将MBSE V流程与数据驱动循环结合,通过开源工具链PERFECT、TRADES-X和VERITAS实现,在自主地面机器人全生命周期验证其有效性。

Comments 9 pages, 11 figures. This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

自主网络物理系统(CPS)处于基于模型的系统工程(MBSE)和数据驱动的机器学习与人工智能(ML/AI)的交汇点,但目前尚无一种集成的系统工程(SE)方法论能够原生地覆盖两者。我们通过IDDMBSE(一种集成的数据驱动和基于模型的系统工程方法论)来填补这一空白,该方法将严格的MBSE V流程扩展为每一步都包含数据驱动循环,并基于SysML、自主栈以及混合模型驱动加数据驱动的权衡架构。我们将IDDMBSE实例化为一个可互操作的开源工具链:PERFECT,它将SysML系统架构映射到可执行的ROS自主栈,用于可扩展的性能评估;TRADES-X,它将设计空间探索分解为基于模型的优化阶段和随后的数据驱动评估阶段;以及VERITAS,它将形式化验证、数据驱动验证和运行时验证结合到一个统一的保证工作流中。我们在一个可信自主地面机器人的全开发生命周期中演示了IDDMBSE,涵盖传感器套件选择、风险敏感路径规划、行为树任务验证、基于共形预测的鲁棒感知以及有保证的多机器人协调,所有这些都在一个我们随工具链一起发布的、具有争议地形的Isaac Sim测试场中进行了演练。最后,我们概述了IDDMBSE如何在SysML v2 / KerML基础上重新构建,以实现语言原生的可组合性和更紧密的ML/AI集成。

英文摘要

Autonomous cyber-physical systems (CPS) sit at the intersection of Model-Based Systems Engineering (MBSE) and data-driven Machine Learning and Artificial Intelligence (ML/AI), yet no integrated Systems Engineering (SE) methodology natively spans both. We address this gap with IDDMBSE, an Integrated Data-Driven and Model-Based Systems Engineering methodology that extends the rigorous MBSE V-process with a data-driven loop at every step, anchored in SysML, the autonomy stack, and a hybrid model-based plus data-driven trade-off architecture. We instantiate IDDMBSE as an interoperable, open-source tool chain: PERFECT, which maps SysML system architectures to executable ROS autonomy stacks for scalable performance evaluation; TRADES-X, which decomposes design-space exploration into a model-based optimization stage followed by a data-driven evaluation stage; and VERITAS, which combines formal, data-driven, and runtime verification into a single assurance workflow. We demonstrate IDDMBSE on a Trusted Autonomous Ground Robot across its development lifecycle, spanning sensor-suite selection, risk-sensitive path planning, behavior-tree task verification, conformal-prediction-based robust perception, and assured multi-robot coordination, all exercised in a contested-terrain Isaac Sim test range that we release with the tool chain. We close by sketching how IDDMBSE is being re-formulated on SysML v2 / KerML foundations to enable language-native composability and tighter ML/AI integration.

2606.06761 2026-06-08 cs.RO cs.AI 新提交

AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation

AxisGuide: 在RGB观测中接地机器人动作坐标系以实现鲁棒的视觉运动操控

Jiyun Jang, Yujin Sung, Woosung Joung, Daewon Chae, Sangwon Lee, Sohwi Kim, Jinkyu Kim, Jungbeom Lee

发表机构 * Korea University(韩国大学) University of Michigan(密歇根大学) KT R&D Center(KT研发中心) Kakao Mobility(Kakao移动)

AI总结 针对视觉运动策略在分布偏移下动作执行失败的问题,提出AxisGuide方法,通过渲染机器人基座坐标系轴并叠加提示通道,增强动作坐标理解,显著提升泛化性能。

Comments Accepted to Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

通过大规模行为克隆训练的视觉运动操控策略已实现强大的语义场景理解,但在分布偏移下往往无法可靠地执行正确的低级动作。例如,即使在具有相同场景布局、相机视角和光照的简单拾取任务中,当物体被放置在未见过的位置时,性能也会大幅下降。我们认为这一差距源于动作理解不足,即无法在图像空间中解释机器人基座坐标系。为解决此问题,我们引入AxisGuide,一种轻量级引导方法,桥接语义场景理解和动作坐标解释。利用相机参数和末端执行器位姿,AxisGuide在每个相机视图中渲染机器人基座轴,并通过少量提示通道增强RGB观测,明确可视化图像空间中+x、+y和+z运动的含义。在LIBERO仿真和真实环境中的广泛评估表明,AxisGuide带来了显著的性能提升和更好的泛化能力,凸显了显式动作坐标提示对于学习可靠且可迁移的通用视觉运动策略的有效性。

英文摘要

Visuomotor manipulation policies trained via large-scale behavior cloning have achieved strong semantic scene understanding, yet often fail to reliably execute correct low-level actions under distribution shifts. For example, even in a simple pickup task with identical scene layouts, camera viewpoints, and illumination, performance can degrade substantially when the object is placed at unseen locations. We argue that this gap arises from insufficient action understanding, namely the inability to interpret the robot's base-frame action coordinate system in image space. To address this issue, we introduce AxisGuide, a lightweight guidance method that bridges semantic scene understanding and action-coordinate interpretation. Using camera parameters and end-effector poses, AxisGuide renders the robot base-frame axes in each camera view and augments RGB observations with a small set of cue channels that explicitly visualize the meaning of the +x, +y, and +z motions in image space. Extensive evaluations in both the LIBERO simulation and real-world environments demonstrate that AxisGuide yields substantial performance gains and improved generalization, highlighting the effectiveness of explicit action-coordinate cues for learning reliable and transferable generalist visuomotor policies.

2606.06762 2026-06-08 cs.RO 新提交

Multi-Robot Planning and Control from CCTV Camera Networks in a Real Warehouse

基于真实仓库中闭路电视摄像机网络的多机器人规划与控制

Luke Robinson, Benjamin Ramtoula, Anas Izaaryene, Paul Newman, Daniele De Martini

发表机构 * Oxford Robotics Institute, University of Oxford, UK(牛津大学机器人研究所,牛津大学,英国) Robot Systems Group, Technical University of Munich, Germany(机器人系统组,慕尼黑技术大学,德国)

AI总结 提出仅利用分布式CCTV网络和边缘计算实现多机器人协调规划与控制的方法,在真实仓库中验证了四台机器人和30个摄像头的系统,首次实现仅依赖外部摄像头网络的现场多机器人协调。

详情
AI中文摘要

利用环境中嵌入的摄像头对移动机器人进行离车控制,通过将感知和计算移离机器人,为可扩展的自主性提供了一条实用路径。我们将这一思想从单机器人情况扩展到真实仓库中的协调车队,仅使用分布式CCTV网络和边缘计算驱动多个机器人。该系统完全在未校准的、基于像素的拓扑相机图的图像空间中运行,支持灵活相机放置下的大范围操作。分层规划器为每个机器人选择相机序列,并通过每个视图规划其图像空间运动,采用优先-联合策略协调机器人,将重叠的相机区域视为一次仅由一个机器人持有的共享资源,以防止碰撞和死锁。我们在一个真实仓库中验证了该方法,该仓库有四个机器人和30个摄像头,分布在六个27米长的过道中,报告了任务时间和协调统计数据。据我们所知,这是首次仅使用外部摄像头网络和离车计算进行多机器人规划和协调的现场演示,机器人未携带任何特定于任务的导航硬件。

英文摘要

Off-board control of mobile robots from cameras embedded in the environment offers a practical path to scalable autonomy, moving sensing and compute off the robots. We extend this idea from the single-robot case to coordinated fleets in a real warehouse, driving multiple robots with only a distributed CCTV network and edge compute. The system operates entirely in image space over an uncalibrated, pixel-wise topological camera graph, enabling wide-area operation with flexible camera placement. A hierarchical planner selects a camera sequence per robot and plans its image-space motion through each view, coordinating robots with a prioritised-then-joint strategy and treating overlapping camera regions as shared resources held by one robot at a time to prevent collisions and deadlocks. We validate the approach in a real warehouse with four robots and 30 cameras across six 27 m aisles, reporting mission times and coordination statistics. To our knowledge, this is the first field demonstration of multi-robot planning and coordination using only an external camera network and off-board compute, with robots carrying no task-specific navigation hardware.

2606.06790 2026-06-08 cs.RO cs.LG cs.SY eess.SY 新提交

Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension

学习具有主动铰接悬挂的行星探测车的全地形运动

Arthur Bouton, Tristan D. Hasseler, Michael Paton, Travis Brown, Jacob Levy, William Reid, Joshua Martin, Hari Nayar

发表机构 * Jet Propulsion Laboratory, California Institute of Technology(喷气推进实验室,加州理工学院) Center for Autonomy, University of Texas at Austin(自主性中心,德克萨斯大学奥斯汀分校) Space Systems Laboratory, University of Maryland(空间系统实验室,马里兰大学)

AI总结 提出一种带有主动万向悬挂的四轮行星探测车概念,利用强化学习训练单一神经网络控制器,实现自主障碍协商和全地形运动,通过策略整合和零样本迁移在物理车上验证。

Comments 21 pages, 26 figures

详情
AI中文摘要

本文介绍了ERNEST,一种四轮行星探测车概念,配备了两自由度主动万向悬挂系统,结合偏航和滚转驱动,实现车轮重构、转向和主动负载分配。一个单一的神经网络控制器,经过训练以在挑战性地形上跟踪期望路径,完全释放了这种驱动悬挂系统在自主障碍协商中的能力。利用高保真DARTS仿真引擎开发了强化学习框架,该引擎结合了刚体接触动力学和Bekker-Wong地面力学,使得能够出现适应松散土壤条件的运动策略。为了在异质地形上获得单一统一控制器,一种策略整合策略将地形专业化智能体的经验合并到一个神经网络中,消除了对显式地形分类和控制器切换的需求。得到的控制器结合了本体感觉和外感觉反馈,包括稀疏立体视觉导出的地形高程、底盘姿态、关节状态和力-扭矩测量。通过领域随机化、传感器噪声注入和模型到真实系统的辨识,实现了到物理车的零样本迁移。实验结果表明,该控制器能够自主穿越岩石场、凸起陷阱、轮高台阶、沙波纹和沙坡。在20°沙坡上,尽管增加了驱动,学习到的控制器在干沙上降低了37%的运输成本,并在湿沙上实现了优越的性能,而被动悬挂在湿沙上完全无法移动。

英文摘要

This paper presents ERNEST, a four-wheeled planetary rover concept equipped with a two-degree-of-freedom Active Gimbal Suspension that combines yaw and roll actuation to enable wheel reconfiguration, steering, and active load redistribution. A single neural network controller, trained to track a desired path across challenging terrain, fully unlocks the capabilities of this actuated suspension system for autonomous obstacle negotiation. A reinforcement learning framework is developed using the high-fidelity DARTS simulation engine, which combines rigid-contact dynamics and Bekker-Wong terramechanics, enabling the emergence of locomotion strategies adapted to loose-soil conditions. To obtain a single unified controller across heterogeneous terrains, a policy consolidation strategy merges the experience of terrain-specialized agents into one neural network, eliminating the need for explicit terrain classification and controller switching. The resulting controller operates on a combination of proprioceptive and exteroceptive feedback, including sparse stereo-derived terrain elevation, chassis attitude, joint states, and force-torque measurements. Zero-shot transfer to the physical rover is achieved through domain randomization, sensor noise injection, and model-to-real system identification. Experimental results demonstrate autonomous traversal of rock fields, a bump trap, a wheel-high step, sand ripples, and sandy slopes. On a 20° sandy slope, the learned controller reduces the cost of transport by 37% on dry sand despite the additional actuation, and achieves superior performance on wet sand where the passive suspension becomes completely immobilized.

2606.06805 2026-06-08 cs.RO cs.AI cs.SY eess.SY 新提交

Lane Change Trajectory Planning for Personalized Driving Comfort and Mobility Efficiency

车道变更轨迹规划:个性化驾驶舒适性与移动效率

Haoxuan Dong, Dongjun Li, Ziyou Song

发表机构 * Department of Mechanical Engineering(机械工程系) Department of Electrical Engineering(电气工程系) National University of Singapore(新加坡国立大学) Computer Science(计算机科学) University of Michigan(密歇根大学)

AI总结 提出一种神经网络驱动的轨迹规划器,结合三阶多项式轨迹生成与学习模块,通过双头共享骨干和基于误差胜者逻辑回归的统计门控机制,实现个性化舒适性与移动效率的平衡。

Comments Accepted by the IEEE Intelligent Vehicles Symposium (IEEE IV 2026), Detroit, MI, United States, June 22_25, 2026

详情
AI中文摘要

车道变更涉及同时的纵向和横向运动,这些运动影响驾驶舒适性和移动效率。由于这些运动紧密耦合且存在显著的车辆间差异,车道变更操作的轨迹规划具有高度个性化的特点。本研究提出了一种神经网络驱动的规划器,该规划器将三阶多项式轨迹生成器与学习模块相结合,该学习模块在不同驾驶条件下推断最优轨迹参数。使用具有双头的共享骨干网络,一个头确保全工况操作保障,而另一个头捕捉驾驶员对舒适性或移动效率的特定偏好。通过基于误差胜者逻辑回归的统计门控实现头门控切换机制,该机制在不同驾驶条件下自适应地选择适当的头,从而实现上下文感知的车道变更轨迹规划。代表性案例和蒙特卡洛模拟表明,所提出的规划器在车道变更过程中实现了个性化的舒适性和移动性,而基线则在个性化数据不足或不可用的驾驶条件下确保可行的轨迹。

英文摘要

Lane changing entails simultaneous longitudinal and lateral motions that affect driving comfort and mobility efficiency. Because these motions are tightly coupled and subject to substantial inter-vehicle variability, trajectory planning for lane-change maneuvers is characterized by a highly personalized nature. This study proposes a neural network-driven planner that integrates a third-order polynomial trajectory generator with a learning module that infers optimal trajectory parameters across diverse driving conditions. Using a shared backbone with dual heads, one head ensures all-condition operational guarantees, while the other captures driver-specific preferences for comfort or mobility efficiency. A head-gated switching mechanism, realized through a statistical gate based on error-winner logistic regression, adaptively selects the appropriate head under varying driving conditions, which enables context-aware lane-change trajectory planning. Representative cases and Monte Carlo simulations show that the proposed planner achieves personalized comfort and mobility during lane changes, while the baseline ensures feasible trajectories under driving conditions where personalized data are insufficient or inaccessible.

2606.06829 2026-06-08 cs.RO 新提交

Three-dimensional hydro-cluttered locomotion by an undulatory robot

三维水杂波环境中的波动机器人运动

Tianyu Wang, Matthew Fernandez, Galen Tunnicliffe, Nikolas Cornell, Justin Duong, Donoven Dortilus, Zhaochen J. Xu, Patricia Meza, Sean Lublinsky, Darsh Parikh, Jianfeng Lin, Emily Grace, Daniel I. Goldman

发表机构 * Institute for Robotics and Intelligent Machines, Georgia Institute of Technology(机器人与智能机器研究所,佐治亚理工学院) School of Physics, Georgia Institute of Technology(Georgia理工学院物理系) George W. Woodruff School of Mechanical Engineering, Georgia Institute of Technology(佐治亚理工学院乔治·W·伍德鲁夫机械工程学院) School of Electrical and Computer Engineering, Georgia Institute of Technology(佐治亚理工学院电气与计算机工程学院) Department of Mechanical and Industrial Engineering, Northeastern University(东北大学机械与工业工程系) Ransom Everglades School(拉森·伊弗格莱德学校)

AI总结 提出AquaMILR机器人,通过可编程体顺应性和深度调节,在三维水杂波环境中实现快速鲁棒的前进运动,并利用惯性滚动作为自发恢复机制。

详情
AI中文摘要

水生机器人扩展了人类进入水下环境的能力,但许多水下空间包含可能干扰开放水域运动的障碍物。在“水杂波”环境中,水与刚性和柔性杂物交织,使得身体与障碍物的接触不可避免。在这些空间中操作需要能够调节和利用接触的机器人,但这一机制仍然难以建模或模拟。基于近期在具有地形适应能力的无肢机器人机械智能方面的进展,我们利用AquaMILR(一种细长无肢机器人)开发了三维水生运动原理,该机器人结合了双侧缆绳驱动、可编程体顺应性、分布式深度调节、耐腐蚀外壳以及用于无系留现场操作的板载电源和电子设备。系统的机器人物理实验表明,可编程体顺应性调节身体变形,并将身体-环境相互作用转化为跨增强水杂波约束强度的快速、鲁棒的前向推进。深度调节提供了三维通道,使机器人能够绕过杂物、从阻塞中恢复,并继续通过原本无法通行的路径。在潜在卡滞场景中,涌现的惯性诱导滚动作为一种自发恢复机制,使机器人摆脱可能导致失败的杂物,无需额外控制即可继续运动。在红树林水生环境中的机器人测试表明,这些原理可转化为实际操作,实现导航和无法进入根区的板载视觉检查。这些结果确立了水杂波运动原理和一种设计范式,其中水生机器人将环境复杂性作为运动资源加以利用。

英文摘要

Aquatic robots have expanded human access to underwater environments, yet many underwater spaces contain obstacles that can disrupt open-water locomotion. In "hydro-cluttered" environments, water is interspersed with rigid and flexible clutter, making body-obstacle contact unavoidable. Operating in these spaces requires robots that can regulate and exploit contact, but this regime remains difficult to model or simulate. Building on recent advances in mechanical intelligence in terradynamically capable limbless robotics, we develop principles for 3D aquatic locomotion using AquaMILR, an elongate limbless robot that combines bilateral cable-driven actuation, programmable body compliance, distributed depth regulation, corrosion-resistant enclosures, and onboard power and electronics for untethered field operation. Systematic robophysical experiments reveal that programmable body compliance regulates body deformation and converts body-environment interactions into fast, robust, forward progression across increasing hydro-clutter constraint strength. Depth regulation provides three-dimensional access, allowing the robot to bypass clutter, recover from obstruction, and continue through otherwise inaccessible routes. In potential jamming scenarios, emergent inertia-induced rolling acts as a spontaneous recovery mechanism, freeing the robot from clutter that would otherwise lead to failure and allowing locomotion to continue without additional control. Tests of the robot in an aquatic mangrove field demonstrate that these principles transfer to practical operation, enabling navigation and onboard visual inspection of inaccessible root zones. These results establish principles for hydro-cluttered locomotion and a design paradigm in which aquatic robots exploit environmental complexity as a locomotor resource.

2606.06832 2026-06-08 cs.RO 新提交

STRIPS-WM: Learning Grounded Propositional STRIPS-style World Models from Images

STRIPS-WM:从图像学习基于命题的STRIPS风格世界模型

Abhiroop Ajith, Constantinos Chamzas

发表机构 * Worcester Polytechnic Institute(沃斯特理工学院)

AI总结 提出STRIPS-WM框架,从图像转换中学习符号化世界模型,用于机器人视觉任务规划,提升规划成功率。

详情
AI中文摘要

执行长时域视觉操作的机器人观察高维图像,但成功的规划依赖于与动作相关的事实:当前可以做什么以及之后会发生什么变化。有用的规划表示应丢弃无关的视觉细节,同时保留动作的适用性和效果。经典任务规划器通过具有前提条件和效果的符号操作符利用这种结构,但从原始视觉经验中获得此类表示仍然具有挑战性。我们研究了一个视觉任务规划设置,其中机器人仅接收图像转换:当前图像、执行的高级动作以及结果图像。在测试时,给定起始图像和目标图像,机器人必须产生一系列达到目标的高级动作。为了解决这个问题,我们引入了STRIPS-WM,一个直接从视觉转换中学习基于图像的STRIPS风格世界模型的框架。STRIPS-WM首先从图像中诱导出有限的抽象转换图,然后学习潜在二元谓词和每个动作标签的一个基于命题的操作符。学习到的操作符形成一个具有稀疏前提条件和添加/删除效果的符号动作模型。最后,学习到的谓词被蒸馏到视觉编码器中,使得能够直接从新的起始和目标图像进行经典规划。在视觉重排任务上的实验表明,STRIPS-WM在图像到规划的成功率上优于测试的视觉展开、潜在图搜索和潜在符号基线。

英文摘要

Robots performing long-horizon visual manipulation observe high-dimensional images, but successful plans depend on action-relevant facts: what can be done now and what changes afterward. A useful planning representation should discard irrelevant visual details while preserving action applicability and effects. Classical task planners exploit this structure through symbolic operators with preconditions and effects, but obtaining such representations from raw visual experience remains challenging. We study a visual task-planning setting in which a robot receives only image transitions: the current image, executed high-level action, and the resulting image. At test time, given a start image and a goal image, the robot must produce a sequence of high-level actions that reaches the goal. To address this problem, we introduce STRIPS-WM, a framework for learning image-grounded STRIPS-style world models directly from visual transitions. STRIPS-WM first induces a finite abstract transition graph from images, then learns latent binary predicates and one grounded propositional operator per action label. The learned operators form a symbolic action model with sparse preconditions and add/delete effects. Finally, the learned predicates are distilled into a visual encoder, enabling classical planning directly from novel start and goal images. Experiments on visual rearrangement tasks show that STRIPS-WM improves image-to-plan success over the tested visual rollout, latent graph-search and latent-symbolic baselines.

2606.06836 2026-06-08 cs.RO cs.AI cs.CV 新提交

Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation

像飞行员一样思考:细粒度长时程无人机导航

Xiangyi Zheng, Xiangyu Wang, Qinan Liao, Zimu Tang, Yue Liao, Dongyue Lyu, Guodong Wang, Junjie Liu, Si Liu

发表机构 * Colab Beihang University(北航) Meituan(美团) National University of Singapore(新加坡国立大学)

AI总结 提出FLIGHT基准和FLIGHT VLA异步架构,通过低频飞行员推理VLM与高频扩散动作模型解耦,实现无人机长时程语义指令下的平滑连续飞行控制。

详情
AI中文摘要

语言引导的无人机代理必须执行长时程语义指令,同时产生平滑、物理可行的连续飞行命令,然而现有的视觉语言导航(VLN)基准通常使用离散或粗粒度的动作,而现有的无人机视觉-语言-动作(VLA)任务则专注于短时、原子化的机动。为了解决无人机任务设置中的这一空白,我们引入了\ extbf{FLIGHT},一个用于混合无人机导航与推理任务的\ extbf{细}粒度\ extbf{长}时程\ extbf{指令引导}基准,该基准结合了多阶段指令与密集的6-DoF轨迹注释,分为两个数据集:细粒度VLN和长时程流。为了使无人机代理具备对任务执行状态和任务规划进行实时飞行推理的能力,同时适应高频、实时的精确控制,我们进一步提出了\ extbf{FLIGHT VLA},一种异步架构,将用于任务状态推理的低频流式飞行员视觉语言模型(VLM)与用于连续控制的高频扩散动作模型解耦,并由显式的\ extbf{飞行员推理}文本进行监督,该文本总结了当前飞行状态并预测下一个子目标。在闭环评估中,FLIGHT VLA在我们的FLIGHT基准上持续优于代表性的VLN和VLA基线,实现了更强的多阶段完成、子目标遵循和终端控制。其训练的流式飞行员推理VLM进一步提升了无人机视频推理,验证了我们设计的有效性。

英文摘要

Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce \textbf{FLIGHT}, a \textbf{F}ine-grained \textbf{L}ong-horizon \textbf{I}nstruction-\textbf{G}uided benchmark for \textbf{H}ybrid UAV navigation and reasoning \textbf{T}asks, which combines multi-stage instructions with dense 6-DoF trajectory annotations across two dataset splits: Fine-grained VLN and Long-horizon Flow. To endow the UAV agent with the capability of real-time in-flight reasoning over task execution status and mission planning, while simultaneously accommodating high-frequency, real-time precise control, we further propose \textbf{FLIGHT VLA}, an asynchronous architecture that decouples a low-frequency Streaming Pilot Vision-Language Model (VLM) for task-state reasoning from a high-frequency diffusion action model for continuous control, supervised by explicit \textbf{Pilot Reasoning} texts that summarize the current flight state and anticipate the next subgoal. In closed-loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi-stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.

2606.06870 2026-06-08 cs.RO 新提交

What Is My Robot Thinking? Design Considerations for Transparent and Trustworthy Shared Autonomy

我的机器人在想什么?透明且可信的共享自主性的设计考量

Atharv Belsare, Zohre Karimi, Connor Mattson, Rushiil Nakka, Daniel S. Brown

发表机构 * Kahlert School of Computing, University of Utah(犹他大学计算学院) Robotics Center, University of Utah(犹他大学机器人中心)

AI总结 通过用户实验研究共享自主系统中界面透明度(反馈模态和信息丰富度)对协调与信任的影响,发现反馈提高意图对齐、减少纠正干预,视觉优于听觉,信息丰富度偏好依赖任务复杂度,揭示完整信念分布并不一致提升对齐或信任。

Comments 9 pages, 5 Figures, Code and videos are available at https://sites.google.com/view/design-t2-sa/home. Under review at IROS 2026

详情
AI中文摘要

在共享自主性下运行的辅助机器人必须平衡用户控制与自主辅助。由于机器人动作依赖于不可直接观察的内部意图推理,推断目标与预期目标之间的不匹配会破坏协调与信任。我们研究了界面级透明度,包括反馈模态(视觉与听觉)和信息丰富度(稀疏与丰富),如何影响基于视觉的共享自主系统中的交互。在一项包含N=25名参与者的用户研究中,涉及两项辅助操作任务,我们评估了这些设计如何影响协调与信任。提供反馈显著提高了意图对齐并减少了纠正干预,表明使推断目标可理解加速了共享控制中的收敛。参与者偏好视觉反馈而非听觉反馈,而对稀疏与丰富信息的偏好取决于任务复杂度。我们还发现,揭示完整的信念分布并不一致地提高对齐或信任。这些发现共同表明,有效的透明度主要通过目标可理解性增强协调,而信任取决于任务适当的信息暴露,而非最大程度的信息披露。基于这些结果,我们概述了设计透明共享自主系统的指导方针。

英文摘要

Assistive robots operating under shared autonomy must balance user control with autonomous assistance. Because robot actions depend on internal intent inference that is not directly observable, mismatches between inferred and intended goals can undermine coordination and trust. We investigate how interface-level transparency, including feedback modality (visual vs. auditory) and information richness (sparse vs. rich), shapes interaction in a vision-based shared autonomy system. In a user study with N=25 participants across two assistive manipulation tasks, we evaluate how these designs influence coordination and trust. Providing feedback significantly improves intent alignment and reduces corrective intervention, indicating that making the inferred goal legible accelerates convergence in shared control. Participants preferred visual over auditory feedback, while preferences for sparse versus rich information depended on task complexity. We also found that revealing the full belief distribution did not consistently improve alignment or trust. Together, these findings indicate that effective transparency enhances coordination primarily through goal legibility, while trust depends on task-appropriate information exposure rather than maximal disclosure. Based on these results, we outline guidelines for designing transparent shared autonomy systems.

2606.06877 2026-06-08 cs.RO cs.AI 新提交

Neuro-Symbolic Learning for Long-Horizon Task Planning Under Complex Logical Constraints

复杂逻辑约束下长时域任务规划的神经符号学习

Qiwei Du, Zitong Zhan, Shaoshu Su, Bowen Li, Yi Du, Zhipeng Zhao, Taimeng Fu, Sebastian Scherer, Jiaoyang Li, Chen Wang

发表机构 * Spatial AI & Robotics (SAIR) Lab, University at Buffalo, NY 14260(空间人工智能与机器人实验室,布法罗大学,纽约州,14260) Robotics Institute, Carnegie Mellon University, PA 15213(机器人研究所,卡内基梅隆大学,宾夕法尼亚州,15213)

AI总结 提出基于命令学习的双层优化框架,通过神经评分器剪枝无关对象,并引入3R策略(修复、重启、回滚)稳定下层规划,在三个基准上实现失败率降低80.04%、规划时间减少57.14%。

详情
AI中文摘要

当机器人必须在复杂逻辑约束(包括对象可供性、空间关系和顺序动作依赖)下推理长时域动作序列时,任务规划常常面临严重的效率瓶颈。最近的神经符号方法通过学习对象重要性分数来剪枝任务无关对象,从而提高规划效率,但它们通常依赖于从完整搜索空间生成的固定离线监督。这造成了训练-测试不匹配:在部署时,规划器在由模型自身不完美预测诱导的剪枝搜索空间中运行,导致暴露偏差和规划性能下降。为了解决这一挑战,我们将任务规划的对象重要性学习形式化为一个基于命令学习的双层优化问题。上层优化一个神经评分器,而下层在评分剪枝的搜索空间中求解符号规划问题。为了稳定这一学习过程,我们在下层规划中引入3R策略,使用并行的修复、重启和回滚恢复来为上层学习提供可靠且自适应的反馈。在三个具有挑战性的基准上的实验展示了最先进的性能,包括失败率降低80.04%和规划时间减少57.14%。我们进一步在仿真和现实世界中的四足移动机械臂上验证了该框架,展示了其在高效且可部署的神经符号任务规划方面的潜力。

英文摘要

Task planning often suffers from severe efficiency bottlenecks when robots must reason over long-horizon action sequences under complex logical constraints, including object affordances, spatial relationships, and sequential action dependencies. Recent neuro-symbolic methods improve planning efficiency by learning object-importance scores to prune task-irrelevant objects, but they typically rely on fixed offline supervision generated from full search spaces. This creates a train-test mismatch: at deployment, the planner operates in pruned search spaces induced by the model's own imperfect predictions, leading to exposure bias and degraded planning performance. To address this challenge, we formulate object-importance learning for task planning as an imperative learning-based bilevel optimization problem. The upper level optimizes a neural scorer, while the lower level solves a symbolic planning problem in the score-pruned search space. To stabilize this learning process, we introduce a 3R strategy into the lower-level planning, using parallel Repair, Restart, and Rollback recovery to provide reliable and adaptive feedback for upper-level learning. Experiments on three challenging benchmarks demonstrate state-of-the-art performance, including an 80.04% reduction in failure rate and a 57.14% reduction in planning time. We further validate the framework on a quadruped-based mobile manipulator in simulation and the real world, demonstrating its potential for efficient and deployable neuro-symbolic task planning.

2606.06878 2026-06-08 cs.RO cs.CV 新提交

A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation

一种用于鲁棒6-DoF抓取姿态估计的跨视图融合框架

Kangjian Zhu, Haobo Jiang, Jianjun Qian, Jin Xie

发表机构 * Nanjing University of Science and Technology(南京理工大学) Nanyang Technological University(南洋理工大学) Nanjing University(南京大学)

AI总结 提出跨视图融合框架,通过辅助视图缓解遮挡,利用自监督对比学习增强点云特征的空间一致性和方向区分性,并设计跨视图对齐圆柱体集成模块融合抓取相关几何,提升角落视图下的6-DoF抓取姿态估计鲁棒性。

Comments Corresponding author: Jin Xie

详情
AI中文摘要

本文提出一种跨视图融合框架,增强了角落视图中6-DoF抓取姿态估计的鲁棒性。我们的框架通过引入辅助视图缓解遮挡,并通过后融合策略避免了耗时的、任务无关的多视图重建。为了增强跨视图融合,我们提出一种自监督对比学习策略,利用跨视图关联来正则化点云特征。简而言之,如果两个点对应相同的3D位置,则跨视图点对被视作匹配;如果它们代表不同的抓取方向,则视为不匹配。该学习策略显著增强了点特征的空间一致性和方向区分性,从而促进了跨视图融合并提高了估计鲁棒性。此外,我们提出一种跨视图对齐圆柱体集成模块,将抓取相关几何融合为综合表示。具体地,该模块首先根据相似性对齐跨视图点和特征,以增强对噪声的鲁棒性。随后,将这些点注册到圆柱坐标系中,强调对抓取重要的旋转对称几何。最后,交替使用局部自注意力和种子交叉注意力层,分别实现单视图内和跨视图间的交互,支持抓取相关几何的细粒度表示。我们的框架在GraspNet-1Billion基准测试和实际应用中均取得了强劲性能。代码可在以下网址获取:此https URL。

英文摘要

In this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views. Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy. To enhance cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features. In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions. The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness. Furthermore, we propose a cross-view-aligned cylinder integration module to fuse grasp-relevant geometry into a comprehensive representation. Specifically, the module first aligns the cross-view points and features according to their similarity to enhance the robustness against noise. Subsequently, these points are registered into the cylindrical coordinate frame, emphasizing the rotation-symmetric geometry which is important for grasping. Finally, local self-attention and seed cross-attention layers are alternately employed, respectively enabling interactions within single views and across views, which supports fine-grained representation of grasp-relevant geometry. Our framework achieves strong performance on the GraspNet-1Billion benchmark and in real-world applications. Code is available at https://github.com/KJZhuAutomatic/Cross-view-Grasp.

2606.06944 2026-06-08 cs.RO 新提交

T-GMP: Terrain-conditioned Generative Motion Priors for Versatile and Natural Humanoid Locomotion

T-GMP: 基于地形条件的生成式运动先验用于多功能且自然的人形机器人 locomotion

Junhong Guo, Hao Hu, Chen Chen, Haoxuan Han, Linao Gong, Xin Yang, Zhicheng He, Yao Su, Fenghua He

发表机构 * Harbin Institute of Technology(哈尔滨工程大学) Leju Robotics(莱居机器人)

AI总结 提出 T-GMP 模块,利用条件变分自编码器从少量专家演示中学习地形条件潜在运动流形,结合对抗学习与立足点惩罚,实现统一策略下适应地形变化的多功能自然运动。

详情
AI中文摘要

实现拟人自然性和鲁棒地形穿越仍然是人形机器人 locomotion 的基本挑战。现有的强化学习方法通常依赖固定的运动先验,限制了其对变化环境的适应性。我们提出基于地形条件的生成式运动先验(T-GMP),该模块使用条件变分自编码器从少量专家状态-地形演示中捕获地形条件潜在运动流形。学习到的先验能够实现平滑的风格转换,促进统一策略适应地形变化。我们将 T-GMP 集成到对抗学习流程中,并引入提出的立足点惩罚,其中判别器根据局部地形特征动态调节自然性约束,指导生成多功能且类人的运动。实验结果表明,我们的方法在穿越成功率和运动平滑度上优于现有基线,同时保持了仿生自然和物理协调的运动。

英文摘要

Achieving both anthropomorphic naturalness and robust terrain traversal remains a fundamental challenge in humanoid locomotion. Existing Reinforcement Learning (RL) approaches typically rely on fixed motion priors, limiting their adaptability to varying environments. We propose Terrain-conditioned Generative Motion Priors (T-GMP), a module that captures a terrain-conditioned latent motion manifold from a few expert state-terrain demonstrations using a Conditional Variational Autoencoder (CVAE). The learned priors enable smooth style transitions, facilitating a unified policy that adapts to terrain variations. We integrate T-GMP into an adversarial learning pipeline with our proposed Foothold Penalty, where a discriminator dynamically modulates naturalness constraints conditioned on local terrain features, guiding the generation of versatile and human-like motions. Experimental results demonstrate that our method outperforms existing baselines in traversal success rate and motion smoothness, while preserving biomimetically natural and physically coordinated motions.

2606.06953 2026-06-08 cs.RO 新提交

LIMMT: Less is More for Motion Tracking

LIMMT:少即是多的运动追踪

Yu Guan, Zekun Qi, Chenghuai Lin, Xuchuan Chen, Dairu Liu, Wenyao Zhang, Jilong Wang, Xinqiang Yu, He Wang, Li Yi

发表机构 * Tsinghua University(清华大学) GalBot Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Shanghai Qi Zhi Institute(上海启智研究院)

AI总结 提出数据驱动的运动追踪框架LIMMT,通过物理可行性、多样性和复杂性三维度筛选高质量运动数据,仅用AMASS的3%数据即可超越全量训练效果。

Comments Accepted at ICML 2026

详情
AI中文摘要

我们认为高质量的运动数据可以在训练早期引导追踪策略走向更优的优化轨迹。在这项工作中,我们引入了LIMMT(少即是多的运动追踪)。据我们所知,这是首个针对基于物理的人形运动追踪的数据中心研究。我们不仅简单地移除低质量和错误片段,而是通过三个维度定义运动数据质量:物理可行性、多样性和复杂性。我们表明,即使仅使用AMASS的不到3%的数据进行训练,也能获得比使用完整数据集更好的追踪性能。我们进一步对估计的网络来源动捕数据进行了数据清洗。大量实验和分析验证了我们框架的有效性。

英文摘要

We argue that high-quality motion data can steer tracking policies toward better optimization trajectories early in training. In this work, we introduce LIMMT (Less Is More for Motion Tracking). To our knowledge, this is the first data-centric study for physics-based humanoid motion tracking. We go beyond simply removing low-quality and erroneous clips, but define motion data quality through three dimensions: physics feasibility, diversity, and complexity. We show that even training with under 3% of AMASS yields better tracking performance than training with the full dataset. We further conduct data cleaning on the estimated web-sourced mocap data. Extensive experiments and analyses validate the effectiveness of our framework.

2606.06977 2026-06-08 cs.RO 新提交

Compliance-Based Sensor Placement for Force Sensing on a Sensorized Prostate Phantom

基于柔顺性的传感器布局方法用于传感化前列腺模体的力感知

Sizhe Tian, Yinoussa Adagolodjo, Jeremie Dequidt

发表机构 * CRIStAL DEFROST Polytech Lille

AI总结 提出一种基于柔顺性的加权贪心传感器布局方法,用于直肠指检训练模体的力感知,相比全局QR方法将目标区域力重构性提高22.5%。

详情
AI中文摘要

本文提出一种基于柔顺性的传感器布局方法,用于为直肠指检训练设计的传感化前列腺模体的力感知。该模体结合了三个内部气动腔室(用作内置压力传感器)和十个表面位移标记。通过在外表面采样位置施加外力生成有限元仿真数据集,并构建将力输入与压力和位移响应关联的柔顺矩阵。基于该矩阵,我们提出一种加权贪心选择策略,最大化局部力可重构性,同时优先考虑临床相关的后部接触区域,并避免将标记直接放置在感兴趣区域内。与全局基于QR的布局策略相比,所提方法将目标区域的平均可重构性得分提高了22.5%。这些结果表明,区域感知的稀疏传感器布局可以在保持有限且实用的传感配置的同时,提高软体机器人医疗模体的力可观测性。

英文摘要

This work presents a compliance-based sensor placement method for force sensing on a sensorized prostate phantom designed for Digital Rectal Examination training. The phantom combines three internal pneumatic chambers, used as intrinsic pressure sensors, with ten surface displacement markers. A finite-element simulation dataset is generated by applying external forces at sampled surface locations, from which a compliance matrix relating force inputs to pressure and displacement responses is constructed. Based on this matrix, we propose a weighted greedy selection strategy that maximizes local force reconstructability while prioritizing the clinically relevant posterior contact region and avoiding marker placement directly within the Region of Interest. Compared with a global QR-based placement strategy, the proposed method increases the mean reconstructability score in the target region by 22.5%. These results suggest that region-aware sparse sensor placement can improve force observability in soft robotic medical phantoms while maintaining a limited and practical sensing configuration.

2606.06996 2026-06-08 cs.RO cs.DC 新提交

Mission-Level Runtime Assurance Framework for Autonomous Driving

自动驾驶任务级运行时保证框架

Chieh Tsai, Salim Hariri

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出一种评估驾驶安全与任务完成能力的运行时保证框架,通过监控系统拒绝不可行命令,实验证明其优于仅关注平台安全的方法。

详情
AI中文摘要

本文研究当高级驾驶命令出现故障或不可靠时自动驾驶的运行时安全性。与主要关注即时车辆安全的传统运行时安全方法不同,所提出的框架在执行命令前评估驾驶安全以及车辆是否仍能成功完成任务。该框架通过引入任务级故障场景(如跳过必需检查点、进入受限区域、生成无法成功完成任务的未来路线)扩展了highway-env。引入运行时监控系统,在执行前检测并拒绝不安全或任务不可行的命令。作为对比,使用公开的Simplex-Drive框架实现了一个基于学习的驾驶控制、安全回退控制和运行时控制器切换的自适应Simplex-Drive运行时安全基线。实验结果表明,仅平台级运行时安全无法检测任务级规划故障,而所提出的框架成功拒绝任务不可行命令,并在随机故障条件下提高了任务成功率。

英文摘要

This paper studies runtime safety for autonomous driving when high-level driving commands become faulty or unreliable. Unlike conventional runtime-safety approaches that mainly focus on immediate vehicle safety, the proposed framework evaluates both driving safety and whether the vehicle can still successfully complete its mission before a command is executed. The framework extends highway-env with mission-level fault scenarios such as skipping required checkpoints, entering restricted areas, and generating future routes that can no longer complete the mission successfully. A runtime monitoring system is introduced to detect and reject unsafe or mission-infeasible commands before execution. For comparison, an adapted Simplex-Drive runtime-safety baseline with learning-based driving control, safety fallback control, and runtime controller switching is implemented using the public Simplex-Drive framework. Experimental results show that platform-level runtime safety alone cannot detect mission-level planning faults, while the proposed framework successfully rejects mission-infeasible commands and improves mission success under randomized fault conditions.

2606.07012 2026-06-08 cs.RO 新提交

Task Editing for Generalizable 3D Visuomotor Policy Learning

面向可泛化3D视觉运动策略学习的任务编辑

Jian-Jian Jiang, YiHan Yang, Lan Wei, Yuming Luo, Xiao-Ming Wu, Xuhang Chen, Bin Fan, Dandan Zhang, Wei-Shi Zheng

发表机构 * Sun Yat-sen University(中山大学) Imperial College London(帝国理工学院) Nanyang Technological University(南洋理工大学) South China University of Technology(华南理工大学)

AI总结 提出Task-Edit框架,通过将任务分解为场景、技能和对象组件并灵活重组,生成多样化轨迹,提升3D视觉运动策略在长程操作任务中的泛化能力。

Comments 8 pages, 4 figures

详情
AI中文摘要

3D视觉运动策略为复杂机器人操作提供了有前景的方向,因为深度图和点云为空间推理提供了丰富的几何信息。然而,它们的成功通常依赖于大规模的真实世界演示,这些演示的收集成本高昂且耗时。为此,现有方法通常使用演示生成策略,通过对人类收集的演示应用以对象为中心的变换(如改变对象姿态或尺度)来提高数据效率。虽然这些变换在局部变化上有效,但它们很大程度上保留了原始场景结构和技能序列,限制了合成复杂任务中多样化的场景-技能-对象组合的能力。在本文中,我们提出Task-Edit,一种新颖的演示生成框架,从任务中心编辑的角度生成多样化轨迹。Task-Edit的关键见解是将任务分解为场景、技能和对象组件,并灵活地重新组合它们。通过这种方式,Task-Edit实现了可扩展的演示生成,并显著提高了长程操作任务的泛化能力。我们通过大量真实世界实验评估了Task-Edit,并展示了三个优势:(1)有效性:Task-Edit在各种真实世界任务和机器人形态上显著提升了3D视觉运动策略。(2)泛化性:Task-Edit提高了模型在不同场景设置下的泛化能力。(3)适用性:Task-Edit使模型能够处理真实世界中难以收集的场景,包括抗干扰、避障和未见过的杂乱场景。

英文摘要

3D visuomotor policies offer a promising direction for complex robotic manipulation, as depth maps and point clouds provide rich geometric information for spatial reasoning. However, their success often depends on large-scale real-world demonstrations, which are costly and time-consuming to collect. To this end, existing methods commonly use demonstration generation strategies to improve data efficiency by applying object-centric transformations to human-collected demonstrations, such as varying object poses or scales. While effective for local variation, these transformations largely preserve the original scene structure and skill sequence, limiting their ability to synthesize diverse scene-skill-object combinations for complex tasks. In this paper, we propose Task-Edit, a novel demonstration generation framework that generates diverse trajectories from a task-centric editing perspective. The key insight of Task-Edit is to decompose a task into scene, skill and object components, and flexibly recombine them. In this way, Task-Edit enables scalable demonstration generation and significantly improves generalization for long-horizon manipulation tasks. We evaluate Task-Edit through extensive real-world experiments and demonstrate three advantages: (1) Effectiveness: Task-Edit significantly improves 3D visuomotor policies across various real-world tasks and robot embodiments. (2) Generalizability: Task-Edit improves model generalization across different scenario setups. (3) Applicability: Task-Edit enables models to handle scenarios that are difficult to collect in the real world, including disturbance resistance, obstacle avoidance and unseen cluttered scenes.

2606.07013 2026-06-08 cs.RO cs.HC 新提交

A Multi-Operator Mixed-Reality Interface for Multi-Robot Control and Coordination: Co-Located and Private Workspace Collaboration

面向多机器人控制与协调的多操作员混合现实界面:共位与私有工作空间协作

Omotoye Shamsudeen Adekoya, Antonio Sgorbissa, Carmine Tommaso Recchiuto

发表机构 * DIBRIS Department, RICE Laboratory, University of Genoa(DIBRIS部门,RICE实验室,热那亚大学)

AI总结 提出一种扩展至多操作员协作的混合现实界面,支持共位共享工作空间和私有工作空间两种模式,通过注册驱动场景构建、轻量级会话同步和单机器人控制租约防止命令冲突。实验表明两种模式任务性能相当,但共位模式显著提升协作感知和操作员偏好。

Comments Submitted to RO-MAN 2026

详情
AI中文摘要

多操作员控制机器人团队不仅需要访问相同的任务信息,还需要维护共享态势感知并防止冲突干预的机制。基于我们之前的HORUS界面(统一系统的整体操作现实),我们提出了一种混合现实界面,将单操作员多机器人监督扩展到协作式多操作员使用。该系统支持两种互补模式:共位共享工作空间,操作员在同一物理位置观察和操作同一张迷你地图;以及私有工作空间模式,操作员通过独立放置的本地工作空间执行相同任务。该架构结合了注册驱动的场景构建、轻量级共享会话同步以及每机器人控制租约,以支持协作监控、任务分配和远程操作,同时防止冲突命令。我们在一项人类受试者研究中评估了该方法,共有36名参与者(18对)在两个搜索环境中控制三台Nova Carter移动机器人。两种模式下的客观任务性能相当,表明两种模式都支持有效的任务执行。然而,共位共享工作空间显著改善了感知协作、共享理解和交接清晰度,并且是首选的协作模式。这些结果表明,即使底层机器人控制工具保持不变,物理上共置MR工作空间也能改善操作员的协调方式。

英文摘要

Multi-operator control of robot teams requires not only access to the same mission information, but also mechanisms for maintaining shared awareness and preventing conflicting interventions. Building on our previous HORUS interface (Holistic Operational Reality for Unified Systems) we present a mixed-reality interface that extends single-operator multi-robot supervision to collaborative multi-operator use. The system supports two complementary modes: a co-located shared workspace, in which operators observe and manipulate the same mini-map in the same physical location, and a private-workspace mode, in which operators work on the same mission through independently placed local workspaces. The architecture combines registration-driven scene construction, lightweight shared-session synchronization, and per-robot control leases to support collaborative monitoring, tasking, and teleoperation while preventing conflicting commands. We evaluated the approach in a human-subject study with 36 participants (18 pairs) controlling three Nova Carter mobile robots in two search environments. The performance of the objective task was comparable across the two modes, indicating that both modes supported effective mission execution. However, the co-located shared workspace significantly improved perceived collaboration, shared understanding, and handoff clarity, and was the preferred collaborative mode. These results indicate that physically co-locating the MR workspace improves how operators coordinate even when the underlying robot-control tools remain unchanged.

2606.07067 2026-06-08 cs.RO 新提交

Extending Responsibility-Sensitive Safety for the Assessment of Offloaded Autonomous Driving Services

扩展责任敏感安全以评估卸载的自动驾驶服务

Robin Dehler, Aryan Thakur, Michael Buchholz

AI总结 针对自动驾驶功能卸载中V2X通信导致响应时间变化的安全挑战,扩展责任敏感安全定义,提出基于安全约束的卸载决策与回退机制,并引入热备阶段提升回退安全性。

Comments 8 pages; accepted for 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC), Naples, Italy, September 15-18, 2026 - DOI will be added after publication

详情
AI中文摘要

安全是自动驾驶系统开发的基本要求。虽然功能卸载在计算效率和能耗方面显示出显著优势,但其在安全关键的AD功能中的应用带来了新的挑战。特别是,由于无线车联网通信,卸载的服务组合会导致响应时间增加且可变,这直接影响车辆的反应时间,从而影响其安全保证。在本文中,我们通过扩展责任敏感安全(RSS)的定义,明确考虑本地和卸载的AD服务组合的不同响应时间,来应对这一挑战。基于这一扩展,我们提出将其集成到功能卸载中,使用RSS安全约束进行卸载决策和回退机制。仅当当前交通状况在相应的端到端响应时间下保持安全时,才允许卸载的服务组合。如果违反此条件,系统将执行受控回退到本地执行。此外,我们引入了一种增强的回退策略,其中包括卸载服务的热备阶段,从而实现从卸载服务到本地服务的更快、更安全的过渡。所提出的方法已集成到我们的AD堆栈中,并在仿真和真实世界中进行了评估。实验结果表明,与最先进的功能卸载和安全框架相比,所提出的方法提高了安全性,同时在安全条件允许时保留了分布式计算的优势。

英文摘要

Safety is a fundamental requirement in the development of autonomous driving (AD) systems. While function offloading has demonstrated significant benefits in terms of computational efficiency and energy consumption, its application to safety-critical AD functionality introduces new challenges. In particular, offloaded service compositions incur increased and variable response times due to wireless vehicle-to-everything (V2X) communication, which directly affects the vehicle's reaction time and thus its safety guarantees. In this paper, we address this challenge by extending the definitions of Responsibility-Sensitive Safety (RSS) to explicitly account for different response times of local and offloaded AD service compositions. Based on this extension, we propose an integration into function offloading, using the RSS safety constraints for offloading decision-making and fallback mechanisms. Offloaded service compositions are only permitted if the current traffic situation remains safe under the corresponding end-to-end response time. If this condition is violated, the system performs a controlled fallback to local execution. Furthermore, we introduce an enhanced fallback strategy that includes a warm-standby phase for offloaded services, enabling faster and safer transitions from offloaded to local services. The proposed approach is integrated into our AD stack and evaluated in both simulation and the real world. Experimental results demonstrate that the proposed method improves safety compared to state-of-the-art function offloading and safety frameworks, while preserving the benefits of distributed computation when safety conditions allow.

2606.07083 2026-06-08 cs.RO 新提交

Predictive Style Matching: Natural and Robust Humanoid Locomotion

预测性风格匹配:自然且鲁棒的类人机器人行走

Simeon Nedelchev, Ekaterina Chaikovskaia, Egor Davydenko, Eduard Zaliaev, Roman Gorbachev

发表机构 * Moscow Institute of Physics and Technology (MIPT)(莫斯科物理技术学院) Innopolis University(因诺波利斯大学) Sber Robotics Center(Sber机器人中心)

AI总结 提出预测性风格匹配(PSM)方法,通过离线预测器将机器人下半身状态映射到上半身关节和步态目标,在保持任务奖励鲁棒性的同时显著降低风格误差。

详情
AI中文摘要

强化学习已成为类人机器人行走控制的主流方法:策略能够可靠地从仿真迁移到硬件,并从干扰中优雅恢复。然而,运动质量仍然落后:仅任务奖励往往收敛到僵硬、不对称的步态,而运动模仿方法改善了外观,但由于参考信号可能对抗恢复平衡所需的瞬态姿态,因此对外部干扰更加敏感。我们提出预测性风格匹配,其中离线预测器将机器人下半身状态历史和速度命令映射到可解释的上半身关节和步态目标,以在训练期间塑造奖励。由于目标是状态条件而非时间索引,且预测器仅在训练时使用,部署的控制器继承了仅任务奖励强化学习基线(RL baseline)的本体感觉接口和推理成本。在Unitree G1上,无论是在仿真还是硬件中,PSM将上半身风格误差比仅任务RL降低大约一个数量级,同时保持其跌倒恢复率,而运动模仿基线实现了最低的风格误差,但无法从干扰中恢复的频率大约高出五倍。

英文摘要

Reinforcement learning has become the prevailing approach to humanoid locomotion control: policies transfer reliably from simulation to hardware and recover gracefully from disturbances. Motion quality, however, still lags behind: task-only rewards often converge to stiff, asymmetric gaits, while motion imitation methods improve appearance but become more sensitive to external disturbances because reference signals can oppose the transient poses needed to regain balance. We propose Predictive Style Matching, in which an offline predictor maps the robot's lower-body state history and velocity commands to interpretable upper-body joint and gait targets that shape the rewards during training. Because the targets are state-conditioned rather than time-indexed and the predictor is used only at training time, the deployed controller inherits the proprioceptive interface and inference cost of a task-only RL baseline. On the Unitree G1, in both simulation and hardware, PSM reduces upper-body style error by roughly an order of magnitude over task-only RL while preserving its fall-recovery rate, whereas the motion-imitation baseline attains the lowest style error but fails to recover from disturbances about five times as often.

2606.07089 2026-06-08 cs.RO 新提交

Dreaming when Necessary: Advancing World Action Models with Adaptive Multi-Modal Reasoning

必要时做梦:通过自适应多模态推理推进世界行动模型

Yinzhou Tang, Jingbo Xu, Yu Shang, Zihao Song, Chen Gao, Wei Wu, Yong Li

发表机构 * Tsinghua University(清华大学) Manifold AI

AI总结 提出AdaWAM,通过轻量动态路由器自适应触发文本或视觉推理,提升长时复杂任务中的推理效率和性能。

详情
AI中文摘要

世界行动模型(WAMs)为具身智能提供了一种有前景的方法,但现有方法严重依赖视频预测作为行动先验,缺乏自适应多模态推理,限制了其在长时、复杂任务中的有效性。我们观察到,WAMs在不同执行上下文中需要不同的多模态推理模式:在任务转换期间,文本推理对于指导高层行动预测至关重要,而在细粒度操作期间,视觉推理对于精确控制至关重要。基于这一观察,我们提出了\textbf{AdaWAM},一种具有自适应多模态推理能力的世界行动模型。AdaWAM集成了一个轻量动态路由器,可在任务执行过程中根据需要自主触发文本或视觉推理。在模拟和真实世界具身任务上的实验表明,AdaWAM在显著提升推理效率的同时,超越了最先进的具身策略。代码和演示可在以下网址获取:this https URL。

英文摘要

World Action Models (WAMs) offer a promising approach to embodied intelligence, yet existing methods rely heavily on video prediction as action priors and lack adaptive multimodal reasoning, limiting their effectiveness on long-horizon, complex tasks. We observe that WAMs require different multimodal reasoning modes under different execution contexts: textual reasoning is essential during task transitions to guide high-level action prediction, while visual reasoning is critical during fine-grained manipulation for precise control. Motivated by this observation, we propose \textbf{AdaWAM}, a world action model with adaptive multimodal reasoning abilities. AdaWAM integrates a lightweight dynamic router that autonomously triggers textual or visual reasoning as needed during task execution. Experiments on both simulated and real-world embodied tasks show that AdaWAM substantially improves inference efficiency while outperforming state-of-the-art embodied policies. Codes and demos are available at: https://adawam.github.io/.

2606.07107 2026-06-08 cs.RO 新提交

Coarse-to-Control: Action-Token Planning for Vision-Language-Action Models

粗到细控制:面向视觉-语言-动作模型的行动令牌规划

Jinhao Wu, Shiduo Zhang, Yicheng Liu, Xiaopeng Yu, Sixian Li, Siyin Wang, Hang Zhao, Jing Huo, Yang Gao, Jingjing Gong, Xipeng Qiu, Yu-Gang Jiang

发表机构 * Nanjing University(南京大学) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学) Tsinghua University(清华大学)

AI总结 提出Coarse-to-Control框架,在动作令牌空间中引入原生规划,通过先预测粗粒度动作令牌序列再生成可执行动作,提升长程任务性能。

详情
AI中文摘要

大多数视觉-语言-动作(VLA)模型直接将观测映射到动作,缺乏显式的中间规划,这限制了在早期错误累积的长程任务上的性能。我们提出Coarse-to-Control,一种规划-执行VLA模型,在动作令牌空间中原生引入规划。关键思想是让策略首先预测一个紧凑的粗粒度动作令牌序列,该序列总结了预期的未来轨迹,然后基于此规划生成可执行的动作令牌。由于规划和执行共享统一的离散动作词汇,规划保持接近控制流形,并提供直接可操作的指导,而不是必须被转换回运动命令的抽象提示。在LIBERO、SimplerEnv-WidowX和真实世界操作任务上的实验表明,动作令牌规划一致地优于直接动作生成,在长程多阶段任务上提升最大。

英文摘要

Most vision-language-action (VLA) models map observations directly to actions without explicit intermediate planning, which limits performance on long-horizon tasks where early mistakes compound. We propose Coarse-to-Control, a plan-execute VLA that introduces planning natively in the action-token space. The key idea is to let the policy first predict a compact sequence of coarse action tokens that summarize the intended future trajectory, and then generate executable action tokens conditioned on this plan. Because both planning and execution share a unified discrete action vocabulary, the plan stays close to the control manifold and provides directly actionable guidance rather than an abstract hint that must be translated back to motor commands. Experiments on LIBERO, SimplerEnv-WidowX, and real-world manipulation tasks show that action-token planning consistently improves over direct action generation, with the largest gains on long-horizon multi-stage tasks.

2606.07170 2026-06-08 cs.RO 新提交

Test-Time Trajectory Optimization for Autonomous Driving

自动驾驶的测试时轨迹优化

Yihong Xu, Eloi Zablocki, Yuan Yin, Elias Ramzi, Ellington Kirby, Alexandre Boulch, Matthieu Cord

发表机构 * valeo.ai Sorbonne Université(索邦大学) CNRS(国家科学研究中心) ISIR(信息科学研究所)

AI总结 提出TOAD方法,在测试时使用交叉熵方法优化轨迹,无需重新训练即可提升多种规划器的性能。

详情
AI中文摘要

端到端的自动驾驶规划器通常生成一组候选轨迹,对每个轨迹评分,并返回得分最高的候选轨迹。然而,评分器仅在生成候选轨迹后应用,无法影响轨迹集合:无论评分器质量如何,候选轨迹集较弱会限制规划性能。我们转而将评分器视为学习到的轨迹级奖励函数,并搜索最大化该奖励的轨迹。我们的方法TOAD在测试时运行交叉熵方法,从规划器的候选轨迹进行热启动。它无需重新训练,可即插即用于现有规划器。在六个基础规划器上,TOAD在NAVSIM-v1(94.7 PDMS)、NAVSIM-v2(56.3 EPDMS)和闭环HUGSIM基准测试中提升了结果。代码将通过项目页面公开:this https URL。

英文摘要

End-to-end planners for autonomous driving typically generate a set of candidate trajectories, score each one, and return the highest-scoring candidate. However, the scorer is applied only after the proposals are generated and cannot influence the set of trajectories: a weak set of candidates limits planning performance regardless of the scorer's quality. We instead treat the scorer as a learned trajectory-level reward function and search for trajectories that maximize it. Our method, TOAD, runs the Cross-Entropy Method at test time, warm-started from the planner's proposals. It requires no retraining and is plug-and-play for existing planners. Across six base planners, TOAD improves results on NAVSIM-v1 (94.7 PDMS), NAVSIM-v2 (56.3 EPDMS), and the closed-loop HUGSIM benchmark. The code will be made publicly available via the project page: https://valeoai.github.io/TOAD/.

2606.07186 2026-06-08 cs.RO cs.SE 新提交

A Causal Probabilistic Framework for Perception-Informed Closed-Loop Simulation of Autonomous Driving

面向感知信息闭环仿真的自动驾驶因果概率框架

Zhennan Fei, Rickard Johansson, Mikael Andersson, Matthias Eng, Mattias Eriksson, Kaveh Kianfar, Sadegh Rahrovani, Chris van der Ploeg, Michael Borth, Maren Buermann, Michiel Braat, Henk Goossens, Zijian Han, Majid Khorsand Vakilzadeh, Gabriel Rodrigues de Campos

发表机构 * ETH Zürich(苏黎世联邦理工学院) KTH Royal Institute of Technology(皇家理工学院) University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种因果概率模型框架,将感知误差注入标准仿真环境,揭示理想SIL无法捕获的潜在风险,为SOTIF验证提供可扩展路径。

详情
AI中文摘要

软件在环(SIL)仿真是现代汽车安全功能验证的基石。然而,许多当前框架采用理想感知,绕过了感知算法的功能不足,导致过于乐观的安全评估。本文提出一种感知信息SIL测试方法,弥合了地面实况仿真与真实世界感知行为之间的差距。我们提出了一个将因果概率模型纳入标准化、基于场景的仿真工具链的框架,适用于高级驾驶辅助系统(ADAS)和自动驾驶系统(ADS)。我们的方法能够系统性地注入由物理触发条件(如雾、雨和物体合并场景)导出的真实感知误差,例如检测丢失、尺寸不准确和定位偏移。通过在标准化仿真环境中评估这些“故障”,我们证明了感知信息测试揭示了理想SIL环境无法捕获的潜在操作风险,为SOTIF(ISO 21448)验证提供了可扩展的途径。

英文摘要

Software-in-the-loop (SIL) simulation is a cornerstone for the validation of modern automotive safety functions. However, many current frameworks utilize ideal sensing, which bypasses the functional insufficiencies of perception algorithms, leading to over-optimistic safety assessments. This paper proposes a perception-informed SIL testing methodology that bridges the gap between ground-truth simulation and real-world perception behavior. We present a framework for incorporating causal probabilistic models into standardized, scenario-based simulation toolchains, applicable to both Advanced Driver Assistance Systems (ADAS) and Autonomous Driving Systems (ADS). Our approach enables the systematic injection of realistic perception errors, such as loss of detection, sizing inaccuracies, and positioning offsets, derived from physical triggering conditions like fog, rain, and object-merging scenarios. By evaluating these ``faults'' within a standardized simulation environment, we demonstrate that perception-informed testing reveals latent operational risks that ideal SIL environments fail to capture, providing a scalable pathway for SOTIF (ISO 21448) validation.

2606.07193 2026-06-08 cs.RO 新提交

Shield-Loco: Shielding Locomotion Policies with Predictive Safety Filtering

Shield-Loco:基于预测性安全过滤的防护运动策略

Aditya Shirwatkar, Sebastian Sanokowski, Shishir Kolathaya, Aaron Johnson, Majid Khadiv

发表机构 * Robert Bosch Center for Cyber Physical Systems(罗伯特·博世网络物理系统中心) Indian Institute of Science(印度科学研究院) Munich Institute of Robotics and Machine Intelligence (MIRMI)(慕尼黑机器人与机器智能研究所(MIRMI)) Technical University of Munich(慕尼黑技术大学) Department of Computer Science & Automation(计算机科学与自动化部门) Department of Mechanical Engineering(机械工程系) Carnegie Mellon University(卡内基梅隆大学) Institute for Advanced Study(高级研究 institute)

AI总结 提出一种预测性安全过滤器,通过全物理模型优化接触序列,减少四足机器人在密集杂乱环境中的安全违规,同时保持任务性能。

详情
AI中文摘要

强化学习(RL)策略能够实现动态腿部运动,但缺乏避免训练中未出现的约束违反的机制。大规模离线安全学习对于覆盖所有边缘情况是不切实际的。现有的安全框架要么依赖无法推理全身行为的降阶模型,要么需要保守的恢复控制器,这会降低任务性能。我们提出一种预测性安全过滤器,它对输入到RL策略的名义接触位置进行事后过滤。当预测到碰撞时,基于采样的优化器使用全物理模型异步搜索更安全的接触序列,而学习的价值函数则引导长期回报。我们的三个算法组件(采样接触的几何投影、动量增强更新和副本交换)使得在不连续的接触景观中优化变得可行。我们在密集杂乱环境中的四足机器人上验证了该过滤器,无论是在仿真还是真实世界中,都显示出在最小偏离名义输入的情况下大幅减少安全违规。

英文摘要

Reinforcement learning (RL) policies enable dynamic legged locomotion but lack mechanisms to avoid violations of safety constraints that are absent during training. Large-scale offline safe learning is impractical for covering all edge cases. Existing safety frameworks either rely on reduced-order models that cannot reason about whole-body behaviors or require conservative recovery controllers that degrade task performance. We propose a predictive safety filter that post-hoc filters the nominal contact locations fed to the RL policy. When a collision is predicted, a sampling-based optimizer asynchronously searches for safer contact sequences using a full-physics model, while a learned value function bootstraps long-horizon returns. Our three algorithmic components (geometric projection of sampled contacts, momentum-augmented updates, and replica-exchange) make the optimization tractable in a discontinuous contact landscape. We validate the filter on a quadruped robot in dense, cluttered environments, both in simulation and in the real world, showing substantial reductions in safety violations with minimal deviation from the nominal input.

2606.07211 2026-06-08 cs.RO cs.AI 新提交

An Abstract Architecture for Explainable Autonomy in Hazardous Environments

危险环境中可解释自主性的抽象架构

Matt Luckcuck, Hazel M Taylor, Marie Farrell

发表机构 * Maynooth University(梅诺斯大学) University of Manchester(曼彻斯特大学)

AI总结 提出一种支持自主系统解释其行为的抽象架构,旨在通过设计可解释性增强用户信任,并以民用核工业为例展示应用。

Comments Originally published 20th of October 2022 at the Second International Workshop on Requirements Engineering for Explainable Systems (RE4ES), which was hosted by the International Requirements Engineering Conference 2022

详情
AI中文摘要

自主机器人系统被提议用于危险环境,通常是为了减少人类工人的风险。在不久的将来,人类工人可能会继续使用和指挥这些自主机器人,就像其他计算机化工具一样,但具有更复杂的决策能力。因此,工程努力的一个重要方向是确保这些用户信任系统。最近的文献表明,可解释性与系统的可信度密切相关。与安全性和保密性属性一样,可解释性应该被设计到系统中,而不是事后添加。本文提出了一种抽象架构,支持自主系统解释其行为(可解释自主性),为实施可解释自主系统提供了设计模板。我们给出了一个工作示例,说明我们的架构如何应用于民用核工业,其中工人和监管机构都需要信任系统的决策能力。

英文摘要

Autonomous robotic systems are being proposed for use in hazardous environments, often to reduce the risks to human workers. In the immediate future, it is likely that human workers will continue to use and direct these autonomous robots, much like other computerised tools but with more sophisticated decision-making. Therefore, one important area on which to focus engineering effort is ensuring that these users trust the system. Recent literature suggests that explainability is closely related to how trustworthy a system is. Like safety and security properties, explainability should be designed into a system, instead of being added afterwards. This paper presents an abstract architecture that supports an autonomous system explaining its behaviour (explainable autonomy), providing a design template for implementing explainable autonomous systems. We present a worked example of how our architecture could be applied in the civil nuclear industry, where both workers and regulators need to trust the system's decision-making capabilities.

2606.07217 2026-06-08 cs.RO cs.CV cs.LG 新提交

Robotic Policy Adaptation via Weight-Space Meta-Learning

通过权重空间元学习实现机器人策略自适应

Christian Bianchi, Siamak Yousefi, Alessio Sampieri, Andrea Roberti, Luca Rigazio, Fabio Galasso, Luca Franco

发表机构 * ItalAI University of Verona(威尼斯大学) Sapeinza University of Rome(罗马萨佩因扎大学)

AI总结 提出WIZARD框架,通过权重空间元学习从语言指令和演示视频生成任务特定LoRA参数,无需微调即可适应新任务,在LIBERO上性能提升高达14倍。

详情
AI中文摘要

视觉-语言-动作(VLA)模型正成为机器人操作的一种有前景的范式,能够从大规模演示和动作标签语料库中训练通用策略。然而,将这些模型适应新任务通常仍需要任务特定的演示、动作注释和额外的微调,使得部署成本高昂且难以扩展。我们提出WIZARD,一种权重空间元学习框架,通过为冻结的VLA策略生成任务特定的LoRA参数来避免任务特定的微调。仅凭语言指令和简短的演示视频,WIZARD即可在单次前向传播中预测相应的自适应权重,无需目标任务动作标签或测试时优化。在元训练期间,WIZARD学习将任务证据直接映射到专家LoRA更新,在权重空间中捕获任务之间的关系。在LIBERO上的实验表明,WIZARD在未见过的数据集集合上性能提升高达约2倍,在未见过的任务上提升高达约14倍。在Franka Emika Panda机器人上,WIZARD持续优于真实域自适应基线,表明生成的适配器提供了超越仿真的任务级特化。

英文摘要

Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to new tasks still typically requires task-specific demonstrations, action annotations, and additional fine-tuning, making deployment costly and difficult to scale. We propose WIZARD, a weight-space meta-learning framework that sidesteps task-specific fine-tuning by generating task-specific LoRA parameters for a frozen VLA policy. Given only a language instruction and a short demonstration video, WIZARD predicts the corresponding adaptation weights in a single forward pass, without target-task action labels or test-time optimization. During meta-training, WIZARD learns to map task evidence directly to expert LoRA updates, capturing relationships between tasks in weight space. Experiments on LIBERO show that WIZARD improves performance by up to ~2x on unseen dataset collections and up to ~14x on unseen tasks. On a Franka Emika Panda, WIZARD consistently improves over a real-domain adapted baseline, showing that generated adapters provide task-level specialization beyond simulation.

2606.07244 2026-06-08 cs.RO cs.AI cs.CV 新提交

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

超越航点:面向视觉语言导航的轨迹中心航点范式

Haoxiang Shi, Xiang Deng, Haoyu Zhang, Qiaohui Chu, Yaowei Wang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室)

AI总结 提出轨迹航点范式,通过TSDF引导的扩散策略预测可执行轨迹,解决VLN-CE中航点不可达与规划控制不一致问题,在基准上取得最优性能。

详情
AI中文摘要

连续环境中的视觉语言导航(VLN-CE)要求智能体在类似真实世界的环境中遵循自然语言指令进行导航。大多数VLN-CE方法采用三阶段框架:航点预测器提出可导航航点,导航器选择最佳航点,低层控制器执行移动。然而,这种解耦范式常导致航点不可达或规划与控制不一致。本文提出一种称为轨迹航点的新范式,将每个候选航点锚定到可执行轨迹上。为此,我们设计了TSDF引导的扩散策略作为轨迹航点预测器,引导轨迹生成避开障碍物,从本质上保证预测航点的可达性。进一步提出轨迹增强导航器,将关联轨迹作为额外信息注入规划,实现高层语义决策与低层执行的严格一致性。在VLN-CE基准上的大量实验表明,我们的轨迹航点范式优于基线方法。

英文摘要

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.

2606.07304 2026-06-08 cs.RO 新提交

CAPE: Contrastive Action-conditioned Parallel Encoding for Embodied Planning

CAPE: 用于具身规划的条件对比动作并行编码

Cong Chen, Haowen Wang, Zhixiang Zhang, Pei Ren, Zhengping Che

AI总结 提出CAPE框架,通过对比学习区分不同动作序列的未来结果,实现高效视觉动力学建模,在真实世界和零样本迁移任务中显著提升规划性能并降低推理成本。

Comments 19 pages, 7 figures

详情
AI中文摘要

具身智能体需要在执行前预测候选动作的未来后果,以便有效规划。现有的视觉动力学模型通过重建未来视觉状态或展开密集潜在表示来学习,这会将学习能力分散到视觉显著但与规划无关的内容上,而不是驱动操作结果的动作条件变化。我们提出CAPE,一种对比动作条件并行编码框架,通过区分不同动作序列诱导的未来结果来学习视觉动力学。给定初始观察和候选动作序列,CAPE在单次前向传播中解码完整的未来潜在轨迹,并使用目标收敛对比目标进行训练,该目标对齐对应相同未来结果的预测,同时分离对应不同结果的预测。在真实世界DROID和零样本迁移到RoboCasa上,CAPE在状态检索、离线动作匹配和闭环规划方面显著优于先前基线,同时在长预测范围内显著降低了规划时的推理成本。

英文摘要

Embodied agents need to predict the future consequences of candidate actions in order to plan effectively before execution. Existing visual dynamics models learn by reconstructing future visual states or rolling out dense latent representations, which spreads learning capacity across visually salient but planning-irrelevant content rather than the action-conditioned changes that drive manipulation outcomes. We propose CAPE, a Contrastive Action-conditioned Parallel Encoding framework that learns visual dynamics by distinguishing the future outcomes induced by different action sequences. Given an initial observation and a candidate action sequence, CAPE decodes the full future latent trajectory in a single forward pass and is trained with a Goal-Convergent Contrastive Objective that aligns predictions corresponding to the same future outcome while separating those corresponding to different outcomes. On real-world DROID and zero-shot transfer to RoboCasa, CAPE substantially outperforms prior baselines on future-state retrieval, offline action matching, and closed-loop planning, while notably reducing planning-time inference cost at long prediction horizons.

2606.07383 2026-06-08 cs.RO cs.LG 新提交

RhinoVLA Technical Report

RhinoVLA 技术报告

Huixi Intelligence, :, Chen Zhang, Chenyang Zhou, Guanglei Ding, Guanghui He, Haibin Gao, Jiajia Chen, Jianyong Zhang, Lianyi Yu, Ningyi Xu, Ping Xu, Qingchen Li, Yingjun Hu, Yijia Zhang, Yuxi Liu

发表机构 * Huixi Intelligence(慧溪智能)

AI总结 针对边缘硬件上VLA模型部署延迟问题,提出RhinoVLA,通过令牌高效骨干、连续动作专家和统一接口实现实时闭环控制,在Huixi R1上达到11.69 Hz推理速度。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出强大潜力,但在边缘硬件上的实时部署仍具挑战。本文中,我们识别出VLM视觉和上下文令牌是部署延迟的主要来源:对于以GEMM为主的投影算子,当模型维度固定时,计算量随输入令牌数量线性增长。基于此观察,我们提出RhinoVLA,一种与Huixi R1边缘SoC协同设计的面向部署的VLA模型。RhinoVLA采用令牌高效的Qwen3-VL骨干和连续动作专家,在保留预训练多模态能力的同时减少VLM侧的令牌和计算负担。为支持跨机器人学习,RhinoVLA进一步引入统一接口,结合视图注册表、72维物理状态-动作槽空间和机器人实例LoRA,使异构机器人观测和动作模式能在共享策略下对齐。在部署方面,RhinoVLA通过硬件感知编译、混合精度执行和并行视觉编码进行优化。实验表明,RhinoVLA在相似参数量下实现了与π0.5相当的下游性能,同时在Huixi R1上达到11.69 Hz的端到端推理,满足10 Hz实时闭环控制目标。该项目将在以下网址开源:此 https URL。

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for robotic manipulation, but real-time deployment on edge hardware remains challenging. In this work, we identify VLM visual and context tokens as a major source of deployment latency: for GEMM-dominated projection operators, computation grows linearly with the number of input tokens when model dimensions are fixed. Motivated by this observation, we propose RhinoVLA, a deployment-oriented VLA model co-designed with the Huixi R1 edge SoC. RhinoVLA adopts a token-efficient Qwen3-VL backbone and a continuous Action Expert, reducing the VLM-side token and computation burden while preserving pretrained multimodal capability. To support cross-robot learning, RhinoVLA further introduces a unified interface that combines View Registry, 72D physical state-action slot space, and robotinstance LoRA, allowing heterogeneous robot observations and action schemas to be aligned under a shared policy. On the deployment side, RhinoVLA is optimized through hardware-aware compilation, mixed-precision execution, and parallel visual encoding. Experiments show that RhinoVLA achieves downstream performance comparable to π0.5 at a similar parameter scale, while reaching 11.69 Hz end-to-end inference on Huixi R1, meeting the 10 Hz real-time closedloop control target. The project will be open-sourced at https://github.com/HuixiAI/RhinoVLA.

2606.07386 2026-06-08 cs.RO 新提交

Spline Policy: A Structured Representation for Robot Policies

样条策略:机器人策略的结构化表示

Mengze Tian, Yiming Li, Sichao Liu, Auke Ijspeert, Sylvain Calinon

发表机构 * École Polytechnique Fédérale de Lausanne (EPFL)(瑞士联邦理工学院(EPFL)) Idiap Research Institute(Idiap研究 institute)

AI总结 提出样条策略(SP),用样条参数替代动作块,保留策略主干,支持连续轨迹解码、时域重采样、参数空间编辑及下游控制,并具有局部修正机制。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

现代机器人操作的模仿学习策略通常将动作表示为固定分辨率的动作块,这种方法简单有效,但在执行前暴露的几何和时间结构有限。本文研究了样条策略(SP),一种结构化表示,它用样条参数替换动作块,同时保持策略主干不变。预测的样条可以解码为紧凑的连续轨迹,在不同时间分辨率下查询,在参数空间中进行约束或编辑,并传递给下游控制器。对于二次样条输出,相同的表示还可以通过解析距离场构造转换为状态依赖的向量场。在该构造的正则性和投影假设下,诱导的动力学不会增加与生成样条的距离,从而在预测运动周围产生有原则的局部修正机制。样条输出进一步支持从观测到样条参数、轨迹和流场的不确定性传播,并且可以与经典控制机制(如零空间碰撞避免)结合,而无需重新训练策略主干。我们使用扩散、流匹配、基于Transformer和视觉-语言-动作主干实例化了SP。在低维运动学习、匹配主干下的模拟操作、灵巧操作以及真实机器人案例研究中的实验表明,SP与现代策略学习器兼容,同时暴露了有用的运动结构特性,包括紧凑解码、时间重采样、预测运动周围的局部修正、不确定性评估和控制器兼容性。

英文摘要

Modern imitation-learning policies for robot manipulation often represent actions as fixed-resolution action chunks, which are simple and effective but expose limited geometric and temporal structure before execution. This paper studies Spline Policy (SP), a structured representation that replaces action chunks with spline parameters while keeping the policy backbone unchanged. The predicted spline can be decoded as a compact continuous trajectory, queried at different temporal resolutions, constrained or edited in parameter space, and passed to downstream controllers. For quadratic spline outputs, the same representation can also be converted into a state-dependent vector field through an analytical distance-field construction. Under the regularity and projection assumptions of this construction, the induced dynamics do not increase the distance to the generated spline, yielding a principled local corrective mechanism around the predicted motion. The spline output further supports uncertainty propagation from observations to spline parameters, trajectories, and flow fields, and can be combined with classical control mechanisms such as null-space collision avoidance without retraining the policy backbone. We instantiate SP with diffusion, flow-matching, transformer-based, and vision-language-action backbones. Experiments in low-dimensional motion learning, simulated manipulation under matched backbones, dexterous manipulation, and real-robot case studies show that SP remains compatible with modern policy learners while exposing useful motion-structure properties, including compact decoding, temporal resampling, local correction around predicted motions, uncertainty evaluation, and controller compatibility.

2606.07389 2026-06-08 cs.RO 新提交

Simulation-Driven Imitation Learning for Biosignals-Free Shared-Autonomy Prosthetic Grasping

模拟驱动的无生物信号共享自主假肢抓取模仿学习

Kaijie Shi, Wanglong Lu, Huiling Chen, Vinicius Prado da Fonseca, Ting Zou, Hanli Zhao, Xianta Jiang

发表机构 * Memorial University of Newfoundland(缅因大学) Wenzhou University(温州大学)

AI总结 提出一个自动生成多样化抓取演示的模拟框架,结合物理可行抓取合成、自然到达轨迹重定向和程序化环境执行,通过模仿学习实现高成功率和强泛化能力的假肢控制。

详情
AI中文摘要

无生物信号的上肢假肢共享自主控制旨在不依赖EMG或其他生理信号的情况下实现自然且低努力的操作。最近的基于模仿学习的方法显示出有希望的结果,但其可扩展性受到收集大量真实世界人类演示数据的成本和变异性的限制。在这项工作中,我们提出了一个可扩展的模拟框架,该框架从腕部安装的虚拟摄像头自动生成多样化的到达-抓取演示。该框架结合了物理可行的抓取合成、自然到达轨迹重定向以及在程序化生成的室内环境中的到达-抓取-提升执行。它记录腕部视角观察、本体感觉和动作,以构建用于模仿学习的大规模演示数据集。通过广泛的模拟基准测试,我们评估了物体和场景的泛化能力,并比较了几种代表性的最先进模仿学习方法。结果表明,模拟演示足够丰富和一致,可用于有效的策略学习。在三个现实场景中,学习到的模拟到现实策略实现了超过90%的抓取成功率,超越了基线方法,并表现出更强的泛化能力,突显了模拟驱动训练在无生物信号共享自主假肢抓取中的前景。演示可在\href{此URL}{此URL}获取。

英文摘要

Biosignals-free shared-autonomy control of upper-limb prosthetic hands aims to enable natural and low-effort manipulation without relying on EMG or other physiological signals. Recent imitation-learning-based approaches have shown promising results, but their scalability is limited by the cost and variability of collecting large amounts of real-world human demonstration data. In this work, we present a scalable simulation framework that automatically generates diverse reach-to-grasp demonstrations from a wrist-mounted virtual camera. The framework combines physically feasible grasp synthesis, natural reaching trajectories retargeting, and reach--grasp--lift execution in procedurally generated indoor environments. It records wrist-view observations, proprioception, and actions to build a large-scale demonstration dataset for imitation learning. Through extensive simulation benchmarks, we evaluate object and scene generalization and compare several representative state-of-the-art imitation learning methods. Results show that the simulated demonstrations are sufficiently rich and consistent for effective policy learning. In three realistic settings, the learned sim-to-real policy achieves over 90\% grasp success, surpasses baseline methods, and exhibits stronger generalization, highlighting the promise of simulation-driven training for biosignals-free shared-autonomy prosthetic grasping. The demonstrations are available at \href{https://sites.google.com/view/sim-prosthetic-grasp/home}{https://sites.google.com/view/sim-prosthetic-grasp/home}.

2606.07424 2026-06-08 cs.RO 新提交

Rapid co-design of Buoyancy-assisted robots for Challenging Locomotion using Gaussian Evolutionary Specialists

基于高斯进化专家的浮力辅助机器人快速协同设计以应对挑战性运动

Ankit Sinha, Nitish Sontakke, Dennis Hong, Yusuke Tanaka, Sehoon Ha

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of California, Los Angeles(加州大学洛杉矶分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出高斯进化专家(GES)框架,通过解耦设计空间划分与策略学习,在浮力辅助轻量腿单元(BALLU)上实现5-25%性能提升,并缩短37%设计优化时间。

Comments Submitted to RA-L

详情
AI中文摘要

设计高性能腿式机器人需要联合优化形态和控制。无模型强化学习(RL)为开发鲁棒控制器提供了模型预测控制的替代方案,无需明确指定机器人动力学。因此,我们看到了使用RL训练控制器和评估设计以优化机器人形态。虽然RL在运动方面取得了成功,但由于重复的策略训练,将其用于协同设计内循环成本高昂。基于形态的条件通用策略提供了一种有前景的替代方案,但遭受行为多样性崩溃,收敛到单一策略,在不同设计上表现次优。另一方面,端到端混合专家(MoE)架构因其表示崩溃而失败。我们提出高斯进化专家(GES),一个将设计空间划分与策略学习解耦以显式捕获多样行为的框架。GES将专家策略分配给演化的高斯区域,并通过训练、探测和领土扩展迭代优化它们。生成的专家被集成到设计采样循环中,用直接评估替代昂贵的重新训练。在浮力辅助轻量腿单元(BALLU)上测试时,GES发现的设计比朴素通用策略性能高5-25%。在硬件上,GES优化设计克服了24厘米高的障碍——比基线BALLU设计提升3倍。此外,GES将设计优化时间缩短了37%。

英文摘要

Designing high-performance legged robots requires jointly optimizing morphology and control. Model-free Reinforcement Learning (RL) offers an alternative to model-predictive control for developing robust controllers without explicitly specifying robot dynamics. Thus, we have seen theuse of RL to train controllers and evaluate designs for robot morphology optimization. While RL has shown success inlocomotion, using it in the co-design inner loop is expensive due to repeated policy training. Universal policies conditioned on morphology offer a promising alternative, but suffer from behavioral diversity collapse, converging to a single strategy that performs sub-optimally across designs. On the other hand, end-to-end Mixture-of-Experts (MoE) architectures fail due to a collapse in its representation. We propose Gaussian Evolutionary Specialists (GES), a framework that decouples design-space partitioning from policy learning to capture diverse behaviors explicitly. GES assigns specialist policies to evolving Gaussian regions and iteratively refines them via training, probing, and territory expansion. The resulting specialists are integrated into a design sampling loop, replacing costly re-training with direct evaluation. When tested on the Buoyancy-Assisted Light Legged Unit (BALLU), GES discovers designs with 5 - 25% higher performance than naive universal policies. On hardware, a GES optimized design overcomes a 24 cm tall obstacle - 3x improvement over the baseline BALLU design. Moreover, GES curtails design optimization time by 37%.

2606.07437 2026-06-08 cs.RO cs.AI cs.HC cs.SE cs.SY eess.SY 新提交

Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and Predictability

重新构想自动驾驶时代的ISO 26262:通过可迁移性和可预测性增强可控性

Chaitanya Shinde, Hadi Hajieghrary, Paul Schmitt, Adam Shoemaker, Bodo Seifert, Steve Kenner

发表机构 * Torc Robotics, Inc.(Torc机器人公司) Reynolds & Moore(雷纳德与摩尔公司) Critical Systems Analysis, LLC(Critical Systems Analysis LLC)

AI总结 针对自动驾驶汽车缺乏人类驾驶员的问题,将ISO 26262中的可控性分解为可迁移性和可预测性两个可审计维度,并给出量化框架,以支持SAE L4/L5系统的功能安全论证。

详情
AI中文摘要

ISO 26262标准通过基于严重性、暴露度和可控性的风险评估来定义道路车辆的功能安全,其基础是人类驾驶车辆范式。在自动驾驶汽车(AV)的背景下,缺乏人类驾驶员需要重新审视这些原则。本文将可控性占位符分解为ISO 26262的两个可审计证据维度,引入了两个可测量的子概念:可迁移性和可预测性。可迁移性扩展了可控性,以捕捉AV系统将控制权移交给专用后备安全机制的能力,而可预测性则捕捉外部主体预测AV行为的难易程度。可预测性基于人机交互启发原则进行形式化定义,并提供了量化它的数学框架。引入了设计能力与可实现能力之间的差距,以区分架构后备声明与场景条件相关的可实现后备能力。所提出的度量与ISO 26262和ISO/PAS 21448(SOTIF)保持一致,使后备和交互声明在ODD切片上可证伪和可追溯。这些维度补充而非替代现有标准,这些增强保留了ISO 26262的结构,同时将其适用性扩展到在SAE L4和L5级别运行的无驾驶员自动化系统。

英文摘要

The ISO 26262 standard defines functional safety for road vehicles through risk assessments based on Severity, Exposure, and Controllability, grounded in a human-driven vehicle paradigm. In the context of autonomous vehicles (AVs), the absence of a human driver necessitates revisiting these principles. This paper decomposes the Controllability placeholder into two auditable evidence dimensions of ISO 26262 by introducing two measurable sub-concepts: Transferability and Predictability. Transferability extends Controllability to capture AV systems' ability to hand off control to dedicated fallback safety mechanisms, while Predictability captures how easily external agents can anticipate AV behavior. Predictability is formally defined from human-robot interaction-inspired principles, and a mathematical framework is provided to quantify it. A designed-versus-achievable gap is introduced to distinguish architectural fallback claims from scene-conditioned achievable fallback capability. The proposed metrics align with ISO 26262 and ISO/PAS 21448 (SOTIF), rendering fallback and interaction claims falsifiable and traceable across ODD slices. These dimensions complement rather than replace existing standards, and the enhancements preserve the structure of ISO 26262 while extending its applicability to driverless automated systems operating at SAE Levels 4 and 5.

2606.07464 2026-06-08 cs.RO cs.AI cs.CV 新提交

Planning-aligned Token Compression for Long-Context Autonomous Driving

面向长上下文自动驾驶的规划对齐令牌压缩

Zhixuan Liang, Yuxiao Chen, Yurong You, Peter Karkus, Wenhao Ding, Boyi Li, Alexander Popov, Yan Wang, Maximilian Igl, Yiming Li, Danfei Xu, Nikolai Smolyanskiy, Boris Ivanovic, Ping Luo, Marco Pavone

发表机构 * NVIDIA Research(NVIDIA研究) School of Computing and Data Science, The University of Hong Kong(计算与数据科学学院,香港大学)

AI总结 提出COMPACT-VA框架,基于条件VQ-VAE将长上下文压缩为有界表示,通过规划对齐实现决策关键信息保留,在动态场景中成功率提升超6%,速度提升3.3倍。

Comments 9 pages

详情
AI中文摘要

整体视觉-动作模型代表了自动驾驶中的一种新兴范式。然而,这种架构在编码用于复杂交互的扩展时间上下文时,会产生迅速超过实时计算预算的令牌序列。虽然线性变换器和外部记忆等方法试图使上下文轻量化,但令牌压缩与架构最为兼容,因为它不需要修改主干网络。然而,现有的压缩采用基于规则的启发式方法(如时间衰减),与规划解耦,存在丢失决策关键信息的风险。我们提出COMPACT-VA,一种基于条件VQ-VAE的规划对齐工作记忆框架,将扩展上下文压缩为有界表示。压缩条件同时基于历史轨迹和学习的规划意图,其中后验编码器在训练期间从未来轨迹中提炼规划意图,而先验编码器学习从压缩观测中预测它。压缩记忆与预测的潜在变量拼接,输入策略进行端到端优化,从而在保留决策关键信息的情况下进行规划。我们在历史上下文对行为正确性(如停车、让行或前行)最关键的高信号动态场景中进行评估,并相应地设计了行为指标。在可比的令牌预算下,我们在成功率上实现了超过6%的提升(68.3%),且各项指标一致提升。消融实验验证了规划对齐耦合的有效性。闭环评估证实,与未压缩处理相比,COMPACT-VA在保持一般驾驶性能的同时实现了3.3倍的速度提升和2.7倍的内存减少。

英文摘要

Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve $>$6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.

2606.07506 2026-06-08 cs.RO 新提交

Affordance-Based Hierarchical Reinforcement Learning for Quadruped Pedipulation

基于可负担性的四足机器人层级强化学习操控

Tuba Girgin, Jose Castelblanco, Gabriel Rodriguez, Emre Girgin, Cagri Kilic

发表机构 * Embry-Riddle Aeronautical University(埃姆布里-瑞德航空航天大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出三级层级强化学习框架,利用姿态和交互点可负担性引导导航与操控策略,在仿真和真实环境中实现自主物体操控。

Comments This paper is submitted to Wiley Journal of Field Robotics

详情
AI中文摘要

四足机器人的物体操控能力是一个开放的研究挑战。虽然先前的研究侧重于低级策略学习,任务执行仍依赖于专家设计的高级轨迹。自主选择目标物体上的可负担交互点和可负担机器人基座姿态消除了对预设计轨迹的需求。本研究提出了一个三级层级强化学习(RL)框架,利用姿态可负担性来引导导航策略,而导航策略驱动运动策略。此外,操控策略由交互点可负担性引导,实现四足机器人的物体中心姿态对齐和有效的末端执行器操控规划。我们在IsaacSim生态系统中训练所提出的框架,并在仿真和真实环境中进行评估。我们在仿真中研究了姿态可负担性在多个场景下的有效性,同时在真实环境中验证了各种物体交互任务,形成了物体交互数据集。结果表明,所提出的框架能够基于可负担性自主识别候选姿态,并在无需人类引导的情况下成功执行真实世界中的物体操控任务。

英文摘要

The object manipulation capabilities of quadruped robots is an open research challenge. While previous studies have focused on low-level policy learning, task execution still relies on expert-designed high-level trajectories. Autonomous selection of both an affordable interaction point on the target object and an affordable robot base pose removes the need for pre-designed trajectories. This study proposes a three-level hierarchical reinforcement learning (RL) framework that utilizes pose affordances to guide the navigation policy, while the navigation policy drives the locomotion policy. In addition, the pedipulation policy is guided by interaction-point affordances, enabling object-centric pose alignment of the quadruped robot and effective end-effector manipulation planning. We train the proposed framework in the IsaacSim ecosystem and evaluate it in both simulation and real-world settings. We investigate the effectiveness of pose affordance across multiple scenarios in simulation while various object interaction tasks are validated on real-world setting forming an object-interaction dataset. The results show that the proposed framework can autonomously identify candidate poses based on their affordance and successfully execute object manipulation tasks in the real world without human guidance.

2606.06660 2026-06-08 cs.AI cs.PF cs.RO 交叉投稿

AEGIS: A Backup Reflex for Physical AI

AEGIS:物理AI的备份反射

Josef Chen

发表机构 * KAIKAKU

AI总结 提出AEGIS方法,通过在弱策略的冻结激活上使用轻量级探针检测高风险步骤,仅在必要时切换到强策略,在LIBERO-Spatial上恢复了弱策略损失的10.1%轨迹。

详情
AI中文摘要

长时域机器人操作往往逐渐失败:一个坏步骤会降低状态,策略会陷入无法恢复的盆地。失败在发生之前通常是可见的。我们引入了AEGIS(激活探针早期预警、门控推理切换),一种选择性升级方法,通过在弱策略的冻结激活上使用轻量级探针,在仍有时间采取行动时检测高风险步骤。当探针标记一个步骤时,控制权切换到更强的独立策略,但仅限于需要它的步骤。在LIBERO-Spatial上,AEGIS恢复了弱策略单独损失的10.1%的轨迹,而预算匹配的盲目升级为4.6%,随机触发安慰剂为5.1%。这些增益在单侧精确配对McNemar检验中显著,经Holm-Bonferroni调整,三个预注册对比:比盲目升级高5.4个百分点,p=8.5e-6;比随机触发高5.0个百分点,p=1.0e-4;配对轨迹自举置信区间排除零。AEGIS仅在38%的步骤上激活强策略,因此杠杆是时机而非计算。探针在早期窗口AUROC为0.764,95% CI [0.70, 0.84],在首次切换前从弱策略路径的前30%轨迹步骤中读取。我们预注册了完整的分析计划,包括条件恢复任务率估计量和明确的终止标准,并在每臂700个公共随机数情节上确认了结果,nA-fail=646。

英文摘要

Long-horizon robot manipulation tends to fail gradually: one bad step degrades the state, and the policy spirals into a basin from which it cannot recover. The failure is often visible before it happens. We introduce AEGIS (Activation-probe Early-warning, Gated Inference Switching), a selective escalation method that uses a lightweight probe on a weak policy's frozen activations to detect high-risk steps while there is still time to act. When the probe flags a step, control switches to a stronger separate policy, but only for the steps that need it. On LIBERO-Spatial, AEGIS recovers 10.1% of the trajectories the weak policy alone loses, versus 4.6% for budget-matched blind escalation and 5.1% for a random-trigger placebo. These gains are significant under one-sided exact paired McNemar tests with Holm-Bonferroni adjustment over three pre-registered contrasts: +5.4pp over blind escalation, p=8.5e-6; +5.0pp over random triggering, p=1.0e-4; paired-trajectory bootstrap CIs exclude zero. AEGIS activates the stronger policy on only 38% of steps, so the lever is timing rather than compute. The probe clears its precondition with an early-window AUROC of 0.764, 95% CI [0.70, 0.84], read from the weak-policy path over the first 30% of trajectory steps before any handoff. We pre-register the full analysis plan, including a conditional recovered-task-rate estimand and explicit kill criteria, and confirm the result on 700 common-random-number episodes per arm, with nA-fail=646.

2606.07100 2026-06-08 cs.CV cs.RO 交叉投稿

LARA: Latent Action Representation Alignment for Vision-Language-Action Models

LARA: 视觉-语言-动作模型的潜在动作表示对齐

Mengya Liu, Baoxiong Jia, Jiangyong Huang, Jingze Zhang, Siyuan Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LARA框架,通过表示对齐联合优化潜在动作模型和视觉-语言-动作模型,利用人类视频数据提升机器人操作性能,在模拟和真实基准上平均提升约10%、5%和15%。

详情
AI中文摘要

视觉-语言动作(VLA)模型使机器人能够直接从观测和语言指令预测动作,但其性能依赖于大规模、高质量数据,并受到真实机器人动作数据集稀缺的限制。为了利用丰富的未标记人类视频促进VLA模型学习,潜在动作模型(LAM)从视觉动态中学习潜在动作表示,为VLA学习提供额外监督。然而,LAM和VLA通常分开训练,导致LAM在VLA训练期间未接地,且VLA模型受冻结的LAM表示约束。为解决这些问题,我们提出潜在动作表示对齐(LARA),一种即插即用框架,通过表示对齐联合优化LAM和VLA。这使得LAM能够利用动作轨迹学习以避免虚假视觉变化,同时VLA通过LAM中学习的前向动力学进行正则化,减少功能无效轨迹的幻觉。我们展示了LARA在预训练、预训练VLA模型的后训练增强以及LAM细化中的多功能性和有效性,在3个模拟和1个精心设计的真实机器人操作基准上平均提升约10%、约5%和约15%。

英文摘要

Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model learning with abundant unlabeled human videos, Latent Action Models (LAM) learn latent action representations from visual dynamics to provide additional supervision for VLA learning. However, LAM and VLA are typically trained separately, leaving LAM ungrounded during VLA training and VLA models constrained by frozen LAM representations. To address these issues, we propose Latent Action Representation Alignment (LARA), a plug-and-play framework that jointly optimizes LAM and VLA via representation alignment. This enables reciprocal benefits where LAMs learn with action trajectories to avoid spurious visual changes, while VLAs are regularized by forward dynamics learned within LAMs to reduce hallucinations of functionally ineffective trajectories. We demonstrate LARA versatility and effectiveness for pre-training, post-training enhancement of pre-trained VLA models, and LAM refinement, achieving an average of ~10%, ~5%, and ~15% improvement over 3 simulation and 1 meticulously designed real-world robotic manipulation benchmarks.

2606.07233 2026-06-08 cs.CV cs.LG cs.RO 交叉投稿

Does Appearance Help? A Systematic Study of Image-Based Re-Identification in Online 3D Multi-Pedestrian Tracking

外观有帮助吗?在线3D多行人追踪中基于图像的重识别系统研究

Eduardo Borges, Luís Garrote, Urbano J. Nunes

发表机构 * Institute of Systems and Robotics, Department of Electrical and Computer Engineering, University of Coimbra(系统与机器人研究所,电气与计算机工程系,科英布拉大学)

AI总结 系统研究轻量级投影框架下图像重识别在在线3D多目标追踪中的作用,提出级联匹配策略以在低延迟下恢复遮挡轨迹并防止身份切换。

Comments Accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

详情
AI中文摘要

基于LiDAR的3D多目标追踪通常仅依赖几何信息,这在长时间遮挡或拥挤人群环境中往往不足以区分目标。虽然集成基于RGB的重识别提供了保持身份上下文的理论解决方案,但现有方法通常依赖计算昂贵的并行检测器,阻碍了机器人的实时响应。本文通过利用轻量级投影框架解耦移动机器人的几何和外观建模,对在线3D多目标追踪中的基于图像的重识别进行了系统研究。对特征提取架构进行了全面分析,采用轻量级CNN和视觉Transformer,并评估了多种多模态数据关联策略以平衡计算延迟和鲁棒追踪。在KITTI数据集的行人类别上的实验表明,外观和运动成本的朴素线性融合由于视觉噪声而降低了性能。相反,级联匹配策略成功恢复了被遮挡的轨迹而不损害整体精度,有效防止了身份切换以维持人机交互的连续性。我们表明,轻量级架构可以在安全导航所需的低延迟和社交意识所需的判别能力之间提供最优权衡。

英文摘要

LiDAR-based 3D Multi-Object Tracking (MOT) typically relies solely on geometric information, which is often insufficient to distinguish between targets during prolonged occlusions or in crowded human-populated environments. While integrating RGB-based Re-Identification (ReID) offers a theoretical solution for preserving identity context, existing approaches often rely on computationally expensive parallel detectors that hinder real-time robot responsiveness. This work presents a systematic study of image-based ReID in online 3D MOT, utilizing a lightweight projection-based framework to decouple geometric and appearance modeling for mobile robots. A comprehensive analysis of feature extraction architectures is conducted, employing lightweight CNNs and Vision Transformers, and evaluating various multi-modal data association strategies to balance computational latency with robust tracking. Experiments on the Pedestrian class of the KITTI dataset reveal that naive linear fusion, of appearance and motion costs, degrades performance due to visual noise. Conversely, a cascaded matching strategy successfully recovers occluded tracks without compromising overall precision, effectively preventing identity switches to maintain human-robot interaction continuity. We show that lightweight architectures can offer an optimal trade-off between the low latency required for safe navigation and the discriminative power needed for social awareness.

2606.07366 2026-06-08 cs.CV cs.LG cs.RO 交叉投稿

Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam Videos

Dash2Sim: 来自野外行车记录仪视频的闭环驾驶仿真

Anurag Ghosh, Francesco Pittaluga, Khiem Vuong, Angela Chen, Juan Alvarez-Padilla, Manmohan Chandraker, Srinivasa Narasimhan

发表机构 * Carnegie Mellon University(卡内基梅隆大学) NEC Labs America(NEC美国实验室) MIT(麻省理工学院) UC San Diego(加州大学圣地亚哥分校)

AI总结 提出Dash2Sim框架,将单目行车记录仪视频转化为度量级、地理参考的4D驾驶日志,用于闭环仿真,并构建ROADWork4D基准数据集,验证了施工区场景对规划器的挑战。

详情
AI中文摘要

自动驾驶仿真通常依赖于在少数城市收集的数据或手工编写的合成场景。行车记录仪视频覆盖了更广泛的位置和情况,包括罕见或长尾场景。由于难以从单目野外视频中恢复准确的4D场景,它们被认为不太适用于仿真。施工区是行车记录仪捕捉到的一类长尾情况。我们提出Dash2Sim,一个将野外单目行车记录仪视频转化为度量级、地理参考的4D驾驶日志并与现有仿真器兼容的框架,并针对独立维护的地图验证每个日志,无需标注。我们将Dash2Sim应用于大型视频语料库,创建了ROADWork4D基准数据集,涵盖17个城市的4,244个场景和270万个3D对象。在验证子集ROADWork4D-CL(2,201个场景)上,我们研究了特权闭环规划器,发现施工区场景具有挑战性:尽管基于规则和混合规划器的泛化能力优于基于学习的规划器,但所有规划器均表现不足,无法完成临时施工区通道所需的变道。在规划之外,Dash2Sim恢复的密集深度在新视角合成质量上提高了高达19%(基于感知指标),表明其具有为单目视频的闭环传感器仿真提供丰富条件的潜力。

英文摘要

Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. They are considered less usable for simulation because it is difficult to recover accurate 4D scenes from monocular in-the-wild videos. Work zones are one such class of long-tailed situations that dashcams capture. We present Dash2Sim, a framework that turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies eachone against an independently maintained map without annotations. We apply Dash2Sim to a large video corpus to create the ROADWork4D benchmark dataset, which spans 4,244 scenes with 2.7M 3D objects across 17 cities. On a verified subset ROADWork4D-CL (2,201 scenes), we study privileged closed-loop planners and find that work zone scenarios are difficult: while rule-based and hybrid planners generalize better than learning-based ones, all fall short, failing to make the lane changes that temporary work zone channels require. Beyond planning, dense depth recovered by Dash2Sim improves novel-view synthesis quality by up to 19% on perceptual metrics, suggesting its potential to provide rich conditioning for closed-loop sensor simulation from monocular videos.

2606.07449 2026-06-08 eess.SY cs.RO cs.SY 交叉投稿

On orbital stabilization of a circular motion primitive for a dynamic extension of the Dubins car model

关于Dubins汽车模型动态扩展的圆形运动原语的轨道镇定

Artem Angelchev-Shiryaev, Pavel E. Aleshin, Anton S. Shiriaev, Pavel A. Shamanaev, Leonid B. Freidovich

发表机构 * Department of Industrial and Mechanical Sciences, Lund University(林恩大学工业与机械科学系) Department of Information Technologies, Sirius University(西里乌斯大学信息科技系) Department of Engineering Cybernetics, NTNU(挪威特纳大学工程控制系) Department of Applied Physics and Electronics, Umeå University(乌梅大学应用物理与电子系)

AI总结 针对Dubins汽车模型动态扩展的圆形运动原语,在横向线性化框架下研究轨道镇定,发现标准方法因横向线性化不稳定而失效,提出一组显式可验证条件使控制器设计仍适用。

Comments 34 pages

详情
AI中文摘要

本文在横向线性化框架下,针对Dubins汽车模型动态扩展的圆形运动原语的轨道镇定问题进行了研究。我们表明相应的横向线性化是不稳定的,且无法通过线性状态反馈进行镇定。因此,基于标准线性化的轨道镇定方法不能直接应用。主要贡献是一组显式且可验证的条件,这些条件刻画了何时基于横向线性化的控制器设计仍然适用。这些条件依赖于运动邻域内动力学的特定结构,以及使用非标准横向坐标进行控制器设计和分析。数值模拟说明了所提出的设计过程。

英文摘要

This paper addresses orbital stabilization of a circular motion primitive for a dynamic extension of the Dubins car model within a transverse-linearization framework. We show that the corresponding transverse linearization is unstable and not stabilizable by linear state feedback. Therefore, the standard linearization-based approach to orbital stabilization cannot be applied directly. The main contribution is a set of explicit and verifiable conditions that characterize when a controller design based on transverse linearization remains applicable. These conditions rely on the specific structure of the dynamics in a neighborhood of the motion and on the use of non-standard transverse coordinates for controller design and analysis. Numerical simulations illustrate the proposed design procedure.

2606.07476 2026-06-08 eess.SY cs.RO cs.SY eess.SP 交叉投稿

Physiologically Constrained Musculoskeletal Neural Network for Multi-DoF Joint Kinematics Estimation from Partially Observed sEMG

生理约束下的肌肉骨骼神经网络用于部分观测sEMG的多自由度关节运动学估计

Wending Heng, Mingming Zhang, Glen Cooper, Zhenhong Li

发表机构 * University of Manchester(曼彻斯特大学) Southern University of Science and Technology(南方科技大学)

AI总结 提出一种肌肉骨骼神经网络(MSK-NN),结合CNN和肌肉骨骼前向动力学模块,在部分观测表面肌电信号下估计多自由度关节角度,并通过复合损失函数实现生理合理激活推断。

详情
AI中文摘要

本文研究了在部分观测表面肌电信号(sEMG)下的多自由度(DoF)关节运动学估计问题,其中由于解剖不可及性或传感器限制,只能测量任务相关肌肉的子集。提出了一种新颖的肌肉骨骼神经网络(MSK-NN),用于估计多自由度关节角度,同时推断已测量和未测量肌肉的激活。MSK-NN由一个基于CNN的肌肉激活估计器和一个嵌入的MSK前向动力学模块组成,形成完全可微的架构。与需要额外生物力学标签(如肌肉-肌腱力、关节力矩)的现有混合神经框架不同,MSK-NN在没有内部生物力学变量直接监督的情况下进行训练。通过结合关节运动学损失、数据驱动的肌肉协同损失和解剖引导的趋势损失,设计了复合物理-生理损失。该方法在不受约束的速度和幅度下的三种节律运动和一种随机运动上评估了二自由度腕关节运动学估计。与CNN、Bi-LSTM、CNN-LSTM和PET基线相比,MSK-NN实现了更低的归一化均方根误差(NRMSE)和更高的决定系数(R2),尤其是在随机运动中。更重要的是,优化的MSK参数保持在生理极限内,并且输入排除肌肉的估计激活与其记录的sEMG包络表现出强烈的时间一致性,证明了MSK-NN恢复生理合理激活的能力。

英文摘要

This paper investigates multi-degrees of freedom (DoF) joint kinematics estimation under partially observed surface electromyography (sEMG), where only a subset of task-relevant muscles can be measured due to anatomical inaccessibility or sensor constraints. A novel musculoskeletal neural network (MSK-NN) is proposed to estimate multi-DoF joint angles while simultaneously inferring activations for both measured and unmeasured muscles. MSK-NN consists of a CNN-based muscle activation estimator and an embedded MSK forward dynamics module, forming a fully differentiable architecture. Unlike existing hybrid neural frameworks that require additional biomechanical labels (e.g., muscle-tendon forces, joint torques), MSK-NN is trained without direct supervision of internal biomechanical variables. A composite physics-physiology loss is designed by incorporating a joint kinematics loss, a data-driven muscle synergy loss, and an anatomy-guided trend loss. The proposed method is evaluated on two-DoF wrist kinematics estimation across three rhythmic motions with unconstrained speed and amplitude, and one random motion. Compared with CNN, Bi-LSTM, CNN-LSTM, and PET baselines, MSK-NN achieves lower normalized root mean square error (NRMSE) and higher coefficient of determination (R2), especially for the random motion. More importantly, the optimized MSK parameters remain within physiological limits, and the estimated activation of an input-excluded muscle exhibits strong temporal agreement with its recorded sEMG envelope, demonstrating the capability of musculoskeletal (MSK)-NN to recover physiologically plausible activations.

2501.15768 2026-06-08 cs.RO cs.SY eess.SY 版本更新

Error-State LQR Formulation for Quadrotor UAV Trajectory Tracking

四旋翼无人机轨迹跟踪的误差状态LQR公式

Micah Reich

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一种基于误差状态线性二次型调节器的四旋翼无人机轨迹跟踪方法,利用指数坐标表示姿态误差,结合全状态反馈与级联体速率控制器实现鲁棒控制。

详情
AI中文摘要

本文提出了一种用于四旋翼无人机(UAV)鲁棒轨迹跟踪的误差状态线性二次型调节器(LQR)公式。该方法利用误差状态动力学,并采用指数坐标表示姿态误差,从而实现用于实时控制的线性化系统表示。控制策略集成了基于LQR的全状态反馈控制器用于轨迹跟踪,并结合级联体速率控制器来处理执行器动力学。提供了误差状态动力学、线性化过程以及控制器设计的详细推导,突出了该方法在动态环境中实现精确稳定四旋翼控制的适用性。

英文摘要

This article presents an error-state Linear Quadratic Regulator (LQR) formulation for robust trajectory tracking in quadrotor Unmanned Aerial Vehicles (UAVs). The proposed approach leverages error-state dynamics and employs exponential coordinates to represent orientation errors, enabling a linearized system representation for real-time control. The control strategy integrates an LQR-based full-state feedback controller for trajectory tracking, combined with a cascaded bodyrate controller to handle actuator dynamics. Detailed derivations of the error-state dynamics, the linearization process, and the controller design are provided, highlighting the applicability of the method for precise and stable quadrotor control in dynamic environments.

2502.16531 2026-06-08 cs.RO cs.SY eess.SY 版本更新

Efficient Coordination and Synchronization of Multi-Robot Systems Under Recurring Linear Temporal Logic

基于循环线性时序逻辑的多机器人系统高效协调与同步

Davide Peron, Victor Nan Fernandez-Ayala, Eleftherios E. Vlahakis, Dimos V. Dimarogonas

发表机构 * Department of Information Engineering, University of Padova(帕多瓦大学信息工程系) Division of Decision and Control Systems, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology(皇家理工学院电气工程与计算机科学学院决策与控制系统系)

AI总结 提出一种结合离线计划综合与在线协调的底层方法,通过实时通信动态调整计划,并引入同步机制处理动作延迟,实现多机器人系统的可扩展协调与同步框架。

详情
Journal ref
Proc. IEEE ICRA, 2025, pp. 10194-10200
AI中文摘要

我们考虑在循环任务下形式化为线性时序逻辑(LTL)规范的多机器人系统。为了高效解决规划问题,我们提出了一种自底向上的方法,将离线计划综合与在线协调相结合,通过实时通信动态调整计划。为了解决动作延迟,我们引入了一种同步机制,确保协调的任务执行,从而得到一个适用于广泛多机器人应用的多智能体协调与同步框架。该软件包使用Python和ROS2开发,便于广泛部署。我们通过涉及九个机器人的实验室实验验证了我们的发现,显示出与先前方法相比增强的适应性。此外,我们进行了多达九十个智能体的仿真,以展示我们工作降低的计算复杂性和可扩展性特征。

英文摘要

We consider multi-robot systems under recurring tasks formalized as linear temporal logic (LTL) specifications. To solve the planning problem efficiently, we propose a bottom-up approach combining offline plan synthesis with online coordination, dynamically adjusting plans via real-time communication. To address action delays, we introduce a synchronization mechanism ensuring coordinated task execution, leading to a multi-agent coordination and synchronization framework that is adaptable to a wide range of multi-robot applications. The software package is developed in Python and ROS2 for broad deployment. We validate our findings through lab experiments involving nine robots showing enhanced adaptability compared to previous methods. Additionally, we conduct simulations with up to ninety agents to demonstrate the reduced computational complexity and the scalability features of our work.

2504.10102 2026-06-08 cs.RO cs.SY eess.SY 版本更新

A Human-Sensitive Controller: Adapting to Human Musculoskeletal Disorder-Related Constraints via Reinforcement Learning

一种人类敏感控制器:通过强化学习适应人类肌肉骨骼疾病相关约束

Vitor Martins, Sara M. Cerqueira, Mercedes Balcells, Elazer R Edelman, Cristina P. Santos

发表机构 * Fundação para a Ciência e Tecnologia(葡萄牙科学与技术基金会) Centro de Microssistemas Eletromecânicos da Universidade do Minho(University of Minho微机电系统中心) Massachusetts Institute of Technology(麻省理工学院) Brigham and Women’s Hospital, Harvard Medical School(哈佛医学院布莱尔妇女医院) GEVAB, IQS School of Engineering(GEVAB,IQS工程学院) LABBELS-Associate Laboratory, University of Minho(University of Minho关联实验室)

AI总结 提出基于强化学习的人类敏感机器人控制策略,使用Q学习和深度Q网络优化协作机器人的人机工效,在保持零疼痛风险下平均缩短38%任务完成时间。

详情
AI中文摘要

工作相关肌肉骨骼疾病仍然是工业环境中的主要挑战,导致劳动力参与减少、医疗成本增加和长期残疾。本研究引入了一种人类敏感机器人系统,旨在将有肌肉骨骼疾病史的个体重新融入标准工作岗位,同时优化更广泛劳动力的工效条件。本研究利用强化学习(RL)开发协作机器人的人类感知控制策略,重点优化工效条件并防止任务执行过程中的疼痛。实现并测试了两种RL方法,即Q学习和深度Q网络(DQN),以根据个体用户特征个性化控制策略。尽管实验结果显示存在模拟到现实的差距,但微调阶段成功地将策略适应了现实条件。DQN优于Q学习,在保持零疼痛风险和安全的工效水平的同时,更快地完成任务,在所有测试的人体测量中平均任务完成时间缩短了38%。结构化的测试协议证实了系统对不同人体测量的适应性,突显了RL驱动的协作机器人实现更安全、更包容工作场所的潜力。

英文摘要

Work-Related Musculoskeletal Disorders continue to be a major challenge in industrial environments, leading to reduced workforce participation, increased healthcare costs, and long-term disability. This study introduces a human-sensitive robotic system aimed at reintegrating individuals with a history of musculoskeletal disorders into standard job roles, while simultaneously optimizing ergonomic conditions for the broader workforce. This research leverages reinforcement learning (RL) to develop a human-aware control strategy for collaborative robots, focusing on optimizing ergonomic conditions and preventing pain during task execution. Two RL approaches, Q-Learning and Deep Q-Network (DQN), were implemented and tested to personalize control strategies based on individual user characteristics. Although experimental results revealed a simulation-to-real gap, a fine-tuning phase successfully adapted the policies to real-world conditions. DQN outperformed Q-Learning by completing tasks faster while maintaining zero pain risk and safe ergonomic levels, achieving on average 38% shorter task completion times across all tested anthropometries. The structured testing protocol confirmed the system's adaptability to diverse human anthropometries, underscoring the potential of RL-driven cobots to enable safer, more inclusive workplaces.

2506.02622 2026-06-08 cs.RO cs.HC 版本更新

HORUS: A Mixed Reality Interface for Managing Teams of Mobile Robots

HORUS:用于管理移动机器人团队的混合现实界面

Omotoye Shamsudeen Adekoya, Antonio Sgorbissa, Carmine Tommaso Recchiuto

发表机构 * DIBRIS Department, RICE Laboratory, University of Genoa(DIBRIS部门、RICE实验室、热那亚大学)

AI总结 提出混合现实界面HORUS,通过Mini-Map和多种遥操作模式实现多机器人团队监控与任务分配,用户研究验证其在搜索救援任务中的协调有效性。

Comments 7 pages, 7 figures, conference paper submitted to UR 2026

详情
AI中文摘要

混合现实(MR)界面已被广泛探索用于控制移动机器人,但关于其在管理机器人团队方面的应用研究有限。本文提出HORUS:统一系统的整体操作现实,这是一个混合现实界面,提供了一套全面的工具,用于同时管理多个移动机器人。HORUS使操作员能够监控单个机器人状态、实时投影传感器数据,并将任务分配给单个机器人、团队子集或整个团队,所有这些都通过Mini-Map(地面站)完成。该界面还提供不同的遥操作模式:迷你地图模式,允许在观察机器人模型及其在迷你地图上的变换的同时进行遥操作;以及半沉浸式模式,提供平坦的屏幕状视图,可以是单视图或立体视图(3D)。我们进行了一项用户研究,参与者使用HORUS管理一个移动机器人团队,任务是在环境中寻找线索,模拟搜索和救援任务。该研究将HORUS的完整团队管理能力与单个机器人遥操作进行了比较。实验验证了HORUS在多机器人协调中的多功能性和有效性,展示了其在动态、基于团队的环境中推进人机协作的潜力。

英文摘要

Mixed Reality (MR) interfaces have been extensively explored for controlling mobile robots, but there is limited research on their application to managing teams of robots. This paper presents HORUS: Holistic Operational Reality for Unified Systems, a Mixed Reality interface offering a comprehensive set of tools for managing multiple mobile robots simultaneously. HORUS enables operators to monitor individual robot statuses, visualize sensor data projected in real time, and assign tasks to single robots, subsets of the team, or the entire group, all from a Mini-Map (Ground Station). The interface also provides different teleoperation modes: a mini-map mode that allows teleoperation while observing the robot model and its transform on the mini-map, and a semi-immersive mode that offers a flat, screen-like view in either single or stereo view (3D). We conducted a user study in which participants used HORUS to manage a team of mobile robots tasked with finding clues in an environment, simulating search and rescue tasks. This study compared HORUS's full-team management capabilities with individual robot teleoperation. The experiments validated the versatility and effectiveness of HORUS in multi-robot coordination, demonstrating its potential to advance human-robot collaboration in dynamic, team-based environments.

2509.11740 2026-06-08 cs.RO 版本更新

From Pixels to Shelf: An Integrated Robotic System for Autonomous Supermarket Stocking with a Mobile Manipulator

从像素到货架:基于移动操作臂的自主超市补货集成机器人系统

Davide Peron, Victor Nan Fernandez-Ayala, Lukas Segelmark

发表机构 * Department of Information Engineering, University of Padova(帕多瓦大学信息工程系) Division of Decision and Control Systems, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology(皇家理工学院电气工程与计算机科学学院决策与控制系统系) Animum AB(Animum公司)

AI总结 提出一种集成商用硬件与ROS2的模块化机器人系统,利用行为树规划、视觉检测和两步MPC控制,在700多次补货实验中实现98%以上的抓取成功率,但性能仍逊于人类工人。

Comments Preprint for CASE 2026

详情
AI中文摘要

零售环境(尤其是超市)中的自主补货由于动态的人机交互、受限空间和多样化的产品几何形状而面临挑战。本文介绍了一种高效的模块化机器人系统,用于自主货架补货,该系统将商用硬件与可扩展的算法架构相结合。本工作的一个主要贡献是将现成硬件与基于ROS2的感知、规划和控制集成到一个适用于零售环境的可部署平台中。我们的解决方案利用行为树(BT)进行任务规划,使用微调视觉模型进行目标检测,并采用两步模型预测控制(MPC)框架,通过ArUco标记实现精确的货架导航。在模拟真实超市条件的实验室实验中,该系统在总共700多次补货事件中实现了超过98%的抓取放置成功率。然而,我们的比较基准表明,当前自主系统的性能和成本效益仍然低于人类工人,我们以此突出关键改进领域,并量化在广泛商业部署真正实现之前仍需取得的进展。

英文摘要

Autonomous stocking in retail environments, particularly supermarkets, presents challenges due to dynamic human interactions, constrained spaces, and diverse product geometries. This paper introduces an efficient modular robotic system for autonomous shelf stocking, integrating commercially available hardware with a scalable algorithmic architecture. A major contribution of this work is the system integration of off-the-shelf hardware and ROS2-based perception, planning, and control into a single deployable platform for retail environments. Our solution leverages Behavior Trees (BTs) for task planning, fine-tuned vision models for object detection, and a two-step Model Predictive Control (MPC) framework for precise shelf navigation using ArUco markers. Laboratory experiments replicating realistic supermarket conditions demonstrate reliable performance, achieving over 98% success in pick-and-place operations across a total of more than 700 stocking events. However, our comparative benchmarks indicate that the performance and cost-effectiveness of current autonomous systems remain inferior to that of human workers, which we use to highlight key improvement areas and quantify the progress still required before widespread commercial deployment can realistically be achieved.

2509.14380 2026-06-08 cs.RO 版本更新

CRAFT: Coaching Reinforcement Learning Autonomously using Foundation Models for Multi-Robot Coordination Tasks

CRAFT:利用基础模型自主教练强化学习以完成多机器人协调任务

Seoyeon Choi, Kanghyun Ryu, Jonghoon Ock, Negar Mehr

发表机构 * Department of Mechanical Engineering, University of California Berkeley(机械工程系,加州大学伯克利分校)

AI总结 提出CRAFT框架,利用大语言模型分解任务、生成奖励函数,并通过视觉语言模型优化,实现多机器人协调学习,在四足导航和双臂操作任务中验证有效性。

详情
AI中文摘要

多智能体强化学习(MARL)为多智能体系统中的协调学习提供了强大的框架。然而,由于机器人具有高维连续联合动作空间、复杂的奖励设计以及并发学习智能体带来的非平稳性,将MARL应用于机器人领域仍然具有挑战性。另一方面,人类通常在教练的帮助下学习复杂的协调任务,教练通过精心设计的课程和详细反馈来指导学习。基于基础模型的推理能力,我们认为这些模型可以类似地教练机器人学习协调。受此启发,我们提出了CRAFT:利用基础模型自主教练强化学习以完成协调任务,这是一个利用基础模型作为多机器人协调“教练”的框架。CRAFT利用大语言模型(LLMs)的规划能力,自动将长时域协调任务分解为子任务序列。然后,CRAFT使用LLM生成的奖励函数训练每个子任务,并通过视觉语言模型(VLM)引导的奖励细化循环来改进它们。我们在多四足导航和双臂操作任务上评估了CRAFT,并展示了其学习复杂协调行为的能力。此外,在多四足导航设置中,我们展示了学到的策略可以迁移到现实世界。项目网站:https://iconlab.negarmehr.com/CRAFT/

英文摘要

Multi-Agent Reinforcement Learning (MARL) provides a powerful framework for learning coordination in multi-agent systems. However, applying MARL to robotics remains challenging due to their high-dimensional continuous joint action spaces, complex reward design, and non-stationarity from concurrently learning agents. On the other hand, humans often learn complex coordination with the help of coaches, who guide learning through carefully designed curricula and detailed feedback. Building on the reasoning capabilities of foundation models, we argue that these models can similarly coach robots to learn coordination. Motivated by this, we propose CRAFT: Coaching Reinforcement learning Autonomously using Foundation models for learning coordination Tasks, a framework that leverages foundation models to act as a "coach" for multi-robot coordination. CRAFT automatically decomposes long-horizon coordination tasks into sequences of subtasks using the planning capability of Large Language Models (LLMs). Then, CRAFT trains each subtask using LLM-generated reward functions, and refines them through a Vision Language Model (VLM)-guided reward-refinement loop. We evaluate CRAFT on multi-quadruped navigation and bimanual manipulation tasks, and demonstrate its capability to learn complex coordination behaviors. In addition, in a multi-quadruped navigation setting, we show that our learned policies transfer to the real world. Project website is https://iconlab.negarmehr.com/CRAFT/

2510.11014 2026-06-08 cs.RO cs.AI cs.CV 版本更新

MatterDoor: Sampling Zero-shot Spatio-semantic Priors using Generative Models

MatterDoor: 使用生成模型采样零样本空间语义先验

Subhransu S. Bhattacharjee, Hao Lu, Dylan Campbell, Rahul Shome

发表机构 * School of Computing, Australian National University(澳大利亚国立大学计算机学院)

AI总结 针对机器人通过门缝观察时场景结构缺失的问题,提出MatterDoor方法,利用预训练生成模型(VLM引导外推、单目深度估计、语义分割)采样隐藏房间的语义3D点云先验,在Matterport3D基准上验证了零样本空间语义先验的有效性。

Comments Under Review

详情
AI中文摘要

自主机器人通常只能通过门缝部分观察房间,墙壁和场景结构隐藏了安全导航和目标导向行动所需的几何和任务相关语义。我们询问现成的预训练生成视觉模型能否将这些缺失结构作为零样本离线先验用于机器人推理。此类先验应支持对未观察结构的空间语义查询,估计隐藏区域中的目标物体似然以及这些区域被占用的概率。给定一个以自我为中心的RGB观测和目标查询,我们的流程使用VLM引导的外推、单目深度估计和语义分割来采样隐藏房间的语义标注3D点云假设。我们引入了MatterDoor,一个源自Matterport3D的门遮挡室内场景基准,并使用生成指标和模拟Stretch机器人目标到达任务评估所得先验。我们的结果表明,无需针对特定问题微调即可推导出对规划有用的空间语义先验。

英文摘要

Autonomous robots often view rooms only partially, through a doorway, where the walls and scene structure hide the geometry and task-relevant semantics needed for safe navigation and goal-directed action. We ask whether off-the-shelf pretrained generative vision models can derive this missing structure as zero-shot offline priors for robot reasoning. Such priors should support spatio-semantic queries over unobserved structure, estimating the target object likelihood in hidden regions and the probability that those regions are occupied. Given an egocentric RGB observation and target query, our pipeline uses VLM-guided outpainting, monocular depth estimation, and semantic segmentation to sample semantically labeled 3D point cloud hypotheses of the hidden room. We introduce MatterDoor, a Matterport3D-derived benchmark of doorway-occluded indoor scenes, and evaluate the resulting priors with generative metrics and simulated Stretch robot object-reaching tasks. Our results suggest that useful spatio-semantic priors for planning can be derived without problem-specific fine-tuning.

2511.12795 2026-06-08 cs.RO 版本更新

ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model

ActiveGrasp: 基于校准能量模型的信息引导主动抓取

Boshu Lei, Wen Jiang, Kostas Daniilidis

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Archimedes, Athena RC(阿基米德、阿提卡RC)

AI总结 针对密集杂乱环境中的抓取问题,提出一种校准能量模型生成抓取姿态,并基于抓取分布的信息增益主动选择视角,在有限视角下高效抓取目标物体。

Comments CVPR 2026

详情
AI中文摘要

在密集杂乱环境中抓取对机器人是一项具有挑战性的任务。以往的方法试图通过在抓取姿态生成前主动收集多个视角来解决这个问题。然而,它们要么忽略了抓取分布对信息增益估计的重要性,要么依赖于抓取分布的投影,这忽略了SE(3)流形上抓取姿态的结构。为了应对这些挑战,我们提出了一种用于抓取姿态生成的校准能量模型,以及一种从抓取分布估计信息增益的主动视角选择方法。我们的能量模型捕捉了SE(3)流形上抓取分布的多模态特性。能量水平被校准到抓取的成功率,使得预测分布与真实分布一致。通过从基于重建环境的校准分布中估计抓取的信息增益,选择下一个最佳视角,这可以高效地驱动机器人探索目标物体的可抓取部分。在模拟环境和真实机器人设置上的实验表明,与先前最先进的模型相比,我们的模型能够在有限视角预算下成功抓取杂乱环境中的物体。我们的模拟环境可以作为未来主动抓取研究的可复现平台。当论文公开发布时,我们的源代码将公开。

英文摘要

Grasping in a densely cluttered environment is a challenging task for robots. Previous methods tried to solve this problem by actively gathering multiple views before grasp pose generation. However, they either overlooked the importance of the grasp distribution for information gain estimation or relied on the projection of the grasp distribution, which ignores the structure of grasp poses on the SE(3) manifold. To tackle these challenges, we propose a calibrated energy-based model for grasp pose generation and an active view selection method that estimates information gain from grasp distribution. Our energy-based model captures the multi-modality nature of grasp distribution on the SE(3) manifold. The energy level is calibrated to the success rate of grasps so that the predicted distribution aligns with the real distribution. The next best view is selected by estimating the information gain for grasp from the calibrated distribution conditioned on the reconstructed environment, which could efficiently drive the robot to explore affordable parts of the target object. Experiments on simulated environments and real robot setups demonstrate that our model could successfully grasp objects in a cluttered environment with limited view budgets compared to previous state-of-the-art models. Our simulated environment can serve as a reproducible platform for future research on active grasping. The source code of our paper will be made public when the paper is released to the public.

2601.10930 2026-06-08 cs.RO 版本更新

Where to Touch, How to Contact: Hierarchical RL-MPC Framework for Geometry-Aware Long-Horizon Dexterous Manipulation

何处触碰,如何接触:面向几何感知的长时间灵巧操作的分层RL-MPC框架

Zhixian Xie, Yu Xiang, Michael Posa, Wanxin Jin

发表机构 * Arizona State University(亚利桑那州立大学) University of Texas at Dallas(德克萨斯大学达拉斯分校) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出分层RL-MPC框架,高层RL策略预测接触意图(接触位置和子目标位姿),低层接触隐式MPC优化局部接触模式并实时重规划,实现几何泛化的非抓取操作,数据效率提升10倍且零样本迁移到真实环境。

详情
AI中文摘要

接触丰富的灵巧操作中的一个关键挑战是需要共同推理全局几何和非光滑接触动力学。端到端策略绕过了这一复杂性,但通常需要大量数据,并且从仿真到现实的迁移效果差。我们通过一个简单的见解来解决这些局限性:灵巧操作本质上是分层的——在高层次上,机器人决定在哪里触碰(几何);在低层次上,它确定如何通过接触动力学移动物体。基于这一见解,我们提出了一个分层RL-MPC框架,其中高层强化学习(RL)策略预测接触意图,这是一种新颖的以物体为中心的接口,指定了(i)物体表面接触位置和(ii)接触后的物体子目标位姿。在接触意图的条件下,低层接触隐式模型预测控制(MPC)优化局部接触模式,并通过接触动力学进行实时(重新)规划,以生成稳健地将物体移向每个子目标的机器人动作。我们在非抓取任务上评估该框架,包括跨不同物体形状的几何泛化推、基于翻转/旋转的物体重新定向以及环境辅助的物体重新定位。它实现了高成功率,数据量大幅减少(比端到端基线少10倍),高度稳健的性能,以及零样本从仿真到现实的迁移。

英文摘要

A key challenge in contact-rich dexterous manipulation is the need to jointly reason over global geometry and nonsmooth contact dynamics. End-to-end policies bypass this complexity, but often require large amounts of data and transfer poorly from simulation to reality. We address the limitations with a simple insight: dexterous manipulation is inherently hierarchical--at a high level, a robot decides where to touch (geometry); at a low level it determines how to move the object through contact dynamics. Building on this insight, we propose a hierarchical RL--MPC framework in which a high-level reinforcement learning (RL) policy predicts a contact intention, a novel object-centric interface that specifies (i) an object-surface contact location and (ii) a post-contact object subgoal pose. Conditioned on the contact intention, a low-level contact-implicit model predictive control (MPC) optimizes local contact modes and real-time (re)plans through contact dynamics to generate robot actions that robustly move the object toward each subgoal. We evaluate the framework on non-prehensile tasks, including geometry-generalized pushing across diverse object shapes, pivoting/flipping-based object reorientation, and environment-assisted object repositioning. It achieves high success rate with substantially reduced data (10 times less than end-to-end baselines), highly robust performance, and zero-shot sim-to-real transfer.

2602.09580 2026-06-08 cs.RO cs.LG 版本更新

SERNF: Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

SERNF: 通过动作块评论家和归一化流实现样本高效的真实世界灵巧策略微调

Chenyu Yang, Denis Tarasov, Davide Liconti, Romain Guntz, Hehui Zheng, Robert K. Katzschmann

发表机构 * Soft Robotics Lab, D-MAVT(软机器人实验室,D-MAVT) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出SERNF框架,结合归一化流策略和动作块评论家,实现真实世界灵巧操作策略的样本高效微调,解决多模态动作分布和信用分配问题。

Comments https://srl-ethz.github.io/SERNF/

详情
AI中文摘要

由于有限的真实世界交互预算和高度多模态的动作分布,真实世界中灵巧操作策略的微调仍然具有挑战性。基于扩散的策略虽然表达能力强,但在微调过程中不允许进行保守的基于似然的更新,因为动作概率难以处理。相比之下,传统的高斯策略在多模态下会崩溃,特别是当动作以块形式执行时,而标准的逐步骤评论家无法与块执行对齐,导致信用分配不佳。我们提出了SERFN,一个具有归一化流(NF)的样本高效离策略微调框架,以应对这些挑战。归一化流策略为多模态动作块提供精确的似然,通过似然正则化实现保守、稳定的策略更新,从而提高样本效率。动作块评论家评估整个动作序列,使价值估计与策略的时间结构对齐,并改善长时域信用分配。据我们所知,这是首次在真实机器人硬件上展示基于似然的多模态生成策略与块级价值学习相结合。我们在真实世界的两个具有挑战性的灵巧操作任务上评估了SERFN:从盒子中取出剪刀并剪断胶带,以及手掌朝下抓握时进行手中立方体旋转——两者都需要在长时域内进行精确、灵巧的控制。在这些任务上,SERFN实现了稳定、样本高效的适应,而标准方法则难以应对。

英文摘要

Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not permit conservative likelihood-based updates during fine-tuning because action probabilities are intractable. In contrast, conventional Gaussian policies collapse under multimodality, particularly when actions are executed in chunks, and standard per-step critics fail to align with chunked execution, leading to poor credit assignment. We present SERFN, a sample-efficient off-policy fine-tuning framework with normalizing flow (NF) to address these challenges. The normalizing flow policy yields exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates through likelihood regularization and thereby improving sample efficiency. An action-chunked critic evaluates entire action sequences, aligning value estimation with the policy's temporal structure and improving long-horizon credit assignment. To our knowledge, this is the first demonstration of a likelihood-based, multimodal generative policy combined with chunk-level value learning on real robotic hardware. We evaluate SERFN on two challenging dexterous manipulation tasks in the real world: cutting tape with scissors retrieved from a case, and in-hand cube rotation with a palm-down grasp -- both of which require precise, dexterous control over long horizons. On these tasks, SERFN achieves stable, sample-efficient adaptation where standard methods struggle.

2602.12360 2026-06-08 cs.RO 版本更新

Predicting Dynamic Map States from Limited Field-of-View Sensor Data

从有限视场传感器数据预测动态地图状态

Knut Peterson, David Han

发表机构 * iMaPLe Research Lab, Drexel University(iMaPLe研究实验室,德雷塞尔大学)

AI总结 针对传感器有限视场问题,提出将时空信息编码为单图像格式,利用现有图像到图像学习模型高精度预测动态地图状态。

Comments 6 pages, 4 figures. Accepted to the 2026 International Conference on Advanced Visual and Signal-Based Systems (AVSS)

详情
AI中文摘要

当自主系统部署在真实场景中时,传感器通常受到有限视场(FOV)约束,这可能是由于系统设计自然导致的,也可能是由于意外遮挡或传感器故障。在无法获得大视场的情况下,能够基于可用数据推断环境信息并预测附近周围环境的状态对于维持安全准确的运行至关重要。在这项工作中,我们探讨了基于有限视场时间序列数据进行动态地图状态预测的深度学习有效性。我们表明,通过将动态传感器数据表示为捕获空间和时间信息的简单单图像格式,我们可以有效地利用各种现有的图像到图像学习模型,在多种传感场景中高精度地预测地图状态。

英文摘要

When autonomous systems are deployed in real-world scenarios, sensors are often subject to limited field-of-view (FOV) constraints, either naturally through system design, or through unexpected occlusions or sensor failures. In conditions where a large FOV is unavailable, it is important to be able to infer information about the environment and predict the state of nearby surroundings based on available data to maintain safe and accurate operation. In this work, we explore the effectiveness of deep learning for dynamic map state prediction based on limited FOV time series data. We show that by representing dynamic sensor data in a simple single-image format that captures both spatial and temporal information, we can effectively use a wide variety of existing image-to-image learning models to predict map states with high accuracy in a diverse set of sensing scenarios.

2602.16073 2026-06-08 cs.RO cs.AI cs.LO cs.SY eess.SY 版本更新

ScenicRules: An Autonomous Driving Benchmark with Multi-Objective Specifications and Abstract Scenarios

ScenicRules:具有多目标规范和抽象场景的自动驾驶基准测试

Kevin Kai-Chun Chang, Ekin Beyazit, Alberto Sangiovanni-Vincentelli, Tichakorn Wongpiromsarn, Sanjit A. Seshia

发表机构 * University of California, Berkeley(加州大学伯克利分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出ScenicRules基准,通过层次化规则框架和形式化场景模型,在随机环境下评估自动驾驶系统对优先级多目标规范的满足程度。

Comments v2: Minor numerical corrections for Table V. 16 pages, 14 figures, 7 tables. Extended version of paper accepted to 2026 IEEE Intelligent Vehicles Symposium (IV 2026). ScenicRules benchmark available at https://github.com/BerkeleyLearnVerify/ScenicRules

详情
AI中文摘要

开发复杂交通环境下的自动驾驶系统需要平衡多个目标,例如避免碰撞、遵守交通规则和高效行驶。在许多情况下,这些目标无法同时满足,因此自然会出现明确的优先级关系。此外,驾驶规则需要上下文,因此正式建模这些规则适用的环境场景非常重要。现有的自动驾驶车辆评估基准缺乏这种多目标优先级规则和形式化环境模型的组合。在这项工作中,我们引入了ScenicRules,一个在随机环境下根据优先级多目标规范评估自动驾驶系统的基准。我们首先形式化了一组多样化的目标作为定量评估指标。接下来,我们设计了一个层次化规则书框架,以可解释和可适应的方式编码多个目标及其优先级关系。然后,我们构建了一个紧凑但具有代表性的场景集合,涵盖各种驾驶情境和近事故情况,并使用Scenic语言进行形式化建模。实验结果表明,我们的形式化目标和层次化规则书与人类驾驶判断高度一致,并且我们的基准有效地暴露了代理在优先级目标方面的失败。我们的基准可在https://github.com/BerkeleyLearnVerify/ScenicRules/获取。

英文摘要

Developing autonomous driving systems for complex traffic environments requires balancing multiple objectives, such as avoiding collisions, obeying traffic rules, and making efficient progress. In many situations, these objectives cannot be satisfied simultaneously, and explicit priority relations naturally arise. Also, driving rules require context, so it is important to formally model the environment scenarios within which such rules apply. Existing benchmarks for evaluating autonomous vehicles lack such combinations of multi-objective prioritized rules and formal environment models. In this work, we introduce ScenicRules, a benchmark for evaluating autonomous driving systems in stochastic environments under prioritized multi-objective specifications. We first formalize a diverse set of objectives to serve as quantitative evaluation metrics. Next, we design a Hierarchical Rulebook framework that encodes multiple objectives and their priority relations in an interpretable and adaptable manner. We then construct a compact yet representative collection of scenarios spanning diverse driving contexts and near-accident situations, formally modeled in the Scenic language. Experimental results show that our formalized objectives and Hierarchical Rulebooks align well with human driving judgments and that our benchmark effectively exposes agent failures with respect to the prioritized objectives. Our benchmark can be accessed at https://github.com/BerkeleyLearnVerify/ScenicRules/.

2603.24576 2026-06-08 cs.RO cs.AI cs.CV 版本更新

Chameleon: Control-Indexed Prospective Memory for Visuomotor Manipulation

Chameleon: 用于视觉运动操控的索引控制前瞻记忆

Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Yuhang Han, Ying Sun, Yang Xiao, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室) Institute for Infocomm Research, A*STAR, Singapore(新加坡*STAR信息与通信研究所) National University of Singapore(新加坡国立大学)

AI总结 提出Chameleon策略,通过索引控制前瞻记忆解决观察-动作延迟问题,在Camo-Dataset上决策成功率从22.5%提升至80.8%,并在多个基准上达到最优。

Comments Code is available at https://github.com/gxyes/MARS_Chameleon

详情
AI中文摘要

机器人常常在观察到某个信息后很久才执行相应的动作。例如,在藏球游戏中,机器人首先看到哪个杯子藏有球,观察杯子移动,然后才需要选择正确的杯子。仅凭最后的观察不足以做出决策:正确的动作依赖于更早的事件。我们将这种时间间隔称为观察-动作延迟。它使得记忆成为一个策略面对的问题:策略必须保持相似历史记录的可区分性,检索与当前决策相关的过去事件,并将该回忆转换为动作就绪状态。我们将这些需求称为可分离性、可寻址性和前瞻性。我们引入了Chameleon,一个约60M参数的视觉运动策略,用于索引控制的前瞻记忆。Chameleon写入具身事件记忆,保留可分离的历史记录,检索控制相关的痕迹,并训练生成的工作状态具有前瞻性。我们还引入了Camo-Dataset,这是一个真实机器人基准,通过使决策场景视觉模糊来隔离观察-动作延迟,从而必须从早期观察中推断正确动作。Chameleon在Camo-Dataset上将决策/端到端成功率从22.5%/21.3%提高到80.8%/71.3%。在公开的长时记忆基准上,它在LIBERO-10上达到87.1% ± 0.8%,在MemoryBench上达到97.3% ± 4.5%,在MIKASA-Robo上达到75.1% ± 1.4%,在相同规模模型中达到最先进水平,并在报告协议下超过多个更大的VLA基线。探针和消融实验表明,Chameleon学习了可分离、可寻址和前瞻的记忆,并且这些特性驱动了其性能提升。

英文摘要

Robots often observe information that determines a future action long before that action is executed. In a shell game, for example, a robot first sees which cup hides the ball, watches the cups move, and only later needs to choose the correct cup. The final observation alone is not enough for a decision: the correct action depends on an earlier event. We refer to this temporal gap as observation-action delay. It makes memory a policy-facing problem: a policy must keep similar histories distinct, retrieve the past event relevant to the current decision, and convert that recall into an action-ready state. We call these requirements separability, addressability, and prospectiveness. We introduce Chameleon, a ~60M visuomotor policy for control-indexed prospective memory. Chameleon writes embodied event memory, preserves separable histories, retrieves control-relevant traces, and trains the resulting working state to be prospective. We also introduce Camo-Dataset, a real-robot benchmark that isolates observation-action delay by making the decision scene visually ambiguous, so the correct action must be inferred from earlier observations. Chameleon improves decision/end-to-end success on Camo-Dataset from 22.5%/21.3% to 80.8%/71.3%. On public long-horizon memory benchmarks, it achieves 87.1% +/- 0.8% on LIBERO-10, 97.3% +/- 4.5% on MemoryBench, and 75.1% +/- 1.4% on MIKASA-Robo, setting the state of the art for same-size models and exceeding multiple larger VLA baselines under the reported protocols. Probes and ablations show that Chameleon learns separable, addressable, and prospective memory, and that these properties drive its performance gains.

2604.08168 2026-06-08 cs.RO cs.AI 版本更新

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

ViVa:用于机器人强化学习的视频生成价值模型

Jindi Lv, Hao Li, Jie Li, Fankun Kong, Yang Wang, Pengfei Yi, Yifei Nie, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang

发表机构 * GigaAI Sichuan University(四川大学) Tsinghua University(清华大学)

AI总结 提出ViVa,利用预训练视频生成器联合预测未来本体感受和标量价值,通过时空先验实现可靠价值估计,在三个任务中取得最优结果,与RECAP结合平均成功率达80%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过大规模预训练推进了机器人操作,但由于部分可观测性和延迟反馈,实际部署仍然具有挑战性。强化学习通过价值函数解决这一问题,该函数评估任务进展并指导策略改进。然而,基于视觉-语言模型(VLM)的现有价值模型难以捕捉时间动态和物理交互,削弱了长期任务中价值估计的可靠性。本文提出ViVa,一种视频生成价值模型,该模型重新利用预训练的视频生成器,联合预测未来本体感受和标量价值。通过将价值估计基于预期的具身动态,ViVa利用时空先验,将价值与超越静态快照的前瞻性内在耦合。ViVa在三个任务的基于度量的评估中取得了最先进的结果,产生可靠的价值信号,准确跟踪任务进展并检测执行错误。集成到RECAP中,它实现了80%的平均成功率,突显了视频生成模型在价值估计中的前景。

英文摘要

Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics and physical interactions, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator to jointly predict future proprioception and a scalar value. By grounding value estimation in anticipated embodiment dynamics, ViVa leverages spatiotemporal priors to intrinsically couple value with foresight beyond static snapshots. ViVa achieves state-of-the-art results in metric-based evaluation across three tasks, producing reliable value signals that accurately track task progress and detect execution errors. Integrated into RECAP, it achieves an average success rate of 80%, highlighting the promise of video-generative models for value estimation.

2605.07496 2026-06-08 cs.RO 版本更新

PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation

PathPainter:将图像生成模型的泛化能力迁移至具身导航

Yijin Wang, Yuru Tian, Xijie Huang, Weiqi Gai, Mo Zhu, Xin Zhou, Yuze Wu, Fei Gao

发表机构 * Tsinghua University(清华大学)

AI总结 提出利用鸟瞰图作为全局先验的导航系统,通过图像生成模型理解自然语言意图并生成可通行掩码,结合跨视图定位消除里程计漂移,在无人机平台上完成160米室外长距离导航。

Comments Work in the progress. 16 pages, 13 figures

详情
AI中文摘要

鸟瞰图已被广泛证明能为导航提供有价值的先验信息。鉴于这种视图提供的全局信息,仍存在两个关键挑战:如何充分利用这些信息以及如何在执行过程中可靠地使用它们。在本文中,我们提出了一种导航系统,该系统使用鸟瞰图作为全局先验,并专为地面和近地面机器人平台设计。该系统采用图像生成模型从自然语言中解读人类意图,识别目标目的地,并生成可通行掩码。在执行过程中,我们引入跨视图定位以将机器人的里程计与鸟瞰图对齐,并减轻传统里程计中的长期漂移。我们进行了广泛的基准实验来评估所提出的方法,并在无人机平台上进一步验证。仅使用传统的局部运动规划器,无人机成功完成了160米的室外长距离导航任务。这项工作展示了基础模型的世界理解能力如何迁移到具身导航,使机器人能够受益于现有图像生成模型的强大泛化能力。

英文摘要

Bird's-eye-view (BEV) images have been widely demonstrated to provide valuable prior information for navigation. Given the global information provided by such views, two key challenges remain: how to fully exploit this information and how to reliably use it during execution. In this paper, we propose a navigation system that uses BEV images as global priors and is designed for ground and near-ground robotic platforms. The system employs an image generation model to interpret human intent from natural language, identify the target destination, and generate traversability masks. During execution, we introduce cross-view localization to align the robot's odometry with the BEV map and mitigate long-term drift in conventional odometry. We conduct extensive benchmark experiments to evaluate the proposed method and further validate it on a UAV platform. Using only a conventional local motion planner, the UAV successfully completes a 160-meter outdoor long-range navigation task. This work demonstrates how the world-understanding capabilities of foundation models can be transferred to embodied navigation, enabling robots to benefit from the strong generalization ability of existing image generation models.

2605.08732 2026-06-08 cs.RO cs.LG 版本更新

Latent Geometry Beyond Search: Amortizing Planning in World Models

超越搜索的潜在几何:在世界模型中摊销规划

Hoang Nguyen, Xiaohao Xu, Xiaonan Huang

发表机构 * Department of Robotics, University of Michigan, Ann Arbor(密歇根大学机器人系,安阿伯)

AI总结 提出在正则化潜在几何下,将规划摊销为潜在逆动力学映射,以轻量级GC-IDM替代在线搜索,在七个环境协议中匹配或超越CEM,决策成本降低100-130倍。

Comments 31 pages

详情
AI中文摘要

现代基于视觉的世界模型可以将观测表示为紧凑而富有表现力的潜在流形,但在这些空间中进行快速的目标导向规划仍然具有挑战性。这引发了一个核心问题:学习到的表示何时简化控制,而不仅仅是实现预测?我们在预训练的LeWorldModel中研究这个问题,其潜在几何通过正则化实现平滑性和均匀性。我们的关键见解是,在这种几何下,规划可以摊销为潜在逆动力学映射,而无需在线搜索。因此,我们用一个轻量级的目标条件逆动力学模型(GC-IDM)替代迭代规划,该模型将当前潜在状态、目标潜在状态和剩余时间步直接映射到下一个动作。实验上,在涵盖导航、接触丰富的操作和连续控制的四个基准环境中,我们的控制器在八个环境-协议设置中的七个上匹配或超过了CEM,同时将每次决策成本降低了100-130倍。对测试时规划器(CEM、MPPI、iCEM和基于梯度的方法)的更广泛扫描表明,这一结果并非特定于某个优化器。这些发现表明,测试时规划恢复的大部分结构已经局部编码在潜在表示中。更广泛地说,我们的结果表明,足够结构化的潜在空间可以将部分规划负担从在线优化转移到学习推理。我们的代码公开在 https://github.com/hdnndh/Latent-Geometry-Beyond-Search-Amortizing-Planning-in-World-Models 。

英文摘要

Modern vision-based world models can represent observations as compact yet expressive latent manifolds, but fast goal-oriented planning in these spaces remains challenging. This raises a central question: when does a learned representation simplify control, rather than merely enabling prediction? We study this question in a pretrained LeWorldModel, whose latent geometry is regularized for smoothness and uniformity. Our key insight is that, under such geometry, planning can be amortized into a latent inverse-dynamics mapping instead of requiring online search. We therefore replace iterative planning with a lightweight Goal-Conditioned Inverse Dynamics Model (GC-IDM) that maps the current latent state, goal latent state, and remaining horizon directly to the next action. Empirically, across four benchmark environments spanning navigation, contact-rich manipulation, and continuous control, our controller matches or exceeds CEM in seven of eight environment-protocol settings while reducing per-decision cost by 100-130x. A broader sweep over test-time planners (CEM, MPPI, iCEM, and gradient-based methods) shows that this result is not specific to a particular optimizer. These findings suggest that much of the structure recovered by test-time planning is already locally encoded in the latent representation. More broadly, our results indicate that sufficiently structured latent spaces can shift part of the planning burden from online optimization to learned inference. Our code is publicly available at https://github.com/hdnndh/Latent-Geometry-Beyond-Search-Amortizing-Planning-in-World-Models .

2605.26974 2026-06-08 cs.RO 版本更新

Trust, Geometry, and Rules: A Credibility-Aware Reinforcement Learning Framework for Safe USV Navigation under Uncertainty

信任、几何与规则:不确定性下安全USV导航的可信感知强化学习框架

Yuhang Zhang, Shuqi Chai, Yukang Zhang, Liusha Yang, Mingchuan Zhang, Wei Wang, Qingjiang Shi, Quanbo Ge

发表机构 * School of Information Engineering, Henan University of Science and Technology(河南科技大学信息工程学院) Shenzhen Research Institute of Big Data(深圳大数据研究院) School of Logistics Engineering, Shanghai Maritime University(上海 Maritime University物流工程学院) Shenzhen Technology University(深圳科技大学) School of Computer Science, Wuhan University(武汉大学计算机学院) School of Software Engineering, Tongji University(同济大学软件工程学院)

AI总结 提出一种集成可信感知学习、几何安全屏蔽和连续规则感知嵌入的强化学习框架,以解决动态海洋环境中USV导航的安全性和COLREGs合规性问题。

详情
AI中文摘要

在动态海洋环境中,无人水面艇(USV)的安全自主导航并遵守《国际海上避碰规则》(COLREGs)仍然是一项艰巨的挑战,特别是当感知系统表现出校准不当的不确定性时。现有的基于强化学习(RL)的方法常常因为状态估计误差导致不可靠的信念状态误导价值函数,而离散的交通规则则引入了学习目标的不连续性而失败。为了解决这些挑战,我们提出了一个集成可信感知学习、几何安全屏蔽和连续规则感知嵌入的框架。首先,可信加权价值学习(CW-VL)引入了一个动态信任因子,该因子源自滤波器估计协方差与经验误差统计之间的差异,以调节评论家的异方差损失,防止策略对噪声样本过拟合。其次,协方差膨胀速度障碍(CI-VO)将位置估计不确定性映射为集合角裕度,形成一个保守的几何屏蔽,覆盖危险的探索行为。第三,风险感知COLREGs职责嵌入将二元相遇职责放松为连续的规则感知信号,提供平滑的扇区过渡信息,并抑制稀疏规则奖励引起的振荡。模拟相遇研究表明,该方法在感知不一致性下具有更好的训练鲁棒性,并且在避碰和COLREGs合规性方面优于基线方法。

英文摘要

Autonomous navigation of Unmanned Surface Vehicles (USVs) that is safe and compliant with the International Regulations for Preventing Collisions at Sea (COLREGs) remains a formidable challenge in dynamic maritime environments, particularly when perception systems exhibit miscalibrated uncertainty. Existing Reinforcement Learning (RL)-based methods often falter because state-estimation errors induce unreliable belief states that mislead the value function, while discrete traffic rules introduce discontinuity in the learning objective. To address these challenges, we propose a framework integrating credibility-aware learning, geometric safety shielding, and continuous rule-aware embedding. First, Credibility-Weighted Value Learning (CW-VL) introduces a dynamic trust factor derived from the discrepancy between filter-estimated covariance and empirical error statistics to modulate the critic's heteroscedastic loss, preventing policy overfitting to noisy samples. Second, the Covariance-Inflated Velocity Obstacle (CI-VO) maps position-estimation uncertainty into set-wise angular margins, forming a conservative geometric shield that overrides hazardous exploratory actions. Third, Risk-Aware COLREGs Duty Embedding relaxes binary encounter duties into continuous rule-aware signals, providing smooth sector-transition information and suppressing oscillation from sparse rule rewards. Simulated encounter studies demonstrate improved training robustness against perceptual inconsistency and superior collision avoidance and COLREGs compliance over baselines.

2606.01072 2026-06-08 cs.RO cs.CV 版本更新

Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs

利用场景图扩展机器人模仿学习的时空上下文

Jianing Qian, Qinhe Peng, Emmanuel Panov, Leonor Fermoselle, Dinesh Jayaraman, Bernadette Bucher, Tarik Kelestemur

发表机构 * University of Pennsylvania(宾夕法尼亚大学) RAI Institute(RAI研究院) University of Michigan(密歇根大学)

AI总结 提出使用场景图作为显式结构化记忆机制,通过动态维护对象中心关系及其时间演化,解决机器人模仿学习中的部分可观测性和长时推理问题。

详情
AI中文摘要

模仿学习使机器人能够通过观察学习如何执行任务。然而,像家庭和办公室这样的真实环境通常由于空间尺度大而严重部分可观测。此外,许多任务涉及执行一系列子任务,要求自主机器人在扩展的时间范围内进行推理。为了解决这些挑战,我们提出在模仿学习中使用场景图作为显式且结构化的记忆机制。通过维护一个动态场景图,捕捉以对象为中心的关系及其随时间的变化,我们的方法允许智能体在任务执行期间保留相关历史上下文,从而有效推理逐步累积的场景信息。我们在模拟移动操作和真实桌面操作上的实验表明,我们的方法显著提高了策略性能,特别是在需要长期推理和在部分可观测性下鲁棒泛化的场景中。

英文摘要

Imitation learning enables robots to learn how to execute tasks via observation. However, real-world environments like homes and offices are often severely partially observed due to their large spatial scales. In addition, many tasks involve executing a series of subtasks requiring autonomous robots to reason over extended time horizons. To address these challenges, we propose using scene graphs as an explicit and structured memory mechanism in imitation learning. By maintaining a dynamic scene graph that captures object-centric relationships and their evolution over time, our method allows the agent to retain relevant historical context during task execution to efficiently reason over incrementally accrued scene information. Our experiments on simulated mobile manipulation and real-world tabletop manipulation demonstrate that our approach substantially improves policy performance, particularly in settings that demand long-term reasoning and robust generalization under partial observability.

2505.17739 2026-06-08 cs.MA cs.CY cs.HC cs.RO 版本更新

Feasible Action Space Reduction for Quantifying Causal Responsibility in Continuous Spatial Interactions

可行动作空间缩减用于量化连续空间交互中的因果责任

Ashwin George, Luciano Cavalcante Siebert, David A. Abbink, Arkady Zgonnikov

发表机构 * Deflt University of Technology(德福特技术大学)

AI总结 针对连续动作空间,提出FeAR度量的连续空间公式,用于量化空间交互中智能体的因果责任,并展示其在分析回溯责任和估计前瞻责任中的应用。

Comments In review

详情
AI中文摘要

理解一个智能体对另一个智能体的因果影响对于将自动化车辆和移动机器人等人工智能系统安全部署到人类居住环境中至关重要。现有的因果责任模型处理具有离散动作的场景的简化抽象,从而限制了在理解空间交互中的责任时的实际应用。基于空间交互的智能体嵌入场景中且必须在每个时刻执行一个动作的假设,提出了可行动作空间缩减(FeAR)作为离散动作的网格世界环境中因果责任的度量。由于现实世界的交互涉及连续动作空间,本文提出了用于测量空间连续交互中因果责任的FeAR度量的公式。我们展示了该度量在典型空间共享冲突中的效用,并展示了其在分析回溯责任和估计前瞻责任以指导智能体决策中的应用。我们的结果突显了FeAR度量在设计和工程化人工智能体以及评估人类周围智能体责任方面的潜力。

英文摘要

Understanding the causal influence of one agent on another agent is crucial for safely deploying artificially intelligent systems such as automated vehicles and mobile robots into human-inhabited environments. Existing models of causal responsibility deal with simplified abstractions of scenarios with discrete actions, thus, limiting real-world use when understanding responsibility in spatial interactions. Based on the assumption that spatially interacting agents are embedded in a scene and must follow an action at each instant, Feasible Action-Space Reduction (FeAR) was proposed as a metric for causal responsibility in a grid-world setting with discrete actions.Since real-world interactions involve continuous action spaces, this paper proposes a formulation of the FeAR metric for measuring causal responsibility in space-continuous interactions. We illustrate the utility of the metric in prototypical space-sharing conflicts, and showcase its applications for analysing backward-looking responsibility and in estimating forward-looking responsibility to guide agent decision making. Our results highlight the potential of the FeAR metric for designing and engineering artificial agents, as well as for assessing the responsibility of agents around humans.

2605.04222 2026-06-08 eess.SY cs.RO cs.SY 版本更新

Safety by Invariance, Liveness through Refinement: Heterogeneous Contract Framework for Co-Design of Layered Control

通过不变性保证安全,通过精化实现活性:分层控制协同设计的异构契约框架

Yoshinari Takayama, Alessio Iovine, Bart Besselink, Guillaume Sandou, Adnane Saoud

发表机构 * Laboratory of Signals and Systems (L2S), CNRS, CentraleSupelec, Paris-Saclay University, France(信号与系统实验室(L2S),CNRS,CentraleSupelec,巴黎-萨克雷大学,法国) College of Computing, University Mohammed VI Polytechnic, Benguerir, Morocco(计算学院,穆罕默德六世理工学院,贝内乌尔,摩洛哥)

AI总结 针对分层控制架构缺乏统一规范语言、跨时间尺度互联保证及层间组合分离的问题,提出将安全-活性分解引入异构假设-保证契约框架,通过连续时间层的不变性保证安全,离散时间层的精化实现活性,并形式化层间协调条件。

Comments 21 pages

详情
AI中文摘要

现实世界的控制系统必须在满足连续时间安全约束的同时实现长期目标(活性),这一组合推动了分层控制架构(LCA)的研究。然而,现有的LCA研究缺乏(i)跨离散规划和连续执行的统一规范语言,(ii)在异构时间尺度下互连子系统时保证规范得以保持的形式化保证,以及(iii)由于依赖简单的输入滤波法则而导致的层间组合分离。本文通过将安全-活性分解引入异构假设-保证框架来填补这三个空白:\emph{安全通过连续时间层的不变性}来保证,而\emph{活性通过离散时间层的精化}来实现,层间协调通过垂直精化和时间兼容性条件形式化。我们通过一个结合MPC规划器、输入到状态稳定(ISS)底层控制器和参考调节器桥的新型LCA实例化该契约,并在包含电池和超级电容器的混合储能系统(HESS)上进行了验证。

英文摘要

Real-world control systems must achieve long-horizon objectives (liveness) while respecting continuous-time safety constraints, a combination that motivates hierarchical layered control architectures (LCAs). Existing LCA research, however, lacks (i) a uniform specification language across discrete planning and continuous execution, (ii) formal guarantees that specifications are preserved when interconnecting subsystems at heterogeneous time scales, and (iii) compositional separation between layers, owing to reliance on naive input-filtering laws. This paper addresses all three gaps by importing the safety--liveness decomposition into a heterogeneous assume--guarantee framework: \emph{safety is enforced by invariance} at the continuous-time layer, while \emph{liveness is achieved through refinement} at the discrete-time layer, with inter-layer coordination formalized via vertical refinement and timing-compatibility conditions. We instantiate this contract with a novel LCA combining an MPC planner, an input-to-state stabilizing (ISS) low-level controller, and a reference-governor bridge, and validate it on a Hybrid Energy Storage System (HESS) comprising a battery and a supercapacitor.

2605.22882 2026-06-08 cs.CV cs.RO 版本更新

GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

GEM-4D:用于机器人操作的几何增强视频世界模型

Kaichen Zhou, Yuzhen Chen, Fangneng Zhan, Hang Hua, Grace Chen, Xinhai Chang, Ao Qu, Yilun Du, Zhuang Liu, Paul Pu Liang, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab(哈佛人工智能与机器人实验室) Harvard University(哈佛大学) Media Lab and EECS(媒体实验室和电子工程与计算机科学系) MIT(麻省理工学院) Princeton University(普林斯顿大学) MIT-IBM Watson AI Lab(麻省理工-IBM沃森人工智能实验室)

AI总结 提出GEM-4D,通过注入从预训练几何基础模型蒸馏的密集4D对应监督,增强视频世界模型的几何一致性,并引入逆动力学模块将视频滚动转换为可执行机器人轨迹,提升操作成功率。

Comments Robotic World Model, Video Generative Model

详情
AI中文摘要

视频世界模型可以从单个指令生成逼真的未来帧,但它们通常无法在时间上一致地跟踪相同的物理点。因此,生成的视频看似合理,但缺乏可靠动作执行(如机器人操作)所需的物理基础。我们提出GEM-4D,一种几何接地视频世界模型,通过在训练期间将预训练几何基础模型蒸馏的密集4D对应监督注入视频生成骨干网络来解决这一限制。这种监督使模型能够联合捕捉外观和几何结构,同时保持单流架构且无额外推理成本。我们进一步引入逆动力学模块,将对应一致的视频滚动转换为可执行的机器人轨迹,从而能够在真实世界和模拟操作中直接部署。GEM-4D在视频预测和几何一致性方面在模拟和真实场景中均达到最先进性能,并将真实世界操作成功率从61%提升至81%。更多结果见https://gem-4d.github.io/。

英文摘要

Video world models can generate realistic futures from a single instruction, but they often fail to track the same physical points consistently across time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across both simulation and realistic scenarios and improves real-world manipulation success from 61% to 81%. Additional results are available at https://gem-4d.github.io/.

2502.08903 2026-06-08 cs.RO cs.AI 版本更新

3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning

面向机器人任务规划的3D grounded视觉-语言框架:自动化提示合成与监督推理

Guoqin Tang, Qingxuan Jia, Zeyuan Huang, Gang Chen, Ning Ji, Zhipeng Yao

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出融合2D提示合成模块和小语言模型的框架,提升机器人3D场景理解与任务执行能力,实验显示任务成功率高达96.0%。

详情
Journal ref
Engineering Applications of Artificial Intelligence, vol. 164, p. 113268, 2026
AI中文摘要

视觉-语言模型(VLMs)在场景理解和感知任务中取得了显著成功,使机器人能够在动态环境中自适应地规划和执行动作。然而,大多数多模态大语言模型缺乏稳健的3D场景定位能力,限制了其在精细机器人操作中的有效性。此外,低识别精度、低效、差的迁移性和可靠性等挑战阻碍了其在精密任务中的应用。为解决这些限制,我们提出了一种新的框架,该框架整合了一个2D提示合成模块,通过将2D图像映射到点云,以及一个小型语言模型(SLM)来监督VLM的输出。2D提示合成模块使训练于2D图像和文本的VLM能够自主提取精确的3D空间信息,无需人工干预,显著增强了3D场景理解。同时,SLM监督VLM的输出,缓解幻觉并确保可靠的可执行机器人控制代码生成。我们的框架消除了在新环境中重新训练的需要,从而提高了成本效率和操作鲁棒性。实验结果表明,所提出的框架实现了96.0%的任务成功率(TSR),优于其他方法。消融研究证明了2D提示合成模块和输出监督模块的关键作用(当移除时,TSR下降67%)。这些发现验证了框架在提升3D识别、任务规划和机器人任务执行方面的有效性。

英文摘要

Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations. Additionally, challenges such as low recognition accuracy, inefficiency, poor transferability, and reliability hinder their use in precision tasks. To address these limitations, we propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs. The 2D prompt synthesis module enables VLMs, trained on 2D images and text, to autonomously extract precise 3D spatial information without manual intervention, significantly enhancing 3D scene understanding. Meanwhile, the SLM supervises VLM outputs, mitigating hallucinations and ensuring reliable, executable robotic control code generation. Our framework eliminates the need for retraining in new environments, thereby improving cost efficiency and operational robustness. Experimental results that the proposed framework achieved a 96.0\% Task Success Rate (TSR), outperforming other methods. Ablation studies demonstrated the critical role of both the 2D prompt synthesis module and the output supervision module (which, when removed, caused a 67\% TSR drop). These findings validate the framework's effectiveness in improving 3D recognition, task planning, and robotic task execution.