arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.26113 2026-05-26 cs.RO cs.CV 版本更新

AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond

AnyScene: 迈向高度可控的任意位置驾驶场景生成及超越

Haiming Zhang, Junfei Zhou, Feng Jiang, Jingzhong Li, Zhenglong Guo, Penglin Dai, Jifeng Dai, Yan Xie, Benjin Zhu

发表机构 * Li Auto(利汽车) Southwest Jiaotong University(西南交通大学) Tsinghua University(清华大学)

AI总结 提出AnyScene框架,通过时空占用扩散Transformer和几何引导视图扩展模块,实现从BEV布局生成语义占用序列和参考无关的多视角驾驶视频,支持精确可控和长时生成。

Comments Work in progress. Project page: https://mind-omni.github.io/

详情
AI中文摘要

生成高保真且可控的合成数据对于推进端到端自动驾驶至关重要,特别是解决罕见安全关键场景的长尾问题。现有的占用引导方法通常依赖于浅层条件机制和参考帧相关的视频合成,这限制了从任意BEV布局进行细粒度可控性,并限制了其在可扩展模拟中的适用性。在本文中,我们提出了AnyScene,一个统一的以占用为中心的驾驶场景生成框架。AnyScene通过时空占用扩散Transformer从BEV布局生成语义占用序列,该Transformer以自回归方式联合标记BEV和占用特征。这种设计使得从跨数据集和用户定义的BEV输入实现精确可控性,同时自然支持长时生成。基于生成的占用,几何引导视图扩展模块将占用视为规范空间表示,并以无参考和自回归方式合成时间一致的多视角驾驶视频,支持推理时的灵活相机配置。大量实验表明,AnyScene在占用和视频生成方面均达到最先进性能。它展现出对未见和定制布局的强大泛化能力,并为下游任务(如稀疏视图3D重建)提供可衡量的益处。

英文摘要

Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of rare safety-critical scenarios. Existing occupancy-guided methods typically rely on shallow conditioning mechanisms and reference-frame-dependent video synthesis, which limits fine-grained controllability from arbitrary BEV layouts and restricts their applicability for scalable simulation. In this paper, we propose AnyScene, a unified occupancy-centric framework for driving scene generation. AnyScene generates semantic occupancy sequences from BEV layouts through a Spatial-Temporal Occupancy Diffusion Transformer that jointly tokenizes BEV and occupancy features in an autoregressive manner. This design enables precise controllability from cross-dataset and user-defined BEV inputs while naturally supporting long-horizon generation. Building upon the generated occupancy, a Geometry-Grounded View Expansion module treats occupancy as the canonical spatial representation and synthesizes temporally consistent multi-view driving videos in a reference-free and autoregressive fashion, supporting flexible camera configurations at inference time. Extensive experiments demonstrate that AnyScene achieves state-of-the-art performance in both occupancy and video generation. It exhibits strong generalization to unseen and customized layouts, and provides measurable benefits for downstream tasks such as sparse-view 3D reconstruction.

2605.25942 2026-05-26 cs.CV cs.RO 版本更新

LRDDv3: High-Resolution Long-Range Drone Detection Dataset with Range Information and Thermal Data

LRDDv3:具有距离信息和热数据的高分辨率远程无人机检测数据集

Knut Peterson, Zaid Mayers, Azmain Yousuf, Priontu Chowdhury, Asher Zaczepinski, Solmaz Arezoomandan, Reihaneh Maarefdoust, David Han

发表机构 * iMaPLe Research Lab, Drexel University(Drexel大学iMaPLe研究实验室) University of Maine(缅因大学)

AI总结 提出LRDDv3数据集,包含102,532张高分辨率远程RGB图像和29,630张配对IR图像,支持远程无人机检测,提供距离信息。

Comments 8 pages, 5 figures. Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA)

详情
AI中文摘要

无人机已迅速成为各种空域中的常见设备,涵盖从娱乐飞行到商业摄影和包裹递送等多种应用。随着无人机日益普及,有人和无人飞行器能够远程检测无人机及其他飞行物体以有效跟踪运动并确保共享空域安全运行变得至关重要。尽管已有多个用于无人机检测的数据集,但对高质量数据的需求仍然存在,特别是在高分辨率远程无人机数据领域。为解决这一问题,我们引入了一个高分辨率数据集,包含102,532张远程无人机RGB图像,这些图像从128个不同的视频片段中以5 FPS采样,这些片段在17个不同的数据采集日(跨越8个月)的飞行中拍摄,以确保光照场景、飞行位置和背景元素的多样性。该数据集拥有全面的无人机距离信息,以及29,630张IR图像,所有这些图像都与基础数据集中的RGB图像配对。作为首批利用4K图像分辨率和配对640x512 IR图像的无人机检测数据集之一,我们的工作代表了在远程检测无人机方面的重要进展。如需获取完整数据集,请访问https://research.coe.drexel.edu/ece/imaple/lrddv3/

英文摘要

Unmanned Aerial Vehicles (UAVs) have quickly become common in various airspaces, representing a wide range of applications from recreation flying to commercial photography and package delivery. With the increasing prevalence of UAVs, it becomes critical that both manned and unmanned aircraft can detect UAVs and other flying objects from long range to effectively track movement and ensure safe operation in shared spaces. While several datasets have been introduced for drone detection, the need for expanded high-quality data persists, especially in the area of high-resolution long-range drone data. To address this, we introduce a high-resolution dataset of 102,532 long-range RGB images of drones, sampled at 5 FPS from 128 distinct video clips taken mid flight during 17 different data collection days spread over 8 months to ensure a wide variety of lighting scenarios, flight locations, and background elements. The dataset boasts comprehensive drone range information across the dataset, as well as 29,630 IR images, all paired with RGB counterparts from the base dataset. As one of the first drone detection datasets to leverage 4K image resolution and paired 640x512 IR images, our work represents a significant advancement to enable the detection of drones at long range. For access to the complete dataset, please visit https://research.coe.drexel.edu/ece/imaple/lrddv3/

2605.25901 2026-05-26 cs.CV cs.RO 版本更新

AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models

AgentGrounder:使用多模态语言模型的零样本3D视觉点云定位

Cuong Huynh, Maxim Popov, Denis Gridusov, Sergey Kolyubin

发表机构 * Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University(生物机械与高效能机器人实验室,ITMO大学)

AI总结 提出AgentGrounder框架,通过两阶段设计(离线构建对象查找表和在线工具驱动代理)实现零样本3D视觉定位,在ScanRefer和Nr3D上分别提升2.5%和6.3%的准确率。

Comments Code: https://github.com/be2rlab/AgentGrounder

详情
AI中文摘要

3D视觉定位(3DVG)是具身AI的基本能力,要求智能体根据自然语言描述在3D场景中定位物体。最近的零样本方法利用2D视觉语言模型(LVLMs),但它们通常依赖于现有的多视图图像集,并且难以处理标准3D分割工具提供的有限语义和空间细节。我们提出了$ extbf{AgentGrounder}$,一个零样本3D视觉定位框架,直接对彩色点云进行操作,无需特定任务的3D训练。我们的方法采用两阶段设计:(1)离线阶段,应用3D模型构建对象查找表(OLT),包含实例ID、语义标签、3D边界框;(2)在线工具驱动代理,分解每个查询,仅从OLT中检索相关候选对象,进行几何评分,并在需要额外视觉证据(如颜色、材质或视角敏感线索)时按需触发图像渲染。与固定的锚点-目标匹配流水线相比,这种设计减少了级联匹配错误,并通过避免提示过载无关对象来提高上下文窗口效率。我们在零样本设置下对ScanRefer和Nr3D进行了评估,观察到在我们的设置中比SeeGround有持续改进,包括ScanRefer上+2.5%的Acc@0.5和Nr3D上+6.3%,在Nr3D视图无关查询上显著提升+6.3%。这些结果表明,结合选择性检索、几何推理和自适应视觉检查为开放词汇3D定位提供了实用且稳健的基础。我们的代码可在https://github.com/be2rlab/AgentGrounder获取。

英文摘要

3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $\textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding. Our code is available at https://github.com/be2rlab/AgentGrounder.

2605.25851 2026-05-26 cs.RO 版本更新

RePlan-Bot: Multi-Level Replanning for Embodied Instruction Following

RePlan-Bot:面向具身指令跟随的多级重规划

Xicheng Gong, Guozheng Sun, Peiran Xu, Yadong Mu

发表机构 * Peking University(北京大学) Tsinghua University(清华大学)

AI总结 提出RePlan-Bot,通过多级连续重规划(高层LLM审计器、常识引导搜索、轻量级ViT校正器)解决具身指令跟随中的长时规划和不可逆状态变化问题,在ALFRED基准上取得最佳性能。

Comments 10 pages

详情
AI中文摘要

具身指令跟随(EIF)要求智能体在交互式3D环境中理解和执行复杂的自然语言命令。尽管近期取得了进展,现有方法在长时规划和应对不可逆状态变化方面常常失败,导致任务成功率低。为解决这些挑战,我们引入了RePlan-Bot,一种新颖的EIF智能体,它在任务执行过程中执行多级、连续的重规划。RePlan-Bot集成了一个高层基于LLM的审计器,用于根据环境反馈动态调整子目标;一个基于常识引导的搜索机制,基于多层实例地图实现精确且结构化的对象定位;以及一个轻量级基于ViT的校正器,用于预先修复有风险的低层动作。在ALFRED基准上的评估显示,RePlan-Bot在已知和未知环境中均达到了最先进的性能,展示了卓越的适应性和可靠性。

英文摘要

Embodied instruction following (EIF) requires agents to understand and execute complex natural language commands within interactive 3D environments. Despite recent advances, existing methods often fail in long-horizon planning and handling irreversible state changes, resulting in low task success rates. To address these challenges, we introduce RePlan-Bot, a novel EIF agent that performs multi-level, continuous replanning throughout task execution. RePlan-Bot integrates a high-level LLM-based auditor for dynamic sub-goal adjustments guided by environmental feedback, a commonsense-guided search mechanism based on a multi-layered instance map for precise and structured object localization, and a lightweight ViT-based corrector to preemptively fix risky low-level actions. Evaluated on the ALFRED benchmark, RePlan-Bot achieves state-of-the-art performance in both seen and unseen environments, demonstrating superior adaptability and reliability.

2605.25832 2026-05-26 cs.RO cs.AI cs.CL cs.CV 版本更新

When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills

当搜索成为记忆:将机器人设计试验转化为可迁移技能

Yunfei Wang, Xiaohao Xu, Yang Li, Xiaonan Huang

发表机构 * University of Michigan(密歇根大学)

AI总结 提出Auto-Robotist,一种自进化LLM代理,通过将形态搜索轨迹提炼为自然语言技能库,实现可迁移的机器人设计知识,在EvoGym任务中提升冷启动搜索并跨设计空间迁移技能。

Comments 20 pages, 8 figures

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作进化机器人设计的提案生成器,但大多数循环仍然是无记忆的:模拟结果塑造下一代种群,但并未作为可复用的设计知识保留。我们提出Auto-Robotist,一种自进化的LLM代理,它将形态搜索轨迹提炼为显式的自然语言技能库。每个技能存储结构原型、基于证据的正负规则以及支持它们的评估设计,使设计记忆可检查而非隐含在种群中。在搜索过程中,代理检索技能以调节LLM对精英主体的编辑,同时保留遗传算法(GA)突变路径以进行探索;评估后,通过添加、诊断和合并更新库。在涵盖运动、穿越和物体交互的七个EvoGym任务中,Auto-Robotist改善了冷启动5x5搜索,并将学到的技能迁移到10x10设计空间,其中参考条件迁移在每个任务上都优于GA。这些结果表明,LLM代理可以将昂贵的物理评估转化为可复用、可审计的设计原则。我们的代码将在接收后发布。

英文摘要

Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless: simulator results shape the next population but are not preserved as reusable design knowledge. We present Auto-Robotist, a self-evolving LLM agent that distills morphology-search traces into an explicit natural-language skill library. Each skill stores a structural archetype, evidence-grounded positive and negative rules, and the evaluated designs that support them, making design memory inspectable rather than implicit in a population. During search, the agent retrieves skills to condition LLM edits of elite bodies while retaining a Genetic Algorithm (GA) mutation path for exploration; after evaluation, it updates the library through Add, Diagnose, and Merge. Across seven EvoGym tasks spanning locomotion, traversal, and object interaction, Auto-Robotist improves cold-start 5x5 search and transfers learned skills to 10x10 design spaces, where reference-conditioned transfer outperforms GA on every task. These results suggest that LLM agents can convert expensive physical evaluations into reusable, auditable design principles. Our code will be released upon acceptance.

2605.25829 2026-05-26 cs.RO cs.AI 版本更新

OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation

OASIS: 通过SE(3)轨迹预测实现机器人操作中的观测-动作空间对齐

Xinzhe Chen, Sihua Ren, Liqi Huang, Haowen Sun, Mingyang Li, Xingyu Chen, Zeyang Liu, Xuguang Lan

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室) Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(西安交通大学人工智能与机器人研究所)

AI总结 提出OASIS视觉运动策略,通过SE(3)末端执行器轨迹预测对齐中间表示与动作空间,在仿真和真实实验中优于VLA和WAM基线。

详情
AI中文摘要

最近的视觉-语言-动作(VLA)模型和世界动作模型(WAMs)通过用辅助空间特征或未来视觉状态预测丰富中间表示来推进机器人操作。然而,这些表示在很大程度上仍停留在观测空间内,不共享动作空间的刚体几何,迫使动作解码器隐式恢复该几何。我们提出OASIS,一种通过$SE(3)$末端执行器轨迹预测将中间表示与动作空间对齐的视觉运动策略。OASIS将融合视觉-语言和度量深度特征的3D感知特征编码器与生成相机帧末端执行器轨迹的$SE(3)$轨迹预测器耦合。以预测器的姿态监督隐藏状态为条件,动作解码器生成与刚体运动一致的动作块。在仿真和真实世界实验中,OASIS在成功率和分布外泛化方面优于VLA和WAM基线。我们的项目页面位于https://npuhandsome.github.io/OASIS_web。

英文摘要

Recent vision-language-action (VLA) models and world action models (WAMs) advance robotic manipulation by enriching intermediate representations with auxiliary spatial features or future visual-state prediction. However, these representations largely remain within the observation space and do not share the rigid-body geometry of the action space, forcing the action decoder to implicitly recover this geometry. We propose OASIS, a visuomotor policy that aligns the intermediate representation with the action space via $SE(3)$ end-effector trajectory prediction. OASIS couples a 3D-aware feature encoder that fuses vision-language and metric-depth features with an $SE(3)$ trajectory predictor that produces a camera-frame end-effector trajectory. Conditioned on the predictor's pose-supervised hidden states, the action decoder generates action chunks consistent with rigid-body motion. Across simulation and real-world experiments, OASIS outperforms VLA and WAM baselines in success rate and out-of-distribution generalization. Our project page is available at https://npuhandsome.github.io/OASIS_web.

2605.25813 2026-05-26 cs.RO 版本更新

Extending Embodied Question Answering from Perception to Decision

将具身问答从感知扩展到决策

Xicheng Gong, Qiwei Li, Peiran Xu, Yadong Mu

发表机构 * Peking University(北京大学) XYZ Embodied AI(XYZ具身AI)

AI总结 提出大规模具身问答数据集EQA-Decision和基线模型RoboDecision,系统覆盖静态场景构建、空间理解、任务动态推理和即时决策四个维度,以统一框架评估具身环境中的感知、推理和行动级决策。

Comments 11 pages,4 figures

详情
AI中文摘要

具身问答(EQA)连接了具身环境中的感知、推理和交互。然而,现有的数据集和基准仍然分散,每个都侧重于有限的推理技能子集,如空间理解或程序推理,而没有提供一个统一的、大规模的综合评估框架。我们提出了EQA-Decision,一个大规模具身问答数据集,系统地涵盖了具身推理的四个互补维度:静态场景构建、空间理解、任务动态推理和即时决策。该数据集包含超过四百万个问答对,并在多样化的具身场景中具有分层注释。此外,我们开发了RoboDecision,一个与EQA-Decision基准对齐的强基线模型,提供了一个统一框架,共同评估具身环境中的感知、推理和行动级决策。结果表明,EQA-Decision有效地基准测试并增强了VLM在空间和交互推理方面的能力,为推进具身智能研究提供了坚实基础。

英文摘要

Embodied Question Answering (EQA) connects perception, reasoning, and interaction within embodied environments. However, existing datasets and benchmarks remain fragmented, each focusing on a limited subset of reasoning skills such as spatial understanding or procedural reasoning, without offering a unified large-scale framework for comprehensive evaluation. We present EQA-Decision, a large-scale embodied QA dataset that systematically covers four complementary dimensions of embodied reasoning: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. The dataset contains over four million question-answer pairs with hierarchical annotations across diverse embodied scenarios. In addition, we develop RoboDecision, a strong baseline model aligned with the EQA-Decision Benchmark, providing a unified framework that jointly evaluates perception, reasoning, and action-level decision-making in embodied environments. Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities in spatial and interaction reasoning, providing a solid foundation for advancing embodied intelligence research.

2605.25790 2026-05-26 cs.RO 版本更新

HoLoArm: Deformable Arms for Collision-Tolerant Quadrotor Flight

HoLoArm: 用于碰撞容忍四旋翼飞行的可变形臂

Quang Ngoc Pham, Jonas Eschmann, Yang Zhou, Alejandro Ojeda Olarte, Giuseppe Loianno, Van Anh Ho

发表机构 * Japan Advanced Institute of Science and Technology(日本先进科学技术研究所) University of California Berkeley(加州大学伯克利分校) New York University(纽约大学)

AI总结 受蜻蜓翅膀结脉结构启发,提出具有柔性臂的四旋翼HoLoArm,结合强化学习控制策略实现被动变形与快速恢复,在高达7.6 m/s碰撞速度下保持稳定飞行。

Comments 8 pages, 15 figures, 1 table, Accepted at the IEEE Robotics and Automation Letters (RA-L) and the IEEE International Conference on Robotics and Automation (ICRA), 2026

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 3, pp. 3582-3589, March 2026
AI中文摘要

无人机在以人为中心的应用中日益普及,凸显了对能够承受碰撞并快速恢复的设计的需求,以最小化对人类和环境的风险。我们提出了HoLoArm,一种具有柔性臂的四旋翼,其灵感来源于蜻蜓翅膀的结脉结构。这种设计在保持飞行稳定性的同时提供了自然的柔韧性和弹性,并通过集成强化学习(RL)控制策略进一步增强了恢复和悬停性能。实验结果表明,HoLoArm可以在任何方向(包括轴向)被动变形,并根据冲击方向和程度在0.3-0.6秒内恢复。无人机能够在高达7.6米/秒的碰撞速度下存活,并携带540克有效载荷,同时保持稳定飞行。这项工作有助于具有高敏捷性和可靠安全性的软体空中机器人的形态设计,使其能够在杂乱和人类共享的环境中运行,并为未来将柔性结构与智能控制相结合的完全软体无人机奠定了基础。

英文摘要

The increasing use of drones in human-centric applications highlights the need for designs that can survive collisions and recover rapidly, minimizing risks to both humans and the environment. We present HoLoArm, a quadrotor with compliant arms inspired by the nodus structure of dragonfly wings. This design provides natural flexibility and resilience while preserving flight stability, which is further reinforced by the integration of a Reinforcement Learning (RL) control policy that enhances both recovery and hovering performance. Experimental results demonstrate that HoLoArm can passively deform in any direction, including axial one, and recover within 0.3-0.6 s depending on the direction and level of the impact. The drone can survive collisions at speeds up to 7.6 m/s and carry a 540 g payload while maintaining stable flight. This work contributes to the morphological design of soft aerial robots with high agility and reliable safety, enabling operation in cluttered and human shared environments, and lays the groundwork for future fully soft drones that integrate compliant structures with intelligent control.

2605.25685 2026-05-26 cs.RO 版本更新

HumanFlow -- Diffusion-Driven MAV Navigation Among Humans via Tightly-Coupled Motion Tracking, Forecasting, and Control

HumanFlow -- 通过紧耦合运动跟踪、预测和控制的扩散驱动MAV在人群中导航

Simon Schaefer, Joshua Näf, Stefan Leutenegger

发表机构 * Technical University of Munich(慕尼黑技术大学) MCML MIRMI ETH Zurich(苏黎世联邦理工学院)

AI总结 提出HumanFlow,一种潜在扩散模型,统一了人体运动跟踪与预测,并利用3D场景上下文,在严重遮挡下实现高精度、高效率的运动估计,并通过紧耦合控制实现MAV在人群中的无碰撞导航。

Comments Accepted to Robotics Science and Systems (RSS), 2026

详情
AI中文摘要

在3D场景上下文中对人类的鲁棒和准确感知对于将机器人集成到日常环境中至关重要。然而,现有方法通常无法预测与周围场景一致的合理且准确的人体运动估计,尤其是在存在严重遮挡或部分可见性的情况下。这可能会限制机器人操作的安全性和效率。我们引入了HumanFlow,一种潜在扩散模型,它统一了人体运动跟踪和预测,并以3D场景上下文为条件。我们展示了我们的人体运动模型在具有挑战性的条件下(包括严重遮挡)能够产生平滑且准确的预测,并且在跟踪精度上优于最先进的方法,同时效率显著更高。此外,我们展示了如何通过将这些表示作为基于流匹配的近似MPC策略的条件,将HumanFlow的潜在空间与控制紧密耦合。我们在模拟中使用真实人类轨迹验证了我们的策略用于MAV社交导航,展示了优越的导航性能,并且在人类部分可观察的情况下仍能保持无碰撞。

英文摘要

Robust and accurate perception of humans in their 3D scene context is essential for integrating robots into everyday environments. Existing approaches, however, often fail to predict plausible and accurate human motion estimates that are consistent with the surrounding scene, especially in the presence of heavy occlusions or partial visibility. This can limit both safety and efficiency for robotic operations. We introduce HumanFlow, a latent diffusion model that unifies human motion tracking and forecasting, conditioned on the 3D scene context. We show that our human motion model produces smooth and accurate predictions under challenging conditions, including heavy occlusions, and outperforms state-of-the-art methods in tracking accuracy while being significantly more efficient. Furthermore, we show how HumanFlow's latent space can be tightly coupled with control by conditioning a flow-matching-based, approximate MPC policy on these representations. We validate our policy in simulation with real human trajectories for MAV social navigation, demonstrating superior navigation performance and remaining collision-free, even under partial observability of the human.

2605.25672 2026-05-26 cs.RO 版本更新

Compliant Non-Prehensile Pushing Manipulation

顺应性非抓取推动操作

Francesco Cufino, Mario Selvaggio, Fabio Amadio, Fabio Ruggiero

发表机构 * PRISMA Lab, department of Electrical Engineering and Information Technology of the University of Naples Federico II(PRISMA实验室,那不勒斯费德里科二世大学电气工程与信息科技系) Inria, CNRS, Université de Lorraine(Inria、CNRS、洛林大学) ABB Corporate Research Center(ABB企业研究中心)

AI总结 针对顺应性机器人系统中的非抓取推动操作,提出基于阻抗控制与模型预测控制的框架,通过优化位置/速度设定点实现顺应性推动,并集成能量罐无源性滤波器保证安全交互。

详情
AI中文摘要

在本文中,我们解决了使用顺应性机器人操作系统执行非抓取推动操作的挑战。为了确保在人类环境中安全操作,机器人必须顺从外部物理交互并表现出被动行为。为此,我们扩展了最先进的推动模型,将其与阻抗控制机器人集成。我们开发了一个基于该模型的模型预测控制框架,通过最优调节机器人的位置/速度设定点来实现顺应性推动,同时实现所需的推动力和接触点适应,以获得期望的物体运动。然而,外部交互可能导致跟踪误差,从而引起推动力潜在的无限增加。为了防止这种情况,我们集成了一个能量罐无源性滤波器,进一步调节机器人速度设定点以保证无源性并避免不受控制的能量积累。所提出的方法已在仿真中严格测试,并通过两个不同机器人系统的实验验证,展示了在人机交互过程中的被动顺应性,并评估了轨迹跟踪性能和对物体物理参数变化的鲁棒性。

英文摘要

In this paper, we address the challenge of performing non-prehensile pushing operations with a compliant robotic manipulation system. To ensure safe operations in human-populated environments, robots must comply with external physical interactions and exhibit passive behavior. To achieve this, we extend a state-of-the-art pushing model to integrate it with impedance-controlled robots. We develop a model predictive control framework built upon this model that enables compliant pushing through optimal modulation of the robot's position/velocity set-point, jointly realizing the required pushing force and contact point adaptation to obtain desired object motion. However, external interactions may induce tracking errors, causing a consequent potentially indefinite increase of the pushing force. To prevent this, we integrate an energy tank passivity filter that further modulates the robot velocity set-point to guarantee passivity and avoid uncontrolled energy buildup. The proposed method has been rigorously tested in simulation and validated through experiments on two different robotic systems, demonstrating passive compliance during human-robot interactions and assessing trajectory tracking performance and robustness to variations in the object's physical parameters.

2605.25646 2026-05-26 cs.RO 版本更新

G-DRAGON: Geospatial Reasoning and Dynamic Planning for Retrieval-Augmented Outdoor Navigation

G-DRAGON:面向检索增强的户外导航的地理空间推理与动态规划

Dongzhihan Wang, Yi Du, Jianan Sun, Yuan Xue, Yingchen Zhang, Bing Xiao, Chen Wang, Liang Xu

发表机构 * Spatial AI & Robotics Lab(空间人工智能与机器人实验室) University at Buffalo(布法罗大学) School of Future Technology(未来技术学院) Shanghai University(上海大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出G-DRAGON框架,通过轻量级LLM的生成式检索将自然语言命令映射到本地OSM实体,结合全局路径规划与SLAM系统,并利用前沿探索和开放集语义体素映射实现最后一英里目标定位,在仿真和真实场景中优于现有方法。

Comments Accepted by IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

在大型户外环境中运行的自主地面机器人需要强大的远程导航和细粒度的“最后一英里”探索。当前视觉语言导航(VLN)的进展在短距离任务中表现良好,但缺乏长距离任务的地理空间基础。一些基于OpenStreetMap(OSM)的方法依赖云端大型语言模型(LLM),容易产生事实幻觉,且无法根据人类指令进行“最后一英里”探索。为解决这些挑战,我们提出了G-DRAGON,一个用于户外开放世界导航的检索增强框架。该框架通过基于轻量级LLM的生成式检索将自然语言命令映射到版本化的本地OSM实体,为全局路径规划生成精确坐标。高级规划模块将全局拓扑路线与SLAM系统桥接,将地理空间路点投影到机器人的可导航框架中。对于“最后一英里”,框架转换为基于前沿的探索和开放集语义体素映射,以定位开放词汇目标。仿真实验表明,我们的框架优于最先进的基线。此外,我们在未见过的真实城市环境中使用无人地面车辆(UGV)验证了该系统,成功完成了轨迹长达500米的人员搜索任务。

英文摘要

Autonomous ground robots operating in large-scale outdoor environments require both robust long-range navigation and fine-grained ''last-mile'' exploration. Current advances in visual-language navigation (VLN) work well at short-range tasks, lacking geospatial grounding for long-distance missions. Some OpenStreetMap (OSM)-based methods relying on cloud-based Large Language Models (LLMs) are prone to factual hallucination and cannot conduct ''last-mile'' exploration based on human instruction. To address these challenges, we present G-DRAGON, a retrieval-augmented framework for outdoor, open-world navigation. This framework maps natural-language commands to versioned, local OSM entities via generative retrieval based on lightweight LLM, yielding accurate coordinates for global route planning. A high-level planning module bridges global topological routes with the SLAM system, projecting geospatial waypoints into the robot's navigable frame. For the ''last mile," the framework transitions to frontier-based exploration and open-set semantic voxel mapping to localize open-vocabulary targets. Experimental results in simulation demonstrate our framework outperforms state-of-the-art baselines. Furthermore, we validate the system in unseen real-world urban environments on an Unmanned Ground Vehicle (UGV), successfully completing person-search missions with trajectories of up to 500m.

2605.25584 2026-05-26 cs.RO cs.AI 版本更新

Acting on the Unseen: Communication-Free Collaborative Filtering for Decentralized Multi-Robot Task Allocation

作用于未知:面向分散式多机器人任务分配的无通信协同过滤

Alexander Apartsin, Yigal Meshulam, Yehudit Aperstein

发表机构 * Holon Institute of Technology(霍洛技术学院) Afeka Tel Aviv Academic College of Engineering(阿法卡特拉维夫工程学院)

AI总结 针对零知识多机器人任务分配问题,提出基于在线低秩协同过滤的SwarmCF方法,无需通信、先验知识或协调者,实现每个机器人在未见任务上的有效行动,并证明其样本复杂度优势。

Comments 27 pages, 12 figures

详情
AI中文摘要

多机器人任务分配通常假设某种通信、已知任务模型或协调者的组合。我们研究相反的极端情况,这在实践中常见但在理论上被忽视,我们称之为零知识MRTA(ZK-MRTA):一个没有先验知识(没有任务模型,甚至没有潜在秩)、没有通信(没有消息、没有参数共享、没有协调者)、并且只能部分且私下带噪地观察队友结果的公共流的机器人团队。一个隐藏的低秩结构决定了哪个机器人适合哪个任务,并且任务数量远多于轮次,因此大多数(机器人,任务)对从未被尝试过。然而,每个机器人可以通过在广播流上运行在线低秩协同过滤(SwarmCF)来很好地处理从未尝试过的任务以及新任务。与任何无结构学习器相比,优势是类别性的,而不是常数因子:无结构学习器在未见对上的误差被证明处于先验均值水平。我们证明了每个机器人的匹配样本复杂度(在秩d和任务数n下,Θ(d) vs Θ(n)),任务稀缺下的任意时间(累积奖励)分离,以及一个确定性条件,在该条件下从掩码广播中分散恢复是精确的(经验验证)。实验量化了广播的价值、一个正比例缩放律(每个机器人的未见对技能随团队规模增加)、以及低秩方法中最强的掩码鲁棒性和任意时间曲线,恢复了集中式全通信上限的大部分(约80%的技能收益),并在容量1竞争和基于机器人的感知实例中保持有效。

英文摘要

Multi-robot task allocation usually assumes some combination of communication, known task models, or a coordinator. We study the opposite extreme, a regime common in practice but overlooked in theory, which we name Zero-Knowledge MRTA (ZK-MRTA): a robot team with no prior knowledge (no task models, not even the latent rank), no communication (no messages, no parameter sharing, no coordinator), and only a partial and privately-noisy view of a public stream of teammates' outcomes. A hidden low-rank structure governs which robot suits which task, and there are far more tasks than rounds, so most (robot, task) pairs are never attempted. Yet each robot can act well on tasks it never attempted, and onboard new tasks, by running online low-rank collaborative filtering over the broadcast (SwarmCF). The advantage over any structure-free learner is categorical, not a constant factor: a structure-free learner is provably at the prior-mean error floor on unseen pairs. We prove a matching per-robot sample complexity (Θ(d) versus Θ(n), in the rank d and the task count n), an anytime (cumulative-reward) separation under task scarcity, and a deterministic condition under which decentralized recovery from the masked broadcast is exact (validated empirically). Experiments quantify the value of the broadcast, a positive scaling law (per-robot unseen-pair skill rises with team size), and the strongest masking-robustness and anytime profile among low-rank methods, recovering most (about 80% on earned skill) of a centralized full-communication ceiling, and holding under capacity-1 contention and in a robotics-grounded sensing instance.

2605.25553 2026-05-26 cs.CV cs.RO 版本更新

ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation

ComPose:用于鲁棒类别级物体姿态估计的统一补全-姿态框架

Huan Ren, Yihan Chen, Chuxin Wang, Nailong Liu, Wenfei Yang, Tianzhu Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory(国家空间科学探测重点实验室,深空探测实验室) Beijing Institute of Control Engineering(北京控制工程研究所)

AI总结 提出ComPose框架,通过关键点渐进补全模块和几何关系一致性损失,将形状补全与姿态估计紧密集成,在不依赖类别级形状先验的情况下提升点云不完整场景下的姿态估计精度和效率。

Comments Accepted by CVPR 2026 (Oral, Best Paper Award Candidate). Project page is available at renhuan1999.github.io/ComPose

详情
AI中文摘要

类别级物体姿态估计旨在预测特定类别中任意物体的姿态和尺寸。现有方法难以处理观测点云固有的不完整性,这限制了它们捕捉完整物体形状以实现鲁棒姿态推理的能力。虽然点云补全提供了一种有前景的解决方案,但将其作为部分观测的独立预处理步骤会引入复合误差和额外计算开销,最终阻碍准确性和效率。为解决这些挑战,我们提出了ComPose,一种新颖的统一框架,紧密集成形状补全以提供完整的几何线索,从而增强姿态估计。ComPose的核心是一个基于关键点的渐进补全模块,通过逐步预测稀疏关键点及其周围的密集点集来恢复完整形状表示,使关键点能够捕捉整体物体几何结构。几何关系编码模块进一步用局部和全局几何上下文丰富关键点特征。此外,我们引入了一种新颖的几何关系一致性损失,以强制观测关键点与其预测的NOCS坐标之间的结构对齐,确保全局一致的坐标变换。在标准基准上的大量实验表明,我们的方法在不依赖类别级形状先验的情况下优于现有最先进方法。

英文摘要

Category-level object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency. To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and their surrounding dense point sets, empowering the keypoints to capture holistic object geometries. A geometric relation encoding module further enriches keypoint features with both local and global geometric context. In addition, we introduce a novel geometric relation consistency loss to enforce structural alignment between observed keypoints and their predicted NOCS coordinates, ensuring globally coherent coordinate transformations. Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art approaches without relying on category-level shape priors.

2605.25547 2026-05-26 cs.RO cs.CV 版本更新

TapSampling: Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation

TapSampling:基于任务进度理解验证器的推理时采样方法用于机器人操作

Sizhe Zhao, Shengping Zhang, Shuo Yang, Weiyu Zhao, Shuigen Wang, Xiangyang Ji

发表机构 * Harbin Institute of Technology, China(哈尔滨工业大学,中国) Harbin Institute of Technology (Weihai) Qingdao Research Institute, China(哈尔滨工业大学(威海)青岛研究院,中国) Iray Technology co., Ltd., Shandong, China(Iray科技有限公司,山东,中国) Tsinghua University, Beijing, China(清华大学,北京,中国)

AI总结 提出TapSampling框架,通过Action-VAE在低维潜空间采样候选动作,并利用任务进度预测验证器选择最优动作,无需微调即可提升多种通用策略的性能。

Comments ICML 2026. Project Page: https://aipixel.github.io/TapSampling/

详情
AI中文摘要

现有的具身控制研究通过扩展训练数据和模型规模展现了显著的性能提升。我们则探索推理时策略作为另一个维度。非确定性生成模型,如扩散模型和自回归模型,已被广泛应用于具身控制领域。然而,单次推理范式限制了它们的性能。在本文中,我们提出 extbf{TapSampling},一个即插即用的推理时采样框架。首先,我们引入一个Action-VAE,通过将策略生成的初始动作映射到压缩的后验分布中,在低维潜空间中表示动作,从中可以抽取任意数量的潜样本并解码为候选动作,这些动作近似于真实动作分布。其次,我们将动作验证表述为任务进度结果预测,利用机器人数据集固有的序列结构训练一个语义基础验证器,用于可解释的动作选择。此外,TapSampling是一个策略无关的框架。在模拟和真实环境中的大量实验表明,我们的方法无需进一步微调策略即可显著提升多种通用策略的性能。代码和模型可在项目页面获取。

英文摘要

Existing embodied control research demonstrates remarkable performance improvements by scaling training data and model size. We instead explore inference-time strategy as an alternative axis. Non-deterministic generative models, such as diffusion and autoregressive models, have been widely adopted in the field of embodied control. However, the single-shot inference paradigm limits their performance. In this paper, we propose \textbf{TapSampling}, a plug-and-play framework for inference-time sampling. First, we introduce an Action-VAE that represents actions in a low-dimensional latent space by mapping policy-generated initial actions into a compressed posterior distribution, from which any number of latent samples can be drawn and decoded into candidate actions that approximate the true action distribution. Second, we formulate action verification as task-progress outcome prediction, using the intrinsic sequential structure of robotic datasets to train a semantically grounded verifier for interpretable action selection. Furthermore, TapSampling is a policy-agnostic framework. Extensive experiments in both simulated and real-world environments demonstrate that our method substantially improves multiple generalist policies without further policy finetuning. Code and models are available at the project page.

2605.25546 2026-05-26 cs.RO 版本更新

Safety-Critical Whole-Body Control for Humanoid Robots via Input-to-State Safe Control Barrier Functions

基于输入到状态安全控制屏障函数的人形机器人安全关键全身控制

Kwanwoo Lee, Sanghyuk Park, Gyeongjae Park, Myeong-Ju Kim, Jaeheung Park

发表机构 * Department of Intelligence and Information, Seoul National University(智能信息系,首尔国立大学) Robotics Lab, Hyundai Motor Group(Hyundai Motor Group 机器人实验室) Advanced Institute of Convergence Technology(融合技术高级研究院)

AI总结 提出一种基于输入到状态安全控制屏障函数(ISSf-CBF)的分层安全关键全身控制框架,通过运动级全身控制器、ISSf-CBF安全滤波器和动力学级全身控制器,在存在未知扰动时保证人形机器人的运动学安全约束。

Comments 14 pages, 6 figures

详情
AI中文摘要

安全关键控制对于在复杂人类中心环境中运行的人形机器人至关重要,这些环境中的物理安全约束(如关节限位、自碰撞避免、障碍物避免和工作空间边界)必须在实际机器人操作中得到满足。然而,现有方法仍然有限,因为在存在未知扰动(如模型不确定性、轨迹跟踪误差和外部扰动)时,运动学安全保证可能会降低。本文提出了一种基于输入到状态安全控制屏障函数(ISSf-CBF)的人形机器人分层安全关键全身控制框架。所提出的架构集成了运动级全身控制器(KinWBC)、ISSf-CBF安全滤波器和动力学级全身控制器(DynWBC)。KinWBC根据优先级任务生成标称关节运动参考;ISSf-CBF滤波器最小程度地修改这些参考,以在有界扰动下满足运动学安全约束;DynWBC跟踪滤波后的参考,同时确保全身动力学可行性和接触稳定性。安全约束施加于全身运动学模型,并保守地调整ISSf-CBF参数,使得所得的运动学安全保证能够在未知扰动下传递到全阶人形机器人动力学。仿真和实际机器人实验表明,所提出的框架在模型失配下提高了安全裕度,并在行走、遥操作和带手控制的单腿平衡过程中实时可靠地强制执行多个安全约束。项目网站:https://kwlee365.github.io/SafeWBC-Website/

英文摘要

Safety-critical control is essential for humanoid robots operating in complex human-centered environments, where physical safety constraints such as joint limits, self-collision avoidance, obstacle avoidance, and workspace boundaries must be satisfied during real-robot operation. However, existing approaches remain limited because kinematic safety guarantees can be degraded in the presence of unknown disturbances, such as model uncertainties, trajectory-tracking errors, and external perturbations. This paper presents a hierarchical safety-critical whole-body control framework for humanoid robots based on input-to-state safe control barrier functions (ISSf-CBFs). The proposed architecture integrates a kinematic-level whole-body controller (KinWBC), an ISSf-CBF safety filter, and a dynamic-level whole-body controller (DynWBC). KinWBC generates nominal joint-motion references from prioritized tasks; the ISSf-CBF filter minimally modifies these references to satisfy kinematic safety constraints under bounded disturbances; and DynWBC tracks the filtered references while enforcing full-body dynamic feasibility and contact stability. Safety constraints are imposed on a whole-body kinematic model, and the ISSf-CBF parameters are conservatively tuned so that the resulting kinematic safety guarantees can be transferred to full-order humanoid dynamics under unknown disturbances. Simulation and real-robot experiments demonstrate that the proposed framework improves safety margins under model mismatch and reliably enforces multiple safety constraints in real time during locomotion, teleoperation, and single-leg balancing with hand control. Project website: https://kwlee365.github.io/SafeWBC-Website/

2605.25537 2026-05-26 cs.RO 版本更新

Action-Prior Denoising for Smooth Real-Time Chunking

基于动作先验去噪的平滑实时分块

Dongyang Liu, Zhaowen Zheng, Yu Sun, Longxu Zhang, Yixuan Liu, Hao Wan

发表机构 * ROKAE (Shandong) Robot Group Co., Ltd.(ROKAE(山东)机器人集团有限公司) School of Mathematical Sciences, University of Chinese Academy of Sciences(中国科学院大学数学科学学院) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出Soft RTC方法,通过动作先验去噪训练时模拟执行延迟,在保持近朴素运行时间的同时,降低高延迟动作变化并提升平滑性。

Comments 7 pages, 5 figures, 1 table

详情
AI中文摘要

实时分块(RTC)通过将新生成的动作块条件于前一块已提交的动作,使得分块动作策略能够在推理延迟下运行。训练时RTC在学习过程中模拟这种延迟,避免了部署时昂贵的指导,但其二元前缀掩码将所有非前缀令牌视为完全无约束。这低估了异步执行:早期重叠动作是固定的,而后期重叠动作虽然可编辑,但仍应保持接近先前的计划。我们提出Soft RTC,一种基于动作先验去噪的训练时RTC泛化方法。Soft RTC从部分去噪状态而非纯噪声中构建损坏的重叠令牌,并通过轻量级的令牌级混合规则在推理时将对齐的前一块作为相同先验注入。在12个已发布的大型Kinetix关卡上,短软窗口在整体解决率上几乎与硬训练时RTC相当(0.809 vs. 0.815),而中等窗口相对于硬RTC将高延迟动作变化和急动度分别降低了9.1%和9.6。与推理时RTC基线不同,两种变体都保持近朴素运行时间。一项小型初步真实机器人分拣研究提供了额外证据,表明训练时RTC可以提高完成率,并且Soft RTC在测试策略中给出了最低的命令动作有限差分指标。

英文摘要

Real-time chunking (RTC) lets chunked action policies operate under inference delay by conditioning a newly generated action chunk on actions already committed by the previous chunk. Training-time RTC simulates this delay during learning and avoids expensive guidance at deployment, but its binary prefix mask treats all non-prefix tokens as fully unconstrained. This under-models asynchronous execution: early overlap actions are fixed, while later overlap actions remain editable but should still stay close to the previous plan. We propose Soft RTC, a training-time RTC generalization based on action-prior denoising. Soft RTC constructs corrupted overlap tokens from partially denoised states instead of pure noise and injects the aligned previous chunk as the same prior during inference through a lightweight token-wise blending rule. On the 12 released large Kinetix levels, a short soft window nearly matches hard training-time RTC in overall solve rate (0.809 vs. 0.815), while a medium window reduces high-delay action delta and jerk by 9.1% and 9.6% relative to hard RTC. Both variants keep near-naive runtime, unlike inference-time RTC baselines. A small preliminary real-robot sorting study provides additional evidence that training-time RTC can improve completion and that Soft RTC gives the lowest commanded-action finite-difference metrics among the tested policies.

2605.25495 2026-05-26 cs.RO cs.CV 版本更新

RepSAM: Bridging Foundation Models to Robotic Vision via Representation-Guided Adaptation

RepSAM: 通过表示引导的适应连接基础模型与机器人视觉

Wenhui Chu

发表机构 * Department of Computer Science and Engineering, Texas A&M University(计算机科学与工程系,德克萨斯大学阿马尔科分校)

AI总结 针对基础模型在非结构化机器人视觉场景中性能下降的问题,提出RepSAM框架,通过CKA引导的秩分配策略和多模态融合模块实现参数高效微调,在减少158倍可训练参数的同时达到全微调97.9%的性能。

Comments Accepted to IJCAI-ECAI 2026 (Special Track on AI and Robotics). 8 pages, 4 figures, 12 tables

详情
AI中文摘要

尽管SAM等基础模型具有零样本能力,但在非结构化环境中的机器人感知仍然具有挑战性。本文将性能下降归因于Transformer层间非均匀的表示偏移:浅层表现出显著的领域差距(CKA < 0.5),而深层则有效迁移(CKA > 0.7)。基于这一观察,我们提出RepSAM,一种表示引导的参数高效微调(PEFT)框架,用于将基础模型适应到机器人视觉。RepSAM采用理论基础的CKA引导秩分配策略,结合多模态融合模块,以稳健处理具有挑战性的机器人场景,包括透明物体和杂乱场景。在六个基准和机器人操作任务上的实验评估表明,RepSAM达到了全微调性能的97.9%(89.0% vs. 90.9% mIoU),同时将可训练参数减少了158倍(从632M降至4.0M)。RepSAM在单个A100 GPU上仅需4小时训练(比全微调减少96倍,全微调需要384 GPU小时),即可比DoRA提高7.9%的mIoU。这些改进具有统计显著性(p < 0.01),并在机器人操作成功率上比LoRA(RGB)基线绝对提高了12.0%。

英文摘要

Robotic perception in unstructured environments remains challenging despite the zero-shot capabilities of foundation models such as SAM. This work attributes performance degradation to non-uniform representation shifts across transformer layers: shallow layers exhibit substantial domain gaps (CKA < 0.5), whereas deep layers transfer effectively (CKA > 0.7). Based on this observation, we propose RepSAM, a representation-guided parameter-efficient fine-tuning (PEFT) framework for adapting foundation models to robotic vision. RepSAM employs a theoretically grounded CKA-guided rank allocation strategy combined with a multi-modal fusion module for robust handling of challenging robotic scenarios, including transparent objects and cluttered scenes. Experimental evaluation across six benchmarks and robotic manipulation tasks demonstrates that RepSAM achieves 97.9% of full fine-tuning performance (89.0% vs. 90.9% mIoU) while reducing trainable parameters by 158x (from 632M to 4.0M). RepSAM outperforms DoRA by 7.9% mIoU with just 4 hours of training on a single A100 GPU (a 96x reduction from full fine-tuning, which takes 384 GPU-hours). These improvements are statistically significant (p < 0.01) and translate to a 12.0% absolute improvement in robotic manipulation success rates over the LoRA (RGB) baseline.

2605.25477 2026-05-26 cs.RO cs.AI 版本更新

EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

EXPO-FT:面向视觉-语言-动作模型的样本高效强化学习微调

Perry Dong, Kuo-Han Hung, Tian Gao, Dorsa Sadigh, Chelsea Finn

发表机构 * Stanford University(斯坦福大学)

AI总结 提出EXPO-FT系统,通过样本高效的强化学习微调预训练的VLA策略,在多种高精度操作任务中实现完美性能(30/30成功率),平均仅需19.1分钟在线机器人数据。

详情
AI中文摘要

高效且可靠地学习新任务的能力一直是机器人学的基础挑战。视觉-语言-动作(VLA)模型在多种操作任务中展现出强大的泛化能力,但预训练策略始终无法达到实际部署所需的可靠性。强化学习(RL)微调为弥合这一差距提供了有前景的路径,但现有方法要么从头开始训练而未充分利用预训练先验,要么微调VLA而未达到实际部署所需的样本效率和成功率。我们提出了EXPO-FT,一个用于对预训练VLA策略进行稳定、样本高效的RL微调的系统,填补了这一空白。我们的系统解决了一系列具有挑战性的操作任务,包括串灯并插入插头点亮、将台球击入袋中、将花插入酒瓶,每个任务都需要高精度、动态动作以及对不同初始状态的鲁棒性。我们的系统在所有评估任务中均实现了完美的任务性能(30/30成功),平均仅需19.1分钟的在线机器人数据,优于先前的从头RL训练和VLA微调方法。我们发布了一个开源代码库,旨在促进机器人领域中VLA模型RL微调的更广泛采用。

英文摘要

The ability to efficiently and reliably learn new tasks has been a foundational challenge in robotics. Vision-Language-Action (VLA) models have demonstrated strong generalization across diverse manipulation tasks, yet pretrained policies consistently fall short of the reliability required for real-world deployment. Reinforcement learning (RL) fine-tuning offers a promising path to bridge this gap, but existing approaches either train from scratch without fully leveraging pretrained priors, or fine-tune VLAs without achieving the sample efficiency and success rates that practical deployment demands. We present EXPO-FT, a system for stable, sample-efficient RL finetuning of pretrained VLA policies that closes this gap. Our system solves a suite of challenging manipulation tasks, including routing string lights and inserting the plug to light it up, striking a pool ball into a pocket, and inserting a flower into a wine bottle, each requiring combinations of high precision, dynamic actions, and robustness to varied initial states. Our system achieves perfect task performance (30/30 successes) across all evaluated tasks within an average of 19.1 minutes of online robot data, outperforming both prior RL-from-scratch and VLA finetuning approaches. We release an open-source codebase with the aim of facilitating broader adoption of RL finetuning of VLA models in robotics.

2605.25423 2026-05-26 cs.RO 版本更新

OPAL: Omnidirectional Path-efficient Aerial 3D expLoration

OPAL: 全方位路径高效空中三维探索

Yoga Satwik Chappidi, Avideh Zakhor

发表机构 * Department of Electrical Engineering and Computer Sciences, University of California, Berkeley(加州大学伯克利分校电子工程与计算机科学系)

AI总结 提出OPAL框架,通过在歧义分支点进行360度偏航旋转替代计算密集的全局路径规划,实现计算简单、路径短且覆盖率高的自主探索。

Comments Submitted to IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

自主探索对于机器人绘制未知环境地图至关重要。探索算法的理想特性包括计算效率高和探索过程中行进距离小。受此启发,我们提出了全方位路径高效空中三维探索(OPAL),这是一个探索框架,其核心是在歧义分支点进行有意的360度偏航旋转,而不是进行计算密集的全局路径规划。我们设计了OPAL的多个变体,以确定在偏航旋转完成后如何选择前沿。其中一个变体是无模型的,而其他变体则使用大语言模型(LLM)或视觉语言模型(VLM)。我们通过改变邻近搜索半径以将前沿纳入选择过程,来表征这些变体的性能。通过仿真,我们发现尽管与计算更复杂的基线(如EDEN和FALCON)相比,耗时的原地偏航旋转增加了总探索时间,但OPAL计算更简单,实现了更短的行进距离和更高的覆盖率-距离曲线下面积。我们还表明,调整前沿选择搜索半径可以在行进距离和总探索时间之间进行权衡。我们在两个室内环境中使用Modal AI无人机将OPAL与FALCON进行比较,验证了我们的结果,发现OPAL的一个变体的行进距离比FALCON低25%。

英文摘要

Autonomous exploration is critical for robot mapping unknown environments. Desirable characteristics of exploration algorithms include compute efficiency and small traversed distance during the exploration process. Motivated by these, we present Omnidirectional Path-efficient Aerial 3D expLoration (OPAL), an exploration framework centered on deliberate 360-degree yaw rotation at ambiguous branch points rather than compute-heavy global tour planning. We devise multiple variants of OPAL to determine the frontier-selection strategy once the yaw pan is completed. One variant is model-free, while others use large language models (LLMs) or vision-language models (VLMs). We characterize the performance of these variants while varying the vicinity search radius to include frontiers in the selection process. Through simulations we find that although the time-consuming in-place yaw rotation increases total exploration time relative to more computationally complex baselines such as EDEN and FALCON, OPAL is computationally simpler and achieves shorter travel distances and higher coverage-versus-distance area under the curve. We also show that adjusting the frontier-selection search radius enables a tradeoff between travel distance and total exploration time. We verify our results on a Modal AI drone in two indoor environments by comparing OPAL against FALCON, and find that the traveled distance for a variant of OPAL to be as much as 25% lower than FALCON.

2605.25414 2026-05-26 cs.RO 版本更新

How to Mitigate the Distribution Shift Problem in Robotics Control: A Robust and Adaptive Approach Based on Offline to Online Imitation Learning

如何缓解机器人控制中的分布偏移问题:一种基于离线到在线模仿学习的鲁棒自适应方法

Hyung-Suk Yoon, Seung-Woo Seo

发表机构 * Department of Electronic and Computer Engineering, Seoul National University, Seoul, South Korea(电子与计算机工程系,首尔国立大学,首尔,韩国)

AI总结 提出一种鲁棒离线到自适应在线模仿学习框架,通过离线阶段利用判别器扩展状态-动作覆盖和在线阶段自监督模仿学习,缓解分布偏移问题。

Comments 8 pages, 2 figures

详情
AI中文摘要

模仿学习中的分布偏移是指智能体无法为训练期间未访问的状态规划适当动作的问题。该问题很大程度上归因于专家演示在整个环境中提供的固有狭窄状态-动作覆盖。在本文中,我们提出了一种鲁棒离线到自适应在线模仿学习框架,以终身、多阶段方案处理分布偏移问题。在离线学习阶段,我们利用补充演示通过判别器有效训练策略,从而拓宽策略的状态-动作覆盖,增强策略对分布偏移的鲁棒性。在后续的在线推理阶段,我们的框架检测分布偏移的发生,并从在线经验中进行自监督模仿学习,使策略适应在线环境。通过在MuJoCo环境中的广泛评估,我们证明我们的方法在分布偏移的鲁棒性和对在线环境的适应性能方面优于基线算法,这表明我们的框架在对抗分布偏移方面具有优越性能。

英文摘要

Distribution shift in imitation learning refers to the problem that the agent cannot plan proper actions for a state that has not been visited during the training. This problem can be largely attributed to the inherently narrow state-action coverage provided by expert demonstrations over the full environment. In this paper, we propose a robust offline to adaptive online imitation learning framework that handles the distribution shift problem in a lifelong, multi-phase scheme. In the offline learning phase, we leverage supplementary demonstrations to broaden the state-action coverage of the policy by utilizing a discriminator to effectively train the policy with supplementary demonstrations, thereby enhancing the robustness of the policy to distribution shift. In the subsequent online inference phase, our framework detects the occurrence of distribution shift and conducts self-supervised imitation learning from online experiences to adapt the policy to the online environments. Through extensive evaluations in MuJoCo environments, we demonstrate that our method exhibits better robustness to distribution shift and better adaptation performance to online environments than the baseline algorithms, which indicates superior performance of our framework against the distribution shift.

2605.25401 2026-05-26 cs.RO 版本更新

Path Following Control System of Line-of-Sight Guidance for Robotic Dolphin with Multi-Link Mechanism in Underwater Simulator

水下模拟器中多连杆机构仿生海豚的视线导引路径跟踪控制系统

Takumi Asada, Takao Oki, Hideo Furuhashi, Kenta Tabata, Renato Miyagusuku, Koichi Ozaki

发表机构 * Utsunomiya University(乌山大学) Aichi Institute of Technology(爱知技术大学)

AI总结 针对多连杆仿生自主水下航行器(BAUV),提出了一种基于视线导引的路径跟踪控制系统,并在水下模拟器中进行了参数确定和控制方法评估。

详情
Journal ref
2026 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2026. p. 844-849
AI中文摘要

具有多连杆机构的仿生自主水下航行器(BAUV)因其低功耗和高机动性被广泛用于水生生物观测和环境调查。环境调查需要能够自动跟踪特定点的路径跟踪系统。然而,BAUV的路径跟踪系统有限,且其与多连杆机构机器人的评估尚未明确。由于BAUV的模型因仿生类型而异,其路径跟踪系统需要预先进行仿真。在本研究中,我们提出了一种适用于多连杆机构BAUV的路径跟踪系统,并在水下模拟中进行了评估。结果表明,可以设计出适合BAUV的路径跟踪系统,使用模拟器确定参数,并评估控制方法。

英文摘要

Biomimetic autonomous underwater vehicle (BAUV) with multi-link mechanism is widely used in aquatic life observation and environmental surveys due to its low power consumption and high maneuverability. An environmental survey requires a path following system that automatically follows specific points. However, the path following system of BAUV is limited, and its evaluation with multi-link mechanism robots has not yet been clarified. The path following system in BAUV requires prior simulation because the model differs depending on the type of biomimetics. In this study, we propose a path following system for BAUVs with a multi-link mechanism and evaluation in underwater simulation. In this result, it was possible to design a path following system suitable for BAUV, determine parameters using a simulator, and evaluate control methods.

2605.25393 2026-05-26 cs.RO 版本更新

Decision-Making with Lightweight Confidence-Aware Language Model for Autonomous Driving

基于轻量级置信感知语言模型的自动驾驶决策

Ruoyu Yao, Ruiguo Zhong, Pei Liu, Mingxing Peng, Rui Yang, Jun Ma

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出一种利用轻量级置信感知语言模型的决策框架,通过多智能体协作生成置信注释的决策演示并蒸馏到双头轻量模型,在nuPlan上实现SOTA成功率和低延迟。

Comments 8 Pages, 3 figures, ITSC 2026

详情
AI中文摘要

大型语言模型和多模态大语言模型在自动驾驶中展现出巨大潜力,提供类人推理和开放世界泛化能力。然而,这些庞大模型过高的计算开销和推理延迟严重阻碍了它们在资源受限的自动驾驶系统中的部署。为解决这一挑战,我们提出了一种新颖的决策框架,利用轻量级置信感知语言模型,弥合了复杂多模态意图推理与高效推理之间的差距。具体而言,我们设计了一个多智能体协作工作流,包括动作投票、置信评估和总结智能体,通过显式的思维链推理生成高质量、带置信注释的决策演示。然后,这些演示被蒸馏到一个具有双头架构的轻量级语言模型中,实现决策概率的联合预测和文本理由的生成。蒸馏通过置信感知微调策略结合检索增强生成来实现,以增强模型的适应性和数据效率。在nuPlan基准上的全面闭环实验表明,我们的方法在常规和长尾场景下均实现了最先进的成功率,同时保持了低推理延迟。

英文摘要

Large Language Models (LLMs) and Multimodal LLMs (MLLMs) have demonstrated immense potential in autonomous driving (AD) by offering human-like reasoning and open-world generalization. However, the excessive computational overhead and high inference latency of these massive models severely hinder their deployment in resource-constrained AD systems. To address this challenge, we propose a novel decision-making framework utilizing a lightweight confidence-aware language model, which bridges the gap between complex multimodal intention reasoning and efficient inference. Specifically, we design a multi-agent collaborative workflow, comprising action voting, confidence assessment, and summarization agents, to generate high-quality, confidence-annotated decision demonstrations via explicit Chain-of-Thought (CoT) reasoning. These demonstrations are then distilled into a lightweight language model featuring a dual-head architecture, enabling the joint prediction of decision probabilities and the generation of textual rationales. The distillation is realized via a confidence-aware fine-tuning strategy coupled with Retrieval Augmented Generation (RAG) to enhance the model's adaptability and data efficiency. Comprehensive closed-loop experiments on the nuPlan benchmark demonstrate that our approach achieves state-of-the-art (SOTA) success rates in both regular and long-tail scenarios while maintaining low inference latency.

2605.25362 2026-05-26 cs.RO 版本更新

Prior Policy Guided Dual-Agent Coordinated Manipulation Planning of Spacecraft-Manipulator System

先验策略引导的航天器-机械臂系统双智能体协同操控规划

Yuhui Hu, Dong Zhou, Kaihong Ouyang, Zhongliang Yu, Jianfeng Lv, Xiangyu Shao

发表机构 * School of Astronautics(航天学院) School of Automation(自动化学院) School of Information Science and Engineering(信息科学与工程学院)

AI总结 针对空间机械臂与基座强耦合导致的姿态稳定问题,提出先验策略引导的双智能体协同操控规划框架,通过时间步级专家切换机制提升深度强化学习效率,实现末端执行器高精度到达与基座姿态稳定。

Comments 36 pages, 13 figures, 6 tables. Under review

详情
AI中文摘要

机械臂与基座之间的强动态耦合对维持航天器姿态稳定性构成了重大挑战,可能危及任务安全。本文提出了一种双智能体协同操控规划(DACMP)框架,该框架同时实现了六自由度空间机械臂末端执行器的高精度位姿到达和基座航天器的姿态稳定。为了提高学习效率,我们提出了一种结合时间步级专家切换引导(TESG)机制的先验策略引导深度强化学习算法,从而促进全局收敛并提高任务成功率。大量实验表明,DACMP在任务成功率和控制精度方面显著优于基线深度强化学习算法。此外,在包括系统约束、环境干扰和感知不确定性在内的各种挑战性场景下,验证了DACMP的鲁棒性。代码和仿真配置可在GitHub上获取:https://github.com/HIT-YuhuiHu/DACMP。

英文摘要

The strong dynamic coupling between the manipulator and the base poses a significant challenge to maintaining spacecraft attitude stability, potentially compromising mission safety. In this paper, we propose a Dual-Agent Coordinated Manipulation Planning (DACMP) framework that simultaneously achieves high-precision end-effector pose reaching for a 6-DoF space manipulator and attitude stabilization of the base spacecraft. To enhance learning efficiency, we present a prior policy-guided Deep Reinforcement Learning algorithm incorporating the Timestep-level Expert Switching Guidance (TESG) mechanism, thereby promoting global convergence and improving task success rates. Extensive experiments demonstrate that DACMP significantly outperforms baseline DRL algorithms in terms of task success rate and control precision. Furthermore, the robustness of DACMP is validated under various challenging scenarios, including system constraints, environmental disturbances, and perception uncertainties. The code and simulation configurations are available on GitHub: https://github.com/HIT-YuhuiHu/DACMP.

2605.25346 2026-05-26 cs.RO cs.AI cs.LG cs.SY eess.SY math.OC 版本更新

Parallel Differentiable Reachability for Learning and Planning with Certified Neural Dynamics and Controllers

用于学习和规划的并行可微可达性:带认证的神经动力学与控制器

Keyi Shen, Glen Chou

发表机构 * MIT(麻省理工学院)

AI总结 提出一种基于JAX的并行可微可达性框架,结合泰勒模型流形构建与CROWN线性界传播,支持GPU批处理和自动微分,并用于认证训练和可达性感知的MPC,在非抓取操作和四旋翼任务中实现在线规划与有界不确定性下的认证可达集过近似。

Comments Robotics: Science and Systems XXII (RSS 2026)

详情
AI中文摘要

神经网络动力学模型和控制策略在机器人领域取得了强大性能,但在不确定性下提供可靠保证仍然困难,尤其是对于闭环神经网络系统。现有的可达性工具提供了形式化的过近似,但通常不可微、过于保守或对于现代学习和在线规划流程来说太慢。为了解决这个问题,我们提出了一个在JAX中可并行化、可微的可达性框架,适用于连续和离散时间系统,具有解析和基于神经网络的动力学和控制器。我们的框架通过统一表示结合了泰勒模型流形构建和CROWN风格的线性界传播,该表示在支持GPU批处理计算和自动微分的同时保留了仿射依赖。基于这个可达性基元,我们开发了(i)一种认证训练方法,鼓励生成对可达性友好的动力学模型和控制器,以及(ii)一种具有基于梯度细化的可达性感知采样MPC方案。在非抓取操作和四旋翼任务上的实验,包括硬件和更高维度的评估(高达72维),展示了在实际在线规划中保持有界不确定性下认证可达集过近似的可行性。

英文摘要

Neural network (NN) dynamics models and control policies achieve strong performance in robotics, but providing sound guarantees under uncertainty remains difficult, especially for closed-loop NN systems. Existing reachability tools provide formal over-approximations, yet are often non-differentiable, overly conservative, or too slow for modern learning and online planning pipelines. To address this, we present a parallelizable, differentiable reachability framework in JAX for continuous- and discrete-time systems with analytical and NN-based dynamics and controllers. Our framework combines Taylor-model flowpipe construction with CROWN-style linear bound propagation through a unified representation that preserves affine dependencies while supporting GPU-batched computation and automatic differentiation. Building on this reachability primitive, we develop (i) a certified training method that encourages reachability-friendly dynamics models and controllers, and (ii) a reachability-aware sampling-based MPC scheme with gradient-based refinement. Experiments on non-prehensile manipulation and quadrotor tasks, including hardware and higher-dimensional evaluations (up to 72D), demonstrate practical online planning while maintaining certified reachable-set over-approximations under bounded uncertainty.

2605.25313 2026-05-26 cs.LG cs.AI cs.RO stat.ML 版本更新

UWM-JEPA: Predictive World Models That Imagine in Belief Space

UWM-JEPA:在信念空间中进行想象的世界预测模型

Santosh Kumar Radha, Oktay Goktas

发表机构 * AgentField AI

AI总结 针对部分可观测环境,提出UWM-JEPA模型,通过密度矩阵潜变量和酉预测器在信念空间中保持联合状态谱,实现长时域盲推演下的不确定性保持,显著优于向量潜变量基线。

Comments 14 pages, 6 figures, 7 tables. Code and data: https://github.com/santoshkumarradha/uwm-jepa

详情
AI中文摘要

部分可观测环境下的世界模型必须想象多个兼容的隐藏未来,并在反事实动作下引导它们。联合嵌入预测架构(JEPAs)在潜在空间中实现这一点,但向量值潜变量没有内部结构来承载盲推演过程中隐藏连续性的信念。我们引入了酉世界模型JEPA(UWM-JEPA),这是一种JEPA世界模型,具有在联合系统-环境空间上的密度矩阵潜变量和学习的酉预测器。该结构在推演过程中精确保持联合状态谱,因此预测器本身不会耗散表示的不确定性。在一个需要根据给定动作序列进行五步前向模拟且目标观测被掩蔽的隐藏速度指示任务中,UWM-JEPA达到0.77的准确率,并且随着动作被扰动而单调下降;而参数匹配的LSTM-JEPA在相同的反事实目标目标和动作头训练下,在所有动作条件下都崩溃为多数类准确率(0.53)。在盲推演下,UWM-JEPA在短时域上损失不到十个点的探针R^2,而向量潜变量基线损失四十一个和六十八个点;两者在保留的上下文探针上表现相当,表明差异在于预测器而非编码器。动作敏感性本身需要针对反事实而非教师强制目标进行训练,这一发现适用于酉参数化之外。对于JEPA世界模型在部分可观测性下进行想象,潜变量几何和预测器动力学至关重要,而不仅仅是冻结的上下文编码能力。

英文摘要

World models for partially observed environments must imagine multiple compatible hidden futures and steer between them under counterfactual actions. Joint Embedding Predictive Architectures (JEPAs) do this in latent space, but a vector-valued latent has no internal structure for carrying the belief over hidden continuations through blind rollout. We introduce the Unitary World Model JEPA (UWM-JEPA), a JEPA world model with a density-matrix latent on a joint system-environment space and a learned unitary predictor. The construction preserves the joint-state spectrum exactly during rollout, so the predictor itself cannot dissipate the represented uncertainty. On a hidden-velocity indicator task requiring five-step forward simulation under a given action sequence with the target observation masked, UWM-JEPA reaches 0.77 accuracy and degrades monotonically as actions are perturbed; a parameter-matched LSTM-JEPA trained under the same counterfactual-target objective and action head collapses to majority-class accuracy (0.53) under every action condition. Under blind rollout, UWM-JEPA loses fewer than ten points of probe R^2 at short horizons while vector-latent baselines lose forty-one and sixty-eight; both nevertheless tie on a held-out context probe, locating the separation in the predictor rather than the encoder. Action sensitivity itself requires training against counterfactual rather than teacher-forced targets, a finding that applies beyond the unitary parameterisation. For JEPA world models to imagine under partial observability, latent geometry and predictor dynamics matter, not frozen context-encoding capacity alone.

2605.25293 2026-05-26 cs.CV cs.AI cs.RO 版本更新

Neuromorphic LiDAR-based Bird's Eye View Object Detection using Energy-efficient Spiking Neural Networks

基于神经形态激光雷达的鸟瞰图目标检测:使用节能脉冲神经网络

Sambit Mohapatra, Senthil Yogamani, Heinrich Gotzig, Patrick Mader

发表机构 * Valeo, Germany(德国瓦莱欧公司) Valeo, Ireland(爱尔兰瓦莱欧公司) TU Ilmenau, Germany(德国伊门豪大学)

AI总结 提出一种端到端脉冲编码器-解码器网络,用于激光雷达点云鸟瞰图表示中的目标检测,通过代理梯度反向传播训练,在KITTI基准上达到高精度,并实现3.33倍突触操作能耗降低。

详情
AI中文摘要

自动驾驶感知需要在严格的功耗约束下对三维传感器数据进行准确高效的处理。传统卷积神经网络实现了强大的检测精度,但计算密集,限制了其在资源受限的神经形态平台上的部署。脉冲神经网络通过事件驱动的稀疏计算提供了一种引人注目的替代方案,但其在复杂真实世界感知任务(如三维目标检测)中的应用仍然有限。在这项工作中,我们提出了一种端到端脉冲编码器-解码器网络,用于激光雷达点云鸟瞰图表示中的目标检测,并使用代理梯度反向传播进行训练。我们训练了两个变体:一个膜电位变体,在输出阶段读取连续神经元状态以获得最大精度,在$\mathrm{IoU}\!=\!0.5$(简单/中等/困难)下达到$92.05$/$87.04$/$86.51$ AP;以及一个全二进制脉冲变体,每一层仅操作脉冲序列,用于直接神经形态部署。我们评估了四种输入脉冲编码策略,并证明允许网络直接从数据学习脉冲表示优于手工制作的泊松、延迟和z轴编码方案,在KITTI基准上,当顺序帧不可用且BEV输入跨时间步重复呈现作为时间流代理时。分块能量分析表明,在保守的基于循环的操作下,与等效CNN相比,突触操作能量降低了$3.33 imes$。这些结果共同证明了脉冲神经网络在自动驾驶中实现准确且节能的神经形态感知的可行性。

英文摘要

Autonomous driving perception demands accurate and efficient processing of three-dimensional sensor data under strict power constraints. Traditional convolutional neural networks achieve strong detection accuracy but are computationally intensive, limiting their suitability for deployment on resource-constrained neuromorphic platforms. Spiking neural networks offer a compelling alternative through event-driven sparse computation, yet their application to complex real-world perception tasks such as three-dimensional object detection remains limited. In this work, we propose an end-to-end spiking encoder-decoder network for object detection in bird's eye view representations of LiDAR point clouds, trained using surrogate gradient backpropagation. We train two variants: a membrane potential variant that reads continuous neuron state at the output stage for maximum accuracy, achieving $92.05$/$87.04$/$86.51$ AP at $\mathrm{IoU}\!=\!0.5$ (Easy/Moderate/Hard), and, a fully binary spiking variant that operates exclusively on spike trains at every layer for direct neuromorphic deployment. We evaluate four input spike encoding strategies and demonstrate that allowing the network to learn spike representations directly from data outperforms hand-crafted Poisson, latency, and z-axis encoding schemes on the KITTI benchmark, where sequential frames are unavailable and the BEV input is presented repeatedly across timesteps as a proxy for temporal streaming. A block-wise energy analysis demonstrates a $3.33\times$ reduction in synaptic operation energy over an equivalent CNN under conservative loop-based operation. Together, these results demonstrate the viability of spiking neural networks for accurate and energy-efficient neuromorphic perception in autonomous driving.

2605.25279 2026-05-26 cs.RO 版本更新

GreenSeg: Ground Segmentation Algorithm for Agricultural Robots in Mediterranean Greenhouses using RGB-D Point Clouds

GreenSeg: 基于RGB-D点云的地中海温室农业机器人地面分割算法

Fernando Cañadas-Aránega, José C. Moreno, José L. Blanco-Claraco

发表机构 * Department of Informatics, CIESOL, ceiA3, Universidad de Almería(信息学院,CIESOL,ceiA3,阿尔梅里亚大学)

AI总结 针对地中海温室狭窄通道、异构地形和光学干扰等挑战,提出一种基于RGB-D感知的双层验证地面分割框架GreenSeg,通过全局平面拟合、曲率滤波和种子点区域生长实现稳定导航,在AGRICOBIOT I平台上验证了其在动态光照下优于基准方法。

详情
AI中文摘要

地中海地区的温室农业因其独特的结构和环境限制面临显著的自动化挑战。这些环境的特点是极其狭窄的通道、从混凝土到耕地的异构地形,以及由聚乙烯覆盖物引起的光学干扰,导致深度传感器中出现镜面反射和“鬼点”。虽然自主导航对于农业任务的数字化至关重要,但传统解决方案通常依赖于昂贵的3D LiDAR系统,这些系统对于大多数设施来说在经济上不可扩展。为了解决这个问题,本文提出了GreenSeg,一个使用RGB-D感知的鲁棒感知框架,用于自主导航。所提出的方法引入了一种双层验证策略:一种鲁棒的全局平面拟合结合表面曲率滤波器以实现地形适应性,以及一种基于种子点的区域生长约束以确保可导航平面的空间连续性。使用AGRICOBIOT I平台在四个不同太阳高度角的日间场景下进行了实验验证。结果表明,GreenSeg始终优于基准分割方法,在走廊末端的关键旋转操作中,平均召回率提高了11.58%,mIoU提高了19.24%。这些发现证实了所提出的算法能够在受预算限制且对光照条件敏感的非结构化动态农业环境中实现稳定安全的自主导航。

英文摘要

Greenhouse agriculture in the Mediterranean region faces significant automation challenges due to its unique structural and environmental constraints. These environments are characterized by extremely narrow aisles, heterogeneous terrains ranging from concrete to tilled soil and severe optical interference caused by polyethylene covers, which induce specular reflections and "ghost points" in depth sensors. While autonomous navigation is essential for digitizing agricultural tasks, traditional solutions often rely on expensive 3D LiDAR systems that are economically unscalable for most facilities. To address this, this paper presents GreenSeg, a robust perception framework for autonomous navigation using RGB-D sensing. The proposed method introduces a dual-layer validation strategy: a robust global plane fitting combined with a surface curvature filter for terrain adaptability, and a seed-point-based Region Growing constraint to ensure the spatial continuity of the navigable plane. Experimental validation was conducted using the AGRICOBIOT I platform across four diurnal scenarios with varying solar elevations. The results show that GreenSeg consistently outperforms benchmark segmentation methods, achieving peak improvements of 11.58% in mean Recall and 19.24% in mIoU during critical rotational maneuvers at the end of corridors. These findings confirm that the proposed algorithm enables stable and safe autonomous navigation in unstructured, dynamic agricultural environments that are subject to budget constraints and sensitive to lighting conditions.

2605.22894 2026-05-26 cs.GR cs.LG cs.RO 版本更新

SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-based Humanoid Control

SCRIPT: 面向语言驱动的物理仿真人体控制的可扩展扩散策略与多阶段训练

Jingyan Zhang, Han Liang, Ruichi Zhang, Bin Li, Juze Zhang, Xin Chen, Jingya Wang, Lan Xu, Jingyi Yu

发表机构 * ShanghaiTech University(上海科技大学) University of Pennsylvania(宾夕法尼亚大学) Stanford University(斯坦福大学)

AI总结 提出SCRIPT框架,通过联合动作-状态-文本扩散Transformer和多阶段训练(监督模仿预训练+混合奖励强化学习后训练),实现语言指令驱动的物理仿真人体控制,在文本对齐、运动质量和物理真实性上超越现有方法。

Comments Project page: https://zhanglele12138.github.io/SCRIPT/

详情
AI中文摘要

从自然语言指令控制物理仿真人体是迈向通用具身智能体的关键一步。然而,现有方法仍受限于语义表达能力和物理可行性之间的张力,往往难以同时实现忠实的指令跟随、高质量的运动和稳定的长时程控制。我们提出SCRIPT,一种具有多阶段训练框架的可扩展扩散策略,用于语言驱动的物理仿真人体控制。SCRIPT的核心是联合动作-状态-文本扩散Transformer(JAST-DiT),它将动作、物理状态和文本表示为专门的令牌流,并通过联合注意力将它们耦合,使语言语义和控制动态之间能够直接交互。为了稳定自回归控制,我们引入了一种非线性历史条件机制,该机制保留密集的近期上下文,并从长期历史中采样越来越稀疏的线索。除了监督模仿预训练外,我们提出了一个后训练阶段,使用混合奖励强化学习(RLHR)进一步提高性能。通过将可学习噪声注入流采样过程,RLHR利用混合物理反馈和文本奖励在闭环模拟中有效改善运动质量和指令跟随。定量评估表明,SCRIPT在文本对齐、运动质量和物理真实性指标上均优于先前的最先进方法。此外,在1200小时的MotionMillion数据集上的扩展研究显示,随着模型规模的扩大,性能持续提升,突显了SCRIPT在大规模预训练中的稳健可扩展性。我们的代码将公开供未来研究使用。

英文摘要

Controlling physics-based humanoids from natural-language instructions is a critical step toward general-purpose embodied agents. However, existing methods remain constrained by a tension between semantic expressiveness and physical feasibility, often failing to jointly achieve faithful instruction following, high-quality motion, and stable long-horizon control. We propose SCRIPT, a scalable diffusion policy with a multi-stage training framework for language-driven physics-based humanoid control. The core of SCRIPT is a Joint Action-State-Text Diffusion Transformer (JAST-DiT), which represents actions, physical states, and text as dedicated token streams and couples them through joint attention, enabling direct interaction between language semantics and control dynamics. To stabilize autoregressive control, we introduce a nonlinear history conditioning mechanism, which preserves the dense recent context and samples increasingly sparse cues from long-term history. Beyond supervised imitation pre-training, we propose a post-training stage, further improving the performance using Reinforcement Learning with Hybrid Rewards (RLHR). By injecting learnable noise into the flow-sampling process, RLHR effectively improves motion quality and instruction following within closed-loop simulations using hybrid physical feedback and text rewards. Quantitative evaluations demonstrate that SCRIPT outperforms prior state-of-the-art methods, with gains across text alignment, motion quality, and physical realism metrics. Furthermore, scaling studies on the 1200-hour MotionMillion dataset demonstrate consistent performance gains with model scaling, highlighting SCRIPT's robust scalability for large-scale pre-training. Our code will be publicly available for future research.

2605.18746 2026-05-26 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

ESI-Bench: 迈向闭环感知-动作的具身空间智能

Yining Hong, Jiageng Liu, Han Yin, Manling Li, Leonidas Guibas, Li Fei-Fei, Jiajun Wu, Yejin Choi

发表机构 * Stanford University(斯坦福大学) UCLA(加州大学洛杉矶分校) Northwestern University(西北大学)

AI总结 提出ESI-BENCH基准,通过主动探索(感知、移动、操作)在OmniGibson环境中评估具身空间智能,发现主动探索显著优于被动方法,失败主因是动作盲视而非感知弱,且模型存在元认知差距。

Comments https://esi-bench.github.io/

详情
AI中文摘要

空间智能通过感知-动作循环展开:智能体通过行动获取观察,并推理观察如何随动作变化。它们不是被动处理所见,而是主动揭示未见——遮挡结构、动态、包含关系和功能,这些无法仅通过被动感知解决。我们超越先前假设神谕观察的空间智能表述,将观察者重新定义为行动者。我们引入ESI-BENCH,一个基于OmniGibson、扎根于Spelke核心知识系统的全面具身空间智能基准,涵盖10个任务类别和29个子类别。智能体必须决定部署哪些能力——感知、移动和操作——以及如何排序以主动积累任务相关证据。我们对最先进的MLLM进行大量实验,发现主动探索显著优于被动对应物,智能体自发发现涌现的空间策略而无需明确指令,而随机多视角往往增加噪声而非信号,尽管消耗更多图像。大多数失败并非源于感知弱,而是动作盲视:糟糕的动作选择导致糟糕的观察,进而引发级联错误。虽然显式3D基础稳定了深度敏感任务的推理,但不完美的3D表示通过扭曲空间关系证明比2D基线更有害。人类研究进一步揭示,与寻求证伪视角并在矛盾下修正信念的人类不同,模型无论证据质量如何都过早且高置信度地承诺,暴露了一个既不能通过更好感知也不能通过更多具身互动单独闭合的元认知差距。

英文摘要

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.

2604.07039 2026-05-26 cs.RO cs.AI 版本更新

AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules

AEROS:一种具有具身能力模块的单智能体操作架构

Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

发表机构 * School of Software, Harbin Institute of Technology(哈尔滨工业大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) School of Mathematical and Computer Sciences, Heriot-Watt University, Malaysia Campus(赫瑞-沃森大学马来西亚分校数学与计算机科学学院) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院)

AI总结 提出AEROS架构,将机器人建模为单一持久智能主体,通过可安装的具身能力模块扩展能力,实现模块化可扩展性、可组合能力执行和一致的系统级安全。

Comments Submitted to Engineering Applications of Artificial Intelligence (EAAI). 48 pages, 5 figures, 9 tables

详情
AI中文摘要

机器人系统缺乏一种原则性的抽象来统一组织智能、能力和执行。现有方法要么在单体架构中耦合技能,要么将功能分解为松散协调的模块或多个智能体,通常缺乏一致的标识和控制权限模型。我们认为,机器人应被建模为一个单一的持久智能主体,其能力通过可安装的包来扩展。我们将这一观点形式化为AEROS(智能体执行运行时操作系统),其中每个机器人对应一个持久智能体,能力通过具身能力模块(ECM)提供。每个ECM封装了可执行技能、模型和工具,而执行约束和安全保证由策略分离的运行时强制执行。这种分离实现了模块化可扩展性、可组合能力执行和一致的系统级安全。我们在PyBullet仿真中使用Franka Panda 7自由度机械臂评估了一个参考实现,进行了八项实验,涵盖重新规划、故障恢复、策略执行、基线比较、跨任务通用性、ECM热插拔、消融和故障边界分析。每个条件下超过100次随机试验,AEROS在三个任务上实现了100%的任务成功率,而基线(BehaviorTree.CPP风格和ProgPrompt风格为92-93%,扁平流水线为67-73%);策略层阻止了所有无效动作,零误接受;运行时优势跨任务泛化,无需特定任务调整;ECM在运行时加载,交换后成功率为100%。

英文摘要

Robotic systems lack a principled abstraction for organizing intelligence, capabilities, and execution in a unified manner. Existing approaches either couple skills within monolithic architectures or decompose functionality into loosely coordinated modules or multiple agents, often without a coherent model of identity and control authority. We argue that a robot should be modeled as a single persistent intelligent subject whose capabilities are extended through installable packages. We formalize this view as AEROS (Agent Execution Runtime Operating System), in which each robot corresponds to one persistent agent and capabilities are provided through Embodied Capability Modules (ECMs). Each ECM encapsulates executable skills, models, and tools, while execution constraints and safety guarantees are enforced by a policy-separated runtime. This separation enables modular extensibility, composable capability execution, and consistent system-level safety. We evaluate a reference implementation in PyBullet simulation with a Franka Panda 7-DOF manipulator across eight experiments covering re-planning, failure recovery, policy enforcement, baseline comparison, cross-task generality, ECM hot-swapping, ablation, and failure boundary analysis. Over 100 randomized trials per condition, AEROS achieves 100% task success across three tasks versus baselines (BehaviorTree.CPP-style and ProgPrompt-style at 92--93%, flat pipeline at 67--73%), the policy layer blocks all invalid actions with zero false acceptances, runtime benefits generalize across tasks without task-specific tuning, and ECMs load at runtime with 100% post-swap success.

2603.18454 2026-05-26 math.OC cs.RO cs.SY eess.SY 版本更新

Fundamental Limits for Sensor-Based Control via the Gibbs Variational Principle

基于吉布斯变分原理的传感器控制基本极限

Vincent Pacelli, Evangelos A. Theodorou

发表机构 * Daniel Guggenheim School of Aerospace Engineering, Georgia Institute of Technology(丹尼尔·古金海姆航空航天工程学院,佐治亚理工学院)

AI总结 通过吉布斯变分原理推导部分观测下因果反馈控制器的最小期望成本下界,提出自洽修正方法,并通过固定点方程获得可计算的数值界。

Comments First revision. Added LQG numerical example. Improved exposition throughout. 6 pages, 1 figure

详情
AI中文摘要

反馈控制器性能的基本极限对于算法基准测试、指导传感器选择和认证任务可行性至关重要,然而目前缺乏通用的计算工具。现有的信息论方法通过将传感器信息与未受控系统进行比较,高估了传感器必须提供的信息,从而在反馈最有价值时产生退化的界。我们通过将吉布斯变分原理应用于状态和观测的联合路径测度,推导了部分观测下任何因果反馈控制器的最小期望成本的下界。该界适用于非线性、非完整和混合动力学,且成本无界,并允许自洽修正:任何好的控制器都会集中状态,这限制了传感器可提取的信息,从而收紧界。由此产生的固定点方程具有唯一解,可通过二分法计算,并且我们给出了自由能最小化可证明凸性的条件,从而得到可证明正确的数值界。在标量LQG问题上,自洽界在中等传感器噪声下捕获了超过80%的已知最优成本;在非线性Dubins汽车跟踪问题上,该界在所有噪声水平下都保持信息量,而使用未受控状态分布的界则无效。

英文摘要

Fundamental limits on the performance of feedback controllers are essential for benchmarking algorithms, guiding sensor selection, and certifying task feasibility -- yet few general-purpose tools exist for computing them. Existing information-theoretic approaches overestimate the information a sensor must provide by evaluating it against the uncontrolled system, producing bounds that degrade precisely when feedback is most valuable. We derive a lower bound on the minimum expected cost of any causal feedback controller under partial observations by applying the Gibbs variational principle to the joint path measure over states and observations. The bound applies to nonlinear, nonholonomic, and hybrid dynamics with unbounded costs and admits a self-consistent refinement: any good controller concentrates the state, which limits the information the sensor can extract, which tightens the bound. The resulting fixed-point equation has a unique solution computable by bisection, and we provide conditions under which the free energy minimization is provably convex, yielding a certifiably correct numerical bound. On a scalar LQG problem the self-consistent bound captures over 80% of the known optimal cost at moderate sensor noise, and on a nonlinear Dubins car tracking problem it remains informative across all noise levels where a bound using the uncontrolled state distribution is vacuous.

2603.06687 2026-05-26 cs.CV cs.CL cs.ET cs.MM cs.RO 版本更新

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

TimeSpot: 在真实世界场景中评估视觉语言模型的地理时间理解能力

Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan Parvez

发表机构 * Computational Intelligence and Operations Laboratory (CIOL), Bangladesh(计算智能与运筹实验室(CIOL),孟加拉国) Shahjalal University of Science and Technology (SUST), Sylhet, Bangladesh(沙赫jalal科学与技术大学(SUST),沙赫里尔,孟加拉国) North South University (NSU), Dhaka, Bangladesh(北南大学(NSU),达卡,孟加拉国) Qatar Computing Research Institute (QCRI), Doha, Qatar(卡塔尔计算研究中心(QCRI),多哈,卡塔尔)

AI总结 提出TimeSpot基准,通过1,455张全球图像评估视觉语言模型在时间属性(季节、月份、时段、日光相位)和地理属性(大洲、国家、气候带、环境类型、经纬度)上的推理能力,发现现有模型性能低下,尤其时间推理不足。

Comments Accepted to ICML 2026

详情
AI中文摘要

地理时间理解,即仅从视觉输入推断位置、时间和上下文属性的能力,支撑着灾害管理、交通规划、具身导航、世界建模和地理教育等应用。尽管最近的视觉语言模型(VLM)利用地标和路标等线索在图像地理定位方面取得了进展,但它们推理时间信号和物理基础空间线索的能力仍然有限。为弥补这一差距,我们引入了TimeSpot,一个用于评估VLM在真实世界中进行地理时间推理的基准。TimeSpot包含来自80个国家的1,455张地面图像,要求直接从视觉证据中结构化预测时间属性(季节、月份、时段、日光相位)和地理属性(大洲、国家、气候带、环境类型、经纬度)。它还包括时空推理任务,测试在真实世界不确定性下的物理合理性。对最先进的开源和闭源VLM的评估显示性能低下,尤其是时间推理。虽然监督微调带来了改进,但结果仍不充分,凸显了需要新方法来实现稳健的、基于物理的地理时间理解。TimeSpot可在 https://TimeSpot-GT.github.io 获取。

英文摘要

Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs. TimeSpot comprises 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. It also includes spatial-temporal reasoning tasks that test physical plausibility under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show low performance, particularly for temporal inference. While supervised fine-tuning yields improvements, results remain insufficient, highlighting the need for new methods to achieve robust, physically grounded geo-temporal understanding TimeSpot is available at: https://TimeSpot-GT.github.io.

2603.06218 2026-05-26 cs.RO 版本更新

Few-Shot Neural Differentiable Simulator: Real-to-Sim Rigid-Contact Modeling

少样本神经可微模拟器:真实到模拟的刚体接触建模

Zhenhao Huang, Siyuan Luo, Bingyang Zhou, Ziqiu Zeng, Jason Pho, Fan Shi

发表机构 * National University of Singapore(新加坡国立大学) Prana Lab(Prana实验室)

AI总结 提出一种结合解析公式物理一致性与图神经网络表示能力的少样本真实到模拟方法,通过少量真实数据校准解析模拟器生成大规模合成数据集,并引入基于网格的图神经网络隐式建模刚体前向动力学及碰撞检测的代理梯度,实现完全可微性,从而提升模拟保真度和策略学习效率。

Comments Accepted in ICRA 2026

详情
AI中文摘要

精确的物理模拟对于机器人学习和控制至关重要,然而解析模拟器通常难以捕捉复杂的接触动力学,而基于学习的模拟器通常需要大量昂贵的真实世界数据。为弥合这一差距,我们提出了一种少样本真实到模拟方法,该方法结合了解析公式的物理一致性与基于图神经网络(GNN)模型的表示能力。仅使用少量真实世界数据,我们的方法校准解析模拟器以生成大规模合成数据集,捕捉多样的接触交互。在此基础上,我们引入了一种基于网格的GNN,隐式建模刚体前向动力学,并推导出碰撞检测的代理梯度,实现完全可微性。实验结果表明,我们的方法使基于学习的模拟器在复现真实世界轨迹方面优于可微基线。此外,可微设计支持基于梯度的优化,我们通过多物体交互场景中的基于模拟的策略学习验证了这一点。大量实验表明,我们的框架不仅以最小监督提高了模拟保真度,还提高了策略学习的效率。综上所述,这些发现表明,具有少样本真实世界基础的可微模拟为推进未来机器人操作和控制提供了有力方向。

英文摘要

Accurate physics simulation is essential for robotic learning and control, yet analytical simulators often fail to capture complex contact dynamics, while learning-based simulators typically require large amounts of costly real-world data. To bridge this gap, we propose a few-shot real-to-sim approach that combines the physical consistency of analytical formulations with the representational capacity of graph neural network (GNN)-based models. Using only a small amount of real-world data, our method calibrates analytical simulators to generate large-scale synthetic datasets that capture diverse contact interactions. On this foundation, we introduce a mesh-based GNN that implicitly models rigid-body forward dynamics and derive surrogate gradients for collision detection, achieving full differentiability. Experimental results demonstrate that our approach enables learning-based simulators to outperform differentiable baselines in replicating real-world trajectories. In addition, the differentiable design supports gradient-based optimization, which we validate through simulation-based policy learning in multi-object interaction scenarios. Extensive experiments show that our framework not only improves simulation fidelity with minimal supervision but also increases the efficiency of policy learning. Taken together, these findings suggest that differentiable simulation with few-shot real-world grounding provides a powerful direction for advancing future robotic manipulation and control.

2602.21198 2026-05-26 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

从试错中学习:具身大语言模型的反思式测试时规划

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Leonidas Guibas, Jiajun Wu, Yejin Choi

发表机构 * Stanford University(斯坦福大学) Northwestern University(西北大学)

AI总结 提出反思式测试时规划方法,通过行动中反思和行动后反思两种模式,结合回溯性反思,使具身智能体在测试时进行自我纠正和经验积累,显著提升长程任务性能。

详情
AI中文摘要

具身大语言模型赋予机器人高级任务推理能力,但它们无法反思错误原因,导致部署成为一系列独立尝试,错误重复而非积累经验。借鉴人类反思实践,我们引入反思式测试时规划,整合两种反思模式: extit{行动中反思},代理在行动前利用测试时扩展生成并评分多个候选行动,基于内部反思;以及 extit{行动后反思},利用测试时训练,根据执行后的外部反思更新内部反思模型和行动策略。我们还包含回溯性反思,允许代理重新评估早期决策,并利用后见之明进行模型更新,实现适当的长程信用分配。在我们新设计的Long-Horizon Household基准和MuJoCo Cupboard Fitting基准上的实验表明,与基线模型相比有显著提升,并能零样本泛化到逼真的HM3D环境以及在Franka Panda机械臂上的真实机器人实验。消融实验证实,行动中反思和行动后反思相互依赖,且回溯性反思在较低计算开销下比逐步外部反馈实现更好的信用分配。定性分析进一步突出了通过反思进行的行为纠正。

英文摘要

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with zero-shot generalization to photorealistic HM3D environments and real-robot experiments on a Franka Panda arm. Ablations confirm that reflection-in-action and reflection-on-action are mutually dependent, and that retrospective reflection achieves better credit assignment than step-wise external feedback at lower computational overhead. Qualitative analyses further highlight behavioral correction through reflection.

2602.06508 2026-05-26 cs.RO 版本更新

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

World-VLA-Loop: 视频世界模型与VLA策略的闭环学习

Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(新加坡国立大学Show实验室)

AI总结 提出World-VLA-Loop框架,通过状态感知视频世界模型联合预测未来帧和二元奖励,并采用协同进化范式迭代优化VLA策略,减少对真实环境交互的依赖。

Comments 16 pages, 9 figures

详情
AI中文摘要

强化学习(RL)可以超越行为克隆,优化视觉-语言-动作(VLA)策略,但由于需要大量 rollout、重置、监督和安全风险,真实世界的RL仍然昂贵。基于动作条件的视频世界模型提供了在虚拟环境中训练的选项,但它们在精确的动作跟随方面表现不佳,尤其是在细微的接近成功失败情况下。此外,它们缺乏用于RL的原生奖励信号。基于不准确的视觉预测计算奖励仍然不可靠。我们引入了World-VLA-Loop,它围绕两个基础设计和一个更高级别的协同进化范式构建。我们首先策划了SANS,专门混合成功和接近成功的轨迹,以改善动作-结果对齐。然后,我们训练了一个状态感知视频世界模型,该模型从扩散潜变量中联合预测未来帧和二元奖励。它将奖励估计与生成器耦合,而不是单独模块,从而反过来有利于视觉预测。由于RL过程中VLA行为会发生变化,固定的模拟器可能与更新后的策略不对齐,因此World-VLA-Loop通过使用精炼的世界模型进行迭代VLA后训练,同时将每个改进策略的rollout反馈回来增强和微调世界模型,从而形成闭环。在仿真和真实机器人实验中,World-VLA-Loop显著提高了VLA性能,同时减少了对昂贵的物理交互的依赖。

英文摘要

Reinforcement learning (RL) can refine Vision-Language-Action (VLA) policies beyond behavior cloning, but real-world RL remains expensive due to extensive rollouts, resets, supervision, and safety risks. Action-conditioned video world models offer an option to train in virtual environments, yet they exhibit imprecise action following, particularly on subtle near-success failures. Besides, they lack native reward signals for RL. Computing rewards based on inaccurate visual predictions remain unreliable. We introduce World-VLA-Loop, structured around two foundational designs and a higher-level co-evolving paradigm. We first curate SANS, dedicatedly mixing successful and near-success trajectories to improve action-outcome alignment. Then, we train a state-aware video world model that jointly predicts future frames and binary rewards from diffusion latents. It couples reward estimation to the generator rather than a separate module, and in turn, benefits visual prediction. Since VLA behavior shifts during RL, a fixed simulator can misalign with the updated policy, World-VLA-Loop therefore closes the loop by using the refined world model for iterative VLA post-training while feeding rollouts from each improved policy back to augment and fine-tune the world model. Across simulation and real-robot experiments, World-VLA-Loop substantially improves VLA performance while reducing reliance on costly physical interaction.

2510.20390 2026-05-26 cs.RO 版本更新

NeuralTouch: Neural Descriptors for Precise Sim-to-Real Tactile Robot Control

NeuralTouch: 用于精确的仿真到现实触觉机器人控制的神经描述符

Yijiong Lin, Bowen Deng, Keju Pu, Chenghua Lu, Max Yang, Efi Psomopoulou, Nathan F. Lepora

发表机构 * Department of Engineering Mathematics and Bristol Robotics Laboratory(工程数学系和布里斯托尔机器人实验室)

AI总结 提出NeuralTouch多模态框架,结合神经描述符场(NDF)和触觉感知,通过深度强化学习策略利用触觉反馈优化抓取姿态,实现精确且可泛化的机器人操作。

详情
Journal ref
IEEE/ASME Transactions on Mechatronics, 2026 IEEE/ASME Transactions on Mechatronics IEEE/ASME Transactions on Mechatronics
AI中文摘要

抓取精度是精确物体操作的关键前提,通常需要机器人手与物体之间的仔细对齐。神经描述符场(NDF)提供了一种有前景的基于视觉的方法,能够生成跨物体类别泛化的抓取姿态。然而,由于相机标定不完美、点云不完整以及物体变异性,仅靠NDF可能产生不准确的姿态。同时,触觉感知能够实现更精确的接触,但现有方法通常学习仅限于简单、预定义接触几何的策略。在这项工作中,我们引入了NeuralTouch,一个集成NDF和触觉感知的多模态框架,通过轻柔的物理交互实现精确、可泛化的抓取。我们的方法利用NDF隐式表示目标接触几何,从中训练深度强化学习(RL)策略,利用触觉反馈来优化抓取。该策略以神经描述符为条件,不需要显式指定接触类型。我们通过仿真中的消融研究以及零样本迁移到真实世界的操作任务(如销钉出孔和瓶盖打开)来验证NeuralTouch,无需额外微调。结果表明,NeuralTouch在抓取精度和鲁棒性上显著优于基线方法,为精确、富接触的机器人操作提供了一个通用框架。

英文摘要

Grasping accuracy is a critical prerequisite for precise object manipulation, often requiring careful alignment between the robot hand and object. Neural Descriptor Fields (NDF) offer a promising vision-based method to generate grasping poses that generalize across object categories. However, NDF alone can produce inaccurate poses due to imperfect camera calibration, incomplete point clouds, and object variability. Meanwhile, tactile sensing enables more precise contact, but existing approaches typically learn policies limited to simple, predefined contact geometries. In this work, we introduce NeuralTouch, a multimodal framework that integrates NDF and tactile sensing to enable accurate, generalizable grasping through gentle physical interaction. Our approach leverages NDF to implicitly represent the target contact geometry, from which a deep reinforcement learning (RL) policy is trained to refine the grasp using tactile feedback. This policy is conditioned on the neural descriptors and does not require explicit specification of contact types. We validate NeuralTouch through ablation studies in simulation and zero-shot transfer to real-world manipulation tasks--such as peg-out-in-hole and bottle lid opening--without additional fine-tuning. Results show that NeuralTouch significantly improves grasping accuracy and robustness over baseline methods, offering a general framework for precise, contact-rich robotic manipulation.

2510.06351 2026-05-26 cs.RO 版本更新

A Formal gatekeeper Framework for Safe Dual Control with Active Exploration

具有主动探索的安全双重控制的正式门控框架

Kaleb Ben Naveed, Devansh R. Agrawal, Dimitra Panagou

发表机构 * Department of Robotics, University of Michigan(密歇根大学机器人系) Department of Aerospace Engineering, University of Michigan(密歇根大学航空航天工程系)

AI总结 提出一个集成鲁棒规划与主动探索的框架,通过门控机制仅在可验证改进且不牺牲安全时进行探索,实现安全与不确定性降低的平衡。

Comments Accepted at American Control Conference (ACC) 2026

详情
AI中文摘要

在模型不确定性下规划安全轨迹是一个基本挑战。鲁棒规划通过考虑最坏情况来确保安全,但忽略了不确定性降低,导致过于保守的行为。在名义任务期间主动实时降低不确定性定义了双重控制问题。大多数方法通过在成本中添加加权探索项来解决这一问题,调整以平衡名义目标和不确定性降低,但没有正式考虑何时探索是有益的。此外,某些方法强制安全性,而其他方法则没有。我们提出了一个框架,将鲁棒规划与正式保证下的主动探索集成如下:关键创新和贡献在于,仅在探索提供可验证改进且不牺牲安全时才进行探索。为实现这一点,我们利用我们早期关于门控器作为安全验证架构的工作,并将其扩展,使其生成既安全又信息丰富的轨迹,从而降低不确定性和任务成本,或将其保持在用户定义的预算内。通过参数不确定性下四旋翼飞行器在线双重控制的仿真案例研究评估了该方法。

英文摘要

Planning safe trajectories under model uncertainty is a fundamental challenge. Robust planning ensures safety by considering worst-case realizations, yet ignores uncertainty reduction and leads to overly conservative behavior. Actively reducing uncertainty on-the-fly during a nominal mission defines the dual control problem. Most approaches address this by adding a weighted exploration term to the cost, tuned to trade off the nominal objective and uncertainty reduction, but without formal consideration of when exploration is beneficial. Moreover, safety is enforced in some methods but not in others. We propose a framework that integrates robust planning with active exploration under formal guarantees as follows: The key innovation and contribution is that exploration is pursued only when it provides a verifiable improvement without compromising safety. To achieve this, we utilize our earlier work on gatekeeper as an architecture for safety verification, and extend it so that it generates both safe and informative trajectories that reduce uncertainty and the cost of the mission, or keep it within a user-defined budget. The methodology is evaluated via simulation case studies on the online dual control of a quadrotor under parametric uncertainty.

2510.03827 2026-05-26 cs.CV cs.RO 版本更新

LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

LIBERO-PRO:超越记忆的视觉-语言-动作模型鲁棒与公平评估

Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, Lichao Sun

发表机构 * Huazhong University of Science and Technology(华中科技大学) College of AI, Tsinghua University(清华大学人工智能学院) Wuhan University of Technology(武汉理工大学) Lehigh University(莱斯大学)

AI总结 针对LIBERO基准评估中的记忆偏差问题,提出LIBERO-PRO扩展基准,通过在操作对象、初始状态、任务指令和环境四个维度施加合理扰动,揭示现有VLA模型性能从90%以上骤降至0.0%的严重缺陷,并呼吁采用鲁棒评估方法。

Comments 10 pages,7 figures, 0 tables

详情
AI中文摘要

LIBERO已成为评估视觉-语言-动作(VLA)模型的广泛采用的基准;然而,其当前的训练和评估设置存在问题,常常导致性能估计膨胀,并阻碍公平的模型比较。为了解决这些问题,我们引入了LIBERO-PRO,一个扩展的LIBERO基准,系统性地评估模型在四个维度(操作对象、初始状态、任务指令和环境)的合理扰动下的性能。实验结果表明,尽管现有模型在标准LIBERO评估下达到90%以上的准确率,但在我们的泛化设置下,其性能骤降至0.0%。关键的是,这种差异暴露了模型依赖于对训练集中动作序列和环境布局的死记硬背,而非真正的任务理解或环境感知。例如,当目标对象被替换为无关物品时,模型仍持续执行抓取动作;即使给出被破坏的指令甚至混乱的令牌,其输出也保持不变。这些发现揭示了当前评估实践中的严重缺陷,我们呼吁社区放弃误导性方法,转而采用对模型泛化和理解能力的鲁棒评估。我们的代码可在 https://github.com/Zxy-MLlab/LIBERO-PRO 获取。

英文摘要

LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension. Our code is available at: https://github.com/Zxy-MLlab/LIBERO-PRO.

2510.01348 2026-05-26 cs.RO 版本更新

Kilometer-Scale GNSS-Denied UAV Navigation via Heightmap Gradients: A Winning System from the SPRIN-D Challenge

基于高程图梯度的千米级GNSS拒止无人机导航:SPRIN-D挑战优胜系统

Michal Werner, David Čapek, Tomáš Musil, Ondřej Franěk, Tomáš Báča, Martin Saska

发表机构 * Faculty of Electrical Engineering, Czech Technical University in Prague(捷克技术大学布拉格分校电子工程系)

AI总结 针对GNSS拒止环境下无人机长距离飞行中的漂移问题,提出一种利用高程图梯度模板匹配进行漂移校正的轻量级定位方法,并在SPRIN-D挑战中实现9公里航点导航。

Comments 8 pages

详情
AI中文摘要

在GNSS拒止环境中实现可靠的长距离无人机飞行具有挑战性:集成里程计会导致漂移,在未探索区域无法进行闭环检测,且嵌入式平台计算能力有限。我们提出了一套完全机载的无人机系统,专为SPRIN-D Funke Fully Autonomous Flight Challenge开发,该挑战要求在没有GNSS或先验密集地图的情况下,在低于25米AGL(离地高度)的高度完成9公里长距离航点导航。该系统集成了感知、建图、规划和控制,并采用一种轻量级漂移校正方法,通过梯度模板匹配将激光雷达导出的局部高程图与先验地理数据高程图进行匹配,并在聚类粒子滤波器中融合里程计证据。在竞赛部署中,该系统在城区、森林和开阔地形中执行了千米级飞行,相对于原始里程计显著减少了漂移,同时在仅CPU硬件上实时运行。我们描述了系统架构、定位流程和竞赛评估,并报告了现场部署中的实际经验,为GNSS拒止无人机自主性的设计提供了参考。

英文摘要

Reliable long-range flight of unmanned aerial vehicles (UAVs) in GNSS-denied environments is challenging: integrating odometry leads to drift, loop closures are unavailable in previously unseen areas and embedded platforms provide limited computational power. We present a fully onboard UAV system developed for the SPRIN-D Funke Fully Autonomous Flight Challenge, which required 9 km long-range waypoint navigation below 25 m AGL (Above Ground Level) without GNSS or prior dense mapping. The system integrates perception, mapping, planning, and control with a lightweight drift-correction method that matches LiDAR-derived local heightmaps to a prior geo-data heightmap via gradient-template matching and fuses the evidence with odometry in a clustered particle filter. Deployed during the competition, the system executed kilometer-scale flights across urban, forest, and open-field terrain and reduced drift substantially relative to raw odometry, while running in real time on CPU-only hardware. We describe the system architecture, the localization pipeline, and the competition evaluation, and we report practical insights from field deployment that inform the design of GNSS-denied UAV autonomy.

2509.23651 2026-05-26 cs.RO 版本更新

HeLoM: Hierarchical Learning for Whole-Body Loco-Manipulation by a Hexapod Robot

HeLoM: 六足机器人全身移动操作的分层学习

Xinrong Yang, Peizhuo Li, Hongyi Li, Yifeng Peng, Arhaan Jain, Junkai Lu, Linnan Chang, Yuhong Cao, Yifeng Zhang, Ge Sun, Guillaume Sartoretti

发表机构 * MARMoT Lab, Department of Mechanical Engineering, National University of Singapore(机械工程系,新加坡国立大学MARMoT实验室) Center for X-mechanics, Zhejiang University(浙江大学X力学中心)

AI总结 提出HeLoM分层框架,通过协调多肢控制实现六足机器人对重/不规则物体的稳定推动,在仿真和实物实验中验证了有效性。

详情
AI中文摘要

在自然界中,动物经常需要移动/操纵与自身重量/大小相当的物体。与抓取和搬运相比,推动提供了一种更直接、高效的非抓取操纵策略,避免了复杂的抓取设计,同时利用直接接触在交互过程中调节物体的姿态。然而,实现有效的推动既需要足够的操纵能力,也需要稳定的全身协调,这在处理重型或不规则物体时尤其具有挑战性。为了解决这些挑战,我们提出了HeLoM,一种基于学习的六足机器人分层全身操纵框架,该框架利用协调的多肢控制,并适用于多足机器人系统。受多足昆虫合作策略的启发,我们的框架利用多个接触点和高度自由度,在物体交互过程中实现高效、动态的全身协调。HeLoM的高层规划器规划推动行为,而其低层控制器保持运动稳定性并生成动态一致的关节动作。这种设计使机器人能够通过协调的前肢交互和支撑性的后肢推进,在执行连续可控的推动行为的同时保持平衡。我们通过仿真和实物实验验证了HeLoM的有效性。结果表明,我们的框架能够在现实世界中稳定地将不同尺寸和未知物理属性的物体推动到指定的目标姿态。

英文摘要

In nature, animals often need to move/manipulate objects comparable in weight/size to their own bodies. Compared to grasping and carrying, pushing provides a more straightforward and efficient non-prehensile manipulation strategy, avoiding complex grasp design while leveraging direct contact to regulate an object's pose during interaction. Achieving effective pushing, however, requires both sufficient manipulation capability and stable whole-body coordination, which is particularly challenging when dealing with heavy or irregular objects. To address these challenges, we propose HeLoM, a learning-based hierarchical whole-body manipulation framework for hexapod robots that exploits coordinated multi-limb control and is applicable to multi-legged robotic systems. Inspired by the cooperative strategies of multi-legged insects, our framework leverages multiple contact points and high degrees of freedom to enable efficient and dynamic whole-body coordination during object interaction. HeLoM's high-level planner plans pushing behaviors, while its low-level controller maintains locomotion stability and generates dynamically consistent joint actions. This design enables the robot to maintain balance while executing continuous and controllable pushing behaviors through coordinated foreleg interaction and supportive hind-leg propulsion. We validate the effectiveness of HeLoM through both simulation and real-world experiments. Results show that our framework can stably push objects of varying sizes and unknown physical properties to designated goal poses in the real world.

2509.17057 2026-05-26 cs.RO 版本更新

RoboManipBaselines: A Unified Framework for Imitation Learning in Robotic Manipulation across Real and Simulation Environments

RoboManipBaselines:面向真实与仿真环境的机器人操作模仿学习统一框架

Masaki Murooka, Tomohiro Motoda, Ryoichi Nakajo, Hanbit Oh, Koshi Makihara, Keisuke Shirai, Tetsuya Ogata, Yukiyasu Domae

发表机构 * Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST)(日本国家先进工业科学技术研究院人工智能研究中心) CNRS-AIST JRL (Joint Robotics Laboratory), IRL(法国国家科学研究中心与日本AIST联合机器人实验室) Institute for AI and Robotics, Future Robotics Organization, Waseda University(早稻田大学未来机器人组织人工智能与机器人研究所) AI Robot Association (AIRoA)(人工智能机器人协会)

AI总结 提出RoboManipBaselines开源框架,统一支持仿真和真实环境下的机器人操作模仿学习全流程,包括数据收集、策略训练和部署,并通过基准测试和研究应用验证其有效性。

Comments Added a Limitations section in response to comments from reviewers

详情
Journal ref
IEEE Access 2026
AI中文摘要

我们提出RoboManipBaselines,一个用于机器人操作模仿学习研究的开源软件框架。该框架支持完整的模仿学习流程,包括数据收集、策略训练和部署,覆盖仿真和真实环境。其设计强调通过一致的工作流程实现集成,跨不同环境和机器人平台的通用性,通过易于添加新机器人、任务和策略的可扩展性,以及通过使用公开数据集进行评估的可重复性。RoboManipBaselines系统地实现了模仿学习的核心组件:环境、数据集和策略。通过统一接口,该框架支持多种仿真器和真实机器人环境,以及多模态传感器和多种策略模型。我们进一步在仿真和真实环境中进行了基准评估,并介绍了多项研究应用,包括数据增强、与触觉模型的集成、交互式机器人系统、3D感知评估和硬件扩展。这些结果表明,RoboManipBaselines为利用模仿学习推进机器人操作的研究和实验验证提供了有用的基础。https://isri-aist.github.io/RoboManipBaselines-ProjectPage

英文摘要

We present RoboManipBaselines, an open-source software framework for imitation learning research in robotic manipulation. The framework supports the entire imitation learning pipeline, including data collection, policy training, and rollout, across both simulation and real-world environments. Its design emphasizes integration through a consistent workflow, generality across diverse environments and robot platforms, extensibility for easily adding new robots, tasks, and policies, and reproducibility through evaluations using publicly available datasets. RoboManipBaselines systematically implements the core components of imitation learning: environment, dataset, and policy. Through a unified interface, the framework supports multiple simulators and real robot environments, as well as multimodal sensors and a wide variety of policy models. We further present benchmark evaluations in both simulation and real-world environments and introduce several research applications, including data augmentation, integration with tactile models, interactive robotic systems, 3D sensing evaluation, and hardware extensions. These results demonstrate that RoboManipBaselines provides a useful foundation for advancing research and experimental validation in robotic manipulation using imitation learning. https://isri-aist.github.io/RoboManipBaselines-ProjectPage

2505.11758 2026-05-26 cs.CV cs.AI cs.GR cs.RO 版本更新

Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning

具有预测性提示和负学习的可泛化视觉语言少样本适应

Sriram Mandalika

发表机构 * Hasso Plattner Institute, University of Potsdam(霍普夫纳研究所,波茨坦大学)

AI总结 提出SCAN框架,通过查询自适应负路由、LLM引导对比提示和自适应融合权重,解决视觉语言模型少样本适应中负类信号处理问题,在11个基准上平均提升4.61%。

详情
AI中文摘要

视觉语言模型的少样本适应在推理时如何处理负类信号方面仍然存在根本性限制。现有方法对所有查询应用统一的负抑制,忽略了最具破坏性的混淆是查询特定的,并且随支持集几何形状而变化。我们提出SCAN(选择性混淆感知负样本),一个通过三个针对性贡献解决这一问题的框架。在推理中,查询自适应负路由将抑制限制在每个查询最易混淆的前K个类别,无需额外参数。通用负文本模板被替换为LLM引导的对比提示,描述易混淆类别对之间的区分属性,在关键处锐化文本决策边界。基于支持集Fisher可判别性估计的无参数自适应融合权重消除了手动调整视觉语言权衡的需要。在11个标准基准上评估,SCAN在16-shot设置下平均优于先前的基于提示和基于适配器的方法4.61%,在类间混淆最严重的细粒度数据集上提升高达7.70%。SCAN在分布偏移下也表现出强泛化性,在四个ImageNet OOD变体上平均提升2.95%,并在显著标签噪声下保持稳健性能,在50%标签损坏下的准确率仍超过最强竞争方法的干净基线。

英文摘要

Few-shot adaptation of vision-language models remains fundamentally limited by how negative class signals are handled at inference. Existing methods apply uniform negative suppression across all queries, ignoring that the most damaging confusions are query-specific and shift with support-set geometry. We introduce SCAN (Selective Confusion-Aware Negatives), a framework that addresses this gap through three targeted contributions. In inference, query-adaptive negative routing restricts suppression to the top-K most confusable classes per query, requiring zero additional parameters. Generic negative text templates are replaced with LLM-bootstrapped contrastive prompts that describe discriminative attributes between confusable class pairs, sharpening the textual decision boundary where it matters most. A parameter-free adaptive fusion weight estimated from support-set Fisher discriminability removes the need for manual tuning of the vision-language trade-off. Evaluated across 11 standard benchmarks, SCAN consistently outperforms prior prompt-based and adapter-based methods by an average of 4.61% at 16-shot, with gains of up to 7.70% on fine-grained datasets where inter-class confusion is most severe. SCAN also generalizes strongly under distribution shift, improving by 2.95% on average across four ImageNet OOD variants, and maintains robust performance under significant label noise, with accuracy under 50% label corruption still exceeding the clean baseline of the strongest competing method.

2502.16205 2026-05-26 cs.RO 版本更新

A neural signed configuration distance function for path planning of picking manipulators

一种用于拾取机械臂路径规划的神经符号配置距离函数

Bernhard Wullt, Mikael Norrlöf, Per Mattsson, Thomas B. Schön

发表机构 * Department of Information Technology, Uppsala University(信息技术系,乌普萨拉大学)

AI总结 针对拾取机械臂路径规划问题,提出一种神经符号配置距离函数(nSCDF)作为隐式障碍物表示,通过构建配置空间中的无碰撞球体,将多查询路径规划器中的点替换为球体,从而快速生成无碰撞走廊并利用凸规划优化路径,实验表明该方法在显著减少时间的同时生成接近渐近最优的路径。

详情
AI中文摘要

拾取机械臂是特定任务机器人,与通用机械臂相比自由度较少,在工业中广泛使用。拾取机器人的效率高度依赖于路径规划解决方案,该方案通常基于采样的多查询方法。规划器能够稳健地解决问题,但其对碰撞检测的大量使用限制了在线使用的规划能力。我们通过提出一种新颖的隐式障碍物表示用于路径规划,即神经符号配置距离函数(nSCDF),从而能够在配置空间中形成无碰撞球体。我们使用球体表示重新表述了一种先进的多查询路径规划器,即在图中使用球体而不是点。我们的规划器返回一个无碰撞走廊,这使我们能够使用凸规划生成优化路径。从数值实验中,我们观察到我们的规划器在显著更短的时间内生成接近渐近最优路径规划器的路径。

英文摘要

Picking manipulators are task specific robots, with fewer degrees of freedom compared to general-purpose manipulators, and are heavily used in industry. The efficiency of the picking robots is highly dependent on the path planning solution, which is commonly based on sampling-based multi-query methods. The planner is robustly able to solve the problem, but its heavy use of collision-detection limits the planning capabilities for online use. We approach this problem by presenting a novel implicit obstacle representation for path planning, a neural signed configuration distance function (nSCDF), which allows us to form collision-free balls in the configuration space. We use the ball representation to re-formulate a state of the art multi-query path planner, i.e., instead of points, we use balls in the graph. Our planner returns a collision-free corridor, which allows us to use convex programming to produce optimized paths. From our numerical experiments, we observe that our planner produces paths that are close to those from an asymptotically optimal path planner, in significantly less time.

2409.20473 2026-05-26 cs.RO 版本更新

Data-Driven Optimization of Tactile Sensor Configurations for Efficient Dexterous Manipulation

数据驱动的触觉传感器配置优化以实现高效灵巧操作

Haoran Guo, Haoyang Wang, Zhengxiong Li, He Bai, Lingfeng Tao

发表机构 * ShanghaiTech University, School of Information Science and Technology(上海科技大学信息科学与技术学院) University of Alberta(阿尔伯塔大学) Oklahoma State University(俄克拉荷马州立大学) University of Colorado Denver, Department of Computer Science and Engineering(科罗拉多大学丹佛分校计算机科学与工程系) Department of Robotics and Mechatronics Engineering, Kennesaw State University(凯斯西储大学机器人与机电工程系)

AI总结 提出两阶段框架量化触觉传感器对深度强化学习策略的贡献,将Shadow Hand传感器从92个减少至14个仍保持90%以上性能,并发现中指传感器具有负贡献。

Comments This work has been submitted to the ICRA for possible publication

详情
AI中文摘要

触觉感知对于基于学习的灵巧操作至关重要,但传感器放置的原则性指导仍然缺乏。虽然密集传感器阵列提供丰富的接触反馈,但它们带来显著的硬件成本,甚至可能通过引入冗余或冲突输入而降低策略性能。本文提出了第一个系统框架,用于量化单个触觉传感器对深度强化学习(DRL)策略性能的贡献。我们提出了一种两阶段方法:粗粒度经验剪枝阶段将Shadow Hand上的传感器数量从92个减少到21个,同时保留93%的任务性能;随后是细粒度主动学习阶段,结合高斯过程回归(GPR)与Lasso回归对每个剩余传感器的功能重要性进行排序。我们的分析揭示,拇指、无名指和小指上的传感器主导操作性能,而中指传感器表现出负贡献——主动降低策略学习。跨三个操作任务(方块、鸡蛋和笔)的消融研究证实,14个传感器的配置保留了全阵列90%以上的性能。在两个新物体上的零样本迁移实验以及在Allegro和Leap Hand上的跨平台验证进一步表明,识别出的重要性排序在任务和机器人形态之间具有泛化性。这些发现建立了量化部署指南,使从业者能够选择具有可预测性能权衡的成本效益传感器配置。

英文摘要

Tactile sensing is critical for learning-based dexterous manipulation, yet principled guidelines for sensor placement remain largely absent. While dense sensor arrays provide rich contact feedback, they impose significant hardware costs and can even degrade policy performance by introducing redundant or conflicting inputs. This paper presents the first systematic framework for quantifying the contribution of individual tactile sensors to deep reinforcement learning (DRL) policy performance. We propose a two-stage approach: a coarse empirical pruning phase that reduces the sensor count on the Shadow Hand from 92 to 21 while retaining 93\% task performance, followed by a fine-grained active learning phase that combines Gaussian Process Regression (GPR) with Lasso regression to rank the functional importance of each remaining sensor. Our analysis reveals that sensors on the thumb, ring finger, and little finger dominate manipulation performance, while middle-finger sensors exhibit negative contributions -- actively degrading policy learning. Ablation studies across three manipulation tasks (block, egg, and pen) confirm that a 14-sensor configuration preserves over 90\% of the full-array performance. Zero-shot transfer experiments on two novel objects and cross-platform validation on the Allegro and Leap Hand further demonstrate that the identified importance rankings generalize across tasks and robot morphologies. These findings establish quantitative deployment guidelines that enable practitioners to select cost-effective sensor configurations with predictable performance trade-offs.

2605.25239 2026-05-26 cs.RO eess.SP 版本更新

FusionCore: A 23-State Unscented Kalman Filter for IMU, Wheel Encoder, GPS, and Visual SLAM Fusion in ROS 2

FusionCore: 用于IMU、轮式编码器、GPS和视觉SLAM融合的23状态无迹卡尔曼滤波器(ROS 2)

Manan Kharwar

发表机构 * Independent Researcher(独立研究者)

AI总结 提出FusionCore,一个基于23状态无迹卡尔曼滤波器的开源ROS 2传感器融合包,通过在线估计轮式编码器偏航率偏差、GPS ECEF原生处理、自适应噪声协方差和VSLAM位姿融合,在12个NCLT序列上比robot_localization取得更低的绝对轨迹误差。

Comments 8 pages, 4 figures, 2 tables. Source code: https://github.com/manankharwar/fusioncore (Apache 2.0)

详情
AI中文摘要

我们提出了FusionCore,一个开源的ROS 2传感器融合包,它使用23状态无迹卡尔曼滤波器(UKF)将IMU、轮式编码器里程计、GPS和视觉SLAM位姿融合成单个100 Hz的里程计流。第23个状态是轮式编码器系统性偏航率偏差的在线估计,该偏差通过GPS航向互协方差识别,并在GPS中断期间减去,以减少滑行模式下的航向漂移。FusionCore还将陀螺仪和加速度计偏差估计为显式滤波器状态,在ECEF中本地处理GPS而无需单独的坐标投影节点,应用基于测量自由度的每传感器马氏卡方异常值门控,并根据创新序列自动调整传感器噪声协方差。VSLAM位姿融合使得任何视觉里程计或SLAM系统都能在GPS缺失环境下运行,包括从地图重新初始化中自动恢复。我们在NCLT公开数据集的12个全长序列(每个55-92分钟)上对robot_localization进行了评估。FusionCore在12个序列中的10个上实现了更低的绝对轨迹误差(ATE),在获胜序列上改进范围从1.2倍到22.2倍。robot_localization的UKF在所有12个序列上数值发散。FusionCore可在https://github.com/manankharwar/fusioncore上获取,采用Apache 2.0许可证。

英文摘要

We present FusionCore, an open-source ROS 2 sensor fusion package that fuses IMU, wheel encoder odometry, GPS, and Visual SLAM pose into a single 100 Hz odometry stream using a 23-state Unscented Kalman Filter (UKF). The 23rd state is an online estimate of the wheel encoder's systematic yaw rate bias, identified through GPS heading cross-covariance and subtracted during GPS blackouts to reduce heading drift in coast mode. FusionCore also estimates gyroscope and accelerometer biases as explicit filter states, handles GPS natively in ECEF without a separate coordinate projection node, applies per-sensor Mahalanobis chi-squared outlier gating calibrated to measurement degrees of freedom, and adapts sensor noise covariance automatically from the innovation sequence. VSLAM pose fusion enables GPS-denied operation with any visual odometry or SLAM system, including automatic recovery from map reinitialization. We evaluate against robot_localization on twelve full-length sequences (55-92 min each) from the NCLT public dataset. FusionCore achieves lower Absolute Trajectory Error (ATE) on ten of twelve sequences, with improvements ranging from 1.2x to 22.2x on winning sequences. The robot_localization UKF diverges numerically on all twelve sequences. FusionCore is available at https://github.com/manankharwar/fusioncore under the Apache 2.0 license.

2605.25220 2026-05-26 cs.CV cs.GR cs.RO 版本更新

Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

无需多视图生成的多视图一致3D高斯头部头像

Aviral Chharia, Fernando De la Torre

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出MVCHead,一种直接从随机采样的2D图像学习3D高斯头部模型的方法,通过层次状态空间块和SE(3)多视图评判器实现多视图一致性,无需多视图数据或3D监督。

Comments CVPR 2026; Project Website: https://humansensinglab.github.io/MVCHead/

详情
Journal ref
CVPR, Denver, CO, USA, 2026, pp. 40163-40174
AI中文摘要

高保真3D高斯头部头像生成对于AR/VR、远程呈现和数字人类等应用至关重要。现有方法依赖于多视图数据集、3D捕获或中间2D视图合成。相比之下,我们仅从随机采样的2D图像中学习条件和非条件3D头部模型,而不使用多视图数据、3D监督或中间视图生成。我们引入MVCHead,一种单次状态空间模型,直接在3D表示中强制执行多视图一致性(MVC),同时在这些约束下回归3D高斯。其核心是,我们提出层次状态空间(HiSS)块,从粗到细逐步细化高斯,同时捕获长距离依赖。在每个HiSS块中,我们修改Mamba的标准单向扫描,提出层次双向状态扫描(HiBiSS),将递归与多视图不一致性最强的轴对齐。最后,我们设计了一个SE(3)多视图评判器,判断一组自渲染是否来自单个底层3D配置,奖励跨视图像素对齐而不观察真实的多视图对。MVCHead实现了最先进的感知质量,在纹理和几何一致性上超越了先前方法,并保持了可比的形状一致性。为了展示可扩展性,我们发布了FaceGS-10K,这是第一个用于训练和评估3D头部模型的大规模即用型3D高斯头部资产数据集。项目页面和代码:https://humansensinglab.github.io/MVCHead/

英文摘要

High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba's standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models. Project Page and code: https://humansensinglab.github.io/MVCHead/

2605.25216 2026-05-26 cs.RO 版本更新

InvariantCloud: A Globally Invariant, Uniquely Indexed Point Cloud Framework for Robust 6-DoF Tactile Pose Tracking

InvariantCloud:一种全局不变、唯一索引的点云框架,用于鲁棒的6自由度触觉姿态跟踪

Pengfei Ye, Yuxiang Ma, Yi Zhou, Wei Chen, Wenzhen Dong, Molong Duan

发表机构 * Department of Mechanical and Aerospace Engineering, The Hong Kong University of Science and Technology(香港科学与技术大学机械与航空航天工程系) Department of Mechanical Engineering, Massachusetts Institute of Technology(麻省理工学院机械工程系) Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong(香港中文大学机械与自动化工程系)

AI总结 提出InvariantCloud框架,利用视觉触觉传感器上表面标记星座的全局不变性,通过一次性全局不变点云配准实现6自由度物体姿态估计,抑制累积漂移并准确估计偏航旋转,在长序列操作任务中展现出高精度和鲁棒性。

详情
AI中文摘要

最近在模仿学习和视觉语言模型方面的进展突显了对高保真触觉感知的需求,其中6自由度触觉物体姿态估计为精确的机器人操作提供了关键基础。我们提出了InvariantCloud,一种6自由度姿态估计框架,该框架利用基于视觉的触觉传感器上表面标记星座的全局不变性。与最近的方法相比,我们的一次性全局不变点云配准抑制了累积漂移,并克服了准确估计偏航(Z轴)旋转的长期限制。实验验证表明,与现有基准相比,InvariantCloud在偏航跟踪精度和重定位重复性方面表现出色,证明了其在长序列操作任务中的精度和鲁棒性。

英文摘要

Recent advances in imitation learning and vision-language models highlight the need for high-fidelity tactile perception, with 6-DoF tactile object pose estimation providing a crucial foundation for precise robotic manipulation. We introduce InvariantCloud, a 6-DoF pose estimation framework that leverages the global invariance of surface marker constellations on vision-based tactile sensors. In contrast to recent approaches, our one-shot globally invariant point cloud registration suppresses cumulative drift and overcomes long-standing limitations in accurately estimating yaw (Z-axis) rotation. Experimental verifications show that InvariantCloud achieves superior yaw tracking accuracy and re-localization repeatability compared to existing benchmarks, demonstrating its precision and robustness in long-sequence manipulation tasks.

2605.25170 2026-05-26 cs.LG cs.AI cs.ET cs.RO 版本更新

Grow-Prune-Freeze Networks: Adaptive & Continual Learning Technique for Olfactory Navigation

生长-剪枝-冻结网络:用于嗅觉导航的自适应与持续学习技术

Kordel K. France, Ovidiu Daescu

AI总结 提出生长-剪枝-冻结(GPF)网络框架,通过动态调整策略网络层数实现持续学习,在湍流羽流导航任务中达到94%成功率,并推广到其他机器学习任务。

详情
AI中文摘要

嗅觉训练数据分散在非标准化的数据集中,限制了构建代表性世界模型的能力。嗅觉导航是一项高度动态和非平稳的任务,受益于实时持续学习。我们引入了一种名为生长-剪枝-冻结(GPF)网络的自适应框架,使智能体能够通过生长、剪枝和冻结其策略的早期层来持续学习,以应对世界复杂性。将GPF基于非线性随机矩阵理论,我们展示了Pennington & Worth(2017)的工作可以从单隐藏层扩展到n层持续学习模型,并且网络权重的特征值组成在添加连续层时得以保持。我们展示了基于期望SARSA的GPF在湍流羽流导航上实现了94%的成功率——这是一个部分可观测、非平稳的任务,代表了激发机器人自适应学习的“大世界”挑战——并提供了将GPF应用于其他世界模型的支撑方法。进一步的实验表明,GPF可能很好地推广到其他机器学习任务,如Atari中的强化学习、图像分类和自回归语言模型。我们开源所有代码和数据,以鼓励对嗅觉机器人技术的改进和更多研究。

英文摘要

Training data for olfaction is scattered through disparate, non-standardized datasets that limit the ability to build representative world models. Olfactory navigation is a highly dynamic and non-stationary task that benefits from real-time continual learning. We introduce an adaptive framework called Grow-Prune-Freeze (GPF) networks that enable an agent to continually learn through growing, pruning, and freezing early layers of its policy in response to world complexity. Grounding GPFs in non-linear random matrix theory, we show that the work of Pennington & Worth (2017) can be extended from single hidden layers to n-layer continual-learning models, and that eigenvalue composition of network weights is preserved as successive layers are added. We show that GPFs based on Expected SARSA achieve a 94% success rate on turbulent plume navigation - a partially observable, non-stationary task representative of the "big world" challenges that motivate adaptive learning in robotics - and provide supporting methodology for applying GPFs in other world models. Further experiments amount evidence that GPFs may generalize well to other machine learning tasks such as reinforcement learning in Atari, image classification, and autoregressive language models. We open source all code and data to encourage improvements on and more research in olfactory robotics.

2605.25109 2026-05-26 cs.RO 版本更新

Soft Pneumatic Actuators for Soft Robotics: A Motion-Based Review of Actuation Mechanisms and Performance Trade-offs

软体机器人中的软体气动执行器:基于运动的驱动机制与性能权衡综述

Mohammed Abboodi

AI总结 本文基于运动类型(直线、弯曲、扭转、全向)分类综述软体气动执行器的设计策略,分析结构特征对运动输出、力产生、空气需求等性能的影响,并讨论选择与比较时的关键条件。

详情
AI中文摘要

软体气动执行器在软体机器人中广泛应用,因为它们能够产生大运动,同时保持足够的柔顺性,以安全地与物体、环境和人体交互。然而,它们的性能并不仅仅由压力决定。相反,响应取决于执行器的构建方式,包括腔室的形状、增强材料的放置、褶皱的使用、材料刚度以及引导其变形的约束。随着文献的扩展,确定哪种机制最适合特定应用以及哪些报告的结果可以在研究之间进行比较变得更加困难。本文根据用于生成四类运动(直线、弯曲、扭转和全向驱动)的设计策略来审视软体气动执行器。对于每一类,它分析了定义变形路径的结构特征,包括编织角、褶皱几何形状、纤维取向、腔室排列、结构不对称性和内部约束层。然后讨论了设计选择如何影响运动输出、力产生、空气需求、可重复性、耐久性、制造难度和机器人集成。本文进一步确定了在选择或比较执行器时必须考虑的关键条件,包括压力、负载条件、执行器尺寸、气动供应和滞后。这种方法有助于解释为什么具有相似运动输出的执行器在设计要求、气动需求和实际适用性上可能存在显著差异。它还突出了在可穿戴、生物医学和移动机器人应用中实现紧凑、高效、可重复和可部署的软体气动系统所需的设计优先级。

英文摘要

Soft pneumatic actuators are widely used in soft robotics because they can produce large motions while remaining compliant enough to interact safely with objects, environments, and the human body. However, their performance is not solely determined by pressure. Instead, the response depends on the way the actuator is built, including the shape of its chambers, the placement of reinforcements, the use of folds, material stiffness, and the constraints that guide its deformation. As the literature has expanded, it has become more difficult to determine which mechanism is most suitable for a given application and which reported results can be compared across studies. This review examines soft pneumatic actuators according to the design strategies used to generate four motion classes: linear, bending, twisting, and omnidirectional actuation. For each class, it analyzes the structural features that define the deformation path, including braid angle, fold geometry, fiber orientation, chamber arrangement, structural asymmetry, and internal constraint layers. It then discusses how the design choice affect motion output, force generation, air demand, repeatability, durability, fabrication difficulty, and robotic integration. The review further identifies key conditions that must be considered when selecting or comparing actuators, including pressure, loading condition, actuator size, pneumatic supply, and hysteresis This approach helps explain why actuators with similar motion outputs may differ substantially in design requirements, pneumatic demand, and practical suitability. It also highlights the design priorities needed for compact, efficient, repeatable, and deployable soft pneumatic systems in wearable, biomedical, and mobile robotic applications.

2605.25044 2026-05-26 cs.RO 版本更新

X-DiffVLA: X-Embodied Diffusion Action Heads for Vision-Language-Action Models

X-DiffVLA:面向视觉-语言-动作模型的跨具身扩散动作头

Boyu Li, Chaoyi Xu, Haoqi Yuan, Xinrun Xu, Börje F. Karlsson, Dongbin Zhao, Haoran Li, Zongqing Lu

发表机构 * SKL-MAIS, Institute of Automation, Chinese Academy of Sciences(SKL-MAIS,自动化研究所,中国科学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(人工智能学院,中国科学院大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院) BeingBeyond School of Computer Science, Peking University(北京大学计算机学院)

AI总结 针对跨具身数据学习通用策略的挑战,提出X-DiffVLA模型,通过扩散模型和具身强制技术实现异构末端执行器间的知识迁移,在RoboCasa和Isaac Gym上分别提升15.3%和12.5%。

详情
AI中文摘要

从跨具身数据中学习通用策略仍然是机器人学中的基本挑战。尽管视觉-语言-动作(VLA)模型在大型多样化数据集上进行了预训练,但它们通常依赖于具身特定的微调才能在下游任务中实现强性能。这一要求严重限制了它们的泛化能力,并阻碍了执行相似任务的具身之间的知识迁移。为了克服这些限制,我们聚焦于共享机器人基座和异构末端执行器的跨具身设置,并提出X-DiffVLA,一种具有统一跨具身动作头的基于扩散的VLA模型。X-DiffVLA能够利用扩散模型的生成优势来捕捉跨具身数据集中的多样性和潜在相关性。具体地,我们引入了具身强制(Embodiment Forcing),一种无分类器引导技术,以隐式地将动作生成导向具身特定的功能组件,无需显式监督即可捕捉细粒度的结构细微差别。此外,设计了形态树扩散(Morphological Tree Diffusion)方法来增强不同末端执行器之间的行为相关性,最大化异构演示的可迁移性。在RoboCasa和Isaac Gym上的实验结果覆盖了从夹爪到灵巧手的多种具身,表明X-DiffVLA达到了最先进的性能,分别提升了15.3%和12.5%。真实世界评估进一步验证了所提出框架的鲁棒性及其在可扩展跨具身策略学习中的有效性。

英文摘要

Learning universal policies from cross-embodied data remains a fundamental challenge in robotics. Although Vision-Language-Action (VLA) models are pre-trained on large and diverse datasets, they typically rely on embodiment-specific fine-tuning to achieve strong performance in downstream tasks. This requirement severely limits their generalization capability and restricts knowledge transfer across embodiments performing similar tasks. To overcome these limitations, we focus on cross-embodied settings with shared robotic bases and heterogeneous end-effectors, and propose X-DiffVLA, a diffusion-based VLA model featuring a unified cross-embodied action head. X-DiffVLA can leverage the generative strengths of diffusion models to capture both the diversity and latent correlations in cross-embodied datasets. Specifically, we introduce Embodiment Forcing, a classifier-free guidance technique to implicitly steer action generation toward embodiment-specific functional components, capturing fine-grained structural nuances without explicit supervision. In addition, a Morphological Tree Diffusion approach is designed to strengthen behavioral correlations across diverse end-effectors, maximizing the transferability of heterogeneous demonstrations. Experimental results across RoboCasa and Isaac Gym, covering different embodiments from grippers to dexterous hands, show that X-DiffVLA achieves state-of-the-art performance, with improvements of 15.3% and 12.5%, respectively. Real-world evaluations further validate the robustness of the proposed framework and its effectiveness in scalable cross-embodied policy learning.

2605.25041 2026-05-26 cs.RO 版本更新

RAMBA: 4D Radar Mapping by Bundle Adjustment

RAMBA: 通过束调整的4D雷达建图

Jianzhu Huai, Yiwen Chen, Binliang Wang

发表机构 * State Key Lab of Info Engineering in Surveying, Mapping and Remote Sensing(信息工程测绘遥感国家重点实验室)

AI总结 提出RAMBA框架,利用束调整联合优化雷达帧状态,结合协方差加权几何残差、IMU预积分因子和雷达自速度约束,实现全局一致的4D雷达建图。

Comments 5 pages, 2 figures, to present in ISPRS2026 Thematic Session 10 on Radar Perception

详情
AI中文摘要

4D雷达在机器人建图中越来越有吸引力,因为它提供距离、方位角、仰角和多普勒测量,同时在恶劣可见度条件下保持鲁棒性。尽管最近的雷达和雷达-惯性里程计方法已经实现了有前景的在线状态估计性能,但4D雷达的离线全局地图优化仍未得到充分探索。本文提出了RAMBA,一种用于全局一致4D雷达建图的雷达束调整框架。给定来自雷达-惯性里程计前端的初始位姿和雷达帧,RAMBA使用协方差加权几何残差、IMU预积分因子和雷达自速度约束联合优化雷达帧状态。几何残差通过跨选定帧形成基于体素的对应关系,并用点协方差加权每个残差,将成对GICP扩展到多帧优化。为了提高对漂移和重访的鲁棒性,RAMBA在对应关系形成过程中强制时间一致性,同时明确支持闭环约束。在ColoRadar和SNAIL Radar数据集上的实验表明,与雷达-惯性里程计和位姿图优化基线相比,RAMBA提高了地图一致性并通常提升了轨迹精度。

英文摘要

4D radar is increasingly attractive for robotic mapping because it provides range, azimuth, elevation, and Doppler measurements while remaining robust in adverse visibility conditions. Although recent radar and radar--inertial odometry methods have achieved promising online state estimation performance, offline global map refinement for 4D radar remains underexplored. This paper presents RAMBA, a radar bundle-adjustment framework for globally consistent 4D radar mapping. Given initial poses and radar frames from a radar--inertial odometry front-end, RAMBA jointly refines radar frame states using covariance-weighted geometric residuals, IMU preintegration factors, and radar ego-velocity constraints. The geometric residuals extend pairwise GICP to a multi-frame optimization by forming voxel-based correspondences across selected frames and weighting each residual with point covariances. To improve robustness against drift and revisits, RAMBA enforces temporal consistency during correspondence formation while explicitly supporting loop-closure constraints. Experiments on the ColoRadar and SNAIL Radar datasets show that RAMBA improves map consistency and usually enhances trajectory accuracy over radar--inertial odometry and pose-graph optimization baselines.

2605.24985 2026-05-26 cs.RO cs.LG physics.comp-ph 版本更新

Learning, locomotion, and navigation of soft synthetic snakes in three-dimensional, heterogeneous environments

软体合成蛇在三维异质环境中的学习、运动与导航

Xiaotian Zhang, Ali Albazroun, Tixian Wang, Songyuan Cui, Prashant G. Mehta, Mattia Gazzola

发表机构 * Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana–Champaign(卡尔·R·沃塞基因组生物学研究所,伊利诺伊大学厄巴纳-香槟分校) Department of Mechanical and Aerospace Engineering, Hong Kong University of Science and Technology(香港科学与技术大学机械与航空航天工程系) Department of Mechanical Science and Engineering, University of Illinois Urbana–Champaign(伊利诺伊大学厄巴纳-香槟分校机械科学与工程系)

AI总结 提出基于仿生驱动和感知模型的强化学习框架,使软体合成蛇能够自主导航非结构化三维地形,并通过高保真环境验证鲁棒性。

Comments 14 pages, 5 figures

详情
AI中文摘要

无肢陆地动物表现出卓越的运动多样性和控制能力,目前尚无法被工程对应物所超越。在这里,我们引入了一个计算框架,使软体合成蛇能够导航非结构化的、异质的三维地形。我们的方法基于仿生驱动和感知模型,这些模型降低了高自由度连续体固有的控制复杂性。这些模型被集成到强化学习架构中,以推导出穿越环境的策略。训练首先在简化的同质地形中进行,以学习运动基元。然后,这些基元被组合成针对复杂地形的自适应策略。我们通过将蛇部署在从真实世界成像重建的高保真三维环境中来展示鲁棒性,实现了可靠的导航。总体而言,这项工作为自然地形中连续系统的控制提供了一个物理真实的仿真平台和实用见解。

英文摘要

Limbless terrestrial animals exhibit exceptional locomotor versatility and control, currently unmatched by engineered counterparts. Here, we introduce a computational framework that enables soft synthetic snakes to navigate unstructured, heterogeneous 3D terrains. Our approach is grounded in bio-inspired actuation and sensing models that reduce the control complexity inherent to high-degree-of-freedom, continuum bodies. These models are integrated into a reinforcement learning architecture to derive environment-traversing policies. Training first occurs in simplified, homogeneous terrains to learn locomotion primitives. These are then composed into adaptive strategies for complex landscapes. We demonstrate robustness by deploying a snake in high-fidelity 3D environments reconstructed from real-world imaging, achieving reliable navigation. Overall, this work provides a physically-realistic simulation platform and practical insights for the control of continuum systems in natural terrains.

2605.24980 2026-05-26 cs.RO 版本更新

Loosely Coupled Factor Graph Optimization for Pseudolite-Augmented Navigation

松耦合因子图优化用于伪卫星增强导航

Chih-Chun Chen, Lipeng Tan, Shiyu Bai, Heike Vallery

发表机构 * Federal Ministry for Economic Affairs and Energy, Germany(德国经济事务和能源部长办公厅) German Federal Ministry of Research, Technology, and Space (BMFTR)(德国联邦研究、技术和空间部长办公厅) Robotics Institute Germany (RIG)(德国机器人研究所)

AI总结 提出一种松耦合因子图优化框架,融合GNSS/伪卫星最小二乘解与IMU数据,在低可见度环境下相比标准最小二乘方法将平均三维误差降低22.8%至41.3%。

详情
AI中文摘要

在全球导航卫星系统(GNSS)退化环境中,伪卫星(PL)提供额外的信号源以增强定位性能,但它们在基于优化的框架中的集成仍然有限。本文提出了一种松耦合因子图优化(FGO)框架,该框架融合了GNSS/PL最小二乘(LS)解与惯性测量单元(IMU)数据。评估考虑了低GNSS可见度场景,包括四颗高仰角GNSS卫星和最多两个PL发射机,时间窗口为80秒。与标准LS方法相比,FGO实现了平均三维误差降低22.8%至41.3%。与GNSS-IMU基线相比,加入PL发射机进一步提高了定位精度,性能取决于几何配置。

英文摘要

In Global Navigation Satellite System (GNSS)-degraded environments, pseudolites (PLs) provide additional signal sources to enhance positioning performance, but their integration in optimization-based frameworks remains limited. This paper presents a loosely coupled factor graph optimization (FGO) framework that fuses the GNSS/PL least-squares (LS) solutions with inertial measurement unit (IMU) data. The evaluation considers low GNSS visibility scenarios with four high-elevation GNSS satellites and up to two PL transmitters over an 80~s window. FGO achieves a 22.8\% to 41.3\% reduction in mean 3D error compared to standard LS methods. Compared to a GNSS-IMU baseline, incorporating PL transmitters further improves positioning accuracy, with performance depending on geometry.

2605.24975 2026-05-26 cs.RO cs.AI cs.LG 版本更新

Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion

弥合差距:实现软演员-评论家算法用于高性能腿部运动

Gianluca Sabatini, Chenhao Li, Marco Hutter

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文通过识别软演员-评论家(SAC)在并行训练中性能不足的根本原因,并提出策略初始化、超时感知评论家目标和多步回报估计等改进,使其在腿部运动任务中达到与近端策略优化(PPO)相当的性能。

详情
AI中文摘要

近端策略优化(PPO)由于其在IsaacLab等大规模并行仿真环境中的鲁棒性和可扩展性,已成为训练腿部机器人的事实标准。然而,其基于策略的性质使其天生样本效率低下,阻碍了其在真实硬件上的持续适应和微调。相比之下,软演员-评论家(SAC)是一种可以重用过去经验的离策略算法,使其成为模拟到现实迁移工作流程的自然候选,其中同一算法既可用于仿真,也可用于真实机器人的在线学习。尽管有这些优势,SAC在大规模并行训练设置中始终未能匹配PPO的经验性能。本工作确定了这一差距的根本原因,并引入了针对性的修改,包括策略初始化、超时感知评论家目标和多步回报估计,使SAC能够稳定地大规模训练。在多个腿部机器人平台和多样化的运动任务上评估,我们的方法完全弥合了与PPO的性能差距。

英文摘要

Proximal Policy Optimization (PPO) has become the de facto standard for training legged robots, thanks to its robustness and scalability in massively parallel simulation environments like IsaacLab. However, its on-policy nature makes it inherently sample-inefficient, preventing its use for continuous adaptation and fine-tuning on real hardware. Soft Actor-Critic (SAC), by contrast, is an off-policy algorithm that can reuse past experience, making it a natural candidate for sim-to-real transfer workflows where the same algorithm can be used both in simulation and for online learning on the real robot. Despite these advantages, SAC has consistently failed to match PPO's empirical performance in massively parallel training settings. This work identifies the root causes of this gap and introduces targeted modifications, covering policy initialization, timeout-aware critic targets, and multi-step return estimation, that enable SAC to train stably at scale. Evaluated across multiple legged robot platforms and diverse locomotion tasks, our approach closes the performance gap with PPO entirely.

2605.24950 2026-05-26 cs.RO cs.LG 版本更新

ARCANE-PedSynth: Synthetic Multi-Pedestrian Datasets with Behavioural Crossing Annotations

ARCANE-PedSynth:具有行为穿越注释的合成多行人数据集

Muhammad Naveed Riaz, Maciej Wielgosz, Antonio M. López Peña

发表机构 * Computer Vision Center (CVC), Universitat Aut\` o noma de Barcelona (UAB), Bellaterra, Barcelona, Spain Institute of Electronics, Faculty of Computer Science, Electronics Telecommunications, AGH University of Krakow, Krak\' o w, Poland

AI总结 提出基于CARLA的开源框架ARCANE-PedSynth,通过混合AI-手动控制架构和12状态行为有限状态机生成高穿越率的多行人合成数据,支持RGB、LiDAR和DVS模态及行为标注,用于自动驾驶中的行人穿越预测。

详情
AI中文摘要

我们提出ARCANE-PedSynth,一个基于CARLA的开源软件框架,用于生成具有密集行为注释的合成多行人数据集,以支持自动驾驶中的行人穿越预测。该框架通过混合AI-手动行人控制架构克服了CARLA原生9%的穿越率,可实现高达75%的可配置目标穿越率。一个包含五种角色原型的12状态行为有限状态机产生了多样化的穿越行为。该框架生成同步的RGB、LiDAR和DVS数据,并带有每帧穿越标签、行为状态和估计的2D姿态关键点。我们通过PedSynth++(一个使用该框架生成的示例数据集)展示了ARCANE-PedSynth,该数据集包含533个多行人片段,覆盖12种天气条件,并带有RGB、LiDAR和DVS流。ARCANE-PedSynth通过CLI参数化和Docker容器化实现完全可重复性。

英文摘要

We present ARCANE-PedSynth, an open-source CARLA-based software framework for generating synthetic multi-pedestrian datasets with dense behavioural annotations for pedestrian crossing prediction in autonomous driving. The framework overcomes CARLA's native 9% crossing rate through a hybrid AI-manual pedestrian control architecture, enabling configurable target rates up to 75%. A 12-state behavioural finite state machine with five character archetypes produces diverse crossing behaviours. The framework generates synchronised RGB, LiDAR, and DVS data with per-frame crossing labels, behavioural states, and estimated 2D pose keypoints. We demonstrate ARCANE-PedSynth through PedSynth++, an example dataset generated with the framework, comprising 533 multi-pedestrian clips across 12 weather conditions with RGB, LiDAR, and DVS streams. ARCANE-PedSynth is fully reproducible via CLI parameterisation and Docker containerisation.

2605.24931 2026-05-26 cs.RO 版本更新

Learning High-Frequency Continuous Action Chunks in Latent Space

在潜在空间中学习高频连续动作块

Kunyun Wang, Yuhang Zheng, Yupeng Zheng, Jieru Zhao, Wenchao Ding

发表机构 * School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院) National University of Singapore, Singapore(新加坡国立大学) Institute of Automation, Chinese Academy of Sciences, Beijing, China(中国科学院自动化研究所) Fudan University, Shanghai, China(复旦大学)

AI总结 本文提出通过变分自编码器将高频动作学习从动作空间转移到潜在空间,并引入Reuse-then-Refine块级精炼策略,以提升高频控制的时间与空间一致性,实现复杂接触任务的平滑执行。

Comments 17 pages, 10 figures

详情
AI中文摘要

现代机器人策略越来越依赖动作块来在物理世界中执行复杂任务。虽然动作块在中等动作频率下提高了时间一致性,但当动作频率进一步增加(例如到60 Hz)时,它变得不足。在这样的高频下,策略常常无法生成既时间平滑又空间一致的动作。我们通过使用变分自编码器(VAE)将高频动作学习从动作空间转移到潜在空间来解决这一挑战。这种表述显著提高了高频控制的时间与空间一致性。为了实现平滑的实时执行,我们进一步引入了Reuse-then-Refine,一种块级精炼策略,在异步推理下改善相邻动作块之间的连续性。因此,由我们的策略控制的机器人可以连续执行复杂的接触丰富任务,减少停顿和抖动。在三个真实世界的接触丰富机器人任务上的实验表明,我们的方法能够以平滑的动作一致地完成任务。我们的代码和数据可在 https://github.com/tars-robotics/RTR 获取。

英文摘要

Modern robotic policies increasingly rely on action chunking to execute complex tasks in the physical world. While action chunking improves temporal consistency at moderate action frequencies, it becomes insufficient when the action frequency is further increased (e.g., to 60~Hz). At such high frequencies, policies often fail to generate actions that are both temporally smooth and spatially consistent. We address this challenge by shifting high-frequency action learning from the action space to a latent space with variational autoencoder (VAE). This formulation significantly improves both temporal and spatial consistency of high-frequency control. To enable smooth real-time execution, we further introduce Reuse-then-Refine, a chunk-level refine strategy that improves continuity between adjacent action chunks under asynchronous inference. As a result, robots controlled by our policy can execute complex contact-rich tasks continuously, with less pauses and jerky motions. Experiments on three real-world contact-rich robotic tasks show that our approach consistently completes tasks with smooth motions. Our code and data are available at https://github.com/tars-robotics/RTR.

2605.24924 2026-05-26 cs.RO 版本更新

Dynamic Neural Koopman Distillation for Real-Time Robot Control Using Diffusion Models

动态神经Koopman蒸馏:基于扩散模型的实时机器人控制

Lei Zheng, Peiqi Yu, Zengqi Peng, Changliu Liu, Armin Lederer

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(国立新加坡大学电子与计算机工程系) Department of Electrical and Computer Engineering, Carnegie Mellon University(卡内基梅隆大学电子与计算机工程系) Robotics and Autonomous Systems Thrust, The Hong Kong University of Science and Technology(香港科学与技术大学机器人与自主系统方向)

AI总结 提出动态神经Koopman蒸馏框架,将多步扩散推理蒸馏为单步前向传递,通过因子化动态Koopman层保留多模态表达能力,在D4RL MuJoCo和物理机器人上实现毫秒级延迟的闭环控制。

Comments 8 pages, 5 figures

详情
AI中文摘要

扩散模型在生成多样化和多模态轨迹用于机器人规划方面表现出色,但其迭代去噪过程引入了与高频闭环控制不兼容的延迟。为了解决这个问题,我们提出了动态神经Koopman蒸馏,这是一个将多步扩散推理蒸馏为单步前向传递的框架,同时保留了教师模型的多模态表达能力。具体来说,我们引入了一个因子化动态Koopman层,通过具有状态依赖模态增益的因子化潜在转移来建模去噪过程。我们在标准D4RL MuJoCo运动基准测试和一个物理Kinova机械臂上评估了所提出的方法,并与单步基线进行了比较。结果表明,我们的方法在报告的运动任务上显著优于现有的单步蒸馏方法,并将推理延迟降低到毫秒级别,与教师策略相比。硬件实验进一步证明,我们的方法能够在保持任务成功和相当准确性的同时,实现平滑且快速的闭环执行。项目页面可在 https://fdkoopman.github.io/ 获取。

英文摘要

Diffusion models excel at generating diverse and multimodal trajectories for robotic planning, yet their iterative denoising process introduces latency that is incompatible with high-frequency closed-loop control. To address this problem, we propose Dynamic Neural Koopman Distillation, a framework that distills multistep diffusion inference into a single forward pass while retaining the multimodal expressivity of the teacher model. Specifically, we introduce a Factorized Dynamic Koopman layer that models the denoising process through a factorized latent transition with state-dependent modal gains. We evaluate the proposed method on standard D4RL MuJoCo locomotion benchmarks and a physical Kinova manipulator, comparing against one-step baselines. The results show that our method significantly outperforms existing one-step distillation approaches on the reported locomotion tasks, and reduces the inference latency to the millisecond regime compared with the teacher policy. Hardware experiments further demonstrate that our method enables smooth and fast closed-loop execution while maintaining task success and comparable accuracy. A project page is available at https://fdkoopman.github.io/.

2605.24922 2026-05-26 cs.RO 版本更新

MuJoCoUni:Persistent Batched Runtime Primitives for MuJoCo

MuJoCoUni:MuJoCo的持久化批处理运行时原语

Yufei Jia, Junzhe Wu

发表机构 * Tsinghua University(清华大学)

AI总结 提出MuJoCoUni,一个用于在线机器人学习和批处理物理评估的MuJoCo下游发行版,通过BatchEnvPool提供有状态环境执行的运行时原语,支持高吞吐并行执行并保持上游语义。

Comments Technical report

详情
AI中文摘要

我们提出MuJoCoUni,一个用于在线机器人学习和批处理物理评估的MuJoCo下游发行版。除了上游mujoco.rollout已经提供的开环批处理轨迹生成外,MuJoCoUni还提供了用于有状态环境执行的运行时原语。目标工作负载需要高吞吐并行执行,同时保留上游CPU MuJoCo在模型、传感器、接触和约束方面的语义。其核心对象BatchEnvPool是一个C++/pybind11执行器,拥有每个环境的mjModel副本、每个线程的mjData工作线程以及一个内部线程池。它提供仅最终状态的短步进、稀疏重置、重置生命周期域随机化、不推进动力学的批处理传感器前向评估,以及批处理雅可比矩阵和高度场查询。该实现仅限于Python绑定层;MuJoCo的求解器、接触模型、积分器和核心源代码树保留上游语义。本报告描述了BatchEnvPool API、实现边界、与rollout的关系,以及随开源mujoco-uni包一起提供的验证和基准测试脚本,该包可通过 exttt{pip install mujoco-uni}安装。

英文摘要

We present MuJoCoUni, a downstream MuJoCo distribution for online robot learning and batched physics evaluation. Alongside the open-loop batched trajectory generation already provided by upstream mujoco.rollout, MuJoCoUni supplies runtime primitives for stateful environment execution. The target workloads need high-throughput parallel execution while retaining upstream CPU MuJoCo semantics for models, sensors, contact, and constraints. Its core object, BatchEnvPool, is a C++/pybind11 executor that owns per-environment mjModel copies, per-thread mjData workers, and an internal thread pool. It provides final-state-only short stepping, sparse reset, reset-lifecycle domain randomization, batched sensor forward evaluation without advancing dynamics, and batched Jacobian and height-field queries. The implementation is confined to the Python binding layer; MuJoCo's solver, contact model, integrator, and core source tree retain upstream semantics. This report describes the BatchEnvPool API, implementation boundary, relationship to rollout, and the validation and benchmark scripts shipped with the open-source mujoco-uni package, which is installed with \texttt{pip install mujoco-uni}.

2605.24860 2026-05-26 eess.SY cs.AI cs.ET cs.LG cs.RO cs.SY 版本更新

DBPnet: Damper Characteristics-Based Bayesian Physics-Informed Neural Network for Wheel Load Estimation

DBPnet:基于阻尼特性的贝叶斯物理信息神经网络用于车轮载荷估计

Tianyi Wang, Tianyi Zeng, Zimo Zeng, Feiyang Zhang, Yujin Wang, Xiangyu Li, Yiming Xu, Sikai Chen, Junfeng Jiao, Christian Claudel, Xinbo Chen

发表机构 * Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin(德克萨斯大学奥斯汀分校土木、建筑与环境工程系) School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(上海交通大学自动化与智能感知学院) College of Electrical Engineering, Zhejiang University(浙江大学电气工程学院) School of Automotive Studies, Tongji University(同济大学汽车学院) School of Architecture, The University of Texas at Austin(德克萨斯大学奥斯汀分校建筑学院) Department of Civil and Environmental Engineering, University of Wisconsin-Madison(威斯康星大学麦迪逊分校土木与环境工程系)

AI总结 提出DBPnet,一种结合阻尼特性嵌入模块的贝叶斯物理信息神经网络,通过悬架连杆级建模和物理信息损失函数,实现鲁棒的车轮载荷估计。

Comments 14 pages, 12 figures, 6 tables

详情
AI中文摘要

高级驾驶辅助系统(ADAS)在现代汽车智能化中扮演重要角色,显著提升车辆安全性和稳定性。ADAS的性能关键依赖于准确可靠的车辆状态估计,特别是来自车辆动态传感器的信号。在这些信号中,车轮载荷是底盘控制和安全关键功能的关键变量,但由于复杂的悬架几何结构、非线性动力学和测量噪声,难以鲁棒估计。为解决此问题,我们提出DBPnet,一种贝叶斯物理信息神经网络(PINN),其具有受阻尼特性启发的物理感知嵌入模块。首先,本文提出一种悬架连杆级建模(SLLM)方法,通过显式考虑悬架的复杂几何结构,构建非线性瞬时动态模型。在SLLM基础上,将贝叶斯推断集成到PINN中,有效应对车辆底盘系统中的噪声和不确定性,从而提高模型的鲁棒性。然后,采用物理信息损失函数确保与基本物理原理的一致性,同时受阻尼特性启发的嵌入模块提取输入信号的时间变化特征,并将其融入PINN的每一层,确保物理观测指导神经网络而不受固定物理模型的约束。在高保真仿真和真实世界实验上的广泛评估表明,我们的DBPnet在RMSE和MaxError上始终低于基线方法。这些结果凸显了我们的DBPnet在推进车轮载荷估计和为更可靠的ADAS执行器功能发展做出贡献的潜力。

英文摘要

Advanced driver assistance systems (ADAS) play an important role in modern automotive intelligence, significantly enhancing vehicle safety and stability. The performance of ADAS critically relies on accurate and reliable vehicle state estimation, particularly from vehicle dynamic sensors. Among these signals, wheel load is a key variable for chassis control and safety-critical functions, yet it remains difficult to estimate robustly due to complex suspension geometry, nonlinear dynamics, and measurement noise. To address this issue, we propose DBPnet, a Bayesian physics-informed neural network (PINN) with a physics-aware embedding module inspired by damper characteristics. First, this paper presents a suspension linkage-level modeling (SLLM) approach that constructs a nonlinear instantaneous dynamic model by explicitly considering the complex geometric structure of the suspension. Building upon SLLM, Bayesian inference is integrated into the PINN to effectively cope with noise and uncertainty in the vehicle chassis system, thereby improving the model's robustness. Then, a physics-informed loss function is employed to ensure consistency with fundamental physical principles, while the damper characteristics-inspired embedding module extracts temporal variation features of input signals and incorporates them into each layer of the PINN, ensuring that physical observations guide the neural network without being constrained by fixed physical models. Extensive evaluations on high-fidelity simulations and real-world experiments demonstrate that our DBPnet consistently achieves lower RMSE and MaxError than baseline methods. These results highlight the potential of our DBPnet to advance wheel load estimation and contribute to the development of more reliable ADAS actuator functions.

2605.24813 2026-05-26 cs.RO cs.SY eess.SY 版本更新

Manifold-Constrained MPPI: Real-Time Sampling-Based Control Under Hard Constraints

流形约束MPPI:硬约束下的实时采样控制

Seulchan Lee, Sanghyun Kim

发表机构 * School of Mechanical Engineering, Kyung Hee University(京畿大学机械工程学院) Advanced Institute of Convergence Technology(融合技术高级研究院)

AI总结 提出流形约束MPPI(MC-MPPI),通过变分自编码器学习约束流形的低维表示,结合二次规划控制器,实现实时硬约束满足。

Comments International Journal of Control, Automation, and Systems

详情
AI中文摘要

基于采样的模型预测控制方法,如模型预测路径积分(MPPI),在复杂机器人系统中提供了无导数优化和鲁棒性。然而,标准MPPI依赖于基于成本的软惩罚,无法保证硬约束满足,严重限制了其在高度约束任务(如闭链操作)中的适用性。为解决这一问题,我们提出了流形约束MPPI(MC-MPPI),一种实时采样控制框架,在保持MPPI计算优势的同时强制执行基于流形的等式约束。关键思想是将约束最优控制问题解耦为潜在空间规划和执行级校正。在规划阶段,变分自编码器(VAE)学习约束流形的低维潜在表示,使MPPI能够高效生成接近可行的候选轨迹,无需逐样本修改。由于该参考能够精确线性化等式约束,执行级二次规划(QP)控制器通过单次求解而非迭代投影来解决残余流形不匹配。在14自由度闭链双臂系统上的仿真和实际实验表明,MC-MPPI以100 Hz稳定运行,可靠地导航动态环境,同时有效维持硬等式约束,并在跟踪精度上显著优于基线方法。补充视频和实现细节见https://rcilab.github.io/mcmppi。

英文摘要

Sampling-based model predictive control methods, such as Model Predictive Path Integral (MPPI), offer derivative-free optimization and robustness in complex robotic systems. However, standard MPPI relies on cost-based soft penalties that cannot guarantee hard-constraint satisfaction, severely limiting its applicability to highly constrained tasks such as closed-chain manipulation. To address this, we propose Manifold-Constrained MPPI (MC-MPPI), a real-time sampling-based control framework that enforces manifold-based equality constraints while preserving the computational advantages of MPPI. The key idea is to decouple the constrained optimal control problem into latent-space planning and execution-level correction. At the planning stage, a Variational Autoencoder (VAE) learns a low-dimensional latent representation of the constraint manifold, enabling MPPI to efficiently generate near-feasible candidate trajectories without per-sample modification. Since this reference enables accurate linearization of the equality constraints, an execution-level Quadratic Programming (QP) controller resolves the residual manifold mismatch in a single solve rather than through iterative projection. Experiments on a 14-DoF closed-chain dual-arm system in both simulation and real-world settings demonstrate that MC-MPPI operates stably at 100 Hz, reliably navigates dynamic environments while effectively maintaining hard equality constraints, and significantly outperforms baseline methods in tracking accuracy. Supplementary videos and implementation details are available at https://rcilab.github.io/mcmppi.

2605.24810 2026-05-26 cs.LG cs.AI cs.RO stat.AP 版本更新

Cross-Domain Energy-Guided Diffusion Generation for Off-Dynamics Reinforcement Learning

跨域能量引导扩散生成用于动态偏移强化学习

Yu Yang, Yihong Guo, Anqi Liu, Pan Xu

发表机构 * Duke University(杜克大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出CEDGE框架,利用能量引导扩散模型生成目标域轨迹,解决动态偏移下离线强化学习的域适应问题。

Comments 29 pages, 3 figures, and 14 tables

详情
AI中文摘要

离动态离线强化学习旨在从大规模源数据集和有限目标数据集中学习目标域策略,但面临转移动态不匹配的问题。现有方法如奖励增强和数据过滤受限于源数据集,无法合成新的目标行为以改善超出收集源轨迹的覆盖范围。虽然近期基于模型的方法尝试通过学习目标感知动态来解决此问题,但生成的体验仅在转移层面构建,导致长时域上的累积误差。这些限制促使离动态离线RL转向轨迹级生成。我们提出CEDGE,一种跨域能量引导扩散生成框架。CEDGE在源域轨迹上训练轨迹扩散模型,并通过能量引导将生成样本适应到目标域。该引导通过最小化源域与期望目标域轨迹之间的分布不匹配得到,并分解为回报、域和行为能量成分。得到的能量引导轨迹既可用于直接规划,也可作为策略学习的合成数据。由于目标适应通过能量引导而非重新训练扩散模型实现,与先前方法相比,CEDGE能高效适应新的目标动态。在ODRL基准上的实验表明,轨迹级能量引导生成改善了动态偏移下的扩散规划,并产生提升下游目标策略学习的合成数据。

英文摘要

Off-dynamics offline reinforcement learning seeks to learn a target-domain policy from a large source dataset and a limited target dataset under mismatched transition dynamics. Existing approaches such as reward augmentation and data filtering are constrained to the source dataset and cannot synthesize new target behavior to improve coverage beyond the collected source trajectories. While recent model-based methods attempt to address this by learning target-aware dynamics, the generated experience is constructed only at the transition level, which leads to accumulated errors over long horizons. These limitations necessitate a shift toward trajectory-level generation for off-dynamics offline RL. We propose CEDGE, a Cross-domain Energy-guided Diffusion GEneration framework. CEDGE trains a trajectory diffusion model on source-domain trajectories and adapts the generated samples to the target domain through energy guidance. This guidance is derived by minimizing the distribution mismatch between the source and desired target-domain trajectories and is decomposed into return, domain, and behavior energy components. The resulting energy-guided trajectories are useful both for direct planning and as synthetic data for policy learning. Since target adaptation is achieved via energy guidance rather than retraining the diffusion model, CEDGE can be efficiently adapted to new target dynamics compared to previous methods. Experiments on the ODRL benchmark demonstrate that trajectory-level energy-guided generation improves diffusion planning under dynamics shifts and produces synthetic data that improves downstream target policy learning.

2605.24777 2026-05-26 cs.RO 版本更新

MR-LiDAR: A Multi-Resolution Roadside LiDAR Benchmark for Perception Diagnostics and Deployment Guidance

MR-LiDAR:用于感知诊断和部署指导的多分辨率路边激光雷达基准

Shunlai Cui, Peng Cao, Yuan Zhu, Yongjiang He, Jiacheng Yin, Xiao Huo, Gang Cao, Xiaobo Liu

发表机构 * Intelligent Comprehensive Transportation Key Laboratory of Sichuan Province(四川省智能综合交通运输重点实验室)

AI总结 针对激光雷达选型缺乏实证基准的问题,提出MR-LiDAR多分辨率基准,通过控制光束数和分布等变量,系统分析其对感知性能的影响,并给出选型指导。

Comments 9 pages, 6 figures

详情
AI中文摘要

激光雷达选型是路边感知系统中的关键问题,因为它直接决定了感知能力和部署成本。然而,缺乏用于比较不同激光雷达配置下感知性能的经验基准,极大地限制了科学的传感器选择和部署规划。为填补这一空白,我们提出了MR-LiDAR,一个用于路边感知诊断的受控多分辨率激光雷达基准。在相同的路边场景中,使用16、32、80和128线激光雷达,我们收集了不同距离下各类交通参与者(包括车辆和弱势道路使用者(VRU))的点云和真实标注。这种受控设计将激光雷达的内在规格(特别是线束数和线束分布)隔离为精确性能诊断的关键变量。基于MR-LiDAR,我们进行了系统的实证分析,以考察线束数、线束分布、目标距离、目标类别和车辆遮挡如何影响激光雷达感知性能。结果表明,所有这些因素都有显著影响。特别是,与“更高线束数总是带来更好感知”的常见假设相反,我们发现,具有优化线束分布的80线激光雷达可以匹配甚至超越具有均匀线束分布的128线激光雷达。此外,我们提供了实用的激光雷达选型参考指南,包括目标点计数统计和基于两种广泛使用的检测算法的检测性能比较。这项工作为确定路边感知应用中经济高效的激光雷达配置提供了诊断基准和实用指导。

英文摘要

LiDAR model selection is a critical issue in roadside sensing systems, as it directly determines both perception capability and deployment cost. However, the lack of empirical benchmarks for comparing perception performance across different LiDAR configurations has greatly constrained scientific sensor selection and deployment planning. To address this gap, we present MR-LiDAR, a controlled multi-resolution LiDAR benchmark for roadside perception diagnostics. Using 16-, 32-, 80-, and 128-beam LiDARs in identical roadside scenarios, we collect point clouds and ground-truth annotations for diverse traffic participants, including vehicles and vulnerable road users (VRUs), across varying distances. This controlled design isolates intrinsic LiDAR specifications, particularly beam count and beam distribution, as the key variables for precise performance diagnostics. Based on MR-LiDAR, we conduct systematic empirical analyses to examine how beam count, beam distribution, target distance, object category, and vehicle occlusion affect LiDAR perception performance. The results reveal that all of these factors have substantial impacts. In particular, contrary to the common assumption that higher beam counts always yield better perception, we show that an 80-beam LiDAR with optimized beam distribution can match or even outperform a 128-beam LiDAR with uniform beam distribution. In addition, we provide a practical reference guide for LiDAR selection, including target point-count statistics and detection performance comparisons based on two widely used detection algorithms. This work offers a diagnostic benchmark and practical guidance for determining cost-effective LiDAR configurations in roadside perception applications.

2605.24767 2026-05-26 cs.RO 版本更新

Enhanced INS/GNSS State Estimation using GNSS-Based Acceleration Measurements

增强的INS/GNSS状态估计:利用基于GNSS的加速度测量

Gal Versano, Itzik Klein

发表机构 * Autonomous Navigation and Sensor Fusion Lab(自主导航与传感器融合实验室) Hatter Department of Marine Technologies(海洋技术系) Charney School of Marine Sciences(海洋科学学院) University of Haifa(海法大学)

AI总结 提出利用历史GNSS测量和运动模型提取车辆加速度信息,并集成到INS/GNSS滤波器中以提高定位鲁棒性和精度,在两组真实无人地面车辆数据集上分别实现11.40%和20.74%的平均位置均方根误差改进。

详情
AI中文摘要

精确可靠的导航对于自主地面车辆运行至关重要。标准的INS/GNSS融合依赖于GNSS位置更新,这提供了有限的方位和惯性传感器误差状态的可观测性,特别是在低动态运动期间。在这项工作中,我们提出利用过去的GNSS测量以及运动模型来提取有意义的车辆加速度信息。然后将该加速度测量集成到INS/GNSS滤波器中,以提高其鲁棒性和准确性。所提出的方法在两个来自不同移动平台和惯性传感器等级的真实无人地面车辆数据集上进行了评估。结果表明,相对于标准位置辅助滤波器,定位精度一致提高,在两个数据集上平均位置均方根误差分别提高了11.40%和20.74%。

英文摘要

Accurate and reliable navigation is essential for autonomous ground vehicle operations. Standard INS/GNSS fusion relies on GNSS position updates, which provide limited observability of orientation and inertial sensor error states, particularly during low-dynamic motion. In this work, we propose utilizing past GNSS measurements alongside a motion model to extract meaningful vehicle acceleration information. This acceleration measurement is then integrated into the INS/GNSS filter to improve its robustness and accuracy. The proposed approach is evaluated on two real-world unmanned ground vehicle datasets collected from different mobile platforms and inertial sensor grades. Results demonstrate consistent positioning accuracy improvements relative to the standard position-aided filter, with mean position root mean square error improvements of 11.40 % and 20.74 % on the two datasets, respectively.

2605.24761 2026-05-26 cs.CV cs.RO 版本更新

Drift-Resistant Navigation World Model with Anchored Epipolar Guidance

抗漂移导航世界模型与锚定对极引导

Po-Chien Luan, Zimin Xia, Wuyang Li, Yang Gao, Alexandre Alahi

发表机构 * EPFL(瑞士联邦理工学院)

AI总结 提出一种抗漂移导航世界模型,通过锚定引导滚动和双向对极几何约束,同时减轻感知漂移和几何漂移,提升长期视觉质量、几何一致性和多视图连贯性。

详情
AI中文摘要

我们提出抗漂移导航世界模型,这是一种生成模型,可减轻传统基于滚动的导航世界模型中的感知漂移和几何漂移。现有方法递归地将生成内容馈送到后续步骤,导致噪声累积和预测退化,即感知漂移。同时,它们的预测通常偏离智能体的运动,导致几何漂移。我们通过将世界模型预测重新设计为锚定引导滚动来解决这两种漂移。我们不顺序滚动每一帧,而是首先预测稀疏的未来锚点,作为稳定的长期目标,然后生成每个块内的中间帧,这些帧以过去上下文和未来锚点为条件。重要的是,这些稀疏锚点还提供几何约束,由双向对极几何支持,以定位中间帧中相应内容应出现的位置。在四个基准上的实验表明,在长期视觉质量、几何一致性和多视图连贯性方面,相对于强基线有一致的改进。这些提升进一步转化为相同规划器下下游规划性能的提高,突显了抗漂移、几何感知预测对于可靠导航世界模型的重要性。

英文摘要

We propose Drift-Resistant Navigation World Model, a generative model that mitigates both perceptual drift and geometric drift in conventional rollout-based navigation world models. Existing methods recursively feed generated content into subsequent steps, causing noise accumulation and degraded predictions, i.e., perceptual drift. Meanwhile, their predictions often deviate from the agent's motion, resulting in geometry drift. We address both types of drift by redesigning world-model prediction as an anchor-guided rollout. Instead of rolling out every frame sequentially, we first predict sparse future anchors that serve as stable long-range targets, and then generate intermediate frames within each chunk conditioned on both past context and future anchors. Importantly, these sparse anchors also provide geometric constraints, supported by bidirectional epipolar geometry, to localize where corresponding content should appear in the intermediate frames. Experiments on four benchmarks demonstrate consistent improvements over strong baselines in long-horizon visual quality, geometric consistency, and multi-view coherence. These gains further translate into improved downstream planning performance under the same planners, highlighting the importance of drift-resistant, geometry-aware prediction for reliable navigation world models.

2605.24760 2026-05-26 cs.RO 版本更新

Geometric Workspace Analysis and Transmission-Aware Dynamics of a Serial Spherical Tool for Microsurgery

显微外科用串行球形工具的几何工作空间分析与传动感知动力学

Anestis Mablekos-Alexiou, Lyndon da Cruz, Christos Bergeles

发表机构 * Moorfields Eye Hospital NHS Foundation Trust(莫尔菲兹眼科医院 NHS 基础信托) King’s College London(国王学院伦敦)

AI总结 提出一种用于显微外科的串行球形机构(带额外平移自由度)的运动学与传动感知设计框架,通过解析工作空间公式和传动感知动力学方法实现快速设计评估。

详情
AI中文摘要

我们提出了一种用于显微外科的串行球形机构(带额外平移自由度)的运动学与传动感知设计框架。第一个贡献是解析工作空间公式,提供可达运动的几何洞察,并能够快速选择旋转轴方向而无需数值优化。第二个贡献是一种用于自锁传动机构驱动的动力学感知方法,支持评估指定工作空间几何的扭矩需求。该框架附带一个用于摩擦识别和逆动力学分析的开源软件包。在专为玻璃体视网膜手术设计的机器人工具上进行的实验验证了模型的预测能力,并展示了其在工程设计中的实用价值。

英文摘要

We present a kinematic and transmission-aware design framework for a serial spherical mechanism with an additional translational degree of freedom for microsurgery. The first contribution is an analytical workspace formulation that provides geometric insight into reachable motion and enables rapid selection of rotation axis orientations without numerical optimization. The second contribution is a dynamics-informed methodology for mechanisms driven by self-locking transmissions, supporting evaluation of torque requirements for a prescribed workspace geometry. The framework is accompanied by an open-source software package for friction identification and inverse dynamics analysis. Experiments on a purpose-built robotic tool for vitreoretinal surgery validate the predictive capability of the models and demonstrate their practical utility for engineering design.

2605.24731 2026-05-26 eess.SY cs.RO cs.SY 版本更新

Passivity-based Semi-autonomous Rotational Motion Navigation for Rigid-body Networks: Stability and Human Passivity Analysis

基于无源性的刚体网络半自主旋转运动导航:稳定性与人体无源性分析

Reiji Terunuma, Yuta Nakamura, Takeshi Hatanaka

发表机构 * Institute of Science Tokyo(东京科学研究所)

AI总结 提出一种基于无源性的半自主姿态控制框架,通过虚拟领导者和隐身控制实现多机器人系统在SO(3)上的人机交互稳定性,并证明在人体无源性假设下的闭环稳定性。

Comments This work is to be submitted to the 6th Workshop on Cyber-Physical Human Systems (CPHS2026) for possible publication

详情
AI中文摘要

本文提出了一种新颖的基于无源性的半自主姿态控制框架,特别关注定义在特殊正交群$SO(3)$上的姿态运动学。虽然人机交互有助于成功执行复杂任务,但确保$SO(3)$流形上人在回路系统的稳定性仍然是一个尚未解决的挑战。我们首先提出了一种新的控制架构,其中多机器人系统通过所谓的隐身控制保持反馈给人类操作员的平均信息的不变性,并且人类干预通过虚拟领导者进行调解,该虚拟领导者通过基于无源性的姿态同步律与机器人耦合。然后,我们在假设人类表现为无源系统的条件下,严格证明了所提出的在回路系统的闭环稳定性。为支持这一分析,进行了仿真研究,将人类操作员识别为动态系统,并检查了所识别模型的无源性特性。

英文摘要

This paper presents a novel passivity-based semi-autonomous attitude control framework, with a particular focus on attitude kinematics defined on the special orthogonal group $SO(3)$. While human-robot interaction facilitates the successful execution of complex tasks, ensuring stability of human-in-the-loop systems on the $SO(3)$ manifold remains a largely unsolved challenge. We first propose a new control architecture in which a multi-robot system preserves invariance of the average information fed back to the human operator through so-called stealthy control, and the human intervention is mediated through a virtual leader, which is coupled with the robots via a passivity-based attitude synchronization law. We then rigorously prove closed-loop stability of the proposed human-in-the-loop system under the assumption that the human behaves as a passive system. To support this analysis, simulation studies are conducted to identify the human operator as a dynamical system, and to examine passivity properties of the identified model.

2605.24690 2026-05-26 cs.RO cs.LG 版本更新

Sum of Costs Diffusion with Dynamic Guidance for Motion Planning

运动规划的动态引导代价和扩散模型

Aysu Aylin Kaplan, Özgür Erkent

发表机构 * Computer Engineering Department, Hacettepe University(哈切特佩大学计算机工程系)

AI总结 提出一种基于扩散模型的高泛化运动规划方法,通过总碰撞代价梯度引导去噪过程并动态选择引导起始步,在Mπnets数据集上取得最优性能。

Comments Accepted at the Frontiers of Optimization for Robotics Workshop at the IEEE International Conference of Robotics & Automation (ICRA), 2026

详情
AI中文摘要

机器人操作的运动规划问题可以通过经典方法或深度学习方法来解决。现有方法在泛化到不同场景时面临重大挑战。在本研究中,我们提出了一种具有高泛化能力的方法,该方法使用扩散模型生成无碰撞轨迹,其中去噪过程由总碰撞代价的梯度引导。我们还提出了一种动态选择梯度引导起始步的方法。实验结果表明,通过动态引导扩散模型与碰撞代价之和,能够克服竞争方法面临的泛化问题,提供更鲁棒的性能。所提出的模型在Mπnets数据集的不同测试场景中,相比其他方法取得了最高性能,证明了其有效性。

英文摘要

The motion planning problem for robotic manipulation can be addressed through classical or deep learning approaches. Existing methods face significant challenges in generalizing to diverse settings. In this study, we present a method with high generalization capability that generates collision-free trajectories using diffusion models where the denoising process is guided by the gradient of the total collision cost. We are also presenting a dynamic approach for choosing start step of the gradient guidance. Experimental results demonstrate that guiding the diffusion model dynamically with the sum of collision costs offers more robust performance by overcoming the generalization issues faced by competing methods. The proposed model demonstrates its effectiveness by achieving the highest performance on diverse test settings in M$π$nets\ dataset among the compared methods.

2605.24643 2026-05-26 cs.RO cs.SY eess.SY 版本更新

Towards Low-Gravity Planetary Exploration using Reinforcement Learning for Walking, Jumping, and In-flight Attitude Control

面向低重力行星探测的强化学习行走、跳跃与飞行姿态控制

Jørgen Anker Olsen, Kostas Alexis

发表机构 * Autonomous Robots Lab(自主机器人实验室) NTNU(挪威特罗姆瑟大学)

AI总结 本文利用强化学习为四足机器人在火星低重力环境下开发行走、垂直跳跃、前向跳跃及飞行姿态控制策略,实现跨越障碍物并安全着陆,仿真与实验验证了策略的有效性。

Comments 16 pages, 16 figures

详情
AI中文摘要

本文提出了用于行星探测场景中动态四足运动的强化学习策略。基于采用五杆腿设计的任务优化四足机器人,我们开发了针对行走、垂直跳跃、前向跳跃和飞行姿态控制的强化学习策略,这些策略明确针对火星上的低重力环境进行了调整。这些策略共同使机器人能够通过协调跳跃和精确的飞行中重新定向来克服比自身更大的障碍物,实现安全着陆。我们通过单轴重新定向测试在Olympus四足机器人上展示了姿态控制策略的Sim2Real迁移,而所有运动策略均在仿真中进行了验证。一个完整的火星探测任务场景展示了在复杂地形上协调策略部署的能力。实验结果显示,在2.6秒内完成90°姿态重新定向,仿真表明在火星重力条件下可实现3.1米的垂直跳跃和3.9米的前向跳跃。- 补充视频:https://www.youtube.com/watch?v=qlSJ3P87A4A

英文摘要

This paper presents reinforcement learning (RL) policies for dynamic quadrupedal locomotion in planetary exploration scenarios. Building on a taskoptimized quadruped with a 5-bar leg design, we develop RL policies for walking, vertical jumping, forward jumping, and in-flight attitude control, explicitly tailored to the reduced gravity on Mars. These policies jointly enable such robots to overcome obstacles larger than themselves through coordinated jumping and precise in-flight reorientation for safe landings. We demonstrate Sim2Real transfer of the attitude control policy on the Olympus quadruped through single-axis reorientation tests, while all locomotion policies are validated in simulation. A complete Mars exploration mission scenario demonstrates coordinated policy deployment across challenging terrain. Experimental results show 90° attitude reorientation in 2.6 seconds, with simulations demonstrating 3.1 meter vertical jumps and 3.9 meter forward jumps under Martian gravity conditions. - Supplementary video: https://www.youtube.com/watch?v=qlSJ3P87A4A

2605.24642 2026-05-26 cs.CV cs.RO 版本更新

Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

理解几何基础模型对视觉-语言-动作模型的影响

Yurou Yang, Muyuan Lin, Roberto Martin-Martin, Martin Labrie, Shreekant Gayaka, Cheng-Hao Kuo, Luca Carlone

发表机构 * Amazon Personal Robotics Group(亚马逊个人机器人小组) University of Texas at Austin(德克萨斯大学奥斯汀分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文通过线性探测分析量化了视觉-语言-动作模型(VLA)与几何基础模型(GFM)之间的“几何差距”,比较了三种注入几何信息的架构,并研究了非架构因素对几何VLA性能的影响。

详情
AI中文摘要

近期工作探索了视觉-语言-动作模型(VLA)与用于3D重建的几何基础模型(GFM)(如VGGT)交叉领域的新机遇。虽然由此产生的几何VLA通常表现出改进的性能,但仍不清楚:(i) 现代VLA是否已经具备足够的几何理解能力,(ii) 将几何理解注入VLA的最佳架构是什么,以及(iii) 其他影响几何VLA的设计选择的效果。在本文中,我们针对特定的VLA(GR00T-N1.5)和GFM(VGGT)进行了严格的实验分析,以阐明这些问题。我们的第一个贡献是通过基于线性探测的严格分析,形式化了先前工作中关于当前VLA缺乏几何理解的直觉。该分析首次量化了VLA与GFM之间的“几何差距”。我们的第二个贡献是识别并比较了将GFM与VLA桥接的不同策略。我们实现了三种不同的架构,它们在将几何信息注入VLA的方式上有所不同,同时尽可能保持低级实现细节相似,以确保公平比较。最后,我们分析了非架构选择(例如,训练数据、相机数量、重建质量)对几何VLA性能的影响。

英文摘要

Recent work explores new opportunities at the intersection of vision-language-action models (VLAs) and geometric foundation models (GFMs) for 3D reconstruction, such as VGGT. While the resulting geometric VLAs often show improved performance, it remains unclear (i) if modern VLAs already have sufficient geometric understanding to start with, (ii) what is the best architecture to inject geometric understanding into a VLA, and (iii) what is the effect of other design choices that affect geometric VLAs. In this paper we provide a rigorous experimental analysis to shed light on these questions, for a specific choice of VLA (GR00T-N1.5) and GFM (VGGT). Our first contribution is to formalize prior work's intuition that current VLAs lack geometric understanding, by providing a rigorous analysis based on linear probing. The analysis quantifies, for the first time, the "geometric gap" between VLAs and GFMs. Our second contribution is to identify and compare different strategies to bridge GFMs with VLAs. We implement three different architectures, which differ in the way they inject geometry in the VLA, while keeping low-level implementation details as similar as possible, to ensure a fair comparison. Finally, we analyze the impact of non-architectural choices (e.g., training data, number of cameras, reconstruction quality) on the performance of the geometric VLAs.

2605.24622 2026-05-26 cs.RO cs.CV 版本更新

PoseRefer: Pathway-Local Parameters for Semantically Grounded Reference Resolution

PoseRefer: 用于语义基础指代消解的通路-局部参数

Anna Deichler

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 提出PoseRefer架构,通过解耦姿态和文本通路并冻结MiniLM类别嵌入,在MM-Conv数据集上实现31.9%的top-1准确率,并揭示融合准确性可能受类别表示伪影影响。

Comments ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy: From Environment Understanding and Reasoning to Safe Interaction

详情
AI中文摘要

一个机器人解析“把杯子放在那个上面”必须融合手势、语言和场景几何,然而3D基础基准测试仅部分捕获了这一情况:描述是事后编写的,手势是模板化的,或者指向是为相机摆拍的。MM-Conv从二元VR交互中捕获自然的伴随语音手势,同时包含全身动作捕捉和3D场景图。我们使用它来评估姿态-语言融合,采用解耦的后期融合架构,其中姿态和文本通路不共享任何学习参数。这两个选择共同使得通过受控消融更容易隔离类别、姿态和文本的贡献。使用冻结的MiniLM类别嵌入的融合在每种指代类型上都超过了仅姿态和最佳文本通路,达到31.9%的top-1。学习到的标量门根据文本通路是否有类别访问权限而在相反策略之间切换。这是一个可靠性诊断:除非通路在架构上解耦,否则语义基础系统的融合准确性声明与类别表示伪影无法区分。

英文摘要

A robot resolving ``put the cup on that one'' must fuse gesture, language, and scene geometry, yet 3D grounding benchmarks only partially capture this regime: descriptions are written post-hoc, gestures are templated, or pointing is staged for the camera. MM-Conv captures natural co-speech gesture from dyadic VR interaction alongside full-body motion capture and 3D scene graphs. We use it to evaluate pose-language fusion with a decoupled late-fusion architecture in which pose and text pathways share no learned parameters. The two choices together make category, pose, and text contributions easier to isolate through controlled ablations. Fusion with frozen MiniLM category embeddings exceeds pose alone and the best text-only pathway on every reference type, reaching 31.9% top-1. The learned scalar gate flips between opposing policies depending on whether the text pathway has category access. This is a reliability diagnostic: fusion-accuracy claims for semantic grounding systems are indistinguishable from category-representation artifacts unless pathways are architecturally decoupled.

2605.24592 2026-05-26 cs.RO 版本更新

MuGen: Multi-Skill Generative Locomotion Controller for Humanoid Robots

MuGen: 人形机器人的多技能生成式运动控制器

Yusen Feng, Xiang Wang, Heyuan Yao, Zixi Kang, Xinyu Huo, Boyang Yu, Pengyun Qiu, Ruijie Zhao, Baoquan Chen, Libin Liu

发表机构 * Peking University(北京大学)

AI总结 提出MuGen框架,利用VQ-VAE和教师-学生策略蒸馏,从异构人类运动数据中学习生成式运动表示,使人形机器人能够执行多技能运动并模仿未见过的动作。

详情
AI中文摘要

本文提出MuGen,一个数据驱动的框架,用于在人形机器人上学习和部署多技能运动。MuGen使机器人能够在示例运动序列的指导下,像人类一样执行富有表现力的动作。为此,我们采用基于模型强化学习训练的向量量化自编码器(VQ-VAEs),生成运动的表示,从数小时的异构人类运动数据中捕捉人类运动的关键模式。我们采用教师-学生学习框架,并开发了一种新的策略蒸馏策略,使可部署的学生策略能够学习这种高效的潜在表示。该策略允许机器人跟踪和模仿未见过的运动,并进一步使机器人能够将学到的潜在空间重用于其他任务。我们通过多样化的运动集和精确的执行来证明我们框架的有效性。

英文摘要

This paper presents MuGen, a data-driven framework for learning and deploying multi-skill locomotion on humanoid robots. MuGen enables a robot to perform expressive motions like humans under the guidance of example motion sequences. To achieve this, we employ vector-quantized autoencoders (VQ-VAEs) trained with model-based reinforcement learning, resulting in a generative representation of locomotion that captures key patterns of human motion from hours of heterogeneous human performance data. We employ a teacher-student learning framework and develop a new policy distillation strategy to enable a deployable student policy learning this efficient latent representation. This policy allows the robot to track and mimic unseen human motions and further enables the robot to reuse the learned latent space for other tasks. We demonstrate the effectiveness of our framework through a diverse set of motions and accurate execution.

2605.19430 2026-05-26 cs.RO 版本更新

Neuromorphic Control of a Flapping-Wing Robot on Resource-Constrained Hardware

资源受限硬件上扑翼机器人的神经形态控制

Rim El Filali, Chenrui Feng, Chao Gao, Weibin Gu

发表机构 * Institute for AI Industry Research (AIR)(人工智能产业研究院) Tsinghua University(清华大学) Department of Computer Science and Technology(计算机科学与技术系) Xinchen Qihang Inc.(新晨科技有限公司)

AI总结 针对重量小于30克的蝴蝶仿生扑翼机器人,提出一种层次化神经形态控制框架,在低成本ESP32微控制器上部署两个轻量级脉冲神经网络实现状态估计与控制,通过模仿学习训练,在无系留飞行中实现稳定俯仰和航向跟踪,相比传统人工神经网络延迟降低36%、功耗降低18%。

详情
AI中文摘要

扑翼微型飞行器(FWMAV)具有卓越的机动性和气动效率,但由于非线性动力学和严格的大小、重量和功率(SWaP)约束(例如重量小于30克的蝴蝶仿生机器人),给机载控制带来了重大挑战。为此,我们提出了一种层次化神经形态控制框架,能够在广泛可用、资源受限的ESP32微控制器(单价约5美元)上实现完全机载的闭环飞行。具体而言,我们的方法在机载部署了两个轻量级脉冲神经网络(SNN):一个用于从原始传感器反馈进行状态估计,另一个通过调节中央模式发生器(CPG)进行翅膀驱动控制。通过模仿学习训练,该系统在无系留真实飞行中实现了稳定的俯仰和航向角跟踪。实验结果进一步表明,与传统人工神经网络(ANN)基线相比,基于SNN的控制器推理延迟降低了36%(从1059微秒降至680微秒),功耗降低了18%(从0.033瓦降至0.027瓦),证明了无需专用硬件的脉冲计算可行性。据我们所知,这项工作首次展示了FWMAV自主飞行的完全机载神经形态控制,突显了SNN在严格SWaP约束下实现节能自主的潜力。视觉摘要:http://bit.ly/4nI8ECY 代码:https://anonymous.4open.science/r/Espikify-76E3/

英文摘要

Flapping-Wing Micro Aerial Vehicles (FWMAVs) provide exceptional maneuverability and aerodynamic efficiency but pose significant challenges for onboard control due to nonlinear dynamics and stringent Size, Weight, and Power (SWaP) constraints, as exemplified by a butterfly-inspired robot less than 30 gram. To this end, we present a hierarchical neuromorphic control framework that enables fully onboard, closed-loop flight on a widely available, resource-constrained ESP32 microcontroller with a unit cost of approximately $5. Specifically, our method deploys two lightweight Spiking Neural Networks (SNNs) onboard: one for state estimation from raw sensory feedback and another for control via modulation of a Central Pattern Generator (CPG) for wing actuation. Trained by imitation learning, the system achieves stable pitch and heading angle tracking during untethered real-world flight. Experimental results further reveal that the SNN-based controller reduces latency by 36% (1059us to 680us) and power by 18% (0.033W to 0.027W) for inference compared to the conventional Artificial Neural Network (ANN) baseline, demonstrating the viability of spike-based computation without specialized hardware. To the best of our knowledge, this work constitutes the first demonstration of fully onboard neuromorphic control for autonomous flight of a FWMAV, highlighting the potential of SNNs to enable energy-efficient autonomy under stringent SWaP constraints. Visual abstract: http://bit.ly/4nI8ECY Code: https://anonymous.4open.science/r/Espikify-76E3/

2605.17268 2026-05-26 cs.AI cs.CV cs.RO 版本更新

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models

VLA 推理是否忠实?自动驾驶模型中因果链的安全性探究

Nicanor Mayumu, Xiaoheng Deng, Patrick Mukala

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) Central South University(中南大学) School of Computer Science(计算机科学学院) University of Wollongong in Dubai(迪拜大学)

AI总结 通过分析300次VLA推理,发现输出推理与轨迹的忠实度仅42.5%,存在大量漏检行人、轨迹脆弱及推理-动作不一致问题,并提出了信息论忠实度形式化定义与安全架构。

Comments Accept (Poster), CVPR 2026 Workshop DriveX NonArchival Track

详情
AI中文摘要

我们首次系统研究了视觉-语言-动作(VLA)驾驶模型的忠实度,分析了100个多样化PhysicalAI-AV场景中300次Alpamayo-R1-10B推理。主要发现是,输出带有轨迹的自然语言推理可能显著不忠实:(i) 整体推理保真度仅为42.5%,因果链与场景现实匹配不到一半;(ii) 在三分之一涉及行人的场景中漏检了94个行人;(iii) 在轻微视觉扰动下轨迹脆弱性达97.7%;(iv) 平均推理-动作一致性仅为48.3%,53.3%的推理表现出一致性低,其中37.9%声称停止但模型继续前行。我们从信息论角度形式化定义了忠实度,定义了实体和动作保真度及验证标准,并概述了与这些结果一致的四组件安全架构。

英文摘要

We present the first systematic study of faithfulness in Vision-Language-Action (VLA) driving models, analyzing 300 Alpamayo-R1-10B inferences across 100 diverse PhysicalAI-AV scenarios. Our main finding is that output natural-language rationales with trajectories may be significantly unfaithful: (i) overall reasoning fidelity is only 42.5%, with Chain-of-Causation matching scene reality less than half the time; (ii) 94 missed pedestrians in one-third of pedestrian-relevant scenes; (iii) 97.7% trajectory fragility under mild visual perturbations; and (iv) only 48.3% mean reasoning-action consistency, with 53.3% of inferences exhibiting low consistency, including 37.9% of stop-claimed cases where the model continues instead. We formalize faithfulness information-theoretically, define entity and action fidelity with verification criteria, and outline a four-component safety architecture aligned with these results.

2605.15971 2026-05-26 cs.RO 版本更新

OHP-RL: Online Human Preference as Guidance in Reinforcement Learning for Robot Manipulation

OHP-RL:在线人类偏好作为机器人操作强化学习中的指导

Yunyang Mo, Jian Li, Qiwei Wu, Yihang Kang, Renjing Xu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州))

AI总结 提出OHP-RL框架,利用人类干预作为偏好信息,通过状态依赖偏好门自适应调节策略学习,在Franka机器人接触丰富的操作任务中实现高成功率、快速收敛和低人类干预。

详情
AI中文摘要

虽然强化学习使机器人能够自主获取技能,但其在实际部署中受到低效和不安全探索的严重限制。人类在环干预提供了一种实用的解决方案,但现有方法通常将这些干预作为辅助训练信号,未能充分捕捉它们提供的关于何时以及如何引导自主性的更丰富信息。人类干预通常编码了在安全和任务约束下对行为的相对偏好,而不是规定要模仿的精确动作。受此观点启发,我们提出在线人类偏好作为强化学习中的指导(OHP-RL),这是一个利用人类干预作为偏好信息来指导策略学习的框架。OHP-RL引入了一个状态依赖的偏好门,自适应地调节人类干预应在何时以及多大程度上塑造策略学习。这种设计使智能体能够从间歇性和不完美的人类反馈中受益,同时保持自主探索和稳定的策略优化。我们在Franka机器人上的三个具有挑战性的真实世界接触丰富操作任务中评估了OHP-RL。在所有任务中,OHP-RL始终实现了高成功率、更快的收敛以及比先前方法显著更低的人类干预努力。此外,学习到的策略在整个训练过程中表现出更稳定和与人类一致的行为。

英文摘要

While reinforcement learning (RL) enables robots to acquire skills autonomously, its real-world deployment is severely limited by inefficient and unsafe exploration. Human-in-the-loop interventions offer a practical solution, yet existing methods typically exploit these interventions as auxiliary training signals, without fully capturing the richer information they provide about when and how autonomy should be guided. Human interventions often encode relative preferences over behavior under safety and task constraints, rather than prescribing exact actions to imitate. Motivated by this perspective, we propose Online Human Preference as Guidance in Reinforcement Learning (OHP-RL), a framework that leverages human interventions as preference information to guide policy learning. OHP-RL introduces a state-dependent preference gate that adaptively regulates when and to what extent human interventions should shape policy learning. This design enables the agent to benefit from intermittent and imperfect human feedback while preserving autonomous exploration and stable policy optimization. We evaluate OHP-RL on three challenging real-world contact-rich manipulation tasks on a Franka robot. Across all tasks, OHP-RL consistently achieves strong success rates, faster convergence, and substantially lower human intervention effort than prior approaches. Moreover, the learned policies exhibit more stable and human-aligned behavior throughout training.

2605.05182 2026-05-26 cs.RO cs.SY eess.SY 版本更新

A Closed-Form Dual-Barrier CBF Safety Filter for Holonomic Robots on Incrementally Built Occupancy Grid Maps

基于增量构建占据栅格地图的全向机器人闭式双障碍CBF安全滤波器

Himanshu Paudel, Basanta Joshi, Dhirendra Raj Madai, Alina Bartaula, Biman Rimal, Sanjay Neupane

发表机构 * Raspberry Pi(树莓派) PX4 CUAV Nano 7 Raspberry Pi 4B quadrotor(树莓派4B四旋翼无人机)

AI总结 提出一种闭式双障碍控制障碍函数安全滤波器,通过解析推导占据栅格地图的符号距离场,同时避免已映射障碍物并限制进入未探索区域,实现全向机器人在资源受限平台上的实时安全控制。

详情
AI中文摘要

我们提出了一种双障碍控制障碍函数(CBF)安全滤波器,用于在增量构建的占据栅格地图中运行的全向机器人的实时、安全关键速度控制。当机器人探索未知环境时,未映射区域引入了不可约的不确定性,因为超出已探索前沿的障碍物几何形状未知,使得进入这些区域成为碰撞风险的来源,尤其是对于前向传感器。为了解决这个问题,我们强制执行两个约束:避免已映射障碍物和限制进入未探索区域。这两个约束都是从占据栅格地图的符号距离场解析推导出来的,产生了一个闭式安全滤波器,每个周期只需求解一个小型线性系统。在资源受限的平台(如Raspberry Pi)上,SLAM和规划已经消耗了大量计算资源,所提出的滤波器的低开销节省了资源。自适应增益调度在信息丰富的区域放松前沿约束,在良好映射的区域收紧约束,提高了探索效率,同时保持了安全性。该滤波器在速度空间中作为最小侵入性校正运行,并与任意标称控制器(包括基于学习的方法)组合。在PX4控制的四旋翼飞行器上的硬件飞行实验表明,在多次室内运行中实现了零碰撞。

英文摘要

We present a dual-barrier control barrier function (CBF) safety filter for real-time, safety-critical velocity control of holonomic robots operating in incrementally built occupancy grid maps. As a robot explores an unknown environment, unmapped regions introduce irreducible uncertainty, since obstacle geometry beyond the explored frontier is unknown, making entry into such regions a source of collision risk, especially with front-facing sensors. To address this, we enforce two constraints: avoidance of mapped obstacles and restriction from unexplored regions. Both constraints are derived analytically from the occupancy grid's signed distance field, yielding a closed-form safety filter that requires only a small linear system solve per cycle. On resource-constrained platforms such as the Raspberry Pi, where SLAM and planning already consume significant compute, the low overhead of the proposed filter preserves resources. An adaptive gain schedule relaxes the frontier constraint in information-rich regions and tightens it in well-mapped areas, improving exploration efficiency while maintaining safety. The filter operates in velocity space as a minimally invasive correction and composes with arbitrary nominal controllers, including learning-based methods. Hardware flight experiments on a PX4-controlled quadrotor demonstrate zero collisions across multiple indoor runs.

2605.02900 2026-05-26 cs.CR cs.AI cs.CV cs.RO 版本更新

Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses

具身人工智能的安全性:风险、攻击与防御综述

Xiao Li, Xiang Zheng, Yifeng Gao, Xinyu Xia, Yixu Wang, Xin Wang, Ye Sun, Yunhan Zhao, Ming Wen, Jiayu Li, Zixing Chen, Xun Gong, Yi Liu, Yige Li, Yutao Wu, Cong Wang, Jun Sun, Yixin Cao, Zhineng Chen, Jingjing Chen, Tao Gui, Qi Zhang, Zuxuan Wu, Xipeng Qiu, Xuanjing Huang, Tiehua Zhang, Zhipeng Wei, Kun Wang, Xinfeng Li, Hanxun Huang, Sarah Erfani, James Bailey, Jianping Wang, Chaowei Xiao, Ran He, Bo Li, Xingjun Ma, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) City University of Hong Kong(香港城市大学) Jilin University(吉林大学) Singapore Management University(新加坡管理大学) Deakin University(德肯大学) Tongji University(同济大学) Nanyang Technological University(南洋理工大学) Chinese Academy of Sciences(中国科学院) The University of Melbourne(墨尔本大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文综述了具身AI在感知、认知、规划、行动及交互全流程中的安全风险、攻击与防御方法,提出了多层次分类体系,并指出了多模态感知融合脆弱性、规划不稳定及人机交互可信度等关键挑战。

Comments Survey paper; 75 pages, 4 figures, 18 tables; v2 expands embodied-specific coverage of agentic threats, World Action Model threats, and contextual risk mitigation, with over 100 new references added. Project page: https://x-zheng16.github.io/Awesome-Embodied-AI-Safety/

详情
AI中文摘要

具身人工智能将感知、认知、规划与交互集成到在开放、安全关键环境中运行的智能体中。随着这些系统获得自主性并进入交通、医疗、工业或辅助机器人等领域,确保其安全性在技术上具有挑战性,在社会上也变得不可或缺。与数字AI系统不同,具身智能体必须在不确定的感知、不完整的知识和动态的人机交互下行动,故障可能直接导致物理伤害。本综述对具身AI中的安全研究进行了全面且结构化的回顾,考察了从感知、认知到规划、行动与交互以及智能体系统的完整具身流程中的攻击与防御。我们引入了一个多层次分类体系,统一了分散的研究工作,并将具身特定的安全发现与视觉、语言和多模态基础模型的更广泛进展联系起来。我们的综述综合了来自500多篇论文的见解,涵盖对抗性攻击、后门攻击、越狱攻击和硬件级攻击;攻击检测、安全训练和鲁棒推理;以及风险感知的人机交互。这一分析揭示了几个被忽视的挑战,包括多模态感知融合的脆弱性、越狱攻击下规划的不稳定性,以及开放场景中人机交互的可信度。通过将领域组织成连贯的框架并识别关键研究空白,本综述为构建不仅具备能力和自主性,而且在现实部署中安全、鲁棒和可靠的具身智能体提供了路线图。

英文摘要

Embodied Artificial Intelligence (Embodied AI) integrates perception, cognition, planning, and interaction into agents that operate in open-world, safety-critical environments. As these systems gain autonomy and enter domains such as transportation, healthcare, and industrial or assistive robotics, ensuring their safety becomes both technically challenging and socially indispensable. Unlike digital AI systems, embodied agents must act under uncertain sensing, incomplete knowledge, and dynamic human-robot interactions, where failures can directly lead to physical harm. This survey provides a comprehensive and structured review of safety research in embodied AI, examining attacks and defenses across the full embodied pipeline, from perception and cognition to planning, action and interaction, and agentic system. We introduce a multi-level taxonomy that unifies fragmented lines of work and connects embodied-specific safety findings with broader advances in vision, language, and multimodal foundation models. Our review synthesizes insights from over 500 papers spanning adversarial, backdoor, jailbreak, and hardware-level attacks; attack detection, safe training and robust inference; and risk-aware human-agent interaction. This analysis reveals several overlooked challenges, including the fragility of multimodal perception fusion, the instability of planning under jailbreak attacks, and the trustworthiness of human-agent interaction in open-ended scenarios. By organizing the field into a coherent framework and identifying critical research gaps, this survey provides a roadmap for building embodied agents that are not only capable and autonomous but also safe, robust, and reliable in real-world deployment.

2604.17538 2026-05-26 cs.RO 版本更新

Novel Algorithms for Smoothly Differentiable and Efficiently Vectorizable Contact Manifold Construction

用于光滑可微且高效可向量化的接触流形构建的新算法

Onur Beker, Andreas René Geist, Anselm Paulus, Georg Martius

发表机构 * University of Tübingen(图宾根大学)

AI总结 针对接触丰富场景中机器人行为优化,提出一种以光滑二次可微性和GPU大规模可向量化为优先的新碰撞检测流水线,包括可微SDF表示、宽/窄阶段例程和凸分解接触融合。

Comments This version adds late-breaking results in preparation for the CR2 workshop in ICRA 2026

详情
AI中文摘要

在接触丰富的环境中生成智能机器人行为是一个目前零阶方法占主导的研究问题。开发利用接触存在下刚体动力学的一阶/二阶信息的方法,在提高求解速度和计算效率方面具有巨大潜力。该研究方向的主要瓶颈在于,由于常见模拟流水线中所有三个步骤(i)碰撞检测、(ii)接触动力学、(iii)时间积分)的病态性,难以获得对数值优化实际有用的梯度和Hessian矩阵。本文提出了一种旨在解决该难题中碰撞检测部分的方法,通过一个从头设计的新流水线,以光滑(即二次)可微性和GPU上的大规模可向量化作为主要优先级。这与标准碰撞检测例程形成对比,后者针对CPU上的运行时间和最小内存占用进行了优化,但采用了阻碍可微性和向量化的逻辑和控制流。所提出的流水线包括以下贡献:i)高度表达力强且计算高效的SDF表示,ii)使用这些表示生成顶点-SDF和边-SDF接触的可微宽阶段和窄阶段例程,iii)基于凸分解的接触融合的可微例程。

英文摘要

Generating intelligent robot behavior in contact-rich settings is a research problem where zeroth-order methods currently prevail. Developing methods that make use of first/second order information about rigid-body dynamics in the presence of contact holds great promise in terms of increasing the solution speed and computational efficiency. The main bottleneck in this research direction is the difficulty in obtaining gradients and Hessians that are actually useful for numerical optimization, due to pathologies in all three steps of a common simulation pipeline: i) collision detection, ii) contact dynamics, iii) time integration. This abstract proposes a method that aims to address the collision detection part of the puzzle, via a novel pipeline designed from scratch with smooth (i.e. twice) differentiability and massive vectorizability on GPUs as the main priorities. This is in contrast to standard collision detection routines that are instead optimized for runtime on CPUs and minimal memory footprint, but do employ logic and control flow that hinder differentiability and vectorization. The proposed pipeline consists of the following contributions: i) highly expressive and compute efficient SDF representations, ii) differentiable broad-phase and narrow-phase routines that use these representations to generate vertex-SDF and edge-SDF contacts, iii) a differentiable routine for convex decomposition based contact blending.

2602.23872 2026-05-26 cs.CV cs.RO 版本更新

Altitude-Adaptive Vision-Only Geo-Localization for UAVs in GPS-Denied Environments

GPS拒止环境下无人机的高度自适应纯视觉地理定位

Xingyu Shao, Mengfan He, Chunyu Li, Liangzheng Sun, Ziyang Meng

发表机构 * Department of Precision Instrument, Tsinghua University(清华大学精密仪器系) School of Aerospace Engineering, Beijing Institute of Technology(北京理工大学航天工程学院) School of Instrumentation Science and Opto-electronics Engineering, Beijing Information Science and Technology University(北京信息科技大学仪器科学与光电工程学院)

AI总结 针对无人机视觉位置识别中高度变化导致的尺度不匹配问题,提出一种基于单目视觉的高度自适应地理定位框架,通过频域变换估计相对高度并用于图像尺度归一化,结合分类-检索视觉位置识别模块实现粗定位,引入质量自适应边缘分类器提升检索鲁棒性。

详情
AI中文摘要

为了解决无人机视觉位置识别中由高度大幅变化引起的尺度不匹配问题,我们提出了一种仅依赖单目视觉的高度自适应地理定位框架。该方法首先通过将输入图像转换到频域,并将高度估计建模为回归作为分类问题,从单张下视图像中估计相对高度。然后利用估计的高度将查询图像裁剪到规范尺度,之后通过分类-检索视觉位置识别模块进行粗定位。为了在图像质量变化的情况下提高检索鲁棒性,我们进一步引入了质量自适应边缘分类器,并通过加权坐标估计对最终位置进行精化,该估计基于前k个检索候选。在两个合成数据集和两个真实飞行数据集上的实验表明,相对高度估计模块在显著高度变化下,下游检索性能有显著提升。与使用相同检索流程但未进行高度归一化相比,我们的视觉位置识别模块通过高度自适应使平均R@1和R@5分别提高了41.50和56.83个百分点,完整系统在报告的工作站硬件上以13.3帧/秒运行。这些结果表明,相对高度估计为跨高度无人机地理定位提供了有效的尺度先验,并在无需辅助距离传感器或时间输入的情况下支持GPS拒止环境下的粗初始化。

英文摘要

To address the scale mismatch caused by large altitude variations in UAV visual place recognition, we propose a monocular vision-only altitude-adaptive geo-localization framework. The method first estimates relative altitude from a single downward-looking image by transforming the input into the frequency domain and formulating altitude estimation as a regression-as-classification (RAC) problem. The estimated altitude is then used to crop the query image to a canonical scale, after which a classification-then-retrieval visual place recognition module performs coarse localization. To improve retrieval robustness under varying image quality, we further introduce a quality-adaptive margin classifier (QAMC) and refine the final location by weighted coordinate estimation over the top retrieved candidates. Experiments on two synthetic datasets and two real-flight datasets show that the relative altitude estimation (RAE) module yields clear overall improvements in downstream retrieval performance under significant altitude changes. With our visual place recognition module, altitude adaptation improves average R@1 and R@5 by 41.50 and 56.83 percentage points, respectively, compared with using the same retrieval pipeline without altitude normalization, and the full system runs at 13.3 frames/s on the reported workstation hardware. These results indicate that relative altitude estimation provides an effective scale prior for cross-altitude UAV geo-localization and supports GPS-denied coarse initialization without auxiliary range sensors or temporal inputs.

2602.03983 2026-05-26 cs.RO cs.CV 版本更新

Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

通过静态-动态解耦实现高效长程视觉-语言-动作模型

Weikang Qiu, Huashuo Lei, Tinglin Huang, Rex Ying

发表机构 * Yale University(耶鲁大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出DySta框架,通过将视觉输入解耦为多级静态和动态令牌,减少上下文长度并复用KV缓存,实现高效多帧集成和推理,在基准测试和真实任务中显著提升性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型最近成为通用机器人控制的一种有前景的范式。基于视觉-语言模型(VLM)架构,VLA模型根据视觉观察和语言指令预测动作,在任务中实现了强大的性能和泛化能力。然而,VLA模型面临两个主要挑战:输入帧的有限上下文窗口,以及由于二次注意力复杂性和大参数数量导致的低效推理。为此,我们提出了DySta,一个将视觉输入解耦为多级静态和动态令牌的框架,使得(1)在帧间保留静态令牌的单一副本以显著减少上下文长度,以及(2)通过轻量级重缓存门(仅在必要时更新)重用静态令牌的键值(KV)缓存。这种设计实现了高效的多帧集成和高效推理。此外,我们引入了一个新的基准测试,更有效地评估VLA模型的多帧集成能力。实验表明,DySta在我们的基准测试中各项指标上提高了24.5%的多帧集成能力,在真实世界记忆依赖任务中绝对成功率达到23.3%,同时在模拟基准测试中推理速度提升2.0倍(成功率+2.3%),在真实世界通用任务中推理速度提升2.2倍(成功率+10.6%)。

英文摘要

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: a limited context window for input frames and inefficient inference due to the quadratic attention complexity and large parameter counts. To this end, we propose DySta, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the key-value (KV) cache of static tokens through a lightweight recache gate that updates only when necessary. This design enables efficient multi-frame integration and efficient inference. In addition, we introduce a new benchmark that more effectively evaluates the multi-frame integration ability of VLAs. Experiments show that Dysta improves multi-frame integration by 24.5% across metrics on our benchmark and 23.3% in absolute success rate on real-world memory-dependent tasks, while accelerating inference by 2.0x (with +2.3% success rate) on simulation benchmarks and 2.2x (with +10.6% success rate) on real-world general tasks.

2512.00375 2026-05-26 cs.RO 版本更新

DPNet: Doppler LiDAR Motion Planning for Highly-Dynamic Environments

DPNet: 面向高动态环境的多普勒激光雷达运动规划

Wei Zuo, Zeyi Ren, Chengyang Li, Yikun Wang, Mingle Zhao, Shuai Wang, Wei Sui, Fei Gao, Yik-Chung Wu, Chengzhong Xu

发表机构 * The University of Hong Kong(香港大学) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究所) University of Macau(澳门大学) D-Robotics Zhejiang University(浙江大学)

AI总结 提出DPNet,通过多普勒卡尔曼神经网络跟踪快速障碍物并利用多普勒调谐模型预测控制实现高动态环境下的高频高精度运动规划。

Comments Accepted to IEEE Robotics and Automation Letters in April, 2026

详情
AI中文摘要

现有的运动规划方法由于对环境变化理解不足,常常难以应对快速移动的障碍物。为了解决这一问题,我们提出将运动规划器与多普勒激光雷达集成,后者不仅提供测距测量,还提供瞬时点速度。然而,由于高精度和高频率的要求,这种集成并非易事。为此,我们引入了多普勒规划网络(DPNet),通过基于多普勒模型的学习来跟踪和应对快速障碍物。我们首先提出了一种多普勒卡尔曼神经网络(D-KalmanNet),用于在部分可观测的高斯状态空间模型下跟踪障碍物状态。然后,我们利用预测的障碍物运动构建了一个多普勒调谐模型预测控制(DT-MPC)框架用于自我运动规划,实现了控制器参数的运行时自动调优。这两个模块使得DPNet能够从最少数据中学习快速环境变化,同时保持轻量级,在跟踪和规划中实现高频率和高精度。在高保真模拟器和真实世界数据集上的实验表明,DPNet优于广泛的基准方案。代码可在 https://github.com/UUwei-zuo/DPNet 获取。

英文摘要

Existing motion planning methods often struggle with rapid-motion obstacles due to an insufficient understanding of environmental changes. To address this, we propose integrating motion planners with Doppler LiDARs, which provide not only ranging measurements but also instantaneous point velocities. However, this integration is nontrivial due to the requirements of high accuracy and high frequency. To this end, we introduce Doppler Planning Network (DPNet), which tracks and reacts to rapid obstacles via Doppler model-based learning. We first propose a Doppler Kalman neural network (D-KalmanNet) to track obstacle states under a partially observable Gaussian state space model. We then leverage the predicted motions of obstacles to construct a Doppler-tuned model predictive control (DT-MPC) framework for ego-motion planning, enabling runtime auto-tuning of controller parameters. These two modules allow DPNet to learn fast environmental changes from minimal data while remaining lightweight, achieving high frequency and high accuracy in both tracking and planning. Experiments on high-fidelity simulator and real-world datasets demonstrate the superiority of DPNet over extensive benchmark schemes. Code available at https://github.com/UUwei-zuo/DPNet

2511.19211 2026-05-26 cs.RO 版本更新

Soft Pneumatic Grippers: Topology optimization, 3D-printing and Experimental validation

软体气动夹爪:拓扑优化、3D打印与实验验证

Prabhat Kumar, Chandra Prakash, Josh Pinskier, David Howard, Matthijs Langelaar

发表机构 * Department Mechanical and Aerospace Engineering. India Institute of Technology Hyderbad, Telangana 502285, India(机械与航空航天工程系。印度理工学院海得拉巴分校,特伦敦502285,印度) CSIRO Robotics, Pullenvale, QLD 4069, Australia(澳大利亚CSIRO机器人部,Pullenvale,QLD 4069,澳大利亚) Faculty of Mechanical Engineering, Delft University of Technology, Mekelweg 2, Delft, 2628CD, Zuid-Holland, The Netherlands(代尔夫特理工大学机械工程学院,Mekelweg 2号,代尔夫特,2628CD,荷兰泽兰荷兰)

AI总结 提出一种考虑载荷设计依赖性的软体气动夹爪拓扑优化框架,通过2D软臂单元优化、3D打印制造及实验验证,证明其优于传统矩形设计。

Comments 11 Figures

详情
AI中文摘要

本文提出了一种系统性的拓扑优化框架,用于设计软体气动夹爪(SPG),明确考虑了驱动载荷的设计依赖性。载荷使用达西定律并添加排水项进行建模。通过将问题表述为使用鲁棒公式的柔顺机构设计问题,优化了一个2D软臂单元。该问题被设定为最小-最大优化,其中考虑了蓝图设计和侵蚀设计的输出变形。对蓝图部分施加体积约束,对侵蚀部分施加应变能约束。采用MMA求解优化问题并获得优化的软单元。使用Ogden材料模型进行有限元分析证实,优化后的2D单元在气动载荷下优于传统的矩形设计。将优化后的2D单元拉伸得到3D模块,并组装十个这样的单元以形成软臂。分析了优化臂在不同压力载荷下的变形曲线。对四个臂进行3D打印,并与支撑结构集成以实现所提出的SPG。在具有不同重量、尺寸、刚度和形状的物体上展示了SPG的抓取性能。

英文摘要

This paper presents a systematic topology optimization framework for designing a soft pneumatic gripper (SPG), explicitly considering the design-dependent nature of the actuating load. The load is modeled using Darcy's law with an added drainage term. A 2D soft arm unit is optimized by formulating it as a compliant mechanism design problem using the robust formulation. The problem is posed as a min-max optimization, where the output deformations of blueprint and eroded designs are considered. A volume constraint is imposed on the blueprint part, while a strain-energy constraint is enforced on the eroded part. The MMA is employed to solve the optimization problem and obtain the optimized soft unit. Finite element analysis with the Ogden material model confirms that the optimized 2D unit outperforms a conventional rectangular design under pneumatic loading. The optimized 2D unit is extruded to obtain a 3D module, and ten such units are assembled to create a soft arm. Deformation profiles of the optimized arm are analysed under different pressure loads. Four arms are 3D-printed and integrated with a supporting structure to realize the proposed SPG. The gripping performance of the SPG is demonstrated on objects with different weights, sizes, stiffness, and shapes.

2510.23509 2026-05-26 cs.RO 版本更新

Logic-Guided Socially-aware Robot Navigation World Model

逻辑引导的社会感知机器人导航世界模型

Weizheng Wang, Obi Ike, Soyun Choi, Sungeun Hong, Aniket Bera, Byung-Cheol Min

发表机构 * School of Applied and Creative Computing, Purdue University(应用与创意计算学院,普渡大学) Department of Computer Science, Purdue University(计算机科学系,普渡大学) Department of Applied Artificial Intelligence, Sungkyunkwan University(应用人工智能系,成均馆大学) Department of Computer Science and Department of Intelligent Systems Engineering, Indiana University Bloomington(计算机科学系和智能系统工程系,印第安纳大学布卢明顿分校)

AI总结 提出NaviWM,通过结合结构化世界模型和逻辑驱动推理链,增强大语言模型在动态人类空间中生成社交合规且物理安全的导航决策的能力。

详情
AI中文摘要

社交机器人导航越来越依赖大语言模型进行推理、路径规划以及在动态人类空间中实现移动。然而,仅依赖LLM进行规划往往会导致不可预测和不安全的行为,尤其是在动态人类空间中,原因是物理基础有限且逻辑一致性弱。在这项工作中,我们引入了NaviWM,一种社会感知的机器人导航世界模型,它通过结构化世界模型和逻辑驱动的思维链过程增强LLM推理。NaviWM由两个主要组件组成:(1)一个时空世界模型,捕捉环境中智能体的位置、速度和活动;(2)一个演绎推理模块,通过多步、基于逻辑的推理过程引导LLM。这种集成使机器人能够在明确定义的约束(如个人空间、碰撞避免和时机)下生成既社交合规又物理安全的导航决策。与基于提示或微调的先前方法不同,NaviWM将社会规范编码为一阶逻辑,从而实现可解释和可验证的推理。实验表明,NaviWM提高了成功率并减少了社交违规,尤其是在拥挤环境中。这些结果证明了将形式推理与LLM结合用于鲁棒社交导航的好处。本工作的更多实验细节和演示视频可在以下网址找到:https://sites.google.com/view/NaviWM。

英文摘要

Social robot navigation increasingly relies on large language models for reasoning, path planning, and enabling movement in dynamic human spaces. However, relying solely on LLMs for planning often leads to unpredictable and unsafe behaviors, especially in dynamic human spaces, due to limited physical grounding and weak logical consistency. In this work, we introduce NaviWM, a socially-aware robot Navigation World Model that augments LLM reasoning with a structured world model and a logic-driven chain-of-thought process. NaviWM consists of two main components: (1) a spatial-temporal world model that captures the positions, velocities, and activities of agents in the environment, and (2) a deductive reasoning module that guides LLMs through a multi-step, logic-based inference process. This integration enables the robot to generate navigation decisions that are both socially compliant and physically safe, under well-defined constraints such as personal space, collision avoidance, and timing. Unlike previous methods based on prompting or fine-tuning, NaviWM encodes social norms as first-order logic, enabling interpretable and verifiable reasoning. Experiments show that NaviWM improves success rates and reduces social violations, particularly in crowded environments. These results demonstrate the benefit of combining formal reasoning with LLMs for robust social navigation. Additional experimental details and demo videos for this work can be found at: https://sites.google.com/view/NaviWM.

2510.01389 2026-05-26 cs.RO cs.AI cs.LG 版本更新

INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models

INSIGHT: 视觉-语言-动作模型中生成帮助触发器的推理时序列内省

Ulas Berk Karli, Ziyao Shangguan, Tesca FItzgerald

发表机构 * Department of Computer Science, Yale University(耶鲁大学计算机科学系)

AI总结 提出INSIGHT框架,利用令牌级不确定性信号(熵、对数概率、不确定性估计)训练变压器分类器,预测VLA模型何时需要人类帮助,并对比强/弱监督下的性能,发现建模时间动态优于静态评分。

详情
AI中文摘要

最近的视觉-语言-动作(VLA)模型展现出强大的泛化能力,但它们缺乏用于预测失败和向人类监督者请求帮助的内省机制。我们提出了INSIGHT,一个利用令牌级不确定性信号来预测VLA何时应请求帮助的学习框架。使用π0-FAST作为基础模型,我们提取每个令牌的熵、对数概率以及基于狄利克雷的偶然不确定性和认知不确定性估计,并训练紧凑的变压器分类器将这些序列映射到帮助触发器。我们探索了强监督或弱监督的监督机制,并在分布内和分布外任务中进行了广泛比较。我们的结果显示了权衡:强标签使模型能够捕捉细粒度的不确定性动态以实现可靠的帮助检测,而弱标签虽然噪声较大,但在训练和评估对齐时仍能支持有竞争力的内省,为密集标注不可行时提供了可扩展的路径。关键的是,我们发现使用变压器建模令牌级不确定性信号的时间演化比静态序列级评分提供了更强的预测能力。本研究首次对VLA中基于不确定性的内省进行了系统评估,为主动学习和通过选择性人工干预实现实时错误缓解开辟了未来途径。

英文摘要

Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present \textbf{INSIGHT}, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using $π_0$-FAST as the underlying model, we extract per-token \emph{entropy}, \emph{log-probability}, and Dirichlet-based estimates of \emph{aleatoric and epistemic uncertainty}, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, offering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty-based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.

2509.05614 2026-05-26 cs.CV cs.AI cs.RO 版本更新

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

SpecPrune-VLA: 通过动作感知的自推测剪枝加速视觉-语言-动作模型

Hanzhen Wang, Jiaming Xu, Yushun Xiang, Jiayi Pan, Yongkang Zhou, Yong-Lu Li, Guohao Dai

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对视觉-语言-动作模型推理加速,提出结合全局上下文与局部信息的无训练两层剪枝方法,实现高达1.57倍加速且成功率几乎无下降。

Comments Accepted to ICML 2026

详情
AI中文摘要

剪枝是一种通过移除不重要值的计算来加速计算密集型模型的典型技术。最近,它被应用于加速视觉-语言-动作(VLA)模型推理。然而,现有的加速方法仅关注当前动作步骤的局部信息,忽略了全局上下文,导致在某些场景下成功率下降超过20%且加速效果有限。本文指出VLA任务中的时空一致性:连续步骤中的输入图像表现出高度相似性,并提出关键见解:令牌选择应结合局部信息与模型的全局上下文。基于此,我们提出SpecPrune-VLA,一种无需训练、具有启发式控制的两级剪枝方法。(1) 动作级静态剪枝:利用全局历史和局部注意力,在每个动作中静态减少视觉令牌。(2) 层级动态剪枝:根据逐层重要性自适应地剪枝每层的令牌。(3) 轻量级动作感知控制器:根据末端执行器的速度将动作分为粗粒度或细粒度,并相应调整剪枝激进程度。大量实验表明,SpecPrune-VLA在LIBERO模拟中实现高达1.57倍加速,在真实世界任务中实现1.70倍加速,且成功率下降可忽略不计。

英文摘要

Pruning is a typical acceleration technique for compute-bound models by removing computation on unimportant values. Recently, it has been applied to accelerate Vision-Language-Action (VLA) model inference. However, existing acceleration methods focus on local information from the current action step and ignore the global context, leading to >20% success rate drop and limited speedup in some scenarios. In this paper, we point out spatial-temporal consistency in VLA tasks: input images in consecutive steps exhibit high similarity, and propose the key insight that token selection should combine local information with global context of the model. Based on this, we propose SpecPrune-VLA, a training-free, two-level pruning method with heuristic control. (1) Action-level static pruning. We leverage global history and local attention to statically reduce visual tokens per action. (2) Layer-level dynamic pruning. We prune tokens adaptively per layer based on layer-wise importance. (3) Lightweight action-aware controller: We classify actions as coarse- or fine-grained by the speed of the end effector and adjust pruning aggressiveness accordingly. Extensive experiments show that SpecPrune-VLA achieves up to 1.57$\times$ speedup in LIBERO simulation and 1.70$\times$ on real-world tasks, with negligible success rate degradation.

2508.03526 2026-05-26 cs.RO 版本更新

CollaBot: Vision-Language Guided Simultaneous Collaborative Manipulation

CollaBot: 视觉-语言引导的同步协作操作

Kun Song, Gaoming Chen, Shentao Ma, Ninglong Jin, Guangbao Zhao, Mingyu Ding, Zhenhua Xiong, Jia Pan

发表机构 * Department of Computer Science, The University of Hong Kong(香港大学计算机科学系) School of Mechanical Engineering, Shanghai Jiao Tong University(上海交通大学机械工程学院)

AI总结 提出CollaBot通用框架,通过场景分割、协作抓取和两阶段规划,实现多机器人同步协作操作大型物体,在实验中达到72%成功率。

Comments 8 pages,6 figures

详情
AI中文摘要

机器人学的一个核心目标是使机器人能够与物理世界交互。传统的操作研究主要关注单个机器人和相对较小的物体。然而,工厂和家庭环境通常需要大型物体的操作,例如移动桌子,这需要多个机器人协同工作。现有研究仍然缺乏一个能够处理不同物体、任务和机器人团队规模的通用框架。在这项工作中,我们提出了CollaBot,一个用于同步协作操作的通用框架。首先,我们使用SEEM进行场景分割和目标物体提取。然后,我们提出了一个协作抓取框架,将任务分解为局部抓取姿态生成和全局协调。最后,我们设计了一个两阶段规划模块,以生成无碰撞轨迹用于任务执行。在不同物体、任务和机器人数量设置下的实验结果表明,我们的框架达到了72%的成功率。这比基于行为克隆的方法有显著改进,验证了所提出框架在复杂多机器人协作任务中的优势。真实世界实验进一步证明了我们的方法在实际应用中的可行性。

英文摘要

One central goal of robotics is to enable robots to interact with the physical world. Traditional manipulation studies primarily focus on single robots and relatively small objects. However, factory and domestic environments often require large-object manipulation, such as moving tables, where multiple robots must work collaboratively. Existing studies still lack a generalizable framework that can handle diverse objects, tasks, and robot team sizes. In this work, we propose CollaBot, a generalist framework for simultaneous collaborative manipulation. First, we use SEEM for scene segmentation and target-object extraction. Then, we propose a collaborative grasping framework that decomposes the task into local grasp pose generation and global coordination. Finally, we design a two-stage planning module to generate collision-free trajectories for task execution. Experimental results across different settings with varying objects, tasks, and numbers of robots indicate that our framework achieves a 72% success rate. This marks a substantial improvement over behavior cloning-based methods, validating the advantages of the proposed framework in complex multi-robot cooperative tasks. Real-world experiments further demonstrate the feasibility of our method in practical applications.

2407.00848 2026-05-26 cs.RO 版本更新

EgoExo++: Integrating On-demand Exocentric Visuals with 2.5D Ground Surface Estimation for Interactive Teleoperation of Underwater ROVs

EgoExo++:结合按需外中心视觉与2.5D地面估计的水下ROV交互式遥操作

Adnan Abdullah, Ruo Chen, Ioannis Rekleitis, Md Jahidul Islam

发表机构 * RoboPI Laboratory, Dept. of ECE, University of Florida, USA(罗博实验室,电子与计算机工程系,佛罗里达大学,美国) Dept. of ME, University of Delaware, USA(机械工程系,德雷塞尔大学,美国)

AI总结 针对水下ROV遥操作视野受限问题,提出EgoExo++方法,通过几何驱动的视觉SLAM合成外中心视图并实时估计2.5D地面,提升操作性能和用户体验。

Comments EgoExo++ (Accepted in IJRR), V7/V3, metadata updated, 16 pages

详情
AI中文摘要

水下ROV(遥控潜水器)对于海底探索和任务执行不可或缺,但基于自我中心(第一人称)视频流的典型遥操作引擎限制了人类操作员的视野,并限制了在复杂、非结构化水下环境中的精确操控。为解决这一问题,我们首先提出EgoExo,一种集成到视觉SLAM流水线中的几何驱动解决方案,从自我中心摄像头馈送中按需合成外中心(第三人称)视图。我们进一步提出EgoExo++,它超越2D外中心视图合成(EgoExo),实时增强分段平面2.5D地面估计。其无锚点空中视角支持地面相对推理,如间隙和基于地形的导航标记跟随。所涉及的计算是闭式的,仅依赖于自我中心视图和单目SLAM估计,这使得它可移植到现有遥操作引擎,并对不同水体特性具有鲁棒性。我们通过2自由度室内导航和6自由度水下洞穴探索在挑战性低光条件下的广泛实验验证了方法的几何精度。为评估操作优势,我们进行了两项用户研究,分别使用模拟和真实数据,每项涉及15名参与者,比较基线自我中心遥操作和EgoExo++。结果表明,系统可用性(SUS)提高,感知工作负荷(NASA-TLX)降低,客观遥操作性能显著提升,包括任务速度提高16%,路径偏差比降低5倍,碰撞事件减少(试验中2次对比5次)。此外,我们强调了EgoExo++增强视觉在支持共享自主和具身遥操作中的作用。EgoExo++的源代码包可在https://github.com/uf-robopi/EgoExo获取。

英文摘要

Underwater ROVs (Remotely Operated Vehicles) are indispensable for subsea exploration and task execution, yet typical teleoperation engines based on egocentric (first-person) video feeds restrict human operators' field-of-view and limit precise maneuvering in complex, unstructured underwater environments. To address this, we first propose EgoExo, a geometry-driven solution integrated into a visual SLAM pipeline that synthesizes on-demand exocentric (third-person) views from egocentric camera feeds. We further propose EgoExo++, which extends beyond 2D exocentric view synthesis (EgoExo) to augment a piecewise planar 2.5D ground surface estimation on-the-fly. Its anchor-free aerial viewpoint supports ground-relative reasoning, such as clearance and terrain-based navigation marker following. The computations involved are closed-form and rely solely on egocentric views and monocular SLAM estimates, which makes it portable across existing teleoperation engines and robust to varying waterbody characteristics. We validate the geometric accuracy of our approach through extensive experiments of 2-DOF indoor navigation and 6-DOF underwater cave exploration in challenging low-light conditions. To assess operational benefits, we conduct two user studies with simulation and real-world data, each involving 15 participants, comparing baseline egocentric teleoperation and EgoExo++. Results indicate improved system usability (SUS), reduced perceived workload (NASA-TLX), and significant gains in objective teleoperation performance, including 16% faster missions, 5-fold reduction in path deviation ratio, and fewer collision events (2 vs. 5 across trials). Furthermore, we highlight the role of EgoExo++ augmented visuals in supporting shared autonomy and embodied teleoperation. The source packages for EgoExo++ are available at: https://github.com/uf-robopi/EgoExo.

2403.06636 2026-05-26 cs.RO 版本更新

Design, Control, and Motion Strategy for DELTA: Transformable Multilink Multirotor for Air-Ground Hybrid Locomotion and Manipulation

DELTA:可变形多连杆多旋翼飞行器的设计、控制与运动策略——用于空地混合运动与操作

Kazuki Sugihara, Moju Zhao, Takuzumi Nishio, Kei Okada, Masayuki Inaba

AI总结 本文提出一种新型多连杆多旋翼机器人DELTA,通过在每个连杆上安装推进器并利用关节驱动,实现了地面滚动、空中飞行及多种环境下的操作能力,并设计了基于非线性优化的实时控制方法和考虑接触约束的运动策略。

Comments 20 pages, 31 figures

详情
AI中文摘要

近年来,多模态运动能力使机器人能够在陆地和空中领域机动。然而,大多数此类机器人仅设计用于运动,很少具备实际任务所需的操作能力。通过添加机械臂,地面机器人可以执行操作,一些带有机械臂的无人机已展示了空中操作能力。尽管如此,这类多旋翼无法直接用于地面操作,且这种配置本身不适合空地混合运动。这是因为其推进器集中式结构难以同时实现足够的操作自由度(DoF)以及带接触和变形的稳定运动。因此,在本工作中,我们开发了一种新型多连杆多旋翼机器人,每个连杆上装有推进器,并能够与环境接触。该机器人可以利用关节驱动,在多种环境中执行地面滚动运动、空中飞行运动以及操作。首先,我们介绍了所提出机器人的最小配置设计。我们还描述了运动学模型,并基于该模型提出了每个组件的设计。其次,我们提出了一种基于非线性优化的实时控制方法,该方法考虑了接触和关节运动,可应用于各种多旋翼。第三,我们提出了包含空地混合多连杆多旋翼特有接触约束的运动策略,并基于多接触模型分析了操作能力的局限性。最后,我们使用实现的样机展示了两个领域中的多种运动。据我们所知,这是多连杆多旋翼首次展示空地混合运动与操作。

英文摘要

In recent years, multimodal locomotion capabilities have enabled robots to maneuver in both terrestrial and aerial domains. However, most of these robots are designed only for locomotion, and few possess the manipulation capabilities required for practical tasks. By adding a manipulator, ground robots can perform manipulation, and some drones with robotic arms have demonstrated aerial manipulation. Nonetheless, such multirotors cannot be directly used for manipulation on the ground, and this configuration itself is unsuitable for air-ground hybrid locomotion. This is because their thruster-centralized structure makes it difficult to achieve both sufficient degrees of freedom (DoF) for manipulation and stable motion with contact and transformation. Therefore, in this work, we develop a new multilink multirotor with thrusters on each link and capable of contact with the environments. This robot can perform terrestrial rolling locomotion, aerial flight locomotion, and manipulation in multiple environments using joint actuation. First, we introduce a minimal configuration design of the proposed robot. We also describe a kinematic model and propose a design for each component based on this model. Second, we propose a real-time control method based on nonlinear optimization that considers contact and joint motion, which can be applied to various multirotors. Third, we propose motion strategies that include contact constraints specific to air-ground hybrid multilink multirotors, and analyze the limitations of manipulation capabilities based on multi-contact model. Finally, we demonstrate a variety of motions in both domains using the implemented prototype. To the best of our knowledge, this is the first demonstration of air-ground hybrid locomotion and manipulation by a multilink multirotor.

2105.01215 2026-05-26 cs.RO 版本更新

Lidar Scan Registration Robust to Extreme Motions

对极端运动鲁棒的激光雷达扫描配准

Simon-Pierre Deschênes, Dominic Baril, Vladimír Kubelka, Philippe Giguère, François Pomerleau

发表机构 * Northern Robotics Laboratory(北方机器人实验室)

AI总结 针对极端运动下点云畸变导致配准失败的问题,提出一种考虑轨迹运动不确定性和环境几何的去畸变方法,在200 m/s^2和800 rad/s^2的峰值加速度下,平移误差降低9.26%,旋转误差降低21.84%。

Comments 8 pages, 8 figures, published in 2021 18th Conference on Robots and Vision (CRV), Burnaby, Canada

详情
Journal ref
2021 18th Conference on Robots and Vision (CRV), 2021, pp. 17-24
AI中文摘要

配准算法,如迭代最近点(ICP),在过去几十年中已被证明在移动机器人定位算法中有效。然而,当机器人承受极端速度和加速度时,它们容易失败。例如,这种运动可能在碰撞后发生,导致点云严重畸变。虽然过去已经探索了点云去畸变方法以提高定位和建图精度,但这些方法仍然依赖于高精度的里程计系统或理想的导航条件。在本文中,我们提出了一种方法,考虑了用于去畸变点云的轨迹的剩余运动不确定性以及环境几何,以提高当前配准算法的鲁棒性。我们在一个产生200 m/s^2和800 rad/s^2峰值加速度的3D地图测试台上将我们的方法与其他三种解决方案进行了比较。在这些极端场景中,我们证明了我们的方法将平移误差降低了9.26%,旋转误差降低了21.84%。所提出的方法具有足够的通用性,可以无需调整地集成到许多加权ICP的变体中,并支持在更恶劣地形中的定位鲁棒性。

英文摘要

Registration algorithms, such as Iterative Closest Point (ICP), have proven effective in mobile robot localization algorithms over the last decades. However, they are susceptible to failure when a robot sustains extreme velocities and accelerations. For example, this kind of motion can happen after a collision, causing a point cloud to be heavily skewed. While point cloud de-skewing methods have been explored in the past to increase localization and mapping accuracy, these methods still rely on highly accurate odometry systems or ideal navigation conditions. In this paper, we present a method taking into account the remaining motion uncertainties of the trajectory used to de-skew a point cloud along with the environment geometry to increase the robustness of current registration algorithms. We compare our method to three other solutions in a test bench producing 3D maps with peak accelerations of 200 m/s^2 and 800 rad/s^2. In these extreme scenarios, we demonstrate that our method decreases the error by 9.26 % in translation and by 21.84 % in rotation. The proposed method is generic enough to be integrated to many variants of weighted ICP without adaptation and supports localization robustness in harsher terrains.

2605.24495 2026-05-26 cs.RO 版本更新

Elevator-LIO: Robust LiDAR-Inertial Odometry for Multi-Floor Navigation under Elevator-Induced Non-Inertial Motion

Elevator-LIO:电梯引起的非惯性运动下多层导航的鲁棒激光雷达-惯性里程计

Yifan Zhang, Yudong Huang, Yuchong Zhang, Changze Li, Haoran Liu, Ming Yang, Tong Qin

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出Elevator-LIO框架,通过解耦状态估计模型和模式依赖的迭代误差状态卡尔曼滤波器,实现电梯内连续定位,并利用自适应体素降采样和事件触发更新抑制垂直漂移。

Comments 16 pages, 10 figures, 5 tables

详情
AI中文摘要

本文提出了Elevator-LIO,一种旨在电梯行驶过程中实现机器人连续定位的激光雷达-惯性里程计框架,从而支持跨楼层机器人任务。为了解决非惯性框架下的状态估计问题,Elevator-LIO建立了一个解耦的状态估计模型,分别对机器人相对于电梯的运动和电梯自身的运动进行建模,并将其嵌入到模式依赖的迭代误差状态卡尔曼滤波器框架中。该框架在普通室内环境中退化为常规LIO估计,同时在电梯非惯性环境中实现电梯相关状态的传播和约束更新,从而实现连续稳定的定位。电梯模式管理器利用激光雷达测距统计和估计状态检测电梯进出事件,并在电梯停止时引入事件触发的零速度和零加速度更新,以抑制累积的垂直漂移。此外,本文采用自适应体素降采样策略,在环境尺度显著变化时保持有效点数的稳定。我们在包含79次电梯乘坐的20个真实世界序列上进行了广泛实验,包括大尺度空间、长垂直行程、动态行人干扰和镜面反射等实际挑战。结果表明,Elevator-LIO在所有序列中保持连续定位精度,其中17个序列的终端高度误差低于1厘米。相比之下,现有代表性定位系统在这些电梯序列上表现不佳。在Hilti 2022/2023数据集上的测试进一步表明,所提方法在标准室内场景中仍具有竞争力。项目页面位于https://xiaofan4122.github.io/Elevator_LIO_Page/。

英文摘要

This paper presents Elevator-LIO, a LiDAR-inertial odometry framework designed to achieve continuous robot localization during elevator travel, thereby supporting cross-floor robotic tasks. To address the state-estimation problem in non-inertial frames, Elevator-LIO establishes a decoupled state-estimation model that separately models the robot motion relative to the elevator and the elevator motion itself, and embeds it into a mode-dependent iterated error-state Kalman filter framework. This framework degenerates to conventional LIO estimation in ordinary indoor environments, while enabling the propagation and constrained update of elevator-related states in elevator non-inertial environments, thereby achieving continuous and stable localization. An elevator mode manager detects elevator entry and exit events using LiDAR ranging statistics and estimated states, and introduces event-triggered zero-velocity and zero-acceleration updates when the elevator stops to suppress accumulated vertical drift. In addition, this paper adopts an adaptive voxel downsampling strategy to maintain a stable number of effective points under significant environmental scale changes. We conduct extensive experiments on 20 real-world sequences containing 79 elevator rides, including practical challenges such as large-scale spaces, long vertical travel, dynamic pedestrian interference, and mirror reflections. The results show that Elevator-LIO maintains continuous localization accuracy in all sequences, with terminal height error below 1 cm in 17 sequences. In contrast, existing representative localization systems perform poorly on these elevator sequences. Tests on the Hilti 2022/2023 datasets further show that the proposed method remains competitive in standard indoor scenarios. The project page is available at https://xiaofan4122.github.io/Elevator_LIO_Page/.

2605.24449 2026-05-26 cs.RO cs.LG 版本更新

Vision-Guided Outdoor Flight and Obstacle Evasion via Reinforcement Learning

基于强化学习的视觉引导户外飞行与避障

Shiladitya Dutta, Aayush Gupta, Varun Saran, Avideh Zakhor

发表机构 * College of Engineering, Department of Electrical Engineering and Computer Science, University of California Berkeley(加州大学伯克利分校工程学院电气工程与计算机科学系)

AI总结 提出一种基于立体视觉深度和视觉惯性里程计的传感器运动策略,通过强化学习和特权学习在仿真中训练,实现零样本迁移到未知户外环境和无人机平台进行自主避障导航。

Comments Published in IEEE Robotics and Automation Letters, vol 11, no 2. Presented at the IEEE International Conference on Robotics and Automation 2026

详情
AI中文摘要

尽管四旋翼飞行器凭借其全向机动性拥有令人印象深刻的穿越能力,但在复杂环境中需要持续的人工操控限制了其在GNSS和遥测信号缺失场景中的应用。为此,我们提出了一种新颖的传感器运动策略,该策略使用立体视觉深度和视觉惯性里程计(VIO)在未知环境中自主穿越障碍物以到达目标点。该策略由一个预训练的自编码器作为感知前端,后接一个规划与控制LSTM网络,输出速度指令,可由现成的商用无人机执行。我们利用强化学习和特权学习范式,通过两阶段过程在仿真中训练该策略:1)以全局运动规划器生成的优化轨迹作为监督骨干进行初始训练;2)在课程环境中进一步微调。为弥合仿真到现实的差距,我们采用领域随机化和奖励塑造来创建对噪声和领域偏移具有鲁棒性的策略。在户外实验中,我们的方法成功实现了对训练中从未遇到的障碍环境和无人机平台的零样本迁移。

英文摘要

Although quadcopters boast impressive traversal capabilities enabled by their omnidirectional maneuverability, the need for continuous pilot control in complex environments impedes their application in GNSS and telemetry-denied scenarios. To this end, we propose a novel sensorimotor policy that uses stereo-vision depth and visual-inertial odometry (VIO) to autonomously navigate through obstacles in an unknown environment to reach a goal point. The policy is comprised of a pre-trained autoencoder as the perception head followed by a planning and control LSTM network which outputs velocity commands that can be followed by an off-the-shelf commercial drone. We leverage reinforcement and privileged learning paradigms to train the policy in simulation through a two-stage process: 1) initial training with optimal trajectories generated by a global motion planner acting as a supervisory backbone, 2) further fine-tuning in a curriculum environment. To bridge the sim-to-real gap, we employ domain randomization and reward shaping to create a policy that is both robust to noise and domain shift. In outdoor experiments, our approach achieves successful zero-shot transfer to both obstacle environments and a drone platform that were never encountered during training.

2605.24436 2026-05-26 cs.MA cs.LG cs.RO 版本更新

A Reinforcement Learning Inspired Latent Yield Based Adaptive Algorithm Switching Mechanism

一种受强化学习启发的基于潜在收益的自适应算法切换机制

Jayprakash S. Nair, Jimson Mathew, Shivashankar B. Nair

发表机构 * Indian Institute of Technology Patna(印度理工学院帕纳布分校) Indian Institute of Technology Guwahati(印度理工学院古瓦哈提分校)

AI总结 针对在线或动态环境中算法选择困难的问题,提出一种受强化学习启发的潜在收益方法,通过封装奖励和惩罚触发探索与利用,实现自适应算法切换,并在排序算法和机器人避障任务中验证了有效性。

Comments Accepted and published in the Proceedings of the 29th European Conference on Applications of Evolutionary Computation (EvoApplications 2026), held as part of EvoStar 2026, Toulouse, France, April 8 to 10, 2026. Lecture Notes in Computer Science (LNCS), Springer Nature Switzerland

详情
Journal ref
Applications of Evolutionary Computation, EvoApplications 2026, LNCS, Springer Nature Switzerland, 2026
AI中文摘要

对于给定的问题实例,选择最合适的算法仍然是一项具有挑战性的任务,尤其是在问题特征随时间演变的在线或动态环境中。仅依赖瞬时性能指标可能导致反应性和不稳定的行为,通常会导致次优的算法切换。本文介绍了一种计算高效的方法,用于聚合算法在多个问题实例上的性能,该方法对实例特征的剧烈变化具有相当的免疫性。受强化学习(RL)固有特征的启发,该技术将奖励和惩罚封装到一个潜在收益中,进而触发利用和探索,从而产生自适应算法切换。所提出的技术采用受遗传算法启发的岛屿模型,以促进并行探索和算法种群之间的性能交换,这些算法种群栖息在局部库中。在排序算法和机器人避障任务上的实验评估证明了该方法的可行性和有效性,突显了其在自适应算法选择至关重要的领域中的潜力。

英文摘要

Selecting the most suitable algorithm for a given problem instance remains a challenging task, particularly in online or dynamic environments where problem characteristics evolve over time. Relying solely on instantaneous performance metrics can result in a reactive and unstable behaviour, often leading to suboptimal algorithm switching. This paper introduces a computationally efficient approach for aggregating an algorithm's performance across multiple problem instances that is fairly immune to erratic variations in instance features. Inspired by features inherent to Reinforcement Learning (RL), this technique encapsulates rewards and penalties into a latent yield that, in turn, triggers exploitation and exploration, consequently resulting in adaptive algorithm switching. The proposed technique employs island models, inspired by Genetic Algorithms, to facilitate parallel exploration and performance exchanges among algorithm populations inhabiting local repertoires. Experimental evaluations on sorting algorithms and robotic obstacle avoidance tasks demonstrate the feasibility and effectiveness of the approach, highlighting its potential in domains where adaptive algorithm selection is critical.

2605.24433 2026-05-26 cs.RO cs.LG 版本更新

Smoother Action Chunking Flow Policy via Prior-Corrected Orthogonal Trust-Region Guidance

基于先验校正的正交信任区域引导的平滑动作块流策略

Kai Fang, Hailong Pei, Xuemin Chi

发表机构 * South China University of Technology(华南理工大学) Zhejiang University(浙江大学)

AI总结 提出POTR方法,通过先验校正权重和正交信任区域约束,改善流匹配机器人策略中动作块推理的边界不连续性和横向扰动,提升成功率和运动平滑性。

详情
AI中文摘要

流匹配机器人策略通常使用动作块推理进行高效的闭环控制,但块边界可能引入不连续的动作转换。现有的RTC引导通过在去噪过程中注入校正信号来改善连续性,但其权重调度在中间时间步较弱,且无约束的校正方向可能引入横向扰动。我们提出POTR,一种先验校正的正交信任区域引导方法。首先,我们将数据先验尺度$σ_d$纳入RTC引导权重,产生更强的中间时间校正。其次,我们将引导向量分解为与去噪速度平行和垂直的分量,并将垂直分量约束在信任区域内。在LIBERO上使用$π_{0.5}$,与RTC相比,POTR提高了成功率,并持续减少了块边界不连续性、加速度和加加速度。消融实验表明,先验校正权重提供了主要的校正增益,而正交信任区域进一步提高了稳定性。

英文摘要

Flow-matching robot policies commonly use action-chunking inference for efficient closed-loop control, but chunk boundaries can introduce discontinuous action transitions. Existing RTC guidance improves continuity by injecting correction signals during denoising, yet its weight schedule is weak at intermediate timesteps and its unconstrained correction direction may introduce transverse perturbations. We propose POTR, a **p**rior-corrected **o**rthogonal **t**rust-**r**egion guidance method. First, we incorporate a data-prior scale $σ_d$ into the RTC guidance weight, yielding stronger intermediate-time correction. Second, we decompose the guidance vector into components parallel and perpendicular to the denoising velocity, and constrain the perpendicular component within a trust region. On LIBERO with $π_{0.5}$, POTR improves success rate and consistently reduces chunk-boundary discontinuity, acceleration, and jerk compared with RTC. Ablations show that the prior-corrected weight provides the main correction gain, while the orthogonal trust region further improves stability.

2605.24394 2026-05-26 cs.RO 版本更新

RoboHitch: Learning Visual Affordance from Disordered Keypoints for Hitch Knots Tying

RoboHitch: 从无序关键点学习视觉可供性用于系结

Jiahui Zuo, Boyang Zhang, Fumin Zhang

发表机构 * Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology(电子与计算机工程系,香港科学与技术大学)

AI总结 提出RoboHitch框架,利用无序3D关键点和RGB图像从人类演示中学习系结,通过动态图自编码器和卷积自编码器融合特征,预测抓取和放置可供性,实现遮挡下的系结。

详情
AI中文摘要

由于复杂的动力学和频繁的自遮挡,可变形线性物体的机器人操作面临重大挑战。现有的机器人打结方法通常依赖于有序关键点和显式边缘连接的精确拓扑状态跟踪。这种依赖使得它们在打结过程中因重复弯曲和交叉导致的跟踪漂移和拓扑不匹配而容易失败。为了解决这些限制,我们引入了RoboHitch,一个新颖的框架,它仅使用无序的3D关键点和RGB图像从人类演示中学习执行系结。这消除了对显式拓扑顺序的需求,允许更灵活的操作。我们的方法采用动态图自编码器从未跟踪的关键点中提取几何特征,并辅以卷积自编码器捕获必要的视觉上下文。然后,双向交叉注意力机制融合这些模态,共同预测抓取和放置可供性,促进对绳子状态的隐式推理,并在遮挡下实现系结。真实世界实验证明了我们方法的有效性和泛化能力,成功完成了自遮挡场景中的系结。

英文摘要

Robotic manipulation of deformable linear objects (DLOs) presents significant challenges due to complex dynamics and frequent self-occlusions. Existing robotic knot tying methods typically rely on precise topological state tracking with ordered keypoints and explicit edge connectivity. This reliance makes them prone to failures due to tracking drift and topology mismatch caused by repeated bending and crossings during knot formation.To address these limitations, we introduce RoboHitch, a novel framework that learns to perform hitch knot tying from human demonstrations using only disordered 3D keypoints and RGB images. This eliminates the need for explicit topological order, allowing for more flexible manipulation. Our method employs a dynamic Graph Autoencoder to extract geometric features from untracked keypoints, complemented by a Convolutional Autoencoder that captures essential visual context. A bidirectional cross-attention mechanism then fuses these modalities to jointly predict pick and place affordances, facilitating implicit reasoning about the rope's state and enabling knot tying under occlusion.Real-world experiments demonstrate the effectiveness and generalizability of our approach, successfully completing hitch knots in scenarios with self-occlusions.

2605.24350 2026-05-26 cs.RO cs.HC 版本更新

PACT: Proactive Asking for Continual Task Assistance in Human-Robot Collaboration

PACT:人机协作中持续任务辅助的主动询问

Chengbo He, Sheng Li, Chenyang Ma, Bochao Zou, Li Sun, Jiansheng Chen, Junliang Xing, Yuanchun Shi, Huimin Ma

发表机构 * University of Science and Technology Beijing(北京科技大学) University of Oxford(牛津大学) Tsinghua University(清华大学)

AI总结 提出PACT框架,通过强化学习在部分观测下决定何时主动询问用户以澄清任务,从而在跨日人机协作中逐步提高辅助准确性和澄清效用。

详情
AI中文摘要

在长期人机协作中,机器人助手需要在部分观测下辅助用户,同时利用跨日的交互历史。然而,在协作开始时,人类的特征和常规通常是未知的,这使得被动的推断-行动辅助变得低效。为了解决这一挑战,我们研究了跨日主动询问设置以进行持续任务辅助,并提出了PACT(持续任务辅助的主动询问),这是一个询问-行动框架,决定在采取行动前是否应寻求澄清。PACT利用当前观测以及累积的交互历史来评估上下文充分性,使机器人能够提供更可靠的辅助,并逐步适应用户。我们使用强化学习实现了其主要的学习实例,并在同一框架下评估了替代实例。为了评估这种行为,我们进一步引入了一个澄清效用度量,量化了辅助准确性与澄清请求频率之间的权衡。在多日具身协作场景中的实验表明,与被动推断基线相比,PACT持续提高了辅助准确性和澄清效用,突显了主动询问在持续人机协作中的重要性。

英文摘要

Robotic assistants in long-term human-robot collaboration need to assist users under partial observations while leveraging cross-day interaction history. However, human traits and routines are often unknown at the beginning of collaboration, making passive infer-then-act assistance ineffective and inefficient. To address this challenge, we study a cross-day proactive asking setting for continual task assistance and propose PACT (Proactive Asking for Continual Task Assistance), an ask-or-act framework that determines whether clarification should be sought before taking action. PACT leverages current observations together with accumulated interaction history to evaluate contextual sufficiency, enabling the robot to provide more reliable assistance and progressively adapt to the user over time. We implement its primary learned instantiation using reinforcement learning and evaluate alternative instantiations under the same framework. To assess such behavior, we further introduce a clarification utility metric that quantifies the trade-off between assistance accuracy and the frequency of clarification requests. Experiments in multi-day embodied collaboration scenarios demonstrate that, compared with passive inference baselines, PACT consistently improves both assistance accuracy and clarification utility, highlighting the importance of proactive asking in continual human-robot collaboration.

2605.24339 2026-05-26 cs.RO 版本更新

IsaacIPC: Coupling High-Fidelity Simulation and Realistic Rendering for Contact-Rich Robotic Systems

IsaacIPC: 面向高接触度机器人系统的高保真仿真与逼真渲染耦合框架

Qixin Liang, Zhongqing Han

发表机构 * Anker Humanoid Lab(安ker人形实验室) The University of Hong Kong(香港大学)

AI总结 提出IsaacIPC框架,通过耦合GPU加速增量势接触(IPC)与IsaacSim/Lab,实现仿真与视觉网格间的变形映射,并引入几何砂浆接触势(GMCP)改善触觉传感中的接触压力分布,支持刚柔耦合机器人仿真。

Comments This is a tech report

详情
AI中文摘要

我们提出IsaacIPC,一个将GPU加速的增量势接触(IPC)与IsaacSim/Lab耦合的机器人仿真框架。IsaacIPC在仿真网格和视觉网格之间映射模拟变形,实现实时逼真渲染,可应用于数据收集和策略评估。对于触觉传感,我们引入了几何砂浆接触势(GMCP),它在触觉表面上的接触样本上定义了一个屏障势,以更好地解析接触压力分布。我们在接触基准测试上评估了GMCP,并在刚柔耦合机器人仿真中展示了IsaacIPC,包括四足机器人、灵巧手和通用操作接口(UMI)夹爪。

英文摘要

We present IsaacIPC, a robotic simulation framework that couples GPU accelerated incremental potential contact (IPC) with IsaacSim/Lab. IsaacIPC maps simulated deformation between simulation and visual meshes, enabling real-time realistic rendering with applications to data collection and policy evaluation. For tactile sensing, we introduce the geometric mortar contact potential (GMCP), which defines a barrier potential over contact samples on tactile surfaces to better resolve contact-pressure distributions. We evaluate GMCP on contact benchmarks and demonstrate IsaacIPC on rigid-deformable robotic simulations including a quadruped robot, a dexterous hand, and a universal manipulation interface (UMI) gripper.

2605.24311 2026-05-26 cs.RO 版本更新

Terrain-Adaptive Grouser Wheel for Optimal Planetary Exploration: Design and Experimental Investigation

地形自适应履刺轮用于最优行星探测:设计与实验研究

Vincent Griffo, Yashwanth Kumar Nakka

发表机构 * Aerospace Robotics Lab, Daniel Guggenheim School of Aerospace Engineering, Georgia Institute of Technology(航空航天机器人实验室,丹尼尔·古根海姆航空航天工程学院,佐治亚理工学院)

AI总结 针对行星车在颗粒地形上的移动难题,提出一种可连续调节履刺高度的多模态轮,实验表明自适应部署可减少滑转30-58%,并提升行驶时间和能效达77.4%。

Comments Under Review

详情
AI中文摘要

在星外环境中运行的行星车经常因地形特征(如坡度和颗粒度)的变化而面临显著的移动挑战。虽然最近在多模态轮设计方面的研究探索了调整刚度、顺应性和直径作为提高地形适应性的手段,但全轮履刺可调节设计在很大程度上仍未探索。履刺是一个引人注目的可驱动特征,因为颗粒地形通常需要更高的履刺高度来改善车轮性能。因此,我们引入了[匿名化机器人名称],这是一种能够连续调节其履刺高度以适应地形的多模态轮。该平台在四种代表性表面(包括乙烯基地板、粗岩石、豌豆砾石和两种压实状态下的沙子)上进行了评估,覆盖了各种颗粒条件。750次实验试验的结果表明,相对于固定配置,自适应部署在颗粒状态下减少了30.0-58.0%的滑转,并将行驶时间和能耗提高了高达77.4%。利用地形试验数据,开发并验证了一个简化的缩放分析,表明地形颗粒度与测试配置的最佳履刺高度之间存在关系。没有单一的履刺高度能在所有地形上最小化滑转,这凸显了常用于行星探测的固定轮系统的局限性。这一观察结果强化了履刺自适应形态(例如[匿名化机器人名称])作为增强行星车在多样且移动挑战性强的星外环境中移动性的有效解决方案的潜力。

英文摘要

Planetary rovers operating in extraterrestrial environments often encounter significant mobility challenges due to varying terrain features such as gradients and granularity. While recent works in multimodal wheel design have explored adjustments in stiffness, compliance, and diameter as a means to improve terrain adaptability, full wheel grouser-adjustable designs remain largely unexplored. Grousers are a compelling feature to actuate, as granular terrains tend to require increased grouser height for improved wheel performance. As a result, we introduce [Anonymized Robot Name], a multimodal wheel capable of continuously adjusting its grouser height for terrain adaptation. The platform was evaluated across four representative surfaces, including vinyl flooring, coarse rock, pea gravel, and sand under two packing states, spanning a range of granular conditions. Results from 750 experimental trials demonstrate that adaptive deployment reduces slip by 30.0--58.0\% and improves travel time and energy consumption by up to 77.4\% in granular regimes relative to fixed configurations. Using the terrain trial data, a simplified scaling analysis was developed and validated, suggesting a relationship between terrain granularity and optimal grouser height for the tested configuration. No single grouser height minimized slip across all terrains, underscoring the limitations of fixed-wheel systems commonly used for planetary exploration. This observation reinforces the potential of grouser-adaptive morphology, such as [Anonymized Robot Name], as an effective solution for enhancing rover mobility across diverse and mobility-challenging extraterrestrial environments.

2605.24301 2026-05-26 cs.RO 版本更新

AcroRL: Learning Aggressive Quadrotor Inversion using Bidirectional Thrust

AcroRL: 使用双向推力学习激进四旋翼翻转

Gabriel Rodriguez, Henri Sayag, Abhishek Rathod, John Stecklein, Siddharth Saha, Christopher Barngrover, Wennie Tabib

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Shield AI

AI总结 提出基于强化学习的框架,通过调制恒定参考轨迹实现紧凑位置约束的四旋翼翻转,在仿真中位置RMSE降低32%、稳定时间减少57%,硬件实验验证了多偏航配置下的翻转能力。

Comments 17 pages, 8 figures

详情
AI中文摘要

双向推力赋予四旋翼第二个平衡条件和更大的控制权限,扩展了可能激进机动的包络,并实现倒飞、栖息和感知。先前的几何控制方法通过基于Hopf纤维化的姿态表示扩展微分平坦性以支持双向推力,但在翻转过程中遇到执行器饱和和电机反转延迟的问题,需要启发式推力姿态调度和航点调整。我们提出一个基于学习的框架,该框架调制恒定参考轨迹以执行紧凑、位置约束的四旋翼翻转,同时保持与传统轨迹生成和跟踪在不同飞行状态下的兼容性。通过强化学习分别训练从正常到倒飞和从倒飞到正常转换的策略。在基于JAX的仿真中,所提方法在所有评估基线中实现了最低的位置偏差和稳定时间,相对于最强的基于优化的基线,位置均方根误差(RMSE)降低了32%,稳定时间减少了57%。硬件实验展示了在多个偏航配置下成功翻转,位置RMSE低于0.35米,并通过在两个状态下的圆形飞行展示了与下游轨迹生成和控制的兼容性。此外,我们提供了所提框架的开源实现。

英文摘要

Bidirectional thrust grants quadrotors a second equilibrium condition and increased control authority, expanding the envelope of possible aggressive maneuvers and enabling inverted flight, perching, and sensing. Prior geometric control approaches extend differential flatness through Hopf fibration-based attitude representations to support bidirectional thrust, but struggle with actuator saturation and motor reversal delay during inversions, requiring heuristic thrust posture scheduling and waypoint tuning. We propose a learning-based framework that modulates a constant reference trajectory to perform compact, position-constrained quadrotor inversions while remaining compatible with traditional trajectory generation and tracking across flight regimes. Separate policies are trained via reinforcement learning for nominal-to-inverted and inverted-to-nominal transitions. In JAX-based simulation, the proposed method achieves the lowest position deviation and settling time across all evaluated baselines, reducing position root mean square error (RMSE) by 32% and settling time by 57% relative to the strongest optimization-based baseline. Hardware experiments demonstrate successful inversion across multiple yaw configurations with position RMSE below 0.35m, and compatibility with downstream trajectory generation and control through circular flight in both regimes. Additionally, we provide an open-source implementation of the proposed framework.

2605.24225 2026-05-26 cs.RO 版本更新

ECo-MoE: Embodiment-Conditioned Mixture of Experts Increases the Evolvability of Robots

ECo-MoE: 基于具身条件的专家混合提升机器人的可进化性

Yibin Wang, Muhan Li, Zihan Guo, Sam Kriegman

发表机构 * Northwestern University, Evanston, IL, USA(西北大学,伊利诺伊州欧文顿分校)

AI总结 提出一种机器人进化与学习联合优化模型,通过混合控制专家和潜在设计向量分布,在统一模块化框架中平衡个体策略与通用控制器的效率,实现不同体型的适应性行为并支持知识复用与进化引导。

详情
AI中文摘要

在本文中,我们介绍了一种机器人进化与学习的模型,该模型联合优化潜在设计向量(基因型)的分布和混合控制专家(神经模块),这些专家由每个解码设计(表型)的潜在坐标门控。这为协同设计算法提供了一种可扩展的替代方案,这些算法要么为每个机器人训练一个单独的策略(效率低下),要么为所有机器人训练一个单一通用控制器(导致过于保守的结构和行为)。我们的方法介于这两个极端之间,将祖先知识保存在一个统一但模块化的框架中,其中不同的身体结构激活和停用不同的学习感觉运动回路组合,以实现目标导向行为。这使得控制器的一部分可以被彻底改造,以更好地适应新出现的物种设计,而不会破坏其他专家模块中包含的来之不易的知识。它还允许预训练的专家策略直接插入到混合中,从而将进化引导到包含所需形态特征的潜在空间中原本未探索的区域。我们将这个过程称为“演示进化”,并探索如何利用它将自由形态进化引导到由预训练模型定义的规范结构。视频和代码可在以下网址找到:https://eco-moe.github.io。

英文摘要

In this paper, we introduce a model of evolution and learning in robots that co-optimizes a distribution of latent design vectors (genotypes) and a mixture of control experts (neural modules), which are gated by the latent coordinates of each decoded design (phenotype). This provides a scalable alternative to co-design algorithms that either train an individual policy for every robot, which is inefficient, or a monolithic universal controller for all robots, which results in overly conservative structures and behaviors. Our approach lies somewhere between these two extremes, preserving ancestral knowledge in a unified yet modular framework in which different body plans activate and deactivate different combinations of learned sensorimotor circuits for goal-directed behavior. This allows one part of the controller to be overhauled to better suit new species of designs as they emerge without disrupting the hard-earned knowledge contained within other expert modules. It also allows pretrained expert policies to be directly plugged into the mixture, which can steer evolution into otherwise unexplored areas of latent space containing desired morphological traits. We refer to this process as "evo by demo" and explore how it may be used to guide freeform evolution toward canonical structures defined by the pretrained model. Videos and code can be found at: https://eco-moe.github.io.

2605.24203 2026-05-26 cs.RO 版本更新

Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance

Afford-VLA:通过内化可操作性实现动作对齐的视觉规划

Runze Wang, Yuqian Fu, Yu Li, Tao Lin, Tianwen Qian, Mohamed Elhoseiny, Bo Zhao, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue

发表机构 * Fudan University(复旦大学) KAUST(康斯坦丁·亚历山大科研大学) SJTU(上海交通大学) East China Normal University(华东师范大学)

AI总结 提出Afford-VLA框架,通过内化任务条件可操作性作为显式视觉规划接口,利用可学习<AFF>令牌查询交互区域并解码为紧凑嵌入以直接条件化动作生成,在多个模拟基准上取得最先进性能。

Comments 20 pages

详情
AI中文摘要

视觉-语言-动作(VLA)模型在通用机器人操作中展现出巨大潜力,但仍受限于空间推理不足,特别是在复杂视觉场景中确定交互位置。虽然近期工作引入多种形式的视觉规划来解决此问题,但现有方法要么依赖全局几何线索、符号中间表示,要么依赖外部生成的视觉信号,这些往往与下游动作预测弱耦合。本文重新审视VLA系统中的视觉规划,认为有效的规划应是局部的、视觉锚定的、内部生成的且直接与动作对齐。基于此洞察,我们提出Afford-VLA,一个统一框架,将任务条件可操作性内化为VLA模型中的显式视觉规划接口。具体而言,我们引入可学习<AFF>令牌来查询任务相关交互区域,从多模态特征解码可操作性掩码,并将其转换为紧凑嵌入,直接条件化动作生成。此设计使可操作性在VLA内部生成和利用,形成紧密耦合的感知-动作通路。为进一步支持此集成,我们采用训练策略,使可操作性通路与动作预测联合优化,提高其对下游控制的有效性。我们在多个模拟基准(包括LIBERO、LIBERO-Plus和SimplerEnv)上评估方法,取得一致的最先进性能,并展示了强大的真实世界结果。这些发现表明,将可操作性内化为动作对齐的视觉规划为改进VLA系统提供了强大范式。

英文摘要

Vision-language-action (VLA) models have shown strong potential for generalist robot manipulation, yet they remain limited by insufficient spatial reasoning, particularly in determining where to interact in complex visual scenes. While recent efforts introduce various forms of visual planning to address this issue, existing approaches either rely on global geometric cues, symbolic intermediate representations, or externally generated visual signals, which are often weakly coupled with downstream action prediction. In this work, we revisit visual planning in VLA systems and argue that effective planning should be local, visually grounded, internally generated, and directly aligned with action. Based on this insight, we propose Afford-VLA, a unified framework that internalizes task-conditioned affordance as an explicit visual planning interface within VLA models. Concretely, we introduce learnable <AFF> tokens to query task-relevant interaction regions, decode affordance masks from multimodal features, and convert them into compact embeddings that directly condition action generation. This design enables affordance to be both generated and utilized within the VLA, forming a tightly coupled perception-action pathway. To further support this integration, we adopt a training strategy that allows the affordance pathway to be jointly optimized with action prediction, improving its effectiveness for downstream control. We evaluate our method on multiple simulation benchmarks, including LIBERO, LIBERO-Plus, and SimplerEnv, achieving consistent state-of-the-art performance, along with strong real-world results. These findings demonstrate that internalizing affordance as action-aligned visual planning provides a powerful paradigm for improving VLA systems.

2605.24127 2026-05-26 cs.RO 版本更新

Investigating the Effect of a Series Elastic Actuation Retrofit to Black-Box Actuators

研究串联弹性驱动改造对黑箱执行器的影响

Ivan Tregear, Ayhan Aktas, Ferdinando Rodriguez y Baena

发表机构 * Imperial College London, Mechanical Engineering Department(帝国理工学院伦敦校区机械工程系)

AI总结 通过为黑箱执行器加装串联弹性元件,利用有限元分析设计扭转弹性元件,实现了高保真力测量,将开环力控制带宽从10.32 Hz提升至30.32 Hz,提升2.93倍,且性能优于成本更高的商用传感器。

Comments Related GitHub repo available here: https://github.com/ITregear/SeriesElasticActuation-FYP

详情
AI中文摘要

在机器人应用中,执行器通常设计为具有最小间隙的刚性结构,以确保精度和可重复性。然而,这限制了柔顺性,导致在不确定环境中可能造成损坏和较差的力控制。串联弹性驱动(SEA)引入柔顺性以增强扰动抑制,并能够通过胡克定律进行力测量,但会降低系统带宽。 一个定制的串联弹性(SE)元件被加装到一个黑箱执行器上,以减轻间隙和静摩擦等非线性。集成SE元件实现了高保真力测量,提高了力控制带宽和性能。 通过有限元(FE)分析设计了一个扭转SE元件,其刚度为2155.4 Nm/rad。测量了原始电机和SEA集成配置的开环力控制带宽,同时利用SEA和商用力传感器的反馈评估了闭环带宽。SEA模块将带宽从10.32 Hz提高到30.32 Hz,提升了2.93倍。此外,尽管成本仅为25英镑,其性能仍比商用传感器高出7.63%。

英文摘要

In robotic applications, actuators are typically designed to be stiff with minimal backlash to ensure precision and repeatability. However, this limits compliance, leading to potential damage and poor force control in uncertain environments. Series Elastic Actuation (SEA) introduces compliance to enhance disturbance rejection and enable force measurement via Hooke's Law but reduces system bandwidth. A custom Series Elastic (SE) element was retrofitted to a black-box actuator to mitigate non-linearities like backlash and static friction. Integrating the SE element enabled high-fidelity force measurements, improving force control bandwidth and performance. A torsional SE element was designed through Finite Element (FE) analysis, yielding a stiffness of 2155.4 Nm/rad. Open-loop force control bandwidth was measured for the original motor and the SEA-integrated configuration, while closed-loop bandwidth was assessed using feedback from the SEA and a commercial force sensor. The SEA module increased bandwidth from 10.32 Hz to 30.32 Hz, a 2.93X improvement. Additionally, it outperformed the commercial sensor by 7.63% despite costing 25 GBP, a fraction of the price.

2605.24125 2026-05-26 cs.RO 版本更新

Anisotropic Diffusion-Driven Ergodic Coverage in Multi-Robot Systems

多机器人系统中各向异性扩散驱动的遍历覆盖

Thales C. Silva, Anoop Kiran, Nora Ayanian

发表机构 * Department of Computer Science(计算机科学系) Brown University(布朗大学)

AI总结 提出一种基于Perona-Malik各向异性扩散的遍历搜索方法,通过非均匀误差传播生成势场,引导多机器人系统实现更灵活的遍历覆盖。

详情
AI中文摘要

我们考虑在多机器人系统中结合势场和遍历搜索的问题。传统的遍历搜索算法使用遍历性度量来考虑不同尺度下的期望分布。最近,提出了一种热方程驱动的遍历方法,增加了遍历度量平滑的灵活性。然而,这种方法作为各向同性扩散,无论期望分布如何变化,都会在所有方向上均匀传播误差。我们引入了遍历性问题的一类通用各向异性扩散公式,为遍历搜索生成势场。我们证明该方法推广了先前考虑径向基函数和热方程解来表示目标密度分布与覆盖轨迹之间差异的结果。在我们的解决方案中,智能体运动使用Perona-Malik扩散解的梯度进行引导,并且我们的公式将热方程作为特例。我们通过一系列不同场景的仿真来展示该方法。

英文摘要

We consider the problem of combining potential field and ergodic search on multi-robot systems. Traditional ergodic search algorithms use metrics for ergodicity that account for the desired distribution at different scales. Recently, a heat equation-driven ergodic approach was proposed, which adds flexibility to the smoothing of the ergodic metric. However, such an approach, as it is an isotropic diffusion, propagates the error uniformly in all directions, regardless of changes in the desired distribution. We introduce a general class of anisotropic diffusion formulation of the ergodicity problem, which generates a potential field for the ergodic search. We demonstrate that this approach generalizes previous results, which consider radial basis functions and the solution of the heat equation to represent the difference between the goal density distribution and the covered trajectories. In our solution, the agent movement is directed using the gradient of the solution of the Perona-Malik diffusion, and our formulation includes the heat equation as a special case. We demonstrate the methodology with a series of simulations in different scenarios.

2605.24111 2026-05-26 cs.RO cs.AI 版本更新

MASt3R-Nav: WayPixel Navigation in Relative 3D Maps

MASt3R-Nav: 相对3D地图中的WayPixel导航

Vansh Garg, Rohit Jayanti, Krish Pandya, Sarthak Chittawar, Siddharth Tourani, Muhammad Haris Khan, Sourav Garg, Madhava Krishna

发表机构 * Robotics Research Center, IIIT-Hyderabad, India(1 罗斯科技研究中心,IIIT-海得拉巴,印度) University of Heidelberg(2 海德堡大学) MBZUAI(3 MBZUAI)

AI总结 提出一种基于像素相对连接性的地图表示,通过相对3D坐标系中的像素对应构建地图,并利用像素级图进行全局路径规划,训练控制器预测轨迹,实现高精度导航。

Comments 2026 IEEE International Conference on Robotics & Automation (ICRA)

详情
AI中文摘要

视觉导航能力与其底层世界表示紧密相关。与需要全局一致几何的传统3D地图不同,图像或物体相对拓扑图几乎完全放弃了几何理解,但这以牺牲导航能力为代价,通常仅限于教-重复模式。本文提出一种新颖的地图表示,即像素相对连接性,它在几何上精确但不需要全局几何一致性。受近期3D基础图像匹配进展的启发,我们通过基于单个图像对相对3D坐标系中像素对应的图像间连接性,从图像序列构建地图。然后,我们利用该像素级图通过近似和稀疏化图像内像素连接性来执行全局路径规划。由此,我们推导出“WayPixel Costmap”表示,并训练一个以此条件化的控制器来预测轨迹展开。我们展示了这种基于相对几何的密集像素级成本图比其图像级和物体级对应物是更精确的控制预测条件变量。这实现了一个高能力的导航系统,通过在模拟器中的四种导航任务和真实世界演示中得到验证。

英文摘要

Visual navigation ability is strongly tied to its underlying representation of the world. Unlike classical 3D maps that require globally-consistent geometry, image- or object-relative topological graphs almost entirely do away with geometric understanding. But, this comes at the cost of navigation capability, often limiting it to merely teach-and-repeat. In this work, we propose a novel map representation in the form of pixel-relative connectivity, which is geometrically accurate but does not require global geometric consistency. Inspired by recent progress in 3D grounded image matching, we construct a map from an image sequence through inter-image connectivity based on pixel correspondences in the relative 3D coordinate systems of individual image pairs. We then use this pixel-level graph to perform global path planning by approximating and sparsifying intra-image pixel connectivity. Through this, we derive a ''WayPixel Costmap'' representation and train a controller conditioned on it to predict a trajectory rollout. We show that this dense pixel-level costmap based on relative geometry is a more accurate conditioning variable for control prediction than its image- and object-level counterparts. This enables a highly capable navigation system, as validated on four types of navigation tasks in the simulator and through real world demonstrations.

2605.24074 2026-05-26 cs.CV cs.RO 版本更新

WideDepth: Millimeter-Accurate Benchmark for Fisheye Depth Estimation

WideDepth: 用于鱼眼深度估计的毫米级精度基准

Ilia Indyk, Ignat Penshin, Ivan Sosin, Maxim Monastyrny, Aleksei Valenkov, Ilya Makarov

发表机构 * Robotics Center(机器人中心) AXXX Trusted AI Research Center, RAS(可信人工智能研究中心,俄罗斯科学院)

AI总结 提出首个室内鱼眼深度估计数据集WideDepth,包含101个场景的5K高分辨率立体对和毫米级真值,并引入基于LiDAR的立体鱼眼图像生成方法,评估多种模型,微调后性能提升高达62%。

Comments Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026

详情
AI中文摘要

鱼眼相机在机器人领域的近场操作、导航和沉浸式感知中应用日益广泛,但缺乏具有精确真值的室内深度基准。为此,我们引入WideDepth——首个用于鱼眼深度估计的室内数据集,包含101个场景的5K高分辨率立体对,标注了毫米级地面真值深度和视差。我们的数据集还包括在水平和垂直立体设置中,不同视场和基线下的配对针孔和鱼眼样本。我们进一步提出一种方法,将针孔训练的立体模型适配到鱼眼图像,并引入一种基于高分辨率LiDAR扫描的新型立体鱼眼图像生成流程。利用这些方法,我们在基准上全面评估了最先进的单目深度、立体匹配和深度补全模型。此外,我们提供了18K LiDAR导出的稀疏深度训练样本,在微调基于针孔的立体模型时,鱼眼数据性能提升高达62%。总之,我们基准的高精度和多功能性为推进鱼眼深度估计和机器人感知研究奠定了坚实基础。项目页面:https://ilyaind.github.io/WideDepth

英文摘要

Fisheye cameras are increasingly adopted in robotics for near-field manipulation, navigation, and immersive perception, yet indoor depth benchmarks with accurate ground truth are still missing. To address this, we introduce WideDepth - the first indoor dataset for fisheye depth estimation, featuring 101 scenes containing 5K high-resolution stereo pairs labeled with millimeter-level ground truth depth and disparity. Our dataset also includes paired pinhole and fisheye samples across varying fields of view and baselines in both horizontal and vertical stereo setups. We further propose a method to adapt pinhole-trained stereo models to fisheye images and introduce a novel stereo fisheye image generation pipeline based on high-resolution LiDAR scans. Leveraging these methods, we thoroughly evaluate state-of-the-art monocular depth, stereo matching, and depth completion models on our benchmark. Additionally, we provide 18K LiDAR-derived sparse depth training samples, achieving up to a 62% performance boost on fisheye data when fine-tuning pinhole-based stereo models. In summary, the high precision and versatility of our benchmark set a strong foundation for advancing research in fisheye depth estimation and robotics perception. Project page: https://ilyaind.github.io/WideDepth

2605.24044 2026-05-26 cs.RO cs.SE cs.SY eess.SY 版本更新

RED: Adaptive Real-Time DAG Scheduling for Robotic Inference under Environmental Dynamics

RED:面向环境动态的自适应实时DAG调度用于机器人推理

Zexin Li, Tao Ren, Johnathan Liu, Xiaoxi He, Cong Liu

发表机构 * University of California, Riverside(加州大学河滨分校) University of Pittsburgh(匹兹堡大学) University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校) University of Macau(澳门大学)

AI总结 提出RED框架,通过截止时间感知调度器和MIMONet结构对齐,在资源受限机器人平台上实现多任务深度神经网络工作负载的实时调度,适应环境动态并保证端到端时序约束。

Comments Extension version of RTSS'23

详情
AI中文摘要

部署在动态环境中的机器人必须应对环境驱动的变化,这些变化会在运行时重塑计算:新任务可能出现,优先级关系可能改变,整体工作负载结构会演变,所有这些都会降低性能,特别是在资源紧张和实时预算下需要多任务推理时。我们提出RED,一个用于资源受限机器人平台上多任务深度神经网络工作负载的实时调度框架,它适应机器人环境动态(RED),同时在建模假设下保留端到端时序保证。RED的核心是一个截止时间感知调度器,它分配中间子截止时间,从而能够适应由不可预测条件引起的计算图演变和异步推理。该框架还支持灵活部署MIMONet(多输入多输出神经网络),这种网络常用于多任务机器人,通过权重共享缓解内存压力。RED通过工作负载细化和图重构过程显式利用这种共享参数属性,将MIMONet结构与可调度性要求对齐,提高兼容性和效率。我们在NVIDIA Jetson系列平台和Apple M系列MacBook上实现RED,并在代表真实机器人场景的导航导向工作负载上进行评估。实验表明,在吞吐量、截止时间满足率、抗干扰鲁棒性、适应性和运行时开销方面,RED持续优于现有方法。

英文摘要

Robots deployed in dynamic environments must contend with environment-driven changes that reshape computation at runtime: new tasks may appear, precedence relations can shift, and overall workload structure evolves, all of which degrade performance, especially when multi-task inference is required under tight resource and real-time budgets. We present RED, a real-time scheduling framework for multi-task deep neural network workloads on resource-constrained robotic platforms that adapts to Robotic Environmental Dynamics (RED) while preserving end-to-end timing guarantees under modeling assumptions. The core of RED is a deadline-aware scheduler that assigns intermediate sub-deadlines, allowing it to accommodate evolving computation graphs and asynchronous inference induced by unpredictable conditions. The framework also supports flexible deployment of MIMONet (multi-input multi-output neural networks), commonly used in multi-tasking robots to alleviate memory pressure through weight sharing. RED explicitly leverages this shared-parameter property via a workload refinement and graph-reconstruction procedure that aligns MIMONet structure with schedulability requirements, improving compatibility and efficiency. We implement RED on NVIDIA Jetson family platforms and on an Apple M-series MacBook and evaluate it on navigation-oriented workloads representative of real robotic scenarios. Experiments show consistent gains over existing methods in throughput, deadline satisfaction, robustness to interference, adaptability, and runtime overhead.

2605.24004 2026-05-26 cs.AI cs.CV cs.LG cs.RO 版本更新

Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving

推理--想象--行动:基于世界模型的闭环LLM自动驾驶决策

Zhengqi Sun, Yiwen Sun, Boxuan Liu, Tailai Chen, Tianxu Guo, Jiabin Liu

发表机构 * 1Department of Information Management, Peking University, Beijing 100871, China 2School of Intelligence Science Technology, Peking University, Beijing 100871, China 3State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing 100080, China 4Yuanpei College, Peking University, Beijing 100871, China 5China Agricultural University, Beijing, China 6CRSC Research \& Design Institute Group Co., Ltd., Beijing, China

AI总结 提出Reason--Imagine--Act (RIA)闭环框架,结合LLM推理器与动作条件世界模型进行在线安全验证,在CARLA点目标协议下实现80.05%路线完成率、51.10%到达率和0.20%碰撞率。

Comments Accepted by the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026). 8 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLM)在自动驾驶中具有潜力,但仅基于语义的决策策略可能在动态交通中产生物理上不安全的行为。现有方法要么在没有显式动力学验证的情况下进行在线语言推理,要么主要在离线流程中使用世界模型,在决策时语义意图与物理可行性之间存在差距。我们提出了Reason--Imagine--Act (RIA),一个闭环框架,将LLM推理器与动作条件世界模型耦合,用于在线安全验证。在每一步,LLM提出一个动作模板和候选子动作,世界模型执行短时域展开,安全评分器选择最安全的可执行动作并反馈给下一步推理。在统一的CARLA点目标协议(1000个回合)下,RIA实现了80.05%的路线完成率、51.10%的到达率和0.20%的碰撞率。在相同的闭环接口下,RIA在核心闭环指标上始终优于无训练基线,包括CARLA TM和MADA。为便于复现,代码可在https://github.com/pku-smart-city/source_code/tree/main/RIA获取。

英文摘要

Large language models (LLMs) are promising for autonomous driving, but semantics-only decision policies can yield physically unsafe behavior in dynamic traffic. Existing methods either perform online language reasoning without explicit dynamics verification or use world models mainly in offline pipelines, leaving a gap between semantic intent and physical feasibility at decision time. We propose Reason--Imagine--Act (RIA), a closed-loop framework that couples an LLM reasoner with an action-conditioned world model for online safety verification. At each step, the LLM proposes an action template and candidate sub-actions, the world model performs short-horizon rollouts, and a safety scorer selects the safest executable action with feedback to the next reasoning step. Under a unified CARLA point-goal protocol (1000 episodes), RIA achieves 80.05% route completion, 51.10% arrival rate, and 0.20% collision rate. Under the same closed-loop interface, RIA consistently outperforms training-free baselines, including CARLA TM and MADA, on core closed-loop metrics. For reproducibility, code is available at https://github.com/pku-smart-city/source_code/tree/main/RIA.

2605.23987 2026-05-26 cs.AI cs.RO 版本更新

Beyond Predefined Learning Objects: A Thinking-Learning Interaction Model for Up-to-Date Autonomous Robot Learning

超越预定义学习对象:面向最新自主机器人学习的思维-学习交互模型

Hong Su

发表机构 * School of Computer Science, Chengdu University of Information Technology(成都信息科技大学计算机学院)

AI总结 针对自主机器人在开放环境中无法依赖预定义学习对象的问题,提出一种思维-学习交互模型,通过思维指导学习(识别变化、选择证据、组织训练、规划验证)和学习促进思维(更新知识、经验、策略、推理)的双向机制,实现输入特征发现、输出类别扩展、模型更新和动作例程重构,实验验证了模型在特征适应、新类别形成、模型更新和动作优化上的有效性。

详情
AI中文摘要

在开放和变化环境中运行的自主机器人不能总是依赖预定义的输入、输出和动作例程。尽管现有的学习方法使机器人能够通过环境交互提高性能,但学习对象往往是预先固定的,例如输入特征、识别输出、网络结构、任务目标或动作序列。这限制了它们在长期运行中出现新特征、新类别或更高效任务例程时的适应能力。为解决此问题,本文提出了一种面向自主机器人的思维-学习交互模型。核心思想是:思维通过识别潜在变化、选择有用证据、组织训练材料和规划验证动作来指导学习,而学习通过更新任务知识、特征选择经验、动作策略和未来推理过程来促进思维。基于这种双向机制,机器人可以逐步超越预定义的学习设置,并通过与环境的持续交互调整其识别关系和动作关系。具体来说,该模型支持自适应输入特征发现、输出类别扩展、学习模型更新和动作例程重构。实验结果表明,该模型在特征适应中将最终识别准确率从0.419提高到0.845,实现了更高的新类别形成准确率和模型更新成功率,并将动作例程重构中的平均动作长度从13.0减少到4.0。在学习增强思维方面,有用证据选择率从0.272提高到0.965,表明学习结果能有效改善未来的证据选择和推理。

英文摘要

Autonomous robots operating in open and changing environments cannot always rely on predefined inputs, outputs, and action routines. Although existing learning methods enable robots to improve their performance through environmental interaction, the objects of learning are often fixed in advance, such as input features, recognition outputs, network structures, task goals, or action sequences. This limits their ability to adapt when new features, new categories, or more efficient task routines appear during long-term operation. To address this problem, this paper proposes a thinking-learning interaction model for autonomous robots. The core idea is that thinking guides learning by identifying potential changes, selecting useful evidence, organizing training materials, and planning verification actions, while learning promotes thinking by updating task knowledge, feature-selection experience, action strategies, and future reasoning processes. Based on this bidirectional mechanism, the robot can gradually move beyond predefined learning settings and adapt its recognition relations and action relations through continuous interaction with the environment. Specifically, the proposed model supports adaptive input feature discovery, output category expansion, learning model update, and action routine reconstruction. Experimental results show that the proposed model improves the final recognition accuracy from 0.419 to 0.845 in feature adaptation, achieves higher new-category formation accuracy and model-update success rate, and reduces the average action length from 13.0 to 4.0 in action routine reconstruction. In learning-enhanced thinking, the useful evidence selection rate increases from 0.272 to 0.965, indicating that learning results can effectively improve future evidence selection and reasoning.

2605.23972 2026-05-26 cs.AI cs.CL cs.RO 版本更新

Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform

为什么我们需要世界模型来实现通用人工智能:大语言模型失败之处以及世界模型如何可能超越

Feisal Alaswad, Batoul Aljaddouh, Maher Alrahhal, Poovammal E, Talal Bonny

发表机构 * Department of Computing Technologies(计算技术系) SRM Institute of Science and Technology(SRM科学与技术学院) Bio-Sensing and Bio-Sensors Group(生物传感与生物传感器组) Smart Automation and Communication Technologies Research Institute of Sciences and Engineering(科学与工程智能自动化与通信技术研究所) University of Sharjah, UAE(阿联酋沙迦大学) Department of Computer Engineering(计算机工程系) College of Computing and Informatics(计算与信息学院)

AI总结 本文通过提出潜在动态推理(LDI)概念和Flux环境案例研究,论证了大语言模型在因果推理、状态跟踪和长程规划上的局限性,并展示基于显式状态空间的强化学习智能体在长程游戏中显著优于纯文本LLM。

Comments 19 pages, 5 figures

详情
AI中文摘要

大语言模型在语言生成和知识密集型任务中表现出色,但在需要因果推理、持久状态跟踪和长程规划的场景中仍然受限。我们认为,这些限制可能源于序列预测与对潜在环境动态进行推理之间的目标层级不匹配。为了形式化这一区别,我们引入了潜在动态推理(LDI),这是一种概念性视角,将语言和多模态观测解释为底层转移动态的部分证据。为了实证研究这一视角,我们引入了Flux,一个完全通过自然语言规则指定的序列推理环境。作为一个概念验证案例研究,这些规则首先被编译成一个显式的状态转移模拟器,说明在某些情况下,结构化的潜在转移动态可以从文本规则描述中操作性地提取出来。这使得我们能够在纯文本观测上运行的LLM与直接在提取的潜在状态空间中训练的强化学习智能体之间进行受控比较。在该案例研究中,能够显式访问潜在状态空间的智能体在长程游戏中表现出更稳定的行为,总胜率约为79%,而LLM仅为11%。定性分析进一步揭示了与不稳定的持久状态跟踪一致的失败模式,包括无效动作、状态跟踪错误和短程推理失败。Flux环境的完整实现可在https://github.com/FeisalAlaswad/FLUX-RL-Agent获取。在评估的设置中,这些结果表明,如果没有持久状态跟踪和转移建模的机制,仅凭强大的序列预测可能难以支持稳健的长程动态推理。

英文摘要

Large language models achieve strong performance in language generation and knowledge-intensive tasks, yet remain limited in settings requiring causal reasoning, persistent state tracking, and long-horizon planning. We argue that these limitations may arise from an objective-level mismatch between sequence prediction and reasoning over latent environment dynamics. To formalize this distinction, we introduce Latent Dynamics Inference (LDI), a conceptual perspective that interprets language and multimodal observations as partial evidence of underlying transition dynamics. To empirically investigate this perspective, we introduce Flux, a sequential reasoning environment specified entirely through natural-language rules. As a proof-of-concept case study, the rules are first compiled into an explicit state-transition simulator, illustrating that structured latent transition dynamics can, in some cases, be operationally extracted from textual rule descriptions. This enables a controlled comparison between the LLMs operating purely over textual observations and reinforcement-learning agents trained directly within the extracted latent state space. Within this case study, agents operating with explicit access to the latent state space exhibit substantially more stable behavior in long-horizon gameplay, achieving an aggregate win rate of approximately 79% versus 11% for LLMs. Qualitative analysis further reveals failure modes consistent with unstable persistent state tracking, including invalid actions, state-tracking errors, and short-horizon reasoning failures. The complete implementation of the Flux environment available at https://github.com/FeisalAlaswad/FLUX-RL-Agent Within the evaluated setting, these results suggest that strong sequence prediction alone may struggle to support robust long-horizon dynamic reasoning without mechanisms for persistent state tracking and transition modeling

2605.23941 2026-05-26 cs.AI cs.RO 版本更新

MEMOR-E: In-Context and Fine-Tuned LLM Personalization for Alzheimer's Assistive Robotics

MEMOR-E: 面向阿尔茨海默病辅助机器人的上下文与微调大语言模型个性化

Maissa Abir Smaili, Eren Sadikoglu, Ransalu Senanayake

发表机构 * Istanbul Medipol University(伊斯坦布尔梅迪波大学) Arizona State University(亚利桑那州立大学)

AI总结 提出移动四足机器人MEMOR-E,结合微调与上下文学习的大语言模型,实现阿尔茨海默病患者的个性化认知支持与可解释人机交互。

Comments 8 pages 14 figures

详情
AI中文摘要

阿尔茨海默病是一种神经退行性疾病,其特征是记忆和语言能力进行性衰退,导致日常生活独立性降低,从而激发社交辅助机器人的支持需求。本文介绍了MEMOR-E,一种配备交互式平板界面的移动四足机器人,通过药物提醒、日常指导、记忆导向互动和陪伴来协助患者和护理人员。我们评估了微调大语言模型(LLMs)以模拟阶段一致的认知行为并解释标准神经心理学语言任务中响应的可行性,使用了235名阿尔茨海默病患者的音频转录和合成生成的健康对照数据。我们还报告了在LLMs中使用上下文学习(ICL)的结果,其中第二个LLM生成了领域和严重程度级别的认知错误摘要。我们的结果表明,MEMOR-E能够生成阶段感知的非诊断性认知摘要,支持个性化辅助互动,同时可解释AI机制将模型输出转化为透明、人类可读的证据,以实现护理人员监督和可信赖的人机交互。

英文摘要

Alzheimer's disease is a neurodegenerative disorder marked by progressive declines in memory and language that reduce independence in daily life, motivating socially assistive robotic support. This paper presents MEMOR-E, a mobile quadruped robot with an interactive tablet interface that assists patients and caregivers through medication reminders, routine guidance, memory oriented interactions, and companionship. We evaluated the feasibility of fine tuning large language models (LLMs) to emulate stage consistent cognitive behavior and interpret responses across standard neuropsychological language tasks, using audio transcriptions from 235 Alzheimer's patients and synthetically generated healthy controls. We also report findings on using in context learning (ICL) in LLMs, where a second LLM produced domain and severity level cognitive error summaries. Our results show that MEMOR-E can generate stage aware, non diagnostic cognitive summaries that support personalized assistive interactions, while explainable AI mechanisms translate model outputs into transparent, human readable evidence to enable caregiver oversight and trustworthy human robot interaction.

2605.23915 2026-05-26 eess.SY cs.RO cs.SY 版本更新

SEIDM: A Safe and Efficient Intelligent Driver Model for Autonomous Driving Behavior

SEIDM:一种用于自动驾驶行为的安全高效智能驾驶员模型

Yuyang Yao, Shaocheng Luo

发表机构 * Department of Information and Communication Engineering, Tongji University(信息与通信工程系,同济大学) Department of Electrical and Computer Engineering, Duke University(电气与计算机工程系,杜克大学)

AI总结 提出SEIDM模型,通过引入自适应安全因子动态调整安全减速项的影响,在保证安全的前提下提高交通流效率。

Comments To appear in IEEE IV 2026

详情
AI中文摘要

智能驾驶员模型(IDM)是自适应巡航控制(ACC)的基石,因其可解释的参数和在跟车行为建模中的有效性而受到重视。然而,其固有的保守性导致稳定时间延长和交通效率降低,这一问题受到的关注有限。在本文中,我们提出SEIDM(安全高效智能驾驶员模型),一种增强的IDM扩展,旨在提高交通流效率而不牺牲安全性。SEIDM引入自适应安全因子,动态调节加速度决策中安全减速项的影响。这使得车辆在安全条件下能够更果断地跟车,同时在潜在危险时表现得更加谨慎。广泛的城市交通仿真表明,SEIDM实现了显著更短的稳定间距和更快的交通流平衡收敛,在交通稳定性和效率方面优于原始IDM及其变体。

英文摘要

The Intelligent Driver Model (IDM) is a cornerstone of Adaptive Cruise Control (ACC), valued for its interpretable parameters and effectiveness in car-following behavior modeling. However, its inherent conservatism leads to prolonged stabilization and reduced traffic efficiency, which have received limited attention. In this paper, we propose SEIDM (Safe and Efficient Intelligent Driver Model), an enhanced IDM extension designed to improve traffic flow efficiency without sacrificing safety. SEIDM introduces an adaptive safety factor to dynamically modulate the impact of the safe deceleration term in acceleration decisions. This allows vehicles to follow more assertively under safe conditions while behaving more cautiously in potential hazards. Extensive urban traffic simulations show that SEIDM achieves significantly shorter stabilization spacing and faster convergence to traffic flow equilibrium, outperforming the original IDM and its variants in traffic stability and efficiency.

2605.02037 2026-05-26 cs.RO cs.AI 版本更新

VILAS: A VLA-Integrated Low-cost Architecture with Soft Grasping for Robotic Manipulation

VILAS:一种集成软抓取的VLA低成本机器人操作架构

Zijian An, Hadi Khezam, Bill Cai, Ran Yang, Shijie Geng, Yiming Feng, Yue Zheng, Lifeng Zhou

发表机构 * Drexel University(德雷塞尔大学) Virginia Seafood Agricultural Research and Extension Center(弗吉尼亚海鲜农业研究与推广中心) Amazon Store Foundation AI (SFAI)(亚马逊商店基金会人工智能(SFAI))

AI总结 提出VILAS低成本模块化机器人操作平台,集成软抓取机构,支持端到端VLA策略学习与部署,并在葡萄抓取任务中验证有效性。

详情
AI中文摘要

我们提出了VILAS,一个完全低成本、模块化的机器人操作平台,旨在支持端到端视觉-语言-动作(VLA)策略学习并在可访问硬件上部署。该系统集成了法如FR5协作臂、Jodell RG52-50电动夹爪和双摄像头感知模块,通过基于ZMQ的通信架构统一协调遥操作、数据收集和策略部署于单一框架内。为了在不依赖显式力传感的情况下安全操作易碎物体,我们设计了一种基于kirigami的软柔性夹爪扩展件,在压缩载荷下产生可预测变形,提供对脆弱目标的温和且可重复接触。我们在VILAS平台上部署并评估了三种最先进的VLA模型:pi_0、pi_0.5和GR00T N1.6。所有模型均使用通过我们的遥操作流水线收集的相同演示数据集,从公开发布的预训练检查点进行微调。在葡萄抓取任务上的实验验证了所提系统的有效性,证实了有能力的操作策略可以在低成本模块化硬件上成功训练和部署。我们的结果进一步为当前VLA模型在真实环境中的部署特性提供了实践见解。

英文摘要

We present VILAS, a fully low-cost, modular robotic manipulation platform designed to support end-to-end vision-language-action (VLA) policy learning and deployment on accessible hardware. The system integrates a Fairino FR5 collaborative arm, a Jodell RG52-50 electric gripper, and a dual-camera perception module, unified through a ZMQ-based communication architecture that seamlessly coordinates teleoperation, data collection, and policy deployment within a single framework. To enable safe manipulation of fragile objects without relying on explicit force sensing, we design a kirigami-based soft compliant gripper extension that induces predictable deformation under compressive loading, providing gentle and repeatable contact with delicate targets. We deploy and evaluate three state-of-the-art VLA models on the VILAS platform: pi_0, pi_0.5, and GR00T N1.6. All models are fine-tuned from publicly released pretrained checkpoints using an identical demonstration dataset collected via our teleoperation pipeline. Experiments on a grape grasping task validate the effectiveness of the proposed system, confirming that capable manipulation policies can be successfully trained and deployed on low-cost modular hardware. Our results further provide practical insights into the deployment characteristics of current VLA models in real-world settings.

2603.09458 2026-05-26 cs.RO 版本更新

Stein Variational Ergodic Surface Coverage with SE(3) Constraints

Stein变分遍历曲面覆盖与SE(3)约束

Jiayun Li, Yufeng Jin, Sangli Teng, Dejian Gong, Georgia Chalvatzaki

发表机构 * Department of Computer Science, TU Darmstadt(图宾根大学计算机科学系) Honda Research Institute Europe GmbH(本田欧洲研究院) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种基于预条件SE(3) Stein变分梯度下降的采样即优化方法,用于生成满足SE(3)约束的遍历轨迹,实现复杂3D点云曲面的高质量覆盖。

详情
AI中文摘要

曲面操作任务要求机器人生成能够全面覆盖复杂3D曲面同时保持精确末端执行器姿态的轨迹。现有的遍历轨迹优化(TO)方法在覆盖任务中表现出色,但由于非凸优化景观以及采样即优化(SAO)技术中对SE(3)约束处理不足,在处理点云目标时存在困难。在这项工作中,我们引入了一种预条件SE(3) Stein变分梯度下降(SVGD)方法用于SAO遍历轨迹生成。我们提出的方法包含多项创新。首先,我们将点云遍历覆盖重新表述为流形感知的采样问题。其次,我们推导了SE(3)特定的SVGD粒子更新,第三,我们开发了一个预条件子以加速TO收敛。与基于优化的强基线和SAO基线相比,我们的基于采样的框架在保持SE(3)几何结构的同时,一致地识别出更优的局部最优解。在3D点云曲面覆盖基准测试和机器人曲面绘制任务上的实验表明,相对于现有的TO和SAO方法,我们的方法在我们的设置中以可计算的计算量实现了更优的覆盖质量,并在真实机器人实验中得到了验证。

英文摘要

Surface manipulation tasks require robots to generate trajectories that comprehensively cover complex 3D surfaces while maintaining precise end-effector poses. Existing ergodic trajectory optimization (TO) methods demonstrate success in coverage tasks, while struggling with point-cloud targets due to the nonconvex optimization landscapes and the inadequate handling of SE(3) constraints in sampling-as-optimization (SAO) techniques. In this work, we introduce a preconditioned SE(3) Stein Variational Gradient Descent (SVGD) approach for SAO ergodic trajectory generation. Our proposed approach comprises multiple innovations. First, we reformulate point-cloud ergodic coverage as a manifold-aware sampling problem. Second, we derive SE(3)-specific SVGD particle updates, and, third, we develop a preconditioner to accelerate TO convergence. Our sampling-based framework consistently identifies superior local optima compared to strong optimization-based and SAO baselines while preserving the SE(3) geometric structure. Experiments on a 3D point-cloud surface coverage benchmark and robotic surface drawing tasks demonstrate that our method achieves superior coverage quality with tractable computation in our setting relative to existing TO and SAO approaches, and is validated in real-world robot experiments.

2602.02839 2026-05-26 cs.RO 版本更新

Language Movement Primitives: Grounding Language Models in Robot Motion

语言运动基元:将语言模型锚定在机器人运动中

Yinlong Dai, Benjamin A. Christie, Daniel J. Evans, Dylan P. Losey, Simon Stepputtis

发表机构 * Collab , Dept. of Mechanical Engineering, Virginia Tech, Blacksburg, VA 24061(合作组,机械工程系,弗吉尼亚理工学院,黑斯堡,VA 24061) TEA Lab , Dept. of Mechanical Engineering, Virginia Tech, Blacksburg, VA 24061(TEA实验室,机械工程系,弗吉尼亚理工学院,黑斯堡,VA 24061)

AI总结 提出语言运动基元(LMP)框架,通过将视觉语言模型(VLM)推理与动态运动基元(DMP)参数化结合,实现零样本机器人操作任务。

详情
AI中文摘要

尽管在基于基础模型的通用问题解决方面取得了显著进展,但使机器人能够根据自然语言指令执行新颖的操作任务仍然是机器人学中的一个基本挑战。大型视觉和语言模型(VLM)能够处理高维输入数据以理解视觉场景和语言,并将任务分解为一系列逻辑步骤;然而,它们难以将这些步骤锚定在具体的机器人运动中。另一方面,机器人基础模型输出动作命令,但在成功执行新颖任务之前需要领域内的微调或经验。其核心仍然存在将抽象任务推理与低级运动控制连接起来的基本挑战。为了解决这一脱节,我们提出了语言运动基元(LMP),这是一个将VLM推理锚定在动态运动基元(DMP)参数化中的框架。我们的关键洞察是,DMP提供了少量可解释的参数,而VLM可以设置这些参数来指定多样、连续且稳定的轨迹。换句话说:VLM可以推理自由形式的自然语言任务描述,并将其期望的运动语义锚定到DMP中——弥合了高级任务推理与低级位置和速度控制之间的鸿沟。基于这种VLM和DMP的结合,我们制定了LMP流程,用于零样本机器人操作,通过生成一系列DMP运动有效完成桌面操作问题。在31个真实世界操作任务中,我们展示了LMP实现了65%的任务成功率,而最佳基线的成功率为35%。请访问我们的网站查看视频:https://collab.me.vt.edu/lmp

英文摘要

Enabling robots to perform novel manipulation tasks from natural language instructions remains a fundamental challenge in robotics, despite significant progress in generalized problem solving with foundational models. Large vision and language models (VLMs) are capable of processing high-dimensional input data for visual scene and language understanding, as well as decomposing tasks into a sequence of logical steps; however, they struggle to ground those steps in embodied robot motion. On the other hand, robotics foundation models output action commands, but require in-domain fine-tuning or experience before they are able to perform novel tasks successfully. At its core, there still remains the fundamental challenge of connecting abstract task reasoning with low-level motion control. To address this disconnect, we propose Language Movement Primitives (LMPs), a framework that grounds VLM reasoning in Dynamic Movement Primitive (DMP) parameterization. Our key insight is that DMPs provide a small number of interpretable parameters, and VLMs can set these parameters to specify diverse, continuous, and stable trajectories. Put another way: VLMs can reason over free-form natural language task descriptions, and semantically ground their desired motions into DMPs -- bridging the gap between high-level task reasoning and low-level position and velocity control. Building on this combination of VLMs and DMPs, we formulate our LMP pipeline for zero-shot robot manipulation that effectively completes tabletop manipulation problems by generating a sequence of DMP motions. Across 31 real-world manipulation tasks, we show that LMP achieves 65% task success as compared to 35% for the best performing baseline. See videos at our website: https://collab.me.vt.edu/lmp

2510.20955 2026-05-26 cs.LG cs.RO 版本更新

Approximating Safety Feedback Without a Safety Oracle via Model Predictive Control

无安全神谕下通过模型预测控制近似安全反馈

Jeff Pflueger, Michael Everett

发表机构 * Northeastern University(东北大学)

AI总结 提出一种利用模拟器和模型预测路径积分算法,基于可逆性和正不变性假设来近似安全函数的方法,避免手动设计安全反馈。

Comments 8 pages, 5 figures

详情
AI中文摘要

移动机器人控制的安全决策算法通常需要存在反馈来验证提议动作的安全性。该反馈假定在控制系统的开发或部署过程中直接可用,可以采取显式约束公式或手工标记的安全数据集的形式,但两者都可能不准确或耗时。许多最近开发的模拟器可以处理复杂的交互和多样化的环境。这些环境具有隐式安全约束,可能难以建模。通过利用其中一个模拟器,我们可以构建一个安全函数的代理,从而绕过对手动设计反馈来捕获这些约束的需求。我们提出了一种算法,通过使用可逆性和对不安全状态空间的正不变性假设来近似安全性。该方法采用模型预测路径积分算法(MPPI)来建立这种可逆性并验证提议的动作。首先,通过模拟器将动作投影到未来状态。然后,如果MPPI能够找到一条路径返回到轨迹中的先前状态,则该状态保证在不安全(正不变)集合之外。实验结果表明,所提出的算法可以近似安全神谕的性能,同时避免将不安全状态分类为安全。

英文摘要

Safe decision-making algorithms for control of mobile robots often require the existence of feedback to verify the safety of proposed actions. This feedback is assumed to be directly available during the development or deployment of the control system. It can take the form of either an explicit constraint formulation or a set of hand-labeled safety data, both of which can be inaccurate or time consuming to produce. Many recently developed simulators can handle complex interactions and varied environments. These environments have implicit safety constraints that may be hard to model. By leveraging one of these simulators, we can construct a proxy for a safety function that bypasses the need for hand designed feedback in capturing these constraints. We present an algorithm that approximates safety by using reversibility and a positive-invariance assumption on the unsafe state space. This method employs the Model-Predictive Path Integral algorithm (MPPI) to establish this reversibility and verify a proposed action. First the action is projected via the simulator to a future state. Then if MPPI can find a path back to a previous state in the trajectory, that state is guaranteed to be outside the unsafe (positive invariant) set. Experimental results demonstrate that the proposed algorithm can approximate the performance of a safety oracle while avoiding classification of unsafe states as safe.

2510.16435 2026-05-26 cs.RO cs.CL cs.HC 版本更新

What Questions Should Robots Be Able to Answer? A Dataset of User Questions for Explainable Robotics

机器人应该能够回答哪些问题?一个用于可解释机器人的用户问题数据集

Lennart Wachowiak, Andrew Coles, Gerard Canal, Oya Celiktutan

发表机构 * King's College London, CDT in Safe and Trusted AI(国王学院伦敦大学,安全与可信人工智能中心) King's College London(国王学院伦敦大学)

AI总结 本文通过收集100名参与者的1893个问题,构建了一个面向家用机器人的用户问题数据集,涵盖12个类别和70个子类别,旨在帮助机器人学家确定机器人需要回答的关键问题类型。

详情
AI中文摘要

随着大型语言模型和对话界面在人机交互中的广泛使用,机器人回答用户问题的能力比以往任何时候都更加重要。因此,我们引入了一个包含1,893个家用机器人用户问题的数据集,这些数据来自100名参与者,并分为12个类别和70个子类别。可解释机器人领域的大多数工作集中在“为什么”问题上。相比之下,我们的数据集提供了多种类型的问题,从关于简单执行细节的问题到关于机器人在假设场景中如何行动的问题——从而为机器人学家提供了关于其机器人需要能够回答哪些问题的宝贵见解。为了收集数据集,我们创建了15个视频刺激和7个文本刺激,描绘了机器人执行各种家务任务。然后,我们询问Prolific上的参与者在每个描绘的情境中他们想问机器人什么问题。在最终数据集中,最常见的类别是关于任务执行细节(21.4%)、机器人能力(12.6%)和性能评估(10.7%)的问题。尽管关于机器人如何处理潜在困难场景并确保正确行为的问题较少,但用户认为这些是机器人最需要能够回答的问题。此外,我们发现自认为是机器人学新手的人与更有经验的用户提出的问题不同。新手更倾向于询问简单事实,例如机器人做了什么或环境的当前状态。随着机器人进入与人类共享的环境,并且语言成为给出指令和交互的核心,该数据集为(i)识别机器人需要记录并暴露给对话界面的信息,(ii)对问答模块进行基准测试,以及(iii)设计符合用户期望的解释策略提供了宝贵的基础。

英文摘要

With the growing use of large language models and conversational interfaces in human-robot interaction, robots' ability to answer user questions is more important than ever. We therefore introduce a dataset of 1,893 user questions for household robots, collected from 100 participants and organized into 12 categories and 70 subcategories. Most work in explainable robotics focuses on why-questions. In contrast, our dataset provides a wide variety of questions, from questions about simple execution details to questions about how the robot would act in hypothetical scenarios -- thus giving roboticists valuable insights into what questions their robot needs to be able to answer. To collect the dataset, we created 15 video stimuli and 7 text stimuli, depicting robots performing varied household tasks. We then asked participants on Prolific what questions they would want to ask the robot in each portrayed situation. In the final dataset, the most frequent categories are questions about task execution details (21.4%), the robot's capabilities (12.6%), and performance assessments (10.7%). Although questions about how robots would handle potentially difficult scenarios and ensure correct behavior are less frequent, users rank them as the most important for robots to be able to answer. Moreover, we find that users who identify as novices in robotics ask different questions than more experienced users. Novices are more likely to inquire about simple facts, such as what the robot did or the current state of the environment. As robots enter environments shared with humans and language becomes central to giving instructions and interaction, this dataset provides a valuable foundation for (i) identifying the information robots need to log and expose to conversational interfaces, (ii) benchmarking question-answering modules, and (iii) designing explanation strategies that align with user expectations.