arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.23892 2026-05-25 cs.CV cs.AI cs.GR cs.LG cs.RO 版本更新

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

优质令牌狩猎：视觉几何变换器令牌选择指南

Shuhong Zheng, Michael Oechsle, Erik Sandström, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski

发表机构 * University of Toronto & Vector Institute（多伦多大学及向量研究所）； Google（谷歌）； Technical University of Munich（慕尼黑技术大学）

AI总结视觉几何变换器在多视角三维重建中表现出色，但其计算成本随输入序列长度呈二次增长，限制了模型的效率和可扩展性。本文提出了一种简单而通用的解决方案，通过限制每个查询在全局注意力中交互的关键/值标记数量来降低计算复杂度。该方法采用两阶段框架：首先在帧级别选择保留的帧以保证场景覆盖多样性，然后在帧内进一步去除冗余标记，且引入基于注意力熵的层感知稀疏化策略。实验表明，该方法在保持或提升性能的同时，可将视觉几何变换器的处理速度提升85%以上。

Comments Project Page: https://zsh2000.github.io/good-token-hunting.github.io, Code: https://github.com/zsh2000/gotohunt

详情

AI中文摘要

视觉几何变换器已成为多视图三维重建的强大架构，能够以前馈方式联合预测多个三维属性。然而，由于这些模型内部的全局注意力层，其计算成本随输入序列长度呈二次增长，限制了其可扩展性和效率。在这项工作中，我们通过一个简单而通用的策略来应对这一挑战：限制每个查询在全局注意力期间交互的键/值令牌数量。为了实现有效的令牌选择，我们引入了一个两阶段框架。首先，帧间选择步骤在帧级别操作，以识别应保留的帧。其次，帧内选择步骤进一步丢弃所选帧内更冗余的令牌。我们的分析强调了基于多样性的帧间选择策略的优势，该策略确保了对场景的广泛覆盖。对于帧内选择，我们表明层感知稀疏化是必要的，选择过程由全局注意力模式的熵引导。与现有解决方案相比，我们的方法提供了优越的速度-精度权衡。大量实验表明，对于包含500张图像的场景，我们的方法将视觉几何变换器加速超过85%，同时保持甚至提升基线性能，这暗示了我们的令牌选择策略如何在视觉几何变换器的未来应用中发挥关键作用。我们的项目网站位于 https://zsh2000.github.io/good-token-hunting.github.io。

英文摘要

Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity-based strategy for inter-frame selection, which ensures broad coverage of the scene. For intra-frame selection, we show that layer-aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed-accuracy trade-off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at https://zsh2000.github.io/good-token-hunting.github.io.

URL PDF HTML ☆

赞 0 踩 0

2605.23863 2026-05-25 cs.RO 版本更新

Robotic Strawberry Harvesting with Robust Vision and Deep Reinforcement Learning based Sim-to-Real Control

基于鲁棒视觉和深度强化学习仿真到现实控制的机器人草莓采摘

Al Bashir, Shao-Yang Chang, Partho Ghose, Prem Raj, Chen-Kang Huang, Azlan Zahid

发表机构 * Department of Biological & Agricultural Engineering, Texas A&M University（德克萨斯A&M大学生物与农业工程系）； Texas A&M AgriLife Research, Texas A&M University System（德克萨斯A&M大学系统农业生命研究）； Department of Biomechatronics Engineering, National Taiwan University（国立台湾大学生物机械工程系）

AI总结本文提出了一种闭环的机器人草莓采摘系统，结合了鲁棒视觉模块、基于仿真训练的深度强化学习控制以及ROS平台的实物机器人执行。研究中设计了一种改进的YOLO26-seg架构HRAttnEdge-YOLO26-seg，提升了复杂场景下的实例分割性能，并在仿真环境中训练了基于PPO的策略控制器，实现了对UR10e机械臂的精准控制。实验表明，该系统在温室环境中成功采摘了281颗草莓，达到了较高的成功率，展示了仿真训练与任务感知结合在农业机器人中的实用性和高效性。

详情

AI中文摘要

本研究提出一种闭环机器人草莓采摘系统，结合鲁棒视觉模块、仿真训练的深度强化学习（DRL）控制和基于ROS的真实机器人执行。在感知方面，我们提出HRAttnEdge-YOLO26-seg，一种改进的YOLO26-seg架构，融合高分辨率P2分支、分割路径注意力和边缘监督原型学习，以改善杂乱场景中的实例分割。在控制方面，我们在Isaac Lab中训练目标条件近端策略优化（PPO）策略，生成UR10e机械臂的平滑关节位置指令，并将其部署在UR10e机器人上，用于目标水果的接近和采摘。这种基于仿真的方法减少了硬件依赖，降低了开发成本，并允许在真实部署前无需大量物理试验即可进行可扩展的策略训练。所提出的视觉模型在评估方法中表现出最高的整体性能。在自采集和公开数据集上，该模型的分割性能提升了10%至14%。在受控室内测试中，PPO控制器产生的运动比基于逆运动学（IK）的MoveIt基线更稳定且动态更平滑。在温室试验中，所提出的集成系统采摘了281颗草莓，实现了96.6%的接近成功率、91.3%的抓取-拉动成功率和84.3%的总体采摘成功率。这些结果表明，任务特定感知与仿真训练的PPO相结合，可以作为传统依赖规划器的操作中接近方法的实用且资源高效的替代方案，从而在复杂农业环境中实现可靠的闭环机器人采摘。

英文摘要

This study presents a closed-loop robotic strawberry harvesting system that combines a robust vision module, simulation-trained deep reinforcement learning (DRL) control, and ROS-based realrobot execution. For perception, we propose HRAttnEdge-YOLO26-seg, a modified YOLO26-seg architecture that incorporates a high-resolution P2 branch, segmentation-path attention, and edgesupervised prototype learning to improve instance segmentation in cluttered scenes. For control, we train a target-conditioned Proximal Policy Optimization (PPO) policy in Isaac Lab to produce smooth joint-position commands for a UR10e manipulator and deploy it on a UR10e robot for targetfruit reaching and harvesting. This simulation-based approach reduces hardware dependency, lowers development cost, and allows scalable policy training without exhaustive physical trials before real deployment. The proposed vision model demonstrated the highest overall performance among the evaluated methods. On both self-collected and public datasets, the model showed a 10 to 14% improvement in segmentation performance. In controlled in-house tests, the PPO controller produced stable and dynamically smoother motion than a inverse kinematics (IK)-based MoveIt baseline. In greenhouse trials, the proposed integrated system harvested 281 strawberries, achieving 96.6% reaching success, 91.3% grasp-and-pull success, and 84.3% overall harvesting success. These results illustrate that task-specific perception combined with simulation-trained PPO can serve as a practical and resource-efficient alternative to conventional planner-dependent reaching in manipulation, enabling reliable closed-loop robotic harvesting in complex agricultural environments.

URL PDF HTML ☆

赞 0 踩 0

2605.23856 2026-05-25 cs.RO 版本更新

Point Tracking Improves World Action Models

点跟踪改进了世界动作模型

Jiarui Guan, Wenshuai Zhao, Yue Pei, Ziliang Chen, Arno Solin, Juho Kannala

发表机构 * Aalto University（阿alto大学）； ELLIS Institute Finland（ELLIS研究所芬兰）； University of Oulu（奥卢大学）； Sun Yat-sen University（中山大学）； Peng Cheng Laboratory（鹏城实验室）； Beihang University（北京航空航天大学）

AI总结该论文提出了一种名为JOPAT的联合像素与轨迹世界动作模型，用于改进机器人策略学习中的环境动态建模。JOPAT通过单个去噪扩散变换器同时预测潜视觉观测、带可见性标志的2D点轨迹以及动作，核心思想是利用点轨迹提供更鲁棒且能捕捉长期动态的显式运动表示。实验表明，JOPAT在涉及遮挡、物体交互和屏幕外运动的长时序任务中显著优于基于像素的传统方法。

2605.23847 2026-05-25 cs.RO 版本更新

Instrumentation for Imitation Learning: Enhancing Training Datasets for Clothes Hanger Insertion

模仿学习的仪器化：增强衣架插入训练数据集

Remko Proesmans, Thomas Lips, Francis wyffels

发表机构 * AI and Robotics Lab (IDLab-AIRO)（人工智能与机器人实验室（IDLab-AIRO））； Ghent University—imec（根特大学—imec）

AI总结本文研究如何通过在物体中集成传感器（即仪器化）来增强模仿学习在衣物挂架插入任务中的表现。作者提出了一种仪器化模仿学习方法，利用180个远程操作演示数据训练扩散策略，并对比了使用和不使用仪器化数据的策略性能。实验表明，结合仪器化数据的策略在成功率上比仅依赖视觉信息的策略高出14-25个百分点，并且能够更有效地理解任务需求。此外，通过仪器化专家策略生成数据增强训练集，可以使仅依赖视觉的策略达到接近专家水平的性能，验证了仪器化在提升模仿学习效果中的有效性。

Comments Accepted for presentation at ICRA2026

详情

AI中文摘要

大型行为模型已经改变了机器人操作领域，但迄今为止，由于数据需求过高，未能像视觉语言模型那样实现革命性突破。我们认为仪器化，即物体中的传感器集成，可以提供宝贵的状态信息，并实现机器人操作的高效学习。在本文中，我们展示了衣架插入的仪器化模仿学习。使用180次遥操作演示，我们训练了有无仪器化数据访问的扩散策略。结果表明，利用仪器化的策略比纯视觉策略成功率提高14-25%，并表现出更高的任务意识。关键的是，黑箱模仿学习策略无需显式指导就能学会优先使用仪器化信号。此外，用仪器化专家策略的 rollout 增强遥操作数据集，使得纯视觉学生策略能够达到与仪器化专家相当的性能，从而超越了原始的纯视觉策略。这些发现确立了仪器化作为增强机器人操作模仿学习的一种有前景的策略。数据集可在 Zenodo 上获取。

英文摘要

Large behaviour models have transformed the field of robotic manipulation, but prohibitive data requirements have thus far prevented a revolution similar to vision language models. We believe that instrumentation, i.e. sensor integration in objects, can provide invaluable state information and enable efficient learning for robotic manipulation. In this paper, we present instrumented imitation learning of clothes hanger insertion. Using 180 teleoperated demonstrations, we train diffusion policies with and without access to instrumentation data. Results show that policies leveraging instrumentation outperform vision-only counterparts by 14-25 %pt and exhibit greater task awareness. Crucially, a black-box imitation learning policy learns to prioritise instrumentation signals without explicit guidance. In addition, enhancing the teleoperation dataset with rollouts from an instrumented expert policy, enables a vision-only student policy to achieve performance comparable to the instrumented expert, thereby surpassing the original vision-only policy. These findings establish instrumentation as a promising strategy to enhance imitation learning for robotic manipulation. Datasets are available on Zenodo.

URL PDF HTML ☆

赞 0 踩 0

2605.23832 2026-05-25 cs.RO 版本更新

SFG-ROS: A Resource-Aware Framework for Dense Multi-Agent Perception

SFG-ROS：面向密集多智能体感知的资源感知框架

Constantin Blessing, Elias Geiger, Jakob Häringer, Dennis Grewe, Markus Enzweiler

发表机构 * Institute for Intelligent Systems, Faculty of Computer Sciences and Engineering（智能系统研究所，计算机科学与工程学院）

AI总结本文提出了一种名为 SFG-ROS 的资源感知型多智能体软件框架，旨在解决异构机器人团队在密集感知任务中遇到的网络拥堵、命名冲突和计算开销过大的问题。该框架通过基于模式的流量路由、按需解码管道和硬件无关的容器化处理等关键技术，有效提升了系统扩展性和资源利用效率。实验表明，与标准 ROS 2 相比，SFG-ROS 显著降低了网络流量和 CPU 开销，同时保持了较低的通信延迟。

详情

AI中文摘要

部署异构多机器人车队进行协作感知需要稳健的数据交换和可扩展的软件架构。然而，标准的ROS 2实现在跨设备分发密集传感器流时，常常面临网络饱和、命名空间冲突和严重的计算开销。为了解决这些瓶颈，我们提出了SFG-ROS，一个面向动态车队部署的资源感知多智能体软件框架。SFG-ROS通过三个主要贡献应对这些挑战。首先，模式驱动的流量路由使用程序化的完全限定名称模式和定向Fast DDS路由，将高频的智能体内部流量与全局网络隔离。其次，按需集中解码管道自动卸载高带宽传感器数据解压缩，消除了本地消费者节点上的冗余处理。最后，硬件无关的容器管道动态适应异构加速器，无缝桥接开发环境与零接触的现场就绪执行。我们使用配备LiDAR和立体深度相机的轮式和腿式机器人车队评估该框架。实验结果表明，SFG-ROS将网络流量限制为$\mathcal{O}(1)$，并通过用轻量级IPC替代冗余解压缩，将每个订阅者的CPU扩展惩罚比标准ROS 2降低了72.3%，同时保持低延迟。最后，我们在宽松许可下发布了SFG-ROS，可通过\href{https://iis-esslingen.github.io/sfg-ros}{iis-esslingen.github.io/sfg-ros}获取。

英文摘要

Deploying heterogeneous multi-agent robot fleets for collaborative perception requires robust data exchange and scalable software architectures. However, standard ROS 2 implementations often suffer from network saturation, namespace collisions, and severe computational overhead when distributing dense sensor streams across devices. To address these bottlenecks, we present SFG-ROS, a resource-aware multi-agent software framework designed for dynamic fleet deployments. SFG-ROS addresses these challenges through three primary contributions. First, schema-driven traffic routing isolates high-frequency intra-agent traffic from the global network using a programmatic fully qualified name schema and targeted Fast DDS routing. Second, an on-demand centralized decoding pipeline automatically offloads high-bandwidth sensor data decompression, eliminating redundant processing across local consumer nodes. Finally, a hardware-agnostic container pipeline dynamically adapts to heterogeneous accelerators, seamlessly bridging development environments with zero-touch, field-ready execution. We evaluate the framework using a fleet of wheeled and legged robots equipped with LiDAR and stereo depth cameras. Experimental results show SFG-ROS bounds network traffic to $\mathcal{O}(1)$ and, by replacing redundant decompression with lightweight IPC, reduces the per-subscriber CPU scaling penalty by 72.3\% versus standard ROS 2, all while maintaining low latency. Finally, we publish SFG-ROS under a permissive license, available via \href{https://iis-esslingen.github.io/sfg-ros}{iis-esslingen.github.io/sfg-ros}.

URL PDF HTML ☆

赞 0 踩 0

2605.23762 2026-05-25 cs.RO 版本更新

Direct Dynamic Retargeting for Humanoid Imitation Learning from Videos

面向人形机器人视频模仿学习的直接动态重定向

Constant Roux, Ludovic De Matteïs, Armand Jordana, Valentin Guillet, Nicolas Mansard, Olivier Stasse, Philippe Souères

发表机构 * LAAS-CNRS, Université de Toulouse, CNRS（法国图卢兹大学LAAS-CNRS中心，法国国家科学研究中心）

AI总结本文研究了如何从单目视频中学习人类形体的模仿技能，并将其应用于人形机器人。为了解决人类运动与人形机器人之间形态差异带来的挑战，作者提出了直接动态重定向（DDR）方法，通过任务空间建模和基于采样的模型预测控制求解器，直接生成符合物理规律的高质量轨迹，避免了传统方法中的几何偏差。实验表明，DDR在轨迹跟踪精度和强化学习训练效率方面均优于现有方法。

详情

AI中文摘要

从单目视频演示中进行模仿学习为向人形机器人教授复杂技能提供了一种可扩展的方法。然而，将人体运动转化为类人运动需要克服显著的形态不匹配。标准方法依赖于几何重定向或间接动态重定向流程。我们发现这些中间运动学投影引入了几何偏差，限制了搜索空间并产生了次优的动态行为。在本文中，我们提出了直接动态重定向（DDR），一种新颖的单阶段框架，可直接从专家视频生成高保真、动态可行的轨迹。通过将问题在任务空间中建模，并在物理模拟器中利用基于采样的模型预测控制求解器，DDR 在缓解输入漂移的同时原生优化复杂的接触序列。我们的实验表明，绕过几何偏差使 DDR 在演示跟踪精度上优于最先进的基线方法。此外，我们证实，向强化学习智能体提供此类物理可行的参考可加速训练收敛，并增强敏捷和平衡行为的最终执行。源代码将公开发布。

英文摘要

Imitation Learning from monocular video demonstrations provides a scalable approach for teaching complex skills to humanoid robots. However, translating human motion to humanoids requires overcoming significant morphological mismatches. Standard approaches rely on Geometric Retargeting or Indirect Dynamic Retargeting pipelines. We identify that these intermediate kinematic projections introduce a geometric bias, restricting the search space and yielding suboptimal dynamic behaviors. In this paper, we propose Direct Dynamic Retargeting (DDR), a novel single-stage framework that generates high-fidelity, dynamically feasible trajectories directly from expert videos. By formulating the problem in the task space and leveraging a sampling-based Model Predictive Control solver within a physics simulator, DDR natively optimizes over complex contact sequences while mitigating input drift. Our experiments demonstrate that bypassing the geometric bias allows DDR to outperform state-of-the-art baselines in demonstration tracking accuracy. Furthermore, we establish that providing such physically viable references to RL agents accelerates training convergence and enhances the final execution of agile and balancing behaviors. Source code will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.23717 2026-05-25 cs.RO 版本更新

Vision-Based Agile Landing on Turbulent Waters

基于视觉的湍急水域敏捷着陆

Dimosthenis Angelis, Leonard Bauersfeld, Davide Scaramuzza, Evangelos Boukas

发表机构 * Department of Electrical and Photonics Engineering, Technical University of Denmark（丹麦技术大学电子与光子工程系）； Robotics and Perception Group, Department of Informatics, University of Zurich（苏黎世大学信息学院机器人与感知组）

AI总结本文研究了在恶劣海况下无人机自主降落在移动海上平台的难题，提出了一种基于强化学习的方法，无需显式获取平台状态信息。该方法结合多旋翼无人机的状态测量和着陆面的局部视觉特征，预测姿态和推力指令，并通过底层控制器实现跟踪。实验表明，该方法在模拟和实际测试中均优于传统模型预测控制方法，是首个无需显式平台状态表示的湍流水域敏捷着陆方案。

详情

AI中文摘要

由于飞行器和着陆平台在公海条件下的耦合运动，无人机在海上船只上的自主着陆具有挑战性。本文提出了一种基于强化学习的自主多旋翼着陆方法，无需显式平台状态信息。该方法利用多旋翼状态测量以及局部视觉特征（包括从着陆表面提取的关键点和相关描述符）来预测姿态和推力指令。这些指令由传统的低层控制器跟踪。策略在仿真中使用合成关键点和随机生成的归一化描述符进行训练，从而能够在无人机上使用不同的局部特征提取器进行零样本部署。我们在逼真的模拟器中评估了该方法，并表明在对应于“非常恶劣”海况的平台运动下，它优于最先进的模型预测控制基线。最后，我们进行了广泛的实际实验，展示了使用两种不同局部特征提取器的自主机载着陆。据我们所知，这是首个在湍急水域中无需显式平台状态表示即可实现海上平台敏捷多旋翼着陆的方法。

英文摘要

Autonomous landing of Unmanned Aerial Vehicles on maritime vessels is challenging due to the coupled motion of the vehicle and landing platform in open-sea conditions. This paper presents a reinforcement-learning-based approach for autonomous multirotor landing on moving maritime platforms without requiring explicit platform-state information. The proposed method uses multirotor state measurements together with local visual features, consisting of keypoints and associated descriptors extracted from the landing surface, to predict attitude and thrust commands. These commands are tracked by a conventional low-level controller. The policy is trained in simulation using synthetic keypoints with randomly generated normalized descriptors, enabling zero-shot deployment with different local feature extractors onboard the UAV. We evaluate the method in a realistic simulator and show that it outperforms a state-of-the-art Model Predictive Control baseline under platform motions corresponding to ``Very Rough'' sea conditions. Finally, we perform extensive real-world experiments, demonstrating autonomous onboard landing using two different local feature extractors. To the best of our knowledge, this is the first approach for agile multirotor landing on maritime platforms in turbulent waters that does not rely on an explicit platform-state representation.

URL PDF HTML ☆

赞 0 踩 0

2605.21031 2026-05-25 cs.RO 版本更新

Modeling and Control of a Pneumatic Morphing Soft Quadrotor based on the SOFA Framework for Dynamic Soft Robotic Simulation

基于SOFA框架的软体动态仿真的气动变形软四旋翼建模与控制

F. Labra Caso, V. Sumathy, P. Ferrentino, B. Vanderborght, J. Haluska, G. Nikolakopoulos

发表机构 * Luleå University of Technology（卢勒奥技术大学）； VUB Brussels University（布鲁塞尔自由大学）； IMEC Center（IMEC中心）； Robotics & AI（机器人与人工智能）

AI总结本文提出了一种基于SOFA框架的有限元方法，用于建模和控制一种充气变形软体四旋翼飞行器。该方法在保留传统四旋翼动力学物理可解释性和控制结构的同时，能够捕捉充气软臂复杂的时变行为。通过在SOFA中将软体手臂离散为四面体网格，并结合弹性材料定律模拟其内部力，实现了对充气驱动变形能力的动态仿真与控制分析，展示了该建模框架和控制器设计的有效性。

Comments 8 pages, 10 figures

详情

AI中文摘要

本文提出了一种基于SOFA的有限元方法，用于气动变形软四旋翼的软体建模及相应的动态仿真与控制。所提出的建模保留了传统四旋翼动力学的物理可解释性和控制结构，同时捕捉了气动驱动软臂的复杂时变行为。在SOFA中，软气动驱动臂被离散化为四面体网格，遵循弹性材料定律，产生与身体真实动态行为相适应的内力。在内部空腔中施加由周期性和基于误差的控制信号共同驱动的气动作用，以分析变形能力。最后，提出了一种比例积分控制器来研究气动臂的受控动态行为和变形能力，其中对软臂的气动驱动进行控制以实现期望的目标位置。仿真结果证明了所提出的新型建模框架及相关控制器设计的有效性。

英文摘要

This article presents a novel SOFA based finite element method for the soft body modeling and the corresponding dynamic simulation and control of a pneumatic morphing soft quadrotor. The proposed modeling preserves the physical interpretability and control structure of traditional quadrotor dynamics, while capturing the complex, time-varying behavior of pneumatically actuated soft arms. In SOFA, the soft pneumatically actuated arms are discretized as a tetrahedral mesh following an elastic material law that produces internal forces adequate to the real dynamic behavior of the body. Pneumatic actuation governed by both periodic and error-based control signals is applied within the internal cavities to analyze the morphing capability. Finally, a proportional-integral controller is proposed to study the controlled dynamic behavior and morphing capabilities of the pneumatic arm, wherein the pneumatic actuation to the soft arm is controlled to achieve the desired target position. The simulation results show the effectiveness of the proposed novel modeling framework and the related controller design.

URL PDF HTML ☆

赞 0 踩 0

2603.21880 2026-05-25 cs.RO 版本更新

Optimal Solutions for the Moving Target Vehicle Routing Problem with Obstacles via Lazy Branch and Price

带障碍物的移动目标车辆路径问题的最优解：懒惰分支定价法

Anoop Bhat, Geordan Gutow, Surya Singh, Zhongqiang Ren, Sivakumar Rathinam, Howie Choset

发表机构 * Robotics Institute at Carnegie Mellon University（卡内基梅隆大学机器人学院）； Mechanical and Aerospace Engineering at Michigan Technological University（密歇根理工大学机械与航空航天工程系）； Robotics and AI Institute（机器人与人工智能研究所）； UM-SJTU Joint Institute and Department of Automation at Shanghai Jiao Tong University（上海交通大学与美国密歇根大学联合研究所及自动化系）； Department of Mechanical Engineering and Department of Computer Science and Engineering at Texas A&M University（德克萨斯农工大学机械工程系和计算机科学与工程系）

AI总结本文研究了存在障碍物的移动目标车辆路径规划问题（MT-VRP-O），旨在为多个代理规划路径以拦截移动目标，同时满足时间窗口、速度限制和容量约束。为此，作者提出了一种基于延迟分支定价的优化方法Lazy BPRC，通过在分支定价框架中使用放松连续性约束的运动规划技术，有效降低了计算成本，并在保证最优解的前提下显著提升了求解效率。

详情

AI中文摘要

带障碍物的移动目标车辆路径问题（MT-VRP-O）旨在为多个智能体寻找轨迹，使其共同拦截一组移动目标。每个目标有一个或多个必须被访问的时间窗口，智能体必须避开静态障碍物并满足速度和容量约束。我们引入了具有松弛连续性的懒惰分支定价法（Lazy BPRC），为MT-VRP-O找到最优解。Lazy BPRC应用了VRP的分支定价框架，该框架在受限主问题（RMP）和定价问题之间交替。RMP旨在从有限的路径子集中为每个智能体选择一系列目标-时间窗口配对（称为路径）来执行。定价问题将路径添加到有限子集中。传统上，求解RMP需要计算每个智能体遵循有限子集中每条路径的成本。在MT-VRP-O中计算这些成本是计算密集型的，因为它需要在移动目标之间进行无碰撞运动规划。Lazy BPRC通过使用每条路径成本的下界来求解RMP，从而推迟成本计算，这些下界是通过具有松弛连续性约束的运动规划计算得出的。我们根据需要懒惰地评估路径的真实成本。我们通过在凸集图（GCS）上搜索最短路径来计算路径成本，并使用我们的连续性松弛方法加速搜索。我们证明，Lazy BPRC的运行速度比两种消融方法快一个数量级。

英文摘要

The Moving Target Vehicle Routing Problem with Obstacles (MT-VRP-O) seeks trajectories for several agents that collectively intercept a set of moving targets. Each target has one or more time windows where it must be visited, and the agents must avoid static obstacles and satisfy speed and capacity constraints. We introduce Lazy Branch-and-Price with Relaxed Continuity (Lazy BPRC), which finds optimal solutions for the MT-VRP-O. Lazy BPRC applies the branch-and-price framework for VRPs, which alternates between a restricted master problem (RMP) and a pricing problem. The RMP aims to select a sequence of target-time window pairings (called a tour) for each agent to follow, from a limited subset of tours. The pricing problem adds tours to the limited subset. Conventionally, solving the RMP requires computing the cost for an agent to follow each tour in the limited subset. Computing these costs in the MT-VRP-O is computationally intensive, since it requires collision-free motion planning between moving targets. Lazy BPRC defers cost computations by solving the RMP using lower bounds on the costs of each tour, computed via motion planning with relaxed continuity constraints. We lazily evaluate the true costs of tours as-needed. We compute a tour's cost by searching for a shortest path on a Graph of Convex Sets (GCS), and we accelerate this search using our continuity relaxation method. We demonstrate that Lazy BPRC runs up to an order of magnitude faster than two ablations.

URL PDF HTML ☆

赞 0 踩 0

2511.03882 2026-05-25 cs.CV cs.AI cs.LG cs.RO 版本更新

Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures

自主X光引导脊柱手术的机器人控制策略学习研究

Florence Klitzner, Blanca Inigo, Benjamin D. Killeen, Lalithkumar Seenivasan, Michelle Song, Axel Krieger, Mathias Unberath

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； Technical University of Munich（慕尼黑技术大学）； Johns Hopkins School of Medicine（约翰霍普金斯医学院）

AI总结本文研究了基于模仿学习的机器人控制策略在X射线引导脊柱手术中的应用，特别是在椎体成形术中导管插入任务中的可行性与挑战。研究构建了一个高度逼真的仿真环境，并构建了包含正确操作轨迹和双平面X射线序列的数据集，用于训练仅依赖视觉信息的模仿学习策略。实验表明，该策略在多种脊柱解剖结构和初始条件下均能实现安全的导管插入，为未来轻量化、无需CT的术中脊柱机器人导航提供了基础。

详情

DOI: 10.1007/s11548-026-03716-x

AI中文摘要

基于模仿学习的机器人控制策略在基于视频的机器人学中重新受到关注。然而，对于稀疏输入的X光引导手术（如脊柱内固定），这种方法是否适用尚不清楚。我们研究了在双平面引导的套管针插入中模仿策略学习的可行性、机遇和挑战。我们开发了一个用于可扩展、自动化模拟X光引导脊柱手术的计算机沙盒，具有高度逼真性。我们整理了一个包含正确轨迹和相应双平面X光序列的数据集，模拟了提供者的逐步对齐过程。然后，我们训练了用于规划和开环控制的模仿学习策略，该策略仅基于视觉信息在椎体成形术环境中迭代对齐套管针。这种精确控制的设置提供了对该方法局限性和能力的见解。我们的策略在68.5%的案例中首次尝试成功，在不同椎体水平上保持了安全的椎弓根内轨迹。该策略迁移到了复杂解剖结构（包括骨折）以及不同的解剖结构和初始位置。在真实X光上的展开表明，具有合理轨迹的部分仿真到真实迁移是可能的。尽管这些初步结果令人鼓舞，但我们还发现了局限性，特别是在入口点精度方面。当前的结果为未来的努力提供了明确的基准，而借助更稳健的先验和领域知识，此类模型可能为未来实现轻量级、无CT的机器人术中脊柱导航奠定基础。

英文摘要

Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation, with sparse inputs. We examine the feasibility, opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula in a vertebroplasty setting solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy transferred to complex anatomy, including fractures, as well as varied anatomies and initializations. Rollouts on real X-ray indicate that partial sim-to-real transfer with plausible trajectories is possible. While these preliminary results are promising, we also identify limitations, especially in entry point precision. The current results present a clear benchmark for future efforts, while with more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.

URL PDF HTML ☆

赞 0 踩 0

2510.07869 2026-05-25 cs.RO 版本更新

USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots

USIM 和 U0：面向通用水下机器人的视觉-语言-动作数据集与模型

Junwen Gu, Zhiheng Wu, Pengxuan Si, Shuang Qiu, Zhentao Zhang, Yukai Feng, Luoyang Sun, Laien Luo, Lianyi Yu, Jian Wang, Zhengxing Wu

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences（认知与决策智能复杂系统重点实验室，自动化研究所，中国科学院）； Baidu Inc.（百度公司）； The School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）

AI总结本文提出了一种面向通用水下机器人的视觉-语言-动作框架，旨在解决水下环境中多任务执行的通用智能问题。研究构建了一个基于仿真的大规模数据集USIM，并设计了一个名为U0的视觉-语言-动作模型，该模型通过引入目标姿态估计辅助任务提升了空间感知能力，能够在避障导航和三维移动操作等任务中取得优异表现。实验表明，U0在离线动作预测误差和在线任务成功率方面均达到当前最优水平，验证了通用智能在水下机器人领域的可行性。

Comments Project Page: https://vincentgu2000.github.io/u0project/

详情

AI中文摘要

水下环境对机器人导航和操作提出了独特挑战。现有研究主要关注特定任务方法，而针对多任务执行的通用智能研究仍然稀缺。为填补这一空白，我们提出一个面向通用水下机器人的统一框架，该框架集成了由语言指令驱动的感知和动作。首先，我们开发了一个数据合成管道来构建 USIM，这是一个基于模拟的数据集，包含来自 2275 条轨迹的超过 905K 帧，总计约 25 小时的 BlueROV2 交互。此外，我们提出了 U0，一个能够执行从避障导航到三维移动操作等各种任务的视觉-语言-动作（VLA）模型。该模型具有基于卷积-注意力的感知（CAP）模块，该模块将目标姿态估计作为辅助任务，以显式增强模型的空间感知能力。在评估方面，我们建立了一个系统评估框架和一个自动化管道，涵盖离线指标和在线任务执行。实验结果表明，USIM 数据集显著增强了现有 VLA 模型适应水下场景的能力。值得注意的是，我们的 U0 模型实现了最先进的性能：它将离线平均动作预测误差降低到 0.0359，并实现了 43.1% 的总体在线成功率，相比现有竞争基线（低于 37.6%）提升了 5.5%，其中导航任务成功率高达 87.5%。这些结果验证了水下机器人通用智能的可行性，为可扩展数据集合成和水下具身智能体提供了基础。

英文摘要

Underwater environments pose unique challenges for robotic navigation and manipulation. While existing research has primarily focused on task-specific methods, studies on general-purpose intelligence for multi-task execution remain scarce. To address this gap, we propose a unified framework for general-purpose underwater robots that integrates perception and action driven by language instructions. First, we develop a data synthesis pipeline to construct USIM, a simulation-based dataset which comprises over 905K frames from 2275 trajectories, totaling approximately 25 hours of BlueROV2 interactions. Furthermore, we propose U0, a vision-language-action (VLA) model capable of executing various tasks from obstacle-avoidance navigation to three-dimensional mobile manipulation. The model features a convolution-attention-based perception (CAP) module, which incorporates target pose estimation as an auxiliary task to explicitly bolster the model's spatial awareness. For evaluation, we establish a systematic assessment framework and an automated pipeline encompassing both offline metrics and online task execution. Experimental results demonstrate that the USIM dataset significantly empowers existing VLA models to adapt to underwater scenarios. Notably, our U0 model achieves state-of-the-art performance: it reduces the offline mean action prediction error to 0.0359 and achieves an overall online success rate of 43.1%, marking a 5.5% improvement over existing competitive baselines (below 37.6%), with navigation tasks reaching as high as 87.5%. These results validate the feasibility of general-purpose intelligence in underwater robotics, providing a foundation for scalable dataset synthesis and aquatic embodied agents.

URL PDF HTML ☆

赞 0 踩 0

2506.14135 2026-05-25 cs.RO cs.CV 版本更新

GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation

GAF: 高斯动作场作为机器人操作中动态世界建模的4D表示

Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Xiang Li, Hongwen Zhang, Yebin Liu

发表机构 * Tsinghua University（清华大学）； Beijing Normal University（北京师范大学）； Shadow AI

AI总结本文提出了一种基于高斯动作场（GAF）的四维表示方法，用于机器人操作中的动态世界建模。GAF通过引入可学习的运动属性，扩展了三维高斯点绘（3DGS），实现了对动态场景和操作动作的四维建模。该方法能够直接从运动感知的四维表示中进行动作推理，并通过重建当前场景、预测未来帧和估计初始动作三个相关输出，提升操作精度。实验表明，GAF在重建质量和机器人操作成功率方面均优于现有方法。

Comments https://ChaiYing1.github.io/projects/GAF/

详情

AI中文摘要

准确的场景感知对于基于视觉的机器人操作至关重要。现有方法通常遵循视觉到动作（V-A）范式，直接从视觉输入预测动作，或视觉到3D到动作（V-3D-A）范式，利用中间3D表示。然而，由于操作场景的复杂性和动态性，这些方法常常面临动作不准确的问题。在本文中，我们采用V-4D-A框架，通过高斯动作场（GAF）从运动感知的4D表示中直接进行动作推理。GAF通过引入可学习的运动属性扩展了3D高斯溅射（3DGS），实现了动态场景和操作动作的4D建模。为了学习时变场景几何和动作感知的机器人运动，GAF提供三个相互关联的输出：当前场景的重建、未来帧的预测以及通过高斯运动估计的初始动作。此外，我们采用一个动作-视觉对齐的去噪框架，以GAF生成的初始动作和高斯感知的统一表示为条件，进一步获得更精确的动作。大量实验表明，GAF在重建质量上实现了显著改进，PSNR提高+11.5385 dB，SSIM提高+0.3864，LPIPS降低-0.5574，同时在机器人操作任务中，相比最先进方法，平均成功率提升+7.3%。

英文摘要

Accurate scene perception is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we adopt a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing 4D modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF provides three interrelated outputs: reconstruction of the current scene, prediction of future frames, and estimation of init action via Gaussian motion. Furthermore, we employ an action-vision-aligned denoising framework, conditioned on a unified representation that combines the init action and the Gaussian perception, both generated by the GAF, to further obtain more precise actions. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR, +0.3864 SSIM and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average +7.3% success rate in robotic manipulation tasks over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2501.08222 2026-05-25 cs.RO 版本更新

Data-driven Spatial Classification using Multi-Arm Bandits for Monitoring with Energy-Constrained Mobile Robots

基于多臂老虎机的数据驱动空间分类用于能量受限移动机器人监测

Xiaoshan Lin, Siddharth Nayak, Stefano Di Cairano, Abraham P. Vinod

发表机构 * Aerospace Engineering and Mechanics department, University of Minnesota（明尼苏达大学航空航天工程与力学系）； Aeronautics and Astronautics department, Massachusetts Institute of Technology（麻省理工学院航空与航天系）； Mitsubishi Electric Research Laboratories (MERL)（三菱电机研究实验室）

AI总结本文研究了利用协同移动机器人进行环境监测中的空间分类问题，旨在快速将搜索区域划分为感兴趣和不感兴趣区域。提出了一种基于多臂老虎机框架的双层策略，高层规划器根据实时数据确定待访问区域，底层规划器通过整数规划协调路径，同时考虑传感器噪声和能量约束。该方法在仿真和实际机器人实验中均表现出良好的分类效率和任务完成性能。

Comments 8 pages, 6 figures. See https://www.youtube.com/watch?v=gzulpOcVYzg for an overview of the approach along with videos of the hardware experiments

详情

AI中文摘要

我们考虑使用协调移动机器人团队收集的数据进行监测的空间分类问题。此类分类问题出现在包括搜索救援和精准农业在内的多个应用中。具体而言，我们希望使用移动传感器和移动充电站团队，尽可能快地将搜索环境的区域分类为有趣和无趣。我们开发了一种数据驱动策略，该策略适应传感数据中的噪声和传感器的有限能量容量，并为团队生成无碰撞运动计划。我们提出了一种双层方法，其中高层规划器利用多臂老虎机框架，根据在线收集的数据确定无人机接下来要访问的潜在感兴趣区域。然后，基于整数规划的低层路径规划器协调团队访问已确定区域的路径，并满足物理约束。我们描述了所提方法的若干理论特性，包括任意时间保证和任务完成时间。我们在仿真中展示了我们方法的有效性，并在使用移动机器人的物理实验中进一步验证了这些观察结果。

英文摘要

We consider the spatial classification problem for monitoring using data collected by a coordinated team of mobile robots. Such classification problems arise in several applications including search-and-rescue and precision agriculture. Specifically, we want to classify the regions of a search environment into interesting and uninteresting as quickly as possible using a team of mobile sensors and mobile charging stations. We develop a data-driven strategy that accommodates the noise in sensed data and the limited energy capacity of the sensors, and generates collision-free motion plans for the team. We propose a bi-level approach, where a high-level planner leverages a multi-armed bandit framework to determine the potential regions of interest for the drones to visit next based on the data collected online. Then, a low-level path planner based on integer programming coordinates the paths for the team to visit the determined regions subject to the physical constraints. We characterize several theoretical properties of the proposed approach, including anytime guarantees and task completion time. We show the efficacy of our approach in simulation, and further validate these observations in physical experiments using mobile robots.

URL PDF HTML ☆

赞 0 踩 0

2605.23583 2026-05-25 cs.RO cs.LG 版本更新

How Many Training Samples Are Needed for the Inverse Kinematics Solutions by Artificial Neural Networks

人工神经网络求解逆运动学需要多少训练样本

Dong-Won Lim

发表机构 * The University of Suwon（苏won大学）

AI总结本文研究了使用人工神经网络求解机器人逆运动学问题时所需的最小训练样本数量。通过构建不同规模的训练数据集，训练前馈神经网络并评估其精度、收敛性和泛化能力，发现当样本数量超过125后，模型效率提升不再显著。该研究为实际机器人应用中优化神经网络数据规模、平衡计算成本与模型精度提供了有价值的指导。

Comments 14 pages, 5 figures

详情

AI中文摘要

逆运动学在机器人运动规划与控制中扮演关键角色。机器人操作臂的逆运动学求解可通过传统方法如几何法、代数法或雅可比法实现，但这些方法存在缺陷。人工神经网络因其泛化能力和计算效率，已成为近似逆运动学解的有前途的替代方案。该方法基本上只训练记录用于求解逆运动学问题的少量末端执行器样本。然而，一个基本问题仍然存在：多少训练样本足以实现可靠且准确的逆运动学预测？本研究探讨了训练数据集大小与基于ANN的逆运动学求解器精度之间的数学框架。使用关节型机器人操作臂，我们生成不同数量的关节位置对来训练前馈神经网络，并评估其精度、收敛性和泛化能力。结果表明，超过125个训练样本并未有助于提高模型效率，该效率通过采样大小上的近似精度可比度量来衡量，为数据效率提供了宝贵见解。这项工作为优化ANN解决方案的数据规模提供了实用指导，平衡了实际机器人应用中的计算成本和模型精度。

英文摘要

Inverse Kinematics (IK) plays a critical role in robotic motion planning and control. The IK solutions of a robot manipulator could be done by conventional ways such as geometric, algebraic, or Jacobian methods, which have drawbacks. The Artificial Neural Networks (ANNs) have become a promising alternative for approximating IK solutions due to their generalization ability and computational efficiency. This approach basically trains only a few samples of the end effector that are recorded for the solution of the IK problem. However, a fundamental question remains: how many training samples are sufficient to achieve reliable and accurate IK predictions? This study investigates the mathematical framework of relating the size of training datasets and the accuracy of ANN-based IK solvers. Using an articulated robotic manipulator, we generate varying amounts of joint-position pairs to train feedforward neural networks and assess their accuracy, convergence, and generalization capability. The results reveal more training samples than 125 did not contribute to the improvement of the model efficiency that the comparable measure dealing with the approximation accuracy over the sampling size, offering valuable insight into data efficiency. This work provides practical guidance for optimizing the data sizing of ANN solutions, balancing computational cost and model accuracy for real-world robotic applications.

URL PDF HTML ☆

赞 0 踩 0

2605.23568 2026-05-25 cs.RO cs.SY eess.SY 版本更新

TactileReflex: Noise-Statistics-Driven Vision-Tactile Reflex Control for Force-Sensitive Manipulation

TactileReflex：基于噪声统计的视觉-触觉反射控制用于力敏感操作

Ziyan Feng, Yulong Fu, Zheng Li, Yuxin He, Jieji Ren, Lujia Wang, Jinni Zhou, Yudong Zhong, Qiang Nie

发表机构 * Thrust of Robotics and Autonomous Systems, The Hong Kong University of Science and Technology (Guangzhou)（机器人与自主系统研究所，香港科学与技术大学（广州））； School of Mechanical Engineering, Shanghai Jiao Tong University（上海交通大学机械工程学院）

AI总结本文提出了一种基于噪声统计特性的视觉-触觉反射控制方法TactileReflex，用于实现对力敏感的精细操作任务，如液体填充的塑料杯的抓取与操作。该方法通过分析触觉传感器的内在噪声特性，直接推导出控制器的阈值，无需外部力标定或手动调参。实验表明，TactileReflex能够有效防止容器不可逆变形，并在动态倒水任务中表现出优异的稳定性与成功率，具有作为高层次操作系统安全层的潜力。

Comments 8 pages, 4 figures, 6 tables

详情

AI中文摘要

操作易变形的柔性容器（如装有液体的一次性塑料杯）需要在极窄的力裕度内实时调整抓取力：力不足会导致滑动，力过大则会使薄壁不可逆变形。现有方法难以完成此类力敏感操作任务。我们提出一种基于噪声统计的标定驱动反射控制范式，结合基于视觉的触觉感知：通过分析传感器的固有噪声特性（通过简短的静态保持-卸载协议），直接推导出所有控制器阈值，消除了外部力标定、试错手动调参或材料特定的物理模型。实现该范式，我们提出了TactileReflex，一个三通道闭环控制器，从双视觉触觉传感器中提取三个图像级代理：剪切强度（$S_y$）、接触强度（$F_n$）和压力中心（$C$），并以约12Hz驱动优先反射通道，用于滑动抑制、重量自适应释放和力保护。每个通道通过噪声导出的阈值直接在其代理上闭环。消融实验表明，只有完整的三通道系统能够防止容器不可逆变形（5/5成功，而部分配置最多1/5成功）。在动态倾倒任务中，固定力基线因姿态漂移在所有10次尝试中均失败，而TactileReflex在两种水量下实现了9/10成功。作为一个自包含且可解释的控制器，TactileReflex可作为高层操作流水线（包括无触觉VR遥操作和视觉-语言-动作策略）的即插即用安全层。

英文摘要

Manipulating fragile deformable containers, such as disposable plastic cups filled with liquid, demands real-time grip-force adaptation within an extremely narrow force margin: insufficient force causes slip, while excessive force irreversibly deforms the thin wall. Existing approaches struggle to achieve such force-sensitive manipulation tasks. We propose a noise-statistics-based calibration-driven reflex control paradigm with vision-based tactile sensing: by analyzing the sensor's intrinsic noise characteristics (via a brief static-hold-and-unload protocol), we directly derive all controller thresholds, eliminating external force calibration, trial-and-error manual tuning, or material-specific physical models. Instantiating this paradigm, we present TactileReflex, a three-channel closed-loop controller that extracts three image-level proxies, shear intensity ($S_y$), contact intensity ($F_n$), and center of pressure ($C$), from dual visuo-tactile sensors and drives prioritized reflex channels at ~12 Hz for slip suppression, weight-adaptive release, and force protection. Each channel closes the loop directly on its proxy via noise-derived thresholds. Ablation demonstrates that only the full three-channel system is able to prevent irreversible container deformation (5/5 success vs. at most 1/5 for partial configurations). In a dynamic pouring task, fixed-effort baselines fail in all 10 attempts due to pose drift, while TactileReflex achieves 9/10 success across two water volumes. As a self-contained and interpretable controller, TactileReflex can serve as a plug-and-play safety layer beneath high-level manipulation pipelines, including haptic-free VR teleoperation and vision-language-action (VLA) policies.

URL PDF HTML ☆

赞 0 踩 0

2605.23477 2026-05-25 cs.RO 版本更新

Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation

语义结构化混合专家用于组合机器人操作

Chengyu Deng, Guanqi Chen, Yizhou Chen, Zejia Liu, Zhiwen Ruan, Guanhua Chen, Jia Pan

发表机构 * The University of Hong Kong（香港大学）； Southern University of Science and Technology（南方科技大学）

AI总结该研究针对基于扩散模型的机器人操作策略在多任务环境下计算成本高、泛化能力差的问题，提出了一种语义结构化的专家混合扩散策略（SMoDP）。该方法通过引入由视觉-语言模型标注指导的轻量技能预测器，在推理时将操作片段路由到专门负责特定行为阶段的专家模块，从而提升效率与可解释性。为确保路由鲁棒性，研究还设计了双对比对齐策略，强化多模态观测与语言定义技能语义的一致性，实验表明该方法在多任务基准上表现出更高的参数效率和任务迁移能力。

Comments Accepted to Robotics: Science and Systems (RSS) 2026

详情

AI中文摘要

基于扩散的策略为精确机器人操作建立了新标准，但面临关键的可扩展性瓶颈：高性能模型计算成本高，而轻量级替代方案通常难以在多样化的多任务环境中泛化。混合专家（MoE）架构通过仅激活参数子集提供了一条有前景的效率路径。然而，现有的MoE路由机制通常依赖于低级噪声或潜在统计量，忽略了操作任务的组合性质。这可能导致可重用行为在专家间碎片化，限制可解释性和可迁移性。我们提出了用于组合机器人操作的语义结构化混合专家扩散策略（SMoDP），这是一个将专家专业化建立在语义任务结构上的框架。SMoDP利用一个轻量级的推理时技能预测器，该预测器由视觉语言模型（VLM）的离线标注监督，将动作块路由到特定行为阶段专业化的专家。为了确保鲁棒的分配，我们提出了一种双对比对齐策略，该策略将多模态观测建立在语言定义的技能语义上（模态间），同时强制执行视觉上不同但功能相关行为之间的路由一致性（模态内）。我们的方法在多任务基准测试中优于代表性的扩散和基于MoE的基线，参数效率显著提高，并通过参数高效微调展示了向新任务的有效组合迁移。项目网站：https://deng-cy20.github.io/SMoDP/

英文摘要

Diffusion-based policies have established a new standard for precise robotic manipulation but face a critical scalability bottleneck: high-performance models are computationally expensive, while lightweight alternatives often fail to generalize across diverse multi-task environments. Mixture-of-Experts (MoE) architectures offer a promising path to efficiency by activating only a subset of parameters. However, existing MoE routing mechanisms typically rely on low-level noise or latent statistics, ignoring the compositional nature of manipulation tasks. This can fragment reusable behaviors across experts, limiting interpretability and transferability. We introduce Semantically Structured Mixture-of-Experts Diffusion Policy (SMoDP) for compositional robotic manipulation, a framework that grounds expert specialization in semantic task structure. SMoDP leverages a lightweight, inference-time skill predictor, supervised by offline annotations from Vision-Language Models (VLMs), to route action chunks to experts specialized for specific behavioral phases. To ensure robust assignment, we propose a dual contrastive alignment strategy that grounds multi-modal observations in language-defined skill semantics (Inter-modal) while enforcing routing consistency across visually distinct but functionally related behaviors (Intra-modal). Our approach outperforms representative diffusion and MoE-based baselines on multi-task benchmarks with significantly improved parameter efficiency and demonstrates effective compositional transfer to novel tasks through parameter-efficient fine-tuning. Project website: https://deng-cy20.github.io/SMoDP/

URL PDF HTML ☆

赞 0 踩 0

2605.23386 2026-05-25 cs.RO 版本更新

Droneulator: A Portable UAV Simulator for Agricultural Workflows with RotorPy and Godot 4

Droneulator: 一种基于RotorPy和Godot 4的农业工作流便携式无人机模拟器

Jacob Swindell, Michael Lowen, Marija Popovic, Riccardo Polvara

发表机构 * Lincoln Centre for Autonomous Systems (L-CAS), University of Lincoln, Lincoln, UK（林肯自主系统中心（L-CAS）、林肯大学、林肯，英国）； Faculty of Aerospace Engineering, MAVLab, Delft University of Technology (TU Delft), Delft, Netherlands（航空航天工程学院、MAVLab、代尔夫特理工大学（TU Delft）、代尔夫特，荷兰）

AI总结本文提出了一款名为Droneulator的便携式无人机模拟器，专为农业应用场景设计，结合了RotorPy进行多旋翼动力学仿真，以及Godot 4进行场景渲染与传感器数据生成。该模拟器支持PX4控制和轻量级WebSocket指令路径，并通过Zenoh实现ROS 2兼容的数据流传输，能够在不修改基础设施的前提下支持农业无人机的图像采集、局部路径规划和强化学习实验。实验结果表明，Droneulator在多种农业无人机任务中表现出良好的性能，包括三维重建、障碍物避让规划和基于深度感知的导航策略训练。

详情

AI中文摘要

农业无人机研究需要模拟器集成逼真的3D场景、高保真车辆动力学和机器人中间件，同时在实际部署中能够跨异构开发机器运行。我们提出Droneulator，一种便携式无人机模拟器架构，结合RotorPy用于多旋翼动力学和Godot 4用于渲染和传感器生成。Droneulator提供基于PX4的控制和轻量级WebSocket命令路径，并通过基于Zenoh的ROS~2兼容管道发布同步的视觉和状态流。这种集成使得单一栈能够支持面向检测的数据捕获、ROS~2/PX4局部规划和强化学习实验，而无需修改模拟器基础设施。我们通过三个农业无人机工作流对当前系统进行了量化验证：使用COLMAP进行3D重建的树冠尺度图像采集、使用EGO-Planner围绕冠层障碍物的局部规划，以及通过自定义Gymnasium环境的闭环强化学习。在报告的设置中，结果表明模拟器能够维持低延迟感知，支持不同捕获密度下的重建导向数据采集，执行围绕冠层障碍物的无碰撞局部规划，并支持基于深度感知的障碍感知导航策略训练。这些结果共同展示了Droneulator在农业无人机检测、规划和学习中作为一个可部署栈的潜力。

英文摘要

Agricultural UAV research requires simulators that integrate realistic 3D scenes, high-fidelity vehicle dynamics, and robotics middleware, while remaining practical to deploy across heterogeneous development machines. We present Droneulator, a portable UAV simulator architecture that combines RotorPy for multirotor dynamics with Godot 4 for rendering and sensor generation. Droneulator exposes both PX4-based control and a lightweight WebSocket command path, and publishes synchronised visual and state streams through a Zenoh-based ROS~2-compatible pipeline. This integration enables a single stack to support inspection-oriented data capture, ROS~2/PX4 local planning, and reinforcement learning experiments without modifying the simulator infrastructure. We present quantified validation of the current system across three agricultural UAV workflows: tree-scale image collection for 3D reconstruction with COLMAP, local planning around canopy obstacles using EGO-Planner, and closed-loop reinforcement learning through a custom Gymnasium environment. In the reported setup, the results show that the simulator can sustain low-latency sensing, support reconstruction-oriented data collection under varying capture density, execute collision-free local planning around canopy obstacles, and support stable depth-sensing-based policy training for obstacle-aware navigation. Together, these results show the potential of Droneulator for agricultural UAV inspection, planning, and learning within one deployable stack.

URL PDF HTML ☆

赞 0 踩 0

2605.23350 2026-05-25 cs.RO 版本更新

Multi-Floor Exploration for Ground Robots via an Incremental Reachable Graph and Structural Priors

基于增量可达图与结构先验的地面机器人多层探索

Zhiwen Zhu, Jiaqi Chen, Xiangyi Huang, Meiqi Hu, Boyu Zhou

AI总结本文研究了地面机器人在多层建筑中的自主探索问题，针对传统二维和2.5维地图无法有效表示楼梯、坡道等可通行的重叠表面的问题，提出了一种基于增量可达图的多层探索框架。该方法通过构建稀疏的可达图并结合结构先验信息，实现了稳定且物理可行的前沿检测与跨楼层探索引导，实验表明该方法在仿真和实际环境中均表现出更高的探索效率和地图完整性。

详情

AI中文摘要

对于地面机器人而言，多层建筑的自主探索仍然具有挑战性，因为传统的2D和2.5D地图无法表示重叠的可通行表面，例如楼梯、坡道和多个可达高度。本文提出了一种基于增量可达图的多层探索框架。该图构建于可达支撑面之上的稀疏图，通过稀疏观测下的试探性图元素保留潜在的有效连接，并实现稳定的、物理可达的前沿检测。为了引导探索超越当前已建图楼层，我们将已探索楼层的任务区域先验投影到目标楼层，以初始化一个假设图，并随着新观测的到来逐步调整该图。然后，一个分层规划器共同推理确认和假设的结构以提供全局引导。在仿真中，与评估的基线相比，所提出的方法展示了改进的探索效率和地图完整性。此外，机载真实世界实验验证了其实用可行性和实时性能。

英文摘要

Autonomous exploration of multi-floor buildings remains challenging for ground robots because conventional 2D and 2.5D maps cannot represent overlapping traversable surfaces such as stairs, ramps, and multiple reachable elevations. This letter presents a multi-floor exploration framework based on an incremental reachable graph. Built as a sparse graph over reachable support surfaces, the graph preserves potentially valid connectivity through tentative graph elements under sparse observations and enables stable, physically reachable frontier detection. To guide exploration beyond the currently mapped floor, we project task-zone priors from an explored floor to initialize a hypothetical graph on the target floor and reconcile it incrementally with incoming observations. A hierarchical planner then jointly reasons over confirmed and hypothetical structures for global guidance. In simulation, the proposed method demonstrates improved exploration efficiency and mapping completeness compared to evaluated baselines. Furthermore, onboard real-world experiments validate its practical feasibility and real-time performance.

URL PDF HTML ☆

赞 0 踩 0

2605.23341 2026-05-25 cs.RO cs.AI 版本更新

Sparse Compositional Flow Matching by geometric assembly from motion primitives

基于运动基元的几何组装的稀疏组合流匹配

Yan Tang, Yuanbo Tang, Tingyu Cao, Shaolun Huang, Yang Li

发表机构 * Tsinghua Shenzhen Graduate School, Tsinghua University, Shenzhen, China（清华大学深圳研究生院，清华大学，深圳，中国）； School of AI, Chinese University of Hong Kong (Shenzhen)（香港中文大学（深圳）人工智能学院）

AI总结该论文研究了如何生成具身智能体（如机器人、水下机器人等）的可执行运动轨迹，提出了一种基于运动原语的稀疏组合流匹配方法。该方法通过在物理轨迹空间中直接组合可重复使用的运动原语，并引入几何约束和结构化稀疏流匹配框架，有效建模轨迹的组合结构与时空连续性。实验表明，该方法在多个数据集上取得了最先进的性能，显著提升了轨迹预测的准确性。

详情

AI中文摘要

具身轨迹，如机器人操纵器、水下航行器和移动机器人的可执行运动序列，是具身AI的基本输出。现代生成模型通常将其视为逐点生成的密集、整体信号，拟合复杂的高维后验，而未建模数据的潜在结构，这是结构化生成模型文献早已指出的样本效率低下问题。我们认为组合潜在结构是自然的选择：许多具身任务共享重复出现的运动片段，这些片段可以明确为有限的可重用运动基元库，并且组合单元自然与子任务边界对齐以支持任务分解。然而，现有的组合生成器在潜在空间中组合，并依赖事后解码将采样单元与实际轨迹段关联。相反，我们通过具有两个耦合设计的流匹配框架直接在物理轨迹空间中组合。运动基元字典学习为每个原子配备可学习的长度掩码和二进制起始指示器，使得原子本身即为基元，在其放置位置逐字重用。然后，具有几何约束的结构化稀疏流匹配通过持续时间感知分词和可微几何损失生成二进制放置矩阵，该损失在相邻基元相遇处强制执行空间连续性和时间邻接性。在Open X-Embodiment和3DMoTraj上，该框架达到了最先进的精度，并将FDE/ADE比从1.8降至1.07，相比最强基线，ADE提高了19.2%，FDE提高了21.0%。

英文摘要

Embodied trajectories, such as the executable motion sequences of robotic manipulators, underwater vehicles, and mobile robots, are a fundamental output of embodied AI. Modern generative models often treat them as a dense, monolithic signal generated point by point, fitting an intricate high-dimensional posterior while leaving the data's latent structure unmodeled, the same sample inefficiency long identified by the structured generative model literature. We argue that a compositional latent structure is a natural choice: many embodied tasks share recurring motion fragments that can be made explicit as a finite repertoire of reusable motion primitives, and compositional units naturally align with subtask boundaries to support task decomposition. Existing compositional generators, however, compose in a latent space and rely on post-hoc decoding to relate sampled units to actual trajectory segments. We instead compose directly in the physical trajectory space through a flow-matching framework with two coupled designs. Motion-Primitive Dictionary Learning equips each atom with a learnable length mask and binary starting indicators so the atom itself is the primitive, reused verbatim wherever it is placed. Structural Sparse Flow Matching with Geometric Constraints then generates a binary placement matrix using duration-aware tokenization and a differentiable geometric loss that enforces spatial continuity and temporal contiguity where adjacent primitives meet. On Open X-Embodiment and 3DMoTraj, the framework attains state-of-the-art accuracy and reduces the FDE/ADE ratio from 1.8 to 1.07, improving ADE by 19.2% and FDE by 21.0% over the strongest baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.23270 2026-05-25 cs.CV cs.AI cs.RO 版本更新

ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

ChainFlow-VLA: 基于视觉语言模型的因果流规划

Xiyang Wang, Xinlin Wang, Tingguang Zhou, Gong Chen, Xingtai Gui, Zhi Xu, Xiaolei Wu, Feiyang Tan, Hangning Zhou, Mu Yang

发表机构 * Afari Intelligent Drive（阿法瑞智能驾驶）； Tianjin University（天津大学）； University of Macau（澳门大学）

AI总结当前端到端自动驾驶系统在时间因果推理与全局轨迹一致性之间存在根本性矛盾。为解决这一问题，本文提出 ChainFlow-VLA，通过统一因果生成与全局优化的联合概率框架，将因果推理与全局轨迹修正相结合。该方法利用视觉语言模型作为语义先验，在保留因果结构的基础上进行轨迹修正，实验表明其在复杂场景中表现出色，达到了与人类相当的高水平性能。

详情

AI中文摘要

当前的端到端自动驾驶系统从根本上受到时间因果推理与全局轨迹一致性之间不匹配的限制。自回归（AR）模型通过因果分解捕获交互感知的时间依赖性，但其逐步解码导致误差累积和次优的全局结构。相比之下，扩散模型全局优化轨迹但缺乏显式因果约束，使其在交互和关键安全场景中不可靠。这种二分法揭示了一个更深层次的问题：现有方法将因果建模和全局优化视为分离的范式，没有原则性的方式将它们统一在单个轨迹分布中。为了解决这个问题，我们提出了ChainFlow-VLA，它在统一的概率框架内统一了因果生成和全局细化。我们将规划公式化为AR诱导模式的混合，并学习这些模式上的视觉语言模型（VLM）条件残差分布。自回归生成器（Chain）生成一组离散的因果轨迹模式，随后基于扩散的细化器（Flow）利用VLM隐藏状态作为语义先验，在残差空间中执行模式条件校正，同时保持因果结构。这种直接的调节将高层场景理解无缝注入到细粒度的轨迹调整中。实验表明，ChainFlow-VLA在模糊和长尾场景中实现了鲁棒的规划，在NAVSIM v1排行榜上取得了94.85的最新分数，匹配人类水平（94.8）。代码将在https://github.com/AFARI-Research/ChainFlow-VLA提供。

英文摘要

Current end-to-end autonomous driving systems are fundamentally limited by a mismatch between temporal causal reasoning and global trajectory consistency. Autoregressive (AR) models capture interaction-aware temporal dependencies via causal factorization, but their step-wise decoding leads to error accumulation and suboptimal global structure. In contrast, diffusion models optimize trajectories globally but lack explicit causal constraints, making them unreliable in interactive and safety-critical scenarios. This dichotomy reveals a deeper issue: existing methods treat causal modeling and global optimization as separate paradigms, without a principled way to unify them within a single trajectory distribution. To address this, we propose ChainFlow-VLA, which unifies causal generation and global refinement within a unified probabilistic framework. We formulate planning as a mixture over AR-induced modes and learn Vision-Language Model (VLM)-conditioned residual distributions over these modes. An autoregressive generator (Chain) produces a discrete set of causal trajectory modes, followed by a diffusion-based refiner (Flow) that leverages VLM hidden states as semantic priors to perform mode-conditioned correction in residual space while preserving causal structure. This straightforward conditioning seamlessly injects high-level scene understanding into fine-grained trajectory adjustments. Experiments demonstrate that ChainFlow-VLA achieves robust planning in ambiguous and long-tail scenarios, achieving a state-of-the-art score of 94.85 on the NAVSIM v1 leaderboard, matching human-level performance (94.8). Code will be available at https://github.com/AFARI-Research/ChainFlow-VLA.

URL PDF HTML ☆

赞 0 踩 0

2605.23263 2026-05-25 cs.RO cs.AI cs.SY eess.SP eess.SY 版本更新

6G Communication Networks Enabling Embodied Agents: Architecture and Prototype

6G通信网络赋能具身智能体：架构与原型

Lipeng Dai, Luping Xiang, Kun Yang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； Institute of Intelligent Networks and Communications (NINE), Nanjing University (Suzhou Campus)（南京大学智能网络与通信研究所（苏州校区））

AI总结本文研究了6G通信网络如何支持具身智能体的通信需求，探讨了具身智能体与6G网络之间的协同关系，并提出了面向人机远程交互的分层通信架构。通过构建包含触觉设备、工业机械臂和5G O-RAN测试平台的原型系统，验证了该架构在毫秒级时延和稳定闭环控制方面的可行性，为未来6G与具身智能体的融合应用提供了重要参考。

详情

AI中文摘要

具身智能体将智能决策与物理执行相结合，对通信提出了比纯软件智能体更严格和多样化的要求。尽管6G承诺亚毫秒级延迟、超高可靠性、原生智能和集成感知，但如何利用这些能力支持具身智能体通信的系统性研究仍然有限。本文从概念和工程两个角度研究了面向具身智能体的6G通信系统。首先，我们回顾了具身智能体的概念和具身价值，并澄清了其与非具身智能体的区别。然后，我们分析了具身智能体与6G网络的共生关系，强调了关键6G使能技术如何支持人机交互的严苛需求。此外，我们展示了具身智能体通过覆盖扩展、环境感知和物理世界理解在增强通信网络中的主动作用。基于这些见解，我们提出了一种用于人机远程交互的分层通信架构，包括人类意图感知层、基于开放无线接入网（O-RAN）的传输层、智能中间层和具身层。为验证其可行性，我们实现了一个端到端原型，集成了触觉设备、工业机械臂、中间平台和5G O-RAN测试床。实验结果表明毫秒级延迟和稳定的闭环操作，证实了所提架构的实用性，并为未来6G-具身智能体研究和工业部署提供了参考。

英文摘要

Embodied agents, which couple intelligent decision-making with physical actuation in the real world, impose far more stringent and heterogeneous communication requirements than purely software-based agents. While 6G promises sub-millisecond latency, ultra-high reliability, native intelligence, and integrated sensing, systematic studies on how to exploit these capabilities for embodied agent communication remain limited. This article investigates 6G-enabled communication systems for embodied agents from both conceptual and engineering perspectives. First, we review the concept, embodiment value of embodied agents, and clarify their distinctions from disembodied agents. Then, we analyse the symbiotic relationship between embodied agents and 6G networks. We highlight how key 6G enablers can support the stringent requirements of human-robot interaction. Furthermore, we demonstrate the proactive role of embodied agents in bolstering communication networks through coverage extension, environmental sensing, and physical world understanding. Building on these insights, we propose a hierarchical communication architecture for human-robot remote interaction, comprising a human-intent perception layer, an open radio access network (O-RAN)-based transport layer, an intelligent intermediary layer, and an embodiment layer. To validate its feasibility, we implement an end-to-end prototype that integrates a haptic device, an industrial robotic arm, an intermediary platform, and a 5G O-RAN testbed. Experimental results demonstrate millisecond-level latency and stable closed-loop operation, confirming the practicality of the proposed architecture and providing a reference for future 6G-embodied agent research and industrial deployments.

URL PDF HTML ☆

赞 0 踩 0

2605.23257 2026-05-25 cs.RO cs.CV 版本更新

Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation

将适应转化为资产：面向在线视觉语言导航的跨域桥接

Zixuan Hu, Xuantuo Huang, Yancheng Li, Yichun Hu, Shengyong Xu, Ling-Yu Duan

发表机构 * School of Computer Science, Peking University, Beijing, China（北京大学计算机科学系）； Peng Cheng Laboratory, Shenzhen, China（鹏城实验室）； School of Electronics, Peking University, Beijing, China（北京大学电子学院）

AI总结本文研究了视觉语言导航（VLN）代理在非平稳环境下的适应问题，提出了一种新的测试时适应（TTA）框架IDEA，通过将在线适应转化为知识资产的积累与组合，有效解决了现有方法中的灾难性遗忘和负迁移问题。IDEA引入了基于Fisher指导的软提示优化机制，并结合领域坐标构建动态资产库，利用历史知识构建跨领域桥梁，实现无需训练的适应。实验表明，该方法在多个基准测试中表现优异，展示了其在实际应用中的有效性。

Comments Accepted by ICML 2026

详情

AI中文摘要

在非平稳环境变化下导航对部署在野外的视觉语言导航（VLN）智能体构成了关键挑战。然而，现有的 VLN 测试时适应（TTA）方法大多将在线适应视为瞬时的、孤立的更新，导致灾难性遗忘和负迁移。为了克服这些问题，我们提出了 IDEA（Inter-Domain BridgE with Historical Assets），一种新颖的 TTA 框架，将适应转化为资产的积累和组合。具体来说，IDEA 引入了通过 Fisher 引导的加权方案优化的软提示，以捕获可迁移的知识。然后，这些优化后的提示与域坐标相结合，形成动态资产库。利用该库，IDEA 通过将目标域投影到历史知识的凸包上来构建跨域桥接。这些设计形成了一个互补循环：不断演化的库支撑桥接构建，而桥接提供优越的初始化以加速资产优化。在 REVERIE、R2R 和 R2R-CE 基准上的大量实验表明，IDEA 相对于现有方法具有一致的优越性，展示了其通过资产共享实现无需训练的适应的能力。

英文摘要

Navigating under non-stationary environment shifts poses a critical challenge for a Vision-and-Language Navigation (VLN) agent deployed in the wild. Yet, existing Test-Time Adaptation (TTA) methods for VLN largely treat online adaptation as transient, isolated updates, leading to catastrophic forgetting and negative transfer. To overcome these issues, we propose Inter-Domain BridgE with Historical Assets (IDEA), a novel TTA framework that transforms adaptation into the accumulation and composition of assets. Specifically, IDEA introduces soft prompts optimized via a Fisher-guided weighting scheme to capture the transferable knowledge. These optimized prompts are then augmented with domain coordinates to form a dynamic asset library. Leveraging this library, IDEA constructs a cross-domain bridge by projecting the target domain onto the convex hull of historical knowledge. These designs form a complementary loop: the evolving library underpins bridge construction, while the bridge provides superior initialization to accelerate asset optimization. Extensive experiments across REVERIE, R2R, and R2R-CE benchmarks demonstrate the consistent superiority of IDEA over existing methods, showcasing its ability to enable training-free adaptation via asset sharing.

URL PDF HTML ☆

赞 0 踩 0

2605.23240 2026-05-25 cs.RO cs.SY eess.SY 版本更新

Signal Temporal Logic Motion Planning via Graphs of Convex Sets

基于凸集图的信号时序逻辑运动规划

Yu Chen, Ancheng Hou, Mingyang Feng, Xiao Yu, Xiang Yin

发表机构 * School of Automation & Intelligent Sensing, Shanghai Jiao Tong University（自动化与智能感知学院，上海交通大学）； Institute of Artificial Intelligence, Xiamen University（人工智能研究院，厦门大学）

AI总结本文研究了在信号时序逻辑（STL）规范下的连续时间运动规划问题，旨在生成满足高层逻辑与时序要求且符合底层运动约束的平滑机器人轨迹。为此，作者提出了一种高效框架，将时序自动机推理与凸集图（GCS）相结合，将STL运动规划问题转化为GCS上的最短路径问题，从而生成满足STL规范、平滑性要求和速度限制的Bézier样条轨迹。实验表明，该方法在多个低维基准、三维四旋翼无人机、30自由度人形机器人以及UR-3机械臂的硬件实验中均能高效求解复杂STL运动规划问题。

详情

AI中文摘要

本文研究信号时序逻辑（STL）规范下的连续时间运动规划。目标是生成满足高层次逻辑和时间要求，同时遵守低层次运动约束的平滑机器人轨迹。为此，我们提出了一种高效框架，结合了时间自动机推理与凸集图（GCS）。首先将STL规范表示为时间自动机，然后与配置空间的凸分解耦合，形成联合转移系统，编码任务进展和区域占用。基于该联合转移系统，STL运动规划问题被重新表述为GCS上的最短路径问题，其解生成满足STL规范、平滑性要求和速度约束的平滑贝塞尔样条轨迹。我们建立了所提公式的正确性，并分析了其计算复杂度，表明一旦时间自动机和凸分解固定，凸松弛的规模与配置空间维度和贝塞尔次数成多项式关系。我们进一步利用专用模板和布尔组合，为表达性强的STL片段开发了紧凑的时间自动机构造。低维基准、3-D四旋翼、30自由度人形机器人的数值实验以及UR-3机械臂的硬件实验表明，所提方法能高效解决复杂的STL运动规划问题，并生成平滑可执行的轨迹。

英文摘要

This paper investigates continuous-time motion planning under Signal Temporal Logic (STL) specifications. The goal is to generate smooth robot trajectories that satisfy high-level logical and timing requirements while respecting low-level motion constraints. To this end, we propose an efficient framework that combines timed-automata reasoning with graphs of convex sets (GCS). An STL specification is first represented by a timed automaton, which is then coupled with a convex decomposition of the configuration space to form a joint transition system encoding both task progress and region occupancy. Based on this joint transition system, the STL motion-planning problem is reformulated as a shortest-path problem over a GCS, whose solution induces a smooth Bézier-spline trajectory satisfying the STL specification, smoothness requirements, and velocity bounds. We establish the soundness of the proposed formulation and analyze its computational complexity, showing that, once the timed automaton and convex decomposition are fixed, the convex relaxation scales polynomially with the configuration-space dimension and the Bézier degree. We further develop a compact timed-automaton construction for an expressive STL fragment using dedicated templates and Boolean composition. Numerical experiments on low-dimensional benchmarks, a $3$-D quadrotor, a $30$-DoF humanoid, and a hardware experiment on a UR-3 robot arm demonstrate that the proposed method efficiently solves complex STL motion-planning problems and produces smooth executable trajectories.

URL PDF HTML ☆

赞 0 踩 0

2605.23203 2026-05-25 cs.CV cs.AI cs.LG cs.RO 版本更新

Lipschitz Optimization for Formal Verification of Homographies

单应性矩阵形式化验证的Lipschitz优化

Jean-Guillaume Durand, Panagiotis Kouvaros, Maxime Gariel, Alessio Lomuscio

发表机构 * Joby Aviation（Joby航空）； Safe Intelligence

AI总结本文研究了针对视觉神经网络在安全关键领域应用的正式鲁棒性验证问题，特别关注相机运动引起的3D扰动对图像生成过程的影响。作者提出了一种基于李普希茨优化和分段连续性分析的验证方法，建立了相机姿态到像素值的闭式映射，并推导出对扰动像素值的紧致线性界。该方法适用于具有平面结构的场景，如增强现实、自动驾驶和机器人操作等，并在多个基准测试中验证了其有效性，相比现有方法在速度和边界紧致性方面均有提升。

Comments 18 pages, 13 figures, 6 tables, to be published at CVPR 2026

详情

AI中文摘要

在受监管行业中采用视觉神经网络需要形式化的鲁棒性保证，尤其是在医疗、自动驾驶和航空航天等安全关键领域。然而，当前方法局限于不完整的统计验证或对$\ell_p$范数和仿射变换的鲁棒性，仅覆盖了图像形成过程中一小部分扰动。特别是，对相机运动的鲁棒性仍然是一个开放问题，尽管它是部署许多视觉应用的关键。我们提出了一种形式化验证方法，针对捕获相机的3D运动扰动鲁棒性。我们首先建立了从相机位姿到像素值的闭式映射。通过分析所得单应性矩阵的连续性性质，我们展示了如何将最近关于Lipschitz优化和分段连续性的工作扩展到推导扰动像素值的紧线性边界。我们的方法适用于以平面结构为主的场景，例如增强现实中的地面、自动驾驶中的道路标记和交通标志，或机器人操作中的平面工作空间。这实现了对投影几何变换的首次形式化验证，无需复杂仿真、替代网络或显式图像形成模型。我们验证了实现，并展示了相比先前工作最高89%的加速和7%更紧的边界。然后，我们在VNN-COMP基准上评估了我们的方法，揭示了投影扰动的系统性弱点。最后，我们在一个安全关键的跑道分类器上进行了真实世界案例研究，突出了对相机运动的实际漏洞，并解决了学习模型认证中的一个关键挑战。数据和代码公开在https://github.com/jeangud/homography-verification。

英文摘要

The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete statistical verification or robustness to $\ell_p$-norm and affine transforms, which cover only a narrow subset of perturbations to the image formation process. In particular, robustness to camera motion remains an open problem despite being key to deploy many vision applications. We present a formal verification approach that targets robustness against 3D motion perturbations of the capturing camera. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. Our approach applies to scenes with predominantly planar structure, such as ground planes in augmented reality, road markings and traffic signs in autonomous driving, or planar workspaces in robotic manipulation. This enables the first formal verification of projective geometry transforms, without complex simulation, surrogate networks, or explicit image-formation models. We validate our implementation and show up to 89% speedup and 7% tighter bounds over prior work. We then evaluate our method on the VNN-COMP benchmark and reveal systematic weaknesses to projective perturbations. Finally, we demonstrate a real-world case study on a safety-critical runway classifier, highlighting practical vulnerabilities to camera motion, and addressing a key challenge in the certification of learned models. Data and code are publicly available at https://github.com/jeangud/homography-verification .

URL PDF HTML ☆

赞 0 踩 0

2605.23187 2026-05-25 cs.CV cs.RO 版本更新

IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

IntentionNav: 一种基于隐式人类指令的意图驱动目标导航基准

Lin Qian, Shijie Li, Sihao Lin, Xuan Zhang, Bangya Liu, Yanran Li, Hujun Yin

发表机构 * The University of Manchester（曼彻斯特大学）； A*STAR ； Responsible AI Research Centre, Adelaide University（阿德莱德大学负责任人工智能研究中心）； University of Bedfordshire（贝福德郡大学）

AI总结 IntentionNav 是一个用于意图驱动对象导航的新基准，旨在评估智能体从隐含人类指令中推断目标物体并完成导航任务的能力。该基准不直接提供目标物体名称，而是通过自然语言指令隐含表达需求，要求智能体理解意图、识别目标并完成导航。研究引入了四种意图模式和多种指令风格，支持对目标推理、语言鲁棒性及导航成功率的细致分析，揭示了当前视觉语言模型在理解隐含意图和完成精准导航任务方面仍面临挑战。

Comments preprint

详情

AI中文摘要

现有的目标导航基准通常告诉具身智能体要找到哪个物体类别，例如微波炉或椅子。面向人类的具身AI经常被问到一些不那么直接的问题：“我需要热一下这个食物”或“房间感觉很闷”。智能体必须推断出能够满足需求的物体，找到一个场景中的实例，并决定是否已达到目标。我们将这种设置研究为意图驱动的目标导航，并引入IntentionNav，一个用于从隐式人类指令进行主动目标搜索的诊断基准。每个episode提供一个自由文本意图、RGB-D观测和位姿，但隐藏目标物体名称。IntentionNav包含176个Isaac Sim场景和64个目标类别上的500个意图。每个意图以四种受控指令风格重写，并标注四种意图模式之一，将表面措辞与语义线索类型分离，同时保持几何匹配。这种配对设计支持对目标推断、语言鲁棒性、邻域可达性和终端成功（而非仅聚合成功）的分析。我们使用一个固定的主动导航智能体评估了三个VLM。模型在48.3%的episode中识别出预期目标，在68.7%中进入其2米邻域，但仅在24.9%中成功终止，并在5.5%中达到接地1米成功。事件脚本意图的成功率最高（28.7%），而物理状态和可供性意图的成功率较低（分别为19.2%和18.5%），表明间接人类意图仍然是主动具身搜索中目标选择、视觉验证和终端定位的瓶颈。

英文摘要

Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human-facing embodied AI is often asked something less direct: "I need something to warm this food" or "the room feels stuffy." The agent must infer the object that can satisfy the need, find a scene-grounded instance, and decide whether the goal has been reached. We study this setting as intent-driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free-text intent, RGB-D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active-navigation agent. Models identify the intended target in 48.3 percent of episodes and enter its 2 m neighborhood in 68.7 percent, but terminate successfully in only 24.9 percent and achieve grounded 1 m success in 5.5 percent. Success is highest for event-script intents (28.7 percent) and lower for physical-state and affordance intents (19.2 percent and 18.5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.

URL PDF HTML ☆

赞 0 踩 0

2605.23165 2026-05-25 cs.RO cs.AI cs.CL 版本更新

Autonomous Frontier-Based Exploration with VLM Guidance

基于自主前沿探索与VLM引导

Aarush Aitha, Avideh Zakhor

发表机构 * EECS Department, University of California（加州大学EECS系）

AI总结本文提出了一种基于视觉语言模型（VLM）引导的自主前沿探索方法，用于提升机器人在未知和危险环境中的探索能力。该方法通过VLM进行高层战略决策，指导传统的底层机器人控制系统，利用当前地图和潜在路径的视觉信息生成多模态提示，从而选择最具前景的探索方向。实验表明，该方法在六个室内环境的仿真中提升了地图覆盖率，且具有轻量、无需训练和易于迁移的特点。

Comments 8 pages, 10 figures, CVPR 2026: 2nd Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments

2605.23160 2026-05-25 cs.RO cs.CV 版本更新

Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping

语义感知引导的无人机探索：面向语言条件的三维室内建图

Nitin Vegesna, Avideh Zakhor

发表机构 * Department of Electrical Engineering and Computer Sciences（电气工程与计算机科学系）

AI总结本文提出了一种语义感知引导的无人机探索系统SAGE，用于在未知的室内3D环境中进行开放词汇的探索，能够在保持全面覆盖行为的同时，利用语义线索重新优先选择探索前沿。SAGE基于FALCON体积探索器，通过集成CLIP模型的四个关键组件，实现了语义与几何信息的联合规划，有效提升了目标发现效率。实验表明，SAGE在模拟和真实环境中均优于现有方法，尤其在目标发现速度和体积吞吐量方面表现突出。

Comments 10 pages, 6 figures, 4 tables. To be presented at the 2nd 3D-LLM/VLA Workshop at CVPR 2026 (non-archival workshop)

详情

AI中文摘要

我们提出语义感知引导探索（SAGE），一个用于未知三维室内环境的开放词汇探索系统，该系统在保持覆盖导向行为的同时，允许语义提示重新优先化前沿选择。基于FALCON体积探索器，SAGE通过四个关键组件集成对比语言-图像预训练（CLIP）：以物体为中心的嵌入存储、将最近观测投影到自由-未知边界的时间缓存、用于高相似度检测的物体前沿，以及统一的语义-几何规划成本。该成本函数限制了语义重新加权的影响，确保前沿被优先化而不牺牲总覆盖率。在基于Matterport3D的仿真中，SAGE在地图-查询对上的物体发现方面优于FALCON和纯语义消融。与Finding Things in the Unknown（FTU）相比，SAGE在九个共享地图-查询对上的探索速度提高了9.0到25.9倍，平均加速13.7倍。此外，SAGE的体积吞吐量显著高于FTU。最后，我们在Modal AI Starling 2四旋翼飞行器上，在两种环境中的五次真实飞行中部署了SAGE，配备机载感知和规划以及离板CLIP推理。比较SAGE和FALCON，我们发现虽然FALCON导致更快的探索和更短的建图轨迹，但SAGE在物体发现方面优于FALCON。

英文摘要

We present Semantic-Aware Guided Exploration, SAGE, a system for open-vocabulary exploration in unknown 3D indoor environments that preserves coverage-oriented behavior while allowing semantic cues to reprioritize frontier selection. Building on the FALCON volumetric explorer, SAGE integrates Contrastive Language-Image Pre-training (CLIP) via four key components: object-centric embedding storage, a temporal cache that projects recent observations onto the free-unknown boundary, object frontiers for high-similarity detections, and a unified semantic-geometric planning cost. This cost function bounds semantic reweighting influence, ensuring frontiers are prioritized without sacrificing total coverage. In Matterport3D-based simulations, SAGE outperforms FALCON and a semantic-only ablation in object discovery across map-query pairs. Compared to Finding Things in the Unknown (FTU), SAGE completes exploration 9.0 to 25.9 times faster across the nine shared map-query pairs, achieving a mean speedup of 13.7. Furthermore, SAGE achieves substantially higher volumetric throughput than FTU. Finally, we deploy SAGE in five real-world flights in two environments on a Modal AI Starling 2 quadrotor with onboard sensing and planning, and offboard CLIP inference. Comparing SAGE and FALCON, we find that while FALCON results in faster exploration and shorter mapping trajectories, SAGE outperforms FALCON in terms of object discovery.

URL PDF HTML ☆

赞 0 踩 0

2605.23128 2026-05-25 cs.RO 版本更新

$π_0$-EqM: Equilibrium Matching for Closed-Loop Vision-Language-Action Control

$\pi_0$-EqM：闭环视觉-语言-动作控制的均衡匹配

Huanming Liu, Congsheng Xu, Jianmin Ji, Yao Mu

发表机构 * University of Science and Technology of China（中国科学技术大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出了一种名为 $π_0$-EqM 的闭环视觉-语言-动作控制方法，通过将传统的流匹配解码器替换为均衡匹配（EqM）解码器，提升了机器人操作任务的性能。在固定计算预算下，该方法在多个任务中显著提高了成功率，并揭示了任务依赖的“稳定性-可执行性”差距现象，为迭代式VLA控制的策略设计提供了新视角。

Comments Preprint. 5 pages, 3 figures

详情

AI中文摘要

目前，视觉-语言-动作（VLA）模型因其在任务泛化方面的巨大潜力而成为机器人操作最常用的范式。然而，大多数用于VLA控制的生成式流匹配动作解码器通常以固定的采样视界部署，限制了状态相关的计算和控制周期之间的时间复用。我们提出$\pi_0$-EqM，用均衡匹配（EqM）解码器替换$\pi_0$中的流匹配专家，同时保持上游VLA堆栈不变。在匹配的300步预算下，$\pi_0$-EqM在19个任务上将RoboTwin的平均成功率从40.4%提升到50.2%，并在LIBERO上保持竞争力，在LIBERO-10上获得最显著的提升（87.0%）。两次阈值扫描揭示了残差与成功率之间存在任务依赖的非单调关系，我们称之为平稳性-可执行性差距。结果表明，迭代VLA控制中的推理深度是策略设计的一部分，并引入了一种基于能量的VLA视角，这可能为未来跨任务和跨本体的可组合动作生成工作提供参考。

英文摘要

Currently, Vision-Language-Action (VLA) models have become the most adopted paradigm for robotic manipulation for its great potential for task generalization. While most generative flow-matching action decoders for VLA control are often deployed with fixed sampling horizons, limiting state-dependent compute and temporal reuse across control cycles. We present $π_0$-EqM, which replaces the flow-matching expert in $π_0$ with an Equilibrium Matching (EqM) decoder while leaving the upstream VLA stack unchanged. Under a matched 300-step budget, $π_0$-EqM improves RoboTwin average success from 40.4% to 50.2% across 19 tasks and remains competitive on LIBERO, with its clearest gain on LIBERO-10 (87.0%). Two threshold scans reveal a task-dependent non-monotonic relation between residual and success, which we term the stationarity--executability gap. The results suggest that inference depth in iterative VLA control is part of policy design and introduce an energy-based VLA perspective that may inform future work on composable action generation across tasks and embodiments.

URL PDF HTML ☆

赞 0 踩 0

2605.23100 2026-05-25 cs.RO 版本更新

Four Simple Proprioceptive Estimators for Legged Robots

腿式机器人的四种简单本体感受估计器

Frank Dellaert, Chiyun Noh, Varun Agrawal, Ayoung Kim

AI总结本文研究了如何利用足端间歇接触信息来改进腿式机器人在惯性测量单元（IMU）噪声影响下的姿态估计。作者提出了一系列逐步增强的估计方法，从基于接触辅助不变扩展卡尔曼滤波（EKF）的方法出发，逐步引入因子图和固定滞后平滑技术，以提升估计精度和鲁棒性。所有四种方法均在GTSAM中实现，并提供了ROS2兼容的代码，便于复现和进一步研究。

详情

AI中文摘要

腿式机器人携带IMU，但由于消费级IMU噪声大，惯性解会漂移。然而，足部与环境产生间歇性接触，可用于减轻这种漂移。本报告开发了一系列表达力逐渐增强的腿式机器人状态估计器，利用了这一特性。在所有情况下，浮动基座状态包括姿态、位置、速度和IMU偏置。为了建模足部接触，我们从Hartley等人的接触辅助不变EKF开始，但降低了接触更新率。然后通过用小因子图替换测量更新来增强。最后，我们将相同的因子转化为带有接触时段足端接触点的固定滞后平滑器，包括和不包括变化的IMU偏置。为了促进可重复性和本体感受腿式里程计的进一步研究，所有四种变体都在GTSAM（Dellaert等人）中可用，并且我们还提供了一个与ROS2兼容的实现。

英文摘要

Legged robots carry an IMU, but the inertial solution drifts because consumer-grade IMUs are noisy. However, the feet create intermittent contacts with the environment that can be used to mitigate that drift. This report develops a sequence of increasingly expressive legged robot state estimators that leverage this. In all cases, the floating-base state comprises attitude, position, velocity, and IMU biases. To model foot contacts, we start from the contact-aided invariant EKF of Hartley et al., albeit at a reduced contact update rate. This is then augmented by replacing the measurement update by a small factor graph. Finally, we turn the same factors into a fixed-lag smoother with contact-episode footholds, with and without an evolving IMU bias. To facilitate reproducibility and further research in proprioceptive legged odometry, all four variants are available in GTSAM (Dellaert et. al), and we additionally provide a ROS2-compatible implementation.

URL PDF HTML ☆

赞 0 踩 0

2605.23098 2026-05-25 cs.RO 版本更新

UfM: Uncertainty from Motion for DNN Depth Estimation Using Gaussians

UfM*：基于高斯分布的运动不确定性用于DNN深度估计

Soumya Sudhakar, Sertac Karaman, Vivienne Sze

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结本文提出了一种名为UfM*的深度神经网络单目深度估计不确定性估计方法，通过使用高斯混合模型高效地衡量多视角预测之间的不一致性，仅需单次网络推理即可生成不确定性。相比传统方法，UfM*在计算和内存效率上显著提升，并在多个数据集上验证了其在提升校准误差和降低能耗方面的优越性，特别适用于资源受限的机器人系统。

Comments 18 pages, 15 figures

详情

AI中文摘要

可靠的不确定性估计对于在安全关键的机器人系统中部署单目深度深度神经网络（DNN）至关重要。传统的不确定性方法（如集成和基于采样的方法）需要每张图像多次推理，导致大量计算和内存开销。此外，从单张图像预测的不确定性无法衡量同一区域不同视图间预测的不一致性。我们提出UfM*（基于运动的不确定性），一种不确定性估计算法，通过使用紧凑高斯混合模型比较前后视图，高效衡量多视图不一致性，每张图像仅需一次DNN推理。使用高斯分布计算多视图不一致性不仅比先前使用点云的方法更节省计算和内存，而且通过衡量3D空间区域间的不一致性提高了不确定性质量。UfM*结合偶然不确定性，在100个分布外ScanNet序列上，与集成相比，期望校准误差改善24-28%，而能耗仅为集成的3%，内存仅为0.02%。我们证明，在微型能量受限机器人上，UfM*在Arm Cortex-A76 CPU上以30 FPS实时运行，每张224x224图像仅消耗63 mJ，突显了使用高斯分布衡量多视图不一致性能够为资源受限的机器人系统实现高效的不确定性估计。

英文摘要

Reliable uncertainty estimation is critical for deploying monocular depth deep neural networks (DNNs) in safety-critical robotic systems. Conventional uncertainty methods such as ensembles and sampling-based approaches require multiple inferences per image, incurring substantial compute and memory overhead. Moreover, uncertainty predicted from a single image misses out on measuring disagreement between predictions across views of the same region. We propose Uncertainty from Motion* (UfM*), an uncertainty estimation algorithm that measures multiview disagreement efficiently by comparing previous and current views using a compact Gaussian mixture, requiring only a single DNN inference per image. Using Gaussians to compute multiview disagreement is not only more compute- and memory-efficient than a prior approach using a point cloud, but also improves uncertainty by measuring disagreement across regions of 3D space. UfM* paired with aleatoric uncertainty improves expected calibration error by 24-28% compared to an ensemble, while requiring only 3% of the energy and 0.02% of the memory on 100 out-of-distribution ScanNet sequences. We demonstrate UfM* consumes only 63 mJ per 224x224 image while running real-time at 30 FPS on an Arm Cortex-A76 CPU onboard a miniature energy-constrained robot, highlighting that measuring multiview disagreement using Gaussians enables efficient uncertainty for resource-constrained robotic systems.

URL PDF HTML ☆

赞 0 踩 0

2605.23027 2026-05-25 cs.RO 版本更新

PIMbot: A Self-Adaptive Attack Framework for Adversarial Manipulation of Multi-Robot Reinforcement Learning

PIMbot：一种用于多机器人强化学习对抗性操纵的自适应攻击框架

Zexin Li, Ziliang Zhang, Hyoseung Kim, Cong Liu

发表机构 * University of California, Riverside（加州大学河滨分校）

AI总结本文提出了一种名为PIMbot的自适应攻击框架，用于对抗性地操控多机器人强化学习中的协作行为。该框架通过奖励通道的激励操控和智能体自身策略的操控两种互补手段，实现对多机器人合作环境的干预，并利用自适应多目标控制器在线平衡这两种手段。研究引入了一种针对多智能体强化学习社会困境中独特奖励函数的新操控方法，实验表明PIMbot在仿真和真实嵌入式系统中均能有效暴露多机器人协作任务中的关键漏洞。

Comments Extension version of IROS'23

详情

AI中文摘要

最近的研究证明了强化学习在多机器人有效协作中的潜力，特别是在机器人面临自身利益与集体利益权衡的社会困境中。然而，沟通不畅和对抗性机器人等环境因素可能影响合作，因此探索如何操纵多机器人通信以实现不同结果至关重要。本文提出了PIMbot，一个通过两种互补杠杆操纵结果的框架：(i) 奖励通道的激励操纵和(ii) 智能体自身动作的策略操纵。一个自适应多目标控制器在线平衡这些杠杆。我们的工作引入了一种新颖的方法来操纵最近的多智能体强化学习社会困境，这些困境利用独特的奖励函数进行激励。通过利用我们提出的PIMbot机制，机器人能够有效地操纵社会困境环境。全面的实验结果证明了我们提出的方法在Gazebo模拟的多机器人环境中的有效性。此外，在NVIDIA Jetson Orin Nano上的真实嵌入式设备案例研究量化了系统成本，并验证了PIMbot在超越仿真的现实自主嵌入式系统场景中的有效性。这些结果共同将PIMbot定位为一个严格的压力测试工具，暴露了多机器人协作任务中的关键漏洞。

英文摘要

Recent research has demonstrated the potential of reinforcement learning in effective multi-robot collaboration, particularly in social dilemmas where robots face a trade-off between self-interest and collective benefits. However, environmental factors such as miscommunication and adversarial robots can impact cooperation, making it crucial to explore how multi-robot communication can be manipulated to achieve different outcomes. This paper presents PIMbot, a framework that manipulates outcomes via two complementary levers: (i) incentive manipulation of the reward channel and (ii) policy manipulation of an agent's own actions. An adaptive multi-objective controller balances these levers in an online manner. Our work introduces a novel approach to manipulation in recent multi-agent RL social dilemmas that utilize a unique reward function for incentivization. By utilizing our proposed PIMbot mechanisms, a robot is able to manipulate the social dilemma environment effectively. Comprehensive experimental results demonstrate the effectiveness of our proposed methods in the Gazebo-simulated multi-robot environment. Moreover, a real embedded device case study on NVIDIA Jetson Orin Nano quantifies system cost and validates PIMbot's effectiveness on realistic autonomous embedded systems scenarios beyond simulation. Together, these results position PIMbot as a rigorous stress-test tool exposing critical vulnerabilities in multi-robot cooperative tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.22991 2026-05-25 cs.RO 版本更新

Verified Task-Space Motion Planning Under Joint-Space Constraints

关节空间约束下的验证任务空间运动规划

Hanjiang Hu, Changliu Liu, Yebin Wang

发表机构 * Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）； Mitsubishi Electric Research Laboratories (MERL)（三菱电机研究实验室）； MERL

AI总结本文研究了在关节空间约束下验证任务空间运动规划的问题，针对传统任务空间规划器如Bug2在面对关节角限制时可能出现的轨迹漂移和目标无法到达的问题，提出了一种基于二阶多项式逆运动学近似和S-过程的方法，计算出在关节位移限制下可验证的笛卡尔空间最大超矩形，从而实现自适应步长的规划。实验表明，该方法在多种对抗性场景中实现了零关节限制违反，并保持了100%的目标到达率。

详情

AI中文摘要

反应式任务空间规划器（如Bug2）使用固定的笛卡尔步长，且不考虑机械臂的关节角度限制。当雅可比矩阵病态时，即使很小的笛卡尔步长也可能导致关节变化超出允许范围；将关节限制在其极限会导致跟踪漂移，甚至完全无法到达目标。我们通过在每个规划步骤中计算在关节位移约束下 extit{可证明可达}的最大笛卡尔超矩形来解决这一问题。利用逆运动学的二阶多项式近似和S过程，我们构建一个小型半定规划，其解给出可证明的半宽~$λ^\star$。利用二次结构的等效二分法在亚毫秒时间内完成验证。将此验证与Bug2集成，得到步长适应局部运动学条件的规划器。在跨越六种关节极限设置的94个对抗场景的统计评估中，SOS验证的规划器实现了 extit{零}关节极限违反，目标到达率为100%，而标准Bug2规划器在6-11%的步骤中违反关节极限，并在高达18%的场景中无法到达目标。

英文摘要

Reactive task-space planners such as Bug2 operate with fixed Cartesian step sizes and are unaware of the manipulator's joint-angle limits. When the Jacobian is poorly conditioned, even small Cartesian steps can demand joint changes that exceed admissible bounds; clipping the joints to their limits causes tracking drift and can prevent goal reaching entirely. We address this by computing, at each planning step, the largest Cartesian hyperrectangle that is \emph{certifiably reachable} under joint displacement bounds. Using a second-order polynomial approximation of the inverse kinematics and the S-procedure, we formulate a small semidefinite program whose solution yields the certified half-width~$λ^\star$. An equivalent bisection procedure exploiting the quadratic structure solves the certification in sub-millisecond time. Integrating this certificate with Bug2 yields a planner whose step size adapts to local kinematic conditioning. In a statistical evaluation over 94 adversarial scenarios spanning six joint-limit settings, the SOS-verified planner achieves \emph{zero} joint-limit violations with a 100\% goal-reaching rate, whereas a standard Bug2 planner violates joint limits in 6--11\% of steps and fails to reach the goal in up to 18\% of scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.22988 2026-05-25 q-bio.NC cs.LG cs.RO cs.SY eess.SY 版本更新

Active Sensing Subserves Task-Level Control

主动感知服务于任务级控制

Andrew Lamperski, Debojyoti Biswas, Eric S. Fortune, John Guckenheimer, Kathleen Hoffman, Noah J. Cowan

发表机构 * Department of Electrical and Computer Engineering, University of Minnesota（明尼苏达大学电气与计算机工程系）； Laboratory for Computational Sensing and Robotics, Johns Hopkins University（约翰霍普金斯大学计算感知与机器人实验室）； Federated Department of Biological Sciences, New Jersey Institute of Technology（新泽西理工学院联合生物科学系）； Department of Mathematics, Cornell University（康奈尔大学数学系）； Department of Mathematics and Statistics, University of Maryland, Baltimore County（马里兰大学巴尔的摩县分校数学与统计学系）； Department of Mechanical Engineering, Johns Hopkins University（约翰霍普金斯大学机械工程系）

AI总结本文探讨了主动感知在任务级控制中的作用，提出主动感知并非由感官目标驱动，而是任务控制的必要组成部分。研究结合生物实证数据和数学理论，表明主动感知行为通常以离散阶段出现，动物在“探索”与“利用”两种行为模式间切换，以适应性传感器和模式切换实现反馈控制。这一策略在生物系统中普遍存在，但在工程系统中却较少应用，提示当前机器人控制体系仍有待改进。

详情

AI中文摘要

主动感知传统上被定义为为了获取信息而消耗能量，通常以运动的形式。在这里，我们提出，对自适应传感器的依赖、运动与感知之间的联系以及任务级控制的结合，必然导致主动感知运动的出现。这样，主动感知并非由感官目标驱动，例如最小化状态不确定性，而是任务级控制所必需的。这一假设，即主动感知服务于控制，得到了来自生物体的经验数据和数学理论的支持。有趣的是，主动感知行为通常发生在离散的时段中，与目标导向行为交替出现。这表明动物在两种具有不同控制策略的行为模式之间切换：一种“探索”模式，动物产生动态运动以塑造感觉反馈；以及一种“利用”模式，动物产生与实现任务目标直接相关的较慢补偿运动。这种依赖于自适应传感器、主动感知和模式切换的反馈控制策略在工程系统中并不常用，尽管在生物学中普遍存在。由最先进的传感器、执行器和机械设计组成的工程系统在“成本函数”方面（如最大力生成、精度和速度）可以胜过动物。然而，动物通常能够实现目前工程系统无法比拟的稳健、优雅的行为，这表明当前的控制系统存在不足。这些以控制理论语言表达的见解可能对改进机器人感知和控制至关重要。

英文摘要

Active sensing is traditionally defined as the expenditure of energy, typically in the form of movement, for obtaining information. Here, we propose that the combination of reliance on adaptive sensors, the linkage between movement and sensing, and task-level control inevitably gives rise to the emergence of active sensing movements. In this way, active sensing is not driven by sensory goals, such as minimizing uncertainty about the state, but rather is necessary for task-level control. This hypothesis, that active sensing subserves control, is supported by both empirical data from organisms and mathematical theory. Interestingly, active sensing behaviors often occur in discrete epochs, interspersed with goal-oriented behavior. This suggests that animals switch between two behavioral modes with distinct control policies, an `explore' mode in which animals produce dynamic movements to shape sensory feedback, and an `exploit' mode in which animals produce slower compensatory movements that are directly related to achieving task goals. This strategy for feedback control that relies on adaptive sensors, active sensing, and mode switching is not commonly used in engineered systems despite being ubiquitous in biology. Engineered systems comprising state-of-the-art sensors, actuators, and mechanical designs can outperform animals with respect to ``cost functions'' such as maximum force generation, precision, and speed. Nevertheless, animals routinely achieve robust, graceful behaviors that are currently unmatched by engineered systems, suggesting that current control systems are insufficient. These insights, expressed in the language of control theory, may be critical for improving robotic sensing and control.

URL PDF HTML ☆

赞 0 踩 0

2605.22986 2026-05-25 cs.RO cs.AI cs.HC cs.LG 版本更新

Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations

知道该问什么的机器人：通过有针对性的解释恢复未对齐的奖励

Helena Merker, Nick Walker, Andreea Bobu

AI总结该研究针对从人类示范中学习奖励函数时存在的特征不充分问题，提出了一种通过有针对性的解释来识别并修正奖励函数偏差的框架。核心方法基于分析示范数据中各特征的一致性，识别出未充分说明的特征，并通过自然语言解释这些不确定性，主动请求针对性的补充示范。实验表明，该方法在模拟和真实机器人任务中显著提升了奖励函数的学习效果，优于随机查询和被动数据收集的方式。

详情

AI中文摘要

从演示中学习奖励函数假设演示对所有特征（或行为中与任务相关的方面）提供了充分的监督。实际上，演示往往不完美：由于认知负荷或物理难度，人类可能低估某些特征，或者训练机制可能未能充分覆盖所有相关情况。无论哪种情况，重要特征可能未被充分指定，导致学习到的奖励函数存在歧义，并在部署时出现未对齐的行为。我们提出一个框架，检测此类未充分指定的特征，并主动请求有针对性的纠正演示。我们的关键洞察是，演示隐含地揭示了哪些特征被良好指定：一致优化的特征在演示之间变化很小，而未充分指定的特征则变化很大。我们利用这一统计信号推断哪些特征可能未被充分演示。然后，机器人用自然语言解释它不确定哪些特征，并请求明确解决已识别差距的演示。我们在模拟桌面操作领域和真实Franka机器人的用户研究中评估了我们的方法。与随机查询和被动数据收集相比，有针对性的、解释引导的查询显著改善了奖励恢复，减少了否则会从有缺陷的演示中持续存在的歧义。

英文摘要

Learning reward functions from demonstrations assumes that demonstrations provide adequate supervision over all features -- or task-relevant aspects of behavior. In practice, demonstrations are often imperfect: humans may under-emphasize certain features due to cognitive load or physical difficulty, or the training regime may fail to sufficiently cover all relevant situations. In either case, important features may be underspecified, leading to ambiguity in the learned reward function and misaligned behavior at deployment. We propose a framework that detects such underspecified features and actively solicits targeted corrective demonstrations. Our key insight is that demonstrations implicitly reveal which features are well specified: features that are consistently optimized show little variation across demonstrations, while features that are underspecified vary widely. We leverage this statistical signal to infer which features may have been insufficiently demonstrated. The robot then explains which features it is uncertain about in natural language and queries for demonstrations that explicitly address the identified gaps. We evaluate our approach in a simulated tabletop manipulation domain and in a user study with a real Franka robot. Targeted, explanation-guided queries significantly improve reward recovery compared to random querying and passive data collection, reducing ambiguity that would otherwise persist in learning from imperfect demonstrations.

URL PDF HTML ☆

赞 0 踩 0

2605.22272 2026-05-25 cs.RO cs.CV 版本更新

Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

Imagine2Real: 通过视频生成先验实现零样本人形机器人-物体交互

Jiahe Chen, ZiRui Wang, Feiyu Jia, Xiao Chen, Xiaojie Niu, Weishuai Zeng, Tianfan Xue, Xiaowei Zhou, Jiangmiao Pang, Jingbo Wang

发表机构 * Zhejiang University（浙江大学）； Shanghai AI Laboratory（上海人工智能实验室）； The Chinese University of Hong Kong（香港中文大学）

AI总结全身体型人机交互（HOI）因高质量3D数据稀缺而面临瓶颈。现有基于视频生成先验的方法由于依赖几何先验（如显式CAD模型）导致表示对齐问题，并因复杂的形态重定向过程而面临重定向复杂性问题。本文提出Imagine2Real，一种无需几何信息的零样本HOI框架，通过将机器人和物体运动统一为4D点轨迹解决表示对齐问题，并通过稀疏关键点追踪避开重定向误差，结合行为基础模型的潜在空间实现自然运动，最终在运动捕捉系统中实现零样本物理部署。

详情

AI中文摘要

全身人形机器人-物体交互（HOI）受限于高保真3D数据的稀缺性。虽然视频生成先验提供了一种有前景的替代方案，但现有方法由于依赖几何先验（如显式CAD模型）而遭受表示不对齐问题，并且由于密集变形和形态不匹配而产生重定向复杂性。我们提出了Imagine2Real，一个零样本HOI框架，用于灵活、无几何的交互。为了解决不对齐问题，我们将机器人和物体的运动统一为4D点轨迹。为了克服重定向复杂性，我们的关键点跟踪器仅跟踪稀疏的关键点（基座、手和物体），完全绕过了误差放大的重定向过程。为了在这些稀疏信号下保持自然步态，我们利用行为基础模型（BFM）的潜在空间作为跟踪器的搜索域。通过渐进式训练策略，Imagine2Real学习到具有简单跟踪奖励的鲁棒行为，从而在动作捕捉（mocap）系统内实现零样本物理部署。

英文摘要

Whole-body Humanoid-Object Interaction (HOI) is bottlenecked by the scarcity of high-fidelity 3D data. While video generative priors offer a promising alternative, existing methods suffer from \textit{Representation Misalignment} due to their reliance on geometric priors (e.g., explicit CAD models), and \textit{Retargeting Complexity} arising from intensive morphing and morphological mismatch. We propose Imagine2Real, a zero-shot HOI framework for flexible, geometry-free interaction. To resolve misalignment, we formulate robot and object motions as unified 4D point trajectories. To overcome retargeting complexity, our Keypoints Tracker tracks only sparse critical points (base, hands, and object), entirely bypassing the error-amplifying retargeting process. To maintain natural gaits despite these sparse signals, we utilize the latent space of a Behavior Foundation Model (BFM) as the tracker's search domain. Using a progressive training strategy, Imagine2Real learns robust behaviors with simple tracking rewards, enabling zero-shot physical deployment within a motion capture(mocap) system.

URL PDF HTML ☆

赞 0 踩 0

2605.16087 2026-05-25 cs.RO cs.AI 版本更新

Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment

面向感知模型的可信与可解释人工智能：从概念到原型车辆部署

Till Beemelmanns, Shayan Sharifi, Manas Mehrotra, Ayushman Choudhuri, Lutz Eckstein

发表机构 * Institute for Automotive Engineering, RWTH Aachen University（汽车工程研究所，亚琛工业大学）

AI总结本文研究了如何在自动驾驶感知模型中实现可信且可解释的人工智能，针对深度神经网络在自动驾驶中应用时存在的不透明性和安全性问题，提出了一种集成可信解释性和不确定性估计的感知模块。该方法基于变压器架构，在推理时通过注意力机制生成解释，并通过扰动一致性测试验证其可靠性，同时引入不确定性估计与校准模块以提升系统鲁棒性。研究还展示了该模块在原型车上的部署及可视化接口，验证了其在实时可信感知监控中的可行性。

Comments Accepted for publication at IEEE ITSC 2026

详情

AI中文摘要

深度神经网络已成为自动驾驶感知的主流解决方案，但其不透明性与新兴的可信人工智能指南相冲突，并给安全保证、调试和人工监督带来复杂性。尽管存在安全与可解释人工智能的理论框架，但针对3D场景理解的可信人工智能具体实现仍然稀缺。我们通过提出一个极其鲁棒、集成忠实可解释性和校准不确定性估计的可信人工智能感知模块来填补这一空白。基于Transformer检测器，我们在推理时从注意力机制中导出解释，并使用基于扰动的连续性测试验证其忠实性。我们进一步集成了不确定性估计与校准模块，并应用了增强鲁棒性的训练方法。实验展示了忠实的显著性行为、改进的鲁棒性以及良好校准的不确定性估计。最后，我们将这些可信人工智能元素部署到原型车辆中，并提供一个可解释人工智能界面，可视化文档工件、模型不确定性状态和显著性图，展示了实时可信感知监控的可行性。补充材料见 https://tillbeemelmanns.github.io/trustworthy_ai/ 。

英文摘要

Deep Neural Networks have become the dominant solution for Autonomous Driving perception, but their opacity conflicts with emerging Trustworthy AI guidelines and complicates safety assurance, debugging, and human oversight. While theoretical frameworks for safe and Explainable AI (XAI) exist, concrete implementations of Trustworthy AI for 3D scene understanding remain scarce. We address this gap by proposing a Trustworthy AI perception module that is remarkably robust, integrates faithful explainability, and calibrated uncertainty estimates. Building on a transformer-based detector, we derive explanation from the attention mechanism at inference time and validate their faithfulness using perturbation-based consistency tests. We further integrate an uncertainty estimation and calibration module, and apply robustness-enhancing training methods. Experiments show faithful saliency behavior, improved robustness, and well-calibrated uncertainty estimates. Finally, we deploy these Trustworthy AI elements in a prototype vehicle and provide an XAI Interface that visualizes documentation artifacts, model uncertainty state, and saliency maps, demonstrating the feasibility of trustworthy perception monitoring in real time. Supplementary materials are available at https://tillbeemelmanns.github.io/trustworthy_ai/ .

URL PDF HTML ☆

赞 0 踩 0

2605.06498 2026-05-25 cs.RO cs.SY eess.SY 版本更新

Lie Group Formulation of Recursive Dynamics Algorithms of Higher Order for Floating-Base Robots

浮动基座机器人高阶递归动力学算法的李群公式

Ahmed Ali, Chiara Gabellieri, Antonio Franchi

发表机构 * Robotics and Mechatronics Department, EEMCS Faculty, University of Twente（特文特大学机器人与机电系，EEMCS学院）； Department of Computer, Control and Management Engineering, Sapienza University of Rome（罗马大学计算机、控制与管理工程系）

AI总结本文研究了浮动基座机器人的高阶递归动力学算法在李群框架下的表示方法，提出了一种基于李群的牛顿-欧拉、连杆惯性及混合动力学算法的高阶时间导数计算方法。该方法适用于基座配置在SE(3)上、连杆结构配置在T^{n1} × R^{n2}流形上的树状机械系统，并通过空间扭力表示实现动力学方程的闭式表达。研究还展示了该方法在12自由度空中机械臂上的应用，验证了其在几何正逆动力学及其高阶导数计算中的有效性，并证明其计算复杂度随导数阶数呈二次增长，优于自动微分方法的指数增长。

详情

DOI: 10.1115/1.4071985
Journal ref: ASME. Journal of Mechanisms and Robotics (2026)

AI中文摘要

本文描述了计算浮动基座树状系统的李群牛顿-欧拉、组合体惯量和混合动力学算法的高阶时间导数的过程，其中基座构型在SE(3)上演化，附着的机构是一个开运动学树，构型在(n1+n2)维流形T^{n1} × R^{n2}上，使用旋量的空间表示。在给出算法后，我们将得到的递归式整理成闭式运动方程，识别出满足无源性性质的容许科里奥利矩阵，并证明组合惯性张量在所有时间导数下保持不变。然后，我们将所开发的方法应用于一个12自由度空中机械臂，推导其几何正动力学和逆动力学及其一阶时间导数的解析表达式，而数值模拟成功评估了这些动力学直至五阶。最后，为了展示其实用性，我们对所提出的扩展进行了基准测试，并表明在考虑的测试中，其计算成本随导数阶数呈二次增长，而自动微分基线则呈指数增长。

英文摘要

In this paper, we describe procedures for computing higher-order time derivatives of the Lie-group Newton-Euler, Articulated-Body Inertia, and hybrid dynamics algorithms for floating-base trees, where the base configuration evolves on SE(3) and the attached mechanism is an open kinematic tree with configuration on the (n1+n2)-dimensional manifold T^{n1} \times R^{n2}, using spatial representation of twists. After presenting the algorithms, we collect the resulting recursions into closed-form equations of motion, identifying an admissible Coriolis matrix satisfying the passivity property, and showing that the articulated inertia tensor remains unchanged across all time derivatives. We then apply the developed methods to a 12-DoF aerial manipulator to derive analytical expressions for its geometric forward and inverse dynamics along with their first time derivatives whereas the numerical simulations successfully evaluate these dynamics up to fifth order. Finally, to demonstrate their practical utility, we benchmark the proposed extensions and show that, in the considered tests, their computational cost scales quadratically with the derivative order, whereas the automatic-differentiation baseline exhibits exponential scaling.

URL PDF HTML ☆

赞 0 踩 0

2605.04568 2026-05-25 cs.LG cs.AI cs.RO 版本更新

Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

Dream-MPC：基于梯度与潜在想象的模型预测控制

Jonathan Spieler, Sven Behnke

发表机构 * Autonomous Intelligent Systems, Computer Science Institute VI - Intelligent Systems（自主智能系统，计算机科学研究所VI - 智能系统）； Robotics, Center for Robotics（机器人学，机器人中心）； the Lamarr Institute for Machine Learning（拉马尔机器学习研究所）； Artificial Intelligence, University of Bonn, Germany（人工智能，波恩大学，德国）

AI总结本文提出了一种名为 Dream-MPC 的新型模型预测控制方法，结合了梯度上升优化与学习到的世界模型，通过生成少量候选轨迹并利用不确定性正则化和优化迭代的复用机制进行优化。该方法在24个连续控制任务中表现出色，显著提升了基础策略的性能，优于传统的无梯度MPC和先进基线方法。

Comments Accepted for International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

最先进的基于模型的强化学习方法要么使用无梯度、基于种群的规划方法，要么使用学习到的策略网络，或者结合策略网络和规划。将模型预测控制（MPC）与学习到的模型和策略先验相结合的混合方法，以利用两种范式的优势，已显示出有希望的结果。然而，这些方法通常依赖于无梯度优化方法，对于高维控制任务可能计算成本高昂。虽然基于梯度的方法是一个有前途的替代方案，但最近的工作经验表明，基于梯度的方法通常比无梯度方法表现更差。我们提出了Dream-MPC，一种新颖的方法，从展开的策略生成少量候选轨迹，并通过使用学习的世界模型、不确定性正则化和通过重用先前优化的动作随时间摊销优化迭代，对每个轨迹进行梯度上升优化。我们在24个连续控制任务上的结果表明，Dream-MPC可以显著提高底层策略的性能，并且可以优于无梯度MPC和最先进的基线。代码和视频可在https://dream-mpc.github.io获取。

英文摘要

State-of-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to leverage the advantages of both paradigms have shown promising results. However, these approaches typically rely on gradient-free optimization methods, which can be computationally expensive for high-dimensional control tasks. While gradient-based methods are a promising alternative, recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts. We propose Dream-MPC, a novel approach that generates few candidate trajectories from a rolled-out policy and optimizes each trajectory by gradient ascent using a learned world model, uncertainty regularization and amortization of optimization iterations over time by reusing previously optimized actions. Our results on 24 continuous control tasks show that Dream-MPC can significantly improve the performance of the underlying policy and can outperform gradient-free MPC and state-of-the-art baselines. Code and videos are available at https://dream-mpc.github.io.

URL PDF HTML ☆

赞 0 踩 0

2603.15278 2026-05-25 eess.SY cs.RO cs.SY 版本更新

Encirclement Guaranteed Finite-Time Capture against Unknown Evader Strategies

针对未知逃逸策略的包围保证有限时间捕获

Dinesh Patra, Prajakta Surve, Ashish R. Hota, Shaunak D. Bopardikar

发表机构 * Department of Electrical Engineering, IIT Kharagpur（印度理工学院Kharagpur电子工程系）； Department of Electrical and Computer Engineering, Michigan State University（密歇根州立大学电子与计算机工程系）

AI总结本文研究了在二维无界环境中，多个追捕者在未知逃逸策略下对单个逃逸者进行有限时间捕获的问题。提出了一类保证在有限时间内完成捕获并保持逃逸者始终被包围的策略，且该策略对逃逸者的策略具有鲁棒性。研究还推导了捕获时间的上界，并通过数值实验验证了所提方法的有效性。

2603.10688 2026-05-25 cs.RO cs.CV 版本更新

MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction

MapGCLR: 用于在线矢量化高清地图构建的地理空间对比学习表示

Jonas Merkert, Alexander Blumberg, Jan-Hendrik Pauls, Christoph Stiller

发表机构 * Institute of Measurement and Control Systems, Karlsruhe Institute of Technology (KIT)（测量与控制系，卡尔斯鲁厄理工学院（KIT））

AI总结本文提出了一种名为 MapGCLR 的方法，旨在提升在线矢量化高精地图构建中鸟瞰图（BEV）特征网格的表示能力。通过在对比损失函数中引入地理空间一致性约束，该方法增强了重叠区域特征的一致性，并结合多遍历数据集划分策略，实现了半监督学习框架。实验表明，该方法在矢量化地图感知任务和特征空间可视化方面均优于传统监督方法。

详情

AI中文摘要

自动驾驶汽车依赖地图信息来理解周围环境。然而，离线高清地图的创建和维护成本仍然很高。一种更具可扩展性的替代方案是在线高清地图构建，它仅在训练时需要地图标注。为了进一步减少标注大量训练标签的需求，自监督训练提供了一种替代方案。本文通过在地理空间上强制重叠的鸟瞰图特征网格之间的一致性作为对比损失函数的一部分，专注于改进矢量化在线高清地图构建模型中的潜在鸟瞰图特征网格表示。为了确保对比对的地理空间重叠，我们引入了一种方法来分析给定数据集中遍历之间的重叠，并根据可调整的多遍历要求生成子数据集划分。我们使用减少的单遍历标注数据对同一模型进行监督训练，并在更广泛的未标注数据集上根据我们的多遍历要求进行自监督训练，有效实现了半监督方法。我们的方法在各个方面都优于监督基线，无论是在下游任务矢量化地图感知性能的定量评估上，还是在鸟瞰图特征空间的主成分分析可视化的分割定性评估上。

英文摘要

Autonomous vehicles rely on map information to understand the world around them. However, the creation and maintenance of offline high-definition (HD) maps remains costly. A more scalable alternative lies in online HD map construction, which only requires map annotations at training time. To further reduce the need for annotating vast training labels, self-supervised training provides an alternative. This work focuses on improving the latent birds-eye-view (BEV) feature grid representation within a vectorized online HD map construction model by enforcing geospatial consistency between overlapping BEV feature grids as part of a contrastive loss function. To ensure geospatial overlap for contrastive pairs, we introduce an approach to analyze the overlap between traversals within a given dataset and generate subsidiary dataset splits following adjustable multi-traversal requirements. We train the same model supervised using a reduced set of single-traversal labeled data and self-supervised on a broader unlabeled set of data following our multi-traversal requirements, effectively implementing a semi-supervised approach. Our approach outperforms the supervised baseline across the board, both quantitatively in terms of the downstream tasks vectorized map perception performance and qualitatively in terms of segmentation in the principal component analysis (PCA) visualization of the BEV feature space.

URL PDF HTML ☆

赞 0 踩 0

2602.15258 2026-05-25 cs.RO 版本更新

SEG-JPEG: Simple Visual Semantic Communications for Remote Operation of Automated Vehicles over Unreliable Wireless Networks

SEG-JPEG: 用于在不可靠无线网络上远程操作自动驾驶车辆的简单视觉语义通信

Sebastian Donnelly, Ruth Anderson, George Economides, James Broughton, Peter Ball, Alexander Rast, Andrew Bradley

发表机构 * Autonomous Driving and Intelligent Transport Group（自动驾驶与智能交通组）； Oxford Brookes University（奥克斯福德布鲁克斯大学）； Oxfordshire County Council（奥克斯福德郡县议会）； Department for Transport（交通部）； School of Engineering, Computing & Mathematics（工程、计算与数学学院）； Artificial Intelligence, Data Analysis and Systems Institute（人工智能、数据分析与系统研究所）

AI总结本文研究了在不可靠无线网络环境下，如何通过视觉语义通信技术实现对自动驾驶车辆的远程操控。提出了一种名为SEG-JPEG的方法，通过在低分辨率灰度图像中用彩色高亮编码检测到的道路使用者分割信息，将所需数据率降低50%，同时保持视觉清晰度。实验表明，该方法能够在低带宽网络下实现低于200毫秒的端到端延迟，提升远程操作员的环境感知能力，为自动驾驶车辆的大规模远程部署提供了可行方案。

Comments 7 pages, 9 figures. Under minor revision for CSNDSP 2026

详情

AI中文摘要

远程操作被认为是快速部署自动驾驶车辆的关键。目前，将图像流传输到远程控制连接车辆需要可靠、高吞吐量的网络连接，而在依赖公共网络基础设施的实际远程操作部署中，这种连接可能受到限制。本文研究了如何应用计算机视觉辅助的语义通信来规避与传统图像压缩技术相关的数据丢失和损坏。通过将检测到的道路用户的分割编码为低分辨率灰度图像中的彩色高亮，与传统技术相比，所需数据速率可降低50%，同时保持视觉清晰度。这使得即使网络数据速率低于500 kbit/s，中位玻璃到玻璃延迟也能低于200 ms，同时清晰勾勒出显著的道路用户，以增强远程操作员的情境意识。该方法在4G移动连接变化的区域使用自动最后一英里配送车辆进行了演示。结果表明，即使在通常受限的公共4G/5G移动网络上，也有可能大规模部署远程操作的自动驾驶车辆，从而有可能加速自动驾驶车辆在全国范围内的推广。

英文摘要

Remote Operation is touted as being key to the rapid deployment of automated vehicles. Streaming imagery to control connected vehicles remotely currently requires a reliable, high throughput network connection, which can be limited in real-world remote operation deployments relying on public network infrastructure. This paper investigates how the application of computer vision assisted semantic communication can be used to circumvent data loss and corruption associated with traditional image compression techniques. By encoding the segmentations of detected road users into colour coded highlights within low resolution greyscale imagery, the required data rate can be reduced by 50% compared with conventional techniques, while maintaining visual clarity. This enables a median glass-to-glass latency of below 200 ms even when the network data rate is below 500 kbit/s, while clearly outlining salient road users to enhance situational awareness of the remote operator. The approach is demonstrated in an area of variable 4G mobile connectivity using an automated last-mile delivery vehicle. Results indicate that large-scale deployment of remotely operated automated vehicles could be possible even on the often constrained public 4G/5G mobile network, providing the potential to expedite the nationwide roll-out of automated vehicles.

URL PDF HTML ☆

赞 0 踩 0

2601.00969 2026-05-25 cs.RO cs.AI 版本更新

V-VLAPS: Value-Guided Planning for Vision-Language-Action Models

V-VLAPS：面向视觉-语言-动作模型的价值引导规划

Ke Ren, Ali Salamatian, Kieran Pattison, Cyrus Neary

发表机构 * The University of British Columbia（不列颠哥伦比亚大学）

AI总结该研究提出了一种名为 V-VLAPS 的价值引导型视觉-语言-动作规划方法，旨在解决视觉-语言-动作（VLA）模型在复杂任务中因策略偏差导致的规划失败问题。通过引入一个轻量的价值头，V-VLAPS 利用离线 VLA 演示数据预测蒙特卡洛回报，从而引导蒙特卡洛树搜索优先探索高价值分支。实验表明，V-VLAPS 在多个 LIBERO 任务套件中显著提升了规划效果，尤其在增加搜索预算后表现优于无价值引导的基线方法。

详情

AI中文摘要

视觉-语言-动作（VLA）模型为机器人操作提供了强大的动作先验，但其反应式行为在分布偏移和长时域任务结构下可能失败。最近的VLA引导规划方法通过使用预训练策略引导树搜索来改进执行，但节点选择仍严重依赖于策略先验和访问计数探索。因此，当策略偏向不良动作时，规划器缺乏学习到的价值信号来纠正这种偏差。先前工作表明，VLA表示编码了 rollout 成功与失败信息，暗示它们也可能在规划期间支持价值估计。我们引入了价值引导的视觉-语言-动作规划与搜索（V-VLAPS），该方法通过一个在离线VLA rollout上训练的轻量级价值头来预测蒙特卡洛回报，从而增强VLA引导规划。这些预测引导蒙特卡洛树搜索朝向更高价值的分支。在五个LIBERO套件上，V-VLAPS在默认搜索预算下总体上与无价值规划基线相当，分析表明许多硬失败是根级超时，其中预测值弱分离。在更大的搜索预算下，V-VLAPS在所有任务套件上优于基线，在LIBERO-Object上提高6个百分点，在LIBERO-10上提高4个百分点。我们的结果表明，VLA表示不仅可以支持失败预测，还可以在搜索到达价值排序重要的分支时支持价值引导规划。

英文摘要

Vision-language-action (VLA) models provide strong action priors for robotic manipulation, but their reactive behavior can fail under distribution shift and long-horizon task structure. Recent VLA-guided planning methods improve execution by using pretrained policies to guide tree search, yet node selection still depends heavily on policy priors and visit-count exploration. Consequently, when the policy favors poor actions, the planner lacks a learned value signal to correct this bias. Prior work has shown that VLA representations encode rollout success and failure information, suggesting that they may also support value estimation during planning. We introduce Value-Guided Vision-Language-Action Planning and Search (V-VLAPS), which augments VLA-guided planning with a lightweight value head trained on offline VLA rollouts to predict Monte Carlo returns. These predictions guide Monte Carlo Tree Search toward higher-value branches. Across five LIBERO suites, V-VLAPS matches value-free planning baseline at the default search budget in aggregate, and analysis shows that many hard failures are root-level timeouts where predicted values are weakly separated. With a larger search budget, V-VLAPS improves over the baseline in all task suites with +6 percentage points on LIBERO-Object and +4 percentage points on LIBERO-10. Our results suggest that VLA representations can support not only failure prediction, but also value-guided planning when search reaches branches where value-based ranking matters.

URL PDF HTML ☆

赞 0 踩 0

2512.11551 2026-05-25 cs.RO 版本更新

CarlaNCAP: A Framework for Quantifying the Safety of Vulnerable Road Users in Infrastructure-Assisted Collective Perception Using EuroNCAP Scenarios

CarlaNCAP：使用EuroNCAP场景量化基础设施辅助集体感知中弱势道路使用者安全性的框架

Jörg Gamerdinger, Sven Teufel, Simon Roller, Oliver Bringmann

发表机构 * University of Tübingen, Faculty of Science, Department of Computer Science, Embedded Systems Group（图宾根大学科学学院计算机科学系嵌入式系统组）

AI总结随着道路使用者数量的增加，近年来交通事故风险显著上升，其中易受伤害的道路使用者（VRUs）在城市环境中因被遮挡而面临更高风险。本文提出了一种基于基础设施辅助集体感知（CP）的框架CarlaNCAP，专门用于评估VRUs的安全性提升，并构建了一个包含11,000帧的EuroNCAP安全关键场景数据集。实验表明，与仅依赖车辆传感器相比，基础设施辅助CP可显著降低事故率，最高可达100%的事故避免率。

详情

AI中文摘要

近年来，道路使用者数量的增加显著提高了事故风险。弱势道路使用者（VRU）尤其危险，尤其是在城市环境中，他们经常被停放的车辆或建筑物遮挡。自动驾驶（AD）和集体感知（CP）是减轻这些风险的有前景的解决方案。特别是基础设施辅助的CP，其中传感器单元安装在交通信号灯或灯柱等基础设施元件上，可以通过提供增强的视角来帮助克服感知限制，从而显著减少遮挡。为了鼓励决策者采用这项技术，需要全面的研究和数据集来证明VRU的安全改进。在本文中，我们提出了一个评估基于基础设施的CP对VRU安全改进的框架，包括一个包含11000帧安全关键EuroNCAP场景的数据集（CarlaNCAP）。利用该数据集，我们进行了深入的仿真研究，并证明基础设施辅助的CP可以显著降低安全关键场景中的事故率，与仅配备传感器的车辆（事故避免率33%）相比，实现了高达100%的事故避免。代码可在https://github.com/ekut-es/carla_ncap获取。

英文摘要

The growing number of road users has significantly increased the risk of accidents in recent years. Vulnerable Road Users (VRUs) are particularly at risk, especially in urban environments where they are often occluded by parked vehicles or buildings. Autonomous Driving (AD) and Collective Perception (CP) are promising solutions to mitigate these risks. In particular, infrastructure-assisted CP, where sensor units are mounted on infrastructure elements such as traffic lights or lamp posts, can help overcome perceptual limitations by providing enhanced points of view, which significantly reduces occlusions. To encourage decision makers to adopt this technology, comprehensive studies and datasets demonstrating safety improvements for VRUs are essential. In this paper, we propose a framework for evaluating the safety improvement by infrastructure-based CP specifically targeted at VRUs including a dataset with safety-critical EuroNCAP scenarios (CarlaNCAP) with 11k frames. Using this dataset, we conduct an in-depth simulation study and demonstrate that infrastructure-assisted CP can significantly reduce accident rates in safety-critical scenarios, achieving up to 100% accident avoidance compared to a vehicle equipped with sensors with only 33%. Code is available at https://github.com/ekut-es/carla_ncap

URL PDF HTML ☆

赞 0 踩 0

2511.02239 2026-05-25 cs.RO cs.AI 版本更新

LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation

LACY: 基于视觉-语言模型的语言-动作循环用于自我改进的机器人操作

Youngjin Hong, Houjian Yu, Mingen Li, Changhyun Choi

发表机构 * Department of Electrical and Computer Engineering, Univ. of Minnesota（电气与计算机工程系，明尼苏达大学）

AI总结本文提出LACY，一种基于视觉-语言模型的“语言-动作循环”框架，旨在提升机器人操作任务中的策略泛化能力。该方法通过同时学习语言到动作（L2A）、动作到语言（A2L）以及语言间一致性（L2C）的双向映射，使机器人不仅能执行任务，还能解释自身行为，从而形成更丰富的内部表征。LACY采用主动增强策略自主生成和筛选训练数据，无需额外人工标注，实验表明其在抓取与放置任务中平均提升了56.46%的成功率，显著增强了语言-动作的语义一致性与鲁棒性。

Comments Accepted to ICRA 2026. Project page: https://vla2026.github.io/LACY/

详情

AI中文摘要

学习机器人操作的可泛化策略越来越依赖于将语言指令映射到动作（L2A）的大规模模型。然而，这种单向范式通常产生执行任务而缺乏更深层次上下文理解的策略，限制了它们泛化或解释其行为的能力。我们认为，将动作映射回语言（A2L）的互补技能对于发展更全面的基础至关重要。一个既能行动又能解释其动作的智能体可以形成更丰富的内部表示，并开启自我监督学习的新范式。我们引入了LACY（语言-动作循环），一个统一的框架，在单个视觉-语言模型内学习这种双向映射。LACY在三个协同任务上联合训练：从语言生成参数化动作（L2A）、用语言解释观察到的动作（A2L）以及验证两个语言描述之间的语义一致性（L2C）。这实现了一个自我改进的循环，通过针对低置信度案例的主动增强策略自主生成和过滤新的训练数据，从而在没有额外人工标注的情况下改进模型。在仿真和真实世界的拾取-放置任务上的实验表明，LACY平均将任务成功率提高了56.46%，并为机器人操作产生了更稳健的语言-动作基础。项目页面：https://vla2026.github.io/LACY/

英文摘要

Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that map language instructions to actions (L2A). However, this one-way paradigm often produces policies that execute tasks without deeper contextual understanding, limiting their ability to generalize or explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic grounding. An agent capable of both acting and explaining its actions can form richer internal representations and unlock new paradigms for self-supervised learning. We introduce LACY (Language-Action Cycle), a unified framework that learns such bidirectional mappings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between two language descriptions (L2C). This enables a self-improving cycle that autonomously generates and filters new training data through an active augmentation strategy targeting low-confidence cases, thereby improving the model without additional human labels. Experiments on pick-and-place tasks in both simulation and the real world show that LACY improves task success rates by 56.46% on average and yields more robust language-action grounding for robotic manipulation. Project page: https://vla2026.github.io/LACY/

URL PDF HTML ☆

赞 0 踩 0

2511.00266 2026-05-25 cs.LG cs.RO 版本更新

X-TRACK: Physics-Aware xLSTM for Realistic Vehicle Trajectory Prediction

X-TRACK: 物理感知的xLSTM用于真实车辆轨迹预测

Aanchal Rajesh Chugh, Marion Neumeier, Sebastian Dorn

AI总结准确的轨迹预测对自动驾驶系统的安全性和可靠性至关重要，尤其需要在高速公路场景中建模长期时间依赖关系并考虑车辆之间的社会交互。本文提出了一种基于xLSTM的新型高速公路轨迹预测框架X-TRAJ，并进一步引入其物理感知变体X-TRACK，通过显式整合车辆运动学约束，生成更真实可行的轨迹。实验表明，X-TRACK在公开数据集highD和NGSIM上均优于现有先进方法，尤其在highD上表现突出。

详情

AI中文摘要

准确的轨迹预测对于安全可靠的自动驾驶系统至关重要，需要模型能够捕捉长期时间依赖性，同时考虑高速公路驾驶场景中相邻车辆之间的社交互动。虽然长短期记忆（LSTM）网络在轨迹预测领域得到了广泛应用，但它们存在记忆容量有限和标量细胞状态等局限性。最近引入的扩展长短期记忆（xLSTM）通过引入指数门控和增强的记忆结构解决了传统LSTM的这些局限性，使其更适合建模长期时间依赖性。尽管具有潜力，基于xLSTM的模型在车辆轨迹预测方面仍未得到充分探索。本文首次将xLSTM应用于高速公路轨迹预测，提出了新颖的基于xLSTM的高速公路轨迹预测框架X-TRAJ，以及其物理感知变体X-TRACK（受运动学约束的扩展LSTM轨迹预测），该变体将车辆运动学显式集成到模型学习过程中。通过引入物理约束，所提出的模型生成真实可行的高速公路轨迹。在公开的高速公路数据集highD和NGSIM上的全面评估表明，X-TRACK在highD上优于最先进的基线，并在NGSIM数据集上达到最先进模型水平。

英文摘要

Accurate trajectory prediction is crucial for safe and reliable autonomous driving systems, requiring models that capture long-term temporal dependencies while accounting for social interactions among neighboring vehicles in highway driving scenarios. While Long Short Term Memory (LSTM) networks have been widely used in the domain of trajectory prediction, they have limitations such as limited memory capacity and scalar cell state. The recently introduced Extended Long Short Term Memory (xLSTM) addresses these limitations of traditional LSTMs by introducing exponential gating and enhanced memory structures, making them better suited for modeling long-term temporal dependencies. Despite their potential, xLSTM-based models remain underexplored in the context of vehicle trajectory prediction. This paper introduces a novel xLSTM-based highway trajectory prediction framework, X-TRAJ, as the first application of xLSTM, and its physics-aware variant, X-TRACK (eXtended LSTM for TRAjectory prediction Constraint by Kinematics), which explicitly integrates vehicle motion kinematics into the model learning process. By introducing physical constraints, the proposed model generates realistic and feasible highway trajectories. A comprehensive evaluation on the publicly available highway datasets, highD and NGSIM, demonstrates that X-TRACK outperforms state-of-the-art baselines on highD and is among the state-of-the-art models on the NGSIM dataset.

URL PDF HTML ☆

赞 0 踩 0

2509.22271 2026-05-25 cs.HC cs.RO 版本更新

Human Autonomy and Sense of Agency in Human-Robot Interaction: A Systematic Literature Review

人机交互中的人类自主性与主体感：一项系统文献综述

Felix Glawe, Tim Schmeckel, Philipp Brauner, Martina Ziefle

发表机构 * Chair for Communication Science（沟通科学系）

AI总结本文系统综述了2011年至2024年间发表的22项实证研究，探讨了人机交互中人类自主性与主体感的重要性及其影响因素。研究通过主题综合分析，揭示了机器人适应性、沟通方式、拟人化程度、机器人存在感及个体差异等五个关键因素。研究指出当前实证证据仍显碎片化，强调需要统一概念定义和加强定性研究，以支持更符合伦理和心理原则的人机交互设计。

详情

DOI: 10.1007/s12369-026-01402-1

AI中文摘要

人类自主性和主体感在人机交互（HRI）中日益被认为是用户福祉、动机以及机器人伦理部署的关键。随着人工智能的快速发展，机器人的能力及其作为同事和伴侣的潜力正在增长。本系统文献综述综合了从2011年至2024年间发表的728篇初始文章中筛选出的22项实证研究。文章从主要科学数据库中检索，并根据实证焦点和概念相关性（即如何在HRI中保持和促进人类自主性和主体感）进行识别。通过主题综合，揭示了五类潜在影响因素：机器人适应性、沟通风格、拟人化、机器人存在和个体差异。通过心理测量量表或意向绑定范式测量，自主性和主体感的感知在工业、教育、医疗、护理和酒店环境中有所不同。本综述强调了这两个概念之间的理论差异，但它们在HRI中的使用仍然纠缠不清。尽管兴趣日益增加，但当前的实证证据仍然有限且分散，凸显了对标准化定义、更稳健的操作化以及进一步探索性和定性研究的必要性。通过识别现有差距并突出新兴趋势，本综述有助于开发以人为中心、支持自主性的机器人设计策略，这些策略遵循伦理和心理学原则，最终支持人机交互中的福祉。

英文摘要

Human autonomy and sense of agency are increasingly recognised as critical for user well-being, motivation, and the ethical deployment of robots in human-robot interaction (HRI). Given the rapid development of artificial intelligence, robot capabilities and their potential to function as colleagues and companions are growing. This systematic literature review synthesises 22 empirical studies selected from an initial pool of 728 articles published between 2011 and 2024. Articles were retrieved from major scientific databases and identified based on empirical focus and conceptual relevance, namely, how to preserve and promote human autonomy and sense of agency in HRI. Derived through thematic synthesis, five clusters of potentially influential factors are revealed: robot adaptiveness, communication style, anthropomorphism, presence of a robot and individual differences. Measured through psychometric scales or the intentional binding paradigm, perceptions of autonomy and agency varied across industrial, educational, healthcare, care, and hospitality settings. The review underscores the theoretical differences between both concepts, but their yet entangled use in HRI. Despite increasing interest, the current body of empirical evidence remains limited and fragmented, underscoring the necessity for standardised definitions, more robust operationalisations, and further exploratory and qualitative research. By identifying existing gaps and highlighting emerging trends, this review contributes to the development of human-centered, autonomy-supportive robot design strategies that uphold ethical and psychological principles, ultimately supporting well-being in human-robot interaction.

URL PDF HTML ☆

赞 0 踩 0

2507.22345 2026-05-25 cs.RO 版本更新

A Reconfigured Wheel-Legged Robot for Enhanced Steering and Adaptability

一种增强转向能力和适应性的重构轮腿机器人

Zhicheng Song, Jinglan Xu, Chunxin Zheng, Yulin Li, Zhihai Bi, Jun Ma

发表机构 * Robotics and Autonomous Systems Thrust, The Hong Kong University of Science and Technology (Guangzhou)（机器人与自主系统方向，香港科技大学（广州））

AI总结本文提出了一种名为FLORES的新型轮腿机器人设计，通过重新配置前腿结构，使其在平坦地面和复杂地形中均能实现高效移动。该设计采用髋部偏航自由度替代传统髋部滚转自由度，结合定制的强化学习控制器，实现了轮式与腿式运动模式之间的无缝切换和适应性控制。实验表明，FLORES在转向能力、导航效率和多地形适应性方面均有显著提升。

详情

DOI: 10.1109/LRA.2026.3688384
Journal ref: IEEE Robotics and Automation Letters, vol. 11, no. 6, pp. 7444-7451, June 2026

AI中文摘要

轮腿机器人结合了腿在崎岖地形上的灵活性和轮子在平坦地面上的效率。然而，现有大多数设计未能充分利用腿式和轮式结构的优势，限制了系统的整体灵活性和效率。我们提出FLORES，一种新型轮腿机器人设计，其独特的前腿配置超越了标准设计方法。具体来说，FLORES将前腿传统的髋关节横滚自由度替换为髋关节偏航自由度，这使得在平坦表面上高效移动的同时，确保在复杂地形中的适应性。这种创新设计促进了不同运动模式（即腿式运动和轮式运动）之间的无缝切换，并优化了在不同环境中的性能。为了充分利用FLORES的机械能力，我们开发了一个定制的强化学习控制器，该控制器采用混合内模，并针对我们独特的机械配置优化了奖励结构。该框架能够生成自适应、多模态的运动策略，促进轮式和腿式运动之间的平滑过渡。此外，我们独特的关节设计使机器人能够表现出新颖且高效的运动步态，充分利用两种运动模式的协同优势。通过全面实验，我们展示了FLORES增强的转向能力、改进的导航效率以及在各种地形上的多功能运动。开源项目可在https://github.com/ZhichengSong6/FLORES获取。

英文摘要

Wheel-legged robots integrate leg agility on rough terrain with wheel efficiency on flat ground. However, most existing designs do not fully capitalize on the benefits of both legged and wheeled structures, which limits overall system flexibility and efficiency. We present FLORES, a novel wheel-legged robot design featuring a distinctive front-leg configuration that sets it beyond standard design approaches. Specifically, FLORES replaces the conventional hip-roll degree of freedom (DoF) of the front leg with hip-yaw DoFs, and this allows for efficient movement on flat surfaces while ensuring adaptability when navigating complex terrains. This innovative design facilitates seamless transitions between different locomotion modes (i.e., legged locomotion and wheeled locomotion) and optimizes the performance across varied environments. To fully exploit \flores's mechanical capabilities, we develop a tailored reinforcement learning (RL) controller that adapts the Hybrid Internal Model (HIM) with a customized reward structure optimized for our unique mechanical configuration. This framework enables the generation of adaptive, multi-modal locomotion strategies that facilitate smooth transitions between wheeled and legged movements. Furthermore, our distinctive joint design enables the robot to exhibit novel and highly efficient locomotion gaits that capitalize on the synergistic advantages of both locomotion modes. Through comprehensive experiments, we demonstrate FLORES's enhanced steering capabilities, improved navigation efficiency, and versatile locomotion across various terrains. The open-source project can be found at https://github.com/ZhichengSong6/FLORES.

URL PDF HTML ☆

赞 0 踩 0

2506.00560 2026-05-25 cs.RO cs.CV 版本更新

Using Ensemble Diffusion to Estimate Uncertainty for End-to-End Autonomous Driving

使用集成扩散估计端到端自动驾驶的不确定性

Florian Wintel, Sigmund H. Høeg, Gabriel Kiss, Frank Lindseth

发表机构 * Norwegian University of Science and Technology（挪威科学技术大学）

AI总结本文提出了一种基于集成扩散模型的端到端自动驾驶系统EnDfuser，用于估计轨迹规划中的不确定性。该方法通过将注意力池化与轨迹规划结合到一个扩散变换器模块中，有效融合了摄像头和激光雷达等多源感知信息，并从单帧感知输入生成多个候选轨迹（共128个），从而提供对不确定未来轨迹空间的可解释性。实验表明，该方法通过设计简单安全规则，在LAV基准测试中提升了1.7%的驾驶性能，展示了集成扩散模型在端到端自动驾驶策略中建模轨迹后验不确定性分布的有效性。

Comments Accepted at NLDL 2026

详情

AI中文摘要

端到端自动驾驶规划系统正在快速改进，特别是在CARLA等闭环模拟环境中。许多此类驾驶系统要么不考虑规划本身的不确定性，要么通过使用不泛化的专用表示来获取不确定性。在本文中，我们提出了EnDfuser，一个使用扩散模型作为轨迹规划器的端到端驾驶系统。EnDfuser通过将注意力池化和轨迹规划结合到一个单一的扩散变换器模块中，有效利用复杂的感知信息，如融合的相机和激光雷达特征。EnDfuser不承诺单一规划，而是通过集成扩散从单一感知帧生成候选轨迹分布（在我们的情况下为128个）。通过观察完整的候选轨迹集，EnDfuser为不确定的多模态未来轨迹空间提供了可解释性。利用这些信息，我们设计了一个简单的安全规则，在LAV基准上将系统的驾驶评分提高了1.7%。我们的发现表明，集成扩散作为传统点估计轨迹规划模块的直接替代品，可以通过建模后验轨迹分布的不确定性，为端到端驾驶策略中的不确定性感知决策过程做出贡献。

英文摘要

End-to-end planning systems for autonomous driving are rapidly improving, especially in closed-loop simulation environments like CARLA. Many such driving systems either do not consider uncertainty as part of the plan itself or obtain it by using specialized representations that do not generalize. In this paper, we propose EnDfuser, an end-to-end driving system that uses a diffusion model as the trajectory planner. EnDfuser effectively leverages complex perception information like fused camera and LiDAR features, through combining attention pooling and trajectory planning into a single diffusion transformer module. Instead of committing to a single plan, EnDfuser produces a distribution of candidate trajectories (128 for our case) from a single perception frame through ensemble diffusion. By observing the full set of candidate trajectories, EnDfuser provides interpretability for uncertain, multimodal future trajectory spaces. Using this information we design a simplistic safety-rule that improves the system's driving score by 1.7% on the LAV benchmark. Our findings suggest that ensemble diffusion, used as a drop-in replacement for traditional point-estimate trajectory planning modules, can contribute to an uncertainty-aware decision making process in End-to-End driving policies by modeling the uncertainty of the posterior trajectory distribution.

URL PDF HTML ☆

赞 0 踩 0

2503.04929 2026-05-25 cs.RO cs.LG cs.SY eess.SY 版本更新

Neural Configuration-Space Barriers for Manipulation Planning and Control

用于操作规划与控制的神经构型空间障碍

Kehan Long, Ki Myung Brian Lee, Nikola Raicevic, Niyas Attasseri, Melvin Leok, Nikolay Atanasov

发表机构 * Contextual Robotics Institute, University of California San Diego（情境机器人研究所，加州大学圣地亚哥分校）

AI总结本文研究了如何在复杂动态环境中高效安全地规划和控制高维机械臂的运动。作者提出了一种基于神经网络配置空间距离函数（CDF）的统一方法，将安全约束转化为CDF屏障，从而减少路径规划中的碰撞检测次数。为应对模型误差和传感器噪声带来的不确定性，研究还提出了分布鲁棒的CDF屏障控制框架，无需假设噪声分布。实验表明，该方法能够在仅依赖 onboard 点云观测的情况下，实现高效且安全的机械臂操控。

详情

AI中文摘要

在杂乱动态环境中，高维机器人操作器的规划与控制需要计算效率和鲁棒的安全保证。受近期学习构型空间距离函数（CDF）作为机器人身体表示的研究启发，我们提出了一种统一的运动规划与控制方法，将安全约束公式化为CDF障碍。CDF障碍近似局部自由构型空间，显著减少了运动规划中的碰撞检测操作次数。然而，使用神经网络学习CDF障碍并依赖在线传感器观测会引入不确定性，这些必须在控制综合中加以考虑。为此，我们开发了一种分布鲁棒的CDF障碍控制公式，该公式在不假设已知底层分布的情况下，考虑了建模误差和传感器噪声。在UFactory xArm6操作器上的仿真和硬件实验表明，我们的神经CDF障碍公式能够在杂乱动态环境中实现高效规划和鲁棒安全控制，仅依赖机载点云观测。

英文摘要

Planning and control for high-dimensional robot manipulators in cluttered dynamic environments require computational efficiency and robust safety guarantees. Inspired by recent advances in learning configuration-space distance functions (CDFs) as representations of robot bodies, we propose a unified approach for motion planning and control that formulates safety constraints as CDF barriers. A CDF barrier approximates the local free configuration space, substantially reducing the number of collision-checking operations during motion planning. However, learning a CDF barrier with a neural network and relying on online sensor observations introduces uncertainties that must be considered during control synthesis. To address this, we develop a distributionally robust CDF barrier formulation for control that accounts for modeling errors and sensor noise without assuming a known underlying distribution. Simulations and hardware experiments on a UFactory xArm6 manipulator show that our neural CDF barrier formulation enables efficient planning and robust safe control in cluttered and dynamic environments, relying only on onboard point-cloud observations.

URL PDF HTML ☆

赞 0 踩 0

2605.22896 2026-05-25 cs.RO cs.AI cs.LG 版本更新

Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models

Agentic-VLA：视觉-语言-动作模型的高效在线自适应

Ruofan Jin, Zaixi Zhang

发表机构 * Ruofan Jin（金鲁凡）； Zaixi Zhang（张在西）

AI总结本文提出了一种名为Agentic-VLA的新型训练框架，旨在提升视觉-语言-动作（VLA）模型在机器人操作任务中的在线适应效率。该方法通过自适应奖励合成、语言引导探索和经验记忆三个核心创新，有效解决了现有VLA模型在新环境泛化能力和训练效率方面的不足。实验表明，Agentic-VLA在LIBERO和RoboTwin 2.0等基准测试中显著提升了任务完成率和学习效率，为构建具备持续学习能力的自适应VLA系统提供了重要进展。

Comments Total 15 pages

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过利用预训练的视觉-语言表示，已成为机器人操作领域的一种有前景的范式。然而，当前的VLA训练方法存在两个关键局限性：对新环境的泛化能力差，以及需要大量演示数据导致的训练效率低下。我们提出Agentic-VLA，一种智能训练框架，通过三项关键创新使VLA能够在线高效自适应：（1）自适应奖励合成，根据VLA当前能力和任务复杂度动态生成并调整奖励函数，将复杂任务分解为可学习的子目标以进行课程学习；（2）语言引导探索，其中评论模型提供结构化指导以实现系统化探索，而非随机采样；（3）经验记忆，存储和检索与任务相关的策略权重，用于相似任务的预热启动自适应。我们在LIBERO基准上评估Agentic-VLA，取得了显著改进：长时域任务提升12.3%，单样本学习提升28.5%，并在无需任务特定演示的情况下实现从0%到31.2%的跨任务迁移。与现有在线自适应方法相比，我们的框架还实现了2.4倍的收敛速度提升。除LIBERO外，Agentic-VLA在双臂RoboTwin 2.0基准（包括其随机困难设置）上仍保持优势。这些结果使Agentic-VLA成为迈向真正自适应、可在部署中持续学习的VLA系统的重要一步。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robotic manipulation by leveraging pre-trained vision-language representations. However, current VLA training methods suffer from two critical limitations: poor generalization to novel environments and low training efficiency requiring extensive demonstrations. We introduce Agentic-VLA, an agentic training framework that enables VLAs to efficiently adapt online through three key innovations: (1) Adaptive Reward Synthesis, which dynamically generates and adjusts reward functions based on the VLA's current capabilities and task complexity, decomposing complex tasks into learnable sub-goals for curriculum learning; (2) Language-Guided Exploration, where a critic model provides structured guidance for systematic exploration rather than random sampling; and (3) Experience Memory,which stores and retrieves task-relevant policy weights for warm-starting adaptation to similar tasks. We evaluate Agentic-VLA on the LIBERO benchmark, achieving substantial improvements: +12.3% on long-horizon tasks, +28.5% in 1-shot learning, and enabling cross-task transfer from 0% to 31.2% without task-specific demonstrations. Our framework also demonstrates 2.4x faster convergence compared to existing online adaptation methods. Beyond LIBERO, Agentic-VLA retains its advantage on the dual-arm RoboTwin 2.0 benchmark, including under its randomized Hard setting. These results establish Agentic-VLA as a significant step toward truly adaptive VLA systems capable of continuous learning in deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.22890 2026-05-25 cs.RO cs.CV 版本更新

Extending Deep Event Visual Odometry with Sparse Point-Cloud Export

基于稀疏点云导出的深度事件视觉里程计扩展

Alireza Safdari, Sajad Ashraf

发表机构 * st Sajad Ashraf（第一作者单位）； nd Alireza Safdari（第二作者单位）

AI总结该研究针对事件相机在高速运动和复杂光照条件下的视觉里程计问题，扩展了深度事件视觉里程计（DEVO）系统，引入了一种稀疏点云输出模块。通过提取DEVO内部估计的3D结构并转化为显式点云表示，实现了对场景几何信息的可视化与后续处理，同时保留了原有的视觉里程计流程。实验表明，生成的稀疏点云在局部一致性方面表现良好，达到了高精度要求，但也体现了在密度、完整性及对累积里程计噪声的敏感性方面的局限性。

Comments 9 Pages, 4 figures, 5 tabel

详情

AI中文摘要

事件相机因其低延迟、高时间分辨率和高动态范围，非常适合高速运动和挑战性光照条件下的视觉里程计。深度事件视觉里程计（DEVO）通过结合稀疏块跟踪、学习块选择、循环对应精化和可微束调整，证明了单目纯事件里程计能够实现强性能。在本项目中，我们通过稀疏点云导出管线扩展了DEVO。我们的方法不修改核心里程计公式，而是暴露DEVO已估计的内部3D结构，并将其转换为显式点云表示，用于可视化和进一步处理。此外，我们实现了一个实用的工作流程，用于数据导出、格式转换和点云清理。最终系统保留了原始视觉里程计管线，同时支持稀疏几何场景输出。在BOARD SLOW序列上的实验表明，导出的稀疏点云与EMVS重建在局部一致，在5厘米阈值下达到高精度，同时也突出了在密度、完整性和对累积里程计噪声敏感性方面的预期局限性。

英文摘要

Event cameras are well suited for visual odometry under high-speed motion and challenging lighting conditions due to their low latency, high temporal resolution, and high dynamic range. Deep Event Visual Odometry (DEVO) demonstrated that monocular event-only odometry can achieve strong performance by combining sparse patch tracking, learned patch selection, recurrent correspondence refinement, and differentiable bundle adjustment. In this project, we extend DEVO with a sparse point-cloud export pipeline. Rather than modifying the core odometry formulation, our approach exposes the internal 3D structure already estimated by DEVO and converts it into an explicit point-cloud representation for visualization and further processing. In addition, we implement a practical workflow for data export, format conversion, and point-cloud cleanup. The resulting system preserves the original visual odometry pipeline while enabling sparse geometric scene output. Experiments on the BOARD SLOW sequence show that the exported sparse cloud is locally consistent with EMVS reconstructions, achieving high precision at a 5 cm threshold, while also highlighting the expected limitations in density, completeness, and sensitivity to accumulated odometry noise.

URL PDF HTML ☆

赞 0 踩 0

2605.22889 2026-05-25 cs.RO 版本更新

Remote Teleoperation of Endovascular Intervention Robots: A Systematic Review

血管内介入机器人的远程遥操作：系统综述

Xingyu Chen, Yinchao Yang, Nikola Fischer, Harry Robertshaw, Benjamin Jackson, Mohammad Shikh-Bahaei, Christos Bergeles, Thomas C Booth

发表机构 * School of Biomedical Engineering & Imaging Sciences, King’s College London（生物医学工程与成像科学学院，伦敦国王学院）； Department of Engineering, King’s College London（工程系，伦敦国王学院）； Department of Neuroradiology, King’s College Hospital（神经放射学系，伦敦国王医院）

AI总结本文系统回顾了远程操控内窥血管介入机器人系统的相关研究，旨在评估其技术可行性、通信基础设施及临床效果。研究发现，通过机械或电磁驱动的远程操控导管和导丝可在数千公里范围内实现精准操作，且在稳定通信条件下网络延迟处于临床可接受范围。尽管初步结果显示小规模人体试验中手术成功率高达100%，但多数证据仍来自动物或模拟实验，未来需在中低收入国家开展多中心临床试验以验证其安全性和广泛适用性。

Comments The manuscript has been submitted to IEEE Transaction on Medical Robotic and Bionics

详情

AI中文摘要

远程机器人辅助血管内介入提供了一种有前景的方法，可减少临床医生的辐射暴露和身体劳损，同时将专业血管护理扩展到地理偏远地区。尽管取得了进展，但遥操作血管内介入仍未被充分探索，特别是对于急性卒中的机械取栓等时间敏感型介入。本综述旨在确定关于遥操作血管内介入机器人的证据，涵盖技术可行性、通信基础设施和临床结局。综述进一步确定了研究空白和未来方向。遵循PRISMA指南，从2501条初始搜索结果中纳入了16项符合纳入标准的研究。我们发现，由机械或电磁系统驱动的遥操作导管和导丝可在长达7000公里的距离内导航。凭借稳健的通信基础设施，网络延迟保持在临床可接受的范围内（30-163毫秒）。尽管初步结果在小规模人体试验中显示了100%的手术成功率，但大多数证据来自动物或体模模型。总体而言，研究结果表明，遥操作血管内介入可以减少职业危害，扩大患者获得紧急手术的机会，并优化资源配置。未来应在低收入和中等收入国家进行研究，以展示更广泛的地理可及性。最终，需要多中心临床试验来验证其在多样化临床环境中的安全性、有效性和泛化性。

英文摘要

Remote robotic-assisted endovascular intervention offers a promising approach to reduce clinician radiation exposure and physical strain, while extending specialized vascular care to geographically distant regions. Despite advancements, teleoperated endovascular intervention remains underexplored, especially for time-sensitive interventions like mechanical thrombectomy for acute stroke. The aim of the current review was to determine the evidence regarding teleoperated endovascular robotic systems, covering technical feasibility, communication infrastructure, and clinical outcomes. The review further identified research gaps and future directions. Following PRISMA guidelines, 16 studies were included that met the inclusion criteria out of 2501 initial search results. We found that teleoperated catheters and guidewires, driven by mechanical or electromagnetic systems, can be navigated across distances up to 7000 km. With robust communication infrastructure, network latency remained within clinically acceptable limits (30-163 ms). Although initial outcomes highlighted 100% procedural success in small-scale human trials, most evidence stemmed from animal or phantom models. Overall, the findings suggest that teleoperated endovascular intervention can reduce occupational hazards, expand patient access to urgent procedures, and optimize resource allocation. Future research should be conducted in low and middle income countries to demonstrate broader geographical access. Ultimately, multi-center clinical trials are required to validate the safety, efficacy, and generalization in diverse clinical settings.

URL PDF HTML ☆

赞 0 踩 0

2508.12043 2026-05-25 cs.RO 版本更新

Talk Less, Fly Lighter: Autonomous Semantic Compression for UAV Swarm Communication via LLMs

少说，轻飞：基于大语言模型的无人机集群自主语义压缩通信

Fei Lin, Tengchao Zhang, Qinghua Ni, Jun Huang, Siji Ma, Yonglin Tian, Yisheng Lv, Naiqi Wu

发表机构 * Department of Engineering Science, Faculty of Innovation Engineering, Macau University of Science and Technology（工程科学系，创新工程学院，澳门科学理工学院）； State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences（复杂系统管理与控制国家重点实验室，自动化研究所，中国科学院）

AI总结本文研究了如何利用大语言模型（LLM）实现无人机群在通信带宽受限条件下的自主语义压缩通信。通过构建不同复杂度的二维仿真场景，并设计融合系统提示与任务指令的通信-执行流程，系统评估了九种主流LLM在语义压缩方面的性能，分析其在环境复杂度和群规模变化下的适应性与稳定性。实验表明，基于LLM的无人机群能够在多跳链路条件下实现高效的协作通信。

详情

DOI: 10.1109/MESA68091.2025.11278876
Journal ref: Proc. 2025 21st IEEE International Conference on Mechatronic and Embedded Systems and Applications (MESA), pp. 29-34, 2025

AI中文摘要

大语言模型（LLMs）在无人系统中的快速应用显著增强了无人机（UAV）集群的语义理解和自主任务执行能力。然而，有限的通信带宽和高频交互需求对集群内的语义信息传输提出了严峻挑战。本文探讨了LLM驱动的无人机集群进行自主语义压缩通信的可行性，旨在减少通信负载的同时保留关键任务语义。为此，我们构建了四种具有不同环境复杂度的二维仿真场景，并设计了一个集成系统提示与任务指令提示的通信-执行管道。在此基础上，我们系统评估了九种主流LLM在不同场景下的语义压缩性能，并通过环境复杂度和集群规模的消融实验分析了它们的适应性和稳定性。实验结果表明，基于LLM的无人机集群在带宽受限和多跳链路条件下具有实现高效协同通信的潜力。

英文摘要

The rapid adoption of Large Language Models (LLMs) in unmanned systems has significantly enhanced the semantic understanding and autonomous task execution capabilities of Unmanned Aerial Vehicle (UAV) swarms. However, limited communication bandwidth and the need for high-frequency interactions pose severe challenges to semantic information transmission within the swarm. This paper explores the feasibility of LLM-driven UAV swarms for autonomous semantic compression communication, aiming to reduce communication load while preserving critical task semantics. To this end, we construct four types of 2D simulation scenarios with different levels of environmental complexity and design a communication-execution pipeline that integrates system prompts with task instruction prompts. On this basis, we systematically evaluate the semantic compression performance of nine mainstream LLMs in different scenarios and analyze their adaptability and stability through ablation studies on environmental complexity and swarm size. Experimental results demonstrate that LLM-based UAV swarms have the potential to achieve efficient collaborative communication under bandwidth-constrained and multi-hop link conditions.

URL PDF HTML ☆

赞 0 踩 0

2504.09583 2026-05-25 cs.RO cs.AI 版本更新

AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding

AirVista-II：面向动态场景语义理解的具身无人机智能体系统

Fei Lin, Yonglin Tian, Tengchao Zhang, Jun Huang, Sangtian Guan, Fei-Yue Wang

发表机构 * Department of Engineering Science, Faculty of Innovation Engineering, Macau University of Science and Technology（创新工程学院工程科学系，澳门科学技术大学）； State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences（复杂系统管理与控制国家重点实验室，中国科学院自动化研究所）； State Key Laboratory for Management and Control of Complex Systems, Chinese Academy of Sciences（复杂系统管理与控制国家重点实验室，中国科学院）

AI总结本文提出了一种名为 AirVista-II 的智能代理系统，旨在提升无人机在动态场景中的语义理解能力。该系统融合了基于代理的任务识别与调度、多模态感知机制以及针对不同时间场景的差异化关键帧提取策略，实现了对动态环境中的关键信息高效捕捉。实验表明，该系统在多种无人机应用场景下能够实现高质量的零样本语义理解，显著提升了无人机自主决策的效率与适应性。

详情

DOI: 10.1109/SMC58881.2025.11342598
Journal ref: Proc. 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 6319-6324, 2025

AI中文摘要

无人机在物流运输和灾难响应等动态环境中日益重要。然而，当前任务通常依赖人类操作员监控航拍视频并做出操作决策。这种人机协作模式在效率和适应性方面存在显著局限性。本文提出AirVista-II——一种面向具身无人机的端到端智能体系统，旨在实现动态场景中的通用语义理解和推理。该系统集成了基于智能体的任务识别与调度、多模态感知机制，以及针对不同时间场景定制的差异化关键帧提取策略，从而高效捕获关键场景信息。实验结果表明，所提系统在零样本设置下，能够在多种基于无人机的动态场景中实现高质量的语义理解。

英文摘要

Unmanned Aerial Vehicles (UAVs) are increasingly important in dynamic environments such as logistics transportation and disaster response. However, current tasks often rely on human operators to monitor aerial videos and make operational decisions. This mode of human-machine collaboration suffers from significant limitations in efficiency and adaptability. In this paper, we present AirVista-II -- an end-to-end agentic system for embodied UAVs, designed to enable general-purpose semantic understanding and reasoning in dynamic scenes. The system integrates agent-based task identification and scheduling, multimodal perception mechanisms, and differentiated keyframe extraction strategies tailored for various temporal scenarios, enabling the efficient capture of critical scene information. Experimental results demonstrate that the proposed system achieves high-quality semantic understanding across diverse UAV-based dynamic scenarios under a zero-shot setting.

URL PDF HTML ☆

赞 0 踩 0

2503.20066 2026-05-25 cs.RO cs.CV 版本更新

Learning Scene-Level Signed Directional Distance Function with Ellipsoidal Priors and Neural Residuals

学习场景级有符号方向距离函数：结合椭球先验与神经残差

Zhirui Dai, Hojoon Shin, Yulun Tian, Ki Myung Brian Lee, Nikolay Atanasov

发表机构 * Department of Electrical and Computer Engineering, University of California San Diego（加州大学圣地亚哥分校电气与计算机工程系）； Brain Corporation（Brain公司）； Robotics Department, University of Michigan（密歇根大学机器人系）

AI总结本文提出了一种新的神经隐式表示方法——有符号方向距离函数（SDDF），用于解决三维重建和可微渲染中的效率与精度问题。SDDF 以位置和视角方向为输入，直接输出到表面的距离，从而实现高效且精确的几何重建。为提升学习效率，作者结合显式的椭球先验和隐式的神经残差，构建了可微混合表示，有效处理障碍物边界处的距离不连续问题，并在多个指标上优于现有方法。

详情

DOI: 10.1109/tpami.2026.3688658
Journal ref: 2026 IEEE Transactions on Pattern Analysis and Machine Intelligence

AI中文摘要

密集重建和可微渲染是3D视觉和计算机图形学中紧密相连的基本操作。最近的神经隐式表示在重建保真度和可微性方面相比传统的离散表示（如网格、点云和体素）展现出显著优势。然而，许多神经隐式模型，如神经辐射场（NeRF）和有符号距离函数（SDF）网络，由于需要沿每条相机射线进行多次查询，渲染效率低下。此外，NeRF和高斯泼溅方法在光度重建方面表现令人印象深刻，但通常需要仔细的监督才能实现精确的几何重建。为了解决这些挑战，我们提出了一种称为有符号方向距离函数（SDDF）的新型表示。与SDF不同，与NeRF类似，SDDF以位置和观察方向作为输入。与SDF类似，与NeRF不同，SDDF直接提供到观察表面的距离，而不是沿视线方向积分。因此，SDDF实现了精确的几何重建和高效的可微方向距离预测。为了高效地学习和预测场景级SDDF，我们开发了一种可微混合表示，结合了显式椭球先验和隐式神经残差。这使得模型能够有效处理障碍物边界周围的距离不连续性，同时保持密集高保真距离预测的能力。通过与最先进表示的广泛评估，我们展示了SDDF实现了（i）有竞争力的SDDF预测精度，（ii）比SDF和NeRF更快的预测速度，以及（iii）与NeRF和高斯泼溅相比更优越的几何一致性。

英文摘要

Dense reconstruction and differentiable rendering are fundamental tightly connected operations in 3D vision and computer graphics. Recent neural implicit representations demonstrate compelling advantages in reconstruction fidelity and differentiability over conventional discrete representations such as meshes, point clouds, and voxels. However, many neural implicit models, such as neural radiance fields (NeRF) and signed distance function (SDF) networks, are inefficient in rendering due to the need to perform multiple queries along each camera ray. Moreover, NeRF and Gaussian Splatting methods offer impressive photometric reconstruction but often require careful supervision to achieve accurate geometric reconstruction. To address these challenges, we propose a novel representation called signed directional distance function (SDDF). Unlike SDF and similar to NeRF, SDDF has a position and viewing direction as input. Like SDF and unlike NeRF, SDDF directly provides distance to the observed surface rather than integrating along the view ray. As a result, SDDF achieves accurate geometric reconstruction and efficient differentiable directional distance prediction. To learn and predict scene-level SDDF efficiently, we develop a differentiable hybrid representation that combines explicit ellipsoid priors and implicit neural residuals. This allows the model to handle distance discontinuities around obstacle boundaries effectively while preserving the ability for dense high-fidelity distance prediction. Through extensive evaluation against state-of-the-art representations, we show that SDDF achieves (i) competitive SDDF prediction accuracy, (ii) faster prediction speed than SDF and NeRF, and (iii) superior geometric consistency compared to NeRF and Gaussian Splatting.

URL PDF HTML ☆

赞 0 踩 0

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

Robotic Strawberry Harvesting with Robust Vision and Deep Reinforcement Learning based Sim-to-Real Control

Point Tracking Improves World Action Models

Instrumentation for Imitation Learning: Enhancing Training Datasets for Clothes Hanger Insertion

SFG-ROS: A Resource-Aware Framework for Dense Multi-Agent Perception

Direct Dynamic Retargeting for Humanoid Imitation Learning from Videos

Vision-Based Agile Landing on Turbulent Waters

Modeling and Control of a Pneumatic Morphing Soft Quadrotor based on the SOFA Framework for Dynamic Soft Robotic Simulation

Optimal Solutions for the Moving Target Vehicle Routing Problem with Obstacles via Lazy Branch and Price

Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures

USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots

GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation

Data-driven Spatial Classification using Multi-Arm Bandits for Monitoring with Energy-Constrained Mobile Robots

How Many Training Samples Are Needed for the Inverse Kinematics Solutions by Artificial Neural Networks

TactileReflex: Noise-Statistics-Driven Vision-Tactile Reflex Control for Force-Sensitive Manipulation

Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation

Droneulator: A Portable UAV Simulator for Agricultural Workflows with RotorPy and Godot 4

Multi-Floor Exploration for Ground Robots via an Incremental Reachable Graph and Structural Priors

Sparse Compositional Flow Matching by geometric assembly from motion primitives

ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

6G Communication Networks Enabling Embodied Agents: Architecture and Prototype

Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation

Signal Temporal Logic Motion Planning via Graphs of Convex Sets

Lipschitz Optimization for Formal Verification of Homographies

IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

Autonomous Frontier-Based Exploration with VLM Guidance

Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping

$π_0$-EqM: Equilibrium Matching for Closed-Loop Vision-Language-Action Control

Four Simple Proprioceptive Estimators for Legged Robots

UfM*: Uncertainty from Motion* for DNN Depth Estimation Using Gaussians

PIMbot: A Self-Adaptive Attack Framework for Adversarial Manipulation of Multi-Robot Reinforcement Learning

Verified Task-Space Motion Planning Under Joint-Space Constraints

Active Sensing Subserves Task-Level Control

Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations

Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment

Lie Group Formulation of Recursive Dynamics Algorithms of Higher Order for Floating-Base Robots

Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

Encirclement Guaranteed Finite-Time Capture against Unknown Evader Strategies

MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction

SEG-JPEG: Simple Visual Semantic Communications for Remote Operation of Automated Vehicles over Unreliable Wireless Networks

V-VLAPS: Value-Guided Planning for Vision-Language-Action Models

CarlaNCAP: A Framework for Quantifying the Safety of Vulnerable Road Users in Infrastructure-Assisted Collective Perception Using EuroNCAP Scenarios

LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation

X-TRACK: Physics-Aware xLSTM for Realistic Vehicle Trajectory Prediction

Human Autonomy and Sense of Agency in Human-Robot Interaction: A Systematic Literature Review

A Reconfigured Wheel-Legged Robot for Enhanced Steering and Adaptability

Using Ensemble Diffusion to Estimate Uncertainty for End-to-End Autonomous Driving

Neural Configuration-Space Barriers for Manipulation Planning and Control

Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models

Extending Deep Event Visual Odometry with Sparse Point-Cloud Export

Remote Teleoperation of Endovascular Intervention Robots: A Systematic Review

Talk Less, Fly Lighter: Autonomous Semantic Compression for UAV Swarm Communication via LLMs

AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding

Learning Scene-Level Signed Directional Distance Function with Ellipsoidal Priors and Neural Residuals

UfM: Uncertainty from Motion for DNN Depth Estimation Using Gaussians