2605.29773 2026-05-29 cs.CV cs.AI cs.RO 版本更新

Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic Segmentation

能量感知NECO：用于语义分割中单次逐像素分布外检测

Boyuan Zhang, Huanshan Huang, Yifei Cao

发表机构 * Ecole Polytechnique, Institut Polytechnique de Paris（巴黎理工学院高研院）； CIAD, UTBM, Université Marie et Louis Pasteur（CIAD、UTBM、马吕斯·路易·巴斯蒂埃大学）； U2IS, ENSTA, Institut Polytechnique de Paris（U2IS、ENSTA、巴黎理工学院）

AI总结提出一种结合NECO几何比率和能量分数的混合方法，实现单次前向传播的逐像素分布外检测，在miniMUAD数据集上AUROC达0.8539，优于单独使用NECO或能量分数。

Comments 7 pages, 6 figures. Accepted at the ICRA 2026 Workshop on Long-term Deployments in the Wild (LoWi 2026)

详情

AI中文摘要

移动机器人的可靠语义分割需要准确的密集预测和分布偏移下的鲁棒不确定性估计。强不确定性基线如蒙特卡洛Dropout通常需要重复的随机前向传播，难以在边缘平台上部署。我们提出能量感知NECO，一种用于语义分割的单次逐像素分布外（OOD）检测器。该方法将从解码器特征计算的居中NECO风格几何比率与基于logit的能量分数相结合。两个分量均使用在纯分布内验证集上拟合的统计量进行标准化，并通过凸组合融合。我们在miniMUAD子集上使用真实像素级OOD标签评估该方法。所提出的混合分数达到0.8539的AUROC，优于仅NECO（0.8280）、仅能量（0.8171）和集成预测熵基线（0.8124）。额外的定性和操作点分析表明，混合检测器在保持单次设计效率优势的同时，提高了整体排名性能。代码可在https://github.com/boyuan-zhangx/Energy-Aware_NECO获取。

英文摘要

Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distribution shift. Strong uncertainty baselines such as Monte Carlo Dropout often require repeated stochastic forward passes and are difficult to deploy on edge platforms. We propose Energy-Aware NECO, a single-pass pixel-wise out-of-distribution (OOD) detector for semantic segmentation. The method combines a centered NECO-style geometric ratio computed from decoder features with a logit-based Energy score. Both components are standardized using statistics fitted on a pure in-distribution validation split and fused through a convex combination. We evaluate the method on the miniMUAD subset using true pixel-level OOD labels. The proposed hybrid score achieves an AUROC of 0.8539, outperforming NECO-only (0.8280), Energy-only (0.8171), and an ensemble predictive-entropy baseline (0.8124). Additional qualitative and operating-point analyses show that the hybrid detector improves overall ranking performance while preserving the efficiency advantages of a single-pass design. Code is available at https://github.com/boyuan-zhangx/Energy-Aware_NECO

URL PDF HTML ☆

赞 0 踩 0

2605.29771 2026-05-29 cs.RO 版本更新

FLIP：通过点云配准实现大规模分布式集群的实时弹性编队规划

Yuan Zhou, Guangtong Xu, Zhenyu Hou, Jialiang Hou, Fei Gao

发表机构 * Institute of Cyber-Systems and Control, College of Control Science and Engineering, Zhejiang University（浙江大学控制科学与工程学院智能系统与控制研究所）； Huzhou Institute, Zhejiang University（浙江大学湖州研究院）

AI总结提出将最优编队位置序列计算转化为时空点云配准问题，利用带离群点剔除的PCR方法实现大规模分布式集群的弹性、高效轨迹规划。

详情

AI中文摘要

传统的大规模编队规划要么过度简化编队表示导致性能不佳，要么采用完全协作关系导致计算负载过大。为了实现高性能和大规模编队规划，我们将最优编队位置序列（OFPS）计算问题转化为时空点云配准（PCR）问题。每个智能体通过分布式计算自身当前位置与所有其他智能体期望编队位置之间的匹配结果来获得OFPS。然后每个智能体利用OFPS优化协作编队轨迹。我们利用带离群点剔除的PCR方法快速执行大规模编队位置配准。这可以防止次优轨迹和故障智能体通过协作网络传播并影响更多智能体。因此，我们统一实现了大规模集群的弹性、高效和分布式轨迹规划。通过120架无人机编队的大规模仿真以及与最先进（SOTA）方法的严格基准测试，证明了所提方法的有效性和优越性。

英文摘要

Traditional large-scale formation planning either oversimplify the formation representation which leads to poor performance, or they employ complete collaborative relationships, which results in excessive computational load. To achieve high-performance and large-scale formation planning, we transform the Optimal Formation Position Sequence \cite{c1} (OFPS) calculation problem into a spatiotemporal Point Cloud Registration (PCR) problem. Each agent derives its OFPS by distributively computing the matching result between current positions and the desired formation positions of all other agents. Then each agent optimizes the cooperative formation trajectory by using OFPS. We leverage the PCR method with outlier rejection to rapidly perform large-scale formation position registration. This prevents suboptimal trajectories and failed agents from propagating through the cooperative network and affecting more agents. Consequently, we uniformly achieve resilient, efficient, and distributed trajectory planning for large-scale swarms. The effectiveness and the superiority of the proposed method are demonstrated through large-scale simulations of 120-drone formation, and rigorous benchmarking against state-of-the-art (SOTA) methods.

URL PDF HTML ☆

赞 0 踩 0

2605.29693 2026-05-29 cs.LG cs.RO 版本更新

Momentum Based Reward Design for Low Emission Traffic Signal Control

基于动量的低排放交通信号控制奖励设计

Chinmay Mundane, Amith Manoharan, Arun Singh

发表机构 * Institute of Technology, University of Tartu（塔尔图大学技术学院）

AI总结提出一种基于动量的奖励函数（MBRF），通过鼓励车辆持续移动而非单纯惩罚拥堵，在SUMO仿真中实现更好的吞吐量-排放权衡和更稳定的学习行为。

详情

AI中文摘要

城市交通拥堵是一个日益严重的全球性问题，导致通勤时间延长和环境污染加剧。传统的交通信号控制系统往往难以适应动态交通状况。自适应交通信号控制可以在不改变道路基础设施的情况下改善城市交通。深度强化学习（DRL）在此任务中表现出色，但现有的基于延误和队列的奖励常常产生短视或不稳定的策略。本文提出了一种基于动量的奖励函数（MBRF），鼓励车辆持续移动，而非仅惩罚拥堵。该方法在SUMO（城市交通仿真）中使用标准交通指标（如等待时间、队列长度、吞吐量和CO2排放）进行评估。结果表明，与基于延误或队列的奖励以及经典控制器（如Max Pressure和LQF）相比，所提出的奖励实现了更好的吞吐量-排放权衡和更稳定的学习行为。

英文摘要

Urban traffic congestion is a growing global issue contributing significantly to long commute times and environmental pollution. Traditional traffic signal control systems often fail to adapt to dynamic traffic conditions. Adaptive traffic signal control can improve urban traffic without changing road infrastructure. Deep Reinforcement Learning (DRL) has shown strong performance for this task, but existing delay and queue-based rewards often produce short-sighted or unstable policies. This paper proposes a Momentum-Based Reward Function (MBRF) that encourages vehicles to keep moving rather than penalizing congestion alone. The method is evaluated in SUMO (Simulation of Urban MObility) using standard traffic metrics such as waiting time, queue length, throughput, and CO2 emissions. Results show that the proposed reward produces better throughput-emission trade-offs and more stable learning behavior than delay or queue-based rewards, as well as classical controllers such as Max Pressure and LQF.

URL PDF HTML ☆

赞 0 踩 0

2605.29605 2026-05-29 cs.RO 版本更新

VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models

VLAConf: 视觉-语言-动作模型的校准任务成功置信度

Dehao Huang, Aoxiang Gu, Chengjie Zhang, Bolin Zou, Wenlong Dong, Zilang Cen, Yue Wang, Hong Zhang

发表机构 * Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China（南方科技大学电子与电气工程系，深圳，中国）； Zhongguancun Academy, Beijing, China（中关村学院，北京，中国）； National Cybersecurity Academy, Wuhan University, Wuhan, China（武汉大学国家网络安全学院，武汉，中国）

AI总结提出VLAConf，一种基于单类判别性置信度框架的方法，通过冻结预训练VLA内部表示和轻量级置信度头，在单次前向传播中直接估计逐步异常分数，实现高效且跨架构通用的任务成功置信度估计。

Comments 11 pages, 7 figures

详情

AI中文摘要

视觉-语言-动作（VLA）模型的置信度估计对于机器人在开放世界中执行操作任务至关重要，它为风险敏感决策和故障预测提供关键信号。现有的置信度估计方法通常依赖于基于集成的范式或动作令牌概率来预测任务成功的可能性。然而，它们在计算效率和跨架构泛化性方面仍面临挑战。这些方法通常需要重复采样，导致推理效率低下，并且仅限于具有离散动作输出的VLA模型，难以应用于连续动作空间。为解决此问题，我们提出VLAConf，一种单类判别性置信度框架。通过利用冻结的预训练VLA内部表示，VLAConf使用轻量级置信度头在单次前向传播中直接估计逐步异常分数，从而消除了详尽重采样的开销。我们还使用步骤条件建模来编码操作轨迹中的展开阶段信息。在LIBERO基准上的实验表明，VLAConf显著提高了为事后校准构建的置信度信号的质量，在推理效率上大幅优于现有基线。VLAConf的有效性在真实机器人实验中进一步得到验证。要访问源代码和补充视频，请访问https://sites.google.com/view/vlaconf。

英文摘要

Confidence estimation for Vision-Language-Action (VLA) models is essential for robots to perform manipulation tasks in the open world, providing crucial signals for risk-sensitive decision-making and failure anticipation. Existing confidence estimation methods typically rely on ensemble-based paradigms or action-token probabilities to predict the likelihood of task success. However, they still encounter challenges in computational efficiency and cross-architecture generalizability. These methods usually require repeated sampling, leading to inference inefficiency, and are restricted to VLA models with discrete action outputs, making them difficult to apply to continuous action spaces. To address this issue, we propose VLAConf, a one-class discriminative confidence framework. By leveraging frozen pretrained VLA internal representations, VLAConf directly estimates step-wise anomaly scores in a single forward pass using a lightweight confidence head, thereby eliminating the overhead of exhaustive resampling. We additionally use step-conditioned modeling to encode rollout-phase information along the manipulation trajectory. Experiments on the LIBERO benchmark demonstrate that VLAConf significantly improves the quality of the confidence signal constructed for post-hoc calibration, outperforming existing baselines by a large margin in inference efficiency. The effectiveness of VLAConf is further validated in real-robot experiments. To access the source code and supplementary videos, visit https://sites.google.com/view/vlaconf.

URL PDF HTML ☆

赞 0 踩 0

2605.29599 2026-05-29 cs.RO cs.CV 版本更新

How to Relieve Distribution Shifts in Semantic Segmentation for Off-Road Environments

如何缓解越野环境语义分割中的分布偏移

Ji-Hoon Hwang, Daeyoung Kim, Hyung-Suk Yoon, Dong-Wook Kim, Seung-Woo Seo

发表机构 * Department of Electrical and Communication Engineering, Seoul National University（电子与通信工程系，首尔国立大学）

AI总结提出ST-Seg框架，通过风格扩展和纹理正则化缓解越野场景中源-目标域差异和传感器退化导致的分布偏移，提升语义分割鲁棒性。

Comments 8 pages, 6 figures. Accepted to IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

详情

DOI: 10.1109/LRA.2025.3551536
Journal ref: IEEE Robotics and Automation Letters, vol. 10, issue. 5, pp. 4500-4507, 2025

AI中文摘要

语义分割对于越野环境中的自主导航至关重要，能够精确分类周围环境以识别可通行区域。然而，越野条件固有的独特因素，如源-目标域差异和粗糙地形导致的传感器退化，可能引起分布偏移，使数据变化与训练条件不同。这常导致语义标签预测不准确，进而造成导航任务失败。为解决此问题，我们提出ST-Seg，一种通过风格扩展（SE）和纹理正则化（TR）扩展源分布的新框架。与先前在固定源分布内隐式应用泛化的方法不同，ST-Seg提供了一种直观的分布偏移处理方法。具体而言，SE通过生成多样化的逼真风格来拓宽域覆盖范围，增强源域有限的风格信息。TR通过深度纹理流形稳定受风格增强学习影响的局部纹理表示。在各种分布偏移的目标域上的实验证明了ST-Seg的有效性，相较于现有方法有显著改进。这些结果凸显了ST-Seg的鲁棒性，增强了越野导航中语义分割的实际应用性。

英文摘要

Semantic segmentation is crucial for autonomous navigation in off-road environments, enabling precise classification of surroundings to identify traversable regions. However, distinctive factors inherent to off-road conditions, such as source-target domain discrepancies and sensor corruption from rough terrain, can result in distribution shifts that alter the data differently from the trained conditions. This often leads to inaccurate semantic label predictions and subsequent failures in navigation tasks. To address this, we propose ST-Seg, a novel framework that expands the source distribution through style expansion (SE) and texture regularization (TR). Unlike prior methods that implicitly apply generalization within a fixed source distribution, ST-Seg offers an intuitive approach for distribution shift. Specifically, SE broadens domain coverage by generating diverse realistic styles, augmenting the limited style information of the source domain. TR stabilizes local texture representation affected by style-augmented learning through a deep texture manifold. Experiments across various distribution-shifted target domains demonstrate the effectiveness of ST-Seg, with substantial improvements over existing methods. These results highlight the robustness of ST-Seg, enhancing the real-world applicability of semantic segmentation for off-road navigation.

URL PDF HTML ☆

赞 0 踩 0

2605.24934 2026-05-29 cs.RO cs.AI cs.CV cs.LG 版本更新

HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

HumanEgo：从几分钟的人类自我中心视频中零样本学习机器人

Zhi Wang, Botao He, Kelin Yu, Seungjae Lee, Ruohan Gao, Furong Huang, Yiannis Aloimonos

发表机构 * University of Maryland（马里兰大学）

AI总结提出HumanEgo框架，通过将人类演示提升为手-物体交互的实体级表示，并训练具有密集辅助目标的流匹配策略，实现从人类自我中心视频到机器人的零样本、无机器人数据、硬件无关的技能迁移。

Comments Project page: https://humanego-ai.github.io

详情

AI中文摘要

人类自我中心视频捕捉了丰富的操作演示，无需任何机器人硬件，但由于人类和机器人在视觉外观和运动学上的具身差距，将这些技能迁移到机器人仍然具有挑战性。我们提出了HumanEgo，一个通过将每个人类演示提升为手-物体交互的实体级表示，并训练具有密集辅助目标的流匹配策略来弥合具身差距的框架，该策略放大了每个轨迹的监督信号。HumanEgo无需机器人数据、硬件无关、数据高效且可零样本地从人类迁移到机器人。每个任务仅需30分钟的人类视频，HumanEgo在四个真实世界任务中实现了92.5%的平均成功率（仅15分钟即可达到75%），比匹配时间的机器人遥操作高出41%，并且能够稳健地零样本迁移到新的机器人、相机和环境。我们发布了HumanEgo作为一个易于使用的开源框架，用于直接从人类数据学习机器人策略：https://github.com/TX-Leo/HumanEgo

英文摘要

Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity-level representation of hand-object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot-data-free, hardware-agnostic, data-efficient, and zero-shot human-to-robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real-world tasks (75% with just 15 minutes), outperforms matched-time robot teleoperation by 41%, and robustly transfers zero-shot across novel robots, cameras, and environments. We release HumanEgo as an easy-to-use, open-source framework for learning robot policies directly from human data: https://github.com/TX-Leo/HumanEgo

URL PDF HTML ☆

赞 0 踩 0

2605.20752 2026-05-29 cs.RO 版本更新

GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation

GaussianDream：用于机器人操作的前馈3D高斯世界模型

Zijian Zhang, Yuqing Jiang, Qian Cheng, Xiaofan Li, Si Liu, Ding Zhao, Ping Luo, Weitao Zhou, Haibao Yu

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Tsinghua University（清华大学）； Zhejiang University（浙江大学）； Beihang University（北京航空航天大学）； Carnegie Mellon University（卡内基梅隆大学）； The University of Hong Kong（香港大学）

AI总结提出GaussianDream，一种前馈3D高斯世界模型插件，通过可学习查询编码当前帧3D空间结构和短期未来演化，在训练时用静态重建和未来预测头监督，推理时仅保留查询条件化动作生成，在多个机器人操作基准上达到最先进性能。

Comments 19 pages, 9 figures

详情

AI中文摘要

视觉-语言-动作（VLA）策略通过将预训练的视觉-语言模型的语义先验迁移到动作生成，推进了语言条件机器人操作。然而，标准的动作模仿学习通常缺乏对显式3D空间信息、密集几何监督和未来环境演化的充分建模，而这些对于精确的机器人交互至关重要。为解决这一问题，我们提出 extbf{GaussianDream}，一种前馈3D高斯世界模型插件。具体地，我们在编码器中引入可学习的GaussianDream查询，使模型能够捕捉当前帧的3D空间结构和短时域的未来演化。训练时，潜在的高斯Dream前缀由静态重建头和未来预测头处理，生成当前3D高斯场景状态和未来高斯演化状态。当前分支通过RGB渲染和深度进行监督，而未来分支使用未来RGB、深度和伪3D场景流信号。推理时，GaussianDream丢弃所有辅助头，仅保留学习到的前缀以条件化动作生成，无需测试时的高斯重建或未来预测。实验结果表明，GaussianDream在多个机器人操作基准上取得了最先进的性能，在LIBERO上达到 extbf{98.4\%}，在RoboCasa Human-50上达到 extbf{54.8\%}，在真实机器人任务上达到 extbf{50.0\%}。与现有的3D增强VLA方法相比，GaussianDream在实现高精度的同时，提供了比基于视频的世界模型方法更高的推理效率。

英文摘要

Vision-language-action (VLA) policies have advanced language-conditioned robotic manipulation by transferring semantic priors from pretrained vision-language models to action generation. However, standard action-imitation learning often lacks sufficient modeling of explicit 3D spatial information, dense geometric supervision, and future environment evolution, all critical for precise robotic interaction. To address this, we propose \textbf{GaussianDream}, a feed-forward 3D Gaussian world-model plug-in. Specifically, we introduce learnable GaussianDream Queries in the encoder, enabling the model to capture current-frame 3D spatial structure and short-horizon future evolution. During training, the latent GaussianDream prefix is processed by a static reconstruction head and a future prediction head to produce current 3D Gaussian scene states and future Gaussian evolution states. The current branch is supervised by RGB rendering and depth, while the future branch uses future RGB, depth, and pseudo 3D scene-flow signals. During inference, GaussianDream discards all auxiliary heads and retains only the learned prefix to condition action generation, without test-time Gaussian reconstruction or future prediction. Experimental results demonstrate that GaussianDream achieves state-of-the-art performance across multiple robotic manipulation benchmarks, reaching \textbf{98.4\%} on LIBERO, \textbf{54.8\%} on RoboCasa Human-50, and \textbf{50.0\%} on real-robot tasks. Compared with existing 3D-enhanced VLA methods, GaussianDream achieves strong accuracy while providing higher inference efficiency than video-based world-model approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.01395 2026-05-29 eess.SY cs.RO cs.SY 版本更新

Quasi-Static Control of Discrete Cosserat Rod

离散Cosserat杆的准静态控制

Srishti Siddharth

发表机构 * Centre for Systems and Control（系统与控制中心）； Indian Institute of Technology Bombay（印度理工学院班加罗尔）； Nanyang Technological University（南洋理工大学）

AI总结针对使用Cosserat杆建模的软体机器人，基于分段常应变空间离散化方法，利用外部力/力矩作为控制输入，设计应变空间和任务空间的状态反馈线性化控制律，实现末端执行器轨迹跟踪和形状控制。

Comments Submitted to 17th APCA International Conference on Automatic Control and Soft Computing (CONTROLO 2026)

2604.19011 2026-05-29 cs.LG cs.RO 版本更新

ScheduleStream: 基于采样器的时序规划用于GPU加速的多臂任务与运动规划及调度

Caelan Garrett, Fabio Ramos

发表机构 * NVIDIA Research Seattle Robotics Lab (SRL)（NVIDIA西雅图机器人实验室）； University of Sydney（悉尼大学）

AI总结提出ScheduleStream，首个通用框架，通过混合持续动作和领域无关算法，结合GPU加速采样器，实现多臂并行任务与运动规划及调度。

Comments Project website: https://schedulestream.github.io

详情

Journal ref: 2026 IEEE International Conference on Robotics and Automation (ICRA)

AI中文摘要

双臂和类人机器人因其类似人类利用多臂高效完成任务的能力而具有吸引力。然而，由于混合离散-连续动作空间的增长，同时控制多个臂在计算上具有挑战性。任务与运动规划（TAMP）算法可以在混合空间中高效规划，但通常生成一次只移动一个臂的计划，而不是允许并行臂运动的调度。为了将TAMP扩展到生成调度，我们提出了ScheduleStream，这是第一个用于带采样操作的规划与调度的通用框架。ScheduleStream使用混合持续动作对时间动态进行建模，这些动作可以异步启动，并持续一个由其参数决定的时长。我们提出了领域无关的算法，无需任何特定于应用的机制即可解决ScheduleStream问题。我们将ScheduleStream应用于任务与运动规划及调度（TAMPAS），其中我们利用采样器内的GPU加速来加快规划。我们将ScheduleStream算法与模拟中的几种消融方法进行比较，发现它们能产生更高效的解决方案。我们在https://schedulestream.github.io上展示了ScheduleStream在几个真实世界双臂机器人任务上的应用。

英文摘要

Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. However, controlling multiple arms at once is computationally challenging due to the growth in the hybrid discrete-continuous action space. Task and Motion Planning (TAMP) algorithms can efficiently plan in hybrid spaces but generally produce plans, where only one arm is moving at a time, rather than schedules that allow for parallel arm motion. In order to extend TAMP to produce schedules, we present ScheduleStream, the first general-purpose framework for planning & scheduling with sampling operations. ScheduleStream models temporal dynamics using hybrid durative actions, which can be started asynchronously and persist for a duration that's a function of their parameters. We propose domain-independent algorithms that solve ScheduleStream problems without any application-specific mechanisms. We apply ScheduleStream to Task and Motion Planning & Scheduling (TAMPAS), where we use GPU acceleration within samplers to expedite planning. We compare ScheduleStream algorithms to several ablations in simulation and find that they produce more efficient solutions. We demonstrate ScheduleStream on several real-world bimanual robot tasks at https://schedulestream.github.io.

URL PDF HTML ☆

赞 0 踩 0

2508.14610 2026-05-29 cs.RO 版本更新

TRUST-Planner: Topology-guided Robust Trajectory Planner for AAVs with Uncertain Obstacle Spatial-temporal Avoidance

TRUST-Planner：面向具有不确定障碍物时空避让的AAV拓扑引导鲁棒轨迹规划器

Junzhi Li, Teng Long, Jingliang Sun, Jianxin Zhong

发表机构 * School of Aerospace Engineering, Beijing Institute of Technology（北京理工大学航空航天工程学院）； Key Laboratory of Dynamics and Control of Flight Vehicle, Ministry of Education（教育部飞行器动力学与控制重点实验室）

AI总结提出TRUST-Planner拓扑引导分层规划框架，通过动态增强可见概率图、无终端最小控制多项式和动态距离场实现复杂动态环境下的鲁棒时空避障，达到96%成功率和毫秒级计算效率。

Comments Accepted by IEEE Transactions on Industrial Electronics (TIE) for publication. The final version will be available online at https://ieeexplore.ieee.org/ after publication

详情

DOI: 10.1109/TIE.2026.3695224

AI中文摘要

尽管自主飞行器（AAV）的运动规划已取得广泛进展，但现有框架在复杂动态环境中仍面临局部极小值和死锁的挑战，导致碰撞风险增加。为了解决这些问题，我们提出了TRUST-Planner，一种拓扑引导的分层规划框架，用于鲁棒的时空避障。在前端，提出了一种动态增强可见概率图（DEV-PRM），以快速探索拓扑路径进行全局引导。后端利用统一的无终端最小控制多项式（UTF-MINCO）和动态距离场（DDF），实现高效的预测性避障和快速并行计算。此外，引入了一种增量式多分支轨迹管理框架，以实现时空拓扑决策，同时有效利用历史信息减少重规划时间。仿真结果表明，TRUST-Planner优于基线竞争对手，在测试的复杂环境中实现了96%的成功率和毫秒级计算效率。真实世界实验进一步验证了所提方法的可行性和实用性。

英文摘要

Despite extensive developments in motion planning of autonomous aerial vehicles (AAVs), existing frameworks faces the challenges of local minima and deadlock in complex dynamic environments, leading to increased collision risks. To address these challenges, we present TRUST-Planner, a topology-guided hierarchical planning framework for robust spatial-temporal obstacle avoidance. In the frontend, a dynamic enhanced visible probabilistic roadmap (DEV-PRM) is proposed to rapidly explore topological paths for global guidance. The backend utilizes a uniform terminal-free minimum control polynomial (UTF-MINCO) and dynamic distance field (DDF) to enable efficient predictive obstacle avoidance and fast parallel computation. Furthermore, an incremental multi-branch trajectory management framework is introduced to enable spatio-temporal topological decision-making, while efficiently leveraging historical information to reduce replanning time. Simulation results show that TRUST-Planner outperforms baseline competitors, achieving a 96\% success rate and millisecond-level computation efficiency in tested complex environments. Real-world experiments further validate the feasibility and practicality of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2507.23270 2026-05-29 cs.RO cs.SY eess.SY 版本更新

Simulation-based planning of Motion Sequences for Automated Procedure Optimization in Multi-Robot Assembly Cells

基于仿真的多机器人装配单元自动化程序优化的运动序列规划

Loris Schneider, Marc Ungen, Elias Huber, Jan-Felix Klein

发表机构 * Institute for Material Handling and Logistics (IFL), Karlsruhe Institute of Technology（材料搬运与物流研究所（IFL），卡尔斯鲁厄理工学院）； Bosch Corporate Research, Robert Bosch GmbH（博世企业研究，罗伯特·博世有限公司）

AI总结提出一种基于仿真的方法，通过将装配步骤分解为核心操作和遍历操作，并采用分解式运动规划策略优化调度，以生成高效无碰撞的多机器人运动序列，减少装配时间。

Comments Accepted for publication at IEEE CASE 2026

详情

AI中文摘要

可重构多机器人单元提供了一种应对波动装配需求的有前景的方法。然而，其配置的重复规划带来了新的挑战，特别是在生成优化、协调的多机器人运动序列以最小化装配时间方面。本文提出了一种基于仿真的方法，用于生成此类优化序列。该方法将装配步骤分解为与任务相关的核心操作和连接的遍历操作。核心操作受约束且预先确定，而遍历操作具有显著的优化潜力。核心操作的调度被形式化为一个优化问题，需要使用基于分解的运动规划策略集成可行的遍历操作。探索了几种求解技术，包括采样启发式、基于树的搜索和无梯度优化。对于运动规划，提出了一种分解方法，识别调度中的特定区域，这些区域可以使用改进的集中式路径规划算法独立求解。所提出的方法生成了高效且无碰撞的多机器人装配程序，优于依赖分散式、机器人个体运动规划的基线方法。通过仿真实验证明了其有效性。

英文摘要

Reconfigurable multi-robot cells offer a promising approach to meet fluctuating assembly demands. However, the recurrent planning of their configurations introduces new challenges, particularly in generating optimized, coordinated multi-robot motion sequences that minimize the assembly duration. This work presents a simulation-based method for generating such optimized sequences. The approach separates assembly steps into task-related core operations and connecting traverse operations. While core operations are constrained and predetermined, traverse operations offer substantial optimization potential. Scheduling the core operations is formulated as an optimization problem, requiring feasible traverse operations to be integrated using a decomposition-based motion planning strategy. Several solution techniques are explored, including a sampling heuristic, tree-based search and gradient-free optimization. For motion planning, a decomposition method is proposed that identifies specific areas in the schedule, which can be solved independently with modified centralized path planning algorithms. The proposed method generates efficient and collision-free multi-robot assembly procedures that outperform a baseline relying on decentralized, robot-individual motion planning. Its effectiveness is demonstrated through simulation experiments.

URL PDF HTML ☆

赞 0 踩 0

2409.01159 2026-05-29 cs.RO 版本更新

Remote telepresence over large distances via robot avatars: case studies

通过机器人化身进行远距离远程呈现：案例研究

Mohamed Elobaid, Stefano Dafarra, Ehsan Ranjbari, Giulio Romualdi, Tomohiro Chaki, Tomohiro Kawakami, Takahide Yoshiike, Daniele Pucci

发表机构 * Artificial and Mechanical Intelligence AMI (Italian Insititute of Technology)（人工与机械智能AMI（意大利理工学院））； Frontier Robotics, Innovative Research Excellence（前沿机器人，创新研究卓越；本田研发）； Honda R&D（机器学习与优化，曼彻斯特大学）； Machine Learning and Optimisation, The University of Manchester

AI总结本文探讨了如何调整一种新提出的化身系统架构，以适应不同形态的机器人（轮式、腿式及多种手部与运动学结构），在带宽受限条件下实现洲际远程呈现。

2409.01144 2026-05-29 cs.RO 版本更新

Adaptive Non-linear Centroidal MPC with Stability Guarantees for Robust Locomotion of Legged Robots

具有稳定性保证的自适应非线性质心MPC用于腿式机器人鲁棒运动

Mohamed Elobaid, Giulio Turrisi, Lorenzo Rapetti, Giulio Romualdi, Stefano Dafarra, Tomohiro Kawakami, Tomohiro Chaki, Takahide Yoshiike, Claudio Semini, Daniele Pucci

发表机构 * Artificial and Mechanical Intelligence (AMI), Istituto Italiano di Tecnologia (IIT)（人工智能与机械智能（AMI），意大利技术研究院（IIT））； Dynamic Legged Systems (DLS), Istituto Italiano di Tecnologia (IIT)（动态腿部系统（DLS），意大利技术研究院（IIT））； Frontier Robotics, Innovative Research Excellence（前沿机器人，创新研究卓越；本田研发，日本埼玉）； Honda R&D, Saitama, Japan（机器学习与优化，曼彻斯特大学）； Machine Learning and Optimisation, The University of Manchester

AI总结通过自适应控制和李雅普诺夫函数重新表述质心MPC控制器，为腿式机器人在未知负载和恒定扰动下提供闭环稳定性与鲁棒性保证。

详情

DOI: 10.1109/LRA.2025.3536296

AI中文摘要

基于简化质心动力学的非线性模型预测运动控制器如今在腿式机器人中无处不在。这些方案即使假设了机器人动力学的固有简化，也被证明能够赋予机器人对微小推力的步态调整能力，此外，在参数不确定（如未知负载）的情况下，它们能够提供一些实用的、尽管有限的鲁棒性。在这项工作中，我们通过重新表述质心MPC控制器，为其闭环稳定性提供了严格的证明。这是通过一种受自适应控制机制启发的系统化程序以及来自控制李雅普诺夫函数的思想实现的。此外，我们的重新表述为一类未测量的恒定扰动提供了鲁棒性。为了展示我们方法的通用性，我们在新一代人形机器人——56.7千克的ergoCub，以及商用21千克四足机器人Aliengo上验证了我们的公式。

英文摘要

Nonlinear model predictive locomotion controllers based on the reduced centroidal dynamics are nowadays ubiquitous in legged robots. These schemes, even if they assume an inherent simplification of the robot's dynamics, were shown to endow robots with a step-adjustment capability in reaction to small pushes, and, moreover, in the case of uncertain parameters - as unknown payloads - they were shown to be able to provide some practical, albeit limited, robustness. In this work, we provide rigorous certificates of their closed loop stability via a reformulation of the centroidal MPC controller. This is achieved thanks to a systematic procedure inspired by the machinery of adaptive control, together with ideas coming from Control Lyapunov functions. Our reformulation, in addition, provides robustness for a class of unmeasured constant disturbances. To demonstrate the generality of our approach, we validated our formulation on a new generation of humanoid robots - the 56.7 kg ergoCub, as well as on a commercially available 21 kg quadruped robot, Aliengo.

URL PDF HTML ☆

赞 0 踩 0

2305.10917 2026-05-29 cs.RO 版本更新

Online Non-linear Centroidal MPC for Humanoid Robots Payload Carrying with Contact-Stable Force Parametrization

面向人形机器人负重任务的在线非线性质心模型预测控制与接触稳定力参数化

Mohamed Elobaid, Giulio Romualdi, Gabriele Nava, Lorenzo Rapetti, Hosameldin Awadalla Omer Mohamed, Daniele Pucci

发表机构 * Mechanical engineering department（机械工程系）； Machine Learning and Optimisation, The University of Manchester（机器学习与优化，曼彻斯特大学）

AI总结针对人形机器人负重行走问题，提出结合在线非线性质心模型预测控制与接触稳定力参数化的方法，实现给定脚步轨迹的跟踪。

2205.04297 2026-05-29 cs.RO cs.AI 版本更新

Learning A Simulation-based Visual Policy for Real-world Peg In Unseen Holes

基于学习的视觉策略用于真实世界中未见过孔洞的插拔

Liang Xie, Hongxiang Yu, Kechun Xu, Tong Yang, Minhang Wang, Haojian Lu, Rong Xiong, Yue Wang

发表机构 * College of Control Science and Engineering, Zhejiang University, Zhejiang, China.（控制科学与工程学院，浙江大学，浙江，中国）； The Application Innovate Lab, Huawei Incorporated Company, China.（应用创新实验室，华为公司，中国）

AI总结提出一种基于学习的视觉插拔方法，通过解耦感知与策略模块，在仿真中训练多种形状，并仅需少量仿真到现实迁移成本即可适应真实世界中任意未见形状。

详情

AI中文摘要

本文提出一种基于学习的视觉插拔方法，能够在仿真中训练多种形状，并在真实世界中以最小的仿真到现实迁移成本适应任意未见形状。核心思想是将感知-运动策略的泛化解耦为快速适应的感知模块和仿真通用策略模块的设计。框架包括分割网络（SN）、虚拟传感器网络（VSN）和控制器网络（CN）。具体地，VSN被训练用于从分割图像中测量未见形状的位姿。然后，给定与形状无关的位姿测量，CN被训练以实现通用插拔。最后，当应用于真实未见孔洞时，我们只需微调仿真VSN+CN所需的分割网络。为进一步最小化迁移成本，我们提出在一分钟人工教学后自动收集和标注分割网络的数据。展示了在眼在外/眼在手配置下的仿真和真实世界结果。采用所提策略的电动汽车充电系统在2-3秒内实现了10/10的成功率，仅使用数百个自动标注样本进行分割网络迁移。

英文摘要

This paper proposes a learning-based visual peg-in-hole that enables training with several shapes in simulation, and adapting to arbitrary unseen shapes in real world with minimal sim-to-real cost. The core idea is to decouple the generalization of the sensory-motor policy to the design of a fast-adaptable perception module and a simulated generic policy module. The framework consists of a segmentation network (SN), a virtual sensor network (VSN), and a controller network (CN). Concretely, the VSN is trained to measure the pose of the unseen shape from a segmented image. After that, given the shape-agnostic pose measurement, the CN is trained to achieve generic peg-in-hole. Finally, when applying to real unseen holes, we only have to fine-tune the SN required by the simulated VSN+CN. To further minimize the transfer cost, we propose to automatically collect and annotate the data for the SN after one-minute human teaching. Simulated and real-world results are presented under the configurations of eye-to/in-hand. An electric vehicle charging system with the proposed policy inside achieves a 10/10 success rate in 2-3s, using only hundreds of auto-labeled samples for the SN transfer.

URL PDF HTML ☆

赞 0 踩 0

2605.29572 2026-05-29 cs.RO cs.HC 版本更新

Learning to Feel Materials from Multisensory Tactile Data via Interpretable Models

通过可解释模型从多感官触觉数据中学习感知材料

Li Zou, Yasemin Vardar

发表机构 * Delft University of Technology (TU Delft), Department of Cognitive Robotics（代尔夫特理工大学（TU Delft），认知机器人学系）

AI总结提出一个可解释的计算框架，利用多感官触觉数据（包括按压、静态接触和滑动交互）建模人类材料感知与识别，发现热觉和顺应性线索对感知建模和材料分类至关重要。

Comments 12 pages, 3 figures, journal

详情

AI中文摘要

人类对材料的触觉感知依赖于复杂的多感官触觉线索，然而低级触觉信号与感知表征之间的关系仍不清楚。这一知识差距阻碍了触觉在数字环境中的集成以及具有类人触觉感知能力的机器人的开发。在这里，我们提出了一个可解释的计算框架，用于使用多感官触觉数据建模人类材料感知和识别。我们的框架包含三个相互关联的模型：模型1将手指-表面交互特征映射到心理物理感官属性，模型2基于这些感知表征对材料进行分类，模型3直接从触觉特征对材料进行分类。结果表明，结合按压、静态接触和滑动交互的信息提高了预测准确性，并且热觉线索对于感知建模和材料分类尤其具有信息量。这些发现强调了热觉和顺应性线索的重要性，这些线索在当前机器人手指和触觉显示器中仍未得到充分体现。纳入此类线索可能增强人工系统近似人类材料感知的能力，并指导设计更具感知基础的触觉界面。

英文摘要

Human tactile perception of materials relies on complex multisensory touch cues, yet the relationship between low-level tactile signals and perceptual representations remains poorly understood. This knowledge gap hinders the integration of touch in digital environments and the development of robots capable of human-like tactile perception. Here, we present an interpretable computational framework for modeling human material perception and recognition using multisensory touch data. Our framework comprises three interconnected models: Model 1 maps finger-surface interaction features to psychophysical sensory attributes, Model 2 classifies materials based on these perceptual representations, and Model 3 directly classifies materials from tactile features. The results showed that combining information from pressing, static contact, and sliding interactions improves prediction accuracy, and that thermal cues are particularly informative for both perceptual modeling and material classification. These findings highlight the importance of thermal and compliance cues, which remain underrepresented in current robotic fingers and haptic displays. Incorporating such cues may enhance artificial systems' ability to approximate human material perception and guide the design of more perceptually grounded haptic interfaces.

URL PDF HTML ☆

赞 0 踩 0

2605.29565 2026-05-29 cs.CV cs.RO 版本更新

From General Vision to Reliable Traversability Estimation: Adapting Vision Foundation Models for Unstructured Outdoor Environments

从通用视觉到可靠的可通行性估计：适应视觉基础模型用于非结构化户外环境

Ji-Hoon Hwang, Jisung Bae, Dong-Wook Kim, Yeonkyu Lee, Seung-Woo Seo

AI总结提出ViTA框架，通过可学习提示、视角多样化训练和几何知识蒸馏，将视觉基础模型适应于非结构化户外环境的可靠可通行性估计，显著降低误报并提升跨域泛化。

Comments 8 pages, 5figures

详情

AI中文摘要

基于视觉的方法已成为非结构化户外环境中可通行性估计的主导范式，通常通过语义分割监督来适应视觉基础模型（VFM）。然而，该范式面临三个根本性挑战，削弱了其可靠性：VFM的任务无关设计、可通行性标注的模糊性以及语义标签与物理安全性之间的差异。我们提出了视觉到可通行性适应（ViTA）框架，该框架将VFM适应于可靠的可通行性估计，并在SAM2上实例化。ViTA通过可学习的可通行性提示注入任务特定知识，同时保留VFM的跨域泛化能力。为处理标注模糊性，我们引入了视角多样化训练，通过估计语义不确定性来抑制模糊边界处的自信预测。为弥合语义与可通行性之间的差异，我们在训练期间蒸馏几何知识，使得推理时仅从RGB图像即可进行坡度和高程推理。语义和几何输出融合为一个连续的可通行性分数，同时反映语义不确定性和几何风险。在包括具有挑战性的真实越野数据集在内的多个领域的评估表明，ViTA实现了最先进的IoU和精确度，同时大幅减少误报并具备强大的跨域泛化能力。

英文摘要

Vision-based approaches have become the dominant paradigm for traversability estimation in unstructured outdoor environments, typically adapting vision foundation models (VFMs) via semantic segmentation supervision. However, this paradigm faces three fundamental challenges that undermine its reliability: the task-agnostic design of VFMs, the ambiguity of traversability annotations, and the discrepancy between semantic labels and physical safety. We propose Vision-to-Traversability Adaptation (ViTA), a framework that adapts VFMs for reliable traversability estimation, instantiated on SAM2. ViTA injects task-specific knowledge through learnable traversability prompts while preserving the VFM's cross-domain generalization. To handle annotation ambiguity, we introduce Perspective-Diversified Training, which estimates semantic uncertainty to suppress confident predictions at ambiguous boundaries. To bridge the semantic-traversability discrepancy, we distill geometric knowledge during training, enabling slope and elevation reasoning from RGB images alone at inference. The semantic and geometric outputs are fused into a continuous traversability score that reflects both semantic uncertainty and geometric risk. Evaluations across diverse domains, including challenging real-world off-road datasets, demonstrate that ViTA achieves state-of-the-art IoU and Precision with substantial false-positive reduction and strong cross-domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2605.29564 2026-05-29 cs.RO 版本更新

VE2VF: Vision-Enabled to Vision-Free Distillation via Real-world Reinforcement Learning for Robust Contact-Rich Manipulation

VE2VF: 基于真实世界强化学习的视觉使能到无视觉蒸馏用于鲁棒接触丰富操作

Victor Kowalski, Chengxi Li, Dongheui Lee

发表机构 * Autonomous Systems, Technische Universitaet Wien (TU Wien)（自动系统，维也纳技术大学）； Institute of Robotics and Mechatronics (DLR)（机器人与机电研究所）

AI总结提出一种人在环强化学习框架，通过教师-学生蒸馏将视觉使能策略的知识迁移到仅依赖本体感知的无视觉策略，在真实世界训练中实现鲁棒泛化，无需域随机化或数据增强。

详情

AI中文摘要

当使用强化学习进行接触丰富的机器人操作时，视觉可以提供任务相关信息，加速学习，超越仅靠本体感知所能达到的效果。然而，视觉使能策略容易过拟合训练时看到的视觉条件，限制了其鲁棒性和可迁移性。我们提出一种人在环强化学习框架，采用教师-学生蒸馏，在完全真实世界训练中实现跨多个任务变体的鲁棒性能，无需域随机化或数据增强。视觉使能教师将其知识蒸馏到仅依赖位姿、扭转和力传感的无视觉学生中，结合了快速训练与强任务泛化。在真实世界的NIST装配基准板上，我们的方法在3个代表性任务上经过约50分钟训练后达到95%的整体成功率，包括对8个未见任务变体的鲁棒泛化。通过蒸馏微调在最困难的任务上实现了完全成功。我们证明所得策略在鲁棒性和适应性上均优于基线。

英文摘要

When using reinforcement learning (RL) for contact-rich robotic manipulation, vision can provide task-relevant information that accelerates learning beyond what proprioception alone can achieve. However, vision-enabled policies tend to overfit to the visual conditions seen during training, limiting their robustness and transferability. We present a human-in-the-loop RL framework that employs teacher-student distillation to achieve robust performance across multiple task variants, trained entirely in the real world without requiring domain randomization or data augmentation. A vision-enabled teacher distills its knowledge into a vision-free student that relies solely on pose, twist, and wrench sensing, combining fast training with strong task generalization. On the real-world NIST assembly benchmark board, our approach achieves 95\% overall success after approximately 50 minutes of training on 3 representative tasks, including robust generalization to 8 unseen task variants. Fine-tuning with distillation achieves full success on the most challenging task. We demonstrate that the resulting policies outperform baselines in both robustness and adaptability.

URL PDF HTML ☆

赞 0 踩 0

2605.29562 2026-05-29 cs.RO cs.AI cs.CV 版本更新

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

VLA-Pro：面向视觉-语言-动作模型的跨任务程序性记忆迁移

Shengyu Si, Yuanzhuo Lu, Ruimeng Yang, Ziyi Ye, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI, Fudan University（复旦大学可信具身人工智能研究院）； Shanghai Key Laboratory of Multimodal Embodied AI（上海多模态具身人工智能重点实验室）； Shanghai Xinzhi Embodied Intelligence Technology Co., Ltd.（上海新智具身智能技术有限公司）

AI总结提出VLA-Pro框架，通过存储和检索任务相关的LoRA适配器作为程序性记忆，实现跨任务泛化，在仿真和真实任务中成功率显著提升。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在通用机器人操作中展现出强大潜力，但在泛化到需要跨物体、场景和动作模式迁移相关经验的新任务时仍面临挑战。本文提出VLA-Pro，一种即插即用框架，通过在训练时存储任务相关的程序性记忆并在推理时迁移这些记忆来增强跨任务泛化。具体而言，VLA-Pro在训练时将任务特定的LoRA适配器存储为参数化的程序性记忆。在推理时，VLA-Pro基于当前多模态上下文检索相关程序性记忆，并动态融合这些记忆以生成当前动作块。在RoboTwin、RLBench和真实世界操作任务上的实验表明，VLA-Pro在多个骨干网络上持续提升跨任务泛化能力，在仿真中实现高达207%的相对改进，并将真实世界成功率从5.8%提升至65.0%。这些结果表明，程序性记忆检索与自适应为将操作经验迁移到新任务提供了一种有效机制，同时保持了模块化和执行稳定性。

英文摘要

Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference. Specifically, VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, VLA-Pro retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones, achieving up to a 207% relative improvement in simulation and increasing real-world success rate from 5.8% to 65.0%. These results suggest that procedural memory retrieval and adaptation provide an effective mechanism for transferring manipulation experience to novel tasks while preserving modularity and execution stability.

URL PDF HTML ☆

赞 0 踩 0

2605.29438 2026-05-29 cs.RO 版本更新

ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models

ElegantVLA：学习何时思考以实现高效的视觉-语言-动作模型

Ye Li, Huanan Liu, Kangye Ji, Yuan Meng, Jiajun Fan, Yuansong Wang, Shiyu Qin, Chenglei Wu, Shu-Tao Xia, Zhi Wang

发表机构 * Tsinghua University（清华大学）； University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出ElegantVLA，一种即插即用的相位自适应推理框架，通过动态计算调度在视觉编码器、大语言模型和动作头之间分配计算资源，实现VLA模型加速，在GR00T和CogACT上分别获得最高2.55倍和3.77倍加速。

详情

AI中文摘要

视觉-语言-动作（VLA）模型是通用机器人控制的一种强大范式。然而，其高计算成本和有限的控制频率阻碍了实时机器人操作，尤其是在每个控制步骤都运行大型视觉-语言骨干网络和迭代动作头时。现有的VLA加速方法通常优化单个组件或依赖固定的加速规则，对不同控制步骤采用大致固定的计算量，忽略了序列化具身控制的非均匀推理需求。受人类运动控制的启发，其中认知和反馈资源集中在目标敏感阶段，我们认为VLA模型应该学习何时投入完整计算以及何时重用先前的计算。我们提出ElegantVLA，一种即插即用的相位自适应推理框架，通过模型内动态计算调度加速VLA模型。ElegantVLA引入一个轻量级调度器，观察时间表示相似性、机器人运动线索和任务进度，联合分配视觉编码器、大语言模型和动作头的计算。对于感知-语言推理，调度器根据视觉-语言表示稳定性选择五级视觉-大语言模型计算模式，从完全重计算到多步时间重用。对于动作生成，它选择三级去噪模式，在稳定运动期间重用中间去噪状态，同时在目标敏感阶段保留完整细化。通过协调这些决策，ElegantVLA为具有显式动作生成模块的现代VLA流水线提供了一个通用加速框架，无需修改或重新训练基础模型。在GR00T和CogACT上的实验分别实现了最高2.55倍和3.77倍的加速，在六个真实世界的GR00T任务中，ElegantVLA将计算量减少了2.18倍，同时将控制频率从13.8 Hz提高到26.3 Hz。

英文摘要

Vision-Language-Action (VLA) models are a powerful paradigm for generalist robotic control. However, their high computational cost and limited control frequency hinder real-time robotic manipulation, especially when large vision-language backbones and iterative action heads run at every control step. Existing VLA acceleration methods often optimize individual components or rely on fixed acceleration rules, treating different control steps with largely fixed computation and overlooking the non-uniform reasoning demands of sequential embodied control. Inspired by human motor control, where cognitive and feedback resources concentrate on goal-sensitive stages, we argue that VLA models should learn when to invest full computation and when to reuse prior computation. We propose ElegantVLA, a plug-in phase-adaptive inference framework that accelerates VLA models through intra-model dynamic compute scheduling. ElegantVLA introduces a lightweight scheduler that observes temporal representation similarity, robot-motion cues, and episode progress to jointly allocate computation across the vision encoder, LLM, and action head. For perception-language reasoning, the scheduler selects a five-level Vision-LLM compute mode, from full recomputation to multi-step temporal reuse, based on visual-language representation stability. For action generation, it selects a three-level denoising mode, reusing intermediate denoising states during stable motion while preserving full refinement for goal-sensitive stages. By coordinating these decisions, ElegantVLA offers a general acceleration framework for modern VLA pipelines with explicit action-generation modules, without modifying or retraining the base model. Experiments on GR00T and CogACT achieve up to 2.55x and 3.77x speedup, and on six real-world GR00T tasks ElegantVLA cuts computation by 2.18x while raising control frequency from 13.8 Hz to 26.3 Hz.

URL PDF HTML ☆

赞 0 踩 0

2605.29416 2026-05-29 cs.RO cs.CV 版本更新

开放运动规划库2.0

Weihang Guo, Theodoros Tyrovouzis, Emiliano Flores, Clayton W. Ramsey, Zachary K. Kingston, Ioan A. Şucan, Mark Moll, Lydia E. Kavraki

发表机构 * Department of Computer Science, Rice University（计算机科学系，里士大学）； Department of Computer Science, Purdue University（计算机科学系，普渡大学）； Waymo, LLC（Waymo公司）； Metron, Inc.（Metron公司）； Ken Kennedy Institute at Rice University（里士大学肯尼迪研究所）

AI总结本文介绍OMPL 2.0，通过硬件加速实现实时运动规划，并集成现代AI研究流程，总结了库与运动规划领域的共同发展及其对研究社区的影响。

2605.29298 2026-05-29 cs.RO 版本更新

MonoDuo: Using One Robot Arm to Learn Bimanual Policies

MonoDuo: 使用单机械臂学习双臂策略

Sandeep Bajamahal, Lawrence Yunliang Chen, Toru Lin, Zehan Ma, Jitendra Malik, Ken Goldberg

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出MonoDuo框架，利用单臂机器人演示和人类协作数据，通过数据增强生成合成演示，训练双臂机器人策略，在五项任务中实现零样本部署和少样本微调，成功率高达70%。

Comments Accepted to appear in the 2026 IEEE International Conference on Robotics and Automation (ICRA), Vienna, Austria, 1-5 June 2026

详情

AI中文摘要

双臂协调对于许多现实世界的操作任务至关重要，然而学习双臂机器人策略受到双臂机器人和数据集稀缺的限制。相比之下，单臂机器人在研究实验室中广泛可用。我们能否利用它们来训练双臂机器人策略？我们提出MonoDuo，一个利用单臂机器人演示与人类协作来学习双臂操作策略的框架。MonoDuo通过遥操作单臂机器人执行双臂任务的一侧，同时由人类执行另一侧来收集数据，然后交换角色以覆盖两侧。来自腕部安装和固定摄像头的RGB-D观测通过最先进的手部姿态估计、图像和点云分割以及修复，被增强为目标双臂机器人的合成演示。这些基于真实机器人运动学的合成演示用于训练双臂策略。我们在五项任务上评估MonoDuo：举箱、背包打包、叠布、拉拉链和递盘子。与仅依赖人类双臂视频的方法相比，MonoDuo能够在未见过的双臂机器人配置上实现零样本部署，成功率高达70%。仅使用25个目标机器人演示进行少样本微调，相比从头训练，成功率进一步提升65-70%，展示了MonoDuo在将单臂机器人数据高效迁移到双臂机器人策略方面的有效性。

英文摘要

Bimanual coordination is essential for many real-world manipulation tasks, yet learning bimanual robot policies is limited by the scarcity of bimanual robots and datasets. Single-arm robots, however, are widely available in research labs. Can we leverage them to train bimanual robot policies? We present MonoDuo, a framework for learning bimanual manipulation policies using single-arm robot demonstrations paired with human collaboration. MonoDuo collects data by teleoperating a single-arm robot to perform one side of a bimanual task while a human performs the other, then swapping roles to cover both sides. RGB-D observations from a wrist-mounted and fixed camera are augmented into synthetic demonstrations for target bimanual robots using state-of-the-art hand pose estimation, image and point cloud segmentation, and inpainting. These synthetic demonstrations, grounded in real robot kinematics, are used to train bimanual policies. We evaluate MonoDuo on five tasks: box lifting, backpack packing, cloth folding, jacket zipping, and plate handover. Compared to approaches relying solely on human bimanual videos, MonoDuo enables zero-shot deployment on unseen bimanual robot configurations, achieving success rates up to 70%. With only 25 target robot demonstrations, few-shot finetuning further boosts success rates by 65-70% over training from scratch, demonstrating MonoDuo's effectiveness in efficiently transferring knowledge from single-arm robot data to bimanual robot policies.

URL PDF HTML ☆

赞 0 踩 0

2605.29254 2026-05-29 cs.RO cs.AI 版本更新

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

极端动态对称性实现全向多功能机器人

Jiaxun Liu, Boxi Xia, Boyuan Chen

发表机构 * Department of Mechanical Engineering and Materials Science, Duke University（杜克大学机械工程与材料科学系）； Department of Electrical and Computer Engineering, Duke University（杜克大学电气与计算机工程系）； Department of Computer Science, Duke University（杜克大学计算机科学系）

AI总结本文提出动态对称性概念，通过动态各向同性度量，在超过1000种模拟形态中发现高动态对称性可提升轨迹跟踪、任务成功率、鲁棒性等性能，并开发了Argus球形机器人系列验证近极端动态各向同性带来的全向运动、自适应地形、快速自稳定和抗故障能力。

Comments Published in Science Robotics (2026). Our project website is at:https://generalroboticslab.com/Argus

详情

Journal ref: Science Robotics 11, eaec1725 (2026)

AI中文摘要

对称性是自然系统中的核心组织原则，但其作为机器人统一设计策略的应用仍主要局限于几何形态。我们证明，对称性可以在动态驱动能力层面加以利用。我们引入动态对称性，即机器人可达质心加速度的均匀性，并通过称为动态各向同性的度量将其形式化。在超过1000种模拟形态中，我们发现更高的动态对称性持续改善了轨迹跟踪、任务成功率、鲁棒性、恢复能力和能量效率，且当动态各向同性接近其理论极限时，效益最为显著。为了系统地研究这一机制，我们开发了Argus，一系列球形机器人，旨在探索增加动态对称性的效果。Argus家族的成员在驱动几何和动态对称性水平上有所不同，但共享一个共同架构原则：径向定向的线性致动器直接塑造机器人的质心动力学。其中，我们构建了一个物理的20腿Argus变体，实现了接近极端的动态各向同性，并展示了方向无关的运动、在杂乱和可变形地形上的敏捷穿越、快速自稳定以及对部分致动器故障的鲁棒性。其分布式感知进一步实现了在连续运动中的全向感知和物体交互。这些结果表明，不仅在形态上而且在可达动力学上设计机器人的对称性，为在不确定的地球和地外环境中实现敏捷性、鲁棒性和多功能性提供了一条强大且通用的途径。

英文摘要

Symmetry is a central organizing principle in natural systems, yet its use as a unifying design strategy in robotics has largely remained limited to geometric form. We show that symmetry can instead be leveraged at the level of dynamic actuation capability. We introduce dynamic symmetry, the uniformity of a robot's attainable center-of-mass accelerations, and formalize it through a measure coined as dynamic isotropy. Across more than 1000 simulated morphologies, we found that higher dynamic symmetry consistently improved trajectory tracking, task success, robustness, resiliency, and energy efficiency, with the benefits becoming most pronounced as dynamic isotropy approached its theoretical limit. To study this regime systematically, we developed Argus, a family of spherical robots designed to explore the effects of increasing dynamic symmetry. Members of the Argus family vary in their actuation geometry and dynamic symmetry level while sharing a common architectural principle: radially oriented linear actuators that directly shape the robot's center-of-mass dynamics. Among them, we built a physical 20-leg Argus variant that achieved near-extreme dynamic isotropy and demonstrated orientation-invariant locomotion, agile traversal of cluttered and deformable terrain, rapid self-stabilization, and resilience to partial actuator failures. Its distributed sensing further enabled omnidirectional perception and object interaction during continuous motion. These results show that designing robots for symmetry not only in morphology but also in their attainable dynamics provides a powerful and general pathway toward agility, robustness, and multifunctionality in uncertain terrestrial and extraterrestrial environments.

URL PDF HTML ☆

赞 0 踩 0

2605.29191 2026-05-29 eess.SY cs.RO cs.SY math.OC 版本更新

Distributed Non-Uniform Scaling Control of Multi-Agent Formation with Dynamic Agent Joining

具有动态加入智能体的多智能体编队分布式非均匀缩放控制

Tao He, Gangshan Jing

发表机构 * School of Automation, Chongqing University, Chongqing, China（重庆大学自动化学院，重庆，中国）

AI总结针对动态加入智能体的多智能体编队，提出一种分布式非均匀缩放控制框架，通过保持图拉普拉斯矩阵的谱特性实现任意维度下的编队形状调整。

Comments This paper has been accepted by IFAC 2026

2605.29155 2026-05-29 cs.RO cs.AI cs.DC 版本更新

CA-AC-MPC: CUDA-Accelerated Actor-Critic Model Predictive Control

CA-AC-MPC: CUDA加速的Actor-Critic模型预测控制

Antoonio Buo, Vittorio Cammarota, Michele Avagnale, Pierluigi Arpenti, Vincenzo Lippiello, Fabio Ruggiero

发表机构 * PRISMA Lab and CREATE Consortium, Department of Electrical Engineering and Information Technology, University of Naples Federico II（PRISMA实验室和CREATE联盟，电气工程与信息技术系，那不勒斯费德里科二世大学）

AI总结提出CUDA加速的AC-MPC变体，通过GPU并行优化降低训练和推理延迟，在敏捷无人机竞速任务中实现最先进圈速和近极限动态性能。

Comments Accepted for presentation at the 2026 International Conference on Unmanned Aircraft Systems, ICUAS 2026

2605.29144 2026-05-29 cs.RO cs.SY eess.SY 版本更新

Learning and Adaptation in Wire Arc Additive Manufacturing Bead Geometry Control

线弧增材制造焊道几何控制中的学习与自适应

Chen-Lung Lu, John Wen

发表机构 * Rensselaer Polytechnic Institute（伦塞拉尔理工学院）

AI总结针对线弧增材制造中热场与几何耦合的非线性动态过程，提出基于循环神经网络和一步预测控制的数据驱动方法，并通过逐层预测误差更新模型实现自适应，实验验证了在焊道高度和宽度一致性上的显著提升。

详情

AI中文摘要

机器人线弧增材制造（WAAM）受复杂非线性过程动力学控制，将热场与构建几何耦合。该过程可视为多输入/多输出动态系统，以焊枪速度和送丝速率作为输入，焊道沉积高度和宽度作为输出。本文利用输入/输出数据学习数据驱动模型，并将其用于焊道规划和控制。我们证明，简单的循环神经网络架构和一步预测控制可以在高度和宽度一致性方面改善过程性能。为了考虑打印过程中热条件的变化，我们使用前一层的预测误差更新学习模型。该自适应步骤进一步提高了预测精度和控制器性能。在集成线扫描反馈的机器人WAAM实验平台上进行的实验表明，与恒定输入和静态模型基线相比，高度和宽度一致性有显著改善。所提出的学习和自适应框架为实现增材制造过程的鲁棒、数据驱动调控提供了实用途径。

英文摘要

Robotics Wire Arc Additive Manufacturing (WAAM) is governed by complex and nonlinear process dynamics coupling thermal field to the build geometry. The process may be regarded as a multi-input/multi-output dynamical system with welding torch speed and wire feed rate as inputs and weld bead deposition height and width as outputs. In this paper, we use the input/output data to learn a data-driven model and use it for weld planning and control. We show that a simple recurrent neural network architecture and one-step-ahead predictive control can improve the process performance in terms of height and width consistency. To account for the changing thermal conditions during the printing process, we update the learning model using prediction error from the previous layer. This adaptation step further improves the prediction accuracy and controller performance. Experiments on a robotic WAAM testbed with integrated line-scanner feedback significant improvements in height and width consistency compared to constant input and static model baselines. The proposed learning and adaptation framework provides a practical pathway toward robust, data-driven regulation of additive manufacturing processes.

URL PDF HTML ☆

赞 0 踩 0

2605.29138 2026-05-29 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous Driving

用于优化自动驾驶延迟-准确性权衡的多分辨率端到端深度神经网络

Qitao Weng, Heechul Yun

发表机构 * University of Kansas Lawrence（堪萨斯大学劳伦斯分校）

AI总结提出一种多分辨率端到端CNN，通过运行时选择输入分辨率和分辨率重定向，在延迟预算下优化自动驾驶的延迟-安全性权衡。

Comments ICCPS 2026

详情

AI中文摘要

延迟-准确性权衡是深度神经网络在信息物理系统实时应用中的基础。在自动驾驶中，安全性尤其依赖于预测质量和从感知到执行的端到端延迟。我们观察到：(1) 当考虑延迟时，延迟最优的网络配置随场景上下文和计算可用性而变化；(2) 单一固定分辨率模型在条件变化时变得次优。我们提出了一种用于CARLA城市驾驶挑战的多分辨率端到端深度神经网络，使用单目摄像头输入。我们的方法采用支持多种输入分辨率的卷积神经网络，通过每分辨率批归一化，使得在延迟预算下运行时选择理想输入尺度成为可能，以及分辨率重定向，允许在没有原始训练数据集的情况下进行多分辨率训练。我们在CARLA中实现并评估了我们的多分辨率端到端CNN，以探索延迟-安全性边界。结果显示，相对于固定分辨率基线，每条路线的安全性指标——车道入侵、红灯违规和碰撞——一致改善。

英文摘要

Latency-accuracy tradeoffs are fundamental in real-time applications of deep neural networks (DNNs) for cyber-physical systems. In autonomous driving, in particular, safety depends on both prediction quality and the end-to-end delay from sensing to actuation. We observe that (1) when latency is accounted for, the latency-optimal network configuration varies with scene context and compute availability; and (2) a single fixed-resolution model becomes suboptimal as conditions change. We present a multi-resolution, end-to-end deep neural network for the CARLA urban driving challenge using monocular camera input. Our approach employs a convolutional neural network (CNN) that supports multiple input resolutions through per-resolution batch normalization, enabling runtime selection of an ideal input scale under a latency budget, as well as resolution retargeting, which allows multi-resolution training without access to the original training dataset. We implement and evaluate our multi-resolution end-to-end CNN in CARLA to explore the latency-safety frontier. Results show consistent improvements in per-route safety metrics - lane invasions, red-light infractions, and collisions - relative to fixed-resolution baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.29114 2026-05-29 cs.CR cs.LG cs.RO 版本更新

ReasonBreak: Probing Vulnerabilities in Reasoning-Enabled Vision-Language-Action Models for Autonomous Driving

ReasonBreak: 探测自动驾驶中具备推理能力的视觉-语言-行动模型的脆弱性

Mohammadreza Teymoorianfard, Jean-Philippe Monteuuis, Jonathan Petit, Amir Houmansadr

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）； Qualcomm（高通）

AI总结本文通过黑盒攻击方法，首次系统研究了具备推理能力的视觉-语言-行动模型在自动驾驶中面对真实输入扰动时的脆弱性，发现其推理和轨迹生成均易受攻击，导致碰撞率上升。

详情

AI中文摘要

具备集成推理能力的视觉-语言-行动（VLA）模型已被提出用于端到端自动驾驶，假设推理与轨迹生成之间存在紧密耦合。然而，此类系统在真实输入扰动下的鲁棒性尚未得到充分探索。我们表明，这些模型对真实输入扰动高度脆弱，在闭环仿真中推理攻击成功率高达89%，轨迹操控攻击成功率高达72%，导致碰撞率上升和安全指标下降。以NVIDIA近期开发的Alpamayo模型为代表，我们首次对具备推理能力的VLA模型在真实文本输入损坏下进行了系统性黑盒研究，评估了其对推理和驾驶行为的影响。我们引入了一个推理感知评估框架，捕捉推理的语义和结构方面，并结合以安全为中心的度量。我们还引入了一个基准，用于评估自动驾驶中推理-轨迹交互的攻击与防御。我们的结果强调了严格评估和改进防御的必要性，以确保自动驾驶中具备推理能力的VLA系统的安全性。

英文摘要

Vision-Language-Action (VLA) models with integrated reasoning have been proposed for end-to-end autonomous driving, assuming a tight coupling between reasoning and trajectory generation. However, the robustness of such systems under realistic input perturbations remains largely unexplored. We show that these models are highly vulnerable to realistic input perturbations, achieving up to 89% attack success rate (ASR) on reasoning and up to 72% on trajectory manipulation in closed-loop simulation, leading to increased collision rates and degraded safety metrics. Using NVIDIA's recent Alpamayo models as representative industry-developed VLAs, we conduct the first systematic black-box study of reasoning-enabled VLA models under realistic textual input corruptions, evaluating their impact on reasoning and driving behavior. We introduce a reasoning-aware evaluation framework capturing both semantic and structural aspects of reasoning, along with safety-centric measures. We also introduce a benchmark for evaluating attacks and defenses on reasoning-trajectory interactions in autonomous driving. Our results highlight the need for rigorous evaluation and improved defenses to ensure the safety of reasoning-enabled VLA systems in autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.29091 2026-05-29 cs.RO cs.MA 版本更新

Human-in-the-Loop Swarms: A Bionic Swarm Approach to Real-World Soil Mapping

人在环路的群体：一种用于真实土壤测绘的仿生群体方法

Petras Swissler, Mohammadali Rashidioun, Nicholas Sahu, Raaid Kabir, Ayodeji Aderibigbe, Oladoyin Kolawole

发表机构 * New Jersey Institute of Technology（新泽西理工学院）； University of Washington（华盛顿大学）

AI总结提出Bionic Swarm系统，通过人类用户替代机器人难以实现的任务，结合蓝牙传感器和集中式服务器运行群体算法，并在真实户外环境中验证了Score-Biased-Search算法，降低了实地群体机器人研究的门槛。

Comments 27 pages, 15 figures. Submitted to Advanced Intelligent Systems

详情

AI中文摘要

由于部署硬件的成本高和开发时间长，群体和现场机器人技术在真实世界验证中面临重大障碍。本文介绍了“Bionic Swarm”，一种新颖的系统，通过抽象出许多难以在机器人上实现但对整体算法评估无贡献的任务，并将这些任务交给人类用户，从而降低了这些障碍。这些人类用户通过智能手机网页应用接收指令，该应用从蓝牙连接的传感器获取测量数据并将其转发到集中式服务器。该服务器运行群体算法并向人类用户指示行动。我们通过实验验证了一种名为Score-Biased-Search的岩土聚焦搜索算法来评估该系统，该算法通过为重建地图上的每个位置分配“分数”，然后通过预期分数较高的区域偏置搜索模式，并表现出相对于搜索代理数量的超线性地图重建。在展示该算法的模拟结果后，我们在Bionic Swarm平台上应用该算法，以验证其在真实户外环境中的功能。这项工作表明，这种人在环路的方法显著降低了现场和群体机器人研究的入门门槛。

英文摘要

Swarm and field robotics face significant barriers to real-world validation due to the high cost and development time to deploy hardware. This paper introduces the ``Bionic Swarm,'' a novel system that lowers these barriers by abstracting away many of the tasks that are difficult to implement on robots but which do not contribute to the overall algorithm evaluation, giving these tasks to human users. These human users take directions from a smartphone web-app that takes measurements from Bluetooth-connected sensors and relays them to a centralized server. This server runs the swarm algorithm and directs actions to the human users. We evaluate this system through the experimental validation of a geotechnically-focused search algorithm named Score-Biased-Search, which functions by assigning a ``score'' to each location on a reconstructed map, then biases search patterns through areas of higher expected scores, and which exhibits superlinear map reconstruction relative to the number of search agents. After presenting simulation results for the algorithm, we then apply the algorithm on the Bionic Swarm platform to validate its function in a real-world, outdoor setting. This work demonstrates that this human-in-the-loop approach significantly lowers the barrier to entry for field and swarm robotics research.

URL PDF HTML ☆

赞 0 踩 0

2605.29074 2026-05-29 cs.CV cs.RO 版本更新

Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models

Embodied3DBench: 视觉语言模型低级具身空间智能的基准测试

Jiyao Zhang, Mingxu Zhang, Yitong Peng, Haoxuan Liu, Chenshuo Wang, Yuxing Long, Haoyang Huang, Dongjiang Li, Nan Duan, Hui Shen, Hao Dong

发表机构 * CFCS, School of CS, PKU（计算机学院CFCS，北京大学）； Jingdong Technology Information Technology Co., Ltd（京东科技信息技术有限公司）

AI总结提出Embodied3DBench基准，通过6类任务（空间结构理解与交互导向感知）系统评估视觉语言模型在3D环境中的低级空间智能，并合成130万QA对训练数据以弥补能力差距。

详情

AI中文摘要

当前的视觉语言模型（VLM）是否准备好理解和推理3D环境中的复杂具身交互？我们引入了Embodied3DBench，一个以机器人为中心的基准，针对具身3D环境中的低级空间智能。为了系统评估这些基础感知能力，该基准包括6个任务类别，分为两个核心组：空间结构理解（定位、空间关系预测和多视图对应）和交互导向感知（可供性预测、抓取点预测和轨迹预测）。该基准涵盖12个子类别，包含超过21k个高质量问答对。我们评估了13个最先进的模型，结果显示，尽管当前模型在高级空间推理（如理解对象间位置关系）方面表现相对较强，但在交互导向感知方面仍然脆弱，突显了缺乏鲁棒的3D感知交互先验。为了积极弥合基准揭示的能力差距，我们进一步合成了一个包含130万问答对的大规模训练数据集。值得注意的是，在该数据集上微调显著提升了低级空间智能。最终，Embodied3DBench通过提供系统评估框架和可扩展的数据解决方案填补了关键空白，为交互感知多模态系统的发展设定了明确目标。

英文摘要

Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in embodied 3D environments. To systematically evaluate these foundational perceptual capabilities, the benchmark includes 6 task categories divided into two core groups: Spatial Structural Understanding (Grounding, Spatial Relation Prediction, and Multi-view Correspondence) and Interaction-Oriented Perception (Affordance Prediction, Grasp Point Prediction, and Trajectory Prediction). The benchmark spans 12 subcategories and contains over 21k high-quality question-answer pairs. We evaluate 13 state-of-the-art models, and the results show that while current models exhibit relatively strong high-level spatial reasoning, such as understanding object-to-object positional relations, they remain fragile in interaction-oriented perception, highlighting a significant lack of robust 3D-aware interaction priors. To actively bridge this capability gap revealed by our benchmark, we further synthesize a large-scale training dataset comprising 1.3M QA pairs. Notably, fine-tuning on this dataset yields significant improvements in low-level spatial intelligence. Ultimately, Embodied3DBench fills a critical gap by providing both a systematic evaluation framework and a scalable data solution, setting a clear target for the development of interaction-aware multimodal systems.

URL PDF HTML ☆

赞 0 踩 0

2605.28883 2026-05-29 cs.AI cs.RO 版本更新

Ultra-Reduced-Impact-Encased-Logging (URIEL): propose a new method for selective sustainable logging and post-harvest silvicultural treatment in tropical forest using airborne robotics systems

超低影响包裹式伐木（URIEL）：提出一种利用空中机器人系统在热带森林中进行选择性可持续伐木和采后造林处理的新方法

Daniel Albiero, Gelton Fernando de Morais, Daniela Han, Flávio Roberto de Freitas Gonçalves, Artur Vitório Andrade Santos, Wesllen Lins de Araújo, Alessandra Maia Freire, Cláudio Kiyoshi Umezu, Mateus Peressin, Francesco Toscano, Admilson Írio Ribeiro, Alfeu J. Sguarezi Filho, Américo Ferraz Dias Neto, Angel Pontin Garcia

发表机构 * School of Agricultural Engineering, University of Campinas (UNICAMP)（坎皮纳斯大学农业工程学院）； School of Mechanical Engineering, University of Campinas (UNICAMP)（坎皮纳斯大学机械工程学院）； Depart. of Agricultural, Forestry, Food and Environmental Sciences, University of Basilicata（巴里奇塔大学农业、林业、食品与环境科学系）； Sorocaba Environmental Engineering, São Paulo State University (UNESP)（圣保罗州立大学索罗卡巴环境工程）； Center for Engineering, Modeling and Applied Social Sciences, Federal University of ABC (UFABC)（ABC联邦大学工程、建模和应用社会科学中心）

AI总结提出URIEL方法，结合直升机伐木、机器人、AI和无人机采后造林处理，实现高经济可行性和几乎零附带损害，维持生态系统服务。

Comments 196 pages, 40 figures, A revolutionary technology to help protect tropical forests. It was developed, scaled, detailed, calculated, and simulated in an advanced computational environment, com viabilidade econômica e social. "E pur si muove"

详情

AI中文摘要

全球热带森林正面临由经济和政治利益驱动的强烈砍伐压力，科学证据表明这种砍伐加剧了气候变化。本文提出了一种新颖的热带森林伐木方法——超低影响包裹式伐木（URIEL）。该方法基于直升机伐木技术，结合机器人技术和人工智能的密集使用，以及由无人机执行的采后造林处理。为此方法开发了合适的设备概念，确定了尺寸，在数字概念验证中完成了细节，并对各种直升机-木材-距离组合进行了有效的数字模拟和经济可行性分析。结果表明，URIEL方法具有高经济可行性，并能在维持生态系统服务的同时几乎消除对森林的附带损害。本文的主要结论是，尽管取得了令人满意的科学和技术成果，但URIEL方法的可行性取决于相关利益相关者的整合：高科技产业、政治政府、认证伐木公司和原住民。

英文摘要

Tropical forests worldwide are under intense deforestation pressure driven by economic and political interests, and scientific evidence suggests this deforestation contributes to climate change. This paper proposes a novel logging method for tropical forests, Ultra-Reduced-Impact-Encased-Logging (URIEL). This new method is based on heli-logging techniques combined with intensive use of robotics and AI integrated with post-harvest silvicultural treatments performed by drones. The concept of appropriate equipment for this method was developed, dimensions were determined, details were completed in a digital proof of concept, and an effective digital simulation and economic feasibility analysis were carried out for various helicopter-timber-distance combinations. The results demonstrated that a URIEL method has high economic viability and makes it possible to virtually eliminate collateral damage to forests while maintaining ecosystem services. The main conclusion of this paper is that, despite the satisfactory scientific and technological results, the feasibility of a Uriel method depends on the integration of stakeholders intrinsic to the context: high-tech industry; political governments; certified logging companies; and native populations.

URL PDF HTML ☆

赞 0 踩 0

2605.22082 2026-05-29 cs.RO cs.LG 版本更新

CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation

CoRMA: 用于接触丰富元适应的对比RMA

Wentian Wang, Chutong Wen, Hongxu Ma, Wuhao Wang, Zhexiong Xue, Abdul Haseeb Nizamani, Dandi Zhou, Xinhai Sun, Jianqiao Zhu

发表机构 * Synthoid AI

AI总结提出CoRMA框架，通过语义接触上下文和对比学习实现力主导装配任务的元适应，无需演示或梯度更新，在仿真和真实机器人上优于基线。

详情

AI中文摘要

我们提出CoRMA（对比机器人运动适应），一个基于上下文的元适应框架，修改了RMA以适用于力主导的装配任务。CoRMA用紧凑的6维仅仿真语义接触上下文（描述接触开始、侧向接合、引导过渡、接触方向和卡滞）替换原始仿真器参数适应。一个可部署的因果Transformer适配器通过语义回归和力状态对比目标，从力、本体感受和动作历史中在线推断该上下文。部署时，移除真实上下文并由推断上下文替代，从而无需演示、特权输入或梯度更新即可实现片段内适应。我们在Isaac Lab / Isaac Sim 5.0中的PegInsert、GearMesh和NutThread任务以及真实Marvin机械臂上评估CoRMA。与在仿真中成功率高但在硬件上大幅下降的FORGE基线相比，CoRMA在受控目标位姿噪声下保留了更高的验证真实成功率。这些结果支持语义接触推断作为相关装配任务族内可复用的适应接口，而更广泛的未见任务泛化和Real2Sim校准仍是未来工作。

英文摘要

We present CoRMA(Contrastive Robotic Motor Adaptation), a context-based meta-adaptation framework that modifies RMA for force-dominant assembly. CoRMA replaces raw simulator-parameter adaptation with a compact 6D simulator-only semantic contact context describing contact onset, lateral engagement, guided transition, contact direction, and jamming. A deployable causal Transformer adapter infers this context online from force, proprioceptive, and action histories using semantic regression and a force-regime contrastive objective. At deployment, oracle context is removed and replaced by the inferred context, enabling within-episode adaptation without demonstrations, privileged inputs, or gradient updates. We evaluate CoRMA on PegInsert, GearMesh, and NutThread in Isaac Lab / Isaac Sim 5.0 and on a real Marvin arm. Compared with FORGE baselines that achieve high simulation success but degrade substantially on hardware, CoRMA retains higher verified real success under controlled target-pose noise. These results support semantic contact inference as a reusable adaptation interface within a related assembly task family, while broader unseen-task generalization and Real2Sim calibration remain future work.

URL PDF HTML ☆

赞 0 踩 0

2605.01663 2026-05-29 cs.LG cs.RO 版本更新

Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

基于流锚定噪声条件Q学习的离线强化学习：高效且表达力强的方法

Sungyoung Lee, Dohyeong Kim, Eshan Balachandar, Zelal Su Mustafaoglu, Keshav Pingali

发表机构 * The University of Texas at Austin, Austin, TX, USA（德克萨斯大学奥斯汀分校）； Independent Researcher, Seoul, South Korea（首尔独立研究者）

AI总结提出FAN算法，通过单次流策略迭代和单高斯噪声样本实现高效离线强化学习，在保持高性能的同时显著降低计算成本。

Comments ICML 2026

详情

AI中文摘要

我们提出流锚定噪声条件Q学习（FAN），一种高效且高性能的离线强化学习算法。近期工作表明，表达力强的流策略和分布性评论家能提升离线强化学习性能，但计算成本高。具体而言，流策略需要迭代采样才能产生单个动作，分布性评论家需要计算多个样本（如分位数）来估计价值。为解决这些低效问题并保持高性能，我们引入FAN。我们的方法采用行为正则化技术，仅需单次流策略迭代，且分布性评论家仅需单个高斯噪声样本。我们对收敛性和性能边界的理论分析表明，这些简化不仅提高了效率，还带来了更优的任务性能。在机器人操作和运动任务上的实验表明，FAN实现了最先进的性能，同时显著减少了训练和推理时间。我们在https://github.com/brianlsy98/FAN 发布代码。

英文摘要

We propose Flow-Anchored Noise-conditioned Q-Learning (FAN), a highly efficient and high-performing offline reinforcement learning (RL) algorithm. Recent work has shown that expressive flow policies and distributional critics improve offline RL performance, but at a high computational cost. Specifically, flow policies require iterative sampling to produce a single action, and distributional critics require computation over multiple samples (e.g., quantiles) to estimate value. To address these inefficiencies while maintaining high performance, we introduce FAN. Our method employs a behavior regularization technique that uses a single flow policy iteration and requires a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes. We release our code at https://github.com/brianlsy98/FAN.

URL PDF HTML ☆

赞 0 踩 0

2605.01194 2026-05-29 cs.RO 版本更新

VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model

VLA-ATTC：基于相对动作评判模型的VLA模型自适应测试时计算

Wenhao Li, Xiu Su, Yichao Cao, Hongyan Xu, Xiaobo Xia, Shan You, Yi Chen, Chang Xu

发表机构 * University of Sydney（悉尼大学）； Central South University（中央南大学）； University of Science and Technology of China（中国科学技术大学）； Sensetime Research（感知时间研究院）； Hong Kong University of Science and Technology（香港理工大学）

AI总结提出VLA-ATTC框架，通过不确定性驱动的“认知离合器”和相对动作评判模型（RAC）实现自适应测试时计算，在LIBERO-LONG基准上将SOTA模型PI0.5的失败率降低50%以上。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在具身操作中展现了卓越的能力和泛化性。然而，它们的决策依赖于快速、本能的过程，缺乏深思熟虑。当面对需要更多考虑的复杂或模糊场景时，这种策略往往会导致次优或灾难性的动作。在本文中，我们引入了 extbf{VLA-ATTC}，一个赋予VLA模型自适应测试时计算（TTC）能力的框架。VLA-ATTC采用基于不确定性的“认知离合器”，在必要时动态地从反射执行过渡到TTC深思阶段。在TTC阶段，一种新颖的 extbf{相对动作评判}（RAC）模型通过成对比较从生成的候选动作中识别最优动作。这种相对机制取代了不稳定的绝对值估计，显著简化了学习目标。此外，我们引入了一种高效的采样策略来分摊计算成本，以及一个自动数据管道，无需人工标注即可整理偏好对。在LIBERO-LONG基准上，VLA-ATTC将SOTA模型PI0.5的失败率降低了50%以上。我们将开源所有代码和权重。

英文摘要

Vision-Language-Action (VLA) models have demonstrated remarkable capabilities and generalization in embodied manipulation. However, their decision-making relies on a fast, instinctive process that lacks deliberation. This strategy often leads to suboptimal or catastrophic actions when facing complex or ambiguous scenarios that require greater consideration. In this paper, we introduce \textbf{VLA-ATTC}, a framework that endows VLA models with adaptive test-time compute (TTC). VLA-ATTC employs an uncertainty-based ``cognitive clutch'' to dynamically transition from reflexive execution to a TTC deliberation phase when necessary. During TTC phase, a novel \textbf{Relative Action Critic} (RAC) model identifies the optimal action from generated candidates via pairwise comparisons. This relative mechanism replaces unstable absolute value estimation, significantly simplifying the learning objective. Furthermore, we introduce an efficient sampling strategy to amortize computational costs and an automated data pipeline that curates preference pairs without manual annotation. On the LIBERO-LONG benchmark, VLA-ATTC reduces the failure rate of the SOTA model PI0.5 by over 50\%. We will open-source all the code and weights.

URL PDF HTML ☆

赞 0 踩 0

2605.01191 2026-05-29 cs.RO 版本更新

Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery

Sentinel-VLA：一种具有主动状态监控的元认知VLA模型，用于动态推理和错误恢复

Wenhao Li, Xiu Su, Dan Niu, Yichao Cao, Hongyan Xu, Zhe Qu, Lei Fan, Shan You, Chang Xu

发表机构 * University of Sydney（悉尼大学）； Central South University（中央南大学）； University of New South Wales（新南威尔士大学）； Sensetime Research（SenseTime研究院）

AI总结提出Sentinel-VLA模型，通过主动哨兵模块监控执行状态，仅在必要时触发动态推理或错误恢复，结合自进化持续学习算法和正交持续适配器，在44个任务上提升成功率30%以上。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过利用广泛的世界知识和强大的泛化能力，推动了具身操作领域的发展。然而，当前的VLA模型仍面临几个关键挑战，包括推理能力有限、缺乏状态监控以及难以自我纠正。在本文中，我们引入了 extbf{Sentinel-VLA}，一种元认知VLA模型，配备了一个主动的“哨兵”模块来监控实时执行状态。仅在必要时，例如在初始规划或检测到错误时，模型会触发动态推理或制定错误恢复方案。这种按需推理机制确保了鲁棒的决策，同时最小化计算开销。值得注意的是，所有训练数据（涵盖44个任务和超过260万次转换）都是通过我们设计的流水线自动生成和标注的。我们还提出了自进化持续学习（SECL）算法，该算法允许Sentinel-VLA识别其能力边界并自动收集数据进行扩展，并与正交持续适配器（OC-Adapter）配对，将参数更新约束在正交空间中，从而防止灾难性遗忘。真实世界实验表明，与最先进的模型PI0相比，Sentinel-VLA将任务成功率提高了30%以上。我们将开源所有代码、权重和数据生成流水线。

英文摘要

Vision-language-action (VLA) models have advanced the field of embodied manipulation by harnessing broad world knowledge and strong generalization. However, current VLA models still face several key challenges, including limited reasoning capability, lack of status monitoring, and difficulty in self-correction. In this paper, we introduce \textbf{Sentinel-VLA}, a metacognitive VLA model equipped with an active ``sentinel'' module to monitor real-time execution status. Only when necessary, such as during initial planning or upon detecting an error, the model triggers a dynamic reasoning or formulate error recovery solutions. This on-demand reasoning mechanism ensures robust decision-making while minimizing computational overhead. Notably, all training data (spanning 44 tasks and over 2.6 million transitions) is automatically generated and annotated through our designed pipeline. We also propose the Self-Evolving Continual Learning (SECL) algorithm, which allows Sentinel-VLA to identify its capability boundaries and automatically collect data for expansion, paired with Orthogonal Continual Adapter (OC-Adapter) to constrain parameter updates to an orthogonal space, thereby preventing catastrophic forgetting. Real-world experiments demonstrate that Sentinel-VLA boosts the task success rate by over 30\% compared to the SOTA model, PI0. We will open-source all the code, weights, and data generation pipeline.

URL PDF HTML ☆

赞 0 踩 0

2604.15864 2026-05-29 cs.RO 版本更新

Environment-Adaptive Solid-State LiDAR-Inertial Odometry

环境自适应固态激光雷达-惯性里程计

Zhi Zhang, Chalermchon Satirapod, Bingtao Ma, Changjun Gu

发表机构 * School of Automation Chongqing University of Posts（自动化学院重庆邮电大学）； Department of Survey Engineering, Faculty of Engineering Chulalongkorn University Bangkok, Thailand（工程学院测绘工程系朱拉隆功大学泰国曼谷）； School of Cyberspace Security Hangzhou Dianzi University Hangzhou, China（网络空间安全学院杭州电子科技大学杭州中国）

AI总结提出一种集成局部法向量约束与退化感知地图维护的环境自适应固态激光雷达-惯性里程计，以解决极端环境下的几何退化与观测不可靠导致的定位漂移和地图不一致问题。

详情

AI中文摘要

固态激光雷达-惯性SLAM因其速度和鲁棒性优势而受到广泛关注。然而，在极端环境中实现精确建图仍然具有挑战性，因为严重的几何退化和不可靠的观测常常导致病态优化和地图不一致。为了解决这些问题，我们提出了一种环境自适应固态激光雷达-惯性里程计，它集成了局部法向量约束与退化感知地图维护，以增强定位精度。具体来说，我们引入局部法向量约束来提高状态估计的稳定性，有效抑制退化场景中的定位漂移。此外，我们设计了一种退化引导的地图更新策略以提高地图精度。得益于精细化的地图表示，后续估计中的定位精度进一步提高。实验结果表明，所提方法在极端和感知退化环境中实现了优越的建图精度和鲁棒性，与基线方法相比，平均RMSE降低高达12.8%。

英文摘要

Solid-state LiDAR-inertial SLAM has attracted significant attention due to its advantages in speed and robustness. However, achieving accurate mapping in extreme environments remains challenging due to severe geometric degeneracy and unreliable observations, which often lead to ill-conditioned optimization and map inconsistencies. To address these challenges, we propose an environment-adaptive solid-state LiDAR-inertial odometry that integrates local normal-vector constraints with degeneracy-aware map maintenance to enhance localization accuracy. Specifically, we introduce local normal-vector constraints to improve the stability of state estimation, effectively suppressing localization drift in degenerate scenarios. Furthermore, we design a degeneration-guided map update strategy to improve map precision. Benefiting from the refined map representation, localization accuracy is further enhanced in subsequent estimation. Experimental results demonstrate that the proposed method achieves superior mapping accuracy and robustness in extreme and perceptually degraded environments, with an average RMSE reduction of up to 12.8% compared to the baseline method.

URL PDF HTML ☆

赞 0 踩 0

2603.16673 2026-05-29 cs.RO cs.AI cs.LG 版本更新

用于在线连续时间连续体机器人状态估计的滑动窗口滤波器

Spencer Teetaert, Sven Lilge, Jessica Burgner-Kahrs, Timothy D. Barfoot

发表机构 * University of Toronto Robotics Institute（多伦多大学机器人研究所）

AI总结提出一种专为连续体机器人设计的随机滑动窗口滤波器，在保持超实时运行速度的同时，通过连续时间方法提升滤波精度并实现在线操作。

Comments 8 pages, 6 figures. Submitted to IEEE-RAS International Conference on Soft Robotics 2026

详情

DOI: 10.1109/RoboSoft67810.2026.11522922
Journal ref: 2026 IEEE 9th International Conference on Soft Robotics (RoboSoft), 239-246

AI中文摘要

连续体机器人的随机状态估计方法通常难以平衡精度和计算效率。尽管最近有几项研究探索了连续体机器人的滑动窗口公式，但这些方法仅限于简化的离散时间近似，并且不提供随机表示。相比之下，当前的随机滤波方法必须以测量速度运行，限制了其全部潜力。最近关于连续体机器人连续时间估计技术的研究显示了一种解决这一运行时约束的原则性方法，但目前仅限于离线操作。在这项工作中，我们提出了一种用于连续体机器人连续时间状态估计的滑动窗口滤波器，它在保持超实时运行速度的同时，改进了滤波方法的精度，并使连续时间方法能够在线操作。这是首个专门为连续体机器人设计的随机滑动窗口滤波器，为该领域的未来研究提供了有希望的方向。

英文摘要

Stochastic state estimation methods for continuum robots (CRs) often struggle to balance accuracy and computational efficiency. While several recent works have explored sliding-window formulations for CRs, these methods are limited to simplified, discrete-time approximations and do not provide stochastic representations. In contrast, current stochastic filter methods must run at the speed of measurements, limiting their full potential. Recent works in continuous-time estimation techniques for CRs show a principled approach to addressing this runtime constraint, but are currently restricted to offline operation. In this work, we present a sliding-window filter (SWF) for continuous-time state estimation of CRs that improves upon the accuracy of a filter approach while enabling continuous-time methods to operate online, all while running at faster-than-real-time speeds. This represents the first stochastic SWF specifically designed for CRs, providing a promising direction for future research in this area.

URL PDF HTML ☆

赞 0 踩 0

2509.19318 2026-05-29 eess.SP cs.RO 版本更新

Scensory: Real-Time Robotic Olfactory Perception for Joint Identification and Source Localization

Scensory：用于联合识别和源定位的实时机器人嗅觉感知

Yanbaihui Liu, Erica Babusci, Claudia K. Gunsch, Boyuan Chen

发表机构 * Duke University（杜克大学）

AI总结提出一种基于学习的机器人嗅觉框架Scensory，通过廉价交叉敏感VOC传感器阵列的短时序信号，利用神经网络解码时间动态特征，同时实现真菌种类识别（最高89.85%准确率）和源定位（最高87.31%准确率）。

Comments Our project website is at: http://generalroboticslab.com/Scensory

详情

AI中文摘要

尽管机器人在视觉和触觉感知方面取得了快速进展，但使其能够从微弱的、扩散主导的化学信号中推理室内真菌污染仍然是一个未解决的挑战。我们提出了Scensory，一个基于学习的机器人嗅觉框架，该框架能够同时识别真菌种类，并通过由廉价、交叉敏感的VOC传感器阵列测量的短时序信号定位其来源。时间VOC动态编码了化学和空间特征，我们通过基于机器人自动化数据收集并带有空间监督训练的神经网络来解码这些特征。在五种真菌种类中，Scensory在环境条件下使用3-7秒的传感器输入实现了高达89.85%的种类准确率和87.31%的源定位准确率。这些结果证明了从扩散主导的化学信号中实现实时、空间基础的感知的能力，为机器人室内环境监测提供了可扩展且低成本的源定位方法。

英文摘要

While robotic perception has advanced rapidly in vision and touch, enabling robots to reason about indoor fungal contamination from weak, diffusion-dominated chemical signals remains an open challenge. We introduce Scensory, a learning-based robotic olfaction framework that simultaneously identifies fungal species and localizes their source from short time series measured by affordable, cross-sensitive VOC sensor arrays. Temporal VOC dynamics encode both chemical and spatial signatures, which we decode through neural networks trained on robot-automated data collection with spatial supervision. Across five fungal species, Scensory achieves up to 89.85% species accuracy and 87.31% source localization accuracy under ambient conditions with 3-7s sensor inputs. These results demonstrate real-time, spatially grounded perception from diffusion-dominated chemical signals, enabling scalable and low-cost source localization for robotic indoor environmental monitoring.

URL PDF HTML ☆

赞 0 踩 0

2508.09976 2026-05-29 cs.RO 版本更新

Masquerade: Learning from In-the-wild Human Videos using Data-Editing

Masquerade: 利用数据编辑从真实世界人类视频中学习

Marion Lepert, Jiaying Fang, Jeannette Bohg

发表机构 * Stanford University（斯坦福大学）

AI总结提出Masquerade方法，通过编辑真实世界第一人称人类视频（估计3D手部姿态、修复手臂、叠加渲染双臂机器人）弥合视觉具身差距，并利用编辑后的视频预训练视觉编码器、微调扩散策略头，在三个长时程双臂厨房任务中实现比基线高5-6倍的泛化性能。

Comments Project website at https://masquerade-robot.github.io/

详情

Journal ref: 2026 IEEE International Conference on Robotics and Automation (ICRA), 2026

AI中文摘要

机器人操作研究仍然面临严重的数据稀缺问题：即使是最大的机器人数据集，其规模和多样性也比推动语言和视觉领域近期突破的数据集小几个数量级。我们提出Masquerade，一种编辑真实世界第一人称人类视频以弥合人类与机器人之间视觉具身差距，并利用这些编辑后的视频学习机器人策略的方法。我们的流程通过以下步骤将每段人类视频转化为机器人化演示：(i) 估计3D手部姿态，(ii) 修复人类手臂，(iii) 叠加一个追踪恢复的末端执行器轨迹的渲染双臂机器人。在67.5万帧编辑后的视频片段上预训练一个视觉编码器以预测未来的2D机器人关键点，并在每个任务仅使用50个机器人演示微调扩散策略头时继续该辅助损失，所得到的策略泛化能力显著优于先前工作。在三个分别于三个未见场景中评估的长时程、双臂厨房任务中，Masquerade的性能比基线高出5-6倍。消融实验表明，机器人叠加和联合训练均不可或缺，且性能随编辑后人类视频数量呈对数增长。这些结果表明，明确弥合视觉具身差距能够解锁来自人类视频的庞大、现成数据源，可用于改进机器人策略。

英文摘要

Robot manipulation research still suffers from significant data scarcity: even the largest robot datasets are orders of magnitude smaller and less diverse than those that fueled recent breakthroughs in language and vision. We introduce Masquerade, a method that edits in-the-wild egocentric human videos to bridge the visual embodiment gap between humans and robots and then learns a robot policy with these edited videos. Our pipeline turns each human video into robotized demonstrations by (i) estimating 3-D hand poses, (ii) inpainting the human arms, and (iii) overlaying a rendered bimanual robot that tracks the recovered end-effector trajectories. Pre-training a visual encoder to predict future 2-D robot keypoints on 675K frames of these edited clips, and continuing that auxiliary loss while fine-tuning a diffusion policy head on only 50 robot demonstrations per task, yields policies that generalize significantly better than prior work. On three long-horizon, bimanual kitchen tasks evaluated in three unseen scenes each, Masquerade outperforms baselines by 5-6x. Ablations show that both the robot overlay and co-training are indispensable, and performance scales logarithmically with the amount of edited human video. These results demonstrate that explicitly closing the visual embodiment gap unlocks a vast, readily available source of data from human videos that can be used to improve robot policies.

URL PDF HTML ☆

赞 0 踩 0

2506.05985 2026-05-29 cs.LG cs.RO 版本更新

Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning

动态渐进式参数高效专家库混合用于终身机器人学习

Yuheng Lei, Sitong Mao, Shunbo Zhou, Hongyuan Zhang, Xuelong Li, Ping Luo

发表机构 * The University of Hong Kong（香港大学）； Institute of Artificial Intelligence (TeleAI), China Telecom（人工智能研究院（TeleAI），中国电信）； Huawei Cloud Computing Technologies（华为云计算技术）； Ola Dimensions ； HKU Shanghai Intelligent Computing Research Center（香港大学上海智能计算研究中心）

AI总结针对终身学习中任务标识不可用和知识隔离问题，提出动态渐进式参数高效专家库混合（DMPEL），通过构建低秩专家库和轻量路由器实现灵活的前向迁移，并引入专家系数回放缓解遗忘，在LIBERO基准上以最少可训练参数和存储超越现有方法。

Comments Accepted to Transactions on Machine Learning Research (TMLR) at https://openreview.net/forum?id=MHVBrjS8cG . Code is available at https://github.com/HarryLui98/DMPEL

详情

AI中文摘要

一个通用智能体必须在其生命周期中持续学习和适应，实现高效的前向迁移，同时最小化灾难性遗忘。先前在主导的预训练-微调范式中的工作探索了用于单任务适应的参数高效微调，通过少量参数有效引导冻结的预训练模型。然而，在终身学习背景下，这些方法依赖于测试时任务标识符这一不切实际的假设，并限制了孤立适配器之间的知识共享。为解决这些限制，我们提出了用于终身机器人学习的动态渐进式参数高效专家库混合（DMPEL）。DMPEL逐步构建一个低秩专家库，并采用轻量路由器将专家动态组合成端到端策略，从而实现灵活高效的终身前向迁移。此外，通过利用微调参数的模块化结构，我们引入了专家系数回放，引导路由器准确检索先前遇到任务的冻结专家。该技术缓解了遗忘，同时相比对整个策略进行经验回放，显著节省存储和计算。在终身机器人学习基准LIBERO上的大量实验表明，我们的框架在持续适应过程中的成功率上优于最先进的终身学习方法，同时使用了最少的可训练参数和存储。

英文摘要

A generalist agent must continuously learn and adapt throughout its lifetime, achieving efficient forward transfer while minimizing catastrophic forgetting. Previous work within the dominant pretrain-then-finetune paradigm has explored parameter-efficient fine-tuning for single-task adaptation, effectively steering a frozen pretrained model with a small number of parameters. However, in the context of lifelong learning, these methods rely on the impractical assumption of a test-time task identifier and restrict knowledge sharing among isolated adapters. To address these limitations, we propose Dynamic Mixture of Progressive Parameter-Efficient Expert Library (DMPEL) for lifelong robot learning. DMPEL progressively builds a low-rank expert library and employs a lightweight router to dynamically combine experts into an end-to-end policy, enabling flexible and efficient lifelong forward transfer. Furthermore, by leveraging the modular structure of the fine-tuned parameters, we introduce expert coefficient replay, which guides the router to accurately retrieve frozen experts for previously encountered tasks. This technique mitigates forgetting while being significantly more storage- and computation-efficient than experience replay over the entire policy. Extensive experiments on the lifelong robot learning benchmark LIBERO demonstrate that our framework outperforms state-of-the-art lifelong learning methods in success rates during continual adaptation, while utilizing minimal trainable parameters and storage.

URL PDF HTML ☆

赞 0 踩 0

2504.12512 2026-05-29 cs.RO cs.SY eess.SY 版本更新

Practical Insights on Grasp Strategies for Mobile Manipulation in the Wild

野外移动操作抓取策略的实用见解

Isabella Huang, Richard Cheng, Sangwoon Kim, Dan Kruse, Carolyn Chen, Lukas Kaul, JC Hancock, Shanmuga Harikumar, Mark Tjersland, James Borders, Dan Helmick

发表机构 * Toyota Research Institute（丰田研究院）

AI总结本文通过SHOPPER移动操作机器人在真实杂货店中的部署实验，提出并分析了通用抓取策略的设计方法及数百次抓取尝试中的关键失败模式，为机器人社区提供了实用见解和待解决的关键挑战。

Comments 8 pages, 8 figures, submitted to IROS 2025

详情

AI中文摘要

移动操作机器人不断进步，其抓取能力也在快速发展。然而，仍存在显著差距阻碍最先进的移动操作机器人在现实世界中广泛部署，包括它们在非结构化环境中可靠抓取物品的能力。为帮助弥合这一差距，我们开发了SHOPPER，一个旨在推动可靠且可泛化抓取策略边界的移动操作机器人平台。我们开发了这些抓取策略，并将其部署在真实的杂货店中——这是一个因其可操作物品、固定装置和布局的极大多样性而被选中的极具挑战性的环境。在这项工作中，我们提出了设计通用抓取策略以在真实杂货店中拾取任何物品的详细方法。此外，我们提供了对最新真实世界现场测试的深入分析，讨论了与数百次不同抓取尝试中基本故障模式相关的关键发现。通过我们的详细分析，我们旨在提供有价值的实用见解并识别关键的抓取挑战，从而引导机器人社区关注该领域亟待解决的开放问题。

英文摘要

Mobile manipulation robots are continuously advancing, with their grasping capabilities rapidly progressing. However, there are still significant gaps preventing state-of-the-art mobile manipulators from widespread real-world deployments, including their ability to reliably grasp items in unstructured environments. To help bridge this gap, we developed SHOPPER, a mobile manipulation robot platform designed to push the boundaries of reliable and generalizable grasp strategies. We develop these grasp strategies and deploy them in a real-world grocery store -- an exceptionally challenging setting chosen for its vast diversity of manipulable items, fixtures, and layouts. In this work, we present our detailed approach to designing general grasp strategies towards picking any item in a real grocery store. Additionally, we provide an in-depth analysis of our latest real-world field test, discussing key findings related to fundamental failure modes over hundreds of distinct pick attempts. Through our detailed analysis, we aim to offer valuable practical insights and identify key grasping challenges, which can guide the robotics community towards pressing open problems in the field.

URL PDF HTML ☆

赞 0 踩 0

2503.00779 2026-05-29 cs.RO 版本更新

Phantom: Training Robots Without Robots Using Only Human Videos

Phantom: 仅使用人类视频训练机器人，无需机器人

Marion Lepert, Jiaying Fang, Jeannette Bohg

发表机构 * Stanford University（斯坦福大学）

AI总结提出一种仅从人类视频演示中训练机器人操作策略的框架，通过手部姿态估计和视觉数据编辑将人类演示转化为机器人兼容的观察-动作对，实现零样本部署并达到最高92%的成功率。

Comments Project website at https://phantom-human-videos.github.io

详情

Journal ref: The 9th Conference on Robot Learning (CoRL 2025)

AI中文摘要

训练通用机器人需要从大规模且多样化的数据源中学习。当前方法严重依赖难以扩展的遥操作演示。我们提出一个可扩展的框架，可直接从人类视频演示中训练操作策略，无需任何机器人数据。我们的方法利用手部姿态估计和视觉数据编辑，将人类演示转化为机器人兼容的观察-动作对。我们修复人类手臂并叠加渲染的机器人以对齐视觉域。这使得无需任何微调即可在真实硬件上实现零样本部署。我们在包括可变形物体操作、多物体清扫和插入等一系列任务上展示了高达92%的强成功率。我们的方法可泛化到新环境并支持闭环执行。通过证明仅使用人类视频即可训练有效策略，我们的方法拓宽了可扩展机器人学习的路径。

英文摘要

Training general-purpose robots requires learning from large and diverse data sources. Current approaches rely heavily on teleoperated demonstrations which are difficult to scale. We present a scalable framework for training manipulation policies directly from human video demonstrations, requiring no robot data. Our method converts human demonstrations into robot-compatible observation-action pairs using hand pose estimation and visual data editing. We inpaint the human arm and overlay a rendered robot to align the visual domains. This enables zero-shot deployment on real hardware without any fine-tuning. We demonstrate strong success rates-up to 92%-on a range of tasks including deformable object manipulation, multi-object sweeping, and insertion. Our approach generalizes to novel environments and supports closed-loop execution. By demonstrating that effective policies can be trained using only human videos, our method broadens the path to scalable robot learning.

URL PDF HTML ☆

赞 0 踩 0