arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.30350 2026-05-29 cs.RO cs.LG 版本更新

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

DynaFLIP: 通过三模态动力学引导表示重新思考机器人感知

Jusuk Lee, Seungjae Lee, Jonghun Shin, Hoseong Jung, Sungha Kim, Daesol Cho, H. Jin Kim, Jia-Bin Huang, Furong Huang

发表机构 * Seoul National University(首尔国立大学) University of Maryland, College Park(马里兰大学学院公园分校) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出DynaFLIP,一种动力学感知的多模态预训练框架,通过图像-语言-3D流三元组训练图像编码器,利用单纯形体积最小化与余弦正则化和对比目标对齐三模态,提升机器人操作中的运动理解与泛化能力。

Comments Project website: https://dynaflip-robotics.github.io

详情
AI中文摘要

机器人操作关键依赖于保留场景中与动作相关方面的感知。然而,大多数机器人学习流程基于为静态识别或视觉-语言对齐预训练的视觉编码器,将运动理解留给下游策略。我们引入了DynaFLIP,一种动力学感知的多模态预训练框架,将运动理解上推到感知中。我们从异构的人类和机器人视频中构建图像-语言-3D流三元组,并使用这些三元组作为训练时监督来塑造仅图像的编码器。我们的关键思想是鼓励三种模态在共享的超球面空间中跨越一个小的单纯形体积——较小的单纯形体积表示更强的对齐。为了避免朴素体积最小化的几何模糊性和平凡坍缩,我们将单纯形体积最小化与余弦正则化和对比目标相结合。我们的分析表明,DynaFLIP关注对操作至关重要的控制相关区域。得到的动力学感知表示作为可重用的视觉骨干,在包括VLA在内的各种下游策略中持续优于基线。我们在多种模拟和真实世界设置中验证了这一点,在分布外场景下增益达到+22.5%。我们的结果表明,当视觉表示被训练为不仅编码存在什么,而且编码世界在动作下如何变化时,机器人泛化能力会提高。

英文摘要

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.

2605.30342 2026-05-29 cs.CV cs.RO 版本更新

Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field

基于各向异性可见性场的不确定性驱动的3D高斯溅射主动建图

Shangjie Xue, Jesse Dill, Dhruv Ahuja, Frank Dellaert, Panagiotis Tsiotras, Danfei Xu

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出GAVIS框架,通过各向异性可见性场量化3DGS的不确定性,并基于最大信息增益实现主动建图,在精度和效率上显著优于现有方法。

Comments Accepted to CVPR 2026. Project page https://gatech-rl2.github.io/GAVIS/

详情
AI中文摘要

我们提出了高斯溅射各向异性可见性场(GAVIS),这是一个用于3DGS中不确定性量化和主动建图的新框架。我们的关键洞察是,训练视图中未见的区域会导致3DGS产生不可靠的预测。为了解决这个问题,我们引入了一种原则性且高效的方法来量化3DGS中的可见性场,定义为每个粒子相对于训练视图的各向异性可见性,并使用球谐函数表示。得到的可见性场被集成到基于贝叶斯网络的不确定性感知3DGS光栅化器中,实现了对合成视图的实时(200 FPS)不确定性量化。在此基础上,进一步在最大信息增益框架内执行主动建图。跨多种环境的广泛实验表明,GAVIS在精度和效率上始终且显著优于先前的方法。此外,除了独立使用外,我们的方法还可以事后应用于改进现有方法的性能。

英文摘要

We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS. To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility of each particle with respect to the training views, and represented using spherical harmonics. The resulting visibility field is integrated into a Bayesian Network-based uncertainty-aware 3DGS rasterizer, enabling real-time (200 FPS) uncertainty quantification for synthesized views. Active mapping is further performed within a maximum information gain framework building on this formulation. Extensive experiments across diverse environments demonstrate that GAVIS consistently and significantly outperforms prior approaches in both accuracy and efficiency. Moreover, beyond standalone use, our method can be applied post-hoc to improve the performance of existing approaches.

2605.30326 2026-05-29 cs.RO cs.AI 版本更新

RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

RoboWits:机器人创造性问题解决中的意外挑战

Chunru Lin, Hongxin Zhang, Fenghao Yu, Zhehuan Chen, Thomas L. Griffiths, Yejin Choi, David Held, Chuang Gan

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校) Princeton University(普林斯顿大学) Stanford University(斯坦福大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出RoboWits双臂机器人基准,通过多智能体协作的自动化任务生成流水线评估机器人在几何、材料和装配推理中的认知推理、创造性工具使用及鲁棒性,发现预训练VLA在突变任务中表现脆弱。

Comments The first two authors contributed equally

详情
AI中文摘要

在真实环境中运行的机器人必须具备在意外挑战下推理、适应和创造性解决问题的能力。然而,当前的机器人基准主要强调技能级执行,对此类认知推理能力的洞察有限。我们提出了RoboWits,一个双臂机器人基准,旨在系统评估认知推理、创造性工具使用以及对意外条件的鲁棒性。为了实现可扩展的高质量推理中心意外场景构建,我们提出了一种自动化任务生成流水线,该流水线被设计为多智能体协作框架,包括种子任务生成与验证、度量生成、场景生成和任务变异等智能体。利用该流水线,我们整理了30个多样化的种子任务和208个带有变异和分级难度的任务,涵盖几何、材料和基于装配的推理。我们对流行的机器人策略、预训练VLA和oracle状态规划器进行了基准测试。结果揭示了显著的性能差距:预训练VLA在单任务微调后在种子任务上表现出初步成功,但在变异任务上表现不佳,这表明它们在需要推理、策略适应以及对欺骗性或受限环境鲁棒性的操作任务中具有脆弱性。项目页面位于https://umass-embodied-agi.github.io/RoboWits。

英文摘要

The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world environments. However, current robotic benchmarks primarily emphasize skill-level execution and provide limited insight into such cognitive reasoning capabilities. We introduce RoboWits, a bi-manual robotic benchmark designed to systematically evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions. To enable scalable construction of high-quality reasoning-centric unexpected scenarios, we propose an automated task generation pipeline formulated as a multi-agent cooperative framework, comprising agents for seed task generation and verification, metric generation, scene generation, and task mutation. Using the pipeline, we curated 30 diverse seed tasks and 208 tasks with mutations and graded difficulty across geometry, material, and assembly-based reasoning. We benchmark popular robot policies, pre-trained VLAs, and oracle-state planners. Our results reveal a significant performance gap: while pre-trained VLAs exhibit preliminary success on seed tasks after single-task fine-tuning, they struggle to perform on mutated tasks, implying their brittleness in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments. Project page is available at https://umass-embodied-agi.github.io/RoboWits.

2605.30282 2026-05-29 cs.RO 版本更新

Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation

Gaze2Act: 基于注视条件的视觉-语言-动作策略用于交互式机器人操作

Kuangji Zuo, Gen Li, Bofan Lyu, Yanshuo Lu, Boyu Ma, Shijia Han, Xinyu Zhou, Xichen Yuan, Chuhao Zhou, Jiaqi Bai, Geng Li, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(MARS实验室,南洋理工大学)

AI总结 提出Gaze2Act框架,通过将人类注视作为动态意图信号,结合跨视角语义匹配和策略级条件化,实现机器人对复杂交互任务的精确操作。

Comments Project page: https://zuo-kuangji.github.io/Gaze2Act/

详情
AI中文摘要

视觉-语言-动作(VLA)模型近期在遵循语言指令的机器人学习方面展现出强大潜力。然而,在实践中,仅靠语言往往难以精确传达人类意图。很难描述在相似候选对象中具体要交互哪个对象、在对象上的何处操作,或目标在执行过程中如何变化。为解决这一局限,我们提出Gaze2Act,一种新颖的VLA框架,利用人类注视作为复杂交互操作中动态且直观的意图信号。Gaze2Act首先通过跨视角语义匹配将第一人称注视映射到机器人视角,弥合自我-外部视角差距,生成对象掩码和注视点,用于从粗到细的目标指定。然后,这些线索通过感知级提示和动作级条件化整合到策略中,使机器人能够关注相关区域并在动态意图下执行精确交互。在对Unitree G1人形机器人进行的七个任务类别和16个真实机器人任务的系统评估中,Gaze2Act在意图准确性和任务成功率方面均达到最先进水平。它在对象消歧、细粒度交互和动态意图引导方面显著优于基线方法。这些结果表明,人类注视为人在环VLA控制提供了一种自然、低负担且高表达性的模态。

英文摘要

Vision-Language-Action (VLA) models have recently shown strong potential for robot learning by following language instructions. However, in practice, language alone is often insufficient to precisely convey human intent. It is difficult to describe which exact object to interact with among similar candidates, where to act on the object, or how the target may change during execution. To address this limitation, we propose Gaze2Act, a novel VLA framework that leverages human gaze as a dynamic and intuitive intent signal for complex interactive manipulation. Gaze2Act first bridges the ego-exo view gap by mapping first-person gaze into the robot's perspective through cross-view semantic matching, producing both an object mask and a gaze point for coarse-to-fine target specification. These cues are then integrated into the policy through perception-level prompting and action-level conditioning, allowing the robot to attend to relevant regions and execute precise interactions under dynamic intent. In a systematic evaluation across seven task categories and 16 real-robot tasks on a Unitree G1 humanoid, Gaze2Act achieves state-of-the-art performance in both intent accuracy and task success rate. It notably outperforms baselines in object disambiguation, fine-grained interaction, and dynamic intent steering. These results demonstrate that human gaze provides a natural, low-burden, and highly expressive modality for human-in-the-loop VLA control.

2605.30056 2026-05-29 cs.RO cs.LG 版本更新

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

基于评论家引导的样本高效扩散强化学习

Shutong Ding, Zejia Zhong, Zhongyi Wang, Ke Hu, Bikang Pan, Jingya Wang, Ye Shi

发表机构 * ShanghaiTech University(上海科技大学)

AI总结 针对扩散策略在强化学习中探索与利用不平衡的问题,提出评论家引导的扩散策略优化(CGPO),通过无训练引导技术平衡探索与利用,在MuJoCo和Franka机器人任务上取得最优性能。

Comments accepted by ICML2026

详情
AI中文摘要

近年来,强化学习(RL)通过利用扩散策略的多模态性和探索能力取得了巨大成功。在这些方法中,一个代表性分支专注于基于采样的策略优化。这种设计使得扩散模型在训练初期具有更好的探索能力,但在Q值信息的利用上不足,导致策略收敛缓慢。另一个分支关注基于梯度的策略优化,该方法充分利用Q函数的梯度,但容易退化为低多样性的单峰策略。为了解决这个问题,我们提出了CGPO(评论家引导的扩散策略优化),通过将无训练引导技术集成到扩散策略的去噪过程中,有效平衡探索与利用。具体而言,CGPO将动作生成引导至评论家网络定义的高价值区域,并将引导后的动作作为回归目标。通过这种方式,CGPO减少了获取高质量动作所需的时间,并通过更好的探索-利用权衡提高了最终性能。我们在5个MuJoCo运动任务上验证了CGPO的有效性,与现有的基于扩散的RL方法相比,CGPO达到了最先进的性能。值得注意的是,CGPO是首次成功将扩散策略应用于真实世界RL的方法,在Franka机器人臂抓取任务上表现出优越性能。我们的官方页面发布在https://dingsht.tech/cgpo-webpage。

英文摘要

Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampling-based policy optimization. This design enables better exploration capability of the diffusion model, particularly at the beginning of training, but suffer from low exploitation in Q-value information, resulting in a slow policy convergence. Another branch pays attention to gradient-based policy optimization, which sufficiently exploits the gradient of the Q function yet tends to collapse into a unimodal policy with low diversity. To address this issue, we propose CGPO, \textbf{C}ritic-\textbf{G}uided diffusion \textbf{P}olicy \textbf{O}ptimization, which effectively balances exploration and exploitation with the training-free guidance technique integrated into the denoising process of diffusion policy. Concretely, CGPO steers action generation toward high-value regions defined by the critic network and uses the guided actions as regression objectives. In this manner, CGPO reduces the time required to obtain high-quality actions and improves final performance with better balance between the exploration-exploitation tradeoff. We validate the effectiveness of CGPO on 5 MuJoCo locomotion tasks, and CGPO achieves state-of-the-art performance compared with existing diffusion-based RL methods. Notably, CGPO is the first success to incorporate diffusion policy into real-world RL, with its superior performance on Franka robot arm grasping tasks. Our official page is released at https://dingsht.tech/cgpo-webpage.

2605.29937 2026-05-29 cs.RO cs.LG 版本更新

Fisher-Preserving Guidance: Training-Free Manifold Constraints for Safe Diffusion Control

Fisher保持引导:用于安全扩散控制的免训练流形约束

Hao Ren, Zetong Bi, Yiming Zeng, Le Zheng, Zhi Li, Zhaoliang Wan, Lu Qi, Hui Cheng

发表机构 * Sun Yat-sen University, Guangzhou, China(中山大学,广州,中国) Insta360 Research, Shenzhen, China(Insta360研究院,深圳,中国)

AI总结 提出一种免训练的Fisher保持引导方法,通过低秩雅可比分解计算Fisher保持更新,并利用截断Fisher去噪敏感性作为不确定性信号,在视觉导航中实现可靠且高效的轨迹预测。

Comments ICML2026

详情
AI中文摘要

扩散模型在视觉导航中的航路点预测是有效的,但当更新偏离训练流形时,标准采样和测试时引导可能产生不可靠或低效的轨迹。我们提出带有外积跨度投影的Fisher保持引导,这是一种免训练的推理方法,在优化任务目标的同时避免与分布外动作相关的大Fisher漂移。我们的方法通过低秩雅可比分解计算Fisher保持更新,每步仅需一次反向传播,支持实时使用。我们进一步引入截断Fisher去噪敏感性作为不确定性信号,并将其用于鲁棒的多样本动作混合。在玩具和真实导航基准上的实验,包括基于TSDF引导的Maze2D、使用官方扩散策略权重的PushT,以及仿真和真实机器人上的视觉导航,均表明与强扩散策略基线相比,无需额外训练即可获得一致的性能提升。

英文摘要

Diffusion models are effective for waypoint prediction in visual navigation, but standard sampling and test time guidance can produce unreliable or inefficient trajectories when updates drift off the training manifold. We propose Fisher Preserving Guidance with Outer Product Span Projection, a training-free inference method that avoids large Fisher drift associated with off-distribution actions while optimizing a task objective. Our method computes the Fisher-preserving update via a low-rank Jacobian factorization, requiring only a single backward pass per step and enabling real-time use. We further introduce Truncated Fisher Denoising Sensitivity as an uncertainty signal and use it for robust multi-sample action blending. Experiments on toy and realistic navigation benchmarks, including Maze2D with TSDF-based guidance, PushT with official Diffusion Policy weights, and visual navigation in simulation and on real robots, demonstrate consistent improvements in performance over strong diffusion-policy baselines without additional training.

2605.29864 2026-05-29 cs.RO 版本更新

LLM-Guided Future Hypotheses for Horizon-Aware Exploration in Multi-Step Robot Manipulation

LLM引导的未来假设用于多步机器人操作中的视野感知探索

Mohammad Khoshnazar, Andrew Melnik, Michael Beetz

发表机构 * Institute of Artificial Intelligence, University of Bremen(人工智能研究所,不莱梅大学)

AI总结 提出未来经验条件化(FEC)框架,利用LLM生成短期未来视频作为结构化先验,结合行为克隆和强化学习微调,提升多步机器人操作中的探索和策略适应能力。

详情
AI中文摘要

多步机器人操作需要在场景如何演化的不确定性下行动,这使得探索和策略适应具有挑战性。我们研究了短期、任务一致的未来视频能否为控制和强化学习微调提供有用的结构化先验。我们通过未来经验条件化(FEC)形式化这一思想,这是一种简单的接口,将闭环策略条件化于短期未来视频的潜在表示上。在我们的模拟设置中,未来片段通过三个阶段生成:一个基于当前场景状态初始化的任务本体上运行的LLM推理器,一个无机器人的数字孪生展开预期物体运动,以及一个无需推理时分割的掩码自由视频扩散模型,用于合成机器人一致的未来片段。我们主要使用BC和BC+RL实例化这一未来条件化接口,并在RoboCasa和CALVIN上与无未来、GT未来、生成未来和错误未来条件下的未来条件化流式流策略(SFP)基线进行比较。生成的未来比无未来条件化提高了性能,而不匹配的未来则降低了性能,我们的BC+RL实例化实现了最强的整体结果。对CALVIN的8个任务的平均BC+RL学习曲线分析进一步表明,GT未来改进最快,生成未来比无未来更早且更高水平地改进,而错误未来在整个训练过程中保持为零。这些结果表明,短期未来视频可以在不完美的未来预测下作为探索和策略适应的有用结构化先验。https://enact2026.github.io/

英文摘要

Multi-step robot manipulation requires acting under uncertainty about how the scene will evolve, making exploration and policy adaptation challenging. We study whether short-horizon, task-consistent future videos can provide useful structured priors for control and reinforcement-learning fine-tuning. We formalize this idea through Future-Experience Conditioning (FEC), a simple interface that conditions closed-loop policies on a latent representation of a short future video. In our simulation setup, future clips are generated in three stages, an LLM reasoner operating over a task ontology initialized from the current scene state, a robot-free digital-twin rollout of the intended object motion, and a mask-free video diffusion model that synthesizes a robot-consistent future clip without requiring segmentation at inference. We instantiate this future-conditioning interface primarily with BC and BC+RL, and compare against a future-conditioned Streaming Flow Policy (SFP) baseline on RoboCasa and CALVIN under NoFuture, GTFuture, GenFuture, and WrongFuture. Generated futures improve performance over no-future conditioning, while mismatched futures degrade it, and our BC+RL instantiation achieves the strongest overall results. An average BC+RL learning-curve analysis across 8 CALVIN tasks further shows that GTFuture improves fastest, GenFuture improves earlier and to a higher level than NoFuture, and WrongFuture remains at zero throughout training. These results suggest that short-horizon future videos can serve as useful structured priors for exploration and policy adaptation under imperfect future predictions. https://enact2026.github.io/

2605.29773 2026-05-29 cs.CV cs.AI cs.RO 版本更新

Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic Segmentation

能量感知NECO:用于语义分割中单次逐像素分布外检测

Boyuan Zhang, Huanshan Huang, Yifei Cao

发表机构 * Ecole Polytechnique, Institut Polytechnique de Paris(巴黎理工学院高研院) CIAD, UTBM, Université Marie et Louis Pasteur(CIAD、UTBM、马吕斯·路易·巴斯蒂埃大学) U2IS, ENSTA, Institut Polytechnique de Paris(U2IS、ENSTA、巴黎理工学院)

AI总结 提出一种结合NECO几何比率和能量分数的混合方法,实现单次前向传播的逐像素分布外检测,在miniMUAD数据集上AUROC达0.8539,优于单独使用NECO或能量分数。

Comments 7 pages, 6 figures. Accepted at the ICRA 2026 Workshop on Long-term Deployments in the Wild (LoWi 2026)

详情
AI中文摘要

移动机器人的可靠语义分割需要准确的密集预测和分布偏移下的鲁棒不确定性估计。强不确定性基线如蒙特卡洛Dropout通常需要重复的随机前向传播,难以在边缘平台上部署。我们提出能量感知NECO,一种用于语义分割的单次逐像素分布外(OOD)检测器。该方法将从解码器特征计算的居中NECO风格几何比率与基于logit的能量分数相结合。两个分量均使用在纯分布内验证集上拟合的统计量进行标准化,并通过凸组合融合。我们在miniMUAD子集上使用真实像素级OOD标签评估该方法。所提出的混合分数达到0.8539的AUROC,优于仅NECO(0.8280)、仅能量(0.8171)和集成预测熵基线(0.8124)。额外的定性和操作点分析表明,混合检测器在保持单次设计效率优势的同时,提高了整体排名性能。代码可在https://github.com/boyuan-zhangx/Energy-Aware_NECO获取。

英文摘要

Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distribution shift. Strong uncertainty baselines such as Monte Carlo Dropout often require repeated stochastic forward passes and are difficult to deploy on edge platforms. We propose Energy-Aware NECO, a single-pass pixel-wise out-of-distribution (OOD) detector for semantic segmentation. The method combines a centered NECO-style geometric ratio computed from decoder features with a logit-based Energy score. Both components are standardized using statistics fitted on a pure in-distribution validation split and fused through a convex combination. We evaluate the method on the miniMUAD subset using true pixel-level OOD labels. The proposed hybrid score achieves an AUROC of 0.8539, outperforming NECO-only (0.8280), Energy-only (0.8171), and an ensemble predictive-entropy baseline (0.8124). Additional qualitative and operating-point analyses show that the hybrid detector improves overall ranking performance while preserving the efficiency advantages of a single-pass design. Code is available at https://github.com/boyuan-zhangx/Energy-Aware_NECO

2605.29771 2026-05-29 cs.RO 版本更新

Joint Angle Estimation with Customized Wristband Based on Online Incremental Learning

基于在线增量学习的定制腕带关节角度估计

Shuo Wang, Xiaobin Chen, Xiaoming Tao

发表机构 * Research Institute for Intelligent Wearable Systems, The Hong Kong Polytechnic University(智能可穿戴系统研究院,香港理工大学)

AI总结 提出一种基于在线增量学习的定制腕带系统,通过两阶段方法(在线学习更新模型+模型估计)实现腕关节角度估计,适应不同佩戴配置下的数据漂移,误差约15度。

详情
AI中文摘要

智能可穿戴技术在人机交互、运动和健康监测中扮演着越来越重要的角色。为了确保使用的舒适性和实用性,运动监测的一种常见形式是利用软体可穿戴传感器。然而,许多关于可穿戴传感器的研究应用过于简单,难以适应不同情况。本研究提出了一种基于在线增量学习方法的定制腕带系统,用于估计腕关节角度。这是一种两阶段估计方法:第一阶段根据佩戴者的手腕运动特征,利用在线学习更新模型,并集成来自IMU的实时数据作为真实值。第二阶段仅使用腕带利用更新后的模型进行腕关节角度估计。换句话说,模型训练在数据采集过程中完成,使得训练好的模型可用于后续的角度估计。该方法在适应由不同测试配置引起的数据漂移方面具有优势,例如同一受试者的左右手腕、同一手腕上佩戴位置的偏差,甚至不同受试者之间的差异。结果表明,传感器在应变变化下表现出良好的性能,所提系统在不同场景下的腕关节轨迹估计误差约为15度。

英文摘要

Intelligent wearable technology plays an increasingly important role in human-computer interaction, motion, and health monitoring. To ensure comfort and practicality of use, one common form for motion monitoring is to utilize soft wearable sensors. However, many research applications regarding wearable sensors are simplistic and difficult to adapt to different situations. This study proposes a system for estimating the angle of the wrist joint using a customized wristband based on an online incremental learning approach. It is a two-stage estimation method: the first stage updates the model based on the wearer's wrist movement characteristics using online learning, integrating real-time data from an IMU as ground truth. The second stage utilizes the updated model for estimation of wrist joint angle solely with the wristband. In other words, model training is completed during data acquisition, allowing the trained model to be used for subsequent angle estimation. This method offers advantages in adapting to data drift caused by variations in different testing configurations, such as the left and right wrists of the same subject, deviations in the wearing position on the same wrist, and even differences among various subjects. The results indicate that the sensors exhibit good performance under strain variations, and the wrist joint trajectory estimation of the proposed system has an approximate error of 15 degree in different scenarios.

2605.29766 2026-05-29 cs.RO 版本更新

MARS Policy: Multimodality Only When It Matters

MARS策略:仅在必要时使用多模态

Jindou Jia, Tuo An, Yuxuan Hu, Gen Li, Jingliang Li, Bohan Hou, Xiangyu Chen, Jiaqi Bai, Bofan Lyu, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(MARS实验室,南洋理工大学)

AI总结 提出MARS策略,通过自适应地在需要时引入随机性,在单模态阶段使用确定性学习,平衡生成策略的多模态能力与确定性模型的高效率,在模拟和真实任务中提升成功率和推理速度。

Comments 13 figures, 17 pages

详情
AI中文摘要

模仿学习已成为解决复杂机器人操作任务的基石。特别是,多模态使机器人能够捕捉多样且有效的行为模式,推动了生成策略作为机器人学习主导范式的迅速兴起。然而,实现这种多模态通常依赖于随机噪声初始化和迭代去噪过程,导致训练复杂度高、推理效率低。同时,机器人任务的并非所有阶段都固有地需要行为多样性。受此启发,我们提出了模态自适应机器人采样(MARS)策略,该策略仅在真正有益时自适应地调用定制的随机性,而在单模态阶段恢复为高效的确定性学习。换句话说,仅在适当的时间注入适量的噪声。通过选择性激活多模态生成,MARS策略弥合了生成策略的多模态能力与确定性模型优越的训练和推理效率之间的差距。在8个模拟和4个真实世界任务上的实证研究表明,MARS展现出鲁棒的多模态表达能力和高效率,在真实世界测试中成功率提升16.67%,推理延迟降低83.20%。反直觉的是,MARS在近确定性任务上的训练效率也超过了确定性策略,因为它更有效地建模了细微的动作多样性。

英文摘要

Imitation learning has become a cornerstone for solving complex robotic manipulation tasks. In particular, multimodality, which enables robots to capture diverse yet valid behavioral patterns, has driven the rapid emergence of generative policies as a dominant paradigm in robot learning. However, achieving such multimodality typically relies on stochastic noise initialization and iterative denoising procedures, resulting in substantial training complexity and low inference efficiency. Meanwhile, not all phases of a robotic task inherently require behavioral diversity. Motivated by this insight, we propose the Modality-Adaptive Robot Sampling (MARS) policy, which adaptively invokes tailored stochasticity only when it is truly beneficial, while reverting to an efficient deterministic learning during single-modal phases. In other words, the proper amount of noise is injected only at the proper time. By selectively activating multimodal generation, MARS policy bridges the gap between the multimodal capability of generative policies and the superior training and inference efficiency of deterministic models. Empirical studies across 8 simulated and 4 real-world tasks demonstrate that MARS exhibits robust multimodal expressivity and high efficiency, with a 16.67% success rate improvement and an 83.20% inference latency reduction in real-world tests. Counterintuitively, MARS also outpaces deterministic policies in training efficiency on near-deterministic tasks by more effectively modeling nuanced action diversity.

2605.29710 2026-05-29 cs.RO 版本更新

PhAIL: A Real-Robot VLA Benchmark and Distributional Methodology

PhAIL:一个真实机器人VLA基准测试与分布性方法论

Sergey Arkhangelskiy

发表机构 * Positronic Robotics(positronic机器人)

AI总结 针对现有VLA策略评估中样本量小、统计比较不可靠的问题,提出PhAIL基准测试,采用时间-成功累积分布函数作为评估基元,通过人类相对吞吐量评分和Kolmogorov-Smirnov显著性检验,在少量rollout下实现更可靠的模型比较。

Comments 22 pages, 10 figures, 8 tables. Dataset, analysis pipeline, and paper source: https://phail.ai and https://github.com/Positronic-Robotics/phail-paper

详情
AI中文摘要

视觉-语言-动作(VLA)策略的真实世界评估仍然依赖于固定超时下的二元成功率,每个条件最多进行$N \le 25$次rollout,几乎总是没有置信区间或配对统计比较;这些队列规模难以可靠地解决接近的比较。我们引入了PhAIL(物理AI排行榜,https://phail.ai),这是一个基于Franka FR3的开放真实机器人基准测试(包括数据集、每次rollout的工件和端到端参考实现),采用分布性评估方法论:以时间-成功累积分布函数(CDF)作为评估基元,分为两个独立任务。第一个是通过人类相对吞吐量(HRT)进行评分,这是一个具有bootstrap置信区间的无量纲标量,锚定于同一设备的远程操作。第二个是显著性检验(Kolmogorov-Smirnov,按对象计算并在对象间进行宏观平均)。在四个公开可用的VLA上,宏观平均KS检验在每(模型,对象)单元$N \le 30$次rollout下解决了两个接近的比较(GR00T vs. ACT,OpenPI vs. ACT),而二元阈值指标无法做到;最接近的一对(OpenPI vs. GR00T)在我们的预算内仍未解决。评估中最佳的VLA每次操作比人类参考慢约$7\times$(RMST比率)。

英文摘要

Real-world evaluation of vision-language-action (VLA) policies still rests on binary success rate at a fixed timeout with $N \le 25$ rollouts per condition, almost always without confidence intervals or paired statistical comparison; these cohort sizes struggle to resolve close comparisons reliably. We introduce PhAIL (Physical AI Leaderboard, https://phail.ai), an open real-robot benchmark on a Franka FR3 (dataset, per-rollout artifacts, and end-to-end reference implementation) of a distributional evaluation methodology: the time-to-success cumulative distribution function (CDF) as the evaluation primitive, with two separated jobs. The first is scoring via Human-Relative Throughput (HRT), a dimensionless scalar with bootstrap confidence intervals, anchored to same-fixture human teleoperation. The second is a significance test (Kolmogorov-Smirnov, computed per-object and macro-averaged across objects). On four publicly-available VLAs, the macro-averaged KS test resolves two close comparisons (GR00T vs. ACT, OpenPI vs. ACT) at $N \le 30$ rollouts per (model, object) cell where binary-threshold metrics do not; the closest pair (OpenPI vs. GR00T) remains unresolved within our budget. The best evaluated VLA is $\sim 7\times$ slower per operation (RMST ratio) than the human reference.

2605.29704 2026-05-29 cs.RO 版本更新

FLIP: Real-Time and Resilient Formation Planning for Large-Scale DIstributed Swarms via Point Cloud Registration

FLIP:通过点云配准实现大规模分布式集群的实时弹性编队规划

Yuan Zhou, Guangtong Xu, Zhenyu Hou, Jialiang Hou, Fei Gao

发表机构 * Institute of Cyber-Systems and Control, College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院智能系统与控制研究所) Huzhou Institute, Zhejiang University(浙江大学湖州研究院)

AI总结 提出将最优编队位置序列计算转化为时空点云配准问题,利用带离群点剔除的PCR方法实现大规模分布式集群的弹性、高效轨迹规划。

详情
AI中文摘要

传统的大规模编队规划要么过度简化编队表示导致性能不佳,要么采用完全协作关系导致计算负载过大。为了实现高性能和大规模编队规划,我们将最优编队位置序列(OFPS)计算问题转化为时空点云配准(PCR)问题。每个智能体通过分布式计算自身当前位置与所有其他智能体期望编队位置之间的匹配结果来获得OFPS。然后每个智能体利用OFPS优化协作编队轨迹。我们利用带离群点剔除的PCR方法快速执行大规模编队位置配准。这可以防止次优轨迹和故障智能体通过协作网络传播并影响更多智能体。因此,我们统一实现了大规模集群的弹性、高效和分布式轨迹规划。通过120架无人机编队的大规模仿真以及与最先进(SOTA)方法的严格基准测试,证明了所提方法的有效性和优越性。

英文摘要

Traditional large-scale formation planning either oversimplify the formation representation which leads to poor performance, or they employ complete collaborative relationships, which results in excessive computational load. To achieve high-performance and large-scale formation planning, we transform the Optimal Formation Position Sequence \cite{c1} (OFPS) calculation problem into a spatiotemporal Point Cloud Registration (PCR) problem. Each agent derives its OFPS by distributively computing the matching result between current positions and the desired formation positions of all other agents. Then each agent optimizes the cooperative formation trajectory by using OFPS. We leverage the PCR method with outlier rejection to rapidly perform large-scale formation position registration. This prevents suboptimal trajectories and failed agents from propagating through the cooperative network and affecting more agents. Consequently, we uniformly achieve resilient, efficient, and distributed trajectory planning for large-scale swarms. The effectiveness and the superiority of the proposed method are demonstrated through large-scale simulations of 120-drone formation, and rigorous benchmarking against state-of-the-art (SOTA) methods.

2605.29693 2026-05-29 cs.LG cs.RO 版本更新

Momentum Based Reward Design for Low Emission Traffic Signal Control

基于动量的低排放交通信号控制奖励设计

Chinmay Mundane, Amith Manoharan, Arun Singh

发表机构 * Institute of Technology, University of Tartu(塔尔图大学技术学院)

AI总结 提出一种基于动量的奖励函数(MBRF),通过鼓励车辆持续移动而非单纯惩罚拥堵,在SUMO仿真中实现更好的吞吐量-排放权衡和更稳定的学习行为。

详情
AI中文摘要

城市交通拥堵是一个日益严重的全球性问题,导致通勤时间延长和环境污染加剧。传统的交通信号控制系统往往难以适应动态交通状况。自适应交通信号控制可以在不改变道路基础设施的情况下改善城市交通。深度强化学习(DRL)在此任务中表现出色,但现有的基于延误和队列的奖励常常产生短视或不稳定的策略。本文提出了一种基于动量的奖励函数(MBRF),鼓励车辆持续移动,而非仅惩罚拥堵。该方法在SUMO(城市交通仿真)中使用标准交通指标(如等待时间、队列长度、吞吐量和CO2排放)进行评估。结果表明,与基于延误或队列的奖励以及经典控制器(如Max Pressure和LQF)相比,所提出的奖励实现了更好的吞吐量-排放权衡和更稳定的学习行为。

英文摘要

Urban traffic congestion is a growing global issue contributing significantly to long commute times and environmental pollution. Traditional traffic signal control systems often fail to adapt to dynamic traffic conditions. Adaptive traffic signal control can improve urban traffic without changing road infrastructure. Deep Reinforcement Learning (DRL) has shown strong performance for this task, but existing delay and queue-based rewards often produce short-sighted or unstable policies. This paper proposes a Momentum-Based Reward Function (MBRF) that encourages vehicles to keep moving rather than penalizing congestion alone. The method is evaluated in SUMO (Simulation of Urban MObility) using standard traffic metrics such as waiting time, queue length, throughput, and CO2 emissions. Results show that the proposed reward produces better throughput-emission trade-offs and more stable learning behavior than delay or queue-based rewards, as well as classical controllers such as Max Pressure and LQF.

2605.29605 2026-05-29 cs.RO 版本更新

VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models

VLAConf: 视觉-语言-动作模型的校准任务成功置信度

Dehao Huang, Aoxiang Gu, Chengjie Zhang, Bolin Zou, Wenlong Dong, Zilang Cen, Yue Wang, Hong Zhang

发表机构 * Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China(南方科技大学电子与电气工程系,深圳,中国) Zhongguancun Academy, Beijing, China(中关村学院,北京,中国) National Cybersecurity Academy, Wuhan University, Wuhan, China(武汉大学国家网络安全学院,武汉,中国)

AI总结 提出VLAConf,一种基于单类判别性置信度框架的方法,通过冻结预训练VLA内部表示和轻量级置信度头,在单次前向传播中直接估计逐步异常分数,实现高效且跨架构通用的任务成功置信度估计。

Comments 11 pages, 7 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型的置信度估计对于机器人在开放世界中执行操作任务至关重要,它为风险敏感决策和故障预测提供关键信号。现有的置信度估计方法通常依赖于基于集成的范式或动作令牌概率来预测任务成功的可能性。然而,它们在计算效率和跨架构泛化性方面仍面临挑战。这些方法通常需要重复采样,导致推理效率低下,并且仅限于具有离散动作输出的VLA模型,难以应用于连续动作空间。为解决此问题,我们提出VLAConf,一种单类判别性置信度框架。通过利用冻结的预训练VLA内部表示,VLAConf使用轻量级置信度头在单次前向传播中直接估计逐步异常分数,从而消除了详尽重采样的开销。我们还使用步骤条件建模来编码操作轨迹中的展开阶段信息。在LIBERO基准上的实验表明,VLAConf显著提高了为事后校准构建的置信度信号的质量,在推理效率上大幅优于现有基线。VLAConf的有效性在真实机器人实验中进一步得到验证。要访问源代码和补充视频,请访问https://sites.google.com/view/vlaconf。

英文摘要

Confidence estimation for Vision-Language-Action (VLA) models is essential for robots to perform manipulation tasks in the open world, providing crucial signals for risk-sensitive decision-making and failure anticipation. Existing confidence estimation methods typically rely on ensemble-based paradigms or action-token probabilities to predict the likelihood of task success. However, they still encounter challenges in computational efficiency and cross-architecture generalizability. These methods usually require repeated sampling, leading to inference inefficiency, and are restricted to VLA models with discrete action outputs, making them difficult to apply to continuous action spaces. To address this issue, we propose VLAConf, a one-class discriminative confidence framework. By leveraging frozen pretrained VLA internal representations, VLAConf directly estimates step-wise anomaly scores in a single forward pass using a lightweight confidence head, thereby eliminating the overhead of exhaustive resampling. We additionally use step-conditioned modeling to encode rollout-phase information along the manipulation trajectory. Experiments on the LIBERO benchmark demonstrate that VLAConf significantly improves the quality of the confidence signal constructed for post-hoc calibration, outperforming existing baselines by a large margin in inference efficiency. The effectiveness of VLAConf is further validated in real-robot experiments. To access the source code and supplementary videos, visit https://sites.google.com/view/vlaconf.

2605.29599 2026-05-29 cs.RO cs.CV 版本更新

How to Relieve Distribution Shifts in Semantic Segmentation for Off-Road Environments

如何缓解越野环境语义分割中的分布偏移

Ji-Hoon Hwang, Daeyoung Kim, Hyung-Suk Yoon, Dong-Wook Kim, Seung-Woo Seo

发表机构 * Department of Electrical and Communication Engineering, Seoul National University(电子与通信工程系,首尔国立大学)

AI总结 提出ST-Seg框架,通过风格扩展和纹理正则化缓解越野场景中源-目标域差异和传感器退化导致的分布偏移,提升语义分割鲁棒性。

Comments 8 pages, 6 figures. Accepted to IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 10, issue. 5, pp. 4500-4507, 2025
AI中文摘要

语义分割对于越野环境中的自主导航至关重要,能够精确分类周围环境以识别可通行区域。然而,越野条件固有的独特因素,如源-目标域差异和粗糙地形导致的传感器退化,可能引起分布偏移,使数据变化与训练条件不同。这常导致语义标签预测不准确,进而造成导航任务失败。为解决此问题,我们提出ST-Seg,一种通过风格扩展(SE)和纹理正则化(TR)扩展源分布的新框架。与先前在固定源分布内隐式应用泛化的方法不同,ST-Seg提供了一种直观的分布偏移处理方法。具体而言,SE通过生成多样化的逼真风格来拓宽域覆盖范围,增强源域有限的风格信息。TR通过深度纹理流形稳定受风格增强学习影响的局部纹理表示。在各种分布偏移的目标域上的实验证明了ST-Seg的有效性,相较于现有方法有显著改进。这些结果凸显了ST-Seg的鲁棒性,增强了越野导航中语义分割的实际应用性。

英文摘要

Semantic segmentation is crucial for autonomous navigation in off-road environments, enabling precise classification of surroundings to identify traversable regions. However, distinctive factors inherent to off-road conditions, such as source-target domain discrepancies and sensor corruption from rough terrain, can result in distribution shifts that alter the data differently from the trained conditions. This often leads to inaccurate semantic label predictions and subsequent failures in navigation tasks. To address this, we propose ST-Seg, a novel framework that expands the source distribution through style expansion (SE) and texture regularization (TR). Unlike prior methods that implicitly apply generalization within a fixed source distribution, ST-Seg offers an intuitive approach for distribution shift. Specifically, SE broadens domain coverage by generating diverse realistic styles, augmenting the limited style information of the source domain. TR stabilizes local texture representation affected by style-augmented learning through a deep texture manifold. Experiments across various distribution-shifted target domains demonstrate the effectiveness of ST-Seg, with substantial improvements over existing methods. These results highlight the robustness of ST-Seg, enhancing the real-world applicability of semantic segmentation for off-road navigation.

2605.24934 2026-05-29 cs.RO cs.AI cs.CV cs.LG 版本更新

HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

HumanEgo:从几分钟的人类自我中心视频中零样本学习机器人

Zhi Wang, Botao He, Kelin Yu, Seungjae Lee, Ruohan Gao, Furong Huang, Yiannis Aloimonos

发表机构 * University of Maryland(马里兰大学)

AI总结 提出HumanEgo框架,通过将人类演示提升为手-物体交互的实体级表示,并训练具有密集辅助目标的流匹配策略,实现从人类自我中心视频到机器人的零样本、无机器人数据、硬件无关的技能迁移。

Comments Project page: https://humanego-ai.github.io

详情
AI中文摘要

人类自我中心视频捕捉了丰富的操作演示,无需任何机器人硬件,但由于人类和机器人在视觉外观和运动学上的具身差距,将这些技能迁移到机器人仍然具有挑战性。我们提出了HumanEgo,一个通过将每个人类演示提升为手-物体交互的实体级表示,并训练具有密集辅助目标的流匹配策略来弥合具身差距的框架,该策略放大了每个轨迹的监督信号。HumanEgo无需机器人数据、硬件无关、数据高效且可零样本地从人类迁移到机器人。每个任务仅需30分钟的人类视频,HumanEgo在四个真实世界任务中实现了92.5%的平均成功率(仅15分钟即可达到75%),比匹配时间的机器人遥操作高出41%,并且能够稳健地零样本迁移到新的机器人、相机和环境。我们发布了HumanEgo作为一个易于使用的开源框架,用于直接从人类数据学习机器人策略:https://github.com/TX-Leo/HumanEgo

英文摘要

Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity-level representation of hand-object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot-data-free, hardware-agnostic, data-efficient, and zero-shot human-to-robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real-world tasks (75% with just 15 minutes), outperforms matched-time robot teleoperation by 41%, and robustly transfers zero-shot across novel robots, cameras, and environments. We release HumanEgo as an easy-to-use, open-source framework for learning robot policies directly from human data: https://github.com/TX-Leo/HumanEgo

2605.20752 2026-05-29 cs.RO 版本更新

GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation

GaussianDream:用于机器人操作的前馈3D高斯世界模型

Zijian Zhang, Yuqing Jiang, Qian Cheng, Xiaofan Li, Si Liu, Ding Zhao, Ping Luo, Weitao Zhou, Haibao Yu

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Tsinghua University(清华大学) Zhejiang University(浙江大学) Beihang University(北京航空航天大学) Carnegie Mellon University(卡内基梅隆大学) The University of Hong Kong(香港大学)

AI总结 提出GaussianDream,一种前馈3D高斯世界模型插件,通过可学习查询编码当前帧3D空间结构和短期未来演化,在训练时用静态重建和未来预测头监督,推理时仅保留查询条件化动作生成,在多个机器人操作基准上达到最先进性能。

Comments 19 pages, 9 figures

详情
AI中文摘要

视觉-语言-动作(VLA)策略通过将预训练的视觉-语言模型的语义先验迁移到动作生成,推进了语言条件机器人操作。然而,标准的动作模仿学习通常缺乏对显式3D空间信息、密集几何监督和未来环境演化的充分建模,而这些对于精确的机器人交互至关重要。为解决这一问题,我们提出 extbf{GaussianDream},一种前馈3D高斯世界模型插件。具体地,我们在编码器中引入可学习的GaussianDream查询,使模型能够捕捉当前帧的3D空间结构和短时域的未来演化。训练时,潜在的高斯Dream前缀由静态重建头和未来预测头处理,生成当前3D高斯场景状态和未来高斯演化状态。当前分支通过RGB渲染和深度进行监督,而未来分支使用未来RGB、深度和伪3D场景流信号。推理时,GaussianDream丢弃所有辅助头,仅保留学习到的前缀以条件化动作生成,无需测试时的高斯重建或未来预测。实验结果表明,GaussianDream在多个机器人操作基准上取得了最先进的性能,在LIBERO上达到 extbf{98.4\%},在RoboCasa Human-50上达到 extbf{54.8\%},在真实机器人任务上达到 extbf{50.0\%}。与现有的3D增强VLA方法相比,GaussianDream在实现高精度的同时,提供了比基于视频的世界模型方法更高的推理效率。

英文摘要

Vision-language-action (VLA) policies have advanced language-conditioned robotic manipulation by transferring semantic priors from pretrained vision-language models to action generation. However, standard action-imitation learning often lacks sufficient modeling of explicit 3D spatial information, dense geometric supervision, and future environment evolution, all critical for precise robotic interaction. To address this, we propose \textbf{GaussianDream}, a feed-forward 3D Gaussian world-model plug-in. Specifically, we introduce learnable GaussianDream Queries in the encoder, enabling the model to capture current-frame 3D spatial structure and short-horizon future evolution. During training, the latent GaussianDream prefix is processed by a static reconstruction head and a future prediction head to produce current 3D Gaussian scene states and future Gaussian evolution states. The current branch is supervised by RGB rendering and depth, while the future branch uses future RGB, depth, and pseudo 3D scene-flow signals. During inference, GaussianDream discards all auxiliary heads and retains only the learned prefix to condition action generation, without test-time Gaussian reconstruction or future prediction. Experimental results demonstrate that GaussianDream achieves state-of-the-art performance across multiple robotic manipulation benchmarks, reaching \textbf{98.4\%} on LIBERO, \textbf{54.8\%} on RoboCasa Human-50, and \textbf{50.0\%} on real-robot tasks. Compared with existing 3D-enhanced VLA methods, GaussianDream achieves strong accuracy while providing higher inference efficiency than video-based world-model approaches.

2605.01395 2026-05-29 eess.SY cs.RO cs.SY 版本更新

Quasi-Static Control of Discrete Cosserat Rod

离散Cosserat杆的准静态控制

Srishti Siddharth

发表机构 * Centre for Systems and Control(系统与控制中心) Indian Institute of Technology Bombay(印度理工学院班加罗尔) Nanyang Technological University(南洋理工大学)

AI总结 针对使用Cosserat杆建模的软体机器人,基于分段常应变空间离散化方法,利用外部力/力矩作为控制输入,设计应变空间和任务空间的状态反馈线性化控制律,实现末端执行器轨迹跟踪和形状控制。

Comments Submitted to 17th APCA International Conference on Automatic Control and Soft Computing (CONTROLO 2026)

详情
AI中文摘要

在本文中,我们为使用Cosserat杆建模的软体机器人设计了反馈控制律,其中Cosserat杆通过分段常应变(PCS)方法进行空间离散化。PCS方法将描述Cosserat杆的非线性偏微分方程转化为非线性常微分方程组。这种简化得到的软体机器人模型类似于串联刚性连杆机械臂。我们通过将外部力/力矩作为控制输入,为准静态PCS模型设计了反馈控制律。控制律基于应变空间和任务空间的状态反馈线性化设计。大量的数值结果展示了这些控制律在软体机器人末端执行器轨迹跟踪和形状控制中的性能。

英文摘要

In this paper, we design feedback control laws for soft robots modelled using the Cosserat rod, which is spatially discretised using the Piecewise Constant Strain (PCS) approach. The PCS approach transforms the nonlinear PDEs describing the Cosserat rod to a system of nonlinear ODEs. This simplification results in a model describing soft robots which is similar to the serial rigid-link manipulators. We design feedback control laws for the quasi-static PCS model by using the external wrenches as control input. The control laws are designed based on state-feedback linearisation in strain and task spaces. An extensive set of numerical results demonstrates the performance of the control laws for end-effector trajectory tracking and shape control of soft robots.

2604.19011 2026-05-29 cs.LG cs.RO 版本更新

Accelerating trajectory optimization with Sobolev-trained diffusion policies

基于Sobolev训练的扩散策略加速轨迹优化

Théotime Le Hellard, Franki Nguimatsia Tiofack, Quentin Le Lidec, Justin Carpentier

发表机构 * Inria - Département d’Informatique de l’École normale supérieure, PSL Research University(法国国家科学研究中心-巴黎高等师范学院计算机系,PSL研究大学) Courant Institute, New York University(纽约大学Courant研究所)

AI总结 针对梯度型轨迹优化求解器,提出利用Sobolev学习训练扩散策略以提供初始猜测,通过利用轨迹和反馈增益的一阶损失避免复合误差,实现求解时间减少2至20倍。

详情
AI中文摘要

轨迹优化求解器利用已知系统动力学通过迭代改进计算局部最优轨迹。其缺点是每个新问题实例独立求解,因此收敛速度和求解质量依赖于初始轨迹。为提高效率,一种自然的方法是用学习策略生成的初始猜测对轨迹优化进行热启动,该策略在求解器先前生成的轨迹上训练。基于扩散的策略最近成为表达性模仿学习模型,使其成为这一角色的有前途候选者。然而,一个反直觉的挑战来自轨迹优化示范的局部最优性:当策略展开时,小的非最优偏差可能将其推入训练数据中未表示的情况,从而在长时域上引发复合误差。在这项工作中,我们专注于基于学习的热启动,用于同时提供反馈增益的梯度型轨迹优化求解器。利用这一特性,我们推导出一阶损失,用于使用轨迹和反馈增益对基于扩散的策略进行Sobolev学习。通过全面实验,我们证明所得策略避免了复合误差,因此可以从非常少的轨迹中学习,提供初始猜测,将求解时间减少2倍到20倍。结合一阶信息使得用更少的扩散步骤进行预测成为可能,从而降低推理延迟。

英文摘要

Trajectory Optimization (TO) solvers exploit known system dynamics to compute locally optimal trajectories through iterative improvements. A downside is that each new problem instance is solved independently; therefore, convergence speed and quality of the solution found depend on the initial trajectory proposed. To improve efficiency, a natural approach is to warm-start TO with initial guesses produced by a learned policy trained on trajectories previously generated by the solver. Diffusion-based policies have recently emerged as expressive imitation learning models, making them promising candidates for this role. Yet, a counterintuitive challenge comes from the local optimality of TO demonstrations: when a policy is rolled out, small non-optimal deviations may push it into situations not represented in the training data, triggering compounding errors over long horizons. In this work, we focus on learning-based warm-starting for gradient-based TO solvers that also provide feedback gains. Exploiting this specificity, we derive a first-order loss for Sobolev learning of diffusion-based policies using both trajectories and feedback gains. Through comprehensive experiments, we demonstrate that the resulting policy avoids compounding errors, and so can learn from very few trajectories to provide initial guesses reducing solving time by $2\times$ to $20 \times$. Incorporating first-order information enables predictions with fewer diffusion steps, reducing inference latency.

2603.10474 2026-05-29 cs.LG cs.NE cs.RO 版本更新

Muscle Synergy Priors Enhance Biomechanical Fidelity in Predictive Musculoskeletal Locomotion Simulation

肌肉协同先验增强预测性肌肉骨骼运动模拟的生物力学保真度

Ilseung Park, Eunsik Choi, Jangwhan Ahn, Jooeun Ahn

发表机构 * Department of Mechanical Engineering(机械工程系) Carnegie Mellon University(卡内基梅隆大学) Department of Physical Education(体育系) Seoul National University(首尔国立大学) Lampe Joint Department of Biomedical Engineering(生物医学工程联合部门) UNC-Chapel Hill and NC State University(北卡罗来纳大学教堂山分校和北卡罗来纳州立大学)

AI总结 提出一种生理学启发的强化学习框架,通过肌肉协同约束控制,在有限实验数据下提高了预测性人体运动模拟的生物力学保真度和泛化能力。

Comments Added a manuscript footnote stating "Project page with supplementary videos: https://ces40320.github.io/WebHomepage__Walk-RL ."

详情
AI中文摘要

人类运动源于高维神经肌肉控制,这使得预测性肌肉骨骼模拟具有挑战性。我们提出了一种生理学启发的强化学习框架,利用肌肉协同约束控制。我们从少量地面行走试验的逆肌肉骨骼分析中提取了低维协同基,并将其作为动作空间,用于训练一个肌肉驱动的三维模型,该模型在可变速度、坡度和不平坦地形上进行训练。由此产生的控制器在0.7-1.8 m/s的速度和±6°的坡度上生成了稳定的步态,并再现了关节角度、关节力矩和地面反作用力的条件依赖性调节。与无约束控制器相比,协同约束控制减少了非生理性膝关节运动学,并将膝关节力矩曲线保持在实验包络内。在各种条件下,模拟的垂直地面反作用力与人体测量值强相关,肌肉激活时间大多落在受试者间变异范围内。这些结果表明,将神经生理结构嵌入强化学习可以在有限实验数据下提高预测性人体运动模拟的生物力学保真度和泛化能力。

英文摘要

Human locomotion emerges from high-dimensional neuromuscular control, making predictive musculoskeletal simulation challenging. We present a physiology-informed reinforcement-learning framework that constrains control using muscle synergies. We extracted a low-dimensional synergy basis from inverse musculoskeletal analyses of a small set of overground walking trials and used it as the action space for a muscle-driven three-dimensional model trained across variable speeds, slopes and uneven terrain. The resulting controller generated stable gait from 0.7-1.8 m/s and on $\pm$ 6$^{\circ}$ grades and reproduced condition-dependent modulation of joint angles, joint moments and ground reaction forces. Compared with an unconstrained controller, synergy-constrained control reduced non-physiological knee kinematics and kept knee moment profiles within the experimental envelope. Across conditions, simulated vertical ground reaction forces correlated strongly with human measurements, and muscle-activation timing largely fell within inter-subject variability. These results show that embedding neurophysiological structure into reinforcement learning can improve biomechanical fidelity and generalization in predictive human locomotion simulation with limited experimental data.

2512.11944 2026-05-29 cs.RO cs.AI 版本更新

A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach

基于学习的运动规划综述:迈向数据驱动的最优控制方法

Jia Hu, Yang Chang, Haoran Wang

发表机构 * College of Transportation Key Laboratory of Road and Traffic Engineering of the Ministry of Education(交通运输学院 道路交通工程教育部重点实验室) Institute for Advanced Study(先进研究院) Tongji University(同济大学)

AI总结 本文系统综述了数据驱动最优控制范式,通过融合最优控制的理论保证与机器学习的自适应能力,为自动驾驶运动规划提供了三维实现路线图,并指出了四个未来研究方向。

Comments 44 pages, 14 figures

详情
AI中文摘要

自动驾驶的运动规划面临一个关键的权衡。传统的基于规则的流程提供了可验证的安全性和可解释性,但往往难以在复杂场景中泛化。相反,新兴的基于学习的方法——包括模仿学习、强化学习和生成式AI——提供了更大的适应性,但通常受限于不透明性和安全风险。现有的综述通常孤立地分析这些AI方法,忽视了将它们与严格的控制框架相结合的潜力。为弥合这一差距,本文首次系统综述了数据驱动最优控制(DDOC)范式,明确考察了它如何协同最优控制的理论保证与现代机器学习的自适应能力。基于这一框架,我们提出了首个DDOC运动规划路线图,将其实现结构化为三个关键维度:定制化、动力学自适应和自整定。最后,为缩小剩余的现实差距,我们确定了四个未来研究方向,从而加速向可信赖且类人的自动驾驶的过渡。

英文摘要

Motion planning for autonomous driving (AD) faces a critical trade-off. While traditional rule-based pipelines offer verifiable safety and interpretability, they often fail to generalize in complex scenarios. Conversely, emerging learning-based methods-including imitation learning (IL), reinforcement learning (RL), and generative AI-offer greater adaptability but are often constrained by opacity and safety risks. Existing surveys typically analyze these AI methods in isolation, overlooking the potential of integrating them with rigorous control frameworks. To bridge this gap, this paper presents the first systematic review of the Data-Driven Optimal Control (DDOC) paradigm, explicitly examining how it synergizes the theoretical guarantees of optimal control with the adaptive capabilities of modern machine learning. Building on this framework, we propose the first roadmap for DDOC-based motion planning, structuring its implementation into three critical dimensions: customization, dynamics adaptation, and self-tuning. Finally, to close the remaining reality gap, we identify four future research directions, thereby accelerating the transition to trustworthy and human-like autonomous driving.

2511.17798 2026-05-29 cs.RO 版本更新

SM2ITH: Safe Mobile Manipulation with Interactive Human Prediction via Task-Hierarchical Bilevel Model Predictive Control

SM2ITH:通过任务分层双层模型预测控制实现安全移动操作与人机交互预测

Francesco D'Orazio, Sepehr Samavi, Xintong Du, Siqi Zhou, Giuseppe Oriolo, Angela P. Schoellig

发表机构 * Department of Computer, Control and Management Engineering, of Sapienza University of Rome(意大利萨皮恩扎大学计算机、控制与管理工程系) University of Toronto Institute for Aerospace Studies (UTIAS) and the Vector Institute for Artificial Intelligence(多伦多大学航空航天研究所(UTIAS)和向量人工智能研究所) Learning Systems and Robotics lab at the Technical University of Munich and the Munich Institute for Robotics and Machine Intelligence (MIRMI)(慕尼黑技术大学学习系统与机器人实验室及慕尼黑机器人与机器智能研究所(MIRMI)) School of Computing Science, Faculty of Applied Sciences, Simon Fraser University(西蒙·弗雷泽大学应用科学学院计算机科学系)

AI总结 提出SM$^2$ITH框架,结合分层任务模型预测控制与双层优化的人机交互预测,实现动态人机环境中的安全高效移动操作。

Comments Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026

详情
AI中文摘要

移动操作机器人被设计用于在以人为中心的环境中执行复杂的导航和操作任务序列。尽管最近基于优化的方法(如分层任务模型预测控制,HTMPC)能够以严格的任务优先级实现高效的多任务执行,但它们目前主要应用于静态或结构化场景。将这些方法扩展到动态的人为中心环境需要预测模型来捕捉人类对机器人行为的反应。本文提出了SM$^2$ITH(通过任务分层双层模型预测控制实现安全移动操作与人机交互预测),这是一个统一框架,通过双层优化联合考虑机器人和人类动力学,将HTMPC与交互式人体运动预测相结合。该框架在两种不同的移动操作机器人(Stretch 3和Ridgeback-UR10)上进行了验证,涉及三种实验设置:(i)具有不同导航和操作优先级的递送任务,(ii)使用不同人体运动预测模型的顺序抓取-放置任务,以及(iii)涉及对抗性人类行为的交互。我们的结果突出了交互式预测如何实现安全高效的协调,优于依赖加权目标或开环人体模型的基线方法。

英文摘要

Mobile manipulators are designed to perform complex sequences of navigation and manipulation tasks in human-centered environments. While recent optimization-based methods such as Hierarchical Task Model Predictive Control (HTMPC) enable efficient multitask execution with strict task priorities, they have so far been applied mainly to static or structured scenarios. Extending these approaches to dynamic human-centered environments requires predictive models that capture how humans react to the actions of the robot. This work introduces Safe Mobile Manipulation with Interactive Human Prediction via Task-Hierarchical Bilevel Model Predictive Control (SM$^2$ITH), a unified framework that combines HTMPC with interactive human motion prediction through bilevel optimization that jointly accounts for robot and human dynamics. The framework is validated on two different mobile manipulators, the Stretch 3 and the Ridgeback-UR10, across three experimental settings: (i) delivery tasks with different navigation and manipulation priorities, (ii) sequential pick-and-place tasks with different human motion prediction models, and (iii) interactions involving adversarial human behavior. Our results highlight how interactive prediction enables safe and efficient coordination, outperforming baselines that rely on weighted objectives or open-loop human models.

2511.04758 2026-05-29 cs.RO cs.AI cs.MA 版本更新

ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling

ScheduleStream: 基于采样器的时序规划用于GPU加速的多臂任务与运动规划及调度

Caelan Garrett, Fabio Ramos

发表机构 * NVIDIA Research Seattle Robotics Lab (SRL)(NVIDIA西雅图机器人实验室) University of Sydney(悉尼大学)

AI总结 提出ScheduleStream,首个通用框架,通过混合持续动作和领域无关算法,结合GPU加速采样器,实现多臂并行任务与运动规划及调度。

Comments Project website: https://schedulestream.github.io

详情
Journal ref
2026 IEEE International Conference on Robotics and Automation (ICRA)
AI中文摘要

双臂和类人机器人因其类似人类利用多臂高效完成任务的能力而具有吸引力。然而,由于混合离散-连续动作空间的增长,同时控制多个臂在计算上具有挑战性。任务与运动规划(TAMP)算法可以在混合空间中高效规划,但通常生成一次只移动一个臂的计划,而不是允许并行臂运动的调度。为了将TAMP扩展到生成调度,我们提出了ScheduleStream,这是第一个用于带采样操作的规划与调度的通用框架。ScheduleStream使用混合持续动作对时间动态进行建模,这些动作可以异步启动,并持续一个由其参数决定的时长。我们提出了领域无关的算法,无需任何特定于应用的机制即可解决ScheduleStream问题。我们将ScheduleStream应用于任务与运动规划及调度(TAMPAS),其中我们利用采样器内的GPU加速来加快规划。我们将ScheduleStream算法与模拟中的几种消融方法进行比较,发现它们能产生更高效的解决方案。我们在https://schedulestream.github.io上展示了ScheduleStream在几个真实世界双臂机器人任务上的应用。

英文摘要

Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. However, controlling multiple arms at once is computationally challenging due to the growth in the hybrid discrete-continuous action space. Task and Motion Planning (TAMP) algorithms can efficiently plan in hybrid spaces but generally produce plans, where only one arm is moving at a time, rather than schedules that allow for parallel arm motion. In order to extend TAMP to produce schedules, we present ScheduleStream, the first general-purpose framework for planning & scheduling with sampling operations. ScheduleStream models temporal dynamics using hybrid durative actions, which can be started asynchronously and persist for a duration that's a function of their parameters. We propose domain-independent algorithms that solve ScheduleStream problems without any application-specific mechanisms. We apply ScheduleStream to Task and Motion Planning & Scheduling (TAMPAS), where we use GPU acceleration within samplers to expedite planning. We compare ScheduleStream algorithms to several ablations in simulation and find that they produce more efficient solutions. We demonstrate ScheduleStream on several real-world bimanual robot tasks at https://schedulestream.github.io.

2508.14610 2026-05-29 cs.RO 版本更新

TRUST-Planner: Topology-guided Robust Trajectory Planner for AAVs with Uncertain Obstacle Spatial-temporal Avoidance

TRUST-Planner:面向具有不确定障碍物时空避让的AAV拓扑引导鲁棒轨迹规划器

Junzhi Li, Teng Long, Jingliang Sun, Jianxin Zhong

发表机构 * School of Aerospace Engineering, Beijing Institute of Technology(北京理工大学航空航天工程学院) Key Laboratory of Dynamics and Control of Flight Vehicle, Ministry of Education(教育部飞行器动力学与控制重点实验室)

AI总结 提出TRUST-Planner拓扑引导分层规划框架,通过动态增强可见概率图、无终端最小控制多项式和动态距离场实现复杂动态环境下的鲁棒时空避障,达到96%成功率和毫秒级计算效率。

Comments Accepted by IEEE Transactions on Industrial Electronics (TIE) for publication. The final version will be available online at https://ieeexplore.ieee.org/ after publication

详情
AI中文摘要

尽管自主飞行器(AAV)的运动规划已取得广泛进展,但现有框架在复杂动态环境中仍面临局部极小值和死锁的挑战,导致碰撞风险增加。为了解决这些问题,我们提出了TRUST-Planner,一种拓扑引导的分层规划框架,用于鲁棒的时空避障。在前端,提出了一种动态增强可见概率图(DEV-PRM),以快速探索拓扑路径进行全局引导。后端利用统一的无终端最小控制多项式(UTF-MINCO)和动态距离场(DDF),实现高效的预测性避障和快速并行计算。此外,引入了一种增量式多分支轨迹管理框架,以实现时空拓扑决策,同时有效利用历史信息减少重规划时间。仿真结果表明,TRUST-Planner优于基线竞争对手,在测试的复杂环境中实现了96%的成功率和毫秒级计算效率。真实世界实验进一步验证了所提方法的可行性和实用性。

英文摘要

Despite extensive developments in motion planning of autonomous aerial vehicles (AAVs), existing frameworks faces the challenges of local minima and deadlock in complex dynamic environments, leading to increased collision risks. To address these challenges, we present TRUST-Planner, a topology-guided hierarchical planning framework for robust spatial-temporal obstacle avoidance. In the frontend, a dynamic enhanced visible probabilistic roadmap (DEV-PRM) is proposed to rapidly explore topological paths for global guidance. The backend utilizes a uniform terminal-free minimum control polynomial (UTF-MINCO) and dynamic distance field (DDF) to enable efficient predictive obstacle avoidance and fast parallel computation. Furthermore, an incremental multi-branch trajectory management framework is introduced to enable spatio-temporal topological decision-making, while efficiently leveraging historical information to reduce replanning time. Simulation results show that TRUST-Planner outperforms baseline competitors, achieving a 96\% success rate and millisecond-level computation efficiency in tested complex environments. Real-world experiments further validate the feasibility and practicality of the proposed method.

2507.23270 2026-05-29 cs.RO cs.SY eess.SY 版本更新

Simulation-based planning of Motion Sequences for Automated Procedure Optimization in Multi-Robot Assembly Cells

基于仿真的多机器人装配单元自动化程序优化的运动序列规划

Loris Schneider, Marc Ungen, Elias Huber, Jan-Felix Klein

发表机构 * Institute for Material Handling and Logistics (IFL), Karlsruhe Institute of Technology(材料搬运与物流研究所(IFL),卡尔斯鲁厄理工学院) Bosch Corporate Research, Robert Bosch GmbH(博世企业研究,罗伯特·博世有限公司)

AI总结 提出一种基于仿真的方法,通过将装配步骤分解为核心操作和遍历操作,并采用分解式运动规划策略优化调度,以生成高效无碰撞的多机器人运动序列,减少装配时间。

Comments Accepted for publication at IEEE CASE 2026

详情
AI中文摘要

可重构多机器人单元提供了一种应对波动装配需求的有前景的方法。然而,其配置的重复规划带来了新的挑战,特别是在生成优化、协调的多机器人运动序列以最小化装配时间方面。本文提出了一种基于仿真的方法,用于生成此类优化序列。该方法将装配步骤分解为与任务相关的核心操作和连接的遍历操作。核心操作受约束且预先确定,而遍历操作具有显著的优化潜力。核心操作的调度被形式化为一个优化问题,需要使用基于分解的运动规划策略集成可行的遍历操作。探索了几种求解技术,包括采样启发式、基于树的搜索和无梯度优化。对于运动规划,提出了一种分解方法,识别调度中的特定区域,这些区域可以使用改进的集中式路径规划算法独立求解。所提出的方法生成了高效且无碰撞的多机器人装配程序,优于依赖分散式、机器人个体运动规划的基线方法。通过仿真实验证明了其有效性。

英文摘要

Reconfigurable multi-robot cells offer a promising approach to meet fluctuating assembly demands. However, the recurrent planning of their configurations introduces new challenges, particularly in generating optimized, coordinated multi-robot motion sequences that minimize the assembly duration. This work presents a simulation-based method for generating such optimized sequences. The approach separates assembly steps into task-related core operations and connecting traverse operations. While core operations are constrained and predetermined, traverse operations offer substantial optimization potential. Scheduling the core operations is formulated as an optimization problem, requiring feasible traverse operations to be integrated using a decomposition-based motion planning strategy. Several solution techniques are explored, including a sampling heuristic, tree-based search and gradient-free optimization. For motion planning, a decomposition method is proposed that identifies specific areas in the schedule, which can be solved independently with modified centralized path planning algorithms. The proposed method generates efficient and collision-free multi-robot assembly procedures that outperform a baseline relying on decentralized, robot-individual motion planning. Its effectiveness is demonstrated through simulation experiments.

2409.01159 2026-05-29 cs.RO 版本更新

Remote telepresence over large distances via robot avatars: case studies

通过机器人化身进行远距离远程呈现:案例研究

Mohamed Elobaid, Stefano Dafarra, Ehsan Ranjbari, Giulio Romualdi, Tomohiro Chaki, Tomohiro Kawakami, Takahide Yoshiike, Daniele Pucci

发表机构 * Artificial and Mechanical Intelligence AMI (Italian Insititute of Technology)(人工与机械智能AMI(意大利理工学院)) Frontier Robotics, Innovative Research Excellence(前沿机器人,创新研究卓越;本田研发) Honda R&D(机器学习与优化,曼彻斯特大学) Machine Learning and Optimisation, The University of Manchester

AI总结 本文探讨了如何调整一种新提出的化身系统架构,以适应不同形态的机器人(轮式、腿式及多种手部与运动学结构),在带宽受限条件下实现洲际远程呈现。

详情
AI中文摘要

本文讨论了必要的考虑因素和调整,使得最近提出的化身系统架构能够与不同的机器人化身形态(包括轮式和腿式机器人,具有各种类型的手部和运动学结构)配合使用,以在通信带宽限制下实现远程(洲际)远程呈现。所报告的案例研究涉及使用位置和力矩控制模式的机器人,独立于其软件中间件。

英文摘要

This paper discusses the necessary considerations and adjustments that allow a recently proposed avatar system architecture to be used with different robotic avatar morphologies (both wheeled and legged robots with various types of hands and kinematic structures) for the purpose of enabling remote (intercontinental) telepresence under communication bandwidth restrictions. The case studies reported involve robots using both position and torque control modes, independently of their software middleware.

2409.01144 2026-05-29 cs.RO 版本更新

Adaptive Non-linear Centroidal MPC with Stability Guarantees for Robust Locomotion of Legged Robots

具有稳定性保证的自适应非线性质心MPC用于腿式机器人鲁棒运动

Mohamed Elobaid, Giulio Turrisi, Lorenzo Rapetti, Giulio Romualdi, Stefano Dafarra, Tomohiro Kawakami, Tomohiro Chaki, Takahide Yoshiike, Claudio Semini, Daniele Pucci

发表机构 * Artificial and Mechanical Intelligence (AMI), Istituto Italiano di Tecnologia (IIT)(人工智能与机械智能(AMI),意大利技术研究院(IIT)) Dynamic Legged Systems (DLS), Istituto Italiano di Tecnologia (IIT)(动态腿部系统(DLS),意大利技术研究院(IIT)) Frontier Robotics, Innovative Research Excellence(前沿机器人,创新研究卓越;本田研发,日本埼玉) Honda R&D, Saitama, Japan(机器学习与优化,曼彻斯特大学) Machine Learning and Optimisation, The University of Manchester

AI总结 通过自适应控制和李雅普诺夫函数重新表述质心MPC控制器,为腿式机器人在未知负载和恒定扰动下提供闭环稳定性与鲁棒性保证。

详情
AI中文摘要

基于简化质心动力学的非线性模型预测运动控制器如今在腿式机器人中无处不在。这些方案即使假设了机器人动力学的固有简化,也被证明能够赋予机器人对微小推力的步态调整能力,此外,在参数不确定(如未知负载)的情况下,它们能够提供一些实用的、尽管有限的鲁棒性。在这项工作中,我们通过重新表述质心MPC控制器,为其闭环稳定性提供了严格的证明。这是通过一种受自适应控制机制启发的系统化程序以及来自控制李雅普诺夫函数的思想实现的。此外,我们的重新表述为一类未测量的恒定扰动提供了鲁棒性。为了展示我们方法的通用性,我们在新一代人形机器人——56.7千克的ergoCub,以及商用21千克四足机器人Aliengo上验证了我们的公式。

英文摘要

Nonlinear model predictive locomotion controllers based on the reduced centroidal dynamics are nowadays ubiquitous in legged robots. These schemes, even if they assume an inherent simplification of the robot's dynamics, were shown to endow robots with a step-adjustment capability in reaction to small pushes, and, moreover, in the case of uncertain parameters - as unknown payloads - they were shown to be able to provide some practical, albeit limited, robustness. In this work, we provide rigorous certificates of their closed loop stability via a reformulation of the centroidal MPC controller. This is achieved thanks to a systematic procedure inspired by the machinery of adaptive control, together with ideas coming from Control Lyapunov functions. Our reformulation, in addition, provides robustness for a class of unmeasured constant disturbances. To demonstrate the generality of our approach, we validated our formulation on a new generation of humanoid robots - the 56.7 kg ergoCub, as well as on a commercially available 21 kg quadruped robot, Aliengo.

2305.10917 2026-05-29 cs.RO 版本更新

Online Non-linear Centroidal MPC for Humanoid Robots Payload Carrying with Contact-Stable Force Parametrization

面向人形机器人负重任务的在线非线性质心模型预测控制与接触稳定力参数化

Mohamed Elobaid, Giulio Romualdi, Gabriele Nava, Lorenzo Rapetti, Hosameldin Awadalla Omer Mohamed, Daniele Pucci

发表机构 * Mechanical engineering department(机械工程系) Machine Learning and Optimisation, The University of Manchester(机器学习与优化,曼彻斯特大学)

AI总结 针对人形机器人负重行走问题,提出结合在线非线性质心模型预测控制与接触稳定力参数化的方法,实现给定脚步轨迹的跟踪。

详情
AI中文摘要

本文考虑了一个问题:允许受到持续干扰(以负重任务形式)的人形机器人遵循给定的规划脚步。为解决此问题,我们结合了在线非线性质心模型预测控制器(MPC)与接触稳定力参数化。MPC的成本函数增加了处理干扰和正则化参数的项。通过仿真和人形机器人iCub上的实验验证了所提出控制器的性能。最后,简要研究了使用参数化对控制器计算时间的影响。

英文摘要

In this paper we consider the problem of allowing a humanoid robot that is subject to a persistent disturbance, in the form of a payload-carrying task, to follow given planned footsteps. To solve this problem, we combine an online nonlinear centroidal Model Predictive Controller - MPC with a contact stable force parametrization. The cost function of the MPC is augmented with terms handling the disturbance and regularizing the parameter. The performance of the resulting controller is validated both in simulations and on the humanoid robot iCub. Finally, the effect of using the parametrization on the computational time of the controller is briefly studied.

2205.04297 2026-05-29 cs.RO cs.AI 版本更新

Learning A Simulation-based Visual Policy for Real-world Peg In Unseen Holes

基于学习的视觉策略用于真实世界中未见过孔洞的插拔

Liang Xie, Hongxiang Yu, Kechun Xu, Tong Yang, Minhang Wang, Haojian Lu, Rong Xiong, Yue Wang

发表机构 * College of Control Science and Engineering, Zhejiang University, Zhejiang, China.(控制科学与工程学院,浙江大学,浙江,中国) The Application Innovate Lab, Huawei Incorporated Company, China.(应用创新实验室,华为公司,中国)

AI总结 提出一种基于学习的视觉插拔方法,通过解耦感知与策略模块,在仿真中训练多种形状,并仅需少量仿真到现实迁移成本即可适应真实世界中任意未见形状。

详情
AI中文摘要

本文提出一种基于学习的视觉插拔方法,能够在仿真中训练多种形状,并在真实世界中以最小的仿真到现实迁移成本适应任意未见形状。核心思想是将感知-运动策略的泛化解耦为快速适应的感知模块和仿真通用策略模块的设计。框架包括分割网络(SN)、虚拟传感器网络(VSN)和控制器网络(CN)。具体地,VSN被训练用于从分割图像中测量未见形状的位姿。然后,给定与形状无关的位姿测量,CN被训练以实现通用插拔。最后,当应用于真实未见孔洞时,我们只需微调仿真VSN+CN所需的分割网络。为进一步最小化迁移成本,我们提出在一分钟人工教学后自动收集和标注分割网络的数据。展示了在眼在外/眼在手配置下的仿真和真实世界结果。采用所提策略的电动汽车充电系统在2-3秒内实现了10/10的成功率,仅使用数百个自动标注样本进行分割网络迁移。

英文摘要

This paper proposes a learning-based visual peg-in-hole that enables training with several shapes in simulation, and adapting to arbitrary unseen shapes in real world with minimal sim-to-real cost. The core idea is to decouple the generalization of the sensory-motor policy to the design of a fast-adaptable perception module and a simulated generic policy module. The framework consists of a segmentation network (SN), a virtual sensor network (VSN), and a controller network (CN). Concretely, the VSN is trained to measure the pose of the unseen shape from a segmented image. After that, given the shape-agnostic pose measurement, the CN is trained to achieve generic peg-in-hole. Finally, when applying to real unseen holes, we only have to fine-tune the SN required by the simulated VSN+CN. To further minimize the transfer cost, we propose to automatically collect and annotate the data for the SN after one-minute human teaching. Simulated and real-world results are presented under the configurations of eye-to/in-hand. An electric vehicle charging system with the proposed policy inside achieves a 10/10 success rate in 2-3s, using only hundreds of auto-labeled samples for the SN transfer.

2605.29572 2026-05-29 cs.RO cs.HC 版本更新

Learning to Feel Materials from Multisensory Tactile Data via Interpretable Models

通过可解释模型从多感官触觉数据中学习感知材料

Li Zou, Yasemin Vardar

发表机构 * Delft University of Technology (TU Delft), Department of Cognitive Robotics(代尔夫特理工大学(TU Delft),认知机器人学系)

AI总结 提出一个可解释的计算框架,利用多感官触觉数据(包括按压、静态接触和滑动交互)建模人类材料感知与识别,发现热觉和顺应性线索对感知建模和材料分类至关重要。

Comments 12 pages, 3 figures, journal

详情
AI中文摘要

人类对材料的触觉感知依赖于复杂的多感官触觉线索,然而低级触觉信号与感知表征之间的关系仍不清楚。这一知识差距阻碍了触觉在数字环境中的集成以及具有类人触觉感知能力的机器人的开发。在这里,我们提出了一个可解释的计算框架,用于使用多感官触觉数据建模人类材料感知和识别。我们的框架包含三个相互关联的模型:模型1将手指-表面交互特征映射到心理物理感官属性,模型2基于这些感知表征对材料进行分类,模型3直接从触觉特征对材料进行分类。结果表明,结合按压、静态接触和滑动交互的信息提高了预测准确性,并且热觉线索对于感知建模和材料分类尤其具有信息量。这些发现强调了热觉和顺应性线索的重要性,这些线索在当前机器人手指和触觉显示器中仍未得到充分体现。纳入此类线索可能增强人工系统近似人类材料感知的能力,并指导设计更具感知基础的触觉界面。

英文摘要

Human tactile perception of materials relies on complex multisensory touch cues, yet the relationship between low-level tactile signals and perceptual representations remains poorly understood. This knowledge gap hinders the integration of touch in digital environments and the development of robots capable of human-like tactile perception. Here, we present an interpretable computational framework for modeling human material perception and recognition using multisensory touch data. Our framework comprises three interconnected models: Model 1 maps finger-surface interaction features to psychophysical sensory attributes, Model 2 classifies materials based on these perceptual representations, and Model 3 directly classifies materials from tactile features. The results showed that combining information from pressing, static contact, and sliding interactions improves prediction accuracy, and that thermal cues are particularly informative for both perceptual modeling and material classification. These findings highlight the importance of thermal and compliance cues, which remain underrepresented in current robotic fingers and haptic displays. Incorporating such cues may enhance artificial systems' ability to approximate human material perception and guide the design of more perceptually grounded haptic interfaces.

2605.29565 2026-05-29 cs.CV cs.RO 版本更新

From General Vision to Reliable Traversability Estimation: Adapting Vision Foundation Models for Unstructured Outdoor Environments

从通用视觉到可靠的可通行性估计:适应视觉基础模型用于非结构化户外环境

Ji-Hoon Hwang, Jisung Bae, Dong-Wook Kim, Yeonkyu Lee, Seung-Woo Seo

AI总结 提出ViTA框架,通过可学习提示、视角多样化训练和几何知识蒸馏,将视觉基础模型适应于非结构化户外环境的可靠可通行性估计,显著降低误报并提升跨域泛化。

Comments 8 pages, 5figures

详情
AI中文摘要

基于视觉的方法已成为非结构化户外环境中可通行性估计的主导范式,通常通过语义分割监督来适应视觉基础模型(VFM)。然而,该范式面临三个根本性挑战,削弱了其可靠性:VFM的任务无关设计、可通行性标注的模糊性以及语义标签与物理安全性之间的差异。我们提出了视觉到可通行性适应(ViTA)框架,该框架将VFM适应于可靠的可通行性估计,并在SAM2上实例化。ViTA通过可学习的可通行性提示注入任务特定知识,同时保留VFM的跨域泛化能力。为处理标注模糊性,我们引入了视角多样化训练,通过估计语义不确定性来抑制模糊边界处的自信预测。为弥合语义与可通行性之间的差异,我们在训练期间蒸馏几何知识,使得推理时仅从RGB图像即可进行坡度和高程推理。语义和几何输出融合为一个连续的可通行性分数,同时反映语义不确定性和几何风险。在包括具有挑战性的真实越野数据集在内的多个领域的评估表明,ViTA实现了最先进的IoU和精确度,同时大幅减少误报并具备强大的跨域泛化能力。

英文摘要

Vision-based approaches have become the dominant paradigm for traversability estimation in unstructured outdoor environments, typically adapting vision foundation models (VFMs) via semantic segmentation supervision. However, this paradigm faces three fundamental challenges that undermine its reliability: the task-agnostic design of VFMs, the ambiguity of traversability annotations, and the discrepancy between semantic labels and physical safety. We propose Vision-to-Traversability Adaptation (ViTA), a framework that adapts VFMs for reliable traversability estimation, instantiated on SAM2. ViTA injects task-specific knowledge through learnable traversability prompts while preserving the VFM's cross-domain generalization. To handle annotation ambiguity, we introduce Perspective-Diversified Training, which estimates semantic uncertainty to suppress confident predictions at ambiguous boundaries. To bridge the semantic-traversability discrepancy, we distill geometric knowledge during training, enabling slope and elevation reasoning from RGB images alone at inference. The semantic and geometric outputs are fused into a continuous traversability score that reflects both semantic uncertainty and geometric risk. Evaluations across diverse domains, including challenging real-world off-road datasets, demonstrate that ViTA achieves state-of-the-art IoU and Precision with substantial false-positive reduction and strong cross-domain generalization.

2605.29564 2026-05-29 cs.RO 版本更新

VE2VF: Vision-Enabled to Vision-Free Distillation via Real-world Reinforcement Learning for Robust Contact-Rich Manipulation

VE2VF: 基于真实世界强化学习的视觉使能到无视觉蒸馏用于鲁棒接触丰富操作

Victor Kowalski, Chengxi Li, Dongheui Lee

发表机构 * Autonomous Systems, Technische Universitaet Wien (TU Wien)(自动系统,维也纳技术大学) Institute of Robotics and Mechatronics (DLR)(机器人与机电研究所)

AI总结 提出一种人在环强化学习框架,通过教师-学生蒸馏将视觉使能策略的知识迁移到仅依赖本体感知的无视觉策略,在真实世界训练中实现鲁棒泛化,无需域随机化或数据增强。

详情
AI中文摘要

当使用强化学习进行接触丰富的机器人操作时,视觉可以提供任务相关信息,加速学习,超越仅靠本体感知所能达到的效果。然而,视觉使能策略容易过拟合训练时看到的视觉条件,限制了其鲁棒性和可迁移性。我们提出一种人在环强化学习框架,采用教师-学生蒸馏,在完全真实世界训练中实现跨多个任务变体的鲁棒性能,无需域随机化或数据增强。视觉使能教师将其知识蒸馏到仅依赖位姿、扭转和力传感的无视觉学生中,结合了快速训练与强任务泛化。在真实世界的NIST装配基准板上,我们的方法在3个代表性任务上经过约50分钟训练后达到95%的整体成功率,包括对8个未见任务变体的鲁棒泛化。通过蒸馏微调在最困难的任务上实现了完全成功。我们证明所得策略在鲁棒性和适应性上均优于基线。

英文摘要

When using reinforcement learning (RL) for contact-rich robotic manipulation, vision can provide task-relevant information that accelerates learning beyond what proprioception alone can achieve. However, vision-enabled policies tend to overfit to the visual conditions seen during training, limiting their robustness and transferability. We present a human-in-the-loop RL framework that employs teacher-student distillation to achieve robust performance across multiple task variants, trained entirely in the real world without requiring domain randomization or data augmentation. A vision-enabled teacher distills its knowledge into a vision-free student that relies solely on pose, twist, and wrench sensing, combining fast training with strong task generalization. On the real-world NIST assembly benchmark board, our approach achieves 95\% overall success after approximately 50 minutes of training on 3 representative tasks, including robust generalization to 8 unseen task variants. Fine-tuning with distillation achieves full success on the most challenging task. We demonstrate that the resulting policies outperform baselines in both robustness and adaptability.

2605.29562 2026-05-29 cs.RO cs.AI cs.CV 版本更新

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

VLA-Pro:面向视觉-语言-动作模型的跨任务程序性记忆迁移

Shengyu Si, Yuanzhuo Lu, Ruimeng Yang, Ziyi Ye, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究院) Shanghai Key Laboratory of Multimodal Embodied AI(上海多模态具身人工智能重点实验室) Shanghai Xinzhi Embodied Intelligence Technology Co., Ltd.(上海新智具身智能技术有限公司)

AI总结 提出VLA-Pro框架,通过存储和检索任务相关的LoRA适配器作为程序性记忆,实现跨任务泛化,在仿真和真实任务中成功率显著提升。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在通用机器人操作中展现出强大潜力,但在泛化到需要跨物体、场景和动作模式迁移相关经验的新任务时仍面临挑战。本文提出VLA-Pro,一种即插即用框架,通过在训练时存储任务相关的程序性记忆并在推理时迁移这些记忆来增强跨任务泛化。具体而言,VLA-Pro在训练时将任务特定的LoRA适配器存储为参数化的程序性记忆。在推理时,VLA-Pro基于当前多模态上下文检索相关程序性记忆,并动态融合这些记忆以生成当前动作块。在RoboTwin、RLBench和真实世界操作任务上的实验表明,VLA-Pro在多个骨干网络上持续提升跨任务泛化能力,在仿真中实现高达207%的相对改进,并将真实世界成功率从5.8%提升至65.0%。这些结果表明,程序性记忆检索与自适应为将操作经验迁移到新任务提供了一种有效机制,同时保持了模块化和执行稳定性。

英文摘要

Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference. Specifically, VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, VLA-Pro retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones, achieving up to a 207% relative improvement in simulation and increasing real-world success rate from 5.8% to 65.0%. These results suggest that procedural memory retrieval and adaptation provide an effective mechanism for transferring manipulation experience to novel tasks while preserving modularity and execution stability.

2605.29438 2026-05-29 cs.RO 版本更新

ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models

ElegantVLA:学习何时思考以实现高效的视觉-语言-动作模型

Ye Li, Huanan Liu, Kangye Ji, Yuan Meng, Jiajun Fan, Yuansong Wang, Shiyu Qin, Chenglei Wu, Shu-Tao Xia, Zhi Wang

发表机构 * Tsinghua University(清华大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ElegantVLA,一种即插即用的相位自适应推理框架,通过动态计算调度在视觉编码器、大语言模型和动作头之间分配计算资源,实现VLA模型加速,在GR00T和CogACT上分别获得最高2.55倍和3.77倍加速。

详情
AI中文摘要

视觉-语言-动作(VLA)模型是通用机器人控制的一种强大范式。然而,其高计算成本和有限的控制频率阻碍了实时机器人操作,尤其是在每个控制步骤都运行大型视觉-语言骨干网络和迭代动作头时。现有的VLA加速方法通常优化单个组件或依赖固定的加速规则,对不同控制步骤采用大致固定的计算量,忽略了序列化具身控制的非均匀推理需求。受人类运动控制的启发,其中认知和反馈资源集中在目标敏感阶段,我们认为VLA模型应该学习何时投入完整计算以及何时重用先前的计算。我们提出ElegantVLA,一种即插即用的相位自适应推理框架,通过模型内动态计算调度加速VLA模型。ElegantVLA引入一个轻量级调度器,观察时间表示相似性、机器人运动线索和任务进度,联合分配视觉编码器、大语言模型和动作头的计算。对于感知-语言推理,调度器根据视觉-语言表示稳定性选择五级视觉-大语言模型计算模式,从完全重计算到多步时间重用。对于动作生成,它选择三级去噪模式,在稳定运动期间重用中间去噪状态,同时在目标敏感阶段保留完整细化。通过协调这些决策,ElegantVLA为具有显式动作生成模块的现代VLA流水线提供了一个通用加速框架,无需修改或重新训练基础模型。在GR00T和CogACT上的实验分别实现了最高2.55倍和3.77倍的加速,在六个真实世界的GR00T任务中,ElegantVLA将计算量减少了2.18倍,同时将控制频率从13.8 Hz提高到26.3 Hz。

英文摘要

Vision-Language-Action (VLA) models are a powerful paradigm for generalist robotic control. However, their high computational cost and limited control frequency hinder real-time robotic manipulation, especially when large vision-language backbones and iterative action heads run at every control step. Existing VLA acceleration methods often optimize individual components or rely on fixed acceleration rules, treating different control steps with largely fixed computation and overlooking the non-uniform reasoning demands of sequential embodied control. Inspired by human motor control, where cognitive and feedback resources concentrate on goal-sensitive stages, we argue that VLA models should learn when to invest full computation and when to reuse prior computation. We propose ElegantVLA, a plug-in phase-adaptive inference framework that accelerates VLA models through intra-model dynamic compute scheduling. ElegantVLA introduces a lightweight scheduler that observes temporal representation similarity, robot-motion cues, and episode progress to jointly allocate computation across the vision encoder, LLM, and action head. For perception-language reasoning, the scheduler selects a five-level Vision-LLM compute mode, from full recomputation to multi-step temporal reuse, based on visual-language representation stability. For action generation, it selects a three-level denoising mode, reusing intermediate denoising states during stable motion while preserving full refinement for goal-sensitive stages. By coordinating these decisions, ElegantVLA offers a general acceleration framework for modern VLA pipelines with explicit action-generation modules, without modifying or retraining the base model. Experiments on GR00T and CogACT achieve up to 2.55x and 3.77x speedup, and on six real-world GR00T tasks ElegantVLA cuts computation by 2.18x while raising control frequency from 13.8 Hz to 26.3 Hz.

2605.29416 2026-05-29 cs.RO cs.CV 版本更新

3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding

3DVLA:通过3D空间和实例理解增强视觉-语言-动作模型

Zhongyu Xia, Yousen Tang, Bingqing Wei, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王轩计算机技术研究所)

AI总结 提出3DVLA框架,通过多视角一致性3D特征编码、实例估计模块和掩码自监督3D编码,解决VLA模型缺乏3D场景理解的问题,在LIBERO-Plus和RoboTwin 2.0上显著提升操作性能。

详情
AI中文摘要

视觉-语言-动作模型在机器人操作中取得了显著进展,但存在一个关键限制:缺乏3D场景理解。这一缺陷表现为三个相互交织的挑战:在不强制执行多视角一致性的情况下弱提取3D空间位置、不足的3D实例理解以及遮挡下的脆弱推理。尽管存在成熟的3D感知方法,但由于架构不兼容以及对昂贵实例级标注的严重依赖,它们难以直接集成到VLA流程中。为解决上述挑战,我们提出3DVLA,一个即插即用框架,将稳健的3D推理注入预训练的VLA,无需额外人工标注或丢弃VLM先验。具体来说,3DVLA通过以下方式应对三个挑战:(1)在所有模态上具有显式多视角一致性约束的普遍3D特征编码和空间条件几何聚合方法,(2)具有高级实例令牌的实例估计模块以实现3D实例感知,以及(3)保留预测器用于视觉令牌完成的掩码自监督3D编码分支以处理遮挡。我们将3DVLA与多个VLA基线集成,并在LIBERO-Plus和RoboTwin 2.0上进行评估。结果显示操作性能持续且显著提升,验证了我们方法的有效性和即插即用兼容性。

英文摘要

Vision-Language-Action models have achieved remarkable progress in robotic manipulation, yet they suffer from a critical limitation: a lack of 3D scene understanding. This deficiency manifests as three intertwined challenges: weak extraction of 3D spatial positions without enforcing multi-view consistency, inadequate 3D instance understanding, and fragile reasoning under occlusion. Although mature 3D perception methods exist, their direct integration into VLA pipelines is hindered by architectural incompatibility and by heavy reliance on costly instance-level annotations. To address the above challenges, we propose 3DVLA, a plug-and-play framework that injects robust 3D reasoning into pretrained VLAs without requiring extra manual labels or discarding VLM priors. Specifically, 3DVLA tackles the three challenges through: (1) pervasive 3D feature encoding with explicit multi-view consistency constraints across all modalities and a Spatially-Conditioned Geometry Aggregation method, (2) an instance estimation module with high-level instance tokens for 3D instance awareness, and (3) a masked self-supervised 3D encoding branch that retains its predictor for visual token completion to handle occlusions. We integrate 3DVLA with multiple VLA baselines and evaluate on LIBERO-Plus and RoboTwin 2.0. Results show consistent and significant gains in manipulation performance, validating both the effectiveness and plug-and-play compatibility of our approach.

2605.29410 2026-05-29 cs.RO 版本更新

A Progress-Aware Leader-Follower Midair Docking System for Dual-Drone Aerial Manipulation

面向双无人机空中操控的进度感知领航-跟随空中对接系统

Yifan Cai, Jan Ming Kevin Tan, Xiangqi Li, Chenzhe Jin, Narsimlu Kemsaram, Valerio Modugno

发表机构 * Department of Computer Science, University College London(计算机科学系,伦敦大学学院)

AI总结 提出一种进度感知的领航-跟随双四旋翼空中对接平台,通过被动磁锁紧模块和阶段管理器实现可靠对接,并基于定量指标进行仿真与实验评估。

Comments This paper has been accepted for publication in the Proceedings of the 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026), August 17-21, 2026, Shenyang, China

详情
AI中文摘要

小型无人机之间的可靠空中对接对于模块化空中合作与操控至关重要,但需要在严格的推力和载荷约束下实现精确的相对位姿控制和可重复的平台操作。我们提出了一种双无人机对接平台,其中两架四旋翼以领航-跟随编队运行,并使用带有被动磁锁紧的轻量级模块化框架进行对接。一个进度感知的任务监督器管理阶段转换:接近、对准、捕获和稳定。该平台集成了完整的硬件-软件栈(带有Crazyflie/PX4接口的ROS 2)和同步日志记录,用于基准评估。我们在仿真和实际实验中,使用编队误差、基线及偏航一致性、对接成功率、对接时间和失败模式统计等定量指标对平台进行评估。该平台能够对对接监督和同步策略进行基于统计的比较,并为模块化空中合作和可重复的空中操控提供了实用的测试平台。

英文摘要

Reliable midair docking between small unmanned aerial vehicles (UAVs) is essential for modular aerial cooperation and manipulation, but it requires precise relative-pose control and repeatable platform under tight thrust and payload constraints. We present a dual-drone docking platform where two quadrotors operate in a leader-follower formation and dock using a lightweight modular frame with passive magnetic latching. A progress-aware mission supervisor manages phase transitions: approach, alignment, capture, and settle. This platform integrates a complete hardware-software stack (ROS 2 with Crazyflie/PX4 interfaces) and synchronized logging for benchmark evaluation. We evaluate the platform in simulation and real-world experiments using quantitative metrics such as formation error, baseline and yaw consistency, docking success rate, time-to-dock, and failure-mode statistics. The platform enables statistically grounded comparison of docking supervision and synchronization strategies and provides a practical testbed for modular aerial cooperation and repeatable midair aerial manipulation.

2605.29409 2026-05-29 eess.SY cs.RO cs.SY 版本更新

Decoupled Thrust-Axis Attitude Control Using Quaternions for Chandrayaan-3 Lunar Landing Mission

基于四元数的解耦推力轴姿态控制用于月船三号月球着陆任务

Aditya Rallapalli, Suraj Kumar, Rijesh M P, Ashok Kumar Kakula, Bharat Kumar GVP

发表机构 * Controls and Digital Area, U R Rao Satellite Centre, Indian Space Research Organization (ISRO), Bangalore, India(控制与数字领域部,U R Rao卫星中心,印度空间研究组织(ISRO),班加罗尔,印度)

AI总结 针对月船三号着陆任务,提出一种基于四元数的解耦方法,实现推力轴独立控制,避免制导与控制之间的不良耦合。

Comments 6 pages, 7 figures, Published in Indian Control Conference 2025

详情
AI中文摘要

月船三号任务在月球南极附近成功软着陆,实现了历史性里程碑,凸显了导航、制导与控制(NGC)系统的关键作用。导航提供了相对于月球中心的飞行器状态估计,而基于多项式的制导方案计算了满足终端着陆条件所需的加速度剖面。该加速度需求被转化为总推力大小和姿态指令生成。姿态指令生成涉及将推力轴与所需加速度矢量对齐,并约束绕推力轴的旋转,通常由任务特定要求决定。尽管基于四元数的控制律因其无奇点表示而受到青睐,但它们固有地耦合了所有三个旋转轴。这种耦合可能导致制导与控制之间的不良相互作用,特别是在绕推力轴进行大旋转时,由于四元数的最短路径特性。本文提出了一种新颖的基于四元数的解耦方法,能够实现独立的推力轴控制,减轻制导-控制相互作用,并确保着陆器姿态控制的正确姿态指令生成。

英文摘要

Chandrayaan-3 mission achieved a historic milestone with its successful soft landing near the lunar south pole, highlighting the critical role of the navigation, guidance, and control (NGC) system. Navigation provided vehicle state estimates relative to the Moon center, while a polynomial based guidance scheme computed the required acceleration profile to meet terminal landing conditions. This acceleration demand was translated into total thrust magnitude and attitude commands generation. Attitude command generation involved aligning the thrust axis with the required acceleration vector and constraining rotation about the thrust axis, typically governed by mission-specific requirements. Although quaternion-based control laws are preferred for their singularity-free representation, they inherently couple all three rotational axes. This coupling can lead to undesirable interactions between guidance and control, especially during large rotations about the thrust axis, due to the quaternion shortest-path property. This paper proposes a novel quaternion-based decoupling method that enables independent thrust-axis control, mitigating guidance-control interaction and ensuring proper attitude commands generation for lander attitude control.

2605.29407 2026-05-29 cs.RO 版本更新

Phase-Conditioned Imitation Learning with Autonomous Failure Recovery for Robust Deformable Object Manipulation

相位条件化模仿学习与自主故障恢复用于鲁棒可变形物体操作

Dayuan Chen, Kai Tang, Yukuan Zhang, Kazuhiro Kosuge, Yasuhisa Hirata

发表机构 * Department of Robotics, Tohoku University(东大理学院机器人系) JC STEM Lab of Robotics for Soft Materials, the Department of Electrical and Computer Engineering, Faculty of Engineering, The University of Hong Kong(香港大学工程学院电气与计算机工程系软材料机器人实验室)

AI总结 提出一种相位条件化、力感知的闭环分层框架,通过FiLM调节的ACT编码器和多模态相位预测器实现自主故障恢复,显著提升可变形物体操作的成功率。

Comments Accepted to IEEE/ASME Transactions on Mechatronics

详情
AI中文摘要

本文提出了一种相位条件化、力感知的框架,用于鲁棒的可变形物体操作。标准的模仿学习策略(如使用Transformer的动作分块,ACT)在推理时依赖马尔可夫假设,当视觉上相似的观测需要矛盾的动作时会导致状态混淆,并阻止从执行故障中自主恢复。我们通过一个闭环分层架构解决了这一问题。一个FiLM条件化的ACT编码器根据当前任务相位调节特征提取,使得单一统一策略能够产生相位特定的行为,同时跨相位共享动作动态。一个融合视觉、力和位姿反馈的多模态相位预测器实时估计相位,检测仅靠视觉无法发现的接触故障,并自主触发恢复轨迹。该系统由一个用于柔顺执行的混合阻抗控制器和一个用于力感知数据收集的触觉遥操作接口完成。消融研究表明,基于FiLM的调制显著优于无条件化和令牌级条件化的基线,t-SNE分析证实FiLM诱导了良好分离的、相位特定的特征表示。在双臂挂上和脱下T恤的任务中验证,闭环系统通过自主错误恢复将挂上成功率从56%提高到87%。代码和视频:https://leledeyuan00.github.io/phaser/

英文摘要

This paper presents a phase-conditioned, force-aware framework for robust deformable object manipulation. Standard imitation learning policies such as Action Chunking with Transformers (ACT) rely on a Markovian assumption at inference, causing state aliasing when visually similar observations require contradictory actions and preventing autonomous recovery from execution failures. We address this with a closed-loop hierarchical architecture. A FiLM-conditioned ACT encoder modulates feature extraction based on the current task phase, enabling a single unified policy to produce phase-specific behaviors while sharing action dynamics across phases. A multi-modal phase predictor fusing visual, force, and pose feedback estimates the phase in real time, detecting contact failures that are invisible to vision alone and autonomously triggering recovery trajectories. The system is completed by a hybrid impedance controller for compliant execution and a haptic teleoperation interface for force-aware data collection. Ablation studies show that FiLM-based modulation significantly outperforms both unconditioned and token-level conditioned baselines, and t-SNE analysis confirms that FiLM induces well-separated, phase-specific feature representations. Validated on hanging and removing a T-shirt with dual arms, the closed-loop system improves the hanging success rate from 56\% to 87\% through autonomous error recovery. Code and videos: https://leledeyuan00.github.io/phaser/

2605.29378 2026-05-29 cs.RO 版本更新

Decentralized LLM-Driven Coordination of Acoustic Robots for Contactless Object Manipulation

去中心化LLM驱动的声学机器人协调用于非接触式物体操控

Yingying Wang, Narsimlu Kemsaram, Sriram Subramanian

发表机构 * Department of Computer Science, University College London(计算机科学系,伦敦大学学院) Department of Artificial Intelligence, University of Malaya(人工智能系,马来大学)

AI总结 提出一种去中心化框架,利用Whisper语音识别和LLM语义解析将自然语言指令转换为多机器人任务计划,实现声学机器人的非接触式物体操控,实验验证了顺序、并行和同步协作任务的有效性。

Comments This paper has been accepted for publication in the Proceedings of the 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026), August 17-21, 2026, Shenyang, China

详情
AI中文摘要

自然语言接口可以简化与多机器人系统的交互,特别是当非专业用户需要发出高级命令时。使用超声相控阵的声学操控也实现了非接触式物体处理,适用于医疗保健、实验室自动化和精密运输等应用。然而,将大型语言模型(LLM)与分布式声学移动机器人相结合仍未被充分探索。本文提出了一种去中心化框架,用于自然语言驱动的声学机器人协调,实现非接触式物体操控。该系统使用基于Whisper的语音识别、基于LLM的语义解析、结构化JSON任务表示和分布式调度,将口语指令转换为可执行的多机器人任务计划。JSON模式编码了机器人分配、时间依赖、空间约束以及顺序、并行和同步执行的同步要求。该系统在两个基于TurtleBot3的声学机器人上实现,每个机器人配备一个超声相控阵用于非接触式物体运输。实验在三种场景下进行:顺序执行、并行多机器人运输和同步协作操控。系统在顺序任务中实现了96%的任务成功率,并行执行为86%,同步协作运输为70%。这些结果表明,自然语言命令可以转化为分布式机器人动作以实现非接触式操控,突显了LLM驱动的自动化在分布式机器人系统中用于人机交互的潜力。

英文摘要

Natural language interfaces can simplify interaction with multi-robot systems, especially when non-expert users need to issue high-level commands. Acoustic manipulation using ultrasonic phased arrays also enables contactless object handling for applications such as healthcare, laboratory automation, and precision transport. However, combining large language models (LLMs) with distributed acoustic mobile robots remains underexplored. This paper presents a decentralized framework for natural language-driven coordination of acoustic robots for contactless object manipulation. The system converts spoken instructions into executable multi-robot task plans using Whisper-based speech recognition, LLM-based semantic parsing, structured JSON task representation, and distributed scheduling. The JSON schema encodes robot assignments, temporal dependencies, spatial constraints, and synchronization requirements for sequential, parallel, and synchronized execution. The system is implemented on two TurtleBot3-based acoustic robots, each equipped with an ultrasonic phased array for contactless object transport. Experiments were conducted in three scenarios: sequential execution, parallel multi-robot transport, and synchronized cooperative manipulation. The system achieved task success rates of 96 percent for sequential tasks, 86 percent for parallel execution, and 70 percent for synchronized collaborative transport. These results show that natural language commands can be transformed into distributed robot actions for contactless manipulation, highlighting the potential of LLM-driven automation for human-robot interaction in distributed robotic systems.

2605.29301 2026-05-29 cs.RO 版本更新

The Open Motion Planning Library 2.0

开放运动规划库2.0

Weihang Guo, Theodoros Tyrovouzis, Emiliano Flores, Clayton W. Ramsey, Zachary K. Kingston, Ioan A. Şucan, Mark Moll, Lydia E. Kavraki

发表机构 * Department of Computer Science, Rice University(计算机科学系,里士大学) Department of Computer Science, Purdue University(计算机科学系,普渡大学) Waymo, LLC(Waymo公司) Metron, Inc.(Metron公司) Ken Kennedy Institute at Rice University(里士大学肯尼迪研究所)

AI总结 本文介绍OMPL 2.0,通过硬件加速实现实时运动规划,并集成现代AI研究流程,总结了库与运动规划领域的共同发展及其对研究社区的影响。

详情
AI中文摘要

开放运动规划库(OMPL)于2008年首次发布,已成为运动规划社区的基石,提供了广泛的最先进的基于采样的算法的实现。经过近二十年的持续开发,我们不断扩展该库,增加了新的规划器、状态空间和问题表述。这些新增内容包括渐近最优和懒惰规划器、约束运动规划以及具有时序逻辑目标的规划。在此基础上,我们推出了OMPL 2.0,这是该库的一次重大演进,旨在通过硬件加速实现实时运动规划,并与现代AI研究流程无缝集成。我们还反思了OMPL和运动规划领域多年来如何共同成长,并讨论了该库对研究社区的更广泛影响。

英文摘要

The Open Motion Planning Library (OMPL), first released in 2008, has become a cornerstone of the motion planning community, providing implementations of a wide range of state-of-the-art sampling-based algorithms. Over almost two decades of continuous development, we have steadily expanded the library with new planners, state spaces, and problem formulations. These additions range from asymptotically optimal and lazy planners to constrained motion planning and planning with temporal-logic goals. Building on this foundation, we introduce OMPL 2.0, a major evolution of the library that targets real-time motion planning through hardware acceleration and integrates seamlessly with modern AI research workflows. We also reflect on how OMPL and the field of motion planning have grown together over the years, and discuss the library's broader impact on the research community.

2605.29298 2026-05-29 cs.RO 版本更新

MonoDuo: Using One Robot Arm to Learn Bimanual Policies

MonoDuo: 使用单机械臂学习双臂策略

Sandeep Bajamahal, Lawrence Yunliang Chen, Toru Lin, Zehan Ma, Jitendra Malik, Ken Goldberg

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MonoDuo框架,利用单臂机器人演示和人类协作数据,通过数据增强生成合成演示,训练双臂机器人策略,在五项任务中实现零样本部署和少样本微调,成功率高达70%。

Comments Accepted to appear in the 2026 IEEE International Conference on Robotics and Automation (ICRA), Vienna, Austria, 1-5 June 2026

详情
AI中文摘要

双臂协调对于许多现实世界的操作任务至关重要,然而学习双臂机器人策略受到双臂机器人和数据集稀缺的限制。相比之下,单臂机器人在研究实验室中广泛可用。我们能否利用它们来训练双臂机器人策略?我们提出MonoDuo,一个利用单臂机器人演示与人类协作来学习双臂操作策略的框架。MonoDuo通过遥操作单臂机器人执行双臂任务的一侧,同时由人类执行另一侧来收集数据,然后交换角色以覆盖两侧。来自腕部安装和固定摄像头的RGB-D观测通过最先进的手部姿态估计、图像和点云分割以及修复,被增强为目标双臂机器人的合成演示。这些基于真实机器人运动学的合成演示用于训练双臂策略。我们在五项任务上评估MonoDuo:举箱、背包打包、叠布、拉拉链和递盘子。与仅依赖人类双臂视频的方法相比,MonoDuo能够在未见过的双臂机器人配置上实现零样本部署,成功率高达70%。仅使用25个目标机器人演示进行少样本微调,相比从头训练,成功率进一步提升65-70%,展示了MonoDuo在将单臂机器人数据高效迁移到双臂机器人策略方面的有效性。

英文摘要

Bimanual coordination is essential for many real-world manipulation tasks, yet learning bimanual robot policies is limited by the scarcity of bimanual robots and datasets. Single-arm robots, however, are widely available in research labs. Can we leverage them to train bimanual robot policies? We present MonoDuo, a framework for learning bimanual manipulation policies using single-arm robot demonstrations paired with human collaboration. MonoDuo collects data by teleoperating a single-arm robot to perform one side of a bimanual task while a human performs the other, then swapping roles to cover both sides. RGB-D observations from a wrist-mounted and fixed camera are augmented into synthetic demonstrations for target bimanual robots using state-of-the-art hand pose estimation, image and point cloud segmentation, and inpainting. These synthetic demonstrations, grounded in real robot kinematics, are used to train bimanual policies. We evaluate MonoDuo on five tasks: box lifting, backpack packing, cloth folding, jacket zipping, and plate handover. Compared to approaches relying solely on human bimanual videos, MonoDuo enables zero-shot deployment on unseen bimanual robot configurations, achieving success rates up to 70%. With only 25 target robot demonstrations, few-shot finetuning further boosts success rates by 65-70% over training from scratch, demonstrating MonoDuo's effectiveness in efficiently transferring knowledge from single-arm robot data to bimanual robot policies.

2605.29254 2026-05-29 cs.RO cs.AI 版本更新

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

极端动态对称性实现全向多功能机器人

Jiaxun Liu, Boxi Xia, Boyuan Chen

发表机构 * Department of Mechanical Engineering and Materials Science, Duke University(杜克大学机械工程与材料科学系) Department of Electrical and Computer Engineering, Duke University(杜克大学电气与计算机工程系) Department of Computer Science, Duke University(杜克大学计算机科学系)

AI总结 本文提出动态对称性概念,通过动态各向同性度量,在超过1000种模拟形态中发现高动态对称性可提升轨迹跟踪、任务成功率、鲁棒性等性能,并开发了Argus球形机器人系列验证近极端动态各向同性带来的全向运动、自适应地形、快速自稳定和抗故障能力。

Comments Published in Science Robotics (2026). Our project website is at:https://generalroboticslab.com/Argus

详情
Journal ref
Science Robotics 11, eaec1725 (2026)
AI中文摘要

对称性是自然系统中的核心组织原则,但其作为机器人统一设计策略的应用仍主要局限于几何形态。我们证明,对称性可以在动态驱动能力层面加以利用。我们引入动态对称性,即机器人可达质心加速度的均匀性,并通过称为动态各向同性的度量将其形式化。在超过1000种模拟形态中,我们发现更高的动态对称性持续改善了轨迹跟踪、任务成功率、鲁棒性、恢复能力和能量效率,且当动态各向同性接近其理论极限时,效益最为显著。为了系统地研究这一机制,我们开发了Argus,一系列球形机器人,旨在探索增加动态对称性的效果。Argus家族的成员在驱动几何和动态对称性水平上有所不同,但共享一个共同架构原则:径向定向的线性致动器直接塑造机器人的质心动力学。其中,我们构建了一个物理的20腿Argus变体,实现了接近极端的动态各向同性,并展示了方向无关的运动、在杂乱和可变形地形上的敏捷穿越、快速自稳定以及对部分致动器故障的鲁棒性。其分布式感知进一步实现了在连续运动中的全向感知和物体交互。这些结果表明,不仅在形态上而且在可达动力学上设计机器人的对称性,为在不确定的地球和地外环境中实现敏捷性、鲁棒性和多功能性提供了一条强大且通用的途径。

英文摘要

Symmetry is a central organizing principle in natural systems, yet its use as a unifying design strategy in robotics has largely remained limited to geometric form. We show that symmetry can instead be leveraged at the level of dynamic actuation capability. We introduce dynamic symmetry, the uniformity of a robot's attainable center-of-mass accelerations, and formalize it through a measure coined as dynamic isotropy. Across more than 1000 simulated morphologies, we found that higher dynamic symmetry consistently improved trajectory tracking, task success, robustness, resiliency, and energy efficiency, with the benefits becoming most pronounced as dynamic isotropy approached its theoretical limit. To study this regime systematically, we developed Argus, a family of spherical robots designed to explore the effects of increasing dynamic symmetry. Members of the Argus family vary in their actuation geometry and dynamic symmetry level while sharing a common architectural principle: radially oriented linear actuators that directly shape the robot's center-of-mass dynamics. Among them, we built a physical 20-leg Argus variant that achieved near-extreme dynamic isotropy and demonstrated orientation-invariant locomotion, agile traversal of cluttered and deformable terrain, rapid self-stabilization, and resilience to partial actuator failures. Its distributed sensing further enabled omnidirectional perception and object interaction during continuous motion. These results show that designing robots for symmetry not only in morphology but also in their attainable dynamics provides a powerful and general pathway toward agility, robustness, and multifunctionality in uncertain terrestrial and extraterrestrial environments.

2605.29191 2026-05-29 eess.SY cs.RO cs.SY math.OC 版本更新

Distributed Non-Uniform Scaling Control of Multi-Agent Formation with Dynamic Agent Joining

具有动态加入智能体的多智能体编队分布式非均匀缩放控制

Tao He, Gangshan Jing

发表机构 * School of Automation, Chongqing University, Chongqing, China(重庆大学自动化学院,重庆,中国)

AI总结 针对动态加入智能体的多智能体编队,提出一种分布式非均匀缩放控制框架,通过保持图拉普拉斯矩阵的谱特性实现任意维度下的编队形状调整。

Comments This paper has been accepted by IFAC 2026

详情
AI中文摘要

编队的非均匀缩放控制使多智能体系统能够通过沿不同坐标轴以不同比例缩放来调整其形状,在复杂环境中提供增强的灵活性。然而,与大多数现有的编队机动策略一样,它通常假设一组固定的智能体,限制了其在需要动态团队扩展的场景中的适用性。本文介绍了一种分布式控制框架,该框架能够在任意维度的非均匀缩放机动过程中将新智能体纳入编队,同时保持图拉普拉斯矩阵的谱特性。仿真示例验证了理论结果的有效性。

英文摘要

Non-uniform scaling control of formation enables multi-agent systems to adjust their shape by scaling with different ratios along different coordinate axes, offering enhanced flexibility in complex environments. However, like most existing formation maneuver strategies, it typically assumes a fixed set of agents, limiting its applicability in scenarios requiring dynamic team expansion. This paper introduces a distributed control framework that enables a formation to incorporate new agents during non-uniform scaling maneuvers in arbitrary dimensions while preserving the spectral properties of the graph Laplacian. Simulation examples validate the effectiveness of the theoretical results.

2605.29155 2026-05-29 cs.RO cs.AI cs.DC 版本更新

CA-AC-MPC: CUDA-Accelerated Actor-Critic Model Predictive Control

CA-AC-MPC: CUDA加速的Actor-Critic模型预测控制

Antoonio Buo, Vittorio Cammarota, Michele Avagnale, Pierluigi Arpenti, Vincenzo Lippiello, Fabio Ruggiero

发表机构 * PRISMA Lab and CREATE Consortium, Department of Electrical Engineering and Information Technology, University of Naples Federico II(PRISMA实验室和CREATE联盟,电气工程与信息技术系,那不勒斯费德里科二世大学)

AI总结 提出CUDA加速的AC-MPC变体,通过GPU并行优化降低训练和推理延迟,在敏捷无人机竞速任务中实现最先进圈速和近极限动态性能。

Comments Accepted for presentation at the 2026 International Conference on Unmanned Aircraft Systems, ICUAS 2026

详情
AI中文摘要

在文献中,actor-critic模型预测控制(AC-MPC)将MPC与强化学习相结合,以实现复杂动态系统的高性能控制。然而,其可微分的MPC层需要在正向和反向传播中反复求解优化问题,导致大量的训练和推理延迟。本文通过引入CUDA加速变体解决了这一瓶颈,显著减少了端到端执行时间,同时保持了基线公式的控制性能。在敏捷无人机竞速任务上的仿真结果表明,我们的方法实现了最先进的圈速和近极限动态行为,同时显著减少了训练和推理时间。

英文摘要

In the literature, actor-critic model predictive control (AC-MPC) integrates MPC with reinforcement learning to enable high-performance control of complex dynamical systems. However, its differentiable MPC layer requires repeatedly solving an optimization problem in both the forward and backward passes, leading to substantial training and inference latency. This paper tackles this bottleneck introducing a CUDA-accelerated variant that significantly reduces end-to-end execution time while preserving the control performance of the baseline formulation. Simulation results on an agile drone racing task show that our approach achieves state-of-the-art lap times and near-limit dynamic behaviour with markedly reduced training and inference time.

2605.29144 2026-05-29 cs.RO cs.SY eess.SY 版本更新

Learning and Adaptation in Wire Arc Additive Manufacturing Bead Geometry Control

线弧增材制造焊道几何控制中的学习与自适应

Chen-Lung Lu, John Wen

发表机构 * Rensselaer Polytechnic Institute(伦塞拉尔理工学院)

AI总结 针对线弧增材制造中热场与几何耦合的非线性动态过程,提出基于循环神经网络和一步预测控制的数据驱动方法,并通过逐层预测误差更新模型实现自适应,实验验证了在焊道高度和宽度一致性上的显著提升。

详情
AI中文摘要

机器人线弧增材制造(WAAM)受复杂非线性过程动力学控制,将热场与构建几何耦合。该过程可视为多输入/多输出动态系统,以焊枪速度和送丝速率作为输入,焊道沉积高度和宽度作为输出。本文利用输入/输出数据学习数据驱动模型,并将其用于焊道规划和控制。我们证明,简单的循环神经网络架构和一步预测控制可以在高度和宽度一致性方面改善过程性能。为了考虑打印过程中热条件的变化,我们使用前一层的预测误差更新学习模型。该自适应步骤进一步提高了预测精度和控制器性能。在集成线扫描反馈的机器人WAAM实验平台上进行的实验表明,与恒定输入和静态模型基线相比,高度和宽度一致性有显著改善。所提出的学习和自适应框架为实现增材制造过程的鲁棒、数据驱动调控提供了实用途径。

英文摘要

Robotics Wire Arc Additive Manufacturing (WAAM) is governed by complex and nonlinear process dynamics coupling thermal field to the build geometry. The process may be regarded as a multi-input/multi-output dynamical system with welding torch speed and wire feed rate as inputs and weld bead deposition height and width as outputs. In this paper, we use the input/output data to learn a data-driven model and use it for weld planning and control. We show that a simple recurrent neural network architecture and one-step-ahead predictive control can improve the process performance in terms of height and width consistency. To account for the changing thermal conditions during the printing process, we update the learning model using prediction error from the previous layer. This adaptation step further improves the prediction accuracy and controller performance. Experiments on a robotic WAAM testbed with integrated line-scanner feedback significant improvements in height and width consistency compared to constant input and static model baselines. The proposed learning and adaptation framework provides a practical pathway toward robust, data-driven regulation of additive manufacturing processes.

2605.29138 2026-05-29 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous Driving

用于优化自动驾驶延迟-准确性权衡的多分辨率端到端深度神经网络

Qitao Weng, Heechul Yun

发表机构 * University of Kansas Lawrence(堪萨斯大学劳伦斯分校)

AI总结 提出一种多分辨率端到端CNN,通过运行时选择输入分辨率和分辨率重定向,在延迟预算下优化自动驾驶的延迟-安全性权衡。

Comments ICCPS 2026

详情
AI中文摘要

延迟-准确性权衡是深度神经网络在信息物理系统实时应用中的基础。在自动驾驶中,安全性尤其依赖于预测质量和从感知到执行的端到端延迟。我们观察到:(1) 当考虑延迟时,延迟最优的网络配置随场景上下文和计算可用性而变化;(2) 单一固定分辨率模型在条件变化时变得次优。我们提出了一种用于CARLA城市驾驶挑战的多分辨率端到端深度神经网络,使用单目摄像头输入。我们的方法采用支持多种输入分辨率的卷积神经网络,通过每分辨率批归一化,使得在延迟预算下运行时选择理想输入尺度成为可能,以及分辨率重定向,允许在没有原始训练数据集的情况下进行多分辨率训练。我们在CARLA中实现并评估了我们的多分辨率端到端CNN,以探索延迟-安全性边界。结果显示,相对于固定分辨率基线,每条路线的安全性指标——车道入侵、红灯违规和碰撞——一致改善。

英文摘要

Latency-accuracy tradeoffs are fundamental in real-time applications of deep neural networks (DNNs) for cyber-physical systems. In autonomous driving, in particular, safety depends on both prediction quality and the end-to-end delay from sensing to actuation. We observe that (1) when latency is accounted for, the latency-optimal network configuration varies with scene context and compute availability; and (2) a single fixed-resolution model becomes suboptimal as conditions change. We present a multi-resolution, end-to-end deep neural network for the CARLA urban driving challenge using monocular camera input. Our approach employs a convolutional neural network (CNN) that supports multiple input resolutions through per-resolution batch normalization, enabling runtime selection of an ideal input scale under a latency budget, as well as resolution retargeting, which allows multi-resolution training without access to the original training dataset. We implement and evaluate our multi-resolution end-to-end CNN in CARLA to explore the latency-safety frontier. Results show consistent improvements in per-route safety metrics - lane invasions, red-light infractions, and collisions - relative to fixed-resolution baselines.

2605.29114 2026-05-29 cs.CR cs.LG cs.RO 版本更新

ReasonBreak: Probing Vulnerabilities in Reasoning-Enabled Vision-Language-Action Models for Autonomous Driving

ReasonBreak: 探测自动驾驶中具备推理能力的视觉-语言-行动模型的脆弱性

Mohammadreza Teymoorianfard, Jean-Philippe Monteuuis, Jonathan Petit, Amir Houmansadr

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Qualcomm(高通)

AI总结 本文通过黑盒攻击方法,首次系统研究了具备推理能力的视觉-语言-行动模型在自动驾驶中面对真实输入扰动时的脆弱性,发现其推理和轨迹生成均易受攻击,导致碰撞率上升。

详情
AI中文摘要

具备集成推理能力的视觉-语言-行动(VLA)模型已被提出用于端到端自动驾驶,假设推理与轨迹生成之间存在紧密耦合。然而,此类系统在真实输入扰动下的鲁棒性尚未得到充分探索。我们表明,这些模型对真实输入扰动高度脆弱,在闭环仿真中推理攻击成功率高达89%,轨迹操控攻击成功率高达72%,导致碰撞率上升和安全指标下降。以NVIDIA近期开发的Alpamayo模型为代表,我们首次对具备推理能力的VLA模型在真实文本输入损坏下进行了系统性黑盒研究,评估了其对推理和驾驶行为的影响。我们引入了一个推理感知评估框架,捕捉推理的语义和结构方面,并结合以安全为中心的度量。我们还引入了一个基准,用于评估自动驾驶中推理-轨迹交互的攻击与防御。我们的结果强调了严格评估和改进防御的必要性,以确保自动驾驶中具备推理能力的VLA系统的安全性。

英文摘要

Vision-Language-Action (VLA) models with integrated reasoning have been proposed for end-to-end autonomous driving, assuming a tight coupling between reasoning and trajectory generation. However, the robustness of such systems under realistic input perturbations remains largely unexplored. We show that these models are highly vulnerable to realistic input perturbations, achieving up to 89% attack success rate (ASR) on reasoning and up to 72% on trajectory manipulation in closed-loop simulation, leading to increased collision rates and degraded safety metrics. Using NVIDIA's recent Alpamayo models as representative industry-developed VLAs, we conduct the first systematic black-box study of reasoning-enabled VLA models under realistic textual input corruptions, evaluating their impact on reasoning and driving behavior. We introduce a reasoning-aware evaluation framework capturing both semantic and structural aspects of reasoning, along with safety-centric measures. We also introduce a benchmark for evaluating attacks and defenses on reasoning-trajectory interactions in autonomous driving. Our results highlight the need for rigorous evaluation and improved defenses to ensure the safety of reasoning-enabled VLA systems in autonomous driving.

2605.29091 2026-05-29 cs.RO cs.MA 版本更新

Human-in-the-Loop Swarms: A Bionic Swarm Approach to Real-World Soil Mapping

人在环路的群体:一种用于真实土壤测绘的仿生群体方法

Petras Swissler, Mohammadali Rashidioun, Nicholas Sahu, Raaid Kabir, Ayodeji Aderibigbe, Oladoyin Kolawole

发表机构 * New Jersey Institute of Technology(新泽西理工学院) University of Washington(华盛顿大学)

AI总结 提出Bionic Swarm系统,通过人类用户替代机器人难以实现的任务,结合蓝牙传感器和集中式服务器运行群体算法,并在真实户外环境中验证了Score-Biased-Search算法,降低了实地群体机器人研究的门槛。

Comments 27 pages, 15 figures. Submitted to Advanced Intelligent Systems

详情
AI中文摘要

由于部署硬件的成本高和开发时间长,群体和现场机器人技术在真实世界验证中面临重大障碍。本文介绍了“Bionic Swarm”,一种新颖的系统,通过抽象出许多难以在机器人上实现但对整体算法评估无贡献的任务,并将这些任务交给人类用户,从而降低了这些障碍。这些人类用户通过智能手机网页应用接收指令,该应用从蓝牙连接的传感器获取测量数据并将其转发到集中式服务器。该服务器运行群体算法并向人类用户指示行动。我们通过实验验证了一种名为Score-Biased-Search的岩土聚焦搜索算法来评估该系统,该算法通过为重建地图上的每个位置分配“分数”,然后通过预期分数较高的区域偏置搜索模式,并表现出相对于搜索代理数量的超线性地图重建。在展示该算法的模拟结果后,我们在Bionic Swarm平台上应用该算法,以验证其在真实户外环境中的功能。这项工作表明,这种人在环路的方法显著降低了现场和群体机器人研究的入门门槛。

英文摘要

Swarm and field robotics face significant barriers to real-world validation due to the high cost and development time to deploy hardware. This paper introduces the ``Bionic Swarm,'' a novel system that lowers these barriers by abstracting away many of the tasks that are difficult to implement on robots but which do not contribute to the overall algorithm evaluation, giving these tasks to human users. These human users take directions from a smartphone web-app that takes measurements from Bluetooth-connected sensors and relays them to a centralized server. This server runs the swarm algorithm and directs actions to the human users. We evaluate this system through the experimental validation of a geotechnically-focused search algorithm named Score-Biased-Search, which functions by assigning a ``score'' to each location on a reconstructed map, then biases search patterns through areas of higher expected scores, and which exhibits superlinear map reconstruction relative to the number of search agents. After presenting simulation results for the algorithm, we then apply the algorithm on the Bionic Swarm platform to validate its function in a real-world, outdoor setting. This work demonstrates that this human-in-the-loop approach significantly lowers the barrier to entry for field and swarm robotics research.

2605.29074 2026-05-29 cs.CV cs.RO 版本更新

Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models

Embodied3DBench: 视觉语言模型低级具身空间智能的基准测试

Jiyao Zhang, Mingxu Zhang, Yitong Peng, Haoxuan Liu, Chenshuo Wang, Yuxing Long, Haoyang Huang, Dongjiang Li, Nan Duan, Hui Shen, Hao Dong

发表机构 * CFCS, School of CS, PKU(计算机学院CFCS,北京大学) Jingdong Technology Information Technology Co., Ltd(京东科技信息技术有限公司)

AI总结 提出Embodied3DBench基准,通过6类任务(空间结构理解与交互导向感知)系统评估视觉语言模型在3D环境中的低级空间智能,并合成130万QA对训练数据以弥补能力差距。

详情
AI中文摘要

当前的视觉语言模型(VLM)是否准备好理解和推理3D环境中的复杂具身交互?我们引入了Embodied3DBench,一个以机器人为中心的基准,针对具身3D环境中的低级空间智能。为了系统评估这些基础感知能力,该基准包括6个任务类别,分为两个核心组:空间结构理解(定位、空间关系预测和多视图对应)和交互导向感知(可供性预测、抓取点预测和轨迹预测)。该基准涵盖12个子类别,包含超过21k个高质量问答对。我们评估了13个最先进的模型,结果显示,尽管当前模型在高级空间推理(如理解对象间位置关系)方面表现相对较强,但在交互导向感知方面仍然脆弱,突显了缺乏鲁棒的3D感知交互先验。为了积极弥合基准揭示的能力差距,我们进一步合成了一个包含130万问答对的大规模训练数据集。值得注意的是,在该数据集上微调显著提升了低级空间智能。最终,Embodied3DBench通过提供系统评估框架和可扩展的数据解决方案填补了关键空白,为交互感知多模态系统的发展设定了明确目标。

英文摘要

Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in embodied 3D environments. To systematically evaluate these foundational perceptual capabilities, the benchmark includes 6 task categories divided into two core groups: Spatial Structural Understanding (Grounding, Spatial Relation Prediction, and Multi-view Correspondence) and Interaction-Oriented Perception (Affordance Prediction, Grasp Point Prediction, and Trajectory Prediction). The benchmark spans 12 subcategories and contains over 21k high-quality question-answer pairs. We evaluate 13 state-of-the-art models, and the results show that while current models exhibit relatively strong high-level spatial reasoning, such as understanding object-to-object positional relations, they remain fragile in interaction-oriented perception, highlighting a significant lack of robust 3D-aware interaction priors. To actively bridge this capability gap revealed by our benchmark, we further synthesize a large-scale training dataset comprising 1.3M QA pairs. Notably, fine-tuning on this dataset yields significant improvements in low-level spatial intelligence. Ultimately, Embodied3DBench fills a critical gap by providing both a systematic evaluation framework and a scalable data solution, setting a clear target for the development of interaction-aware multimodal systems.

2605.28883 2026-05-29 cs.AI cs.RO 版本更新

Ultra-Reduced-Impact-Encased-Logging (URIEL): propose a new method for selective sustainable logging and post-harvest silvicultural treatment in tropical forest using airborne robotics systems

超低影响包裹式伐木(URIEL):提出一种利用空中机器人系统在热带森林中进行选择性可持续伐木和采后造林处理的新方法

Daniel Albiero, Gelton Fernando de Morais, Daniela Han, Flávio Roberto de Freitas Gonçalves, Artur Vitório Andrade Santos, Wesllen Lins de Araújo, Alessandra Maia Freire, Cláudio Kiyoshi Umezu, Mateus Peressin, Francesco Toscano, Admilson Írio Ribeiro, Alfeu J. Sguarezi Filho, Américo Ferraz Dias Neto, Angel Pontin Garcia

发表机构 * School of Agricultural Engineering, University of Campinas (UNICAMP)(坎皮纳斯大学农业工程学院) School of Mechanical Engineering, University of Campinas (UNICAMP)(坎皮纳斯大学机械工程学院) Depart. of Agricultural, Forestry, Food and Environmental Sciences, University of Basilicata(巴里奇塔大学农业、林业、食品与环境科学系) Sorocaba Environmental Engineering, São Paulo State University (UNESP)(圣保罗州立大学索罗卡巴环境工程) Center for Engineering, Modeling and Applied Social Sciences, Federal University of ABC (UFABC)(ABC联邦大学工程、建模和应用社会科学中心)

AI总结 提出URIEL方法,结合直升机伐木、机器人、AI和无人机采后造林处理,实现高经济可行性和几乎零附带损害,维持生态系统服务。

Comments 196 pages, 40 figures, A revolutionary technology to help protect tropical forests. It was developed, scaled, detailed, calculated, and simulated in an advanced computational environment, com viabilidade econômica e social. "E pur si muove"

详情
AI中文摘要

全球热带森林正面临由经济和政治利益驱动的强烈砍伐压力,科学证据表明这种砍伐加剧了气候变化。本文提出了一种新颖的热带森林伐木方法——超低影响包裹式伐木(URIEL)。该方法基于直升机伐木技术,结合机器人技术和人工智能的密集使用,以及由无人机执行的采后造林处理。为此方法开发了合适的设备概念,确定了尺寸,在数字概念验证中完成了细节,并对各种直升机-木材-距离组合进行了有效的数字模拟和经济可行性分析。结果表明,URIEL方法具有高经济可行性,并能在维持生态系统服务的同时几乎消除对森林的附带损害。本文的主要结论是,尽管取得了令人满意的科学和技术成果,但URIEL方法的可行性取决于相关利益相关者的整合:高科技产业、政治政府、认证伐木公司和原住民。

英文摘要

Tropical forests worldwide are under intense deforestation pressure driven by economic and political interests, and scientific evidence suggests this deforestation contributes to climate change. This paper proposes a novel logging method for tropical forests, Ultra-Reduced-Impact-Encased-Logging (URIEL). This new method is based on heli-logging techniques combined with intensive use of robotics and AI integrated with post-harvest silvicultural treatments performed by drones. The concept of appropriate equipment for this method was developed, dimensions were determined, details were completed in a digital proof of concept, and an effective digital simulation and economic feasibility analysis were carried out for various helicopter-timber-distance combinations. The results demonstrated that a URIEL method has high economic viability and makes it possible to virtually eliminate collateral damage to forests while maintaining ecosystem services. The main conclusion of this paper is that, despite the satisfactory scientific and technological results, the feasibility of a Uriel method depends on the integration of stakeholders intrinsic to the context: high-tech industry; political governments; certified logging companies; and native populations.

2605.22082 2026-05-29 cs.RO cs.LG 版本更新

CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation

CoRMA: 用于接触丰富元适应的对比RMA

Wentian Wang, Chutong Wen, Hongxu Ma, Wuhao Wang, Zhexiong Xue, Abdul Haseeb Nizamani, Dandi Zhou, Xinhai Sun, Jianqiao Zhu

发表机构 * Synthoid AI

AI总结 提出CoRMA框架,通过语义接触上下文和对比学习实现力主导装配任务的元适应,无需演示或梯度更新,在仿真和真实机器人上优于基线。

详情
AI中文摘要

我们提出CoRMA(对比机器人运动适应),一个基于上下文的元适应框架,修改了RMA以适用于力主导的装配任务。CoRMA用紧凑的6维仅仿真语义接触上下文(描述接触开始、侧向接合、引导过渡、接触方向和卡滞)替换原始仿真器参数适应。一个可部署的因果Transformer适配器通过语义回归和力状态对比目标,从力、本体感受和动作历史中在线推断该上下文。部署时,移除真实上下文并由推断上下文替代,从而无需演示、特权输入或梯度更新即可实现片段内适应。我们在Isaac Lab / Isaac Sim 5.0中的PegInsert、GearMesh和NutThread任务以及真实Marvin机械臂上评估CoRMA。与在仿真中成功率高但在硬件上大幅下降的FORGE基线相比,CoRMA在受控目标位姿噪声下保留了更高的验证真实成功率。这些结果支持语义接触推断作为相关装配任务族内可复用的适应接口,而更广泛的未见任务泛化和Real2Sim校准仍是未来工作。

英文摘要

We present CoRMA(Contrastive Robotic Motor Adaptation), a context-based meta-adaptation framework that modifies RMA for force-dominant assembly. CoRMA replaces raw simulator-parameter adaptation with a compact 6D simulator-only semantic contact context describing contact onset, lateral engagement, guided transition, contact direction, and jamming. A deployable causal Transformer adapter infers this context online from force, proprioceptive, and action histories using semantic regression and a force-regime contrastive objective. At deployment, oracle context is removed and replaced by the inferred context, enabling within-episode adaptation without demonstrations, privileged inputs, or gradient updates. We evaluate CoRMA on PegInsert, GearMesh, and NutThread in Isaac Lab / Isaac Sim 5.0 and on a real Marvin arm. Compared with FORGE baselines that achieve high simulation success but degrade substantially on hardware, CoRMA retains higher verified real success under controlled target-pose noise. These results support semantic contact inference as a reusable adaptation interface within a related assembly task family, while broader unseen-task generalization and Real2Sim calibration remain future work.

2605.01663 2026-05-29 cs.LG cs.RO 版本更新

Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

基于流锚定噪声条件Q学习的离线强化学习:高效且表达力强的方法

Sungyoung Lee, Dohyeong Kim, Eshan Balachandar, Zelal Su Mustafaoglu, Keshav Pingali

发表机构 * The University of Texas at Austin, Austin, TX, USA(德克萨斯大学奥斯汀分校) Independent Researcher, Seoul, South Korea(首尔独立研究者)

AI总结 提出FAN算法,通过单次流策略迭代和单高斯噪声样本实现高效离线强化学习,在保持高性能的同时显著降低计算成本。

Comments ICML 2026

详情
AI中文摘要

我们提出流锚定噪声条件Q学习(FAN),一种高效且高性能的离线强化学习算法。近期工作表明,表达力强的流策略和分布性评论家能提升离线强化学习性能,但计算成本高。具体而言,流策略需要迭代采样才能产生单个动作,分布性评论家需要计算多个样本(如分位数)来估计价值。为解决这些低效问题并保持高性能,我们引入FAN。我们的方法采用行为正则化技术,仅需单次流策略迭代,且分布性评论家仅需单个高斯噪声样本。我们对收敛性和性能边界的理论分析表明,这些简化不仅提高了效率,还带来了更优的任务性能。在机器人操作和运动任务上的实验表明,FAN实现了最先进的性能,同时显著减少了训练和推理时间。我们在https://github.com/brianlsy98/FAN 发布代码。

英文摘要

We propose Flow-Anchored Noise-conditioned Q-Learning (FAN), a highly efficient and high-performing offline reinforcement learning (RL) algorithm. Recent work has shown that expressive flow policies and distributional critics improve offline RL performance, but at a high computational cost. Specifically, flow policies require iterative sampling to produce a single action, and distributional critics require computation over multiple samples (e.g., quantiles) to estimate value. To address these inefficiencies while maintaining high performance, we introduce FAN. Our method employs a behavior regularization technique that uses a single flow policy iteration and requires a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes. We release our code at https://github.com/brianlsy98/FAN.

2605.01194 2026-05-29 cs.RO 版本更新

VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model

VLA-ATTC:基于相对动作评判模型的VLA模型自适应测试时计算

Wenhao Li, Xiu Su, Yichao Cao, Hongyan Xu, Xiaobo Xia, Shan You, Yi Chen, Chang Xu

发表机构 * University of Sydney(悉尼大学) Central South University(中央南大学) University of Science and Technology of China(中国科学技术大学) Sensetime Research(感知时间研究院) Hong Kong University of Science and Technology(香港理工大学)

AI总结 提出VLA-ATTC框架,通过不确定性驱动的“认知离合器”和相对动作评判模型(RAC)实现自适应测试时计算,在LIBERO-LONG基准上将SOTA模型PI0.5的失败率降低50%以上。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在具身操作中展现了卓越的能力和泛化性。然而,它们的决策依赖于快速、本能的过程,缺乏深思熟虑。当面对需要更多考虑的复杂或模糊场景时,这种策略往往会导致次优或灾难性的动作。在本文中,我们引入了 extbf{VLA-ATTC},一个赋予VLA模型自适应测试时计算(TTC)能力的框架。VLA-ATTC采用基于不确定性的“认知离合器”,在必要时动态地从反射执行过渡到TTC深思阶段。在TTC阶段,一种新颖的 extbf{相对动作评判}(RAC)模型通过成对比较从生成的候选动作中识别最优动作。这种相对机制取代了不稳定的绝对值估计,显著简化了学习目标。此外,我们引入了一种高效的采样策略来分摊计算成本,以及一个自动数据管道,无需人工标注即可整理偏好对。在LIBERO-LONG基准上,VLA-ATTC将SOTA模型PI0.5的失败率降低了50%以上。我们将开源所有代码和权重。

英文摘要

Vision-Language-Action (VLA) models have demonstrated remarkable capabilities and generalization in embodied manipulation. However, their decision-making relies on a fast, instinctive process that lacks deliberation. This strategy often leads to suboptimal or catastrophic actions when facing complex or ambiguous scenarios that require greater consideration. In this paper, we introduce \textbf{VLA-ATTC}, a framework that endows VLA models with adaptive test-time compute (TTC). VLA-ATTC employs an uncertainty-based ``cognitive clutch'' to dynamically transition from reflexive execution to a TTC deliberation phase when necessary. During TTC phase, a novel \textbf{Relative Action Critic} (RAC) model identifies the optimal action from generated candidates via pairwise comparisons. This relative mechanism replaces unstable absolute value estimation, significantly simplifying the learning objective. Furthermore, we introduce an efficient sampling strategy to amortize computational costs and an automated data pipeline that curates preference pairs without manual annotation. On the LIBERO-LONG benchmark, VLA-ATTC reduces the failure rate of the SOTA model PI0.5 by over 50\%. We will open-source all the code and weights.

2605.01191 2026-05-29 cs.RO 版本更新

Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery

Sentinel-VLA:一种具有主动状态监控的元认知VLA模型,用于动态推理和错误恢复

Wenhao Li, Xiu Su, Dan Niu, Yichao Cao, Hongyan Xu, Zhe Qu, Lei Fan, Shan You, Chang Xu

发表机构 * University of Sydney(悉尼大学) Central South University(中央南大学) University of New South Wales(新南威尔士大学) Sensetime Research(SenseTime研究院)

AI总结 提出Sentinel-VLA模型,通过主动哨兵模块监控执行状态,仅在必要时触发动态推理或错误恢复,结合自进化持续学习算法和正交持续适配器,在44个任务上提升成功率30%以上。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过利用广泛的世界知识和强大的泛化能力,推动了具身操作领域的发展。然而,当前的VLA模型仍面临几个关键挑战,包括推理能力有限、缺乏状态监控以及难以自我纠正。在本文中,我们引入了 extbf{Sentinel-VLA},一种元认知VLA模型,配备了一个主动的“哨兵”模块来监控实时执行状态。仅在必要时,例如在初始规划或检测到错误时,模型会触发动态推理或制定错误恢复方案。这种按需推理机制确保了鲁棒的决策,同时最小化计算开销。值得注意的是,所有训练数据(涵盖44个任务和超过260万次转换)都是通过我们设计的流水线自动生成和标注的。我们还提出了自进化持续学习(SECL)算法,该算法允许Sentinel-VLA识别其能力边界并自动收集数据进行扩展,并与正交持续适配器(OC-Adapter)配对,将参数更新约束在正交空间中,从而防止灾难性遗忘。真实世界实验表明,与最先进的模型PI0相比,Sentinel-VLA将任务成功率提高了30%以上。我们将开源所有代码、权重和数据生成流水线。

英文摘要

Vision-language-action (VLA) models have advanced the field of embodied manipulation by harnessing broad world knowledge and strong generalization. However, current VLA models still face several key challenges, including limited reasoning capability, lack of status monitoring, and difficulty in self-correction. In this paper, we introduce \textbf{Sentinel-VLA}, a metacognitive VLA model equipped with an active ``sentinel'' module to monitor real-time execution status. Only when necessary, such as during initial planning or upon detecting an error, the model triggers a dynamic reasoning or formulate error recovery solutions. This on-demand reasoning mechanism ensures robust decision-making while minimizing computational overhead. Notably, all training data (spanning 44 tasks and over 2.6 million transitions) is automatically generated and annotated through our designed pipeline. We also propose the Self-Evolving Continual Learning (SECL) algorithm, which allows Sentinel-VLA to identify its capability boundaries and automatically collect data for expansion, paired with Orthogonal Continual Adapter (OC-Adapter) to constrain parameter updates to an orthogonal space, thereby preventing catastrophic forgetting. Real-world experiments demonstrate that Sentinel-VLA boosts the task success rate by over 30\% compared to the SOTA model, PI0. We will open-source all the code, weights, and data generation pipeline.

2604.15864 2026-05-29 cs.RO 版本更新

Environment-Adaptive Solid-State LiDAR-Inertial Odometry

环境自适应固态激光雷达-惯性里程计

Zhi Zhang, Chalermchon Satirapod, Bingtao Ma, Changjun Gu

发表机构 * School of Automation Chongqing University of Posts(自动化学院 重庆邮电大学) Department of Survey Engineering, Faculty of Engineering Chulalongkorn University Bangkok, Thailand(工程学院 测绘工程系 朱拉隆功大学 泰国曼谷) School of Cyberspace Security Hangzhou Dianzi University Hangzhou, China(网络空间安全学院 杭州电子科技大学 杭州中国)

AI总结 提出一种集成局部法向量约束与退化感知地图维护的环境自适应固态激光雷达-惯性里程计,以解决极端环境下的几何退化与观测不可靠导致的定位漂移和地图不一致问题。

详情
AI中文摘要

固态激光雷达-惯性SLAM因其速度和鲁棒性优势而受到广泛关注。然而,在极端环境中实现精确建图仍然具有挑战性,因为严重的几何退化和不可靠的观测常常导致病态优化和地图不一致。为了解决这些问题,我们提出了一种环境自适应固态激光雷达-惯性里程计,它集成了局部法向量约束与退化感知地图维护,以增强定位精度。具体来说,我们引入局部法向量约束来提高状态估计的稳定性,有效抑制退化场景中的定位漂移。此外,我们设计了一种退化引导的地图更新策略以提高地图精度。得益于精细化的地图表示,后续估计中的定位精度进一步提高。实验结果表明,所提方法在极端和感知退化环境中实现了优越的建图精度和鲁棒性,与基线方法相比,平均RMSE降低高达12.8%。

英文摘要

Solid-state LiDAR-inertial SLAM has attracted significant attention due to its advantages in speed and robustness. However, achieving accurate mapping in extreme environments remains challenging due to severe geometric degeneracy and unreliable observations, which often lead to ill-conditioned optimization and map inconsistencies. To address these challenges, we propose an environment-adaptive solid-state LiDAR-inertial odometry that integrates local normal-vector constraints with degeneracy-aware map maintenance to enhance localization accuracy. Specifically, we introduce local normal-vector constraints to improve the stability of state estimation, effectively suppressing localization drift in degenerate scenarios. Furthermore, we design a degeneration-guided map update strategy to improve map precision. Benefiting from the refined map representation, localization accuracy is further enhanced in subsequent estimation. Experimental results demonstrate that the proposed method achieves superior mapping accuracy and robustness in extreme and perceptually degraded environments, with an average RMSE reduction of up to 12.8% compared to the baseline method.

2603.16673 2026-05-29 cs.RO cs.AI cs.LG 版本更新

When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

机器人何时应该思考?基于强化学习的资源感知推理在具身机器人决策中的应用

Jun Liu, Pu Zhao, Zhenglun Kong, Xuan Shen, Peiyan Dong, Fan Yang, Lin Cui, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Gaowen Liu, Yanzhi Wang, Dong Huang

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Northeastern University(东北大学) Harvard University(哈佛大学) Cornell University(康奈尔大学) MIT(麻省理工学院) Fujitsu Research of America(美国富士通研究) Tsinghua University(清华大学) Peking University(北京大学) University of Georgia(佐治亚大学) Florida International University(佛罗里达国际大学) EmbodyX Inc(EmbodyX公司) Cisco Systems(思科系统)

AI总结 提出RARRL框架,通过强化学习学习高层编排策略,使具身代理能自适应决定是否调用LLM推理、选择推理角色及分配计算预算,以平衡推理开销与任务成功率。

详情
AI中文摘要

具身机器人系统越来越依赖基于大语言模型(LLM)的代理来支持与环境交互过程中的高级推理、规划和决策。然而,调用LLM推理会引入大量的计算延迟和资源开销,这可能会中断动作执行并降低系统可靠性。过多的推理可能延迟动作,而推理不足则常常导致错误决策和任务失败。这引出了具身代理的一个基本问题:代理何时应该推理,何时应该行动?在这项工作中,我们提出了RARRL(基于强化学习的资源感知推理),一个用于具身代理资源感知编排的分层框架。RARRL不是学习低级控制策略,而是学习一个在代理决策层运行的高级编排策略。该策略使代理能够根据当前观察、执行历史和剩余资源,自适应地决定是否调用推理、使用哪个推理角色以及分配多少计算预算。大量实验,包括使用来自ALFRED基准测试的经验延迟配置文件进行评估,表明与固定或启发式推理策略相比,RARRL在减少执行延迟和增强鲁棒性的同时,持续提高了任务成功率。这些结果表明,自适应推理控制对于构建可靠且高效的具身机器人代理至关重要。

英文摘要

Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning introduces substantial computational latency and resource overhead, which can interrupt action execution and reduce system reliability. Excessive reasoning may delay actions, while insufficient reasoning often leads to incorrect decisions and task failures. This raises a fundamental question for embodied agents: when should the agent reason, and when should it act? In this work, we propose RARRL (Resource-Aware Reasoning via Reinforcement Learning), a hierarchical framework for resource-aware orchestration of embodied agents. Rather than learning low-level control policies, RARRL learns a high-level orchestration policy that operates at the agent's decision-making layer. This policy enables the agent to adaptively determine whether to invoke reasoning, which reasoning role to employ, and how much computational budget to allocate based on current observations, execution history, and remaining resources. Extensive experiments, including evaluations with empirical latency profiles derived from the ALFRED benchmark, show that RARRL consistently improves task success rates while reducing execution latency and enhancing robustness compared with fixed or heuristic reasoning strategies. These results demonstrate that adaptive reasoning control is essential for building reliable and efficient embodied robotic agents.

2603.08142 2026-05-29 cs.RO 版本更新

Multifingered force-aware control for humanoid robots

人形机器人的多指力感知控制

Pasquale Marra, Gabriele M. Caddeo, Ugo Pattacini, Lorenzo Natale

发表机构 * Humanoid Sensing and Perception, Istituto Italiano di Tecnologia, Genoa, Italy(人机感知与感知,意大利技术研究院,热那亚,意大利) DIBRIS, Università di Genova, Via All’Opera Pia, 13, Genoa, Italy(DIBRIS,热那亚大学,Via All’Opera Pia, 13,热那亚,意大利) MESH Facility, Istituto Italiano di Tecnologia, Genoa, Italy(MESH设施,意大利技术研究院,热那亚,意大利)

AI总结 提出一种基于力估计的多指手人形机器人控制方案,通过调整躯干、手臂、手腕和手指的运动重新分配力,以维持与物体的稳定接触,在平衡任务中成功率达82.7%。

Comments This work has been accepted for publication in ICRA 2026

详情
AI中文摘要

在本文中,我们研究了具有多指手的机器人平台中的力感知控制和力分配问题。给定目标以及来自触觉传感器的力估计,我们设计了一个控制器,能够调整躯干、手臂、手腕和手指的运动,重新分配力以维持与不同质量分布或不稳定接触的物体的稳定接触。为了估计力,我们使用五个Xela磁传感器与压头交互,收集触觉信号和地面真实力测量数据集,并训练力估计器。然后,我们引入一种基于模型的控制方案,该方案最小化压力中心(CoP)与指尖接触多边形质心之间的距离。由于我们的方法依赖于估计的力而非原始触觉信号,因此它有可能应用于任何能够进行力估计的传感器。我们在一个包含五个物体的平衡任务上验证了我们的框架,成功率达到82.7%,并在多物体场景中进一步评估,准确率达到80%。代码和数据可在此处找到:https://github.com/hsp-iit/multifingered-force-aware-control。

英文摘要

In this paper, we address force-aware control and force distribution in robotic platforms with multi-fingered hands. Given a target goal and force estimates from tactile sensors, we design a controller that adapts the motion of the torso, arm, wrist, and fingers, redistributing forces to maintain stable contact with objects of varying mass distribution or unstable contacts. To estimate forces, we collect a dataset of tactile signals and ground-truth force measurements using five Xela magnetic sensors interacting with indenters, and train force estimators. We then introduce a model-based control scheme that minimizes the distance between the Center of Pressure (CoP) and the centroid of the fingertips contact polygon. Since our method relies on estimated forces rather than raw tactile signals, it has the potential to be applied to any sensor capable of force estimation. We validate our framework on a balancing task with five objects, achieving a $82.7\%$ success rate, and further evaluate it in multi-object scenarios, achieving $80\%$ accuracy. Code and data can be found here https://github.com/hsp-iit/multifingered-force-aware-control.

2602.13436 2026-05-29 cs.RO 版本更新

Force Sensing for Wearable Human-Robot Interfaces via Fluidic Innervation

用于可穿戴人机界面的力传感:基于流体神经支配

Noah Rubin, Ava Schraeder, Hrishikesh Sahu, Thomas C. Bulea, Lillian Chin

发表机构 * Rehabilitation Medicine Department, National Institutes of Health (NIH) Clinical Center(国家卫生研究院(NIH)临床中心康复医学部门) Department of Electrical and Computer Engineering, University of Texas at Austin(德克萨斯大学奥斯汀分校电气与计算机工程系)

AI总结 通过3D打印硅胶垫中的空气通道测量力,实现线性响应,并验证其在等长扭矩、动态运动和机器人外骨骼中的应用。

Comments 6 pages, 7 figures, accepted to BioRob 2026

详情
AI中文摘要

机械表征人机界面对于理解用户行为和优化可穿戴机器人性能至关重要。由于制造复杂性和非线性传感器响应,该界面的传感器化一直具有挑战性。在这里,我们通过流体神经支配测量人体肢体与设备的相互作用,创建了一个带有嵌入式空气通道的3D打印硅胶垫来测量力。当力施加到垫子上时,空气通道被压缩,导致压力变化,可由现成的压力传感器测量。我们在台架测试中证明,垫子压力与施加的力高度线性相关($R^2 = 0.998$),并通过策略性垫子放置,在临床测力计中确认了与等长膝关节扭矩的强线性关系。我们基于这些理想化设置,在更不受约束的环境中测试垫子性能,包括循环动态和逐步等长肱二头肌弯举。最后,我们将传感器集成到下肢机器人外骨骼中,并在设备未通电的情况下记录重复深蹲期间的垫子压力。垫子压力一致地跟踪深蹲阶段和整体任务动态。总的来说,我们的初步结果表明,流体神经支配是一种易于定制的传感方式,具有高信噪比和时间分辨率,可用于捕捉人机交互。从长远来看,这种方式可能提供一种替代的实时传感输入,用于控制/优化可穿戴机器人系统,并在设备使用期间捕捉用户功能。

英文摘要

Mechanically characterizing the human-machine interface is essential to understanding user behavior and optimizing wearable robot performance. This interface has been challenging to sensorize due to manufacturing complexity and non-linear sensor responses. Here, we measure human limb-device interaction via fluidic innervation, creating a 3D-printed silicone pad with embedded air channels to measure forces. As forces are applied to the pad, the air channels compress, resulting in a pressure change measurable by off-the-shelf pressure transducers. We demonstrate in benchtop testing that pad pressure is highly linearly related to applied force ($R^2 = 0.998$) and confirmed strong linear relationships to isometric knee torque in a clinical dynamometer with strategic pad placement. We built on these idealized settings to test pad performance in more unconstrained settings, including during cyclic dynamic and stepwise isometric bicep curls. Finally, we integrated the sensor into a lower-extremity robotic exoskeleton and recorded pad pressure during repeated squats with the device unpowered. Pad pressure tracked squat phase and overall task dynamics consistently. Collectively, our preliminary results suggest fluidic innervation is a readily customizable sensing modality with high signal-to-noise ratio and temporal resolution for capturing human-machine interaction. In the long-term, this modality may provide an alternative real-time sensing input to control / optimize wearable robotic systems and to capture user function during device use.

2602.04516 2026-05-29 cs.RO 版本更新

TACO: Temporal Consensus Optimization for Continual Neural Mapping

TACO:用于连续神经映射的时间共识优化

Xunlan Zhou, Hongrui Zhao, Negar Mehr

发表机构 * School of Intelligence Science(智能科学学院) Department of Aerospace Engineering(航空航天工程系) Department of Mechanical Engineering and Technology(机械工程与技术系) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of California Berkeley(加州大学伯克利分校) Nanjing University(南京大学)

AI总结 提出TACO框架,通过将映射建模为时间共识优化问题,无需回放历史数据即可实现动态环境下的连续神经映射,平衡了内存效率与适应性。

Comments In: Robotics: Science and Systems (RSS 2026)

详情
AI中文摘要

神经隐式映射已成为机器人导航和场景理解的有力范式。然而,实际机器人部署需要在严格的内存和计算约束下持续适应变化的环境,现有映射系统无法支持这一点。大多数先前方法依赖于回放历史观测来保持一致性,并假设静态场景。因此,它们无法适应动态机器人环境中的连续学习。为应对这些挑战,我们提出TACO(时间共识优化),一种无回放的连续神经映射框架。我们将映射重新表述为时间共识优化问题,其中将过去的模型快照视为时间邻居。直观上,我们的方法类似于模型咨询其自身的过去知识。我们通过强制与历史表示进行加权共识来更新当前地图。我们的方法允许可靠的历史几何约束优化,同时允许不可靠或过时的区域根据新观测进行修订。TACO在无需存储或回放先前数据的情况下实现了内存效率与适应性之间的平衡。通过广泛的模拟和真实世界实验,我们表明TACO能够稳健地适应场景变化,并持续优于其他连续学习基线。代码可在 https://iconlab.negarmehr.com/TACO 获取。

英文摘要

Neural implicit mapping has emerged as a powerful paradigm for robotic navigation and scene understanding. However, real-world robotic deployment requires continual adaptation to changing environments under strict memory and computation constraints, which existing mapping systems fail to support. Most prior methods rely on replaying historical observations to preserve consistency and assume static scenes. As a result, they cannot adapt to continual learning in dynamic robotic settings. To address these challenges, we propose TACO (TemporAl Consensus Optimization), a replay-free framework for continual neural mapping. We reformulate mapping as a temporal consensus optimization problem, where we treat past model snapshots as temporal neighbors. Intuitively, our approach resembles a model consulting its own past knowledge. We update the current map by enforcing weighted consensus with historical representations. Our method allows reliable past geometry to constrain optimization while permitting unreliable or outdated regions to be revised in response to new observations. TACO achieves a balance between memory efficiency and adaptability without storing or replaying previous data. Through extensive simulated and real-world experiments, we show that TACO robustly adapts to scene changes, and consistently outperforms other continual learning baselines. Code is available at https://iconlab.negarmehr.com/TACO

2602.00324 2026-05-29 math.OC cs.CV cs.RO eess.SP 版本更新

Dual Quaternion SE(3) Synchronization with Recovery Guarantees

对偶四元数 SE(3) 同步及其恢复保证

Jianing Zhao, Linglingzhi Zhu, Anthony Man-Cho So

发表机构 * Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, NT, Hong Kong(系统工程与工程管理系,香港中文大学(深圳)) H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA(H. Milton Stewart工业与系统工程学院,佐治亚理工学院)

AI总结 采用对偶四元数表示,通过谱初始化和对偶四元数广义幂法实现 SE(3) 同步,并给出误差界和线性收敛保证。

Comments ICML 2026

详情
AI中文摘要

特殊欧几里得群 SE(3) 上的同步旨在从含噪的成对相对变换中恢复绝对位姿,是机器人和 3D 视觉中的核心基本操作。标准方法通常需要多步启发式程序来恢复有效位姿,这些程序难以分析且通常缺乏理论保证。本文采用对偶四元数表示,并直接在对偶四元数单位上制定 SE(3) 同步。开发了一个两阶段算法:通过 Hermitian 对偶四元数测量矩阵上的幂法计算谱初始化,随后是对偶四元数广义幂法 (DQGPM),通过每次迭代投影来强制执行可行性。建立了谱估计器的估计误差界,并证明 DQGPM 具有有限迭代误差界,并实现线性误差收缩直至显式的噪声相关阈值。在合成基准和真实多扫描点集配准上的实验表明,所提出的流程在准确性和效率上均优于代表性的基于矩阵的方法。

英文摘要

Synchronization over the special Euclidean group SE(3) aims to recover absolute poses from noisy pairwise relative transformations and is a core primitive in robotics and 3D vision. Standard approaches often require multi-step heuristic procedures to recover valid poses, which are difficult to analyze and typically lack theoretical guarantees. This paper adopts a dual quaternion representation and formulates SE(3) synchronization directly over the unit dual quaternion. A two-stage algorithm is developed: A spectral initializer computed via the power method on a Hermitian dual quaternion measurement matrix, followed by a dual quaternion generalized power method (DQGPM) that enforces feasibility through per-iteration projection. The estimation error bounds are established for spectral estimators, and DQGPM is shown to admit a finite-iteration error bound and achieves linear error contraction up to an explicit noise-dependent threshold. Experiments on synthetic benchmarks and real-world multi-scan point-set registration demonstrate that the proposed pipeline improves both accuracy and efficiency over representative matrix-based methods.

2512.03010 2026-05-29 cs.CV cs.GR cs.RO 版本更新

SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting

SurfFill: 通过高斯曲面元填充完成LiDAR点云

Svenja Strobel, Matthias Innmann, Bernhard Egger, Marc Stamminger, Linus Franke

发表机构 * NavVis GmbH(NavVis公司) Inria, Université Côte d'Azur(Inria与阿尔卑斯海岸大学)

AI总结 针对LiDAR点云缺失薄结构和边缘细节的问题,提出基于高斯曲面元(Gaussian surfel)的补全方案SurfFill,利用光束发散启发式识别模糊区域并优化曲面元重建以生长新点,在合成和真实场景中优于先前方法。

Comments Project page: https://lfranke.github.io/surffill

详情
AI中文摘要

LiDAR捕获的点云通常被视为主动3D重建的金标准。尽管其在平坦区域精度极高,但捕获容易遗漏小的几何结构,并可能在暗色、吸光材料上失败。或者,拍摄场景的多张照片并应用3D摄影测量可以推断这些细节,因为它们通常代表特征丰富的区域。然而,对于无特征区域,LiDAR的精度很少能达到。因此,我们建议通过引入SurfFill:一种基于高斯曲面元的LiDAR补全方案,结合LiDAR和基于相机的捕获的优势。我们分析LiDAR捕获,并将LiDAR光束发散归因于伪影的主要因素,主要表现为薄结构和边缘。我们利用这一见解,通过评估点云中密度的变化,引入一种用于完成扫描的模糊启发式方法。这使我们能够识别靠近缺失区域的点,然后我们可以使用这些点生长额外的点以完成扫描。对于这种点生长,我们约束高斯曲面元重建,将优化和密集化集中在这些模糊区域。最后,提取模糊区域重建的高斯基元并采样以获取点来完成点云。为了解决大规模重建的挑战,我们将此流程扩展为一种分治方案,用于建筑大小的点云补全。我们在合成和真实场景的LiDAR点云补全任务上评估,发现我们的方法优于先前的重建方法。

英文摘要

LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.

2511.11703 2026-05-29 cs.LG cs.AI cs.RO 版本更新

Enhancing Reinforcement Learning in 3D Environments through Semantic Segmentation: A Case Study in ViZDoom

通过语义分割增强3D环境中的强化学习:以ViZDoom为例

Jin Huang

发表机构 * Hugo Huang(胡戈·黄)

AI总结 提出SS-only和RGB+SS两种输入表示,利用语义分割减少内存消耗并提升强化学习在3D环境中的性能,在ViZDoom中验证。

Comments Master's Thesis at the University of Edinburgh (2024)

详情
AI中文摘要

在高维感官输入的3D环境中,强化学习面临两大挑战:(1) 稳定学习所需的内存缓冲区导致的高内存消耗,以及(2) 部分可观测马尔可夫决策过程(POMDPs)的复杂性。本项目通过提出两种新颖的输入表示:SS-only和RGB+SS,两者均对RGB彩色图像进行语义分割,以应对这些挑战。在ViZDoom的死斗模式中进行了实验,利用完美的分割结果进行受控评估。我们的结果表明,SS-only能够将内存缓冲区的内存消耗减少至少66.6%,当应用如游程编码等最小开销的可向量化无损压缩技术时,可减少高达98.6%。同时,RGB+SS通过提供的额外语义信息显著增强了强化学习代理的性能。此外,我们探索了基于密度的热力图作为可视化强化学习代理移动模式的工具,并评估了其用于数据收集的适用性。与先前方法的简要比较突出了我们的方法如何克服在ViZDoom等3D环境中应用语义分割时的常见陷阱。

英文摘要

Reinforcement learning (RL) in 3D environments with high-dimensional sensory input poses two major challenges: (1) the high memory consumption induced by memory buffers required to stabilise learning, and (2) the complexity of learning in partially observable Markov Decision Processes (POMDPs). This project addresses these challenges by proposing two novel input representations: SS-only and RGB+SS, both employing semantic segmentation on RGB colour images. Experiments were conducted in deathmatches of ViZDoom, utilizing perfect segmentation results for controlled evaluation. Our results showed that SS-only was able to reduce the memory consumption of memory buffers by at least 66.6%, and up to 98.6% when a vectorisable lossless compression technique with minimal overhead such as run-length encoding is applied. Meanwhile, RGB+SS significantly enhances RL agents' performance with the additional semantic information provided. Furthermore, we explored density-based heatmapping as a tool to visualise RL agents' movement patterns and evaluate their suitability for data collection. A brief comparison with a previous approach highlights how our method overcame common pitfalls in applying semantic segmentation in 3D environments like ViZDoom.

2510.27607 2026-05-29 cs.CV cs.RO 版本更新

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

双流扩散用于世界模型增强的视觉-语言-动作模型

John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin

发表机构 * Kim Jaechul Graduate School of AI, Korea Advanced Institute of Technology, Seoul, Republic of Korea(金 Jaechul 人工智能研究生院,韩国科学技术院,首尔,大韩民国)

AI总结 提出DUST框架,通过双流扩散Transformer和异步采样方法,解决世界模型增强的视觉-语言-动作模型中的模态差距问题,在模拟和真实任务中取得显著性能提升。

Comments Accepted at ICML 2026. Project page at https://periphanes.github.io/dust (20 pages, 10 figures)

详情
AI中文摘要

用世界模型增强视觉-语言-动作模型(VLA)对于机器人策略学习很有前景,但由于模态差距,在联合预测状态和动作方面面临挑战。为了解决这个问题,我们提出了DUal-STream diffusion(DUST),一个世界模型增强的VLA框架,其特点是一个多模态扩散Transformer,在保持独立模态流的同时实现跨模态知识共享。此外,DUST利用独立的噪声扰动和解耦的流匹配损失来学习跨模态因果关系。我们进一步引入了一种用于动作和视觉令牌的异步采样方法,通过推理时缩放来增强性能。在RoboCasa和GR-1等模拟基准上的实验结果表明,DUST相对于最先进的VLA和世界建模基线实现了高达6%的性能提升,推理时缩放额外提供了2-5%的提升。在使用Franka Research 3的真实世界任务中,DUST的成功率比基线高出10%。最后,我们证明了DUST通过在无动作视频上的预训练以及与异构机器人和人类数据集的联合训练,实现了有效的迁移学习。

英文摘要

Augmenting vision-language-action models (VLAs) with world models is promising for robotic policy learning but faces challenges in jointly predicting states and actions due to the modality gap. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework featuring a multimodal diffusion transformer that maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, DUST utilizes independent noise perturbations and a decoupled flow matching loss to learn cross-modal causal relationships. We further introduce an asynchronous sampling method for action and vision tokens that enhances performance through inference-time scaling. Experimental results on simulated benchmarks like RoboCasa and GR-1 show that DUST achieves up to 6% gains over state-of-the-art VLA and world-modeling baselines, with inference-time scaling providing an additional 2-5% improvement. In real-world tasks using the Franka Research 3, DUST outperforms baselines by 10% in success rate. Finally, we demonstrate that DUST enables effective transfer learning through both pretraining on action-free videos and joint-training with heterogeneous robot and human datasets.

2510.26623 2026-05-29 cs.RO 版本更新

A Sliding-Window Filter for Online Continuous-Time Continuum Robot State Estimation

用于在线连续时间连续体机器人状态估计的滑动窗口滤波器

Spencer Teetaert, Sven Lilge, Jessica Burgner-Kahrs, Timothy D. Barfoot

发表机构 * University of Toronto Robotics Institute(多伦多大学机器人研究所)

AI总结 提出一种专为连续体机器人设计的随机滑动窗口滤波器,在保持超实时运行速度的同时,通过连续时间方法提升滤波精度并实现在线操作。

Comments 8 pages, 6 figures. Submitted to IEEE-RAS International Conference on Soft Robotics 2026

详情
Journal ref
2026 IEEE 9th International Conference on Soft Robotics (RoboSoft), 239-246
AI中文摘要

连续体机器人的随机状态估计方法通常难以平衡精度和计算效率。尽管最近有几项研究探索了连续体机器人的滑动窗口公式,但这些方法仅限于简化的离散时间近似,并且不提供随机表示。相比之下,当前的随机滤波方法必须以测量速度运行,限制了其全部潜力。最近关于连续体机器人连续时间估计技术的研究显示了一种解决这一运行时约束的原则性方法,但目前仅限于离线操作。在这项工作中,我们提出了一种用于连续体机器人连续时间状态估计的滑动窗口滤波器,它在保持超实时运行速度的同时,改进了滤波方法的精度,并使连续时间方法能够在线操作。这是首个专门为连续体机器人设计的随机滑动窗口滤波器,为该领域的未来研究提供了有希望的方向。

英文摘要

Stochastic state estimation methods for continuum robots (CRs) often struggle to balance accuracy and computational efficiency. While several recent works have explored sliding-window formulations for CRs, these methods are limited to simplified, discrete-time approximations and do not provide stochastic representations. In contrast, current stochastic filter methods must run at the speed of measurements, limiting their full potential. Recent works in continuous-time estimation techniques for CRs show a principled approach to addressing this runtime constraint, but are currently restricted to offline operation. In this work, we present a sliding-window filter (SWF) for continuous-time state estimation of CRs that improves upon the accuracy of a filter approach while enabling continuous-time methods to operate online, all while running at faster-than-real-time speeds. This represents the first stochastic SWF specifically designed for CRs, providing a promising direction for future research in this area.

2509.19318 2026-05-29 eess.SP cs.RO 版本更新

Scensory: Real-Time Robotic Olfactory Perception for Joint Identification and Source Localization

Scensory:用于联合识别和源定位的实时机器人嗅觉感知

Yanbaihui Liu, Erica Babusci, Claudia K. Gunsch, Boyuan Chen

发表机构 * Duke University(杜克大学)

AI总结 提出一种基于学习的机器人嗅觉框架Scensory,通过廉价交叉敏感VOC传感器阵列的短时序信号,利用神经网络解码时间动态特征,同时实现真菌种类识别(最高89.85%准确率)和源定位(最高87.31%准确率)。

Comments Our project website is at: http://generalroboticslab.com/Scensory

详情
AI中文摘要

尽管机器人在视觉和触觉感知方面取得了快速进展,但使其能够从微弱的、扩散主导的化学信号中推理室内真菌污染仍然是一个未解决的挑战。我们提出了Scensory,一个基于学习的机器人嗅觉框架,该框架能够同时识别真菌种类,并通过由廉价、交叉敏感的VOC传感器阵列测量的短时序信号定位其来源。时间VOC动态编码了化学和空间特征,我们通过基于机器人自动化数据收集并带有空间监督训练的神经网络来解码这些特征。在五种真菌种类中,Scensory在环境条件下使用3-7秒的传感器输入实现了高达89.85%的种类准确率和87.31%的源定位准确率。这些结果证明了从扩散主导的化学信号中实现实时、空间基础的感知的能力,为机器人室内环境监测提供了可扩展且低成本的源定位方法。

英文摘要

While robotic perception has advanced rapidly in vision and touch, enabling robots to reason about indoor fungal contamination from weak, diffusion-dominated chemical signals remains an open challenge. We introduce Scensory, a learning-based robotic olfaction framework that simultaneously identifies fungal species and localizes their source from short time series measured by affordable, cross-sensitive VOC sensor arrays. Temporal VOC dynamics encode both chemical and spatial signatures, which we decode through neural networks trained on robot-automated data collection with spatial supervision. Across five fungal species, Scensory achieves up to 89.85% species accuracy and 87.31% source localization accuracy under ambient conditions with 3-7s sensor inputs. These results demonstrate real-time, spatially grounded perception from diffusion-dominated chemical signals, enabling scalable and low-cost source localization for robotic indoor environmental monitoring.

2508.09976 2026-05-29 cs.RO 版本更新

Masquerade: Learning from In-the-wild Human Videos using Data-Editing

Masquerade: 利用数据编辑从真实世界人类视频中学习

Marion Lepert, Jiaying Fang, Jeannette Bohg

发表机构 * Stanford University(斯坦福大学)

AI总结 提出Masquerade方法,通过编辑真实世界第一人称人类视频(估计3D手部姿态、修复手臂、叠加渲染双臂机器人)弥合视觉具身差距,并利用编辑后的视频预训练视觉编码器、微调扩散策略头,在三个长时程双臂厨房任务中实现比基线高5-6倍的泛化性能。

Comments Project website at https://masquerade-robot.github.io/

详情
Journal ref
2026 IEEE International Conference on Robotics and Automation (ICRA), 2026
AI中文摘要

机器人操作研究仍然面临严重的数据稀缺问题:即使是最大的机器人数据集,其规模和多样性也比推动语言和视觉领域近期突破的数据集小几个数量级。我们提出Masquerade,一种编辑真实世界第一人称人类视频以弥合人类与机器人之间视觉具身差距,并利用这些编辑后的视频学习机器人策略的方法。我们的流程通过以下步骤将每段人类视频转化为机器人化演示:(i) 估计3D手部姿态,(ii) 修复人类手臂,(iii) 叠加一个追踪恢复的末端执行器轨迹的渲染双臂机器人。在67.5万帧编辑后的视频片段上预训练一个视觉编码器以预测未来的2D机器人关键点,并在每个任务仅使用50个机器人演示微调扩散策略头时继续该辅助损失,所得到的策略泛化能力显著优于先前工作。在三个分别于三个未见场景中评估的长时程、双臂厨房任务中,Masquerade的性能比基线高出5-6倍。消融实验表明,机器人叠加和联合训练均不可或缺,且性能随编辑后人类视频数量呈对数增长。这些结果表明,明确弥合视觉具身差距能够解锁来自人类视频的庞大、现成数据源,可用于改进机器人策略。

英文摘要

Robot manipulation research still suffers from significant data scarcity: even the largest robot datasets are orders of magnitude smaller and less diverse than those that fueled recent breakthroughs in language and vision. We introduce Masquerade, a method that edits in-the-wild egocentric human videos to bridge the visual embodiment gap between humans and robots and then learns a robot policy with these edited videos. Our pipeline turns each human video into robotized demonstrations by (i) estimating 3-D hand poses, (ii) inpainting the human arms, and (iii) overlaying a rendered bimanual robot that tracks the recovered end-effector trajectories. Pre-training a visual encoder to predict future 2-D robot keypoints on 675K frames of these edited clips, and continuing that auxiliary loss while fine-tuning a diffusion policy head on only 50 robot demonstrations per task, yields policies that generalize significantly better than prior work. On three long-horizon, bimanual kitchen tasks evaluated in three unseen scenes each, Masquerade outperforms baselines by 5-6x. Ablations show that both the robot overlay and co-training are indispensable, and performance scales logarithmically with the amount of edited human video. These results demonstrate that explicitly closing the visual embodiment gap unlocks a vast, readily available source of data from human videos that can be used to improve robot policies.

2506.05985 2026-05-29 cs.LG cs.RO 版本更新

Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning

动态渐进式参数高效专家库混合用于终身机器人学习

Yuheng Lei, Sitong Mao, Shunbo Zhou, Hongyuan Zhang, Xuelong Li, Ping Luo

发表机构 * The University of Hong Kong(香港大学) Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信) Huawei Cloud Computing Technologies(华为云计算技术) Ola Dimensions HKU Shanghai Intelligent Computing Research Center(香港大学上海智能计算研究中心)

AI总结 针对终身学习中任务标识不可用和知识隔离问题,提出动态渐进式参数高效专家库混合(DMPEL),通过构建低秩专家库和轻量路由器实现灵活的前向迁移,并引入专家系数回放缓解遗忘,在LIBERO基准上以最少可训练参数和存储超越现有方法。

Comments Accepted to Transactions on Machine Learning Research (TMLR) at https://openreview.net/forum?id=MHVBrjS8cG . Code is available at https://github.com/HarryLui98/DMPEL

详情
AI中文摘要

一个通用智能体必须在其生命周期中持续学习和适应,实现高效的前向迁移,同时最小化灾难性遗忘。先前在主导的预训练-微调范式中的工作探索了用于单任务适应的参数高效微调,通过少量参数有效引导冻结的预训练模型。然而,在终身学习背景下,这些方法依赖于测试时任务标识符这一不切实际的假设,并限制了孤立适配器之间的知识共享。为解决这些限制,我们提出了用于终身机器人学习的动态渐进式参数高效专家库混合(DMPEL)。DMPEL逐步构建一个低秩专家库,并采用轻量路由器将专家动态组合成端到端策略,从而实现灵活高效的终身前向迁移。此外,通过利用微调参数的模块化结构,我们引入了专家系数回放,引导路由器准确检索先前遇到任务的冻结专家。该技术缓解了遗忘,同时相比对整个策略进行经验回放,显著节省存储和计算。在终身机器人学习基准LIBERO上的大量实验表明,我们的框架在持续适应过程中的成功率上优于最先进的终身学习方法,同时使用了最少的可训练参数和存储。

英文摘要

A generalist agent must continuously learn and adapt throughout its lifetime, achieving efficient forward transfer while minimizing catastrophic forgetting. Previous work within the dominant pretrain-then-finetune paradigm has explored parameter-efficient fine-tuning for single-task adaptation, effectively steering a frozen pretrained model with a small number of parameters. However, in the context of lifelong learning, these methods rely on the impractical assumption of a test-time task identifier and restrict knowledge sharing among isolated adapters. To address these limitations, we propose Dynamic Mixture of Progressive Parameter-Efficient Expert Library (DMPEL) for lifelong robot learning. DMPEL progressively builds a low-rank expert library and employs a lightweight router to dynamically combine experts into an end-to-end policy, enabling flexible and efficient lifelong forward transfer. Furthermore, by leveraging the modular structure of the fine-tuned parameters, we introduce expert coefficient replay, which guides the router to accurately retrieve frozen experts for previously encountered tasks. This technique mitigates forgetting while being significantly more storage- and computation-efficient than experience replay over the entire policy. Extensive experiments on the lifelong robot learning benchmark LIBERO demonstrate that our framework outperforms state-of-the-art lifelong learning methods in success rates during continual adaptation, while utilizing minimal trainable parameters and storage.

2504.12512 2026-05-29 cs.RO cs.SY eess.SY 版本更新

Practical Insights on Grasp Strategies for Mobile Manipulation in the Wild

野外移动操作抓取策略的实用见解

Isabella Huang, Richard Cheng, Sangwoon Kim, Dan Kruse, Carolyn Chen, Lukas Kaul, JC Hancock, Shanmuga Harikumar, Mark Tjersland, James Borders, Dan Helmick

发表机构 * Toyota Research Institute(丰田研究院)

AI总结 本文通过SHOPPER移动操作机器人在真实杂货店中的部署实验,提出并分析了通用抓取策略的设计方法及数百次抓取尝试中的关键失败模式,为机器人社区提供了实用见解和待解决的关键挑战。

Comments 8 pages, 8 figures, submitted to IROS 2025

详情
AI中文摘要

移动操作机器人不断进步,其抓取能力也在快速发展。然而,仍存在显著差距阻碍最先进的移动操作机器人在现实世界中广泛部署,包括它们在非结构化环境中可靠抓取物品的能力。为帮助弥合这一差距,我们开发了SHOPPER,一个旨在推动可靠且可泛化抓取策略边界的移动操作机器人平台。我们开发了这些抓取策略,并将其部署在真实的杂货店中——这是一个因其可操作物品、固定装置和布局的极大多样性而被选中的极具挑战性的环境。在这项工作中,我们提出了设计通用抓取策略以在真实杂货店中拾取任何物品的详细方法。此外,我们提供了对最新真实世界现场测试的深入分析,讨论了与数百次不同抓取尝试中基本故障模式相关的关键发现。通过我们的详细分析,我们旨在提供有价值的实用见解并识别关键的抓取挑战,从而引导机器人社区关注该领域亟待解决的开放问题。

英文摘要

Mobile manipulation robots are continuously advancing, with their grasping capabilities rapidly progressing. However, there are still significant gaps preventing state-of-the-art mobile manipulators from widespread real-world deployments, including their ability to reliably grasp items in unstructured environments. To help bridge this gap, we developed SHOPPER, a mobile manipulation robot platform designed to push the boundaries of reliable and generalizable grasp strategies. We develop these grasp strategies and deploy them in a real-world grocery store -- an exceptionally challenging setting chosen for its vast diversity of manipulable items, fixtures, and layouts. In this work, we present our detailed approach to designing general grasp strategies towards picking any item in a real grocery store. Additionally, we provide an in-depth analysis of our latest real-world field test, discussing key findings related to fundamental failure modes over hundreds of distinct pick attempts. Through our detailed analysis, we aim to offer valuable practical insights and identify key grasping challenges, which can guide the robotics community towards pressing open problems in the field.

2503.00779 2026-05-29 cs.RO 版本更新

Phantom: Training Robots Without Robots Using Only Human Videos

Phantom: 仅使用人类视频训练机器人,无需机器人

Marion Lepert, Jiaying Fang, Jeannette Bohg

发表机构 * Stanford University(斯坦福大学)

AI总结 提出一种仅从人类视频演示中训练机器人操作策略的框架,通过手部姿态估计和视觉数据编辑将人类演示转化为机器人兼容的观察-动作对,实现零样本部署并达到最高92%的成功率。

Comments Project website at https://phantom-human-videos.github.io

详情
Journal ref
The 9th Conference on Robot Learning (CoRL 2025)
AI中文摘要

训练通用机器人需要从大规模且多样化的数据源中学习。当前方法严重依赖难以扩展的遥操作演示。我们提出一个可扩展的框架,可直接从人类视频演示中训练操作策略,无需任何机器人数据。我们的方法利用手部姿态估计和视觉数据编辑,将人类演示转化为机器人兼容的观察-动作对。我们修复人类手臂并叠加渲染的机器人以对齐视觉域。这使得无需任何微调即可在真实硬件上实现零样本部署。我们在包括可变形物体操作、多物体清扫和插入等一系列任务上展示了高达92%的强成功率。我们的方法可泛化到新环境并支持闭环执行。通过证明仅使用人类视频即可训练有效策略,我们的方法拓宽了可扩展机器人学习的路径。

英文摘要

Training general-purpose robots requires learning from large and diverse data sources. Current approaches rely heavily on teleoperated demonstrations which are difficult to scale. We present a scalable framework for training manipulation policies directly from human video demonstrations, requiring no robot data. Our method converts human demonstrations into robot-compatible observation-action pairs using hand pose estimation and visual data editing. We inpaint the human arm and overlay a rendered robot to align the visual domains. This enables zero-shot deployment on real hardware without any fine-tuning. We demonstrate strong success rates-up to 92%-on a range of tasks including deformable object manipulation, multi-object sweeping, and insertion. Our approach generalizes to novel environments and supports closed-loop execution. By demonstrating that effective policies can be trained using only human videos, our method broadens the path to scalable robot learning.