arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

机器人 / 具身智能

机器人、具身智能、机器人学习、操作、导航和具身世界模型。

今日/当前日期收录 68 信号源:cs.RO, cs.AI, cs.CV, cs.LG

1. 机器人基础模型 3 篇

2606.18375 2026-06-18 cs.RO 新提交 95%

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

PAIWorld: 用于机器人操作的三维一致世界基础模型

Yuhang Huang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Jiazhao Zhang, Ruizhen Hu, Wancheng Feng, Shilong Zou, Hewen Xiao, Ziqiao Zhou, Kaiyun Huang, Zhiyu Peng, Juzhan Xu, Hang Zhao, Chenyang Zhu, Renjiao Yi, Yifei Huang, Douhui Wu, Yan Zhang, Kexu Cheng, Chunhe Song, Yunzhi Xue, Xiuhong Zhang, Leitao Guo, Yunji Chen, Bin Wu, Haibin Yu, Kai Xu

发表机构 * Institute of AI for Industries, Chinese Academy of Sciences(中国科学院人工智能产业研究院)

专题命中 机器人基础模型 :提出3D一致世界基础模型,用于机器人操作。

AI总结 提出PAIWorld框架,通过几何感知交叉注意力、几何旋转位置编码和潜在3D-REPA蒸馏,解决多视图世界模型的3D不一致问题,在机器人操作基准上取得领先性能。

详情
AI中文摘要

世界基础模型(WFMs)是强大的模拟器,但它们主要运行在单视图设置中,缺乏机器人操作所需的多视图3D一致性。虽然机器人系统依赖多个摄像头(自我中心、眼到手和腕装)进行策略学习,但当前的多视图世界模型只是简单地拼接视图标记,没有显式的几何推理。这导致跨视图物体漂移、深度不一致和纹理错位。我们将这些失败归因于两个缺陷:缺乏显式的视图间通信机制和缺乏3D几何先验。我们认为同时解决这两个问题是必要且充分的。为此,我们提出PAIWorld,一个通过三个核心组件增强扩散变换器世界模型的框架:(1)几何感知交叉注意力块,建立跨视图的显式通路;(2)几何旋转位置编码,将相机射线方向和外部姿态编码到注意力机制中;(3)潜在3D-REPA,从冻结的3D基础模型中蒸馏3D感知特征以确保3D一致性。基于DiT世界基础模型,PAIWorld在机器人操作基准上实现了最先进的多视图3D一致性,在WorldArena排行榜上排名第一,在AgiBot-Challenge2026排行榜上排名第二,同时支持基于模型的规划、世界动作模型和多视图策略后训练等下游应用。

英文摘要

World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.

2606.17846 2026-06-18 cs.RO cs.CV cs.LG 新提交 90%

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Qwen-RobotManip 技术报告:对齐解锁机器人操作基础模型的规模

Haoqi Yuan, Zhixuan Liang, Anzhe Chen, Ye Wang, Haoyang Li, Pei Lin, Yiyang Huang, Zixing Lei, Tong Zhang, Jiazhao Zhang, Jie Zhang, Jingyang Fan, Gengze Zhou, Qihang Peng, Chenxu Lv, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team(Qwen团队)

专题命中 机器人基础模型 :机器人操作基础模型,大规模预训练

AI总结 提出 Qwen-RobotManip,通过统一的对齐框架(表示、运动和行为维度)实现多源异构操作数据的大规模协同训练,构建约38,100小时预训练语料,在零样本指令跟随、跨本体迁移等泛化能力上超越先前模型。

Comments 44 pages

详情
AI中文摘要

语言和多模态基础模型通过统一公式对齐异构数据并大规模训练,实现了强大的泛化能力。在本报告中,我们研究这种扩展方法是否可以应用于机器人操作以实现真正的泛化。这具有挑战性,因为与文本不同,操作数据本质上是异构的、收集成本高且多样性狭窄,使得对齐和规模同时变得困难。我们提出了 Qwen-RobotManip,一个基于 Qwen-VL 构建的可泛化视觉-语言-动作基础模型。Qwen-RobotManip 引入了一个跨操作表示、运动和行为维度的统一对齐框架,使大规模多源训练变得一致而非冲突。这种对齐能力进而使 Qwen-RobotManip 能够吸收以前训练方案无法维持规模的操作数据。一个人到机器人合成流水线将第一人称手部演示转换为跨15个平台的机器人轨迹,一个严格的策展流水线协调异构数据集。仅使用开源数据集和人类视频,无需专有数据收集,Qwen-RobotManip 构建了约38,100小时的预训练语料,并展现出涌现的泛化能力,包括零样本指令跟随、对扰动的鲁棒性、反应性错误恢复和跨本体迁移。我们发现标准基准无法捕捉预训练质量,因此采用了包括 RoboCasa365、LIBERO-Plus、EBench、RoboTwin-Clean2Rand、RoboTwin-IF 和 RoboTwin-XE 在内的 OOD 设置。Qwen-RobotManip 在所有 OOD 设置中显著优于先前最先进的模型(包括 π0.5),在 RoboChallenge 中排名第一,相对改进20%,并在包括 AgileX ALOHA、Franka、UR 和 ARX 在内的真实机器人平台上得到验证。

英文摘要

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $π$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新 90%

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3:面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

专题命中 机器人基础模型 :为具身智能体提供通用骨干网络

AI总结 提出基于统一混合Transformer架构的全模态世界模型Cosmos 3,联合处理语言、图像、视频、音频和动作序列,在理解和生成任务上达到新最优,为具身智能体提供可扩展的通用骨干。

详情
AI中文摘要

我们介绍了Cosmos 3,一个全模态世界模型家族,设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置,Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明,Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平,展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型,并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署,我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准,网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

2. 机器人操作 13 篇

2606.18363 2026-06-18 cs.RO cs.AI 新提交 95%

Guava: An Effective and Universal Harness for Embodied Manipulation

Guava: 一种有效且通用的具身操作工具框架

Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, Jiayuan Mao

发表机构 * University of Maryland College Park(马里兰大学帕克分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Waterloo(滑铁卢大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) University of Pennsylvania(宾夕法尼亚大学) Amazon FAR(亚马逊 FAR)

专题命中 机器人操作 :提出具身操作工具框架Guava,结合推理与外部模块。

AI总结 提出Guava框架,通过迭代感知-推理-行动循环、语义动作抽象和多模态观测三大关键设计,将具身操作能力蒸馏到4B开源模型中,在仿真和真实环境中性能媲美前沿专有模型。

详情
AI中文摘要

在大规模视觉-语言数据上训练的语言模型已展现出作为具身智能体的强大潜力。通过具身工具使用来驾驭模型,为端到端的视觉-语言-行动系统提供了一种有前景的替代方案,它将高层推理与外部模块(用于感知、规划和控制)相结合。然而,对于具身操作而言,什么构成了有效的工具框架,以及这种框架能在多大程度上解锁广泛推理模型的具身能力,仍不清楚。在这项工作中,我们提出了Guava,一个通过系统探索智能体工作流、动作空间和观测空间的设计空间而开发的具身工具使用框架。我们的研究确定了有效具身智能体的三个关键要素:迭代感知-推理-行动循环、语义动作抽象和多模态观测。为了理解这些设计原则是否对小型模型也具有普适性,我们开发了一个端到端的训练流程,利用完全在仿真中收集的不到2000条轨迹,将具身操作能力蒸馏到一个4B开源模型中。在仿真和真实环境中的实验结果表明,其性能与前沿专有模型相当,同时展现出对未见物体、新指令和长时域任务的强大泛化能力。结果表明,一个精心设计的框架可以作为具身操作的可扩展、模型无关的接口,使紧凑的开源模型在极少的训练数据下展现出强大的涌现具身能力。

英文摘要

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.

2606.19340 2026-06-18 cs.RO 新提交 90%

Zero-Shot Long-Horizon Dexterous Manipulation via Multi-View 3D-Grounded VLM Reasoning

零样本长时程灵巧操作:基于多视图3D接地VLM推理

Jisoo Kim, Sangwon Baik, Taeksoo Kim, Sungjoo Kim, Junyoung Lee, Mingi Choi, Hanbyul Joo

发表机构 * Seoul National University(首尔国立大学)

专题命中 机器人操作 :灵巧操作零样本框架,VLM生成3D规划

AI总结 提出零样本框架,利用多视图RGB图像通过VLM生成3D任务规划,结合三角测量和射线投票实现精确3D接地,支持抓取和工具使用,在真实实验中优于基线方法。

详情
AI中文摘要

我们提出了一个零样本框架,用于长时程灵巧操作,该框架将语言指令从校准的多视图RGB图像接地到可执行的3D任务规划。我们的系统不是训练端到端策略,而是使用视觉语言模型(VLM)生成参考帧任务接地和原始级2D关键点,然后通过多视图融合将其提升到3D。这种提升结合了视图级VLM接地的三角测量与参考视图射线投票,后者沿语义相机射线搜索跨相邻视图的几何一致候选点。生成的3D关键点支持抓取和放置以及工具使用:对于工具使用,我们检索与推断技能类别对应的以对象为中心的原子动作,并将其存储的6D工具轨迹对齐到场景;对于灵巧执行,我们将提升的抓取关键点扩展为任务条件抓取可行区域,并使用臂手运动生成器生成可行的抓取-运动对。真实世界实验表明,与单视图RGB-D接地和微调VLA基线相比,3D接地精度和执行可靠性有所提高。我们进一步通过闭环状态验证和重新规划展示了长时程操作,实现了在新场景中对未见物体和工具使用任务的零样本执行。

英文摘要

We present a zero-shot framework for long-horizon dexterous manipulation that grounds language instructions into executable 3D task plans from calibrated multi-view RGB images. Rather than training an end-to-end policy, our system uses a vision-language model (VLM) to produce reference-frame task grounding and primitive-level 2D keypoints, then lifts them into 3D via multi-view fusion. This lifting combines triangulation of view-wise VLM groundings with reference-view ray voting, which searches along a semantic camera ray for geometrically consistent candidates across neighboring views. The resulting 3D keypoints support both pick-and-place and tool-use: for tool-use, we retrieve an object-centric atomic action corresponding to the inferred skill category and align its stored 6D tool trajectory to the scene; for dexterous execution, we expand the lifted grasp keypoint into a task-conditioned grasp affordance region and generate feasible grasp-motion pairs with an arm-hand motion generator. Real-world experiments show improved 3D grounding accuracy and execution reliability over single-view RGB-D grounding and fine-tuned VLA baselines. We further demonstrate long-horizon manipulation through closed-loop status verification and replan, enabling zero-shot execution on unseen objects and tool-use tasks in novel scenes.

2606.19265 2026-06-18 cs.RO 新提交 90%

Shape Sensing of Continuum Robots using Direct Laser Writing

使用直接激光写入的连续体机器人形状感知

Amber K. Rothe, Nidhi Malhotra, Jaydev P. Desai

发表机构 * Medical Robotics and Automation (RoboMed) Laboratory(医疗机器人与自动化实验室) Wallace H. Coulter Department of Biomedical Engineering(Wallace H. Coulter生物医学工程部门) Georgia Institute of Technology(佐治亚理工学院)

专题命中 机器人操作 :连续体机器人形状感知与闭环控制

AI总结 本文利用直接激光写入技术制造应变传感器,集成于连续体机器人关节中,通过线性和非线性模型预测关节角度,误差低至1.76度,并实现闭环控制,跟踪误差小于3度。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

连续体机器人因其固有的柔顺性和灵巧性,为微创和自然腔道手术提供了一种有前景的方法。然而,这种灵活性也使得估计机器人当前形状变得具有挑战性。已有多种方法用于重建这些机器人的形状,包括成像、光学传感、磁传感和电阻传感。使用直接激光写入(DLW)制造的应变传感器可以提供一种替代传感方法。该技术涉及使用激光诱导某些聚合物碳化,以创建石墨烯图案,例如应变传感器。在本文中,我们展示了如何使用同一激光和同一设置将柔性连续体关节和DLW传感器加工成一个整体结构。使用线性和非线性模型对制造的传感器进行表征,这些模型用于预测关节角度,误差低至1.76度。此外,我们展示了如何使用DLW传感器在机器人关节中实现闭环控制,跟踪误差低于3度。

英文摘要

Continuum robots offer a promising approach for minimally invasive and natural-orifice surgical procedures due to their inherent compliance and dexterity. However, this flexibility also makes estimating the current shape of the robot challenging. Several approaches have been used to reconstruct the shape of these robots, including imaging, optical sensing, magnetic sensing, and resistive sensing. Strain sensors fabricated using direct laser writing (DLW) could provide an alternative sensing method. This technique involves using a laser to induce carbonization of certain polymers to create graphene patterns, such as strain sensors. In this paper, we demonstrate how a flexible continuum joint and a DLW sensor can be machined as one monolithic structure using the same laser and the same setup. The fabricated sensors are characterized using linear and nonlinear models, which are used to predict the joint angle with error as low as 1.76 degrees. Furthermore, we demonstrate how a DLW sensor can be used to implement closed-loop control in a robotic joint, achieving tracking error under 3 degrees.

2606.19240 2026-06-18 cs.RO cs.CV cs.HC cs.SY eess.SY 新提交 90%

Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation

透过遮挡:机器人遥操作的确定性手臂运动学校正

Thomas M. Kwok, Nicholas Koenig, Yue Hu

发表机构 * Department of Mechanical and Mechatronics Engineering, University of Waterloo, Canada(滑铁卢大学机械与机电工程系)

专题命中 机器人操作 :遥操作中手臂运动学校正方法

AI总结 提出手臂运动学校正方法,利用恒定臂长几何约束和勾股定理确定性地重建遮挡关节深度,无需复杂建模,经Vicon验证有效,并成功应用于遥操作。

详情
AI中文摘要

无标记、单RGB-D相机动作捕捉为机器人遥操作提供了一种低成本、非侵入性的替代传统标记系统的方法;然而,在自遮挡存在时,特别是上肢运动期间,深度估计常常退化。本文提出了一种手臂运动学校正(AKC)方法,通过基于恒定臂长施加几何约束来改进深度估计。所提出的方法利用手腕位置和预定义臂长,基于勾股定理的确定性公式重建遮挡关节深度,从而避免了对复杂概率建模或参数调整的需求。针对Vicon参考系统的实验验证表明,该方法在静态和动态关节运动下均表现出可靠的性能,通过均方根误差(RMSE)和皮尔逊相关性进行评估。此外,在模拟和物理机器人环境中成功演示了运动映射遥操作。结果表明,AKC在长时间、严重自遮挡下增强了鲁棒性并保持了解剖一致性,即使与不太可靠的时间滤波器配对时也是如此,突显了其在机器人遥操作和人机交互等实时应用中的实用性。

英文摘要

Markerless, single-RGB-D-camera motion capture provides a low-cost and non-invasive alternative to conventional marker-based systems for robot teleoperation; however, depth estimation often degrades in the presence of self-occlusion, particularly during upper-limb motion. This paper presents an Arm Kinematic Correction (AKC) method that improves depth estimation by enforcing geometric constraints based on constant arm lengths. The proposed approach reconstructs occluded joint depths by leveraging wrist positions and predefined arm lengths via a deterministic formulation based on the Pythagorean theorem, thereby avoiding the need for complex probabilistic modeling or parameter tuning. Experimental validation against a Vicon reference system demonstrates reliable performance for both static and dynamic joint motions, evaluated using root-mean-square error (RMSE) and Pearson correlation. Furthermore, motion-mapping teleoperation is successfully demonstrated in both simulated and physical robot environments. The results show that AKC enhances robustness and preserves anatomical consistency under long-duration, severe self-occlusion, even when paired with less reliable temporal filters, highlighting its practicality for real-time applications such as robot teleoperation and human-robot interaction.

2606.19091 2026-06-18 cs.RO 新提交 90%

GCNGrasp-VP: Affordance-Guided View Planning for Efficient Task-Oriented Grasping

GCNGrasp-VP: 基于功能引导的视角规划用于高效任务导向抓取

Zanjia Tong, Wenlong Dong, Chengjie Zhang, Hong Zhang

发表机构 * Shenzhen Key Laboratory of Robotics and Computer Vision(机器人与计算机视觉深圳重点实验室)

专题命中 机器人操作 :任务导向抓取,主动视角规划。

AI总结 提出GCNGrasp-VP框架,通过功能场预测引导主动视角规划,无需场景重建,单次视角调整即可显著提升遮挡下的任务导向抓取成功率。

Comments Accepted to IROS 2026

详情
AI中文摘要

当物体视角存在遮挡时,任务导向抓取性能会显著下降。现有的任务导向抓取方法通常假设任务相关区域在初始帧中可见,而视角规划方法虽然能够实现主动感知,但往往忽略任务语义并依赖耗时的场景重建。为了解决这些局限性,我们提出了GCNGrasp-VP,一个将功能场预测与主动视角规划相结合的高效框架。该框架的核心是GCNGrasp-v2,一个同时支持抓取评估和功能场预测的任务导向抓取模型,实现了常数时间推理复杂度。利用这一能力,我们的功能引导视角规划器(Affordance-VP)将功能场作为信息增益度量,无需场景重建即可引导相机观察任务相关区域。视角规划结果表明,我们的方法仅需一次视角调整就显著优于基于场景不确定性的基线方法。真实世界验证进一步证实了在单物体场景中抓取成功率的显著提升,同时保持毫秒级计算延迟。代码和模型可在以下网址获取:this https URL。

英文摘要

Task-oriented grasping performance degrades significantly when object views suffer from occlusions. Existing task-oriented grasping methods typically assume task-relevant regions are visible in the initial frame, while view planning approaches enable active perception but often ignore task semantics and rely on time-consuming scene reconstruction. To address these limitations, we present GCNGrasp-VP, an efficient framework integrating affordance field prediction with active view planning. Central to this framework is GCNGrasp-v2, a task-oriented grasp model that simultaneously supports grasp evaluation and affordance field prediction, achieving constant-time inference complexity. Leveraging this capability, our Affordance-guided View Planner (Affordance-VP) utilizes the affordance field as an information gain metric to guide camera observation of task-relevant regions without requiring scene reconstruction. View planning results show that our method significantly outperforms scene-uncertainty-driven baselines with only one view adjustment. Real-world validation further confirms substantial improvements in grasp success rates for single-object scenarios while maintaining millisecond-level computational latency. Code and models are available at https://github.com/Instinct323/GCNGrasp-VP.

2606.18601 2026-06-18 cs.RO 新提交 90%

Admittance-Based Surface Alignment for Human-in-the-Loop Robotic Visual Inspection

基于导纳的表面对齐用于人在环机器人视觉检测

Antara Banerjee, Colin Acton, Xu Chen

发表机构 * University of Washington(华盛顿大学)

专题命中 机器人操作 :提出导纳控制框架实现机器人表面精确对齐

AI总结 提出一种基于导纳的实时闭环控制框架,融合操作员输入与感知驱动,实现机器人末端执行器与局部表面的精确对齐,在6自由度机械臂上验证了稳定法向跟踪和0.4°的平均定向误差。

详情
AI中文摘要

精密视觉检测是航空航天、半导体和医疗制造中质量保证的基础,这些领域中高价值零件上未被检测到的表面缺陷直接导致报废、返工和现场故障。机器人视觉检测需要在存在感知噪声和表面不规则的情况下,实现末端执行器与局部表面几何的精确对齐。在工业环境中,通常通过遥操作或共享自主性将人类操作员保持在回路中,引入实时调整,使得纯离线运动规划不足。这激发了能够在人类和感知不确定性下做出反应性、顺从行为的控制架构。本文提出了一种新颖的实时闭环机器人定向控制流程,用于精密视觉检测,该流程采用基于导纳的框架,统一了操作员输入和感知驱动的表面对齐。我们将末端执行器设计为在粘性介质中运动的虚拟球体,使得由此产生的物理可解释的质量-阻尼系统根据定向误差和操作员命令生成同步、顺从的运动。我们在6自由度机械臂上验证了该框架,展示了稳定的法向跟踪和0.4°的最终平均定向误差。

英文摘要

Precision visual inspection underpins quality assurance across aerospace, semiconductor, and medical manufacturing, where undetected surface anomalies on high-value parts translate directly into scrap, rework, and field failures. Robotic visual inspection requires precise alignment between the end-effector and local surface geometry in the presence of perception noise and surface irregularities. In industrial settings, a human operator is often kept in the loop via teleoperation or shared autonomy, introducing real-time adjustments that render purely offline motion planning inadequate. This motivates control architectures capable of reactive, compliant behavior under combined human and perceptual uncertainty. This paper presents a novel real-time, closed-loop robotic orientation control pipeline for precision visual inspection, with an admittance-based framework that unifies operator input and perception-driven surface alignment. We design the end-effector as a virtual sphere moving through a viscous medium, such that the resulting physically interpretable mass--damper system generates synchronized, compliant motion from orientation error and operator commands. We validate the framework on a 6-DOF manipulator demonstrating stable normal-tracking and a final mean orientation error of 0.4°.

2606.18594 2026-06-18 cs.RO cs.AI 新提交 90%

Benchmarking Action Spaces in Reinforcement Learning for Vision-based Robotic Manipulation

基于视觉的机器人操作中强化学习动作空间的基准测试

Seyed Alireza Azimi, Homayoon Farrahi, Abhishek Naik, Colin Bellinger, A. Rupam Mahmood

发表机构 * Department of Computing Science, University of Alberta(阿尔伯塔大学计算机科学系) National Research Council Canada(加拿大国家研究委员会) School of Electrical Engineering and Computer Science, University of Ottawa(渥太华大学电气工程与计算机科学学院) Vector Institute(向量研究所) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔机器智能研究所)

专题命中 机器人操作 :基准测试RL动作空间在视觉机器人操作中性能

AI总结 本研究通过模拟到现实的迁移,在物体抓取和推动任务中评估了四种动作空间,发现关节速度动作空间在平滑性和任务性能上最优,并为RL实践者提供了动作空间选择指导。

Comments 9 pages with references

详情
AI中文摘要

在现实世界的强化学习(RL)中,动作空间的选择在塑造运动平滑性、安全性和整体任务性能方面起着关键作用。在本研究中,我们评估了位姿增量、位姿速度、关节位置增量和关节速度在两项基于视觉的操作任务(物体抓取和推动)中的表现。我们在模拟中训练策略,并通过模拟到现实的迁移将其部署到现实世界。我们发现,动作空间表示确实显著影响模拟到现实的性能。特别是,我们发现关节速度动作空间在平滑性和最终任务性能方面最适合基于视觉的抓取和推动任务。我们还为RL实践者在模拟和现实实验中选择动作空间提供了实用指导。

英文摘要

In real-world reinforcement learning (RL), the choice of action space can play a key role in shaping motion smoothness, safety, and overall task performance. In this study, we evaluate pose increment, pose velocity, joint position increment, and joint velocity across two vision-based manipulation tasks: object picking and pushing. We train policies in simulation and deploy them to the real world using sim-to-real transfer. We find that action-space representation indeed significantly affects sim-to-real performance. In particular, we find that the joint velocity action space is best for the vision-based picking and pushing tasks in terms of smoothness and final task performance. We also provide practical guidance for RL practitioners in choosing action spaces for both simulation and real-world experiments.

2605.05925 2026-06-18 cs.RO 版本更新 90%

DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions

DexSynRefine:合成与精炼人-物交互运动以实现物理可行的灵巧机器人动作

Hyesung Lee, Hyunwoo Jung, Si-Hwan Heo, Sungwook Yang

发表机构 * Korea Institute of Science and Technology(韩国科学技术院) KAIST(韩国科学技术院) Hanyang University(翰阳大学)

专题命中 机器人操作 :提出DexSynRefine框架,实现灵巧机器人操作。

AI总结 提出DexSynRefine框架,通过HOI-MMFP运动先验合成手-物轨迹,结合任务空间残差强化学习和接触动力学适应,将人-物交互数据转化为物理可行的灵巧操作,在五个任务上成功率提升50-70个百分点。

Comments Project page: https://dexsynrefine.github.io/

详情
AI中文摘要

从人-物交互(HOI)数据中学习灵巧操作为机器人遥操作提供了一种可扩展的替代方案,但HOI演示通常稀疏且纯运动学,在实体不匹配和接触丰富的动力学下直接重定向不可靠。我们提出DexSynRefine,一个耦合框架,将HOI数据视为结构化运动先验而非可执行的机器人动作。DexSynRefine首先使用HOI运动流形流基元(HOI-MMFP)——一种耦合手-物运动的运动先验,根据任务和初始物体状态合成手-物轨迹。然后通过任务空间残差强化学习对其进行物理接地,并通过从本体感受历史推断缺失的接触动力学上下文来适应执行。在五个灵巧操作任务中,每个阶段解决一个互补的瓶颈:HOI-MMFP提高了轨迹一致性和平滑性,任务空间残差在测试的替代方案中提供了最强的接地表示,接触动力学适应实现了鲁棒的真实世界执行。综合来看,DexSynRefine在真实世界中的成功率比运动学重定向提高了50-70个百分点。

英文摘要

Learning dexterous manipulation from human-object interaction (HOI) data offers a scalable alternative to robot teleoperation, but HOI demonstrations are typically sparse and purely kinematic, making direct retargeting unreliable under embodiment mismatch and contact-rich dynamics. We present DexSynRefine, a coupled framework that treats HOI data as structured motion priors rather than executable robot actions. DexSynRefine first synthesizes hand-object trajectories conditioned on the task and initial object state using HOI Motion Manifold Flow Primitives (HOI-MMFP), a motion prior for coupled hand-object motion. It then physically grounds them with task-space residual reinforcement learning and adapts execution by inferring missing contact-dynamics context from proprioceptive history. Across five dexterous manipulation tasks, each stage addresses a complementary bottleneck: HOI-MMFP improves trajectory consistency and smoothness, task-space residuals provide the strongest grounding representation among the tested alternatives, and contact-dynamics adaptation enables robust real-world execution. Together, DexSynRefine improves real-world success rates over kinematic retargeting by 50-70~percentage points.

2606.19314 2026-06-18 cs.RO 新提交 85%

Modeling Branches for Active Manipulation using Iterative Parameter Estimation

基于迭代参数估计的主动操作分支建模

Madhav Rijal, Rashik Shrestha, Trevor Smith, Yu Gu

发表机构 * Department of Mechanical and Aerospace Engineering, West Virginia University(机械与航空航天工程系,西弗吉尼亚大学)

专题命中 机器人操作 :植物分支建模与主动操作

AI总结 提出一种通过迭代估计材料参数来建模植物分支的方法,利用有限元模拟和变形感知运动规划器,实现精确分支操作,平均变形能量降低35.69%。

Comments Accepted to IROS 2026

详情
AI中文摘要

本研究提出了一种通过迭代估计材料参数来建模多样化植物分支的方法,以支持精细的分支操作。在农业机器人中,分支操作对于植物重新定位、稳定以及清除密集叶片中的视觉障碍是必要的。该方法从点云数据构建四面体分支模型,并使用有限元方法模拟其行为。利用真实观测的变形数据,迭代估计分支参数,然后通过变形感知运动规划器计算最优路径,以在另一个机器人的视野内移动和稳定分支。在30次对具有不同几何形状和材料特性的分支进行的试验中,该方法平均降低了35.69%的变形能量,同时路径长度平均增加了8.10%。

英文摘要

This study presents a method for modeling diverse plant branches by iteratively estimating material parameters to support delicate branch manipulation. Branch manipulation is necessary in agricultural robotics for plant repositioning, stabilizing, and clearing visual obstructions in dense foliage. The proposed method builds a tetrahedral branch model from point-cloud data and simulates its behavior using the finite element method. Using real observed deformation data, it iteratively estimates branch parameters and then computes an optimal path with a deformation-aware motion planner to move and stabilize branches within another robot's field of view. Across 30 trials on branches with varying geometries and material properties, the proposed method reduced the deformation energy by 35.69% while increasing the path length by 8.10% on average.

2606.19233 2026-06-18 cs.RO 新提交 85%

Mobile Pedipulation for Object Sliding via Hierarchical Control on a Wheeled Bipedal Robot

基于轮式双足机器人分层控制的移动式腿部操作物体滑动

Yue Qin, Yulun Zhuang, Zelin Shen, Yanran Ding

发表机构 * Department of Robotics, University of Michigan(密歇根大学机器人系)

专题命中 机器人操作 :轮式双足机器人用腿部滑动物体,属于机器人操作

AI总结 提出一种分层控制框架,使轮式双足机器人能用腿部滑动平面物体,通过简化三刚体动力学模型和轨迹优化运动规划器,在实验中成功实现1kg物体取回和4kg物体滑动。

Comments 8 pages, 7 figures

详情
AI中文摘要

在本文中,我们提出了一种分层控制框架,使轮式双足机器人能够利用其轮式腿执行平面物体滑动任务。该方法基于一个简化三刚体动力学模型构建了非线性模型预测控制器,该模型明确考虑了髋关节滚动自由度和多种轮-环境接触模式,这对于横向步态和腿部操作任务至关重要。在该框架内,非线性模型预测控制器同时调节机器人 locomotion 和交互力,使机器人能够稳定地执行滚动和物体操作行为。我们开发了一个基于轨迹优化的机器人-物体运动规划器,以生成包含地面-物体接触中粘滑转换的参考运动。通过实际硬件实验验证了两种代表性的腿部操作运动,即滑行和横向滑动,其中机器人成功地从桌子下取回一个1kg的物体,并通过滑行将一个4kg的物体滑动0.228米的距离。

英文摘要

In this letter, we present a hierarchical control framework that enables wheeled bipedal robots to perform planar object sliding tasks with their wheeled legs. The proposed approach formulates a nonlinear model predictive controller (NMPC) based on a reduced-order three rigid bodies (TRB) dynamical model that explicitly accounts for the hip roll degree of freedom and multiple wheel-environment contact modes, which is essential for lateral stepping and pedipulation tasks. Within this framework, the NMPC simultaneously regulates robot locomotion and interaction forces, allowing the robot to stably execute both rolling and object manipulation behaviors. A trajectory-optimization-based robot-object motion planner is developed to generate reference motions that incorporate stick-slip transitions in ground-object contact. Two representative pedipulation motions, namely scooting and lateral sliding, are validated through real-world hardware experiments, in which the robot successfully retrieves a 1 kg object from under a desk and slides a 4 kg object over a distance of 0.228 m via scooting.

2606.19194 2026-06-18 cs.RO 新提交 85%

Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation

用于机器人操作中一步流匹配的可逆神经网络适配器

Yu Zhang, Kangyi Ji, Yongxiang Zou, Rongtao Xu, Feng Zheng, Long Cheng

专题命中 机器人操作 :可逆神经网络用于机器人操作动作生成

AI总结 提出可逆神经网络适配器,通过一步去噪过程生成高维动作,降低推理复杂度并保持精度,在仿真和真实实验中提升效率。

详情
AI中文摘要

本文提出了一种用于通用机器人操作的可逆神经网络适配器,旨在通过一步去噪过程,基于多模态观测(包括视觉、语言和本体感受输入)生成精确的高维动作。基于流匹配公式,所提出的适配器有效地将动作生成轨迹约束在可逆潜空间内,从而仅需单次推理步骤即可实现高效、高质量的灵巧动作合成。与传统的迭代流匹配策略相比,所提出的框架显著降低了推理复杂度,同时保持了强大的动作预测精度和稳定性。在多种仿真基准和真实机器人平台上进行了大量实验,以评估所提出方法的有效性。在仿真基准测试中,所提出的适配器在广泛的操作任务上持续表现出优于或接近最先进的性能。此外,真实世界实验显示,视觉-语言-动作(VLA)模型的推理效率显著提升,平均推理延迟从110毫秒降低到61毫秒,同时保持了强大的任务性能。

英文摘要

This paper presents an invertible neural network adapter for general robotic manipulation, designed to generate precise high-dimensional actions conditioned on multimodal observations, including visual, linguistic, and proprioceptive inputs, through a one-step denoising process. Built upon a flow-matching formulation, the proposed adapter effectively constrains the action generation trajectory within an invertible latent space, thereby enabling efficient and high-quality dexterous action synthesis with only a single inference step. Compared with conventional iterative flow-matching policies, the proposed framework substantially reduces inference complexity while maintaining strong action prediction accuracy and stability. Extensive experiments are conducted across a diverse set of simulation benchmarks and real-world robotic platforms to evaluate the effectiveness of the proposed method. Across simulation benchmarks, the proposed adapter consistently demonstrates superior or near state-of-the-art performance on a wide range of manipulation tasks. Furthermore, real-world experiments reveal a significant improvement in inference efficiency for vision-language-action (VLA) models, reducing the average inference latency from 110 ms to 61 ms while maintaining strong task performance.

2606.19089 2026-06-18 cs.RO 新提交 85%

ART-VS: Adaptive Resolution Tiling for Vision Transformer Visual Servoing

ART-VS:用于视觉Transformer伺服的自适应分辨率分块

Alessandro Scherl, Bernhard Neuberger, Simon Schwaiger, David Mulero-Pérez, Lucas Muster, Jose Garcia-Rodriguez

发表机构 * Department of Computer Technology, University of Alicante(阿尔瓦登特技术系,阿利坎特大学) Department of Industrial Engineering, UAS Technikum Vienna(工业工程系,维也纳技术学院) Automation and Control Institute, TU Wien(自动化与控制研究所,维也纳技术大学) Institute of Software Engineering and Artificial Intelligence, Graz University of Technology(软件工程与人工智能研究所,格拉茨技术大学) Institute for Integrative Nature Conservation Research, University of Natural Resources and Life Sciences Vienna(整合自然保护研究 institute,维也纳自然资源与生命科学大学)

专题命中 机器人操作 :视觉伺服,自适应分辨率分块。

AI总结 提出ART-VS方法,通过粗-精两阶段自适应调整特征粒度,在不需任务特定训练下提升视觉伺服鲁棒性和精度,显著降低定位误差并提高速度。

Comments Accepted at IROS2026

详情
AI中文摘要

基于自监督视觉Transformer(ViT)特征的视觉伺服实现了无需训练的机器人定位,具有强泛化能力,但面临鲁棒性与精度之间的根本权衡。粗粒度的块级描述符提供稳定的对应关系,但限制了定位精度。提高图像分辨率可改善精度,但鲁棒性增益有限——在扰动下,高分辨率处理仅将收敛成功率从76.6%提升至81.0%,尽管ViT块数量增加了12倍。因此,我们提出自适应分辨率分块视觉伺服(ART-VS),一种两阶段方法,根据伺服进程调整特征粒度:先以原生ViT分辨率进行粗阶段实现稳定对齐,然后进行分块高分辨率阶段,将匹配限制在局部邻域以提高定位精度。无需任何任务特定训练,ART-VS在扰动下达到95.4%的收敛率,比标准分辨率和全分辨率ViT伺服分别高出18.8和14.4个百分点。与前者相比,定位误差降低53%,同时运行速度比后者快10倍以上,VRAM使用减少27%。我们在三个ViT骨干网络上验证了ART-VS,并展示了真实世界类别级抓取未见过的物体实例,透明瓶成功率95/100,鞋子成功率98/100。代码见该链接。

英文摘要

Visual servoing with self-supervised Vision Transformer (ViT) features enables training-free robotic positioning with strong generalization, but faces a fundamental trade-off between robustness and precision. Coarse patch-level descriptors provide stable correspondences yet limit positioning accuracy. Increasing image resolution improves precision but yields only marginal robustness gains - under perturbation, high-resolution processing improves convergence success rate from 76.6% to just 81.0% despite 12x more ViT patches. Therefore, we propose Adaptive Resolution Tiling Visual Servoing (ART-VS), a two-phase method that adapts feature granularity to servoing progress: a coarse phase at native ViT resolution for stable alignment, then a tiled high-resolution phase that restricts matching to local neighborhoods improving positioning accuracy. Without any task-specific training, ART-VS achieves 95.4% convergence under perturbation, outperforming standard and full-resolution ViT-based servoing by 18.8 and 14.4 percentage points. Over the former it reduces positioning error by 53%, while running at over 10x higher speed and 27% lower VRAM than the latter. We validate ART-VS across three ViT backbones and demonstrate real-world category-level grasping of unseen object instances, achieving 95/100 on transparent bottles and 98/100 on shoes. Code available under https://art-vs.github.io/.

2606.18883 2026-06-18 cs.RO 新提交 85%

ZiMPedance: Impedance-Aware ZMP Modeling and Control for Payload Carrying with Quadruped Robots

ZiMPedance:面向四足机器人负载搬运的阻抗感知ZMP建模与控制

Giovanni B. Dessy, Lorenzo Amatucci, Victor Barasuol, Claudio Semini

发表机构 * Dynamic Legged Systems Lab, Istituto Italiano di Tecnologia (IIT)(动态腿部系统实验室,意大利技术研究院(IIT))

专题命中 机器人操作 :四足机器人负载搬运,阻抗感知控制

AI总结 提出扩展零力矩点(ZMP)公式以包含被动负载接口动力学,结合模型预测控制减少稳定性违规达10倍,并提高运动效率。

详情
AI中文摘要

四足机器人的负载运输受到机器人与负载之间物理接口动力学的强烈影响。与主动机械臂相比,被动弹簧臂减轻了重量和复杂性,但其弹簧-阻尼动力学可能引入振荡力,降低运动稳定性。本文推导了一个扩展的零力矩点(ZMP)公式,该公式包含被动负载接口动力学,将刚度、阻尼和负载质量与稳定性裕度联系起来。分析表明,欠阻尼配置可能与运动谐波共振。基于这一见解,我们通过被动子系统动力学增强了单刚体动力学模型,并将其集成到模型预测控制框架中。在仿真中,所提出的控制器将稳定性违规减少高达10倍(从7.0%降至0.7%),并通过将水平地面反作用力努力降低高达15%来提高运动效率。硬件实验表明,在标称控制器失效的拉放扰动下,携带2公斤负载的机器人能够稳定运动。同一模型还使得通过被动臂动力学实现末端执行器跟踪成为可能,而无需直接驱动臂。

英文摘要

Load transportation with quadruped robots is strongly affected by the dynamics of the physical interface between the robot and the load. Passive spring-based arms reduce weight and complexity compared to active manipulators, but their spring-damper dynamics can introduce oscillatory forces that degrade locomotion stability. This paper derives an extended Zero Moment Point (ZMP) formulation that includes passive payload-interface dynamics, relating stiffness, damping, and payload mass to the stability margin. The analysis shows that underdamped configurations can resonate with locomotion harmonics. Based on this insight, we augment a Single Rigid Body Dynamics model with passive subsystem dynamics and integrate it into a Model Predictive Control framework. In simulation, the proposed controller reduces stability violations by up to $10\times$, from $7.0\%$ to $0.7\%$, and increase locomotion efficiency by lowering horizontal ground reaction force effort by up to $15\%$ compared to a nominal baseline. Hardware experiments with a $2\,\mathrm{kg}$ payload show stable locomotion under pull-release disturbances where the nominal controller fails. The same model also enables end-effector tracking through passive arm dynamics without direct arm actuation.

3. 机器人学习 10 篇

2606.19333 2026-06-18 cs.RO cs.CV 新提交 90%

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Do as I Do: 从日常人类视频中获取灵巧操作数据

Bhawna Paliwal, Haritheja Etukuru, William Liang, Pieter Abbeel, Nur Muhammad Mahi Shafiullah, Jitendra Malik

发表机构 * UC Berkeley(加州大学伯克利分校)

专题命中 机器人学习 :从人类视频重建灵巧操作数据

AI总结 提出DO AS I DO算法,从单目RGB人类视频中重建手-物交互并重定向到多指灵巧机器人手,生成可执行的操作数据,优于现有方法。

Comments Project website: https://do-as-i-do.com/

详情
AI中文摘要

我们如何可扩展地生成机器人操作数据,特别是在像多指灵巧手这样的人形平台上?从人类视频中学习最近成为这个问题的可能答案。然而,估计手-物交互和跨越人-机器人具身差距的困难阻碍了将丰富的单目RGB人类视频作为机器人操作数据的主要来源。在这项工作中,我们提出了DO AS I DO,一种将单目RGB人类视频重建并重定向到多指灵巧机器人手的算法。DO AS I DO从各种自我中心和外部中心的野外视频源中重建手-物交互。然后,该算法将这些手-物交互估计重定向为一系列可在现实世界中执行的动作,从不同的人类视频中生成机器人完整的操作数据。总体而言,DO AS I DO在从RGB视频中估计手-物交互和提取灵巧操作轨迹方面优于先前的最先进技术,正如我们在具有真实标签的数据集和在线收集的视频片段数据集上的实验所示。我们的实验使我们能够为从业者收集人类操作数据提出一个有效性指南。

英文摘要

How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

2606.18772 2026-06-18 cs.RO 新提交 90%

HALOMI: Learning Humanoid Loco-Manipulation with Active Perception from Human Demonstrations

HALOMI: 从人类演示中学习具有主动感知的人形机器人全身操控

Zehui Zhao, Yuxuan Zhao, Gaojing Zhang, Chenxi Liu, Maolin Zheng, Wenzhao Lian

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Sussex(萨塞克斯大学) East China University of Science and Technology(华东理工大学)

专题命中 机器人学习 :人形机器人全身操控,从人类演示学习。

AI总结 提出HALOMI框架,通过扩展通用操控接口(UMI)实现主动感知,利用流形约束控制器和观察-动作对齐,使Unitree G1人形机器人在五项真实任务中平均成功率达85%。

详情
AI中文摘要

人类演示可以大规模收集,并自然捕捉主动的手眼协调,是学习人形机器人全身操控的有前景的数据源。然而,直接将人类演示迁移到人形机器人需要精确的世界坐标系跟踪控制器,这在分布外(OOD)目标下通常脆弱,而人形差异在自我中心观察和动作执行中持续存在。为解决这些挑战,我们提出HALOMI,一个从人类演示中学习具有主动感知的人形机器人全身操控的可扩展框架。HALOMI扩展了通用操控接口(UMI)并加入自我中心感知,以大规模收集自我视角和手腕视角观察以及头-手轨迹。我们进一步提出一个流形约束控制器,在学习的潜在行为流形中规划,以实现世界坐标系中精确鲁棒的头-手跟踪。为弥合人形差异,我们进行自我视角对齐,并引入控制器感知的参考轨迹自适应,以减少观察和动作执行中的不匹配。我们在配备活动脖子的Unitree G1人形机器人上验证HALOMI,涉及导航、抓取、双手操控、全身协调和动态行为五项真实任务。在三个定量评估的任务中,HALOMI平均成功率达85%,而额外定性演示显示其支持动态抛掷和深蹲抓取的能力。

英文摘要

Human demonstrations, which can be collected at scale and naturally capture active hand-eye coordination, are a promising data source for learning humanoid loco-manipulation. However, directly transferring human demonstrations to humanoids requires a precise world-frame tracking controller, which is often brittle under Out-of-Distribution(OOD) targets, while human-to-humanoid gaps persist in both egocentric observation and action execution. To address these challenges, we present HALOMI, a scalable framework for learning humanoid loco-manipulation with active perception from human demonstrations. HALOMI extends Universal Manipulation Interface (UMI) with egocentric sensing to collect ego-view and wrist-view observations along with head-hand trajectories at scale. We further propose a manifold-constrained controller that plans in a learned latent behavior manifold to enable precise and robust head-hand tracking in the world frame. To bridge the human-to-humanoid gap, we perform ego-view alignment and introduce a controller-aware reference trajectory adaptation to reduce mismatch in both observation and action execution. We validate HALOMI on a Unitree G1 humanoid robot with an actuated neck across five real-world tasks involving navigation, grasping, bimanual manipulation, whole-body coordination, and dynamic behaviors. Across the three quantitatively evaluated tasks, HALOMI achieves an average success rate of 85\%, while additional qualitative demonstrations show its ability to support dynamic tossing and deep-squat grasping.

2606.18704 2026-06-18 cs.RO 新提交 90%

Selective Unit-Cell Actuation in Lattice Structures for Distributed Morphology in Soft Robots

晶格结构中的选择性单元胞驱动用于软体机器人的分布式形态变化

Trevor Exley, Altair Coutinho, Lucia Beccai

发表机构 * Istituto Italiano di Tecnologia (IIT)(意大利技术研究院)

专题命中 机器人学习 :软体机器人晶格结构驱动与形态控制

AI总结 提出嵌入式气动单元胞,将弯曲支柱晶格与双向波纹管致动器集成,通过空间驱动模式实现全局形态控制,实验验证了可扩展位移、力生成及弯曲、抓取和爬行运动。

Comments Accepted to IROS 2026, 8 pages, 5 figures

详情
AI中文摘要

软晶格结构越来越多地用于机器人中以定制柔顺性和引导变形;然而,驱动通常是在设备或模块级别引入,致动器插入到原本被动的架构中。在这项工作中,我们将致动器-晶格协同设计推进到单元胞尺度。我们提出了一种嵌入式气动单元胞,它将弯曲支柱晶格几何形状与双向波纹管致动器集成在一个单一的整体元件中。当镶嵌时,晶格作为一个分布式驱动场,其中全局形态由空间驱动模式而非均匀加压控制。对1x1、2x2和3x3镶嵌的实验表征展示了可扩展的位移和力生成,具有可重复的循环性能。在3x3x3阵列中,单元胞的选择性驱动产生了不同的全局变形模式,包括弯曲和定向抓取,而无需改变硬件配置。此外,耦合主动和被动单元胞实现了弯曲驱动的爬行运动,证明了异质镶嵌可以通过不对称变形进行平移。这些结果确立了单元胞级驱动作为晶格基软体机器人分布式变形的策略,并为可扩展的整体机器人架构提供了基础。

英文摘要

Soft lattice structures are increasingly used in robotics to tailor compliance and guide deformation; however, actuation is typically introduced at the device or module level, with actuators inserted into otherwise passive architectures. In this work, we move actuator-lattice co-design to the unit-cell scale. We present an embedded pneumatic unit cell that integrates curved-strut lattice geometry with a bidirectional bellow actuator within a single monolithic element. When tessellated, the lattice functions as a distributed actuation field in which global morphology is governed by spatial actuation patterns rather than uniform pressurization. Experimental characterization of 1x1, 2x2, and 3x3 tessellations demonstrates scalable displacement and force generation with repeatable cyclic performance. Selective actuation of unit cells in a 3x3x3 array produces distinct global deformation modes, including bending and directional grasping, without altering hardware configuration. Additionally, coupling active and passive unit cells enables bending-driven crawling locomotion, demonstrating that heterogeneous tessellations can translate through asymmetric deformation. These results establish unit-cell-level actuation as a strategy for distributed morphing in lattice-based soft robots and provide a foundation for scalable, monolithic robotic architectures.

2606.13672 2026-06-18 cs.RO 新提交 90%

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

$\texttt{WEAVER}$:更好、更快、更长——一种有效的机器人操作世界模型

Arnav Kumar Jain, Yilin Wu, Jesse Farebrother, Gokul Swamy, Andrea Bajcsy

发表机构 * Mila - Québec AI Institute(Mila - 魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) Carnegie Mellon University(卡内基梅隆大学) McGill University(麦吉尔大学)

专题命中 机器人学习 :世界模型用于机器人操作策略评估与规划

AI总结 提出WEAVER世界模型架构,通过流匹配损失训练多视图潜在预测,同时实现高保真度、长程一致性和高效推理,在机器人操作任务中显著提升策略评估、改进和测试时规划性能。

详情
AI中文摘要

世界模型(即学习型模拟器)对机器人技术的潜在影响深远——包括策略评估、策略改进和测试时规划——所有这些都只需有限的真实世界交互。为了解锁这些下游能力,世界模型需要同时满足三个期望:(i)保真度(即产生与现实相关的模拟轨迹),(ii)一致性(即产生在长时域上连贯的模拟轨迹),以及(iii)效率(即快速产生模拟轨迹)。我们提出$\texttt{WEAVER}$(面向具身推理的多视图世界估计):一种同时实现所有三个期望的世界模型架构,在机器人操作任务上提供了最先进的结果。$\texttt{WEAVER}$是一个多视图世界模型,通过流匹配损失训练以预测未来潜在状态和奖励值。我们提炼了模型架构、记忆和预测目标方面的关键设计决策,以解锁那些困扰先前世界建模方法的长时间动态操作任务。我们将$\texttt{WEAVER}$应用于机器人硬件,展示了其在策略评估(与真实世界成功率的相关系数$\rho=0.870$)、策略改进(在$\pi_{0.5}$机器人基础模型上真实世界成功率提升$38\%$)和测试时规划(真实世界成功率提升$14\%$,且比先前世界模型快$5-10$倍)方面的有效性。$\texttt{WEAVER}$在分布外场景评估中也表现出优于先前世界模型的性能。代码、模型和视频见:this https URL。

英文摘要

The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and test-time planning -- all with limited real-world interaction. To unlock these downstream capabilities, a WM needs to jointly satisfy three desiderata: $\textit{(i)}$ fidelity (i.e., producing simulated trajectories that correlate with reality), $\textit{(ii)}$ consistency (i.e., producing simulated trajectories that are coherent over long horizons), and $\textit{(iii)}$ efficiency (i.e., producing simulated trajectories quickly). We propose WEAVER (World Estimation Across Views for Embodied Reasoning): a WM architecture that simultaneously achieves all three desiderata, providing state-of-the-art results on robotic manipulation tasks. WEAVER is a multi-view WM trained to predict future latents and reward values via a flow-matching loss. We distill the key design decisions across model architecture, memory, and prediction objectives required to unlock the kinds of long-horizon dynamic manipulation tasks that have confounded prior world modeling approaches. We apply WEAVER in robotic hardware, demonstrating its effectiveness at policy evaluation ($ρ$=0.870 correlation with real-world success rate), policy improvement (real-world success rate improvement of $38\%$ on top of the $π_{0.5}$ robot foundation model), and test-time planning (real-world success rate improvement of $14\%$ with a $5-10\times$ speedup over prior WMs). WEAVER also demonstrates better performance than prior WMs when evaluated on out-of-distribution scenarios. Code, models, and videos at: https://arnavkj1995.github.io/WEAVER/ .

2510.18085 2026-06-18 cs.RO cs.AI cs.MA 版本更新 90%

R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations

R2BC: 从单智能体演示进行多智能体模仿学习

Connor Mattson, Varun Raveendra, Ellen Novoseller, Nicholas Waytowich, Vernon J. Lawhern, Daniel S. Brown

发表机构 * Kahlert School of Computing, University of Utah(犹他大学凯勒尔计算学院) DEVCOM Army Research Laboratory(陆军研究实验室)

专题命中 机器人学习 :多机器人模仿学习,核心是机器人学习

AI总结 提出R2BC方法,通过轮换单智能体演示训练多机器人系统,无需联合动作空间演示,在模拟和实物任务中性能媲美或超越基于特权同步演示的基线方法。

Comments 8 pages, 6 figures. In Proceedings: IEEE International Conference on Robotics & Automation (ICRA 2026)

详情
AI中文摘要

模仿学习(IL)是人类教授机器人的自然方式,尤其是在高质量演示易于获取的情况下。虽然IL已广泛应用于单机器人场景,但将其扩展到多智能体系统的研究相对较少,尤其是在单个人类必须为协作机器人团队提供演示的场景中。本文介绍并研究了轮换行为克隆(R2BC),该方法使单个人类操作员能够通过顺序的单智能体演示有效训练多机器人系统。我们的方法允许人类一次远程操作一个智能体,并逐步向整个系统教授多智能体行为,无需联合多智能体动作空间的演示。我们表明,在四个多智能体模拟任务中,R2BC方法的性能与基于特权同步演示的Oracle行为克隆方法相当,甚至在某些情况下超越后者。最后,我们在两个使用真实人类演示训练的物理机器人任务上部署了R2BC。

英文摘要

Imitation Learning (IL) is a natural way for humans to teach robots, particularly when high-quality demonstrations are easy to obtain. While IL has been widely applied to single-robot settings, relatively few studies have addressed the extension of these methods to multi-agent systems, especially in settings where a single human must provide demonstrations to a team of collaborating robots. In this paper, we introduce and study Round-Robin Behavior Cloning (R2BC), a method that enables a single human operator to effectively train multi-robot systems through sequential, single-agent demonstrations. Our approach allows the human to teleoperate one agent at a time and incrementally teach multi-agent behavior to the entire system, without requiring demonstrations in the joint multi-agent action space. We show that R2BC methods match, and in some cases surpass, the performance of an oracle behavior cloning approach trained on privileged synchronized demonstrations across four multi-agent simulated tasks. Finally, we deploy R2BC on two physical robot tasks trained using real human demonstrations.

2606.18328 2026-06-18 cs.RO 新提交 88%

Recover, Discover, Plan: Learning Skills and Concepts from Robot Failures

恢复、发现、规划:从机器人失败中学习技能与概念

Bowen Li, Mayank Mishra, Y. Isabel Liu, Stone Tao, Nishanth Kumar, Alexander G. Gray, Ruwan Wickramarachchi, Jonathan Francis, Sebastian Scherer, Tom Silver

发表机构 * CMU(卡内基梅隆大学) Princeton(普林斯顿大学) AI2(艾伦人工智能研究所) MIT(麻省理工学院) Centaur AI Bosch Center for AI(博世人工智能中心)

专题命中 机器人学习 :从机器人失败中学习技能与概念,实现长期规划。

AI总结 提出ReSYNC方法,通过技能学习与概念发现的交替过程,从失败恢复经验中逐步构建抽象谓词,实现全局失败避免和长期规划,性能提升超50%。

Comments 9 pages, 6 figures. Website: https://jaraxxus-me.github.io/ReSYNC/

详情
AI中文摘要

智能机器人不仅应该从失败中恢复,还应该获取必要的抽象知识以避免未来的失败。虽然强化学习(RL)可以学习反应性恢复行为,但为每种不同的失败模式训练单独的策略效率极低。我们引入了恢复驱动的关系概念综合(ReSYNC),这是第一种从失败恢复经验中逐步发现并细化状态抽象(关系谓词)以支持抽象规划的方法。与纯粹的反应性方法不同,ReSYNC通过增量双学习过程联合学习技能和概念。在技能学习阶段,机器人使用RL学习从训练任务中出现的失败中恢复。在概念学习阶段,机器人发现新的关系谓词并细化其抽象规划模型,以解释和泛化所学的恢复行为。这种交互使ReSYNC能够将训练中看到的局部恢复转化为测试时的全局失败避免。在四个模拟领域,我们展示了ReSYNC持续扩展和细化其抽象库的能力,使其能够解决长期、前所未见的问题,性能超过强基线50%以上。此外,我们展示了ReSYNC的仿真到现实迁移,其中它执行真实世界的非抓取操作技能,并通过抽象规划泛化到未见场景。总体而言,ReSYNC代表了朝着机器人自主获取抽象以实现物理世界中可扩展的、感知失败的规划迈出的重要一步。

英文摘要

Intelligent robots should not only recover from failures, but also acquire the abstract knowledge needed to avoid them in the future. While reinforcement learning (RL) can learn reactive recovery behaviors, training a separate policy for every distinct failure mode is highly inefficient. We introduce Recovery-Driven Synthesis of Relational Concepts (ReSYNC), the first approach that progressively discovers and refines state abstractions (relational predicates) from failure-recovery experience to support abstract planning. Unlike purely reactive methods, ReSYNC jointly learns skills and concepts through an incremental dual-learning process. In the skill-learning phase, the robot uses RL to learn to recover from failures seen in training tasks. In the concept-learning phase, the robot discovers new relational predicates and refines its abstract planning model to explain and generalize the learned recovery behaviors. This interaction enables ReSYNC to convert local recoveries seen during training into global failure avoidance at test time. Across four simulated domains, we show that ReSYNC's ability to continually expand and refine its abstraction library allows it to solve long-horizon, previously unseen problems, outperforming strong baselines by over 50%. Additionally, we demonstrate sim-to-real transfer of ReSYNC, where it performs real-world non-prehensile manipulation skills and generalizes to unseen scenarios through abstract planning. Overall, ReSYNC represents a significant step toward robots that autonomously acquire abstractions for scalable, failure-aware planning in the physical world.

2606.18959 2026-06-18 cs.RO 新提交 85%

TactSpace: Learning a Physics-enriched Shared Latent Space for Tactile Sim-to-Real Transfer

TactSpace: 学习富含物理信息的共享潜在空间以实现触觉模拟到现实的迁移

Arunim Joarder, Arjun Bhardwaj, René Zurbrügg, Mayank Mittal, Florin Püntener, Sira Bielefeldt, Cosmin Roman, Vaishakh Patil, Marco Hutter

发表机构 * Robotic Systems Lab, ETH Zürich(瑞士苏黎世联邦理工学院机器人系统实验室) Micro- and Nanosystems Lab, ETH Zürich(瑞士苏黎世联邦理工学院微纳系统实验室) ETH AI Center(苏黎世联邦理工学院人工智能中心) NVIDIA(NVIDIA公司)

专题命中 机器人学习 :学习共享潜在空间实现触觉模拟到现实迁移。

AI总结 提出多模态表示学习框架TactSpace,通过共享潜在空间对齐异构触觉模态,实现零样本模拟到现实迁移,在力预测和形状重建任务中分别降低误差16.7%和45.8%。

Comments 9 pages, 6 figures, 4 tables, accepted into IROS 2026

详情
AI中文摘要

触觉传感提供了对机器人操作至关重要的接触相互作用的直接测量。然而,当前的模拟器缺乏足够保真度来忠实模拟触觉传感器的复杂变形和换能机制,严重阻碍了机器人学习流程中的模拟到现实迁移。为了解决这一挑战,我们提出了一种多模态表示学习框架,该框架在共享潜在空间内对齐异构触觉模态,消除了对精确原始信号模拟的需求,同时保留了相关的接触信息。我们的方法采用模态特定编码器将不同的触觉观测(例如模拟穿透深度和真实电容)投影到公共嵌入空间中。该模型使用自重建和交叉重建目标以及对比对齐进行训练,鼓励模态不变且信息丰富的表示。我们在压头形状识别、力预测和几何重建任务上评估学习到的嵌入,仅在模拟中训练并直接在真实传感器测量上测试。我们的结果展示了跨物理不同表示的零样本模拟到现实迁移。此外,结合多物理模拟模态产生了更信息丰富的嵌入,这些嵌入可跨不同下游任务迁移,力预测误差降低16.7%,形状重建误差降低45.8%。最后,我们为Isaac Lab发布了一个基于Warp的高效罚函数触觉模拟模型实现,支持可扩展的触觉数据生成。

英文摘要

Tactile sensing provides direct measurements of contact interactions that are essential for robotic manipulation. However, current simulators lack the fidelity to faithfully model the complex deformation and transduction mechanics of tactile sensors, severely hindering sim-to-real transfer in robot learning pipelines. To address this challenge, we propose a multi-modal representation learning framework that aligns heterogeneous tactile modalities within a shared latent space, eliminating the need for accurate raw-signal simulation while preserving relevant contact information. Our approach employs modality-specific encoders to project diverse tactile observations, such as simulated penetration depth and real-world capacitance, into a common embedding space. The model is trained using self- and cross-reconstruction objectives alongside contrastive alignment, encouraging modality-invariant yet information-rich representations. We evaluate the learned embeddings on indenter shape identification, force prediction, and geometric reconstruction tasks, training exclusively in simulation and testing directly on real sensor measurements. Our results demonstrate zero-shot sim-to-real transfer across physically dissimilar representations. Furthermore, incorporating multi-physics simulation modalities yields more informative embeddings that transfer across diverse downstream tasks, demonstrating a 16.7% reduction in force prediction error and a 45.8% reduction in shape reconstruction error. Finally, we release an efficient Warp-based implementation of a penalty-based tactile simulation model for Isaac Lab, enabling scalable tactile data generation.

2606.18828 2026-06-18 cs.RO cs.AI 新提交 85%

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

空间即智能:用于黎曼度量生成的神经半群叠加

Chenghao Xu

发表机构 * National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University(湖南大学机器人视觉感知与控制技术国家工程研究中心)

专题命中 机器人学习 :通过黎曼度量生成实现机器人运动规划,零样本泛化

AI总结 提出将智能置于空间本身,通过神经半群叠加机制生成黎曼度量,使动作简化为测地线跟随,在单障碍场景训练后零样本泛化到未见配置。

详情
AI中文摘要

传统方法将智能置于智能体中,无论是作为学习策略还是搜索过程。我们则将智能置于空间本身:场景在构型流形上诱导一个黎曼度量,动作简化为跟随该度量的测地线,而无需调用单独的规划器或碰撞检查器。一个单一的编码器-路由器网络通过三个互补的参数组实现这一思想——框架参数(定向生成器)、调制参数(控制空间传播)和基本系数(决定强度)。这些组通过共享的半群叠加机制组合,产生单个黎曼度量场,形成一种紧凑的架构,其几何复杂度自然随场景复杂度扩展。在单个双障碍场景上训练后,该模型在未见过的障碍配置上展现出鲁棒的零样本泛化能力,无碰撞路径成本与障碍穿透路径成本相差数个数量级。

英文摘要

Traditional approaches place intelligence in the agent, whether as a learned policy or a search procedure. We instead place intelligence in the space itself: a scene induces a Riemannian metric on the configuration manifold, and action reduces to following the geodesics of that metric rather than invoking a separate planner or collision checker. A single Encoder-Router network realizes this idea through three complementary parameter groups -- frame parameters that orient the generators, modulation parameters that govern their spatial propagation, and basic coefficients that determine their strength. These groups combine through a shared semigroup-superposition mechanism to produce a single Riemannian metric field, yielding a compact architecture whose geometry scales naturally with scene complexity. Trained on a single two-obstacle scene, the model demonstrates robust zero-shot generalization across unseen obstacle configurations, with orders-of-magnitude separation between collision-free and obstacle-penetrating path costs.

2606.18747 2026-06-18 cs.RO cs.AI 新提交 85%

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

通过基于人类反馈的迭代强化学习利用大语言模型生成自然且富有表现力的机器人手势

Chris Lee, Flora Salim, Benjamin Tag, Francisco Cruz

发表机构 * University of New South Wales(新南威尔士大学) Universidad Central de Chile(智利中央大学)

专题命中 机器人学习 :机器人手势生成,RLHF优化表达。

AI总结 针对社交机器人手势生成僵硬问题,提出将ChatGPT集成到Pepper机器人中生成共语手势,并引入基于人类反馈的迭代强化学习(RLHF)优化手势,实验表明RLHF提升了手势的表现力、相关性和流畅性。

Comments 8 Pages, 6 Figures

详情
AI中文摘要

富有表现力的手势对于自然有效的沟通至关重要,当仅靠语言线索不足时(例如,指向),手势可以补充言语。对于像Pepper这样的人形社交机器人,产生自然且富有表现力的动作对于改善人机交互(HRI)和长期接受度至关重要。然而,由于依赖专家编写的动画,生成手势仍然具有挑战性,导致行为僵硬,难以适应动态和多样化的环境。或者,机器学习方法通常难以捕捉感知的自然性,随着自由度的增加而变得更加困难。因此,产生富有表现力的机器人手势需要一个能够适应环境同时遵守社会规范和物理约束的系统。大语言模型(LLMs)的最新进展使得动态代码生成成为可能,为从自然语言实时合成手势提供了新的机会。在本文中,我们将ChatGPT集成到人形机器人Pepper中,以生成与对话输出一致的共语手势。虽然这一基线实现了灵活的手势生成,但生成的动作通常被认为僵硬且不自然。为了解决这一限制,我们引入了一种基于人类反馈的迭代强化学习(RLHF)系统,该系统根据用户评估微调手势生成,并利用迭代用户研究比较Pepper生成的手势。我们的结果表明,RLHF改进了LLM的共语生成能力,产生了更富有表现力、相关且流畅的动作。

英文摘要

Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper's generated gestures. Our results show that RLHF improved the LLM's co-speech generative capabilities, producing more expressive, relevant and fluid movements.

2606.18698 2026-06-18 cs.RO cs.AI cs.LG 新提交 85%

Leveraging Energy Features for Surface Classification with Deep Learning: A Comparative Analysis Across Three Independent Datasets

利用能量特征进行基于深度学习的表面分类:三个独立数据集的比较分析

Alexander Belyaev, Oleg Kushnarev

专题命中 机器人学习 :移动机器人表面分类,使用能量特征

AI总结 研究评估能量特征作为表面分类的独立或辅助模态的可行性,在三个数据集上比较多种深度学习架构,发现CNN性能最优,纯能量特征准确率85-90%,与惯性特征结合可达96-99%,且能量特征可稳定提升1-2%准确率。

详情
AI中文摘要

基于能量的方法在移动机器人表面分类中仍是一个相对未被充分研究的途径,尽管在受限环境中取得了有希望的结果。本研究评估了使用能量衍生特征作为独立分类模态或作为惯性数据补充输入的可行性。在三个公开数据集上进行了全面评估,比较了现代深度学习架构(包括循环神经网络、卷积神经网络、仅编码器变压器和Mamba状态空间模型)在自动超参数调整和输入序列长度优化下的性能。模型在所有评估数据集上均实现了比先前报道值更高的准确率,其中卷积神经网络取得了最高的整体性能。当仅依赖基于能量的特征时,模型分类准确率在85-90%范围内,比与惯性特征结合时(96-99%)低约5-10%。用能量特征增强惯性数据导致平均准确率持续提高1-2%。这些发现表明,仅依赖能量特征的分类器为独立部署提供了足够的准确性,同时在与其它感知模态结合使用时也提供了一致的增益。

英文摘要

The energy-based method remains a comparatively underexamined approach for surface classification in mobile robotics, despite promising results in constrained environments. This study evaluated the viability of using energy-derived features as either a standalone classification modality or as supplementary input to inertial data. A comprehensive evaluation was conducted across three publicly available datasets, comparing the performance of modern deep learning architectures including recurrent neural networks, convolutional neural networks, encoder-only transformers, and Mamba state-space models, under automated hyperparameter tuning and input sequence length optimization. The models achieved higher accuracy than previously reported values on all evaluated datasets, with the convolutional neural network yielding the highest overall performance. When relying exclusively on energy-based features, the models attained classification accuracies in the range of 85-90%, approximately 5-10% lower than those achieved when combined with inertial features (96-99%). Augmenting inertial data with energy features resulted in a consistent mean accuracy improvement of 1-2%. These findings indicate that classifiers relying solely on energy features offer sufficient accuracy for standalone deployment, while also providing a consistent gain when used in combination with other sensing modalities.

4. 具身导航 3 篇

2606.18888 2026-06-18 cs.AI 新提交 90%

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

部分可观测环境下导航的生成模型预测规划

Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski

发表机构 * University of Manchester(曼彻斯特大学) Aalto University(阿尔托大学)

专题命中 具身导航 :部分可观测环境导航,结合扩散模型与MPC

AI总结 提出BeliefDiffusion框架,结合扩散模型和模型预测控制,显式建模多模态信念分布并进行前瞻规划,在合成地图环境中显著优于无模型强化学习和生成方法。

详情
AI中文摘要

部分可观测环境中的导航对自主智能体构成重大挑战,需要在未知环境中利用有限的感知信息做出有效决策。基于信念的方法,特别是那些使用神经网络近似信念空间的方法,往往无法捕捉信念空间固有的多模态性,尤其是在具有感知混淆的高维情况下。虽然生成模型提供了一种有吸引力的替代方案,但它们通常需要大量数据或专家演示,并且缺乏长期规划的显式机制。在本文中,我们介绍了BeliefDiffusion,一种结合了生成和规划优势的新框架。BeliefDiffusion利用扩散模型显式表征多模态信念分布,并利用模型预测控制(MPC)同时进行前瞻规划。它包含两个步骤:(1)基于观测历史想象合理的环境配置;(2)在聚合的配置上规划高效的导航策略。通过在合成地图环境中的大量实验,我们证明BeliefDiffusion在导航成功率和路径效率上显著优于无模型强化学习基线和其它生成方法。我们的结果验证了将多模态信念表示显式纳入规划能够在部分可观测设置中实现更鲁棒的导航。

英文摘要

Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.

2606.18426 2026-06-18 cs.RO 新提交 90%

VEGA: Learning Navigation VLAs from In-the-Wild Egocentric Video with Geometric Trajectory Supervision

VEGA: 从野外自我中心视频中通过几何轨迹监督学习导航VLA

Gershom Seneviratne, Yohan Abeysinghe, Jianyu An, Vaibhav Shende, Dinesh Manocha

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

专题命中 具身导航 :训练导航VLA模型,利用自我中心视频

AI总结 提出VEGA方法,利用未标注的自我中心视频通过重建场景几何生成障碍感知轨迹,训练流匹配VLA导航策略,在VEGA-Bench上碰撞减少33.0%,真实世界成功率提升至少150.0%。

详情
AI中文摘要

我们提出了VEGA,一种从未标注的自我中心导航视频中训练导航视觉-语言-动作(VLA)模型的方法。互联网规模的自我中心视频提供了可扩展的导航相关视觉观察来源,捕捉了杂乱场景、近距离障碍物以及通过真实世界空间的自然人体运动。然而,这些视频不能直接用于策略学习,因为它们没有提供在机器人坐标系中基于显式导航目标的障碍感知轨迹。VEGA通过从单目视频重建局部场景几何、采样导航目标(表示为文本、图像或空间路径点)并利用构建的几何生成障碍感知轨迹来解决这一差距。生成的轨迹分布随后用于训练流匹配VLA导航策略。通过仅在训练期间使用几何,VEGA将障碍感知规划直接蒸馏到基于视觉的策略中。此外,我们引入了VEGA-Bench,一个包含25万场景和约500万个导航目标(与场景几何配对)的基准,旨在评估VLA的目标进展、碰撞避免和障碍物间隙。我们的评估表明,VEGA在VEGA-Bench上实现了有竞争力的目标进展,同时相比最强基线碰撞减少33.0%,障碍物间隙提高17.9%,在真实世界试验中成功率至少提高150.0%,碰撞至少减少66.7%,障碍物间隙至少提高60.0%。最终,我们证明了视频衍生的几何监督为训练障碍感知导航VLA提供了可扩展且有效的信号。代码和基准将在发表时发布。

英文摘要

We introduce VEGA, an approach for training navigation VisionLanguage-Action (VLA) models from unlabeled egocentric navigation videos. Internet-scale egocentric videos provide a scalable source of navigation-relevant visual observations, capturing cluttered scenes, close-range obstacles, and natural human motion through real-world spaces. However, these videos are not directly usable for policy learning because they do not provide obstacle-aware trajectories conditioned on explicit navigation goals in the robot's coordinate frame. VEGA addresses this gap by reconstructing local scene geometry from monocular video, sampling navigation goals (represented as text, image, or spatial waypoints) and generating obstacle-aware trajectories using the constructed geometry. The resulting trajectory distribution is then used to train a flow-matching VLA navigation policy. By using geometry exclusively during training, VEGA distills obstacle-aware planning directly into a vision-based policy. Furthermore, we introduce VEGA-Bench, a benchmark containing 250k scenes and approximately 5 million navigation goals paired with scene geometry, designed to evaluate goal progress, collision avoidance, and obstacle clearance of VLAs. Our evaluation shows that VEGA achieves competitive goal progress while reducing collisions by 33.0% and improving obstacle clearance by 17.9% over the strongest baseline on VEGABench, while improving success by at least 150.0%, reducing collisions by at least 66.7%, and improving obstacle clearance by at least 60.0% in real-world trials. Ultimately, we demonstrate that video-derived geometric supervision provides a scalable and effective signal for training obstacle-aware navigation VLAs. The code and benchmark will be released at the time of publication.

2606.18847 2026-06-18 cs.AI 新提交 85%

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

WorldLines: 对长时域有状态具身智能体进行基准测试与建模

Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州)) HKUST(香港科技大学) Knowin

专题命中 具身导航 :长时域具身智能体基准,家庭任务规划。

AI总结 提出WorldLines基准,通过构建带时间跨度的家庭轨迹(含对话、动作、状态变化等)评估具身智能体的长时记忆与任务规划能力,并设计ObsMem记忆框架提升状态感知决策。

Comments 27 pages, 18 figures

详情
AI中文摘要

为了在真实家庭环境中长时间协助人类,具身智能体必须记住用户习惯、世界状态和过去的交互。现有的长期记忆基准主要评估以语言为中心的检索和问答,而具身基准通常关注短时域任务执行,未测试在动态环境中长期记忆的使用。我们引入WorldLines,一个项目驱动的长时域具身家庭辅助基准。它构建了带时间跨度的家庭轨迹,包含对话、动作、执行反馈、物体和设备状态变化,并将其转换为带有证据链接的样本,用于记忆问答和具身任务规划。我们进一步提出ObsMem,一个观察者锚定的记忆框架,维护可见性感知的记忆和动作原生状态轨迹,以实现状态感知的决策。实验揭示了在部分可观测性、被覆盖的世界状态以及将长期记忆转化为具身规划方面的持续挑战,而ObsMem为此场景提供了更强的参考架构。

英文摘要

To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while embodied benchmarks often focus on short-horizon task execution without testing long-term memory use in dynamic environments. We introduce WorldLines, a project-driven benchmark for long-horizon embodied household assistance. It constructs temporally extended household traces with dialogues, actions, execution feedback, object and device state changes, and converts them into evidence-linked samples for Memory QA and Embodied Task Planning. We further propose ObsMem, an observer-grounded memory framework that maintains visibility-aware memories and action-native state trails for state-aware decisions. Experiments reveal persistent challenges in partial observability, overwritten world states, and translating long-term memory into embodied plans, while ObsMem offers a stronger reference architecture for this setting.

5. 具身推理 1 篇

2602.15513 2026-06-18 cs.RO cs.AI 90%

HIMM: Human-Inspired Long-Term Memory Modeling for Embodied Exploration and Question Answering

HIMM:面向具身探索与问答的人类启发式长期记忆建模

Ji Li, Bo Wang, Jing Xia, Mingyi Li, Shiyan Hu

发表机构 * The University of Hong Kong(香港大学) Beijing Institute of Technology(北京理工大学)

专题命中 具身推理 :具身探索与问答,长期记忆建模

AI总结 本文提出HIMM模型,通过分离事件记忆与语义记忆,提升具身智能在长期观察和有限上下文下的探索与问答能力,实验显示在多个基准测试中表现优异。

Journal ref IROS 2026

详情
AI中文摘要

将多模态大语言模型作为具身代理的'大脑'仍面临挑战,特别是在长时间观测和有限上下文预算下。现有记忆辅助方法通常依赖文本摘要,丢弃丰富视觉和空间细节且在非平稳环境中易碎。本文提出非参数化记忆框架,明确分离事件记忆与语义记忆以支持具身探索与问答。我们的检索优先、推理辅助范式通过语义相似性召回事件经验并通过视觉推理验证,使过去观察的稳健重用无需严格几何对齐。同时,我们引入程序式规则提取机制,将经验转换为结构化、可重用的语义记忆,促进跨环境泛化。大量实验表明,HIMM在具身问答和探索基准上达到最新水平,在LLM-Match和LLM MatchXSPL上分别获得7.3%和11.4%的提升,在GOAT-Bench上分别获得+7.7%的成功率和+6.8%的SPL。分析显示,事件记忆主要提升探索效率,而语义记忆增强具身代理的复杂推理能力。

英文摘要

Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.