arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

机器人 / 具身智能

机器人、具身智能、机器人学习、操作、导航和具身世界模型。

今日/当前日期收录 79 信号源:cs.RO, cs.AI, cs.CV, cs.LG

1. 机器人操作 13 篇

2606.19397 2026-06-19 cs.RO 新提交 95%

DiffusionVS: A Generative Framework for Robust Visual Servoing Based on Diffusion Policy

DiffusionVS:基于扩散策略的鲁棒视觉伺服生成框架

Hongkang Cui, Rui He, Haoyao Chen

专题命中 机器人操作 :提出基于扩散策略的视觉伺服方法,用于机器人操作和导航。

AI总结 提出基于扩散策略的视觉伺服方法,通过条件去噪生成相机速度,并采用在线训练增强泛化能力,仿真成功率近100%,物理实验93%。

Comments 8 pages, 4 figures, 7 tables

详情
AI中文摘要

视觉伺服是机器人操作和导航中的基础技术。基于回归的视觉伺服常因噪声敏感的单步映射和分布偏移时的误差累积而出现轨迹抖动。相比之下,扩散策略通过预测动作序列保持时间一致性,并通过隐式数据增强提高鲁棒性。本文提出一种新颖的基于扩散的伺服方法。基于扩散策略,该方法使用观测标签角点的归一化图像坐标作为输入,通过条件去噪生成相机速度。为了克服在静态数据集上训练的模型的泛化限制,采用了在线训练范式,通过交互经验收集持续扩展训练数据的多样性。该策略显著提升了模型的性能和泛化能力。全面的仿真和实际实验证明了该方法的有效性,在仿真中实现了近100%的成功率,在物理实验中达到93%。除了具体的流程,我们进一步验证了扩散机制的通用性。实验表明,现有的视觉伺服网络在与我们的扩散模块集成时,性能持续提升。这些结果表明,所提出的策略具有广泛的适用性,能够增强除本文具体架构之外的各种视觉伺服系统。

英文摘要

Visual servoing is a fundamental technique in robotic manipulation and navigation. Regression-based visual servoing frequently experiences trajectory jitter as a result of noise-sensitive single-step mappings and the accumulation of errors during distribution shifts. In contrast, Diffusion Policy maintains temporal consistency by predicting action sequences and improves robustness through implicit data augmentation. This paper presents a novel diffusion-based servoing method. Based on Diffusion Policy, the proposed approach uses normalized image coordinates of observed tag corners as input and generates camera velocity through conditional denoising. To overcome the generalization limitations of models trained on static datasets, an online training paradigm is adopted, continuously expanding the diversity of training data through interactive experience collection. This strategy substantially enhances both the performance and generalization capability of the model. Comprehensive simulations and real-world experiments demonstrate the effectiveness of the proposed method, achieving success rates of nearly 100\% in simulation and 93\% in physical experiments. Beyond the specific pipeline, we further validate the generality of the diffusion mechanism. Experiments show that existing visual servoing networks consistently achieve improved performance when integrated with our diffusion-based module. These results indicate that the proposed strategy possesses broad applicability and can enhance various visual servoing systems beyond the specific architecture presented here.

2606.17054 2026-06-19 cs.RO cs.AI cs.CV cs.LG 新提交 95%

Human Universal Grasping

人类通用抓取

Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto

发表机构 * New York University(纽约大学) Tsinghua University(清华大学) University of Michigan(密歇根大学)

专题命中 机器人操作 :提出HUG模型实现零样本机器人抓取

AI总结 提出HUG模型,利用人类抓取数据(1M-HUG数据集)和流匹配方法,从单张RGB-D图像生成多样化抓取姿态,并重定向到机器人手,实现零样本抓取,在HUG-Bench上超越基线23%-34%。

Comments 28 pages, 20 figures, 7 tables

详情
AI中文摘要

人类可以轻松抓取物体,而多指机器人远未达到这种通用性。我们认为机器人抓取数据最自然的来源是人类,他们每天拿起数千个物体。我们提出HUG,一个流匹配模型,能够为任何用户指定的物体(从立体相机捕获的单张RGB-D图像中)生成多样化的人类抓取。使用智能眼镜,我们首先收集了1M-HUGs,一个自我中心的人类抓取数据集,涵盖100万帧(27.8小时)和41栋建筑中的6,707个物体实例。接下来,为了建模自然人类抓取的分布,我们的新型流匹配模型融合RGB和深度观测,输出由手腕平移、手腕旋转和MANO手姿态参数化的抓取。预测的抓取可以重定向到各种机器人手,实现在日常场景中的零样本抓取。为了标准化评估,我们构建了一个新的模拟基准HUG-Bench,包含来自五个几何类别和不同尺寸的90个未见物体,并带有公制尺度的3D网格。我们在真实世界中评估HUG,使用HUG-Bench的30个物体测试集,跨越多个立体相机、机器人实体和家庭环境。HUG在我们具有挑战性的物体集上比最先进的抓取基线高出23%和34%。代码、数据、基准、检查点和交互式演示已在我们的网站上发布:https://grasping.io/

英文摘要

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/

2606.20562 2026-06-19 cs.RO 新提交 90%

MemoryWAM: Efficient World Action Modeling with Persistent Memory

MemoryWAM:具有持久记忆的高效世界动作建模

Sizhe Yang, Juncheng Mu, Tianming Wei, Chenhao Lu, Xiaofan Li, Linning Xu, Zhengrong Xue, Zhecheng Yuan, Dahua Lin, Jiangmiao Pang, Huazhe Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Tsinghua University(清华大学) Zhejiang University(浙江大学)

专题命中 机器人操作 :机器人操作中的世界动作建模与记忆

AI总结 提出MemoryWAM,通过混合记忆设计和定制注意力机制,在长时域机器人操作任务中实现高效记忆依赖决策,优于现有VLA和WAM基线。

详情
AI中文摘要

现实世界中的鲁棒机器人操作不仅需要理解当前观测,还需要记忆和动力学建模。世界动作模型(WAM)通过联合建模基于当前和历史观测的视觉预测和动作,具备了这些能力,使其成为机器人操作的一个有前景的范式。然而,现有的WAM面临一个基本权衡:高效推理的方法通常仅基于最近观测的有界窗口进行条件化,因此在非马尔可夫环境中表现不佳;而保留长历史的方法则会产生随序列长度大幅增长的时间和空间成本。为解决这一挑战,我们引入了MemoryWAM,一种具有高效持久记忆的世界动作模型。MemoryWAM采用混合记忆设计,结合了最近帧、事件边界锚点帧以及总结长程历史的紧凑要点令牌。一种定制的注意力机制能够检索详细的短期上下文和压缩的长期上下文,支持具有降低推理延迟和GPU内存使用的记忆依赖决策。在模拟和现实世界的长时域、记忆依赖的操作任务中,MemoryWAM在保持良好计算效率的同时,优于强大的视觉-语言-动作(VLA)和WAM基线。

英文摘要

Robust robotic manipulation in the real world requires not only an understanding of the current observation, but also memory and dynamics modeling. World action models (WAMs) possess these capabilities by jointly modeling visual foresight and actions conditioned on both current and historical observations, making them a promising paradigm for robotic manipulation. However, existing WAMs face a fundamental trade-off: methods with efficient inference typically condition only on a bounded window of recent observations and therefore struggle in non-Markovian environments, whereas methods that preserve long histories incur time and space costs that grow substantially with sequence length. To address this challenge, we introduce MemoryWAM, a world action model with efficient persistent memory. MemoryWAM uses a hybrid memory design that combines recent frames, event-boundary anchor frames, and compact gist tokens that summarize long-range history. A tailored attention mechanism enables retrieval of both detailed short-term context and compressed long-term context, supporting memory-dependent decision-making with reduced inference latency and GPU memory usage. Across long-horizon, memory-dependent manipulation tasks in both simulation and the real world, MemoryWAM outperforms strong vision-language-action (VLA) and WAM baselines while maintaining favorable computational efficiency.

2606.20193 2026-06-19 cs.RO 新提交 90%

Belt-Finger: An Affordable Soft Belt-Driven Gripper for Dexterous In-Hand Manipulation

Belt-Finger: 一种经济实惠的软带驱动夹爪,用于灵巧的手内操作

Boya Zhang, Andreas Zell, Georg Martius

发表机构 * University of Tübingen(图宾根大学) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所)

专题命中 机器人操作 :软带驱动夹爪实现灵巧手内操作。

AI总结 提出一种双软带手指模块,为平行夹爪增加三个手内自由度(平移、俯仰、滚动),在保持低成本、易集成的同时提升灵巧操作能力,并通过MPC和遥操作验证其有效性。

详情
AI中文摘要

平行夹爪是机器人中默认的操纵器选择,因为它们简单、坚固且廉价。然而,其有限的手内移动性常常迫使大幅度的臂部运动,并限制了在狭窄工作空间中的灵巧操作。我们提出了一种平行夹爪的升级方案:一种基于双软带的指模块,在保留标准开合功能的同时增加了三个手内自由度(DoF):平移、俯仰和滚动。该机制故意保持简单,并设计为经济制造和直接集成,保留了传统平行夹爪的可靠性和精确控制,同时大大拓宽了操作能力的范围。为了展示新增自由度的实用性,我们将该夹爪集成到两个控制流程中。首先,我们调整了一个模型预测控制器,用于已知物体的手内操作。其次,我们引入了一个轻量级遥操作接口,能够以最少的硬件同时控制机器人臂和夹爪(总共10个自由度)。通过遥操作、MPC和训练策略执行的一系列具有挑战性的操作任务,与传统的平行夹爪相比,所提出的夹爪在灵巧性和任务可行性上持续改进。

英文摘要

Parallel-jaw grippers are the default manipulator choice in robotics because they are simple, robust, and inexpensive. Their limited in-hand mobility, however, often forces large arm motions and restricts dexterous manipulation in confined workspaces. We present a parallel-gripper upgrade: a double-soft-belt-based finger module that preserves standard opening/closing while adding three in-hand degrees of freedom (DoF): translation, pitch, and roll. The mechanism is deliberately kept simple and engineered for inexpensive manufacturing and straightforward integration, preserving the reliability and precise control of traditional parallel grippers while greatly broadening the range of manipulation capabilities. To demonstrate the utility of the added DoFs, we integrate the gripper in two control pipelines. First, we adapt a model predictive controller for in-hand manipulation of known objects. Second, we introduce a lightweight teleoperation interface that enables simultaneous control of the robot arm and gripper (10 DoFs total) with minimal hardware. Across a suite of challenging manipulation tasks executed via teleoperation, MPC, and trained policies, the proposed gripper consistently improves dexterity and task feasibility compared to a conventional parallel gripper

2606.20135 2026-06-19 cs.RO cs.AI 新提交 90%

Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation

频率感知流匹配用于连续且一致的机器人动作生成

Jianing Guo, Fangzheng Chen, Zihao Mao, Wong Lik Hang Kenny, Zhenhong Wu, Yu Li, Yishuai Cai, Yuanpei Chen, Yikun Ban, Kai Chen, Qi Dou, Yaodong Yang, Xianglong Liu, Huijie Zhao, Simin Li

发表机构 * Beihang University(北京航空航天大学) Peking University(北京大学) The Chinese University of Hong Kong(香港中文大学) PKU-Psibot Lab(北大-智源机器人实验室) Zhongguancun Laboratory(中关村实验室) Hefei Comprehensive National Science Center(合肥综合性国家科学中心)

专题命中 机器人操作 :频率感知流匹配用于机器人动作生成。

AI总结 提出频率感知流匹配(FAFM),通过离散余弦变换将离散动作序列转换到频域进行流匹配,并正则化一阶时间导数以生成平滑连续的动作,提升成功率、多模态表达性和运动平滑性。

详情
AI中文摘要

流匹配已成为机器人操作的标准范式,因为它与扩散策略等类似方法一样,对建模复杂的多模态动作分布具有很强的表达能力。然而,现有方法依赖于离散化的动作块,使得它们对以异构控制频率收集的演示数据脆弱,并且容易产生时间上不一致的动作,从而降低控制稳定性。在本文中,我们提出了频率感知流匹配(FAFM),它输出连续的、时间上一致的动作。为了处理异构频率输入,我们使用离散余弦变换(DCT)将离散动作序列转换到频域,对得到的系数进行流匹配,并通过余弦基展开重建连续动作。为了生成时间上一致的动作,我们对一阶时间导数进行正则化以促进平滑动作。这对应于一个Sobolev型约束,抑制高频误差并阻止突变的动作变化。我们的FAFM简单,不引入额外的网络参数,并且适用于独立的流匹配策略和视觉-语言动作模型。在合成玩具基准、避障、LapGym和LIBERO上,FAFM提高了成功率、多模态表达能力、运动平滑性、收敛速度、对机械偏差和混合频率输入的鲁棒性。这些优势在真实世界的Franka机器人上部署时保持一致。代码见此https URL。

英文摘要

Flow matching has emerged as a standard paradigm for robotic manipulation owing to its strong expressive power for modelling complex, multimodal action distributions, alongside similar approaches like diffusion policy. However, existing methods rely on discretized action chunks, making them brittle to demonstrations collected at heterogeneous control frequencies and prone to temporally inconsistent actions that degrade control stability. In this paper, we propose Frequency-Aware Flow Matching (FAFM), which outputs continuous, temporally consistent actions. To handle heterogeneous frequency input, we transform discrete action sequences into the frequency domain with the discrete cosine transform (DCT), perform flow matching over the resulting coefficients, and reconstruct continuous actions via cosine basis expansion. To generate temporally consistent actions, we regularize the first-order temporal derivative to promote smooth actions. This corresponds to a Sobolev-type constraint that suppresses high-frequency errors and discourages abrupt action changes. Our FAFM is simple, introduces no additional network parameters and applies to standalone flow-matching policies and vision-language action models. Across synthetic toy benchmark, obstacle avoidance, LapGym, and LIBERO, FAFM improves success rates, multimodal expressivity, motion smoothness, convergence speed, robustness to mechanical bias and mixed-frequency input. These gains are consistent when deployed on a real-world Franka robot. Code available at https://anonymous.4open.science/r/FAFM.

2606.20118 2026-06-19 cs.RO cs.LG 新提交 90%

Pose6DAug: Physically Plausible Multi-view Object Swapping for Robot Data Augmentation

Pose6DAug: 用于机器人数据增强的物理合理多视图物体替换

Jonghoon Lee, Seong Hyeon Park, Byungwoo Jeon, Minha Lee, Jinwoo Shin

发表机构 * KAIST(韩国科学技术院) Korea University(韩国大学) RLWRLD

专题命中 机器人操作 :数据增强框架提升VLA策略泛化。

AI总结 提出Pose6DAug,一种基于失败驱动的数据增强框架,通过3D网格和6D姿态轨迹替换成功轨迹中的物体,生成多视图一致的物理合理演示,无需额外数据收集,在新型物体上提升VLA策略成功率16.5%。

详情
AI中文摘要

视觉-语言-动作(VLA)策略在通用操作中展现出强大潜力,但在外观或几何形状偏离训练分布的新型分布外物体上常常失败。标准的补救措施是为每个失败案例收集多视图遥操作数据,但这在成本和时间上扩展性差。我们提出Pose6DAug,一种失败驱动的数据增强框架,将策略自身的成功回合转化为针对其失败模式的目标演示,无需任何新数据收集。我们的关键洞察是,每个成功回合已经编码了一个物理有效的动作轨迹以及校准的多视图观测。通过仅替换被操作物体同时保留该轨迹,我们获得新的且物理基础的演示。然而,简单的2D视频编辑会破坏多视图一致性和物理合理性,特别是在严重遮挡和以自我为中心的视角下。我们的方法直接在3D中操作,通过时间一致的6D姿态轨迹驱动的显式网格锚定目标物体,确保所有相机视图的几何一致渲染。在我们方法增强的数据上微调VLA,相对于最先进的基线,在新型物体上的成功率提高了16.5%,同时保持了分布内性能。这些结果表明,多视图和物理一致的增强是实现可扩展VLA泛化的实用途径。

英文摘要

Vision-language-action (VLA) policies have shown strong potential for general-purpose manipulation, yet they often fail on novel, out-of-distribution objects whose appearance or geometry deviates from the training distribution. The standard remedy is to collect multi-view teleoperation data for every failure case, but this scales poorly in both cost and time. We introduce Pose6DAug, a failure-driven data augmentation framework that turns a policy's own successful episodes into targeted demonstrations for its failure modes, without any new data collection. Our key insight is that each successful episode already encodes a physically valid action trajectory together with calibrated multi-view observations. By swapping only the manipulated object while preserving this trajectory, we obtain new and physically grounded demonstrations. However, naive 2D video editing breaks multi-view consistency and physical plausibility, particularly under heavy occlusion and egocentric viewpoints. Our method instead operates directly in 3D, anchoring the target object with an explicit mesh driven by a temporally coherent 6D pose trajectory, ensuring geometrically consistent renderings across all camera views. Fine-tuning a VLA on data augmented by our method improves success rates by 16.5% relative to the state-of-the-art baseline on novel objects, while preserving in-distribution performance. These results show that multi-view and physically consistent augmentation is a practical path to scalable VLA generalization.

2606.19980 2026-06-19 cs.AI 新提交 90%

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

ENPIRE: 现实世界中智能体机器人策略的自我改进

Wenli Xiao, Jia Xie, Tonghe Zhang, Haotian Lin, Letian "Max" Fu, Haoru Xue, Jalen Lu, Yi Yang, Cunxi Dai, Zi Wang, Jimmy Wu, Guanzhi Wang, S. Shankar Sastry, Ken Goldberg, Linxi "Jim" Fan, Yuke Zhu, Guanya Shi

发表机构 * NVIDIA(英伟达) CMU(卡内基梅隆大学) UC Berkeley(加州大学伯克利分校)

专题命中 机器人操作 :提出ENPIRE框架实现机器人策略自我改进

AI总结 提出ENPIRE框架,通过环境重置、策略执行、结果验证和迭代优化的闭环反馈,使编码智能体自主改进机器人操作策略,在灵巧操作任务上达到99%成功率。

详情
AI中文摘要

在现实世界中实现灵巧的机器人操作严重依赖人工监督和算法工程,这成为追求通用物理智能的核心瓶颈。尽管新兴的编码智能体可以生成代码来自动化算法搜索,但其成功主要局限于数字环境。我们推测,自动化机器人研究缺失的抽象是一个可重复的反馈循环,用于现实世界策略改进:重置场景、执行策略、验证结果并优化下一次迭代。为弥补这一差距,我们引入ENPIRE,一个用于编码智能体的框架,通过四个核心模块实例化这一物理反馈例程:环境模块(EN)用于自动重置和验证,策略改进模块(PI)启动策略优化,推出模块(R)用于评估一个或多个并行运行的物理机器人的策略,以及进化模块(E),其中编码智能体分析日志、查阅文献、改进训练基础设施和算法代码以解决失败模式。这一闭环系统将现实世界操作学习转化为可控的优化过程,在最小化人工努力的同时,允许对训练方案和智能体变体进行公平消融。在ENPIRE的支持下,前沿编码智能体可以自主训练策略,在具有挑战性的灵巧操作任务(如整理针盒、紧固扎带和工具使用)上达到99%的成功率,并且当我们派遣智能体团队在机器人集群上工作时,这一过程会进一步加速。我们的结果展示了将编码智能体部署到物理世界中自主推进机器人技术的实用且可扩展的路径。

英文摘要

Achieving dexterous robotic manipulation in the real world heavily relies on human supervision and algorithm engineering, which becomes a central bottleneck in the pursuit of general physical intelligence. Although emerging coding agents can generate code to automate algorithm search, their successes remain largely confined in digital environments. We conjecture that the missing abstraction to automate robotics research is a repeatable feedback loop for real-world policy improvement: reset the scene, execute a policy, verify the outcome, and refine the next iteration. To bridge this gap, we introduce ENPIRE, a harness framework for coding agents that instantiates this physical feedback routine with four core modules: an Environment module (EN) for automatic reset and verification, a Policy Improvement module (PI) that launches policy refinement, a Rollout module (R) to evaluate policies with one or multiple physical robots operating in parallel, and an Evolution module (E) in which coding agents analyze logs, consult literature, improve training infrastructure and algorithm code to address failure modes. This closed-loop system transforms real-world manipulation learning into a controllable optimization procedure, minimizing human effort while allowing fair ablations across training recipe and agent variants. Powered by ENPIRE, frontier coding agents can autonomously train a policy to achieve a 99% success rate on challenging, dexterous manipulation tasks, such as organizing a pin box, fastening a zip tie, and tool use, a process that further accelerates when we dispatch an agent team on a robot fleet. Our results suggest a practical and scalable path toward deploying coding agents to autonomously advancing robotics in the physical world.

2606.19897 2026-06-19 cs.RO 新提交 90%

One-to-Two Acting: A Novel Framework for Single-arm Agent Action Expansion to Dual Arms

一对二执行:一种面向单臂智能体动作扩展至双臂的新框架

Youbin Yao, Nieqin Cao, Mingyan Li, Yan Ding, Fuqiang Gu, Chao Chen

发表机构 * Chongqing University(重庆大学) Xi’an Jiaotong-Liverpool University(西交利物浦大学) Lumos Robotics

专题命中 机器人操作 :双臂操作框架,从单臂监督学习。

AI总结 提出ExS2D层次化动作扩展框架,利用单臂监督实现双臂操作,通过时间优先关系提取、子任务引导动作映射和碰撞避免协调规划,在仿真中减少54.4%执行步骤并保持成功率。

Comments 6 pages, 5 figures, 3 tables

详情
AI中文摘要

双臂操作可以通过并行执行提高吞吐量,但收集双臂演示进行训练成本高且困难。我们提出ExS2D,一种层次化动作扩展框架,能够从单臂监督实现双臂操作。ExS2D首先从文本指令生成结构化子任务,同时显式捕获时间优先关系。然后通过观察中的子任务引导动作映射,将每个子任务落地为可执行动作。最后,由多模态大语言模型驱动的协调器执行考虑优先关系的动作分配和同步规划,以选择无碰撞的双臂执行。仿真实验表明,ExS2D在保持与单臂基线相当的成功率的同时,平均执行步骤减少了54.4%。在四个任务上的真实机器人实验进一步证明了ExS2D在少量单臂样本下进行双臂执行的可靠性,且未使用任何双臂演示。

英文摘要

Dual-arm manipulation can improve throughput via parallel execution, but collecting bimanual demonstrations for training is costly and difficult. We present ExS2D, a hierarchical action expansion framework that enables dual-arm manipulation from single-arm supervision. ExS2D first generates structured subtasks from textual instructions while explicitly capturing temporal precedence. It then grounds each subtask into executable actions through subtask-guided action mapping in observation. Finally, precedence-aware action allocation and synchronized planning are performed by a multimodal large language model driven coordinator to select collision-free dual-arm executions. Simulation experiments demonstrate that ExS2D reduces the average execution steps by 54.4% while maintaining a comparable success rate to a single-arm baseline. Real-robot experiments on four tasks further demonstrate the reliability of ExS2D for dual-arm execution under few-shot single-arm samples, while using zero bimanual demonstrations.

2606.19358 2026-06-19 cs.RO 新提交 90%

WorkBenchMark: A LEGO-Based Assembly Benchmark with an Assembly-by-Disassembly Baseline for the Smart Manufacturing League

WorkBenchMark:面向智能制造联盟的基于乐高积木的装配基准与通过拆卸进行装配的基线方法

Wenbo Ma, Daniel Swoboda, Matteo Tschesche, Till Hofmann

发表机构 * Chair of Machine Learning and Reasoning (i6), RWTH Aachen University(亚琛工业大学机器学习与推理教席(i6)) MASCOR Institute, FH Aachen University of Applied Science(亚琛应用技术大学MASCOR研究所)

专题命中 机器人操作 :基于乐高的机器人装配基准。

AI总结 提出一个基于乐高Duplo的机器人装配基准,包含400个任务和四个复杂度层级,并提供一个基于规划的基线方法,在所有层级上优于现代视觉-语言-动作方法。

Comments RoboCup Symposium 2026 accepted paper

详情
AI中文摘要

我们介绍了WorkBenchMark,一个受RoboCup智能制造联盟启发的基于乐高Duplo的机器人装配基准。机器人装配将低层操作与物理约束下的任务级符号推理相结合,当前端到端学习方法尚未可靠解决这一组合。该基准提供跨四个复杂度层级的400个任务。我们提供了一个开放词汇的感知、通过拆卸进行装配的基线解决方案。我们的基于规划的流水线在所有层级上优于现代视觉-语言-动作方法。该基准、仿真环境和基线实现将公开发布,以支持更广泛的机器人装配社区。

英文摘要

We introduceWorkBenchMark, a LEGO Duplo-based robotic assembly benchmark motivated by the RoboCup Smart Manufacturing League. Robotic assembly couples low-level manipulation with task-level symbolic reasoning under physical constraints, a combination that current end-to-end learning methods do not yet solve reliably. The benchmark provides 400 tasks across four complexity tiers. We provide an open-vocabulary perception, Assembly-by-Disassembly baseline solution. Our planning-based pipeline outperforms a modern vision-language-action approach across all tiers. The benchmark, simulation environment, and baseline implementation will be released openly to support the broader robotic assembly community.

2606.15516 2026-06-19 cs.RO 新提交 90%

Transferring Contact, Not Just Motion: Compliant Grasping Across Dexterous Hands

传递接触,而不仅仅是运动:跨灵巧手的柔顺抓取

Soofiyan Atar, Yao-Ting Huang, Michael Yip

发表机构 * University of California San Diego(加州大学圣迭戈分校)

专题命中 机器人操作 :跨灵巧手柔顺抓取,属于机器人操作

AI总结 提出跨本体力-位置接口,通过校准力矩和指尖力实现异构灵巧手间的接触感知抓取,结合流匹配视觉运动策略和混合力位控制器,实现可迁移的柔顺抓取。

Comments Website(overview): transferring-contact-not-just-motion.github.io/

详情
AI中文摘要

灵巧抓取依赖于接触调节,而不仅仅是运动。稳定操作要求手指在接触滑动、变形或视觉遮挡时保持适当的物体负载。现有的跨本体灵巧策略通过重定向手部姿态或潜在动作统一运动,但力反馈仍与每只手的感觉和驱动绑定,限制了迁移。本文引入了一种跨本体力-位置接口,用于异构灵巧手之间的接触感知操作。运动意图在共享的手部姿态潜在空间中表示,而每只手的力信号通过系统辨识校准为物理关节扭矩(单位N.m)。这些扭矩被映射为指尖力和紧凑的每指负载描述符,使策略获得关于手部应移动到哪里以及物体如何加载的可比观测。利用该接口,训练了一个流匹配视觉运动策略,输入视觉、本体感觉和校准后的接触,并采用结构化视觉掩码,在抓取相关遮挡下鼓励依赖力。相同的校准信号驱动混合力-位置控制器进行演示采集和执行,保持训练和部署中的力目标一致。在结构不同的手上进行的实验表明,校准的接触反馈实现了可迁移的柔顺抓取,学习到的基元可在长时程操作流程中重复使用。

英文摘要

Dexterous grasping depends on contact regulation, not motion alone. Stable manipulation requires fingers to maintain appropriate object loading as contacts slip, deform, or become visually occluded. Existing cross-embodiment dexterous policies unify motion through retargeted hand poses or latent actions, but force feedback remains tied to each hand's sensing and actuation, limiting transfer. This work introduces a cross-embodiment force-position interface for contact-aware manipulation across heterogeneous dexterous hands. Motion intent is represented in a shared hand-pose latent, while each hand's effort signal is calibrated through system identification into physical joint torque in N.m. These torques are mapped to fingertip forces and compact per-finger load descriptors, giving the policy comparable observations of where the hand should move and how the object is loaded. Using this interface, a flow-matching visuomotor policy is trained on vision, proprioception, and calibrated contact, with structured visual masking that encourages reliance on force under grasp-relevant occlusion. The same calibrated signal drives a hybrid force-position controller for demonstration collection and execution, keeping force targets consistent across training and deployment. Experiments across structurally different hands show that calibrated contact feedback enables transferable compliant grasping, with learned primitives reusable in long-horizon manipulation pipelines.

2606.20426 2026-06-19 cs.RO 新提交 85%

TaCauchy: An Extensible FEM Framework for Vision-Based Tactile Simulation

TaCauchy:面向视觉触觉仿真的可扩展有限元框架

Hengfei Zhao, Yifan Xie, Junhao Gong, Yue Sun, Kai Zhu, Weihua He, Shoujie Li, Haohuan Fu, Wenbo Ding

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Huawei Inc.(华为技术有限公司)

专题命中 机器人操作 :触觉仿真框架用于机器人操作中的力计算

AI总结 提出TaCauchy框架,基于UIPC求解器在Isaac Sim中集成有限元法,直接计算柯西应力张量并投影为接触力,实现高保真触觉仿真,支持多种传感器,物理验证SSIM>0.93。

Comments Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2026

详情
AI中文摘要

基于视觉的触觉传感器需要高保真仿真以支持强化学习,然而现有方法难以在GPU加速的机器人平台中提供精确的机械应力场。我们提出TaCauchy,一个可扩展的有限元法(FEM)框架,将严格的基于物理的力计算集成到Isaac Sim中。TaCauchy基于统一增量势接触(UIPC)求解器,直接从超弹性本构定律计算柯西应力张量,并将其投影到接触表面以获得牵引力和压力分布,从而从第一性原理而非经验估计提供机械真实值。我们的框架具有几何感知自适应细化的自动网格生成和模块化传感器接口,能够以最小配置快速集成多种传感器(GelSight Mini、DIGIT、9DTact)。性能基准测试显示,单环境帧率为33.40 FPS,60个并行环境的总吞吐量为555 FPS,应力提取开销低于1 ms。物理验证实验表明,在1.2556 N至4.7332 N的力范围内,仿真与真实触觉响应高度一致,SSIM超过0.93,证实了该框架为下游机器人操作任务提供准确、基于物理的力监督的能力。

英文摘要

Vision-based tactile sensors require high-fidelity simulation for reinforcement learning, yet existing approaches struggle to provide accurate mechanical stress fields within GPU-accelerated robotics platforms. We present TaCauchy, an extensible Finite Element Method (FEM) framework that integrates rigorous physics-based force computation into Isaac Sim. Built on the Unified Incremental Potential Contact (UIPC) solver, TaCauchy directly computes Cauchy stress tensors from hyperelastic constitutive laws and projects them onto contact surfaces to obtain traction forces and pressure distributions, providing mechanical ground truth from first principles rather than empirical estimation. Our framework features automatic mesh generation with geometry-aware adaptive refinement and a modular sensor interface enabling rapid integration of diverse sensors (GelSight Mini, DIGIT, 9DTact) with minimal configuration. Performance benchmarks demonstrate 33.40 FPS for single environments and 555 FPS aggregate throughput across 60 parallel environments, with stress extraction overhead under 1 ms. Physical validation experiments show strong agreement between simulated and real tactile responses across force ranges from 1.2556 N to 4.7332 N, achieving SSIM above 0.93, confirming the framework's capability to provide accurate, physically-grounded force supervision for downstream robotic manipulation tasks.

2606.20285 2026-06-19 cs.RO 新提交 85%

Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems

Co-VLA:面向双臂视觉-语言-动作系统的协调感知结构化动作建模

Yandong Wang, Jiaqian Yu, Xiongfeng Peng, Lu Xu, Yamin Mao, Weiming Li, Jaewook Yoo, Dongwook Lee, Daehyun Ji, Mingbo Zhao, Chao Zhang

发表机构 * Donghua University(东华大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院) Samsung AI Center, DS Division(三星DS部门AI中心)

专题命中 机器人操作 :聚焦双臂机器人操作任务

AI总结 针对双臂紧耦合任务中隐式协调不足的问题,提出Co-VLA框架,通过结构化动作专家和潜在感知控制器显式引入协调先验,在仿真和真实场景中显著提升成功率和效率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在单臂和双臂机器人操作中展现出强大能力。先前研究表明,通过端到端学习,利用大型视觉-语言骨干网络和连续动作预测,可以涌现出协调的双臂行为。然而,随着双臂任务变得紧密耦合且执行约束变得关键,仅靠隐式协调不足以确保可靠、可解释且稳定的行为。在这项工作中,我们提出了Co-VLA,一个协调感知的双臂操作框架,将显式结构先验引入VLA模型。我们在一个最先进的视觉-语言骨干网络上实例化我们的方法,用专为双臂协调设计的结构化动作专家(SAE)替换其单一动作头。具体来说,我们在动作生成层面引入显式结构,采用模块化的协调感知损失,根据任务特定结构塑造共享和残差潜在变量。共享潜在变量编码任务级协调意图,而残差潜在变量捕获每个手臂的执行调整。在部署时,潜在感知控制器(LAC)解释学习到的表示,以实时调节同步强度、执行不对称性、平滑性和安全约束。LAC在关节命令级别运行,并与标准控制流水线兼容,无需力或阻抗控制。在仿真和真实世界基准上的实验表明,Co-VLA显著优于单一基线,在紧协调任务中成功率达到27%的提升,在OOD真实世界场景中性能翻倍(从13%提升至27%),并将任务完成时间减少高达25%。

英文摘要

Vision-language-action (VLA) models show strong capabilities in single and dual-arm robotic manipulation. Prior works show coordinated bimanual behaviors can emerge from end-to-end learning, leveraging large vision-language backbones with continuous action prediction. However, as bimanual tasks become tightly coupled and execution constraints become critical, implicit coordination alone is insufficient to ensure reliable, interpretable, and stable behavior. In this work, we propose Co-VLA, a coordination-aware bimanual manipulation framework introducing explicit structural priors into VLA models. We instantiate our method on a state-of-the-art vision-language backbone by replacing its monolithic action head with a Structured Action Expert (SAE) designed for bimanual coordination. Specifically, we introduce explicit structure at the action generation level with a modular coordination-aware loss that shapes shared and residual latents according to task-specific structures. The shared latent encodes task-level coordination intent, while residual latents capture execution adjustments for each arm. At deployment, a Latent-Aware Controller (LAC) interprets the learned representations to modulate synchronization strength, execution asymmetry, smoothness, and safety constraints in real time. LAC operates at the joint-command level and remains compatible with standard control pipelines without requiring force or impedance control. Experiments across simulation and real-world benchmarks show Co-VLA significantly outperforms monolithic baselines, achieving a 27% success rate gain in tight-coordination tasks, more than doubling performance in OOD real-world scenarios (from 13% to 27%), and reducing task completion time by up to 25%.

2606.20120 2026-06-19 cs.RO cs.AI 新提交 85%

Dual-Agent Framework for Cross-Model Verified Translation of Natural-Language Protocols into Robotic Laboratory Platform

用于将自然语言协议翻译为机器人实验室平台的双智能体跨模型验证框架

Hyeonna Choi, Jung Yup Kim, Hyuneui Lim, Seunggyu Jeon

发表机构 * Department of Bionic Machinery, Research Institute of AI Robot, Korea Institute of Machinery & Materials(生物机械系、人工智能机器人研究所、韩国机械材料研究院)

专题命中 机器人操作 :双智能体框架翻译自然语言协议到机器人平台。

AI总结 提出双智能体框架,通过解析器形式化协议、规则映射引擎生成控制命令、异构LLM验证器纠错,实现自然语言微孔板协议到机器人平台可执行命令的转换,并验证了端到端自主执行。

详情
AI中文摘要

生物实验协议以自然语言编写,而自动化系统依赖预定义控制命令,这造成了限制自主执行的语义鸿沟。微孔板自动实验由于需要同时控制孔映射、样本-试剂组合、重复放置和平行分配而尤其具有挑战性。本研究提出一种基于智能体的协议翻译框架,将自然语言微孔板协议转换为机器人实验室平台的可执行控制命令。解析器智能体将自然语言协议形式化为结构化表示,基于规则的映射引擎确定性地融入机器人实验室平台的操作约束以生成设备级控制命令。异构LLM验证器检查完整性、参数准确性和执行顺序,并在检测到错误时触发带有结构化反馈的自校正循环。在随机选择的ELISA协议上对7个解析器和3个验证器进行扫描,评估模型规模和验证器类型在跨模型验证下对翻译准确率和通过率的影响。通过将所提框架的基于规则映射与LLM端到端直接映射进行比较,进一步验证了准确率-延迟权衡。最后,在机器人实验室平台上演示了基于Bradford法的微孔板蛋白质定量,验证了从自然语言协议到真实实验的端到端自主执行。所提框架为缩小自然语言协议与基于微孔板的自主实验室之间的语义鸿沟提供了一种灵活方法。

英文摘要

Biological experiment protocols are written in natural language, whereas automation systems rely on predefined control commands, creating a semantic gap that limits autonomous execution. Microplate-based automatic experiments are particularly challenging due to the need to simultaneously control well mapping, sample-reagent combinations, replicate placement, and parallel dispensing. This study proposes an agent-based protocol translation framework that converts natural-language microplate-based protocols into executable control commands for a robotic laboratory platform. A Parser Agent formalizes the natural-language protocol into a structured representation, and a rule-based mapping engine deterministically incorporates the operational constraints of the robotic laboratory platform to generate device-level control commands. A heterogeneous LLM Validation Agent verifies completeness, parameter accuracy, and execution order, and triggers a self-correction loop with structured feedback when errors are detected. A sweep involving 7 Parsers and 3 Validators on randomly selected ELISA protocols evaluates how model scale and Validator type affect translation accuracy and pass rates under cross-model verification. The accuracy-latency trade-off is further verified by comparing the rule-based mapping of the proposed framework with LLM end-to-end direct mapping. Finally, Bradford assay-based protein quantification using a microplate was demonstrated on a robotic laboratory platform, validating end-to-end autonomous execution from natural-language protocols to real-world experiments. The proposed framework provides a flexible approach to narrowing the semantic gap between natural-language protocols and microplate-based self-driving laboratories.

2. 机器人学习 12 篇

2606.19357 2026-06-19 cs.RO cs.AI 新提交 95%

Physical Atari: A Robust and Accessible Platform for Real-time Reinforcement Learning on Robots

Physical Atari: 一个用于机器人实时强化学习的鲁棒且可访问的平台

Khurram Javed, Joseph Modayil, Gloria Kennickell, Richard S. Sutton, John Carmack

发表机构 * Keen Technologies University of Alberta, Canada(阿尔伯塔大学,加拿大) Openmind Research Institute(Openmind研究机构)

专题命中 机器人学习 :机器人实时强化学习平台,验证算法在物理世界学习

AI总结 提出Physical Atari平台,通过机器人操作Atari控制器和实时渲染游戏帧,实现物理世界中的强化学习研究,验证了算法可直接在机器人上学习,并指出分布偏移会显著降低策略性能。

Comments To appear at RLC 2026

详情
AI中文摘要

我们构建了一个名为Robotroller的机器人,它能够操作Atari CX40+控制器,以及一个名为Atari Devbox的设备,该设备在屏幕上渲染来自Arcade Learning Environment的游戏帧和奖励信号。Robotroller和Atari Devbox,连同现成的摄像头和台式计算机,构成一个可用于研究物理世界中强化学习算法的系统。我们将整个系统称为Physical Atari。在本文中,我们详细介绍了使Physical Atari成为一个鲁棒且可访问平台的关键决策。为了使系统鲁棒,我们设计了Robotroller,使得所有运动都通过轴承完成,从而减少磨损。此外,我们编写了软件,以高频监控伺服电机的状态并进行干预以限制应力。为了使系统可访问,我们使用了价格合理的现成组件和可通过消费级3D打印机制造的零件。Physical Atari的建造成本低于1000美元,并且已用于数周不间断的强化学习实验,未出现任何机械故障。我们用它验证了强化学习算法可以直接在机器人上学习,并表明即使学习和部署之间的微小分布偏移也会显著降低策略的性能。我们的结果强调了设备端适应对于在机器人上获得强性能的重要性。

英文摘要

We built a robot called the Robotroller that actuates an Atari CX40+ controller and a device called the Atari Devbox that renders the game frame and the reward signal from the Arcade Learning Environment on a screen. The Robotroller and the Atari Devbox, together with an off-the-shelf camera and a desktop computer, constitute a system that can be used to study reinforcement learning algorithms in the physical world. We call the full system Physical Atari. In this paper, we detail the key decisions that make Physical Atari a robust and accessible platform. To make the system robust, we designed the Robotroller so that all movement is done through bearings, which reduces wear. Additionally, we wrote software that monitors the state of the servos at a high frequency and intervenes to limit stress. To make the system accessible, we used affordable off-the-shelf components and parts that can be manufactured using consumer 3D printers. Physical Atari can be built for under $1,000 and has been used for weeks of non-stop reinforcement learning experiments without any mechanical failures. We used it to validate that reinforcement learning algorithms can learn directly on robots and show that even small distribution shifts between learning and deployment can significantly degrade the performance of policies. Our results underscore the importance of on-device adaptation for strong performance on robots.

2606.19729 2026-06-19 cs.RO cs.AI 新提交 90%

VOiLA: Vectorized Online Planning with Learned Diffusion Model for POMDP Agents

VOiLA: 基于学习扩散模型的向量化在线规划用于POMDP智能体

Marcus Hoerger, Rishikesh Joshi, Rahul Shome, Ian Manchester, Hanna Kurniawati

发表机构 * Australian National University(澳大利亚国立大学) The University of Sydney(悉尼大学)

专题命中 机器人学习 :提出POMDP在线规划框架,用于机器人规划。

AI总结 提出VOiLA框架,利用条件扩散模型学习POMDP模型,通过蒸馏加速采样并与向量化在线规划器集成,在三个基准任务和实物机器人上实现高效在线规划。

Comments Submitted to the 2026 International Symposium of Robotics Research (ISRR)

详情
AI中文摘要

不确定性下的规划是自主机器人的关键能力。部分可观测马尔可夫决策过程(POMDP)为此提供了强大框架。尽管基于POMDP的规划已取得显著进展,但其在现实问题中的应用常受限于难以获得准确的POMDP模型。我们提出VOiLA(Vectorized Online planning wIth Learned diffusion model for POMDP Agents),一个学习任务无关POMDP模型以实现在不确定性下在线规划的框架。VOiLA使用条件扩散模型学习转移和观测采样器,并学习用于基于粒子的信念更新的观测似然模型。为实现高效在线规划,扩散采样器被蒸馏为紧凑的前馈生成器,并与VOPP(一种利用GPU并行化的在线POMDP规划器)集成。实验结果表明,蒸馏策略将采样成本降低了近三个数量级,使学习到的生成式POMDP模型对在线规划实用。在三个基准问题上的评估表明,VOiLA在使用不到10%训练数据的情况下,性能达到或优于递归软演员-评论家算法,并且对未见环境配置的泛化能力更强。实物机器人评估表明,VOiLA仅使用模拟数据学习模型,并在10次运行中全部成功完成任务。

英文摘要

Planning under uncertainty is an essential capability for autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for such a capability. Although POMDP-based planning has advanced significantly, its application to real-world problems is often limited by the difficulty of obtaining faithful POMDP models. We present Vectorized Online planning wIth Learned diffusion model for POMDP Agents (VOiLA), a framework that learns task-agnostic POMDP models for online planning under uncertainty. VOiLA learns transition and observation samplers using conditional diffusion models and learns observation-likelihood models for particle-based belief updates. To enable efficient online planning, the diffusion samplers are distilled into compact feedforward generators and integrated with Vectorized Online POMDP Planner (VOPP), an online POMDP planner designed to leverage GPU parallelization. Experimental results indicate the distillation strategy reduces sampling cost by up to nearly three orders of magnitude, making learned generative POMDP models practical for online planning. Evaluation of VOiLA on three benchmark problems indicate that VOiLA achieves equal or better performance than Recurrent Soft Actor Critic while using less than 10% training data, and generalizes much better to unseen environment configurations. Physical robot evaluation indicates VOiLA uses the models learned using only simulated data and generates a policy that successfully accomplish the task in 10 of 10 runs.

2606.19728 2026-06-19 cs.RO cs.AI 新提交 90%

Bidirectional Tutoring for Developmental Motor Learning in Robots: Co-Developed Interaction Dynamics Support Stable Learning

机器人发展性运动学习的双向辅导:共同发展的交互动力学支持稳定学习

Rui Fukushima, Jun Tani

发表机构 * Okinawa Institute of Science and Technology Graduate University(冲绳科学技术大学院大学)

专题命中 机器人学习 :提出双向辅导框架用于机器人运动技能学习。

AI总结 提出双向辅导框架,通过人类或AI导师与机器人动态适应,利用自由能原理神经网络实现稳定序列学习,在物体操作任务中验证了行为一致性和泛化能力。

Comments 16 pages, 14 figures

详情
AI中文摘要

众所周知,婴儿通过与照顾者的密集互动来发展运动技能。尽管这种社会互动对人类发展至关重要,但机器人的运动技能学习通常被视为单向过程,机器人被动接受导师的演示。这忽视了社会互动的一个关键特性:它本质上是双向的,导师和学习者相互动态适应。在这种互动中,机器人的过往经验可能作为先验约束,塑造共同发展轨迹的动态。我们假设双向辅导允许这些约束引导形成一致的行为模式,从而保持行为一致性并支持泛化,而单向互动缺乏此类约束,导致更广泛、更不一致的行为模式。为检验这一假设,我们使用实体人形机器人进行了两个物体操作实验:一个涉及人机互动,另一个采用AI导师通过自适应干预机制与真实机器人互动,以检验在更受控条件下是否会出现类似效果。我们使用基于自由能原理的神经网络并扩展生成回放来实现发展性学习框架,该框架支持从单个辅导情节中进行稳定的逐序列学习。在两种设置中,双向辅导促进了行为一致性和阶段性泛化,同时机器人逐渐需要更少的导师指导。这些结果表明,双向辅导作为一种具身和社会化方法,为机器人的发展性运动学习提供了有效支架。

英文摘要

Infants are well known to develop their motor skills through dense interaction with caregivers. Although such social interaction is crucial for human development, motor-skill learning in robots is often treated as a unidirectional process in which robots passively receive demonstrations from tutors. This overlooks a key property of social interaction: it is inherently bidirectional, with tutor and learner dynamically adapting to each other. In such interactions, the robot's past experiences may function as prior constraints that shape the dynamics of their co-developed trajectories. We hypothesize that bidirectional tutoring allows such constraints to guide the formation of consistent behavioral patterns that preserve behavioral coherence and support generalization, whereas unidirectional interaction lacks such constraints and leads to broader, less consistent behavioral patterns. To examine this hypothesis, we conducted two experiments with a physical humanoid robot performing an object manipulation task: one involving human-robot interaction and another employing an AI tutor interacting with the real robot through an adaptive intervention mechanism designed to examine whether similar effects would emerge under more controlled conditions. We implement the developmental learning framework using a free-energy-principle-based neural network extended with generative replay, which supports stable sequence-by-sequence learning from single tutored episodes. Across both settings, bidirectional tutoring fostered consistent behaviors and stage-wise generalization, while the robot gradually required less tutor guidance. These results suggest that bidirectional tutoring, as an embodied and socially grounded approach, provides an effective scaffold for developmental motor learning in robots.

2606.19699 2026-06-19 cs.RO cs.LG cs.SY eess.SY 新提交 90%

Comparative Study on Agility, Efficiency, and Impact Absorption of Bipedal Robots with Active Toes

具有主动脚趾的双足机器人敏捷性、效率和冲击吸收的比较研究

Joong-Gil Kim, Wontae Ye, Geunwoo Cho, Seong-Ho Yun, Se-Hyoung Cho, Yong-Jae Kim

发表机构 * School of Electrical, Electronics and Communication Engineering, Korea University of Technology and Education(韩国技术教育大学电气、电子与通信工程学院) Artificial Intelligence and Robotics Institute, Korea Institute of Science and Technology(韩国科学技术研究院人工智能与机器人研究所) Robot Innovation Hub, WIRobotics Inc.(WIRobotics公司机器人创新中心)

专题命中 机器人学习 :比较双足机器人有无主动脚趾的性能。

AI总结 提出一种14自由度双足机器人,模拟人类脚趾的轻量、高扭矩、坚固特性,通过高保真仿真训练环境,对比有无主动脚趾的配置,发现脚趾机器人以1.33米/秒行走时,CoT降低17.5%,脚跟冲击力降低5.0%,路径偏差平均和最大分别降低25.0%和34.0%。

Comments 6 pages, 7 figures

详情
AI中文摘要

人类腿部表现出高效率、敏捷性和冲击吸收能力,其中脚趾在这些能力中起着关键作用。尽管已经有许多尝试在机器人中实现类似人类的脚趾,但它们尚未完全复制人类特征,也没有严格验证其益处。我们提出了一种14自由度的双足机器人,模拟人类脚趾的轻量、高扭矩、坚固特性。为了定量分析主动脚趾在敏捷性、效率和冲击吸收方面的有效性,我们开发了一个高保真仿真训练环境,该环境反映了具有耦合传动和精确功耗的实际执行器。为了确保有和没有主动脚趾的配置之间的公平比较,我们设计了一个最小化强化学习奖励函数,并对两者应用了相同的训练程序。仿真结果表明,在1.33米/秒行走时,与无脚趾配置相比,配备脚趾的机器人将CoT降低了17.5%,脚跟冲击力降低了5.0%。在敏捷性测试中,平均和最大路径偏差分别降低了25.0%和34.0%。

英文摘要

Human legs exhibit high efficiency, agility, and impact absorption, with toes playing a crucial role in these capabilities. While many attempts have been made to implement human-like toes in robots, they have not fully replicated human characteristics nor rigorously validated their benefits. We propose a 14-DOF biped robot emulating human toes' lightweight, high-torque, robust nature. To quantitatively analyze the effectiveness of the active toes in terms of agility, efficiency, and impact absorption, we developed a high-fidelity simulation training environment that reflects actual actuators with coupled transmissions and accurate power consumption. To ensure a fair comparison between configurations with and without active toes, we designed a minimal RL reward function and applied an identical training procedure to both. The simulation results indicate that, at 1.33 m/s walking, the toe-equipped robot reduced CoT by 17.5% and heel-strike GRF by 5.0% compared with the toe-ablation configuration. On the agility test, average and maximum path deviation decreased by 25.0% and 34.0%, respectively.

2606.19419 2026-06-19 cs.RO cs.AI 新提交 90%

Playful Agentic Robot Learning

趣味性具身机器人学习

Junyi Zhang, Jiaxin Ge, Hanjun Yoo, Letian Fu, Zihan Yang, Yaowei Liu, Raj Saravanan, Shaofeng Yin, Justin Yu, Dantong Niu, Zirui Wang, Roei Herzig, Ken Goldberg, Yutong Bai, David M. Chan, Ion Stoica, Angjoo Kanazawa, Jiahui Lei, Haiwen Feng, Trevor Darrell

发表机构 * University of California, Berkeley(加州大学伯克利分校) Impossible Research

专题命中 机器人学习 :机器人通过自主探索学习可复用技能。

AI总结 提出RATs框架,让机器人通过自主探索学习可复用技能,在LIBERO-PRO和MolmoSpaces上分别提升20.6和17.0个百分点。

Comments Project page: https://playful-rats.github.io/

详情
AI中文摘要

当前的具身机器人系统可以编写可执行的代码即策略程序、观察反馈并在多次尝试中修正行为,但它们仍然主要是任务驱动的:可复用技能仅在明确指令后获得。我们研究趣味性具身机器人学习,其中具身编码代理在下游任务到来之前,将自主导向的趣味性作为持续技能学习阶段。我们引入RATs,即专为趣味性技能获取设计的机器人代理团队。在趣味性阶段,RATs提出新颖且可学习的探索性任务,规划并执行机器人代码策略,验证中间进展,诊断失败,通过密集的步骤级反馈进行重试,并将成功执行提炼到持久代码技能库中。在测试时,代理从该冻结库中重用相关技能以帮助解决新任务。在LIBERO-PRO和MolmoSpaces上的实验表明,与无趣味性和随机趣味性基线相比,趣味性学习技能在保留的下游任务上分别提升了20.6和17.0个百分点(相对于CaP-Agent0)。此外,学习到的技能可以通过简单地检索到上下文中插入到其他推理时代码即策略代理中,无需微调基础模型,即可在RoboSuite和真实世界迁移中分别提升8.9和8.8个百分点。

英文摘要

Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.

2511.16223 2026-06-19 cs.RO 90%

DynaMimicGen: A Data Generation Framework for Robot Learning of Dynamic Tasks

DynaMimicGen:一种用于机器人动态任务学习的数据生成框架

Vincenzo Pomponi, Paolo Franceschi, Stefano Baraldo, Loris Roveda, Oliver Avram, Luca Maria Gambardella, Anna Valente

发表机构 * Institute of Systems and Technologies for Sustainable Production (ISTePS)(可持续生产系统与技术研究所) Department of Innovative Technologies (DTI)(创新技术系) University of Applied Science and Arts of Southern Switzerland (SUPSI)(瑞士南部应用科学与艺术大学) Istituto Dalle Molle di studi sull’intelligenza artificiale (IDSIA)(达莫尔智能研究 institute) Department of Mechanical Engineering(机械工程系) Politecnico di Milano (PoliMi)(米兰理工学院) Faculty of Informatics(信息学院) Università della Svizzera Italiana (USI)(瑞士意大利大学)

专题命中 机器人学习 :提出DynaMimicGen框架生成动态任务数据用于机器人学习。

AI总结 本文提出DynaMimicGen框架,通过少量人类示范生成数据,支持动态任务学习,产生适应性强的轨迹,提升机器人在复杂环境中的表现。

详情
AI中文摘要

学习稳健的操作策略通常需要大量且多样化的数据集,但收集这些数据耗时费力且不适用于动态环境。本文引入DynaMimicGen(D-MG),一种可扩展的数据生成框架,能够在极少量人类监督下训练策略,同时支持动态任务设置。仅需少量人类示范,D-MG首先将示范分割为有意义的子任务,然后利用动态运动片段(DMPs)来适应和推广演示行为到新颖且动态变化的环境。改进了依赖静态假设或简单轨迹插值的先前方法,D-MG生成平滑、真实且任务一致的笛卡尔轨迹,能够实时适应任务执行过程中物体姿态、机器人状态或场景几何的变化。我们的方法支持不同场景——包括场景布局、物体实例和机器人配置——使其适用于静态和高度动态的操作任务。我们证明机器人代理通过模仿学习在D-MG生成的数据上实现了在长时间跨度和接触丰富的基准测试中的强大表现,包括立方体堆叠和将杯子放入抽屉等任务,即使在不可预测的环境变化下也是如此。通过消除对大量人类示范的需求并使动态设置的泛化成为可能,D-MG提供了一种强大而高效的替代手动数据收集方法,为可扩展的自主机器人学习铺平道路。

英文摘要

Learning robust manipulation policies typically requires large and diverse datasets, the collection of which is time-consuming, labor-intensive, and often impractical for dynamic environments. In this work, we introduce DynaMimicGen (D-MG), a scalable dataset generation framework that enables policy training from minimal human supervision while uniquely supporting dynamic task settings. Given only a few human demonstrations, D-MG first segments the demonstrations into meaningful sub-tasks, then leverages Dynamic Movement Primitives (DMPs) to adapt and generalize the demonstrated behaviors to novel and dynamically changing environments. Improving prior methods that rely on static assumptions or simplistic trajectory interpolation, D-MG produces smooth, realistic, and task-consistent Cartesian trajectories that adapt in real time to changes in object poses, robot states, or scene geometry during task execution. Our method supports different scenarios - including scene layouts, object instances, and robot configurations - making it suitable for both static and highly dynamic manipulation tasks. We show that robot agents trained via imitation learning on D-MG-generated data achieve strong performance across long-horizon and contact-rich benchmarks, including tasks like cube stacking and placing mugs in drawers, even under unpredictable environment changes. By eliminating the need for extensive human demonstrations and enabling generalization in dynamic settings, D-MG offers a powerful and efficient alternative to manual data collection, paving the way toward scalable, autonomous robot learning.

2606.20521 2026-06-19 cs.CV 新提交 85%

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

HumanScale: 以自我为中心的人类视频在具身预训练中可超越真实机器人数据

Juncheng Ma, Jianxin Bi, Yufan Deng, Xuanran Zhai, Kewei Zhang, Ye Huang, Bo Liang, Shukai Gong, Jiankai Tu, Xiaotian Tang, Jiaxin Li, Kaiqi Chen, Duomin Wang, Yuqi Wang, Bingyi Kang, Eric Huang, Zhiyang Dou, Zhen Dong, Enze Xie, Wojciech Matusik, Tat-Seng Chua, Daquan Zhou

发表机构 * PKU(北京大学) NUS(新加坡国立大学) MIT(麻省理工学院) UCSB(加州大学圣塔芭芭拉分校) NVIDIA(英伟达)

专题命中 机器人学习 :人类视频用于具身基础模型预训练

AI总结 本文通过系统比较发现,经过精心设计的过滤和标注流程,以自我为中心的人类视频在具身基础模型预训练中不仅可行,而且性能优于遥操作真实机器人数据,验证了“预训练于人类视频+少量机器人数据适配”的可扩展范式。

Comments Github: https://github.com/DAGroup-PKU/HumanNet/

详情
AI中文摘要

具身基础模型有望像大型语言模型一样从数据扩展中受益,但面临更严重的数据瓶颈。遥操作真实机器人轨迹因其精确的动作监督和具身对齐而仍然是主要的预训练来源,但其可扩展性受限于高采集成本、获取难度以及低行为和环境多样性。这些限制引发了对以自我为中心的人类视频作为可扩展、成本显著更低且更多样化的具身模型预训练替代方案的兴趣。然而,与遥操作真实机器人数据相比,其有效性仍未得到充分探索。为了解决这个问题,我们在固定的后训练和验证协议下,进行了一项系统研究,比较以自我为中心的人类视频和遥操作真实机器人轨迹作为具身基础模型的预训练数据源。令人惊讶的是,我们发现经过精心设计的过滤和标注流程处理的以自我为中心的数据,不仅是模型预训练的可行替代品,而且可以带来更优的性能。在相同预训练数据量下,在以自我为中心数据上预训练的模型在真实机器人动作预测上的验证损失降低了24%,在分布内和分布外真实机器人任务执行上的成功率分别提高了52.5%和90%。这一发现验证了具身基础模型的一种可扩展范式:在以自我为中心的人类视频上预训练以学习多样化的世界表征,然后使用少量标注的真实机器人数据进行适配以实现动作空间对齐。我们希望这项研究能鼓励对以自我为中心数据的更广泛探索,并在昂贵的机器人数据收集之前为数据质量评估提供指导。

英文摘要

Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.

2606.20495 2026-06-19 cs.RO 新提交 85%

Increasing Resilience of Continuum Robots via Motion Planning Algorithms

通过运动规划算法提高连续体机器人的韧性

Oxana Shamilyan, Ievgen Kabin, Zoya Dyka, Oleksandr Sudakov, Peter Langendoerfer

发表机构 * IHP – Leibniz-Institut für innovative Mikroelektronik(莱布尼茨创新微电子研究所) BTU Cottbus-Senftenberg(科特博斯-塞芬堡工业大学) Technical Center, National Academy of Sciences of Ukraine(乌克兰国家科学院技术中心)

专题命中 机器人学习 :研究连续体机器人的运动规划算法

AI总结 本文实验研究运动规划算法对连续体机器人韧性的影响,通过改进遗传算法和A*算法,结合层次分析法评估路径质量,发现遗传算法生成更多样化路径,提升机器人韧性。

详情
AI中文摘要

本文介绍了针对韧性连续体机器人的运动规划实验研究。我们主要关注多准则决策、其在路径规划算法中的应用、对生成路径的影响以及执行时间。为此,我们使用了两种著名的路径规划算法,即遗传算法和A*算法,并通过添加层次分析法算法来评估生成路径的质量,对其进行了修改。在我们的实验中,层次分析法考虑了四个不同的准则,即距离、电机损伤、机器人手臂的机械损伤和精度,每个准则都被认为有助于连续体机器人的韧性。使用不同的准则对于延长连续体机器人的维护操作时间是必要的。我们使用两种不同的机器人模拟环境进行了实验。尽管我们显著简化了机器人模型及其环境,但我们仍然基于真实机器人原型实现了环境的一些特征。特别地,其中一个环境包含单路径点和多路径点,另一个环境仅包含多路径点。结果表明,与A*算法相比,遗传算法的性能时间不依赖于环境的基数。它生成更多样化的路径,从而提高了机器人的韧性。

英文摘要

This paper presents an experimental study of motion planning for resilient continuum robots. In this study we mainly focused on multi-criteria decision-making, its application for path-planning algorithms, impact on the generated path and execution time. To do this, we used two well-known algorithms for path planning, namely Genetic algorithm and A star algorithm, and modified them by adding the Analytical Hierarchy Process algorithm to evaluate the quality of the paths generated. In our experiment the Analytical Hierarchy Process considers four different criteria, i.e. distance, motors damage, mechanical damage of the robot's arm and accuracy, each considered to contribute to the resilience of a continuum robot. The use of different criteria is necessary to increase the time to maintenance operations of the continuum robot. We conducted the experiments using two different simulated environments of the robot. Although we significantly simplified the robot's model and its environment, we still implemented some of the features of the environment based on the real robot prototype. In particular, one of the environments has single- as well as multi-path points, and other consists of the multi-path points only. The results show that, in contrast to A star, the performance time of Genetic algorithm does not depend on the environment's cardinality. It generates more diverse paths, which increases the robot's resilience.

2606.20389 2026-06-19 cs.RO 新提交 85%

CoLI: A Reproducible Platform for Continuum Robot Learning via Monolithic 3D Printing and Isomorphic Teleoperation

CoLI: 通过整体3D打印和同构遥操作实现连续体机器人学习的可复现平台

Ziyuan Tang, Chenxi Xiao*

发表机构 * School of Information Science and Technology at ShanghaiTech University(上海科技大学信息科学与技术学院)

专题命中 机器人学习 :连续体机器人学习平台,支持模仿学习和遥操作。

AI总结 提出一种基于多材料3D打印和同构遥操作的连续体机器人平台,简化制造流程并实现无奇异映射控制,支持模仿学习自主控制,通过硬件表征和操作任务验证其可复现性和学习就绪性。

Comments 8 pages, 7 figures, 1 table, accepted by IROS2026

详情
AI中文摘要

连续体机器人因其高自由度、柔顺结构和操作安全性,在操作任务中展现出巨大潜力。然而,复杂的制造和组装过程、具有挑战性的运动学建模以及缺乏直观的控制接口,导致其在研究和实际应用中的可复现性受到阻碍。为解决这些问题,我们提出了一种新颖的开源连续体机器人设计。该平台采用多材料3D打印实现简化的制造流程,使机械臂能够作为整体柔顺结构制造,且组装工作量最小。控制通过同构遥操作接口实现,该接口建立了直接的执行器级映射,无需显式运动学建模,并提供无奇异映射。基于该硬件设计,平台进一步支持基于模仿学习的自主控制。通过硬件表征和一系列操作任务对所提出的系统进行了评估。实验结果表明,该平台提供了一个可复现的、学习就绪的连续体机器人系统,加速了连续体机器人社区的算法开发和系统基准测试。

英文摘要

Continuum robots offer strong potential for manipulation tasks due to their high degrees of freedom, compliant structures, and operational safety. However, their adoption in both research and practical applications has been hindered by reproducibility issues arising from complex fabrication and assembly processes, challenging kinematic modeling, and a lack of intuitive control interfaces. To address these challenges, we present a novel open-source continuum robot design. The platform features a simplified fabrication pipeline enabled by multi-material 3D printing, allowing the arm to be fabricated as a monolithic compliant structure with minimal assembly. Control is achieved through an isomorphic teleoperation interface that establishes a direct actuator-level mapping, eliminating the need for explicit kinematic modeling and providing a singularity-free mapping. Building on this hardware design, the platform further supports imitation-learning-based autonomous control. The proposed system is evaluated through hardware characterization and a set of manipulation tasks. Experimental results demonstrate that the platform provides a reproducible, learning-ready continuum robot system, accelerating algorithmic development and systematic benchmarking for the continuum robotics community.

2606.20365 2026-06-19 cs.RO cs.MA 新提交 85%

An Infrastructure-less, Control-Independent Solution to Relative Localisation of a Team of Mobile Robots using Ranging Measurements

基于测距的移动机器人团队相对定位的无基础设施、控制无关解决方案

Paolo Golinelli, Tommaso Faraci, Daniele Fontanelli

发表机构 * Department of Industrial Engineering, University of Trento(特伦托大学工业工程系) Department of Information Engineering and Computer Science, University of Trento(特伦托大学信息工程与计算机科学系)

专题命中 机器人学习 :移动机器人团队协作定位算法

AI总结 提出一种无锚点、完全去中心化的协作定位算法,仅依赖局部里程计、稀疏测距和短程通信,无需控制机器人运动即可实现团队可观测性,采用多假设贝叶斯框架保证鲁棒性。

详情
AI中文摘要

定位机器人团队的能力对于从非结构化环境中的机器人舰队到协作控制和导航任务等应用至关重要。在此类场景中,固定基础设施通常不可用,部署必须快速灵活,系统要求必须最小化。我们提出了一种去中心化协作定位算法,同时解决了所有这些挑战。该方法无锚点、完全去中心化,并且与大多数现有方法不同,不需要控制机器人运动来确保团队可观测性。它仅依赖局部里程计、稀疏的代理间测距测量和短程通信,这些在实践中广泛可用。该算法采用多假设贝叶斯框架,维护所有可行解集,确保在瞬态不可观测条件下的鲁棒性。此外,通过信息共享,每个代理都能受益于整个群体的估计,即使在部分连接条件下也是如此。

英文摘要

The ability to localise teams of robots is essential for applications ranging from robotic fleets in unstructured environments to cooperative control and navigation tasks. In such contexts, fixed infrastructure is often unavailable, deployments must be fast and flexible, and system requirements must be minimal. We present a decentralised cooperative localisation algorithm that addresses all these challenges at once. The method is anchor-less, fully decentralised, and, unlike most existing approaches, does not require controlling the robots motion to ensure team observability. It relies only on local odometry, sparse inter-agent ranging measurements, and short-range communication, all of which are widely available in practice. The algorithm adopts a multi-hypothesis Bayesian framework that maintains the entire set of feasible solutions, ensuring robustness under transient unobservable conditions. Moreover, through information sharing, each agent benefits from the estimates of the entire group, even in partially connected conditions.

2606.20209 2026-06-19 cs.RO cs.AI 新提交 85%

FlowMaps: Modeling Long-Term Multimodal Object Dynamics with Flow Matching

FlowMaps: 使用流匹配建模长期多模态物体动态

Francesco Argenziano, Miguel Saavedra-Ruiz, Sacha Morin, Charlie Gauthier, Daniele Nardi, Liam Paull

发表机构 * Sapienza University of Rome(罗马大学) Université de Montréal(蒙特利尔大学) Mila - Quebec AI Institute(米拉-魁北克人工智能研究所)

专题命中 机器人学习 :FlowMaps建模物体动态,提升机器人导航性能。

AI总结 提出FlowMaps模型,通过潜在流匹配学习物体位置的多模态时空分布,预测动态物体未来位置,提升机器人在变化家庭环境中的导航性能。

详情
AI中文摘要

对3D场景的联合空间和时间理解是部署在日常家庭环境中的机器人的关键要求。这些智能体不仅必须理解和导航空间布局,还必须推理这些空间如何随时间演变。特别是,人类每天与物体互动,导致物体在整个环境中改变位置,使机器人难以可靠地将当前观察与先前看到的物体关联起来。然而,这些互动并非随机:人类的习惯和日常行为在物体位置上产生了时空一致的模式,机器人智能体可以学习这些模式,然后将其用于下游任务,如导航。为此,我们引入了FlowMaps,一种潜在流匹配模型,用于估计连续3D空间中动态物体未来位置的多模态分布。通过学习物体之间的隐式依赖关系及其时间演变,FlowMaps预测物体位置在人类过去互动条件下的可能变化,同时支持在具有相似物体习惯的未见环境中的泛化。为了展示该方法的实用性,我们在模拟和真实环境中将FlowMaps部署到下游的动态物体导航任务中。在超过600个回合中,FlowMaps优于最先进的方法,表明通过连续、多模态的时空分布建模物体动态可以改善机器人在变化家庭环境中的搜索和导航。代码和附加材料可在此https URL获取。

英文摘要

Joint spatial and temporal understanding of 3D scenes is a crucial requirement for robots deployed in everyday household environments. Such agents must not only comprehend and navigate spatial layouts, but also reason about how these spaces evolve over time. In particular, humans interact with objects daily, causing them to change position throughout the environment and making it difficult for robots to reliably associate current observations with previously seen objects. However, these interactions are not random: human habits and routines induce spatio-temporally consistent patterns in object locations, which robotic agents can potentially learn and then exploit for downstream tasks such as navigation. To this end, we introduce FlowMaps, a latent flow matching model for estimating multimodal distributions over the future locations of dynamic objects in a continuous 3D space. By learning the implicit dependencies among objects and their temporal evolution, FlowMaps predicts likely changes in object locations conditioned on past human interactions, while supporting generalization across previously unseen environments that share similar object routines. To demonstrate the utility of this method, we deploy FlowMaps in a downstream dynamic Object Navigation task in both simulated and real-world environments. Across more than 600 episodes, FlowMaps outperforms state-of-the-art approaches, showing that modeling object dynamics through continuous, multimodal spatio-temporal distributions improves robotic search and navigation in changing household environments. Code and additional material is available at https://fra-tsuna.github.io/flowmaps/.

2606.20150 2026-06-19 cs.RO 新提交 85%

Robust Assembly State Reasoning from Action Recognition for Human-Robot Collaboration

面向人机协作的基于动作识别的鲁棒装配状态推理

James Fant-Male, Roel Pieters

发表机构 * Cognitive Robotics group, Unit of Automation Technology and Mechanical Engineering, Tampere University(坦佩雷大学自动化技术与机械工程系认知机器人组)

专题命中 机器人学习 :人机协作中的装配状态推理。

AI总结 研究从动作识别输入跟踪装配状态的方法,比较逻辑、HMM和神经网络方法,发现最优方法因任务而异,逻辑方法在多变场景更鲁棒。

Comments Preprint accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026). 8 pages, 9 figures, 3 tables

详情
AI中文摘要

人类动作识别(HAR)在人机协作(HRC)研究中经常被用于理解已执行的动作以及协作任务的状态。然而,从HAR准确跟踪装配状态尚未得到充分研究,并且在现实场景中并非易事。本研究系统性地调查并比较了使用动作识别输入跟踪装配状态的方法。使用两个不同数据集和五种状态跟踪方法(包括基于逻辑的、隐马尔可夫模型(HMM)和神经网络(NN)方法)进行的调查表明,最优方法在不同任务中并不统一,并且不同方法在不同情况下会失败。测试使用具有不同噪声水平的模拟输入和来自HAR模型的真实输入进行。结果表明,NN和HMM方法在变异性有限的任务中表现良好,但在其他场景中,基于逻辑的方法可能更鲁棒。对于没有额外传感的重复动作任务,建模预期动作持续时间的方法也很重要。

英文摘要

Human Action Recognition (HAR) is frequently investigated in Human-Robot Collaboration (HRC) research to understand what actions have been performed and hence the state of a collaborative task. Accurately tracking an assembly state from HAR is however not fully investigated, and in realistic scenarios is not a trivial task. This research systematically investigates and compares methods for tracking assembly state using action recognition inputs. Investigations using two diverse datasets and five state tracking approaches, including logic-based, Hidden Markov Model (HMM), and neural network (NN) methods, show that optimal approaches are not uniform across different tasks and that different methods fail under different circumstances. Testing is performed using both simulated inputs with varying noise levels and realistic inputs from a HAR model. Results show NN and HMM methods can perform well in tasks with limited variability, but for other scenarios logic-based approaches can be more robust. Methods which model expected action duration are also important for tasks with repeated actions where no additional sensing is provided.

3. 具身导航 5 篇

2606.19555 2026-06-19 cs.RO 新提交 90%

SCAN-Planner: Spatial Collision-Aware Local Planning for Route-Guided Long-Range Quadruped Navigation

SCAN-Planner:用于路线引导的远程四足导航的空间碰撞感知局部规划

Han Zheng, Zhe Chen, Yiwen Fu, Ming Yang, Tong Qin

发表机构 * Shanghai Jiao Tong University(上海交通大学)

专题命中 具身导航 :提出SCAN-Planner用于四足机器人远程导航

AI总结 提出SCAN-Planner框架,通过偏航感知双圆柱足迹和投影A*搜索实现空间碰撞感知的局部规划,在密集杂乱、3D非结构化环境和远程导航中生成安全平滑轨迹。

详情
AI中文摘要

四足机器人越来越需要能够在狭窄通道、杂乱室内场景和大规模3D非结构化环境中导航。现有的局部规划器通常使用各向同性几何膨胀来近似机器人,或依赖于平面和高程图表示,导致在狭窄空间中的保守运动以及对悬垂结构的推理有限。本文提出了SCAN-Planner,一种用于远程四足导航的空间碰撞感知局部规划框架。使用偏航感知的双圆柱足迹来建模细长的机器人身体,通过在膨胀的3D占用地图中进行稀疏查询实现全身碰撞评估。我们进一步引入投影A*搜索,在插值的地面跟随表面上生成无碰撞引导,并通过z梯度抑制来水平避开障碍物同时保持垂直稳定性。对于大规模部署,具有边界回退的机器人中心滑动地图提供高分辨率局部碰撞检查并从局部死胡同中恢复。仿真和真实实验表明,SCAN-Planner在密集杂乱、3D非结构化场景、楼梯穿越和远程导航任务中生成安全、平滑且高效的轨迹。

英文摘要

Quadruped robots are increasingly expected to navigate through narrow passages, cluttered indoor scenes, and large-scale 3D unstructured environments. Existing local planners commonly approximate the robot using isotropic geometric inflation or rely on planar and elevation-map representations, leading to conservative motion in tight spaces and limited reasoning about overhanging structures. This letter presents SCAN-Planner, a spatial collision-aware local planning framework for long-range quadruped navigation. A yaw-aware twin-cylinder footprint is used to model the elongated robot body, enabling whole-body collision evaluation through sparse queries in an inflated 3D occupancy map. We further introduce a projected A* search that generates collision-free guidance on an interpolated ground-following surface, with z-gradient suppression to avoid obstacles horizontally while maintaining vertical stability. For large-scale deployment, a robot-centric sliding map with boundary fallback provides high-resolution local collision checking and recovery from local dead ends. Simulation and real-world experiments demonstrate that SCAN-Planner generates safe, smooth, and efficient trajectories in dense clutter, 3D unstructured scenes, stair traversal, and long-range navigation tasks.

2606.18112 2026-06-19 cs.RO cs.CV 新提交 90%

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

Qwen-RobotNav 技术报告:为智能体导航系统设计的可扩展导航模型

Jiazhao Zhang, Gengze Zhou, Hale Yin, Yiyang Huang, Zixing Lei, Qihang Peng, Haoqi Yuan, Jie Zhang, Xudong Guo, Xiaoyue Chen, An Yang, Fei Huang, Zhibo Yang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Zhuoyuan Yu, Jingyang Fan, Zhixuan Liang, Pei Lin, Ye Wang, Anzhe Chen, Kun Yan, Xiao Xu, Jiahao Li, Lulu Hu, Minying Zhang, Shurui Li, Wenhu Xiao, Shuai Bai, Xuancheng Ren, Chenxu Lv, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team(通义实验室)

专题命中 具身导航 :提出可扩展导航模型,用于智能体导航系统

AI总结 提出 Qwen-RobotNav 可扩展导航模型,通过参数化接口支持多种任务模式和可调观测参数,在15.6M样本上训练,联合视觉语言数据防止行为坍缩,在多个导航基准上取得新最优结果,并展示零样本泛化能力。

详情
AI中文摘要

智能体导航系统需要一个基础导航模型,其观测策略可以在推理时从外部重新配置,因为指令跟随、目标搜索、目标跟踪和自动驾驶共享相同的感知规划主干,但对视觉流的消费方式有根本不同的要求。我们提出 Qwen-RobotNav,一个建立在 Qwen-RobotNav 上的可扩展导航模型,通过一个具有两个互补维度的参数化接口来解决这个问题:多个任务模式选择导航行为,以及可控的观测参数(例如,token 预算、每个摄像头的权重)控制视觉历史的编码方式。通过训练时对所有参数进行随机化,Qwen-RobotNav 对任何推理时配置都具有鲁棒性,无需对 Qwen-RobotNav 主干进行任何架构修改。我们在15.6M样本上训练 Qwen-RobotNav;与视觉语言数据联合训练防止了在仅轨迹训练中观察到的反应性动作序列映射器的坍缩。参数化接口也使 Qwen-RobotNav 成为智能体系统的自然构建块:对于长时域场景,上层规划器将目标分解为子任务,并在情节中动态切换 Qwen-RobotNav 的任务模式和上下文策略,通过重复调用同一模型组合出复杂行为。大量实验表明,Qwen-RobotNav 在主要导航基准上取得了新的最优结果。该模型从2B到8B参数展现出良好的扩展性,联合多任务训练发展出一个跨任务族迁移的共享空间规划基板,并在多样环境中对真实世界机器人展现出强大的零样本泛化能力。

英文摘要

Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.

2606.16780 2026-06-19 cs.RO 新提交 90%

DIFF-IPPO: Diffusion-Based Informative Path Planning with Open-Vocabulary Belief Maps

DIFF-IPPO:基于扩散的开放词汇信念地图信息路径规划

Sausar Karaf, Oleg Sautenkov, Mikhail Martynov, Dzmitry Tsetserukou

发表机构 * Intelligent Space Robotics Laboratory, CDE, Skoltech(智能空间机器人实验室,CDE,斯科尔科沃科学技术研究院)

专题命中 具身导航 :提出扩散规划器用于机器人目标搜索

AI总结 提出DIFF-IPPO框架,结合开放词汇信念地图生成器与扩散规划器,在非高斯信念图上生成全局轨迹,实现高效目标搜索,检测得分达81.49%-86.55%。

详情
AI中文摘要

探索和物体搜索要求机器人感知环境、识别感兴趣区域,并规划提高目标检测可能性或最大化信息增益的轨迹。许多IPP方法,特别是在连续环境监测中,依赖于高斯过程信念模型,而物体搜索场景通常从语义或开放词汇感知中产生复杂的多模态信念地图。直接基于这种非高斯信念地图的全局轨迹生成仍然相对未被充分探索。尽管基于扩散的规划器为此类分布建模提供了强大能力,但它们在信息路径规划中的应用仍然有限。在这项工作中,我们提出了DIFF-IPPO,一个集成了开放词汇信念地图生成器和基于扩散的规划器的流水线,用于在信念地图上生成全局轨迹。该方法生成的轨迹将传感器覆盖集中在高信念区域,在不同数据集场景下实现了81.49%至86.55%的归一化检测得分。我们在一个模拟的搜索与救援场景中验证了该系统,其中规划器搜索候选建筑区域以定位燃烧的建筑。在此设置中,一个由五架无人机组成的团队使用批处理信念地图条件轨迹生成,在3.5分钟内实现了首次检测。

英文摘要

Exploration and object search require robots to perceive their environment, identify regions of interest, and plan trajectories that improve target-detection likelihood or maximize information gain. Many IPP methods, especially in continuous environmental monitoring, rely on Gaussian-process belief models, while object-search settings often produce complex, multimodal belief maps from semantic or open-vocabulary perception. Global trajectory generation directly conditioned on such non-Gaussian belief maps remains comparatively underexplored. Although diffusion-based planners offer strong capabilities for modeling such distributions, their use in informative path planning remains limited. In this work, we propose DIFF-IPPO, a pipeline that integrates an open-vocabulary belief map generator with a diffusion-based planner for global trajectory generation over belief maps. The method generates trajectories that concentrate sensor coverage over high-belief regions, achieving normalized detection scores between 81.49% and 86.55% across different dataset scenarios. We validate the system in a simulated search-and-rescue scenario where the planner searches candidate building regions to locate a burning building. In this setting, a team of five drones using batched belief-map-conditioned trajectory generation achieves first detections in 3.5 minutes.

2606.20479 2026-06-19 cs.RO 新提交 85%

GroundControl: Anticipating Navigation Failures in Vision-Language Agents via Trajectory-Consistent Uncertainty Estimates

GroundControl: 通过轨迹一致的不确定性估计预测视觉语言智能体中的导航失败

Nastaran Darabi, Divake Kumar, Sina Tayebati, Devashri Naik, Amit Ranjan Trivedi

发表机构 * University of Illinois at Chicago (UIC)(伊利诺伊大学芝加哥分校)

专题命中 具身导航 :预测视觉语言导航智能体的失败

AI总结 提出轨迹一致的不确定性估计方法GroundControl,通过卡尔曼滤波建模距离变化并结合轨迹特征,有效预测导航失败,在选择性风险-覆盖评估中优于基线。

详情
AI中文摘要

视觉语言导航智能体在基准任务上取得了具有竞争力的平均成功率,但失败通常源于可预测的轨迹级问题,如振荡、停滞或低效绕路。因此,可靠部署需要能够在执行过程中预测新兴失败动态的不确定性信号,而不仅仅是反映瞬时动作熵。我们引入了\emph{GroundControl},一种轨迹一致的不确定性估计器,定义为在一个回合中聚合的、相对于标称目标导向的距离-目标动态的统计偏差。GroundControl使用恒定速度卡尔曼滤波器对距离演化进行建模,并将归一化创新统计量与补充轨迹特征(捕捉进展、单调性、路径效率和振荡行为)相结合。由此产生的不确定性分数反映了导航行为中的几何和时间不一致性,而非局部预测分散。为了独立于任务成功评估不确定性质量,我们形式化了\emph{选择性风险-覆盖导航(SRCN)}协议,该协议通过风险-覆盖曲线和AURC/E-AURC摘要,衡量不确定性分数按失败或低效对回合进行排序的有效性。在五个EB-Navigation分割($N=300$个回合)上,基于成功的选择性风险下,轨迹一致的不确定性实现了接近神谕的排序,GPT-4o模型的加权平均$\mathrm{E\text{-}AURC}_{\mathrm{SR}}=0.0024$,显著优于熵、共形和启发式基线。在基于SPL的选择性评估下,GroundControl在模型和导航分割上始终实现最低的AURC和E-AURC。这些结果表明,对目标导向动态的偏离进行建模,为预测视觉语言智能体中的导航失败提供了可解释且鲁棒的信号。

英文摘要

Vision-language navigation agents achieve competitive average success on benchmark tasks, yet failures often arise through predictable trajectory-level breakdowns such as oscillation, stagnation, or inefficient detours. Reliable deployment, therefore, requires uncertainty signals that anticipate emerging failure dynamics during execution rather than reflect only instantaneous action entropy. We introduce \emph{GroundControl}, a trajectory-consistent uncertainty estimator defined as statistical deviation from nominal goal-directed distance-to-goal dynamics aggregated over an episode. GroundControl models distance evolution using a constant-velocity Kalman filter and combines normalized innovation statistics with complementary trajectory features capturing progress, monotonicity, path efficiency, and oscillatory behavior. The resulting uncertainty score reflects geometric and temporal inconsistency in navigation behavior rather than local prediction dispersion. To evaluate uncertainty quality independently of task success, we formalize \emph{Selective Risk--Coverage Navigation (SRCN)}, a protocol that measures how effectively an uncertainty score ranks episodes by failure or inefficiency using risk--coverage curves and AURC / E-AURC summaries. Across five EB-Navigation splits ($N=300$ episodes), trajectory-consistent uncertainty achieves near-oracle ordering under success-based selective risk, with weighted-average $\mathrm{E\text{-}AURC}_{\mathrm{SR}}=0.0024$ for the GPT-4o model, substantially outperforming entropy-, conformal-, and heuristic baselines. Under SPL-based selective evaluation, GroundControl consistently achieves the lowest AURC and E-AURC across models and navigation splits. These results show that modeling deviation from goal-directed dynamics provides an interpretable and robust signal for anticipating navigation failures in vision-language agents.

2606.20458 2026-06-19 cs.RO 新提交 85%

Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented Urban Navigation

慢速大脑,快速规划器:延迟鲁棒的VLM增强城市导航

Zhenghao "Mark'' Peng, Honglin He, Quanyi Li, Yukai Ma, Bolei Zhou

发表机构 * Amazon FAR(亚马逊 FAR) UCLA(加州大学洛杉矶分校) Independent(独立) Zhejiang University(浙江大学)

专题命中 具身导航 :提出VLM增强的移动机器人城市导航方法。

AI总结 针对移动机器人在人行道导航中轨迹评分差距问题,提出一种无需训练的延迟鲁棒轨迹级融合层,利用VLM选择候选轨迹并与规划器输出融合,在挑战场景下降低ADE 30%。

详情
AI中文摘要

基于学习的 sidewalk 导航规划器可以实时生成多样化的候选轨迹,但其评分函数在挑战性场景中往往无法选择最佳轨迹,即使同一集合中存在更好的候选,也会输出使移动机器人驶入草地、朝向行人或错误方向的轨迹。我们称之为轨迹评分差距:在真实世界的人行道导航中,基于锚点的规划器的最佳选择与最佳候选之间的差距很大,这可能是由于规划器的高层场景理解能力有限。我们不是用端到端的视觉-语言-动作模型替换规划器,而是提出一种VLM-规划器接口,使用VLM从规划器的候选集合中选择一个候选索引,然后将其与规划器的初始输出融合。然而,VLM每次查询需要1-3秒,因此无法直接驱动5-20Hz的控制循环。我们贡献了一种无需训练、延迟鲁棒的轨迹级融合层,通过指数衰减的几何相似性将过时的VLM选择转化为实时规划器评分。在约2000个具有挑战性的真实世界场景(例如交叉口、行人相遇)中,VLM选择相比规划器的最佳选择实现了30%的ADE降低,而规划器在常规场景中仍保持竞争力。在仿真中,Score Fusion在高达5秒的延迟下仍保持>80%的成功率。我们在移动机器人上展示了完整系统,在具有不同网络延迟的具有挑战性的校园人行道上进行导航。

英文摘要

Learning-based planners for sidewalk navigation can generate diverse candidate trajectories in real time, yet their scoring functions often fail to select the best trajectory in challenging situations, outputting trajectories that make the mobile robot drive onto grass, toward pedestrians, or in the wrong direction, even when better candidates exist in the same set. We call this the trajectory scoring gap: in real-world sidewalk navigation, the gap between an anchor-based planner's top choice and the best possible candidate is substantial, likely due to limited high-level scene understanding capability of the planner. Rather than replacing the planner with an end-to-end Vision-Language-Action model, we propose a VLM-Planner interface that uses a VLM to select a candidate index from the planner's proposal set and then fuse it with the planner's initial output. However, VLMs take 1--3s per query and so cannot directly drive a 5--20Hz control loop. We contribute a training-free, latency-resilient trajectory-level fusion layer that turns a stale VLM selection into real-time planner scoring via geometric similarity with exponential decay. On $\sim$2,000 challenging real-world scenarios (e.g., junctions, pedestrian encounters), VLM selection achieves 30% ADE reduction versus the planner's best selection, while the planner remains competitive in routine situations. In simulation, Score Fusion maintains >80% success rate with delays up to 5s. We demonstrate the full system on a mobile robot navigating challenging campus sidewalks with varied network latency.