arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 机器人学习与模仿强化学习 25 篇

2606.14862 2026-06-16 cs.RO 新提交

TacStyle: Personalizing Tactile Robot Policies using Structured Behavior Representations

TacStyle: 使用结构化行为表示个性化触觉机器人策略

Kevin Robledo, Matías I. Torres Galaz, Kumar Dixhant Rai, Shelly Sara Ulman, Tasmia Tasrin, Heramb Nemlekar

发表机构 * Department of Computer Science, California State University, Northridge(加州州立大学北岭分校计算机科学系) Department of Mechanical Engineering, California State University, Northridge(加州州立大学北岭分校机械工程系)

AI总结 提出通过结构化潜在表示组织用户偏好,结合基础模型解释语言指令,实现机器人行为精细调整,减少偏好标签需求。

Comments 14 pages, 5 figures

详情
AI中文摘要

辅助人类的机器人系统应能根据个人用户偏好调整其行为。例如,用户可能希望机器人手臂在折叠衣物或清洁家具时调整施加的力。自然语言为人类传达此类偏好提供了直观方式。语言条件机器人策略的最新进展表明,机器人可以成功使用语言提示来确定要执行的任务。然而,将相同方法扩展到实现任务应如何执行,需要描述任务数据中轨迹偏好或风格的详细标签。收集此类注释具有挑战性,而且直接以这些标签为条件可能无法提供对连续行为范围的精细控制。例如,通过“施加比之前稍大一点的压力”这样的抽象指令来传达机器人必须施加的确切力是困难的。因此,在这项工作中,我们提出使用语言来推理偏好行为,而不是直接生成它们。我们首先学习一个结构化的潜在表示,根据相应轨迹的差异来组织用户偏好。然后,给定一个偏好提示,我们使用基础模型来解释这个潜在空间,并选择一个能产生所需行为的值。通过仿真和真实世界实验,我们表明从直观结构化的潜在空间中选择机器人行为能够更精确地适应用户偏好,同时所需的偏好标签显著少于语言条件策略。

英文摘要

Robotic systems that assist humans should be capable of adapting their behaviors to individual user preferences. For instance, users may want a robot arm to adjust the amount of force it applies while folding their laundry or cleaning furniture. Natural language provides an intuitive way for humans to communicate such preferences. Recent progress in language-conditioned robot policies has shown that robots can successfully use language prompts to determine what task to perform. However, extending the same approach to realize how the task should be performed requires detailed labels describing the preferences or styles of trajectories in the task data. Not only is collecting such annotations challenging, but conditioning directly on these labels may also fail to provide fine-grained control over a continuous range of behaviors. For example, it can be difficult to convey the exact force that a robot must apply through abstract instructions like "apply a bit more pressure than before". Therefore, in this work, we propose using language to reason over preferred behaviors instead of directly generating them. We first learn a structured latent representation that organizes user preferences according to differences in the corresponding trajectories. Then, given a preference prompt, we use a foundation model to interpret this latent space and choose a value that produces the desired behavior. Through both simulation and real-world experiments, we show that selecting robot behaviors from an intuitively structured latent space enables more precise adaptation to user preferences while requiring significantly fewer preference labels than language-conditioned policies.

2606.15232 2026-06-16 cs.RO 新提交

Rethinking Implicit Spatial Representation in Visuomotor Policy Learning

重新思考视觉运动策略学习中的隐式空间表示

Xiangyu Chen, Yuxuan Hu, Chuhao Zhou, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室)

AI总结 本文重新评估了空间softmax池化在机器人操作中的有效性,发现其提供紧凑稳定的空间表示但受限于表示瓶颈,并提出PRISM编码器通过多尺度隐式空间信息融合提升性能。

详情
AI中文摘要

基于生成模型的模仿学习已成为机器人操作广泛采用的范式,其中策略性能关键取决于条件视觉表示。尽管空间softmax表示已被用于先前的视觉运动策略,但其有效性和潜在机制仍未被充分理解。本文重新思考空间softmax池化的使用:这种隐式空间表示是否为机器人操作提供了有效且稳定的视觉特征?通过对视觉编码器中不同池化方法的系统研究,我们发现这种池化操作产生紧凑且稳定的空间表示,尽管使用更少的维度,但优于特征值表示。互补的显著性分析进一步表明,这些空间表示引导编码器更一致地关注任务相关区域。然而,这一优势受到当前视觉编码器中表示瓶颈的限制:重复的下采样操作在动作生成模块使用之前削弱了细粒度空间信息,尤其是在低分辨率观测下。受这些发现的启发,我们提出PRISM,一种通过自上而下的交叉注意力融合保留多尺度隐式空间信息的视觉编码器。跨多个任务和策略骨干的实验显示出一致的改进。特别是在低分辨率、高精度的ToolHang任务中,PRISM显示出明显的增益,将平均成功率从5.0%提高到13.4%,同时参数仅增加15.4%。这些结果支持将多尺度隐式空间表示作为机器人操作的有效且高效的设计原则。

英文摘要

Generative model-based imitation learning has become a widely adopted paradigm for robotic manipulation, where policy performance depends critically on the conditioned visual representations. Although spatial softmax-based representations have been adopted in prior visuomotor policies, their effectiveness and underlying mechanisms remain insufficiently understood. This work rethinks the use of spatial softmax pooling: do such implicit spatial representations provide effective and stable visual features for robotic manipulation? Through systematic studies of different pooling methods in visual encoders, we find that this pooling operation produces compact and stable spatial representations, which outperform feature-value representations, despite using substantially fewer dimensions. Complementary saliency analysis further suggests that these spatial representations guide the encoder to focus more consistently on task-relevant regions. However, this advantage is limited by a representation bottleneck in current visual encoders: repeated downsampling operations weaken fine-grained spatial information before the action-generation module can use it, especially under low-resolution observations. Motivated by these findings, we propose PRISM, a visual encoder that preserves multiscale implicit spatial information through top-down cross-attention fusion. Experiments across multiple tasks and policy backbones show consistent improvements. In particular, on the low-resolution, high-precision ToolHang task, PRISM shows clear gains, improving the average success rate from 5.0% to 13.4% while increasing parameters by only 15.4%. These results support the use of multiscale implicit spatial representations as an effective and efficient design principle for robotic manipulation.

2606.15373 2026-06-16 cs.RO 新提交

A Hybrid Model-Based and Model-Free Framework for Active Multi-View Viewpoint Optimization in Sonar Target Recognition

一种基于模型与无模型混合框架的主动多视角声纳目标识别视点优化

Yongkyoon Park, Jane Shin

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出混合模型与无模型框架,结合CNN观测似然与Radon方向估计,通过信息增益奖励训练PPO智能体离线学习视点选择策略,部署时仅用CNN信念更新实现实时视点选择,在声纳数据集上以更少感知步数和运动成本达到竞争性识别精度。

详情
AI中文摘要

本文提出了一种基于模型与无模型混合框架,用于使用前视声纳进行主动多视角目标识别。卷积神经网络(CNN)提供数据驱动的观测似然,而基于Radon的方向估计无需角度标注即可实现视点感知。在训练过程中,基于信息增益的奖励引导近端策略优化(PPO)智能体离线学习信念感知的视点选择策略。在部署时,学习到的策略仅使用基于CNN的信念更新进行实时视点选择,无需计算昂贵的在线POMDP树搜索。在海洋垃圾前视声纳数据集上的实验表明,与基于模型的基线方法相比,所提方法在减少感知步数和运动成本的同时实现了具有竞争力的识别精度。

英文摘要

This paper presents a hybrid model-based and model-free framework for active multi-view target recognition using forward-looking sonar. A convolutional neural network (CNN) provides data-driven observation likelihoods, while Radon-based orientation estimation enables viewpoint-aware sensing without requiring angle annotations. During training, an information-gain-based reward guides a Proximal Policy Optimization (PPO) agent to learn a belief-aware viewpoint selection policy offline. At deployment, the learned policy performs real-time viewpoint selection using only CNN-based belief updates, eliminating the need for computationally expensive online POMDP tree search. Experiments on a marine-debris forward-looking sonar dataset demonstrate that the proposed approach achieves competitive recognition accuracy while reducing sensing steps and motion cost compared to model-based baselines.

2606.15514 2026-06-16 cs.RO cs.LG 新提交

Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities

强化学习引导的软融合检索用于缺失模态下的鲁棒多模态模仿学习

Hassan Ismkhan, Hamid Bouchahcia

发表机构 * Bournemouth University(伯恩茅斯大学)

AI总结 提出RL4IL方法,利用强化学习策略从训练库中检索最相关专家演示,并通过软交叉注意力融合生成动作,有效处理传感器缺失问题,在LIBERO基准上超越现有方法。

详情
AI中文摘要

机器人系统通过多种输入模态感知世界——包括视觉摄像头流和自然语言指令——并必须基于这些信号选择适当的动作。然而,假设所有输入设备永久可用是不现实的,因为在部署过程中传感器可能失效、被遮挡或完全丢失。因此,鲁棒处理此类缺失模态场景对于真实世界的机器人操作至关重要。本文介绍了RL4IL,一种强化学习引导的模仿学习方法,通过从训练库中识别最相关的专家演示,为给定观测选择最合适的动作。一个强化学习策略,通过基于广度优先搜索候选集的近端策略优化进行训练,对候选演示进行排序,一个软交叉注意力融合头聚合它们的动作信号以产生最终预测。当推理时模态缺失时,一个专用的每模态RL检索策略从训练库中识别捐赠演示,一个软插补头通过交叉注意力在排名靠前的捐赠者上重建缺失嵌入——无需对系统进行任何重新训练。在三个LIBERO基准套件上的实验表明,RL4IL在传感器丢失条件下显著优于最先进的模仿学习方法,同时无需策略网络训练。代码可在https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera找到。

英文摘要

Robotic systems perceive the world through multiple input modalities -- including visual camera streams and natural language instructions -- and must select appropriate actions based on these signals. However, assuming the permanent availability of all input devices is unrealistic, as sensors may fail, become occluded, or drop out entirely during deployment. Robust handling of such missing-modality scenarios is therefore essential for real-world robot operation. This paper introduces RL4IL, a reinforcement learning guided method for imitation learning that selects the most suitable action for a given observation by identifying the most relevant expert demonstrations from a training library. A reinforcement learning policy, trained via Proximal Policy Optimisation over Breadth-First Search candidate sets, ranks candidate demonstrations and a soft cross-attention fusion head aggregates their action signals to produce the final prediction. When a modality is missing at inference time, a dedicated per-modality RL retrieval policy identifies donor demonstrations from the training library, and a soft imputation head reconstructs the missing embedding via cross-attention over the top-ranked donors -- without requiring any retraining of the system. Experiments on three LIBERO benchmark suites demonstrate that RL4IL substantially outperforms state-of-the-art imitation learning methods under sensor dropout conditions, while requiring no policy network training. The code can be found at https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera

2606.15587 2026-06-16 cs.RO 新提交

Perfect Demo Makes Poor Teacher: Learning Robust Alignment from Critical Motion Segments

完美演示造就差劲教师:从关键运动片段学习鲁棒对齐

Mingyu Liu, Zeju Li, Jiuhe Shu, Hanqing Wang, Yuhao Chao, Hao Chen, Chunhua Shen

发表机构 * Zhejiang University(浙江大学) Shanghai Innovation Institute(上海创新研究院) Hong Kong University of Science and Technology (GZ)(香港科技大学(广州)) Nanjing University(南京大学)

AI总结 针对精细操作中流畅演示因压缩关键对齐动作导致策略学习不足的问题,提出数据级重采样和表示级STAIR特征,利用稠密运动感知监督提升策略鲁棒性。

详情
AI中文摘要

专家演示被广泛认为是机器人模仿学习的黄金标准。然而,对于插入、堆叠和对齐等精细操作,我们发现了一个反直觉的失败模式:流畅的演示可能是差劲的教师。熟练的遥操作员将对齐和恢复的决定性时刻压缩到一个短暂的时间窗口内,导致策略被冗余的自由空间运动淹没,并在精度决定成功的关键区域缺乏监督。我们在两个层面解决这一瓶颈。在数据层面,靠近对齐时减速和对关键片段重采样都有帮助,但收益主要来自拓宽策略必须学习的恢复状态的覆盖范围,而非重新加权已有的帧。然而,这种数据层面的修复并未触及策略的逐帧视角:单个图像仍然直接映射到动作,控制修正的局部运动仍然隐式。因此,我们转向表示层面,引入STAIR(时空特征作为机器人学习接口),这是一种紧凑的动态特征,连接视觉-语言模型和动作专家,将每个轨迹中已记录的短视运动蒸馏为稠密的、运动感知的监督。仅使用流畅数据训练,STAIR恢复了大部分精心演示带来的增益(总体从50.0%提升至62.2%,接近精心演示的64.4%)。这些结果呼吁对机器人数据采取更具教学性的视角,优化机器的可学习性而非仅考虑人类效率。

英文摘要

Expert demonstrations are widely assumed to be the gold standard for robot imitation learning. Yet for fine-grained manipulation such as insertion, stacking, and alignment, we uncover a counterintuitive failure mode: fluent demonstrations can be poor teachers. A skilled teleoperator compresses the decisive moments of alignment and recovery into a brief temporal window, leaving the policy flooded with redundant free-space motion and starved of supervision exactly where precision determines success. We address this bottleneck at two levels. At the data level, slowing down near alignment and resampling critical segments both help, yet the gain comes mainly from broadening the coverage of recovery states the policy must learn, not from reweighting frames it already has. Such data-side fixes, however, leave the policy's per-frame view untouched: a single image still maps directly to an action, and the local motion that governs correction stays implicit. We therefore turn to the representation level and introduce STAIR (\textbf{S}patio-\textbf{T}emporal feature \textbf{A}s an \textbf{I}nterface for \textbf{R}obot learning), a compact dynamic feature that bridges the vision-language model and the action expert, distilling the short-horizon motion already recorded in each trajectory into dense, motion-aware supervision. Trained on fluent data alone, STAIR recovers most of the deliberate-demonstration gain ($50.0$ to $62.2\%$ overall, approaching the $64.4\%$ of deliberate demonstrations). These results call for a more pedagogical view of robot data, optimized for machine learnability rather than human efficiency alone.

2606.15685 2026-06-16 cs.RO cs.CV 新提交

Learning New Tasks via Reusable Skills: Skill-Compositional Experts for Embodied Continual Learning

通过可复用技能学习新任务:面向具身持续学习的技能组合专家

Shuaike Zhang, Shaokun Wang, Haoyu Tang, Jianlong Wu, Liqiang Nie

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shandong University(山东大学) Shenzhen Loop Area Institute(深圳循环区域研究所)

AI总结 提出技能组合专家(SCE)框架,通过组合技能基础(CSG)分解演示为可复用技能,并利用双执行-转换专家(DETE)实现新任务学习,有效缓解具身持续学习中的灾难性遗忘。

Comments 13 pages, 5 figures

详情
AI中文摘要

具身持续学习(ECL)旨在使机器人能够在闭环控制下持续获取新的操作任务,同时保留先前学习的行为。与传统的持续学习相比,ECL遭受更严重的灾难性遗忘。在闭环控制下累积的特征漂移通过顺序决策逐步传播,导致先前学习的行为退化。ECL中的一个关键挑战在于如何在不断演变的任务中进行结构化的技能复用,因为现有方法主要关注技能学习,而没有明确组织它们以执行连贯的任务。为了解决这个问题,我们提出了SCE,一个用于ECL的技能组合专家框架。SCE通过组合技能基础(CSG)构建技能库,将任务演示分解为可复用的技能。在此基础上,双执行-转换专家(DETE)通过技能组合实现新任务学习,其中一个分支确保技能执行,另一个支持技能之间的转换以实现连贯行为。在LIBERO基准测试和真实世界操作任务上的实验表明,SCE持续提高了保留率和整体任务性能。进一步的特征漂移分析和消融研究验证了我们方法的有效性。项目网站:https://eqcy.github.io/sce/。

英文摘要

Embodied Continual Learning (ECL) aims to enable robots to continually acquire new manipulation tasks while retaining previously learned behaviors under closed-loop control. Compared with conventional continual learning, ECL suffers from more severe catastrophic forgetting. Feature drift accumulated under closed-loop control progressively propagates through sequential decision-making, leading to degradation of previously learned behaviors. A key challenge in ECL lies in structured skill reuse across continually evolving tasks, since existing methods primarily focus on skill learning without explicitly organizing them for coherent task execution. To address this issue, we propose SCE, a Skill-Compositional Experts framework for ECL. SCE builds a skill base via Compositional Skill Grounding (CSG), which decomposes task demonstrations into reusable skills. Based on this, Dual Execution-and-Transition Experts (DETE) enable new task learning through skill composition, where one branch ensures skill execution and the other supports transitions between skills for coherent behavior. Experiments on LIBERO benchmarks and real-world manipulation tasks demonstrate that SCE consistently improves retention and overall task performance. Further feature drift analyses and ablation studies verify the effectiveness of our method. Project website: https://eqcy.github.io/sce/.

2606.16178 2026-06-16 cs.RO 新提交

Scaling Short-Term Memory of Visuomotor Policies for Long-Horizon Tasks

面向长时任务的视觉运动策略短期记忆扩展

Rutav Shah, Rajat Kumar Jenamani, Xiaohan Zhang, Lingfeng Sun, Roberto Martín-Martín, Yuke Zhu, Deva Ramanan, Karl Schmeckpeper

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) Toyota Research Institute(丰田研究院)

AI总结 提出PRISM架构,通过门控注意力与层次化压缩扩展视觉运动策略的短期记忆至两分钟,在ReMemBench基准上超越现有方法5%-12%。

Comments 14 pages, 9 Figures, 8 Tables

详情
AI中文摘要

许多机器人任务需要短期记忆,无论是检索不再可见的物体,还是在设定时间后关闭电器。然而,大多数通过模仿学习训练的视觉运动策略仅依赖即时感官输入,而不使用过去经验来指导决策。我们提出了PRISM,一种基于Transformer的视觉运动策略架构,通过两个关键组件有效利用短期记忆:(i) 门控注意力,过滤检索到的信息以抑制无关细节,通过减少历史与当前动作预测之间的虚假相关性来提高性能;(ii) 层次化架构,首先将局部信息压缩为紧凑令牌,然后整合它们以捕获时间上扩展的依赖关系,改善计算和内存占用。这些机制共同使我们将视觉运动策略的短期记忆扩展到长达两分钟。为了系统评估视觉运动控制中的记忆,我们引入了ReMemBench——一个包含八种多样化家务操作任务的基准,涵盖四类短期记忆——旨在促进通用记忆机制而非孤立的、特定任务的解决方案。PRISM持续优于先前的工作,包括循环架构、Transformer及其变体——在最强基线上实现了5%–12%的绝对改进。在RoboCasa和LIBERO基准上,尽管没有利用任何大规模预训练,它相对于其无记忆变体以及微调的视觉-语言-动作基线(如GR00T-N1-3B和OpenVLA)实现了11%–15%的绝对改进。PRISM和ReMemBench共同为开发和评估可扩展到长时任务的短期记忆增强视觉运动策略奠定了基础。更多资料请访问https://shahrutav.github.io/short-term-memory。

英文摘要

Many robotic tasks require short-term memory, whether it's retrieving an object that's no longer visible or turning off an appliance after a set period. Yet, most visuomotor policies trained via imitation learning rely only on immediate sensory input without using past experiences to guide decisions. We present PRISM, a transformer-based architecture for visuomotor policies to effectively use short-term memory via two key components: (i) gated attention, which filters retrieved information to suppress irrelevant details, improving performance by reducing the spurious correlations between the history and current action prediction, (ii) a hierarchical architecture that first compresses local information into compact tokens and then integrates them to capture temporally extended dependencies, improving its compute and memory footprint. Together, these mechanisms enable us to scale short-term memory in visuomotor policies for up to two minutes. To systematically evaluate memory in visuomotor control, we introduce ReMemBench -- a benchmark of eight diverse household manipulation tasks spanning four categories of short-term memory -- designed to foster general memory mechanisms rather than siloed, task-specific solutions. PRISM consistently outperforms prior works, including recurrent architectures, transformers, and their variants -- achieving an absolute improvement of 5%--12% over the strongest baseline. On the RoboCasa and LIBERO benchmarks, it achieves absolute improvements of 11%--15% over its no-memory variant and fine-tuned Vision-Language-Action baselines such as GR00T-N1-3B and OpenVLA, despite not leveraging any large-scale pretraining. Together, PRISM and ReMemBench establish a foundation for developing and evaluating short-term memory-augmented visuomotor policies that scale to long-horizon tasks. Additional materials are available at https://shahrutav.github.io/short-term-memory

2606.16458 2026-06-16 cs.RO 新提交

RHO: Your Coding Agent is Secretly a Roboticist

RHO:你的编码代理其实是个机器人专家

Karim Elmaaroufi, Justin Svegliato, Sarunas Kalade, Graham Schelle, Sanjit A. Seshia, Matei Zaharia

发表机构 * University of California, Berkeley(加州大学伯克利分校) AMD

AI总结 提出RHO范式,通过训练时搜索神经符号化多文件策略库,实现机器人任务的高效零样本泛化,在LIBERO-PRO和Robosuite上分别达到45%和70%的成功率,显著优于现有方法。

Comments 46 pages, 9 figures, 15 tables. Project page: https://rho-robotics.github.io

详情
AI中文摘要

代码即策略(CaP)表明,大型语言模型(LLM)可以通过组合感知、规划和控制原语来编写代码解决机器人任务。然而,最近的CaP系统在测试时依赖多轮代码生成循环,这对于实时机器人控制通常不可行。我们引入了机器人学优化(RHO),这是一种新颖的范式,其中支持工具编码的代理在训练时提出并搜索可解释的、神经符号化的多文件策略库(仓库即策略),这些库组合这些原语,而不是单个提示、函数或文件。RHO通过环境奖励和执行的反思性反馈进行搜索,而不是通过遥操作演示。它泛化到受扰动的拾取和放置场景,如LIBERO-PRO,其中OpenVLA得分为0.0%,π_{0.5}平均为12.83%。使用相同的低级原语,RHO达到45.0%的成功率,比最强的多轮代理系统高2.5倍,比π_{0.5}高3.5倍。在Robosuite上,RHO以70.0%的成绩创造了新的最先进水平,超过了之前多轮记录的68.29%,且部署时无需纠正性LLM代码编辑。当在控制循环中使用LLM时,如在RAI的O3DE基准测试中,RHO优化了部署代理的多文件提示、工具和控制代码,将保留成功率从23.5%提高到44.3%,同时减少了20%的墙钟时间和27%的工具调用。

英文摘要

Code-as-Policies (CaP) has shown that large language models (LLMs) can write code to solve robotics tasks by composing perception, planning, and control primitives. Recent CaP systems, however, rely on multi-turn code-generation loops at test time, which is often infeasible for real-time robot control. We introduce Robotics Harness Optimization (RHO), a novel paradigm in which tool-enabled coding agents, at training time, propose and search for interpretable, neurosymbolic multi-file policy repositories (Repositories-as-Policies) that compose these primitives rather than a single prompt, function, or file. RHO searches with reflective feedback from environment reward and execution rather than teleoperation demonstrations. It generalizes to perturbed pick-and-place settings like LIBERO-PRO, where OpenVLA scores 0.0% and $π_{0.5}$ averages 12.83%. Using the same low-level primitives, RHO reaches a 45.0% success rate, 2.5x higher than the strongest multi-turn agentic system, and 3.5x higher than $π_{0.5}$. On Robosuite, RHO sets a new state-of-the-art of 70.0%, exceeding the prior multi-turn record of 68.29% using single-turn execution with no corrective LLM code edits at deployment. When an LLM is used in the control loop, as on RAI's O3DE benchmark, RHO optimizes the deployed agent's multi-file harness of prompts, tools, and control code, improving held-out success from 23.5% to 44.3% with 20% less wall-clock time and 27% fewer tool calls.

2606.16572 2026-06-16 cs.RO 新提交

Steering Generative Reinforcement Learning into Stable Robotic Controller

将生成式强化学习引导至稳定机器人控制器

Yixuan Wang, Shutong Ding, Ke Hu, Tianxiang Gui, Jingya Wang, Ye Shi

发表机构 * ShanghaiTech University(上海科技大学)

AI总结 提出SteerGenPO框架,通过潜在空间强化学习将训练好的生成式策略转化为鲁棒的确定性机器人控制器,在Isaac Lab和Unitree G1任务上优于基线方法,实现更稳定的推理行为。

详情
AI中文摘要

基于扩散和流的生成式策略通过迭代动作生成诱导丰富的随机探索,为强化学习提供了强大的策略类。然而,扩散策略的随机性不适用于高维机器人系统中的稳定精确控制,其中小的动作变化可能累积为不一致的运动并降低鲁棒性。为解决此问题,我们提出SteerGenPO,一种潜在空间强化学习框架,将训练好的生成式策略引导为鲁棒的确定性机器人控制器。关键思想是用学习到的潜在演员替换训练好的生成式策略的随机潜在采样,该潜在演员为生成式策略预测状态相关的潜在输入。这分离了探索和控制:随机生成采样在策略学习期间提供多样化的动作提议,而确定性潜在引导在部署时提供稳定和自适应的控制。我们在六个Isaac Lab基准测试和一个Unitree G1运动任务上评估了SteerGenPO。结果表明,SteerGenPO在经典RL和生成式RL基线上均有改进,同时其确定性潜在引导产生更稳定的推理时行为和更可靠的命令响应。

英文摘要

Diffusion and flow-based generative policies provide a powerful policy class for reinforcement learning by inducing rich stochastic exploration through iterative action generation. However, the stochasticity of diffusion policies is not suitable for stable and precise control in high-dimensional robotic systems, where small action variations can accumulate into inconsistent motion and reduced robustness. To address this issue, we propose SteerGenPO, a latent-space reinforcement learning framework that steers a trained generative policy into a robust deterministic robotic controller. The key idea is to replace stochastic latent sampling of the trained generative policy with a learned latent actor that predicts a state-dependent latent input for the generative policies. This separates exploration and control: stochastic generative sampling provides diverse action proposals during policy learning, while deterministic latent steering provides stable and adaptive control at deployment. We evaluate SteerGenPO on six Isaac Lab benchmarks and a Unitree G1 locomotion task. The results show SteerGenPO improves over both classical RL and generative RL baselines, while its deterministic latent steering produces more stable inference-time behaviors and more reliable command responses.

2606.16856 2026-06-16 cs.RO 新提交

Video-Based Optimal Transport for Feedback-Efficient Offline Preference-Based Reinforcement Learning

基于视频的最优传输用于反馈高效的离线偏好强化学习

Tung M. Luu, Hwanhee Kim, Younghwan Lee, Chang D. Yoo

AI总结 提出VOTP框架,利用视频基础模型和最优传输生成伪标签,仅需少量人类反馈即可学习有效奖励函数,显著降低标注成本。

Comments ICML 2026 (Oral)

详情
AI中文摘要

向强化学习智能体传达复杂目标通常需要精心的奖励工程。偏好强化学习(PbRL)通过从人类反馈中学习奖励函数提供了一种有前景的替代方案,但其可扩展性受到高标注成本的阻碍。受视频基础模型(ViFMs)进展的启发,我们提出了基于视频的最优传输偏好(VOTP),这是一个半监督框架,仅需少量标签即可学习有效的奖励函数。通过利用最优传输在ViFMs的丰富表示空间中对齐视觉轨迹,VOTP有效地为大量未标注数据生成高保真伪标签,大幅减少了人类监督。在运动控制和操作基准上的大量实验证明了VOTP的优越性,在有限的反馈预算下,其性能优于最先进的离线PbRL方法。我们还展示了VOTP在视觉干扰存在时的鲁棒性,并在真实机器人任务上验证了其实用性,其中它以最少的人类输入学习了有意义的奖励。

英文摘要

Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL (PbRL) offers a promising alternative by learning reward functions from human feedback, but its scalability is hindered by high labeling costs. Inspired by advances in Video Foundation Models (ViFMs), we present Video-based Optimal Transport Preference (VOTP), a semi-supervised framework that learns effective reward functions from only a handful of labels. By leveraging optimal transport to align visual trajectories within the rich representation space of ViFMs, VOTP effectively generates high-fidelity pseudo-labels for large amounts of unlabeled data, substantially reducing human supervision. Extensive experiments across locomotion and manipulation benchmarks demonstrate the superiority of VOTP, which outperforms state-of-the-art offline PbRL methods under limited feedback budgets. We also showcase the robustness of VOTP in the presence of visual distractors and validate its utility on real robotic tasks, where it learns meaningful rewards with minimal human input.

2606.16888 2026-06-16 cs.RO 新提交

LOPAL: Local Performance-Aware Active Learning from Imperfect Demonstrations

LOPAL:基于局部性能感知的不完美演示主动学习

Johannes Heidersberger, Shail Jadav, Dongheui Lee

发表机构 * Autonomous Systems Lab, Institute of Computer Technology, TU Wien(维也纳工业大学计算机技术研究所自主系统实验室) Institute of Robotics and Mechatronics, German Aerospace Center (DLR)(德国航空航天中心机器人与机电一体化研究所)

AI总结 提出LOPAL方法,利用局部演示质量信息,通过高斯混合模型编码轨迹与质量评估,结合共享自主权主动收集纠正数据,在不完美演示中提升任务性能。

Comments Accepted for publication in IEEE Robotics and Automation Letters (RAL), 2026

详情
AI中文摘要

从演示中学习(LfD)通过允许机器人直接从人类任务演示中学习,实现了直观的机器人技能获取。然而,当前方法通常未能解决由于次优和不一致的人类行为,演示质量在每个演示内部可能变化的问题。因此,我们引入了LOPAL(局部性能感知主动学习),一种利用这种局部演示质量信息的主动学习方法。我们的方法由两个协同组件组成。首先,一种局部性能驱动的LfD方法使用高斯混合模型(GMM)来编码演示轨迹及其相关的局部质量评估。这使得能够通过利用高性能的互补局部数据生成优于不完美演示的轨迹。其次,主动数据采集允许通过收集额外的信息样本来超越不完美演示。在缺乏良好数据的区域,通过共享自主权(SA)机制主动请求用户提供纠正,同时机器人自主执行学习的行为。LOPAL的有效性在仿真和真实世界实验中得到了验证。真实世界管道检查任务的结果表明,所提出的方法可以实现高达27.31%的任务性能提升,同时减少了收集演示所需的努力。

英文摘要

Learning from Demonstration (LfD) enables intuitive robot skill acquisition by allowing robots to learn directly from human task demonstrations. However, current methods often fail to address the fact that due to suboptimal and inconsistent human behavior, the quality of the demonstration can vary within each demonstration. Therefore, we introduce LOPAL (LOcal Performance-aware Active Learning), an active learning approach that leverages this local demonstration quality information. Our approach consists of two synergistic components. First, a local performance-driven LfD method uses a Gaussian Mixture Model (GMM) to encode both the demonstrated trajectories and their associated local quality assessments. This enables the generation of trajectories that outperform the imperfect demonstrations by utilizing complementary local data of high performance. Second, active data acquisition allows to improve beyond the imperfect demonstrations by collecting additional informative samples. In areas missing good data, the user is actively requested to provide corrections through a shared autonomy (SA) mechanism, while the robot autonomously executes the learned behavior. The efficacy of LOPAL was validated in both a simulation and a real-world experiment. The results from a real-world pipe inspection task showed that the proposed approach can achieve up to 27.31 % improvement in task performance while also reducing the effort required to collect the demonstrations.

2606.17011 2026-06-16 cs.RO cs.LG 新提交

ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning

ROVE: 通过强化学习解锁人类干预用于人形机器人操作

Wei Xiao, Weiliang Tang, Yuying Ge, Hui Zhou, Yao Mu, Li Zhang, Yixiao Ge

发表机构 * XPENG Robotics(小鹏机器人) Fudan University(复旦大学) The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出ROVE框架,利用强化学习和乐观价值估计,从次优人类干预轨迹中学习高价值行为,提升人形机器人操作性能。

详情
AI中文摘要

人类干预为视觉-语言-动作(VLA)模型的后训练提供了关键的纠正信号。然而,由于复杂的全身运动学和灵巧手控制,实现无缝的人形干预是一个严峻的系统挑战。因此,收集到的干预轨迹往往是次优的,依赖人类干预作为专家监督的方法可能会吸收犹豫、低效甚至错误的行为。为了解决系统和算法两方面的挑战,我们提出了ROVE,一个用于人形VLA后训练的强化学习框架,能够处理不完美的人类干预。首先,ROVE引入了一个人在环的流水线,能够收集人形操作中的部署和干预数据。其次,它利用乐观价值估计(OVE)从混合质量的轨迹中优先考虑高价值行为。为了进一步增强价值估计的鲁棒性,我们融入了跨具身的人类经验视频,为长尾失败和恢复模式提供丰富的监督。由此产生的评论家产生信息丰富的优势信号,引导VLA演员专注于高价值行为,而不是不加区分地模仿所有动作。在具有挑战性的真实世界接触密集和精细的人形操作任务中,ROVE优于基于经验学习的基线,并在多次部署-干预迭代中持续改进。

英文摘要

Human interventions provide crucial corrective signals for post-training Vision-Language-Action (VLA) models. However, enabling seamless humanoid interventions is a formidable systems challenge due to complex whole-body kinematics and dexterous-hand control. Consequently, the collected intervention trajectories are often suboptimal, and methods that rely on human interventions as expert supervision can absorb hesitant, inefficient, or even erroneous behaviors. To address both the system and algorithmic challenges, we propose ROVE, a reinforcement learning framework for humanoid VLA post-training with imperfect human interventions. First, ROVE introduces a human-in-the-loop pipeline capable of collecting deployment and intervention data for humanoid manipulation. Second, it utilizes Optimistic Value Estimation (OVE) to prioritize high-value behaviors from mixed-quality trajectories. To further robustify value estimation, we incorporate cross-embodiment human experience videos to provide rich supervision for long-tailed failure and recovery modes. The resulting critic yields informative advantage signals, steering the VLA actor to focus on high-value behaviors rather than indiscriminately imitating all actions. On challenging real-world contact-rich and fine-grained humanoid manipulation tasks, ROVE outperforms experience-learning baselines and consistently improves across multiple rollout-intervention iterations.

2606.17043 2026-06-16 cs.RO cs.LG 新提交

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

基于层级优势加权的在线RL微调VLA策略从稀疏回合结果

Tongyan Fang, Siyuan Huang, Naiyu Fang, Ganlong Zhao, Zhongjin Luo, Jianbo Liu, Xiaogang Wang, Ying Dong, Hongsheng Li

发表机构 * ACE Robotics Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出层级优势加权行为克隆(HABC),通过分离生存性和效率目标并自适应平衡,解决稀疏二元结果下VLA策略在线微调中的信用分配问题,在三个双臂接触任务上将成功率从12-44%提升至38-92%。

Comments Website: https://acerobotics-vla.github.io/HABC-Website

详情
AI中文摘要

当预训练的VLA策略通过在线RL进行微调时,每次 rollout 回合仅产生单个二元结果(成功或失败),但 actor 更新需要每个时间步的监督。现有方法通常将此稀疏结果简化为单个标量奖励或优势信号,这混淆了不同形式的过渡级反馈,并且在基本任务成功可实现后提供的指导有限。首先,单个标量信号混淆了生存性和效率这两个目标;一旦基本成功实现,二元标签无法提供梯度来区分高效完成与缓慢完成。其次,真实世界的 rollout 混合了自主段和干预段;天真地将回合结果跨这些边界分配会导致不正确的信用分配。为解决这些问题,我们提出层级优势加权行为克隆(HABC),该方法在不同数据子集上为这两个目标训练独立的评论家头,并通过状态自适应平衡组合其输出。状态自适应门 $g_t$ 合并它们的一步优势,在成功不确定时优先考虑生存性,仅在生存性高时转向效率,并将结果转换为 actor 损失上的每时间步权重。干预感知的信用分配进一步将结果标签限制在当前策略执行的段,防止监督跨干预边界泄漏。在三个接触丰富的双臂任务上的真实机器人实验中,HABC 将监督微调(SFT)基线的成功率从 36%、44% 和 12% 提升至 92%、88% 和 38%。

英文摘要

When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate $g_t$ merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.

2606.14752 2026-06-16 cs.CV cs.AI cs.LG cs.RO 交叉投稿

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

X-Tokenizer: 一种用于视觉-语言-动作预训练的多模态动作分词器

Xirui Kang, Yanpei Shi, Lucy Liang, Roy Gan, Dongxiu Liu, Pushi Zhang, Danpeng Chen, Xiaoyi Qin, Yinan Zheng, Jinliang Zheng, Hao Wang, Xianyuan Zhan, Hang Su

发表机构 * Square Robot City University of Hong Kong(香港城市大学) Tsinghua University(清华大学)

AI总结 提出X-Tokenizer,通过语义残差量化(SRQ)和掩码动作建模(MAM)将动作离散化为语义接口,在2.4M轨迹上预训练后提升VLA模型的多模态接地和长程任务性能。

Comments Project page: https://x-square-robot.github.io/X-Tokenizer_projectPage/

详情
AI中文摘要

现代视觉-语言-动作(VLA)模型必须桥接预训练的视觉-语言推理和精确的连续机器人控制。现有的动作分词器主要为了重建而离散化动作,产生的编码保留了运动几何结构,但仅向主干网络提供弱语义监督。因此,我们将动作分词化不仅视为压缩,而是作为多模态推理与可执行控制之间的语义接口学习。为此,我们引入了X-Tokenizer,一种轻量级的编码器-语义残差量化(SRQ)-解码器架构,为多种机械臂形态提供共享的动作接口。其关键组件SRQ在残差向量量化上施加了非对称结构:第一层通过掩码动作建模(MAM)训练,形成捕获粗略运动意图的离散动作语言,而更深层则保持面向重建的残差,保留细粒度细节。为了进一步将动作标记与多模态语义对齐,X-Tokenizer通过与预训练基础模型的表示空间进行对比对齐以及下一帧视觉-语言特征预测进行预训练。在2.4M轨迹(2.0B动作帧)上预训练后,单个冻结的X-Tokenizer作为表示塑造的监督信号插入混合离散-连续VLA中。X-Tokenizer在真实世界聚合指标上达到最佳,并在RoboTwin 2.0模拟中表现强劲。在多模态接地(+13.5%)和长程任务(+8.25)上优于FAST,表明动作分词器作为VLA预训练的语义接口,而不仅仅是动作压缩。

英文摘要

Modern Vision-Language-Action (VLA) models must bridge pretrained vision-language reasoning and precise continuous robot control. Existing action tokenizers discretize actions primarily for reconstruction, producing codes that preserve motion geometry but provide only weak semantic supervision to the backbone. We therefore formulate action tokenization not as mere compression, but as semantic interface learning between multimodal reasoning and executable control. To this end, we introduce X-Tokenizer, a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture that provides a shared action interface across diverse robotic arm embodiments. Its key component, SRQ, imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling (MAM) to form a discrete action language that captures coarse motion intent, while deeper levels remain reconstruction-oriented residuals that preserve fine-grained details. To further align action tokens with multimodal semantics, X-Tokenizer is pretrained with contrastive alignment to the representation space of a pretrained foundation model and with next-frame vision-language feature prediction. Pretrained on 2.4M trajectories (2.0B action frames), a single frozen X-Tokenizer plugs into a mixed discrete-continuous VLA as a representation-shaping supervision signal. X-Tokenizer achieves top real-world aggregate and strong RoboTwin 2.0 simulation results. Outperforming FAST in multimodal grounding (+13.5%) and long-horizon tasks (+8.25), it shows that action tokenizers serve as semantic interfaces for VLA pretraining beyond mere action compression.

2606.14801 2026-06-16 cs.LG cs.AI cs.RO 交叉投稿

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

QPILOTS:面向流策略的高效测试时Q引导

Yifan Ruan, Chenyang Cao, Andreas Burger, Ali Pesaranghader, Kaveh Kamali, Jaehong Kim, Nandita Vijaykumar, Alan Aspuru-Guzik, Igor Gilitschenski, Nicholas Rhinehart

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) LG Electronics(LG电子)

AI总结 提出QPILOTS方法,在推理时通过投影去噪中间状态到最终动作估计并计算评论家梯度来引导流匹配和扩散策略,无需修改原策略,在离线到在线RL基准上达到90%平均成功率。

Comments 10 pages, 7 figures

详情
AI中文摘要

流匹配和扩散策略是表达力强的动作生成器,但使用时序差分强化学习(RL)优化它们仍然困难。有效的策略提取需要利用评论家的动作梯度,但通过多步去噪过程直接反向传播该信号可能数值不稳定。现有方法要么丢弃梯度信息,将策略蒸馏为更简单的单步动作器,要么随着评论家改进而重复微调去噪策略。我们提出QPILOTS,一种保持原策略不变并在推理时引导去噪过程的方法。在每个去噪步骤中,我们不是评估评论家对噪声中间动作(其中评论家预测不可靠),而是首先将该中间状态投影到最终干净动作的估计,并在那里计算评论家梯度。我们引入两种变体:QPILOTS-U使用快速单点近似,而QPILOTS-M通过学习的辅助网络绘制可微后验样本。在标准的离线到在线RL基准测试中,QPILOTS实现了最佳整体性能,在50个任务中达到平均90%的成功率。我们还应用QPILOTS引导一个大型、冻结的预训练视觉-语言动作(VLA)基础模型,在模拟的六个操作任务中优于或匹配先前的推理时方法。

英文摘要

Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic's action gradient, yet directly backpropagating this signal through a multi-step denoising process can be numerically unstable. Existing methods work around this either by discarding gradient information, distilling the policy into a simpler one-step actor, or repeatedly fine-tuning the denoising policy as the critic improves. We propose QPILOTS, a method that leaves the original policy unmodified and steers the denoising process at inference time. At each denoising step, instead of evaluating the critic on the noisy intermediate action where critic predictions are unreliable, we first project that intermediate state to an estimate of the final clean action and compute the critic gradient there. We introduce two variants: QPILOTS-U uses a fast single-point approximation, while QPILOTS-M draws differentiable posterior samples via a learned auxiliary network. On a standard offline-to-online RL benchmark, QPILOTS achieves the best aggregate performance, reaching an average success rate of 90% across 50 tasks. We also apply QPILOTS to steer a large, frozen, pretrained Vision-Language Action (VLA) foundation model, outperforming or matching prior inference-time approaches across six manipulation tasks in simulation.

2606.16286 2026-06-16 cs.LG cs.AI cs.RO 交叉投稿

FlowMPC: Improving Flow Matching policies with World Models

FlowMPC:利用世界模型改进流匹配策略

Chandon Hamel

发表机构 * Stanford University(斯坦福大学)

AI总结 提出FlowMPC框架,结合流匹配模仿策略与学习的世界模型,通过MPPI规划提升测试时性能,在ManiSkill操作任务中显著提高成功率。

详情
AI中文摘要

流匹配(FM)是一种在多模态动作空间中进行行为克隆的强大方法[Jiang et al., 2025],但由于它没有直接训练以最大化期望回报,FM策略在测试时的表现仍有改进空间。本文研究学习的世界模型是否可以通过对策略提出的候选动作序列进行模型预测路径积分(MPPI)规划来改进FM策略。基于TD-MPC2 [Hansen et al., 2024],我引入了FlowMPC,这是一个将模仿学习的FM策略与学习的世界模型相结合的框架,用于ManiSkill操作任务[Tao et al., 2025]中的测试时规划。在PickCube和PickSingleYCB上,添加世界模型比单独使用FM策略提高了性能,尤其是在回合结束时的成功率方面有显著提升。这些结果表明,基于世界模型的规划可以有效地补充基于流的模仿策略,而无需修改FM训练目标。

英文摘要

Flow Matching (FM) is a powerful approach for behavior cloning in multimodal action spaces [Jiang et al., 2025], but because it is not trained to directly maximize expected return, there is still room to improve how FM policies act at test time. This work investigates whether a learned world model can improve FM policies by enabling Model Predictive Path Integral (MPPI) planning over candidate action sequences proposed by the policy. Building on TD-MPC2 [Hansen et al., 2024], I introduce FlowMPC, a framework that combines an imitation-learned FM policy with a learned world model for test-time planning in ManiSkill manipulation tasks [Tao et al., 2025]. Across PickCube and PickSingleYCB, adding the world model improved performance over the FM policy alone, with especially clear gains in end-of-episode success. These results suggest that world-model-based planning can effectively complement flow-based imitation policies without modifying the FM training objective.

2606.16515 2026-06-16 cs.LG cs.AI cs.RO 交叉投稿

Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning

基于组合子目标评分的方向条件策略用于在线目标条件强化学习

Swaminathan S K, Damiya Gondha, Theyanesh Eswaramoorthy Rajahkrishnan, Aritra Hazra

AI总结 提出方向条件策略(DCP),通过共享InfoNCE表示将目标达成分解为子目标评分和方向条件动作,理论证明方向充分性、训练与部署一致性及可控子空间失效条件,在九个环境中优于对比RL。

Comments 17 pages, Accepted to the 2nd Workshop on Compositional Learning at ICML 2026 (Seoul, South Korea)

详情
AI中文摘要

Hamilton-Jacobi-Bellman理论表明,最优目标条件动作仅通过当前状态下目标距离的梯度依赖于目标,然而标准的在线GCRL仍然将演员网络条件于原始目标——当目标远离数据分布时,这是一个几何上无信息的信号。我们提出方向条件策略(DCP),一种完全在线的方法,将目标达成分解为两个共享一个InfoNCE表示ψ的组件:一个子目标评分步骤,选择与最终目标g在ψ空间中对齐的已访问状态z_t;以及一个方向条件演员,它消耗从ψ(s_t)到ψ(z_t)的单位方向d_t和幅度r_t。这两个组件联合训练,在部署时干净地分解(子目标评分被移除,而方向条件保留,用g代替z_t),并允许在相同的(d_t, r_t)接口上进行独立修改。我们证明了三个结果。首先,HJB下的方向充分性:在控制仿射动力学下,最优动作仅通过价值梯度依赖于目标。其次,一个定量界表明,在学习表示的温和条件下,并假设评分规则返回一个路径上的z_t,演员在训练和部署时的条件输入在表示误差和测地线松弛下是一致的。第三,一个可控子空间刻画了方向条件失效的情况。在九个环境中,DCP在大多数最终指标上优于对比RL,在操作和障碍物交互任务上提升最大;对学习到的ψ-距离景观的定性分析表明,对比表示表现为一种在线拟度量,编码环境拓扑,而唯一的失败案例(AntSoccer)定位到理论预期的学习梯度病理。

英文摘要

Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor on the raw goal -- a signal that is geometrically uninformative when the goal is far from the data distribution. We propose Direction-Conditioned Policies (DCP), a fully online method that decomposes goal-reaching into two components sharing one InfoNCE representation $ψ$: a subgoal-scoring step that selects a visited state $z_t$ aligned with the final goal $g$ in $ψ_g$, and a direction-conditioned actor that consumes the unit direction $d_t$ and magnitude $r_t$ from $ψ(s_t)$ to $ψ(z_t)$. The two components train jointly, factor cleanly at deployment (subgoal scoring is removed, while direction conditioning remains with $g$ in place of $z_t$), and admit independent modification at the same $(d_t,r_t)$ interface. We prove three results. First, direction sufficiency under HJB: the optimal action under control-affine dynamics depends on the goal only through the value gradient. Second, a quantitative bound showing that, under mild conditions on the learned representation and assuming the scoring rule returns an on-path $z_t$, the actor's conditioning input at training and at deployment coincide up to representation error and geodesic slack. Third, a controllable-subspace characterization of when directional conditioning fails. Across nine environments, DCP improves over Contrastive RL on most final metrics, with the largest gains on manipulation and obstacle-interaction tasks; a qualitative analysis of the learned $ψ$-distance landscape shows the contrastive representation behaves as an online quasimetric encoding environment topology, and the single failure case (AntSoccer) localizes to a learned-gradient pathology that the theory anticipates.

2408.15919 2026-06-16 cs.RO 版本更新

ReMoBot: Retrieval-Based Few-Shot Imitation Learning for Mobile Manipulation with Vision Foundation Models

ReMoBot: 基于检索的少样本模仿学习用于移动操作与视觉基础模型

Yuying Zhang, Wenyan Yang, Francesco Verdoja, Ville Kyrki, Joni Pajarinen

发表机构 * School of Electrical Engineering, Aalto University, Espoo, Finland(艾尔沃大学电气工程学院,埃斯波,芬兰)

AI总结 提出ReMoBot,一种基于检索的少样本模仿学习框架,利用视觉基础模型从演示中检索信息,解决移动操作任务中的部分可观测性和数据有限问题,在真实世界任务中取得高成功率。

详情
AI中文摘要

模仿学习算法通常将演示提炼为参数化策略以模仿专家行为。然而,在数据有限和部分可观测的情况下(例如在自我中心的移动操作中),现有方法往往难以生成准确的动作。为了解决这些挑战,我们提出了ReMoBot,一种少样本、轨迹条件的模仿学习框架,它直接从演示中检索信息,以解决具有自我中心视觉观察的移动操作任务。利用视觉基础模型,ReMoBot通过结合状态级相似性、历史感知轨迹对齐和动作序列一致性来识别相关的专家演示,从而消除感知上相似观察的歧义。然后,智能体以完全无需训练的方式基于这些检索到的演示选择适当的控制命令。我们在波士顿动力Spot机器人上,在仿真和真实世界环境中评估了ReMoBot在三个移动操作任务上的表现。在仿真中对比了五种方法后,我们将我们的方法与两个直接在真实世界数据上训练(无仿真到真实迁移)的基线进行了比较。每个任务仅用20个演示,ReMoBot就优于基线,在Table Uncover(70%)和Gap Cover(80%)任务中取得了高成功率,同时在更具挑战性的真实世界Curtain Open任务中也展示了有前景的性能。此外,ReMoBot能够泛化到不同的机器人位置、物体尺寸和材料属性,突显了其在真实世界可变形移动操作中的鲁棒性。更多细节请访问:this https URL

英文摘要

Imitation learning (IL) algorithms typically distill demonstrations into parametric policies to mimic expert behavior. However, with limited data and partial observability, such as in egocentric mobile manipulation, existing methods often struggle to generate accurate actions. To address these challenges, we propose ReMoBot, a few-shot, trajectory-conditioned imitation learning framework that directly Retrieves information from demonstrations to solve Mobile manipulation tasks with ego-centric visual observations. Leveraging vision foundation models, ReMoBot identifies relevant expert demonstrations by combining state-level similarity, history-aware trajectory alignment, and action-sequence consistency to disambiguate perceptually similar observations. The agent then selects appropriate control commands based on these retrieved demonstrations in a fully training-free manner. We evaluate ReMoBot on three mobile manipulation tasks using a Boston Dynamics Spot robot in both simulation and real-world settings. After benchmarking five approaches in simulation, we compare our method with two baselines trained directly on real-world data without sim-to-real transfer. With only 20 demonstrations per task, ReMoBot outperforms the baselines, achieving high success rates in Table Uncover (70%) and Gap Cover (80%), while also showing promising performance on the more challenging Curtain Open task in the real-world setting. Furthermore, ReMoBot generalizes across varying robot positions, object sizes, and material properties, highlighting its robustness in real-world deformable mobile manipulation. Additional details are available at: https://sites.google.com/view/remobot/home

2506.20668 2026-06-16 cs.RO cs.LG 版本更新

DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy

DemoDiffusion: 使用预训练扩散策略的一次性人类模仿

Sungjae Park, Homanga Bharadhwaj, Shubham Tulsiani

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出DemoDiffusion方法,通过单次人类演示和预训练扩散策略,无需任务特定训练即可使机器人执行操作任务,在8项任务中平均成功率达83.8%。

Comments 11 pages. Published at ICRA 2026

详情
AI中文摘要

我们提出DemoDiffusion,一种简单的方法,使机器人能够通过模仿单次人类演示来执行操作任务,无需任务特定训练或配对的人-机器人数据。我们的方法基于两个见解。首先,人类演示中的手部运动为机器人的末端执行器轨迹提供了有用的先验,我们可以通过运动学重定向将其转换为粗略的开环机器人运动轨迹。其次,虽然这种重定向的运动捕捉了任务的整体结构,但它可能无法很好地与上下文中的合理机器人动作对齐。为了解决这个问题,我们利用预训练的通用扩散策略来修改轨迹,确保它既遵循人类运动,又保持在合理机器人动作的分布内。与基于在线强化学习或配对的人-机器人数据的方法不同,我们的方法能够以最小的努力稳健地适应新任务和场景。在涵盖8种不同操作任务的实际实验中,DemoDiffusion实现了83.8%的平均成功率,而预训练策略为13.8%,运动学重定向为52.5%,甚至在预训练通用策略完全失败的任务上也取得了成功。项目页面:此 https URL

英文摘要

We propose DemoDiffusion, a simple method for enabling robots to perform manipulation tasks by imitating a single human demonstration, without requiring task-specific training or paired human-robot data. Our approach is based on two insights. First, the hand motion in a human demonstration provides a useful prior for the robot's end-effector trajectory, which we can convert into a rough open-loop robot motion trajectory via kinematic retargeting. Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context. To address this, we leverage a pre-trained generalist diffusion policy to modify the trajectory, ensuring it both follows the human motion and remains within the distribution of plausible robot actions. Unlike approaches based on online reinforcement learning or paired human-robot data, our method enables robust adaptation to new tasks and scenes with minimal effort. In real-world experiments across 8 diverse manipulation tasks, DemoDiffusion achieves 83.8\% average success rate, compared to 13.8\% for the pre-trained policy and 52.5\% for kinematic retargeting, succeeding even on tasks where the pre-trained generalist policy fails entirely. Project page: https://demodiffusion.github.io/

2509.18428 2026-06-16 cs.RO cs.CV 版本更新

Latent Action Pretraining Through World Modeling

通过世界建模的潜在动作预训练

Bahey Tharwat, Yara Nasser, Ali Abouzeid, Ian Reid

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(Mohamed bin Zayed人工智能大学) Alexandria University(亚历山大大学)

AI总结 提出LAWM框架,通过世界建模从无标签视频中学习潜在动作表征,实现跨任务、环境和本体的迁移学习,在LIBERO基准和真实场景中优于使用真实动作预训练的方法。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在遵循语言指令的机器人操作任务学习中越来越受欢迎。最先进的VLA模型,如OpenVLA和$\pi_{0}$,是在通过遥操作收集的大规模手动标注动作数据集上训练的。最近的方法,包括LAPA和villa-X,引入了潜在动作表示,通过建模帧间的抽象视觉变化,实现在无标签数据集上的无监督预训练。尽管这些方法展示了强大的结果,但它们的大模型尺寸使得在真实世界环境中部署具有挑战性。在这项工作中,我们提出了LAWM,一个模型无关的框架,通过世界建模从无标签视频数据中学习潜在动作表示,以自监督方式预训练模仿学习模型。这些视频可以来自机器人记录或人类使用日常物品执行动作的视频。我们的框架能够跨任务、环境和本体迁移所学知识。它在LIBERO基准和真实世界设置中优于使用真实机器人动作预训练的模型以及其他类似的预训练方法,同时在真实世界环境中高效且实用。

英文摘要

Vision-Language-Action (VLA) models have gained popularity for learning robotic manipulation tasks that follow language instructions. State-of-the-art VLAs, such as OpenVLA and $π_{0}$, were trained on large-scale, manually labeled action datasets collected through teleoperation. More recent approaches, including LAPA and villa-X, introduce latent action representations that enable unsupervised pretraining on unlabeled datasets by modeling abstract visual changes between frames. Although these methods have shown strong results, their large model sizes make deployment in real-world settings challenging. In this work, we propose LAWM, a model-agnostic framework to pretrain imitation learning models in a self-supervised way, by learning latent action representations from unlabeled video data through world modeling. These videos can be sourced from robot recordings or videos of humans performing actions with everyday objects. Our framework is able to transfer learned knowledge across tasks, environments, and embodiments. It outperforms models pretrained with ground-truth robot actions and other similar pretraining methods on the LIBERO benchmark and real-world setup, while being efficient and practical for real-world settings.

2602.13197 2026-06-16 cs.RO cs.CV cs.LG 版本更新

Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

模仿有效的方法:基于仿真过滤的人类视频模块化策略学习

Albert J. Zhai, Kuo-Hao Zeng, Jiasen Lu, Ali Farhadi, Shenlong Wang, Wei-Chiu Ma

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Allen Institute for AI(Allen人工智能研究所) University of Washington(华盛顿大学) Cornell University(康奈尔大学)

AI总结 提出Perceive-Simulate-Imitate框架,通过仿真过滤人类视频中的抓取-轨迹对,学习任务导向的抓取与后抓取运动策略,无需机器人数据即可实现鲁棒操作。

Comments Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

通过观看人类视频学习操作技能的能力有潜力为机器人学习解锁新的高度可扩展数据源。本文研究抓取操作,其中任务涉及在抓取物体后执行各种后抓取运动。人类视频为学习后抓取运动提供了强信号,但对于学习先决的抓取行为帮助较小,尤其是对于没有类人手的机器人。一个有前景的方法是采用模块化策略设计,利用专用抓取生成器产生稳定抓取。然而,任意稳定抓取通常与任务不兼容,阻碍机器人执行期望的下游运动。为解决这一挑战,我们提出Perceive-Simulate-Imitate (PSI)框架,该框架使用通过仿真中配对抓取-轨迹过滤处理的人类视频运动数据来训练模块化操作策略。这一仿真步骤用抓取适用性标签扩展轨迹数据,从而允许对任务导向的抓取能力进行监督学习。通过真实世界实验,我们展示了该框架可以在没有任何机器人数据的情况下高效学习精确操作技能,相比直接使用抓取生成器,性能显著更鲁棒。

英文摘要

The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.

2606.10449 2026-06-16 cs.RO 版本更新

GuideWalk: Learning Unified Autonomous Navigation and Locomotion for Humanoid Robots across Versatile Terrains

GuideWalk: 面向人形机器人的统一自主导航与运动学习,适用于多种地形

Haoxuan Han, Chen Chen, Linao Gong, Xin Yang, Hao Hu, Junhong Guo, Zhicheng He, Yao Su, Fenghua He

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Leju Robotics(乐聚机器人)

AI总结 提出GuideWalk框架,通过可通行性感知导航引导与地形自适应运动教师蒸馏,实现人形机器人在复杂地形上的稳定导航与运动协调。

详情
AI中文摘要

人形机器人已具备强大的运动能力,但在多种地形上的可靠导航仍然具有挑战性,因为避障必须与动态可行的运动协调。在这项工作中,我们提出了GuideWalk,一个统一的端到端框架,将可通行性感知的导航引导与地形自适应运动教师相结合,用于人形机器人导航。具体来说,我们引入了一个导航模块,提供明确的速度引导,将避障与地形条件解耦,从而能够在不同环境中进行鲁棒的规划。我们提出了一种复合教师蒸馏方案,其中目标导向的命令和动态一致的动作被聚合并蒸馏到单个策略中。为了进一步提高鲁棒性,蒸馏后的策略通过强化学习和辅助行为克隆目标进行微调,这促进了探索同时保留了期望的教师行为。实验表明,GuideWalk在保持稳定的人形运动的同时,实现了稳定有效的导航。

英文摘要

Humanoid robots have achieved strong locomotion capabilities, but reliable navigation on versatile terrains remains challenging because obstacle avoidance must be coordinated with dynamically feasible motion. In this work, we present GuideWalk, a unified end-to-end framework that integrates traversability-aware navigation guidance with terrain-adaptive locomotion teacher for humanoid navigation. Specifically, we introduce a navigation module that provides explicit velocity guidance, decoupling obstacle avoidance from terrain conditions to enable robust planning across diverse environments. We propose a composite teacher distillation scheme, where goal-directed commands and dynamically consistent actions are aggregated and distilled into a single policy. To further improve robustness, the distilled policy is refined with reinforcement learning and an auxiliary behavior cloning objective, which promotes exploration while preserving desirable teacher behaviors. Experiments demonstrate that GuideWalk achieves stable and effective navigation while maintaining stable humanoid locomotion.

2606.13053 2026-06-16 cs.RO cs.AI 版本更新

EV-WM: Event-Verified World Models for Long-Horizon Robotic Manipulation

EA-WM: 基于任务规范基础的事件感知世界模型用于长时域操作

Kailin Wang, Haoxiang Jie, Yaoyuan Yan, Jiacheng Zhou, Zhiyou Heng

发表机构 * AI Lab, Country Garden Services Group(碧桂园服务集团AI实验室) Fudan University(复旦大学) Omni AI

AI总结 提出EA-WM框架,通过事件预测和验证增强预训练特征世界模型,实现长时域操作中任务进展信号的可靠评估与规划。

详情
AI中文摘要

预训练特征世界模型为机器人想象提供了有用的基础,但仅凭视觉或潜在预测并不能确定想象的未来是否满足任务相关事件。长时域操作需要关系性、谓词级和物理基础的进展信号:物体是否移动,抽屉或接触状态是否改变,放置谓词是否满足,以及候选未来是否足够可靠以执行。我们引入了EA-WM,一种事件感知世界模型框架,通过任务规范基础的事件预测和验证来增强冻结的视觉特征动力学。EA-WM在预训练视觉特征空间中展开候选未来,将其解码为结构化事件状态,并使用任务进展、语义一致性、物理可行性和不确定性项进行评分。验证器指导基于采样的规划,门控候选动作,并在接触敏感的LIBERO酒架设置中,选择PPO生成的提议。在导航、可变形物体、墙壁约束和语言描述的操作研究中,EA-WM表明事件感知验证可以使特征空间世界模型更可解释,并更好地与任务进展对齐。

英文摘要

Pretrained-feature world models provide a useful substrate for robot imagination, but visual or latent prediction alone does not determine whether an imagined future satisfies task-relevant predicates. Long-horizon manipulation requires progress signals that are relational, predicate-level, and physically grounded: whether an object has moved, whether a drawer or contact state has changed, whether a placement predicate is satisfied, and whether a candidate future is reliable enough for execution. We introduce \textbf{EV-WM}, a predicate-grounded verification framework for world-model planning. EV-WM rolls out candidate futures in pretrained visual-feature space, decodes them into structured event states, and scores them using task-progress, semantic-consistency, physical-feasibility, and uncertainty terms. The verifier guides sampling-based planning, gates candidate actions, and, in the contact-sensitive LIBERO wine-rack setting, selects among PPO-generated proposals. Across navigation, deformable-object, wall-constrained, and language-described manipulation studies, EV-WM shows that predicate-grounded verification can make feature-space world-model planning more interpretable and better aligned with task progress.

2606.13769 2026-06-16 cs.RO cs.CV cs.LG 版本更新

$μ_0$: A Scalable 3D Interaction-Trace World Model

$\mu_0$: 一种可扩展的3D交互轨迹世界模型

Seungjae Lee, Yoonkyo Jung, Jusuk Lee, Jonghun Shin, Amir Hossein Shahidzadeh, Yao-Chih Lee, H. Jin Kim, Jia-Bin Huang, Furong Huang

发表机构 * University of Maryland, College Park(马里兰大学帕克分校) Seoul National University(首尔大学)

AI总结 提出基于3D轨迹的可扩展世界模型$\mu_0$,通过预测交互点轨迹实现跨本体机器人学习,无需动作标签,性能媲美有监督模型。

详情
AI中文摘要

能够捕捉动作如何引起物理变化的世界模型使得可扩展的机器人学习成为可能,而无需依赖特定本体的动作标签。像素空间视频模型提供了广泛的视觉先验,但将模型容量消耗在密集外观重建上,而直接动作模型则需要特定本体的标签,阻碍了可扩展性。我们提出$\mu_0$,一种基于3D轨迹的可扩展世界模型。$\mu_0$不是预测密集像素或直接建模动作,而是预测显著交互点(如物体、工具、手和接触区域)的平滑3D轨迹,从而产生一个紧凑、与本体无关的运动接口。为了能够从多样化的视频源进行训练,我们的TraceExtract系统通过选择关键点、构建全局对齐的轨迹以及将运动片段与层次化语言描述关联,自动提取3D监督。这种TraceExtract监督通过将预训练的视觉-语言骨干网络与模块化轨迹专家相结合来预训练$\mu_0$,其中轨迹专家通过B样条控制点表示每个查询并预测未来轨迹。实验表明,$\mu_0$在2D和3D轨迹预测方面均优于基线方法,包括轨迹预测模型和分词VLM方法。由于$\mu_0$是冻结且可重用的,它可以与动作专家配对用于下游机器人本体。尽管是无动作预训练,由此产生的轨迹条件策略在性能上与使用动作监督预训练的VLA模型(如$\pi_0$)相当。这些结果确立了3D轨迹作为跨本体操作的可扩展和可迁移表示。

英文摘要

World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present $μ_0$, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, $μ_0$ forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains $μ_0$ by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that $μ_0$ outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because $μ_0$ is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as $π_0$. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.

2602.17997 2026-06-16 cs.LG cs.RO 版本更新

Whole-Brain Connectomic Graph Model Enables Whole-Body Locomotion Control in Fruit Fly

全脑连接组图模型实现果蝇全身运动控制

Zehao Jin, Yaoye Zhu, Chen Zhang, Yanan Sui

发表机构 * Tsinghua University(清华大学)

AI总结 提出Fly-connectomic Graph Model,将果蝇全脑连接组作为图结构控制器,通过深度强化学习驱动仿真果蝇运动,在多种任务中表现稳定且样本效率优于基线。

详情
AI中文摘要

动物在由全脑连接塑造的神经系统控制下执行协调的全身运动。全脑神经连接(即连接组)的映射为建模感觉运动信息流提供了天然的图结构,但其作为具身智能体神经控制器的潜力尚未被充分探索。本文介绍了Fly-connectomic Graph Model,该模型直接将成年果蝇的全脑连接组实例化为图结构神经控制器,通过深度强化学习驱动仿真生物力学果蝇的运动。我们在多种运动任务中实现了稳定的性能,并且与图和非图基线相比,样本效率更高。我们的结果展示了一种通过将全脑布线原理转化为可操作的架构先验来设计有效控制策略的生物启发式方法,同时通过动态信息流提高了可解释性。这项工作还通过提供一个计算平台来研究动物行为背后的感觉运动转换,以及一种推动更贴近自然的智能系统发展的范式,强调了连接神经力学与具身智能的潜力。

英文摘要

Animals perform coordinated whole-body movements under the control of neural systems shaped by brain-wide connectivity. The mapping of the whole-brain neural connections, or the connectomes, provides a natural graph for modeling sensorimotor information flow, yet its potential as a neural controller for embodied agents remains largely unexplored. Here, we introduce the Fly-connectomic Graph Model, which directly instantiates the whole-brain connectome of an adult Drosophila as a graph-structured neural controller for movements of a simulated biomechanical fruit fly via deep reinforcement learning. We achieve stable performance across diverse locomotion tasks, as well as better sample efficiency compared to both graph and non-graph baselines. Our results demonstrate a biologically informed way towards effective control policy design by translating whole-brain wiring principles into actionable architectural priors, while also improving the interpretability through dynamic information flow. This work also highlights the potential to bridge neuromechanics with embodied intelligence by providing a computational platform for investigating the sensorimotor transformation underlying animal behavior and a paradigm to advance the development of more nature-aligned intelligent systems.

2. 运动规划、控制与动力学 20 篇

2606.14763 2026-06-16 cs.RO cs.LG math.OC 新提交

Bayesian Optimization for Learning Nonlinear MPC in Autonomous Agent Navigation

自主智能体导航中学习非线性模型预测控制的贝叶斯优化

Lorenzo Ortolani, Gabriel Voss, Gabriele Beltrami, Francesco Dorati, Tommaso Felice Banfi

发表机构 * Talos Robotics AI

AI总结 提出一种无地图框架,结合滚动时域规划与非线性MPC,利用贝叶斯优化自动调参,在仿真和实物四足机器人上实现高效导航。

Comments Published at the IEEE ICRA 2026 Xplore Workshop (Oral), Cross-Disciplinary aspects of Exploration in Robotics, Reinforcement Learning, and Search

详情
AI中文摘要

在动态未知环境中的实时自主导航仍然是移动机器人领域的一个基本挑战。我们提出了一种无地图框架,该框架紧密集成了反应式滚动时域规划与非线性模型预测控制(MPC)。在每个控制周期,构建基于激光雷达的高斯占据表示,并通过A*搜索生成无碰撞轨迹,随后由采用平滑sigmoid障碍屏障的CasADi/IPOPT MPC公式进行跟踪。为了提高对参数敏感性的鲁棒性,我们采用基于树结构Parzen估计器(TPE)的离线贝叶斯优化方案,该方案针对复合导航目标识别出接近最优的控制器参数。此外,使用高斯过程代理分析参数敏感性,并深入了解优化景观。所提出的框架与机器人无关,在仿真中使用Gazebo在Unitree Go2四足机器人上进行评估,随后部署到实体机器人上。实验结果表明,在仿真中调优的参数能有效迁移到硬件上,无需额外调优即可保持相当的性能。完整系统在部署时实现了高达90.0%的导航成功率,并且在仿真环境中评估指标平均提升38.9%。

英文摘要

Real-time autonomous navigation in dynamic, unknown environments remains a fundamental challenge for mobile robotics. We propose a map-free framework that tightly integrates reactive rolling-horizon planning with nonlinear Model Predictive Control (MPC). At each control cycle, a LiDAR-based Gaussian occupancy representation is constructed and used to generate collision-free trajectories via A* search, which are then tracked by a CasADi/IPOPT MPC formulation incorporating a smooth sigmoid obstacle barrier. To improve robustness to parameter sensitivity, we adopt an offline Bayesian optimization scheme based on Tree-structured Parzen Estimators (TPE), which identifies near-optimal controller parameters with respect to a composite navigation objective. In addition, a Gaussian Process surrogate is used to analyze parameter sensitivity and provide insight into the optimization landscape. The proposed framework is robot-agnostic and is evaluated on the Unitree Go2 quadruped in simulation using Gazebo, followed by deployment on the physical robot. Experimental results show that parameters tuned in simulation transfer effectively to hardware, maintaining comparable performance without additional tuning. The full system achieves up to a 90.0\% navigation success rate when deployed, along with a 38.9\% average improvement in the evaluation metrics across simulated environments.

2606.14794 2026-06-16 cs.RO 新提交

Computing Smooth Geodesics under Two-Sided Curvature Bounds with Applications to Robotics and Image Analysis

计算双侧曲率约束下的光滑测地线及其在机器人和图像分析中的应用

Da Chen, Zhenjiang Li, Jean-Marie Mirebeau, Xuecheng Tai, Jinglin Zhang, Wei Zhang, Laurent D. Cohen

发表机构 * CEREMADE, University Paris Dauphine, University-PSL, CNRS, UMR 7534(巴黎多芬纳大学CEREMADE实验室,巴黎文理研究大学,法国国家科学研究中心,UMR 7534) Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University, Shandong Academy of Medical Sciences(山东省肿瘤医院放射肿瘤科,山东第一医科大学,山东省医学科学院) Department of Mathematics, Centre Borelli, ENS Paris-Saclay, CNRS, University Paris-Saclay(巴黎萨克雷大学数学系,博雷利中心,巴黎萨克雷高等师范学校,法国国家科学研究中心) Norce(挪威研究中心) School of Control Science and Engineering, Shandong University(山东大学控制科学与工程学院)

AI总结 提出一种基于Hamilton-Jacobi-Bellman偏微分方程框架的曲率有界测地线模型,通过约束曲率上下界实现路径的光滑性和几何控制,并给出离散化求解方案,应用于机器人路径规划和图像曲线结构跟踪。

详情
AI中文摘要

平面曲线的曲率由于与光滑性、刚性和弹性等理想几何特性密切相关,因此作为计算二阶最小路径的关键正则化项。本文解决计算物理和几何中一个更具挑战性的问题:跟踪曲率受任意上下界约束的最小路径。为此,我们提出了一种新的曲率有界测地线模型,该模型在Hamilton-Jacobi-Bellman (HJB) 偏微分方程 (PDE) 框架下开发。它通过强制曲率范围约束,对最小路径提供强大的几何控制,使得路径光滑且具有有界曲率限制。我们还提出了一种包含曲率约束的哈密顿量和HJB PDE的离散化方案,使得能够高效求解模型的数值解。最后,我们展示了所提出的曲率有界测地线模型在机器人路径规划和图像曲线结构跟踪中的应用能力。数值实验表明,所提出的曲率有界测地线模型是寻找满意路径的强大且鲁棒的工具。

英文摘要

Curvature of planar curves serves as a key regularization term for computing second-order minimal paths, due to its tight relevance to desirable geometric properties such as smoothness, rigidity, and elasticity. In this paper, we tackle a more challenging problem in computational physics and geometry problem: tracking minimal paths whose curvature is constrained by arbitrary upper and lower bounds. For that purpose, we propose a new curvature-bounded geodesic model, developed under the Hamilton-Jacobi-Bellman (HJB) partial differential equation (PDE) framework. It provides strong geometric control over minimal paths by enforcing curvature range constraints, whose paths are smooth and of bounded curvature limitation. We also present a discretization scheme for the Hamiltonian and the HJB PDE incorporating curvature bounds, allowing efficient solver for estimating numerical solutions to the model. Finally, we illustrate the capability of the proposed curvature-bounded geodesic model in applications of robot path planning and curvilinear structures tracking from images. Numerical experiments demonstrate that the proposed curvature-bounded geodesic model serves as a powerful and robust tool for finding satisfactory paths.

2606.15046 2026-06-16 cs.RO 新提交

Exact, Efficient, and Safe Occlusion-Aware Planning Using AH-Polyhedrons

使用AH-多面体的精确、高效且安全的遮挡感知规划

Long Kiu Chung, David Isele, Toktam Mohammadnejad, Faizan M. Tariq, Sangjae Bae, Shreyas Kousik, Jovin D'sa

发表机构 * Honda Research Institute (HRI)(本田研究所) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出APRO框架,利用博弈论主动感知和AH-多面体可达性分析,通过线性规划实现精确安全验证,解决自动驾驶代客泊车中的遮挡问题,达到100%安全率且实时。

Comments 8 pages, 3 figures

详情
AI中文摘要

安全处理遮挡是动态环境中自主移动机器人面临的基本挑战。这一问题在自动驾驶代客泊车(AVP)中尤为突出,因为交通规则宽松、遮挡频繁且杂乱,过度保守的行为可能导致车辆被困。然而,现有方法要么缺乏形式化安全保证,要么假设智能体遵循道路结构,要么引入保守性,使得AVP的遮挡感知规划仍然是一个开放挑战。在本文中,我们提出APRO(遮挡的AH-多面体可达性),一个基于博弈论主动感知和AH-多面体可达性分析的精确且高效的遮挡感知规划框架,以AVP作为典型用例。我们的关键洞察是将先前工作中基于集合的安全条件重新表述为AH-多面体的并集,从而通过线性规划(LP)实现精确的安全验证,无需在集合计算或道路拓扑假设中引入任何额外的保守性。我们进一步展示了如何将所得安全条件集成到基于优化的规划器或二分搜索方案中,以用于实时应用。我们在仿真和硬件实验中验证了我们的方法,包括在真实停车场数据集上的数据回放。实验结果表明,我们的方法在所有评估场景中始终达到100%的安全率,同时保持实时性能,从而比具有形式化安全保证的现有方法做出更安全、更优的决策。

英文摘要

Safely handling occlusions is a fundamental challenge for autonomous mobile robots operating in dynamic environments. This issue is especially prominent in autonomous valet parking (AVP), where traffic rules are lax, occlusions are frequent and cluttered, and overly conservative behavior can leave vehicles stuck. However, existing methods either lack formal safety guarantees, assume agents follow road structures, or introduce conservatism, leaving occlusion-aware planning for AVP an open challenge. In this paper, we propose APRO (AH-Polyhedron Reachability for Occlusions), an exact and efficient occlusion-aware planning framework based on game-theoretic active perception and AH-polyhedron reachability analysis with AVP as our canonical use case. Our key insight is to reformulate set-based safety conditions in prior work as unions of AH-polyhedrons, enabling exact safety verification through linear programming (LP) without any additional conservatism in set computations or assumptions on road topology. We further show how the resulting safety conditions can be integrated into optimization-based planners or a bisection search scheme for real-time applications. We validate our method in simulation and hardware experiments, including data replay on a real-world parking lot dataset. Experimental results demonstrate that our method consistently achieved a 100% safety rate across all evaluated scenarios while maintaining real-time performance, resulting in safer and more optimal decisions than existing methods with formal safety guarantees.

2606.15317 2026-06-16 cs.RO 新提交

Covariance-Regulated Recursive Koopman Learning for Nonlinear Systems with Uncertain Time-Varying Dynamics

面向不确定时变非线性系统的协方差调控递归Koopman学习

Weibin Gu, Chen Yang, Lu Shi, Chao Gao

发表机构 * Tsinghua University(清华大学) China University of Petroleum-Beijing at Karamay(中国石油大学(北京)克拉玛依校区) Xinchen Qihang Inc.(信辰启航有限公司)

AI总结 针对离线模型在时变动力学下失效的问题,提出协方差调控递归Koopman学习框架,通过误差死区门控和常迹归一化策略防止协方差爆炸和参数冻结,实现数值稳定的在线建模,并在非完整驱动轮式机器人和扑翼微型飞行器上验证了其跟踪性能。

详情
AI中文摘要

自主机器人的离线模型在训练分布之外的时变动力学下常常失效。Koopman算子理论通过提升提供非线性动力学的线性表示,但其向实时递归估计的过渡可能面临数值脆弱性:使用指数遗忘时低激励下的协方差风涌,以及无遗忘时增益消失。本文提出了一种协方差调控递归Koopman学习(CR-RKL)框架,包含两种互补策略——误差死区门控和常迹归一化——每种策略都能独立防止协方差爆炸和参数冻结,后者还额外保留了不确定性的几何结构。在具有车轮滑移和Stribeck摩擦的非完整差分驱动机器人以及26克仿蝴蝶扑翼微型飞行器上验证,CR-RKL实现了数值稳定且准确的在线建模,当嵌入模型预测控制时,在不确定时变动力学下保持了可靠的跟踪性能。

英文摘要

Offline models for autonomous robots often fail under time-varying dynamics outside their training distribution. Koopman operator theory offers a linear representation of nonlinear dynamics via lifting, but its transition to real-time recursive estimation may suffer numerical vulnerabilities: covariance windup under low excitation when using exponential forgetting, and vanishing gain without forgetting. This paper introduces a Covariance-Regulated Recursive Koopman Learning (CR-RKL) framework with two complementary strategies--error dead-zone gating and constant-trace normalization--each independently capable of preventing covariance explosion and parameter freezing, with the latter additionally preserving the geometric structure of uncertainty. Validated on a non-holonomic differential-drive robot with wheel slip and Stribeck friction and on a 26-gram butterfly-inspired flapping-wing micro aerial vehicle, CR-RKL achieves numerically stable and accurate online modeling, and when embedded in model predictive control, it maintains reliable tracking performance under uncertain, time-varying dynamics.

2606.15469 2026-06-16 cs.RO 新提交

Learning Context-Aware Neural ODE Dynamics for Adaptive Robotic Control

学习上下文感知的神经ODE动力学用于自适应机器人控制

Shao-Yi Yu, Jen-Wei Wang, Maya Horii, Masayoshi Tomizuka, Vikas Garg

发表机构 * University of California, Berkeley(加州大学伯克利分校) Aalto University(阿尔托大学) YaiYai Ltd(YaiYai有限公司)

AI总结 提出基于神经ODE的上下文感知动力学模型,通过两阶段训练从状态-动作历史推断环境因素,实现模型预测控制下的自适应,在四旋翼、Sphero BOLT和Fanuc机械臂上验证了时空变化环境下的有效性。

详情
AI中文摘要

部署在不确定和动态变化环境中的机器人系统经常面临接触条件、空气动力学效应和外部干扰的变化,这些挑战了可靠的控制。为了在基于模型的控制下保持有效性,这些系统需要能够适应此类变化的动力学模型,特别是在直接获取完整环境信息受限的情况下。为了实现适应性并促进与模型预测控制的集成,我们提出了一种基于神经常微分方程的上下文感知动力学模型,该模型使用两阶段训练过程从状态-动作历史推断环境因素。我们在多种机器人平台上验证了该方法,包括仿真中的四旋翼,以及真实世界实验中的Sphero BOLT机器人和Fanuc机械臂。结果表明,我们的方法有效地适应了不同任务中时间和空间变化的环境变化。视频可在https://youtu.be/PY0sNyF2rqE 获取,源代码可在https://github.com/syyu410-yu/context-aware-neural-ode-control.git 获取。

英文摘要

Robotic systems deployed in uncertain and dynamically changing environments often face variations in contact conditions, aerodynamic effects, and external disturbances that challenge reliable control. To remain effective under model-based control, these systems require dynamics models that can adapt to such changes, especially when direct access to complete environmental information is limited. To enable adaptability and facilitate integration with model predictive control, we propose a context-aware dynamics model based on neural ordinary differential equations, which infers environmental factors from state-action histories using a two-phase training procedure. We validate the approach across diverse robotic platforms, including a quadrotor in simulation, as well as a Sphero BOLT robot and a Fanuc manipulator in real-world experiments. The results demonstrate that our method effectively adapts to temporally and spatially varying environmental changes across different tasks. Videos are available at https://youtu.be/PY0sNyF2rqE , and the source code is available at https://github.com/syyu410-yu/context-aware-neural-ode-control.git .

2606.15594 2026-06-16 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 新提交

Pixels to Proofs: Probabilistically-Safe Latent World Model Control via Parallel Conformal Robust MPC

从像素到证明:通过并行保形鲁棒MPC实现概率安全的潜在世界模型控制

Devesh Nath, Anutam Srinivasan, Haoran Yin, Ruitong Jiang, Jeffrey Fang, Glen Chou

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出SLS^2框架,结合保形预测与鲁棒模型预测控制,在学习的潜在世界模型中实现基于视觉的安全运动规划,提升目标到达性能与安全性。

详情
AI中文摘要

我们提出了SLS^2,一个使用鲁棒模型预测控制(MPC)在学习的潜在世界模型中进行安全反馈运动规划的框架。我们的方法训练了一个动作条件的联合嵌入世界模型,具有紧凑的马尔可夫潜在状态,通过学习的潜在动力学实现高效的基于梯度的轨迹优化。为了在潜在预测不完美的情况下确保真实系统的安全性,我们采用保形预测来通知GPU加速的系统级综合(SLS)鲁棒MPC方案,以获得校准的潜在误差界限和鲁棒的潜在空间约束集。我们还学习并保形化了一个潜在约束检查器,使SLS规划器能够在闭环执行期间施加概率安全约束。我们在基于视觉的控制任务上评估了我们的方法,与潜在世界模型和安全规划基线相比,它提高了目标到达性能和安全性。

英文摘要

We present SLS^2, a framework for safe feedback motion planning from pixels using robust model predictive control (MPC) in learned latent world models. Our approach trains an action-conditioned joint-embedding world model with compact Markovian latent states, enabling efficient gradient-based trajectory optimization through learned latent dynamics. To enforce safety for the true system despite imperfect latent predictions, we inform a GPU-accelerated system level synthesis (SLS) robust MPC scheme with conformal prediction to obtain calibrated latent error bounds and robust latent-space constraint sets. We further learn and conformalize a latent constraint checker, allowing the SLS planner to impose probabilistic safety constraints during closed-loop execution. We evaluate our method on vision-based control tasks, where it improves both goal-reaching performance and safety over latent world-model and safe-planning baselines.

2606.15654 2026-06-16 cs.RO cs.AI 新提交

PO-PDDL: Learning Symbolic POMDPs from Visual Demonstrations for Robot Planning Under Uncertainty

PO-PDDL: 从视觉演示中学习符号化POMDP以实现不确定性下的机器人规划

Wenjing Tang, Xuanjin Jin, Yuan Liu, Renming Huang, Cewu Lu, Panpan Cai

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出PO-PDDL符号化POMDP框架,通过从机器人执行视频中重建潜在状态轨迹、识别部分可观测性并学习随机转移与观测模型,实现不确定性下的鲁棒任务规划。

详情
AI中文摘要

现实世界的机器人任务规划必须在随机动作执行和部分可观测性下进行,然而为真实机器人领域构建部分可观测马尔可夫决策过程(POMDP)模型仍然困难且劳动密集。我们引入了PO-PDDL,一种POMDP的符号化表述,它保留了规划领域定义语言(PDDL)的关系结构和LLM友好的语法,同时显式建模了部分可观测性、随机性和信念。基于此表述,我们提出了一种用于学习PO-PDDL模型的演示驱动流程。该方法从真实机器人执行视频中重建潜在符号状态轨迹,通过推断状态与视觉观测之间的不一致性识别部分可观测性,并相应地学习随机转移和观测模型。得到的PO-PDDL领域可跨任务重用,并在感知和执行不确定性下实现在线信念空间规划。在真实世界长时域操作任务上的实验表明,我们的方法持续优于现有的PDDL和POMDP模型学习方法,以显著更低的规划成本实现了不确定性下的鲁棒任务规划。

英文摘要

Real-world robot task planning must operate under both stochastic action execution and partial observability, yet constructing Partially Observable Markov Decision Process (POMDP) models for real robotics domains remains difficult and labor-intensive. We introduce PO-PDDL, a symbolic formulation of POMDPs that preserves the relational structure and LLM-friendly syntax of the Planning Domain Definition Language (PDDL), while explicitly modeling partial observability, stochasticity, and beliefs. Building on this formulation, we propose a demonstration-driven pipeline for learning PO-PDDL models. The proposed method reconstructs latent symbolic state trajectories from real-robot execution videos, identifies partial observability via inconsistencies between inferred states and visual observations, and learns stochastic transition and observation models accordingly. The resulting PO-PDDL domains are reusable across tasks and enable online belief-space planning under both perception and execution uncertainty. Experiments on real-world long-horizon manipulation tasks show that our method consistently outperforms existing PDDL and POMDP model-learning approaches, achieving robust task planning under uncertainty with significantly lower planning cost.

2606.15896 2026-06-16 cs.RO cs.LG 新提交

LoComposition: Terrain-Adaptive Energy-Efficient Quadruped Locomotion without Gait Priors

LoComposition:无需步态先验的地形自适应高效四足运动

Loukas Kordos, Leonard T. Franz, Simon Rappenecker, Oliver Hausdoerfer, Angela P. Schoellig, Pavel Kolev, Georg Martius

发表机构 * Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) University of Tübingen(图宾根大学) Technical University of Munich(慕尼黑工业大学) University of Stuttgart(斯图加特大学)

AI总结 提出一种将任务奖励、操作约束、能量最小化和地形感知分离的框架,无需显式步态先验,在四足机器人上实现高效地形自适应运动,运输成本降低56%,违规减少96%。

Comments 17 pages, 5 figures, 10 tables

详情
AI中文摘要

基于学习的四足运动通常依赖于复杂的奖励函数,将任务规范、操作限制、步态偏好和地形适应纠缠在单个优化目标中。我们通过不同的机制处理这些功能:任务规范用奖励,操作限制用约束,步态偏好用能量最小化,以及用外部感知来根据地形难度调整能量使用。我们表明,这些组件共同实现了高效、地形自适应的运动,并且移除每个组件会暴露出不同的失败模式。我们的公式移除了显式的步态先验(包括腾空时间、接触次数和足部间隙目标),转而支持涌现行为。与传统的复杂奖励基线相比,我们的公式在实现相当的地形穿越的同时,将运输成本降低了56%,操作限制违规减少了96%。得到的策略零样本迁移到使用基于LiDAR高程地图的物理Unitree Go2上。项目网站含视频:https://tinyurl.com/locomposition。

英文摘要

Learning-based quadrupedal locomotion typically relies on complex reward formulations that entangle task specification, operational limits, gait preference, and terrain adaptation within a single optimization objective. We instead treat these functions through distinct mechanisms: rewards for task specification, constraints for operational limits, energy minimization for gait preference, and exteroceptive perception for adapting energy use to terrain difficulty. We show that these components jointly enable efficient, terrain-adaptive locomotion, and that removing each component exposes a distinct failure mode. Our formulation removes explicit gait priors (including air-time, contact-count, and foot-clearance targets) in favor of emergent behavior. Compared to a conventional complex-reward baseline, our formulation achieves comparable terrain traversal while reducing cost of transport by 56% and operational-limit violations by 96%. The resulting policies transfer zero-shot to a physical Unitree Go2 using LiDAR-based elevation mapping. Project website with videos: https://tinyurl.com/locomposition.

2606.15918 2026-06-16 cs.RO 新提交

Energy-Efficient Arm Reaching for a Humanoid Robot via Deep Reinforcement Learning with Identified Power Models

基于识别功率模型的深度强化学习实现人形机器人节能手臂伸展

Nestor N. Deniz, Simon Parsons, Fernando Auat Cheein

发表机构 * Harper Adams University(哈珀亚当斯大学) Lincoln Institute for Agri-Food Technology(林肯农业食品技术研究所) Lincoln Centre for Autonomous Systems(林肯自主系统中心)

AI总结 提出一种端到端能量感知强化学习框架,结合物理实验识别的电功率模型与SAC策略,在Unitree G1人形机器人上实现节能手臂伸展,仿真成功率69.9%,实物验证平均能耗71.5 J。

详情
AI中文摘要

在田间执行操作任务(如机器人苹果采摘)的人形机器人面临严重的能量约束,这直接限制了每块电池充电可执行的伸展运动次数。本文针对Unitree~G1人形机器人的7自由度左臂,提出了一种端到端的能量感知强化学习框架,该框架结合了基于物理实验识别的电功率模型和在基于Pinocchio的刚体动力学模拟器中训练的Soft Actor-Critic (SAC)策略。RL策略在增量关节位置动作空间上运行,并使用混合星座奖励进行训练,该奖励将四点末端执行器星座距离与扭矩范数能量代理相结合;经过$5\times10^6$次训练后,在运动学模拟中对$1\,000$个随机目标达到了$69.9\%$的成功率,成功情节的平均能量为\SI{98.16}{\joule}。最后,在物理Unitree~G1上,该策略在三个独立的10目标批次上进行了验证,实现了平均能量$71.5 \pm 48.3$\,J,末端执行器位置误差$2.64 \pm 1.04$\,cm,方向误差$6.92 \pm 1.33^\circ$——均在\SI{4}{\centi\metre}/$8.6^\circ$的训练容差内。这些结果构成了基于能量感知强化学习的人形机器人手臂伸展的第一步。

英文摘要

Humanoid robots performing in-field manipulation tasks, such as robotic apple harvesting, face severe energy constraints that directly limit the number of reaching motions that can be executed per battery charge. This paper presents an end-to-end, energy-aware reinforcement learning framework for the 7-degree-of-freedom left arm of the Unitree~G1 humanoid robot, combining a physics-based, experimentally identified electrical power model with a Soft Actor-Critic (SAC) policy trained in a Pinocchio-based rigid-body dynamics simulator. The RL policy operates on an incremental joint-position action space and is trained with a Hybrid Constellation Reward that combines a four-point end-effector constellation distance with a torque-norm energy proxy; after % $5\times10^6$ training it reaches a $69.9\%$ success rate over $1\,000$ random targets in kinematic simulation, at a mean energy of \SI{98.16}{\joule} on successful episodes. Finally, on the physical Unitree~G1, the policy is validated over three independent 10-target batches, achieving a mean energy of $71.5 \pm 48.3$\,J, an end-effector position error of $2.64 \pm 1.04$\,cm, and an orientation error of $6.92 \pm 1.33^\circ$ -- within the \SI{4}{\centi\metre}/$8.6^\circ$ training tolerance. These results constitute a first step toward energy-aware reinforcement-learning-based arm reaching for humanoid robots.

2606.16480 2026-06-16 cs.RO cs.AI cs.SY eess.SY 新提交

HOLO-MPPI: Multi-Scenario Motion Planning via Hierarchical Policy Optimization

HOLO-MPPI:通过分层策略优化的多场景运动规划

Youngjae Min, Jovin D'sa, Faizan M. Tariq, David Isele, Navid Azizan, Sangjae Bae

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Honda Research Institute, USA(本田研究所(美国))

AI总结 提出HOLO-MPPI框架,结合离线高层策略学习与在线低层随机最优控制,实现多场景运动规划,无需针对每个场景重新调整参数,在自动驾驶中优于MPPI和端到端RL基线。

详情
AI中文摘要

部署在现实世界中的机器人必须在不同场景下规划运动,而无需针对每个场景重新调整参数。端到端强化学习(RL)可以跨场景泛化,但在分布偏移、奖励错误指定和随机交互下往往变得脆弱。模型预测路径积分(MPPI)控制能够在无梯度的情况下实现强大的实时优化,但其性能依赖于良好形状的采样先验,而手动设计先验无法扩展到多场景部署。我们提出了HOLO-MPPI(高层离线,低层在线MPPI),一种多场景运动规划框架,结合了高层策略学习与低层随机最优控制。离线时,我们学习一个高层策略,在抽象动作空间中提出场景鲁棒的规划,并利用学习的世界模型进行在线推演。在线时,该策略作为数据驱动的先验生成器,根据当前观测和目标参数化MPPI的采样分布。然后MPPI围绕该先验实时优化低层控制序列,以适应局部扰动。我们通过设计有效的高层动作空间和定制模型架构,在自动驾驶中实例化HOLO-MPPI。在多种驾驶场景下的评估表明,HOLO-MPPI在保持实时控制的同时,优于MPPI和端到端RL基线。

英文摘要

Robots deployed in the real world must plan motions across diverse scenarios without per-scenario retuning. End-to-end reinforcement learning (RL) can generalize across scenarios but often becomes brittle under distribution shift, reward misspecification, and stochastic interactions. Model predictive path integral (MPPI) control enables strong real-time refinement without gradients, but its performance depends on a well-shaped sampling prior, while manually designing the priors does not scale to multi-scenario deployment. We present HOLO-MPPI (High-level Offline, Low-level Online MPPI), a multi-scenario motion planning framework that combines high-level policy learning with low-level stochastic optimal control. Offline, we learn a high-level policy that proposes scenario-robust plans in an abstract action space, with a learned world model for online rollout. Online, the policy serves as a data-driven prior generator that parameterizes MPPI's sampling distribution conditioned on the current observation and goal. MPPI then optimizes low-level control sequences around this prior in real time to adapt to local disturbances. We instantiate HOLO-MPPI in autonomous driving by designing an effective high-level action space and tailored model architectures. Our evaluation across diverse driving scenarios shows that HOLO-MPPI improves upon MPPI and end-to-end RL baselines while maintaining real-time control.

2606.16542 2026-06-16 cs.RO 新提交

ADAPT: Analytical Disturbance-Aware Policy Training for Humanoid Locomotion

ADAPT: 面向人形机器人运动的解析干扰感知策略训练

Bofan Lyu, Jindou Jia, Kuangji Zuo, Yanshuo Lu, Shijia Han, Gen Li, Boyu Ma, Jingliang Li, Geng Li, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室)

AI总结 提出ADAPT框架,通过解析全身干扰观测器在线估计外力/力矩,无需传感器,提升人形机器人在干扰下的运动精度与鲁棒性。

详情
AI中文摘要

部署在人类中心环境中的人形机器人必须处理力交互任务,其中外部接触会引入意外干扰,破坏运动精度和稳定性。现有的基于学习的方法依赖于广泛的域随机化、特定任务的力目标或基于运动历史的学习型力估计器,每种方法都会在精度、任务可迁移性或分布外鲁棒性上做出妥协。我们提出了解析干扰感知策略训练(ADAPT),这是一个框架,它为人形机器人策略配备了物理基础的干扰观测器。ADAPT的核心是一个解析全身干扰观测器,它利用可访问的机器人动力学在线估计残余力/力矩,无需力/力矩传感器。估计的干扰直接输入策略,使人形机器人获得对外力/力矩的显式、基于物理的感知,能够泛化到各种未见过的场景。在Unitree G1人形机器人上的实验表明,ADAPT在躯干扰动、站立推力和不对称手部负载下实现了比仅基于本体感觉的基线更准确的干扰预测和更强的鲁棒性,即使在分布外干扰下也能改善速度跟踪。此外,ADAPT能够惩罚在下肢关节推断出的干扰,以鼓励更轻快的运动。

英文摘要

Humanoids deployed in human-centered environments must handle force-interactive tasks, where external contacts introduce unexpected disturbances that disrupt locomotion accuracy and stability. Existing learning-based approaches rely on broad domain randomization, task-specific force objectives, or learning-based force estimators from motion history, each of which compromises accuracy, task transferability, or out-of-distribution (OOD) robustness. We present Analytical Disturbance-Aware Policy Training (ADAPT), a framework that equips humanoid policies with a physically grounded disturbance observer. The core of ADAPT is an analytical whole-body disturbance observer that estimates residual force/torque online with the accessible robot dynamics, without requiring force/torque sensors. Fed directly into the policy, the estimated disturbances give the humanoid an explicit, physics-derived sense of external force/torque that can generalize across diverse unseen scenes. Experiments on a Unitree G1 humanoid show that ADAPT achieves accurate disturbance prediction and stronger robustness than a proprioception-only baseline under torso perturbations, standing pushes, and asymmetric hand payloads, with improved velocity tracking even on OOD disturbances. Moreover, ADAPT enables penalizing inferred disturbances at lower-body joints to encourage lighter locomotion.

2606.16564 2026-06-16 cs.RO cs.LG 新提交

Elastic ODYN: Differentiable Optimization for Infeasible Control and Learning in Robotics

Elastic ODYN:面向机器人中不可行控制与学习的可微优化

Aristotelis Papatheodorou, Jose Rojas, Ioannis Havoutis, Carlos Mastalli

发表机构 * University of Oxford(牛津大学) Heriot-Watt University(赫瑞瓦特大学)

AI总结 提出Elastic ODYN,一种通过平滑平方ℓ2弹性松弛处理不可行二次规划(QP)的原始-对偶非内点求解器,支持热启动,在无可行点时收敛到最接近可行解,并基于此开发可微QP层和不可行感知SQP方法,在基准QP、奇异接触力学、可微参数辨识及四足/人形机器人轨迹优化中优于现有方法。

Comments 8 pages, 5 figures, 2 tables

详情
AI中文摘要

机器人系统经常遇到冲突的目标、建模误差和退化接触条件,这些条件使得二次规划(QP)不可行。然而,大多数优化求解器和可微QP层假设可行性,当约束无法同时满足时,会导致数值失败、梯度不稳定或求解器崩溃。我们提出Elastic ODYN,一种原始-对偶非内点QP求解器,通过平滑平方ℓ2弹性松弛处理不可行性。所得公式在病态和退化条件下保持良态,支持热启动,并在无可行点时收敛到最接近可行解。一个轻量级细化阶段从弹性解中恢复有物理意义的对偶变量。基于此框架,我们开发了Elastic OdynLayer,一个在不可行性下具有稳定梯度的可微QP层,以及Elastic OdynSQP,一种不可行感知的SQP方法,通过选择性约束松弛解决不一致的子问题和本质不可行的最优控制任务。我们在基准QP、奇异接触力学、可微参数辨识以及四足和人形机器人轨迹优化上评估该框架。在所有设置中,Elastic ODYN在鲁棒性、热启动性能和收敛可靠性方面始终优于最先进的弹性QP求解器,使得优化、仿真、控制和学习能够超越现有方法的可行性假设。

英文摘要

Robotic systems routinely encounter conflicting objectives, modeling errors, and degenerate contact conditions that render quadratic programs (QPs) infeasible. Yet most optimization solvers and differentiable QP layers assume feasibility, leading to numerical failures, unstable gradients, or solver breakdown when constraints cannot be simultaneously satisfied. We present Elastic ODYN, a primal--dual non-interior-point QP solver that handles infeasibility through smooth squared-$\ell_2$ elastic relaxations. The resulting formulation remains well posed under ill-conditioning and degeneracy, supports warm starting, and converges to closest-to-feasible solutions when no feasible point exists. A lightweight refinement stage recovers physically meaningful dual variables from the elastic solution. Building on this framework, we develop Elastic OdynLayer, a differentiable QP layer with stable gradients under infeasibility, and Elastic OdynSQP, an infeasibility-aware SQP method that resolves inconsistent subproblems and intrinsically infeasible optimal control tasks through selective constraint relaxation. We evaluate the framework on benchmark QPs, singular contact mechanics, differentiable parameter identification, and quadrupedal and humanoid trajectory optimization. Across all settings, Elastic ODYN consistently outperforms state-of-the-art elastic QP solvers in robustness, warm-start performance, and convergence reliability, enabling optimization, simulation, control, and learning beyond the feasibility assumptions of existing methods.

2606.16696 2026-06-16 cs.RO 新提交

VENOM: Versatile Embodied Network for Omni-bodied Motion tracking

VENOM: 用于全身运动追踪的多功能具身网络

Siddharth Padmanabhan, Kazuki Miyazawa, Takato Horii

发表机构 * Graduate School of Engineering Science, University of Osaka(大阪大学工学研究科)

AI总结 提出VENOM,一种基于GPT的跨具身全身运动追踪模型,在仿真中实现多个人形机器人的全身运动追踪,无需分离上下身控制。

详情
AI中文摘要

仅从演示数据中实现跨多个人形机器人的专家级表现力全身运动追踪,在人形机器人学习中仍然是一个具有挑战性且相对未充分探索的问题。跨具身运动追踪策略主要通过将控制问题分解为上身和下身控制来训练。本文提出VENOM,一种用于仿真中的人形机器人的跨具身全身运动追踪模型。VENOM是一种基于GPT的运动追踪器,在多人形机器人数据上训练,可以追踪整个身体,无需分割为上身和下身控制。我们整理了一个名为VENOM数据集的多人形机器人运动追踪数据集,包含状态、动作和奖励,并在此数据集上训练VENOM和基线模型。在本文中,我们评估了VENOM相对于基线的性能,并表明我们能够实现一个稳定的运动追踪器,其能力优于仅通过监督学习在多人形机器人数据上训练的MLP,并且还表明,尽管缺乏奖励反馈,VENOM与使用非对称演员-评论家强化学习训练的专家的追踪能力紧密匹配。

英文摘要

Achieving expert-level expressive full-body motion tracking across multiple humanoids solely from demonstration data remains a challenging and relatively an underexplored problem in humanoid robot learning. Cross-embodiment motion tracking policies are mostly trained by decoupling the control problem into upper and lower body control. This work proposes VENOM, a cross-embodiment full-body motion tracking model for humanoids in simulation. VENOM is a GPT-based motion tracker trained on multiple humanoid data that can track the entire body without the requirement to split into upper and lower body control. We curate a multi-humanoid motion tracking dataset called the VENOM dataset that contains states, actions, and rewards and train VENOM and the baselines on this dataset. In this letter, we evaluate VENOM's performance against baselines and show that we can achieve a stable motion tracker across different humanoids more capable than an MLP trained on multiple humanoid data with supervised learning alone, and also show that despite lack of reward feedback, VENOM closely matches the tracking capability of experts that were trained using asymmetric-actor critic reinforcement learning.

2606.16780 2026-06-16 cs.RO 新提交

DIFF-IPPO: Diffusion-Based Informative Path Planning with Open-Vocabulary Belief Maps

DIFF-IPPO:基于扩散的开放词汇信念地图信息路径规划

Sausar Karaf, Oleg Sautenkov, Mikhail Martynov, Dzmitry Tsetserukou

发表机构 * Intelligent Space Robotics Laboratory, CDE, Skoltech(智能空间机器人实验室,CDE,斯科尔科沃科学技术研究院)

AI总结 提出DIFF-IPPO框架,结合开放词汇信念地图生成器与扩散规划器,在非高斯信念图上生成全局轨迹,实现高效目标搜索,检测得分达81.49%-86.55%。

详情
AI中文摘要

探索和物体搜索要求机器人感知环境、识别感兴趣区域,并规划提高目标检测可能性或最大化信息增益的轨迹。许多IPP方法,特别是在连续环境监测中,依赖于高斯过程信念模型,而物体搜索场景通常从语义或开放词汇感知中产生复杂的多模态信念地图。直接基于这种非高斯信念地图的全局轨迹生成仍然相对未被充分探索。尽管基于扩散的规划器为此类分布建模提供了强大能力,但它们在信息路径规划中的应用仍然有限。在这项工作中,我们提出了DIFF-IPPO,一个集成了开放词汇信念地图生成器和基于扩散的规划器的流水线,用于在信念地图上生成全局轨迹。该方法生成的轨迹将传感器覆盖集中在高信念区域,在不同数据集场景下实现了81.49%至86.55%的归一化检测得分。我们在一个模拟的搜索与救援场景中验证了该系统,其中规划器搜索候选建筑区域以定位燃烧的建筑。在此设置中,一个由五架无人机组成的团队使用批处理信念地图条件轨迹生成,在3.5分钟内实现了首次检测。

英文摘要

Exploration and object search require robots to perceive their environment, identify regions of interest, and plan trajectories that improve target-detection likelihood or maximize information gain. Many IPP methods, especially in continuous environmental monitoring, rely on Gaussian-process belief models, while object-search settings often produce complex, multimodal belief maps from semantic or open-vocabulary perception. Global trajectory generation directly conditioned on such non-Gaussian belief maps remains comparatively underexplored. Although diffusion-based planners offer strong capabilities for modeling such distributions, their use in informative path planning remains limited. In this work, we propose DIFF-IPPO, a pipeline that integrates an open-vocabulary belief map generator with a diffusion-based planner for global trajectory generation over belief maps. The method generates trajectories that concentrate sensor coverage over high-belief regions, achieving normalized detection scores between 81.49% and 86.55% across different dataset scenarios. We validate the system in a simulated search-and-rescue scenario where the planner searches candidate building regions to locate a burning building. In this setting, a team of five drones using batched belief-map-conditioned trajectory generation achieves first detections in 3.5 minutes.

2606.16972 2026-06-16 cs.RO cs.SY eess.SY 新提交

When Should a Robot Replan? Regret-Guided Update Scheduling in Time-Varying MDPs

机器人何时应重新规划?时变MDP中的遗憾引导更新调度

Negin Musavi, Gokul Puthumanaillam, Ruben Hernandez, William Schafer, Melkior Ornik

发表机构 * University of Illinois Urbana–Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对时变环境下机器人因预算限制无法持续重规划的问题,提出基于动态遗憾的在线更新调度规则,在仿真和实物实验中优于固定预算基线。

详情
AI中文摘要

在非平稳环境中运行的机器人必须随着动态漂移不断调整其策略,但机载能量和计算预算限制了全状态估计和重规划步骤的执行频率。这引出一个问题:在时间轴上,机器人何时应花费其有限的预算?我们在具有已知转移漂移率边界的时变马尔可夫决策过程(TVMDP)中形式化该问题。我们将执行建模为一种“跳过更新”方案,即在选定的更新时间点,智能体通过最大似然估计转移核并计算有限时域策略,而在更新间隔之间,则在传播的状态估计下重用该策略。我们分析了该方案的动态遗憾,并展示了它如何根据TVMDP的性质和跳过长度在跳过区间内增长;由此产生的界限通过一种在线、遗憾引导的更新规则回答了开头的问题,该规则自适应地分配预算。我们在具有时变滑移动力学的模拟火星车导航任务和室内障碍物场中的Crazyflie四旋翼飞行器上评估了该规则。自适应分配优于其他预算基线。

英文摘要

Robots operating in non-stationary environments must continually adapt their policies as the dynamics drift, but onboard energy and compute budgets cap how often a full state estimation and re-planning step can be performed. This raises a question: \emph{when}, along a horizon, should a robot spend its limited budget? We formulate this problem in time-varying Markov decision processes (TVMDPs) with a known bound on the rate of transition drift. We model execution as a \emph{skip-update} scheme in which, at chosen update times, the agent estimates the transition kernel by maximum likelihood and computes a finite-horizon policy, and between updates reuses this policy under a propagated state estimate. We analyze the dynamic regret of this scheme and show how it grows during skip intervals in terms of the properties of the TVMDP and the skip lengths; the resulting bound answers the opening question via an online, regret-guided update rule that allocates the budget adaptively. We evaluate the rule in a simulated Mars-rover navigation task with time-varying slip dynamics and on a Crazyflie quadrotor in indoor obstacle fields. Adaptive allocation outperforms other budgeted baselines.

2606.13485 2026-06-16 eess.SY cs.HC cs.NE cs.RO cs.SY physics.med-ph 交叉投稿

Impedance MPC with Patient-Torque Estimation for Knee Rehabilitation Exoskeletons

用于膝关节康复外骨骼的阻抗模型预测控制与患者力矩估计

Yongyan Cao, Jinshan Tang

发表机构 * Department of Biomedical Engineering and Engineering Science(生物医学工程与工程科学系)

AI总结 提出阻抗模型预测控制框架,结合卡尔曼扰动状态估计患者力矩,实现无偏移跟踪和辅助按需,在500 Hz下满足临床精度标准。

详情
AI中文摘要

膝关节康复外骨骼必须强制执行规定的关节轨迹,同时保持对非自主痉挛和自主患者努力的安全顺从——这是任何固定增益阻抗控制器的目标冲突。我们提出了一种用于膝关节康复外骨骼的阻抗模型预测控制框架,并在串联弹性执行器(SEA)平台上进行了演示:代数前馈将膝关节动力学简化为常系数标量双积分器,而滚动时域二次规划(QP)计算校正力矩,同时强制执行硬性的运动范围、力矩和速度限制(ISO 13482)。由直接基于SEA的力矩传感(通过弹性元件测量的串联弹性弹簧挠度——一种固有的、无EMG的患者力矩估计,而非单独的力传感器)驱动的卡尔曼扰动状态提供了标称无偏移保证,并通过其符号和期望运动方向实现无传感器的辅助按需。常状态矩阵允许离线预计算QP成本逆,从而实现多步时域下的500 Hz运行。在七个控制器基准测试(正弦跟踪、等长保持)中,500 Hz卡尔曼MPC在15 Nm痉挛下实现了0.1 mrad RMS、0.1 mrad稳态、0.2 mrad峰值的无偏移,而相同刚度下的经典阻抗控制器稳态偏移为515 mrad——直接测量通道几乎立即(几个采样周期内)收敛估计。没有估计器时,它实现经典阻抗(4.8 mrad RMS,8.3 mrad稳态)。所有MPC变体均满足87 mrad临床标准;没有经典控制器满足。该架构通过考虑耦合的每个关节QP为20自由度MyoSuite myoLeg设计。

英文摘要

Knee rehabilitation exoskeletons must enforce a prescribed joint trajectory while remaining safely compliant with involuntary spasm and voluntary patient effort-objectives in tension for any fixed-gain impedance controller. We present an Impedance Model Predictive Control framework for knee rehabilitation exoskeletons, demonstrated on a series-elastic-actuator (SEA) platform: an algebraic feedforward reduces the knee dynamics to a constant-coefficient scalar double integrator, and a receding-horizon quadratic program (QP) computes corrective torques while enforcing hard range-of-motion, torque, and velocity limits (ISO 13482). A Kalman disturbance state driven by direct SEA-based torque sensing (the series-elastic spring deflection measured through the elastic element - an intrinsic, EMG-free patient-torque estimate, not a separate load cell) gives a nominal offset-free guarantee and, via its sign and the desired-motion direction, sensorless Assist-as-Needed. The constant state matrix permits offline precomputation of the QP cost inverse, enabling 500 Hz operation with a multi-step horizon. Across seven-controller benchmarks (sinusoidal tracking, isometric hold), the 500 Hz Kalman MPC is offset free 0.1 mrad RMS, 0.1 mrad steady-state, 0.2 mrad peak under 15 Nm spasm, versus a 515 mrad steady-state offset for classical impedance at the same stiffness - the direct-measurement channel converging the estimate near-immediately (within a few sampling periods). Without the estimator it realizes a classical impedance (4.8 mrad RMS, 8.3 mrad steady-state). All MPC variants meet the 87 mrad clinical criterion; no classical controller does. The architecture is formulated for the 20 DOF MyoSuite myoLeg via coupling-aware per-joint QPs.

2606.16068 2026-06-16 eess.SY cs.RO cs.SY math.OC 交叉投稿

Anisotropic Template Ansätze for Robust Positive Invariance under State-Dependent Uncertainty

各向异性模板Ansätze用于状态依赖不确定性下的鲁棒正不变性

Abdelrahman Ramadan, Melissa Greeff, Sidney Givigi

发表机构 * Electrical and Computer Engineering, Smith Engineering, and with Ingenuity Labs Research Institute, Queen’s University(电气与计算机工程系、史密斯工程系以及Ingenuity Labs研究研究院,皇后大学) School of Computing, and with Ingenuity Labs Research Institute, Queen’s University(计算学院以及Ingenuity Labs研究研究院,皇后大学)

AI总结 提出一种基于高斯过程导出的正定矩阵场映射固定椭球模板的方法,建立状态和输入依赖扰动下鲁棒正不变性的充分条件,通过LMI条件实现,仿真显示体积大幅缩减。

详情
AI中文摘要

我们建立了在具有各向异性协方差结构的状态和输入依赖扰动下鲁棒正不变性的充分条件。所提出的ansatz通过高斯过程导出的正定矩阵场映射一个固定的椭球模板,在保留基于有限图验证的同时,包含了标量同位缩放。得到的LMI条件将学习到的场与Schur稳定动力学耦合;一个带有膨胀因子$r=1/(1-γ_{\mathrm{cl}})$的各向同性后备方案被证明是可接受的。在每个学习周期中,场被冻结,因此在线管道评估仅需一次GP协方差查询和一个小的矩阵平方根,无需在线集迭代或LMI求解。四旋翼仿真显示,相对于非自适应同位基线,3D速度管道体积减少了$195\times$,联合7D速度-控制子空间体积减少了$2.1\times10^5$倍。此扩展版本增加了完整证明、分离的离线/在线复杂度分析以及控制器扫描、收缩和投影面积研究。

英文摘要

We establish sufficient conditions for robust positive invariance under state- and input-dependent disturbances with anisotropic covariance structure. The proposed ansatz maps a fixed ellipsoidal template through a GP-derived positive-definite matrix field, subsuming scalar homothetic scaling while retaining finite graph-based verification. The resulting LMI conditions couple the learned field to Schur-stable dynamics; an isotropic fallback with inflation factor $r=1/(1-γ_{\mathrm{cl}})$ proves admissibility. During each learning epoch the field is frozen, so online tube evaluation is one GP covariance query and a small matrix square root, with no online set iteration or LMI solve. Quadrotor simulations show a $195\times$ reduction in 3D velocity-tube volume and a $2.1{\times}10^5$ reduction in the joint 7D velocity-control subspace relative to a non-adaptive homothetic baseline. This extended version adds full proofs, a separated offline/online complexity analysis, and controller-sweep, contraction, and projection-area studies.

2509.00836 2026-06-16 cs.RO 版本更新

One-Step Model Predictive Path Integral for Manipulator Motion Planning Using Configuration Space Distance Fields

基于构型空间距离场的一步模型预测路径积分用于机械臂运动规划

Yulin Li, Tetsuro Miyazaki, Kenji Kawashima

发表机构 * Department of Information Physics and Computing, The University of Tokyo(东京大学信息物理与计算系)

AI总结 提出将构型空间距离场与模型预测路径积分结合,利用CDF梯度统一代价函数并缩短规划时域至一步,实现高效避障,在2D环境和7自由度机械臂仿真中成功率近100%,控制频率超750Hz。

详情
AI中文摘要

机械臂的运动规划是机器人学中的一个基本问题。经典的基于优化的方法通常依赖符号距离场(SDF)的梯度来施加避碰约束。然而,这些方法容易陷入局部最小值,并且在SDF梯度消失时可能失败。最近,构型空间距离场(CDF)被提出,它直接在机器人的构型空间中建模距离。与工作空间SDF不同,CDF几乎处处可微,因此提供了可靠的梯度信息。另一方面,无梯度方法如模型预测路径积分(MPPI)控制利用长时域滚动来实现避碰。虽然有效,但这些方法由于大量轨迹样本、重复碰撞检测以及设计具有异质物理单位的代价函数的困难而计算昂贵。在本文中,我们提出了一个将CDF与MPPI集成的框架,以实现机器人在其构型空间中的直接导航。利用CDF梯度,我们统一了关节空间中的MPPI代价,并将时域缩短为一步,大幅削减计算量,同时在实际中保持避碰能力。我们证明,我们的方法在2D环境中实现了接近100%的成功率,并在具有复杂障碍物的具有挑战性的7自由度Franka机械臂仿真中持续获得高成功率。此外,我们的方法达到了超过750Hz的控制频率,显著优于基于优化的方法和标准MPPI基线。这些结果突出了所提出的CDF-MPPI框架在高维运动规划中的有效性和效率。

英文摘要

Motion planning for robotic manipulators is a fundamental problem in robotics. Classical optimization-based methods typically rely on the gradients of signed distance fields (SDFs) to impose collision-avoidance constraints. However, these methods are susceptible to local minima and may fail when the SDF gradients vanish. Recently, Configuration Space Distance Fields (CDFs) have been introduced, which directly model distances in the robot's configuration space. Unlike workspace SDFs, CDFs are differentiable almost everywhere and thus provide reliable gradient information. On the other hand, gradient-free approaches such as Model Predictive Path Integral (MPPI) control leverage long-horizon rollouts to achieve collision avoidance. While effective, these methods are computationally expensive due to the large number of trajectory samples, repeated collision checks, and the difficulty of designing cost functions with heterogeneous physical units. In this paper, we propose a framework that integrates CDFs with MPPI to enable direct navigation in the robot's configuration space. Leveraging CDF gradients, we unify the MPPI cost in joint-space and reduce the horizon to one step, substantially cutting computation while preserving collision avoidance in practice. We demonstrate that our approach achieves nearly 100% success rates in 2D environments and consistently high success rates in challenging 7-DOF Franka manipulator simulations with complex obstacles. Furthermore, our method attains control frequencies exceeding 750 Hz, substantially outperforming both optimization-based and standard MPPI baselines. These results highlight the effectiveness and efficiency of the proposed CDF-MPPI framework for high-dimensional motion planning.

2509.20084 2026-06-16 cs.RO 版本更新

C-3TO: Continuous 3D Trajectory Optimization on Neural Euclidean Signed Distance Fields

C-3TO:基于神经欧几里得有符号距离场的连续三维轨迹优化

Guillermo Gil, Jose Antonio Cobano, Luis Merino, Fernando Caballero

发表机构 * Service Robotics Laboratory – Universidad Pablo de Olavide (Seville), Spain(帕布罗·奥拉维德大学机器人服务实验室(塞维利亚),西班牙)

AI总结 提出一种在杂乱环境中利用在线神经欧几里得有符号距离场进行连续三维轨迹优化的框架,通过两阶段非线性优化直接优化五次多项式表示的平滑轨迹,实现安全、高效且可动态执行的轨迹规划。

Comments 8 pages, 5 figures, submitted and accepted in ICUAS 2026

详情
AI中文摘要

本文提出了一种新颖的框架,用于在杂乱环境中进行连续三维轨迹优化,利用在线神经欧几里得有符号距离场(ESDF)。与先前依赖离散化ESDF网格和插值的方法不同,我们的方法直接优化由五次多项式表示的平滑轨迹,该轨迹定义在连续的神经ESDF上,确保整个轨迹上的精确梯度信息。该框架集成了一个两阶段非线性优化管道,平衡了效率、安全性和平滑性。实验结果表明,C-3TO能够生成碰撞感知且动态可行的轨迹。此外,其在定义局部窗口大小和优化参数方面的灵活性,使得能够轻松适应不同用户的需求,而不影响性能。通过将连续轨迹参数化与持续更新的神经ESDF相结合,C-3TO为空中机器人安全高效的局部重规划建立了稳健且可泛化的基础。

英文摘要

This paper introduces a novel framework for continuous 3D trajectory optimization in cluttered environments, leveraging online neural Euclidean Signed Distance Fields (ESDFs). Unlike prior approaches that rely on discretized ESDF grids with interpolation, our method directly optimizes smooth trajectories represented by fifth-order polynomials over a continuous neural ESDF, ensuring precise gradient information throughout the entire trajectory. The framework integrates a two-stage nonlinear optimization pipeline that balances efficiency, safety and smoothness. Experimental results demonstrate that C-3TO produces collision-aware and dynamically feasible trajectories. Moreover, its flexibility in defining local window sizes and optimization parameters enables straightforward adaptation to diverse user's needs without compromising performance. By combining continuous trajectory parameterization with a continuously updated neural ESDF, C-3TO establishes a robust and generalizable foundation for safe and efficient local replanning in aerial robotics.

2606.08059 2026-06-16 cs.RO 版本更新

Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain

感知行为基础模型:将人体运动先验适应到以机器人为中心的地形

Zifan Wang, Yizhao Li, Teli Ma, Qiang Zhang, Yudong Fan, Hao Xu, Shuo Yang, Junwei Liang

发表机构 * Mondo Robotics The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学) Artificial General Intelligence Institute, University of Science and Technology of China(中国科学技术大学通用人工智能研究院)

AI总结 提出感知行为基础模型(Perceptive BFM),通过地形一致参考合成(TCRS)将人体运动先验适应到机器人局部地形,实现地形感知的人形机器人控制。

详情
AI中文摘要

人形机器人行为基础模型旨在从广泛的人体运动先验中获取可复用的全身控制策略,使单一控制器能够产生多样且富有表现力的行为。然而,现有的以运动为中心的基础策略大多假设参考运动已经与机器人周围环境物理兼容。当演示者、操作者和机器人处于不同环境时,这一假设不再成立:人体运动可能指定了预期行为,但并未指定机器人局部地形所需的落脚点、间隙、身体高度或接触时机。我们引入了\emph{感知行为基础模型}(Perceptive BFM),这是一种地形感知的人形机器人控制框架,将人体运动先验植根于以机器人为中心的感知。该模型保留原始运动学运动参考作为行为接口,同时利用局部地形观测来调整接触、姿态和时机。为了提供可扩展的地形监督,我们开发了\emph{地形一致参考合成}(TCRS),通过接触感知的落脚点构建、足部几何感知的摆动优化、支撑感知的根部重建、碰撞修复和多点逆运动学,将面向运动的运动片段转换为地形一致的参考。然后,我们训练一个盲适应参考教师,并通过目标帧动作对齐将其地形一致行为迁移到部署的原始参考学生。学生是一个身份门控Transformer跟踪器,其地形特征通过残差路径进入,这些路径初始化为保留运动跟踪先验,并仅在需要时训练产生局部修正。

英文摘要

Humanoid behavior foundation models aim to acquire reusable whole-body control policies from broad human motion priors, enabling a single controller to produce diverse and expressive behaviors. However, existing motion-centric foundation policies largely assume that the reference motion is already physically compatible with the robot's surroundings. This assumption breaks when the demonstrator, operator, and robot inhabit different environments: a human motion may specify the intended behavior, but not the footholds, clearance, body height, or contact timing required by the robot's local terrain. We introduce \emph{Perceptive Behavior Foundation Model} (Perceptive BFM), a terrain-aware humanoid control framework that grounds human motion priors in robot-centric perception. The model preserves raw kinematic motion references as the behavioral interface, while using local terrain observations to adapt contacts, posture, and timing. To provide scalable terrain supervision, we develop \emph{terrain-conformal reference synthesis} (TCRS), which converts locomotion-oriented human motion clips into terrain-consistent references through contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, and multi-point inverse kinematics. We then train a blind adapted-reference teacher and transfer its terrain-conformal behavior to a deployed raw-reference student through target-frame action alignment. The student is an identity-gated Transformer tracker whose terrain features enter through residual pathways initialized to preserve the motion-tracking prior and trained to produce local corrections only when needed.

3. 操作、抓取与灵巧手 18 篇

2606.14981 2026-06-16 cs.RO cs.AI cs.LG 新提交

Inference-time Policy Steering via Vision and Touch

通过视觉和触觉进行推理时策略引导

Yilin Wu, Zilin Si, Zeynep Temel, Oliver Kroemer, Andrea Bajcsy

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出ViTaL框架,通过视觉采样验证和触觉引导扩散编辑的双层优化,在推理时引导机器人策略,显著提升接触丰富操作任务的成功率。

详情
AI中文摘要

推理时引导通过在部署前验证候选动作来适应预训练的生成式机器人策略。虽然先前的方法通常仅使用视觉观察进行验证,但对于接触丰富的操作任务,仅靠视觉往往不足,因为成功取决于全局任务进展和微妙的局部交互(如接触力)。我们提出了ViTaL,一个视觉-触觉推理时引导框架,将多模态引导形式化为双层优化问题。在高层,视觉采样与验证执行长时域模式选择,决定机器人应执行何种行为。在低层,触觉引导的扩散编辑在较短时域内细化所选动作序列,以满足局部接触要求。为了支持基于结果的引导,ViTaL学习了一个视觉-触觉潜在世界模型,并采用了语义对齐的视觉和触觉验证器,包括一个新颖的文本条件触觉奖励,直接在潜在空间中对预测的触觉未来进行评分。在三个真实世界的接触丰富操作任务中,ViTaL相对于基础策略将整体成功率提高了51%,比单模态引导至少高出33%,并且比朴素多模态融合至少高出20%。网站:https://yilin-wu98.github.io/vital_website。

英文摘要

Inference-time steering adapts pre-trained generative robot policies during deployment by verifying candidate actions before execution. While prior methods typically perform this verification only with visual observations, vision alone is often insufficient for contact-rich manipulation, where success depends on both global task progress and subtle local interactions such as contact force. We introduce ViTaL, a visuo-tactile inference-time steering framework that formulates multimodal guidance as a bi-level optimization problem. At the high level, visual sampling-and-verification performs long-horizon mode selection, deciding what behavior the robot should execute. At the low level, tactile-guided diffusion editing refines the selected action sequence over a shorter horizon to satisfy local contact requirements. To support outcome-based steering, ViTaL learns a visuo-tactile latent world model and employs semantically aligned visual and tactile verifiers, including a novel text-conditioned tactile reward that scores predicted tactile futures directly in latent space. Across three real-world contact-rich manipulation tasks, ViTaL improves overall success by 51% over the base policy, outperforms unimodal steering by at least 33%, and exceeds naive multimodal fusion by at least 20%. Website: https://yilin-wu98.github.io/vital_website.

2606.15133 2026-06-16 cs.RO cs.CV 新提交

DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects

DragMesh-2: 与铰接物体的物理合理灵巧手-物体交互

Tianshan Zhang, Yijia Duan, Yanjun Li, Zeyu Zhang, Hao Tang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院)

AI总结 提出DragMesh-2框架,通过接触驱动的灵巧手-铰接物体交互,结合物理信息感知训练机制PICA,在无触觉反馈下提升变接触负载的鲁棒性。

Comments Code: https://github.com/AIGeeksGroup/DragMesh-2. Website: https://aigeeksgroup.github.io/DragMesh-2

详情
AI中文摘要

与铰接物体的灵巧交互对于家庭、辅助和人形操作至关重要,其中多指手可以提供超越平行爪抓取的顺应接触模式。然而,铰接物体操作不同于静态物体操作:目标部件无法直接驱动,其运动必须通过持续的物理手-手柄接触来实现。这使得从以物体为中心的铰接生成到手驱动的灵巧手-物体交互的转变变得非平凡,因为几何轨迹重放或开环执行无法模拟移动铰接部件所需的接触动力学。此外,仅在固定动力学下为任务完成训练的策略可能会过拟合标称接触负载,尤其是在没有触觉或力反馈的情况下,并且当接触负载变化时性能可能会下降。为了应对这些挑战,我们提出了DragMesh-2,一个用于与铰接物体灵巧交互的接触驱动框架,它将铰接交互从以物体为中心的生成扩展到手驱动的灵巧手-物体交互,其中铰接运动必须通过物理接触产生。我们进一步提出了PICA,一种物理信息感知的训练机制,它在没有触觉或力反馈的情况下将物理信号注入策略学习,提高了在变化接触负载下的鲁棒性和任务成功率。最后,我们在多个阻尼条件和铰接物体类别上进行了系统评估,以研究接触负载变化下的鲁棒性,并提供了一个纯几何的灵巧交互资源,以支持未来的移动操作和人形手-物体交互研究。在七个GAPartNet物体上,DragMesh-2在接触负载变化下比对比方法实现了更强的鲁棒性,同时在各种阻尼条件下保持了高任务成功率。

英文摘要

Dexterous interaction with articulated objects is important for household, assistive, and humanoid manipulation, where multi-finger hands can provide compliant contact patterns beyond parallel-jaw grasping. However, articulated-object manipulation differs from static-object manipulation: the target part cannot be directly actuated, and its motion must emerge through sustained physical hand--handle contact. This makes the transition from object-centric articulated generation to hand-driven dexterous hand--object interaction non-trivial, since geometric trajectory replay or open-loop execution does not model the contact dynamics required to move the articulated part. Moreover, policies trained only for task completion under fixed dynamics can overfit nominal contact loads, especially without tactile or force feedback, and may degrade when the contact load changes. To address these challenges, we present DragMesh-2, a contact-driven framework for dexterous interaction with articulated objects that extends articulated interaction from object-centric generation to hand-driven dexterous hand--object interaction, where articulated motion must arise through physical contact. We further propose PICA, a physically informed contact-aware training mechanism that injects physical signals into policy learning without tactile or force feedback, improving robustness and task success under changing contact loads. Finally, we conduct systematic evaluation across multiple damping conditions and articulated-object categories to study robustness under contact-load variation, and provide a pure-geometry dexterous interaction resource to support future loco-manipulation and humanoid hand--object interaction research. Across seven GAPartNet objects, DragMesh-2 achieves stronger robustness under contact-load variation than the compared methods while maintaining high task success across damping conditions.

2606.15171 2026-06-16 cs.RO 新提交

Seam-to-Graph Reconstruction for Garment Configuration Alignment

用于服装配置对齐的缝到图重建

Xuzhao Huang, Kai Tang, Fuyuki Tokuda, Norman C. Tien, Kazuhiro Kosuge

发表机构 * JC STEM Lab of Robotics for Soft Materials, Department of Electrical and Electronic Engineering, Faculty of Engineering, The University of Hong Kong(香港大学工程学院电气与电子工程系JC STEM软材料机器人实验室) Unprecedented-scale Data Analytics Center, Tohoku University(东北大学前所未有规模数据分析中心) Graduate School of Information Sciences, Tohoku University(东北大学信息科学研究科) Department of Electrical and Electronic Engineering, Faculty of Engineering, The University of Hong Kong(香港大学工程学院电气与电子工程系)

AI总结 提出基于图神经网络和注意力机制的Seam-to-Graph网络,将部分可观测的缝信息映射为拓扑结构骨架图,用于实时服装状态估计,并设计变形感知分层视觉伺服控制器实现服装配置对齐,在双臂机器人系统上验证了人类级精度和鲁棒性。

Comments 11 pages, 9 figures

详情
AI中文摘要

缝线编码了服装的丰富结构信息,但在机器人操作场景中通常仅部分可观测。为了稳健地利用缝线信息,我们提出了一种基于图神经网络和注意力机制的Seam-to-Graph网络。该网络将非结构化的缝线观测映射为拓扑编码的结构骨架图,用于实时服装状态估计。基于这种骨架图状态估计,我们设计了一个变形感知的分层视觉伺服控制器,用于服装配置对齐。我们在双臂机器人系统上实现了该控制器,以将服装装载到丝网印刷平台上并精确对齐到期望配置。真实机器人实验表明,使用所提方法的机器人不仅实现了人类级别的对齐精度且对齐误差方差更小,而且对不同服装具有鲁棒性。这些结果表明,利用缝线信息对于服装操作是有效的。

英文摘要

Seams encode rich structural information about garments but are frequently partially observable in robotic manipulation scenarios. To robustly leverage seam information, we propose a Seam-to-Graph network based on graph neural networks and attention mechanisms. This network maps unstructured seam observations to a topology-encoded structural skeleton graph for real-time garment state estimation. Using this skeleton-graph-based state estimation, we design a deformation-aware, hierarchical visual servoing controller for garment configuration alignment. We implement this controller on a bimanual robot system to load a garment onto a screen printing platen and to align it to the desired configuration precisely. Real-robot experiments demonstrate that the robot using the proposed method not only achieves human-level alignment accuracy with reduced variance in alignment error but is also robust to different garments. These results demonstrate that the use of seam information is effective for garment manipulation.

2606.15516 2026-06-16 cs.RO 新提交

Transferring Contact, Not Just Motion: Compliant Grasping Across Dexterous Hands

传递接触,而不仅仅是运动:跨灵巧手的柔顺抓取

Soofiyan Atar, Yao-Ting Huang, Michael Yip

发表机构 * University of California San Diego(加州大学圣迭戈分校)

AI总结 提出跨本体力-位置接口,通过校准力矩和指尖力实现异构灵巧手间的接触感知抓取,结合流匹配视觉运动策略和混合力位控制器,实现可迁移的柔顺抓取。

Comments Website(overview): transferring-contact-not-just-motion.github.io

详情
AI中文摘要

灵巧抓取依赖于接触调节,而不仅仅是运动。稳定操作要求手指在接触滑动、变形或视觉遮挡时保持适当的物体负载。现有的跨本体灵巧策略通过重定向手部姿态或潜在动作统一运动,但力反馈仍与每只手的感觉和驱动绑定,限制了迁移。本文引入了一种跨本体力-位置接口,用于异构灵巧手之间的接触感知操作。运动意图在共享的手部姿态潜在空间中表示,而每只手的力信号通过系统辨识校准为物理关节扭矩(单位N.m)。这些扭矩被映射为指尖力和紧凑的每指负载描述符,使策略获得关于手部应移动到哪里以及物体如何加载的可比观测。利用该接口,训练了一个流匹配视觉运动策略,输入视觉、本体感觉和校准后的接触,并采用结构化视觉掩码,在抓取相关遮挡下鼓励依赖力。相同的校准信号驱动混合力-位置控制器进行演示采集和执行,保持训练和部署中的力目标一致。在结构不同的手上进行的实验表明,校准的接触反馈实现了可迁移的柔顺抓取,学习到的基元可在长时程操作流程中重复使用。

英文摘要

Dexterous grasping depends on contact regulation, not motion alone. Stable manipulation requires fingers to maintain appropriate object loading as contacts slip, deform, or become visually occluded. Existing cross-embodiment dexterous policies unify motion through retargeted hand poses or latent actions, but force feedback remains tied to each hand's sensing and actuation, limiting transfer. This work introduces a cross-embodiment force-position interface for contact-aware manipulation across heterogeneous dexterous hands. Motion intent is represented in a shared hand-pose latent, while each hand's effort signal is calibrated through system identification into physical joint torque in N.m. These torques are mapped to fingertip forces and compact per-finger load descriptors, giving the policy comparable observations of where the hand should move and how the object is loaded. Using this interface, a flow-matching visuomotor policy is trained on vision, proprioception, and calibrated contact, with structured visual masking that encourages reliance on force under grasp-relevant occlusion. The same calibrated signal drives a hybrid force-position controller for demonstration collection and execution, keeping force targets consistent across training and deployment. Experiments across structurally different hands show that calibrated contact feedback enables transferable compliant grasping, with learned primitives reusable in long-horizon manipulation pipelines.

2606.15909 2026-06-16 cs.RO 新提交

GeoTLM: Geometry-aware Tactile-Language Models for Contact Motion Orientation Reasoning of Dynamic Objects

GeoTLM: 面向动态物体接触运动方向推理的几何感知触觉语言模型

Qiutian Li, Zinan Liu, Lin Wang

发表机构 * School of EEE, Nanyang Technological University (NTU)(南洋理工大学电气与电子工程学院)

AI总结 提出GeoTLM,通过可微几何表示(DGR)提取触觉剪切场中的几何先验,提升动态物体旋转和滑动方向推理能力,在旋转和滑动任务上分别提升14.6%和16.2%的准确率。

Comments 7 pages, 3 figures, 4 tables

详情
AI中文摘要

现代触觉语言模型(TLMs)在机器人学习任务(如材料和纹理识别)中展现出潜力。然而,对于接触密集场景,这些TLMs难以理解动态物体的物理属性,如旋转和滑动方向。例如,我们的初步实验表明,流行的TLMs(如Sparsh和AnyTouch2)在基于GelSight Mini触觉数据的旋转方向推理上表现较弱。这一令人惊讶的差距启发我们探索一个新的研究问题:能否将物理基础的几何先验注入TLMs,以实现对动态物体属性的可靠接触方向推理?为此,我们提出GeoTLM,一种新颖的几何表示引导的TLM,用于感知动态接触事件。我们的关键思想是在语言级推理之前保留并结构化触觉剪切场几何,而不是将低分辨率触觉令牌强行塞入脆弱的封闭形式物理算子。为实现这一点,我们提出一种轻量级(仅14k参数)但新颖的可微几何表示(DGR)。具体地,DGR在剪切场中学习接触掩码引导的表示,并通过反对称七区域池化设计进行聚合,其动机是旋转接触产生反对称变形模式的物理直觉。我们在两个代表性任务上进行实验:旋转方向和滑动方向推理。大量实验表明,GeoTLM在相同骨干网络下,无几何编码器时,新物体旋转准确率提升14.6%,真实传感器滑动准确率提升16.2%。总体而言,我们的工作为物理基础的触觉语言推理开辟了新途径,在动态物体理解和接触密集的机器人操作方面具有巨大潜力。

英文摘要

Modern tactile-language models (TLMs) have shown potential for robot learning tasks, such as material and texture recognition. However, for contact-rich scenarios, these TLMs struggle to understand the physical properties of dynamic objects, such as rotation and sliding directions. For instance, our preliminary experiments reveal that popular TLMs, such as Sparsh and AnyTouch2, exhibit weak performance on basic rotation direction reasoning from GelSight Mini tactile data. This surprising gap inspires us to explore a novel research question: Can we inject physically grounded geometric priors into TLMs to enable reliable contact orientation reasoning of dynamic object properties? To this end, we propose GeoTLM, a novel geometric representation-guided TLM for the perception of dynamic contact events. Our key idea is to preserve and structure tactile shear-field geometry before language-level reasoning, rather than forcing low-resolution tactile tokens into fragile closed-form physics operators. To achieve this, we propose a lightweight (only 14k parameters) yet novel Differentiable Geometric Representation (DGR). Specifically, DGR learns a contact-mask-guided representation in the shear field and aggregates it through an antisymmetric seven-region pooling design, motivated by the physical intuition that rotational contact produces antisymmetric deformation patterns. We conduct experiments on two representative tasks: rotation direction and sliding direction reasoning. Extensive experiments show that GeoTLM improves novel-object rotation accuracy by +14.6% and real-sensor sliding accuracy by +16.2% over the same backbone without the geometric encoder. Overall, our work paves a new way for physically grounded tactile-language reasoning, with strong potential for dynamic object understanding and contact-rich robotic manipulation.

2606.16078 2026-06-16 cs.RO 新提交

A Deployment Case Study in Robotic Apparel Automation: Digital Twin Integration, Interoperability, and Workforce Enablement

机器人服装自动化部署案例研究:数字孪生集成、互操作性与劳动力赋能

Gokul Narayanan, Abhiroop Ajith, Jonathan Zornow, Carlos Calle, Auralis Herrero Lugo, Jose Luis Susa Rincon, Chengtao Wen, Eugen Solowjow

发表机构 * Siemens Corporation(西门子股份公司) Sewbo Levi's(李维斯) Bluewater Defense

AI总结 针对织物柔性导致的机器人操作难题,本文通过牛仔布制造案例,提出集成数字线程、数字孪生、互操作层及运行时监控的机器人缝纫系统,实现快速部署与鲁棒性提升。

Comments 4 pages, 3 figures, IEEE ICRA 2026 Workshop Paper

详情
AI中文摘要

尽管在电子和汽车制造等领域的柔性自动化取得了稳步进展,但由于织物具有可变形性且难以用机器人操作,服装自动化仍然具有挑战性。本文介绍了一个面向部署的牛仔布制造机器人缝纫系统案例研究,强调了实际应用所需的系统级集成。在工程层面,数字线程模块将DXF生产图纸解析为工艺参数和可执行的机器人轨迹,减少了手动编程工作量,并实现了跨缝纫操作的快速重新定位。同时,在部署前使用工作单元的数字孪生来验证可达性和间隙、优化布局和顺序、评估操作员访问以及评估与上下游任务的节拍兼容性,从而降低调试风险。在部署阶段,系统通过互操作层将协作机器人与传统缝纫设备、焊接、吸盘夹具和机器级控制器集成。运行时监控与验证(包括缝迹监控、碰撞检查和轨迹级验证)提高了环境变化下的鲁棒性,而面向操作员的培训和指导工具支持设置、故障排除和技术采纳。在牛仔短裤上进行的两次分阶段工厂部署(涵盖2D口袋操作和3D服装成型缝迹)表明,基于数字孪生的验证、数字线程驱动的任务生成、互操作性、运行时验证和操作员培训对于扩展机器人服装自动化至关重要。

英文摘要

Despite steady advances in flexible automation in sectors such as electronics and automotive manufacturing, apparel automation remains challenging because fabrics are deformable and difficult to manipulate with robots. This paper presents a deployment-oriented case study of a robotic sewing system for denim manufacturing, emphasizing the system-level integration required for practical adoption. At the engineering level, a digital thread module parses DXF production drawings into process parameters and executable robot trajectories, reducing manual programming effort and enabling rapid re-targeting across sewing operations. In parallel, a digital twin of the workcell is used during pre-deployment to validate reach and clearance, refine layout and sequencing, evaluate operator access, and assess cycle-time compatibility with upstream and downstream tasks, thereby reducing commissioning risk. At deployment, the system integrates a collaborative robot with conventional sewing equipment, welding, suction fixtures, and machine-level controllers through an interoperability layer. Runtime monitoring and verification, including seam monitoring, collision checking, and trajectory-level validation, improve robustness under environmental variability, while operator-facing training and guidance tools support setup, troubleshooting, and technology adoption. Two staged factory deployments on denim shorts, covering 2D pocket operations and 3D garment-shaping seams, show that digital-twin-based validation, digital-thread-driven task generation, interoperability, runtime verification, and operator training are important for scaling robotic apparel automation.

2606.16272 2026-06-16 cs.RO 新提交

TopoRetarget: Interaction-Preserving Retargeting for Dexterous Manipulation

TopoRetarget:面向灵巧操作的交互保持重定向

Jielin Wu, Shenzhe Yao, Guanqi He, Xiaohan Liu, Zhaoqing Zeng, Xiangrui Jiang, Han Yang, Wentao Zhang, Hang Zhao

发表机构 * IIIS, Tsinghua University(清华大学交叉信息研究院)

AI总结 提出TopoRetarget框架,通过稀疏交互图和距离加权拉普拉斯变形,在重定向中保持手-物体交互结构,提升灵巧操作强化学习策略的性能。

Comments Project page: https://toporetarget2026.github.io/TopoRetarget/

详情
AI中文摘要

人类手-物体演示通过参考跟踪为训练灵巧操作强化学习策略提供了密集的参考运动。然而,要将此类演示用于策略学习,重定向必须保留手部姿态和任务相关的手-物体接触结构。否则,接触和可行性伪影会降低下游策略的性能。我们提出TopoRetarget,一种交互保持的重定向框架,它在不同重定向条件下使用单一参数集,同时保持任务相关的手-物体交互,并将人类演示适应到灵巧机器人手。该方法在手和物体关键点上构建稀疏交互图,并优化带有方向一致性、运动学约束和穿透处理的距离加权拉普拉斯变形。评估表明,生成的参考提高了交互保真度和策略学习:TopoRetarget在ContactPose数据集上实现了所有基线中最佳的接触精度和对齐,将笔旋转训练成功率比现有基线方法提高了40.6个百分点,并在立方体重定向和笔旋转任务上实现了对Wuji手硬件的零样本迁移。

英文摘要

Human hand-object demonstrations provide dense reference motions for training dexterous manipulation reinforcement learning (RL) policies through reference tracking. However, to use such demonstrations for RL policy learning, retargeting must preserve hand pose and task-relevant hand-object contact structure. Otherwise, contact and feasibility artifacts can degrade downstream RL policy performance. We introduce TopoRetarget, an interaction-preserving retargeting framework that uses a single set of parameters across diverse retargeting conditions while maintaining task-relevant hand-object interaction and adapting human demonstrations to dexterous robot hands. The method constructs a sparse interaction graph over hand and object keypoints and optimizes distance-weighted Laplacian deformation with directional consistency, kinematic constraints, and penetration handling. Evaluations show that the generated references improve both interaction fidelity and policy learning: TopoRetarget achieves the best contact precision and alignment over all baselines on the ContactPose Dataset, improves Pen-Spin training success by 40.6 percentage points over the existing baseline methods, and enables zero-shot transfer to Wuji Hand hardware on cube reorientation and pen spinning.

2606.16370 2026-06-16 cs.RO 新提交

ART-Glove: Articulated Tactile Glove for Contact-Grounded Dexterous Interaction Capture

ART-Glove:用于接触接地灵巧交互捕获的关节式触觉手套

Changyi Lin, Ding Zhao

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出ART-Glove关节式触觉手套,通过16个刚性功能表面和22个解剖对齐关节,同步捕获22自由度关节运动和2048触觉点接触信息,支持下游灵巧机器人学习。

详情
AI中文摘要

我们提出ART-Glove,一种关节式触觉手套,旨在捕获接触接地的灵巧演示,同时保持人类灵巧性。ART-Glove通过覆盖手指、拇指和手掌的16个刚性功能表面使手侧接触几何显式化。22个解剖对齐关节连接这些表面,使其在灵巧操作过程中跟随人类手部运动。基于编码器的传感跟踪表面运动,而密集的压阻式触觉传感记录相同表面上的接触。完整系统以120 Hz同步捕获22自由度关节测量和2048触觉点测量。我们通过运动自由度、关节传感、触觉传感和接触丰富交互捕获实验评估ART-Glove,证明其能够在记录支持下游灵巧机器人学习的接触接地信息的同时保持人类灵巧性。

英文摘要

We present ART-Glove, an articulated tactile glove designed to capture contact-grounded dexterous demonstrations while preserving human dexterity. ART-Glove makes hand-side contact geometry explicit with 16 rigid functional surfaces covering the fingers, thumb, and palm. Twenty-two anatomically aligned joints connect these surfaces and allow them to follow human hand motion during dexterous manipulation. Encoder-based sensing tracks surface motion, while dense piezoresistive tactile sensing records contact over the same surfaces. The complete system captures synchronized 22-DoF joint measurements and 2048-taxel tactile measurements at 120 Hz. We evaluate ART-Glove across experiments on motion freedom, joint sensing, tactile sensing, and contact-rich interaction capture, demonstrating its ability to preserve human dexterity while recording contact-grounded information that can support downstream dexterous robot learning.

2606.16436 2026-06-16 cs.RO cs.CV 新提交

V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos

V2P-Manip:从单目人类视频学习灵巧操作

Kaihan Chen, Yanming Shao, Haifeng Ji, Xiaokang Yang, Yao Mu

发表机构 * Zhejiang University(浙江大学) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出V2P-Manip框架,从单目人类演示视频中提取具有视觉保真度和物理合理性的轨迹,通过两阶段精炼实现空间对齐与物理一致性,在TACO和OakInk基准上显著优于先前方法。

详情
AI中文摘要

实现自主机器人灵巧操作需要大规模精确、类人的动作序列。作为昂贵遥操作数据的可扩展补充,从单目视频中提取兼具视觉保真度和物理合理性的轨迹是具身智能的一个有前景的前沿方向。为此,我们引入V2P-Manip,一个高效的框架,旨在直接从人类演示视频中学习灵巧操作策略。我们建立了一个高效、集成的流水线,涵盖3D资产获取、轨迹估计和灵巧策略学习。为了弥合视觉感知与物理约束之间的差距,我们引入了一个两阶段精炼过程,以强制执行空间对齐和物理一致性。在TACO和OakInk基准上的评估表明,我们的方法在姿态精度、对非结构化环境的适应性以及训练效率方面显著优于先前方法。最终,实验结果证实了在多个合成操作任务上平均成功率超过75%,并验证了提取的操作先验在不同灵巧手形态上的适应性。

英文摘要

Achieving autonomous robotic dexterous manipulation requires precise, human-like action sequences at scale. As a scalable supplement to costly teleoperation data, extracting trajectories with both visual fidelity and physical plausibility from monocular videos represents a promising frontier in embodied AI. To this end, we introduce V2P-Manip, an efficient framework designed to learn dexterous manipulation policies directly from human demonstration videos. We establish an efficient, integrated pipeline encompassing 3D asset acquisition, trajectory estimation, and dexterous policy learning. To bridge the gap between visual perception and physical constraints, we introduce a two-stage refinement process to enforce spatial alignment and physical consistency. Evaluations on the TACO and OakInk benchmarks demonstrate that our approach significantly outperforms previous methods in pose accuracy, adaptability to unstructured environments, and training efficiency. Ultimately, experimental results confirm an average success rate of over 75% across multiple synthetic manipulation tasks and validate the adaptability of the extracted manipulation priors across diverse dexterous hand embodiments.

2606.16504 2026-06-16 cs.RO 新提交

APEX: Adaptive Policy Execution for Precise Manipulation

APEX: 用于精确操作的适应性策略执行

Mengfei Zhao, Chenxi Jiang, Tuo An, Jindou Jia, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室)

AI总结 针对策略与控制器间的执行差距,提出即插即用的APEX框架,通过动态可行参考重建和测试时自适应,减少跟踪误差并提升操作成功率。

Comments 20 pages, 9 figures, 4 tables

详情
AI中文摘要

现代模仿学习方法,包括视觉运动策略和视觉-语言-动作(VLA)策略,通常输出高层动作参考,由低层控制器执行。然而,缺乏高层参考信号以及策略在训练过程中对底层控制动态的不了解,不可避免地导致了执行差距。结果,实际动作系统地偏离策略指令的动作,对精度敏感的操作产生关键影响。先前的工作要么修改策略架构,要么修改低层控制器,两者都需要对预训练策略或封装控制器进行侵入式更改。这引发了一个自然问题:当策略和控制器都被视为不可访问的黑盒时,我们能否弥合执行差距?我们提出了适应性策略执行(APEX),这是一个插入在策略和控制器之间的即插即用框架,从策略输出中重建动态可行的参考,并在测试时根据低层状态反馈进行自适应,具有可证明的收敛保证。广泛的实证研究表明,APEX在演示回放中将控制器引起的跟踪误差减少了41.2%,并在四种视觉运动策略和VLA策略类别上将操作成功率提高了4.8-25.8个百分点。

英文摘要

Modern imitation learning methods, including visuomotor and Vision-Language-Action (VLA) policies, typically output high-level action references that are executed by low-level controllers. However, the absence of higher-order reference signals, together with the policy's lack of awareness of the underlying low-level control dynamics during training, inevitably induces an execution gap. As a result, realized actions deviate systematically from policy-commanded ones, with a critical impact on precision-sensitive manipulation. Prior work either modifies the policy architecture or the low-level controller, both requiring intrusive changes to the pretrained policy or packaged controller. This raises a natural question: when the policy and controller are both treated as inaccessible black boxes, can we bridge the execution gap? We propose Adaptive Policy Execution (APEX), a plug-and-play framework inserted between the policy and the controller that reconstructs a dynamically feasible reference from policy outputs and adapts at test-time according to low-level state feedback, with a provable convergence guarantee. Extensive empirical studies show that APEX reduces controller-induced tracking error by 41.2% on demonstration replay and improves manipulation success by 4.8--25.8 percentage points across four visuomotor and VLA policy classes.

2606.16690 2026-06-16 cs.RO cs.AI cs.CV 新提交

PATCH: Action-Chunk-Conditioned Latent Patch Innovation Monitoring for Robot Manipulation

PATCH: 基于动作块条件潜在补丁创新的机器人操作监控

Yanan Zhou, Ranpeng Qiu, Yincong Chen, Jiajie Cui, Weiming Zhi

发表机构 * School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) Australian Centre For Robotics, The University of Sydney(悉尼大学澳大利亚机器人中心)

AI总结 提出PATCH监控器,通过动作块条件潜在补丁创新检测局部场景动态,实现扰动感知的机器人操作干预与恢复。

详情
AI中文摘要

基于学习的操作策略在真实世界机器人操作中取得了实质性进展,特别是在短视界动作生成方面。然而,在开放工作空间中部署时,面对意外的局部场景动态(如移动物体、短暂遮挡或预期运动附近的干扰)仍然脆弱。现有的运行时监控器通常依赖全局观测异常、策略不确定性或帧级视觉变化,难以区分任务相关的执行风险与良性的视觉变化。我们提出PATCH,一种用于部署时干预的基于动作块条件的潜在补丁创新监控器。给定当前动作块,PATCH定义了一个投影执行走廊,预测其内部的潜在补丁演化,并累积机器人自身运动无法解释的持续残差。这些残差形成局部化的干预信号,使PATCH-Router能够暂停执行、选择可用的恢复源,并在局部创新消退后恢复原始策略。在真实机器人 rollout 数据上的实验表明,PATCH 比竞争性运行时监控器产生更稳定且上下文相关的触发信号。真实机器人部署进一步展示了监控驱动的干预和策略恢复,用于扰动感知的操作。项目页面:https://yananzhou5555.github.io/PATCH/。

英文摘要

Learning-based manipulation policies have made substantial progress in real-world robot manipulation, particularly for short-horizon action generation. However, deployment in open workspaces remains fragile under unexpected local scene dynamics, such as moving objects, transient occlusions, or disturbances near the intended motion. Existing runtime monitors often rely on global observation anomalies, policy uncertainty, or frame-level visual changes, and struggle to distinguish task-relevant execution risk from benign visual variation. We introduce PATCH, an action-chunk-conditioned latent patch innovation monitor for deployment-time intervention. Given the active action chunk, PATCH defines a projected execution corridor, predicts latent patch evolution inside it, and accumulates persistent residuals unexplained by the robot's own motion. These residuals form a localized intervention signal that allows PATCH-Router to pause execution, select an available recovery source, and resume the original policy once localized innovation subsides. Experiments on real robot rollout data show that PATCH produces more stable and context-relevant triggers than competing runtime monitors. Real-robot deployment further demonstrates monitor-driven intervention and policy resumption for disturbance-aware manipulation. Project Page: https://yananzhou5555.github.io/PATCH/.

2606.16978 2026-06-16 cs.RO cs.LG cs.SY eess.SY 新提交

Task-Error Residual Learning for Real-Robot Five-Ball Juggling

任务误差残差学习用于真实机器人五球杂耍

Kai Ploeger, Jan Peters

发表机构 * Technical University of Darmstadt(达姆施塔特工业大学) German Research Center for AI (DFKI)(德国人工智能研究中心) Hessian Center for Artificial Intelligence (hessian.AI)(黑森州人工智能中心)

AI总结 提出基于任务误差方向监督和误差模型驱动样本选择的残差学习方法,在Barrett WAM机械臂上实现稳定三、四、五球杂耍,首次尝试失败后任务误差单调递减,无需进一步失败。

Comments Submitted to the 2026 International Symposium on Robotics Research (ISRR)

详情
AI中文摘要

对于改进现有行为的残差学习,样本效率取决于两个因素:每次试错返回的信息量,以及学习器使用这些信息的效率。强化学习的标准标量奖励携带的信息远少于定义任务的方向性任务误差。随机探索进一步丢弃了每次试错返回的信息。通过使用方向性任务误差监督和驱动样本选择的任务误差模型进行残差学习,我们在拟人化Barrett WAM机械臂上实现了稳定的三、四、五球杂耍。尽管通过简单、理想化的堆栈进行规划和控制,系统从第二次尝试开始收敛。第一次尝试失败后,任务误差单调递减,没有进一步的失败。相比之下,五球杂耍通常需要人类多年的练习。我们在三个三元轴上比较残差学习器:学习反馈中的方向性信息和分析先验的承诺,涵盖牛顿式雅可比更新、复合贝叶斯优化和随机搜索方法。两个轴都被证明是必要的:方向性反馈或信息性先验单独都不足够,而结合它们的最简单方法——固定雅可比牛顿更新——是最可靠的。学习到的残差能够容忍大量的先验失准和退化的关节跟踪,主要影响收敛速度。因此,真实机器人上残差学习的瓶颈是监督信号的信息内容以及学习器如何使用它,而不是周围堆栈的精度。所有实验的视频文档可在 https://kai-ploeger.com/residual-juggling 获取。

英文摘要

For residual learning that refines existing behavior, sample efficiency depends on two things: how much information each rollout returns, and how efficiently the learner uses that information. Reinforcement learning's standard scalar reward carries far less information than the directional task error that defines the task. Random exploration further discards whatever information each rollout returns. Through residual learning with directional task-error supervision and a task error model that drives sample selection, we achieve stable three-, four-, and five-ball juggling on anthropomorphic Barrett WAM arms. Despite planning and controlling through a simple, idealized stack, the system converges from the second attempt. The first attempt drops, after which task error decreases monotonically without further failures. In comparison, five-ball juggling typically takes humans years of practice. We compare residual learners across two ternary axes, the directional information in the learning feedback and the commitment of the analytic prior, spanning Newton-style Jacobian updates, Composite Bayesian Optimization, and stochastic search methods. Both axes prove necessary: neither directional feedback nor an informative prior suffices alone, and the simplest method that combines them, a fixed-Jacobian Newton update, is the most reliable. The learned residual tolerates substantial prior misalignment and degraded joint tracking, affecting mainly convergence speed. The bottleneck for residual learning on real robots is therefore the information content of the supervision signal and how the learner uses it, not the accuracy of the surrounding stack. Video documentation of all experiments is available at https://kai-ploeger.com/residual-juggling.

2606.17054 2026-06-16 cs.RO 新提交

Human Universal Grasping

人类通用抓取

Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto

发表机构 * New York University(纽约大学) Tsinghua University(清华大学) University of Michigan(密歇根大学)

AI总结 提出HUG模型,利用人类抓取数据(1M-HUG数据集)和流匹配方法,从单张RGB-D图像生成多样化抓取姿态,并重定向到机器人手,实现零样本抓取,在HUG-Bench上超越基线23%-34%。

Comments 28 pages, 20 figures, 7 tables

详情
AI中文摘要

人类可以轻松抓取物体,而多指机器人远未达到这种通用性。我们认为机器人抓取数据最自然的来源是人类,他们每天拿起数千个物体。我们提出HUG,一个流匹配模型,能够为任何用户指定的物体(从立体相机捕获的单张RGB-D图像中)生成多样化的人类抓取。使用智能眼镜,我们首先收集了1M-HUGs,一个自我中心的人类抓取数据集,涵盖100万帧(27.8小时)和41栋建筑中的6,707个物体实例。接下来,为了建模自然人类抓取的分布,我们的新型流匹配模型融合RGB和深度观测,输出由手腕平移、手腕旋转和MANO手姿态参数化的抓取。预测的抓取可以重定向到各种机器人手,实现在日常场景中的零样本抓取。为了标准化评估,我们构建了一个新的模拟基准HUG-Bench,包含来自五个几何类别和不同尺寸的90个未见物体,并带有公制尺度的3D网格。我们在真实世界中评估HUG,使用HUG-Bench的30个物体测试集,跨越多个立体相机、机器人实体和家庭环境。HUG在我们具有挑战性的物体集上比最先进的抓取基线高出23%和34%。代码、数据、基准、检查点和交互式演示已在我们的网站上发布:https://grasping.io/

英文摘要

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/

2606.17055 2026-06-16 cs.RO 新提交

T-Rex: Tactile-Reactive Dexterous Manipulation

T-Rex: 触觉反应灵巧操作

Dantong Niu, Zhuoyang Liu, Zekai Wang, Boning Shao, Zhao-Heng Yin, Anirudh Pai, Yuvan Sharma, Stefano Saravalle, Ruijie Zheng, Jing Wang, Ryan Punamiya, Mengda Xu, Yuqi Xie, Yunfan Jiang, Letian Fu, Konstantinos Kallidromitis, Matteo Gioia, Junyi Zhang, Jiaxin Ge, Haiwen Feng, Fabio Galasso, Wei Zhan, David M. Chan, Yutong Bai, Roei Herzig, Jiahui Lei, Fei-Fei Li, Ken Goldberg, Jitendra Malik, Pieter Abbeel, Yuke Zhu, Danfei Xu, Jim, Fan, Trevor Darrell

发表机构 * UC Berkeley(加州大学伯克利分校) NVIDIA(英伟达) Stanford(斯坦福大学) Panasonic(松下) La Sapienza University(罗马大学) ItalAI

AI总结 提出大规模触觉数据集和可变速率混合Transformer架构,在12项精细操作任务上平均成功率提升超30%。

Comments Project page: https://tactile-rex.github.io/

详情
AI中文摘要

长期以来,对触觉信号做出动态反应的能力被认为是实现敏捷人类级灵巧操作的关键。然而,当前基于学习的视觉-语言-动作(VLA)模型在机器人操作中通常要么忽略触觉模态,要么局限于使用静态线索的编码器,部分原因是缺乏多样化的训练数据和标准化评估、当前VLA模型中的架构限制以及静态触觉编码器的局限性。在本文中,我们通过解决所有这些局限性来推动触觉反应操作的前沿。我们提出了一个大规模、100小时的触觉丰富数据集,该数据集通过一种新颖的、数据高效的配方收集,优先考虑基本运动基元。为了有效利用自然高频的触觉信号而不牺牲现有VLA的现有能力,我们引入了一种可变速率混合Transformer(MoT)架构,配备了一种新颖的时间触觉VQ-VAE编码器。我们在12项需要精细力控制和可变形物体操作的操作任务上展示了触觉反应策略的有效性,平均成功率比最强基线高出30%以上。

英文摘要

The ability to react dynamically to tactile signals has long been considered crucial to agile human-level dexterity. Yet contemporary learning-based Vision-Language-Action (VLA) models for robotic manipulation generally either overlook the tactile modality or are limited to encoders with static cues, due in part to the scarcity of diverse training data and standardized evaluation, architectural constraints in current VLA models, and limitations of static tactile encoders. In this paper, we push the frontier of tactile-reactive manipulation by addressing all of these limitations. We propose a large-scale, 100-hour tactile-rich dataset collected via a novel, data-efficient recipe that prioritizes elementary motor primitives. To effectively exploit naturally high-frequency touch signals without sacrificing the existing capabilities of existing VLAs, we introduce a variable-rate Mixture-of-Transformers (MoT) architecture equipped with a novel temporal tactile VQ-VAE encoder. We demonstrate the effectiveness of tactile-reactive policies on 12 manipulation tasks requiring delicate force control and deformable object manipulation, achieving over 30% higher average success rate than the strongest baseline.

2606.15064 2026-06-16 cs.LG cs.RO 交叉投稿

Phase-Localized Curation Does Not Help: A Negative Result on Per-Phase Metric Selection for Demonstration Filtering

相位局部筛选无帮助:基于逐阶段度量选择的演示过滤负面结果

Aarav Bedi

发表机构 * Department of Mechanical Engineering, University of California, Berkeley(加州大学伯克利分校机械工程系)

AI总结 本文通过LIBERO任务实验证明,按阶段局部应用度量进行演示筛选不如全局或统一度量,原因是缺陷信号被稀释且阶段度量不可迁移。

Comments 5 pages, 3 tables. Code: https://github.com/aaravbedi/phase-gated-curation

详情
AI中文摘要

操作演示具有时间阶段结构,一个自然的假设是演示筛选度量应在阶段内而非全局应用。其思想是将每条轨迹分割为阶段,用局部信息最丰富的度量对每个阶段评分,然后聚合。这直接源于先前工作,表明单个全局度量可能是缺陷的最佳检测器,但却是结果策略的最差筛选器。我们在三个接触丰富的LIBERO拾取放置任务上测试了逐阶段假设,使用受控的早期释放结构缺陷,将阶段门控筛选与相同度量的统一应用以及强单个全局度量进行比较。在所有三个任务和每个条件五个随机种子下,阶段门控筛选从未是最佳筛选策略,并且在三个任务中的两个上是最差的(任务1:86.0 vs. 全局92.0;任务3:22.7 vs. 统一48.0)。我们将失败归因于一个具体机制:当缺陷信号集中在单个阶段时,跨阶段排名聚合会用来自无缺陷阶段的无信息分数稀释该信号,从而选择比简单地在各处应用缺陷信息度量更差的演示子集。我们进一步表明,逐阶段度量选择不能跨任务迁移,因为任何两个任务之间没有阶段共享获胜度量,因此选择不能重用,必须从噪声扫描中为每个任务重新推导。这些结果限制了一种看似合理且先前未经测试的方法,并论证了实践者应优先识别单个缺陷信息度量,而非按阶段分解筛选。我们发布了完整流程、所有度量实现和每个种子的结果。

英文摘要

Manipulation demonstrations have temporal phase structure, and a natural hypothesis is that demonstration-curation metrics should be applied within phases rather than globally. The idea is to segment each trajectory into phases, score each phase with the metric that is locally most informative, and then aggregate. This follows directly from prior work showing that a single global metric can be the best detector of a defect and yet the worst curator of the resulting policy. We test the per-phase hypothesis on three contact-rich LIBERO pick-and-place tasks with a controlled early-release structural defect, comparing phase-gated curation against the same metrics applied uniformly and against a strong single global metric. Across all three tasks and five random seeds per condition, phase-gated curation is never the best curation strategy, and it is the worst of the three on two of the three tasks (Task 1: 86.0 vs. 92.0 for global; Task 3: 22.7 vs. 48.0 for uniform). We trace the failure to a concrete mechanism. When the defect signal is concentrated in a single phase, rank-aggregating across phases dilutes that signal with uninformative scores from defect-free phases, selecting a worse demonstration subset than simply applying the defect-informative metric everywhere. We further show that the per-phase metric selection does not transfer across tasks, since no phase shares a winning metric between any two tasks, so the selection cannot be reused and must be re-derived per task from a noisy sweep. These results bound a plausible and previously untested method, and they argue that practitioners should prefer identifying a single defect-informative metric over decomposing curation by phase. We release the full pipeline, all metric implementations, and per-seed results.

2606.16470 2026-06-16 cs.CV cs.RO 交叉投稿

Decoupled Object-Centric Video Understanding for Generating Robotic Manipulation Commands

解耦的以对象为中心的视频理解用于生成机器人操作指令

Thanh Nguyen Canh, Thanh-Tuan Tran, Haolan Zhang, Ziyan Gao, Xiem HoangVan, Nak Young Chong

发表机构 * School of Information Science, Japan Advanced Institute of Science and Technology(日本北陆先端科学技术大学院大学信息科学学院) University of Engineering and Technology, Vietnam National University(越南国立大学工程与技术大学) Department of Robotics, Hanyang University(汉阳大学机器人学系)

AI总结 提出解耦动作识别与对象选择的框架,通过TSM分类动作和对象选择算法识别任务相关对象,结合VLM生成精确指令,在Something-Something V2上显著提升性能。

详情
AI中文摘要

将视频演示翻译为可执行的机器人命令仍然具有挑战性,因为现有方法通常无法识别演示动作中功能涉及的对象。因此,它们可能生成语言上合理但操作上模糊的命令。我们提出了一种以对象为中心的视频理解框架,将动作识别与对象识别解耦,以生成精确的、无语法的操作命令。我们的方法集成了时间移位模块(TSM)用于高效的时空动作分类,以及一种新颖的\textbf{对象选择}算法,通过基于轨迹的角色分类、模糊检测和重叠最小化来识别任务相关对象。然后,选定的对象由视觉语言模型(VLM)处理,以实现鲁棒的类别识别和零样本泛化。在修改后的Something-Something V2数据集上评估,我们的方法达到了86.79%的动作分类准确率,在标准对象上BLEU-4得分为0.337,在新颖对象上为0.261。这些结果分别比最强的任务特定基线提高了80.2%和143.9%。在METEOR和CIDEr指标上观察到更大的提升,在新颖对象上分别达到157.9%和171.7%。在所有语义指标上,我们的方法始终优于任务特定方法,并与大型通用VLM保持竞争力或超越它们,同时保留了模块化的、以对象为中心的设计。

英文摘要

Translating video demonstrations into executable robot commands remains challenging because existing methods often fail to identify which objects are functionally involved in the demonstrated action. As a result, they may generate commands that are linguistically plausible but operationally ambiguous. We propose an object-centric video understanding framework that decouples action recognition from object identification to generate precise, grammar-free manipulation commands. Our approach integrates Temporal Shift Modules (TSM) for efficient spatio-temporal action classification with a novel \textbf{Object Selection} algorithm that identifies task-relevant objects through trajectory-based role classification, blur detection, and overlap minimization. The selected objects are then processed by Vision-Language Models (VLMs) for robust category recognition and zero-shot generalization. Evaluated on a modified Something-Something V2 dataset, our method achieves 86.79\% action classification accuracy and BLEU-4 scores of 0.337 on standard objects and 0.261 on novel objects. These results improve over the strongest task-specific baseline by 80.2\% and 143.9\%, respectively. Larger gains are observed in METEOR and CIDEr, reaching 157.9\% and 171.7\% on novel objects. Across all semantic metrics, our approach consistently outperforms task-specific methods and remains competitive with, or surpasses, large general-purpose VLMs while retaining a modular, object-centric design.

2606.09777 2026-06-16 cs.RO 版本更新

AetheRock: An Arm-Worn Robot Teaching System for Force-Guided Vision-Tactile Learning

AetheRock: 一种用于力引导视觉触觉学习的臂戴式机器人教学系统

Hong Li, Yue Xu, Yihan Tang, Yankang Dong, Chenyuan Liu, Chenyang Yu, Xuyang Li, Siyuan Huang, Yujun Shen, Nan Xue, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团) Shanghai Innovation Institute(上海创新研究院) Beijing Institute for General Artificial Intelligence (BIGAI)(北京通用人工智能研究院)

AI总结 提出臂戴式设备AetheRock采集夹爪力、视觉和触觉数据,并设计ForceVT框架利用力和视觉引导触觉学习,解决力感知机器人学习中传感器装配不兼容问题。

详情
AI中文摘要

力和触觉感知在接触密集操作中不可或缺。然而,由于手持或可穿戴设备中触觉和力传感器的不兼容装配,力感知机器人学习面临关键挑战。为解决这些限制,我们首先引入AetheRock用于夹爪力、视觉和触觉数据收集,这是一种臂戴式设备,指尖配备模块化且易于制造的视觉触觉传感器GelSlim-MiniFab,人体手指接触区域配备电阻式压力传感器,定制PCB模块,以及用于舒适和稳健收集的可穿戴套件。在此基础上,我们提出ForceVT,一种表示学习框架,利用力和视觉引导保真度无关的触觉学习,实现在任何触觉情况下的鲁棒推理。实际实验表明,AetheRock实现了合格的数据效率,且ForceVT有效缓解了视觉触觉传感器在制造和使用不一致时的低效问题。总体而言,我们的工作通过创新的硬件设计和算法减轻了夹爪力-视觉-触觉机器人学习的局限性。

英文摘要

Force and tactile sensing are indispensable in contact-rich manipulation. However, force-aware robot learning faces critical challenges due to the incompatible assembly of tactile and force sensors in handheld or wearable devices. To address these limitations, we first introduce AetheRock for gripper-force, vision, and tactile data collection, which is an arm-worn device featuring a modular and easily manufactured visuo-tactile sensor, GelSlim-MiniFab, at the fingertip, a resistive pressure sensor at the human finger contact region, a customized PCB module, and a wearable kit for comfortable and robust collection. Building on this, we propose ForceVT, a representation learning framework that uses force and vision to guide fidelity-agnostic tactile learning, enabling robust inference in any tactile situation. Real-world experiments show that AetheRock achieves qualified data efficiency and that ForceVT effectively alleviates inefficiencies when visuo-tactile sensors exhibit manufacturing and utilization inconsistencies. Overall, our work mitigates the limitations of gripper-force vision-tactile robot learning through innovative hardware design and algorithms.

2601.13565 2026-06-16 cs.CV cs.RO eess.IV 版本更新

Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation

学习细粒度对应与跨视角感知用于开放词汇6D物体姿态估计

Yu Qin, Shimeng Fan, Fan Yang, Zixuan Xue, Zijie Mai, Wenrui Chen, Kailun Yang, Zhiyong Li

发表机构 * School of Artificial Intelligence and Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University(人工智能与机器人学院和机器人视觉感知与控制技术国家工程研究中心,湖南大学) State Key Laboratory of Autonomous Intelligent Unmanned Systems, Tongji University(自主智能无人系统国家重点实验室,同济大学) School of Computer Science and Engineering, Hunan University of Science and Technology(计算机科学与工程学院,湖南科技大学)

AI总结 提出FiCoP框架,通过物体中心解耦、跨视角全局感知模块和补丁相关预测器,实现空间约束的细粒度对应,显著提升开放世界6D姿态估计的鲁棒性。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L). The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP

详情
AI中文摘要

开放词汇6D物体姿态估计使机器人能够仅凭自然语言指令操控任意未见过的物体。然而,现有方法的一个关键限制是它们依赖于无约束的全局匹配策略。在开放世界场景中,尝试将锚点特征与整个查询图像空间进行匹配会引入过多的歧义,因为目标特征容易与背景干扰物混淆。为解决这一问题,我们提出了细粒度对应姿态估计(FiCoP),这是一个从易受噪声影响的全局匹配过渡到空间约束的补丁级对应的框架。为了系统地消除背景干扰,FiCoP首先采用以物体为中心的解耦步骤,将目标从宏观环境噪声中隔离出来。基于这个局部区域,我们的核心方法创新有两个方面。首先,提出了跨视角全局感知(CPGP)模块,通过显式上下文推理和文本引导的语义注入融合双视图特征,建立结构一致性。其次,我们设计了一个补丁相关预测器(PCP),利用补丁到补丁的相关矩阵作为结构先验。这生成一个精确的块状关联图,作为空间滤波器,强制执行细粒度、抗噪声的匹配。在REAL275和Toyota-Light数据集上的实验表明,与最先进方法相比,FiCoP的平均召回率分别提高了8.0%和6.1%,突显了其在复杂、无约束的开放世界环境中为机器人代理提供鲁棒和泛化感知的能力。源代码将在此https URL公开。

英文摘要

Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. To systematically eliminate background interference, FiCoP first employs an object-centric disentanglement step to isolate the target from macro-level environmental noise. Building upon this localized region, our core methodological innovations are twofold. Firstly, a Cross-Perspective Global Perception (CPGP) module is proposed to fuse dual-view features, establishing structural consensus through explicit context reasoning and text-guided semantic injection. Secondly, we design a Patch Correlation Predictor (PCP) that leverages a patch-to-patch correlation matrix as a structural prior. This generates a precise block-wise association map, acting as a spatial filter to enforce fine-grained, noise-resilient matching. Experiments on the REAL275 and Toyota-Light datasets demonstrate that FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method, highlighting its capability to deliver robust and generalized perception for robotic agents operating in complex, unconstrained open-world environments. The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP.

4. 导航、定位与SLAM 21 篇

2606.14776 2026-06-16 cs.RO cs.LG 新提交

Deep Learning-Based Lunar Crater Terrain Relative Navigation

基于深度学习的月球陨石坑地形相对导航

Batu Candan, Simone Servadio

发表机构 * NASA(美国国家航空航天局) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出一种结合深度学习陨石坑检测器和扩展卡尔曼滤波的地形相对导航算法,在初始位置偏差达5公里时仍能将导航误差降至数百米。

详情
AI中文摘要

准确的位置估计对于未来使用自主飞行器实现月球着陆至关重要,尤其是在地形特征稀疏的危险环境中。本文提出了一种地形相对导航(TRN)算法,该算法结合了我们专门为NASA陨石坑检测挑战问题设计的深度学习陨石坑检测器和扩展卡尔曼滤波(EKF)。我们的检测器分析从轨道获取的单目图像中的陨石坑特征,并通过匈牙利分配方法及基于共识的离群点去除方法,识别它们与全球数据库中陨石坑的匹配。然后,估计的测量值用于优化EKF,其中航天器在月心月固(LCLF)参考系中的姿态估计,结合高度辅助信息,约束径向漂移。仿真结果表明,即使航天器偏离实际位置达5公里,TRN也能从这种情况中恢复,将导航误差降低到几百米。需要注意的是,为了保持陨石坑特征的对应关系,必须将图像分辨率和场景中的尺度与检测器训练集分布相匹配。

英文摘要

Accurate position estimation is crucial for the successful implementation of future lunar landings using autonomous vehicles, especially in dangerous environments with sparse terrain features. In this paper, we propose a terrain relative navigation (TRN) algorithm combining our deep-learning crater detector, which was designed specifically for the NASA Crater Detection Challenge problem, and an Extended Kalman Filter (EKF). Our detector analyzes crater features from the monocular images acquired from orbit, and their matches with craters from a global database are identified via a Hungarian assignment approach followed by the consensus-based outliers removal method. The estimated measurements are then used to refine an EKF, where spacecraft pose estimation in the Lunar-Centered Lunar-Fixed (LCLF) frame of reference, augmented with altitude aiding information, constrains radial drift. The simulation results indicate that even if the spacecraft is off from its actual location up to 5 km, TRN could recover from this situation, achieving navigation error reduction to a few hundred meters. It should be noted that in order to maintain crater feature correspondences, it is important to match the image resolution and the scales within the scene to the detector training set distribution.

2606.14879 2026-06-16 cs.RO cs.CV cs.LG 新提交

VANDERER: Map-Free Exploration using Future-Aware and Visual-Curiosity-Guided Diffusion Policy

VANDERER: 基于未来感知与视觉好奇心引导扩散策略的无地图探索

Venkata Naren Devarakonda, Raktim Gautam Goswami, Prashanth Krishnamurthy, Farshad Khorrami

发表机构 * Control/Robotics Research Laboratory (CRRL), Department of Electrical and Computer Engineering, NYU Tandon School of Engineering(纽约大学坦登工程学院电气与计算机工程系控制/机器人研究实验室(CRRL)) New York University Abu Dhabi (NYUAD) Center for Artificial Intelligence and Robotics (CAIR)(纽约大学阿布扎比分校人工智能与机器人中心(CAIR))

AI总结 提出VANDERER框架,利用视觉好奇心模块引导预训练扩散策略,仅依赖单目图像实现高效无地图探索,在多种模拟环境中平均探索面积比NoMaD多13.4%。

详情
AI中文摘要

移动智能体需要高效的探索策略来绘制未知环境并自主规划任务。传统方法依赖于生成占据地图并优化未探索区域的访问顺序。然而,在传感器受限的设置中,例如仅使用单目相机,生成准确的占据地图具有挑战性。为了解决这一问题,我们提出了VANDERER,一个探索框架,它利用视觉好奇心模块(VCM)仅使用单目图像数据来引导预训练的扩散策略。该好奇心模块通过导航世界模型预测所提议动作的结果,并通过好奇心成本对其进行评估。然后,该成本引导扩散过程生成最大化探索的动作。在多种模拟环境中进行评估,VANDERER始终优于现有基线,平均探索面积比NoMaD多13.4%。我们的结果揭示了室外环境中视觉好奇心与几何好奇心之间的直接相关性,表明VANDERER能够有效利用这种关系,在传感器受限的智能体上实现高效探索。

英文摘要

Mobile agents require efficient exploration strategies to map unseen environments and autonomously plan tasks. Traditional methods rely on generating occupancy maps and optimizing the sequence in which unexplored regions are visited. However, in sensor-constrained settings, such as those limited to monocular cameras, generating accurate occupancy maps is challenging. To address this, we propose VANDERER, an exploration framework that leverages a Visual Curiosity Module (VCM) to guide pre-trained diffusion policies using only monocular image data. This curiosity module predicts the outcomes of proposed actions via a navigation world model and evaluates them through a curiosity cost. The cost then guides the diffusion process toward generating actions that maximize exploration. Evaluated across diverse simulated environments, VANDERER consistently outperforms established baselines, exploring an average of 13.4% more area than NoMaD. Our results reveal a direct correlation between visual and geometric curiosity in outdoor environments, demonstrating that VANDERER can effectively leverage this relationship for efficient exploration using sensor-constrained agents.

2606.15154 2026-06-16 cs.RO 新提交

Task-Aware Environment Augmentation for Reliable Navigation via Shielded Conditional Diffusion

任务感知的环境增强:通过屏蔽条件扩散实现可靠导航

Bharawee Phoompho, Gokul Puthumanaillam, Yan Miao, Ruben Hernandez, Tim Bretl, Sayan Mitra, Melkior Ornik

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对部分可观测环境下的轨迹规划可靠性问题,提出任务感知的环境增强方法SCoDA,利用条件扩散模型学习最优视觉标记布局,通过屏蔽采样引导标记放置,提升轨迹执行可靠性和完成时间。

详情
AI中文摘要

在部分可观测条件下的可靠轨迹规划不仅取决于计算可行的几何路径,还取决于机器人在执行该轨迹时是否能接收到信息丰富的观测。现有方法通常保持环境固定,通过信念空间规划、主动定位或增加传感来适应机器人,这往往导致在观测贫乏区域出现代价高昂的不确定性传播和脆弱行为。我们翻转这一视角,解决一个很大程度上开放的问题:\emph{任务感知的环境增强}——给定一个已建图的环境、一条规划的任务轨迹和少量视觉标记预算,应在何处增强环境,使得规划轨迹能在不确定性下可靠执行?我们的关键观察是,有用的标记布局由它们沿任务轨迹提供的定位支持定义:少量时机恰当的观测足以防止不确定性在状态估计误差会危及控制的区域累积。基于这一观察,我们提出了\tbp{SCoDA},即$\textbf{S}$hielded $\textbf{Co}$nditional $\textbf{D}$iffusion for Environment $\textbf{A}$ugmentation(屏蔽条件扩散用于环境增强)。\tbp{SCoDA}从数据中学习高性能标记布局的条件分布,以环境、规划轨迹、干扰上下文和期望执行轮廓为条件。其屏蔽采样器推理沿规划执行应进行位姿校正的位置,并将该分布引导至任务相关、有限预算的增强。在模拟基准测试和硬件部署中,我们展示了\tbp{SCoDA}在轨迹执行可靠性和完成时间上优于强基线方法。代码、模型和数据集见:\hyperlink{scoda-diffusion.github.io}{https://scoda-diffusion.github.io/}

英文摘要

Reliable trajectory planning under partial observability depends not only on computing a feasible geometric path, but also on whether the robot receives informative observations while executing that trajectory. Existing approaches usually keep the environment fixed and adapt the robot through belief-space planning, active localization, or added sensing, often incurring costly uncertainty propagation and brittle behavior in observation-poor regions. We flip this perspective and address the largely open problem of \emph{task-aware environment augmentation}: given a mapped environment, a planned task trajectory, and a small budget of visual fiducial markers, where should the environment be augmented so that the planned trajectory can be executed reliably under uncertainty? Our key observation is that useful marker layouts are defined by the localization support they provide along the task trajectory: a small number of well-timed observations can be sufficient to prevent uncertainty from accumulating in regions where state-estimation error would otherwise compromise control. Building on this observation, we present \tbp{SCoDA}, $\textbf{S}$hielded $\textbf{Co}$nditional $\textbf{D}$iffusion for Environment $\textbf{A}$ugmentation. \tbp{SCoDA} learns a conditional distribution over high-performing fiducial layouts from data, using the environment, planned trajectory, disturbance context, and desired execution profile as conditioning. Its shielded sampler reasons over where along the planned execution pose corrections should occur, and steers this distribution toward task-relevant, finite-budget augmentations. Across simulated benchmarks and hardware deployments, we show that \tbp{SCoDA} improves trajectory execution reliability and completion time over strong baselines. Code, models and dataset available at: \hyperlink{scoda-diffusion.github.io}{https://scoda-diffusion.github.io/}

2606.15476 2026-06-16 cs.RO 新提交

FARM: Find Anything using Relational Spatial Memory

FARM: 使用关系空间记忆找到任何物体

Siming He, Leo Huang, Adam Lilja, Fabio Hubel, Jonas Frey, Marco Pavone, S. Shankar Sastry, Jitendra Malik, Claire Tomlin

发表机构 * UC Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出FARM系统,通过实时构建包含几何、视觉语言描述和视角证据的开放词汇物体级记忆,并利用VLM解析查询和显式空间约束,在44k语言查询中Recall@5和Recall@10分别提升164%和224%,Accuracy@1提升35%。

详情
AI中文摘要

在家庭、仓库及其他物体丰富的环境中运行的机器人需要能够按需找到特定物体实例的记忆系统。仅靠物体级记忆往往不够:场景中包含许多看似匹配的物体,用户通过目标与地标及周围物体的关系来指代目标(例如,“飞镖盘下方、海报左侧的高灯”),这要求一种支持通过语义、外观和空间谓词进行检索的关系空间记忆。为此,我们提出了FARM(使用关系空间记忆找到任何物体),该系统以5-10 Hz的实时速度构建一个紧凑的、开放词汇的物体级记忆,包含几何、视觉语言描述和视角证据。在查询时,FARM使用VLM解析查询并评分视觉证据,同时通过物体符号和关系谓词显式地约束空间关系。这种对VLM的结构化使用使得检索比基于帧历史或场景图上下文的端到端推理更准确和鲁棒。在涵盖67个室内外场景(面积从15到15,000平方米)的44k语言查询实验中,FARM的Recall@5和Recall@10相比先前方法分别提升了164%和224%,最终VLM重排序阶段将Accuracy@1提升了35%,同时保持实时运行。我们进一步在四足机器人上使用机载传感器和计算展示了闭环部署。

英文摘要

Robots operating in homes, warehouses, and other object-rich environments need memory systems that can find specific object instances on demand. Object-level memory alone is often insufficient: scenes contain many plausibly matching objects, and users refer to the target through relations to landmarks and surrounding objects (e.g. ``the tall lamp below the dartboard and to the left of the poster''), demanding a relational spatial memory that supports retrieval through semantic, appearance, and spatial predicates over objects. To achieve this, we present FARM (Find Anything using Relational Spatial Memory), which builds, in real time at 5-10 Hz, a compact, open-vocabulary, object-level memory with geometry, visual-language descriptors, and viewpoint evidence. At query time, FARM uses VLMs to parse the query and score visual evidence, while grounding spatial constraints explicitly through object symbols and relational predicates. This structured use of VLMs enables more accurate and robust retrieval than end-to-end reasoning over frame histories or scene-graph context. In experiments on 44k language queries spanning 67 indoor and outdoor scenes, ranging from 15 to 15,000 m^2, FARM improves Recall@5 and Recall@10 over prior methods by 164% and 224%, and a final VLM reranking stage improves Accuracy@1 by 35%, while running in real time. We further demonstrate closed-loop deployment on a quadrupedal robot using onboard sensors and compute.

2606.15491 2026-06-16 cs.RO 新提交

FD-SLAM: Fast Dense Radar-Inertial SLAM with Frequency-Domain Loop Closure and Pose Graph Optimization

FD-SLAM: 基于频域闭环和位姿图优化的快速密集雷达-惯性SLAM

Nader J. Abu-Alrub, Nathir A. Rawashdeh

发表机构 * University of Texas at Austin(得克萨斯大学奥斯汀分校)

AI总结 提出FD-SLAM,通过频域闭环检测和位姿图优化提升密集雷达-惯性SLAM的精度与鲁棒性,在公开数据集上达到先进水平。

详情
AI中文摘要

雷达SLAM对于在视觉退化环境中运行的自主地面车辆具有吸引力,然而扫描雷达噪声大、扫描速率低,且其测量值难以在长轨迹上可靠匹配。本文提出FD-SLAM,一种快速密集雷达-惯性SLAM系统,它通过频域闭环检测和位姿图优化扩展了密集雷达-惯性里程计。所提方法通过使用紧凑的频域极坐标描述符进行闭环候选检索,以及基于时间滤波、相位相关筛选、扫描对齐相似性和几何一致性检查的多阶段验证流水线,保留了扫描雷达测量的类图像结构。验证后的闭环作为非顺序约束添加到SE(2)位姿图中,与雷达-惯性里程计因子一起。FD-SLAM在公开数据集上使用标准KITTI评估指标进行评估。结果表明,FD-SLAM改进了FD-RIO基线,与当前最先进的雷达SLAM方法相比具有竞争性能,并在多个评估的驾驶轨迹上提供了良好的旋转精度。运行时分析进一步表明,在仅CPU的设置下,雷达-惯性前端运行速度高于雷达采样率,而闭环检测和图优化适合并行后台执行。

英文摘要

Radar SLAM is attractive for autonomous ground vehicles operating in visually degraded environments, however, scanning radars are noisy, have low scanning rates, and their measurements are challenging to match reliably over long trajectories. This paper presents FD-SLAM, a fast dense radar-inertial SLAM system that extends dense radar-inertial odometry with frequency-domain loop closure and pose graph optimization. The proposed method preserves an image-like structure of scanning radar measurements by using a compact frequency-domain polar descriptor for loop-candidate retrieval and a multi-stage verification pipeline based on temporal filtering, phase-correlation screening, scan-alignment similarity, and geometric consistency checks. Verified loop closures are added as non-sequential constraints in an SE(2) pose graph together with radar-inertial odometry factors. FD-SLAM is evaluated on a publicly available dataset using standard KITTI evaluation metrics. The results show that FD-SLAM improves FD-RIO baseline, achieves competitive performance against current state-of-the-art radar SLAM methods, and provides favorable rotational accuracy across multiple evaluated driving trajectories. Runtime analysis further indicates that the radar-inertial front-end operates above the radar sampling rate on a CPU-only setup, while loop closure detection and graph optimization remain suitable for parallel background execution.

2606.15691 2026-06-16 cs.RO 新提交

Can Causal Models Enhance Robot Navigation? Online Causal Adaptation for Real-Robot Navigation

因果模型能否增强机器人导航?面向真实机器人导航的在线因果自适应

Zhitao Liang, Alex Mitrevski, Emmanuel Dean, Karinne Ramirez-Amaro

发表机构 * Chalmers University of Technology(查尔姆斯理工大学)

AI总结 研究因果模型在真实机器人导航中的迁移问题,提出离线评估和在线自适应两种应用方式,实验表明因果模型在复杂场景下能显著提升导航性能。

Comments Accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

详情
AI中文摘要

机器人学中的因果性旨在通过使机器人能够预测其行为的后果,产生更可解释和灵活的机器人行为;然而,在真实环境中将因果模型与现有系统(如导航)结合部署的研究仍不充分。本文解决了在真实机器人实验中为导航场景迁移因果模型的挑战性问题。我们通过两种方式研究该问题:(i) 使用因果模型作为离线评估模块,预测记录的机器人导航轨迹的胜任度,并将其与定量导航性能相关联;(ii) 使用因果模型作为在线自适应模块,在默认导航的预测胜任度较低时进行干预。我们在一个在走廊巡逻的物理服务机器人上验证了该方法。结果表明,预测的胜任度与路径效率正相关,与路径不规则性(次优行为)负相关。模型预测还与人工标注高度一致(Cohen's kappa值为0.88)。在在线实验中,所提方法在转弯和避障等复杂场景中提升了导航性能,相比默认导航基线获得了更高的预测胜任度和更好的导航指标。在基线已接近最优的简单场景中,因果自适应的收益有限。这些结果表明,因果模型在任务复杂度增加时尤其能有效增强导航。总体而言,我们的结果证明了为行为解释开发的因果模型可以成功集成到真实机器人导航系统中。

英文摘要

Causality in robotics aims to produce more interpretable and flexible robot behaviours by enabling robots to predict the consequences of their actions; however, deploying causal models with existing systems (e.g., navigation) operating in real environments remains understudied. This paper addresses the challenging problem of transferring causal models in real-robot experiments for a navigation scenario. We study this problem in two ways: (i) using the causal model as an offline evaluation module that predicts the competence of recorded real-robot navigation trajectories and relates it to quantitative navigation performance, and (ii) using the causal model as an online adaptation module that intervenes when the predicted competence of the default navigation is low. We validate our approach in a physical service robot that patrols around corridors. We show that the predicted competence correlates positively with path efficiency, and negatively with path irregularities (suboptimal behaviour). The model predictions also show strong agreement with human annotations (Cohen's kappa value of 0.88). In online experiments, the proposed method improves navigation performance in complex scenarios such as cornering and obstacle avoidance, yielding higher predicted competence and better navigation metrics than the default navigation baseline. In simpler scenarios, where the baseline already performs near-optimally, the causal adaptation provides limited benefit. These results indicate that causal models are particularly effective in enhancing navigation under increased task complexity. Overall, our results demonstrate that causal models developed for behavioural interpretation can be successfully integrated into real-robot navigation systems.

2606.15846 2026-06-16 cs.RO 新提交

FlashNav: Ultra-Fast Policy Training for Robot Navigation within 20 Seconds

FlashNav:20秒内实现超快速机器人导航策略训练

Shanze Wang, Yiwei Qian, Xinming Zhang, Jun Xue, Siwei Cheng, Xianghui Wang, Qingyuan Hu, Xiaoyu Shen, Wei Zhang

发表机构 * Eastern Institute of Technology, Ningbo(宁波东方理工大学) The Hong Kong Polytechnic University(香港理工大学) National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出FlashNav框架,通过GPU加速和MDP对齐实现20秒内训练可部署的导航策略,在TurtleBot2和Unitree Go2上验证成功。

Comments 15 pages, 4 figures

详情
AI中文摘要

深度强化学习在机器人导航中展现出强大潜力,但其实际部署仍受限于策略训练的长时钟成本。本文提出FlashNav,一个用于超快速基于距离的机器人导航训练的GPU优先框架。据我们所知,FlashNav是首个达到秒级策略训练的基于DRL的机器人导航框架,最快可部署策略在不到20秒内训练完成。关键思想是将仿真与导航MDP对齐:FlashNav保留了速度级导航的必要组件,包括占据几何、距离感知、目标条件控制、机器人运动动力学、碰撞处理、终止和重置,同时从训练循环中移除不必要的渲染和高保真物理细节。基于批量位图仿真器和我们的FastDSAC学习器构建的全GPU驻留训练流水线,FlashNav完全在GPU上生成大规模并行导航转移。在TurtleBot2和Unitree Go2上的实验表明,FlashNav在RTX 5090上20秒内达到100%成功率,并在桌面GPU上保持在几十秒内。学习到的策略进一步迁移到静态和动态室内场景中的物理轮式和腿式机器人,证明基于DRL的导航可以在秒级速度下训练,同时保持可部署的避障行为。

英文摘要

Deep reinforcement learning has shown strong potential for robot navigation, but its practical deployment is still limited by the long wall-clock cost of policy training. This paper presents FlashNav, a GPU-first framework for ultra-fast range-based robot navigation training. To the best of our knowledge, FlashNav is the first DRL-based robot navigation framework that reaches seconds-level policy training, with the fastest deployable policy trained in less than 20 seconds. The key idea is to align simulation with the navigation MDP: FlashNav preserves the essential components for velocity-level navigation, including occupancy geometry, range sensing, goal-conditioned control, robot motion dynamics, collision handling, termination, and reset, while removing unnecessary rendering and high-fidelity physical details from the training loop. Built on a batched bitmap simulator and a fully GPU-resident training pipeline with our FastDSAC learner, FlashNav generates massive parallel navigation transitions entirely on GPU. Experiments on TurtleBot2 and Unitree Go2 show that FlashNav achieves a 100\% success-rate below 20 seconds on an RTX 5090 and remains within tens of seconds across desktop GPUs. The learned policies further transfer to physical wheeled and legged robots in static and dynamic indoor scenes, demonstrating that DRL-based navigation can be trained at seconds-level speed while preserving deployable obstacle-avoidance behavior.

2606.16057 2026-06-16 cs.RO cs.SY eess.SP eess.SY 新提交

A Smart-Scheduled Hybrid (SSH) EKF-FGO State Estimation

一种智能调度混合(SSH)EKF-FGO状态估计方法

Eric Levi, Soosan Beheshti

发表机构 * GitHub arXiv

AI总结 本文通过智能调度混合EKF-FGO框架,实验性地将优化调度作为独立设计变量,研究其在平衡估计精度与计算成本中的作用,并在平面SLAM仿真中验证了调度对预优化漂移、瞬态误差和运行时间的显著影响。

Comments This work has been accepted for presentation/publication at the 2026 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). The final published version will appear in IEEE Xplore

详情
AI中文摘要

在机器人学和控制中,可靠的状态估计需要在估计精度和计算成本之间取得平衡。虽然基于滤波的方法(如扩展卡尔曼滤波器,EKF)提供高效的实时更新,而使用因子图的优化公式化方法改善全局一致性,但优化调度的作用通常被隐式处理,而非作为明确的设计变量进行研究。本文提出了一项实验研究,通过使用智能调度混合(SSH)EKF-FGO框架作为受控测试平台,明确隔离了优化调度。通过将基于EKF的状态传播与定期调用的批量优化相结合,并保持求解器结构和计算量固定,本文的主要贡献是实验性地将优化调度表征为一个独立的设计变量,它控制着中间估计精度与计算成本之间的权衡。在平面SLAM环境中的仿真结果表明,调度强烈影响预优化漂移、瞬态误差行为和运行时间。特别是,结果识别出一些操作区域,在这些区域中,全局优化的大部分好处可以以一小部分计算成本保留,从而突显了优化调度作为混合状态估计系统中一个未被充分探索但至关重要的考虑因素。

英文摘要

Reliable state estimation in robotics and control re quires balancing estimation accuracy against computational cost. While filtering-based methods such as the Extended Kalman Filter (EKF) provide efficient real-time updates, and optimisation based formulations using factor graphs improve global consistency, the role of optimisation scheduling is often treated implicitly rather than examined as an explicit design variable. This paper presents an experimental study that explicitly isolates optimisation scheduling using a Smart Scheduled Hybrid (SSH) EKF-FGO framework as a controlled testbed. By combining EKF-based state propagation with periodically invoked batch optimisation and holding solver structure and effort fixed, the main contribution of this work is the experimental characterisation of optimisation scheduling as an independent design variable governing the trade-off between intermediate estimation accuracy and computational cost. Simulation results in a planar SLAM environment show that scheduling strongly influences pre optimisation drift, transient error behaviour, and runtime. In particular, the results identify operating regimes in which most of the benefit of global optimisation can be retained at a fraction of the computational cost, highlighting optimisation scheduling as an under-explored yet critical consideration in hybrid state estimation systems.

2606.16232 2026-06-16 cs.RO 新提交

PolyMerge: Compressing 3D Gaussian Splats with Polytope Coverings for Provably Safe Resource-Constrained Navigation

PolyMerge: 用多面体覆盖压缩3D高斯泼溅以实现可证明安全的资源受限导航

Jihoon Hong, Chih-Yuan Chiu, Sara Fridovich-Keil, Glen Chou

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出PolyMerge,将大规模3D高斯泼溅模型转换为凸多面体覆盖,保证覆盖原模型所有障碍物,结合控制障碍函数实现实时安全路径规划,在Crazyflie无人机上验证。

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 7, pp. 8512-8519, July 2026
AI中文摘要

障碍物避免对于安全导航和运动规划至关重要。最近的辐射场重建方法能够以高保真度进行物体检测和建模,但对于机载感知路径规划而言,仍然过于消耗内存和计算资源。为了解决这些限制,我们提出PolyMerge,将场景的大规模、逼真的3D高斯泼溅(3DGS)模型转换为凸多面体的轻量级表示,这些多面体的并集可证明地过度逼近原始3DGS模型中的所有障碍物。PolyMerge调整多面体数量以权衡保守性和计算成本,并与控制障碍函数(CBF)集成以规划无碰撞路径。我们在Crazyflie无人机的仿真和硬件实验中展示了PolyMerge,该无人机在严重的机载计算约束下使用PolyMerge实时计算并跟踪安全轨迹,在保证安全的同时在速度上优于基线。有关我们的代码和视频,请访问https://athlon76.github.io/PolyMerge-website/。

英文摘要

Obstacle avoidance is essential for safe navigation and motion planning. Recent radiance field reconstruction methods enable object detection and modeling with high fidelity, but remain too memory- and compute-intensive for on-board perception-based path planning. To address these limitations, we propose PolyMerge to convert a large, photorealistic 3D Gaussian Splatting (3DGS) model of a scene into a lightweight representation of convex polytopes whose union provably over-approximates all obstacles in the original 3DGS model. PolyMerge tunes the polytope count to trade off conservativeness and compute cost, and integrates with control barrier functions (CBFs) to plan collision-free paths. We showcase PolyMerge in simulation and hardware experiments on a Crazyflie drone, which uses PolyMerge to compute and follow safe trajectories in real time under severe onboard compute constraints, outperforming baselines in speed while guaranteeing safety. For our code and videos, visit https://athlon76.github.io/PolyMerge-website/.

2606.16400 2026-06-16 cs.RO 新提交

SemGeoNav:A Safety-Guided Visual Navigation Approach with Semantic Reasoning and Geometric Planning

SemGeoNav:一种结合语义推理与几何规划的安全引导视觉导航方法

Yu Liu, Zongyang Chen, Yan Guo, Chao Liu, Xianfei Pan

发表机构 * College of Intelligence Science and Technology, National University of Defense Technology(国防科技大学智能科学学院)

AI总结 提出SemGeoNav分层视觉导航框架,融合端到端模型的高层语义推理与几何方法的可靠局部规划,实现鲁棒图像导航并显著提升避障能力,在真实机器人上优于ViNT和NoMaD。

Comments The paper has been accepted by ICGNC 2026

详情
AI中文摘要

基于学习的视觉导航增强了语义目标到达能力。然而,由于其黑箱特性,纯端到端模型通常缺乏显式的几何约束,导致在开放环境中避障不可预测且不可靠。相反,传统几何规划器确保安全性,但难以处理高维视觉目标。为了解决这些限制,我们提出了SemGeoNav,一种新颖的分层视觉导航框架。它紧密集成了端到端模型的高层语义推理与基于几何方法的可靠局部规划能力,实现了鲁棒的基于图像的导航,同时显著改善了避障。此外,我们引入了一种时间轨迹平滑机制,以确保机器人运动连续稳定。我们在真实环境中的Unitree Go2四足机器人上评估了SemGeoNav。结果表明,SemGeoNav优于现有代表性方法(包括ViNT和NoMaD),实现了更高的成功率和更短的导航时间。

英文摘要

Learning-based visual navigation has enhanced semantic goal-reaching capabilities. However, due to their black-box nature, purely end-to-end models often lack explicit geometric constraints, leading to unpredictable and unreliable obstacle avoidance in open environments. Conversely, traditional geometric planners ensure safety but struggle with high-dimensional visual targets. To address these limitations, we propose SemGeoNav, a novel hierarchical visual navigation framework.It tightly integrates the high-level semantic reasoning of end-to-end models with the reliable local planning ability of geometry-based methods, achieving robust image-based navigation while significantly improving obstacle avoidance. Furthermore, we introduce a temporal trajectory smoothing mechanism to ensure continuous and stable robot motion. We evaluated SemGeoNav on a Unitree Go2 quadruped robot in real-world environments. The results demonstrate that SemGeoNav outperforms existing representative methods, including ViNT and NoMaD, achieving higher success rates and shorter navigation times.

2606.16881 2026-06-16 cs.RO 新提交

SGM-SLAM: Scene Graph Matching for Data-Efficient Distributed SLAM

SGM-SLAM:面向数据高效的分布式SLAM的场景图匹配

Yewei Huang, Tixiao Shan, Abhinav Rajvanshi, Niluthpol Chowdhury Mithun, Yaxuan Li, Brendan Englot, Han-Pang Chiu

发表机构 * Dartmouth College(达特茅斯学院) SRI International(SRI国际) Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 提出一种基于场景图匹配的分布式SLAM框架,仅使用对象标签和质心进行匹配,通过多步数据交换与优化实现高效通信,在室内外环境中验证了有效性。

详情
AI中文摘要

我们介绍了一种面向配备LiDAR、相机和惯性传感器的机器人团队的数据高效分布式同步定位与地图构建(SLAM)框架。该框架使用场景图匹配来识别机器人间的测量约束。与依赖于特征级匹配的先前方法不同,我们的框架是首个仅使用对象标签和质心进行场景图匹配的方法。我们的方法通过使用融合的RGB-LiDAR点云构建场景图,生成语义分割点云层和离散有界对象层,以伴随估计的机器人轨迹。场景图匹配通过交换和匹配相邻机器人的对象数据协作完成。为最大化通信效率,我们采用了多步数据交换与优化过程。我们通过仿真和由腿式机器人在室内外环境中收集的真实世界数据集展示了我们方法的有效性和效率。

英文摘要

We introduce a data-efficient distributed Simultaneous Localization and Mapping (SLAM) framework designed for a team of robots equipped with LiDAR, cameras, and inertial sensors. Our framework uses scene graph matching to identify inter-robot measurement constraints. Unlike prior approaches that rely on feature-level matching, our framework is the first to perform scene graph matching using only object labels and centroids. Our approach constructs a scene graph by using fused RGB-LiDAR point clouds to generate both a semantically segmented point cloud layer, and a layer of discrete bounded objects, to accompany estimated robot trajectories. Scene graph matching is performed collaboratively through exchanging and matching object data with neighboring robots. To maximize communication efficiency, we utilize a multi-step data exchange and optimization process. We demonstrate the effectiveness and efficiency of our approach using both simulation and real-world datasets collected by legged robots in indoor and outdoor environments.

2606.16902 2026-06-16 cs.RO cs.AI 新提交

Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models

基于开放视觉语言模型的空间问答与导航的二值追踪

Dongbin Na, Chanwoo Kim, Soonbin Rho, Giyun Choi, Gangbok Lee, Dooyoung Hong

发表机构 * RGA Inc.(RGA公司)

AI总结 提出BinTrack,一种全开源的空间定位代理,通过二值搜索轨迹段,在SpaceLocQA基准上准确率提升22.8%,推理速度提升1.5倍,并发布多行程室外数据集GangnamLoop。

Comments 21 pages, 4 figures, 15 tables. Project page: https://ndb796.github.io/BinaryTracking ; Code and dataset: https://github.com/ndb796/BinaryTracking

详情
AI中文摘要

本工作针对服务机器人在长距离自我中心路线上的空间问答问题。给定诸如“在回家的路上哪里可以找到干洗店?”的查询,系统返回一个度量坐标,下游导航组件可以据此行动。先前的空间问答方法利用基于闭源模型(如GPT-4o)的检索增强代理进行路径探索。然而,在现实世界中运行的机器人通常无法可靠地依赖在线闭源模型,因为网络不稳定、通信延迟和部署成本。这需要能够在机器人上运行的开源空间问答方法,但先前在这方面的研究仍然有限。本工作提出BinTrack,一种简单而有效的全开源空间定位代理,它利用机器人轨迹的时间顺序。BinTrack对查询中识别的两个锚点地标之间的轨迹段进行二值搜索。与其他开源实现相比,它将整体准确率提高了22.8%,甚至在SpaceLocQA基准的全局类别上匹配了报告的闭源模型结果,这是迄今为止需要强大推理代理(如GPT-4o)的最具挑战性的设置。此外,其优化的推理策略始终比先前方法提供超过1.5倍的推理加速。最后,本工作发布了GangnamLoop,这是一个新颖且实用的多行程室外基准,通过在实际公共街道上部署真实四足机器人并采用匿名化策略收集而成。它在不同室外条件下重新访问相同位置,并将机器人的低视角与人类主人的视角配对。源代码和数据集可在https://github.com/ndb796/BinaryTracking公开获取。

英文摘要

This work addresses spatial question answering for service robots traversing long egocentric routes. Given a query such as "where can I find a dry cleaner on the way back home?", the system returns a metric coordinate that downstream navigation components can act on. Prior Spatial Question Answering approaches leverage retrieval-augmented agents built on closed-source models such as GPT-4o for path exploration. However, robots operating in the real world often cannot reliably depend on online closed-source models due to network instability, communication latency, and deployment cost. It creates a need for open-source based Spatial Question Answering approaches that can run onboard the robot, yet prior research in this direction remains limited. This work proposes BinTrack, a simple yet effective, fully open-source spatial-localization agent that leverages the temporal ordering of a robot's trajectory. BinTrack performs a binary search over the trajectory segments between two anchor landmarks identified from a query. It improves overall accuracy by up to 22.8% over other open-source implementations and even matches the reported closed-source model result on the global category of the SpaceLocQA benchmark, the most challenging setting that has so far required strong reasoning agents such as GPT-4o. Furthermore, its optimized inference strategy consistently yields more than a 1.5x inference speedup over previous approaches. Finally, this work releases GangnamLoop, a novel and practical multi-trip outdoor benchmark collected by deploying a real quadruped robot on public streets with the anonymization policy. It revisits the same locations under different outdoor conditions and pairs the robot's low viewpoint with the human owner's. The source codes and datasets are publicly available at https://github.com/ndb796/BinaryTracking

2606.16935 2026-06-16 cs.RO cs.AI cs.LG 新提交

CrossMaps: Confidence-Aware Open-Vocabulary Semantic Mapping for Rover Navigation

CrossMaps: 用于漫游车导航的置信度感知开放词汇语义地图

Jan-Niklas Klein, Sona Ghahremani, Christian Medeiros Adriano, Holger Giese

发表机构 * Hasso Plattner Institute for Digital Engineering, Potsdam, Germany(哈索·普拉特纳数字工程研究所(德国波茨坦))

AI总结 提出CrossMaps,一种实时置信度感知开放词汇语义地图构建流水线,通过多尺度CLIP嵌入、置信度融合和双记忆架构生成可查询语义地图,用于漫游车导航。

Comments IEEE International Conference on Robotics and Automation (ICRA) 2026: ROSE International Workshop on Robotics Software Engineering, June 01, 2026, Vienna, Austria

详情
AI中文摘要

漫游车依赖感知来维护空间地图,该地图编码物体和传感器质量(例如,距离可靠性、光照伪影、数据密度),指导数据融合、嵌入更新以及在部分可观测性下的导航。为了研究这些耦合的感知-导航过程,我们提出了CrossMaps,一种实时的置信度感知开放词汇语义地图构建流水线,该流水线从RGB-D数据构建可语言查询的地图。基于VLMaps风格的方法,CrossMaps集成了多尺度CLIP嵌入、置信度感知融合以及由短期记忆(STM)和长期记忆(LTM)组成的双记忆架构。STM使用几何、语义和时间置信度线索聚合噪声视觉观测,而置信且一致的单元被提升到LTM作为持久语义地标。CrossMaps设计用于与Jetson Orin驱动的UGV以及SLAM一起部署,实时运行并生成语义热力图,可通过自然语言查询来引导漫游车导航。

英文摘要

Rovers rely on perception to maintain spatial maps that encode both objects and sensor quality (e.g., range reliability, lighting artifacts, data density), guiding data fusion, embedding updates, and navigation under partial observability. To study these coupled perception-navigation processes, we present CrossMaps, a real-time confidence-aware open-vocabulary semantic mapping pipeline that constructs language-queryable maps from RGB-D data. Building on VLMaps-style approaches, CrossMaps integrates multi-scale CLIP embeddings with confidence-aware fusion and a dual-memory architecture consisting of Short-Term Memory (STM) and Long-Term Memory (LTM). The STM aggregates noisy visual observations using geometric, semantic, and temporal confidence cues, while confident and coherent cells are promoted to the LTM as persistent semantic landmarks. Designed for deployment with a Jetson Orin-powered UGV alongside SLAM, CrossMaps runs in real time and produces semantic heatmaps that can be queried with natural language to guide rover navigation.

2606.16474 2026-06-16 cs.CV cs.RO 交叉投稿

MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry

MVOFormer:用于鲁棒单目视觉里程计的流-语义Transformer

Jituo Li, Shunwang Sun, Jialu Zhang, Xinqi Liu, Jinyao Hu, Zhicheng Lu, Sajad Saeedi, Guodong Lu

发表机构 * State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University(浙江大学流体动力与机电系统国家重点实验室) Zhejiang Key Laboratory of Industrial Big Data and Robot Intelligent Systems(浙江省工业大数据与机器人智能系统重点实验室) School of Mechanical Engineering, Zhejiang University(浙江大学机械工程学院) Robotics Institute, Zhejiang University(浙江大学机器人研究院) School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) Rural Health Research Institute, Charles Sturt University(查尔斯特大学农村健康研究所) University College London(伦敦大学学院)

AI总结 提出MVOFormer,一种流-语义双分支编码器与迭代多模态解码器结合的Transformer框架,通过融合密集几何运动与语义先验实现粗到细位姿优化,在零样本泛化上显著超越现有方法。

Comments 8 pages, 6 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

单目视觉里程计(MVO)是自主导航和机器人定位的基础。然而,现有的基于学习的MVO方法通常缺乏可解释的互补特征或具有过于复杂的多阶段架构,这些局限性固有地限制了它们的鲁棒性和跨域泛化能力。在这项工作中,我们提出了MVOFormer,一种用于鲁棒单目视觉里程计的新型Transformer框架。我们的架构采用流-语义双分支编码器,将密集几何运动线索与以物体为中心的语义先验协同结合,明确区分静态结构与动态干扰物。然后,这些表示通过迭代多模态解码器融合,实现从粗到细的位姿优化,同时动态抑制对不可靠区域的注意力。大量评估表明,无需任何目标域微调,MVOFormer在TartanAir、KITTI、TUM-RGBD和ETH3D-SLAM等多个基准上实现了优越的零样本泛化和鲁棒性,显著优于先前基于学习的帧到帧方法。

英文摘要

Monocular visual odometry (MVO) is foundational to autonomous navigation and robotic localization. However, existing learning-based MVO approaches often struggle with either a lack of interpretable, complementary features or overly complex multi-stage architectures. These limitations inherently restrict their robustness and cross-domain generalization. In this work, we propose MVOFormer, a novel transformer framework for robust monocular visual odometry. Our architecture features a Flow-Semantic Dual Branch Encoder that synergizes dense geometric motion cues with object-centric semantic priors, explicitly distinguishing static structures from dynamic distractors. These representations are then fused by an Iterative Multimodal Decoder, enabling coarse-to-fine pose refinement while dynamically suppressing attention on unreliable regions. Extensive evaluations demonstrate that, without any target-domain fine-tuning, MVOFormer achieves superior zero-shot generalization and robustness, significantly outperforming prior learning-based frame-to-frame methods across diverse benchmarks including TartanAir, KITTI, TUM-RGBD, and ETH3D-SLAM.

2606.16569 2026-06-16 cs.CV cs.RO 交叉投稿

PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

PROSE: 基于视觉语言模型的无训练自我中心场景配准

Zhiang Chen, Nahyuk Lee, Boyang Sun, Taein Kwon, Marc Pollefeys, Zuria Bauer, Sunghwan Hong

发表机构 * ETH Zurich(苏黎世联邦理工学院) VGG, University of Oxford(牛津大学VGG实验室) ETH AI Center(苏黎世联邦理工学院人工智能中心)

AI总结 提出PROSE方法,利用预训练视觉语言模型将RGB序列提升为对象级3D场景图,通过对象高度先验和相同/不同查询匹配实例,无需训练或深度传感器即可实现自我中心场景配准,在Aria基准上超越几何和场景图基线。

Comments Project page: https://rckola.github.io/prose/

详情
AI中文摘要

将同一室内空间在不同时间拍摄的两张图像进行配准,是机器人和AR系统持久空间记忆的基础,但该任务的现实版本是自我中心的,且其最具可扩展性的形式是仅RGB。头戴式摄像头产生模糊、快速移动、部分重叠的视图,难以从中恢复密集几何。经典配准依赖于该场景所缺乏的干净点云,而学习的场景图方法需要预先构建或注释的图以及训练好的匹配器,我们发现后者在自我中心数据下脆弱。我们采取不同路线,使用预训练的视觉语言模型作为场景理解和跨扫描匹配的来源。我们的方法PROSE(Prompted Scene rEgistration)利用现成的几何、分割和语言基础模型将每个RGB序列提升为对象级3D场景图,然后提示同一VLM匹配两个RGB序列中的对象实例。为了使匹配易于处理且可靠,我们利用对象高度作为先验,并通过配对的相同/不同查询验证每个提议的匹配,然后通过为每个匹配对象假设一个候选并选择具有最强几何一致性的候选来求解刚体变换。PROSE不添加任何学习参数,也不需要深度传感器、训练或注释图。在自我中心的Aria Digital Twin和Aria Everyday Activities基准测试中,它在真实和RGB重建的点云上的配准精度均优于几何和学习的场景图基线,并且其生成的场景图可直接用于下游任务。

英文摘要

Registering two captures of the same indoor space taken at different times underpins persistent spatial memory for robots and AR systems, yet the realistic version of this task is egocentric and its most scalable form is RGB-only. Head-mounted cameras yield blurry, fast-moving, partially overlapping views from which dense geometry is hard to recover. Classical registration leans on exactly the clean point clouds this setting lacks, while learned scene-graph methods require a pre-built or annotated graph and a trained matcher that we find brittle under egocentric data. We take a different route, using a pretrained vision-language model as the source of both scene understanding and cross-scan matching. Our method, PROSE (Prompted Scene rEgistration), lifts each RGB sequence into an object-level 3D scene graph using off-the-shelf foundation models for geometry, segmentation, and language, then prompts the same VLM to match object instances across the two RGB sequences. To make this matching tractable and reliable, we leverage object heights as a prior and verify each proposed match with a paired same/different query, then solve for the rigid transform by hypothesizing a candidate per matched object and selecting the one with the strongest geometric consensus. PROSE adds no learned parameters and requires no depth sensor, training, or annotated graph. On the egocentric Aria Digital Twin and Aria Everyday Activities benchmarks, it outperforms both geometric and learned scene-graph baselines in registration accuracy, on ground-truth and RGB-reconstructed point clouds alike, and the scene graph it produces transfers directly to downstream tasks.

2412.05873 2026-06-16 cs.RO 版本更新

AC-LIO: Towards Asymptotic Compensation for Distortion in LiDAR-Inertial Odometry via Selective Intra-Frame Smoothing

AC-LIO:通过选择性帧内平滑实现激光雷达-惯性里程计中畸变的渐近补偿

Tianxiang Zhang, Xuanxuan Zhang, Wenlei Fan, Xin Xia, Huai Yu, Lin Wang, You Li

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University, China(信息工程测绘遥感国家重点实验室(LIESMARS),武汉大学,中国) Department of Mechanical Engineering, University of Michigan-Dearborn(密歇根大学迪尔伯恩分校机械工程系) Electronic Information School, Wuhan University, China(武汉大学电子信息学院,中国) School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore(南洋理工大学电子电气工程学院,新加坡)

AI总结 提出AC-LIO框架,通过渐近反向传播更新项并在收敛准则指导下补偿残余运动畸变,以最小计算开销提升离散状态LIO系统精度,在长期大尺度定位中平均RMSE降低30.4%。

Comments 11 pages, 9 figures

详情
AI中文摘要

现有的激光雷达-惯性里程计(LIO)方法通常利用IMU积分得到的先验轨迹来补偿激光雷达帧内的运动畸变。然而,先验轨迹与真实轨迹之间的差异会导致残余运动畸变,破坏激光雷达帧与其对应几何环境的一致性。这种不平衡可能导致点云注册陷入局部最优,从而在长期和大尺度定位中加剧漂移。为此,我们提出了一种新颖的LIO框架,称为AC-LIO,采用选择性帧内平滑。我们的核心思想是在收敛准则的指导下渐近反向传播当前更新项并补偿残余运动畸变,旨在以最小的计算开销提高离散状态LIO系统的精度。大量实验表明,与现有技术相比,我们的AC-LIO框架进一步提升了里程计精度,平均RMSE比第二名结果降低约30.4%,从而显著提高了长期和大尺度定位与建图的精度。

英文摘要

Existing LiDAR-Inertial Odometry (LIO) methods typically utilize the prior trajectory derived from the IMU integration to compensate for the motion distortion within LiDAR frames. However, discrepancies between the prior and true trajectory can lead to residual motion distortions that compromise the consistency of LiDAR frame with its corresponding geometric environment. This imbalance may result in pointcloud registration becoming trapped in local optima, thereby exacerbating drift during long-term and large-scale localization. To this end, we propose a novel LIO framework with selective intra-frame smoothing dubbed AC-LIO. Our core idea is to asymptotically backpropagate current update term and compensate for residual motion distortion under the guidance of convergence criteria, aiming to improve the accuracy of discrete-state LIO system with minimal computational increase. Extensive experiments demonstrate that our AC-LIO framework further enhances odometry accuracy compared to prior arts, with about 30.4% reduction in average RMSE over the second best result, leading to marked improvements in the accuracy of long-term and large-scale localization and mapping.

2602.00222 2026-06-16 cs.RO cs.AI cs.CV 版本更新

MapDream: Task-Driven Map Learning for Vision-Language Navigation

MapDream: 面向视觉-语言导航的任务驱动地图学习

Guoxin Lian, Shuo Wang, Yucheng Wang, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Deying Li, Zhaoxin Fan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MapDream框架,通过自回归鸟瞰图生成联合学习地图与动作预测,在R2R-CE和RxR-CE上达到单目最优性能。

详情
AI中文摘要

视觉-语言导航(VLN)要求智能体在部分可观测的3D环境中遵循自然语言指令,这促使地图表示能够聚合超出局部感知的空间上下文。然而,现有大多数方法依赖于独立于导航策略构建的手工地图。我们认为,地图应该是由导航目标直接塑造的学习表示,而非详尽的重建。基于这一见解,我们提出MapDream,一种地图在环框架,将地图构建表述为自回归鸟瞰图(BEV)图像合成。该框架联合学习地图生成和动作预测,将环境上下文蒸馏为紧凑的三通道BEV地图,仅保留导航关键的可通行性。监督预训练引导了可靠的地图到控制接口,而自回归设计通过强化微调实现端到端联合优化。在R2R-CE和RxR-CE上的实验取得了最先进的单目性能,验证了任务驱动的生成式地图学习。

英文摘要

Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.

2602.05608 2026-06-16 cs.RO 版本更新

HiCrowd: Hierarchical Crowd Flow Alignment for Dense Human Environments

HiCrowd:密集人群环境中的分层人群流对齐

Yufei Zhu, Shih-Min Yang, Martin Magnusson, Allan Wang

发表机构 * Robot Navigation and Perception Lab, AASS Research Center, Örebro University, Sweden(奥雷布罗大学机器人导航与感知实验室,AASS研究中心,瑞典) Miraikan – The National Museum of Emerging Science and Innovation, Japan(日本新兴科学与创新国家博物馆——Miraikan)

AI总结 提出HiCrowd分层框架,结合强化学习与模型预测控制,通过跟随人群流解决机器人冻结问题,在真实和合成数据集上提升导航效率与安全性。

Comments 2026 IEEE International Conference on Robotics and Automation (ICRA)

详情
AI中文摘要

在密集人群中导航对移动机器人仍是一个重大挑战。关键问题是机器人冻结问题,即机器人难以找到安全运动并被困在人群中。为解决此问题,我们提出HiCrowd,一个将强化学习(RL)与模型预测控制(MPC)相结合的分层框架。HiCrowd利用周围行人运动作为引导,使机器人能够与兼容的人群流对齐。高层RL策略生成一个跟随点,使机器人与合适的人群组对齐,而低层MPC通过短视距规划安全地跟踪该引导。该方法结合了长期人群感知决策与安全短期执行。我们在离线设置(回放记录的人类轨迹)和在线设置(人类轨迹更新以在仿真中对机器人做出反应)中,将HiCrowd与反应式和基于学习的基线进行了比较。在真实世界数据集和合成人群数据集上的实验表明,我们的方法在导航效率和安全性上表现更优,同时减少了冻结行为。我们进一步通过在公共博物馆和大阪2025年世博会的实际部署验证,该方法无需重新训练即可在密集人流中导航,展现出鲁棒且具有社会意识的行为。我们的结果表明,利用人类运动作为引导,而非将人类仅视为动态障碍,为机器人在人群中安全高效导航提供了有力原则。项目代码和演示可在此https URL获取。

英文摘要

Navigating through dense human crowds remains a significant challenge for mobile robots. A key issue is the freezing robot problem, where the robot struggles to find safe motions and becomes stuck within the crowd. To address this, we propose HiCrowd, a hierarchical framework that integrates reinforcement learning (RL) with model predictive control (MPC). HiCrowd leverages surrounding pedestrian motion as guidance, enabling the robot to align with compatible crowd flows. A high-level RL policy generates a follow point to align the robot with a suitable pedestrian group, while a low-level MPC safely tracks this guidance with short horizon planning. The method combines long-term crowd aware decision making with safe short-term execution. We evaluate HiCrowd against reactive and learning-based baselines in offline setting (replaying recorded human trajectories) and online setting (human trajectories are updated to react to the robot in simulation). Experiments on a real-world dataset and a synthetic crowd dataset show that our method outperforms in navigation efficiency and safety, while reducing freezing behaviors. We further validate through real-world deployment in a public museum and Expo 2025 Osaka, where it navigates dense pedestrian flows without retraining, demonstrating robust and socially aware behavior. Our results suggest that leveraging human motion as guidance, rather than treating humans solely as dynamic obstacles, provides a powerful principle for safe and efficient robot navigation in crowds. Project code and demos are available at https://github.com/test-bai-cpu/HiCrowd.

2603.16273 2026-06-16 cs.RO 版本更新

GenZ-LIO: Generalizable LiDAR-Inertial Odometry Beyond Confined--Open Boundaries

GenZ-LIO: 超越受限与开放边界的可泛化激光雷达-惯性里程计

Daehan Lee, Hyungtae Lim, Seongjun Kim, Soonbin Rho, Changhyeon Lee, Sanghyun Park, Junwoo Hong, Eunseon Choi, Hyunyoung Jo, Soohee Han

发表机构 * Computational Control Engineering Laboratory (CoCEL), Department of Convergence IT Engineering and Electrical Engineering, Pohang University of Science and Technology (POSTECH), Pohang 37673, South Korea(convergence 信息技术工程与电气工程系 计算控制工程实验室,POSTECH 首尔大学(Pohang University of Science and Technology), Pohang 37673, South Korea) Laboratory for Information & Decision Systems (LIDS), Massachusetts Institute of Technology, Cambridge, MA 02139, USA(信息与决策系统实验室,麻省理工学院(Massachusetts Institute of Technology), Cambridge, MA 02139, USA)

AI总结 提出GenZ-LIO框架,通过尺度自适应体素化、混合度量状态更新和体素剪枝对应搜索,解决受限与开放空间过渡导致的鲁棒性和效率下降问题,在42个序列上保持稳定估计。

Comments 21 pages, 12 figures

详情
AI中文摘要

对于巡检、搜索救援和探索等现场机器人任务,激光雷达-惯性里程计(LIO)通过在GNSS拒绝或无结构环境中提供定位和建图,可作为自主性的核心组件。然而,现场部署中常见的受限与开放空间之间的过渡会导致扫描密度和局部几何结构发生显著变化,从而降低LIO的鲁棒性和计算效率。为解决这些问题,我们提出了GenZ-LIO,一个可泛化的LIO框架,旨在适应受限和开放环境中空间尺度的变化。GenZ-LIO包含三个组件:(i) 尺度感知自适应体素化,用于调节跨空间尺度变化的扫描下采样;(ii) 混合度量状态更新,用于在变化的几何结构下组合点到平面和点到点残差;(iii) 体素剪枝对应搜索,用于高效的点到点匹配。我们使用来自九个公共数据集的42个序列以及我们新收集的NarrowWide数据集进行了全面评估,以分析LIO在不同现场场景下空间尺度变化时的性能。在评估的序列中,GenZ-LIO保持稳定的里程计估计而不发散,表明在测试的现场条件下具有实际鲁棒性。源代码和收集的数据集将在发表后公开。

英文摘要

For field robotic missions such as inspection, search-and-rescue, and exploration, light detection and ranging (LiDAR)-inertial odometry (LIO) can serve as a core component of autonomy by providing localization and mapping in GNSS-denied or unstructured environments. However, transitions between confined and open spaces, which are commonly encountered in field deployments, can induce substantial changes in scan density and local geometric structure, thereby reducing the robustness and computational efficiency of LIO. To address these issues, we present GenZ-LIO, a generalizable LIO framework designed to adapt to variations in spatial scale across confined and open environments. GenZ-LIO comprises three components: (i) scale-aware adaptive voxelization for regulating scan downsampling across spatial scale changes, (ii) hybrid-metric state update for combining point-to-plane and point-to-point residuals under varying geometric structure, and (iii) voxel-pruned correspondence search for efficient point-to-point matching. We conduct a comprehensive evaluation using 42 sequences from nine public datasets and our newly collected NarrowWide dataset to analyze LIO performance under spatial scale variations across diverse field scenarios. Across the evaluated sequences, GenZ-LIO maintains stable odometry estimation without divergence, indicating practical robustness under the tested field conditions. The source code and collected dataset will be made publicly available upon publication.

2606.04907 2026-06-16 cs.RO 版本更新

WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation

WAM-Nav:面向统一视觉导航的非对称潜在世界-动作建模

Ning Yang, Yan Huang, Kaiwen Peng, Ziheng He, Kai Wang, Cui Miao, Kailin Lyu, Guo Li, Xiaofeng Wang, Zheng Zhu, Jing Liu, Nianfeng Liu

发表机构 * Nanjing University(南京大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Chinese Academy of Sciences(中国科学院大学) FiveAges National University of Defense Technology(国防科技大学) Tsinghua University(清华大学) GigaAI

AI总结 提出WAM-Nav,一种联合学习动作生成与潜在视觉预测的非对称扩散Transformer模型,通过共享扩散Transformer实现长时程动作与短时程视觉预测的联合扩散,并引入双流上下文条件机制和目标对齐模块,在统一策略下支持图像目标、点目标和无目标导航,在ClutterScenes和InternScenes基准上分别提升15.7%和3.3%的成功率,并在真实环境中实现85%的任务成功率。

详情
AI中文摘要

视觉导航需要在复杂的几何和物理约束下生成平滑且无碰撞的轨迹。现有的反应式策略直接将观测映射到动作,缺乏预期推理能力,限制了其主动避障的能力。虽然视觉想象提供了预测性前瞻,但传统的模块化方法将场景预测与策略学习分离,常常导致误差累积和推理效率低下。为了解决这些限制,我们提出了WAM-Nav,一种用于具身视觉导航的潜在世界-动作模型,它联合学习动作生成和潜在视觉预测,从而在不影响推理效率的情况下实现更鲁棒和更具前瞻性的导航决策。具体来说,WAM-Nav利用共享的扩散Transformer进行非对称联合扩散,同时生成长时程动作和短时程视觉预测,减少了多步自回归展开中固有的推理延迟和视觉误差累积。为了进一步促进平滑且一致的轨迹生成,我们引入了一种双流上下文条件机制,将情节级别的自运动历史与顺序视觉观测相结合。结合统一的目标对齐模块,该模块在不同目标类型间保持平衡表示,WAM-Nav在单一策略下自然支持图像目标、点目标和无目标探索。在具有挑战性的ClutterScenes和InternScenes基准上的大量实验证明了WAM-Nav的强大泛化能力,特别是在图像目标和点目标导航中,成功率分别提高了15.7%和3.3%。真实世界部署进一步验证了有效的零样本模拟到现实迁移,在多样化的室内和室外环境中实现了平均85%的任务成功率。

英文摘要

Visual navigation requires generating smooth and collision-free trajectories under complex geometric and physical constraints. Existing reactive policies that directly map observations to actions lack anticipatory reasoning, limiting their ability to proactively avoid obstacles. While visual imagination offers predictive foresight, conventional modular approaches separate scene prediction from policy learning, often leading to error accumulation and inefficient inference. To address these limitations, we propose WAM-Nav, a Latent World-Action Model for embodied visual navigation that jointly learns action generation and latent visual foresight, enabling more robust and foresighted navigation decisions without compromising inference efficiency. Specifically, WAM-Nav utilizes a shared Diffusion Transformer for asymmetric joint diffusion to concurrently generate long-horizon actions and short-horizon visual foresight, reducing the inference latency and visual error accumulation inherent in multi-step autoregressive rollouts. To further encourage smooth and consistent trajectory generation, we introduce a dual-stream contextual conditioning mechanism that integrates episode-level ego-motion history with sequential visual observations. Combined with a unified goal alignment module that preserves balanced representations across goal types, WAM-Nav naturally supports Image-Goal, Point-Goal, and No-Goal exploration within a single policy. Extensive experiments on the challenging ClutterScenes and InternScenes benchmarks demonstrate strong generalization of WAM-Nav, particularly on Image-Goal and Point-Goal navigation, where it improves success rates by 15.7% and 3.3%, respectively. Real-world deployment further validates effective zero-shot sim-to-real transfer, achieving an average 85% task success rate across diverse indoor and outdoor environments.

2511.15645 2026-06-16 cs.CV cs.RO 版本更新

FDIO: Frequency Decomposed Inertial Odometry

FDIO:频率分解惯性里程计

Shanshan Zhang, Liqin Wu, Wenying Cao, Lingxiang Zheng, Yu Yang

发表机构 * Department of Information and Communication Engineering, National and Local Joint Engineering Research Center of Navigation and Location Based Services, Xiamen University(信息与通信工程系、导航与位置服务国家与地方联合工程研究中心、厦门大学) Department of Electronic Science, State Key Laboratory of Physical Chemistry of Solid Surfaces, Xiamen University(电子科学系、固体表面物理化学国家重点实验室、厦门大学)

AI总结 针对双设备采集场景中IMU信号耦合问题,提出频率分解惯性里程计(FDIO),通过拉普拉斯金字塔分解信号、Mamba模块建模低频长程运动和多尺度卷积提取高频局部特征,在五个数据集上平均绝对轨迹误差降低33.3%。

详情
AI中文摘要

行人惯性里程计(PIO)仅利用惯性测量单元(IMU)采集的加速度和角速度测量值估计自主行人运动,使其在消费级定位应用中具有极高价值。然而,在双设备采集设置下,自由携带的移动设备收集的IMU信号本质上是复合信号,其中人体躯干的全局运动与局部肢体运动引起的扰动耦合在一起。这种耦合使得精确的人体运动建模更具挑战性。为解决这一问题,本文提出了频率分解惯性里程计(FDIO)。该方法首先使用拉普拉斯金字塔将输入IMU信号分解为低频和高频分量。然后采用Mamba模块从低频分量中建模长程运动信息,并使用多尺度卷积模块从高频分量中提取细粒度局部动态特征。在五个公开PIO数据集上的实验表明,FDIO的平均绝对轨迹误差为3.221米,平均相对轨迹误差为2.550米,与RoNIN ResNet基线相比,误差分别降低了33.3%和16.7%。这些结果验证了所提出的频率分解策略的有效性。据我们所知,这项工作是将Mamba和频率分解架构引入惯性里程计的早期尝试之一。

英文摘要

Pedestrian inertial odometry (PIO) estimates autonomous pedestrian motion using only acceleration and angular velocity measurements collected by an inertial measurement unit (IMU), making it highly valuable for consumer level localization applications. However, under a dual device acquisition setting, IMU signals collected by a freely carried mobile device are inherently composite signals in which the global motion of the human torso is coupled with perturbations induced by local limb motion. This coupling makes accurate human motion modeling more challenging. To address this issue, this paper proposes frequency decomposed inertial odometry (FDIO). The proposed method first decomposes input IMU signals into low frequency and high frequency components using a Laplacian pyramid. It then adopts a Mamba module to model long range motion information from the low frequency component and uses a multi scale convolution module to extract fine grained local dynamic features from the high frequency component. Experiments on five public PIO datasets show that FDIO achieves an average absolute trajectory error of 3.221~m and an average relative trajectory error of 2.550~m, reducing the errors by 33.3\% and 16.7\% compared with the RoNIN ResNet baseline, respectively. These results validate the effectiveness of the proposed frequency decomposition strategy. To the best of our knowledge, this work is among the first efforts to introduce Mamba and a frequency decomposition architecture into inertial odometry.

5. 人机交互与协作机器人 11 篇

2606.14969 2026-06-16 cs.RO 新提交

Multimodal Physiological Assessment of Contact-Rich Physical Human-Robot Interaction Under Varying Environmental Conditions

多变环境条件下接触丰富型物理人机交互的多模态生理评估

Yanyi Chen, Xi Wang, Min Deng

发表机构 * Texas Tech University(德克萨斯理工大学) Texas A&M University(德克萨斯农工大学) University of Tennessee, Knoxville(田纳西大学诺克斯维尔分校)

AI总结 本研究通过多模态生理测量(EDA、sEMG、眼动追踪)和主观舒适度评估,揭示了接触丰富型物理人机交互中操作者因环境压力(温度、噪声、照度)而付出的隐藏生理代价,并发现操作者通过增加生理努力维持任务性能的补偿机制。

详情
AI中文摘要

现实环境中的物理人机交互(pHRI)使操作者在接触丰富型任务中暴露于波动的环境条件。传统的以任务为中心的评估忽视了这些压力源带来的生理负担。因此,我们进行了一项多模态实证研究,涉及18种不同的温度、噪声和照度组合下的接触丰富型追踪任务。同步记录了皮肤电活动(EDA)、表面肌电图(sEMG)、眼动追踪数据和主观环境舒适度评分。将这些生理信号与执行数据一起评估,揭示了客观性能未捕捉到的隐藏生理成本。结果显示,任务性能在所有环境条件下保持稳定。自主神经负荷(由紧张性皮肤电导水平(SCL)指示)随温度升高而增加,而身体和认知负荷不受影响。感知环境舒适度与追踪误差或完成时间无显著关联。这些发现揭示了一种补偿机制,即操作者通过增加生理努力来抑制热不适,从而保持一致的性能。这一见解推动了开发生理感知控制架构的动机,该架构利用实时生理指标来减少非结构化环境中操作者的工作负荷。

英文摘要

Physical human-robot interaction (pHRI) in real-world settings exposes operators to fluctuating environmental conditions during contact-rich tasks. Traditional task-centric evaluations overlook the physiological burdens imposed by these stressors. Therefore, we conducted a multimodal empirical study involving contact-rich tracing tasks under 18 distinct combinations of temperature, acoustic noise, and illuminance. Synchronously, we recorded electrodermal activity (EDA), surface electromyography (sEMG), eye-tracking data, and subjective environmental comfort ratings. Evaluating these physiological signals alongside execution data revealed hidden physiological costs not captured by objective performance. The results revealed that task performance remained stable across all environmental conditions. Autonomic workload, indexed by tonic skin conductance level (SCL), increased with temperature, while physical and cognitive workload were unaffected. Perceived environmental comfort showed no significant association with tracing error or completion time. These findings reveal a compensatory mechanism where operators maintain consistent performance by increasing their physiological effort to suppress thermal discomfort. Such insight motivates the development of physiology-aware control architectures that leverage real-time physiological metrics to reduce operator workload in unstructured environments.

2606.15239 2026-06-16 cs.RO cs.CY cs.HC 新提交

Co-Creating Buildable and Open Social Robot Study Companions with University Students

与大学生共同创造可构建且开放的社会机器人学习伙伴

Farnaz Baksh, Matevž B. Zorec, Feiazie Baksh, Karl Kruusamäe

发表机构 * University of Tartu(塔尔图大学) University of Guyana Robotics Club(圭亚那大学机器人俱乐部)

AI总结 针对开源机器人构建门槛高的问题,采用双钻石框架与大学生共同设计机器人学习伴侣v4.1,通过扭锁、卡扣等可装配/拆卸设计,将系统可用性从差提升至优(SUS 59.4→89.4),并降低感知工作负荷。

Comments Accepted for 18th International Conference on Social Robotics (ICSR + ART 2026), London, UK | 1-4 July 2026

详情
AI中文摘要

开源社会机器人提供了可访问性、可修复性和学生赋权,但构建本身往往是一个障碍。现有平台要么预组装发货,排除了动手学习的机会,要么让学生面对不熟悉的紧固件、不透明的布线和难以触及的维修点,从而削弱了参与度。针对性的机械重新设计能否在保持结构完整性的同时降低这一障碍,尚未得到验证。在这里,我们展示了面向装配的设计(DfA)和面向拆卸的设计(DfD)干预措施在缩短构建时间之前,先改变了构建的体验感受。与圭亚那和爱沙尼亚的大学生合作,我们应用双钻石框架共同创造了机器人学习伴侣(RSC)v4.1:映射痛点,然后围绕扭锁紧固件、卡扣接头和免工具维修锁扣重新设计其底盘。在两项涉及开发者和首次构建者的研究中,系统可用性从差提升至优(SUS 59.4到89.4),感知工作负荷呈下降趋势(NASA-TLX 4.29到4.00),平均组装时间呈下降趋势(21.4到13.7分钟,含初学者的学习效应),同时为首次构建者提供的方向提示和导航连续性成为下一个文档化前沿。感知工作负荷,而非完成时间,似乎是决定学生是否接受开源硬件的关键因素。

英文摘要

Open-source social robots offer accessibility, repairability, and student empowerment, yet the build itself often presents a barrier. Existing platforms either ship pre-assembled, foreclosing hands-on learning, or expose students to unfamiliar fasteners, opaque wiring, and inaccessible service points that erode engagement. Whether targeted mechanical redesign can lower this barrier whilst maintaining structural integrity remains untested. Here we show that Design for Assembly (DfA) and Design for Disassembly (DfD) interventions reshape how a build feels before they shorten how long it takes. Working with university students in Guyana and Estonia, we applied the Double Diamond framework to co-create the Robot Study Companion (RSC) v4.1: mapping pain points, then redesigning its chassis around twist-lock fasteners, snap-fit joints, and tool-free service latches. Across two studies with developers and first-time builders, system usability climbed from Poor to Excellent (SUS 59.4 to 89.4), perceived workload trended downward (NASA-TLX 4.29 to 4.00), and mean assembly time trended downward (21.4 to 13.7 minutes, with juniors' learning effect), whilst orientation cues and navigation continuity for first-time builders emerged as the next documentation frontier. Perceived workload, not completion time, appears to govern whether students take up open hardware.

2606.15434 2026-06-16 cs.RO cs.HC cs.SY eess.SY 新提交

A Bilateral Teleoperation Framework for Dexterous Manipulation

灵巧操作的双边遥操作框架

Stefano Dalla Gasperina, Dong Ho Kang, Haiyun Zhang, Aldo Galvan, Job D. Ramirez, Aaron Kim, Mark Helwig, Kazuto Yokoyama, Takahisa Ueno, Tetsuya Narita, Ann Majewicz-Fey, Ashish D. Deshpande, Luis Sentis

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Sony Group Corporation(索尼集团公司) Meta Reality Labs Research(Meta现实实验室研究)

AI总结 提出模块化双边遥操作框架,集成操作端输入与机器人端灵巧手及柔顺臂,通过位置重定向、差分控制、多尺度触觉反馈和共享控制实现灵巧操作,验证了协调控制与接触感知能力。

Comments 4 pages, 7 figures, 1 appendix,

详情
AI中文摘要

灵巧遥操作需要精确的手臂-手协调、低延迟反馈以及在真实接触丰富环境中的鲁棒交互。本文提出一个模块化双边遥操作框架,将操作端输入接口与机器人端灵巧手和柔顺机械臂集成在统一控制架构中。该系统支持基于位置的手部重定向、差分臂控制、多尺度触觉反馈和共享控制,以实现稳定操作。我们通过一个真实的灵巧操作任务验证了该框架,突出了协调的手臂-手控制和接触感知交互。除了可行性之外,我们还识别了与跨具身不匹配、触觉反馈粒度和共享控制相关的关键设计见解。所提出的平台提供了一个实用的遥操作系统,并为未来从演示中学习的研究收集高质量演示奠定了基础。

英文摘要

Dexterous teleoperation requires precise arm-hand coordination, low-latency feedback, and robust interaction in real-world contact-rich environments. This paper presents a modular bilateral teleoperation framework that integrates operator-side input interfaces with a robot-side dexterous hand and compliant robotic arm in a unified control architecture. The system supports position-based hand retargeting, differential arm control, multi-scale haptic feedback, and shared control for stable manipulation. We validate the framework through a real-world dexterous manipulation task, highlighting coordinated arm-hand control and contact-aware interaction. Beyond feasibility, we identify key design insights related to cross-embodiment mismatch, haptic feedback granularity, and shared control. The proposed platform provides a practical teleoperation system and a foundation for collecting high-quality demonstrations for future learning-from-demonstration research.

2606.08281 2026-06-16 cs.RO cs.HC cs.SY eess.SY physics.med-ph 新提交

Impedance MPC for Physical Human-Robot Interaction: Predictive Disturbance Rejection with Joint-Limit Safety

阻抗MPC用于物理人机交互:具有关节极限安全性的预测性扰动抑制

Yongyan Cao, Jinshan Tang

发表机构 * Voryx Robotic LLC George Mason University(乔治梅森大学)

AI总结 针对物理人机交互中轨迹精度与安全性的矛盾,提出双层阻抗MPC,通过解析抵消动力学和卡尔曼滤波估计持续扰动,实现零稳态误差,并利用零空间势垒和工作空间投影保证关节极限安全。

Comments 7 pages and 3 figures

详情
AI中文摘要

物理人机交互(pHRI)要求在非计划接触下同时实现轨迹精度和顺应性安全。经典阻抗控制在持续人力作用下会产生非零稳态位置误差(施加力除以任务刚度),积分作用仅在狭窄的稳定增益预算内减少该误差。我们提出一种双层阻抗MPC来解决这一矛盾。第一层解析抵消重力、科里奥利力和任务空间惯性,将剩余被控对象简化为具有恒定状态转移矩阵的构型无关双积分器。第二层以100 Hz求解30变量凸QP,利用该恒定结构使得自由响应矩阵仅需预计算一次;增广卡尔曼滤波器估计持续扰动状态,提供形式化的零稳态误差保证。零空间逆势垒和任务空间工作空间投影在测试工作空间内保证关节极限安全。在7自由度Franka FR3上,与经典阻抗在持续15 N力下的44.8 mm稳态误差相比,带卡尔曼增广的阻抗MPC达到亚0.05 mm稳态误差(降低超过800倍),在四个3-D圆上实现亚毫米跟踪,并对测量噪声和高达30%的惯性失配具有优雅鲁棒性。

英文摘要

Physical human-robot interaction (pHRI) demands simultaneous trajectory accuracy and compliant safety under unplanned contact. Classical impedance control incurs a nonzero steady-state position error under sustained human force -- the applied force divided by the task stiffness -- which integral action reduces only within a narrow stable-gain budget. We present a two-layer Impedance MPC that resolves this tension. Layer~1 analytically cancels gravity, Coriolis, and task-space inertia, reducing the residual plant to a configuration-independent double integrator with a constant state-transition matrix. Layer~2 solves a 30-variable convex QP at 100\,Hz, exploiting this constant structure so the free-response matrix is precomputed once; an augmented Kalman filter estimates the persistent disturbance state, giving a formal zero-steady-state-error guarantee. A null-space inverse-barrier potential and a task-space workspace projection enforce joint-limit safety across the tested workspace. On a 7-DOF Franka FR3, Impedance MPC with Kalman augmentation attains sub-0.05\,mm steady-state error versus 44.8\,mm for classical impedance (a $>$800-fold reduction) under a sustained 15\,N force, sub-millimeter tracking on four 3-D circles, and graceful robustness to measurement noise and inertial mismatch up to 30\%.

2606.15494 2026-06-16 cs.RO 新提交

Understanding and Modeling Perceived Cognitive and Physical Strain Dynamics for Planning-Oriented Human-Robot Collaboration in Prefabricated Construction

理解与建模感知的认知和体力应变动态:面向预制建筑中规划导向的人机协作

Yifan Wang, Bo Xiao, Shane T. Mueller

发表机构 * Department of Civil, Environmental, and Geospatial Engineering, Michigan Technological University(土木、环境与地理空间工程系,密歇根技术大学)

AI总结 本研究通过受控重复工作-休息实验,建立基于经验数据的线性混合效应模型描述认知应变积累和非线性恢复,为预制建筑中人机协作的规划提供依据。

Comments 53 pages, 15 figures

详情
AI中文摘要

预制建筑中的人机协作需要规划方法不仅考虑生产力,还要考虑重复工作和休息期间随时间变化的工人状态。现有的规划模型通常依赖于关于疲劳、工作量或恢复的简化假设,缺乏关于感知应变如何演变的特定领域经验证据。本研究开发了一种基于经验的、规划导向的方法,以表征预制建筑人机协作中感知应变的积累和恢复。通过受控重复工作-休息实验,使用心理努力评定量表和博格感知 exertion 评定量表评估感知的认知和体力应变。评估了线性和指数函数形式,然后进行混合效应建模以检查协作条件、会话效应和个体间变异性。结果表明,认知应变积累最好由线性混合效应模型表示,而休息阶段的恢复遵循非线性衰减。由此产生的规划导向模型可能为未来的人状态感知任务分配和调度研究提供信息。

英文摘要

Human-robot collaboration (HRC) in prefabricated construction requires planning approaches that consider not only productivity but also time-dependent worker states during repeated work and rest. Existing planning models often rely on simplified assumptions about fatigue, workload, or recovery, with limited domain-specific empirical evidence on how perceived strain evolves. This study develops an empirically grounded, planning-oriented approach to characterize perceived strain accumulation and recovery in prefabricated construction HRC. A controlled repeated work-rest experiment assessed perceived cognitive and physical strain using the Rating Scale for Mental Effort and Borg's Rating of Perceived Exertion. Linear and exponential functional forms were evaluated, followed by mixed-effects modeling to examine collaborative conditions, session effects, and inter-individual variability. Results indicate that cognitive strain accumulation is best represented by a linear mixed-effects model, whereas rest-phase recovery follows nonlinear decay. The resulting planning-oriented models may inform future human-state-aware task allocation and scheduling research.

2606.15568 2026-06-16 cs.RO 新提交

SAPS: Shared Autonomy for Policy Steering by Blending Teleoperation with a Pretrained VLA

SAPS: 通过混合遥操作与预训练VLA的策略引导共享自主性

Crystal Zhou, Jehan Yang, Douglas J. Weber, Zackory Erickson

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出SAPS框架,在动作层面混合人类遥操作命令与预训练策略动作,无需重训练或辅助模型,通过动态余弦相似度仲裁策略提升任务成功率高达82%,并减少人工干预。

Comments 23 pages, 15 figures, 5 tables

详情
AI中文摘要

近期视觉-语言-动作(VLA)模型的进展展示了机器人操作中令人印象深刻的通用能力,但这些策略在分布外的空间和语义扰动下可能变得脆弱。虽然人类遥操作提供了可靠的恢复,但可能要求高认知负荷和精确的手动控制,且现有的策略引导方法通常需要辅助模型或采样器修改。在这项工作中,我们引入了策略引导的共享自主性(SAPS),这是一个在动作层面混合实时人类遥操作命令与预训练策略动作的框架。SAPS不需要策略重训练、辅助动力学模型或架构修改。我们提出并评估了三种仲裁策略来平衡人类和VLA策略控制,包括一种动态余弦相似度仲裁策略,该策略计算人类与策略动作之间的几何一致性。在仿真(LIBERO、LIBERO-PRO、CALVIN)和真实机器人硬件上的评估中,SAPS在仿真和真实世界中将任务成功率比自主执行提高了高达82%。此外,与纯遥操作相比,我们的方法大幅减少了人工干预,同时实现了比自主执行和纯遥操作更快的任务完成时间。这些结果表明,动作级共享自主性是一种实用的、模型无关的方法,用于在涉及人类操作员的真实世界环境中可靠部署通用机器人策略,在辅助遥操作和可扩展数据收集方面具有前景的应用。

英文摘要

Recent advancements in Vision-Language-Action (VLA) models have demonstrated impressive generalist capabilities in robot manipulation, yet these policies can be brittle under out-of-distribution spatial and semantic perturbations. While human teleoperation offers reliable recovery, it can demand high cognitive load and precise manual control, and existing policy steering methods often require auxiliary models or sampler modifications. In this work, we introduce Shared Autonomy for Policy Steering (SAPS), a framework that blends real-time human teleoperation commands with pretrained policy actions at the action level. SAPS requires no policy retraining, auxiliary dynamics models, or architectural modifications. We propose and evaluate three arbitration strategies to balance human and VLA policy control, including a dynamic Cosine-similarity arbitration strategy that computes the geometric agreement between human and policy actions. Across evaluations in simulation (LIBERO, LIBERO-PRO, CALVIN) and on real-world robot hardware, SAPS improves task success rates over autonomous execution by up to 82% in both simulation and the real world. Furthermore, our approach drastically reduces human intervention compared to pure teleoperation, while simultaneously achieving faster task completion times than both autonomous execution and pure teleoperation. These results demonstrate that action-level shared autonomy is a practical, model-agnostic approach for reliably deploying generalist robot policies in real-world contexts involving a human operator,with promising applications in assistive teleoperation and scalable data collection.

2606.16413 2026-06-16 cs.RO cs.HC 新提交

An Augmented Reality Brain-Robot Interface for Generalist Robot Arm Manipulation

面向通用机器人臂操控的增强现实脑机接口

Shangkai Zhang, Rousslan Fernand Julien Dossa, Luca Nunziante, Marina Di Vincenzo, Kai Arulkumaran

发表机构 * Araya Inc.(Araya公司)

AI总结 提出结合眼动追踪与运动想象脑电的增强现实脑机接口,实现通用机器人臂的直观操控,通过18人实验验证了多步骤日常任务的有效性和良好可用性。

Comments Accepted at the 2026 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

详情
AI中文摘要

增强现实(AR)与基于脑电图的脑机接口(BCI)的融合为辅助目的提供了直观控制机器人的有前景途径。然而,现有的AR脑机接口(BRI)系统通常受限于特定任务结构,限制了其在真实环境中的实用性。我们提出了一种面向通用机器人臂操控的AR BRI,它将基于注视的对象选择与运动想象动作控制相结合。我们的系统使用眼动追踪进行直观对象定位,并通过上下文感知的视觉覆盖(“放置”和“使用”)在共享自主框架内引导用户完成任务。我们通过一项可行性研究评估了该界面,18名健康参与者执行了三种多步骤日常生活活动:饮水、使用抽屉和操作烤箱。结果表明,这种交互范式实现了有效的顺序任务执行和高用户参与度,获得了“良好”的可用性评级(SUS > 70)。这些发现支持了所提出的交互范式用于复杂BCI驱动的机器人辅助的可行性,并激励了未来针对预期目标人群的评估。项目网站:https://ar-bri-manip.github.io/。

英文摘要

The integration of augmented reality (AR) and EEG-based brain-computer interfaces (BCIs) offers a promising path for enabling intuitive control of robots for assistive purposes. However, existing AR brain-robot interface (BRI) systems are often constrained to task-specific structures, limiting their utility in real-world environments. We present an AR BRI designed for generalist robot arm manipulation that combines gaze-based object selection with motor imagery action control. Our system uses eye-tracking for intuitive object targeting and context-aware visual overlays ("Place" and "Use") to guide the user through tasks within a shared autonomy framework. We evaluated the interface through a feasibility study with 18 healthy participants performing three multi-step activities of daily living: drinking, using a drawer, and operating an oven. Our results demonstrate that this interaction paradigm enables effective sequential task execution and high user engagement, achieving a "Good" usability rating (SUS > 70). These findings support the feasibility of the proposed interaction paradigm for complex BCI-driven robotic assistance, and motivate future evaluation with the intended target population. Project website: https://ar-bri-manip.github.io/.

2606.16491 2026-06-16 cs.RO 新提交

HATS: A Human-Agent Teleoperation System for Multi-Arm Data Collection

HATS:用于多臂数据收集的人-智能体遥操作系统

Zesen Lin, Jian-Jian Jiang, Haoming Cen, Xiao-Ming Wu, Dandan Zhang, Wei-Shi Zheng

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) Nanyang Technological University(南洋理工大学) Imperial College London(帝国理工学院)

AI总结 提出HATS系统,由单操作员借助MLLM智能体控制两主臂和两辅助臂,实现高效多臂数据收集,性能媲美双人专家团队。

详情
AI中文摘要

许多真实世界的操作场景,例如处理复杂的协作任务和应对大工作空间,需要协调两个以上的机械臂。因此,需要一个有效的多臂遥操作系统来收集训练协调多臂操作策略的示范数据。然而,现有的遥操作框架主要关注单操作员或多操作员设置,面临着单操作员认知负荷与多操作员协调成本之间的实际权衡。为了解决这个问题,我们引入了HATS,一个人类-智能体遥操作系统,使单个人类操作员在基于MLLM的智能体辅助下,能够收集多臂操作任务的数据。我们的系统解耦了控制空间:两个主臂由人类直接遥操作,而两个辅助臂由处理子任务的无训练智能体控制。此外,人类操作员可以在执行过程中使用语音命令来防止碰撞并纠正辅助臂的行为。大量评估表明,HATS在数据收集效率和成功率上与专家双人团队相当。此外,下游策略评估证明了通过HATS收集的数据的有效性和质量。

英文摘要

Many real-world manipulation scenarios, such as handling complex collaborative tasks and dealing with large workspaces, require coordination of more than two robotic arms. Consequently, an effective multi-arm teleoperation system is required to collect demonstrations for training coordinated multi-arm manipulation policies. However, existing teleoperation frameworks mainly focus on single-operator or multi-operator setups, facing a practical trade-off between the cognitive load placed on a single operator and the coordination cost incurred by multiple operators. To address this problem, we introduce HATS, a human-agent teleoperation system that enables a single human operator, assisted by an MLLM-based agent, to collect data for multi-arm manipulation tasks. Our system decouples the control space: two primary arms are directly teleoperated by the human, while two assistive arms are controlled by a training-free agent that handles sub-tasks. In addition, the human operator can use voice commands to prevent collisions and correct assistive arm behaviors during execution. Extensive evaluations demonstrate that HATS achieves data collection efficiency and success rates comparable to expert dual-human teams. Moreover, downstream policy evaluations demonstrate the efficacy and quality of the data collected through HATS.

2606.16600 2026-06-16 cs.RO 新提交

WaveSync: Constrained Wavefront Optimization for Synchronized Co-Speech Gestures in Humanoid Robots

WaveSync: 面向人形机器人同步共语手势的约束波前优化

Thang Tran Viet, Thanh Nguyen Canh, Gia Huy Uong, Phuc Van Dinh, Tan Viet Tuyen Nguyen, Xiem HoangVan, Nak Young Chong

发表机构 * University of Engineering and Technology, Vietnam National University(越南国立大学工程与技术大学) School of Information Science, Japan Advanced Institute of Science and Technology(日本先端科学技术大学院大学信息科学学院) School of Electronics and Computer Science, University of Southampton(南安普顿大学电子与计算机科学学院) Department of Robotics, Hanyang University(汉阳大学机器人学系)

AI总结 提出WaveSync框架,利用大语言模型分解语义并构建重要性波,通过动态运动基元生成手势轨迹,再经波前优化实现手势与语音的峰值同步,同时满足运动学约束,在五组对话场景中优于基线方法。

详情
AI中文摘要

富有表现力的共语手势对于自然的人机交互至关重要,但在物理人形机器人上生成这些手势非常困难,因为手势动作必须与语音重点对齐,同时满足严格的运动学和动力学约束。与虚拟化身不同,人形机器人无法自由执行快速或重叠的运动,这使得单词级别的同步和硬件安全的运动规划成为一个耦合问题。我们提出了\textbf{WaveSync},一个混合框架,其中大语言模型将对话响应分解为结构化的语义模式,并为每个单词分配重要性权重,构建连续的语义重要性波。手势轨迹通过动态运动基元进行塑造,在增强表现力的同时确保运动学可行性。波前优化阶段实现手势与语音的峰值到峰值同步,并通过手势持续时间压缩和前向传播解决剩余的运动学违规。基于五个对话场景的实验评估表明,我们的方法实现了高同步精度,并在客观和主观评估中优于三个基线。WaveSync中的每个组件在生成富有表现力、语义基础且符合运动学要求的手势中都发挥了必要作用。代码、资源和视频可在\href{https://github.com/pairs-lab/WaveSync}{WaveSync}获取。

英文摘要

Expressive co-speech gestures are crucial for natural human-robot interaction, but generating them on physical humanoid robots is difficult because gesture strokes must align with speech emphasis while satisfying strict kinematic and dynamic constraints. Unlike virtual avatars, humanoid robots cannot freely execute rapid or overlapping motions, making word-level synchronization and hardware-safe motion planning a coupled problem. We present \textbf{WaveSync}, a hybrid framework in which a Large Language Model decomposes dialogue responses into structured semantic schemas and assigns per-word importance weights, constructing a continuous Semantic Importance Wave. Gesture trajectories are shaped through Dynamic Movement Primitives, enforcing kinematic feasibility while enhancing expressiveness. A Wavefront Optimization stage aligns peak-to-peak gesture-speech synchronization and resolves residual kinematic violations through gesture-duration compression and forward propagation. Experimental evaluation based on five dialogue scenarios shows that our method achieves high synchronization accuracy and outperforms three baselines in both objective and subjective evaluations. Each component in WaveSync plays a necessary role in producing gestures that are expressive, semantically grounded, and kinematically compliant. The code, resources, and videos are available at \href{https://github.com/pairs-lab/WaveSync}{WaveSync}

2602.02773 2026-06-16 cs.RO 版本更新

Bimanual High-Density EMG Control for In-Home Mobile Manipulation by Users with Quadriplegia

用于四肢瘫痪用户居家移动操作的双侧高密度肌电控制

Jehan Yang, Eleanor Hodgson, Cindy Sun, Zackory Erickson, Doug Weber

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 针对颈椎脊髓损伤用户,提出双侧高密度肌电(HDEMG)前臂袖套,结合共享自主框架实现实时手势控制移动操作器,经12天居家研究验证日常任务有效性。

Comments 17 pages, 20 figures

详情
AI中文摘要

家中的移动操作器可以使颈椎脊髓损伤(cSCI)患者执行他们原本无法自行完成的日常家务任务。然而,这些用户的瘫痪通常限制了他们对传统机器人控制界面(如操纵杆或键盘)的访问。在这项工作中,我们介绍并部署了首个系统,该系统使四肢瘫痪用户能够利用来自身体瘫痪部位的运动意图,通过双侧高密度肌电(HDEMG)控制移动操作器。我们开发了一对定制的、织物集成的HDEMG前臂袖套,佩戴在双臂上,捕获来自临床瘫痪自由度的残余神经肌肉活动,并支持基于手势的实时机器人控制。我们在(n=2)名cSCI用户中实现了基于运动意图的高分类准确率,最高达到98.0%。其次,通过集成视觉、语言和运动规划模块,我们引入了一个共享自主框架,支持稳健且用户驱动的遥操作,特别有利于家庭环境中导航密集型任务。最后,为了在真实环境中演示该系统,我们进行了一项为期12天的居家用户研究,评估可穿戴EMG界面在每日机器人控制中的实时使用。这些系统组件共同实现了在真实家庭环境中执行日常生活活动(ADL)和其他家务任务的有效机器人控制。

英文摘要

Mobile manipulators in the home can enable people with cervical spinal cord injury (cSCI) to perform daily physical household tasks that they could not otherwise do themselves. However, paralysis in these users often limits access to traditional robot control interfaces such as joysticks or keyboards. In this work, we introduce and deploy the first system that enables a user with quadriplegia to control a mobile manipulator using intent from paralyzed parts of their body, using bimanual high-density electromyography (HDEMG). We develop a pair of custom, fabric-integrated HDEMG forearm sleeves, worn on both arms, that capture residual neuromotor activity from clinically paralyzed degrees of freedom and support real-time gesture-based robot control. We achieve high classification accuracies based on motor intent across (n = 2) users with cSCI, achieving up to 98.0%. Second, by integrating vision, language, and motion planning modules, we introduce a shared autonomy framework that supports robust and user-driven teleoperation, with particular benefits for navigation-intensive tasks in home environments. Finally, to demonstrate the system in the wild, we present a twelve-day in-home user study evaluating real-time use of the wearable EMG interface for daily robot control. Together, these system components enable effective robot control for performing activities of daily living (ADLs) and other household tasks in a real home environment.

2510.07063 2026-06-16 cs.HC cs.RO 版本更新

Artists' Views on Robotics Involvement in Painting Productions

艺术家对机器人参与绘画创作的看法

Francesca Cocchella, Nilay Roy Choudhury, Eric Chen, Patrícia Alves-Oliveira

发表机构 * CONTACT Unit, Italian Institute of Technology(意大利理工学院联络单位) University of Michigan(密歇根大学)

AI总结 通过八位抽象艺术家与机器人合作绘画的实证研究,发现人机协作更富趣味性和反思性,提供更大自主性,并激发克服系统限制的新策略。

Comments 10 pages, 9 figures, submitted to RAM special issue: Arts and Robotics

详情
AI中文摘要

随着机器人技术的发展,其在艺术创作中的潜力成为一个日益相关的研究课题。本研究探讨了专业抽象艺术家如何感知和体验与自主绘画机械臂的协同创作互动。八位艺术家参与了六次绘画会话——三次与人类伙伴,随后三次与机器人——并随后参加了通过反思性主题分析分析的半结构化访谈。人与人之间的互动被描述为直观、对话性和情感投入,而人与机器人的会话则感觉更有趣和反思性,提供了更大的自主性,并促使采用新颖策略来克服系统的局限性。这项工作提供了对艺术家与机器人真实体验的首批实证研究之一,强调了长期参与和多学科方法在人机协同创作中的价值。

英文摘要

As robotic technologies evolve, their potential in artistic creation becomes an increasingly relevant topic of inquiry. This study explores how professional abstract artists perceive and experience co-creative interactions with an autonomous painting robotic arm. Eight artists engaged in six painting sessions -- three with a human partner, followed by three with the robot -- and subsequently participated in semi-structured interviews analyzed through reflexive thematic analysis. Human-human interactions were described as intuitive, dialogic, and emotionally engaging, whereas human-robot sessions felt more playful and reflective, offering greater autonomy and prompting for novel strategies to overcome the system's limitations. This work offers one of the first empirical investigations into artists' lived experiences with a robot, highlighting the value of long-term engagement and a multidisciplinary approach to human-robot co-creation.

6. 具身智能与视觉语言动作模型 19 篇

2606.15021 2026-06-16 cs.RO 新提交

Steering Autoregressive Vision-Language-Action Policies via Action Token Intervention

通过动作令牌干预引导自回归视觉-语言-动作策略

Jason Chan, Jonathan C. Kao

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出Token Steering方法,在推理时通过干预动作令牌空间动态引导VLA模型轨迹生成,无需训练或微调,显著提升家务操作任务成功率。

Comments 9 pages, 5 figures

详情
AI中文摘要

我们提出Token Steering (TS),一种通过直接干预动作令牌空间来动态引导自回归视觉-语言-动作(VLA)模型生成轨迹的方法。TS将低维用户输入注入模型的原生动作令牌表示,允许用户在无需修改底层视觉-语言模型(VLM)架构的情况下影响轨迹生成。由于TS完全在推理时运行,因此不需要额外的训练或微调。用户输入引导而非覆盖预训练策略,允许用户影响机器人动作,同时保留VLA学习的灵巧性、平滑性和任务先验。我们在两个家务操作任务——物体放置后关闭抽屉和状态感知物体交换——上评估TS,成功率分别从10.0%提高到72.5%,从16.7%提高到93.8%。通过实现对机器人基础模型的轻量级、直观引导,我们的界面有潜力改善消费环境中的交互,并拓宽对有限身体控制个体的可及性。项目网站:https://jasontchan.github.io/token-steering/ 。

英文摘要

We present Token Steering (TS), a method for dynamically steering trajectories generated by an autoregressive vision-language-action (VLA) model through direct intervention in the action-token space. TS injects low-dimensional user inputs into the model's native action-token representation, allowing users to influence trajectory generation without modifying the underlying vision-language model (VLM) architecture. Because TS operates entirely at inference time, it requires no additional training or finetuning. User inputs guide rather than override the pretrained policy, allowing users to influence robot actions while preserving the dexterity, smoothness, and task priors learned by the VLA. We evaluate TS on two household manipulation tasks -- drawer closing after object placement and state-aware object swapping -- and improve success rates from 10.0% to 72.5% and from 16.7% to 93.8%, respectively. By enabling lightweight, intuitive steering over robot foundation models, our interface has the potential to improve human-robot interaction in consumer environments and broaden accessibility for individuals with limited physical control. Project website: https://jasontchan.github.io/token-steering/ .

2606.15285 2026-06-16 cs.RO 新提交

Acting While Understanding: Asynchronous Semantic-Action Decoupling for Real-Time Vision-Language-Action Models

理解中行动:面向实时视觉-语言-动作模型的异步语义-动作解耦

Shenhao Yan, Ge Wang, Qi Liu, Weilin Meng, Jiahao Yang, Chengsi Yao, Fan Feng, Xiaoguang Ma, Yiming Zhao, Yatong Han

发表机构 * Northeastern University(东北大学) Ising AI CUHK-Shenzhen(香港中文大学(深圳))

AI总结 提出异步语义-动作解耦框架,分离VLAs中的语义理解与动作生成,通过低频率语义更新和高频率动作推理实现高频闭环控制,在LIBERO和真实机器人上验证了高达35.6Hz的动作模块吞吐率。

详情
AI中文摘要

视觉-语言-动作模型(VLAs)在机器人操作中展现出强大的任务理解和泛化能力,但全模型推理的高计算成本限制了其在低延迟、高频率闭环控制中的部署。我们提出一种异步语义-动作解耦框架,该框架沿现有VLAs的内部语义-动作接口分离语义理解与动作生成,无需重新设计视觉-语言骨干网络或引入外部规划器。低频理解模块异步更新可复用的语义条件,而高频动作模块持续输出控制动作,无需重复调用完整模型。为缓解陈旧语义与当前执行状态之间的时间不匹配,我们进一步引入历史动作条件化和时间错位训练,提供短时域执行上下文,并在陈旧语义条件下提高反馈控制鲁棒性。在LIBERO上使用$π_{0.5}$和UniVLA进行的实验,以及使用UniVLA的真实机器人部署表明,所提框架实现了高达35.6 Hz的服务端动作模块推理吞吐率,并提供了一条低侵入性路径,无需以控制速率运行完整VLA推理即可实现高频闭环控制。

英文摘要

Vision-Language-Action models (VLAs) have demonstrated strong task understanding and generalization in robotic manipulation, yet the high computational cost of full-model inference limits their deployment in low-latency, high-frequency closed-loop control. We propose an asynchronous semantic-action decoupling framework that separates semantic understanding from action generation along the internal semantic-action interface of existing VLAs, without redesigning the vision-language backbone or introducing an external planner. A low-frequency understanding module asynchronously updates reusable semantic conditions, while a high-frequency action module continuously outputs control actions without repeatedly invoking the full model. To mitigate the temporal mismatch between stale semantics and the current execution state, we further introduce historical action conditioning and time-misalignment training, which provide short-horizon execution context and improve feedback control robustness under stale semantic conditions. Experiments on LIBERO with $π_{0.5}$ and UniVLA, together with real-robot deployment using UniVLA, show that the proposed framework achieves up to 35.6 Hz server-side action-module inference throughput and offers a low-intrusion path to high-frequency closed-loop control without running full VLA inference at control rate.

2606.15631 2026-06-16 cs.RO cs.AI 新提交

Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time

检索,不重新训练:在测试时将视觉语言动作模型扩展到新任务

Jeongeun Park, Juhan Park, Taekyung Kim, Sungjoon Choi, Dongyoon Han, Sangdoo Yun

发表机构 * NAVER AI Lab(NAVER AI实验室) Korea University(高丽大学)

AI总结 提出检索增强策略,通过一次训练冻结模型,部署时仅添加检索数据即可适应新任务,无需逐任务微调,在跨本体泛化中优于基线。

Comments https://recap-robot.github.io/

详情
AI中文摘要

将视觉-语言-动作(VLA)策略扩展到新任务通常需要特定任务的遥操作演示和逐任务微调,这使得适应在数据收集和计算方面成本高昂。在本文中,我们表明这种目标侧逐任务适应成本可以被检索所取代。我们的检索增强策略在目标本体(查询)和更廉价的本体(池,例如人手视频)的配对演示上训练一次,然后冻结。新任务在部署时通过将池侧演示附加到检索池来添加。冻结策略在每个控制步骤中根据检索到的轨迹进行条件化,因此新任务通过索引数据而非更新参数来吸收。微调仅在面对新的、未见过的本体时需要,而不是每个新任务。我们表明,检索改进了超越特定骨干网络的策略,包括标准VLA策略,但其效果在基于视频生成的世界动作模型(WAM)Cosmos Policy中尤为显著。在这种设置中,检索提供了粗略的任务进展,而WAM的未来图像目标提供了额外的视觉一致性信号,增强了检索条件化的动作。在PushT上,我们研究了检索如何为跨本体泛化到未见目标角度提供可重用的高级运动先验,而在RoboTwin 2.0上,我们的方法在未见任务上优于跨本体基线,并且我们还在真实机器人上演示了该方法。

英文摘要

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.

2606.15768 2026-06-16 cs.RO cs.AI 新提交

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

LaWAM: 用于高效动力学感知机器人策略的潜在世界行动模型

Jialei Chen, Kai Wang, Kang Chen, Shuaihang Chen, Feng Gao, Wenhao Tang, Zhiyuan Li, Weilin Liu, Zhuyu Yao, Boxun Li, Yuanbo Xu, Chao Yu

发表机构 * Tsinghua University(清华大学) Jilin University(吉林大学) Nankai University(南开大学) Peking University(北京大学) Harbin Institute of Technology(哈尔滨工业大学) Zhongguancun Academy(中关村学院) Striding.AI Infinigence AI

AI总结 提出LaWAM模型,通过潜在视觉子目标预测场景变化,实现动力学感知的机器人控制,在多个基准上达到最优或竞争性成功率,且推理延迟低。

详情
AI中文摘要

视觉-语言-行动模型(VLA)利用大规模视觉-语言预训练进行语义机器人控制,但通常缺乏对机器人行动如何改变场景的明确预见。世界行动模型(WAM)通过基于预测的未来条件化策略来解决这一限制,但现有方法通常依赖计算昂贵的视频生成,且存在大量像素级冗余。我们提出LaWAM,一种潜在世界行动模型,通过紧凑的潜在视觉子目标(而非重建的未来视频)向机器人策略暴露预测动力学。LaWAM的核心是一个潜在行动条件化的潜在世界模型(LaWM)。我们通过在预训练视觉基础模型的潜在空间中训练潜在行动模型,并重新利用其前向解码器来预测未来观察特征以描述场景演变,从而获得LaWM。然后,LaWAM基于这些预测的潜在视觉子目标条件化行动生成,以实现动力学感知的机器人控制。LaWAM在LIBERO(98.6%成功率)、RoboTwin(91.22%成功率)和真实世界操作任务中取得了最优或具有竞争力的成功率,同时保持低延迟推理。LaWAM每次行动块预测运行时间为187毫秒,相比像素空间WAM,实现了高达24倍的墙钟延迟降低。

英文摘要

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.

2606.15898 2026-06-16 cs.RO 新提交

VL2Spike: Spike-driven Distillation from VLMs for Low-Power Visual Perception in Embodied AI

VL2Spike:面向具身AI低功耗视觉感知的VLM脉冲驱动蒸馏

Zinan Liu, Eric Zheng, Soumyaratna Debnath, Hao Shi, Ling Xiao, Lin Wang

发表机构 * School of EEE, Nanyang Technological University (NTU)(南洋理工大学电气与电子工程学院) Department of Computer Science, University of Toronto(多伦多大学计算机科学系) Advanced Micro Devices, Inc.(超威半导体公司) State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University(浙江大学极端光子学与仪器国家重点实验室) Faculty of Information Science and Technology, Hokkaido University(北海道大学信息科学与技术学院)

AI总结 提出VL2Spike框架,通过时空视觉脉冲蒸馏和脉冲原型引导语言蒸馏,将VLM多模态知识迁移至Spikformer,在静态数据集上提升6.81%性能且能耗仅15.7%,并显著增强机器人视觉地点识别能力。

Comments 9 pages, 4 figures, 8 tables

详情
AI中文摘要

脉冲神经网络(SNN)是受大脑启发的、事件驱动的模型,通过稀疏脉冲进行计算,从而在资源受限的具身AI模型中实现高效的视觉感知。具有脉冲自注意力的Spiking-Transformer模型的出现显著提升了纯SNN的学习能力。尽管SNN具有能效优势,但其性能仍受限于基于脉冲的架构和优化挑战,因为标准梯度下降规则无法直接应用。最近,视觉语言模型(VLM)展示了丰富的多模态知识表示能力,可用于视觉感知。因此,利用VLM来更好地训练Spikformer是很有前景的。为此,我们提出了VL2Spike,一种新颖的基于脉冲的知识蒸馏(KD)框架,将VLM的多模态知识与紧凑的Spikformer模型桥接起来。该设计增强了Spikformer模型的学习能力,同时保留了其能效优势,从而为低功耗机器人感知提供了一条实用路径。我们的VL2Spike带来了两项关键技术贡献。为了与脉冲动态对齐,我们首先提出了时空视觉脉冲(SVS)蒸馏,实现了(1)VLM图像特征与脉冲令牌之间的共享流形对齐,以及(2)膜电位和脉冲率上的暖启动时间一致性。然后,我们设计了一种新颖的脉冲原型引导语言(SPL)蒸馏策略,将Spikformer的类别原型和logits与可提示的VLM文本嵌入对齐。大量实验表明,VL2Spike在三个静态数据集上仅消耗15.7%的能量就实现了6.81%的性能提升。它在机器人视觉地点识别(VPR)上也表现出强大的泛化能力,性能提升6.63%,突显了其在具身AI中低功耗感知的潜力。

英文摘要

Spiking neural networks (SNNs) are brain-inspired, event-driven models that compute with sparse spikes, which enables highly efficient visual perception in resource-constrained embodied AI models. The emergence of Spiking-Transformer models with spike self-attention has substantially improved the learning capacity of pure SNNs. Although SNNs are energy efficient, their performance is still limited by the spike-based architecture and optimization challenges, as standard gradient descent rules cannot be directly applied. Recently, vision-language models (VLMs) have shown rich multi-modal knowledge representation capabilities for visual perception. Thus, it is promising to leverage VLMs for better Spikformer training. To this end, we present VL2Spike, a novel spike-based knowledge distillation (KD) framework that bridges multi-modal knowledge from VLMs with compact Spikformer models. This design enhances the learning capacity of Spikformer models while preserving their energy-efficiency merits, thereby offering a practical pathway toward low-power robotic perception. Our VL2Spike brings two key technical contributions. To align with spiking dynamics, we first propose spatial-temporal visual spike (SVS) distillation, which achieves (1) shared manifold alignment between VLM image features and spike tokens, and (2) warm-started temporal consistency on membrane potentials and spike rates. We then design a novel spike prototype-guided linguistic (SPL) distillation strategy that aligns Spikformer's class prototypes and logits with promptable VLM text embeddings. Extensive experiments show that VL2Spike achieves 6.81% gain across three static datasets with only 15.7% energy consumption. It also exhibits strong generalization capacity on robotic visual place recognition (VPR) with a gain of 6.63%, highlighting its potential for low-power perception in embodied AI.

2606.17046 2026-06-16 cs.RO cs.CV cs.LG 新提交

Geometric Action Model for Robot Policy Learning

几何动作模型用于机器人策略学习

Jisang Han, Seonghu Jeon, Jaewoo Jung, René Zurbrügg, Honggyu An, Tifanny Portela, Marco Hutter, Marc Pollefeys, Seungryong Kim, Sunghwan Hong

发表机构 * KAIST AI(韩国科学技术院人工智能学院) ETH Zurich(苏黎世联邦理工学院) ETH AI Center(苏黎世联邦理工学院人工智能中心)

AI总结 提出几何动作模型(GAM),通过重用预训练几何基础模型(GFM)作为共享骨干,实现语言条件下的操作策略,在仿真和真实机器人任务中优于现有方法。

Comments Project page: https://cvlab-kaist.github.io/Geometric-Action-Model/

详情
AI中文摘要

通用机器人策略必须遵循用户指令,同时推理物体、相机和机器人动作如何在3D物理世界中交互。最近的视觉-语言-动作模型(VLAs)和视频世界-动作模型(WAMs)从大规模基础模型中继承了强大的语义或时间先验,但它们仍然主要在2D图像帧或2D派生的潜在空间上操作,隐含了接触丰富操作所需的3D几何信息。我们提出了几何动作模型(GAM),一种语言条件操作策略,直接重用预训练的几何基础模型(GFM)作为感知、时间预测和动作解码的共享基础。GAM在中间层分割GFM:浅层作为观察编码器,在分割层插入一个因果未来预测器,根据语言、本体感受和动作历史预测未来的潜在令牌。然后,预测的未来令牌通过剩余的GFM块进行特征传播和解码,使得单个骨干能够同时产生未来几何和动作。这种设计通过最小的架构修改赋予GFM语言条件的时间世界建模能力,同时保留其丰富的几何先验。在广泛的仿真和真实机器人操作基准测试中,GAM比当前基础模型规模的基线更准确、更鲁棒、更快、更轻量。

英文摘要

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.

2606.15099 2026-06-16 cs.CV cs.LG cs.RO 交叉投稿

Think Less, Act Early: Reinforced Latent Reasoning with Early Exit in Vision-Language-Action Models

少思考,早行动:视觉-语言-动作模型中带早退的强化潜在推理

Dianqiao Lei, Lianlei Shan

AI总结 提出AVA-VLA框架,通过强化学习去噪和早退策略优化潜在推理轨迹,在LIBERO上实现6倍推理加速和98.3%平均成功率。

Comments Accepted at ICML 2026

详情
AI中文摘要

现有的视觉-语言-动作(VLA)模型主要依赖显式的思维链(CoT)推理来桥接感知和动作。虽然有效,但这种范式在多步骤任务中面临高计算成本和错误传播的问题。在本文中,我们提出了自适应变量对齐VLA(AVA-VLA),一种新颖的潜在推理VLA框架,将推理建模为一系列不可观测的潜在变量,绕过了显式文本生成的需求。然而,潜在轨迹本质上容易受到噪声干扰和与下游目标不对齐的影响。为了解决这个问题,我们引入了一种基于强化学习的去噪机制,将潜在状态生成视为一个顺序决策过程,通过任务级奖励优化推理轨迹。此外,我们结合了一种早退策略,根据状态置信度自适应地终止推理,实现了深度和效率之间的动态权衡。在具身决策基准上的大量实验表明,AVA-VLA在LIBERO上实现了比显式CoT方法6倍的推理加速,同时达到了98.3%的平均成功率,在效率和长期稳定性上均优于全推理基线。

英文摘要

Existing Vision-Language-Action (VLA) models predominantly rely on explicit Chain-of-Thought (CoT) reasoning to bridge perception and action. While effective, this paradigm suffers from high computational costs and error propagation in multi-step tasks. In this paper, we propose Adaptive Variable Alignment VLA (AVA-VLA), a novel Latent Reasoning VLA framework that models reasoning as a sequence of unobservable latent variables, bypassing the need for explicit text generation. However, latent trajectories are inherently susceptible to noise interference and misalignment with downstream objectives. To address this, we introduce a Reinforcement Learning-based Denoising mechanism that treats latent state generation as a sequential decision process, optimizing reasoning trajectories via task-level rewards. Furthermore, we incorporate an Early-Exit Strategy that adaptively terminates reasoning based on state confidence, enabling a dynamic trade-off between depth and efficiency. Extensive experiments on embodied decision benchmarks demonstrate that AVA-VLA achieves a 6x inference speedup over explicit CoT methods while attaining a 98.3% average success rate on LIBERO, improving both efficiency and long-horizon stability over full-reasoning baselines.

2606.15714 2026-06-16 cs.CL cs.RO 交叉投稿

Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

超越英语:揭示视觉-语言-动作模型中的多语言差距

Hanyang Chen, Hongliang Li, Jiarui Cao, Yang Li, Yang Jiang, Haonan Wen, Kaiyu Huang, Shengnan Guo, Huaiyu Wan

发表机构 * Beijing Jiaotong University(北京交通大学)

AI总结 本研究首次系统探究VLA模型的多语言指令跟随能力,发现英语训练模型在其他语言上性能显著下降,并提出多语言主成分对齐方法缩小差距。

详情
AI中文摘要

视觉-语言-动作模型最近展示了从大规模多模态数据学习通用机器人策略的能力。然而,大多数现有的VLA系统主要使用英语指令进行训练和评估,使得它们理解和执行其他语言指令的能力在很大程度上未被探索。虽然底层的大语言模型通常具备多语言能力,但这些多语言能力在训练过程中是否能迁移到VLA尚不清楚。在这项工作中,我们首次对VLA模型中的多语言指令跟随进行了系统研究。我们首先通过扩展现有基准测试并翻译其指令来构建多语言指令。利用这些指令,我们在模拟环境中评估了几个代表性的VLA模型在一系列任务上的表现。我们的实验揭示了一个显著的多语言差距:主要用英语指令训练的模型在评估其他语言时表现出显著的性能下降,即使底层语言骨干是多语言的。我们提供了若干发现和分析来理解多语言差距。跨语言迁移行为分析表明,性能下降与指令理解和动作执行都相关。表示分析表明,多语言指令引起的表示偏移可能导致了多语言差距。受这些发现的启发,我们进一步探索了提高VLA多语言性能的策略。我们提出了一种简单而有效的多语言微调方法——多语言主成分对齐,该方法利用主成分分析获取主成分子空间并对齐投影后的多语言表示,有效缩小了多语言性能差距。

英文摘要

Vision-Language-Action models have recently demonstrated promising capabilities in learning generalist robot policies from large-scale multimodal data. However, most existing VLA systems are trained and evaluated primarily with English instructions, leaving their ability to understand and execute instructions in other languages largely unexplored. While the underlying large language models often possess multilingual capabilities, it remains unclear whether these multilingual capabilities transfer to VLAs during training. In this work, we present the first systematic study of multilingual instruction following in VLA models. We first construct multilingual instructions by extending existing benchmarks with translations of their instructions. Using these instructions, we evaluate several representative VLA models across a range of tasks in simulation settings. Our experiments reveal a significant multilingual gap: models trained primarily on English instructions exhibit substantial performance degradation when evaluated on other languages, even when the underlying language backbone is multilingual. We provide several findings and analyses to understand the multilingual gap. Cross-lingual transfer behavior analysis shows that performance drops correlate with both instruction understanding and action execution. Representation analyses suggest that multilingual instruction-caused representation shifts may contribute to the multilingual gap. Motivated by these findings, we further explore strategies to improve multilingual performance in VLAs. We propose a simple yet effective multilingual fine-tuning approach, Multilingual Principal Component Alignment, which leverages Principal Component Analysis to get the principal component subspace and align projected multilingual representations, effectively reducing the multilingual performance gap.

2508.08706 2026-06-16 cs.RO 版本更新

OmniVTLA: Vision-Tactile-Language-Action Models with Semantic-Aligned Tactile Sensing

OmniVTLA:具有语义对齐触觉感知的视觉-触觉-语言-动作模型

Zhengxue Cheng, Yiqian Zhang, Anni Tang, Keyu Wang, Wenkang Zhang, Haoyu Li, Hengdi Zhang, Li Song

发表机构 * Shanghai Jiao Tong University(上海交通大学) Paxini Tech(帕辛尼科技)

AI总结 提出OmniVTLA架构,通过双路径触觉编码器框架和语义对齐触觉ViT,结合新数据集ObjTac,显著提升机器人操作中接触密集任务的成功率和轨迹平滑度。

Comments Accepted by IEEE Robotics and Automation Letters (RA-L). ObjTac dataset: https://readerek.github.io/Objtac.github.io

详情
AI中文摘要

最近的视觉-语言-动作(VLA)模型建立在视觉-语言基础之上,在机器人操作中取得了有希望的结果并展现出任务泛化的可能性。然而,由于触觉传感器的异质性和获取触觉数据的困难,当前的VLA模型显著忽视了触觉感知的重要性,并在接触密集任务中失败。为了解决这个问题,本文提出了OmniVTLA,一种涉及触觉感知的新型架构。具体来说,我们的贡献有三方面。首先,我们的OmniVTLA具有双路径触觉编码器框架。该框架通过使用预训练的视觉变换器(ViT)和语义对齐的触觉ViT(SA-ViT),增强了跨多种基于视觉和基于力的触觉传感器的触觉感知。其次,我们引入了ObjTac,一个全面的基于力的触觉数据集,捕捉了10个类别56个物体的文本、视觉和触觉信息。拥有135K三模态样本,ObjTac补充了现有的视觉-触觉数据集。第三,利用该数据集,我们训练了一个语义对齐的触觉编码器,以学习统一的触觉表示,作为OmniVTLA更好的初始化。真实世界实验表明,与最先进的VLA基线相比,取得了显著改进,在拾取和放置任务中,使用夹爪的成功率达到96.9%(比基线高21.9%),使用灵巧手的成功率达到100%(比基线高6.2%)。此外,与现有VLA相比,OmniVTLA通过触觉感知显著减少了任务完成时间并生成了更平滑的轨迹。我们的ObjTac数据集可在以下网址找到:this https URL

英文摘要

Recent vision-language-action (VLA) models build upon vision-language foundations, and have achieved promising results and exhibit the possibility of task generalization in robot manipulation. However, due to the heterogeneity of tactile sensors and the difficulty of acquiring tactile data, current VLA models significantly overlook the importance of tactile perception and fail in contact-rich tasks. To address this issue, this paper proposes OmniVTLA, a novel architecture involving tactile sensing. Specifically, our contributions are threefold. First, our OmniVTLA features a dual-path tactile encoder framework. This framework enhances tactile perception across diverse vision-based and force-based tactile sensors by using a pretrained vision transformer (ViT) and a semantically-aligned tactile ViT (SA-ViT). Second, we introduce ObjTac, a comprehensive force-based tactile dataset capturing textual, visual, and tactile information for 56 objects across 10 categories. With 135K tri-modal samples, ObjTac supplements existing visuo-tactile datasets. Third, leveraging this dataset, we train a semantically-aligned tactile encoder to learn a unified tactile representation, serving as a better initialization for OmniVTLA. Real-world experiments demonstrate substantial improvements over state-of-the-art VLA baselines, achieving 96.9% success rates with grippers, (21.9% higher over baseline) and 100% success rates with dexterous hands (6.2% higher over baseline) in pick-and-place tasks. Besides, OmniVTLA significantly reduces task completion time and generates smoother trajectories through tactile sensing compared to existing VLA. Our ObjTac dataset can be found at https://readerek.github.io/Objtac.github.io

2601.04061 2026-06-16 cs.RO cs.CV 版本更新

CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

CLAP: 从人类视频中学习视觉-语言-动作模型的对比潜在动作预训练

Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, Yansong Tang

发表机构 * Tsinghua University(清华大学) Astribot University of Hong Kong(香港大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出CLAP框架,通过对比学习将人类视频与机器人动作词汇对齐,利用伪标签训练VLA模型,实现从人类视频到机器人执行的有效技能迁移。

Comments The code is available at: https://github.com/LinShan-Bin/OpenCLAP

详情
AI中文摘要

通用视觉-语言-动作模型仍然受限于机器人数据的稀缺性,而人类视频演示则相对丰富。现有的潜在动作模型试图利用视频数据,但常常遭受视觉纠缠,编码噪声而非操作技能。为了解决这一限制,我们提出了对比潜在动作预训练(CLAP),该框架首先使用Act-VAE从机器人轨迹中学习可执行的动作标记词汇,然后通过对比学习将人类视觉转换与该词汇对齐。这种对齐将未标记的人类视频映射到物理上可行的潜在动作空间,而不是重建外观。基于对齐的标记,我们使用机器人演示和伪标记的人类视频训练CLAP-NTP作为自回归VLA,保持指令遵循和物体泛化能力。为了部署和目标域适应,我们进一步引入了一种后训练策略,该策略将CLAP-RF(一种用于低延迟连续动作块预测的整流流动作头)与知识匹配正则化相结合,以在微调期间保留预训练的语义知识。大量实验表明,CLAP在竞争基线上取得了强劲的性能,同时实现了从人类视频到机器人执行的有效技能迁移。

英文摘要

Generalist Vision-Language-Action models remain constrained by the scarcity of robotic data relative to the abundance of human video demonstrations. Existing Latent Action Models attempt to use video data but often suffer from visual entanglement, encoding noise rather than manipulation skills. To address this limitation, we propose Contrastive Latent Action Pretraining (CLAP), a framework that first uses Act-VAE to learn an executable action-token vocabulary from robot trajectories and then aligns human visual transitions with this vocabulary through contrastive learning. This alignment maps unlabeled human videos into a physically grounded latent action space rather than reconstructing appearance. Building on the aligned tokens, we train CLAP-NTP as an autoregressive VLA using robot demonstrations and pseudo-labeled human videos, preserving instruction following and object generalization. For deployment and target-domain adaptation, we further introduce a post-training strategy that combines CLAP-RF, a Rectified Flow action head for low-latency continuous action chunk prediction, with Knowledge Matching regularization to preserve pretrained semantic knowledge during fine-tuning. Extensive experiments show that CLAP achieves strong performance against competitive baselines while enabling effective skill transfer from human videos to robotic execution.

2601.05248 2026-06-16 cs.RO 版本更新

LaST$_{0}$: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model

LaST$_{0}$:面向机器人视觉-语言-动作模型的潜在时空思维链

Zhuoyang Liu, Jiaming Liu, Hao Chen, Jiale Yu, Ziyu Guo, Chengkai Hou, Chenyang Gu, Xiangju Mi, Renrui Zhang, Kun Wu, Zhengping Che, Jian Tang, Pheng-Ann Heng, Shanghang Zhang

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出LaST$_0$框架,通过潜在时空思维链在隐空间进行高效推理,避免显式推理的延迟和语言瓶颈,在10个真实世界任务中平均成功率提升13%-14%。

Comments Project page: https://vla-last0.github.io/

详情
AI中文摘要

视觉-语言-动作(VLA)模型近期展现出强大的泛化能力,一些方法试图在执行前显式生成语言推理轨迹或预测未来观测。然而,显式推理通常会产生不可忽视的推理延迟,限制了机器人操作所需的时间分辨率。此外,这种推理局限于语言空间,造成表征瓶颈,难以忠实捕捉难以言表的物理属性。为缓解这些限制,我们提出LaST$_0$,一种通过潜在时空思维链(CoT)在执行前实现高效推理的框架,捕捉通常难以用语言描述的细粒度物理和机器人动态。具体而言,我们引入一个token高效的潜在CoT空间,建模未来视觉动态、3D结构信息和机器人本体感知状态,并进一步将这些表征跨时间扩展,以实现时间一致的隐式推理轨迹。此外,LaST$_0$采用通过混合Transformer设计实现的双系统架构,其中推理专家执行低频潜在推理,动作专家基于面向机器人的潜在表征生成高频动作。为促进协调,LaST$_0$以异构操作频率进行训练,在部署时实现自适应切换。在涵盖桌面、移动和灵巧手操作的10个真实世界任务中,LaST$_0$的平均成功率分别比先前的SOTA VLA方法提高了13%、14%和14%。

英文摘要

Vision-Language-Action (VLA) models have recently shown strong generalization, with some approaches seeking to explicitly generate linguistic reasoning traces or predict future observations prior to execution. However, explicit reasoning typically incurs non-negligible inference latency, which constrains the temporal resolution required for robotic manipulation. Moreover, such reasoning is confined to the linguistic space, imposing a representational bottleneck that struggles to faithfully capture ineffable physical attributes. To mitigate these limitations, we propose LaST$_0$, a framework that enables efficient reasoning before acting through a Latent Spatio-Temporal Chain-of-Thought (CoT), capturing fine-grained physical and robotic dynamics that are often difficult to verbalize. Specifically, we introduce a token-efficient latent CoT space that models future visual dynamics, 3D structural information, and robot proprioceptive states, and further extends these representations across time to enable temporally consistent implicit reasoning trajectories. Furthermore, LaST$_0$ adopts a dual-system architecture implemented via a Mixture-of-Transformers design, where a reasoning expert conducts low-frequency latent inference and an acting expert generates high-frequency actions conditioned on robotics-oriented latent representations. To facilitate coordination, LaST$_0$ is trained with heterogeneous operation frequencies, enabling adaptive switching during deployment. Across 10 real-world tasks spanning tabletop, mobile, and dexterous hand manipulation, LaST$_0$ improves mean success rates by 13%, 14% and 14% over prior SOTA VLA methods, respectively.

2601.16207 2026-06-16 cs.RO 版本更新

IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

IVRA: 通过无需训练的提示引导改善机器人动作策略的视觉-令牌关系

Jongwoo Park, Kanchana Ranasinghe, Jinhyeok Jang, Cristina Mata, Yoo Sung Jang, Michael S Ryoo

发表机构 * Stony Brook University(石溪大学) ETRI(电子技术研究院)

AI总结 提出IVRA方法,利用模型内置视觉编码器的亲和力提示,无需额外编码器或重训练,在推理时调整视觉-令牌交互以增强空间理解,在多种VLA架构和任务上提升操作成功率。

详情
AI中文摘要

许多视觉-语言-动作(VLA)模型将图像块展平为一维令牌序列,削弱了精确操作所需的二维空间线索。我们提出IVRA,一种轻量级、无需训练的方法,通过利用模型内置视觉编码器中已有的亲和力提示来改善空间理解,无需任何外部编码器或重训练。IVRA选择性地将这些亲和力信号注入到实例级特征所在的语言模型层中。这种推理时的干预重新调整了视觉-令牌交互,更好地保留了几何结构,同时保持所有模型参数固定。我们通过将IVRA应用于多种VLA架构(LLaRA、OpenVLA和FLOWER),在涵盖2D和3D操作的模拟基准(VIMA和LIBERO)以及各种真实机器人任务上展示了其通用性。在2D VIMA上,IVRA在低数据情况下比基线LLaRA平均成功率提高了+4.2%。在3D LIBERO上,它在OpenVLA和FLOWER基线上取得了一致的提升,包括在基线准确率接近饱和时的改进(96.3% -> 97.1%)。代码和可视化可在以下网址获取:this http URL

英文摘要

Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% -> 97.1). Code and visualizations are available at: jongwoopark7978.github.io/IVRA

2601.18692 2026-06-16 cs.RO cs.CV 版本更新

A Pragmatic VLA Foundation Model

一个务实的VLA基础模型

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, Kecheng Zheng

发表机构 * robbyant.com

AI总结 提出LingBot-VLA,基于约2万小时真实数据和9种双臂机器人配置,在3个平台上完成100个任务,性能优于竞品,并实现高效训练吞吐。

Comments Project Webpage: https://technology.robbyant.com/lingbot-vla/, Code: https://github.com/Robbyant/lingbot-vla/, GM-100: https://huggingface.co/datasets/robbyant/lingbot-GM-100

详情
AI中文摘要

在机器人操作领域,一个有能力的视觉-语言-动作(VLA)基础模型有望在任务和平台上忠实泛化,同时确保成本效率(例如,适应所需的数据和GPU小时数)。为此,我们开发了LingBot-VLA,使用了来自9种流行的双臂机器人配置的约2万小时真实数据。通过对3个机器人平台的系统评估,每个平台完成100个任务,每个任务有130个训练后回合,我们的模型在性能上明显优于竞争对手,展示了其强大的性能和广泛的泛化能力。我们还构建了一个高效的代码库,在8-GPU训练设置下实现了每秒261个样本的吞吐量,相比现有的VLA导向代码库,加速了1.5~2.8倍(取决于所依赖的VLM基础模型)。上述特性确保我们的模型非常适合实际部署。为了推动机器人学习领域的发展,我们开放了代码、基础模型和基准数据,重点关注更具挑战性的任务和促进合理的评估标准。

英文摘要

Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 4 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.

2604.21391 2026-06-16 cs.RO cs.AI 版本更新

From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

从噪声到意图:基于残差桥的生成式VLA策略锚定

Yiming Zhong, Yaoyu He, Zemin Yang, Pengfei Tian, Yifan Huang, Qingqiu Huang, Xinge Zhu, Yuexin Ma

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ResVLA架构,通过频谱分析将机器人控制解耦为确定性低频锚点和随机高频残差,利用残差扩散桥聚焦局部动态精化,实现高效表示与强条件对齐。

Comments Accepted to ICML 2026

详情
AI中文摘要

在具身智能中,连接高层语义理解与低层物理控制仍是一个持续挑战,源于认知与行动之间基本的时空尺度不匹配。现有的生成式VLA策略通常采用“从噪声生成”范式,忽略了这种差异,导致表示效率低下和优化过程中条件对齐薄弱。在这项工作中,我们提出ResVLA,一种将范式转变为“从意图精化”的架构。认识到机器人运动自然分解为全局意图和局部动态,ResVLA利用频谱分析将控制解耦为确定性低频锚点和随机高频残差。通过将生成过程锚定在预测的意图上,我们的模型通过残差扩散桥严格专注于精化局部动态。大量仿真实验表明,ResVLA实现了具有竞争力的性能,对语言和机器人本体扰动的强鲁棒性,以及比标准生成基线更快的收敛速度。ResVLA在真实世界机器人实验中也表现出强劲性能。

英文摘要

Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. ResVLA also demonstrates strong performance in real-world robot experiments.

2605.22183 2026-06-16 cs.RO cs.AI 版本更新

Action with Visual Primitives

基于视觉基元的动作生成

Weilong Guo, Yuchen Wang, Renping Zhou, Yunfeng Zhang, Rui Fang, Yuyang Pang, Wenda Xu, Gao Huang

发表机构 * Anyverse Dynamics Tsinghua University(清华大学)

AI总结 提出AVP架构,通过视觉语言模型推断下一阶段目标并生成视觉基元令牌,条件化流匹配动作专家,在通用拾放任务中成功率比pi_0.5提升27.61%。

Comments 9 pages, 6 figures. Project page: https://kingdroper.github.io/AVP/

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人操作的一种有前景的范式。当前架构的常见设计是将语言指令和视觉观察映射到单次前向传播中的动作。虽然概念上简单,但这种表述将指令理解、空间场景理解和运动控制纠缠在单一学习目标中。因此,动作专家必须隐式地重新学习预训练VLM中已经存在的认知和感知能力,这可能限制学习效率和泛化能力。我们提出AVP(基于视觉基元的动作生成),一种端到端架构,实现了这种以视觉基元为中心的接口:VLM推断下一阶段目标并生成视觉基元令牌,这些令牌条件化一个流匹配动作专家,其监督来自末端执行器运动学。在通用拾放任务上的真实机器人实验表明,AVP相比pi_0.5将成功率提高了27.61%,并优于其他近期方法,在数据效率、空间组合泛化和对象级迁移方面持续取得增益。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current architectures maps language instructions and visual observations to actions in a single forward pass. While conceptually simple, this formulation entangles instruction comprehension, spatial scene understanding, and motor control within a single learning objective. As a result, the action expert must implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM, which can limit both learning efficiency and generalization. We introduce AVP (Action with Visual Primitives), an end-to-end architecture that implements this visual-primitive-centric interface: the VLM infers the next-stage target and emits visual-primitive tokens that condition a flow-matching action expert, with supervision derived from end-effector kinematics. Real-robot experiments on general pick-and-place tasks show that AVP improves the success rate by 37.04% over pi_0.5 and outperforms other recent methods, with consistent gains in data efficiency, spatial-compositional generalization, and object-level transfer.

2605.27284 2026-06-16 cs.RO cs.AI 版本更新

FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

FineVLA:面向可操控视觉-语言-动作策略的细粒度指令对齐

Xintong Hu, Xuhong Huang, Jinyu Zhang, Yutong Yao, Yuchong Sun, Qiuyue Wang, Mingsheng Li, Sicheng Xie, Yitao Liu, Junhao Chen, Yixuan Chen, Yingming Zheng, Shuai Bai, Tao Yu

发表机构 * XLANG Lab, The University of Hong Kong(XLANG实验室,香港大学) Qwen Team, Alibaba Inc.(通义团队,阿里巴巴公司)

AI总结 提出FineVLA框架,通过构建细粒度数据集和训练策略,在保持任务成功率的同时实现机器人动作的细粒度可控性。

Comments 26 pages, 7 figures, 25 tables

详情
AI中文摘要

视觉-语言-动作(VLA)模型日益被期望不仅完成机器人任务,还能遵循人类关于如何执行这些任务的指令。然而,现有的机器人数据集通常将轨迹与粗略的目标级语言配对,留下执行关键细节(如活动臂、接近方向和接触区域)未指定。这限制了可操控策略学习和机器人视频理解。我们引入了FineVLA,一个用于动作对齐的细粒度VLA监督的开放框架。该框架包括:(1)一个数据构建工具,统一了来自10个开源机器人数据集的85K任务中的972,247条轨迹,并构建了FineVLA-Data,一个包含47,159条细粒度轨迹的人工验证数据集;(2)一个包含500个视频、10,816个原子事实和1,030个VQA问题的留出基准;(3)一个机器人专用的VLM标注器,用于可扩展的细粒度标注;(4)一个使用细粒度和原始目标级指令的受控混合训练的可操控VLA策略。我们的实验得出了三个发现。首先,细粒度监督不会牺牲目标级成功率:在不同设置下,仅使用细粒度指令相比仅使用原始指令成功率提高了1.4到8.1个百分点。其次,细粒度指令和原始指令互补,遵循一致的倒U形趋势,在FG:Raw = 1:2到1:1时达到峰值。最佳混合设置在RoboTwin模拟中达到86.8%/82.5%的成功率,在真实世界双臂操作中达到62.7/100(相比之下仅使用原始指令为49.9)。第三,细粒度监督改善了可操控控制:最大的真实世界增益出现在姿态(+23)、颜色(+18)和接近方向(+18)上——这些因素中目标级指令没有提供指导。总体而言,细粒度语言应增强目标级指令:指定如何执行以及实现什么。项目页面:https://finevla.xlang.ai/

英文摘要

Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about how those tasks should be executed. However, existing robot datasets usually pair trajectories with coarse goal-level language, leaving execution-critical details such as active arm, approach direction, and contact region unspecified. This limits steerable policy learning and robotic video understanding. We introduce FineVLA, an open framework for action-aligned fine-grained VLA supervision. The framework includes: (1) a data construction tool that unifies 972,247 trajectories across 85K tasks from 10 open-source robot datasets and builds FineVLA-Data, a human-verified dataset of 47,159 fine-grained trajectories; (2) a held-out benchmark with 500 videos, 11,631 atomic facts, and 1,030 VQA questions; (3) a robotics-specialized VLM annotator for scalable fine-grained annotation; and (4) a steerable VLA policy trained with controlled mixtures of fine-grained and raw goal-level instructions. Our experiments yield three findings. First, fine-grained supervision does not sacrifice goal-level success: FG-only improves over Raw-only by +1.4 to +8.1 success-rate points across settings. Second, fine-grained and raw instructions are complementary, following a consistent inverted-U trend peaking at FG:Raw = 1:2 to 1:1. The best mixed setting reaches 86.8%/82.5% in RoboTwin simulation and 62.7/100 in real-world dual-arm manipulation (vs. 49.9 Raw-only). Third, fine-grained supervision improves steerable control: the largest real-world gains appear on pose (+23), color (+18), and approach direction (+18)--factors where goal-level instructions provide no guidance. Overall, fine-grained language should augment goal-level instructions: specifying how to execute alongside what to achieve. Project page: https://finevla.xlang.ai/

2606.10495 2026-06-16 cs.RO 版本更新

Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models

Act on What You See: 在视觉-语言-动作模型中解锁安全社交导航

Qingzi Wang, Xiyang Wu, Guangyao Shi, Dianwei Chen, Xianfeng Yang, Dinesh Manocha

发表机构 * University of Maryland(马里兰大学) University of Southern California(南加州大学)

AI总结 提出SALSA框架,通过两阶段无标注后训练(社交行为对齐和时间安全对齐),使预训练VLA模型利用已有表征实现安全社交导航,减少86.4%的近距离碰撞。

详情
AI中文摘要

安全社交导航要求机器人区分行人与普通障碍物,并在危险迫近前做出反应。我们表明,预训练的视觉-语言-动作(VLA)模型已在其内部表征中编码了行人-物体区分和未来碰撞信号,但行为克隆未能将这些信号转化为社交上合适的动作。为解决这一不匹配问题,我们提出SALSA,一个两阶段无标注后训练框架:(1)社交行为对齐将中间层社交特征桥接到动作头,并在反事实人-物场景对上训练以打破视觉显著性捷径;(2)时间安全对齐提供自动生成的未来风险监督,实现预期性碰撞避免。在SCAND和实际部署中,SALSA将近距离碰撞减少86.4%,并将社交反事实准确率从53%提升至93%,表明通过教导VLA策略利用其已拥有的表征来行动,可以实现更安全的社交导航。这些结果表明,通过更好地对齐潜在表征与动作生成,预训练VLA策略可被调整用于更安全的社交导航。

英文摘要

Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent. We show that pretrained Vision-Language-Action (VLA) models already encode pedestrian-object distinctions and future collision signals in their internal representations, but behavior cloning fails to translate these signals into socially appropriate actions. To address this mismatch, we propose SALSA, a two-stage annotation-free post-training framework: (1) social behavioral alignment bridges intermediate-layer social features to the action head and trains on counterfactual human-object scene pairs to break visual saliency shortcuts; (2) temporal safety alignment provides automatically generated future-risk supervision to enable anticipatory collision avoidance. On SCAND and real-world deployment, SALSA reduces near-collisions by 86.4% and improves social counterfactual accuracy from 53% to 93%, demonstrating that safer social navigation can be achieved by teaching VLA policies to act on representations they already possess. These results show that pretrained VLA policies can be adapted for safer social navigation by better aligning their latent representations with action generation.

2511.18960 2026-06-16 cs.LG cs.CV cs.RO 版本更新

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

AVA-VLA: 通过主动视觉注意力改进视觉-语言-动作模型

Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin, Jingjing Qian, Jing Zhang, Yong Wu, Xiaoyuan Yu

发表机构 * LiAuto Inc.(LiAuto公司) Beijing University of Technology(北京理工大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 针对VLA模型忽视历史信息的问题,提出AVA-VLA框架,利用循环状态近似信念并引入主动视觉注意力动态重加权视觉令牌,在LIBERO和CALVIN等基准上取得最优性能。

Comments Accepted at CVPR 2026 (Highlight)

详情
AI中文摘要

视觉-语言-动作(VLA)模型最近在具身任务中取得了显著进展,但大多数方法在每个时间步独立处理视觉观察。这种历史无关的设计将机器人操作视为马尔可夫决策过程,而现实中的机器人控制本质上是部分可观测的,需要推理过去的交互。为了解决这一不匹配,我们从部分可观测马尔可夫决策过程的角度重新表述VLA策略学习,并提出AVA-VLA,一种将动作生成建立在循环状态上的框架,该状态作为智能体对任务历史信念的神经近似。基于此循环状态,我们引入了主动视觉注意力(AVA),它动态地重新加权当前观测中的视觉令牌,以关注与指令和执行历史最相关的区域。大量实验表明,AVA-VLA在标准机器人基准测试(包括LIBERO和CALVIN)上达到了最先进的性能,并有效迁移到真实世界的双臂操作任务。这些结果证明了时间基础的主动视觉处理在改善机器人序列决策中VLA性能的有效性。项目页面见该URL。

英文摘要

Vision-Language-Action (VLA) models have shown remarkable progress in embodied tasks recently, but most methods process visual observations independently at each timestep. This history-agnostic design treats robot manipulation as a Markov Decision Process, even though real-world robotic control is inherently partially observable and requires reasoning over past interactions. To address this mismatch, we reformulate VLA policy learning from a Partially Observable Markov Decision Process perspective and propose AVA-VLA, a framework that conditions action generation on a recurrent state that serves as a neural approximation to the agent's belief over task history. Built on this recurrent state, we introduce Active Visual Attention (AVA), which dynamically reweights visual tokens in the current observation to focus on regions most relevant given both the instruction and execution history. Extensive experiments show that AVA-VLA achieves state-of-the-art performance on standard robotic benchmarks, including LIBERO and CALVIN, and transfers effectively to real-world dual-arm manipulation tasks. These results demonstrate the effectiveness of temporally grounded active visual processing for improving VLA performance in robotic sequential decision-making. The project page is available at https://liauto-dsr.github.io/AVA-VLA-Page.

2606.13578 2026-06-16 cs.CL cs.AI cs.LG cs.MM cs.RO 版本更新

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

LabVLA:在科学实验室中落地视觉-语言-动作模型

Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu, Chenxi Li, Daqi Gao, Zeqin Su, Jintao Xing, Zirui Xue, Rui Li, Xiangyu Zhao, Shuofei Qiao, Minting Pan, Wangmeng Zuo, Lei Bai, Dongzhan Zhou, Ningyu Zhang, Huajun Chen

发表机构 * Zhejiang University(浙江大学) Shanghai AI Laboratory(上海人工智能实验室) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 针对科学实验室中机器人执行协议面临的数据和实体瓶颈,提出模拟数据引擎RoboGenesis和两阶段训练策略LabVLA,在LabUtopia基准上取得最高平均成功率。

Comments Work in progress. Project website at https://zjunlp.github.io/LabVLA/

详情
AI中文摘要

科学实验室越来越依赖AI系统来推理实验,但物理实验操作仍超出其能力范围。AI可以帮助阅读文献、生成假设和规划协议,但实验台前的协议执行仍需人类操作员。视觉-语言-动作(VLA)模型为书面协议与机器人执行之间提供了一种可能的接口,但现有策略主要在家庭和桌面演示上训练,很少遇到科学实验室中的仪器、透明液体或固定协议工作流。弥补这一差距需要实验室特定的监督和统一的学习框架,以适应执行实验协议所使用的不同机器人实体。因此,我们将数据和实体视为与模型设计并列的核心瓶颈。为解决数据方面的问题,我们构建了RoboGenesis,这是一个基于模拟的工作流和数据引擎,能够从原子技能组合配置的实验室工作流,验证和过滤 rollout,并跨支持的机器人配置文件导出结构化演示。在策略方面,我们提出了LabVLA,采用两阶段训练方案:首先进行FAST动作标记预训练,使Qwen3-VL-4B-Instruct骨干网络在学习任何连续控制之前具备动作意识;然后进行流匹配后训练,在知识隔离下附加一个DiT动作专家。在LabUtopia基准上,LabVLA在分布内和分布外设置下均达到了所有评估基线中最高的平均成功率。

英文摘要

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

7. 多机器人与群体系统 8 篇

2606.14882 2026-06-16 cs.RO 新提交

DynaHMRC: Decentralized Heterogeneous Multi-Robot Collaboration for Dynamic Tasks with Large Language Models

DynaHMRC: 基于大语言模型的动态任务去中心化异构多机器人协作

Wenhao Yu, Yu'ang Xie, Yifan Duan, Jie Peng, Guanting Ye, Ka-Veng Yuen, Yanyong Zhang, Jianmin Ji

发表机构 * University of Science and Technology of China (USTC)(中国科学技术大学) University of Macau (UM)(澳门大学)

AI总结 提出DynaHMRC去中心化框架,每个机器人作为角色感知的LLM智能体,通过四阶段闭环流程(自我描述、任务分配与领导竞标、领导者选举、反思执行)实现动态异构多机器人协作,并构建基准测试验证其高效性和可扩展性。

详情
AI中文摘要

大型语言模型(LLMs)为机器人提供了更丰富的任务理解和适应性,使其在协调长期任务中的异构多机器人系统方面具有前景。尽管有这种潜力,但仍存在几个挑战尚未充分探索:(1)集中式LLM调度器随着团队规模和环境复杂性的增加而扩展性差。单个模型必须处理过多的上下文信息,长上下文近似可能降低推理质量;(2)现有任务公式未能充分考虑动态设置,而对不断变化的任务条件的鲁棒适应对于实际部署至关重要;(3)领域特定数据稀缺限制了专门的机器人推理,使得专有通用模型在专家任务上效率低下。为了解决这些限制,我们提出了DynaHMRC,一个去中心化框架,其中每个机器人充当角色感知的LLM智能体。这种设计减轻了单模型上下文瓶颈,并支持跨异构团队配置的灵活协作。DynaHMRC将协作组织为四阶段闭环过程:自我描述、带有领导竞标的任务分配、领导者选举和反思执行,由可执行的机器人接口支持。我们进一步开发了一个基准测试,涵盖三个任务族、四种动态变化和六种团队配置,以系统研究动态任务建模。此外,我们进行了实证分析,以指导领域特定专家数据集的构建,并微调预训练LLM以提高专业能力。实验表明,DynaHMRC在更少的动作和通信步骤下实现了比强基线更高的成功率,同时在评估的设置中随着团队规模的增长显示出有希望的可扩展性趋势。

英文摘要

Large language models (LLMs) provide robots with richer task understanding and adaptability, making them promising for coordinating heterogeneous multi-robot systems in long-horizon tasks. Despite this potential, several challenges remain underexplored: (1) Centralized LLM schedulers scale poorly as team size and environmental complexity increase. A single model must process excessive contextual information, and long-context approximation may degrade reasoning quality; (2) Existing task formulations insufficiently consider dynamic settings, while robust adaptation to evolving task conditions is essential for real-world deployment; (3) Domain-specific data scarcity limits specialized robotic reasoning, making proprietary general-purpose models inefficient for expert tasks. To address these limitations, we propose DynaHMRC, a decentralized framework in which each robot acts as a role-aware LLM agent. This design mitigates the single-model context bottleneck and supports flexible collaboration across heterogeneous team configurations. DynaHMRC organizes collaboration as a four-stage closed-loop process: self-description, task allocation with leadership bidding, leader election, and reflective execution, supported by executable robot interfaces. We further develop a benchmark covering three task families, four dynamic variations, and six team configurations to systematically study dynamic task modeling. In addition, we conduct an empirical analysis to guide the construction of domain-specific expert datasets and fine-tune pretrained LLMs to improve specialized competence. Experiments show that DynaHMRC achieves higher success rates than strong baselines with fewer action and communication steps, while demonstrating promising scalability trends as team size grows within the evaluated settings.

2606.15255 2026-06-16 cs.RO 新提交

OSDAG: Online Scheduling for Efficient Multi-Robot Collaboration

OSDAG: 面向高效多机器人协作的在线调度

Thanh Nguyen Canh, Thang Tran Viet, Phuc Van Dinh, Xiem HoangVan, Nak Young Chong

发表机构 * Japan Advanced Institute of Science and Technology(日本北陆先端科学技术大学院大学) University of Engineering and Technology, Vietnam National University(越南国立大学工程技术大学) Hanyang University(汉阳大学)

AI总结 提出OSDAG框架,结合LLM任务推理与DAG在线调度,通过一次性分解指令为依赖图并实时分配任务,相比对话式方法推理速度提升5-15倍,调度时间缩短38%。

详情
AI中文摘要

协调异构多机器人系统(MRS)完成复杂、长周期任务需要灵活的高层推理和高效的低层调度。现有的基于LLM的方法解决了推理方面,但引入了两个关键瓶颈:(1)执行过程中重复的LLM推理,随着智能体数量增加而增加延迟;(2)离线、预提交的调度,即使存在独立工作,也会迫使机器人等待顺序排列的前驱任务而闲置。本文提出了OSDAG,一种新颖的框架,将基于LLM的任务推理与有向无环图(DAG)表示和约束感知的在线调度相结合。LLM被调用一次,将自然语言指令分解为带有依赖注释的任务图,然后轻量级在线调度器实时将就绪任务分配给空闲智能体。DAG表示编码了前驱和资源约束,确保正确性同时暴露所有可用的并行性。在五个基准场景上的实验表明,与基于对话的方法相比,OSDAG的推理时间快5-15倍,与顺序基线相比,完成时间最多减少38%,并保持有竞争力的成功率。在双臂操作任务上的仿真和真实世界实验验证了所提方法在高效多机器人协调中的有效性和实用性。网站和资源可在 http://thanhnguyencanh.github.io/LLM_DAG4MultiRobot 获取。

英文摘要

Coordinating heterogeneous multi-robot systems (MRS) for complex, long-horizon tasks requires both flexible high-level reasoning and efficient low-level scheduling. Existing LLM-based approaches address the reasoning side but introduce two critical bottlenecks: (1) repeated LLM inference during execution, which inflates latency with agent count, and (2) offline, pre-committed scheduling, which forces robots to idle while waiting for sequentially ordered predecessors even when independent work is available. This paper presents OSDAG, a novel framework that integrates LLM-based task reasoning with Directed Acyclic Graph (DAG) representation and constraint-aware online scheduling. The LLM is invoked once to decompose a natural-language instruction into a dependency-annotated task graph, and a lightweight online scheduler then allocates ready tasks to idle agents in real time. The DAG representation encodes both precedence and resource constraints, ensuring correctness while exposing all available parallelism. Experiments across five benchmark scenarios demonstrate that OSDAG achieves 5-15x faster reasoning time compared to dialogue-based methods, reduces makespan by up to 38% over sequential baselines, and maintains competitive success rates. Both simulation and real-world experiments on dual-arm manipulation tasks validate the effectiveness and practicality of the proposed approach for efficient multi-robot coordination. The website and resources are available at http://thanhnguyencanh.github.io/LLM_DAG4MultiRobot

2606.15550 2026-06-16 cs.RO 新提交

Robots as Tokens: Unified Diffusion Transformer for Coordinated Multi-Robot Trajectory Generation

机器人作为令牌:面向协调多机器人轨迹生成的统一扩散Transformer

Ruofei Bai, Jie Chen, Yuxin Cai, Jun Li, Wei-Yun Yau, Lihua Xie

发表机构 * Nanyang Technological University(南洋理工大学) Agency for Science, Technology and Research(新加坡科技研究局) National University of Singapore(新加坡国立大学)

AI总结 提出Roken框架,将每个机器人表示为离散令牌,通过扩散Transformer直接生成满足安全和连通性约束的多机器人轨迹,无需迭代后处理。

Comments 23 pages, 13 figures; \textbf{Project page:} \href{https://bairuofei.github.io/roken-project-page/}{\texttt{bairuofei.github.io/roken-project-page}}

详情
AI中文摘要

生成模型在语言和视觉生成中的成功激发了其在生成式机器人规划中的广泛应用。然而,现有工作大多聚焦于单机器人规划,或以顺序方式生成多机器人轨迹并通过迭代后处理解决机器人间冲突。本文研究协调多机器人轨迹(作为一种特殊的时空分布)是否可以通过生成模型以前馈方式学习和生成。我们提出Roken(Robots as Tokens),一种统一的扩散Transformer,直接生成同时满足(个体)安全和(全局)连通性约束的多机器人轨迹。Roken的核心设计是将每个机器人表示为一个离散令牌,使它们能够通过自注意力自然交互,并通过交叉注意力关注地图令牌以获取环境布局。我们进一步引入基于贝叶斯定理的多个辅助任务,提供多尺度时空监督以高效学习条件分布。训练时,Roken吸收来自不同团队规模的多样化专家轨迹。推理时,Roken作为一个多功能多机器人规划器,可处理单机器人规划、协调多机器人轨迹生成,以及通过固定部分机器人令牌作为条件进行条件轨迹生成。在多种杂乱环境中的实验表明,Roken能够生成协调的多机器人轨迹,以高成功率执行连通性约束的目标导航任务,优于用于生成训练数据集的基线方法。Roken在混合团队规模训练后展现出良好的可扩展性,并对未见或部分观测环境具有泛化能力,验证了其从多样化数据中学习并执行多种任务的潜力。

英文摘要

The success of generative models in language and visual generation has inspired extensive applications to generative robot planning. However, most existing works either focus on single-robot planning, or generate multi-robot trajectories in a sequential manner with iterative post-processing to resolve inter-robot conflicts. In this work, we investigate whether coordinated multi-robot trajectories, as a special spatiotemporal distribution, can be learned and generated with a generative model in a feed-forward manner. We propose Robots as Tokens (Roken), a unified diffusion transformer that directly generates multi-robot trajectories that satisfy both (individual) safety and (global) connectivity constraints. The core design of Roken is to represent each robot as a discrete token, allowing them to naturally interact with each other through self-attention, and cross-attend to map tokens for environment layouts. We further introduce several auxiliary tasks based on Bayes' theorem to provide multi-scale spatial-temporal supervision for efficient learning of the conditional distribution. In training, Roken absorbs diverse expert trajectories from different team sizes. During inference, Roken behaves as a versatile multi-robot planner that can handle single-robot planning, coordinated multi-robot trajectory generation, and conditional trajectory generation by fixing some robot tokens as conditions. Experiments in diverse cluttered environments show that Roken can generate coordinated multi-robot trajectories to perform connectivity-constrained goal navigation tasks with high success rates, outperforming the baseline method used to generate the training dataset. Roken also demonstrates good scalability after training with mixed team sizes, and shows generalization to unseen or partially observed environments, verifying its potential to learn from diverse data and perform versatile tasks.

2606.16490 2026-06-16 cs.RO 新提交

Robots that Collaborate: Sequential Asymmetric Imitation for Learning Coupled Robot Policies

协作机器人:用于学习耦合机器人策略的序列非对称模仿

Yincong Chen, Ranpeng Qiu, Zihao Li, Yanan Zhou, Guoqiang Ren, Weiming Zhi

发表机构 * Zeno AI University of Sydney(悉尼大学)

AI总结 提出序列非对称模仿(SAI),通过单操作员课程学习耦合多机器人行为,无需同步双操作员演示或显式通信,在真实双机器人操作任务中提升成功率与相位同步。

详情
AI中文摘要

协作移动操作要求机器人与部分可观测的伙伴协调,同时通过共享物体进行物理交互。这很困难,因为失败通常不是由于局部技能差,而是由于不合时宜的等待、让步、拉动、释放或重新定位。我们通过两个双臂移动操作器与刚性和可变形物体耦合来研究这个问题。我们提出序列非对称模仿(SAI),一种单操作员课程,用于学习耦合的多机器人行为,无需同步双操作员演示或显式机器人间通信。SAI 首先从与顺从人类伙伴的单侧演示中训练机器人 A,然后针对已部署的机器人 A 策略训练机器人 B,最后在协调失败附近使用稀疏干预来优化机器人 A。这种分阶段过程使策略暴露于越来越真实的伙伴行为,包括延迟、相位不匹配、让步不足和交互冲突。在真实世界的双机器人操作任务中,SAI 在任务成功率、相位同步和伙伴条件性让步方面优于独立模仿和课程消融基线。这些结果表明,物理耦合协作可以通过模仿课程的结构来学习,而不是通过同步多操作员演示或显式协调机制。项目页面:http://cyc0429.github.io/sai-project-page/

英文摘要

Collaborative mobile manipulation requires robots to coordinate with a partially observed partner while physically interacting through shared objects. This is difficult because failures often arise not from poor local skills, but from mistimed waiting, yielding, pulling, releasing, or repositioning. We study this problem with two bimanual mobile manipulators coupled through rigid and deformable objects. We propose Sequential Asymmetric Imitation (SAI), a single-teleoperator curriculum for learning coupled multi-robot behaviors without synchronized dual-operator demonstrations or explicit inter-robot communication. SAI trains Robot A from unilateral demonstrations with a compliant human partner, trains Robot B against the deployed Robot A policy, and then refines Robot A using sparse interventions near coordination failures. This staged process exposes the policies to increasingly realistic partner behaviors, including delay, phase mismatch,insufficient yielding, and interaction conflict. Across real-world dual-robot manipulation tasks, SAI improves task success, phase synchronization, and partner-contingent yielding over independent imitation and curriculum-ablation baselines. These results suggest that physically coupled collaboration can be learned through the structure of the imitation curriculum, rather than through synchronized multi-operator demonstrations or explicit coordination mechanisms.Project page:http://cyc0429.github.io/sai-project-page/

2606.16116 2026-06-16 eess.SY cs.MA cs.RO cs.SY math.DS 交叉投稿

Distributed Safe Consensus Under Asymmetric Input and Time-Varying Output Constraints

非对称输入与时变输出约束下的分布式安全一致性

Abhinav Sinha, Shashi Ranjan Kumar

发表机构 * Guidance, Autonomy, Learning, and Control for Intelligent Systems (GALACxIS) Lab, Department of Aerospace Engineering and Engineering Mechanics, University of Cincinnati(智能系统引导、自主、学习与控制实验室,航空航天工程与工程力学系,辛辛那提大学) Intelligent Systems and Control (ISaC) Lab, Department of Aerospace Engineering, Indian Institute of Technology Bombay(智能系统与控制实验室,航空航天工程系,印度班加罗尔理工学院)

AI总结 针对单积分多智能体系统,提出一种结合障碍坐标变换的分布式控制律,同时满足非对称执行器约束和时变输出安全约束,实现渐近同步。

详情
AI中文摘要

本文研究了在连通无向图上的单积分多智能体系统中,同时存在非对称执行器约束和输出安全约束下的安全分布式一致性问题。每个智能体配备一个连续可微的非对称执行器动力学,将命令控制信号映射到实际输入,同时使后者严格保持在规定的允许区间内。为了解决输出安全性问题,在公共时变安全区间上引入障碍坐标变换,并在变换后的坐标中设计分布式同步律。所得到的控制器将基于图的协调层与执行器侧跟踪层相结合,从而同时实现输入可容许性、安全输出集的前向不变性和渐近同步。对于初始条件的紧致可容许集,证明了闭环解是完整的,所有信号有界,执行器输入始终严格在其非对称界限内,并且智能体输出始终保持在规定的安全区间内。此外,变换后的同步误差指数收敛到零,原始智能体输出渐近同步到嵌入公共安全区间中的设计者选择的可容许轨迹。数值仿真验证了所提出的框架,并展示了在非对称执行器界限和时变输出约束下的安全一致性。

英文摘要

This paper studies safe distributed consensus for single-integrator multi-agent systems over connected undirected graphs under simultaneous asymmetric actuator constraints and output safety constraints. Each agent is equipped with a continuously differentiable asymmetric actuator dynamics that maps a commanded control signal to the realized plant input while keeping the latter strictly inside a prescribed admissible interval. To address output safety, a barrier-coordinate transformation is introduced over a common time-varying safe interval, and a distributed synchronization law is designed in the transformed coordinates. The resulting controller integrates a graph-based coordination layer with an actuator-side tracking layer, thereby enabling simultaneous enforcement of input admissibility, forward invariance of the safe output set, and asymptotic synchronization. For compact admissible sets of initial conditions, it is shown that the closed-loop solution is complete, all signals remain bounded, the actuator inputs remain strictly within their asymmetric bounds, and the agent outputs remain inside the prescribed safe interval for all time. Moreover, the transformed synchronization errors converge exponentially to zero, and the original agent outputs asymptotically synchronize to a designer-selected admissible trajectory embedded in the common safe interval. Numerical simulations validate the proposed framework and demonstrate safe consensus under both asymmetric actuation bounds and time-varying output constraints.

2511.21957 2026-06-16 cs.RO cs.MA 版本更新

RSPECT: Robust and Scalable Planner for Energy-Aware Coordination of UAV-UGV Teams in Aerial Monitoring

RSPECT:空中监控中无人机-无人车团队能量感知协调的鲁棒可扩展规划器

Cahit Ikbal Er, Amin Kashiri, Yasin Yazicioglu

发表机构 * Department of Electrical and Computer Engineering, Northeastern University(东北大学电气与计算机工程系)

AI总结 针对无人机与作为移动充电站的无人车在长期空中监控任务中的鲁棒规划问题,提出可扩展启发式算法RSPECT,通过混合整数规划建模并保证规划可行性与鲁棒性。

Comments Accepted to the Journal of Intelligent & Robotic Systems (JINT)

详情
AI中文摘要

我们考虑能量受限的无人机(UAV)和作为移动充电站的无人车(UGV)的鲁棒规划,以执行长期空中监控任务。具体而言,给定一组需要无人机访问的点以及无人机-无人车团队的期望最终位置,目标是在不确定性(例如,未知障碍物/地形、风)下,找到一个无需重大修订即可实现的鲁棒计划(车辆轨迹),以最小时间完成此任务。我们提供了该问题的形式化描述,将其建模为混合整数规划(MIP),该问题是NP难的。由于精确求解方法对此类问题在计算上难以处理,我们提出了RSPECT,一种可扩展且高效的启发式算法。我们给出了关于算法复杂度以及所得计划的可行性和鲁棒性的理论结果。我们还通过仿真和实验展示了我们方法的性能。

英文摘要

We consider the robust planning of energy-constrained unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs), which act as mobile charging stations, to perform long-horizon aerial monitoring missions. More specifically, given a set of points to be visited by the UAVs and desired final positions of the UAV-UGV teams, the objective is to find a robust plan (the vehicle trajectories) that can be realized without a major revision in the face of uncertainty (e.g., unknown obstacles/terrain, wind) to complete this mission in minimum time. We provide a formal description of this problem as a mixed-integer program (MIP), which is NP-hard. Since exact solution methods are computationally intractable for such problems, we propose RSPECT, a scalable and efficient heuristic. We provide theoretical results on the complexity of our algorithm and the feasibility and robustness of resulting plans. We also demonstrate the performance of our method via simulations and experiments.

2512.13090 2026-06-16 cs.RO 版本更新

Multi-Robot Motion Planning from Vision and Language using Heat-Inspired Diffusion

基于热启发的扩散模型实现从视觉和语言的多机器人运动规划

Jebeom Chae, Junwoo Chang, Seungho Yeom, Yujin Kim, Jongeun Choi

发表机构 * Department of Artificial Intelligence, Yonsei University(燕山大学人工智能学院) School of Mechanical Engineering, Yonsei University(燕山大学机械工程学院)

AI总结 提出LHD框架,结合CLIP语义先验与碰撞避免扩散核,实现语言条件化的多机器人无碰撞轨迹规划,在成功率上优于先前方法并降低延迟。

Comments 8 pages, 6 figures, accepted by IEEE Robotics and Automation Letters (RA-L)

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 6, pp. 7118-7125, June 2026
AI中文摘要

扩散模型最近通过捕捉可行轨迹的多模态分布,成为机器人运动规划的强大工具。然而,它们在具有灵活、语言条件化任务规范的多机器人场景中的扩展仍然有限。此外,当前的基于扩散的方法在推理过程中计算成本高,并且由于需要显式构建环境表示以及缺乏几何可达性推理机制,难以泛化。为了解决这些限制,我们提出了语言条件化热启发扩散(LHD),一个端到端的基于视觉的框架,生成语言条件化的无碰撞轨迹。LHD将来自视觉语言模型(VLM)CLIP的语义先验与作为物理归纳偏置的碰撞避免扩散核相结合,使规划器能够在可达工作空间内严格解释语言命令。这自然地处理了分布外(OOD)场景——在可达性方面——通过引导机器人朝向匹配语义意图的可访问替代方案,同时消除了推理时对显式障碍物信息的需求。在多种真实世界启发的地图上的广泛评估以及真实机器人实验表明,LHD在成功率上持续优于先前的基于扩散的规划器,同时减少了规划延迟。项目页面见:this https URL

英文摘要

Diffusion models have recently emerged as powerful tools for robot motion planning by capturing the multi-modal distribution of feasible trajectories. However, their extension to multi-robot settings with flexible, language-conditioned task specifications remains limited. Furthermore, current diffusion-based approaches incur high computational cost during inference and struggle with generalization because they require explicit construction of environment representations and lack mechanisms for reasoning about geometric reachability. To address these limitations, we present Language-conditioned Heat-inspired Diffusion (LHD), an end-to-end vision-based framework that generates language-conditioned, collision-free trajectories. LHD integrates semantic priors from CLIP, a vision-language model (VLM), with a collision-avoiding diffusion kernel serving as a physical inductive bias that enables the planner to interpret language commands strictly within the reachable workspace. This naturally handles out-of-distribution (OOD) scenarios -- in terms of reachability -- by guiding robots toward accessible alternatives that match the semantic intent, while eliminating the need for explicit obstacle information at inference time. Extensive evaluations on diverse real-world-inspired maps, along with real-robot experiments, show that LHD consistently outperforms prior diffusion-based planners in success rate, while reducing planning latency. Project page is available at: https://jebeom.github.io/lhd_project_page/

2509.07561 2026-06-16 cs.MA cs.RO 版本更新

Bio-inspired decision making in robot swarms under biases

偏差下机器人群体中的生物启发式决策

Raina Zakir, Timoteo Carletti, Marco Dorigo, Andreagiovanni Reina

发表机构 * IRIDIA, Université Libre de Bruxelles(布鲁塞尔自由大学IRIDIA实验室) Department of Mathematics and Namur Institute for Complex Systems, naXys, University of Namur(纳慕尔大学数学系和复杂系统纳慕尔研究所) Centre for the Advanced Study of Collective Behaviour, Universität Konstanz(康斯坦茨大学集体行为高级研究所以) Department of Computer and Information Science, Universität Konstanz(康斯坦茨大学计算机与信息科学系) Department of Collective Behaviour, Max Planck Institute of Animal Behaviour(动物行为Max Planck研究所集体行为系)

AI总结 研究在存在非社会偏差时,直接切换与交叉抑制两种意见动力学机制对机器人群体决策性能的影响,发现交叉抑制在偏差条件下更优。

详情
AI中文摘要

最小化机器人群体提供了一种可扩展、鲁棒且成本效益高的方法来执行复杂任务,有望改变医疗、灾难响应和环境监测等领域的应用。然而,协调这种去中心化系统仍然是一个基本挑战,特别是当机器人在通信、计算和内存方面受限时。在我们的研究中,单个机器人在感知环境时经常出错,但群体能够快速可靠地就$n$个离散选项中的最佳选项达成共识。我们比较了两种典型的意见动力学机制——直接切换和交叉抑制——它们是简单而有效的集体信息处理规则,在从神经群体到昆虫群体的生物系统中广泛存在。我们通过考虑影响意见动力学的非社会偏差,推广了现有的平均场模型。使用直接切换的群体在没有非社会动力学时能可靠地选择最佳选项,但一旦引入此类偏差,其性能下降,常常导致决策死锁。相比之下,受生物启发的交叉抑制在广泛的偏差条件下实现了更快、更一致、更准确、更鲁棒和更可扩展的决策。我们的研究结果为最小化群体的协调提供了理论和实践见解,并扩展到了生物学和工程学中广泛类别的去中心化决策系统。

英文摘要

Minimalistic robot swarms offer a scalable, robust, and cost-effective approach to performing complex tasks with the potential to transform applications in healthcare, disaster response, and environmental monitoring. However, coordinating such decentralised systems remains a fundamental challenge, particularly when robots are constrained in communication, computation, and memory. In our study, individual robots frequently make errors when sensing the environment, yet the swarm can rapidly and reliably reach consensus on the best among $n$ discrete options. We compare two canonical mechanisms of opinion dynamics -- direct-switch and cross-inhibition -- which are simple yet effective rules for collective information processing observed in biological systems across scales, from neural populations to insect colonies. We generalise the existing mean-field models by considering asocial biases influencing the opinion dynamics. While swarms using direct-switch reliably select the best option in absence of asocial dynamics, their performance deteriorates once such biases are introduced, often resulting in decision deadlocks. In contrast, bio-inspired cross-inhibition enables faster, more cohesive, accurate, robust, and scalable decisions across a wide range of biased conditions. Our findings provide theoretical and practical insights into the coordination of minimal swarms and offer insights that extend to a broad class of decentralised decision-making systems in biology and engineering.

8. 无人车、无人机与移动机器人 13 篇

2606.15251 2026-06-16 cs.RO cs.AI cs.LG 新提交

Driving, Fast or Slow? Neuro-Symbolic Guidance for Motion Prediction in Multi-Modal Ground Mobility

驾驶,快或慢?多模态地面移动中运动预测的神经符号引导

Simon Kohaut, Felix Divo, Julius Hahnewald, Benedict Flade, Julian Eggert, Kristian Kersting, Devendra Singh Dhami

发表机构 * Artificial Intelligence and Machine Learning Lab, TU Darmstadt(达姆施塔特工业大学人工智能与机器学习实验室) Honda Research Institute(本田研究所) Hessian Center for AI (hessian.AI)(黑森州人工智能中心) Centre for Cognitive Science(认知科学中心) German Center for AI (DFKI)(德国人工智能研究中心) Uncertainty in Artificial Intelligence Lab, TU Eindhoven(埃因霍温理工大学人工智能不确定性实验室)

AI总结 提出TraCS框架,通过神经符号方法将交通规则编码为概率一阶逻辑,增强黑盒运动预测模型的可解释性和合规性,在Argoverse 2上持续提升SOTA性能。

详情
AI中文摘要

准确且可解释的异构交通空间(包括行人、自行车、汽车和卡车)运动预测对于安全的自主导航至关重要。然而,最先进的方法仍然是黑盒,缺乏对现实世界移动的监管和行为约束的显式编码。我们提出Trajectory Compliance-Shaping (TraCS),一种神经符号框架,通过可解释的概率一阶逻辑增强现有的黑盒运动预测骨干网络。为此,TraCS采用智能体代码生成流水线,弥合交通规则的自然语言描述与概率运动预测之间的差距。此外,TraCS采用反应式数据流推理引擎,随着场景演变维护并高效更新合规性景观。为防止TraCS过度自信地将骨干网络的预测引导到错误方向,我们提出一种神经置信度评分,作为上下文感知的合规性信号衰减。我们在Argoverse 2基准上展示了TraCS如何持续改进最先进的预测骨干网络,表明概率和符号合规性推理是纯神经运动预测的广泛适用且计算高效的补充。

英文摘要

Accurate and interpretable motion prediction for heterogeneous traffic spaces, including pedestrians, bicycles, cars, and trucks, is essential for safe autonomous navigation. Nevertheless, state-of-the-art approaches remain predominantly black-box, lacking explicit encoding of the regulatory and behavioral constraints of real-world mobility. We propose Trajectory Compliance-Shaping (TraCS), a neuro-symbolic framework that augments existing black-box motion prediction backbones with interpretable and probabilistic first-order logic. To do so, TraCS employs an agentic code-generation pipeline to bridge the gap between natural-language descriptions of traffic regulations and probabilistic motion prediction. Furthermore, TraCS employs a reactive data-streaming inference engine that maintains and efficiently updates compliance landscapes as scenes evolve. To prevent TraCS from overconfidently steering the backbone's predictions in the wrong direction, we propose a neural confidence rating learned as a context-aware attenuation of the compliance signal. We demonstrate on the Argoverse 2 benchmark how TraCS consistently improves state-of-the-art prediction backbones, showing that probabilistic and symbolic compliance reasoning is a broadly applicable and computationally efficient complement to purely neural motion predictors.

2606.16042 2026-06-16 cs.RO cs.AI 新提交

Leveraging Deep Learning for Object and Position Recognition of Load Carriers for Autonomous Logistics Vehicles

利用深度学习实现自主物流车辆对载具的物体与位置识别

Christoph Legat, Tobias Miller, Marco Riess

发表机构 * Research Group on Cognitive Autonomy & Predictive Intelligence, Technical University of Applied Sciences, Augsburg, Germany(认知自主与预测智能研究组,奥格斯堡应用技术大学,德国) Grenzebach Maschinenbau GmbH, Asbach-Bäumenheim, Germany(Grenzebach Maschinenbau GmbH,德国阿斯巴赫-博伊门海姆)

AI总结 提出基于深度学习的框架,通过卷积神经网络从RGBD数据中识别载具上的预定义地标并计算其位姿,实现自主物流车辆对载具的检测与定位,实验验证了工业环境下的可靠性。

Comments 6 pages, 6 figures, IFAC World Congress2026, \c{opyright} 2026 the authors. This work has been accepted to IFAC for publication under a Creative Commons Licence CC-BY-NC-ND

详情
AI中文摘要

本工作探索了在移动机器人中利用人工智能实现载具的自主检测和位姿估计,以便自动拾取。设计了一个深度神经网络,从RGBD数据中识别载具上的预定义地标;然后利用这些地标计算载具的位姿。该网络直接处理RGBD图像以估计地标位置,这些位置构成了确定载具位置的基础。该方法在大量实验中得到了验证,并包含软件和硬件实现。提出了一个基于深度学习的框架,用于检测载具并估计其位姿,以应用于自主物流车辆。我们的方法使用卷积神经网络从RGBD输入中识别载具上的特征参考点,并通过将这些推断出的地标与先验几何知识相结合来计算其位姿。实验表明,所得精度足以在工业环境中可靠地检测载具,证实了该方法适用于自主内部物流应用。

英文摘要

This work explores the use of artificial intelligence in mobile robotics to achieve autonomous detection and pose estimation of load carriers for automated pickup. A deep neural network is designed to recognize predefined landmarks on the carrier from RGBD data; these landmarks are then used to compute the carrier's pose. The network operates directly on RGBD images to estimate landmark positions, which form the basis for determining the carrier's location. The approach is validated in extensive experiments and comprises both software and hardware implementations. A deep learning-based framework is presented to detect load carriers and estimate their pose for use with autonomous logistics vehicles. Our method uses a convolutional neural network to identify characteristic reference points on the carrier from RGBD input and computes its pose by combining these inferred landmarks with prior geometric knowledge. Experiments show that the resulting accuracy is sufficient for reliable load carrier detection in industrial environments, confirming the suitability of the method for autonomous intralogistics applications.

2606.16513 2026-06-16 cs.RO 新提交

Agile Fall Recovery for Quadrotors with Bidirectional Thrust via Reinforcement Learning

基于强化学习的双向推力四旋翼敏捷坠落恢复

Anke Zhao, Yuhang Zhong, Kenghou Hoi, Junyu Mou, Junjie Wang, Lijie Wang, Jialiang Hou, Fei Gao

发表机构 * Institute of Cyber-Systems and Control, College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院工业控制技术研究所) Differential Robotics

AI总结 提出基于强化学习的框架,利用轻量级机载传感器实现四旋翼从任意地面姿态恢复至稳定悬停,通过非对称演员-评论家架构和增量非线性动态逆控制器解决部分可观测性和传感器失效问题,仿真和实验验证了零样本迁移和鲁棒性。

详情
AI中文摘要

自主坠落恢复是四旋翼在现实环境中运行的关键能力,因为碰撞或故障可能导致飞行器以任意姿态停在地面上。该问题具有挑战性,因为恢复必须在有限的机载感知、受限的自由空间、地面接触以及存在未知干扰的情况下实现。本文提出了一种基于强化学习的框架,用于四旋翼从任意地面姿态自主恢复至稳定悬停,仅使用轻量级机载传感器。为了解决严重的部分可观测性和间歇性传感器失效问题,我们在非对称演员-评论家架构中训练了一个循环策略,并利用增量非线性动态逆(INDI)控制器跟踪策略输出。结合电机响应和光流的高保真仿真,整体训练框架显著缩小了仿真到现实的差距。仿真消融研究验证了主要设计选择的重要性,而真实世界实验展示了在不同初始姿态、风干扰和额外负载下的零样本迁移和鲁棒恢复。这些结果表明,无需明确的状态估计,仅使用有限且不可靠的机载传感即可实现敏捷的四旋翼坠落恢复。

英文摘要

Autonomous fall recovery is a critical capability for quadrotors operating in real-world environments, where collisions or failures may leave the vehicle resting on the ground in an arbitrary attitude. This problem is challenging because recovery must be achieved under limited onboard sensing, in constrained free space, with ground contact, and in the presence of unknown disturbances. In this letter, we present an RL-based framework for autonomous fall recovery of a quadrotor from arbitrary ground attitudes to stable hover using only lightweight onboard sensors. To address severe partial observability and intermittent sensor invalidity, we train a recurrent policy within an asymmetric actor--critic architecture, leveraging an Incremental Nonlinear Dynamic Inversion (INDI) controller to track the policy output. Combined with high-fidelity simulations of motor response and optical flow, the overall training framework significantly reduces the sim-to-real gap. Simulation ablation studies validate the importance of the main design choices, while real-world experiments demonstrate zero-shot transfer and robust recovery under different initial attitudes, wind disturbances, and additional payloads. These results demonstrate that agile quadrotor fall recovery can be achieved without explicit state estimation using only limited and unreliable onboard sensing.

2606.16621 2026-06-16 cs.RO 新提交

Reinforcement Learning with Inner-loop Dynamics Estimator for Aerial Manipulation under Uncertainty

基于内环动力学估计器的强化学习在不确定性下的空中操纵

Shivansh Pratap Singh, Samaksh Ujjwal, Ishita Chaudhary, V R Vasudevan, Rishabh Dev Yadav, Spandan Roy

发表机构 * International Institute of Information Technology Hyderabad(国际信息技术学院海德拉巴) University of Manchester(曼彻斯特大学)

AI总结 提出一种分层控制框架,结合强化学习与外环与内环动力学估计器,实现直接任务驱动控制,在硬件实验中降低末端执行器跟踪误差并提高任务成功率。

详情
AI中文摘要

空中操纵器能够在难以到达的环境中进行物理交互;然而,在快速臂运动、载荷变化及相关未知动态不确定性下,直接全身空中操纵的组合问题在很大程度上仍未解决。我们提出了一种分层控制框架,结合强化学习(RL)与内环动力学估计器来解决这一问题。RL外环将期望的六自由度(DOF)末端执行器目标映射到协调的全身指令,从而实现直接任务驱动控制,而无需在策略层依赖完全准确的耦合动态模型。内环随后跟踪这些指令,同时通过动力学估计器方案在执行过程中补偿瞬态惯性偏移和不确定性,无需系统模型知识。我们通过硬件实验,在变化的载荷条件下,在配备3自由度操纵器的定制四旋翼飞行器上验证了所提出的方法。与RL+PID和RL+INDI+PID基线相比,所提出的方法在测试的硬件条件下降低了末端执行器跟踪误差并提高了任务成功率。这些结果表明,将学习到的全身协调与基于估计器的低层补偿相结合,提高了在变化操作条件下空中操纵的精度和鲁棒性。

英文摘要

Aerial manipulators enable physical interaction in hard-to-reach environments; however, the combined problem of direct whole-body aerial manipulation under rapid arm motion, payload changes, and related unknown dynamic uncertainty remains a largely unsolved problem. We present a hierarchical control framework that combines Reinforcement Learning (RL) with an inner-loop dynamics estimator to address this problem. The RL outer loop maps desired 6-degrees-of-freedom (DOF) end-effector targets to coordinated whole-body commands, enabling direct task-driven control without relying on a fully accurate coupled dynamic model in the policy layer. An inner loop then tracks these commands while compensating for transient inertial shifts and uncertainty during execution via a dynamics estimator scheme without requiring system model knowledge. We validate the proposed approach on a custom quadrotor equipped with a 3-DoF manipulator through hardware experiments under varying payload conditions. Compared with RL+PID and RL+INDI+PID baselines, the proposed method reduces end-effector tracking error and improves task success rate across the tested hardware conditions. These results show that combining learned whole-body coordination with estimator-based low-level compensation improves the precision and robustness of aerial manipulation under changing operating conditions.

2606.16735 2026-06-16 cs.RO 新提交

Pride and Prejudice: Toward an Information-Theoretic Framework for Mutually Communicative Driver Behavior Modeling

傲慢与偏见:迈向相互通信的驾驶员行为建模的信息论框架

Tingjun Li, Nan Xu, Shuo Feng, Hassan Askari, Bruno Henrique Groenner Barbosa, Konghui Guo

发表机构 * State Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University(吉林大学汽车底盘集成与仿生国家重点实验室) Beijing National Research Center for Information Science and Technology, Tsinghua University(清华大学北京信息科学与技术国家研究中心) Department of Engineering, Brock University(布鲁克大学工程系) Department of Automatics, Federal University of Lavras(拉夫拉斯联邦大学自动化系)

AI总结 针对自动驾驶与人类驾驶车辆间意图误读导致的安全与效率问题,提出基于信息论的隐式相互通信模型,结合贝叶斯说服博弈与信息论奖励,在NGSIM数据集上降低强制换道预测误差达20%。

Comments 16 pages, 10 figures. Accepted for the IEEE Transactions on Intelligent Transportation Systems (T-ITS), June 2026

详情
AI中文摘要

当自动驾驶车辆(AV)和人类驾驶车辆(HV)误读彼此的意图时,混合自主驾驶会变得不安全且低效。我们将此问题研究为换道中的隐式相互通信。所提出的框架建模了自车如何在认知不确定性下既表达自身意图又探测对方驾驶员的偏好。它结合了用于主动信号传递的带有虚拟特征的k级贝叶斯说服博弈、用于相互通信的信息论奖励以及通信能力的自适应权重。我们进一步引入了Pride-Inquiry (P-I) 和 Pride-Prejudice (P-P) 平面来分析通信强度和倾向。该模型使用基于通信的多智能体逆强化学习算法(C-MIRL)在自然主义NGSIM数据集上进行校准。与非通信基线相比,所提出的模型将强制换道的预测误差降低了高达20%,同时保持了强大的泛化能力。驾驶员在环问卷得分与校准后的通信变量呈正相关,支持了模型的主观有效性。学习到的奖励进一步表明,询问和倾听能力比单纯的骄傲和表达贡献更大,并且询问偏好在不同驾驶员之间变化更强烈。这些结果支持在交互驾驶中对相互通信和认知不确定性进行显式建模。

英文摘要

Mixed autonomy driving becomes unsafe and inefficient when autonomous vehicles (AVs) and human-driven vehicles (HVs) misread each other's intentions. We study this problem as implicit mutual communication in lane changes. The proposed framework models how the ego vehicle both expresses its intent and probes the other driver's preference under epistemic uncertainty. It combines a level-k Bayesian persuasion game with virtual features for proactive signaling, information-theoretic rewards for mutual communication, and adaptive weights of communication affordances. We further introduce the Pride-Inquiry (P-I) and Pride-Prejudice (P-P) planes to analyze communication intensity and tendency. The model is calibrated with a Communication-Based Multi-Agent Inverse Reinforcement Learning algorithm (C-MIRL) on the naturalistic NGSIM dataset. Compared with the non-communicative baseline, the proposed model reduces the prediction error of mandatory lane changes by up to 20% while maintaining strong generalization. Driver-In-the-Loop questionnaire scores are positively correlated with the calibrated communication variables, supporting the subjective validity of the model. The learned rewards further show that inquiry and listening affordances contribute more than pride and expression alone, and that inquiry preference varies more strongly across drivers. These results support explicit modeling of mutual communication and epistemic uncertainty in interactive driving.

2606.14716 2026-06-16 cs.CV cs.AI cs.RO 交叉投稿

RAMS: Resource-Adaptive and Detection-Conditioned Model Switching for Embedded Edge Perception

RAMS: 面向嵌入式边缘感知的资源自适应与检测条件模型切换

Kushal Khemani, Evan Leri, George Xu, Amit Hod

发表机构 * NEXEDGE Research Lab(NEXEDGE研究实验室)

AI总结 提出RAMS运行时控制器,通过监控设备压力、校准切换阈值,在YOLOv8三个规模模型间动态切换,引入检测条件策略和VRU加权准确率评分,在多种嵌入式平台上实现延迟与精度的平衡。

详情
AI中文摘要

嵌入式硬件上的边缘目标检测需要在变化的资源压力下平衡推理延迟和检测质量。我们提出RAMS,一种轻量级运行时控制器,它监控设备压力,从空闲行为校准切换阈值,并在三个驻留的YOLOv8层级(NANO/SMALL/MEDIUM,分辨率320/416/640 px)之间动态选择,无需模型重新加载延迟。RAMS定义了五种切换策略,包括两种检测条件变体,可在最近检测到易受伤道路使用者(VRU)后防止激进的降级。我们进一步引入VRU加权准确率评分(SWAS),一种用于离线策略比较的标量指标,无需真实标注,以及一种基于oracle的变体,用于分离检测器循环性与真正的层级保留收益。在Raspberry Pi 5、x86笔记本电脑和Jetson Orin ONNX/TensorRT部署中,相同的控制器方程在37倍的延迟范围内运行。在重负载下的Jetson Orin TensorRT上,safety2策略实现了3.41毫秒的平均延迟,比固定MEDIUM推理快5.6倍,同时通过接近NANO操作并在VRU阳性窗口期间选择性锁定SMALL和MEDIUM,保留了其代理准确率的74%。与重负载下仅基于阈值的策略相比,检测条件切换在oracle评分下将SWAS提高了25.4%,在检测器衍生评分下提高了47.3%。实时KITTI评估报告了每层级VRU召回率分别为24.2%、41.2%和59.0%,表明反应性覆盖从根本上受限于基线检测器的召回率。

英文摘要

Edge object detection on embedded hardware requires balancing inference latency and detection quality under changing resource pressure. We present RAMS, a lightweight runtime controller that monitors device pressure, calibrates switching thresholds from idle behavior, and dynamically selects among three resident YOLOv8 tiers (NANO/SMALL/MEDIUM at 320/416/640 px) without model-reload latency. RAMS defines five switching policies, including two detection-conditioned variants that prevent aggressive downgrades after recent vulnerable-road-user (VRU) detections. We further introduce the VRU-Weighted Accuracy Score (SWAS), a scalar metric for offline policy comparison without ground-truth annotations, together with an oracle-bounded variant that separates detector circularity from genuine tier-retention benefit. Across Raspberry Pi 5, x86 laptops, and Jetson Orin ONNX/TensorRT deployments, the same controller equations operate over a 37x latency range. On Jetson Orin TensorRT under heavy load, the safety2 policy achieves 3.41 ms mean latency, 5.6x faster than fixed-MEDIUM inference, while retaining 74% of its proxy accuracy through near-NANO operation with selective SMALL and MEDIUM locks during VRU-positive windows. Detection-conditioned switching improves SWAS by 25.4% under oracle scoring and 47.3% under detector-derived scoring relative to threshold-only policies under heavy load. Live KITTI evaluation reports per-tier VRU recall of 24.2%, 41.2%, and 59.0%, showing that reactive overrides are fundamentally limited by baseline detector recall.

2606.16558 2026-06-16 cs.AI cs.RO cs.SY eess.SY 交叉投稿

ROSA-RL: Uncertainty-Aware Roundabout Optimized Speed Advisory with Reinforcement Learning

ROSA-RL:基于强化学习的不确定性感知环岛优化速度建议

Anna-Lena Schlamp, Jeremias Gerner, Klaus Bogenberger, Werner Huber, Stefanie Schmidtner

发表机构 * Universität der Bundeswehr München(慕尼黑联邦国防军大学) Hochschule für angewandte Wissenschaften Landshut(兰茨胡特应用科学大学)

AI总结 针对混合交通中环岛场景的不确定性,提出ROSA-RL框架,结合Transformer预测冲突区域占用概率与强化学习,实现安全高效的环岛入口速度协调。

Comments 8 pages, 2 figures, 2 tables. Copyright 2026 IEEE. This is the accepted manuscript for 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC), not the final published version

详情
AI中文摘要

环岛在混合交通中对自动驾驶构成挑战,因为异质且非确定性的人类行为、未知的驾驶意图以及高交互复杂性使得在进入时刻冲突区域是被阻塞还是可用存在不确定性。我们提出ROSA-RL——基于强化学习的不确定性感知环岛优化速度建议。它通过概率冲突预测,实现混合交通中自动驾驶和人类驾驶车辆的安全高效环岛进入。一个基于Transformer的模型预测未来五秒内的冲突区域占用情况,捕捉多智能体交互以预测即将发生的冲突和可用间隙。预测输出编码了未来运动和意图的不确定性,并增强经典强化学习框架的状态,实现不确定性感知的速度协调。在基于真实世界数据的仿真评估中,ROSA-RL能有效处理不确定性,并优于基于模型的基线方法,缩小了与假设完全已知占用的理想设置之间的差距,同时提高了交通效率和安全性。本工作的源代码可在github.com/urbanAIthi/ROSA-RL获取。

英文摘要

Roundabouts challenge automated driving in mixed traffic, as heterogeneous and non-deterministic human behavior, unknown driving intentions, and high interaction complexity create uncertainty about whether the conflict zone will be blocked or available at the moment of entry. We present ROSA-RL -- uncertainty-aware Roundabout Optimized Speed Advisory with Reinforcement Learning. It enables safe and efficient roundabout entry for automated and human-driven vehicles in mixed traffic through probabilistic conflict forecasting. A Transformer-based model predicts conflict zone occupancy over a five-second horizon, capturing multi-agent interactions to anticipate upcoming conflicts and available gaps. The prediction outputs encode uncertainty in future motion and intent, and augment the state of a classical RL framework, enabling uncertainty-aware speed coordination. Evaluated in simulations grounded in real-world data, ROSA-RL can effectively handle uncertainty and outperform a comparable model-based baseline, closing the gap to an ideal setting assuming fully known occupancy while improving traffic efficiency and safety. The source code of this work is available under: github.com/urbanAIthi/ROSA-RL.

2411.18714 2026-06-16 cs.RO cs.AI cs.LG 版本更新

Explainable deep learning improves human mental models of self-driving cars

可解释深度学习提升人类对自动驾驶汽车的心理模型

Eoin M. Kenny, Akshay Dharmavaram, Sang Uk Lee, Tung Phan-Minh, Shreyas Rajesh, Yunqing Hu, Laura Major, Momchil S. Tomov, Julie A. Shah

发表机构 * Computer Science & Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology(计算机科学与人工智能实验室(CSAIL),麻省理工学院) Motional AD Inc.(Motional AD公司) Department of Psychology and Center for Brain Science, Harvard University(心理学系和大脑科学中心,哈佛大学) Department of Aeronautics and Astronautics, Massachusetts Institute of Technology(航空与宇航系,麻省理工学院)

AI总结 提出概念包装网络(CW-Net),在真实自动驾驶车上实现可解释规划,通过因果性概念解释提升驾驶员对车辆行为的预测能力,尤其在意外场景中。

Comments MST & JAS contributed equally to this work

详情
AI中文摘要

自动驾驶汽车越来越依赖深度神经网络来实现类人驾驶。这种黑箱规划器的不透明性使得准确预测其何时会失败变得具有挑战性,可能带来灾难性后果。尽管关于解释这些系统的研究激增,但由于实际部署的困难,大部分研究局限于模拟或玩具设置,使得这些技术的实际效用未知。在此,我们引入概念包装网络(CW-Net),一种忠实解释基于机器学习的规划器行为的方法,该方法在不牺牲性能的情况下,将其推理因果地扎根于人类可解释的概念。我们在真实自动驾驶车上部署CW-Net,并表明由此产生的解释改善了人类驾驶员对车辆的心理模型,使他们能够更好地预测其行为,特别是在意外情况下。这表明,集成到自动驾驶汽车中的可解释深度学习在现实部署环境中既易于理解又有用。我们预计我们的方法可以应用于其他安全关键系统,如自主无人机和机器人外科医生,以及其他架构,如端到端学习系统和视觉-语言-动作模型。总体而言,我们的研究为自主代理的可解释性建立了一条经过部署验证的路径,这可能有助于使其更加透明和安全。

英文摘要

Self-driving cars increasingly rely on deep neural networks to achieve human-like driving. The opacity of such black-box planners makes it challenging to accurately anticipate when they will fail, with potentially catastrophic consequences. While research into interpreting these systems has surged, most of it is confined to simulations or toy setups due to the difficulty of real-world deployment, leaving the practical utility of such techniques unknown. Here, we introduce the Concept-Wrapper Network (CW-Net), a method for faithfully explaining the behavior of machine-learning-based planners that causally grounds their reasoning in human-interpretable concepts without sacrificing performance. We deploy CW-Net on a real self-driving car and show that the resulting explanations improve the human driver's mental model of the vehicle, allowing them to better predict its behavior, particularly in surprising situations. This demonstrates that explainable deep learning integrated into self-driving cars can be both understandable and useful in a realistic deployment setting. We anticipate our method could be applied to other safety-critical systems, such as autonomous drones and robotic surgeons, as well as to other architectures, such as end-to-end learning systems and vision-language-action models. Overall, our study establishes a deployment-validated pathway to interpretability for autonomous agents, which could help make them more transparent and safe.

2501.04988 2026-06-16 cs.RO cs.SY eess.SY 版本更新

Intelligent Sailing Model for Open Sea Navigation

公海航行智能航行模型

Hanna Krasowski, Stefan Schärdinger, Murat Arcak, Matthias Althoff

发表机构 * University of California, Berkeley(加州大学伯克利分校) Technical University of Munich(慕尼黑技术大学)

AI总结 提出首个智能航行模型(ISM),模拟遵守海上交通规则的船舶,结合模型预测控制实现航点跟踪,在交互仿真中达到约97%的目标到达率且无碰撞。

详情
AI中文摘要

自主船舶有望提升海上贸易的安全性和可靠性。为促进自主船舶的发展,需要仿真来模拟与其他船舶的真实交互。然而,由于环境非结构化、交通规则粗略以及船舶类型差异大,模拟真实的交互式海上交通具有挑战性。目前,尚无用于严格基准测试自主船舶算法的交互式海上环境仿真标准。本文首次提出智能航行模型(ISM),该模型模拟公海航行中遵守规则的船舶。ISM船舶根据海上交通规则对其他交通参与者做出反应,同时解决由航点表征的运动规划任务。具体而言,ISM监控适用规则,据此生成合规航点,并利用模型预测控制跟踪航点。我们在两种环境中评估ISM:仅含ISM船舶的交互式交通,以及混合交通(部分船舶轨迹来自记录的真实海上交通数据或为关键场景手工设计)。结果表明,包含多种类型ISM船舶的仿真具有规则合规性和可扩展性。我们测试了4,049个关键交通场景。在ISM船舶的交互式交通中,未发生碰撞,目标到达率约为97%。

英文摘要

Autonomous vessels potentially enhance safety and reliability of seaborne trade. To facilitate the development of autonomous vessels, simulations are required to model realistic interactions with other vessels. However, modeling realistic interactive maritime traffic is challenging due to the unstructured environment, coarsely specified traffic rules, and largely varying vessel types. Currently, there is no standard for simulating interactive maritime environments in order to rigorously benchmark autonomous vessel algorithms. In this paper, we introduce the first intelligent sailing model (ISM), which simulates rule-compliant vessels for navigation on the open sea. An ISM vessel reacts to other traffic participants according to maritime traffic rules while at the same time solving a motion planning task characterized by waypoints. In particular, the ISM monitors the applicable rules, generates rule-compliant waypoints accordingly, and utilizes a model predictive control for tracking the waypoints. We evaluate the ISM in two environments: interactive traffic with only ISM vessels and mixed traffic where some vessel trajectories are from recorded real-world maritime traffic data or handcrafted for criticality. Our results show that simulations with many ISM vessels of different vessel types are rule-compliant and scalable. We tested 4,049 critical traffic scenarios. For interactive traffic with ISM vessels, no collisions occurred while goal-reaching rates of about 97 percent were achieved.

2510.12560 2026-06-16 cs.CV cs.LG cs.RO 版本更新

CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving

CoIRL-AD:面向自动驾驶的潜在世界模型中的协作-竞争模仿-强化学习

Xiaoji Zheng, Ziyuan Yang, Yanhao Chen, Yuhang Peng, Yuanrong Tang, Gengyuan Liu, Bokui Chen, Jiangtao Gong

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出CoIRL-AD框架,通过解耦模仿学习与强化学习、利用潜在世界模型进行长时程奖励估计以及引入竞争机制,在离线训练中提升自动驾驶的鲁棒性,尤其在跨城市泛化和长尾场景中表现优异。

Comments 19 pages, 22 figures, ICML 2026

详情
AI中文摘要

基于模仿学习(IL)训练的端到端自动驾驶模型通常泛化能力较差,尤其是在专家演示稀疏的长尾场景中。强化学习(RL)可以提供互补的任务级监督,但在没有交互模拟器的离线设置中,将RL应用于真实世界的自动驾驶具有挑战性,因为数据集主要由专家动作主导,行为多样性有限。我们提出CoIRL-AD,一个竞争性的双策略框架,在统一的离线训练机制下整合IL和RL。CoIRL-AD将模仿和奖励优化解耦到不同的智能体中,以缓解目标冲突,使用想象的未来轨迹进行长时程奖励估计,并引入竞争机制,选择性地传递有益行为,同时使RL保持与专家驾驶行为一致。在nuScenes基准上的实验表明,CoIRL-AD在强IL基线上持续提升鲁棒性,尤其在跨城市泛化和长尾场景中取得了显著改进。代码可在以下网址获取:this https URL。

英文摘要

End-to-end autonomous driving models trained with imitation learning (IL) often generalize poorly, particularly in long-tail scenarios where expert demonstrations are sparse. Reinforcement learning (RL) can provide complementary task-level supervision, but applying RL to real-world autonomous driving is challenging in offline settings without interactive simulators, where datasets are dominated by expert actions and provide limited behavioral diversity. We propose CoIRL-AD, a competitive dual-policy framework that integrates IL and RL under a unified offline training regime. CoIRL-AD decouples imitation and reward optimization into separate actors to alleviate objective conflicts, uses imagined future rollouts for long-horizon reward estimation, and introduces a competition mechanism that selectively transfers beneficial behaviors while keeping RL anchored to expert-like driving. Experiments on the nuScenes benchmark show that CoIRL-AD consistently improves robustness over strong IL-based baselines, with especially large gains in cross-city generalization and long-tail scenarios. Code is available at: https://github.com/SEU-zxj/CoIRL-AD.

2512.24838 2026-06-16 cs.CV cs.RO 版本更新

CropTrack: A Tracking with Re-Identification Framework for Precision Agriculture

CropTrack: 面向精准农业的跟踪与重识别框架

Md Ahmed Al Muzaddid, Jordan A. James, William J. Beksi

发表机构 * Department of Computer Science and Engineering, The University of Texas at Arlington(计算机科学与工程系,德克萨斯大学阿灵顿分校)

AI总结 针对农业场景中物体外观相似、频繁遮挡导致跟踪困难的问题,提出结合外观与运动信息的MOT框架CropTrack,通过重排序增强外观关联、一对多关联冲突解决和指数移动平均原型特征库,显著提升身份保持和关联精度。

Comments 8 pages, 5 figures, and 4 tables

详情
AI中文摘要

农业环境中的多目标跟踪(MOT)由于重复模式、相似物体外观、突然光照变化和频繁遮挡而面临重大挑战。该领域的当代跟踪器依赖物体运动而非外观进行关联。然而,当目标经历频繁且强烈的遮挡时,它们难以维持物体身份。物体外观的高度相似性使得在农业场景中集成基于外观的关联变得非平凡。为解决此问题,我们提出CropTrack,一种基于外观和运动信息结合的新型MOT框架。CropTrack集成了重排序增强的外观关联、基于外观冲突解决策略的一对多关联以及指数移动平均原型特征库,以改进基于外观的关联。在公开可用的农业MOT数据集上评估,CropTrack展示了一致的身份保持,优于传统的基于运动的跟踪方法。与现有技术相比,CropTrack在关联准确性和识别精度得分上取得了显著提升,同时身份切换次数更低。

英文摘要

Multiple-object tracking (MOT) in agricultural environments presents major challenges due to repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Contemporary trackers in this domain rely on the motion of objects rather than appearance for association. Nevertheless, they struggle to maintain object identities when targets undergo frequent and strong occlusions. The high similarity of object appearances makes integrating appearance-based association nontrivial for agricultural scenarios. To solve this problem we propose CropTrack, a novel MOT framework based on the combination of appearance and motion information. CropTrack integrates a reranking-enhanced appearance association, a one-to-many association with appearance-based conflict resolution strategy, and an exponential moving average prototype feature bank to improve appearance-based association. Evaluated on publicly available agricultural MOT datasets, CropTrack demonstrates consistent identity preservation, outperforming traditional motion-based tracking methods. Compared to the state of the art, CropTrack achieves significant gains in association accuracy and identification precision scores with a lower number of identity switches.

2602.07343 2026-06-16 cs.CV cs.AI cs.LG cs.RO 版本更新

Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation

通过文字看道路:一种语言引导的RGB-T驾驶场景分割框架

Ruturaj Reddy, Hrishav Bakul Barua, Junn Yong Loo, Thanh Thi Nguyen, Ganesh Krishnasamy

发表机构 * National University of Singapore(新加坡国立大学) University of Technology Sydney(悉尼科技大学)

AI总结 提出CLARITY框架,利用视觉语言模型先验动态调整RGB-T融合策略,并引入暗目标语义保留和层次化解码器,在MFNet数据集上达到62.3% mIoU和77.5% mAcc的新SOTA。

详情
AI中文摘要

在恶劣光照、照明和阴影条件下,道路场景的鲁棒语义分割仍然是自动驾驶应用的核心挑战。RGB-热融合是一种标准方法,但现有方法在所有条件下统一应用静态融合策略,导致模态特定噪声在网络中传播。因此,我们提出CLARITY,它根据检测到的场景条件动态调整融合策略。在视觉语言模型(VLM)先验的引导下,网络学习根据光照状态调节每种模态的贡献,同时利用对象嵌入进行分割,而不是应用固定的融合策略。我们进一步引入了两种机制:一种保留有效的暗对象语义,这些语义在先前的噪声抑制方法中被错误丢弃;另一种是层次化解码器,它在不同尺度上强制结构一致性,以锐化薄对象的边界。在MFNet数据集上的实验表明,CLARITY建立了新的最先进水平(SOTA),实现了62.3%的mIoU和77.5%的mAcc。

英文摘要

Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms - one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.

2602.14780 2026-06-16 cs.MA cs.CY cs.RO cs.SY eess.SY 版本更新

ROSA: Roundabout Optimized Speed Advisory with Multi-Agent Trajectory Prediction in Multimodal Traffic

ROSA: 多模式交通中基于多智能体轨迹预测的环岛优化速度建议

Anna-Lena Schlamp, Jeremias Gerner, Klaus Bogenberger, Werner Huber, Stefanie Schmidtner

发表机构 * IEEE

AI总结 提出ROSA系统,结合Transformer多智能体轨迹预测与协调速度引导,提升环岛多模式混合交通的效率与安全,预测精度优于前人工作。

Comments 8 pages, 1 figure, 4 tables. Copyright 2026 IEEE. This is the accepted manuscript for 2025 IEEE International Conference on Intelligent Transportation Systems (ITSC), not the final published version

详情
AI中文摘要

我们提出ROSA——环岛优化速度建议——一个结合多智能体轨迹预测与协调速度引导的系统,用于环岛处的多模式混合交通。使用基于Transformer的模型,ROSA联合预测环岛处车辆和弱势道路使用者(VRU)的未来轨迹。该模型针对单步预测训练并自回归部署,生成确定性输出,从而实现可操作的速度建议。结合运动动力学,模型在五秒预测范围内实现了高精度(ADE: 1.29m, FDE: 2.99m),超越了先前工作。添加路线意图进一步提升了性能(ADE: 1.10m, FDE: 2.36m),展示了网联车辆数据的价值。基于与VRU和环岛内车辆的预测冲突,ROSA为接近和进入环岛的车辆提供实时、主动的速度建议。尽管存在预测不确定性,ROSA显著提升了车辆效率和安全性,甚至从VRU视角对感知安全性也有积极影响。本工作的源代码可在以下网址获取:this http URL。

英文摘要

We present ROSA -- Roundabout Optimized Speed Advisory -- a system that combines multi-agent trajectory prediction with coordinated speed guidance for multimodal, mixed traffic at roundabouts. Using a Transformer-based model, ROSA jointly predicts the future trajectories of vehicles and Vulnerable Road Users (VRUs) at roundabouts. Trained for single-step prediction and deployed autoregressively, it generates deterministic outputs, enabling actionable speed advisories. Incorporating motion dynamics, the model achieves high accuracy (ADE: 1.29m, FDE: 2.99m at a five-second prediction horizon), surpassing prior work. Adding route intention further improves performance (ADE: 1.10m, FDE: 2.36m), demonstrating the value of connected vehicle data. Based on predicted conflicts with VRUs and circulating vehicles, ROSA provides real-time, proactive speed advisories for approaching and entering the roundabout. Despite prediction uncertainty, ROSA significantly improves vehicle efficiency and safety, with positive effects even on perceived safety from a VRU perspective. The source code of this work is available under: github.com/urbanAIthi/ROSA.

9. 软体机器人与硬件设计 9 篇

2606.15028 2026-06-16 cs.RO 新提交

An Autonomous Subgram SMA-Based Swimmer

基于SMA的亚克级自主游泳器

Conor K. Trygstad, Francisco M. F. R. Gonçalves, Néstor O. Pérez-Arancibia

发表机构 * Washington State University(华盛顿州立大学)

AI总结 提出一种900毫克仿生游泳器Swima,采用形状记忆合金驱动的高功密度执行器,集成机载电源和计算,实现自主游泳超过18分钟,速度达22.4毫米/秒,转弯速率14°/秒,跟踪误差均方根约6.5°,为首个亚克级机载电源、驱动和计算的微型游泳器。

Comments Under review, 6 pages, 5 figures

详情
AI中文摘要

我们介绍了Swima,一种仿生900毫克游泳器,由两个10毫克高功密度(HWD)执行器驱动,这些执行器由形状记忆合金(SMA)线驱动。通过使用定制印刷电路板(PCB)和11毫安时3.7伏507毫克单节锂离子(Li-Ion)电池,我们集成了机载电源和计算,从而实现超过18分钟的自主游泳。Swima可以以高达22.4毫米/秒(0.56体长/秒)的速度游泳,达到高达14°/秒的转弯速率,并且能够在多次测试中跟随0度航向参考轨迹,跟踪误差的均方根(RMS)值约为6.5°。该机器人是迄今为止开发的首个具有机载电源、驱动和计算的亚克级微型游泳器。

英文摘要

We present the Swima, a bioinspired 900-mg swimmer propelled by two 10-mg high-work-density (HWD) actuators driven by shape-memory alloy (SMA) wires. We integrated onboard power and computation by using a custom-built printed circuit board (PCB) and an 11-mAh 3.7-V 507-mg single-cell lithium-ion (Li-Ion) battery, which in conjunction enable autonomous swimming in excess of 18 min. The Swima can swim at speeds of up to 22.4 mm/s (0.56 Bl/s), achieves turning rates of up to 14°/s, and can follow 0-degree heading reference trajectories with root mean square (RMS) values of tracking errors of about 6.5° across multiple tests. This robot is the first subgram microswimmer with onboard power, actuation, and computation developed to date.

2606.15068 2026-06-16 cs.RO 新提交

Design and Fabrication of a Spin Coater with In-Situ Optical Measurement for Soft Thin Films

用于软薄膜的原位光学测量旋涂机的设计与制造

Daniel Gliksberg, Jiajie Qiu, Jun Suzuki, Kamal Youcef-Toumi

发表机构 * The Japan Steel Works, LTD.(日本制钢所)

AI总结 针对软弹性薄膜厚度测量难题,设计了一种低成本3D打印旋涂机,集成激光反射原位光学测厚系统,实现50-300微米薄膜厚度控制,分辨率达3.6微米。

Comments 8 pages, 7 figures, 5 tables. To be published in the conference proceedings for AIM 2026

详情
AI中文摘要

旋涂广泛用于聚合物和弹性体薄膜的制造,但由于接触式测量的变形以及传统光学计量成本高、复杂度大,高柔性材料的可靠厚度验证仍然具有挑战性。在介电弹性体致动器等软弹性应用中,精确的厚度控制尤为关键,因为机械和功能性能与薄膜厚度密切相关。本文提出了一种低成本的、主要采用3D打印的台式旋涂机,集成了最小变形的光学厚度测量系统,用于软薄膜制备流程。该系统设计用于制造厚度在50至300微米之间的薄膜,重复性在10微米以内。通过四象限光电探测器跟踪反射激光束的位移,实现原位厚度测量,避免了显著变形。讨论了光学几何、传感器线性约束以及通过有限元分析进行的结构验证。使用校准金属垫片的实验验证显示厚度分辨率为3.6-3.7微米,最佳情况下的测量重复性为13微米(95%置信区间)。该平台可重复生产厚度在目标值9微米以内的硅胶薄膜,表明可访问的光学计量可以集成到低成本旋涂系统中,用于无需专门工业仪器的、厚度可控的柔性薄膜实际制造。

英文摘要

Spin coating is widely used for fabrication of thin polymer and elastomer films, yet reliable thickness verification of highly compliant materials remains challenging due to deformation from contact-based measurements and the cost and complexity of conventional optical metrology. Accurate thickness control is especially critical in soft elastomer applications such as dielectric elastomer actuators (DEAs), where mechanical and functional performance scales strongly with film thickness. This work presents a low-cost, primarily 3D-printed benchtop spin coater with an integrated, minimally deforming optical thickness measurement system for soft-film fabrication workflows. The system is designed to manufacture films between 50 and 300 microns thick with repeatability within 10 microns. Thickness is measured in-situ by tracking displacement of a reflected laser beam via quadrant photodetector, avoiding significant deformation. Optical geometry, sensor linearity constraints, and structural validation via finite element analysis are discussed. Experimental validation using calibrated metal shims demonstrated a thickness resolution of 3.6-3.7 microns and best-case measurement repeatability of 13 microns (95 percent confidence interval). The platform repeatably produced silicone films within 9 microns of target thickness, demonstrating that accessible optical metrology can be integrated into a low-cost spin coating system for practical, thickness-controlled fabrication of compliant thin films without specialized industrial instrumentation.

2606.15645 2026-06-16 cs.RO 新提交

TO-SoFiT: Topology Optimization of Hydraulic Soft Fish Tail Design for programmable undulating locomotion

TO-SoFiT: 用于可编程波动运动的液压软鱼尾拓扑优化设计

A Padmaprabhan, Amal Shaji, Prabhat Kumar

发表机构 * Indian Institute of Technology Hyderabad(印度理工学院海得拉巴分校)

AI总结 提出一种拓扑优化方法自动设计液压软鱼尾,平衡变形效率、流固耦合、可制造性和刚度,实现可调波动幅度和多轴弯曲,优于传统矩形尾鳍。

Comments Accepted for publication at the Advances in Robotics (AIR), 2025, IIT Jodhpur

详情
AI中文摘要

软体机器人利用柔性材料通过受控弹性变形产生运动,使其非常适合水下探测和仿生海洋系统等精细任务。尽管液压/气动驱动对此类系统仍然至关重要,但缺乏系统化的设计框架阻碍了能够实现复杂三维运动(如鱼类游泳)的机器人开发。本文引入一种拓扑优化方法来自动设计液压软鱼尾,明确处理流体驱动与结构变形之间的设计依赖耦合。我们使用基于达西定律的模型,并增加排水项来模拟空间变化的液压压力载荷,通过有限元分析将其转化为一致的节点力。采用的鲁棒多准则优化公式平衡了变形效率、流固耦合、几何可制造性和所需刚度,以优化用于三维游泳运动学的仿生软鱼尾。优化后的尾鳍拓扑被集成到气动网络驱动器中,并在各种液压载荷下进行计算验证,实现了可调波动幅度和用于深度调节的多轴弯曲。优化的二维尾鳍优于其矩形对应物。通过级联优化的尾鳍段,我们展示了在不同液压载荷下软体机器鱼尾的可编程游泳模式。这项工作推进了液压驱动器和软结构的系统化协同设计,为在受限水生环境中自动化设计具有优化设计和脊椎动物般灵活性的水下机器人提供了途径。我们的实现和模拟公开于 https://github.com/PrabhatIn/TO-SoFiT。

英文摘要

Soft robots leverage compliant materials to generate motion through controlled elastic deformation, making them ideal for delicate tasks such as underwater exploration and biomimetic marine systems. Although hydraulic/pneumatic actuation remains pivotal for such systems, the lack of systematic design frameworks has hindered the development of robots capable of complex 3D motion, such as fish-like swimming. This work introduces a topology optimization method to automate the design of a hydraulic soft fish tail, explicitly addressing the design-dependent coupling between fluidic actuation and structural deformation. We use a Darcy law-based model augmented with a drainage term to simulate spatially varying hydraulic pressure loads, translating these into consistent nodal forces via finite element analysis. The employed robust multi-criteria optimization formulation balances deformation efficiency, fluid-structure interaction, geometric manufacturability, and required stiffness for optimizing a bioinspired soft fish tail for 3D swimming kinematics. The optimized tail topology is incorporated into a pneumatic network actuator and computationally validated under various hydraulic loads, achieving tunable undulatory amplitudes and multiaxis bending for depth adjustment. The optimized 2D tail outperforms its rectangular counterpart. By cascading optimized tail segments, we demonstrate programmable swimming patterns in soft robotic fish tails at different hydraulic loads. This work advances the systematic codesign of hydraulic actuators and soft structures, offering a pathway to automate underwater robots with optimized design and vertebrate-like agility in confined aquatic environments. Our implementations and simulations are publicly available at 'https://github.com/PrabhatIn/TO-SoFiT'.

2606.15915 2026-06-16 cs.RO 新提交

Identification of a Physics-Based Electrical Power Consumption Model for the Unitree G1 Humanoid Arm

基于物理的Unitree G1人形机器人手臂电力消耗模型识别

Nestor N. Deniz, Sebastian Vega, Simon Parsons, Fernando Auat Cheein

发表机构 * Harper Adams University(哈珀亚当斯大学) Lincoln Institute for Agri-Food Technology(林肯农业食品技术研究所) Lincoln Centre for Autonomous Systems(林肯自主系统中心)

AI总结 提出一种基于物理的线性参数模型,用于预测Unitree G1人形机器人左臂的电力消耗,通过实验数据识别参数,在897条轨迹上达到R²=0.933,并在未见速度轨迹上验证泛化能力。

详情
AI中文摘要

精确预测电力消耗对于电池供电人形机器人的能量感知运动规划、电池管理和热监测至关重要。本文提出了一个基于物理的线性参数模型,用于Unitree~G1人形机器人七自由度左臂的电力消耗。所提出的公式将执行器损耗项与基线扭矩校正相结合,该校正捕捉重力补偿负载的变化,并能够准确预测负净功率轨迹。引入成对交互项来模拟多关节同时运动期间的功率耦合。模型参数从物理Unitree~G1上收集的实验数据中识别,使用板载功率测量作为回归目标。在覆盖单关节和协调手臂运动、多个速度水平的897条轨迹上,识别模型实现了$R^2 = 0.933$,RMSE为1.07 (W)。在46条以先前未见速度执行的轨迹上验证,得到$R^2 = 0.965$,显示出在识别数据集之外的强泛化能力。对识别参数的分析揭示了手臂上不同的功耗特性,粘性摩擦主导大多数关节(肩部俯仰和所有三个腕关节),铜损主导肩部偏航和肘部,而肩部滚动则独特地由库仑摩擦主导。

英文摘要

Accurate prediction of electrical power consumption is essential for energy-aware motion planning, battery management, and thermal monitoring in battery-powered humanoid robots. This letter presents a physics-based, linear-in-parameters model for the electrical power consumption of the seven-degree-of-freedom left arm of the Unitree~G1 humanoid robot. The proposed formulation combines actuator loss terms with a baseline-torque correction that captures changes in gravity-compensation load and enables accurate prediction of negative net power trajectories. Pairwise interaction terms are introduced to model power coupling during simultaneous multi-joint motion. Model parameters are identified from experimental data collected on a physical Unitree~G1 using onboard power measurements as the regression target. Across 897 trajectories covering single-joint and coordinated arm motions at multiple speed levels, the identified model achieves $R^2 = 0.933$ with an RMSE of 1.07 (W). Validation on 46 trajectories executed at previously unseen speeds yields $R^2 = 0.965$, demonstrating strong generalisation beyond the identification dataset. Analysis of the identified parameters reveals distinct power-consumption characteristics across the arm, with viscous friction dominating most joints (shoulder pitch and all three wrist joints), copper losses dominating shoulder yaw and the elbow, and shoulder roll uniquely dominated by Coulomb friction.

2606.15997 2026-06-16 cs.RO cs.SY eess.SY 新提交

Friction Characterization of a Cable-Driven Differential Actuation System for Lower-Limb Exoskeletons

下肢外骨骼用缆绳驱动差动驱动系统的摩擦特性

Alberto Maria Nobili, Fabio Salsedo, Alessandro Filippeschi

发表机构 * Institute of Mechanical Intelligence, Department of Excellence in Robotics and AI, Sant'Anna School of Advanced Studies(机械智能研究所、机器人及人工智能卓越系,圣安娜高等研究学院) Wearable Robotics s.r.l.(可穿戴机器人有限公司)

AI总结 提出一种用于下肢外骨骼髋-膝关节屈伸的差动驱动架构,通过电机与关节间的线性差动映射实现协同扭矩共享,并开发基于模型的摩擦估计策略实现无传感器扭矩估计。

Comments Accepted for presentation IEEE RAS/EMBS 11th International Conference on Biomedical Robotics and Biomechatronics

详情
AI中文摘要

下肢外骨骼需要能够提供精确关节扭矩控制同时保持低质量和低负担的驱动系统。传统架构通常依赖独立驱动关节和关节级扭矩传感器,增加了系统复杂性和重量。本文提出一种用于髋-膝关节屈伸的新型差动驱动架构,通过电机与关节间的线性差动映射实现两个电机之间的协同扭矩共享。为了补偿传动损失,开发并实验实现了一种基于模型的摩擦估计策略,允许在无需扭矩传感器的情况下进行精确的关节扭矩估计。所提出的解决方案在物理原型上进行了验证,证明了在下肢外骨骼的差动驱动髋-膝模块中无传感器扭矩估计的可行性。

英文摘要

Lower-limb exoskeletons require actuation systems that can provide accurate joint torque control while preserving low mass and encumbrance. Conventional architectures often rely on independently actuated joints and joint-level torque sensors, increasing system complexity and weight. This paper presents a novel differential actuation architecture for hip-knee flexion/extension, enabling cooperative torque sharing between two motors via a linear differential mapping between motor and joint. To compensate for transmission losses, a model-based friction estimation strategy is developed and experimentally implemented, allowing accurate joint torque estimation without the need for torque sensors. The proposed solution is validated on a physical prototype, demonstrating the feasibility of sensorless torque estimation in a differentially actuated hip-knee module of a lower-limb exoskeleton.

2606.16876 2026-06-16 cs.RO 新提交

ExoTraj: A General Lower-limb Exoskeleton Assistance Policy for Complex Environments

ExoTraj:面向复杂环境的通用下肢外骨骼辅助策略

Xiao-Yin Liu, Guotao Li, Long Sun, Xu Liang, Zeng-Guang Hou

发表机构 * The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室) The School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) CASIA-MUST Joint Laboratory of Intelligence Science and Technology, Institute of Systems Engineering, Macau University of Science and Technology(澳门科技大学系统工程研究所中科院自动化所-澳门科技大学智能科学与技术联合实验室) The School of Automation and Intelligence, Beijing Jiaotong University(北京交通大学自动化与智能学院)

AI总结 提出ExoTraj统一策略,通过快速流匹配实现多模态特征到轨迹的精确预测,并利用模型预测控制优化力矩,在复杂户外场景中实现自适应辅助,无需昂贵运动捕捉系统。

Comments 28 pages, 19 figures, project page: https://xiaoyinliu0714.github.io/Home_ExoTraj/

详情
AI中文摘要

在动态外骨骼场景中,自适应力矩预测需要昂贵的运动捕捉系统,这在复杂户外环境中不可行。轨迹预测已成为解决该问题的有效方法之一。然而,外骨骼轨迹预测的核心挑战有两个:建立从多模态特征到轨迹信息的映射;构建从轨迹到力矩的映射。对于前者,现有方法大多仅执行单步预测并忽略受试者间轨迹变异性,从而限制了轨迹优化空间和预测泛化能力。为此,本文提出一种快速流匹配方法,能够实现精确的轨迹预测和更好的实时性能泛化,其中轨迹生成误差和编码观测用于指导训练方向。对于第二个挑战,由于人-机器人系统的高动态性以及感知与控制之间的强耦合,简单的控制方法难以基于预测轨迹实现高效辅助。本文利用模型预测控制并设计了一种新颖的优化目标来优化力矩,确保外骨骼实现舒适且鲁棒的辅助。通过整合上述两个组件,开发了统一策略ExoTraj,使其能够在复杂户外场景中实现自适应辅助,而无需高昂的数据采集成本。实验结果表明,与传统方法相比,ExoTraj在在线阶段将跨受试者预测误差降低了14.0%,并保持对外部噪声的鲁棒性。相对于零力矩条件,ExoTraj分别将代谢率降低11.5-24.4%,心率降低1.7-19.5%,峰值肌肉激活水平降低10.9-41.3%。

英文摘要

Adaptive torque prediction in dynamic exoskeleton scenarios requires expensive motion capture systems, which are infeasible in complex outdoor environments. Trajectory prediction has emerged as one of the effective approaches to address such an issue. However, the core challenges of exoskeleton trajectory prediction are twofold: establishing the mapping from multi-modal features to trajectory information; constructing the mapping from trajectory to torque. For the former, most existing methods perform only single-step prediction and neglect inter-subject trajectory variability, thereby limiting the trajectory optimization space and prediction generalization. To address this, this paper proposes a fast flow matching method that enables accurate trajectory prediction and better generalization for real-time performance, where trajectory generation errors and encoded observations are used to guide the training direction. For the second challenge, due to the high dynamics of the human-robot system and the strong coupling between perception and control, simple control methods struggle to achieve efficient assistance based on the predicted trajectory. This paper utilizes model predictive control and designs a novel optimization objective to optimize torque, ensuring the exoskeleton achieves comfortable and robust assistance. By integrating the above two components, the unified policy, denoted as ExoTraj, is developed to enable adaptive assistance in complex outdoor scenarios without high data acquisition cost. Experimental results show that compared to traditional methods, ExoTraj reduces cross-subject prediction error by 14.0% during the online phase and maintains robustness against external noise. Relative to the zero torque condition, ExoTraj decreases metabolic rate by 11.5-24.4%, heart rate by 1.7-19.5%, and peak muscle activation levels by 10.9-41.3%, respectively.

2606.16657 2026-06-16 eess.SP cs.RO 交叉投稿

Towards mm-Level Accurate UWB Radar: High-Accuracy Phase-Based Obstacle Detection through Multi-Channel Fusion

迈向毫米级精度的UWB雷达:通过多通道融合实现基于相位的高精度障碍物检测

Jelle De Moerloose, Adnan Shahid, Eli De Poorter

AI总结 提出一种在无源UWB雷达中利用相位信息进行距离估计的框架,通过多通道融合实现厘米级精度,中位误差1.69 cm,比仅用幅度的方法提升显著。

Comments 13 pages, Submitted to IEEE Transactions On Wireless Communications

详情
AI中文摘要

对于自主导引车、机器人及环境表征等应用,使用超宽带(UWB)雷达进行精确、无标签的距离估计至关重要。对于基于标签的定位系统,基于相位的UWB信号处理技术已展现出亚波长测距精度,但这些方法不适用于无源(无标签)雷达设置,因为其反射弱、多径条件复杂且缺乏已知的飞行时间(ToF)首径参考。本文首次证明,在完全无源的UWB雷达设置中可以有效利用相位信息。我们提出一种信号处理框架,通过将基于幅度的粗估计与跨多个频率通道的高分辨率相位变化相结合,提取可靠的距离信息。通过参考视距分量的相位测量,该方法补偿了硬件引起的相位漂移,而多通道频率分集的使用则能够消除周期性相位信息的模糊性,并提高对特定频率信道退化(如菲涅尔区)的鲁棒性。所提方法在配备使用DW3000设备的双基地UWB雷达的机器人上进行了验证,并在真实的金属工业环境中进行了评估。实验结果表明,我们的工作即使在高速下也能持续达到厘米级精度,中位误差为1.69 cm,显著优于仅依赖幅度信息的现有约10 cm精度的UWB雷达方法。我们进一步展示了多通道融合如何利用不相关的信道退化,相比单通道操作将误差降低超过40%,并概述了如何将相位建模与融合推向亚厘米级精度。

英文摘要

Accurate, tag-free distance estimation with ultrawideband (UWB) radar is essential for applications such as autonomous guided vehicles, robotics, and environment characterization. For tag-based localization systems, phase-based UWB signal processing techniques have demonstrated sub-wavelength ranging precision, but these approaches are not applicable for passive (tagless) radar setups with weak reflections, mixed multipath conditions, and the absence of a known time-of-flight (ToF) first-path reference. This paper demonstrates for the first time that phase information can be effectively exploited in a fully passive UWB radar setting. We introduce a signal processing framework that extracts reliable distance information by combining coarse amplitude-based estimates with high-resolution phase changes across multiple frequency channels. By referencing phase measurements with the line-of-sight component, the method compensates for hardware-induced phase drift, while the use of multichannel frequency diversity enables disambiguation of periodic phase information and improves robustness against frequencyspecific channel degradation such as Fresnel zones. The proposed approach is validated on a robot equipped with a bistatic UWB radar using DW3000 devices and evaluated in a realistic metallic industrial environment. Experimental results show that our work consistently achieves centimeter-level accuracy even at high speeds, with a median error of 1.69 cm, significantly outperforming existing ~10cm accuracy UWB radar approaches relying only on amplitude-information. We further show how multi-channel fusion exploits uncorrelated channel degradation to reduce the error by more than 40% compared to single-channel operation, and outline how phase modeling and fusion can be pushed toward sub-centimeter accuracy.

2511.06998 2026-06-16 cs.RO 版本更新

Raspi$^2$USBL: An open-source Raspberry Pi-Based Passive Inverted Ultra-Short Baseline Positioning System for Underwater Robotics

Raspi$^2$USBL:一种基于树莓派的开源被动倒置超短基线水下机器人定位系统

Jin Huang, Yingqiang Wang, Ying Chen

发表机构 * State Key Laboratory of Ocean Sensing, Ocean College of Zhejiang University(浙江省海洋传感重点实验室,浙江大学海洋学院) School of Oceanography, Shanghai Jiao Tong University(上海交通大学海洋学院)

AI总结 提出一种基于树莓派的低成本被动倒置超短基线定位系统,通过被动声学接收器和主动信标实现水下定位,在消声池、淡水湖和开放海域测试中达到0.1%斜距精度和0.1°方位角精度。

详情
AI中文摘要

精确的水下定位仍然是水下机器人领域的一个基本挑战,因为全球导航卫星系统(GNSS)信号无法穿透海面。本文介绍了Raspi$^2$USBL,一种基于树莓派的被动倒置超短基线(piUSBL)定位系统,为水下机器人研究提供了一个低成本、易获取且可复现的平台。该系统由一个被动声学接收器和一个主动信标组成。接收器集成了水听器阵列、多通道前置放大器、恒温晶振(OCXO)、树莓派5和MCC系列数据采集(DAQ)板。信标集成了匹配网络、功率放大器和发射换能器。一个开源C++框架支持单向传播时间(OWTT)消息的时钟同步和触发,同时执行匹配滤波、阵列波束成形和自适应增益控制,以估计飞行时间(TOF)和到达方向(DOA)。该系统在消声水池、淡水湖和开放海域试验中进行了验证。结果表明,斜距精度优于0.1%,方位角精度在0.1°以内,且在高达1.3公里的距离上性能稳定。这些发现表明,低成本的系统级可复现硬件能够提供研究级的水下定位精度。通过发布软件框架并提供可复现的硬件架构,Raspi$^2$USBL提供了一个参考平台,降低了水下机器人实验室的入门门槛,并促进了水下声学导航和群体机器人领域的可复现研究。

英文摘要

Precise underwater positioning remains a fundamental challenge for underwater robotics because global navigation satellite system (GNSS) signals cannot penetrate the sea surface. This paper presents Raspi$^2$USBL, a Raspberry Pi-based passive inverted ultra-short baseline (piUSBL) positioning system that provides a low-cost, accessible, and reproducible platform for underwater robotic research. The system consists of a passive acoustic receiver and an active beacon. The receiver integrates a hydrophone array, multichannel preamplifier, oven-controlled crystal oscillator (OCXO), Raspberry Pi 5, and MCC-series data acquisition (DAQ) board. The beacon integrates a matching network, power amplifier, and transmitting transducer. An open-source C++ framework supports clock synchronization and triggering for one-way travel-time (OWTT) messaging, while performing matched filtering, array beamforming, and adaptive gain control to estimate the time of flight (TOF) and direction of arrival (DOA). The system was validated in an anechoic tank, a freshwater lake, and open-sea trials. Results demonstrate a slant-range accuracy better than 0.1%, a bearing accuracy within 0.1°, and stable performance over distances up to 1.3 km. These findings show that low-cost, system-level reproducible hardware can deliver research-grade underwater positioning accuracy. By releasing the software framework and providing a reproducible hardware architecture, Raspi$^2$USBL offers a reference platform that lowers the entry barrier for underwater robotics laboratories and promotes reproducible research in underwater acoustic navigation and swarm robotics.

2604.00768 2026-06-16 cs.RO cs.HC 版本更新

An Ergonomic, Customizable Soft Robotic Glove toward Personalized Hand Rehabilitation

一种符合人体工程学、可定制的软体机器人手套,用于个性化手部康复

Rui Chen, Firman Isma Serdana, Domenico Chiaradia, Xianlong Mai, Elena Losanno, Gabriele Righi, Claudia De Santis, Federica Serra, Vincent Mendez, Cristian Camardella, Daniele Leonardis, Giulio Del Popolo, Silvestro Micera, Antonio Frisoli

发表机构 * The Biorobotics Inst. and Dept. of Excellence in Robotics and AI, Scuola Superiore Sant’Anna, Pisa, Italy(生物机器人研究所和机器人与人工智能卓越部门,圣安娜高等学院,比萨,意大利) Institute of Mechanical Intelligence and Department of Excellence in Robotics and AI, Scuola Superiore Sant’Anna (SSSA), 56127, Pisa, Italy(机械智能研究所和机器人与人工智能卓越部门,圣安娜高等学院(SSSA),56127,比萨,意大利) CAS Key Laboratory of Mechanical Behavior and Design of Materials, Institute of Humanoid Robots, School of Engineering Sciences, University of Science and Technology of China, Hefei, China(中国科学技术大学材料机械行为与设计国家重点实验室,人形机器人研究所,工程科学学院,合肥,中国) Modular Implantable Neuroprostheses (MINE) Laboratory, Università Vita-Salute San Raffaele & Sant’Anna School of Advanced Studies, Milan, Italy(可植入神经假体(MINE)实验室,维塔-桑瑞法大学及圣安娜高级研究学院,米兰,意大利) Careggi University Hospital, 50134 Florence, Italy(Careggi大学医院,50134,佛罗伦萨,意大利) Bertarelli Fndn. Chair in Translational Neuroengineering, Neuro X Institute, École Polytechnique Fédérale de Lausanne (EPFL), Laussane, Switzerland(伯特拉里基金会转化神经工程主席,Neuro X研究所,洛桑联邦理工学院(EPFL),洛桑,瑞士)

AI总结 针对手部康复需求,提出一种基于织物的可定制软体机器人手套,采用数控热封技术制造双腔致动器,通过个性化指关节适配和凹形外表面设计提升舒适度与抓握力,实验验证其能降低前臂肌肉活动并改善脊髓损伤患者的抓握模式。

详情
AI中文摘要

神经系统疾病后的手部损伤严重限制了日常生活活动的独立性,促使了有效辅助和康复策略的发展。在此背景下,软体机器人手套引起了越来越多的兴趣,然而在定制化、人体工程学适配和用户舒适度方面的持续挑战限制了其临床实用性。本文提出了一种符合人体工程学、可定制的基于织物的软体机器人手套,其致动器可根据个体手指关节几何形状进行定制。该手套包含五个支持手指屈伸的双作用致动器,以及一个专用的拇指外展致动器。利用计算机数控热封技术,我们制造了对称腔室致动器,该致动器在充气时呈现凹形外表面,从而增加手指接触面积并提高舒适度。表征测试证实,其关节力矩和抓握力足以完成日常生活活动相关任务。在十名健康受试者中,主动辅助显著降低了操作过程中的前臂肌肉活动;一项针对三名颈髓损伤患者的初步研究显示,抓握模式更加自然,且对腱索抓握的依赖减少。

英文摘要

Hand impairment following neurological disorders substantially limits independence in activities of daily living, motivating the development of effective assistive and rehabilitation strategies. Soft robotic gloves have attracted growing interest in this context, yet persistent challenges in customization, ergonomic fit, and user comfort constrain their clinical utility. Here, we present an ergonomic, customizable fabric-based soft robotic glove whose actuators can be tailored to individual finger-joint geometry. The glove comprises five dual-action actuators supporting finger flexion and extension, together with a dedicated thumb abduction actuator. Leveraging computer numerical control heat sealing technology, we fabricated symmetrical-chamber actuators that adopt a concave outer surface upon inflation, thereby increasing finger contact area and improving comfort. Characterization confirmed joint moment and grasping force sufficient for ADL-relevant tasks. In ten healthy subjects, active assistance significantly reduced forearm muscle activity during manipulation, and a pilot study in three individuals with cervical spinal cord injury showed more natural grasp patterns and reduced reliance on tenodesis grasp.

10. 仿真、数据集与评测 19 篇

2606.14767 2026-06-16 cs.RO 新提交

Synthetic-to-Real Pipeline for Safe Landing Zone Detection

合成到真实的着陆区安全检测流水线

Shrikant Banerjee, Reza Faieghi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种合成数据生成与感知流水线,通过域随机化生成逼真城市环境并微调Transformer架构,结合欧几里得距离变换实现无碰撞着陆区检测,消除手动标注需求。

Comments Proceedings of Conference on Robots and Vision (CRV) 2026, Vancouver, British Columbia , Canada

详情
AI中文摘要

随着无人飞行器(UAV)向更高自主性过渡,在非合作、非结构化环境中执行无辅助回收的能力变得至关重要。实现安全自主着陆需要高保真语义分辨率以区分可通行地形与危险障碍物,然而开发常因标注航拍数据集的稀缺而受阻。本文提出一种全面的感知与数据生成流水线,旨在弥合自主着陆任务的模拟到现实差距。我们引入一个程序化合成数据引擎,通过域随机化生成具有自动语义标注的逼真城市环境。基于Transformer的OneFormer架构仅在此合成数据上微调,利用多头自注意力机制进行全局上下文解析。为确保操作安全,一个确定性着陆模块利用欧几里得距离变换(EDT)和动态推理逻辑,在障碍物周围保持严格安全缓冲的同时,识别最大的内接安全着陆区。针对UAVid数据集的定量基准测试展示了鲁棒的语义分割性能,而在真实世界无人机视频上的定性验证证实了系统在未见环境中识别无碰撞着陆点的能力。我们的结果凸显了高保真程序化模拟在消除手动标注需求的同时,为自主UAV回收提供鲁棒、边缘可部署的态势感知的潜力。

英文摘要

As Uncrewed Aerial Vehicles (UAVs) transition toward higher levels of autonomy, the ability to perform unassisted recovery in non-cooperative, unstructured environments becomes critical. Achieving safe autonomous landing requires high-fidelity semantic resolution to distinguish navigable terrain from hazardous obstacles, yet development is often hindered by the scarcity of annotated aerial datasets. This work proposes a comprehensive perception and data generation pipeline designed to bridge the sim-to-real gap for autonomous landing tasks. We introduce a procedural synthetic data engine that generates photorealistic urban environments with automated semantic annotations through domain randomization. A Transformer-based OneFormer architecture is fine-tuned exclusively on this synthetic data, leveraging multi-head self-attention mechanisms for global context resolution. To ensure operational safety, a deterministic landing module utilizes a Euclidean Distance Transform (EDT) and dynamic inference logic to identify the largest inscribed safe landing zones while maintaining strict clearance buffers around obstacles. Quantitative benchmarking against the UAVid dataset demonstrates robust semantic segmentation performance, while qualitative validation on real-world UAV footage confirms the system's ability to identify collision-free landing sites in unseen environments. Our results highlight the potential of high-fidelity procedural simulation to eliminate the need for manual annotation while providing robust, edge-deployable situational awareness for autonomous UAV recovery.

2606.15010 2026-06-16 cs.RO 新提交

LV-Calib: LiDAR-Camera Extrinsic Calibration with Boundary-Response Modeling

LV-Calib:基于边界响应建模的LiDAR-相机外参标定

Sheng Hong

发表机构 * Pen-Tung Sah Institute of Micro-Nano Science and Technology, Xiamen University(厦门大学萨本栋微米纳米科学技术研究院)

AI总结 提出LV-Calib框架,利用可打印平面靶标,通过视觉基准和圆形反射率边界,结合强度与几何约束优化LiDAR特征点,实现LiDAR-相机外参标定和边界响应校准,达到亚像素重投影精度和毫米级特征一致性。

Comments Comments: 8 pages, 6 figures, 3 tables

详情
AI中文摘要

我们提出LV-Calib,一个使用可打印平面靶标进行LiDAR-相机外参估计和LiDAR边界响应校准的标定框架。靶标作为共享观测载体:视觉基准提供索引图像测量,而圆形反射率边界提供LiDAR可观测的结构特征点。LV-Calib不直接将边界点拟合为理想几何轮廓,而是自动裁剪背景点,估计靶标平面,并通过强度与几何约束迭代优化精确的LiDAR侧3D特征点。该优化显式处理了由有限光束足迹和黑白反射率不连续处的混合强度返回引起的展宽和畸变过渡带。基于这些优化的LiDAR特征,我们构建了加权重投影一致的外参优化,其中图像观测保持在重投影域,LiDAR特征残差由优化置信度加权。最后,利用估计的外参和提取的过渡带,LV-Calib通过估计边界重叠样本的俯仰-偏航-距离残差统计量来校准LiDAR边界响应。在印刷板标定数据上的实验展示了亚像素重投影精度、毫米级LiDAR特征一致性以及改进的里程计性能。代码和标定数据将发布以供可重复评估。

英文摘要

We present LV-Calib, a calibration framework for LiDAR-camera extrinsic estimation and LiDAR boundary-response calibration using a printable planar target. The target serves as a shared observation carrier: visual fiducials provide indexed image measurements, while circular reflectivity boundaries provide LiDAR-observable structural feature points. Instead of directly fitting boundary points as ideal geometric contours, LV-Calib automatically crops background points, estimates the target plane, and iteratively refines accurate LiDAR-side 3-D feature points from intensity and geometric constraints. The refinement explicitly handles the broadened and distorted transition band induced by finite beam footprint and mixed-intensity returns around black-white reflectivity discontinuities. Given these refined LiDAR features, we formulate a weighted reprojection-consistent extrinsic optimization with LiDAR feature alignment, where image observations are kept in the reprojection domain and LiDAR feature residuals are weighted by refinement confidence. Finally, using the estimated extrinsic and the extracted transition band, LV-Calib calibrates the LiDAR boundary response by estimating pitch-yaw-range residual statistics of boundary-overlap samples. Experiments on printed-board calibration data demonstrate sub-pixel reprojection accuracy, millimeter-level LiDAR feature consistency, and improved odometry performance. Code and calibration data will be released for reproducible evaluation.

2606.15338 2026-06-16 cs.RO 新提交

SimWeaver: Zero-Shot RGB Sim-to-Real for Deformable Manipulation

SimWeaver:面向可变形操作的零样本RGB仿真到现实

Wenkang Hu, Haoran Wang, Yitong Li, Liu Liu, Mengao Zhao, Lai Jiang, Xincheng Tang, Junhang Wei, Zhengjie Shu, Zhendong Wang, Zhizhong Su, Huamin Wang, Ruigang Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Horizon Robotics(地平线机器人) Style3D Research(Style3D研究院)

AI总结 提出SimWeaver框架,通过200条仿真演示训练零样本RGB VLA策略,在5种可变形任务中达到91%平均真实成功率,无需遥操作或任务校准。

详情
AI中文摘要

可变形操作的RGB仿真到现实在没有真实世界微调的情况下基本上仍未解决。我们提出了SimWeaver,它在每个任务的200个模拟演示上训练零样本RGB VLA策略,在5种不同的可变形任务(包括塑料袋操作)中达到每个任务超过80%和平均91%的真实世界成功率,无需遥操作或每个任务校准。SimWeaver结合了一个可靠的基于测量的模拟器(SimWeaver-Sim)、一个支持单图像生成的可扩展资产框架(SimWeaver-Asset)、一个确定性拓扑感知轨迹合成器(SimWeaver-Syn)以及一个具有ISP感知光度增强的仿真到现实协议(SimWeaver-Real)。在丝绸抓取任务中,模拟训练的策略在视觉分布偏移下达到100%的成功率,而基于真实数据的基线下降到9-70%,且每个轨迹的成本低两个数量级。我们将发布SimWeaver和一个代表性资产子集。项目页面:https://simweaver.github.io/

英文摘要

RGB sim-to-real for deformable manipulation has remained largely unsolved without real-world fine-tuning. We present SimWeaver, which trains zero-shot RGB VLA policies on 200 simulated demonstrations per task, reaching above 80% per-task and 91% average real-world success across 5 diverse deformable tasks including plastic-bag manipulation, without teleoperation or per-task calibration. SimWeaver combines a reliable measurement-backed simulator (SimWeaver-Sim) with an extensible asset framework supporting single-image generation(SimWeaver-Asset), a deterministic topology-aware trajectory synthesizer (SimWeaver-Syn), and a sim-to-real protocol with ISP-aware photometric augmentation (SimWeaver-Real). On silk grasping, the sim-trained policy reaches 100% under visual distribution shifts where real-data baselines drop to 9-70%, at two orders of magnitude lower per-trajectory cost. We will release SimWeaver and a representative asset subset. Project page: https://simweaver.github.io/

2606.15431 2026-06-16 cs.RO 新提交

A Corridor-Scale CARLA-VISSIM Co-Simulation Framework for Multi-Intersection Urban Traffic

面向多交叉口城市走廊的CARLA-VISSIM联合仿真框架

Sima Ashayer, Austin Haris, Mina Sartipi

发表机构 * University of Tennessee at Chattanooga(田纳西大学查塔努加分校)

AI总结 提出CARLA-VISSIM双向步进同步联合仿真框架,集成微观交通逻辑与高保真3D渲染,在田纳西州查塔努加市MLK大道约15个交叉口走廊上验证,支持混合控制与感知就绪的走廊级交通研究。

详情
AI中文摘要

本文提出了一个已实现的CARLA-VISSIM联合仿真框架,用于美国田纳西州查塔努加市马丁·路德·金大道上约15个相连交叉口的城市走廊。该系统通过双向步进同步接口集成CARLA 0.10.0(Unreal Engine 5)与PTV VISSIM 2026,将VISSIM的微观车辆、行人和信号控制器逻辑与CARLA的高保真3D渲染相结合。基于LiDAR的高程模型和RoadRunner的高清地图提供了地形精确的道路几何,并在两个仿真器中一致部署。该框架包含显式的参与者所有权、镜像生命周期管理、坐标协调以及每个参与者最新状态的更新策略,实现了VISSIM控制的交通流与CARLA控制的自我车辆之间的稳定交互。一个走廊规模的案例研究展示了在约100辆车和100名行人的峰值负载下,交通信号镜像、车辆-行人同步交互以及稳定的混合控制操作。该部署捕捉了MLK街上五个信号化交叉口及其连接的上游和下游交叉口的交互,揭示了多交叉口走廊特有的同步挑战。结果表明,以MLK为中心的走廊为验证跨仿真器一致性提供了有效测试平台,且所提出的架构支持可靠的、感知就绪的走廊级交通联合仿真。

英文摘要

This paper presents an implemented CARLA-VISSIM co-simulation framework for an urban corridor comprising approximately fifteen connected intersections centered on Martin Luther King Jr. Boulevard in Chattanooga, Tennessee. The system integrates CARLA 0.10.0 Unreal Engine 5 with PTV VISSIM 2026 through a bidirectional, step-synchronized interface that couples VISSIM's microscopic vehicle, pedestrian, and signal-controller logic with CARLA's high-fidelity 3D rendering. A LiDAR-derived elevation model and RoadRunner-based High Definition (HD) map provide terrain-accurate road geometry deployed consistently across both simulators. The framework incorporates explicit actor ownership, mirrored lifecycle management, coordinate reconciliation, and a latest-state-per-actor update policy, enabling stable interaction between VISSIM-controlled traffic and a CARLA-controlled ego vehicle. A corridor-scale case study demonstrates consistent traffic-signal mirroring, synchronized vehicle-pedestrian interactions, and stable mixed-authority operation under peak loads of approximately 100 vehicles and 100 pedestrians. The deployment captures the interaction of the five signalized intersections along MLK Street and their connecting upstream and downstream intersections, revealing synchronization challenges unique to multi-intersection corridors. Results indicate that this MLK-centered corridor provides an effective testbed for verifying cross-simulator consistency and that the proposed architecture supports reliable, perception-ready co-simulation for corridor-level traffic studies.

2606.15930 2026-06-16 cs.RO cs.AI 新提交

ControlMap: Controllable High-Definition Map Generation for Traffic Scenario Simulation

ControlMap: 用于交通场景仿真的可控高清地图生成

Marwan Farag, Steffen Wäldele, Yu Yao

发表机构 * University of Stuttgart(斯图加特大学) Robert Bosch GmbH(博世公司) Motional, Inc(Motional公司)

AI总结 提出基于潜在扩散和ControlNet的数据驱动管道,实现可控高清地图生成,支持空间引导、条件强度调整和城市风格迁移,并引入新指标评估控制信号遵循度和地图真实性。

详情
AI中文摘要

仿真是验证自动驾驶系统的核心,但当前流程因高精(HD)地图创建成本高昂而受限于场景多样性不足。扩展HD地图需要昂贵的数据收集和人工处理。此外,现有生成模型缺乏在生成过程中针对特定道路拓扑进行细粒度控制的能力。本文提出一种数据驱动的可控HD地图生成管道,使用潜在扩散和ControlNet进行空间条件控制。据我们所知,我们是首个将空间引导信号注入扩散模型用于HD地图合成的工作。此外,我们的模型支持通过无分类器引导调整条件强度,并通过城市标签条件实现城市级风格迁移。为补充现有指标,我们引入两个新指标来评估对控制信号的遵循程度以及与真实地图的相似性。实验表明,我们的模型生成的HD地图真实且忠实遵循输入道路拓扑,同时准确保留城市特定细节。

英文摘要

Simulation is central to validating autonomous driving systems, yet current pipelines are limited by insufficient scenario diversity due to costly High Definition (HD) map creation. Scaling HD maps requires expensive data collection and manual processing. Moreover, existing generative models lack the fine-grained control necessary to target specific road topologies during generation. This paper presents a data-driven pipeline for controllable HD map generation using latent diffusion and ControlNet for spatial conditioning. To our knowledge, we are the first to inject spatial guidance signals into a diffusion model for HD map synthesis. Furthermore, our model supports adjustable conditioning strength through classifier-free guidance and city-level style transfer via city label conditioning. To complement existing metrics, we introduce two novel metrics to evaluate adherence to the control signal and similarity to ground-truth maps. Experiments demonstrate that our model generates realistic HD maps that faithfully follow input road topologies while accurately preserving city-specific details.

2606.16208 2026-06-16 cs.RO 新提交

ATHENA: Accelerated Multi-Task Heterogeneous Influence Functions for Robot Data Curation

ATHENA: 加速的多任务异构影响函数用于机器人数据筛选

Tao Xu, Jiaxin Wang, Runhao Zhang, Jiayi Guan, Xianchao Zeng, Weixi Song, Xinyu Zhou, Zhetao Chen, Guang Chen, Yong-Lu Li

发表机构 * Tongji University(同济大学) Shanghai Innovation Institute(上海创新研究院) Xi'an Jiaotong University(西安交通大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出ATHENA框架,利用Kronecker梯度结构和秩r随机截断近似加速影响函数计算,实现多任务VLA模型数据筛选,在模拟和真实机器人任务中以更少数据达到或超越全数据微调性能。

详情
AI中文摘要

在机器人模仿学习中,影响函数提供了一种原则性方法来量化每个演示对机器人任务结果的影响,但将其扩展到十亿参数的视觉-语言-动作(VLA)模型受到计算和多任务瓶颈的限制。为此,我们提出ATHENA,一个专为十亿参数规模的多任务VLA数据筛选设计的影响函数框架。具体来说,它利用线性层梯度的Kronecker结构来降低投影成本,并通过秩r随机截断近似来近似稠密Hessian矩阵的逆,在影响计算中实现了约313.4倍的加速。此外,ATHENA制定了全局和局部交互影响,以平衡50个联合训练任务间的数据筛选。在RoboTwin 2.0和真实机器人部署上的广泛评估,分别涵盖9.34小时和6.90小时的演示,表明ATHENA在模拟中仅使用50%的演示、在六个真实机器人任务中使用66.7%的数据,即可达到或超过全数据联合微调的性能。总体而言,ATHENA证明了其在十亿参数多任务VLA微调中用于数据筛选的有效性。

英文摘要

In robot imitation learning, influence functions provide a principled approach to quantify each demonstration's effect on robot task outcomes, yet scaling them to billion-parameter Vision-Language-Action (VLA) models is limited by computational and multitask bottlenecks. To this end, we propose ATHENA, an influence function framework tailored for multitask VLA data curation at a billion-parameter scale. Concretely, it leverages the Kronecker structure of linear-layer gradients to reduce projection cost, and approximates dense Hessian inversion with a rank-r Random Truncated Approximation, achieving about a 313.4x speedup in influence computation. Furthermore, ATHENA formulates global and local interactive influence to balance data curation across 50 jointly trained tasks. Extensive evaluations on RoboTwin 2.0 and real-robot deployment, covering 9.34 and 6.90 hours of demonstrations, respectively, show that ATHENA matches or exceeds full-data joint fine-tuning using only 50% of demonstrations in simulation and 66.7% of data across six real-robot tasks. Overall, ATHENA demonstrates its effectiveness for data curation in billion-parameter multitask VLA fine-tuning.

2606.16447 2026-06-16 cs.RO cs.AI 新提交

Training and Evaluating Diffusion Policies with Long Context Lengths

训练和评估具有长上下文长度的扩散策略

Abhinav Agarwal, Adam Wei, Taylan Kargin, Michael Zeng, Cole Becker, Arif Kerem Dayi, Pablo Parrilo, Asuman Ozdaglar, Russ Tedrake

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文首次详细研究模仿学习中上下文长度的影响,发现简单扩展上下文长度并不脆弱,并提出联合训练多上下文长度策略的方法以降低样本复杂度。

详情
AI中文摘要

模仿学习已经能够从RGB观测中实现高度灵巧的机器人操作。然而,使用这些方法训练的策略通常仅基于短历史观测来调节机器人动作。这些策略无法解决需要记忆的任务,并且可能反复执行相同的失败动作。在这项工作中,我们首先在任务具有不同局部稳定性和记忆需求以及多种数据体制下,将上下文长度从短到长逐步增加,对策略性能进行基准测试。据我们所知,这是首次如此详细地研究模仿学习中上下文长度的影响。我们的结果挑战了先前的说法:简单地扩展上下文长度并不像文献中声称的那样脆弱。使用适当的调节方法和去噪骨干网络(UNet+交叉注意力),单任务策略在通常的数据体制下即使采用简单扩展也能在许多任务上取得高成功率。接下来,我们提出一种训练算法,用于联合训练多个上下文长度的策略,进一步降低长上下文学习的样本复杂度。最后,我们将我们的发现应用于重新评估先前提出的一些长上下文模仿学习解决方案。

英文摘要

Imitation learning has enabled highly-dexterous robotic manipulation from RGB observations. Policies trained with these methods, however, typically condition robot actions on only a short history of observations. These policies cannot solve tasks that require memory and can get stuck repeatedly executing the same failing motions. In this work, we first benchmark policy performance as context length is incrementally increased from short to long, across a spectrum of tasks with varying local stability and memory requirements, and in multiple data regimes. To our knowledge, this is the first study to investigate context length in imitation learning at this level of detail. Our results challenge prior claims: naively scaling context length is not as brittle as advertised in literature. With an appropriate conditioning method and denoising backbone (UNet+Cross-Attention), single-task policies achieve high success rates on many tasks in the usual data regime even with naive scaling. Next, we propose a training algorithm to jointly train policies at multiple context lengths, further reducing the sample complexity of long-context learning. Finally, we apply our findings to re-evaluate some previously proposed solutions to long-context imitation learning.

2606.16570 2026-06-16 cs.RO 新提交

Automated Digital Twin Construction for Highway Scenarios Using LiDAR Point Clouds and OpenStreetMap

基于LiDAR点云和OpenStreetMap的高速公路场景自动数字孪生构建

Yongqi Zhao, Dong Bi, Paul Kovacevic, Tomislav Mihalj, Martin Schabauer, Johannes Betz, Arno Eichberger

发表机构 * Institute of Automotive Engineering, Graz University of Technology(格拉茨技术大学汽车工程研究所) School of Intelligent Connected Vehicle, Hubei University of Automotive Technology(湖北汽车工业学院智能网联汽车学院) Professorship of Autonomous Vehicle Systems, Technical University of Munich(慕尼黑工业大学自动驾驶系统教席)

AI总结 提出融合LiDAR点云与OpenStreetMap数据的自动化流程,生成地理参考的ASAM OpenDRIVE高速公路地图,实现车道级几何与拓扑的完整建模,平均横向RMSE为0.740米。

Comments 9 pages, 5 figures

详情
AI中文摘要

精确的道路环境建模是自动驾驶系统仿真和验证的基础。然而,从真实传感器数据构建标准格式(如ASAM OpenDRIVE)的道路地图仍然是一个耗时且昂贵的过程。移动测绘LiDAR可以捕获精确的车道级几何,但仅限于行驶走廊,而OpenStreetMap(OSM)提供广泛的道路网络拓扑,但缺乏车道级的几何精度。为了解决这一问题,提出了一种自动化工作流程,融合LiDAR点云与OSM数据,生成地理参考的ASAM OpenDRIVE高速公路环境地图,所需人工干预最少。该流程从LiDAR测量中重建主线道路,并从OSM道路图中推断匝道几何和拓扑,从而在无需完整传感器覆盖的情况下实现完整的高速公路互通立交建模。实验表明,平均横向RMSE为0.740米,生成的地图可直接用于主流仿真平台,包括IPG CarMaker和Esmini。这些结果验证了将测量几何与地图拓扑相结合用于自动OpenDRIVE数字孪生生成的有效性。项目代码可在https://github.com/ftgTUGraz/opendrive-digital-twin-generator获取。

英文摘要

Accurate road environment modeling is fundamental to the simulation and validation of automated driving systems. However, constructing road maps in standardized formats such as ASAM OpenDRIVE from real-world sensor data remains a time-consuming and costly process. Mobile mapping LiDAR captures accurate lane-level geometry but is confined to the driven corridor, while OpenStreetMap (OSM) provides broad road network topology but lacks geometric precision at the lane level. To address this, an automated workflow is proposed to fuse LiDAR point clouds with OSM data to generate georeferenced ASAM OpenDRIVE maps of highway environments, requiring minimal manual intervention. The pipeline reconstructs mainline roads from LiDAR-derived measurements and infers ramp geometry and topology from the OSM road graph, enabling complete highway interchange modeling without full sensor coverage. Experiments demonstrate a mean lateral RMSE of 0.740 m, and the generated maps are directly usable in mainstream simulation platforms including IPG CarMaker and Esmini. These results validate the effectiveness of combining measurement-derived geometry with map-derived topology for automated OpenDRIVE digital twin generation. The project code is available at https://github.com/ftgTUGraz/opendrive-digital-twin-generator

2606.16776 2026-06-16 cs.RO 新提交

DataLadder: A Simulation-Enabled Interconversion Toolchain for the Embodied Data Pyramid

DataLadder:面向具身数据金字塔的仿真赋能互转换工具链

Peidong Liu, Yongce Liu, Songyan Guo, Fuyuan Ma, Zhihao Yuan, Ao Li, Zengjue Chen, Wenhao Li, Tianle Zhang, Mingyang Li, Jiale Zhang, Junzhe Xiong, Zhiyuan Xiang, Dafeng Chi, Yuzheng Zhuang, Yihang Li, Qingrong He, Jiaming Liang, Chen Cai, Peng Hao, Mingxi Luo, Song Wang, Junwu Xiong, Ruodai Li, Liyi Luo, Wei Tan, Dongjiang Li, Jiawei Li, Hui Shen, Yicheng Gong, Liang Lin

发表机构 * Joy Future Academy, JD Group(京东集团未来研究院) JD Technology, JD Group(京东集团京东科技)

AI总结 提出DataLadder工具链,通过机器人↔仿真↔人类双向路径,实现人机对齐的模型评估与数据生成,利用数字孪生和仿真一致性过滤解决物理机器人扩展难题。

Comments Project Page: https://joyai-sim.github.io/

详情
AI中文摘要

通用机器人策略需要可信的评估和机器人可用的训练数据,但仅靠物理机器人难以规模化。真实机器人试验和演示仍然是部署信号最可靠的来源,但它们缓慢、昂贵且难以复现。我们提出DataLadder,一个仿真赋能的人机对齐模型评估与数据生成互转换工具链,记为Robot $\ ightleftharpoons$ Simulation $\ ightleftharpoons$ Human。一方面,Robot $\ ightarrow$ Simulation $\ ightarrow$ Human路径通过将真实机器人桌面整理任务重建为校准的数字孪生体以进行可扩展评估,同时利用人类具身反馈检查和优化仿真运动的自然性,支持人机对齐的模型评估。另一方面,Human $\ ightarrow$ Simulation $\ ightarrow$ Robot路径支持人机对齐的数据生成:它将自我中心的人类演示提升到仿真中,在机器人物理约束下检查它们,并将其转换为以机器人为中心的轨迹、标注和视觉观察。这些路径共同使用JoySim仿真器作为机器人数据生成的可扩展评估层和物理一致性过滤器。我们进一步将核心重建、仿真、渲染和真实性增强模块打包为京东云上的云服务,将系统转变为机器人数据生成和模型评估的可复用基础设施。

英文摘要

Generalist robot policies require trustworthy evaluation and robot-usable training data, but both are difficult to scale with physical robots alone. Real-robot trials and demonstrations remain the most faithful source of deployment signals, yet they are slow, costly, and hard to reproduce. We present DataLadder, a simulation-enabled interconversion toolchain for human-robot aligned model evaluation and data generation, denoted as Robot $\rightleftharpoons$ Simulation $\rightleftharpoons$ Human. On the one hand, the Robot $\rightarrow$ Simulation $\rightarrow$ Human pathway supports human-robot aligned model evaluation by reconstructing real-robot tabletop organization tasks as calibrated digital twins for scalable evaluation, while using human embodied feedback to inspect and refine the naturalness of simulated motions. On the other hand, the Human $\rightarrow$ Simulation $\rightarrow$ Robot pathway supports human-robot aligned data generation: it lifts ego-centric human demonstrations into simulation, checks them under robot physical constraints, and converts them into robot-centered trajectories, annotations, and visual observations. Together, these pathways use the JoySim simulator as both a scalable evaluation layer and a physical consistency filter for robot data generation. We further package the core reconstruction, simulation, rendering, and realism-augmentation modules as cloud services on JD Cloud, turning the system into reusable infrastructure for robot data generation and model evaluation.

2606.16826 2026-06-16 cs.RO cs.AI 新提交

ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation Policies

ATOM-Bench:用于操作策略中原子技能与组合泛化的真实世界基准

Zenan Wu, Bingqing Wei, Lu Liu, Zheqi He, Xi Wang, Jiakang Liu, Zehui Li, Guocai Yao, Jing-Shu Zheng, Xi Yang, Yongtao Wang

发表机构 * Beijing Academy of Artificial Intelligence(北京人工智能研究院) Peking University(北京大学)

AI总结 提出ATOM-Bench基准,通过分解桌面操作为原子任务和组合任务,评估操作策略的原子技能获取与组合泛化能力,发现当前策略在细粒度原子技能和组合重用上存在不足。

Comments Homepage: https://flageval-baai.github.io/AtomBenchPage

详情
AI中文摘要

通用操作策略越来越多地被呈现为机器人控制的基础模型,但它们的真实世界泛化能力仍然难以诊断。一个策略可能在演示任务上成功,但仍无法执行细粒度的原子技能或在新的任务结构中重新组合已学习的技能。我们引入了\ extbf{ATOM-Bench},一个用于评估操作策略中原子技能和组合泛化的真实世界基准。ATOM-Bench将桌面操作分解为运动原子和指令原子,包含30个原子任务和24个保留的组合任务,涵盖配对单臂和双臂机器人轨道。我们收集了3000个人类演示用于原子微调,并发布演示数据和评估回滚数据以支持可重复的真实世界评估。策略在原子任务上进行微调,并在原子技能获取和保留的组合任务上进行评估。我们进一步引入了原子分数(AS)和组合失败份额(CFS),以区分由弱原子技能引起的失败和由有限组合重用引起的失败。通过对五种代表性操作策略进行2700次物理回滚,我们发现当前策略可以获取简单的指令接地技能,但在细粒度运动原子、计数和逻辑过滤方面仍然困难。更重要的是,强大的原子性能并不能可靠地迁移到保留的组合任务上。ATOM-Bench提供了一个诊断测试平台,用于研究失败是由弱运动执行、差指令接地还是有限组合重用引起的。

英文摘要

Generalist manipulation policies are increasingly presented as foundation models for robotic control, but their real-world generalization remains difficult to diagnose. A policy may succeed on demonstrated tasks while still failing to execute fine-grained atomic skills or recombine learned skills in new task structures. We introduce \textbf{ATOM-Bench}, a real-world benchmark for evaluating both atomic skills and compositional generalization in manipulation policies. ATOM-Bench factorizes tabletop manipulation into motor atoms and instruction atoms, and contains 30 atomic tasks and 24 held-out compositional tasks across paired single-arm and dual-arm robot tracks. We collect 3,000 human demonstrations for atomic fine-tuning and release both the demonstration data and evaluation rollout data to support reproducible real-world evaluation. Policies are fine-tuned on atomic tasks and evaluated on both atomic skill acquisition and held-out compositional tasks. We further introduce Atomic Score (AS) and Compositional Failure Share (CFS) to distinguish failures caused by weak atomic skills from failures caused by limited compositional reuse. Through 2,700 physical rollouts on five representative manipulation policies, we find that current policies can acquire simple instruction-grounding skills, but still struggle with fine-grained motor atoms, counting, and logical filtering. More importantly, strong atomic performance does not reliably transfer to held-out compositional tasks. ATOM-Bench provides a diagnostic testbed for studying whether failures arise from weak motor execution, poor instruction grounding, or limited compositional reuse.

2606.16953 2026-06-16 cs.RO 新提交

SidewalkBench: Benchmarking Visual Navigation on Urban Sidewalks

SidewalkBench: 城市人行道视觉导航基准

Zhizheng Liu, Honglin He, Vivek Alumootil, Akshat Pandya, Brad Squicciarini, Wayne Wu, Bolei Zhou

发表机构 * University of California, Los Angeles(加利福尼亚大学洛杉矶分校) Coco Robotics

AI总结 针对城市人行道视觉导航缺乏统一基准的问题,提出SidewalkBench,基于NVIDIA Isaac Sim构建高保真仿真环境,评估9种模型,发现行人交互和长距离鲁棒性是关键瓶颈。

Comments Project Page: https://vail-ucla.github.io/SidewalkBench/

详情
AI中文摘要

城市人行道导航由于复杂的结构布局、动态的行人行为和长距离而面临重大挑战。虽然最近的视觉导航模型提供了一种有前景的解决方案,但缺乏统一的基准阻碍了定量和可重复的评估。为了弥补这一差距,我们提出了SidewalkBench,一个专为城市人行道视觉导航设计的综合基准。基于NVIDIA Isaac Sim,SidewalkBench提供了GPU加速的多样化、高保真人行道环境模拟,包括程序化生成和真实世界扫描的场景。我们进一步用丰富的、反应性的事件驱动行人行为和灵活高效的动画填充场景,从而在逼真的现实世界设置下实现标准化的模型评估。我们在330个单元测试场景、800个行人反应场景和105个长时域场景上对9个视觉导航模型进行了全面评估。我们的发现强调,行人交互和长时域鲁棒性仍然是现有模型的关键瓶颈,而利用合成数据扩大人行道训练成为一种有前景的解决方案。

英文摘要

Urban sidewalk navigation presents significant challenges due to complex structural layouts, dynamic pedestrian behaviors, and long distances. While recent visual navigation models offer a promising solution, the lack of a unified benchmark hinders quantitative and reproducible evaluation. To bridge this gap, we propose SidewalkBench, a comprehensive benchmark designed for visual navigation on urban sidewalks. Built upon NVIDIA Isaac Sim, SidewalkBench brings GPU-accelerated simulation of diverse, high-fidelity sidewalk environments, including both procedurally generated and real-world scanned scenes. We further populate the scenes with rich, reactive event-based pedestrian behaviors and flexible, efficient animation, enabling standardized model evaluation under realistic real-world settings. We conduct a comprehensive evaluation of 9 visual navigation models on 330 unit-test scenarios, 800 pedestrian-reactive scenarios, and 105 long-horizon scenarios. Our findings highlight that pedestrian interaction and long-horizon robustness remain critical bottlenecks for existing models, and scaling up sidewalk training with synthetic data emerges as a promising solution.

2606.17040 2026-06-16 cs.RO cs.CV 新提交

R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies

R2RDreamer: 面向空间泛化的2D操作策略的3D感知数据增强

Xiuwei Xu, Haowen Sun, Angyuan Ma, Yiwei Zhang, Zhenyu Wu, Xiaofeng Wang, Bingyao Yu, Zheng Zhu, Jie Zhou, Jiwen Lu

发表机构 * Tsinghua University(清华大学) BUPT(北京邮电大学) GigaAI

AI总结 提出R2RDreamer框架,通过轻量级3D编辑和2D视频补全,从少量真实演示生成几何一致的增强数据,提升2D操作策略的空间泛化能力。

Comments Project page: https://r2rdreamer.github.io/

详情
AI中文摘要

空间泛化对于模仿学习的操作策略至关重要,但通常需要跨不同物体姿态、机器人配置和相机视角的大规模演示。从少量源演示中进行数据增强为昂贵的真实世界数据收集提供了一种实用替代方案。基于仿真的增强可以创建可控变化,但需要复杂的环境和物体设置,并可能引入仿真到现实的差距。最近的实到实方法通过联合编辑真实演示的3D观测和动作轨迹来避免这些问题,但它们仍然依赖于强大的3D场景解析和几何补全,并且通常生成针对3D点云策略而非基于RGB的2D策略的观测。我们提出R2RDreamer,一个实到实演示增强框架,它在保持3D动作-观测编辑的几何一致性的同时,将视觉补全迁移到2D视频空间。具体来说,R2RDreamer首先通过在一个共享的3D框架中编辑不完整的物体点云和末端执行器轨迹来执行轻量级3D增强;然后,它将编辑后的场景投影到具有遮挡感知推理的掩码图像空间控制视频中,并使用密集控制图像到视频模型来补全时间上连贯的RGB观测。在空间偏移操作任务上的实验,包括2D扩散风格策略和视觉-语言-动作策略,表明R2RDreamer从有限的源演示中提高了空间泛化能力,分析验证了3D编辑、遮挡感知投影和视频补全的贡献。

英文摘要

Spatial generalization is critical for imitation-learned manipulation policies, but achieving it typically requires scaling demonstrations across diverse object poses, robot configurations, and camera viewpoints. Data augmentation from a few source demonstrations offers a practical alternative to costly real-world collection. Simulation-based augmentation can create controllable variation, but requires complex environment and object setup and may introduce a sim-to-real gap. Recent real-to-real methods avoid these issues by jointly editing 3D observations and action trajectories from real demonstrations, yet they still rely on strong 3D scene parsing and geometry completion, and often produce observations tailored to 3D pointcloud policies rather than RGB-based 2D policies. We propose R2RDreamer, a real-to-real demonstration augmentation framework that preserves the geometric consistency of 3D action-observation editing while moving visual completion to 2D video space. Specifically, R2RDreamer first performs lightweight 3D augmentation by editing incomplete object pointclouds and end-effector trajectories in a shared 3D frame; it then projects the edited scene into masked image-space control videos with occlusion-aware reasoning and uses a dense-control image-to-video model to complete temporally coherent RGB observations. Experiments on spatially shifted manipulation tasks with both 2D diffusion-style policies and vision-language-action policies show that R2RDreamer improves spatial generalization from limited source demonstrations, with analyses validating the contributions of 3D editing, occlusion-aware projection, and video completion.

2606.15139 2026-06-16 cs.GT cs.RO 交叉投稿

Self-Driving Negotiator: An interactive, verifiable benchmark for social negotiation and theory of mind under hidden intent

自动驾驶谈判者:一个在隐藏意图下进行社会谈判和心理理论的交互式可验证基准

Ashutosh Kumar

发表机构 * Owl Autonomous Imaging, Inc(Owl 自动成像公司)

AI总结 提出一个文本多轮程序化生成环境,用于衡量自动驾驶中基于隐藏意图推断的隐式社会协调能力,通过特权模拟器状态计算奖励和诊断,当前最佳模型平均成功率仅0.68。

详情
AI中文摘要

自动驾驶充满了微小的社会谈判:一个司机向前推进,另一个让行,行人假意向路边移动,或车道车辆选择是否打开并线间隙。这类互动需要在部分可观测性下从行为推断隐藏意图,然后安全高效地行动。现有的自动驾驶语言基准主要关注感知、视觉问答或开环规划,而现有的语言智能体谈判基准通常将谈判明确表达在文本中。自动驾驶谈判者弥合了两者之间的差距:一个纯文本、多轮、程序化生成的环境,用于衡量驾驶中的隐式社会协调。智能体生成具体的驾驶动作。奖励和诊断从特权模拟器状态计算,而非模型的解释。本报告涵盖任务设计、奖励和反博弈不变量、验证场景、非LLM基线以及六模型推理排行榜。当前模型与脚本专家相去甚远。三个场景中最佳平均成功率为0.68;争议并线场景中模型表现统计上持平;难度层级区分了线索跟随与真正的等待承诺行为。

英文摘要

Autonomous driving is full of tiny social negotiations: a driver presses forward, another yields, a pedestrian fakes toward the curb, or a lane vehicle chooses whether to open a merge gap. Such interactions require inferring hidden intent from behavior under partial observability and then acting safely and efficiently. Existing autonomous-driving language benchmarks mostly focus on perception, visual question answering, or open-loop planning, while existing language-agent negotiation benchmarks typically make the negotiation explicit in text. Self-Driving Negotiator bridges the gap between the two: a text-only, multi-turn, procedurally generated environment for measuring implicit social coordination in driving. Agents generate specific driving actions. Reward and diagnostics are computed from the privileged simulator state, not from the explanation of the model. This report covers task design, reward and anti-gaming invariants, validated scenarios, non-LLM baselines, and a six-model inference leaderboard. Current models are far removed from the scripted expert. The best average success rate across three scenarios is 0.68; contested merge is statistically flat across models; and difficulty tiers separate cue-following from true wait-for-commitment behavior.

2606.15522 2026-06-16 cond-mat.mtrl-sci cs.RO 交叉投稿

NIMO: A Software Platform for Closed-Loop Materials Exploration with Diverse AI Algorithms

NIMO:一个集成多种AI算法的闭环材料探索软件平台

Ryo Tamura, Naruki Yoshikawa, Koji Tsuda, Shoichi Matsuda

发表机构 * Center for Basic Research on Materials, National Institute for Materials Science(材料基础研究所,日本国家材料科学研究所) Graduate School of Frontier Sciences, The University of Tokyo(东京大学前沿科学研究生院) Center for Green Research on Energy and Environmental Materials, National Institute for Material Science(能源与环境材料绿色研究中心,日本国家材料科学研究所)

AI总结 提出开源平台NIMO,通过模块化AI-机器人解耦、候选池架构和统一接口,集成12种AI算法,实现跨实验室的自主材料探索。

Comments 29 pages, 5 figures

详情
AI中文摘要

自主实验室(SDL)中,人工智能提出后续实验,机器人系统执行这些实验,正迅速成为材料发现的先锋。然而,一个关键的瓶颈在于如何将针对特定探索目标定制的多样化AI算法与不同实验室中异构的机器人硬件无缝衔接。在此,我们介绍NIMO,一个开源软件平台,旨在通过三个核心范式消除这一障碍:通过简单的CSV文件交换实现模块化AI-机器人解耦,一个离散候选池架构无缝吸收领域知识,以及一个预装了十二种不同AI算法的统一Python接口。在这篇视角文章中,我们回顾了每种算法的操作原理,以及由NIMO驱动的六个不同SDL实现,涵盖电解质发现、有机合成、薄膜探索、燃料电池过程信息学、咖啡环相探索和遗留液体处理自动化。其中一个实现还展示了NIMO与IvoryOS编排框架的无缝互操作性。为了普及自主科学,我们还介绍了一个无代码桌面应用程序,使非程序员能够进行直观的人机交互探索。NIMO可在https://github.com/NIMS-DA/nimo免费获取,为加速跨不同实验场景的自主材料探索提供了多功能的即插即用基础。

英文摘要

Self-driving laboratories (SDLs), where artificial intelligence proposes subsequent experiments and robotic systems execute them, are rapidly becoming the vanguard of materials discovery. A critical bottleneck, however, lies in seamlessly bridging diverse AI algorithms tailored for specific exploration goals with the heterogeneous robotic hardware found across different laboratories. Here, we present NIMO, an open-source software platform designed to dissolve this barrier through three core paradigms: a modular AI-robot decoupling mediated via simple CSV file exchange, a discrete candidate-pool architecture that seamlessly absorbs domain knowledge, and a unified Python interface pre-loaded with twelve distinct AI algorithms. In this Perspective, we review the operational principles of each algorithm alongside six diverse SDL implementations driven by NIMO, covering electrolyte discovery, organic synthesis, thin-film exploration, fuel-cell process informatics, coffee-ring phase exploration, and legacy liquid-handling automation. One of these also demonstrates NIMO's seamless interoperability with the IvoryOS orchestration framework. To democratize autonomous science, we also introduce a no-code desktop application that enables intuitive, human-in-the-loop exploration for non-programmers. NIMO is freely available at https://github.com/NIMS-DA/nimo, offering a versatile, plug-and-play foundation to accelerate autonomous materials exploration across diverse experimental landscapes.

2606.16202 2026-06-16 cs.CV cs.AI cs.RO 交叉投稿

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

EgoPhys: 从第一人称视频学习可变形物体的通用物理模型

Hyunjin Kim, Ri-Zhao Qiu, Guangqi Jiang, Xiaolong Wang

发表机构 * UC San Diego(加州大学圣地亚哥分校)

AI总结 提出EgoPhys框架,从第一人称RGB视频中通过可泛化先验构建可变形物体的物理数字孪生,无需测试时优化即可预测弹簧刚度场,在重建、未来预测和零样本泛化上优于基线。

Comments Project Page: https://hjhyunjinkim.github.io/EgoPhys

详情
AI中文摘要

人类通过日常互动自然地理解物体物理,但准确预测复杂的可变形动力学(如弹性材料和织物)仍然是计算机视觉和机器人学的主要挑战。我们提出EgoPhys,一个利用可泛化先验从仅RGB的第一人称视频构建可变形物理数字孪生的框架。EgoPhys通过将每个物体的逆物理解蒸馏到紧凑码本中,克服了现有方法的局限性,从而能够为未见物体预测密集的弹簧刚度场,而无需每个弹簧的测试时优化。使用来自多样化第一人称交互的可泛化先验进行训练,EgoPhys在重建、未来预测和零样本泛化方面优于基线。为了支持训练和评估,我们整理了一个涵盖多样化可变形物体、场景和操作风格的第一人称交互数据集。我们将EgoPhys部署在真实的xArm6机器人上,证明从单个第一人称人类游戏视频初始化的数字孪生可以作为内部世界表示,辅助可变形物体规划,突显第一人称RGB观测作为通往真实到模拟管道的可扩展路径。

英文摘要

Humans naturally understand object physics through everyday interactions, but faithfully predicting complex deformable dynamics, such as elastic materials and fabrics, remains a major challenge for computer vision and robotics. We present EgoPhys, a framework that constructs deformable physical digital twins from egocentric RGB-only video using generalizable priors. EgoPhys overcomes the limitations of existing methods to enable controllable deformable digital twin generation from egocentric videos by distilling per-object inverse-physics solutions into a compact codebook, enabling prediction of dense spring stiffness fields for unseen objects without per-spring test-time optimization. Trained with generalizable priors from diverse egocentric interactions, EgoPhys outperforms baselines in reconstruction, future prediction, and zero-shot generalization. To support training and evaluation, we curate an egocentric interaction dataset covering diverse deformable objects, scenes, and manipulation styles. We deploy EgoPhys on a real xArm6 robot, demonstrating that a digital twin initialized from a single egocentric human play video can serve as an internal world representation to aid in deformable-object planning, highlighting egocentric RGB observations as a scalable path toward real-to-sim pipelines.

2509.14548 2026-06-16 cs.RO cs.HC 版本更新

SimCoachCorpus: A naturalistic dataset with language and trajectories for embodied teaching

SimCoachCorpus:一个包含语言和轨迹的自然主义数据集,用于具身教学

Emily Sumner, Deepak E. Gopinath, Laporsha Dees, Patricio Reyes Gomez, Xiongyi Cui, Andrew Silva, Jean Costa, Allison Morgan, Mariah Schrum, Tiffany L. Chen, Avinash Balachandran, Guy Rosman

发表机构 * Toyota Research Institute(丰田研究院) Cambridge, MA 02139(马萨诸塞州剑桥市02139) Los Altos, CA 94022(加州洛斯阿尔托斯94022)

AI总结 为填补具身交互领域中语言与物理动作结合的数据集空白,我们构建了SimCoachCorpus,包含29名受试者在赛车模拟器中的驾驶数据(15名有教练指导,14名无指导),同步记录车辆状态、教练反馈及认知负荷等,可用于研究运动学习、语言现象及训练教学模型。

Comments This is an extended version of a paper accepted to KDD Datasets & Benchmarks Track 2026

详情
AI中文摘要

高质量策划的数据集对于训练和评估人工智能方法至关重要,但在语言和物理动作交织的具身交互领域中,这类数据集往往缺乏。特别是,很少有数据集能够捕捉人们通过口头指令随时间在具身任务中获取运动技能的过程。为填补这一空白,我们引入了SimCoachCorpus:一个独特的赛车模拟器驾驶数据集,能够研究在指导和非指导运动技能获取过程中的丰富现象。在该数据集中,29名受试者被要求在驾驶模拟器中围绕赛道驾驶约九十分钟。15名参与者接受了一位专业性能驾驶教练的一对一指导,14名参与者没有接受教练指导。SimCoachCorpus包括车辆状态和输入、地图(赛道边界和赛车线)以及锥形标志等特征。此外,这些特征与教练的同步口头反馈以及每圈结束后的额外终端反馈同步。我们还为每个同步反馈话语提供了高层教练类别的高质量注释、学生对教练建议的遵从度评分,以及参与者的自我报告认知负荷和情绪状态(通过研究期间的调查收集)。最终数据集包含超过20,000个同步反馈话语、超过400个终端反馈话语以及超过40小时的交互驾驶数据。我们的自然主义交互数据集可用于研究运动学习动态、探索语言现象以及训练教学和学习计算模型。我们展示了该数据集在上下文学习、模仿学习和主题建模中的应用。数据托管在此https URL,代码可在此https URL获取。

英文摘要

High-quality curated datasets are essential for training and evaluating AI approaches, but are often lacking in embodied interactive domains where language and physical action are intertwined. In particular, few datasets capture how people acquire motor skills in embodied tasks through verbal instruction over time. To address this gap, we introduce SimCoachCorpus: a unique dataset of race car simulator driving that enables the investigation of rich phenomena during guided and unguided motor skill acquisition. In this dataset, 29 humans were asked to drive in a driving simulator around a race track for approximately ninety minutes. Fifteen participants received one-on-one instruction from a professional performance driving coach, and 14 participants drove without coaching instruction. SimCoachCorpus includes features such as vehicle state and inputs, map (track boundaries and race-line), and cone landmarks. Additionally, these are synchronized with the coach's concurrent verbal feedback and additional terminal feedback at the end of each lap. We also provide high-quality annotations of high-level coaching categories for each concurrent feedback utterance, ratings on students' compliance with coaching advice, and self-reported cognitive load and emotional state of participants (gathered from surveys during the study). The final dataset includes over 20,000 concurrent feedback utterances, over 400 terminal feedback utterances, and over 40 hours of interactive driving data. Our naturalistic interactive dataset can be used to investigate motor learning dynamics, explore linguistic phenomena, and train computational models of teaching and learning. We demonstrate applications of this dataset for in-context learning, imitation learning, and topic modeling. Data is hosted at https://doi.org/10.7910/DVN/W7VTKZ and code is available at https://github.com/ToyotaResearchInstitute/sim_coach_corpus

2411.19567 2026-06-16 cs.SE cs.RO 版本更新

DynNPC: Finding More Violations Induced by ADS in Simulation Testing through Dynamic NPC Behavior Generation

DynNPC:通过动态NPC行为生成在仿真测试中发现更多由ADS引发的违规

You Lu, Yifan Tian, Dingji Wang, Bihuan Chen, Xin Peng

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(计算机科学与人工智能学院,复旦大学)

AI总结 提出DynNPC框架,让NPC车辆在仿真执行中根据交通信号和自车行为动态生成驾驶策略,以生成更多由自动驾驶系统(ADS)引发的违规场景,提升测试效率。

Comments Accepted by TOSEM 2026

详情
AI中文摘要

最近,许多仿真测试方法被提出,用于生成多样化的驾驶场景以测试自动驾驶系统(ADS)。然而,先前方法生成的场景中NPC车辆的行为是预定义并在仿真执行前变异的,忽略了交通信号和自车(Ego)车辆的行为。因此,它们发现的大量违规是由NPC车辆的不现实行为引发的,并未揭示ADS的缺陷。此外,迭代变异过程中NPC行为的巨大场景搜索空间限制了先前方法的效率。为解决这些限制,我们提出了一种新颖的基于场景的测试框架DynNPC,以生成更多由ADS引发的违规场景。具体来说,DynNPC允许NPC车辆在仿真执行期间根据交通信号和自车车辆的实时行为,使用不同的驾驶策略动态生成行为。我们将DynNPC与最先进的基于场景的测试方法进行比较。评估结果表明,DynNPC在发现更多由ADS引发的违规场景方面具有有效性和高效性。

英文摘要

Recently, a number of simulation testing approaches have been proposed to generate diverse driving scenarios for autonomous driving systems (ADSs) testing. However, the behaviors of NPC vehicles in these scenarios generated by previous approaches are predefined and mutated before simulation execution, ignoring traffic signals and the behaviors of the Ego vehicle. Thus, a large number of the violations they found are induced by unrealistic behaviors of NPC vehicles, revealing no bugs of ADSs. Besides, the vast scenario search space of NPC behaviors during the iterative mutations limits the efficiency of previous approaches. To address these limitations, we propose a novel scenario-based testing framework, DynNPC, to generate more violation scenarios induced by the ADS. Specifically, DynNPC allows NPC vehicles to dynamically generate behaviors using different driving strategies during simulation execution based on traffic signals and the real-time behavior of the Ego vehicle. We compare DynNPC with state-of-the-art scenario-based testing approaches. Our evaluation has demonstrated the effectiveness and efficiency of DynNPC in finding more violation scenarios induced by the ADS.

2502.19544 2026-06-16 cs.LG cs.RO 版本更新

Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data

通过非策划数据引导世界模型的高效强化学习

Yi Zhao, Aidan Scannell, Wenshuai Zhao, Yuxin Hou, Tianyu Cui, Le Chen, Dieter Büchler, Arno Solin, Juho Kannala, Joni Pajarinen

发表机构 * Aalto University(阿alto大学) University of Edinburgh(爱丁堡大学) ELLIS Institute Finland(芬兰ELLIS研究所) Deep Render Imperial College London(伦敦帝国理工学院) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) CIFAR AI Chair(CIFAR人工智能主席) University of Alberta(阿尔伯塔大学) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔机器智能研究所(Amii)) University of Oulu(奥卢大学)

AI总结 提出利用无奖励、混合质量、多本体的非策划离线数据,通过经验回放和执行引导技术解决分布偏移问题,显著提升在线强化学习的样本效率。

详情
AI中文摘要

利用离线数据是提高在线强化学习(RL)样本效率的一种有前景的方法。本文通过利用丰富的非策划数据(无奖励、混合质量、跨多个本体收集)来扩展离线到在线RL的可用数据池。尽管学习世界模型似乎有望利用此类数据,但我们发现简单的微调在许多任务上无法加速RL训练。通过仔细研究,我们将这种失败归因于微调期间离线数据和在线数据之间的分布偏移。为了解决这个问题并有效使用离线数据,我们提出了两种技术:\emph{i)} 经验回放和\emph{ii)} 执行引导。通过这些修改,非策划离线数据显著提高了RL的样本效率。在有限的样本预算下,我们的方法在跨越6个本体的72个视觉运动任务上,实现了几乎两倍于从头学习基线的总得分。在诸如移动和机器人操作等具有挑战性的任务上,它显著优于先前利用离线数据的方法。

英文摘要

Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two techniques: \emph{i)} experience rehearsal and \emph{ii)} execution guidance. With these modifications, the non-curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves nearly twice the aggregate score of learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.

2603.05876 2026-06-16 cs.CV cs.RO 版本更新

Systematic Evaluation of Novel View Synthesis for Video Place Recognition

面向视频地点识别的合成新视角系统性评估

Muhammad Zawad Mahmud, Samiha Islam, Damian Lyons

AI总结 系统评估合成新视角对视频地点识别的影响,发现少量合成视角可提升识别性能,且视角变化幅度不如添加数量和图像类型重要。

Comments Submitted to IEEE IROS 2026

详情
AI中文摘要

合成新视角的生成在多个方面对机器人导航具有积极影响。在基于图像的导航中,由地面机器人拍摄的场景生成的俯视新视角可用于引导空中机器人到达该位置。在视频地点识别中,可以添加地面位置的空中新视角,使无人机能够识别地面机器人所见的地点,同样,俯视视角也可用于生成地面新视角。本文使用五个公开视频地点识别图像数据库和七种典型图像相似度方法,对视频地点识别中的合成新视角进行了系统性评估。我们表明,对于少量合成添加,新视角能提升视频地点识别的统计指标。我们发现,对于较大添加量,视角变化幅度不如添加视角数量和数据集中的图像类型重要。

英文摘要

The generation of synthetic novel views has the potential to positively impact robot navigation in several ways. In image-based navigation, a novel overhead view generated from a scene taken by a ground robot could be used to guide an aerial robot to that location. In Video Place Recognition (VPR), novel views of ground locations from the air can be added that enable a UAV to identify places seen by the ground robot, and similarly, overhead views can be used to generate novel ground views. This paper presents a systematic evaluation of synthetic novel views in VPR using five public VPR image databases and seven typical image similarity methods. We show that for small synthetic additions, novel views improve VPR recognition statistics. We find that for larger additions, the magnitude of viewpoint change is less important than the number of views added and the type of imagery in the dataset.

11. 安全、鲁棒性与可信机器人 11 篇

2606.16022 2026-06-16 cs.RO 新提交

$λ$-Reachability: Geometric-Horizon Safety Bellman Equations for Humanoid Safety

$λ$-可达性:面向人形机器人安全性的几何视界安全贝尔曼方程

Rui Chen, Shangtao Li, Yifan Sun, Changliu Liu

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Mechanical Engineering, Carnegie Mellon University(卡内基梅隆大学机械工程系)

AI总结 提出$λ$-可达性方法,通过几何分布视界和随机吸收终端的随机多步估计,实现高维机器人系统的Hamilton-Jacobi安全性分析,显著提升安全边界分类和裕度估计。

详情
AI中文摘要

我们引入了$λ$-可达性,一种用于高维机器人系统的Hamilton-Jacobi安全性分析的可扩展方法。与依赖固定一步贝尔曼更新的先前折扣公式不同,$λ$-可达性采用安全值的随机多步估计器,使用几何分布的展开视界和随机吸收终端。概念上类似于TD($λ$),$λ$-可达性通过可解释的视界控制参数,在局部自一致性更新和长视界轨迹最大安全目标之间插值。与TD($λ$)不同(其中终端值始终包含在学习目标中),$λ$-可达性中的终端安全值仅以参数$δ$控制的概率使用。我们正式证明,对于$δ<1$,更新诱导一个收缩映射,允许时序差分学习;当$λ\ o 1$时,估计器恢复无折扣的可达性目标。我们将$λ$-可达性应用于高维安全学习问题,包括在平衡和碰撞避免约束下的模拟和真实人形机器人。实验结果表明,与单步时序差分基线相比,$λ$-可达性显著改进了安全集边界分类和安全裕度估计。

英文摘要

We introduce $λ$-Reachability, a scalable approach to Hamilton--Jacobi safety analysis for high-dimensional robotic systems. Unlike prior discounted formulations that rely on fixed one-step Bellman updates, $λ$-Reachability employs a stochastic multi-step estimator of the safety value, using a geometrically distributed rollout horizon together with a randomly absorbed terminal. Conceptually analogous to TD($λ$), $λ$-Reachability interpolates between local self-consistency updates and long-horizon max-over-trajectory safety targets via an interpretable horizon-control parameter. Unlike TD($λ$), where the terminal value is always incorporated in learning targets, the terminal safety value in $λ$-Reachability is only used at a probability controlled by parameter $δ$. We formally show that for $δ<1$, the update induces a contraction mapping that allows temporal-difference learning; as $λ\to 1$, the estimator recovers the undiscounted reachability objective. We apply $λ$-Reachability to high-dimensional safety learning problems with both simulated and real humanoid robots under balance and collision avoidance constraints. Experimental results demonstrate that $λ$-Reachability significantly improves both safe-set boundary classification and safety margin estimation compared to single-step temporal-difference baselines.

2606.16313 2026-06-16 cs.RO cs.AI 新提交

Is Your Trajectory Displacement Safe in Long-tail?

你的轨迹位移在长尾场景中安全吗?

Qiao Sun, Weicheng Zheng, Yixin Huang, Hang Zhao

发表机构 * Shanghai Qi Zhi Institute(上海期智研究院) Tsinghua University(清华大学) Tongji University(同济大学)

AI总结 提出FluidTest评估框架,通过成对WebUI协议、32种语义威胁分类和三元验证系统,检测规划轨迹相对于专家参考的额外威胁,实验发现SOTA规划器仍存在大量安全相关失败。

Comments 20 pages, 15 figures

详情
AI中文摘要

长尾场景仍然是自动驾驶评估的主要瓶颈,即使数据集规模增长数个数量级。现有的评估流水线很少同时具备人类对齐、安全感知、可验证和可解释性:闭环指标在强规划器中常常饱和,而无结构的人类评分在没有精心设计协议的情况下可能充满噪声。我们将规划评估表述为额外威胁检测:给定规划器轨迹和专家参考,规划器的位移是否引入了新的不安全驾驶行为?我们提出FluidTest,一个包含三个组件的评估流水线:用于可靠人工标注的成对WebUI协议;包含32种语义威胁及其基于证据的决策图的分类法;以及一个带有反思的三元验证系统,用于精确性和可审计性。在WOD-E2E数据集上的实验表明,FluidTest在训练过的标注者中产生一致的标签,并在65%的Poutine轨迹和51%的RAP轨迹中识别出额外威胁。这些结果表明,尽管具有高评分者反馈分数(RFS)和低平均位移误差(ADE),最先进的规划器仍可能表现出大量与安全相关的失败。更多细节、指导和代码请访问https://fluidtest.web.app。

英文摘要

Long-tail scenarios remain a major bottleneck for autonomous driving evaluation, even as datasets grow by orders of magnitude. Existing evaluation pipelines are rarely human-aligned, safety-aware, verifiable, and explainable at the same time: closed-loop metrics often saturate among strong planners, while unstructured human ratings can be noisy without a carefully designed protocol. We formulate planning evaluation as additional-threat detection: given a planner trajectory and an expert reference, does the planner's displacement introduce new unsafe driving behavior? We propose FluidTest, an evaluation pipeline with three components: a pairwise WebUI protocol for reliable human annotation; a taxonomy of 32 semantic threats with evidence-grounded decision graphs; and a three-agent verification system with reflection for precision and auditability. Experiments on the WOD-E2E dataset show that FluidTest produces consistent labels among trained annotators and identifies additional threats in 65% of Poutine trajectories and 51% of RAP trajectories. These results show that state-of-the-art planners can still exhibit substantial safety-relevant failures despite high Rater Feedback Scores (RFS) and low Average Displacement Error (ADE). Additional details, guidance, and code are available at https://fluidtest.web.app.

2606.16788 2026-06-16 cs.RO 新提交

SoK: Security and Privacy of Foundation-Model-Powered Robots

SoK: 基础模型驱动机器人的安全与隐私

Xueluan Gong, Chen Chen, Jinxin Liu, Qian Wang, Kwok-Yan Lam

发表机构 * College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Cyber Science and Engineering, Wuhan University(武汉大学网络空间安全学院)

AI总结 本文提出F-E-S-G结构边界框架,系统分析基础模型驱动机器人的安全与隐私风险,并基于96篇论文揭示威胁模式、防御不匹配和评估差距。

Comments 21 pages, 2 figures

详情
AI中文摘要

基础模型正在重塑机器人技术,使机器人能够解释开放式指令、推理多模态上下文并在复杂的开放世界环境中运行。然而,它们的集成也引入了安全与隐私(S&P)风险,这些风险从基础模型本身扩展到具身执行管道、支持生态系统以及更广泛的治理影响。现有文献综述提供了宝贵的见解,但通常侧重于特定的基础模型类型、风险类别、缓解策略或信任边界。因此,该领域缺乏一个统一的结构来分析风险源自何处、如何在机器人系统中传播以及缓解措施应在何处干预。为填补这一空白,我们提出了一个渐进式的F-E-S-G结构边界框架,用于分析基础模型驱动机器人的安全与隐私。该框架包含四个层次:基础模型层(F)、具身系统层(E)、支持生态系统层(S)和治理影响层(G)。基于此结构,我们开发了一个多级分类法,沿三个层次组织先前的研究:F-E-S-G信任边界、安全-隐私关注点以及风险-缓解视角。我们进一步使用细粒度编码属性对每项研究进行注释,包括目标、生命周期阶段、机制、系统访问和效果。在此框架和分类法的指导下,我们对96篇论文进行了系统化分析。我们的分析揭示了从单一边界视角难以识别的多种威胁模式、防御不匹配和评估差距。基于这些发现,我们确定了开放挑战和未来方向,为开发安全、隐私保护且负责任治理的基础模型驱动机器人系统提供了研究议程。

英文摘要

Foundation models are reshaping robotics by enabling robots to interpret open-ended instructions, reason over multimodal contexts, and operate in complex, open-world environments. However, their integration also introduces security and privacy (S&P) risks that extend beyond the FMs themselves to embodied execution pipelines, supporting ecosystems, and broader governance impacts. Existing literature reviews provide valuable insights but often focus on specific FM types, risk categories, mitigation strategies, or trust boundaries. Consequently, the field lacks a unified structure for analyzing where risks originate, how they propagate across robotic systems, and where mitigations should intervene. To address this gap, we propose a progressive F-E-S-G structural boundary framework for analyzing the S&P of FM-powered robots. The framework comprises four layers: the Foundation model layer (F), Embodied system layer (E), Supporting ecosystem layer (S), and Governance impact layer (G). Building on this structure, we develop a multi-level taxonomy that organizes prior studies along three levels: F-E-S-G trust boundary, security-privacy concerns, and risk-mitigation perspectives. We further annotate each study using fine-grained coding attributes, including target, lifecycle stage, mechanism, system access, and effect. Guided by this framework and taxonomy, we systematize 96 papers. Our analysis uncovers multiple threat patterns, defense mismatches, and evaluation gaps that are difficult to identify from a single-boundary perspective. Based on these findings, we identify open challenges and future directions to provide a research agenda for developing secure, privacy-preserving, and responsibly governed FM-powered robotic systems.

2606.15165 2026-06-16 cs.CR cs.RO 交叉投稿

VLALeaks: Membership Inference Attacks against Vision-Language-Action Models

VLALeaks:针对视觉-语言-动作模型的成员推理攻击

Xukun Luan, Jinyan Liu, Xuesong Li, Yuanguo Bi, Renjun Wu, Zhongxiang Lei, Di Wang

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 提出VLALeaks方法,利用VLA模型注意力差异,通过两阶段流程(成员特征提取和攻击模型构建)首次揭示VLA模型的隐私漏洞,在多个基准上实现最优攻击性能。

Comments Security and Privacy

详情
AI中文摘要

视觉-语言-动作(VLA)模型实现了端到端的机器人控制,并引起了广泛关注。然而,VLA模型对训练数据的记忆特性,加上机器人数据采集的高昂成本,引发了关于数据隐私泄露和知识产权侵权的严重担忧。成员推理攻击(MIA)旨在判断给定样本是否属于训练集。尽管这种攻击代表了重大的隐私威胁,但在VLA模型的背景下尚未得到充分探索。为填补这一空白,我们提出了VLALeaks,该方法基于VLA模型中的注意力差异。我们首次揭示了VLA模型的隐私漏洞。具体而言,它包括两个阶段:(1)成员特征提取,和(2)攻击模型构建。在多个VLA基准上的实验结果表明,VLALeaks能够轻易揭示成员信息,并实现了最优的攻击AUC和TPR@1%FPR,突显了当前VLA模型部署中的隐私漏洞。我们的工作是首个对VLA模型进行MIA的系统性研究,旨在为安全可信的VLA模型提供见解。

英文摘要

Vision-Language-Action (VLA) models enable end-to-end robot control and have garnered widespread attention. However, the memorization of training data inherent to VLA, coupled with the high cost of robotic data acquisition, raises serious concerns regarding data privacy leakage and intellectual property infringement. Membership inference attacks (MIAs) aim to determine whether a given sample belongs to the training set. While representing a significant privacy threat, this attack remains underexplored in the context of VLA models. To bridge this gap, we propose VLALeaks, which is based on attention discrepancies in VLA models. We reveal, for the first time, the privacy vulnerabilities of VLA models. Specifically, it comprises a two-stage process: (1) membership feature extraction, and (2) attack model construction. Experimental results across multiple VLA benchmarks demonstrate that VLALeaks readily reveals membership information and achieves optimal attack AUC and TPR@1\%FPR, highlighting the privacy vulnerabilities in current VLA model deployments. Our work is the first systematic study of MIAs on VLA models, aiming to provide insights for secure and trustworthy VLA models.

2606.15311 2026-06-16 eess.SY cs.RO cs.SY 交叉投稿

Hamilton-Jacobi Reachability-Based Safe Reinforcement Learning for Emergency Collision Avoidance

基于Hamilton-Jacobi可达性的安全强化学习用于紧急碰撞避免

Yuhong Jiang, Shiyue Zhao, Junzhi Zhang, Junfeng Zhang, Xinhan Li, Shijie Zhao, Chengkun He

发表机构 * Tsinghua University(清华大学) Jilin University(吉林大学)

AI总结 提出一种基于Hamilton-Jacobi可达性运动安全集的安全强化学习框架,通过离线数据近似安全集并嵌入约束马尔可夫决策过程,实现紧急避撞中的前瞻性安全监督与策略优化。

Comments Preprint

详情
AI中文摘要

极端驾驶条件下的紧急碰撞避免需要安全关键控制,该控制需考虑未来时间范围内的障碍物接近度和车辆动态稳定性,然而现有方法通常依赖瞬时或局部安全评估。本文提出一种安全强化学习框架,由基于Hamilton-Jacobi (HJ) 可达性的运动安全集引导,为约束策略优化提供前瞻性安全监督。具体而言,通过结合几何碰撞裕度和底盘稳定性极限,构建统一的符号安全函数,并通过可达性分析将其扩展为有限时域运动安全集,该集合表征在未来车辆状态演化下能否维持安全。为实现实用计算,从离线极端驾驶数据中近似运动安全集,减轻基于网格的HJ求解器的计算负担。然后将学习到的运动安全集作为连续安全成本嵌入约束马尔可夫决策过程,并采用PID-Lagrangian策略优化方案自适应调节拉格朗日乘子以强制执行安全约束。在低附着避障场景中的仿真和实车实验表明,所提方法相比基线方法实现了更高的目标到达率、更平滑的避让机动,并保持了更大的统一安全裕度。

英文摘要

Emergency collision avoidance under extreme driving conditions demands safety-critical control that accounts for both obstacle proximity and vehicle dynamic stability over a future time horizon, yet existing methods often rely on instantaneous or local safety evaluations. This paper proposes a safe reinforcement learning framework guided by a Hamilton-Jacobi (HJ) reachability based motion safety set that provides forward-looking safety supervision for constrained policy optimization. Specifically, a unified signed safety function is formulated by combining geometric collision margins and chassis stability limits, and is then extended through reachability analysis into a finite-horizon motion safety set that characterizes whether safety can be maintained under future vehicle state evolution. To enable practical computation, the motion safety set is approximated from offline extreme driving data, mitigating the computational burden of grid-based HJ solvers. The learned motion safety set is then embedded as a continuous safety cost into a constrained Markov decision process, and a PID-Lagrangian policy optimization scheme is employed to adaptively regulate the Lagrange multiplier for safety constraint enforcement. Simulation and real-vehicle experiments on low-adhesion obstacle-avoidance scenarios demonstrate that the proposed method achieves higher goal-reaching rates, produces smoother avoidance maneuvers, and maintains larger unified safety margins than baseline methods.

2606.15366 2026-06-16 eess.SY cs.RO cs.SY math.OC 交叉投稿

Robust Conformal CBF and CLF Controllers via Iterative Policy Updates

通过迭代策略更新的鲁棒共形CBF和CLF控制器

Omid Mirzaeedodangeh, Eliot Shekhtman, Nikolai Matni, Lars Lindemann

发表机构 * Automatic Control Laboratory, ETH Zürich(瑞士苏黎世联邦理工学院自动控制实验室) Computer and Information Science, University of Pennsylvania(宾夕法尼亚大学计算机与信息科学系) Electrical and Systems Engineering, University of Pennsylvania(宾夕法尼亚大学电气与系统工程系)

AI总结 针对共形预测嵌入鲁棒控制时因分布偏移导致安全/稳定性保证失效的问题,提出迭代更新策略框架,结合对抗鲁棒共形预测与分布偏移预算,实现跨回合保证。

详情
AI中文摘要

共形预测(CP)已被用于获取学习动力学模型与真实未知系统之间误差的概率界限。然后,这些CP界限可以嵌入到鲁棒控制李雅普诺夫函数(CLF)和控制屏障函数(CBF)框架中。然而,由于部署的CLF/CBF策略下的闭环轨迹分布与推导CP界限及其保证的轨迹分布之间存在分布偏移,这种方法无法保留稳定性/安全性保证。为了解决这个问题,我们提出了一种情节式框架,该框架迭代更新鲁棒共形CLF/CBF策略,同时保持跨情节的稳定性/安全性保证。我们通过(1)使用对抗鲁棒共形预测,以及(2)量化分布偏移预算来实现这一点,该预算允许我们控制模型误差在策略更新中增加的程度。该分布偏移预算通过闭环轨迹灵敏度分析推导得出,为CP界限提供了隐式和显式更新规则。我们分析了算法的收敛性,并在三个案例研究中进行了演示。据我们所知,这是首次为鲁棒共形CBF/CLF策略提供稳定性/安全性保证的结果。

英文摘要

Conformal prediction (CP) has been used to obtain probabilistic bounds on the error between a learned dynamics model and the true but unknown system. Such CP bounds can then be embedded into robust control Lyapunov function (CLF) and control barrier function (CBF) frameworks. However, such an approach does not retain stability/safety guarantees because of the distribution shift between the closed-loop trajectory distribution under the deployed CLF/CBF policy and the trajectory distribution from which the CP bound and its guarantees were derived. To address this issue, we propose an episodic framework that iteratively updates the robust conformal CLF/CBF policy while maintaining stability/safety guarantees across episodes. We achieve this by (1) using adversarially robust conformal prediction, and (2) quantifying a distribution shift budget that allows us to control how much the model error can increase across policy updates. This distribution shift budget is derived via a closed-loop trajectory sensitivity analysis, yielding an implicit and an explicit update rule for the CP bound. We analyze convergence of our algorithm, which we demonstrate on three case studies. To the best of our knowledge, these are the first results that provide stability/safety guarantees for robust conformal CBF/CLF policies.

2606.16467 2026-06-16 cs.CR cs.RO 交叉投稿

A Formal Resilience Framework for Cyber-Physical Embodied Systems under Device-Level Cyberattacks

面向设备级网络攻击的物理信息具身系统形式化弹性框架

Alberto Giaretta

发表机构 * Department of Computer Science, Örebro University, Sweden(瑞典 Örebro 大学计算机科学系) AI, Robotics and Cybersecurity Center (ARC), Örebro University, Sweden(瑞典 Örebro 大学人工智能、机器人与网络安全中心)

AI总结 提出一种形式化可靠性框架,将入侵检测系统信息融入弹性评估谓词,用于分析网络攻击对具身CPS任务执行和实体保全的影响,并指导缓解策略部署。

Comments 8 pages, 2 tables

详情
AI中文摘要

在信息物理系统(CPS)中,容错通常通过分析传感器和执行器输出、检测渐进漂移或突然故障以及启动适当的容错机制来实现。在一般故障模型下合理的这种方法无法捕捉网络攻击引起的细微破坏,这些攻击可能采用微妙策略。这在具身CPS中尤为关键,因为计算和物理设备不仅在任务完成中起积极作用,还在实体保全(即维护系统物理完整性)中起积极作用。为防止结构性物理损伤,具身CPS需要一个能够主动响应网络攻击的框架。本文提出一个形式化可靠性框架,将入侵检测系统信息融入弹性评估谓词,从而能够评估对破坏和退化的容忍能力。该框架支持关于网络攻击如何影响任务执行和实体保全以及是否需要部署缓解策略的结构化推理。分析示例展示了其分析能力和合理性,为可靠且安全的具身CPS奠定了理论基础。

英文摘要

In cyber-physical systems (CPSs), fault tolerance is traditionally achieved by analysing sensor and actuator outputs, detecting progressive drift or sudden failures, and initiating suitable tolerance mechanisms. Reasonable under general failure models, this approach fails to capture nuanced disruptions caused by cyberattacks, which may employ subtle strategies. This is particularly critical in embodied CPSs, where computational and physical devices not only have an active role in task completion, but also in embodiment preservation (that is, maintaining the system's physical integrity). To prevent structural physical damage, embodied CPSs require a framework that enables proactive response to cyberattacks. This paper proposes a formal dependability framework that incorporates IDS information into resilience evaluation predicates, enabling assessment of tolerance to disruption and degradation. The framework supports structured reasoning about how cyberattacks affect task execution and embodiment preservation, and whether mitigation strategies must be deployed. Analytical examples demonstrate its analytical capability and soundness, establishing a theoretical foundation for dependable and secure embodied CPSs.

2601.15459 2026-06-16 cs.RO 版本更新

Neural Minimum-Distance Estimation for Collision-Aware Operation of Multi-Arm Laparoscopy Surgical Robots Through Learning-from-Simulation

基于仿真学习的多臂腹腔镜手术机器人碰撞感知操作的神经最小距离估计

Sarvin Ghiasi, Majid Roshanfar, Jake Barralet, Liane S. Feldman, Amir Hooshiar

发表机构 * Surgical Performance Enhancement and Robotics (SuPER) Centre, Department of Surgery(外科性能增强与机器人中心(SuPER)中心,外科部) The Wilfred and Joyce Posluns Centre for Image Guided Innovation & Therapeutic Intervention (PCIGITI)(威廉与乔伊斯·波斯伦中心(PCIGITI)影像引导创新与治疗干预中心) The Hospital for Sick Children (SickKids)(儿童医院(SickKids))

AI总结 提出结合分析建模、实时仿真与深度残差神经网络的框架,用于多臂手术机器人最小距离估计与碰撞预警,模型在验证集上R²=0.940,RMSE=42.0 mm。

详情
Journal ref
Sensors 2026, 26(12), 3744
AI中文摘要

本研究提出了一个集成框架,通过解决多臂操纵器之间的最小距离估计和相关的碰撞感知警告,提高腹腔镜手术中机械臂的安全性和操作效率。通过结合分析建模、实时仿真和机器学习,该框架为确保机器人安全操作提供了稳健的解决方案。开发了一个分析模型,基于关节配置估计机械臂之间的最小距离,提供理论计算作为验证工具和基准。为补充这一点,创建了一个3D仿真环境,模拟两个7自由度Kinova机械臂(Kinova inc., Boisbriand, QC, Canada),生成了用于距离估计和碰撞警告的多样化配置数据集。利用这些见解,训练了一个以关节配置为输入的深度残差神经网络模型。在保留的验证集上,模型达到了R²=0.940,RMSE=42.0 mm,MAE=28.7 mm,且平均偏差接近零,展示了强大的预测准确性和在整个工作空间中的一致泛化能力。该框架旨在作为早期碰撞警告层,当预测的臂间距离低于0.2 m阈值时触发警告,考虑到Kinova Gen3(Kinova inc., Boisbriand, QC, Canada)的横截面半径,这对应于大约50 mm的表面到表面间隙。这项工作展示了将分析建模与机器学习相结合以提高多臂机器人系统精度和可靠性的有效性。

英文摘要

This study presents an integrated framework for enhancing the safety and operational efficiency of robotic arms in laparoscopic surgery by addressing minimum distance estimation between multi-arm manipulators and the associated collision-aware warning. By combining analytical modeling, real time simulation, and machine learning, the framework offers a robust solution for ensuring safe robotic operations. An analytical model was developed to estimate the minimum distances between robotic arms based on their joint configurations, offering theoretical calculations that serve as both a validation tool and a benchmark. To complement this, a 3D simulation environment was created to model two 7 DOF Kinova robotic arms (Kinova inc., Boisbriand, QC, Canada), generating a diverse dataset of configurations for distance estimation and collision warning. Using these insights, a deep residual neural network model was trained with joint configurations as inputs. On the held out validation set, the model achieves R2 = 0.940, RMSE = 42.0 mm, MAE = 28.7 mm, and a near zero mean bias, demonstrating strong predictive accuracy and consistent generalization across the workspace. The framework is intended as an early collision warning layer, where a warning is triggered when the predicted inter-arm distance falls below a 0.2 m threshold, which corresponds to a surface to surface clearance of approximately 50 mm given the Kinova Gen3 (Kinova inc., Boisbriand, QC, Canada) cross sectional radius. This work demonstrates the effectiveness of combining analytical modeling with machine learning to enhance the precision and reliability of multi-arm robotic systems.

2606.12978 2026-06-16 cs.RO cs.CV cs.SY eess.SY 版本更新

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

轨迹级重定向攻击对视觉-语言-动作模型

Gokul Puthumanaillam, Vardhan Dongre, Pranay Thangeda, Hooshang Nayyeri, Dilek Hakkani-Tür, Melkior Ornik

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文发现VLA模型存在轨迹级漏洞:看似保留原始指令的对抗性提示,能重定向机器人最终物理结果,并提出了命令保持的轨迹重定向威胁模型和在线提示搜索方法。

详情
AI中文摘要

视觉-语言-动作(VLA)策略将自然语言引入闭环机器人控制,使机器人能够直接从文本指令执行操作任务。同一接口赋予文本在控制中的循环角色,因为提示在每个重新规划步骤中被重复使用,每个提示条件化的动作会改变策略所作用的未来观测。现有的VLA攻击研究对抗性提示,这些提示引发目标低级动作或使此类动作在变化的图像中持续存在。我们识别出一个更强的轨迹级故障模式:一个提示仍然$\textit{看起来}$指定了预期任务,但重定向了最终物理结果。我们在数学上将这种设置形式化为$\textit{命令保持的轨迹重定向}$,这是一种仅提示的威胁模型,其中攻击者在情节开始前选择一个提示,所有策略和环境组件保持不变,并且提示必须保持接近良性指令,同时省略目标词和纠正语言。为了找到这样的提示,我们引入了一种在线提示搜索方法,该方法使用滚动来发现扰动,其闭环行为跟踪目标任务,同时满足命令保持约束。在仿真和硬件上的实验表明,接近良性的提示扰动可以将VLA滚动重定向到攻击者指定的目标。这些结果暴露了VLA指令基础中的轨迹级漏洞:看似保留预期命令的文本仍然可以让对手控制机器人的最终物理结果。项目网站:此https URL

英文摘要

Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: https://vla-redirection-attack.github.io/

2606.14238 2026-06-16 cs.RO cs.AI 版本更新

When and How Severely: Scenario-Specific Safety Envelopes for Driving VLAs

何时以及多严重:驾驶VLA的场景特定安全包络

Abhinaw Priyadershi, Jelena Frtunikj

发表机构 * NVIDIA Corporation(英伟达公司) NVIDIA GmbH(英伟达德国有限公司)

AI总结 针对ISO 21448下VLA驾驶规划器的安全认证,提出二维安全包络方法,通过GMM识别六种严重性等级,揭示场景特定风险差异。

详情
AI中文摘要

根据ISO 21448 (SOTIF)对视觉-语言-动作(VLA)驾驶规划器的安全认证依赖于运行设计域(ODD)规范,该规范回答两个互补的问题:规划器何时开始失效,以及一旦失效其严重程度如何?我们评估了Alpamayo R1(一个100亿参数的开源权重驾驶VLA)在15,968个(片段,攻击)对上的表现。我们发现一个保守的聚合差距:在15%平均位移误差(ADE)预算下,聚合安全阈值σ ≤ 50掩盖了能够容忍测试网格顶部(σ = 70)的良好采样场景。在变化解释子集上的高斯混合模型(GMM)识别出六个离散的严重性等级(BIC最优k=6),因此具有相同平均误差的两个扰动条件在高严重性(C4/C5)失效份额上可能有实质性差异。将两种分析结合在同一个语料库上,发现了一个单独分析无法得出的结论:噪声阈值最宽松的场景并非高严重性率最低的场景:STOP_SIGNAL的C4/C5份额大约是LANE_KEEPING的4倍,尽管它容忍更大的σ。因此,用于驾驶VLA的可部署SOTIF ODD规范需要二维安全包络,而不是每个危险的单一聚合值。

英文摘要

Safety certification of Vision-Language-Action (VLA) driving planners under ISO 21448 (SOTIF) rests on an Operational Design Domain (ODD) specification that answers two complementary questions: when does the planner start to fail, and how severely does it fail once it does? We evaluate Alpamayo R1, a 10B-parameter open-weight driving VLA, on 15,968 (clip, attack) pairs. We find a conservative-aggregate gap: an aggregate safe threshold of $σ\leq 50$ under a 15% average displacement error (ADE) budget masks well-sampled scenarios that tolerate the top of the tested grid ($σ= 70$). A Gaussian Mixture Model (GMM) on the changed-explanation subset identifies six discrete severity bands (BIC-optimal $k{=}6$), so two perturbation conditions with the same mean error can differ materially in their share of high-severity (C4/C5) failures. Joining the two analyses on the same corpus surfaces a finding neither yields in isolation: the scenarios with the loosest noise thresholds are not those with the lowest high-severity rate: STOP_SIGNAL concentrates roughly $4\times$ the C4/C5 share of LANE_KEEPING despite tolerating a larger $σ$. A deployable SOTIF ODD specification for driving VLAs therefore requires a two-dimensional safety envelope, not a single aggregate value per hazard.

2601.19612 2026-06-16 cs.LG cs.AI cs.RO 版本更新

Safe Exploration via Policy Priors

通过策略先验进行安全探索

Manuel Wendl, Yarden As, Manish Prajapat, Anton Pollak, Stelian Coros, Andreas Krause

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出SOOPER方法,利用次优但保守的策略先验,结合概率动力学模型进行乐观探索和悲观回退,在保证安全的同时收敛到最优策略。

详情
AI中文摘要

安全探索是强化学习智能体在受控(例如模拟)环境之外在线学习和适应的关键要求。在这项工作中,我们通过利用次优但保守的策略(例如,从离线数据或模拟器中获得)作为先验来应对这一挑战。我们的方法SOOPER使用概率动力学模型进行乐观探索,但在必要时悲观地回退到保守的策略先验。我们证明了SOOPER在整个学习过程中保证安全性,并通过限制其累积遗憾建立了收敛到最优策略的保证。在关键的安全强化学习基准测试和真实硬件上的大量实验表明,SOOPER具有可扩展性,优于现有技术,并在实践中验证了我们的理论保证。

英文摘要

Safe exploration is a key requirement for reinforcement learning (RL) agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.

12. 其他/综合机器人 12 篇

2606.14721 2026-06-16 cs.GR cs.CV cs.RO 交叉投稿

DC-Motion: Decoupling Semantics and Details via Discrete-Continuous Tokens for Human Motion Generation

DC-Motion: 通过离散-连续令牌解耦语义与细节以生成人体运动

Hequan Wang, Jiaxu Zhang, Zhengbo Zhang, Zhigang Tu

发表机构 * Wuhan University(武汉大学)

AI总结 提出DC-Motion框架,通过离散-连续VAE将运动分解为语义离散令牌和细节连续残差,结合掩码自回归模型和残差扩散模型,实现复杂文本指令下的高质量运动生成。

详情
AI中文摘要

文本到运动生成需要合成物理上真实的动态,这些动态严格遵循复杂且长程的文本指令。现有方法依赖于同质表示空间,可能无法捕捉人体运动的层次结构,扩散模型在组合语义推理上表现不佳,而自回归模型由于量化牺牲了细粒度的物理细节。为了解决这个问题,我们引入了DC-Motion,一个分解式生成框架,旨在通过离散-连续令牌显式解耦语义和细节。首先,离散-连续VAE(DC-VAE)将运动分解为用于语义的离散令牌和用于细粒度动态的连续残差。然后,一个掩码自回归模型从文本预测离散结构,一个轻量级残差扩散模型恢复连续的物理细节。大量实验表明,DC-Motion有效提高了遵循复杂指令的能力。通过有效平衡语义可控性和物理真实性,我们的方法为人体运动生成提供了一种高度可适应的建模范式。在HumanML3D和KIT-ML数据集上,DC-Motion实现了最先进的性能,在运动真实感方面获得了最佳的FID,在文本对齐方面获得了最佳的R-precision。

英文摘要

Text-to-motion generation requires synthesizing physically realistic dynamics that strictly follow complex and long-horizon textual instructions. Existing approaches rely on homogeneous representation spaces that may fail to capture the hierarchical nature of human motion, with diffusion models struggling at compositional semantic reasoning and AR models sacrificing fine-grained physical details due to quantization. To solve it, we introduce DC-Motion, a factorized generative framework designed to explicitly decouple semantics and details via discrete-continuous tokens. A Discrete-Continuous VAE (DC-VAE) first decomposes motion into discrete tokens for semantics and continuous residuals for fine-grained dynamics. Then, a masked AR model predicts the discrete structure from text, and a lightweight residual diffusion model recovers the continuous physical details. Extensive experiments demonstrate that DC-Motion effectively improves the capability to follow complex instructions. By effectively balancing semantic controllability and physical realism, our approach offers a highly adaptable modeling paradigm for human motion generation. On both HumanML3D and KIT-ML datasets, DC-Motion achieves state-of-the-art performance, delivering the best FID for motion realism and R-precision for text alignment.

2606.15142 2026-06-16 cs.CV cs.RO 交叉投稿

MotionVLA: Vision-Language-Action Model for Humanoid Motion

MotionVLA:面向人形运动的视觉-语言-动作模型

Nonghai Zhang, Siyu Zhai, Yanjun Li, Zeyu Zhang, Zhihan Yin, Yandong Guo, Boxin Shi, Hao Tang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) AI 2 Robotics

AI总结 针对人形运动生成中低频姿态与高频物理信号量化不匹配的问题,提出双流频率分词器DSFT和基于Qwen3.5的MotionVLA模型,在HumanML3D和MBench上显著提升多样性一致性和运动条件一致性。

详情
AI中文摘要

从场景图像和文本生成逼真的人形运动涉及低频姿态语义和高频物理动力学。然而,许多现有方法使用单个共享码本对运动进行分词,将异质运动信号强制映射到相同的量化空间。我们对人体运动数据的频域分析揭示了单码本量化与运动统计之间的明显不匹配:五个DCT系数捕获了93%的关节位置能量,但仅捕获了37%的关节速度能量,这可能导致量化偏向姿态统计,而低估高频速度分量。第二个挑战在于使标准自回归模型有效建模运动序列中的高频物理信号。因此,我们提出了DSFT,一种双流频率分词器,将运动分离为基础流和物理流,并使用DCT截断和BPE独立压缩它们。此外,我们提出了MotionVLA,一个基于Qwen3.5的模型,将基础令牌和物理令牌排列在统一序列中,其中物理令牌在基础令牌之后预测。在HumanML3D和MBench上的实验表明,尽管使用轻量级2B骨干网络,MotionVLA在HumanML3D上将与真实数据的多样性差距减少了50%以上,并在MBench上将运动条件一致性提高了3.8%,支持频率感知的双流解耦作为自回归运动生成的有效公式。代码:https://github.com/AIGeeksGroup/MotionVLA。网站:https://aigeeksgroup.github.io/MotionVLA。

英文摘要

Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.

2606.15647 2026-06-16 cs.AI cs.CV cs.RO 交叉投稿

Towards Next-Generation Healthcare: A Survey of Medical Embodied AI for Perception, Decision-Making, and Action

迈向下一代医疗:医疗具身AI在感知、决策与行动中的综述

Cheng Zhang, Qing Cai, Xingzheng Wu, Xun Yang, Xiaojun Chang, Bingkun Bao, Liqiang Nie, Xinwang Liu, Yi Yang

发表机构 * School of Information Science and Engineering, Ocean University of China(中国海洋大学信息科学与工程学院) Innovation School of Artificial Intelligence, Hefei University of Technology(合肥工业大学人工智能创新学院) School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学技术学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机与信息工程学院) School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院) College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院) ReLER Laboratory, CCAI, Zhejiang University(浙江大学计算机辅助设计与图形学国家重点实验室)

AI总结 本文系统综述医疗具身AI的核心组件,强调感知、决策与行动的协调集成,并分析临床实践中的挑战与未来方向。

Comments 19 pages, 9 figures

详情
AI中文摘要

基础模型在提升医疗效率方面表现出色,广泛应用于各类医疗场景。然而,它们在感知、理解和与物理世界交互方面的能力有限,严重制约了其在真实临床工作流中的有效性,而临床工作流中安全关键的决策和物理执行紧密耦合。近年来,具身人工智能(AI)作为一种有前景的物理交互范式出现,使智能体能够在复杂医疗环境中操作。随着该领域研究的迅速扩展,理解智能体如何在临床环境中作为集成的端到端系统运行变得日益关键。然而,现有关于医疗具身AI的综述大多强调单个方面或功能组件,缺乏统一的系统级组织。为支持和巩固最新进展,我们系统调查了医疗具身AI的核心组件,特别关注感知、决策与行动的协调集成。我们进一步回顾了代表性医疗应用和相关数据集,并分析了真实临床实践中遇到的主要挑战。最后,我们讨论了这一快速发展领域未来研究的关键方向。相关项目见 https://github.com/VMVLab/Medical_Embodied_AI_Paper_List。

英文摘要

Foundation models have demonstrated impressive performance in enhancing healthcare efficiency across a wide range of medical applications. Nevertheless, their limited ability to perceive, understand, and interact with the physical world significantly constrains their effectiveness in real-world clinical workflows, where safety-critical decision-making and physical execution are tightly coupled. Recently, embodied artificial intelligence (AI) has emerged as a promising physical-interactive paradigm for intelligent healthcare, enabling agents to operate in complex medical environments. As research in this area rapidly expands, understanding how intelligent agents function as integrated, end-to-end systems in clinical environments becomes increasingly critical. However, existing surveys on medical embodied AI largely emphasize individual aspects or functional components, lacking a unified system-level organization of the field. To support and consolidate recent advances, we systematically survey the core components of medical embodied AI, with a particular emphasis on the coordinated integration of perception, decision-making, and action. We further review representative medical applications and relevant datasets, and we analyze the major challenges encountered in real-world clinical practice. Finally, we discuss key directions for future research in this rapidly evolving field. The associated project can be found at https://github.com/VMVLab/Medical_Embodied_AI_Paper_List.

2601.08514 2026-06-16 cs.RO 版本更新

Simplifying ROS2 controllers with a modular architecture for robot-agnostic reference generation

简化ROS2控制器:用于机器人无关参考生成的模块化架构

Davide Risi, Vincenzo Petrone, Antonio Langella, Lorenzo Pagliara, Enrico Ferrentino, Pasquale Chiacchio

发表机构 * Department of Information Engineering, Electrical Engineering and Applied Mathematics (DIEM), University of Salerno(信息工程、电气工程与应用数学系(DIEM),萨勒诺大学)

AI总结 提出一种模块化ROS2架构,通过专用参考生成器解耦参考处理与控制逻辑,减少重复代码并提升跨平台复用性,在UR和Franka机器人上验证了可靠跟踪与流水线构建效率。

Comments 5 pages, 7 figures

详情
AI中文摘要

本文介绍了一种新颖的ROS2模块化架构,该架构将获取、验证和插值参考所需的逻辑与跟踪这些参考的控制律解耦。该设计包含一个名为参考生成器的专用组件,它从外部节点(如规划器)接收参考(以单点或轨迹的形式),并通过现有的ros2_control链式机制以控制器的采样周期向下游控制器写入单点参考。这种分离消除了控制器中重复的参考处理代码,并提高了跨机器人平台的可重用性。我们实现了两个参考生成器:一个用于处理关节空间参考,另一个用于笛卡尔参考,以及一组新的控制器(带重力补偿的PD控制器、笛卡尔位姿控制器和导纳控制器),并在仿真和真实的Universal Robots及Franka Emika机械臂上验证了该方法。结果表明:(i) 在所有测试场景中参考均被可靠跟踪;(ii) 参考生成器减少了跨链式控制器的重复参考处理代码,有利于构建和复用复杂的控制器流水线;(iii) 控制器实现仅专注于控制律。

英文摘要

This paper introduces a novel modular architecture for ROS2 that decouples the logic required to acquire, validate, and interpolate references from the control laws that track them. The design includes a dedicated component, named Reference Generator, that receives references, in the form of either single points or trajectories, from external nodes (e.g., planners), and writes single-point references at the controller's sampling period via the existing ros2_control chaining mechanism to downstream controllers. This separation removes duplicated reference-handling code from controllers and improves reusability across robot platforms. We implement two reference generators: one for handling joint-space references and one for Cartesian references, along with a set of new controllers (PD with gravity compensation, Cartesian pose, and admittance controllers) and validate the approach on simulated and real Universal Robots and Franka Emika manipulators. Results show that (i) references are tracked reliably in all tested scenarios, (ii) reference generators reduce duplicated reference-handling code across chained controllers to favor the construction and reuse of complex controller pipelines, and (iii) controller implementations remain focused only on control laws.

2603.24350 2026-06-16 cs.RO cs.AI cs.LG 版本更新

Evidence of an Emergent "Self" in Continual Robot Learning

持续机器人学习中涌现的“自我”证据

Adidev Jhunjhunwala, Judah Goldfeder, Hod Lipson

发表机构 * Creative Machines Lab, Department of Mechanical Engineering, Columbia University(创意机器实验室,机械工程系,哥伦比亚大学) Creative Machines Lab, Department of Computer Science, Columbia University(创意机器实验室,计算机科学系,哥伦比亚大学)

AI总结 通过比较恒定任务与持续学习下机器人的认知结构,发现持续学习机器人形成显著更稳定的不变子网络,该子网络对适应性至关重要,为量化智能系统自我概念提供原则性方法。

Comments 44 pages, 24 figures, includes supplementary materials

详情
AI中文摘要

理解自我意识的一个关键挑战是,如何以原则性的方式量化一个智能系统是否具有“自我”概念,以及如果存在,如何将“自我”与其他认知结构区分开来。我们提出,可以通过寻找认知过程中相对于快速获得的认知技能变化较小的不变部分来隔离“自我”——因为我们的自我是我们经验中最持久的方面。我们利用这一原则分析了两种条件下机器人的认知结构:一个机器人学习恒定任务,而另一个在可变任务下进行持续学习。我们发现,经历持续学习的机器人形成了一个不变子网络,该子网络比对照组显著更稳定(p < 0.001),并且该子网络在功能上也很重要:保留它有助于适应,而破坏它会损害性能。我们在跨越运动控制和操作的三种不同机器人上验证了这一模式。

英文摘要

A key challenge to understanding self-awareness has been a principled way of quantifying whether an intelligent system has a concept of a "self", and if so how to differentiate the "self" from other cognitive structures. We propose that the "self" can be isolated by seeking the invariant portion of cognitive process that changes relatively little compared to more rapidly acquired cognitive skills - because our self is the most persistent aspect of our experiences. We used this principle to analyze the cognitive structure of robots under two conditions: One robot learns a constant task, while a second undergoes continual learning under variable tasks. We find that robots subjected to continual learning develop an invariant subnetwork that is significantly more stable (p < 0.001) compared to the control, and that this subnetwork is also functionally important: preserving it aids adaptation while damaging it impairs performance. We validate this pattern across three different robots spanning locomotion and manipulation.

2604.03386 2026-06-16 cs.RO cs.NE 版本更新

Activity-Dependent Plasticity in Morphogenetically-Grown Recurrent Networks

形态发生生长递归网络中的活动依赖可塑性

Sergii Medvid, Andrii Valenia, Mykola Glybovets

发表机构 * National University of Kyiv-Mohyla Academy(基輔-莫希拉學院國立大學)

AI总结 研究形态发生生长递归网络中Hebbian与反Hebbian可塑性的作用,发现反Hebbian可塑性显著优于Hebbian,且协同进化独立发现该模式,表明反Hebbian优势对小递归网络具有普适性。

Comments 8 pages, 6 figures. Camera-ready version; accepted at GECCO 2026 Companion (EvoSelf workshop)

详情
AI中文摘要

神经架构搜索的发育方法通过自组织从紧凑基因组生长出功能网络,但所得网络以固定的生长后权重运行。我们在50,000个形态发生生长的递归控制器(CartPole和Acrobot上超过5M种配置)中刻画了Hebbian和反Hebbian可塑性,然后测试协同进化实验——其中可塑性参数编码在基因组中并与发育架构共同进化——是否独立恢复这些模式。我们的刻画显示:(1)对于胜任网络,反Hebbian可塑性显著优于Hebbian(Cohen's d = 0.53-0.64);(2)遗憾(在最佳固定设置下损失的oracle改进比例)达到52-100%;(3)在非平稳条件下,可塑性的作用从微调转变为真正的适应。协同进化独立发现这些模式:在CartPole上,70%的运行进化出反Hebbian可塑性(p = 0.043);在Acrobot上,进化发现接近零的eta且符号混合——与刻画完全匹配。随机RNN对照表明,反Hebbian优势对小递归网络具有普适性,但拓扑依赖程度是发育特异的:形态发生生长网络的遗憾比具有匹配拓扑统计的随机图高2-6倍。

英文摘要

Developmental approaches to neural architecture search grow functional networks from compact genomes through self-organisation, but the resulting networks operate with fixed post-growth weights. We characterise Hebbian and anti-Hebbian plasticity across 50,000 morphogenetically grown recurrent controllers (5M+ configurations on CartPole and Acrobot), then test whether co-evolutionary experiments -- where plasticity parameters are encoded in the genome and evolved alongside the developmental architecture -- recover these patterns independently. Our characterisation reveals that (1) anti-Hebbian plasticity significantly outperforms Hebbian for competent networks (Cohen's d = 0.53-0.64), (2) regret (fraction of oracle improvement lost under the best fixed setting) reaches 52-100%, and (3) plasticity's role shifts from fine-tuning to genuine adaptation under non-stationarity. Co-evolution independently discovers these patterns: on CartPole, 70% of runs evolve anti-Hebbian plasticity (p = 0.043); on Acrobot, evolution finds near-zero eta with mixed signs -- exactly matching the characterisation. A random-RNN control shows that anti-Hebbian dominance is generic to small recurrent networks, but the degree of topology-dependence is developmental-specific: regret is 2-6x higher for morphogenetically grown networks than for random graphs with matched topology statistics.

2604.16592 2026-06-16 cs.RO cs.AI cs.CV cs.ET 版本更新

Human Cognition in Machines: A Unified Perspective of World Models

机器中的人类认知:世界模型的统一视角

Timothy Rupprecht, Pu Zhao, Amir Taherin, Arash Akbari, Arman Akbari, Yumei He, Tooba Imtiaz, Sean Duffy, Juyi Lin, Yixiao Chen, Rahul Chowdhury, Enfu Nan, Yixin Shen, Yifan Cao, Haochen Zeng, Weiwei Chen, Geng Yuan, Jennifer Dy, Sarah Ostadabbas, Xuan Zhang, David Kaeli, Edmund Yeh, Yanzhi Wang

发表机构 * Northeastern University(东北大学) EmbodyX Inc.(EmbodyX公司) Tulane University(路易斯安那州立大学) Cornell University(康奈尔大学) University of Georgia(佐治亚大学)

AI总结 提出统一框架整合记忆、感知等认知功能,指出动机和元认知研究不足,并引入认知世界模型新类别。

详情
AI中文摘要

本报告通过区分先前工作在认知功能上的创新来审视世界模型。许多工作声称其世界模型具有近乎人类般的认知能力。评估这些主张需要基于人类和机器认知理论的第一原理。在迈向类人世界模型的过程中,我们提出了一个概念性的统一框架,该框架完全整合了所有认知功能(即记忆、感知、语言、推理、想象、动机和元认知),并指出现有研究的空白,以指导未来技术的发展。特别是,我们发现动机(尤其是内在动机)和元认知仍然严重研究不足,并提出了基于主动推理和全局工作空间理论的具体方向来解决这些空白。我们还引入了认知世界模型,这是一个新的类别,涵盖在结构化知识上运行的科学发现代理框架。我们的分类法应用于视频、具身和认知世界模型,提出了先前分类法未涉及的研究方向。

英文摘要

This report of world models distinguishes prior works by the cognitive functions they innovate. Many works claim an almost human-like cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles from human and machine cognition theory. In moving towards human-like world models we present a conceptual unified framework for world models that fully incorporates all the cognitive functions (i.e., memory, perception, language, reasoning, imagining, motivation, and metacognition) and identify gaps in existing research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and metacognition remain drastically under-researched, and we propose concrete directions to address these gaps informed by active inference and global workspace theory. We also introduce epistemic world models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied to video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.

2601.08056 2026-06-16 q-bio.NC cs.RO 版本更新

The embodied brain: Bridging the brain, body, and behavior with biorealistic neuromechanical models

具身大脑:通过生物真实神经力学模型连接大脑、身体与行为

Sibo Wang-Chen, Pavan Ramdya

发表机构 * EPFL(瑞士联邦理工学院)

AI总结 本文综述生物真实神经力学模型,通过将人工神经控制器嵌入模拟环境中的身体模型,揭示神经、身体与环境交互的行为控制算法,并推动神经科学、机器人学和机器学习之间的交流。

Comments 18 pages, 4 figures (including 1 graphical abstract), 1 table

详情
AI中文摘要

动物行为反映了神经系统、身体和环境之间的相互作用。因此,必须考虑生物力学和环境背景,以理解行为控制的算法。将人工神经控制器嵌入模拟环境中的身体模型的计算模型,是用于此目的的有力工具。在这里,我们回顾了生物真实神经力学模型的进展,同时强调了即将到来的新兴机遇。我们首先展示了这些模型如何能够推断出难以通过实验测量的生物物理变量。通过系统性扰动,可以通过这些模型生成新的可实验检验的假设。然后,我们考察了神经力学模型如何促进神经科学、机器人学和机器学习之间的交流,并展示了它们在医疗保健中的应用。我们设想,将实验研究与对其神经力学替代物的主动探测相结合,将显著加速神经科学的进展。

英文摘要

Animal behavior reflects interactions between the nervous system, body, and environment. Therefore, biomechanics and environmental context must be considered to understand algorithms for behavioral control. Computational models that embed artificial neural controllers within body models in simulated environments, are a powerful tool for this purpose. Here, we review advances in biorealistic neuromechanical models while also highlighting emerging opportunities ahead. We first show how these models enable inference of biophysical variables that are difficult to measure experimentally. Through systematic perturbation, one can generate new experimentally testable hypotheses through these models. We then examine how neuromechanical models facilitate the exchange between neuroscience, robotics, and machine learning, and showcase their applications in healthcare. We envision that coupling experimental studies with active probing of their neuromechanical surrogates will significantly accelerate progress in neuroscience.

2605.25006 2026-06-16 cs.RO cs.LG cs.NE 版本更新

Convex-Neural RRT*: Fast and Reliable Learning-Guided Sampling for High-Quality Robot Path Planning

Convex-Neural RRT*: 快速可靠的基于学习引导的高质量机器人路径规划采样

Hichem Cheriet, Badra Khellat Kihel, Samira Chouraqui, Bara J. Emran

AI总结 提出Convex-Neural RRT*算法,通过神经网络预测高质量路径附近的凸候选区域来引导采样,在多种环境中相比神经引导变体减少30-75%计算时间,路径长度平均减少约5%,成功率超99%。

详情
AI中文摘要

基于采样的机器人路径规划算法在不同障碍物配置的环境中提供了概率完备性和强经验收敛性。然而,在实践中,这些方法通常需要多次迭代才能获得高质量解。本文提出了Convex-Neural RRT*,一种增强的RRT*变体,它结合神经引导来预测高质量路径附近的信息性航点区域。从这些预测中提取凸候选区域,使规划器能够将探索集中在几何相关区域,同时保持全局探索。该算法在三种环境类型和18个基准地图上与Neural RRT*、Neural Informed RRT*、经典RRT*和LTA*进行了评估。实验结果表明,与神经引导变体相比,Convex-Neural RRT*减少了30-75%的计算时间,相对于LTA*减少了高达88-98%,同时与经典RRT*相比,平均路径长度减少了约5%,在复杂环境中改进更大。该方法在不同障碍物密度下保持了超过99%的整体成功率。这些发现表明,凸引导神经采样在计算效率和解质量之间提供了有效平衡,支持其在时间敏感的机器人导航任务中的适用性。

英文摘要

Sampling-based algorithms for robot path planning offer probabilistic completeness and strong empirical convergence properties across environments with diverse obstacle configurations. However, in practice, these methods often require many iterations to obtain high-quality solutions. This paper proposes Convex-Neural RRT*, an enhanced RRT* variant that incorporates neural guidance to predict informative waypoint regions near high-quality paths. Convex candidate regions are extracted from these predictions, enabling the planner to concentrate exploration on geometrically relevant areas while preserving global exploration. The proposed algorithm is evaluated against Neural RRT*, Neural Informed RRT*, classical RRT*, and LTA* across three environment types and 18 benchmark maps. Experimental results show that Convex-Neural RRT* reduces computation time by 30-75% compared to neural-guided variants and up to 88-98% relative to LTA*, while achieving an average path length reduction of approximately 5% compared to classical RRT*, with larger improvements observed in complex environments. The method also maintains an overall success rate above 99% across varying obstacle densities. These findings indicate that convex-guided neural sampling provides an effective balance between computational efficiency and solution quality, supporting its applicability to time-sensitive robotic navigation tasks.

2602.21954 2026-06-16 physics.soc-ph cs.RO 版本更新

The Swarm Intelligence Freeway-Urban Trajectories (SWIFTraj) Dataset -- Part II: A Graph-Based Approach for Trajectory Connection

蜂群智能高速公路-城市轨迹(SWIFTraj)数据集——第二部分:基于图的方法用于轨迹连接

Xinkai Ji, Pan Liu, Ying Yang, Yu Han

发表机构 * Hong Kong University of Science and Technology - Guangzhou(香港科技大学(广州))

AI总结 本文提出基于图的方法,解决无人机群轨迹连接中的时间对齐和车辆匹配问题,通过模拟和真实数据验证,实现高精度轨迹连接。

详情
AI中文摘要

在本系列论文第一部分中,我们介绍了SWIFTraj,一个通过无人机群收集的新开源车辆轨迹数据集。该数据集有两个显著特点:首先,通过连接连续无人机视频中的轨迹,提供长距离连续轨迹,最长超过4.5公里;其次,涵盖由高速公路及其连接的城市道路组成的综合交通网络。从无人机群获取如此长距离的连续轨迹具有挑战性,因为需要在多个视频之间进行准确的时间对齐,并且无人机的空间分布不规则。为解决这些挑战,本文提出了一种新颖的基于图的方法用于连接无人机群捕获的车辆轨迹。构建了一个无向图来表示灵活的无人机布局,并开发了一种基于轨迹匹配成本最小化的自动时间对齐方法,以估计视频之间的最佳时间偏移。为了关联不同视频中同一辆车的轨迹,使用匈牙利算法建立了车辆匹配表。所提出的方法在模拟和真实数据上进行了评估。实世界实验的结果显示,时间对齐误差在三个视频帧内,对应约0.1秒,且车辆匹配的F1分数约为0.99。这些结果证明了所提出方法在解决无人机轨迹连接中的关键挑战的有效性,并突显了其在大规模车辆轨迹收集中的潜力。

英文摘要

In Part I of this companion paper series, we introduced SWIFTraj, a new open-source vehicle trajectory dataset collected using a unmanned aerial vehicle (UAV) swarm. The dataset has two distinctive features. First, by connecting trajectories across consecutive UAV videos, it provides long-distance continuous trajectories, with the longest exceeding 4.5 km. Second, it covers an integrated traffic network consisting of both freeways and their connected urban roads. Obtaining such long-distance continuous trajectories from a UAV swarm is challenging, due to the need for accurate time alignment across multiple videos and the irregular spatial distribution of UAVs. To address these challenges, this paper proposes a novel graph-based approach for connecting vehicle trajectories captured by a UAV swarm. An undirected graph is constructed to represent flexible UAV layouts, and an automatic time alignment method based on trajectory matching cost minimization is developed to estimate optimal time offsets across videos. To associate trajectories of the same vehicle observed in different videos, a vehicle matching table is established using the Hungarian algorithm. The proposed approach is evaluated using both simulated and real-world data. Results from real-world experiments show that the time alignment error is within three video frames, corresponding to approximately 0.1 s, and that the vehicle matching achieves an F1-score of about 0.99. These results demonstrate the effectiveness of the proposed method in addressing key challenges in UAV-based trajectory connection and highlight its potential for large-scale vehicle trajectory collection.

2603.11729 2026-06-16 cs.DS cs.AI cs.RO 版本更新

Adapting Dijkstra for Buffers and Unlimited Transfers

为缓冲区和无限换乘调整Dijkstra算法

Denys Katkalo, Andrii Rohovyi, Toby Walsh

发表机构 * University of Oxford(牛津大学)

AI总结 本文提出Transfer Aware Dijkstra (TAD)算法,通过扫描完整行程序列而非单条边,解决了带缓冲区时间的无限换乘路径规划中传统Dijkstra过滤失效的问题,并在伦敦和瑞士网络上实现比MR快两倍以上的速度且保持最优性。

Comments v4: clarified RAPTOR description in the Background section

详情
AI中文摘要

近年来,基于RAPTOR的算法被认为是无需预处理即可处理无限换乘路径规划的最先进技术。然而,这一地位很大程度上源于路由研究的演进,其中基于Dijkstra的解决方案被基于时间表的算法取代,而缺乏系统性的比较。在这项工作中,我们重新审视了经典的基于Dijkstra的无限换乘公共交通路由方法,并证明时间依赖Dijkstra (TD-Dijkstra) 优于MR。然而,高效的TD-Dijkstra实现依赖于在预处理期间过滤被支配的连接,这假设乘客总是可以切换到更快的连接。我们表明,当站点有缓冲区时间时,这种过滤是不合理的,因为它无法区分可能继续等待的坐席乘客和必须遵守缓冲区的换乘乘客。为了解决这一限制,我们引入了Transfer Aware Dijkstra (TAD),这是一种修改后的算法,它扫描整个行程序列而不是单个边,从而正确处理缓冲区时间,同时保持相对于MR的性能优势。我们在伦敦和瑞士网络上的实验表明,与MR相比,我们可以在有和没有缓冲区时间的两个网络上实现超过两倍的速度提升,同时产生最优结果。

英文摘要

In recent years, RAPTOR based algorithms have been considered the state-of-the-art for path-finding with unlimited transfers without preprocessing. However, this status largely stems from the evolution of routing research, where Dijkstra-based solutions were superseded by timetable-based algorithms without a systematic comparison. In this work, we revisit classical Dijkstra-based approaches for public transit routing with unlimited transfers and demonstrate that Time-Dependent Dijkstra (TD-Dijkstra) outperforms MR. However, efficient TD-Dijkstra implementations rely on filtering dominated connections during preprocessing, which assumes passengers can always switch to a faster connection. We show that this filtering is unsound when stops have buffer times, as it cannot distinguish between seated passengers who may continue without waiting and transferring passengers who must respect the buffer. To address this limitation, we introduce Transfer Aware Dijkstra (TAD), a modification that scans entire trip sequences rather than individual edges, correctly handling buffer times while maintaining performance advantages over MR. Our experiments on the London and Switzerland networks show that we can achieve more than a twofold speedup over MR while producing optimal results on both networks, with and without buffer times.

2509.16370 2026-06-16 math.OC cs.MS cs.RO cs.SY eess.SY 版本更新

Dual-Regularized Riccati Recursions for Interior-Point Optimal Control

双正则化Riccati递归用于内点最优控制

João Sousa-Pinto, Dominique Orban

发表机构 * IMT School for Advanced Studies, Lucca(利卡大学高级研究学院)

AI总结 本文提出双正则化线性二次调节器问题的闭式扩展Riccati递归,通过顺序和并行方法在O(N)和O(logN)时间内求解,并证明在满足特定惯性条件时,非零原始步骤是增广障碍-拉格朗日 merit 函数的下降方向。

详情
AI中文摘要

我们推导了顺序和并行Riccati递归的闭式扩展,用于求解双正则化线性二次调节器(LQR)问题,分别具有O(N)顺序时间和O(log(N))并行时间。我们展示,当使用正则化对偶-原内点方法求解光滑、约束、非凸、离散时间最优控制问题时,这些子问题会出现,即使存在分阶段的等式或不等式约束,也不需要对约束雅可比矩阵施加任何秩要求。我们证明,当Newton-KKT矩阵满足某些惯性条件时,每个非零原始步骤都是增广障碍-拉格朗日 merit 函数的下降方向。我们通过双正则化Riccati pivots的正定性(比标准LQR正定性要求更弱的条件)来表征这些惯性条件,从而获得廉价的惯性证书。我们提供了MIT授权的C++和JAX实现,以及在Lean中的完整形式化结果。我们对领先的最优控制和非线性规划求解器进行了基准测试,证明在中等规模问题上具有竞争力的性能,并在时间跨度、问题维度和约束数量增加时获得显著收益。

英文摘要

We derive closed-form extensions of the sequential and parallel Riccati recursions for solving dual-regularized linear-quadratic regulator (LQR) problems, with $O(N)$ sequential time and $O(\log(N))$ parallel time, respectively. We show that these subproblems arise when using regularized primal-dual interior-point methods to solve smooth, constrained, non-convex, discrete-time optimal control problems via multiple-shooting, even in the presence of stagewise equality or inequality constraints, and without imposing any rank requirements on constraint Jacobians. We prove that, when certain inertia conditions on the Newton-KKT matrix are met, each nonzero primal step is a descent direction of an augmented barrier-Lagrangian merit function. We characterize these inertia conditions in terms of the positive-definiteness of the dual-regularized Riccati pivots (a weaker condition than the standard LQR positive-definiteness requirements), thereby yielding inexpensive certificates of the required inertia. We provide MIT-licensed implementations of our methods in C++ and in JAX, as well as a full formalization of our results in Lean. We benchmark our algorithm against leading optimal control and nonlinear programming solvers on complex trajectory optimization problems, establishing competitive performance on moderate problems and substantial gains as the horizon length, problem dimension, and constraint count increase.