arXivDaily arXiv每日学术速递 周一至周五更新

1. 机器人学习与模仿强化学习 13 篇

2606.19419 2026-06-19 cs.RO cs.AI 新提交

Playful Agentic Robot Learning

趣味性具身机器人学习

Junyi Zhang, Jiaxin Ge, Hanjun Yoo, Letian Fu, Zihan Yang, Yaowei Liu, Raj Saravanan, Shaofeng Yin, Justin Yu, Dantong Niu, Zirui Wang, Roei Herzig, Ken Goldberg, Yutong Bai, David M. Chan, Ion Stoica, Angjoo Kanazawa, Jiahui Lei, Haiwen Feng, Trevor Darrell

发表机构 * University of California, Berkeley(加州大学伯克利分校) Impossible Research

AI总结 提出RATs框架,让机器人通过自主探索学习可复用技能,在LIBERO-PRO和MolmoSpaces上分别提升20.6和17.0个百分点。

Comments Project page: https://playful-rats.github.io/

详情
AI中文摘要

当前的具身机器人系统可以编写可执行的代码即策略程序、观察反馈并在多次尝试中修正行为,但它们仍然主要是任务驱动的:可复用技能仅在明确指令后获得。我们研究趣味性具身机器人学习,其中具身编码代理在下游任务到来之前,将自主导向的趣味性作为持续技能学习阶段。我们引入RATs,即专为趣味性技能获取设计的机器人代理团队。在趣味性阶段,RATs提出新颖且可学习的探索性任务,规划并执行机器人代码策略,验证中间进展,诊断失败,通过密集的步骤级反馈进行重试,并将成功执行提炼到持久代码技能库中。在测试时,代理从该冻结库中重用相关技能以帮助解决新任务。在LIBERO-PRO和MolmoSpaces上的实验表明,与无趣味性和随机趣味性基线相比,趣味性学习技能在保留的下游任务上分别提升了20.6和17.0个百分点(相对于CaP-Agent0)。此外,学习到的技能可以通过简单地检索到上下文中插入到其他推理时代码即策略代理中,无需微调基础模型,即可在RoboSuite和真实世界迁移中分别提升8.9和8.8个百分点。

英文摘要

Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.

2606.19656 2026-06-19 cs.RO cs.LG 新提交

DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning

DF-ExpEnse: 扩散滤波探索用于高效样本微调

Calvin Luo, Chen Sun, Shuran Song

发表机构 * Stanford University(斯坦福大学) Brown University(布朗大学)

AI总结 提出DF-ExpEnse探索技术,利用生成控制策略的多模态建模能力和评论家集成,在微调中高效收集在线经验,提升样本效率。

Comments ICML 2026

详情
AI中文摘要

智能机器人决策的自然方案是从预训练的生成控制策略初始化,该策略总结了离线经验,并将其适应于自收集的在线经验。我们提出了DF-ExpEnse,一种探索技术,可提高在线经验收集的质量,从而提升微调样本效率。DF-ExpEnse利用生成控制策略的多模态建模能力,创建一个表达性强且易于评估的候选集。然后,它利用评论家集成来识别在质量与高探索兴趣之间最佳平衡的动作。在群体设置中,DF-ExpEnse进一步支持跨智能体通信,以促进群体协作探索。DF-ExpEnse可以无缝集成到通过强化学习微调预训练生成控制策略的现有策略中。我们通过实验验证,在各种操作和 locomotion 任务中,与默认微调和替代动作选择方案相比,DF-ExpEnse 持续带来样本效率优势。项目可在此 https URL 找到。

英文摘要

A natural recipe for intelligent robotic decision-making is initializing from pretrained generative control policies, which have summarized offline experience, and adapting them to self-collected online experience. We present DF-ExpEnse, an exploration technique that improves the quality of online experience collection, thus increasing finetuning sample-efficiency. DF-ExpEnse leverages the multimodal modeling capabilities of the generative control policy to create an expressive and tractably evaluatable candidate set. It then utilizes an ensemble of critics to identify the action that best balances quality with high exploration interest. In fleet settings, DF-ExpEnse further enables cross-agent communication to facilitate collaborative exploration as a group. DF-ExpEnse can be seamlessly integrated with existing strategies that finetune pretrained generative control policies via reinforcement learning. We experimentally validate consistent sample-efficiency benefits through DF-ExpEnse across a variety of manipulation and locomotion tasks, compared to default finetuning and alternative action selection schemes. Project can be found at https://df-expense.github.io.

2606.19728 2026-06-19 cs.RO cs.AI 新提交

Bidirectional Tutoring for Developmental Motor Learning in Robots: Co-Developed Interaction Dynamics Support Stable Learning

机器人发展性运动学习的双向辅导:共同发展的交互动力学支持稳定学习

Rui Fukushima, Jun Tani

发表机构 * Okinawa Institute of Science and Technology Graduate University(冲绳科学技术大学院大学)

AI总结 提出双向辅导框架,通过人类或AI导师与机器人动态适应,利用自由能原理神经网络实现稳定序列学习,在物体操作任务中验证了行为一致性和泛化能力。

Comments 16 pages, 14 figures

详情
AI中文摘要

众所周知,婴儿通过与照顾者的密集互动来发展运动技能。尽管这种社会互动对人类发展至关重要,但机器人的运动技能学习通常被视为单向过程,机器人被动接受导师的演示。这忽视了社会互动的一个关键特性:它本质上是双向的,导师和学习者相互动态适应。在这种互动中,机器人的过往经验可能作为先验约束,塑造共同发展轨迹的动态。我们假设双向辅导允许这些约束引导形成一致的行为模式,从而保持行为一致性并支持泛化,而单向互动缺乏此类约束,导致更广泛、更不一致的行为模式。为检验这一假设,我们使用实体人形机器人进行了两个物体操作实验:一个涉及人机互动,另一个采用AI导师通过自适应干预机制与真实机器人互动,以检验在更受控条件下是否会出现类似效果。我们使用基于自由能原理的神经网络并扩展生成回放来实现发展性学习框架,该框架支持从单个辅导情节中进行稳定的逐序列学习。在两种设置中,双向辅导促进了行为一致性和阶段性泛化,同时机器人逐渐需要更少的导师指导。这些结果表明,双向辅导作为一种具身和社会化方法,为机器人的发展性运动学习提供了有效支架。

英文摘要

Infants are well known to develop their motor skills through dense interaction with caregivers. Although such social interaction is crucial for human development, motor-skill learning in robots is often treated as a unidirectional process in which robots passively receive demonstrations from tutors. This overlooks a key property of social interaction: it is inherently bidirectional, with tutor and learner dynamically adapting to each other. In such interactions, the robot's past experiences may function as prior constraints that shape the dynamics of their co-developed trajectories. We hypothesize that bidirectional tutoring allows such constraints to guide the formation of consistent behavioral patterns that preserve behavioral coherence and support generalization, whereas unidirectional interaction lacks such constraints and leads to broader, less consistent behavioral patterns. To examine this hypothesis, we conducted two experiments with a physical humanoid robot performing an object manipulation task: one involving human-robot interaction and another employing an AI tutor interacting with the real robot through an adaptive intervention mechanism designed to examine whether similar effects would emerge under more controlled conditions. We implement the developmental learning framework using a free-energy-principle-based neural network extended with generative replay, which supports stable sequence-by-sequence learning from single tutored episodes. Across both settings, bidirectional tutoring fostered consistent behaviors and stage-wise generalization, while the robot gradually required less tutor guidance. These results suggest that bidirectional tutoring, as an embodied and socially grounded approach, provides an effective scaffold for developmental motor learning in robots.

2606.19752 2026-06-19 cs.RO cs.AI 新提交

Temporal Self-Imitation Learning

时间自我模仿学习

Yinsen Jia, Boyuan Chen

发表机构 * Duke University(杜克大学)

AI总结 提出时间自我模仿学习框架,通过挖掘高效成功轨迹并转化为可重用监督信号,提升长时域机器人操作任务的学习效率与鲁棒性。

详情
AI中文摘要

基于奖励塑形训练的长时域机器人操作策略仍可能通过低效交互利用密集奖励,而训练过程中稀有高效行为可能被遗忘。我们认为时间效率本身为强化学习提供了强大且未充分利用的自我监督源。我们引入时间自我模仿学习(TSIL),一种强化学习框架,挖掘学习过程中产生的时间高效成功轨迹,并将其转化为可重用的监督信号以改进未来策略。TSIL通过从快速成功轨迹中提取配置条件自适应时间目标逐步优化学习,并通过效率加权自我模仿学习保留和重放高效行为。在15个不同的长时域操作任务中,TSIL持续提升了学习效率、任务完成效率、快速成功行为的重访率以及对不稳定训练条件的鲁棒性。更广泛地,我们的结果表明,成功行为的时间结构本身为强化学习提供了超越人工奖励塑形的可扩展自我监督信号。

英文摘要

Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.

2606.19774 2026-06-19 cs.RO 新提交

Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection

开始正确,到达正确:通过初始噪声选择实现异步执行

Trong-Bao Ho, Quang-Tan Nguyen, Thien-Loc Ha, Gia-Binh Nguyen, Viet-Thanh Nguyen, Long Dinh, Minh N. Vu, Duy M. H. Nguyen, An Thai Le, Ngo Anh Vien

发表机构 * VinRobotics VinUniversity DFKI(德国人工智能研究中心) University of Stuttgart(斯图加特大学) IMPRS-IS(国际马克斯·普朗克智能系统研究学院)

AI总结 针对流式策略异步执行中的动作块边界不一致问题,提出无需训练的PAINT方法,通过初始噪声选择而非轨迹引导实现前缀一致性,在12个模拟和6个真实操作任务中提升执行一致性与任务性能。

Comments First version 19 pages, project site: https://paint-action-chunking.github.io

详情
AI中文摘要

动作分块使机器人策略能够产生时间上连贯的行为,但基于流的策略生成多步动作序列会产生延迟,与实时控制不兼容。在异步执行下,机器人继续执行当前块的同时生成下一个块,即使微小延迟也会在块边界造成不一致。现有方法通过将生成导向已执行的动作前缀来解决此问题。我们则表明,通过在生成开始前选择合适的初始噪声即可实现前缀一致性,使得未经修改的流ODE能够生成连贯的下一块。这将异步推理重新定义为噪声选择问题而非轨迹引导问题。我们提出\textbf{PAINT},一种无需训练的方法,通过后向欧拉反演找到此噪声,并通过重绘规则构建最终块。总之,\texttt{PAINT}不需要梯度、重新训练或策略修改;然而它在\textit{12个模拟基准}和\textit{6个真实世界操作任务}(涵盖单臂、双臂和人形机器人)上提高了执行一致性和任务性能。网站:~\href{ this https URL }{\texttt{ this https URL }}。

英文摘要

Action chunking enables robot policies to produce temporally coherent behavior, but generating multi-step action sequences with flow-based policies incurs latency that is incompatible with real-time control. Under asynchronous execution, the robot continues executing the current chunk while the next one is generated, causing even minor delays to create inconsistencies at chunk boundaries. Existing methods address this problem by steering generation toward the already executed action prefix. We instead show that prefix consistency can be achieved by selecting an appropriate initial noise before generation begins, allowing the unmodified flow ODE to produce a coherent next chunk. This reframes asynchronous inference as a noise selection problem rather than a trajectory steering problem. We introduce \textbf{PAINT}, a training-free method that finds this noise via backward Euler inversion and constructs the final chunk through a repainting rule. In summary, \texttt{PAINT} requires no gradients, retraining, or policy modification; yet it improves execution consistency and task performance across \textit{12 simulated benchmarks} and \textit{6 real-world manipulation tasks} spanning single-arm, bimanual, and humanoid embodiments. Website: ~\href{https://paint-action-chunking.github.io}{\texttt{https://paint-action-chunking.github.io}}.

2606.20048 2026-06-19 cs.RO 新提交

MirrorDuo: Reflection-Consistent Visuomotor Learning from Mirrored Demonstration Pairs

MirrorDuo:基于镜像演示对的反射一致视觉运动学习

Zheyu Zhuang, Ruiyu Wang, Giovanni Luca Marchetti, Florian T. Pokorny, Danica Kragic

AI总结 提出MirrorDuo方法,通过反射一致性为每个原始演示生成镜像副本,实现数据增强,在相同数据预算下显著提升行为克隆性能,并支持零/少样本技能迁移。

Comments Published in CoRL 2025

Journal ref CoRL 2025

详情
AI中文摘要

基于图像的行为克隆利用从无处不在的RGB相机捕获的演示。然而,它仍然受到收集多样化演示成本的限制,特别是在工作空间变化中泛化。我们提出MirrorDuo,一种基于反射的公式,操作于图像、本体感受和完整的6自由度末端执行器动作元组,为每个原始演示生成镜像对应物,有效实现“收集一个,免费获得一个”。它可以作为现有学习管道(如标准行为克隆或扩散策略)的数据增强策略,或作为反射等变策略网络的结构先验。通过利用原始域和镜像域之间的重叠,当演示均匀分布在工作空间两侧时,MirrorDuo在相同数据预算下实现了显著改进的性能。当演示仅限于一侧时,MirrorDuo能够在目标布局中仅使用零或五个演示实现向镜像工作空间的高效技能迁移。

英文摘要

Image-based behaviour cloning leverages demonstrations captured from ubiquitous RGB cameras. However, it remains constrained by the cost of collecting diverse demos, especially for generalizing across workspace variations. We propose MirrorDuo, a reflection-based formulation that operates on image, proprioception, and full 6-DoF end-effector action tuples, generating a mirrored counterpart for each original demonstration, effectively achieving "collect one, get one for free". It can be applied as a data augmentation strategy for existing learning pipelines, such as standard behaviour cloning or diffusion policy, or as a structural prior for reflection-equivariant policy networks. By leveraging the overlap between the original and mirrored domains, MirrorDuo achieves significantly improved performance under the same data budget when demonstrations are evenly distributed across both sides of the workspace. When demonstrations are confined to one side, MirrorDuo enables efficient skill transfer to the mirrored workspace with as few as zero or five demos in the target arrangement.

2606.20056 2026-06-19 cs.RO 新提交

VFILC: Accurate Frequency Extrapolations in Imitation Learning via Sampling Frequency ILC

VFILC: 通过采样频率迭代学习控制实现模仿学习中的精确频率外推

Nozomu Masuya, Toshiaki Tsuji, Sho Sakaino

发表机构 * Grad. School of Science Technology University of Tsukuba Tsukuba, Japan Engineering Saitama University Saitama, Japan Information Engineering University of Tsukuba Tsukuba, Japan

AI总结 提出VFILC方法,结合可变频率模仿学习与前馈-反馈迭代学习控制,在三种任务中实现精确的速度外推,频率误差降低最高81%。

Comments 8 pages, 17 figures. Accepted at IROS 2026

详情
AI中文摘要

传统的基于神经网络(NN)的变速度运动模仿学习方法要么局限于内插速度,要么在外推超出训练速度范围时产生不可预测的运动。可变频率模仿学习(VFIL)通过将NN模型的采样频率与运动频率相关联,实现了速度的外推,但其开环配置导致频率误差,特别是在外推的高频设置中。本研究提出了基于VFIL和迭代学习控制(ILC)的可变频率模仿学习与迭代学习控制(VFILC),包含前馈和反馈两部分,前者利用VFIL的优势,后者调整频率误差。实验结果表明,所提方法成功且精确地外推了运动速度,并在所有三个任务中减少了频率误差;特别是在以训练数据中平均速度的两倍进行外推时,与简单前馈VFIL相比,反馈在擦拭任务中将频率误差显著降低了81%,在摇晃任务中降低了50%。即使在受复杂摩擦特性影响的接触密集混合任务的内插频率下,所提方法相比VFIL也将精度提高了27%。

英文摘要

Conventional neural network (NN)-based imitation learning methods for variable-speed motion either restricted their scope to interpolated speeds, or generated unpredictable motions when extrapolating beyond trained velocity ranges. Variable-frequency imitation learning (VFIL) enabled extrapolations of speeds by linking the NN model's sampling frequency to the motion frequency, whereas its open-loop configuration caused frequency errors, especially in the extrapolated high-frequency settings. This study proposes variable-frequency imitation learning with iterative learning control (VFILC) based on a combination of VFIL and iterative learning control (ILC) with both feedforward and feedback parts, the former taking advantage of VFIL and the latter adjusting the frequency errors. The experimental results showed that the proposed method successfully and accurately extrapolated motion speeds and reduced frequency errors in all three tasks, and that the feedback especially reduced the frequency errors by a remarkable 81% in the wiping task and 50% in the shaking task, both compared to simple feedforward VFIL, when extrapolating at double the average speed in the training data. The proposed method also improved accuracy by 27% compared with VFIL even at an interpolated frequency for a contact-rich mixing task affected by complex friction traits.

2606.20135 2026-06-19 cs.RO cs.AI 新提交

Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation

频率感知流匹配用于连续且一致的机器人动作生成

Jianing Guo, Fangzheng Chen, Zihao Mao, Wong Lik Hang Kenny, Zhenhong Wu, Yu Li, Yishuai Cai, Yuanpei Chen, Yikun Ban, Kai Chen, Qi Dou, Yaodong Yang, Xianglong Liu, Huijie Zhao, Simin Li

发表机构 * Beihang University(北京航空航天大学) Peking University(北京大学) The Chinese University of Hong Kong(香港中文大学) PKU-Psibot Lab(北大-智源机器人实验室) Zhongguancun Laboratory(中关村实验室) Hefei Comprehensive National Science Center(合肥综合性国家科学中心)

AI总结 提出频率感知流匹配(FAFM),通过离散余弦变换将离散动作序列转换到频域进行流匹配,并正则化一阶时间导数以生成平滑连续的动作,提升成功率、多模态表达性和运动平滑性。

详情
AI中文摘要

流匹配已成为机器人操作的标准范式,因为它与扩散策略等类似方法一样,对建模复杂的多模态动作分布具有很强的表达能力。然而,现有方法依赖于离散化的动作块,使得它们对以异构控制频率收集的演示数据脆弱,并且容易产生时间上不一致的动作,从而降低控制稳定性。在本文中,我们提出了频率感知流匹配(FAFM),它输出连续的、时间上一致的动作。为了处理异构频率输入,我们使用离散余弦变换(DCT)将离散动作序列转换到频域,对得到的系数进行流匹配,并通过余弦基展开重建连续动作。为了生成时间上一致的动作,我们对一阶时间导数进行正则化以促进平滑动作。这对应于一个Sobolev型约束,抑制高频误差并阻止突变的动作变化。我们的FAFM简单,不引入额外的网络参数,并且适用于独立的流匹配策略和视觉-语言动作模型。在合成玩具基准、避障、LapGym和LIBERO上,FAFM提高了成功率、多模态表达能力、运动平滑性、收敛速度、对机械偏差和混合频率输入的鲁棒性。这些优势在真实世界的Franka机器人上部署时保持一致。代码见此https URL。

英文摘要

Flow matching has emerged as a standard paradigm for robotic manipulation owing to its strong expressive power for modelling complex, multimodal action distributions, alongside similar approaches like diffusion policy. However, existing methods rely on discretized action chunks, making them brittle to demonstrations collected at heterogeneous control frequencies and prone to temporally inconsistent actions that degrade control stability. In this paper, we propose Frequency-Aware Flow Matching (FAFM), which outputs continuous, temporally consistent actions. To handle heterogeneous frequency input, we transform discrete action sequences into the frequency domain with the discrete cosine transform (DCT), perform flow matching over the resulting coefficients, and reconstruct continuous actions via cosine basis expansion. To generate temporally consistent actions, we regularize the first-order temporal derivative to promote smooth actions. This corresponds to a Sobolev-type constraint that suppresses high-frequency errors and discourages abrupt action changes. Our FAFM is simple, introduces no additional network parameters and applies to standalone flow-matching policies and vision-language action models. Across synthetic toy benchmark, obstacle avoidance, LapGym, and LIBERO, FAFM improves success rates, multimodal expressivity, motion smoothness, convergence speed, robustness to mechanical bias and mixed-frequency input. These gains are consistent when deployed on a real-world Franka robot. Code available at https://anonymous.4open.science/r/FAFM.

2606.20562 2026-06-19 cs.RO 新提交

MemoryWAM: Efficient World Action Modeling with Persistent Memory

MemoryWAM:具有持久记忆的高效世界动作建模

Sizhe Yang, Juncheng Mu, Tianming Wei, Chenhao Lu, Xiaofan Li, Linning Xu, Zhengrong Xue, Zhecheng Yuan, Dahua Lin, Jiangmiao Pang, Huazhe Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Tsinghua University(清华大学) Zhejiang University(浙江大学)

AI总结 提出MemoryWAM,通过混合记忆设计和定制注意力机制,在长时域机器人操作任务中实现高效记忆依赖决策,优于现有VLA和WAM基线。

详情
AI中文摘要

现实世界中的鲁棒机器人操作不仅需要理解当前观测,还需要记忆和动力学建模。世界动作模型(WAM)通过联合建模基于当前和历史观测的视觉预测和动作,具备了这些能力,使其成为机器人操作的一个有前景的范式。然而,现有的WAM面临一个基本权衡:高效推理的方法通常仅基于最近观测的有界窗口进行条件化,因此在非马尔可夫环境中表现不佳;而保留长历史的方法则会产生随序列长度大幅增长的时间和空间成本。为解决这一挑战,我们引入了MemoryWAM,一种具有高效持久记忆的世界动作模型。MemoryWAM采用混合记忆设计,结合了最近帧、事件边界锚点帧以及总结长程历史的紧凑要点令牌。一种定制的注意力机制能够检索详细的短期上下文和压缩的长期上下文,支持具有降低推理延迟和GPU内存使用的记忆依赖决策。在模拟和现实世界的长时域、记忆依赖的操作任务中,MemoryWAM在保持良好计算效率的同时,优于强大的视觉-语言-动作(VLA)和WAM基线。

英文摘要

Robust robotic manipulation in the real world requires not only an understanding of the current observation, but also memory and dynamics modeling. World action models (WAMs) possess these capabilities by jointly modeling visual foresight and actions conditioned on both current and historical observations, making them a promising paradigm for robotic manipulation. However, existing WAMs face a fundamental trade-off: methods with efficient inference typically condition only on a bounded window of recent observations and therefore struggle in non-Markovian environments, whereas methods that preserve long histories incur time and space costs that grow substantially with sequence length. To address this challenge, we introduce MemoryWAM, a world action model with efficient persistent memory. MemoryWAM uses a hybrid memory design that combines recent frames, event-boundary anchor frames, and compact gist tokens that summarize long-range history. A tailored attention mechanism enables retrieval of both detailed short-term context and compressed long-term context, supporting memory-dependent decision-making with reduced inference latency and GPU memory usage. Across long-horizon, memory-dependent manipulation tasks in both simulation and the real world, MemoryWAM outperforms strong vision-language-action (VLA) and WAM baselines while maintaining favorable computational efficiency.

2606.19408 2026-06-19 cs.LG cs.RO 交叉投稿

FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning

FlexLAM: 解决潜在动作学习中的瓶颈权衡

Takanori Yoshimoto, Yang Hu, Naruya Kondo, Tatsuya Matsushima

发表机构 * University of Tsukuba(筑波大学) The University of Tokyo(东京大学)

AI总结 针对潜在动作模型中固定容量瓶颈导致的权衡问题,提出FlexLAM,通过嵌套dropout实现变长潜在动作,在不增加架构或损失的情况下,在稀缺标签和低回报任务中优于固定容量模型,并支持推理时调整令牌预算。

详情
AI中文摘要

潜在动作为无动作视频与下游决策提供了紧凑接口,但现有潜在动作模型(LAM)强制每个转换通过固定容量瓶颈。我们识别出一个瓶颈权衡:过于紧凑的编码可能丢弃动作对齐所需的转换线索,而过于松散的编码则保留了额外的转换变化,当对齐标签稀缺或分布狭窄时必须解决这些变化。FlexLAM用通过嵌套dropout训练的变长潜在动作取代固定容量,产生前缀有效编码,首先捕获紧凑的转换结构,仅在需要时添加细节,无需新架构或损失。在标准稀缺标签监督下和低回报单任务对齐压力测试中,单个FlexLAM在每个评估的令牌预算下匹配或超越单独训练的固定容量LAM,表明FlexLAM不仅在推理时可调整,而且在相同令牌预算下学习了更好的潜在动作接口。同一模型支持推理时令牌预算调整而无需重新训练,并且FlexLAM改善了Ego4D转换重建。这些结果表明,变长潜在动作是对潜在动作模型、潜在动作世界模型和视频预训练动作接口中固定容量瓶颈的无架构、即插即用升级。

英文摘要

Latent actions provide a compact interface between action-free video and downstream decision-making, yet existing Latent Action Models (LAMs) force every transition through a fixed-capacity bottleneck. We identify a bottleneck trade-off: overly tight codes can discard transition cues needed for action alignment, while overly loose codes preserve additional transition variation that must be resolved when alignment labels are scarce or narrowly distributed. FlexLAM replaces this fixed capacity with variable-length latent actions trained by nested dropout, yielding prefix-valid codes that capture compact transition structure first and add detail only when needed, without new architectures or losses. A single FlexLAM matches or surpasses separately trained fixed-capacity LAMs at every evaluated token budget under standard scarce-label supervision and under a low-return single-task alignment stress test, indicating that FlexLAM is not merely adjustable at inference time but learns a better latent-action interface at the same token budgets. The same model supports inference-time token-budget adjustment without retraining, and FlexLAM improves Ego4D transition reconstruction. These results suggest that variable-length latent actions are an architecture-free, drop-in upgrade to the fixed-capacity bottleneck in latent action models, latent-action world models, and video-pretrained action interfaces.

2509.19658 2026-06-19 cs.RO cs.AI 版本更新

RoboSSM: Scalable In-context Imitation Learning via State-Space Models

RoboSSM: 基于状态空间模型的可扩展上下文模仿学习

Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Martín-Martín, Peter Stone

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) KAIST(韩国科学技术院) FAIR at Meta(元宇宙FAIR) Amazon(亚马逊) Sony AI(索尼人工智能)

AI总结 提出RoboSSM,用状态空间模型替代Transformer实现上下文模仿学习,在LIBERO基准上对未见和长时任务泛化更优,首次证明SSM是ICIL高效可扩展的骨干网络。

Comments IROS 2026

详情
AI中文摘要

上下文模仿学习(ICIL)使机器人能够从仅包含少量演示的提示中学习任务。通过消除部署时参数更新的需求,该范式支持对新任务的少样本适应。然而,最近的ICIL方法依赖于Transformer,其计算能力有限,并且在处理比训练时更长的提示时往往表现不佳。在这项工作中,我们引入了RoboSSM,一种基于状态空间模型(SSM)的可扩展上下文模仿学习方案。具体来说,RoboSSM用Longhorn(一种最先进的SSM)替代Transformer,该模型提供线性时间推理和强大的外推能力,非常适合长上下文提示。通过在LIBERO基准上的多样化实验,我们证明了将SSM应用于ICIL的有效性,通过处理测试时更长的上下文,实现了比基于Transformer的ICIL方法对未见和长时任务更好的泛化。这些结果首次表明,SSM是ICIL高效且可扩展的骨干网络。我们的代码可在此网址获取。

英文摘要

In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn -- a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. Through diverse experiments on the LIBERO benchmark, we demonstrate the effectiveness of applying SSMs to ICIL, achieving improved generalization to both unseen and long-horizon tasks than Transformer-based ICIL methods by handling longer contexts at test-time. These results show for the first time that SSMs are an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.

2505.17006 2026-06-19 cs.CV cs.RO 版本更新

CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

CoMo: 从互联网视频中学习连续潜在运动以实现可扩展的机器人学习

Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, Limin Wang

发表机构 * Nanjing University(南京大学) Shanghai AI Lab(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) Fudan University(复旦大学) Tongji University(同济大学)

AI总结 提出CoMo方法,通过早期时间差分和时序对比学习从互联网视频中学习连续潜在运动,避免离散化信息损失,实现零样本泛化生成伪动作标签,联合训练策略在仿真和真实实验中表现优异。

Comments CVPR 2026

详情
AI中文摘要

从互联网视频中无监督学习潜在运动对于机器人学习至关重要。现有的离散方法通常通过小码本大小的向量量化来减轻提取过多静态背景导致的捷径学习,但它们存在信息损失,难以捕捉更复杂和细粒度的动态。此外,离散潜在运动与连续机器人动作之间存在固有分布差距,阻碍了统一策略的联合学习。我们提出CoMo,旨在从互联网规模视频中学习更精确的连续潜在运动。CoMo采用早期时间差分(Td)机制来增加捷径学习难度并显式增强运动线索。此外,为确保潜在运动更好地捕捉有意义的背景,我们进一步提出时序对比学习(Tcl)方案。具体地,正样本对通过小的未来帧时间偏移构建,而负样本对则通过直接反转时间方向形成。所提出的Td和Tcl协同工作,有效确保潜在运动更好地关注前景并增强运动线索。关键的是,CoMo表现出强大的零样本泛化能力,使其能够为未见过的视频生成有效的伪动作标签。大量的仿真和真实实验表明,使用CoMo伪动作标签联合训练的策略在扩散和自回归架构下均实现了优越性能。

英文摘要

Unsupervised learning of latent motion from Internet videos is crucial for robot learning. Existing discrete methods generally mitigate the shortcut learning caused by extracting excessive static backgrounds through vector quantization with a small codebook size. However, they suffer from information loss and struggle to capture more complex and fine-grained dynamics. Moreover, there is an inherent gap between the distribution of discrete latent motion and continuous robot action, which hinders the joint learning of a unified policy. We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. CoMo employs an early temporal difference (Td) mechanism to increase the shortcut learning difficulty and explicitly enhance motion cues. Additionally, to ensure latent motion better captures meaningful foregrounds, we further propose a temporal contrastive learning (Tcl) scheme. Specifically, positive pairs are constructed with a small future frame temporal offset, while negative pairs are formed by directly reversing the temporal direction. The proposed Td and Tcl work synergistically and effectively ensure that the latent motion focuses better on the foreground and reinforces motion cues. Critically, CoMo exhibits strong zeroshot generalization, enabling it to generate effective pseudo action labels for unseen videos. Extensive simulated and real-world experiments show that policies co-trained with CoMo pseudo action labels achieve superior performance with both diffusion and auto-regressive architectures.

2602.04037 2026-06-19 cs.LG cs.RO 版本更新

DADP: Domain Adaptive Diffusion Policy

DADP: 领域自适应扩散策略

Pengcheng Wang, Qinghang Liu, Haotian Lin, Yiheng Li, Guojian Zhan, Masayoshi Tomizuka, Yixiao Wang

发表机构 * University of California, Berkeley, California, USA(加州大学伯克利分校) Peking University, Beijing, China(北京大学) Tsinghua University, Beijing, China(清华大学)

AI总结 提出DADP,通过无监督解耦和领域感知扩散注入,实现跨动态环境的鲁棒零样本适应,在运动与操控任务上超越先前方法。

详情
AI中文摘要

学习能够泛化到未见过的转移动态的领域自适应策略,仍然是基于学习的控制中的一个基本挑战。通过领域表示学习来捕获领域特定信息,从而实现领域感知决策,已经取得了实质性进展。我们分析了通过动态预测学习领域表示的过程,发现选择与当前步骤相邻的上下文会导致学习到的表示将静态领域信息与变化的动态属性纠缠在一起。这种混合可能会混淆条件策略,从而限制零样本适应。为了应对这一挑战,我们提出了DADP(领域自适应扩散策略),通过无监督解耦和领域感知扩散注入实现鲁棒适应。首先,我们引入了滞后上下文动态预测,这是一种将未来状态估计条件化在历史偏移上下文上的策略;通过增加这个时间间隔,我们通过过滤掉瞬态属性来无监督地解耦静态领域表示。其次,我们通过偏置先验分布和重新制定扩散目标,将学习到的领域表示直接集成到生成过程中。在涉及运动和操控的具有挑战性的基准测试上的大量实验表明,DADP相对于先前方法具有优越的性能和泛化能力。更多可视化结果可在此https URL上获得。

英文摘要

Learning domain adaptive policies that can generalize to unseen transition dynamics, remains a fundamental challenge in learning-based control. Substantial progress has been made through domain representation learning to capture domain-specific information, thus enabling domain-aware decision making. We analyze the process of learning domain representations through dynamical prediction and find that selecting contexts adjacent to the current step causes the learned representations to entangle static domain information with varying dynamical properties. Such mixture can confuse the conditioned policy, thereby constraining zero-shot adaptation. To tackle the challenge, we propose DADP (Domain Adaptive Diffusion Policy), which achieves robust adaptation through unsupervised disentanglement and domain-aware diffusion injection. First, we introduce Lagged Context Dynamical Prediction, a strategy that conditions future state estimation on a historical offset context; by increasing this temporal gap, we unsupervisedly disentangle static domain representations by filtering out transient properties. Second, we integrate the learned domain representations directly into the generative process by biasing the prior distribution and reformulating the diffusion target. Extensive experiments on challenging benchmarks across locomotion and manipulation demonstrate the superior performance, and the generalizability of DADP over prior methods. More visualization results are available on the https://outsider86.github.io/DomainAdaptiveDiffusionPolicy/.

2. 运动规划、控制与动力学 11 篇

2606.19512 2026-06-19 cs.RO cs.SY eess.SY 新提交

Proprioceptive Invariant State Estimation for Humanoid Robots on Non-Inertial Ground

非惯性地面上仿人机器人的本体感觉不变状态估计

Falak Mandali, Zijian He, Yan Gu

发表机构 * Purdue University(普渡大学)

AI总结 提出一种仅使用本体感觉的InEKF方法,利用足部IMU和运动学约束,实现非惯性地面上仿人机器人的实时状态估计,收敛速度提升96%,位置误差降低80%。

详情
AI中文摘要

本文提出了一种不变扩展卡尔曼滤波(InEKF)方法,用于在非惯性地面上运行的仿人机器人仅使用机载本体感觉进行实时状态估计。所提出的方法估计机器人相对于移动地面框架的基座位置和速度,无需直接测量地面运动或外部安装的传感器。通过足部安装的IMU利用支撑脚的运动学约束,该滤波器在保持完全本体感觉的同时,考虑了过程模型和测量模型中的地面引起的非线性。估计器被设计为具有右不变测量模型,从而在较大的初始不确定性下实现有利的误差动态。可观测性分析建立了机器人相对于非惯性地面框架的相对基座位置和速度可观测的条件。在摇摆和俯仰地面上站立和蹲下的Digit仿人机器人实验表明,与现有的InEKF相比,收敛速度提高了96%,位置估计误差减少了80%。在单轴旋转地面上的行走实验实现了平均估计误差小于9厘米,初始误差高达1米。

英文摘要

This paper presents an invariant extended Kalman filtering (InEKF) approach for real-time state estimation of humanoid robots operating on non-inertial ground using only onboard proprioceptive sensing. The proposed approach estimates the robot's base position and velocity relative to the moving ground frame without requiring direct measurements of ground motion or externally mounted sensors. By exploiting kinematic constraints at the stance foot through foot-mounted IMUs, the filter accounts for ground-induced nonlinearities in the process and measurement models while remaining fully proprioceptive. The estimator is formulated to admit a right-invariant measurement model, enabling favorable error dynamics under large initial uncertainties. Observability analysis establishes conditions under which the robot's relative base position and velocity are observable with respect to the non-inertial ground frame. Experiments with the Digit humanoid robot standing and squatting atop a swaying and pitching ground showcase a 96% speedup in convergence rate and an 80% reduction in position estimate errors over existing InEKFs. Walking experiments on a uni-axially rotating ground achieve an average estimation error of less than 9 cm for an initial error of up to 1 m.

2606.19633 2026-06-19 cs.RO cs.AI 新提交

CTS-MoE: Implicit Terrain Adaptation via Mixture-of-Experts for Perceptive Locomotion

CTS-MoE: 基于混合专家模型的隐式地形适应感知运动

Francisco Affonso, Matheus P. Angarola, Ana Luiza Mineiro, Aditya Potnis, Marcelo Becker, Girish Chowdhary

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of São Paulo(圣保罗大学)

AI总结 针对非连续地形上的感知运动问题,提出CTS-MoE方法,通过密集混合专家策略与感知门控组合共享行为,并用多批评家防止价值干扰,实现端到端训练和隐式地形适应,在仿真和硬件上优于基线。

详情
AI中文摘要

在不连续地形(如楼梯、间隙和障碍物)上的感知腿式运动需要自适应行为,因为单一的保守步态无法产生应对突然拓扑变化所需的预期动作。将该问题视为多任务强化学习,会在共享与分离之间引入张力。任务使用共同的运动基础但具有冲突的奖励,因此策略必须共享行为同时避免价值干扰。先前的工作只解决了其中一方面:整体策略牺牲了专业化,而分层子策略牺牲了跨过渡和未知地形的泛化能力。我们提出CTS-MoE,它结合了密集混合专家执行器与基于感知的门控来组合共享行为,以及具有任务特定价值头的多批评家来防止干扰。该模型在单阶段并发教师-学生设置中进行端到端训练,处理部分可观测性并避免顺序蒸馏,任务标签仅在训练期间使用。部署时,路由仅依赖于感知,从而无需高层选择器或地形分类器即可实现地形适应。在仿真和硬件上对Unitree Go1进行的实验(涵盖已知和未知地形)显示了任务感知的专业化,与整体基线相比,跟踪误差更低,成功率更高。项目网站:此https URL。

英文摘要

Perceptive legged locomotion over discontinuous terrain (e.g., stairs, gaps, and obstacles) requires adaptive behavior, as a single conservative gait cannot produce the anticipatory maneuvers needed for abrupt topology changes. Cast as multi-task reinforcement learning, this problem introduces a tension between sharing and separation. Tasks use a common locomotion base but have conflicting rewards, so a policy must share behavior while avoiding value interference. Prior work addresses only one side, with monolithic policies sacrificing specialization and hierarchical sub-policies sacrificing generalization across transitions and unseen terrain. We propose CTS-MoE, which combines a dense mixture-of-experts actor with perception-based gating to compose shared behaviors and a multi-critic with task-specific value heads to prevent interference. The model is trained end-to-end in a single-stage concurrent teacher-student setup that handles partial observability and avoids sequential distillation, with task labels used only during training. At deployment, routing depends solely on perception, allowing terrain adaptation without a high-level selector or terrain classifier. Experiments on a Unitree Go1 in simulation and on hardware across seen and unseen terrains show task-aware specialization, with lower tracking error and higher success rates than monolithic baselines. Project Website: https://cts-moe.github.io/ .

2606.19699 2026-06-19 cs.RO cs.LG cs.SY eess.SY 新提交

Comparative Study on Agility, Efficiency, and Impact Absorption of Bipedal Robots with Active Toes

具有主动脚趾的双足机器人敏捷性、效率和冲击吸收的比较研究

Joong-Gil Kim, Wontae Ye, Geunwoo Cho, Seong-Ho Yun, Se-Hyoung Cho, Yong-Jae Kim

发表机构 * School of Electrical, Electronics and Communication Engineering, Korea University of Technology and Education(韩国技术教育大学电气、电子与通信工程学院) Artificial Intelligence and Robotics Institute, Korea Institute of Science and Technology(韩国科学技术研究院人工智能与机器人研究所) Robot Innovation Hub, WIRobotics Inc.(WIRobotics公司机器人创新中心)

AI总结 提出一种14自由度双足机器人,模拟人类脚趾的轻量、高扭矩、坚固特性,通过高保真仿真训练环境,对比有无主动脚趾的配置,发现脚趾机器人以1.33米/秒行走时,CoT降低17.5%,脚跟冲击力降低5.0%,路径偏差平均和最大分别降低25.0%和34.0%。

Comments 6 pages, 7 figures

详情
AI中文摘要

人类腿部表现出高效率、敏捷性和冲击吸收能力,其中脚趾在这些能力中起着关键作用。尽管已经有许多尝试在机器人中实现类似人类的脚趾,但它们尚未完全复制人类特征,也没有严格验证其益处。我们提出了一种14自由度的双足机器人,模拟人类脚趾的轻量、高扭矩、坚固特性。为了定量分析主动脚趾在敏捷性、效率和冲击吸收方面的有效性,我们开发了一个高保真仿真训练环境,该环境反映了具有耦合传动和精确功耗的实际执行器。为了确保有和没有主动脚趾的配置之间的公平比较,我们设计了一个最小化强化学习奖励函数,并对两者应用了相同的训练程序。仿真结果表明,在1.33米/秒行走时,与无脚趾配置相比,配备脚趾的机器人将CoT降低了17.5%,脚跟冲击力降低了5.0%。在敏捷性测试中,平均和最大路径偏差分别降低了25.0%和34.0%。

英文摘要

Human legs exhibit high efficiency, agility, and impact absorption, with toes playing a crucial role in these capabilities. While many attempts have been made to implement human-like toes in robots, they have not fully replicated human characteristics nor rigorously validated their benefits. We propose a 14-DOF biped robot emulating human toes' lightweight, high-torque, robust nature. To quantitatively analyze the effectiveness of the active toes in terms of agility, efficiency, and impact absorption, we developed a high-fidelity simulation training environment that reflects actual actuators with coupled transmissions and accurate power consumption. To ensure a fair comparison between configurations with and without active toes, we designed a minimal RL reward function and applied an identical training procedure to both. The simulation results indicate that, at 1.33 m/s walking, the toe-equipped robot reduced CoT by 17.5% and heel-strike GRF by 5.0% compared with the toe-ablation configuration. On the agility test, average and maximum path deviation decreased by 25.0% and 34.0%, respectively.

2606.19729 2026-06-19 cs.RO cs.AI 新提交

VOiLA: Vectorized Online Planning with Learned Diffusion Model for POMDP Agents

VOiLA: 基于学习扩散模型的向量化在线规划用于POMDP智能体

Marcus Hoerger, Rishikesh Joshi, Rahul Shome, Ian Manchester, Hanna Kurniawati

发表机构 * Australian National University(澳大利亚国立大学) The University of Sydney(悉尼大学)

AI总结 提出VOiLA框架,利用条件扩散模型学习POMDP模型,通过蒸馏加速采样并与向量化在线规划器集成,在三个基准任务和实物机器人上实现高效在线规划。

Comments Submitted to the 2026 International Symposium of Robotics Research (ISRR)

详情
AI中文摘要

不确定性下的规划是自主机器人的关键能力。部分可观测马尔可夫决策过程(POMDP)为此提供了强大框架。尽管基于POMDP的规划已取得显著进展,但其在现实问题中的应用常受限于难以获得准确的POMDP模型。我们提出VOiLA(Vectorized Online planning wIth Learned diffusion model for POMDP Agents),一个学习任务无关POMDP模型以实现在不确定性下在线规划的框架。VOiLA使用条件扩散模型学习转移和观测采样器,并学习用于基于粒子的信念更新的观测似然模型。为实现高效在线规划,扩散采样器被蒸馏为紧凑的前馈生成器,并与VOPP(一种利用GPU并行化的在线POMDP规划器)集成。实验结果表明,蒸馏策略将采样成本降低了近三个数量级,使学习到的生成式POMDP模型对在线规划实用。在三个基准问题上的评估表明,VOiLA在使用不到10%训练数据的情况下,性能达到或优于递归软演员-评论家算法,并且对未见环境配置的泛化能力更强。实物机器人评估表明,VOiLA仅使用模拟数据学习模型,并在10次运行中全部成功完成任务。

英文摘要

Planning under uncertainty is an essential capability for autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for such a capability. Although POMDP-based planning has advanced significantly, its application to real-world problems is often limited by the difficulty of obtaining faithful POMDP models. We present Vectorized Online planning wIth Learned diffusion model for POMDP Agents (VOiLA), a framework that learns task-agnostic POMDP models for online planning under uncertainty. VOiLA learns transition and observation samplers using conditional diffusion models and learns observation-likelihood models for particle-based belief updates. To enable efficient online planning, the diffusion samplers are distilled into compact feedforward generators and integrated with Vectorized Online POMDP Planner (VOPP), an online POMDP planner designed to leverage GPU parallelization. Experimental results indicate the distillation strategy reduces sampling cost by up to nearly three orders of magnitude, making learned generative POMDP models practical for online planning. Evaluation of VOiLA on three benchmark problems indicate that VOiLA achieves equal or better performance than Recurrent Soft Actor Critic while using less than 10% training data, and generalizes much better to unseen environment configurations. Physical robot evaluation indicates VOiLA uses the models learned using only simulated data and generates a policy that successfully accomplish the task in 10 of 10 runs.

2606.19031 2026-06-19 cs.RO 新提交

Congestion-Aware Robot Tour Planning in Crowded Environments

拥挤环境中的拥塞感知机器人巡视规划

Stefano Bernagozzi, Charlie Street, Masoumeh Mansouri, Lorenzo Natale

发表机构 * Istituto Italiano di Tecnologia(意大利理工学院) Università di Genova(热那亚大学) University of Birmingham(伯明翰大学)

AI总结 提出一种基于概率的巡视规划器,通过学习人流预测模型并在线构建马尔可夫决策过程,在拥挤环境中高效规划机器人路径,减少拥塞影响。

Comments Accepted to IEEE IROS 2026

Journal ref IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2026

详情
AI中文摘要

自主移动服务机器人通常需要完成在环境中遍历一组位置的巡视任务。例如,引导人们穿过购物中心、在配送中心递送包裹或在博物馆提供导览。然而,在拥挤环境中,人群的存在可能对机器人性能产生负面影响。例如,人类会触发机器人的碰撞避免操作,从而降低机器人速度。人群随机移动且随时间变化。本文提出一种针对拥挤环境的概率巡视规划器,该规划器明确考虑人类拥塞。我们学习圆形线性流场(CLiFF)地图,该地图根据初始观测预测人类轨迹。然后,我们利用这些预测在线构建并求解马尔可夫决策过程,从而高效地将机器人引导通过环境。我们的方法具有足够的可扩展性,能够在观察到新人群时重新规划。我们在购物中心的真实人群数据集上评估了该方法。

英文摘要

Autonomous mobile service robots are often required to complete tours that require navigating through a set of locations in an environment. Example domains include guiding people through a shopping mall, delivering packages in a fulfilment centre, or giving guided tours in a museum. However, in crowded environments, the presence of people may negatively impact robot performance. For example, humans will activate robot collision avoidance manoeuvres that slow the robot down. Crowds move stochastically and vary throughout the day. In this paper we present a probabilistic tour planner for crowded environments which explicitly reasons over human congestion. We learn circular linear flow field (CLiFF) maps which predict human trajectories given an initial observation. We then use these predictions to build and solve a Markov decision process online which efficiently routes the robot through the environment. Our approach is scalable enough to re-plan as new people are observed. We evaluate our approach on a real-world crowd dataset in a shopping mall.

2606.19928 2026-06-19 cs.RO 新提交

SWAP: Symmetric Equivariant World-Model for Agile Robot Parkour

SWAP: 用于敏捷机器人跑酷的对称等变世界模型

Kaixin Lan, Ze Wang, Hongyi Li, Lei Jiang, Chaojie Fu, Chengkai Su, Choi Lam Wong, Yongbin Jin, Hongtao Wang

发表机构 * Center for X-Mechanics, Zhejiang University(浙江大学交叉力学中心) ZJU-Hangzhou Global Scientific and Technology Innovation Center(浙江大学杭州国际科创中心) Mirrorme Technology Co., Ltd.(魔镜科技有限公司)

AI总结 提出SWAP框架,将对称等变性嵌入世界模型和演员-评论家网络,实现四足机器人跑酷记录突破(跨越2.13米间隙、攀爬1.63米平台),并展现出对未见镜像地形的几何泛化与零样本迁移能力。

详情
AI中文摘要

虽然潜在世界模型能够实现极限跑酷所需的主动预测,但其纯数据驱动的特性迫使它们将左右对称交互冗余编码为独立模式。这增加了学习负担并阻碍了几何规律性的捕获,限制了潜在空间对下游策略的效率。为了解决这个问题,我们提出了SWAP,一个端到端的等变对称世界模型。该框架将对称性直接嵌入到世界模型和演员-评论家网络中。在真实世界测试中,机器人跨越了2.13米的间隙并攀爬了1.63米的高台,打破了四足机器人跑酷的记录。此外,该框架对未见过的镜像地形展现出鲁棒的几何泛化能力,并在多种户外环境中具有卓越的零样本迁移能力。这些结果表明,对称等变性是推动学习型腿式运动物理极限的有效结构先验。

英文摘要

While latent world models enable the proactive predictions required for extreme parkour, their purely data-driven nature forces them to redundantly encode left-right symmetric interactions as independent patterns. This inflates the learning burden and hinders the capture of geometric regularities, restricting the latent space's efficiency for downstream policies. To address this, we propose SWAP, an end-to-end equivariant symmetric world model. This framework embeds symmetry directly into both the world model and the actor-critic networks. In real-world tests, the robot leaps across a 2.13 m gap and climbs a 1.63 m platform, breaking records for quadruped parkour. Furthermore, the framework exhibits robust geometric generalization to unseen mirrored terrains and exceptional zero-shot transferability across diverse outdoor environments. These results demonstrate that symmetry equivariance is an effective structural prior for pushing the physical boundaries of learned legged locomotion.

2606.20197 2026-06-19 cs.RO 新提交

Stable Transformer-Actor-Critic Model Predictive Control: A Contraction Analysis Approach

稳定的Transformer-Actor-Critic模型预测控制:一种收缩分析方法

Antonio Marino, Valerio Modugno, Marco Cognetti

AI总结 提出一种Transformer-Actor-Critic MPC架构,通过证明Transformer满足增量输入-状态稳定性并利用黎曼收缩理论分析互联动力学,将理论界作为训练正则化项,实现可证明鲁棒的控制策略。

详情
AI中文摘要

Actor-Critic模型预测控制(MPC)有效解决了复杂的非凸控制问题,但保证这些流程中基于序列的学习模型的闭环稳定性仍然具有挑战性。本文介绍了一种新颖的Transformer-Actor-Critic MPC架构,具有形式化的鲁棒性保证。首先,我们证明了Transformer网络可以满足全局增量输入-状态稳定性($\delta$ISS)。然后,我们利用黎曼收缩理论分析物理对象与预测神经网络之间的互联动力学。最后,我们将这些理论界作为训练正则化项,以产生可证明鲁棒的策略。该框架在非线性3D无人机模型上进行了验证,执行目标到达和避障机动。

英文摘要

Actor-Critic Model Predictive Control (MPC) effectively addresses complex, non-convex control problems, but guaranteeing the closed-loop stability of sequence-based learning models within these pipelines remains challenging. This paper introduces a novel Transformer-Actor-Critic MPC architecture with formal robustness guarantees. First, we prove that Transformer networks can satisfy global incremental Input-to-State Stability ($δ$ISS). We then leverage Riemannian contraction theory to analyze the interconnected dynamics between the physical plant and the predictive neural network. Finally, we integrate these theoretical bounds as a training regularizer to yield a certifiably robust policy. The framework is validated on a nonlinear 3D drone model executing target-reaching and obstacle-avoidance maneuvers.

2606.20495 2026-06-19 cs.RO 新提交

Increasing Resilience of Continuum Robots via Motion Planning Algorithms

通过运动规划算法提高连续体机器人的韧性

Oxana Shamilyan, Ievgen Kabin, Zoya Dyka, Oleksandr Sudakov, Peter Langendoerfer

AI总结 本文实验研究运动规划算法对连续体机器人韧性的影响,通过改进遗传算法和A*算法,结合层次分析法评估路径质量,发现遗传算法生成更多样化路径,提升机器人韧性。

详情
AI中文摘要

本文介绍了针对韧性连续体机器人的运动规划实验研究。我们主要关注多准则决策、其在路径规划算法中的应用、对生成路径的影响以及执行时间。为此,我们使用了两种著名的路径规划算法,即遗传算法和A*算法,并通过添加层次分析法算法来评估生成路径的质量,对其进行了修改。在我们的实验中,层次分析法考虑了四个不同的准则,即距离、电机损伤、机器人手臂的机械损伤和精度,每个准则都被认为有助于连续体机器人的韧性。使用不同的准则对于延长连续体机器人的维护操作时间是必要的。我们使用两种不同的机器人模拟环境进行了实验。尽管我们显著简化了机器人模型及其环境,但我们仍然基于真实机器人原型实现了环境的一些特征。特别地,其中一个环境包含单路径点和多路径点,另一个环境仅包含多路径点。结果表明,与A*算法相比,遗传算法的性能时间不依赖于环境的基数。它生成更多样化的路径,从而提高了机器人的韧性。

英文摘要

This paper presents an experimental study of motion planning for resilient continuum robots. In this study we mainly focused on multi-criteria decision-making, its application for path-planning algorithms, impact on the generated path and execution time. To do this, we used two well-known algorithms for path planning, namely Genetic algorithm and A star algorithm, and modified them by adding the Analytical Hierarchy Process algorithm to evaluate the quality of the paths generated. In our experiment the Analytical Hierarchy Process considers four different criteria, i.e. distance, motors damage, mechanical damage of the robot's arm and accuracy, each considered to contribute to the resilience of a continuum robot. The use of different criteria is necessary to increase the time to maintenance operations of the continuum robot. We conducted the experiments using two different simulated environments of the robot. Although we significantly simplified the robot's model and its environment, we still implemented some of the features of the environment based on the real robot prototype. In particular, one of the environments has single- as well as multi-path points, and other consists of the multi-path points only. The results show that, in contrast to A star, the performance time of Genetic algorithm does not depend on the environment's cardinality. It generates more diverse paths, which increases the robot's resilience.

2508.21677 2026-06-19 cs.RO 版本更新

Robust Convex Model Predictive Control with collision avoidance guarantees for robot manipulators

具有碰撞避免保证的机器人操作器鲁棒凸模型预测控制

Bernhard Wullt, Johannes Köhler, Per Mattsson, Mikeal Norrlöf, Thomas B. Schön

发表机构 * ABB robotics(ABB机器人公司) Department of Mechanical Engineering, Imperial College London(帝国理工学院机械工程系) Department of Information Technology, Uppsala University(乌普萨拉大学信息科技系)

AI总结 提出一种结合鲁棒管MPC与走廊规划算法的凸MPC方案,在模型不确定下实现工业机器人快速无碰撞运动,优于基准方法。

详情
AI中文摘要

工业操作器通常在杂乱环境中运行,安全运动规划至关重要。然而,模型不确定性使任务更加复杂,导致保守的速度限制以减少干扰影响。因此,需要能够保证快速执行安全运动的控制方法。我们通过为操作器提出一种新颖的模型预测控制(MPC)方案来解决这一问题,其中两个主要组件是鲁棒管MPC和用于获得无碰撞运动的走廊规划算法。我们的方案形成凸MPC公式,可以快速求解,使方法具有实际应用价值。我们在模拟环境中展示了方法的有效性,该环境包含一个6自由度工业机器人在具有不确定模型参数的杂乱环境中运行。通过容忍更高水平的模型不确定性同时实现更快的运动,我们优于基准方法。

英文摘要

Industrial manipulators typically operate in cluttered environments, where safe motion planning is critical. However, model uncertainties further complicate this task, which leads to conservative speed limits to reduce the influence of disturbances. Hence, there is a need for control methods that can guarantee safe motions which are executed fast. We address this by suggesting a novel model predictive control (MPC) solution for manipulators, where our two main components are a robust tube MPC and a corridor planning algorithm to obtain collision-free motion. Our solution results in a convex MPC formulation, which we can solve fast, making our method practically useful. We demonstrate the efficacy of our method in a simulated environment with a 6 DOF industrial robot operating in cluttered environments with uncertain model parameters. We outperform benchmark methods by tolerating higher levels of model uncertainty while achieving faster motion.

2606.16780 2026-06-19 cs.RO 版本更新

DIFF-IPPO: Diffusion-Based Informative Path Planning with Open-Vocabulary Belief Maps

DIFF-IPPO:基于扩散的开放词汇信念地图信息路径规划

Sausar Karaf, Oleg Sautenkov, Mikhail Martynov, Dzmitry Tsetserukou

发表机构 * Intelligent Space Robotics Laboratory, CDE, Skoltech(智能空间机器人实验室,CDE,斯科尔科沃科学技术研究院)

AI总结 提出DIFF-IPPO框架,结合开放词汇信念地图生成器与扩散规划器,在非高斯信念图上生成全局轨迹,实现高效目标搜索,检测得分达81.49%-86.55%。

详情
AI中文摘要

探索和物体搜索要求机器人感知环境、识别感兴趣区域,并规划提高目标检测可能性或最大化信息增益的轨迹。许多IPP方法,特别是在连续环境监测中,依赖于高斯过程信念模型,而物体搜索场景通常从语义或开放词汇感知中产生复杂的多模态信念地图。直接基于这种非高斯信念地图的全局轨迹生成仍然相对未被充分探索。尽管基于扩散的规划器为此类分布建模提供了强大能力,但它们在信息路径规划中的应用仍然有限。在这项工作中,我们提出了DIFF-IPPO,一个集成了开放词汇信念地图生成器和基于扩散的规划器的流水线,用于在信念地图上生成全局轨迹。该方法生成的轨迹将传感器覆盖集中在高信念区域,在不同数据集场景下实现了81.49%至86.55%的归一化检测得分。我们在一个模拟的搜索与救援场景中验证了该系统,其中规划器搜索候选建筑区域以定位燃烧的建筑。在此设置中,一个由五架无人机组成的团队使用批处理信念地图条件轨迹生成,在3.5分钟内实现了首次检测。

英文摘要

Exploration and object search require robots to perceive their environment, identify regions of interest, and plan trajectories that improve target-detection likelihood or maximize information gain. Many IPP methods, especially in continuous environmental monitoring, rely on Gaussian-process belief models, while object-search settings often produce complex, multimodal belief maps from semantic or open-vocabulary perception. Global trajectory generation directly conditioned on such non-Gaussian belief maps remains comparatively underexplored. Although diffusion-based planners offer strong capabilities for modeling such distributions, their use in informative path planning remains limited. In this work, we propose DIFF-IPPO, a pipeline that integrates an open-vocabulary belief map generator with a diffusion-based planner for global trajectory generation over belief maps. The method generates trajectories that concentrate sensor coverage over high-belief regions, achieving normalized detection scores between 81.49% and 86.55% across different dataset scenarios. We validate the system in a simulated search-and-rescue scenario where the planner searches candidate building regions to locate a burning building. In this setting, a team of five drones using batched belief-map-conditioned trajectory generation achieves first detections in 3.5 minutes.

2604.09795 2026-06-19 eess.SY cs.RO cs.SY 版本更新

On Feedback Speed Control for a Planar Tracking

平面跟踪中的反馈速度控制

Xincheng Li, Tengyue Liu, Udit Halder

发表机构 * Department of Mechanical and Aerospace Engineering, University of South Florida(南佛罗里达大学机械与航空航天工程系)

AI总结 针对领航-跟随平面跟踪问题,提出一种反馈速度控制律与恒定方位角转向策略,实现并排编队并证明渐近稳定性,扩展至N-agent链网络。

详情
AI中文摘要

本文研究了领航者和跟随者之间的平面跟踪问题。我们提出了一种新颖的反馈速度控制律,结合恒定方位角转向策略,以保持两个智能体之间的并排编队。我们证明了当领航者的转向已知时,所提出的控制使闭环系统渐近稳定。对于跟随者无法获取领航者转向的情况,我们表明系统相对于被视为输入的领航者转向仍然是输入-状态稳定的。此外,我们证明如果领航者的转向是周期性的,跟随者将渐近收敛到具有相同周期的周期轨道。我们通过数值模拟和移动机器人实验验证了这些结果。最后,我们通过将两智能体控制律扩展到N智能体链网络,展示了所提出方法的可扩展性,并说明了其在生物和工程群体中方向信息传播的意义。

英文摘要

This paper investigates a planar tracking problem between a leader and follower agent. We propose a novel feedback speed control law, paired with a constant bearing steering strategy, to maintain an abreast formation between the two agents. We prove that the proposed control yields asymptotic stability of the closed-loop system when the steering of the leader is known. For the case when the leader's steering is unavailable to the follower, we show that the system is still input-to-state stable with respect to the leader's steering viewed as an input. Furthermore, we demonstrate that if the leader's steering is periodic, the follower will asymptotically converge to a periodic orbit with the same period. We validate these results through numerical simulations and experimental implementations on mobile robots. Finally, we demonstrate the scalability of the proposed approach by extending the two-agent control law to an N-agent chain network, illustrating its implications for directional information propagation in biological and engineered flocks.

3. 操作、抓取与灵巧手 13 篇

2606.19397 2026-06-19 cs.RO 新提交

DiffusionVS: A Generative Framework for Robust Visual Servoing Based on Diffusion Policy

DiffusionVS:基于扩散策略的鲁棒视觉伺服生成框架

Hongkang Cui, Rui He, Haoyao Chen

AI总结 提出基于扩散策略的视觉伺服方法,通过条件去噪生成相机速度,并采用在线训练增强泛化能力,仿真成功率近100%,物理实验93%。

Comments 8 pages, 4 figures, 7 tables

详情
AI中文摘要

视觉伺服是机器人操作和导航中的基础技术。基于回归的视觉伺服常因噪声敏感的单步映射和分布偏移时的误差累积而出现轨迹抖动。相比之下,扩散策略通过预测动作序列保持时间一致性,并通过隐式数据增强提高鲁棒性。本文提出一种新颖的基于扩散的伺服方法。基于扩散策略,该方法使用观测标签角点的归一化图像坐标作为输入,通过条件去噪生成相机速度。为了克服在静态数据集上训练的模型的泛化限制,采用了在线训练范式,通过交互经验收集持续扩展训练数据的多样性。该策略显著提升了模型的性能和泛化能力。全面的仿真和实际实验证明了该方法的有效性,在仿真中实现了近100%的成功率,在物理实验中达到93%。除了具体的流程,我们进一步验证了扩散机制的通用性。实验表明,现有的视觉伺服网络在与我们的扩散模块集成时,性能持续提升。这些结果表明,所提出的策略具有广泛的适用性,能够增强除本文具体架构之外的各种视觉伺服系统。

英文摘要

Visual servoing is a fundamental technique in robotic manipulation and navigation. Regression-based visual servoing frequently experiences trajectory jitter as a result of noise-sensitive single-step mappings and the accumulation of errors during distribution shifts. In contrast, Diffusion Policy maintains temporal consistency by predicting action sequences and improves robustness through implicit data augmentation. This paper presents a novel diffusion-based servoing method. Based on Diffusion Policy, the proposed approach uses normalized image coordinates of observed tag corners as input and generates camera velocity through conditional denoising. To overcome the generalization limitations of models trained on static datasets, an online training paradigm is adopted, continuously expanding the diversity of training data through interactive experience collection. This strategy substantially enhances both the performance and generalization capability of the model. Comprehensive simulations and real-world experiments demonstrate the effectiveness of the proposed method, achieving success rates of nearly 100\% in simulation and 93\% in physical experiments. Beyond the specific pipeline, we further validate the generality of the diffusion mechanism. Experiments show that existing visual servoing networks consistently achieve improved performance when integrated with our diffusion-based module. These results indicate that the proposed strategy possesses broad applicability and can enhance various visual servoing systems beyond the specific architecture presented here.

2606.19586 2026-06-19 cs.RO 新提交

One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies

一个演示胜过千条轨迹:用于视觉运动策略的动作-视角增强

Chuer Pan, Litian Liang, Dominik Bauer, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Shuran Song

发表机构 * Stanford University(斯坦福大学) Columbia University(哥伦比亚大学) Toyota Research Institute(丰田研究所)

AI总结 提出一种数据增强框架,通过高斯泼溅和轨迹优化生成逼真的鱼眼图像序列和物理可行的动作轨迹,提升操作策略在场景变化和障碍物下的成功率。

Comments Project website: https://chuerpan.com/1001-demos.github.io/. Published at CoRL 2025

Journal ref Proceedings of The 9th Conference on Robot Learning, PMLR 305:3902-3914, 2025

详情
AI中文摘要

用于操作的视觉运动策略在建模复杂机器人行为方面展现出显著潜力,但机器人初始配置的微小变化和未见障碍物容易导致分布外观测。在没有大量数据收集工作的情况下,这些会导致灾难性的执行失败。在这项工作中,我们引入了一个有效的数据增强框架,该框架从真实世界的眼在手演示中生成视觉上逼真的鱼眼图像序列和相应的物理上可行的动作轨迹,这些演示使用带有单个鱼眼摄像头的便携式平行夹爪捕获。我们引入了一种新颖的高斯泼溅公式,适用于广角鱼眼摄像头,以重建和编辑带有未见物体的3D场景。我们利用轨迹优化生成平滑、无碰撞、视图渲染友好的动作轨迹,并从相应新视角渲染视觉观测。在仿真和现实世界中的综合实验表明,我们的增强框架提高了各种操作任务在相同场景和需要避障的增强场景中的成功率。

英文摘要

Visuomotor policies for manipulation have demonstrated remarkable potential in modeling complex robotic behaviors, yet minor alterations in the robot's initial configuration and unseen obstacles easily lead to out-of-distribution observations. Without extensive data collection effort, these result in catastrophic execution failures. In this work, we introduce an effective data augmentation framework that generates visually realistic fisheye image sequences and corresponding physically feasible action trajectories from real-world eye-in-hand demonstrations, captured with a portable parallel gripper with a single fisheye camera. We introduce a novel Gaussian Splatting formulation, adapted to wide FoV fisheye cameras, to reconstruct and edit the 3D scene with unseen objects. We utilize trajectory optimization to generate smooth, collision-free, view-rendering-friendly action trajectories and render visual observations from corresponding novel views. Comprehensive experiments in simulation and the real world show that our augmentation framework improves the success rate for various manipulation tasks in both the same scene and the augmented scene with obstacles requiring collision avoidance.

2606.19897 2026-06-19 cs.RO 新提交

One-to-Two Acting: A Novel Framework for Single-arm Agent Action Expansion to Dual Arms

一对二执行:一种面向单臂智能体动作扩展至双臂的新框架

Youbin Yao, Nieqin Cao, Mingyan Li, Yan Ding, Fuqiang Gu, Chao Chen

发表机构 * Chongqing University(重庆大学) Xi’an Jiaotong-Liverpool University(西交利物浦大学) Lumos Robotics

AI总结 提出ExS2D层次化动作扩展框架,利用单臂监督实现双臂操作,通过时间优先关系提取、子任务引导动作映射和碰撞避免协调规划,在仿真中减少54.4%执行步骤并保持成功率。

Comments 6 pages, 5 figures, 3 tables

详情
AI中文摘要

双臂操作可以通过并行执行提高吞吐量,但收集双臂演示进行训练成本高且困难。我们提出ExS2D,一种层次化动作扩展框架,能够从单臂监督实现双臂操作。ExS2D首先从文本指令生成结构化子任务,同时显式捕获时间优先关系。然后通过观察中的子任务引导动作映射,将每个子任务落地为可执行动作。最后,由多模态大语言模型驱动的协调器执行考虑优先关系的动作分配和同步规划,以选择无碰撞的双臂执行。仿真实验表明,ExS2D在保持与单臂基线相当的成功率的同时,平均执行步骤减少了54.4%。在四个任务上的真实机器人实验进一步证明了ExS2D在少量单臂样本下进行双臂执行的可靠性,且未使用任何双臂演示。

英文摘要

Dual-arm manipulation can improve throughput via parallel execution, but collecting bimanual demonstrations for training is costly and difficult. We present ExS2D, a hierarchical action expansion framework that enables dual-arm manipulation from single-arm supervision. ExS2D first generates structured subtasks from textual instructions while explicitly capturing temporal precedence. It then grounds each subtask into executable actions through subtask-guided action mapping in observation. Finally, precedence-aware action allocation and synchronized planning are performed by a multimodal large language model driven coordinator to select collision-free dual-arm executions. Simulation experiments demonstrate that ExS2D reduces the average execution steps by 54.4% while maintaining a comparable success rate to a single-arm baseline. Real-robot experiments on four tasks further demonstrate the reliability of ExS2D for dual-arm execution under few-shot single-arm samples, while using zero bimanual demonstrations.

2606.20193 2026-06-19 cs.RO 新提交

Belt-Finger: An Affordable Soft Belt-Driven Gripper for Dexterous In-Hand Manipulation

Belt-Finger: 一种经济实惠的软带驱动夹爪,用于灵巧的手内操作

Boya Zhang, Andreas Zell, Georg Martius

发表机构 * University of Tübingen(图宾根大学) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所)

AI总结 提出一种双软带手指模块,为平行夹爪增加三个手内自由度(平移、俯仰、滚动),在保持低成本、易集成的同时提升灵巧操作能力,并通过MPC和遥操作验证其有效性。

详情
AI中文摘要

平行夹爪是机器人中默认的操纵器选择,因为它们简单、坚固且廉价。然而,其有限的手内移动性常常迫使大幅度的臂部运动,并限制了在狭窄工作空间中的灵巧操作。我们提出了一种平行夹爪的升级方案:一种基于双软带的指模块,在保留标准开合功能的同时增加了三个手内自由度(DoF):平移、俯仰和滚动。该机制故意保持简单,并设计为经济制造和直接集成,保留了传统平行夹爪的可靠性和精确控制,同时大大拓宽了操作能力的范围。为了展示新增自由度的实用性,我们将该夹爪集成到两个控制流程中。首先,我们调整了一个模型预测控制器,用于已知物体的手内操作。其次,我们引入了一个轻量级遥操作接口,能够以最少的硬件同时控制机器人臂和夹爪(总共10个自由度)。通过遥操作、MPC和训练策略执行的一系列具有挑战性的操作任务,与传统的平行夹爪相比,所提出的夹爪在灵巧性和任务可行性上持续改进。

英文摘要

Parallel-jaw grippers are the default manipulator choice in robotics because they are simple, robust, and inexpensive. Their limited in-hand mobility, however, often forces large arm motions and restricts dexterous manipulation in confined workspaces. We present a parallel-gripper upgrade: a double-soft-belt-based finger module that preserves standard opening/closing while adding three in-hand degrees of freedom (DoF): translation, pitch, and roll. The mechanism is deliberately kept simple and engineered for inexpensive manufacturing and straightforward integration, preserving the reliability and precise control of traditional parallel grippers while greatly broadening the range of manipulation capabilities. To demonstrate the utility of the added DoFs, we integrate the gripper in two control pipelines. First, we adapt a model predictive controller for in-hand manipulation of known objects. Second, we introduce a lightweight teleoperation interface that enables simultaneous control of the robot arm and gripper (10 DoFs total) with minimal hardware. Across a suite of challenging manipulation tasks executed via teleoperation, MPC, and trained policies, the proposed gripper consistently improves dexterity and task feasibility compared to a conventional parallel gripper

2606.20285 2026-06-19 cs.RO 新提交

Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems

Co-VLA:面向双臂视觉-语言-动作系统的协调感知结构化动作建模

Yandong Wang, Jiaqian Yu, Xiongfeng Peng, Lu Xu, Yamin Mao, Weiming Li, Jaewook Yoo, Dongwook Lee, Daehyun Ji, Mingbo Zhao, Chao Zhang

发表机构 * Donghua University(东华大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院) Samsung AI Center, DS Division(三星DS部门AI中心)

AI总结 针对双臂紧耦合任务中隐式协调不足的问题,提出Co-VLA框架,通过结构化动作专家和潜在感知控制器显式引入协调先验,在仿真和真实场景中显著提升成功率和效率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在单臂和双臂机器人操作中展现出强大能力。先前研究表明,通过端到端学习,利用大型视觉-语言骨干网络和连续动作预测,可以涌现出协调的双臂行为。然而,随着双臂任务变得紧密耦合且执行约束变得关键,仅靠隐式协调不足以确保可靠、可解释且稳定的行为。在这项工作中,我们提出了Co-VLA,一个协调感知的双臂操作框架,将显式结构先验引入VLA模型。我们在一个最先进的视觉-语言骨干网络上实例化我们的方法,用专为双臂协调设计的结构化动作专家(SAE)替换其单一动作头。具体来说,我们在动作生成层面引入显式结构,采用模块化的协调感知损失,根据任务特定结构塑造共享和残差潜在变量。共享潜在变量编码任务级协调意图,而残差潜在变量捕获每个手臂的执行调整。在部署时,潜在感知控制器(LAC)解释学习到的表示,以实时调节同步强度、执行不对称性、平滑性和安全约束。LAC在关节命令级别运行,并与标准控制流水线兼容,无需力或阻抗控制。在仿真和真实世界基准上的实验表明,Co-VLA显著优于单一基线,在紧协调任务中成功率达到27%的提升,在OOD真实世界场景中性能翻倍(从13%提升至27%),并将任务完成时间减少高达25%。

英文摘要

Vision-language-action (VLA) models show strong capabilities in single and dual-arm robotic manipulation. Prior works show coordinated bimanual behaviors can emerge from end-to-end learning, leveraging large vision-language backbones with continuous action prediction. However, as bimanual tasks become tightly coupled and execution constraints become critical, implicit coordination alone is insufficient to ensure reliable, interpretable, and stable behavior. In this work, we propose Co-VLA, a coordination-aware bimanual manipulation framework introducing explicit structural priors into VLA models. We instantiate our method on a state-of-the-art vision-language backbone by replacing its monolithic action head with a Structured Action Expert (SAE) designed for bimanual coordination. Specifically, we introduce explicit structure at the action generation level with a modular coordination-aware loss that shapes shared and residual latents according to task-specific structures. The shared latent encodes task-level coordination intent, while residual latents capture execution adjustments for each arm. At deployment, a Latent-Aware Controller (LAC) interprets the learned representations to modulate synchronization strength, execution asymmetry, smoothness, and safety constraints in real time. LAC operates at the joint-command level and remains compatible with standard control pipelines without requiring force or impedance control. Experiments across simulation and real-world benchmarks show Co-VLA significantly outperforms monolithic baselines, achieving a 27% success rate gain in tight-coordination tasks, more than doubling performance in OOD real-world scenarios (from 13% to 27%), and reducing task completion time by up to 25%.

2606.20549 2026-06-19 cs.RO 新提交

Generating Robot Hands from Human Demonstrations

从人类演示生成机器人手

Sha Yi, Nicklas Hansen, Xueqian Bai, Carmelo Sferrazza, Michael T. Tolley, Xiaolong Wang

发表机构 * University of California San Diego(加州大学圣迭戈分校) Amazon Frontier AI & Robotics(亚马逊前沿人工智能与机器人)

AI总结 提出数据驱动框架,利用人类日常操作中超过400万帧指尖运动数据,通过逆运动学匹配指尖位置,优化树状结构机器人手的设计,生成通用6自由度手和低自由度任务专用手,并训练强化学习智能体加速设计搜索。

详情
AI中文摘要

机器人学习在控制学习方面取得了快速进展,但学习机器人的物理身体仍然困难得多,因为同时搜索设计和控制会产生一个非常大的组合问题。在这里,我们提出了一个数据驱动的框架,用于从人类演示生成机器人手。我们不是为每个候选设计学习一个复杂的控制器,而是使用制造后使用的相同简单控制策略来生成机器人手设计:通过逆运动学匹配指尖位置。利用来自日常操作的超过400万帧人类指尖运动数据,我们的算法优化树状结构机器人手以再现所需的目标运动。该框架产生了一个6自由度(DoF)通用手和具有空间四杆仿生关节的低自由度任务专用手。为了加速设计搜索,我们训练了一个强化学习(RL)智能体来提出好的手设计和关节角度,将搜索时间从数小时减少到数分钟。我们直接将机制制作为具有打印就绪关节的一体式铰接结构。在真实世界实验中,6自由度手实现了高度精确的遥操作指尖跟踪,优于现有的商用机器人手,而专门的3自由度手以降低的机械复杂性再现了结构化的人类和合成轨迹。这些结果表明,大规模人类运动数据不仅可以用于训练机器人控制器,还可以作为优化和生成机器人物理实体的参考。

英文摘要

Robot learning has advanced rapidly in learning control, but learning the physical body of a robot remains much more difficult because jointly searching over design and control creates a very large combinatorial problem. Here, we present a data-driven framework for generating robot hands from human demonstrations. Instead of learning a complex controller together with each candidate design, we generate robot hand designs using the same simple control policy used after fabrication: matching fingertip positions through inverse kinematics. Using more than 4 million frames of human fingertip motion from everyday manipulation, our algorithm optimizes tree-structured robot hands to reproduce desired target motions. The framework produced both a 6-degree-of-freedom (DoF) general-purpose hand and lower-DoF task-specific hands with spatial four-bar mimic joints. To accelerate the search over designs, we trained a reinforcement-learning (RL) actor to propose good hand designs and joint angles, reducing search time from hours to minutes. We fabricated the mechanisms directly as one-piece articulated structures with print-in-place joints. In real-world experiments, the 6-DoF hand achieved highly accurate teleoperated fingertip tracking better than available commercial robot hands, whereas the specialized 3-DoF hands reproduced structured human and synthetic trajectories with reduced mechanical complexity. These results showed that large-scale human motion data can be used not only to train robot controllers but also as a reference for optimizing and generating the physical embodiment of robots.

2606.17054 2026-06-19 cs.RO cs.AI cs.CV cs.LG 新提交

Human Universal Grasping

人类通用抓取

Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto

发表机构 * New York University(纽约大学) Tsinghua University(清华大学) University of Michigan(密歇根大学)

AI总结 提出HUG模型,利用人类抓取数据(1M-HUG数据集)和流匹配方法,从单张RGB-D图像生成多样化抓取姿态,并重定向到机器人手,实现零样本抓取,在HUG-Bench上超越基线23%-34%。

Comments 28 pages, 20 figures, 7 tables

详情
AI中文摘要

人类可以轻松抓取物体,而多指机器人远未达到这种通用性。我们认为机器人抓取数据最自然的来源是人类,他们每天拿起数千个物体。我们提出HUG,一个流匹配模型,能够为任何用户指定的物体(从立体相机捕获的单张RGB-D图像中)生成多样化的人类抓取。使用智能眼镜,我们首先收集了1M-HUGs,一个自我中心的人类抓取数据集,涵盖100万帧(27.8小时)和41栋建筑中的6,707个物体实例。接下来,为了建模自然人类抓取的分布,我们的新型流匹配模型融合RGB和深度观测,输出由手腕平移、手腕旋转和MANO手姿态参数化的抓取。预测的抓取可以重定向到各种机器人手,实现在日常场景中的零样本抓取。为了标准化评估,我们构建了一个新的模拟基准HUG-Bench,包含来自五个几何类别和不同尺寸的90个未见物体,并带有公制尺度的3D网格。我们在真实世界中评估HUG,使用HUG-Bench的30个物体测试集,跨越多个立体相机、机器人实体和家庭环境。HUG在我们具有挑战性的物体集上比最先进的抓取基线高出23%和34%。代码、数据、基准、检查点和交互式演示已在我们的网站上发布:https://grasping.io/

英文摘要

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/

2504.15535 2026-06-19 cs.RO 版本更新

VibeCheck: Using Active Acoustic Tactile Sensing for Contact-Rich Manipulation

VibeCheck: 使用主动声学触觉传感进行接触丰富的操作

Kaidi Zhang, Do-Gon Kim, Eric T. Chang, Hua-Hsuan Liang, Zhanpeng He, Kathryn Lampo, Philippe Wu, Ioannis Kymissis, Matei Ciocarlie

发表机构 * Dept. of Mechanical Engineering(机械工程系) Dept. of Computer Science(计算机科学系) Dept. of Electrical Engineering(电气工程系) Columbia University(哥伦比亚大学)

AI总结 本文构建了带有两个压电手指的主动声学传感夹爪,通过物体传递声学振动来感知其声学特性和接触状态,用于物体分类、抓取位置估计、内部结构姿态估计以及外部接触类型分类,并基于接触分类模型实现了鲁棒的插销任务。

Comments Published at IROS 2025. 8 pages, 7 figures

详情
AI中文摘要

物体的声学响应可以揭示其全局状态,例如材料属性或与外界的外部接触。在这项工作中,我们构建了一个主动声学传感夹爪,配备两个压电手指:一个用于生成信号,另一个用于接收信号。通过将一个手指的声学振动通过物体传递到另一个手指,我们能够洞察物体的声学特性和接触状态。我们使用该系统进行物体分类、估计抓取位置、估计内部结构的姿态,以及分类物体与环境的外部接触类型。利用我们的接触类型分类模型,我们解决了一个标准的长时域操作问题:插销插入。我们基于传感器的性能使用一个简单的模拟转移模型来训练一个模仿学习策略,该策略对分类器的不完美预测具有鲁棒性。最后,我们在UR5机器人上演示了该策略,仅使用主动声学传感作为反馈。视频可在此 https URL 找到。

英文摘要

The acoustic response of an object can reveal a lot about its global state, for example its material properties or the extrinsic contacts it is making with the world. In this work, we build an active acoustic sensing gripper equipped with two piezoelectric fingers: one for generating signals, the other for receiving them. By sending an acoustic vibration from one finger to the other through an object, we gain insight into an object's acoustic properties and contact state. We use this system to classify objects, estimate grasping position, estimate poses of internal structures, and classify the types of extrinsic contacts an object is making with the environment. Using our contact type classification model, we tackle a standard long-horizon manipulation problem: peg insertion. We use a simple simulated transition model based on the performance of our sensor to train an imitation learning policy that is robust to imperfect predictions from the classifier. We finally demonstrate the policy on a UR5 robot with active acoustic sensing as the only feedback. Videos can be found at https://roamlab.github.io/vibecheck .

2508.02604 2026-06-19 cs.RO cs.SY eess.SY 版本更新

Periodic robust robotic rock chop via virtual model control

基于虚拟模型控制的周期性鲁棒机器人砍切

Yi Zhang, Fumiya Iida, Fulvio Forni

发表机构 * University of Cambridge(剑桥大学) University of Tokyo(东京大学)

AI总结 提出一种物理结构化的虚拟模型控制器,通过切换虚拟机构生成鲁棒的周期性砍切运动,无需预规划轨迹,在Franka机械臂上实现多种蔬菜的亚毫米级精确切割。

详情
AI中文摘要

机器人切割是一项具有挑战性的、接触丰富的操作任务,机器人必须同时协商未知的物体力学、大接触力和精确的运动要求。我们的假设是,这种复杂性可以通过设计一个物理结构化的虚拟模型控制器来缓解,该控制器使用切换虚拟机构生成鲁棒的、有节奏的岩石砍切运动,无需预先规划的轨迹或精确的环境信息。运动是由环境、机器人动力学和切换虚拟机构的虚拟力之间的相互作用产生的,最终通过可用的驱动实现。通过理论分析和实验验证,我们证明了受控的机器人行为会稳定到周期性的运动。使用Franka机械臂进行的实验表明,在五种不同的蔬菜上实现了鲁棒的切割,对于1毫米到6毫米的厚度,以每秒近一次切割的速度实现了亚毫米级的切片精度。尽管刀的形状或砧板的高度发生变化,控制器仍保持高性能,并成功适应了不同的人形机械臂,展示了鲁棒性和平台独立性。

英文摘要

Robotic cutting is a challenging, contact-rich manipulation task where the robot must simultaneously negotiate unknown object mechanics, large contact forces, and precise motion requirements. Our hypothesis is that this complexity can be alleviated through the design of a physically structured virtual-model controller that uses switched virtual mechanisms to generate a robust, rhythmic rock-chop motion for robotic cutting, without requiring pre-planned trajectories or precise environmental information. Motion is generated by the interaction between the environment, the robot's dynamics, and the virtual forces of the switching virtual mechanism, ultimately realized through the available actuation. Through theoretical analysis and experimental validation, we demonstrate that the controlled robot behavior settles into a stable periodic motion. Experiments with a Franka manipulator demonstrate robust cuts across five different vegetables, achieving sub-millimeter slice accuracy for thicknesses from 1 mm to 6 mm at a rate of nearly one cut per second. The controller maintains high performance despite changes in knife shape or cutting board height, and successfully adapts to a different humanoid manipulator, demonstrating robustness and platform independence.

2509.00271 2026-06-19 cs.RO 版本更新

Learn from What We HAVE: History-Aware VErifier that Reasons about Past Interactions Online

从我们所拥有的学习:在线推理过去交互的历史感知验证器

Yishu Li, Xinyi Mao, Ying Yuan, Kyutae Sim, Ben Eisner, David Held

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出历史感知验证器HAVE,通过解耦动作生成与验证,利用历史交互在线消除歧义,理论证明其提升期望动作质量,在多个模拟和真实环境中验证有效性。

Comments CoRL 2025

详情
AI中文摘要

我们引入了一种新颖的历史感知验证器(HAVE),通过利用过去的交互来在线消除不确定场景中的歧义。机器人经常遇到视觉上模糊的物体,这些物体的操作结果直到物理交互之前都是不确定的。虽然仅凭生成模型理论上可以适应这种模糊性,但在实践中,即使在以动作历史为条件的情况下,它们在模糊情况下也会获得次优性能。为了解决这个问题,我们提出明确地将动作生成与验证解耦:我们使用无条件的基于扩散的生成器来提出多个候选动作,并采用我们的历史感知验证器通过推理过去的交互来选择最有希望的动作。通过理论分析,我们证明了使用验证器显著提高了期望动作质量。在多个模拟和真实环境(包括铰接物体、多模态门和不均匀物体拾取)中的实证评估和分析证实了我们方法的有效性以及对基线的改进。我们的项目网站位于:this https URL

英文摘要

We introduce a novel History-Aware VErifier (HAVE) to disambiguate uncertain scenarios online by leveraging past interactions. Robots frequently encounter visually ambiguous objects whose manipulation outcomes remain uncertain until physically interacted with. While generative models alone could theoretically adapt to such ambiguity, in practice they obtain suboptimal performance in ambiguous cases, even when conditioned on action history. To address this, we propose explicitly decoupling action generation from verification: we use an unconditional diffusion-based generator to propose multiple candidate actions and employ our history-aware verifier to select the most promising action by reasoning about past interactions. Through theoretical analysis, we demonstrate that employing a verifier significantly improves expected action quality. Empirical evaluations and analysis across multiple simulated and real-world environments including articulated objects, multi-modal doors, and uneven object pick-up confirm the effectiveness of our method and improvements over baselines. Our project website is available at: https://liy1shu.github.io/HAVE_CoRL25/

2603.04531 2026-06-19 cs.RO 版本更新

PTLD: Sim-to-real Privileged Tactile Latent Distillation for Dexterous Manipulation

PTLD: 从仿真到现实的触觉潜在知识蒸馏用于灵巧操作

Rosy Chen, Mustafa Mukadam, Michael Kaess, Tingfan Wu, Francois R Hogan, Jitendra Malik, Akash Sharma

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Washington(华盛顿大学) FAIR at Meta(Meta的FAIR团队) UC Berkeley(伯克利大学)

AI总结 提出PTLD方法,通过真实世界触觉策略数据蒸馏鲁棒状态估计器,解决触觉仿真困难问题,在灵巧操作任务中相比纯本体感策略提升182%和57%。

详情
AI中文摘要

触觉灵巧操作对于自动化复杂家务任务至关重要,但学习有效控制策略仍然是一个挑战。虽然最近的工作依赖于模仿学习,但通过机器人遥操作或动觉教学获取多指手的高质量演示是困难的。另一种方法是,通过强化学习我们可以在仿真中学习技能,但快速且真实的触觉观测仿真具有挑战性。为了弥合这一差距,我们引入了PTLD:从仿真到现实的触觉潜在知识蒸馏,这是一种无需触觉仿真即可学习触觉操作技能的新方法。我们的关键思想不是模拟触觉传感器或纯粹依赖本体感策略进行零样本从仿真到现实的迁移,而是利用现实世界中的特权传感器收集真实的触觉策略数据。然后,这些数据用于蒸馏一个鲁棒的状态估计器,该估计器基于触觉输入运行。我们的实验表明,PTLD可以通过结合触觉感知显著改善在仿真中训练的本体感操作策略。在基准的掌内旋转任务中,PTLD相比纯本体感策略实现了182%的提升。我们还展示了PTLD能够学习具有挑战性的触觉掌内重定向任务,在该任务中,我们观察到达到的目标数量相比仅使用本体感提高了57%。网站:此 https URL。

英文摘要

Tactile dexterous manipulation is essential to automating complex household tasks, yet learning effective control policies remains a challenge. While recent work has relied on imitation learning, obtaining high quality demonstrations for multi-fingered hands via robot teleoperation or kinesthetic teaching is prohibitive. Alternatively, with reinforcement we can learn skills in simulation, but fast and realistic simulation of tactile observations is challenging. To bridge this gap, we introduce PTLD: sim-to-real Privileged Tactile Latent Distillation, a novel approach to learning tactile manipulation skills without requiring tactile simulation. Instead of simulating tactile sensors or relying purely on proprioceptive policies to transfer zero-shot sim-to-real, our key idea is to leverage privileged sensors in the real world to collect real-world tactile policy data. This data is then used to distill a robust state estimator that operates on tactile input. We demonstrate from our experiments that PTLD can be used to improve proprioceptive manipulation policies trained in simulation significantly by incorporating tactile sensing. On the benchmark in-hand rotation task, PTLD achieves a 182% improvement over a proprioception only policy. We also show that PTLD enables learning the challenging task of tactile in-hand reorientation where we see a 57% improvement in the number of goals reached over using proprioception alone. Website: https://akashsharma02.github.io/ptld-website/.

2606.15516 2026-06-19 cs.RO 版本更新

Transferring Contact, Not Just Motion: Compliant Grasping Across Dexterous Hands

传递接触,而不仅仅是运动:跨灵巧手的柔顺抓取

Soofiyan Atar, Yao-Ting Huang, Michael Yip

发表机构 * University of California San Diego(加州大学圣迭戈分校)

AI总结 提出跨本体力-位置接口,通过校准力矩和指尖力实现异构灵巧手间的接触感知抓取,结合流匹配视觉运动策略和混合力位控制器,实现可迁移的柔顺抓取。

Comments Website(overview): transferring-contact-not-just-motion.github.io/

详情
AI中文摘要

灵巧抓取依赖于接触调节,而不仅仅是运动。稳定操作要求手指在接触滑动、变形或视觉遮挡时保持适当的物体负载。现有的跨本体灵巧策略通过重定向手部姿态或潜在动作统一运动,但力反馈仍与每只手的感觉和驱动绑定,限制了迁移。本文引入了一种跨本体力-位置接口,用于异构灵巧手之间的接触感知操作。运动意图在共享的手部姿态潜在空间中表示,而每只手的力信号通过系统辨识校准为物理关节扭矩(单位N.m)。这些扭矩被映射为指尖力和紧凑的每指负载描述符,使策略获得关于手部应移动到哪里以及物体如何加载的可比观测。利用该接口,训练了一个流匹配视觉运动策略,输入视觉、本体感觉和校准后的接触,并采用结构化视觉掩码,在抓取相关遮挡下鼓励依赖力。相同的校准信号驱动混合力-位置控制器进行演示采集和执行,保持训练和部署中的力目标一致。在结构不同的手上进行的实验表明,校准的接触反馈实现了可迁移的柔顺抓取,学习到的基元可在长时程操作流程中重复使用。

英文摘要

Dexterous grasping depends on contact regulation, not motion alone. Stable manipulation requires fingers to maintain appropriate object loading as contacts slip, deform, or become visually occluded. Existing cross-embodiment dexterous policies unify motion through retargeted hand poses or latent actions, but force feedback remains tied to each hand's sensing and actuation, limiting transfer. This work introduces a cross-embodiment force-position interface for contact-aware manipulation across heterogeneous dexterous hands. Motion intent is represented in a shared hand-pose latent, while each hand's effort signal is calibrated through system identification into physical joint torque in N.m. These torques are mapped to fingertip forces and compact per-finger load descriptors, giving the policy comparable observations of where the hand should move and how the object is loaded. Using this interface, a flow-matching visuomotor policy is trained on vision, proprioception, and calibrated contact, with structured visual masking that encourages reliance on force under grasp-relevant occlusion. The same calibrated signal drives a hybrid force-position controller for demonstration collection and execution, keeping force targets consistent across training and deployment. Experiments across structurally different hands show that calibrated contact feedback enables transferable compliant grasping, with learned primitives reusable in long-horizon manipulation pipelines.

2606.18960 2026-06-19 cs.CV cs.RO 版本更新

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Mem-World:用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology(大连理工大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院)

AI总结 提出Mem-World,通过4D腕部视角曲面元索引内存W-VMem,解决操作中因遮挡和运动导致的场景遗忘问题,实现持久世界建模,提升策略评估与改进效果。

详情
AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式,通过生成动作一致的视频推演,为昂贵的真实世界实验提供了可扩展的替代方案。然而,在操作中持久世界建模仍然具有挑战性:频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图,导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制,我们提出了Mem-World,一种内存增强的多视图动作条件世界模型。其核心是W-VMem,一种4D腕部视图为中心的曲面元索引内存,将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置,W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中,通过基于曲面元的渲染和评分选择相关历史帧,为预测提供信息丰富且非冗余的上下文。大量实验表明,Mem-World在复杂操作场景中生成持久推演,比Ctrl-World实现更可靠的策略评估,将皮尔逊相关系数提高14.5%,并通过合成数据生成支持有效的策略改进,在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

4. 导航、定位与SLAM 22 篇

2606.19383 2026-06-19 cs.RO cs.CV 新提交

3D Scene Graphs: Open Challenges and Future Directions

3D场景图:开放挑战与未来方向

Dennis Rotondi, Francesco Argenziano, Sebastian Koch, Nathan Hughes, Martin Buechner, Johanna Wald, Lukas Rosenberger Schmid, Daniele Nardi, Abhinav Valada, Liam Paull, Federico Tombari, Luca Carlone, Kai O. Arras

AI总结 本文统一综述3D场景图(3DSG)的构建、应用与评估,分析现有建模选择与开放挑战,旨在推动鲁棒部署。

Comments Invited article for the Annual Review of Control, Robotics, and Autonomous Systems Volume 10

详情
AI中文摘要

3D场景图(3DSG)通过将几何基础与环境的语义和关系抽象相结合,已成为空间AI的强大表示。其表现力使其与机器人和计算机视觉中的广泛问题相关,包括操作、导航、任务规划、场景理解等。然而,该领域仍然分散:不同的社区采用不同的公式、构建流程和评估协议,使得比较方法、识别共同假设以及评估鲁棒实际部署的剩余挑战变得困难。本综述提供了对3DSG的统一和批判性回顾,特别强调开放挑战和未来方向。我们首先在共同定义下形式化3DSG,并分析表征现有公式的主要建模选择,包括节点和边属性、层次结构、动态场景表示和可供性感知扩展。然后,我们回顾如何从原始感官观察构建3DSG,讨论最常见的术语、约定和技术。最后,我们检查下游应用和评估策略,从内在图质量到任务级性能。为支持社区,我们还提供了一个专用网站,组织和扩展所调查的内容,可访问此 https URL。

英文摘要

3D Scene Graphs (3DSGs) have emerged as a powerful representation for spatial AI by combining geometric grounding with semantic and relational abstractions of the environment. Their expressiveness has made them relevant to a broad range of problems in robotics and computer vision, including manipulation, navigation, task planning, scene understanding, and many others. However, the field remains fragmented: different communities adopt distinct formulations, construction pipelines, and evaluation protocols, making it difficult to compare methods, identify common assumptions, and assess remaining challenges for robust real-world deployment. This survey provides a unified and critical review of 3DSGs, with particular emphasis on open challenges and future directions. We first formalize 3DSGs under a common definition and analyze the principal modeling choices that characterize existing formulations, including node and edge attributes, hierarchical structure, dynamic scene representations, and affordance-aware extensions. We then review how 3DSGs are built from raw sensory observations, discussing the most common terminologies, conventions, and techniques. Finally, we examine downstream applications and evaluation strategies, from intrinsic graph quality to task-level performance. To support the community, we also provide a dedicated website that organizes and extends the surveyed content, accessible at https://3dscenegraphs.com/.

2606.19555 2026-06-19 cs.RO 新提交

SCAN-Planner: Spatial Collision-Aware Local Planning for Route-Guided Long-Range Quadruped Navigation

SCAN-Planner:用于路线引导的远程四足导航的空间碰撞感知局部规划

Han Zheng, Zhe Chen, Yiwen Fu, Ming Yang, Tong Qin

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出SCAN-Planner框架,通过偏航感知双圆柱足迹和投影A*搜索实现空间碰撞感知的局部规划,在密集杂乱、3D非结构化环境和远程导航中生成安全平滑轨迹。

详情
AI中文摘要

四足机器人越来越需要能够在狭窄通道、杂乱室内场景和大规模3D非结构化环境中导航。现有的局部规划器通常使用各向同性几何膨胀来近似机器人,或依赖于平面和高程图表示,导致在狭窄空间中的保守运动以及对悬垂结构的推理有限。本文提出了SCAN-Planner,一种用于远程四足导航的空间碰撞感知局部规划框架。使用偏航感知的双圆柱足迹来建模细长的机器人身体,通过在膨胀的3D占用地图中进行稀疏查询实现全身碰撞评估。我们进一步引入投影A*搜索,在插值的地面跟随表面上生成无碰撞引导,并通过z梯度抑制来水平避开障碍物同时保持垂直稳定性。对于大规模部署,具有边界回退的机器人中心滑动地图提供高分辨率局部碰撞检查并从局部死胡同中恢复。仿真和真实实验表明,SCAN-Planner在密集杂乱、3D非结构化场景、楼梯穿越和远程导航任务中生成安全、平滑且高效的轨迹。

英文摘要

Quadruped robots are increasingly expected to navigate through narrow passages, cluttered indoor scenes, and large-scale 3D unstructured environments. Existing local planners commonly approximate the robot using isotropic geometric inflation or rely on planar and elevation-map representations, leading to conservative motion in tight spaces and limited reasoning about overhanging structures. This letter presents SCAN-Planner, a spatial collision-aware local planning framework for long-range quadruped navigation. A yaw-aware twin-cylinder footprint is used to model the elongated robot body, enabling whole-body collision evaluation through sparse queries in an inflated 3D occupancy map. We further introduce a projected A* search that generates collision-free guidance on an interpolated ground-following surface, with z-gradient suppression to avoid obstacles horizontally while maintaining vertical stability. For large-scale deployment, a robot-centric sliding map with boundary fallback provides high-resolution local collision checking and recovery from local dead ends. Simulation and real-world experiments demonstrate that SCAN-Planner generates safe, smooth, and efficient trajectories in dense clutter, 3D unstructured scenes, stair traversal, and long-range navigation tasks.

2606.19687 2026-06-19 cs.RO 新提交

Route-Constrained Robust Fusion Estimation for MEMS/GNSS Integrated Navigation of Unmanned Ground Vehicles in GNSS Degraded Environments

MEMS/GNSS组合导航中无人地面车辆在GNSS退化环境下的路径约束鲁棒融合估计

Jingzhi Cui, Chao Zhang, Yuliang Mao, Shaolin Lü, Dongmei Li, Huan Che, Rong Zhang

发表机构 * State Key Laboratory of Precision Space-time Information Sensing Technology, Tsinghua University(清华大学精密时空信息感知技术国家重点实验室) Xiaomi Inc.(小米公司)

AI总结 针对GNSS信号严重遮挡下结构化道路环境中无人地面车辆的累积定位漂移,提出一种鲁棒的路径约束状态估计方法,利用历史航位推算轨迹与高精地图匹配生成伪位置观测,通过扩展卡尔曼滤波持续注入道路级约束,抑制位置偏差并改善方位估计。

Comments Accepted workshop paper, 1st Workshop on Robot Meets GNSS and Ranging for Seamless Autonomy, IEEE ICRA 2026

Journal ref 1st Workshop on Robot Meets GNSS and Ranging for Seamless Autonomy, IEEE ICRA 2026, Vienna, Austria, June 5, 2026

详情
AI中文摘要

为了解决在严重全球导航卫星系统信号遮挡下结构化道路环境中无人地面车辆的累积定位漂移问题,本文提出了一种鲁棒的路径约束状态估计方法。在无卫星信号期间,该方法建立了历史航位推算轨迹与从高精地图中提取的任务路线局部段之间的对应关系,并通过二维刚性变换估计出路线参考位置。然后将估计的位置作为伪位置观测,纳入扩展卡尔曼滤波更新中。这样,道路级的路径约束可以持续注入到统一的状态估计框架中,从而抑制相对于任务路线的位置偏差,同时间接改善方位估计。为了增强实际适用性,进一步引入了触发控制、匹配质量验证、路径偏移补偿和单次更新修正限制等工程策略。在三个代表性场景(长隧道、多段隧道和弯曲隧道)中的实验表明,所提方法有效抑制了卫星中断期间的误差累积,降低了最大偏差过大的风险,并提高了定位连续性和道路级可用性。

英文摘要

To address cumulative localization drift of unmanned ground vehicles in structured road environments under severe Global Navigation Satellite System signal occlusion, this paper proposes a robust route-constrained state estimation method. During periods without satellite signals, the proposed method establishes the correspondence between the historical dead reckoning trajectory and local segments of the mission route extracted from a high-definition map, and estimates a route-referenced position via a two-dimensional rigid transformation. The estimated position is then formulated as a pseudo-position observation and incorporated into an Extended Kalman Filter update. In this way, route constraints at the road level can be continuously injected into a unified state estimation framework, thereby suppressing position deviation relative to the mission route while indirectly improving azimuth estimation. To enhance practical applicability, engineering strategies, such as trigger control, matching quality validation, route offset compensation, and single update correction limiting, are further introduced. Experiments in three representative scenarios, including a long tunnel, a multi-segment tunnel, and a curved tunnel, show that the proposed method effectively suppresses error accumulation during satellite outages, reduces the risk of large maximum deviation, and improves localization continuity and road-level usability.

2606.19874 2026-06-19 cs.RO cs.CV 新提交

MMD-SLAM: Structure-Enhanced Multi-Meta Gaussian Distribution-Guided Visual SLAM

MMD-SLAM:结构增强的多元高斯分布引导视觉SLAM

Fan Zhu, Ziyu Chen, Peichen Liu, Yifan Zhao, Zhisong Xu, Hui Zhu, Hongxing Zhou, Sixun Liu, Chunmao Jiang

发表机构 * HFIPS, Chinese Academy of Sciences(中国科学院合肥物质科学研究院) University of Science and Technology of China(中国科学技术大学) Aarhus University(奥胡斯大学) University of Tokyo(东京大学) Beijing University of Chemical Technology(北京化工大学) North China Electric Power University(华北电力大学)

AI总结 提出MMD-SLAM,利用亚特兰大世界假设引导多元高斯表示,通过点线融合、主导方向编码和高斯进化策略,提升视觉SLAM的跟踪精度与建图质量。

Comments ICRA 2026

详情
AI中文摘要

3D高斯泼溅(3DGS)显著提升了新视角合成和高保真场景重建,扩展了基于3DGS的视觉同步定位与建图(SLAM)方法的潜力。然而,大多数现有系统未能充分利用底层结构信息,这限制了渲染质量并常常导致地图不一致。为了解决这些限制,我们提出了MMD-SLAM,一个结构增强的视觉SLAM框架,利用亚特兰大世界(AW)假设来引导多元高斯表示以实现逼真的建图。首先,我们引入了一种点线融合策略用于位姿优化,其中3D线段被纳入以提高跟踪鲁棒性并为建图提供额外约束。其次,我们设计了一种具有主导方向的多元高斯表示,显式编码来自AW假设的结构先验。最后,我们提出了一种高斯进化策略,该策略适应场景几何并将结构线索融入全局优化。大量实验表明,这些创新使MMD-SLAM在跟踪精度和建图质量方面均达到了最先进的性能。例如,与MonoGS相比,我们的方法在ScanNet上实现了48.56%的ATE RMSE降低,在Replica上实现了5.71%的PSNR提升。

英文摘要

3D Gaussian Splatting (3DGS) has significantly boosted novel view synthesis and high-fidelity scene reconstruction, expanding the potential of 3DGS-based Visual Simultaneous Localization and Mapping (SLAM) methods. However, most existing systems fail to fully exploit the underlying structural information, which limits rendering quality and often leads to inconsistent maps. To address these limitations, we propose MMD-SLAM, a structure-enhanced Visual SLAM framework that leverages the Atlanta World (AW) assumption to guide a Multi-Meta Gaussian representation for photorealistic mapping. First, we introduce a point-line fusion strategy for pose optimization, where 3D line segments are incorporated to improve tracking robustness and provide additional constraints for mapping. Second, we design a Multi-Meta Gaussian representation with dominant directions, explicitly encoding structural priors from the AW hypothesis. Finally, we propose a Gaussian evolution strategy that adapts to scene geometry and incorporates structural cues into global optimization. Extensive experiments demonstrate that these innovations enable MMD-SLAM to achieve state-of-the-art performance in both tracking accuracy and mapping quality. e.g., our method achieves a 48.56% reduction in ATE RMSE on ScanNet and a 5.71% improvement in PSNR on Replica, compared with MonoGS.

2606.20209 2026-06-19 cs.RO cs.AI 新提交

FlowMaps: Modeling Long-Term Multimodal Object Dynamics with Flow Matching

FlowMaps: 使用流匹配建模长期多模态物体动态

Francesco Argenziano, Miguel Saavedra-Ruiz, Sacha Morin, Charlie Gauthier, Daniele Nardi, Liam Paull

发表机构 * Sapienza University of Rome(罗马大学) Université de Montréal(蒙特利尔大学) Mila - Quebec AI Institute(米拉-魁北克人工智能研究所)

AI总结 提出FlowMaps模型,通过潜在流匹配学习物体位置的多模态时空分布,预测动态物体未来位置,提升机器人在变化家庭环境中的导航性能。

详情
AI中文摘要

对3D场景的联合空间和时间理解是部署在日常家庭环境中的机器人的关键要求。这些智能体不仅必须理解和导航空间布局,还必须推理这些空间如何随时间演变。特别是,人类每天与物体互动,导致物体在整个环境中改变位置,使机器人难以可靠地将当前观察与先前看到的物体关联起来。然而,这些互动并非随机:人类的习惯和日常行为在物体位置上产生了时空一致的模式,机器人智能体可以学习这些模式,然后将其用于下游任务,如导航。为此,我们引入了FlowMaps,一种潜在流匹配模型,用于估计连续3D空间中动态物体未来位置的多模态分布。通过学习物体之间的隐式依赖关系及其时间演变,FlowMaps预测物体位置在人类过去互动条件下的可能变化,同时支持在具有相似物体习惯的未见环境中的泛化。为了展示该方法的实用性,我们在模拟和真实环境中将FlowMaps部署到下游的动态物体导航任务中。在超过600个回合中,FlowMaps优于最先进的方法,表明通过连续、多模态的时空分布建模物体动态可以改善机器人在变化家庭环境中的搜索和导航。代码和附加材料可在此https URL获取。

英文摘要

Joint spatial and temporal understanding of 3D scenes is a crucial requirement for robots deployed in everyday household environments. Such agents must not only comprehend and navigate spatial layouts, but also reason about how these spaces evolve over time. In particular, humans interact with objects daily, causing them to change position throughout the environment and making it difficult for robots to reliably associate current observations with previously seen objects. However, these interactions are not random: human habits and routines induce spatio-temporally consistent patterns in object locations, which robotic agents can potentially learn and then exploit for downstream tasks such as navigation. To this end, we introduce FlowMaps, a latent flow matching model for estimating multimodal distributions over the future locations of dynamic objects in a continuous 3D space. By learning the implicit dependencies among objects and their temporal evolution, FlowMaps predicts likely changes in object locations conditioned on past human interactions, while supporting generalization across previously unseen environments that share similar object routines. To demonstrate the utility of this method, we deploy FlowMaps in a downstream dynamic Object Navigation task in both simulated and real-world environments. Across more than 600 episodes, FlowMaps outperforms state-of-the-art approaches, showing that modeling object dynamics through continuous, multimodal spatio-temporal distributions improves robotic search and navigation in changing household environments. Code and additional material is available at https://fra-tsuna.github.io/flowmaps/.

2606.20322 2026-06-19 cs.RO 新提交

Towards 3D karst underwater scene reconstruction from rotating sonar data

基于旋转声纳数据的3D喀斯特水下场景重建

Georgios Evangelos Margaritis, Lionel Lapierre, Simon Rohou, Zhi Yan, Andreas Nüchter, François Goulette

发表机构 * U2IS, ENSTA, Institut Polytechnique de Paris(巴黎综合理工学院ENSTA学院U2IS实验室) Lab-STICC, ENSTA, Institut Polytechnique de Paris(巴黎综合理工学院ENSTA学院Lab-STICC实验室) Informatics XVII – Robotics, Julius-Maximilians-Universität Würzburg(尤利乌斯-马克西米利安-维尔茨堡大学信息学XVII – 机器人学)

AI总结 针对声纳数据稀疏噪声大、导航漂移导致3D重建困难的问题,提出结合连续时间SLAM校正轨迹与两阶段深度学习表面重建的流水线,生成可沉浸导航的3D网格。

Comments 1st Workshop on Long-term Deployments in the Wild (LoWi)

详情
AI中文摘要

喀斯特含水层提供关键的淡水资源,但由于其复杂且了解不足的地下几何结构,构成重大危害。由于水下探测的声纳数据稀疏且噪声大,而导航估计存在漂移,限制了标准3D重建方法,因此绘制这些环境具有挑战性。我们提出了一种从声纳剖面仪重建水下喀斯特管道的流水线。我们将连续时间SLAM方法用于校正轨迹漂移,与一种新颖的两阶段深度学习表面重建方法相结合,生成用于水文地质分析的沉浸式可导航3D网格。

英文摘要

Karst aquifers provide critical freshwater resources but pose significant hazards due to their complex and poorly understood subsurface geometry. Mapping these environments is challenging because sonar data from underwater exploration is sparse and noisy, while navigation estimates suffer from drift limiting standard 3D reconstruction methods. We present a pipeline for reconstructing underwater karst conduits from a sonar profiler. We combine a continuous-time SLAM approach to correct trajectory drift with a novel two-stage deep learning method for surface reconstruction, producing an immersive and navigable 3D mesh for hydrogeological analysis.

2606.20424 2026-06-19 cs.RO 新提交

LIT-GS: LiDAR-Inertial-Thermal Gaussian Splatting for Illumination-Robust Mapping

LIT-GS: 面向光照鲁棒建图的激光雷达-惯性-热高斯泼溅

Shikuan Shi, Chunran Zheng, Jiaming Xu, Tianyong Ye, Tao Yu, Yukang Cui

发表机构 * College of Mechatronics and Control Engineering, Shenzhen University(深圳大学机电与控制工程学院) Department of Mechanical Engineering, The University of Hong Kong(香港大学机械工程系)

AI总结 提出LIT-GS框架,利用激光雷达平面几何约束联合优化位姿与高斯,解决光照变化和纹理缺失场景下RGB依赖的脆弱性问题,提升几何精度与渲染质量。

Comments Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

高斯泼溅实现了实时神经渲染,但现有的激光雷达-惯性-视觉(LIV)高斯建图流程由于依赖RGB光度线索,在光照变化和纹理缺失场景下仍然脆弱。我们提出了LIT-GS,一个激光雷达-惯性-热高斯泼溅框架,将激光雷达导出的平面几何作为显式约束注入到位姿/结构优化和高斯优化中。具体来说,我们利用LIV视觉地图点作为置信度感知的跨模态锚点,建立可靠的热-激光雷达关联,并在弱热监督下将加权的激光雷达点到平面残差引入光束法平差,以联合优化相机位姿和3D点。基于优化后的结构,我们进一步引入一个激光雷达平面正则化的可微泼溅目标,约束渲染的3D点与局部观测平面对齐,从而减轻低对比度热图像中的表面增厚和结构漂移。在专有序列和公开数据集上的实验表明,LIT-GS在几何精度和渲染质量上持续优于最先进的基于LIV的高斯泼溅基线,尤其是在具有挑战性的光照条件下。

英文摘要

Gaussian Splatting has enabled real-time neural rendering, yet existing LiDAR-inertial-visual (LIV) Gaussian mapping pipelines remain fragile under illumination changes and texture-deficient scenes due to their reliance on RGB photometric cues. We present LIT-GS, a LiDAR-inertial-thermal Gaussian Splatting framework that injects LiDAR-derived plane geometry as an explicit constraint in both pose/structure refinement and Gaussian optimization. Specifically, we exploit LIV visual map points as confidence-aware cross-modal anchors to establish reliable thermal-LiDAR associations, and incorporate weighted LiDAR point-to-plane residuals into bundle adjustment to jointly refine camera poses and 3D points under weak thermal supervision. Building on the refined structure, we further introduce a LiDAR-plane-regularized differentiable splatting objective that constrains rendered 3D points to align with locally observed planes, mitigating surface thickening and structural drift in low-contrast thermal imagery. Experiments on proprietary sequences and public datasets demonstrate that LIT-GS consistently improves geometric accuracy and rendering quality over state-of-the-art LIV-based Gaussian Splatting baselines, particularly in challenging lighting conditions.

2606.20458 2026-06-19 cs.RO 新提交

Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented Urban Navigation

慢速大脑,快速规划器:延迟鲁棒的VLM增强城市导航

Zhenghao "Mark'' Peng, Honglin He, Quanyi Li, Yukai Ma, Bolei Zhou

发表机构 * Amazon FAR(亚马逊 FAR) UCLA(加州大学洛杉矶分校) Independent(独立) Zhejiang University(浙江大学)

AI总结 针对移动机器人在人行道导航中轨迹评分差距问题,提出一种无需训练的延迟鲁棒轨迹级融合层,利用VLM选择候选轨迹并与规划器输出融合,在挑战场景下降低ADE 30%。

详情
AI中文摘要

基于学习的 sidewalk 导航规划器可以实时生成多样化的候选轨迹,但其评分函数在挑战性场景中往往无法选择最佳轨迹,即使同一集合中存在更好的候选,也会输出使移动机器人驶入草地、朝向行人或错误方向的轨迹。我们称之为轨迹评分差距:在真实世界的人行道导航中,基于锚点的规划器的最佳选择与最佳候选之间的差距很大,这可能是由于规划器的高层场景理解能力有限。我们不是用端到端的视觉-语言-动作模型替换规划器,而是提出一种VLM-规划器接口,使用VLM从规划器的候选集合中选择一个候选索引,然后将其与规划器的初始输出融合。然而,VLM每次查询需要1-3秒,因此无法直接驱动5-20Hz的控制循环。我们贡献了一种无需训练、延迟鲁棒的轨迹级融合层,通过指数衰减的几何相似性将过时的VLM选择转化为实时规划器评分。在约2000个具有挑战性的真实世界场景(例如交叉口、行人相遇)中,VLM选择相比规划器的最佳选择实现了30%的ADE降低,而规划器在常规场景中仍保持竞争力。在仿真中,Score Fusion在高达5秒的延迟下仍保持>80%的成功率。我们在移动机器人上展示了完整系统,在具有不同网络延迟的具有挑战性的校园人行道上进行导航。

英文摘要

Learning-based planners for sidewalk navigation can generate diverse candidate trajectories in real time, yet their scoring functions often fail to select the best trajectory in challenging situations, outputting trajectories that make the mobile robot drive onto grass, toward pedestrians, or in the wrong direction, even when better candidates exist in the same set. We call this the trajectory scoring gap: in real-world sidewalk navigation, the gap between an anchor-based planner's top choice and the best possible candidate is substantial, likely due to limited high-level scene understanding capability of the planner. Rather than replacing the planner with an end-to-end Vision-Language-Action model, we propose a VLM-Planner interface that uses a VLM to select a candidate index from the planner's proposal set and then fuse it with the planner's initial output. However, VLMs take 1--3s per query and so cannot directly drive a 5--20Hz control loop. We contribute a training-free, latency-resilient trajectory-level fusion layer that turns a stale VLM selection into real-time planner scoring via geometric similarity with exponential decay. On $\sim$2,000 challenging real-world scenarios (e.g., junctions, pedestrian encounters), VLM selection achieves 30% ADE reduction versus the planner's best selection, while the planner remains competitive in routine situations. In simulation, Score Fusion maintains >80% success rate with delays up to 5s. We demonstrate the full system on a mobile robot navigating challenging campus sidewalks with varied network latency.

2606.20479 2026-06-19 cs.RO 新提交

GroundControl: Anticipating Navigation Failures in Vision-Language Agents via Trajectory-Consistent Uncertainty Estimates

GroundControl: 通过轨迹一致的不确定性估计预测视觉语言智能体中的导航失败

Nastaran Darabi, Divake Kumar, Sina Tayebati, Devashri Naik, Amit Ranjan Trivedi

发表机构 * University of Illinois at Chicago (UIC)(伊利诺伊大学芝加哥分校)

AI总结 提出轨迹一致的不确定性估计方法GroundControl,通过卡尔曼滤波建模距离变化并结合轨迹特征,有效预测导航失败,在选择性风险-覆盖评估中优于基线。

详情
AI中文摘要

视觉语言导航智能体在基准任务上取得了具有竞争力的平均成功率,但失败通常源于可预测的轨迹级问题,如振荡、停滞或低效绕路。因此,可靠部署需要能够在执行过程中预测新兴失败动态的不确定性信号,而不仅仅是反映瞬时动作熵。我们引入了\emph{GroundControl},一种轨迹一致的不确定性估计器,定义为在一个回合中聚合的、相对于标称目标导向的距离-目标动态的统计偏差。GroundControl使用恒定速度卡尔曼滤波器对距离演化进行建模,并将归一化创新统计量与补充轨迹特征(捕捉进展、单调性、路径效率和振荡行为)相结合。由此产生的不确定性分数反映了导航行为中的几何和时间不一致性,而非局部预测分散。为了独立于任务成功评估不确定性质量,我们形式化了\emph{选择性风险-覆盖导航(SRCN)}协议,该协议通过风险-覆盖曲线和AURC/E-AURC摘要,衡量不确定性分数按失败或低效对回合进行排序的有效性。在五个EB-Navigation分割($N=300$个回合)上,基于成功的选择性风险下,轨迹一致的不确定性实现了接近神谕的排序,GPT-4o模型的加权平均$\mathrm{E\text{-}AURC}_{\mathrm{SR}}=0.0024$,显著优于熵、共形和启发式基线。在基于SPL的选择性评估下,GroundControl在模型和导航分割上始终实现最低的AURC和E-AURC。这些结果表明,对目标导向动态的偏离进行建模,为预测视觉语言智能体中的导航失败提供了可解释且鲁棒的信号。

英文摘要

Vision-language navigation agents achieve competitive average success on benchmark tasks, yet failures often arise through predictable trajectory-level breakdowns such as oscillation, stagnation, or inefficient detours. Reliable deployment, therefore, requires uncertainty signals that anticipate emerging failure dynamics during execution rather than reflect only instantaneous action entropy. We introduce \emph{GroundControl}, a trajectory-consistent uncertainty estimator defined as statistical deviation from nominal goal-directed distance-to-goal dynamics aggregated over an episode. GroundControl models distance evolution using a constant-velocity Kalman filter and combines normalized innovation statistics with complementary trajectory features capturing progress, monotonicity, path efficiency, and oscillatory behavior. The resulting uncertainty score reflects geometric and temporal inconsistency in navigation behavior rather than local prediction dispersion. To evaluate uncertainty quality independently of task success, we formalize \emph{Selective Risk--Coverage Navigation (SRCN)}, a protocol that measures how effectively an uncertainty score ranks episodes by failure or inefficiency using risk--coverage curves and AURC / E-AURC summaries. Across five EB-Navigation splits ($N=300$ episodes), trajectory-consistent uncertainty achieves near-oracle ordering under success-based selective risk, with weighted-average $\mathrm{E\text{-}AURC}_{\mathrm{SR}}=0.0024$ for the GPT-4o model, substantially outperforming entropy-, conformal-, and heuristic baselines. Under SPL-based selective evaluation, GroundControl consistently achieves the lowest AURC and E-AURC across models and navigation splits. These results show that modeling deviation from goal-directed dynamics provides an interpretable and robust signal for anticipating navigation failures in vision-language agents.

2606.20491 2026-06-19 cs.RO cs.CV 新提交

Fast Human Attention Prediction for Fixation-guided Active Perception in Autonomous Navigation

用于自主导航中注视引导主动感知的快速人类注意力预测

Fatma Youssef Mohammed, Grzegorz Malczyk, Kostas Alexis

发表机构 * Norwegian University of Science and Technology (NTNU)(挪威科技大学)

AI总结 提出GazeLNN,一种基于液态神经网络和MobileNetV3的轻量级扫描路径预测模型,在MIT低分辨率数据集上达到最优性能,计算成本降低99.40%,推理速度提升6倍,并集成到强化学习训练的主动相机-机器人控制策略中,实现自主导航中的注视引导感知。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

人类视觉注意力依赖于结构化的扫描路径来高效处理场景,但将这种行为注入机器人自主性仍处于初级阶段,且受到现有预测模型高计算成本的阻碍。为了解决这一问题,我们提出了GazeLNN,一种计算轻量级的扫描路径预测模型,该模型采用液态神经网络作为其循环引擎,并使用MobileNetV3进行特征提取。该架构以自回归方式运行,根据当前视觉刺激和注视历史预测顺序注视热图。尽管仅需0.61 GFLOPs,GazeLNN在MIT低分辨率数据集上达到了最先进的性能,获得了0.47的ScanMatch分数。它在多种评估指标上优于现有的循环基线,同时将计算成本降低了99.40%,并将推理速度提高了六倍。为了研究人类注意力建模在机器人自主性中的作用,并展示这种高效架构的实际效用,我们将GazeLNN集成到通过强化学习训练的主动相机-机器人控制策略中。这种集成使得在自主导航过程中能够实现人类注视引导的感知,并通过在无人机上的成功实际部署得到了验证。

英文摘要

Human visual attention relies on structured scanpaths to efficiently process scenes, yet instilling this behavior into robot autonomy is in its infancy and hindered by the high,computational costs of existing predictive models. To address this, we introduce GazeLNN, a computationally lightweight,scanpath prediction model that leverages Liquid Neural Networks as its recurrent engine and employs MobileNetV3 for feature extraction. Operating auto-regressively, the architecture predicts sequential fixation heatmaps conditioned on the current visual stimulus and fixation history. Despite requiring only 0.61 GFLOPs, GazeLNN achieves state-of-the-art performance on the MIT Low Resolution dataset achieving 0.47 ScanMatch score. It outperforms existing recurrent baselines across diverse evaluation metrics, while reducing computational costs by 99.40% and accelerating inference by up to six times. To investigate the role of human attention modeling in robot autonomy and demonstrate the practical utility of this highly efficient architecture, we integrate GazeLNN into an active camera-robot control policy trained via Reinforcement Learning. This integration enables human-fixation-guided perception during autonomous navigation, validated through successful real-world deployments on an aerial robot.

2509.13972 2026-06-19 cs.RO 版本更新

BIM Informed Visual SLAM for Construction Environments

BIM 引导的视觉 SLAM 在建筑环境中的应用

Asier Bikandi-Noya, Miguel Fernandez-Cortizas, Muhammad Shaheer, Ali Tourani, Holger Voos, Jose Luis Sanchez-Lopez

发表机构 * Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg(自动化与机器人研究组,安全、可靠与信任跨学科研究中心(SnT),卢森堡大学)

AI总结 针对建筑环境中视觉SLAM轨迹漂移问题,提出利用建筑信息模型(BIM)的结构先验增强RGB-D SLAM系统,通过墙面对应与几何约束优化减少漂移,提升全局一致性,实验显示轨迹误差降低25.23%,地图精度提升7.14%。

Comments 9 pages, 7 tables, 4 figures

详情
AI中文摘要

监测建筑施工现场需要将计划设计与实际建造状态进行比较,而同步定位与地图构建(SLAM)技术可以实时估计实际状态。然而,视觉SLAM在建筑环境中容易产生轨迹漂移,生成的地图在几何上与实际环境不准确。为解决这一局限,我们利用从建筑信息模型(BIM)导出的结构先验增强现有的RGB-D SLAM系统。该系统将检测到的墙面与BIM中的对应墙面关联,并将这些对应关系作为几何约束加入后端优化,从而减少漂移并增强全局一致性。所提方法实时运行,并在多个真实建筑工地上验证,与最先进的基线相比,平均轨迹误差降低25.23%,地图精度提升7.14%。鲁棒性分析进一步表明,该方法对不完整的BIM数据以及计划模型与实际环境之间的几何差异具有韧性。

英文摘要

Monitoring building construction sites requires comparing the as-planned design with the as-built state, which can be estimated in real time using Simultaneous Localization and Mapping (SLAM) techniques. However, visual SLAM is prone to trajectory drift in construction environments, producing maps that are geometrically inaccurate with the actual environment. To address this limitation, we augment an existing RGB-D SLAM system with structural priors derived from the Building Information Model (BIM). The system associates detected walls with their BIM counterparts and includes these correspondences as geometric constraints in the back-end optimization, reducing drift and enhancing global consistency. The proposed method operates in real time and is validated on multiple real construction sites, achieving an average trajectory error reduction of 25.23% and a 7.14% improvement in map accuracy over state-of-the-art baselines. Robustness analyses further demonstrate resilience to incomplete BIM data and geometric discrepancies between as-planned models and the as-built environment.

2512.11173 2026-06-19 cs.RO 版本更新

Learning Category-level Last-meter Navigation from RGB Demonstrations of a Single-instance

从单实例RGB演示中学习类别级最后米导航

Tzu-Hsien Lee, Fidan Mahmudova, Karthik Desingh

发表机构 * University of Minnesota, Twin Cities(明尼苏达大学 Twin Cities 分校)

AI总结 提出面向对象的模仿学习框架,利用RGB观测实现四足移动机械臂在最后米阶段的精确导航,无需深度或地图先验,在类别级泛化中达到高成功率。

详情
AI中文摘要

移动机械臂基座的精确定位对于后续成功操作至关重要。大多数基于RGB的导航系统仅保证粗略的米级精度,不适合移动操作的精确定位阶段。这一差距导致操作策略无法在其训练演示的分布内运行,从而导致频繁的执行失败。我们通过引入一种面向对象的模仿学习框架来解决这一差距,用于最后米导航,使四足移动机械臂机器人仅使用其机载摄像头的RGB观测即可实现可操作的定位。我们的方法将导航策略条件化为三个输入:目标图像、来自机载摄像头的多视角RGB观测以及指定目标对象的文本提示。然后,语言驱动的分割模块和空间得分矩阵解码器提供显式的对象定位和相对姿态推理。使用类别内单个对象实例的真实世界数据,该系统能够泛化到不同环境中具有挑战性光照和背景条件的未见对象实例。为了全面评估这一点,我们引入了两个指标:边缘对齐度量(使用真实方向)和对象对齐度量(评估机器人视觉上面对目标的程度)。在这些指标下,我们的策略在相对于未见目标对象定位时,边缘对齐成功率达到74.58%,对象对齐成功率达到89.42%。这些结果表明,无需深度、LiDAR或地图先验,即可在类别级实现精确的最后米导航,为统一的移动操作提供可扩展的途径。项目页面:此https URL

英文摘要

Achieving precise positioning of the mobile manipulator's base is essential for successful manipulation actions that follow. Most of the RGB-based navigation systems only guarantee coarse, meter-level accuracy, making them less suitable for the precise positioning phase of mobile manipulation. This gap prevents manipulation policies from operating within the distribution of their training demonstrations, resulting in frequent execution failures. We address this gap by introducing an object-centric imitation learning framework for last-meter navigation, enabling a quadruped mobile manipulator robot to achieve manipulation-ready positioning using only RGB observations from its onboard cameras. Our method conditions the navigation policy on three inputs: goal images, multi-view RGB observations from the onboard cameras, and a text prompt specifying the target object. A language-driven segmentation module and a spatial score-matrix decoder then supply explicit object grounding and relative pose reasoning. Using real-world data from a single object instance within a category, the system generalizes to unseen object instances across diverse environments with challenging lighting and background conditions. To comprehensively evaluate this, we introduce two metrics: an edge-alignment metric, which uses ground truth orientation, and an object-alignment metric, which evaluates how well the robot visually faces the target. Under these metrics, our policy achieves 74.58% success in edge-alignment and 89.42% success in object-alignment when positioning relative to unseen target objects. These results show that precise last-meter navigation can be achieved at a category-level without depth, LiDAR, or map priors, enabling a scalable pathway toward unified mobile manipulation. Project page: https://rpm-lab-umn.github.io/category-level-last-meter-nav/

2601.03040 2026-06-19 cs.RO cs.AI cs.LG 版本更新

PiDR: Physics-Informed Inertial Dead Reckoning for Autonomous Platforms

PiDR:面向自主平台的物理信息惯性航位推算

Arup Kumar Sahoo, Itzik Klein

发表机构 * Autonomous Navigation and Sensor Fusion Lab (ANSFL)(自主导航与传感器融合实验室(ANSFL)) Hatter Department of Marine Technologies(海洋技术系) Charney School of Marine Sciences(海洋科学学院) University of Haifa(海法大学)

AI总结 提出PiDR框架,将惯性导航原理作为物理信息残差融入网络训练,在纯惯性导航中减少轨迹漂移,在移动机器人和水下自主航行器数据集上定位精度提升超29%。

Comments 11 pages and 7 figures

详情
AI中文摘要

完全自主的一个基本要求是在缺乏外部数据(如GNSS信号或视觉信息)的情况下维持精确导航的能力。在这些具有挑战性的环境中,平台必须完全依赖惯性传感器,导致纯惯性导航。然而,在现实场景中,惯性传感器的固有噪声和其他误差项会导致导航解随时间漂移。尽管传统的深度学习模型已成为惯性导航的一种可能方法,但它们本质上是黑箱的。此外,它们在有限的监督传感器数据下难以有效学习,并且常常无法保持物理原理。为了解决这些局限性,我们提出了PiDR,一种用于纯惯性导航情况下自主平台的物理信息惯性航位推算框架。PiDR通过物理信息残差组件将惯性导航原理明确地整合到网络训练过程中,从而提供了透明性。即使在有限或稀疏监督下,PiDR在减轻轨迹突然偏差方面也起着关键作用。我们在移动机器人和自主水下航行器收集的真实世界数据集上评估了PiDR。在两个数据集中,我们获得了超过29%的定位改进,证明了PiDR在不同环境和动力学下运行的不同平台上的泛化能力。因此,PiDR提供了一种鲁棒、轻量级且有效的架构,可以部署在资源受限的平台上,在不利场景中实现实时纯惯性导航。

英文摘要

A fundamental requirement for full autonomy is the ability to sustain accurate navigation in the absence of external data, such as GNSS signals or visual information. In these challenging environments, the platform must rely exclusively on inertial sensors, leading to pure inertial navigation. However, the inherent noise and other error terms of the inertial sensors in such real-world scenarios will cause the navigation solution to drift over time. Although conventional deep-learning models have emerged as a possible approach to inertial navigation, they are inherently black-box in nature. Furthermore, they struggle to learn effectively with limited supervised sensor data and often fail to preserve physical principles. To address these limitations, we propose PiDR, a physics-informed inertial dead-reckoning framework for autonomous platforms in situations of pure inertial navigation. PiDR offers transparency by explicitly integrating inertial navigation principles into the network training process through the physics-informed residual component. PiDR plays a crucial role in mitigating abrupt trajectory deviations even under limited or sparse supervision. We evaluated PiDR on real-world datasets collected by a mobile robot and an autonomous underwater vehicle. We obtained more than 29% positioning improvement in both datasets, demonstrating the ability of PiDR to generalize different platforms operating in various environments and dynamics. Thus, PiDR offers a robust, lightweight, yet effective architecture and can be deployed on resource-constrained platforms, enabling real-time pure inertial navigation in adverse scenarios.

2601.15614 2026-06-19 cs.RO 版本更新

AION: Aerial Indoor Object-Goal Navigation Using Dual-Policy Reinforcement Learning

AION: 基于双策略强化学习的空中室内目标导航

Zichen Yan, Yuchen Hou, Shenao Wang, Yichao Gao, Rui Huang, Lin Zhao

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学电子与计算机工程系)

AI总结 提出AION,一种端到端双策略强化学习框架,解耦探索与目标到达行为,用于视觉空中目标导航,无需外部定位或全局地图,在AI2-THOR和IsaacSim中验证了优越性能。

Comments Accepted to IROS 2026

详情
AI中文摘要

目标导航要求智能体自主探索未知环境并导航至由语义标签指定的目标对象。以往工作主要研究二维移动下的零样本目标导航,将其扩展到具有三维移动能力的空中平台仍未被充分探索。空中机器人具有优越的机动性和搜索效率,但也带来了空间感知、动态控制和安全性保障方面的新挑战。本文提出AION,用于基于视觉的空中目标导航,无需依赖外部定位或全局地图。AION是一个端到端的双策略强化学习框架,将探索和目标到达行为解耦为两个专门策略。我们在AI2-THOR基准上评估AION,并在IsaacSim中使用高保真无人机模型进一步评估其实时性能。实验结果表明,AION在探索、导航效率和安全性的综合评估指标上均取得了优越性能。视频可在\url{this https URL}找到,代码和模型检查点可在\url{this https URL}获取。

英文摘要

Object-Goal Navigation (ObjectNav) requires an agent to autonomously explore an unknown environment and navigate toward target objects specified by a semantic label. While prior work has primarily studied zero-shot ObjectNav under 2D locomotion, extending it to aerial platforms with 3D locomotion capability remains underexplored. Aerial robots offer superior maneuverability and search efficiency, but they also introduce new challenges in spatial perception, dynamic control, and safety assurance. In this paper, we propose AION for vision-based aerial ObjectNav without relying on external localization or global maps. AION is an end-to-end dual-policy reinforcement learning (RL) framework that decouples exploration and goal-reaching behaviors into two specialized policies. We evaluate AION on the AI2-THOR benchmark and further assess its real-time performance in IsaacSim using high-fidelity drone models. Experimental results show that AION achieves superior performance across comprehensive evaluation metrics in exploration, navigation efficiency, and safety. The video can be found at \url{https://youtu.be/TgsUm6bb7zg}, code and model checkpoints are available at \url{https://github.com/Zichen-Yan/AION}.

2605.09383 2026-06-19 cs.RO 版本更新

Safety-Critical LiDAR-Inertial Odometry with On-Manifold Deterministic Protection Level

安全关键的激光雷达-惯性里程计与在线流形确定性保护级别

Yueqi Zhu, Yan Pan, Chufan Rui, Jiasheng Luo, Shihua Li, Bo Zhou

发表机构 * School of Automation, Southeast University(东南大学自动化学院) Key Laboratory of Measurement and Control of CSE, Ministry of Education(教育部测控CSE重点实验室)

AI总结 本文提出一种安全关键的激光雷达-惯性里程计,通过在线流形确定性状态估计提供确定性保护级别,以提升移动机器人在安全关键场景中的导航安全性。

详情
AI中文摘要

在安全关键场景中,自主导航系统的保护级别对于使移动机器人安全执行任务至关重要。然而,现有针对机器人概率导航系统的研究通常使用有限数据集进行离线准确性评估,并假设结果可应用于未知真实环境。因此,当前自主移动机器人往往缺乏在线安全评估的保护级别。为填补这一空白,我们提出了一种安全关键的激光雷达-惯性里程计(LIO),其基于在线流形确定性状态估计提供确定性保护级别。通过采用未知但有界的假设,我们推导出点云噪声与迭代最近点算法估计不确定性之间的简洁闭式关系。利用这一关系,我们设计了一种在线流形椭球集成员滤波器,并将其实现于LIO系统中。利用集成员滤波器的性质,我们的系统将估计位置的可行集作为确定性保护级别,用作机器人下游自主操作的安全参考。实验结果表明,我们的系统能够为各种环境中的不同机器人提供有效的确定性在线安全参考。

英文摘要

In safety-critical scenarios, the protection level of the autonomous navigation system is crucial for enabling mobile robots to perform safe tasks. However, existing studies on probabilistic navigation systems for robots usually perform offline accuracy evaluations using limited datasets and assume that the results can be applied to unknown real-world environments. As a result, current autonomous mobile robots often lack protection levels for online safety assessment. To fill this gap, we propose a safety-critical LiDAR-inertial odometry (LIO) that provides deterministic protection levels based on on-manifold deterministic state estimation. By adopting the unknown but bounded assumption, we derive a neat closed-form relationship between point cloud noise and the uncertainty of the estimation from the iterated closest point algorithm. Using this relationship, we design an on-manifold ellipsoidal set-membership filter and implement it within the LIO system. Leveraging the properties of the set-membership filter, our system offers the feasible sets of the estimated locations as the deterministic protection levels, serving as safety references for the robots' downstream autonomous operations. The experimental results show that our system can provide effective deterministic online safety references for diverse robots in various environments.

2606.14776 2026-06-19 cs.RO cs.LG 版本更新

Deep Learning-Based Lunar Crater Terrain Relative Navigation

基于深度学习的月球陨石坑地形相对导航

Batu Candan, Simone Servadio

发表机构 * NASA(美国国家航空航天局) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出一种结合深度学习陨石坑检测器和扩展卡尔曼滤波的地形相对导航算法,在初始位置偏差达5公里时仍能将导航误差降至数百米。

详情
AI中文摘要

准确的位置估计对于未来使用自主飞行器实现月球着陆至关重要,尤其是在地形特征稀疏的危险环境中。本文提出了一种地形相对导航(TRN)算法,该算法结合了我们专门为NASA陨石坑检测挑战问题设计的深度学习陨石坑检测器和扩展卡尔曼滤波(EKF)。我们的检测器分析从轨道获取的单目图像中的陨石坑特征,并通过匈牙利分配方法及基于共识的离群点去除方法,识别它们与全球数据库中陨石坑的匹配。然后,估计的测量值用于优化EKF,其中航天器在月心月固(LCLF)参考系中的姿态估计,结合高度辅助信息,约束径向漂移。仿真结果表明,即使航天器偏离实际位置达5公里,TRN也能从这种情况中恢复,将导航误差降低到几百米。需要注意的是,为了保持陨石坑特征的对应关系,必须将图像分辨率和场景中的尺度与检测器训练集分布相匹配。

英文摘要

Accurate position estimation is crucial for the successful implementation of future lunar landings using autonomous vehicles, especially in dangerous environments with sparse terrain features. In this paper, we propose a terrain relative navigation (TRN) algorithm combining our deep-learning crater detector, which was designed specifically for the NASA Crater Detection Challenge problem, and an Extended Kalman Filter (EKF). Our detector analyzes crater features from the monocular images acquired from orbit, and their matches with craters from a global database are identified via a Hungarian assignment approach followed by the consensus-based outliers removal method. The estimated measurements are then used to refine an EKF, where spacecraft pose estimation in the Lunar-Centered Lunar-Fixed (LCLF) frame of reference, augmented with altitude aiding information, constrains radial drift. The simulation results indicate that even if the spacecraft is off from its actual location up to 5 km, TRN could recover from this situation, achieving navigation error reduction to a few hundred meters. It should be noted that in order to maintain crater feature correspondences, it is important to match the image resolution and the scales within the scene to the detector training set distribution.

2606.16057 2026-06-19 cs.RO cs.SY eess.SP eess.SY 版本更新

A Smart-Scheduled Hybrid (SSH) EKF-FGO State Estimation

一种智能调度混合(SSH)EKF-FGO状态估计方法

Eric Levy, Soosan Beheshti

发表机构 * GitHub arXiv

AI总结 本文通过智能调度混合EKF-FGO框架,实验性地将优化调度作为独立设计变量,研究其在平衡估计精度与计算成本中的作用,并在平面SLAM仿真中验证了调度对预优化漂移、瞬态误差和运行时间的显著影响。

Comments This work has been accepted for presentation/publication at the 2026 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). The final published version will appear in IEEE Xplore

详情
AI中文摘要

在机器人学和控制中,可靠的状态估计需要在估计精度和计算成本之间取得平衡。虽然基于滤波的方法(如扩展卡尔曼滤波器,EKF)提供高效的实时更新,而使用因子图的优化公式化方法改善全局一致性,但优化调度的作用通常被隐式处理,而非作为明确的设计变量进行研究。本文提出了一项实验研究,通过使用智能调度混合(SSH)EKF-FGO框架作为受控测试平台,明确隔离了优化调度。通过将基于EKF的状态传播与定期调用的批量优化相结合,并保持求解器结构和计算量固定,本文的主要贡献是实验性地将优化调度表征为一个独立的设计变量,它控制着中间估计精度与计算成本之间的权衡。在平面SLAM环境中的仿真结果表明,调度强烈影响预优化漂移、瞬态误差行为和运行时间。特别是,结果识别出一些操作区域,在这些区域中,全局优化的大部分好处可以以一小部分计算成本保留,从而突显了优化调度作为混合状态估计系统中一个未被充分探索但至关重要的考虑因素。

英文摘要

Reliable state estimation in robotics and control re quires balancing estimation accuracy against computational cost. While filtering-based methods such as the Extended Kalman Filter (EKF) provide efficient real-time updates, and optimisation based formulations using factor graphs improve global consistency, the role of optimisation scheduling is often treated implicitly rather than examined as an explicit design variable. This paper presents an experimental study that explicitly isolates optimisation scheduling using a Smart Scheduled Hybrid (SSH) EKF-FGO framework as a controlled testbed. By combining EKF-based state propagation with periodically invoked batch optimisation and holding solver structure and effort fixed, the main contribution of this work is the experimental characterisation of optimisation scheduling as an independent design variable governing the trade-off between intermediate estimation accuracy and computational cost. Simulation results in a planar SLAM environment show that scheduling strongly influences pre optimisation drift, transient error behaviour, and runtime. In particular, the results identify operating regimes in which most of the benefit of global optimisation can be retained at a fraction of the computational cost, highlighting optimisation scheduling as an under-explored yet critical consideration in hybrid state estimation systems.

2606.18112 2026-06-19 cs.RO cs.CV 版本更新

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

Qwen-RobotNav 技术报告:为智能体导航系统设计的可扩展导航模型

Jiazhao Zhang, Gengze Zhou, Hale Yin, Yiyang Huang, Zixing Lei, Qihang Peng, Haoqi Yuan, Jie Zhang, Xudong Guo, Xiaoyue Chen, An Yang, Fei Huang, Zhibo Yang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Zhuoyuan Yu, Jingyang Fan, Zhixuan Liang, Pei Lin, Ye Wang, Anzhe Chen, Kun Yan, Xiao Xu, Jiahao Li, Lulu Hu, Minying Zhang, Shurui Li, Wenhu Xiao, Shuai Bai, Xuancheng Ren, Chenxu Lv, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team(通义实验室)

AI总结 提出 Qwen-RobotNav 可扩展导航模型,通过参数化接口支持多种任务模式和可调观测参数,在15.6M样本上训练,联合视觉语言数据防止行为坍缩,在多个导航基准上取得新最优结果,并展示零样本泛化能力。

详情
AI中文摘要

智能体导航系统需要一个基础导航模型,其观测策略可以在推理时从外部重新配置,因为指令跟随、目标搜索、目标跟踪和自动驾驶共享相同的感知规划主干,但对视觉流的消费方式有根本不同的要求。我们提出 Qwen-RobotNav,一个建立在 Qwen-RobotNav 上的可扩展导航模型,通过一个具有两个互补维度的参数化接口来解决这个问题:多个任务模式选择导航行为,以及可控的观测参数(例如,token 预算、每个摄像头的权重)控制视觉历史的编码方式。通过训练时对所有参数进行随机化,Qwen-RobotNav 对任何推理时配置都具有鲁棒性,无需对 Qwen-RobotNav 主干进行任何架构修改。我们在15.6M样本上训练 Qwen-RobotNav;与视觉语言数据联合训练防止了在仅轨迹训练中观察到的反应性动作序列映射器的坍缩。参数化接口也使 Qwen-RobotNav 成为智能体系统的自然构建块:对于长时域场景,上层规划器将目标分解为子任务,并在情节中动态切换 Qwen-RobotNav 的任务模式和上下文策略,通过重复调用同一模型组合出复杂行为。大量实验表明,Qwen-RobotNav 在主要导航基准上取得了新的最优结果。该模型从2B到8B参数展现出良好的扩展性,联合多任务训练发展出一个跨任务族迁移的共享空间规划基板,并在多样环境中对真实世界机器人展现出强大的零样本泛化能力。

英文摘要

Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.

2606.18951 2026-06-19 cs.RO 版本更新

A High-accuracy Event-based Underwater SLAM System

高精度事件相机水下SLAM系统

Yifan Peng, Qihang Liu, Haoying Li, Yuzhe Li, Junfeng Wu, Ziyang Hong

AI总结 针对事件相机水下SLAM中时间曲面成像质量差和匹配失败问题,提出基于结构感知度量和贝叶斯优化的高精度立体SLAM系统,并贡献首个高质量水下事件数据集UWE。

详情
AI中文摘要

虽然事件相机为水下SLAM提供了巨大潜力,但现有的基于时间曲面(TS)的方法在水下部署时被证明非常不可靠。波动的相机速度严重降低了TS成像质量,而宽立体基线和重复的水下纹理导致关键匹配失败,频繁引发系统崩溃。为克服这些挑战,我们开发了首个高精度事件相机水下立体SLAM系统。基于结构张量相干性和梯度,设计了一种结构感知度量来定量评估TS结构信息密度。通过将最优TS生成解耦为基于系统初始化的两个不同阶段,贝叶斯优化(BO)在初始化前首先预测最优先验TS,同时我们设置异步在线局部搜索方法,在跟踪阶段实时获取合适的TS。我们使用先验视差保证精确的数据关联,并采用“最新观测优先”三角测量机制实现稳定三角测量。作为这些解决方案的基准和社区资源,我们还贡献了UWE,这是首个高质量真实世界水下事件数据集,包含变化的相机运动、复杂纹理和不同轨迹特征。在公共数据集和UWE上的广泛评估表明,所提出的SLAM系统与最先进的事件相机方法相比具有竞争力的精度性能。代码和数据将开源。

英文摘要

While event cameras offer immense potential for underwater SLAM, existing Time Surface (TS)-based methods prove highly unreliable when deployed underwater. Fluctuating camera velocities severely degrade TS imaging quality, while wide stereo baselines and repetitive underwater textures induce critical matching failures, frequently triggering system failure. To overcome these challenges, we develop the first high-accuracy event-based underwater stereo SLAM system. A structure-aware metric for TS is designed based on structure tensor coherence and gradients to quantitatively evaluate TS structural information density. By decoupling the optimal TS generation into two distinct stages based on system initialization, Bayesian Optimization(BO) first predicts an optimal prior TS sequentially before initialization while we set an asynchronous online local searching method periodically to obtain appropriate TS in real-time during the tracking stage. We use the prior disparity to guarantee precise data association and "latest-observation-first'' triangulation mechanism to realize stable triangulation. As a benchmark for these solutions and a resource for the community, we also contribute UWE, the first high-quality real-world underwater event dataset containing variable camera motions, complex textures and different trajectory features. Extensive evaluations on public datasets and UWE show the competitive accuracy performance of the proposed SLAM system compared to the state-of-the-art event-based method. The code and data will be open-sourced.

2510.24399 2026-06-19 cs.CV cs.RO 版本更新

GenTrack: A New Generation of Multi-Object Tracking

GenTrack:新一代多目标跟踪

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

发表机构 * SDU Robotics, University of Southern Denmark(SDU机器人实验室,南丹麦大学)

AI总结 提出GenTrack多目标跟踪方法,采用随机与确定性混合策略,结合粒子群优化与社会交互,在弱检测器、遮挡等场景下有效维持目标身份一致性并减少ID切换。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

本文介绍了一种新颖的多目标跟踪(MOT)方法,称为GenTrack,其主要贡献包括:第一,一种混合跟踪方法,采用随机和确定性方式,以鲁棒地处理未知且时变的目标数量,特别是在维持目标身份(ID)一致性和管理非线性动态方面;第二,利用粒子群优化(PSO)和一些提出的适应度度量,引导随机粒子朝向其目标分布模式,从而即使在弱且噪声大的目标检测器下也能实现有效跟踪;第三,整合目标间的社会交互,以增强PSO引导的粒子,并改进强(匹配)和弱(未匹配)轨迹的连续更新,从而减少ID切换和轨迹丢失,尤其是在遮挡期间;第四,基于GenTrack重新定义的视觉MOT基线,结合了基于空间一致性、外观、检测置信度、轨迹惩罚和社会分数的综合状态与观测模型,以实现系统且高效的目标更新;第五,首个公开可用的最小依赖源代码参考实现,包含三种变体,包括GenTrack Simple、Strengthen和Super,便于灵活重新实现。实验结果表明,与最先进的跟踪器相比,GenTrack在标准基准和现实场景中提供了优越的性能,并集成了基线实现以进行公平比较。还讨论了未来工作的潜在方向。所提方法和比较跟踪器的源代码参考实现已在GitHub上提供:this https URL

英文摘要

This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: first-a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, second-leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, third-integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, fourth-a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and five-the first ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Simple, Strengthen, and Super, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack

2510.24410 2026-06-19 cs.CV cs.RO 版本更新

GenTrack2: An Improved Hybrid Approach for Multi-Object Tracking

GenTrack2: 一种改进的多目标跟踪混合方法

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

发表机构 * SDU Robotics, University of Southern Denmark(SDU机器人研究所,南丹麦大学)

AI总结 提出结合随机粒子滤波与确定性关联的多目标跟踪方法,通过粒子群优化和新型代价矩阵解决非线性动态下的标识一致性问题,性能优于现有方法。

Comments The content of this paper was included in the full manuscript of GenTrack family which has been submitted to the journal for possible publication

详情
AI中文摘要

本文提出一种视觉多目标跟踪方法,联合使用随机和确定性机制,以确保在非线性动态下未知且时变目标数量的标识一致性。随机粒子滤波处理非线性动态和非高斯噪声,并借助粒子群优化(PSO)将粒子引导至状态分布模式,通过提出的适应度度量(包含运动一致性、外观相似性和与邻近目标的社交互动线索)减轻发散。确定性关联通过提出的代价矩阵进一步强制标识一致性,该矩阵包含粒子与当前检测之间的空间一致性、检测置信度和轨迹惩罚。随后,提出一种新颖方案,在保持目标身份的同时平滑更新目标状态,特别是对于与其他目标交互和长时间遮挡期间的弱轨迹。此外,对过去状态的速度回归提供趋势种子速度,增强粒子采样和状态更新。所提出的跟踪器设计灵活,适用于预录视频和相机直播流(未来帧不可用)。实验结果表明,与最先进的跟踪器相比,性能优越。所提出方法和对比跟踪器的源代码参考实现已在GitHub上提供:此 https URL

英文摘要

This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2

2602.23172 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

潜在高斯泼溅用于4D全景占据跟踪

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

发表机构 * University of Freiburg(弗赖堡大学) Bosch Research(博世研究院) University of Haifa(海法大学)

AI总结 提出潜在高斯泼溅(LaGS)方法,通过特征高斯体作为动态关键点实现多视图特征聚合,用于4D全景占据跟踪,在Occ3D nuScenes和Waymo上达到最优性能。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L), 2026

详情
AI中文摘要

捕捉4D时空场景结构对于机器人在动态环境中安全可靠运行至关重要。然而,现有方法通常只解决部分问题:它们要么通过边界框提供粗略的几何跟踪,要么提供缺乏显式时间关联和实例级推理的详细3D占据估计。在这项工作中,我们提出了潜在高斯泼溅(LaGS)用于4D全景占据跟踪(4D-POT)。我们重新审视底层表示,将3D特征建模为一组稀疏的带特征高斯体。这些高斯体作为动态的、面向体积的关键点,在泼溅到体素网格进行解码之前,能够实现多视图特征的空间连续、距离加权聚合。这种以点为中心的公式实现了灵活、数据相关的感受野和长程空间交互,这是局部密集体素算子难以捕捉的。分层高斯表示通过结合来自粗超点的全局上下文和来自高分辨率流的细粒度细节,进一步实现了多尺度推理。在Occ3D nuScenes和Waymo上的大量实验证明了4D-POT的最先进性能。我们在以下网址提供代码和模型:this https URL。

英文摘要

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT. We provide code and models at https://lags.cs.uni-freiburg.de/.

5. 人机交互与协作机器人 6 篇

2606.19914 2026-06-19 cs.RO cs.AI 新提交

Co-policy: Responsive Human-Robot Co-Creation for Musical Performances

Co-policy: 响应式人机音乐共创框架

Xuetao Li, Wenke Huang, Mang Ye, Zijian Liu, Jinhua Xie, Jifeng Xuan, Miao Li

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Automation, Wuhan University of Technology(武汉理工大学自动化学院) School of Geodesy and Geomatics, Wuhan University(武汉大学测绘学院) School of Robotics, Wuhan University(武汉大学机器人学院)

AI总结 提出Co-policy框架,通过语义锚定、约束变分和视觉运动策略实现人机音乐实时共创,在真实钟琴实验中优于扩散策略基线。

详情
AI中文摘要

艺术长期以来一直是人类创造力的关键表达。具身人工智能为生成模型通过物理动作而非无形数字内容参与创造力提供了一条途径。在机器人音乐共创中,将语义音乐理解与实时且可物理执行的表演连接起来具有挑战性。我们提出了Co-policy,一个人机音乐共创框架,它分离了语义意图接地、约束音乐变分和视觉运动执行。为了接地音乐语义,Co-policy使用预推理语义锚点和微调的Qwen-vl规划器(F-Qwen)将语音、实时音乐种子和视觉观察转化为结构化的共创计划。为了支持低延迟执行,Co-policy引入了高斯混合视觉运动策略(GMP),实现为条件混合密度策略,在单次前向传递中将目标音符和视觉上下文映射到多模态机器人动作。与仅复现用户指定音符的机器人回放系统不同,Co-policy在音乐和物理约束下生成互补的音乐响应。真实机器人钟琴实验、消融研究和专家评估显示,与扩散策略和消融基线相比,意图对齐、执行准确性和响应频率均有提升,支持物理接地动作生成作为具身人机共创的关键要求。

英文摘要

Art has long stood as a pivotal expression of human creativity. Embodied artificial intelligence offers a route for generative models to participate in that creativity through physical action rather than disembodied digital content. In robotic music co-creation, it is challenging to connect semantic musical understanding with real-time and physically executable performance. We present Co-policy, a framework for human-robot musical co-creation that separates semantic intent grounding, constrained musical variation, and visuomotor execution. To ground musical semantics, Co-policy uses pre-inference semantic anchors and a fine-tuned Qwen-vl planner (F-Qwen) to transform speech, live musical seeds, and visual observations into structured co-creation plans. To support low-latency execution, Co-policy introduces a Gaussian-Mixture Visuomotor Policy (GMP), implemented as a conditional mixture-density policy that maps target notes and visual context to multimodal robot actions in a single forward pass. Unlike robotic playback systems that merely reproduce user-specified notes, Co-policy generates complementary musical responses under both musical and physical constraints. Real-robot chime experiments, ablations, and expert evaluation show improved intent alignment, execution accuracy, and response frequency over diffusion-policy and ablated baselines, supporting physically grounded action generation as a key requirement for embodied human-AI co-creation.

2606.19971 2026-06-19 cs.RO 新提交

Evaluation of Augmented Reality-based Intuitive Interface for Robot-Assisted Transesophageal Echocardiography: A User Study

基于增强现实的机器人辅助经食管超声心动图直观界面评估:用户研究

Xiu Zhang*, Matteo Di Mauro*, Sofia Breschi, Angela Peloso, Emiliano Votta, Arianna Menciassi, Elena De Momi

AI总结 本研究提出并评估了一种基于增强现实的直观界面,用于机器人辅助经食管超声心动图,通过3D可视化与尖端控制显著提升空间精度并降低操作误差。

详情
AI中文摘要

经食管超声心动图(TEE)对于诊断和引导结构性心脏病(SHD)介入治疗至关重要。然而,手动TEE操作需要操作者具备丰富的专业技能,体力消耗大,并且在透视下操作会使临床医生暴露于辐射中。机器人辅助TEE系统已被引入以改进探头操作并减少操作者疲劳,但直观有效的用户界面设计仍是一个开放挑战。本研究提出并评估了一种模型增强的、基于增强现实(AR)的直观界面,用于机器人辅助TEE,旨在提高空间意识和控制直观性。使用集成电磁跟踪和虚拟模拟器的机器人TEE平台,比较了三种在可视化和交互模式上不同的用户界面:2D关节级(2D-JI)、3D关节级(3D-JI)和3D尖端级(3D-TI)。36名参与者执行标准化导航任务以再现目标超声心动图视图,通过位置和方向误差、完成时间和NASA-TLX工作量评分评估性能。结果表明,3D可视化显著提高了空间精度,与2D界面相比,中位位置误差从13毫米减少到3毫米,方向误差减半。尖端级交互相比关节级控制,方向误差进一步降低50%,并减少了用户间变异性。总体而言,3D-TI配置结合了沉浸式可视化与直接尖端级控制,被证明是最有效且符合人体工程学的界面,支持将基于AR的可视化和直观控制范式集成到下一代机器人TEE系统中,以增强操作者性能和手术安全性。

英文摘要

TransEsophageal Echocardiography (TEE) is essential for diagnosing and guiding Structural Heart Disease (SHD) interventions. However, manual TEE manipulation demands significant operator expertise, is physically demanding, and exposes clinicians to radiation when performed alongside fluoroscopy. Robotic-assisted TEE systems have been introduced to improve probe handling and reduce operator fatigue, yet the design of intuitive and effective user interfaces remains an open challenge. This study presents and evaluates a model-enhanced, Augmented Reality (AR)-based intuitive interface for robot-assisted TEE, designed to improve spatial awareness and control intuitiveness. A robotic TEE platform integrated with electromagnetic tracking and a virtual simulator was used to compare three user interfaces differing in visualization and interaction modalities: 2D jointlevel (2D-JI), 3D joint-level (3D-JI), and 3D tip-level (3D-TI). Thirty six participants performed standardized navigation tasks to reproduce target echocardiographic views, with performance assessed via position and orientation errors, completion time, and NASA-TLX workload scores. Results show that 3D visualization significantly improved spatial accuracy, reducing median position error from 13 mm to 3 mm and halving the orientation error compared with the 2D interface. Tip-level interaction yielded a further 50% reduction in orientation error and reduced interuser variability relative to joint-level control. Overall, the 3D-TI configuration, combining immersive visualization with direct tip-level control, proved the most effective and ergonomic interface, supporting the integration of AR-based visualization and intuitive control paradigms into next-generation robotic TEE systems to enhance operator performance and procedural safety.

2606.20120 2026-06-19 cs.RO cs.AI 新提交

Dual-Agent Framework for Cross-Model Verified Translation of Natural-Language Protocols into Robotic Laboratory Platform

用于将自然语言协议翻译为机器人实验室平台的双智能体跨模型验证框架

Hyeonna Choi, Jung Yup Kim, Hyuneui Lim, Seunggyu Jeon

AI总结 提出双智能体框架,通过解析器形式化协议、规则映射引擎生成控制命令、异构LLM验证器纠错,实现自然语言微孔板协议到机器人平台可执行命令的转换,并验证了端到端自主执行。

详情
AI中文摘要

生物实验协议以自然语言编写,而自动化系统依赖预定义控制命令,这造成了限制自主执行的语义鸿沟。微孔板自动实验由于需要同时控制孔映射、样本-试剂组合、重复放置和平行分配而尤其具有挑战性。本研究提出一种基于智能体的协议翻译框架,将自然语言微孔板协议转换为机器人实验室平台的可执行控制命令。解析器智能体将自然语言协议形式化为结构化表示,基于规则的映射引擎确定性地融入机器人实验室平台的操作约束以生成设备级控制命令。异构LLM验证器检查完整性、参数准确性和执行顺序,并在检测到错误时触发带有结构化反馈的自校正循环。在随机选择的ELISA协议上对7个解析器和3个验证器进行扫描,评估模型规模和验证器类型在跨模型验证下对翻译准确率和通过率的影响。通过将所提框架的基于规则映射与LLM端到端直接映射进行比较,进一步验证了准确率-延迟权衡。最后,在机器人实验室平台上演示了基于Bradford法的微孔板蛋白质定量,验证了从自然语言协议到真实实验的端到端自主执行。所提框架为缩小自然语言协议与基于微孔板的自主实验室之间的语义鸿沟提供了一种灵活方法。

英文摘要

Biological experiment protocols are written in natural language, whereas automation systems rely on predefined control commands, creating a semantic gap that limits autonomous execution. Microplate-based automatic experiments are particularly challenging due to the need to simultaneously control well mapping, sample-reagent combinations, replicate placement, and parallel dispensing. This study proposes an agent-based protocol translation framework that converts natural-language microplate-based protocols into executable control commands for a robotic laboratory platform. A Parser Agent formalizes the natural-language protocol into a structured representation, and a rule-based mapping engine deterministically incorporates the operational constraints of the robotic laboratory platform to generate device-level control commands. A heterogeneous LLM Validation Agent verifies completeness, parameter accuracy, and execution order, and triggers a self-correction loop with structured feedback when errors are detected. A sweep involving 7 Parsers and 3 Validators on randomly selected ELISA protocols evaluates how model scale and Validator type affect translation accuracy and pass rates under cross-model verification. The accuracy-latency trade-off is further verified by comparing the rule-based mapping of the proposed framework with LLM end-to-end direct mapping. Finally, Bradford assay-based protein quantification using a microplate was demonstrated on a robotic laboratory platform, validating end-to-end autonomous execution from natural-language protocols to real-world experiments. The proposed framework provides a flexible approach to narrowing the semantic gap between natural-language protocols and microplate-based self-driving laboratories.

2606.20150 2026-06-19 cs.RO 新提交

Robust Assembly State Reasoning from Action Recognition for Human-Robot Collaboration

面向人机协作的基于动作识别的鲁棒装配状态推理

James Fant-Male, Roel Pieters

发表机构 * Cognitive Robotics group, Unit of Automation Technology and Mechanical Engineering, Tampere University(坦佩雷大学自动化技术与机械工程系认知机器人组)

AI总结 研究从动作识别输入跟踪装配状态的方法,比较逻辑、HMM和神经网络方法,发现最优方法因任务而异,逻辑方法在多变场景更鲁棒。

Comments Preprint accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026). 8 pages, 9 figures, 3 tables

详情
AI中文摘要

人类动作识别(HAR)在人机协作(HRC)研究中经常被用于理解已执行的动作以及协作任务的状态。然而,从HAR准确跟踪装配状态尚未得到充分研究,并且在现实场景中并非易事。本研究系统性地调查并比较了使用动作识别输入跟踪装配状态的方法。使用两个不同数据集和五种状态跟踪方法(包括基于逻辑的、隐马尔可夫模型(HMM)和神经网络(NN)方法)进行的调查表明,最优方法在不同任务中并不统一,并且不同方法在不同情况下会失败。测试使用具有不同噪声水平的模拟输入和来自HAR模型的真实输入进行。结果表明,NN和HMM方法在变异性有限的任务中表现良好,但在其他场景中,基于逻辑的方法可能更鲁棒。对于没有额外传感的重复动作任务,建模预期动作持续时间的方法也很重要。

英文摘要

Human Action Recognition (HAR) is frequently investigated in Human-Robot Collaboration (HRC) research to understand what actions have been performed and hence the state of a collaborative task. Accurately tracking an assembly state from HAR is however not fully investigated, and in realistic scenarios is not a trivial task. This research systematically investigates and compares methods for tracking assembly state using action recognition inputs. Investigations using two diverse datasets and five state tracking approaches, including logic-based, Hidden Markov Model (HMM), and neural network (NN) methods, show that optimal approaches are not uniform across different tasks and that different methods fail under different circumstances. Testing is performed using both simulated inputs with varying noise levels and realistic inputs from a HAR model. Results show NN and HMM methods can perform well in tasks with limited variability, but for other scenarios logic-based approaches can be more robust. Methods which model expected action duration are also important for tasks with repeated actions where no additional sensing is provided.

2509.10416 2026-06-19 cs.RO 版本更新

TASC: Task-Aware Shared Control for Relational Telemanipulation

TASC:面向关系遥操作的任务感知共享控制

Ze Fu, Pinhao Song, Yutong Hu, Renaud Detry

发表机构 * KU Leuven, Dept. Mechanical Engineering, Research unit Robotics, Automation and Mechatronics(KU莱顿机械工程系,机器人、自动化与机电一体化研究单位) KU Leuven, Dept. Electrical Engineering, Research unit Processing Speech and Images(KU莱顿电气工程系,语音与图像处理研究单位)

AI总结 提出TASC框架,通过视觉构建开放词汇交互图推断任务级用户意图,并基于空间约束提供共享控制辅助,提升关系遥操作效率与泛化能力。

Comments Accepted to IROS 2026

详情
AI中文摘要

我们提出了TASC,一个面向关系遥操作的任务感知共享控制框架,该框架从仅运动输入中推断任务级用户意图并提供辅助。为了在没有预定义模板的情况下支持抓取关系任务,TASC从视觉输入构建一个开放词汇的交互图来表示功能性物体关系,并据此推断用户意图。然后,共享控制策略在抓取和物体交互过程中提供辅助,该辅助由视觉语言模型预测的空间约束引导。我们的方法解决了共享控制下关系遥操作的两个关键挑战:(1)从低级运动命令中推断任务级意图,以及(2)跨不同物体和任务的泛化辅助。在仿真和真实世界的实验表明,与先前方法相比,TASC提高了任务效率并减少了用户输入努力,同时实现了跨多种关系遥操作任务的零样本泛化。支持我们实验的代码在此https URL公开提供。

英文摘要

We present TASC, a Task-Aware Shared Control framework for relational telemanipulation that infers task-level user intent and provides assistance from motion-only input. To support prehensile relational tasks without predefined templates, TASC constructs an open-vocabulary interaction graph from visual input to represent functional object relationships, and infers user intent accordingly. A shared control policy then provides assistance during both grasping and object interaction, guided by spatial constraints predicted by a vision-language model. Our method addresses two key challenges in relational telemanipulation under shared control: (1) task-level intent inference from low-level motion commands, and (2) generalizable assistance across diverse objects and tasks. Experiments in both simulation and the real world demonstrate that TASC improves task efficiency and reduces user input effort compared to prior methods, while enabling zero-shot generalization across diverse relational telemanipulation tasks. The code that supports our experiments is publicly available at https://github.com/fitz0401/tasc.

2503.20646 2026-06-19 cs.HC cs.RO cs.SY eess.SY 版本更新

Immersive and Wearable Thermal Rendering for Augmented Reality

增强现实的沉浸式可穿戴热渲染

Alexandra Watkins, Ritam Ghosh, Evan Chow, Nilanjan Sarkar

发表机构 * Vanderbilt University(范德比大学)

AI总结 提出一种掌戴式热反馈原型,通过间接反馈、主动热透传和时空变化渲染策略,在增强现实中实现沉浸式热触觉体验,实验验证了其可行性与权衡。

详情
AI中文摘要

我们提出了一种概念验证的掌戴式热反馈原型,针对增强现实(AR)中的热渲染挑战,用户必须在其物理工作空间中与真实和虚拟物体交互。与为虚拟现实开发的热反馈系统相比,AR热反馈必须保持手部灵活性、维持对真实世界热线索的访问,并在不阻碍自然物体交互的情况下提供连贯的虚拟温度感知。我们提出了三个AR特定的设计考虑,并由我们的原型实现:间接反馈以保持指尖灵活性、主动热透传以感知和渲染接触物理表面的温度,以及手掌上的空间和时间变化热渲染。人体实验评估了AR交互过程中的感知灵敏度、间接反馈、主动热透传、空间模式识别和移动热渲染。结果表明,尽管间接反馈在指尖视觉接触时降低了感知真实感,但并未降低沉浸感或舒适度;主动热透传支持真实与渲染表面之间的温度辨别;时空渲染相比静态热刺激显著提高了沉浸感和真实感。这些发现表明,我们的设计考虑是AR热触觉的可行设计策略,同时澄清了需要精确真实感与更广泛沉浸式热体验的应用之间的权衡。

英文摘要

We present a proof-of-concept palm-mounted thermal feedback prototype addressing thermal rendering challenges specific to augmented reality (AR), where users must interact with both real and virtual objects in their physical workspace. In contrast to thermal feedback systems developed for virtual reality, AR thermal feedback must preserve manual dexterity, maintain access to real-world thermal cues, and provide coherent virtual temperature sensations without obstructing natural object interaction. We propose three AR-specific design considerations, which our prototype implements: indirect feedback to preserve fingertip dexterity, active thermal passthrough to sense and render the temperature of contacted physical surfaces, and spatially and temporally varying thermal rendering across the palm. Human-subject experiments evaluated perceptual sensitivity, indirect feedback, active thermal passthrough, spatial pattern recognition, and moving thermal rendering during AR interaction. Results showed that although indirect feedback reduced perceived realism during visual contact at the fingertips, it did not reduce immersion or comfort; active thermal passthrough supported temperature discrimination between real and rendered surfaces; and spatiotemporal rendering significantly improved immersion and realism compared with static thermal stimulation. These findings suggest that our design considerations are viable design strategies for AR thermal haptics, while also clarifying tradeoffs for applications that require precise realism versus broader immersive thermal experience.

6. 具身智能与视觉语言动作模型 6 篇

2606.19784 2026-06-19 cs.RO 新提交

EquiVLA: A General Framework for Rotationally Equivariant Vision-Language-Action Models

EquiVLA: 旋转等变视觉-语言-动作模型的通用框架

Thien-Loc Ha, Quang-Tan Nguyen, Trong-Bao Ho, Long Dinh, Minh Duc Nguyen, Gia-Binh Nguyen, Pham Tri Quang, Minh N. Vu, Duy M. H. Nguyen, An Thai Le, Ngo Anh Vien

发表机构 * VinRobotics VinUniversity DFKI(德国人工智能研究中心) University of Stuttgart(斯图加特大学) IMPRS-IS(国际马克斯·普朗克智能系统研究学院)

AI总结 提出EquiVLA,首个端到端SO(2)等变VLA框架,通过EquiPerceptor和EquiActor实现从视觉到动作的近似等变链,在LIBERO、CALVIN和真实机器人任务上显著提升性能。

Comments Comment: First version 22 pages, project site: https://equivla.github.io/

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人操作的有力范式,但它们缺乏几何归纳偏置:在特定方向训练的策略需要大量数据才能泛化到不同旋转配置。我们提出 \textsc{EquiVLA},首个端到端 $\mathrm{SO}(2)$-等变 VLA 模型的通用框架,适用于任何将冻结的视觉-语言骨干与流匹配扩散 Transformer 动作头耦合的架构。\textsc{EquiVLA} 引入了 \textsc{EquiPerceptor},它从冻结的 ViT 特征生成近似 $\mathrm{SO}(2)$-等变的视觉表示;以及 \textsc{EquiActor},一个精确 $\mathrm{SO}(2)$-等变的流匹配扩散 Transformer 动作头。两者共同建立了一条从相机观测到预测动作序列的近似 $\mathrm{SO}(2)$ 等变链。在 GR00T~N1.5 上实例化,并在四个 LIBERO 套件、CALVIN ABCD$\to$D 以及 Mobile ALOHA 上的五个真实机器人任务中评估,\textsc{EquiVLA} 在 LIBERO 上达到 $92.6\%$ 的平均成功率(基线为 $78.1\%$),在 CALVIN 上平均序列长度为 $4.03$(基线为 $3.45$),并将真实机器人成功率从 $54\%$ 提升至 $72\%$。

英文摘要

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for generalist robot manipulation, yet they lack geometric inductive biases: policies trained at specific orientations require substantially more data to generalize across rotational configurations. We present \textsc{EquiVLA}, the first general framework for end-to-end $\mathrm{SO}(2)$-equivariant VLA models, applicable to any architecture coupling a frozen vision-language backbone with a flow-matching Diffusion Transformer action head. \textsc{EquiVLA} introduces \textsc{EquiPerceptor}, which produces approximately $\mathrm{SO}(2)$-equivariant visual representations from frozen ViT features; and \textsc{EquiActor}, an exactly $\mathrm{SO}(2)$-equivariant flow-matching Diffusion Transformer action head. Together, they establish an approximate $\mathrm{SO}(2)$ equivariance chain from camera observations to predicted action sequences. Instantiated on GR00T~N1.5 and evaluated across four LIBERO suites, CALVIN ABCD$\to$D, and five real-robot tasks on Mobile ALOHA, \textsc{EquiVLA} achieves $92.6\%$ average success on LIBERO (vs. $78.1\%$ baseline), an average sequence length of $4.03$ on CALVIN (vs. $3.45$), and improves real-robot success from $54\%$ to $72\%$.

2606.20246 2026-06-19 cs.RO cs.AI 新提交

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

微调视觉-语言-动作模型所需的层数比你想象的少

Gia-Binh Nguyen, Trong-Bao Ho, Thien-Loc Ha, Khoa Vo, Philip Lund Møller, Quang T. Nguyen, Long Dinh, Tuan Dam, Vu Duong, Tung M. Luu, Trung Le, Tran Nguyen Le, Minh Vu, An Thai Le, Ngan Le, Daniel Sonntag, James Zou, Jan Peters, Duy M. H. Nguyen, Ngo Anh Vien

发表机构 * Center for AI Research, VinUniversity(VinUniversity人工智能研究中心) VinRobotics University of Arkansas(阿肯色大学) Technical University of Denmark(丹麦技术大学) Hanoi University of Science and Technology(河内科技大学) KAIST(韩国科学技术院) Monash University(莫纳什大学) Oldenburg University(奥尔登堡大学) DFKI(德国人工智能研究中心) University of Stuttgart(斯图加特大学) IMPRS-IS(国际马克斯·普朗克智能系统研究学院) Stanford University(斯坦福大学) Technische Universität Darmstadt(达姆施塔特工业大学)

AI总结 本文发现VLA模型存在层间表示冗余,提出无需训练的压缩方法,通过去除冗余层将模型深度减少50%,实现40-50%训练加速和30%推理加速,性能不变。

详情
AI中文摘要

在大规模视频-机器人数据集上预训练的视觉-语言-动作(VLA)模型彻底改变了机器人操作,但其数十亿参数架构在下游微调和实时推理过程中带来了巨大的计算负担。在这项工作中,我们揭示了这些连续控制基础策略(例如pi_0、GR00T-N1.5)的一个高度非平凡的结构特性:尽管在多样化的物理轨迹上训练,它们表现出严重的逐层表示冗余。为了利用这一点,我们引入了一个完全无需训练的结构压缩流程,避免了现有方法需要加载全尺寸模型来学习优化的令牌缩减或动态层选择器的需求。相反,仅通过使用中心核对齐的单次前向传递来识别冗余层特征,我们移除孪生层以永久压缩模型深度高达50%,涵盖VLM主干和连续控制策略头。这种精简架构的下游微调带来了双重加速效益:训练时间减少40-50%,实时推理速度提升高达30%,同时匹配或超越全尺寸基模型性能。我们在三个模拟基准(LIBERO、RoboCasa、SimplerEnv)和10个跨4种不同机器人实体的多样化真实世界操作任务上全面验证了我们的方法。这些结果证明,先进的VLA所需的层数远少于先前假设,为可扩展的机器人学习提供了一种高度计算高效的范式。

英文摘要

Vision-Language-Action (VLA) models pre-trained on massive video-robot datasets have revolutionized robotic manipulation, yet their multi-billion parameter architectures impose prohibitive computational burdens during downstream fine-tuning and real-time inference. In this work, we reveal a highly non-trivial architectural characteristic of these continuous control foundation policies (e.g., pi_0, GR00T-N1.5): despite being trained on diverse physical trajectories, they exhibit severe layer-wise representational redundancy. To exploit this, we introduce a structural compression pipeline that is entirely training-free, bypassing the need of existing methods to load full-scale models to learn optimized token reductions or dynamic layer selectors. Instead, using only a single forward pass via Centered Kernel Alignment to identify redundant layer features, we remove twin layers to permanently compress the model depth by up to 50% across both the VLM backbone and the continuous control policy head. Downstream fine-tuning of this streamlined architecture yields a dual acceleration benefit: a 40-50% reduction in training time and up to 30% faster real-time inference, while matching or exceeding full-scale base model performance. We comprehensively validate our method across three simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 diverse real-world manipulation tasks across 4 unique robotic embodiments. These results prove that advanced VLAs require significantly fewer layers than previously assumed, offering a highly compute-efficient paradigm for scalable robot learning.

2606.19451 2026-06-19 cs.LG cs.CV cs.RO 交叉投稿

3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning

3D-DLP:自监督3D物体中心场景表示学习

Ellina Zhang, Madhaven Iyengar, Amir Zadeh, Chuan Li, Deepak Pathak, David Held, Tal Daniel

AI总结 提出3D-DLP模型,通过自监督学习将场景级RGB-D或体素观测分解为3D潜在粒子,每个粒子编码解耦属性,实现可解释的逐粒子分割图,并支持场景操控和下游机器人操作。

Comments ICML 2026. Project webpage: https://eubooks3003.github.io/3d-dlp

详情
AI中文摘要

我们引入了3D-DLP,一种自监督的物体中心表示学习模型,它将场景级RGB-D或体素观测分解为一组3D潜在粒子。基于深度潜在粒子(DLP)框架,每个粒子编码解耦的属性,包括3D关键点位置、边界框尺寸和外观特征,并代表场景中的一个独特实体。该模型通过端到端的自监督重建目标学习可解释的逐粒子分割图。我们在模拟和真实数据集上证明,学习到的潜在空间是可解释和可控的:通过操纵粒子位置并解码,我们可以生成新颖的场景配置。此外,我们展示了将这些紧凑的3D潜在粒子用于下游机器人操作,相比缺乏显式3D信息或依赖无物体中心结构的密集3D输入的基线方法,性能有所提升。代码和视频可在以下网址获取:此 https URL。

英文摘要

We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at https://eubooks3003.github.io/3d-dlp.

2606.19531 2026-06-19 cs.CV cs.RO 交叉投稿

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

ImageWAM:世界动作模型真的需要视频生成,还是只需要图像编辑?

Yuyang Zhang, Wenyao Zhang, Zekun Qi, He Zhang, Haitao Lin, Jingbo Zhang, Yao Mu, Xiaokang Yang, Wenjun Zeng, Xin Jin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Eastern Institute of Technology(东方理工学院) Tencent Robotics X(腾讯机器人X) Tsinghua University(清华大学) Zhongguancun Academy(中关村学院)

AI总结 提出ImageWAM框架,利用预训练图像编辑模型替代视频生成进行机器人动作预测,通过编辑去噪的KV缓存作为世界动作上下文,在多个模拟和真实实验中优于基线,计算量降至1/6,延迟降至1/4。

Comments Project Page: https://zhangwenyao1.github.io/ImageWAM/

详情
AI中文摘要

世界动作模型(WAMs)通常依赖视频生成来桥接视觉世界建模和机器人控制。然而,基于视频的WAMs面临三个耦合的限制:密集的多帧未来令牌使得推理成本高昂,完整的视频预测将容量花费在与动作无关的时间和外观细节上,以及长期未来想象可能引入误导动作预测的错误。这些问题提出了一个简单的问题:世界动作模型真的需要视频生成吗?我们提出ImageWAM,一个简单的WAM框架,将预训练的图像编辑模型重新用于机器人动作预测。与视频生成相比,图像编辑提供了更匹配的先验:它只需要建模目标帧变换,关注与动作相关的当前到目标视觉差异,并通过编辑预训练将任务指令接地到局部视觉变化。在实践中,ImageWAM在推理时不解码目标帧;相反,它根据图像编辑去噪产生的KV缓存条件化一个流匹配动作专家,将其用作紧凑的世界动作上下文。ImageWAM在多个模拟和真实世界实验中优于标准VLA基线和匹配的竞争性WAM,且无需额外的策略预训练。它还将FLOPs降低到基于视频的WAMs的1/6,延迟降低到1/4。注意力分析进一步表明,编辑缓存聚焦于任务相关的变化区域,支持图像编辑作为基于视频的世界动作建模的有效替代方案。

英文摘要

World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.

2512.20014 2026-06-19 cs.RO cs.AI 版本更新

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

Bring My Cup! 使用视觉注意力提示个性化视觉-语言-动作模型

Sangoh Lee, Sangwoo Mo, Wook-Shin Han

发表机构 * GSAI, POSTECH(POSTECH 人工智能研究所) IME, POSTECH(POSTECH 信息媒体研究所)

AI总结 针对VLA模型难以处理个性化指令的问题,提出无需训练的视觉注意力提示(VAP)方法,通过参考图像作为非参数记忆,利用开放词汇检测和嵌入匹配定位个人物品,并以视觉提示注入模型,在多个仿真和真实场景中显著提升成功率和正确物体操作。

Comments ICML 2026. Project page: https://vap-project.github.io/

详情
AI中文摘要

尽管视觉-语言-动作(VLA)模型能够很好地泛化到通用指令,但在处理个性化命令(如“bring my cup”)时却存在困难,因为机器人必须在视觉相似的物体中识别并操作特定实例。我们研究了这种操作个人物品的场景,其中VLA必须仅使用少量参考图像来识别并控制训练中未见过的用户特定物体。为了解决这一挑战,我们提出了视觉注意力提示(VAP),一种简单而有效的无需训练的感知适配器,为冻结的VLA模型赋予自上而下的选择性注意力。VAP将参考图像视为非参数视觉记忆,通过开放词汇检测和基于嵌入的匹配将个人物品定位到场景中,然后通过突出显示该物体并重写指令,将这种定位作为视觉提示注入模型。我们构建了两个仿真基准(Personalized-SIMPLER和Personalized-VLABench)以及一个真实桌面基准,用于评估多个机器人和任务上的个性化操作。实验表明,VAP在成功率和正确物体操作方面始终优于通用策略和令牌学习基线,有助于弥合语义理解与实例级控制之间的差距。

英文摘要

While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.

2605.23733 2026-06-19 cs.RO cs.AI 版本更新

Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Any2Any: 高效跨本体迁移用于人形机器人全身跟踪

Ming Yang, Tao Yu, Feng Li, Hua Chen

发表机构 * LimX Dynamics(LimX动力学)

AI总结 提出Any2Any范式,通过运动学对齐和动力学微调,实现预训练全身跟踪模型高效迁移至新的人形机器人本体,仅需少量数据和计算即可达到竞争性跟踪性能。

Comments Project Page: https://any2any.top/

详情
AI中文摘要

全身跟踪(WBT)模型已成为人形机器人的关键基础,使其能够高保真地模仿各种运动。从头训练此类模型需要大规模数据和计算,使得在新人形平台上快速部署成本高昂。这自然引发一个问题:预训练的WBT模型能否通过最小化适应跨本体迁移?为回答这个问题,我们提出Any2Any,一种范式,能够高效地将现有WBT专家迁移到新人形本体,仅需少量数据和计算。Any2Any首先在源和目标人形之间进行运动学对齐,对齐其输入和输出空间,使得预训练的源策略可以在目标本体上有意义地重用。然后,Any2Any通过向选定的动力学敏感模块应用轻量级参数高效微调(PEFT)组件进行动力学适应,保留有用的行为先验,同时实现对目标机器人的定向适应。在多个人形平台和预训练骨干上的大量实验表明,与从头训练相比,Any2Any显著加速收敛并降低训练成本,同时实现具有竞争力或更优的跟踪性能。值得注意的是,仅使用完整训练所需计算和数据的1%,Any2Any成功将在Unitree G1上预训练的Sonic模型迁移到LimX Oli和LimX Luna。这些结果表明,预训练的WBT专家可以跨本体高效重用,为在新机器人上部署人形全身控制提供可扩展的路径。

英文摘要

Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots. More results and videos are available on our project page: https://any2any.top/.

7. 多机器人与群体系统 6 篇

2606.19632 2026-06-19 cs.RO cs.AI cs.LG cs.LO cs.MA 新提交

Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation

通过决策树蒸馏对学习到的多智能体通信策略进行形式化验证

Ahmad Farooq, Kamran Iqbal

发表机构 * University of Arkansas at Little Rock(阿肯色大学小石城分校)

AI总结 提出通过决策树蒸馏将多智能体强化学习策略转化为可解释模型,并利用PRISM进行形式化验证,确保安全属性转移至原始网络,在无人机编队任务中实现88.9%属性满足率。

Comments 9 pages, 3 figures, 7 tables. Accepted at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026), Pittsburgh, Pennsylvania, USA, September 27-October 1, 2026

详情
AI中文摘要

多智能体强化学习使智能体能够通过涌现通信发展协调策略,但神经策略缺乏无人机群和自动驾驶车队等安全关键机器人部署所需的形式化安全保证。我们提出了首个通过学习策略抽象进行安全验证的端到端框架:神经策略被蒸馏为可解释的决策树,然后进行形式化验证,并通过经验验证确认验证的安全属性可转移至原始网络。我们的四阶段流程包括:从智能体观测中提取领域特定特征;决策树蒸馏达到97.9% +/- 1.2%的神经策略保真度;自动翻译为PRISM概率模型检查器规范,具有完整的特征到状态变量对应关系;以及通过成对分解、联合界聚合和经验邻居建模对概率计算树逻辑属性进行组合验证。评估用于5-7个智能体多无人机协调的矢量量化变分信息瓶颈策略,我们验证了18个涵盖安全性、活性和合作的时间逻辑属性,实现了88.9%的属性满足率,所有五个安全阈值均满足(碰撞概率0.3% vs 阈值1%)。原始神经策略的蒙特卡洛验证确认验证的安全属性转移偏差<=0.6个百分点(95%置信区间)。离散VQ-VIB消息相比连续方法提供+11.6至+13.6个百分点的保真度优势,实现3-4倍更快的验证。我们的框架为蒸馏策略抽象提供了经验验证的安全验证,作为深度多智能体强化学习与多机器人部署形式化安全工作流之间的实用桥梁。

英文摘要

Multi-agent reinforcement learning (MARL) enables agents to develop coordination strategies through emergent communication, but neural policies lack the formal safety guarantees required for safety-critical robotic deployment in drone swarms and autonomous vehicle fleets. We present the first end-to-end framework for safety verification of learned multi-agent communication policies through policy abstraction: neural policies are distilled into interpretable decision trees, then formally verified, with empirical validation confirming that verified safety properties transfer to original networks. Our four-stage pipeline consists of domain-specific feature extraction from agent observations, decision tree distillation achieving 97.9% +/- 1.2% fidelity to neural policies, automated translation to PRISM probabilistic model checker specifications with complete feature-to-state-variable correspondence, and compositional verification of Probabilistic Computation Tree Logic (PCTL) properties via pairwise decomposition with union-bound aggregation and empirical neighbor modeling. Evaluating Vector-Quantized Variational Information Bottleneck (VQ-VIB) policies for multi-drone coordination with 5-7 agents, we verify 18 temporal logic properties across safety, liveness, and cooperation, achieving 88.9% property satisfaction with all five safety thresholds satisfied (0.3% collision probability vs. 1% threshold). Monte Carlo validation of original neural policies confirms that verified safety properties transfer with <=0.6 percentage-point deviation (95% CI). Discrete VQ-VIB messages provide +11.6 to +13.6 percentage-point fidelity advantages over continuous methods, enabling 3-4x faster verification. Our framework provides empirically validated safety verification for distilled policy abstractions, serving as a practical bridge between deep MARL and formal safety workflows for multi-robot deployment.

2606.19920 2026-06-19 cs.RO cs.LG cs.MA 新提交

Deep-Unfolded Coordination

深度展开协调

Hunter Kuperman, Minchan Jung, Rahul V. Ghosh, Alex Oshin, Evangelos A. Theodorou

发表机构 * Autonomous Control and Decision Systems Laboratory Georgia Institute of Technology United States(佐治亚理工学院自主控制与决策系统实验室)

AI总结 提出Deep Coordinator框架,通过深度展开ADMM-DDP迭代学习动态调整超参数,实现非凸优化器求解时自适应惩罚参数,在车队和四旋翼仿真中速度提升6.18-9.44倍且可扩展至8倍规模。

Comments The second and third authors contributed equally (equal second authorship). 35 pages (10 pages main text), 17 figures, 3 tables

详情
AI中文摘要

分布式优化是一种高度可扩展且结构透明的技术,用于解决多机器人问题;然而,这类方法通常需要高度专门化、针对特定问题的超参数调整。在这项工作中,我们提出了Deep Coordinator,一个深度展开框架,学习在求解时根据优化器性能动态调整ADMM-DDP(一种流行的机器人任务分布式求解器)的超参数。我们的架构包括将固定数量的ADMM-DDP迭代展开成一个神经网络,层之间具有可学习的函数,将优化器状态映射到下一个超参数。据我们所知,Deep Coordinator是第一个在求解时调整非凸优化器惩罚参数的深度展开框架;我们展示了主流的监督方法在训练此类模型时可能产生退化解,并提出了一种无监督学习方案。在车队和四旋翼飞行器的仿真中,Deep Coordinator生成的轨迹质量与常规求解器相当,但速度快6.18-9.44倍。此外,当部署到比训练规模大8倍的系统时,Deep Coordinator仍能保持其性能优势。

英文摘要

Distributed optimization is a highly scalable and structurally transparent technique to solve multi-agent robotics problems; however, such methods often suffer from the need for highly-specialized, problem-specific hyperparameter tunings. In this work, we propose Deep Coordinator, a deep-unfolding framework that learns to dynamically adjust the hyperparameters of ADMM-DDP, a popular distributed solver for robotics tasks, at solve-time in response to optimizer performance. Our architecture consists of unrolling a fixed number of ADMM-DDP iterations into a neural network with learnable functions between layers mapping the optimizer state to the next hyperparameters. To the best of our knowledge, Deep Coordinator is the first deep-unfolding framework to adapt the penalty parameters of a non-convex optimizer at solve-time; we show that the mainstream supervised approach can yield degenerate solutions when training such models, and propose an unsupervised learning scheme. On simulations with fleets of cars and quadrotors, Deep Coordinator produces trajectories of comparable quality 6.18-9.44x faster than conventional solvers. Furthermore, Deep Coordinator retains its performance benefits when deployed to systems up to 8x larger than trained on.

2606.20031 2026-06-19 cs.RO cs.AI 新提交

A Neuromorphic Reinforcement Learning Framework for Efficient Pathfinding in Robotic Mobile Fulfillment Systems

一种用于机器人移动履行系统高效路径规划的神经形态强化学习框架

Junzhe Xu, Zecui Zeng, Lusong Li, Yuetong Fang, Renjing Xu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) JD Explore Academy(京东探索研究院)

AI总结 提出SDQN-RMFS框架,通过ANN到SNN的转换和硬标签知识蒸馏,在神经形态芯片上实现超低功耗路径规划,相比GPU能耗降低11281倍,延迟减少近一半。

详情
AI中文摘要

动态环境变化、受限工作空间和严格的实时约束使得机器人移动履行系统(RMFS)中的路径规划对传统的搜索和基于规则的方法来说是一个具有挑战性的问题,这些方法通常遭受高计算复杂性和长决策延迟。虽然强化学习(RL)已成为一种强大的替代方案,但在资源受限的硬件上以极端的能源效率部署学习到的策略仍然是一个开放的挑战。我们提出了SDQN-RMFS,一个端到端的框架,实现了从全精度人工神经网络(ANN)训练的RL策略到神经形态芯片的高保真部署。通过仅在稀疏事件触发时进行计算,该框架实现了超低功耗的RMFS路径规划。我们的全栈流水线操作如下:首先通过碰撞允许策略高效训练ANN策略以密集化信息轨迹,然后通过硬标签知识蒸馏方法将其转换为脉冲神经网络(SNN)。这有效地解决了输出分布不匹配问题,在保持策略能力的同时显著降低了推理延迟。硬件实验表明,与高性能GPU基线相比,能耗节省高达11281倍,延迟几乎减少两倍,同时决策质量与原始训练策略相当。这些结果确立了物理神经形态推理作为大规模RMFS运营的实用且能源可持续的途径。

英文摘要

Dynamic environmental changes, confined workspaces, and stringent real-time constraints make pathfinding in Robotic Mobile Fulfillment Systems (RMFS) a challenging problem for conventional search- and rule-based methods, which typically suffer from high computational complexity and long decision latency. While reinforcement learning (RL) has emerged as a powerful alternative, deploying learned policies with extreme energy efficiency on resource-constrained hardware remains an open challenge. We present SDQN-RMFS, an end-to-end framework that achieves high-fidelity deployment of an RL-trained policy from a full-precision artificial neural network (ANN) through to a neuromorphic chip. By computing only when triggered by sparse events, this framework unlocks ultra-low-power RMFS pathfinding. Our full-stack pipeline operates as follows: an ANN policy is first efficiently trained via a collision-allowing strategy to densify informative trajectories, and then converted into a spiking neural network (SNN) via a hard-label knowledge distillation approach. This effectively addresses the output distribution mismatch, preserving policy capability across the ANN-to-SNN pipeline while substantially reducing inference latency. Hardware experiments demonstrate up to 11,281$\times$ energy savings and a nearly two-fold reduction in latency compared to a high-performance GPU baseline, while maintaining decision quality on par with the original trained policy. These results establish physical neuromorphic inference as a practical and energy-sustainable pathway for large-scale RMFS operations.

2606.20232 2026-06-19 cs.RO cs.GT 新提交

Mobile Target Search with Imperfect Perception: A Partially Observable Stochastic Game Theoretical Approach

不完美感知下的移动目标搜索:一种部分可观测随机博弈论方法

Hanzheng Zhang, Shu Liang, Shuyu Liu

发表机构 * Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University(同济大学上海自主智能无人系统科学中心) Department of Control Science and Engineering, Tongji University(同济大学控制科学与工程系)

AI总结 针对传感器限制、恶意干扰或通信噪声导致的不完美感知,采用部分可观测随机博弈(POSG)框架建模搜索者与目标间的对抗互动,提出可检测性概念和基于随机递归分析的充分判据,并开发服务器辅助分布式算法。

详情
AI中文摘要

本文研究了在传感器限制、恶意干扰或通信噪声导致的不完美感知下的移动目标搜索问题。搜索者和目标在具有有限移动性的网格状区域中运行,导致搜索与逃避之间的动态相互作用。为了捕捉不完美感知下的这种对抗互动,我们采用部分可观测随机博弈(POSG)方法,该方法通过引入目标智能来推广部分可观测马尔可夫决策过程(POMDP)。为了处理感知不确定性引起的虚警和漏检,我们提出了一种新颖的可检测性概念,以确定搜索策略是否能保证最终检测,并基于随机递归分析提供了充分的可检测性准则。我们进一步开发了一种服务器辅助的分布式算法,该算法利用搜索者的聚合势博弈结构和基于KL散度的目标预测约简。数值模拟验证了所提算法的有效性,并支持了可检测性分析。

英文摘要

This paper investigates mobile target search under imperfect perceptions caused by sensor limitations, malicious jamming, or communication noise. Searchers and targets operate in a grid-shaped area with bounded mobility, leading to a dynamic interplay between search and evasion. To capture this adversarial interaction under imperfect perceptions, we adopt the partially observable stochastic game (POSG) approach, which generalizes partially observable Markov decision processes (POMDPs) by incorporating target intelligence. To handle false alarms and missed detections caused by perceptual uncertainties, we propose a novel detectability concept to determine whether a search strategy guarantees eventual detection, and provide sufficient detectability criteria based on stochastic recurrence analysis. We further develop a server-assisted distributed algorithm that utilizes the aggregative potential game structure for searchers and a KL-divergence-based reduction for target prediction. Numerical simulations validate the effectiveness of the proposed algorithm and support the detectability analysis.

2606.20365 2026-06-19 cs.RO cs.MA 新提交

An Infrastructure-less, Control-Independent Solution to Relative Localisation of a Team of Mobile Robots using Ranging Measurements

基于测距的移动机器人团队相对定位的无基础设施、控制无关解决方案

Paolo Golinelli, Tommaso Faraci, Daniele Fontanelli

发表机构 * Department of Industrial Engineering, University of Trento(特伦托大学工业工程系) Department of Information Engineering and Computer Science, University of Trento(特伦托大学信息工程与计算机科学系)

AI总结 提出一种无锚点、完全去中心化的协作定位算法,仅依赖局部里程计、稀疏测距和短程通信,无需控制机器人运动即可实现团队可观测性,采用多假设贝叶斯框架保证鲁棒性。

详情
AI中文摘要

定位机器人团队的能力对于从非结构化环境中的机器人舰队到协作控制和导航任务等应用至关重要。在此类场景中,固定基础设施通常不可用,部署必须快速灵活,系统要求必须最小化。我们提出了一种去中心化协作定位算法,同时解决了所有这些挑战。该方法无锚点、完全去中心化,并且与大多数现有方法不同,不需要控制机器人运动来确保团队可观测性。它仅依赖局部里程计、稀疏的代理间测距测量和短程通信,这些在实践中广泛可用。该算法采用多假设贝叶斯框架,维护所有可行解集,确保在瞬态不可观测条件下的鲁棒性。此外,通过信息共享,每个代理都能受益于整个群体的估计,即使在部分连接条件下也是如此。

英文摘要

The ability to localise teams of robots is essential for applications ranging from robotic fleets in unstructured environments to cooperative control and navigation tasks. In such contexts, fixed infrastructure is often unavailable, deployments must be fast and flexible, and system requirements must be minimal. We present a decentralised cooperative localisation algorithm that addresses all these challenges at once. The method is anchor-less, fully decentralised, and, unlike most existing approaches, does not require controlling the robots motion to ensure team observability. It relies only on local odometry, sparse inter-agent ranging measurements, and short-range communication, all of which are widely available in practice. The algorithm adopts a multi-hypothesis Bayesian framework that maintains the entire set of feasible solutions, ensuring robustness under transient unobservable conditions. Moreover, through information sharing, each agent benefits from the estimates of the entire group, even in partially connected conditions.

2605.22748 2026-06-19 cs.RO cs.AI cs.LG cs.MA 版本更新

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

通过多智能体强化学习实现超人类安全且敏捷的赛车

Ismail Geles, Leonard Bauersfeld, Markus Wulfmeier, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich(苏黎世大学机器人与感知组) Google DeepMind(谷歌深Mind) Nomagic

AI总结 本文提出通过多智能体强化学习在高速四旋翼赛车中实现安全且敏捷的性能,展示了多智能体交互对真实世界交互安全性的关键作用,同时在高速赛车中超越人类飞行员并减少碰撞率。

Comments 12 pages (+4 supplementary). Website: https://rpg.ifi.uzh.ch/marl

详情
AI中文摘要

自主系统在孤立或模拟环境中已实现超人类性能,但在共享、动态的真实世界空间中仍显得脆弱。这种失败源于物理应用中主导的单智能体范式,其中其他参与者被忽略或视为环境噪声,阻碍了有效协调。本文证明多智能体强化学习为真实世界交互提供了必要的安全性基础。使用高速四旋翼赛车作为高风险测试平台,训练智能体在复杂空气动力学相互作用和战略机动中导航,具有可变数量的赛车。通过联赛基于的自我对战,智能体进化出复杂的前瞻性行为,包括主动避障、超车和处理多智能体物理交互,包括空气动力学下洗。我们的智能体在超过22米/秒的速度下多玩家赛车中超越了冠军级人类飞行员,同时与最先进的单智能体基线相比,碰撞率减少了50%。关键的是,使用多样化的人工智能体进行训练能够实现零样本泛化到更安全的人类交互。这些结果表明,实现稳健的机器人共存的路径不在于孤立的安全约束,而在于多智能体交互的严格要求。多媒体材料可在:https://rpg.ifi.uzh.ch/marl

英文摘要

Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi-agent reinforcement learning provides the essential safety scaffolding required for real-world interaction. Using high-speed quadrotor racing as a high-stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league-based self-play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi-agent physical interactions, including aerodynamic downwash. Our agents outperform a champion-level human pilot in multi-player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state-of-the-art single-agent baselines. Crucially, training with diverse artificial agents enables zero-shot generalization to safer human interaction. These results suggest that the path to robust robotic co-existence lies not in isolated safety constraints, but in the rigorous demands of multi-agent interaction. Multimedia materials are available at: https://rpg.ifi.uzh.ch/marl

8. 无人车、无人机与移动机器人 12 篇

2606.19641 2026-06-19 cs.RO cs.CV 新提交

Scaling Self-Play for End-to-End Driving

扩展端到端驾驶的自我对弈

Luke Rowe, Roger Girgis, Rodrigue de Schaetzen, Daphne Cornelisse, Alaap Grandhi, Felix Heide, Eugene Vinitsky, Christopher Pal, Liam Paull

发表机构 * Mila(米拉研究所) Université de Montréal(蒙特利尔大学) Polytechnique Montréal(蒙特利尔理工学院) Torc Robotics NYU Tandon School of Engineering(纽约大学坦登工程学院) McMaster University(麦克马斯特大学) Princeton University(普林斯顿大学)

AI总结 提出大规模自我对弈训练策略,通过高效模拟器Gigapixel实现像素级自我对弈,结合DAgger蒸馏和感知适应,提升端到端驾驶模型性能。

详情
AI中文摘要

端到端自动驾驶模型通常基于离线的人类演示数据集进行训练,这些数据集提供的状态覆盖有限,且通常没有闭环反馈,使得模型在闭环部署时容易出现复合误差,并对长尾智能体交互脆弱。为克服这些限制,我们提出了一种替代策略:直接在模拟中的像素上进行大规模自我对弈。虽然先前的自我对弈方法已显示出向真实世界驾驶的有前景的迁移,但它们通常假设向量化的鸟瞰图(BEV)观测,这与直接基于传感器观测的端到端策略不兼容。为此,我们引入了Gigapixel,一个具有透视渲染的高吞吐量批处理驾驶模拟器,实现了直接从像素观测的可扩展自我对弈。Gigapixel并非针对计算成本高的逼真传感器模拟,而是渲染一个简化的边界框世界,保留基本场景结构,同时实现每秒5万智能体步的吞吐量。由于直接像素空间的自我对弈强化学习在端到端模型规模下样本效率极低,我们提出了自我对弈DAgger训练:通过从特权RL教师进行在线策略蒸馏来训练基于像素的策略。为弥合模拟到现实的差距,我们随后通过轻量级感知适应将自我对弈训练的策略迁移到真实世界传感器数据。在Gigapixel中训练并适应真实世界传感器数据的策略在HUGSIM和NAVSIM-v2基准测试中取得了竞争性表现,无需人类轨迹监督。此外,扩展自我对弈训练带来策略性能的成比例提升,确立了自我对弈作为训练端到端模型的实用且可扩展的策略。

英文摘要

End-to-end autonomous driving models are typically trained on offline human-demonstration datasets that provide limited state coverage and often no closed-loop feedback, making them prone to compounding errors when deployed in closed-loop and brittle to long-tail agent interactions. To overcome these limitations, we propose an alternative strategy for training end-to-end driving models: large-scale self-play directly from pixels in simulation. While prior self-play approaches have shown promising transfer to real-world driving, they typically assume vectorized Bird's-Eye-View (BEV) observations that are incompatible with end-to-end policies operating directly on sensor observations. To this end, we introduce Gigapixel, a high-throughput batched driving simulator with perspective rendering, enabling scalable self-play directly from pixel observations. Rather than targeting compute-costly photorealistic sensor simulation, Gigapixel renders a simplified bounding-box world that preserves essential scene structure while achieving throughput at 50k agent steps per second. Since direct pixel-space self-play RL is prohibitively sample-inefficient at end-to-end model scale, we propose self-play DAgger training: we train pixel-based policies in self-play via on-policy distillation from a privileged RL teacher. To bridge the sim-to-real gap, we subsequently transfer the self-play trained policies to real-world sensor data through lightweight perception adaptation. Policies trained in Gigapixel and adapted to real-world sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision. Moreover, scaling self-play training yields proportional gains in policy performance, establishing self-play as a practical and scalable strategy for training end-to-end models.

2606.19672 2026-06-19 cs.RO 新提交

Safe Local Navigation for Ackermann-Steered Robots in Unmapped Environments

阿克曼转向机器人在未映射环境中的安全局部导航

Christian Schaible, Shahin Sirouspour

发表机构 * McMaster University(麦克马斯特大学)

AI总结 提出一种控制框架,通过局部障碍物检测确定最安全航向角,构建边界线并优化车辆-障碍物间距,实现阿克曼转向机器人在无全局目标环境中的安全局部导航。

Comments Presented at the 23rd Conference on Robots and Vision (CRV 2026)

Journal ref Proc. 23rd Conference on Robots and Vision (CRV), 2026

详情
AI中文摘要

提出了一种控制框架,用于在缺乏全局目标的未映射环境中,对配备阿克曼转向的移动机器人进行安全局部导航。基于局部障碍物检测,沿车辆前方最大开阔空间方向确定最安全航向角。在该方向引导下,在车辆左右两侧构建边界线以实现障碍物分离。这些边界线通过求解一个最大化车辆-障碍物间距的凸二次优化获得。可选地,对边界线施加约束以保持平行性并平滑先前控制步骤的突变。然后使用反馈线性化控制器调节车辆与一条或两条边界线的距离,从而有效跟踪通过最大化障碍物间距保证安全的局部参考路径。该控制方案包含开源代码。实验结果表明,与一些现有的基于探索的规划器相比,所提方法生成的导航路径更安全,计算时间显著缩短。

英文摘要

A control framework is proposed for safe local navigation of mobile robots equipped with Ackermann steering in unmapped environments where a global goal is absent. Based on local obstacle detections, the safest heading angle is determined along the direction of the largest open space ahead of the vehicle. Guided by this direction, bounding lines are constructed on the left and right sides of the vehicle to achieve obstacle separation. These bounding lines are obtained by solving a convex quadratic optimization that maximizes vehicle-to-obstacle clearance. Optionally, conditions are imposed on the bounding lines to preserve parallelism and smooth abrupt changes from prior control steps. A feedback-linearizing controller is then used to regulate the vehicle's distance from one or both bounding lines, effectively enabling tracking of a local reference path that preserves safety through obstacle clearance maximization. Open-source code is included for the application of this control scheme. Experimental results demonstrate that the proposed method produces safer navigation paths with significantly shorter computation times, compared to some existing exploration-based planners.

2606.19711 2026-06-19 cs.RO cs.LG cs.SY eess.SY 新提交

A Differentiable Composite Approximation Framework for Autonomous Underwater Vehicle Maneuvering Modeling from Sea-Trial Data

一种可微复合近似框架:基于海试数据的自主水下航行器机动建模

Aobo Wang, Aifei Xia, Zihao Wang, Lizhu Hao

发表机构 * College of Shipbuilding Engineering, Harbin Engineering University(哈尔滨工程大学船舶工程学院) China Academy of Aerospace Aerodynamics(中国航天空气动力技术研究院) Institute of Artificial Intelligence, Shanghai University(上海大学人工智能研究院) China Ship Scientific Research Center(中国船舶科学研究中心)

AI总结 提出可微复合近似框架,结合多项式基与数据自适应基联合校准,并引入转向运动电流估计补偿,提升AUV机动预测精度。

详情
AI中文摘要

基于机载测量的场建模可以生成反映真实运行特性的自主水下航行器(AUV)机动模型。从近似角度看,传统机动模型使用预定义的约束多项式基,而数据驱动模型使用数据自适应基。受此基函数视角启发,本文提出一种可微复合近似公式,其中多项式基分量和数据自适应基分量被视为单个预测器的可微部分并联合校准。开发了一种基于梯度的协同校准方法用于全尺寸AUV机动预测,其中灵敏度感知机制调节有界多项式更新,而神经残差在共享预测目标下捕获剩余非线性差异。为了考虑现场数据中的海流效应,引入了一种基于转向运动的电流估计和补偿程序,以构建电流补偿的学习目标用于训练和滚动预测。该框架使用从7米长AUV在多种机动条件下收集的海试数据进行评估。结果表明,与纯多项式、纯神经网络和冻结先验混合基线相比,所提方法改进了递归轨迹和速度预测,证明了其在基于现场数据的AUV机动建模中的适用性。

英文摘要

Field-based modeling from onboard measurements can produce autonomous underwater vehicle (AUV) maneuvering models that reflect real operating characteristics. From an approximation perspective, conventional maneuvering models use predefined constraint polynomial bases, whereas data-driven models use data-adaptive bases. Motivated by this basis-function view, this paper presents a differentiable composite-approximation formulation, in which the polynomial-basis component and the data-adaptive basis component are treated as differentiable parts of a single predictor and calibrated jointly. A gradient-based co-calibration method is developed for full-scale AUV maneuvering prediction, where a sensitivity-aware mechanism regulates bounded polynomial updates while the neural residual captures remaining nonlinear discrepancies under a shared prediction objective. To account for ocean-current effects in field data, a turning-motion-based current estimation and compensation procedure is incorporated to construct current-compensated learning targets for training and rollout. The framework is evaluated using sea-trial data collected from a 7-meter AUV under multiple maneuvering conditions. Results show that the proposed method improves recursive trajectory and velocity prediction compared with polynomial-only, neural-only, and frozen-prior hybrid baselines, demonstrating its applicability to field-data-based AUV maneuvering modeling.

2606.19836 2026-06-19 cs.RO cs.CV 新提交

World Engine: Towards the Era of Post-Training for Autonomous Driving

World Engine:迈向自动驾驶后训练时代

Tianyu Li, Li Chen, Caojun Wang, Haochen Liu, Kashyap Chitta, Zhenjie Yang, Yuhang Lu, Naisheng Ye, Yihang Qiu, Yufei Wang, Luoxi Zou, Jiaxin Peng, Jin Pan, Zhaoyu Su, Andrei Bursuc, Shengbo Eben Li, Andreas Geiger, Peng Su, Hongyang Li

AI总结 提出World Engine生成式框架,通过从真实日志重建高保真交互环境并外推安全关键变体,利用强化后训练对齐策略与安全约束,显著减少罕见安全关键场景故障,提升自动驾驶安全性。

Comments Technical Report. Project Page: https://opendrivelab.com/WorldEngine/

详情
AI中文摘要

自动驾驶车辆必须在现实世界中安全运行,而错误可能带来严重后果。尽管现代端到端驾驶策略在常规场景中表现出色,但其可靠性受限于真实驾驶数据集中安全关键的“长尾”事件的稀缺性。这些罕见交互定义了学习策略的实际安全边界,但在现实世界中难以大规模收集。我们展示了这一根本限制可以通过在合成的关键交互上对预训练驾驶模型进行后训练来解决。我们引入了World Engine,一个生成式框架,从真实日志中重建高保真交互环境,并系统性地将其外推为现实的安全关键变体。这一范式使得基于强化的后训练能够将策略与安全约束对齐,规避现实世界探索中固有的物理风险。在基于nuPlan构建的公开基准上,World Engine显著减少了罕见安全关键场景中的故障,并且相比仅扩展预训练数据带来了更大的增益。此外,当部署到生产级自动驾驶系统时,所得策略减少了模拟碰撞,并在道路测试中显示出可衡量的改进,表明在合成的安全关键交互上进行后训练为更安全的自动驾驶提供了一条可扩展且有效的途径。完整的代码库套件(包括训练)已向公众发布。

英文摘要

Autonomous vehicles must operate safely in the real world, where errors can have severe consequences. Although modern end-to-end driving policies excel in routine scenarios, their reliability is limited by the scarcity of safety-critical ``long-tail'' events in real driving datasets. These rare interactions define the practical safety boundary of the learned policy, yet they are difficult to collect at scale in the real world. Here we show that this fundamental limitation can be addressed by post-training pre-trained driving models on synthesized high-stakes interactions. We introduce World Engine, a generative framework that reconstructs high-fidelity interactive environments from real-world logs and systematically extrapolates them into realistic safety-critical variations. This paradigm enables reinforcement-based post-training to align policies with safety constraints, circumventing the physical risks inherent in real-world exploration. On a public benchmark built on nuPlan, World Engine substantially reduces failures in rare safety-critical scenarios and yields significantly larger gains than scaling pre-training data alone. Furthermore, when deployed on a production-scale autonomous driving system, the resulting policy reduces simulated collisions and demonstrates measurable improvements in on-road testing, showing that post-training on synthesized, safety-critical interactions offers a scalable and effective pathway to safer autonomous driving. The full codebase suite, including training, is released to the public.

2606.19929 2026-06-19 cs.RO 新提交

Motor Angular Speed Preintegration for Multirotor UAV State Estimation

多旋翼无人机状态估计中的电机角速度预积分

Matěj Petrlík, Filip Novák, Robert Pěnička, Martin Saska

AI总结 针对无人机振动导致IMU精度下降的问题,提出基于电机转速加速度预积分的方法,替代IMU进行状态传播,并构建因子用于图优化,结合LiDAR形成MAS-LO算法,相比LIO-SAM位置精度提升28%,速度精度提升65%。

详情
AI中文摘要

精确的状态估计对于实现无人机的敏捷和近障碍飞行所需的紧密反馈控制至关重要。最先进的方法融合慢速位姿测量与高频惯性测量以获得精确的状态估计。然而,来自无人机上IMU的惯性测量会受到旋转螺旋桨振动的退化,导致估计状态的精度下降。我们提出了一种基于电机转速加速度预积分的新方法。我们展示了以这种方式获得的加速度可以单独用于状态传播,在不包含IMU的情况下实现更好的精度。此外,我们提出了一个由预积分电机转速组成的因子,可以直接用于因子图优化框架。我们将该因子与LiDAR测量结合,提出电机角速度LiDAR里程计(MAS-LO)算法,用于精确状态估计,并开源该算法。最后,我们与最先进的惯性算法LIO-SAM进行估计精度评估,结果显示位置估计精度提升28%,速度估计精度提升65%,测量延迟降低14%,并且对错误参数值具有高鲁棒性。

英文摘要

A precise state estimate is crucial for a tight feedback control that enables agile and near-obstacle flights of UAVs. The state-of-the-art methods fuse slow pose measurements with high-frequency inertial measurements to obtain a precise state estimate. However, the inertial measurements from the IMU onboard the UAV are degraded by vibrations from spinning propellers and the precision of the estimated state suffers. We propose a novel approach based on the preintegration of accelerations obtained from motor speeds. We show that the accelerations obtained in this manner can be used for state propagation on their own to achieve better precision without including the IMU. Further, we propose a factor composed of the preintegrated motor speeds that can be directly employed in factor graph optimization frameworks. We combine our factor with LiDAR measurements into the proposed Motor Angular Speed LiDAR Odometry (MAS-LO) algorithm for precise state estimation, which we open-source. Lastly, we evaluate the estimation precision against a state-of-the-art inertial algorithm LIO-SAM to show 28% improvement in position and 65% in velocity estimation accuracy, 14% lower measurement lag, and high robustness to wrong parameter values.

2606.20336 2026-06-19 cs.RO 新提交

Autonomous Driving with Priority-Ordered STL Specifications Under Multimodal Uncertainty

多模态不确定性下基于优先级排序STL规范的自动驾驶

Taha Bouzid, Shuhao Qi, Mircea Lazar, Sofie Haesaert

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 提出一种不确定性感知的轨迹规划框架,通过信号时序逻辑的词典序优先级处理冲突目标,并结合模型预测路径积分控制实现,在仿真中验证了有效性。

详情
AI中文摘要

自动驾驶车辆必须规划满足安全、乘客舒适度和交通规则等多重要求的轨迹。然而,在安全关键场景中,不可能同时满足所有要求,因此需要根据重要性进行优先级排序。同时,在这些安全关键场景中,应明确考虑周围交通(如其他车辆和行人)轨迹预测的不确定性。在这项工作中,我们提出了一种不确定性感知的轨迹规划框架,该框架结合了信号时序逻辑(STL)规范上的预定义词典序,该排序在不确定性下仍然有效。我们使用模型预测路径积分(MPPI)控制实现了该公式,并在仿真场景中展示了我们方法的有效性,表明我们的框架在现实的多模态不确定性下有效处理了冲突目标。

英文摘要

Autonomous vehicles must plan trajectories that satisfy a multitude of requirements on safety, passenger comfort, and compliance with traffic rules. However, in safety-critical scenarios, it is not always possible to satisfy all requirements simultaneously, necessitating their prioritization based on importance. At the same time, in these safety-critical scenarios, the uncertainty in trajectory predictions of the surrounding traffic, such as other vehicles and pedestrians, should be explicitly accounted for. In this work, we propose an uncertainty-aware trajectory planning framework that incorporates a predefined lexicographic ordering over Signal Temporal Logic (STL) specifications that stays valid under uncertainty. We implement this formulation with Model Predictive Path Integral (MPPI) control and we demonstrate the effectiveness of our method on simulation scenarios, showing that our framework efficiently handles conflicting objectives under realistic multi-modal uncertainty.

2505.18201 2026-06-19 cs.RO cs.LG 版本更新

Reinforcement Twinning for Hybrid Control of Flapping-Wing Drones

强化孪生用于扑翼无人机的混合控制

Romain Poletti, Lorenzo Schena, Lilla Koloszar, Joris Degroote, Miguel Alfonso Mendez

发表机构 * Environmental and Applied Fluid Dynamics, von Karman Institute for Fluid Dynamics(环境与应用流体动力学,冯·卡门流体动力学研究所) Department of Mechanical Engineering, Vrije Universiteit Brussel(机械工程系,自由大学布鲁塞尔) Department of Electromechanical, Systems and Metal Engineering, Ghent University(机电系统与金属工程系,根特大学) Aero-Thermo-Mechanics Laboratory, École Polytechnique de Bruxelles, Université Libre de Bruxelles(航空热力学力学实验室,布鲁塞尔理工学院,自由大学布鲁塞尔) Experimental Aerodynamics and Propulsion Lab, Universidad Carlos III de Madrid(实验空气动力学与推进实验室,马德里卡洛斯三世大学)

AI总结 提出一种混合无模型/基于模型的扑翼无人机控制方法,通过强化孪生算法结合强化学习与自适应数字孪生,利用迁移学习和策略裁判提升样本效率与控制鲁棒性。

详情
AI中文摘要

控制扑翼无人机需要能够处理来自不完整、有噪声传感器数据的时变、非线性、欠驱动动力学的控制器。人工智能的最新进展,特别是强化学习,通过从环境交互中进行数据驱动的策略优化,为解决此类复杂控制问题开辟了新视角。然而,纯数据驱动方法样本效率低,需要大量甚至不安全的探索,尤其是在缺乏引导物理模型的情况下。这激发了混合人工智能-物理框架。本文提出了一种使用强化孪生算法的混合无模型/基于模型的飞行控制方法。基于模型的组件使用伴随公式和从实时轨迹中连续识别的自适应数字孪生;无模型组件使用强化学习。两个智能体通过迁移学习、模仿学习以及真实环境与数字孪生之间的共享经验来共享知识,并由一个策略裁判协调,该裁判根据数字孪生性能和真实到虚拟一致性比率选择哪个智能体在现实中行动。该框架针对扑翼无人机的纵向控制进行了评估,该无人机被建模为由准稳态气动力驱动的非线性时变系统。混合策略在三种自适应模型初始化下进行了测试:(1)从现有数据进行离线识别,(2)随机初始化并进行完全在线识别,以及(3)使用有偏参数进行离线预训练,然后进行在线自适应。在所有情况下,混合框架在性能、鲁棒性和样本效率方面均优于纯无模型和纯基于模型的方法。

英文摘要

Controlling flapping-wing drones requires controllers that handle time-varying, nonlinear, underactuated dynamics from incomplete, noisy sensor data. Recent advances in artificial intelligence (AI), particularly reinforcement learning (RL), have opened new perspectives for addressing such complex control problems through data-driven policy optimization from interaction with the environment. Yet purely data-driven methods are sample-inefficient, demanding extensive, sometimes unsafe exploration, especially without guiding physical models. This motivates hybrid AI-physics frameworks. This article proposes a hybrid model-free/model-based flight-control approach using the reinforcement twinning algorithm. The model-based (MB) component uses an adjoint formulation and an adaptive digital twin continuously identified from live trajectories; the model-free (MF) component uses RL. The two agents share knowledge via transfer learning, imitation learning, and shared experience between the real environment and the digital twin, coordinated by a policy referee that selects which agent acts in reality based on digital-twin performance and a real-to-virtual consistency ratio. The framework is evaluated for the longitudinal control of a flapping-wing drone, modelled as a nonlinear time-varying system driven by quasi-steady aerodynamic forces. The hybrid strategy is tested under three adaptive-model initializations: (1) offline identification from existing data, (2) random initialization with fully online identification, and (3) offline pre-training with biased parameters followed by online adaptation. In all cases, the hybrid framework improves performance, robustness, and sample efficiency over purely model-free and purely model-based approaches.

2605.08525 2026-06-19 cs.RO cs.SY eess.SY 版本更新

Model-Reference Adaptive Flight Control of a 95-mg Insect-Scale Flapping-Wing Aerial Robot

95毫克昆虫尺度扑翼飞行机器人的模型参考自适应飞行控制

Francisco M. F. R. Gonçalves, Conor K. Trygstad, Néstor O. Pérez-Arancibia

发表机构 * Washington State University(华盛顿州立大学)

AI总结 针对昆虫尺度扑翼飞行机器人参数不确定性和扰动问题,提出模型参考自适应控制(MRAC)架构,结合混合乘性扩展卡尔曼滤波,实现高精度位置控制,并通过95毫克机器人实验验证了悬停和轨迹跟踪性能。

Comments Under review, 8 pages, 7 figures

详情
AI中文摘要

由于系统尺度和复杂制造,描述扑翼昆虫尺度飞行机器人动力学的模型存在参数不确定性,例如惯性矩阵和飞行器的执行器映射。此外,由于其低惯性,这种机器人在飞行中受到随机和系统性扰动的严重影响,包括电源线张力、阵风和机翼不对中产生的非期望气动力。因此,在亚分克尺度上执行复杂机动的高性能要求机器人调整其行为以抵消扰动和模型不确定性。为此,我们引入了一种模型参考自适应控制(MRAC)架构,用于可实现为三维空间中刚体的扑翼机器昆虫的高性能位置控制。此外,我们展示了在飞行中实现混合乘性扩展卡尔曼滤波以估计当前和期望角速度,如何显著抑制姿态振动,特别是沿滚转和俯仰自由度,并提高飞行性能。为了展示所提方法的适用性、功能性和高性能,我们使用一个95毫克的昆虫尺度飞行机器人进行了实时悬停和轨迹跟踪六自由度飞行控制实验。

英文摘要

Due to the system's scale and complex fabrication, the model describing the dynamics of a flapping-wing insect-scale aerial robot is subject to parameter uncertainty; for example, in the inertia matrix and the actuator mapping of the flier. Furthermore, due to its low inertia, this type of robot is greatly affected by stochastic and systematic disturbances during flight, including power-wire tension, gusts, and undesired aerodynamic forces produced by wing misalignment. Therefore, the high-performance execution of complex maneuvers at the subdecigram scale requires the robot to adapt its behavior to counteract disturbances and model uncertainty. Toward this objective, we introduce a model-reference adaptive control (MRAC) architecture for high-performance position control of flapping-wing robotic insects that can be modeled as rigid bodies in the three-dimensional (3D) space. In addition, we demonstrate how the implementation of a hybrid multiplicative extended Kálmán filter for estimating current and desired angular velocities during flight significantly dampens attitude vibrations, especially along the roll and pitch degrees of freedom (DOFs), and also improves flight performance. To show the suitability, functionality, and high performance of the proposed approach, we conducted real-time hovering and trajectory-tracking 6-DOF flight control experiments with a 95-mg insect-scale aerial robot.

2605.28654 2026-06-19 cs.RO cs.SY eess.SY math.OC 版本更新

Integrated Exploration-Aware UAV Route Optimization and Path Planning

集成探索感知的无人机路径优化与轨迹规划

Jimin Choi, Grant Stagg, Cameron K. Peterson, Max Z. Li

发表机构 * Department of Aerospace Engineering, University of Michigan(密歇根大学航空航天工程系) Department of Electrical Engineering, Brigham Young University(BYU 电子工程系) Department of Aerospace Engineering, Department of Civil and Environmental Engineering, and Department of Industrial and Operations Engineering, University of Michigan(密歇根大学航空航天工程系、土木与环境工程系和工业与运营管理工程系)

AI总结 提出一种集成探索感知的无人机路径优化与轨迹规划框架,通过风险地图、不确定兴趣区域建模、B样条轨迹优化和在线重规划,在灾害监测中平衡报告点访问与新信息探索,实现平均KL散度降低15.9%。

详情
AI中文摘要

无人机越来越多地用于危险环境(如灾区、污染场地、野火区域和受损基础设施)中的探索驱动监测,此时有限的飞行续航必须在访问报告位置和收集新信息之间分配。在这些场景中,关于危险的先验信息通常不完整、空间不精确,并且在执行过程中可能发生变化。例如,初始报告可能识别出危险可能存在的区域,但实际危险可能被移动、部分观察到或完全未被报告。我们提出了一种集成的探索感知无人机路径优化与轨迹规划框架,用于在不确定和演变的先验信息下进行危险监测。环境被表示为空间风险地图,每个位置都有相关的危险状况信念。报告的危险被建模为不确定的兴趣区域(ROI),而不是确认的目标位置,要求无人机在检查报告区域的同时,利用有限的飞行续航探索信息丰富的区域。所提出的方法解决了报告ROI上的车辆路径问题,通过辅助伪节点增强路径以改善空间覆盖,将剩余飞行距离预算分配到路径段,并优化局部探索的动态可行B样条轨迹。在执行过程中,无人机测量更新基于网格的信念地图,当新信息和剩余预算证明调整合理时,对剩余轨迹进行重规划。在48种场景配置中,在线重规划相比离线优化规划器平均KL散度降低15.9%,相比直线遍历降低48.6%。

英文摘要

Uncrewed aerial vehicles (UAVs) are increasingly used for exploration-driven monitoring in hazardous environments such as disaster zones, contaminated sites, wildfire areas, and damaged infrastructure, where limited flight endurance must be allocated between visiting reported locations and gathering new information. In these settings, prior information regarding hazards is often incomplete, spatially imprecise, and subject to change during execution. For example, initial reports may identify a region where a hazard is likely to exist, but the actual hazard may be displaced, partially observed, or entirely unreported. We present an integrated exploration-aware UAV route optimization and path planning framework for hazard monitoring under uncertain and evolving prior information. The environment is represented as a spatial risk map, where each location has an associated belief of hazardous conditions. Reported hazards are modeled as uncertain regions of interest (ROIs) rather than confirmed target locations, requiring the UAV to inspect reported areas while also using its limited flight endurance to explore informative regions. The proposed method solves a vehicle routing problem over reported ROIs, augments the route with auxiliary pseudo-nodes to improve spatial coverage, allocates the remaining flight distance budget across route segments, and optimizes dynamically feasible B-spline trajectories for local exploration. During execution, UAV measurements update a grid-based belief map, and the remaining trajectory is replanned when new information and the remaining budget justify adaptation. Across 48 scenario configurations, online replanning improves average KL reduction by 15.9% over the offline optimized planner and 48.6% over straight-line traversal.

2606.10688 2026-06-19 cs.RO 版本更新

Self-Supervised Relevance Modelling in Autonomous Driving via Counterfactual Analysis

自动驾驶中基于反事实分析的自监督相关性建模

Luca Lusvarghi, Javier Gozalvez, Pablo Urbano Hidalgo

发表机构 * Networked Systems Lab, Universidad Miguel Hernandez de Elche(网络系统实验室,米格尔·希内斯·埃尔切大学)

AI总结 提出一种基于反事实分析的自监督方法,用于量化自动驾驶中物体的相关性,实现毫秒级实时估计,并生成相关性热图以辅助感知与规划。

详情
AI中文摘要

自动驾驶依赖于计算密集型的感知管线,以持续检测和跟踪周围环境中的物体。虽然某些物体对于规划安全有效的操作至关重要,但其他物体可能不相关,并且对自动驾驶车辆的驾驶决策没有影响。关注相关物体可以更有效地利用可用计算资源,减少处理延迟,并限制感知噪声的下游传播。在这项工作中,我们提出了一种基于反事实分析的新型自监督方法,以开发相关性模型——一种基于AI的工具,用于量化物体对自动驾驶车辆的相关性。为了展示所提出方法的潜力,我们在选定城市场景中生成的合成因果数据集上训练了相关性模型。结果表明,该相关性模型能够以毫秒级延迟准确估计物体的相关性,从而在高密度场景中实现实时相关性估计。我们还展示了该相关性模型可用于构建相关性热图,为自动驾驶车辆的驾驶策略提供有价值的见解,并可用于主动通知感知和规划任务。我们公开发布了相关性模型和因果数据集。

英文摘要

Autonomous driving relies on computationally intensive perception pipelines to continuously detect and track objects in the surrounding environment. While some objects are key to plan safe and effective maneuvers, others may not be relevant and have no impact on the autonomous vehicle's driving decisions. Focusing on relevant objects allows a more efficient usage of available computational resources, reduces processing latencies, and limits the downstream propagation of perception noise. In this work, we propose a novel self-supervised approach based on counterfactual analysis to develop a relevance model - an AI-based tool that quantifies the relevance of objects for an autonomous vehicle. To demonstrate the potential of the proposed approach, we train a relevance model on a synthetic causal dataset generated in a selected urban scenario. Results show that the relevance model is able to accurately estimate the objects' relevance with millisecond-level latency, enabling real-time relevance estimation also in high-density scenarios. We also show that the relevance model can be used to build relevance heatmaps that offer valuable insights into the autonomous vehicle's driving policy and can be used to proactively inform perception and planning tasks. We openly release both the relevance model and the causal dataset.

2603.09420 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Class-Incremental Motion Forecasting

类别增量运动预测

Nicolas Schischka, Nikhil Gosala, B Ravi Kiran, Senthil Yogamani, Abhinav Valada

发表机构 * Department of Computer Science, University of Freiburg, Germany(弗赖堡大学计算机科学系) Qualcomm SARL France(法国.qualcomm SARL) Automated Driving, Qualcomm Technologies, Inc.(qualcomm Technologies, Inc. 自动驾驶部门)

AI总结 提出类别增量运动预测新任务,通过端到端框架结合伪标签与开放词汇分割,利用3D-2D投票机制和查询特征方差重放策略,缓解灾难性遗忘并适应新类别。

Comments V3: Change title. Add further experiments

详情
AI中文摘要

运动预测使自动驾驶车辆能够通过预测动态智能体的未来轨迹来预判场景演化。然而,现有方法通常假设一个封闭世界设定,具有固定的对象分类法并依赖高质量感知,限制了其在现实世界中的应用,因为现实世界中感知不完美,且新对象类别可能随时间出现。在这项工作中,我们引入了类别增量运动预测,这是一个新颖的设定,其中新对象类别随时间顺序引入,并且直接从相机图像预测未来对象轨迹。我们提出了首个针对该设定的端到端框架,该框架适应新引入的类别,同时减轻对先前学习类别的灾难性遗忘。我们的方法为已知类别生成运动预测伪标签,并将其与开放词汇分割模型的2D实例掩码进行匹配。这种3D到2D关键点投票机制过滤不一致和过度自信的预测,而基于查询特征方差的重放策略采样信息丰富的过去序列以保留先验知识。在nuScenes和Argoverse 2上的广泛评估表明,我们的方法成功地在已知类别上保持性能,同时有效适应新类别。我们进一步展示了向真实世界驾驶的零样本迁移,并表明该框架自然地扩展到nuScenes和NeuroNCAP上的开环和闭环端到端类别增量规划。代码和模型将在该https URL上公开。

英文摘要

Motion forecasting enables autonomous vehicles to anticipate scene evolution by predicting the future trajectories of dynamic agents. However, existing approaches typically assume a closed-world setting with a fixed object taxonomy and access to high-quality perception, limiting their applicability in the real world where perception is imperfect, and new object classes may emerge over time. In this work, we introduce class-incremental motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are predicted directly from camera images. We propose the first end-to-end framework for this setting, which adapts to newly introduced classes while mitigating catastrophic forgetting of previously learned ones. Our method generates motion forecasting pseudo-labels for known classes and matches them with 2D instance masks from an open-vocabulary segmentation model. This 3D-to-2D keypoint voting mechanism filters inconsistent and overconfident predictions, while a query feature variance-based replay strategy samples informative past sequences to preserve prior knowledge. Extensive evaluations on nuScenes and Argoverse 2 show that our approach successfully preserves performance on known classes while effectively adapting to novel ones. We further demonstrate zero-shot transfer to real-world driving and show that the framework extends naturally to open- and closed-loop end-to-end class-incremental planning on nuScenes and NeuroNCAP. Code and models will be made publicly available at https://omen.cs.uni-freiburg.de.

2606.13794 2026-06-19 eess.SY cs.AI cs.RO cs.SY 版本更新

An integrated interpretable control effectiveness learning and nonlinear control allocation methodology for overactuated aircrafts

过驱动飞行器的可解释控制效能学习与非线性控制分配集成方法

Umut Demir, Aamir Ahmad, Walter Fichter

发表机构 * University of Stuttgart, Faculty of Aerospace Engineering and Geodesy, Institute of Flight Mechanics and Control (iFR)(斯图加特大学航空航天工程与大地测量学院飞行力学与控制研究所)

AI总结 提出一种基于稀疏非线性动力学辨识的学习控制效能映射方法,结合在线自适应机制,实现过驱动飞行器的高效非线性控制分配,兼具可解释性和低计算成本。

详情
AI中文摘要

非线性动力学以及多个执行器之间产生的强耦合削弱了传统线性控制分配技术背后的假设。当飞行进入非线性效应主导的模态时,线性分配器因模型失配增加而精度下降,进而降低飞行控制系统的性能和鲁棒性。高保真机载模型和黑箱数据驱动方法可以在整个飞行包线内恢复精度,但分别带来实时分配难以承受的计算负担,并牺牲了验证和故障诊断所需的可解释性。本文通过使用稀疏非线性动力学辨识从代表性飞行数据中学习显式的、受物理约束的控制效能映射解析模型,解决了这些限制。所得映射紧凑、可解释,并允许解析导数,从而能够在非线性求解器中高效计算,同时额外包含执行器动力学,无需机载模型。在线自适应机制监控预测残差,并在检测到显著对象变化时刷新模型,从而在执行器故障和变化工况下提供平滑重构。该方法在一款高保真非线性基准飞行器上经过一系列激进机动评估,达到了与完整非线性机载模型相当的精度,同时相对于现有基线显著降低了计算成本。

英文摘要

Nonlinear dynamics and the strong couplings that arise between multiple effectors undermine the assumptions behind conventional, linear control allocation techniques. When flight enters regimes where nonlinear effects dominate, linear allocators exhibit reduced accuracy due to increased model mismatch, which subsequently degrades performance and robustness of the flight control system. High fidelity onboard models and black box data driven approaches can recover accuracy across the flight envelope, but respectively impose computational burdens prohibitive for real time allocation and sacrifice the interpretability required for verification and fault diagnosis. This paper addresses these limitations by learning an explicit, physics constrained analytical model of the control effectiveness mapping from representative flight data using Sparse Identification of Nonlinear Dynamics. The resulting mapping is compact, interpretable, and admits analytical derivatives, enabling efficient computation within nonlinear solvers that additionally incorporate actuator dynamics, without requiring an onboard model. An online adaptation mechanism monitors prediction residuals and refreshes the model when significant plant changes are detected, providing graceful reconfiguration under actuator failures and varying operating conditions. The methodology is evaluated on a high fidelity nonlinear benchmark aircraft across a range of aggressive maneuvers, achieving accuracy comparable to a full nonlinear onboard model while substantially reducing computational cost relative to established baselines.

9. 软体机器人与硬件设计 3 篇

2606.20389 2026-06-19 cs.RO 新提交

CoLI: A Reproducible Platform for Continuum Robot Learning via Monolithic 3D Printing and Isomorphic Teleoperation

CoLI: 通过整体3D打印和同构遥操作实现连续体机器人学习的可复现平台

Ziyuan Tang, Chenxi Xiao*

AI总结 提出一种基于多材料3D打印和同构遥操作的连续体机器人平台,简化制造流程并实现无奇异映射控制,支持模仿学习自主控制,通过硬件表征和操作任务验证其可复现性和学习就绪性。

Comments 8 pages, 7 figures, 1 table, accepted by IROS2026

详情
AI中文摘要

连续体机器人因其高自由度、柔顺结构和操作安全性,在操作任务中展现出巨大潜力。然而,复杂的制造和组装过程、具有挑战性的运动学建模以及缺乏直观的控制接口,导致其在研究和实际应用中的可复现性受到阻碍。为解决这些问题,我们提出了一种新颖的开源连续体机器人设计。该平台采用多材料3D打印实现简化的制造流程,使机械臂能够作为整体柔顺结构制造,且组装工作量最小。控制通过同构遥操作接口实现,该接口建立了直接的执行器级映射,无需显式运动学建模,并提供无奇异映射。基于该硬件设计,平台进一步支持基于模仿学习的自主控制。通过硬件表征和一系列操作任务对所提出的系统进行了评估。实验结果表明,该平台提供了一个可复现的、学习就绪的连续体机器人系统,加速了连续体机器人社区的算法开发和系统基准测试。

英文摘要

Continuum robots offer strong potential for manipulation tasks due to their high degrees of freedom, compliant structures, and operational safety. However, their adoption in both research and practical applications has been hindered by reproducibility issues arising from complex fabrication and assembly processes, challenging kinematic modeling, and a lack of intuitive control interfaces. To address these challenges, we present a novel open-source continuum robot design. The platform features a simplified fabrication pipeline enabled by multi-material 3D printing, allowing the arm to be fabricated as a monolithic compliant structure with minimal assembly. Control is achieved through an isomorphic teleoperation interface that establishes a direct actuator-level mapping, eliminating the need for explicit kinematic modeling and providing a singularity-free mapping. Building on this hardware design, the platform further supports imitation-learning-based autonomous control. The proposed system is evaluated through hardware characterization and a set of manipulation tasks. Experimental results demonstrate that the platform provides a reproducible, learning-ready continuum robot system, accelerating algorithmic development and systematic benchmarking for the continuum robotics community.

2605.25005 2026-06-19 cs.RO 版本更新

Stiffness Optimization for Concentrated Bending in Magnetically Actuated Catheters: Maintaining Steerability under Gradient Stiffness

磁驱动导管集中弯曲的刚度优化:在梯度刚度下保持可操控性

Jiewen Tan, Junnan Xue, Shing Shin Cheng, Shuang Song, Erli Lyu, Jiaole Wang

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) The Chinese University of Hong Kong(香港中文大学) Macao Polytechnic University(澳门理工学院)

AI总结 针对磁驱动软导管在推送性与近端集中弯曲之间的权衡,提出一种刚度优化的多段磁驱动导管(SO-MAC),通过解耦转向-推进机构和梯度刚度架构,在推进过程中实现稳定的近端枢轴弯曲,同时远端被动自直以传递推进力。

详情
AI中文摘要

对于磁驱动软导管,实现高效的推送性(推进力传递)和近端集中弯曲以保持可操控性具有挑战性:较高的轴向/弯曲刚度可改善力传递但降低可操控性,而较低的刚度可实现大的近端集中弯曲,但在压缩推送载荷下增加扭结/屈曲风险。为了解决这一权衡,我们提出了一种刚度优化的多段磁驱动导管(SO-MAC),它集成了解耦的转向-推进机构与梯度刚度架构。SO-MAC在推进过程中将弯曲集中在稳定的近端枢轴周围,而远端部分通过优化的刚度分布和弹簧骨架的弹性恢复抵抗摩擦引起的扭结/屈曲,被动自直以传递推进力。在$0{-}180^{\circ}$的组合转向和推进过程中,枢轴保持稳定,远端尖端几乎直线地向目标方向推进。直径为1.5 mm的SO-MAC在其10 mm尖端处实现了高达$180^{\circ}$的转向,弯曲半径为3 mm,平均形状误差为$1.39 \pm 0.56$ mm,转向枢轴误差为$0.35 \pm 0.10$ mm。在支气管模型中的视觉反馈控制进一步验证了通过高度弯曲的分叉路径的鲁棒导航。

英文摘要

Achieving both efficient pushability (propulsion transmission) and proximally concentrated bending for steerability is challenging for magnetically actuated soft catheters: higher axial/bending stiffness improves force transmission but reduces steerability, whereas lower stiffness enables large, proximally concentrated bending yet increases kinking/buckling risk under compressive push loads. To address this trade-off, we propose a stiffness-optimized multi-segment magnetically actuated catheter (SO-MAC) that integrates a decoupled steering-advancement mechanism with a gradient-stiffness architecture. The SO-MAC concentrates bending about a stable proximal pivot during advancement while the distal section passively self-straightens to transmit propulsion, aided by the optimized stiffness distribution and elastic recovery of the spring backbone against friction-induced kinking/buckling. Over $0{-}180^{\circ}$ combined steering and advancement, the pivot remained stable and the distal tip advanced near-straight toward the target direction. A 1.5 mm-diameter SO-MAC achieved up to $180^{\circ}$ steering with a 3 mm bending radius at its 10 mm tip, with an average shape error of $1.39 \pm 0.56$ mm and a steering-pivot error of $0.35 \pm 0.10$ mm. Visual feedback control in a bronchial phantom further confirmed robust navigation through highly curved, bifurcating paths.

2604.00527 2026-06-19 math.MG cs.RO math.DG 版本更新

Bistable Quad-Nets Composed of Four-Bar Linkages

由四杆机构组成的双稳态四边网

Gudrun Szewieczek, Daniel Huczala, Martin Pfurner, Hans-Peter Schröcker

发表机构 * University of Innsbruck, Department of Basic Sciences in Engineering Sciences(因斯布鲁克大学工程科学基础科学系) Seoul National University, Robotics Laboratory(首尔国立大学机器人实验室)

AI总结 研究由空间四杆机构组成的双稳态机械结构,通过Study二次曲面解释并利用Whiteley去平均化从柔性四边网构造,无需数值优化即可控制几何参数。

详情
AI中文摘要

我们研究了一种新型机械结构,由空间四杆机构组成,具有双稳态特性,即允许两种不同的构型。这些结构在Study二次曲面中具有四边网的解释,我们利用该解释证明了具有无限数量连杆和关节的组装体的存在性。我们提出了一种纯几何构造方法,从欧几里得空间中的无穷小柔性四边网出发,应用Whiteley去平均化。这一观点将问题置于离散微分几何的更广泛框架内,并能够从众所周知的四边网类别(如离散极小曲面)构造双稳态结构。与许多其他双稳态结构构造方法相比,我们的方法不依赖于数值优化,并且允许简单控制相关几何参数,如轴位置和卡扣角度。

英文摘要

We study a novel type of mechanical structures, composed of spatial four-bar linkages, that are bistable, that is, they allow for two distinct configurations. These structures have an interpretation as quad nets in the Study quadric which we use to prove existence of assemblies with an unbounded number of links and joints. We propose a purely geometric construction of such objects, starting from infinitesimally flexible quad nets in Euclidean space and applying Whiteley de-averaging. This point of view situates the problem within the broader framework of discrete differential geometry and enables the construction of bistable structures from well-known classes of quad nets, such as discrete minimal surfaces. In contrast to many other construction methods for bistable structures, our approach does not rely on numerical optimization and it allows for simple control of relevant geometric parameters such as axis positions and snap angles.

10. 仿真、数据集与评测 14 篇

2606.19357 2026-06-19 cs.RO cs.AI 新提交

Physical Atari: A Robust and Accessible Platform for Real-time Reinforcement Learning on Robots

Physical Atari: 一个用于机器人实时强化学习的鲁棒且可访问的平台

Khurram Javed, Joseph Modayil, Gloria Kennickell, Richard S. Sutton, John Carmack

AI总结 提出Physical Atari平台,通过机器人操作Atari控制器和实时渲染游戏帧,实现物理世界中的强化学习研究,验证了算法可直接在机器人上学习,并指出分布偏移会显著降低策略性能。

Comments To appear at RLC 2026

详情
AI中文摘要

我们构建了一个名为Robotroller的机器人,它能够操作Atari CX40+控制器,以及一个名为Atari Devbox的设备,该设备在屏幕上渲染来自Arcade Learning Environment的游戏帧和奖励信号。Robotroller和Atari Devbox,连同现成的摄像头和台式计算机,构成一个可用于研究物理世界中强化学习算法的系统。我们将整个系统称为Physical Atari。在本文中,我们详细介绍了使Physical Atari成为一个鲁棒且可访问平台的关键决策。为了使系统鲁棒,我们设计了Robotroller,使得所有运动都通过轴承完成,从而减少磨损。此外,我们编写了软件,以高频监控伺服电机的状态并进行干预以限制应力。为了使系统可访问,我们使用了价格合理的现成组件和可通过消费级3D打印机制造的零件。Physical Atari的建造成本低于1000美元,并且已用于数周不间断的强化学习实验,未出现任何机械故障。我们用它验证了强化学习算法可以直接在机器人上学习,并表明即使学习和部署之间的微小分布偏移也会显著降低策略的性能。我们的结果强调了设备端适应对于在机器人上获得强性能的重要性。

英文摘要

We built a robot called the Robotroller that actuates an Atari CX40+ controller and a device called the Atari Devbox that renders the game frame and the reward signal from the Arcade Learning Environment on a screen. The Robotroller and the Atari Devbox, together with an off-the-shelf camera and a desktop computer, constitute a system that can be used to study reinforcement learning algorithms in the physical world. We call the full system Physical Atari. In this paper, we detail the key decisions that make Physical Atari a robust and accessible platform. To make the system robust, we designed the Robotroller so that all movement is done through bearings, which reduces wear. Additionally, we wrote software that monitors the state of the servos at a high frequency and intervenes to limit stress. To make the system accessible, we used affordable off-the-shelf components and parts that can be manufactured using consumer 3D printers. Physical Atari can be built for under $1,000 and has been used for weeks of non-stop reinforcement learning experiments without any mechanical failures. We used it to validate that reinforcement learning algorithms can learn directly on robots and show that even small distribution shifts between learning and deployment can significantly degrade the performance of policies. Our results underscore the importance of on-device adaptation for strong performance on robots.

2606.19358 2026-06-19 cs.RO 新提交

WorkBenchMark: A LEGO-Based Assembly Benchmark with an Assembly-by-Disassembly Baseline for the Smart Manufacturing League

WorkBenchMark:面向智能制造联盟的基于乐高积木的装配基准与通过拆卸进行装配的基线方法

Wenbo Ma, Daniel Swoboda, Matteo Tschesche, Till Hofmann

发表机构 * Chair of Machine Learning and Reasoning (i6), RWTH Aachen University(亚琛工业大学机器学习与推理教席(i6)) MASCOR Institute, FH Aachen University of Applied Science(亚琛应用技术大学MASCOR研究所)

AI总结 提出一个基于乐高Duplo的机器人装配基准,包含400个任务和四个复杂度层级,并提供一个基于规划的基线方法,在所有层级上优于现代视觉-语言-动作方法。

Comments RoboCup Symposium 2026 accepted paper

详情
AI中文摘要

我们介绍了WorkBenchMark,一个受RoboCup智能制造联盟启发的基于乐高Duplo的机器人装配基准。机器人装配将低层操作与物理约束下的任务级符号推理相结合,当前端到端学习方法尚未可靠解决这一组合。该基准提供跨四个复杂度层级的400个任务。我们提供了一个开放词汇的感知、通过拆卸进行装配的基线解决方案。我们的基于规划的流水线在所有层级上优于现代视觉-语言-动作方法。该基准、仿真环境和基线实现将公开发布,以支持更广泛的机器人装配社区。

英文摘要

We introduceWorkBenchMark, a LEGO Duplo-based robotic assembly benchmark motivated by the RoboCup Smart Manufacturing League. Robotic assembly couples low-level manipulation with task-level symbolic reasoning under physical constraints, a combination that current end-to-end learning methods do not yet solve reliably. The benchmark provides 400 tasks across four complexity tiers. We provide an open-vocabulary perception, Assembly-by-Disassembly baseline solution. Our planning-based pipeline outperforms a modern vision-language-action approach across all tiers. The benchmark, simulation environment, and baseline implementation will be released openly to support the broader robotic assembly community.

2606.19504 2026-06-19 cs.RO cs.SY eess.SY 新提交

Simulating Robotic Locomotion in Sand: Resistive Force Theory in an Open-Source Physics Engine

模拟沙地中的机器人运动:开源物理引擎中的阻力理论

Ryan Walker Brown, Laura K. Treers, Kathryn A. Daltorio

发表机构 * Case Western Reserve University(凯斯西储大学) University of Vermont(佛蒙特大学)

AI总结 将三维颗粒阻力理论(3D RFT)集成到MuJoCo物理引擎中,实现沙地行走模拟,验证了足端形状、速度和负载对运动的影响,并在六足机器人实验中预测行走距离和沉陷误差在20%以内。

Comments 12 pages, 7 figures

详情
AI中文摘要

阻力理论(RFT)的最新进展使得无需模拟单个颗粒相互作用即可近似沙地运动中的地面反作用力,从而降低了计算成本。然而,这些工具在常用于机器人仿真的3D物理引擎中尚不可用。我们探讨了将阻力近似与标准动力学计算相结合,是否能为自由行走的机器人提供稳定的支撑。为此,我们在物理仿真引擎MuJoCo中实现了三维颗粒阻力理论(3D RFT)。我们在多个场景中验证了仿真,证明了由于末端执行器形状、速度和负载引起的关键趋势得以保留。我们的实现预测了12自由度六足机器人在沙地中的行走距离和足部下沉,误差在实验值的20%以内。尽管RFT存在固有近似,但本文描述的开源工具有望帮助开发新的和改进的机器人设计,以穿越颗粒介质基底。

英文摘要

Recent advancements in Resistive Force Theory (RFT) enable approximation of ground reaction forces for locomotion in sand without the computational expense of modeling interactions with individual grains. However, these tools have been absent in 3D physics engines commonly used for robot simulation. We explore if resistive force approximations are sufficient, when integrated with standard dynamics calculations, to provide a stable substrate for a freely walking robot. To determine this, we implement 3D Granular Resistive Force Theory (3D RFT) in a physics simulation engine, MuJoCo. We verify simulations in multiple scenarios to demonstrate that key trends due to end effector shape, speed, and loading are preserved. Our implementation predicts walking distance and foot sinkage of a 12-Degree of Freedom hexapod robot within 20\% of experiments in sand. While RFT has inherent approximations, the open source tool described here has potential to help develop new and improved robot designs to traverse granular media substrates.

2606.19675 2026-06-19 cs.RO 新提交

ForEnt: A Multi-Modal Dataset for Characterizing Quadruped Robot Entrapments in Forest Environments

ForEnt: 用于表征四足机器人在森林环境中被困的多模态数据集

Natapat Kirdwichai, Danesh Tarapore

发表机构 * University of Southampton(南安普顿大学)

AI总结 针对四足机器人在森林中因植被缠绕而倾覆的问题,提出多模态数据集ForEnt,包含RGB-D、LiDAR、本体感知和第三人称视频,记录69次被困事件,支持可重复的基准测试。

Comments 8 pages, 7 figures

详情
AI中文摘要

腿式机器人越来越多地被部署在森林中进行生态调查和监测,但由于穿越森林环境带来的挑战,它们的自主性经常中断。森林被困,例如当机器人的腿被藤蔓或其他植被缠住时,会导致失去稳定性并翻倒。此类事件不仅中断任务并需要人工干预,还可能损坏机器人硬件。为了解决缺乏专门数据集来研究森林环境中这些故障模式的问题,我们提出了ForEnt,这是一个多模态数据集,使用低成本的Unitree Go2四足机器人在英国南安普顿公共林地的八个森林地点收集。在我们的数据集中,进行了约1.7公里的穿越,共11个序列,记录了69次被困事件。ForEnt包括时间同步的RGB-D图像、LiDAR扫描、本体感知数据和第三人称视频,能够分析导致被困的地形因素,并提供标记的传感器流用于可重复的基准测试。通过支持被困检测策略的评估,ForEnt降低了在具有挑战性的森林环境中开发稳健四足机器人部署的门槛。

英文摘要

Legged robots are increasingly deployed in forests for ecological surveying and monitoring, yet their autonomy is often interrupted consequent to the challenges posed in traversing forest environments. Forest entrapments, for example, when a robot's legs are ensnared in vines or other vegetation, result in loss of stability and toppling. Such events not only disrupt the mission and require manual intervention, but also risk damage to the robot hardware. To address the absence of a dedicated dataset to investigate these failure modes in forest environments, we present ForEnt, a multi-modal dataset collected with the low-cost Unitree Go2 quadruped across eight forest sites in the Southampton Common Woodlands, UK. For our dataset, over approximately 1.7 km of traversals in 11 sequences were conducted, yielding 69 recorded entrapment events. ForEnt includes time-synchronized RGB-D images, LiDAR scans, proprioceptive data, and third-person video, enabling analysis of terrain factors contributing to entrapment and providing labeled sensor streams for reproducible benchmarking. By supporting the evaluation of entrapment detection strategies, ForEnt lowers the barrier to developing robust quadruped robot deployments in challenging forest environments.

2606.19769 2026-06-19 cs.RO cs.AI 新提交

Data Standards for Humanoid Robotics: The Missing Infrastructure for Physical AI

人形机器人数据标准:物理AI缺失的基础设施

Shaoshan Liu, Xiugong Qin, Xuan Wu, Xuan Xia, Ning Ding, Jialu Liu, Jie Tang

AI总结 本文论证数据标准是人形机器人可扩展性的关键基础设施,通过提出ISO/WD 26264-1标准,解决数据非累积性问题,使具身经验可解释、可共享、可追溯和可复用。

详情
AI中文摘要

人形机器人的可扩展性不仅取决于模型和硬件,还取决于物理经验能否在机器人、任务、组织及时间维度上积累。基于作者在ISO/TC 299/WG 16内制定ISO/WD 26264-1《人形机器人数据集——第1部分:通用要求》的工作,本文论证数据标准正成为物理AI的基础设施。我们提出三个见解:第一,人形机器人数据是具身交互数据,而非孤立数字样本的集合;有用的数据集必须保留机器人本体、动作、任务、场景、执行轨迹和结果之间的关系。第二,其价值取决于物理一致性:多模态流仅在时序、坐标系、标定、运动学、单位和同步假设可检查时才可复用。第三,主要瓶颈不仅是数据稀缺,更是由高采集成本、数据孤岛和不一致评估导致的非累积性数据。我们认为人形机器人数据标准通过使具身经验可解释、可共享、可追溯和可复用来解决这些瓶颈。通用标准应为生命周期管理、元数据、来源、质量、版本控制和可追溯性提供横向基础设施,而能力特定部分应定义操作、移动、人机交互、认知及未来人形能力的领域语法。随着AI从屏幕进入实体,数据标准必须从组织数字信息演变为结构化物理交互。

英文摘要

The scalability of humanoid robots will depend not only on models and hardware, but also on whether physical experience can accumulate across robots, tasks, organizations, and time. Drawing on the authors' work in developing ISO/WD 26264-1, Humanoid robot datasets -- Part 1: General requirements, within ISO/TC 299/WG 16, this article argues that data standards are becoming foundational infrastructure for Physical AI. We develop three insights. First, humanoid robot data is embodied interaction data, not a collection of isolated digital samples; a useful dataset must preserve the relationship among robot body, action, task, scene, execution trace, and outcome. Second, its value depends on physical coherence: multimodal streams are reusable only when timing, coordinate frames, calibration, kinematics, units, and synchronization assumptions remain inspectable. Third, the main bottleneck is not only data scarcity, but non-cumulative data caused by high collection costs, data silos, and inconsistent evaluation. We argue that humanoid robot data standards address these bottlenecks by making embodied experience interpretable, shareable, traceable, and reusable. A general standard should provide horizontal infrastructure for lifecycle management, metadata, provenance, quality, versioning, and traceability, while capability-specific parts should define domain grammar for manipulation, locomotion, human-robot interaction, cognition, and future humanoid capabilities. As AI moves from screens into bodies, data standards must evolve from organizing digital information to structuring physical interaction.

2606.19813 2026-06-19 cs.RO 新提交

TIDY: Thermal Infrared Image Denoising via Wavelet Domain Entropy and Directional Stripe Index

TIDY: 基于小波域熵和方向条纹指数的热红外图像去噪

Tai Hyoung Rhee, Dong-Guw Lee, Ayoung Kim

发表机构 * Dept. of Mechanical Engineering, SNU(首尔大学机械工程系)

AI总结 提出轻量级小波域去噪器TIDY,利用真实噪声数据训练,通过小波熵和方向条纹指数损失项抑制随机噪声和条纹伪影,在室内恶劣条件下提升热红外图像质量及下游机器人任务性能。

详情
AI中文摘要

热红外(TIR)成像因其在低光视觉退化下的鲁棒感知能力,已成为野外机器人的热门选择,但它受到严重的随机噪声和固定模式噪声的影响,破坏了后续估计。由于低热对比度和均匀温度分布,这种噪声在室内会加剧,导致室内TIR部署相对缺乏。现有的TIR去噪方法在精度和效率之间权衡不佳,要么对于机器人所需的在线部署来说太慢,要么对严重退化不够鲁棒,而且通常是在合成噪声上训练的。针对这些问题,我们提出了TIDY,一种轻量级的小波域去噪器,在真实的干净-噪声TIR数据上训练。通过在小波域中重新表述TIR去噪,TIDY明确地将噪声与结构内容分离,实现了有针对性的抑制,降低了空间复杂度,显著提高了推理速度(约34Hz)。TIDY引入了两个新指标,小波熵和小波方向条纹指数,作为互补的损失项,以明确抑制随机噪声和条纹伪影。在严重的室内损坏和零样本设置中,TIDY提高了鲁棒性,并在下游机器人任务(包括热惯性里程计和单目深度估计)中产生一致的增益。代码和数据集可在以下网址获取:this https URL

英文摘要

Thermal infrared (TIR) imaging has been a popular choice for field robotics due to its robust perception capability under low light visual degradation, but it suffers from severe stochastic and fixed-pattern noise that breaks downstream estimation. This noise is intensified indoors due to low thermal contrast and uniform temperature distributions, contributing to the relative lack of indoor TIR deployments. Existing TIR denoising methods exhibit a poor accuracy-efficiency tradeoff, either too slow for online deployment required in robotics or insufficiently robust to severe degradation, while typically being trained on synthetic noise. Addressing these problems, we propose TIDY, a lightweight wavelet-domain denoiser trained on real clean-noisy TIR data. By reformulating TIR denoising in the wavelet domain, TIDY explicitly disentangles noise from structural content, enabling targeted suppression with reduced spatial complexity, significantly improving inference speed over prior methods (~34Hz). TIDY introduces two new metrics, Wavelet Entropy and Wavelet Directional Stripe Index, as complementary loss terms to explicitly suppress stochastic noise and stripe artifacts. Across severe indoor corruption and zero-shot settings, TIDY improves robustness and yields consistent gains in downstream robotics tasks including thermal inertial odometry and monocular depth estimation. Code and dataset is available at: https://github.com/williamrheeth/TIDY

2606.20118 2026-06-19 cs.RO cs.LG 新提交

Pose6DAug: Physically Plausible Multi-view Object Swapping for Robot Data Augmentation

Pose6DAug: 用于机器人数据增强的物理合理多视图物体替换

Jonghoon Lee, Seong Hyeon Park, Byungwoo Jeon, Minha Lee, Jinwoo Shin

AI总结 提出Pose6DAug,一种基于失败驱动的数据增强框架,通过3D网格和6D姿态轨迹替换成功轨迹中的物体,生成多视图一致的物理合理演示,无需额外数据收集,在新型物体上提升VLA策略成功率16.5%。

详情
AI中文摘要

视觉-语言-动作(VLA)策略在通用操作中展现出强大潜力,但在外观或几何形状偏离训练分布的新型分布外物体上常常失败。标准的补救措施是为每个失败案例收集多视图遥操作数据,但这在成本和时间上扩展性差。我们提出Pose6DAug,一种失败驱动的数据增强框架,将策略自身的成功回合转化为针对其失败模式的目标演示,无需任何新数据收集。我们的关键洞察是,每个成功回合已经编码了一个物理有效的动作轨迹以及校准的多视图观测。通过仅替换被操作物体同时保留该轨迹,我们获得新的且物理基础的演示。然而,简单的2D视频编辑会破坏多视图一致性和物理合理性,特别是在严重遮挡和以自我为中心的视角下。我们的方法直接在3D中操作,通过时间一致的6D姿态轨迹驱动的显式网格锚定目标物体,确保所有相机视图的几何一致渲染。在我们方法增强的数据上微调VLA,相对于最先进的基线,在新型物体上的成功率提高了16.5%,同时保持了分布内性能。这些结果表明,多视图和物理一致的增强是实现可扩展VLA泛化的实用途径。

英文摘要

Vision-language-action (VLA) policies have shown strong potential for general-purpose manipulation, yet they often fail on novel, out-of-distribution objects whose appearance or geometry deviates from the training distribution. The standard remedy is to collect multi-view teleoperation data for every failure case, but this scales poorly in both cost and time. We introduce Pose6DAug, a failure-driven data augmentation framework that turns a policy's own successful episodes into targeted demonstrations for its failure modes, without any new data collection. Our key insight is that each successful episode already encodes a physically valid action trajectory together with calibrated multi-view observations. By swapping only the manipulated object while preserving this trajectory, we obtain new and physically grounded demonstrations. However, naive 2D video editing breaks multi-view consistency and physical plausibility, particularly under heavy occlusion and egocentric viewpoints. Our method instead operates directly in 3D, anchoring the target object with an explicit mesh driven by a temporally coherent 6D pose trajectory, ensuring geometrically consistent renderings across all camera views. Fine-tuning a VLA on data augmented by our method improves success rates by 16.5% relative to the state-of-the-art baseline on novel objects, while preserving in-distribution performance. These results show that multi-view and physically consistent augmentation is a practical path to scalable VLA generalization.

2606.20272 2026-06-19 cs.RO cs.CV 新提交

Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications

高效连接真实场景与合成数据生成以支持基于AI的认知机器人和计算机视觉应用

Paul Koch, Vivek Chavan, André Sers, Adem Karakurt, Paul Hofmann, Mohamad Zaher Ziadeh, Jörg Krüger

发表机构 * Fraunhofer IPK(弗劳恩霍夫生产设备和设计技术研究所) TU Berlin(柏林工业大学)

AI总结 本文讨论当前AI视觉模型在认知机器人应用中的局限,并提出通过连接仿真与真实世界训练数据生成来弥合领域差距的方法。

Comments Accepted and best paper award at MHI-Kolloquium 2024

详情
AI中文摘要

AI视觉模型是认知机器人在工业和家庭应用中潜在用例场景的驱动因素。基于最新的AI成就,已经提出了从语义环境分析到6D和抓取姿态估计的大量方法。然而,这些进展需要更强大和高效的方法,特别是在训练数据和AI架构方面,这些方法能够协同应对当前挑战、精度限制以及超越领域差距的可扩展性。在本文中,我们讨论了这些当前限制和相关最先进技术中的趋势,这些趋势正对这些挑战提出挑战。此外,我们讨论了当前在弥合仿真与真实世界应用之间的领域差距方面的工作进展,通过在训练数据生成中连接两者来实现。

英文摘要

AI vision models are a driving factor for the potential use case scenarios of cognitive robotics within in the industry and household applications. A large array of methods from semantic environment analysis towards 6D and grasping pose estimation have been proposed based on the latest AI achievements. However, such advancements require further strong and efficient methods w.r.t. training data and AI-architectures, which are capable in synergy to tackle current challenges, precision limits, and scalability beyond domain gaps. In this paper, we discuss these current limits and trends in the related state-of-the-art which are challenging those. Further we discuss our current work in progress on bridging the domain gap between simulations and real world applications by linking those in the training data generation.

2606.20426 2026-06-19 cs.RO 新提交

TaCauchy: An Extensible FEM Framework for Vision-Based Tactile Simulation

TaCauchy:面向视觉触觉仿真的可扩展有限元框架

Hengfei Zhao, Yifan Xie, Junhao Gong, Yue Sun, Kai Zhu, Weihua He, Shoujie Li, Haohuan Fu, Wenbo Ding

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Huawei Inc.(华为技术有限公司)

AI总结 提出TaCauchy框架,基于UIPC求解器在Isaac Sim中集成有限元法,直接计算柯西应力张量并投影为接触力,实现高保真触觉仿真,支持多种传感器,物理验证SSIM>0.93。

Comments Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2026

详情
AI中文摘要

基于视觉的触觉传感器需要高保真仿真以支持强化学习,然而现有方法难以在GPU加速的机器人平台中提供精确的机械应力场。我们提出TaCauchy,一个可扩展的有限元法(FEM)框架,将严格的基于物理的力计算集成到Isaac Sim中。TaCauchy基于统一增量势接触(UIPC)求解器,直接从超弹性本构定律计算柯西应力张量,并将其投影到接触表面以获得牵引力和压力分布,从而从第一性原理而非经验估计提供机械真实值。我们的框架具有几何感知自适应细化的自动网格生成和模块化传感器接口,能够以最小配置快速集成多种传感器(GelSight Mini、DIGIT、9DTact)。性能基准测试显示,单环境帧率为33.40 FPS,60个并行环境的总吞吐量为555 FPS,应力提取开销低于1 ms。物理验证实验表明,在1.2556 N至4.7332 N的力范围内,仿真与真实触觉响应高度一致,SSIM超过0.93,证实了该框架为下游机器人操作任务提供准确、基于物理的力监督的能力。

英文摘要

Vision-based tactile sensors require high-fidelity simulation for reinforcement learning, yet existing approaches struggle to provide accurate mechanical stress fields within GPU-accelerated robotics platforms. We present TaCauchy, an extensible Finite Element Method (FEM) framework that integrates rigorous physics-based force computation into Isaac Sim. Built on the Unified Incremental Potential Contact (UIPC) solver, TaCauchy directly computes Cauchy stress tensors from hyperelastic constitutive laws and projects them onto contact surfaces to obtain traction forces and pressure distributions, providing mechanical ground truth from first principles rather than empirical estimation. Our framework features automatic mesh generation with geometry-aware adaptive refinement and a modular sensor interface enabling rapid integration of diverse sensors (GelSight Mini, DIGIT, 9DTact) with minimal configuration. Performance benchmarks demonstrate 33.40 FPS for single environments and 555 FPS aggregate throughput across 60 parallel environments, with stress extraction overhead under 1 ms. Physical validation experiments show strong agreement between simulated and real tactile responses across force ranges from 1.2556 N to 4.7332 N, achieving SSIM above 0.93, confirming the framework's capability to provide accurate, physically-grounded force supervision for downstream robotic manipulation tasks.

2606.20189 2026-06-19 cs.CV cs.AI cs.RO 交叉投稿

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

HilDA:利用扩散的分层蒸馏推进自监督LiDAR预训练

Maciej Wozniak, Jesper Ericsson, Hariprasath Govindarajan, Truls Nyberg, Thomas Gustafsson, Patric Jensfelt, Olov Andersson

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院) Linköping University(林雪平大学) TRATON AB(TRATON公司) Qualcomm Auto Ltd Sweden Filial(高通汽车有限公司瑞典分公司)

AI总结 提出HilDA框架,通过分层蒸馏(多层蒸馏和全局上下文蒸馏)结合时间占用扩散目标,自监督预训练LiDAR骨干网络,在3D检测、场景流和语义占用预测任务上达到最先进水平。

Comments Accepted to ECCV 2026. Maciej and Jesper contributed equally

详情
AI中文摘要

利用视觉基础模型(VFM)进行相机到LiDAR的知识蒸馏为解决真实世界自动驾驶中巨大的几何和运动多样性所需的标注数据稀缺问题提供了一种有前景的方案。然而,当前方法通常将VFM视为黑盒教师,仅依赖逐帧特征相似性。因此,它们未能充分利用教师的逐层语义结构和全局上下文,以及LiDAR序列中固有的丰富时空信息。我们提出HilDA,一个用于LiDAR骨干网络的自监督预训练框架,能更好地捕捉驾驶任务所需的语义“是什么”和几何“在哪里”。HilDA结合了分层蒸馏(包括用于渐进语义对齐的多层蒸馏和用于场景级语义的全局上下文蒸馏)与一个促进时空一致性的时间占用扩散目标。使用HilDA预训练的模型在跨模态蒸馏基准上取得了最先进的结果,并在3D目标检测、场景流和语义占用预测任务上优于通过先前蒸馏方法训练的模型。代码见:此 https URL。

英文摘要

Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.

2510.08807 2026-06-19 cs.RO cs.LG 版本更新

Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation

Humanoid Everyday:面向开放世界人形机器人操作的综合机器人数据集

Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, Abha Jha, Hanwen Yang, Rong Xue, Sergey Zakharov, Vitor Guizilini, Yue Wang

发表机构 * University of Southern California(南加州大学) Toyota Research Institute(丰田研究院)

AI总结 提出Humanoid Everyday数据集,包含10.3k轨迹、260个任务的多模态数据,用于人形机器人灵巧操作、人机交互和移动操作研究,并配套云评估平台。

详情
AI中文摘要

从运动到灵巧操作,人形机器人在展示复杂的全身能力方面取得了显著进展。然而,当前大多数机器人学习数据集和基准主要关注固定机器人臂,少数现有人形数据集要么局限于固定环境,要么任务多样性有限,通常缺乏人机交互和下肢运动。此外,缺乏用于在人形数据上对基于学习的策略进行基准测试的标准化评估平台。在这项工作中,我们提出了Humanoid Everyday,一个大规模且多样化的人形操作数据集,其特点是涉及灵巧物体操作、人机交互、运动集成动作等广泛的任务多样性。利用高效的人工监督遥操作流水线,Humanoid Everyday聚合了高质量的多模态感官数据,包括RGB、深度、LiDAR和触觉输入,以及自然语言注释,包含10.3k条轨迹和超过300万帧数据,涵盖7个大类共260个任务。此外,我们对数据集上的代表性策略学习方法进行了分析,提供了它们在不同任务类别中的优势和局限性的见解。为了标准化评估,我们引入了一个基于云的评估平台,允许研究人员在我们的受控环境中无缝部署他们的策略并接收性能反馈。通过发布Humanoid Everyday以及我们的策略学习分析和标准化的基于云的评估平台,我们旨在推进通用人形操作的研究,并为现实世界中更有能力和具身化的机器人代理奠定基础。我们的数据集、数据收集代码和云评估网站在我们的项目网站上公开发布。

英文摘要

From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Our dataset, data collection code, and cloud evaluation website are made publicly available on our project website.

2606.19186 2026-06-19 cs.RO cs.LG 版本更新

Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise

学习标注延迟和误报AEB事件:针对极端类别不平衡和非对称标签噪声的实用系统

Mengxiang Hao, Xin Jiang, Xinghao Huang, Wenliang Su, Zhiteng Wang, Junjie Rao, Xiaotian Yang, Wei Liao, Chengyu Han, Gen Liang, Yulun Song, Zhitao Xu, Xianpeng Lang

发表机构 * Li Auto(理想汽车)

AI总结 提出首个自动化AEB标注框架,通过特定数据增强和噪声抑制技术,解决极端类别不平衡和非对称标签噪声问题,将延迟/误报触发召回率提升80%,人工工作量减少50%。

Comments 8 pages, 5 figures, accepted by IEEE International Conference on Robotics and Automation (ICRA)

Journal ref 2026 IEEE International Conference on Robotics and Automation (ICRA)

详情
AI中文摘要

自主紧急制动(AEB)优化依赖于准确标注的真实世界触发事件,特别是揭示系统缺陷的罕见但关键的延迟和误报AEB触发事件。然而,这些少数样本在每天数千次触发事件中占比不到5%,使得大规模人工标注成本过高。我们提出了首个自动化AEB标注框架来解决这一问题。在开发过程中,我们识别出两个严重损害延迟/误报触发标注准确性的基本挑战:(1)极端类别不平衡,其中延迟/误报触发被真实触发淹没;(2)非对称标签噪声,其中误标注的多数样本(真实触发)抑制了少数样本(延迟/误报触发)的学习。为克服这些挑战,我们提出两项关键创新:(1)特定数据增强,通过操纵焦点目标属性、移植自车动态和掩蔽非焦点代理来合成逼真样本;(2)噪声抑制,使用稳定硬度估计和探针引导的自适应阈值来清理误标注的真实触发样本。关键的是,我们将模型部署为具有全栈架构的实用标注系统,从每天数千个AEB事件中高效识别关键的延迟/误报触发。生产结果表明,延迟/误报触发的召回率提高了80%,人工工作量减少了50%。除了直接收益,该系统通过积累高质量标注实现持续自我改进,为车载AEB系统优化奠定了必要的数据基础。

英文摘要

Autonomous Emergency Braking (AEB) optimization relies on accurately annotated real-world trigger events, particularly rare but critical delayed and false AEB triggers that expose system deficiencies. However, these minority samples comprise less than 5% of thousands of daily triggers, making manual annotation prohibitively expensive at scale. We present the first automated AEB annotation framework to address this problem. During development, we identified two fundamental challenges that severely impair delayed/false trigger annotation accuracy: (1) Extreme class imbalance where delayed/false triggers are overwhelmed by true triggers; (2) Asymmetric label noise where mislabeled majority samples (true triggers) suppress minority samples (delayed/false triggers) learning. To overcome these challenges, we propose two key innovations: (1) Specific data augmentation that synthesizes realistic samples by manipulating focal target attributes, transplanting ego-vehicle dynamics, and masking non-focal agents; (2) noise suppression using stable hardness estimation and probe-guided adaptive threshold to clean mislabeled true trigger samples. Crucially, we deploy our model as a practical annotation system with full-stack architecture, efficiently identifying critical delayed/false triggers from thousands of daily AEB events. Production results demonstrate 80% improvement in recall of delayed/false triggers and 50% reduction in manual workload. Beyond immediate gains, the system enables continuous self-improvement through accumulated high-quality annotations, establishing a necessary data foundation for on-vehicle AEB system optimization

2501.17015 2026-06-19 cs.AI cs.MA cs.RO 版本更新

UniMM: A Unified Mixture Model Framework for Multi-Agent Simulation

UniMM:一种用于多智能体仿真的统一混合模型框架

Longzhong Lin, Xuewu Lin, Kechun Xu, Haojian Lu, Lichao Huang, Rong Xiong, Yue Wang

发表机构 * Zhejiang University(浙江大学) Horizon Robotics

AI总结 提出UniMM框架统一回归混合模型与离散NTP模型,通过闭环样本生成缓解分布偏移,并在WOSAC基准上取得最优性能。

Comments Accepted author manuscript. The version of record has been published in IEEE Transactions on Pattern Analysis and Machine Intelligence

Journal ref IEEE Transactions on Pattern Analysis and Machine Intelligence, Early Access, 2026

详情
AI中文摘要

仿真在评估自动驾驶系统中起着关键作用,其中生成逼真的多智能体行为是一个关键方面。在多智能体仿真中,主要挑战包括行为多模态性和闭环分布偏移。在本研究中,我们提出了一个统一的混合模型(UniMM)框架,用于生成多模态智能体行为,该框架涵盖了主流方法,包括基于回归的混合模型和离散NTP模型。此外,我们引入了一种针对混合模型的闭环样本生成方法,以缓解分布偏移。在UniMM框架内,我们从模型和数据角度识别了关键配置。我们对各种模型配置进行了系统检查,并全面描述了它们的效果。此外,我们对数据配置的研究强调了闭环样本在实现逼真仿真中的关键作用。为了将闭环样本的优势扩展到更广泛的混合模型中,我们进一步引入了一种时间解缠和对齐机制,以解决捷径学习和离策略学习问题。利用我们探索的见解,UniMM框架内提出的不同变体,包括离散模型、无锚模型和基于锚点的模型,均在WOSAC基准上取得了最先进的性能。

英文摘要

Simulation plays a crucial role in assessing autonomous driving systems, where the generation of realistic multi-agent behaviors is a key aspect. In multi-agent simulation, the primary challenges include behavioral multimodality and closed-loop distributional shifts. In this study, we formulate a unified mixture model (UniMM) framework for generating multimodal agent behaviors, which can cover the mainstream methods including regression-based mixture models and discrete NTP models. Furthermore, we introduce a closed-loop sample generation approach tailored for mixture models to mitigate distributional shifts. Within the UniMM framework, we recognize critical configurations from both the model and data perspectives. We conduct a systematic examination of various model configurations, and comprehensively characterize their effects. Moreover, our investigation into the data configuration highlights the pivotal role of closed-loop samples in achieving realistic simulations. To extend the benefits of closed-loop samples across a broader range of mixture models, we further introduce a temporal disentanglement-and-alignment mechanism to address the shortcut learning and off-policy learning issues. Leveraging insights from our exploration, the distinct variants proposed within the UniMM framework, including discrete, anchor-free, and anchor-based models, all achieve state-of-the-art performance on the WOSAC benchmark.

2507.05169 2026-06-19 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model

世界模型批判:一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结 本文从心理学“假设性思维”出发,提出世界模型的核心目标是模拟真实世界的所有可行动可能性,并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测(GLP)架构。

详情
AI中文摘要

世界模型,即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器,近年来因开发具有人工(通用)智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估,已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发,并借鉴心理学文献中“假设性思维”的概念,论证世界模型的主要目标是模拟真实世界中所有可行动的可能性,以进行有目的的推理和行动。我们审视了世界建模的关键设计维度:数据、表示、架构、学习目标和使用,调查了现有方法并分析了它们的权衡。在此基础上,我们提出了一种新的通用世界模型生成式潜在预测(GLP)架构,基于有状态的、分层的、多层次的、混合连续/离散表示,以及生成式和自监督学习框架,并展望了由这种模型支持的物理、智能体和嵌套(PAN)AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

11. 安全、鲁棒性与可信机器人 6 篇

2606.19561 2026-06-19 cs.RO cs.SY eess.SY 新提交

pdSTL: Probabilistic Differentiable Signal Temporal Logic for Stochastic Systems

pdSTL: 面向随机系统的概率可微信号时序逻辑

Bennett Dogbey, Hemanth Manjunatha

发表机构 * Oklahoma State University(俄克拉荷马州立大学)

AI总结 提出pdSTL框架,将概率语义与可微鲁棒性结合,通过区间值概率语义和LSTM式展开实现线性时间可微监控,在障碍物规避、换道和真实四旋翼飞行实验中优于确定性可微STL。

详情
AI中文摘要

在不确定环境中运行的自主机器人必须满足复杂的时序和安全规范,尽管存在随机动力学和感知噪声。虽然信号时序逻辑(STL)为基于梯度的优化提供了鲁棒性度量,但现有的扩展要么缺乏可微性,要么忽略了信念空间的不确定性。我们引入了pdSTL(概率可微信号时序逻辑),这是一个将概率语义与信念轨迹上的可微鲁棒性统一起来的框架。pdSTL采用区间值概率语义来计算保守的满足界限,并通过STL语法树组合传播。我们将时序鲁棒性评估制定为STL算子的循环、LSTM式展开,从而实现适用于端到端轨迹优化的线性时间、可微监控。我们在模拟障碍物规避、换道操作以及真实世界的Crazyflie四旋翼飞行实验中验证了pdSTL,这些实验在气动干扰下进行。结果表明,pdSTL在保持形式化概率保证的同时实现了高效优化,在现实世界的不确定性下,在维持安全裕度方面显著优于确定性可微STL。

英文摘要

Autonomous robots operating in uncertain environments must satisfy complex temporal and safety specifications despite stochastic dynamics and sensing noise. While Signal Temporal Logic (STL) offers robustness measures for gradient-based optimization, existing extensions either lack differentiability or ignore belief-space uncertainty. We introduce pdSTL (probabilistic differentiable Signal Temporal Logic), a framework that unifies probabilistic semantics with differentiable robustness over belief trajectories. pdSTL employs interval-valued probabilistic semantics to compute conservative satisfaction bounds, propagated compositionally through the STL syntax tree. We formulate the temporal robustness evaluation as a recurrent, LSTM-style unfolding of STL operators, enabling linear-time, differentiable monitoring suitable for end-to-end trajectory optimization. We validate pdSTL on simulated obstacle avoidance, lane-change maneuvers, and real-world Crazyflie quadcopter flight experiments under aerodynamic disturbances. Results demonstrate that pdSTL achieves efficient optimization with formal probabilistic guarantees, significantly outperforming deterministic differentiable STL in maintaining safety margins under real-world uncertainty.

2606.19590 2026-06-19 cs.RO cs.SY eess.SY 新提交

Safe, Real-Time Active Model Discrimination and Fault Diagnosis for Nonlinear Systems via Differentiable Reachability

通过可微可达性实现非线性系统的安全、实时主动模型辨识与故障诊断

Xinpei Ni, Melkior Ornik, Glen Chou, Samuel Coogan

发表机构 * Institute of Robotics and Intelligent Machines (IRIM), Georgia Institute of Technology(佐治亚理工学院机器人与智能机器研究所) Department of Aerospace Engineering, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校航空航天工程系)

AI总结 针对不确定非线性系统,提出一种基于可微可达性近似的实时主动故障诊断算法,通过优化控制输入使输出集分离,在保证安全的同时实现快速模型辨识。

详情
AI中文摘要

我们提出了一种安全、实时的算法,用于对具有过程和测量扰动的连续时间不确定非线性系统进行主动故障诊断和模型辨识。给定一组表示正常和故障模式(包括执行器和传感器故障)的候选模型,我们制定了一个输出反馈、时变策略优化问题,该问题(i)在有限时域内鲁棒地强制执行状态输入安全约束,并且(ii)驱动系统产生与至多一个模型一致的采样测量,从而实现确定性诊断。为了实时解决这个问题,我们使用可达状态和输出集的区间过近似开发了一个可处理的近似,并通过一个可微目标函数对诊断能力进行编码,该函数惩罚可能模型的可达输出集之间的重叠。由此产生的优化使用基于梯度的JAX和可微可达性原语在线高效求解。我们在几个高维非线性机器人系统(包括模拟四旋翼和战斗机模型、硬件差速驱动机器人和四足导航)上评估了我们的方法,用于传感器和执行器故障诊断(最多11种故障模式)。在这些案例研究中,我们的方法在50毫秒内实现了可靠的模型辨识,在辨识成功率和速度上优于基线方法,同时提供了形式化的安全保证。

英文摘要

We present a safe, real-time algorithm for active fault diagnosis and model discrimination for uncertain continuous-time nonlinear systems with process and measurement disturbances. Given a finite set of candidate models representing nominal and faulty modes, including actuator and sensor faults, we formulate an output-feedback, time-varying policy optimization problem that (i) robustly enforces state-input safety constraints over a finite horizon and (ii) drives the system to produce sampled measurements consistent with at most one model, enabling deterministic diagnosis. To solve this problem in real time, we develop a tractable approximation using interval over-approximations of reachable state and output sets, and encode diagnosability via a differentiable objective that penalizes overlap between the reachable output sets of possible models. The resulting optimization is solved efficiently online with gradient-based methods using JAX and differentiable reachability primitives. We evaluate our method on sensor and actuator fault diagnosis (up to 11 fault modes) in several high-dimensional nonlinear robotic systems, including a simulated quadrotor and fighter-jet model, a hardware differential-drive robot, and quadrupedal navigation. Across these case studies, our approach achieves reliable model discrimination in under 50 ms, outperforming baselines in discrimination success rate and speed while providing formal safety guarantees.

2606.19598 2026-06-19 cs.RO 新提交

Fail-RAG : A Retrieval Augmented Generation Informed Framework for Robot Failure Identification

Fail-RAG:一种基于检索增强生成的机器人故障识别框架

Ameya Salvi, Jie Hu

发表机构 * Hitachi America, Ltd.(日立美国有限公司)

AI总结 提出Fail-RAG框架,利用检索增强生成和视觉语言模型,通过嵌入故障图像和上下文信息并查询数据库,实现机器人操作故障的高效检测,在仓库自动化任务中平均检测准确率提升25个百分点。

详情
AI中文摘要

工业自动化正经历由技术突破和社会变革驱动的机器人演进:向通用机器人、具身和物理人工智能发展,以及劳动力短缺的加剧。智能自主机器人不仅需要按计划运动,还需对意外事件做出反应。本研究聚焦于仓库中物料搬运机器人的意外事件,将其定义为故障,并开发检测机器人操作故障的方法。由于环境和任务的动态性,故障形式可能变化,基于规则的检测方法可能失效。我们提出'Fail-RAG',一种基于检索增强生成(RAG)的故障检测框架,其中故障图像和上下文信息被嵌入,并通过计算相似度查询故障数据库。进一步使用视觉语言模型(VLM)按照指令模板分析故障并提供细节。通过使用固定机械臂和移动操作器在仓库自动化常见任务中进行仿真和物理实验,评估了Fail-RAG的性能。与使用现成VLM相比,Fail-RAG在五种机器人操作类型上的平均故障检测准确率提高了25个百分点,表明其在真实世界故障检测中的有效性。

英文摘要

Industry automation is witnessing an evolution in robotics driven by both technological breakthroughs and societal changes: progress towards generalist robots, embodied and physical artificial intelligence (AI), and increasing labor shortage in manufacturing.An intelligent autonomous robot needs to not only act according to planned motions but also react to any unexpected events. In this study, we focus on such unexpected events in warehouses where robots are used for material handling. Specifically, we refer to any unexpected events as failures and develop methods to detect robot operations related failures. Rule-based detection methods may break since the form of failures could change due to the dynamic nature of both environments and tasks. We propose 'Fail-RAG', a Retrieval Augmented Generation (RAG)-based failure detection framework where failure images and context information are embedded and queried against a failure database by calculating their similarities. Vision-Language Models (VLMs) are further used to analyze failures and provide details by following our instruction template. We evaluated the performance of Fail-RAG by conducting both simulation and physical experiments using fixed robot arms and a mobile manipulator for multiple tasks that are common in warehouse automation. Fail-RAG achieved 25 percentage point higher failure detection accuracy on average across five types of robot operations compared to using off-the-shelf VLMs, indicating its effectiveness for real-world failure detection.

2606.19998 2026-06-19 cs.RO cs.AI cs.CV cs.LG 新提交

Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory

Tri-Info: 基于信息论的VLA模型可泛化、可解释的故障预测

Jinghan Yang, Yunchao Zhang, Wang Yuan, Haolun Wan, Jiaming Zhang, Zhengyang Hu, Yanchao Yang

发表机构 * InfoBodied AI Lab, The University of Hong Kong(香港大学信息具身人工智能实验室) HKU Musketeers Foundation Institute of Data Science(香港大学赛马会数据科学研究院)

AI总结 提出Tri-Info方法,通过信息论信号捕捉动作多样性、时间一致性和状态耦合,实现跨架构、环境及仿真到现实的零样本故障检测,准确率达83%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越多地部署在各种任务中,但它们仍然是黑箱,其物理交互可能导致不可逆的伤害,因此需要可泛化和可解释的故障检测。我们观察到成功和失败的轨迹具有系统不同的信息论特征。基于此,我们将VLA控制形式化为闭环信息管道,并推导出三重信息论(Tri-Info)信号,这些信号捕捉动作是否保持多样性、时间一致性以及与状态转换的耦合。在六个VLA模型和三个基准环境中,Tri-Info在域内匹配最强的基线。此外,Tri-Info无需重新训练即可跨架构、环境和仿真到现实差距迁移,在现实世界任务中达到83%的准确率,而先前的检测器则降至随机水平。这确立了Tri-Info作为一种简单而强大的方法,不仅能够检测故障并具有强大的跨域泛化能力,还能提供底层故障模式的可解释诊断。

英文摘要

Vision-Language-Action (VLA) models are increasingly deployed across diverse tasks, yet they remain black boxes whose physical interactions can cause irreversible harm, making generalizable and interpretable failure detection essential. We observe that successful and failed rollouts carry systematically different information-theoretic signatures. Building on this, we formalize VLA control as a closed-loop information pipeline and derive the Triple Information-theoretic (Tri-Info) signals that capture whether actions remain diverse, temporally consistent, and coupled to state transitions. Across six VLA models and three benchmark environments, Tri-Info matches the strongest baselines in-domain. Moreover, Tri-Info transfers across architectures, environments, and the sim-to-real gap without retraining, reaching 83\% accuracy on real-world tasks where prior detectors collapse to chance. This establishes Tri-Info as a simple yet powerful method that not only detects failures with strong cross-domain generalization, but also delivers interpretable diagnostics of the underlying failure modes.

2606.20428 2026-06-19 cs.RO 新提交

ARC: Adaptive Robust Joint State and Covariance Estimation

ARC:自适应鲁棒联合状态与协方差估计

Alexandre Hadji-Thomas, Andrew Stirling, James R. Forbes

AI总结 提出统一块坐标下降框架,结合自适应鲁棒损失、迭代重加权最小二乘状态更新和最小加权协方差行列式估计器,实现离群值下状态与协方差的自适应联合估计。

Comments Submitted to information IEEE Robotics and Automation Letters (RA-L), June 2026. 8 pages, 7 figures, 1 table

详情
AI中文摘要

传感器测量经常受到离群值和非高斯噪声的污染。这些传感器数据中的缺陷会导致经典状态估计器产生有偏且不可靠的状态和不确定性估计。鲁棒估计器拒绝或降低离群值的权重,但不进行测量协方差估计,而联合状态和协方差估计器假设高斯残差和固定的损失形状参数。将这两种能力整合到一个框架中,可以在存在离群值的情况下同时估计状态和协方差。本文提出了一种统一的块坐标下降框架,该框架结合了范数感知自适应鲁棒损失、迭代重加权最小二乘状态更新和最小加权协方差行列式协方差估计器,产生了一个自调谐的联合状态和协方差估计器。该框架在蒙特卡洛模拟和真实世界超宽带定位实验(在杂乱的视距外环境中)中进行了评估。结果表明,所提出的估计器能够一致地恢复真实的内点测量协方差,并在状态估计精度上达到或超过所有基线方法,且无需任何手动参数调整。

英文摘要

Sensor measurements are frequently corrupted by outliers and non-Gaussian noise. These imperfections in the sensor data can cause classical state estimators to generate biased and unreliable state and uncertainty estimates. Robust estimators reject or downweight outliers but do not perform measurement covariance estimation, whereas joint state and covariance estimators assume Gaussian residuals and fixed loss shape parameters. Integrating these two capabilities into a single framework is an opportunity to simultaneously estimate both state and covariance in the presence of outliers. This paper proposes a unified Block-Coordinate Descent framework that combines a norm-aware adaptive robust loss, an Iteratively Reweighted Least-Squares state update, and a Minimum Weighted Covariance Determinant covariance estimator, yielding a self-tuning joint state and covariance estimator. The framework is evaluated in a Monte-Carlo simulation and on real-world ultra-wideband localization experiments in cluttered non-line-of-sight environments. Results show that the proposed estimator consistently recovers the true inlier measurement covariance and matches or exceeds the state estimation accuracy of all baselines, without requiring any manual parameter tuning.

2601.15459 2026-06-19 cs.RO 版本更新

Neural Minimum-Distance Estimation for Collision-Aware Operation of Multi-Arm Laparoscopy Surgical Robots Through Learning-from-Simulation

基于仿真学习的多臂腹腔镜手术机器人碰撞感知操作的神经最小距离估计

Sarvin Ghiasi, Majid Roshanfar, Jake Barralet, Liane S. Feldman, Amir Hooshiar

发表机构 * Surgical Performance Enhancement and Robotics (SuPER) Centre, Department of Surgery(外科性能增强与机器人中心(SuPER)中心,外科部) The Wilfred and Joyce Posluns Centre for Image Guided Innovation & Therapeutic Intervention (PCIGITI)(威廉与乔伊斯·波斯伦中心(PCIGITI)影像引导创新与治疗干预中心) The Hospital for Sick Children (SickKids)(儿童医院(SickKids))

AI总结 提出结合分析建模、实时仿真与深度残差神经网络的框架,用于多臂手术机器人最小距离估计与碰撞预警,模型在验证集上R²=0.940,RMSE=42.0 mm。

Journal ref Sensors 2026, 26(12), 3744

详情
AI中文摘要

本研究提出了一个集成框架,通过解决多臂操纵器之间的最小距离估计和相关的碰撞感知警告,提高腹腔镜手术中机械臂的安全性和操作效率。通过结合分析建模、实时仿真和机器学习,该框架为确保机器人安全操作提供了稳健的解决方案。开发了一个分析模型,基于关节配置估计机械臂之间的最小距离,提供理论计算作为验证工具和基准。为补充这一点,创建了一个3D仿真环境,模拟两个7自由度Kinova机械臂(Kinova inc., Boisbriand, QC, Canada),生成了用于距离估计和碰撞警告的多样化配置数据集。利用这些见解,训练了一个以关节配置为输入的深度残差神经网络模型。在保留的验证集上,模型达到了R²=0.940,RMSE=42.0 mm,MAE=28.7 mm,且平均偏差接近零,展示了强大的预测准确性和在整个工作空间中的一致泛化能力。该框架旨在作为早期碰撞警告层,当预测的臂间距离低于0.2 m阈值时触发警告,考虑到Kinova Gen3(Kinova inc., Boisbriand, QC, Canada)的横截面半径,这对应于大约50 mm的表面到表面间隙。这项工作展示了将分析建模与机器学习相结合以提高多臂机器人系统精度和可靠性的有效性。

英文摘要

This study presents an integrated framework for enhancing the safety and operational efficiency of robotic arms in laparoscopic surgery by addressing minimum distance estimation between multi-arm manipulators and the associated collision-aware warning. By combining analytical modeling, real time simulation, and machine learning, the framework offers a robust solution for ensuring safe robotic operations. An analytical model was developed to estimate the minimum distances between robotic arms based on their joint configurations, offering theoretical calculations that serve as both a validation tool and a benchmark. To complement this, a 3D simulation environment was created to model two 7 DOF Kinova robotic arms (Kinova inc., Boisbriand, QC, Canada), generating a diverse dataset of configurations for distance estimation and collision warning. Using these insights, a deep residual neural network model was trained with joint configurations as inputs. On the held out validation set, the model achieves R2 = 0.940, RMSE = 42.0 mm, MAE = 28.7 mm, and a near zero mean bias, demonstrating strong predictive accuracy and consistent generalization across the workspace. The framework is intended as an early collision warning layer, where a warning is triggered when the predicted inter-arm distance falls below a 0.2 m threshold, which corresponds to a surface to surface clearance of approximately 50 mm given the Kinova Gen3 (Kinova inc., Boisbriand, QC, Canada) cross sectional radius. This work demonstrates the effectiveness of combining analytical modeling with machine learning to enhance the precision and reliability of multi-arm robotic systems.

12. 其他/综合机器人 6 篇

2606.19525 2026-06-19 cs.RO 新提交

A Categorial and Sheaf-Theoretic Semantics for Autonomic Component Ensembles

自主组件集合的范畴与层论语义

Manuel Hernández, Eduardo Sánchez-Soto

AI总结 针对自主组件集合语言SCEL,提出基于范畴论和层论的多层数学模型,将机器人社会建模为拓扑空间上的层,通过层上同调量化系统故障,将分布式系统验证转化为几何分析。

详情
AI中文摘要

大规模、去中心化的自主代理系统(如机器人集群和网络化信息物理系统)的激增对传统形式化方法提出了严峻挑战。软件组件集合语言(SCEL)为这类系统提供了形式化模型,但其操作语义不适合推理全局、结构和涌现属性。本报告利用范畴论和层论为SCEL提出了一种新的多层数学模型。我们认为,用SCEL描述的机器人社会可以形式化地建模为拓扑空间上的层,其中组件是点,集合是开集,分布式知识构成层的数据。在此框架下,信息共享等计算过程等价于“粘合”局部数据的层论操作。系统故障可以被理解并量化为拓扑障碍,通过层上同调可测量。该方法将复杂分布式系统的验证转化为数学对象的几何分析,为设计鲁棒的自主系统提供了深刻的结构性见解。

英文摘要

The proliferation of large-scale, decentralized systems of autonomous agents, such as swarms of robots and networked cyber-physical systems, presents a formidable challenge to traditional formal methods. The Software Component Ensemble Language (SCEL) offers a formal model for such systems, but its operational semantics is not ideal for reasoning about global, structural, and emergent properties. This report proposes a new, multi-layered mathematical model for SCEL using category theory and sheaf theory. We argue that a society of robots described in SCEL can be formally modeled as a sheaf on a topological space, where components are points, ensembles are open sets, and distributed knowledge forms the sheaf's data. In this framework, computational processes like information sharing become equivalent to the sheaf-theoretic operation of "gluing" local data. System failures can then be understood and quantified as topological obstructions, measurable by sheaf cohomology. This approach transforms the verification of a complex distributed system into the analysis of the geometry of a mathematical object, providing deep, structural insights for the design of robust autonomic systems.

2606.20394 2026-06-19 cs.RO math.OC 新提交

Agentic AutoResearch forSpace Autonomy: An Auditable, LLM-Driven Research Agent for Aerospace Control Problems

面向空间自主性的智能体自动研究:用于航空航天控制问题的可审计、LLM驱动的研究代理

Amit Jain, Richard Linares

发表机构 * Department of Aeronautics and Astronautics(航空航天学系)

AI总结 提出AutoResearch框架,利用大语言模型作为离线研究代理,自动迭代开发航天控制策略,并通过内置可信层审计结果,消除种子噪声影响,在交会和对接问题上验证了有效性。

详情
AI中文摘要

航天器的制导、导航与控制功能日益通过从专家求解器中提炼的学习策略来实现。开发这样的策略本身就是一个研究过程:研究者选择架构和超参数,运行实验,并必须判断一个明显的改进是真实的还是仅仅是种子噪声。本文提出了AutoResearch框架,其中大语言模型自主驱动这一循环,用于航空航天控制问题,并结合了一个内置在循环中的可信层,该层根据问题自身测量的种子噪声对每个报告的结果进行认证。语言模型仅作为离线研究代理,负责开发控制策略;它产生的训练策略随后部署在航天器上,而模型本身从不操作飞行器。在每次迭代中,代理读取自然语言描述的问题描述和运行历史,对训练脚本提出一次编辑,执行它,并记录结果。任何报告的结果在通过相同的三项检查之前不会被认可:测量的每个问题的种子噪声、最佳配置的重新播种验证,以及代理编辑的留一法剪枝。相同的循环被原样应用于两个航空航天控制问题:Clohessy-Wiltshire相对交会问题和带有安全约束的避碰对接问题(经过禁飞区),每个问题都针对已知的最优控制基准进行了校准。在这两个问题中,经过审计的策略以多个标准差超过了测量的种子噪声;对相同参数的未定向搜索则没有。在对接问题上,差距变得明显:未定向搜索没有产生可行的策略,而学习到的策略在每个种子上都保持在禁飞区之外。

英文摘要

Spacecraft guidance, navigation, and control functions are increasingly realized as learned policies distilled from expert solvers. Developing such a policy is itself a research process: an investigator selects an architecture and hyperparameters, runs experiments, and must determine whether an apparent improvement is genuine or merely seed noise. This paper presents AutoResearch, a framework in which a large language model autonomously drives that loop for aerospace control problems, coupled with a credibility layer, built into the loop, that certifies each reported result against the problem's own measured seed noise. The language model serves only as the offline research agent that develops the control policy; the trained policy it produces is then deployed onboard the spacecraft, while the model itself never operates the vehicle. At each iteration the agent reads a plain-language problem description and the run history, proposes a single edit to the training script, executes it, and logs the outcome. No reported result is credited until it passes the same three checks: measured per-problem seed noise, reseeded verification of the best configuration, and leave-one-out pruning of the agent's edits. The same loop is applied, unchanged, to two aerospace control problems: a Clohessy-Wiltshire relative rendezvous and a safety-constrained collision-avoidance docking past a keep-out zone, each calibrated against a known optimal control benchmark. In both, the audited policy clears the measured seed noise by many standard deviations; an undirected search over the same parameters does not. On the docking problem the gap becomes categorical: undirected search yields no feasible policy, while the learned policy stays outside the keep-out zone on every seed.

2606.20547 2026-06-19 cs.LG cs.CV cs.GR cs.RO math.DG 交叉投稿

The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups

Token 是群元素:关于矩阵李群上的李代数注意力

Przemyslaw Musialski

发表机构 * New Jersey Institute of Technology(新泽西理工学院)

AI总结 提出李代数注意力机制,将token定义为矩阵李群元素,利用相对位姿的李代数范数作为注意力分数,无需学习核函数或表示论工具,适用于仿射全帧群等非紧致非阿贝尔群。

Comments preprint, 19 pages, 3 figures

详情
AI中文摘要

我们将注意力token置于群上:一个token是矩阵李群$G$的一个元素$g_i$——一个纯粹的变换,没有特征负载,也没有外部作用$\rho(g)$承载它。据我们所知,这是第一个token为裸矩阵李群元素的注意力构造:它们的分数是相对位姿的闭式代数范数,而非学习核,并且它达到了每个基于不可约表示或满射指数的方法必须排除的仿射全帧群。我们称之为李代数注意力。一旦token是群元素,其余部分无需通常的表示论机制。一对的相对几何是规范的,即$g_i^{-1} g_j$,因此成对不变量$w_{ij} = \log(g_i^{-1} g_j)$是内在的而非设计的;在$G$对角作用下的等变性是重言式的,且余循环条件自动成立。注意力分数是负平方代数范数$s_{ij} = -\|\log(g_i^{-1} g_j)\|_\lambda^2/\tau$:在块加权Frobenius内积下的规范邻近核,无需不可约表示、球谐函数、Clebsch-Gordan积或学习核。该构造适用于任何矩阵李群,在包含相对位姿的选定对数图上,包括具有尺度和剪切的非紧致非阿贝尔仿射群,这些是向量token注意力方法无法达到的:既不是不可约表示传统,也不是满射指数方法。在SE(2)、SO(3)和Aff(2)上的三个序列补全实验证实了这一点:闭式分数匹配了相同不变量上的学习MLP核,并在SE(2)上优于它,使用的分数参数少50到80倍,而向量token基线破坏了不变量,误差达五到十二个数量级。

英文摘要

We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ -- a bare transformation, with no feature payload and no external action $ρ(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_λ^2/τ$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.

2601.02379 2026-06-19 cs.RO cs.AI 版本更新

Movement Primitives in Robotics: A Comprehensive Survey

机器人运动基元:综合综述

Nolan B. Gutierrez, Joseph M. Cloud, William J. Beksi

发表机构 * Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington, USA(计算机科学与工程系,德克萨斯理工大学阿灵顿分校,阿灵顿,美国)

AI总结 综述机器人运动基元框架,涵盖从人类示教中编码轨迹的方法,分析弹簧-阻尼系统、概率耦合、神经网络等特性,并讨论应用与挑战。

Comments 105 pages, 3 figures, and 6 tables

详情
AI中文摘要

生物系统表现出连续的运动流,由顺序片段组成,使它们能够以创造性和多功能的方式执行复杂任务。这一观察促使研究人员识别出被称为运动基元的运动基本构建块,这些基元非常适合在自主系统(如机器人)中生成运动指令。在本综述中,我们按时间顺序提供了运动基元方法和应用的百科全书式概述。具体来说,我们将运动基元框架呈现为一种表示通过人类示教获得的机器人控制轨迹的方式。在机器人领域,运动基元可以在轨迹级别编码基本运动,例如机器人如何抓取杯子或抛球所需的运动序列。此外,运动基元已开发出具有弹簧-阻尼系统的理想分析特性、多个示教的概率耦合、在高维系统中使用神经网络等特性,以应对机器人领域的困难挑战。尽管运动基元广泛应用于各个领域,本综述的目标是告知从业者如何在机器人背景下使用这些框架。具体而言,我们旨在(i)系统回顾主要运动基元框架并检查其优缺点;(ii)突出已成功使用运动基元的应用;(iii)检查开放问题并讨论在机器人中应用运动基元时的实际挑战。

英文摘要

Biological systems exhibit a continuous stream of movements, consisting of sequential segments, that allow them to perform complex tasks in a creative and versatile fashion. This observation has led researchers towards identifying elementary building blocks of motion known as movement primitives, which are well-suited for generating motor commands in autonomous systems, such as robots. In this survey, we provide an encyclopedic overview of movement primitive approaches and applications in chronological order. Concretely, we present movement primitive frameworks as a way of representing robotic control trajectories acquired through human demonstrations. Within the area of robotics, movement primitives can encode basic motions at the trajectory level, such as how a robot would grasp a cup or the sequence of motions necessary to toss a ball. Furthermore, movement primitives have been developed with the desirable analytical properties of a spring-damper system, probabilistic coupling of multiple demonstrations, using neural networks in high-dimensional systems, and more, to address difficult challenges in robotics. Although movement primitives have widespread application to a variety of fields, the goal of this survey is to inform practitioners on the use of these frameworks in the context of robotics. Specifically, we aim to (i) present a systematic review of major movement primitive frameworks and examine their strengths and weaknesses; (ii) highlight applications that have successfully made use of movement primitives; and (iii) examine open questions and discuss practical challenges when applying movement primitives in robotics.

2603.27361 2026-06-19 cs.RO 版本更新

Online Inertia Tensor Identification for Non-Cooperative Spacecraft via Augmented UKF

Batu Candan, Simone Servadio

发表机构 * Department of Aerospace Engineering, Iowa State University(航空航天工程系,爱荷华州立大学)

Journal ref AIAA 2026 Region V Student Conference, AIAA 2026-108993

详情
英文摘要

Autonomous proximity operations, such as active debris removal and on-orbit servicing, require high-fidelity relative navigation solutions that remain robust in the presence of parametric uncertainty. Standard estimation frameworks typically assume that the target spacecraft's mass properties are known a priori; however, for non-cooperative or tumbling targets, these parameters are often unknown or uncertain, leading to rapid divergence in model-based propagators. This paper presents an augmented Unscented Kalman Filter (UKF) framework designed to jointly estimate the relative 6-DOF pose and the full inertia tensor of a non-cooperative target spacecraft. The proposed architecture fuses visual measurements from monocular vision-based Convolutional Neural Networks (CNN) with depth information from LiDAR to constrain the coupled rigid-body dynamics. By augmenting the state vector to include the six independent elements of the inertia tensor, the filter dynamically recovers the target's normalized mass distribution in real-time without requiring ground-based pre-calibration. To ensure numerical stability and physical consistency during the estimation of constant parameters, the filter employs an adaptive process noise formulation that prevents covariance collapse while allowing for the gradual convergence of the inertial parameters. Numerical validation is performed via Monte Carlo simulations, demonstrating that the proposed Augmented UKF enables the simultaneous convergence of kinematic states and inertial parameters, thereby facilitating accurate long-term trajectory prediction and robust guidance in non-cooperative deep-space environments.

2511.16223 2026-06-19 cs.RO 版本更新

DynaMimicGen: A Data Generation Framework for Robot Learning of Dynamic Tasks

Vincenzo Pomponi, Paolo Franceschi, Stefano Baraldo, Loris Roveda, Oliver Avram, Luca Maria Gambardella, Anna Valente

发表机构 * Institute of Systems and Technologies for Sustainable Production (ISTePS)(可持续生产系统与技术研究所) Department of Innovative Technologies (DTI)(创新技术系) University of Applied Science and Arts of Southern Switzerland (SUPSI)(瑞士南部应用科学与艺术大学) Istituto Dalle Molle di studi sull’intelligenza artificiale (IDSIA)(达莫尔智能研究 institute) Department of Mechanical Engineering(机械工程系) Politecnico di Milano (PoliMi)(米兰理工学院) Faculty of Informatics(信息学院) Università della Svizzera Italiana (USI)(瑞士意大利大学)

详情
英文摘要

Learning robust manipulation policies typically requires large and diverse datasets, the collection of which is time-consuming, labor-intensive, and often impractical for dynamic environments. In this work, we introduce DynaMimicGen (D-MG), a scalable dataset generation framework that enables policy training from minimal human supervision while uniquely supporting dynamic task settings. Given only a few human demonstrations, D-MG first segments the demonstrations into meaningful sub-tasks, then leverages Dynamic Movement Primitives (DMPs) to adapt and generalize the demonstrated behaviors to novel and dynamically changing environments. Improving prior methods that rely on static assumptions or simplistic trajectory interpolation, D-MG produces smooth, realistic, and task-consistent Cartesian trajectories that adapt in real time to changes in object poses, robot states, or scene geometry during task execution. Our method supports different scenarios - including scene layouts, object instances, and robot configurations - making it suitable for both static and highly dynamic manipulation tasks. We show that robot agents trained via imitation learning on D-MG-generated data achieve strong performance across long-horizon and contact-rich benchmarks, including tasks like cube stacking and placing mugs in drawers, even under unpredictable environment changes. By eliminating the need for extensive human demonstrations and enabling generalization in dynamic settings, D-MG offers a powerful and efficient alternative to manual data collection, paving the way toward scalable, autonomous robot learning.