arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 机器人学习与模仿强化学习 6 篇

2606.12499 2026-06-12 cs.RO 新提交

Action-Effect Memory Pretraining for Robot Manipulation

动作-效应记忆预训练用于机器人操作

Yijing Zhou, Qiwei Liang, Sitong Zhuang, Jiaxi Li, Xianpeng Wang, Boyang Cai, Yunyang Mo, Renjing Xu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Shenzhen University(深圳大学)

AI总结 提出AEM框架,通过视觉-动作历史掩码建模学习紧凑时间表征,提升机器人操作在部分可观测环境下的性能,优于单帧预训练和帧堆叠方法。

详情
AI中文摘要

我们提出了AEM,一个用于机器人操作的动作-效应记忆预训练框架,从视觉-动作历史中学习紧凑的时间表征。与先前主要关注单帧视觉编码的机器人表征预训练方法不同,AEM针对操作的时间特性,在部分可观测性下,仅凭当前观测往往不足。AEM通过交错视觉和动作特征将操作建模为动作驱动的交互过程,并应用掩码建模从不完整历史中恢复缺失内容,从而学习动作条件化的状态演化。最终视觉令牌的Mamba编码输出用作紧凑的历史表征,作为解码和下游控制的全局上下文。该设计在保持推理高效的同时,保留了单向量时间瓶颈。我们使用扩散策略和流策略评估AEM。AEM在仿真和真实环境中一致提升了操作性能,在干净场景、杂乱和随机场景以及非马尔可夫任务中均优于基线。消融研究进一步表明,历史感知预训练超越了单帧预训练和直接帧堆叠,同时降低了推理延迟和计算成本。

英文摘要

We present AEM, an Action-Effect Memory pretraining framework for robot manipulation that learns compact temporal representations from vision-action history. Unlike prior robot representation pretraining methods that mainly focus on single-frame visual encoding, AEM targets the temporal nature of manipulation, where the current observation alone is often insufficient under partial observability. AEM models manipulation as an action-driven interaction process by interleaving visual and action features and applying masked modeling to recover missing content from incomplete histories, thereby learning action-conditioned state evolution. The Mamba-encoded output of the final vision token is used as a compact history representation, serving as the global context for decoding and downstream control. This design preserves a single-vector temporal bottleneck while keeping inference efficient. We evaluate AEM with Diffusion Policy and Flow Policy. AEM consistently improves manipulation performance in both simulation and real-world settings, outperforming baselines across clean scenes, cluttered and random scenes, and non-Markovian tasks. Ablation studies further show that history-aware pretraining surpasses single-frame pretraining and direct frame stacking, while reducing inference latency and computational cost.

2606.12890 2026-06-12 cs.RO 新提交

Learning to Adapt: Representation-Based Reinforcement Learning for Multi-Task Skill Transfer

学会适应:基于表示的多任务技能迁移强化学习

Aryan Naveen, Haitong Ma, Haldun Balim, Na Li

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Harvard School of Engineering and Applied Sciences(哈佛大学工程与应用科学学院)

AI总结 提出RepMT-SAC框架,通过谱MDP分解捕获可迁移动力学,实现任务无关核心与最小任务特定调整的价值函数结构,在四旋翼轨迹跟踪任务上零样本性能提升30%。

Comments 8 pages, 4 figures, 1 table

详情
AI中文摘要

强化学习在学习复杂控制策略方面取得了显著成功,但由于样本效率低和跨任务泛化能力差,其适用性仍然有限。在这项工作中,我们提出了RepMT-SAC,一个多任务强化学习框架,能够实现高效的知识共享和稳健的新任务迁移。RepMT-SAC使用谱MDP分解来捕获可迁移的动力学,将价值函数结构化为一个任务无关的核心和最小的任务特定调整。这种设计允许在分布内任务上具有强大的零样本性能,并在分布外任务上实现快速的少样本适应。我们在四旋翼轨迹跟踪任务上评估了RepMT-SAC在分布内和分布外上下文中的表现,证明其性能优于基线方法高达30%。

英文摘要

Reinforcement learning has achieved remarkable success in learning complex control policies, yet its applicability remains limited due to sample inefficiency and poor generalization across tasks. In this work, we propose RepMT-SAC, a framework for multi-task RL that enables efficient knowledge sharing and robust transfer to new tasks. RepMT-SAC uses spectral MDP decomposition to capture transferable dynamics, structuring the value function into a task-agnostic core with a minimal task-specific adjustment. This design allows for strong zero-shot performance on in-distribution tasks and rapid few-shot adaptation to out-of-distribution tasks. We evaluate RepMT-SAC on quadcopter trajectory-following tasks across in-distribution and out-of-distribution contexts, demonstrating that it outperforms baselines by up to 30%.

2606.13169 2026-06-12 cs.RO 新提交

Redesigning Regularization for Effective Policy Smoothing

重新设计正则化以实现有效的策略平滑

Taisuke Kobayashi, Naoto Yamanaka

发表机构 * National Institute of Informatics (NII)(国立信息学研究所) The Graduate University for Advanced Studies (SOKENDAI)(综合研究大学院大学)

AI总结 针对强化学习中策略平滑问题,本文指出现有正则化实现的理论与实践差异,提出改进方案,在多个任务和算法中实现平滑运动并提升控制性能,并在四足机器人仿真到现实迁移中验证了平滑性对目标速度突变鲁棒性的提升。

Comments submitted to RA-L

详情
AI中文摘要

本文提出了一种新颖的正则化设计,以有效平滑强化学习中的策略函数。虽然最初考虑了增强“全局”Lipschitz连续性的正则化,但由于平滑性与表达性之间的权衡,它被限制为“局部”Lipschitz连续性。然而,显而易见的是,原始实现繁琐且无法提供足够的平滑效果,导致人们倾向于更简单的实现。这源于理论与实现之间的差异,而更合适的实现有望促进平滑。因此,本文指出了原始实现无法正常工作的三个原因,并提供了相应的补救措施。这种改进的正则化在多个任务和算法中表现良好,成功实现了平滑运动,同时提高了控制性能。此外,通过将其应用于四足机器人的仿真到现实强化学习,证明了平滑运动能够提供对目标速度命令突变的鲁棒性。

英文摘要

This paper proposes a novel regularization design to effectively smooth policy functions in reinforcement learning. While regularization that enhances ``global'' Lipschitz continuity was initially considered, it has been limited to ``local'' Lipschitz continuity due to a tradeoff between smoothness and expressiveness. However, it has become apparent that the original implementation is cumbersome and does not provide sufficient smoothing, leading to a preference for simpler implementations. This stems from a discrepancy between theory and implementation, and a more appropriate implementation can expect to facilitate smoothing. Therefore, this paper identifies three reasons why the original implementation does not function adequately and provide remedies for them. This modified regularization performs well across multiple tasks and algorithms, successfully achieving smooth motion while improving control performance. Furthermore, by applying it to sim-to-real reinforcement learning for a quadruped robot, it is demonstrated that smooth motion provides robustness against sudden changes in target velocity commands.

2606.13355 2026-06-12 cs.RO cs.AI 新提交

Real-Time Execution with Autoregressive Policies

基于自回归策略的实时执行

Sangkyu Lee, Seohyeon Park, Tackgeun You, Avi Caciularu, Idan Szpektor, Hwasup Lim, Youngjae Yu

发表机构 * Korea Institute of Science and Technology(韩国科学技术研究院) Seoul National University(首尔大学) Google Research(谷歌研究院)

AI总结 通过异步推理和约束解码实现自回归策略的实时执行,在保证低延迟的同时提升任务完成速度,实验表明其性能优于流匹配策略。

详情
AI中文摘要

实时执行通过异步推理实现平滑动作轨迹和快速响应,对于大规模视觉-语言-动作模型的实际部署至关重要。然而,近期关于实时执行的工作主要关注扩散策略的变体,尽管自回归策略在同步推理中滚动速度较慢,更需要实时性。相比之下,我们证明自回归策略可以通过调整分词范围和应用约束解码来实现实时执行,从而保证严格的延迟界限,支持多轨迹解码以最大化性能。在模拟和真实环境中,我们发现自回归策略始终优于同等水平的流匹配策略,同时显著提升了同步推理的任务完成速度。结合自回归策略的固有优势(如更快的收敛速度和更好的指令遵循泛化能力),这些结果证实自回归策略仍是一种支持实时执行的竞争性策略类型。

英文摘要

Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for realistic deployments of large-scale Vision-Language-Action models. However, recent work on real-time execution primarily focuses on variants of diffusion policies, even though it is more critical for autoregressive policies given their slower rollout speed in synchronous inference. In contrast, we demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, we find that the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can remain a competitive policy type supporting real-time execution.

2606.11092 2026-06-12 cs.RO cs.AI 版本更新

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

RoboNaldo:通过运动引导课程强化学习实现精准、稳定且强力的人形足球射门

Yichao Zhong, Yidan Lu, Yuhang Lu, Tianyang Tang, Haoguang Mai, Yixuan Pan, Tianyu Li, Li Chen, Jingbo Wang, Zhongyu Li, Peng Lu, Hongyang Li

发表机构 * The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学) Archon Robotics

AI总结 提出三阶段运动引导课程强化学习框架RoboNaldo,从单一人踢参考逐步优化射门性能,在仿真中射门误差降低48.6%、速度提升2.96倍,真实机器人上3米外平均射门误差0.73-0.86米,触球后球速达13.10米/秒。

详情
AI中文摘要

精英级人形足球射门需要全身稳定性、高冲量全身交互以及目标精度。运动跟踪驱动的强化学习提供了全身运动协调的稳定性,但固定参考难以适应不同的球位和击球时机;相比之下,任务奖励驱动的强化学习难以从零开始探索和发现有效的踢球动作。因此,我们引入了RoboNaldo,一个用于高冲量人形交互的三阶段运动引导课程强化学习框架。使用单一人踢参考作为支架,并逐步将优化转向射门性能。课程首先学习稳定的全身踢球先验,然后使踢球适应任意静止球位的任意球场景,最后通过运动指令和踢球触发接口扩展到移动球射门。训练期间,一个高级启发式规划器控制该接口,而推理时其他高级控制器可驱动相同的低级策略。在仿真中,RoboNaldo的任意球射门误差比先前工作基线低48.6%,射门速度高2.96倍。在真实世界中,使用搭载机载感知的宇树G1,RoboNaldo在3米距离的任意球和移动球情况下,平均目标射门误差分别为0.73米和0.86米。触球后球速达到13.10米/秒,是职业比赛开放射门速度的59-71%。项目页面:$\href{ this https URL }{\text{ this http URL }}$。

英文摘要

Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: https://opendrivelab.com/RoboNaldo.

2604.08958 2026-06-12 cs.LG cs.AI cs.RO 版本更新

WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

WOMBET:基于世界模型的经验迁移实现鲁棒且样本高效的强化学习

Mintae Kim, Koushil Sreenath

发表机构 * Hybrid Robotics, UC Berkeley(混合机器人技术,伯克利大学)

AI总结 提出WOMBET框架,通过源任务中学习世界模型并生成不确定性惩罚的离线数据,再结合自适应采样进行在线微调,实现鲁棒且样本高效的强化学习迁移。

Comments 13 pages, 6 figures, 8th Annual Learning for Dynamics & Control Conference (L4DC)

详情
AI中文摘要

机器人领域的强化学习通常受限于数据收集的成本和风险,因此需要从源任务向目标任务进行经验迁移。离线到在线强化学习利用先验数据,但通常假设给定固定数据集,并未解决如何生成可靠数据进行迁移的问题。我们提出基于世界模型的经验迁移(WOMBET)框架,该框架联合生成和利用先验数据。WOMBET在源任务中学习世界模型,并通过不确定性惩罚规划生成离线数据,随后筛选出高回报和低认知不确定性的轨迹。然后,它通过在离线数据和在线数据之间进行自适应采样,在目标任务中进行在线微调,实现了从先验驱动的初始化到任务特定适应的稳定过渡。我们证明了不确定性惩罚目标提供了真实回报的下界,并推导了有限样本误差分解,捕捉了分布不匹配和近似误差。实验上,WOMBET在连续控制基准测试中相比强基线提高了样本效率和最终性能,展示了联合优化数据生成和迁移的益处。

英文摘要

Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose World Model-Based Experience Transfer (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.

2. 运动规划、控制与动力学 6 篇

2606.12579 2026-06-12 cs.RO 新提交

G-MAPP: GPU-accelerated Multi-Agent Planning and Perception for Reactive Motion Generation

G-MAPP: 基于GPU加速的多智能体规划与感知用于反应式运动生成

Tanmay Bishnoi, Riddhiman Laha, Tobias Löw, Jose Alex Chandy, Luis F. C. Figueredo, Sami Haddadin

发表机构 * Department of Electrical, Computer, and Biomedical Engineering, Toronto Metropolitan University(多伦多都会大学电气、计算机与生物医学工程系) Munich Institute of Robotics and Machine Intelligence (MIRMI), Technical University of Munich (TUM)(慕尼黑工业大学慕尼黑机器人与机器智能研究所) Institute for Experiential Robotics, Northeastern University(东北大学体验式机器人研究所) Idiap Research Institute(Idiap 研究所) EPFL(瑞士联邦理工学院洛桑) CHART Group at the School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院 CHART 小组) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出GPU加速的框架,通过并行状态探索和紧密耦合感知-动作循环,实现非结构化环境中的实时反应式运动生成,在7自由度机器人上达到5倍加速并成功避障。

Comments The implementation is available at: https://github.com/chart-research/g-mapp

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 6, pp. 7516-7523, June 2026
AI中文摘要

在非结构化环境中的反应式运动生成仍然是机器人学中的一个开放挑战。由于无碰撞运动生成的计算复杂性,现有方法要么为静态场景生成全局轨迹,要么采用对环境做出保守假设的模型。本文指出主要瓶颈在于高保真环境规划的运行时性能需求,以及感知与规划模块之间的时间集成。因此,我们提出一个框架,通过使用GPU加速世界建模和基于向量场的规划,不牺牲运行时性能和感知与规划的世界表示。这使得我们能够实现更快的并行状态探索以进行准全局轨迹规划,并在动态杂乱环境中使用现成的深度传感器实时紧密耦合感知-动作循环。我们定量评估了CPU和GPU版本规划器的计算时间和成功率差异,并在7自由度Franka Emika机器人上通过真实世界实验对我们的耦合框架进行了定性评估。实验结果表明,我们的基于GPU的框架相比CPU版本实现了高达5倍的加速,并在简单和具有挑战性的物理世界场景中成功避免了碰撞。

英文摘要

Reactive motion generation in unstructured environments remains an open challenge in robotics. Due to the computational complexity of collision-free motion generation, existing methods either generate global trajectories for static scenarios, or employ models that make conservative assumptions about the environment. This paper identifies the primary bottleneck as the runtime performance demand of planning on high-fidelity environments, and the temporal integration between the perception and planning modules. Therefore, we propose a framework that does not compromise on runtime performance and world representations for perception and planning by accelerating world modeling and vector-field based planning using the GPU. This allows us to achieve faster parallel state exploration for quasi-global trajectory planning, and tighter coupling of the perception-action loop in real-time for dynamic cluttered environments with off-the-shelf depth sensors. We quantitatively evaluate the computation-time and success rate differences for the CPU and GPU versions of our planner, and perform qualitative evaluations of our coupled framework using real-world experiments on a 7-DoF Franka Emika robot. Experimental results demonstrate that our GPU-based framework achieves up to a 5x speedup over the CPU version and successfully avoids collisions across both trivial and challenging physical world scenarios.

2606.12814 2026-06-12 cs.RO cs.AI 新提交

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

Stubborn: 一种用于人形机器人鲁棒运动跟踪与摔倒恢复的流线型统一强化学习框架

Xiao Ren, Yuhui Yang, Zongbiao Weng, Zhijie Liu, He Kong

发表机构 * Southern University of Science and Technology(南方科技大学)

AI总结 提出Stubborn框架,通过非对称Actor-Critic架构、偏航对齐表示、伯努利概率终止机制和自适应采样策略,统一实现人形机器人的运动跟踪与摔倒恢复,在性能与鲁棒性上超越现有方法。

详情
AI中文摘要

最近的强化学习方法在改善人形机器人运动跟踪性能和实现扰动下的摔倒恢复方面显示出巨大潜力。然而,现有大多数工作将运动跟踪和摔倒恢复视为不同任务,需要多阶段训练,并配备专门的恢复奖励和/或独立的恢复策略。此外,现有的基于强化学习的方法通常在严重跟踪失败后立即终止训练回合,限制了在不稳定或摔倒状态下的恢复导向探索。为了解决上述问题,我们提出了Stubborn,一个流线型统一的强化学习框架,用于实现鲁棒的人形机器人运动跟踪和摔倒恢复。具体来说,Stubborn采用非对称Actor-Critic架构,包含三个主要组件。首先,采用偏航对齐的跟踪表示,以减少对全局漂移和航向扰动的敏感性,同时保留与重力相关的平衡信息。其次,我们引入基于伯努利的概率终止机制,使策略能够在不同失败模式下鼓励探索摔倒恢复行为。第三,我们提出一种概率终止和跟踪误差驱动的策略,根据跟踪性能动态重塑采样分布,提高困难运动片段和不稳定状态的训练效率。与最先进方法的广泛比较和消融研究表明,Stubborn取得了有竞争力的性能,所提出的概率终止机制和自适应采样策略有助于性能和鲁棒性的提升。真实世界演示请参见此https URL。

英文摘要

Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to https://aislab-sustech.github.io/Stubborn/.

2606.13113 2026-06-12 eess.SY cs.RO cs.SY 交叉投稿

MPC for underactuated spacecraft control with a Lyapunov supervised physics-informed neural network correction layer

基于李雅普诺夫监督的物理信息神经网络校正层的欠驱动航天器MPC控制

Amirhossein Ayanmanesh Motlaghmofrad, Carlo Cena, Mauro Martini, Marcello Chiaberge

AI总结 针对欠驱动航天器姿态控制,提出一种分层架构,结合非线性模型预测控制、物理信息神经网络和李雅普诺夫监督机制,在不确定性下降低稳态误差并保持鲁棒性。

Comments Accepted at SPAICE (AI in and for Space) 2026

详情
AI中文摘要

欠驱动航天器面临可控性限制和对环境干扰的高度敏感性,使得姿态机动和稳定复杂化。由于沿欠驱动轴缺乏控制能力,传统控制器无法直接稳定所有姿态分量,因此需要参考规划策略。此外,MPC方法对惯性不确定性和未建模动态耦合仍然敏感,导致在失配下跟踪性能下降。为解决这些问题,我们考虑一种集成三层的分层架构:(i) 非线性模型预测控制器(NMPC),用于约束和欠驱动感知的机动规划以及在执行器限制下的标称闭环稳定性;(ii) 物理信息神经网络(PINN),在仿真数据上离线训练以估计残余干扰力矩,其损失项强制执行与刚体旋转动力学的一致性;(iii) 基于李雅普诺夫的监督安全机制,在线评估学习到的校正并限制或抑制其影响,以保持基线控制器的稳定性特性。该架构在模拟反作用轮动力学、执行器饱和及环境干扰的高保真仿真环境中进行评估。蒙特卡洛研究表明,与独立NMPC相比,稳态姿态误差有统计显著的降低,同时在不确定性下保持鲁棒行为。监督层确保当基于学习的增强不可靠时,能够优雅地退化到纯模型控制。

英文摘要

Underactuated spacecraft faces controllability limitations and heightened sensitivity to environmental disturbances, complicating attitude maneuvering and stabilization. Due to the lack of control authority along the underactuated axis, conventional controllers cannot directly stabilize all attitude components and therefore require reference planning strategies. Furthermore, MPC approaches remain sensitive to inertia uncertainty and unmodeled dynamic couplings, resulting in degraded tracking performance under mismatch. To address these issues, we consider a hierarchical architecture integrating three layers: (i) a nonlinear model predictive controller (NMPC) for constraint and underactuation-aware maneuver planning and nominal closed-loop stability under actuator limits; (ii) a physics-informed neural network (PINN) trained offline on simulation data to estimate residual disturbance torques, with loss terms that enforce consistency with rigid-body rotational dynamics; (iii) a Lyapunov-based supervisory safety mechanism that evaluates the learned correction online and bounds or suppresses its influence to preserve the stability properties of the baseline controller. The architecture is evaluated in a high-fidelity simulation environment modelling reaction wheel dynamics, actuator saturation, and environmental disturbances. Monte Carlo studies show statistically significant reductions in steady-state attitude error relative to standalone NMPC while maintaining robust behavior under uncertainty. The supervisory layer ensures graceful degradation to purely model-based control when the learning-based augmentation is unreliable.

2606.13400 2026-06-12 cs.LG cs.AI cs.RO 交叉投稿

PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

PolyFlow: 安全高效的多面体约束流匹配,具有约束嵌入和无投影更新

Jianming Ma, Qiyue Yang, Yang Zhang, Liyun Yan, Zhanxiang Cao, Yazhou Zhang, Yue Gao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出PolyFlow,一种将约束直接嵌入模型和流动力学的多面体约束流匹配框架,通过离散时间流公式和无投影架构消除离散化误差并严格满足任意多面体约束,在规划与控制任务中实现零约束违反并降低推理延迟。

Comments 30 pages, 12 figures, Accepted to ICML 2026

详情
AI中文摘要

尽管基于流的生成模型在广泛领域展现了强大的性能,但由于严格的约束要求,在安全关键的物理系统中部署它们仍然具有挑战性。现有方法通常通过事后修正来强制执行安全性,这会产生大量的计算开销,并可能扭曲学习到的分布。我们提出了PolyFlow,一种多面体约束流匹配框架,将约束直接嵌入到模型和流动力学中。PolyFlow引入了离散时间流公式和无投影架构,消除了离散化误差,并保证严格满足任意多面体约束,无需昂贵的迭代求解器。实验结果表明,PolyFlow在规划和控制任务中实现了零约束违反,同时保持了较高的分布保真度。与最先进的约束生成基线相比,PolyFlow显著降低了推理延迟,并在安全性、效率和生成质量之间展示了有利的权衡。代码可在该 https URL 获取。

英文摘要

While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the discretization error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality. Code is available on https://github.com/MJianM/PolyFlow.

2602.15424 2026-06-12 cs.RO 版本更新

Lyapunov-Based PI-Like Control for Robust Trajectory Tracking of a Four-Wheel Independently Driven and Steered Robot: Design and Experimental Validation

基于李雅普诺夫的PI类控制用于四轮独立驱动与转向机器人的鲁棒轨迹跟踪:设计与实验验证

Branimir Ćaran, Vladimir Milić, Marko Švaco, Bojan Jerbić

发表机构 * Faculty of Mechanical Engineering and Naval Architecture, University of Zagreb(Zagreb大学机械工程与造船工程学院) Regional Centre of Excellence for Robotic Technology (CRTA)(机器人技术卓越研究中心) Croatian Academy of Sciences and Arts(克罗地亚科学院)

AI总结 提出一种基于李雅普诺夫的PI类控制器,结合模型前馈补偿,实现四轮独立驱动与转向机器人的鲁棒轨迹跟踪,并通过实验验证其优于PI和滑模控制器。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

本文提出了一种基于李雅普诺夫综合的PI类控制器,用于独立驱动和转向的四轮移动机器人的鲁棒轨迹跟踪。对于本文所考虑的机器人,使用了一个明确的结构验证数学模型,以实现系统化的控制器设计,并具有严格的稳定性保证,适用于实时实现。针对内环的速度误差和积分误差联合动力学,开发了基于李雅普诺夫的实用稳定性分析,得出了速度误差和积分误差联合状态的实用稳定性和一致最终有界性的显式界和充分条件。所得控制律保留了PI类结构,并具有基于模型的前馈补偿,使其适用于标准嵌入式平台上的实现,同时提高了对构型依赖的残余动力学和未建模效应的鲁棒性。所提设计的有效性和鲁棒性在四轮独立转向和独立驱动的移动机器人平台上进行了实验验证,包括水平和垂直操作条件,并与PI控制器和滑模控制器进行了对比。

英文摘要

In this paper, a Lyapunov-based synthesis of a PI-like controller is proposed for robust trajectory tracking of an independently driven and steered four-wheel mobile robot. For the robot considered in this work, an explicit structurally verified mathematical model is used to enable systematic controller design with rigorous stability guarantees suitable for real time implementation. An augmented Lyapunov-based practical stability analysis is developed for the combined velocity-error and integral-error dynamics of the inner loop, yielding explicit bounds and sufficient conditions for practical stability and uniform ultimate boundedness of the combined velocity-error and integral-error state. The resulting control law retains a PI-like structure with model-based feedforward compensation, making it suitable for implementation on standard embedded platforms while improving robustness against configuration dependent residual dynamics and unmodelled effects. The effectiveness and robustness of the proposed design are demonstrated experimentally on a four-wheel independently steered and independently driven mobile robot platform, under both horizontal and vertical operating conditions and benchmarked against a PI controller and a sliding-mode controller.

2604.20428 2026-06-12 cs.RO 版本更新

Lexicographic Minimum-Violation Motion Planning using Signal Temporal Logic

使用信号时序逻辑的字典序最小违规运动规划

Patrick Halder, Lothar Kiltz, Hannes Homburger, Johannes Reuter, Matthias Althoff

AI总结 提出一种将字典序多目标优化转化为单目标标量优化的方法,通过非均匀量化和位移扩展MPPI求解器,并引入结合时空违规的谓词鲁棒性度量,实现可解释且可扩展的字典序STL最小违规运动规划。

Comments Submitted to the IEEE Open Journal of Intelligent Transportation Systems (under review)

详情
AI中文摘要

自动驾驶汽车的运动规划通常需要满足多个有条件冲突的规范。在无法同时满足所有规范的情况下,最小违规运动规划通过根据规范的优先级最小化违规来维持系统运行。信号时序逻辑(STL)提供了一种形式化语言来严格定义这些规范,并能够对其违规进行定量评估。然而,规范的完全排序导致了一个字典序优化问题,使用标准方法求解通常计算成本高昂。我们通过使用非均匀量化和位移将多目标字典序优化问题转化为单目标标量优化问题来解决这个问题。具体来说,我们扩展了一个确定性模型预测路径积分(MPPI)求解器,以高效求解无二次输入成本的优化问题。此外,引入了一种结合空间和时间违规的新型谓词鲁棒性度量。我们的结果表明,所提出的方法在单目标求解器框架内为字典序STL最小违规运动规划提供了一种可解释且可扩展的解决方案。

英文摘要

Motion planning for autonomous vehicles often requires satisfying multiple conditionally conflicting specifications. In situations where not all specifications can be met simultaneously, minimum-violation motion planning maintains system operation by minimizing violations of specifications in accordance with their priorities. Signal temporal logic (STL) provides a formal language for rigorously defining these specifications and enables the quantitative evaluation of their violations. However, a total ordering of specifications yields a lexicographic optimization problem, which is typically computationally expensive to solve using standard methods. We address this problem by transforming the multi-objective lexicographic optimization problem into a single-objective scalar optimization problem using non-uniform quantization and bit-shifting. Specifically, we extend a deterministic model predictive path integral (MPPI) solver to efficiently solve optimization problems without quadratic input cost. Additionally, a novel predicate-robustness measure that combines spatial and temporal violations is introduced. Our results show that the proposed method offers an interpretable and scalable solution for lexicographic STL minimum-violation motion planning within a single-objective solver framework.

3. 操作、抓取与灵巧手 14 篇

2606.12604 2026-06-12 cs.RO 新提交

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

EgoEngine:从自我中心人类视频到高保真灵巧机器人演示

Yangcen Liu, Shuo Cheng, Xinchen Yin, Woo Chul Shin, Alfred Cueva, Yiran Yang, Zhenyang Chen, Chuye Zhang, Danfei Xu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Tsinghua University(清华大学)

AI总结 提出EgoEngine框架,通过视觉和动作桥接,将自我中心人类视频转化为高保真机器人数据,首次实现零样本灵巧策略学习。

详情
AI中文摘要

灵巧操作受限于大规模机器人演示数据的收集成本。自我中心人类视频提供了多样操作行为的可扩展来源,但直接用于机器人学习需要弥合两个差距:人类与机器人观测之间的视觉差距,以及人类运动与机器人可执行动作之间的动作差距。我们提出EgoEngine,一个可扩展的框架,用于将自我中心人类操作视频转化为高保真机器人数据。给定一个自我中心RGB视频,EgoEngine生成:(i) 高保真机器人观测视频,用机器人替换人类,同时保留场景上下文和时间对齐,以及(ii) 在可行性约束下,与任务对齐、可执行的机器人动作轨迹。在仿真和真实机器人上的实验表明,EgoEngine能够将人类视频可扩展地转化为机器人数据,并且据我们所知,首次展示了无需真实机器人演示,从自我中心人类视频进行零样本视觉运动灵巧策略学习。项目网站:此 https URL。

英文摘要

Dexterous manipulation is limited by the cost of collecting large-scale robot demonstrations. Egocentric human videos offer a scalable source of diverse manipulation behaviors, but directly using them for robot learning requires bridging two gaps: the visual gap between human and robot observations, and the action gap between human motion and robot-executable action. We propose EgoEngine, a scalable framework for transforming egocentric human manipulation videos into high-fidelity robot data. Given an egocentric RGB video, EgoEngine produces: (i) a high-fidelity robot observation video replacing human with robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots show that EgoEngine enables scalable conversion of human videos into robot data and, to our knowledge, demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations. Project website: https://egoengine.github.io.

2606.12759 2026-06-12 cs.RO 新提交

Sparse2Act: Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation

Sparse2Act: 学习跨域机器人操作的动作对齐稀疏3D表示

Yu Guo, Chang Yu, Siyu Ma, Yunuo Chen, Yin Yang, Ying Nian Wu, Chenfanfu Jiang

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) University of California, San Diego(加州大学圣迭戈分校) University of Utah(犹他大学)

AI总结 提出Sparse2Act框架,通过动作对齐的掩码稀疏3D编码预训练,实现跨域机器人操作,在LIBERO-10上达86.9%成功率,并支持域迁移和sim-to-real。

详情
AI中文摘要

显式3D表示对于操作任务具有吸引力,因为它们以度量坐标暴露物体形状、工作空间几何以及机器人-物体关系。然而,稀疏3D编码器通常通过下游任务目标学习,将表示与特定数据分布、策略架构和动作参数化绑定。我们引入Sparse2Act,一个用于预训练稀疏点云编码器的观察-动作对齐框架。关键思想是使用任务空间末端执行器动作作为几何监督:训练掩码稀疏3D令牌以组织场景特征,使其围绕与观察配对的工作空间运动。预训练后,仅编码器初始化被下游策略重用,允许它们保留自己的架构和动作空间,包括关节空间命令。在LIBERO-10基准上,我们的方法在500步微调后达到86.9%的平均成功率。相同的预训练编码器支持LIBERO到Meta-World的跨域迁移,在Meta-World-5基准上达到73.4%的平均成功率。关于目标和解码器容量的消融实验表明,增益来自掩码动作对齐信号,并且在下游动作解码器中仍然有用。在真实世界实验中,模拟预训练后跟有限真实数据微调,在四个任务上平均成功率达到72.5%,展示了有效的模拟到真实迁移。这些结果表明,机器人动作可以为可重用的稀疏3D表示提供紧凑的几何监督。

英文摘要

Explicit 3D representations are attractive for manipulation because they expose object shape, workspace geometry, and robot-object relations in metric coordinates. However, sparse 3D encoders are often learned through downstream task objectives, tying the representation to a particular data distribution, policy architecture, and action parameterization. We introduce Sparse2Act, an observation-action alignment framework for pretraining sparse point-cloud encoders. The key idea is to use task-space end-effector actions as geometric supervision: masked sparse 3D tokens are trained to organize scene features around the workspace motion paired with the observation. After pretraining, only the encoder initialization is reused by downstream policies, allowing them to retain their own architectures and action spaces, including joint-space commands. On the LIBERO-10 benchmark, our method achieves 86.9% average success after 500 fine-tuning steps. The same pretrained encoder supports LIBERO-to-Meta-World cross-domain transfer, achieving 73.4% average success on the Meta-World-5 benchmark. Ablations on the objective and decoder capacity show that the gains come from the masked action-alignment signal and remain useful across downstream action decoders. In real-world experiments, simulation pretraining followed by limited real-data fine-tuning achieves an average success rate of 72.5% across four tasks, demonstrating effective sim-to-real transfer. These results suggest that robot actions can provide compact geometric supervision for reusable sparse 3D representations.

2606.12954 2026-06-12 cs.RO 新提交

Towards Reliable Sequential Object Picking in Clutter: The Runner-up Solution to RGMC 2025

面向杂乱环境中的可靠顺序物体抓取:RGMC 2025 亚军方案

Wei Yu, Xidan Zhang, Ziyi Zheng, Weijie Kong, Huixu Dong

发表机构 * School of Mechanical Engineering, Zhejiang University(浙江大学机械工程学院)

AI总结 针对杂乱环境中的顺序物体抓取任务,提出集成硬件-软件流水线,结合多功能夹爪设计与物体分布及遮挡关系新表示,实现高效识别、搜索与顺序抓取,获RGMC 2025亚军。

Comments First, Second and Third Coauthor contributed equally to this work

详情
AI中文摘要

作为机器人操作中的长期挑战,在杂乱环境中稳定高效地抓取在工业场景中至关重要。尽管近期研究在杂乱抓取中取得了较高的成功率,但对于顺序物体搜索与分类等更具挑战性的任务,成熟解决方案仍然较少。本工作基于杂乱环境抓取基准(CEPB)解决杂乱环境中的顺序物体抓取问题,并展示了我们在ICRA 2025第十届机器人抓取与操作竞赛(RGMC)的“杂乱抓取”赛道中的方案。该任务提出了几个关键挑战。首先,它需要鲁棒且考虑碰撞的抓取,在包括刚性和可变形物体在内的多样化物体集上具有高成功率。其次,它要求高效搜索目标物体,这对方案的清理和搜索策略提出了严格要求。为应对上述挑战,我们设计了一个集成的硬件-软件流水线,结合了物体识别、清理和多模态抓取。主要贡献包括多功能夹爪的硬件设计以及杂乱空间中物体分布和遮挡关系的新表示。该流水线实现了对杂乱环境中物体的高效识别、搜索和顺序抓取,在实验室测试和竞赛场景中均表现出色,最终在RGMC 2025的“杂乱抓取”赛道中获得第二名。

英文摘要

As a long-standing challenge in robotic manipulation, stable and efficient grasping in cluttered environments is of great importance in industrial settings. While recent studies have achieved relatively high success rates in grasping from clutter, there remain few mature solutions for more demanding tasks such as sequential object search and sorting. This work addresses sequential object picking in cluttered environments based on the Cluttered Environment Picking Benchmark (CEPB) and presents our solution to the Pick-in-Clutter track of the 10th Robotic Grasping and Manipulation Competition (RGMC) at ICRA 2025. The task poses several key challenges. First, it requires robust and collision-aware grasping with high success rates across a diverse set of objects, including both rigid and deformable ones. Second, it demands efficient search for target objects, which places stringent requirements on the decluttering and searching strategies of the solution. To address the above challenges, we design an integrated hardware-software pipeline that combines object recognition, decluttering, and multi-modal grasping. The main contributions include the hardware design of a multifunctional gripper and novel representations for object distribution and occlusion relationships in cluttered space. This pipeline enables efficient recognition, search, and sequential grasping of objects in clutter, demonstrating strong performance in both laboratory tests and competition scenarios, and ultimately achieving second place in the Pick-in-Clutter track of the RGMC 2025.

2606.12965 2026-06-12 cs.RO 新提交

EmbodiSteer: Steering Embodiment-Agnostic Visuomotor Policies with Joint-Space Guidance for Zero-Shot Cross-Embodiment Deployment

EmbodiSteer: 用关节空间引导的具身无关视觉运动策略实现零样本跨具身部署

Shihefeng Wang, Kangchen Lv, Mingrui Yu, Xiang Li

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系) Beijing Key Laboratory of Embodied Intelligence Systems(北京具身智能系统重点实验室) Institute for Embodied Intelligence and Robotics, Tsinghua University(清华大学具身智能与机器人研究所)

AI总结 提出EmbodiSteer框架,通过前向运动学和雅可比更新将推理时的扩散采样提升到目标机器人关节空间,并加入全身碰撞感知引导,实现零样本、具身感知的部署,在模拟和物理机器人上显著降低碰撞率并提高任务成功率。

Comments The first two authors contribute equally

详情
AI中文摘要

可扩展的机器人模仿学习依赖于来自不同机器人的大规模异构数据或无身体数据,使得笛卡尔末端执行器动作成为具身无关策略学习的关键接口。然而,仅末端执行器的抽象使得笛卡尔策略对部署的机器人身体无感知,导致其在全身碰撞避免等机器人特定约束下脆弱。为克服这一限制,我们提出EmbodiSteer,一种无需训练的框架,将具身无关的视觉运动策略引导至零样本、具身感知的部署。EmbodiSteer将策略学习保持在笛卡尔空间,同时通过前向运动学和基于雅可比的更新,高效地将推理时的扩散采样提升到目标机器人的关节空间。在每个去噪步骤后,通过关节轨迹上的全身碰撞感知引导,机械臂可以在保持学习到的末端执行器行为的同时避开碰撞。与仅笛卡尔执行相比,EmbodiSteer在9个模拟机器人上将碰撞率降低46.1%,任务成功率提高28.5%,并在高度受限场景下的两个物理机器人上实现碰撞率降低90.0%,成功率提高36.7%。我们的项目页面位于此https URL。

英文摘要

Scalable robot imitation learning relies on large-scale heterogeneous data from diverse robots or body-free data, making Cartesian end-effector actions a key interface for embodiment-agnostic policy learning. However, end-effector-only abstraction leaves Cartesian policies unaware of the deployed robot body, making them brittle under robot-specific constraints such as whole-body collision avoidance. To overcome this limitation, we present EmbodiSteer, a training-free framework that steers embodiment-agnostic visuomotor policies toward zero-shot, embodiment-aware deployment. EmbodiSteer keeps policy learning in Cartesian space while efficiently lifting inference-time diffusion sampling into the target robot's joint space via forward kinematics and Jacobian-based updates. With whole-body collision-aware guidance over joint trajectories after each denoising step, the arm can be steered away from collisions while preserving learned end-effector behavior. Compared with Cartesian-only execution, EmbodiSteer reduces collision rate by 46.1% and improves task success rate by 28.5% across 9 simulated robots, and further achieves 90.0% collision rate reduction and 36.7% success rate increase on two physical robots in highly constrained scenarios. Our project page is at https://frankwang67.github.io/EmbodiSteer-Page.

2606.13102 2026-06-12 cs.RO 新提交

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

FTP-1:一种跨触觉传感器的通用基础触觉策略,用于密集接触操作

Chengbo Yuan, Zicheng Zhang, Mingjie Zhou, Wendi Chen, Yi Wang, Zhuoyang Liu, Dantong Niu, Shuo Wang, Hui Zhang, Wenkang Zhang, Yingdong Hu, Yuanqing Gong, Wanli Xing, Chuan Wen, Cewu Lu, Kaifeng Zhang, Yang Gao

发表机构 * Tsinghua University(清华大学) Shanghai Qi Zhi Institute(上海期智研究院) Sharpa Shanghai Jiao Tong University(上海交通大学) University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院) Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出FTP-1,首个通用基础触觉策略,通过异构编码器和共享Transformer专家,跨21种传感器和3000小时数据预训练,实现触觉操作技能的跨传感器迁移,在未见传感器上成功率提升31%。

详情
AI中文摘要

尽管基于视觉的通用机器人策略取得了成功,但现有的基于触觉的策略仍然局限于固定的具身和传感器设置。这是因为触觉信号在不同硬件之间高度异构,使得跨传感器泛化变得困难。我们提出了FTP-1,这是第一个通用基础触觉策略,预训练以获取跨不同传感器和具身的可迁移触觉操作能力。FTP-1支持多种触觉输入,包括基于图像、阵列和状态的信号,通过使用异构编码器将它们投影到统一的形态感知潜在标记中,并由共享的触觉Transformer专家联合建模。FTP-1在来自26个数据源的约3000小时触觉操作数据上进行预训练,涵盖21种传感器的人类和机器人演示,学习到的触觉技能可以迁移到预训练期间未见过的传感器上。在涵盖5种硬件配置的下游微调实验中,FTP-1在见过的传感器设置上将密集接触操作的成功率提高了17.2%,并且令人惊讶地,迁移到两种先前未见过的触觉传感器设置上,成功率提高了31%。FTP-1为触觉操作建立了第一个统一的基础基线,为未来的触觉策略提供了共享的模型级起点。预训练模型、数据集、训练代码及更多可视化内容请访问此网址。

英文摘要

Despite the success of vision-based generalist robotic policies, existing tactile-based policies remain tied to fixed embodiments and sensor setups. This is because tactile signals are highly heterogeneous across hardware, making cross-sensor generalization difficult. We present FTP-1,the first generalist foundation tactile policy pretrained to acquire transferable tactile manipulation abilities across diverse sensors and embodiments. FTP-1 supports varied tactile inputs, including image-, array-, and state-based signals, by using heterogeneous encoders to project them into unified morphology-aware latent tokens that are jointly modeled by a shared tactile Transformer expert. Pretrained on around 3,000 hours of tactile manipulation data aggregated from 26 data sources, spanning human and robot demonstrations across 21 sensors, FTP-1 learns tactile skills that transfer beyond the sensors seen during pretraining. Across downstream finetuning experiments spanning 5 hardware configurations, FTP-1 improves contact-rich manipulation on seen sensor setups by +17.2% and, surprisingly, transfers to two previously unseen tactile-sensor setups, achieving a +31% gain in success rate. FTP-1 establishes the first unified foundation baseline for tactile manipulation, providing future tactile policies with a shared model-level starting point. Pretrained models, datasets, training code and more visualization at https://ftp1-policy.github.io.

2606.13232 2026-06-12 cs.RO 新提交

WT-UMI: Tactile-based Whole-Body Manipulation via Force-Supervised Contact-Aware Planning

WT-UMI: 基于触觉的全身操控通过力监督的接触感知规划

Jaehwi Jang, Zhaoyuan Gu, Alfred Cueva, Zimeng Chai, Junjie Sheng, Thong Nguyen, Himank Galundia, Yifan Wu, Huishu Xue, Isaac Legene, Ojas Mediratta, Davin Doan, Andrew Collins, Sarah Sadegh, KyoungMok Kim, Rishita Dhalbisoi, Zun Chen, Ye Zhao

发表机构 * The Institute for Robotics and Intelligent Machines, Georgia Institute of Technology(机器人与智能机械研究所,佐治亚理工学院)

AI总结 提出WT-UMI系统,结合人体演示与遥操作数据,通过力监督规划器预测末端执行器位姿和接触力轨迹,并利用触觉导纳控制器提升全身操控性能。

Comments 18 pages, 8 figures

详情
AI中文摘要

全身人形操控笨重、可变形和共享负载物体需要分布式接触感知和显式力调节,然而大多数模仿策略仅隐式处理接触力。另一方面,不同的演示来源提供了具有固有权衡的互补模态:人体演示捕捉自然接触力但不包含机器人可执行动作,而遥操作直接记录机器人动作但力调节不够自然。本文提出\textbf{WT-UMI},一种可穿戴全身触觉接口,可由人类操作员佩戴或安装在人形机器人上,在人体演示和人形遥操作模式下提供触觉图像、接触力和末端执行器位姿的精确观测。我们引入一个力条件目标位姿校正模块,通过从遥操作数据中学习校正,将测量的人体位姿转换为接触感知的机器人目标。为了利用人体数据中的自然力交互,我们提出一个力监督规划器,预测末端执行器位姿块和接触力轨迹。预测的接触力作为基于触觉的导纳控制器的参考。在五个接触密集型任务中,涵盖可变形物体、笨重刚体物体和人-人形协作,WT-UMI在成功率上优于四个策略基线,并降低了接触位置跟踪误差。我们的项目页面可在此https URL访问。

英文摘要

Whole-body humanoid manipulation of bulky, deformable, and shared-load objects requires distributed contact sensing and explicit force regulation, yet most imitation policies treat contact force only implicitly. On the other hand, different demonstration sources provide complementary modalities with inherent trade-offs: human demonstrations capture natural contact forces but not robot-executable actions, while teleoperation directly records robot actions but with less natural force regulation. This paper presents \textbf{WT-UMI}, a wearable whole-body tactile interface worn by human operators or mounted on humanoids, providing accurate observations of tactile images, contact forces, and end-effector poses across both human demonstration and humanoid teleoperation modes. We introduce a force-conditioned target-pose correction module that converts measured human poses into contact-aware robot targets by learning corrections from teleoperation data. To leverage the natural force interaction in human data, we propose a force-supervised planner that predicts end-effector pose chunks and contact-force trajectories. The predicted contact force serves as the reference for a tactile-based admittance controller. Across five contact-rich tasks spanning deformable objects, bulky rigid objects, and human--humanoid collaboration, WT-UMI improves success rate and reduces contact-position tracking error over four policy baselines. Our project page is available at https://wt-umi.github.io/WTUMI/.

2606.13279 2026-06-12 cs.RO 新提交

See Selectively, Act Adaptively: Dual-Level Structural Decomposition for Bimanual Robot Manipulation

选择性观察,适应性行动:双水平结构分解用于双臂机器人操作

Yoon-Ji Choi, Young-Chae Son, Soo-Chul Lim

发表机构 * Dongguk University(东国大学)

AI总结 提出基于双水平结构分解的双臂操作VLA框架,通过视觉选择路由和动作专家混合机制分别处理视觉相关性和双臂交互模式,在模拟和真实任务中成功率分别提升27.7%和43.3%。

详情
AI中文摘要

在双臂机器人操作中,任务相关的视觉信息随任务阶段和上下文变化,而两臂的交互在独立和协调模式之间切换,使得策略学习具有挑战性。然而,现有的整体式视觉-语言-动作(VLA)策略通过单一共享表示和动作生成路径处理多样的视觉输入和交互模式,往往无法分别考虑视觉相关性和双臂交互结构。为了解决这个问题,我们提出了一个基于双水平结构分解的双臂操作VLA框架。视图选择视觉路由器动态调整腕部视角的贡献以强调相关视觉线索,而交互感知动作专家混合(MoE)将动作生成分解为协调和单臂路径,以适应不同的双臂交互模式。我们在RoboTwin 2.0中的六个模拟双臂操作任务和三个长时域真实世界任务上评估了所提方法。我们的模型在模拟和真实世界评估中,整体平均成功率分别比整体式基线提高了27.7%和43.3%,并且在两种设置下始终优于单模块变体。这些结果表明,联合考虑选择性视觉处理和双臂交互结构的显式分解为鲁棒的双臂操作提供了有效的归纳偏置。

英文摘要

In bimanual robotic manipulation, task-relevant visual information varies with the task stage and context, while the interaction of the two arms shifts between independent and coordinated modes, making policy learning challenging. However, existing monolithic Vision-Language-Action (VLA) policies process diverse visual inputs and interaction patterns through a single shared representation and action generation pathway, often failing to separately account for visual relevance and bimanual interaction structure. To address this issue, we propose a bimanual manipulation VLA framework based on Dual-Level Structural Decomposition. The View-Selective Visual Router dynamically adjusts wrist-view contributions to emphasize relevant visual cues, while the Interaction-Aware Action Mixture-of-Experts (MoE) decomposes action generation into coordinated and arm-wise pathways to adapt to varying bimanual interaction modes. We evaluate the proposed method on six simulated bimanual manipulation tasks in RoboTwin 2.0 and three long-horizon real-world tasks. Our model improves the overall average success rate over a monolithic baseline by 27.7% in simulation and 43.3% in real-world evaluation, while consistently outperforming single-module variants across both settings. These results demonstrate that jointly considering selective visual processing and explicit decomposition of bimanual interaction structures provides an effective inductive bias for robust bimanual manipulation.

2606.13394 2026-06-12 cs.RO 新提交

GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation

GeoHAT: 几何自适应混合动作Transformer用于移动操作

Xiangyu Zhu, Renjun Wu, Luzhou Ge, Jinyan Liu, Xuesong Li

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 提出GeoHAT框架,通过轻量级傅里叶空间编码器注入几何信息,并采用混合全身动作解码器分解机械臂与基座动作,在ManiSkill-HAB基准上成功率提升23.7%。

详情
AI中文摘要

全身移动操作需要在不断变化的视角下协调移动基座和机械臂,这对几何感知和动作生成提出了挑战。当前的策略要么依赖2D特征,要么依赖缺乏密集空间结构的稀疏3D表示,并且通常将机械臂和基座编码在一个动作向量中,忽略了它们各自不同的控制需求。此外,现有的密集融合策略在噪声深度下可能破坏预训练表示,同时带来沉重的计算开销。我们提出了GeoHAT,一个基于简单原则的端到端扩散框架:几何信息应仅在可靠处注入,且仅在需要处被关注。GeoHAT采用轻量级傅里叶空间编码器,将密集的逐像素3D坐标映射为几何标记,无需额外的3D视觉骨干网络。然后,通过由深度有效性调制的逐标记门控融合,将这些标记选择性地注入视觉基础模型特征中,在保留语义先验的同时丰富空间理解。对于动作生成,混合全身动作解码器将机械臂和基座分解到不同的子空间,并通过稀疏交叉注意力让每个动作模态关注其任务相关的视觉上下文,同时因果时序建模捕获时间步内协调和时间步间依赖。在ManiSkill-HAB仿真基准上的实验表明,GeoHAT实现了79.3%的平均成功率,比最强基线高出23.7%。此外,在多种任务上的真实世界实验也证实了在所有基线上的一致改进。

英文摘要

Whole-body mobile manipulation requires coordinating mobile base and manipulator under shifting viewpoints, posing challenges in geometric perception and action generation. Current policies either rely on 2D features or sparse 3D representations that lack dense spatial structure, and typically encode arm and base within one action vector that ignores their distinct control demands. Moreover, existing dense fusion strategies risk corrupting pretrained representations under noisy depth while incurring heavy computational overhead. We present GeoHAT, an end-to-end diffusion-based framework built on a simple principle: geometry should be injected only where reliable and attended to only where needed. GeoHAT employs a lightweight Fourier spatial encoder that maps dense per-pixel 3D coordinates into geometric tokens without an additional 3D vision backbone. These tokens are then selectively injected into vision foundation model features through per-token gated fusion modulated by depth validity, preserving the semantic prior while enriching spatial understanding. For action generation, a Hybrid Whole-Body Action Decoder decomposes arm and base into distinct subspaces and lets each action modality attend to its task-relevant visual context through sparse cross-attention, while causal temporal modeling captures intra-timestep coordination and inter-timestep dependencies. Experiments on the ManiSkill-HAB simulation benchmark demonstrate that GeoHAT achieves a 79.3% mean success rate, surpassing the strongest baseline by 23.7%. Furthermore, real-world experiments on diverse tasks also confirm consistent improvements over all baselines.

2606.13601 2026-06-12 cs.RO cs.SY eess.SY 新提交

MCR-Bionic Hand: Anatomical Structural Priors for Dexterous Manipulation

MCR-Bionic Hand: 用于灵巧操作的解剖结构先验

Haosen Yang, Guowu Wei

发表机构 * University of Salford(索尔福德大学)

AI总结 本文提出MCR-Bionic Hand,一种基于人体手部解剖结构先验的仿生机械手,通过结构智能实现低维控制到灵巧操作的映射,在接触密集型任务中验证了其有效性。

详情
AI中文摘要

灵巧机器人手通常被表述为由自由度、驱动和控制算法支配的高维主动控制系统。然而,人类手的灵巧性部分编码在骨骼、韧带、肌腱、腱膜和内在肌肉的物理结构中。本文将这种贡献描述为两种相互关联的结构智能形式:结构先验生成,其中腕指腱固定、FDS/FDP路径和背侧伸肌腱帽将低维姿态输入转换为默认抓取构型及PIP到DIP协调;以及肌肉介导的调节,其中外在肌、蚓状肌和骨间肌围绕该默认状态调节MCP姿态、远端稳定性、指尖力路径和接触状态。基于此框架,MCR-Bionic Hand被开发为一个1:1肌肉骨骼仿生手,在一个主体内集成了两排八骨手腕、跨腕肌腱、解剖屈肌腱路径、掌板和侧副韧带约束、背侧伸肌腱帽以及内在肌通路。功能演示和几何力学模型表明,手腕姿态诱导多关节预塑形,伸肌腱帽将PIP姿态映射为耦合的DIP响应,而内在肌通路在抓取形成后调节远端稳定性和指尖动作方向。接触密集型任务,包括硬币旋转、笔传递、手背翻硬币和立方体操作,表明MCR-Bionic将低维状态生成与精细接触后调节联系起来。这些结果表明,解剖仿生学的价值不在于视觉相似性,而在于识别执行部分控制功能的人手结构。

英文摘要

Dexterous robotic hands are usually formulated as high dimensional active control systems governed by degrees of freedom, actuation, and algorithms. Human hand dexterity, however, is partly encoded in the physical architecture of bones, ligaments, tendons, aponeuroses, and intrinsic muscles. This work describes that contribution as two linked forms of structural intelligence: structural prior generation, in which wrist to finger tenodesis, FDS/FDP routing, and the dorsal extensor hood transform low dimensional posture inputs into default grasp configurations and PIP to DIP coordination; and muscle mediated modulation, in which extrinsic muscles, lumbricals, and interossei regulate MCP posture, distal stability, fingertip force paths, and contact states around that default state. Based on this framework, MCR-Bionic Hand is developed as a 1:1 musculoskeletal biomimetic hand integrating a two row eight bone wrist, cross wrist tendons, anatomical flexor routing, volar plate and collateral ligament constraints, the dorsal extensor hood, and intrinsic muscle pathways within one body. Functional demonstrations and geometric mechanical models show that wrist posture induces multi joint pre shaping, the extensor hood maps PIP posture to a coupled DIP response, and intrinsic plus pathways modulate distal stability and fingertip action direction after grasp formation. Contact rich tasks, including coin rotation, pen transfer, dorsal coin flipping, and cube manipulation, show that MCR-Bionic links low dimensional state generation with fine post contact modulation. These results suggest that anatomical biomimetics is valuable not for visual similarity, but for identifying human hand structures that perform part of control.

2606.13677 2026-06-12 cs.RO cs.AI cs.CV cs.LG 新提交

Mana: Dexterous Manipulation of Articulated Tools

Mana: 铰接工具的灵巧操作

Zhao-Heng Yin, Guanya Shi, Pieter Abbeel, C. Karen Liu

发表机构 * UC Berkeley(加州大学伯克利分校) CMU(卡内基梅隆大学) Stanford University(斯坦福大学) Amazon FAR(亚马逊FAR)

AI总结 提出Mana框架,将灵巧操作重解释为动画问题,通过粗到细的流水线自动生成操作轨迹,实现铰接工具的零样本仿真到现实迁移。

Comments Project Page: https://zhaohengyin.github.io/mana

详情
AI中文摘要

铰接工具的操作由于需要协调内部自由度与接触丰富的交互,仍然是灵巧机器人学中的一个主要挑战。虽然先前的工作主要集中在刚性物体上,但铰接工具的使用由于其物理复杂性以及学习功能性抓取和操作策略的困难,仍未得到充分探索。我们提出了Mana(操作动画器),一个通用的仿真到现实框架,将灵巧操作重新解释为动画问题。受计算机动画启发,Mana采用粗到细的流水线,通过运动规划和强化学习将程序生成的抓取关键帧转化为操作轨迹。数据生成过程基本自动化,仅需几次鼠标点击即可指定功能可供性(每个工具不到1分钟)。在跨越不同尺度和关节类型的四个铰接工具上,Mana实现了抓取和手内操作的零样本仿真到现实迁移,展示了灵巧铰接工具操作的可扩展方法。

英文摘要

Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (<1 minute per tool). Across four articulated tools spanning different scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating a scalable approach to dexterous articulated tool use.

2604.08983 2026-06-12 cs.RO 版本更新

AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly

AssemLM: 用于机器人装配的空间推理多模态大语言模型

Zhi Jing, Jinbin Qiao, Ouyang Lu, Jicong Ao, Shuang Qiu, Huazhe Xu, Yu-Gang Jiang, Chenjia Bai

发表机构 * Fudan University(复旦大学) Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究所(TeleAI),中国电信) Tianjin University(天津大学) Northwestern Polytechnical University(西北工业大学) Tsinghua University(清华大学) City University of Hong Kong(香港城市大学)

AI总结 提出AssemLM,一种融合装配手册、点云和文本指令的多模态大语言模型,通过专用点云编码器提取几何与旋转特征,实现精确的6D装配位姿推理,并构建含90万样本的AssemBench基准,在真实机器人装配任务中取得最优性能。

Comments Project Page: https://assemlmhome.github.io/

详情
AI中文摘要

空间推理是具身智能的基本能力,尤其对于机器人装配等精细操作任务。当前基于视觉语言模型(VLM)的方法主要依赖粗粒度的2D感知,难以对复杂3D几何进行精确推理。为解决这一局限,我们提出AssemLM,一种用于机器人装配的空间多模态大语言模型,它整合装配手册、点云和文本指令,通过显式几何理解预测任务关键的6D装配位姿。为桥接原始3D感知与高层语言推理,AssemLM采用专用点云编码器提取细粒度几何与旋转特征,以实现装配任务中精确的3D空间推理。此外,我们引入AssemBench,一个面向装配空间推理的大规模基准,包含超过90万多模态样本和精确的6D位姿标注,将评估从2D定位扩展到完整的3D几何推理。大量实验和真实机器人评估表明,AssemLM在6D位姿推理性能上达到最优,并有效支持真实环境中的精细多步装配任务。代码、模型和AssemBench数据集将公开提供。

英文摘要

Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. Recent methods based on vision-language models (VLMs) largely rely on coarse 2D perception and struggle to perform accurate reasoning over complex 3D geometry. To address this limitation, we propose AssemLM, a spatial multimodal large language model for robotic assembly that integrates assembly manuals, point clouds, and textual instructions to predict task-critical 6D assembly poses with explicit geometric understanding. To bridge raw 3D perception and high-level linguistic reasoning, AssemLM employs a specialized point cloud encoder to extract fine-grained geometric and rotational features for accurate 3D spatial reasoning in assembly tasks. In addition, we introduce AssemBench, a large-scale benchmark for assembly-oriented spatial reasoning with over 900K multimodal samples and precise 6D pose annotations, extending evaluation from 2D grounding to full 3D geometric inference. Extensive experiments and real-robot evaluations demonstrate that AssemLM achieves state-of-the-art 6D pose reasoning performance and effectively supports fine-grained, multi-step assembly tasks in real-world settings. Code, models, and the AssemBench dataset will be made publicly available.

2606.08765 2026-06-12 cs.RO cs.CV 版本更新

RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

RGB-S: 用于鲁棒灵巧操作的图像对齐触觉显著性

Shengcheng Luo, Kefei Wu, Xiaoying Zhou, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University(上海科技大学) Beijing Institute for General Artificial Intelligence(北京通用人工智能研究院)

AI总结 提出RGB-S框架,通过正向运动学和相机标定将触觉传感器位置投影到RGB图像平面,生成力调制高斯显著性图,显式对齐触觉与视觉,在严重遮挡下灵巧操作成功率提升26.7个百分点。

Comments 20 pages, 7 figures

详情
AI中文摘要

有效的视觉-触觉整合对于机器人灵巧操作至关重要,尤其是在视觉观测不可靠或被遮挡时。然而,将稀疏、异构的触觉测量与密集的视觉表示鲁棒地对齐仍然是一个基本挑战。大多数现有方法需要策略从有限的演示中隐式学习跨模态对应关系,而不利用几何先验。因此,它们在视觉观测退化时往往数据效率低且泛化能力差。为解决这一限制,我们提出一个框架,显式地将物理接触锚定在图像域中。利用机器人正向运动学和相机标定,我们将触觉传感器位置直接投影到RGB图像平面上。然后,我们渲染力调制的高斯显著性图,以模拟由运动学和标定误差引起的空间不确定性。通过零初始化的条件架构整合这些2D空间锚点,我们的方法将物理接触先验注入标准视觉骨干网络,同时保留预训练的视觉表示。我们在模拟和现实世界的六项灵巧操作任务中评估了我们的方法,在严重视觉遮挡下。现实世界实验表明,在图像域中显式的RGB-S锚定将现实世界遮挡操作成功率比最强的隐式视觉-触觉基线提高了26.7个百分点,表明其空间推理能力和对遮挡的鲁棒性得到了改善。项目页面:touch-as-saliency.github.io

英文摘要

Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors. As a result, they are often data-inefficient and generalize poorly when visual observations are degraded. To address this limitation, we propose a framework that explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, we project tactile sensor locations directly onto the RGB image plane. We then render force-modulated Gaussian saliency maps to model spatial uncertainty arising from kinematic and calibration errors. By integrating these 2D spatial anchors through a zero-initialized conditioning architecture, our method injects physical contact priors into standard visual backbones while preserving pre-trained visual representations. We evaluate our method on six dexterous manipulation tasks in both simulation and the real world under severe visual occlusions. Real-world experiments show that explicit RGB-S grounding in the image domain improves real-world occluded manipulation success rates by $26.7$ percentage points over the strongest implicit visuo-tactile baseline, suggesting its improved spatial reasoning and robustness to occlusion. Project page: touch-as-saliency.github.io

2606.10683 2026-06-12 cs.RO cs.AI cs.CV 版本更新

UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

UniDexTok:基于真实数据的统一灵巧手分词器

Dong Fang, Youjun Wu, Yuanxin Zhong, Rui Zhang, Yunlong Wang, Xiaosong Jia, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Hefei University of Technology(合肥工业大学) Rimbot Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出统一灵巧手模型(UDHM)将人手和机器人手状态映射到共享22自由度语义接口,并基于此开发UniDexTok,一种免重定向的状态分词器,学习基于真实关节状态的离散token,实现异构灵巧手的统一表示,误差降低98%以上。

详情
AI中文摘要

灵巧手对于精细操作至关重要,但其硬件设计在不同实施例之间存在显著差异。运动学、关节定义和自由度方面的差异使得定义共享状态表示变得困难,与平行夹爪相比更是如此。因此,灵巧手数据仍然碎片化,难以用于联合训练。在这项工作中,我们提出了统一灵巧手模型(UDHM),它将人手和机器人手状态映射到一个共享的22自由度语义接口。基于UDHM,我们引入了UniDexTok,一种免重定向的状态分词器,它从标准化的真实关节状态中学习基于实施例的离散token。UniDexTok为异构灵巧手提供了统一表示,无需依赖重定向或仿真数据。与最近的基线UniHM相比,UniDexTok将MPJAE从15.63度降低到0.16度,MPJPE从18.51毫米降低到0.18毫米,误差分别减少了98.98%和99.03%。这些结果将重建精度从厘米级提升到亚毫米级。实验进一步表明,来自其他实施例的数据提高了目标实施例的重建精度,证明了跨实施例分词的优势。当引入新的灵巧手时,UniDexTok还表现出强大的零样本和少样本重建能力。

英文摘要

Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

2606.11767 2026-06-12 cs.RO cs.AI 版本更新

Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning

通过真实到仿真到真实触觉策略学习的盲操作灵巧抓取

Shengcheng Luo, Xiyan Huang, Zhe Xu, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University(上海科技大学) Beijing Institute for General Artificial Intelligence(北京通用人工智能研究院)

AI总结 提出一种结合Real2Sim触觉校准、布局感知触觉编码器和触觉条件扩散策略的框架,实现仅依赖触觉的灵巧手盲抓取,在真实机器人上对20个物体达到27%成功率。

Comments 23 pages, 6 figures

详情
AI中文摘要

使用灵巧手进行盲抓取是一项关键的操作能力。然而,由于触觉的仿真到真实差距以及稀疏触觉信号的有限表达能力,为真实机器人学习这种仅依赖触觉的策略仍然具有挑战性。为了弥合这一差距,我们提出了一个仅依赖触觉的盲抓取框架,该框架可部署在物理多指机器人手上。我们的方法结合了三个关键组成部分。首先,我们引入了一个Real2Sim触觉校准流程,构建了一个接触校准的数字孪生模拟器,能够复现真实的触觉信号。其次,我们使用布局感知触觉编码器改进了稀疏触觉观测的表达能力,该编码器通过自监督预训练融入了传感器几何先验。第三,为了提高对未见物体的泛化能力,我们在校准后的模拟器中训练了特定物体的强化学习专家,并将其成功的抓取轨迹聚合为触觉条件扩散策略。我们在配备分布式触觉传感的物理LEAP手上评估了我们的方法,涉及10个见过和10个未见过的物体。部署的策略在所有20个物体上实现了27%的真实世界抓取成功率,无需真实世界的抓取演示或视觉输入。仿真消融实验表明,布局感知触觉预训练提高了抓取性能,而传感级评估确认Real2Sim校准增加了仿真与硬件之间触觉接触事件的一致性。这些结果表明,接触事件校准、几何感知触觉表示学习和基于扩散的策略聚合为真实灵巧机器人手上的仅触觉盲抓取提供了一条有效路径。项目页面:此HTTP URL。

英文摘要

Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robots remains challenging due to the tactile sim-to-real gap and the limited expressiveness of sparse tactile signals. To bridge this gap, we propose a framework for tactile-only blind grasping that is deployable on a physical multi-fingered robotic hand. Our approach combines three key components. First, we introduce a Real2Sim tactile calibration pipeline that constructs a contact-calibrated digital-twin simulator capable of reproducing real tactile signals. Second, we improve the expressiveness of sparse tactile observations using a layout-aware tactile encoder, which incorporates sensor-geometry priors through self-supervised pretraining. Third, to improve generalization to unseen objects, we train object-specific reinforcement-learning experts in the calibrated simulator and aggregate their successful grasp trajectories into a tactile-conditioned Diffusion Policy. We evaluate our method on a physical LEAP Hand equipped with distributed tactile sensing across 10 seen and 10 unseen objects. The deployed policy achieves a 27\% real-world grasp success rate across all 20 objects, without real-world grasping demonstrations or visual input. Simulation ablations show that layout-aware tactile pretraining improves grasping performance, while sensing-level evaluations confirm that Real2Sim calibration increases the consistency of tactile contact events between simulation and hardware. Together, these results suggest that contact-event calibration, geometry-aware tactile representation learning, and diffusion-based policy aggregation provide an effective path toward tactile-only blind grasping on real dexterous robotic hands. Project page:Dex-Blind-Grasp.github.io.

4. 导航、定位与SLAM 13 篇

2606.12550 2026-06-12 cs.RO cs.AI 新提交

Foresight: Iterative Reasoning About Clues that Matter for Navigation

Foresight: 关于导航关键线索的迭代推理

Arthur Zhang, Carl Qi, Donne Su, Xiangyun Meng, Amy Zhang, Joydeep Biswas

发表机构 * UT Austin(德克萨斯大学奥斯汀分校) FieldAI

AI总结 提出Foresight框架,利用微调VLM交替提出和批评图像空间运动计划,通过人类反馈学习奖励模型进行强化学习后训练,实现无地图导航中稀疏语言指令下的迭代运动优化,任务成功率提升37%。

Comments 22 pages, 10 figures, 3 tables

详情
AI中文摘要

从稀疏语言指令进行开放世界无地图导航需要解决未明确指定的目标,并推断哪些环境线索与到达目标相关。例如,到达一个视野外的目的地可能需要解释坡道、标志或绕行路线,这些揭示了去哪里或走哪条路线。先前的工作受限于对已知导航因素和封闭集因素类别的依赖,或者在运动规划之前识别线索而遗漏了依赖于计划的线索。我们认为预训练的视觉语言模型(VLM)可以发现新的指令相关线索,但需要适应以关注哪些线索重要以及它们应如何影响运动规划。我们在Foresight中实现了这些想法,这是一个测试时框架,其中微调的VLM交替提出图像空间运动计划并使用语言目标和视觉上下文对其进行批评。后续计划基于先前的批评,使得在执行前能够进行迭代运动优化。为了将计划批评和优化与开放集行为偏好对齐,我们从人类反馈中学习一个奖励模型,并使用它在计划-批评循环中通过强化学习对VLM进行后训练。在离线评估和6个真实世界环境中,相对于最先进的测试时推理和基础模型基线,Foresight将平均任务成功率提高了37%,并将每次任务的干预次数减少了52%,同时在Jetson AGX Orin上实时运行。我们将发布代码、数据和训练细节,以支持未来关于机器人运动优化的测试时推理工作。更多视频请见:this https URL

英文摘要

Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination may require interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works are limited by their reliance on known navigation factors and closed-set factor categories, or identify cues before motion planning and miss plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We realize these ideas in Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling iterative motion refinement before execution. To align plan critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin. We will release code, data, and training details to support future work on test-time reasoning for robot motion refinement. Additional videos at: https://amrl.cs.utexas.edu/foresight

2606.12603 2026-06-12 cs.RO cs.AI 新提交

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

从模仿到对齐:面向长距离人行道导航的人类偏好流策略

Honglin He, Zhizheng Liu, Yukai Ma, Bolei Zhou

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出FlowPilot,一种仅使用单目RGB相机的无地图导航策略,通过锚定流匹配进行预训练,并引入人类偏好学习实现对齐,在长距离人行道导航中提升鲁棒性和社会合规性。

详情
AI中文摘要

自主长距离人行道导航对于微出行应用(如机器人送餐和辅助电动轮椅)至关重要。与道路上的自动驾驶不同,长距离人行道导航需要在不可预测的人行道地形和行人中精确操作,且感知栈轻量,仅需单个单目RGB相机。虽然从演示中模仿学习(IL)提供了一种实用解决方案,但由此产生的自动驾驶策略常常遭受复合误差、人行道上缺乏社会合规性以及缺乏处理复杂情况的反事实推理能力。为解决这些挑战,我们提出了FlowPilot,一种仅使用单目RGB相机即可实现稳健高效长距离导航性能的无地图导航策略。我们首先提出使用锚定流匹配作为动作表示,用于在大型机器人车队数据上进行策略预训练,并捕捉人行道导航行为的多样、复杂、多模态分布。为弥合模仿与对齐之间的差距,我们进一步设计了一种人在环的偏好学习方案,通过少量人类干预数据调整策略。它增强了模型的反事实推理能力和在人行道上的社会合规性。我们通过在多样化人行道环境中的广泛仿真和真实世界实验评估了FlowPilot。在仿真中,FlowPilot实现了42%的成功率和66%的路线完成率,而FlowPilot-HP进一步提升了真实世界的鲁棒性和社会合规性,相对于基础模型,IR降低了40.0%,NIR降低了52.1%。

英文摘要

Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electronic wheelchairs. Unlike autonomous driving on the road, long-horizon sidewalk navigation requires precise maneuvering through unpredictable sidewalk terrains and pedestrians, with a lightweight perception stack as minimal as a single monocular RGB camera. While imitation learning (IL) from demonstrations offers a practical solution, the resulting autopilot policy often suffers from compounding errors, a lack of social compliance on sidewalks, and deficiencies in counterfactual reasoning to handle complex situations. To address these challenges, we introduce FlowPilot, a mapless navigation policy that achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. We first propose to use anchored flow matching as an action representation for policy pre-training on large-scale robot fleet data and to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. To bridge the gap between imitation and alignment, we further design a human-in-the-loop preference learning scheme to tune the policy on a small amount of human intervention data. It strengthens the model's counterfactual reasoning and social compliance on sidewalks. We evaluate FlowPilot through extensive simulation and real-world experiments in diverse sidewalk environments. FlowPilot achieves 42% success rate and 66% route completion in simulation, while FlowPilot-HP further improves real-world robustness and social compliance, reducing IR by 40.0% and NIR by 52.1% relative to the base model.

2606.12956 2026-06-12 cs.RO 新提交

SERF: Spatiotemporal Environment and Robot Feature Map for Long-Horizon Mobile Manipulation

SERF:面向长时域移动操作任务的时空环境与机器人特征地图

Sunghwan Kim, Byeonghyun Pak, Kehan Long, Yulun Tian, Nikolay Atanasov

发表机构 * UC San Diego(加州大学圣地亚哥分校) Agency for Defense Development(国防发展局) SceniX Inc.(SceniX公司) University of Michigan(密歇根大学)

AI总结 提出SERF地图,将环境与机器人身体表示为共享潜空间中的神经点,并在线更新,作为VLA模型的状态输入,提升长时域移动操作中的推理能力,在BEHAVIOR-1K上优于纯图像基线。

Comments Project page: https://existentialrobotics.org/serf/

详情
AI中文摘要

长时域机器人移动操作需要对定位、环境变化和任务进度进行持续推理,而这些都难以仅从图像观测中推断。在本文中,我们表明,将移动操作策略条件化于一个时空特征地图可以改善长时域上的推理。该地图将环境和铰接机器人身体表示为共享潜空间中的神经点,并从自我中心观测和本体感受状态在线更新。我们使用基于对象的刚性跟踪更新环境神经点,并使用正向运动学更新机器人神经点。通过从多个参考帧和空间尺度提取地图标记,我们将时空环境与机器人特征(SERF)地图作为状态输入到视觉-语言-动作(VLA)模型中,为策略提供局部和全局上下文。我们在BEHAVIOR-1K(一个家庭环境中的长时域移动操作基准)上展示了SERF。实验表明,SERF VLA策略优于纯图像基线,通过遵循更直接的轨迹更快地达到子目标,提高了对场景配置变化的鲁棒性,并能从物体掉落失败中恢复。

英文摘要

Long-horizon robot mobile manipulation requires continual reasoning about localization, environment changes, and task progress, all of which are challenging to infer from image observations alone. In this paper, we show that conditioning a mobile manipulation policy on a spatiotemporal feature map improves reasoning over long horizons. The map represents the environment and the articulated robot body as neural points in a shared latent space and is updated online from egocentric observations and proprioceptive state. We update the environment neural points using object-level rigid tracking and the robot neural points using forward kinematics. We use our spatiotemporal environment and robot feature (SERF) map as a state input to a vision-language-action (VLA) model by extracting map tokens from multiple reference frames and spatial scales, providing the policy with both local and global context. We demonstrate SERF on BEHAVIOR-1K, a benchmark for long-horizon mobile manipulation in household environments. Experiments show that the SERF VLA policy outperforms image-only baselines, reaches subgoals faster by following more direct trajectories, improves robustness to scene-configuration shifts, and recovers from object-drop failures.

2606.13494 2026-06-12 cs.RO cs.CV 新提交

NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation

NavWAM:用于目标条件视觉导航的导航世界动作模型

Daichi Azuma, Taiki Miyanishi, Koya Sakamoto, Shuhei Kurita, Yaonan Zhu, Petr Khrapchenkov, Motoaki Kawanabe, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学) National Institute of Informatics(国立信息学研究所) AIRoA ATR

AI总结 提出NavWAM,一种扩散变换器策略,通过联合学习未来观测、目标进度值和动作块,将导航世界模型预测直接转化为可执行动作,在离线基准和真实机器人部署中优于基于规划的世界模型基线。

Comments Project page: https://dachii-azm.github.io/navwam/

详情
AI中文摘要

目标条件视觉导航要求机器人在部分可观测性下行动,通过预测其运动将如何改变未来的自我中心视图以及这种变化是否使其更接近目标。导航世界模型提供了这种视觉预见,但它们仍然是预测模块,需要外部规划器将预测的未来转化为闭环控制。我们提出导航世界动作模型(NavWAM),一种扩散变换器策略,通过将未来观测、目标进度值和动作块表示为共享的潜在序列,将导航世界模型预测转化为可执行动作。通过联合学习未来预测与决定闭环行为的动作和价值目标,NavWAM使视觉预见可直接用于机器人控制。我们通过模拟预训练和真实机器人适应构建NavWAM,并在图像目标导航任务上将其与基于规划的世界模型和代表性直接导航策略进行评估。在离线基准和闭环真实机器人部署中,NavWAM在使用默认策略模式(无CEM式动作搜索)的情况下,在我们的评估中优于基于规划的世界模型基线。项目页面:此 https URL

英文摘要

Goal-conditioned visual navigation requires a robot to act under partial observability by anticipating how its motion will change the future egocentric view and whether that change brings it closer to the goal. Navigation world models provide such visual foresight, but they remain prediction modules that require an external planner to convert predicted futures into closed-loop control. We propose Navigation World Action Model (NavWAM), a diffusion-transformer policy that turns navigation world-model prediction into executable action by representing future observations, goal-progress values, and action chunks in a shared latent sequence. By learning future prediction jointly with the action and value targets that determine closed-loop behavior, NavWAM makes visual foresight directly usable for robot control. We build NavWAM through simulation pretraining and real-robot adaptation, and evaluate it on image-goal navigation against planning-based world models and a representative direct navigation policy. Across offline benchmarks and closed-loop real-robot deployment, NavWAM improves over planning-based world-model baselines in our evaluations while using the default policy mode without CEM-style action search. Project page: https://dachii-azm.github.io/navwam/

2606.12849 2026-06-12 cs.DC cs.CV cs.RO 交叉投稿

SemanticXR: Low Power and Real-time Queryable Semantic Mapping with an Object-Level Device-Cloud Architecture

SemanticXR: 低功耗实时可查询语义建图与对象级设备-云架构

Rahul Singh, Devdeep Ray, Connor Smith, Sarita Adve

AI总结 提出首个设备-云协同系统SemanticXR,通过对象级通信、执行和内存管理,在XR功耗、带宽和内存约束下实现实时开放词汇语义建图与查询,服务器建图延迟提升2.2倍,设备功耗仅增加2%。

详情
AI中文摘要

语义建图是新兴扩展现实(XR)应用(如AI助手和空间对象搜索)中实现具身交互的核心服务。在移动XR设备上部署此功能需要系统具备开放词汇、实时和低功耗特性。现有方法计算密集且假设服务器级资源。云卸载提供了一条实用路径,但现有系统未在设备-云边界拆分语义建图或管理其通信、执行和内存占用。我们提出SemanticXR,首个在XR功耗、带宽和内存约束下实现实时开放词汇语义建图与查询的设备-云系统。我们的关键洞察是将语义可识别对象提升为跨设备和服务器的通信、执行和内存的一级单元。在服务器端,对象级并行和几何下采样改善了建图延迟,而对象级深度建图协同设计降低了上行带宽。在设备端,具有增量更新和更新优先级的对象级稀疏局部地图实现了网络鲁棒的查询,并限制了内存和下行带宽。对象级可配置的资源使用与质量权衡让应用和系统分别根据应用需求和运行条件调整建图。与使用相同感知模型的设备-云基线相比,对象级组织在同等语义质量下将服务器端建图延迟提升了2.2倍。深度建图协同设计将上行带宽维持在2.5 Mbps以下。在设备端,SemanticXR即使在网络中断时也能为多达10,000个对象维持低于100 ms的查询延迟,在500 MB内支持数万个对象,并将下行带宽随地图变化而非总场景大小缩放。系统在正常运行时仅增加2%的设备功耗。

英文摘要

Semantic mapping is a core service that enables grounded interactions in emerging Extended Reality (XR) applications such as AI assistants and spatial object search. Deploying this capability on mobile XR devices requires a system that is open-vocabulary, real-time, and low-power. Existing approaches are compute-intensive and assume server-class resources. Cloud offloading offers a practical path, but no existing system splits semantic mapping across the device-cloud boundary or manages its communication, execution, and memory footprint. We present SemanticXR, the first device-cloud system for real-time, open-vocabulary semantic mapping and querying under XR power, bandwidth, and memory constraints. Our key insight is to elevate semantically identifiable objects to first-class units of communication, execution, and memory across the device and server. On the server, object-level parallelism and geometry downsampling improve mapping latency, while object-level depth-mapping co-design reduces upstream bandwidth. On the device, an object-level sparse local map with incremental updates and update prioritization enables network-robust querying with bounded memory and downstream bandwidth. Object-level configurable resource usage vs. quality trade-offs let applications and the system adapt mapping to application requirements and operating conditions, respectively. Against a device-cloud baseline with the same perception models, object-level organization improves server-side mapping latency by 2.2X at equal semantic quality. Depth-mapping co-design maintains upstream bandwidth under 2.5 Mbps. On the device, SemanticXR sustains sub-100 ms query latency for up to 10,000 objects even under network drops, supports tens of thousands of objects within 500 MB, and scales downstream bandwidth with map changes, not total scene size. The system adds only 2% device power during normal operation.

2606.13206 2026-06-12 cs.CV cs.RO 交叉投稿

Visual Place Recognition in Forests with Depth-Aware Distillation

基于深度感知蒸馏的森林视觉地点识别

Walter Nedov, Saimunur Rahman, Kavindie Katuwandeniya, David Hall, Kaushik Roy, Peyman Moghadam

发表机构 * CSIRO Robotics, Brisbane, Australia(澳大利亚联邦科学与工业研究组织机器人实验室,布里斯班,澳大利亚) University of Queensland, Brisbane, Australia(昆士兰大学,布里斯班,澳大利亚) Queensland University of Technology, Brisbane, Australia(昆士兰科技大学,布里斯班,澳大利亚)

AI总结 针对森林环境中视觉地点识别因植被重复、结构线索弱及外观变化大而困难的问题,提出轻量级深度感知蒸馏框架,将几何线索注入DINOv2模型,在WildCross基准上提升鲁棒性。

Comments IEEE ICRA Workshop on Field Robotics 2026

详情
AI中文摘要

在自然森林环境中,由于植被重复、结构线索弱以及穿越过程中外观变化显著,视觉地点识别仍然具有挑战性。为解决这一限制,本文提出了一种轻量级的深度感知蒸馏框架,该框架将几何线索注入基于DINOv2的地点识别模型,同时保持其预训练的描述符空间。在最近的WildCross基准上进行评估,所提出的方法相比仅依赖外观的对应方法取得了性能提升,对外观变化具有鲁棒性。这些结果证明了深度作为自然环境中地点识别的强互补模态的重要性,并指出深度感知蒸馏是迈向更鲁棒森林感知的一个有前景的方向。

英文摘要

Visual place recognition in natural forest environments remains challenging due to repetitive vegetation, weak structural cues, and significant appearance variation across traversals. To address this limitation, this paper proposes a lightweight depth-aware distillation framework that injects geometric cues into a DINOv2-based place recognition model, while maintaining its pre-trained descriptor space. Evaluated on the recent WildCross benchmark, the proposed approach yields gains over an appearance-only counterpart, providing robustness to appearance variations. These results demonstrate the importance of depth as a strong complementary modality for place recognition in natural environments and identify depth-aware distillation as a promising direction for more robust forest perception.

2606.13503 2026-06-12 cs.CV cs.AI cs.RO 交叉投稿

Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments

异构激光雷达早期融合与学习重排序策略用于非结构化环境中的鲁棒长期地点识别

Judith Vilella-Cantos, Juan José Cabrera, Mónica Ballesta, David Valiente, Luis Payá

发表机构 * Miguel Hernández University of Elche(米格尔·埃尔南德斯·德埃尔切大学)

AI总结 提出MinkUNeXt-VINE++方法,通过异构LiDAR数据早期融合和学习重排序策略,在非结构化环境(如葡萄园)中显著提升长期地点识别性能,Recall@1指标提升20%-30%。

详情
AI中文摘要

在非结构化环境(如农田)中,鲁棒定位是自主系统的关键挑战。LiDAR传感器提供环境的详细3D信息,且不受光照条件影响,因此基于LiDAR的地点识别方法备受关注。本文提出MinkUNeXt-VINE++,一种结合两个传感器(Livox Mid-360和Velodyne VLP-16)异构LiDAR数据早期融合与推理时学习重排序策略的新方法。这种融合利用每个传感器的优势,提供更全面的环境表示。此外,重排序方法在重复环境(如葡萄园)中尤为重要,因为找到真正匹配是一项重大挑战。我们使用TEMPO-VINE数据集评估了该方法,该数据集提供了不同物候阶段葡萄园环境中的异构LiDAR数据。结果表明,与单传感器方法和现有最优方法相比,MinkUNeXt-VINE++显著提升了地点识别性能。与单传感器方法相比,MinkUNeXt-VINE++在Recall@1指标上提升了20%,加入重排序后提升30%。我们的方法代码已公开,可复现结果。

英文摘要

Robust localization in unstructured environments, such as agricultural fields, is a critical challenge for autonomous systems. LiDAR sensors provide detailed 3D information about the environment and are invariant to lighting conditions. For this reason, LiDAR-based place recognition methods have gained significant attention. In this paper, we propose MinkUNeXt-VINE++, a novel approach that combines early fusion of heterogeneous LiDAR data from two sensors (Livox Mid-360 and Velodyne VLP-16) and a learned re-ranking strategy in inference time. This fusion leverages the strengths of each sensor to provide a more comprehensive representation of the environment. Additionally, the re-ranking approach is particularly important in repetitive environments, such as vineyards, as finding true positives is a major challenge. We evaluated our approach using the TEMPO-VINE dataset, which provides heterogeneous LiDAR data in vineyard environments across different phenological stages. Our results demonstrate that MinkUNeXt-VINE++ significantly improves place recognition performance compared to single-sensor approaches and state-of-the-art methods. MinkUNeXt-VINE++ achieves a 20% improvement in the Recall@1 metric compared to single-sensor approaches, and +30% including re-ranking. The code of our method is publicly available for reproduction.

2510.05430 2026-06-12 cs.RO 版本更新

Active Semantic Perception

主动语义感知

Huayi Tang, Pratik Chaudhari

发表机构 * General Robotics, Automation, Sensing and Perception (GRASP) Laboratory(通用机器人、自动化、传感与感知实验室)

AI总结 提出一种基于紧凑多层场景图和大语言模型的主动语义感知方法,用于高效探索未知环境,在仿真和真实机器人上验证了优于现有方法。

详情
AI中文摘要

我们开发了一种主动语义感知方法,该方法利用场景的语义进行探索等任务。我们构建了一个紧凑的多层场景图,能够以不同抽象级别表示大型复杂室内环境,例如对应于房间、物体、墙壁、窗户等的节点,以及它们几何结构的细粒度细节。我们基于大语言模型(LLM)开发了一个程序,用于采样与场景部分观测一致的未观测区域的新可能场景图。我们开发了一个程序,用于计算潜在航点在该场景图上的信息增益,以实现复杂的空间推理:例如,从客厅出去的两扇门中,一扇可能通向厨房,另一扇通向卧室。我们在仿真中的逼真3D室内公寓以及现实世界中的Unitree Go 2机器人上评估了我们的方法。定性和定量分析表明,我们的方法能够比现有方法更快、更准确地确定环境中高层和低层的语义信息。

英文摘要

We develop an approach for active semantic perception, which refers to using the semantics of the scene for tasks such as exploration. We build a compact, multi-layer scene graph that can represent large, complex indoor environments at various levels of abstraction, e.g., nodes corresponding to rooms, objects, walls, windows etc., as well as fine-grained details of their geometry. We develop a procedure based on large language models (LLMs) to sample new plausible scene graphs of unobserved regions that are consistent with partial observations of the scene. We develop a procedure to compute the information gain of a potential waypoint upon this scene graph to enable sophisticated spatial reasoning: for example, of the two doors that lead out of the living room, one probably leads to the kitchen and the other to the bedroom. We evaluate our approach in realistic 3D indoor apartments in simulation and also on a Unitree Go 2 robot in the real world. Qualitative and quantitative analysis shows that our approach can pin down high-level and low-level semantic information in the environment quickly and more accurately than existing approaches.

2511.23030 2026-06-12 cs.RO cs.CV 版本更新

DiskChunGS: Large-Scale 3D Gaussian SLAM Through Chunk-Based Memory Management

DiskChunGS:基于分块内存管理的大规模3D高斯SLAM

Casimir Feldmann, Maximum Wilder-Smith, Vaishakh Patil, Michael Oechsle, Michael Niemeyer, Keisuke Tateno, Marco Hutter

发表机构 * Robotic Systems Lab, ETH Zurich(机器人系统实验室,瑞士苏黎世联邦理工学院) Google(谷歌)

AI总结 提出DiskChunGS,通过将场景划分为空间块并将非活跃区域存储于磁盘,突破GPU内存限制,实现大规模3D高斯SLAM,在多个数据集上完成全序列重建并提升视觉质量。

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 4, 2026
AI中文摘要

近期3D高斯溅射(3DGS)的进展在实时渲染的新视角合成中展现了令人印象深刻的结果。然而,将3DGS与SLAM系统集成面临根本的可扩展性限制:方法受限于GPU内存容量,只能重建小规模环境。我们提出DiskChunGS,一种可扩展的3DGS SLAM系统,通过一种外核方法克服这一瓶颈,该方法将场景划分为空间块,并在GPU内存中仅维护活跃区域,同时将非活跃区域存储在磁盘上。我们的架构与现有的用于位姿估计和闭环检测的SLAM框架无缝集成,实现大规模全局一致的重建。我们在室内场景(Replica、TUM-RGBD)、城市驾驶场景(KITTI)以及资源受限的Nvidia Jetson平台上验证了DiskChunGS。我们的方法独特地完成了所有11个KITTI序列,没有出现内存故障,同时实现了卓越的视觉质量,证明了算法创新可以克服先前限制3DGS SLAM方法的内存约束。

英文摘要

Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated impressive results for novel view synthesis with real-time rendering capabilities. However, integrating 3DGS with SLAM systems faces a fundamental scalability limitation: methods are constrained by GPU memory capacity, restricting reconstruction to small-scale environments. We present DiskChunGS, a scalable 3DGS SLAM system that overcomes this bottleneck through an out-of-core approach that partitions scenes into spatial chunks and maintains only active regions in GPU memory while storing inactive areas on disk. Our architecture integrates seamlessly with existing SLAM frameworks for pose estimation and loop closure, enabling globally consistent reconstruction at scale. We validate DiskChunGS on indoor scenes (Replica, TUM-RGBD), urban driving scenarios (KITTI), and resource-constrained Nvidia Jetson platforms. Our method uniquely completes all 11 KITTI sequences without memory failures while achieving superior visual quality, demonstrating that algorithmic innovation can overcome the memory constraints that have limited previous 3DGS SLAM methods.

2603.00167 2026-06-12 cs.RO 版本更新

EgoMoD: Predicting Global Maps of Dynamics from Local Egocentric Observations

EgoMoD:从局部自我中心观测预测全局动态地图

Iacopo Catalano, David Morilla-Cabello, Jorge Pena-Queralta, Eduardo Montijano

发表机构 * University of Turku, Finland(芬兰图尔库大学) Centre for Artificial Intelligence, Zürich University of Applied Sciences, Winterthur, Switzerland(瑞士应用科学大学人工智能中心) Instituto de Investigación en Ingeniería de Aragón, Universidad de Zaragoza, Spain(西班牙阿拉贡工程研究所,萨拉戈萨大学)

AI总结 提出EgoMoD方法,利用短时自我中心视频和位姿条件架构,学习从局部观测预测全局运动动态地图,替代传统全局感知基础设施,实现零样本迁移。

详情
AI中文摘要

在动态环境中高效导航需要预测机器人即时感知范围之外的运动模式演变,从而在拥挤场景中实现先发制人而非纯粹反应式规划。运动动态地图(MoDs)提供了空间中运动趋势的结构化表示,有助于长期全局规划,但传统上需要长时间全局环境观测来构建。我们提出EgoMoD,这是第一种学习直接从机器人操作期间收集的短时自我中心视频片段预测未来MoDs的方法。我们的方法使用视频和位姿条件架构,以从外部观测计算的MoDs作为特权监督进行训练,从而学习从局部动态线索推断环境范围的运动趋势,使局部观测成为全局运动结构的预测信号。因此,我们能够预测整个环境的未来运动动态,而不仅仅是扩展机器人视野中的过去模式。作为特定地点的动态先验,EgoMoD在推理时用标准车载传感器替代了先前MoD方法所需的外部全局感知基础设施。在大型模拟环境中的实验表明,EgoMoD能在有限可观测性下预测未来MoDs,而使用真实图像的评估展示了其对真实系统的零样本迁移能力。

英文摘要

Efficient navigation in dynamic environments requires anticipating how motion patterns evolve beyond the robot's immediate perceptual range, enabling preemptive rather than purely reactive planning in crowded scenes. Maps of Dynamics (MoDs) offer a structured representation of motion tendencies in space useful for long-term global planning, but constructing them traditionally requires global environment observations over extended periods of time. We introduce EgoMoD, the first approach that learns to predict future MoDs directly from short egocentric video clips collected during robot operation. Our method learns to infer environment-wide motion tendencies from local dynamic cues using a video- and pose-conditioned architecture trained with MoDs computed from external observations as privileged supervision, allowing local observations to serve as predictive signals of global motion structure. Thanks to this, we offer the capacity to forecast future motion dynamics over the whole environment rather than merely extend past patterns in the robot's field of view. As a site-specific dynamic prior, EgoMoD replaces the external global sensing infrastructure required by prior MoD methods at inference time with standard onboard sensors. Experiments in large simulated environments show that EgoMoD predicts future MoDs under limited observability, while evaluation with real images showcases its zero-shot transferability to real systems.

2603.05965 2026-06-12 cs.RO cs.CV 版本更新

PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

PROBE: 具有解析平移鲁棒性的概率占用BEV编码用于3D地点识别

Jinseop Lee, Byoungho Lee, Gichul Yoo

发表机构 * SK Intellix

AI总结 提出无学习的LiDAR地点描述符PROBE,通过极坐标雅可比解析边缘化连续平移,实现距离自适应角度不确定性,在跨传感器泛化中取得高精度。

Comments 8 pages, 8 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

详情
AI中文摘要

我们提出PROBE(概率占用BEV编码),一种无学习的LiDAR地点识别描述符,将每个BEV单元的占用建模为伯努利随机变量。PROBE不依赖于离散点云扰动,而是通过极坐标雅可比解析边缘化连续笛卡尔平移,在O(R·S)时间内得到距离自适应角度不确定性σ_θ = σ_t / r。主要参数σ_t表示以米为单位的预期平移不确定性,这是一种与传感器无关的物理量,增强了跨传感器泛化能力,同时减少了对每个数据集大量调参的需求。成对相似性结合了伯努利-KL Jaccard与指数不确定性门控以及基于FFT的高度余弦相似性用于旋转对齐。在涵盖四种不同LiDAR类型的四个数据集上评估,PROBE在多会话评估中实现了手工描述符中最高的精度,并且在单会话性能上与手工和监督基线相比具有竞争力。源代码和补充材料可在该https URL获取。

英文摘要

We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell's occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty $σ_θ= σ_t / r$ in $\mathcal{O}(R{\cdot}S)$ time. The primary parameter $σ_t$ represents the expected translational uncertainty in meters, a sensor-independent physical quantity that enhances cross-sensor generalization while reducing the need for extensive per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance relative to both handcrafted and supervised baselines. The source code and supplementary materials are available at https://sites.google.com/view/probe-pr.

2507.22028 2026-06-12 cs.CV cs.RO 版本更新

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

从看见到体验:通过强化学习扩展导航基础模型

Honglin He, Yukai Ma, Brad Squicciarini, Wayne Wu, Bolei Zhou

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Coco Robotics(Coco机器人)

AI总结 提出S2E框架,结合离线视频预训练和模拟环境强化学习,通过锚点引导分布匹配和残差注意力模块,提升导航基础模型的交互性和安全性。

Comments 27 pages, 20 figures, 9 tables, conference

详情
AI中文摘要

基于大规模网络数据训练的导航基础模型使智能体能够跨不同环境和实体进行泛化。然而,这些仅基于离线数据训练的模型往往缺乏推理其行为后果或通过反事实理解进行适应的能力。因此,它们在现实世界城市导航中面临重大限制,其中交互性和安全行为(如避开障碍物和移动行人)至关重要。为解决这些挑战,我们引入了从看见到体验(S2E)学习框架,通过强化学习扩展导航基础模型的能力。S2E结合了离线视频预训练和强化学习后训练的优势。它保持了从大规模真实世界视频中获得的模型泛化能力,同时通过模拟环境中的强化学习增强了其交互性。具体而言,我们引入了两项创新:(1)用于离线预训练的锚点引导分布匹配策略,通过基于锚点的监督稳定学习并建模多样化的运动模式;(2)用于强化学习的残差注意力模块,从模拟环境中获得反应性行为,同时不抹除模型的预训练知识。此外,我们建立了一个全面的端到端评估基准NavBench-GS,该基准基于真实世界场景的光照逼真3D高斯溅射重建,并融入了物理交互。它可以系统评估导航基础模型的泛化能力和安全性。

英文摘要

Navigation foundation models trained on massive web-scale data enable agents to generalize across diverse environments and embodiments. However, these models, which are trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in real-world urban navigation, where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing (S2E) learning framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pretraining on offline videos and post-training through reinforcement learning. It maintains the model's generalizability acquired from large-scale real-world videos while enhancing its interactivity through reinforcement learning in simulation environments. Specifically, we introduce two innovations: (1) an Anchor-Guided Distribution Matching strategy for offline pretraining, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and (2) a Residual-Attention Module for reinforcement learning, which obtains reactive behaviors from simulation environments without erasing the model's pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models.

2605.31419 2026-06-12 cs.CV cs.RO 版本更新

Triangle Splatting SLAM

三角形泼溅SLAM

Nicholas Fry, Eric Dexheimer, Kirill Mazur, Paul H. J. Kelly, Andrew J. Davison

发表机构 * Software Performance Optimisation Group(软件性能优化组) Department of Computing(计算部门)

AI总结 提出首个使用可微三角形作为3D地图表示的密集RGB-D SLAM系统,通过在线可微渲染实现跟踪与建图,并支持实时网格转换与编辑。

Comments 26 pages, 11 figures

详情
AI中文摘要

我们提出了一种密集RGB-D SLAM系统,使用可微三角形作为3D地图表示。虽然3D高斯泼溅已成为新颖视角合成的主要方法,但三角形仍然是传统渲染硬件、游戏引擎以及需要显式几何的下游任务(如模拟、碰撞和编辑)的标准图元。最近的离线方法表明,通过在一组带姿态的图像上进行Delaunay三角剖分,可以将非结构化的“三角形汤”优化为照片级逼真的网格。基于这一见解,我们提出了第一个密集SLAM系统,通过在线可微渲染三角形汤来执行跟踪和建图。地图可以通过受限Delaunay三角剖分实时转换为连通网格,从而实现网格变形和碰撞检测等新的在线功能。在Replica和TUM-RGBD数据集上,我们的系统在3D几何方面优于基线,匹配相机跟踪精度,并支持基于网格的在线场景编辑。

英文摘要

We present a dense RGB-D SLAM system using differentiable triangles as the 3D map representation. While 3D Gaussian Splatting has emerged as the leading method for novel-view synthesis, triangles remain the standard primitive for traditional rendering hardware, game engines, and downstream tasks requiring explicit geometry such as simulation, collision, and editing. Recent offline methods have demonstrated that an unstructured 'triangle soup' can be optimised into a photorealistic mesh via Delaunay triangulation across a set of posed images. Building upon this insight, we present the first dense SLAM system to employ Triangle Splatting to perform both tracking and mapping through online differentiable rendering of a triangle soup. The map can be converted into a connected mesh on-the-fly via restricted Delaunay triangulation, enabling new online capabilities such as mesh deformation and collision checking. On Replica and TUM-RGBD, our system outperforms baselines on 3D geometry, matches the camera-tracking accuracy, and enables online mesh-based scene editing.

5. 人机交互与协作机器人 7 篇

2606.12475 2026-06-12 cs.RO 新提交

Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration

学习辅助:面向隐式人机协作的协作式VLA模型

Leo Xu, Letian Li, Alex Cuellar, Michael Hagenow

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究利用视觉-语言-动作(VLA)模型通过模仿学习实现人机协作,发现动作分块策略在隐式协作中存在演示动作泄漏问题,提出推理时引导方法缓解过早辅助行为,并通过用户研究验证其有效性。

详情
AI中文摘要

人机协作(HRC)结合了人类和机器人的互补优势,以提高任务效率。然而,许多现有的协作系统依赖于手工设计的流程,限制了其对新任务的可扩展性和灵活性。在这项工作中,我们展示了通过模仿学习进行端到端训练的模型,特别是视觉-语言-动作(VLA)模型,可以支持协作操作,并刻画了影响其真实世界性能的关键因素。我们评估了两种最先进的模型,并识别了隐式HRC中动作分块策略的一种失败模式,其中演示动作泄漏(即动作块跨越潜在任务转换)可能导致过早的辅助行为。我们发现,这个问题随着执行时域的增长而加剧,并在真实世界的协作VLA系统中出现,例如当机器人试图在人员准备好之前移交工具时。我们提出了一种推理时引导方法,以减轻这些错误的辅助动作,同时保持策略性能。最后,通过一项16名参与者在长时域协作组装任务上的用户研究,我们表明引导能够实现更长的执行时域,同时减轻过早辅助,与短时域策略相比,实现了更快的协作和更少的失败。

英文摘要

Human-robot collaboration (HRC) combines the complementary strengths of humans and robots to improve task efficiency. However, many existing collaborative systems rely on hand-engineered pipelines, limiting their scalability and flexibility for new tasks. In this work, we show that models trained end-to-end with imitation learning, specifically vision-language-action (VLA) models, can support collaborative manipulation, and characterize the key factors affecting their real-world performance. We evaluate two state-of-the-art models and identify a failure mode of action-chunking policies in implicit HRC, where demonstration action leakage (i.e., action chunks crossing latent task transitions) can cause premature assistive behavior. We find that this issue increases with longer execution horizons and occurs in real-world collaborative VLA systems, such as when a robot attempts to hand over a tool before the person is ready. We propose an inference-time steering method to mitigate these erroneous assistive actions while preserving policy performance. Finally, through a 16-participant user study on a long-horizon collaborative assembly task, we show that steering enables a longer execution horizon while mitigating premature assistance, leading to faster collaboration and fewer failures compared to a shorter-horizon policy.

2606.12995 2026-06-12 cs.RO 新提交

GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training

GenHOI: 通过模仿生成视频实现接触感知的人形机器人-物体交互,无需任务特定训练

Zhihai Bi, Qiang Zhang, Guoyang Zhao, Jiahang Cao, Xueyin Luo, Yushan Zhang, Jinglan Xu, Ruoyu Geng, Yulin Li, Andrew F. Luo, Jun Ma

发表机构 * The University of Tokyo(东京大学) National University of Singapore(新加坡国立大学) University of California, Los Angeles(加州大学洛杉矶分校) Tsinghua University(清华大学)

AI总结 提出GenHOI框架,通过模仿单个生成视频实现人形机器人零样本执行多种物体交互任务,无需任务特定训练或物理演示数据,利用接触事件和手-物接触区域编码为几何约束优化轨迹。

详情
AI中文摘要

人形机器人-物体交互(HOI)是人形机器人的基本能力,但由于动态平衡与与多样物体稳定交互之间的紧密耦合,它仍然具有挑战性。现有方法通常需要耗时的任务特定策略训练或依赖于刚性轨迹回放,这限制了它们适应新颖交互场景的能力。在这项工作中,我们提出了\textit{GenHOI},一个简单而有效的框架,通过直接模仿单个生成视频,使人类形机器人能够以零样本方式执行多样化的物体交互任务,无需任务特定训练或物理演示数据。GenHOI首先在仿真中重建机器人-物体场景并渲染第一帧图像,该图像与语言命令一起条件化任务导向交互视频的合成。然后分析生成的视频以识别交互相关的接触事件并估计手-物体接触区域,这些被编码为以物体为中心的几何约束,将视觉交互线索转化为物理基础的优化先验。在这些先验的指导下,从视频中恢复的参考运动被细化和平滑,以解决2D视频生成中固有的尺度模糊性,同时将单个参考轨迹适应于未见过的机器人-物体相对姿态。优化后的轨迹最终由闭环跟踪控制器执行。我们在包括箱子抓取、非对称双臂椅子搬运、从下方抬桌子和圆柱物体包裹在内的多样化物体交互任务中,通过大量仿真和真实世界实验验证了所提出的框架。

英文摘要

Humanoid-Object Interaction (HOI) is a fundamental capability for humanoid robots, yet it remains challenging due to the tight coupling between dynamic balance and stable interaction with diverse objects. Existing methods often require time-consuming task-specific policy training or rely on rigid trajectory replay, which limits their ability to accommodate novel interaction scenarios. In this work, we present \textit{GenHOI}, a simple yet effective framework that enables humanoid robots to perform diverse object-interaction tasks in a zero-shot manner by directly imitating a single generated video, without task-specific training or physical demonstration data. GenHOI first reconstructs the robot-object scene in simulation and renders a first-frame image, which, together with the language command, conditions the synthesis of a task-oriented interaction video. The generated video is then analyzed to identify interaction-relevant contact events and estimate hand-object contact regions, which are encoded as object-centric geometric constraints that convert visual interaction cues into physically grounded optimization priors. Guided by these priors, the reference motion recovered from the video is refined and smoothed to resolve the scale ambiguity inherent in 2D video generation, while adapting a single reference trajectory to unseen robot-object relative poses. The optimized trajectory is finally executed by a closed-loop tracking controller. We validate the proposed framework in extensive simulation and real-world experiments across diverse object-interaction tasks, including box grasping, asymmetric bimanual chair carrying, table lifting from below, and cylindrical-object enveloping.

2606.13190 2026-06-12 cs.RO cs.HC 新提交

Multi-Modal Multi-Agent Robotic Cognitive Alignment enabled by Non-Invasive Consumer Brain Computer Interfaces: A Proof of Concept Exploration

基于非侵入式消费级脑机接口的多模态多智能体机器人认知对齐:概念验证探索

Nataliya Kosmyna, Liz Jenkins, Anoop K. Sinha

发表机构 * GOOGLE(谷歌) Paradigms of Intelligence(智能范式) Cambridge, MA, United States(马萨诸塞州剑桥市,美国) Mountain View, CA, United States(加利福尼亚州山景城,美国)

AI总结 提出一种框架,利用消费级脑机接口监测脑电信号,在高认知负荷时延迟智能体通信,实现认知对齐的多智能体交互,初步验证了实时信号处理、大语言模型与机器人结合的可行性。

Comments 19 pages, 9 figures, for associated video, see https://youtu.be/0Tav-G87XGs

详情
AI中文摘要

尽管非语言行为和表达性动作对于自然的人机交互至关重要,但现有方法常常忽略一个关键要素:人类的内在认知状态。主动式多智能体系统经常在不合时宜的时刻打断人类,导致认知过载和任务性能下降。本文引入了一个生成“认知对齐”多智能体交互的框架,增强了机器人系统在人类高心理工作负荷和高投入度时刻,能够上下文相关地延迟向智能体系统用户发送通信的能力。我们介绍了一种闭环架构的设计与实现,该架构探索了自主任务执行与实时神经生理学专注度之间的相互作用。使用消费级脑机接口(BCI),我们的方法在人类执行投入度诱导任务时持续监测脑电图(EEG)频谱带功率。我们提出了一种基于投入度的流水线,其中基于HTTP的信令机制在检测到高投入度时将主智能体的感官输入和音频输出置于保持状态,从而允许次级智能体在后台无缝处理复杂的委托任务。一旦人类的认知状态恢复到较低的认知负荷基线,主智能体释放排队的智能体消息。我们的初步结果证明了利用实时信号处理、大语言模型(LLMs)和物理机器人实体创建认知感知、非侵入式多智能体系统的可行性。

英文摘要

While non-verbal behaviors and expressive movements are essential for natural human-robot interaction, existing methods often overlook a crucial element: the human's internal cognitive state. Frequently, proactive multi-agent systems can interrupt humans at inopportune moments, leading to cognitive overload and decreased task performance. This paper introduces a framework for generating "cognitively aligned" multi-agent interactions, enhancing the ability of robotic systems to contextually defer communications to the user of an agent system during moments of high human mental workload and engagement. We present the design and implementation of a closed-loop architecture that explores the interplay between autonomous task execution and real-time neurophysiological focus. Using a consumer-grade Brain-Computer Interface (BCI), our approach continuously monitors Electroencephalography (EEG) spectral band powers while a human performs an engagement-inducing task. We propose an engagement-driven pipeline where an HTTP-based signaling mechanism places a primary agent's sensory inputs and audio outputs into a holding state upon detecting high engagement. This allows secondary agents to seamlessly process complex, delegated tasks in the background. Once the human's cognitive state returns to a lower cognitive load baseline, the primary agent releases the queued agent message. Our preliminary results demonstrate the feasibility of leveraging real-time signal processing, Large Language Models (LLMs), and physical robotic embodiments to create cognitively-aware, non-intrusive multi-agent systems.

2606.13256 2026-06-12 cs.RO cs.AI 新提交

Humor Style Drives Laughter, Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI Jokes

幽默风格驱动笑声,话题塑造可接受性:评估双语个人与政治机器人交付的AI笑话

Anna-Maria Velentza, Anne-Gwenn Bosser

发表机构 * Univ Brest-Bretagne INP, COMMEDIA team, Lab-STICC CNRS UMR 6285(布列塔尼大学-INP,COMMEDIA团队,Lab-STICC CNRS UMR 6285)

AI总结 本研究通过混合因素设计,评估机器人用双语讲AI生成笑话时,幽默类型(亲和、自我增强、攻击、自贬)和内容(个人vs政治)对趣味性和适当性的影响,发现幽默类型显著影响趣味性,内容影响适当性,语言偏好受内容及参与者流利度影响。

Comments Accepted in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026), Kitakyushu, Fukuoka, Japan

详情
AI中文摘要

幽默在人类社交关系中扮演核心角色,计算幽默的最新进展为将幽默融入人机交互(HRI)创造了新机会。虽然大型语言模型(LLMs)能生成多种形式的幽默,但在群体环境中,幽默风格、笑话内容和语言偏好如何影响对机器人传递幽默的感知仍不清楚。在这项探索性研究中,我们采用混合因素设计,让参与者在大学教室中评估由机器人传递的AI生成笑话。我们考察了幽默类型(亲和型、自我增强型、攻击型、自贬型)和笑话内容(个人相关vs政治)对感知趣味性和适当性的影响,以及语言偏好。结果表明,幽默类型显著影响趣味性,攻击型和亲和型幽默评分更高;而笑话内容主要影响适当性,个人相关笑话优于政治笑话。语言偏好受笑话内容和参与者自我报告的流利度及幽默实践的影响。

英文摘要

Humor plays a central role in human social relationships, and recent advances in computational humor create new opportunities for integrating humor into human-robot interaction (HRI). While large language models (LLMs) can generate diverse forms of humor, it remains unclear how humor style, joke content, and language preference shape perceptions of robot-delivered humor in group settings. In this exploratory study, we employed a mixed factorial design in which participants evaluated AI-generated jokes delivered by a robot in a university classroom. We examined the effects of humor type (Affiliative, Self-Enhancing, Aggressive, Self-Defeating) and joke content (person-related vs. political) on perceived funniness and appropriateness, as well as preferred language. Results show that humor type significantly influences funniness, with Aggressive and Affiliative humor rated higher, while joke content primarily affects appropriateness, with person-related jokes preferred over political ones. Language preference was shaped by both joke content and participants' self-reported fluency and humor practices.

2606.13340 2026-06-12 cs.RO 新提交

EMG-Based Adaptation of Anisotropic Virtual Fixtures for Robot-Assisted Surgical Resection and Dissection

基于EMG的各向异性虚拟夹具自适应方法用于机器人辅助手术切除与解剖

Dario Onfiani, Michael Dyck, Luigi Biagiotti, Julian Klodmann

发表机构 * University of Modena and Reggio Emilia(摩德纳大学) German Aerospace Center (DLR)(德国航空航天中心)

AI总结 提出一种基于EMG信号自适应调节各向异性虚拟夹具的框架,通过实时推断外科医生意图动态调整约束,实验证明能提高手术精度和运动一致性,降低认知负荷。

详情
AI中文摘要

本文针对机器人辅助腹腔镜手术中的精细任务(如切除和解剖),开发了一种自适应辅助系统。尽管虚拟夹具在引导外科医生运动方面具有显著优势,但传统虚拟夹具通常由固定几何形状定义,缺乏适应手术流程或外科医生即时意图的灵活性。为解决这些局限性,我们提出了一种自适应各向异性虚拟夹具的新框架。此外,我们引入了一种直观的控制接口,该接口基于从EMG信号推断的外科医生意图,实时调节夹具的几何形状。该方法允许外科医生通过收缩前臂肌肉动态扩展或解除约束,实现精确引导运动和工具自由重新定位之间的无缝切换。基于标准化手术训练任务的初步用户研究实验结果表明了所提方法的有效性。该系统在任务精度和运动一致性方面表现出显著改善,同时降低了感知认知负荷、努力和挫败感。

英文摘要

In this paper, we address the development of an adaptive assistance system for robot-assisted laparoscopic surgery, specifically for delicate tasks such as Resection and Dissection. Even if Virtual Fixtures offer significant advantages for guiding a surgeon's movements, conventional Virtual Fixtures are often defined by fixed geometries, lacking the flexibility to adapt to the surgical workflow or the surgeon's immediate intent. To address these limitations, we propose a novel framework for an adaptive and anisotropic virtual fixture. In addition, we introduce an intuitive control interface that modulates the fixture's geometry in real-time based on the surgeon's intent, inferred from EMG signals. This approach allows the surgeon to dynamically expand or disengage the constraint by contracting their forearm muscles, enabling seamless transitions between precise guided motion and free repositioning of the tool. Experimental results from a pilot user study, based on a standardized surgical training task, demonstrate the effectiveness of the proposed method. The system showed significant improvements in task accuracy and movement consistency, alongside a reduction in perceived cognitive load, effort, and frustration.

2606.13435 2026-06-12 cs.RO 新提交

GIVE: Grounding Human Gestures in Vision-Language-Action Models

GIVE:在视觉-语言-动作模型中接地人类手势

Pengfei Liu, Gen Li, Junqiao Fan, Boyu Ma, Jindou Jia, Yang Xiao, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室)

AI总结 针对VLA模型忽略手势导致意图理解不准的问题,提出GIVE方法,通过视觉和语义双路径增强手势理解,在真实HRI实验中目标识别准确率提升40%,任务成功率提升80%。

Comments Project page: https://luis-cloud-sg.github.io/GIVE-project/

详情
AI中文摘要

人类交流本质上是多模态的,语言通常伴随着非语言线索(如手势)来传达意图。然而,当前的视觉-语言-动作(VLA)模型将机器人操作视为纯文本驱动的任务,忽视了手势在人机交互(HRI)中的重要作用。当语言指令模糊或不明确时,这往往导致意图接地不准确和操作不可靠。为了解决这一挑战,我们提出了GIVE(通过视觉-语义增强的手势意图),一种有效的方法,在不修改架构的情况下,用人类手势理解增强预训练的VLA模型。具体来说,GIVE通过两条互补的路径融入手势信息:一条视觉路径,将手部骨架和指尖射线叠加到机器人观测上,用于显式对象接地;一条语义路径,生成人类手势和任务指令的高级描述,用于鲁棒的意图接地。通过联合利用视觉和语义指导,GIVE使VLA策略能够更好地将手势与操作行为关联,并适应动态交互意图。在真实世界的HRI实验中,GIVE显著优于基线,目标对象识别准确率提升40%,整体任务成功率提升80%,同时展现出对未见空间布局和不同参与者的强大鲁棒性和泛化能力。

英文摘要

Human communication is inherently multimodal, where language is often accompanied by non-verbal cues such as gestures to convey intentions. However, current Vision-Language-Action (VLA) models treat robotic manipulation as a pure text-driven task, overlooking the important role of gestures in Human-Robot Interaction (HRI). This often leads to inaccurate intent grounding and unreliable manipulation when language instructions are ambiguous or underspecified. To address this challenge, we propose GIVE (Gesture Intent via Visual-Semantic Enhancement), an effective approach that enhances pre-trained VLA models with human gesture understanding without architectural modifications. Specifically, GIVE incorporates gesture information through two complementary pathways: a visual pathway that overlays hand skeletons and fingertip rays onto robot observations for explicit object grounding, and a semantic pathway that generates high-level descriptions of human gestures and task instructions for robust intent grounding. By jointly leveraging visual and semantic guidance, GIVE enables VLA policies to better associate gestures with manipulation behaviors and adapt to dynamic interaction intents. In real-world HRI experiments, GIVE substantially outperforms the baseline, improving target object recognition accuracy by 40% and overall task success rate by 80%, while demonstrating strong robustness and generalization to unseen spatial layouts and diverse participants.

2601.22090 2026-06-12 cs.RO 版本更新

ReactEMG Stroke: Healthy-to-Stroke Few-shot Adaptation for sEMG-Based Intent Detection

ReactEMG 中风:基于表面肌电图的意图检测的健康到中风少样本适应

Runsheng Wang, Katelyn Lee, Xinyue Zhu, Lauren Winterbottom, Dawn M. Nilsen, Joel Stein, Matei Ciocarlie

发表机构 * Department of Mechanical Engineering, Columbia University in the City of New York(哥伦比亚大学纽约市机械工程系) Department of Computer Science, Columbia University in the City of New York(哥伦比亚大学纽约市计算机科学系) Department of Rehabilitation and Regenerative Medicine, Columbia University Irving Medical Center(哥伦比亚大学伊文思医疗中心康复与再生医学系)

AI总结 提出一种健康到中风的适应流程,利用大规模健康受试者sEMG预训练模型,仅用少量中风患者数据微调,显著提升意图检测准确率和鲁棒性。

详情
AI中文摘要

表面肌电图(sEMG)是一种有前景的控制信号,用于中风后按需辅助手部康复,但从瘫痪肌肉检测意图通常需要长时间、特定于受试者的校准,并且对变异性很脆弱。我们提出了一种健康到中风的适应流程,该流程从在大规模健全受试者sEMG上预训练的模型初始化意图检测器,然后仅使用少量特定于受试者的数据为每个中风参与者进行微调。使用从三位慢性中风患者收集的新数据集,我们比较了适应策略(仅头部调优、参数高效的LoRA适配器和全端到端微调),并在包含现实分布偏移(如会话内漂移、姿势变化和臂带重新定位)的保留测试集上评估。在各种条件下,与相同数据预算下的零样本迁移和仅中风训练相比,健康预训练适应一致地改善了中风意图检测;最佳适应方法将平均转换准确率从0.42提高到0.61,原始准确率从0.69提高到0.78。这些结果表明,迁移可复用的健康域EMG表示可以减少校准负担,同时提高实时中风后意图检测的鲁棒性。我们的项目网站、视频、代码和数据集可在以下网址获取:this https URL。

英文摘要

Surface electromyography (sEMG) is a promising control signal for assist-as-needed hand rehabilitation after stroke, but detecting intent from paretic muscles often requires lengthy, subject-specific calibration and remains brittle to variability. We propose a healthy-to-stroke adaptation pipeline that initializes an intent detector from a model pretrained on large-scale able-bodied sEMG, then fine-tunes it for each stroke participant using only a small amount of subject-specific data. Using a newly collected dataset from three individuals with chronic stroke, we compare adaptation strategies (head-only tuning, parameter-efficient LoRA adapters, and full end-to-end fine-tuning) and evaluate on held-out test sets that include realistic distribution shifts such as within-session drift, posture changes, and armband repositioning. Across conditions, healthy-pretrained adaptation consistently improves stroke intent detection relative to both zero-shot transfer and stroke-only training under the same data budget; the best adaptation methods improve average transition accuracy from 0.42 to 0.61 and raw accuracy from 0.69 to 0.78. These results suggest that transferring a reusable healthy-domain EMG representation can reduce calibration burden while improving robustness for real-time post-stroke intent detection. Our project website, video, code, and dataset are available at: https://roamlab.github.io/reactemg-stroke/.

6. 具身智能与视觉语言动作模型 8 篇

2606.12690 2026-06-12 cs.RO cs.AI 新提交

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

EWAM:一种用于具身智能闭环在线自适应的增强世界动作模型

Xin Zhou, Cong Miao

发表机构 * Astronex Robotics Nanjing University of Information Science and Technology(南京信息工程大学)

AI总结 提出EWAM架构,基于冻结的Cosmos3骨干网络,通过四个轻量级神经层实现零样本在线自适应,无需微调或额外演示数据,显著减少新任务布局的部署数据需求。

详情
AI中文摘要

在本文中,我们提出了增强世界动作模型(EWAM),这是一种基于预训练且完全冻结的Cosmos3骨干网络构建的闭环在线自适应架构。EWAM完全在零样本任务协议下进行评估,其核心目标是减少适应新任务布局所需的额外部署数据量。值得注意的是,所有评估中均未引入额外的任务特定演示集,也未对骨干网络进行微调。其性能提升完全源于由四个插入的轻量级神经层组成的推理时协同推理机制:位于扩散变换器(DiT)中间层的神经经验记忆层提供任务相关的执行上下文;状态预测头之后的神经异常检测层实时监测预测状态与实际状态之间的差异;神经策略路由层根据异常严重程度动态选择直接执行、保守重规划或回滚恢复;神经动作校正层利用执行诊断优化生成的动作块。与简单的特征融合不同,记忆、异常检测和校正模块以可微分的方式深度集成到Cosmos3的前向路径中,仅最终路由决策是离散监督的。

英文摘要

In this paper, we propose the Enhanced World Action Model (EWAM), a closed-loop online adaptation architecture built upon a pretrained and fully frozen Cosmos3 backbone network. Evaluated entirely under a zero-shot task protocol, EWAM is centrally focused on reducing the amount of additional deployment data required to adapt to new task layouts. Notably, no extra task-specific demonstration sets were introduced in any of the evaluations, and no fine-tuning was performed on the backbone network. Its performance gains stem entirely from an inference-time co-reasoning mechanism composed of four inserted lightweight neural layers: the Neural Experience Memory Layer located in the intermediate layers of the Diffusion Transformer (DiT) provides task-relevant execution context; the Neural Anomaly Detection Layer after the state prediction head monitors the divergence between predicted and actual states in real time; the Neural Policy Routing Layer dynamically selects direct execution, conservative replanning, or rollback recovery based on the anomaly severity; and the Neural Action Correction Layer refines the generated action chunks using execution diagnostics. Unlike naive feature fusion, the memory, anomaly detection, and correction modules are deeply integrated into the Cosmos3 forward path in a differentiable manner, with only the final routing decision being a discrete supervised one.

2606.13049 2026-06-12 cs.RO 新提交

Y-BotFrame: An Extensible Embodied Agent Framework for Quadruped Robot Assistants

Y-BotFrame:一种用于四足机器人助手的可扩展具身智能体框架

Luyao Zhang, Ke Li, Yuan Ding, Xulong Zhao, Guo Yu, Chengwei Yan, Fuyu Dong, Jiawei Hu, Di Wang, Nan Luo, Gang Liu, Quan Wang

发表机构 * Xidian University(西安电子科技大学)

AI总结 提出Y-BotFrame框架,集成多模态感知与大语言模型认知核心,将自然语言指令映射为可执行任务单元,实现无遥控器的人机协作,支持模块化扩展。

详情
AI中文摘要

四足机器人能够以高灵活性穿越各种复杂地形。作为高机动性的地面智能平台,它们可以配备导航控制、环境感知和智能交互模块,从而成为各种算法在现实世界中的移动部署平台。本文介绍了Y-BotFrame,一个可扩展的具身平台,它将机器人转变为智能地面助手。Y-BotFrame集成了多模态感知能力,包括语音、视觉和激光雷达,并采用大语言模型作为环境理解、上下文推理和任务规划的认知核心。该系统将用户的自然语言指令映射为机器人可执行的具体任务单元。Y-BotFrame通过语音命令和视觉反馈支持自然交互,无需遥控器即可实现高效的人机协作。凭借高度可扩展的框架,Y-BotFrame支持新功能模块的即插即用集成以及模块化升级和迭代开发,为通用、指令驱动的具身智能体在现实世界中的部署提供了参考实现。补充视频见https://this https URL。

英文摘要

Quadruped robots are capable of traversing a wide range of complex terrains with high flexibility. As highly mobile ground-based intelligent platforms, they can be equipped with modules for navigation control, environmental perception, and intelligent interaction, thereby serving as real-world mobile deployment platforms for various algorithms. In this paper, we introduce Y-BotFrame, an extensible embodied platform that turns a robot into an intelligent ground assistant. Y-BotFrame integrates multimodal perception capabilities, including speech, vision, and LiDAR, and employs a large language model as the cognitive core for environmental understanding, contextual reasoning, and task planning. The system maps user natural-language instructions into executable embodied task units that can be carried out by the robot. Y-BotFrame supports natural interaction through voice commands and visual feedback, removing the need for a remote controller and enabling efficient human-robot collaboration. With a highly extensible framework, Y-BotFrame supports plug-and-play integration of new functional modules as well as modular upgrades and iterative development, offering a reference implementation for the real-world deployment of general-purpose, instruction-driven embodied agents.The supplementary video is available at https://xdei-group.github.io/Y-BotFrame/.

2606.13222 2026-06-12 cs.RO cs.AI 新提交

Proprioceptive-visual correspondence enables self-other distinction in humanoid robots

本体感觉-视觉对应使能人形机器人的自我-他人区分

Yurun Chen, Tianyuan Gao, Yizhong Ge, Shikun Ban, Yizhou Wang, Hongkai Xiong, Wenjun Zeng, Wentao Zhu

发表机构 * Eastern Institute of Technology, Ningbo(宁波东方理工大学) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Carnegie Mellon University(卡内基梅隆大学) East China Normal University(华东师范大学) Ningbo Institute of Digital Twin(宁波数字孪生研究院)

AI总结 提出通过本体感觉与视觉的对应学习自我-他人区分,无需身份标签或运动学模型,并建立预测性自我模型,支持目标到达、碰撞感知运动规划和运动重定向。

Comments 23 pages, 9 figures, 1 supplementary table

详情
AI中文摘要

区分自我与他人是社会智能的前提,然而与人类共享工作空间的人形机器人仍然缺乏这种能力。在这里,我们展示了一个人形机器人可以通过本体感觉-视觉对应学习自我-他人区分,无需任何身份标签或运动学模型。一旦建立,这种区分引导出一个预测性自我模型,该模型将关节配置映射到三维身体占用,捕捉机器人身体如何随动作变化。在涉及人类或形态相同机器人的多智能体场景中,系统可靠地识别自身,学习三维自我模型,并支持下游任务,包括目标到达、碰撞感知运动规划和人类到机器人的运动重定向。这些结果共同勾勒出一条路径,使机器人在共享物理环境中与其他人行动和协调时具备身体自我表征。项目页面:此 https URL。

英文摘要

Distinguishing self from others is a prerequisite for social intelligence, yet humanoid robots that increasingly share workspaces with humans still lack this ability. Here we show that a humanoid robot can learn self-other distinction from proprioceptive-visual correspondence, without any identity labels or kinematic models. Once established, this distinction bootstraps a predictive self-model that maps joint configurations to three-dimensional body occupancy, capturing how the robot's body changes with action. In multi-agent scenes involving humans or morphologically identical robots, the system reliably identifies itself, learns a 3D self-model, and supports downstream tasks including target reaching, collision-aware motion planning, and human-to-robot motion retargeting. Together, these results outline a route toward bodily self-representation in robots that act and coordinate alongside others in shared physical environments. Project page: https://euron-zc.github.io/humanoid-self-model/.

2606.12497 2026-06-12 cs.LG cs.RO 交叉投稿

$μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

$μ$VLA:部分可观测操作中VLA模型的循环记忆研究

Egor Cherepanov, Nikita Kachaev, Daniil Zelezetsky, Aydar Bulatov, Artem Pshenitsyn, Yuri Kuratov, Alexey Skrynnik, Aleksandr I. Panov, Alexey K. Kovalev

发表机构 * CogAI Lab, Moscow, Russia(CogAI实验室,莫斯科,俄罗斯) MIRAI, Moscow, Russia(MIRAI,莫斯科,俄罗斯)

AI总结 针对VLA模型在部分可观测场景中的记忆缺失问题,提出仅通过可学习记忆令牌和截断反向传播时间实现最小化循环记忆增强,在MIKASA-Robo上将训练任务成功率从0.42提升至0.84,并在LIBERO上保持全可观测性能。

Comments 34 pages, 20 figures, 9 tables

详情
AI中文摘要

视觉-语言-动作(VLA)模型从当前观测预测未来动作块,这一假设在部分可观测性下失效,因为决策依赖于不再可见的信息。现有的记忆增强VLA同时引入了循环、检索、压缩模块、辅助目标、层次化记忆或特定任务架构变化,因此循环本身的贡献与周围机制纠缠不清。我们提出了一个在强预训练VLA骨干网络中的受控隔离研究。我们的方案通过一小部分可学习的记忆令牌增强Transformer,这些令牌跨时间步传递并通过自注意力更新,使用截断反向传播时间进行端到端训练,没有辅助损失和架构变化。我们将其实例化为$μ$VLA,一组由记忆宽度m、TBPTT长度K和记忆更新规则(跨步梯度或分离的EMA)参数化的OpenVLA-OFT变体,使得循环是唯一变化的因素。在MIKASA-Robo上,$μ$VLA在最强设置下将五个训练任务的平均成功率从0.42提高到0.84,并在具有相同记忆结构的保留任务上达到0.23,而无记忆基线为0.07。在需要不同记忆结构的任务上,性能接近基线。在LIBERO上,最强的循环变体达到96.2%的平均成功率,表明在全可观测性下没有性能下降。我们将这些结果解释为对最小化骨干网络循环能力范围的校准,识别了其足够的情况以及需要额外记忆结构的情况。演示和视频可在以下链接找到:https://example.com。

英文摘要

Vision-language-action (VLA) models predict chunks of future actions from the current observation, an assumption that fails under partial observability, where decisions depend on information no longer visible. Existing memory-augmented VLAs simultaneously introduce recurrence, retrieval, compression modules, auxiliary objectives, hierarchical memory, or task-specific architectural changes, so the contribution of recurrence itself remains entangled with surrounding machinery. We present a controlled isolation study of recurrence in a strong pretrained VLA backbone. Our formulation augments the transformer with a small set of learnable memory tokens carried across timesteps and updated through self-attention, trained end to end with truncated backpropagation through time, with no auxiliary losses and no architectural changes. We instantiate this as $μ$VLA, a family of OpenVLA-OFT variants parameterized by memory width m, TBPTT length K, and the memory update rule (cross-step gradients or a detached EMA), so that recurrence is the only varying factor. On MIKASA-Robo, $μ$VLA improves average success rate on five training tasks from 0.42 to 0.84 at the strongest setting and reaches 0.23 on held-out tasks with the same memory structure versus 0.07 for the memoryless baseline. On tasks requiring different memory structure, performance remains near baseline. On LIBERO, the strongest recurrent variant achieves 96.2% average success, indicating no regression under full observability. We interpret these results as a calibration of the capability envelope of minimal in-backbone recurrence, identifying the regime in which it is sufficient and the regime where additional memory structure is required. Demos and videos can be found in https://avanturist322.github.io/mu-vla/.

2606.13515 2026-06-12 cs.CV cs.LG cs.RO 交叉投稿

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

MaskWAM:统一掩码提示与预测的世界-动作模型

Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang, Chenghao Gu, Heng Li, Ping Tan

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Tencent Robotics X(腾讯机器人X实验室) Tsinghua University(清华大学)

AI总结 提出MaskWAM,通过统一掩码输入与预测的混合Transformer架构,解决世界-动作模型的空间瓶颈,提升策略泛化能力,在LIBERO等任务上显著优于基线。

详情
AI中文摘要

世界-动作模型(WAMs)通过视频预测为机器人控制提供了一种有前景的范式。然而,当前的WAMs存在根本性的空间瓶颈:标准文本输入在杂乱场景中引入指代歧义,而非结构化的RGB预测缺乏语义基础,并受任务无关背景的偏差影响。为克服这些限制,我们引入了MaskWAM,一种以对象为中心的世界-动作模型。通过统一的混合Transformer(MoT)将掩码同时作为显式输入和预测进行联合集成,MaskWAM实现了鲁棒的策略泛化。该设计提供两个关键优势:(1)预测未来掩码产生以对象为中心的语义监督,抑制视觉噪声,显著增强甚至标准文本条件的WAMs;(2)将此预测监督与第一帧视觉提示(如目标对象掩码)耦合,建立精确的空间锚点,大幅减少语言歧义。关键在于,由于WAMs本质上是视觉驱动的架构,直接掩码条件化比单独文本提供更强的引导,为操作未见对象建立了精确且鲁棒的范式。在LIBERO、RoboTwin和真实世界任务上的评估表明,MaskWAM在语言清晰和语言模糊任务中均显著优于基线。

英文摘要

World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

2602.04208 2026-06-12 cs.RO cs.AI cs.LG 版本更新

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

SCALE: 基于自不确定性条件自适应观察与执行的视觉-语言-动作模型

Hyeonbeom Choi, Daechul Ahn, Youhan Lee, Taewook Kang, Seongwon Cho, Jonghyun Choi

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出SCALE推理策略,利用自不确定性联合调节视觉感知和动作,无需额外训练或验证器,仅单次前向传播,提升VLA模型在模拟和真实环境中的鲁棒性。

Comments ICML 2026 Spotlight. Project page: https://dcahn12.github.io/projects/scale/

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人控制的一种有前景的范式,测试时缩放(TTS)在增强训练外鲁棒性方面受到关注。然而,现有的VLA TTS方法需要额外训练、验证器和多次前向传播,使其部署不切实际。此外,它们仅干预动作解码,而保持视觉表示固定——在感知模糊的情况下不足,此时重新考虑如何感知与决定做什么同样重要。为解决这些限制,我们提出SCALE,一种简单的推理策略,基于“自不确定性”联合调节视觉感知和动作,受主动推理理论中不确定性驱动探索的启发——无需额外训练、无需验证器,且仅需单次前向传播。SCALE在高不确定性下拓宽感知和动作的探索,而在自信时聚焦于利用——实现在不同条件下的自适应执行。在模拟和真实世界基准上的实验表明,SCALE改进了最先进的VLA模型,并优于现有TTS方法,同时保持单次前向传播的效率。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.

2510.03896 2026-06-12 cs.CV cs.RO 版本更新

GAE: Unleashing Physical Potential of VLM with Generalizable Action Expert

GAE: 利用可泛化动作专家释放VLM的物理潜力

Mingyu Liu, Zheng Huang, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Yating Wang, Haoyi Zhu, Hao Chen, Chunhua Shen

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出通用动作专家(GAE),通过稀疏几何接口将VLM的高层意图转化为连续动作轨迹,采用动作预训练-点云微调(APPF)方案解耦动作动力学与几何基础,实现跨视觉域、视角和指令的强泛化。

详情
AI中文摘要

视觉语言模型展示了强大的推理和规划能力,但将这些预测转化为精确的机器人动作仍是一个核心挑战。现有的视觉-语言-动作方法通常将推理和动作生成纠缠在一起,导致泛化能力有限。我们提出了通用动作专家(GAE),一个任务无关的模型,将稀疏几何规划转化为密集的机器人动作。我们的方法引入了一个稀疏几何接口:VLM预测代表高层意图的稀疏3D路点,而GAE将这些路点与实时点云观测一起映射到连续动作轨迹。GAE在一个包含来自仿真和真实世界机器人的15万条轨迹的大规模点云-轨迹数据集上进行预训练。为了进一步提高效率和泛化能力,我们引入了动作预训练-点云微调(APPF)方案,将学习动作动力学与几何基础解耦。预训练后,GAE被冻结并在下游任务中重用,只需对VLM进行轻量级微调以生成稀疏接口。实验表明,我们的方法在多样化的视觉域、相机视角和自然语言指令下实现了强大的性能和泛化能力。

英文摘要

Vision-language models demonstrate strong reasoning and planning abilities, yet grounding these predictions into precise robot actions remains a central challenge. Existing Vision-Language-Action methods typically entangle reasoning and action generation, leading to limited generalization. We propose Generalizable Action Expert (GAE), a task-agnostic model that converts sparse geometric plans into dense robot actions. Our approach introduces a sparse geometric interface: the VLM predicts sparse 3D waypoints representing high-level intention, while GAE maps these waypoints together with real-time point cloud observations to continuous action trajectories. GAE is pretrained on a large-scale pointcloud-trajectory dataset comprising 150k trajectories from both simulation and real-world robots. To further improve efficiency and generalization, we introduce an Action Pre-training, Pointcloud Fine-tuning (APPF) scheme that decouples learning action dynamics from geometry grounding. After pretraining, GAE is frozen and reused across downstream tasks, requiring only lightweight fine-tuning of the VLM to produce the sparse interface. Experiments show that our method achieves strong performance and generalization across diverse visual domains, camera viewpoints, and natural language instructions.

2606.01621 2026-06-12 cs.CV cs.RO 版本更新

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

Goal2Pixel: 将目标锚定到像素以实现视觉语言导航

Muyi Bao, Yuxin Cai, Hang Xu, Zongtai Li, Jinxi He, Jingfan Tang, Chen Lv, Ji Zhang, Yaqi Xie, Wenshan Wang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Nanyang Technological University(南洋理工大学)

AI总结 提出Goal2Pixel范式,通过将连续环境中的视觉语言导航(VLN-CE)重新定义为可导航像素锚定,利用图像平面作为统一空间接口,预测可见导航像素并反投影为3D航点,结合可见性感知关键帧记忆和坐标感知辅助损失,在减少VLM调用次数的同时实现竞争性性能。

Comments 8 pages

详情
AI中文摘要

视觉语言模型(VLM)已成为连续环境中视觉语言导航(VLN-CE)的常见基础。然而,大多数基于VLM的方法将导航视为低级动作预测,这种接口模糊、受限于短视运动基元,且由于重复的VLM查询而效率低下。我们提出Goal2Pixel,一种纯基于像素的范式,将VLN-CE重新定义为可导航像素锚定。Goal2Pixel不预测动作,而是使用图像平面作为VLM推理与机器人运动之间的统一空间接口:模型预测一个对智能体可见的可导航像素,该像素被反投影为3D航点以进行前向导航。对于非前向动作,我们在图像平面上附加辅助指令区域,其中左/右/下区域分别解释为左转、右转和停止。为了实现长程导航,我们提出了一种可见性感知的关键帧记忆,用于紧凑且信息丰富的历史表示。为了将预训练的VLM适应于可导航像素锚定,我们引入了语义嵌入和坐标感知辅助损失。Goal2Pixel在需要比先前方法更少的VLM推理调用的情况下,实现了具有竞争力的最新性能。在R2R-CE Val-Unseen上,它以每集仅7.75次VLM调用达到54.1%的SR和52.5%的SPL,而直接动作预测在32.9%的SR下需要46.62次调用,减少了6倍。同样的趋势在RxR-CE上也成立。项目页面:https://baobao0926.github.io/Goal2Pixel/。

英文摘要

Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on RxR-CE.Project Page: https://baobao0926.github.io/Goal2Pixel/.

7. 多机器人与群体系统 6 篇

2606.12614 2026-06-12 cs.RO 新提交

DARRMS -- An Efficient Algorithm for Dynamic Attention Radius in Resource-Constrained Multi-Agent Systems

DARRMS——资源受限多智能体系统中动态注意力半径的高效算法

Benjamin Alcorn, Eman Hammad

发表机构 * Texas A&M University(德克萨斯A&M大学)

AI总结 提出DARRMS算法,通过优化注意力半径和决策,在资源受限下降低计算需求,提升协调性和可扩展性。

详情
AI中文摘要

多智能体系统是机器人、网络安全和自动驾驶规划等领域不可或缺的工具。这类系统通常面临计算资源约束,需要高效的轻量级算法。传统决策框架常假设理想条件(如完全可观测性和无限计算能力),这与现实挑战不符。本文提出一种新算法,在不显著牺牲其他性能指标的前提下,降低对计算资源的需求。智能体将可观测性限制在某个注意力半径内,从而有意识地忽略对行动规划可能不必要的环境部分。通过同时优化注意力半径和决策,我们的方法在不确定环境中增强了协调性和可扩展性。通过理论分析和实证验证,我们证明了自适应观测在资源受限系统中提升系统性能并维持稳健决策策略的有效性。

英文摘要

Multi-agent systems are integral tools for various domains such as robotics, cybersecurity, and autonomous vehicle planning. These types of systems often have constraints on the computational resources, leading to a need for efficient lightweight algorithms. Traditional decision making frameworks often assume ideal conditions, such as full observability and unlimited computational capacity, which do not align with real-world challenges. In this paper, we introduce a new algorithm that allows for reduced demand on computational resources without a large cost of other performance metrics. Agents will limit their observability to some attention radius, which intentionally allows them to ignore parts of the environment that might be unnecessary for action planning. By optimizing both the attention radius and decision-making, our approach enhances coordination and scalability in uncertain environments. Through both theoretical analysis and empirical validation, we demonstrate the effectiveness of adaptive observation in improving system performance and maintaining robust decision-making strategies in resource-constrained systems.

2606.12640 2026-06-12 cs.LG cs.RO cs.SY eess.SY 交叉投稿

Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

个体控制障碍函数引导的扩散模型用于安全离线多智能体强化学习

Qingyun Guo, Junyi Shi, Jianuo Huang, Tianyu Shi

发表机构 * Department of Electrical Engineering and Automation, Aalto University(阿尔托大学电气工程与自动化系) School of Computing and Data Science, Xiamen University Malaysia(厦门大学马来西亚分校计算与数据科学学院) Department of Computer Science, University of Toronto(多伦多大学计算机科学系)

AI总结 提出一种将神经个体控制障碍函数嵌入扩散模型的离线多智能体强化学习算法,通过逆动力学恢复控制策略,在保证奖励的同时显著提升轨迹生成的安全性。

Comments Accepted to the 23rd IFAC World Congress, 2026

详情
AI中文摘要

离线强化学习允许直接从数据中学习控制策略而无需在线交互,使其适用于安全关键任务。最近的研究将扩散模型应用于离线强化学习,以利用其建模复杂数据分布的强大能力。然而,现有方法主要关注单智能体设置,多智能体环境中的安全挑战在很大程度上未被探索。在这项工作中,我们提出了一种安全的离线多智能体强化学习算法,该算法将神经个体控制障碍函数嵌入扩散模型中,以增强轨迹生成过程中的安全性,并通过逆动力学恢复控制策略。我们在多种基准上评估了我们的算法,证明了在保持竞争性奖励的同时实现了显著的安全改进。

英文摘要

Offline reinforcement learning allows control policies to be learned directly from data without online interaction, making it suitable for safety-critical tasks. Recent studies have applied diffusion models to offline reinforcement learning to leverage their strong capacity for modeling complex data distributions. However, existing approaches primarily focus on single-agent settings, leaving the safety challenges in multi-agent environments largely unexplored. In this work, we propose a safe offline multi-agent reinforcement learning algorithm that embeds neural individual control barrier functions into the diffusion model to enhance safety during trajectory generation, with control policies recovered through inverse dynamics. We evaluate our algorithm across diverse benchmarks, demonstrating substantial safety improvements while maintaining competitive rewards.

2606.13068 2026-06-12 cs.MA cs.RO 交叉投稿

Effects of Social Interactions in Self-Organising Railway Traffic Management

自组织铁路交通管理中社交互动的影响

Fabio Oddi, Federico Naldini, Leo D'Amato, Grégory Marlière, Paola Pellegrini, Vito Trianni

AI总结 研究自组织铁路交通管理中预测邻域范围(horizon)对分布式协调过程的影响,发现短时间范围足够,长范围会损害局部可解性和计算响应性而无全局收益。

详情
AI中文摘要

最近的研究正在探索自组织交通管理作为扩展到复杂现实网络的一种解决方案。在这样的系统中,列车预测其邻域,生成交通计划假设,并通过与邻居的共识达成未来要实施的交通计划。本文研究了该流程中的一个结构参数:预测邻域范围。列车使用该范围来识别与邻居的未来潜在冲突,并建立局部交互拓扑,即要与之协商的列车子集。作为主要设计变量,范围直接决定了社交互动图的大小和密度,而其对局部子问题复杂性和分布式共识动态的影响则代表了需要探索的权衡。通过闭环仿真框架,研究评估了范围变化如何影响整个分散协调过程,从初始冲突检测到分布式调度共识。分析重点在于研究范围选择引入的潜在权衡:平衡局部可解性和计算响应性与安全关键环境中全局调度一致性和可行性的需求。与直觉相反,我们的实证结果表明,短时间范围就足够了,而长时间范围会损害局部可解性和计算响应性,且不会带来全局调度最优性的提升。

英文摘要

Recent research is exploring self-organised traffic management as a solution for scaling to complex real-world networks. In such a system, trains predict their neighbourhood, produce traffic plan hypotheses, and agree via consensus with neighbours on a future traffic plan to be implemented. This paper investigates a structural parameter within this pipeline: the predictive neighbourhood horizon. The horizon is used by trains to identify future potential conflicts with neighbours, and to establish the local interaction topology, that is, the subset of trains to negotiate with. As the primary design variable, the horizon directly determines the size and density of the social interaction graph, whereas its impact on the complexity of local sub-problems and the distributed consensus dynamics represents a trade-off to be explored. Through a closed-loop simulation framework the study evaluates how variations of the horizon impact the overall decentralised coordination process, from initial conflict detection to distributed schedule consensus. The analysis focuses on investigating the potential trade-off introduced by the horizon choice: balancing local tractability and computational responsiveness with the need for global schedule coherence and feasibility in safety-critical environments. Contrary to intuition, our empirical results indicate that the short time horizons suffice, while long values compromise local tractability and computational responsiveness with no gain in global schedule optimality.

2509.14210 2026-06-12 cs.RO 版本更新

GLIDE: A Coordinated Aerial-Ground Framework for Search and Rescue in Unknown Environments

GLIDE:未知环境下的空地协同搜索与救援框架

Seth Farrell, Chenghao Li, Henrik I. Christensen

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出GLIDE框架,通过两架无人机与一辆无人地面车协同,实现未知环境中的快速受害者定位和障碍物感知导航,利用角色分离和地形侦察提升救援效率。

详情
AI中文摘要

我们提出了一种空地协同搜索与救援(SAR)框架,将两架无人机(UAV)与一辆无人地面车(UGV)配对,以在未知环境中实现快速受害者定位和障碍物感知导航。我们将该框架命名为引导式长视距集成无人机护航(GLIDE),强调UGV在长视距规划中对UAV引导的依赖。在我们的框架中,目标搜索UAV执行实时机载受害者检测和地理参考,为地面平台提名目标,而地形侦察UAV则在UGV计划路径前方飞行,提供中程可通行性更新。UGV融合空中线索与本地感知,执行时间高效的A*规划,并在信息到达时持续重新规划。此外,我们进行了硬件演示(使用GEM e6高尔夫球车作为UGV和两架X500 UAV),以评估端到端SAR任务性能,并包括模拟消融实验,以独立于检测评估规划栈。实证结果表明,UAV之间的明确角色分离,结合地形侦察和引导规划,在时间关键的SAR任务中改善了到达时间和导航安全性。

英文摘要

We present a cooperative aerial-ground search-and-rescue (SAR) framework that pairs two unmanned aerial vehicles (UAVs) with an unmanned ground vehicle (UGV) to achieve rapid victim localization and obstacle-aware navigation in unknown environments. We dub this framework Guided Long-horizon Integrated Drone Escort (GLIDE), highlighting the UGV's reliance on UAV guidance for long-horizon planning. In our framework, a goal-searching UAV executes real-time onboard victim detection and georeferencing to nominate goals for the ground platform, while a terrain-scouting UAV flies ahead of the UGV's planned route to provide mid-level traversability updates. The UGV fuses aerial cues with local sensing to perform time-efficient A* planning and continuous replanning as information arrives. Additionally, we present a hardware demonstration (using a GEM e6 golf cart as the UGV and two X500 UAVs) to evaluate end-to-end SAR mission performance and include simulation ablations to assess the planning stack in isolation from detection. Empirical results demonstrate that explicit role separation across UAVs, coupled with terrain scouting and guided planning, improves reach time and navigation safety in time-critical SAR missions.

2602.12024 2026-06-12 cs.RO 版本更新

Adaptive-Horizon Conflict-Based Search for Closed-Loop Multi-Agent Path Finding

自适应视界冲突搜索用于闭环多智能体路径规划

Jiarui Li, Federico Pecora, Runyu Zhang, Gioele Zardini

发表机构 * Laboratory for Information and Decision Systems, Massachusetts Institute of Technology(信息与决策系统实验室,麻省理工学院) Schwarzman College of Computing(施瓦茨曼计算学院)

AI总结 提出ACCBS算法,通过动态调整规划视界和重用约束树,在有限计算预算下快速生成高质量可行解,兼具渐近最优性和扰动适应性。

详情
AI中文摘要

MAPF是自动化仓库和物流中大型机器人编队的核心协调问题。现有方法要么是开环规划器,生成固定轨迹并难以处理扰动,要么是闭环启发式方法,没有可靠性能保证,限制了其在安全关键部署中的使用。本文提出ACCBS,一种基于有限视界CBS变体的闭环算法,具有受MPC中迭代加深启发的视界变化机制。ACCBS根据可用计算预算动态调整规划视界,并重用单个约束树以实现视界之间的无缝过渡。因此,它能在预算增加时快速产生高质量可行解,同时渐近最优,表现出任意时间行为。大量案例研究表明,ACCBS结合了对扰动的灵活性和强性能保证,有效弥合了大规模机器人部署中理论最优性与实际鲁棒性之间的差距。

英文摘要

MAPF is a core coordination problem for large robot fleets in automated warehouses and logistics. Existing approaches are typically either open-loop planners, which generate fixed trajectories and struggle to handle disturbances, or closed-loop heuristics without reliable performance guarantees, limiting their use in safety-critical deployments. This paper presents ACCBS, a closed-loop algorithm built on a finite-horizon variant of CBS with a horizon-changing mechanism inspired by iterative deepening in MPC. ACCBS dynamically adjusts the planning horizon based on the available computational budget, and reuses a single constraint tree to enable seamless transitions between horizons. As a result, it produces high-quality feasible solutions quickly while being asymptotically optimal as the budget increases, exhibiting anytime behavior. Extensive case studies demonstrate that ACCBS combines flexibility to disturbances with strong performance guarantees, effectively bridging the gap between theoretical optimality and practical robustness for large-scale robot deployment.

2509.01630 2026-06-12 cs.LG cs.MA cs.RO cs.SY eess.SY 版本更新

DiffCoord: Differentiable Coordination for Distributed Multi-Agent Trajectory Optimization

DiffCoord: 分布式多智能体轨迹优化的可微协调

Bingheng Wang, Yichao Gao, Tianchen Sun, Shanker Ajay, Lin Zhao

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学电子与计算机工程系)

AI总结 提出DiffCoord框架,将截断ADMM-DDP管道的耦合参数通过端到端元学习联合优化,利用智能体神经网络实现任务自适应,并扩展到不同智能体数量。在协作空中运输系统中验证,相比现有方法将每智能体梯度计算时间减少70%。

详情
AI中文摘要

将交替方向乘子法(ADMM)与微分动态规划(DDP)相结合,为分布式多智能体轨迹优化提供了一个可扩展的框架。在实践中,ADMM通常被截断以提高计算效率,这紧密耦合了原本分别控制协调质量和任务性能的参数。在本文中,我们提出了可微协调(DiffCoord),一个统一框架,联合元学习截断ADMM-DDP管道的这些耦合参数。这些参数由智能体神经网络生成以实现任务自适应,并且同构智能体之间共享相同的网络,从而能够扩展到不同数量的智能体。我们通过端到端微分ADMM-DDP管道实现了高效的元学习。值得注意的是,这产生了一个辅助的ADMM-LQR分布式梯度求解器,用于计算和协调关于这些参数的元梯度。该求解器继承了管道的计算结构,使得关键计算结果可以重用,并能够在智能体和轨迹时间线上高效并行化。我们通过协作空中运输系统的数值和物理实验验证了DiffCoord,该系统在狭窄空间中重新配置四旋翼编队以实现安全的六自由度负载操作。它能够鲁棒地适应变化的团队规模和负载动力学,同时与最先进的轨迹梯度方法相比,将每智能体梯度计算时间减少高达70%。

英文摘要

Integrating the Alternating Direction Method of Multipliers (ADMM) with Differential Dynamic Programming (DDP) provides a scalable framework for distributed multi-agent trajectory optimization. In practice, ADMM is typically truncated for computational efficiency, tightly coupling parameters that would otherwise separately govern coordination quality and task performance. In this paper, we propose Differentiable Coordination (DiffCoord), a unified framework that jointly meta-learns these coupled parameters for the truncated ADMM-DDP pipeline. These parameters are generated by agent-wise neural networks for task adaptation, and the same networks are shared among isomorphic agents to enable scalability to varying agent counts. We achieve efficient meta-learning by differentiating the ADMM-DDP pipeline end-to-end. Notably, this yields an auxiliary ADMM-LQR distributed gradient solver that computes and coordinates meta-gradients with respect to these parameters. This solver inherits the computational structure of the pipeline, enabling reuse of key computation results and efficient parallelization over agents and along trajectory horizons. We validate DiffCoord through numerical and physical experiments on a cooperative aerial transport system, where it reconfigures quadrotor formations for safe 6-DoF load manipulation in tight spaces. It adapts robustly to varying team sizes and load dynamics, while reducing per-agent gradient computation time by up to 70% compared with state-of-the-art trajectory-gradient methods.

8. 无人车、无人机与移动机器人 4 篇

2606.12859 2026-06-12 cs.RO 新提交

AIR-VLA+: Decoupling Movement and Manipulation via Cascaded Dual-Action Decoders with Asymmetric MoE for Aerial Robots

AIR-VLA+: 通过级联双动作解码器与非对称MoE解耦空中机器人的移动与操作

Jianli Sun, Bin Tian, Qiyao Zhang, Zijian Liu, Yutong Wang, Zhiyong Cui, Bai Li, Yisheng Lv, Yonglin Tian

发表机构 * The Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Automation, Beijing Institute of Technology(北京理工大学自动化学院) College of Automotive and Energy Engineering, Tongji University(同济大学汽车与能源工程学院) School of Transportation Science and Engineering, Beihang University(北京航空航天大学交通科学与工程学院) Information Science, East China Normal University(华东师范大学信息科学)

AI总结 针对空中机器人移动与操作在动作尺度、动力学和控制目标上的显著差异,提出级联双动作解码器与非对称MoE架构,实现解耦协调控制,在AIR-VLA基准上取得48.0平均分,任务完成度提升80.2%。

详情
AI中文摘要

空中操作系统长期以来在端到端控制中遭受表示耦合问题,因为平台级无人机(UAV)移动与末端执行器级机械臂操作在动作尺度、动力学和控制目标上存在显著差异。本文提出AIR-VLA+,一种专为空中操作设计的流匹配动作生成架构,具有级联双动作解码器和非对称特征级混合专家(MoE)。我们构建了级联的操作和移动解码器,使无人机在移动过程中单向观察机械臂的意图以实现工作流协调,同时隔离无人机移动信息反向传播对机械臂操作稳定性的影响。针对空中操作中无人机移动高度依赖高层语义并负责任务状态转换的特点,我们为无人机移动解码器设计了输入特征增强模块,该模块引入隐式视觉抓取投影器以感知夹爪与物体的交互状态,并注入压缩的全局语义特征。在无人机移动解码器内部,我们部署了隐式MoE架构,使不同的移动专家在训练过程中自发地对不同任务阶段表现出能力倾向。通过在特征流形上进行密集软混合计算,无人机移动获得了更强的任务阶段适应性。在标准化AIR-VLA基准上的实验表明,我们的方法以48.0的总体平均分全面超越所有基线。与单头$\pi_{0.5}$策略相比,整体任务完成分数提高了80.2%,有效缓解了复合机器人的异构协调控制冲突。

英文摘要

Aerial manipulation systems have long suffered from representation coupling in end-to-end control, as platform-level Unmanned Aerial Vehicle (UAV) movement and end-effector-level arm manipulation differ substantially in action scale, dynamics, and control objectives. In this paper, we propose AIR-VLA+, a flow matching action generation architecture specifically designed for aerial manipulation, featuring cascaded dual-action decoders and an asymmetric feature-level Mixture of Experts (MoE). We construct cascaded manipulation and movement decoders, allowing the UAV to unidirectionally observe the manipulator's intent during movement to achieve workflow coordination, while isolating the impact of UAV movement information backpropagation on arm manipulation stability. Addressing the characteristic that UAV movement is highly dependent on high-level semantics and responsible for task state transitions in aerial manipulation, we design an input feature enhancement module for the UAV movement decoder. This module introduces an implicit visual grasp projector to perceive the interaction state between the gripper and the object, and injects compressed global semantic features. Within the UAV movement decoder, we deploy an implicit MoE architecture, enabling different movement experts to spontaneously exhibit capacity inclinations for various task stages during training. Through dense soft blending computation on the feature manifold, the UAV movement is endowed with stronger task-stage adaptability. Experiments on the standardized AIR-VLA benchmark demonstrate that our method comprehensively surpasses all baselines with an overall average score of 48.0. The overall task completion score improves by 80.2\% compared to the single-head $π_{0.5}$ policy, effectively mitigating the heterogeneous coordinated control conflicts of composite robots.

2606.12987 2026-06-12 cs.CV cs.AI cs.LG cs.RO 交叉投稿

Diffusion Transformer World-Action Model for AV Scene Prediction

扩散Transformer世界-动作模型用于自动驾驶场景预测

Ruslan Sharifullin, Benjamin Jiang, Kai Xi Chew

发表机构 * Stanford University(斯坦福大学)

AI总结 提出紧凑潜世界模型,结合扩散Transformer(DiT)预测未来场景,在nuScenes上实现4.8倍更好的KID,并实现动作可控性(转向ρ=0.81)。

Comments 10 pages, 9 figures, 2 tables

详情
AI中文摘要

动作条件世界模型使自动驾驶车辆能够根据自身规划的控制预测未来摄像头场景,从而无需真实世界部署即可进行规划和仿真,但在紧凑、可训练的规模下,未来具有模糊性,且该领域的标准失真度量具有误导性:它们奖励模糊的回归均值而非逼真的预测。我们通过一个紧凑的潜世界模型应对这一问题,该模型给定当前前摄像头潜变量和一系列自我动作,预测未来场景潜变量,由冻结解码器渲染为$256 \ imes 256$帧,最多提前8秒,在150个保留的nuScenes场景上评估。我们首先基准测试预测位置:在跨越四个表示族的六个冻结编码器中,具有时间上下文的V-JEPA2将转向RMSE比最佳单帧编码器降低40%。然后我们训练一个潜扩散Transformer(DiT),并通过受控诊断识别其所需的四个要素:空间token、$x_0$目标、残差锚定以及与目标不确定性匹配的采样。在Stable-Diffusion-VAE编码-预测-解码流水线中,我们揭示了核心矛盾:失真度量(余弦相似度、SSIM)倾向于模糊均值,掩盖了扩散模型更接近真实帧分布的事实。基于Inception的FID和KID揭示了清晰的感知-失真边界:扩散模型达到KID 0.078,而回归为0.375(好4.8倍),且可部署的训练校准使其无需测试时真实值即可实用。该模型真正具有动作可控性(转向驱动场景位移,Spearman $\ ho = 0.81$,而回归为$-0.18$)。我们将有限的单次运动归因于共享当前锚点,并设计了一个紧凑的170万参数“跳跃”模型,恢复完整的真实运动幅度($1.02\ imes$ GT),而单次模型捕获不到一半。

英文摘要

Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the $x_0$ objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression ($4.8\times$ better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman $ρ= 0.81$, vs $-0.18$ for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.7M-parameter "jump" model that recovers full ground-truth motion magnitude ($1.02\times$ GT), where single-pass models capture less than half.

2511.11022 2026-06-12 cs.RO 版本更新

Miniature Testbed for Validating Multi-Agent Cooperative Autonomous Driving

用于验证多智能体协同自动驾驶的微型测试平台

Hyunchul Bae, Eunjae Lee, Jehyeop Han, Minhee Kang, Jaehyeon Kim, Junggeun Seo, Minkyun Noh, Heejin Ahn

发表机构 * School of Electrical Engineering(电气工程学院) School of Mechanical Engineering(机械工程学院) Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出CIVAT微型测试平台,集成V2V/V2I通信与ROS2框架,通过基础设施感知和交叉口管理实验验证协同自动驾驶功能。

Comments Accepted by ICRA 2026, 8 pages

详情
AI中文摘要

协同自动驾驶通过实现车辆与智能路侧基础设施之间的实时协作来扩展车辆自主性,仍然是一个具有挑战性但至关重要的问题。然而,现有的测试平台均未采用配备感知、边缘计算和通信能力的智能基础设施。为填补这一空白,我们设计并实现了一个1:15比例的微型测试平台CIVAT,用于验证协同自动驾驶,该平台包括一个缩小的城市地图、配备车载传感器的自动驾驶车辆以及智能基础设施。所提出的测试平台通过共享Wi-Fi和ROS2框架,以发布-订阅模式集成V2V和V2I通信,实现车辆与基础设施之间的信息交换,从而达成协同驾驶功能。作为案例研究,我们通过基于基础设施的感知和交叉口管理实验验证了该系统。

英文摘要

Cooperative autonomous driving, which extends vehicle autonomy by enabling real-time collaboration between vehicles and smart roadside infrastructure, remains a challenging yet essential problem. However, none of the existing testbeds employ smart infrastructure equipped with sensing, edge computing, and communication capabilities. To address this gap, we design and implement a 1:15-scale miniature testbed, CIVAT, for validating cooperative autonomous driving, consisting of a scaled urban map, autonomous vehicles with onboard sensors, and smart infrastructure. The proposed testbed integrates V2V and V2I communication with the publish-subscribe pattern through a shared Wi-Fi and ROS2 framework, enabling information exchange between vehicles and infrastructure to realize cooperative driving functionality. As a case study, we validate the system through infrastructure-based perception and intersection management experiments.

2606.12236 2026-06-12 cs.RO cs.CV 版本更新

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

DrivingAgent: 自动驾驶系统的设计与调度智能体

Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王选计算机技术研究所) University of California, Merced(加州大学默塞德分校)

AI总结 提出DrivingAgent框架,通过自动化模块开发(设计阶段)和强化学习训练的轻量级LLM实时调度(调度阶段),解决自动驾驶系统集成新模型和满足实时约束的挑战,在nuScenes和Bench2Drive上取得更优速度-精度权衡。

详情
AI中文摘要

许多自动驾驶系统越来越多地整合基础模型以提高泛化能力并处理长尾场景。然而,这一趋势带来了两个关键挑战:(i)设计和集成新模型的手动且劳动密集型过程,以及(ii)缺乏智能、动态的调度机制以满足严格的实时约束。虽然基于大语言模型(LLM)的智能体为自动化提供了有前景的途径,但现有框架并不适合自动驾驶。具体来说,它们未能区分系统设计和实时调度的根本不同需求,将模块视为不透明的黑盒,并且并非为持续运行而设计。为了解决这些局限性,我们提出了DrivingAgent,这是一个针对自动驾驶系统设计和调度双重挑战的新型智能体框架。在设计阶段,DrivingAgent通过解释系统架构、生成代码以及通过超网络训练验证模块来自动化模块开发。在调度阶段,它采用一个通过强化学习训练的轻量级LLM来实时动态编排系统模块,并由一个集成长期存储与带时间戳短期上下文的结构化记忆支持。实验结果表明,DrivingAgent在nuScenes和Bench2Drive基准测试上实现了更优的速度-精度权衡。

英文摘要

Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.

9. 软体机器人与硬件设计 4 篇

2606.13352 2026-06-12 cs.RO 新提交

Low cost, easily manufactured, highly flexible strain and touch sensitive fiber for robotics applications

低成本、易制造、高柔性应变与触觉传感纤维用于机器人应用

Christian Diaz Herrera, Srushti Raste, Simin Liu, Miles Modeste, Jiyang, Yin, Katelyn McCall, Yuxing Jared Yao, Roopkamal Chahal, Simon Chidley, Trung Ha, T. David Westmoreland, Sonia Roberts

发表机构 * Wesleyan University(卫斯理大学)

AI总结 提出一种仅用廉价商用部件和工具快速制造的导电纤维,兼具电阻应变传感和电容触觉传感功能,实验验证其在机器人抓取、姿态估计和近场跟踪中的应用。

详情
AI中文摘要

现有的机器人拉伸和触觉传感器通常在材料成本、所需制造设备或制造时间方面至少有一项昂贵。我们提出并实验表征了一种导电纤维,仅使用廉价的商用现成部件(导电线程$0.07/英尺,硅胶管$0.94/英尺)和工具(环形针穿线器$2),可快速制造(20厘米长度2分钟)。我们展示了其作为电阻应变传感器的三种应用:触发气动辅助手指的抓取、感知气动机器人带的位置、以及估计柔性固体的姿态。我们还展示了其作为电容传感器的两种应用:首先,作为触觉传感器触发商业机器人手臂移动;其次,作为近场传感器使机器人手臂跟随移动的手。电容传感器通过编织制成,展示了纤维的高柔性。我们讨论了提高制造可扩展性的方法及其成本权衡。最后,我们展示了一种修复切断纤维的方法。

英文摘要

Existing stretch and touch sensors for robots are generally expensive with respect to at least one of material costs, required manufacturing equipment, or manufacturing time. We present and experimentally characterize a conductive fiber made using only inexpensive commercial off-the-shelf parts (conductive thread at $0.07/ft, silicone tubing at $0.94/ft) and tools (loop-style needle threader at $2), which can be manufactured quickly (20 cm length in 2 minutes.) We demonstrate its use as a resistive strain sensor with three applications: Triggering a grasp in a pneumatically actuated assistive finger, sensing the pose of a pneumatically actuated robotic strap, and estimating the pose of a flexible solid. We also demonstrate that it can be used as a capacitive sensor with two applications: First, as a touch sensor which triggers a commercial robot arm to move, and second, as a near-field sensor enabling the robot arm to follow a moving hand. The capacitive sensors are knitted, showcasing the high flexibility of the fiber. We discuss methods for improving manufacturing scalability and their cost trade-offs. Finally, we demonstrate a method for repairing a cut fiber.

2503.10919 2026-06-12 cs.RO cs.SY eess.SY nlin.PS 版本更新

Data-Driven Soft Robot Control via Adiabatic Spectral Submanifolds

基于绝热谱子流形的数据驱动软体机器人控制

Roshan S. Kaundinya, John Irvin Alora, Jonas G. Matt, Luis A. Pabon, Marco Pavone, George Haller

发表机构 * Institute for Mechanical Systems, ETH Zürich(机械系统研究所,苏黎世联邦理工学院) Autonomous Systems Lab, Stanford University(自主系统实验室,斯坦福大学) Automatic Control Laboratory, ETH Zürich(自动控制实验室,苏黎世联邦理工学院)

AI总结 针对软体机器人在非线性区域控制难题,提出基于绝热谱子流形(aSSM)的模型预测控制策略,通过数据驱动构建低维吸引子流形,实现高精度轨迹跟踪,性能提升达10倍。

Comments 41 pages, 24 figures, IJRR (2026) in press

详情
AI中文摘要

软体机器人的机械复杂性给基于模型的控制带来了重大挑战。具体而言,线性数据驱动模型难以在探索具有显著非线性行为的复杂空间扩展路径上控制软体机器人。为了解释这些非线性,我们基于最新的绝热谱子流形(aSSM)理论开发了一种模型预测控制策略。该理论适用是因为重度阻尼机器人的内部振动衰减速度远快于机器人沿预定路径的期望速度。在这种情况下,低维吸引不变流形(aSSM)从路径发出并承载机器人的主导动力学。借助这一最新理论,我们仅从数据出发设计了一种基于aSSM的模型预测控制方案。我们展示了数据驱动模型在跨不同任务跟踪动态轨迹方面的有效性。我们在软体躯干机器人和基于Cosserat杆的弹性软臂的高保真、高维有限元模型上进行了验证,额外实验确认了即使在存在实验噪声的情况下也具有鲁棒性能。值得注意的是,我们发现五维或六维aSSM简化模型在所有闭环控制任务中的跟踪性能比其他数据驱动建模方法高出最多10倍。

英文摘要

The mechanical complexity of soft robots creates significant challenges for their model-based control. Specifically, linear data-driven models have struggled to control soft robots on complex, spatially extended paths that explore regions with significant nonlinear behavior. To account for these nonlinearities, we develop here a model-predictive control strategy based on the recent theory of adiabatic spectral submanifolds (aSSMs). This theory is applicable because the internal vibrations of heavily overdamped robots decay at a speed that is much faster than the desired speed of the robot along its intended path. In that case, low-dimensional attracting invariant manifolds (aSSMs) emanate from the path and carry the dominant dynamics of the robot. Aided by this recent theory, we devise an aSSM-based model-predictive control scheme purely from data. We demonstrate the effectiveness of our data-driven model in tracking dynamic trajectories across diverse tasks. We validate on high-fidelity, high-dimensional finite-element models of a soft trunk robot and Cosserat-rod-based elastic soft arms, with additional experiments confirming robust performance even in the presence of experimental noise. Notably, we find that five- or six-dimensional aSSM-reduced models outperform the tracking performance of other data-driven modeling methods by a factor up to 10 across all closed-loop control tasks.

2508.12681 2026-06-12 cs.RO cs.LG cs.SY eess.SY 版本更新

Adaptive Model-Predictive Control of a Soft Continuum Robot Using a Physics-Informed Neural Network Based on Cosserat Rod Theory

基于Cosserat杆理论物理信息神经网络的软体连续机器人自适应模型预测控制

Johann Licher, Max Bartholdt, Henrik Krauss, Tim-Lukas Habich, Thomas Seel, Moritz Schappler

发表机构 * Institute of Mechatronic Systems, Leibniz University Hannover(机械系统研究所,汉诺威莱布尼茨大学) Department of Advanced Interdisciplinary Studies, The University of Tokyo(先进跨学科研究部,东京大学) Institute of Assembly Technology and Robotics, Leibniz University of Hannover(组装技术与机器人研究所,汉诺威莱布尼茨大学)

AI总结 提出一种基于域解耦物理信息神经网络(DD-PINN)的实时非线性模型预测控制框架,实现软体连续机器人的高精度动态控制,位置误差低于3 mm。

Comments Submitted to IEEE Transactions on Robotics, 20 pages, 14 figures

详情
AI中文摘要

软体连续机器人(SCR)的动态控制对其应用扩展具有巨大潜力,但由于精确动态模型的高计算需求,仍然是一个具有挑战性的问题。虽然已经提出了如Koopman算子方法等数据驱动方法,但它们通常缺乏自适应性,且无法重建完整的机器人形状,限制了其适用性。本文介绍了一种基于具有自适应弯曲刚度的域解耦物理信息神经网络(DD-PINN)的实时非线性模型预测控制(MPC)框架。DD-PINN作为动态Cosserat杆模型的替代模型,加速比高达44,000倍。它还被用于无迹卡尔曼滤波器中,从末端执行器位置测量中估计模型状态和弯曲柔度。我们在GPU上实现了一个以70 Hz运行的非线性进化MPC。在仿真中,它展示了动态轨迹的精确跟踪和设定点控制,末端执行器位置误差低于3 mm(执行器长度的2.3%)。在实际实验中,控制器实现了类似的精度和高达3.55 m/s²的加速度。

英文摘要

Dynamic control of soft continuum robots (SCRs) holds great potential for expanding their applications, but remains a challenging problem due to the high computational demands of accurate dynamic models. While data-driven approaches like Koopman-operator-based methods have been proposed, they typically lack adaptability and cannot reconstruct the full robot shape, limiting their applicability. This work introduces a real-time-capable nonlinear model-predictive control (MPC) framework for SCRs based on a domain-decoupled physics-informed neural network (DD-PINN) with adaptable bending stiffness. The DD-PINN serves as a surrogate for the dynamic Cosserat rod model with a speed-up factor of up to 44,000. It is also used within an unscented Kalman filter for estimating the model states and bending compliance from end-effector position measurements. We implement a nonlinear evolutionary MPC running at 70 Hz on the GPU. In simulation, it demonstrates accurate tracking of dynamic trajectories and setpoint control with end-effector position errors below 3 mm (2.3\% of the actuator's length). In real-world experiments, the controller achieves similar accuracy and accelerations up to 3.55 m/s2.

2511.18322 2026-06-12 cs.RO cs.CV cs.LG 版本更新

Learning Visually Interpretable Oscillator Networks for Soft Continuum Robots from Video

从视频中学习软体连续体机器人的视觉可解释振荡器网络

Henrik Krauss, Johann Licher, Naoya Takeishi, Annika Raatz, Takehisa Yairi

发表机构 * Department of Advanced Interdisciplinary Studies, The University of Tokyo(东京大学先进跨学科研究系) Institute of Assembly Technology and Robotics, Leibniz University Hannover(莱比锡大学汉诺威装配技术与机器人研究所) Research Center for Advanced Science and Technology, The University of Tokyo(东京大学先进科学研究中心)

AI总结 提出注意力广播解码器(ABCD)和视觉振荡器网络(VONs),实现从视频中学习软体连续体机器人动力学的视觉和机械可解释性,多步预测误差降低5.8倍。

Comments Code available at: https://github.com/UThenrik/visual_oscillators_for_SCR Dataset available at: https://zenodo.org/records/17812071 Video available at: https://youtu.be/i80H8erVISM

详情
AI中文摘要

从视频中学习软体连续体机器人(SCR)动力学提供了灵活性,但现有方法缺乏可解释性或依赖先验假设。基于模型的方法需要先验知识和手动设计。我们通过引入以下内容来弥补这一差距:(1)注意力广播解码器(ABCD),一种用于基于自编码器的潜在动力学学习的即插即用模块,生成像素级注意力图,定位每个潜在维度的贡献,同时过滤静态背景,通过空间接地潜在变量和图像叠加实现视觉可解释性。(2)视觉振荡器网络(VONs),一种二维潜在振荡器网络,与ABCD注意力图耦合,用于学习到的质量、耦合刚度和力的图像可视化,从而实现机械可解释性。我们在单段和双段SCR上验证了我们的方法,表明基于ABCD的模型显著提高了多步预测精度,在双段机器人上,Koopman算子的误差降低了5.8倍,振荡器网络的误差降低了3.5倍。VONs自主发现了振荡器的链式结构。这种完全数据驱动的方法产生了紧凑、机械可解释的模型,对未来的控制应用具有潜在意义。

英文摘要

Learning soft continuum robot (SCR) dynamics from video offers flexibility but existing methods lack interpretability or rely on prior assumptions. Model-based approaches require prior knowledge and manual design. We bridge this gap by introducing: (1) The Attention Broadcast Decoder (ABCD), a plug-and-play module for autoencoder-based latent dynamics learning that generates pixel-accurate attention maps localizing each latent dimension's contribution while filtering static backgrounds, enabling visual interpretability via spatially grounded latents and on-image overlays. (2) Visual Oscillator Networks (VONs), a 2D latent oscillator network coupled to ABCD attention maps for on-image visualization of learned masses, coupling stiffness, and forces, thereby enabling mechanical interpretability. We validate our approach on single- and double-segment SCRs, demonstrating that ABCD-based models significantly improve multi-step prediction accuracy with 5.8x error reduction for Koopman operators and 3.5x for oscillator networks on a two-segment robot. VONs autonomously discover a chain structure of oscillators. This fully data-driven approach yields compact, mechanically interpretable models with potential relevance for future control applications.

10. 仿真、数据集与评测 7 篇

2606.12936 2026-06-12 cs.RO cs.AI 新提交

An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics

面向湿实验室机器人的具身仿真平台、基准测试及数据高效增强框架

Zhe Liu, Huanbo Jin, Zhaohui Du, Zhe Wang, He Xu, Peijia Li, Jiaming Gu, Quan Lu, Qi Wang, Bin Ji, Ting Xiao

发表机构 * Key Laboratory of Smart Manufacturing in Energy Chemical Process Ministry of Education(能源化工过程智能制造国家重点实验室) Department of Computer Science and Engineering(计算机科学与工程系) Department of Laboratory Medicine(实验室医学系) Shanghai Jiao Tong University School of Medicine(上海交通大学医学院)

AI总结 提出Pipette平台,包含可编辑资产、仿真数据增强管道和11任务基准测试,将30次演示的VLA成功率从44.1%提升至74.7%。

Comments 25 pages, 17figures

详情
AI中文摘要

湿实验室机器人可以提高生物医学实验的可重复性、通量和安全性,但扩展其学习需要可定制的模拟器以进行安全和可重复的任务生成、开放的可编辑实验室资产,以及将有限演示转化为可用训练数据的高效管道。我们提出了Pipette,一个用于湿实验室机器人学习的具身仿真平台、基准测试和数据高效增强框架。Pipette发布了超过43个开源且可重新编辑的湿实验室资产,以及一个可扩展的资产构建管道。Pipette的一个关键组件是其基于仿真的数据增强管道,在仿真中重放人类演示,应用光照、相机、速度和动作扰动,并通过自动任务成功检查过滤生成的片段,从有限的手动演示中快速扩展可用的训练数据。我们进一步引入了一个包含11个任务的湿实验室具身基准测试,涵盖样本处理、培养器具操作、设备操作和精确放置。每个任务仅需30次演示,ACT实现了65.5%的平均成功率,而仿真增强将SmolVLA从44.1%提升至74.7%,将π0从40.4%提升至46.5%,验证了Pipette在数据高效的VLA训练和评估中的有效性。Pipette还支持自然语言驱动的场景构建和任务注册,降低了非专家用户定义新湿实验室机器人任务的门槛。

英文摘要

Wet-lab robots can improve the reproducibility, throughput, and safety of biomedical experiments, but scaling their learning requires customizable simulators for safe and reproducible task generation, open editable laboratory assets, and efficient pipelines that turn limited demonstrations into usable training data. We present Pipette, an embodied simulation platform, benchmark, and data-efficient augmentation framework for wet-lab robot learning. Pipette releases over 43 open-source and re-editable wet-lab assets, together with an extensible asset-building pipeline. A key component of Pipette is its simulation-based data augmentation pipeline, replaying human demonstrations in simulation, applies lighting, camera, speed, and action perturbations, and filters generated episodes with automatic task success checks, rapidly expanding usable training data from limited manual demonstrations. We further introduce an 11-task wet-lab embodied benchmark covering sample handling, culture-ware manipulation, device operation, and precision placement. With only 30 demonstrations per task, ACT achieves 65.5% average success rate, while simulation augmentation improves SmolVLA from 44.1% to 74.7% and π0 from 40.4% to 46.5%, validating the effectiveness of Pipette for data-efficient VLA training and evaluation. Pipette also supports natural-language-driven scene construction and task registration, lowering the barrier for non-expert users to define new wet-lab robotic tasks.

2606.13028 2026-06-12 cs.RO cs.CV 新提交

Comparing Commercial Depth Sensor Accuracy for Medical Applications

面向医疗应用的商用深度传感器精度比较

Pit Henrich, Maximilian Weiherer, Franziska Hansen, Bernhard Egger, Franziska Mathis-Ullrich

AI总结 本文在猪骨、猪肚和硅胶肾模型上,以触针采样为参考,比较了立体视觉、结构光和飞行时间四类深度传感器在50cm距离下的精度,发现Zivid 2M+ 60在所有物体和指标上表现最佳。

Comments 4 Pages

详情
AI中文摘要

深度估计在医疗和外科手术中有众多应用。我们使用触针采样的参考数据,在猪骨标本、猪肚标本和硅胶肾脏模型上对四种深度传感器进行了基准测试。这些物体包含多个现实挑战,包括均匀表面、镜面反射表面和次表面散射。比较包括距离约50厘米处的立体视觉、结构光和飞行时间传感器。具体而言,比较了Intel RealSense D405(美国Intel RealSense)、PMD Flexx2(德国pmdtechnologies)、Stereolabs ZED 2i(法国Stereolabs)和Zivid 2M+ 60(挪威Zivid)。在本研究考虑的所有物体和指标中,Zivid 2M+ 60表现最佳。ZED在真实组织上排名第二,但在模型上排名最后。

英文摘要

Depth estimation has numerous medical and surgical applications. We benchmark four depth sensors on a porcine bone specimen, a porcine belly specimen, and a silicone kidney phantom using stylus-sampled references. These objects contain several real-world challenges, including homogeneous surfaces, specular surfaces, and subsurface scattering. The comparison includes stereo, structured-light, and time-of-flight sensors at a distance of approximately 50 cm. Specifically, the Intel RealSense D405 (Intel RealSense, United States), PMD Flexx2 (pmdtechnologies, Germany), Stereolabs ZED 2i (Stereolabs, France), and Zivid 2M+ 60 (Zivid, Norway) are compared. The Zivid 2M+ 60 performed best across all objects and metrics considered in this work. The ZED ranked second for real tissue, but last on the phantom.

2606.13040 2026-06-12 cs.RO 新提交

RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation

RoboProcessBench:视觉语言机器人操作中的过程感知理解基准测试

Dayu Xia, Yue Shi, Yao Mu, Huiting Ji, Chaofan Ma, Yingjie Zhou, Hua Chen, Yang Liu, Jiezhang Cao, Guangtao Zhai

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Zhejiang University(浙江大学) Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) China University of Mining Technology(中国矿业大学)

AI总结 提出RoboProcessBench基准,通过静态监控和动态推理两个维度、12个诊断问题家族,评估视觉语言模型在机器人操作中的过程感知理解能力,并基于58k问答对数据集验证了当前模型的局限性及后训练的有效性。

详情
AI中文摘要

视觉语言模型(VLM)正越来越多地被探索作为机器人操作中的视觉评判者、奖励生成器和故障检测器。这些角色隐含地要求模型不仅判断最终任务成功与否,还要判断操作执行在物理和时间上的进展。然而,现有评估未能测试VLM是否具备细粒度的过程理解。为填补这一空白,我们提出了RoboProcessBench,一个用于视觉语言机器人操作中过程感知理解的基准测试。RoboProcessBench将这种能力分解为两个互补维度:\emph{静态监控}和\emph{动态推理},具体化为12个诊断问题家族,涵盖阶段、接触、运动、协调、原始局部进展、时间顺序、结果和原始级转换。基于物理基础的执行轨迹,构建的基准语料库ProcessData包含约58k个问答对,涵盖260个操作任务,进一步分为ProcessData-SFT和ProcessData-Eval,分别用于后训练和评估。对ProcessData-Eval上各种VLM的广泛评估揭示了12个诊断任务家族的普遍局限性,表明当前模型仍缺乏对操作执行的鲁棒过程感知理解。但通过ProcessData-SFT,后训练的\textit{Qwen2.5-VL-7B}和\textit{InternVL-3-8B}在局部状态、运动、进展和原始级线索上表现出持续改进。这些结果表明,RoboProcessBench既可作为评估基准,也可作为可学习的监督源,用于开发能够监控和评估机器人操作过程的VLM。项目网页:\href{ this https URL }{ this https URL }。

英文摘要

Vision-language models (VLMs) are increasingly explored as visual critics, reward generators, and failure detectors in robotic manipulation. These roles implicitly require models to judge not only final task success, but also how a manipulation execution is physically and temporally progressing. However, existing evaluations fail to test whether VLMs possess fine-grained process understanding. To address this gap, we present RoboProcessBench, a benchmark for process-aware understanding in vision-language robotic manipulation. RoboProcessBench decomposes such capability into two complementary dimensions, \emph{static monitoring} and \emph{dynamic reasoning}, instantiated as 12 diagnostic question families covering phase, contact, motion, coordination, primitive-local progress, temporal order, outcome, and primitive-level transitions. Built from physically grounded execution traces, the curated benchmark corpus ProcessData contains \textasciitilde 58k question-answer pairs across 260 manipulation tasks, which is further split into ProcessData-SFT and ProcessData-Eval for post-training and evaluation purposes. Extensive evaluation of various VLMs on ProcessData-Eval reveals broad limitations across 12 diagnostic task families, suggesting current models still lack robust process-aware understanding of manipulation executions. But with ProcessData-SFT, the post-trained \textit{Qwen2.5-VL-7B} and \textit{InternVL-3-8B} exhibit consistent gains on local state, motion, progress, and primitive-aware cues. These results demonstrate that RoboProcessBench serves as both an evaluation benchmark and a learnable supervision source for developing VLMs capable of monitoring and evaluating robotic manipulation processes. Project webpage: \href{https://processbench-2026.github.io/RoboProcessBench-Web/}{https://processbench-2026.github.io}.

2606.13497 2026-06-12 cs.RO cs.CV 新提交

SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale

SPARC:来自机器人演示的可靠空间标注

Nils Blank, Paul Mattes, Maximilian Xiling Li, Jakub Suliga, Thomas Roth, Moritz Reuss, Pankhuri Vanjani, Rudolf Lioutikov

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) NVIDIA(英伟达) Robotics Institute Germany(德国机器人研究所)

AI总结 提出SPARC框架,利用机器人任务的时空结构生成可靠性评分,自动标注演示中的空间信息,减少噪声标签并保留更多有用样本,在物体定位基准上优于纯检测基线。

详情
AI中文摘要

本文介绍了一种具有可靠性校准的机器人演示空间标注方法(SPARC),这是一个风险感知框架,能够自动为机器人演示标注结构化的空间信息,并为每个标注分配可靠性评分。结构化的空间标注,如边界框、物体轨迹和操作阶段标签,有益于广泛的机器人应用,从训练接地机器人策略和具身基础模型到运动规划和层次化任务组合。现有的自动化流水线可以大规模生成此类标注,但无法提供可靠的质量信号:检测器置信度对于标注正确性的校准不佳,迫使人们在接受噪声标签或丢弃有用样本之间做出选择。与现有的自动化流水线不同,SPARC利用机器人任务固有的时空结构生成可靠性信号,减少噪声标签并保留更多有用样本。我们进一步引入了交互感知基准(IA-Bench),这是一个衡量模型在机器人演示中接地交互物体位置准确性的基准。在涵盖多种实体和场景的1.7k个人工标注演示上,SPARC在定位准确性上显著优于纯检测基线,同时在高精度操作点保留了三倍以上的样本。我们的实验表明,基于我们的标注微调的模型在物体接地和指向基准上达到了与类似规模模型相当的最先进结果,同时在更广泛的空间推理套件上保持竞争力,无需手动验证或标注的训练数据。此外,基于SPARC生成的标注训练的策略在杂乱、视觉模糊的真实场景中优于基线。代码、数据和模型可从此网址获取。

英文摘要

This work introduces Spatial Annotations from Robot Demonstrations with Reliability Calibration (SPARC), a risk-aware framework that automatically labels robot demonstrations with structured spatial annotations and assigns each annotation a reliability score. Structured spatial annotations, such as bounding boxes, object trajectories, and manipulation phase labels, benefit a broad range of robotics applications from training grounded robot policies and embodied foundation models to motion planning and hierarchical task composition. Existing automated pipelines generate such annotations at scale but provide no reliable quality signal: detector confidence is poorly calibrated for annotation correctness, forcing a choice between accepting noisy labels or discarding useful samples. In contrast to existing automated pipelines, SPARC leverages the spatio-temporal structure inherent to robot tasks to generate a reliability signal, reducing noisy labels and retaining more useful samples. We further introduce Interaction-Aware Bench (IA-Bench), a benchmark that measures model accuracy in grounding the locations of interacted objects in robot demonstrations. On 1.7k human-annotated demonstrations spanning diverse embodiments and scenarios, SPARC significantly outperforms detection-only baselines in localization accuracy while retaining three times more samples at high-precision operating points. Our experiments demonstrate that models finetuned on our annotations achieve state-of-the-art results on object-grounding and pointing benchmarks among similarly sized models, while remaining competitive on broader spatial-reasoning suites without manually verified or annotated training data. Furthermore, policies trained on SPARC-generated annotations outperform baselines in cluttered, visually ambiguous real-world scenes. Code, data, and models are available at intuitive-robots.github.io/sparc-labeling.

2606.13092 2026-06-12 cs.LG cs.RO math.DS 交叉投稿

Scale Buys Interpolation, Structure Buys a Horizon: Certified Predictability for Equivariant World Models

规模买插值,结构买地平线:等变世界模型的认证可预测性

Hongbo Wang

AI总结 针对等变潜在世界模型,提出可计算的多步可预测地平线认证,证明T步滚动误差在对称轨道上恒定,并由李雅普诺夫谱分层界定,且该认证为等变模型独有。

Comments 23 pages (9 main + appendices). Code: https://github.com/TimothyWang418/se3-ejepa

详情
AI中文摘要

规模买插值;结构买认证的地平线。世界模型的平均误差无法说明特定预测是否可信,或可信多久。对于等变潜在世界模型,我们给出可计算的多步可预测地平线认证:$T$步滚动误差在每个对称轨道上恒定(定理A),并由预测器的李雅普诺夫谱逐通道分层,$T_j(\epsilon)\sim\log(1/\epsilon)/\lambda_j$。地平线是双向的——匹配的下界使近似等变被证明受地平线限制——且该认证为结构独有:轨道恒定误差刻画等变性,因此任何非等变模型无论规模多大都不具备。实验上,在40维Lorenz-96上,只有$\mathbb{Z}_N$等变网络恢复完整李雅普诺夫谱($R^2=0.98$);密集和循环基线失败。由于谱是忠实的,认证先验地起作用:在固定感知预算下,$c$倍膨胀的认证需要$c$倍预算,且等变认证满足其膨胀密集对应物无法满足的预算——无需校准数据。相同的读出,未经修改,可无训练审计公开预训练世界模型:TD-MPC2检查点落在认证自身的范围分类上——在强膨胀处校准(比率0.94-1.02),在弱膨胀处乐观,在收缩处正确弃权——部署的监控器逐单元复制该映射,样本外。在官方1M-317M多任务阶梯上,校准不随参数增加。在V-JEPA 2-AC(1B,真实机器人数据)上,测量的交叉检查正确覆盖了过度承诺的切空间谱——交叉验证审计,而非原始数值,是可部署的对象。规模买插值,而非校准的地平线。

英文摘要

Scale buys interpolation; structure buys a certified horizon. A world model's average error says nothing about whether a particular prediction can be trusted, or for how long. For equivariant latent world models we give a computable, multi-step certificate of the predictable horizon: $T$-step rollout error is provably constant over each symmetry orbit (Theorem A) and stratified channel-by-channel by the predictor's Lyapunov spectrum, $T_j(ε)\sim\log(1/ε)/λ_j$. The horizon is two-sided -- a matching lower bound makes approximate equivariance provably horizon-limited -- and the certificate is exclusive to structure: orbit-constant error characterizes equivariance, so no non-equivariant model has it at any scale. Empirically, on 40-D Lorenz-96 only a $\mathbb{Z}_N$-equivariant network recovers the full Lyapunov spectrum ($R^2{=}0.98$); dense and recurrent baselines fail. Because the spectrum is faithful, the certificate acts, a priori: under a fixed sensing budget a $c\times$-inflated certificate provably needs $c\times$ the budget, and the equivariant certificate meets a budget its inflated dense counterpart cannot -- with zero calibration data. The same read-out, unchanged, audits public pretrained world models training-free: TD-MPC2 checkpoints land on the certificate's own scope taxonomy -- calibrated where strongly expansive (ratio 0.94-1.02), optimistic where weakly expansive, correctly abstaining where contracting -- a map a deployed monitor replicates cell-by-cell, out-of-sample. Across the official 1M-317M multitask ladder, calibration does not improve with parameters. On V-JEPA 2-AC (1B, real robot data) the measured cross-check correctly overrides an over-promising tangent spectrum -- the cross-validated audit, not the raw number, is the deployable object. Scale buys interpolation, not a calibrated horizon.

2511.17221 2026-06-12 cs.CV cs.RO 版本更新

QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

QueryOcc:基于查询的3D语义占据自监督方法

Adam Lilja, Ji Lan, Junsheng Fu, Lars Hammarstrand

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) Zenseact

AI总结 提出QueryOcc,一种基于查询的自监督框架,通过相邻帧的4D时空查询直接学习连续3D语义占据,利用视觉基础模型或激光雷达数据提供监督,并引入收缩场景表示以在恒定内存下实现远程监督,在Occ3D-nuScenes基准上语义RayIoU提升26%。

详情
AI中文摘要

从图像学习3D场景几何和语义是计算机视觉的核心挑战,也是自动驾驶的关键能力。由于大规模3D标注成本过高,近期研究探索直接从传感器数据中进行自监督学习,无需人工标签。现有方法要么依赖2D渲染一致性(3D结构仅隐式出现),要么依赖来自累积激光雷达点云的离散化体素网格,限制了空间精度和可扩展性。我们提出QueryOcc,一种基于查询的自监督框架,通过跨相邻帧采样的独立4D时空查询直接学习连续3D语义占据。该框架支持来自视觉基础模型导出的伪点云或原始激光雷达数据的监督。为了实现恒定内存下的远程监督和推理,我们引入了一种收缩场景表示,在平滑压缩远处区域的同时保留近场细节。QueryOcc在自监督Occ3D-nuScenes基准上以11.6 FPS运行,语义RayIoU比之前的基于相机的方法提升26%,表明直接4D查询监督能够实现强大的自监督占据学习。

英文摘要

Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/

2601.21570 2026-06-12 cs.AI cs.RO 版本更新

From Digital to Physical: Digital Agents as Autonomous Coaches for Physical Intelligence

从数字到物理:数字代理作为物理智能的自主教练

Zixing Lei, Genjia Liu, Yuanshuo Zhang, Qipeng Liu, Yuzhu Cai, Sixiang Chen, Jixian Wu, Yunhong Wang, Weixin Li, Chuan Wen, Bo Zhao, Shanghang Zhang, Wenzhao Lian, Siheng Chen

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China(上海交通大学人工智能学院) Zhongguancun Academy, Beijing, China(中关村学院) School of Integrated Circuits, Shanghai Jiao Tong University, Shanghai, China(上海交通大学集成电路学院) School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China(北京大学计算机科学学院多媒体信息处理国家重点实验室)

AI总结 提出EmboCoach-Bench基准,评估LLM代理自主设计具身策略的能力,通过迭代调试和优化,代理在平均成功率上超越人工基线26.5%,并具备自我修正能力。

Comments 53 pages, 12 figures

详情
AI中文摘要

具身AI领域正朝着通用机器人系统快速发展,得益于高保真模拟和大规模数据收集。然而,这种扩展能力仍然受到劳动密集型人工监督的严重瓶颈,从复杂的奖励塑造到跨异构后端的超参数调整。受LLM在软件自动化和科学发现中成功的启发,我们引入了\ extsc{EmboCoach-Bench},一个评估LLM代理自主设计具身策略能力的基准。涵盖32个专家精选的RL和IL任务,我们的框架将可执行代码作为通用接口。我们超越静态生成,评估动态闭环工作流,其中代理利用环境反馈迭代地起草、调试和优化解决方案,涵盖从物理信息奖励设计到扩散策略等策略架构的改进。广泛评估得出三个关键见解:(1)自主代理在平均成功率上可以定性超越人工设计的基线26.5%;(2)具有环境反馈的代理工作流有效增强了策略开发,并显著缩小了开源和专有模型之间的性能差距;(3)代理对病态工程案例表现出自我修正能力,通过迭代仿真循环调试成功从近乎完全失败中恢复任务性能。最终,这项工作为自我进化的具身智能奠定了基础,加速了具身AI领域从劳动密集型手动调优到可扩展自主工程的范式转变。

英文摘要

The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and large-scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor-intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs' success in software automation and science discovery, we introduce \textsc{EmboCoach-Bench}, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert-curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed-loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics-informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human-engineered baselines by 26.5\% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open-source and proprietary models; and (3) agents exhibit self-correction capabilities for pathological engineering cases, successfully resurrecting task performance from near-total failures through iterative simulation-in-the-loop debugging. Ultimately, this work establishes a foundation for self-evolving embodied intelligence, accelerating the paradigm shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI field.

11. 安全、鲁棒性与可信机器人 4 篇

2606.13203 2026-06-12 cs.RO 新提交

Embedding ISO 10218 Safety Compliance in Robots via Control Barrier Functions for Human-Robot Collaboration

通过控制障碍函数将ISO 10218安全合规性嵌入机器人以实现人机协作

Federico Parma, Cesare Tonola, Nicola Pedrocchi, Manuel Beschi

发表机构 * Dept. of Electrical and Information Engineering, Polytechnic of Bari(巴里理工大学电气与信息工程系) Dipartimento di Ingegneria Meccanica e Industriale, University of Brescia(布雷西亚大学机械与工业工程系) Institute of Intelligent Industrial Technologies and Systems, National Research Council of Italy, STIIMA-CNR(意大利国家研究委员会智能工业技术与系统研究所)

AI总结 提出基于控制障碍函数(CBF)的方法,利用人体加速度数据预测最小人机距离,并通过序列二次规划(SQP)框架实现安全约束,在UR10e上验证了该方法在遵守ISO 10218标准的同时减少轨迹误差63%。

详情
AI中文摘要

人机协作(HRC)需要严格遵守安全标准(如ISO 10218),以防止有害交互。标准的速度与分离监控(SSM)滤波器基于保守假设(如人体速度恒定)计算安全机器人速度,这阻碍了对最小分离距离的准确预测,并导致不必要的操作停止。本文提出一种控制障碍函数(CBF),明确纳入人体加速度数据,以在机器人最坏情况制动轨迹期间解析地前向预测最小人机分离距离。为保证控制层面的安全性,该预测性CBF作为不等式约束被集成到序列二次规划(SQP)框架中。具体地,提出了两种方法:方法I,一种CBF约束的PD安全滤波器;方法II,一种执行空间管约束的任务缩放SQP控制器。在UR10e机器人上的仿真和实际实验评估了两种方法相对于标准工业SSM模块基线的性能。结果表明,方法II动态调节执行速度并限制空间偏差。与方法I相比,方法II在平均轨迹误差上减少了63%,并避免了过度规避动作,在遵守ISO 10218 SSM指南的同时确保了高任务吞吐量。

英文摘要

Human-Robot Collaboration (HRC) requires strict adherence to safety standards, such as ISO 10218, to prevent harmful interactions. Standard Speed and Separation Monitoring (SSM) filters calculate safe robotic speeds based on conservative assumptions, such as constant human velocity, which prevents accurate predictions of minimum separation distances and causes unnecessary operational halts. This paper proposes a Control Barrier Function (CBF) that explicitly incorporates human acceleration data to analytically forward-predict the minimum human-robot separation distance during a worst-case robotic stopping trajectory. To guarantee safety at the control level, this predictive CBF is integrated as an inequality constraint within a Sequential Quadratic Programming (SQP) framework. Specifically, two methods are proposed: Method I, a CBF-constrained PD safety filter; and Method II, a task-scaling SQP controller that enforces a spatial tube constraint. Simulated and real-world experiments on a UR10e robot evaluate the two proposed methods against a standard industrial SSM module baseline. Results demonstrate that Method II dynamically modulates execution speed and confines spatial deviations. Compared to Method I, Method II achieves a 63\% reduction in mean trajectory error and avoids excessive evasive manoeuvres, ensuring high task throughput while complying with ISO 10218 SSM guidelines.

2501.04823 2026-06-12 cs.RO math.OC stat.AP 版本更新

Learning Robot Safety from Sparse Human Feedback using Conformal Prediction

基于共形预测从稀疏人类反馈中学习机器人安全

Aaron O. Feldman, Joseph A. Vincent, Maximilian Adang, JunEn Low, Mac Schwager

发表机构 * Department of Aeronautics and Astronautics, Stanford University(航空航天工程系,斯坦福大学)

AI总结 通过人类对策略轨迹的二元反馈,利用共形预测识别包含未来策略错误的状态区域,构建具有保证漏检率的预警系统,并用于改进模型预测控制器的安全性。

详情
AI中文摘要

确保机器人安全可能具有挑战性;用户定义的约束可能遗漏边缘情况,策略即使从安全数据训练也可能变得不安全,并且安全可能是主观的。因此,我们通过向标记不安全行为的人类展示策略轨迹来学习机器人安全。从这种二元反馈中,我们使用共形预测的统计方法识别一个状态区域(可能在学习的潜在空间中),保证包含用户指定比例的未来策略错误。我们的方法是样本高效的,因为它基于最近邻分类,避免了共形预测中常见的保留数据。通过提醒机器人是否到达可疑的不安全区域,我们获得了一个模拟人类安全偏好且具有保证漏检率的预警系统。通过视频标注,我们的系统可以检测四旋翼视觉运动策略何时无法通过指定门。我们提出了一种通过避免可疑不安全区域来改进策略的方法。通过它,我们提高了模型预测控制器的安全性,这在30次四旋翼飞行跨越6个导航任务的实验测试中得到了证明。提供了代码和视频。

英文摘要

Ensuring robot safety can be challenging; user-defined constraints can miss edge cases, policies can become unsafe even when trained from safe data, and safety can be subjective. Thus, we learn about robot safety by showing policy trajectories to a human who flags unsafe behavior. From this binary feedback, we use the statistical method of conformal prediction to identify a region of states, potentially in learned latent space, guaranteed to contain a user-specified fraction of future policy errors. Our method is sample-efficient, as it builds on nearest neighbor classification and avoids withholding data as is common with conformal prediction. By alerting if the robot reaches the suspected unsafe region, we obtain a warning system that mimics the human's safety preferences with guaranteed miss rate. From video labeling, our system can detect when a quadcopter visuomotor policy will fail to steer through a designated gate. We present an approach for policy improvement by avoiding the suspected unsafe region. With it we improve a model predictive controller's safety, as shown in experimental testing with 30 quadcopter flights across 6 navigation tasks. Code and videos are provided.

2603.16013 2026-06-12 cs.RO cs.SE 版本更新

Safety Case Patterns for VLA-based driving systems: Insights from SimLingo

基于VLA的驾驶系统的安全案例模式:来自SimLingo的见解

Gerhard Yu, Fuyuki Ishikawa, Oluwafemi Odu, Alvine Boaye Belle

发表机构 * York University(约克大学) National Institute of Informatics(国家信息研究所)

AI总结 针对VLA驾驶系统提出RAISE安全案例设计方法,通过扩展HARA和定制模式,结合SimLingo案例验证其构建基于证据的安全声明的有效性。

详情
AI中文摘要

基于视觉-语言-动作(VLA)的驾驶系统代表了自动驾驶领域的重大范式转变,因为通过结合交通场景理解、语言解释和动作生成,这些系统能够实现更灵活、自适应和响应指令的驾驶行为。然而,尽管它们被越来越多地采用,并具有支持社会责任型自动驾驶以及理解高级人类指令的潜力,基于VLA的驾驶系统可能表现出新型的危险行为。例如,将开放式的自然语言输入(如用户或导航指令)集成到多模态控制回路中可能导致不可预测和不安全的行为,从而危及车辆乘员和行人。因此,确保这些系统的安全性对于建立对其运行的信任至关重要。为此,我们提出了一种名为RAISE的新型安全案例设计方法。我们的方法引入了针对基于指令的驾驶系统(如VLA驾驶系统)定制的新模式,扩展了危害分析和风险评估(HARA),详细说明了安全场景及其结果,并设计了一种创建VLA驾驶系统安全案例的技术。在SimLingo上的案例研究说明了如何使用我们的方法为这类新兴的自动驾驶系统构建严谨的、基于证据的安全声明。

英文摘要

Vision-Language-Action (VLA)-based driving systems represent a significant paradigm shift in autonomous driving since, by combining traffic scene understanding, linguistic interpretation, and action generation, these systems enable more flexible, adaptive, and instruction-responsive driving behaviors. However, despite their growing adoption and potential to support socially responsible autonomous driving as well as understanding high-level human instructions, VLA-based driving systems may exhibit new types of hazardous behaviors. For instance, the integration of open-ended natural language inputs (e.g., user or navigation instructions) into the multimodal control loop may lead to unpredictable and unsafe behaviors that could endanger vehicle occupants and pedestrians. Hence, assuring the safety of these systems is crucial to help build trust in their operations. To support this, we propose a novel safety case design approach called RAISE. Our approach introduces novel patterns tailored to instruction-based driving systems such as VLA-based driving systems, an extension of Hazard Analysis and Risk Assessment (HARA) detailing safe scenarios and their outcomes, and a design technique to create the safety cases of VLA-based driving systems. A case study on SimLingo illustrates how our approach can be used to construct rigorous, evidence-based safety claims for this emerging class of autonomous driving systems.

2602.05121 2026-06-12 eess.SY cs.RO cs.SY 版本更新

Trojan Attacks on Neural Network Controllers for Robotic Systems

针对机器人系统神经网络控制器的木马攻击

Farbod Younesi, Walter Lucia, Amr Youssef

发表机构 * Concordia University(康科德大学) Concordia Institute for Information Systems Engineering(康科德信息系统工程研究所) Fonds de recherche du Québec – Nature et Technologies(魁北克自然与技术研究基金) National Cybersecurity Consortium(国家网络安全联盟)

AI总结 针对机器人神经网络控制器,设计轻量级并行木马网络,在特定触发条件下篡改控制指令,通过仿真验证攻击有效性。

Comments Paper submitted to the 2026 IEEE Conference on Control Technology and Applications (CCTA)

详情
AI中文摘要

神经网络控制器越来越多地应用于机器人系统中,用于轨迹跟踪和姿态稳定等任务。然而,它们对可能不可信的训练流程或供应链的依赖引入了显著的安全漏洞。本文以差速驱动移动机器人平台为案例,研究针对神经控制器的后门(木马)攻击。具体来说,假设机器人的跟踪控制器实现为神经网络,我们设计了一个轻量级的并行木马网络,可以嵌入到控制器中。该恶意模块在正常操作期间保持休眠,但在检测到由机器人姿态和目标参数定义的高度特定触发条件时,会破坏主控制器的轮速命令,导致不良且可能不安全的机器人行为。我们提供了所提出的木马网络的概念验证实现,并通过两种不同攻击场景下的仿真进行了验证。结果证实了所提出攻击的有效性,并表明基于神经网络的机器人控制系统面临潜在的关键安全威胁。

英文摘要

Neural network controllers are increasingly deployed in robotic systems for tasks such as trajectory tracking and pose stabilization. However, their reliance on potentially untrusted training pipelines or supply chains introduces significant security vulnerabilities. This paper investigates backdoor (Trojan) attacks against neural controllers, using a differential-drive mobile robot platform as a case study. In particular, assuming that the robot's tracking controller is implemented as a neural network, we design a lightweight, parallel Trojan network that can be embedded within the controller. This malicious module remains dormant during normal operation but, upon detecting a highly specific trigger condition defined by the robot's pose and goal parameters, compromises the primary controller's wheel velocity commands, resulting in undesired and potentially unsafe robot behaviours. We provide a proof-of-concept implementation of the proposed Trojan network, which is validated through simulation under two different attack scenarios. The results confirm the effectiveness of the proposed attack and demonstrate that neural network-based robotic control systems are subject to potentially critical security threats.

12. 其他/综合机器人 4 篇

2606.12657 2026-06-12 cs.AI cs.DB cs.RO 交叉投稿

TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

TrajGenAgent: 一种用于人类移动轨迹生成的分层LLM智能体

Siyu Li, Toan Tran, Lingyi Zhao, Khurram Shafique, Li Xiong

发表机构 * Emory University(埃默里大学) University of Florida(佛罗里达大学)

AI总结 提出TrajGenAgent,一种无需微调的分层LLM智能体框架,通过编排器-工作者两阶段设计生成真实轨迹,在时空保真度、语义一致性和个体行为真实性上优于现有方法。

Comments 14 pages, 2 figures, 8 tables. Accepted by the 27th IEEE International Conference on Mobile Data Management (MDM 2026)

详情
AI中文摘要

人类移动数据对于交通、城市规划和流行病控制至关重要,但大规模轨迹收集通常成本高昂且受隐私限制,这推动了逼真的合成轨迹生成。现有的基于LLM的生成器通常依赖于提示工程(保留了零样本推理但缺乏细粒度的时空基础)或轨迹级微调(提高了统计精度但产生了大量计算成本并可能削弱一般推理)。我们提出了TrajGenAgent,一种语义感知的分层LLM智能体框架,用于无需模型微调的人类移动轨迹生成。TrajGenAgent采用两阶段编排器-工作者设计:LLM首先通过上下文学习从历史证据中合成个体和星期条件化的活动链,然后确定性工作流通过个性化POI检索、距离感知位置选择、运动学感知旅行时间传播和基于LLM的持续时间估计将每个活动落地为完整的访问。为了评估超越聚合时空统计的真实性,我们引入了一个基于异常检测的评估框架,使用两个互补检测器来评估行为和语义合理性。在基准和大规模模拟数据集上的实验表明,与代表性的神经网络和基于LLM的基线相比,TrajGenAgent在时空保真度、语义一致性和个体特定行为真实性方面有所改进,同时避免了参数更新。

英文摘要

Human mobility data is important for transportation, urban planning, and epidemic control, but large-scale trajectory collection is often costly and privacy-constrained, motivating realistic synthetic trajectory generation. Existing LLM-based generators typically rely on either prompt engineering, which preserves zero-shot reasoning but lacks fine-grained spatiotemporal grounding, or trajectory-level fine-tuning, which improves statistical precision but incurs substantial computational cost and may weaken general reasoning. We propose TrajGenAgent, a semantic-aware hierarchical LLM-agent framework for human mobility trajectory generation without model fine-tuning. TrajGenAgent uses a two-stage orchestrator-worker design: an LLM first synthesizes an individual- and weekday-conditioned activity chain from historical evidence via in-context learning, and a deterministic workflow then grounds each activity into a complete visit using personalized POI retrieval, distance-aware location selection, kinematics-aware travel-time propagation, and LLM-based duration estimation. To evaluate realism beyond aggregate spatiotemporal statistics, we introduce an anomaly-detection-based evaluation framework using two complementary detectors to assess behavioral and semantic plausibility. Experiments on benchmark and large-scale simulation datasets show that TrajGenAgent improves spatiotemporal fidelity, semantic coherence, and individual-specific behavioral realism over representative neural and LLM-based baselines, while avoiding parameter updates.

2602.02181 2026-06-12 cs.RO 版本更新

Extending the Law of Intersegmental Coordination: Implications for Powered Prosthetic Controls

扩展节段间协调定律:对动力假肢控制的启示

Elad Siman Tov, Nili E. Krausz

发表机构 * Faculty of Mechanical Engineering, Technion – Israel Institute of Technology(机械工程系,技术学院–以色列理工学院)

AI总结 针对下肢截肢者步行代谢成本问题,提出基于节段间协调定律的假肢控制框架,通过分析三维运动学数据扩展出力矩协调定律,并开发了开源工具包。

Comments Submitted to 2026 IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob)

详情
AI中文摘要

动力假肢能够为截肢者提供净正功,并在过去二十年中取得了进步。然而,降低截肢者步行代谢成本仍是一个未解决的问题。节段间协调定律(ISC)已在多种步态中被观察到,并先前被认为与步行能量消耗有关,但很少在下肢截肢者步态背景下进行分析或应用。该定律指出,大腿、小腿和足部在步态周期中的仰角是协变的。在这项工作中,我们开发了一种方法,用于分析下肢三维运动学数据的节段间协调,以简化ISC分析。此外,受运动控制、生物力学和机器人学文献的启发,我们使用该方法将ISC扩展为一种新的力矩协调定律。我们发现了这些仰角空间力矩(ESM),并展示了显示健全步态基于力矩的协调的结果。我们还分析了使用动力和被动假肢的截肢者步态的ISC,发现虽然仰角保持平面性,但ESM缺乏平面协调。我们提出了一个ISC驱动的动力假肢控制框架,使用健康协调作为约束来预测小腿角度/力矩,以补偿由于被动足部引起的改变。我们开发了ISC3d工具箱,该工具可在线免费获取,可用于计算三维运动学和动力学ISC。这为进一步研究协调在步态中的作用提供了手段,并可能有助于解决人类运动神经控制的基本问题。

英文摘要

Powered prostheses are capable of providing net positive work to amputees and have advanced in the past two decades. However, reducing amputee metabolic cost of walking remains an open problem. The Law of Intersegmental Coordination (ISC) has been observed across gaits and previously implicated in energy expenditure of walking, yet it has rarely been analyzed or applied within the context of lower-limb amputee gait. This law states that the elevation angles of the thigh, shank and foot over the gait cycle covary. In this work, we developed a method to analyze intersegmental coordination for lower-limb 3D kinematic data, to simplify ISC analysis. Moreover, inspired by motor control, biomechanics and robotics literature, we used our method to extend ISC to a new law of coordination of moments. We find these Elevation Space Moments (ESM), and present results showing a moment-based coordination for able bodied gait. We also analyzed ISC for amputee gait with powered and passive prostheses, and found that while elevation angles remained planar, the ESM lacked planar coordination. We present an ISC-driven powered prosthetic control framework, using healthy coordination as a constraint to predict the shank angles/moments to compensate for alterations due to a passive foot. We developed the ISC3d toolbox that is freely available online, which may be used to compute kinematic and kinetic ISC in 3D. This provides a means to further study the role of coordination in gait and may help address fundamental questions of the neural control of human movement.

2604.24449 2026-06-12 cs.RO cs.AI cs.LG 版本更新

SPLIT: Separating Physical-Contact via Latent Arithmetic in Image-Based Tactile Sensors

SPLIT:通过潜在算术分离物理接触以实现基于图像的触觉传感器

Wadhah Zai El Amri, Nicolás Navarro-Guerrero

发表机构 * Leibniz Universität Hannover, L3S Research Center(莱布尼茨汉诺威大学,L3S研究所)

AI总结 本文提出SPLIT方法,通过潜在空间算术分离接触几何与传感器光学特性,实现触觉传感器的高效模拟,支持多传感器迁移和双向模拟,提升机器人触觉感知研究效率。

Comments Accepted to Elsevier Robotics and Autonomous Systems Journal

详情
AI中文摘要

训练机器人触觉感知的机器学习模型需要大量数据,但获取真实交互数据因物理复杂性和变异性而具有挑战性。模拟触觉传感器是加速进展的关键步骤。本文提出了SPLIT,一种新的基于图像的触觉传感器模拟方法,重点在于DIGIT传感器。我们的方法核心是一种潜在空间算术策略,明确分离接触几何与传感器特定的光学属性。与需要重新校准的现有方法不同,这种分离使SPLIT能够适应多样化的DIGIT背景,甚至在不完全重训练的情况下将数据转移到不同的传感器如GelSight R1.5。此外,我们的方法在推理速度上优于现有替代方案。我们还提供了一种校准的有限元方法(FEM)软体网格模拟,具有可变分辨率,提供速度与保真度之间的可调权衡。此外,我们的算法支持双向模拟,允许从变形网格生成逼真图像以及从触觉图像重建网格。这种多功能性使SPLIT成为加速机器人触觉感知研究进展的重要工具。

英文摘要

Training machine learning models for robotic tactile sensing requires vast amounts of data, yet obtaining realistic interaction data remains a challenge due to physical complexity and variability. Simulating tactile sensors is thus a crucial step in accelerating progress. This paper presents SPLIT, a novel method for simulating image-based tactile sensors, with a primary focus on the DIGIT sensor. Central to our approach is a latent space arithmetic strategy that explicitly disentangles contact geometry from sensor-specific optical properties. Unlike methods that require recalibration for every new unit, this disentanglement allows SPLIT to adapt to diverse DIGIT backgrounds and even transfer data to distinct sensors like the GelSight R1.5 without full model retraining. Beyond this adaptability, our approach achieves faster inference speeds than existing alternatives. Furthermore, we provide a calibrated finite element method (FEM) soft-body mesh simulation with variable resolution, offering a tunable trade-off between speed and fidelity. Additionally, our algorithm supports bidirectional simulation, allowing for both the generation of realistic images from deformation meshes and the reconstruction of meshes from tactile images. This versatility makes SPLIT a valuable tool for accelerating progress in robotic tactile sensing research.

2204.10552 2026-06-12 cs.RO 版本更新

Making Parameterization and Constrains of Object Landmark Globally Consistent via SPD(3) Manifold and Improved Cost Functions

通过SPD(3)流形和改进的成本函数使物体地标参数化和约束实现全局一致

Yutong Hu, Wei Wang

AI总结 本文通过SPD(3)流形和改进成本函数解决物体级SLAM后端的奇异性问题,提升收敛速度和鲁棒性,实验显示映射精度平均提高22%。

Comments 8 pages, 8 figures, submitted to IROS 2022 & RA-L

详情
AI中文摘要

物体级SLAM引入了具有语义意义且紧凑的物体地标,有助于室内外机器人应用和自动驾驶任务。然而,现有方法因分别用尺度和姿态参数化物体地标而导致后端出现奇异性问题。本文引入对称正定矩阵流形作为改进的物体级地标表示,并改进后端成本函数使其兼容该表示。实验表明,所提方法在仿真中收敛更快且更鲁棒。在真实数据集上的实验也显示,使用相同前端数据时,本策略平均提高了22%的映射精度。

英文摘要

Object-level SLAM introduces semantic meaningful and compact object landmarks that help both indoor robot applications and outdoor autonomous driving tasks. However, the back end of object-level SLAM suffers from singularity problems because existing methods parameterize object landmark separately by their scales and poses. Under that parameterization method, the same abstract object can be represented by rotating the object coordinate frame by 90 deg and swapping its length with width value, making the pose of the same object landmark not globally consistent. To avoid the singularity problem, we first introduce the symmetric positive-definite (SPD) matrix manifold as an improved object-level landmark representation and further improve the cost functions in the back end to make them compatible with the representation. Our method demonstrates a faster convergence rate and more robustness in simulation experiments. Experiments on real datasets also reveal that using the same front-end data, our strategy improves the mapping accuracy by 22% on average.