arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.12499 2026-06-12 cs.RO 新提交

Action-Effect Memory Pretraining for Robot Manipulation

动作-效应记忆预训练用于机器人操作

Yijing Zhou, Qiwei Liang, Sitong Zhuang, Jiaxi Li, Xianpeng Wang, Boyang Cai, Yunyang Mo, Renjing Xu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Shenzhen University（深圳大学）

AI总结提出AEM框架，通过视觉-动作历史掩码建模学习紧凑时间表征，提升机器人操作在部分可观测环境下的性能，优于单帧预训练和帧堆叠方法。

详情

AI中文摘要

我们提出了AEM，一个用于机器人操作的动作-效应记忆预训练框架，从视觉-动作历史中学习紧凑的时间表征。与先前主要关注单帧视觉编码的机器人表征预训练方法不同，AEM针对操作的时间特性，在部分可观测性下，仅凭当前观测往往不足。AEM通过交错视觉和动作特征将操作建模为动作驱动的交互过程，并应用掩码建模从不完整历史中恢复缺失内容，从而学习动作条件化的状态演化。最终视觉令牌的Mamba编码输出用作紧凑的历史表征，作为解码和下游控制的全局上下文。该设计在保持推理高效的同时，保留了单向量时间瓶颈。我们使用扩散策略和流策略评估AEM。AEM在仿真和真实环境中一致提升了操作性能，在干净场景、杂乱和随机场景以及非马尔可夫任务中均优于基线。消融研究进一步表明，历史感知预训练超越了单帧预训练和直接帧堆叠，同时降低了推理延迟和计算成本。

英文摘要

We present AEM, an Action-Effect Memory pretraining framework for robot manipulation that learns compact temporal representations from vision-action history. Unlike prior robot representation pretraining methods that mainly focus on single-frame visual encoding, AEM targets the temporal nature of manipulation, where the current observation alone is often insufficient under partial observability. AEM models manipulation as an action-driven interaction process by interleaving visual and action features and applying masked modeling to recover missing content from incomplete histories, thereby learning action-conditioned state evolution. The Mamba-encoded output of the final vision token is used as a compact history representation, serving as the global context for decoding and downstream control. This design preserves a single-vector temporal bottleneck while keeping inference efficient. We evaluate AEM with Diffusion Policy and Flow Policy. AEM consistently improves manipulation performance in both simulation and real-world settings, outperforming baselines across clean scenes, cluttered and random scenes, and non-Markovian tasks. Ablation studies further show that history-aware pretraining surpasses single-frame pretraining and direct frame stacking, while reducing inference latency and computational cost.

URL PDF HTML ☆

赞 0 踩 0

2606.12890 2026-06-12 cs.RO 新提交

Learning to Adapt: Representation-Based Reinforcement Learning for Multi-Task Skill Transfer

学会适应：基于表示的多任务技能迁移强化学习

Aryan Naveen, Haitong Ma, Haldun Balim, Na Li

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Harvard School of Engineering and Applied Sciences（哈佛大学工程与应用科学学院）

AI总结提出RepMT-SAC框架，通过谱MDP分解捕获可迁移动力学，实现任务无关核心与最小任务特定调整的价值函数结构，在四旋翼轨迹跟踪任务上零样本性能提升30%。

Comments 8 pages, 4 figures, 1 table

2606.13169 2026-06-12 cs.RO 新提交

Redesigning Regularization for Effective Policy Smoothing

重新设计正则化以实现有效的策略平滑

Taisuke Kobayashi, Naoto Yamanaka

发表机构 * National Institute of Informatics (NII)（国立信息学研究所）； The Graduate University for Advanced Studies (SOKENDAI)（综合研究大学院大学）

AI总结针对强化学习中策略平滑问题，本文指出现有正则化实现的理论与实践差异，提出改进方案，在多个任务和算法中实现平滑运动并提升控制性能，并在四足机器人仿真到现实迁移中验证了平滑性对目标速度突变鲁棒性的提升。

Comments submitted to RA-L

详情

AI中文摘要

本文提出了一种新颖的正则化设计，以有效平滑强化学习中的策略函数。虽然最初考虑了增强“全局”Lipschitz连续性的正则化，但由于平滑性与表达性之间的权衡，它被限制为“局部”Lipschitz连续性。然而，显而易见的是，原始实现繁琐且无法提供足够的平滑效果，导致人们倾向于更简单的实现。这源于理论与实现之间的差异，而更合适的实现有望促进平滑。因此，本文指出了原始实现无法正常工作的三个原因，并提供了相应的补救措施。这种改进的正则化在多个任务和算法中表现良好，成功实现了平滑运动，同时提高了控制性能。此外，通过将其应用于四足机器人的仿真到现实强化学习，证明了平滑运动能够提供对目标速度命令突变的鲁棒性。

英文摘要

This paper proposes a novel regularization design to effectively smooth policy functions in reinforcement learning. While regularization that enhances ``global'' Lipschitz continuity was initially considered, it has been limited to ``local'' Lipschitz continuity due to a tradeoff between smoothness and expressiveness. However, it has become apparent that the original implementation is cumbersome and does not provide sufficient smoothing, leading to a preference for simpler implementations. This stems from a discrepancy between theory and implementation, and a more appropriate implementation can expect to facilitate smoothing. Therefore, this paper identifies three reasons why the original implementation does not function adequately and provide remedies for them. This modified regularization performs well across multiple tasks and algorithms, successfully achieving smooth motion while improving control performance. Furthermore, by applying it to sim-to-real reinforcement learning for a quadruped robot, it is demonstrated that smooth motion provides robustness against sudden changes in target velocity commands.

URL PDF HTML ☆

赞 0 踩 0

2606.13355 2026-06-12 cs.RO cs.AI 新提交

Real-Time Execution with Autoregressive Policies

基于自回归策略的实时执行

Sangkyu Lee, Seohyeon Park, Tackgeun You, Avi Caciularu, Idan Szpektor, Hwasup Lim, Youngjae Yu

发表机构 * Korea Institute of Science and Technology（韩国科学技术研究院）； Seoul National University（首尔大学）； Google Research（谷歌研究院）

AI总结通过异步推理和约束解码实现自回归策略的实时执行，在保证低延迟的同时提升任务完成速度，实验表明其性能优于流匹配策略。

详情

AI中文摘要

实时执行通过异步推理实现平滑动作轨迹和快速响应，对于大规模视觉-语言-动作模型的实际部署至关重要。然而，近期关于实时执行的工作主要关注扩散策略的变体，尽管自回归策略在同步推理中滚动速度较慢，更需要实时性。相比之下，我们证明自回归策略可以通过调整分词范围和应用约束解码来实现实时执行，从而保证严格的延迟界限，支持多轨迹解码以最大化性能。在模拟和真实环境中，我们发现自回归策略始终优于同等水平的流匹配策略，同时显著提升了同步推理的任务完成速度。结合自回归策略的固有优势（如更快的收敛速度和更好的指令遵循泛化能力），这些结果证实自回归策略仍是一种支持实时执行的竞争性策略类型。

英文摘要

Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for realistic deployments of large-scale Vision-Language-Action models. However, recent work on real-time execution primarily focuses on variants of diffusion policies, even though it is more critical for autoregressive policies given their slower rollout speed in synchronous inference. In contrast, we demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, we find that the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can remain a competitive policy type supporting real-time execution.

URL PDF HTML ☆

赞 0 踩 0

2606.11092 2026-06-12 cs.RO cs.AI 版本更新

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

RoboNaldo：通过运动引导课程强化学习实现精准、稳定且强力的人形足球射门

Yichao Zhong, Yidan Lu, Yuhang Lu, Tianyang Tang, Haoguang Mai, Yixuan Pan, Tianyu Li, Li Chen, Jingbo Wang, Zhongyu Li, Peng Lu, Hongyang Li

发表机构 * The University of Hong Kong（香港大学）； The Chinese University of Hong Kong（香港中文大学）； Archon Robotics

AI总结提出三阶段运动引导课程强化学习框架RoboNaldo，从单一人踢参考逐步优化射门性能，在仿真中射门误差降低48.6%、速度提升2.96倍，真实机器人上3米外平均射门误差0.73-0.86米，触球后球速达13.10米/秒。

详情

AI中文摘要

精英级人形足球射门需要全身稳定性、高冲量全身交互以及目标精度。运动跟踪驱动的强化学习提供了全身运动协调的稳定性，但固定参考难以适应不同的球位和击球时机；相比之下，任务奖励驱动的强化学习难以从零开始探索和发现有效的踢球动作。因此，我们引入了RoboNaldo，一个用于高冲量人形交互的三阶段运动引导课程强化学习框架。使用单一人踢参考作为支架，并逐步将优化转向射门性能。课程首先学习稳定的全身踢球先验，然后使踢球适应任意静止球位的任意球场景，最后通过运动指令和踢球触发接口扩展到移动球射门。训练期间，一个高级启发式规划器控制该接口，而推理时其他高级控制器可驱动相同的低级策略。在仿真中，RoboNaldo的任意球射门误差比先前工作基线低48.6%，射门速度高2.96倍。在真实世界中，使用搭载机载感知的宇树G1，RoboNaldo在3米距离的任意球和移动球情况下，平均目标射门误差分别为0.73米和0.86米。触球后球速达到13.10米/秒，是职业比赛开放射门速度的59-71%。项目页面：$\href{ this https URL }{\text{ this http URL }}$。

英文摘要

Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: https://opendrivelab.com/RoboNaldo.

URL PDF HTML ☆

赞 0 踩 0

2604.08958 2026-06-12 cs.LG cs.AI cs.RO 版本更新

WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

WOMBET：基于世界模型的经验迁移实现鲁棒且样本高效的强化学习

Mintae Kim, Koushil Sreenath

发表机构 * Hybrid Robotics, UC Berkeley（混合机器人技术，伯克利大学）

AI总结提出WOMBET框架，通过源任务中学习世界模型并生成不确定性惩罚的离线数据，再结合自适应采样进行在线微调，实现鲁棒且样本高效的强化学习迁移。

Comments 13 pages, 6 figures, 8th Annual Learning for Dynamics & Control Conference (L4DC)

详情

AI中文摘要

机器人领域的强化学习通常受限于数据收集的成本和风险，因此需要从源任务向目标任务进行经验迁移。离线到在线强化学习利用先验数据，但通常假设给定固定数据集，并未解决如何生成可靠数据进行迁移的问题。我们提出基于世界模型的经验迁移（WOMBET）框架，该框架联合生成和利用先验数据。WOMBET在源任务中学习世界模型，并通过不确定性惩罚规划生成离线数据，随后筛选出高回报和低认知不确定性的轨迹。然后，它通过在离线数据和在线数据之间进行自适应采样，在目标任务中进行在线微调，实现了从先验驱动的初始化到任务特定适应的稳定过渡。我们证明了不确定性惩罚目标提供了真实回报的下界，并推导了有限样本误差分解，捕捉了分布不匹配和近似误差。实验上，WOMBET在连续控制基准测试中相比强基线提高了样本效率和最终性能，展示了联合优化数据生成和迁移的益处。

英文摘要

Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose World Model-Based Experience Transfer (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.

URL PDF HTML ☆

赞 0 踩 0

2606.12579 2026-06-12 cs.RO 新提交

G-MAPP: GPU-accelerated Multi-Agent Planning and Perception for Reactive Motion Generation

G-MAPP: 基于GPU加速的多智能体规划与感知用于反应式运动生成

Tanmay Bishnoi, Riddhiman Laha, Tobias Löw, Jose Alex Chandy, Luis F. C. Figueredo, Sami Haddadin

发表机构 * Department of Electrical, Computer, and Biomedical Engineering, Toronto Metropolitan University（多伦多都会大学电气、计算机与生物医学工程系）； Munich Institute of Robotics and Machine Intelligence (MIRMI), Technical University of Munich (TUM)（慕尼黑工业大学慕尼黑机器人与机器智能研究所）； Institute for Experiential Robotics, Northeastern University（东北大学体验式机器人研究所）； Idiap Research Institute（Idiap 研究所）； EPFL（瑞士联邦理工学院洛桑）； CHART Group at the School of Computer Science, University of Nottingham（诺丁汉大学计算机科学学院 CHART 小组）； Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)（穆罕默德·本·扎耶德人工智能大学）

AI总结提出GPU加速的框架，通过并行状态探索和紧密耦合感知-动作循环，实现非结构化环境中的实时反应式运动生成，在7自由度机器人上达到5倍加速并成功避障。

Comments The implementation is available at: https://github.com/chart-research/g-mapp

详情

DOI: 10.1109/LRA.2026.3678839
Journal ref: IEEE Robotics and Automation Letters, vol. 11, no. 6, pp. 7516-7523, June 2026

AI中文摘要

在非结构化环境中的反应式运动生成仍然是机器人学中的一个开放挑战。由于无碰撞运动生成的计算复杂性，现有方法要么为静态场景生成全局轨迹，要么采用对环境做出保守假设的模型。本文指出主要瓶颈在于高保真环境规划的运行时性能需求，以及感知与规划模块之间的时间集成。因此，我们提出一个框架，通过使用GPU加速世界建模和基于向量场的规划，不牺牲运行时性能和感知与规划的世界表示。这使得我们能够实现更快的并行状态探索以进行准全局轨迹规划，并在动态杂乱环境中使用现成的深度传感器实时紧密耦合感知-动作循环。我们定量评估了CPU和GPU版本规划器的计算时间和成功率差异，并在7自由度Franka Emika机器人上通过真实世界实验对我们的耦合框架进行了定性评估。实验结果表明，我们的基于GPU的框架相比CPU版本实现了高达5倍的加速，并在简单和具有挑战性的物理世界场景中成功避免了碰撞。

英文摘要

Reactive motion generation in unstructured environments remains an open challenge in robotics. Due to the computational complexity of collision-free motion generation, existing methods either generate global trajectories for static scenarios, or employ models that make conservative assumptions about the environment. This paper identifies the primary bottleneck as the runtime performance demand of planning on high-fidelity environments, and the temporal integration between the perception and planning modules. Therefore, we propose a framework that does not compromise on runtime performance and world representations for perception and planning by accelerating world modeling and vector-field based planning using the GPU. This allows us to achieve faster parallel state exploration for quasi-global trajectory planning, and tighter coupling of the perception-action loop in real-time for dynamic cluttered environments with off-the-shelf depth sensors. We quantitatively evaluate the computation-time and success rate differences for the CPU and GPU versions of our planner, and perform qualitative evaluations of our coupled framework using real-world experiments on a 7-DoF Franka Emika robot. Experimental results demonstrate that our GPU-based framework achieves up to a 5x speedup over the CPU version and successfully avoids collisions across both trivial and challenging physical world scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.12814 2026-06-12 cs.RO cs.AI 新提交

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

Stubborn: 一种用于人形机器人鲁棒运动跟踪与摔倒恢复的流线型统一强化学习框架

Xiao Ren, Yuhui Yang, Zongbiao Weng, Zhijie Liu, He Kong

发表机构 * Southern University of Science and Technology（南方科技大学）

AI总结提出Stubborn框架，通过非对称Actor-Critic架构、偏航对齐表示、伯努利概率终止机制和自适应采样策略，统一实现人形机器人的运动跟踪与摔倒恢复，在性能与鲁棒性上超越现有方法。

详情

AI中文摘要

最近的强化学习方法在改善人形机器人运动跟踪性能和实现扰动下的摔倒恢复方面显示出巨大潜力。然而，现有大多数工作将运动跟踪和摔倒恢复视为不同任务，需要多阶段训练，并配备专门的恢复奖励和/或独立的恢复策略。此外，现有的基于强化学习的方法通常在严重跟踪失败后立即终止训练回合，限制了在不稳定或摔倒状态下的恢复导向探索。为了解决上述问题，我们提出了Stubborn，一个流线型统一的强化学习框架，用于实现鲁棒的人形机器人运动跟踪和摔倒恢复。具体来说，Stubborn采用非对称Actor-Critic架构，包含三个主要组件。首先，采用偏航对齐的跟踪表示，以减少对全局漂移和航向扰动的敏感性，同时保留与重力相关的平衡信息。其次，我们引入基于伯努利的概率终止机制，使策略能够在不同失败模式下鼓励探索摔倒恢复行为。第三，我们提出一种概率终止和跟踪误差驱动的策略，根据跟踪性能动态重塑采样分布，提高困难运动片段和不稳定状态的训练效率。与最先进方法的广泛比较和消融研究表明，Stubborn取得了有竞争力的性能，所提出的概率终止机制和自适应采样策略有助于性能和鲁棒性的提升。真实世界演示请参见此https URL。

英文摘要

Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to https://aislab-sustech.github.io/Stubborn/.

URL PDF HTML ☆

赞 0 踩 0

2606.13113 2026-06-12 eess.SY cs.RO cs.SY 交叉投稿

使用信号时序逻辑的字典序最小违规运动规划

Patrick Halder, Lothar Kiltz, Hannes Homburger, Johannes Reuter, Matthias Althoff

AI总结提出一种将字典序多目标优化转化为单目标标量优化的方法，通过非均匀量化和位移扩展MPPI求解器，并引入结合时空违规的谓词鲁棒性度量，实现可解释且可扩展的字典序STL最小违规运动规划。

Comments Submitted to the IEEE Open Journal of Intelligent Transportation Systems (under review)

详情

AI中文摘要

自动驾驶汽车的运动规划通常需要满足多个有条件冲突的规范。在无法同时满足所有规范的情况下，最小违规运动规划通过根据规范的优先级最小化违规来维持系统运行。信号时序逻辑（STL）提供了一种形式化语言来严格定义这些规范，并能够对其违规进行定量评估。然而，规范的完全排序导致了一个字典序优化问题，使用标准方法求解通常计算成本高昂。我们通过使用非均匀量化和位移将多目标字典序优化问题转化为单目标标量优化问题来解决这个问题。具体来说，我们扩展了一个确定性模型预测路径积分（MPPI）求解器，以高效求解无二次输入成本的优化问题。此外，引入了一种结合空间和时间违规的新型谓词鲁棒性度量。我们的结果表明，所提出的方法在单目标求解器框架内为字典序STL最小违规运动规划提供了一种可解释且可扩展的解决方案。

英文摘要

Motion planning for autonomous vehicles often requires satisfying multiple conditionally conflicting specifications. In situations where not all specifications can be met simultaneously, minimum-violation motion planning maintains system operation by minimizing violations of specifications in accordance with their priorities. Signal temporal logic (STL) provides a formal language for rigorously defining these specifications and enables the quantitative evaluation of their violations. However, a total ordering of specifications yields a lexicographic optimization problem, which is typically computationally expensive to solve using standard methods. We address this problem by transforming the multi-objective lexicographic optimization problem into a single-objective scalar optimization problem using non-uniform quantization and bit-shifting. Specifically, we extend a deterministic model predictive path integral (MPPI) solver to efficiently solve optimization problems without quadratic input cost. Additionally, a novel predicate-robustness measure that combines spatial and temporal violations is introduced. Our results show that the proposed method offers an interpretable and scalable solution for lexicographic STL minimum-violation motion planning within a single-objective solver framework.

URL PDF HTML ☆

赞 0 踩 0

2606.12604 2026-06-12 cs.RO 新提交

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

EgoEngine：从自我中心人类视频到高保真灵巧机器人演示

Yangcen Liu, Shuo Cheng, Xinchen Yin, Woo Chul Shin, Alfred Cueva, Yiran Yang, Zhenyang Chen, Chuye Zhang, Danfei Xu

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Tsinghua University（清华大学）

AI总结提出EgoEngine框架，通过视觉和动作桥接，将自我中心人类视频转化为高保真机器人数据，首次实现零样本灵巧策略学习。

详情

AI中文摘要

灵巧操作受限于大规模机器人演示数据的收集成本。自我中心人类视频提供了多样操作行为的可扩展来源，但直接用于机器人学习需要弥合两个差距：人类与机器人观测之间的视觉差距，以及人类运动与机器人可执行动作之间的动作差距。我们提出EgoEngine，一个可扩展的框架，用于将自我中心人类操作视频转化为高保真机器人数据。给定一个自我中心RGB视频，EgoEngine生成：(i) 高保真机器人观测视频，用机器人替换人类，同时保留场景上下文和时间对齐，以及(ii) 在可行性约束下，与任务对齐、可执行的机器人动作轨迹。在仿真和真实机器人上的实验表明，EgoEngine能够将人类视频可扩展地转化为机器人数据，并且据我们所知，首次展示了无需真实机器人演示，从自我中心人类视频进行零样本视觉运动灵巧策略学习。项目网站：此 https URL。

英文摘要

Dexterous manipulation is limited by the cost of collecting large-scale robot demonstrations. Egocentric human videos offer a scalable source of diverse manipulation behaviors, but directly using them for robot learning requires bridging two gaps: the visual gap between human and robot observations, and the action gap between human motion and robot-executable action. We propose EgoEngine, a scalable framework for transforming egocentric human manipulation videos into high-fidelity robot data. Given an egocentric RGB video, EgoEngine produces: (i) a high-fidelity robot observation video replacing human with robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots show that EgoEngine enables scalable conversion of human videos into robot data and, to our knowledge, demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations. Project website: https://egoengine.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.12759 2026-06-12 cs.RO 新提交

FTP-1：一种跨触觉传感器的通用基础触觉策略，用于密集接触操作

Chengbo Yuan, Zicheng Zhang, Mingjie Zhou, Wendi Chen, Yi Wang, Zhuoyang Liu, Dantong Niu, Shuo Wang, Hui Zhang, Wenkang Zhang, Yingdong Hu, Yuanqing Gong, Wanli Xing, Chuan Wen, Cewu Lu, Kaifeng Zhang, Yang Gao

发表机构 * Tsinghua University（清华大学）； Shanghai Qi Zhi Institute（上海期智研究院）； Sharpa ； Shanghai Jiao Tong University（上海交通大学）； University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）； Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出FTP-1，首个通用基础触觉策略，通过异构编码器和共享Transformer专家，跨21种传感器和3000小时数据预训练，实现触觉操作技能的跨传感器迁移，在未见传感器上成功率提升31%。

详情

AI中文摘要

尽管基于视觉的通用机器人策略取得了成功，但现有的基于触觉的策略仍然局限于固定的具身和传感器设置。这是因为触觉信号在不同硬件之间高度异构，使得跨传感器泛化变得困难。我们提出了FTP-1，这是第一个通用基础触觉策略，预训练以获取跨不同传感器和具身的可迁移触觉操作能力。FTP-1支持多种触觉输入，包括基于图像、阵列和状态的信号，通过使用异构编码器将它们投影到统一的形态感知潜在标记中，并由共享的触觉Transformer专家联合建模。FTP-1在来自26个数据源的约3000小时触觉操作数据上进行预训练，涵盖21种传感器的人类和机器人演示，学习到的触觉技能可以迁移到预训练期间未见过的传感器上。在涵盖5种硬件配置的下游微调实验中，FTP-1在见过的传感器设置上将密集接触操作的成功率提高了17.2%，并且令人惊讶地，迁移到两种先前未见过的触觉传感器设置上，成功率提高了31%。FTP-1为触觉操作建立了第一个统一的基础基线，为未来的触觉策略提供了共享的模型级起点。预训练模型、数据集、训练代码及更多可视化内容请访问此网址。

英文摘要

Despite the success of vision-based generalist robotic policies, existing tactile-based policies remain tied to fixed embodiments and sensor setups. This is because tactile signals are highly heterogeneous across hardware, making cross-sensor generalization difficult. We present FTP-1,the first generalist foundation tactile policy pretrained to acquire transferable tactile manipulation abilities across diverse sensors and embodiments. FTP-1 supports varied tactile inputs, including image-, array-, and state-based signals, by using heterogeneous encoders to project them into unified morphology-aware latent tokens that are jointly modeled by a shared tactile Transformer expert. Pretrained on around 3,000 hours of tactile manipulation data aggregated from 26 data sources, spanning human and robot demonstrations across 21 sensors, FTP-1 learns tactile skills that transfer beyond the sensors seen during pretraining. Across downstream finetuning experiments spanning 5 hardware configurations, FTP-1 improves contact-rich manipulation on seen sensor setups by +17.2% and, surprisingly, transfers to two previously unseen tactile-sensor setups, achieving a +31% gain in success rate. FTP-1 establishes the first unified foundation baseline for tactile manipulation, providing future tactile policies with a shared model-level starting point. Pretrained models, datasets, training code and more visualization at https://ftp1-policy.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.13232 2026-06-12 cs.RO 新提交

WT-UMI: Tactile-based Whole-Body Manipulation via Force-Supervised Contact-Aware Planning

WT-UMI: 基于触觉的全身操控通过力监督的接触感知规划

Jaehwi Jang, Zhaoyuan Gu, Alfred Cueva, Zimeng Chai, Junjie Sheng, Thong Nguyen, Himank Galundia, Yifan Wu, Huishu Xue, Isaac Legene, Ojas Mediratta, Davin Doan, Andrew Collins, Sarah Sadegh, KyoungMok Kim, Rishita Dhalbisoi, Zun Chen, Ye Zhao

发表机构 * The Institute for Robotics and Intelligent Machines, Georgia Institute of Technology（机器人与智能机械研究所，佐治亚理工学院）

AI总结提出WT-UMI系统，结合人体演示与遥操作数据，通过力监督规划器预测末端执行器位姿和接触力轨迹，并利用触觉导纳控制器提升全身操控性能。

Comments 18 pages, 8 figures

详情

AI中文摘要

全身人形操控笨重、可变形和共享负载物体需要分布式接触感知和显式力调节，然而大多数模仿策略仅隐式处理接触力。另一方面，不同的演示来源提供了具有固有权衡的互补模态：人体演示捕捉自然接触力但不包含机器人可执行动作，而遥操作直接记录机器人动作但力调节不够自然。本文提出\textbf{WT-UMI}，一种可穿戴全身触觉接口，可由人类操作员佩戴或安装在人形机器人上，在人体演示和人形遥操作模式下提供触觉图像、接触力和末端执行器位姿的精确观测。我们引入一个力条件目标位姿校正模块，通过从遥操作数据中学习校正，将测量的人体位姿转换为接触感知的机器人目标。为了利用人体数据中的自然力交互，我们提出一个力监督规划器，预测末端执行器位姿块和接触力轨迹。预测的接触力作为基于触觉的导纳控制器的参考。在五个接触密集型任务中，涵盖可变形物体、笨重刚体物体和人-人形协作，WT-UMI在成功率上优于四个策略基线，并降低了接触位置跟踪误差。我们的项目页面可在此https URL访问。

英文摘要

Whole-body humanoid manipulation of bulky, deformable, and shared-load objects requires distributed contact sensing and explicit force regulation, yet most imitation policies treat contact force only implicitly. On the other hand, different demonstration sources provide complementary modalities with inherent trade-offs: human demonstrations capture natural contact forces but not robot-executable actions, while teleoperation directly records robot actions but with less natural force regulation. This paper presents \textbf{WT-UMI}, a wearable whole-body tactile interface worn by human operators or mounted on humanoids, providing accurate observations of tactile images, contact forces, and end-effector poses across both human demonstration and humanoid teleoperation modes. We introduce a force-conditioned target-pose correction module that converts measured human poses into contact-aware robot targets by learning corrections from teleoperation data. To leverage the natural force interaction in human data, we propose a force-supervised planner that predicts end-effector pose chunks and contact-force trajectories. The predicted contact force serves as the reference for a tactile-based admittance controller. Across five contact-rich tasks spanning deformable objects, bulky rigid objects, and human--humanoid collaboration, WT-UMI improves success rate and reduces contact-position tracking error over four policy baselines. Our project page is available at https://wt-umi.github.io/WTUMI/.

URL PDF HTML ☆

赞 0 踩 0

2606.13279 2026-06-12 cs.RO 新提交

See Selectively, Act Adaptively: Dual-Level Structural Decomposition for Bimanual Robot Manipulation

选择性观察，适应性行动：双水平结构分解用于双臂机器人操作

Yoon-Ji Choi, Young-Chae Son, Soo-Chul Lim

发表机构 * Dongguk University（东国大学）

AI总结提出基于双水平结构分解的双臂操作VLA框架，通过视觉选择路由和动作专家混合机制分别处理视觉相关性和双臂交互模式，在模拟和真实任务中成功率分别提升27.7%和43.3%。

详情

AI中文摘要

在双臂机器人操作中，任务相关的视觉信息随任务阶段和上下文变化，而两臂的交互在独立和协调模式之间切换，使得策略学习具有挑战性。然而，现有的整体式视觉-语言-动作（VLA）策略通过单一共享表示和动作生成路径处理多样的视觉输入和交互模式，往往无法分别考虑视觉相关性和双臂交互结构。为了解决这个问题，我们提出了一个基于双水平结构分解的双臂操作VLA框架。视图选择视觉路由器动态调整腕部视角的贡献以强调相关视觉线索，而交互感知动作专家混合（MoE）将动作生成分解为协调和单臂路径，以适应不同的双臂交互模式。我们在RoboTwin 2.0中的六个模拟双臂操作任务和三个长时域真实世界任务上评估了所提方法。我们的模型在模拟和真实世界评估中，整体平均成功率分别比整体式基线提高了27.7%和43.3%，并且在两种设置下始终优于单模块变体。这些结果表明，联合考虑选择性视觉处理和双臂交互结构的显式分解为鲁棒的双臂操作提供了有效的归纳偏置。

英文摘要

In bimanual robotic manipulation, task-relevant visual information varies with the task stage and context, while the interaction of the two arms shifts between independent and coordinated modes, making policy learning challenging. However, existing monolithic Vision-Language-Action (VLA) policies process diverse visual inputs and interaction patterns through a single shared representation and action generation pathway, often failing to separately account for visual relevance and bimanual interaction structure. To address this issue, we propose a bimanual manipulation VLA framework based on Dual-Level Structural Decomposition. The View-Selective Visual Router dynamically adjusts wrist-view contributions to emphasize relevant visual cues, while the Interaction-Aware Action Mixture-of-Experts (MoE) decomposes action generation into coordinated and arm-wise pathways to adapt to varying bimanual interaction modes. We evaluate the proposed method on six simulated bimanual manipulation tasks in RoboTwin 2.0 and three long-horizon real-world tasks. Our model improves the overall average success rate over a monolithic baseline by 27.7% in simulation and 43.3% in real-world evaluation, while consistently outperforming single-module variants across both settings. These results demonstrate that jointly considering selective visual processing and explicit decomposition of bimanual interaction structures provides an effective inductive bias for robust bimanual manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.13394 2026-06-12 cs.RO 新提交

GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation

GeoHAT: 几何自适应混合动作Transformer用于移动操作

Xiangyu Zhu, Renjun Wu, Luzhou Ge, Jinyan Liu, Xuesong Li

发表机构 * Beijing Institute of Technology（北京理工大学）

AI总结提出GeoHAT框架，通过轻量级傅里叶空间编码器注入几何信息，并采用混合全身动作解码器分解机械臂与基座动作，在ManiSkill-HAB基准上成功率提升23.7%。

详情

AI中文摘要

全身移动操作需要在不断变化的视角下协调移动基座和机械臂，这对几何感知和动作生成提出了挑战。当前的策略要么依赖2D特征，要么依赖缺乏密集空间结构的稀疏3D表示，并且通常将机械臂和基座编码在一个动作向量中，忽略了它们各自不同的控制需求。此外，现有的密集融合策略在噪声深度下可能破坏预训练表示，同时带来沉重的计算开销。我们提出了GeoHAT，一个基于简单原则的端到端扩散框架：几何信息应仅在可靠处注入，且仅在需要处被关注。GeoHAT采用轻量级傅里叶空间编码器，将密集的逐像素3D坐标映射为几何标记，无需额外的3D视觉骨干网络。然后，通过由深度有效性调制的逐标记门控融合，将这些标记选择性地注入视觉基础模型特征中，在保留语义先验的同时丰富空间理解。对于动作生成，混合全身动作解码器将机械臂和基座分解到不同的子空间，并通过稀疏交叉注意力让每个动作模态关注其任务相关的视觉上下文，同时因果时序建模捕获时间步内协调和时间步间依赖。在ManiSkill-HAB仿真基准上的实验表明，GeoHAT实现了79.3%的平均成功率，比最强基线高出23.7%。此外，在多种任务上的真实世界实验也证实了在所有基线上的一致改进。

英文摘要

Whole-body mobile manipulation requires coordinating mobile base and manipulator under shifting viewpoints, posing challenges in geometric perception and action generation. Current policies either rely on 2D features or sparse 3D representations that lack dense spatial structure, and typically encode arm and base within one action vector that ignores their distinct control demands. Moreover, existing dense fusion strategies risk corrupting pretrained representations under noisy depth while incurring heavy computational overhead. We present GeoHAT, an end-to-end diffusion-based framework built on a simple principle: geometry should be injected only where reliable and attended to only where needed. GeoHAT employs a lightweight Fourier spatial encoder that maps dense per-pixel 3D coordinates into geometric tokens without an additional 3D vision backbone. These tokens are then selectively injected into vision foundation model features through per-token gated fusion modulated by depth validity, preserving the semantic prior while enriching spatial understanding. For action generation, a Hybrid Whole-Body Action Decoder decomposes arm and base into distinct subspaces and lets each action modality attend to its task-relevant visual context through sparse cross-attention, while causal temporal modeling captures intra-timestep coordination and inter-timestep dependencies. Experiments on the ManiSkill-HAB simulation benchmark demonstrate that GeoHAT achieves a 79.3% mean success rate, surpassing the strongest baseline by 23.7%. Furthermore, real-world experiments on diverse tasks also confirm consistent improvements over all baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.13601 2026-06-12 cs.RO cs.SY eess.SY 新提交

UniDexTok：基于真实数据的统一灵巧手分词器

Dong Fang, Youjun Wu, Yuanxin Zhong, Rui Zhang, Yunlong Wang, Xiaosong Jia, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Hefei University of Technology（合肥工业大学）； Rimbot ； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出统一灵巧手模型(UDHM)将人手和机器人手状态映射到共享22自由度语义接口，并基于此开发UniDexTok，一种免重定向的状态分词器，学习基于真实关节状态的离散token，实现异构灵巧手的统一表示，误差降低98%以上。

详情

AI中文摘要

灵巧手对于精细操作至关重要，但其硬件设计在不同实施例之间存在显著差异。运动学、关节定义和自由度方面的差异使得定义共享状态表示变得困难，与平行夹爪相比更是如此。因此，灵巧手数据仍然碎片化，难以用于联合训练。在这项工作中，我们提出了统一灵巧手模型（UDHM），它将人手和机器人手状态映射到一个共享的22自由度语义接口。基于UDHM，我们引入了UniDexTok，一种免重定向的状态分词器，它从标准化的真实关节状态中学习基于实施例的离散token。UniDexTok为异构灵巧手提供了统一表示，无需依赖重定向或仿真数据。与最近的基线UniHM相比，UniDexTok将MPJAE从15.63度降低到0.16度，MPJPE从18.51毫米降低到0.18毫米，误差分别减少了98.98%和99.03%。这些结果将重建精度从厘米级提升到亚毫米级。实验进一步表明，来自其他实施例的数据提高了目标实施例的重建精度，证明了跨实施例分词的优势。当引入新的灵巧手时，UniDexTok还表现出强大的零样本和少样本重建能力。

英文摘要

Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

URL PDF HTML ☆

赞 0 踩 0

2606.11767 2026-06-12 cs.RO cs.AI 版本更新

Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning

通过真实到仿真到真实触觉策略学习的盲操作灵巧抓取

Shengcheng Luo, Xiyan Huang, Zhe Xu, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University（上海科技大学）； Beijing Institute for General Artificial Intelligence（北京通用人工智能研究院）

AI总结提出一种结合Real2Sim触觉校准、布局感知触觉编码器和触觉条件扩散策略的框架，实现仅依赖触觉的灵巧手盲抓取，在真实机器人上对20个物体达到27%成功率。

Comments 23 pages, 6 figures

详情

AI中文摘要

使用灵巧手进行盲抓取是一项关键的操作能力。然而，由于触觉的仿真到真实差距以及稀疏触觉信号的有限表达能力，为真实机器人学习这种仅依赖触觉的策略仍然具有挑战性。为了弥合这一差距，我们提出了一个仅依赖触觉的盲抓取框架，该框架可部署在物理多指机器人手上。我们的方法结合了三个关键组成部分。首先，我们引入了一个Real2Sim触觉校准流程，构建了一个接触校准的数字孪生模拟器，能够复现真实的触觉信号。其次，我们使用布局感知触觉编码器改进了稀疏触觉观测的表达能力，该编码器通过自监督预训练融入了传感器几何先验。第三，为了提高对未见物体的泛化能力，我们在校准后的模拟器中训练了特定物体的强化学习专家，并将其成功的抓取轨迹聚合为触觉条件扩散策略。我们在配备分布式触觉传感的物理LEAP手上评估了我们的方法，涉及10个见过和10个未见过的物体。部署的策略在所有20个物体上实现了27%的真实世界抓取成功率，无需真实世界的抓取演示或视觉输入。仿真消融实验表明，布局感知触觉预训练提高了抓取性能，而传感级评估确认Real2Sim校准增加了仿真与硬件之间触觉接触事件的一致性。这些结果表明，接触事件校准、几何感知触觉表示学习和基于扩散的策略聚合为真实灵巧机器人手上的仅触觉盲抓取提供了一条有效路径。项目页面：此HTTP URL。

英文摘要

Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robots remains challenging due to the tactile sim-to-real gap and the limited expressiveness of sparse tactile signals. To bridge this gap, we propose a framework for tactile-only blind grasping that is deployable on a physical multi-fingered robotic hand. Our approach combines three key components. First, we introduce a Real2Sim tactile calibration pipeline that constructs a contact-calibrated digital-twin simulator capable of reproducing real tactile signals. Second, we improve the expressiveness of sparse tactile observations using a layout-aware tactile encoder, which incorporates sensor-geometry priors through self-supervised pretraining. Third, to improve generalization to unseen objects, we train object-specific reinforcement-learning experts in the calibrated simulator and aggregate their successful grasp trajectories into a tactile-conditioned Diffusion Policy. We evaluate our method on a physical LEAP Hand equipped with distributed tactile sensing across 10 seen and 10 unseen objects. The deployed policy achieves a 27\% real-world grasp success rate across all 20 objects, without real-world grasping demonstrations or visual input. Simulation ablations show that layout-aware tactile pretraining improves grasping performance, while sensing-level evaluations confirm that Real2Sim calibration increases the consistency of tactile contact events between simulation and hardware. Together, these results suggest that contact-event calibration, geometry-aware tactile representation learning, and diffusion-based policy aggregation provide an effective path toward tactile-only blind grasping on real dexterous robotic hands. Project page:Dex-Blind-Grasp.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.12550 2026-06-12 cs.RO cs.AI 新提交

Foresight: Iterative Reasoning About Clues that Matter for Navigation

Foresight: 关于导航关键线索的迭代推理

Arthur Zhang, Carl Qi, Donne Su, Xiangyun Meng, Amy Zhang, Joydeep Biswas

发表机构 * UT Austin（德克萨斯大学奥斯汀分校）； FieldAI

AI总结提出Foresight框架，利用微调VLM交替提出和批评图像空间运动计划，通过人类反馈学习奖励模型进行强化学习后训练，实现无地图导航中稀疏语言指令下的迭代运动优化，任务成功率提升37%。

Comments 22 pages, 10 figures, 3 tables

详情

AI中文摘要

从稀疏语言指令进行开放世界无地图导航需要解决未明确指定的目标，并推断哪些环境线索与到达目标相关。例如，到达一个视野外的目的地可能需要解释坡道、标志或绕行路线，这些揭示了去哪里或走哪条路线。先前的工作受限于对已知导航因素和封闭集因素类别的依赖，或者在运动规划之前识别线索而遗漏了依赖于计划的线索。我们认为预训练的视觉语言模型（VLM）可以发现新的指令相关线索，但需要适应以关注哪些线索重要以及它们应如何影响运动规划。我们在Foresight中实现了这些想法，这是一个测试时框架，其中微调的VLM交替提出图像空间运动计划并使用语言目标和视觉上下文对其进行批评。后续计划基于先前的批评，使得在执行前能够进行迭代运动优化。为了将计划批评和优化与开放集行为偏好对齐，我们从人类反馈中学习一个奖励模型，并使用它在计划-批评循环中通过强化学习对VLM进行后训练。在离线评估和6个真实世界环境中，相对于最先进的测试时推理和基础模型基线，Foresight将平均任务成功率提高了37%，并将每次任务的干预次数减少了52%，同时在Jetson AGX Orin上实时运行。我们将发布代码、数据和训练细节，以支持未来关于机器人运动优化的测试时推理工作。更多视频请见：this https URL

英文摘要

Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination may require interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works are limited by their reliance on known navigation factors and closed-set factor categories, or identify cues before motion planning and miss plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We realize these ideas in Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling iterative motion refinement before execution. To align plan critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin. We will release code, data, and training details to support future work on test-time reasoning for robot motion refinement. Additional videos at: https://amrl.cs.utexas.edu/foresight

URL PDF HTML ☆

赞 0 踩 0

2606.12603 2026-06-12 cs.RO cs.AI 新提交

SemanticXR: 低功耗实时可查询语义建图与对象级设备-云架构

Rahul Singh, Devdeep Ray, Connor Smith, Sarita Adve

AI总结提出首个设备-云协同系统SemanticXR，通过对象级通信、执行和内存管理，在XR功耗、带宽和内存约束下实现实时开放词汇语义建图与查询，服务器建图延迟提升2.2倍，设备功耗仅增加2%。

详情

AI中文摘要

语义建图是新兴扩展现实（XR）应用（如AI助手和空间对象搜索）中实现具身交互的核心服务。在移动XR设备上部署此功能需要系统具备开放词汇、实时和低功耗特性。现有方法计算密集且假设服务器级资源。云卸载提供了一条实用路径，但现有系统未在设备-云边界拆分语义建图或管理其通信、执行和内存占用。我们提出SemanticXR，首个在XR功耗、带宽和内存约束下实现实时开放词汇语义建图与查询的设备-云系统。我们的关键洞察是将语义可识别对象提升为跨设备和服务器的通信、执行和内存的一级单元。在服务器端，对象级并行和几何下采样改善了建图延迟，而对象级深度建图协同设计降低了上行带宽。在设备端，具有增量更新和更新优先级的对象级稀疏局部地图实现了网络鲁棒的查询，并限制了内存和下行带宽。对象级可配置的资源使用与质量权衡让应用和系统分别根据应用需求和运行条件调整建图。与使用相同感知模型的设备-云基线相比，对象级组织在同等语义质量下将服务器端建图延迟提升了2.2倍。深度建图协同设计将上行带宽维持在2.5 Mbps以下。在设备端，SemanticXR即使在网络中断时也能为多达10,000个对象维持低于100 ms的查询延迟，在500 MB内支持数万个对象，并将下行带宽随地图变化而非总场景大小缩放。系统在正常运行时仅增加2%的设备功耗。

DiskChunGS：基于分块内存管理的大规模3D高斯SLAM

Casimir Feldmann, Maximum Wilder-Smith, Vaishakh Patil, Michael Oechsle, Michael Niemeyer, Keisuke Tateno, Marco Hutter

发表机构 * Robotic Systems Lab, ETH Zurich（机器人系统实验室，瑞士苏黎世联邦理工学院）； Google（谷歌）

AI总结提出DiskChunGS，通过将场景划分为空间块并将非活跃区域存储于磁盘，突破GPU内存限制，实现大规模3D高斯SLAM，在多个数据集上完成全序列重建并提升视觉质量。

详情

DOI: 10.1109/LRA.2026.3668704
Journal ref: IEEE Robotics and Automation Letters, vol. 11, no. 4, 2026

AI中文摘要

近期3D高斯溅射（3DGS）的进展在实时渲染的新视角合成中展现了令人印象深刻的结果。然而，将3DGS与SLAM系统集成面临根本的可扩展性限制：方法受限于GPU内存容量，只能重建小规模环境。我们提出DiskChunGS，一种可扩展的3DGS SLAM系统，通过一种外核方法克服这一瓶颈，该方法将场景划分为空间块，并在GPU内存中仅维护活跃区域，同时将非活跃区域存储在磁盘上。我们的架构与现有的用于位姿估计和闭环检测的SLAM框架无缝集成，实现大规模全局一致的重建。我们在室内场景（Replica、TUM-RGBD）、城市驾驶场景（KITTI）以及资源受限的Nvidia Jetson平台上验证了DiskChunGS。我们的方法独特地完成了所有11个KITTI序列，没有出现内存故障，同时实现了卓越的视觉质量，证明了算法创新可以克服先前限制3DGS SLAM方法的内存约束。

英文摘要

Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated impressive results for novel view synthesis with real-time rendering capabilities. However, integrating 3DGS with SLAM systems faces a fundamental scalability limitation: methods are constrained by GPU memory capacity, restricting reconstruction to small-scale environments. We present DiskChunGS, a scalable 3DGS SLAM system that overcomes this bottleneck through an out-of-core approach that partitions scenes into spatial chunks and maintains only active regions in GPU memory while storing inactive areas on disk. Our architecture integrates seamlessly with existing SLAM frameworks for pose estimation and loop closure, enabling globally consistent reconstruction at scale. We validate DiskChunGS on indoor scenes (Replica, TUM-RGBD), urban driving scenarios (KITTI), and resource-constrained Nvidia Jetson platforms. Our method uniquely completes all 11 KITTI sequences without memory failures while achieving superior visual quality, demonstrating that algorithmic innovation can overcome the memory constraints that have limited previous 3DGS SLAM methods.

URL PDF HTML ☆

赞 0 踩 0

2603.00167 2026-06-12 cs.RO 版本更新

EgoMoD: Predicting Global Maps of Dynamics from Local Egocentric Observations

EgoMoD：从局部自我中心观测预测全局动态地图

Iacopo Catalano, David Morilla-Cabello, Jorge Pena-Queralta, Eduardo Montijano

发表机构 * University of Turku, Finland（芬兰图尔库大学）； Centre for Artificial Intelligence, Zürich University of Applied Sciences, Winterthur, Switzerland（瑞士应用科学大学人工智能中心）； Instituto de Investigación en Ingeniería de Aragón, Universidad de Zaragoza, Spain（西班牙阿拉贡工程研究所，萨拉戈萨大学）

AI总结提出EgoMoD方法，利用短时自我中心视频和位姿条件架构，学习从局部观测预测全局运动动态地图，替代传统全局感知基础设施，实现零样本迁移。

详情

AI中文摘要

在动态环境中高效导航需要预测机器人即时感知范围之外的运动模式演变，从而在拥挤场景中实现先发制人而非纯粹反应式规划。运动动态地图（MoDs）提供了空间中运动趋势的结构化表示，有助于长期全局规划，但传统上需要长时间全局环境观测来构建。我们提出EgoMoD，这是第一种学习直接从机器人操作期间收集的短时自我中心视频片段预测未来MoDs的方法。我们的方法使用视频和位姿条件架构，以从外部观测计算的MoDs作为特权监督进行训练，从而学习从局部动态线索推断环境范围的运动趋势，使局部观测成为全局运动结构的预测信号。因此，我们能够预测整个环境的未来运动动态，而不仅仅是扩展机器人视野中的过去模式。作为特定地点的动态先验，EgoMoD在推理时用标准车载传感器替代了先前MoD方法所需的外部全局感知基础设施。在大型模拟环境中的实验表明，EgoMoD能在有限可观测性下预测未来MoDs，而使用真实图像的评估展示了其对真实系统的零样本迁移能力。

英文摘要

Efficient navigation in dynamic environments requires anticipating how motion patterns evolve beyond the robot's immediate perceptual range, enabling preemptive rather than purely reactive planning in crowded scenes. Maps of Dynamics (MoDs) offer a structured representation of motion tendencies in space useful for long-term global planning, but constructing them traditionally requires global environment observations over extended periods of time. We introduce EgoMoD, the first approach that learns to predict future MoDs directly from short egocentric video clips collected during robot operation. Our method learns to infer environment-wide motion tendencies from local dynamic cues using a video- and pose-conditioned architecture trained with MoDs computed from external observations as privileged supervision, allowing local observations to serve as predictive signals of global motion structure. Thanks to this, we offer the capacity to forecast future motion dynamics over the whole environment rather than merely extend past patterns in the robot's field of view. As a site-specific dynamic prior, EgoMoD replaces the external global sensing infrastructure required by prior MoD methods at inference time with standard onboard sensors. Experiments in large simulated environments show that EgoMoD predicts future MoDs under limited observability, while evaluation with real images showcases its zero-shot transferability to real systems.

URL PDF HTML ☆

赞 0 踩 0

2603.05965 2026-06-12 cs.RO cs.CV 版本更新

PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

PROBE: 具有解析平移鲁棒性的概率占用BEV编码用于3D地点识别

Jinseop Lee, Byoungho Lee, Gichul Yoo

发表机构 * SK Intellix

AI总结提出无学习的LiDAR地点描述符PROBE，通过极坐标雅可比解析边缘化连续平移，实现距离自适应角度不确定性，在跨传感器泛化中取得高精度。

Comments 8 pages, 8 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

详情

DOI: 10.1109/LRA.2026.3703245

AI中文摘要

我们提出PROBE（概率占用BEV编码），一种无学习的LiDAR地点识别描述符，将每个BEV单元的占用建模为伯努利随机变量。PROBE不依赖于离散点云扰动，而是通过极坐标雅可比解析边缘化连续笛卡尔平移，在O(R·S)时间内得到距离自适应角度不确定性σ_θ = σ_t / r。主要参数σ_t表示以米为单位的预期平移不确定性，这是一种与传感器无关的物理量，增强了跨传感器泛化能力，同时减少了对每个数据集大量调参的需求。成对相似性结合了伯努利-KL Jaccard与指数不确定性门控以及基于FFT的高度余弦相似性用于旋转对齐。在涵盖四种不同LiDAR类型的四个数据集上评估，PROBE在多会话评估中实现了手工描述符中最高的精度，并且在单会话性能上与手工和监督基线相比具有竞争力。源代码和补充材料可在该https URL获取。

英文摘要

We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell's occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty $σ_θ= σ_t / r$ in $\mathcal{O}(R{\cdot}S)$ time. The primary parameter $σ_t$ represents the expected translational uncertainty in meters, a sensor-independent physical quantity that enhances cross-sensor generalization while reducing the need for extensive per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance relative to both handcrafted and supervised baselines. The source code and supplementary materials are available at https://sites.google.com/view/probe-pr.

URL PDF HTML ☆

赞 0 踩 0

2507.22028 2026-06-12 cs.CV cs.RO 版本更新

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

从看见到体验：通过强化学习扩展导航基础模型

Honglin He, Yukai Ma, Brad Squicciarini, Wayne Wu, Bolei Zhou

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Coco Robotics（Coco机器人）

AI总结提出S2E框架，结合离线视频预训练和模拟环境强化学习，通过锚点引导分布匹配和残差注意力模块，提升导航基础模型的交互性和安全性。

Comments 27 pages, 20 figures, 9 tables, conference

详情

AI中文摘要

基于大规模网络数据训练的导航基础模型使智能体能够跨不同环境和实体进行泛化。然而，这些仅基于离线数据训练的模型往往缺乏推理其行为后果或通过反事实理解进行适应的能力。因此，它们在现实世界城市导航中面临重大限制，其中交互性和安全行为（如避开障碍物和移动行人）至关重要。为解决这些挑战，我们引入了从看见到体验（S2E）学习框架，通过强化学习扩展导航基础模型的能力。S2E结合了离线视频预训练和强化学习后训练的优势。它保持了从大规模真实世界视频中获得的模型泛化能力，同时通过模拟环境中的强化学习增强了其交互性。具体而言，我们引入了两项创新：（1）用于离线预训练的锚点引导分布匹配策略，通过基于锚点的监督稳定学习并建模多样化的运动模式；（2）用于强化学习的残差注意力模块，从模拟环境中获得反应性行为，同时不抹除模型的预训练知识。此外，我们建立了一个全面的端到端评估基准NavBench-GS，该基准基于真实世界场景的光照逼真3D高斯溅射重建，并融入了物理交互。它可以系统评估导航基础模型的泛化能力和安全性。

英文摘要

Navigation foundation models trained on massive web-scale data enable agents to generalize across diverse environments and embodiments. However, these models, which are trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in real-world urban navigation, where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing (S2E) learning framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pretraining on offline videos and post-training through reinforcement learning. It maintains the model's generalizability acquired from large-scale real-world videos while enhancing its interactivity through reinforcement learning in simulation environments. Specifically, we introduce two innovations: (1) an Anchor-Guided Distribution Matching strategy for offline pretraining, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and (2) a Residual-Attention Module for reinforcement learning, which obtains reactive behaviors from simulation environments without erasing the model's pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models.

URL PDF HTML ☆

赞 0 踩 0

2605.31419 2026-06-12 cs.CV cs.RO 版本更新

Triangle Splatting SLAM

三角形泼溅SLAM

Nicholas Fry, Eric Dexheimer, Kirill Mazur, Paul H. J. Kelly, Andrew J. Davison

发表机构 * Software Performance Optimisation Group（软件性能优化组）； Department of Computing（计算部门）

AI总结提出首个使用可微三角形作为3D地图表示的密集RGB-D SLAM系统，通过在线可微渲染实现跟踪与建图，并支持实时网格转换与编辑。

Comments 26 pages, 11 figures

详情

AI中文摘要

我们提出了一种密集RGB-D SLAM系统，使用可微三角形作为3D地图表示。虽然3D高斯泼溅已成为新颖视角合成的主要方法，但三角形仍然是传统渲染硬件、游戏引擎以及需要显式几何的下游任务（如模拟、碰撞和编辑）的标准图元。最近的离线方法表明，通过在一组带姿态的图像上进行Delaunay三角剖分，可以将非结构化的“三角形汤”优化为照片级逼真的网格。基于这一见解，我们提出了第一个密集SLAM系统，通过在线可微渲染三角形汤来执行跟踪和建图。地图可以通过受限Delaunay三角剖分实时转换为连通网格，从而实现网格变形和碰撞检测等新的在线功能。在Replica和TUM-RGBD数据集上，我们的系统在3D几何方面优于基线，匹配相机跟踪精度，并支持基于网格的在线场景编辑。

英文摘要

We present a dense RGB-D SLAM system using differentiable triangles as the 3D map representation. While 3D Gaussian Splatting has emerged as the leading method for novel-view synthesis, triangles remain the standard primitive for traditional rendering hardware, game engines, and downstream tasks requiring explicit geometry such as simulation, collision, and editing. Recent offline methods have demonstrated that an unstructured 'triangle soup' can be optimised into a photorealistic mesh via Delaunay triangulation across a set of posed images. Building upon this insight, we present the first dense SLAM system to employ Triangle Splatting to perform both tracking and mapping through online differentiable rendering of a triangle soup. The map can be converted into a connected mesh on-the-fly via restricted Delaunay triangulation, enabling new online capabilities such as mesh deformation and collision checking. On Replica and TUM-RGBD, our system outperforms baselines on 3D geometry, matches the camera-tracking accuracy, and enables online mesh-based scene editing.

URL PDF HTML ☆

赞 0 踩 0

2606.12475 2026-06-12 cs.RO 新提交

Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration

学习辅助：面向隐式人机协作的协作式VLA模型

Leo Xu, Letian Li, Alex Cuellar, Michael Hagenow

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本文研究利用视觉-语言-动作（VLA）模型通过模仿学习实现人机协作，发现动作分块策略在隐式协作中存在演示动作泄漏问题，提出推理时引导方法缓解过早辅助行为，并通过用户研究验证其有效性。

详情

AI中文摘要

人机协作（HRC）结合了人类和机器人的互补优势，以提高任务效率。然而，许多现有的协作系统依赖于手工设计的流程，限制了其对新任务的可扩展性和灵活性。在这项工作中，我们展示了通过模仿学习进行端到端训练的模型，特别是视觉-语言-动作（VLA）模型，可以支持协作操作，并刻画了影响其真实世界性能的关键因素。我们评估了两种最先进的模型，并识别了隐式HRC中动作分块策略的一种失败模式，其中演示动作泄漏（即动作块跨越潜在任务转换）可能导致过早的辅助行为。我们发现，这个问题随着执行时域的增长而加剧，并在真实世界的协作VLA系统中出现，例如当机器人试图在人员准备好之前移交工具时。我们提出了一种推理时引导方法，以减轻这些错误的辅助动作，同时保持策略性能。最后，通过一项16名参与者在长时域协作组装任务上的用户研究，我们表明引导能够实现更长的执行时域，同时减轻过早辅助，与短时域策略相比，实现了更快的协作和更少的失败。

英文摘要

Human-robot collaboration (HRC) combines the complementary strengths of humans and robots to improve task efficiency. However, many existing collaborative systems rely on hand-engineered pipelines, limiting their scalability and flexibility for new tasks. In this work, we show that models trained end-to-end with imitation learning, specifically vision-language-action (VLA) models, can support collaborative manipulation, and characterize the key factors affecting their real-world performance. We evaluate two state-of-the-art models and identify a failure mode of action-chunking policies in implicit HRC, where demonstration action leakage (i.e., action chunks crossing latent task transitions) can cause premature assistive behavior. We find that this issue increases with longer execution horizons and occurs in real-world collaborative VLA systems, such as when a robot attempts to hand over a tool before the person is ready. We propose an inference-time steering method to mitigate these erroneous assistive actions while preserving policy performance. Finally, through a 16-participant user study on a long-horizon collaborative assembly task, we show that steering enables a longer execution horizon while mitigating premature assistance, leading to faster collaboration and fewer failures compared to a shorter-horizon policy.

URL PDF HTML ☆

赞 0 踩 0

2606.12995 2026-06-12 cs.RO 新提交

基于EMG的各向异性虚拟夹具自适应方法用于机器人辅助手术切除与解剖

Dario Onfiani, Michael Dyck, Luigi Biagiotti, Julian Klodmann

发表机构 * University of Modena and Reggio Emilia（摩德纳大学）； German Aerospace Center (DLR)（德国航空航天中心）

AI总结提出一种基于EMG信号自适应调节各向异性虚拟夹具的框架，通过实时推断外科医生意图动态调整约束，实验证明能提高手术精度和运动一致性，降低认知负荷。

详情

AI中文摘要

本文针对机器人辅助腹腔镜手术中的精细任务（如切除和解剖），开发了一种自适应辅助系统。尽管虚拟夹具在引导外科医生运动方面具有显著优势，但传统虚拟夹具通常由固定几何形状定义，缺乏适应手术流程或外科医生即时意图的灵活性。为解决这些局限性，我们提出了一种自适应各向异性虚拟夹具的新框架。此外，我们引入了一种直观的控制接口，该接口基于从EMG信号推断的外科医生意图，实时调节夹具的几何形状。该方法允许外科医生通过收缩前臂肌肉动态扩展或解除约束，实现精确引导运动和工具自由重新定位之间的无缝切换。基于标准化手术训练任务的初步用户研究实验结果表明了所提方法的有效性。该系统在任务精度和运动一致性方面表现出显著改善，同时降低了感知认知负荷、努力和挫败感。

英文摘要

In this paper, we address the development of an adaptive assistance system for robot-assisted laparoscopic surgery, specifically for delicate tasks such as Resection and Dissection. Even if Virtual Fixtures offer significant advantages for guiding a surgeon's movements, conventional Virtual Fixtures are often defined by fixed geometries, lacking the flexibility to adapt to the surgical workflow or the surgeon's immediate intent. To address these limitations, we propose a novel framework for an adaptive and anisotropic virtual fixture. In addition, we introduce an intuitive control interface that modulates the fixture's geometry in real-time based on the surgeon's intent, inferred from EMG signals. This approach allows the surgeon to dynamically expand or disengage the constraint by contracting their forearm muscles, enabling seamless transitions between precise guided motion and free repositioning of the tool. Experimental results from a pilot user study, based on a standardized surgical training task, demonstrate the effectiveness of the proposed method. The system showed significant improvements in task accuracy and movement consistency, alongside a reduction in perceived cognitive load, effort, and frustration.

URL PDF HTML ☆

赞 0 踩 0

2606.13435 2026-06-12 cs.RO 新提交

GIVE: Grounding Human Gestures in Vision-Language-Action Models

GIVE：在视觉-语言-动作模型中接地人类手势

Pengfei Liu, Gen Li, Junqiao Fan, Boyu Ma, Jindou Jia, Yang Xiao, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University（南洋理工大学MARS实验室）

AI总结针对VLA模型忽略手势导致意图理解不准的问题，提出GIVE方法，通过视觉和语义双路径增强手势理解，在真实HRI实验中目标识别准确率提升40%，任务成功率提升80%。

Comments Project page: https://luis-cloud-sg.github.io/GIVE-project/

详情

AI中文摘要

人类交流本质上是多模态的，语言通常伴随着非语言线索（如手势）来传达意图。然而，当前的视觉-语言-动作（VLA）模型将机器人操作视为纯文本驱动的任务，忽视了手势在人机交互（HRI）中的重要作用。当语言指令模糊或不明确时，这往往导致意图接地不准确和操作不可靠。为了解决这一挑战，我们提出了GIVE（通过视觉-语义增强的手势意图），一种有效的方法，在不修改架构的情况下，用人类手势理解增强预训练的VLA模型。具体来说，GIVE通过两条互补的路径融入手势信息：一条视觉路径，将手部骨架和指尖射线叠加到机器人观测上，用于显式对象接地；一条语义路径，生成人类手势和任务指令的高级描述，用于鲁棒的意图接地。通过联合利用视觉和语义指导，GIVE使VLA策略能够更好地将手势与操作行为关联，并适应动态交互意图。在真实世界的HRI实验中，GIVE显著优于基线，目标对象识别准确率提升40%，整体任务成功率提升80%，同时展现出对未见空间布局和不同参与者的强大鲁棒性和泛化能力。

英文摘要

Human communication is inherently multimodal, where language is often accompanied by non-verbal cues such as gestures to convey intentions. However, current Vision-Language-Action (VLA) models treat robotic manipulation as a pure text-driven task, overlooking the important role of gestures in Human-Robot Interaction (HRI). This often leads to inaccurate intent grounding and unreliable manipulation when language instructions are ambiguous or underspecified. To address this challenge, we propose GIVE (Gesture Intent via Visual-Semantic Enhancement), an effective approach that enhances pre-trained VLA models with human gesture understanding without architectural modifications. Specifically, GIVE incorporates gesture information through two complementary pathways: a visual pathway that overlays hand skeletons and fingertip rays onto robot observations for explicit object grounding, and a semantic pathway that generates high-level descriptions of human gestures and task instructions for robust intent grounding. By jointly leveraging visual and semantic guidance, GIVE enables VLA policies to better associate gestures with manipulation behaviors and adapt to dynamic interaction intents. In real-world HRI experiments, GIVE substantially outperforms the baseline, improving target object recognition accuracy by 40% and overall task success rate by 80%, while demonstrating strong robustness and generalization to unseen spatial layouts and diverse participants.

URL PDF HTML ☆

赞 0 踩 0

2601.22090 2026-06-12 cs.RO 版本更新

ReactEMG Stroke: Healthy-to-Stroke Few-shot Adaptation for sEMG-Based Intent Detection

ReactEMG 中风：基于表面肌电图的意图检测的健康到中风少样本适应

Runsheng Wang, Katelyn Lee, Xinyue Zhu, Lauren Winterbottom, Dawn M. Nilsen, Joel Stein, Matei Ciocarlie

发表机构 * Department of Mechanical Engineering, Columbia University in the City of New York（哥伦比亚大学纽约市机械工程系）； Department of Computer Science, Columbia University in the City of New York（哥伦比亚大学纽约市计算机科学系）； Department of Rehabilitation and Regenerative Medicine, Columbia University Irving Medical Center（哥伦比亚大学伊文思医疗中心康复与再生医学系）

AI总结提出一种健康到中风的适应流程，利用大规模健康受试者sEMG预训练模型，仅用少量中风患者数据微调，显著提升意图检测准确率和鲁棒性。

详情

AI中文摘要

表面肌电图（sEMG）是一种有前景的控制信号，用于中风后按需辅助手部康复，但从瘫痪肌肉检测意图通常需要长时间、特定于受试者的校准，并且对变异性很脆弱。我们提出了一种健康到中风的适应流程，该流程从在大规模健全受试者sEMG上预训练的模型初始化意图检测器，然后仅使用少量特定于受试者的数据为每个中风参与者进行微调。使用从三位慢性中风患者收集的新数据集，我们比较了适应策略（仅头部调优、参数高效的LoRA适配器和全端到端微调），并在包含现实分布偏移（如会话内漂移、姿势变化和臂带重新定位）的保留测试集上评估。在各种条件下，与相同数据预算下的零样本迁移和仅中风训练相比，健康预训练适应一致地改善了中风意图检测；最佳适应方法将平均转换准确率从0.42提高到0.61，原始准确率从0.69提高到0.78。这些结果表明，迁移可复用的健康域EMG表示可以减少校准负担，同时提高实时中风后意图检测的鲁棒性。我们的项目网站、视频、代码和数据集可在以下网址获取：this https URL。

英文摘要

Surface electromyography (sEMG) is a promising control signal for assist-as-needed hand rehabilitation after stroke, but detecting intent from paretic muscles often requires lengthy, subject-specific calibration and remains brittle to variability. We propose a healthy-to-stroke adaptation pipeline that initializes an intent detector from a model pretrained on large-scale able-bodied sEMG, then fine-tunes it for each stroke participant using only a small amount of subject-specific data. Using a newly collected dataset from three individuals with chronic stroke, we compare adaptation strategies (head-only tuning, parameter-efficient LoRA adapters, and full end-to-end fine-tuning) and evaluate on held-out test sets that include realistic distribution shifts such as within-session drift, posture changes, and armband repositioning. Across conditions, healthy-pretrained adaptation consistently improves stroke intent detection relative to both zero-shot transfer and stroke-only training under the same data budget; the best adaptation methods improve average transition accuracy from 0.42 to 0.61 and raw accuracy from 0.69 to 0.78. These results suggest that transferring a reusable healthy-domain EMG representation can reduce calibration burden while improving robustness for real-time post-stroke intent detection. Our project website, video, code, and dataset are available at: https://roamlab.github.io/reactemg-stroke/.

URL PDF HTML ☆

赞 0 踩 0

2606.12690 2026-06-12 cs.RO cs.AI 新提交

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

EWAM：一种用于具身智能闭环在线自适应的增强世界动作模型

Xin Zhou, Cong Miao

发表机构 * Astronex Robotics ； Nanjing University of Information Science and Technology（南京信息工程大学）

AI总结提出EWAM架构，基于冻结的Cosmos3骨干网络，通过四个轻量级神经层实现零样本在线自适应，无需微调或额外演示数据，显著减少新任务布局的部署数据需求。

详情

AI中文摘要

在本文中，我们提出了增强世界动作模型（EWAM），这是一种基于预训练且完全冻结的Cosmos3骨干网络构建的闭环在线自适应架构。EWAM完全在零样本任务协议下进行评估，其核心目标是减少适应新任务布局所需的额外部署数据量。值得注意的是，所有评估中均未引入额外的任务特定演示集，也未对骨干网络进行微调。其性能提升完全源于由四个插入的轻量级神经层组成的推理时协同推理机制：位于扩散变换器（DiT）中间层的神经经验记忆层提供任务相关的执行上下文；状态预测头之后的神经异常检测层实时监测预测状态与实际状态之间的差异；神经策略路由层根据异常严重程度动态选择直接执行、保守重规划或回滚恢复；神经动作校正层利用执行诊断优化生成的动作块。与简单的特征融合不同，记忆、异常检测和校正模块以可微分的方式深度集成到Cosmos3的前向路径中，仅最终路由决策是离散监督的。

英文摘要

In this paper, we propose the Enhanced World Action Model (EWAM), a closed-loop online adaptation architecture built upon a pretrained and fully frozen Cosmos3 backbone network. Evaluated entirely under a zero-shot task protocol, EWAM is centrally focused on reducing the amount of additional deployment data required to adapt to new task layouts. Notably, no extra task-specific demonstration sets were introduced in any of the evaluations, and no fine-tuning was performed on the backbone network. Its performance gains stem entirely from an inference-time co-reasoning mechanism composed of four inserted lightweight neural layers: the Neural Experience Memory Layer located in the intermediate layers of the Diffusion Transformer (DiT) provides task-relevant execution context; the Neural Anomaly Detection Layer after the state prediction head monitors the divergence between predicted and actual states in real time; the Neural Policy Routing Layer dynamically selects direct execution, conservative replanning, or rollback recovery based on the anomaly severity; and the Neural Action Correction Layer refines the generated action chunks using execution diagnostics. Unlike naive feature fusion, the memory, anomaly detection, and correction modules are deeply integrated into the Cosmos3 forward path in a differentiable manner, with only the final routing decision being a discrete supervised one.

URL PDF HTML ☆

赞 0 踩 0

2606.13049 2026-06-12 cs.RO 新提交

MaskWAM：统一掩码提示与预测的世界-动作模型

Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang, Chenghao Gu, Heng Li, Ping Tan

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； Tencent Robotics X（腾讯机器人X实验室）； Tsinghua University（清华大学）

AI总结提出MaskWAM，通过统一掩码输入与预测的混合Transformer架构，解决世界-动作模型的空间瓶颈，提升策略泛化能力，在LIBERO等任务上显著优于基线。

详情

AI中文摘要

世界-动作模型（WAMs）通过视频预测为机器人控制提供了一种有前景的范式。然而，当前的WAMs存在根本性的空间瓶颈：标准文本输入在杂乱场景中引入指代歧义，而非结构化的RGB预测缺乏语义基础，并受任务无关背景的偏差影响。为克服这些限制，我们引入了MaskWAM，一种以对象为中心的世界-动作模型。通过统一的混合Transformer（MoT）将掩码同时作为显式输入和预测进行联合集成，MaskWAM实现了鲁棒的策略泛化。该设计提供两个关键优势：（1）预测未来掩码产生以对象为中心的语义监督，抑制视觉噪声，显著增强甚至标准文本条件的WAMs；（2）将此预测监督与第一帧视觉提示（如目标对象掩码）耦合，建立精确的空间锚点，大幅减少语言歧义。关键在于，由于WAMs本质上是视觉驱动的架构，直接掩码条件化比单独文本提供更强的引导，为操作未见对象建立了精确且鲁棒的范式。在LIBERO、RoboTwin和真实世界任务上的评估表明，MaskWAM在语言清晰和语言模糊任务中均显著优于基线。

英文摘要

World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

URL PDF HTML ☆

赞 0 踩 0

2602.04208 2026-06-12 cs.RO cs.AI cs.LG 版本更新

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

SCALE: 基于自不确定性条件自适应观察与执行的视觉-语言-动作模型

Hyeonbeom Choi, Daechul Ahn, Youhan Lee, Taewook Kang, Seongwon Cho, Jonghyun Choi

发表机构 * Seoul National University（首尔国立大学）

AI总结提出SCALE推理策略，利用自不确定性联合调节视觉感知和动作，无需额外训练或验证器，仅单次前向传播，提升VLA模型在模拟和真实环境中的鲁棒性。

Comments ICML 2026 Spotlight. Project page: https://dcahn12.github.io/projects/scale/

详情

AI中文摘要

视觉-语言-动作（VLA）模型已成为通用机器人控制的一种有前景的范式，测试时缩放（TTS）在增强训练外鲁棒性方面受到关注。然而，现有的VLA TTS方法需要额外训练、验证器和多次前向传播，使其部署不切实际。此外，它们仅干预动作解码，而保持视觉表示固定——在感知模糊的情况下不足，此时重新考虑如何感知与决定做什么同样重要。为解决这些限制，我们提出SCALE，一种简单的推理策略，基于“自不确定性”联合调节视觉感知和动作，受主动推理理论中不确定性驱动探索的启发——无需额外训练、无需验证器，且仅需单次前向传播。SCALE在高不确定性下拓宽感知和动作的探索，而在自信时聚焦于利用——实现在不同条件下的自适应执行。在模拟和真实世界基准上的实验表明，SCALE改进了最先进的VLA模型，并优于现有TTS方法，同时保持单次前向传播的效率。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.

URL PDF HTML ☆

赞 0 踩 0

2510.03896 2026-06-12 cs.CV cs.RO 版本更新

GAE: Unleashing Physical Potential of VLM with Generalizable Action Expert

GAE: 利用可泛化动作专家释放VLM的物理潜力

Mingyu Liu, Zheng Huang, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Yating Wang, Haoyi Zhu, Hao Chen, Chunhua Shen

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出通用动作专家（GAE），通过稀疏几何接口将VLM的高层意图转化为连续动作轨迹，采用动作预训练-点云微调（APPF）方案解耦动作动力学与几何基础，实现跨视觉域、视角和指令的强泛化。

详情

AI中文摘要

视觉语言模型展示了强大的推理和规划能力，但将这些预测转化为精确的机器人动作仍是一个核心挑战。现有的视觉-语言-动作方法通常将推理和动作生成纠缠在一起，导致泛化能力有限。我们提出了通用动作专家（GAE），一个任务无关的模型，将稀疏几何规划转化为密集的机器人动作。我们的方法引入了一个稀疏几何接口：VLM预测代表高层意图的稀疏3D路点，而GAE将这些路点与实时点云观测一起映射到连续动作轨迹。GAE在一个包含来自仿真和真实世界机器人的15万条轨迹的大规模点云-轨迹数据集上进行预训练。为了进一步提高效率和泛化能力，我们引入了动作预训练-点云微调（APPF）方案，将学习动作动力学与几何基础解耦。预训练后，GAE被冻结并在下游任务中重用，只需对VLM进行轻量级微调以生成稀疏接口。实验表明，我们的方法在多样化的视觉域、相机视角和自然语言指令下实现了强大的性能和泛化能力。

英文摘要

Vision-language models demonstrate strong reasoning and planning abilities, yet grounding these predictions into precise robot actions remains a central challenge. Existing Vision-Language-Action methods typically entangle reasoning and action generation, leading to limited generalization. We propose Generalizable Action Expert (GAE), a task-agnostic model that converts sparse geometric plans into dense robot actions. Our approach introduces a sparse geometric interface: the VLM predicts sparse 3D waypoints representing high-level intention, while GAE maps these waypoints together with real-time point cloud observations to continuous action trajectories. GAE is pretrained on a large-scale pointcloud-trajectory dataset comprising 150k trajectories from both simulation and real-world robots. To further improve efficiency and generalization, we introduce an Action Pre-training, Pointcloud Fine-tuning (APPF) scheme that decouples learning action dynamics from geometry grounding. After pretraining, GAE is frozen and reused across downstream tasks, requiring only lightweight fine-tuning of the VLM to produce the sparse interface. Experiments show that our method achieves strong performance and generalization across diverse visual domains, camera viewpoints, and natural language instructions.

URL PDF HTML ☆

赞 0 踩 0

2606.01621 2026-06-12 cs.CV cs.RO 版本更新

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

Goal2Pixel: 将目标锚定到像素以实现视觉语言导航

Muyi Bao, Yuxin Cai, Hang Xu, Zongtai Li, Jinxi He, Jingfan Tang, Chen Lv, Ji Zhang, Yaqi Xie, Wenshan Wang

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Nanyang Technological University（南洋理工大学）

AI总结提出Goal2Pixel范式，通过将连续环境中的视觉语言导航（VLN-CE）重新定义为可导航像素锚定，利用图像平面作为统一空间接口，预测可见导航像素并反投影为3D航点，结合可见性感知关键帧记忆和坐标感知辅助损失，在减少VLM调用次数的同时实现竞争性性能。

Comments 8 pages

详情

AI中文摘要

视觉语言模型（VLM）已成为连续环境中视觉语言导航（VLN-CE）的常见基础。然而，大多数基于VLM的方法将导航视为低级动作预测，这种接口模糊、受限于短视运动基元，且由于重复的VLM查询而效率低下。我们提出Goal2Pixel，一种纯基于像素的范式，将VLN-CE重新定义为可导航像素锚定。Goal2Pixel不预测动作，而是使用图像平面作为VLM推理与机器人运动之间的统一空间接口：模型预测一个对智能体可见的可导航像素，该像素被反投影为3D航点以进行前向导航。对于非前向动作，我们在图像平面上附加辅助指令区域，其中左/右/下区域分别解释为左转、右转和停止。为了实现长程导航，我们提出了一种可见性感知的关键帧记忆，用于紧凑且信息丰富的历史表示。为了将预训练的VLM适应于可导航像素锚定，我们引入了语义嵌入和坐标感知辅助损失。Goal2Pixel在需要比先前方法更少的VLM推理调用的情况下，实现了具有竞争力的最新性能。在R2R-CE Val-Unseen上，它以每集仅7.75次VLM调用达到54.1%的SR和52.5%的SPL，而直接动作预测在32.9%的SR下需要46.62次调用，减少了6倍。同样的趋势在RxR-CE上也成立。项目页面：https://baobao0926.github.io/Goal2Pixel/。

英文摘要

Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on RxR-CE.Project Page: https://baobao0926.github.io/Goal2Pixel/.

URL PDF HTML ☆

赞 0 踩 0

2606.12614 2026-06-12 cs.RO 新提交

DARRMS -- An Efficient Algorithm for Dynamic Attention Radius in Resource-Constrained Multi-Agent Systems

DARRMS——资源受限多智能体系统中动态注意力半径的高效算法

Benjamin Alcorn, Eman Hammad

发表机构 * Texas A&M University（德克萨斯A&M大学）

AI总结提出DARRMS算法，通过优化注意力半径和决策，在资源受限下降低计算需求，提升协调性和可扩展性。

详情

AI中文摘要

多智能体系统是机器人、网络安全和自动驾驶规划等领域不可或缺的工具。这类系统通常面临计算资源约束，需要高效的轻量级算法。传统决策框架常假设理想条件（如完全可观测性和无限计算能力），这与现实挑战不符。本文提出一种新算法，在不显著牺牲其他性能指标的前提下，降低对计算资源的需求。智能体将可观测性限制在某个注意力半径内，从而有意识地忽略对行动规划可能不必要的环境部分。通过同时优化注意力半径和决策，我们的方法在不确定环境中增强了协调性和可扩展性。通过理论分析和实证验证，我们证明了自适应观测在资源受限系统中提升系统性能并维持稳健决策策略的有效性。

英文摘要

Multi-agent systems are integral tools for various domains such as robotics, cybersecurity, and autonomous vehicle planning. These types of systems often have constraints on the computational resources, leading to a need for efficient lightweight algorithms. Traditional decision making frameworks often assume ideal conditions, such as full observability and unlimited computational capacity, which do not align with real-world challenges. In this paper, we introduce a new algorithm that allows for reduced demand on computational resources without a large cost of other performance metrics. Agents will limit their observability to some attention radius, which intentionally allows them to ignore parts of the environment that might be unnecessary for action planning. By optimizing both the attention radius and decision-making, our approach enhances coordination and scalability in uncertain environments. Through both theoretical analysis and empirical validation, we demonstrate the effectiveness of adaptive observation in improving system performance and maintaining robust decision-making strategies in resource-constrained systems.

URL PDF HTML ☆

赞 0 踩 0

2606.12640 2026-06-12 cs.LG cs.RO cs.SY eess.SY 交叉投稿

Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

个体控制障碍函数引导的扩散模型用于安全离线多智能体强化学习

Qingyun Guo, Junyi Shi, Jianuo Huang, Tianyu Shi

发表机构 * Department of Electrical Engineering and Automation, Aalto University（阿尔托大学电气工程与自动化系）； School of Computing and Data Science, Xiamen University Malaysia（厦门大学马来西亚分校计算与数据科学学院）； Department of Computer Science, University of Toronto（多伦多大学计算机科学系）

AI总结提出一种将神经个体控制障碍函数嵌入扩散模型的离线多智能体强化学习算法，通过逆动力学恢复控制策略，在保证奖励的同时显著提升轨迹生成的安全性。

Comments Accepted to the 23rd IFAC World Congress, 2026

2606.13068 2026-06-12 cs.MA cs.RO 交叉投稿

扩散Transformer世界-动作模型用于自动驾驶场景预测

Ruslan Sharifullin, Benjamin Jiang, Kai Xi Chew

发表机构 * Stanford University（斯坦福大学）

AI总结提出紧凑潜世界模型，结合扩散Transformer（DiT）预测未来场景，在nuScenes上实现4.8倍更好的KID，并实现动作可控性（转向ρ=0.81）。

Comments 10 pages, 9 figures, 2 tables

详情

AI中文摘要

动作条件世界模型使自动驾驶车辆能够根据自身规划的控制预测未来摄像头场景，从而无需真实世界部署即可进行规划和仿真，但在紧凑、可训练的规模下，未来具有模糊性，且该领域的标准失真度量具有误导性：它们奖励模糊的回归均值而非逼真的预测。我们通过一个紧凑的潜世界模型应对这一问题，该模型给定当前前摄像头潜变量和一系列自我动作，预测未来场景潜变量，由冻结解码器渲染为$256 \ imes 256$帧，最多提前8秒，在150个保留的nuScenes场景上评估。我们首先基准测试预测位置：在跨越四个表示族的六个冻结编码器中，具有时间上下文的V-JEPA2将转向RMSE比最佳单帧编码器降低40%。然后我们训练一个潜扩散Transformer（DiT），并通过受控诊断识别其所需的四个要素：空间token、$x_0$目标、残差锚定以及与目标不确定性匹配的采样。在Stable-Diffusion-VAE编码-预测-解码流水线中，我们揭示了核心矛盾：失真度量（余弦相似度、SSIM）倾向于模糊均值，掩盖了扩散模型更接近真实帧分布的事实。基于Inception的FID和KID揭示了清晰的感知-失真边界：扩散模型达到KID 0.078，而回归为0.375（好4.8倍），且可部署的训练校准使其无需测试时真实值即可实用。该模型真正具有动作可控性（转向驱动场景位移，Spearman $\ ho = 0.81$，而回归为$-0.18$）。我们将有限的单次运动归因于共享当前锚点，并设计了一个紧凑的170万参数“跳跃”模型，恢复完整的真实运动幅度（$1.02\ imes$ GT），而单次模型捕获不到一半。

英文摘要

Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the $x_0$ objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression ($4.8\times$ better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman $ρ= 0.81$, vs $-0.18$ for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.7M-parameter "jump" model that recovers full ground-truth motion magnitude ($1.02\times$ GT), where single-pass models capture less than half.

URL PDF HTML ☆

赞 0 踩 0

2511.11022 2026-06-12 cs.RO 版本更新

Miniature Testbed for Validating Multi-Agent Cooperative Autonomous Driving

用于验证多智能体协同自动驾驶的微型测试平台

Hyunchul Bae, Eunjae Lee, Jehyeop Han, Minhee Kang, Jaehyeon Kim, Junggeun Seo, Minkyun Noh, Heejin Ahn

发表机构 * School of Electrical Engineering（电气工程学院）； School of Mechanical Engineering（机械工程学院）； Korea Advanced Institute of Science and Technology（韩国科学技术院）

AI总结提出CIVAT微型测试平台，集成V2V/V2I通信与ROS2框架，通过基础设施感知和交叉口管理实验验证协同自动驾驶功能。

Comments Accepted by ICRA 2026, 8 pages

详情

AI中文摘要

协同自动驾驶通过实现车辆与智能路侧基础设施之间的实时协作来扩展车辆自主性，仍然是一个具有挑战性但至关重要的问题。然而，现有的测试平台均未采用配备感知、边缘计算和通信能力的智能基础设施。为填补这一空白，我们设计并实现了一个1:15比例的微型测试平台CIVAT，用于验证协同自动驾驶，该平台包括一个缩小的城市地图、配备车载传感器的自动驾驶车辆以及智能基础设施。所提出的测试平台通过共享Wi-Fi和ROS2框架，以发布-订阅模式集成V2V和V2I通信，实现车辆与基础设施之间的信息交换，从而达成协同驾驶功能。作为案例研究，我们通过基于基础设施的感知和交叉口管理实验验证了该系统。

英文摘要

Cooperative autonomous driving, which extends vehicle autonomy by enabling real-time collaboration between vehicles and smart roadside infrastructure, remains a challenging yet essential problem. However, none of the existing testbeds employ smart infrastructure equipped with sensing, edge computing, and communication capabilities. To address this gap, we design and implement a 1:15-scale miniature testbed, CIVAT, for validating cooperative autonomous driving, consisting of a scaled urban map, autonomous vehicles with onboard sensors, and smart infrastructure. The proposed testbed integrates V2V and V2I communication with the publish-subscribe pattern through a shared Wi-Fi and ROS2 framework, enabling information exchange between vehicles and infrastructure to realize cooperative driving functionality. As a case study, we validate the system through infrastructure-based perception and intersection management experiments.

URL PDF HTML ☆

赞 0 踩 0

2606.12236 2026-06-12 cs.RO cs.CV 版本更新

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

DrivingAgent: 自动驾驶系统的设计与调度智能体

Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王选计算机技术研究所）； University of California, Merced（加州大学默塞德分校）

AI总结提出DrivingAgent框架，通过自动化模块开发（设计阶段）和强化学习训练的轻量级LLM实时调度（调度阶段），解决自动驾驶系统集成新模型和满足实时约束的挑战，在nuScenes和Bench2Drive上取得更优速度-精度权衡。

详情

AI中文摘要

许多自动驾驶系统越来越多地整合基础模型以提高泛化能力并处理长尾场景。然而，这一趋势带来了两个关键挑战：（i）设计和集成新模型的手动且劳动密集型过程，以及（ii）缺乏智能、动态的调度机制以满足严格的实时约束。虽然基于大语言模型（LLM）的智能体为自动化提供了有前景的途径，但现有框架并不适合自动驾驶。具体来说，它们未能区分系统设计和实时调度的根本不同需求，将模块视为不透明的黑盒，并且并非为持续运行而设计。为了解决这些局限性，我们提出了DrivingAgent，这是一个针对自动驾驶系统设计和调度双重挑战的新型智能体框架。在设计阶段，DrivingAgent通过解释系统架构、生成代码以及通过超网络训练验证模块来自动化模块开发。在调度阶段，它采用一个通过强化学习训练的轻量级LLM来实时动态编排系统模块，并由一个集成长期存储与带时间戳短期上下文的结构化记忆支持。实验结果表明，DrivingAgent在nuScenes和Bench2Drive基准测试上实现了更优的速度-精度权衡。

英文摘要

Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.13352 2026-06-12 cs.RO 新提交

从数字到物理：数字代理作为物理智能的自主教练

Zixing Lei, Genjia Liu, Yuanshuo Zhang, Qipeng Liu, Yuzhu Cai, Sixiang Chen, Jixian Wu, Yunhong Wang, Weixin Li, Chuan Wen, Bo Zhao, Shanghang Zhang, Wenzhao Lian, Siheng Chen

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China（上海交通大学人工智能学院）； Zhongguancun Academy, Beijing, China（中关村学院）； School of Integrated Circuits, Shanghai Jiao Tong University, Shanghai, China（上海交通大学集成电路学院）； School of Computer Science, Shanghai Jiao Tong University, Shanghai, China（上海交通大学计算机科学学院）； State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China（北京大学计算机科学学院多媒体信息处理国家重点实验室）

AI总结提出EmboCoach-Bench基准，评估LLM代理自主设计具身策略的能力，通过迭代调试和优化，代理在平均成功率上超越人工基线26.5%，并具备自我修正能力。

Comments 53 pages, 12 figures

详情

AI中文摘要

具身AI领域正朝着通用机器人系统快速发展，得益于高保真模拟和大规模数据收集。然而，这种扩展能力仍然受到劳动密集型人工监督的严重瓶颈，从复杂的奖励塑造到跨异构后端的超参数调整。受LLM在软件自动化和科学发现中成功的启发，我们引入了\ extsc{EmboCoach-Bench}，一个评估LLM代理自主设计具身策略能力的基准。涵盖32个专家精选的RL和IL任务，我们的框架将可执行代码作为通用接口。我们超越静态生成，评估动态闭环工作流，其中代理利用环境反馈迭代地起草、调试和优化解决方案，涵盖从物理信息奖励设计到扩散策略等策略架构的改进。广泛评估得出三个关键见解：（1）自主代理在平均成功率上可以定性超越人工设计的基线26.5%；（2）具有环境反馈的代理工作流有效增强了策略开发，并显著缩小了开源和专有模型之间的性能差距；（3）代理对病态工程案例表现出自我修正能力，通过迭代仿真循环调试成功从近乎完全失败中恢复任务性能。最终，这项工作为自我进化的具身智能奠定了基础，加速了具身AI领域从劳动密集型手动调优到可扩展自主工程的范式转变。

英文摘要

The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and large-scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor-intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs' success in software automation and science discovery, we introduce \textsc{EmboCoach-Bench}, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert-curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed-loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics-informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human-engineered baselines by 26.5\% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open-source and proprietary models; and (3) agents exhibit self-correction capabilities for pathological engineering cases, successfully resurrecting task performance from near-total failures through iterative simulation-in-the-loop debugging. Ultimately, this work establishes a foundation for self-evolving embodied intelligence, accelerating the paradigm shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI field.

URL PDF HTML ☆

赞 0 踩 0

2606.13203 2026-06-12 cs.RO 新提交

Embedding ISO 10218 Safety Compliance in Robots via Control Barrier Functions for Human-Robot Collaboration

通过控制障碍函数将ISO 10218安全合规性嵌入机器人以实现人机协作

Federico Parma, Cesare Tonola, Nicola Pedrocchi, Manuel Beschi

发表机构 * Dept. of Electrical and Information Engineering, Polytechnic of Bari（巴里理工大学电气与信息工程系）； Dipartimento di Ingegneria Meccanica e Industriale, University of Brescia（布雷西亚大学机械与工业工程系）； Institute of Intelligent Industrial Technologies and Systems, National Research Council of Italy, STIIMA-CNR（意大利国家研究委员会智能工业技术与系统研究所）

AI总结提出基于控制障碍函数（CBF）的方法，利用人体加速度数据预测最小人机距离，并通过序列二次规划（SQP）框架实现安全约束，在UR10e上验证了该方法在遵守ISO 10218标准的同时减少轨迹误差63%。

详情

AI中文摘要

人机协作（HRC）需要严格遵守安全标准（如ISO 10218），以防止有害交互。标准的速度与分离监控（SSM）滤波器基于保守假设（如人体速度恒定）计算安全机器人速度，这阻碍了对最小分离距离的准确预测，并导致不必要的操作停止。本文提出一种控制障碍函数（CBF），明确纳入人体加速度数据，以在机器人最坏情况制动轨迹期间解析地前向预测最小人机分离距离。为保证控制层面的安全性，该预测性CBF作为不等式约束被集成到序列二次规划（SQP）框架中。具体地，提出了两种方法：方法I，一种CBF约束的PD安全滤波器；方法II，一种执行空间管约束的任务缩放SQP控制器。在UR10e机器人上的仿真和实际实验评估了两种方法相对于标准工业SSM模块基线的性能。结果表明，方法II动态调节执行速度并限制空间偏差。与方法I相比，方法II在平均轨迹误差上减少了63%，并避免了过度规避动作，在遵守ISO 10218 SSM指南的同时确保了高任务吞吐量。

英文摘要

Human-Robot Collaboration (HRC) requires strict adherence to safety standards, such as ISO 10218, to prevent harmful interactions. Standard Speed and Separation Monitoring (SSM) filters calculate safe robotic speeds based on conservative assumptions, such as constant human velocity, which prevents accurate predictions of minimum separation distances and causes unnecessary operational halts. This paper proposes a Control Barrier Function (CBF) that explicitly incorporates human acceleration data to analytically forward-predict the minimum human-robot separation distance during a worst-case robotic stopping trajectory. To guarantee safety at the control level, this predictive CBF is integrated as an inequality constraint within a Sequential Quadratic Programming (SQP) framework. Specifically, two methods are proposed: Method I, a CBF-constrained PD safety filter; and Method II, a task-scaling SQP controller that enforces a spatial tube constraint. Simulated and real-world experiments on a UR10e robot evaluate the two proposed methods against a standard industrial SSM module baseline. Results demonstrate that Method II dynamically modulates execution speed and confines spatial deviations. Compared to Method I, Method II achieves a 63\% reduction in mean trajectory error and avoids excessive evasive manoeuvres, ensuring high task throughput while complying with ISO 10218 SSM guidelines.

URL PDF HTML ☆

赞 0 踩 0

2501.04823 2026-06-12 cs.RO math.OC stat.AP 版本更新

Learning Robot Safety from Sparse Human Feedback using Conformal Prediction

基于共形预测从稀疏人类反馈中学习机器人安全

Aaron O. Feldman, Joseph A. Vincent, Maximilian Adang, JunEn Low, Mac Schwager

发表机构 * Department of Aeronautics and Astronautics, Stanford University（航空航天工程系，斯坦福大学）

AI总结通过人类对策略轨迹的二元反馈，利用共形预测识别包含未来策略错误的状态区域，构建具有保证漏检率的预警系统，并用于改进模型预测控制器的安全性。

详情

AI中文摘要

确保机器人安全可能具有挑战性；用户定义的约束可能遗漏边缘情况，策略即使从安全数据训练也可能变得不安全，并且安全可能是主观的。因此，我们通过向标记不安全行为的人类展示策略轨迹来学习机器人安全。从这种二元反馈中，我们使用共形预测的统计方法识别一个状态区域（可能在学习的潜在空间中），保证包含用户指定比例的未来策略错误。我们的方法是样本高效的，因为它基于最近邻分类，避免了共形预测中常见的保留数据。通过提醒机器人是否到达可疑的不安全区域，我们获得了一个模拟人类安全偏好且具有保证漏检率的预警系统。通过视频标注，我们的系统可以检测四旋翼视觉运动策略何时无法通过指定门。我们提出了一种通过避免可疑不安全区域来改进策略的方法。通过它，我们提高了模型预测控制器的安全性，这在30次四旋翼飞行跨越6个导航任务的实验测试中得到了证明。提供了代码和视频。

英文摘要

Ensuring robot safety can be challenging; user-defined constraints can miss edge cases, policies can become unsafe even when trained from safe data, and safety can be subjective. Thus, we learn about robot safety by showing policy trajectories to a human who flags unsafe behavior. From this binary feedback, we use the statistical method of conformal prediction to identify a region of states, potentially in learned latent space, guaranteed to contain a user-specified fraction of future policy errors. Our method is sample-efficient, as it builds on nearest neighbor classification and avoids withholding data as is common with conformal prediction. By alerting if the robot reaches the suspected unsafe region, we obtain a warning system that mimics the human's safety preferences with guaranteed miss rate. From video labeling, our system can detect when a quadcopter visuomotor policy will fail to steer through a designated gate. We present an approach for policy improvement by avoiding the suspected unsafe region. With it we improve a model predictive controller's safety, as shown in experimental testing with 30 quadcopter flights across 6 navigation tasks. Code and videos are provided.

URL PDF HTML ☆

赞 0 踩 0

2603.16013 2026-06-12 cs.RO cs.SE 版本更新

SPLIT：通过潜在算术分离物理接触以实现基于图像的触觉传感器

Wadhah Zai El Amri, Nicolás Navarro-Guerrero

发表机构 * Leibniz Universität Hannover, L3S Research Center（莱布尼茨汉诺威大学，L3S研究所）

AI总结本文提出SPLIT方法，通过潜在空间算术分离接触几何与传感器光学特性，实现触觉传感器的高效模拟，支持多传感器迁移和双向模拟，提升机器人触觉感知研究效率。

Comments Accepted to Elsevier Robotics and Autonomous Systems Journal

详情

DOI: 10.1016/j.robot.2026.105498

AI中文摘要

训练机器人触觉感知的机器学习模型需要大量数据，但获取真实交互数据因物理复杂性和变异性而具有挑战性。模拟触觉传感器是加速进展的关键步骤。本文提出了SPLIT，一种新的基于图像的触觉传感器模拟方法，重点在于DIGIT传感器。我们的方法核心是一种潜在空间算术策略，明确分离接触几何与传感器特定的光学属性。与需要重新校准的现有方法不同，这种分离使SPLIT能够适应多样化的DIGIT背景，甚至在不完全重训练的情况下将数据转移到不同的传感器如GelSight R1.5。此外，我们的方法在推理速度上优于现有替代方案。我们还提供了一种校准的有限元方法（FEM）软体网格模拟，具有可变分辨率，提供速度与保真度之间的可调权衡。此外，我们的算法支持双向模拟，允许从变形网格生成逼真图像以及从触觉图像重建网格。这种多功能性使SPLIT成为加速机器人触觉感知研究进展的重要工具。

英文摘要

Training machine learning models for robotic tactile sensing requires vast amounts of data, yet obtaining realistic interaction data remains a challenge due to physical complexity and variability. Simulating tactile sensors is thus a crucial step in accelerating progress. This paper presents SPLIT, a novel method for simulating image-based tactile sensors, with a primary focus on the DIGIT sensor. Central to our approach is a latent space arithmetic strategy that explicitly disentangles contact geometry from sensor-specific optical properties. Unlike methods that require recalibration for every new unit, this disentanglement allows SPLIT to adapt to diverse DIGIT backgrounds and even transfer data to distinct sensors like the GelSight R1.5 without full model retraining. Beyond this adaptability, our approach achieves faster inference speeds than existing alternatives. Furthermore, we provide a calibrated finite element method (FEM) soft-body mesh simulation with variable resolution, offering a tunable trade-off between speed and fidelity. Additionally, our algorithm supports bidirectional simulation, allowing for both the generation of realistic images from deformation meshes and the reconstruction of meshes from tactile images. This versatility makes SPLIT a valuable tool for accelerating progress in robotic tactile sensing research.

URL PDF HTML ☆

赞 0 踩 0

2204.10552 2026-06-12 cs.RO 版本更新

Making Parameterization and Constrains of Object Landmark Globally Consistent via SPD(3) Manifold and Improved Cost Functions

通过SPD(3)流形和改进的成本函数使物体地标参数化和约束实现全局一致

Yutong Hu, Wei Wang

AI总结本文通过SPD(3)流形和改进成本函数解决物体级SLAM后端的奇异性问题，提升收敛速度和鲁棒性，实验显示映射精度平均提高22%。

Comments 8 pages, 8 figures, submitted to IROS 2022 & RA-L

1. 机器人学习与模仿强化学习 6 篇

Action-Effect Memory Pretraining for Robot Manipulation

Learning to Adapt: Representation-Based Reinforcement Learning for Multi-Task Skill Transfer

Redesigning Regularization for Effective Policy Smoothing

Real-Time Execution with Autoregressive Policies

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

2. 运动规划、控制与动力学 6 篇

G-MAPP: GPU-accelerated Multi-Agent Planning and Perception for Reactive Motion Generation

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

MPC for underactuated spacecraft control with a Lyapunov supervised physics-informed neural network correction layer

PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

Lyapunov-Based PI-Like Control for Robust Trajectory Tracking of a Four-Wheel Independently Driven and Steered Robot: Design and Experimental Validation

Lexicographic Minimum-Violation Motion Planning using Signal Temporal Logic

3. 操作、抓取与灵巧手 14 篇

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

Sparse2Act: Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation

Towards Reliable Sequential Object Picking in Clutter: The Runner-up Solution to RGMC 2025

EmbodiSteer: Steering Embodiment-Agnostic Visuomotor Policies with Joint-Space Guidance for Zero-Shot Cross-Embodiment Deployment

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

WT-UMI: Tactile-based Whole-Body Manipulation via Force-Supervised Contact-Aware Planning

See Selectively, Act Adaptively: Dual-Level Structural Decomposition for Bimanual Robot Manipulation

GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation

MCR-Bionic Hand: Anatomical Structural Priors for Dexterous Manipulation

Mana: Dexterous Manipulation of Articulated Tools

AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly

RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning

4. 导航、定位与SLAM 13 篇

Foresight: Iterative Reasoning About Clues that Matter for Navigation

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

SERF: Spatiotemporal Environment and Robot Feature Map for Long-Horizon Mobile Manipulation

NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation

SemanticXR: Low Power and Real-time Queryable Semantic Mapping with an Object-Level Device-Cloud Architecture

Visual Place Recognition in Forests with Depth-Aware Distillation

Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments

Active Semantic Perception

DiskChunGS: Large-Scale 3D Gaussian SLAM Through Chunk-Based Memory Management

EgoMoD: Predicting Global Maps of Dynamics from Local Egocentric Observations

PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

Triangle Splatting SLAM

5. 人机交互与协作机器人 7 篇

Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration

GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training

Multi-Modal Multi-Agent Robotic Cognitive Alignment enabled by Non-Invasive Consumer Brain Computer Interfaces: A Proof of Concept Exploration

Humor Style Drives Laughter, Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI Jokes

EMG-Based Adaptation of Anisotropic Virtual Fixtures for Robot-Assisted Surgical Resection and Dissection

GIVE: Grounding Human Gestures in Vision-Language-Action Models

ReactEMG Stroke: Healthy-to-Stroke Few-shot Adaptation for sEMG-Based Intent Detection

6. 具身智能与视觉语言动作模型 8 篇

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

Y-BotFrame: An Extensible Embodied Agent Framework for Quadruped Robot Assistants

Proprioceptive-visual correspondence enables self-other distinction in humanoid robots

$μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

GAE: Unleashing Physical Potential of VLM with Generalizable Action Expert

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

7. 多机器人与群体系统 6 篇

DARRMS -- An Efficient Algorithm for Dynamic Attention Radius in Resource-Constrained Multi-Agent Systems

Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

Effects of Social Interactions in Self-Organising Railway Traffic Management

GLIDE: A Coordinated Aerial-Ground Framework for Search and Rescue in Unknown Environments

Adaptive-Horizon Conflict-Based Search for Closed-Loop Multi-Agent Path Finding

DiffCoord: Differentiable Coordination for Distributed Multi-Agent Trajectory Optimization

8. 无人车、无人机与移动机器人 4 篇

AIR-VLA+: Decoupling Movement and Manipulation via Cascaded Dual-Action Decoders with Asymmetric MoE for Aerial Robots

Diffusion Transformer World-Action Model for AV Scene Prediction

Miniature Testbed for Validating Multi-Agent Cooperative Autonomous Driving

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

9. 软体机器人与硬件设计 4 篇

Low cost, easily manufactured, highly flexible strain and touch sensitive fiber for robotics applications

Data-Driven Soft Robot Control via Adiabatic Spectral Submanifolds

Adaptive Model-Predictive Control of a Soft Continuum Robot Using a Physics-Informed Neural Network Based on Cosserat Rod Theory

Learning Visually Interpretable Oscillator Networks for Soft Continuum Robots from Video

10. 仿真、数据集与评测 7 篇

An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics

Comparing Commercial Depth Sensor Accuracy for Medical Applications