arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.22816 2026-05-22 cs.RO cs.CV 版本更新

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

AwareVLN: 基于自感知的视觉语言导航推理

Wenxuan Guo, Xiuwei Xu, Yichen Liu, Xiangyu Li, Hang Yin, Huangxing Chen, Wenzhao Zheng, Jianjiang Feng, Jie Zhou, Jiwen Lu

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出AwareVLN框架,通过自感知推理机制实现端到端的视觉语言导航,解决了传统方法在理解代理、指令和场景关系上的不足,并在多个数据集上实现了优于现有方法的性能。

Comments Accepted to CVPR 2026. Project page: https://gwxuan.github.io/AwareVLN/

详情
AI中文摘要

视觉语言导航(VLN)要求一个智能体将语言指令接地到其自身移动中。尽管最先进的方法利用视觉语言模型(VLMs)的推理能力进行端到端动作预测,但它们往往缺乏对代理、指令和场景之间关系的显式且可解释的理解。相反,显式构建场景图进行启发式规划直观但依赖额外的3D传感器,阻碍了大规模视觉语言预训练。为弥合这一差距,我们提出了AwareVLN,一种新的框架,使导航模型具备自感知推理机制,使其能够以完全端到端和数据驱动的方式理解代理的状态和任务进度。我们的方法有两个关键创新:(1)一个结构推理模块,促进空间和任务导向的自感知;(2)一个自动数据引擎,具有进度划分,用于有效的训练。在Habitat模拟器的各种数据集上的广泛实验表明,我们的AwareVLN显著优于先前的视觉语言导航方法。项目页面:https://gwxuan.github.io/AwareVLN/.

英文摘要

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: https://gwxuan.github.io/AwareVLN/.

2605.22812 2026-05-22 cs.RO cs.CV 版本更新

GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

GesVLA: 一种具有手势感知能力的视觉-语言-动作模型嵌入表示

Wenxuan Guo, Ziyuan Li, Meng Zhang, Yichen Liu, Yimeng Dong, Chuxi Xu, Yunfei Wei, Ze Chen, Erjin Zhou, Jianjiang Feng

发表机构 * Tsinghua University(清华大学) Dexmal

AI总结 本文提出GesVLA模型,通过引入手势作为平行指令模态,解决现有VLA系统在复杂场景中空间模糊问题,采用双VLM架构实现手势表示与动作策略的紧密耦合,并通过手势数据生成管道和两阶段训练策略提升目标定位准确性和人机交互效率。

Comments Project page: https://gwxuan.github.io/GesVLA/

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过统一感知与动作,在通用机器人操作中展现出强大潜力。然而,现有VLA系统主要依赖文本指令,在包含多个相似物体的复杂场景中难以解决空间模糊问题。为解决这一限制,我们引入手势作为平行指令模态,提出一种具有手势感知能力的视觉-语言-动作模型(GesVLA)。我们的方法将手势特征直接编码到潜在空间中,使其能够参与高层推理和低层动作生成,并采用双VLM架构实现手势表示与动作策略的紧密耦合。在数据层面,我们通过将手模型渲染到现实世界场景图像上,构建了一个可扩展的手势数据生成管道。这在减少仿真到现实的视觉差距的同时,生成了具有多样化运动模式和相应指向注释的丰富数据。此外,我们采用两阶段训练策略,使模型具备手势感知和动作预测能力。我们在多个现实机器人任务中评估了我们的方法,包括受控块操作任务进行验证以及更实际的场景如产品和农产品选择。实验结果表明,结合手势能够一致地提高目标定位准确性和人机交互效率,特别是在复杂和拥挤的环境中。项目页面:https://gwxuan.github.io/GesVLA/.

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To address this limitation, we introduce gesture as a parallel instruction modality and propose a Gesture-aware Vision-Language-Action model (GesVLA). Our approach encodes gesture features directly into the latent space, enabling them to participate in both high-level reasoning and low-level action generation, and adopts a dual-VLM architecture to achieve tight coupling between gesture representations and action policies. At the data level, we construct a scalable gesture data generation pipeline by rendering hand models onto real-world scene images. This reduces the sim-to-real visual gap while producing rich data with diverse motion patterns and corresponding pointing annotations. In addition, we employ a two-stage training strategy to equip the model with both gesture perception and action prediction capabilities. We evaluate our approach on multiple real-world robotic tasks, including a controlled block manipulation task for validation and more practical scenarios such as product and produce selection. Experimental results show that incorporating gesture consistently improves target grounding accuracy and human-robot interaction efficiency, especially in complex and cluttered environments. Project page: https://gwxuan.github.io/GesVLA/.

2605.22722 2026-05-22 cs.RO cs.SY eess.SY 版本更新

N3P: Accelerated Automated Parking via a Learning-Based Naturalistic Three-Stage Scheme

N3P:通过基于学习的自然三阶段方案实现加速的自动泊车

Yifan Xue, Toktam Mohammadnejad, Faizan M Tariq, Sangjae Bae, David Isele, Yosuke Sakamoto, Nadia Figueroa, Jovin D'sa

发表机构 * Honda Research Institute (HRI)(本田研究院) University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出N3P,一种基于学习的三阶段框架,用于自动泊车,通过引入中间预备姿态和学习模块预测该姿态,将泊车操作分解为更简单的子问题,从而降低计算复杂度并加速路径生成,实验表明其在垂直和平行泊车场景中显著提升了规划速度,并在成功率和轨迹质量上优于强化学习基线。

Comments Accepted at IEEE Intelligent Transportation Systems Conference (ITSC 2026)

详情
AI中文摘要

自动驾驶泊车需要高效的路径规划,以确保运动学可行性并在受限环境中实现碰撞避免。混合A*被广泛使用,但计算成本高,而强化学习(RL)方法缺乏可靠性,往往在长时间几何约束下表现不佳,导致轨迹次优。我们提出了N3P,一种快速基于学习的三阶段框架用于自动泊车。通过引入中间预备姿态并使用学习模块预测该姿态,N3P将操作分解为更简单的子问题,从而降低计算复杂度并加速路径生成。我们通过将其与混合A*算法结合来验证该框架。在垂直和平行泊车场景中的实验表明,N3P增强的混合A*将规划速度提高了超过80%。它在成功率和轨迹质量上优于RL基线,产生更短的轨迹和更少的换挡,同时在大多数情况下实现可比或更低的规划时间。

英文摘要

Autonomous parking requires efficient path planning that ensures kinematic feasibility and collision avoidance in constrained environments. Hybrid A* is widely used but computationally expensive, while reinforcement learning (RL) methods lack reliability and often struggle with long-horizon geometric constraints, leading to suboptimal trajectories. We present N3P, a fast learning-based three-stage framework for automated parking. By introducing an intermediate preparatory pose and using a learning module to predict it, N3P decomposes the maneuver into simpler subproblems, thereby reducing computational complexity and accelerating path generation. We validate the framework by integrating it with Hybrid A* algorithms. Experiments in perpendicular and parallel parking scenarios show that N3P-enhanced Hybrid A* speeds up planning by more than 80%. It also outperforms RL baselines in success rate and trajectory quality, producing shorter trajectories with fewer gear changes, while achieving comparable or lower planning time in most cases.

2605.22709 2026-05-22 cs.CR cs.ET cs.RO cs.SY eess.SY 版本更新

TriSweep: A Four-Drone Swarm Framework for Electromagnetic Side-Channel Analysis

TriSweep: 一种四无人机群框架用于电磁侧信道分析

Eric Yocam, Varghese Vaidyan

发表机构 * Department of Computer Science, College of Engineering, California Polytechnic State University, San Luis Obispo, CA 93407, USA(计算机科学系,工程学院,加州州立大学,圣路易斯奥比斯波分校,CA 93407,USA)

AI总结 本文提出TriSweep框架,通过四无人机群实现自主远距离电磁侧信道分析,针对嵌入式微控制器在0.25-1.5米范围内进行攻击,通过空间专业化收集无人机和固定积累无人机的协同工作,实现信号增强和掩码消除,验证了无人机群在对抗环境中的有效性。

Comments Simulation framework + systems design for a four-drone swarm performing standoff electromagnetic side-channel analysis. No hardware fabricated yet

详情
AI中文摘要

电磁(EM)侧信道分析传统上假设存在一个静止且近距离的探测器,这种威胁模型低估了空中对手的威胁。TriSweep是一种模拟框架,设计并评估了一种四无人机群架构,用于自主远距离电磁侧信道分析(EM-SCA)嵌入式微控制器,距离为0.25-1.5米。三个空间专业化收集无人机——锚点(全频谱)、掩码探测器(掩码寄存器加载泄漏)和密码探测器(掩码SubBytes输出泄漏)——将信号馈入一个固定积累无人机,该无人机通过两个空间分离泄漏流的中心乘积进行相干结合(+4.8 dB信噪比增益)和二次掩码消除。在三个真实的ANSSI ASCAD数据集(ATmega8515掩码AES-128和50/100样本非同步变体)上评估该框架,其在0.25米范围内主要掩码数据集上实现了模拟密钥排名为18±1.7(五种子)。通过轮廓跟踪轨迹交叉相关对齐,单无人机排名从89降低到21,在100样本抖动变体上展示了对无人机悬停振动的补偿。积累无人机中的两个通道CNN收敛到损失为0.454(与随机基线5.545相比)并在非同步数据集上提高了排名。尚未制造物理硬件;原型构建是下一步计划。

英文摘要

Electromagnetic (EM) side-channel analysis traditionally assumes a stationary, close-proximity probe - a threat model that underestimates aerial adversaries. TriSweep is a simulation framework that designs and evaluates a four-drone swarm architecture for autonomous standoff EM-SCA of embedded microcontrollers at 0.25-1.5 m. Three spatially specialized collector drones - Anchor (full-spectrum), Mask Probe (mask-register loading leakage), and Cipher Probe (masked SubBytes output leakage) - feed a stationary Accumulator drone that performs coherent combining (+4.8 dB SNR gain) and second-order mask cancellation via a centered product of the two spatially separated leakage streams. Evaluated against three real ANSSI ASCAD datasets (ATmega8515 masked AES-128 and 50/100-sample desynchronized variants), the framework achieves a simulated key rank of 18 +/- 1.7 (five-seed) at 0.25 m on the primary masked dataset. Profiling-trace cross-correlation alignment reduces single-drone rank from 89 to 21 on the 100-sample-jitter variant, demonstrating compensation for drone hover vibration. A two-channel CNN in the Accumulator converges to a loss of 0.454 (vs. random baseline 5.545) and improves rank on desynchronized datasets. No physical hardware has been fabricated; prototype construction is the planned next step.

2605.22693 2026-05-22 cs.RO cs.AI 版本更新

Scout-Assisted Planning for Heterogeneous Robot Teams under Partially Known Environments

Scout-Assisted Planning for Heterogeneous Robot Teams under Partially Known Environments

Hoang-Dung Bui, Abhish Khanal, Raihan Islam Arnob, Gregory J. Stein

发表机构 * George Mason University(乔治·马歇尔大学)

AI总结 本文提出了一种Scout-Assisted Planning框架,通过无人机主动收集环境信息来改进地面车辆的导航,通过信息增益引导的行动剪枝减少回溯成本,实验表明其在不同环境中能显著降低地面机器人旅行成本。

详情
AI中文摘要

自主机器人团队在部分已知环境中导航时,当地面机器人遇到被阻塞的道路时,需要昂贵的回溯操作。我们通过Scout-Assisted Planning,一种异构规划框架,其中无人机主动收集环境信息以改进地面车辆的导航。为了将侦察聚焦于最关键的边,我们提出了基于信息增益的行动剪枝,通过评估候选侦察行动对地面机器人行为的预期影响来评分。由于精确的信息增益基于行动剪枝计算成本过高,我们开发了一个基于图神经网络的模型,该模型可以直接从图结构和信念状态预测信息增益值,将规划时间减少到实时水平而不牺牲解决方案质量。在三种环境类型上的实验表明,SAP结合信息增益行动剪枝将地面机器人旅行成本降低了31.9-37.7%相对于加拿大旅行者问题基线,并且比基于接近度的侦察指导多出8-14%,证实了基于原则的信息增益引导的侦察在实际部署中既更有效且计算上可行。

英文摘要

Autonomous robot teams navigating partially known environments face costly backtracking when ground robots encounter blocked roads that are only revealed upon physical traversal. We address this with Scout-Assisted Planning, a heterogeneous planning framework in which scouting Unmanned Aerial Vehicles proactively gather environmental information to improve Unmanned Ground Vehicle navigation. To focus scouting on the most consequential edges, we propose Information Gain-based Action Pruning, which scores candidate scouting actions by their expected impact on ground robot behavior. Since exact Information Gain-based Action Pruning computation is prohibitively expensive, we develop a Graph Neural Network based model that predicts information gain values directly from graph structure and belief state, reducing planning time to real-time levels without sacrificing solution quality. Experiments across three environment types show that SAP with Information Gain Action Pruning reduces ground robot travel cost by 31.9--37.7% over the Canadian Traveler Problem baseline, and outperforms proximity-based scouting guidance by an additional 8--14%, confirming that principled information-gain-guided scouting is both more effective and computationally feasible for real-world deployment

2605.22633 2026-05-22 cs.RO 版本更新

SE3Kit: A Lightweight Python Library for Specialized Geometric Primitives in Robotics

SE3Kit: 一个用于机器人学中专用几何原语的轻量级Python库

Daniyal Maroufi, Omid Rezayof, Farshid Alambeigi

发表机构 * Walker Department of Mechanical Engineering and Texas Robotics at The University of Texas at Austin(德克萨斯大学奥斯汀分校机械工程系和德克萨斯机器人学院)

AI总结 本文提出SE3Kit,一个轻量级Python库,专注于特殊欧几里得群SE(3)和特殊正交群SO(3)上的高效运算,提供严格的数学实现,适用于嵌入式部署、快速原型设计和教育。

详情
AI中文摘要

Python机器人生态系统面临挑战:虽然有许多库用于刚体变换,但很少有库既轻量又数学严谨。本文介绍了SE3Kit,一个轻量级Python库,高效地进行特殊欧几里得群SE(3)和特殊正交群SO(3)上的运算。不同于需要大量依赖的现有框架(例如SpatialMath、PyPose)或缺乏机器人特定功能的一般工具(例如SciPy),SE3Kit旨在填补这些极端之间的空白。它专为嵌入式部署、快速原型设计和教育而设计,同时提供严谨的数学实现。它提供了一个仅使用Python和NumPy的Lie群运算实现,没有深度学习或其他可视化软件的开销。

英文摘要

The Python robotics ecosystem faces a challenge: while many libraries exist for rigid body transformations, few are both lightweight and mathematically strict. This paper introduces SE3Kit, a lightweight Python library efficient operations on the Special Euclidean Group SE(3) and the Special Orthogonal Group SO(3). Unlike established frameworks that require heavy dependencies (e.g., SpatialMath, PyPose) or general tools that lack robotics-specific features (e.g., SciPy), SE3Kit targets the gap between these extremes. It is designed for embedded deployment, rapid prototyping, and education while providing rigorous mathematical implementation. It provides a pure-Python, NumPy-only implementation of Lie Group operations, without the overhead of deep learning or other visualization software.

2605.22605 2026-05-22 cs.RO cs.CV 版本更新

Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection

通过双区间运动线索解耦自身运动与目标动态以实现无人机检测

Liuyang Wang, Feitian Zhang

发表机构 * Department of Robotics, School of Advanced Manufacturing and Robotics(机器人学院,先进制造与机器人学院) State Key Laboratory of Turbulence and Complex Systems(湍流与复杂系统国家重点实验室) Peking University(北京大学) Great Bay University(大湾大学)

AI总结 本文提出了一种基于视觉的运动引导检测框架,通过双区间运动提取策略和轻量级运动引导注意力模块,解耦目标运动与相机干扰,提升无人机检测在剧烈自身运动下的性能。

详情
AI中文摘要

无人机的物体检测面临严重的自身运动、相机抖动和大规模变化的挑战。尽管现代检测器在静态图像上表现良好,但直接应用于无人机视频时往往失效,尤其在动态场景中的小目标。现有基于运动的方法要么依赖计算昂贵的光流,要么使用单区间差分,易受抖动影响且难以捕捉多样的运动模式。本文提出了一种视觉-only的运动引导检测框架,通过双区间运动提取策略和轻量级运动引导注意力模块,解耦目标运动与相机干扰。首先基于同射影的全局运动补偿(GMC)对相邻帧进行对齐。然后引入双区间运动提取策略,捕捉短期和长期的运动线索。为了整合这些线索,轻量级运动引导注意力模块(MGA)在特征金字塔网络中增强特征表示。在VisDrone-VID数据集上的实验表明,在严重自身运动下,该方法在YOLOv8基线上有显著改进。消融研究进一步验证了双区间设计和所提运动引导注意力机制的有效性。

英文摘要

Object detection from Unmanned Aerial Vehicles (UAVs) is challenged by severe ego-motion, camera jitter, and large scale variations. While modern detectors perform well on static images, their direct application to UAV video often fails, particularly for small objects in dynamic scenes. Existing motion-based methods either rely on computationally expensive optical flow or use single-interval differencing, which is sensitive to jitter and limited in capturing diverse motion patterns. We propose a vision-only motion-guided detection framework that decouples target motion from camera-induced disturbances. A homography-based Global Motion Compensation (GMC) first aligns adjacent frames. We then introduce a Dual-Interval Motion Extraction strategy that captures both short-term and long-term motion cues. To integrate these cues, a lightweight Motion-Guided Attention (MGA) module enhances feature representations within a Feature Pyramid Network. Experiments on the VisDrone-VID dataset demonstrate consistent improvements over a strong YOLOv8 baseline under severe ego-motion. Ablation studies further confirm the effectiveness of the dual-interval design and the proposed motion-guided attention mechanism.

2605.22600 2026-05-22 cs.RO 版本更新

Branch-Stochastic Model Predictive Control for Motion Planning under Multi-Modal Uncertainty with Scenario Clustering

基于分支随机优化的运动规划在多模态不确定性下的场景聚类

Zekun Xing, Ramkrishna Chaudhari, Marion Leibold, Dirk Wollherr, Martin Buss

发表机构 * Chair of Automatic Control Engineering(自动控制工程教授会)

AI总结 本文提出一种结合随机模型预测控制与分支结构的方法,用于在多模态不确定性下进行运动规划,通过场景聚类提高实时计算性能并减少保守性。

Comments This work has been accepted for presentation at IFAC World Congress 2026

详情
AI中文摘要

自动驾驶的运动规划必须考虑周围车辆意图和轨迹的多模态不确定性。以最坏情况处理不确定性可以保证鲁棒性,但往往导致过度保守。随机模型预测控制(SMPC)通过机会约束减少了轨迹层面的保守性,但对意图不确定性仍保持保守,因为约束必须在所有意图下成立。本文提出一种新的SMPC与分支结构的结合,使规划器能够为不同的可能意图生成不同的轨迹,同时在轨迹不确定性下保持安全。提出了一种新的场景聚类方法,基于高层决策相似性合并预测场景,从而确保实时可处理性。此外,一种自适应的分支时间计算延迟对分离计划的承诺,直到意图不确定性充分降低。在具有挑战性的高速公路场景中的仿真研究证明,所提出的方法提高了安全性,减少了保守性,并实现了实时计算性能。

英文摘要

Motion planning for autonomous driving must account for multi-modal uncertainty in both the intentions and trajectories of surrounding vehicles. Handling uncertainty in a worst-case manner guarantees robustness but often leads to excessive conservatism. Stochastic Model Predictive Control (SMPC) reduces trajectory-level conservatism through chance constraints, yet remains conservative with respect to intention uncertainty since constraints must hold across all intentions. We present a novel combination of SMPC and the branching structure, enabling the planner to generate distinct trajectories for different possible intentions while maintaining safety under trajectory uncertainty. A novel scenario clustering is proposed to merge prediction scenarios based on high-level decision similarity, thereby ensuring real-time tractability. Furthermore, an adaptive branching-time computation postpones commitment to separate plans until intention uncertainty is sufficiently reduced. Simulation studies in challenging highway scenarios demonstrate that the proposed method improves safety, reduces conservatism, and achieves real-time computational performance.

2605.22597 2026-05-22 cs.LG cs.AI cs.GR cs.RO 版本更新

MoSA: Motion-constrained Stress Adaptation for Mitigating Real-to-Sim Gap in Continuum Dynamics via Learning Residual Anisotropy

MoSA: 通过学习残余各向异性来缓解连续动力学中现实到模拟差距的运动约束应力适应

Jiaxu Wang, Junhao He, Jingkai Sun, Yi Gu, Yunyang Mo, Jiahang Cao, Qiang Zhang, Renjing Xu

发表机构 * Hong Kong University of Science(香港科学大学) MMLab, Chinese University of Hong Kong, Hong Kong SAR(香港中文大学MMLab, 香港特别行政区) The University of Hong Kong, Hong Kong SAR(香港大学, 香港特别行政区)

AI总结 本文提出MoSA框架,通过运动约束应力适应来缓解连续动力学中现实到模拟差距,利用各向同性模型作为物理先验,并学习残余应力算子以捕捉轻微各向异性和非均匀性,最终在机器人操作中验证了其有效性。

详情
Journal ref
International Conference on Machine Learning 2026
AI中文摘要

从视觉观测中学习现实世界的动力学对于各种领域至关重要。一种常见策略是通过估计物理参数来校准模拟器,但准确性最终受限于底层物理模型,这些模型通常假设材料是均质且各向同性的。即使合理,现实中的物体通常表现出轻微的各向异性和非均匀性。在近各向同性的骨架良好校准后,这些残余效应成为进一步缩小现实到模拟差距的关键瓶颈。虽然神经网络可以端到端地拟合动力学,但这种黑盒建模会丢弃强物理先验,导致数据效率低和过拟合。因此,我们提出了MoSA,一种运动约束应力适应框架,旨在针对这些残余效应以进一步提高现实到模拟动力学学习。MoSA使用各向同性模型作为物理先验,并学习残余应力算子以捕捉轻微各向异性和非均匀性。它通过微平面约束的再分布逐步适应应力,在一个物理指导的级联网络中。我们进一步通过监督变形场的时空导数来施加运动约束。实验表明,我们学习的动力学在准确性、泛化性和鲁棒性方面均优于现有方法,同时学习了具有物理意义的残余各向异性。最后,我们在机器人操作设置中验证了MoSA,显示更好的现实到模拟动力学建模能够转化为更可靠的模拟到现实转移。项目页面可在https://mercerai.github.io/MoSA/上获取。

英文摘要

Learning real-world dynamics from visual observations is crucial for various domains. A common strategy is to calibrate simulators by estimating physical parameters, yet accuracy is ultimately bounded by the underlying physical models, which often assume materials are homogeneous and isotropic. Even if reasonable, real-world objects typically exhibit mild anisotropy and heterogeneity. After the near-isotropic backbone is well calibrated, these residual effects become the key bottleneck for further closing the real-to-sim gap. Although neural networks can fit dynamics end-to-end, such black-box modeling discards strong physical priors, leading to poor data efficiency and overfitting. Therefore, we propose MoSA, a motion-constrained stress adaptation framework that targets these residual effects to further improve real-to-sim dynamics learning. MoSA uses an isotropic model as a physics prior and learns residual stress operators to capture mild anisotropy and heterogeneity. It progressively adapts stresses via microplane-constrained redistribution in a physics-informed cascaded network. We further impose motion constraints by supervising temporal and spatial derivatives of the deformation field. Experimentally, our learned dynamics achieves superior accuracy, generalization, and robustness, while learning physically meaningful residual anisotropy. Finally, we validate MoSA in a robot manipulation setting, showing that better real-to-sim dynamics modeling translates into more reliable sim-to-real transfer. Project Page is available at https://mercerai.github.io/MoSA/.

2605.22521 2026-05-22 cs.RO cs.HC 版本更新

Quantifying Full-Body Immersion

量化全身沉浸

Alihan Bakir, Ekrem Yüksel, Fabio Zuliani, Neil Chennoufi, Francesco Bruno, Jamie Paik

发表机构 * Reconfigurable Robotics Lab(可重构机器人实验室)

AI总结 本文提出了一种基于全身动态交互的沉浸式虚拟体验新范式,通过音频视觉沉浸、物理沉浸和全身沉浸三个层次,结合模块化机器人表面单元实现可扩展的沉浸环境渲染,推动人与虚拟环境的共生。

Comments This manuscript is under consideration for possible publication in the Nature. Copyright may be transferred to Nature if the manuscript is accepted for publication, without further notice

详情
AI中文摘要

人类正处于又一场数字革命的前沿,现实与虚拟世界的界限正在消融,重塑我们对周围环境的认知和交互方式。在此背景下,我们引入了一种以全身动态交互为核心的沉浸式虚拟体验新范式。我们的方法通过三个不同的层次重新定义沉浸:音频视觉沉浸,捕捉感官真实;物理沉浸,提供触觉反馈;以及全身沉浸(FBI),其中动态的身体互动无缝整合到虚拟环境中。该创新的核心是一种基于模块化机器人表面单元的可扩展、可分布平台,这些单元受到自然界适应性设计的启发。这些单元能够渲染沉浸式环境,从亲密的个人体验到大规模多用户设置,动态适应实时互动。模块化系统在整个空间中分布力、形状和运动反馈,复制环境的物理特性,并通过FBI实现新的深度参与。通过结合可扩展性、适应性和动态物理参与,该框架弥合了现实与虚拟世界之间的鸿沟。它提供了一种前所未有的沉浸水平,使用户能够以共生的方式与虚拟空间进行全身互动。这项工作不仅推动了沉浸技术的发展,还重新定义了人类与虚拟环境共存的方式,为人类与环境合成的新时代奠定了基础。

英文摘要

Humanity is at the forefront of yet another digital revolution, where the lines between real and virtual worlds are dissolving, reshaping how we perceive and interact with our surroundings. In this context, we introduce a transformative paradigm for immersive virtual experiences centered around whole-body kinetic interactions. Our approach redefines immersion through three distinct levels: audio-visual immersion, capturing sensory realism; physical immersion, delivering haptic feedback; and full-body immersion (FBI), where dynamic bodily interaction integrates seamlessly with virtual environments. At the core of this innovation lies a scalable, distributable platform based on modular robotic surface units inspired by the adaptive designs of nature. These units enable the rendering of immersive environments at any scale, from intimate personal experiences to expansive multi-user settings, dynamically adapting to interactions in real-time. The modular system distributes force, shape, and motion feedback throughout entire spaces, replicating the physical characteristics of the environment and enabling new depth of engagement through FBI. By combining scalability, adaptability, and dynamic physical engagement, this framework bridges the gap between real and virtual worlds. It offers an unprecedented level of immersion where users can engage their entire bodies in symbiotic interactions with the virtual space. This work not only advances immersive technology but also redefines how humans and virtual environments coexist, setting a foundation for a new era of human-environment synthesis.

2605.22493 2026-05-22 cs.LG cs.AI cs.RO 版本更新

Understanding Multimodal Failure in Action-Chunking Behavioral Cloning

理解动作分块行为克隆中的多模态失败

Lorenzo Mazza, Massimiliano Datres, Ariel Rodriguez, Sebastian Bodenstedt, Gitta Kutyniok, Stefanie Speidel

发表机构 * NCT-Dresden(NCT-德累斯顿)

AI总结 研究行为克隆在多模态情况下失败的机制,分析不同多模态参数化在动作分块策略中的不同失效方式,并提出通过调整正则化程度和改进生成策略来提升鲁棒性的方法。

详情
AI中文摘要

当相同的观察允许多个有效动作时,行为克隆变得困难。我们研究了动作分块策略中的这一问题,并展示了不同多模态参数化以不同的方式失败。对于隐变量策略,后验-先验正则化使部署时的采样更可靠,但过度正则化会移除区分演示模式所需的动作条件信息。减少这种正则化可以保留模式信息,但此时成功取决于先验是否覆盖相关隐变量区域。对于动作空间生成策略,多模态性受到基础到动作传输的平滑性限制:具有小Lipschitz常数的映射无法将大量分离的模式分配显著概率。覆盖许多模式需要基础空间中的陡峭过渡或动作空间中的非支持桥接区域。在合成多模态任务和机器人模拟基准上的实验支持了这些机制。

英文摘要

Behavioral cloning becomes difficult when the same observation admits several valid actions. We study this problem for action-chunking policies and show that different multimodal parameterizations fail in different ways. For latent-variable policies, posterior-prior regularization makes deployment-time sampling more reliable, but excessive regularization removes the action-conditioned information needed to distinguish demonstrated modes. Reducing this regularization can preserve mode information, but then success depends on whether the prior covers the relevant latent regions. For action-space generative policies, multimodality is constrained by the smoothness of the base-to-action transport: a map with small Lipschitz constant cannot assign substantial probability to many well-separated modes. Covering many modes therefore requires either sharp transitions in base space or off-support bridge regions in action space. Experiments on synthetic multimodal tasks and robotic simulation benchmarks support these mechanisms.

2605.22456 2026-05-22 cs.RO cs.AI 版本更新

Steins;Gate Drive: Semantic Safety Arbitration over Structured Futures for Latency-Decoupled LLM Planning

Steins;Gate Drive: 基于结构化未来语义安全仲裁的延迟解耦LLM规划

Anjie Qiu, Hans D. Schotten

发表机构 * Institute for Wireless Communication and Navigation(无线通信与导航研究所) RPTU University Kaiserslautern-Landau(凯撒斯劳滕-兰道大学) German Research Center for Artificial Intelligence(德国人工智能研究中心)

AI总结 本文提出SteinsGateDrive架构,通过延迟解耦规划与运行时架构,在保持安全边界的同时,将有效延迟从+3.07秒减少到-0.01秒,提升了自动驾驶的规划效率。

Comments 10 pages, 2 figures, 5 tables, submitted to IEEE transaction of intelligent vehicles

详情
AI中文摘要

云托管的LLM驱动代理提供有用的语义判断,但其推理延迟超过了分步车辆控制窗口。学习的世界模型预测未来,但通常将未来生成和动作选择保留在大型耦合循环中。我们提出了SteinsGateDrive,一种延迟解耦的规划-运行时架构,其中世界线隐喻来自同名故事,指出了干预的一个可能后果:LLM在最终控制时刻之前选择反事实驾驶未来,而运行时仅在安全合同有效时重用所选预测。生成器构建了三种世界线角色:alpha名义性自我条件未来、beta交互反事实(围绕附近车辆)以及gamma危险压力未来(如刹车、变道或被阻塞的走廊)。所选分支成为具有时间范围、有效/中止条件、回退和授权的类型化战略预测。在10个种子和20步的内受试匹配种子正常-高速公路协议中,GPT-5.4 mini在1秒时间范围将有效延迟从+3.07秒减少到4秒时间范围的-0.01秒,同时保持测量的无碰撞安全边界。该架构的安全贡献来自原子谓词运行时检查,而不是漂移分数,后者作为刷新频率的调节器。

英文摘要

Cloud-hosted LLM driver agents provide useful semantic judgments, but their inference latency exceeds stepwise vehicle-control windows. Learned world models predict futures, but they usually keep future generation and action selection inside large coupled loops. We present SteinsGateDrive, a latency-decoupled planner-runtime architecture in which the worldline metaphor from the eponymous story names one plausible consequence of an intervention: the LLM selects counterfactual driving futures before the final control instant, and a runtime reuses the selected forecast only while safety contracts remain valid. The generator builds three world-line roles: alpha nominal ego-conditioned futures, beta interaction counterfactuals around nearby vehicles, and gamma hazard-stress futures such as braking, cut-ins, or blocked corridors. The selected branch becomes a typed StrategicForecast with horizon, validity/abort conditions, fallback, and authority. On a within-subject, matched-seed normal-highway protocol with 10 seeds and 20 steps, GPT-5.4 mini reduces effective lag from +3.07 s at 1-second horizon to -0.01 s at 4-second horizon while preserving the measured no-collision safety boundary. The architecture's safety contribution comes from the atom-predicate runtime check, not from the drift score, which functions as a refresh-frequency knob.

2605.22446 2026-05-22 cs.CV cs.AI cs.RO 版本更新

Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

Pre-VLA: 预防性运行时验证用于可靠视觉-语言-动作和世界模型展开

Zhen Sun, Yongjian Guo, Haoran Sun, Luqiao Wang, Wei Lu, Jiachi Ji, Shengzhe Ji, Junwu Xiong, Zhijun Meng

发表机构 * Beihang University(北京航空航天大学) Tsinghua University(清华大学) Peking University(北京大学) JDT AI Infra Zhejiang University(浙江大学)

AI总结 本文提出Pre-VLA,一种统一的运行时验证架构,用于在物理执行或世界模型想象之前评估动作的有效性,以提高视觉-语言-动作和世界模型展开的可靠性。

详情
AI中文摘要

尽管大型视觉-语言-动作(VLA)模型和生成世界模型(WM)在长周期具身智能方面取得了进展,但其实际部署仍受到基于学习的动作生成不确定性的挑战。低质量的动作可能导致执行中的物理故障或导致冗余的渲染成本的误导性世界模型展开。为了解决这个问题,我们提出了Pre-VLA,一种统一的运行时验证架构,能够在物理执行或世界模型想象之前进行预防性动作有效性评估。Pre-VLA利用一个高效的多模态主干,具有模态感知的池化和轻量级双分支头,以预测候选动作片段的安全性信心和批评派生的优势分数。为处理严重的类别不平衡和不稳定边界决策,我们使用结合焦点分类、优势回归和软阈值校准的多任务目标来训练Pre-VLA。在部署期间,双模式预防性重采样调度器过滤低质量的动作,并在有限计算预算下触发自适应重采样。在LIBERO基准测试中,Pre-VLA将四个套件的平均闭环成功率从30.79%提高到37.62%,减少任务执行步骤,实现每个动作片段平均183.9毫秒的前向验证时间,并减轻世界模型展开中的误差累积。

英文摘要

While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low-quality actions may cause physical failures during execution or lead to misleading world-model rollouts with redundant rendering costs. To address this issue, we propose Pre-VLA, a unified runtime verification architecture that performs preemptive action validity assessment before physical execution or world-model imagination. Pre-VLA leverages an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head to predict both safety confidence and critic-derived advantage scores for candidate action chunks. To handle severe class imbalance and unstable boundary decisions, we train Pre-VLA with a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. During deployment, a dual-mode preemptive resampling scheduler filters low-quality actions and triggers adaptive resampling under a limited computation budget. Experiments on the LIBERO benchmark show that Pre-VLA improves the average closed-loop success rate across four suites from 30.79\% to 37.62\% over RynnVLA-002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world-model rollouts.

2605.22443 2026-05-22 cs.RO 版本更新

Terminal Constraint Model Predictive Control for Image-Based Visual Servoing of UAVs with Kalman Filter-Based Moment Loss Compensation

终端约束模型预测控制用于基于图像的视觉伺服控制无人机的卡尔曼滤波基于矩损失补偿

X. Wang, Y. Cao, W. L. W. Leong, Y. R. Tan, S. Huang, S. H. R. Teo, C. Xiang

发表机构 * College of Design and Engineering, National University of Singapore(设计与工程学院,新加坡国立大学)

AI总结 本文提出了一种终端约束模型预测控制(TC-MPC)框架,结合卡尔曼滤波机制,用于解决基于图像的视觉伺服控制中因输入和状态约束导致的闭环稳定性丧失和因运动剧烈导致的矩特征丢失问题。

详情
AI中文摘要

基于图像的视觉伺服控制(IBVS)通过直接调节图像空间误差为无人机(UAVs)提供高效的视觉引导控制范式。然而,传统IBVS控制器面临两个关键问题:由于输入和状态约束导致接近目标时闭环稳定性丧失,以及在剧烈运动下因矩基视觉特征间歇性丢失导致的控制失效。为了解决这些挑战,本文提出了一种用于IBVS的终端约束模型预测控制(TC-MPC)框架,集成了基于卡尔曼滤波(KF)的状态预测机制。TC-MPC明确将终端状态约束和终端成本纳入IBVS误差动力学中,确保在控制和状态约束下递归可行性、改进的收敛行为和闭环稳定性。同时,卡尔曼滤波预测短时间视觉退化期间图像矩的时序演变,使控制器在矩测量部分不可用时能够保持控制连续性。所提出的方法通过实时无人机视觉伺服控制实验进行了验证。

英文摘要

Image-Based Visual Servoing (IBVS) provides an efficient vision-guided control paradigm for unmanned aerial vehicles (UAVs) by directly regulating image-space errors. However, conventional IBVS controllers are vulnerable to two critical issues: loss of closed-loop stability near the target due to input and state constraints, and control failure caused by intermittent loss of moment-based visual features under aggressive motion. To address these challenges, this paper proposes a terminal-constraint model predictive control (TC-MPC) framework for IBVS, integrated with a Kalman filter (KF)-based state-prediction mechanism. The TC-MPC explicitly incorporates terminal-state constraints and a terminal cost into the IBVS error dynamics, ensuring recursive feasibility, improved convergence behavior, and closed-loop stability under control and state constraints. In parallel, the Kalman filter predicts the temporal evolution of image moments during short-term visual degradation, enabling the controller to preserve control continuity when moment measurements are partially unavailable. The proposed approach is validated through real-time UAV visual servoing experiments.

2605.22431 2026-05-22 cs.RO 版本更新

Real-Time Auto-Optimization in Unknown Environments via Structure-Exploiting Dual Control for Exploration and Exploitation

通过利用结构的双控方法实现未知环境中的实时自优化

Shiying Dong, Haoyang Yang, Qiwei Liu, Wen-Hua Chen

发表机构 * Research Centre for Low Altitude Economy(低空经济研究中心) Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出了一种快速数值双控方法,用于解决未知环境中的自优化问题,通过利用双控方法的结构特性,提高了探索与利用的效率和计算速度。

详情
AI中文摘要

本文开发了一种快速数值双控方法,用于解决未知环境中的自优化问题。在自优化问题中,最优运行条件事先未知且可能随环境变化而变化。与经典双控技术类似,计算负担仍然是双控方法中的主要问题。现有的双控方法提供了一个原理性的探索-利用目标,但主要通过标准优化包或显式梯度型更新法则实现,其中双控方法的数值结构未被充分利用。本文表明,双控方法中的奖励函数具有内在的凸-非线性结构,其中探索和利用项形成一个统一的非线性残差图,配备了凸外损失。得益于这种结构,通过仅线性化非线性残差图而保留凸外损失,开发出一种结构利用的数值方法。因此,每个子问题被转换为结构化的凸形式,可以可靠地求解。所得到的广义高斯-牛顿Hessian近似是正半定的,并且仅依赖于一阶导数,从而支持快速的在线计算。所提出的方法在车辆巡航自优化问题上进行了评估,并与现有方法进行了比较。仿真和硬件在环实验结果表明,所提出的方法提高了控制性能,并实现了约一个数量级的速度提升,最大计算时间仅为83微秒,仅在典型车辆嵌入式CPU上。

英文摘要

This paper develops a fast numerical dual control for exploration and exploitation (DCEE) method to address auto-optimization problems in unknown environments. In auto-optimization problems, the optimal operating condition is unknown a priori and may vary with the environment. As in classical dual control techniques, computational burden remains a major concern in DCEE for active learning. Existing DCEE methods provide a principled exploration-exploitation objective, but mainly realized through standard optimization packages or explicit gradient-type update laws, where the numerical structure of the DCEE has not been fully exploited. This paper shows that the reward function in DCEE has an inherent convex-over-nonlinear structure, where the exploitation and exploration terms form a unified nonlinear residual map equipped with a convex outer loss. Benefiting from this structure, a structure-exploiting numerical method is developed by linearizing only the nonlinear residual map while preserving the convex outer loss. Thus, each subproblem is transformed into a structured convex form that can be solved reliably. The resulting generalized Gauss-Newton Hessian approximation is positive semidefinite and depends only on first-order derivatives, thereby supporting fast online computation. The proposed method is evaluated on a vehicle cruising auto-optimization problem and compared with existing methods. Simulation and hardware-in-the-loop experimental results show that the proposed method improves control performance and achieves a speedup of approximately one order of magnitude, with a microsecond-level maximum computation time of only 83 μs on a typical vehicle embedded CPU.

2605.22420 2026-05-22 cs.CV cs.AI cs.RO 版本更新

Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction

基于扩散的通用增强器用于城市场景重建

Henry Che, Jingkang Wang, Yun Chen, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun

发表机构 * Waabi University of Toronto(多伦多大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出GenRe,一种基于扩散的通用增强器,用于城市场景重建,通过学习不同场景中的生成先验,高效地生成稳健且高保真的表示,能够可靠地泛化到挑战性的未见过的视角,从而在自动驾驶中实现鲁棒和可扩展的传感器模拟。

Comments ICRA 2026. Project page: https://waabi.ai/genre

详情
AI中文摘要

从真实世界观测重建城市场景已成为自动驾驶开发和测试的强大工具。尽管当前的神经渲染方法在记录轨迹上实现了高质量的渲染,但其在大视角变化下质量显著下降,限制了闭环模拟的应用。最近的研究表明,使用扩散模型在这些具有挑战性的视角上增强质量并将其改进回3D表示具有前景。然而,它们通常需要昂贵的每场景优化,且提炼的表示仍然脆弱,无法超越有限的合成视角泛化。为了解决这些限制,我们提出了GenRe,一种新的基于扩散的通用增强器用于城市场景重建。GenRe输入任何预训练的3D高斯表示,并在几分钟内修复其中的缺陷。通过学习在多样化场景中提炼生成先验,GenRe高效地生成稳健且高质量的表示,能够可靠地泛化到具有挑战性的未见过的视角(例如,变道)。实验表明,GenRe在质量和效率上均优于现有方法,并且受益于各种下游任务,使自动驾驶中的传感器模拟更加稳健和可扩展。

英文摘要

Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-driving development and testing. While current neural rendering approaches achieve high-fidelity rendering along the recorded trajectories, their quality degrades significantly under large viewpoint shifts, limiting the applicability for closed-loop simulation. Recent works have shown promising results in using diffusion models to enhance quality at these challenging viewpoints and distill improvements back into 3D representations. However, they often require costly per-scene optimization, and the distilled representations remain fragile and fail to generalize beyond limited synthesized views. To address these limitations, we propose GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within a few minutes. By learning to distill generative priors across diverse scenes, GenRe produces robust and high-fidelity representation efficiently that generalizes reliably to challenging unseen viewpoints (e.g., lane change). Experiments show that GenRe outperforms existing methods in both quality and efficiency and benefits various downstream tasks, enabling robust and scalable sensor simulation for autonomous driving.

2605.22322 2026-05-22 cs.RO 版本更新

How can reasoning capability empower the AI copilot robot in endoscopic surgery

推理能力如何赋能内窥手术中的AI助手机器人

Guankun Wang, Long Bai, Hongliang Ren

发表机构 * Department of Electronic Engineering(电子工程系)

AI总结 本文研究了推理能力在内窥手术中AI助手机器人中的应用,提出通过整合多模态线索、解读手术意图和推断隐藏组织动态来提高手术的精确性、安全性和可持续性。

Comments Accepted by npj digital medicine

详情
AI中文摘要

推理能力已显著提升了复杂逻辑推理和机器人决策制定在一般领域的能力。然而,其在人工智能(AI)助手机器人——特别是基于视觉-语言-动作(VLA)模型实现——在内窥手术中的潜力仍待探索。有效的推理应使AI助手机器人能够整合多模态线索、解读手术意图并推断隐藏的组织动态,从而缓解术中不确定性和对外科医生的认知负担。正确实施的推理驱动自主性可以将AI助手机器人从被动执行者转变为认知合作者,从而在临床实践中提高精确性、安全性和可持续性。

英文摘要

Reasoning capability has significantly advanced complex logical inference and robotic decision-making in general domains. However, its potential in the Artificial Intelligence (AI) copilot robot-particularly implemented based on the Vision-Language-Action (VLA) model-remains unexplored in endoscopic surgery. Effective reasoning should enable AI copilot robots to integrate multimodal cues, interpret surgical intent, and infer hidden tissue dynamics, thereby alleviating intraoperative uncertainty and cognitive burden on surgeons. Properly implemented, reasoning-driven autonomy can transform AI copilot robots from reactive executors into cognitive collaborators, enhancing precision, safety, and sustainability in clinical practice.

2605.22283 2026-05-22 cs.RO 版本更新

Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action

视觉-语言-动作中视界外操作的空间记忆

Pengteng Li, Weiyu Guo, He Zhang, Tiefu Cai, Xiao He, Yandong Guo, Hui Xiong

发表机构 * Thrust of Artificial Intelligence, The Hong Kong University of Science(人工智能推动部,香港科学大学)

AI总结 本文提出SOMA框架,用于解决视觉-语言-动作模型中视界外操作的问题,通过构建持久的空间记忆,使模型能够超越当前视觉范围进行推理,提升任务成功率和操作行为质量。

Comments Accepted by ICML 2026

详情
AI中文摘要

我们引入SOMA,即视觉-语言-动作(VLA)模型中的空间记忆框架,用于视界外操作。现有VLAs通常隐式假设任务相关物体始终可见,当目标物体处于相机视野外时,会导致行为脆弱和反应迟钝。SOMA通过为VLAs配备由移动头部相机获取的多视角观察构建的持久空间记忆,解决了这一限制。该框架包含三个组件:空间记忆构建,通过扫描将角度方向的观察聚合为统一的空间-语义表示;动态记忆细化,保持时间上的全局一致性;以及情境记忆检索,激活操作过程中与指令相关的空间线索。我们评估SOMA在五个具有挑战性的现实世界视界外操作任务上,包括多步骤和双臂场景,其中目标物体最初不可见。实验结果表明,SOMA不仅提高了任务成功率,还诱导了质不同操作行为,具有更快的目标定位、减少视角搜索和近似单次抓取在部分可观察性条件下。在RoboCasa GR1和SimplerEnv上的额外实验进一步验证了SOMA记忆设计在传统完全可观察设置下的有效性。代码将很快发布。

英文摘要

We introduce SOMA, the Spatial Memory framework for Out-of-Vision Manipulation in Vision-Language-Action (VLA) models. Most existing VLAs implicitly assume that task-relevant objects are always visible, leading to brittle and reactive behaviors when targets fall outside the camera's field of view. SOMA addresses this limitation by equipping VLAs with a persistent spatial memory constructed from multi-view observations acquired via a movable head camera, enabling reasoning beyond the current visual frustum. The framework consists of three components: Spatial Memory Construction, which aggregates angular-wise observations into a unified spatial-semantic representation through scanning; Dynamic Memory Refinement, which maintains global consistency over time; and Contextual Memory Retrieval, which activates instruction-relevant spatial cues during manipulation. We evaluate SOMA on five challenging real-world out-of-vision manipulation tasks, including multi-step and dual-arm scenarios where target objects are initially invisible. Experimental results show that SOMA not only improves task success rates, but also induces qualitatively different manipulation behaviors, with faster target localization, reduced viewpoint search, and near one-shot grasping under partial observability. Additional experiments on RoboCasa GR1 and SimplerEnv further validate the effectiveness of SOMA's memory design under conventional fully observable settings. Code will be released soon.

2605.22259 2026-05-22 cs.LG cs.CV cs.RO 版本更新

An Evidence Hierarchy for Bayesian Object Classification via OSINT-Aided Heterogeneous Sensor Fusion

基于OSINT辅助异质传感器融合的贝叶斯目标分类证据层级

Jan Nausner, Michael Hubner

发表机构 * Center for Digital Safety & Security, Austrian Institute of Technology GmbH (AIT)(数字安全与安全研究所,奥地利技术研究院(AIT))

AI总结 本文提出了一种基于OSINT辅助的异质传感器融合方法,通过建立新的证据层级模型,结合上下文信息和领域知识,提升对CBRNE威胁的分类准确率,实验结果表明该方法在抗干扰和先验不匹配方面具有优势,分类准确率高达95%。

Comments 6 pages, 1 figure; \c{opyright} 2026 The Authors. Submitted to the 2026 IEEE International Conference on Multisensor Fusion and Integration (MFI 2026). Under review

详情
AI中文摘要

异质传感器融合对于检测、定位和分类CBRNE威胁至关重要。然而,单独的传感器通常只能检测相关威胁的子集,其可靠性各异,甚至只能提供间接威胁指示,使威胁分类变得困难。此外,传感器侧的高杂波率对融合系统提出了巨大挑战。此外,高质量数据集的有限供应阻碍了智能传感器中基于学习的检测和分类模型的发展。为缓解这些传感器相关缺点,提出了一种上下文感知和领域知识增强的融合过程。首先,建立了一个新的证据层级,能够建模直接、指示性和上下文信息。其次,通过收集、处理和利用OSINT输入,将环境上下文信息引入融合过程。第三,利用证据层级的所有级别,构建一个结合领域知识的贝叶斯威胁类型分类机制。所提出的方法在模拟场景中进行了评估,结果表明该融合方法在抗杂波和先验不匹配方面具有优势,总体分类准确率高达95%。

英文摘要

Heterogeneous sensor fusion is vital for detecting, localizing, and classifying CBRNE threats. However, individual sensors are often only capable of detecting a subset of relevant threats with varying reliability or can even provide only indirect threat indications, making threat classification challenging. Furthermore, high clutter rates on the sensor side present a great challenge for fusion systems. Additionally, the limited availability of high quality datasets hinders the advancement of learning-based detection and classification models in smart sensors. To mitigate these sensor related shortcomings, a context-aware and domain knowledge-enhanced fusion process is proposed. First, a novel evidence hierarchy is established that enables modeling of direct, indicative, and contextual information. Second, contextual information about the environment is introduced into the fusion process, by collecting, processing, and exploiting OSINT inputs. Third, all levels of the evidence hierarchy are used to craft a Bayesian threat type classification mechanism with domain knowledge-informed priors. The proposed methodology is evaluated in simulated scenarios, and the results demonstrate the benefit of the proposed fusion approach in terms of robustness to clutter and prior mismatch, with an overall classification accuracy of up to 95%.

2605.22206 2026-05-22 cs.NE cs.AI cs.RO 版本更新

Temporal Coding as a Substrate for Sensorimotor Object Inference: A Spiking Reinterpretation of Thousand Brains Architecture

时间编码作为感觉运动物体推断的子基质:一种脉冲重解释的千脑架构

Joy Bose

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究提出用脉冲编码替代密集向量,以更有效地编码传感器接触顺序,从而提升物体识别的准确性和鲁棒性,核心方法是基于STDP的学习规则和可学习参数lambda,主要贡献是验证了时间编码在不同空间排列和噪声水平下的优越性能。

Comments 18 pages, 5 figures

详情
AI中文摘要

千脑理论(TBT)及其开源的Monty框架通过感觉运动推断进行物体识别——通过主动移动传感器跨物体表面并逐接触建立证据。当前实现将每个接触编码为密集浮点向量。虽然Monty跟踪步间位移并跨接触积累证据,但其将每个接触的特征激活模式视为无序集合——特征遇到的顺序不具有表征意义。在TBT中,接触的顺序具有空间意义:知道在从左到右的扫过中特征A在特征B之前被感受到,可以告诉你A和B在物体上的位置。密集向量丢弃了这种顺序。我们提出用等级顺序脉冲包替代密集向量:每个接触产生一连串神经事件的短暂爆发,其中最强烈激活的神经元首先放电。连续爆发之间的时间间隔隐含地编码传感器位移,而无需显式坐标计算。一种生物启发的学习规则(STDP)将遍历方向编码到突触权重中。一个可学习的参数lambda调整对早期与近期接触的依赖程度,适应每个物体的几何形状。我们推导出三个可检验的预测,并指定了四个组件的大约450行NumPy实现。三个合成实验验证了核心主张:时间编码在具有相同特征但不同空间排列的物体上实现完美判别准确性,而密集积累在偶然情况下表现不佳;时间编码在所有测试噪声水平上保持30-50个百分点的优势;适应性的lambda收敛到不同的值,反映物体几何复杂性。对Monty的YCB基准的端到端评估留待未来工作。

英文摘要

The Thousand Brains Theory (TBT) and its open-source Monty framework model object recognition through sensorimotor inference -- identifying objects by actively moving a sensor across their surface and building evidence contact by contact. The current implementation encodes each contact as a dense floating-point vector. While Monty tracks inter-step displacement and accumulates evidence across contacts, it treats the feature activation pattern at each contact as an unordered set - the directional sequence in which features are encountered carries no representational weight. In TBT, the sequence of contacts carries spatial meaning: knowing that feature A was felt before feature B during a left-to-right sweep tells you something about where A and B sit on the object. Dense vectors discard this ordering. We propose replacing dense vectors with rank-order spike packets: each contact produces a brief burst of neural events where the most strongly activated neuron fires first. The time gap between successive bursts implicitly encodes sensor displacement without explicit coordinate calculations. A biologically motivated learning rule (STDP) encodes traversal direction into synaptic weights. A learnable parameter lambda adjusts reliance on earlier versus recent contacts, adapting to each object's geometry. We derive three testable predictions and specify an implementation of four components in approximately 450 lines of NumPy. Three synthetic experiments confirm the core claims: temporal coding achieves perfect discrimination accuracy on objects with identical features in different spatial arrangements, where dense accumulation performs at chance; temporal coding maintains a 30-50 percentage point advantage across all tested noise levels; the adaptive lambda converges to distinct values, reflecting object geometric complexity. End-to-end evaluation on Monty's YCB benchmark is left for future work.

2605.22189 2026-05-22 cs.RO 版本更新

Learning A Unified Risk Map for Autonomous Driving in Partially Observable Environments

在部分可观察环境中学习统一的风险图

Jie Jia, Yaofeng Su, Zeyu Bao, Yun Hong, Bingzhao Gao, Zhongxue Gan, Wenchao Ding

发表机构 * Fudan University(复旦大学) Tongji University(同济大学)

AI总结 本文提出了一种统一的风险图建模与学习框架,用于部分可观察环境中的自动驾驶,通过时空建模整合交通流风险和碰撞风险,以更精细地评估遮挡引起的危险,并引入扩散基场景生成框架来解决遮挡交互场景稀缺的问题,实验表明该方法在Waymo Open Motion Dataset上显著优于现有方法。

Comments Published in IEEE Robotics and Automation Letters

详情
AI中文摘要

Occlusion-aware prediction remains a critical challenge in autonomous driving due to the inherent uncertainty of unobserved regions. Existing approaches either overestimate risk based on reachable states or struggle to predict accurate trajectories under high occlusion uncertainty. To address these limitations, we propose a unified risk map modeling and learning framework for partially observable environments. Our method integrates traffic flow risk and collision risk through spatiotemporal modeling, enabling fine-grained assessment of occlusion-induced hazards. To address the scarcity of scenarios involving occluded interactions, we introduce a diffusion-based scenario generation framework that produces realistic yet adversarial scenarios. We integrate the modeling and learning of a unified risk map into a framework that supports risk-aware planning under partial observability. Experiments on the Waymo Open Motion Dataset show that our method significantly outperforms the state-of-the-art occlusion-aware baseline, improving minimum time-to-collision by 0.78 times and average time-to-collision by 1.67 times. The proposed framework offers a comprehensive and practical solution for risk-aware planning in partially observable environments.

英文摘要

Occlusion-aware prediction remains a critical challenge in autonomous driving due to the inherent uncertainty of unobserved regions. Existing approaches either overestimate risk based on reachable states or struggle to predict accurate trajectories under high occlusion uncertainty. To address these limitations, we propose a unified risk map modeling and learning framework for partially observable environments. Our method integrates traffic flow risk and collision risk through spatiotemporal modeling, enabling fine-grained assessment of occlusion-induced hazards. To address the scarcity of scenarios involving occluded interactions, we introduce a diffusion-based scenario generation framework that produces realistic yet adversarial scenarios. We integrate the modeling and learning of a unified risk map into a framework that supports risk-aware planning under partial observability. Experiments on the Waymo Open Motion Dataset show that our method significantly outperforms the state-of-the-art occlusion-aware baseline, improving minimum time-to-collision by 0.78 times and average time-to-collision by 1.67 times. The proposed framework offers a comprehensive and practical solution for risk-aware planning in partially observable environments.

2605.22164 2026-05-22 cs.LG cs.RO 版本更新

Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics

超越欧几里得距离:通过地平线匹配轨迹可达性度量修复潜在世界模型

Liangyu Li, Shengzhi Wang, Qingwen Liu

发表机构 * Tongji University(同济大学)

AI总结 本文提出轨迹可达性度量(TRM)作为固定潜在世界模型的后处理终端排名方法,通过训练小的成对头部来改进终端排名,从而提高连续操控任务的性能。

Comments 26 pages, 7 figures

详情
AI中文摘要

潜在世界模型可以包含用于控制的状态,但其终端成本接口可能会向规划器暴露错误的决策相关信息。在常见的潜在MPC中,候选序列通过预测终端和目标潜在状态之间的欧几里得距离进行排名;这假设了原始潜在距离权重能够正确地反映可达性相关变量。我们提出轨迹可达性度量(TRM),一种用于固定潜在世界模型的后处理终端排名方法。TRM从记录的轨迹结构中训练一个小的成对头部,并将其用作替代或混合成本;编码器、动力学、采样器、优化器和评估表现保持不变。关键设计选择是地平线意识监督:该度量在广泛的、平衡的时间分离上进行训练,以匹配长地平线终端候选排名问题。在硬TwoRoom基准上,使用LeWorldModel(LeWM)的原始潜在规划成功率为7.0%,而全地平线TRM成功率为97.0%;洗牌时间标签控制仍为0.0%。同样的配方在三个种子上将PLDM基线从32.7%提高到84.0%,而短地平线TRM变体在100,000对预算下仅达到35.0%。在TwoRoom中,我们提供了TRM为何有效的机理证据:XY位置是线性可解码的(R²=0.998),但原始潜在MSE错误地排名候选;XY探针行空间在终端-目标潜在MSE中占比不到1%,但承载了大部分候选质量信号;SCSA审计显示TRM提高了规划器看到的排序和选定终点。在PushT go50/go75中,TRM风格的任务-状态度量比闭环成功更清晰地改进了SCSA排名和选定最终距离,推动了连续操控中的辅助混合成本。TRM是规划器面对的修复,审计解释了何时终端可达性度量应替代或补充原始潜在接近度。

英文摘要

Latent world models can contain the state needed for control, yet their terminal-cost interface can expose the planner to the wrong decision-relevant information. In common latent MPC, candidate sequences are ranked by Euclidean distance between predicted terminal and goal latent states; this assumes that raw latent distance weights reachability-relevant variables correctly. We propose trajectory reachability metrics (TRM), a post-hoc terminal-ranking method for fixed latent world models. TRM trains a small pairwise head from logged trajectory structure and uses it as a replacement or hybrid cost; the encoder, dynamics, sampler, optimizer, and evaluation manifests remain fixed. The key design choice is horizon-aware supervision: the metric is trained on broad, balanced temporal separations to match the long-horizon terminal candidate ranking problem. On a hard TwoRoom benchmark, raw latent planning with LeWorldModel (LeWM) reaches 7.0% success, while full-horizon TRM reaches 97.0%; shuffled temporal-label controls stay at 0.0%. The same recipe improves a PLDM baseline from 32.7% to 84.0% across three seeds, and a short-horizon TRM variant reaches only 35.0% with the 100,000 pair budget. In TwoRoom, we provide mechanistic evidence for why TRM works: XY position is linearly decodable (R^2=0.998), yet raw latent MSE misranks candidates; the XY-probe rowspace accounts for less than 1% of terminal-goal latent MSE but carries most candidate-quality signal; and SCSA audits show that TRM improves the ordering and selected endpoint seen by the planner. On PushT go50/go75, TRM-style task-state metrics improve SCSA ranking and selected final distance more cleanly than closed-loop success, motivating auxiliary hybrid costs in continuous manipulation. TRM is the planner-facing repair, and audits explain when terminal reachability metrics should replace or augment raw latent proximity.

2605.22138 2026-05-22 cs.AI cs.CL cs.LG cs.RO 版本更新

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

通过自我调节模拟规划实现高效的代理推理

Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu, Eric P. Xing

发表机构 * Institute of Foundation Models (IFM)(基础模型研究所) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出通过分解决策过程为三个系统:模拟推理、自我调节和反应执行,来提升代理推理的效率,并展示了SR$^2$AM模型在不同任务中的表现。

Comments Code and model artifacts are available at https://github.com/sailing-lab/sr2am

详情
AI中文摘要

代理应该如何决定何时以及如何规划?主流方法将代理建模为具有自适应计算的反应策略(例如链式思考),通过端到端训练期望规划隐式地出现。由于无法控制规划的存在、结构或时间范围,这些系统显著增加了推理长度,导致无效的令牌使用,而没有可靠的准确性提升。我们主张高效的代理推理受益于将决策过程分解为三个系统:模拟推理(系统II)通过世界模型将推理根植于未来状态预测;自我调节(系统III)通过学习的配置器决定何时以及如何深入规划;以及反应执行(系统I)处理细粒度的动作。模拟推理在不同任务中提供统一的规划,而无需每个领域的工程,同时自我调节确保规划只在需要时被调用。为了测试这一点,我们开发了SR$^2$AM(Self-Regulated Simulative Reasoning Agentic LLM),在LLM的链式思考中实现这两个系统作为独立阶段,其中LLM作为世界模型。我们探索了两种实现:从提示的多模块系统中记录决策(v0.1)和从预训练推理LLM的痕迹中重建结构化计划(v1.0),通过监督学习和强化学习(RL)训练。在数学、科学、表格分析和网络信息检索中,v0.1-8B和v1.0-30B在性能上与120-355B和685B-1T参数系统相当,而v1.0-30B使用的推理令牌比同类代理LLM少25.8-95.3%。强化学习使平均规划时间增加22.8%,而规划频率仅增加2.0%,表明它学会了更远地规划而不是更频繁地规划。更广泛地说,学习的自我调节实例化了一个原则,我们预计可以扩展到代理如何管理自己的学习和适应。

英文摘要

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR$^2$AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

2605.22123 2026-05-22 cs.RO 版本更新

Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations

超越像素:从少量示范中学习不变的奖励以实现实世界机器人学

Tengye Xu, Yangting Sun, Ziju Shen, Guanqi Chen, Zhen Fu, Chen yizhou, Hua Chen, Jia Pan

发表机构 * School of Computing and Data Science, The University of Hong Kong(计算科学与数据科学学院,香港大学) LimX Dynamics Technology Co., Ltd(LimX动力技术有限公司) Southern University of Science and Technology(南方科技大学) Peking University(北京大学) Zhejiang University(浙江大学)

AI总结 本文提出了一种从少量示范中学习不变奖励的方法,以实现实世界机器人学中的泛化能力,通过发现行为不变量来改进奖励函数的设计,从而在多个任务中提升策略学习效果。

详情
AI中文摘要

设计能够超越受控实验室环境的奖励函数仍然是强化学习在机器人学中的基本挑战。在开放世界操纵问题中,单一任务可以通过不同的物体实例、位置和摄像头视角出现多种变体。最近基于视觉的奖励模型倾向于记忆特定的像素分布,并且无法超越其训练条件进行泛化。为了解决这个问题,我们提出了一种框架,该框架可以从最少的五个示范中学习不变的符号奖励函数。关键思想是将视觉特征拟合转向发现行为不变量:在多样化的视觉实例中保持不变的任务级属性。该框架有两个耦合的组件:一个结构化奖励公式,它编码任务级策略和物理约束,同时保持最优策略不变性;以及一个混合的符号-数值过程,该过程从示范中提炼这些不变量,而无需在线交互。在八个Meta-World任务和三个Franka操纵任务上的实验表明,我们的方法在过程对齐和策略展开排名能力方面优于基线方法,加速了下游策略学习。三个现实世界的出分布实验进一步表明,学习到的奖励能够零样本泛化到位置、视角和物体变体,使单一奖励表示能够在实践中重用于多种任务变体。

英文摘要

Designing reward functions that generalize beyond controlled laboratory settings remains a fundamental challenge in reinforcement learning for robotics. In open-world manipulation problems, a single task can appear in numerous variants through different object instances, positions, and camera viewpoints. Recent vision-based reward models tend to memorize specific pixel distributions and fail to generalize beyond their training conditions. To address this, we propose a framework that learns invariant symbolic reward functions from as few as five demonstrations. The insight is to shift from visual feature-fitting to the discovery of behavioral invariants: task-level properties that remain constant across diverse visual instantiations. The framework has two coupled components: a structural reward formulation that encodes task-level strategies and physical constraints while preserving optimal policy invariance, and a hybrid symbolic-numerical procedure that distills these invariants from demonstrations without online interaction. Experiments on eight Meta-World tasks and three Franka manipulation tasks demonstrate that our method achieves stronger process alignment and policy rollout ranking abilities compared to baselines, accelerating downstream policy learning. Three real-world out-of-distribution experiments further show that the same learned reward generalizes zero-shot to position, viewpoint, and object variations, enabling a single reward representation to be reused across diverse task variants in practice.

2605.18047 2026-05-22 cs.RO 版本更新

FUSE: A Framework for Unified State Estimation in Vehicular and Robotic SLAM Systems

FUSE:一种用于车辆和机器人SLAM系统统一状态估计的框架

Wei Wu, Honglin Chen, Wenhan Cao, Yao Lyu, Shaobing Xu, Kun Jiang, Jiangtao Li, Tao Zhang, Lei Guo, Shengbo Eben Li

发表机构 * State Key Lab of Intelligent Green Vehicle and Mobility, Tsinghua University(智能绿色车辆与移动国家重点实验室,清华大学) School of VM and College of AI, Tsinghua University(车辆学院与人工智能学院,清华大学) SunRisingAI Ltd.(SunRisingAI有限公司) China Intelligent and Connected Vehicles (Beijing) Research Institute Co., Ltd.(中国汽车智能互联车辆(北京)研究院有限公司)

AI总结 本文提出FUSE框架,用于统一车辆和机器人SLAM系统中的状态估计,通过分离时间处理、局部几何关联、估计器公式和地图更新策略,提高状态估计设计的灵活性和准确性。

详情
AI中文摘要

在混合速率传感下,紧密耦合的SLAM公式通常将时间处理、局部几何关联、估计器公式和地图更新策略绑定到特定方法的设计中。这种绑定使得难以在不重新设计其余状态估计过程的情况下改变一个设计选择。本文提出了FUSE,一种用于车辆和机器人SLAM系统统一状态估计的框架。FUSE围绕观察摄入、传播、更新和状态查询组织状态估计接口,并利用此接口将时间处理、残差准备的局部几何关联、估计器公式和地图更新策略分开。开发了一个LiDAR-IMU实例来在混合速率传感和方向退化下检验该框架,其中高速惯性传播、LiDAR触发的几何更新、残差筛选和退化感知的修正通过相同的接口边界操作。在418米的环形走廊序列中,该实例报告了1.626米的端到端轨迹误差,与Faster-LIO相比,相对误差减少了7.9%。结果支持FUSE作为组织状态估计设计选择的框架,并展示了评估实例如何在弱可观测方向上正则化更新。

英文摘要

Tightly coupled SLAM formulations under mixed-rate sensing often bind temporal processing, local geometric association, estimator formulation, and map-update policy into method-specific designs. Such binding makes it difficult to vary one design choice without re-engineering the rest of the state-estimation process. This paper presents FUSE, a framework for unified state estimation in vehicular and robotic SLAM systems. FUSE organizes the state-estimation interface around observation ingestion, propagation, update, and state query, and uses this interface to separate temporal processing, residual-ready local geometric association, estimator formulation, and map-update policy. A LiDAR--IMU instantiation is developed to examine the framework under mixed-rate sensing and directional degeneracy, where high-rate inertial propagation, LiDAR-triggered geometric update, residual screening, and degeneracy-aware correction operate through the same interface boundaries. On a 418~m loop-corridor sequence, the instantiation reports a 1.626 m end-to-end trajectory error, corresponding to a 7.9% relative error reduction compared with Faster-LIO, the lowest-error baseline on this sequence. The results support FUSE as a framework for organizing state-estimation design choices and show how the evaluated instantiation regularizes updates along weakly observable directions.

2605.17950 2026-05-22 cs.RO cs.SY eess.SY 版本更新

Active Defense Against False Data Injection Attacks in Robotic Manipulators

对抗机器人机械臂中虚假数据注入攻击的主动防御

Gabriele Gualandi, Carl Mikael Larsson, Alessandro V. Papadopoulos

发表机构 * M \"a lardalen University, V \"a ster s, Sweden (e-mail: ).

AI总结 本文提出两种防御方法,即异常感知虚拟阻尼和操作性降低,以提高机器人机械臂在有限时间范围内抵御虚假数据注入攻击的能力,并通过仿真验证其有效性。

Comments Extended 8-page version containing full proofs. An abridged 6-page version has been accepted for publication in the Proceedings of the 23rd IFAC World Congress (2026). v3: Minor typographical fixes and updated reference formatting

详情
AI中文摘要

机器人系统容易受到虚假数据注入攻击(FDIAs)的影响,其中攻击者通过篡改传感器信号来获得恶意控制。反馈线性化使机器人系统暴露于积分器漏洞,使其容易受到隐蔽攻击,这些攻击可能导致末端执行器行为出现显著偏差而不会引发警报。本文通过形式化两种防御方法,即异常感知虚拟阻尼和操作性降低,以提高机械臂在有限时间范围内抵御FDIAs的韧性,并在名义任务执行中提供概率保证。在7自由度冗余机械臂上的仿真显示,所提出的防御方法在与仅使用阈值基ADS如卡方检测相比时,显著减少了FDIA的影响,同时在无攻击情况下保持了名义任务性能。

英文摘要

Robotic systems are vulnerable to False Data Injection Attacks (FDIAs), where adversaries corrupt sensor signals to gain malicious control. Feedback linearization exposes robotic systems to integrator vulnerability, making them susceptible to stealthy attacks that can cause significant deviations in end-effector behavior without raising alarms. This paper addresses the resilience of manipulators against finite-horizon FDIAs by formalizing two defense methods, namely anomaly-aware virtual damping and manipulability reduction, with probabilistic guarantees on nominal task execution. Simulations on a 7-DOF redundant manipulator show that the proposed defenses substantially reduce the impact of FDIA compared to using solely a threshold-based ADS like the Chi-squared, while preserving nominal task performance in the absence of attack.

2605.15153 2026-05-22 cs.RO cs.AI 版本更新

Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

Pelican-Unify 1.0:一种用于理解和推理、想象和行动的统一具身智能模型

Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu, Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, Yechi Liu, Haoyuan Shi, Zecong Tang, Haosong Sun, Renwen Cui, Kuishu Wu, Wenhai Liu, Yang Xu, Yingji Zhang, Yidong Wang, Senkang Hu, Jinpeng Lu, Nga Teng Chan, Yechen Wu, Zeting Liu, Xianzhou Hou, Yong Dai, Jian Tang, Xiaozhu Ju

发表机构 * Beijing Innovation Center of Humanoid Robotics (X-Humanoid)(北京人形机器人创新中心(X-Humanoid))

AI总结 本文提出Pelican-Unify 1.0,一种基于统一原则训练的首个具身基础模型,通过单一视觉语言模型作为统一理解模块,将场景、指令、视觉上下文和行动历史映射到共享语义空间,并通过统一推理模块生成任务、行动和未来导向的思维链,最终将隐藏状态投影到密集潜在变量中,再通过统一未来生成器生成未来视频和行动。

详情
AI中文摘要

我们提出了Pelican-Unify 1.0,首个根据统一原则训练的具身基础模型。Pelican-Unify 1.0使用单一视觉语言模型作为统一理解模块,将场景、指令、视觉上下文和行动历史映射到共享语义空间。同一视觉语言模型也作为统一推理模块,通过单次前向传递自回归地生成任务、行动和未来导向的思维链,并将最终隐藏状态投影到密集潜在变量中。统一未来生成器(UFG)然后基于该潜在变量,在同一去噪过程中通过两个模态特定的输出头联合生成未来视频和未来行动。语言、视频和行动损失均反向传播到共享表示中,使模型在训练过程中共同优化理解和推理、想象和行动,而非训练三个独立专家系统。实验表明,统一并不意味着妥协。通过单一检查点,Pelican-Unify 1.0在所有三种能力上均取得强劲表现:在八个VLM基准测试中得分为64.7,是同类模型中最佳;在WorldArena中得分为66.03,排名第一;在RoboTwin中得分为93.5,是对比行动方法中第二好的平均值。这些结果表明,统一范式在保持专业能力的同时,将理解和推理、想象和行动整合到一个模型中。

英文摘要

We present Pelican-Unify 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unify 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a dense latent variable. A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process. The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and action during training, rather than training three isolated expert systems. Experiments demonstrate that unification does not imply compromise. With a single checkpoint, Pelican-Unify 1.0 achieves strong performance across all three capabilities: 64.7 on eight VLM benchmarks, the best among comparable-scale models; 66.03 on WorldArena, ranking first; and 93.5 on RoboTwin, the second-best average among compared action methods. These results show that the unified paradigm succeeds in preserving specialist strength while bringing understanding, reasoning, imagination, and action into one model.

2605.10696 2026-05-22 cs.RO 版本更新

VRA: Grounding Discrete-Time Joint Acceleration in Voltage-Constrained Actuation

VRA:在电压受限致动器中接地离散时间联合加速度

Lingwei Zhang, Jiaming Wang, Tianlin Zhang, Zhitao Song, Xuanqi Zeng, Weipeng Xia, Zhongyu Li, Yun-hui Liu

发表机构 * Department of Mechanical and Automation Engineering(机械与自动化工程系) Hong Kong Embodied AI Lab(香港具身AI实验室) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出VRA方法,通过将运动学加速度与电压受限致动器物理相联系,解决在电压受限情况下不可实现的加速度问题,实验表明该方法能消除不可实现的加速度,恢复一致的近约束执行并减少约束引起的振荡。

Comments 10 pages, Accepted by RSS 2026

详情
AI中文摘要

离散时间关节加速度约束被广泛用于强制位置和速度限制。然而,在电压受限的电动致动器中,运动学上可行的加速度可能无法物理实现,暴露了缺失的执行层面抽象。我们提出电压可实现加速度(VRA),一种关节级加速度接口,通过限制命令加速度到电压可实现的约束,将运动学加速度接地在电压受限致动器物理上。在电动致动器和轮腿四足机器人上的硬件实验表明,VRA消除了不可实现的加速度,恢复了一致的近约束执行,并减少了约束引起的振荡。

英文摘要

Discrete-time joint acceleration constraints are widely used to enforce position and velocity limits. However, under voltage-constrained electric actuators, kinematically admissible accelerations may be physically unrealizable, exposing a missing execution-level abstraction. We propose Voltage-Realizable Acceleration (VRA), a joint-level acceleration interface that grounds kinematic acceleration in voltage-constrained actuator physics by restricting commanded accelerations to voltage-realizable constraints. Hardware experiments on electric actuators and a wheel-legged quadruped show that VRA removes unrealizable accelerations, restores consistent near-constraint execution, and reduces constraint-induced oscillations.

2604.11028 2026-05-22 cs.RO cs.AI 版本更新

Federated Single-Agent Robotics: Multi-Robot Coordination Without Intra-Robot Multi-Agent Fragmentation

联邦单体机器人:多机器人协调无需机器人内部多代理碎片化

Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

发表机构 * School of Software, Harbin Institute of Technology(哈尔滨工业大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) School of Mathematical and Computer Sciences, Heriot-Watt University, Malaysia Campus(赫瑞-沃顿大学马来西亚校区数学与计算机科学学院) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院)

AI总结 本文提出了一种联邦单体机器人(FSAR)架构,通过在单体机器人运行时基础上实现多机器人协调,避免了机器人内部的多代理碎片化,提升了协调效率和恢复能力。

Comments 30 pages, 10 figures, 9 tables. Code: https://github.com/s20sc/fsar-fleet-coordination

详情
AI中文摘要

随着具身机器人向舰队规模操作发展,多机器人协调已成为系统挑战的核心。现有方法通常将其视为增加机器人内部多代理分解的动机。我们主张另一种原则:多机器人协调不需要机器人内部的多代理碎片化。每个机器人应保持一个单体具身代理,拥有自己的持久运行时、本地策略范围、能力状态和恢复权限,而协调则通过在舰队层面的联邦实现。我们提出了联邦单体机器人(FSAR),一种基于单体机器人运行时的多机器人协调运行时架构。每个机器人暴露受控的能力表面,而非内部碎片化的代理社会。舰队协调通过共享的能力注册表、跨机器人任务委托、策略感知的权限分配、信任范围内的交互以及分层恢复协议实现。我们正式化了关键协调关系,包括权限委托、跨机器人能力请求、本地与舰队恢复边界以及分层人类监督,并描述了一种支持共享具身能力模块(ECM)发现、合同感知的跨机器人协调以及舰队层面治理的舰队运行时架构。我们在代表性的多机器人协调场景中评估了FSAR,与分解密集的基线进行比较。结果表明,在治理局部性(d=2.91,p<.001 vs. 集中控制)和恢复包含性(d=4.88,p<.001 vs. 分解密集)方面有统计学显著的提升,同时在所有场景中减少了权限冲突和策略违规。我们的结果支持了从具身代理到具身舰队的路径应通过在相干机器人运行时之间进行联邦而非在其中进行碎片化的观点。

英文摘要

As embodied robots move toward fleet-scale operation, multi-robot coordination is becoming a central systems challenge. Existing approaches often treat this as motivation for increasing internal multi-agent decomposition within each robot. We argue for a different principle: multi-robot coordination does not require intra-robot multi-agent fragmentation. Each robot should remain a single embodied agent with its own persistent runtime, local policy scope, capability state, and recovery authority, while coordination emerges through federation across robots at the fleet level. We present Federated Single-Agent Robotics (FSAR), a runtime architecture for multi-robot coordination built on single-agent robot runtimes. Each robot exposes a governed capability surface rather than an internally fragmented agent society. Fleet coordination is achieved through shared capability registries, cross-robot task delegation, policy-aware authority assignment, trust-scoped interaction, and layered recovery protocols. We formalize key coordination relations including authority delegation, inter-robot capability requests, local-versus-fleet recovery boundaries, and hierarchical human supervision, and describe a fleet runtime architecture supporting shared Embodied Capability Module (ECM) discovery, contract-aware cross-robot coordination, and fleet-level governance. We evaluate FSAR on representative multi-robot coordination scenarios against decomposition-heavy baselines. Results show statistically significant gains in governance locality (d=2.91, p<.001 vs. centralized control) and recovery containment (d=4.88, p<.001 vs. decomposition-heavy), while reducing authority conflicts and policy violations across all scenarios. Our results support the view that the path from embodied agents to embodied fleets is better served by federation across coherent robot runtimes than by fragmentation within them.

2604.07799 2026-05-22 cs.RO cs.AI 版本更新

Learning Without Losing Identity: Capability Evolution for Embodied Agents

无需失去身份的学习:体素代理的能力进化

Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

发表机构 * School of Software, Harbin Institute of Technology(哈尔滨工业大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) School of Mathematical and Computer Sciences, Heriot-Watt University(赫瑞-沃德大学数学与计算机科学学院) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院)

AI总结 本文提出了一种以能力为中心的体素代理进化范式,通过引入体素能力模块(ECMs)实现持续改进,同时保持代理身份的稳定性,实验表明其在任务成功率和安全性方面优于传统方法。

Comments 12 pages, 2 figures, 7 tables

详情
AI中文摘要

体素代理被期望在动态物理环境中持续运作,并随时间不断获得新能力。现有方法通常通过修改代理本身来提高性能,导致长期系统不稳定和身份丢失。本文提出了一种以能力为中心的进化范式,认为机器人应保持持久的代理作为认知身份,同时通过能力进化实现持续改进。具体而言,我们引入了体素能力模块(ECMs),代表可随时间学习、优化和组合的模块化功能单元。我们提出一个统一框架,将能力进化与代理身份解耦。能力通过包含任务执行、经验收集、模型优化和模块更新的闭环过程进化,所有执行均由运行时层控制,确保安全性和策略约束。通过模拟体素任务证明,能力进化在20次迭代中将任务成功率从32.4%提升到91.3%,优于代理修改基线和现有技能学习方法(SPiRL, SkiMo),同时保持零策略漂移和零安全违规。我们的结果表明,将代理身份与能力进化分离为长期体素智能提供了可扩展且安全的基础。

英文摘要

Embodied agents are expected to operate persistently in dynamic physical environments, continuously acquiring new capabilities over time. Existing approaches to improving agent performance often rely on modifying the agent itself -- through prompt engineering, policy updates, or structural redesign -- leading to instability and loss of identity in long-lived systems. In this work, we propose a capability-centric evolution paradigm for embodied agents. We argue that a robot should maintain a persistent agent as its cognitive identity, while enabling continuous improvement through the evolution of its capabilities. Specifically, we introduce the concept of Embodied Capability Modules (ECMs), which represent modular, versioned units of embodied functionality that can be learned, refined, and composed over time. We present a unified framework in which capability evolution is decoupled from agent identity. Capabilities evolve through a closed-loop process involving task execution, experience collection, model refinement, and module updating, while all executions are governed by a runtime layer that enforces safety and policy constraints. We demonstrate through simulated embodied tasks that capability evolution improves task success rates from 32.4% to 91.3% over 20 iterations, outperforming both agent-modification baselines and established skill-learning methods (SPiRL, SkiMo), while preserving zero policy drift and zero safety violations. Our results suggest that separating agent identity from capability evolution provides a scalable and safe foundation for long-term embodied intelligence.

2602.06995 2026-05-22 cs.RO cs.CV cs.IT cs.MA math.IT 版本更新

When Simultaneous Localization and Mapping Meets Wireless Communications: A Survey

当同时定位与建图遇见无线通信:一篇综述

Konstantinos Gounis, Sotiris A. Tegos, Dimitrios Tyrovolas, Panagiotis D. Diamantoulakis, George K. Karagiannidis

发表机构 * Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki(阿尔蒂斯大学电气与计算机工程系)

AI总结 本文综述了SLAM与无线通信交汇领域的最新进展,重点探讨了视觉SLAM(V-SLAM)整合中的双向影响,总结了无线信号传播、几何信道建模、基于射频(RF)的定位与感知等关键概念,以及图像处理技术如何检测地标并预测无线信道的最优路径,同时分析了SLAM与无线通信交叉领域的技术、挑战和未来方向。

详情
AI中文摘要

本文综述了SLAM与无线通信交汇领域的最新进展, attributing the bidirectional impact of each with a focus on visual SLAM (V-SLAM) integration. We provide an overview of key concepts related to wireless signal propagation, geometric channel modeling, and radio frequency (RF)-based localization and sensing. In addition to this, we show image processing techniques that can detect landmarks, proactively predicting optimal paths for wireless channels. Several dimensions are considered, including the prerequisites, techniques, background, and future directions and challenges of the intersection between SLAM and wireless communications. We analyze estimation and control approaches such as Bayesian filters, feature-based pose estimation, perception-aware motion control, spatial methods for signal processing such as vector fields, and key technological aspects. We expose techniques and items towards enabling a highly effective retrieval of the autonomous robot state. Among other interesting findings, we observe that monocular V-SLAM would benefit from RF relevant information, as the latter can serve as a proxy for the scale ambiguity resolution. Conversely, we find that wireless communications in the context of 5G and beyond can potentially benefit from visual odometry that is central in SLAM. Moreover, we examine other sources besides the camera for SLAM and describe the twofold relation with wireless communications. Finally, integrated solutions performing joint communications and SLAM appear to be in their infancy: theoretical and practical advancements are required to add higher-level localization and semantic perception capabilities to RF and multi-antenna technologies.

英文摘要

This paper surveys the state-of-the-art in the nexus of SLAM and Wireless Communications, attributing the bidirectional impact of each with a focus on visual SLAM (V-SLAM) integration. We provide an overview of key concepts related to wireless signal propagation, geometric channel modeling, and radio frequency (RF)-based localization and sensing. In addition to this, we show image processing techniques that can detect landmarks, proactively predicting optimal paths for wireless channels. Several dimensions are considered, including the prerequisites, techniques, background, and future directions and challenges of the intersection between SLAM and wireless communications. We analyze estimation and control approaches such as Bayesian filters, feature-based pose estimation, perception-aware motion control, spatial methods for signal processing such as vector fields, and key technological aspects. We expose techniques and items towards enabling a highly effective retrieval of the autonomous robot state. Among other interesting findings, we observe that monocular V-SLAM would benefit from RF relevant information, as the latter can serve as a proxy for the scale ambiguity resolution. Conversely, we find that wireless communications in the context of 5G and beyond can potentially benefit from visual odometry that is central in SLAM. Moreover, we examine other sources besides the camera for SLAM and describe the twofold relation with wireless communications. Finally, integrated solutions performing joint communications and SLAM appear to be in their infancy: theoretical and practical advancements are required to add higher-level localization and semantic perception capabilities to RF and multi-antenna technologies.

2511.07820 2026-05-22 cs.RO cs.AI cs.CV cs.GR cs.SY eess.SY 版本更新

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

SONIC:为自然人形全身体控进行超大规模运动追踪

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Fernando Castañeda, Sirui Chen, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Jinhyung Park, David Sami, Zi Wang, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi "Jim" Fan, Yuke Zhu

发表机构 * NVIDIA

AI总结 本文提出了一种超大规模运动追踪方法,通过扩大模型容量、数据和计算资源,实现了一种能够产生自然且稳健全身体态的通用人形控制器,并展示了其在运动追踪任务中的可扩展性及在下游任务中的应用价值。

Comments Project page: https://nvlabs.github.io/SONIC/

详情
AI中文摘要

尽管大规模基础模型在数千块GPU上训练已取得显著进展,但类似规模提升在人形控制中尚未显现。当前的人形神经控制器规模较小,仅针对有限的行为集,并在少量GPU上训练。我们证明,扩大模型容量、数据和计算资源可以产生一个通用的人形控制器,能够实现自然且稳健的全身体态。我们将运动追踪定位为人形控制的可扩展任务,利用密集监督的多样化动作捕捉数据获取人类运动先验知识,而无需手动奖励工程。我们通过沿三个轴扩展构建了一个运动追踪的基础模型:网络大小(120万到4200万参数)、数据集规模(10亿+帧来自700小时的动作捕捉数据)以及计算资源(21000 GPU小时)。除了展示规模优势外,我们还通过:(1)实时运动规划器连接运动追踪到导航等任务,实现自然和交互式控制;(2)统一的token空间支持VR远程操作和视觉-语言-动作(VLA)模型,使用单一策略。通过这一接口,我们展示了需要协调手和脚放置的自主VLA驱动全身体控。扩大运动追踪表现出有利的特性:性能随计算和数据多样性稳步提升,学习的策略能泛化到未见的运动,使大规模运动追踪成为人形控制的实用基础。

英文摘要

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

2507.23773 2026-05-22 cs.AI cs.CL cs.LG cs.RO 版本更新

General Agentic Planning Through Simulative Reasoning with World Models

通过世界模型的模拟推理实现通用代理规划

Mingkai Deng, Jinyu Hou, Zhiting Hu, Eric Xing

发表机构 * Institute of Foundation Models (IFM)(基础模型研究所) Carnegie Mellon University(卡内基梅隆大学) UC San Diego(南加州大学)

AI总结 本文提出通过模拟推理实现通用代理规划,利用世界模型进行未来状态预测,提升决策能力,通过SiRA架构在不同任务中取得更高任务完成率。

Comments Winner of Berkeley LLM Agents Hackathon (Fundamentals Track); code available at https://github.com/sailing-lab/sira

详情
AI中文摘要

什么是规划?当前的代理系统,无论是 scaffolding 工作流还是端到端策略,都依赖于反应式决策:通过固定流程选择下一步行动,最多只能有非区分性的适应性计算(例如链式思维),缺乏对未来结果的显式建模。这限制了通用性,因为每个新任务都需要重新工程而不是共享推理能力的转移。相比之下,人类通过在内部世界模型中心理模拟候选动作的后果来规划,这种能力被称为模拟推理(系统II),它支持在不同上下文中灵活、目标导向的行为。我们主张通过世界模型进行模拟推理为代理系统提供了一种通用的规划机制,比反应式策略(系统I)更优,因为决策基于预测的未来状态而不是模式匹配的响应。为了验证这一点,我们引入了SiRA(模拟推理架构),一种以目标为导向的架构,利用基于LLM的世界模型和自然语言信念状态来实现模拟推理,同时保持模型无关性。我们在网络浏览器环境中评估了三个质的不同的任务类别:受约束的导航、多跳信息聚合和一般指令跟随。在所有类别中,模拟推理在与匹配的反应基线相比,任务完成率提高了124%,并且在与代表性的开放网络代理相比,受约束导航的成功率从0%提高到32.2%。在不同任务类型中的持续优势表明,这种优势源于可泛化的情境评估,而不是特定任务的调优。

英文摘要

What does it mean to plan? Current agentic systems, whether scaffolded workflows or end-to-end policies, rely on reactive decision-making: selecting the next action via a fixed procedure with at most undifferentiated adaptive computation (e.g., chain-of-thought) lacking explicit modeling of future outcomes. This limits generalizability, as each new task demands re-engineering rather than transfer of shared reasoning capacity. Humans, by contrast, plan by mentally simulating consequences of candidate actions within an internal world model, a capacity known as simulative reasoning (System II) that supports flexible, goal-directed behavior across diverse contexts. We argue that simulative reasoning through a world model provides a general-purpose planning mechanism for agentic systems, improving upon reactive policies (System I) by grounding decisions in predicted future states rather than pattern-matched responses. To verify this, we introduce SiRA (Simulative Reasoning Architecture), a goal-oriented architecture instantiating simulative reasoning using an LLM-based world model with natural-language belief states, while remaining model-agnostic. We evaluate across three qualitatively distinct task categories: constrained navigation, multi-hop information aggregation, and general instruction following, in a web-browser environment. Across all categories, simulative reasoning achieves up to 124% higher task completion rates than a matched reactive baseline, and increases constrained navigation success from 0% to 32.2% compared to a representative open-web agent. The persistent advantage across distinct task types suggests the benefit stems from generalizable counterfactual evaluation rather than task-specific tuning.

2404.05307 2026-05-22 cs.CV cs.RO 版本更新

4D Radar Semantic Segmentation of People in Field Conditions Using Temporal Multi-View Networks

利用时序多视角网络进行野外条件下4D雷达的人体语义分割

Mikael Skog, Oleksandr Kotlyar, Vladimír Kubelka, Martin Magnusson

发表机构 * Center for Advanced Autonomous Sensor Systems (AASS)(先进自主传感器系统中心)

AI总结 本文提出TMVA4D网络,利用4D雷达数据进行人体语义分割,通过多视角投影区分背景与人体,在低能见度条件下实现75.9%的Dice系数和61.2%的IoU指标。

详情
AI中文摘要

可靠的人员检测对于移动机器人和重型车辆在道路和工业环境(如采矿和建筑)中的安全自主至关重要。然而,常规传感器如摄像头或激光雷达在尘埃、雾或烟等恶劣条件下容易失效,限制了其在现实机器人系统中的应用。雷达在广泛的环境条件下提供稳健的测量。特别是现代高分辨率4D成像雷达提供跨距离、方位和仰角的4D点云,以及每个点的多普勒速度数据,非常适合机器人感知。我们提出TMVA4D,一种基于CNN和ConvLSTM编码器的神经网络架构家族,利用4D雷达模态进行语义分割。这些架构被训练以区分背景和人体类别,使用一系列2D投影的4D雷达数据,涵盖仰角、方位、距离和多普勒速度维度。在多个操作站点评估中,我们的模型在低能见度条件下实现了有希望的性能(Dice 75.9%,IoU 61.2% for class person)。数据和代码将在发表后公开发布。

英文摘要

Reliable people detection is crucial for the safe autonomy of mobile robots and heavy vehicles, both on roads and in industrial settings like mining and construction. However, common sensors like cameras or lidars are prone to failure in adverse conditions such as dust, fog, or smoke, which limits their use in real-world robotic systems. Radar, on the other hand, delivers robust measurements in a wide range of environmental conditions. In particular, modern high-resolution 4D imaging radars provide 4D point clouds across range, azimuth, and elevation, as well as per-point Doppler velocity data, well suited for robot perception. We propose TMVA4D, a family of artificial neural network architectures based on CNN and ConvLSTM encoders that leverage the 4D radar modality for semantic segmentation. The architectures are trained to distinguish between background and person classes using a series of 2D projections of the 4D radar data, encompassing elevation, azimuth, range, and Doppler velocity dimensions. Evaluated across several operational sites, our models achieve promising performance (Dice 75.9%, IoU 61.2% for class person) even in low-visibility conditions. The data and code will be made publicly available upon publication.

2605.22021 2026-05-22 cs.RO 版本更新

Industrial Dual-Arm Box Handling via Online Inertial Estimation and Convex Wrench Optimization

工业双臂箱体搬运 via 在线惯性估计和凸 wrench 优化

Kenzhi Iskandar Wong, Lin Yang, Qian Ying Lee, Domenico Campolo

发表机构 * School of Mechanical and Aerospace Engineering, Nanyang Technological University(机械与航空航天工程学院,南洋理工大学)

AI总结 本文提出了一种摩擦感知的双臂箱体搬运框架,用于处理具有未知惯性特性的物体。通过在线估计物体质量和质心,并利用二次锥规划在椭球摩擦限制表面约束下计算摩擦可行的接触力和扭距,从而实现稳定的搬运。

Comments 14 pages, submitted to Robotics and Computer-Integrated Manufacturing (RCIM) Journal

详情
AI中文摘要

工业机器人物体搬运 often 涉及箱子和包裹,其质量和质心通常在事先未知。这些不确定性影响了稳定提升所需的力-力矩平衡,不当的接触 wrench 控制可能导致滑动、物体掉落、方向偏差或过度挤压。本文提出了一种摩擦感知的双臂箱体搬运框架,用于具有未知惯性特性的物体。所提出的方法从测量的接触 wrench 中在线估计物体质量和质心,并通过二次锥规划(SOCP)在椭球摩擦限制表面约束下计算摩擦可行的接触力和扭距。还包含一个离线轨迹细化阶段,以减少存在几何约束时的不希望的物体-环境接触。通过将摩擦可行性作为硬约束,并在可行区域内最小化接触努力,该框架实现了稳定的提升,而无需将滑动避免和过度挤压作为单独调节的目标。在不同质心配置下的真实双臂机器人系统实验表明,该方法在未知惯性特性物体上实现了稳定的摩擦接触。

英文摘要

Industrial robotic object handling often involves boxes and packages whose mass and center of mass are not known in advance. These uncertainties affect the force--moment balance required for stable lifting, and improper regulation of contact wrenches can lead to slip, object drop, orientation deviation, or excessive squeezing. This paper presents a friction-aware dual-arm box-handling framework for objects with unknown inertial properties. The proposed approach estimates the object mass and center of mass online from measured contact wrenches, and computes friction-feasible contact forces and torsional moments through a second-order cone program (SOCP) under ellipsoidal friction-limit-surface constraints. An offline trajectory refinement stage is also included to reduce undesired object--environment contact when geometric constraints are present. By enforcing friction feasibility as a hard constraint and minimizing contact effort within the feasible region, the framework achieves stable lifting without treating slip avoidance and excessive squeezing as separately tuned objectives. Experiments on a real dual-arm robotic system under different center-of-mass configurations demonstrate that the method lifts objects with unknown inertial properties while maintaining stable frictional contact.

2605.21976 2026-05-22 cs.RO 版本更新

TacO: Benchmarking Tactile Sensors for Object Manipulation

TacO: 用于物体操作的触觉传感器基准测试

Anya Zorin, Zilin Si, Myungsun Park, Junsung Park, Alexiy Buynitsky, Sachin Bhadang, Taejun Park, Sohee John Yoon, Yong-Lae Park, Oliver Kroemer, Zeynep Temel, Michael T. Tolley, Sha Yi, Xiaolong Wang

发表机构 * UC San Diego(圣迭戈大学) CMU(卡内基梅隆大学) SNU(首尔国立大学)

AI总结 本文提出了一种基于任务驱动的触觉传感器评估框架,通过训练不同模态的触觉传感器(视觉、声学、磁性和电阻性)在三个任务上的表现,探讨了触觉信息在不同材料和任务中的有效性。

详情
AI中文摘要

基于视觉的学习从示范中取得了在使机器人执行操作任务和高层语义推理方面的显著成功,但仍然不足以处理复杂且接触丰富的操作。尽管普遍认为触觉感知能改善操作,但尚无实证指导说明哪种触觉传感器最适合哪种操作任务。本文提供了一种系统性的、任务驱动的触觉传感器评估,提出了基于操作策略性能选择和评估传感器的框架。为四个不同的模态(视觉、声学、磁性和电阻性)的触觉传感器分别训练了独立的操作策略,用于三个任务:未知质量的拾取和放置、物体重新定向和插头插入。对于每个任务,分析了传感器属性如空间分辨率、剪切感知和触觉表示,以及固有材料摩擦对任务性能的影响。而不是触觉感知以相同方式对所有任务都有帮助,我们的结果表明触觉信息的有用性在很大程度上取决于传感器模态、材料属性和特定的操作任务。所有触觉传感器、代码、数据和硬件设置将在项目网站上公开。

英文摘要

Vision-based learning from demonstrations has achieved remarkable success in enabling robots to perform manipulation tasks and high-level semantic reasoning, yet it remains insufficient for complex, contact-rich manipulation. While there is broad agreement that tactile sensing improves manipulation, there is no empirical guidance on which tactile sensors are best suited for which manipulation tasks. In this paper, we provide a systematic, task-driven evaluation of tactile sensors for robot manipulation and propose a framework for selecting and evaluating sensors based on manipulation policy performance. Separate manipulation policies are trained for tactile sensors of four distinct modalities: visual, acoustic, magnetic, and resistive, across three tasks: pick-and-place with unknown mass, object reorientation, and plug insertion. For each task, an analysis of how sensor properties such as spatial resolution, shear sensing, and tactile representation, and the inherent material friction affect task performances is done. Rather than tactile sensing being universally beneficial in the same way, our results show that the usefulness of tactile information depends strongly on sensor modality, material properties, and the specific manipulation tasks. All of the tactile sensors, code, data, and hardware setup will be publicly available on the project website.

2605.21947 2026-05-22 cs.RO 版本更新

A Visitation Grid for Complete Coverage Foraging in Robot Swarms

用于机器人群完全覆盖觅食的访问网格

Qi Arturo Gonzalez, Yifeng Gao, Li Zhang, Qi Lu

发表机构 * Department of Computer Science(计算机科学系) The University of Texas Rio Grande Valley(德克萨斯大学里奥格兰德谷分校)

AI总结 本文提出了一种基于网格的随机觅食策略,通过减少冗余访问并加速后期收集,提高了机器人群在大规模未知环境中的资源收集效率和完整性。

Comments The 23rd International Conference on Ubiquitous Robots, 10 figures, 3 tables

详情
AI中文摘要

在大规模未知环境中对稀疏资源的完全收集仍然是自主机器人群面临的挑战。先前研究表明,收集阶段的大部分时间消耗在最终阶段,此时仅剩下少量随机分布的资源。因此,许多现有的群体觅食算法(搜索和收集)专注于在有限的时间窗口内收集大多数资源,而不是改进后期收集所有资源的效率。我们提出了一种基于网格的随机觅食策略,通过显式减少冗余访问并加速后期收集。未知的搜索区域被划分为网格地图,该地图由一个轻量级的中央服务器维护。为了保持可扩展性,机器人和服务器都在有限的内存和计算约束下运行。服务器根据机器人报告的位置更新网格级别的访问次数,生成探索密度的全局估计。对于每次新的觅食任务,机器人从一个局部3×3邻域网格中以最低访问次数的概率选择下一个搜索区域,从而将探索偏向于未访问的区域,同时保持随机性。广泛的模拟实验表明,所提出的策略在性能上始终优于传统的中央放置基线觅食算法(CPFA)。与CPFA相比,所提出的方法将总收集时间减少了多达33%,并在任务的最后阶段将收集效率提高了超过48%。这些结果表明,所提出的策略在机器人群的近完全和完全资源收集中具有鲁棒性、灵活性和可扩展性,并且可以作为在有限机载资源下随机群体觅食方法的一般增强。

英文摘要

The complete collection of sparse resources in large, unknown environments remains a challenging problem for autonomous robot swarms. Previous studies have shown that a substantial portion of total mission time is consumed during the final stage of collection, where only a small fraction of randomly scattered resources remain. Consequently, many existing swarm foraging algorithms (search and collection) focus on collecting most resources within a limited time window, rather than improving end-stage efficiency for collecting all resources. We propose a grid-based stochastic foraging strategy that explicitly reduces redundant visits and accelerates late-stage collection. The unknown search area is partitioned into a grid map, which is maintained by a lightweight central server. To maintain scalability, both robots and the server operate within limited memory and computational constraints. The server updates the grid-level visitation counts based on robot-reported locations, producing a global estimate of the exploration density. For each new foraging trip, a robot selects its next search area from a local 3 X 3 neighborhood of grids probabilistically with the lowest visitation count, thus biasing exploration toward under-visited regions while maintaining stochasticity. Extensive simulation experiments demonstrate that the proposed strategy consistently outperforms the canonical centrally placed baseline foraging algorithm (CPFA). Compared to CPFA, the proposed method reduces the total collection time by up to 33% and improves collection efficiency by more than 48% during the final stage of the mission. These results indicate that the proposed strategy is robust, flexible, and scalable for near-complete and complete resource collection in robot swarms and can serve as a general enhancement for stochastic swarm foraging methods under limited onboard resources.

2605.21935 2026-05-22 cs.RO 版本更新

Learning to Evolve: Multi-modal Interactive Fields for Robust Humanoid Navigation in Dynamic Environments

学习进化:多模态交互场用于动态环境中的稳健双足机器人导航

Peifeng Jiang, Hong Liu, Jin Jin, Wenshuai Wang, Xia Li

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School(一般人工智能国家重点实验室,北京大学深圳研究生院) Oxford Robotics Institute, University of Oxford(牛津大学机器人研究所) Institute for Machine Learning, Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系机器学习研究所)

AI总结 本文提出多模态交互场(MIF)系统,通过结合置信度感知的语义3D高斯溅射、差异触发的空间记忆更新和任务驱动的几何重建,在闭环感知-适应管道中实现稳健的双足机器人导航,显著提高了非静态环境中的重定位成功率并减少了语义内存足迹。

Comments Accepted by Robotics: Science and Systems 2026

详情
AI中文摘要

安全的以操作为导向的导航对于双足机器人需要在运动引起的感知扭曲、环境变化和交互层面的几何安全约束下保持可靠的场景记忆。现有语义映射和场景图系统难以直接部署在此设置中,因为它们通常假设稳定的相机轨迹、静态环境或粗略的对象几何。我们引入多模态交互场(MIF),一个面向双足机器人的系统,整合了置信度感知的语义3D高斯溅射、差异触发的空间记忆更新和任务驱动的几何重建,形成闭环的感知-适应管道。MIF耦合了三个场:一个不确定性感知的3DGS外观场,用于抑制步态引起的模糊;一个空间场用于维护拓扑记忆;一个几何场用于在操作前支持交互姿态安全(IPS)。引入了一个差异检测分数,用于区分运动引起的假阳性变化与持续变化,并仅更新局部不一致的区域。在真实动态办公室中的Unitree-G1双足机器人上,MIF将非静态环境中的重定位成功率从12%提升到94%,同时通过特征蒸馏将语义内存足迹减少91.4%,以适应实际的在线操作。项目页面和代码:https://ziya-jiang.github.io/MIF-homepage/

英文摘要

Safe manipulation-oriented navigation for humanoid robots requires scene memory that remains reliable under locomotion-induced perceptual distortion, environmental changes, and interaction-level geometric safety constraints. Existing semantic mapping and scene-graph systems are difficult to deploy directly in this setting because they often assume stable camera trajectories, static environments, or coarse object geometry. We introduce the Multi-modal Interactive Field (MIF), a humanoid-oriented system that integrates confidence-aware semantic 3D Gaussian Splatting, discrepancy-triggered spatial memory updates, and task-driven geometric reconstruction within a closed-loop perception-adaptation pipeline. MIF couples three fields: an uncertainty-aware 3DGS Appearance Field that suppresses gait-induced blur, a Spatial Field that maintains topological memory, and a Geometry Field that supports Interaction Pose Safety (IPS) before manipulation. A discrepancy detection score is introduced to separate locomotion-induced false-positive changes from persistent changes and updates only locally inconsistent regions. On a Unitree-G1 humanoid in a real dynamic office, MIF improves relocation success in non-static environments from 12% to 94% compared with static scene-graph memory, while reducing semantic memory footprint by 91.4% through feature distillation for practical online operation. Project page and code: https://ziya-jiang.github.io/MIF-homepage/

2605.21932 2026-05-22 cs.RO 版本更新

Auction-Consensus Algorithm with Learned Bidding Scheme for Multi-Robot Systems

带有学习出价方案的拍卖-共识算法用于多机器人系统

Jose Rodriguez, Constantine Tarawneh, Sven Koenig, Wenjie Dong, Qi Lu

发表机构 * Department of Electrical and Computer Engineering, The University of Texas at Rio Grande Valley (UTRGV)(德克萨斯理工大学里奥格兰德谷分校电子与计算机工程系) Department of Mechanical Engineering, UTRGV(UTRGV机械工程系) Department of Computer Science, Donald Bren School of Information and Computer Sciences, University of California, Irvine(加州大学尔湾分校计算机科学系) Department of Computer Science, UTRGV(UTRGV计算机科学系)

AI总结 本文提出了一种学习增强的拍卖-共识框架,通过强化学习训练神经出价策略来改进多机器人系统的任务分配,保留了传统的拍卖和共识阶段以实现去中心化协调。

Comments The 23rd International Conference on Ubiquitous Robots, 9 figures, 6 pages

详情
AI中文摘要

多机器人任务分配(MRTA)是分布式多智能体系统中的核心挑战,其中机器人团队必须在有限通信的情况下协作分配和执行任务,同时优化全局性能目标。拍卖-共识算法,如基于共识的捆绑算法(CBBA),提供了可扩展的去中心化协调,具有可证明的收敛性,但依赖于手工设计的贪婪评分函数,通常导致次优的任务分配。本文提出了一种学习增强的拍卖-共识框架,其中CBBA的确定性出价机制被神经出价策略取代,该策略通过强化学习进行训练。在集中训练和去中心化执行范式下,智能体学会从部分局部观测中计算任务出价,同时保留标准拍卖和共识阶段以实现去中心化协调。学习的出价策略通过混合整数线性规划获得的接近全局最优解的奖励进行训练。多个神经网络架构被评估,包括神经加法模型、长短期记忆(LSTM)模型和集合转换器模型。在不同群体大小的实验结果中,学习的出价策略在经典CBBA之上提高了解决方案的质量,同时保持了去中心化的执行。所提出的方法突显了将强化学习与经典分布式协调算法结合的有效性,为高质量的去中心化多机器人任务分配提供了可扩展的路径。

英文摘要

Multi-Robot Task Allocation (MRTA) is a central challenge in decentralized multi-agent systems, where teams of robots must cooperatively assign and execute tasks under limited communication while optimizing global performance objectives. Auction-consensus algorithms, such as the Consensus-Based Bundle Algorithm (CBBA), provide scalable decentralized coordination with provable convergence, but rely on hand-crafted greedy scoring functions that often lead to suboptimal task allocations. This paper proposes a learning-enhanced auction-consensus framework in which CBBA's deterministic bidding mechanism is replaced by a neural bidding policy trained using reinforcement learning. Under a centralized training and decentralized execution paradigm, agents learn to compute task bids from partial local observations while retaining the standard auction and consensus phases for decentralized coordination. The learned bidding policy is trained using Proximal Policy Optimization with rewards shaped by proximity to globally optimal solutions obtained via mixed-integer linear programming. Multiple neural architectures are evaluated, including a Neural Additive Model, the Long Short-Term Memory (LSTM) model, and the Set Transformer Model. Experimental results across varying swarm sizes demonstrate that learned bidding policies can improve solution quality over classical CBBA while preserving decentralized execution. The proposed approach highlights the effectiveness of integrating reinforcement learning with classical distributed coordination algorithms, offering a scalable pathway toward higher-quality decentralized multi-robot task allocation.

2605.21914 2026-05-22 cs.RO 版本更新

Non-Contact Vibration-Based Damage Detection of Civil Structures Using a Cost-Effective Autonomous UAV

基于低成本自主无人机的非接触式振动法 civil 结构损伤检测

Javier Becerril, Maximiliano Vargas, Jennifer Herrera, Joanna Gutierrez, Jorge Rios, Mohsen Amjadian, Constantine Tarawneh, Jinghao Yang, Qi Lu

发表机构 * Department of Computer Science, The University of Texas at Rio Grande Valley (UTRGV)(德克萨斯大学里奥格兰德谷大学计算机科学系) Department of Mechanical Engineering at UTRGV(德克萨斯大学里奥格兰德谷大学机械工程系) Department of Civil Engineering at UTRGV(德克萨斯大学里奥格兰德谷大学土木工程系) Department of Electrical and Computer Engineering at UTRGV(德克萨斯大学里奥格兰德谷大学电气与计算机工程系)

AI总结 本文提出了一种利用低成本自主无人机进行非接触式振动法 civil 结构损伤检测的方法,通过视频记录中的视觉运动追踪提取振动信号,识别自然频率的变化以检测结构退化。实验评估了实验室规模的框架结构在健康和模拟损伤条件下的表现,结果表明无人机能够可靠地检测到损伤引起的频率变化,尽管存在一定的误差,但其性能优于商业无人机系统。

Comments 8 pages, 8 figures, The 2026 International Conference on Unmanned Aircraft Systems, ICUAS 2026

详情
AI中文摘要

本文提出了一种非接触式振动法 civil 结构损伤检测方法,利用自主且定制化的低成本无人机(UAV)。通过基于视觉的运动追踪从视频记录中提取振动信号,以识别自然频率的变化,从而检测结构退化。在实验室规模的框架结构上,评估了健康和模拟损伤条件下的性能。所提出的系统通过实验研究验证,使用两部智能手机、USB相机和定制的低成本无人机,该无人机配备了内置相机和自主对齐系统,以在GPS受限环境中操作。提取并分析位移时间,并在频域中与参考测量值(来自接触加速度计和有限元模型)进行比较。实验结果表明,所有平台均能成功捕捉基频及其因损伤引起的偏移。尽管由于平台干扰和传感限制,无人机表现出略高的误差(最高达5.7%),但其能够可靠地检测到损伤引起的频率变化。与商业无人机系统相比,所提出的平台在显著降低成本的情况下实现了可比的检查性能。这些结果表明,低成本自主无人机为结构健康监测提供了一种实用、灵活且可扩展的解决方案,特别是在接触式传感不可行的情况下。此外,研究结果也支持了多个协作无人机部署的潜力,以进一步提高检查的覆盖范围和鲁棒性。

英文摘要

This paper presents a non-contact approach for vibration-based structural damage detection using an autonomous and customized cost-effective unmanned aerial vehicle (UAV). Vibration signals are extracted from video recordings through vision-based motion tracking to identify shifts in natural frequencies indicative of structural degradation. A laboratory-scale frame structure is evaluated under healthy and simulated-damage conditions. The proposed system is validated through an experimental study involving two smartphones, a USB camera, and a custom-built low-cost UAV equipped with an onboard camera and an autonomous alignment system for operation in GPS-denied environments. The displacement time is extracted and analyzed in the frequency domain and compared to reference measurements from contact accelerometers and a finite element model. Experimental results show that all platforms successfully capture the fundamental frequency and its shift due to damage. Although the UAV exhibits slightly higher errors (up to 5.7%) due to platform-induced disturbances and sensing limitations, it reliably detects damage-induced frequency changes. Compared to commercial UAV systems, the proposed platform achieves comparable inspection performance at significantly lower cost. These results demonstrate that low-cost autonomous UAVs provide a practical, flexible, and scalable solution for structural health monitoring, particularly in scenarios where contact-based sensing is impractical. The findings also support the potential for the deployment of multiple cooperative UAVs to further enhance inspection coverage and robustness.

2605.21901 2026-05-22 cs.RO 版本更新

Higher Order Reasoning for Collaborative Communicationless Mobile Robot Operations

高阶推理用于无通信协作移动机器人操作

Jonathan Reasoner, Nicola Bezzo

发表机构 * Department of Electrical and Computer Engineering, University of Virginia(弗吉尼亚大学电气与计算机工程系)

AI总结 本文提出了一种基于高阶推理的动态认知规划框架,使机器人能够在无通信环境下实现隐式协调和长周期规划,通过仿真和实物实验验证了其在通信受限领域中提升任务完成效率的能力。

详情
AI中文摘要

在无通信环境下,多机器人系统必须在不进行常规定步协调策略所假设的持续信息交换的情况下运作。本文提出了一种新颖的动态认知规划框架,通过机器人之间的高阶推理实现隐式协调和长周期规划。我们的方法使机器人能够形成并传播高阶信念粒子,利用贝叶斯推断更新世界信念,并通过行为树选择动作,以预测队友的可能决策。一种时间感知的模型预测路径积分(MPPI)控制器将这种推理整合到低层执行中,使机器人能够在部分可观测条件下规划拦截并适应轨迹。所提出的框架在仿真和实物实验中均显示出比一阶基线方法更短的任务完成时间,证明了认知逻辑可以作为在通信受限领域中具有鲁棒性的协调基础。

英文摘要

In communicationless environments, multi-robot systems must operate without the constant information exchange that many coordination strategies typically assume. This paper presents a novel dynamic epistemic planning framework that enables implicit coordination and long horizon planning through higher-order reasoning among robots. With our approach, robots form and propagate higher-order belief particles, update world beliefs using Bayesian inference, and select actions via a behavior tree that anticipates teammates' likely decisions. A temporally aware Model Predictive Path Integral (MPPI) controller integrates this reasoning into low-level execution, allowing robots to plan intercepts and adapt trajectories under partial observability. The proposed framework is evaluated in both simulations and physical experiments, where it consistently reduces task completion time compared to a first-order baseline, demonstrating that epistemic logic can serve as a robust foundation for resilient coordination in communication-restricted domains.

2605.21863 2026-05-22 cs.RO 版本更新

OCELOT: Odometry and Contact Estimation for Legged Robots

OCELOT:用于腿部机器人的步态和接触估计

Emre Girgin, Cagri Kilic

发表机构 * Department of Aerospace Engineering, Embry-Riddle Aeronautical University(航空航天工程系,埃默里-瑞德航空航天大学)

AI总结 本文提出了一种基于误差状态扩展卡尔曼滤波器(ESEKF)的完整腿部里程计管道,通过仅使用本体感觉数据(如固定IMU、关节编码器和力传感器)来实现准确的里程计估计,核心贡献是融合接触检测和不确定性量化模块,用于显式识别并拒绝滑动。

Comments 8 pages

详情
AI中文摘要

腿部机器人中的一项重大挑战是仅使用机载本体感觉传感器实现准确的里程计。在本研究中,我们提出了一种基于误差状态扩展卡尔曼滤波器(ESEKF)的完整腿部里程计管道,该管道仅依赖于本体感觉数据:固定IMU、关节编码器和力传感器,其中滤波器的状态通过确定处于静止支撑的脚来校正。我们的核心贡献是融合接触检测和一个不确定性量化模块,该模块设计用于显式识别并拒绝滑动。该模块为每只脚运行两个检测器:1)一个基于力的去抖 Gaussian Mixture Model(GMM)引导的有限状态机(FSM)以确认物理接触,2)一个基于运动学的广义似然比检验(GLRT)在估计的脚速度上。两个估计器的连续质量分数被融合,以检测脚是否同时物理加载和运动学静止,并作为每种接触的不确定性信号。为了验证我们的方法,我们收集了一个多模态数据集,包含29个序列,覆盖多样的室内外地形(例如混凝土、草地、鹅卵石和岩石),总长度为2.4公里。我们对比了本体感觉和外源感觉方法。结果表明,我们的方法在提供准确的里程计估计和在易滑动环境中具有鲁棒性。我们还分享了我们的代码和实时ROS2包作为开源。

英文摘要

One of the significant challenges in legged robotics is achieving accurate odometry using only onboard proprioceptive sensors. In this study, we present a complete leg odometry pipeline based on an Error-State EKF (ESEKF) that relies exclusively on proprioceptive data: a body fixed IMU, joint encoders, and force sensors, where filter's state is corrected by feet determined to be in a stationary stance. The core of our contribution is fused contact detection and an uncertainty quantification module designed to explicitly identify and reject slippage. This module runs two detectors in parallel for each foot, 1) a debounced, force-based Gaussian Mixture Model (GMM) guided Finite State Machine (FSM) to confirm physical contact, and 2) a kinematic-based Generalized Likelihood Ratio Test (GLRT) on the estimated velocity of the foot. The continuous quality scores from both estimators are fused to detect if the foot is both physically loaded and kinematically stationary and served as an uncertainty signal for each contact. To validate our approach, we collected a multi-modal dataset of 29 sequences spanning diverse indoor and outdoor terrains (e.g., concrete, grass, pebble, and rock) total of 2.4 km long. We benchmarked our approach against both proprioceptive and exteroceptive methods. The results demonstrate our method's efficacy in providing accurate odometry estimates, robustly handling slippage-prone environments. We also share our code and real-time ROS2 package as open-source.

2605.21862 2026-05-22 cs.RO cs.AI 版本更新

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

EvoScene-VLA: 在动作解码器中进化场景信念用于分块机器人控制

Chushan Zhang, Ruihan Lu, Jinguang Tong, Xuesong Li, Yikai Wang, Hongdong Li

发表机构 * Australian National University(澳大利亚国立大学) The University of Queensland(昆士兰大学) Beijing Normal University(北京师范大学)

AI总结 本文提出EvoScene-VLA,通过在动作解码器中维护更新的场景状态,改进分块机器人控制中的多步控制预测,提升了场景信念的持续性和准确性。

详情
AI中文摘要

Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, extbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.

英文摘要

Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, \textbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.

2605.21836 2026-05-22 cs.RO 版本更新

Analytical and Experimental Force Analysis of a Soft Linear Pneumatic Actuator

软线性气动执行器的分析与实验力分析

Mohammed Abboodi

AI总结 本文通过分析和实验研究了一种线性软套筒执行器(LSSA)的力特性,探讨了压力、几何形状、位移、负载和轴向刚度之间的耦合效应,揭示了其力生成机制。

详情
AI中文摘要

软套筒执行器(SSAs)最近被开发为可穿戴和辅助机器人系统中的气动驱动方法。通过将驱动结构集成到套筒状几何形状中,这些执行器可以减少对外部附件层和传动机制的依赖,同时保持与肢体形状表面的顺应性。然而,SSAs的力生成行为仍然解释不足,特别是在伸展过程中输出力的变化、外部负载的影响以及轴向刚度的机械作用方面。本文提出了线性软套筒执行器(LSSA)的分析和实验力分析。通过将净轴向力表示为由帽和折叠壁产生的压力生成贡献,并减去与轴向刚度相关的力,开发了一个准静态分析模型。该模型结合了内部压力、投影压力面积、折叠壁几何形状、轴向位移以及实验拟合的轴向刚度关系。进行了预设伸展和静态负载实验以评估执行器响应。在125kPa时,生成的力从零伸展时的约112N减少到40mm时几乎为零。静态负载延迟了可测量的力生成并减少了力输出,特别是在低和中等压力下。结果表明,LSSA的力生成由压力、几何形状、位移、负载和轴向刚度的耦合效应所支配。

英文摘要

Soft sleeve actuators (SSAs) have recently been developed as a pneumatic actuation approach for wearable and assistive robotic systems. By integrating the actuation structure into a sleeve-like geometry, these actuators can reduce reliance on external attachment layers and transmission mechanisms while maintaining compliance with limb-shaped surfaces. However, the force-generation behavior of SSAs remains insufficiently explained, particularly with respect to the variation of output force during extension, the influence of external loading, and the mechanical role of axial stiffness. This paper presents an analytical and experimental force analysis of a linear soft sleeve actuator (LSSA). A quasi-static analytical model was developed by expressing the net axial force as the pressure-generated contribution from the cap and folded walls, reduced by the force associated with axial stiffness. The model incorporates internal pressure, projected pressure areas, folded wall geometry, axial displacement, and an experimentally fitted axial stiffness relation. Prescribed-extension and static-load experiments were conducted to evaluate the actuator response. At 125 kPa, the generated force decreased from approximately 112 N at zero extension to nearly zero at 40 mm. Static loading delayed measurable force generation and reduced force output, particularly at low and intermediate pressures. The results show that LSSA force generation is governed by coupled effects of pressure, geometry, displacement, loading, and axial stiffness.

2605.21811 2026-05-22 cs.RO 版本更新

Safe and Steerable Geometric Motion Policies for Robotic Dexterous Manipulation

安全且可操控的几何运动策略用于机器人灵巧操作

Albert Wu, Riccardo Bonalli, Thomas Lew, C. Karen Liu

发表机构 * Computer Science Department, Stanford University(斯坦福大学计算机科学系) Laboratory of Signals and Systems, University of Paris-Saclay, CNRS, CentraleSupélec(巴黎-萨克雷大学信号实验室,CNRS,CentraleSupélec) Toyota Research Institute(丰田研究院)

AI总结 本研究提出SafePBDS框架,通过几何一致的方法计算最优且可证明安全的配置流形加速度,以实现机器人灵巧操作中的目标和约束的持续协调,并在模拟和Franka Panda-Allegro手平台上验证了其在灵巧抓取和手部重定向中的高效规划和安全保障。

Comments 24 pages, 10 figures, 5 tables. Project page and demo video: https://tml.stanford.edu/safe-pbds

详情
AI中文摘要

机器人灵巧操作需要持续协调在异构几何空间上定义的目标和约束:一个在$\mathbb{R}^7$配置流形上控制的机器人可能需要在$\mathrm{SE}(3)$上跟踪末端执行器姿态,同时在$\mathbb{R}$上满足障碍物避让边距。我们提出了Safe Pullback Bundle Dynamical Systems(SafePBDS),一种几何一致的框架,该框架从任意任务流形上的目标和安全要求计算最优且可证明安全的配置流形加速度。SafePBDS建立在先前工作之上,将预定义的任务流形动力学系统结合以产生自主运动。其第一个创新是拉回控制屏障函数构造,将任务流形的安全条件转换为配置流形加速度上的线性约束。第二个创新是任务流形动作接口,允许高层策略注入低维残差运动;零输入恢复自主行为,而任意输入下保持安全。这使高层策略能够高效地引导探索,同时将精确运动留给自主行为。我们通过模拟和23自由度Franka Panda-Allegro手平台验证了SafePBDS。在灵巧抓取中,SafePBDS在20个家庭物体和120次试验中实现了92.5%的成功率。通过动作接口,该方法可通过一维动作排除抓取中的任一手指,实现94.4%的3指抓取成功率。SafePBDS的高效规划和安全保证还使其成为首个基于模型的、完全驱动的手部在手重定向方法,能够超过360度的yaw旋转,无论物体重量和腕部运动如何变化。演示视频和细节:https://tml.stanford.edu/safe-pbds

英文摘要

Robotic dexterous manipulation requires continuously reconciling objectives and constraints defined on heterogeneous geometric spaces: a robot controlled on a $\mathbb{R}^7$ configuration manifold may need to track end effector poses on $\mathrm{SE}(3)$ while satisfying obstacle avoidance margins in $\mathbb{R}$. We present Safe Pullback Bundle Dynamical Systems (SafePBDS), a geometrically consistent framework that computes optimal, certifiably safe configuration manifold accelerations from objectives and safety requirements on arbitrary task manifolds. SafePBDS builds on prior work that combines predefined task manifold dynamical systems to produce autonomous motion. Its first innovation is a pullback control barrier function construction, which converts task manifold safety conditions into linear constraints on configuration manifold accelerations. The second innovation is a task manifold action interface that allows a high-level policy to inject low dimensional residual motions; zero input recovers the autonomous behavior, while safety is preserved under arbitrary inputs. This lets high-level policies efficiently steer exploration while leaving precise motion to the autonomous behavior. We validate SafePBDS in simulation and on a 23-DOF Franka Panda-Allegro Hand platform. On dexterous grasping, SafePBDS achieves a $92.5\%$ success rate across 20 household objects and 120 trials. Using the action interface, the method can exclude any one of the four fingers during grasping via a one-dimensional action, achieving $94.4\%$ 3-finger grasp success across 3 objects and 36 trials. The efficient planning and safety guarantee of SafePBDS also enables the first model-based, fully actuated palm-down in-hand reorientation, exceeding $360^\circ$ of yaw rotation in both directions under varying object weight and wrist motion. Demo video and details: https://tml.stanford.edu/safe-pbds

2605.21800 2026-05-22 cs.LG cs.RO 版本更新

stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation

stable-worldmodel: 一个用于可重复世界建模研究和评估的平台

Lucas Maes, Quentin Le Lidec, Luiz Facury, Nassim Massaudi, Ayush Chaurasia, Francesco Capuano, Richard Gao, Taj Gillin, Dan Haramati, Damien Scieur, Yann LeCun, Randall Balestriero

发表机构 * Mila & Université de Montréal(Mila与蒙特利尔大学) New York University(纽约大学) Universidade Federal de Minas Gerais(巴西联邦大学矿务学院) Independent Researcher(独立研究者) LanceDB University of Oxford(牛津大学) Brown University(布朗大学)

AI总结 本文提出stable-worldmodel平台,旨在解决世界建模研究中代码库、数据管道和评估协议碎片化的问题,通过提供高性能的数据层、现代世界模型基线和规划求解器的实现,以及扩展的环境和任务,实现标准化和可重复的世界建模研究和评估。

详情
AI中文摘要

世界模型是构建能够推理、规划并在训练数据之外进行泛化的重要组成部分。然而,目前世界模型的研究仍然碎片化,不同的代码库、数据管道和评估协议阻碍了可重复性和公平比较。当前实践还受到三个关键瓶颈的限制:脆弱的一次性代码库、缓慢的视频数据加载以及缺乏标准化的泛化基准。我们提出了stable-worldmodel (swm),一个开源平台,用于标准化和可重复的世界建模研究和评估。它提供了(1)一个高性能的Lance数据层,支持和转换MP4、HDF5和LeRobot数据集;(2)干净、经过良好测试的现代世界模型基线和规划求解器的实现;(3)一个广泛的环境和任务套件,扩展了可控的视觉、几何和物理因素的变化,以系统地评估动态理解、控制性能、表示质量和分布外泛化。通过在单一可扩展框架下统一整个流程, exttt{swm}显著减少了研究开销,并加速了向可靠世界模型的可信进展。

英文摘要

World models are central to building agents that can reason, plan, and generalize beyond their training data. However, research on world models is currently fragmented, with disparate codebases, data pipelines, and evaluation protocols hindering reproducibility and fair comparison. Current practice is further limited by three key bottlenecks: fragile one-off codebases, slow video data loading, and the lack of standardized generalization benchmarks. We present stable-worldmodel (swm), an open-source platform for standardized and reproducible world modeling research and evaluation. It delivers (1) a high-performance Lance-based data layer with native support and conversion tools for MP4, HDF5, and LeRobot datasets, (2) clean, well-tested implementations of modern world model baselines and planning solvers, and (3) a broad suite of environments and tasks extended with controllable visual, geometric, and physical factors of variation for systematic in-silico evaluation of dynamics understanding, control performance, representation quality, and out-of-distribution generalization. By unifying the full pipeline under a single, scalable framework, \texttt{swm} dramatically reduces research overhead and accelerates trustworthy progress toward reliable world models.

2605.21788 2026-05-22 cs.CV cs.RO 版本更新

SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

SceneGraphGrounder: 通过结构化场景图匹配实现零样本3D视觉定位

Xuefei Sun, Xujia Zhang, Brendan Crowe, Doncey Albin, Christoffer Heckman

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 本文提出SceneGraphGrounder框架,通过结构化场景图匹配将3D定位问题转化为结构化图匹配问题,利用视觉标记提示策略从2D视图推断物体间关系,并在3D场景图中建立持久编码,从而在ScanRefer基准测试中实现了零样本条件下与现有方法相当的性能,并在真实机器人部署中验证了其在长周期物理环境中的鲁棒空间推理能力。

详情
AI中文摘要

零样本3D视觉定位需要从非结构化环境中通过自由形式自然语言定位物体。最近的视觉-语言模型(VLM)方法取得了有希望的结果,但依赖于视点依赖的推理或隐式表示,限制了组合查询的空间一致性和可解释性。我们提出了SceneGraphGrounder,一个将3D定位重新表述为在重建的3D场景图上的结构化图匹配的框架。为了实现这种表述,我们引入了一种视觉标记提示策略,使VLM能够从2D视图推断物体-物体关系,这些关系随后被提升为持久的3D场景图编码,既包含空间关系又包含语义关系。给定一个查询,我们构建查询图并与场景图进行受限对齐,确保多视图一致性和可解释的推理。在ScanRefer基准测试中,我们的方法在零样本条件下实现了与现有方法相当的性能,仅使用RGB-D输入。我们进一步通过在移动机器人上的真实世界部署验证了我们的框架,展示了其在长周期物理环境中的鲁棒空间推理能力。我们将在接受后公开我们的代码。

英文摘要

Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.

2605.21747 2026-05-22 cs.CV cs.RO 版本更新

Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models

通过利用视觉语言模型推断车辆信息以改进自动驾驶中的3D标注

Steven Chen, Shivesh Khaitan, Nemanja Djuric

发表机构 * Aurora Innovation, Inc.(Aurora创新公司)

AI总结 本文提出了一种利用视觉语言模型推断车辆信息以提高自动驾驶中3D车辆标注精度的方法,通过零样本推理车辆信息,结合车辆型号和型号识别方法,提升了标注效率和质量。

Comments To appear in Proceedings of the IEEE Intelligent Vehicles Symposium (IV), 2026. Accepted for oral presentation

详情
AI中文摘要

我们提出了一种通过零样本推理车辆信息来提高自动驾驶应用中3D车辆标注的方法,利用车辆制造商和型号识别(VMMR)方法。所提出的方法利用视觉语言模型(VLM)从图像片段中推断车辆的制造商、型号和代数,并输出准确的3D包围盒尺寸以引导手动标注。我们评估了迭代提示工程和不同VLMs选择对车辆包围盒推断和制造商/型号/代数识别的影响。与强大的基线相比,所提出的方法不仅在准确性上表现出色,而且在缓解特定失败模式方面也表现出色,例如在车辆显著遮挡的情况下,VLMs提供的尺寸比初始激光雷达辅助的人工标注标签更优。在公共和专有数据上的实验强烈表明,我们的结论可以推广到不同的标注者和数据集。结果表明,将VLMs整合到标注过程中可以减少手动标注时间,同时提高标注质量。

英文摘要

We present an approach to improve 3D vehicle labeling in self-driving applications through zero-shot inference of vehicle information, leveraging Vehicle Make and Model Recognition (VMMR) methods. The proposed approach utilizes a Vision Language Model (VLM) to both infer a vehicle's make, model, and generation from image crops, and output accurate 3D bounding box dimensions to seed manual labeling. We evaluate the impact of iterative prompt engineering and the choice of different VLMs on both vehicle bounding box inference and make/model/generation recognition. When compared to strong baselines, the proposed approach not only shows high accuracy, but also excels in mitigating specific failure modes where VLMs provide better dimensions than initial lidar-aided human annotated labels (e.g., in cases of significant vehicle occlusion). Experiments on both public and proprietary data strongly suggest that our conclusions are generalizable across different labelers and datasets. The results demonstrate that integrating VLMs into the labeling process can reduce manual labeling time while increasing label quality.

2605.21723 2026-05-22 cs.RO cs.AI cs.MA cs.SY eess.SY 版本更新

Learning Altruistic Collaboration in Heterogeneous Multi-Team Systems

在异质多团队系统中学习利他性协作

Riwa Karam, Ruoyu Lin, Brooks A. Butler, Magnus Egerstedt

发表机构 * Samueli School of Engineering, University of California, Irvine(加州大学欧文分校萨缪尔学学院) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文研究了通过动态机器人分配实现的异质多团队协作,将机器人视为可转移资源。利用生态学中的哈密顿规则作为利他决策机制,提出了一种具有异质能力、转移成本和能力依赖贡献的多团队协作资源分配框架。所得到的分配问题是组合性的,并被证明是NP难的。为了解决可扩展性问题,我们开发了一种基于图神经网络的策略,在集中训练和分布式执行下近似基于哈密顿规则的利他性分配。该模型在团队交互图上运行,并预测机器人层面的转移决策和下一步的机器人到团队分配。通过消防演习场景的模拟和实验验证了所提出的方法,证明所学习的策略在扩展到更大系统时能够实现接近最优的性能。

详情
AI中文摘要

本文研究了通过动态机器人分配实现的异质多团队协作,其中机器人被视为可转移资源。利用生态学中的哈密顿规则作为利他决策机制,我们提出了一种具有异质能力、转移成本和能力依赖贡献的多团队协作资源分配框架。所得到的分配问题是一个组合问题,并被证明是NP难的。为了解决可扩展性问题,我们开发了一种基于图神经网络的策略,在集中训练和分布式执行下近似基于哈密顿规则的利他性分配。该模型在团队交互图上运行,并预测机器人层面的转移决策和下一步的机器人到团队分配。通过消防演习场景的模拟和实验验证了所提出的方法,证明所学习的策略在扩展到更大系统时能够实现接近最优的性能。

英文摘要

This paper studies heterogeneous multi-team collaboration through dynamic robot allocation, where robots are treated as transferable resources. Leveraging Hamilton's rule from ecology as an altruistic decision-making mechanism, we propose a multi-team collaborative resource allocation framework with heterogeneous capabilities, transfer costs, and capability-dependent contributions. The resulting allocation problem is combinatorial and is shown to be NP-hard. To address scalability, we develop a graph neural network policy under centralized training and decentralized execution that approximates the altruistic allocations based on Hamilton's rule. The model operates over the team interaction graph and predicts robot-level transfer decisions and next robot-to-team assignments. The proposed approach is validated in a firefighting scenario through simulations and experiments, demonstrating that the learned policy achieves near-optimal performance while scaling to larger systems.

2605.21719 2026-05-22 cs.RO cs.SY eess.SY 版本更新

Mind the Gaps: Multi-Robot Feedback-Driven Ergodic Coverage in Unknown Environments

注意缝隙:未知环境中的多机器人反馈驱动的遍历覆盖

Thales Costa Silva, Nora Ayanian

发表机构 * Department of Computer Science at Brown University(布朗大学计算机科学系)

AI总结 本文提出了一种多机器人反馈驱动的遍历覆盖策略,通过实时环境模型反馈调整机器人采样行为,以提高未知环境中的覆盖效率和资源分配。

详情
AI中文摘要

在本文中,我们解决了多机器人自适应覆盖的问题,其中机器人团队通过连续调整位置进行动态采样以收集环境数据。此任务具有挑战性,特别是在机器人必须随时间高效分配到新采样位置时。遍历搜索方法通过确保机器人时间平均的空间分布与环境信息的空间分布一致来优化机器人轨迹。虽然这些方法在目标分布已知的情况下能促进有效探索,但往往无法考虑环境的未知先验分布。为克服这一限制,我们提出了一种自适应覆盖策略,利用环境模型的实时反馈来调整机器人采样行为以应对未知条件。我们的方法通过基于环境参数模型构建目标空间信息分布,该分布在线更新,从而增强传统遍历轨迹优化。该策略假设环境是静态或变化缓慢相对于机器人运动。我们的框架使机器人能够动态优先考虑高兴趣区域,提高覆盖效率,为单个代理合成有效的控制策略,并在未知先验分布的设置中优化资源使用。我们通过仿真验证了我们的方法,证明了其在提高覆盖和资源分配方面的有效性。

英文摘要

In this work, we address the problem of multi-robot adaptive coverage, where teams of robots perform dynamic sampling by continuously adjusting their positions to collect data in an environment. This task can be challenging, particularly when robots must be efficiently allocated to new sampling locations over time. Ergodic search methods optimize robot trajectories by ensuring that the robots' time-averaged spatial distribution aligns with the spatial distribution of environmental information. While these methods promote effective exploration provided a target distribution, they often fail to account for unknown prior distributions of the environment. To overcome this limitation, we propose an adaptive coverage strategy that utilizes real-time feedback from an environmental model to adjust robot sampling behavior in response to unknown conditions. Our approach enhances traditional ergodic trajectory optimization by constructing a target spatial information distribution based on parametric models of the environment, which are updated online. This strategy assumes that the environment is either static or changes slowly compared to the robot's motion. Our framework allows robots to dynamically prioritize regions of high interest, improving coverage efficiency, synthesizing effective control policies for individual agents, and optimizing resource use in settings with unknown prior distributions. We validate our approach through simulations, demonstrating its effectiveness in enhancing coverage and resource allocation.

2605.21714 2026-05-22 cs.CV cs.RO 版本更新

AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking

AVI-HT:自适应视觉-IMU融合用于3D手部跟踪

Ziyi Kou, Ankit Kumar, Mia Huang, Taylor Niehues, Vatsal Mehta, Ergys Ristani, Li Guan

发表机构 * Meta Reality Labs(Meta现实实验室)

AI总结 本文提出AVI-HT,一种自适应视觉-IMU融合方法,通过联合建模第一人称视角图像与手套上的6自由度IMU信号,用于跟踪3D手部姿态。核心方法包括同步多模态训练数据配对和跨传感器深度注意力机制,主要贡献是提高了在手-物体交互场景中的准确性和可用性。

详情
AI中文摘要

我们提出了AVI-HT,一种用于通过联合建模第一人称视角图像与手套上的6自由度IMU信号来跟踪3D手部姿态的自适应视觉-IMU融合方法。AVI-HT在手-物体交互(HOI)场景中,特别是在重视觉遮挡情况下,实现了显著提高的准确性和可用性。其成功基于两个互补的成分:(1)同步多模态训练数据配对身体上的视觉-IMU传感器流与运动捕捉系统的地面真实3D手部姿态;(2)一种跨传感器深度注意力机制,能够自适应地调节对视觉和单个IMU传感器的信任度。为了在真实世界中评估AVI-HT,我们在包含100000+对视觉-IMU样本的DexGloveHOI数据集中进行了广泛的实验,这些样本具有同步的3D标注姿态,用户在日常任务中操作各种物体。我们比较了多种单模态和多模态跟踪方法,基于两种手部模型(UmeTrack、MANO)。结果表明,AVI-HT在基准上将平均关键点误差减少了16.1%,其腕对齐变体减少了24.2%。消融研究进一步揭示了IMU传感器在不同活动类型中的每指贡献,以及模型对IMU噪声和视觉-IMU融合中的时间偏移的敏感性。

英文摘要

We present AVI-HT, an adaptive visual-IMU fusion approach for tracking 3D hand poses by jointly modeling the egocentric image with on-glove 6-DoF IMU signals. AVI-HT achieves significantly improved accuracy and availability, particularly in hand-object interaction (HOI) scenarios involving heavy visual occlusion. Two complementary ingredients underpin its success: (1) synchronized multi-modal training data pairing on-body vision-IMU sensor streams with ground-truth 3D hand poses from a motion-capture system, and (2) a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors. To evaluate AVI-HT in real-world settings, we conduct extensive experiments on our DexGloveHOI dataset that consists of 100K+ pairwise vision-IMU samples with synchronized 3D annotated poses, in which users manipulate a variety of objects during daily tasks. We compare against multiple single- and multi-modal tracking approaches under two hand models (UmeTrack, MANO). The results show that AVI-HT reduces mean keypoint error by 16.1% and its wrist-aligned variant by 24.2% over the baselines. Ablation studies further reveal the per-finger contribution of IMU sensors across activity types, and the model's sensitivity to IMU noise and temporal misalignment in vision-IMU fusion.

2605.21710 2026-05-22 cs.RO 版本更新

PGDG: Physically Grounded Data Generation for Robust Bimanual Policy Learning from a Single Demonstration

PGDG: 为从单个示范中学习鲁棒双臂策略而设计的物理基础数据生成

Cunxi Dai, Haoran Chang, Aditya Nisal, Rahul Kumar, Guofei Chen, Tao Chen, Yuzhe Qin, Guanya Shi

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Dexmate

AI总结 本文提出PGDG,一种基于物理的数据生成框架,通过零样本校准扩展单个示范为包含物理上合理、成功和多样恢复行为的紧凑数据集,从而提升双臂操作中接触丰富的行为克隆性能。

详情
AI中文摘要

接触丰富的双臂操作中的行为克隆仍然具有挑战性,因为多样化的示范收集成本高,且即使小的扰动也可能将系统推入无恢复监督的流形外状态。我们提出PGDG,一种具有零样本校准的数据生成框架,能够在不额外人工标注的情况下,将单个示范扩展为一个包含物理上合理、成功且多样化的恢复行为的紧凑数据集。PGDG在物理基础采样器和数据集校准器之间迭代,其中校准器选择具有信息量、非冗余性和可恢复性的行为来更新采样分布,朝向未覆盖的恢复模式;而采样器则从更新后的分布中绘制出物理上合理的滚动候选,并保留成功的轨迹。为进一步提高数据质量,PGDG应用短时间域采样基于控制来重新标记所选的高风险状态并应用纠正动作。在四个双臂操作任务中,PGDG在仿真和零样本现实世界迁移中均优于仅空间增强的方法。在RotateBox-Pitch任务中,仿真中的成功率从38%提升到93%,现实世界中的成功率从35%提升到82%。PGDG还能够有效促进如GR00T等基础模型的微调,使成功率从46%提升到77%。更多结果可在我们的网站上查看:https://cunxid.github.io/PGDG/。

英文摘要

Behavior cloning for contact-rich bimanual manipulation remains challenging because diverse demonstrations are expensive to collect, and even small disturbances can push the system into off-manifold states where no recovery supervision is available. We propose PGDG, a data generation framework with zero-shot curation that expands a single demonstration into a compact dataset of physically plausible, successful, and diverse recovery behaviors without additional human labeling. PGDG iterates between a physics-grounded sampler and a dataset curator, where the curator selects informative, non-redundant, and recoverable behaviors to update the sampling distribution toward under-covered recovery modes, and the sampler draws physically plausible rollout candidates from this updated distribution and retains successful trajectories. To further improve data quality, PGDG applies short-horizon sampling-based control to relabel selected risky states with corrective actions. Across four bimanual manipulation tasks, PGDG consistently outperforms spatial-only augmentation in both simulation and zero-shot real-world transfer. On RotateBox-Pitch, success improves from 38% to 93% in simulation and from 35% to 82% in the real world. PGDG also enables effective foundation models fine-tuning such as GR00T, increasing success from 46% to 77%. Additional results are available in our website: https://cunxid.github.io/PGDG/.

2605.21704 2026-05-22 cs.RO cs.SY eess.SY 版本更新

Motion Design for Grasp-Based Dynamic Locomotion in Microgravity

微重力环境下基于抓取的动态移动运动设计

Chaerim Moon, Joohyung Kim, Justin K. Yim

发表机构 * Department of Mechanical Science and Engineering at the University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校机械科学与工程系)

AI总结 本文针对微重力环境下多肢体机器人系统基于抓取的动态移动问题,提出了一种可参数化的移动规划框架,通过调整步态模式、步长、移动速度和名义姿态等参数,评估其在稳定性和驱动需求方面的性能。研究结果表明,扩大可行接触力空间并抑制脉冲全身动力学可提升移动性能。

详情
AI中文摘要

在微重力环境中,移动通常依赖于稀疏且不规则排列的锚点,这促使了基于抓取的多肢体移动。在此设置中,动态移动只有通过有意识地调节锚定相互作用和全身协调,才能在耦合的动力学和运动学约束下实现。本文提出了针对微重力环境下多肢体机器人系统基于抓取的动态移动的设计见解,目标是需要六维肢体操作以与候选锚点建立接触的场景。研究的设计参数包括步态模式、步长、移动速度和名义姿态。提出了一种可参数化的移动规划框架,以支持这些参数的变化,并评估由此产生的移动性能,包括稳定性和驱动需求。在基于物理的仿真中采用了两种代表性四足形态进行评估。结果表明,扩大可行接触力空间并抑制脉冲全身动力学可提高移动性能。这些发现为微重力移动中多肢体系统的接触配置选择和全身协调策略提供了指导。

英文摘要

Locomotion in microgravity often relies on sparsely and irregularly arranged anchors, motivating grasp-based mobility with multiple limbs. In this setting, dynamic locomotion is feasible only through deliberate regulation of both anchored interactions and whole-body coordination under coupled dynamic and kinematic constraints. This paper presents design insights for grasp-based dynamic locomotion with multi-limbed robotic systems in microgravity, targeting scenarios that require 6D limb manipulation to establish contacts with candidate anchors. The investigated design parameters include gait pattern, stride length, locomotion speed, and nominal posture. A parameterizable locomotion planning framework is proposed to support variations of these parameters and to evaluate the resulting locomotion performance in terms of stability and actuation demand. Two representative quadruped morphologies are adopted for evaluation in physics-based simulation. The results demonstrate that enlarging the feasible contact wrench space and attenuating impulsive whole-body dynamics improve locomotion performance. These findings inform strategies for contact configuration selection and whole-body coordination in microgravity locomotion with multi-limbed systems.

2605.21688 2026-05-22 cs.RO cs.SY eess.SY 版本更新

Closed-Loop Sim-to-Real Reinforcement Learning for Deformable Microfiber Shape Control

闭环仿真到现实强化学习用于可变形微纤维形状控制

Alessandro Amici, Houari Bettahar, Veeti Jaakkola, Quan Zhou

发表机构 * Department of Electrical Engineering and Automation, Aalto University(艾尔沃大学电气工程与自动化系)

AI总结 本文提出了一种闭环仿真到现实强化学习方法,用于在表面控制可变形微纤维形状,通过在简化摩擦模拟器中训练几何形状调节,并利用实时视觉反馈在部署过程中迭代修正未建模的表面相互作用效果。

Comments 7 pages,7 figures

详情
AI中文摘要

自主基于接触的微 manipulation 是具有挑战性的,因为微尺度的表面和界面相互作用难以准确建模,限制了传统基于模型的控制和仿真到现实学习的使用。我们提出了一种闭环仿真到现实强化学习(RL)方法,用于表面上的微纤维形状控制。核心思想是在简化摩擦less 模拟器中训练几何形状调节,并在部署过程中依赖实时视觉反馈来迭代修正观测到的未建模表面相互作用效果。一个完全在仿真中训练的 RL 策略被直接转移到一个物理双夹爪微 manipulation 系统上,该系统以 40 Hz 运行,无需重新训练或领域适应。使用丝绸微纤维作为测试平台,该策略在 24 种不同的初始配置上实现了平均点状形状误差为 270 ± 80 微米。在九种样本中,覆盖三种纤维直径(50、80 和 120 微米)和三种 manipulated 长度(10 mm、15 mm 和 20 mm)的所有组合时,相同的策略在不重新训练或调整的情况下实现了亚毫米级的最终形状误差。这些结果表明,一个在简化模拟器中学习的策略可以在表面接触下实现可重复的现实世界微纤维形状调节,只要任务相关的仿真到现实不匹配效应在闭环反馈回路中仍然可观测和可纠正。

英文摘要

Autonomous contact-based micromanipulation is challenging because surface and interfacial interactions at the microscale are difficult to model accurately, limiting the use of conventional model-based control and sim-to-real learning. We present a closed-loop sim-to-real reinforcement learning (RL) approach for microfiber shape control on a surface. The central idea is to train geometric shape regulation in a simplified frictionless simulator and rely on real-time visual feedback during deployment to iteratively correct the observed effects of unmodeled surface interactions. An RL policy trained entirely in simulation is transferred directly to a physical dual-gripper micromanipulation system operating at 40 Hz, without retraining or domain adaptation. Using silk microfibers as a testbed, the policy achieves a mean point-wise shape error of 270 $\pm$ 80 $μ$m across twenty-four diverse initial configurations. Across nine specimens covering all combinations of three fiber diameters (50, 80, and 120 $μ$m) and three manipulated lengths (10 mm, 15mm, and 20 mm), the same policy achieves sub-millimeter final shape error without any retraining or retuning. These results show that a policy learned in a simplified simulator can achieve repeatable real-world microfiber shape regulation under surface contact, provided that the task-relevant effects of the sim-to-real mismatch remain observable and correctable within the closed feedback loop.

2605.21686 2026-05-22 cs.RO 版本更新

Distributed Multi-Coverage for Robot Swarms

机器人群的分布式多覆盖

Mariem Guitouni, Aaron T. Becker

发表机构 * University of Houston(德克萨斯大学休斯顿分校)

AI总结 本文提出了一种分布式多覆盖算法,用于解决机器人群在局部感知、局部通信和无全局协调的情况下,维持关键资产可靠覆盖的问题,同时应对机器人故障等约束条件。

Comments Accepted at ANTS 2026 (International Conference on Swarm Intelligence), published by Springer Nature

详情
AI中文摘要

自主无人机群用于监视、环境监测和基础设施检查时,必须在机器人故障的情况下保持关键资产的可靠覆盖。这要求多覆盖:每个资产必须由多个机器人观察以实现冗余,且覆盖要求因资产的重要性而异。尽管最近的工作已通过整数规划最优地解决了集中式问题,但实际部署面临约束,需要分布式解决方案:机器人具有有限的通信范围,机载计算限制了全局规划,且部分系统故障不得导致任务中止。本文提出了一种适用于具有局部感知、局部通信和无全局协调的机器人群的分布式多覆盖算法。

英文摘要

Autonomous drone swarms deployed for surveillance, environmental monitoring, and infrastructure inspection must maintain reliable coverage of critical assets despite robot failures. This requires multicoverage: each asset must be observed by multiple robots for redundancy, with coverage requirements varying by asset importance. While recent work has solved the centralized problem optimally using integer programming, practical deployments face constraints that demand distributed solutions: robots operate with limited communication ranges, onboard computation restricts global planning, and partial system failures must not cause mission abort. We present a distributed multicoverage algorithm for robot swarms operating with local sensing, local communication, and no global coordination.

2605.21680 2026-05-22 cs.RO 版本更新

Flying Together: Human-Guided Immersive Shared Control for Aerial Robot Teams in Unknown Environments

Flying Together: Human-Guided Immersive Shared Control for Aerial Robot Teams in Unknown Environments

Lou De Bel-Air, Luca Morando, Ruitao Chen, Keru Wang, Benjamin Jarvis, Charbel Toumieh, Yang Zhou, Ken Perlin, Dario Floreano, Giuseppe Loianno

发表机构 * New York University(纽约大学) Ecole Polytechnique Federale de Lausanne(洛桑联邦理工学院) University of California Berkeley(加州大学伯克利分校)

AI总结 本文提出了一种基于虚拟现实的共享控制框架,用于在约束和未知环境中操作无人机团队,通过实时用户引导探索,提升在无结构环境中的自主导航能力。核心方法是一种基于用户引导的运动原语规划器,结合阻抗控制器,使操作员能够灵活影响团队行为并引导无人机前往自主规划器可能忽略的感兴趣区域。

Comments Accepted at IEEE International Conference in Robotics and Automation, Vienna 2026

详情
AI中文摘要

尽管自主多机器人能够实现安全协调的导航,但它们往往难以适应突发状况并捕捉操作员驱动的目标。本文提出了一种基于虚拟现实(VR)的共享控制框架,用于在约束和未知环境中操作无人机团队,实现实时用户引导探索。我们的方法核心是一种新颖的基于用户引导的运动原语规划器,能够计算连续的碰撞免费轨迹,同时持续整合操作员输入。该规划器与阻抗控制器相结合,使操作员能够灵活影响团队行为并引导无人机前往感兴趣区域。系统支持混合现实操作,包括物理和模拟无人机,并实现双方面VR接口,使操作员通过迁移点引导机器人团队,同时接收即时的团队状态视觉反馈。实验结果表明,共享控制提高了障碍物避障能力,保持了机器人间的间距,并减少了操作员的负担,展示了沉浸式、人机协作多机器人导航的可行性和优势。

英文摘要

While autonomous multi-robots can achieve safe and coordinated navigation, they often struggle to adapt to unforeseen conditions and to capture operator-driven objectives in unstructured environments. We present a Virtual Reality (VR)-based shared control framework for teams of drones operating in constrained and unknown environments, enabling real-time, user-guided exploration. At the core of our approach is a novel, user-guided motion-primitive-based planner that computes continuous, collision-free trajectories while continuously integrating operator input. This planner is coupled with an admittance controller, allowing the operator to flexibly influence team behavior and guide drones toward regions of interest that autonomous planners may overlook. The system supports mixed-reality operations with both physical and simulated drones, and implements a bilateral VR-based interface, allowing the operator to guide the robot team via migration points while receiving immediate visual feedback of the team state. Experimental results show that shared control improves obstacle avoidance, maintains inter-agent spacing, and reduces operator effort, demonstrating the feasibility and advantages of immersive, human-in-the-loop multi-robot navigation.

2605.21572 2026-05-22 cs.CV cs.RO 版本更新

PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

PhysX-Omni: 为刚体、变形体和关节物体统一的模拟准备物理3D生成

Ziang Cao, Yinghao Liu, Haitian Li, Runmao Yao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University(南洋理工大学S实验室)

AI总结 本文提出PhysX-Omni,一种统一的模拟准备物理3D生成框架,通过开发针对视觉-语言模型的高效几何表示和首个通用模拟准备3D数据集PhysXVerse,以及评估生成和理解能力的PhysX-Bench,显著提升了生成和理解性能,推动下游应用如具身AI和物理模拟的发展。

Comments Project page: https://physx-omni.github.io/

详情
AI中文摘要

Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

英文摘要

Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

2605.16258 2026-05-22 cs.CV cs.AI cs.RO 版本更新

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

IVGT:隐式视觉几何变换器用于神经场景表示

Yuqi Wu, Tianyu Hu, Wenzhao Zheng, Yuanhui Huang, Haowen Sun, Jie Zhou, Jiwen Lu

发表机构 * Intelligent Vision Group, Tsinghua University(清华大学智能视觉组)

AI总结 本文提出IVGT,一种隐式视觉几何变换器,通过无姿态多视角图像隐式建模连续且一致的几何结构,从而实现神经场景表示,支持在任意3D位置进行连续空间查询,以预测签名距离和颜色,并在多个任务中表现出色。

Comments Code: https://github.com/wzzheng/IVGT/

详情
AI中文摘要

从未经姿态的多视角图像中重建一致的3D几何和外观是计算机视觉中的基础但具有挑战性的问题。现有的视觉几何基础模型通常通过回归像素对齐的点图来预测显式几何,常常面临冗余和几何连续性有限的问题。我们提出了IVGT,一种隐式视觉几何变换器,能够从无姿态的多视角图像中隐式建模连续且一致的几何。这种形式在规范坐标系中学习了连续的神经场景表示,并支持在任意3D位置进行连续空间查询,通过轻量级解码器检索局部特征,以预测签名距离(SDF)值和颜色。它允许直接提取连续且一致的表面几何,从而能够从任意视角渲染RGB图像、深度图和表面法线图。我们通过多数据集联合优化进行训练,结合2D监督和3D几何正则化。IVGT在不同场景中表现出良好的泛化能力,并在多种任务中实现了优异的性能,包括网格和点云重建、新视角合成、深度和表面法线估计以及相机姿态估计。

英文摘要

Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi-dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.

2605.14598 2026-05-22 cs.RO 版本更新

DSSP: Diffusion State Space Policy with Full-History Encoding

DSSP:具有完整历史编码的扩散状态空间策略

Zhiyuan Guan, Jianshu Hu, Han Fang, Yunpeng Jiang, Yize Huang, Shujia Li, Xiao Li, Yutong Ban

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出DSSP,一种基于扩散模型的状态空间策略,通过完整历史编码提升机器人操作任务中长周期任务的历史依赖性处理能力,实现了更高效的模型压缩和更小的模型规模。

详情
AI中文摘要

基于扩散的模仿学习在机器人操作中显示出强大的前景。然而,大多数现有策略仅依赖于当前观察或最近的短窗口观察,限制了它们在长周期任务中解决历史依赖性模糊性的能力。为此,我们引入DSSP,一种具有完整历史编码的扩散状态空间策略,能够为机器人操作提供高效的完整历史条件。利用状态空间模型(SSMs)的连续序列建模特性,我们的历史编码器有效地将整个观察流压缩成一个紧凑的上下文表示。为了确保此上下文保留有关未来状态演化的关键信息,编码器通过动态感知的辅助训练目标进行优化。此高层上下文表示随后与近期状态观察无缝融合,形成一个分层的条件机制用于动作生成。此外,为了保持架构一致性并减少GPU内存开销,我们还用SSM实例化扩散骨干网络。在模拟基准和真实世界操作任务中的广泛实验表明,DSSP在显著更小的模型规模下实现了最先进的性能,展示了分层条件在历史长度增加时捕获关键信息的优越效率。

英文摘要

Diffusion-based imitation learning has shown strong promise for robot manipulation. However, most existing policies condition only on the current observation or a short window of recent observations, limiting their ability to resolve history-dependent ambiguities in long-horizon tasks. To address this, we introduce DSSP, a history-conditioned Diffusion State Space Policy that enables efficient, full-history conditioning for robot manipulation. Leveraging the continuous sequence modeling properties of State Space Models (SSMs), our history encoder effectively compresses the entire observation stream into a compact context representation. To ensure this context preserves critical information regarding future state evolution, the encoder is optimized with a dynamics-aware auxiliary training objective. This high-level context representation is then seamlessly fused with recent state observations to form a hierarchical conditioning mechanism for action generation. Furthermore, to maintain architectural consistency and minimize GPU memory overhead, we also instantiate the diffusion backbone itself using an SSM. Extensive experiments across simulation benchmarks and real-world manipulation tasks show that DSSP achieves state-of-the-art performance with a significantly smaller model size, demonstrating superior efficiency of the hierarchical conditioning in capturing crucial information as the history length increases.

2604.24681 2026-05-22 cs.RO 版本更新

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

从大规模人类示范中学习人类意图先验以用于机器人操作

Yifan Xie, YuAn Wang, Guangyu Chen, Jinkun Liu, Yu Sun, Wenbo Ding

发表机构 * Tsinghua University(清华大学) ByteDance(字节跳动)

AI总结 本文提出MoT-HRA框架,通过大规模人类示范学习人类意图先验,用于机器人操作,通过构建HA-2.2M数据集和三个耦合专家提升动作合理性和鲁棒性。

Comments 13 pages, 5 figures

详情
AI中文摘要

人类视频包含丰富的操作先验,但用于机器人学习仍然困难,因为原始观测将场景理解、人类运动和特定于身体的动作纠缠在一起。我们引入MoT-HRA,一种层次化视觉-语言-动作框架,从大规模人类示范中学习人类意图先验。我们首先整理HA-2.2M,一个通过手中心过滤、空间重建、时间分割和语言对齐从异构人类视频中重建出的220万集动作-语言数据集。在此数据集之上,MoT-HRA将操作分解为三个耦合专家:一个视觉-语言专家预测无关身体的3D轨迹,一个意图专家将MANO风格的手部运动建模为潜在的人类运动先验,一个精细专家将意图感知的表示映射到机器人动作块。共享注意力主干和只读键值传输允许下游控制使用人类先验同时限制对上游表示的干扰。在手部运动生成、模拟操作和真实世界机器人任务上的实验表明,MoT-HRA在分布偏移下提高了动作合理性和鲁棒控制。

英文摘要

Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention trunk and read-only key-value transfer allow downstream control to use human priors while limiting interference with upstream representations. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks show that MoT-HRA improves motion plausibility and robust control under distribution shift.

2603.22508 2026-05-22 cs.RO cs.SY eess.SY 版本更新

Parallel OctoMapping: A Scalable Framework for Enhanced Path Planning in Autonomous Navigation

并行八叉树映射:一种用于自主导航中路径规划增强的可扩展框架

Yihui Mao, Tian Tan, Xuehui Shen, Warren E. Dixon, Rushikesh Kamalapurkar

发表机构 * Department of Mechanical and Aerospace Engineering, University of Florida(佛罗里达大学机械与航空航天工程系) Department of Electrical and Systems Engineering, University of Pennsylvania(宾夕法尼亚大学电气与系统工程系)

AI总结 本文提出并行八叉树映射(POMP),一种高效的基于八叉树的映射技术,通过在固定占用网格分辨率下优化自由空间表示,提升路径规划效率和成功率,特别是在复杂环境中。

详情
AI中文摘要

映射在机器人和自主系统中至关重要,因为它为路径规划提供了空间基础。高效的映射使规划算法能够生成可靠的路径,同时确保安全并实时适应复杂环境。固定分辨率的映射方法通常会产生过于保守的障碍物表示,导致在拥挤场景中生成次优路径或规划失败。为了解决这个问题,我们引入了并行八叉树映射(POMP),一种高效的基于八叉树的映射技术,旨在最大化可用自由空间并支持多线程计算。据我们所知,POMP是首个在固定占用网格分辨率下优化自由空间表示同时保持地图保真度和与现有基于搜索的规划器兼容的方法。因此,它可以集成到现有的规划流程中,从而提高路径发现的成功率和路径长度,特别是在拥挤环境中,同时显著提高计算效率。

英文摘要

Mapping is essential in robotics and autonomous systems because it provides the spatial foundation for path planning. Efficient mapping enables planning algorithms to generate reliable paths while ensuring safety and adapting in real time to complex environments. Fixed-resolution mapping methods often produce overly conservative obstacle representations that lead to suboptimal paths or planning failures in cluttered scenes. To address this issue, we introduce Parallel OctoMapping (POMP), an efficient OctoMap-based mapping technique that maximizes available free space and supports multi-threaded computation. To the best of our knowledge, POMP is the first method that, at a fixed occupancy-grid resolution, refines the representation of free space while preserving map fidelity and compatibility with existing search-based planners. It can therefore be integrated into existing planning pipelines, yielding higher pathfinding success rates and shorter path lengths, especially in cluttered environments, while substantially improving computational efficiency.

2603.11642 2026-05-22 cs.RO 版本更新

Noise-Space Attribution and Control of Chunk-Boundary Artifact

噪声空间中的属性分析与块边界伪影控制

Rui Wang

发表机构 * Rui Wang(1 王瑞)

AI总结 本文研究了生成视觉-运动策略中块边界伪影的机制,通过分析噪声空间中的变量,展示了如何通过控制隐含噪声来调节伪影,并证明伪影变化可以影响最终任务结果。

详情
AI中文摘要

动作分块在生成视觉-运动策略中被广泛应用,但块边界处的反复执行不连续性仍然缺乏机制性解释。本文将块边界伪影视为可分析的机制变量。我们首先证明成功和失败的episode在伪影度量上稳定分离。然后我们显示,在随机动作分块策略中,固定观察上下文并仅改变隐含噪声足以系统地调节伪影。在同一扩散策略检查点上,比较DDPM、零方差DDPM和DDIM进一步表明,这种局部可控性取决于从初始噪声到动作输出的信息路径是否保持完整。最后,从固定局部执行状态的受控干预中,我们发现伪影变化可以影响最终结果,并且在同一任务中,首选方向甚至可以反转:某些上下文在较低伪影下表现更高成功,而另一些上下文在较高伪影下表现更高成功。在代表性高伪影偏好的关键上下文中,成功率从0.033增加到0.717。这些结果表明,块边界伪影不是单纯的执行副产品,而是在噪声空间中的一个变量,可以被归因、控制,并与任务结果机制性关联。

英文摘要

Action chunking is widely used in generative visuomotor policies, yet the recurring execution discontinuities at chunk boundaries still lack a mechanistic explanation. This paper treats chunk-boundary artifact as an analyzable mechanism variable. We first show that successful and failed episodes separate stably on artifact metrics. We then show that, in stochastic action-chunked policies, fixing the observation context and changing only latent noise is sufficient to modulate artifact systematically. On the same Diffusion Policy checkpoint, comparisons among DDPM, zero-variance DDPM, and DDIM further show that this local controllability depends on whether the information path from initial noise to action output remains intact. Finally, from controlled interventions at fixed local execution states, we find that artifact changes can carry through to final outcome, and that the preferred direction can reverse even within the same task: some contexts achieve higher success under lower artifact, whereas others achieve higher success under higher artifact. In a representative high-artifact-favoring key context selected by held-out matched-continuation validation, success rate increases from 0.033 to 0.717. These results show that chunk-boundary artifact is not a mere execution-side by-product, but a variable in noise space that can be attributed, controlled, and mechanistically linked to task outcome.

2602.03205 2026-05-22 cs.RO 版本更新

HUSKY: Humanoid Skateboarding System via Physics-Aware Whole-Body Control

HUSKY:通过物理感知的全身控制实现人形滑雪板系统

Jinrui Han, Dewei Wang, Chenyun Zhang, Xinzhe Liu, Ping Luo, Chenjia Bai, Xuelong Li

发表机构 * Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信) The University of Hong Kong(香港大学) University of Science and Technology of China(中国科学技术大学) ShanghaiTech University(上海科技大学)

AI总结 本文提出HUSKY框架,通过整合人形滑雪板系统建模和物理感知的全身控制,解决高动态和复杂交互任务中的稳定动态操控问题,实现在实际场景中稳定灵活的滑雪板操作。

Comments Accepted to RSS2026

详情
AI中文摘要

尽管当前的人形全身控制框架大多依赖静态环境假设,但解决具有高动态性和复杂交互的任务却是一个巨大的挑战。本文针对人形滑雪板任务,一个需要在欠驱动轮式平台上稳定动态操控的极具挑战性的任务,提出一个整合系统,该系统受非完整约束和紧密耦合的人-物体交互支配。成功执行此任务需要同时掌握混合接触动力学和在机械耦合、动态不稳定滑雪板上的稳健平衡控制。为克服上述挑战,我们提出了HUSKY,一个基于学习的框架,整合了人形-滑雪板系统建模和物理感知的全身控制。我们首先建模板倾斜与车轮转向角度之间的耦合关系,从而能够对系统动力学进行原理性分析。在此基础上,HUSKY利用对抗运动先验(AMP)学习人样的推动作,并采用物理引导的、以方向为导向的策略来实现倾斜到转向行为。此外,轨迹引导机制确保了在推与转向之间平滑而稳定的过渡。在Unitree G1人形平台上的实验结果表明,我们的框架能够在现实场景中实现稳定的滑雪板操控。项目页面可在https://husky-humanoid.github.io/上找到。

英文摘要

While current humanoid whole-body control frameworks predominantly rely on the static environment assumptions, addressing tasks characterized by high dynamism and complex interactions presents a formidable challenge. In this paper, we address humanoid skateboarding, a highly challenging task requiring stable dynamic maneuvering on an underactuated wheeled platform. This integrated system is governed by non-holonomic constraints and tightly coupled human-object interactions. Successfully executing this task requires simultaneous mastery of hybrid contact dynamics and robust balance control on a mechanically coupled, dynamically unstable skateboard. To overcome the aforementioned challenges, we propose HUSKY, a learning-based framework that integrates humanoid-skateboard system modeling and physics-aware whole-body control. We first model the coupling relationship between board tilt and truck steering angles, enabling a principled analysis of system dynamics. Building upon this, HUSKY leverages Adversarial Motion Priors (AMP) to learn human-like pushing motions and employs a physics-guided, heading-oriented strategy for lean-to-steer behaviors. Moreover, a trajectory-guided mechanism ensures smooth and stable transitions between pushing and steering. Experimental results on the Unitree G1 humanoid platform demonstrate that our framework enables stable and agile maneuvering on skateboards in real-world scenarios. The project page is available on https://husky-humanoid.github.io/.

2510.08759 2026-05-22 cs.CV cs.RO 版本更新

Dissecting Embodied Abilities in Multimodal Language Models through Skill-level Evaluation and Diagnosis

通过技能级评估与诊断解构多模态语言模型的具身能力

Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Yizhe Zhu, Shiji Xin, Yijian Huang, Boce Hu, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Haojie Huang, Lawson L. S. Wong

发表机构 * Northeastern University, Boston, MA, USA The Chinese University of Hong Kong, Hong Kong, China Peking University, Beijing, China Westlake University, Hangzhou, China Harvard University, Cambridge, MA, USA Purdue University, West Lafayette, IN, USA University of Oxford, Oxford, United Kingdom

AI总结 本文提出BEAR基准,通过分解具身任务为14个原子技能进行细粒度评估,发现感知能力是推理失败的主要瓶颈,并提出BEAR-Agent多模态对话代理,显著提升具身技能性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

理解具身多模态大语言模型(MLLMs)的能力瓶颈对于改进具身代理至关重要。然而,现有具身基准主要集中在任务级评估,未能提供模型失败的潜在原因的可操作见解。为解决这一限制,我们引入BEAR,一个将具身任务分解为14个原子技能以进行细粒度技能级评估的基准。BEAR包含4,469个交错的图像-视频-文本样本,涵盖6类中的14种技能,从低级感知到高级规划。我们评估了20个MLLMs在BEAR上的表现,采用分层技能级诊断框架,并揭示了两个关键发现:(1)感知能力是推理失败的主要瓶颈,(2)当前模型存在不稳定的时间空间建模问题,这在先前基准中未被充分暴露。受这些发现启发,我们进一步提出BEAR-Agent,一个多模态对话代理,通过添加视觉和空间推理工具来增强MLLMs。BEAR-Agent在具身技能上显著提升了性能,在BEAR上相对于GPT-5基模型实现了17.5%的相对提升,同时在仿真和现实世界机器人实验中也优于强基线模型。项目页面:https://bear-official66.github.io/

英文摘要

Understanding the capability bottlenecks of embodied multimodal large language models (MLLMs) is crucial for improving embodied agents. However, existing embodied benchmarks mainly focus on task-level evaluation and fail to provide actionable insights into the underlying causes of model failures. To address this limitation, we introduce BEAR, a benchmark that decomposes embodied tasks into 14 atomic skills for fine-grained skill-level evaluation. BEAR comprises 4,469 interleaved image-video-text samples spanning 14 skills across 6 categories, ranging from low-level perception to high-level planning. We evaluate 20 MLLMs on BEAR under a hierarchical skill-level diagnosis framework and uncover two key findings: (1) perceptual capabilities are major bottlenecks behind reasoning failures, and (2) current models suffer from unstable spatiotemporal modeling that remains largely unexposed in prior benchmarks. Motivated by these findings, we further propose BEAR-Agent, a multimodal conversational agent that augments MLLMs with visual and spatial reasoning tools. BEAR-Agent substantially improves performance across embodied skills, achieving a relative improvement of 17.5% on GPT-5 over the base model on BEAR, while also outperforming strong baselines in both simulation and real-world robotic experiments. Project page: https://bear-official66.github.io/

2510.04280 2026-05-22 cs.LG cs.AI cs.RO 版本更新

A KL-regularization Framework for Learning to Plan with Adaptive Priors

一种基于KL正则化的学习规划框架:具有自适应先验的规划

Álvaro Serra-Gomez, Daniel Jarne Ornia, Dhruva Tirumala, Thomas Moerland

发表机构 * LIACS, Leiden University, Leiden, The Netherlands(莱顿大学莱顿分校,荷兰) Google Deepmind, London, United Kingdom(谷歌DeepMind,英国伦敦) University of Oxford, Oxford, United Kingdom(牛津大学,英国牛津)

AI总结 本文提出了一种基于KL正则化的学习规划框架,通过将规划器的动作分布作为先验整合到策略优化中,提升了在高维连续控制任务中模型驱动强化学习的样本效率和长期性能。

Comments Published at ICML2026

详情
AI中文摘要

有效的探索仍然是模型驱动强化学习(MBRL)中的核心挑战,尤其是在高维连续控制任务中,样本效率至关重要。近期的一项重要工作利用学习的策略作为模型预测路径积分(MPPI)规划的提案分布。初始方法在更新采样策略时独立于规划器分布,通常通过确定性策略梯度和熵正则化最大化学习的价值函数。然而,由于训练过程中遇到的状态依赖于MPPI规划器,使采样策略与规划器对齐可以提高价值估计的准确性以及长期性能。为此,近期的方法通过最小化KL散度到规划器分布或引入规划器引导的正则化来更新采样策略。在本文中,我们通过引入策略优化-模型预测控制(PO-MPC),将这些基于MPPI的强化学习方法统一到一个框架中,这是一种整合规划器动作分布作为先验的KL正则化MBRL方法家族。通过使学习的策略与规划器的行为对齐,PO-MPC允许在回报最大化和KL散度最小化之间更灵活的策略更新。我们澄清了先前方法如何作为该家族的特殊案例出现,并探索了之前未研究的变体。我们的实验表明,这些扩展配置产生了显著的性能提升,推动了基于MPPI的强化学习的前沿。

英文摘要

Effective exploration remains a central challenge in model-based reinforcement learning (MBRL), particularly in high-dimensional continuous control tasks where sample efficiency is crucial. A prominent line of recent work leverages learned policies as proposal distributions for Model-Predictive Path Integral (MPPI) planning. Initial approaches update the sampling policy independently of the planner distribution, typically maximizing a learned value function with deterministic policy gradient and entropy regularization. However, because the states encountered during training depend on the MPPI planner, aligning the sampling policy with the planner improves the accuracy of value estimation and long-term performance. To this end, recent methods update the sampling policy by minimizing KL divergence to the planner distribution or by introducing planner-guided regularization into the policy update. In this work, we unify these MPPI-based reinforcement learning methods under a single framework by introducing Policy Optimization-Model Predictive Control (PO-MPC), a family of KL-regularized MBRL methods that integrate the planner's action distribution as a prior in policy optimization. By aligning the learned policy with the planner's behavior, PO-MPC allows more flexibility in the policy updates to trade off Return maximization and KL divergence minimization. We clarify how prior approaches emerge as special cases of this family, and we explore previously unstudied variations. Our experiments show that these extended configurations yield significant performance improvements, advancing the state of the art in MPPI-based RL.

2506.14648 2026-05-22 cs.RO cs.AI 版本更新

SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning

SENIOR: 在基于偏好的强化学习中高效查询选择与偏好引导探索

Hexian Ni, Tao Lu, Haoyuan Hu, Yinghao Cai, Shuo Wang

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 本文提出SENIOR方法,通过高效查询选择和偏好引导探索提升人类反馈效率和策略学习速度,解决基于偏好的强化学习在反馈和样本效率方面的不足。

Comments 8 pages, 8 figures, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)

详情
AI中文摘要

基于偏好强化学习(PbRL)方法通过学习基于人类偏好的奖励模型来避免奖励工程。然而,较差的反馈和样本效率仍然是阻碍PbRL应用的问题。本文提出了一种新颖的高效查询选择和偏好引导探索方法,称为SENIOR,能够选择有意义且易于比较的行为片段对,以提高人类反馈效率并加速策略学习,通过设计的偏好引导内在奖励。我们的关键思想是双方面的:(1)我们设计了一种基于运动区别的选择方案(MDS)。它通过状态的核密度估计选择具有明显运动和不同方向的片段对,这更任务相关且更易于人类偏好标注;(2)我们提出了一种新颖的偏好引导探索方法(PGE)。它鼓励探索高偏好和低访问状态,并持续引导智能体获取有价值的样本。两种机制的协同作用可以显著加快奖励和策略学习的进度。我们的实验表明,SENIOR在六个复杂的机器人操作任务(从仿真和现实世界)中,既在人类反馈效率又在策略收敛速度上均优于其他五个现有方法。视频可在我们的项目网站上找到:https://2025senior.github.io/

英文摘要

Preference-based Reinforcement Learning (PbRL) methods provide a solution to avoid reward engineering by learning reward models based on human preferences. However, poor feedback- and sample- efficiency still remain the problems that hinder the application of PbRL. In this paper, we present a novel efficient query selection and preference-guided exploration method, called SENIOR, which could select the meaningful and easy-to-comparison behavior segment pairs to improve human feedback-efficiency and accelerate policy learning with the designed preference-guided intrinsic rewards. Our key idea is twofold: (1) We designed a Motion-Distinction-based Selection scheme (MDS). It selects segment pairs with apparent motion and different directions through kernel density estimation of states, which is more task-related and easy for human preference labeling; (2) We proposed a novel preference-guided exploration method (PGE). It encourages the exploration towards the states with high preference and low visits and continuously guides the agent achieving the valuable samples. The synergy between the two mechanisms could significantly accelerate the progress of reward and policy learning. Our experiments show that SENIOR outperforms other five existing methods in both human feedback-efficiency and policy convergence speed on six complex robot manipulation tasks from simulation and four real-worlds. Videos can be found on our project website: https://2025senior.github.io/

2503.00747 2026-05-22 cs.CV cs.RO eess.IV 版本更新

LFX: Towards Unified Light Field Dense Semantic Segmentation and Salient Object Detection

LFX:迈向统一的光场密集语义分割和显著物体检测

Fei Teng, Lingxin Huang, Buyin Deng, Kai Luo, Boyuan Zheng, Zheng Fang, Hong Zheng, Kunyu Peng, Jiaming Zhang, Yaonan Wang, Kailun Yang

发表机构 * School of Artificial Intelligence and Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University, China(人工智能与机器人学院和机器人视觉感知与控制技术国家工程研究中心,湖南大学,中国) China Mobile Group Hunan Company Ltd., China(中国移动集团湖南有限公司,中国) Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Germany(人机学与机器人研究所,卡尔斯鲁厄理工学院,德国)

AI总结 本文提出LFX框架,通过统一的光场表示特征调制空间,实现了对多种光场表示和不同感知任务的适应,从而在三个光场基准测试中取得最先进的结果,显著优于特定表示方法。

Comments The source code will be made publicly available at https://github.com/FeiT-FeiTeng/LFX

详情
AI中文摘要

光场相机在单次曝光内捕获多视角观测。然而,现有研究通常针对特定的LF表示进行优化,导致该领域缺乏统一的学习框架。为弥合这一差距,我们提出了LFX,首个统一的光场感知框架。LFX建立了一个表示不变的特征调制空间,使其能够适应异构的LF表示和多样的感知任务。具体而言,我们提出了Field-of-Parallax Angular Subspace Modeling(FoP-ASM),为每个辅助视图分配独立的角标记,实现视图间的独立建模。同时,共享流形子空间约束和正则化损失强制在视图间保持全局一致的语义调制。在三个LF基准测试中的广泛评估表明,LFX在不同的LF表示上均取得最佳结果,比特定表示方法高出高达12%和20%,在显著物体检测中达到0.029/0.027的MAE,且在语义分割中达到84.37 mIoU。源代码将在https://github.com/FeiT-FeiTeng/LFX上公开。

英文摘要

Light field cameras capture multi-view observations within a single exposure. However, existing studies are typically tailored to specific LF representations, leaving the field without a unified learning framework. To bridge this gap, we present LFX, the first unified framework for LF perception. LFX establishes a representation-invariant feature modulation space, enabling it to adapt to heterogeneous LF representations and diverse perception tasks. Specifically, we propose Field-of-Parallax Angular Subspace Modeling (FoP-ASM), which assigns an independent angular marker to each auxiliary view, enabling view-wise independent modeling. Meanwhile, shared manifold subspace constraints and regularization losses enforce globally consistent semantic modulation across views. Extensive evaluations across three LF benchmarks show that LFX achieves state-of-the-art results across distinct LF representations, outperforming representation-specific methods by up to 12% and 20% with 0.029/0.027 MAE for salient object detection, and achieving 84.37 mIoU for semantic segmentation. The source code will be made publicly available at https://github.com/FeiT-FeiTeng/LFX.