SE3Kit: A Lightweight Python Library for Specialized Geometric Primitives in Robotics

SE3Kit: 一个用于机器人学中专用几何原语的轻量级Python库

Daniyal Maroufi, Omid Rezayof, Farshid Alambeigi

发表机构 * Walker Department of Mechanical Engineering and Texas Robotics at The University of Texas at Austin（德克萨斯大学奥斯汀分校机械工程系和德克萨斯机器人学院）

AI总结本文提出SE3Kit，一个轻量级Python库，专注于特殊欧几里得群SE(3)和特殊正交群SO(3)上的高效运算，提供严格的数学实现，适用于嵌入式部署、快速原型设计和教育。

2605.22605 2026-05-22 cs.RO cs.CV 版本更新

Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection

通过双区间运动线索解耦自身运动与目标动态以实现无人机检测

Liuyang Wang, Feitian Zhang

发表机构 * Department of Robotics, School of Advanced Manufacturing and Robotics（机器人学院，先进制造与机器人学院）； State Key Laboratory of Turbulence and Complex Systems（湍流与复杂系统国家重点实验室）； Peking University（北京大学）； Great Bay University（大湾大学）

AI总结本文提出了一种基于视觉的运动引导检测框架，通过双区间运动提取策略和轻量级运动引导注意力模块，解耦目标运动与相机干扰，提升无人机检测在剧烈自身运动下的性能。

详情

AI中文摘要

无人机的物体检测面临严重的自身运动、相机抖动和大规模变化的挑战。尽管现代检测器在静态图像上表现良好，但直接应用于无人机视频时往往失效，尤其在动态场景中的小目标。现有基于运动的方法要么依赖计算昂贵的光流，要么使用单区间差分，易受抖动影响且难以捕捉多样的运动模式。本文提出了一种视觉-only的运动引导检测框架，通过双区间运动提取策略和轻量级运动引导注意力模块，解耦目标运动与相机干扰。首先基于同射影的全局运动补偿（GMC）对相邻帧进行对齐。然后引入双区间运动提取策略，捕捉短期和长期的运动线索。为了整合这些线索，轻量级运动引导注意力模块（MGA）在特征金字塔网络中增强特征表示。在VisDrone-VID数据集上的实验表明，在严重自身运动下，该方法在YOLOv8基线上有显著改进。消融研究进一步验证了双区间设计和所提运动引导注意力机制的有效性。

英文摘要

Object detection from Unmanned Aerial Vehicles (UAVs) is challenged by severe ego-motion, camera jitter, and large scale variations. While modern detectors perform well on static images, their direct application to UAV video often fails, particularly for small objects in dynamic scenes. Existing motion-based methods either rely on computationally expensive optical flow or use single-interval differencing, which is sensitive to jitter and limited in capturing diverse motion patterns. We propose a vision-only motion-guided detection framework that decouples target motion from camera-induced disturbances. A homography-based Global Motion Compensation (GMC) first aligns adjacent frames. We then introduce a Dual-Interval Motion Extraction strategy that captures both short-term and long-term motion cues. To integrate these cues, a lightweight Motion-Guided Attention (MGA) module enhances feature representations within a Feature Pyramid Network. Experiments on the VisDrone-VID dataset demonstrate consistent improvements over a strong YOLOv8 baseline under severe ego-motion. Ablation studies further confirm the effectiveness of the dual-interval design and the proposed motion-guided attention mechanism.

URL PDF HTML ☆

赞 0 踩 0

2605.22600 2026-05-22 cs.RO 版本更新

Branch-Stochastic Model Predictive Control for Motion Planning under Multi-Modal Uncertainty with Scenario Clustering

基于分支随机优化的运动规划在多模态不确定性下的场景聚类

Zekun Xing, Ramkrishna Chaudhari, Marion Leibold, Dirk Wollherr, Martin Buss

发表机构 * Chair of Automatic Control Engineering（自动控制工程教授会）

AI总结本文提出一种结合随机模型预测控制与分支结构的方法，用于在多模态不确定性下进行运动规划，通过场景聚类提高实时计算性能并减少保守性。

Comments This work has been accepted for presentation at IFAC World Congress 2026

详情

AI中文摘要

自动驾驶的运动规划必须考虑周围车辆意图和轨迹的多模态不确定性。以最坏情况处理不确定性可以保证鲁棒性，但往往导致过度保守。随机模型预测控制（SMPC）通过机会约束减少了轨迹层面的保守性，但对意图不确定性仍保持保守，因为约束必须在所有意图下成立。本文提出一种新的SMPC与分支结构的结合，使规划器能够为不同的可能意图生成不同的轨迹，同时在轨迹不确定性下保持安全。提出了一种新的场景聚类方法，基于高层决策相似性合并预测场景，从而确保实时可处理性。此外，一种自适应的分支时间计算延迟对分离计划的承诺，直到意图不确定性充分降低。在具有挑战性的高速公路场景中的仿真研究证明，所提出的方法提高了安全性，减少了保守性，并实现了实时计算性能。

英文摘要

Motion planning for autonomous driving must account for multi-modal uncertainty in both the intentions and trajectories of surrounding vehicles. Handling uncertainty in a worst-case manner guarantees robustness but often leads to excessive conservatism. Stochastic Model Predictive Control (SMPC) reduces trajectory-level conservatism through chance constraints, yet remains conservative with respect to intention uncertainty since constraints must hold across all intentions. We present a novel combination of SMPC and the branching structure, enabling the planner to generate distinct trajectories for different possible intentions while maintaining safety under trajectory uncertainty. A novel scenario clustering is proposed to merge prediction scenarios based on high-level decision similarity, thereby ensuring real-time tractability. Furthermore, an adaptive branching-time computation postpones commitment to separate plans until intention uncertainty is sufficiently reduced. Simulation studies in challenging highway scenarios demonstrate that the proposed method improves safety, reduces conservatism, and achieves real-time computational performance.

URL PDF HTML ☆

赞 0 踩 0

2605.22597 2026-05-22 cs.LG cs.AI cs.GR cs.RO 版本更新

MoSA: Motion-constrained Stress Adaptation for Mitigating Real-to-Sim Gap in Continuum Dynamics via Learning Residual Anisotropy

MoSA: 通过学习残余各向异性来缓解连续动力学中现实到模拟差距的运动约束应力适应

Jiaxu Wang, Junhao He, Jingkai Sun, Yi Gu, Yunyang Mo, Jiahang Cao, Qiang Zhang, Renjing Xu

发表机构 * Hong Kong University of Science（香港科学大学）； MMLab, Chinese University of Hong Kong, Hong Kong SAR（香港中文大学MMLab, 香港特别行政区）； The University of Hong Kong, Hong Kong SAR（香港大学, 香港特别行政区）

AI总结本文提出MoSA框架，通过运动约束应力适应来缓解连续动力学中现实到模拟差距，利用各向同性模型作为物理先验，并学习残余应力算子以捕捉轻微各向异性和非均匀性，最终在机器人操作中验证了其有效性。

详情

Journal ref: International Conference on Machine Learning 2026

AI中文摘要

从视觉观测中学习现实世界的动力学对于各种领域至关重要。一种常见策略是通过估计物理参数来校准模拟器，但准确性最终受限于底层物理模型，这些模型通常假设材料是均质且各向同性的。即使合理，现实中的物体通常表现出轻微的各向异性和非均匀性。在近各向同性的骨架良好校准后，这些残余效应成为进一步缩小现实到模拟差距的关键瓶颈。虽然神经网络可以端到端地拟合动力学，但这种黑盒建模会丢弃强物理先验，导致数据效率低和过拟合。因此，我们提出了MoSA，一种运动约束应力适应框架，旨在针对这些残余效应以进一步提高现实到模拟动力学学习。MoSA使用各向同性模型作为物理先验，并学习残余应力算子以捕捉轻微各向异性和非均匀性。它通过微平面约束的再分布逐步适应应力，在一个物理指导的级联网络中。我们进一步通过监督变形场的时空导数来施加运动约束。实验表明，我们学习的动力学在准确性、泛化性和鲁棒性方面均优于现有方法，同时学习了具有物理意义的残余各向异性。最后，我们在机器人操作设置中验证了MoSA，显示更好的现实到模拟动力学建模能够转化为更可靠的模拟到现实转移。项目页面可在https://mercerai.github.io/MoSA/上获取。

英文摘要

Learning real-world dynamics from visual observations is crucial for various domains. A common strategy is to calibrate simulators by estimating physical parameters, yet accuracy is ultimately bounded by the underlying physical models, which often assume materials are homogeneous and isotropic. Even if reasonable, real-world objects typically exhibit mild anisotropy and heterogeneity. After the near-isotropic backbone is well calibrated, these residual effects become the key bottleneck for further closing the real-to-sim gap. Although neural networks can fit dynamics end-to-end, such black-box modeling discards strong physical priors, leading to poor data efficiency and overfitting. Therefore, we propose MoSA, a motion-constrained stress adaptation framework that targets these residual effects to further improve real-to-sim dynamics learning. MoSA uses an isotropic model as a physics prior and learns residual stress operators to capture mild anisotropy and heterogeneity. It progressively adapts stresses via microplane-constrained redistribution in a physics-informed cascaded network. We further impose motion constraints by supervising temporal and spatial derivatives of the deformation field. Experimentally, our learned dynamics achieves superior accuracy, generalization, and robustness, while learning physically meaningful residual anisotropy. Finally, we validate MoSA in a robot manipulation setting, showing that better real-to-sim dynamics modeling translates into more reliable sim-to-real transfer. Project Page is available at https://mercerai.github.io/MoSA/.

URL PDF HTML ☆

赞 0 踩 0

2605.22521 2026-05-22 cs.RO cs.HC 版本更新

Quantifying Full-Body Immersion

量化全身沉浸

Alihan Bakir, Ekrem Yüksel, Fabio Zuliani, Neil Chennoufi, Francesco Bruno, Jamie Paik

发表机构 * Reconfigurable Robotics Lab（可重构机器人实验室）

AI总结本文提出了一种基于全身动态交互的沉浸式虚拟体验新范式，通过音频视觉沉浸、物理沉浸和全身沉浸三个层次，结合模块化机器人表面单元实现可扩展的沉浸环境渲染，推动人与虚拟环境的共生。

Comments This manuscript is under consideration for possible publication in the Nature. Copyright may be transferred to Nature if the manuscript is accepted for publication, without further notice

详情

AI中文摘要

人类正处于又一场数字革命的前沿，现实与虚拟世界的界限正在消融，重塑我们对周围环境的认知和交互方式。在此背景下，我们引入了一种以全身动态交互为核心的沉浸式虚拟体验新范式。我们的方法通过三个不同的层次重新定义沉浸：音频视觉沉浸，捕捉感官真实；物理沉浸，提供触觉反馈；以及全身沉浸（FBI），其中动态的身体互动无缝整合到虚拟环境中。该创新的核心是一种基于模块化机器人表面单元的可扩展、可分布平台，这些单元受到自然界适应性设计的启发。这些单元能够渲染沉浸式环境，从亲密的个人体验到大规模多用户设置，动态适应实时互动。模块化系统在整个空间中分布力、形状和运动反馈，复制环境的物理特性，并通过FBI实现新的深度参与。通过结合可扩展性、适应性和动态物理参与，该框架弥合了现实与虚拟世界之间的鸿沟。它提供了一种前所未有的沉浸水平，使用户能够以共生的方式与虚拟空间进行全身互动。这项工作不仅推动了沉浸技术的发展，还重新定义了人类与虚拟环境共存的方式，为人类与环境合成的新时代奠定了基础。

英文摘要

Humanity is at the forefront of yet another digital revolution, where the lines between real and virtual worlds are dissolving, reshaping how we perceive and interact with our surroundings. In this context, we introduce a transformative paradigm for immersive virtual experiences centered around whole-body kinetic interactions. Our approach redefines immersion through three distinct levels: audio-visual immersion, capturing sensory realism; physical immersion, delivering haptic feedback; and full-body immersion (FBI), where dynamic bodily interaction integrates seamlessly with virtual environments. At the core of this innovation lies a scalable, distributable platform based on modular robotic surface units inspired by the adaptive designs of nature. These units enable the rendering of immersive environments at any scale, from intimate personal experiences to expansive multi-user settings, dynamically adapting to interactions in real-time. The modular system distributes force, shape, and motion feedback throughout entire spaces, replicating the physical characteristics of the environment and enabling new depth of engagement through FBI. By combining scalability, adaptability, and dynamic physical engagement, this framework bridges the gap between real and virtual worlds. It offers an unprecedented level of immersion where users can engage their entire bodies in symbiotic interactions with the virtual space. This work not only advances immersive technology but also redefines how humans and virtual environments coexist, setting a foundation for a new era of human-environment synthesis.

URL PDF HTML ☆

赞 0 踩 0

2605.22493 2026-05-22 cs.LG cs.AI cs.RO 版本更新

Understanding Multimodal Failure in Action-Chunking Behavioral Cloning

理解动作分块行为克隆中的多模态失败

Lorenzo Mazza, Massimiliano Datres, Ariel Rodriguez, Sebastian Bodenstedt, Gitta Kutyniok, Stefanie Speidel

发表机构 * NCT-Dresden（NCT-德累斯顿）

AI总结研究行为克隆在多模态情况下失败的机制，分析不同多模态参数化在动作分块策略中的不同失效方式，并提出通过调整正则化程度和改进生成策略来提升鲁棒性的方法。

详情

AI中文摘要

当相同的观察允许多个有效动作时，行为克隆变得困难。我们研究了动作分块策略中的这一问题，并展示了不同多模态参数化以不同的方式失败。对于隐变量策略，后验-先验正则化使部署时的采样更可靠，但过度正则化会移除区分演示模式所需的动作条件信息。减少这种正则化可以保留模式信息，但此时成功取决于先验是否覆盖相关隐变量区域。对于动作空间生成策略，多模态性受到基础到动作传输的平滑性限制：具有小Lipschitz常数的映射无法将大量分离的模式分配显著概率。覆盖许多模式需要基础空间中的陡峭过渡或动作空间中的非支持桥接区域。在合成多模态任务和机器人模拟基准上的实验支持了这些机制。

英文摘要

Behavioral cloning becomes difficult when the same observation admits several valid actions. We study this problem for action-chunking policies and show that different multimodal parameterizations fail in different ways. For latent-variable policies, posterior-prior regularization makes deployment-time sampling more reliable, but excessive regularization removes the action-conditioned information needed to distinguish demonstrated modes. Reducing this regularization can preserve mode information, but then success depends on whether the prior covers the relevant latent regions. For action-space generative policies, multimodality is constrained by the smoothness of the base-to-action transport: a map with small Lipschitz constant cannot assign substantial probability to many well-separated modes. Covering many modes therefore requires either sharp transitions in base space or off-support bridge regions in action space. Experiments on synthetic multimodal tasks and robotic simulation benchmarks support these mechanisms.

URL PDF HTML ☆

赞 0 踩 0

2605.22456 2026-05-22 cs.RO cs.AI 版本更新

Steins;Gate Drive: Semantic Safety Arbitration over Structured Futures for Latency-Decoupled LLM Planning

Steins;Gate Drive: 基于结构化未来语义安全仲裁的延迟解耦LLM规划

Anjie Qiu, Hans D. Schotten

发表机构 * Institute for Wireless Communication and Navigation（无线通信与导航研究所）； RPTU University Kaiserslautern-Landau（凯撒斯劳滕-兰道大学）； German Research Center for Artificial Intelligence（德国人工智能研究中心）

AI总结本文提出SteinsGateDrive架构，通过延迟解耦规划与运行时架构，在保持安全边界的同时，将有效延迟从+3.07秒减少到-0.01秒，提升了自动驾驶的规划效率。

Comments 10 pages, 2 figures, 5 tables, submitted to IEEE transaction of intelligent vehicles

详情

AI中文摘要

云托管的LLM驱动代理提供有用的语义判断，但其推理延迟超过了分步车辆控制窗口。学习的世界模型预测未来，但通常将未来生成和动作选择保留在大型耦合循环中。我们提出了SteinsGateDrive，一种延迟解耦的规划-运行时架构，其中世界线隐喻来自同名故事，指出了干预的一个可能后果：LLM在最终控制时刻之前选择反事实驾驶未来，而运行时仅在安全合同有效时重用所选预测。生成器构建了三种世界线角色：alpha名义性自我条件未来、beta交互反事实（围绕附近车辆）以及gamma危险压力未来（如刹车、变道或被阻塞的走廊）。所选分支成为具有时间范围、有效/中止条件、回退和授权的类型化战略预测。在10个种子和20步的内受试匹配种子正常-高速公路协议中，GPT-5.4 mini在1秒时间范围将有效延迟从+3.07秒减少到4秒时间范围的-0.01秒，同时保持测量的无碰撞安全边界。该架构的安全贡献来自原子谓词运行时检查，而不是漂移分数，后者作为刷新频率的调节器。

英文摘要

Cloud-hosted LLM driver agents provide useful semantic judgments, but their inference latency exceeds stepwise vehicle-control windows. Learned world models predict futures, but they usually keep future generation and action selection inside large coupled loops. We present SteinsGateDrive, a latency-decoupled planner-runtime architecture in which the worldline metaphor from the eponymous story names one plausible consequence of an intervention: the LLM selects counterfactual driving futures before the final control instant, and a runtime reuses the selected forecast only while safety contracts remain valid. The generator builds three world-line roles: alpha nominal ego-conditioned futures, beta interaction counterfactuals around nearby vehicles, and gamma hazard-stress futures such as braking, cut-ins, or blocked corridors. The selected branch becomes a typed StrategicForecast with horizon, validity/abort conditions, fallback, and authority. On a within-subject, matched-seed normal-highway protocol with 10 seeds and 20 steps, GPT-5.4 mini reduces effective lag from +3.07 s at 1-second horizon to -0.01 s at 4-second horizon while preserving the measured no-collision safety boundary. The architecture's safety contribution comes from the atom-predicate runtime check, not from the drift score, which functions as a refresh-frequency knob.

URL PDF HTML ☆

赞 0 踩 0

2605.22446 2026-05-22 cs.CV cs.AI cs.RO 版本更新

Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

Pre-VLA: 预防性运行时验证用于可靠视觉-语言-动作和世界模型展开

Zhen Sun, Yongjian Guo, Haoran Sun, Luqiao Wang, Wei Lu, Jiachi Ji, Shengzhe Ji, Junwu Xiong, Zhijun Meng

发表机构 * Beihang University（北京航空航天大学）； Tsinghua University（清华大学）； Peking University（北京大学）； JDT AI Infra ； Zhejiang University（浙江大学）

AI总结本文提出Pre-VLA，一种统一的运行时验证架构，用于在物理执行或世界模型想象之前评估动作的有效性，以提高视觉-语言-动作和世界模型展开的可靠性。

详情

AI中文摘要

尽管大型视觉-语言-动作（VLA）模型和生成世界模型（WM）在长周期具身智能方面取得了进展，但其实际部署仍受到基于学习的动作生成不确定性的挑战。低质量的动作可能导致执行中的物理故障或导致冗余的渲染成本的误导性世界模型展开。为了解决这个问题，我们提出了Pre-VLA，一种统一的运行时验证架构，能够在物理执行或世界模型想象之前进行预防性动作有效性评估。Pre-VLA利用一个高效的多模态主干，具有模态感知的池化和轻量级双分支头，以预测候选动作片段的安全性信心和批评派生的优势分数。为处理严重的类别不平衡和不稳定边界决策，我们使用结合焦点分类、优势回归和软阈值校准的多任务目标来训练Pre-VLA。在部署期间，双模式预防性重采样调度器过滤低质量的动作，并在有限计算预算下触发自适应重采样。在LIBERO基准测试中，Pre-VLA将四个套件的平均闭环成功率从30.79%提高到37.62%，减少任务执行步骤，实现每个动作片段平均183.9毫秒的前向验证时间，并减轻世界模型展开中的误差累积。

英文摘要

While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low-quality actions may cause physical failures during execution or lead to misleading world-model rollouts with redundant rendering costs. To address this issue, we propose Pre-VLA, a unified runtime verification architecture that performs preemptive action validity assessment before physical execution or world-model imagination. Pre-VLA leverages an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head to predict both safety confidence and critic-derived advantage scores for candidate action chunks. To handle severe class imbalance and unstable boundary decisions, we train Pre-VLA with a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. During deployment, a dual-mode preemptive resampling scheduler filters low-quality actions and triggers adaptive resampling under a limited computation budget. Experiments on the LIBERO benchmark show that Pre-VLA improves the average closed-loop success rate across four suites from 30.79\% to 37.62\% over RynnVLA-002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world-model rollouts.

URL PDF HTML ☆

赞 0 踩 0

2605.22443 2026-05-22 cs.RO 版本更新

Terminal Constraint Model Predictive Control for Image-Based Visual Servoing of UAVs with Kalman Filter-Based Moment Loss Compensation

终端约束模型预测控制用于基于图像的视觉伺服控制无人机的卡尔曼滤波基于矩损失补偿

X. Wang, Y. Cao, W. L. W. Leong, Y. R. Tan, S. Huang, S. H. R. Teo, C. Xiang

发表机构 * College of Design and Engineering, National University of Singapore（设计与工程学院，新加坡国立大学）

AI总结本文提出了一种终端约束模型预测控制（TC-MPC）框架，结合卡尔曼滤波机制，用于解决基于图像的视觉伺服控制中因输入和状态约束导致的闭环稳定性丧失和因运动剧烈导致的矩特征丢失问题。

详情

AI中文摘要

基于图像的视觉伺服控制（IBVS）通过直接调节图像空间误差为无人机（UAVs）提供高效的视觉引导控制范式。然而，传统IBVS控制器面临两个关键问题：由于输入和状态约束导致接近目标时闭环稳定性丧失，以及在剧烈运动下因矩基视觉特征间歇性丢失导致的控制失效。为了解决这些挑战，本文提出了一种用于IBVS的终端约束模型预测控制（TC-MPC）框架，集成了基于卡尔曼滤波（KF）的状态预测机制。TC-MPC明确将终端状态约束和终端成本纳入IBVS误差动力学中，确保在控制和状态约束下递归可行性、改进的收敛行为和闭环稳定性。同时，卡尔曼滤波预测短时间视觉退化期间图像矩的时序演变，使控制器在矩测量部分不可用时能够保持控制连续性。所提出的方法通过实时无人机视觉伺服控制实验进行了验证。

英文摘要

Image-Based Visual Servoing (IBVS) provides an efficient vision-guided control paradigm for unmanned aerial vehicles (UAVs) by directly regulating image-space errors. However, conventional IBVS controllers are vulnerable to two critical issues: loss of closed-loop stability near the target due to input and state constraints, and control failure caused by intermittent loss of moment-based visual features under aggressive motion. To address these challenges, this paper proposes a terminal-constraint model predictive control (TC-MPC) framework for IBVS, integrated with a Kalman filter (KF)-based state-prediction mechanism. The TC-MPC explicitly incorporates terminal-state constraints and a terminal cost into the IBVS error dynamics, ensuring recursive feasibility, improved convergence behavior, and closed-loop stability under control and state constraints. In parallel, the Kalman filter predicts the temporal evolution of image moments during short-term visual degradation, enabling the controller to preserve control continuity when moment measurements are partially unavailable. The proposed approach is validated through real-time UAV visual servoing experiments.

URL PDF HTML ☆

赞 0 踩 0

2605.22431 2026-05-22 cs.RO 版本更新

Real-Time Auto-Optimization in Unknown Environments via Structure-Exploiting Dual Control for Exploration and Exploitation

通过利用结构的双控方法实现未知环境中的实时自优化

Shiying Dong, Haoyang Yang, Qiwei Liu, Wen-Hua Chen

发表机构 * Research Centre for Low Altitude Economy（低空经济研究中心）； Hong Kong Polytechnic University（香港理工大学）

AI总结本文提出了一种快速数值双控方法，用于解决未知环境中的自优化问题，通过利用双控方法的结构特性，提高了探索与利用的效率和计算速度。

详情

AI中文摘要

本文开发了一种快速数值双控方法，用于解决未知环境中的自优化问题。在自优化问题中，最优运行条件事先未知且可能随环境变化而变化。与经典双控技术类似，计算负担仍然是双控方法中的主要问题。现有的双控方法提供了一个原理性的探索-利用目标，但主要通过标准优化包或显式梯度型更新法则实现，其中双控方法的数值结构未被充分利用。本文表明，双控方法中的奖励函数具有内在的凸-非线性结构，其中探索和利用项形成一个统一的非线性残差图，配备了凸外损失。得益于这种结构，通过仅线性化非线性残差图而保留凸外损失，开发出一种结构利用的数值方法。因此，每个子问题被转换为结构化的凸形式，可以可靠地求解。所得到的广义高斯-牛顿Hessian近似是正半定的，并且仅依赖于一阶导数，从而支持快速的在线计算。所提出的方法在车辆巡航自优化问题上进行了评估，并与现有方法进行了比较。仿真和硬件在环实验结果表明，所提出的方法提高了控制性能，并实现了约一个数量级的速度提升，最大计算时间仅为83微秒，仅在典型车辆嵌入式CPU上。

英文摘要

This paper develops a fast numerical dual control for exploration and exploitation (DCEE) method to address auto-optimization problems in unknown environments. In auto-optimization problems, the optimal operating condition is unknown a priori and may vary with the environment. As in classical dual control techniques, computational burden remains a major concern in DCEE for active learning. Existing DCEE methods provide a principled exploration-exploitation objective, but mainly realized through standard optimization packages or explicit gradient-type update laws, where the numerical structure of the DCEE has not been fully exploited. This paper shows that the reward function in DCEE has an inherent convex-over-nonlinear structure, where the exploitation and exploration terms form a unified nonlinear residual map equipped with a convex outer loss. Benefiting from this structure, a structure-exploiting numerical method is developed by linearizing only the nonlinear residual map while preserving the convex outer loss. Thus, each subproblem is transformed into a structured convex form that can be solved reliably. The resulting generalized Gauss-Newton Hessian approximation is positive semidefinite and depends only on first-order derivatives, thereby supporting fast online computation. The proposed method is evaluated on a vehicle cruising auto-optimization problem and compared with existing methods. Simulation and hardware-in-the-loop experimental results show that the proposed method improves control performance and achieves a speedup of approximately one order of magnitude, with a microsecond-level maximum computation time of only 83 μs on a typical vehicle embedded CPU.

URL PDF HTML ☆

赞 0 踩 0

2605.22420 2026-05-22 cs.CV cs.AI cs.RO 版本更新

Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction

基于扩散的通用增强器用于城市场景重建

Henry Che, Jingkang Wang, Yun Chen, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun

发表机构 * Waabi ； University of Toronto（多伦多大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出GenRe，一种基于扩散的通用增强器，用于城市场景重建，通过学习不同场景中的生成先验，高效地生成稳健且高保真的表示，能够可靠地泛化到挑战性的未见过的视角，从而在自动驾驶中实现鲁棒和可扩展的传感器模拟。

Comments ICRA 2026. Project page: https://waabi.ai/genre

详情

AI中文摘要

在部分可观察环境中学习统一的风险图

Jie Jia, Yaofeng Su, Zeyu Bao, Yun Hong, Bingzhao Gao, Zhongxue Gan, Wenchao Ding

发表机构 * Fudan University（复旦大学）； Tongji University（同济大学）

AI总结本文提出了一种统一的风险图建模与学习框架，用于部分可观察环境中的自动驾驶，通过时空建模整合交通流风险和碰撞风险，以更精细地评估遮挡引起的危险，并引入扩散基场景生成框架来解决遮挡交互场景稀缺的问题，实验表明该方法在Waymo Open Motion Dataset上显著优于现有方法。

Comments Published in IEEE Robotics and Automation Letters

详情

DOI: 10.1109/LRA.2026.3666393

AI中文摘要

Occlusion-aware prediction remains a critical challenge in autonomous driving due to the inherent uncertainty of unobserved regions. Existing approaches either overestimate risk based on reachable states or struggle to predict accurate trajectories under high occlusion uncertainty. To address these limitations, we propose a unified risk map modeling and learning framework for partially observable environments. Our method integrates traffic flow risk and collision risk through spatiotemporal modeling, enabling fine-grained assessment of occlusion-induced hazards. To address the scarcity of scenarios involving occluded interactions, we introduce a diffusion-based scenario generation framework that produces realistic yet adversarial scenarios. We integrate the modeling and learning of a unified risk map into a framework that supports risk-aware planning under partial observability. Experiments on the Waymo Open Motion Dataset show that our method significantly outperforms the state-of-the-art occlusion-aware baseline, improving minimum time-to-collision by 0.78 times and average time-to-collision by 1.67 times. The proposed framework offers a comprehensive and practical solution for risk-aware planning in partially observable environments.

英文摘要

Occlusion-aware prediction remains a critical challenge in autonomous driving due to the inherent uncertainty of unobserved regions. Existing approaches either overestimate risk based on reachable states or struggle to predict accurate trajectories under high occlusion uncertainty. To address these limitations, we propose a unified risk map modeling and learning framework for partially observable environments. Our method integrates traffic flow risk and collision risk through spatiotemporal modeling, enabling fine-grained assessment of occlusion-induced hazards. To address the scarcity of scenarios involving occluded interactions, we introduce a diffusion-based scenario generation framework that produces realistic yet adversarial scenarios. We integrate the modeling and learning of a unified risk map into a framework that supports risk-aware planning under partial observability. Experiments on the Waymo Open Motion Dataset show that our method significantly outperforms the state-of-the-art occlusion-aware baseline, improving minimum time-to-collision by 0.78 times and average time-to-collision by 1.67 times. The proposed framework offers a comprehensive and practical solution for risk-aware planning in partially observable environments.

URL PDF HTML ☆

赞 0 踩 0

2605.22164 2026-05-22 cs.LG cs.RO 版本更新

Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics

超越欧几里得距离：通过地平线匹配轨迹可达性度量修复潜在世界模型

Liangyu Li, Shengzhi Wang, Qingwen Liu

发表机构 * Tongji University（同济大学）

AI总结本文提出轨迹可达性度量（TRM）作为固定潜在世界模型的后处理终端排名方法，通过训练小的成对头部来改进终端排名，从而提高连续操控任务的性能。

Comments 26 pages, 7 figures

详情

AI中文摘要

潜在世界模型可以包含用于控制的状态，但其终端成本接口可能会向规划器暴露错误的决策相关信息。在常见的潜在MPC中，候选序列通过预测终端和目标潜在状态之间的欧几里得距离进行排名；这假设了原始潜在距离权重能够正确地反映可达性相关变量。我们提出轨迹可达性度量（TRM），一种用于固定潜在世界模型的后处理终端排名方法。TRM从记录的轨迹结构中训练一个小的成对头部，并将其用作替代或混合成本；编码器、动力学、采样器、优化器和评估表现保持不变。关键设计选择是地平线意识监督：该度量在广泛的、平衡的时间分离上进行训练，以匹配长地平线终端候选排名问题。在硬TwoRoom基准上，使用LeWorldModel（LeWM）的原始潜在规划成功率为7.0%，而全地平线TRM成功率为97.0%；洗牌时间标签控制仍为0.0%。同样的配方在三个种子上将PLDM基线从32.7%提高到84.0%，而短地平线TRM变体在100,000对预算下仅达到35.0%。在TwoRoom中，我们提供了TRM为何有效的机理证据：XY位置是线性可解码的（R²=0.998），但原始潜在MSE错误地排名候选；XY探针行空间在终端-目标潜在MSE中占比不到1%，但承载了大部分候选质量信号；SCSA审计显示TRM提高了规划器看到的排序和选定终点。在PushT go50/go75中，TRM风格的任务-状态度量比闭环成功更清晰地改进了SCSA排名和选定最终距离，推动了连续操控中的辅助混合成本。TRM是规划器面对的修复，审计解释了何时终端可达性度量应替代或补充原始潜在接近度。

英文摘要

Latent world models can contain the state needed for control, yet their terminal-cost interface can expose the planner to the wrong decision-relevant information. In common latent MPC, candidate sequences are ranked by Euclidean distance between predicted terminal and goal latent states; this assumes that raw latent distance weights reachability-relevant variables correctly. We propose trajectory reachability metrics (TRM), a post-hoc terminal-ranking method for fixed latent world models. TRM trains a small pairwise head from logged trajectory structure and uses it as a replacement or hybrid cost; the encoder, dynamics, sampler, optimizer, and evaluation manifests remain fixed. The key design choice is horizon-aware supervision: the metric is trained on broad, balanced temporal separations to match the long-horizon terminal candidate ranking problem. On a hard TwoRoom benchmark, raw latent planning with LeWorldModel (LeWM) reaches 7.0% success, while full-horizon TRM reaches 97.0%; shuffled temporal-label controls stay at 0.0%. The same recipe improves a PLDM baseline from 32.7% to 84.0% across three seeds, and a short-horizon TRM variant reaches only 35.0% with the 100,000 pair budget. In TwoRoom, we provide mechanistic evidence for why TRM works: XY position is linearly decodable (R^2=0.998), yet raw latent MSE misranks candidates; the XY-probe rowspace accounts for less than 1% of terminal-goal latent MSE but carries most candidate-quality signal; and SCSA audits show that TRM improves the ordering and selected endpoint seen by the planner. On PushT go50/go75, TRM-style task-state metrics improve SCSA ranking and selected final distance more cleanly than closed-loop success, motivating auxiliary hybrid costs in continuous manipulation. TRM is the planner-facing repair, and audits explain when terminal reachability metrics should replace or augment raw latent proximity.

URL PDF HTML ☆

赞 0 踩 0

2605.22138 2026-05-22 cs.AI cs.CL cs.LG cs.RO 版本更新

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

通过自我调节模拟规划实现高效的代理推理

Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu, Eric P. Xing

发表机构 * Institute of Foundation Models (IFM)（基础模型研究所）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出通过分解决策过程为三个系统：模拟推理、自我调节和反应执行，来提升代理推理的效率，并展示了SR$^2$AM模型在不同任务中的表现。

Comments Code and model artifacts are available at https://github.com/sailing-lab/sr2am

详情

AI中文摘要

代理应该如何决定何时以及如何规划？主流方法将代理建模为具有自适应计算的反应策略（例如链式思考），通过端到端训练期望规划隐式地出现。由于无法控制规划的存在、结构或时间范围，这些系统显著增加了推理长度，导致无效的令牌使用，而没有可靠的准确性提升。我们主张高效的代理推理受益于将决策过程分解为三个系统：模拟推理（系统II）通过世界模型将推理根植于未来状态预测；自我调节（系统III）通过学习的配置器决定何时以及如何深入规划；以及反应执行（系统I）处理细粒度的动作。模拟推理在不同任务中提供统一的规划，而无需每个领域的工程，同时自我调节确保规划只在需要时被调用。为了测试这一点，我们开发了SR$^2$AM（Self-Regulated Simulative Reasoning Agentic LLM），在LLM的链式思考中实现这两个系统作为独立阶段，其中LLM作为世界模型。我们探索了两种实现：从提示的多模块系统中记录决策（v0.1）和从预训练推理LLM的痕迹中重建结构化计划（v1.0），通过监督学习和强化学习（RL）训练。在数学、科学、表格分析和网络信息检索中，v0.1-8B和v1.0-30B在性能上与120-355B和685B-1T参数系统相当，而v1.0-30B使用的推理令牌比同类代理LLM少25.8-95.3%。强化学习使平均规划时间增加22.8%，而规划频率仅增加2.0%，表明它学会了更远地规划而不是更频繁地规划。更广泛地说，学习的自我调节实例化了一个原则，我们预计可以扩展到代理如何管理自己的学习和适应。

英文摘要

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR$^2$AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.22123 2026-05-22 cs.RO 版本更新

Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations

超越像素：从少量示范中学习不变的奖励以实现实世界机器人学

Tengye Xu, Yangting Sun, Ziju Shen, Guanqi Chen, Zhen Fu, Chen yizhou, Hua Chen, Jia Pan

发表机构 * School of Computing and Data Science, The University of Hong Kong（计算科学与数据科学学院，香港大学）； LimX Dynamics Technology Co., Ltd（LimX动力技术有限公司）； Southern University of Science and Technology（南方科技大学）； Peking University（北京大学）； Zhejiang University（浙江大学）

AI总结本文提出了一种从少量示范中学习不变奖励的方法，以实现实世界机器人学中的泛化能力，通过发现行为不变量来改进奖励函数的设计，从而在多个任务中提升策略学习效果。

详情

AI中文摘要

设计能够超越受控实验室环境的奖励函数仍然是强化学习在机器人学中的基本挑战。在开放世界操纵问题中，单一任务可以通过不同的物体实例、位置和摄像头视角出现多种变体。最近基于视觉的奖励模型倾向于记忆特定的像素分布，并且无法超越其训练条件进行泛化。为了解决这个问题，我们提出了一种框架，该框架可以从最少的五个示范中学习不变的符号奖励函数。关键思想是将视觉特征拟合转向发现行为不变量：在多样化的视觉实例中保持不变的任务级属性。该框架有两个耦合的组件：一个结构化奖励公式，它编码任务级策略和物理约束，同时保持最优策略不变性；以及一个混合的符号-数值过程，该过程从示范中提炼这些不变量，而无需在线交互。在八个Meta-World任务和三个Franka操纵任务上的实验表明，我们的方法在过程对齐和策略展开排名能力方面优于基线方法，加速了下游策略学习。三个现实世界的出分布实验进一步表明，学习到的奖励能够零样本泛化到位置、视角和物体变体，使单一奖励表示能够在实践中重用于多种任务变体。

英文摘要

Designing reward functions that generalize beyond controlled laboratory settings remains a fundamental challenge in reinforcement learning for robotics. In open-world manipulation problems, a single task can appear in numerous variants through different object instances, positions, and camera viewpoints. Recent vision-based reward models tend to memorize specific pixel distributions and fail to generalize beyond their training conditions. To address this, we propose a framework that learns invariant symbolic reward functions from as few as five demonstrations. The insight is to shift from visual feature-fitting to the discovery of behavioral invariants: task-level properties that remain constant across diverse visual instantiations. The framework has two coupled components: a structural reward formulation that encodes task-level strategies and physical constraints while preserving optimal policy invariance, and a hybrid symbolic-numerical procedure that distills these invariants from demonstrations without online interaction. Experiments on eight Meta-World tasks and three Franka manipulation tasks demonstrate that our method achieves stronger process alignment and policy rollout ranking abilities compared to baselines, accelerating downstream policy learning. Three real-world out-of-distribution experiments further show that the same learned reward generalizes zero-shot to position, viewpoint, and object variations, enabling a single reward representation to be reused across diverse task variants in practice.

URL PDF HTML ☆

赞 0 踩 0

2605.18047 2026-05-22 cs.RO 版本更新

FUSE: A Framework for Unified State Estimation in Vehicular and Robotic SLAM Systems

FUSE：一种用于车辆和机器人SLAM系统统一状态估计的框架

Wei Wu, Honglin Chen, Wenhan Cao, Yao Lyu, Shaobing Xu, Kun Jiang, Jiangtao Li, Tao Zhang, Lei Guo, Shengbo Eben Li

发表机构 * State Key Lab of Intelligent Green Vehicle and Mobility, Tsinghua University（智能绿色车辆与移动国家重点实验室，清华大学）； School of VM and College of AI, Tsinghua University（车辆学院与人工智能学院，清华大学）； SunRisingAI Ltd.（SunRisingAI有限公司）； China Intelligent and Connected Vehicles (Beijing) Research Institute Co., Ltd.（中国汽车智能互联车辆（北京）研究院有限公司）

AI总结本文提出FUSE框架，用于统一车辆和机器人SLAM系统中的状态估计，通过分离时间处理、局部几何关联、估计器公式和地图更新策略，提高状态估计设计的灵活性和准确性。

详情

AI中文摘要

在混合速率传感下，紧密耦合的SLAM公式通常将时间处理、局部几何关联、估计器公式和地图更新策略绑定到特定方法的设计中。这种绑定使得难以在不重新设计其余状态估计过程的情况下改变一个设计选择。本文提出了FUSE，一种用于车辆和机器人SLAM系统统一状态估计的框架。FUSE围绕观察摄入、传播、更新和状态查询组织状态估计接口，并利用此接口将时间处理、残差准备的局部几何关联、估计器公式和地图更新策略分开。开发了一个LiDAR-IMU实例来在混合速率传感和方向退化下检验该框架，其中高速惯性传播、LiDAR触发的几何更新、残差筛选和退化感知的修正通过相同的接口边界操作。在418米的环形走廊序列中，该实例报告了1.626米的端到端轨迹误差，与Faster-LIO相比，相对误差减少了7.9%。结果支持FUSE作为组织状态估计设计选择的框架，并展示了评估实例如何在弱可观测方向上正则化更新。

英文摘要

Tightly coupled SLAM formulations under mixed-rate sensing often bind temporal processing, local geometric association, estimator formulation, and map-update policy into method-specific designs. Such binding makes it difficult to vary one design choice without re-engineering the rest of the state-estimation process. This paper presents FUSE, a framework for unified state estimation in vehicular and robotic SLAM systems. FUSE organizes the state-estimation interface around observation ingestion, propagation, update, and state query, and uses this interface to separate temporal processing, residual-ready local geometric association, estimator formulation, and map-update policy. A LiDAR--IMU instantiation is developed to examine the framework under mixed-rate sensing and directional degeneracy, where high-rate inertial propagation, LiDAR-triggered geometric update, residual screening, and degeneracy-aware correction operate through the same interface boundaries. On a 418~m loop-corridor sequence, the instantiation reports a 1.626 m end-to-end trajectory error, corresponding to a 7.9% relative error reduction compared with Faster-LIO, the lowest-error baseline on this sequence. The results support FUSE as a framework for organizing state-estimation design choices and show how the evaluated instantiation regularizes updates along weakly observable directions.

URL PDF HTML ☆

赞 0 踩 0

2605.17950 2026-05-22 cs.RO cs.SY eess.SY 版本更新

Active Defense Against False Data Injection Attacks in Robotic Manipulators

对抗机器人机械臂中虚假数据注入攻击的主动防御

Gabriele Gualandi, Carl Mikael Larsson, Alessandro V. Papadopoulos

发表机构 * M \"a lardalen University, V \"a ster s, Sweden (e-mail: ).

AI总结本文提出两种防御方法，即异常感知虚拟阻尼和操作性降低，以提高机器人机械臂在有限时间范围内抵御虚假数据注入攻击的能力，并通过仿真验证其有效性。

Comments Extended 8-page version containing full proofs. An abridged 6-page version has been accepted for publication in the Proceedings of the 23rd IFAC World Congress (2026). v3: Minor typographical fixes and updated reference formatting

2605.15153 2026-05-22 cs.RO cs.AI 版本更新

Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

Pelican-Unify 1.0：一种用于理解和推理、想象和行动的统一具身智能模型

Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu, Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, Yechi Liu, Haoyuan Shi, Zecong Tang, Haosong Sun, Renwen Cui, Kuishu Wu, Wenhai Liu, Yang Xu, Yingji Zhang, Yidong Wang, Senkang Hu, Jinpeng Lu, Nga Teng Chan, Yechen Wu, Zeting Liu, Xianzhou Hou, Yong Dai, Jian Tang, Xiaozhu Ju

发表机构 * Beijing Innovation Center of Humanoid Robotics (X-Humanoid)（北京人形机器人创新中心（X-Humanoid））

AI总结本文提出Pelican-Unify 1.0，一种基于统一原则训练的首个具身基础模型，通过单一视觉语言模型作为统一理解模块，将场景、指令、视觉上下文和行动历史映射到共享语义空间，并通过统一推理模块生成任务、行动和未来导向的思维链，最终将隐藏状态投影到密集潜在变量中，再通过统一未来生成器生成未来视频和行动。

详情

AI中文摘要

当同时定位与建图遇见无线通信：一篇综述

Konstantinos Gounis, Sotiris A. Tegos, Dimitrios Tyrovolas, Panagiotis D. Diamantoulakis, George K. Karagiannidis

发表机构 * Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki（阿尔蒂斯大学电气与计算机工程系）

AI总结本文综述了SLAM与无线通信交汇领域的最新进展，重点探讨了视觉SLAM（V-SLAM）整合中的双向影响，总结了无线信号传播、几何信道建模、基于射频（RF）的定位与感知等关键概念，以及图像处理技术如何检测地标并预测无线信道的最优路径，同时分析了SLAM与无线通信交叉领域的技术、挑战和未来方向。

详情

AI中文摘要

本文综述了SLAM与无线通信交汇领域的最新进展， attributing the bidirectional impact of each with a focus on visual SLAM (V-SLAM) integration. We provide an overview of key concepts related to wireless signal propagation, geometric channel modeling, and radio frequency (RF)-based localization and sensing. In addition to this, we show image processing techniques that can detect landmarks, proactively predicting optimal paths for wireless channels. Several dimensions are considered, including the prerequisites, techniques, background, and future directions and challenges of the intersection between SLAM and wireless communications. We analyze estimation and control approaches such as Bayesian filters, feature-based pose estimation, perception-aware motion control, spatial methods for signal processing such as vector fields, and key technological aspects. We expose techniques and items towards enabling a highly effective retrieval of the autonomous robot state. Among other interesting findings, we observe that monocular V-SLAM would benefit from RF relevant information, as the latter can serve as a proxy for the scale ambiguity resolution. Conversely, we find that wireless communications in the context of 5G and beyond can potentially benefit from visual odometry that is central in SLAM. Moreover, we examine other sources besides the camera for SLAM and describe the twofold relation with wireless communications. Finally, integrated solutions performing joint communications and SLAM appear to be in their infancy: theoretical and practical advancements are required to add higher-level localization and semantic perception capabilities to RF and multi-antenna technologies.

英文摘要

This paper surveys the state-of-the-art in the nexus of SLAM and Wireless Communications, attributing the bidirectional impact of each with a focus on visual SLAM (V-SLAM) integration. We provide an overview of key concepts related to wireless signal propagation, geometric channel modeling, and radio frequency (RF)-based localization and sensing. In addition to this, we show image processing techniques that can detect landmarks, proactively predicting optimal paths for wireless channels. Several dimensions are considered, including the prerequisites, techniques, background, and future directions and challenges of the intersection between SLAM and wireless communications. We analyze estimation and control approaches such as Bayesian filters, feature-based pose estimation, perception-aware motion control, spatial methods for signal processing such as vector fields, and key technological aspects. We expose techniques and items towards enabling a highly effective retrieval of the autonomous robot state. Among other interesting findings, we observe that monocular V-SLAM would benefit from RF relevant information, as the latter can serve as a proxy for the scale ambiguity resolution. Conversely, we find that wireless communications in the context of 5G and beyond can potentially benefit from visual odometry that is central in SLAM. Moreover, we examine other sources besides the camera for SLAM and describe the twofold relation with wireless communications. Finally, integrated solutions performing joint communications and SLAM appear to be in their infancy: theoretical and practical advancements are required to add higher-level localization and semantic perception capabilities to RF and multi-antenna technologies.

URL PDF HTML ☆

赞 0 踩 0

2511.07820 2026-05-22 cs.RO cs.AI cs.CV cs.GR cs.SY eess.SY 版本更新

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

SONIC：为自然人形全身体控进行超大规模运动追踪

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Fernando Castañeda, Sirui Chen, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Jinhyung Park, David Sami, Zi Wang, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi "Jim" Fan, Yuke Zhu

发表机构 * NVIDIA

AI总结本文提出了一种超大规模运动追踪方法，通过扩大模型容量、数据和计算资源，实现了一种能够产生自然且稳健全身体态的通用人形控制器，并展示了其在运动追踪任务中的可扩展性及在下游任务中的应用价值。

Comments Project page: https://nvlabs.github.io/SONIC/

详情

AI中文摘要

尽管大规模基础模型在数千块GPU上训练已取得显著进展，但类似规模提升在人形控制中尚未显现。当前的人形神经控制器规模较小，仅针对有限的行为集，并在少量GPU上训练。我们证明，扩大模型容量、数据和计算资源可以产生一个通用的人形控制器，能够实现自然且稳健的全身体态。我们将运动追踪定位为人形控制的可扩展任务，利用密集监督的多样化动作捕捉数据获取人类运动先验知识，而无需手动奖励工程。我们通过沿三个轴扩展构建了一个运动追踪的基础模型：网络大小（120万到4200万参数）、数据集规模（10亿+帧来自700小时的动作捕捉数据）以及计算资源（21000 GPU小时）。除了展示规模优势外，我们还通过：（1）实时运动规划器连接运动追踪到导航等任务，实现自然和交互式控制；（2）统一的token空间支持VR远程操作和视觉-语言-动作（VLA）模型，使用单一策略。通过这一接口，我们展示了需要协调手和脚放置的自主VLA驱动全身体控。扩大运动追踪表现出有利的特性：性能随计算和数据多样性稳步提升，学习的策略能泛化到未见的运动，使大规模运动追踪成为人形控制的实用基础。

英文摘要

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

URL PDF HTML ☆

赞 0 踩 0

2507.23773 2026-05-22 cs.AI cs.CL cs.LG cs.RO 版本更新

General Agentic Planning Through Simulative Reasoning with World Models

通过世界模型的模拟推理实现通用代理规划

Mingkai Deng, Jinyu Hou, Zhiting Hu, Eric Xing

发表机构 * Institute of Foundation Models (IFM)（基础模型研究所）； Carnegie Mellon University（卡内基梅隆大学）； UC San Diego（南加州大学）

AI总结本文提出通过模拟推理实现通用代理规划，利用世界模型进行未来状态预测，提升决策能力，通过SiRA架构在不同任务中取得更高任务完成率。

Comments Winner of Berkeley LLM Agents Hackathon (Fundamentals Track); code available at https://github.com/sailing-lab/sira

详情

AI中文摘要

什么是规划？当前的代理系统，无论是 scaffolding 工作流还是端到端策略，都依赖于反应式决策：通过固定流程选择下一步行动，最多只能有非区分性的适应性计算（例如链式思维），缺乏对未来结果的显式建模。这限制了通用性，因为每个新任务都需要重新工程而不是共享推理能力的转移。相比之下，人类通过在内部世界模型中心理模拟候选动作的后果来规划，这种能力被称为模拟推理（系统II），它支持在不同上下文中灵活、目标导向的行为。我们主张通过世界模型进行模拟推理为代理系统提供了一种通用的规划机制，比反应式策略（系统I）更优，因为决策基于预测的未来状态而不是模式匹配的响应。为了验证这一点，我们引入了SiRA（模拟推理架构），一种以目标为导向的架构，利用基于LLM的世界模型和自然语言信念状态来实现模拟推理，同时保持模型无关性。我们在网络浏览器环境中评估了三个质的不同的任务类别：受约束的导航、多跳信息聚合和一般指令跟随。在所有类别中，模拟推理在与匹配的反应基线相比，任务完成率提高了124%，并且在与代表性的开放网络代理相比，受约束导航的成功率从0%提高到32.2%。在不同任务类型中的持续优势表明，这种优势源于可泛化的情境评估，而不是特定任务的调优。

英文摘要

What does it mean to plan? Current agentic systems, whether scaffolded workflows or end-to-end policies, rely on reactive decision-making: selecting the next action via a fixed procedure with at most undifferentiated adaptive computation (e.g., chain-of-thought) lacking explicit modeling of future outcomes. This limits generalizability, as each new task demands re-engineering rather than transfer of shared reasoning capacity. Humans, by contrast, plan by mentally simulating consequences of candidate actions within an internal world model, a capacity known as simulative reasoning (System II) that supports flexible, goal-directed behavior across diverse contexts. We argue that simulative reasoning through a world model provides a general-purpose planning mechanism for agentic systems, improving upon reactive policies (System I) by grounding decisions in predicted future states rather than pattern-matched responses. To verify this, we introduce SiRA (Simulative Reasoning Architecture), a goal-oriented architecture instantiating simulative reasoning using an LLM-based world model with natural-language belief states, while remaining model-agnostic. We evaluate across three qualitatively distinct task categories: constrained navigation, multi-hop information aggregation, and general instruction following, in a web-browser environment. Across all categories, simulative reasoning achieves up to 124% higher task completion rates than a matched reactive baseline, and increases constrained navigation success from 0% to 32.2% compared to a representative open-web agent. The persistent advantage across distinct task types suggests the benefit stems from generalizable counterfactual evaluation rather than task-specific tuning.

URL PDF HTML ☆

赞 0 踩 0

2404.05307 2026-05-22 cs.CV cs.RO 版本更新

4D Radar Semantic Segmentation of People in Field Conditions Using Temporal Multi-View Networks

利用时序多视角网络进行野外条件下4D雷达的人体语义分割

Mikael Skog, Oleksandr Kotlyar, Vladimír Kubelka, Martin Magnusson

发表机构 * Center for Advanced Autonomous Sensor Systems (AASS)（先进自主传感器系统中心）

AI总结本文提出TMVA4D网络，利用4D雷达数据进行人体语义分割，通过多视角投影区分背景与人体，在低能见度条件下实现75.9%的Dice系数和61.2%的IoU指标。

详情

AI中文摘要

可靠的人员检测对于移动机器人和重型车辆在道路和工业环境（如采矿和建筑）中的安全自主至关重要。然而，常规传感器如摄像头或激光雷达在尘埃、雾或烟等恶劣条件下容易失效，限制了其在现实机器人系统中的应用。雷达在广泛的环境条件下提供稳健的测量。特别是现代高分辨率4D成像雷达提供跨距离、方位和仰角的4D点云，以及每个点的多普勒速度数据，非常适合机器人感知。我们提出TMVA4D，一种基于CNN和ConvLSTM编码器的神经网络架构家族，利用4D雷达模态进行语义分割。这些架构被训练以区分背景和人体类别，使用一系列2D投影的4D雷达数据，涵盖仰角、方位、距离和多普勒速度维度。在多个操作站点评估中，我们的模型在低能见度条件下实现了有希望的性能（Dice 75.9%，IoU 61.2% for class person）。数据和代码将在发表后公开发布。

英文摘要

Reliable people detection is crucial for the safe autonomy of mobile robots and heavy vehicles, both on roads and in industrial settings like mining and construction. However, common sensors like cameras or lidars are prone to failure in adverse conditions such as dust, fog, or smoke, which limits their use in real-world robotic systems. Radar, on the other hand, delivers robust measurements in a wide range of environmental conditions. In particular, modern high-resolution 4D imaging radars provide 4D point clouds across range, azimuth, and elevation, as well as per-point Doppler velocity data, well suited for robot perception. We propose TMVA4D, a family of artificial neural network architectures based on CNN and ConvLSTM encoders that leverage the 4D radar modality for semantic segmentation. The architectures are trained to distinguish between background and person classes using a series of 2D projections of the 4D radar data, encompassing elevation, azimuth, range, and Doppler velocity dimensions. Evaluated across several operational sites, our models achieve promising performance (Dice 75.9%, IoU 61.2% for class person) even in low-visibility conditions. The data and code will be made publicly available upon publication.

URL PDF HTML ☆

赞 0 踩 0

2605.22021 2026-05-22 cs.RO 版本更新

Industrial Dual-Arm Box Handling via Online Inertial Estimation and Convex Wrench Optimization

工业双臂箱体搬运 via 在线惯性估计和凸 wrench 优化

Kenzhi Iskandar Wong, Lin Yang, Qian Ying Lee, Domenico Campolo

发表机构 * School of Mechanical and Aerospace Engineering, Nanyang Technological University（机械与航空航天工程学院，南洋理工大学）

AI总结本文提出了一种摩擦感知的双臂箱体搬运框架，用于处理具有未知惯性特性的物体。通过在线估计物体质量和质心，并利用二次锥规划在椭球摩擦限制表面约束下计算摩擦可行的接触力和扭距，从而实现稳定的搬运。

Comments 14 pages, submitted to Robotics and Computer-Integrated Manufacturing (RCIM) Journal

详情

AI中文摘要

工业机器人物体搬运 often 涉及箱子和包裹，其质量和质心通常在事先未知。这些不确定性影响了稳定提升所需的力-力矩平衡，不当的接触 wrench 控制可能导致滑动、物体掉落、方向偏差或过度挤压。本文提出了一种摩擦感知的双臂箱体搬运框架，用于具有未知惯性特性的物体。所提出的方法从测量的接触 wrench 中在线估计物体质量和质心，并通过二次锥规划（SOCP）在椭球摩擦限制表面约束下计算摩擦可行的接触力和扭距。还包含一个离线轨迹细化阶段，以减少存在几何约束时的不希望的物体-环境接触。通过将摩擦可行性作为硬约束，并在可行区域内最小化接触努力，该框架实现了稳定的提升，而无需将滑动避免和过度挤压作为单独调节的目标。在不同质心配置下的真实双臂机器人系统实验表明，该方法在未知惯性特性物体上实现了稳定的摩擦接触。

英文摘要

Industrial robotic object handling often involves boxes and packages whose mass and center of mass are not known in advance. These uncertainties affect the force--moment balance required for stable lifting, and improper regulation of contact wrenches can lead to slip, object drop, orientation deviation, or excessive squeezing. This paper presents a friction-aware dual-arm box-handling framework for objects with unknown inertial properties. The proposed approach estimates the object mass and center of mass online from measured contact wrenches, and computes friction-feasible contact forces and torsional moments through a second-order cone program (SOCP) under ellipsoidal friction-limit-surface constraints. An offline trajectory refinement stage is also included to reduce undesired object--environment contact when geometric constraints are present. By enforcing friction feasibility as a hard constraint and minimizing contact effort within the feasible region, the framework achieves stable lifting without treating slip avoidance and excessive squeezing as separately tuned objectives. Experiments on a real dual-arm robotic system under different center-of-mass configurations demonstrate that the method lifts objects with unknown inertial properties while maintaining stable frictional contact.

URL PDF HTML ☆

赞 0 踩 0

2605.21976 2026-05-22 cs.RO 版本更新

高阶推理用于无通信协作移动机器人操作

Jonathan Reasoner, Nicola Bezzo

发表机构 * Department of Electrical and Computer Engineering, University of Virginia（弗吉尼亚大学电气与计算机工程系）

AI总结本文提出了一种基于高阶推理的动态认知规划框架，使机器人能够在无通信环境下实现隐式协调和长周期规划，通过仿真和实物实验验证了其在通信受限领域中提升任务完成效率的能力。

详情

AI中文摘要

在无通信环境下，多机器人系统必须在不进行常规定步协调策略所假设的持续信息交换的情况下运作。本文提出了一种新颖的动态认知规划框架，通过机器人之间的高阶推理实现隐式协调和长周期规划。我们的方法使机器人能够形成并传播高阶信念粒子，利用贝叶斯推断更新世界信念，并通过行为树选择动作，以预测队友的可能决策。一种时间感知的模型预测路径积分（MPPI）控制器将这种推理整合到低层执行中，使机器人能够在部分可观测条件下规划拦截并适应轨迹。所提出的框架在仿真和实物实验中均显示出比一阶基线方法更短的任务完成时间，证明了认知逻辑可以作为在通信受限领域中具有鲁棒性的协调基础。

英文摘要

In communicationless environments, multi-robot systems must operate without the constant information exchange that many coordination strategies typically assume. This paper presents a novel dynamic epistemic planning framework that enables implicit coordination and long horizon planning through higher-order reasoning among robots. With our approach, robots form and propagate higher-order belief particles, update world beliefs using Bayesian inference, and select actions via a behavior tree that anticipates teammates' likely decisions. A temporally aware Model Predictive Path Integral (MPPI) controller integrates this reasoning into low-level execution, allowing robots to plan intercepts and adapt trajectories under partial observability. The proposed framework is evaluated in both simulations and physical experiments, where it consistently reduces task completion time compared to a first-order baseline, demonstrating that epistemic logic can serve as a robust foundation for resilient coordination in communication-restricted domains.

URL PDF HTML ☆

赞 0 踩 0

2605.21863 2026-05-22 cs.RO 版本更新

OCELOT: Odometry and Contact Estimation for Legged Robots

OCELOT：用于腿部机器人的步态和接触估计

Emre Girgin, Cagri Kilic

发表机构 * Department of Aerospace Engineering, Embry-Riddle Aeronautical University（航空航天工程系，埃默里-瑞德航空航天大学）

AI总结本文提出了一种基于误差状态扩展卡尔曼滤波器（ESEKF）的完整腿部里程计管道，通过仅使用本体感觉数据（如固定IMU、关节编码器和力传感器）来实现准确的里程计估计，核心贡献是融合接触检测和不确定性量化模块，用于显式识别并拒绝滑动。

Comments 8 pages

详情

AI中文摘要

腿部机器人中的一项重大挑战是仅使用机载本体感觉传感器实现准确的里程计。在本研究中，我们提出了一种基于误差状态扩展卡尔曼滤波器（ESEKF）的完整腿部里程计管道，该管道仅依赖于本体感觉数据：固定IMU、关节编码器和力传感器，其中滤波器的状态通过确定处于静止支撑的脚来校正。我们的核心贡献是融合接触检测和一个不确定性量化模块，该模块设计用于显式识别并拒绝滑动。该模块为每只脚运行两个检测器：1）一个基于力的去抖 Gaussian Mixture Model（GMM）引导的有限状态机（FSM）以确认物理接触，2）一个基于运动学的广义似然比检验（GLRT）在估计的脚速度上。两个估计器的连续质量分数被融合，以检测脚是否同时物理加载和运动学静止，并作为每种接触的不确定性信号。为了验证我们的方法，我们收集了一个多模态数据集，包含29个序列，覆盖多样的室内外地形（例如混凝土、草地、鹅卵石和岩石），总长度为2.4公里。我们对比了本体感觉和外源感觉方法。结果表明，我们的方法在提供准确的里程计估计和在易滑动环境中具有鲁棒性。我们还分享了我们的代码和实时ROS2包作为开源。

英文摘要

One of the significant challenges in legged robotics is achieving accurate odometry using only onboard proprioceptive sensors. In this study, we present a complete leg odometry pipeline based on an Error-State EKF (ESEKF) that relies exclusively on proprioceptive data: a body fixed IMU, joint encoders, and force sensors, where filter's state is corrected by feet determined to be in a stationary stance. The core of our contribution is fused contact detection and an uncertainty quantification module designed to explicitly identify and reject slippage. This module runs two detectors in parallel for each foot, 1) a debounced, force-based Gaussian Mixture Model (GMM) guided Finite State Machine (FSM) to confirm physical contact, and 2) a kinematic-based Generalized Likelihood Ratio Test (GLRT) on the estimated velocity of the foot. The continuous quality scores from both estimators are fused to detect if the foot is both physically loaded and kinematically stationary and served as an uncertainty signal for each contact. To validate our approach, we collected a multi-modal dataset of 29 sequences spanning diverse indoor and outdoor terrains (e.g., concrete, grass, pebble, and rock) total of 2.4 km long. We benchmarked our approach against both proprioceptive and exteroceptive methods. The results demonstrate our method's efficacy in providing accurate odometry estimates, robustly handling slippage-prone environments. We also share our code and real-time ROS2 package as open-source.

URL PDF HTML ☆

赞 0 踩 0

2605.21862 2026-05-22 cs.RO cs.AI 版本更新

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

EvoScene-VLA: 在动作解码器中进化场景信念用于分块机器人控制

Chushan Zhang, Ruihan Lu, Jinguang Tong, Xuesong Li, Yikai Wang, Hongdong Li

发表机构 * Australian National University（澳大利亚国立大学）； The University of Queensland（昆士兰大学）； Beijing Normal University（北京师范大学）

AI总结本文提出EvoScene-VLA，通过在动作解码器中维护更新的场景状态，改进分块机器人控制中的多步控制预测，提升了场景信念的持续性和准确性。

详情

AI中文摘要

Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, extbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.

英文摘要

Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, \textbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.21836 2026-05-22 cs.RO 版本更新

Analytical and Experimental Force Analysis of a Soft Linear Pneumatic Actuator

软线性气动执行器的分析与实验力分析

Mohammed Abboodi

AI总结本文通过分析和实验研究了一种线性软套筒执行器（LSSA）的力特性，探讨了压力、几何形状、位移、负载和轴向刚度之间的耦合效应，揭示了其力生成机制。

详情

AI中文摘要

软套筒执行器（SSAs）最近被开发为可穿戴和辅助机器人系统中的气动驱动方法。通过将驱动结构集成到套筒状几何形状中，这些执行器可以减少对外部附件层和传动机制的依赖，同时保持与肢体形状表面的顺应性。然而，SSAs的力生成行为仍然解释不足，特别是在伸展过程中输出力的变化、外部负载的影响以及轴向刚度的机械作用方面。本文提出了线性软套筒执行器（LSSA）的分析和实验力分析。通过将净轴向力表示为由帽和折叠壁产生的压力生成贡献，并减去与轴向刚度相关的力，开发了一个准静态分析模型。该模型结合了内部压力、投影压力面积、折叠壁几何形状、轴向位移以及实验拟合的轴向刚度关系。进行了预设伸展和静态负载实验以评估执行器响应。在125kPa时，生成的力从零伸展时的约112N减少到40mm时几乎为零。静态负载延迟了可测量的力生成并减少了力输出，特别是在低和中等压力下。结果表明，LSSA的力生成由压力、几何形状、位移、负载和轴向刚度的耦合效应所支配。

英文摘要

Soft sleeve actuators (SSAs) have recently been developed as a pneumatic actuation approach for wearable and assistive robotic systems. By integrating the actuation structure into a sleeve-like geometry, these actuators can reduce reliance on external attachment layers and transmission mechanisms while maintaining compliance with limb-shaped surfaces. However, the force-generation behavior of SSAs remains insufficiently explained, particularly with respect to the variation of output force during extension, the influence of external loading, and the mechanical role of axial stiffness. This paper presents an analytical and experimental force analysis of a linear soft sleeve actuator (LSSA). A quasi-static analytical model was developed by expressing the net axial force as the pressure-generated contribution from the cap and folded walls, reduced by the force associated with axial stiffness. The model incorporates internal pressure, projected pressure areas, folded wall geometry, axial displacement, and an experimentally fitted axial stiffness relation. Prescribed-extension and static-load experiments were conducted to evaluate the actuator response. At 125 kPa, the generated force decreased from approximately 112 N at zero extension to nearly zero at 40 mm. Static loading delayed measurable force generation and reduced force output, particularly at low and intermediate pressures. The results show that LSSA force generation is governed by coupled effects of pressure, geometry, displacement, loading, and axial stiffness.

URL PDF HTML ☆

赞 0 踩 0

2605.21811 2026-05-22 cs.RO 版本更新

Safe and Steerable Geometric Motion Policies for Robotic Dexterous Manipulation

安全且可操控的几何运动策略用于机器人灵巧操作

Albert Wu, Riccardo Bonalli, Thomas Lew, C. Karen Liu

发表机构 * Computer Science Department, Stanford University（斯坦福大学计算机科学系）； Laboratory of Signals and Systems, University of Paris-Saclay, CNRS, CentraleSupélec（巴黎-萨克雷大学信号实验室，CNRS，CentraleSupélec）； Toyota Research Institute（丰田研究院）

AI总结本研究提出SafePBDS框架，通过几何一致的方法计算最优且可证明安全的配置流形加速度，以实现机器人灵巧操作中的目标和约束的持续协调，并在模拟和Franka Panda-Allegro手平台上验证了其在灵巧抓取和手部重定向中的高效规划和安全保障。

Comments 24 pages, 10 figures, 5 tables. Project page and demo video: https://tml.stanford.edu/safe-pbds

详情

AI中文摘要

机器人灵巧操作需要持续协调在异构几何空间上定义的目标和约束：一个在$\mathbb{R}^7$配置流形上控制的机器人可能需要在$\mathrm{SE}(3)$上跟踪末端执行器姿态，同时在$\mathbb{R}$上满足障碍物避让边距。我们提出了Safe Pullback Bundle Dynamical Systems（SafePBDS），一种几何一致的框架，该框架从任意任务流形上的目标和安全要求计算最优且可证明安全的配置流形加速度。SafePBDS建立在先前工作之上，将预定义的任务流形动力学系统结合以产生自主运动。其第一个创新是拉回控制屏障函数构造，将任务流形的安全条件转换为配置流形加速度上的线性约束。第二个创新是任务流形动作接口，允许高层策略注入低维残差运动；零输入恢复自主行为，而任意输入下保持安全。这使高层策略能够高效地引导探索，同时将精确运动留给自主行为。我们通过模拟和23自由度Franka Panda-Allegro手平台验证了SafePBDS。在灵巧抓取中，SafePBDS在20个家庭物体和120次试验中实现了92.5%的成功率。通过动作接口，该方法可通过一维动作排除抓取中的任一手指，实现94.4%的3指抓取成功率。SafePBDS的高效规划和安全保证还使其成为首个基于模型的、完全驱动的手部在手重定向方法，能够超过360度的yaw旋转，无论物体重量和腕部运动如何变化。演示视频和细节：https://tml.stanford.edu/safe-pbds

英文摘要

Robotic dexterous manipulation requires continuously reconciling objectives and constraints defined on heterogeneous geometric spaces: a robot controlled on a $\mathbb{R}^7$ configuration manifold may need to track end effector poses on $\mathrm{SE}(3)$ while satisfying obstacle avoidance margins in $\mathbb{R}$. We present Safe Pullback Bundle Dynamical Systems (SafePBDS), a geometrically consistent framework that computes optimal, certifiably safe configuration manifold accelerations from objectives and safety requirements on arbitrary task manifolds. SafePBDS builds on prior work that combines predefined task manifold dynamical systems to produce autonomous motion. Its first innovation is a pullback control barrier function construction, which converts task manifold safety conditions into linear constraints on configuration manifold accelerations. The second innovation is a task manifold action interface that allows a high-level policy to inject low dimensional residual motions; zero input recovers the autonomous behavior, while safety is preserved under arbitrary inputs. This lets high-level policies efficiently steer exploration while leaving precise motion to the autonomous behavior. We validate SafePBDS in simulation and on a 23-DOF Franka Panda-Allegro Hand platform. On dexterous grasping, SafePBDS achieves a $92.5\%$ success rate across 20 household objects and 120 trials. Using the action interface, the method can exclude any one of the four fingers during grasping via a one-dimensional action, achieving $94.4\%$ 3-finger grasp success across 3 objects and 36 trials. The efficient planning and safety guarantee of SafePBDS also enables the first model-based, fully actuated palm-down in-hand reorientation, exceeding $360^\circ$ of yaw rotation in both directions under varying object weight and wrist motion. Demo video and details: https://tml.stanford.edu/safe-pbds

URL PDF HTML ☆

赞 0 踩 0

2605.21800 2026-05-22 cs.LG cs.RO 版本更新

stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation

stable-worldmodel: 一个用于可重复世界建模研究和评估的平台

Lucas Maes, Quentin Le Lidec, Luiz Facury, Nassim Massaudi, Ayush Chaurasia, Francesco Capuano, Richard Gao, Taj Gillin, Dan Haramati, Damien Scieur, Yann LeCun, Randall Balestriero

发表机构 * Mila & Université de Montréal（Mila与蒙特利尔大学）； New York University（纽约大学）； Universidade Federal de Minas Gerais（巴西联邦大学矿务学院）； Independent Researcher（独立研究者）； LanceDB ； University of Oxford（牛津大学）； Brown University（布朗大学）

AI总结本文提出stable-worldmodel平台，旨在解决世界建模研究中代码库、数据管道和评估协议碎片化的问题，通过提供高性能的数据层、现代世界模型基线和规划求解器的实现，以及扩展的环境和任务，实现标准化和可重复的世界建模研究和评估。

详情

AI中文摘要

世界模型是构建能够推理、规划并在训练数据之外进行泛化的重要组成部分。然而，目前世界模型的研究仍然碎片化，不同的代码库、数据管道和评估协议阻碍了可重复性和公平比较。当前实践还受到三个关键瓶颈的限制：脆弱的一次性代码库、缓慢的视频数据加载以及缺乏标准化的泛化基准。我们提出了stable-worldmodel (swm)，一个开源平台，用于标准化和可重复的世界建模研究和评估。它提供了（1）一个高性能的Lance数据层，支持和转换MP4、HDF5和LeRobot数据集；（2）干净、经过良好测试的现代世界模型基线和规划求解器的实现；（3）一个广泛的环境和任务套件，扩展了可控的视觉、几何和物理因素的变化，以系统地评估动态理解、控制性能、表示质量和分布外泛化。通过在单一可扩展框架下统一整个流程， exttt{swm}显著减少了研究开销，并加速了向可靠世界模型的可信进展。

英文摘要

World models are central to building agents that can reason, plan, and generalize beyond their training data. However, research on world models is currently fragmented, with disparate codebases, data pipelines, and evaluation protocols hindering reproducibility and fair comparison. Current practice is further limited by three key bottlenecks: fragile one-off codebases, slow video data loading, and the lack of standardized generalization benchmarks. We present stable-worldmodel (swm), an open-source platform for standardized and reproducible world modeling research and evaluation. It delivers (1) a high-performance Lance-based data layer with native support and conversion tools for MP4, HDF5, and LeRobot datasets, (2) clean, well-tested implementations of modern world model baselines and planning solvers, and (3) a broad suite of environments and tasks extended with controllable visual, geometric, and physical factors of variation for systematic in-silico evaluation of dynamics understanding, control performance, representation quality, and out-of-distribution generalization. By unifying the full pipeline under a single, scalable framework, \texttt{swm} dramatically reduces research overhead and accelerates trustworthy progress toward reliable world models.

URL PDF HTML ☆

赞 0 踩 0

2605.21788 2026-05-22 cs.CV cs.RO 版本更新

SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

SceneGraphGrounder: 通过结构化场景图匹配实现零样本3D视觉定位

Xuefei Sun, Xujia Zhang, Brendan Crowe, Doncey Albin, Christoffer Heckman

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

AI总结本文提出SceneGraphGrounder框架，通过结构化场景图匹配将3D定位问题转化为结构化图匹配问题，利用视觉标记提示策略从2D视图推断物体间关系，并在3D场景图中建立持久编码，从而在ScanRefer基准测试中实现了零样本条件下与现有方法相当的性能，并在真实机器人部署中验证了其在长周期物理环境中的鲁棒空间推理能力。

详情

AI中文摘要

零样本3D视觉定位需要从非结构化环境中通过自由形式自然语言定位物体。最近的视觉-语言模型（VLM）方法取得了有希望的结果，但依赖于视点依赖的推理或隐式表示，限制了组合查询的空间一致性和可解释性。我们提出了SceneGraphGrounder，一个将3D定位重新表述为在重建的3D场景图上的结构化图匹配的框架。为了实现这种表述，我们引入了一种视觉标记提示策略，使VLM能够从2D视图推断物体-物体关系，这些关系随后被提升为持久的3D场景图编码，既包含空间关系又包含语义关系。给定一个查询，我们构建查询图并与场景图进行受限对齐，确保多视图一致性和可解释的推理。在ScanRefer基准测试中，我们的方法在零样本条件下实现了与现有方法相当的性能，仅使用RGB-D输入。我们进一步通过在移动机器人上的真实世界部署验证了我们的框架，展示了其在长周期物理环境中的鲁棒空间推理能力。我们将在接受后公开我们的代码。

英文摘要

Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.21747 2026-05-22 cs.CV cs.RO 版本更新

Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models

通过利用视觉语言模型推断车辆信息以改进自动驾驶中的3D标注

Steven Chen, Shivesh Khaitan, Nemanja Djuric

发表机构 * Aurora Innovation, Inc.（Aurora创新公司）

AI总结本文提出了一种利用视觉语言模型推断车辆信息以提高自动驾驶中3D车辆标注精度的方法，通过零样本推理车辆信息，结合车辆型号和型号识别方法，提升了标注效率和质量。

Comments To appear in Proceedings of the IEEE Intelligent Vehicles Symposium (IV), 2026. Accepted for oral presentation

详情

AI中文摘要

我们提出了一种通过零样本推理车辆信息来提高自动驾驶应用中3D车辆标注的方法，利用车辆制造商和型号识别（VMMR）方法。所提出的方法利用视觉语言模型（VLM）从图像片段中推断车辆的制造商、型号和代数，并输出准确的3D包围盒尺寸以引导手动标注。我们评估了迭代提示工程和不同VLMs选择对车辆包围盒推断和制造商/型号/代数识别的影响。与强大的基线相比，所提出的方法不仅在准确性上表现出色，而且在缓解特定失败模式方面也表现出色，例如在车辆显著遮挡的情况下，VLMs提供的尺寸比初始激光雷达辅助的人工标注标签更优。在公共和专有数据上的实验强烈表明，我们的结论可以推广到不同的标注者和数据集。结果表明，将VLMs整合到标注过程中可以减少手动标注时间，同时提高标注质量。

英文摘要

We present an approach to improve 3D vehicle labeling in self-driving applications through zero-shot inference of vehicle information, leveraging Vehicle Make and Model Recognition (VMMR) methods. The proposed approach utilizes a Vision Language Model (VLM) to both infer a vehicle's make, model, and generation from image crops, and output accurate 3D bounding box dimensions to seed manual labeling. We evaluate the impact of iterative prompt engineering and the choice of different VLMs on both vehicle bounding box inference and make/model/generation recognition. When compared to strong baselines, the proposed approach not only shows high accuracy, but also excels in mitigating specific failure modes where VLMs provide better dimensions than initial lidar-aided human annotated labels (e.g., in cases of significant vehicle occlusion). Experiments on both public and proprietary data strongly suggest that our conclusions are generalizable across different labelers and datasets. The results demonstrate that integrating VLMs into the labeling process can reduce manual labeling time while increasing label quality.

URL PDF HTML ☆

赞 0 踩 0

2605.21723 2026-05-22 cs.RO cs.AI cs.MA cs.SY eess.SY 版本更新

Learning Altruistic Collaboration in Heterogeneous Multi-Team Systems

在异质多团队系统中学习利他性协作

Riwa Karam, Ruoyu Lin, Brooks A. Butler, Magnus Egerstedt

发表机构 * Samueli School of Engineering, University of California, Irvine（加州大学欧文分校萨缪尔学学院）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结本文研究了通过动态机器人分配实现的异质多团队协作，将机器人视为可转移资源。利用生态学中的哈密顿规则作为利他决策机制，提出了一种具有异质能力、转移成本和能力依赖贡献的多团队协作资源分配框架。所得到的分配问题是组合性的，并被证明是NP难的。为了解决可扩展性问题，我们开发了一种基于图神经网络的策略，在集中训练和分布式执行下近似基于哈密顿规则的利他性分配。该模型在团队交互图上运行，并预测机器人层面的转移决策和下一步的机器人到团队分配。通过消防演习场景的模拟和实验验证了所提出的方法，证明所学习的策略在扩展到更大系统时能够实现接近最优的性能。

详情

AI中文摘要

本文研究了通过动态机器人分配实现的异质多团队协作，其中机器人被视为可转移资源。利用生态学中的哈密顿规则作为利他决策机制，我们提出了一种具有异质能力、转移成本和能力依赖贡献的多团队协作资源分配框架。所得到的分配问题是一个组合问题，并被证明是NP难的。为了解决可扩展性问题，我们开发了一种基于图神经网络的策略，在集中训练和分布式执行下近似基于哈密顿规则的利他性分配。该模型在团队交互图上运行，并预测机器人层面的转移决策和下一步的机器人到团队分配。通过消防演习场景的模拟和实验验证了所提出的方法，证明所学习的策略在扩展到更大系统时能够实现接近最优的性能。

英文摘要

This paper studies heterogeneous multi-team collaboration through dynamic robot allocation, where robots are treated as transferable resources. Leveraging Hamilton's rule from ecology as an altruistic decision-making mechanism, we propose a multi-team collaborative resource allocation framework with heterogeneous capabilities, transfer costs, and capability-dependent contributions. The resulting allocation problem is combinatorial and is shown to be NP-hard. To address scalability, we develop a graph neural network policy under centralized training and decentralized execution that approximates the altruistic allocations based on Hamilton's rule. The model operates over the team interaction graph and predicts robot-level transfer decisions and next robot-to-team assignments. The proposed approach is validated in a firefighting scenario through simulations and experiments, demonstrating that the learned policy achieves near-optimal performance while scaling to larger systems.

URL PDF HTML ☆

赞 0 踩 0

2605.21719 2026-05-22 cs.RO cs.SY eess.SY 版本更新

Mind the Gaps: Multi-Robot Feedback-Driven Ergodic Coverage in Unknown Environments

注意缝隙：未知环境中的多机器人反馈驱动的遍历覆盖

Thales Costa Silva, Nora Ayanian

发表机构 * Department of Computer Science at Brown University（布朗大学计算机科学系）

AI总结本文提出了一种多机器人反馈驱动的遍历覆盖策略，通过实时环境模型反馈调整机器人采样行为，以提高未知环境中的覆盖效率和资源分配。

详情

AI中文摘要

在本文中，我们解决了多机器人自适应覆盖的问题，其中机器人团队通过连续调整位置进行动态采样以收集环境数据。此任务具有挑战性，特别是在机器人必须随时间高效分配到新采样位置时。遍历搜索方法通过确保机器人时间平均的空间分布与环境信息的空间分布一致来优化机器人轨迹。虽然这些方法在目标分布已知的情况下能促进有效探索，但往往无法考虑环境的未知先验分布。为克服这一限制，我们提出了一种自适应覆盖策略，利用环境模型的实时反馈来调整机器人采样行为以应对未知条件。我们的方法通过基于环境参数模型构建目标空间信息分布，该分布在线更新，从而增强传统遍历轨迹优化。该策略假设环境是静态或变化缓慢相对于机器人运动。我们的框架使机器人能够动态优先考虑高兴趣区域，提高覆盖效率，为单个代理合成有效的控制策略，并在未知先验分布的设置中优化资源使用。我们通过仿真验证了我们的方法，证明了其在提高覆盖和资源分配方面的有效性。

英文摘要

In this work, we address the problem of multi-robot adaptive coverage, where teams of robots perform dynamic sampling by continuously adjusting their positions to collect data in an environment. This task can be challenging, particularly when robots must be efficiently allocated to new sampling locations over time. Ergodic search methods optimize robot trajectories by ensuring that the robots' time-averaged spatial distribution aligns with the spatial distribution of environmental information. While these methods promote effective exploration provided a target distribution, they often fail to account for unknown prior distributions of the environment. To overcome this limitation, we propose an adaptive coverage strategy that utilizes real-time feedback from an environmental model to adjust robot sampling behavior in response to unknown conditions. Our approach enhances traditional ergodic trajectory optimization by constructing a target spatial information distribution based on parametric models of the environment, which are updated online. This strategy assumes that the environment is either static or changes slowly compared to the robot's motion. Our framework allows robots to dynamically prioritize regions of high interest, improving coverage efficiency, synthesizing effective control policies for individual agents, and optimizing resource use in settings with unknown prior distributions. We validate our approach through simulations, demonstrating its effectiveness in enhancing coverage and resource allocation.

URL PDF HTML ☆

赞 0 踩 0

2605.21714 2026-05-22 cs.CV cs.RO 版本更新

AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking

AVI-HT：自适应视觉-IMU融合用于3D手部跟踪

Ziyi Kou, Ankit Kumar, Mia Huang, Taylor Niehues, Vatsal Mehta, Ergys Ristani, Li Guan

发表机构 * Meta Reality Labs（Meta现实实验室）

AI总结本文提出AVI-HT，一种自适应视觉-IMU融合方法，通过联合建模第一人称视角图像与手套上的6自由度IMU信号，用于跟踪3D手部姿态。核心方法包括同步多模态训练数据配对和跨传感器深度注意力机制，主要贡献是提高了在手-物体交互场景中的准确性和可用性。

详情

AI中文摘要

我们提出了AVI-HT，一种用于通过联合建模第一人称视角图像与手套上的6自由度IMU信号来跟踪3D手部姿态的自适应视觉-IMU融合方法。AVI-HT在手-物体交互（HOI）场景中，特别是在重视觉遮挡情况下，实现了显著提高的准确性和可用性。其成功基于两个互补的成分：（1）同步多模态训练数据配对身体上的视觉-IMU传感器流与运动捕捉系统的地面真实3D手部姿态；（2）一种跨传感器深度注意力机制，能够自适应地调节对视觉和单个IMU传感器的信任度。为了在真实世界中评估AVI-HT，我们在包含100000+对视觉-IMU样本的DexGloveHOI数据集中进行了广泛的实验，这些样本具有同步的3D标注姿态，用户在日常任务中操作各种物体。我们比较了多种单模态和多模态跟踪方法，基于两种手部模型（UmeTrack、MANO）。结果表明，AVI-HT在基准上将平均关键点误差减少了16.1%，其腕对齐变体减少了24.2%。消融研究进一步揭示了IMU传感器在不同活动类型中的每指贡献，以及模型对IMU噪声和视觉-IMU融合中的时间偏移的敏感性。

英文摘要

We present AVI-HT, an adaptive visual-IMU fusion approach for tracking 3D hand poses by jointly modeling the egocentric image with on-glove 6-DoF IMU signals. AVI-HT achieves significantly improved accuracy and availability, particularly in hand-object interaction (HOI) scenarios involving heavy visual occlusion. Two complementary ingredients underpin its success: (1) synchronized multi-modal training data pairing on-body vision-IMU sensor streams with ground-truth 3D hand poses from a motion-capture system, and (2) a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors. To evaluate AVI-HT in real-world settings, we conduct extensive experiments on our DexGloveHOI dataset that consists of 100K+ pairwise vision-IMU samples with synchronized 3D annotated poses, in which users manipulate a variety of objects during daily tasks. We compare against multiple single- and multi-modal tracking approaches under two hand models (UmeTrack, MANO). The results show that AVI-HT reduces mean keypoint error by 16.1% and its wrist-aligned variant by 24.2% over the baselines. Ablation studies further reveal the per-finger contribution of IMU sensors across activity types, and the model's sensitivity to IMU noise and temporal misalignment in vision-IMU fusion.

URL PDF HTML ☆

赞 0 踩 0

2605.21710 2026-05-22 cs.RO 版本更新

PGDG: Physically Grounded Data Generation for Robust Bimanual Policy Learning from a Single Demonstration

PGDG: 为从单个示范中学习鲁棒双臂策略而设计的物理基础数据生成

Cunxi Dai, Haoran Chang, Aditya Nisal, Rahul Kumar, Guofei Chen, Tao Chen, Yuzhe Qin, Guanya Shi

发表机构 * Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）； Dexmate

AI总结本文提出PGDG，一种基于物理的数据生成框架，通过零样本校准扩展单个示范为包含物理上合理、成功和多样恢复行为的紧凑数据集，从而提升双臂操作中接触丰富的行为克隆性能。

详情

AI中文摘要

接触丰富的双臂操作中的行为克隆仍然具有挑战性，因为多样化的示范收集成本高，且即使小的扰动也可能将系统推入无恢复监督的流形外状态。我们提出PGDG，一种具有零样本校准的数据生成框架，能够在不额外人工标注的情况下，将单个示范扩展为一个包含物理上合理、成功且多样化的恢复行为的紧凑数据集。PGDG在物理基础采样器和数据集校准器之间迭代，其中校准器选择具有信息量、非冗余性和可恢复性的行为来更新采样分布，朝向未覆盖的恢复模式；而采样器则从更新后的分布中绘制出物理上合理的滚动候选，并保留成功的轨迹。为进一步提高数据质量，PGDG应用短时间域采样基于控制来重新标记所选的高风险状态并应用纠正动作。在四个双臂操作任务中，PGDG在仿真和零样本现实世界迁移中均优于仅空间增强的方法。在RotateBox-Pitch任务中，仿真中的成功率从38%提升到93%，现实世界中的成功率从35%提升到82%。PGDG还能够有效促进如GR00T等基础模型的微调，使成功率从46%提升到77%。更多结果可在我们的网站上查看：https://cunxid.github.io/PGDG/。

英文摘要

Behavior cloning for contact-rich bimanual manipulation remains challenging because diverse demonstrations are expensive to collect, and even small disturbances can push the system into off-manifold states where no recovery supervision is available. We propose PGDG, a data generation framework with zero-shot curation that expands a single demonstration into a compact dataset of physically plausible, successful, and diverse recovery behaviors without additional human labeling. PGDG iterates between a physics-grounded sampler and a dataset curator, where the curator selects informative, non-redundant, and recoverable behaviors to update the sampling distribution toward under-covered recovery modes, and the sampler draws physically plausible rollout candidates from this updated distribution and retains successful trajectories. To further improve data quality, PGDG applies short-horizon sampling-based control to relabel selected risky states with corrective actions. Across four bimanual manipulation tasks, PGDG consistently outperforms spatial-only augmentation in both simulation and zero-shot real-world transfer. On RotateBox-Pitch, success improves from 38% to 93% in simulation and from 35% to 82% in the real world. PGDG also enables effective foundation models fine-tuning such as GR00T, increasing success from 46% to 77%. Additional results are available in our website: https://cunxid.github.io/PGDG/.

URL PDF HTML ☆

赞 0 踩 0

2605.21704 2026-05-22 cs.RO cs.SY eess.SY 版本更新

Motion Design for Grasp-Based Dynamic Locomotion in Microgravity

微重力环境下基于抓取的动态移动运动设计

Chaerim Moon, Joohyung Kim, Justin K. Yim

发表机构 * Department of Mechanical Science and Engineering at the University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校机械科学与工程系）

AI总结本文针对微重力环境下多肢体机器人系统基于抓取的动态移动问题，提出了一种可参数化的移动规划框架，通过调整步态模式、步长、移动速度和名义姿态等参数，评估其在稳定性和驱动需求方面的性能。研究结果表明，扩大可行接触力空间并抑制脉冲全身动力学可提升移动性能。

详情

AI中文摘要

在微重力环境中，移动通常依赖于稀疏且不规则排列的锚点，这促使了基于抓取的多肢体移动。在此设置中，动态移动只有通过有意识地调节锚定相互作用和全身协调，才能在耦合的动力学和运动学约束下实现。本文提出了针对微重力环境下多肢体机器人系统基于抓取的动态移动的设计见解，目标是需要六维肢体操作以与候选锚点建立接触的场景。研究的设计参数包括步态模式、步长、移动速度和名义姿态。提出了一种可参数化的移动规划框架，以支持这些参数的变化，并评估由此产生的移动性能，包括稳定性和驱动需求。在基于物理的仿真中采用了两种代表性四足形态进行评估。结果表明，扩大可行接触力空间并抑制脉冲全身动力学可提高移动性能。这些发现为微重力移动中多肢体系统的接触配置选择和全身协调策略提供了指导。

英文摘要

Locomotion in microgravity often relies on sparsely and irregularly arranged anchors, motivating grasp-based mobility with multiple limbs. In this setting, dynamic locomotion is feasible only through deliberate regulation of both anchored interactions and whole-body coordination under coupled dynamic and kinematic constraints. This paper presents design insights for grasp-based dynamic locomotion with multi-limbed robotic systems in microgravity, targeting scenarios that require 6D limb manipulation to establish contacts with candidate anchors. The investigated design parameters include gait pattern, stride length, locomotion speed, and nominal posture. A parameterizable locomotion planning framework is proposed to support variations of these parameters and to evaluate the resulting locomotion performance in terms of stability and actuation demand. Two representative quadruped morphologies are adopted for evaluation in physics-based simulation. The results demonstrate that enlarging the feasible contact wrench space and attenuating impulsive whole-body dynamics improve locomotion performance. These findings inform strategies for contact configuration selection and whole-body coordination in microgravity locomotion with multi-limbed systems.

URL PDF HTML ☆

赞 0 踩 0

2605.21688 2026-05-22 cs.RO cs.SY eess.SY 版本更新

Closed-Loop Sim-to-Real Reinforcement Learning for Deformable Microfiber Shape Control

闭环仿真到现实强化学习用于可变形微纤维形状控制

Alessandro Amici, Houari Bettahar, Veeti Jaakkola, Quan Zhou

发表机构 * Department of Electrical Engineering and Automation, Aalto University（艾尔沃大学电气工程与自动化系）

AI总结本文提出了一种闭环仿真到现实强化学习方法，用于在表面控制可变形微纤维形状，通过在简化摩擦模拟器中训练几何形状调节，并利用实时视觉反馈在部署过程中迭代修正未建模的表面相互作用效果。

Comments 7 pages,7 figures

详情

AI中文摘要

自主基于接触的微 manipulation 是具有挑战性的，因为微尺度的表面和界面相互作用难以准确建模，限制了传统基于模型的控制和仿真到现实学习的使用。我们提出了一种闭环仿真到现实强化学习（RL）方法，用于表面上的微纤维形状控制。核心思想是在简化摩擦less 模拟器中训练几何形状调节，并在部署过程中依赖实时视觉反馈来迭代修正观测到的未建模表面相互作用效果。一个完全在仿真中训练的 RL 策略被直接转移到一个物理双夹爪微 manipulation 系统上，该系统以 40 Hz 运行，无需重新训练或领域适应。使用丝绸微纤维作为测试平台，该策略在 24 种不同的初始配置上实现了平均点状形状误差为 270 ± 80 微米。在九种样本中，覆盖三种纤维直径（50、80 和 120 微米）和三种 manipulated 长度（10 mm、15 mm 和 20 mm）的所有组合时，相同的策略在不重新训练或调整的情况下实现了亚毫米级的最终形状误差。这些结果表明，一个在简化模拟器中学习的策略可以在表面接触下实现可重复的现实世界微纤维形状调节，只要任务相关的仿真到现实不匹配效应在闭环反馈回路中仍然可观测和可纠正。

英文摘要

Autonomous contact-based micromanipulation is challenging because surface and interfacial interactions at the microscale are difficult to model accurately, limiting the use of conventional model-based control and sim-to-real learning. We present a closed-loop sim-to-real reinforcement learning (RL) approach for microfiber shape control on a surface. The central idea is to train geometric shape regulation in a simplified frictionless simulator and rely on real-time visual feedback during deployment to iteratively correct the observed effects of unmodeled surface interactions. An RL policy trained entirely in simulation is transferred directly to a physical dual-gripper micromanipulation system operating at 40 Hz, without retraining or domain adaptation. Using silk microfibers as a testbed, the policy achieves a mean point-wise shape error of 270 $\pm$ 80 $μ$m across twenty-four diverse initial configurations. Across nine specimens covering all combinations of three fiber diameters (50, 80, and 120 $μ$m) and three manipulated lengths (10 mm, 15mm, and 20 mm), the same policy achieves sub-millimeter final shape error without any retraining or retuning. These results show that a policy learned in a simplified simulator can achieve repeatable real-world microfiber shape regulation under surface contact, provided that the task-relevant effects of the sim-to-real mismatch remain observable and correctable within the closed feedback loop.

URL PDF HTML ☆

赞 0 踩 0

2605.21686 2026-05-22 cs.RO 版本更新

Distributed Multi-Coverage for Robot Swarms

机器人群的分布式多覆盖

Mariem Guitouni, Aaron T. Becker

发表机构 * University of Houston（德克萨斯大学休斯顿分校）

AI总结本文提出了一种分布式多覆盖算法，用于解决机器人群在局部感知、局部通信和无全局协调的情况下，维持关键资产可靠覆盖的问题，同时应对机器人故障等约束条件。

Comments Accepted at ANTS 2026 (International Conference on Swarm Intelligence), published by Springer Nature

2605.21680 2026-05-22 cs.RO 版本更新

DSSP：具有完整历史编码的扩散状态空间策略

Zhiyuan Guan, Jianshu Hu, Han Fang, Yunpeng Jiang, Yize Huang, Shujia Li, Xiao Li, Yutong Ban

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出DSSP，一种基于扩散模型的状态空间策略，通过完整历史编码提升机器人操作任务中长周期任务的历史依赖性处理能力，实现了更高效的模型压缩和更小的模型规模。

详情

AI中文摘要

基于扩散的模仿学习在机器人操作中显示出强大的前景。然而，大多数现有策略仅依赖于当前观察或最近的短窗口观察，限制了它们在长周期任务中解决历史依赖性模糊性的能力。为此，我们引入DSSP，一种具有完整历史编码的扩散状态空间策略，能够为机器人操作提供高效的完整历史条件。利用状态空间模型（SSMs）的连续序列建模特性，我们的历史编码器有效地将整个观察流压缩成一个紧凑的上下文表示。为了确保此上下文保留有关未来状态演化的关键信息，编码器通过动态感知的辅助训练目标进行优化。此高层上下文表示随后与近期状态观察无缝融合，形成一个分层的条件机制用于动作生成。此外，为了保持架构一致性并减少GPU内存开销，我们还用SSM实例化扩散骨干网络。在模拟基准和真实世界操作任务中的广泛实验表明，DSSP在显著更小的模型规模下实现了最先进的性能，展示了分层条件在历史长度增加时捕获关键信息的优越效率。

英文摘要

Diffusion-based imitation learning has shown strong promise for robot manipulation. However, most existing policies condition only on the current observation or a short window of recent observations, limiting their ability to resolve history-dependent ambiguities in long-horizon tasks. To address this, we introduce DSSP, a history-conditioned Diffusion State Space Policy that enables efficient, full-history conditioning for robot manipulation. Leveraging the continuous sequence modeling properties of State Space Models (SSMs), our history encoder effectively compresses the entire observation stream into a compact context representation. To ensure this context preserves critical information regarding future state evolution, the encoder is optimized with a dynamics-aware auxiliary training objective. This high-level context representation is then seamlessly fused with recent state observations to form a hierarchical conditioning mechanism for action generation. Furthermore, to maintain architectural consistency and minimize GPU memory overhead, we also instantiate the diffusion backbone itself using an SSM. Extensive experiments across simulation benchmarks and real-world manipulation tasks show that DSSP achieves state-of-the-art performance with a significantly smaller model size, demonstrating superior efficiency of the hierarchical conditioning in capturing crucial information as the history length increases.

URL PDF HTML ☆

赞 0 踩 0

2604.24681 2026-05-22 cs.RO 版本更新

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

从大规模人类示范中学习人类意图先验以用于机器人操作

Yifan Xie, YuAn Wang, Guangyu Chen, Jinkun Liu, Yu Sun, Wenbo Ding

发表机构 * Tsinghua University（清华大学）； ByteDance（字节跳动）

AI总结本文提出MoT-HRA框架，通过大规模人类示范学习人类意图先验，用于机器人操作，通过构建HA-2.2M数据集和三个耦合专家提升动作合理性和鲁棒性。

Comments 13 pages, 5 figures

详情

AI中文摘要

人类视频包含丰富的操作先验，但用于机器人学习仍然困难，因为原始观测将场景理解、人类运动和特定于身体的动作纠缠在一起。我们引入MoT-HRA，一种层次化视觉-语言-动作框架，从大规模人类示范中学习人类意图先验。我们首先整理HA-2.2M，一个通过手中心过滤、空间重建、时间分割和语言对齐从异构人类视频中重建出的220万集动作-语言数据集。在此数据集之上，MoT-HRA将操作分解为三个耦合专家：一个视觉-语言专家预测无关身体的3D轨迹，一个意图专家将MANO风格的手部运动建模为潜在的人类运动先验，一个精细专家将意图感知的表示映射到机器人动作块。共享注意力主干和只读键值传输允许下游控制使用人类先验同时限制对上游表示的干扰。在手部运动生成、模拟操作和真实世界机器人任务上的实验表明，MoT-HRA在分布偏移下提高了动作合理性和鲁棒控制。

英文摘要

Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention trunk and read-only key-value transfer allow downstream control to use human priors while limiting interference with upstream representations. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks show that MoT-HRA improves motion plausibility and robust control under distribution shift.

URL PDF HTML ☆

赞 0 踩 0

2603.22508 2026-05-22 cs.RO cs.SY eess.SY 版本更新

Parallel OctoMapping: A Scalable Framework for Enhanced Path Planning in Autonomous Navigation

并行八叉树映射：一种用于自主导航中路径规划增强的可扩展框架

Yihui Mao, Tian Tan, Xuehui Shen, Warren E. Dixon, Rushikesh Kamalapurkar

发表机构 * Department of Mechanical and Aerospace Engineering, University of Florida（佛罗里达大学机械与航空航天工程系）； Department of Electrical and Systems Engineering, University of Pennsylvania（宾夕法尼亚大学电气与系统工程系）

AI总结本文提出并行八叉树映射（POMP），一种高效的基于八叉树的映射技术，通过在固定占用网格分辨率下优化自由空间表示，提升路径规划效率和成功率，特别是在复杂环境中。

详情

AI中文摘要

映射在机器人和自主系统中至关重要，因为它为路径规划提供了空间基础。高效的映射使规划算法能够生成可靠的路径，同时确保安全并实时适应复杂环境。固定分辨率的映射方法通常会产生过于保守的障碍物表示，导致在拥挤场景中生成次优路径或规划失败。为了解决这个问题，我们引入了并行八叉树映射（POMP），一种高效的基于八叉树的映射技术，旨在最大化可用自由空间并支持多线程计算。据我们所知，POMP是首个在固定占用网格分辨率下优化自由空间表示同时保持地图保真度和与现有基于搜索的规划器兼容的方法。因此，它可以集成到现有的规划流程中，从而提高路径发现的成功率和路径长度，特别是在拥挤环境中，同时显著提高计算效率。

英文摘要

Mapping is essential in robotics and autonomous systems because it provides the spatial foundation for path planning. Efficient mapping enables planning algorithms to generate reliable paths while ensuring safety and adapting in real time to complex environments. Fixed-resolution mapping methods often produce overly conservative obstacle representations that lead to suboptimal paths or planning failures in cluttered scenes. To address this issue, we introduce Parallel OctoMapping (POMP), an efficient OctoMap-based mapping technique that maximizes available free space and supports multi-threaded computation. To the best of our knowledge, POMP is the first method that, at a fixed occupancy-grid resolution, refines the representation of free space while preserving map fidelity and compatibility with existing search-based planners. It can therefore be integrated into existing planning pipelines, yielding higher pathfinding success rates and shorter path lengths, especially in cluttered environments, while substantially improving computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2603.11642 2026-05-22 cs.RO 版本更新

LFX：迈向统一的光场密集语义分割和显著物体检测

Fei Teng, Lingxin Huang, Buyin Deng, Kai Luo, Boyuan Zheng, Zheng Fang, Hong Zheng, Kunyu Peng, Jiaming Zhang, Yaonan Wang, Kailun Yang

发表机构 * School of Artificial Intelligence and Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University, China（人工智能与机器人学院和机器人视觉感知与控制技术国家工程研究中心，湖南大学，中国）； China Mobile Group Hunan Company Ltd., China（中国移动集团湖南有限公司，中国）； Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Germany（人机学与机器人研究所，卡尔斯鲁厄理工学院，德国）

AI总结本文提出LFX框架，通过统一的光场表示特征调制空间，实现了对多种光场表示和不同感知任务的适应，从而在三个光场基准测试中取得最先进的结果，显著优于特定表示方法。

Comments The source code will be made publicly available at https://github.com/FeiT-FeiTeng/LFX

详情

AI中文摘要

光场相机在单次曝光内捕获多视角观测。然而，现有研究通常针对特定的LF表示进行优化，导致该领域缺乏统一的学习框架。为弥合这一差距，我们提出了LFX，首个统一的光场感知框架。LFX建立了一个表示不变的特征调制空间，使其能够适应异构的LF表示和多样的感知任务。具体而言，我们提出了Field-of-Parallax Angular Subspace Modeling（FoP-ASM），为每个辅助视图分配独立的角标记，实现视图间的独立建模。同时，共享流形子空间约束和正则化损失强制在视图间保持全局一致的语义调制。在三个LF基准测试中的广泛评估表明，LFX在不同的LF表示上均取得最佳结果，比特定表示方法高出高达12%和20%，在显著物体检测中达到0.029/0.027的MAE，且在语义分割中达到84.37 mIoU。源代码将在https://github.com/FeiT-FeiTeng/LFX上公开。

英文摘要

Light field cameras capture multi-view observations within a single exposure. However, existing studies are typically tailored to specific LF representations, leaving the field without a unified learning framework. To bridge this gap, we present LFX, the first unified framework for LF perception. LFX establishes a representation-invariant feature modulation space, enabling it to adapt to heterogeneous LF representations and diverse perception tasks. Specifically, we propose Field-of-Parallax Angular Subspace Modeling (FoP-ASM), which assigns an independent angular marker to each auxiliary view, enabling view-wise independent modeling. Meanwhile, shared manifold subspace constraints and regularization losses enforce globally consistent semantic modulation across views. Extensive evaluations across three LF benchmarks show that LFX achieves state-of-the-art results across distinct LF representations, outperforming representation-specific methods by up to 12% and 20% with 0.029/0.027 MAE for salient object detection, and achieving 84.37 mIoU for semantic segmentation. The source code will be made publicly available at https://github.com/FeiT-FeiTeng/LFX.

URL PDF HTML ☆

赞 0 踩 0