2606.02510 2026-06-02 cs.CV cs.RO 版本更新

Not All Points Are Equal: Uncertainty-Aware 4D LiDAR Scene Synthesis

并非所有点都同等重要：不确定性感知的4D LiDAR场景合成

Xiang Xu, Alan Liang, Youquan Liu, Xian Sun, Linfeng Li, Lingdong Kong, Ziwei Liu, Qingshan Liu

发表机构 * NUAA（南京航空航天大学）； NUS（新加坡国立大学）； FDU（福建工程学院）； Duke（杜克大学）； NTU（国立新加坡大学）； NJUPT（南京理工大学泰州学院）； SKL-TI（特种信息处理实验室）

AI总结提出U4D框架，利用空间不确定性引导LiDAR场景生成，通过熵图识别高不确定性区域并优先合成，再补全其余区域，实现高保真4D场景。

Comments CVPR 2026 E2E3D Workshop; GitHub at https://github.com/worldbench/U4D

详情

AI中文摘要

从LiDAR获取的序列构建忠实的4D世界对于具身AI至关重要，但当前的生成框架对所有空间区域采用统一的建模能力。这忽略了单个扫描中感知难度的巨大差异：远距离表面、遮挡边界和小尺度物体比良好观测的结构具有更高的不确定性。我们提出了U4D，一种新的框架，明确利用空间不确定性以“从难到易”的顺序引导LiDAR场景生成。U4D通过预训练分割器的香农熵推导逐点不确定性图，然后应用无条件扩散阶段合成具有精确几何的高熵区域，接着是条件补全阶段，利用这些结构作为先验填充剩余区域。MoST（时空混合）块通过动态平衡空间细节和时间连续性进一步维护跨帧一致性。在nuScenes和SemanticKITTI上的大量实验证明了最先进的场景保真度、时间一致性和下游性能。

英文摘要

Constructing faithful 4D worlds from LiDAR-acquired sequences is crucial for embodied AI, yet current generative frameworks apply uniform modeling capacity across all spatial regions. This ignores that perceptual difficulty varies dramatically within a single scan: distant surfaces, occluded boundaries, and small-scale objects carry far higher uncertainty than well-observed structures. We present U4D, a new framework that explicitly leverages spatial uncertainty to guide LiDAR scene generation in a "hard-to-easy" schedule. U4D derives per-point uncertainty maps via Shannon Entropy from a pretrained segmentor, then applies an unconditional diffusion stage to synthesize high-entropy areas with precise geometry, followed by a conditional completion stage that fills in the remaining regions using these structures as priors. A MoST (Mixture of Spatio-Temporal) block further maintains cross-frame coherence by dynamically balancing spatial detail and temporal continuity. Extensive experiments on nuScenes and SemanticKITTI demonstrate state-of-the-art scene fidelity, temporal consistency, and downstream performance.

URL PDF HTML ☆

赞 0 踩 0

2606.02486 2026-06-02 cs.RO 版本更新

Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation

拦截未来：用于动态VLA操作的潜在空间预测世界模型

Shahram Najam Syed, Arthur Jakobsson, Haoran Hao, Jeffrey Ichnowski

发表机构 * Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）

AI总结提出AHEAD框架，通过潜在空间世界模型预测未来视觉特征，使冻结的VLA模型在动态场景中实现高成功率操作。

Comments 28 pages, 7 figures, 16 tables, Su

详情

AI中文摘要

视觉-语言-动作（VLA）模型在静态操作中具有泛化能力，但当物体在任务执行过程中移动时则失效。它们将当前观测映射为动作，并假设观测与执行之间场景静止，因此在任何非平凡的物体速度下，产生的延迟都会超过可用的抓取时间。我们通过AHEAD（自适应动态预期视界外推）弥补了这一差距，这是一种先预测后执行的包装器，用运动感知的潜在世界模型增强冻结的VLA。一个在操作视频上训练的小型世界模型，基于光流计算的每个令牌的速度和加速度，预测VLA特征空间中的未来块令牌。语言和运动显著性掩码将预测集中在任务相关的块上，模型向前滚动自适应视界，当预测不确定性超过阈值时停止。然后冻结的动作解码器接收预测的未来令牌代替当前令牌。AHEAD为冻结的7B OpenVLA增加了4.9M参数，在20个动态模拟场景中达到79%至97%的成功率，而最强基线仅为31%至58%。在物理UFactory xArm 7上，AHEAD在三个传送带和滚球任务中成功率为29/30至30/30，在桨叶拦截任务中为23/30，在抛射物捕捉任务中为19/30，而所有基线均为0/30。

英文摘要

Vision-Language-Action (VLA) models generalize across static manipulation but fail when objects move during task execution. They map the current observation to an action and assume the scene is stationary between observation and execution, so at any non-trivial object speed the resulting latency exceeds the time available to grasp. We close this gap with AHEAD (Anticipatory Horizon Extrapolation with Adaptive Dynamics), a predict-then-act wrapper that augments a frozen VLA with a motion-aware latent world model. A small world model trained on manipulation video forecasts future patch tokens in the VLA's feature space, conditioned on per-token velocity and acceleration from optical flow. A language-and-motion saliency mask concentrates prediction on task-relevant patches, and the model rolls forward for an adaptive horizon, halting when prediction uncertainty crosses a threshold. The frozen action decoder then receives the predicted future tokens in place of the current ones. AHEAD adds 4.9M parameters to a frozen 7B OpenVLA and reaches 79 to 97% success across 20 dynamic simulation scenarios where the strongest baseline reaches 31 to 58%. On a physical UFactory xArm 7, AHEAD succeeds on 29/30 to 30/30 on three conveyor and rolling-ball tasks, 23/30 on paddle interception, and 19/30 on projectile catching where every baseline scores 0/30.

URL PDF HTML ☆

赞 0 踩 0

2606.02432 2026-06-02 cs.RO 版本更新

NDPP-Grasp: Non-Differentiable Physical Plausibility Constraint-Guided Task-Oriented Dexterous Grasp Generation

NDPP-Grasp：非可微物理合理性约束引导的任务导向灵巧抓取生成

Qiuchi Xiang, Haoxuan Qu, Hossein Rahmani, Jun Liu

发表机构 * Lancaster University（兰卡斯特大学）

AI总结提出一种框架，通过将非可微物理合理性约束直接注入任务对齐的抓取扩散模型的去噪过程，实现物理合理性引导的灵巧抓取生成，同时保持任务对齐。

详情

AI中文摘要

任务导向的灵巧抓取生成旨在产生既物理合理又适用于特定操作任务的灵巧抓取姿态。现有的基于扩散的方法通常以解耦的方式处理这两个要求：它们首先训练一个用于任务对齐的抓取扩散模型，然后依赖生成后的细化来提高物理合理性。然而，这种事后修正策略仅在抓取已经生成后才应用物理合理性指导，使得生成轨迹本身不受物理约束引导，可能导致次优的抓取。为了解决这个问题，我们提出了一种新颖的框架，该框架以实用且有效的方式将物理合理性指导直接注入任务对齐的抓取扩散模型的去噪过程中，即使物理合理性约束是非可微的。这使得物理合理性能够在整个去噪过程中塑造抓取生成，同时保持任务对齐。大量实验证明了我们框架的有效性。

英文摘要

Task-oriented dexterous grasp generation aims to produce dexterous grasp poses that are both physically plausible and functionally suitable for specified manipulation tasks. Existing diffusion-based methods often address these two requirements in a decoupled manner: they first train a grasp diffusion model for task alignment and then rely on post-generation refinement to improve physical plausibility. However, this after-the-fact correction strategy applies physical plausibility guidance only once the grasp has already been generated, leaving the generation trajectory itself unguided by physical constraints and potentially leading to suboptimal grasps. To address this problem, we propose a novel framework that directly injects physical plausibility guidance into the denoising process of a task-aligned grasp diffusion model in a practical and effective manner, even when physical plausibility constraints are non-differentiable. This allows physical plausibility to shape grasp generation throughout denoising while preserving task alignment. Extensive experiments demonstrate the efficacy of our framework.

URL PDF HTML ☆

赞 0 踩 0

2606.02370 2026-06-02 cs.RO 版本更新

动力学是学出来的，不是告诉的：零样本策略适应的潜在动力学几何半监督发现

Zhiming Xu, Weitao Zhou, Xianghui Pan, Nanshan Deng, Chengju Liu, Qijun Chen, Chenpeng Yao

发表机构 * Zhiming Xu（徐志明）； Weitao Zhou（周伟涛）； Xianghui Pan（潘向辉）； Nanshan Deng（邓南山）； Chengju Liu（刘成军）； Qijun Chen（陈齐军）； Chenpeng Yao（姚晨鹏）

AI总结针对机器人强化学习中动力学变化导致策略失效的问题，提出基于对比学习的半监督方法，通过构建平滑、任务相关的潜在拓扑结构，实现零样本策略适应，在MuJoCo基准上优于参数中心方法。

Comments Proceedings of the 43rd International Conference on Machine Learning

详情

AI中文摘要

现实世界中的动力学变化对机器人强化学习构成了严峻挑战，因为与标称环境紧密耦合的策略在物理条件变化时往往会灾难性地失败。大多数现有方法依赖于将明确识别的物理参数编码到潜在上下文中，这是一种以参数为中心的范式，依赖于预先指定的变化轴，在未建模或复合动力学变化下变得脆弱。我们从以结果为中心的角度重新审视动力学适应：不是告诉策略动力学是什么，而是让它们学习动力学如何影响交互结果。理论上，这基于目标域遗憾与轨迹动力学编码器的Lipschitz常数之间的单调关系。实际上，该常数可以通过对比学习来上界，从而在没有特权动力学信息的情况下产生平滑、任务相关的潜在拓扑。在MuJoCo基准上，我们的方法在严重的动力学变化下（包括未建模和时变参数）始终优于以参数为中心的基线，同时提高了分布内稳定性和潜在可解释性。总体而言，这些结果验证了控制潜在几何是实现鲁棒适应的原则性机制。

英文摘要

Real-world dynamics shifts pose a critical challenge for reinforcement learning in robotics, as policies tightly coupled to nominal environments often fail catastrophically when physical conditions change. Most existing methods rely on encoding explicitly identified physical parameters into a latent context, a parameter-centric paradigm that depends on pre-specified axes of variation and becomes brittle under unmodeled or compound dynamics changes. We revisit dynamics adaptation from an outcome-centric perspective: rather than telling policies what the dynamics are, we enable them to learn how dynamics affect interaction outcomes. Theoretically, this is grounded in a monotonic relationship between target-domain regret and the Lipschitz constant of a trajectory dynamics encoder. Practically, this constant can be upper-bounded through contrastive learning, yielding a smooth, task-relevant latent topology without privileged dynamics information. On MuJoCo benchmarks, our method consistently outperforms parameter-centric baselines under severe dynamics shifts, including unmodeled and time-varying parameters, while also improving in-distribution stability and latent interpretability. Overall, these results validate that controlling latent geometry is a principled mechanism for robust adaptation.

URL PDF HTML ☆

赞 0 踩 0

2606.02277 2026-06-02 cs.RO 版本更新

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

RoboSemanticBench: 诊断 VLA 模型在动作预测中的语义基础

Bin Yu, Yao Zhang, Haishan Liu, Shijie Lian, Yuliang Wei, Xiaopeng Lin, Zhaolong Shen, Changti Wu, Ruina Hu, Bailing Wang, Cong Huang, Kai Chen

发表机构 * HIT（哈尔滨工业大学）； ZGCA（中钢集团人工智能研究院）； ZGCI（中钢集团智能计算研究所）； WHU（武汉大学）； HUST（华中科技大学）； HKUST(GZ)（香港科技大学（广州））； BUAA（北京航空航天大学）； ECNU（华东师范大学）； DeepCybo

AI总结提出 RoboSemanticBench 基准，通过多选问答任务评估 VLA 模型是否利用指令语义选择正确物体，发现模型在语义正确选择上接近随机，揭示语义理解与动作预测之间的差距。

Comments GitHub: https://github.com/ZGC-EmbodyAI/RoboSemanticBench

详情

AI中文摘要

视觉-语言-动作（VLA）模型建立在预训练语言或视觉-语言骨干网络的语义理解应指导机器人动作预测的前提上。然而，机器人微调被优化为对任务特定动作分布的模仿，许多评估可以通过视觉或指令-动作捷径解决。我们引入 RoboSemanticBench（RSB），一个用于诊断动作预测中语义基础的具身基准：即后训练的 VLA 模型是否能够使用复杂的指令语义来选择并操作正确的物理目标。在每个回合中，机器人接收一个多项选择的数学或常识问题，观察候选答案块，并必须抓取对应正确答案的块。RSB 涵盖受控算术、小学数学理解以及常识或事实理解，分为四选和十选套件。在代表性的 VLA 模型上，我们发现许多策略学会了抓取候选块，但在控制抓取成功率后，选择语义正确块的比例接近随机或低于随机，揭示了骨干网络级别的语义能力与动作预测之间持续存在的差距。

英文摘要

Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.02251 2026-06-02 cs.RO cs.AI eess.SP 版本更新

FW-NKF: Frequency-Weighted Neural Kalman Filters

FW-NKF: 频率加权神经卡尔曼滤波器

Adnan Harun Dogan, Berken Utku Demirel, Christian Holz

发表机构 * Department of Computer Science, ETH Zürich（苏黎世联邦理工学院计算机科学系）

AI总结提出频率加权神经卡尔曼滤波器（FW-NKF），通过将因果谱整形算子嵌入卡尔曼测量残差并联合学习观测和状态转移网络，抑制频带受限噪声，在混沌系统和惯性姿态估计等任务中定位误差降低达10%。

Comments Published at ICRA 2026

详情

AI中文摘要

鲁棒状态估计是机器人自主性的核心，然而经典卡尔曼滤波器难以应对频率相关干扰和模型失配，如传感器振动、电磁干扰和周期性噪声。尽管深度卡尔曼滤波器（DKF）变体通过学习潜在状态转移扩展了扩展卡尔曼滤波（EKF）框架，但它们缺乏明确的机制来抑制在实际场景中通常污染传感器测量的带限噪声分量。我们引入了频率加权神经卡尔曼滤波器（FW-NKF），这是一种统一的混合方法，将因果谱整形算子嵌入卡尔曼测量残差，并联合学习观测网络和状态转移网络。通过同时调整滤波器频谱和潜在状态表示，FW-NKF在抑制噪声主导频带的同时捕获复杂的残差结构。我们在四个异构基准上进行了广泛实验，包括混沌系统（如多维洛伦兹系统）和全身惯性姿态估计，发现定位误差降低高达10%，且方向精度显著提升。我们的消融研究证实，频率加权和深度潜在状态建模对整体性能有贡献。

英文摘要

Robust state estimation is central to robotic autonomy, yet classical Kalman filters struggle with frequency-dependent disturbances and model mismatch such as sensor vibrations, electromagnetic interference, and periodic noise. Although Deep Kalman Filter (DKF) variants extend the Extended Kalman Filtering (EKF) framework by learning latent transitions, they lack explicit mechanisms to suppress band-limited noise components that typically corrupt sensor measurements in real-world scenarios. We introduce the Frequency-Weighted Neural Kalman Filter (FW-NKF), a unified hybrid approach that embeds a causal spectral-shaping operator into the Kalman measurement residual and jointly learns observation, and transition networks. By adapting both the filter spectrum and the latent state representation, FW-NKF attenuates the noise-dominated frequency bands while capturing complex residual structures. We conduct extensive experiments on four heterogeneous benchmarks, including chaotic systems such as multi-dimensional Lorenz systems and full-body inertial pose estimation, and find a reduction in localization error of up to 10% as well as marked improvements in orientation accuracy. Our ablation studies confirm that frequency weighting and deep latent-state modeling contribute to overall performance.

URL PDF HTML ☆

赞 0 踩 0

2606.02107 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Network Distributed Multi-Agent Reinforcement Learning for Consensus Control of Quadcopters

网络分布式多智能体强化学习用于四旋翼无人机一致性控制

Youssef Mahran, Zeyad Gamal, Aamir Ahmad, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department, German University in Cairo (GUC), Egypt（埃及德国大学（GUC）机械工程系）； Institute of Flight Mechanics and Control (IFR), Head of Flight Robotics, University of Stuttgart, Germany（德国斯图加特大学飞行力学与控制研究所）； Faculty of EMS, Head of Mechatronics Engineering Department, German University in Cairo (GUC), Egypt（埃及德国大学（GUC）EMS学院）

AI总结提出网络分布式多智能体强化学习框架，利用通信图实现分布式策略，通过MASAC训练高层规划器，实现零样本扩展到250个智能体。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情

DOI: 10.1109/MELECON64486.2026.11418865
Journal ref: 2026 IEEE 23rd Mediterranean Electrotechnical Conference (MELECON)

AI中文摘要

本文提出了一种用于四旋翼无人机一致性控制的网络分布式多智能体强化学习（ND-MARL）框架。与依赖集中式规划或完全分散式执行的传统多智能体MARL公式相比，ND-MARL将群体通信图纳入决策过程。在2-邻居通信拓扑下，每个智能体仅观察两个邻居的信息，并通过分布式策略输出动作。使用多智能体软演员-评论家（MASAC）训练高层分布式一致性规划器，并将其嵌入层次化堆栈中，以生成由低层四旋翼控制器跟踪的参考目标位置。结果表明，与集中式MARL控制器相比，实现了平滑的一致性轨迹和规划器-跟踪器集成。最值得注意的是，学习到的控制器表现出零样本可扩展性，即在三智能体系统上训练的策略，在相同的2-邻居通信拓扑下，无需重新训练或微调即可部署到多达250个智能体的群体中，实现了随着团队规模增大而稳态散布增加的一致收敛，这是由于稀疏信息传播所致。这些发现突显了ND-MARL作为分布式、通信感知的四旋翼一致性控制的稳定框架。

英文摘要

This paper proposes a Network Distributed Multi-Agent Reinforcement Learning (ND-MARL) framework for quadcopter consensus control. Compared to conventional multi-agent MARL formulations that rely on centralized planning or fully decentralized execution, ND-MARL incorporates the swarm communication graph into the decision process. Under a 2-Neighbor communication topology, each agent observes information of only two neighbors and outputs an action through a distributed policy. A high-level distributed consensus planner is trained using Multi-Agent Soft Actor-Critic (MASAC) and embedded in a hierarchical stack to generate reference target positions tracked by a low-level quadcopter controller. Results demonstrate smooth consensus trajectories and planner-tracker integration when compared to a centralized MARL controller. Most notably, the learned controller exhibits zero-shot scalability, as policies trained on a three-agent system are deployed to swarms of up to 250 agents under the same 2-Neighbor communication topology without retraining or fine-tuning, achieving consistent convergence with increasing steady-state spread at large team sizes due to sparse information propagation. These findings highlight ND-MARL as a stable framework for distributed, communication-aware quadcopter consensus control.

URL PDF HTML ☆

赞 0 踩 0

2606.02058 2026-06-02 cs.CV cs.RO 版本更新

TIDES: Time-Derivative Event Simulation via Deformable Reconstruction

TIDES：基于可变形重建的时间导数事件模拟

Christopher Thirgood, Dipon Kumar Ghosh, Simon Hadfield

发表机构 * University of Surrey（萨里大学）

AI总结提出TIDES，一种基于动态高斯泼溅的连续时间事件模拟器，通过显式3D场景表示推导逐像素强度动态，实现精确的阈值交叉预测，并利用遮挡引导自适应时间步长，达到最先进的事件流保真度。

详情

AI中文摘要

事件相机响应环境外观变化而发出异步事件。真实世界事件数据集的稀缺使得模拟至关重要。然而，大多数模拟器从帧序列推断事件时间戳，迫使许多阈值交叉共享一小组离散时间；我们将这种失效模式称为时间戳批处理，它在快速运动和遮挡下会恶化。我们提出TIDES，一种基于动态高斯泼溅的连续时间事件模拟器。由于TIDES在具有学习几何和运动的显式3D场景表示上运行，它可以直接从场景推导每像素强度动态，而不是通过渲染帧的差分。这使得能够精确预测阈值交叉，包括每个渲染步骤的多次交叉，而无需时间上采样或帧插值。相同的3D场景模型揭示了物体之间部分遮挡的位置；TIDES利用这一点来指导自适应时间步长，仅将计算集中在遮挡动力学使简单亮度变化模型不可靠的区域。最后，我们使用瓦片级仲裁器对有限传感器带宽进行建模，其吞吐量、抖动和事件丢失再现了真实的传感器伪影。在配对的RGB-事件基准测试中，TIDES达到了最先进的事件流保真度。我们还表明，TIDES模拟的事件比竞争对手更有效地转移到真实下游任务。

英文摘要

Event cameras emit asynchronous events in response to environmental appearance changes. The scarcity of real-world event datasets makes simulation essential. However, most simulators infer event timestamps from frame sequences, forcing many threshold crossings to share a small set of discrete times; a failure mode we term timestamp batching that worsens under fast motion and occlusion. We present TIDES, a continuous-time event simulator built on dynamic Gaussian splatting. Because TIDES operates on an explicit 3D scene representation with learnt geometry and motion, it can derive per-pixel intensity dynamics directly from the scene, rather than by differencing rendered frames. This enables accurate threshold-crossing prediction, including multiple crossings per rendering step, without temporal upsampling or frame interpolation. The same 3D scene model reveals where objects partially occlude one another; TIDES uses this to guide adaptive time stepping, concentrating computation only in regions where occlusion dynamics make simple models of brightness change unreliable. Finally, we model finite sensor bandwidth using a tile-level arbiter whose throughput, jitter, and event drops reproduce realistic sensor artifacts. Across paired RGB-event benchmarks, TIDES attains state-of-the-art event-stream fidelity. We also show that events simulated by TIDES transfer more effectively to real downstream tasks than competitors'.

URL PDF HTML ☆

赞 0 踩 0

2606.02027 2026-06-02 cs.RO cs.LG cs.MA 版本更新

World-Task Factorization for Robot Learning

世界-任务分解用于机器人学习

Eduardo Sebastián, Adrian Pfisterer, Vito Mengers, Oliver Brock, Amanda Prorok

发表机构 * Department of Computer Science and Technology, University of Cambridge, United Kingdom（计算机科学与技术系，剑桥大学，英国）； Robotics and Biology Laboratory, Technische Universität Berlin（机器人与生物学实验室，柏林技术大学）； Science of Intelligence (SCIoI), Cluster of Excellence, Berlin, Germany（智能科学（SCIoI），卓越中心，柏林，德国）； Robotics Institute Germany（德国机器人研究所）

AI总结提出将策略分解为世界因子和任务因子，通过可微图模型AICON与紧凑学习策略结合，实现零样本泛化到新配置并迁移到真实硬件。

详情

AI中文摘要

机器人学习必须产生能够泛化到新的约束、队友和环境组合的策略。为此，我们必须对策略进行结构性分解，这种选择决定了哪些部分泛化、哪些需要重新训练、哪些保持纠缠。现有方法涵盖从期望结构从数据扩展中涌现，到通过层次结构、技能库或学习专门化手工设计。在本文中，我们研究我们认为机器人学中最基本的分解：将世界与任务分离。我们研究了这种分解有原则的条件。世界因子是具身系统和环境的属性；它们独立于意图存在。任务因子由任务在世界所允许的事物上的逻辑定义。我们通过贝叶斯模型证据形式化这种不对称性：它与数据生成过程一致，通过分析世界模型保持高似然，并减少奥卡姆剃刀对任务参数的惩罚。我们通过将AICON（一个可微分的递归估计器和互连图，具有组合性，无需任务特定数据即可运行，并将成本梯度传播到执行器）与一个紧凑的学习策略配对来实例化这种分解，该策略调节梯度路径。梯度作为两个因子之间的接口：它们通过图携带世界结构，通过成本携带任务结构，从而在保持结构泛化的同时实现低维学习。我们在三个问题上测试了世界/任务分解，这些问题包含异构机器人、环境、任务逻辑和感觉运动模态。我们的框架在所有设置中优于端到端基线和分析启发式方法，零样本泛化到分布外配置，并无需重新训练即可迁移到真实硬件。

英文摘要

Robot learning must produce policies that generalize to new combinations of constraints, teammates, and environments. To achieve this, we must structurally factor the policy, which is a choice that dictates what generalizes, what requires retraining, and what remains entangled. Existing methods span a wide spectrum, from expecting structure to emerge from data scaling, to hand-designing it via hierarchies, skill libraries or learned specializations. In this paper, we study what we argue is the most fundamental factorization in robotics: separating the world from the task. We investigate the conditions under which this factorization is principled. World factors are properties of the embodied system and the environment; they exist independently of intent. Task factors are defined by the task's logic over what the world admits. We formalize this asymmetry through Bayesian model evidence: it aligns with the data-generating process, maintains high likelihood through an analytical world model, and reduces the Occam razor's penalty on task parameters. We instantiate this factorization by pairing AICON, a differentiable graph of recursive estimators and interconnections that is compositional, operates without task-specific data, and propagates cost gradients to actuators, with a compact, learned policy that modulates gradient paths. Gradients serve as the interface between the two factors: they carry world structure through the graph and task structure through costs, enabling low-dimensional learning while preserving structural generalization. We test the world/task factorization across three problems that encompass heterogeneous robots, environments, task logic and sensorimotor modalities. Our framework outperforms end-to-end baselines and analytical heuristics in all settings, generalizes zero-shot to out-of-distribution configurations, and transfers to real hardware without retraining.

URL PDF HTML ☆

赞 0 踩 0

2606.01970 2026-06-02 cs.RO cs.MA cs.SY eess.SY 版本更新

Market-Based Replanning for Safety-Critical UAV Swarms in Search and Rescue Missions

基于市场重规划的搜救任务中安全关键无人机群

Luiz Giacomossi, Andrea Haglund, Claire Namatovu, Emily Zainali, Esaias Målqvist, Yonatan M. Beyene, Ivan Tomasic, Baran Çürüklü, Håkan Forsberg

发表机构 * KTH Royal Institute of Technology（皇家理工学院）； Swedish Defence Research Agency（瑞典国防研究机构）； KTH Royal Institute technological Institute（皇家理工学院）

AI总结提出一种分布式协调架构IRDS，通过反向拍卖市场机制和几何共识协议，在无人机故障下自主重分配任务，在25%退化下保持93%任务成功率。

Comments 6 pages, 4 figures, accepted at MIPRO 2026

详情

AI中文摘要

搜救任务中可靠自主无人机群需要能够容忍代理退化并维持操作的容错协调。本文介绍了智能重规划无人机群（IRDS），一种为资源受限环境设计的分布式协调架构。所提出的框架采用反向拍卖市场机制，其中代理基于距离加权成本函数竞标服务搜索区域，并结合几何共识协议进行目标验证。我们通过物理仿真（N=8个代理，8x8网格）评估该方法，并施加随机故障注入。结果表明，无人机群能够以相对于总任务持续时间较低的延迟自主重新分配来自故障代理的任务，在25%劳动力退化下保持93%的任务成功率。所提出的框架展示了一种稳健的、经过实证测试的空中机器人自愈协调方法。

英文摘要

Reliable autonomous UAV swarms in Search and Rescue (SAR) missions require fault-tolerant coordination capable of sustaining operations despite agent degradation. This paper introduces the Intelligent Replanning Drone Swarm (IRDS), a distributed coordination architecture designed for resource-constrained environments. The proposed framework employs a Reverse-Auction market mechanism where agents bid to service search sectors based on a distance-weighted cost function, coupled with a geometric consensus protocol for target verification. We evaluate the approach through physics-based simulations (N=8 agents, 8x8 grid) subjected to stochastic fault injection. Results indicate that the swarm autonomously reallocates tasks from failed agents with low latency relative to the total mission duration, maintaining a mission success rate of 93% under 25% workforce degradation. The proposed framework demonstrates a robust, empirically tested method for self-healing aerial robotic coordination.

URL PDF HTML ☆

赞 0 踩 0

2606.01955 2026-06-02 cs.RO cs.CV 版本更新

WALL-WM: Carving World Action Modeling at the Event Joints

WALL-WM：在事件关节处雕刻世界动作建模

Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, Qian Wang

发表机构 * X Square Robot Team（X Square机器人团队）

AI总结提出WALL-WM世界动作模型，通过事件级视觉-语言-动作预训练解决固定长度动作块与语言、视觉、动作之间的粒度不匹配问题，实现跨语言、场景和任务的泛化，在大规模真实世界评估中达到最先进性能。

详情

AI中文摘要

WALL-WM是一种世界动作模型，它将视频-动作学习从以块为中心的优化转变为以事件为基础的视觉-语言-动作预训练，使用语义连贯的动作事件作为学习的基本单元。现有的WAM通常从多模态或视频基础模型初始化，然后直接基于当前观测和指令优化固定长度的动作块。尽管方便，但这种以块为中心的公式造成了基本的粒度不匹配。语言描述语义目标和事件，视觉通过连续场景动态演变，动作在控制级时间尺度上运行；将三者强制纳入相同的固定长度预测窗口，使得VLA训练变成短视的相关性拟合。WALL-WM通过围绕语义事件组织监督和数据来解决这种不匹配。具体来说，它将基于事件的VLA预训练与由事件级标题和聚类平衡采样构建的数据生态系统配对，从而实现对多样化行为、场景和任务结构的可扩展学习。从相同的事件预训练骨干出发，WALL-WM支持两种互补的推理模式。事件模式消耗下一事件描述并实现可变长度的执行块，而统一模式使用带有阶梯式解码的VLM来调节传统的固定长度块推理，同时保留梯度连续的VLA路径。结合基于Muon优化器的大规模预训练基础设施，WALL-WM为通用WAM提供了实用的规模化方案。实验表明，WALL-WM在语言、场景和任务上广泛泛化，在大规模真实世界泛化评估中达到了最先进的性能。

英文摘要

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.01951 2026-06-02 cs.RO 版本更新

Co-training with Ego-centric Video and Demonstration for Robot Navigation Task

基于自我中心视频与示范的机器人导航任务协同训练

Shoya Kuno, Yumo Ouchi, Kanata Suzuki

发表机构 * Department of Informatics, Graduate School of Informatics, Kyoto University（信息学系，京都大学研究生院）； Spatial Robotics Research Center, Fujitsu Limited（空间机器人研究中心，富士通有限公司）

AI总结提出将自我中心行走视频转化为移动机器人模仿学习数据集的框架，通过联合训练VLA模型提升语言理解和动作生成能力。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在多种机器人任务中展现出潜力，但其性能严重依赖于大规模高质量训练数据，而在真实机器人上收集这些数据成本高昂且耗时。虽然先前的工作已经探索了利用自我中心人类视频来增强操作数据集，但由于运动过程中的视角变化，将此类方法应用于移动机器人导航仍然具有挑战性。在本文中，我们提出了一个框架，将自我中心行走视频转化为移动机器人模仿学习的数据集。该方法从人类视频中估计相机运动，并将其转换为与地面移动机器人兼容的动作表示。通过联合训练基于人类数据和机器人收集数据的VLA模型，该模型在语言理解和鲁棒动作生成方面比单独使用任一数据源训练取得了更好的性能。在水果搜索导航任务上的实验表明，人类自我中心视频为移动机器人学习提供了有效且可扩展的数据源。

英文摘要

Vision-language-action (VLA) models are promising for diverse robotic tasks, but their performance heavily depends on large-scale high-quality training data, whose collection on real robots is costly and time-consuming. While prior work has explored augmenting manipulation datasets with egocentric human videos, applying such approaches to mobile robot navigation remains challenging due to viewpoint changes during locomotion. In this paper, we propose a framework that converts egocentric walking videos into datasets for mobile robot imitation learning. The proposed method estimates camera motion from human videos and transforms it into action representations compatible with ground mobile robots. By jointly training a VLA model on human-derived and robot-collected datasets, the model achieves improved language understanding and more robust action generation than training with either data source alone. Experiments on a fruit-search navigation task demonstrate that human egocentric videos provide an effective and scalable data source for mobile robot learning.

URL PDF HTML ☆

赞 0 踩 0

2606.01950 2026-06-02 cs.RO cs.CV cs.LG 版本更新

Learning Action-Conditional and Object-Centric Gaussian Splatting World Models for Rigid Objects

面向刚性物体的学习动作条件与对象中心高斯溅射世界模型

Jens U. Kreber, Lukas Mack, Joerg Stueckler

发表机构 * Intelligent Perception in Technical Systems Group（技术系统智能感知组）

AI总结提出MRO-GWM模型，通过对象中心高斯表示和时空变换器架构，学习刚性物体在3D中的动作条件动力学，支持多物体场景和部分观测下的未来运动预测。

详情

AI中文摘要

世界模型使智能体能够预测其动作对环境的影响。在本文中，我们提出了多刚性物体高斯世界模型（MRO-GWM），一种学习刚性物体在3D中动作条件动力学的新模型。通过用对象中心高斯表示场景，我们可以表示任意物体形状和多物体场景。我们开发了一种新颖的时空变换器架构，该架构根据物体高斯的历史和未来动作预测未来的刚体运动。物体通过其在规范坐标系中的高斯表示，从而可以将物体运动描述为刚体变换。我们的模型在多视角重建上进行训练，这要求模型处理因遮挡导致的物体部分观测。我们分析了该方法在由典型家庭物体组成的合成数据集上的预测性能，这些数据集包含多物体动力学和机器人末端执行器的交互。我们还在模拟中评估了模型在非抓取操作中的模型预测控制性能。

英文摘要

World models enable intelligent agents to predict the consequences of their actions on the environment. In this paper, we propose Multi Rigid Object Gaussian World Model (MRO-GWM), a novel model that learns action-conditional dynamics of rigid objects in 3D. By representing the scene by object-centric Gaussians, we can represent arbitrary object shapes and multi-object scenes. We develop a novel spatio-temporal transformer architecture that predicts future rigid body motion from a history of object Gaussians and future actions. Objects are represented by their Gaussians in a canonical frame, which allows for describing object motion as rigid body transformation. Our model is trained on reconstructions from multiple viewpoints, which requires the model to handle partial observations of objects due to occlusions. We analyze prediction performance of our approach on synthetic datasets composed of typical household objects with multi-object dynamics and interactions by a robot end effector. We also evaluate our model in model-predictive control for non-prehensile manipulation in simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.01946 2026-06-02 cs.RO 版本更新

Closed-Form Pose Estimation of Endoluminal Medical Devices via Gradiometer-Based Electromagnetic Localization System

基于梯度计的电磁定位系统实现腔内医疗器械的闭式位姿估计

Zhiwei Wu, Jiahao Luo, Yubo Pu, Siyi Wei, Yuankai Chen, Jinhui Zhang

AI总结提出一种基于梯度计的电磁定位系统（GELS），利用紧凑型磁力计阵列作为准梯度计估计局部磁场和梯度张量，通过欧拉齐次关系映射为位移，再经多源Procrustes配准实现闭式位姿估计，无需预校准场图或迭代优化。

详情

AI中文摘要

嵌入式磁跟踪对于腔内医疗器械的远程导航具有极具吸引力的前景。然而，现有的六自由度位姿恢复方法通常需要预校准的工作空间场图或迭代非线性优化。本文提出了一种基于梯度计的电磁定位系统（GELS），这是一种闭式跟踪框架，使用紧凑型磁力计阵列作为嵌入式准梯度计来估计局部磁场和梯度张量。这些量通过欧拉齐次关系映射为源与阵列之间的位移，随后利用至少三个非共线源的多源Procrustes配准恢复阵列的方向和位置。该算法需要已知的源位置和阵列几何结构，但无需预校准的工作空间场图、初始位姿猜测或校准的激励源矩。恢复的位姿还可作为移动磁参考框架，实现概念验证的子级偶极子定位任务。跨传感器阵列配置和激励模式的台架实验显示，序列平均位置误差为\SI{10.80}{\milli\meter}--\SI{15.57}{\milli\meter}，最快更新率为\SI{14.49}{\hertz}，中位求解器运行时间为\SI{172.00}{\micro\second}。基于扰动的误差传播分析进一步确定了传感器间不一致性和偶极子模型失配是主要的精度限制因素，从而为未来进一步减少位姿估计误差的传感器阵列和磁源设计提供指导。

英文摘要

Embedded magnetic tracking holds highly attractive prospects for remote navigation of endoluminal medical devices. However, existing six-degree-of-freedom pose recovery approaches often require pre-calibrated workspace field maps or iterative nonlinear optimization. This letter presents a Gradiometer-Based Electromagnetic Localization System (GELS), a closed-form tracking framework that uses a compact magnetometer array as an embedded quasi-gradiometer to estimate local magnetic fields and gradient tensors. These quantities are mapped by the Euler homogeneous relation to displacements between source and array, from which multi-source Procrustes registration recovers the array orientation and position using at least three non-collinear sources. The algorithm requires known source positions and array geometry, but no pre-calibrated workspace field maps, initial pose guesses, or calibrated excitation-source moments. The recovered pose also enables a proof-of-concept sub-level dipole localization task by serving as a mobile magnetic reference frame. Benchtop experiments across sensor-array configurations and excitation modes demonstrate sequence-averaged position errors of \SI{10.80}{\milli\meter}--\SI{15.57}{\milli\meter}, a fastest update rate of \SI{14.49}{\hertz}, and a median solver runtime of \SI{172.00}{\micro\second}. A perturbation-based error propagation analysis further identifies inter-sensor inconsistency and dipole-model mismatch as the dominant accuracy limits, thereby informing future sensor array and magnetic source design for further reducing pose-estimation error.

URL PDF HTML ☆

赞 0 踩 0

2606.01865 2026-06-02 cs.RO 版本更新

Set-Supervised Diffusion Policy: Learning Action-Chunking Diffusion through Corrections

集合监督扩散策略：通过修正学习动作分块扩散

Zhaoting Li, Gang Chen, Javier Alonso-Mora, Cosimo Della Santina, Jens Kober

发表机构 * University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出集合监督扩散策略（SDP），利用人类修正中的对比动作分块数据，通过构建期望动作分块集合来训练扩散策略，有效缓解分布偏移并提升鲁棒性。

详情

AI中文摘要

扩散策略最近已成为机器人操作的一个强大框架。然而，与其他行为克隆方法一样，它仍然容易受到分布偏移的影响，通常需要人在回路中进行干预以纠正部署过程中的失败。这些交互自然提供了成对监督，形式为机器人的不期望动作和人类教师的纠正动作。然而，现有的数据聚合流程和标准行为克隆损失在很大程度上忽略了来自不期望动作的负面信号，导致对教师动作的过拟合以及对昂贵专家数据的日益依赖。为了解决这一限制，我们提出了集合监督扩散策略（SDP），这是一种新颖的学习框架，利用对比动作分块数据从人类修正中训练扩散策略。从配对的正负动作分块中，SDP构建了一组期望的动作分块，并设计了一个训练流程，鼓励扩散策略与该集合对齐。通过在多个机器人操作任务上的大量实验，我们证明了SDP持续提高了策略性能，在对噪声数据的鲁棒性方面尤其显著。此外，SDP生成了高质量的聚合数据集，使得从人在回路修正中进行更高效、更可靠的策略学习成为可能。我们的代码可在 https://set-supervised-diffusion-policy.github.io/ 获取。

英文摘要

Diffusion policies have recently emerged as a powerful framework for robotic manipulation. However, like other behavior cloning methods, they remain vulnerable to distributional shift, often requiring human-in-the-loop interventions to correct failures during deployment. These interactions naturally provide paired supervision in the form of the robot's undesired actions and the human teacher's corrective actions. Yet existing data aggregation pipelines and standard behavior cloning losses largely ignore this negative signal from undesired actions, leading to overfitting to teacher's actions and an increasing reliance on costly expert data. To address this limitation, we propose Set-Supervised Diffusion Policy (SDP), a novel learning framework that utilizes contrastive action-chunk data to train diffusion policies from human corrections. From paired positive and negative action-chunks, SDP constructs a set of desired action-chunks and designs a training pipeline that encourages the diffusion policy to align with the set. Through extensive experiments across multiple robotic manipulation tasks, we demonstrate that SDP consistently improves policy performance, with particularly strong gains in robustness to noisy data. Moreover, SDP induces high-quality aggregated datasets, enabling more efficient and reliable policy learning from human-in-the-loop corrections. Our code is available at https://set-supervised-diffusion-policy.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.01847 2026-06-02 cs.RO cs.LG 版本更新

The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space

我们说的谎言：通过切空间上的分数匹配纠正视觉-语言-动作策略中的欧几里得谬误

Bing-Cheng Chuang, I-Hsuan Chu, Bor-Jiun Lin, YuanFu Yang, Min Sun, Chun-Yi Lee

发表机构 * National Taiwan University（台湾大学）

AI总结针对扩散视觉-语言-动作策略将SE(3)位姿表示为平坦R^12向量导致的欧几里得谬误，提出Lie Diffuser Actor (LDA)框架，通过左不变SDE注入噪声、在切空间预测分数并利用指数映射回缩样本，从根本上消除流形漂移、保证坐标框架等变性和测地线最优性，在CALVIN ABC→D上平均任务长度从3.27提升至3.51。

Comments ICML 2026 Accepted

详情

AI中文摘要

基于扩散的视觉-语言-动作策略在机器人操作中取得了显著成功，但犯了一个我们称之为$ extbf{欧几里得谬误}$的基本几何错误：将SE(3)位姿表示为平坦的$\mathbb{R}^{12}$向量。这种近似导致(1)违反SO(3)约束的流形漂移，(2)坐标变换下等变性的破坏，以及(3)具有过高运动学代价的非测地线轨迹。我们提出$ extbf{Lie Diffuser Actor (LDA)}$，一个本质上在SE(3)上运行的扩散框架。我们的方法通过左不变SDE注入噪声，在切空间中预测分数，并通过指数映射回缩样本。这种表述通过构造消除了流形漂移，同时保证了坐标框架等变性和测地线最优性。在CALVIN ABC$ ightarrow$D上，LDA将平均任务长度从$3.27$提升到$3.51$（$+7.3\%$）。我们进一步在真实机器人上验证了该方法，结果表明我们的方法在大多数任务上优于基线。

英文摘要

Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic trajectories with excessive kinematic cost. We introduce $\textbf{Lie Diffuser Actor (LDA)}$, a diffusion framework operating intrinsically on SE(3). Our method injects noise through left-invariant SDEs, predicts scores in the tangent space, and retracts samples via the exponential map. This formulation eliminates manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality. On CALVIN ABC$\rightarrow$D, LDA improves average task length from $3.27$ to $3.51$ ($+7.3\%$). We further validate our method on real robot and the results show that our methodology outperforms the baseline on majority tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.01824 2026-06-02 cs.RO 版本更新

DisFlow: Scene Flow from Distance Field for Object Pose, Velocity Tracking, and Dynamic Object Reconstruction

DisFlow: 基于距离场的场景流用于物体姿态、速度跟踪和动态物体重建

Lan Wu, Sheila Sutjipto, Jennifer Wakulicz, Teresa Vidal-Calleja

发表机构 * Robotics Institute, University of Technology Sydney（技术悉尼大学机器人研究所）； School of Engineering, University of Western Australia（西澳大学工程学院）

AI总结提出DisFlow框架，利用高斯过程隐式曲面表示从距离场估计场景流，实现6自由度动态物体姿态估计、运动跟踪和表面重建。

详情

AI中文摘要

我们提出了DisFlow，一种新颖的从距离场进行在线场景流估计的框架，能够实现6自由度动态物体姿态估计、运动跟踪和表面重建。场景由高斯过程隐式曲面（GPIS）表示，表面法线作为导数约束，使得在表面附近能够进行准确的符号距离计算和带不确定性的梯度查询。以此表示为基石，我们从距离场计算场景流，描述表面点如何在连续帧中随时间传输。通过我们的流，我们可以通过优雅的闭式优化逐步注册新观测的点云来估计物体的姿态和运动。与先前在相机或世界坐标系中操作的方法不同，我们的方法直接在物体坐标系中进行概率融合，其中物体随时间保持几何一致性。DisFlow方法在空间和时间上的紧密耦合产生了密集几何、表面法线、物体姿态轨迹、速度和不确定性，且均达到实时速率。我们在动态物体序列上评估了DisFlow，并证明它在同时重建高质量物体表面的同时，实现了准确的姿态和运动跟踪。代码公开于https://github.com/LanWu076/disflow_ros2。

英文摘要

We present \emph{DisFlow}, a novel framework for online scene flow estimation from distance field that enables \emph{6DoF dynamic object pose estimation}, \emph{motion tracking}, and \emph{surface reconstruction}. The scene is represented by Gaussian Process Implicit Surfaces (GPIS), with surface normals serving as derivative constraints, enabling accurate signed distance computations near the surface and gradient queries with uncertainty. With this representation as a foundation, we compute a scene flow from the distance field that describes how surface points are transported over time in consecutive frames. Through our flow, we can estimate an object's pose and motion by incrementally registering a new observed point cloud via an elegant closed-form optimisation. Unlike prior methods that operate in the camera or world frame, our approach performs probabilistic fusion directly in the \emph{object frame}, where the object remains geometrically consistent over time. The tight coupling of the DisFlow method in space and time yields dense geometry, surface normals, object pose trajectories, velocities, and uncertainty, all at real-time rates. We evaluate DisFlow on dynamic object sequences and demonstrate that it achieves accurate pose and motion tracking while simultaneously reconstructing high-quality object surfaces. Code publicly available at \href{https://github.com/LanWu076/disflow_ros2}{https://github.com/LanWu076/disflow\_ros2}

URL PDF HTML ☆

赞 0 踩 0

2606.01777 2026-06-02 cs.RO 版本更新

Trans2Occ: Voxel Occupancy Estimation and Grasp for Transparent Objects from Simulation to Reality

Trans2Occ: 从仿真到现实的透明物体体素占用估计与抓取

Yixuan Yang, Sha Zhang, Rui Li, Zhenfei Yin, Xinzhu Ma, Yiran Qin, Lei Bai, Xudong Xu, Shilin Shan, Wangmeng Zuo, Yanyong Zhang, Wanli Ouyang, Feng Zheng, Shixiang Tang, Dongzhan Zhou

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； SUSTech（南方科技大学）； CUHK（香港中文大学）； Harbin Institute of Technology（哈尔滨工业大学）； University of Oxford（牛津大学）； Beihang University（北京航空航天大学）； Nanyang Technological University（南洋理工大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出基于单视图RGB输入的体素占用预测框架，结合仿真数据生成与规则抓取策略，实现透明物体的鲁棒3D感知与操作。

详情

AI中文摘要

透明物体由于折射和反射导致的深度感知不可靠，对机器人感知构成挑战。先前的方法依赖多视图重建或深度补全，但往往难以在真实机器人系统中扩展或部署。本文提出一个基于单视图RGB输入的透明物体感知与操作实用框架。我们的方法直接从单张图像预测体素空间占用，提供支持下游机器人抓取的几何感知表示。为实现大规模训练，我们构建了一个仿真流水线，在不同材质和光照条件下生成配对的RGB图像和体素占用标注。我们证明预测的占用表示对领域偏移具有鲁棒性，并能从仿真有效迁移到真实机器人设置，无需微调。基于占用构建的简单规则抓取策略进一步实现了透明物体的可靠抓取性能。在仿真和真实环境中的大量实验表明，我们的框架提供了准确的3D理解，并实现了透明物体的实用操作。这些结果表明，单视图占用预测为机器人中的透明物体感知提供了一种可扩展且有效的解决方案。

英文摘要

Transparent objects remain challenging for robotic perception due to unreliable depth sensing caused by refraction and reflection. While prior approaches rely on multi-view reconstruction or depth completion, they are often difficult to scale or deploy in real-world robotic systems. In this paper, we present a practical framework for transparent object perception and manipulation based on single-view RGB input. Our approach predicts voxel-space occupancy directly from a single image, providing a geometry-aware representation that supports downstream robotic grasping. To enable large-scale training, we construct a simulation pipeline that generates paired RGB images and voxel occupancy annotations under diverse materials and lighting conditions. We demonstrate that the predicted occupancy representation is robust to domain shifts and transfers effectively from simulation to real-world robotic setups without fine-tuning. A simple rule-based grasping strategy built on top of the occupancy further achieves reliable grasp performance on transparent objects. Extensive experiments in both simulation and real-world environments show that our framework provides accurate 3D understanding and enables practical manipulation of transparent objects. These results suggest that single-view occupancy prediction offers a scalable and effective solution for transparent object perception in robotics.

URL PDF HTML ☆

赞 0 踩 0

2606.01734 2026-06-02 cs.CV cs.LG cs.RO 版本更新

FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds

FlatVPR: 用于基础模型特征流形几何校正的即插即用地线性残差适配器

Rai Hisada, Kanji Tanaka

发表机构 * Fundamental Engineering for Knowledge-Based Society, Graduate School of Engineering, University of Fukui（知识社会基础工程，工程研究生院，福井大学）

AI总结提出FlatVPR范式，通过可学习残差适配器和Pullback Flatness Loss抑制特征流形曲率，实现稀疏锚点下的线性插值重建，在NCLT数据集上显著提升视觉位置识别精度。

Comments 5 pages, 1 figure, technical report

详情

AI中文摘要

本文提出“FlatVPR”，一种新颖的几何校正范式，通过强制特征流形结构，使得两个相邻锚点 $\mathbf{z}_A$ 和 $\mathbf{z}_B$ 之间的任何描述符都可以通过线性插值 $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$（其中 $t \in [0,1]$ 表示相对位置）精确重建，从而有效平衡视觉位置识别（VPR）中地图轻量化和定位精度之间的权衡。尽管最先进的基础模型（如DINOv2-ViT-S/14）提供了鲁棒的语义特征，但其潜在流形表现出显著的曲率，将物理空间中的均匀线性运动投影到特征空间中高度非线性的轨迹上，这阻碍了稀疏锚点条件下的可靠重建。为了实现上述基于插值的重建，我们对原始基础特征 $\mathbf{z}$ 引入残差变换 $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$，其中 $\text{Res}(\cdot)$ 表示可学习的适配器。我们的方法通过数学上严谨的Pullback Flatness Loss显式抑制流形曲率，该损失最小化中间特征与连接相邻锚点的线性段之间的偏差，从而最小化流形的内在曲率。通过这种空间展平，地图构建被公式化为期望最大化（EM）框架，解耦为用于流形适应的连续M步和用于最优锚点选择准则的概念性E步。在NCLT数据集上的实验表明，即使在100米间隔的极端稀疏锚点和极端季节变化条件下，应用我们的适配器也能带来显著的性能提升。

英文摘要

This paper proposes ``FlatVPR,'' a novel geometric rectification paradigm that effectively bridges the trade-off between map lightweightness and localization accuracy in visual place recognition (VPR) by enforcing a feature manifold structure where any descriptor between two adjacent anchors $\mathbf{z}_A$ and $\mathbf{z}_B$ can be accurately reconstructed via linear interpolation $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$, where $t \in [0,1]$ denotes the relative position. While state-of-the-art foundation models such as DINOv2-ViT-S/14 provide robust semantic features, their latent manifolds exhibit prominent curvature, projecting uniform linear motion in physical space onto highly non-linear trajectories in the feature space, which hinders reliable reconstruction under sparse anchor conditions. To enable the aforementioned interpolation-based reconstruction, we introduce a residual transformation $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$ to the raw foundation features $\mathbf{z}$, where $\text{Res}(\cdot)$ represents a learnable adapter. Our method explicitly suppresses manifold curvature using a mathematically grounded Pullback Flatness Loss that minimizes the deviation of intermediate features from the linear segment connecting adjacent anchors, thereby minimizing the intrinsic curvature of the manifold. Through this spatial flattening, map construction is formulated within an Expectation-Maximization (EM) framework, decoupled into a continuous M-step for manifold adaptation and a conceptual E-step for optimal anchor selection guidelines. Experiments on the NCLT dataset demonstrate that the application of our adapter leads to significant performance improvements even under extremely sparse anchor conditions with 100m intervals and extreme seasonal changes.

URL PDF HTML ☆

赞 0 踩 0

2606.01713 2026-06-02 cs.RO cs.SY eess.SY math.OC 版本更新

FlipItRight: Stable Pose-Targeted Throw-Flip Across Diverse Objects

FlipItRight: 面向多样物体的稳定姿态目标投掷翻转

Axel Dawne, Shinkyu Park

发表机构 * King Abdullah University of Science and Technology（卡布斯大学）

AI总结提出FlipItRight框架，通过将任务分解为物体级规划器和机器人级规划器，利用释放状态作为显式中间表示，实现无需先验数据或学习模型的高自由度机械臂稳定平面姿态目标投掷翻转，在120次实验中达到90%成功率。

详情

AI中文摘要

我们提出了FlipItRight，一个用于高自由度机械臂进行稳定平面姿态目标投掷翻转的框架。该任务被分解为一个物体级规划器，它生成满足期望着陆姿态的候选释放状态，以及一个机器人级规划器，它评估可执行性并构建可行的摆动运动。将释放状态视为显式中间表示，能够实现原则性的候选过滤、释放和预摆动配置的自适应选择，以及结构化的近释放运动设计——特别是在最终摆动阶段保持近似恒定的末端执行器速度，以提高对释放时间不确定性的鲁棒性。我们在一个真实平台上对不同形状、大小和质量的物体进行了验证，在120次试验中达到了90%的成功率。消融研究证实，每个设计选择都对投掷性能有所贡献，并且该框架不需要先验数据或学习模型，能够直接部署到新物体和目标上，无需特定环境的校准或数据收集。

英文摘要

We propose FlipItRight, a framework for stable planar pose-targeted throw-flip with a high-DoF manipulator. The task is decomposed into an object-level planner, which generates candidate release states satisfying the desired landing pose, and a robot-level planner, which evaluates executability and constructs a feasible swing motion. Treating the release state as an explicit intermediate representation enables principled candidate filtering, adaptive selection of release and pre-swing configurations, and structured near-release motion design -- in particular, approximately constant end-effector velocities during the final swing phase to improve robustness to release-timing uncertainty. We validate on a real platform across objects of varying shape, size, and mass, achieving a 90% success rate across 120 trials. Ablation studies confirm that each design choice contributes to throwing performance, and the framework requires no prior data or learned model, enabling direct deployment on new objects and targets without environment-specific calibration or data collection.

URL PDF HTML ☆

赞 0 踩 0

2606.01600 2026-06-02 cs.CV cs.CL cs.RO 版本更新

RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation

RoboTrustBench：机器人操作视频世界模型的可信度基准测试

Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar, Jingjing Chen, Bin Zhu

发表机构 * Singapore Management University（新加坡国立管理学院）； Fudan University（复旦大学）； Princeton University（普林斯顿大学）

AI总结针对视频世界模型在机器人操作中的可信度问题，提出RoboTrustBench基准，包含正常、约束敏感、反事实和对抗四种场景，通过专家验证的指令-图像对和六维评估协议，发现当前模型在约束推理、反事实基础、物理交互和不安全指令抑制方面存在不足。

Comments Project: https://huiqiongli.github.io/RoboTrustBench/

2606.01597 2026-06-02 cs.RO cs.MA 版本更新

S2M-Trek: 从单球到多球运输：基于轮腿机器人的逐帧深度集方法

Zong Chen, Xuebin Li, Jinpeng Xiao, Shaoyang Li, Ben Liu, Min Li, Zhouping Yin, Yiqun Li

发表机构 * School of Mechanical Science and Engineering, Huazhong University of Science and Technology（华中科技大学机械科学与工程学院）； School of Mathematics, Harbin Institute of Technology（哈尔滨工业大学数学学院）

AI总结针对轮腿四足机器人背部同时运输多个自由滚动球体的动态操作问题，提出逐帧深度集（PFDS）编码器，通过逐帧置换不变池化解决历史拼接编码器的置换对称性不匹配，实现五球100%无掉落运输。

详情

AI中文摘要

我们研究了从单个自由滚动球体到多个球体同时运输的动态操作缩放问题，这些球体在轮腿四足机器人背部运输，无需围栏、夹具或机械止动器。多个相同的自由滚动球体构成一个无序集合，没有持久身份：它们的顺序可能在每个历史帧中独立变化，产生一种\\emph{逐帧置换对称性}，而标准的历史拼接集合编码器并未显式强制这种对称性——这些编码器仅在整个历史上施加共享的对角置换对称性。我们表明，这种对称性不匹配导致基于课程强化学习的具体失败模式。在相同的PPO训练预算内，平坦MLP和分支编码器在双球阶段或以下停滞，而历史拼接深度集基线（\\\HCDS）在我们的运行中无法超越双球阶段，除非在训练期间随机化球到槽的分配，这表明它利用槽索引作为课程捷径，而不是学习无身份的多球动力学。我们提出\textbf{逐帧深度集（\\\PFDS）}，它在时间读出之前在每个历史帧内执行置换不变池化；我们证明\\\PFDS是$\\\Gframe$-不变的，并且能普遍逼近连续的$\\\Gframe$-不变策略。一个$2{\\times}2$消融实验（编码器架构和槽随机化）分离了架构和数据增强路径，\\\PFDS在所有五个随机种子下达到五球阶段，模拟中实现100%无掉落运输。我们进一步通过DAgger将\\\PFDS教师蒸馏为\\\TactSet，用$16{\\times}16$布尔联合接触图替代特权球体状态观测，产生紧凑且自然$\\\Gframe$-不变的触觉表示。

英文摘要

We study the problem of scaling dynamic loco-manipulation from a single free-rolling sphere to multiple spheres transported simultaneously on the back of a wheel-legged quadruped, without fences, grippers, or mechanical stops. Multiple identical free-rolling spheres form an unordered set with no persistent identity: their ordering may change independently at each history frame, creating a \emph{per-frame permutation symmetry} that standard history-concatenation set encoders do not explicitly enforce -- these encoders impose only a shared, diagonal permutation symmetry over the full history. We show that this symmetry mismatch leads to a concrete failure mode in curriculum-based reinforcement learning. Within the same PPO training budget, flat MLPs and branch-wise encoders plateau at or below the two-sphere stage, while a history-concatenation Deep Sets baseline (\HCDS) fails to progress past the two-sphere stage in our runs unless ball-to-slot assignments are randomised during training, suggesting that it exploits slot indices as a curriculum shortcut rather than learning identity-free multi-sphere dynamics. We propose \textbf{Per-Frame Deep Sets (\PFDS)}, which performs permutation-invariant pooling within each history frame before temporal readout; we prove that \PFDS is $\Gframe$-invariant and universally approximates continuous $\Gframe$-invariant policies. A $2{\times}2$ ablation over encoder architecture and slot randomisation separates the architectural and data-augmentation pathways, and \PFDS reaches the five-sphere stage with 100\% no-drop transport in simulation across all five random seeds. We further distill the \PFDS teacher into \TactSet via DAgger, replacing privileged sphere-state observations with a $16{\times}16$ Boolean union contact map, yielding a compact and naturally $\Gframe$-invariant tactile representation.

URL PDF HTML ☆

赞 0 踩 0

2606.01313 2026-06-02 cs.RO cs.AI 版本更新

PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making

PSG-Nav: 通过多元宇宙决策的概率场景图导航

Rufeng Chen, Yue Chang, Xiaqiang Tang, Hechang Chen, Sihong Xie

发表机构 * Tsinghua University（清华大学）

AI总结提出PSG-Nav方法，通过构建3D概率场景图并利用多元宇宙决策从联合分布中采样最可能的世界设置，以处理开放词汇导航中的感知不确定性，并引入证据经验校准器实现在线终身适应，在多个基准上取得最新最优结果。

Comments 21 pages, 7 figures. ICML 2026

详情

AI中文摘要

开放词汇导航要求具身智能体管理由语义歧义和模型错误引起的显著感知不确定性。然而，大多数现有工作满足于局部最优的确定性方法，剥夺了在多个复合可能性上的复杂导航决策，而这些对于全局更优解至关重要。在本文中，我们提出概率场景图导航（PSG-Nav），它构建了一个3D概率场景图，使用完整的语义类别分布来考虑感知不确定性。为了有效利用局部分布来组合和推理最优导航地标，我们提出多元宇宙决策，从联合分布中采样多个最可能的世界设置，并基于地标与多元宇宙之间的兼容性评估导航地标。为了减轻开放词汇导航中因认知不确定性导致的误报，我们引入证据经验校准器，通过将检测与过去成功和失败的记忆进行交叉验证，实现在线终身适应。在广泛使用的基准MP3D、HM3D和HSSD上的大量实验表明，PSG-Nav建立了新的最先进结果，分别实现了66.1%、44.8%和67.9%的成功率。代码可在https://psg-nav.github.io/获取。

英文摘要

Open-vocabulary navigation requires embodied agents to manage significant perception uncertainty stemming from semantic ambiguity and model errors. However, most existing works settle for local optimal deterministic approaches, depriving complex navigation decision-making over multiple composite possibilities that are critical for globally better solutions. In this paper, we propose Probabilistic Scene Graph Navigation (PSG-Nav), which constructs a 3D Probabilistic Scene Graph that uses full semantic categorical distributions to account for perception uncertainty. To efficiently use the local distributions to compose and reason about the optimal navigation landmarks, we propose Multiverse Decision to sample multiple most likely world settings from the joint distribution, and evaluate navigation landmarks based on the compatibility between landmarks and multiverses. To mitigate false positives due to epistemic uncertainty in open-vocabulary navigation, we introduce the Evidential Experience Calibrator, which enables online lifelong adaptation by cross-validating detections against memories of past successes and failures. Extensive experiments on widely-used benchmarks MP3D, HM3D, and HSSD demonstrate that PSG-Nav establishes new state-of-the-art results, achieving Success Rates of 66.1%, 44.8%, and 67.9%, respectively. Code is available at: https://psg-nav.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.01277 2026-06-02 cs.RO cs.AI cs.CV cs.SY eess.IV eess.SY 版本更新

DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance

DeepIPCv3: 面向突发行人穿越避让的事件感知多模态传感器融合

Oskar Natan, Andi Dharmawan, Aufaclav Zatu Kusuma Frisky, Jazi Eko Istiyanto, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada（计算机科学与电子系，加雅马达大学）； Department of Computer Science and Engineering, Toyohashi University of Technology（计算机科学与工程系，东福士大学）

AI总结提出DeepIPCv3框架，通过Transformer交叉模态注意力融合LiDAR点云与DVS事件流，实现突发行人穿越场景下的高反应性避让，在自定义多模态数据集上达到最优轨迹与控制精度。

详情

AI中文摘要

当前的端到端自动驾驶系统主要依赖基于帧的传感器，这类传感器在高度动态的突发行人穿越场景中存在固有的感知延迟和运动模糊问题。为解决这一关键安全漏洞，我们提出DeepIPCv3，一种新颖的多模态自主导航框架，它将LiDAR点云的密集3D空间几何与动态视觉传感器（DVS）的微秒级异步事件流协同融合。我们引入了一种受Transformer启发的交叉模态注意力机制，以动态关联这些不同模态，使网络能够即时优先处理高速动态更新，同时不牺牲场景结构感知。融合后的潜在表示通过一个混合策略网络映射到安全的局部路径点和可执行控制命令，该网络结合了启发式轨迹跟踪与直接神经预测。由于在真实场景中测试这些突发穿越场景存在严重物理风险，该框架使用在光照良好的正午和具有挑战性的傍晚条件下收集的自定义多模态数据集进行严格离线评估。广泛的对比和消融研究表明，DeepIPCv3达到了最先进的预测性能。通过有效消除曝光失败和运动模糊，所提出的LiDAR与DVS融合实现了最低的轨迹和控制命令误差，使得无论环境光照如何，都能实现高反应性、数学上有界的规避机动。为支持未来研究，我们将代码发布到GitHub仓库：https://github.com/oskarnatan/DeepIPCv3。

英文摘要

Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and motion blur during highly dynamic encounters, specifically sudden pedestrian crossings. To address this critical safety vulnerability, we propose DeepIPCv3, a novel multi-modal autonomous navigation framework that synergizes the dense 3D spatial geometry of LiDAR point clouds with the microsecond-level asynchronous event streams of a Dynamic Vision Sensor (DVS). We introduce a Transformer-inspired cross-modal attention mechanism to dynamically correlate these distinct modalities, allowing the network to instantaneously prioritize high-speed dynamic updates without sacrificing structural scene awareness. The fused latent representations are then mapped to safe local waypoints and executable control commands via a hybrid policy network that blends heuristic trajectory tracking with direct neural predictions. Due to the severe physical risks associated with live testing of these sudden crossing scenarios, the framework is rigorously evaluated offline using a custom multi-modal dataset collected across both well-illuminated noon and challenging evening conditions. Extensive comparative and ablation studies demonstrate that DeepIPCv3 achieves state-of-the-art predictive performance. By effectively eliminating exposure failures and motion blur, the proposed LiDAR and DVS fusion yields the lowest trajectory and control command errors, enabling highly reactive, mathematically bounded evasive maneuvers regardless of ambient illumination. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/DeepIPCv3.

URL PDF HTML ☆

赞 0 踩 0

2606.01238 2026-06-02 cs.RO cs.LG 版本更新

Training-Free Imitation Learning with Closed-Form Diffusion Policies

无训练闭环扩散策略的模仿学习

Raghav Mishra, Ian R. Manchester

发表机构 * Australian Center for Robotics, ARIAM Hub, and School of Aerospace, Mechanical and Mechatronic Engineering University of Sydney（澳大利亚机器人中心、ARIAM中心和悉尼大学航空航天、机械与机电工程学院）

AI总结提出一种基于演示数据集闭式得分的无训练扩散策略（CFDP），实现毫秒级实时模仿学习，性能媲美需数小时训练的神经基线，并支持推理时策略编辑与演示增强。

详情

AI中文摘要

尽管基于扩散的策略具有令人印象深刻的性能和表达能力，但其长时间离线训练拖慢了数据收集和策略部署循环。我们引入了闭环扩散策略（CFDP），这是一类使用从演示数据集导出的闭式得分的无训练扩散策略，用于模仿学习。我们在硬件实验中用移动CPU进行实时推理部署CFDP，表明它能够直接从数据集中毫秒级成功执行模仿，并且推理速度比神经扩散策略更快。在模仿学习基准实验中，我们展示了CFDP与需要数小时训练的神经基线相比具有竞争力，在训练时间和性能之间提供了有利的权衡。最后，我们展示了闭环扩散策略如何作为一种可组合原语，实现对预训练神经扩散策略的数据驱动推理时编辑，包括策略引导和新颖的演示增强。

英文摘要

While diffusion-based policies have impressive performance and expressivity, their long offline training slows down the data collection and policy deployment loop. We introduce Closed-Form Diffusion Policies, a class of training-free diffusion-based policies for imitation learning using the closed-form score derived from the demonstration dataset. We deploy CFDP with real-time inference with a mobile CPU in hardware experiments, showing it can successfully perform imitation directly from the dataset in milliseconds and with faster inference than neural diffusion policies. In experiments on imitation learning benchmarks, we show that CFDP is competitive against neural baselines that require hours of training, providing a favorable tradeoff between training time and performance. Finally, we show how closed-form diffusion policies act as a composable primitive that enables data-driven inference-time editing of pre-trained neural diffusion policies, including policy guidance and novel demonstration augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.01170 2026-06-02 cs.MA cs.RO 版本更新

Coordinating Task Switching in a Robotics Multi-Agent System Using Behavior Trees

使用行为树协调机器人多智能体系统中的任务切换

Lucas Haug, Anarosa Alves Franco Brandão, Arthur Casals

发表机构 * LTI - Laboratório de Técnicas Inteligentes, Universidade de S a ~ \tilde{a} o Paulo, SP（智能技术实验室，圣保罗大学）

AI总结本文提出一种基于行为树的方法，用于在IEEE VSSS机器人足球多智能体系统中协调机器人行为，并通过与有限状态机的对比实验及竞赛验证其有效性。

Comments 7 pages, 7 figures. Preprint of a manuscript submitted to the XXVI Congresso Brasileiro de Automática (CBA 2026)

详情

AI中文摘要

动态环境中四旋翼飞行器的鲁棒集成规划与控制：基于带CBF惩罚的NMPC

Zeinab Shayan, Mohammadreza Izadi, Reza Faieghi

发表机构 * Autonomous Vehicles Laboratory, Department of Aerospace Engineering, Toronto Metropolitan University（自主车辆实验室，航空航天工程系，多伦多 Metropolitan 大学）

AI总结提出一种将控制障碍函数作为指数惩罚嵌入非线性模型预测控制的鲁棒集成规划与控制策略，通过高增益扰动观测器和卡尔曼滤波器增强系统鲁棒性，实现动态环境中的安全避障。

Comments Accepted to Conference on Robots and Vision (CRV 2026), Vancouver, Canada

详情

DOI: 10.21428/d82e957c.79116e3d

AI中文摘要

本文提出了一种新的多旋翼无人飞行器鲁棒集成规划与控制策略。我们提出了一种非线性模型预测控制公式，将控制障碍函数作为指数惩罚嵌入，在严格输入约束下提高可行性并确保平滑避障。惩罚权重提供了一个实用的调节旋钮，用于在跟踪精度和避障激进程度之间进行权衡。我们通过采用高增益扰动观测器来估计和补偿外部扰动，从而增强系统鲁棒性。我们还结合了卡尔曼滤波器，用于计算高效的实时障碍物运动预测，从而实现对移动障碍物的规避。与传统的NMPC以及带有硬CBF约束的NMPC的对比研究，在Gazebo和硬件实验中得到了验证，展示了优越的可行性、安全性和鲁棒性。据我们所知，这是首个经过硬件验证的NMPC-CBF IPC框架，为四旋翼飞行器在动态环境中的安全部署迈出了实际的一步。

英文摘要

This paper presents a new robust integrated planning and control (IPC) strategy for multirotor uncrewed aerial vehicles. We propose a nonlinear model predictive control (NMPC) formulation that embeds control barrier functions (CBFs) as exponential penalties, improving feasibility while ensuring smooth obstacle avoidance under tight input bounds. The penalty weights provide a practical tuning knob to trade off tracking accuracy against avoidance aggressiveness. We enhance the system robustness by employing a high-gain disturbance observer (HGDO) to estimate and compensate for external disturbances. We also incorporate a Kalman filter (KF) for computationally efficient, real-time prediction of obstacle motion, enabling avoidance of moving obstacles. Comparative studies against both conventional NMPC and NMPC with hard CBF constraints, validated in Gazebo and hardware experiments, demonstrate superior feasibility, safety, and robustness. To the best of our knowledge, this is the first hardware-validated NMPC-CBF IPC framework, offering a practical step toward safe quadrotor deployment in dynamic environments.

URL PDF HTML ☆

赞 0 踩 0

2606.01036 2026-06-02 cs.RO 版本更新

Position: Good Embodied Reward Models Need Bad Behavior Data

立场：好的具身奖励模型需要不良行为数据

Ran Tian, Yilin Wu, Andrea Bajcsy

发表机构 * Ran Tian, Yilin Wu, Andrea Bajcsy

AI总结本文主张为获得可靠的具身奖励模型，社区必须投资于“不良”机器人数据（失败、次优、易错甚至危险行为），并通过实验证明即使少量真实不良数据也能改善与人类偏好的一致性。

Comments This position paper has been accepted by the ICML 2026 position track as a spotlight paper

详情

AI中文摘要

这篇立场论文认为，为了获得可靠的具身奖励模型，社区必须投资于“不良”机器人数据：失败、次优、易错甚至危险的行为。虽然奖励模型是任何基础模型生命周期的核心，但今天的具身奖励模型主要基于成功行为进行训练。我们分析了三个最先进的具身奖励模型，发现它们系统性地过度奖励那些真实人类评估者会惩罚的行为，包括不安全交互、糟糕执行以及仅表面满足任务的捷径策略。我们将这些失败归因于一个关键的数据缺口：负面具身数据的稀缺性，这些数据收集成本高昂，并且在现有的机器人数据集中经常被过滤掉或保留。此外，我们表明，即使是少量真实不良行为数据也能改善与人类偏好的一致性，并减少代价高昂的误报。因此，我们呼吁具身AI社区整理并发布他们的不良机器人数据，构建合成不良数据生成引擎，开发更去中心化的物理评估系统，并设计用于细粒度具身奖励模型评估的基准。

英文摘要

This position paper argues that to obtain reliable embodied reward models, the community must invest in ``bad'' robot data: failed, suboptimal, error-prone, and even hazardous behaviors. While reward models are central to any foundation model's lifecycle, today's embodied reward models are trained primarily on successful behaviors. We analyze three state-of-the-art embodied reward models and find that they systematically over-reward behaviors that real human evaluators would penalize, including unsafe interactions, poor execution, and shortcut strategies that only superficially satisfy tasks. We attribute these failures to a key data gap: the scarcity of negative embodied data which is costly to collect and often filtered out or withheld in existing robotics datasets. Furthermore, we show that even modest exposure to real bad behavior data can improve alignment with human preferences and reduce costly false positives. We therefore call on the embodied AI community to curate and release their bad robot data, build synthetic bad data generation engines, develop more decentralized physical evaluation systems, and design benchmarks for fine-grained embodied reward model evaluations.

URL PDF HTML ☆

赞 0 踩 0

2606.01027 2026-06-02 cs.RO 版本更新

$τ_0$-WM: A Unified Video-Action World Model for Robotic Manipulation

$\tau_0$-WM：一种用于机器人操作的统一视频-动作世界模型

Pengfei Zhou, Shengcong Chen, Di Chen, Jiaxu Wang, Rongjun Jin, Bingwen Zhu, Yike Pan, Songen Gu, Kuanning Wang, Shufeng Nan, Xingyu Qiu, Chenhao Qiu, Pu Yang, Yunuo Cai, Jianxiong Gao, Yifan Li, Yanwei Fu, Xiangyu Yue, Zhi Chen, Jianlan Luo

发表机构 * Shanghai Innovation Institute（上海创新研究院）； AGIBOT Finch

AI总结提出$\tau_0$-WM，一个统一视频-动作世界模型，通过共享视频扩散骨干集成策略学习、视频预测和动作评估，在长时域和精细操作任务上优于基线。

Comments Our project homepge: https://finch.agibot.com/research/tau0-wm

详情

AI中文摘要

机器人操作需要能够生成可执行动作并在物理执行前预测和评估其未来后果的模型。我们提出$\tau_0$-世界模型（$\tau_0$-WM），一个统一的视频-动作世界模型，在单个未来预测框架内整合了策略学习、视频预测和动作评估。基于共享的视频扩散骨干，$\tau_0$-WM提供两个互补接口。首先，一个视频动作模型从多视角观察、语言指令和机器人状态中联合预测未来视觉潜变量和连续动作块。其次，一个动作条件视频模拟器将候选动作块展开为多视角未来并预测密集的任务进度分数。该模型在大约27,300小时的实机遥操作、UMI风格交互、自我中心人类视频以及使用模态特定监督掩码的展开或失败轨迹上进行训练。在推理时，$\tau_0$-WM利用测试时计算来采样动作候选，通过重新去噪一致性对其进行排序，并对低质量候选调用基于模拟器的修正。在具有挑战性的长时域和精细机器人操作任务上，$\tau_0$-WM表现出优于其他相关基线的性能。

英文摘要

Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present $τ_0$-World Model ($τ_0$-WM), a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, $τ_0$-WM provides two complementary interfaces. First, a video action model jointly predicts future visual latents and continuous action chunks from multi-view observations, language instructions, and robot state. Second, an action-conditioned video simulator rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. The model is trained on approximately $27{,}300$ hours of real-robot teleoperation, UMI-style interaction, egocentric human videos, and rollout or failure trajectories using modality-specific supervision masks. At inference time, $τ_0$-WM uses test-time computation to sample action candidates, rank them with re-denoising consistency, and invoke simulator-based rectification for low-quality candidates. On challenging long-horizon and fine-grained robotic manipulation tasks, $τ_0$-WM shows superior performance over other relevant baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01015 2026-06-02 cs.RO cs.AI cs.NI cs.SY eess.SY 版本更新

AI-IoT-Robotics Integration: Survey of Frameworks, Emerging Trends, and the Path Toward Connected Robotics

AI-IoT-机器人集成：框架、新兴趋势及迈向互联机器人的路径综述

Ranulfo Bezerra, Satoshi Tadokoro, Kazunori Ohno

发表机构 * Tohoku University（东大大学）

AI总结本文综述了人工智能、物联网和机器人三者融合的现状，提出了模块化系统架构，并强调了小语言模型（SLM）和大型语言模型（LLM）在分布式认知与自主决策中的作用，为下一代互联机器人和物理AI生态系统提供了概念和技术路线图。

Comments 15 pages, 3 figures, 3 tables. Published in IEEE Internet of Things Journal

详情

DOI: 10.1109/JIOT.2026.3670191
Journal ref: IEEE Internet of Things Journal, vol. 13, no. 10, pp. 20398-20412, 15 May15, 2026

AI中文摘要

人工智能、物联网和机器人的融合不再是未来的愿景；它正迅速成为实时、智能和上下文感知系统的基础。AI实现感知和推理，IoT提供可扩展的感知和通信，而机器人则提供具身驱动。尽管在AIoT和物联网机器人（IoRT）等两两组合方面取得了显著进展，但仍缺乏完全整合这三者的统一设计框架。本综述综合了这些领域的最新进展，强调了边缘端的小语言模型（SLM）和云端的大型语言模型（LLM）在分布式认知和自主决策中的新兴作用。我们提出了一个符合这些趋势的模块化系统架构，分析了互操作性和反馈控制中存在的持续差距，并根据集成深度对现有工作进行了分类。我们的综述强调了混合SLM-LLM系统与IoT基础设施和机器人代理相结合时，如何应对实时适应、可扩展性和可靠性方面的挑战。这项工作为设计模块化、可解释且能够在动态环境中学习的下一代AI-IoT-机器人生态系统提供了概念和技术路线图，为新兴的互联机器人和物理AI范式铺平了道路。

英文摘要

The convergence of Artificial Intelligence, the Internet of Things, and Robotics is no longer a futuristic vision; it is rapidly becoming the foundation of real-time, intelligent, and context-aware systems. AI enables perception and reasoning, IoT provides scalable sensing and communication, and robotics delivers embodied actuation. Despite significant progress in pairwise combinations such as AIoT and the Internet of Robotic Things (IoRT), there remains a lack of unified design frameworks that fully integrate all three. This survey synthesizes the state-of-the-art across these domains, emphasizing the emerging role of Small Language Models (SLMs) at the edge and Large Language Models (LLMs) in the cloud for distributed cognition and autonomous decision-making. We propose a modular system architecture that aligns with these trends, analyze persistent gaps in interoperability and feedback control, and classify existing work by integration depth. Our review highlights how hybrid SLM-LLM systems, when coupled with IoT infrastructure and robotic agents, can address challenges in real-time adaptation, scalability, and reliability. This work offers a conceptual and technical roadmap for designing next-generation AI-IoT-Robotic ecosystems that are modular, interpretable, and capable of learning within dynamic environments, paving the way for the emerging paradigm of Connected Robotics and Physical AI.

URL PDF HTML ☆

赞 0 踩 0

2606.00998 2026-06-02 cs.RO 版本更新

GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping

GraspGen-X: 跨形态6自由度扩散抓取

Beining Han, Yu-Wei Chao, Erwin Coumans, Clemens Eppner, Balakumar Sundaralingam, Jia Deng, Stan Birchfield, Adithyavairavan Murali

发表机构 * NVIDIA ； Princeton University（普林斯顿大学）

AI总结提出一种基于扩散模型的跨形态6自由度抓取方法，通过扫描体积启发式编码夹爪表示，在20亿抓取数据上训练，实现对新物体、场景和夹爪形态的零样本泛化。

详情

AI中文摘要

我们研究跨形态6自由度机器人抓取。与先前工作不同，我们要求模型不仅泛化到新物体/场景，还要泛化到新夹爪形态和物理抓取过程。我们的方法将基于扩散模型的生成式6自由度抓取模型扩展到对额外夹爪表示的条件化。我们提出一种用于编码夹爪的扫描体积启发式方法。我们使用程序化生成的夹爪和一个包含20亿抓取的大规模数据集训练跨形态模型。在仿真实验中，我们的模型在零样本泛化到新型真实世界夹爪和物体方面优于基线方法。我们的模型也可作为微调以适应新夹爪的良好初始化。在消融实验中，我们展示了扫描体积夹爪表示和程序化夹爪训练数据集的效率。最后，我们展示了在6自由度抓取中对真实世界新型夹爪的零样本泛化，在跨形态泛化方面超越了基线。

英文摘要

We study cross-embodiment 6-DOF robot grasping. Unlike prior works, we require the model not only to generalize to novel objects / scenes but also to novel gripper morphologies and physical grasping processes. Our method extends diffusion model based generative 6-DOF grasping models to condition on the additional gripper's representation. We propose a swept-volume heuristic for encoding the gripper. We train our cross-embodiment model with procedural grippers and a large-scale dataset of 2 Billion grasps. In simulation experiments, our model has the best zero-shot generalization to novel real-world grippers and objects over baseline methods. Our model also serves as a good initialization for fine-tuning to adapt to novel grippers. In ablations, we demonstrate the efficiency of our sweep-volume gripper representation and our procedural gripper training dataset. Last, we show zero-shot generalization to real-world novel grippers for 6-DOF grasping, surpassing baselines in cross-embodiment generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.00990 2026-06-02 cs.RO 版本更新

OSCAR: Obstacle Survival Curves for Adaptive Robot Navigation

OSCAR: 用于自适应机器人导航的障碍物生存曲线

Hshmat Sahak, Aoran Jiao, Nicholas Rhinehart, Tim Barfoot

发表机构 * University of Toronto（多伦多大学）

AI总结提出OSCAR框架，利用生存模型学习障碍物清除时间分布，并通过图规划器动态调整等待与重路由的阈值，以减少导航时间。

Comments 8 pages main text, appendices included

详情

AI中文摘要

一个沿已知路线图行驶的移动机器人在临时障碍物阻塞关键边时可能会犯代价高昂的导航错误：在停放的推车后面等待太久浪费时间，但立即绕过一个几秒钟后会移动的人也是低效的。标准的反应式避障处理障碍物周围的局部运动，而固定的等待或重路由规则忽略了不同障碍物类型通常持续的时间。我们提出了OSCAR：一种用于具有临时阻塞的基于图的导航的自适应生存建模框架。假设在遇到障碍物时可以获得障碍物类别标签，机器人从在线经验中学习类别条件的残余清除时间分布，包括在重路由之前未观察到清除时的右删失观测。这些生存模型被集成到一个时间相关的图规划器中，该规划器维护障碍物记忆并计算每个阻塞边的耐心阈值：在采取替代路线之前等待多长时间。该方法在多个回合中持续更新其清除估计，并使用它们来平衡等待与重路由。我们在仿真中和真实移动机器人上（在大学中庭，障碍物包括人、椅子、垃圾桶和管道）评估了该方法。在仿真中，学习策略的目标时间在每类障碍物少于20次观测后收敛到具有真实清除分布的神谕的1%以内，优于所有启发式基线。实际部署证实该策略在线改进，从50个导航回合的经验中调整其耐心阈值。

英文摘要

A mobile robot following a graph of known routes can make costly navigation errors when a temporary obstacle blocks a critical edge: waiting too long behind a parked cart wastes time, but immediately rerouting around a person who would move in a few seconds is also inefficient. Standard reactive obstacle avoidance addresses local motion around obstacles, while fixed wait-or-reroute rules ignore how long different obstacle types tend to persist. We propose OSCAR: an adaptive survival-modeling framework for graph-based navigation with temporary blockages. Assuming obstacle class labels are available at encounter time, the robot learns class-conditioned residual clearance-time distributions from online experience, including right-censored observations when it reroutes before observing clearance. These survival models are integrated into a time-dependent graph planner that maintains obstacle memory and computes a patience threshold at each blocked edge: how long to wait before taking an alternate route. The method continuously updates its clearance estimates across episodes and uses them to balance waiting against rerouting. We evaluate the approach in simulation and on a real mobile robot in a university atrium with obstacles including people, chairs, bins, and tubes. In simulation, the learned policy's time-to-goal converges to within 1% of an oracle with access to ground-truth clearance distributions after fewer than 20 observations per obstacle class, outperforming all heuristic baselines. Real-world deployment confirms that the policy improves online, adapting its patience thresholds from experience across 50 navigation episodes.

URL PDF HTML ☆

赞 0 踩 0

2606.00985 2026-06-02 cs.RO 版本更新

Make Your VLA More Robust Without More Data By Interleaving Motion Planning

通过交错运动规划使您的VLA更鲁棒而无需更多数据

Dan BW Choe, Sundhar Vinodh Sangeetha, Samuel Coogan, Shreyas Kousik

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出MPVI框架，将基于模型的运动规划与视觉-语言-动作模型交错结合，通过VLM完成检查和本体感受触发实现可靠切换，无需额外训练即可提升长时域移动操作任务的鲁棒性，在BEHAVIOR-1K基准上任务进度提升113%。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在移动操作方面取得了显著进展，但在长时域任务上的表现仍然较差。这些任务尤其具有挑战性，因为（1）必须在空间分布的子任务的长序列中保持对高层目标的进展，并且（2）早期执行错误会在任务时域内迅速累积。尽管在大规模人类遥操作移动操作数据上进行了微调，这些挑战仍然存在，表明仅靠更多数据可能无法解决问题。为了应对这些挑战，我们提出了MPVI：运动规划器/VLA交错框架，该框架将基于模型的运动规划与VLA集成，无需进一步训练即可提高鲁棒性。所提出的集成通过开放词汇目标检测、前沿探索和运动规划，实现了在杂乱场景中对远处或遮挡目标物体的定位和导航。然而，这种集成并非易事，需要模块之间的可靠切换；我们通过基于VLM的完成检查与本体感受触发器展示了一种可行的方法。我们在BEHAVIOR-1K基准上评估了我们的方法，并展示了在任务进度上比顶级端到端VLA基线提升113%。更多详情请访问项目页面：https://mpvi.netlify.app/。

英文摘要

Vision-Language-Action (VLA) models have shown remarkable progress for mobile manipulation, but their performance on long-horizon tasks remains poor. These tasks are especially challenging because (1) progress toward high-level goals must be maintained across extended sequences of spatially distributed subtasks, and (2) early execution errors compound rapidly over the task horizon. These challenges persist despite finetuning on large human teleoperated mobile manipulation data, indicating that more data alone may not resolve the problem. To address these challenges, we propose MPVI: Motion Planner / VLA Interleaving, a framework that integrates model-based motion planning with VLAs to improve robustness without further training. The proposed integration enables localization and navigation to distant or occluded target objects through cluttered scenes using open-vocabulary object detection, frontier exploration and motion planning. However, such integration is non-trivial, requiring reliable switching between modules; we show one way forward via VLM-based completion checking with proprioceptive triggers. We evaluate our approach on the BEHAVIOR-1K benchmark and demonstrate 113% improvement in task progress over a top end-to-end VLA baseline. Additional details are available at the project page: https://mpvi.netlify.app/.

URL PDF HTML ☆

赞 0 踩 0

2606.00966 2026-06-02 cs.RO 版本更新

模仿学习中的不可行优化问题与分层增广拉格朗日方法

Roland Andrews, Justin Carpentier, Ajay Sathya

发表机构 * University of Cambridge（剑桥大学）

AI总结针对模仿学习中约束不可行导致训练不稳定的问题，提出基于增广拉格朗日方法的解决方案，将策略引导至最近可行约束问题的解，并在驾驶示例中验证其有效性。

2606.00709 2026-06-02 cs.RO 版本更新

BEVIO: Efficient Bird's-Eye-View based Sparse-Update Visual-Inertial Odometry for Lunar Day-Night Navigation

BEVIO: 基于鸟瞰图的稀疏更新视觉-惯性里程计用于月球昼夜导航

Mohit Singh, Shehryar Khattak, Ashish Goel, Michael Paton, Kostas Alexis, Issa A. Nesnas

发表机构 * Jet Propulsion Laboratory, California Institute of Technology（喷气推进实验室，加州理工学院）； Autonomous Robots Lab at the Norwegian University of Science and Technology（挪威科学技术大学自主机器人实验室）

AI总结提出一种基于鸟瞰图的图像匹配方案，在极低视觉更新率下实现可靠的视觉-惯性里程计，适用于资源受限的月球车昼夜导航。

Comments Accepted at the 2026 IEEE International Conference on Robotics and Automation, Vienna

详情

AI中文摘要

视觉-惯性里程计（VIO）提供平滑、高频率的状态估计，已广泛应用于地面和行星应用的机器人导航。然而，其性能通常依赖于视觉更新的频率，这对于在极端资源约束和低帧率下运行的行星车来说是一个挑战。本文研究如何为月球车应用实现具有极稀疏视觉更新的可靠VIO，解决昼夜操作中自照明条件下特征关联特别困难的问题。我们提出了一种基于鸟瞰图（BEV）的图像匹配方案，该方案在较大的帧间运动和显著的视觉外观变化下仍能保持鲁棒性，实现更可靠的特征匹配。我们通过高保真照片级月球仿真和半比例月球车在加利福尼亚州普拉斯特城进行的长期昼夜部署实时机器人实验，广泛评估了我们提出的BEVIO方法。结果表明，我们的方法能够在低至0.25 Hz的视觉更新率下实现可靠的昼夜自照明穿越，突显了其在功耗和计算受限的月球车导航中的适用性。

英文摘要

Visual-Inertial Odometry (VIO) provides smooth, high-rate state estimates and has been widely used for robotic navigation in both terrestrial and planetary applications. However, its performance is typically dependent on the frequency of visual updates, which is a challenge for planetary rovers operating under extreme resource constraints and low frame rates. This work investigates enabling reliable VIO with very sparse visual updates for lunar rover applications, addressing both day and night-time operations where feature associations become especially difficult under self-illumination conditions. We propose a Bird's Eye View (BEV)-based image matching scheme that remains robust to larger inter-frame motions and more reliable feature matching despite significant visual appearance changes. We extensively evaluate our proposed approach, BEVIO, through high-fidelity photorealistic lunar and real-time robotic experiments conducted using a half-scale lunar rover, in a long-term day-night deployment at Plaster City, CA, USA. The results demonstrate that our method enables reliable day and nighttime self-illuminated traverses at visual update rates as low as 0.25 Hz, underscoring its suitability for navigation on power- and compute-limited lunar rovers.

URL PDF HTML ☆

赞 0 踩 0

2606.00702 2026-06-02 cs.RO cs.AI 版本更新

Shape Your Body: Value Gradients for Multi-Embodiment Robot Design

塑造你的身体：用于多形态机器人设计的价值梯度

Nico Bohlinger, Jan Peters

发表机构 * Technical University of Darmstadt（德累斯顿技术大学）； Robotics Institute Germany (RIG)（德国机器人研究所）； German Research Center for AI (DFKI)（德国人工智能研究中心）； hessian.AI（黑森AI）

AI总结提出将通用多形态价值函数转化为可复用模型，通过价值梯度优化机器人设计，无需为每个机器人重新进行强化学习协同设计。

2606.00664 2026-06-02 cs.RO cs.CV 版本更新

SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models

SKIP: 用于高效具身世界模型的稀疏关键帧插值范式

Ziheng He, Yixiang Chen, Ning Yang, Zhanqian Wu, Qisen Ma, Yuan Xu, Jiabing Yang, Peiyan Li, Xiangnan Wu, Xiaofeng Wang, Zheng Zhu, Jing Liu, Nianfeng Liu, Yan Huang

发表机构 * UCAS（中国科学院自动化研究所）； CASIA（中国科学院自动化研究所）； NJU（南京大学）； GigaAI ； THU（清华大学）； FiveAges

AI总结提出稀疏关键帧插值范式(SKIP)，通过识别任务相关关键帧并仅生成这些帧，再基于机器人动作插值缺失帧，实现高效视频生成，在LIBERO上速度提升4.16倍，FVD降低89%，且生成视频作为训练数据时策略性能下降极小。

Comments 25 pages, 10 figures

详情

AI中文摘要

具身世界模型通过预测机器人动作如何影响周围场景，已成为机器人学中一种有前景的范式。然而，在像素空间中进行 rollout 推理在计算上仍然昂贵，因为长时程操作视频通常必须逐帧生成。这种成本不能通过不加区分地丢弃帧来轻易降低，因为下游策略依赖于对稀疏任务相关事件（如接近、接触、抓取和释放）的完整保留。为了解决这一挑战，我们提出了稀疏关键帧插值范式（SKIP），这是一种事件保留的稀疏到密集框架，避免了密集的逐帧生成。SKIP 首先通过利用机器人感知的多模态特征来识别任务相关的关键帧。然后，它仅用稀疏视频扩散模型合成这些关键帧。一个学习到的间隙预测器和一个动作条件插值器随后根据机器人动作重建缺失的间隔。在 LIBERO 上，SKIP 生成密集 rollouts 的速度比密集基线快 4.16 倍，同时提高了视觉保真度并将聚合 FVD 降低了 89.0%。重要的是，SKIP 生成的视频是有效的策略训练数据。即使它们完全替代真实演示，π_{0.5} 的成功率在 LIBERO 模拟中仅下降 1.3 个百分点，在真实机器人上下降 6.7 个百分点，而完全密集的逐帧生成则下降 48 到 58 个百分点。

英文摘要

Embodied world models have emerged as a promising paradigm in robotics by predicting how robot actions affect the surrounding scene. However, the rollout inference remains computationally expensive in pixel space, as long-horizon manipulation videos typically have to be generated frame by frame. This cost cannot be easily reduced by indiscriminately dropping frames, since downstream policies rely on complete preservation of sparse task-relevant events such as approach, contact, grasp, and release. To address this challenge, we propose Sparse Keyframe Interpolation Paradigm (SKIP), an event-preserving sparse-to-dense framework that avoids dense frame-by-frame generation. SKIP first identifies task-relevant keyframes by leveraging robot-aware multimodal features. It then synthesizes only these keyframes with a sparse video diffusion model. A learned gap predictor and an action-conditioned interpolator subsequently reconstruct the missing intervals according to the robot actions. On LIBERO, SKIP generates dense rollouts $4.16\times$ faster than a dense baseline while improving visual fidelity and reducing aggregate FVD by $89.0\%$. Importantly, SKIP-generated videos are effective policy-training data. Even when they fully replace real demonstrations, $π_{0.5}$ success drops only $1.3$ pp in LIBERO simulation and $6.7$ pp on the real robot, whereas fully dense frame-by-frame generation collapses by $48$ to $58$ pp.

URL PDF HTML ☆

赞 0 踩 0

2606.00637 2026-06-02 cs.RO 版本更新

DriveAnchor: 用于自动驾驶规划的渐进式基于锚点的流学习

Limin Yan, Haoyun Tang, Yutao Qiu, Hongqing Liu, Haoyu Xu

发表机构 * Meituan Autonomous Driving（美团自动驾驶）； Xi’an Jiaotong University（西安交通大学）； Beijing Institute of Technology（北京理工大学）

AI总结提出三阶段框架DriveAnchor，通过示范流预训练、引导流后训练和奖励精炼流微调，实现行为多样性、可控性和安全性，在200万场景中近距碰撞率降低89%，平均奖励提升32%。

详情

AI中文摘要

我们提出DriveAnchor，一个用于自动驾驶规划的三阶段框架，在可组合流水线中实现行为多样性、可控性和安全性。示范流预训练通过最远点采样构建的2398个轨迹形状词汇表替代无结构高斯先验，在词汇覆盖中结构化地奠定行为多样性基础。引导流后训练联合后训练一个能量场模块与流匹配（FM），仅以静态道路几何为条件，在流生成前将锚点重新定位到用户指定的走廊多边形，无需可微引导即可增加可控性；在第二阶段后，新的走廊预设只需更新能量场，无需重新训练FM。奖励精炼流微调应用零阶强化学习，使每个锚点的输出与避碰目标对齐：由于流匹配模型在单步模式下是确定性前馈网络，每个锚点唯一确定输出轨迹，将奖励优化简化为锚点空间中的方向搜索，无需对数似然计算或ODE到SDE转换。在约200万个保留驾驶场景上的评估表明，DriveAnchor将近距碰撞率降低89%，平均奖励提升32%，且模仿精度不下降，在NVIDIA Drive Orin上推理时间为2.06毫秒。DriveAnchor已通过真实车辆测试验证，确认其适用于生产部署。

英文摘要

We present DriveAnchor, a three-stage framework for autonomous driving planning that achieves behavioral diversity, controllability, and safety in a composable pipeline. Demonstration Flow Pretraining replaces the unstructured Gaussian prior with a vocabulary of 2,398 trajectory shapes constructed by farthest-point sampling, structurally grounding behavioral diversity in vocabulary coverage. Guided Flow Post-training jointly post-trains an Energy Field module with flow matching (FM), conditioning the Energy Field on static road geometry alone, to relocate anchors toward user-specified corridor polygons before flow generation, adding controllability without differentiable guidance; after Stage 2, new corridor presets require only Energy Field updates, not FM retraining. Reward-Refined Flow Fine-tuning applies zeroth-order reinforcement learning to align each anchor's output with collision-avoidance objectives: because the flow-matching model is a deterministic feedforward network in single-step mode, each anchor uniquely determines the output trajectory, reducing reward optimization to a direction search in anchor space without log-likelihood computation or ODE-to-SDE conversion. Evaluated on approximately 2 million held-out driving scenarios, DriveAnchor reduces near-range collision rates by 89% and improves mean reward by 32% without degradation in imitation accuracy, with 2.06 ms inference on NVIDIA Drive Orin. DriveAnchor has been validated through real-world vehicle testing, confirming its practicality for production deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.00515 2026-06-02 cs.RO cs.AI cs.SY eess.SY 版本更新

PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation

PaCo-VLA: 用于富接触视觉-语言-动作操控的被动屏蔽柔顺先验

Haofan Cao, Zhaoyang Li, Zhichao You, Liang Guo, Tianrui Li

发表机构 * Southwest Jiaotong University（西南交通大学）； University of Leeds（莱斯特大学）

AI总结提出PaCo-VLA框架，通过被动屏蔽将VLA模型输出转化为任务级柔顺建议，并利用能量罐和边界检查防止无效预测绕过底层接触物理，实现安全精确的富接触操控。

Comments Under review, code will be available soon

详情

AI中文摘要

富接触操控既需要高层语义推理，也需要对高频接触动态的安全调节。虽然视觉-语言-动作（VLA）模型提供了前所未有的语义泛化能力，但其低速率输出缺乏在力敏感任务中直接控制执行器所需的可靠性。为弥合这一语义到控制的鸿沟，我们引入PaCo-VLA，一种被动屏蔽的柔顺先验，重新定义了VLA接口。PaCo-VLA不将直接电机指令托付给VLA，而是将网络输出视为任务级柔顺建议：语义绑定、任务阶段和导纳调度。一个高频、建议无关的被动屏蔽通过能量罐核算和边界检查来管理这些建议，防止无效、过时或未经验证的模型预测绕过底层接触物理。这种解耦架构还支持因果评估，将语义贡献与几何捷径分离。大量仿真和真实世界的连接器插入实验表明，PaCo-VLA在无屏蔽VLA基线上实现了卓越的精度，即使在对抗性柔顺偏移下也能保持零被动违规。该框架在导纳端口建立了一个可证明的采样被动运行时契约，并为在富接触领域部署基础模型提供了运行时接口。

英文摘要

Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Vision-Language-Action (VLA) models provide unprecedented semantic generalization, their low-rate outputs lack the reliability required for direct plant authority in force-sensitive tasks. To bridge this semantic-to-control gap, we introduce PaCo-VLA, a passivity-shielded compliance prior that recasts the VLA interface. Rather than trusting VLAs with direct motor commands, PaCo-VLA treats network outputs as task-level compliance proposals: semantic bindings, task stages, and admittance schedules. A high-frequency, proposal-independent passivity shield governs these proposals through energy-tank accounting and boundary checks, preventing invalid, stale, or unverified model predictions from bypassing low-level contact physics. This decoupled architecture also enables causal evaluation, isolating semantic contributions from geometric shortcuts. Extensive simulated and real-world connector-insertion experiments demonstrate that PaCo-VLA achieves superior precision over unshielded VLA baselines, sustaining zero passivity violations even under adversarial compliance shifts. This framework establishes a provably sampled-passive runtime contract at the admittance port and provides a runtime interface for deploying foundation models in contact-rich domains.

URL PDF HTML ☆

赞 0 踩 0

2606.00470 2026-06-02 cs.RO cond-mat.soft 版本更新

A passive universal grasping mechanism based on an everting shell

基于外翻壳体的被动通用抓取机构

Mythra V. S. Balakuntala, Safvan Palathingal, G. K. Ananthasuresh

发表机构 * Indian Institute of Science（印度科学研究院）

AI总结提出一种基于弹性可变形双稳态壳体外翻的被动单片柔性抓取机构，通过梁段构成的抓取臂与外翻壳体协同工作，实现对任意形状刚性物体的包络抓取。

详情

DOI: 10.1007/978-981-15-4477-4_43

AI中文摘要

概念化了一种基于弹性可变形双稳态壳体外翻的被动单片柔性抓取机构。它由梁段构成的抓取臂与外翻壳体协同工作。该抓取器能够抓取任意形状的刚性物体，最大尺寸和重量受限于机构设计。双稳态壳体在接触物体时外翻，使抓取臂包裹物体形成封闭空间。机构保持该构型直到再次被驱动，使壳体恢复原始构型，从而打开封闭空间释放物体。臂的刚度决定机构的有效载荷，臂的尺寸决定可抓取的最大物体。臂具有分布式柔性，可适应物体形状而不施加过大压力。

英文摘要

A passive monolithic compliant grasping mechanism that works based on the eversion of an elastically deformable bistable shell is conceptualized. It comprises grasping arms made of beam segments that work in conjunction with the everting shell. The grasper is capable of picking up a stiff object of any shape up to a maximum size and weight. The bistable shell everts upon contact with the object to enable the grasping arms envelop the object forming an enclosure. The mechanism then stays in that configuration until it is actuated again to turn the shell back to its original configuration and thereby opening the enclosure to release the object. The stiffness of the arms decides the payload of the mechanism. The size of the arms decides the largest object that can be grasped and held. The arms have distributed compliance so that they can conform to the shape of the object without applying undue force on it.

URL PDF HTML ☆

赞 0 踩 0

2606.00459 2026-06-02 cs.RO cs.SY eess.SY 版本更新

Adaptive PD Gains for Energy-Conscious Control in Physical Human-Robot Interaction

物理人机交互中节能控制的自适应PD增益

Danyal Saqib, Francisco Andrade Chavez, Marie Charbonneau

发表机构 * University of Calgary（卡尔加里大学）； University of Waterloo RoboHub（多伦多大学罗布hub）

AI总结提出一种自适应PD控制器，通过限制机器人动能和势能实现安全物理人机交互，并给出稳定性证明与实验验证。

详情

DOI: 10.21428/d82e957c.37d70c9b
Journal ref: Proceedings of the 23rd Conference on Robots and Vision, 2026

AI中文摘要

柔顺力或力矩控制是常被研究以实现安全物理人机交互（pHRI）的方法。然而，这些方法存在局限性。力控制要求机器人配备外部力传感器以跟踪施加力的幅度和方向。力矩控制需要在每个关节进行力矩感知或估计。由于并非所有机器人都具备这些条件，基于能量的方法提供了一种有前景的替代方案。此类方法旨在通过限制机器人的机械能来实现安全的pHRI。当前利用基于能量方法的方案往往实现复杂，且部分可能需要进一步稳定性验证。因此，我们提出一种自适应比例-微分（PD）控制器，能够在任意给定限制下限制机器人的能量，以实现安全的pHRI。所提出的控制器可以同时限制机器人的动能和势能，并且控制器增益的行为可通过多种参数进行塑造，精确界定截止限制和锐度。我们为控制器构建了稳定性证明，并定义了确保控制器稳定性的条件。所提出控制器的行为和柔顺性在PAL Robotics的TALOS机器人上进行了仿真和硬件测试，验证了控制器预期的柔顺和能量限制行为。

英文摘要

Compliant force or torque control are approaches often investigated to achieve safe physical human-robot interaction (pHRI). However, these approaches have limitations. Force control requires a robot to be equipped with external force sensors to track the amplitude and direction of applied forces. Torque control requires torque sensing or estimation in each joint. As this is not available on every robot, energy-based approaches offer a promising alternative. Such approaches aim to achieve safe pHRI by limiting the mechanical energy of the robot. Current schemes leveraging an energy-based approach tend to have a complex implementation, and some may require further stability verification. We hence propose an adaptive proportional-derivative (PD) controller that can limit a robot's energy under any given limit to achieve safe pHRI. The proposed controller can limit both the kinetic and potential energy of a robot, and the behaviour of the controller gains can be shaped using various parameters, defining precisely the cutoff limit and sharpness. We construct a stability proof for the controller and define a condition to ensure the controller's stability. The proposed controller's behaviour and compliance are tested on the TALOS robot from PAL Robotics both in simulation and on hardware, verifying the expected compliant and energy-limiting behaviour of the controller.

URL PDF HTML ☆

赞 0 踩 0

2606.00449 2026-06-02 cs.RO 版本更新

ROG-Grasp: Root-Oriented Geometry for Robotic Grasping and Placement

ROG-Grasp：面向根部的几何方法用于机器人抓取与放置

Zijian An, Augustus Sroka, Ran Yang, Bill Cai, Satoru Eto, Brian Poon, Kelvin Cai, Shijie Geng, Feng Liu, Yiming Feng, Lifeng Zhou

发表机构 * Department of Electrical and Computer Engineering, Drexel University（德雷塞尔大学电气与计算机工程系）； Virginia Seafood Agricultural Research and Extension Center, and Department of Biological Systems Engineering, Virginia Tech（弗吉尼亚理工学院生物系统工程系和弗吉尼亚海鲜农业研究与推广中心）； Amazon Store Foundation AI (SFAI)（亚马逊商店基金会人工智能（SFAI））

AI总结提出基于根部表面几何的ROG-Grasp框架，通过RGB-D感知估计农产品朝向，结合YOLO检测器和点云平面拟合生成稳定抓取姿态，在番茄和洋葱实验中实现高成功率与快速执行。

Comments Comments: 7 pages, 6 figures. Video: https://youtu.be/Ir2UtGODdMo

详情

AI中文摘要

朝向感知操作在采后农业加工中至关重要，其中农产品必须以一致的配置被抓取和放置。本文提出ROG-Grasp，一种基于几何的机器人抓取和放置框架，通过RGB-D感知从根部表面几何估计农产品朝向。使用基于YOLO的根部检测器和点云平面拟合来推断根部法线，从而生成稳定的抓取姿态和朝向约束的笛卡尔运动规划。在番茄和洋葱上的实验表明，在孤立和杂乱场景中均具有高成功率和稳定的执行时间。与视觉-语言-动作（VLA）策略相比，所提出的方法实现了更可靠、更准确的抓取完成，且执行速度更快。这些结果突显了几何驱动感知对于实际朝向控制操作任务的有效性。我们的论文视频可在网上获取：https://youtu.be/Ir2UtGODdMo。

英文摘要

Orientation-aware manipulation is essential in post-harvest agricultural processing, where produce must be grasped and placed in consistent configurations. This paper presents ROG-Grasp, a geometry-based robotic grasping and placement framework that estimates the produce orientation from root surface geometry using RGB-D perception. A YOLO-based root detector and point cloud plane fitting are used to infer the root normal, enabling stable grasp pose generation and orientation-constrained Cartesian motion planning. Experiments on tomatoes and onions demonstrate high success rates and stable execution time in both isolated and cluttered scenarios. Compared with vision-language-action (VLA) policies, the proposed method achieves more reliable and accurate grasp completion with faster execution. These results highlight the effectiveness of geometry-driven perception for practical orientation-controlled manipulation tasks. A video of our paper is available online https://youtu.be/Ir2UtGODdMo.

URL PDF HTML ☆

赞 0 踩 0

2606.00418 2026-06-02 cs.RO cs.HC 版本更新

Literary Emotions in Motion: A Soft Robotics Installation for Tactile Storytelling

文学情感在运动中：用于触觉叙事的软体机器人装置

Carolina Silva-Plata, Abraham Villavicencio-Carmona, Miguel Silva Plata, Stefan Escaida, Ruben Fernandez

发表机构 * Department of Mechanical Engineering, University of Chile（智利大学机械工程系）； Independent Researcher（独立研究员）； Bolivian Catholic University（玻利维亚天主大学）； Institute of Engineering Sciences, University of O’Higgins（奥希金斯大学工程科学研究所）

AI总结提出一种将叙事文本语义情感分析映射到软体气动模块可变刚度的交互装置，通过用户研究评估刚度与LED强度多感官耦合对情感感知的影响。

Comments 8 pages, 8 figures

详情

DOI: 10.1109/MRA.2026.3693101
Journal ref: IEEE Robotics and Automation Magazine, 2026

AI中文摘要

软体机器人越来越多地在艺术语境中被探索，其中触觉交互为观众提供了超越视觉或听觉信号的具身参与。本作品展示了一个交互装置，将叙事文本的语义情感分析映射到软体气动模块的可变刚度。一个自然语言模型从预定义的六种情感中识别出两种主导情感，驱动七个六边形排列的软体执行器充气。中心执行器代表主要情感，而周围的执行器表达次要情感。我们开发并机械表征了称为软模块的硅胶执行器，其具有薄膜层，展示了这种形态控制如何扩展可实现的刚度范围，同时保持简单性和低成本制造。一项包含十名参与者的用户研究进一步评估了刚度和LED强度的多感官耦合如何影响情感感知。结果表明，伴随颜色变化的刚度调制可以支持软体机器人装置中具有情感意义和吸引力的触觉交互。

英文摘要

Soft robotics is increasingly explored in artistic contexts, where tactile interaction provides audiences with embodied engagement beyond visual or auditory signals. This work presents an interactive installation that maps semantic emotion analysis of narrative text into variable stiffness of soft pneumatic modules. A natural language model identifies two dominant emotions from a predefined set of six, driving the inflation of seven hexagonally arranged soft actuators. The central actuator represents the primary emotion, while the surrounding ones express the secondary. We develop and mechanically characterize silicone actuators, called soft modules, featuring a thin membrane layer, demonstrating how this morphological control expands the achievable stiffness range while preserving simplicity and low-cost fabrication. A user study with ten participants further evaluates how multisensory coupling of stiffness and LEDs intensity influences emotional perception. The results suggest that stiffness modulation accompanied by color change can support emotionally meaningful and engaging tactile interaction in soft robotic installations.

URL PDF HTML ☆

赞 0 踩 0

2606.00397 2026-06-02 cs.RO cs.SY eess.SY 版本更新

SoFiE: Soft Finger Exoskeleton for Intelligent Grasping

SoFiE: 用于智能抓取的软手指外骨骼

Magnus Malthe Sigsgaard Nielsen, Nicklas Nikolaj Grønvall, Xiaofeng Xiong, Saravana Prashanth Murali Babu

发表机构 * SDU Soft Robotics, SDU Biorobotics, The Maersk Mc-Kinney Moller Institute, University of Southern Denmark (SDU)（SDU柔性机器人实验室、SDU生物机器人实验室、马士基麦金尼莫勒研究所、南丹麦大学）

AI总结本文提出一种模块化软手指外骨骼SoFiE，采用3D打印柔性材料、肌腱驱动和集成触觉传感，实现轻量化、低轮廓的抓取辅助与智能感知。

详情

AI中文摘要

软体可穿戴机器人系统已成为辅助手部功能减退个体的有前景解决方案。本文提出SoFiE，一种模块化软手指外骨骼，旨在辅助抓取任务中的食指屈曲。该系统主要采用3D打印柔性材料制造，实现了轻量、低轮廓和模块化设计。驱动通过紧凑型直流电机驱动的肌腱机构实现，而被动伸展由柔性导电弹簧提供。该元件称为StretchSense，通过变形下的电阻变化也作为本体感受传感器。此外，引入了一种新颖的触觉传感方法MagSense，使用嵌入软指尖结构中的磁铁和磁力计对来估计接触力和物体柔顺性。该系统完全无线，并由嵌入式微控制器控制。此外，通过电机编码器反馈的驱动器级传感能够估计系统状态，为安全和自适应控制策略提供基础。实验验证表明，该系统能够提供可靠的姿态估计，区分不同刚度的材料，并在不同抓取任务中生成独特的传感器特征。本文详细介绍了所提出的外骨骼的设计、制造和传感概念，作为模块化、软体和辅助可穿戴机器人的概念验证。

英文摘要

Soft wearable robotic systems have emerged as a promising solution for assisting individuals with reduced hand function. This paper presents SoFiE, a modular soft finger exoskeleton designed to assist index-finger flexion during grasping tasks. The proposed system is primarily fabricated using 3D-printed flexible materials, enabling a lightweight, low-profile, and modular design. Actuation is achieved through a tendon-driven mechanism powered by a compact DC motor, while passive extension is provided by a compliant conductive spring. This element, termed StretchSense, also functions as a proprioceptive sensor by exhibiting resistance changes under deformation. Furthermore, a novel tactile sensing approach, MagSense, is introduced, using a magnet and magnetometer pair embedded in a soft fingertip structure to estimate contact force and object compliance. The system is fully untethered and controlled by an embedded microcontroller. In addition, actuator-level sensing through motor encoder feedback enables estimation of the system state, providing a foundation for safe and adaptive control strategies. Experimental validation demonstrates the capability of the system to provide reliable pose estimation, distinguish between materials with different stiffness, and generate distinct sensor signatures across different grasping tasks. This paper details the design, fabrication, and sensing concepts of the proposed exoskeleton as a proof of concept toward modular, soft, and assistive wearable robotics.

URL PDF HTML ☆

赞 0 踩 0

2606.00383 2026-06-02 cs.RO cs.LG cs.SY eess.SY 版本更新

Behavior Cloning of MPC for 3-DOF Robotic Manipulators

三自由度机械臂MPC的行为克隆

Theo Guegan, Dexter Wen Jie Teo

发表机构 * University of Waterloo（多伦多大学）； Universite de Technologie de Compiègne（技术与科学大学）； Nanyang Technological University（南洋理工大学）； Polytechnique Montréal（蒙特利尔理工学院）

AI总结针对MPC实时计算负担重的问题，采用行为克隆方法近似MPC策略，通过多种神经网络架构实现三自由度机械臂的实时控制，在宽松容差下推理延迟降低3倍，成功率84.98%。

Comments Accepted at the IEEE ICRA 2026 Workshop on Reinforcement Learning in the Era of Imitation Learning (RL4IL), 6 pages excluding references

详情

AI中文摘要

虽然模型预测控制（MPC）提供了强大的稳定性和鲁棒性，但它给实时系统带来了显著的计算负担。本文研究了行为克隆在近似MPC策略以实时控制三自由度机械臂中的应用。我们提出了一个结合逆运动学与MPC的基线控制器，并评估了从经典回归算法到深度学习模型（包括深度MLP和RNN）的神经网络架构，以推导计算高效的替代策略。我们分析了泛化能力、稳定性考虑以及不同架构选择固有的权衡。我们的实证研究采用了在线和离线评估，以评估在准确性、计算效率和对原始MPC策略的忠实度方面的性能。结果表明，行为克隆可以有效减少三自由度机械臂MPC策略的计算负担，在宽松容差下推理延迟降低3倍，成功率达到84.98%。值得注意的是，我们发现静态架构优于时间变体，证实了瞬时状态观测对此任务的充分性。然而，在严格容差下我们观察到精度差距，这表明虽然行为克隆捕获了全局最优轨迹，但需要进一步研究以最小化终端稳态误差。

英文摘要

While Model Predictive Control (MPC) provides strong stability and robustness, it imposes a significant computational burden on real-time systems. This paper investigates the application of Behavior Cloning to approximate MPC policies for the real-time control of a 3-degree-of-freedom robotic manipulator. We present a baseline controller combining Inverse Kinematics with MPC and evaluate neural network architectures, ranging from classical regression algorithms to deep learning models including Deep MLPs and RNNs, to derive computationally efficient surrogate policies. We analyze generalization capabilities, stability considerations, and the trade-offs inherent in different architectural choices. Our empirical study employs both online and offline evaluations to assess performance regarding accuracy, computational efficiency, and fidelity to the original MPC policy. Our results demonstrate that Behavior Cloning can effectively reduce the computational burden of MPC policies for 3-DOF robotic manipulators, achieving a 3x reduction in inference latency with a 84.98% success rate under relaxed tolerances. Notably, we find that static architectures outperform temporal variants, confirming the sufficiency of instantaneous state observations for this task. However, we observe a precision gap under strict tolerances, which suggest that while Behavior Cloning captures the global optimal trajectory, further research is needed to minimize terminal steady-state error.

URL PDF HTML ☆

赞 0 踩 0

2606.00374 2026-06-02 cs.RO 版本更新

Constrained Whole-Body Tracking for Humanoid Robots

人形机器人的约束全身跟踪

Daniel Morton, Pranit Mohnot, Marco Pavone

发表机构 * Stanford University（斯坦福大学）； NVIDIA Research（NVIDIA研究）

AI总结提出 ConstrainedMimic 框架，结合操作空间控制与控制障碍函数，在强化学习跟踪策略中实现实时约束满足，用于人形机器人全身运动跟踪与遥操作。

详情

AI中文摘要

强化学习的最新进展已展示出人形机器人令人印象深刻的全身灵活性，但确保安全性和满足约束（特别是训练后指定的约束）仍然是一个挑战。为此，我们提出了 ConstrainedMimic，一个利用全身运动学和动力学在 RL 跟踪策略中实时执行约束的控制框架。通过整合操作空间控制和障碍函数（CBF）的原理，我们能够满足对运动学参考运动和底层动力学的任意运行时约束。在（模拟的）Unitree G1 上使用学习策略进行的全身运动跟踪和遥操作实验中，我们展示了碰撞避免（包括机器人身体和外部障碍物）、关节限制和质心稳定性约束。通过保持与当前接触模式和跟踪目标一致，我们在约束激活时最小化地限制了策略的能力。我们的方法完全可微，可在 CPU、GPU 和 TPU 上运行，并能以高达 300-500 Hz 的频率部署。所有软件将在发表后免费提供。

英文摘要

Recent advances in reinforcement learning (RL) have demonstrated impressive whole-body agility for humanoid robots, yet ensuring safety and satisfying constraints -- particularly those specified after training -- remains a challenge. Towards this goal, we present ConstrainedMimic, a control framework that leverages whole-body kinematics and dynamics for real-time constraint enforcement within RL tracking policies. By integrating principles from operational space control and control barrier functions (CBFs), we enable the satisfaction of arbitrary runtime constraints on both the kinematic reference motion and the underlying dynamics. In whole-body motion-tracking and teleoperation experiments on a (simulated) Unitree G1 with a learned policy, we demonstrate collision avoidance (both with the robot body and external obstacles), joint limits, and center of mass stability constraints. By remaining consistent with the current contact mode and tracking objectives, we minimally restrict the capabilities of the policy when constraints are active. Our method is fully differentiable, runs on CPU, GPU, and TPU, and can be deployed at up to 300-500 Hz. All software will be freely available upon publication.

URL PDF HTML ☆

赞 0 踩 0

2606.00355 2026-06-02 cs.RO 版本更新

FAIR^2 Drones: An AI-Ready Standard for Cross-Domain Wildlife Drone Datasets

FAIR^2 Drones：跨领域野生动物无人机数据集的AI就绪标准

Jenna Kline, Kilian Meier, Vandita Shukla, Edouard G. A. Rolland, Elena Iannino, Lucie Laporte-Devylder, Constanza Andrea Molina Catricheo, Blair Costelloe, Elizabeth Campolongo, Henrik S. Midtiby, Devis Tuia, Benjamin Risse, Ulrik P. S. Lundquist, Anders Lyhne Christensen, Fabio Remondino, Thomas Richardson, Tanya Berger-Wolf

发表机构 * The Ohio State University, Department of Computer Science and Engineering（俄亥俄州立大学计算机科学与工程系）； School of Civil, Aerospace and Design Engineering, University of Bristol（布里斯托尔大学土木、航空航天与设计工程学院）； D Optical Metrology (3DOM), Fondazione Bruno Kessler (FBK)（3DOM光学计量（3DOM），布鲁诺·克塞勒基金会（FBK））； Computer Vision and Machine Learning Systems Group, Institute for Geoinformatics, University of Muenster（计算机视觉与机器学习系统组，地理信息研究所，穆恩斯特大学）； Unmanned Aerial Systems Center, University of Southern Denmark（无人飞行系统中心，南部丹麦大学）； Department of Collective Behavior, Max Planck Institute of Animal Behavior（集体行为部门，动物行为马克斯·普朗克研究所）； Department of Biology, University of Konstanz（生物学系，康斯坦茨大学）； Department of Biology, University of Southern Denmark（生物学系，南部丹麦大学）

AI总结提出FAIR^2 Drones标准，通过整合FAIR和AI就绪数据框架并添加平台元数据和标注规范，使无人机数据集同时支持生态分析、机器人算法开发和计算机视觉基准测试。

详情

AI中文摘要

使用无人机收集动物生态数据需要大量的时间、专业知识和财务资源。然而，大多数现有数据集仅服务于单一研究社区，限制了跨学科重用。我们提出了一个统一的无人机数据集标准FAIR^2 Drones，该标准基于现有的FAIR和AI就绪数据框架，通过添加必要的平台元数据和标注规范，桥接了生态学、机器人和计算机视觉。我们的标准使数据集能够同时支持生态分析、机器人算法开发和计算机视觉基准测试。我们提供了开源验证工具、参考实现以及多模态扩展，将无人机图像与互补传感器（如相机陷阱、GPS和声学）连接起来。通过跨学科标准化元数据，该框架最大化了昂贵现场部署的科学投资回报，并加速了环境监测中的跨领域合作。

英文摘要

Animal ecology data collection using drones represents a substantial investment of time, expertise, and financial resources. Yet most existing datasets serve only a single research community, limiting interdisciplinary reuse. We propose a unified drone dataset standard, FAIR^2 Drones, that bridges ecology, robotics, and computer vision by building on existing FAIR and AI-ready data frameworks while adding essential platform metadata and annotation specifications. Our standard enables datasets to simultaneously support ecological analysis, robotics algorithm development, and computer vision benchmarking. We provide open-source validation tools, reference implementations, and multimodal extensions linking drone imagery with complementary sensors such as camera traps, GPS, and acoustics. By standardizing metadata across disciplines, this framework maximizes the scientific return on investment for costly field deployments and accelerates cross-domain collaboration in environmental monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.00318 2026-06-02 cs.RO cs.CV 版本更新

预测流控制障碍函数用于实时安全最优控制

Amirsaeid Safari, Jesse B. Hoagg

发表机构 * Department of Mechanical and Aerospace Engineering, University of Kentucky（机械与航空航天工程系，肯塔基大学）

AI总结本文提出预测流控制障碍函数(P-CBF)，通过将CBF推广为预测流的泛函，结合终端候选和规划时间偏移，实现有限预测时域内的安全证书，并统一了有限时域积分成本优化与安全认证。

详情

AI中文摘要

控制障碍函数(CBF)通过状态上的逐点条件提供实时安全保证。然而，合成有效的CBF是困难的，且得到的控制器是短视的。为解决短视问题，本文引入了预测流控制障碍函数(P-CBF)，它将CBF从当前状态的函数推广为在有限预测时域内参数化控制计划下的预测流的泛函。为了安全，P-CBF可以证明预测流在整个预测时域内处于安全集中。然而，候选P-CBF面临与候选CBF相同的挑战，即控制约束使得保证P-CBF的有效性变得困难。本文通过引入终端候选P-CBF（要求预测流在终端时刻终止于备份安全集）和规划时间偏移（调节预测时域，提供额外的自由度以确保可行性）来解决这一挑战。实时控制以及控制计划参数和规划时间偏移的演化由单个凸优化联合确定，该优化保证可行且使相关安全集前向不变。所得到的安全最优流控制在整个预测时域内提供安全证书，并统一了有限时域积分成本优化与安全认证。如果控制约束是凸多面体，则该优化简化为二次规划(QP)。该QP实现称为FlowBarrier，在非完整地面机器人穿越密集环境的场景中进行了验证。FlowBarrier与非线性模型预测控制和两种基于CBF的安全滤波方法在100次试验中进行了比较，FlowBarrier实现了最高的目标到达率、零安全违规和最低的计算时间。

英文摘要

Control barrier functions (CBFs) provide real-time safety guarantees through pointwise conditions on the state. However, synthesizing a valid CBF is difficult and the resulting controllers are myopic. To address myopia, this article introduces predicted-flow control barrier functions (P-CBFs), which generalize the CBF from a function of the current state to a functional of a predicted flow under a parametrized control plan over a finite prediction horizon. For safety, a P-CBF can certify that the predicted flow is in a safe set over the entire prediction horizon. However, candidate P-CBFs suffer from the same challenge as candidate CBFs, namely, control constraints make it difficult to guarantee that the P-CBF is valid. This article resolves this challenge by introducing a terminal candidate P-CBF requiring that the predicted flow end in a backup safe set at the terminal time, and a planning-time shift that modulates the prediction horizon, providing an additional degree of freedom to ensure feasibility. The real-time control and the evolution of the control-plan parameter and planning-time shift are determined jointly by a single convex optimization that is guaranteed to be feasible and renders the associated safe set forward invariant. The resulting safe optimal flow control provides a safety certificate over the entire prediction horizon and unifies finite-horizon integral-cost optimization with safety certification. This optimization reduces to a quadratic program (QP) if the control constraints are a convex polytope. The QP implementation, termed FlowBarrier, is validated on a nonholonomic ground robot navigating a dense environment. FlowBarrier is compared to nonlinear model predictive control and two CBF-based safety filter methods across 100 trials, where FlowBarrier achieves the highest goal-reaching rate, zero safety violations, and the lowest computation time.

URL PDF HTML ☆

赞 0 踩 0

2606.00267 2026-06-02 cs.CV cs.AI cs.LG cs.RO 版本更新

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

StressDream: 引导视频世界模型实现鲁棒的策略评估与改进

Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding, Apoorva Sharma, Karen Leung, Edward Schmerling, Marco Pavone, Andrea Bajcsy

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； NVIDIA Research（NVIDIA研究）； University of Washington（华盛顿大学）； Stanford University（斯坦福大学）

AI总结提出StressDream方法，通过优化扩散视频世界模型的初始噪声，在推理时引导生成高影响且合理的未来场景，以支持鲁棒的策略评估与改进。

Comments Project page: https://junwon.me/StressDream/

详情

AI中文摘要

视频世界模型通过想象以自我机器人动作为条件的真实未来观察，在策略评估与改进方面展现出潜力。虽然世界模型可以对未来的分布进行建模，但策略评估与改进通常依赖于名义上的想象，这可能会遗漏机器人动作的高影响结果，除非抽取大量样本。为了实现对世界模型想象的鲁棒策略评估与改进，我们提出StressDream，该方法通过在推理时优化扩散世界模型的初始噪声，将想象引导至高影响且合理的结果。然而，优化高维噪声具有挑战性：优化必须推理生成视频中细微的、场景相关的目标事件，同时避免产生不合理想象的分布外噪声。我们通过两个互补目标来解决这一问题：一个语义目标，利用视觉语言模型通过推理生成视频提供信息丰富的梯度；一个合理性目标，防止优化后的噪声漂移到分布外。利用用于自动驾驶和机器人操作的最先进的视频世界模型，我们展示了StressDream能够有效地将想象引导至推理时由文本指定的高影响且合理的结果，例如任务失败，从而通过识别那些合理未来包含不良结果的动作，实现鲁棒的策略评估与改进。视频结果见https://junwon.me/StressDream/。

英文摘要

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.

URL PDF HTML ☆

赞 0 踩 0

2606.00253 2026-06-02 cs.RO cs.LG 版本更新

Per-Group Error, Not Total MSE: Fine-Tuning Vision-Language-Action Models for 11-DoF Mobile Manipulation

分组误差而非总MSE：微调视觉-语言-动作模型用于11自由度移动操作

Pau Montagut Bofi, Mario García Blasco, Tessa Pulli, Markus Vincze

发表机构 * University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）

AI总结针对异构关节空间的移动操作器微调视觉-语言-动作模型时，发现总MSE最低的检查点并非实际表现最佳，提出以分组误差作为更可靠的检查点选择指标。

Comments 4 pages, 3 figures, 3 tables. Accepted as poster at ICRA 2026 Workshop "From Data to Decisions: VLA Pipelines for Real Robots". Code: [https://github.com/paumontagut/per-group-mse-vla](https://github.com/paumontagut/per-group-mse-vla)

详情

AI中文摘要

对具有异构关节空间的移动操作器微调视觉-语言-动作（VLA）模型可能产生反直觉的结果：总MSE最低的检查点并非在真实机器人上表现最佳。我们认为这是将异构关节组（手臂、夹爪、头部、轮式底座）合并为单一指标的可预测后果，其中易于预测的关节可能掩盖仍然失败的关节。我们在11自由度Toyota HSR上微调SmolVLA（450M，仅动作专家），并将其与更强的预训练基线$π_{0.5}$（3.3B）进行比较。分组分析揭示了两种模式：在SmolVLA中，移动底座收敛最慢并限制了整体性能。在$π_{0.5}$的仅专家微调（仅训练动作头，骨干冻结）中，总MSE低于基线但手臂精度下降。在60次真实机器人试验（每个模型20次）中，$π_{0.5}$ 80k（4.0/4）显著优于两种微调变体（仅专家3k：3.75/4；HSR-SmolVLA：3.5/4；Mann-Whitney $p \leq 0.010$），尽管仅专家3k的总MSE最低。这种差异与离线手臂组误差最为一致，而非总MSE或底座组误差。我们得出结论：对于具有异构动作空间的机器人，分组误差比总MSE是更可靠的检查点选择信号。代码：https://github.com/paumontagut/per-group-mse-vla

英文摘要

Fine-tuning Vision-Language-Action (VLA) models for mobile manipulators with heterogeneous joint spaces can produce a counterintuitive result: the checkpoint with the lowest aggregate MSE is not the one that performs best on the real robot. We argue this is a predictable consequence of collapsing heterogeneous joint groups (arm, gripper, head, wheeled base) into a single metric, where easy-to-predict joints can mask joints that still fail. We fine-tune SmolVLA (450M, action-expert only) on the 11-DoF Toyota HSR and compare it against $π_{0.5}$ (3.3B), a stronger pretrained baseline. Per-group analysis exposes two patterns: in SmolVLA, the mobile base converges slowest and limits overall performance. In expert-only fine-tuning of $π_{0.5}$ (training only the action head, backbone frozen), total MSE drops below the baseline but arm accuracy degrades. On 60 real-robot trials (20 per model), $π_{0.5}$ 80k (4.0/4) significantly outperforms both fine-tuned variants (expert-only 3k: 3.75/4; HSR-SmolVLA: 3.5/4; Mann-Whitney $p \leq 0.010$), despite expert-only 3k having the lowest total MSE. This separation is most consistent with the offline arm-group error, not total MSE or base-group error. We conclude that per-group error is a more reliable signal than total MSE for checkpoint selection on robots with heterogeneous action spaces. Code: https://github.com/paumontagut/per-group-mse-vla

URL PDF HTML ☆

赞 0 踩 0

2606.00252 2026-06-02 cs.RO cs.LG 版本更新

HOIST: Humanoid Optimization with Imitation and Sample-efficient Tuning for Manipulating Suspended Loads

HOIST: 基于模仿和样本高效微调的人形机器人悬挂负载操作优化

Songyang Liu, Shunyu Yao, Dingyuan Huang, Shuai Li

发表机构 * Department of Civil and Coastal Engineering, University of Florida（土木与海岸工程系，佛罗里达大学）

AI总结提出HOIST方法，结合模仿学习和样本高效的批量强化学习，优化人形机器人操控悬挂负载的放置精度和停止行为。

详情

AI中文摘要

使用人形机器人操控悬挂负载具有挑战性，因为机器人只能通过全身运动和间歇接触来影响一个欠驱动的振荡负载。模仿学习提供了安全初始行为，但无法直接优化最终放置，而从头开始的强化学习在真实人形机器人上不安全且样本效率低。我们提出了HOIST——基于模仿和样本高效微调的人形机器人悬挂负载操作优化。HOIST首先从虚拟现实遥操作演示中微调一个高级视觉-语言-动作策略，并通过全身控制器执行其命令。然后，它使用VLA rollout和迭代批量RL来提高放置精度和停止行为。在仿真和真实人形机器人上的实验表明，HOIST优于仅模仿和额外演示基线；与纯VLA rollout相比，HOIST将平移放置误差减少了19.9厘米，原始角度误差减少了3.56度，展示了人形机器人在欠驱动物料处理任务中的潜力。

英文摘要

Manipulating suspended payloads with humanoid robots is challenging because the robot can only influence an underactuated, oscillatory load through whole-body motion and intermittent contact. Imitation learning provides safe initial behavior but does not directly optimize final placement, while reinforcement learning from scratch is unsafe and sample-inefficient on real humanoids. We present HOIST-Humanoid Optimized with Imitation and Sample-efficient Tuning for manipulating suspended loads. HOIST first finetunes a high-level vision-language-action (VLA) policy from virtual-reality (VR) teleoperation demonstrations and executes its commands through a whole-body controller. It then uses VLA rollouts and iterative batched RL to improve placement accuracy and stopping behavior. Experiments in simulation and on a real humanoid show that HOIST improves over imitation-only and additional-demonstration baselines; compared with pure VLA rollouts, HOIST reduces translational placement error by 19.9 cm and raw angular error by 3.56 degrees, demonstrating the potential of humanoids for underactuated material-handling tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.00201 2026-06-02 cs.RO 版本更新

Series-Parallel Integrated Nonlinear Elastic Actuator applied to the lean motion of a bicycle simulator

应用于自行车模拟器倾斜运动的串并联集成非线性弹性致动器

Christina Kohler, Michiel Plooij, Nuria Peña-Perez, Arend L. Schwab, Heike Vallery

发表机构 * Institute of Automatic Control, RWTH Aachen University（自动控制研究所，亚琛RWTH大学）； Demcon Life Sciences & Health（Demcon生命科学与健康）； Hapticlink Technologies（Hapticlink技术公司）； Department of BioMechanical Engineering, Delft University of Technology（生物机械工程系，代尔夫特理工大学）； Department of Rehabilitation Medicine, Erasmus MC（康复医学系，埃因霍温麦斯特大学）

AI总结提出一种串并联集成非线性弹性致动器（SPINEA），通过非线性传动使单个弹性元件同时承担串联和并联角色，实现高扭矩和精确扭矩跟踪，并应用于自行车模拟器倾斜运动。

详情

AI中文摘要

设计用于高扭矩、高保真力触觉交互的机器人具有挑战性。并联弹性致动器（PEA）使用与较小电机并联的弹性元件来补充扭矩，而串联弹性致动器（SEA）使用串联的弹性元件来解耦电机阻抗并改善力控制。最近的工作结合了SEA和PEA以获得两者的优点，但需要单独的弹性元件或离合器。本文提出了串并联集成非线性弹性致动器（SPINEA），它融合了SEA和PEA，使得单个弹性元件同时承担并联和串联的双重角色。这是通过非线性传动实现的，其中电机和负载具有不对齐的旋转轴并且弹性连接。这种几何结构实现了高峰值扭矩和精确的扭矩跟踪。我们将SPINEA应用于力触觉自行车模拟器的倾斜驱动，这需要高力矩和精确的渲染以实现安全且逼真的骑行者交互。我们实现了一个原型并进行了实验，包括外部激励装置和骑行者骑行。我们的结果证实了SPINEA的低阻抗和精确扭矩跟踪，在自行车框架固定时高达4.25 Hz，在骑行者骑行时高达4 Hz。这些优点可能转移到其他需要紧凑、高性能驱动的应用中。

英文摘要

Designing robots for high-torque, high-fidelity haptic interaction is challenging. Parallel Elastic Actuators (PEAs) use elastic elements in parallel to smaller motors to complement torques, and Series Elastic Actuators (SEAs) use elastic elements in series to decouple motor impedance and improve force control. Recent work combines SEAs and PEAs to obtain both benefits but requires separate elastic elements or clutching. This paper presents the Series Parallel Integrated Nonlinear Elastic Actuator (SPINEA), which merges SEA and PEA such that a single elastic element takes on dual roles simultaneously, parallel and series. This is achieved by a nonlinear transmission in which the motor and load have misaligned rotation axes and are elastically connected. This geometry enables both high peak torque and precise torque tracking. We apply SPINEA to actuate lean of a haptic bicycle simulator, which requires high moments and precise rendering for safe and realistic rider interactions. We realized a prototype and performed experiments, both with an external excitation setup and with riders cycling. Our results confirm SPINEA's low impedance and precise torque tracking, up to 4.25 Hz with the bicycle frame fixed and up to 4 Hz with riders. The benefits may transfer to other applications requiring compact, high-performance actuation.

URL PDF HTML ☆

赞 0 踩 0

2606.00197 2026-06-02 cs.RO 版本更新

Cuttlebot: a platform demonstration for complex, autonomous, bio-inspired swimmers

Cuttlebot：一种复杂自主仿生游泳机器人的平台演示

Alexander Nicholas White, Ang Leo Li, Alexander Yin, Derrick Roseman, Valeria Saro-Cortes, Hannah Wiswell, Aimy Wissa, Mihai Duduta

发表机构 * School of Mechanical, Aerospace, and Manufacturing Engineering, University of Connecticut（康涅狄格大学机械、航空航天与制造工程学院）； Department of Mechanical and Industrial Engineering, University of Toronto（多伦多大学机械与工业工程系）； University of Connecticut, Institute of Materials Science（康涅狄格大学材料科学研究所）； Department of Mechanical and Aerospace Engineering, Princeton University（普林斯顿大学机械与航空航天工程系）

AI总结本文提出CORE自主机器人平台，驱动六个人工肌肉并感知视觉与空间信息，开发了仿乌贼机器人Cuttlebot，通过波动鳍实现三维游泳，验证了平台的有效性。

详情

AI中文摘要

对深海作业和资源日益增长的兴趣推动了生态敏感但环境耐用的机器人的发展。介电弹性体驱动器人工肌肉因其耐压、耐温及柔软特性，成为驱动此类系统的理想选择，但难以与机器人系统集成。本文提出了一种自主机器人平台：CORE，能够驱动六个人工肌肉，同时感知视觉和空间信息。为验证该平台，我们开发了Cuttlebot——一种受乌贼启发的机器人，利用波动鳍进行三维游泳。Cuttlebot的鳍部有四个主要人工肌肉，外加一个触手启发的软体夹爪。该机器人在一系列有缆和无缆游泳测试中进行了评估，展示了每秒2.5厘米的平移速度和每秒10度的旋转速度。此外，CORE系统能够将专门的控制信号驱动到人工肌肉中，以可控方式输出六个自由度的力和扭矩。本工作为开发用于海洋探索和监测的复杂仿生游泳机器人提供了平台，并以我们的领先示例Cuttlebot奠定了基础。

英文摘要

Increasing interest in deep-sea operations and resources motivates the development of ecologically sensitive but environmentally durable robots. Dielectric elastomer actuator artificial muscles are good candidates for powering such systems due to their pressure and temperature tolerance and soft makeup, but they are difficult to integrate with robotic systems. This work presents an autonomous robotic platform: the CORE, capable of driving six artificial muscles while sensing visual and spatial information. To validate the platform, we developed the Cuttlebot - a cuttlefish-inspired robot that swims in three dimensions using undulatory fin locomotion. The Cuttlebot has four primary artificial muscles in its fins in addition to a tentacle-inspired soft gripper. The robot was evaluated in a series of tethered and untethered swimming tests, demonstrating a top speed of 2.5 centimeters per second translation and 10 degrees per second rotation. Furthermore, the CORE system was capable of driving specialized control signals into the artificial muscles to controllably output force and torque in six axes. This work provides a platform for developing complex, bio-inspired swimming robots for ocean exploration and monitoring, laying the foundation with our leading example: the Cuttlebot.

URL PDF HTML ☆

赞 0 踩 0

2606.00191 2026-06-02 cs.RO cs.CV 版本更新

Safe2Drive: Evaluating Safe Driving Behaviors of E2E Autonomous Driving Models

Safe2Drive: 评估端到端自动驾驶模型的安全驾驶行为

Nishad Sahu, Kalpana Panda, Congyuan Yu, Changzhong Qian, Shounak Sural, Ragunathan Rajkumar

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Birla Institute of Technology and Science Pilani（比拉理工学院和科学帕利尼）

AI总结针对端到端自动驾驶模型在常见安全关键场景中表现脆弱的问题，提出Safe2Drive测试集和安全驾驶评分（SDS），评估发现领先模型在安全场景中驾驶得分大幅下降且SDS较低。

详情

Journal ref: CVPR Workshops 2026

AI中文摘要

最近的端到端（E2E）自动驾驶策略在闭环模拟中取得了高驾驶得分。然而，这些策略是否能够处理常见的安全关键场景仍不清楚。我们提出了Safe2Drive（S2D），一组与Bench2Drive对齐的场景扩展，重点关注三类常见的道路危险：施工区、行人乱穿马路和被遮挡的弱势道路使用者（VRU）。Safe2Drive增加了100个常见但具有挑战性的场景，并引入了安全驾驶评分（SDS），这是一种以安全为中心的度量，在先前评估器的基础上增加了碰撞前制动、施工区物体接触、车道居中和平滑性检查。在S2D上评估两种最先进的策略（LEAD和SimLingo），我们发现它们的驾驶得分相对于报告的Bench2Drive基线急剧下降（LEAD：从Bench2Drive上的94.70 DS下降到S2D上的39.95 DS；SimLingo：从Bench2Drive上的85.07 DS下降到S2D上的41.00 DS），并且S2D上的SDS较低（LEAD为11.85，SimLingo为15.27）。这些结果与脆弱的安全驾驶行为一致，例如对施工区理解差、闯红灯以及行人制动延迟或缺失。这项研究突显了E2E模型即使在训练集包含的CARLA城镇上进行测试时也缺乏安全行为推理。我们计划发布所有100个S2D场景的代码和视频。

英文摘要

Recent end-to-end (E2E) autonomous driving policies achieve high driving scores in closed-loop simulations. Yet it remains unclear whether these policies handle common safety-critical scenarios. We present Safe2Drive (S2D), a set of Bench2Drive-aligned scenario extensions focused on three frequent families of road hazards: work zones, pedestrian jaywalking, and occluded vulnerable road users (VRUs). Safe2Drive adds 100 common but challenging scenarios and introduces SafeDriving Score (SDS), a safety-centric metric that augments prior evaluators with pre-crash braking, work zone-object contact, lane centering, and smoothness checks. Evaluating two state-of-the-art policies (LEAD and SimLingo) on S2D, we find that their driving scores drop sharply relative to their reported Bench2Drive baselines (LEAD: from 94.70 DS on Bench2Drive to 39.95 DS on S2D; SimLingo: from 85.07 DS on Bench2Drive to 41.00 DS on S2D) and that SDS on S2D is low (11.85 for LEAD and 15.27 for Sim-Lingo). These results are consistent with brittle safe-driving behaviors such as poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. This study highlights a lack of safe behavioral reasoning in E2E models even when tested on CARLA towns that are part of the training set. We plan to release the code and videos for all 100 S2D scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.00162 2026-06-02 cs.RO cs.CV cs.LG 版本更新

Modeling Robotics Dataset Construction as an Artifact-Based Build Process

将机器人数据集构建建模为基于工件的构建过程

Leon Pohl, Lukas Beer, George Sebastian, Mirko Maehlisch

发表机构 * Institute for Autonomous Driving, University of the Bundeswehr Munich（自主驾驶研究所，联邦国防军 Munich 大学）

AI总结本文提出将机器人数据集构建建模为基于工件的构建过程，并实现开源工具Bagzel，通过依赖图管理和增量构建显著降低数据集更新延迟，实验表明在迭代工作流中速度提升高达386倍。

Comments Accepted 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026), 6 pages, 6 figures, 2 tables

详情

AI中文摘要

机器人系统生成大量多模态传感器数据，但将ROS bag记录转换为机器学习数据集通常由临时的顺序脚本处理，导致工程开销和迭代周期缓慢。我们将数据集构建建模为基于依赖图的工件构建过程，并在Bagzel中实现该方法，这是一个开源的Bazel扩展，用于可重现、增量式的数据集生成（包括nuScenes格式导出）。我们将Bagzel和Bagzel-xattr（服务端摘要管理）与顺序的rosbag2nuscenes基线进行比较。Bagzel在所有评估执行模式下减少了运行时间，在迭代工作流中提升最大（在20.4 GB数据集上，热构建加速高达386.26倍，增量构建加速高达7.21倍）。在5.1至20.4 GB的数据集大小范围内，Bagzel变体显示出比基线明显更好的扩展行为，尤其是在热构建和增量构建模式下。Bagzel-xattr提供了额外增益，在输入粒度研究中相比Bagzel平均运行时间减少5.9%。总体而言，将机器人数据集构建建模为基于工件的构建过程大幅降低了数据集更新延迟，同时保持了支持可重现性的确定性构建设计。Bagzel公开获取地址：https://github.com/UniBwTAS/bagzel。

英文摘要

Robotic systems generate large volumes of multimodal sensor data, but converting ROS bag recordings into machine learning datasets is often handled by ad hoc sequential scripts, creating engineering overhead and slow iteration cycles. We model dataset construction as an artifact-based build process over a dependency graph and implement this approach in Bagzel, an open-source Bazel extension for reproducible, incremental dataset generation (including nuScenes-format export). We compare Bagzel and Bagzel-xattr (server-side digest management) against a sequential rosbag2nuscenes baseline. Bagzel reduces runtime in all evaluated execution modes, with the largest gains in iterative workflows (up to 386.26x in warm builds and 7.21x in incremental builds on a 20.4 GB dataset). Across dataset sizes from 5.1 to 20.4 GB, Bagzel variants show markedly better scaling behavior than the baseline, especially in warm and incremental modes. Bagzel-xattr provides additional gains, with a mean runtime reduction of 5.9% compared to Bagzel in the input granularity study. Overall, modeling robotics dataset construction as an artifact-based build process substantially reduces dataset update latency while maintaining a deterministic build design that supports reproducibility. Bagzel is publicly available at https://github.com/UniBwTAS/bagzel.

URL PDF HTML ☆

赞 0 踩 0

2606.00145 2026-06-02 cs.RO cs.AI 版本更新

Completion at the Boundary (CaB): Deployable Switching with Completion-Aware Control under Limited Calibration

边界完成（CaB）：有限校准下具有完成感知的可部署切换

Yusuke Sano, Takeshi Itoga

发表机构 * Intelligent Systems Laboratory, SECOM Co., Ltd.（SECOM公司智能系统实验室）

AI总结提出Completion at the Boundary (CaB)方法，通过边界阶段令牌（Before/Hit/After）保留双边证据，在有限校准条件下实现VLA代理的完成感知切换，提升复合指令执行和交接质量。

详情

AI中文摘要

视觉-语言-动作（VLA）代理可以执行自然语言指令，但部署系统仍缺乏操作接口：决定指令何时完成。这一缺口在短复合指令（“做A，然后做B”）中尤为严重，时机不当的交接会级联导致下游故障。完成本质上是闭环的，因为切换是一种改变指令上下文从而影响未来动作和观察的干预。我们研究在由开放式指令空间启发的可部署低校准机制下的完成问题，强制要求无测试时重新学习，并选择一个全局校准的切换规则（在开发集上选择一次，在测试集上原样复用）。在此约束下，将非对称边界证据压缩为单个标量可能在任务极性变化时变得脆弱。我们提出边界完成（CaB），它预测事件局部完成对象，形式为边界阶段令牌（Before/Hit/After），在此规则下保留双边证据。CaB-When将此完成对象转换为最小、可审计的切换决策（何时），而CaB-How重用同一完成对象来调节动作生成，以实现交接过程中的边界稳定控制（如何）。使用干预感知的E1/E2协议，我们表明在匹配容量和可部署性约束下，CaB在第一个视角Minecraft VLA基准上提高了复合执行和交接质量。

英文摘要

Vision-language-action (VLA) agents can execute natural-language instructions, yet deployed systems still lack an operational interface: deciding when the instruction is complete. This gap is acute in short composites ("do A, then B"), where mistimed handoffs cascade into downstream failures. Completion is inherently closed-loop because switching is an intervention that changes the instruction context and thus future actions and observations. We study completion under a deployable low-calibration regime motivated by open-ended instruction spaces, enforcing no test-time relearning and a single globally calibrated switching rule selected once on development set and reused unchanged on test set. Under this constraint, collapsing asymmetric boundary evidence into a single scalar can be brittle under polarity shifts across tasks. We propose Completion at the Boundary (CaB), which predicts an event-local completion object in the form of Boundary-Phase Tokens (Before/Hit/After), retaining two-sided boundary evidence under this discipline. CaB-When converts this completion object into a minimal, auditable switching decision (when), while CaB-How reuses the same completion object to condition action generation for boundary-stable control through handoffs (how). Using an intervention-aware E1/E2 protocol, we show that CaB improves composite execution and handoff quality on a first-person Minecraft VLA benchmark under matched capacity and deployability constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.00119 2026-06-02 cs.RO cs.AI 版本更新

V2I Work Zone Geometry Reconstruction with Pose-Conditioned UWB Range Denoising

基于位姿条件的UWB测距去噪的V2I工作区几何重建

Jiaxi Liu, Hangyu Li, Yang Cheng, Rui Gana, Junwei You, Weizhe Tang, Peng Zhang, Steven T. Parker, Xiaopeng Li, Bin Ran

发表机构 * Department of Civil & Environmental Engineering, University of Wisconsin-Madison（威斯康星大学麦迪逊分校土木与环境工程系）

AI总结针对V2I工作区几何重建中UWB测距受突发异常、非视距误差和位姿不确定性的影响，提出一种位姿条件、排列等变的预测去噪器，通过共享锚点时间预测、对称集聚合和位姿条件残差解码，显著提升测距精度和几何重建质量。

详情

AI中文摘要

可靠的工作区映射对于网联自动驾驶车辆（CAV）安全平稳地通过工作区至关重要。安装在锥形路标上的超宽带（UWB）路侧单元（RSU）提供了一种经济高效的工作区布局推断方式，因为路侧锚点和车载标签为工作区几何重建提供了直接的车对基础设施（V2I）距离约束。然而，在实际现场部署中，UWB测距估计受到突发异常、非视距（NLOS）误差、任意锚点排序问题以及车辆位姿不确定性的影响。为解决这些挑战，本研究提出了一种位姿条件、排列等变的预测去噪器，用于多锚点UWB测距。该模型采用共享锚点时间预测来捕捉距离动态，对称集聚合来处理无序和缺失的锚点，以及位姿条件残差解码来将车辆运动作为几何先验。两阶段训练策略首先从观测距离学习预测，然后通过NLOS加权监督微调去噪器。该方法在CAV收集的罕见真实世界V2I UWB现场数据以及受控大规模仿真基准上进行了评估，以获得消融见解。结果表明，所提出的方法在具有挑战性的NLOS主导场景中显著提高了测距精度、锥形标定位和工作区几何重建，对锚点重新索引和适度锚点丢失保持鲁棒，并将测量加权的现场均方误差相对于原始输入降低了66.9%。

英文摘要

Reliable work zone mapping is important for connected and autonomous vehicles (CAVs) to navigate safely and smoothly through work zone areas. Cone-mounted ultra-wideband (UWB) roadside units (RSU) offer a cost-effective way for work zone layout inference, as roadside anchors and vehicle tags provide direct vehicle-to-infrastructure (V2I) range constraints for work zone geometry reconstruction. However, UWB range estimation is degraded by bursty outliers, non-line-of-sight (NLOS) errors, arbitrary anchor-ordering issues, and vehicle pose uncertainties in practical field deployments. To address these challenges, this study proposes a pose-conditioned, permutation-equivariant predictive denoiser for multi-anchor UWB ranging. The model employs shared anchor-wise temporal prediction to capture range dynamics, symmetric set aggregation to handle unordered and missing anchors, and pose-conditioned residual decoding to incorporate vehicle motion as a geometric prior. A two-stage training strategy first learns prediction from observed ranges, and then fine-tunes the denoiser with NLOS-weighted supervision. The method is evaluated on rare real-world V2I UWB field data collected with a CAV, as well as on controlled large-scale simulation benchmarks for ablative insights. Results show that the proposed method substantially improves range accuracy, cone localization, and work zone geometry reconstruction in challenging NLOS-dominated regimes, remains robust to anchor re-indexing and moderate anchor dropout, and reduces measurement-weighted field MSE by 66.9% relative to the raw input.

URL PDF HTML ☆

赞 0 踩 0

2606.00117 2026-06-02 cs.RO 版本更新

Ontology-Guided Reasoning for Affordance-Based Explanations of Robot Navigation

基于本体引导的机器人导航可供性解释推理

Amar Halilovic, Vahidin Hasic, Senka Krivic

发表机构 * Institute of Artificial Intelligence, Ulm University（乌尔姆大学人工智能研究所）； Faculty of Electrical Engineering, University of Sarajevo（萨拉热窝大学电气工程学院）

AI总结提出本体引导推理方法，通过局部可供性本体表示实体、可供性状态和空间关系，评估假设的对象-可供性状态变化作为解释因素，生成语义可理解且可操作的解释，并在机器人图书管理员场景中验证其准确性和鲁棒性。

详情

AI中文摘要

本文提出基于本体引导的推理方法，用于机器人导航的可供性解释。在人类环境中，机器人仅检测到其路径被阻塞是不够的。它还必须推理附近物体的可供性、可能的状态变化以及哪些变化能使其安全继续。我们通过将附近实体、其可供性、可供性状态和定性空间关系表示在局部可供性本体中，并评估假设的对象-可供性状态变化作为候选解释因素来解决这一问题。这产生了不仅语义上可理解而且可操作的解释。我们在以机器人图书管理员场景为中心的轻量级基准中实例化该方法，并在程序生成的导航案例上进行评估。结果表明，与仅语义基线相比，本体引导推理更准确地识别相关解释因素，并且随着语义杂波增加仍保持鲁棒性。总体而言，本文论证了可供性本体不仅可以作为环境的语义描述，还可以作为可解释性和可靠机器人自主性的推理基础。

英文摘要

This paper proposes ontology-guided reasoning for affordance-based explanations of robot navigation. In human environments, it is not sufficient for a robot to detect that its route is blocked. It must also reason about what nearby objects afford, which state changes are possible, and which of these changes would allow it to continue safely. We address this problem by representing nearby entities, their affordances, affordance states, and qualitative spatial relations in a local affordance ontology and by evaluating hypothetical object--affordance state changes as candidate explanation factors. This yields explanations that are not only semantically grounded but also actionable. We instantiate the approach in a lightweight benchmark centered on a robot librarian scenario and evaluate it on procedurally generated navigation cases. The results show that ontology-guided reasoning identifies relevant explanation factors more accurately than a semantic-only baseline and remains robust as semantic clutter increases. Overall, the paper argues that affordance ontologies can serve not merely as semantic descriptions of the environment, but as reasoning foundations for explainability and reliable robot autonomy.

URL PDF HTML ☆

赞 0 踩 0

2606.00113 2026-06-02 cs.RO 版本更新

World Models for Robotic Manipulation: A Survey

机器人操作的世界模型：综述

Fangyuan Wang, Ziyuan Wang, Guorui Pei, Mengshi Zhang, Canxi Liang, Jun Hu, Zhongxuan Li, Jinsong Wu, Ning Han, Zeqing Zhang, Jiaming Qi, Hongmin Wu, Shiyao Zhang, Pai Zheng, Jia Pan, David Navarro-Alarcon, Sichao Liu, Peng Zhou

发表机构 * Department of Mechanical Engineering, The Hong Kong Polytechnic University（香港理工大学机械工程系）； Department of Mechanical Engineering and Automation, Harbin Institute of Technology（哈尔滨工业大学机械工程与自动化系）； School of Advanced Engineering, Great Bay University（大湾大学先进工程学院）； College of Robotics Science and Engineering, Taiyuan University of Technology（太原科技大学机器人科学与工程学院）； School of Data Science, City University of Hong Kong (Dongguan)（香港城市大学（东莞）数据科学学院）； Department of Mechatronic Engineering, Guangdong Polytechnic Normal University（广东 polytechnic 正常大学机电工程系）； School of Computing and Data Science, The University of Hong Kong（香港大学计算与数据科学学院）； School of Electrical and Electronic Engineering, Nanyang Technological University（南洋理工大学电子与电气工程学院）； College of Mechanical and Electrical Engineering, Northeast Forestry University（东北林业大学机械与电气工程学院）； Greater Bay Area National Center of Technology Innovation（粤港澳大湾区国家技术创新中心）； Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University（香港理工大学工业与系统工程系）

AI总结本文通过三个问题（预测什么未来表示、预测如何与动作连接、何时在机器人学习流程中使用预测）系统综述了机器人操作中的世界模型，将其定义为动作条件预测系统，并分类为五种表示族，提出了功能分类法，总结了基础设施角色、数据集和评估协议，揭示了从任务特定动力学预测器向预测基础设施的演变及开放挑战。

详情

AI中文摘要

机器人操作依赖于在执行前预测动作如何重塑物体、接触和场景几何的能力。学习的世界模型通过预测在机器人干预下任务相关的未来演化提供这种能力，然而该术语现在涵盖潜在动力学模型、动作条件视频生成器、三维和四维场景预测器、物理信息模拟器以及视觉-语言-动作系统中的预测模块。这种广度使文献碎片化，并模糊了对操作重要的设计选择。我们通过三个问题调查机器人操作的世界模型：预测什么未来表示、预测如何与动作连接、以及何时在机器人学习流程中使用预测。我们将世界模型操作性地定义为动作条件预测系统，并将其与感知模块、逆模型、策略、奖励和值函数区分开来。然后，我们将现有工作组织成五种表示族，开发了一个功能分类法，将集成预测-动作模型与显式预测规划器分开，并描述了基础设施角色，包括合成经验生成、候选过滤、基于搜索的评估、学习环境和结果验证。我们进一步将这些角色映射到预训练、后训练和推理适应中，回顾了34个操作数据集，并综合了预测保真度、任务性能和模拟器可靠性的评估协议。本综述表明，世界模型正在从任务特定的动力学预测器演变为机器人学习的预测基础设施，同时揭示了接触建模、幻觉控制、动作对齐和闭环使用下基准测试方面的开放挑战。

英文摘要

Robotic manipulation depends on the ability to anticipate how actions reshape objects, contacts, and scene geometry before execution. Learned world models provide this capability by predicting task-relevant future evolution under robot intervention, yet the term now spans latent dynamics models, action-conditioned video generators, three- and four-dimensional scene predictors, physics-informed simulators, and predictive modules inside vision-language-action systems. This breadth has fragmented the literature and obscured the design choices that matter for manipulation. We survey world models for robotic manipulation through three questions: what future representation is predicted, how prediction is connected to action, and when prediction is used in the robot-learning pipeline. We operationally define a world model as an action-conditioned predictive system and distinguish it from perception modules, inverse models, policies, rewards, and value functions. We then organize existing work into five representation families, develop a functional taxonomy that separates integrated prediction-action models from explicit predictive planners, and characterize infrastructure roles including synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification. We further map these roles across pretraining, post-training, and inference adaptation, review 34 manipulation datasets, and synthesize evaluation protocols for predictive fidelity, task performance, and simulator reliability. This survey shows that world models are evolving from task-specific dynamics predictors into predictive infrastructure for robot learning, while exposing open challenges in contact modeling, hallucination control, action alignment, and benchmarking under closed-loop use.

URL PDF HTML ☆

赞 0 踩 0

2606.00110 2026-06-02 cs.CV cs.RO 版本更新

General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling

广义协变动作建模：通过时空解耦构建广义流形

Huaihai Lyu, Chaofan Chen, Mingyu Cao, Yuheng Ji, Changsheng Xu

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出广义动作流形框架，通过时间不变性和几何不变性解耦实现广义协变，提升从稀疏演示中泛化的鲁棒性。

详情

AI中文摘要

从有限数据中实现鲁棒泛化是具身智能的核心挑战。现有方法通过回归绝对坐标失败，这违反了广义协变原理。根本上，这混淆了内在任务几何与刚性执行模式，将策略绑定到特定运动风格和固定速度。为解决此问题，我们提出广义动作流形（GAM）框架，通过结构解耦强制执行广义协变。具体地，GAM通过强制两个正交维度的不变性来实现流形：（1）时间不变性，利用弧长参数化将空间路径几何与时间动力学正交化，确保对速度变化的鲁棒性；（2）几何不变性，其中模式-仿射-分解机制将轨迹映射到姿态归一化坐标框架中的规范“世界线”。这区分了不变几何模式与仿射调制，确保空间泛化性。通过将GAM集成到结构化视觉-语言-动作（VLA）架构中，我们使稀疏演示能够密集填充连续有效的动作流形。实验结果表明，GAM实现了优越的迁移和鲁棒性，优于几何无关基线。

英文摘要

Achieving robust generalization from limited data is a central challenge in embodied intelligence. Prevailing methods fail by regressing absolute coordinates, which violates the principle of general covariance. Fundamentally, this conflates the intrinsic task geometry with rigid execution patterns, binding policies to specific motion styles and fixed speeds. To resolve this, we propose the Generalized Action Manifold (GAM) framework that enforces general covariance through structural disentanglement. Specifically, GAM realizes the manifold by enforcing invariance across two orthogonal dimensions: (1) Temporal Invariance, utilizing an Arc-Length Parameterizer to orthogonalize the spatial path geometry from temporal dynamics, ensuring robustness to velocity variations; (2) Geometric Invariance, where a Schema-Affine-Factorization mechanism maps trajectories to canonical ``world lines'' in a pose-normalized coordinate frame. This distinguishes invariant geometric schemas from affine modulations, ensuring spatial generalizability. By integrating GAM within a structured Vision-Language-Action (VLA) architecture, we enable sparse demonstrations to densely populate a continuous, valid action manifold. Empirical results demonstrate that GAM enables superior transfer and robustness capabilities, outperforming geometry-agnostic baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.00104 2026-06-02 cs.RO cs.AI 版本更新

PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs

PEACE: 一种用于无人机的带约束执行的规划-执行智能体

Erdem Uysal, Timo Kehrer, Sebastiano Panichella

发表机构 * Institute of Computer Science, University of Bern（伯尔尼大学计算机科学研究所）； AI4I - The Italian Institute of Artificial Intelligence（意大利人工智能研究所）

AI总结提出一种基于大语言模型的规划-执行智能体架构，通过解耦高层任务规划与低层控制，并引入约束执行层和有限重规划，实现无人机可解释、可约束的自主飞行。

Comments Accepted to ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy: From Environment Understanding and Reasoning to Safe Interaction

详情

AI中文摘要

基础模型越来越多地被用于驱动自主系统，然而现有方法要么将模型保持在紧密的控制循环中，增加延迟和幻觉风险，要么将自然语言编译成不透明的端到端策略，难以解释、约束，且需要特定领域的数据集和微调。我们提出一种用于基于PX4的无人机的规划-执行智能体，将高层任务规划与低层控制解耦。大语言模型执行单次任务规划，而执行通过结构化的ROS 2工具调用接口（桥接到MAVLink）处理。该系统通过将模块化2D检测器（如YOLO或视觉语言模型）与用于3D物体定位的针孔深度投影模块相结合，构建世界模型。约束执行层强制执行高度限制和水平地理围栏，有限重规划能够从执行时的动作失败中恢复。我们将我们的方法定位在基于基础模型的机器人系统的三种常见设计模式中，并在Gazebo中的PX4软件在环仿真中展示其可行性。结果突出了与紧密耦合的LLM控制相比，改进的可解释性、约束执行和减少的LLM调用。代码、数据集、视频和其他材料可在以下链接找到：https://github.com/erdemuysalx/PEACE

英文摘要

Foundation models are increasingly used to drive autonomous systems, yet existing approaches either keep the model in a tight control loop, raising latency and hallucination risk, or compile natural language into opaque end-to-end policies that are hard to explain, constraint and require domain-specific datasets and fine-tuning. We propose a planner-executor agent for PX4-based drones that decouples high-level mission planning from low-level control. A large language model performs single-pass task planning, while execution is handled through a structured ROS 2 tool-calling interface bridged to MAVLink. The system constructs a world model by combining modular 2D detectors (e.g., YOLO or vision-language models) with a pinhole depth projection module for 3D object localization. A constraint enforcement layer enforces altitude limits and horizontal geofencing, and bounded replanning enables recovery from execution-time action failures. We position our approach within three common design patterns for foundation-model-based robotics systems and demonstrate its feasibility in PX4 software-in-the-loop simulations in Gazebo. Results highlight improved explainability, constraint enforcement, and reduced LLM calls compared to tightly coupled LLM control. The code, dataset, videos, and other material can be found at the following link: https://github.com/erdemuysalx/PEACE

URL PDF HTML ☆

赞 0 踩 0

2606.00095 2026-06-02 cs.CV cs.AI cs.CL cs.RO 版本更新

基于图扩散的全身逆运动学

Helong Huang, Kai Tan, Feng Wen, Guowei Huang, Xingyue Quan

发表机构 * Large Model Algorithm Lab, Huawei（华为大模型算法实验室）

AI总结提出GraphDiff-IK，一种结构感知的图扩散逆运动学框架，通过将机器人表示为运动学图并引入分层消息传递和躯干感知条件，实现了多分支机器人的准确稳定IK求解。

详情

AI中文摘要

逆运动学（IK）是机器人学中的一个基本问题，需要生成满足目标末端执行器位姿的关节配置。现有方法通常难以在多种机器人形态间泛化，并且无法有效建模IK的多模态特性，特别是在具有多个运动学分支的关节系统中。在这项工作中，我们提出了GraphDiff-IK，一种结构感知的图扩散逆运动学框架。具体来说，我们将机器人表示为从机器人URDF构建的运动学图，其中节点对应驱动关节，边编码运动学依赖关系。基于这种表示，我们将IK表述为条件图扩散过程，直接在机器人图上生成关节配置。为了更好地捕捉关节系统中的结构依赖关系，我们进一步引入了一种结构感知的图推理框架，具有分层阶段式消息传递和针对多分支机器人的躯干感知条件。此外，我们结合了带噪声的正向运动学反馈和任务空间监督，以提高去噪过程中的几何一致性。所提出的框架提供了一种统一的公式，自然支持单臂机器人、双臂系统以及具有躯干或腰部结构的关节机器人。在多种机器人平台上的大量实验表明，所提出的方法实现了准确且稳定的IK性能，同时保留了为冗余机器人系统生成多个可行解的能力。

英文摘要

Inverse kinematics (IK) is a fundamental problem in robotics, requiring the generation of joint configurations that satisfy target end-effector poses. Existing approaches often struggle to generalize across diverse robot morphologies and to effectively model the multi-modal nature of IK, particularly in articulated systems with multiple kinematic branches. In this work, we propose GraphDiff-IK, a structure-aware graph diffusion framework for inverse kinematics. Specifically, we represent the robot as a kinematic graph constructed from the robot URDF, where nodes correspond to actuated joints and edges encode kinematic dependencies. Building upon this representation, we formulate IK as a conditional graph diffusion process that directly generates joint configurations on the robot graph. To better capture structural dependencies in articulated systems, we further introduce a structure-aware graph reasoning framework with hierarchical stage-wise message passing and torso-aware conditioning for multi-branch robots. In addition, we incorporate noisy forward kinematics feedback and task-space supervision to improve geometric consistency during denoising. The proposed framework provides a unified formulation that naturally supports single-arm robots, dual-arm systems, and articulated robots with torso or waist structures. Extensive experiments on diverse robotic platforms demonstrate that the proposed method achieves accurate and stable IK performance while preserving the ability to generate multiple feasible solutions for redundant robotic systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00085 2026-06-02 cs.RO 版本更新

Balancing Accuracy and Efficiency: Adaptive Dynamics Orchestration for Model Predictive Control

平衡精度与效率：模型预测控制的自适应动力学编排

Francesco Cancelliere, Aniket Datar, Giovanni Muscato, Xuesu Xiao

发表机构 * Department of Electrical and Computer Engineering, University of Michigan, Ann Arbor, MI, USA（1. 电气与计算机工程系，密歇根大学，安娜堡，密歇根州，美国）

AI总结提出自适应动力学编排（ADO）框架，通过在线反事实滚动评估模型残差，动态选择最适合当前导航上下文的动力学模型，在计算效率与预测精度之间取得平衡。

Comments 8 pages, 7 figures

详情

AI中文摘要

自主导航的模型预测控制（MPC）面临模型精度与实时效率之间的基本权衡。高保真动力学模型能够准确预测轨迹展开过程中复杂的车辆-地形交互，但计算成本高，增加推理延迟并降低控制频率。相反，轻量级模型支持快速更新和密集采样，但在安全关键条件下可能产生错误预测，导致灾难性故障如车辆侧翻。为解决这一权衡，我们提出自适应动力学编排（ADO），一种根据当前导航上下文动态选择最合适动力学模型的框架。ADO维护一个涵盖不同精度-效率特征的模型库，并通过在线反事实滚动（即执行的控制动作在模型库中重放以评估预测差异）的残差误差，持续细化地形条件性能估计。这些估计实时指导模型选择，平衡计算效率与预测精度。在越野地面机器人上的真实实验表明，与固定低延迟基线相比，ADO显著降低了建模误差，同时接近最高保真模型的精度而不产生其计算成本，从而在复杂地形中实现更可靠和有效的导航。

英文摘要

Model Predictive Control (MPC) for autonomous navigation faces a fundamental trade-off between model accuracy and real-time efficiency. High-fidelity dynamics models can accurately predict complex vehicle-terrain interactions during trajectory rollouts, but incur significant computational cost, increasing inference latency and reducing control frequency. Conversely, lightweight models enable fast updates and dense sampling, yet may produce erroneous predictions under safety-critical conditions, potentially leading to catastrophic failures such as vehicle rollover. To address this trade-off, we propose Adaptive Dynamics Orchestration (ADO), a framework that dynamically selects the most appropriate dynamics model for the current navigation context. ADO maintains a library of models spanning diverse accuracy-efficiency profiles and continuously refines terrain-conditioned performance estimates using residual errors from online counterfactual rollouts, where executed control actions are replayed across the model library to assess predictive discrepancy. These estimates guide model selection in real time, balancing computational efficiency and predictive accuracy. Real-world experiments on an off-road ground robot demonstrate that ADO significantly reduces modeling error compared to a fixed low-latency baseline, while approaching the accuracy of the highest-fidelity model without incurring its computational cost, resulting in more reliable and effective navigation in challenging terrain.

URL PDF HTML ☆

赞 0 踩 0

2606.00083 2026-06-02 cs.LG cs.AI cs.RO 版本更新

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

从演示到奖励：VLM奖励模型的测试时提示优化

Christian Gumbsch, Leonardo Barcellona, Lennard Schünemann, Platon Karageorgis, Andrii Zadaianchuk, Zehao Wang, Sergey Zakharov, Fabien Despinoy, Rahaf Aljundi, Efstratios Gavves

发表机构 * University of Amsterdam（阿姆斯特丹大学）； Catholic University of Leuven（鲁汶天主大学）； Toyota Research Institute（丰田研究院）； Toyota Motor Europe（丰田欧洲公司）

AI总结提出Demo2Reward方法，利用少量专家演示在测试时优化VLM奖励模型的提示指令，减少假阳性并保持真阳性，无需额外训练即可提升下游策略学习。

详情

AI中文摘要

强化学习依赖于准确的奖励函数，但在现实应用（如机器人技术）中，这些函数通常是手工设计的，甚至不可用。最近的研究探索了预训练视觉-语言模型（VLM）作为奖励模型的零样本推理能力。然而，如果没有仔细的提示工程，这些方法往往会产生次优的奖励，其中假阳性预测会严重降低下游策略学习。在机器人技术中，通常收集包含专家演示的有限数据集来引导策略学习。这种场景提供了在策略训练之前优化奖励模型的机会。我们提出Demo2Reward，一种测试时自适应技术，基于少量演示（3-10条轨迹）优化奖励模型的语言指令，以减少假阳性同时保持真阳性。关键是，这在策略学习期间不需要额外的模型训练或计算资源。我们表明，Demo2Reward在一系列模拟机器人任务和策略骨干上始终优于现有的零样本和少样本VLM奖励模型。最后，我们证明Demo2Reward有效迁移到真实世界的机器人学习场景，无需手动设计奖励函数即可实现策略学习。

英文摘要

Reinforcement learning relies on accurate reward functions, which are often hand-crafted or even unavailable in real-world applications, such as robotics. Recent work has explored the zero-shot reasoning capabilities of pre-trained Vision-Language Models (VLMs) as reward models. However, without careful prompt engineering, these approaches tend to produce suboptimal rewards, where false positive predictions can severely degrade downstream policy learning. In robotics, limited datasets comprising expert demonstrations are often collected to bootstrap policy learning. This scenario provides an opportunity to optimize a reward model prior policy training. We propose Demo2Reward a test-time adaptation technique to optimize the language instruction of a reward model based on a few demonstrations (3-10 trajectories) to reduce false positives while preserving true positives. Crucially, this requires no additional model training or computation resources during policy learning. We show that Demo2Reward consistently outperforms existing zero- and few-shot VLM reward models across a range of simulated robotic tasks and policy backbones. Finally, we demonstrate that Demo2Reward effectively transfers to a real-world robotic learning scenario, enabling policy learning without manually engineering a reward function.

URL PDF HTML ☆

赞 0 踩 0

2606.00069 2026-06-02 cs.RO eess.IV 版本更新

Invascal: Inverse-Vacuity Self-Calibration for Uncertainty-Aware LiDAR Range-View Semantic Segmentation

Invascal: 面向不确定性感知激光雷达距离视图语义分割的逆空性自校准

Kerim Turacan, Hannes Reichert, Andrei Bolandut, Konrad Doll

发表机构 * Faculty of Engineering and Computer Science, University of Applied Sciences Aschaffenburg（工程与计算机科学学院，阿施费尔德应用科学大学）

AI总结提出一种与架构无关的不确定性感知适配器头，通过偏好头和强度头分解预测，并设计逆空性自校准目标（Invascal）来监督强度信号，实现可靠且校准良好的不确定性估计，同时保持分割精度。

Comments Accepted for publication at the 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC)

详情

AI中文摘要

激光雷达语义分割是自动驾驶车辆和移动机器人的核心感知能力。然而，安全运行还取决于知道预测何时不可靠。现有方法通常依赖softmax置信度，这往往校准不良且过度自信，而来自蒙特卡洛dropout或集成方法的更强不确定性估计对于实时使用通常计算成本高昂。为此，我们引入了一种新颖的、与架构无关的不确定性感知适配器头。它将预测分解为用于类别排名的偏好头和用于细化不确定性评估的强度头，从而能够原则性地构建证据狄利克雷表示。基于此设计，我们提出了逆空性自校准目标（Invascal），它直接监督强度信号以产生可靠且校准良好的不确定性估计，同时防止证据无节制增长。我们在多个激光雷达数据集和骨干架构上评估了我们的框架。我们与确定性训练、蒙特卡洛dropout和集成方法以及先前的证据方法进行了比较。我们的方法在最小计算开销下，持续改进了不确定性校准，优于传统的确定性方法。同时，它保持了有竞争力的分割精度，而先前的证据方法往往会出现性能下降。

英文摘要

LiDAR semantic segmentation is a core perception capability for autonomous vehicles and mobile robots. However, safe operation also depends on knowing when predictions are unreliable. Existing approaches typically rely on softmax confidence, which is often miscalibrated and overconfident, while stronger uncertainty estimates from Monte Carlo dropout or ensembles are often computationally expensive for real-time use. To this end, we introduce a novel, architecture-agnostic uncertainty-aware Adapter Head. It decomposes the prediction into a Preference Head for class ranking and a Strength Head that refines uncertainty assessment, thereby enabling a principled construction of evidential Dirichlet representations. Building on this design, we propose our inverse-vacuity self-calibration objective (Invascal), which directly supervises the strength signal to produce reliable and well-calibrated uncertainty estimates while preventing runaway evidence growth. We evaluate our framework across multiple LiDAR datasets and backbone architectures. We compare against deterministic training, Monte Carlo dropout and ensembles, and prior evidential methods. Our approach consistently improves uncertainty calibration over traditional deterministic methods with minimal computational overhead. At the same time, it preserves competitive segmentation accuracy, where prior evidential methods often suffer performance degradation.

URL PDF HTML ☆

赞 0 踩 0

2606.00063 2026-06-02 cs.RO math-ph math.MP physics.flu-dyn 版本更新

Linear Motility Maps in Nonlinear Viscous Fluids

非线性粘性流体中的线性运动映射

Yishun Zhou, Shai Revzen

发表机构 * Department of Robotics, University of Michigan（机器人学系，密歇根大学）； Departments of Electrical Engineering and Computer Science, and Ecology and Evolutionary Biology（电气工程与计算机科学系、生态与进化生物学系）

AI总结研究在低雷诺数流体中，线性运动映射扩展到幂律流体，并发现Carreau-Yasuda流体可违反该线性性质实现净运动，方向可随速度改变。

详情

AI中文摘要

已知在低雷诺数流体中运动的系统受“运动映射”支配，该映射线性地将形状变化率与通过流体的本体框架速度联系起来。其结果是“珀塞尔扇贝定理”——经历时间上前后相同路径的形状变化（往复身体变形）的运动系统无法实现净位移，无论这些变化的速度如何。我们证明线性速度运动映射扩展到任何幂律粘度（即Ostwald-de Waele流体），因此也适用于中间剪切范围内的许多生物流体。我们还表明，在Carreau-Yasuda流体中，线性速度性质可以被违反，使用由两个不等质量且具有不等阻力系数的质量组成的“尺蠖”模型进行往复运动，从而产生净运动。有趣的是，运动方向可以通过改变速度来切换。我们的结果表明，几何力学的线性运动映射可用于分析和设计幂律流体中的运动，并且某些非线性阻力关系（如Carreau-Yasuda）可用于产生净运动，看似违反了“扇贝定理”。

英文摘要

Systems moving in low Reynolds number fluid regimes are known to be governed by a ``motility map'' which linearly relates their shape change rates to they body frame velocity moving through the fluid. A consequence of this is ``Purcell's Scallop Theorem'' -- a locomotion system that undergoes shape changes that follow the same path forward and backward in time (reciprocal body deformations) cannot achieve net displacement, regardless of pacing of those changes.We show that linear-in-velocity motility maps extend to any power law viscosity (a.k.a. Ostwald--de Waele fluid), and therefore to many biological fluids in intermediate shear ranges. We also show that the linear-in-velocity property can be violated in Carreau-Yasuda fluids to produce net motion using an ``inchworm'' model consisting of two unequal masses with unequal drag coefficients performing reciprocal motions. Interestingly, the direction of motion can be switched by changing speeds. Our results show that the linear motility map of geometric mechaincs can be used to analyze and design locomotion in power-law fluids, and that some nonlinear drag relationships such as Carreau-Yasuda can be exploited to generate net locomotion in seeming violation of the ``scallop theorem''.

URL PDF HTML ☆

赞 0 踩 0

2606.00059 2026-06-02 cs.RO cs.LG 版本更新

Reinforcement Learning for Optimal Experiment Design in Parameter Identification of Mechatronic Systems

机电系统参数辨识中最优实验设计的强化学习方法

Julian Langschwert, Georg Schaefer, Jakob Rehrl, Stefan Huber, Simon Hirlaender

发表机构 * Josef Ressel Centre for Intelligent and Secure Industrial Automation, Salzburg University of Applied Sciences, Salzburg, Austria（约瑟夫·雷斯尔智能与安全工业自动化中心，萨尔茨堡应用技术大学，萨尔茨堡，奥地利）； Paris Lodron University of Salzburg, Salzburg, Austria（萨尔茨堡巴黎洛登伦大学，萨尔茨堡，奥地利）

AI总结提出一种强化学习智能体，通过奖励塑形自主满足安全约束，为Quanser Aero 2测试平台学习最优激励信号，在三个辨识参数上均达到竞争性估计精度，且安全违规率仅0.75%。

Comments Accepted at DEXA AI4IP 2026

2606.00054 2026-06-02 cs.RO cs.AI cs.CV 版本更新

From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data

从人类视频到机器人操作：基于人类中心数据的可扩展视觉-语言-动作学习综述

Zhiyuan Feng, Qixiu Li, Huizhi Liang, Rushuai Yang, Yichao Shen, Zhiying Du, Zhaowei Zhang, Yu Deng, Li Zhao, Hao Zhao, Zongqing Lu, Oier Mees, Marc Pollefeys, Jiaolong Yang, Baining Guo

发表机构 * Tsinghua University（清华大学）； HKUST（香港科技大学）； Xi’an Jiaotong University（西安交通大学）； Fudan University（复旦大学）； Microsoft Research Asia（微软亚洲研究院）； Peking University（北京大学）； Microsoft Zurich Project（微软苏黎世实验室）

AI总结本文综述了如何将丰富的人类视频转化为视觉-语言-动作（VLA）模型的有效知识，分类了四种方法（潜在动作表示、预测世界模型、显式2D监督、显式3D重建），并指出了结构化非结构化视频、跨具身和视角的动作映射、以及评估协议设计三大挑战。

Comments Accepted to IJCAI 2026 Survey Track. Project page: https://aaronfengzy.github.io/HumanCentricToVLA-Survey/

详情

AI中文摘要

近期在可泛化具身控制方面的进展由大规模预训练的视觉-语言-动作（VLA）模型驱动。然而，大多数现有方法依赖于大量机器人演示数据，这些数据获取成本高昂且与特定具身紧密耦合。相比之下，人类视频丰富且捕捉了丰富的交互，为真实世界操作提供了多样的语义和物理线索。然而，具身差异以及任务对齐标注的频繁缺失使得它们直接用于VLA模型具有挑战性。本综述提供了一个统一的视角，探讨如何将人类视频转化为VLA模型的有效知识。我们根据所提取的动作相关信息将现有方法分为四类：(i) 编码帧间变化的潜在动作表示；(ii) 预测未来帧的预测世界模型；(iii) 提取图像平面线索的显式2D监督；(iv) 恢复几何或运动的显式3D重建。除分类外，我们强调了该领域的三个关键开放挑战：将非结构化视频结构化为可训练的片段、在具身和视角异质性下将视频导出的监督接地到机器人可执行动作中，以及设计能更好预测真实世界部署性能和迁移效率的评估协议，从而为未来研究方向提供参考。论文和资源的精选列表见 https://github.com/AaronFengZY/HumanCentricToVLA-Survey。

英文摘要

Recent progress in generalizable embodied control has been driven by large-scale pretraining of Vision-Language-Action (VLA) models. However, most existing approaches rely on large collections of robot demonstrations, which are costly to obtain and tightly coupled to specific embodiments. Human videos, by contrast, are abundant and capture rich interactions, providing diverse semantic and physical cues for real-world manipulation. Yet, embodiment differences and the frequent absence of task-aligned annotations make their direct use in VLA models challenging. This survey provides a unified view of how human videos are transformed into effective knowledge for VLA models. We categorize existing approaches into four classes based on the action-related information they derive: (i) latent action representations that encode inter-frame changes; (ii) predictive world models that forecast future frames; (iii) explicit 2D supervision that extracts image-plane cues; and (iv) explicit 3D reconstruction that recovers geometry or motion. Beyond this taxonomy, we highlight three key open challenges in this area: structuring unstructured videos into training-ready episodes, grounding video-derived supervision into robot-executable actions under embodiment and viewpoint heterogeneity, and designing evaluation protocols that better predict real-world deployment performance and transfer efficiency, thereby informing future research directions. A curated list of papers and resources is available at https://github.com/AaronFengZY/HumanCentricToVLA-Survey.

URL PDF HTML ☆

赞 0 踩 0

2606.00053 2026-06-02 cs.RO 版本更新

VLAMotor: Test-Guided Enhancement of Vision-Language-Action Models via Agent-BasedData Synthesis

VLAMotor: 通过基于智能体的数据合成实现视觉-语言-动作模型的测试引导增强

Zeqin Liao, Peifan Ren, Zixu Gao, Hongyu Gong, Lianyu Hu, Wenbing Tang, Yuhong Nan, Zibin Zheng, Yang Liu

发表机构 * School of computing and data science, Nanyang Technological University（计算与数据科学学院，南洋理工大学）； School of Software Engineering, Sun Yat-sen University（软件工程学院，中山大学）； GuangDong Engineering Technology Research Center of Blockchain, China（区块链工程技术研发中心，中国）； Northwest A&F University（西北农林科技大学）

AI总结提出VLAMotor框架，通过距离感知测试暴露失败案例，并利用基于智能体的数据合成生成成功轨迹微调VLA模型，显著提升模型在仿真和真实环境中的成功率。

详情

AI中文摘要

视觉-语言-动作（VLA）模型遵循数据驱动范式，受训练数据覆盖范围的限制，在部署后容易在边缘情况配置上失败。为了减轻此类风险，必须暴露高质量失败模式，并将由此产生的失败转化为监督数据用于模型增强。现有研究大多止步于失败检测，缺乏利用发现的失败进行模型修复的机制。我们提出VLAMotor，这是首个用于VLA增强的分析框架，它集成了距离感知模型测试以暴露失败，以及基于智能体的数据合成以进行模型微调。首先，VLAMotor基于与训练样本的距离估计输入不确定性，并将不确定性排序与冗余消除相结合，构建暴露多样化失败的紧凑测试集。然后，VLAMotor将失败轨迹抽象为结构化语义表示，并规划参数化的修复技能序列，通过逆运动学和运动执行将其实现为可执行轨迹。由此产生的成功轨迹被自动标注并用于微调原始VLA模型，从而得到增强的VLA模型。在四个代表性机器人操作任务上的评估表明，VLAMotor生成的仿真测试用例中有92.33%触发了VLA失败，并且VLAMotor将测试覆盖率相比最先进工具提高了18.93%。通过使用从失败测试用例中导出的合成数据微调VLA模型，VLAMotor进一步将VLA模型的总体成功率提高了49.25%。当部署在真实硬件上时，仿真增强模型相比原始VLA模型成功率提高了57.50%，展示了VLA增强的一种有效且低成本的方向。

英文摘要

Vision-Language-Action (VLA) models follow a data-driven paradigm and are constrained by the coverage of training data, making them prone to failure on edge-case configurations after deployment. To mitigate such risks, it is essential to expose high-quality failure modes and convert the resulting failures into supervisory data for model enhancement. Existing studies largely stop at failure detection and lack a mechanism for leveraging discovered failures for model repair. We propose VLAMotor, the first analysis framework for VLA enhancement, which integrates distance-aware model testing for failure exposure and agent-based data synthesis for model finetunning. First, VLAMotor estimates input uncertainty based on the distance to training samples, and combines uncertainty ranking with redundancy elimination to build compact test sets that expose diverse failures. Then, VLAMotor abstracts failure trajectories into structured semantic representations, and plans parameterized repair-skill sequences, which are then realized as executable trajectories through inverse kinematics and motion execution. The resulting successful trajectories are automatically labeled and used to fine-tune the original VLA model, yielding an enhanced VLA model. Evaluation on four representative robotic manipulation tasks shows that 92.33% of the in-simulation test cases generated by VLAMotor trigger VLA failures, and VLAMotor improves test coverage over the state-of-the-art tool by 18.93%. By fine-tuning VLA models with synthetic data derived from failed test cases, VLAMotor further enhances the overall success rate of VLA models by 49.25%. When deployed on real hardware, the simulation-enhanced models improve the success rate over the original VLA models by 57.50%, demonstrating an effective and low-cost direction for VLA enhancement.

URL PDF HTML ☆

赞 0 踩 0

2605.30877 2026-06-02 cs.RO 版本更新

Wall-OSS-0.5 Technical Report

Wall-OSS-0.5 技术报告

Ryan Yu, Pushi Zhang, Starrick Liu, Brae Liu, Miracle Kang, Shalfun Li, Lights Shi, Ellie Ma, Ping Yang, Chris Pan, Jerry Chen, Dongxiu Liu, Rain Sun, Miles Guo, Byron Zhang, Hugo Zhou, Zach Xu, Vincent Chen, Harrison Huang, James Wang, Dance Kuzi, Andy Zhai, Hang Su, Roy Gan, Lucy Liang, Hao Wang, Qian Wang

发表机构 * arXiv

AI总结本文提出Wall-OSS-0.5，一个基于3B VLM骨干网络并增强动作生成组件的4B开源VLA模型，通过梯度桥接联合训练策略，在超过20个实体上预训练，实现零样本真实机器人行为，并在微调后超越π_0.5，证明VLA预训练本身即可产生可执行的机器人能力。

详情

AI中文摘要

大规模视觉-语言-动作（VLA）预训练正日益成为机器人策略的基础，然而预训练VLA的证据几乎总是在任务特定微调后报告。这留下了一个基本问题未解答：VLA预训练本身是否产生可执行的机器人行为，还是仅仅为下游策略学习提供更好的初始化？我们提出Wall-OSS-0.5，一个基于3B VLM骨干网络并增强动作生成组件的开源4B VLA，设计使得预训练的机器人能力可直接在物理硬件上测量。该模型在超过20个实体上进行预训练，每轮处理超过一百万个机器人轨迹以及一个多模态语料库。我们采用梯度桥接联合训练方案，其中三个目标扮演不同且互补的角色：离散动作预测将强大的VLM原生梯度注入骨干网络，多模态预测保持基于视觉-语言的理解，连续流匹配作为部署时的动作接口。在任务特定微调之前，预训练检查点实现了非平凡的零样本真实机器人行为，在17个任务套件中完成了包括一个保留的变形操作任务在内的多个任务，并取得了高任务进度。微调后，同一检查点作为更强的适应先验，在15个真实机器人任务上达到60.5%的平均任务进度，比π_0.5高出17.5%。多模态评估进一步证实动作训练不会侵蚀基于视觉-语言的能力：模型在保持广泛视觉-语言能力的同时增强了具身基础。总之，这些结果将VLA预训练从初始化策略重新定位为可直接测试且已经有用的机器人能力来源。

英文摘要

Large-scale Vision-Language-Action (VLA) pretraining is increasingly adopted as the foundation for robot policies, yet the evidence for pretrained VLAs is almost invariably reported after task-specific fine-tuning. This leaves a foundational question unanswered: does VLA pretraining itself yield executable robot behavior, or does it merely furnish a better initialization for downstream policy learning? We present Wall-OSS-0.5, an open-source 4B VLA built upon a 3B VLM backbone augmented with action-generation components, designed so that pretrained robotic capability is directly measurable on physical hardware. The model is pretrained across more than 20 embodiments, processing over one million robot trajectories per epoch alongside a grounded multimodal corpus. We adopt a gradient-bridged co-training recipe in which three objectives play distinct and complementary roles: discrete action prediction routes strong VLM-native gradients into the backbone, multimodal prediction preserves grounded vision-language understanding, and continuous flow matching serves as the deployment-time action interface. Before task-specific fine-tuning, the pretrained checkpoint achieves non-trivial zero-shot real-robot behavior, completing several tasks, including a held-out deformable manipulation task, at high task progress on a 17-task suite. After fine-tuning, the same checkpoint serves as a stronger adaptation prior, reaching 60.5% average task progress on 15 real-robot tasks and outperforming π_0.5 by 17.5%. Multimodal evaluations further confirm that action training does not erode grounded vision-language competence: the model preserves broad vision-language ability while strengthening embodied grounding. Together, these results reposition VLA pretraining from an initialization strategy to a directly testable, already useful source of robot capability.

URL PDF HTML ☆

赞 0 踩 0

2605.30581 2026-06-02 cs.CV cs.AI cs.RO 版本更新

Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes

工业视觉模拟到现实中的先验可用性：CAD引导与CAD不可用机制的综述

Chenxi Tao, Seung-Kyum Choi

发表机构 * George W. Woodruff School of Mechanical Engineering（乔治·W·伍德鲁夫机械工程学院）； Georgia Institute of Technology（佐治亚理工学院）

AI总结本文通过先验可用性视角重新组织工业视觉模拟到现实问题，区分CAD可用、CAD不可用和边界先验三种机制，并基于T-LESS/BOP、MVTec AD和VisA数据集进行实证分析，揭示了源分布设计、检测器容量和真实校准的重要性，以及CAD在测试时提供的独特验证通道。

Comments Review article; 103 references; 9 main figures; empirical anchors on T-LESS/BOP, MVTec AD, and VisA

详情

AI中文摘要

工业视觉模拟到现实通常被描述为从合成图像到真实图像的迁移，但工业部署通常涉及可用证据与所需决策之间更广泛的错配。系统可能基于CAD渲染、模拟RGB-D观测、正常参考图像、合成缺陷、预训练特征空间或语言提示构建，却在不同的传感器、光照、材料、夹具、校准、生产变化和罕见缺陷模式下部署。本综述将工业视觉模拟到现实重新定义为由先验可用性组织的域差距问题。我们区分了CAD可用设置（其中显式物体几何可支持渲染、校准、姿态估计、分割和测试时几何验证）、CAD不可用设置（其中几何被正常参考外观、特征分布、师生残差、合成异常假设、基础特征或视觉语言先验取代）以及边界先验设置（其中近似模型、模板、参考视图或语义对应仅保留CAD的部分作用）。这一框架将基于CAD的检测和6D姿态估计文献与通常单独综述的工业异常和表面检测文献联系起来。为使分类具体化，我们使用T-LESS/BOP、MVTec AD和VisA上的实证锚点。这些锚点表明，仅靠CAD渲染数量并不能弥合迁移；源分布设计、检测器容量和小规模真实校准可能更为重要。它们还表明，测试时的CAD通过掩码、姿态和深度一致性创建了独特的验证通道，而CAD不可用的检测则依赖于校准的正常性和特征偏差。因此，本综述反对单一跨任务排行榜，而是询问什么先验支撑了部署决策。

英文摘要

Industrial visual sim-to-real is often described as transferring from synthetic images to real images, but industrial deployment usually involves a broader mismatch between available evidence and required decisions. A system may be built from CAD renderings, simulated RGB-D observations, normal reference images, synthetic defects, pretrained feature spaces, or language prompts, yet deployed under different sensors, lighting, materials, fixtures, calibration, production variation, and rare defect modes. This review reframes industrial visual sim-to-real as a domain-gap problem organized by prior availability. We distinguish CAD-available settings, where explicit object geometry can support rendering, calibration, pose estimation, segmentation, and test-time geometric verification; CAD-unavailable settings, where geometry is replaced by normal-reference appearance, feature distributions, teacher-student residuals, synthetic anomaly assumptions, foundation features, or vision-language priors; and boundary-prior settings, where approximate models, templates, reference views, or semantic correspondences preserve only part of the CAD role. This framing connects CAD-based detection and 6D pose-estimation literature with industrial anomaly and surface-inspection literature that is usually reviewed separately. To make the taxonomy concrete, we use empirical anchors on T-LESS/BOP, MVTec AD, and VisA. The anchors show that CAD render count alone does not close transfer; source-distribution design, detector capacity, and small real calibration can matter more. They also show that CAD at test time creates a distinct verification channel through mask, pose, and depth consistency, whereas CAD-unavailable inspection relies on calibrated normality and feature deviation. The review therefore argues against a single cross-task leaderboard and instead asks what prior grounds the deployment decision.

URL PDF HTML ☆

赞 0 踩 0

2605.27180 2026-06-02 cs.RO 版本更新

Towards Drone-based Mapping of Volcanic Gases using Gas Tomography

面向基于无人机的火山气体测绘：使用气体断层成像

Marius Schaab, Niklas Karbach, Antonia Rabe, Thomas Wiedemann, Patrick Hinsen, Dmitriy Shutin, Thorsten Hoffmann, Achim J. Lilienthal

发表机构 * German Research Foundation (DFG)（德国研究基金会）； Istituto Nazionale di Geofisica e Vulcanologia (INGV)（意大利国家地震与火山观测研究所）

AI总结针对无人机旋翼下洗流干扰问题，提出基于拉格朗日模型的模型驱动气体断层成像方法，实现火山气体排放的准确测绘。

详情

AI中文摘要

火山排放大量二氧化碳，直接影响人类生活。测绘火山气体排放有助于预测喷发并了解火山对气候和环境的影响。基于无人机的气体传感显著降低了火山监测的风险，但在测量气体时面临技术限制，因为旋翼下洗流会在检测前驱散气体羽流。使用远程气体传感的气体断层成像解决了这一挑战。在Salinelle dei Cappuccini泥火山，我们证明，尽管无人机搭载的原位传感器因空气动力学干扰未能检测到CO2排放，但开路路径传感成功实现了远程气体分布测绘。我们提出了一种新颖的基于模型的气体断层重建方法，该方法结合拉格朗日模型来补偿风引起的平流。所得气体分布图与手动收集的原位测量结果一致，证实了基于模型的气体断层成像有效克服了下洗流限制，并实现了火山排放的准确测绘。

英文摘要

Volcanoes emit large amounts of CO2, directly influencing human lives. Mapping volcanic gas emissions helps to forecast eruptions and understand the impact of volcanoes on climate and the environment. Drone-based gas sensing significantly reduces risks in volcanic monitoring but faces technical limitations when measuring gas, as rotor downwash disperses the gas plume before detection. Gas Tomography using remote gas sensing addresses this challenge. At the Salinelle dei Cappuccini mud volcanoes, we demonstrate that while drone-mounted in-situ sensors failed to detect CO2 emissions due to aerodynamic disturbance, open-path sensing successfully enabled remote gas distribution mapping. We present a novel model-based gas tomographic reconstruction approach that incorporates a Lagrangian model to compensate for wind-induced advection. The resulting gas distribution maps align with manually collected in-situ measurements, confirming that model-based gas tomography effectively overcomes downwash limitations and enables accurate mapping of volcanic emissions.

URL PDF HTML ☆

赞 0 踩 0

2605.26625 2026-06-02 cs.RO cs.SY eess.SY 版本更新

AttenA+: 纠正机器人基础模型中的动作不平等性

Daojie Peng, Fulong Ma, Jiahang Cao, Qiang Zhang, Xupeng Xie, Jian Guo, Ping Luo, Andrew F. Luo, Boyu Zhou, Jun Ma

发表机构 * HKUST(GZ)（香港科技大学（广州））； HKU（香港大学）； USTC（中国科学技术大学）； IDEA Research（IDEA研究院）； SUSTech（南方科技大学）； X-Humaniod

AI总结针对机器人基础模型忽视动作物理重要性的问题，提出AttenA+框架，通过速度驱动的动作注意力重加权训练目标，提升复杂长程任务性能。

详情

AI中文摘要

现有的机器人基础模型虽然强大，但基于一个隐含的时间同质性假设：在优化过程中将所有动作视为同等信息量。这种从语言模型继承的“平坦”训练范式，对操作的内在物理层次结构无动于衷。实际上，机器人轨迹本质上是异质的，其中低速段通常通过需要精确交互来决定任务成功，而高速运动则作为容错过渡。这种均匀损失权重与物理关键性之间的错位从根本上限制了当前视觉-语言-动作（VLA）模型和世界-动作模型（WAM）在复杂长程任务中的性能。为了纠正这一点，我们引入了AttenA+，一个与架构无关的框架，通过速度驱动的动作注意力优先考虑运动学关键段。通过基于逆速度场重新加权训练目标，AttenA+自然地使模型的学习能力与操作的物理需求对齐。作为一种即插即用的增强，AttenA+可以集成到现有骨干网络中，无需结构修改或额外参数。大量实验表明，AttenA+显著提升了当前最先进模型的上限。具体来说，它在Libero基准上将OpenVLA-OFT提升至98.6%（+1.5%），并将FastWAM在RoboTwin 2.0上推进至92.4%（+0.6%）。在Franka机械臂上的真实世界验证进一步展示了其鲁棒性和跨任务泛化能力。我们的工作表明，挖掘动作序列的内在结构先验为标准缩放定律提供了一种高效、物理感知的补充，为通用机器人控制开辟了新路径。

英文摘要

Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via velocity-driven action attention. By reweighting the training objective based on the inverse velocity field, AttenA+ naturally aligns the model's learning capacity with the physical demands of manipulation. As a plug-and-play enhancement, AttenA+ can be integrated into existing backbones without structural modifications or additional parameters. Extensive experiments demonstrate that AttenA+ significantly elevates the ceilings of current state-of-the-art models. Specifically, it improves OpenVLA-OFT to 98.6% (+1.5%) on the Libero benchmark and pushes FastWAM to 92.4% (+0.6%) on RoboTwin 2.0. Real-world validation on a Franka manipulator further showcases its robustness and cross-task generalization. Our work suggests that mining the intrinsic structural priors of action sequences offers a highly efficient, physics-aware complement to standard scaling laws, paving a new path for general-purpose robotic control.

URL PDF HTML ☆

赞 0 踩 0

2510.01711 2026-06-02 cs.RO cs.LG 版本更新

Contrastive Representation Regularization for Vision-Language-Action Models

视觉-语言-动作模型的对比表示正则化

Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin

发表机构 * KAIST（韩国科学技术院）

AI总结提出机器人状态感知对比损失（RS-CL），通过对比学习对齐VLM表示与机器人本体感受状态，提升VLA模型在机器人操作任务中的性能。

Comments ICML 2026

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过利用预训练视觉-语言模型（VLM）的丰富表示，在机器人操作中展现了强大的能力。然而，它们的表示可以说仍然次优，缺乏对控制动作和本体感受信息等机器人信号的敏感性。为了解决这个问题，我们引入了机器人状态感知对比损失（RS-CL），一种简单有效的VLA模型表示正则化方法，旨在弥合VLM表示与机器人信号之间的差距。特别地，RS-CL通过使用状态之间的相对距离作为软监督，使表示更紧密地对齐机器人的本体感受状态。作为原始动作预测目标的补充，RS-CL增强了控制相关表示学习，同时轻量级且与标准VLA训练流程完全兼容。我们的实验结果表明，RS-CL显著提升了最先进VLA模型的性能；它将先前技术在RoboCasa-Kitchen基准上的性能提升至69.7%，达到最先进水平，并在具有挑战性的真实机器人操作任务中将成功率从45.0%提升至58.3%。

英文摘要

Vision-Language-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive information. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipelines. Our empirical results demonstrate that RS-CL substantially improves the performance of state-of-the-art VLA models; it pushes the prior art to 69.7% achieving the state-of-the-art performance on the RoboCasa-Kitchen benchmark, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.24881 2026-06-02 cs.RO 版本更新

Learning Transferable Motor Skills for Geometry-Aware Robotic Surface Tasks

面向几何感知的机器人表面任务的可迁移运动技能学习

Miroslav David, Karla Stepanova, Robert Babuska

发表机构 * Czech Institute of Informatics, Robotics, and Cybernetics（捷克信息学、机器人学与自动化研究所）； Czech Technical University in Prague（布拉格捷克技术大学）； Delft University of Technology（代尔夫特理工大学）

AI总结提出一种模块化框架，将几何运动规划与执行级专家行为解耦，通过可解释的原子运动规则和神经网络推断，实现跨几何形状的迁移学习。

Comments In: Workshop on Geometry in the Age of Data-Driven Robotics, ICRA 2026, Vienna, 2026

详情

AI中文摘要

机器人表面交互任务，如喷涂或焊接，需要精确的几何规划和精确的运动执行。虽然现代运动规划器能够生成有效的几何路径，但它们通常缺乏人类操作员所具备的专家运动模式。相反，从示范中学习往往将任务执行紧密耦合到特定的训练几何形状，限制了可迁移性。我们提出了一种模块化框架，将几何运动规划与执行级专业知识解耦。专家行为被表示为一个可解释的、原子的运动规则词汇表，例如速度缩放和方向偏移，这些规则系统地修改几何规划的参考路径。我们训练了一个多模态神经网络，从运动轨迹数据和CAD模型几何中联合推断规则参数。我们通过在L形和窗形物体上的动态仿真评估了我们的方法，证明了模型在两种拓扑结构上成功提取了速度和方向规则。

英文摘要

Robotic surface-interaction tasks, such as spray painting or welding, require both accurate geometric planning and precise motion execution. While modern motion planners generate valid geometric paths, they often lack the expert motor patterns observed in human operators. Conversely, learning from demonstration often tightly couples task execution to the specific training geometry, limiting transferability. We propose a modular framework that decouples geometric motion planning from execution-level expertise. Expert behavior is represented as a vocabulary of interpretable, atomic motor rules, such as velocity scaling and orientation offsets, that systematically modify a geometrically planned reference path. We train a multimodal neural network to infer rule parameters jointly from kinematic trajectory data and CAD model geometry. We evaluate our approach through dynamic simulation on L-shaped and window-shaped objects, demonstrating on simulated data that the model successfully extracts velocity and orientation rules across both topologies.

URL PDF HTML ☆

赞 0 踩 0

2603.02845 2026-06-02 cs.RO cs.AI 版本更新

SPARC: Spatial-Aware Path Planning via Attentive Agent Communication

SPARC: 通过注意力智能体通信实现空间感知路径规划

Sayang Mu, Xiangyu Wu, Bo An

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结提出关系增强多头注意力（RMHA）机制，通过嵌入曼哈顿距离到注意力权重计算，优先处理空间邻近机器人的消息，在40x40网格上从8机器人零样本泛化到128机器人时，在30%障碍密度下实现约75%成功率，超越基线25个百分点以上。

Comments The manuscript is being withdrawn at the request of the first author for the purpose of revising content and re-uploading a revised version with updated data/figures/text . The revised manuscript will be resubmitted to arXiv promptly with the same author list and research theme

详情

旋转的特殊酉参数化估计器

Akshay Chandrasekhar

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文通过特殊酉矩阵重新审视旋转估计问题，提出两种新的连续表示用于神经网络中的旋转学习，并通过实验验证其有效性。

Comments Published at ICLR 2026; clarified paper contribution and theoretical narrative; 33 pages

2604.22896 2026-06-02 cs.RO cs.LG 版本更新

Magnetic Indoor Localization through CNN Regression and Rotation Invariance

基于CNN回归和旋转不变性的磁室内定位

Helge Rosé, Konstantin Klipp, Tom Koubek, Bernd Schäufele, Ilja Radusch

发表机构 * University of Freiburg（弗赖堡大学）

AI总结提出使用旋转不变特征（磁场强度和重力轴投影）训练轻量级CNN模型，实现无需方向校准的室内定位，在MagPie数据集上达到或超越现有最优精度。

Comments Published and presented at the 2026 4th International Conference on Mechatronics, Control and Robotics (ICMCR)

详情

DOI: 10.1109/ICMCR69541.2026.11533953

AI中文摘要

室内定位是GNSS拒止环境中广泛应用的关键技术，包括室内导航和物联网系统。结合卷积神经网络（CNN）和基于磁场特征的方法，提供了一种低成本、无需基础设施的精确定位解决方案。尽管磁指纹是室内定位的一种有前景的方法，但基于原始3D磁力计数据训练的模型对设备方向高度敏感。我们通过使用从3D磁场导出的两个旋转不变特征来解决这个问题：磁场强度（Mn）和重力轴投影（Mg）。我们在磁序列上训练轻量级7层扩张CNN（MagNetS/XL），直接回归（x, y）位置。使用MagPie数据集（三栋建筑，手持轨迹），我们系统评估了测试和/或训练数据的固定和随机旋转。原始3D输入（Mx, My, Mz）在固定90°旋转下表现出各向同性误差增加，并随着随机旋转增大而进一步恶化。相比之下，2D输入（Mn, Mg）保持旋转不变精度，并且一旦旋转超过三个参考建筑的特定阈值（Loomis大建筑0°，Talbot中建筑5°，CSL小建筑6°），其性能就超过3D输入。MagNetXL在MagPie数据集上达到或超越了现有最优精度，而MagNetS以约三分之一的参数实现了相似性能，有利于移动部署。这些结果表明，在实际使用中，从旋转不变输入获得的鲁棒性超过了输入维度降低的损失，从而无需方向校准或额外基础设施即可进行地图构建和定位。

英文摘要

Indoor positioning is an essential technology for a wide range of applications in GNSS-denied environments, including indoor navigation and IoT systems. Combining convolutional neural networks (CNNs) and magnetic field-based features offers a low-cost, infrastructure-free solution for precise positioning. While magnetic fingerprints are a promising approach for indoor positioning, models trained on raw 3D magnetometer data are highly sensitive to device orientation. We address this by using two rotation invariant features derived from the 3D magnetic field: the norm (Mn) and the projection onto the gravity axis (Mg). We train a lightweight 7-layer dilated CNN (MagNetS/XL) on magnetic sequences to directly regress (x, y) positions. Using the MagPie dataset (three buildings, handheld trajectories), we systematically evaluate fixed and random rotations of test and/or train data. Raw 3D inputs (Mx, My , Mz) exhibit isotropic error increases under fixed 90° rotations and further degrade with growing random rotations. In contrast, 2D (Mn, Mg) inputs maintain rotation invariant accuracy and surpass the 3D inputs once rotation exceeds building-specific thresholds for three reference buildings: 0° for Loomis (large), 5° for Talbot (medium), and 6° for CSL (small). MagNetXL achieves or exceeds state-of-the-art accuracy on the MagPie dataset, and MagNetS delivers similar performance with roughly one third of the parameters, favoring mobile deployment. These results show that the robustness gained from rotation invariant inputs outweighs the loss of input dimensionality in realistic usage, allowing mapping and localization without orientation alignment or added infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2603.15956 2026-06-02 cs.RO cs.AI 版本更新

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

ExpertGen: 从非完美行为先验的可扩展仿真到现实专家策略学习

Zifan Xu, Ran Gong, Maria Vittoria Minniti, Kausik Sivakumar, Ahmet Salih Gundogdu, Eric Rosen, Riedana Yan, Tushar Kusnur, Zixing Wang, Di Deng, Peter Stone, Xiaohan Zhang, Karl Schmeckpeper

发表机构 * Robotics and AI Institute（机器人与人工智能研究院）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； Sony AI（索尼人工智能）

AI总结提出ExpertGen框架，通过扩散策略初始化行为先验并结合强化学习优化噪声，在仅稀疏奖励下生成高质量专家策略，实现从仿真到现实的可扩展迁移。

详情

AI中文摘要

学习通用且鲁棒的行为克隆策略需要大量高质量的机器人数据。虽然人类演示（例如通过遥操作）是专家行为的标准来源，但在现实世界中大规模获取此类数据成本过高。本文介绍了ExpertGen，一个在仿真中自动化专家策略学习的框架，以实现可扩展的仿真到现实迁移。ExpertGen首先使用在非完美演示（可能由大语言模型合成或由人类提供）上训练的扩散策略初始化行为先验。然后，通过优化扩散模型的初始噪声同时保持原始策略冻结，使用强化学习将该先验引导至高的任务成功率。通过保持预训练的扩散策略冻结，ExpertGen将探索正则化到安全、类人的行为流形内，同时仅使用稀疏奖励即可实现有效学习。在具有挑战性的操作基准上的实证评估表明，ExpertGen无需奖励工程即可可靠地生成高质量的专家策略。在工业装配任务中，ExpertGen实现了90.5%的整体成功率，而在长时域操作任务中达到了85%的整体成功率，优于所有基线方法。所得策略表现出灵巧的控制，并在不同的初始配置和失败状态下保持鲁棒。为了验证仿真到现实的迁移，学习到的基于状态的专家策略通过DAgger进一步提炼为视觉运动策略，并成功部署在真实的机器人硬件上。

英文摘要

Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer. ExpertGen first initializes a behavior prior using a diffusion policy trained on imperfect demonstrations, which may be synthesized by large language models or provided by humans. Reinforcement learning is then used to steer this prior toward high task success by optimizing the diffusion model's initial noise while keep original policy frozen. By keeping the pretrained diffusion policy frozen, ExpertGen regularizes exploration to remain within safe, human-like behavior manifolds, while also enabling effective learning with only sparse rewards. Empirical evaluations on challenging manipulation benchmarks demonstrate that ExpertGen reliably produces high-quality expert policies with no reward engineering. On industrial assembly tasks, ExpertGen achieves a 90.5% overall success rate, while on long-horizon manipulation tasks it attains 85% overall success, outperforming all baseline methods. The resulting policies exhibit dexterous control and remain robust across diverse initial configurations and failure states. To validate sim-to-real transfer, the learned state-based expert policies are further distilled into visuomotor policies via DAgger and successfully deployed on real robotic hardware.

URL PDF HTML ☆

赞 0 踩 0

2604.14344 2026-06-02 cs.RO 版本更新

CART: Context-Aware Terrain Adaptation using Temporal Sequence Selection for Legged Robots

CART: 基于时间序列选择的上下文感知地形自适应方法用于腿式机器人

Kartikeya Singh, Youngjin Kim, Yash Turkar, Karthik Dantu

发表机构 * DRONES LAB, University at Buffalo, NY, USA（无人机实验室，布法罗大学，纽约州，美国）

AI总结提出CART高层控制器，通过融合本体感觉和外部感知的上下文信息，提升腿式机器人在复杂地形上的稳定行走能力，在仿真和真实实验中分别将成功率平均提高5%，并将基座振荡降低最多41%和22%。

详情

AI中文摘要

自然界中的动物结合多种模态（如视觉和触觉）来感知地形，并发展出在不平坦地形上高效行走的理解。同样，腿式机器人需要通过发展对视觉和本体感觉之间关系的理解，来增强其在复杂地形上稳定行走的能力。目前大多数地形自适应方法在复杂的越野地形上仍然容易失败，因为它们没有明确建模外部感知地形外观与本体感觉物理交互之间的上下文关系。这种基于经验的学习往往会在所见与真实感受之间产生视觉-纹理悖论。在这项工作中，我们引入了CART，一种基于上下文感知地形自适应方法的高层控制器，它集成了来自机载传感器的本体感觉和外部感知，以实现对地形的鲁棒理解。我们在多种地形上使用Unitree Go2和ANYmal-C机器人在IsaacSim模拟器中进行评估，并在真实世界实验中使用Boston Dynamics SPOT机器人。为了评估学习到的上下文是否能在各种悖论情况下改善运动行为，我们在仿真和真实实验中测量了机器人的稳定性、穿越成功率和任务完成时间。我们将CART与多种地形条件下的最先进运动控制和地形自适应基线进行比较。CART在仿真中将平均成功率比基线提高了5%，同时改善了上下文条件化的运动行为，包括在仿真中将基座振荡降低最多41%，在真实世界中降低22%，且不增加完成运动任务所需的时间。

英文摘要

Animals in nature combine multiple modalities, such as sight and feel, to perceive terrain and develop an understanding of how to walk on uneven terrain in an efficient manner. Similarly, legged robots need to develop their ability to stably walk on complex terrains by developing an understanding of the relationship between vision and proprioception. Most current terrain-adaptation methods remain susceptible to failure on complex off-road terrain because they do not explicitly model the context between exteroceptive terrain appearance and proprioceptive physical interaction. This experience-based learning often creates a Visual-Texture Paradox between what has been seen and how it actually feels. In this work, we introduce CART, a high-level controller built on a context-aware terrain adaptation approach that integrates proprioception and exteroception from onboard sensing to achieve a robust understanding of terrain. We evaluate our method on multiple terrains using the Unitree Go2 and ANYmal-C robot on the IsaacSim simulator and a Boston Dynamics SPOT robot for our real-world experiments. To evaluate whether the learned context improves locomotion behavior under the various paradox circumstances, we measure the robot s stability, traversal success, and task completion time in both simulation and real-world experiments. We compare CART against state-of-the-art locomotion and terrain- adaptation baselines across diverse terrain conditions. CART improves the average success rate by 5% over the baselines in simulation, while improving context-conditioned locomotion behavior, including up to 41% lower base oscillation in simulation and 22% in the real world, without increasing the time required to complete the locomotion tasks.

URL PDF HTML ☆

赞 0 踩 0

2504.08278 2026-06-02 math.OC cs.RO cs.SY eess.SY 版本更新

Line-Search Filter Differential Dynamic Programming for Optimal Control with Nonlinear Equality Constraints

带非线性等式约束最优控制的线搜索滤波微分动态规划

Ming Xu, Stephen Gould, Iman Shames

发表机构 * School of Computer and Communication Sciences, EPFL（瑞士联邦理工学院计算机与通信科学学院）； School of Computing, Australian National University（澳大利亚国立大学计算学院）； Department of Electrical and Electronic Engineering, University of Melbourne（墨尔本大学电子与电气工程系）

AI总结提出FilterDDP算法，通过线搜索和步长滤波器处理非线性等式约束，并证明局部二次收敛性，在机器人接触隐式轨迹优化中验证有效性。

Comments Accepted for publication in the IEEE International Conference on Robotics and Automation (ICRA) 2026. Revised version with more exposition in methodology and updated results with improved implementation

详情

AI中文摘要

我们提出FilterDDP，一种用于求解带非线性等式约束的离散时间最优控制问题的微分动态规划算法。与基于价值函数或增广拉格朗日类算法的先前方法不同，FilterDDP使用步长滤波器结合线搜索来处理等式约束。我们确定了步长滤波器准则的两个重要设计选择，这些选择带来了鲁棒的数值性能：1）在步长接受准则中使用拉格朗日函数而非代价函数；2）在反向传播中扰动值函数Hessian矩阵。这两个选择都有严格的理论依据，特别是对于2），我们给出了局部二次收敛的形式化证明。除了提供处理同时含等式和不等式约束的最优控制问题的原始-对偶内点扩展外，我们还在机器人学中出现的三个接触隐式轨迹优化问题上验证了FilterDDP。

英文摘要

We present FilterDDP, a differential dynamic programming algorithm for solving discrete-time, optimal control problems (OCPs) with nonlinear equality constraints. Unlike prior methods based on merit functions or the augmented Lagrangian class of algorithms, FilterDDP uses a step filter in conjunction with a line search to handle equality constraints. We identify two important design choices for the step filter criteria which lead to robust numerical performance: 1) we use the Lagrangian instead of the cost in the step acceptance criterion and, 2) in the backward pass, we perturb the value function Hessian. Both choices are rigorously justified, for 2) in particular by a formal proof of local quadratic convergence. In addition to providing a primal-dual interior point extension for handling OCPs with both equality and inequality constraints, we validate FilterDDP on three contact implicit trajectory optimisation problems which arise in robotics.

URL PDF HTML ☆

赞 0 踩 0

2604.12792 2026-06-02 cs.RO 版本更新

Actuation space reduction to facilitate insightful shape matching in a novel reconfigurable tendon driven continuum manipulator

驱动空间缩减以促进新型可重构腱驱动连续体机器人的形状匹配洞察

Sabyasachi Dash, John Golden, Girish Krishnan

发表机构 * Department of Industrial and Enterprise Systems Engineering, University of Illinois Urbana-Champaign（工业与企业系统工程系，伊利诺伊大学厄巴纳-香槟分校）； Department of Mechanical Science and Engineering, University of Illinois, Urbana-Champaign（机械科学与工程系，伊利诺伊大学厄巴纳-香槟分校）

AI总结提出一种通过旋转间隔盘重构腱路径的连续体机器人设计，利用曲率-扭转中间空间简化从期望骨架曲线到驱动器输入的映射，实现无模型的分步形状匹配策略。

详情

DOI: 10.1109/RoboSoft67810.2026.11522832

AI中文摘要

在腱驱动连续体机器人（TDCM）中，重构腱路径可以实现骨架的定制空间变形。本文提出一种设计，其中腱可以在驱动之前或之后通过主动旋转各个间隔盘来重新布线。每个盘的旋转因此为驱动空间增加了一个自由度，使得从期望骨架曲线到相应驱动器输入的映射复杂化。然而，当骨架形状投影到由曲率和扭转（C-T）定义的中间空间时，会出现一些模式，突出显示哪些盘对实现全局形状最有影响。这种洞察力使得一种简化的顺序形状匹配策略成为可能：首先，旋转近端和中间盘以近似全局形状；然后，调整远端盘以微调末端执行器位置，同时对整体形状影响最小。所提出的驱动框架为传统控制方法提供了一种无模型替代方案，绕过了建模可重构TDCM的复杂性。

英文摘要

In tendon driven continuum manipulators (TDCMs), reconfiguring the tendon routing enables tailored spatial deformation of the backbone. This work presents a design in which tendons can be rerouted either prior to or after actuation by actively rotating the individual spacer disks. Each disk rotation thus adds a degree of freedom to the actuation space, complicating the mapping from a desired backbone curve to the corresponding actuator inputs. However, when the backbone shape is projected into an intermediate space defined by curvature and torsion (C-T), patterns emerge that highlight which disks are most influential in achieving a global shape. This insight enables a simplified, sequential shape-matching strategy: first, the proximal and intermediate disks are rotated to approximate the global shape; then, the distal disks are adjusted to fine-tune the end-effector position with minimal impact on the overall shape. The proposed actuation framework offers a model-free alternative to conventional control approaches, bypassing the complexities of modeling reconfigurable TDCMs.

URL PDF HTML ☆

赞 0 踩 0

2604.10579 2026-06-02 cs.RO cs.AI 版本更新

AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence

AffordGen: 通过可供性对应生成多样化演示以实现通用物体操作

Jiawei Zhang, Kaizhe Hu, Yingqian Huang, Yuanchen Ju, Zhengrong Xue, Huazhe Xu

发表机构 * Shanghai Qi Zhi Institute（上海启智研究院）； Tsinghua University（清华大学）； Fudan University（复旦大学）； UC Berkeley（伯克利大学）

AI总结提出AffordGen框架，利用3D生成模型和视觉基础模型在大规模3D网格上的语义对应生成多样化操作轨迹，训练鲁棒的闭环视觉运动策略，实现零样本泛化到未见物体。

详情

AI中文摘要

尽管现代模仿学习方法在机器人操作中取得了近期成功，但其性能常常受到数据多样性不足导致的几何变化的限制。利用强大的3D生成模型和视觉基础模型（VFMs），所提出的AffordGen框架通过利用大规模3D网格上有意义关键点的语义对应来生成新的机器人操作轨迹，从而克服了这一限制。然后，这个大规模、可供性感知的数据集被用于训练一个鲁棒的、闭环的视觉运动策略，结合了可供性的语义泛化能力和端到端学习的反应性鲁棒性。在仿真和现实世界中的实验表明，使用AffordGen训练的策略实现了高成功率，并能够零样本泛化到真正未见过的物体，显著提高了机器人学习中的数据效率。项目页面：https://jiaweiz9.github.io/AffordGen-release/

英文摘要

Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often constrained by geometric variations due to limited data diversity. Leveraging powerful 3D generative models and vision foundation models (VFMs), the proposed AffordGen framework overcomes this limitation by utilizing the semantic correspondence of meaningful keypoints across large-scale 3D meshes to generate new robot manipulation trajectories. This large-scale, affordance-aware dataset is then used to train a robust, closed-loop visuomotor policy, combining the semantic generalizability of affordances with the reactive robustness of end-to-end learning. Experiments in simulation and the real world show that policies trained with AffordGen achieve high success rates and enable zero-shot generalization to truly unseen objects, significantly improving data efficiency in robot learning. Project Page: https://jiaweiz9.github.io/AffordGen-release/

URL PDF HTML ☆

赞 0 踩 0

2604.09877 2026-06-02 cs.CV cs.AI cs.RO 版本更新

Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction

Genie 4D：语义先验引导的4D动态场景重建

Yiru Yang, Zhuojie Wu, Nishant Kumar Singh, Max Schulthess

发表机构 * University of Zurich（苏黎世大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出Genie 4D框架，结合实时视觉惯性高斯泼溅前端和前馈4D骨干网络，利用冻结的DINOv3特征作为结构先验抑制身份漂移，并通过条件扩散精炼器恢复高频细节，最终通过轻量级潜在动作头实现用户可控的4D世界模型重建。

详情

AI中文摘要

在计算机视觉与机器人感知的交汇处，动态场景的4D重建将低层几何感知与高层语义理解联系起来。我们提出Genie 4D，一个将手持手机拍摄转化为语义化、动作可控的4D世界模型的框架。Genie 4D将用于度量几何的实时视觉惯性高斯泼溅前端与由冻结的DINOv3特征（作为结构先验）正则化的前馈4D骨干网络相结合。语义先验抑制了动态跟踪中的身份漂移，而短条件扩散精炼器恢复了回归骨干网络平滑掉的高频表面细节。最后，一个轻量级潜在动作头将重建的4D状态暴露给以JEPA风格下一嵌入目标训练的Genie式世界模型，使得场景可以在用户动作下向前推进。在Point Odyssey和TUM-Dynamics基准测试上，Genie 4D保留了前馈基线的线性时间复杂度O(T)，同时提高了3D跟踪精度（APD）和重建完整性，并且可以在单个消费级GPU（RTX 5090）上通过iPhone、Mac、Windows和Linux采集客户端交互式运行。Genie 4D为走向物理基础的世界模型提供了一条实用的、语义先验引导的路径。

英文摘要

At the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes connects low-level geometric sensing with high-level semantic understanding. We present Genie 4D, a framework that turns hand-held phone capture into a semantically grounded, action-controllable 4D world model. Genie 4D couples a real-time visual-inertial Gaussian splatting front-end for metric geometry with a feed-forward 4D backbone regularized by frozen DINOv3 features acting as structural priors. The semantic priors suppress identity drift during dynamic tracking, while a short conditional diffusion refiner recovers high-frequency surface detail that regression backbones smooth away. Finally, a lightweight latent-action head exposes the reconstructed 4D state to a Genie-style world model trained with a JEPA-style next-embedding objective, so that the scene can be rolled forward under user actions. On the Point Odyssey and TUM-Dynamics benchmarks, Genie 4D retains the linear time complexity O(T) of feed-forward baselines while improving 3D tracking accuracy (APD) and reconstruction completeness, and it runs interactively on a single consumer GPU (RTX 5090) from iPhone, Mac, Windows, and Linux capture clients. Genie 4D offers a practical, semantic-prior-guided path toward physically grounded world models.

URL PDF HTML ☆

赞 0 踩 0

2604.09487 2026-06-02 cs.RO cs.LG 版本更新

Sim-to-Real Transfer for Muscle-Actuated Robots via Generalized Actuator Networks

基于广义执行器网络的肌肉驱动机器人仿真到现实迁移

Jan Schneider, Mridul Mahajan, Le Chen, Simon Guist, Bernhard Schölkopf, Ingmar Posner, Dieter Büchler

AI总结提出广义执行器网络（GenAN），通过从关节位置轨迹学习执行器模型，实现肌肉驱动机器人从仿真到现实的策略迁移，首次成功在四自由度肌肉驱动机器人臂上完成动态任务。

详情

AI中文摘要

肌腱驱动配合软肌肉执行器使机器人更快、更安全，同时可能加速技能获取。然而，由于固有的非线性、摩擦和迟滞，这些系统在实际中很少使用，这给建模和控制带来了复杂性。到目前为止，这些挑战阻碍了策略从仿真到真实系统的迁移。为弥合这一差距，我们提出了一种仿真到现实的流程，该流程学习这种复杂执行器的神经网络模型，并利用成熟的刚体仿真来处理手臂动力学和与环境的交互。我们的方法称为广义执行器网络（GenAN），通过直接从关节位置轨迹学习，而不是需要扭矩传感器，从而能够在广泛的机器人上进行执行器模型识别。在PAMY2（一种由气动人工肌肉驱动的肌腱驱动机器人）上使用GenAN，我们成功部署了完全在仿真中训练的、动态但精确的到达目标、杯中球和乒乓球策略。据我们所知，这一结果构成了四自由度肌肉驱动机器人臂首次成功的仿真到现实迁移。

英文摘要

Tendon drives paired with soft muscle actuation enable faster and safer robots while potentially accelerating skill acquisition. Still, these systems are rarely used in practice due to inherent nonlinearities, friction, and hysteresis, which complicate modeling and control. So far, these challenges have hindered policy transfer from simulation to real systems. To bridge this gap, we propose a sim-to-real pipeline that learns a neural network model of this complex actuation and leverages established rigid body simulation for the arm dynamics and interactions with the environment. Our method, called Generalized Actuator Network (GenAN), enables actuation model identification across a wide range of robots by learning directly from joint position trajectories rather than requiring torque sensors. Using GenAN on PAMY2, a tendon-driven robot powered by pneumatic artificial muscles, we successfully deploy dynamic but precise goal-reaching, ball-in-a-cup, and table tennis policies, trained entirely in simulation. To the best of our knowledge, this result constitutes the first successful sim-to-real transfer for a four-degrees-of-freedom muscle-actuated robot arm.

URL PDF HTML ☆

赞 0 踩 0

2604.02878 2026-06-02 cs.RO cs.SY eess.SY 版本更新

An Asynchronous Two-Speed Kalman Filter for Real-Time UUV Cooperative Navigation Under Acoustic Delays

一种用于声学延迟下实时UUV协同导航的异步双速卡尔曼滤波器

Shuyue Li, Miguel López-Benítez, Eng Gee Lim, Fei Ma, Qian Dong, Mengze Cao, Limin Yu, Xiaohui Qin

发表机构 * Xi’an Jiaotong-Liverpool University（西安交通大学利物浦大学）； Suzhou Municipal Key Laboratory Broadband Wireless Access Technology（苏州市级宽带无线接入技术重点实验室）； XJTLU-JITRI Academy（XJTLU-JITRI学院）

AI总结针对水声通信延迟导致实时状态估计困难的问题，提出一种异步双速卡尔曼滤波器（TSKF），通过变分历史蒸馏（VHD）机制解耦估计过程，实现高频实时控制与延迟协同信息处理，在严重延迟下保持与批量优化方法相当的轨迹误差。

Comments 6 pages, 6 figures. Accepted for publication in the 2026 IEEE International Conference on Industrial Informatics (INDIN). \c{opyright} 2026 IEEE. Personal use of this material is permitted. See PDF for the full IEEE copyright notice

详情

AI中文摘要

在全球导航卫星系统（GNSS）受限的水下环境中，单个无人水下航行器（UUV）会遭受无界航位推算漂移，因此协同导航（CN）对于精确状态估计至关重要。然而，水声信道固有的严重通信延迟对实时状态估计构成了严峻挑战。传统滤波器，如扩展卡尔曼滤波器（EKF）或无迹卡尔曼滤波器（UKF），通常在等待延迟数据时阻塞主控制回路，或者有效丢弃乱序测量（OOSM），导致严重漂移。为了解决这一问题，我们提出了一种由新颖投影机制——变分历史蒸馏（VHD）增强的异步双速卡尔曼滤波器（TSKF）。所提出的架构将估计过程解耦为两个并行线程：一个快速线程利用高斯过程（GP）补偿的航位推算来保证高频实时控制，另一个慢速线程专门处理异步延迟的协同信息。通过引入有限长度循环状态缓冲区（FLCSB），该算法将延迟测量应用于对应的历史状态，并利用基于VHD的投影将修正快速前向传播到当前时刻，而无需计算密集的重新计算。仿真结果表明，所提出的TSKF在严重延迟（高达30秒）下保持了与计算密集的批量优化方法相当的轨迹误差。在亚毫秒时间内执行，它显著优于标准EKF/UKF。结果展示了一种有效的控制、通信和计算（3C）协同设计，显著增强了自主海洋自动化系统的鲁棒性。

英文摘要

In Global Navigation Satellite System (GNSS)-denied underwater environments, individual unmanned underwater vehicles (UUVs) suffer from unbounded dead-reckoning drift, making collaborative navigation (CN) crucial for accurate state estimation. However, the severe communication delay inherent in underwater acoustic channels poses serious challenges to real-time state estimation. Traditional filters, such as Extended Kalman Filters (EKFs) or Unscented Kalman Filters (UKFs), usually block the main control loop while waiting for delayed data, or effectively discard Out-of-Sequence Measurements (OOSMs), resulting in serious drift. To address this, we propose an Asynchronous Two-Speed Kalman Filter (TSKF) enhanced by a novel projection mechanism, which we term Variational History Distillation (VHD). The proposed architecture decouples the estimation process into two parallel threads: a fast-rate thread that utilizes Gaussian Process (GP) compensated dead reckoning to guarantee high-frequency real-time control, and a slow-rate thread dedicated to processing asynchronously delayed collaborative information. By introducing a Finite-Length Circular State Buffer (FLCSB), the algorithm applies delayed measurements to their corresponding historical states, and utilizes a VHD-based projection to fast-forward the correction to the current time without computationally heavy recalculations. Simulation results demonstrate that the proposed TSKF maintains a trajectory error comparable to computationally intensive batch-optimization methods under severe delays (up to 30\,s). Executing in sub-millisecond time, it significantly outperforms standard EKF/UKF. The results demonstrate an effective control, communication, and computing (3C) co-design that significantly enhances the resilience of autonomous marine automation systems.

URL PDF HTML ☆

赞 0 踩 0

2602.12724 2026-06-02 cs.RO 版本更新

URDF-Anything+：面向仿真就绪铰接资产的端到端生成

Zhuangzhe Wu, Yue Xin, Chengkai Hou, Minghao Chen, Yaoxu Lyu, Jieyu Zhang, Shanghang Zhang

发表机构 * Peking University（北京大学）； Visual Geometry Group, University of Oxford（牛津大学视觉几何组）； University of Washington（华盛顿大学）

AI总结提出URDF-Anything+，一种端到端自回归扩散框架，从单张RGB图像直接生成仿真就绪的URDF模型，统一建模部件几何与铰接结构。

详情

AI中文摘要

铰接物体是机器人学、物理仿真和交互式虚拟环境的基础。然而，从视觉观测中恢复它们本质上具有挑战性，因为图像仅提供关于部件几何及其底层运动学结构的部分和模糊线索。现有方法通常依赖多阶段流水线、从资产库检索或显式部件分割。我们提出URDF-Anything+，一种端到端自回归扩散框架，直接从单张RGB图像生成仿真就绪的URDF模型。以视觉观测和物体几何为条件，URDF-Anything+在结构化潜在空间中运行，并在统一生成过程中联合建模部件几何和铰接。具体而言，模型顺序预测每个铰接部件及其关联的关节参数，同时一个终止标记动态确定部件数量。这种设计使得无需外部检索或后处理阶段即可直接生成完全可执行的URDF。在大规模铰接物体基准上的实验表明，URDF-Anything+在几何重建质量、关节参数估计和物理可执行性方面优于先前方法，同时比现有多阶段方法显著更高效。此外，生成的URDF作为忠实数字孪生，使得纯仿真训练的操作策略能够零样本迁移。

英文摘要

Articulated objects are fundamental for robotics, simulation of physics, and interactive virtual environments. However, recovering them from visual observations is inherently challenging, as images provide only partial and ambiguous cues about both part geometry and their underlying kinematic structure. Existing approaches typically rely on multi-stage pipelines, retrieval from asset libraries, or explicit part segmentation. We present URDF-Anything+, an end-to-end autoregressive diffusion framework that generates simulation-ready URDF models directly from a single RGB image. Conditioned on visual observations and object geometry, URDF-Anything+ operates in a structured latent space and jointly models part geometry and articulation in a unified generation process. Specifically, the model sequentially predicts each articulated part together with its associated joint parameters, while a termination token dynamically determines the number of parts. This design enables direct generation of fully executable URDFs without external retrieval or post-processing stages. Experiments on large-scale articulated object benchmarks demonstrate that URDF-Anything+ outperforms prior methods in geometric reconstruction quality, joint parameter estimation, and physical executability, while being substantially more efficient than existing multi-stage approaches. Furthermore, the generated URDFs serve as faithful digital twins, enabling the zero-shot transfer of manipulation policies trained purely in simulation.

URL PDF HTML ☆

赞 0 踩 0

2603.11653 2026-06-02 cs.LG cs.RO 版本更新

Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

简单配方有效：视觉-语言-动作模型通过强化学习成为自然持续学习者

Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, Roberto Martin-Martin

发表机构 * University of Southern California（南加州大学）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； University of California, Berkeley（加州大学伯克利分校）

AI总结本文通过系统研究发现，对于大型预训练视觉-语言-动作模型，简单的顺序微调结合低秩适配在持续强化学习中表现出高可塑性、几乎无遗忘和强零样本泛化，优于复杂方法。

Comments Accepted at RLC 2026

详情

AI中文摘要

基于事件的四旋翼飞行器在杂乱环境中的近似模仿学习

Nico Messikommer, Jiaxu Xing, Leonard Bauersfeld, Marco Cannici, Elie Aljalbout, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich, Switzerland（苏黎世联邦理工学院机器人与感知组）

AI总结提出近似模仿学习框架，通过分离表征学习与策略搜索，将事件相机四旋翼飞行策略训练时间从52.44小时降至1.86小时，实现28倍加速，并在仿真和真实环境中验证了高速飞行性能。

详情

AI中文摘要

事件相机具有高时间分辨率和低延迟，使其成为高速机器人应用的理想传感器，而传统相机会因运动模糊而失效。然而，它们在机器人学习中的广泛应用受到在线训练期间模拟高频事件数据计算成本的严重制约。在这项工作中，我们提出了近似模仿学习，这是一个从根本上解决这一瓶颈的新框架，将复杂、敏捷无人机飞行的策略训练时间从52.44小时减少到仅1.86小时——实现了28倍的计算加速。我们的关键见解是将表征学习与策略搜索分离。我们首先利用大规模离线数据集学习特定于任务的表征空间。随后，通过仅依赖轻量级状态信息的在线交互对策略进行微调，完全消除了在主动策略搜索阶段渲染事件的需求。这种训练范式极大地降低了开发开销，并使基于事件的控制策略能够扩展到复杂环境。此外，我们的方法在部署期间消除了对标准相机或中间表示的依赖，直接将事件映射到控制命令。在仿真中，我们的方法匹配或超过了需要完整在线事件渲染的标准模仿学习基线的性能。最后，我们在真实世界中成功验证了该框架，展示了通过这种超高效范式训练的策略使四旋翼飞行器能够在高度杂乱的环境中以前所未有的速度（高达9.8米/秒）飞行。

英文摘要

Event cameras offer high temporal resolution and low latency, making them ideal sensors for high-speed robotic applications where conventional cameras suffer from motion blur. However, their widespread adoption in robot learning is severely bottlenecked by the computational cost of simulating high-frequency event data during online training. In this work, we present Approximate Imitation Learning, a novel framework that fundamentally resolves this bottleneck, reducing policy training time for complex, agile drone flight from 52.44 hours to just 1.86 hours - a 28x computational speedup. Our key insight is to separate representation learning from policy search. We first leverage a large-scale offline dataset to learn a task-specific representation space. Subsequently, the policy is fine-tuned through online interactions that rely solely on lightweight state information, completely eliminating the need to render events during the active policy search phase. This training paradigm drastically reduces development overhead and enables event-based control policies to scale to complex environments. Furthermore, our approach eliminates the reliance on standard cameras or intermediate representations during deployment, mapping events directly to control commands. In simulation, our method matches or exceeds the performance of standard imitation learning baselines that require full online event rendering. Finally, we successfully validate the framework in the real world, demonstrating that a policy trained via this ultra-efficient paradigm enables a quadrotor to fly through highly cluttered environments at remarkable speeds of up to 9.8 m/s.

URL PDF HTML ☆

赞 0 踩 0

2602.01662 2026-06-02 cs.RO 版本更新

PLanAR: Planning-Language-Grounded Agentic Reasoning for Robot Manipulation

PLanAR：面向机器人操作的规划语言基础智能推理

Pengyuan Guo, Zhonghao Mai, Zhengtong Xu, Kaidi Zhang, Quan Khanh Luu, Heng Zhang, Zichen Miao, Arash Ajoudani, Zachary Kingston, Qiang Qiu, Yu She

发表机构 * Purdue University（普渡大学）； Istituto Italiano di Tecnologia（意大利理工研究院）

AI总结提出PLanAR框架，通过规划语言接口定义VLM推理空间，实现开放词汇的长时域机器人操作，并支持逐步验证与重规划。

Comments New version with updated framing, contributions, experiments, and figures

详情

AI中文摘要

近期视觉-语言模型（VLM）的进步推动了真实世界机器人操作的进展。然而，非结构化环境中的长时域操作要求VLM推理变化的场景状态、动作约束和执行结果，这仅靠自然语言推理仍然困难。我们提出PLanAR，一个规划语言基础的机器人智能体框架，用于开放词汇的长时域操作。PLanAR使用规划语言接口定义VLM推理空间：对象谓词表示场景状态，动作模式指定具有前提条件和效果的机器人技能，符号规划提供可执行的中间表示。该接口支持逐步验证：在每个动作之后，PLanAR利用机载观测检查预期符号效果是否实现，使基于VLM的智能体能够更新任务状态、检测失败，并在执行偏离预期时重新规划。在多种机器人形态、VLM后端以及包括堆叠、填字游戏和长时域厨房工作流程的任务中，PLanAR展示了强大的真实世界能力，同时揭示了当前VLM在具身推理中的关键局限性。

英文摘要

Recent advances in vision-language models (VLMs) have enabled increasing progress in real-world robot manipulation. However, long-horizon manipulation in unstructured environments requires VLMs to reason about changing scene states, action constraints, and execution outcomes, which remains difficult with natural language reasoning alone. We present PLanAR, a planning-language-grounded robot agent framework for open-vocabulary, long-horizon manipulation. PLanAR uses a planning-language interface to define the VLM reasoning space: object predicates represent scene states, action schemas specify robot skills with preconditions and effects, and symbolic plans provide executable intermediate representations. This interface enables stepwise verification: after each action, PLanAR uses onboard observations to check whether the expected symbolic effects have been achieved, allowing the VLM-based agent to update task states, detect failures, and replan when execution deviates from expectation. Across robot embodiments, VLM backends, and tasks including stacking, crossword solving, and long-horizon kitchen workflows, PLanAR demonstrates strong real-world capability while revealing key limitations of current VLMs in embodied reasoning.

URL PDF HTML ☆

赞 0 踩 0

2602.23694 2026-06-02 cs.RO cs.AI 版本更新

Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion

基于对数似然比融合的可解释多模态手势识别用于无人机和移动机器人遥操作

Seungyeol Baek, Jaspreet Singh, Lala Shakti Swarup Ray, Hymalai Bello, Paul Lukowicz, Sungho Suh

发表机构 * Department of Artificial Intelligence, Korea University（人工智能系，韩国大学）； Department of Computer Science, RPTU Kaiserslautern-Landau（计算机科学系，RPTU凯撒斯劳滕-兰道）； Embedded Intelligence, German Research Center for Artificial Intelligence (DFKI)（嵌入式智能，德国人工智能研究中心（DFKI））

AI总结提出一种融合腕戴式Apple Watch惯性数据和定制手套电容传感信号的多模态手势识别框架，利用对数似然比后期融合策略提升性能并提供可解释性，在降低计算成本的同时达到与视觉基线相当的识别效果。

详情

AI中文摘要

人类操作员仍经常暴露在危险环境中，如灾区及工业设施，在这些场景中，移动机器人和无人飞行器（UAV）的直观可靠遥操作至关重要。在此背景下，免手持遥操作增强了操作员的移动性和态势感知能力，从而提高了危险环境中的安全性。尽管基于视觉的手势识别已被探索作为免手持遥操作的一种方法，但其性能在遮挡、光照变化和杂乱背景下常会下降，限制了其在真实操作中的适用性。为克服这些限制，我们提出一种多模态手势识别框架，该框架融合来自双手腕上Apple Watch的惯性数据（加速度计、陀螺仪和方向）与来自定制手套的电容传感信号。我们设计了一种基于对数似然比（LLR）的后期融合策略，该策略不仅提升了识别性能，还通过量化模态特定贡献提供了可解释性。为支持本研究，我们引入了一个包含20种受飞机引导信号启发的手势的新数据集，包含同步的RGB视频、IMU和电容传感器数据。实验结果表明，我们的框架在显著降低计算成本、模型大小和训练时间的同时，达到了与最先进的视觉基线相当的性能，使其非常适合实时机器人控制。因此，我们强调了基于传感器的多模态融合作为手势驱动的移动机器人和无人机遥操作的鲁棒且可解释解决方案的潜力。

英文摘要

Human operators are still frequently exposed to hazardous environments such as disaster zones and industrial facilities, where intuitive and reliable teleoperation of mobile robots and Unmanned Aerial Vehicles (UAVs) is essential. In this context, hands-free teleoperation enhances operator mobility and situational awareness, thereby improving safety in hazardous environments. While vision-based gesture recognition has been explored as one method for hands-free teleoperation, its performance often deteriorates under occlusions, lighting variations, and cluttered backgrounds, limiting its applicability in real-world operations. To overcome these limitations, we propose a multimodal gesture recognition framework that integrates inertial data (accelerometer, gyroscope, and orientation) from Apple Watches on both wrists with capacitive sensing signals from custom gloves. We design a late fusion strategy based on the log-likelihood ratio (LLR), which not only enhances recognition performance but also provides interpretability by quantifying modality-specific contributions. To support this research, we introduce a new dataset of 20 distinct gestures inspired by aircraft marshalling signals, comprising synchronized RGB video, IMU, and capacitive sensor data. Experimental results demonstrate that our framework achieves performance comparable to a state-of-the-art vision-based baseline while significantly reducing computational cost, model size, and training time, making it well suited for real-time robot control. We therefore underscore the potential of sensor-based multimodal fusion as a robust and interpretable solution for gesture-driven mobile robot and drone teleoperation.

URL PDF HTML ☆

赞 0 踩 0

2512.00470 2026-06-02 cs.RO 版本更新

LAP: Fast LAtent Diffusion Planner for Autonomous Driving

LAP：面向自动驾驶的快速潜在扩散规划器

Jinhao Zhang, Wenlong Xia, Zhexuan Zhou, Haoming Song, Youmin Gong, Jie Mei

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LAP框架，在VAE学习的潜在空间中进行规划，通过单步去噪实现高质量规划，在nuPlan基准上达到学习型规划方法的最优闭环性能，推理速度提升高达10倍。

详情

AI中文摘要

扩散模型在自动驾驶中模拟类人驾驶行为方面展现出强大能力，但其迭代采样过程导致大量延迟，且直接对原始轨迹点操作迫使模型将容量用于低级运动学而非高级多模态语义。为解决这些限制，我们提出潜在规划器（LAP），该框架在VAE学习的潜在空间中进行规划，将高级意图与低级运动学解耦，使规划器能够捕获丰富的多模态驾驶策略。为弥合高级语义规划空间与向量化场景上下文之间的表征差距，我们引入中间特征对齐机制，促进鲁棒的信息融合。值得注意的是，LAP可在单步去噪中生成高质量规划，大幅降低计算开销。通过在大型nuPlan基准上的广泛评估，LAP在学习型规划方法中实现了最先进的闭环性能，同时推理速度相比先前SOTA方法提升高达10倍。

英文摘要

Diffusion models have demonstrated strong capabilities for modeling human-like driving behaviors in autonomous driving, but their iterative sampling process induces substantial latency, and operating directly on raw trajectory points forces the model to spend capacity on low-level kinematics, rather than high-level multi-modal semantics. To address these limitations, we propose LAtent Planner (LAP), a framework that plans in a VAE-learned latent space that disentangles high-level intents from low-level kinematics, enabling our planner to capture rich, multi-modal driving strategies. To bridge the representational gap between the high-level semantic planning space and the vectorized scene context, we introduce an intermediate feature alignment mechanism that facilitates robust information fusion. Notably, LAP can produce high-quality plans in one single denoising step, substantially reducing computational overhead. Through extensive evaluations on the large-scale nuPlan benchmark, LAP achieves state-of-the-art closed-loop performance among learning-based planning methods, while demonstrating an inference speed-up of at most 10x over previous SOTA approaches.

URL PDF HTML ☆

赞 0 踩 0

2603.03741 2026-06-02 cs.RO cs.AI 版本更新

HALO: Learning Human-Robot Collaboration via Heterogeneous-Agent Lyapunov Policy Optimization

HALO：通过异质智能体李雅普诺夫策略优化学习人机协作

Hao Zhang, Yaru Niu, Yikai Wang, Ding Zhao, H. Eric Tseng

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结针对人机协作中人类行为多样性和环境变化导致的泛化与鲁棒性问题，提出异质智能体李雅普诺夫策略优化（HALO）框架，通过李雅普诺夫收缩稳定去中心化多智能体强化学习，并利用最优二次投影修正梯度，实现理性差距的单调收缩，提升协作性能。

Comments https://HaoZhang-THU.github.io/HALO/

详情

AI中文摘要

为了提高人机协作（HRC）的泛化性和韧性，机器人必须应对人类行为和情境的多种组合，这推动了多智能体强化学习（MARL）的应用。然而，机器人与人类之间的固有异质性造成了理性差距（RG），使得去中心化的策略更新偏离了合作联合优化。由此产生的学习问题是一个一般和可微博弈，因此独立的策略梯度更新在没有额外结构的情况下可能会振荡或发散。我们提出了异质智能体李雅普诺夫策略优化（HALO），这是一个通过强制策略参数空间中的李雅普诺夫收缩来稳定去中心化MARL的框架。与针对约束马尔可夫决策过程中状态/轨迹约束的基于李雅普诺夫的安全RL不同，HALO使用李雅普诺夫认证来稳定去中心化策略学习。HALO通过最优二次投影修正去中心化梯度，确保RG的单调收缩，并实现对开放式交互空间的有效探索。大量的仿真和真实人形机器人实验表明，这种认证的稳定性提高了协作边缘情况下的泛化性和鲁棒性。我们的项目网站位于https://HaoZhang-THU.github.io/HALO/。

英文摘要

To improve generalization and resilience in human-robot collaboration (HRC), robots must contend with diverse combinations of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG), where decentralized policy updates deviate from cooperative joint optimization. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALO), a framework that stabilizes decentralized MARL by enforcing Lyapunov-based contraction in policy-parameter space. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALO uses Lyapunov certification to stabilize decentralized policy learning. HALO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases. Our project website is available at https://HaoZhang-THU.github.io/HALO/.

URL PDF HTML ☆

赞 0 踩 0

2603.02650 2026-06-02 cs.LG cs.AI cs.RO 版本更新

Improving Diffusion Planners by Self-Supervised Action Gating with Energies

通过自监督动作能量门控改进扩散规划器

Yuan Lu, Dongqi Han, Yansen Wang, Dongsheng Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出SAGE方法，利用潜在一致性信号在推理时重新排序轨迹，惩罚动态不一致的计划，从而提升扩散规划器的性能和鲁棒性。

详情

AI中文摘要

扩散规划器是离线强化学习的一种强大方法，但当价值引导选择偏好得分高但局部与环境动态不一致的轨迹时，它们可能会失败，导致执行脆弱。我们提出了自监督动作能量门控（SAGE），一种推理时重排序方法，使用潜在一致性信号惩罚动态不一致的计划。SAGE在离线状态序列上训练联合嵌入预测架构（JEPA）编码器，并训练一个动作条件的潜在预测器用于短时域过渡。在测试时，SAGE为每个采样候选分配一个由其潜在预测误差给出的能量，并将此可行性得分与价值估计相结合以选择动作。SAGE可以集成到现有的扩散规划流程中，这些流程可以通过价值评分采样轨迹和选择动作；它不需要环境回滚，也不需要重新训练策略。在运动、导航和操作基准测试中，SAGE提高了扩散规划器的性能和鲁棒性。

英文摘要

Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrate into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.

URL PDF HTML ☆

赞 0 踩 0

2603.02478 2026-06-02 eess.SY cs.RO cs.SY 版本更新

Scalar-Measurement Attitude Estimation on $\mathbf{SO}(3)$ with Bias Compensation

$\mathbf{SO}(3)$ 上带偏差补偿的标量测量姿态估计

Alessandro Melis, Tarek Bouazza, Hassan Alnahhal, Sifeddine Benahmed, Soulaimane Berkane, Tarek Hamel

发表机构 * I3S, CNRS, Université Côte d’Azur, Sophia Antipolis, France（法国国家科学研究中心I3S研究所、普罗旺斯大学、索菲亚-安蒂波利斯分校）； Institut Universitaire de France（法国国家科学院）； Department of Technology & Innovation, Capgemini Engineering（Capgemini工程公司技术与创新部）； Department of Computer Science and Engineering, Université du Québec en Outaouais (UQO)（魁北克大学Outaouais分校计算机科学与工程系）

AI总结本文提出基于标量测量的 $\mathbf{SO}(3)$ 非线性确定性观测器，结合陀螺仪偏差补偿，在适当可观测性条件下实现局部指数稳定，并证明两个标量测量在合适激励下足以进行姿态估计，三个在静态情况下足够。

Comments 9 pages, 4 figures. Accepted to ICRA 2026

详情

AI中文摘要

姿态估计方法通常依赖于来自惯性传感器（如加速度计和磁力计）的完整矢量测量。本文表明，仅使用标量测量也能实现可靠估计，这些标量测量自然出现为矢量读数的分量或来自其他传感模态的独立约束。我们提出了 $\mathbf{SO}(3)$ 上的非线性确定性观测器，该观测器结合了陀螺仪偏差补偿，并在适当的可观测性条件下保证均匀局部指数稳定性。该框架的一个关键特性是对部分感知的鲁棒性：即使只有矢量分量的子集可用，也能保持准确估计。在 BROAD 数据集上的实验验证确认了在逐步减少的测量配置下性能一致，即使在严重信息丢失的情况下估计误差仍然很小。据我们所知，这是第一项建立基本可观测性结果的工作，表明在适当激励下两个标量测量足以进行姿态估计，而在静态情况下三个足够。这些结果将基于标量测量的观测器定位为传统基于矢量方法的实用且可靠的替代方案。

英文摘要

Attitude estimation methods typically rely on full vector measurements from inertial sensors such as accelerometers and magnetometers. This paper shows that reliable estimation can also be achieved using only scalar measurements, which naturally arise either as components of vector readings or as independent constraints from other sensing modalities. We propose nonlinear deterministic observers on $\mathbf{SO}(3)$ that incorporate gyroscope bias compensation and guarantee uniform local exponential stability under suitable observability conditions. A key feature of the framework is its robustness to partial sensing: accurate estimation is maintained even when only a subset of vector components is available. Experimental validation on the BROAD dataset confirms consistent performance across progressively reduced measurement configurations, with estimation errors remaining small even under severe information loss. To the best of our knowledge, this is the first work to establish fundamental observability results showing that two scalar measurements under suitable excitation suffice for attitude estimation, and that three are enough in the static case. These results position scalar-measurement-based observers as a practical and reliable alternative to conventional vector-based approaches.

URL PDF HTML ☆

赞 0 踩 0

2603.01302 2026-06-02 cs.RO 版本更新

Hybrid TD3: Overestimation Bias Analysis and Stable Policy Optimization for Hybrid Action Space

混合TD3：混合动作空间中的过估计偏差分析与稳定策略优化

Thanh-Tuan Tran, Thanh Nguyen Canh, Nak Young Chong, Xiem HoangVan

发表机构 * Department of Computer Science, University of California, Berkeley（1. 加州大学伯克利分校计算机科学系）

AI总结针对离散-连续混合动作空间中的强化学习挑战，提出Hybrid TD3算法，通过理论分析过估计偏差并引入加权裁剪Q学习目标，实现稳定策略优化。

详情

AI中文摘要

离散-连续混合动作空间中的强化学习对机器人操作提出了基本挑战，其中高层任务决策和低层关节空间执行必须联合优化。现有方法要么离散化连续组件，要么将离散选择松弛为连续近似，这些方法在高维动作空间和域随机化下存在可扩展性限制和训练不稳定性。在本文中，我们提出Hybrid TD3，这是对双延迟深度确定性策略梯度（TD3）的扩展，以原则性方式原生处理参数化混合动作空间。我们对混合动作设置中的过估计偏差进行了严格的理论分析，推导了双评论家架构下的形式化界限，并在同步高斯误差假设下建立了五种算法变体之间的完整偏差排序。基于此分析，我们引入了一个加权裁剪Q学习目标，该目标对离散动作分布进行边缘化，在实现与标准裁剪最小化等效的偏差减少的同时，提高了策略平滑性。实验结果表明，Hybrid TD3在训练稳定性和竞争性能方面优于最先进的混合动作基线。

HyperDet: 基于超4D雷达点云的3D目标检测

Yichun Xiao, Runwei Guan, Jin Jin, Fangqiang Ding

发表机构 * University of Edinburgh（爱丁堡大学）； HKUST (GZ)（香港科技大学（广州））； University of Oxford（牛津大学）； MIT（麻省理工学院）

AI总结提出一种与检测器无关的框架HyperDet，通过构建任务感知的超4D雷达点云，利用时空累积、跨传感器验证和多普勒引导的运动补偿以及前景生成增强，显著提升仅用雷达的3D目标检测性能。

Comments 11 pages, 3 figures, 3 tables

详情

AI中文摘要

仅使用4D雷达进行3D目标检测能达到什么程度？尽管现代4D雷达为自主感知提供了鲁棒天气和速度感知能力，但其点云仍然稀疏、嘈杂且不稳定，限制了仅用雷达的3D检测。我们提出HyperDet，一种与检测器无关的框架，在检测前构建任务感知的超4D雷达点云。HyperDet首先通过时空累积、跨传感器验证和多普勒引导的运动补偿来细化短窗口环视雷达观测，提高返回可靠性和时间一致性。然后，它利用仅在训练时可用的激光雷达引导的伪雷达监督进行前景生成增强，在保留测量雷达背景和雷达原生属性的同时丰富目标几何。在检测器训练期间，雷达感知的目标级增强进一步在几何重定位下保持多普勒一致性。在推理时，HyperDet仅需雷达输入，可直接与标准3D检测器配合使用。在两个公开的环视4D雷达数据集上的实验表明，与原始雷达输入相比，在标准3D检测器上均取得一致改进，验证了输入级雷达增强作为仅用雷达3D检测的有效方法。

英文摘要

How far can 3D object detection go using 4D radar alone? Despite offering weather-robust and velocity-aware sensing for autonomous perception, modern 4D radar still yields sparse, noisy, and unstable point clouds, limiting radar-only 3D detection. We present HyperDet, a detector-agnostic framework that constructs task-aware hyper 4D radar point clouds before detection. HyperDet first refines short-window surround-view radar observations through spatio-temporal accumulation, cross-sensor validation, and Doppler-guided motion compensation, improving return reliability and temporal coherence. It then performs foreground generative enhancement using LiDAR-guided pseudo-radar supervision available only during training, enriching object geometry while preserving measured radar background and radar-native attributes. During detector training, radar-aware object-level augmentation further preserves Doppler consistency under geometric relocation. At inference time, HyperDet requires radar input alone and can be directly paired with standard 3D detectors. Experiments on two public surround-view 4D radar datasets demonstrate consistent improvements over raw radar inputs across standard 3D detectors, validating input-level radar enhancement as an effective approach to radar-only 3D detection.

URL PDF HTML ☆

赞 0 踩 0

2509.18046 2026-06-02 cs.RO cs.AI cs.ET cs.SY eess.SP eess.SY 版本更新

HuMam: Humanoid Motion Control via End-to-End Deep Reinforcement Learning with Mamba

HuMam: 基于Mamba的端到端深度强化学习人形机器人运动控制

Yinuo Wang, Yuanyang Qi, Jinzhao Zhou, Pengxiang Meng, Xiaowen Tao

发表机构 * College of Graduate and Professional Studies, Trine University（特灵大学研究生与专业研究学院）； Department of Civil Engineering, University of Hong Kong（香港大学土木工程系）； Faculty of Engineering and Information Technology, University of Technology Sydney（悉尼大学工程与信息技术学院）； National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University（吉林大学汽车底盘集成与生物仿生国家重点实验室）； School of Computer Science and Statistics, Trinity College Dublin（都柏林信任学院计算机科学与统计学系）

AI总结提出HuMam框架，使用单层Mamba编码器融合状态与步态目标，通过PPO优化实现人形机器人稳定高效的端到端运动控制。

Comments 12 pages

详情

Journal ref: 2026 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE International Conference on Robotics, Automation and Mechatronics (RAM) (CIS-RAM 2026)

AI中文摘要

端到端强化学习（RL）用于人形机器人运动因其紧凑的感知-动作映射而具有吸引力，但实际策略常受训练不稳定、特征融合低效和高执行成本困扰。我们提出HuMam，一种以状态为中心的端到端RL框架，采用单层Mamba编码器融合机器人中心状态与定向脚步目标及连续相位时钟。策略输出由低级PD环跟踪的关节位置目标，并通过PPO优化。一个简洁的六项奖励平衡接触质量、摆动平滑度、脚部放置、姿态和身体稳定性，同时隐含促进节能。在mc-mujoco中的JVRC-1人形机器人上，HuMam在强前馈基线上持续提高了学习效率、训练稳定性和整体任务性能，同时降低了功耗和扭矩峰值。据我们所知，这是首个采用Mamba作为融合骨干的端到端人形机器人RL控制器，展示了在效率、稳定性和控制经济性方面的切实提升。

英文摘要

End-to-end reinforcement learning (RL) for humanoid locomotion is appealing for its compact perception-action mapping, yet practical policies often suffer from training instability, inefficient feature fusion, and high actuation cost. We present HuMam, a state-centric end-to-end RL framework that employs a single-layer Mamba encoder to fuse robot-centric states with oriented footstep targets and a continuous phase clock. The policy outputs joint position targets tracked by a low-level PD loop and is optimized with PPO. A concise six-term reward balances contact quality, swing smoothness, foot placement, posture, and body stability while implicitly promoting energy saving. On the JVRC-1 humanoid in mc-mujoco, HuMam consistently improves learning efficiency, training stability, and overall task performance over a strong feedforward baseline, while reducing power consumption and torque peaks. To our knowledge, this is the first end-to-end humanoid RL controller that adopts Mamba as the fusion backbone, demonstrating tangible gains in efficiency, stability, and control economy.

URL PDF HTML ☆

赞 0 踩 0

2602.09153 2026-06-02 cs.RO cs.AI cs.CV cs.GR 版本更新

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

SceneSmith: 面向仿真就绪室内场景的智能体生成

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Harvard University（哈佛大学）

AI总结提出层次化智能体框架SceneSmith，通过VLM智能体协作从自然语言生成仿真就绪的室内场景，相比先前方法生成3-6倍物体且碰撞率低于2%。

Comments ICML 2026 Spotlight; Project page: https://scenesmith.github.io/

详情

AI中文摘要

仿真已成为大规模训练和评估家用机器人的关键工具，但现有环境未能捕捉真实室内空间的多样性和物理复杂性。当前的场景合成方法生成的房间稀疏布置，缺乏机器人操作所必需的密集杂乱、铰接式家具和物理属性。我们提出了SceneSmith，一个层次化智能体框架，能够从自然语言提示生成仿真就绪的室内环境。SceneSmith通过连续阶段构建场景——从建筑布局到家具放置再到小物体填充——每个阶段都实现为VLM智能体（设计师、评论家和编排者）之间的交互。该框架通过文本到3D合成生成静态物体、数据集检索获取铰接式物体以及物理属性估计，紧密集成了资产生成。SceneSmith生成的物体数量是先前方法的3-6倍，物体间碰撞率低于2%，且96%的物体在物理仿真下保持稳定。在205名参与者参与的用户研究中，与基线相比，平均真实感胜率达到92%，平均提示忠实度胜率达到91%。我们进一步证明了这些环境可用于端到端的自动机器人策略评估流程。

英文摘要

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.

URL PDF HTML ☆

赞 0 踩 0

2602.06925 2026-06-02 cs.RO cs.GT 版本更新

高度可变形本体感觉膜用于实时三维形状重建

Guanyu Xu, Jiaqi Wang, Dezhong Tong, Xiaonan Huang

AI总结提出一种基于光波导传感的柔性可拉伸本体感觉硅胶膜，通过数据驱动模型解码变形相关光强信号，实现实时三维形状重建，在140mm方形膜上达到90Hz更新率和1.307mm平均重建误差。

Comments 13 pages, 9 figures

详情

AI中文摘要

重建物体表面的三维几何形状对于机器人感知至关重要，但基于视觉的方法在低光照或遮挡条件下效果不佳。这一局限性促使我们设计一种本体感觉膜，该膜贴合感兴趣表面并通过重建自身变形来推断三维几何形状。传统的变形感知膜通常依赖于电阻、电容或磁敏机制，但可能存在结构复杂、在大规模变形下顺应性有限以及易受电磁干扰等问题。本文提出一种基于光波导传感的柔软、灵活且可拉伸的本体感觉硅胶膜。该膜将边缘安装的LED和中心分布的光电二极管集成在多层弹性体复合材料中。丰富的变形相关光强信号通过数据驱动模型解码，以恢复膜的几何形状。在定制的140mm方形膜上，以90Hz的端到端更新率实现了实时重建，对于高达25mm的面外变形，平均重建误差为1.307mm。所提出的传感器还在大面内变形下展示了精确重建，在高达75%应变下实现了可靠的形状恢复，平均Chamfer距离为1.214mm。所提出的框架为可变形机器人系统中的全局形状感知提供了一种可扩展、稳健且低剖面的解决方案。

英文摘要

Reconstructing the three-dimensional (3D) geometry of object surfaces is essential for robot perception, yet vision-based approaches degrade under low illumination or occlusion. This limitation motivates the design of a proprioceptive membrane that conforms to the surface of interest and infers 3D geometry by reconstructing its own deformation. Conventional deformation-aware membranes typically rely on resistive, capacitive, or magneto-sensitive mechanisms, but can suffer from structural complexity, limited compliance during large-scale deformation, and susceptibility to electromagnetic interference. This work presents a soft, flexible, and stretchable proprioceptive silicone membrane based on optical waveguide sensing. The membrane integrates edge-mounted LEDs and centrally-distributed photodiodes (PDs) within a multilayer elastomeric composite. Rich deformation-dependent light-intensity signals are decoded by a data-driven model to recover the membrane geometry. Real-time reconstruction is demonstrated on a customized 140 mm square membrane at an end-to-end update rate of 90 Hz, achieving an average reconstruction error of 1.307 mm for out-of-plane deformation of up to 25 mm. The proposed sensor also demonstrates accurate reconstruction under large in-plane deformation, achieving reliable shape recovery up to 75% strain with an average Chamfer distance of 1.214 mm. The proposed framework provides a scalable, robust, and low-profile solution for global shape perception in deformable robotic systems.

URL PDF HTML ☆

赞 0 踩 0

2601.11460 2026-06-02 cs.RO cs.LG 版本更新

Semantic-Geometric Task Representations for Bimanual Manipulation from Human Demonstrations to Robot Action Planning

面向双臂操作的语义-几何任务表示：从人类示范到机器人动作规划

Franziska Herbert, Vignesh Prasad, Han Liu, Dorothea Koert, Georgia Chalvatzaki

发表机构 * Interactive Robot Perception & Learning (PEARL) Lab, Computer Science Dept., TU Darmstadt, Germany（图腾大学达姆施塔特分校计算机科学系交互机器人感知与学习实验室）； Hessian.AI, Darmstadt, Germany（黑森人工智能公司）； Robotics Institute Germany (RIG)（德国机器人研究所）； Interactive AI Algorithms & Cognitive Models for Human-AI Interaction (IKIDA), Computer Science Dept., TU Darmstadt, Germany（人机交互的交互人工智能算法与认知模型（IKIDA），图腾大学达姆施塔特分校计算机科学系）； Center for Cognitive Science, TU Darmstadt, Germany（图腾大学达姆施塔特分校认知科学中心）

AI总结提出一种语义-几何图任务表示，通过消息传递神经网络编码器和Transformer解码器联合编码对象身份、语义关系和运动历史，实现从人类示范中学习结构化任务表示，并支持跨实体迁移和双臂操作规划。

Comments 9 pages, 7 figures, preprint

详情

AI中文摘要

从人类示范中学习结构化任务表示对于双臂操作至关重要，因为动作顺序、对象参与和交互几何在不同执行中变化显著。一个关键挑战在于以支持任务进展推理的形式，联合捕获离散的语义任务结构和对象中心几何关系的时间演化。我们引入一种基于语义-几何图的任务表示，通过消息传递神经网络（MPNN）编码器和基于Transformer的解码器，联合编码对象身份、对象间语义关系和每个对象的运动历史。编码器仅对时间场景图进行操作，产生与动作标签解耦的结构化表示。解码器则根据动作上下文预测未来动作、关联对象和对象运动。这种解耦学习了任务无关的表示，使得编码器可以通过仅在小型机器人数据集上微调解码器而跨实体复用。在两个数据集的十一个双臂任务中，我们发现结构化语义-几何表示相对于更简单的基于序列模型的优势随着动作顺序和对象参与的任务变异性增加而增长。在部署时，规划器将动作和运动预测与学习的概率运动基元相结合，在两个真实机器人双臂任务上实现了完全任务成功，并优于图消融、Transformer、仅解码器和微调的视觉-语言模型基线。

英文摘要

Learning structured task representations from human demonstrations is essential for bimanual manipulation, where action ordering, object involvement, and interaction geometry vary significantly across executions. A key challenge lies in jointly capturing the discrete semantic task structure and the temporal evolution of object-centric geometric relations in a form that supports reasoning over task progression. We introduce a semantic--geometric graph-based task representation that jointly encodes object identities, inter-object semantic relations, and per-object motion histories, via a Message Passing Neural Network (MPNN) encoder and a Transformer-based decoder. The encoder operates solely on the temporal scene graph, producing structured representations decoupled from action labels. The decoder then conditions on action-context to forecast future actions, associated objects, and object motions. This decoupling learns task-agnostic representations, enabling encoder reuse across embodiments through decoder-only finetuning on a small robot dataset. Across eleven bimanual tasks from two datasets, we find that the benefit of structured semantic--geometric representations over simpler sequence-based models grows with task variability in action ordering and object involvement. At deployment, a planner couples the action and motion predictions with learned Probabilistic Movement Primitives, achieving full task success on two real-robot bimanual tasks and outperforming graph ablations, Transformer, decoder-only, and finetuned vision-language model baselines.

URL PDF HTML ☆

赞 0 踩 0

2508.20072 2026-06-02 cs.CV cs.LG cs.RO 版本更新

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

离散扩散VLA：将离散扩散引入视觉-语言-动作策略中的动作解码

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出离散扩散VLA，通过将动作块离散化并在统一Transformer骨干内使用离散扩散模式进行渐进细化，实现自适应解码顺序和错误纠正，在多个基准上取得高性能并保留预训练的视觉-语言先验。

Comments Accepted by ICML 2026. 17 pages

详情

AI中文摘要

视觉-语言-动作（VLA）模型将大型视觉-语言骨干网络适配为将图像和指令映射为机器人动作。然而，当前的VLA要么以固定的从左到右顺序自回归生成动作，性能较差；要么在骨干网络外附加独立的扩散头，这会割裂信息通路并阻碍统一、可扩展的架构。相反，我们提出了离散扩散VLA，它将动作块离散化，并使用离散扩散模式在统一的Transformer骨干内保留渐进细化。我们的方法实现了自适应解码顺序，在解决较难的动作元素之前先解决高置信度的动作元素，并采用二次重掩码来重新审视不确定的预测，从而实现鲁棒的纠错。这种设计保留了预训练的视觉-语言先验，支持并行解码，并提高了效率。离散扩散VLA在LIBERO上达到96.4%的平均成功率，在SimplerEnv-Fractal上达到71.2%的视觉匹配，在SimplerEnv-Bridge上达到54.2%的整体性能。在LIBERO-Goal的分布外测试中，我们的方法仅表现出0.8%的语言退化（相比之下并行解码为8.0%），以及20.4%的视觉退化（相比之下连续扩散为29.0%），表明其很好地保留了预训练的视觉-语言能力。我们还在AgileX Cobot Magic平台上进行了两次真实机器人评估，以展示该方法的有效性。

英文摘要

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order with poor performance or attach separate diffusion heads outside the backbone that fragments information pathways and hinders unified, scalable architectures. Instead, we present Discrete Diffusion VLA that discretizes action chunks and models them with discrete diffusion pattern retaining progressive refinement inside the unified transformer backbone. Our method achieves an adaptive decoding order that resolves high-confidence action elements before harder ones and employs secondary re-masking to revisit uncertain predictions, enabling robust error correction. This design preserves pretrained vision-language priors, supports parallel decoding, and improves the efficiency. Discrete Diffusion VLA achieves 96.4% avg. success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge. On out-of-distribution tests of LIBERO-Goal, our method exhibits only 0.8% language degradation versus 8.0% of parallel decoding, and 20.4% vision degradation versus 29.0% for continuous diffusion, demonstrating well retention of pretrained vision-language capabilities. We also conduct two real-robot evaluations on AgileX Cobot Magic platform to show the method's effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2512.18336 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Dynamic Entropy Tuning in Reinforcement Learning Low-Level Quadcopter Control: Stochasticity vs Determinism

强化学习低层四旋翼控制中的动态熵调节：随机性与确定性

Youssef Mahran, Zeyad Gamal, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department（机械工程系）； The German University in Cairo（开罗德国大学）

AI总结研究在四旋翼控制中，通过动态熵调节训练随机策略的强化学习算法，并与确定性策略算法对比，发现动态熵调节可防止灾难性遗忘并提高探索效率。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情

DOI: 10.1109/ICCTA64612.2024.10974880
Journal ref: 2024 IEEE 34th International Conference on Computer Theory and Applications (ICCTA)

AI中文摘要

本文探讨了在训练随机策略的强化学习算法中动态熵调节的影响，并将其性能与训练确定性策略的算法进行了比较。随机策略通过优化动作的概率分布来最大化奖励，而确定性策略则为每个状态选择一个确定的动作。本文研究了使用静态熵和动态熵训练随机策略，然后执行确定性动作来控制四旋翼的效果，并与训练确定性策略并执行确定性动作进行了对比。为此，随机算法选择了软演员-评论家（SAC）算法，确定性算法选择了双延迟深度确定性策略梯度（TD3）算法。训练和仿真结果表明，动态熵调节通过防止灾难性遗忘和提高探索效率，对控制四旋翼产生了积极影响。

英文摘要

This paper explores the impact of dynamic entropy tuning in Reinforcement Learning (RL) algorithms that train a stochastic policy. Its performance is compared against algorithms that train a deterministic one. Stochastic policies optimize a probability distribution over actions to maximize rewards, while deterministic policies select a single deterministic action per state. The effect of training a stochastic policy with both static entropy and dynamic entropy and then executing deterministic actions to control the quadcopter is explored. It is then compared against training a deterministic policy and executing deterministic actions. For the purpose of this research, the Soft Actor-Critic (SAC) algorithm was chosen for the stochastic algorithm while the Twin Delayed Deep Deterministic Policy Gradient (TD3) was chosen for the deterministic algorithm. The training and simulation results show the positive effect the dynamic entropy tuning has on controlling the quadcopter by preventing catastrophic forgetting and improving exploration efficiency.

URL PDF HTML ☆

赞 0 踩 0

2512.18333 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Reinforcement Learning Position Control of a Quadrotor Using Soft Actor-Critic (SAC)

基于软演员-评论家(SAC)的四旋翼强化学习位置控制

Youssef Mahran, Zeyad Gamal, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department（机械电子工程系）； The German University in Cairo（埃及德国大学）

AI总结提出一种基于强化学习的四旋翼推力矢量控制架构，使用软演员-评论家算法训练，相比传统RPM控制器训练更快、路径跟踪更平滑准确。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情

DOI: 10.1109/NILES63360.2024.10753187
Journal ref: 2024 IEEE 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES)

AI中文摘要

本文提出了一种新的基于强化学习(RL)的四旋翼控制架构。现有文献主要关注直接控制四个旋翼的转速，而本文旨在控制四旋翼的推力矢量。RL智能体计算沿四旋翼z轴的总推力百分比以及期望的滚转角(ϕ)和俯仰角(θ)。然后，智能体将计算出的控制信号连同当前四旋翼的偏航角(ψ)发送给姿态PID控制器。PID控制器再将控制信号映射为电机转速。采用软演员-评论家算法（一种无模型离策略随机RL算法）来训练RL智能体。训练结果表明，与传统的RPM控制器相比，所提出的推力矢量控制器训练时间更短。仿真结果表明，所提出的推力矢量控制器具有更平滑、更精确的路径跟踪性能。

英文摘要

This paper proposes a new Reinforcement Learning (RL) based control architecture for quadrotors. With the literature focusing on controlling the four rotors' RPMs directly, this paper aims to control the quadrotor's thrust vector. The RL agent computes the percentage of overall thrust along the quadrotor's z-axis along with the desired Roll ($ϕ$) and Pitch ($θ$) angles. The agent then sends the calculated control signals along with the current quadrotor's Yaw angle ($ψ$) to an attitude PID controller. The PID controller then maps the control signals to motor RPMs. The Soft Actor-Critic algorithm, a model-free off-policy stochastic RL algorithm, was used to train the RL agents. Training results show the faster training time of the proposed thrust vector controller in comparison to the conventional RPM controllers. Simulation results show smoother and more accurate path-following for the proposed thrust vector controller.

URL PDF HTML ☆

赞 0 踩 0

2512.13356 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Control of a Twin Rotor using Twin Delayed Deep Deterministic Policy Gradient (TD3)

使用双延迟深度确定性策略梯度（TD3）控制双旋翼系统

Zeyad Gamal, Youssef Mahran, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department（机械电子工程系）； The German University in Cairo（埃及德国大学）

AI总结提出基于TD3算法的强化学习框架，用于控制双旋翼气动系统在俯仰和方位角上的稳定与轨迹跟踪，仿真和实验验证了其优于传统PID控制器的抗干扰能力。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情

DOI: 10.1109/ICSTCC62912.2024.10744717
Journal ref: 2024 28th IEEE International Conference on System Theory, Control and Computing (ICSTCC)

AI中文摘要

本文提出了一种强化学习（RL）框架，用于在特定俯仰角和方位角下控制和稳定双旋翼气动系统（TRAS），并跟踪给定轨迹。TRAS的复杂动力学和非线性特性使得使用传统控制算法进行控制具有挑战性。然而，近年来RL的发展因其在多旋翼控制中的潜在应用而引起了兴趣。本文使用双延迟深度确定性策略梯度（TD3）算法来训练RL智能体。该算法适用于具有连续状态和动作空间的环境（类似于TRAS），因为它不需要系统的模型。仿真结果展示了RL控制方法的有效性。接下来，使用风扰形式的的外部扰动来测试控制器与传统PID控制器相比的有效性。最后，在实验室装置上进行了实验，以确认控制器在实际应用中的有效性。

英文摘要

This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at specific pitch and azimuth angles and tracking a given trajectory. The complex dynamics and non-linear characteristics of the TRAS make it challenging to control using traditional control algorithms. However, recent developments in RL have attracted interest due to their potential applications in the control of multirotors. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was used in this paper to train the RL agent. This algorithm is used for environments with continuous state and action spaces, similar to the TRAS, as it does not require a model of the system. The simulation results illustrated the effectiveness of the RL control method. Next, external disturbances in the form of wind disturbances were used to test the controller's effectiveness compared to conventional PID controllers. Lastly, experiments on a laboratory setup were carried out to confirm the controller's effectiveness in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2512.09065 2026-06-02 cs.RO cs.AI 版本更新

ShelfAware: Real-Time Semantic Localization in Quasi-Static Environments with Low-Cost Sensors

ShelfAware：准静态环境下基于低成本传感器的实时语义定位

Shivendra Agrawal, Jake Brawer, Ashutosh Naik, Alessandro Roncone, Bradley Hayes

发表机构 * Department of Computer Science, University of Colorado Boulder（科罗拉多大学波尔德分校计算机科学系）

AI总结提出ShelfAware语义粒子滤波器，通过将场景语义建模为类别统计证据而非固定地标，结合深度似然与类别语义相似度，并利用预计算语义视角进行逆语义提议，实现低成本视觉硬件上的鲁棒全局定位。

Comments 8 pages

详情

DOI: 10.1109/LRA.2026.3682613
Journal ref: IEEE Robotics and Automation Letters (RA-L), 2026

AI中文摘要

许多室内工作空间是准静态的：其全局几何布局稳定，但局部语义不断变化，产生重复几何结构、动态杂乱和感知噪声，使得标准基于视觉的定位失效。我们提出ShelfAware，一种用于鲁棒全局定位的语义粒子滤波器，它将场景语义视为对象类别的统计证据而非固定数量地标。ShelfAware融合深度似然与以类别为中心的语义相似度，并利用预计算的语义视角库在蒙特卡洛定位（MCL）中执行逆语义提议，从而在低成本、纯视觉硬件上实现快速、有针对性的假设生成。为了展示感知无关的可扩展性，我们在两个领域评估ShelfAware。在严格控制的模拟零售环境中，ShelfAware实现了97%的全局定位成功率，并在购物车、可穿戴和动态遮挡条件下保持了最高的跟踪成功率（66%）。此外，在利用开放词汇视觉管道的3,500平方英尺运营杂货店中，ShelfAware显著优于几何和固定数量语义基线。通过分布性建模语义并利用逆提议，ShelfAware解决了几何混叠问题，为动态真实环境中的移动和辅助机器人提供了无需基础设施的构建模块。

英文摘要

Many indoor workspaces are quasi-static: their global geometric layout is stable, but local semantics change continually, producing repetitive geometry, dynamic clutter, and perceptual noise that defeat standard vision-based localization. We present ShelfAware, a semantic particle filter for robust global localization that treats scene semantics as statistical evidence over object categories rather than fixed quantity landmarks. ShelfAware fuses a depth likelihood with a category-centric semantic similarity and uses a precomputed bank of semantic viewpoints to perform inverse semantic proposals inside Monte Carlo Localization (MCL), yielding fast, targeted hypothesis generation on low-cost, vision-only hardware. To demonstrate perception-agnostic scalability, we evaluate ShelfAware across two domains. In a rigorously controlled mock retail environment, ShelfAware achieves a 97% global localization success rate, maintaining the highest tracking success (66%) across cart, wearable, and dynamic occlusion conditions. Furthermore, in a 3,500 sq. ft. operational grocery store leveraging an open-vocabulary vision pipeline, ShelfAware significantly outperforms both geometric and fixed-quantity semantic baselines. By modeling semantics distributionally and leveraging inverse proposals, ShelfAware resolves geometric aliasing, providing an infrastructure-free building block for mobile and assistive robots in dynamic real-world environments.

URL PDF HTML ☆

赞 0 踩 0

2512.06182 2026-06-02 cs.RO cs.SY eess.SY 版本更新

SpeedAug: 通过节奏增强策略和强化学习微调实现策略加速

Taewook Nam, Junmo Cho, Youngsoo Jang, Sung Ju Hwang

发表机构 * KAIST（韩国科学技术院）； UNIST（全南大学）； DeepAuto.ai

AI总结提出SpeedAug框架，通过节奏增强先验策略和强化学习微调，使机器人策略学习任务最优执行节奏，在保持高成功率的同时显著提升执行速度和样本效率。

详情

AI中文摘要

针对复杂真实世界操作任务的机器人策略学习近期取得了快速进展，这在很大程度上得益于通过人类操作收集演示数据的能力。然而，从这些演示中训练出的策略通常执行任务的速度远低于机器人的物理能力，因为演示数据是在实际约束下收集的，这些约束倾向于保守的、以成功为导向的轨迹，而非执行速度。现有的策略加速方法通过数据预处理或启发式规则确定执行节奏，而不是学习针对任务优化的执行速度。在本文中，我们提出了SpeedAug，一个策略加速框架，使策略能够通过强化学习（RL）学习任务最优的执行节奏。SpeedAug首先从速度增强的演示中学习一个节奏增强的先验策略，该策略捕捉了多样的执行节奏。在此基础上，通过强化学习微调指导探索，以优化动作轨迹并高效优化执行节奏。在机器人操作基准上的实验表明，SpeedAug在保持高成功率的同时，显著提高了策略加速的样本效率，实现了快速且稳定的任务执行。应用于真实世界的操作任务时，SpeedAug仅用16分钟的在线交互就将任务吞吐量提高了1.8倍，且未降低成功率。

英文摘要

Robotic policy learning for complex real-world manipulation tasks has seen rapid recent progress, enabled in large part by the ability to collect demonstrations through human operation. However, policies trained from such demonstrations often execute tasks far more slowly than the robot's physical capabilities, as demonstration data is collected under practical constraints that favor conservative, success-oriented trajectories over execution speed. Existing policy acceleration methods determine execution tempo through data preprocessing or heuristic rules, rather than learning execution speed optimized for the task. In this paper, we propose SpeedAug, a policy acceleration framework that enables policies to learn task-optimal execution tempo via reinforcement learning (RL). SpeedAug first learns a tempo-enriched prior policy from speed-augmented demonstrations that captures diverse execution tempos. Building on this tempo-enriched prior, RL fine-tuning guides exploration to refine action trajectories and optimize execution tempo efficiently. Experiments on robotic manipulation benchmarks demonstrate that SpeedAug substantially improves the sample efficiency of policy acceleration while maintaining high success rates, achieving fast and stable task execution. Applied to a real-world manipulation task, SpeedAug improves task throughput by 1.8x using only 16 minutes of online interactions without compromising the success rate.

URL PDF HTML ☆

赞 0 踩 0

2511.22445 2026-06-02 cs.RO 版本更新

DIPOLE: Fusing Vision and Geometry for Robust Visuomotor Generalization

DIPOLE：融合视觉与几何实现鲁棒的视觉运动泛化

Yikai Tang, Haoran Geng, Jindou Jia, Yuxuan Hu, Sheng Zang, Jianfei Yang, Pieter Abbeel, Jitendra Malik

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Nanyang Technological University（南洋理工大学）

AI总结提出DIPOLE，通过训练时模态丢弃和轻量交叉注意力融合视觉与几何信息，实现跨光照、纹理、视角等变化的鲁棒策略泛化，在18个模拟和4个真实任务中平均性能提升39.1%。

详情

AI中文摘要

模仿学习已成为从演示中获取视觉运动技能的关键方法，其中设计有效的观测编码器对于策略泛化至关重要。然而，现有方法在测试条件与演示不同时往往难以应对，例如光照、纹理、视角、物体位置或物体身份的变化。为了解决这一挑战，我们提出了具有互补编码器的扩散策略（DIPOLE），这是一种通过训练时机制而非专门融合架构学习融合互补模态的视觉运动策略。模态级丢弃在每个训练步骤中屏蔽一个分支，鼓励每个模态保持独立的信息性。然后，一个轻量级的交叉注意力层在两者之间交换互补线索。这种设计赋予DIPOLE五个核心优势：跨不同任务的稳定高性能、对视觉变化的鲁棒性、亚厘米精度的空间泛化、超越任一模态的涌现能力以及零样本迁移到未见物体。在18个模拟和4个真实世界任务中，DIPOLE平均优于六个基线39.1%，在未见视觉干扰物下提升41.5%，在随机物体放置下提升15.2%。

英文摘要

Imitation learning has emerged as a crucial approach for acquiring visuomotor skills from demonstrations, where designing effective observation encoders is essential for policy generalization. However, existing methods tend to struggle once test-time conditions differ from the demonstrations, such as changes in lighting, texture, viewpoint, object placement, or object identity. To address this challenge, we propose DIffusion POlicy with compLementarity Encoders (DIPOLE), a visuomotor policy that learns to fuse complementary modalities through a training-time mechanism rather than a specialized fusion architecture. A modality-wise dropout masks one branch at each training step, encouraging each modality to remain individually informative. A lightweight cross-attention layer then exchanges complementary cues between the two. This design endows DIPOLE with five core strengths: stable high performance across diverse tasks, robustness to visual changes, spatial generalization at sub-centimeter precision, emergent capability beyond either modality, and zero-shot transfer to unseen objects. Across 18 simulated and 4 real-world tasks, DIPOLE outperforms six baselines by 39.1% on average, with gains of 41.5% under unseen visual distractors and 15.2% under randomized object placement.

URL PDF HTML ☆

赞 0 踩 0

2511.17502 2026-06-02 cs.RO 版本更新

RynnVLA-002: A Unified Vision-Language-Action and World Model

RynnVLA-002: 统一的视觉-语言-动作与世界模型

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Bohan Hou, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Deli Zhao, Hao Chen

发表机构 * DAMO Academy, Alibaba Group（达摩院，阿里巴巴集团）； Hupan Lab（虎扑实验室）； Zhejiang University（浙江大学）

AI总结提出统一视觉-语言-动作（VLA）与世界模型的框架RynnVLA-002，通过联合学习环境动态和动作规划，在仿真和真实机器人任务中显著提升成功率。

2511.10276 2026-06-02 cs.RO cs.AI 版本更新

RoboBenchMart: Benchmarking Robots in Retail Environment

RoboBenchMart：零售环境中的机器人基准测试

Konstantin Soshin, Alexander Krapukhin, Andrei Spiridonov, Gregorii Bukhtuev, Andrey Kuznetsov, Vlad Shakhuro, Denis Shepelev

发表机构 * FusionBrain Lab, Robotics Group（融合大脑实验室，机器人组）； NUST MISIS ； Lomonosov Moscow State University（罗蒙诺索夫莫斯科国立大学）

AI总结针对零售环境中的移动操作任务，提出RoboBenchMart开源模拟基准，通过密集杂乱物品和复杂空间配置评估通用视觉-语言-动作模型（VLA），发现现有模型在常见零售任务中仍表现不佳。

详情

AI中文摘要

大多数现有的机器人操作基准专注于桌面或家庭场景。虽然这些设置推动了令人印象深刻的进展，但目前尚不清楚在这些场景中表现出色的通用VLA是否能够真正泛化到具有不同几何、语义和工作流程的领域。我们引入了RoboBenchMart，一个针对零售暗店环境的开源模拟基准，其中移动操作器必须对多样化的杂货物品执行复杂的操作任务。该设置提出了重大挑战，包括密集的物品杂乱和多样的空间配置，物品位于不同的高度、深度且紧密相邻。通过针对零售领域，我们的基准解决了一个具有近期自动化影响潜力的场景。利用生成的轨迹，我们为当前的通用VLA建模了一个标准、现实的微调设置，并评估了几种最先进的模型。我们发现，即使在常见的零售任务上，它们仍然表现挣扎，这表明这些模型尚未真正跨领域泛化。为了支持进一步研究，我们发布了RoboBenchMart套件，其中包括程序化商店布局生成器、轨迹生成管道、评估工具和微调基线模型。

英文摘要

Most existing robotic manipulation benchmarks focus on tabletop or household scenarios. While these setups have driven impressive progress, it remains unclear whether generalist VLAs that excel there can truly generalize to domains with different geometry, semantics, and workflows. We introduce RoboBenchMart, an open-source simulated benchmark targeting retail dark-store environments, where a mobile manipulator must perform complex manipulation tasks with diverse grocery items. This setting presents significant challenges, including dense object clutter and varied spatial configurations, with items positioned at different heights, depths, and in close proximity. By targeting on the retail domain, our benchmark addresses a setting with strong potential for near-term automation impact. Using generated trajectories, we model a standard, realistic fine-tuning setup for current generalist VLAs and evaluate several state-of-the-art models. We find that they still struggle even on common retail tasks, indicating that these models are not yet truly general across domains. To support further research, we release the RoboBenchMart suite, which includes a procedural store layout generator, a trajectory generation pipeline, evaluation tools, and fine-tuned baseline models.

URL PDF HTML ☆

赞 0 踩 0

2511.02937 2026-06-02 cs.RO cs.SE cs.SY eess.SY 版本更新

Toward an Agricultural Operational Design Domain: A Framework

面向农业运行设计域：一个框架

Mirco Felske, Jannik Redenius, Georg Happich, Julius Schöning

AI总结针对农业自主系统在复杂多变环境中运行的特殊挑战，提出包含Ag-ODD描述概念、7层模型和迭代验证过程的农业运行设计域框架，以实现环境描述的结构化、透明化和可验证性。

Comments 18 pages, 7 figures, 2 tables

详情

DOI: 10.1016/j.atech.2026.102246
Journal ref: Smart Agricultural Technology, Volume 14, August 2026

AI中文摘要

Seq-DeepIPC：足式机器人导航中用于端到端控制的顺序感知

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada（计算机科学与电子系，加查马达大学）； Department of Computer Science and Engineering, Toyohashi University of Technology（计算机科学与工程系，东福冈技术大学）

AI总结提出Seq-DeepIPC模型，通过融合多模态感知（RGB-D+GNSS）与时间序列，实现足式机器人在真实环境中的端到端导航控制，并在机器人狗上验证了其有效性。

Comments This work has been accepted for publication in the IEEE Sensors Journal. https://ieeexplore.ieee.org/document/11373257/

详情

DOI: 10.1109/JSEN.2026.3656442

AI中文摘要

我们提出了Seq-DeepIPC，一种用于足式机器人在真实环境中导航的顺序端到端感知到控制模型。Seq-DeepIPC通过将多模态感知（RGB-D+GNSS）与时间融合和控制紧密结合，推进了自主足式导航的智能感知。该模型联合预测语义分割和深度估计，为规划和控制提供更丰富的空间特征。为了在边缘设备上高效部署，我们使用轻量级模型作为编码器，在保持精度的同时减少计算量。通过移除噪声较大的IMU，转而通过顺序GNSS坐标的差分分析推导全局航向，简化了航向估计。我们收集了一个更大且更多样化的数据集，包括道路和草地地形，并在机器人狗上验证了Seq-DeepIPC。对比和消融研究表明，顺序输入改善了我们的模型中的感知和控制，而其他基线则没有受益。Seq-DeepIPC以合理的模型大小取得了具有竞争力或更好的结果；尽管仅使用GNSS的航向在高大建筑物附近可靠性较低，但在开阔区域是鲁棒的。总体而言，Seq-DeepIPC将端到端导航从轮式机器人扩展到更通用和具有时间感知能力的系统。为了支持未来的研究，我们将在GitHub仓库https://github.com/oskarnatan/Seq-DeepIPC发布代码。

英文摘要

We present Seq-DeepIPC, a sequential end-to-end perception-to-control model for legged robot navigation in real-world environments. Seq-DeepIPC advances intelligent sensing for autonomous legged navigation by tightly integrating multi-modal perception (RGB-D + GNSS) with temporal fusion and control. The model jointly predicts semantic segmentation and depth estimation, giving richer spatial features for planning and control. For efficient deployment on edge devices, we use a lightweight model as the encoder, reducing computation while maintaining accuracy. Heading estimation is simplified by removing the noisy IMU and instead deriving global heading via differential analysis of sequential GNSS coordinates. We collected a larger and more diverse dataset that includes both road and grass terrains, and validated Seq-DeepIPC on a robot dog. Comparative and ablation studies show that sequential inputs improve perception and control in our models, while other baselines do not benefit. Seq-DeepIPC achieves competitive or better results with reasonable model size; although GNSS-only heading is less reliable near tall buildings, it is robust in open areas. Overall, Seq-DeepIPC extends end-to-end navigation beyond wheeled robots to more versatile and temporally-aware systems. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/Seq-DeepIPC.

URL PDF HTML ☆

赞 0 踩 0

2411.09241 2026-06-02 cs.RO eess.SP 版本更新

BlueME: Robust Underwater Robot-to-Robot Communication Using Compact Magnetoelectric Antennas

BlueME: 使用紧凑型磁电天线的鲁棒水下机器人间通信

Mehron Talebi, Sultan Mahmud, Adam Khalifa, Md Jahidul Islam

发表机构 * Department of ECE, University of Florida, USA（佛罗里达大学电子工程系）

AI总结提出并验证了基于磁电天线的紧凑型水下通信系统BlueME，在700米距离内以10瓦功耗实现可靠通信，克服了浑浊、障碍和多径干扰等挑战。

详情

AI中文摘要

我们介绍了BlueME的设计、开发和实验验证，这是一种用于水下机器人间通信的紧凑型磁电（ME）天线阵列系统。BlueME采用在其自然机械共振频率下工作的ME天线，以高效地在水下传输和接收甚低频（VLF）电磁信号。我们概述了所提出系统在低功耗嵌入式平台上的设计、仿真、制造和集成，重点关注便携式和可扩展的应用。为了评估性能，我们在开放水域现场试验中将BlueME部署在自主水面航行器（ASV）和遥控潜水器（ROV）上。海洋试验表明，BlueME在仅消耗10瓦功率的情况下，可在超过700米的距离内保持可靠的信号传输。现场试验显示，该系统在具有挑战性的水下条件下（如浑浊度、障碍物和多径干扰）有效运行——这些条件通常会影响声学和光学。我们的分析还考察了完全浸没对系统性能的影响，并确定了关键的部署考虑因素。这项工作代表了ME天线在实验室外的首次实际水下部署，并实现了迄今为止最大的VLF ME阵列系统。BlueME展示了在多机器人协作系统和远程传感器网络中用于海洋机器人和自动化的巨大潜力。

英文摘要

We present the design, development, and experimental validation of BlueME, a compact magnetoelectric (ME) antenna array system for underwater robot-to-robot communication. BlueME employs ME antennas operating at their natural mechanical resonance frequency to efficiently transmit and receive very-low-frequency (VLF) electromagnetic signals underwater. We outline the design, simulation, fabrication, and integration of the proposed system on low-power embedded platforms, focusing on portable and scalable applications. For performance evaluation, we deployed BlueME on an autonomous surface vehicle (ASV) and a remotely operated vehicle (ROV) in open-water field trials. Ocean trials demonstrate that BlueME maintains reliable signal transmission at distances beyond 700 meters while consuming only 10 watts of power. Field trials show that the system operates effectively in challenging underwater conditions such as turbidity, obstacles, and multipath interference -- conditions that generally affect acoustics and optics. Our analysis also examines the impact of complete submersion on system performance and identifies key deployment considerations. This work represents the first practical underwater deployment of ME antennas outside the laboratory and implements the largest VLF ME array system to date. BlueME demonstrates significant potential for marine robotics and automation in multi-robot cooperative systems and remote sensor networks.

URL PDF HTML ☆

赞 0 踩 0

2510.10676 2026-06-02 cs.AR cs.CL cs.RO eess.AS 版本更新

Bhasha-Rupantarika: Algorithm-Hardware Co-design approach for Multilingual Neural Machine Translation

Bhasha-Rupantarika: 面向多语言神经机器翻译的算法-硬件协同设计方法

Mukul Lokhande, Tanushree Dewangan, Mohd Sharik Mansoori, Tejas Chaudhari, Akarsh J., Damayanti Lokhande, Adam Teman, Santosh Kumar Vishvakarma

发表机构 * Special Manpower Development Program for Chip to Start-Up (SMDP-C2S)（芯片到初创企业专项人才发展计划（SMDP-C2S））； Ministry of Electronics and Information Technology (MeitY)（电子与信息技术部（MeitY））

AI总结提出一种通过算法-硬件协同设计实现的轻量高效多语言翻译系统Bhasha-Rupantarika，采用亚字节精度量化（FP8/INT8/INT4/FP4）在FPGA上实现模型大小减少4.1倍、推理速度提升4.2倍，为资源受限环境下的多语言AI部署提供可行方案。

详情

DOI: 10.1109/ISQED69900.2026.11534749
Journal ref: International Symposium on Quality Electronic Design (ISQED), San Francisco, CA, USA, 2026

AI中文摘要

本文介绍了Bhasha-Rupantarika，一个通过算法-硬件协同设计为资源受限环境量身定制的轻量高效多语言翻译系统。该方法研究了亚字节精度级别（FP8、INT8、INT4和FP4）的模型部署，实验结果表明模型大小减少4.1倍（FP4），推理速度提升4.2倍，对应吞吐量提高至66 tokens/s（提升4.8倍）。这凸显了超低精度量化对于使用FPGA加速器的物联网设备实时部署的重要性，实现了与预期相当的性能。我们的评估涵盖了印度语言和国际语言之间的双向翻译，展示了其在低资源语言环境中的适应性。FPGA部署显示LUT减少1.96倍，FF减少1.65倍，与OPU相比吞吐量提升2.2倍，与HPTA相比提升4.6倍。总体而言，该评估提供了一种基于量化感知翻译且兼顾硬件效率的可行解决方案，适用于可部署的多语言AI系统。完整的代码[https://github.com/mukullokhande99/Bhasha-Rupantarika/]和可复现数据集已公开，便于研究人员快速集成和进一步开发。

英文摘要

This paper introduces Bhasha-Rupantarika, a light and efficient multilingual translation system tailored through algorithm-hardware codesign for resource-limited settings. The method investigates model deployment at sub-octet precision levels (FP8, INT8, INT4, and FP4), with experimental results indicating a 4.1x reduction in model size (FP4) and a 4.2x speedup in inference speed, which correlates with an increased throughput of 66 tokens/s (improvement by 4.8x). This underscores the importance of ultra-low precision quantization for real-time deployment in IoT devices using FPGA accelerators, achieving performance on par with expectations. Our evaluation covers bidirectional translation between Indian and international languages, showcasing its adaptability in low-resource linguistic contexts. The FPGA deployment demonstrated a 1.96x reduction in LUTs and a 1.65x decrease in FFs, resulting in a 2.2x enhancement in throughput compared to OPU and a 4.6x enhancement compared to HPTA. Overall, the evaluation provides a viable solution based on quantisation-aware translation along with hardware efficiency suitable for deployable multilingual AI systems. The entire codes [https://github.com/mukullokhande99/Bhasha-Rupantarika/] and dataset for reproducibility are publicly available, facilitating rapid integration and further development by researchers.

URL PDF HTML ☆

赞 0 踩 0

2510.04074 2026-06-02 cs.RO 版本更新

Feedback Matters: Augmenting Autonomous Dissection with Visual and Topological Feedback

反馈至关重要：利用视觉和拓扑反馈增强自主解剖

Chung-Pang Wang, Changwei Chen, Xiao Liang, Soofiyan Atar, Florian Richter, Michael Yip

发表机构 * Department of Electrical and Computer Engineering, University of California San Diego（加州大学圣迭戈分校电气与计算机工程系）

AI总结提出一种反馈驱动的自主组织解剖框架，通过内窥镜图像推理拓扑变化、量化可见性并主动操控组织，结合规划与学习方法，显著提升手术自主性和鲁棒性。

详情

AI中文摘要

自主手术系统必须适应高度动态的环境，其中组织特性和视觉线索快速演变。这种适应性的核心是反馈：在执行过程中感知、解释和响应变化的能力。尽管反馈机制已在手术机器人中得到探索，包括工具和组织跟踪以及错误检测，但现有方法在处理组织解剖的拓扑和感知挑战方面仍然有限。在这项工作中，我们提出了一种用于自主组织解剖的反馈驱动框架，该框架在每次解剖动作后明确地从内窥镜图像中推理拓扑变化。这种结构化反馈指导后续动作，使系统能够定位解剖进展并在线调整策略。为了提高这种反馈的可靠性，我们引入了量化组织暴露的可见性指标，并制定了主动操控组织以最大化可见性的最优控制器设计。最后，我们将这些反馈机制与基于规划和基于学习的解剖方法相结合，并通过实验证明，它们在复杂手术场景中显著增强了自主性，减少了错误，并提高了鲁棒性。

英文摘要

Autonomous surgical systems must adapt to highly dynamic environments where tissue properties and visual cues evolve rapidly. Central to such adaptability is feedback: the ability to sense, interpret, and respond to changes during execution. While feedback mechanisms have been explored in surgical robotics, ranging from tool and tissue tracking to error detection, existing methods remain limited in handling the topological and perceptual challenges of tissue dissection. In this work, we propose a feedback-enabled framework for autonomous tissue dissection that explicitly reasons about topological changes from endoscopic images after each dissection action. This structured feedback guides subsequent actions, enabling the system to localize dissection progress and adapt policies online. To improve the reliability of such feedback, we introduce visibility metrics that quantify tissue exposure and formulate optimal controller designs that actively manipulate tissue to maximize visibility. Finally, we integrate these feedback mechanisms with both planning-based and learning-based dissection methods, and demonstrate experimentally that they significantly enhance autonomy, reduce errors, and improve robustness in complex surgical scenarios.

URL PDF HTML ☆

赞 0 踩 0

2509.20070 2026-06-02 cs.RO 版本更新

LLM Trainer: Automated Robotic Data Generation via Demonstration Augmentation using LLMs

LLM Trainer：利用大语言模型通过演示增强自动生成机器人数据

Abraham George, Amir Barati Farimani

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出LLM Trainer，一种利用大语言模型的世界知识将少量人类演示自动扩展为大规模机器人数据集的管道，通过离线标注和在线关键姿势重定向生成新轨迹，并采用汤普森采样优化标注。

Comments 9 pages, 5 figures, 4 tables. Accepted in ICRA 2026

详情

AI中文摘要

我们提出LLM Trainer，一个全自动管道，利用大语言模型（LLM）的世界知识，将少量人类演示（少至一个）转化为用于模仿学习的大型机器人数据集。我们的方法将演示生成分解为两个步骤：（1）离线演示标注，提取关键帧、显著物体和姿态-物体关系；（2）在线关键姿势重定向，在给定初始观察的情况下将这些关键帧适应到新场景。使用这些修改后的关键点，我们的系统对原始演示进行扭曲以生成新轨迹，然后执行该轨迹，如果成功，则保存生成的演示。由于标注可跨场景重用，我们使用汤普森采样优化标注，显著提高了生成成功率。我们在多种任务上评估了我们的方法，发现我们的数据标注方法始终优于专家设计的基线。我们进一步展示了一种集成策略，将优化的LLM前馈计划与学习到的反馈模仿学习控制器相结合。最后，我们在Franka Emika Panda机器人上演示了硬件可行性。更多材料和演示视频，请参见项目网站：https://sites.google.com/andrew.cmu.edu/llm-trainer

英文摘要

We present LLM Trainer, a fully automated pipeline that leverages the world knowledge of Large Language Models (LLMs) to transform a small number of human demonstrations (as few as one) into a large robot dataset for imitation learning. Our approach decomposes demonstration generation into two steps: (1) offline demonstration annotation that extracts keyframes, salient objects, and pose-object relations; and (2) online keypose retargeting that adapts those keyframes to a new scene, given an initial observation. Using these modified keypoints, our system warps the original demonstration to generate a new trajectory, which is then executed, and the resulting demo, if successful, is saved. Because the annotation is reusable across scenes, we use Thompson sampling to optimize the annotation, significantly improving generation success rate. We evaluate our method on a range of tasks, and find that our data annotation method consistently outperforms expert-engineered baselines. We further show an ensemble policy that combines the optimized LLM feed-forward plan with a learned feedback imitation learning controller. Finally, we demonstrate hardware feasibility on a Franka Emika Panda robot. For additional materials and demonstration videos, please see the project website: https://sites.google.com/andrew.cmu.edu/llm-trainer

URL PDF HTML ☆

赞 0 踩 0

2509.19246 2026-06-02 cs.RO cs.MA cs.SY eess.SY 版本更新

RCM-ACT: 基于动态RCM校准的模仿学习用于自主眼内异物取出

Yue Wang, Wenjie Deng, Haotian Xue, Di Cui, Yiqi Chen, Mingchuan Zhou, Haochao Ying, Jian Wu

发表机构 * College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； State Key Laboratory of Transvascular Implantation Devices of the Second Affiliated Hospital, Zhejiang University School of Medicine（浙江大学医学院第二附属医院血管植入设备国家重点实验室）； Dessight Biomedical（Dessight生物医学公司）； Center for Rehabilitation Medicine, Department of Ophthalmology, Zhejiang Provincial People’s Hospital（浙江省人民医院康复医学中心、眼科部门）； School of Biosystems Engineering and Food Science, Zhejiang University（浙江大学生物系统工程与食品科学学院）； School of Public Health and Second Affiliated Hospital, Zhejiang University School of Medicine（浙江大学医学院公共卫生学院及第二附属医院）； State Key Laboratory of Transvascular Implantation Devices of the Second Affiliated Hospital and School of Public Health, Zhejiang University School of Medicine（浙江大学医学院第二附属医院及公共卫生学院血管植入设备国家重点实验室）； Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence（浙江省医学影像人工智能重点实验室）

AI总结提出RCM-ACT模仿学习框架，通过动态RCM校准和动作分块变换器解决眼内手术中的运动学不确定性，实现自主环抓取与定位，平均3D抓取偏差0.686 mm。

详情

AI中文摘要

眼内异物取出需要在受限的眼内空间达到毫米级精度，然而现有机器人系统主要依赖手动遥操作，学习曲线陡峭。为了解决自主操作的挑战，特别是可变运动缩放和远程运动中心（RCM）点变化带来的运动学不确定性，我们提出了RCM-ACT，一种用于自主眼内异物环操作的模仿学习框架。我们的方法集成了RCM动态校准以解决眼内器械变化引起的坐标系不一致，并引入了RCM-ACT架构，该架构将动作分块变换器与片段级运动学重对齐相结合。仅使用来自人工眼模型中专家演示的立体视觉数据和器械运动学进行训练，RCM-ACT成功完成了环抓取和定位任务，无需显式深度感知。实验验证表明，在未校准的显微镜条件下成功实现了端到端自主操作，平均3D欧几里得抓取偏差为0.686 mm，完整任务成功率为11/20。这些结果为开发能够执行复杂眼内手术的智能眼科手术系统提供了可行的框架。

英文摘要

Intraocular foreign body removal demands millimeter-level precision in confined intraocular spaces, yet existing robotic systems predominantly rely on manual teleoperation with steep learning curves. To address the challenges of autonomous manipulation, particularly kinematic uncertainties from variable motion scaling and Remote Center of Motion (RCM) point variation, we propose RCM-ACT, an imitation learning framework for autonomous intraocular foreign body ring manipulation. Our approach integrates RCM dynamic calibration to resolve coordinate system inconsistencies caused by intraocular instrument variation and introduces the RCM-ACT architecture, which combines action chunking transformers with episode-level kinematic realignment. Trained solely on stereo visual data and instrument kinematics from expert demonstrations in an artificial eye model, RCM-ACT successfully completes ring grasping and positioning tasks without explicit depth sensing. Experimental validation demonstrates the successful implementation of end-to-end autonomy under uncalibrated microscopy conditions, achieving a mean 3-D Euclidean grasp deviation of 0.686 mm and 11/20 full-task successes. The results provide a viable framework for developing intelligent eye surgical systems capable of complex intraocular procedures.

URL PDF HTML ☆

赞 0 踩 0

2406.09953 2026-06-02 cs.RO cs.AI 版本更新

DAG-Plan: Generating Directed Acyclic Dependency Graphs for Dual-Arm Cooperative Planning

DAG-Plan：生成有向无环依赖图用于双臂协作规划

Zeyu Gao, Yao Mu, Jinye Qu, Mengkang Hu, Shijia Peng, Chengkai Hou, Lingyue Guo, Ping Luo, Shanghang Zhang, Yanfeng Lu

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (CASIA)（多模态人工智能系统国家重点实验室，自动化研究所，中国科学院（CASIA））； School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS)（中国科学院大学人工智能学院）； School of Computer Science, Shanghai Jiao Tong University（上海交通大学计算机科学学院）； State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（多媒体信息处理国家重点实验室，北京大学计算机科学学院）； Department of Computer Science, The University of Hong Kong（香港大学计算机科学系）； OpenGVLab, Shanghai AI Laboratory（上海人工智能实验室，OpenGVLab）

AI总结提出DAG-Plan框架，首次使用有向无环图作为双臂协调的核心表示，通过一次LLM解析生成结构化DAG，实现自适应并行执行，在双臂厨房基准测试中成功率提升48%，执行效率提升84.1%。

Comments ICRA 2026

详情

AI中文摘要

双臂机器人有望提高效率，但需要规划具有非线性子任务依赖关系的复杂任务。当前使用大型语言模型（LLM）的方法存在根本性权衡：生成线性序列效率高但无法建模并行性和适应变化，而迭代查询具有适应性但过于缓慢且成本高昂。为弥合这一差距，我们引入DAG-Plan，一种新颖的任务规划框架，首次采用有向无环图（DAG）作为双臂协调的核心表示。关键洞察在于DAG天然捕获复杂的子任务依赖关系并明确揭示并行执行的机会。在该框架内，LLM仅被使用一次作为强大的语义解析器，将自然语言指令转换为结构化的DAG。在执行过程中，我们的系统基于实时环境观察动态地将候选节点分配给合适的机械臂，实现真正的自适应并行操作。在双臂厨房基准测试上的广泛评估表明，DAG-Plan的结构化方法从根本上优于现有范式。与单查询线性序列方法相比，通过稳健管理依赖关系，成功率提高了48%；与迭代查询方法相比，通过消除重复LLM调用的延迟，执行效率提高了84.1%。我们的工作表明，基于图的原则性表示是解锁高效可靠的基于LLM的复杂机器人系统规划的关键。更多演示和代码请访问 https://sites.google.com/view/dag-plan。

英文摘要

Dual-arm robots promise greater efficiency but require planning for complex tasks with nonlinear sub-task dependencies. Current methods using Large Language Models (LLMs) suffer from a fundamental trade-off: generating linear sequences is efficient but fails to model parallelism and adapt to changes, while iterative querying is adaptive but too slow and costly. To bridge this gap, we introduce DAG-Plan, a novel task planning framework that for the first time employs a Directed Acyclic Graph (DAG) as the central representation for dual-arm coordination. The key insight is that a DAG natively captures complex sub-task dependencies and explicitly reveals opportunities for parallel execution. Within this framework, an LLM is used only once as a powerful semantic parser to translate a natural language instruction into a structured DAG. During execution, our system dynamically assigns candidate nodes to the suitable arm based on real-time environmental observations, enabling truly adaptive and parallel operation. Extensive evaluation on a dual-arm kitchen benchmark shows that DAG-Plan's structured approach fundamentally outperforms existing paradigms. It achieves a 48% higher success rate than single-query linear sequence methods with dual arm by robustly managing dependencies, and an 84.1% higher execution efficiency than iterative querying methods by eliminating the latency of repeated LLM calls. Our work demonstrates that a principled, graph-based representation is the key to unlocking efficient and reliable LLM-based planning for complex robotic systems. More demos and code are available on https://sites.google.com/view/dag-plan.

URL PDF HTML ☆

赞 0 踩 0

2503.15371 2026-06-02 cs.RO cs.LG 版本更新

GIFT: Geometry-Induced Functional Transfer for Category-level Object Manipulation

GIFT: 几何诱导的功能迁移用于类别级物体操作

Cristiana de Farias, Luis Figueredo, Riddhiman Laha, Maxime Adjigble, Brahim Tamadazte, Rustam Stolkin, Sami Haddadin, Naresh Marturi

发表机构 * Extreme Robotics Laboratory, School of Metallurgy and Materials, University of Birmingham（伯明翰大学冶金与材料学院极端机器人实验室）； Munich Institute of Robotics & Machine Intelligence, Technische Universität München (TUM)（慕尼黑工业大学机器人与人工智能研究所）； School of Computer Science, University of Nottingham（诺丁汉大学计算机科学学院）； Sorbonne Université, ISIR, Paris, France（巴黎法国索邦大学ISIR研究所）

AI总结提出GIFT框架，利用功能映射和螺旋插值，从单次人类演示中迁移复杂物体操作技能到新物体，无需额外训练。

Comments 8 pages, 6 figures. ICRA 2026

详情

AI中文摘要

在新环境中操作不熟悉物体对机器人来说具有挑战性，因为泛化能力有限。我们提出了一种新的技能迁移框架GIFT（几何诱导的功能迁移），使机器人能够从单次人类演示中迁移复杂的物体操作技能和约束。我们的方法通过关注以物体为中心的交互，从演示中推导几何表示，解决了技能获取和任务执行的挑战。利用功能映射（FMC）框架，我们高效地映射物体及其环境之间的交互函数，使机器人能够在具有相似拓扑或类别的物体之间复制任务操作，即使它们形状差异很大。此外，我们的方法结合了螺旋插值（ScLERP）来生成平滑、几何感知的机器人路径，确保迁移的技能遵循演示的任务约束。我们通过大量实验验证了该方法的有效性和适应性，展示了在多样化的真实环境中成功进行技能迁移和任务执行，无需额外训练。

英文摘要

Robotic manipulation of unfamiliar objects in new environments is challenging due to limited generalisation capabilities. We propose a new skill transfer framework, GIFT (Geometry-Induced Functional Transfer), which enables a robot to transfer complex object manipulation skills and constraints from a single human demonstration. Our approach addresses the challenge of skill acquisition and task execution by deriving geometric representations from demonstrations focusing on object-centric interactions. By leveraging the Functional Maps (FMC) framework, we efficiently map interaction functions between objects and their environments, allowing the robot to replicate task operations across objects of similar topologies or categories, even when they have significantly different shapes. Additionally, our method incorporates screw interpolation (ScLERP) for generating smooth, geometrically-aware robot paths to ensure the transferred skills adhere to the demonstrated task constraints. We validate the effectiveness and adaptability of our approach through extensive experiments, demonstrating successful skill transfer and task execution in diverse real-world environments without requiring additional training.

URL PDF HTML ☆

赞 0 踩 0

2412.12036 2026-06-02 cs.LG cs.RO 版本更新

LeARN: Learnable and Adaptive Representations for Nonlinear Dynamics in System Identification

LeARN: 系统辨识中非线性动力学的可学习与自适应表示

Arunabh Singh, Joyjit Mukherjee

发表机构 * Visual Computing Lab, Indian Institute of Science（印度科学院视觉计算实验室）； Department of Electrical and Electronics Engineering, BITS Pilani Hyderabad Campus（BITS Pilani Hyderabad校区电子与电气工程系）

AI总结提出LeARN框架，通过元学习从数据中直接学习基函数库，无需领域知识，实现非线性动力学的自适应辨识，在Neural Fly数据集上达到与SINDy相当的动态误差性能。

Comments This work has been accepted at the 34th Mediterranean Conference on Control and Automation (MED 2026)

详情

AI中文摘要

系统辨识是从观测的输入-输出数据中推导动态系统数学模型的过程，随着基于学习的方法的出现，经历了范式转变。这些方法解决了非线性动态系统中数据驱动发现的复杂挑战，受到了广泛关注。其中，稀疏非线性动力学辨识（SINDy）已成为一种变革性方法，将复杂的动态行为提炼为基函数的可解释线性组合。然而，SINDy依赖领域专业知识来构建其基函数的基础“库”，限制了其适应性和通用性。在这项工作中，我们引入了一个非线性系统辨识框架LeARN，通过直接从数据中学习基函数库，超越了对先验领域知识的需求。为了增强对不同噪声条件下动态系统演变的适应性，我们采用了一种新颖的基于元学习的系统辨识方法，利用轻量级深度神经网络（DNN）动态优化这些基函数。这不仅捕捉了复杂的系统行为，还能有效适应新的动态模式。我们在Neural Fly数据集上验证了我们的框架，展示了其强大的适应和泛化能力。尽管简单，我们的LeARN在动态误差性能上与SINDy相当。这项工作朝着自主发现动态系统迈出了一步，为机器学习无需大量领域特定干预即可揭示复杂系统控制原理的未来铺平了道路。

英文摘要

System identification, the process of deriving mathematical models of dynamical systems from observed input-output data, has undergone a paradigm shift with the advent of learning-based methods. Addressing the intricate challenges of data-driven discovery in nonlinear dynamical systems, these methods have garnered significant attention. Among them, Sparse Identification of Nonlinear Dynamics (SINDy) has emerged as a transformative approach, distilling complex dynamical behaviors into interpretable linear combinations of basis functions. However, SINDy's reliance on domain-specific expertise to construct its foundational 'library' of basis functions limits its adaptability and universality. In this work, we introduce a nonlinear system identification framework LeARN that transcends the need for prior domain knowledge by learning the library of basis functions directly from data. To enhance adaptability to evolving system dynamics under varying noise conditions, we employ a novel meta-learning-based system identification approach that utilizes a light-weight Deep Neural Network (DNN) to dynamically refine these basis functions. This not only captures intricate system behaviors but also adapts effectively to new dynamical regimes. We validate our framework on the Neural Fly dataset, showcasing its robust adaptation and generalization capabilities. Despite its simplicity, our LeARN achieves competitive dynamical error performance to SINDy. This work presents a step towards autonomous discovery of dynamical systems, paving the way for a future where machine learning uncovers the governing principles of complex systems without requiring extensive domain-specific interventions.

URL PDF HTML ☆

赞 0 踩 0

2407.12014 2026-06-02 cs.HC cs.CY cs.RO 版本更新

Surprising Performances of Students with Autism in Classroom with NAO Robot

自闭症学生在NAO机器人课堂中的惊人表现

Qin Yang, Huan Lu, Dandan Liang, Shengrong Gong, Huanghao Feng

发表机构 * School of Computer and Information Technology（计算机与信息学院）； Northeast Petroleum University（东北石油大学）； School of Computer Science and Engineering（计算机科学与工程学院）； Changshu Institute of Technology（常州职业技术学院）； Changshu Special Education School（常州特殊教育学校）； School of Chinese Language and Literature（中文语言文学学院）； Nanjing Normal University（南京师范大学）

AI总结本研究通过NAO机器人辅助的集体课堂实验，发现自闭症谱系障碍学生在机器人课堂中表现出更高的参与度和更少的刻板行为，表明社交机器人能显著提升其课堂专注力和教育表现。

详情

DOI: 10.1007/s44366-026-0083-1
Journal ref: Frontiers of Digital Education 3(2), 2024

AI中文摘要

自闭症是一种在幼儿期出现并持续终生的发育障碍，深刻影响社交行为，并阻碍患者学习和社交技能的获取。随着技术进步，越来越多的技术被用于支持自闭症谱系障碍（ASD）学生的教育，旨在改善其教育成果和社交能力。许多关于自闭症干预的研究强调了社交机器人在行为治疗中的有效性。然而，关于将社交机器人融入自闭症儿童课堂环境的研究仍然很少。本文描述了在NAO机器人介导的集体课堂环境中进行的一项小组实验的设计与实施。实验由特殊教育教师和NAO机器人协作开展课堂活动，旨在通过教师、机器人和学生之间的互动营造动态学习环境。该实验在特殊教育学校进行，作为预期中扩展机器人辅助课堂的基础研究。实验数据表明，配备NAO机器人的课堂中的ASD学生表现明显优于普通课堂中的学生。NAO机器人的类人特征和肢体语言吸引了学生的注意力，特别是在才艺展示和指令任务中，学生表现出更高的参与度，并且减少了在常规环境中常见的刻板重复行为和不相关的小动作。我们的初步发现表明，NAO机器人显著提高了ASD学生的专注力和课堂参与度，可能改善教育表现并促进更好的社交行为。

英文摘要

Autism is a developmental disorder that manifests in early childhood and persists throughout life, profoundly affecting social behavior and hindering the acquisition of learning and social skills in those diagnosed. As technological advancements progress, an increasing array of technologies is being utilized to support the education of students with Autism Spectrum Disorder (ASD), aiming to improve their educational outcomes and social capabilities. Numerous studies on autism intervention have highlighted the effectiveness of social robots in behavioral treatments. However, research on the integration of social robots into classroom settings for children with autism remains sparse. This paper describes the design and implementation of a group experiment in a collective classroom setting mediated by the NAO robot. The experiment involved special education teachers and the NAO robot collaboratively conducting classroom activities, aiming to foster a dynamic learning environment through interactions among teachers, the robot, and students. Conducted in a special education school, this experiment served as a foundational study in anticipation of extended robot-assisted classroom sessions. Data from the experiment suggest that ASD students in classrooms equipped with the NAO robot exhibited notably better performance compared to those in regular classrooms. The humanoid features and body language of the NAO robot captivated the students' attention, particularly during talent shows and command tasks, where students demonstrated heightened engagement and a decrease in stereotypical repetitive behaviors and irrelevant minor movements commonly observed in regular settings. Our preliminary findings indicate that the NAO robot significantly enhances focus and classroom engagement among students with ASD, potentially improving educational performance and fostering better social behaviors.

URL PDF HTML ☆

赞 0 踩 0

2307.06647 2026-06-02 cs.RO cs.AI cs.CV 版本更新

DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle

DeepIPCv2: 基于LiDAR的鲁棒环境感知与自动驾驶导航控制

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada（计算机科学与电子系，加查马达大学）； Department of Computer Science and Engineering, Toyohashi University of Technology（计算机科学与工程系，toyohashi技术大学）

AI总结提出DeepIPCv2端到端自动驾驶框架，通过融合LiDAR点云分割与多视图投影构建鲁棒场景表示，结合门控循环单元、命令特定多层感知器和PID控制器实现路径点与导航控制命令的联合估计，在光照变化下取得最低总指标误差和最少驾驶干预。

Comments This work has been accepted for publication in IEEE Access. https://ieeexplore.ieee.org/document/11313052

详情

DOI: 10.1109/ACCESS.2025.3647530

AI中文摘要

我们提出DeepIPCv2，一个端到端的自动驾驶框架，它集成了基于LiDAR的环境感知与命令特定的控制学习。与先前依赖摄像头的模型不同，DeepIPCv2采用点云分割和多视图投影来构建鲁棒的场景表示。这些特征通过门控循环单元、命令特定的多层感知器和PID控制器的组合进行融合和解码，以估计路径点和导航控制命令。这种设计增强了机动性并解决了驾驶数据集中的动作不平衡问题。为了验证模型，我们构建了一个覆盖不同光照条件的数据集，并进行了消融研究和与包括TransFuser在内的最新方法的对比测试。结果表明，DeepIPCv2实现了最低的总指标误差和最少的驾驶干预，突显了其对光照变化的鲁棒性和改进的控制精度。通过稍后在https://github.com/oskarnatan/DeepIPCv2发布代码，我们旨在支持端到端自动驾驶研究的可重复性和未来进展。

英文摘要

We propose DeepIPCv2, an end-to-end autonomous driving framework that integrates LiDAR-based environmental perception with command-specific control learning. Unlike prior camera-reliant models, DeepIPCv2 employs point cloud segmentation and multi-view projection to construct robust scene representations. These features are fused and decoded through a combination of gated recurrent units, command-specific multi-layer perceptrons, and PID controllers to estimate both waypoints and navigational control commands. This design enhances maneuverability and addresses action imbalance in driving datasets. To validate the model, we constructed a dataset covering diverse illumination conditions and conducted ablation studies and comparative tests against recent methods, including TransFuser. Results demonstrate that DeepIPCv2 achieves the lowest total metric error and the fewest driving interventions, highlighting both its robustness to illumination changes and its improved control accuracy. By releasing the codes at https://github.com/oskarnatan/DeepIPCv2 later, we aim to support reproducibility and future advancements in end-to-end autonomous driving research.

URL PDF HTML ☆

赞 0 踩 0