arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.31486 2026-06-01 cs.RO 版本更新

Learning Controlled Separation of Small Objects Between Two Fingers with a Tactile Skin

利用触觉皮肤学习两个手指间小物体的受控分离

Ulf Kasolowsky, Berthold Bäuml

发表机构 * Learning AI for Dextrous Robots Lab（灵巧机器人学习人工智能实验室）； Technical University of Munich（慕尼黑技术大学）； DLR Institute of Robotics and Mechatronics（德意志航天中心机器人与机电研究所）

AI总结本文提出并解决了多用途机器人手两个手指间小物体的受控分离任务，通过强化学习训练纯触觉策略，并分析了空间分辨触觉反馈的优势。

详情

AI中文摘要

我们提出并解决了多用途机器人手两个手指间小物体的受控分离这一新任务：在抓取一盒小物体后，任务是丢弃尽可能多的物体，直到手指间保留所需数量。这些物体相对于手指宽度很小，而且绝对尺寸也很小。在我们的案例中，处理的是直径仅为6毫米的小颗粒。我们证明，该任务可以纯粹通过触觉（无视觉）完成，使用指尖上的空间分辨触觉皮肤。分离策略通过强化学习在模拟中训练，使用简单的稀疏奖励，基本上检查是否达到所需物体数量。在模拟实验中，我们详尽分析了使用空间分辨触觉反馈的好处：虽然理想（高分辨率）触觉传感器几乎可以完美完成任务，但空间分辨率较低的传感器（此处为4x4触觉单元）与仅使用手指关节传感器相比，仍能带来高达20%的改进。为了进行此分析，我们还在策略旁边训练了一个估计器，用于预测真实接触位置。最后，我们展示了配备触觉皮肤的DLR-Hand II的成功仿真到现实迁移。

英文摘要

We introduce and solve the novel task of controlled separation of small objects with two fingers of a multi-purpose robotic hand: after grasping into a box of small objects, the task is to drop as many of them until a desired number remains between the fingers. The objects are small compared to the width of the fingers but also in absolute terms. In our case little pellets with a diameter of only 6mm are handled. We show that the task can be performed purely tactile (no vision) using a spatially-resolved tactile skin on a fingertip. The separation policy is trained in simulation via reinforcement learning using a straightforward sparse reward, which basically checks if the desired number of objects is reached. In simulation experiments, we provide an exhaustive analysis of the benefits of using spatially-resolved tactile feedback: while an ideal (high-resolution) tactile sensor allows solving the task almost perfectly, a sensor with lower spatial resolution (here 4x4 taxels) still leads to an improvement of up to 20% compared to using only the fingers' joint sensors. For this analysis, we further train an estimator alongside the policy that predicts the ground truth contact positions. Finally, we demonstrate the successful sim-to-real transfer for the DLR-Hand II equipped with a tactile skin.

URL PDF HTML ☆

赞 0 踩 0

2605.31481 2026-06-01 cs.RO 版本更新

Batched Differentiable Rigid Body Dynamics in PyTorch for GPU-Accelerated Robot Learning

Yue Wang, Yanran Xu, Wenbo Wu, Chuanhang Qiu, Zhaoxing Li

发表机构 * University of Southampton（南安普顿大学）

AI总结提出BARD，一种基于PyTorch的批处理可微刚体动力学库，通过三级缓存、无矩阵乘法的关节变换和层级并行传播，在GPU上实现高达64倍的前向运动学加速，并支持梯度计算。

详情

AI中文摘要

随着机器人控制转向大规模强化学习与循环动力学计算，社区对Pinocchio等CPU绑定库的依赖在基于GPU的训练流程中造成了吞吐瓶颈。我们提出了BARD（批处理铰接刚体动力学），这是一个自包含的PyTorch实现，基于Featherstone的刚体动力学算法，针对批处理GPU评估和自动微分进行了优化。三个设计选择使其高效：分层惰性求值缓存避免冗余树遍历，通过预计算的Rodrigues常数实现无矩阵乘法的关节变换，以及将顺序操作减少为树深度批处理步骤的层级并行传播。在五个机器人模型（7-23自由度）上，BARD在数值上匹配Pinocchio，同时在NVIDIA H200上以批大小4096实现前向运动学高达64倍、雅可比矩阵高达63倍的吞吐量提升。我们通过基于梯度的系统辨识验证了可微性，在7自由度机械臂上，在5%扭矩噪声下将连杆质量恢复至1.24%的平均误差，并将BARD集成到Isaac Lab AMP训练流程中，用于具有4096个并行环境的11自由度脊柱四足机器人，其在循环动力学中比Pinocchio快8.5倍，比ADAM快2.0倍。BARD已开源：https://github.com/YueWang996/bard-pytorch-dynamics。

英文摘要

As robot control shifts toward large-scale reinforcement learning with in-loop dynamics computation, the community's reliance on CPU-bound libraries such as Pinocchio creates a throughput bottleneck in GPU-based training pipelines. We present BARD (Batched Articulated Rigid-body Dynamics), a self-contained PyTorch implementation of Featherstone's rigid-body dynamics algorithms, optimized for batched GPU evaluation and automatic differentiation. Three design choices make this efficient: a tiered lazy-evaluation cache that avoids redundant tree traversals, matmul-free joint transforms via pre-computed Rodrigues constants, and level-parallel propagation that reduces sequential operations to tree-depth batched steps. On five robot models (7-23 DOFs), BARD matches Pinocchio numerically while reaching up to 64x higher throughput for Forward Kinematics and 63x for Jacobians at batch size 4096 on an NVIDIA H200. We validate differentiability through gradient-based system identification on a 7-DOF manipulator, recovering link masses to 1.24% mean error under 5% torque noise, and integrate BARD into an Isaac Lab AMP training pipeline for an 11-DOF spined quadruped with 4096 parallel environments, where it is 8.5x faster than Pinocchio and 2.0x faster than ADAM for in-loop dynamics. BARD is open-sourced at: https://github.com/YueWang996/bard-pytorch-dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.31476 2026-06-01 cs.RO 版本更新

IDOL: Inverse-Dynamics-Guided Future Prediction for End-to-End Autonomous Driving

IDOL: 逆动力学引导的未来预测用于端到端自动驾驶

Chenghao Zhang, Timin Li, Dongmei Li

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）

AI总结提出IDOL框架，通过逆动力学模型将BEV世界模型预测的未来潜在场景状态转化为规划相关的轨迹增量，实现未来预测与轨迹优化的紧密耦合，在NAVSIM基准上达到最优性能。

Comments 20 pages, 5 figures

详情

AI中文摘要

端到端自动驾驶已成为直接从传感器观测学习规划的有力范式，而近期基于世界模型的方法通过显式推理场景未来演化进一步丰富了这一范式。然而，仅靠未来预测并不能保证更好的规划，除非预测的演化能够转化为规划相关的轨迹更新。当前许多方法仍预测未来场景状态，而未明确解码状态转换中隐藏的运动含义。因此，未来推理通常仅具有描述性价值，而与可执行运动生成的耦合较弱。为解决此限制，我们提出IDOL，一种基于逆动力学的未来预测框架，用于潜在BEV空间中基于世界模型的端到端规划，其中逆动力学作为未来预测与轨迹优化之间的关键桥梁。IDOL首先使用BEV世界模型预测多个未来潜在场景状态，然后对相邻潜在未来应用逆动力学模型，以解码过渡感知的轨迹特征并恢复规划相关的运动增量，解释潜在世界随时间如何演化。这些逆动力学导出的信号用于优化规划轨迹，将未来预测从被动场景预测转变为可操作的规划指导。轻量级闭环细化模块通过重用优化轨迹进行另一轮未来感知推理，进一步改善长时一致性。通过将逆动力学引入潜在未来推理，IDOL加强了世界建模与规划之间的耦合。在NAVSIM v1和NAVSIM v2基准上的大量实验表明，IDOL在可比方法中达到了最先进的性能。

英文摘要

End-to-end autonomous driving has emerged as a compelling paradigm for learning planning directly from sensor observations, while recent world-model-based approaches further enrich this paradigm by enabling explicit reasoning about how the scene may evolve in the future. Yet future prediction alone does not guarantee better planning unless the predicted evolution can be converted into planning-relevant trajectory updates. Many current methods still forecast future scene states without explicitly decoding the motion implications hidden in state transitions. As a result, future reasoning often remains descriptively useful but only weakly coupled to executable motion generation. To address this limitation, we propose \mathbf{IDOL}, an inverse-dynamics-guided future prediction framework for world-model-based end-to-end planning in latent BEV space, where inverse dynamics serves as the key bridge between future prediction and trajectory optimization. IDOL first predicts multiple future latent scene states with a BEV world model, then applies an inverse dynamics model to adjacent latent futures to decode transition-aware trajectory features and recover planning-relevant motion deltas that explain how the latent world evolves over time. These inverse-dynamics-derived signals are used to optimize the planned trajectory, turning future forecasting from passive scene anticipation into actionable planning guidance. A lightweight closed-loop refinement module further improves long-horizon consistency by reusing the optimized trajectory for another round of future-aware reasoning. By introducing inverse dynamics into latent future reasoning, IDOL tightens the coupling between world modeling and planning. Extensive experiments on the NAVSIM v1 and NAVSIM v2 benchmarks show that IDOL achieves state-of-the-art performance among comparable methods.

URL PDF HTML ☆

赞 0 踩 0

2605.31460 2026-06-01 cs.RO cs.SY eess.SY 版本更新

On-Device Robotic Planning: Eliminating Inference Redundancy for Efficient Decision-Making

设备端机器人规划：消除推理冗余以实现高效决策

Joonhee Lee, Hyunseung Shin, Hyunmi Kim, Pei Zhang, Jeonggil Ko

发表机构 * School of Integrated Technology, College of Computing, Yonsei University（延世大学整合技术学院，计算学院）； Department of Hyperscale AI SoC Research Section, ETRI（ETRI超大规模AI SoC研究部）； EECS - Electrical and Computer Engineering, University of Michigan（密歇根大学电气与计算机工程学院）

AI总结提出REIS框架，通过场景门控、KV引导的affordance路由和审慎推理减少推理冗余，在保持语义适应性的同时加速机器人控制。

Comments 19 pages

2605.31436 2026-06-01 cs.RO 版本更新

Actuator-Aware Inverse Kinematics with Joint-Limit Admissibility for Torque-Controlled Redundant Robots

面向力矩控制冗余机器人的关节极限可容许性感知逆运动学

Mohammad Dastranj, Mahdi Hejrati, Jouni Mattila

发表机构 * Unit of Automation Technology and Mechanical Engineering, Faculty of Engineering and Natural Sciences（自动化技术与机械工程单位，工程与自然科学学院）

AI总结提出一种基于凸二次规划的逆运动学方法，通过控制障碍函数约束关节极限，并利用控制器兼容性目标解决冗余，实现无需修改下层控制器的任务行为改善。

详情

AI中文摘要

本文针对关节极限约束下的力矩控制冗余机器人，提出了执行器感知的逆运动学。在所考虑的架构中，逆运动学输出不仅仅是纯运动学的关节速度指令；它是提供给下游力矩级控制器的所需关节速度。因此，小的命令任务残差不一定能改善实际运动。所提出的方法构建了一个凸二次规划问题，其决策变量是关节级所需速度。控制障碍函数风格的边界施加了参考级关节极限可容许性，而任务方程通过惩罚松弛变量处理。冗余通过考虑先前命令一致性和执行器扭矩容量加权的控制器兼容性目标来解决。该方法独立于特定的力矩级控制器，可作为末端轨迹与冗余机器人控制器之间的中间逆运动学层。在虚拟分解控制的七自由度上肢外骨骼上的实验将所提方法与标准逆运动学基线以及约束任务保持二次规划基线进行了比较。结果表明，在不修改下游控制器的情况下，在测试轨迹中实现了更低的极限推动指令、有界可容许所需速度以及改善的实际任务行为。

英文摘要

This paper proposes actuator-aware inverse kinematics for torque-controlled redundant robots under joint-limit constraints. In the considered architecture, the inverse-kinematic output is not merely a purely kinematic joint-velocity command; it is the required joint velocity supplied to a downstream torque-level controller. Therefore, a small commanded task residual may not necessarily improve realized motion. The proposed method formulates a convex quadratic programming problem whose decision variable is the joint-level required velocity. Control barrier function style bounds impose reference-level joint-limit admissibility, while the task equation is handled through a penalized slack variable. Redundancy is resolved using a controller-compatibility objective that accounts for previous-command consistency and actuator torque-capacity weighting. The method is independent of the particular torque-level controller and can serve as an intermediate IK layer between an endpoint trajectory and a redundant robot controller. Experiments on a virtual-decomposition-controlled seven-degree-of-freedom upper-limb exoskeleton compare the method with standard inverse-kinematic baselines and a constrained task-preserving quadratic programming baseline. The results indicate lower limit-pushing commands, bounded admissible required velocities, and improved realized task behavior in the tested trajectory, without modifying the downstream controller.

URL PDF HTML ☆

赞 0 踩 0

2605.31387 2026-06-01 cs.CL cs.RO 版本更新

Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely

多轮多智能体对话用于协作重建仅略微提升VLM在空间推理上的性能

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

发表机构 * Computational Linguistics, Department of Linguistics University of Potsdam（语言学计算系，语言学系柏林洪堡大学）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心（DFKI））

AI总结研究通过多轮多智能体对话框架评估视觉语言模型在协作空间推理任务中的表现，发现视觉空间理解仍是主要瓶颈，文本表示和分解图像表示可部分提升性能。

Comments Preprint

详情

AI中文摘要

探索视觉-语言模型中的碰撞接地以实现安全的人机协作

Jun Wang, Xiaohao Xu, Xiaonan Huang

发表机构 * University of Michigan, Ann Arbor（密歇根大学，安娜堡）

AI总结针对安全人机协作，提出碰撞接地概念及物理基准TouchSafeBench，评估视觉-语言模型在分类当前安全状态和预警即将碰撞任务中的表现，发现现有模型不可靠，视觉流畅性不等于物理责任性。

Comments 31 pages, 9 figures

详情

AI中文摘要

安全的人机协作需要的不仅仅是视觉描述：监控器必须确定机器人身体是否安全分离、已经与场景或人发生碰撞，或即将碰撞。我们将这种能力称为碰撞接地：将视觉观察与机器人身体几何、相机视角、场景布局、人体接近度和时间运动相结合，以推断当前和即将发生的接触。我们引入了TouchSafeBench，一个基于物理的基准，用于评估视觉-语言模型（VLM）中的碰撞接地能力。TouchSafeBench基于Habitat 3.0构建，包含2,940个模拟室内共现场景，涵盖社交导航和社交重排，具有同步的多视角RGB-D观测、自上而下的轨迹地图、校准的相机元数据和模拟器导出的接触标签。我们研究了两个面向部署的任务：分类当前安全状态和在接触前预警即将发生的碰撞。在三个前沿或面向机器人的VLM和九种视觉表示中，当前模型远未达到可靠：最佳平均Macro-F1仍低于50%，显式深度不会自动转化为机器人身体碰撞证据，且机器人与场景的接触始终比人与人的接触风险更难。TouchSafeBench揭示了具身VLM的一个核心限制：视觉流畅性并不意味着物理责任性。可靠的机器人安全监控器需要能够显式绑定视角、机器人形态、度量几何和未来碰撞的表示。我们将在论文被接收后发布该基准。

英文摘要

Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.31121 2026-06-01 cs.RO cs.AI 版本更新

TARIC: Memory-Augmented Traversability-Aware Outdoor VLN under Interrupted Semantic Cues

TARIC: 语义线索中断下基于记忆增强的可通行性感知户外视觉语言导航

Tianle Zeng, Hanjing Ye, Jianwei Peng, Jingwen Yu, Hanxuan Chen, Hong Zhang

发表机构 * Shenzhen Key Laboratory of Robotics and Computer Vision（深圳机器人与计算机视觉重点实验室）； Southern University of Science and Technology（南方科技大学）； CKS Robotics Institute（CKS机器人研究所）； Hong Kong University of Science and Technology（香港科技大学）； College of Electrical and Information Engineering（电气与信息工程学院）

AI总结针对户外视觉语言导航中语义线索中断导致导航退化的问题，提出统一框架，通过可通行性一致的执行引导和不确定性感知的3D线索记忆，在长时间无线索阶段维持稳定导航，在四足和轮式平台上成功率提升显著。

详情

AI中文摘要

户外视觉语言导航（VLN）在远程、开放世界环境中经常受到语义线索中断的干扰，此时信息性目标线索变得稀疏、被遮挡或离开视野。一旦此类线索消失，智能体进入无线索阶段，并常退化为回溯、振荡航向或盲目探索。虽然基于记忆的方法试图弥合这些间隙，但在可通行性驱动的绕行中常常失败：记忆中的线索方向可能不可行，迫使绕行延长无线索阶段，并逐渐使机器人中心的线索过时、隐式历史模糊。这使得可通行性成为维持目标导向引导的稳定性条件，而不仅仅是局部安全问题。我们提出一个统一的户外VLN框架，通过在长时间无线索阶段维持可通行性一致的可执行引导来应对语义线索中断。具体来说，我们的方法从可见性门控的目标或探索线索中提取语义方位，并利用实时近场可通行性轮廓将其接地为可执行航向，提供超越仅拒绝安全过滤的目标一致可行引导。为防止绕行期间引导退化，我们将间歇性2D证据提升为世界对齐的3D线索记忆，并配备不确定性感知读出机制，确保引导在机器人移动时持续可达且稳定。我们在四足和轮式平台上评估该框架，路线长度为600-1000米。我们的方法在模拟中成功率比最强基线提高超过10个百分点，真实世界成功率达到40%，而最强基线为17.5%，且在长时间无线索间隔中具有显著更高的鲁棒性。

英文摘要

Outdoor vision-language navigation (VLN) in long-range, open-world environments is frequently disrupted by semantic-cue interruptions, where informative goal cues become sparse, occluded, or leave the field of view. Once such cues disappear, agents enter a cue-free phase and often degrade into backtracking, oscillatory headings, or aimless exploration. While memory-based methods attempt to bridge these gaps, they often fail under traversability-driven detours: the remembered cue direction may be infeasible, forcing detours that prolong cue-free phases and gradually render robot-centric cues stale and implicit histories blurred. This makes traversability a stability condition for maintaining goal-directed guidance, rather than merely a local safety concern. We propose a unified outdoor VLN framework that survives semantic-cue interruptions by maintaining traversability-consistent executable guidance throughout prolonged cue-free phases. Specifically, our method extracts semantic bearings from visibility-gated goal or exploration cues and grounds them into executable headings using a real-time near-field traversability profile, providing goal-consistent feasible guidance beyond reject-only safety filtering. To prevent guidance degradation during detours, we lift intermittent 2D evidence into a world-aligned 3D cue memory with an uncertainty-aware readout mechanism, ensuring guidance remains continuously reachable and stable as the robot moves. We evaluate the framework on quadrupedal and wheeled platforms over 600--1000 m routes. Our method improves simulation success rate by over 10 percentage points over the strongest baseline and achieves a real-world success rate of 40%, compared to 17.5% for the strongest baseline, with substantially higher robustness during prolonged cue-free intervals.

URL PDF HTML ☆

赞 0 踩 0

2605.31119 2026-06-01 cs.RO cs.LG 版本更新

Don't Fool Me Twice: Adapting to Adversity in the Wild with Experience-Driven Reasoning

不要愚弄我两次：通过经验驱动推理在野外适应逆境

Navin Sriram Ravie, Andrew Jong, Krrish Jain, John Liu, Omar Alama, Bijo Sebastian, Sebastian Scherer

发表机构 * Department of Engineering Design, Indian Institute of Technology, Madras（印度理工学院工程设计系，马德拉斯）； Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）

AI总结提出一种持续学习框架，使移动机器人能够在线从干扰中学习，通过语义将异常行为归因于原因，从而更好地预测和规划未来。

详情

AI中文摘要

在机器人学中，危险和逆境模式通常具有具体性且相对于每个智能体。自主移动机器人的一个前沿是使智能体能够在未见的非结构化环境中有效运行。在未见的非结构化环境中的一个重大挑战是可能无法预测特定机器人的所有危险。尽管最近的工作使用大型基础视觉语言模型（VLM）来预先预测一个详尽的常识性危险列表，但仍然难以捕捉可能的交互和依赖于具体性的逆境。我们提出了一个持续学习框架，使移动具身智能体能够在线从干扰中学习，并通过语义将异常行为归因于原因，从而更好地预测和规划未来世界。我们的框架“不要愚弄我两次”首先观察干扰并描述其对机器人的影响；该描述通过视觉上下文增强，以查询VLM预测可能的原因；使用核回归对局部干扰进行特征化，从而实现对瞬态异常的高效、少样本建模。我们利用语义体素中心建模来估计认知不确定性，通过将交互驱动的干扰视为可学习的空间行为，实现更丰富的下游恢复。我们提出了四个假设，并在仿真和硬件上跨具体性和逆境模式进行了验证。

英文摘要

In robotics, dangers and adversity modes are often embodiment-specific and relative to each agent. A frontier of autonomous mobile robotics is to enable agents to operate effectively in the wild in unseen unstructured environments. A significant challenge in unseen unstructured environments is that it may not be possible to predict all the dangers to the specific robot. Although recent work has used large foundation vision-language models (VLMs) to preemptively predict an exhaustive list of common-sense dangers, it remains difficult to capture possible interaction and embodiment-dependent adversities. We propose a continual learning framework for a mobile embodied agent to learn online from disturbances and attribute anomalous behaviours to causes through semantics, enabling better prediction and planning of the world in the future. Our framework, "Don't Fool Me Twice", first observes disturbances and describes their effects on the robot; this description is augmented with visual context to query a VLM to predict possible causes; the local disturbance is characterized using kernel regression, which allows for efficient, few-shot modeling of transient anomalies. We leverage semantic voxel-centric modeling to estimate epistemic uncertainty, enabling richer downstream recovery by treating interaction-driven disturbances as learnable spatial behaviors. We present four hypotheses and validate them in simulation and on hardware across embodiments and adversity modes.

URL PDF HTML ☆

赞 0 踩 0

2605.31116 2026-06-01 cs.CV cs.RO 版本更新

RDGen: 通过强化学习生成高质量机器人学习的演示

Zijian Zhu, Menglin Zou, Zhuang Li, Yaojie Tu, Xinhai Sun

AI总结提出RDGen框架，利用从仿真到真实的强化学习策略生成高质量机器人演示轨迹，用于训练视觉-语言-动作模型，相比人工遥操作产生更平滑轨迹并提升下游性能。

Comments 13 pages, 4 figures, 3 tables

详情

AI中文摘要

视觉-语言-动作（VLA）模型已成为通用机器人控制的一种有前景的范式。然而，其性能仍然从根本上受限于高质量机器人轨迹数据的可用性。在当前的机器人学习实践中，这些数据主要通过人类遥操作收集，这需要大量人力、成本高昂且难以扩展。在本文中，我们提出了RDGen，一种用于生成高质量机器人演示的仿真到真实强化学习框架。RDGen并非仅将强化学习用作最终控制策略，而是利用训练好的RL策略作为结构化的轨迹生成器。该系统由一个基于VLM的任务解析器（用于识别任务相关物体）、一个基于Grounding DINO的物体定位器以及一个从仿真迁移到真实机器人的RL策略组成。然后，成功的 rollout 被收集为干净、高质量的演示，用于下游VLA训练，而仿真阶段进一步以极低的边际成本提供可扩展的额外轨迹来源。在拾取和放置任务上的实验表明，迁移后的RL策略实现了高任务成功率。与人类遥操作相比，RDGen生成的轨迹显著更平滑，并产生更优的下游VLA性能。这些结果表明，RL生成的演示可以作为机器人策略学习更可靠和一致的监督信号。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robot control. However, their performance remains fundamentally constrained by the availability of high-quality robot trajectory data. In current robot learning practice, such data are primarily collected through human teleoperation, which is labor-intensive, costly, and difficult to scale. In this paper, we propose RDGen, a sim-to-real reinforcement learning framework for generating high-quality robot demonstrations. Rather than employing reinforcement learning solely as the final control policy, RDGen leverages trained RL policies as a structured trajectory generator. The system consists of a VLM-based task parser that identifies task-relevant objects, a Grounding DINO-based object localizer, and an RL policy transferred from simulation to the real robot. Successful rollouts are then harvested as clean, high-quality demonstrations for downstream VLA training, while the simulation stage further provides a scalable source of additional trajectories at little marginal cost. Experiments on a pick-and-place task demonstrate that the transferred RL policy achieves a high task success rate. Compared with human teleoperation, RDGen produces significantly smoother trajectories and yields superior downstream VLA performance. These results indicate that RL-generated demonstrations can serve as more reliable and consistent supervisory signals for robot policy learning.

URL PDF HTML ☆

赞 0 踩 0

2605.30928 2026-06-01 cs.RO 版本更新

Enhancing Human-Likeness in Reinforcement Learning Agents via Hierarchical Macro Action Quantization

通过分层宏动作量化增强强化学习智能体的人类相似性

Usman Nizamani, M. Shaheer Luqman, Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, M. Zeeshan Zia, Quoc-Huy Tran

发表机构 * Retrocausal, Inc.（Retrocausal公司）

AI总结提出一种分层宏动作量化框架（HiMAQ），通过两级向量量化将人类演示编码为宏动作，使强化学习智能体在保持高回报的同时生成更接近人类的行为序列，在D4RL基准上优于非分层基线并兼容多种RL算法。

详情

AI中文摘要

DisPlace: 面向多参考视觉地点识别的判别性地点投影

Dhyey Manish Rajani, Michael Milford, Tobias Fischer

发表机构 * QUT Centre for Robotics, School of Electrical Engineering and Robotics at the Queensland University of Technology（昆士兰理工大学机器人中心，电气工程与机器人学学院）

AI总结提出DisPlace框架，通过广义特征值问题融合多参考描述符，最大化地点间可分性并抑制地点内变化，提升视觉地点识别在多变条件下的鲁棒性。

Comments Under review

详情

AI中文摘要

视觉地点识别（VPR）的一个关键挑战是在不同环境条件和视角下，将查询图像与参考地图进行匹配。虽然多次参考遍历提高了鲁棒性，但现有的融合策略要么统一聚合参考，要么依赖启发式选择，无法区分保持稳定地点身份的描述符变化与由变化条件或视角引起的变化。在本文中，我们提出DisPlace，一种多参考VPR框架，将多个参考描述符融合为单个紧凑且具有判别性的地点表示。DisPlace将描述符融合表述为一个广义特征值问题，该问题最大化地点间可分性，同时抑制跨参考的地点内变化，而不是保留整体描述符方差。与现有的多参考融合方法不同，DisPlace利用跨参考遍历的变化来识别哪些描述符维度的线性组合保留了地点身份，哪些捕捉了条件或视角特定的变化。我们在Oxford RobotCar、Nordland、Pittsburgh30k和Google Landmarks v2上，使用六种最先进的VPR描述符评估了DisPlace。在54种外观变化条件下，DisPlace在49种中优于七种多参考基线，在视角和非结构化设置下持续改进描述符级融合性能，并且在推理期间比所有比较的融合方法需要更少的存储空间。

英文摘要

A key challenge in Visual Place Recognition (VPR) is matching query images against reference maps captured under diverse environmental conditions and viewpoints. While multiple reference traversals improve robustness, existing fusion strategies either aggregate references uniformly or rely on heuristic selection, without distinguishing descriptor variations that preserve stable place identity from those caused by changing conditions or viewpoints. In this paper, we propose DisPlace, a multi-reference VPR framework that fuses multiple reference descriptors into a single compact and discriminative place representation. DisPlace formulates descriptor fusion as a generalized eigenvalue problem that maximizes between-place separability while suppressing within-place variation across references, rather than preserving overall descriptor variance. Unlike existing multi-reference fusion methods, DisPlace exploits variation across reference traversals to identify which linear combinations of descriptor dimensions preserve place identity and which capture condition- or viewpoint-specific variation. We evaluate DisPlace on Oxford RobotCar, Nordland, Pittsburgh30k, and Google Landmarks v2 across six state-of-the-art VPR descriptors. DisPlace outperforms seven multi-reference baselines in 49 out of 54 appearance-varying conditions, consistently improves descriptor-level fusion performance under viewpoint and unstructured settings, and requires less storage during inference than all compared fusion methods.

URL PDF HTML ☆

赞 0 踩 0

2605.30749 2026-06-01 cs.LG cs.RO 版本更新

原始子空间介导VLA中的少样本迁移

Anya Singh, Cabrel Happi, Jai Relan, Varun Nair, Vidyut Baradwaj

AI总结本研究通过原始感知训练在视觉-语言-动作（VLA）策略中构建可迁移的子技能库，仅需少量演示即可实现少样本迁移，相比平坦训练方法样本效率提升3倍。

详情

AI中文摘要

在工业环境中部署视觉-语言-动作（VLA）策略需要能够以低成本教授新任务，而当前VLA缺乏这一特性，因为每个新任务都需要微调。我们研究原始感知训练是否会产生一种可迁移的产物：一个学习到的子技能库，可以在推理时根据少量演示进行组合，以执行策略从未训练过的任务。我们在REASSEMBLE接触式装配数据集上，使用匹配的LoRA微调配方和固定超参数，训练了两种具有不同归纳偏置的VLA架构——OpenVLA和$π_{0.5}$，并在平坦轨迹和原始分割的回合（带有原始特定语言提示）之间改变训练方式。我们从训练中保留6个对象-任务组合，并评估少样本迁移：模型接收$m \in \{0, 1, 3, 5, 10\}$个保留任务的演示，并在不更新权重的情况下尝试执行。我们在三个训练种子上重复实验，并在第二个数据集（LIBERO-Long）上进行验证。原始训练模型仅需m=3个演示即可达到微调上限性能的78%，而平坦训练模型需要m=10个演示才能达到相同水平——这是一个3倍的样本效率差距，在种子、架构和数据集上均得到复现。为了建立因果关系，我们消融了隐藏状态的原始可解码子空间，结果显示少样本迁移性能下降32个百分点，而消融相同维度的随机子空间则没有影响，这表明原始表示是因果必要的，而非与迁移偶然相关。我们识别并纠正了评估分块策略时的一个方法论陷阱：单步动作范围门的族系膨胀会导致与真实人类演示相比的假失败率高出数量级。

英文摘要

Deploying vision-language-action (VLA) policies in industrial environments requires the ability to teach new tasks at low cost, a property current VLAs lack, since each new task requires fine-tuning. We investigate whether primitive-aware training produces a transferable artifact: a learned library of sub-skills that can be composed at inference time, conditioned on a small number of demonstrations, to perform tasks the policy was never trained on. We train two VLA architectures with different inductive biases, OpenVLA and $π_{0.5}$, on the REASSEMBLE contact-rich assembly dataset under matched LoRA fine-tuning recipes and locked hyperparameters, varying training between flat trajectories and primitive-segmented episodes with primitive-specific language prompts. We hold out 6 object-task combinations from training and evaluate few-shot transfer: models receive $m \in \{0, 1, 3, 5, 10\}$ demonstrations of a held-out task and attempt execution without weight updates. We replicate across three training seeds and validate on a second dataset (LIBERO-Long). Primitive-trained models reach 78% of fine-tuned upper-bound performance with only m=3 demonstrations, while flat-trained models require m=10 demonstrations to reach the same level -- a $3\times$ sample efficiency gap that replicates across seeds, architectures, and datasets. To establish causation, we ablate the primitive-decodable subspace of hidden states and show few-shot transfer degrades by 32 percentage points while ablating a random subspace of equal dimensionality has no effect, indicating primitive representations are causally necessary rather than incidentally correlated with transfer. We identify and correct a methodological pitfall in evaluating chunked policies: family-wise inflation of single-step action-range gates produces order-of-magnitude higher false-failure rates against ground-truth human demonstrations.

URL PDF HTML ☆

赞 0 踩 0

2605.30671 2026-06-01 cs.CV cs.RO 版本更新

WristCompass: Kinematic Coupling as a Learnable Visual Concept for Ego-Camera Orientation

WristCompass: 运动耦合作为可学习的视觉概念用于自我相机朝向估计

Varun Nair, Vidyut Baradwaj, Jiahang He, Anya Singh, Jai Relan, Cabrel Happi

AI总结提出WristCompass，利用手腕与相机朝向之间的运动耦合作为视觉概念，通过紧凑的4D特征和GRU时序建模，从操作视频中恢复自我相机朝向，零样本迁移至厨房视频并达到与1B参数场景模型相近的性能。

详情

AI中文摘要

从操作视频中恢复自我相机朝向是从自我中心演示中分离手部运动与相机运动的前提，这是模仿学习的关键步骤。从场景几何推断朝向的常规方法在手部遮挡框架时失效：VGGT，一个1B参数的场景重建模型，在TACO基准测试上的表现甚至不如常数预测器。我们识别出一个替代的视觉概念，它恰好出现在场景几何缺失时：运动耦合动力学，即由手臂-肩-头链施加的手腕运动与相机朝向之间的结构化物理关系。我们发现这个概念是紧凑的（4D手腕间特征优于126D全手关键点）、时序的（需要短窗口上的GRU而非逐帧检索）和物理基础的（由于根植于解剖学而非场景外观，因此可零样本跨数据集迁移）。仅在桌面操作上训练的WristCompass，零样本迁移至Epic Kitchens烹饪视频，实现了14.3°的中位测地误差，并以200K GRU参数接近1B参数场景模型的性能。

英文摘要

Recovering ego-camera orientation from manipulation video is a prerequisite for disentangling hand motion from camera motion, a key step in imitation learning from egocentric demonstrations. The obvious approach, inferring orientation from scene geometry, fails when hands occlude the frame: VGGT, a 1B-parameter scene reconstruction model, scores worse than a constant predictor on the TACO benchmark. We identify an alternative visual concept that is present precisely when scene geometry is absent: kinematic coupling dynamics, the structured physical relationship between wrist motion and camera orientation imposed by the arm-shoulder-head chain. We find that this concept is compact (4D inter-wrist features outperform 126D full hand keypoints), temporal (requiring a GRU over short windows rather than per-frame retrieval), and physically grounded (transferring zero-shot across datasets because it is rooted in anatomy rather than scene appearance). Trained only on tabletop manipulation, WristCompass transfers zero-shot to Epic Kitchens cooking video, achieving 14.3$^\circ$ median geodesic error and approaching the performance of a 1B-parameter scene model at 200K GRU parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.30660 2026-06-01 cs.LG cs.RO 版本更新

BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies

BOKBO (Best of K Bad Options): VLA策略的校准式弃权

Anya Singh, Cabrel Happi, Jai Relan, Varun Nair, Vidyut Baradwaj

AI总结针对视觉-语言-动作(VLA)策略的测试时扩展方法，提出首个共形弃权层BOKBO，通过全局和逐任务变体提供有限样本无分布保证的执行违规率控制，并揭示基于扰动的K采样下策略内部非一致性分数的结构性缺陷。

详情

AI中文摘要

针对视觉-语言-动作(VLA)策略的测试时扩展方法，如RoboMonkey、SEAL、MG-Select和V-GPS，在推理时采样K个候选动作块并执行验证器最优结果。当所有K个候选都不安全时，系统会执行违规动作且无警告。我们提出BOKBO，这是首个用于K样本VLA推理的共形弃权层，提供执行违规率的有限样本无分布保证。我们提供全局和逐任务（Mondrian）变体，其中逐任务变体缩小了最困难任务上的条件差距。我们的分析揭示了基于扰动的K采样下策略内部非一致性分数的结构性失败：基础策略置信度代理与K样本不一致性之间的相关性为0.98（与动作噪声超参数σ相关），而与实际安全违规的相关性处于噪声基底。我们通过复现令牌级温度采样下的分析来测试失败范围，发现该失败是机制特定的，并在基于策略随机性的采样下得到部分缓解。一个基于语义视觉特征和任务标识学习的违规预测器支持紧密校准：在libero_object_temp_x0.1上使用OpenVLA-OFT，ε=0.05时，条件CRC边界在86%的bootstrap分割上成立，覆盖率为78%，净任务成功率为70%。Mondrian-BOKBO将最小逐任务条件保持比例从0.71提高到0.93。结果在5个训练种子上稳定，在π_0-FAST上的bootstrap噪声内可复现，在libero_spatial_temp_x0.1作为同等基准上成立，并经受住了四个套件内分布偏移。我们还识别并纠正了一个方法论陷阱：全局设置的力阈值远低于专家典型的操作力，将不安全行为与正常操作混淆，导致违规率膨胀5倍。

英文摘要

Test-time scaling for vision-language-action (VLA) policies, methods such as RoboMonkey, SEAL, MG-Select, and V-GPS, samples K candidate action chunks at inference and executes the verifier-best. When all K candidates are unsafe, the system executes a violating action with no warning. We propose BOKBO, the first conformal abstention layer for K-sample VLA inference, providing finite-sample distribution-free guarantees on executed-violation rate. We provide both global and per-task (Mondrian) variants, with the per-task variant closing the conditional gap on the hardest tasks. Our analysis exposes a structural failure of policy-internal nonconformity scores under perturbation-based K-sampling: the base-policy confidence proxy and K-sample disagreement correlate at 0.98 with the action-noise hyperparameter $σ$, while correlating at the noise floor with actual safety violations. We test the failure's scope by replicating the analysis under token-level temperature sampling and find the failure is mechanism-specific and partially mitigated under policy-stochasticity-based sampling. A learned violation predictor conditioned on semantic visual features and task identity supports tight calibration: at $ε$ = 0.05 on libero_object_temp_x0.1 with OpenVLA-OFT, the conditional CRC bound holds on 86% of bootstrap splits with 78% coverage and 70% net task success. Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93. Results are stable across 5 training seeds, replicate within bootstrap noise on $π_0$-FAST, hold on libero_spatial_temp_x0.1 as a co-equal benchmark, and survive four within-suite distribution shifts. We additionally identify and correct a methodological pitfall: globally-set force thresholds well below expert-typical manipulation forces conflate unsafe behavior with normal manipulation, inflating violation rates by $5\times$.

URL PDF HTML ☆

赞 0 踩 0

2605.30647 2026-06-01 cs.RO 版本更新

Bidirectional Incremental Generalized Hybrid A*

双向增量广义混合A*

Sidharth Talia, Oren Salzman, Siddhartha Srinivasa

AI总结针对复杂动力学系统在非结构化环境中的高效任意时刻运动规划问题，提出双向增量广义混合A*算法，通过双向搜索缓解冻结顶点隐藏解的问题，保证单调成本改进和终止，显著减少扩展次数。

详情

AI中文摘要

我们关注在非结构化环境中具有复杂动力学的系统的有效任意时刻运动规划问题，这些环境使得预计算运动基元不可行。由于维度灾难，直接应用A*到此类问题在计算上不可行。诸如Hybrid A*等方法通过离散化状态空间来解决这一负担，但反过来在树发现和离散化分辨率之间产生了耦合。增量广义混合A*（IGHA*）通过以任意方式在分辨率层次上进行搜索来打破这种耦合，它冻结顶点以供后续搜索迭代使用，而不是修剪它们。然而，冻结的顶点可能会在特定迭代中隐藏支持解的顶点。虽然经典双向搜索的动机是减少搜索深度，但将IGHA*扩展到双向设置（称为Bi-IGHA*）通过从根本上缓解冻结顶点隐藏解的行为而获得额外好处。我们证明了Bi-IGHA*保留了IGHA*在单调成本改进和终止方面的保证。我们通过实验表明，Bi-IGHA*在R3、R4和R6规划问题上显著减少了扩展次数，并在高速越野自主性的运动规划中实现了等效的闭环性能，同时所需的扩展次数显著减少。网站：https://personalrobotics.github.io/IGHAStar/biighastar.html

英文摘要

We focus on the problem of efficient anytime kinodynamic planning for systems with complex dynamics in unstructured environments that make precomputing motion primitives infeasible. Directly applying A* to such problems is computationally infeasible due to the curse of dimensionality. Methods such as Hybrid A* addressed this burden by discretizing the state space, but in turn creating a coupling between tree discovery and the discretization resolution. The Incremental Generalized Hybrid A* (IGHA*) performs search over a hierarchy of resolutions in an anytime fashion to break this coupling, by freezing vertices to use in later search iterations rather than pruning them. However, the frozen vertices can hide solution-supporting vertices from the search at a particular iteration. While classical bidirectional search is motivated by the reduction of search depth, extending IGHA* into the bidirectional setting (termed Bi-IGHA*) obtains additional benefit by fundamentally mitigating the behaviour induced by frozen vertices hiding solutions. We show that Bi-IGHA* preserves IGHA*'s guarantees on monotonic cost improvement and termination. We empirically show that Bi-IGHA* substantially reduces expansions on R3, R4, and R6 planning problems, and achieves equivalent closed-loop performance with kinodynamic planning for high-speed off-road autonomy while requiring significantly fewer expansions. Website: https://personalrobotics.github.io/IGHAStar/biighastar.html

URL PDF HTML ☆

赞 0 踩 0

2605.30639 2026-06-01 cs.CV cs.AI cs.RO 版本更新

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

PInVerify：面向主动实例验证的离线具身基准

Yuhang Jiang

发表机构 * University of Trento（特伦托大学）

AI总结提出主动实例验证任务，构建离线具身基准PInVerify，通过多视角导航和细粒度属性匹配评估具身智能体，并基于多模态大语言模型建立基线。

Comments Accepted as a poster at the Foundation Models Meet Embodied Agents (FMEA) Workshop, CVPR 2026. 44 pages including appendix. Code: https://github.com/Avalon-S/PInVerify

详情

AI中文摘要

具身智能体在导航到目标物体方面取得了显著进展，但到达目标附近并不能保证智能体找到了正确的实例：微妙的属性差异（例如“白色花卉”与“白色条纹”）通常需要近距离、多视角检查。我们通过主动实例验证（AIV）来解决这一差距，该任务要求智能体主动围绕候选对象选择视角，以判断其是否匹配细粒度的自然语言描述。我们将AIV形式化为一个有限视野决策过程，并引入PInVerify，一个用于AIV的离线具身基准：包含18个物体类别的3000个评估场景，以多视角捕获形式提供，并采用6扇区导航拓扑，暴露陷阱视角（可导航但无信息）和不可达扇区。作为参考基线，我们构建了一个无需训练的流水线和一个基于开源多模态大语言模型（MLLMs）的LoRA微调端到端智能体（参数规模≤8B），包括属性分解、可见性加权多视角跟踪器和三种次优视角选择（NBV）策略。在Qwen3-VL（4B/8B）、SenseNova-SI-1.2-InternVL3-8B、CLIP和SigLIP2上的评估中，最佳MLLM基线超过最佳嵌入基线4.9个百分点；GT框消融实验显示检测差距为+3.1个百分点；在测试的NBV策略中，我们未观察到主动视角选择带来的可靠增益。LoRA微调智能体（SFT+GSPO）达到85.6%。PInVerify旨在支持具身AI中主动、细粒度语义验证的进一步研究。代码：https://github.com/Avalon-S/PInVerify。

英文摘要

Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent has found the correct instance: subtle attribute differences (e.g., "white floral" vs. "white striped") often require close-range, multi-view inspection. We address this gap with Active Instance Verification (AIV), a task in which an agent actively selects viewpoints around a candidate object to decide whether it matches a fine-grained natural-language description. We formalize AIV as a finite-horizon decision process and introduce PInVerify, an offline embodied benchmark for AIV: 3,000 evaluation episodes across 18 object categories, delivered as multi-view captures with a 6-sector navigation topology that exposes trap views (navigable but uninformative) and unreachable sectors. As reference baselines we build a training-free pipeline and a LoRA-fine-tuned end-to-end agent around open-source multimodal large language models (MLLMs) at on-device scale ($\leq$8B parameters), with attribute decomposition, a visibility-weighted multi-view tracker, and three next-best-view (NBV) strategies. In our evaluation across Qwen3-VL (4B/8B), SenseNova-SI-1.2-InternVL3-8B, CLIP, and SigLIP2, the best MLLM-based baseline exceeds the best embedding baseline by 4.9 pp; GT-box ablations show a +3.1 pp detection gap; and we do not observe reliable gains from active viewpoint selection within the tested NBV strategies. A LoRA-fine-tuned agent (SFT+GSPO) reaches 85.6%. PInVerify aims to support further work on active, fine-grained semantic verification in embodied AI. Code: https://github.com/Avalon-S/PInVerify.

URL PDF HTML ☆

赞 0 踩 0

2605.30617 2026-06-01 cs.RO math.OC 版本更新

Exploiting Chordal Sparsity for Globally Optimal Estimation with Factor Graphs

利用弦稀疏性实现因子图的全局最优估计

Avinash Subramanian, Connor Holmes, Timothy D. Barfoot, Frank Dellaert, Frederike Dümbgen

发表机构 * College of Computing, Georgia Institute of Technology（佐治亚理工学院计算机学院）； Robotics Institute, University of Toronto（多伦多大学机器人研究所）； Department of Mechanical Engineering, Carnegie Mellon University（卡内基梅隆大学机械工程系）

AI总结本文提出在GTSAM框架中自动构建凸半定规划松弛，并利用贝叶斯树分解加速求解，实现因子图的全局最优估计。

详情

Journal ref: ICRA 2026 WORKSHOP ON FRONTIERS OF OPTIMIZATION FOR ROBOTICS

AI中文摘要

鲁棒且高效的状态估计对于机器人感知、导航和控制至关重要。状态估计问题可以方便地使用因子图框架建模，如现代软件包GTSAM或g2o所支持的那样。然而，这些框架中包含的标准求解器是局部的，可能收敛到较差的局部最小值，带来显著的安全隐患。相反，基于凸松弛的技术已被证明能够全局求解或认证许多状态估计问题。但是，这些松弛方法1)通常需要大量精力来构建，并且2)与高效的局部求解器相比，可能产生显著更高的成本，因为它们需要求解一个大规模半定规划（SDP）。在这项工作中，我们通过以下方式解决了这两个缺点：1)在GTSAM框架内创建了一个新过程，用于自动为任何具有常见因子和变量类型的因子图构建凸SDP松弛，以及2)利用GTSAM原生的贝叶斯树结构来分解SDP问题，从而在弦稀疏问题上显著加速求解器时间。我们通过两个案例研究展示了这种利用结构的全局估计器与标准局部求解器相比的有利扩展性：一个带有环因子图的三维位姿图SLAM问题和一个带有链因子图的二维定位问题。软件框架可在https://github.com/borglab/gtsam获取。

英文摘要

Robust and efficient state estimation is crucial for perception, navigation, and control in robotics. State estimation problems are conveniently modeled using the factor-graph framework as enabled by modern software packages such as GTSAM or g2o. However, the standard solvers included in such frameworks are local and may converge to poor local minima, posing significant safety concerns. Conversely, techniques based on convex relaxations have been shown to provide a means of globally solving or certifying many state estimation problems. However, these relaxations 1) often require substantial effort to formulate, and 2) may incur significantly higher cost compared to efficient local solvers, as they require solving a large semidefinite program (SDP). In this work, we address both shortcomings by 1) creating a new procedure within the GTSAM framework for automatically constructing convex SDP relaxations for any factor graphs with common factor and variable types, and by 2) exploiting the Bayes tree constructions native to GTSAM to decompose the SDP problem, leading to significant speedup in solver time for chordally sparse problems. We demonstrate the favorable scaling of this structure-exploiting global estimator compared to standard local solvers for two case studies: A 3D pose-graph SLAM problem with a ring factor graph and a 2D localization problem with a chain factor graph. The software framework is available at https://github.com/borglab/gtsam.

URL PDF HTML ☆

赞 0 踩 0

2605.30612 2026-06-01 cs.RO cs.LG cs.SY eess.SY 版本更新

ZAPS-DA: Zero-Phase Action Policy Smoothing with Decoupled Actor for Continuous Control in Reinforcement Learning

ZAPS-DA：基于解耦演员的零相位动作策略平滑用于强化学习中的连续控制

Faiq Shamass

发表机构 * Independent Researcher（独立研究者）

AI总结提出ZAPS-DA框架，通过解耦演员网络模仿零相位滤波目标，在不引入相位延迟和后处理的情况下减少连续控制策略的动作抖动，并在驾驶仿真中验证了其有效性。

Comments 7 pages, 5 figures, 5 tables. Submitted to IEEE RA-L

详情

AI中文摘要

基于离策略强化学习训练的连续控制策略经常表现出高频动作抖动，使得直接部署在物理执行器上不可行。事后滤波可以减弱抖动但引入相位延迟；在演员损失中嵌入平滑惩罚会将其与RL梯度耦合，并将奖励回归与过度激进的平滑混为一谈。我们提出ZAPS-DA，一个在部署时减少动作抖动且具有可忽略相位延迟和无后处理的框架。ZAPS-DA将一个未修改的主演员（由基础RL损失训练）与一个单独的解耦演员配对，该解耦演员通过监督学习模仿存储在回放缓冲区中的零相位滤波目标。部署的策略是解耦演员：一个从当前观测到平滑动作的前馈映射，没有推理时滤波和动作历史输入——我们称之为非因果滤波器的因果蒸馏机制。幅度匹配的MSE损失提供了跨优化器类别的零超参数可移植性。使用Soft Actor-Critic和Savitzky-Golay滤波器在两个驾驶模拟器中通过配对n=150评估协议进行验证：在MetaDrive上，ZAPS-DA将转向抖动减少14-21倍，油门抖动减少3-5倍（所有p < 10^{-4}，Bonferroni校正），同时以6.3%的奖励成本匹配任务完成率（成功率p=0.31，碰撞率p=0.31）；在自定义Webots自适应巡航控制环境中，相同的SG配置产生了帕累托改进——奖励持平（p=0.121），转向抖动减少8-45倍，总任务失败率从2.0%降至0.7%。

英文摘要

Continuous control policies trained with off-policy reinforcement learning frequently exhibit high-frequency action jitter, rendering direct deployment on physical actuators impractical. Post-hoc filtering attenuates jitter but introduces phase lag; embedding smoothness penalties in the actor's loss couples them with the RL gradient and conflates reward regression with over-aggressive smoothing. We present ZAPS-DA, a framework that reduces action jitter at deployment with negligible phase lag and no post-processing. ZAPS-DA pairs an unmodified main actor (trained by the base RL loss) with a separate decoupled actor trained via supervised imitation of zero-phase filtered targets stored in the replay buffer. The deployed policy is the decoupled actor: a feed-forward map from the current observation to a smooth action, with no inference-time filter and no action-history input -- a mechanism we term causal distillation of a non-causal filter. A magnitude-matched MSE loss provides zero-hyperparameter portability across optimizer classes. Validated with Soft Actor-Critic and a Savitzky--Golay filter in two driving simulators using paired n=150 evaluation protocols: on MetaDrive, ZAPS-DA reduces steering jitter by 14--21x and throttle jitter by 3--5x (all $p < 10^{-4}$, Bonferroni-corrected) while matching task-completion (p=0.28 success, p=0.31 crash) at a 6.3% reward cost; on a custom Webots adaptive cruise control environment, the same SG configuration produces a Pareto improvement -- reward parity (p=0.121), 8--45x steering jitter reduction, and total task-failure rate reduced from 2.0% to 0.7%.

URL PDF HTML ☆

赞 0 踩 0

2605.30583 2026-06-01 cs.RO cs.PF 版本更新

Caspar: CUDA Accelerator for Symbolic Programming with Adaptive Reordering

Caspar: 基于自适应重排序的符号编程CUDA加速器

Emil Martens, Aaron Miller, Matias Varnum, Annette Stahl

发表机构 * Norwegian University of Science and Technology（挪威科学与技术大学）； Skydio

AI总结提出Caspar库，通过自动生成优化CUDA内核，实现从Python符号表达式到GPU高性能运行时的桥梁，并在大规模BA数据集上实现5-20倍加速。

Comments Accepted at ICRA 2026

详情

AI中文摘要

我们提出Caspar，一个使现代GPU在机器人领域更易用的库，并提供可应用于多种优化问题的最先进非线性GPU求解器。Caspar通过从符号表达式自动生成优化的CUDA内核，弥合了Python中表达性符号编程与C++中高性能GPU运行时之间的差距。基于SymForce库，用户可以轻松定义和组合符号表达式（包括李群运算），以生成自定义CUDA内核。要将Caspar用作求解器，用户只需定义符号残差函数；Caspar随后使用符号微分生成必要的GPU内核和接口以执行非线性优化。本文介绍了Caspar的核心组件，并通过在Bundle Adjustment in the Large (BAL)数据集上执行光束法平差展示了其性能。我们将Caspar与其他最先进的光束法平差器进行基准测试，结果表明它比最佳替代方案快5到20倍，所需内存更少，且达到相似的精度。这说明了我们的符号GPU编程方法的优势。Caspar作为SymForce的一部分发布，可在https://github.com/symforce-org/symforce免费获取。

英文摘要

We present Caspar, a library that makes the power of modern GPUs more accessible in robotics and provides a state-of-the-art nonlinear GPU solver that can be applied to a wide range of different optimization problems. Caspar bridges the gap between expressive symbolic programming in Python and high-performance GPU runtimes in C++ by automatically generating optimized CUDA kernels from symbolic expressions. Building on the SymForce library, users can easily define and combine symbolic expressions, including Lie group operations, to generate custom CUDA kernels. To use Caspar as a solver, users need only define the symbolic residual functions; Caspar then uses symbolic differentiation to generate the necessary GPU kernels and interfaces to perform nonlinear optimization. In this paper, we present the core components of Caspar and showcase its performance by performing bundle adjustment on the Bundle Adjustment in the Large (BAL) dataset. We benchmark Caspar against other state-of-the-art bundle adjusters and show that it is 5 to 20 times faster than the best alternative, requires less memory, and achieves similar accuracy. This illustrates the benefit of our symbolic GPU programming approach. Caspar is released as part of SymForce and is freely available at https://github.com/symforce-org/symforce

URL PDF HTML ☆

赞 0 踩 0

2605.30571 2026-06-01 cs.AR cs.AI cs.DC cs.PF cs.RO 版本更新

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

受限于内存但不受限于带宽：批量1的LLM解码中的物理AI推理差距

Josef Chen

发表机构 * KAIKAKU（卡伊卡普）

AI总结本文通过测量不同GPU上批量1的自回归解码性能，发现物理AI推理并非仅受内存带宽限制，还受启动开销影响，并指出量化路径的实际收益取决于运行时实现。

详情

AI中文摘要

物理AI系统，包括机器人、自动驾驶车辆、具身智能体和边缘副驾驶，通常运行与云端LLM服务不同的推理工作负载：单流、批量1的自回归解码，其中一个机器人、摄像头流或用户会话等待下一个token。这种工作负载通常被描述为受内存带宽限制。每个解码步骤都会流式传输模型权重和活跃的KV缓存，因此延迟应与峰值HBM带宽成比例。我们表明这种说法是正确的但不完整。我们测量了三个7至8B类GQA变压器在四个NVIDIA GPU（H100 SXM5、A100-80GB SXM4、L40S和L4）上的批量1解码。我们评估了从2048到16384的上下文长度，在受控的bf16 SDPA设置下产生了44个有效单元。达到的峰值HBM带宽比例随着峰值带宽的增加而下降。在标题性的Qwen-2.5-7B ctx=2048单元中，L4达到了其分析内存下限的大约81%，而H100仅达到27%。物理AI解码是内存主导的，但更快的内存并不能转化为成比例的延迟增益。我们通过CUDA Graphs A/B实验测试了缺失项。在H100上，ctx=2048时，CUDA Graphs在N=10个新会话中将解码延迟提高了1.259倍，95%自助法置信区间为1.253至1.267。在L4上，相同的干预仅提供了1.028倍的提升。这分离出了在快速GPU上可见但在较慢、带宽受限的GPU上基本隐藏的启动侧开销。部署的含义是，只有当运行时实现时，内存节省才重要。在L4上，bf16解码接近内存下限，但常见的量化路径并未恢复预期的4倍权重流量减少：从62.32 ms/step的bf16基线，bnb-nf4达到59.36 ms/step，AutoAWQ+Marlin达到45.24 ms/step。使用Ada调优的int4内核的GPTQ+ExLlamaV2达到17.36 ms/step。

英文摘要

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.

URL PDF HTML ☆

赞 0 踩 0

2605.30569 2026-06-01 cs.RO 版本更新

Any-ttach: Quick End-effector Swapping Enables Manipulation Dexterity with Simplicity

Any-ttach: 快速末端执行器更换实现简洁的灵巧操作

Weizhe Ni, Jinzhou Li, Haoyu Li, Cody Andres Alessio-Bunnell, Wenjing Pan, Xianyi Cheng

发表机构 * Department of Mechanical Engineering and Materials Science, Duke University（杜克大学机械工程与材料科学系）

AI总结提出Any-ttach框架，通过低成本快速末端执行器更换机制，结合任务规划，实现多种工具和末端模块的灵巧操作，在长时任务中验证了可靠性和效率提升。

详情

AI中文摘要

机器人操作灵巧性通常通过构建越来越复杂的高自由度多指手来实现。虽然许多机器人手被设计为复制人类形态，但人手的功能角色暗示了不同的视角：其复杂性可能很大程度上是为了支持工具使用和工具制造。这一观察启发了Any-ttach，一个以工具为中心的操作框架，将快速末端执行器更换视为实现简单灵巧性的机制。Any-ttach结合了用于开合机器人接口的低成本自动更换机制、用于收集人类演示的手持设备，以及一个组合了学习、参数化和规划的工具使用技能的任务规划框架。该系统通过相同的共享接口支持多种工具和末端执行器模块，包括日常工具、铰接工具（如剪刀）、Fin Ray手指和低成本拟人手。我们的实验表明，Any-ttach提高了工具更换的可靠性，增加了演示效率，减少了工具位姿变异性，并支持多样化的工具使用技能。在两个长时任务（制作三明治和准备黄瓜）中，Any-ttach通过末端执行器切换和执行监控执行了六个工具使用子技能。这些结果表明，机器人不仅可以通过更复杂的末端执行器，还可以通过快速可更换的工具和末端执行器模块来扩展操作能力。更多详情和视频请访问https://any-ttach.github.io/。

英文摘要

Robotic manipulation dexterity is often pursued by building increasingly complex high-DoF multifingered hands. While many robotic hands are designed to replicate human morphology, the functional role of human hands suggests a different perspective: much of their complexity may exist to enable tool use and tool making. This observation motivates Any-ttach, a tool-centric manipulation framework that treats quick end-effector swapping as a mechanism for dexterity with simplicity. Any-ttach combines a low-cost automatic swapping mechanism for an open-close robot interface, a handheld device for collecting human demonstrations, and a task planning framework that composes learned, parameterized, and planned tool-use skills. The system supports diverse tools and end-effector modules, including daily tools, articulated tools such as scissors, Fin Ray fingers, and a low-cost anthropomorphic hand, through the same shared interface. Our experiments show that Any-ttach improves tool-swapping reliability, increases demonstration efficiency, reduces tool-pose variability, and supports diverse tool-use skills. In two long-horizon tasks, making a sandwich and preparing a cucumber, Any-ttach executes six tool-use subskills through end-effector switching and execution monitoring. These results suggest that robots can expand manipulation capability not only through more complex end-effectors, but also through rapidly exchangeable tools and end-effector modules. More details and videos are available at https://any-ttach.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2605.30508 2026-06-01 cs.RO 版本更新

ARISTO Hand: Sensing-Driven Distal Hyperextension for Fine-Grained Manipulation

ARISTO Hand：基于感知驱动的远端过伸实现精细操作

Aaron Kim, Dong Ho Kang, Mark Helwig, Mingyo Seo, Kazuto Yokoyama, Tetsuya Narita, Luis Sentis

发表机构 * Human Centered Robotics Lab at The University of Texas at Austin（德克萨斯大学奥斯汀分校人本机器人实验室）； Sony Group Corporation（索尼集团）

AI总结提出一种肌腱驱动机械手ARISTO Hand，通过主动远端过伸和混合指尖传感架构（刚性指甲安装力-扭矩传感器与软电容触觉阵列），增强对薄物体的操作能力，在1-20 mm厚度范围内将拔出力提升2.76倍，并实现SD卡插拔等精细任务。

详情

AI中文摘要

操作薄物体需要精确的接触几何和可靠的力感知，然而许多拟人化机械手缺乏此类交互所需的机械和传感能力。我们提出ARISTO Hand，一种肌腱驱动机器人手，它将主动远端过伸与混合指尖传感架构相结合，该架构结合了刚性指甲安装的力-扭矩传感器和软电容触觉阵列。主动过伸使得指尖能够在标准屈曲的运动学极限之外进行受控接合，对于1-20 mm的物体厚度，拔出力提高了2.76倍，同时保留了标称抓取能力。刚性指甲安装传感器在边缘接触期间提供可靠的力测量，此时本体感觉力估计的灵敏度随着接触几何接近运动学奇点而下降。我们通过定量力表征和多阶段SD卡提取与插入任务验证了所提出的架构。视频和补充材料可在 https://aristohand.github.io 获取。

英文摘要

Manipulating thin objects requires precise contact geometry and reliable force perception, yet many anthropomorphic robotic hands lack the mechanical and sensing capabilities needed for such interactions. We present the ARISTO Hand, a tendon-driven robotic hand that integrates active distal hyperextension with a hybrid fingertip-sensing architecture that combines a rigid, nail-mounted force-torque sensor and a soft capacitive tactile array. Active hyperextension enables controlled fingertip engagement beyond the kinematic limits of standard flexion, increasing pull-out force by 2.76x for object thicknesses of 1-20 mm while preserving the nominal grasp capability. The rigid nail-mounted sensor provides reliable force measurements during edge contacts, where the sensitivity of proprioceptive force estimation degrades as the contact geometry approaches kinematic singularities. We validate the proposed architecture through quantitative force characterization and a multi-stage SD card extraction and insertion task. Video and supplementary materials are available at: https://aristohand.github.io

URL PDF HTML ☆

赞 0 踩 0

2605.30506 2026-06-01 cs.RO cs.CV 版本更新

VLM-GLoc: Vision-Language Model Enhanced Monte Carlo Localization for Robust Semantic Global Localization in Cluttered Quasi-Static Environments

VLM-GLoc：视觉语言模型增强的蒙特卡洛定位，用于杂乱准静态环境中的鲁棒语义全局定位

Shivendra Agrawal, Bradley Hayes

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

AI总结提出VLM-GLoc方法，利用开放词汇视觉语言模型作为统一语义观测前端，通过逆语义提议机制和文本到地图检索，在几何模糊和语义歧义的准静态环境中实现鲁棒全局定位。

详情

AI中文摘要

在几何模糊的准静态环境（如杂货店、办公室、学校和医院）中，全局定位对移动机器人构成重大挑战。具有平行过道和长尾产品分布的杂货店，以及具有重复家具（如椅子、桌子、显示器和门）的办公室和实验室，是常见的室内环境，存在几何甚至语义歧义。传统方法要么依赖独特的几何特征，要么依赖特定领域的视觉管道，这些方法难以处理长尾语义分布和瞬态视觉杂乱。我们提出VLM-GLoc，一种分层语义蒙特卡洛定位（MCL）方法，利用开放词汇视觉语言模型（VLM）作为统一语义观测前端。我们假设VLM具有三重优势：（1）提取高度判别性的丰富文本特征，（2）对模糊或动态对象进行隐式质量过滤，（3）针对数据增强的持久性推理。我们引入一种逆语义提议机制，通过文本到地图检索播种粒子。在两个具有不同特征的真实世界环境和两个不同平台上进行评估：一个3500平方英尺的杂货店（使用手机）和一个3700平方英尺的实验室空间（使用四足机器人），VLM-GLoc分别实现了70%和74%的全局定位成功率，显著优于传统的纯几何和特定领域基线方法。

英文摘要

Global localization in geometrically aliased, quasi-static environments such as grocery stores, offices, schools, and hospitals poses a significant challenge for mobile robots. Grocery stores with parallel aisles and a long tailed distribution of products, as well as offices and labs with repetitive furniture such as chairs, desks, monitors, and doors, exemplify common indoor environments that present geometric and even semantic ambiguity. Traditional approaches rely either on distinct geometric features or on domain-specific vision pipelines that struggle with long-tail semantic distributions and transient visual clutter. We present VLM-GLoc, a method for hierarchical semantic Monte Carlo Localization (MCL) that leverages open-vocabulary Vision-Language Models (VLMs) as a unified semantic observation front-end. We hypothesize a three-fold benefit from VLMs: (1) extracting highly discriminative rich text features, (2) implicit quality filtering of blurry or dynamic objects, and (3) permanence reasoning for targeted data augmentation. We introduce an inverse semantic proposal mechanism that seeds particles via text-to-map retrieval. Evaluated across two real-world environments with different characteristics and two different platforms: a 3,500 sq. ft. grocery store with a cellphone and a 3,700 sq. ft. lab space with a quadruped, VLM-GLoc achieves 70% and 74% global localization success respectively, substantially outperforming traditional geometry-only and domain-specific baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.30503 2026-06-01 cs.RO cs.SY eess.SY stat.ML 版本更新

Physics-informed Goal-Conditioned Reinforcement Learning under Hybrid Contact Dynamics

混合接触动力学下的物理信息目标条件强化学习

Vittorio Giammarino, Anastasios Manganaris, Ahmed H. Qureshi

发表机构 * Department of Computer Science（计算机科学系）

AI总结针对接触丰富任务中混合动力学导致现有物理信息目标条件强化学习方法性能下降的问题，提出接触感知和分层公式，选择性应用物理信息归纳偏置，向接触丰富操作扩展。

详情

AI中文摘要

从稀疏反馈中学习达到任意目标需要智能体推断状态-目标对之间的丰富可达性概念。目标条件强化学习（GCRL）通过学习跨目标泛化的策略来应对这一挑战，但随着底层动力学变得高维、混合或接触依赖，这种泛化变得越来越困难。为了解决这个问题，物理信息GCRL（Pi-GCRL）将最优控制启发的归纳偏置引入目标条件价值学习。虽然Pi-GCRL方法在导航和无目标到达领域已被证明有效，但它们在接触丰富任务中的可靠性仍不清楚，其中接触交互导致混合动力学、模式依赖的可控性和非光滑价值景观。在这项工作中，我们表明这些结构特性可能导致现有Pi-GCRL方法在朴素应用于接触丰富操作时性能下降。受此分析启发，我们引入了接触感知和分层公式，选择性地将物理信息归纳偏置应用于操作问题。我们的结果为将Pi-GCRL扩展到接触丰富操作提供了原则性的一步。

英文摘要

Learning to reach arbitrary goals from sparse feedback requires agents to infer a rich notion of reachability across state--goal pairs. Goal-conditioned reinforcement learning (GCRL) tackles this challenge by learning policies that generalize across goals, but this generalization becomes increasingly difficult as the underlying dynamics become high-dimensional, hybrid, or contact-dependent. To address this issue, physics-informed GCRL (Pi-GCRL) introduces optimal-control-inspired inductive biases into goal-conditioned value learning. While Pi-GCRL methods have proven effective in navigation and object-free goal-reaching domains, their reliability in contact-rich tasks remains unclear, where contact interactions induce hybrid dynamics, mode-dependent controllability, and nonsmooth value landscapes. In this work, we show that these structural properties can cause existing Pi-GCRL methods to degrade when applied naively to contact-rich manipulation. Motivated by this analysis, we introduce contact-aware and hierarchical formulations that apply physics-informed inductive biases selectively across the manipulation problem. Our results provide a principled step toward extending Pi-GCRL to contact-rich manipulation.

URL PDF HTML ☆

赞 0 踩 0

2605.30488 2026-06-01 cs.RO 版本更新

CoMo3R-SLAM: Collaborative Monocular Dense SLAM with Learned 3D Reconstruction Priors for Outdoor Multi-Agent Systems

CoMo3R-SLAM: 面向室外多智能体系统的协作式单目稠密SLAM与学习型3D重建先验

Zhihao Cao, Qi Shao, Shuhao Zhai, Feng Tian, Anh Nguyen, Hesheng Wang, Baoru Huang

发表机构 * ETH Zurich（苏黎世联邦理工学院）； University of Liverpool（利物浦大学）； Harbin Engineering University（哈尔滨工程大学）； University of Ottawa（Ottawa大学）； Shanghai Jiao Tong University（上海交通大学）； Imperial College London（伦敦帝国理工学院）

AI总结提出首个协作式单目稠密RGB SLAM系统CoMo3R-SLAM，利用学习的前馈3D重建先验实现室外多智能体地图构建，无需深度传感器即可生成全局一致的度量地图。

详情

AI中文摘要

协作式稠密SLAM对于多机器人团队在大规模室外环境中实现可扩展且一致的3D感知至关重要。现有系统通常依赖深度传感器，导致显著的载荷、功耗和标定成本。单目RGB相机是一种轻量级替代方案，但协作式单目稠密SLAM仍面临尺度模糊、智能体间数据关联不可靠等困难，尤其是在室外场景中，低重叠和重复结构使得传统特征匹配不可靠，从而需要鲁棒的几何信息。我们提出CoMo3R-SLAM，这是首个利用鲁棒的学习前馈3D重建先验进行室外多智能体地图构建的协作式单目稠密RGB SLAM系统。每个智能体运行一个先验引导的前端，用于实时跟踪和局部稠密融合，而协调器执行稠密点图匹配以进行跨智能体验证、闭式Sim(3)规范同步以及GPU加速的全局光束法平差与分段深度优化。我们的系统既不需要深度传感器也不需要参数化内参，仅凭单目RGB即可产生鲁棒的跨智能体约束和全局一致的度量地图。在Tanks and Temples和Waymo序列上，CoMo3R-SLAM在四个Tanks and Temples场景中的三个上实现了最佳ATE，并在Waymo上达到竞争性精度，匹配或超越最先进的RGB-D方法，同时以8 FPS在线运行。

英文摘要

Collaborative dense SLAM is essential for multi-robot teams to achieve scalable and consistent 3D perception across large-scale outdoor environments. Existing systems typically depend on depth sensors, incurring significant payload, power, and calibration costs. Monocular RGB cameras are a lightweight alternative, but collaborative monocular dense SLAM remains difficult due to scale ambiguity, unreliable inter-agent data association, especially in outdoor scenes where low overlap and repetitive structures make traditional feature matching unreliable, motivating robust geometric information. We propose CoMo3R-SLAM, the first collaborative monocular dense RGB SLAM system that leverages robust learned feed-forward 3D reconstruction priors for outdoor multi-agent mapping. Each agent runs a prior-guided front-end for real-time tracking and local dense fusion, while a coordinator performs dense pointmap matching for cross-agent verification, closed-form Sim(3) gauge synchronization, and GPU-accelerated global bundle adjustment with segment-level depth optimization. Requiring neither depth sensors nor parametric intrinsics, our system produces robust cross-agent constraints and globally consistent metric maps from monocular RGB alone. On Tanks and Temples and Waymo sequences, CoMo3R-SLAM achieves the best ATE on three of four Tanks and Temples scenes and competitive Waymo accuracy, matching or exceeding state-of-the-art RGB-D methods while running online at 8 FPS.

URL PDF HTML ☆

赞 0 踩 0

2605.30484 2026-06-01 cs.RO 版本更新

ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation

ELAN4D：以具身为中心的4D监督用于视觉-语言-动作模型的即插即用适配

Zeyuan He, Bowen Yang, Zhirui Fang, Keru Zhou, Lei Jiang, Jingjing Qian, Fan Mo, Junchi Yan, Philip Torr, Xiu Li, Li Jiang, Jialin Yu

发表机构 * Torr Vision Group, University of Oxford（托尔视觉组，牛津大学）； The Chinese University of Hong Kong, Shenzhen（香港大学（深圳））； Tsinghua University（清华大学）； Shanghai Jiao Tong University（上海交通大学）； University College London（伦敦大学学院）； University of Cambridge（剑桥大学）

AI总结提出ELAN4D框架，通过未来机器人关键点轨迹作为预测性时空监督，以即插即用方式增强VLA策略的鲁棒性和泛化能力。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在机器人操作中展现出潜力，但现有大多数策略通过直接从当前观测回归动作来反应式运行，没有显式建模未来动态。这限制了它们在分布外扰动下的泛化能力。为解决此问题，我们提出ELAN4D，一个以具身为中心的4D感知训练框架，通过未来机器人关键点轨迹作为预测性时空监督来增强VLA策略。仅利用本体感觉状态的前向运动学，我们推导出机器人关键点（如关节和末端执行器）的3D位移轨迹，预处理成本可忽略。这些轨迹提供度量且紧凑的监督，无需外部跟踪器或重建。一个即插即用的辅助分支，配备轻量级轨迹解码器，在通过梯度隔离保护预训练视觉-语言主干的同时，将4D信号注入动作专家。推理时丢弃轨迹解码器，保持基础策略接口不变。在LIBERO、LIBERO-Plus、RoboTwin2.0和真实世界操作任务上的大量实验表明，ELAN4D持续优于强VLA基线，在分布外扰动（包括相机、背景和布局变化）下取得最佳整体性能和显著提升。这些结果凸显了以具身为中心的4D监督对于构建更鲁棒和可泛化的操作策略的有效性。

英文摘要

Vision-Language-Action (VLA) models have shown promise for robotic manipulation, yet most existing policies operate reactively by directly regressing actions from current observations, without explicitly modeling future dynamics. This limits their ability to generalize under out-of-distribution perturbations. To address this issue, we propose ELAN4D, an embodiment-centric, 4D-aware training framework that enhances VLA policies with future robot keypoint tracks as predictive spatio-temporal supervision. Using only forward kinematics from proprioceptive states, we derive 3D displacement tracks of robot keypoints, such as joints and the end-effector, with negligible preprocess cost. These tracks provide metric and compact supervision without requiring external trackers or reconstruction. A plug-and-play auxiliary branch with a lightweight track decoder injects this 4D signal into the action expert while preserving the pretrained vision-language backbone through gradient isolation. The track decoder is discarded during inference, leaving the base policy interface unchanged. Extensive experiments on LIBERO, LIBERO-Plus, RoboTwin2.0 and real-world manipulation tasks demonstrate that ELAN4D consistently improves over strong VLA baselines, achieving the best overall performance and substantial gains under out-of-distribution perturbations, including camera, background, and layout shifts. These results highlight the effectiveness of embodiment-centric 4D supervision for building more robust and generalizable manipulation policies.

URL PDF HTML ☆

赞 0 踩 0

2605.30468 2026-06-01 cs.RO 版本更新

Learning-Based Navigation for Indoor Mobile Robots

基于学习的室内移动机器人导航

Tri-Tin Nguyen, Tien-Dat Nguyen, Gia-Uy Le, Vinh Nguyen, Vinh-Hao Nguyen

发表机构 * Faculty of Electrical ； Electronic Engineering, Ho Chi Minh City University of Technology, VNU-HCM Ho Chi Minh City, Vietnam

AI总结提出一种结合监督学习全局规划器与基于学习的DWA局部规划器的导航框架，通过行为克隆和PPO优化实现安全避障导航。

详情

AI中文摘要

本文提出了一种基于学习的室内移动机器人导航框架。该方法将基于代价感知A*专家轨迹训练的监督神经全局规划器与提出的基于学习的DWA局部规划器相结合，后者被表述为动态窗口法（DWA）动作格上的离散候选选择。对于局部规划，策略首先通过行为克隆进行训练，然后在可行性感知掩码下通过近端策略优化（PPO）进行精炼。该框架在模拟和真实室内环境中进行了实现和评估。实验结果表明，所提方法能够在存在障碍物的情况下生成可行的全局路径和可靠的局部运动指令，以实现安全的目标导向导航。这些结果证明了将基于学习的全局规划与强化学习精炼的局部控制相结合用于室内移动机器人导航的有效性。源代码将在 https://ntdathp.github.io/rl_robot_web/ 发布。

英文摘要

This paper presents a learning-based navigation framework for indoor mobile robots. The proposed method combines a supervised neural global planner, trained from cost-aware A* expert trajectories, with the proposed Learning-Based DWA local planner, which is formulated as discrete candidate selection over the Dynamic Window Approach (DWA) action lattice. For local planning, the policy is first trained by behavior cloning and then refined by Proximal Policy Optimization (PPO) under feasibility-aware masking. The framework is implemented and evaluated in both simulated and real-world indoor environments. Experimental results show that the proposed method generates feasible global routes and reliable local motion commands for safe goal-directed navigation in the presence of obstacles. These results demonstrate the effectiveness of integrating learning-based global planning with reinforcement-learning-refined local control for indoor mobile robot navigation. The source code will be released at https://ntdathp.github.io/rl_robot_web/.

URL PDF HTML ☆

赞 0 踩 0

2605.30383 2026-06-01 cs.RO cs.AI 版本更新

VR-DAgger: 用于灵巧数据收集和不确定性引导的在线策略校正的沉浸式VR

René Zurbrügg, Tifanny Portela, Arjun Bhardwaj, Aravind Elanjimattathil Vijayan, Maximum Wilder-Smith, Marco Hutter

发表机构 * Robotics Systems Lab（机器人系统实验室）； ETH Zürich（苏黎世联邦理工学院）； ETH AI Center（ETH人工智能中心）； ETH Augmented Reality Research Lab（ETH增强现实研究实验室）； ETH Mobility Initiative（ETH移动性倡议）； ANYbotics AG（ANYbotics公司）； Swiss Federal Railways（瑞士联邦铁路）

AI总结提出VR-DAgger框架，通过VR应用进行灵巧遥操作和数据收集，利用MC Dropout不确定性评分选择关键失败片段进行在线校正，在灵巧操作任务上相比行为克隆提升高达23个百分点，并减少约40%的样本收集时间。

详情

AI中文摘要

从示范中学习对于机器人操作是有效的，但收集足够的任务特定数据仍然是一个主要瓶颈。在分布偏移下，小误差会累积，性能下降，专家时间往往花费在冗余、低价值的修正上，而不是少数关键失败案例。我们提出了VR-DAgger，一个以沉浸式VR应用为中心的人机协作框架，用于灵巧遥操作、示范收集和选择性策略校正。VR客户端提供直观的手部控制和同步场景可视化，而后台工作站运行仿真和学习，实现无需操作员持续监督的自主部署。我们使用蒙特卡洛（MC）Dropout在Isaac Lab部署扩散策略时对不确定性进行评分，并选择信息量大的失败片段进行校正。这些片段在VR中作为剪辑重放，操作员选择性地标记和校正策略的行为，将监督集中在不确定性最高的地方，无需全程监控或单独的中断分类器。我们在三个灵巧操作任务（平底锅抓取放置、抽屉打开、阀门旋转）上使用10自由度XHand在标准和具有挑战性的初始配置下进行评估。主动标记在所有任务上持续优于行为克隆，提升高达23个百分点。与无指导的人机协作检查相比，VR-DAgger通过将审查集中在选定的片段而非完整部署上，将每个样本的收集时间减少了约40%。

英文摘要

Learning from demonstrations is effective for robotic manipulation, but collecting sufficient task-specific data remains a major bottleneck. Under distribution shift, small errors compound, performance degrades, and expert time is often spent on redundant, low-value corrections instead of the few critical failure cases. We present VR-DAgger, a human-in-the-loop framework centered on an immersive VR application for dexterous teleoperation, demonstration collection, and selective policy correction. The VR client provides intuitive hand control with synchronized scene visualization, while a backend workstation runs simulation and learning, enabling autonomous rollouts without continuous operator oversight. We use Monte Carlo (MC) dropout to score uncertainty during Isaac Lab rollouts of a diffusion policy and select informative failure segments for correction. These segments are replayed in VR as clips, where the operator selectively labels and corrects the policy's behavior, concentrating supervision where uncertainty is highest without full-rollout monitoring or a separate intervention classifier. We evaluate on three dexterous manipulation tasks (Pan pick-and-place, Drawer opening, Valve turning) with a 10-DoF XHand under standard and challenging initial configurations. Active labeling consistently improves over behavioral cloning across all tasks, with gains of up to 23 percentage points. Compared to unguided human-in-the-loop inspection, VR-DAgger reduces per-sample collection time by approximately 40% by focusing review on selected segments rather than full rollouts.

URL PDF HTML ☆

赞 0 踩 0

2605.26430 2026-06-01 cs.RO 版本更新

Multi-Robot Box Transport over Different Surfaces with Decentralized Role-based Proportional Control

多机器人在不同表面上的基于去中心化角色比例控制的箱子运输

Aditya Bhatt, Himavarshini Yarragangu, Urvish Shah, Venkata Sai Yaswanth Mohan Thota, Souma Chowdhury

发表机构 * Mechanical & Aerospace Eng., University at Buffalo, Buffalo, NY（机械与航空航天工程系，布法罗大学，布法罗，纽约）

AI总结提出一种异步去中心化任务与运动规划方法R2P2，通过角色分配和比例控制实现多机器人在不同倾斜和摩擦表面上的协作箱子运输，在仿真和物理实验中验证了其泛化性和成功率优于标准虚拟领导者-跟随者方法。

Comments Accepted for presentation at the 2026 ASME IDETC-CIE

详情

AI中文摘要

通过推动实现多机器人协作运输物体在建筑、仓库环境以及灾后 debris 清理等许多应用中具有广泛前景。然而，在不同倾斜和摩擦特性的表面上实现协作运输带来了独特的挑战。为应对这些挑战，本文提出了一种异步去中心化任务与运动规划方法，用于在平坦、上坡和下坡地形上运输不同质量的矩形箱子。这种去中心化方法减轻了通信、同步和共识需求，并缓解了单点故障问题。我们的方法称为R2P2（基于规则和比例控制原语的角色分配），根据对所需操作模式（箱子旋转 vs 平移）的认知规则为机器人分配角色（例如，推、支撑和阻止）；随后根据角色执行基于规则的控制或机器人速度的比例控制。每个机器人在执行角色和控制时假设能观察到自身和箱子的位置与朝向。R2P2在使用NVIDIA IsaacSim构建的模拟器中通过六机器人团队进行了评估——展示了在不同表面摩擦/倾斜和箱子质量场景下的泛化能力，并且与标准虚拟领导者-跟随者方法相比具有更高的成功率。R2P2还通过物理实验成功验证，在四台负责移动1.2 kg箱子的turtlebots上执行。

英文摘要

Collaborative transport of objects via pushing by multiple robots has many applications, ranging from construction and warehouse environments to post disaster debris clean-up. Achieving collaborative transport over surfaces with different inclination and friction properties however poses unique challenges. To address these challenges, this paper presents an asynchronous decentralized task and motion planning approach for transporting rectangular boxes of varying mass over flat, uphill and downhill terrain. Such a decentralized approach alleviates communication, synchronization and consensus needs and mitigates single point of failure issues. Our approach, called R2P2 or Roles with Rules and Proportional-control Primitive, assigns roles (e.g., push, support and prevent) to robots based on rules cognizant of the mode of manipulation needed (box rotation vs translation); this is followed by either rule-based control or proportional control of robot velocity based on the roles. Each robot is assumed to observe the location and heading of self and the box in executing the role and controls. R2P2 is evaluated with a six-robot team deployed in a simulator built using NVIDIA IsaacSim -- demonstrating generalizability across different surface friction/inclination and box mass scenarios, and better success rate compared to a standard virtual-leader-follower method. R2P2 is also successfully validated with a physical experiment, where it is executed onboard four turtlebots tasked with moving a 1.2 kg box.

URL PDF HTML ☆

赞 0 踩 0

2605.26304 2026-06-01 cs.RO 版本更新

Collaborative Navigation and Exploration with $β$-Sparse Gaussian Processes

基于$β$-稀疏高斯过程的协作导航与探索

Evangelos Psomiadis, Dipankar Maity, Panagiotis Tsiotras

发表机构 * D. Guggenheim School of Aerospace Engineering, Georgia Tech, Atlanta, GA, USA（佐治亚理工学院D.Guggenheim航空航天工程学院）； Department of Electrical and Computer Engineering, UNC Charlotte, Charlotte, NC, USA（北卡罗来纳大学夏洛特分校电气与计算机工程系）

AI总结针对异构机器人在未知环境中的协作导航问题，提出一种利用$β$-稀疏高斯过程进行带宽受限下地图点选择和导航动作联合优化的框架，显著降低路径代价和传输信息量。

Comments 16 pages, 6 figures

详情

AI中文摘要

异构机器人在未知环境中的协作导航由于传感、通信和计算限制而面临重大挑战。在这项工作中，一个领航机器人向目标导航，同时一个移动传感器机器人（例如无人机）通过传输其局部观测地图的信息来辅助，但受带宽限制。我们提出一个框架，使传感器能够在线联合选择其传输的地图点和导航动作，同时预测环境的未探索区域。为此，我们提出了$β$-稀疏高斯过程，一种鲁棒的变分稀疏高斯过程模型，用于在基数约束下进行任务感知的诱导点选择。此外，我们开发了一种平衡任务相关性与探索的动作选择策略。在火星和地球地图上的仿真表明，与无通信相比，该框架可将路径代价降低18%，与原始数据传输基线相比，传输信息量减少76%。

英文摘要

Collaborative navigation of heterogeneous robots in unknown environments poses significant challenges due to sensing, communication, and computational limitations. In this work, a lead robot navigates toward a target while a mobile sensor robot (e.g., a drone) assists by transmitting information about its locally observed map under bandwidth constraints. We propose a framework that enables the sensor to jointly select its transmitted map points and navigation actions online, while also predicting unexplored regions of the environment. To this end, we present $β$-Sparse Gaussian Processes, a robust variational sparse Gaussian Process model for task-aware inducing point selection under cardinality constraints. Furthermore, we develop an action-selection strategy that balances task relevance with exploration. Simulations on Mars and Earth maps show that the framework can reduce path cost by 18% relative to no communication and decrease transmitted information by 76% compared to raw-data transmission baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.29879 2026-06-01 cs.CV cs.RO 版本更新

DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding

DGSG-Mind：用于长期场景理解与定位的动态3D高斯场景图

Luzhou Ge, Xiangyu Zhu, Jinyan Liu, Xuesong Li

发表机构 * School of Computer Science, Beijing Institute of Technology, China（北京理工大学计算机科学学院）

AI总结提出DGSG-Mind，一种混合实例感知的3D高斯动态场景图系统，通过概率体素网格与显式3D高斯结合实现鲁棒的跨模态实例融合和增量语义映射，并构建层次化场景图与3D高斯思维进行多模态推理，在零样本3D视觉定位、开放词汇语义分割和场景重建中取得领先性能。

Comments 9 pages, 6 figures

详情

AI中文摘要

将开放词汇语义信息集成到动态3D场景表示中对于长期具身场景理解至关重要。然而，现有方法常因跨视角线索不完整而导致脆弱的实例关联，同时处理对象级拓扑变化的能力有限，限制了长期机器人任务执行。此外，当前的3D场景理解方法要么依赖简单的特征匹配而缺乏显式空间推理，要么假设离线真实3D几何。为应对这些挑战，我们提出DGSG-Mind，一种混合实例感知的3D高斯动态场景图系统，配备具身推理智能体。我们的系统将概率体素网格与显式3D高斯耦合，实现鲁棒的跨模态实例融合和增量语义映射。它通过基于高斯的视觉重定位和由几何-语义一致性引导的局部掩码细化来处理动态变化。基于实例高斯图，DGSG-Mind进一步构建层次化场景图，并开发3D高斯思维，集成结构关系、空间-语义信息和视觉标注的RoI高斯渲染以进行多模态推理。大量实验表明，DGSG-Mind在基于自重建地图的方法中实现了最佳的零样本3D视觉定位性能，同时在3D开放词汇语义分割和场景重建中也表现出强劲性能。我们进一步将DGSG-Mind部署到真实世界机器人上，展示其目标导向推理和动态更新能力。DGSG-Mind的项目页面位于https://icr-lab.github.io/DGSG-Mind。

LangForce: 通过潜在动作查询对视觉语言动作模型进行贝叶斯分解

Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Beijing Zhongguancun Academy（北京中关村学院）； Zhongguancun Institute of Artificial Intelligence（中关村人工智能研究院）； Harbin Institute of Technology（哈尔滨工业大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Zhengzhou University（郑州大学）； Beihang University（北航）； East China Normal University（东华大学）； DeepCybot Co., Ltd.（DeepCybot有限公司）

AI总结针对VLA模型在训练中因数据偏差导致语言信息被忽略的问题，提出LangForce框架，通过贝叶斯分解和潜在动作查询构建双分支架构，最大化动作与指令的点互信息，无需新数据即可显著提升泛化能力。

Comments ICML 2026

详情

AI中文摘要

视觉-语言-动作（VLA）模型在机器人操作中显示出潜力，但往往难以泛化到新指令或复杂的多任务场景。我们识别出当前训练范式中的一个关键病理：目标驱动的数据收集造成了数据集偏差。在此类数据集中，仅凭视觉观察就能高度预测语言指令，导致指令与动作之间的条件互信息消失，我们将此现象称为信息崩溃。因此，模型退化为忽略语言约束的纯视觉策略，并在分布外（OOD）设置中失败。为解决此问题，我们提出LangForce，一种通过贝叶斯分解强制执行指令跟随的新框架。通过引入可学习的潜在动作查询，我们构建了一个双分支架构，用于估计纯视觉先验 $p(a \mid v)$ 和语言条件后验 $π(a \mid v, \ell)$。然后我们优化策略以最大化动作与指令之间的条件点互信息（PMI）。该目标有效惩罚了视觉捷径，并奖励明确解释语言命令的动作。无需新数据，LangForce显著提升了泛化能力。在SimplerEnv和RoboCasa上的大量实验证明了显著改进，包括在具有挑战性的OOD SimplerEnv基准上提升11.3%，验证了我们的方法在动作中稳健地锚定语言的能力。

英文摘要

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

URL PDF HTML ☆

赞 0 踩 0

2605.01581 2026-06-01 cs.RO 版本更新

Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

Hyper-DP3: 面向视觉运动控制的3D扩散策略的频率感知规模调整

Jinhao Zhang, Zhexuan Zhou, Huizhe Li, Yichen Lai, Wenlong Xia, Haoming Song, Youmin Gong, Jie Mei

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Shanghai Jiao Tong University（上海交通大学）

AI总结针对机器人操作中扩散策略的高计算成本问题，从频域角度分析动作轨迹的平滑性，提出轻量级3D扩散策略Hyper-DP3，使用扩散混合器解码器和两步DDIM推理，以极低参数和延迟实现最先进性能。

详情

AI中文摘要

基于扩散的视觉运动策略在机器人操作中表现良好，但当前方法仍继承了图像生成风格的解码器和多步采样。我们从频域角度重新审视这一设计。机器人动作轨迹高度平滑，大部分能量集中在少数低频离散余弦变换模式上。在此结构下，我们证明最优去噪器的误差受低频子空间维度和残余高频能量限制，意味着去噪误差在很少的反向步骤后即饱和。这也表明动作去噪需要比图像生成简单得多的去噪模型。受此启发，我们提出Hyper-DP3（HDP3），一种口袋大小的3D扩散策略，具有轻量级扩散混合器解码器，支持两步DDIM推理。我们的合成实验验证了理论，并支持两步去噪的充分性。此外，在RoboTwin2.0、Adroit、MetaWorld和真实世界任务中，HDP3以不到先前基于3D扩散策略1%的参数和显著更低的推理延迟实现了最先进的性能。

英文摘要

Diffusion-based visuomotor policies perform well in robotic manipulation, yet current methods still inherit image-generation-style decoders and multi-step sampling. We revisit this design from a frequency-domain perspective. Robot action trajectories are highly smooth, with most energy concentrated in a few low-frequency discrete cosine transform modes. Under this structure, we show that the error of the optimal denoiser is bounded by the low-frequency subspace dimension and residual high-frequency energy, implying that denoising error saturates after very few reverse steps. This also suggests that action denoising requires a much simpler denoising model than image generation. Motivated by this insight, we propose Hyper-DP3 (HDP3), a pocket-scale 3D diffusion policy with a lightweight Diffusion Mixer decoder that supports two-step DDIM inference. Our synthetic experiments validate the theory and support the sufficiency of two-step denoising. Futhermore, across RoboTwin2.0, Adroit, MetaWorld, and real-world tasks, HDP3 achieves state-of-the-art performance with fewer than 1% of the parameters of prior 3D diffusion-based policies and substantially lower inference latency.

URL PDF HTML ☆

赞 0 踩 0

2604.27994 2026-06-01 cs.RO 版本更新

Dreaming Across Towns: Semantic Rollout and Town-Adversarial Regularization for Zero-Shot Held-Out-Town Fixed-Route Driving in CARLA

跨城镇驾驶：面向CARLA零样本未见城镇固定路线驾驶的语义展开与城镇对抗正则化

Feeza Khan Khanzada, Jaerock Kwon

发表机构 * Department of Electrical and Computer Engineering, University of Michigan–Dearborn（密歇根大学迪尔伯恩分校电子与计算机工程系）

AI总结提出一种结合未来语义预测与城镇对抗正则化的训练方法，在仅使用Town05和Town06训练的情况下，提升CARLA驾驶代理在未见城镇Town03和Town04上的零样本迁移性能。

详情

AI中文摘要

在一个模拟城镇中训练的驾驶代理往往在新城镇中表现不佳，因为道路形状、交叉口和车道布局可能不同。本文研究如何在CARLA驾驶模拟器中改进这种迁移，而不向代理提供来自测试城镇的任何训练数据。代理仅在Town05和Town06中训练，然后直接在Town03和Town04中评估。为了聚焦于道路布局差异，所有实验使用相同的天气和交通设置。我们提出一种训练方法，鼓励代理学习跨城镇有用的特征，而不是与单个训练城镇绑定的特征。在训练过程中，代理被要求预测未来相机视图的高层视觉含义，并且被阻止依赖那些揭示数据来自哪个源城镇的线索。这些额外的学习信号仅在训练期间使用；在测试时，驾驶策略使用与基线代理相同的观测和控制接口。在与匹配的DreamerV3风格世界模型驾驶代理的受控比较中，所提出的方法在未见城镇上取得了最高的平均成功率：在Town03上为36.6%，95%置信区间[30.5, 42.7]；在Town04上为85.6%，95%置信区间[84.0, 87.2]（基于五个训练种子计算）。针对最强基线的种子配对测试显示，在两个未见城镇上成功率差异均为正。额外实验表明，单独预测未来视觉含义或单独去除城镇特定线索不足以匹配组合方法。这些结果表明，将未来场景理解与减少对源城镇特定特征的依赖相结合，可以改善该CARLA设置下的跨城镇驾驶性能。

英文摘要

Driving agents trained in one simulated town often perform poorly in a new town because the road shapes, intersections, and lane layouts can be different. This paper studies how to improve this kind of transfer in the CARLA driving simulator without giving the agent any training data from the test towns. The agent is trained only in Town05 and Town06, then evaluated directly in Town03 and Town04. To focus on road-layout differences, all experiments use the same weather and traffic settings. We propose a training method that encourages the agent to learn features that are useful across towns rather than features tied to one training town. During training, the agent is asked to predict the high-level visual meaning of future camera views and is also discouraged from relying on cues that reveal which source town the data came from. These extra learning signals are used only during training; at test time, the driving policy uses the same observation and control interface as the baseline agent. In controlled comparisons with matched DreamerV3-style world-model driving agents, the proposed method achieves the highest mean held-out success: 36.6\% on Town03 with a 95\% confidence interval of [30.5, 42.7] and 85.6\% on Town04 with a 95\% confidence interval of [84.0, 87.2], computed across five training seeds. Seed-paired tests against the strongest primary baselines show positive success-rate differences in both held-out towns. Additional experiments show that predicting future visual meaning alone or removing town-specific cues alone is not enough to match the combined method. These results suggest that combining future-scene understanding with reduced reliance on source-town-specific features can improve cross-town driving performance in this CARLA setting.

URL PDF HTML ☆

赞 0 踩 0

2604.20395 2026-06-01 cs.CV cs.RO 版本更新

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

SpaCeFormer: 快速无提议开放词汇3D实例分割

Chris Choy, Junha Lee, Chunghyun Park, Minsu Cho, Jan Kautz

发表机构 * NVIDIA

AI总结提出SpaCeFormer，一种基于空间曲线变换的无提议方法，在0.12-0.30秒内完成场景分割，比多阶段2D+3D流水线快2-3个数量级，并构建了最大开放词汇3D实例分割数据集SpaCeFormer-3M，在ScanNet200上零样本mAP达11.1，提升2.8倍。

Comments Project page: https://nvlabs.github.io/SpaCeFormer/

详情

AI中文摘要

开放词汇3D实例分割是机器人和AR/VR的核心能力，但先前方法存在瓶颈：多阶段2D+3D流水线聚合基础模型输出需数百秒每场景，而伪标签端到端方法依赖碎片化掩码和外部区域提议。我们提出SpaCeFormer，一种无提议的空间曲线变换器，在标准基准上每场景运行0.12-0.30秒，比多阶段2D+3D流水线快2-3个数量级。我们将其与SpaCeFormer-3M配对，这是最大的开放词汇3D实例分割数据集（通过多视图掩码聚类和多视图VLM标注构建，包含来自7.4K场景的604K实例的3.0M多视图一致描述）；其掩码召回率比先前单视图流水线高21倍（IoU>0.5时54.3% vs 2.5%）。SpaCeFormer结合空间窗口注意力与Morton曲线序列化以获得空间连贯特征，并使用RoPE增强解码器直接从学习到的查询预测实例掩码，无需外部提议。在ScanNet200上，我们实现11.1零样本mAP，比先前最佳无提议方法提升2.8倍；在ScanNet++和Replica上，我们达到22.9和24.1 mAP，超越包括使用多视图2D输入在内的所有先前方法。

英文摘要

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs in 0.12--0.30 seconds per scene across standard benchmarks, 2--3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21$\times$ higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU$>$0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8$\times$ improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

URL PDF HTML ☆

赞 0 踩 0

2604.10432 2026-06-01 cs.RO 版本更新

AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

AnySlot: 用于零样本槽级放置的目标条件视觉-语言-动作策略

Zhaofeng Hu, Sifan Zhou, Qinbo Zhang, Rongtao Xu, Qi Su, Jorge Mendez-Mendz, Ci-Jyun Liang

发表机构 * Stony Brook University（石溪大学）； Carnegie Mellon University（卡内基梅隆大学）； MBZUAI ； Peking University（北京大学）

AI总结提出AnySlot框架，通过将语言指令转化为空间视觉目标，解耦高层槽选择与低层执行，实现零样本槽级精确放置。

详情

AI中文摘要

视觉-语言-动作（VLA）策略已成为通用机器人操作的多功能范式。然而，在组合语言下的精确物体放置对端到端VLA策略仍然具有挑战性。槽级放置需要可靠的槽接地和厘米级几何精度。为此，我们提出AnySlot，一个通过引入语言接地与控制之间的显式空间视觉目标来降低组合复杂性的框架。AnySlot通过在目标槽处渲染空间标记将语言转化为视觉目标，然后使用目标条件VLA策略执行该目标。这种层次化设计将高层槽选择与低层执行解耦，提高了语义准确性和空间鲁棒性。此外，认识到此类精度要求高的任务缺乏基准，我们引入了SlotBench，一个包含九个任务类别的结构化模拟基准，用于评估槽级放置中的空间推理。大量实验表明，AnySlot在零样本槽级放置中显著优于平面VLA基线和模块化接地方法。

英文摘要

Vision-Language-Action (VLA) policies have emerged as a versatile paradigm for generalist robotic manipulation. However, precise object placement under compositional language remains challenging for end-to-end VLA policies. Slot-level placement requires reliable slot grounding and centimeter-level geometric precision. To this end, we propose AnySlot, a framework that reduces compositional complexity by introducing an explicit spatial visual goal between language grounding and control. AnySlot converts language into a visual goal by rendering a spatial marker at the intended slot, then executes this goal with a goal-conditioned VLA policy. This hierarchical design decouples high-level slot selection from low-level execution, improving semantic accuracy and spatial robustness. Furthermore, recognizing the lack of benchmarks for such precision-demanding tasks, we introduce SlotBench, a structured simulation benchmark with nine task categories for evaluating spatial reasoning in slot-level placement. Extensive experiments show that AnySlot significantly outperforms flat VLA baselines and modular grounding methods in zero-shot slot-level placement.

URL PDF HTML ☆

赞 0 踩 0

2604.01985 2026-06-01 cs.LG cs.AI cs.RO 版本更新

World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry

World Action Verifier: 通过前向-反向不对称性自我改进世界模型

Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, Yilun Du

发表机构 * Stanford University（斯坦福大学）； UC San Diego（加州大学圣地亚哥分校）； Carnegie Mellon University（卡内基梅隆大学）； Google DeepMind（谷歌深Mind）； Harvard University（哈佛大学）

AI总结提出World Action Verifier (WAV)框架，利用状态合理性和动作可达性的独立验证以及前向-反向不对称性，通过视频语料库的多样子目标生成器和稀疏逆模型实现循环一致性，从而在欠探索区域自我改进世界模型，在多个任务中样本效率提升2倍且下游策略性能提升22%以上。

Comments Project Website: https://world-action-verifier.github.io

详情

AI中文摘要

通用世界模型有望实现可扩展的策略评估、优化和规划，但达到所需的鲁棒性仍然具有挑战性。与主要关注最优动作的策略学习不同，世界模型需要在大量次优动作的空间中保持可靠，而这些动作在带有动作标签的机器人交互中往往代表性不足。为了解决这一挑战，我们提出了World Action Verifier (WAV)框架，该框架使世界模型能够识别自身的预测错误并进行自我改进。关键思想是将动作条件的状态预测分解为两个独立可验证的因素：状态合理性和动作可达性。我们证明，由于两个潜在的不对称性——更广泛的无动作数据的可用性和动作相关特征的更低维度——验证这些因素比直接前向预测更容易处理。利用这些不对称性，我们通过（i）从视频语料库中获得的多样子目标生成器和（ii）从状态特征子集推断动作的稀疏逆模型来增强世界模型。通过强制提议的子目标、推断的动作和前向展开之间的循环一致性，WAV在现有方法常常失败的欠探索区域提供了一种有效的验证机制。在涵盖MiniGrid、RoboMimic和ManiSkill的九个任务中，我们的方法实现了2倍的样本效率提升，同时将下游策略性能提高了22%以上。

英文摘要

General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning which primarily focuses on optimal actions, a world model needs to be reliable over a vast space of suboptimal actions, which are often underrepresented in action-labeled robot interactions. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two independently verifiable factors: state plausibility and action reachability. We show that verifying these factors is significantly more tractable than direct forward prediction due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among proposed subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods often fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by over 22%.

URL PDF HTML ☆

赞 0 踩 0

2603.28579 2026-06-01 cs.RO 版本更新

EBuddy: a workflow orchestrator for industrial human-machine collaboration

EBuddy：面向工业人机协作的工作流编排器

Michele Banfi, Rocco Felici, Stefano Baraldo, Oliver Avram, Anna Valente

发表机构 * Laboratory of Automation, Robotics and Machines (ARM)（自动化、机器人与机器实验室）

AI总结提出EBuddy，一种基于语音引导的工作流编排器，通过将专家实践形式化为有限状态机驱动的应用，实现工业环境中自然的人机协作，显著缩短端到端流程时间并保持可重复性。

详情

AI中文摘要

本文介绍了EBuddy，一种用于工业环境中自然人机协作的语音引导工作流编排器。EBuddy针对工具密集型工作流中一个反复出现的瓶颈：专家知识有效但难以规模化，当操作员和会话之间临时重建程序时，执行质量会下降。EBuddy将专家实践操作化为有限状态机（FSM）驱动的应用程序，在运行时提供可解释的决策框架（当前状态和允许的动作），使得口头请求在状态约束下被解释，同时系统执行并监控相应的工具交互。通过模块化工作流工件，EBuddy协调异构资源，包括GUI驱动的软件和协作机器人，利用自动语音识别和意图理解实现完全基于语音的交互。在定向能量沉积（DED）的叶轮叶片检查和修复准备中，通过人机协作实现的工业试点显示，在入职、3D扫描和处理以及修复程序生成过程中，端到端流程时间显著减少，同时保持了可重复性和低操作员负担。

英文摘要

This paper presents EBuddy, a voice-guided workflow orchestrator for natural human-machine collaboration in industrial environments. EBuddy targets a recurrent bottleneck in tool-intensive workflows: expert know-how is effective but difficult to scale, and execution quality degrades when procedures are reconstructed ad hoc across operators and sessions. EBuddy operationalizes expert practice as a finite state machine (FSM) driven application that provides an interpretable decision frame at runtime (current state and admissible actions), so that spoken requests are interpreted within state-grounded constraints, while the system executes and monitors the corresponding tool interactions. Through modular workflow artifacts, EBuddy coordinates heterogeneous resources, including GUI-driven software and a collaborative robot, leveraging fully voice-based interaction through automatic speech recognition and intent understanding. An industrial pilot on impeller blade inspection and repair preparation for directed energy deposition (DED), realized by human-robot collaboration, shows substantial reductions in end-to-end process duration across onboarding, 3D scanning and processing, and repair program generation, while preserving repeatability and low operator burden.

URL PDF HTML ☆

赞 0 踩 0

2603.26612 2026-06-01 cs.RO 版本更新

Meta-Adaptive Beam Search Planning for Transformer-Based Reinforcement Learning Control of UAVs with Overhead Manipulators under Flight Disturbances

基于Transformer强化学习的无人机搭载顶置机械臂在飞行扰动下的元自适应波束搜索规划

Hazim Alzorgan, Sayed Pedram Haeri Boroujeni, Abolfazl Razi

AI总结针对无人机与顶置机械臂耦合导致的末端执行器跟踪误差问题，提出基于Transformer双深度Q网络（DDQN）的强化学习框架，通过自适应波束搜索规划器利用学习到的评论家进行前向估计，实现软件在环的短视域波束搜索，显著降低跟踪误差并提升奖励。

Comments The paper will be reworked significantly

详情

AI中文摘要

配备顶置机械臂的无人机为检查、维护和基于接触的交互提供了独特的能力。然而，无人机及其机械臂的运动紧密耦合，由风或控制不完善引起的微小姿态变化会使末端执行器偏离预定路径。这种耦合使得可靠跟踪变得困难，也限制了最初为固定基座机器人设计的学习型臂控制器的直接使用。在我们的测试中，每当无人机机体经历漂移或快速姿态修正时，这些效应都会一致出现。为了解决这一问题，我们开发了一个基于Transformer双深度Q网络（DDQN）的强化学习框架，其核心思想是使用自适应波束搜索规划器，该规划器利用学习到的评论家作为前向估计器，对候选控制序列进行短视域波束搜索。这使得控制器能够通过模拟推演来预测末端执行器的运动，而不是直接在实际模型上执行这些动作，实现了软件在环（SITL）方法。前瞻依赖于处理短状态序列的Transformer评论家提供的价值估计，而DDQN骨干网络则提供保持学习过程稳定所需的单步目标。在相同训练条件下对3自由度空中机械臂进行评估，所提出的元自适应规划器表现出最强的整体性能，奖励增加10.2%，平均跟踪误差大幅降低（从约6%降至3%），并且相对于DDQN基线，组合奖励-误差指标改善29.6%。当无人机基座因外部扰动出现漂移时，与固定波束和仅Transformer变体相比，我们的方法在跟踪目标尖端轨迹方面表现出更高的稳定性（保持5厘米跟踪误差）。

英文摘要

Drones equipped with overhead manipulators offer unique capabilities for inspection, maintenance, and contact-based interaction. However, the motion of the drone and its manipulator is tightly linked, and even small attitude changes caused by wind or control imperfections shift the end-effector away from its intended path. This coupling makes reliable tracking difficult and also limits the direct use of learning-based arm controllers that were originally designed for fixed-base robots. These effects appear consistently in our tests whenever the UAV body experiences drift or rapid attitude corrections. To address this behavior, we develop a reinforcement-learning (RL) framework with a transformer-based double deep Q learning (DDQN), with the core idea of using an adaptive beam-search planner that applies a short-horizon beam search over candidate control sequences using the learned critic as the forward estimator. This allows the controller to anticipate the end-effector's motion through simulated rollouts rather than executing those actions directly on the actual model, realizing a software-in-the-loop (SITL) approach. The lookahead relies on value estimates from a Transformer critic that processes short sequences of states, while a DDQN backbone provides the one-step targets needed to keep the learning process stable. Evaluated on a 3-DoF aerial manipulator under identical training conditions, the proposed meta-adaptive planner shows the strongest overall performance with a 10.2% reward increase, a substantial reduction in mean tracking error (from about 6% to 3%), and a 29.6% improvement in the combined reward-error metric relative to the DDQN baseline. Our method exhibits elevated stability in tracking target tip trajectory (by maintaining 5 cm tracking error) when the drone base exhibits drifts due to external disturbances, as opposed to the fixed-beam and Transformer-only variants.

URL PDF HTML ☆

赞 0 踩 0

2509.22550 2026-06-01 cs.RO 版本更新

An Intention-driven Lane Change Framework Considering Heterogeneous Dynamic Cooperation in Mixed-traffic Environment

考虑混合交通中异构动态协作的意图驱动换道框架

Xiaoyun Qiu, Haichao Liu, Yue Pan, Jun Ma, Xinhu Zheng

发表机构 * Intelligent Transportation Thrust, Systems Hub, The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州）智能交通研究所，系统中心）； Guangdong Provincial Key Lab of Integrated Communication, Sensing and Computation for Ubiquitous Internet of Things（广东省集成通信、感知与计算 ubiquitous internet of things 关键实验室）； Robotics and Autonomous Systems Thrust, Systems Hub, The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州）机器人与自主系统研究所）

AI总结提出一种结合驾驶风格识别、协作感知决策与运动规划的意图驱动换道框架，通过深度学习和逆强化学习实现混合交通中安全高效的换道。

详情

DOI: 10.1109/TITS.2026.3690192
Journal ref: IEEE Transactions on Intelligent Transportation Systems, May, 2026

AI中文摘要

在混合交通环境中，自动驾驶车辆（AV）必须与异构的人类驾驶车辆（HV）交互，这些车辆的意图和驾驶风格因个体和场景而异。这种变异性给换道交互带来了不确定性，其中安全性和效率关键取决于准确预测周围驾驶员的协作反应。现有方法通常通过假设统一或固定的行为模式来过度简化这些交互。为了解决这一限制，我们提出了一种意图驱动的换道框架，该框架将驾驶风格识别与协作感知决策和运动规划相结合。一个基于深度学习的分类器实时识别不同的人类驾驶风格。然后，我们引入了一个双视角协作分数，由内在的基于风格的倾向和交互动态组件组成，从而实现可解释和自适应的意图预测及定量推断。一个决策模块结合了行为克隆（BC）和逆强化学习（IRL）来确定换道的可行性。随后，建立了一个协调的运动规划架构，将基于IRL的意图推断与模型预测控制（MPC）相结合，以生成无碰撞且符合社会规范的轨迹。在NGSIM数据集上的实验表明，所提出的决策模型优于代表性的基于规则和基于学习的基线，在换道分类中达到了96.98%的准确率。运动规划评估进一步证明了在混合交通环境中机动成功率和执行稳定性的提高。这些结果验证了结构化协作建模对于意图驱动的自主换道的有效性。

英文摘要

In mixed-traffic environments, autonomous vehicles (AVs) must interact with heterogeneous human-driven vehicles (HVs) whose intentions and driving styles vary across individuals and scenarios. Such variability introduces uncertainty into lane change interactions, where safety and efficiency critically depend on accurately anticipating surrounding drivers' cooperative responses. Existing methods often oversimplify these interactions by assuming uniform or fixed behavioral patterns. To address this limitation, we propose an intention-driven lane change framework that integrates driving-style recognition with cooperation-aware decision-making and motion-planning. A deep learning-based classifier identifies distinct human driving styles in real time. We then introduce a dual-perspective cooperation score composed of intrinsic style-dependent tendencies and interactive dynamic components, enabling interpretable and adaptive intention prediction and quantitative inference. A decision-making module combines behavior cloning (BC) and inverse reinforcement learning (IRL) to determine lane change feasibility. Later, a coordinated motion-planning architecture integrating IRL-based intention inference with model predictive control (MPC) is established to generate collision-free and socially compliant trajectories. Experiments on the NGSIM dataset show that the proposed decision-making model outperforms representative rule-based and learning-based baselines, achieving 96.98% accuracy in lane change classification. Motion-planning evaluations further demonstrate improved maneuver success and execution stability in mixed-traffic environments. These results validate the effectiveness of structured cooperation modeling for intention-driven autonomous lane changes.

URL PDF HTML ☆

赞 0 踩 0

2603.11586 2026-06-01 cs.RO 版本更新

Unsupervised LiDAR-Based Multi-UAV Detection and Tracking Under Extreme Sparsity

基于激光雷达的极端稀疏条件下多无人机无监督检测与跟踪

Nivand Khosravi, Rodrigo Ventura, Meysam Basiri

发表机构 * Instituto Superior T\' e cnico University of Lisbon Lisbon, Portugal

AI总结针对非重复固态激光雷达扫描导致的极端稀疏点云，提出无监督检测与跟踪流水线，通过自适应DBSCAN聚类和时序一致性检验实现高精度检测，并比较确定性分配与概率数据关联在跟踪中的性能。

Comments Presented at the International Conference on Mechatronics and Robotics Engineering (ICMRE2026). To appear in IEEE conference proceedings

详情

DOI: 10.1109/ICMRE69538.2026.11533899
Journal ref: Proc. 2026 12th International Conference on Mechatronics and Robotics Engineering (ICMRE), Oldenburg, Germany, 2026

AI中文摘要

非重复固态激光雷达扫描导致对空中无人机检测的极端稀疏测量：一个10-25米的小型四旋翼通常每次扫描仅产生1-2个回波，远低于大多数现有检测方法假设的点密度，且不足以进行稳健的多目标数据关联。我们提出了一种无监督、仅依赖激光雷达的流水线，无需标注训练数据即可处理检测和跟踪。检测器将距离自适应DBSCAN聚类与三阶段时序一致性检验相结合，并在真实空对空飞行数据上以八种不同参数配置进行基准测试。最佳设置达到0.891精度、0.804召回率和0.63米均方根误差，系统性的minPts扫描验证了大多数扫描最多包含1-2个目标点，直接量化了稀疏程度。对于多目标跟踪，我们在四种具有递增模糊程度的模拟场景中，比较了确定性匈牙利分配与联合概率数据关联（JPDA），每种均与交互多模型滤波耦合。JPDA将身份切换减少了64%，而对MOTA影响可忽略，表明当无人机轨迹彼此接近时概率关联具有优势。结合真实世界检测与RTK-GPS真值以及基于模拟的跟踪与身份标注真值的双环境评估策略，克服了在无人机间距低于2米时仅依赖GNSS评估的局限性。

英文摘要

Non-repetitive solid-state LiDAR scanning leads to an extremely sparse measurement regime for detecting airborne UAVs: a small quadrotor at 10-25 m typically produces only 1-2 returns per scan, which is far below the point densities assumed by most existing detection approaches and inadequate for robust multi-target data association. We introduce an unsupervised, LiDAR-only pipeline that addresses both detection and tracking without the need for labeled training data. The detector integrates range-adaptive DBSCAN clustering with a three-stage temporal consistency check and is benchmarked on real-world air-to-air flight data under eight different parameter configurations. The best setup attains 0.891 precision, 0.804 recall, and 0.63 m RMSE, and a systematic minPts sweep verifies that most scans contain at most 1-2 target points, directly quantifying the sparsity regime. For multi-target tracking, we compare deterministic Hungarian assignment with joint probabilistic data association (JPDA), each coupled with Interacting Multiple Model filtering, in four simulated scenarios with increasing levels of ambiguity. JPDA cuts identity switches by 64% with negligible impact on MOTA, demonstrating that probabilistic association is advantageous when UAV trajectories approach one another closely. A two-environment evaluation strategy, combining real-world detection with RTK-GPS ground truth and simulation-based tracking with identity-annotated ground truth, overcomes the limitations of GNSS-only evaluation at inter-UAV distances below 2 m.

URL PDF HTML ☆

赞 0 踩 0

LangMap：一个用于分层开放词汇目标导航的人工验证基准

Bo Miao, Weijia Liu, Jun Luo, Lachlan Shinnick, Jian Liu, Thomas Hamilton-Smith, Yuhe Yang, Zijie Wu, Vanja Videnovic, Feras Dayoub, Anton van den Hengel

发表机构 * AIML, Adelaide University（AIML，阿德莱德大学）； East China Normal University（华东师范大学）； NERC-RVC, Hunan University（NERC-RVC，湖南大学）； The University of Western Australia（西澳大学）； Breaker Industries

AI总结针对现有基准在分层语义目标导航中的不足，提出LangMap基准，通过人工验证的语义标注和对比注释协议，支持场景、房间、区域和实例四个层级的目标导航任务，并引入PlaNaVid基线方法。

详情

AI中文摘要

语言条件目标导航（LGN）要求智能体在没有逐步指导的情况下定位用户指定的目标。然而，现有基准主要关注类别级目标或依赖视觉语言模型（VLM）生成的实例描述，这些描述通常包含歧义和语义错误，限制了系统性和可靠的评估。我们提出了HieraNav，一个开放词汇的LGN任务，目标在四个分层语义层级上指定：场景、房间、区域和实例。为此，我们提出了Language as a Map（LangMap），据我们所知，这是第一个具有人工验证语义标注的真实世界3D室内导航基准，支持所有四个目标层级的任务。LangMap提供了区域标签以及覆盖414个对象类别的区分性区域和实例描述，通过比较同一场景区域和实例的严格对比注释协议生成，包含超过18K个任务。每个目标都配有简洁和详细的描述，支持跨指令风格的评估。定量和定性分析验证了我们的注释质量；值得注意的是，我们的实例描述在文本到视图匹配上比GOAT-Bench注释高出23个百分点。我们进一步引入了PlaNaVid，一个强大的仅RGB基线，它将有界多样记忆（BDM）与高级规划相结合，以激发用于多目标导航的反应策略。PlaNaVid在没有深度、3D场景表示或对象掩码的情况下实现了顶级成功率。进一步分析表明，记忆和更丰富的上下文提升了性能，而长尾类别、小物体、远距离目标和多目标完成仍然是开放的挑战。该基准可在https://bo-miao.github.io/LangMap获取。

英文摘要

Language-conditioned goal navigation (LGN) requires agents to locate user-specified targets without step-by-step guidance. However, existing benchmarks largely focus on category-level goals or rely on instance descriptions generated by vision-language models (VLMs), which often contain ambiguities and semantic errors, limiting systematic and reliable evaluation. We introduce HieraNav, an open-vocabulary LGN task with goals specified at four hierarchical semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), to our knowledge the first real-world 3D indoor navigation benchmark with human-verified semantic annotations to support tasks across all four goal levels. LangMap provides region labels and discriminative region and instance descriptions covering 414 object categories, produced through a rigorous contrastive annotation protocol comparing same-scene regions and instances, and contains over 18K tasks. Each target is paired with concise and detailed descriptions, enabling evaluation across instruction styles. Quantitative and qualitative analyses validate our annotation quality; notably, our instance descriptions outperform GOAT-Bench annotations by 23 percentage points in text-to-view matching. We further introduce PlaNaVid, a strong RGB-only baseline that combines Bounded Diverse Memory (BDM) with high-level planning to prime a reactive policy for multi-goal navigation. PlaNaVid achieves top-tier success rates without depth, 3D scene representations, or object masks. Further analysis shows that memory and richer context boost performance, while long-tailed categories, small objects, distant targets, and multi-goal completion remain open challenges. The benchmark is available at https://bo-miao.github.io/LangMap

URL PDF HTML ☆

赞 0 踩 0

2601.18537 2026-06-01 cs.RO cs.AI 版本更新

SKETCH: Semantic Key-Point Conditioning for Long-Horizon Vessel Trajectory Prediction

SKETCH: 面向长时域船舶轨迹预测的语义关键点条件建模

Linyong Gan, Zimo Li, Wenxin Xu, Xingjian Li, Jianhua Z. Huang, Enmei Tu, Shuhang Chen

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen, China（香港中文大学（深圳）数据科学学院）； School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China（香港中文大学（深圳）科学与工程学院）； School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China（香港中文大学（深圳）人工智能学院）； COSCO SHIPPING Advanced Technology Institute, Shanghai, China（中远海运技术研究院）

AI总结针对长时域轨迹预测中方向漂移问题，提出基于语义关键点（NKP）的条件轨迹建模框架，将预测分解为全局语义决策与局部运动建模，采用预训练-微调策略估计NKP先验，在真实AIS数据上显著提升长时域、方向精度和细粒度预测性能。

详情

AI中文摘要

由于复杂导航行为和环境因素导致的复合不确定性，准确的长时域船舶轨迹预测仍然具有挑战性。现有方法在长时间外推时往往难以保持全局方向一致性，导致轨迹漂移或不合理。为解决这一问题，我们提出了一种语义关键点条件轨迹建模框架，通过以捕获导航意图的高级下一关键点（NKP）为条件来预测未来轨迹。该公式将长时域预测分解为全局语义决策和局部运动建模，有效将未来轨迹的支持集限制在语义可行的子集内。为了从历史观测中高效估计NKP先验，我们采用了预训练-微调策略。在真实AIS数据上的大量实验表明，所提方法在长旅行时长、方向精度和细粒度轨迹预测方面持续优于现有最先进方法。

英文摘要

Accurate long-horizon vessel trajectory prediction remains challenging due to compounded uncertainty from complex navigation behaviors and environmental factors. Existing methods often struggle to maintain global directional consistency, leading to drifting or implausible trajectories when extrapolated over long time horizons. To address this issue, we propose a semantic-key-point-conditioned trajectory modeling framework, in which future trajectories are predicted by conditioning on a high-level Next Key Point (NKP) that captures navigational intent. This formulation decomposes long-horizon prediction into global semantic decision-making and local motion modeling, effectively restricting the support of future trajectories to semantically feasible subsets. To efficiently estimate the NKP prior from historical observations, we adopt a pretrain-finetune strategy. Extensive experiments on real-world AIS data demonstrate that the proposed method consistently outperforms state-of-the-art approaches, particularly for long travel durations, directional accuracy, and fine-grained trajectory prediction.

URL PDF HTML ☆

赞 0 踩 0

2411.04073 2026-06-01 cs.RO cs.CC cs.MA 版本更新

A Two-Stage Reactive Auction Framework for the Multi-Depot Rural Postman Problem with Dynamic Vehicle Failures

面向动态车辆故障的多仓库农村邮差问题的两阶段反应式拍卖框架

Eashwar Sathyamurthy, Jeffrey W. Herrmann, Shapour Azarm

发表机构 * Department of Mechanical Engineering, University of Maryland（马里兰大学机械工程系）； Department of Mechanical Engineering, The Catholic University of America（美国天主教大学机械工程系）

AI总结针对多仓库农村邮差问题中车辆故障导致的任务中断，提出一种两阶段实时重调度框架，结合集中式拍卖与对等拍卖，在保证解质量的同时将重调度时间从小时级降至秒级。

详情

DOI: 10.1109/ACCESS.2026.3695779

AI中文摘要

尽管无人车车队在运输、物流和巡检中提供了效率，但它们对故障的敏感性对任务连续性构成了重大挑战。我们研究了带有可充电和可重复使用车辆的多仓库农村邮差问题（MD-RPP-RRV），其中放置在多个仓库、具有容量约束的无人充电车辆在为基于弧的需求服务时可能发生故障。为了解决运行中意外的车辆故障，我们提出了一种两阶段实时重调度框架。首先，集中式拍卖快速生成可行的重调度方案；对于此阶段，我们推导了一个理论加性界，为最坏情况下的重调度惩罚提供了分析保证。其次，对等拍卖通过一个针对问题的磁场路由器对局部调度进行修复，以细化基线方案，该路由器利用通过敏感性分析校准的参数，确保计算增长可控。我们将该方法与模拟退火元启发式算法进行基准比较，以评估解质量和执行速度。在257个不同故障场景上的实验结果表明，与元启发式基线相比，该框架实现了平均运行时间减少超过95%，将重调度时间从小时级缩短到秒级，同时保持高质量的解。两阶段框架在大规模实例上表现出色，在近80%的场景中优于集中式拍卖，平均解改进超过12%。此外，它在59%和28%的场景中分别优于模拟退火的平均结果和最佳结果，为实时任务连续性提供了所需的鲁棒速度-质量权衡。

英文摘要

Although unmanned vehicle fleets offer efficiency in transportation, logistics and inspection, their susceptibility to failures poses a significant challenge to mission continuity. We study the Multi-Depot Rural Postman Problem with Rechargeable and Reusable Vehicles (MD-RPP-RRV) with vehicle failures, where unmanned rechargeable vehicles placed at multiple depots with capacity constraints may fail while serving arc-based demands. To address unexpected vehicle breakdowns during operation, we propose a two-stage real-time rescheduling framework. First, a centralized auction quickly generates a feasible rescheduling solution; for this stage, we derive a theoretical additive bound that establishes an analytical guarantee on the worst-case rescheduling penalty. Second, a peer auction refines this baseline through a problem-specific magnetic field router for local schedule repair, utilizing parameters calibrated via sensitivity analysis to ensure controlled computational growth. We benchmark this approach against a simulated annealing metaheuristic to evaluate solution quality and execution speed. Experimental results on 257 diverse failure scenarios demonstrate that the framework achieves an average runtime reduction of over 95\% relative to the metaheuristic baseline, cutting rescheduling times from hours to seconds while maintaining high solution quality. The two-stage framework excels on large-scale instances, surpassing the centralized auction in nearly 80\% of scenarios with an average solution improvement exceeding 12\%. Moreover, it outperforms the simulated annealing mean and best results in 59\% and 28\% of scenarios, respectively, offering the robust speed-quality trade-off required for real-time mission continuity.

URL PDF HTML ☆

赞 0 踩 0

2512.11571 2026-06-01 cs.RO 版本更新

Cross-Entropy Optimization of Physically Grounded Task and Motion Plans

物理基础的任务与运动规划的交叉熵优化

Andreu Matoses Gimenez, Nils Wilde, Chris Pek, Javier Alonso-Mora

发表机构 * Department of Cognitive Robotics, Delft University of Technology（德鲁夫特理工大学认知机器人学系）； Faculty of Computer Science, Dalhousie University（达尔豪斯大学计算机科学学院）

AI总结提出利用GPU并行物理模拟器和交叉熵优化，通过采样控制器参数获得低成本解决方案，以解决传统TAMP算法忽略动力学和接触的问题。

Comments Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情

DOI: 10.1109/LRA.2026.3685463
Journal ref: IEEE Robotics and Automation Letters, 2026

AI中文摘要

自主执行任务通常需要机器人规划高级离散动作和连续低级运动来实现它们。先前的TAMP算法主要关注计算性能、完备性或最优性，通过简化和抽象使问题易于处理。然而，这可能导致生成的计划在需要操作物体时，未能考虑可靠执行任务所必需的动力学或复杂接触。此外，忽略低级控制器影响的方法可能无法为真实系统获得最优或可行的计划实现。我们研究使用GPU并行物理模拟器来计算带有运动控制器的计划实现，明确考虑动力学，并考虑与环境的接触。通过交叉熵优化，我们对控制器或动作的参数进行采样，以获得低成本解决方案。由于我们的方法使用与真实系统相同的控制器，机器人可以直接执行计算出的计划。我们在一组任务中展示了我们的方法，其中机器人能够利用环境的几何形状来移动物体。网站和代码：https://andreumatoses.github.io/research/parallel-realization

英文摘要

Autonomously performing tasks often requires robots to plan high-level discrete actions and continuous low-level motions to realize them. Previous TAMP algorithms have focused mainly on computational performance, completeness, or optimality by making the problem tractable through simplifications and abstractions. However, this comes at the cost of the resulting plans potentially failing to account for the dynamics or complex contacts necessary to reliably perform the task when object manipulation is required. Additionally, approaches that ignore effects of the low-level controllers may not obtain optimal or feasible plan realizations for the real system. We investigate the use of a GPU-parallelized physics simulator to compute realizations of plans with motion controllers, explicitly accounting for dynamics, and considering contacts with the environment. Using cross-entropy optimization, we sample the parameters of the controllers, or actions, to obtain low-cost solutions. Since our approach uses the same controllers as the real system, the robot can directly execute the computed plans. We demonstrate our approach for a set of tasks where the robot is able to exploit the environment's geometry to move an object. Website and code: https://andreumatoses.github.io/research/parallel-realization

URL PDF HTML ☆

赞 0 踩 0

2511.19433 2026-06-01 cs.RO cs.AI cs.CV 版本更新

Mixture of Horizons in Action Chunking

动作分块中的视野混合

Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, Mingyu Ding

发表机构 * Renmin University of China（中国人民大学）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； The Chinese University of Hong Kong（香港中文大学）

AI总结针对视觉-语言-动作模型中动作分块长度（视野）的权衡问题，提出混合视野策略，通过并行处理不同视野的动作片段并融合输出，同时提升长期预见与短期精度，实现性能与泛化性的改进。

Comments Accepted at ICML 2026

详情

AI中文摘要

视觉-语言-动作（VLA）模型在机器人操作中展现出显著能力，但其性能对训练中使用的$ extbf{动作分块长度}$（称为$ extbf{视野}$）敏感。我们的实证研究揭示了一个内在权衡：较长的视野提供更强的全局预见但降低细粒度精度，而较短的视野增强局部控制但在长期任务上表现不佳，这意味着固定选择单一视野是次优的。为缓解这一权衡，我们提出$ extbf{混合视野（MoH）}$策略。MoH将动作分块重新排列为多个不同视野的片段，通过共享动作变换器并行处理，并使用轻量线性门控融合输出。它具有三个吸引人的优点：1) MoH在单个模型中联合利用长期预见和短期精度，提高了复杂任务的性能和泛化能力。2) MoH对全注意力动作模块即插即用，训练或推理开销极小。3) MoH支持自适应视野的动态推理，通过跨视野共识选择稳定动作，实现比基线高2.5倍的吞吐量，同时保持优越性能。在基于流的策略$π_0$、$π_{0.5}$和单步回归策略$π_{ ext{reg}}$上的大量实验表明，MoH在仿真和真实世界任务上均取得一致且显著的提升。值得注意的是，在混合任务设置下，带有MoH的$π_{0.5}$在LIBERO上仅经过$30k$次训练迭代即达到99$\%$的平均成功率，创下新纪录。项目页面：https://timsty1.github.io/moh/

英文摘要

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://timsty1.github.io/moh/

URL PDF HTML ☆

赞 0 踩 0

2510.17111 2026-06-01 cs.RO cs.LG 版本更新

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

面向具身操作的高效视觉-语言-动作模型：系统综述

Weifan Guan, Qinghao Hu, Aosheng Li, Jian Cheng

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； University of Chinese Academy of Sciences（中国科学院大学）； AiRiA ； Nanjing University of Information Science and Technology（南京信息科学技术大学）

AI总结本文系统综述了通过模型架构、感知特征、动作生成和训练/推理策略四个维度降低视觉-语言-动作模型延迟、内存占用及计算成本的方法。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过将自然语言指令和视觉观察映射到机器人动作，将视觉-语言模型扩展到具身控制。尽管功能强大，但VLA系统因其巨大的计算和内存需求而面临重大挑战，这与需要实时性能的边缘平台（如机载移动操作器）的约束相冲突。解决这一矛盾已成为近期研究的核心焦点。鉴于对更高效、可扩展的VLA系统的日益关注，本综述系统回顾了提高VLA效率的方法，重点在于减少延迟、内存占用以及训练和推理成本。我们将现有解决方案分为四个维度：模型架构、感知特征、动作生成和训练/推理策略，并总结了每个类别中的代表性技术。最后，我们讨论了未来趋势和开放挑战，指出了推进高效具身智能的方向。

英文摘要

Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real-time performance. Addressing this tension has become a central focus of recent research. In light of the growing efforts toward more efficient and scalable VLA systems, this survey provides a systematic review of approaches for improving VLA efficiency, with an emphasis on reducing latency, memory footprint, and training and inference costs. We categorize existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, summarizing representative techniques within each category. Finally, we discuss future trends and open challenges, highlighting directions for advancing efficient embodied intelligence.

URL PDF HTML ☆

赞 0 踩 0

2509.19452 2026-06-01 cs.RO cs.CV cs.LG 版本更新

HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames

HUNT：通过瞬时相对帧在非结构化环境中进行高速无人机导航与跟踪

Alessandro Saviolo, Jeffrey Mao, Giuseppe Loianno

发表机构 * New York University（纽约大学）； University of California Berkeley（加州大学伯克利分校）

AI总结提出HUNT框架，利用瞬时相对帧统一搜索与跟踪，实现高速飞行和鲁棒自主性。

详情

AI中文摘要

搜索与救援任务要求无人机既能高速穿越未知的非结构化环境，又能在检测到目标后跟踪目标。在感知退化且无全局定位的情况下实现这两种能力仍是一个开放挑战。最近的相对导航工作通过将规划和控制锚定到可见的检测目标上展示了鲁棒跟踪，但在视野中没有目标时无法进行导航。我们提出了HUNT（高速无人机导航与跟踪），一个实时框架，在单一相对公式中统一了穿越、获取和跟踪。HUNT直接从机载瞬时观测量（如姿态、高度和速度）定义导航目标，从而在搜索过程中实现反应式高速飞行。一旦检测到目标，相同的感知-控制管道无缝过渡到跟踪。在茂密森林、集装箱场地以及使用车辆和人体模型的搜索与救援任务中的户外实验表明，在全局方法失败的情况下，该框架实现了鲁棒自主性。

英文摘要

Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT (High-speed UAV Navigation and Tracking), a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception-control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail.

URL PDF HTML ☆

赞 0 踩 0

2503.21168 2026-06-01 cs.RO cs.SY eess.SY 版本更新

TAGA: A Tangent-Based Reactive Approach for Socially Compliant Robot Navigation Around Human Groups

TAGA：一种基于切线的反应式方法，用于在人群体周围实现社交合规的机器人导航

Utsha Kumar Roy, Sejuti Rahman

发表机构 * Department of Computer Science and Engineering, BRAC University（布拉格大学计算机科学与工程系）； New Uzbekistan University（新乌兹别克斯坦大学）

AI总结提出TAGA方法，通过切线路径检测群体边界并协调群体与个体避障，引入群体穿越率（GCR）指标，在多种人群动力学模型下验证了反应式与学习型方法的非对称性效果。

Comments 8 pages, 3 figures, 3 tables. Submitted to IEEE Robotics and Automation Letters (RA-L)

详情

AI中文摘要

机器人在有人群的环境中导航时，必须避免碰撞并尊重人群的社会结构，特别是社会群体的隐含边界。大多数导航方法将人类建模为独立个体，即使无碰撞也会导致社交干扰行为。本文提出TAGA（群体避障的切线动作），通过切线路径机动检测群体边界，无需修改底层导航策略。一个分层安全控制器协调群体级避障与个体碰撞预防。我们提出群体穿越率（GCR），一个连续度量，衡量机器人在任何群体凸包内停留的时间步比例，提供比终端度量更细粒度的社交合规评估。我们引入了一个现实的人群模拟基准，包含五个基于经验的阶段：个体速度异质性、群体速度耦合、F-formation静态群体、领导者-跟随者动力学和凸包边界，并在ORCA和Social Force行人动力学下进行评估。在ORCA、Social Force、DS-RNN和Intention-RL上的实验揭示了反应式-学习型非对称性：TAGA对经典反应式基线提升最大（成功率最高+8pp，GCR减半），而对学习型策略成本近乎为零。这些发现为模块化群体感知何时增加价值以及端到端群体感知训练何时更优提供了可操作的指导。

英文摘要

Robots navigating human-populated environments must avoid collisions while respecting the social structure of crowds, particularly the implicit boundaries of social groups. Most navigation approaches model humans as independent individuals,causing socially disruptive behavior even when collision-free. This paper presents TAGA (Tangent Action for Group Avoidance), detected group boundaries via tangent-path maneuvers without modifying the underlying navigation policy. A hierarchical safety controller coordinates group-level avoidance with individual collision prevention. We propose the Group Crossing Rate (GCR), a continuous metric measuring the fraction of timesteps the robot spends inside any group convex hull, providing finer-grained social compliance assessment than terminal metrics alone. We introduce a realistic crowd simulation benchmark with five empirically grounded phases: individual speed heterogeneity, group speed coupling, F-formation static groups, leader-follower dynamics, and convex-hull boundaries, evaluated under both ORCA and Social Force pedestrian dynamics. Experiments across ORCA, Social Force, DS-RNN, and Intention-RL reveal a reactive-learning asymmetry: TAGA provides the largest gains for classical reactive baselines (up to +8pp success rate, GCR halved) with near-zero cost for learned policies. These findings offer actionable guidance for when modular group-awareness adds value versus when end-to-end group-aware training is preferable.

URL PDF HTML ☆

赞 0 踩 0

2506.23768 2026-06-01 cs.RO 版本更新

Motion Tracking with Muscles: Predictive Control of a Parametric Musculoskeletal Canine Model

基于肌肉的运动追踪：参数化肌肉骨骼犬模型的预测控制

Vittorio La Barbera, Steven Bohez, Leonard Hasenclever, Yuval Tassa, John R. Hutchinson

发表机构 * DeepMind（深Mind）； Royal Veterinary College（皇家兽医学院）

AI总结提出一种由精确3D肌肉网格程序化生成的犬类肌肉骨骼模型，结合改进的肌肉动力学模型和运动捕捉任务，通过比较模拟肌肉激活模式与实验EMG数据验证，旨在弥合生物力学、机器人和计算神经科学之间的差距。

2505.20795 2026-06-01 cs.RO 版本更新

Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt

以人类演示视频为提示学习可泛化的机器人策略

Xiang Zhu, Yichen Liu, Hezhong Li, Jianyu Chen

发表机构 * Tsinghua University, China（清华大学，中国）； Shanghai Qi Zhi Institute, China（上海启智研究院，中国）

AI总结提出两阶段框架，利用人类演示视频学习可泛化机器人策略，无需遥操作数据或微调即可执行新任务。

Comments Accepted to the IEEE International Conference on Robotics and Automation (ICRA), 2026

详情

AI中文摘要

最近的机器人学习方法通常依赖于通过遥操作收集的大规模机器人数据集的模仿学习。面对新任务时，这些方法通常需要收集一组新的遥操作数据并微调策略。此外，遥操作数据收集流程也繁琐且昂贵。相反，人类能够通过观察他人操作高效学习新任务。在本文中，我们介绍了一种新颖的两阶段框架，利用人类演示学习可泛化的机器人策略。该策略可以直接以人类演示视频为提示，执行新任务，无需任何新的遥操作数据和模型微调。在第一阶段，我们训练视频生成模型，通过交叉预测捕获人类和机器人演示视频数据的联合表示。在第二阶段，我们使用新颖的原型对比损失将学习到的表示与人类和机器人之间的共享动作空间融合。在真实世界灵巧操作任务上的实证评估显示了所提出方法的有效性和泛化能力。

英文摘要

Recent robot learning methods commonly rely on imitation learning from massive robotic dataset collected with teleoperation. When facing a new task, such methods generally require collecting a set of new teleoperation data and finetuning the policy. Furthermore, the teleoperation data collection pipeline is also tedious and expensive. Instead, human is able to efficiently learn new tasks by just watching others do. In this paper, we introduce a novel two-stage framework that utilizes human demonstrations to learn a generalizable robot policy. Such policy can directly take human demonstration video as a prompt and perform new tasks without any new teleoperation data and model finetuning at all. In the first stage, we train video generation model that captures a joint representation for both the human and robot demonstration video data using cross-prediction. In the second stage, we fuse the learned representation with a shared action space between human and robot using a novel prototypical contrastive loss. Empirical evaluations on real-world dexterous manipulation tasks show the effectiveness and generalization capabilities of our proposed method.

URL PDF HTML ☆

赞 0 踩 0

2407.16167 2026-06-01 cs.RO cs.SY eess.SY 版本更新

Consideration of Vehicle Characteristics on the Motion Planner Algorithm

运动规划算法中车辆特性的考虑

Syed Adil Ahmed, Taehyun Shim

发表机构 * University of Michigan Dearborn（密歇根大学迪尔伯恩分校）

AI总结针对现有轨迹规划器未考虑质心高度影响导致不同车辆（尤其是高质心车辆）轨迹非最优的问题，提出一种采用简化双轨模型、基于稳态方程估计侧向和侧倾载荷转移以及简化轮胎模型的规划器，以降低求解器负担，并在高/低加速度条件和不同车辆高度下与粒子模型和运动学模型规划器进行对比。

Comments This paper has been accepted for conference proceedings in MECC 2024, Chicago under a Creative Commons License CC-BY-NC-ND

详情

DOI: 10.1016/j.ifacol.2025.01.086
Journal ref: IFAC-PapersOnLine, Vol 58, Num 28, 2024, pgs 444-449

AI中文摘要

自主车辆控制通常分为两个主要领域：轨迹规划和轨迹跟踪。目前，轨迹规划大多通过粒子或基于运动学模型的优化控制器完成。由于这些规划器不考虑质心高度及其影响，其输出对于不同车辆类型（尤其是高质心车辆）并非唯一。因此，跟踪控制器在尝试实现这些次优轨迹时，可能需要付出较大努力以避免车辆操纵性和舒适性约束。本文尝试通过考虑一种采用简化双轨模型的规划器来解决该问题，该模型利用稳态方程估计侧向和侧倾载荷转移，并采用简化轮胎模型以降低求解器负担。将所开发的规划器与广泛使用的粒子模型和运动学模型规划器在碰撞避免场景下进行对比，涵盖高/低加速度条件和不同车辆高度。

英文摘要

Autonomous vehicle control is generally divided in two main areas; trajectory planning and tracking. Currently, the trajectory planning is mostly done by particle or kinematic model-based optimization controllers. The output of these planners, since they do not consider CG height and its effects, is not unique for different vehicle types, especially for high CG vehicles. As a result, the tracking controller may have to work hard to avoid vehicle handling and comfort constraints while trying to realize these sub-optimal trajectories. This paper tries to address this problem by considering a planner with simplified double track model with estimation of lateral and roll based load transfer using steady state equations and a simplified tire model to reduce solver workload. The developed planner is compared with the widely used particle and kinematic model planners in collision avoidance scenarios in both high and low acceleration conditions and with different vehicle heights.

URL PDF HTML ☆

赞 0 踩 0