arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 运动规划、控制与动力学 1 篇

2512.21109 2026-06-18 cs.RO 版本更新

Robust and Efficient MuJoCo-based Model Predictive Control via Web of Affine Spaces Derivatives

基于仿射空间网络导数的鲁棒高效MuJoCo模型预测控制

Chen Liang, Daniel Rakita

发表机构 * Department of Computer Science, Yale University(耶鲁大学计算机科学系)

AI总结 针对MJPC中有限差分导数计算瓶颈,引入仿射空间网络(WASP)导数替代,实现高效稳定的导数计算,在多种机器人任务中实现高达2倍加速,并优于随机采样规划器。

Comments Accepted to 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

MuJoCo是一个强大且高效的物理模拟器,广泛应用于机器人领域。其在实际中的一种常见应用是通过模型预测控制(MPC),该控制方法利用模拟器的重复滚动来优化未来动作,并实时生成响应性控制策略。为了使这一过程更易于使用,开源库MuJoCo MPC(MJPC)提供了直接构建在MuJoCo模拟器之上的即用型MPC算法和实现。然而,MJPC依赖有限差分(FD)来计算通过底层MuJoCo模拟器的导数,这通常是一个关键瓶颈,可能使其在时间敏感任务中成本过高,尤其是在高自由度系统或复杂场景中。在本文中,我们介绍了在MJPC中使用仿射空间网络(WASP)导数作为FD的即插即用替代方案。WASP是一种最近开发的方法,用于高效计算精确导数近似序列。通过重用先前相关导数计算的信息,WASP加速并稳定了新导数的计算,使其特别适合MPC随时间迭代的细粒度更新。我们在涵盖多种机器人形态的多样化MJPC任务集上评估了WASP。我们的结果表明,WASP导数在MJPC中特别有效:它无缝集成到各种任务中,提供一致鲁棒的性能,并且与基于导数的规划器(如iLQG)一起使用时,相比FD后端实现了高达2倍的加速。此外,基于WASP的MPC在我们的评估任务中优于MJPC的随机采样规划器,提供了更高的效率和可靠性。为了支持采用和未来研究,我们发布了完全集成WASP导数的MJPC开源实现。

英文摘要

MuJoCo is a powerful and efficient physics simulator widely used in robotics. One common way it is applied in practice is through Model Predictive Control (MPC), which uses repeated rollouts of the simulator to optimize future actions and generate responsive control policies in real time. To make this process more accessible, the open source library MuJoCo MPC (MJPC) provides ready-to-use MPC algorithms and implementations built directly on top of the MuJoCo simulator. However, MJPC relies on finite differencing (FD) to compute derivatives through the underlying MuJoCo simulator, which is often a key bottleneck that can make it prohibitively costly for time-sensitive tasks, especially in high-DOF systems or complex scenes. In this paper, we introduce the use of Web of Affine Spaces (WASP) derivatives within MJPC as a drop-in replacement for FD. WASP is a recently developed approach for efficiently computing sequences of accurate derivative approximations. By reusing information from prior, related derivative calculations, WASP accelerates and stabilizes the computation of new derivatives, making it especially well suited for MPC's iterative, fine-grained updates over time. We evaluate WASP across a diverse suite of MJPC tasks spanning multiple robot embodiments. Our results suggest that WASP derivatives are particularly effective in MJPC: it integrates seamlessly across tasks, delivers consistently robust performance, and achieves up to a 2$\mathsf{x}$ speedup compared to an FD backend when used with derivative-based planners, such as iLQG. In addition, WASP-based MPC outperforms MJPC's stochastic sampling-based planners on our evaluation tasks, offering both greater efficiency and reliability. To support adoption and future research, we release an open-source implementation of MJPC with WASP derivatives fully integrated.

2. 操作、抓取与灵巧手 4 篇

2501.02874 2026-06-18 cs.RO 版本更新

Steering Flexible Linear Objects in Planar Environments by Two Robot Hands Using Euler's Elastica Solutions

使用欧拉弹性线解在两机器人手在平面环境中操控柔性线性物体

Aharon Levin, Elon Rimon, Amir Shapiro

发表机构 * Dept. of ME, Technion, Israel(技术学院机械工程系,以色列) Dept. of ME, Ben-Gurion University, Israel(本· Gurion大学机械工程系,以色列)

AI总结 本文利用欧拉弹性线解,通过控制两机器人手的抓取端点位置和切线,实现平面环境中柔性线性物体的无自交、稳定和避障操控。

详情
AI中文摘要

机器人手对柔性物体(如电缆、电线和生鲜食品)的操控构成了机器人抓取力学中的一个特殊挑战。本文考虑了两机器人手在平面环境中操控柔性线性物体的问题。柔性线性物体被建模为弹性不可拉伸杆,通过改变抓取端点位置同时保持端点切线相等来进行操控。柔性线性物体的形状具有基于抓取端点位置和切线的闭式解,称为欧拉弹性线。本文在最优控制框架下获得了弹性线解,然后利用弹性线解得到了柔性线性物体无自交、稳定性和避障的闭式判据。这些新工具被整合到一个规划方案中,用于在稀疏障碍物分布的平面环境中操控柔性线性物体。该方案已完全实现并通过详细示例进行了演示。

英文摘要

The manipulation of flexible objects such as cables, wires and fresh food items by robot hands forms a special challenge in robot grasp mechanics. This paper considers the steering of flexible linear objects in planar environments by two robot hands. The flexible linear object, modeled as an elastic non-stretchable rod, is manipulated by varying the gripping endpoint positions while keeping equal endpoint tangents. The flexible linear object shape has a closed form solution in terms of the grasp endpoint positions and tangents, called Euler's elastica. This paper obtains the elastica solutions under the optimal control framework, then uses the elastica solutions to obtain closed-form criteria for non self-intersection, stability and obstacle avoidance of the flexible linear object. The new tools are incorporated into a planning scheme for steering flexible linear objects in planar environments populated by sparsely spaced obstacles. The scheme is fully implemented and demonstrated with detailed examples.

2601.20381 2026-06-18 cs.RO 版本更新

STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation

STORM:基于槽的任务感知面向对象的机器人操作表示

Alexandre Chapin, Emmanuel Dellandréa, Liming Chen

发表机构 * Ecole Centrale de Lyon, LIRIS(里尔森中央理工大学,LIRIS实验室)

AI总结 提出STORM模块,通过多阶段训练策略将冻结的视觉基础模型与语义感知槽结合,生成面向对象的任务感知表示,提升机器人操作在视觉干扰下的泛化性和控制性能。

详情
AI中文摘要

视觉基础模型为机器人提供了强大的感知特征,但其密集表示缺乏显式的对象级结构,限制了操作任务的鲁棒性和可收缩性。我们提出STORM(基于槽的任务感知面向对象的机器人操作表示),一个轻量级的面向对象适应模块,通过一组语义感知槽增强冻结的视觉基础模型,用于机器人操作。STORM不重新训练大型骨干网络,而是采用多阶段训练策略:首先通过使用语言嵌入的视觉-语义预训练稳定面向对象的槽,然后与下游操作策略联合适应。这种分阶段学习防止了退化槽的形成,并在保持语义一致性的同时将感知与任务目标对齐。在对象发现基准和模拟操作任务上的实验表明,与直接使用冻结的基础模型特征或端到端训练面向对象的表示相比,STORM改善了对视觉干扰物的泛化能力和控制性能。我们的结果强调了多阶段适应作为将通用基础模型特征转化为用于机器人控制的任务感知面向对象表示的有效机制。

英文摘要

Visual foundation models provide strong perceptual features for robotics, but their dense representations lack explicit object-level structure, limiting robustness and contractility in manipulation tasks. We propose STORM (Slot-based Task-aware Object-centric Representation for robotic Manipulation), a lightweight object-centric adaptation module that augments frozen visual foundation models with a small set of semantic-aware slots for robotic manipulation. Rather than retraining large backbones, STORM employs a multi-phase training strategy: object-centric slots are first stabilized through visual--semantic pretraining using language embeddings, then jointly adapted with a downstream manipulation policy. This staged learning prevents degenerate slot formation and preserves semantic consistency while aligning perception with task objectives. Experiments on object discovery benchmarks and simulated manipulation tasks show that STORM improves generalization to visual distractors, and control performance compared to directly using frozen foundation model features or training object-centric representations end-to-end. Our results highlight multi-phase adaptation as an efficient mechanism for transforming generic foundation model features into task-aware object-centric representations for robotic control.

2605.05925 2026-06-18 cs.RO 版本更新

DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions

DexSynRefine:合成与精炼人-物交互运动以实现物理可行的灵巧机器人动作

Hyesung Lee, Hyunwoo Jung, Si-Hwan Heo, Sungwook Yang

发表机构 * Korea Institute of Science and Technology(韩国科学技术院) KAIST(韩国科学技术院) Hanyang University(翰阳大学)

AI总结 提出DexSynRefine框架,通过HOI-MMFP运动先验合成手-物轨迹,结合任务空间残差强化学习和接触动力学适应,将人-物交互数据转化为物理可行的灵巧操作,在五个任务上成功率提升50-70个百分点。

Comments Project page: https://dexsynrefine.github.io/

详情
AI中文摘要

从人-物交互(HOI)数据中学习灵巧操作为机器人遥操作提供了一种可扩展的替代方案,但HOI演示通常稀疏且纯运动学,在实体不匹配和接触丰富的动力学下直接重定向不可靠。我们提出DexSynRefine,一个耦合框架,将HOI数据视为结构化运动先验而非可执行的机器人动作。DexSynRefine首先使用HOI运动流形流基元(HOI-MMFP)——一种耦合手-物运动的运动先验,根据任务和初始物体状态合成手-物轨迹。然后通过任务空间残差强化学习对其进行物理接地,并通过从本体感受历史推断缺失的接触动力学上下文来适应执行。在五个灵巧操作任务中,每个阶段解决一个互补的瓶颈:HOI-MMFP提高了轨迹一致性和平滑性,任务空间残差在测试的替代方案中提供了最强的接地表示,接触动力学适应实现了鲁棒的真实世界执行。综合来看,DexSynRefine在真实世界中的成功率比运动学重定向提高了50-70个百分点。

英文摘要

Learning dexterous manipulation from human-object interaction (HOI) data offers a scalable alternative to robot teleoperation, but HOI demonstrations are typically sparse and purely kinematic, making direct retargeting unreliable under embodiment mismatch and contact-rich dynamics. We present DexSynRefine, a coupled framework that treats HOI data as structured motion priors rather than executable robot actions. DexSynRefine first synthesizes hand-object trajectories conditioned on the task and initial object state using HOI Motion Manifold Flow Primitives (HOI-MMFP), a motion prior for coupled hand-object motion. It then physically grounds them with task-space residual reinforcement learning and adapts execution by inferring missing contact-dynamics context from proprioceptive history. Across five dexterous manipulation tasks, each stage addresses a complementary bottleneck: HOI-MMFP improves trajectory consistency and smoothness, task-space residuals provide the strongest grounding representation among the tested alternatives, and contact-dynamics adaptation enables robust real-world execution. Together, DexSynRefine improves real-world success rates over kinematic retargeting by 50-70~percentage points.

2606.13672 2026-06-18 cs.RO 版本更新

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

$\texttt{WEAVER}$:更好、更快、更长——一种有效的机器人操作世界模型

Arnav Kumar Jain, Yilin Wu, Jesse Farebrother, Gokul Swamy, Andrea Bajcsy

发表机构 * Mila - Québec AI Institute(Mila - 魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) Carnegie Mellon University(卡内基梅隆大学) McGill University(麦吉尔大学)

AI总结 提出WEAVER世界模型架构,通过流匹配损失训练多视图潜在预测,同时实现高保真度、长程一致性和高效推理,在机器人操作任务中显著提升策略评估、改进和测试时规划性能。

详情
AI中文摘要

世界模型(即学习型模拟器)对机器人技术的潜在影响深远——包括策略评估、策略改进和测试时规划——所有这些都只需有限的真实世界交互。为了解锁这些下游能力,世界模型需要同时满足三个期望:(i)保真度(即产生与现实相关的模拟轨迹),(ii)一致性(即产生在长时域上连贯的模拟轨迹),以及(iii)效率(即快速产生模拟轨迹)。我们提出$\texttt{WEAVER}$(面向具身推理的多视图世界估计):一种同时实现所有三个期望的世界模型架构,在机器人操作任务上提供了最先进的结果。$\texttt{WEAVER}$是一个多视图世界模型,通过流匹配损失训练以预测未来潜在状态和奖励值。我们提炼了模型架构、记忆和预测目标方面的关键设计决策,以解锁那些困扰先前世界建模方法的长时间动态操作任务。我们将$\texttt{WEAVER}$应用于机器人硬件,展示了其在策略评估(与真实世界成功率的相关系数$\rho=0.870$)、策略改进(在$\pi_{0.5}$机器人基础模型上真实世界成功率提升$38\%$)和测试时规划(真实世界成功率提升$14\%$,且比先前世界模型快$5-10$倍)方面的有效性。$\texttt{WEAVER}$在分布外场景评估中也表现出优于先前世界模型的性能。代码、模型和视频见:this https URL。

英文摘要

The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and test-time planning -- all with limited real-world interaction. To unlock these downstream capabilities, a WM needs to jointly satisfy three desiderata: $\textit{(i)}$ fidelity (i.e., producing simulated trajectories that correlate with reality), $\textit{(ii)}$ consistency (i.e., producing simulated trajectories that are coherent over long horizons), and $\textit{(iii)}$ efficiency (i.e., producing simulated trajectories quickly). We propose WEAVER (World Estimation Across Views for Embodied Reasoning): a WM architecture that simultaneously achieves all three desiderata, providing state-of-the-art results on robotic manipulation tasks. WEAVER is a multi-view WM trained to predict future latents and reward values via a flow-matching loss. We distill the key design decisions across model architecture, memory, and prediction objectives required to unlock the kinds of long-horizon dynamic manipulation tasks that have confounded prior world modeling approaches. We apply WEAVER in robotic hardware, demonstrating its effectiveness at policy evaluation ($ρ$=0.870 correlation with real-world success rate), policy improvement (real-world success rate improvement of $38\%$ on top of the $π_{0.5}$ robot foundation model), and test-time planning (real-world success rate improvement of $14\%$ with a $5-10\times$ speedup over prior WMs). WEAVER also demonstrates better performance than prior WMs when evaluated on out-of-distribution scenarios. Code, models, and videos at: https://arnavkj1995.github.io/WEAVER/ .

3. 导航、定位与SLAM 3 篇

2511.02036 2026-06-18 cs.RO 版本更新

TurboMap: GPU-Accelerated Local Mapping for Visual SLAM

TurboMap: 面向视觉SLAM的GPU加速局部建图

Parsa Hosseininejad, Kimia Khabiri, Shishir Gopinath, Soudabeh Mohammadhashemi, Karthik Dantu, Steven Y. Ko

发表机构 * Simon Fraser University(西蒙弗雷泽大学) University at Buffalo(布法罗大学)

AI总结 针对视觉SLAM中局部建图延迟问题,提出GPU并行化与CPU优化结合的TurboMap后端,通过重构地图点创建、融合及关键帧管理,实现1.3-1.6倍加速且保持精度。

Comments Accepted for presentation at IROS 2026, preprint

详情
AI中文摘要

在实时视觉SLAM系统中,局部建图必须在严格的延迟约束下运行,因为延迟会降低地图质量并增加跟踪失败的风险。GPU并行化是降低延迟的有效途径。然而,由于同步共享状态更新以及将大型地图数据结构传输到GPU的开销,并行化局部建图具有挑战性。本文提出TurboMap,一个GPU并行化且CPU优化的局部建图后端,全面解决了这些挑战。我们重构了地图点创建,以在GPU上实现并行关键点对应搜索,重新设计并并行化了地图点融合,在CPU上优化了冗余关键帧剔除,并集成了基于GPU的快速局部光束法平差求解器。为最小化数据传输和同步成本,我们引入了持久化的GPU驻留关键帧存储。在EuRoC和TUM-VI数据集上的实验表明,平均局部建图速度分别提升1.3倍和1.6倍,同时保持精度不变。

英文摘要

In real-time Visual SLAM systems, local mapping must operate under strict latency constraints, as delays degrade map quality and increase the risk of tracking failure. GPU parallelization offers a promising way to reduce latency. However, parallelizing local mapping is challenging due to synchronized shared-state updates and the overhead of transferring large map data structures to the GPU. This paper presents TurboMap, a GPU-parallelized and CPU-optimized local mapping backend that holistically addresses these challenges. We restructure Map Point Creation to enable parallel Keypoint Correspondence Search on the GPU, redesign and parallelize Map Point Fusion, optimize Redundant Keyframe Culling on the CPU, and integrate a fast GPU-based Local Bundle Adjustment solver. To minimize data transfer and synchronization costs, we introduce persistent GPU-resident keyframe storage. Experiments on the EuRoC and TUM-VI datasets show average local mapping speedups of 1.3x and 1.6x, respectively, while preserving accuracy.

2602.04401 2026-06-18 cs.RO cs.CV 版本更新

Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

视觉地点识别中可靠操作点选择的分位数迁移

Dhyey Manish Rajani, Michael Milford, Tobias Fischer

发表机构 * QUT Centre for Robotics(昆士兰理工大学机器人中心) School of Electrical Engineering and Robotics(电气工程与机器人学院) Queensland University of Technology(昆士兰理工大学)

AI总结 提出一种通过分位数归一化迁移阈值的方法,自动选择视觉地点识别系统的操作点,在100%精度下最大化召回率,无需手动调参。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

详情
AI中文摘要

视觉地点识别(VPR)是全球导航卫星系统(GNSS)受限环境中定位的关键组成部分,但其性能严重依赖于选择平衡精度和召回率的图像匹配阈值(操作点)。阈值通常针对特定环境离线手动调整,并在部署期间固定,导致在环境变化下性能下降。我们提出一种方法,自动选择VPR系统的操作点,以在100%精度下最大化召回率。该方法使用已知对应关系的小型校准遍历,并通过相似度得分分布的分位数归一化将阈值迁移到部署中。这种分位数迁移确保阈值在校准大小和查询子集上保持稳定。在五个基准数据集上使用七种最先进的VPR技术进行的实验表明,我们提出的方法始终优于现有基线,使底层VPR技术在大约两倍的部署场景中(中位数改进)以100%精度运行,同时在该精度下检索到多达29%的正确匹配。该方法通过适应新环境并在操作条件下泛化,消除了手动调整。我们的代码可在该https URL获取。

英文摘要

Visual Place Recognition (VPR) is a key component for localisation in Global Navigation Satellite System (GNSS)-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that automatically selects the operating point of a VPR system to maximise recall at 100% precision. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets. Experiments with seven state-of-the-art VPR techniques across five benchmark datasets demonstrate that our proposed approach consistently outperforms existing baselines, enabling the underlying VPR technique to operate at 100% precision in approximately twice as many deployment scenarios (median improvement), while retrieving up to 29% more correct matches at that precision. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code is available at https://github.com/DhyeyR-007/Quantile-Transfer-for-Reliable-VPR.

2606.01605 2026-06-18 cs.RO 版本更新

Embedding Semantic Risk into Distance Fields and CBFs for Online Monocular Safe Control

将语义风险嵌入距离场和CBF用于在线单目安全控制

Dawei Zhang, Nuo Chen, Shuo Liu, Roberto Tron, Zhiwen Fan

发表机构 * Division of Systems Engineering, Boston University(系统工程系,波士顿大学) Department of Mechanical Engineering, Boston University(机械工程系,波士顿大学) Department of Electrical and Computer Engineering, Texas A&M University(电气与计算机工程系,德克萨斯农工大学)

AI总结 提出一种在线单目感知到控制框架,通过将语义风险直接嵌入欧几里得符号距离场(ESDF),在控制优化前编码风险,实现基于控制障碍函数(CBF)的语义感知安全导航与遥操作。

详情
AI中文摘要

我们提出了一种在线单目感知到控制框架,将语义风险嵌入到用于基于控制障碍函数(CBF)的安全导航和遥操作的距离场中。许多基于感知的安全过滤器对所有映射的障碍物分配相同的基于距离的安全裕度,或者仅将语义用作下游控制器调整,而不是在空间表示中编码语义风险。我们的框架通过将语义信息直接嵌入欧几里得符号距离场(ESDF),在线推理障碍物几何和类别相关风险。这种设计在控制优化前编码语义风险,因此高风险对象在安全场中施加更大的空间影响,同时保留运行时高效的ESDF查询。具体来说,基于基础模型的SLAM前端从单目RGB视频重建密集3D几何,而每帧语义分割提供像素级类别标签,这些标签被融合到重建的几何中。得到的几何-语义表示随后被转换为ESDF,其中语义标签识别安全相关区域并在场计算前施加类别相关的膨胀。语义感知的ESDF提供CBF控制器所需的局部距离值和空间导数,而类别相关的增益进一步调节控制器响应。广泛的仿真和硬件实验证明了在线操作在10-20 Hz的频率以及遥操作和自主导航中的语义感知安全行为。

英文摘要

We propose an online monocular perception-to-control framework that embeds semantic risk into the distance field used by Control Barrier Function (CBF)-based safe navigation and teleoperation. Many perception-based safety filters assign the same distance-based safety margin to all mapped obstacles or use semantics only as a downstream controller adjustment, rather than encoding semantic risk in the spatial representation. Our framework instead reasons online about obstacle geometry and class-dependent risk by embedding semantic information directly into the Euclidean Signed Distance Field (ESDF). This design encodes semantic risk before control optimization, so high-risk objects exert a larger spatial influence in the safety field while retaining efficient ESDF queries at runtime. Specifically, a foundation-model-based SLAM front end reconstructs dense 3-D geometry from monocular RGB video, while per-frame semantic segmentation provides pixel-level class labels that are fused into the reconstructed geometry. The resulting geometric-semantic representation is then converted into an ESDF, where semantic labels identify safety-relevant regions and impose class-dependent inflation before field computation. The semantic-aware ESDF provides the local distance values and spatial derivatives required by the CBF controller, while class-dependent gains further regulate the controller response. Extensive simulation and hardware experiments demonstrate online operation at 10--20 Hz and semantic-aware safe behavior in both teleoperation and autonomous navigation.

4. 人机交互与协作机器人 2 篇

2503.08895 2026-06-18 cs.RO 版本更新

Mutual Adaptation in Human-Robot Co-Transportation with Human Preference Uncertainty

人机协同运输中考虑人类偏好不确定性的相互适应

Al Jaber Mahmud, Weizi Li, Xuan Wang

发表机构 * George Mason University(乔治·马歇尔大学) University of California, Riverside(加州大学河滨分校)

AI总结 针对人机协同运输中人类偏好参数不确定及适应策略平衡问题,提出统一框架,通过建模偏好概率分布、时变固执度及协调规划模型,结合位姿优化策略,实现相互适应以提升任务性能。

Comments 9 pages, 6 figures

详情
AI中文摘要

相互适应可以通过整合机器人和人类对环境的理解来增强人机协同运输的整体任务性能。虽然人类建模有助于捕捉人类的主观偏好,但存在两个挑战:(i)人类偏好参数的不确定性,以及(ii)需要平衡对人和机器都有利的适应策略。在本文中,我们提出了一个统一的框架来应对这些挑战,并通过相互适应提高任务性能。首先,我们不依赖固定参数,而是通过纳入一系列不确定的人类偏好参数来建模人类选择的概率分布。在此基础上,我们引入时变固执度量和协调规划模型,该模型允许机器人领导团队的轨迹,或者如果人类偏好的路径与机器人的计划冲突且其固执度超过阈值,则机器人转为跟随人类。最后,我们引入一种用于低级控制的位姿优化策略,以减轻人类领导时的不确定行为。为了验证该框架,我们设计并进行了包含二十名人类参与者反馈的研究。然后,通过仿真,我们展示了我们的模型在通过相互适应和位姿优化增强任务性能方面的有效性。

英文摘要

Mutual adaptation can enhance overall task performance in human-robot co-transportation by integrating both the robot's and the human's understanding of the environment. While human modeling helps capture humans' subjective preferences, two challenges persist: (i) the uncertainty of human preference parameters and (ii) the need to balance adaptation strategies that benefit both humans and robots. In this paper, we propose a unified framework to address these challenges and improve task performance through mutual adaptation. First, instead of relying on fixed parameters, we model a probability distribution of human choices by incorporating a range of uncertain human preference parameters. Building on this, we introduce a time-varying stubbornness measure and a coordinated planning model, which allows either the robot to lead the team's trajectory or, if a human's preferred path conflicts with the robot's plan and their stubbornness exceeds a threshold, the robot to transition to following the human. Finally, we introduce a pose optimization strategy for low-level control to mitigate the uncertain human behaviors when they are leading. To validate the framework, we design and perform a study with human feedback from twenty human participants. We then demonstrate, through simulations, the effectiveness of our models in enhancing task performance with mutual adaptation and pose optimization.

2501.06348 2026-06-18 cs.HC cs.RO 版本更新

Why Automate This? Exploring Correlations Between Desire for Robotic Automation, Invested Time and Well-Being

为什么自动化这个?探索机器人自动化愿望、投入时间与幸福感之间的相关性

Ruchira Ray, Leona Pang, Sanjana Srivastava, Li Fei-Fei, Samantha Shorey, Roberto Martín-Martín

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) Stanford University(斯坦福大学) University of Pittsburgh(匹兹堡大学)

AI总结 本研究利用BEHAVIOR-1K等数据集,发现活动时间并非自动化偏好的强预测因子,而幸福感和痛苦感是最强指标,并揭示了性别和收入水平的差异。

Comments 26 pages, 14 figures

详情
AI中文摘要

理解人类倾向于自动化任务的动机对于开发无缝融入日常生活的机器人至关重要。因此,我们提出疑问:个体是否更倾向于根据活动消耗的时间或执行活动时的感受来自动化活动?本研究探讨了这些偏好以及它们是否在不同社会群体(特别是性别类别和收入水平)之间存在差异。利用BEHAVIOR-1K数据集、美国时间使用调查以及美国时间使用调查幸福感模块的数据,我们研究了机器人自动化愿望、花费时间以及相关感受(幸福感、意义感、悲伤感、痛苦感、压力感或疲惫感)之间的关系。我们的主要发现表明,尽管存在常见假设,但活动花费的时间并不能强烈预测自动化偏好;相反,幸福感和痛苦感是最强的指标。我们还识别出性别和经济水平的差异:女性倾向于自动化压力大的活动,而男性倾向于自动化让他们不快乐的活动;中等收入个体优先自动化不太愉快和有意义的活动,而低收入和高收入群体则没有显著相关性。我们希望我们的研究有助于推动机器人设计符合用户优先事项,使家用机器人朝着更具社会相关性的解决方案发展。所有数据和交互式工具均可在此https URL公开获取。

英文摘要

Understanding the motivations underlying the human inclination to automate tasks is vital for developing robots that fit seamlessly into daily life. Accordingly, we ask: are individuals more inclined to automate activities based on the time they consume or the feelings experienced while performing them? This study explores these preferences and whether they vary across social groups, specifically gender category and income level. Leveraging data from the BEHAVIOR-1K dataset, the American Time-Use Survey, and the American Time-Use Survey Well-Being Module, we investigate the relationship between the desire for robot automation, time spent, and associated feelings: Happiness, Meaningfulness, Sadness, Painfulness, Stressfulness, or Tiredness. Our key findings show that, despite common assumptions, time spent on activities does not strongly predict automation preferences; instead, happiness and pain are the strongest indicators. We also identify differences by gender and economic level: Women prefer to automate stressful activities, whereas men prefer to automate those that make them unhappy; mid-income individuals prioritize automating less enjoyable and meaningful activities, while low and high-income show no significant correlations. We hope our research helps motivate the design of robots that align with user priorities, moving domestic robotics toward more socially relevant solutions. All data and an interactive tool are publicly available at https://robin-lab.cs.utexas.edu/why-automate-this/.

5. 具身智能与视觉语言动作模型 2 篇

2606.17846 2026-06-18 cs.RO cs.CV cs.LG 版本更新

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Qwen-RobotManip 技术报告:对齐解锁机器人操作基础模型的规模

Haoqi Yuan, Zhixuan Liang, Anzhe Chen, Ye Wang, Haoyang Li, Pei Lin, Yiyang Huang, Zixing Lei, Tong Zhang, Jiazhao Zhang, Jie Zhang, Jingyang Fan, Gengze Zhou, Qihang Peng, Chenxu Lv, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team(Qwen团队)

AI总结 提出 Qwen-RobotManip,通过统一的对齐框架(表示、运动和行为维度)实现多源异构操作数据的大规模协同训练,构建约38,100小时预训练语料,在零样本指令跟随、跨本体迁移等泛化能力上超越先前模型。

Comments 44 pages

详情
AI中文摘要

语言和多模态基础模型通过统一公式对齐异构数据并大规模训练,实现了强大的泛化能力。在本报告中,我们研究这种扩展方法是否可以应用于机器人操作以实现真正的泛化。这具有挑战性,因为与文本不同,操作数据本质上是异构的、收集成本高且多样性狭窄,使得对齐和规模同时变得困难。我们提出了 Qwen-RobotManip,一个基于 Qwen-VL 构建的可泛化视觉-语言-动作基础模型。Qwen-RobotManip 引入了一个跨操作表示、运动和行为维度的统一对齐框架,使大规模多源训练变得一致而非冲突。这种对齐能力进而使 Qwen-RobotManip 能够吸收以前训练方案无法维持规模的操作数据。一个人到机器人合成流水线将第一人称手部演示转换为跨15个平台的机器人轨迹,一个严格的策展流水线协调异构数据集。仅使用开源数据集和人类视频,无需专有数据收集,Qwen-RobotManip 构建了约38,100小时的预训练语料,并展现出涌现的泛化能力,包括零样本指令跟随、对扰动的鲁棒性、反应性错误恢复和跨本体迁移。我们发现标准基准无法捕捉预训练质量,因此采用了包括 RoboCasa365、LIBERO-Plus、EBench、RoboTwin-Clean2Rand、RoboTwin-IF 和 RoboTwin-XE 在内的 OOD 设置。Qwen-RobotManip 在所有 OOD 设置中显著优于先前最先进的模型(包括 π0.5),在 RoboChallenge 中排名第一,相对改进20%,并在包括 AgileX ALOHA、Franka、UR 和 ARX 在内的真实机器人平台上得到验证。

英文摘要

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $π$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3:面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

AI总结 提出基于统一混合Transformer架构的全模态世界模型Cosmos 3,联合处理语言、图像、视频、音频和动作序列,在理解和生成任务上达到新最优,为具身智能体提供可扩展的通用骨干。

详情
AI中文摘要

我们介绍了Cosmos 3,一个全模态世界模型家族,设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置,Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明,Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平,展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型,并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署,我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准,网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

6. 多机器人与群体系统 1 篇

2510.18085 2026-06-18 cs.RO cs.AI cs.MA 版本更新

R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations

R2BC: 从单智能体演示进行多智能体模仿学习

Connor Mattson, Varun Raveendra, Ellen Novoseller, Nicholas Waytowich, Vernon J. Lawhern, Daniel S. Brown

发表机构 * Kahlert School of Computing, University of Utah(犹他大学凯勒尔计算学院) DEVCOM Army Research Laboratory(陆军研究实验室)

AI总结 提出R2BC方法,通过轮换单智能体演示训练多机器人系统,无需联合动作空间演示,在模拟和实物任务中性能媲美或超越基于特权同步演示的基线方法。

Comments 8 pages, 6 figures. In Proceedings: IEEE International Conference on Robotics & Automation (ICRA 2026)

详情
AI中文摘要

模仿学习(IL)是人类教授机器人的自然方式,尤其是在高质量演示易于获取的情况下。虽然IL已广泛应用于单机器人场景,但将其扩展到多智能体系统的研究相对较少,尤其是在单个人类必须为协作机器人团队提供演示的场景中。本文介绍并研究了轮换行为克隆(R2BC),该方法使单个人类操作员能够通过顺序的单智能体演示有效训练多机器人系统。我们的方法允许人类一次远程操作一个智能体,并逐步向整个系统教授多智能体行为,无需联合多智能体动作空间的演示。我们表明,在四个多智能体模拟任务中,R2BC方法的性能与基于特权同步演示的Oracle行为克隆方法相当,甚至在某些情况下超越后者。最后,我们在两个使用真实人类演示训练的物理机器人任务上部署了R2BC。

英文摘要

Imitation Learning (IL) is a natural way for humans to teach robots, particularly when high-quality demonstrations are easy to obtain. While IL has been widely applied to single-robot settings, relatively few studies have addressed the extension of these methods to multi-agent systems, especially in settings where a single human must provide demonstrations to a team of collaborating robots. In this paper, we introduce and study Round-Robin Behavior Cloning (R2BC), a method that enables a single human operator to effectively train multi-robot systems through sequential, single-agent demonstrations. Our approach allows the human to teleoperate one agent at a time and incrementally teach multi-agent behavior to the entire system, without requiring demonstrations in the joint multi-agent action space. We show that R2BC methods match, and in some cases surpass, the performance of an oracle behavior cloning approach trained on privileged synchronized demonstrations across four multi-agent simulated tasks. Finally, we deploy R2BC on two physical robot tasks trained using real human demonstrations.

7. 无人车、无人机与移动机器人 1 篇

2602.01700 2026-06-18 cs.RO 版本更新

Tilt-Ropter: A Fully Actuated Hybrid Aerial-Terrestrial Vehicle with Tilt Rotors and Passive Wheels

Tilt-Ropter: 一种带有倾转旋翼和被动轮的全驱动混合空中-地面车辆

Ruoyu Wang, Xuchen Liu, Zongzhou Wu, Zixuan Guo, Wendi Ding, Ben M. Chen

发表机构 * Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong(机械与自动化工程系,香港中文大学) Faculty of Engineering, The University of Hong Kong(工程学院,香港大学) Peng Cheng Laboratory(鹏城实验室)

AI总结 提出全驱动混合空中-地面车辆Tilt-Ropter,通过倾转旋翼和被动轮实现高效多模态运动,并设计统一非线性模型预测控制器实现低跟踪误差和地面运动功耗降低92.8%。

Comments 8 pages, 10 figures. Accepted by the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

在这项工作中,我们提出了Tilt-Ropter,一种全驱动的混合空中-地面车辆(HATV),它集成了倾转旋翼和被动轮,以实现高效的多模态运动。与传统的欠驱动HATV不同,Tilt-Ropter的全驱动设计允许力和扭矩解耦控制,提高了机动性和地面运动效率。开发了一个统一的非线性模型预测控制器(NMPC)来跟踪参考轨迹,强制执行非完整约束,并适应运动模式间的接触效应,同时通过专门的控制分配确保执行器可行性。为了解决复杂的轮地动力学问题,集成了一个外部力估计器来提供实时交互力估计。该系统通过仿真和实际实验进行了验证,包括无缝的空地过渡和轨迹跟踪任务。实验结果表明,两种模式下的跟踪误差都很低,并且地面运动期间的功耗相比飞行降低了92.8%,突显了该平台在能源受限环境中执行长时间任务的适用性。

英文摘要

In this work, we present Tilt-Ropter, a fully actuated hybrid aerial-terrestrial vehicle (HATV) that integrates tilt rotors with passive wheels to enable efficient multi-modal locomotion. Unlike conventional underactuated HATVs, the fully actuated design of Tilt-Ropter allows decoupled force and torque control, improving maneuverability and ground locomotion efficiency. A unified nonlinear model predictive controller (NMPC) is developed to track reference trajectories, enforce non-holonomic constraints, and accommodate contact effects across locomotion modes, while ensuring actuator feasibility through dedicated control allocation. To address complex wheel-ground dynamics, an external wrench estimator is incorporated to provide real-time interaction wrench estimates. The system is validated through simulation and real-world experiments, including seamless air-ground transitions and trajectory tracking tasks. Experimental results demonstrate low tracking errors in both modes and reveal a 92.8% reduction in power consumption during ground locomotion compared to flight, highlighting the platform's suitability for long-duration missions in energy-constrained environments.

8. 仿真、数据集与评测 4 篇

2512.11736 2026-06-18 cs.RO 版本更新

Bench-Push: Benchmarking Pushing-based Navigation and Manipulation Tasks for Mobile Robots

Bench-Push:基于推动的移动机器人导航与操作任务基准测试

Ninghan Zhong, Steven Caro, Megnath Ramesh, Rishi Bhatnagar, Avraiem Iskandar, Stephen L. Smith

发表机构 * Institute for Robotics and Intelligent Machines, Georgia Institute of Technology(机器人与智能机器研究所,佐治亚理工学院) Department of Electrical and Computer Engineering, University of Waterloo(电气与计算机工程系,滑铁卢大学) Department of Mechanical Engineering, University of Alberta(机械工程系,阿尔伯塔大学)

AI总结 提出首个统一的推动式移动机器人导航与操作基准Bench-Push,包含多种模拟环境、新评估指标和基线实现,用于解决可移动障碍物环境中的机器人推动任务评估问题。

Comments Published in CRV 2026

详情
AI中文摘要

移动机器人越来越多地部署在具有可移动物体的杂乱环境中,这对禁止交互的传统方法提出了挑战。在这种环境中,移动机器人必须超越传统的避障策略,利用推动或轻推策略来实现其目标。尽管基于推动的机器人研究正在增长,但评估依赖于临时设置,限制了可重复性和交叉比较。为了解决这个问题,我们提出了Bench-Push,这是首个用于基于推动的移动机器人导航和操作任务的统一基准。Bench-Push包括多个组件:1)一系列全面的模拟环境,捕捉推动任务中的基本挑战,包括在具有可移动障碍物的迷宫中导航、自主船舶在冰覆盖水域中导航、箱子递送和区域清理,每个任务都有不同复杂程度;2)新的评估指标,用于捕捉效率、交互努力和部分任务完成;3)使用Bench-Push评估跨环境的已建立基线的示例实现。Bench-Push作为Python库开源,采用模块化设计。代码、文档和训练模型可在https://this URL找到。

英文摘要

Mobile robots are increasingly deployed in cluttered environments with movable objects, posing challenges for traditional methods that prohibit interaction. In such settings, the mobile robot must go beyond traditional obstacle avoidance, leveraging pushing or nudging strategies to accomplish its goals. While research in pushing-based robotics is growing, evaluations rely on ad hoc setups, limiting reproducibility and cross-comparison. To address this, we present Bench-Push, the first unified benchmark for pushing-based mobile robot navigation and manipulation tasks. Bench-Push includes multiple components: 1) a comprehensive range of simulated environments that capture the fundamental challenges in pushing-based tasks, including navigating a maze with movable obstacles, autonomous ship navigation in ice-covered waters, box delivery, and area clearing, each with varying levels of complexity; 2) novel evaluation metrics to capture efficiency, interaction effort, and partial task completion; and 3) demonstrations using Bench-Push to evaluate example implementations of established baselines across environments. Bench-Push is open-sourced as a Python library with a modular design. The code, documentation, and trained models can be found at https://github.com/IvanIZ/BenchNPIN.

2512.14428 2026-06-18 cs.RO 版本更新

Odyssey: An Automotive Lidar-Inertial Odometry Dataset with GNSS-denied situations

Odyssey:一种面向GNSS拒止场景的汽车激光雷达-惯性里程计数据集

Aaron Kurda, Simon Steuernagel, Lukas Jung, Marcus Baum

发表机构 * University of Göttingen(哥廷根大学) iMAR Navigation(iMAR导航)

AI总结 提出Odyssey数据集,采用导航级环形激光陀螺仪RTK/INS提供高精度真值,包含36个序列和长时间GNSS拒止环境(隧道、室内停车场),用于评估LIO/SLAM系统。

Comments 10 pages, 4 figures, 3 tables, submitted to International Journal of Robotics Research (IJRR)

详情
AI中文摘要

激光雷达-惯性里程计(LIO)及同时定位与建图(SLAM)系统的开发与评估需要精确的真值。全球导航卫星系统(GNSS)常作为其基础,但在遮挡环境中,由于多径效应或信号丢失,其信号可能不可靠。现有数据集通过引入惯性测量单元(IMU)测量来补偿偶发的GNSS丢失,但由于累积漂移,常用系统不允许对GNSS拒止环境进行长时间研究。因此,此类数据集的多样性有限。为弥补这一空白,我们提出了Odyssey,一个汽车LIO数据集,其特点包括:(1)基于导航级环形激光陀螺仪(RLG)的RTK/INS导出的真值,其偏置稳定性比现有汽车数据集好1到4个数量级;(2)跨不同环境的36个序列的全面收集,支持稳健且全面的评估;(3)长时间的GNSS拒止环境,包括隧道以及汽车基准测试中此前未见过的室内停车场。在此,我们的RLG系统能够在常用系统会过度漂移的场景中实现准确评估。除了为LIO提供数据外,Odyssey还通过三次轨迹重复和通过精确大地坐标集成外部地图数据来支持地点识别任务。所有数据、数据加载器和补充材料均可在线获取,网址为:https://this https URL。

英文摘要

The development and evaluation of Lidar-Inertial Odometry (LIO) and Simultaneous Localization and Mapping (SLAM) systems requires a precise ground truth. The Global Navigation Satellite System (GNSS) is often used as a foundation for this, but its signals can be unreliable in obstructed environments due to multi-path effects or loss-of-signal. While existing datasets compensate for sporadic GNSS loss by incorporating Inertial Measurement Unit (IMU) measurements, the commonly used systems do not permit prolonged study of GNSS-denied environments due to accumulated drift. Therefore, the diversity of such datasets is limited. To close this gap, we present Odyssey, an automotive LIO dataset featuring: (1) a ground truth derived from a navigation-grade Ring Laser Gyroscope (RLG)-based RTK/INS, offering bias stability one to four orders of magnitude better than existing automotive datasets; (2) a comprehensive collection of 36 sequences across diverse environments, enabling robust and comprehensive evaluation and (3) prolonged GNSS-denied environments, including tunnels and, previously unseen in the context of automotive benchmarks, indoor parking garages. Here, our RLG-based system enables accurate evaluation in scenarios where commonly employed systems would drift excessively. Besides providing data for LIO, Odyssey also supports place recognition tasks through threefold trajectory repetition and integration of external mapping data via precise geodetic coordinates. All data, dataloader and supplementary material are available online at https://odyssey.uni-goettingen.de/ .

2601.07052 2026-06-18 cs.RO 版本更新

RSLCPP -- Deterministic Simulations Using ROS 2

RSLCPP——使用ROS 2进行确定性仿真

Simon Sagmeister, Marcel Weinmann, Phillip Pitschi, Markus Lienkamp

发表机构 * Technical University of Munich, Germany(慕尼黑技术大学) School of Engineering & Design, Department of Mobility Systems Engineering, Institute of Automotive Technology(工程与设计学院,移动系统工程系,汽车技术研究所) School of Engineering & Design, Department of Engineering Physics and Computation, Institute of Automatic Control(工程与设计学院,工程物理与计算系,自动控制研究所)

AI总结 针对ROS异步多进程设计导致仿真结果不可复现的问题,提出RSLCPP库,通过确定性回调执行实现跨平台可复现仿真,无需修改现有节点代码。

Comments Accepted for publication at the 'IEEE Robotics and Automation Practice'

详情
AI中文摘要

仿真在现实机器人技术中至关重要,为开发各种机器人应用提供了安全、可扩展且高效的环境。虽然机器人操作系统(ROS)在学术界和工业界已被广泛采用作为这些机器人应用的基础,但其异步、多进程的设计使得复现变得复杂,尤其是在不同的硬件平台上。当计算时间和通信延迟变化时,无法保证确定性回调执行。这种缺乏复现性的问题给科学基准测试和持续集成带来了困难,因为在这些场景中一致的结果至关重要。为了解决这个问题,我们提出了一种使用ROS 2节点创建确定性仿真的方法。我们的ROS仿真库(RSLCPP)实现了这种方法,使得现有节点可以组合成一个产生可复现结果的仿真例程,通常无需更改任何源代码。我们证明,在测试合成基准测试和真实机器人系统时,我们的方法在各种CPU和架构上产生相同的结果。RSLCPP已开源,网址为:https://this https URL。

英文摘要

Simulation is crucial in real-world robotics, offering safe, scalable, and efficient environments for developing a variety of robotic applications. While the Robot Operating System (ROS) has been widely adopted as the backbone of these robotic applications in both academia and industry, its asynchronous, multi-process design complicates reproducibility, especially across varying hardware platforms. Deterministic callback execution cannot be guaranteed when computation times and communication delays vary. This lack of reproducibility complicates scientific benchmarking and continuous integration, where consistent results are essential. To address this, we present a methodology to create deterministic simulations using ROS 2 nodes. Our ROS Simulation Library for C++ (RSLCPP) implements this approach, enabling existing nodes to be combined into a simulation routine that yields reproducible results, usually without requiring any source code changes. We demonstrate that our approach produces identical results across various CPUs and architectures when testing both a synthetic benchmark and a real-world robotics system. RSLCPP is open-sourced at https://github.com/TUMFTM/rslcpp.

2606.17639 2026-06-18 cs.RO cs.CV 版本更新

ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

ERQA-Plus:具身AI推理的诊断基准

Hong Yang, Basura Fernando

发表机构 * Centre for Frontier AI Research, Agency for Science, Technology and Research(新加坡科技研究局前沿人工智能研究中心) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 提出ERQA-Plus基准,包含1766个基于机器人中心图像的问答实例,覆盖感知、动作、社交、导航和常识推理,用于诊断具身AI的推理能力。

详情
AI中文摘要

通用具身智能体需要的不仅仅是物体识别:它们必须从情境视觉观察中推理空间关系、动作、程序、人类意图、环境约束和常识后果。然而,现有的视觉和具身问答基准通常对测试的推理依赖关系控制有限,使得难以将基于具身的推理与基于捷径的视觉或语言模式匹配区分开来。我们提出了ERQA-Plus,一个用于具身AI推理的诊断基准。ERQA-Plus包含1766个问答实例,这些实例基于711张以机器人为中心的图像,并根据一个结构化的分类法组织,涵盖感知、动作中心、社交交互、导航环境和上下文常识推理。该数据集使用多阶段生成和验证流程构建,结合了分类法引导的问题生成、自动质量判断、迭代修订和人工评估,以改进视觉基础、答案有效性和推理质量。我们对代表性的通用视觉语言模型和具身模型进行了基准测试,包括LLaVA-NeXT-8B、Prismatic-7B、MiniCPM-V-4.5-8B、Qwen3-VL、RoboRefer-8B和RoboBrain2.5-8B。尽管最强的模型Qwen3-VL-32B达到了83.4%的整体准确率和61.4的SBERT分数,但类别级别的结果揭示了空间推理、程序推理、事件预测和意图推理方面的持续弱点。因此,ERQA-Plus提供了一个细粒度的评估框架,不仅衡量具身智能体是否回答正确,还衡量它们能够可靠地执行哪些形式的具身推理。数据集可在https://this https URL获取,项目页面在https://this https URL。

英文摘要

Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.

9. 其他/综合机器人 2 篇

2507.16859 2026-06-18 cs.RO cs.AI 版本更新

Enhancing Fatigue Detection through Heterogeneous Multi-Source Data Integration and Cross-Domain Modality Imputation

通过异构多源数据集成与跨域模态插补增强疲劳检测

Luobin Cui, Yanlai Wu, Tang Ying, Weikai Li

AI总结 针对实际部署环境中高质量传感器不可用的问题,提出异构多源疲劳检测框架,利用共享模态进行跨域模态插补,融合源域知识提升目标域疲劳检测性能。

Comments 4figures,14pages

详情
AI中文摘要

疲劳检测对于安全相关应用(如航空、采矿和长途运输)中的人类操作员至关重要。可靠的操作员疲劳估计可以支持人机系统中的及时警告、自适应任务调度、接管提醒和其他安全管理决策。然而,这些功能的有效性取决于疲劳相关信号是否能在部署环境中可靠捕获。虽然许多研究已显示高保真传感器在受控实验室环境中的价值,但在实际环境中,由于噪声、光照条件和视野限制,其性能往往会下降,从而限制了实际应用。本文形式化了一种面向实际部署的疲劳检测设置,其中高质量传感器在实际应用中通常不可用。为解决这一问题,我们利用来自异构源域的知识,包括难以在现场部署但常用于受控环境的高保真传感器,来辅助真实目标域中的疲劳检测。基于这一思想,我们设计了一个异构多源疲劳检测框架,该框架利用目标域中的可用模态,同时通过基于共享模态的跨域模态插补来利用源域中的多样化配置。

英文摘要

Fatigue detection for human operators is important in safety-related applications such as aviation, mining, and long-haul transport. Reliable estimation of operator fatigue can support timely warnings, adaptive task scheduling, takeover reminders, and other safety-management decisions in human-machine systems. However, the effectiveness of these functions depends on whether fatigue-related signals can be reliably captured in the deployment environment. While many studies have shown the value of high-fidelity sensors in controlled laboratory environments, their performance often degrades when used in real-world settings because of noise, lighting conditions, and field-of-view constraints, thereby limiting their practical use. This paper formalizes a deployment-oriented setting for real-world fatigue detection, where high-quality sensors are often unavailable in practical applications. To address this issue, we use knowledge from heterogeneous source domains, including high-fidelity sensors that are difficult to deploy in the field but commonly used in controlled environments, to assist fatigue detection in the real-world target domain. Based on this idea, we design a heterogeneous and multi-source fatigue-detection framework that uses the available modalities in the target domain while leveraging diverse configurations in the source domains through cross-domain modality imputation based on shared modalities.

2602.15513 2026-06-18 cs.RO cs.AI 版本更新

HIMM: Human-Inspired Long-Term Memory Modeling for Embodied Exploration and Question Answering

Ji Li, Bo Wang, Jing Xia, Mingyi Li, Shiyan Hu

发表机构 * The University of Hong Kong(香港大学) Beijing Institute of Technology(北京理工大学)

详情
Journal ref
IROS 2026
英文摘要

Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.