URL PDF HTML ☆

赞 0 踩 0

2606.06312 2026-06-05 cs.RO 版本更新

Meridian: Metric-Semantic Primitive Matching for Cross-View Geo-Localization Beyond Urban Environments

Meridian: 超越城市环境的跨视角地理定位的度量-语义基元匹配

Mason Peterson, Qingyuan Li, Yixuan Jia, Fernando Cladera, Carlos Nieto-Granda, Camillo Jose Taylor, Jonathan P. How

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； GRASP Laboratory, University of Pennsylvania（宾夕法尼亚大学GRASP实验室）； U.S. Army Combat Capabilities Development Command, Army Research Laboratory（美国陆军战斗能力发展指挥部，陆军研究实验室）

AI总结提出Meridian方法，通过匹配航拍图像与地面机器人RGB-D数据中的高层度量-语义基元，无需特定区域训练即可实现跨多种环境的全局定位，平均轨迹误差2.4米。

Comments 9 pages, 6 figures

详情

AI中文摘要

成功的机器人自动化需要准确的全局定位以支持可重复性、任务规划、目标指定和安全操作。然而，在GNSS受限环境中的可靠定位仍然是一个开放问题。高空航拍图像提供了一种有前景的解决方案，但现有方法主要针对结构化城市环境，很少在非结构化自然地形中得到验证。现有技术的局限性包括依赖针对特定环境训练的模型，以及在自然户外区域常见的重复几何和无特征景观中难以处理。为克服这些挑战，我们提出了Meridian，一种在航拍图像和地面机器人RGB-D相机数据之间匹配高层度量-语义基元的方法，实现了准确的全局定位，并在多样环境中具有良好的泛化能力，无需任何针对特定区域数据的训练或算法微调。我们提出了新颖的一致性度量来估计机器人子图位姿的分布，并在鲁棒的位姿图优化步骤中剔除异常假设，以实现准确的机器人轨迹估计。我们证明了我们的算法可以在多种环境中定位地面机器人，包括自动驾驶数据集、公园和校园区域以及荒野营地，在19公里的地面遍历中平均优化轨迹误差为2.4米。

英文摘要

Successful robot automation requires accurate global localization to support repeatability, task planning, goal specification, and safe operation. However, reliable localization in GNSS-denied environments remains an open problem. Overhead aerial imagery offers a promising solution, but existing approaches primarily target structured urban environments and have been rarely demonstrated in unstructured natural terrain. Limitations of the state-of-the-art include a reliance on models trained for specific environments, as well as difficulty handling repetitive geometries and featureless landscapes commonly found in natural outdoor areas. To overcome these challenges, we present Meridian, a method for matching high-level metric-semantic primitives across aerial images and ground robot RGB-D camera data that achieves accurate global localization and generalizes well across diverse environments, all without any training or algorithmic fine-tuning on area-specific data. We formulate novel consistency metrics to estimate a distribution over robot submap poses and to reject outlier hypotheses in a robust pose graph optimization step for accurate robot trajectory estimation. We demonstrate that our algorithm can localize a ground robot across a wide variety of environments, including an autonomous driving dataset, a park and campus area, and a wilderness camp, with an average optimized trajectory error of 2.4 m over 19 km of ground traversal.

URL PDF HTML ☆

赞 0 踩 0

2606.06308 2026-06-05 cs.RO 版本更新

Attitude-Aided Linear Calibration of Triaxial Accelerometers

三轴加速度计的姿态辅助线性校准

Yongqiang Yu, Tian Huang, Yipeng Yang

发表机构 * Tsinghua University（清华大学）

AI总结提出一种利用姿态信息的三轴加速度计线性校准方法（ALAC），通过构建组合误差矩阵实现线性最小二乘估计，仅需五个任意方向测量即可完成校准，并在静态和准静态实验中验证了其精度和鲁棒性。

详情

AI中文摘要

三轴MEMS加速度计广泛应用于惯性传感、导航和传感器融合，但现有校准方法通常依赖昂贵的参考设备或非线性迭代优化，限制了其在低成本或自校准系统中的效率和适用性。我们提出姿态辅助线性加速度计校准（ALAC），一种可在任何提供姿态信息的平台（如转台、机械臂或惯性测量单元）上运行的方法。ALAC构建组合误差矩阵（CEM）以在统一校准模型中表示传感器误差，并实现线性最小二乘估计。偏置和重力向量被联合估计，隐式考虑了平台未对准，CEM的矩阵分解恢复尺度、非正交性和对准旋转参数。在静态重力下，校准被表述为约束齐次最小二乘（CHLS）问题，并使用标准线性代数闭式求解。仅需五个任意方向的测量，递归扩展支持在线或现场校准。在静止的机器人安装加速度计和准静态公共IMU轨迹上的实验表明，ALAC在离线和在线模式下，在精度和对传感器噪声的鲁棒性方面均优于基于参考和在线基线方法。在相同数据集上，它在滤波条件下与迭代自校准性能相当，并在原始测量上超越所有评估基线。这些结果证明了基于MEMS的惯性平台（尤其是低成本IMU和在线校准场景）的一种鲁棒且实用的校准方案。

英文摘要

Triaxial MEMS accelerometers are widely used for inertial sensing, navigation, and sensor fusion, but existing calibration methods often rely on costly reference setups or nonlinear iterative optimization, limiting their efficiency and applicability to low-cost or self-calibrating systems. We present attitude-aided linear accelerometer calibration (ALAC), a method that operates on any platform providing orientation information, such as turntables, robotic arms, or inertial measurement units. ALAC constructs a combined error matrix (CEM) to represent sensor errors in a unified calibration model and enables linear least-squares estimation. The bias and gravity vector are jointly estimated, implicitly accounting for platform misalignment, and matrix decomposition of the CEM recovers scale, non-orthogonality, and alignment rotation parameters. Under static gravity, calibration is formulated as a constrained homogeneous least-squares (CHLS) problem and solved in closed form using standard linear algebra. Only five arbitrarily oriented measurements are required, and a recursive extension supports online or in-field calibration. Experiments on a stationary robot-mounted accelerometer and a quasi-static public IMU trajectory show that ALAC, in both offline and online modes, outperforms reference-based and online baselines in accuracy and robustness to sensor noise. On the same dataset, it matches iterative self-calibration under filtered conditions and surpasses all evaluated baselines on raw measurements. These results demonstrate a robust and practical calibration scheme for MEMS-based inertial platforms, especially low-cost IMUs and online calibration scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.06292 2026-06-05 cs.CV cs.RO 版本更新

CLEAR：端到端自动驾驶中的认知与潜在评估自适应路由

Yining Xing, Zehong Ke, Zhiyuan Liu, Yanbo Jiang, Wenhao Yu, Jianqiang Wang

发表机构 * Qwen 3.5 0.8B

AI总结提出CLEAR框架，通过单步条件漂移替代扩散模型的多步去噪，结合视觉编码器Drive-JEPA和微调Qwen 3.5 0.8B进行语义推理，实现高效多模态规划，在NAVSIM v1上达到93.7 PDMS。

详情

AI中文摘要

端到端自动驾驶模型通常难以平衡多模态机动生成与实时推理约束。虽然扩散模型成功捕捉了多样化的驾驶行为，但其迭代去噪过程在安全关键部署中引入了不可接受的延迟。为了解决这个问题，我们提出了CLEAR（认知与潜在评估自适应路由），一个结合超快生成规划与深度语义推理的框架。CLEAR采用Drive-JEPA作为视觉编码器，并用VAE潜在空间中的单步条件漂移替代多步去噪链，引入条件系数以平衡多样性和专家精度。同时，我们在驾驶问答对上全微调Qwen~3.5~0.8B以提取场景感知隐藏状态。这些状态指导自适应调度器（从预定义方案的离散集中选择条件系数$α$和样本数量$N$）和交叉注意力评分器（从候选中选择最优轨迹）。在NAVSIM v1基准上，CLEAR达到了最先进的PDMS 93.7。我们的结果表明，无需密集几何标注或迭代采样，即可高效执行高保真多模态规划。

英文摘要

End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $α$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.

URL PDF HTML ☆

赞 0 踩 0

2606.06218 2026-06-05 cs.RO cs.AI 版本更新

MotionDisco: 用于极端人形机器人移动操作的运动发现

Ilyass Taouil, Michal Ciebelski, Shafeef Omar, Haizhou Zhao, Angela Dai, Aaron M. Johnson, Majid Khadiv

发表机构 * Technical University of Munich, Germany（慕尼黑技术大学）； New York University, USA（纽约大学）； Carnegie Mellon University, USA（卡内基梅隆大学）

AI总结提出MotionDisco框架，通过大语言模型引导的进化搜索和顺序运动动力学轨迹优化，从零开始自动发现长时域、接触丰富的人形机器人移动操作技能，并在真实机器人上部署。

详情

AI中文摘要

我们提出MotionDisco，一个从零开始发现接触丰富、长时域人形机器人移动操作运动的框架，无需依赖遥操作或从人类演示中重定向运动。这是具有挑战性的，因为可能的接触交互空间随任务时域和场景中物体数量呈组合增长。MotionDisco通过将大语言模型（LLM）引导的进化搜索与高效的顺序运动动力学轨迹优化器和剪枝策略相结合，实现对交互序列的快速搜索，从而快速发现新技能。通过大量消融研究，我们展示了LLM引导的搜索在多个具有挑战性的长时域任务中成功发现了全身轨迹。最后，通过在发现的轨迹上训练强化学习跟踪策略，我们将运动迁移到真实人形机器人上。这是第一项完全通过自动进化搜索发现并部署长时域人形机器人移动操作技能的工作。实验补充视频见：https://youtu.be/DHiVz34QYlw。

英文摘要

We present MotionDisco, a framework that discovers contact-rich, long-horizon humanoid loco-manipulation motions from scratch, without relying on teleoperation or motion retargeting from human demonstrations. This is challenging because the space of possible contact interactions grows combinatorially with the task horizon and the number of objects in the scene. MotionDisco enables rapid discovery of novel motions by coupling a large language model (LLM) guided evolutionary search over sequences of interactions with an efficient sequential kinodynamic trajectory optimizer and pruning strategy, enabling the rapid discovery of novel skills. Through extensive ablation studies, we show that our LLM-guided search discovers successful whole-body trajectories across several challenging long-horizon tasks. Finally, by training reinforcement learning tracking policies on the discovered trajectories, we transfer the motions to a real humanoid robot. This is the first work to discover and deploy long-horizon humanoid loco-manipulation skills entirely through automated evolutionary search. Supplementary videos of the experiments are available at: https://youtu.be/DHiVz34QYlw.

URL PDF HTML ☆

赞 0 踩 0

2606.06130 2026-06-05 cs.RO 版本更新

L-SDPPO：用于舱内机器人操作的脉冲扩散策略优化

Liwen Zhang, Dong Zhou, Guanghui Sun, Yifei Zheng, Yuhui Hu, Kaihong Ouyang, Zuoquan Zhao

发表机构 * Department of Control Science and Engineering, Harbin Institute of Technology（控制科学与工程系，哈尔滨工业大学）； Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong（机械与自动化工程系，香港中文大学）

AI总结提出L-SDPPO框架，结合脉冲扩散策略与强化学习优化，并引入状态依赖延迟注入机制，在舱内机器人操作任务中实现高成功率和低能耗。

详情

AI中文摘要

航天器中的舱内机器人有助于减少宇航员的工作量并提高任务效率。最近的研究集中于使用深度学习方法来实现这些复杂环境中操作所需的精确控制。然而，在没有重力阻尼的情况下，物体会表现出不可预测、无约束的漂移。这些因素要求对复杂的多模态动作分布具有鲁棒性。扩散策略（DP）可以建模这些复杂动作，但其迭代采样过程对于航天器有限的功率预算来说消耗过多能量。因此，我们提出了一种低能耗的舱内机器人操作框架L-SDPPO，其中脉冲扩散策略（SDP）通过强化学习（RL）算法进行优化。此外，为了解决微重力下动态时空特征感知不足的问题，我们提出了状态依赖延迟注入（SDLI）机制，该机制模拟生物神经延迟以动态调节输入信息的时间。在五个代表性的舱内日常任务（例如舱门打开和精密容器盖合）上的评估表明，与最先进的机器人操作方法相比，我们的方法始终能实现更高的成功率和更低的能耗。这些结果表明我们的方法是一种可行的舱内机器人操作方法。

英文摘要

Intra-vehicular robots in spacecraft help reduce astronaut workload and improve mission efficiency. Recent research focuses on using deep learning methods to achieve the acute control required for operations in these complex environments. However, objects exhibit unpredictable, unconstrained drift without gravitational damping. These factors demand robustness against complex multimodal action distributions. Diffusion policies (DP) can model these complex actions, but their iterative sampling process consumes too much energy for the limited power budgets of spacecraft. We therefore propose a low-energy intra-vehicular robotic manipulation framework, L-SDPPO, in which the Spiking Diffusion Policy (SDP) is optimized with a reinforcement learning (RL) algorithm. Furthermore, to address the insufficient perception of dynamic spatiotemporal features in microgravity, we propose the statedependent latency injection (SDLI) mechanism, which mimics biological neural delays to dynamically regulate the timing of input information. Evaluation on five representative intra-vehicular daily tasks (e.g., hatch opening and precision container capping) shows that our method consistently achieves higher success rates and lower energy consumption, compared to the state-of-the-art robotic manipulation methods. These results demonstrate our method is a viable intra-vehicular robotic manipulation method.

URL PDF HTML ☆

赞 0 踩 0

2606.06041 2026-06-05 cs.RO cs.AI cs.NE 版本更新

Sample-efficient Low-level Motion Planning for Robotic Manipulation Tasks via Zero-shot Transfer Learning

通过零样本迁移学习实现机器人操作任务的样本高效低级运动规划

Yuanzhi He, Victor Romero-Cano, José J. Patiño, Juan David Hernández, William Sawtell, Gualtiero Colombo

发表机构 * School of Computer Science & Informatics, Cardiff University, Cardiff, UK（计算机科学与信息学系，卡迪夫大学，卡迪夫，英国）

AI总结提出iCEM+TL框架，通过迁移学习和奖励重塑提高复杂操作任务的成功率，仿真中提升高达23%，并在真实机器人上验证。

Comments 12 pages, 5 figures, International Conference on Artificial Neural Networks (ICANN) 2026 conference accepted

详情

AI中文摘要

随着机器人系统变得日益复杂，其运动规划模型的复杂性和更长的训练时间带来了巨大挑战。进化算法如样本高效交叉熵方法（iCEM）最近通过利用高效的知识重用策略来提升性能，在低级实时规划中展现出潜力。尽管在许多控制任务中有效，但iCEM在更复杂场景中的性能可能受到限制，特别是那些需要堆叠、滑动和放置到架子的任务。在这项工作中，我们提出了一种新颖的iCEM+TL框架，明确利用迁移学习（TL），其中关键的iCEM参数从较简单的上游任务迁移以指导更复杂的下游任务。此外，我们通过任务分解对堆叠物体和放置到架子应用了奖励重塑（RR）以优化任务特定性能。仿真结果表明，我们的框架实现了高达23%的成功率提升。该框架还在真实的Franka Emika机器人上的堆叠任务中得到进一步验证，展示了其在实际部署中的可行性。

英文摘要

As robotic systems become more sophisticated, the growing complexity of their motion planning models and the longer training times pose substantial challenges. Evolutionary algorithms such as the Sample-efficient Cross-Entropy Method (iCEM) have recently demonstrated promising potential for low-level real-time planning by leveraging efficient knowledge reuse strategies to improve performance. Although effective in many control tasks, iCEM's performance can be constrained in more complex scenarios, particularly those requiring stacking, sliding, and shelf placement. In this work, we propose a novel iCEM+TL framework that explicitly leverages Transfer Learning (TL), where key iCEM parameters are transferred from simpler upstream tasks to guide more complex downstream tasks. Additionally, we applied Reward Redesign (RR) through task decomposition for stacking objects and shelf placement to optimize task-specific performance. Results from the simulation show that our framework achieves success rate improvements of up to 23%. The framework is further validated on a real Franka Emika robot in a stacking task, demonstrating its practical feasibility for real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.06040 2026-06-05 cs.RO cs.SY eess.SY 版本更新

Gotta Grow Fast: Design and Benchmarking of a Tip Mount for High-Speed Vine Robots

快速生长：高速藤蔓机器人尖端支架的设计与基准测试

Antonio Alvarez Valdivia, Robert Reeve, Ankush Dhawan, Ciera McFarland, Chad Council, Margaret McGuinness, Nathaniel Hanson

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Lincoln Laboratory（林肯实验室）； Stanford University（斯坦福大学）； University of Notre Dame（圣母大学）

AI总结提出一种三角滚轮尖端支架，通过滚动代替滑动减少生长阻力，实现TPU涂层防撕裂尼龙藤蔓机器人的一致外翻，并建立可重复的基准测试框架。

Comments Accepted to IEEE Robotics & Automation Letters

详情

AI中文摘要

软体生长藤蔓机器人通过尖端外翻机制扩展，该机制使其能够在杂乱环境中导航。然而，在尖端集成摄像头和其他传感器具有独特挑战，因为形成尖端的材料随着机器人生长而不断更新。这种持续的材料更替，加上内层之间的摩擦、增加的尖端重量和织物收缩，使传感器和工具安装复杂化。这些限制阻碍了藤蔓机器人在检查和搜索任务中的应用，而快速生长并携带尖端传感器至关重要。在这项工作中，我们提出了一种三角滚轮尖端支架，通过滚动而非滑动与机器人本体接触，减少生长过程中的内部阻力。通过迭代故障分析优化设计，首次实现了在TPU涂层防撕裂尼龙藤蔓机器人上的一致外翻。为了定量评估支架性能，我们引入了一个定制测试台，通过测量外翻过程中的尾部张力来隔离尖端安装效应。跨多个支架变体（包括先前设计）的比较实验表明，我们的三角滚轮支架实现了最低的尾部张力和最可重复的生长性能。这些结果既建立了一个经过验证的尖端支架设计，也为推进软体生长机器人中传感器和工具集成提供了一个可重复的基准测试框架。支架和测试台的CAD文件可在以下网址获取：https://sprout-mitll.github.io/tip_mounts/。

英文摘要

Soft, growing vine robots extend through tip eversion, a mechanism that enables navigation through cluttered environments. However, integrating cameras and other sensors at the tip is uniquely challenging because the material forming the tip is constantly renewed as the robot grows. This continual material turnover, combined with friction between internal layers, added tip weight, and fabric constriction, complicates sensor and tool mounting. These limitations hinder the deployment of vine robots for inspection and search tasks, where rapid growth while carrying tip-mounted sensors is essential. In this work, we present a triangular roller tip mount that reduces internal resistance during growth by rolling rather than sliding against the robot body. The design was refined through iterative failure analysis, enabling, for the first time, consistent eversion on a TPU-coated ripstop nylon vine robot. To quantitatively evaluate mount performance, we introduce a custom testbed that isolates tip mounting effects by measuring tail tension during eversion. Comparative experiments across multiple mount variants, including prior designs, show that our triangular roller mount achieves the lowest tail tension and most repeatable growth performance. These results establish both a validated tip mount design and a repeatable benchmarking framework for advancing sensor and tool integration in soft growing robots. CAD for the mount and testbed is available at: https://sprout-mitll.github.io/tip_mounts/.

URL PDF HTML ☆

赞 0 踩 0

2606.06014 2026-06-05 cs.AI cs.RO 版本更新

T-FunS3D：任务驱动的分层开放词汇3D功能分割

Jingkun Feng, Reza Sabzevari

发表机构 * P4MARS Lab at the Faculty of Aerospace Engineering, Delft University of Technology（代尔夫特理工大学航空航天工程学院P4MARS实验室）

AI总结提出T-FunS3D方法，通过构建开放词汇场景图并利用视觉语言模型，实现任务驱动的分层3D功能分割，在保持性能的同时提升速度和降低内存消耗。

详情

AI中文摘要

开放词汇3D功能分割使机器人能够在3D场景中定位功能性物体组件。这是一项需要空间理解和任务解释的挑战性任务。当前的开放词汇3D分割方法主要关注物体级识别，而场景级部分分割方法试图详尽地分割整个场景，导致资源密集且耗时。在粒度、准确性和速度之间平衡分割性能仍然是一个挑战。作为缓解这一问题的一步，我们引入了T-FunS3D，一种任务驱动的分层开放词汇3D功能分割方法，为机器人应用提供可操作的感知。我们的方法以室内场景的3D点云和带姿态的RGB-D图像作为输入。通过提取环境中的实例及其视觉嵌入，我们构建了一个开放词汇场景图。给定任务描述，T-FunS3D识别场景图中最相关的实例，并利用视觉语言模型定位其功能组件。在SceneFun3D数据集上的实验表明，T-FunS3D在开放词汇3D功能分割方面与最先进方法相当，同时实现了更快的运行时间和更少的内存使用。

英文摘要

Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive and time consuming. Balancing segmentation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open-vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage.

URL PDF HTML ☆

赞 0 踩 0

2606.05960 2026-06-05 cs.RO 版本更新

Towards a Data Flywheel for Embodied Intelligence in Logistics

面向物流具身智能的数据飞轮

Anlan Yu, Zaishu Chen, Zhiqing Hong, Daqing Zhang

发表机构 * Peking University（北京大学）； JD Logistics（京东物流）； HKUST (Guangzhou)（香港科技大学（广州））

AI总结提出一种数据驱动的物流具身智能框架，通过构建数据飞轮将日常操作转化为可复用数据资产，利用世界模型生成长尾包裹操作的可靠监督，并整合多模态数据实现策略持续改进。

详情

AI中文摘要

摊销非线性模型预测控制

Francesco Pillitteri, Alberto Bemporad

发表机构 * IMT School for Advanced Studies（IMT高级研究学院）

AI总结针对输入仿射非线性系统，提出一种基于状态依赖二次规划的单网络残差校正架构，通过可微内点层保证约束满足，实现实时非线性模型预测控制，在机械臂跟踪任务中取得数量级加速。

Comments 6 pages

详情

AI中文摘要

非线性模型预测控制需要在每个采样时刻实时求解一个约束非线性规划（NLP），这是一个计算瓶颈，限制了在资源受限硬件或高采样率下的部署。我们针对输入仿射非线性系统这一广泛类别解决了这一挑战，证明了最优控制动作可以通过一个状态依赖的二次规划（QP）来近似，其成本参数取决于当前状态和参考。我们提出了一种单网络残差校正架构：一个状态依赖的解析基线提供初始QP参数，网络仅学习匹配完整NLP解所需的校正；QP通过一个可微内点层求解，保证了第一个控制动作的约束满足。该网络使用由NLP求解器生成的数据进行离线训练，采用结合监督模仿和KKT残差惩罚的混合损失。我们在一个具有笛卡尔末端执行器跟踪的三连杆平面机械臂上验证了该方法，展示了相比NLP求解器数量级的加速，同时保持了可比的跟踪性能。

英文摘要

Nonlinear Model Predictive Control requires solving a constrained nonlinear program (NLP) in real-time at every sampling instant, a computational bottleneck that limits deployment on resource-constrained hardware or at high sampling rates. We address this challenge for the broad class of input-affine nonlinear systems to show that the optimal control move can be approximated by a state-dependent quadratic program (QP) whose cost parameters depend on the current state and reference. We propose a single-network residual-corrector architecture: a state-dependent analytic baseline provides initial QP parameters, and the network learns only the corrections needed to match the full NLP solution; the QP is solved by a differentiable interior-point layer, guaranteeing constraint satisfaction for the first control action. The network is trained offline on data generated by an NLP solver using a hybrid loss that combines supervised imitation and KKT-residual penalties. We validate the approach on a three-link planar robotic arm with Cartesian end-effector tracking, demonstrating orders-of-magnitude speedup over the NLP solver while maintaining comparable tracking performance.

URL PDF HTML ☆

赞 0 踩 0

2606.05773 2026-06-05 cs.RO 版本更新

PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

PiL-World: 用于VLA策略环内评估的块式世界模型

Chong Ma, Taiyi Su, Jian Zhu, Jianjun Zhang, Zitai Huang, Yi Xu, Hanli Wang

发表机构 * Tongji University（同济大学）； AIRC, Midea Group（美的集团人工智能研究院）

AI总结提出PiL-World，一种块式世界模型，通过交替VLA推理和世界模型预测实现闭环评估，无需真实机器人执行，显著降低成功率估计误差。

详情

AI中文摘要

视觉-语言-动作（VLA）策略在真实机器人任务中闭环运行：机器人观察场景，执行一个动作块，并根据结果观察决定下一步。然而，大多数现有的用于机器人动作评估的世界模型仅限于沿预收集动作轨迹进行开环预测。这阻碍了它们支持闭环VLA评估，其中每个动作块必须基于先前执行产生的观察。为填补这一空白，我们提出PiL-World，一种专为策略环内VLA评估设计的块式世界模型。给定当前观察和VLA策略展开的动作轨迹，PiL-World生成与VLA展开一致的多视角未来观察，并匹配策略所需的图像输入。通过交替VLA推理和世界模型预测，PiL-World实现了无需每一步真实机器人执行的闭环评估。为提高展开保真度，PiL-World将视频生成条件化为从头部视角机器人运动导出的动作视觉控制和编码任务执行上下文的潜在历史，同时联合预测互补的多视角观察。除了成功的遥操作演示，它还从失败的执行轨迹中学习，帮助想象展开更好地匹配真实策略执行的分布。我们在三个真实双臂操作任务上评估PiL-World。PiL-World生成的想象展开与真实机器人执行高度一致。更重要的是，与基线相比，它将真实世界展开中测量的VLA成功率与通过闭环世界模型评估估计的VLA成功率之间的误差从63.2%降低到12.0%。

英文摘要

Vision-language-action (VLA) policies operate in a closed loop in real-world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open-loop prediction along pre-collected action trajectories. This prevents them from supporting closed-loop VLA evaluation, where each action chunk must be conditioned on the observation generated by the previous execution. To address this gap, we propose PiL-World, a chunk-wise world model designed for policy-in-the-loop VLA evaluation. Given the current observation and the action trajectory rolled out by a VLA policy, PiL-World generates multi-view future observations that are consistent with the VLA rollout and match the image inputs required by the policy. By alternating between VLA inference and world-model prediction, PiL-World enables closed-loop evaluation without real robot execution at every step. To improve rollout fidelity, PiL-World conditions video generation on action-derived visual control from head-view robot motion and latent histories that encode task execution context, while jointly predicting complementary multi-view observations. Beyond successful teleoperated demonstrations, it also learns from failed execution trajectories, helping the imagined rollouts better match the distribution of real policy executions. We evaluate PiL-World on three real dual-arm manipulation tasks. PiL-World generates imagined rollouts that are highly consistent with real robot executions. More importantly, compared with the baseline, it reduces the error between VLA success rates measured in real-world rollouts and those estimated through closed-loop world-model evaluation from 63.2% to 12.0%.

URL PDF HTML ☆

赞 0 踩 0

2606.05737 2026-06-05 cs.CV cs.AI cs.LG cs.RO 版本更新

Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

让它简单：视觉-语言-动作模型的单步动作生成

Yitong Chen, Shiduo Zhang, Jingjing Gong, Xipeng Qiu

发表机构 * University of Science and Technology of China（中国科学技术大学）； Shanghai Innovation Institute（上海创新研究院）； Fudan University（复旦大学）

AI总结针对视觉-语言-动作（VLA）模型，提出通过偏置训练时间分布至高频噪声状态，实现无需教师模型、蒸馏或辅助目标的单步动作生成，性能可匹配十步解码。

Comments 20 pages, 10 figures

详情

AI中文摘要

基于扩散的视觉-语言-动作（VLA）模型通常继承图像生成的观点：动作通过迭代去噪生成。我们认为VLA动作生成具有不同的条件-目标结构：策略以丰富的观测、语言和状态为条件，但仅预测紧凑的低维动作块。在这种不对称性下，强单步动作生成不一定需要为图像合成开发的先进单步方法。我们保持标准速度预测，不添加教师模型、蒸馏阶段或辅助目标；在我们的主要方案中，我们简单地将训练时间分布偏向高频噪声状态。我们首先在受控的MNIST网格到序列任务中隔离效果，然后通过广泛的机器人策略实验进行测试。在标准LIBERO、LIBERO-Plus和LIBERO-Pro上，使用高频噪声偏置调度训练的单步策略通常匹配相同方案下的十步解码，并且在标准LIBERO上可以超过使用均匀时间分布训练的十步策略。真实机器人双臂YAM RSS评估提供了相同采样器趋势的小样本跨架构检查。在具有30M动作头的1.4B VLM模型上，单步解码在LIBERO-Long上达到95.6%。这些结果表明，强单步VLA动作生成可以从标准扩散训练中涌现，而无需引入为图像生成开发的完整少步扩散机制。

英文摘要

Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6\% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.

URL PDF HTML ☆

赞 0 踩 0

2606.05699 2026-06-05 cs.RO 版本更新

DexFuture: Hierarchical Future-State Visuomotor Targeting for Bimanual Dexterous Tool Use

DexFuture: 用于双手灵巧工具使用的分层未来状态视觉运动目标

Runfa Blark Li, Kuang-Ting Tu, Nikola Raicevic, Dwait Bhatt, Xinshuang Liu, Keito Suzuki, Ki Myung Brian Lee, Nikolay Atanasov, Truong Nguyen

发表机构 * UC San Diego（圣迭戈大学）

AI总结提出DexFuture分层系统，通过高层未来状态视觉运动目标预测器和低层目标条件结构化灵巧策略，实现双手灵巧工具使用，达到90%的特权oracle性能，运行速度60Hz，比DexWM式CEM规划快约250倍。

详情

AI中文摘要

双手灵巧工具使用对机器人来说仍然具有挑战性，因为手部配置维度高，且手-工具-物体动力学和接触复杂。大多数现有控制策略依赖于演示提供的未来配置参考，而未来动作条件世界模型需要对高维动作序列进行缓慢的在线规划。一个重大挑战是生成动态一致的未来参考轨迹，而不依赖于演示中的特权状态或缓慢的反事实规划。我们提出DexFuture，一个分层系统，将高层未来状态视觉运动目标预测器与低层目标条件结构化灵巧策略耦合。基于自我中心RGB、本体感觉和几何历史，高层预测器构建结构化的手-工具-物体视觉运动嵌入，并使用水平条件Transformer生成多步未来目标轨迹。然后，低层策略通过目标条件每链接Transformer跟踪这些轨迹。这种分层结构将粗略的未来参考生成与细粒度的动作控制解耦，并将缓慢的长时域语义预测与高频执行解耦。在OakInk2双手工具使用任务上，DexFuture达到了90%的特权oracle性能，而无参考策略仅为7%。DexFuture以60Hz运行，比DexWM风格的交叉熵方法（CEM）规划（使用未来动作条件世界模型）快约250倍。

英文摘要

Bimanual dexterous tool use remains challenging for robots due to high-dimensional hand configurations and complex hand-tool-object dynamics and contact. Most existing control policies depend on future configuration references provided from demonstrations, while future action-conditioned world models require slow online planning over high-dimensional action sequences. A significant challenge is generating a dynamically consistent future reference trajectory without relying on privileged states from demonstrations or slow counterfactual planning. We propose DexFuture, a hierarchical system that couples a high-level Future-State Visuomotor Target Predictor with a low-level Target-Conditioned Structured Dexterous Policy. Conditioned on egocentric RGB, proprioceptive and geometric history, the high-level predictor constructs structured hand-tool-object visuomotor embeddings and uses a horizon-conditioned transformer to generate a multi-step future target trajectory. Then, the low-level policy tracks them with a target-conditioned per-link transformer. This hierarchy decouples coarse future reference generation from fine-grained action control, and slow long-horizon semantic prediction from high-frequency execution. On OakInk2 bimanual tool-use tasks, DexFuture achieves 90% of the privileged-oracle performance, compared to 7% for a no-reference policy. DexFuture operates at 60 Hz, approximately 250 times faster than DexWM-style Cross-Entropy Method (CEM) planning with a future action-conditioned world model.

URL PDF HTML ☆

赞 0 踩 0

2606.05687 2026-06-05 cs.RO cs.SY eess.SY 版本更新

Accelerating and Scaling MPC-Guided Reinforcement Learning for Humanoid Locomotion and Manipulation

加速与扩展MPC引导的强化学习在类人机器人行走与操作中的应用

Junheng Li, Liang Wu, Sergio A. Esteban, Lizhi Yang, Ján Drgoňa, Aaron D. Ames

发表机构 * California Institute of Technology（加州理工学院）； Johns Hopkins University（约翰霍普金斯大学）

AI总结本文提出了一种基于质心动力学MPC奖励的MPC-RL框架，并开发了并行批处理GPU求解器π^nMPC，以高效实现类人机器人的行走与操作技能。

Comments 8 pages, 5 figures

详情

AI中文摘要

在类人运动控制中，模型预测控制（MPC）提供基于物理的预测和约束处理，而强化学习（RL）通过大规模仿真实现鲁棒的全身技能。然而，在RL内部使用MPC通常需要耗时的问题构建或过高的训练开销，使得此类框架在实践中难以证明其合理性。本文研究了训练时高效的MPC引导方法用于类人机器人行走与操作，称为MPC-RL。我们引入了一种基于质心动力学的MPC奖励公式，在训练时利用MPC轨迹的引导。为了在大规模并行RL中实现这一点，我们开发了π^nMPC，一种并行时域且无需构建的批处理GPU MPC求解器，它直接操作时变动力学以避免高内存使用和预编译。通过多种对比研究和硬件验证，我们发现MPC-RL在行走和操作技能上实现了优越的性能。代码库可在https://github.com/junhengl/mpc-rl获取。

英文摘要

In humanoid motion control, model predictive control (MPC) offers physically grounded prediction and constraint handling, while reinforcement learning (RL) enables robust whole-body skills through large-scale simulation. However, using MPC inside RL often requires time-consuming problem construction or excessive training overhead, making such frameworks difficult to justify in practice. This work studies efficient training-time MPC guidance for humanoid locomotion and manipulation, termed MPC-RL. We introduce a centroidal-dynamics MPC reward formulation that leverages guidance from MPC trajectories in training time. To make this practical in massively parallel RL, we develop $π^n$MPC, a parallel-in-horizon and construction-free batched GPU MPC solver that operates directly on time-varying dynamics to avoid high memory usage and pre-compilation. Through a variety of comparative studies and hardware validations, we have found that MPC-RL achieves superior performance in locomotion and manipulation skills. The code base is available at https://github.com/junhengl/mpc-rl.

URL PDF HTML ☆

赞 0 踩 0

2606.05669 2026-06-05 cs.RO cs.SY eess.SY 版本更新

Dynamic Multi-Agent Pickup and Delivery in Robotic Cellular Warehousing Systems

机器人化仓储系统中的动态多智能体取送货

Cheng Ren, Ming Li, Xinping Guan, George Q. Huang

发表机构 * Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University（工业与系统工程系，香港理工大学）； School of Automation and Intelligent Sensing, Shanghai Jiao Tong University（自动化与智能感知学院，上海交通大学）

AI总结针对订单内部SKU动态追加的仓库场景，首次形式化动态多智能体取送货问题，提出两种基于令牌传递的事件触发在线重规划算法，显著降低订单流时间。

详情

AI中文摘要

机器人化仓储系统（RCWS）引发多智能体取送货（MAPD）过程，其中机器人按顺序为每个订单收集多个库存单位（SKU）。与假设静态任务的经典MAPD公式不同，真实仓库操作通常涉及动态订单演变，即在订单执行过程中可能追加新的SKU。受此实际需求驱动，本文首次考虑内部订单演变，形式化了动态多智能体取送货问题。基于令牌传递范式，我们提出了两种事件触发在线重规划算法。第一种，动态令牌传递，通过添加订单分解和基于优先级的令牌调度，在订单更新时执行局部重规划，同时保持无碰撞执行。第二种，协作令牌传递，进一步使空闲机器人能够机会性地协助新添加的取货任务，提高系统级效率。在RCWS环境中的仿真结果表明，与静态和非协作基线相比，所提方法显著减少了订单流时间。

英文摘要

Robotic Cellular Warehousing Systems (RCWS) give rise to multi-agent pickup and delivery (MAPD) processes in which robots sequentially collect multiple stock-keeping units (SKUs) for each order. Unlike classical MAPD formulations that assume static tasks, real warehouse operations often involve dynamic order evolution, where new SKUs may be appended to an order while it is being executed. Motivated by this practical requirement, this letter formulates the Dynamic Multi-Agent Pickup and Delivery problem considering internal order evolution for the first time. Building on the token passing paradigm, we propose two event-triggered online replanning algorithms. The first, Dynamic Token Passing, performs localized replanning upon order updates through add-order decomposition and priority-based token scheduling while preserving collision-free execution. The second, Cooperative Token Passing, further enables idle robots to opportunistically assist newly added pickups, improving system-level efficiency. Simulation results in RCWS environments demonstrate that the proposed methods significantly reduce order flowtime compared with static and non-cooperative baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.05663 2026-06-05 cs.RO 版本更新

超材料中的波聚焦：超越衍射极限的触觉显示器

Gregory Reardon, Max Linnander, Dustin Goetz, Neeli Tummala, Yon Visell

发表机构 * Media Arts and Technology Program（媒体艺术与技术项目）； Department of Mechanical Engineering（机械工程系）； Department of Electrical and Computer Engineering（电气与计算机工程系）； University of California, Santa Barbara（加州大学圣芭芭拉分校）

AI总结本文利用局部共振超材料板中的慢波分支实现机械波聚焦，突破衍射极限，生成高分辨率虚拟触觉像素，并将像素面积缩小十倍。

详情

AI中文摘要

我们解决了工程化分布式触觉显示器的挑战，该显示器能够在表面上任意位置再现多个局部化、可独立寻址的振动——代表虚拟触觉像素。我们的技术基于使用稀疏的致动器阵列在弯曲板中聚焦机械波。在触觉频率下，波衍射阻止了在多指触摸交互相关空间尺度上形成局部化虚拟触觉像素。我们通过在板上增加机械共振器晶格，形成局部共振超材料板，克服了这一限制。板的动态模式与共振器模式之间的耦合改变了控制波传播的色散关系，引入了一个慢波分支，使得能够超越未修改板所施加的衍射极限进行聚焦。我们使用数值模拟来设计超材料系统的色散关系，以实现触觉频率下的高分辨率聚焦。然后，我们制造了一个超材料触觉显示器，并实验证明虚拟像素比在没有共振器的相同板上生成的像素更加局部化，导致虚拟像素面积缩小十倍。在行为实验中，我们展示了该系统能够传递感知上局部化的单点和多点触觉反馈以及移动触觉源，同时保持对多个显示位置的时间波形的独立控制。这里报告的方法可以使用少量致动自由度实现高分辨率触觉显示器，适用于广泛应用。

英文摘要

We address the challenge of engineering distributed haptic displays capable of reproducing multiple localized, independently addressable vibrations -- representing virtual tactile pixels -- at arbitrary locations on a surface. Our technique is based on the focusing of mechanical waves in a flexural plate using a sparse set of actuators. At tactile frequencies, wave diffraction prevents the formation of localized virtual tactile pixels at spatial scales relevant for multi-digit touch interactions. We overcome this limitation by augmenting the plate with a lattice of mechanical resonators, forming a locally resonant metamaterial plate. Coupling between the plate's dynamic modes and those of the resonators alters the dispersion relation governing wave transmission, introducing a slow-wave branch that enables focusing beyond the diffraction limit imposed by the unmodified plate. We use numerical simulations to engineer the dispersion relation of the metamaterial system for high-resolution focusing at tactile frequencies. We then fabricate a metamaterial tactile display and experimentally demonstrate virtual pixels that are far more localized than those generated on an otherwise identical plate without resonators, resulting in a tenfold reduction in virtual-pixel area. In behavioral experiments, we show that this system can deliver perceptually localized single- and multi-point tactile feedback and moving tactile sources while maintaining independent control over temporal waveforms at multiple display locations. The methods reported here can enable high-resolution haptic displays for widespread applications using a small number of actuated degrees of freedom.

URL PDF HTML ☆

赞 0 踩 0

2606.05533 2026-06-05 cs.LG cs.AI cs.CV cs.RO 版本更新

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

物体能做什么，而非它们是什么：面向功能可供性推理的功能潜在空间

Rohan Siva, Neel P. Bhatt, Yunhao Yang, Seoyoung Lee, Nishant Gadde, Christian Ellis, Alvaro Velasquez, Zhangyang Wang, Ufuk Topcu

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Neurosymbolic Intelligence（神经符号智能）； University of Colorado Boulder（科罗拉多大学博尔德分校）

AI总结提出A4D框架，通过构建基于功能可供性的共享潜在空间，将视觉观察映射到该空间并测量与可供性的距离，实现基于物体功能而非外观的规划推理，显著提升泛化能力和推理效率。

Comments Code, videos, and data available at: https://A4Dance-reasoning.github.io

详情

AI中文摘要

现有的机器人规划系统依赖于基于外观的推理，其中视觉观察被编码到围绕物体外观组织的潜在空间中（例如，根据外观识别“手推车”）。然而，规划需要推理物体的任务相关功能（例如，物体是否“可移动”），而基于外观的潜在空间无法捕捉这些信息。因此，现有方法难以泛化到新颖的机器人-物体交互。我们通过功能可供性推理解决这一泛化能力有限的问题，使规划基于任务相关的物体功能而非仅外观。我们提出A4D，它将视觉观察映射到一个围绕可供性（例如“可移动”）组织的共享潜在空间中。通过将视觉观察投影到这个功能潜在空间并测量它们与可供性的接近程度，A4D推断出与观察物体相关的功能。此外，我们引入了一种可供性发现机制，扩展潜在空间以处理现有可供性不足的未见场景。A4D利用功能潜在空间中的接近度来量化可供性推理的不确定性，并选择性地触发可供性发现。我们在涉及多样化和未见可供性的多个规划任务上评估A4D。A4D在现有可供性上达到94%的推理准确率，比最先进方法高出超过15个百分点；在不到原始训练数据10%的情况下，将新可供性推理准确率从70%提升到90%以上，并实现100倍更快的推理。代码、视频和数据可在https://A4Dance-reasoning.github.io获取。

英文摘要

Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a "cart" based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is "movable"), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions. We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4D, which maps visual observations into a shared latent space structured around affordances (e.g., "movable"). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4D infers functionalities relevant to the observed object. Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4D uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4D across several planning tasks involving diverse and unseen affordances. A4D achieves 94% inference accuracy on existing affordances outperforming state-of-the-art approaches by over 15% points, improves new-affordance inference accuracy from 70% to over 90% with fewer than 10% of the original training data, and enables 100x faster inference. Code, videos, and data available at: https://A4Dance-reasoning.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.05501 2026-06-05 cs.RO 版本更新

Learning Contact Representation for Leg Odometry

学习足式里程计的接触表示

Emre Girgin, Cagri Kilic

发表机构 * Department of Aerospace Engineering, Embry Riddle Aeronautical University（航空航天工程系，埃姆布里-瑞德航空大学）

AI总结提出一种自监督表示学习框架，仅利用关节编码器标准传感器集进行接触检测，无需力传感器，在足式机器人里程计中优于监督方法和基线概率方法。

Comments 17 pages

详情

AI中文摘要

足式机器人里程计的估计依赖于一个假设：在支撑相期间，足部相对于世界的速度保持为零。主体速度的反馈来自足部的运动学串行链，因此准确的腿部相位检测是一个关键子问题。大量研究使用安装在足尖的地面反作用力传感器进行分类，但这些传感器可能并非所有足式机器人普遍可用。此外，这些传感器通常对未考虑的干扰（如足部与地面接触时的滑动）不敏感。在本研究中，我们提出了一种用于接触检测的自监督表示学习框架，该框架利用关节编码器的标准传感器集，无需依赖力传感器增强。我们使用学习到的表示来概率性地建模支撑相和摆动相。实验结果证实了所提出的自监督接触检测器的有效性。我们的框架在性能上优于需要传感器集增强和标注的监督方法以及基线概率方法。此外，我们将代码公开。

英文摘要

The estimation of odometry in legged robots depends on the assumption that the velocity of the foot with respect to the world remains zero during the stance phase. Feedback for the main body velocity is derived from the kinematic serial chain of the feet making accurate leg phase detection is a critical subproblem. A considerable number of studies employ ground reaction force sensors mounted at the tip of the foot to classify, yet these sensors may not be universally available for all legged robots. Additionally, these sensors are often unresponsive to unaccounted disturbances, such as slippage, while the foot remains in contact with the ground. In this study, we propose a self-supervised representation learning framework for contact detection that utilizes the standard sensor set of joint encoders without reliance on force sensor augmentations. We employ learned representations to model the stance and swing phases probabilistically. The experimental results obtained confirm the efficacy of the proposed self-supervised contact detector. Our framework exhibited superior performance in comparison to supervised methods which necessitate sensor set augmentation and labeling, as well as baseline probabilistic approaches. Additionally, we make our code available to the public.

URL PDF HTML ☆

赞 0 踩 0

2606.05491 2026-06-05 cs.CV cs.RO 版本更新

Unpaired RGB-Thermal Gaussian-Splatting Using Visual Geometric Transformers

无配对RGB-热成像高斯泼溅使用视觉几何变换器

Jean Cordonnier, Chenghao Xu, Olga Fink, Malcolm Mielle

发表机构 * Ecole Polytechnique Federale de Lausanne（瑞士联邦理工学院洛桑分校）； Schindler EPFL Lab（施耐德EPFL实验室）

AI总结提出一种无配对RGB-热成像新视角合成框架，利用VGGT估计各模态相机位姿并通过Procrustes对齐，结合多模态3D高斯泼溅实现联合重建，在保持RGB保真度的同时实现热成像视图合成。

Comments Accepted at ICRA 2026's Workshop MM-SpatialAI: Multi-Modal Spatial AI for Robust Navigation and Open-World Understanding

详情

AI中文摘要

结合RGB和热成像的多模态新视角合成（NVS）能够利用视觉和热信息进行精确的3D场景重建。然而，现有方法通常依赖于精确校准的RGB-热成像图像对或立体设置，限制了可扩展性和实际部署。为了解决这个问题，我们引入了一个无配对RGB-热成像NVS框架，该框架利用VGGT（一种3D前馈变换器架构）独立估计每个模态的相机位姿。然后使用Procrustes算法与跨模态特征匹配器对齐位姿集，从而无需配对校准即可实现联合配准。在此对齐基础上，我们进一步提出了一种多模态3D高斯泼溅方法，直接从无配对的RGB和热成像图像中学习。在多种场景上的实验表明，我们的方法在热成像视图合成中取得了有竞争力的性能，同时保持了RGB保真度。此外，我们表明现有的重建方法可能产生缺乏跨模态一致性的特定模态重建。因此，我们引入了一个基准框架，以严格评估每个模态的图像合成以及重建场景的多模态一致性。

英文摘要

Multi-modal novel view synthesis (NVS) combining RGB and thermal imagery enables precise 3D scene reconstruction with visual and thermal information. However, existing methods typically rely on precisely calibrated RGB-thermal image pairs or stereo setups, limiting scalability and practical deployment. To address this, we introduce a framework for unpaired RGB-thermal NVS that leverages VGGT, a 3D feed-forward transformer architecture, to independently estimate camera poses for each modality. The pose sets are then aligned using the Procrustes algorithm with a cross-modal feature matcher, enabling joint registration without paired calibration. Building on this alignment, we further propose a multi-modal 3D Gaussian Splatting approach that learns directly from unpaired RGB and thermal images. Experiments on diverse scenes demonstrate that our method achieves competitive performance in thermal view synthesis while maintaining RGB fidelity. Moreover, we show that existing reconstruction approaches can produce modality-specific reconstructions that lack cross-modal consistency. We thus introduce a benchmarking framework to rigorously evaluate both per-modality image synthesis and the multi-modal coherence of reconstructed scenes.

URL PDF HTML ☆

赞 0 踩 0

2606.05468 2026-06-05 cs.RO 版本更新

FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization

FlowPRO：通过近端偏好优化对流匹配VLA进行无奖励强化微调

Yihao Wu, He Zhang, Junbo Tan, Xueqian Wang, Zhengyou Zhang

发表机构 * Tencent Robotics X（腾讯机器人X实验室）； Futian Laboratory（福田实验室）； Tsinghua University（清华大学）

AI总结提出FlowPRO框架，通过近端偏好优化（RPRO）和干预-回滚数据收集方法，实现无奖励的离线强化微调，在四类长时程双臂任务中取得最高成功率。

详情

AI中文摘要

将视觉-语言-动作（VLA）模型后训练为可在真实机器人上可靠部署的策略仍然是一个主要瓶颈。SFT和DAgger仅间接利用失败信号，而基于奖励的强化学习则受限于真实世界奖励设计的难度以及训练可靠评论家的困难。我们提出FlowPRO，一种针对流匹配VLA的无奖励离线强化微调框架。在算法上，我们提出RPRO（机器人流匹配近端偏好优化），一种针对VLA模型流匹配动作头定制的偏好优化目标。RPRO将对比优化器与显式近端正则化器配对，该正则化器锚定隐式奖励的绝对幅度，从而消除了普通Flow-DPO的奖励黑客失败模式。在数据方面，一种遥操作干预-回滚范式通过单个操作员动作在真实机器人上自然产生成对的正负轨迹$(τ^w, τ^l)$；平滑插值过程结合批量混合，然后将这些稀疏修正转换为密集的每状态监督，同时保留基础策略的能力。在四项长时程双臂任务上，FlowPRO取得了最高成功率，优于四个代表性基线，消融实验证实了每个损失组件的贡献。

英文摘要

Post-training Vision-Language-Action (VLA) models into policies that can be reliably deployed on real robots remains a major bottleneck. SFT and DAgger exploit failure signals only indirectly, and reward-based RL is bottlenecked by the difficulty of real-world reward design and of training reliable critics. We present FlowPRO, a reward-free offline reinforced fine-tuning framework for flow-matching VLAs. Algorithmically, we propose RPRO (Robotic Flow-matching Proximalized Preference Optimization), a preference-optimization objective tailored to the flow-matching action head of VLA models. RPRO pairs a contrastive optimizer with an explicit proximal regularizer that anchors the absolute magnitude of the implicit reward, thereby eliminating the reward-hacking failure mode of plain Flow-DPO. On the data side, a teleoperated intervention-and-rollback paradigm produces naturally paired positive and negative trajectories $(τ^w, τ^l)$ on a real robot from a single operator action; a Smooth Interpolation procedure, combined with batch mixing, then converts these sparse corrections into dense per-state supervision while preserving the base policy's capabilities. On four long-horizon bimanual tasks, FlowPRO attains the highest success rate, outperforming four representative baselines, and ablations confirm the contribution of each loss component.

URL PDF HTML ☆

赞 0 踩 0

2606.05437 2026-06-05 cs.RO cs.CV 版本更新

Uncertainty-Aware Adaptive Sensor Fusion for Autonomous Navigation

不确定性感知的自适应传感器融合用于自主导航

Simegnew Yihunie Alaba, Yuichi Motai

发表机构 * IEEE

AI总结提出一种结合无迹卡尔曼滤波（UKF）的混合深度学习方法，通过不确定性感知的自适应融合视觉和惯性特征，提高自主导航中视觉惯性里程计（VIO）的位姿估计精度。

Comments 13 pages

详情

AI中文摘要

李群中导航向量场距离函数的高效计算

Vinicius M. Gonçalves, João Baião, Felipe Bartelt, Douglas G. Macharet, Gustavo M. Freitas, Héctor Azpúrua, Luciano C. A. Pimenta

发表机构 * University of São Paulo（圣保罗大学）

AI总结针对李群中基于向量场的路径跟踪问题，提出一种利用G-多项式曲线结构将距离计算简化为多项式求根的高效方法，显著降低计算时间并保持精度。

详情

AI中文摘要

基于向量场的方法被广泛用于机器人控制，并常应用于路径跟踪问题。一些向量场方法需要重复计算机器人配置与曲线之间的距离以及相应的最近点。最近，向量场已被扩展到李群。在这种情况下，这种计算可能非常昂贵，尤其是在嵌入式平台上以高控制频率执行时。本文提出了一种高效计算点与曲线之间距离的方法，该曲线表示为所谓的G-多项式曲线，这是一种将多项式曲线推广到矩阵李群的曲线表示。所提出的方法利用这些曲线的结构，将问题简化为少量多项式求根计算。仿真结果表明，与现有的基于优化的方法相比，该方法在保持精度的同时显著减少了计算时间。还提供了SE(3)群情况下的实用公式，并在机器人机械臂上进行了实验验证。该方法已在一个计算包中实现，可在线获取。

英文摘要

Vector-field-based methods are widely used for robot control and are often applied to the path-tracking problem. Some vector field approaches require repeatedly computing the distance between the robot configuration and the curve, as well as the corresponding closest point. Recently, vector fields have been extended to Lie Groups. In this case, this computation can be expensive, especially when performed at high control frequencies on embedded platforms. This paper proposes a method for efficiently computing the distance between a point and a curve represented as what is called a G-polynomial curve, which is a curve representation that generalizes polynomial curves to matrix Lie groups. The proposed approach exploits the structure of these curves to reduce the problem to a small number of polynomial root-finding computations. Simulation results show that the method significantly reduces computation time while maintaining accuracy compared to existing optimization-based approaches. Practical formulas are also provided for the case of the group SE(3), and the method is validated experimentally on a robotic manipulator. The methodology is implemented in a computational package, available online.

URL PDF HTML ☆

赞 0 踩 0

2606.05254 2026-06-05 cs.LG cs.CV cs.RO 版本更新

Flash-WAM: Modality-Aware Distillation for World Action Models

Flash-WAM：面向世界动作模型的模态感知蒸馏

Arman Akbari, Ci Zhang, Arash Akbari, Lin Zhao, Yixiao Chen, Weiwei Chen, Xuan Zhang, Geng Yuan, Yanzhi Wang

发表机构 * Northeastern University（东北大学）； University of Georgia（佐治亚大学）； EmbodyX Inc.（EmbodyX公司）

AI总结针对世界动作模型联合生成视频和机器人动作时因多模态噪声分布不对称导致蒸馏失效的问题，提出模态感知步蒸馏框架Flash-WAM，通过为不同模态选择匹配噪声机制的参数化方法，实现单步推理并大幅加速。

详情

AI中文摘要

世界动作模型（WAMs）通过迭代扩散联合生成未来视频和机器人动作，在操作基准上表现出色，但需要数十个去噪步骤，这一成本阻碍了实时控制。步蒸馏已成为自然的补救措施，但现成的方法在联合视频-动作设置中失效，因为视频和动作流使用不同的信噪比偏移噪声调度，并以显著不同的边际噪声分布到达训练，这种不对称性是单模态蒸馏方法无法处理的。我们提出 extbf{Flash-WAM}，一个受一致性蒸馏启发的模态感知步蒸馏框架，为每个模态选择一致性函数以匹配其噪声机制：针对动作流的低噪声机制采用线性梯度缩放参数化，针对视频流的高噪声机制采用方差保持参数化，该框架基于对一致性函数族的结构分析，该分析刻画了在一致性边界条件下可实现的梯度缩放。在LingBot-VA上实例化，Flash-WAM将每个模态的推理压缩到单步。在RoboTwin 2.0上，这将每个块延迟从8.1秒减少到NVIDIA L40S上的348毫秒，实现了23倍的加速，从而支持实时推理。Flash-WAM在模拟基准上保持了任务成功率（RoboTwin 2.0上85.5%，LIBERO上95.7%），并大幅恢复了真实世界性能（Unitree G1人形机器人上平均60%），而朴素的一致性蒸馏在相同步预算下降至24%。

英文摘要

World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce \textbf{Flash-WAM}, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from $8.1$ seconds to $348$ ms on NVIDIA L40S, a $23{\times}$ speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks ($85.5\%$ RoboTwin 2.0, $95.7\%$ LIBERO) and substantially recovers real-world performance ($60\%$ average on a Unitree G1 humanoid robot), while naive consistency distillation drops to $24\%$ at the same step budget.

URL PDF HTML ☆

赞 0 踩 0

2606.05248 2026-06-05 cs.RO 版本更新

Inverse Manipulation through Symbolic Planning and Residual Operator Learning

通过符号规划与残差算子学习的逆操作

Yigit Yildirim, Giuseppe Rauso, Riccardo Caccavale, Alberto Finzi

发表机构 * University of Bologna（博洛尼亚大学）

AI总结提出一种混合框架，结合STRIPS-like符号规划与残差强化学习，实现机器人操作任务的逆操作，在ManiSkill3 PushCube任务中验证了将近似符号逆操作转化为物理可行的逆技能。

Comments To be presented in PlanRob26

详情

AI中文摘要

逆推机器人任务需要的不仅仅是逆转符号状态转换或回放运动轨迹。在机器人操作任务中，在连续交互动力学下，符号逆计划通常无法完全恢复正向执行的效果。我们提出了一种用于逆操作的混合框架，该框架通过软几何谓词从演示中自动提取STRIPS-like算子，并推导出逆技能目标。对于每个提取的算子，我们构建一个逆恢复目标，该目标保留前提条件、恢复删除效果并否定添加效果。任务规划器首先尝试使用可用的动作原语来满足该目标。未解决的符号谓词随后引出一个残差算子学习问题，通过强化学习（RL）解决。我们在ManiSkill3 PushCube任务上评估了该框架。对于正向推动技能，符号逆操作执行粗略的抓取-放置恢复，而残差Soft Actor-Critic策略则细化立方体姿态以满足剩余的逆谓词。我们的结果表明，谓词导出的残差控制可以将近似的符号逆操作转化为物理上可行的逆技能。

英文摘要

Inverting a robotic task requires more than reversing symbolic state transitions or rewinding motor trajectories. In robot manipulation tasks, symbolic inverse plans often fail to fully restore the effects of forward executions under continuous interaction dynamics. We present a hybrid framework for inverse manipulation that derives inverse-skill objectives from STRIPS-like operators automatically extracted from demonstrations through soft geometric predicates. For each extracted operator, we construct an inverse restoration objective that preserves preconditions, restores delete effects, and negates add effects. A task planner first attempts to satisfy this objective using available action primitives. Unresolved symbolic predicates then induce a residual operator learning problem solved through Reinforcement Learning (RL). We evaluate the framework on the ManiSkill3 PushCube task. For a forward pushing skill, the symbolic inverse performs a coarse pick-and-place restoration, while a residual Soft Actor-Critic policy refines the cube pose to satisfy the remaining inverse predicates. Our results show that predicate-derived residual control can turn an approximate symbolic inverse into a physically grounded inverse skill.

URL PDF HTML ☆

赞 0 踩 0

2606.05236 2026-06-05 cs.RO cs.LG 版本更新

A New Quaternion-Joint Cable-Driven Redundant Manipulator Configuration and its Control Through FABRIK and Residual Reinforcement Learning

一种新型四元数关节缆驱动冗余机械臂配置及其通过FABRIK和残差强化学习的控制

Tanapath Pornthisan, Thanapat Kemthong, Thanyapisit Kangsathien, Pasut Aranchaiya, Paulo Garcia, Viboon Sangveraphunsiri

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结提出一种4段8关节四元数关节缆驱动冗余机械臂配置，并利用残差强化学习实现比FABRIK算法高三个数量级的位置和方向精度控制。

详情

AI中文摘要

能够穿越任意空间路径的机械臂，特别是在高度阻塞的工作空间中，在多个行业中备受期待。四元数关节最近赋予了一类特定的机械臂——缆驱动冗余机械臂——超越其先前能力的新功能。具体来说，四元数关节减少了每个自由度所需的电机数量，为更紧凑的解决方案铺平了道路。一个持续的挑战是，四元数关节运动学模型的复杂性给机械臂配置的先验决策带来了困难，并对控制系统提出了更高的计算需求，其非线性放大了由于制造不精确而产生的设计与物理实物之间的所有差异。在这里，我们展示了一个4段、8关节的机械臂可以在更低的硬件成本下实现比现有配置更广阔的工作空间，并且残差强化学习在控制此类机械臂方面优于现有最先进的方法——特别是FABRIK算法。我们的结果表明，这种配置比先前设计更有效地利用工作空间，并且残差强化学习在位置和方向精度上比FABRIK高出三个数量级，实现了对新型4段、8关节机械臂的精确控制。此外，控制实现更简单：我们描述了完整的FABRIK控制过程及相应的学习实现。我们的方法适用于新系统的设计，为设计者提供了开发此类机械臂及新型配置相应控制系统的更多工具。

英文摘要

Robotic arms capable of traversing arbitrary spatial paths, especially in highly obstructed workspaces, are highly desired across several industries. Quaternion-joints have recently empowered a specific class of robotic arms -- cable-driven redundant manipulators -- beyond its prior capabilities. Specifically, quaternion-joints reduce the number of required motors per degree of freedom, paving the way for more compact solutions.An ongoing challenge is that the complexity of the kinematic model of quaternion joints challenges a priori decisions on manipulator configurations and imposes higher computational demands on the control system and its non-linearities amplify all discrepancies between design and physical artifact arising from fabrication imprecision. Here we show a that a 4-segment, 8-joint manipulator can achieve a broader workspace than extant configurations, at lower hardware cost, and that Residual Reinforcement Learning outperforms extant state-of-the-art methods -- specifically, the FABRIK algorithm -- on the control of such manipulator. Our results show that this configuration is more workspace-effective than prior designs, and that Residual Reinforcement Learning outperforms FABRIK by three orders of magnitude on positional and orientational accuracy, effecting precise control of the novel 4-segment, 8-joint manipulator. Additionally, the control implementation is simpler: we describe the complete FABRIK process for control and corresponding learning implementation. Our methodology is applicable to the design of new systems, providing designers with further tools for the development of this class of manipulators and corresponding control systems for novel configurations.

URL PDF HTML ☆

赞 0 踩 0

2606.05234 2026-06-05 cs.RO cs.LG 版本更新

测试时训练用于视觉前瞻视觉-语言-动作模型

Sangwu Park, Wonjoong Kim, Yeonjun In, Sein Kim, Hongseok Kang, Chanyoung Park

发表机构 * KAIST（韩国科学技术院）

AI总结本文提出了一种测试时训练方法，用于增强视觉前瞻视觉-语言-动作模型在面对分布外数据时的鲁棒性，通过引入适应性更新过滤机制来减少测试时更新带来的实际挑战。

Comments Accepted at ICML 2026 Workshop on Continual Adaptation at Scale (CATS)

详情

AI中文摘要

Visual Foresight VLA (VF-VLA) 已成为最近 VLA 中的重要架构选择，因其出色的性能。然而，VF-VLA 的固有设计使其特别容易受到分布外（OOD）偏移的影响。由于动作的质量直接取决于预测未来视觉信息的准确性，OOD 条件会影响两个阶段。为了解决这一脆弱性，我们提出了测试时训练视觉前瞻 VLA（$T^3$VF），这是一种受观察启发的测试时训练方法，即预测的未来图像及其后续观察形成自然的监督对。为了进一步解决由于随意测试时更新而产生的实际挑战，我们引入了自适应更新过滤机制。经验上，$T^3$VF 在不改变任何架构或辅助模块的情况下，以适度的额外推理成本缓解了 VF-VLA 的 OOD 脆弱性。

英文摘要

Visual Foresight VLA (VF-VLA) has become a prominent architectural choice in the recent VLA due to its impressive performance. Nevertheless, the inherent design of VF-VLA makes it particularly vulnerable to out-of-distribution (OOD) shifts. Because the quality of action directly depends on the accuracy of the predicted future visual information, OOD conditions affect both stages at once. To address this vulnerability, we propose Test-Time Training Visual Foresight VLA ($T^3$VF), a test-time training approach motivated by the observation that the predicted future image and its subsequent observation form a natural supervision pair. To further address the practical challenges that arise from indiscriminate test-time updates, we introduce an adaptive update filtering mechanism. Empirically, $T^3$VF mitigates the OOD vulnerability of VF-VLA at a modest additional inference cost, without requiring any architectural modification or auxiliary modules.

URL PDF HTML ☆

赞 0 踩 0

2604.21017 2026-06-05 cs.RO cs.AI 版本更新

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

Open-H-Embodiment: 一个大规模数据集，用于在医疗机器人中启用基础模型

Open-H-Embodiment Consortium, :, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang, Alaa Eldin Abdelaal, Alberto Arezzo, Ayberk Acar, Farshid Alambeigi, Carlo Alberto Ammirati, Yunke Ao, Pablo David Aranda Rodriguez, Soofiyan Atar, Mattia Ballo, Noah Barnes, Federica Barontini, Filip Binkiewicz, Peter Black, Sebastian Bodenstedt, Leonardo Borgioli, Nikola Budjak, Benjamin Calmé, Fabio Carrillo, Nicola Cavalcanti, Changwei Chen, Haoxin Chen, Sihang Chen, Qihan Chen, Zhongyu Chen, Ziyang Chen, Shing Shin Cheng, Meiqing Cheng, Min Cheng, Zih-Yun Sarah Chiu, Xiangyu Chu, Camilo Correa-Gallego, Giulio Dagnino, Anton Deguet, Jacob Delgado, Jonathan C. DeLong, Kaizhong Deng, Alexander Dimitrakakis, Qingpeng Ding, Hao Ding, Giovanni Distefano, Daniel Donoho, Anqing Duan, Marco Esposito, Shane Farritor, Jad Fayad, Zahi Fayad, Mario Ferradosa, Filippo Filicori, Chelsea Finn, Philipp Fürnstahl, Jiawei Ge, Stamatia Giannarou, Xavier Giralt Ludevid, Frederic Giraud, Aditya Amit Godbole, Ken Goldberg, Antony Goldenberg, Diego Granero Marana, Xiaoqing Guo, Tamás Haidegger, Evan Hailey, Pascal Hansen, Ziyi Hao, Kush Hari, Kengo Hayashi, Jonathon Hawkins, Shelby Haworth, Ortrun Hellig, S. Duke Herrell, Zhouyang Hong, Andrew Howe, Junlei Hu, Zhaoyang Jacopo Hu, Ria Jain, Mohammad Rafiee Javazm, Howard Ji, Rui Ji, Jianmin Ji, Zhongliang Jiang, Dominic Jones, Jeffrey Jopling, Britton Jordan, Ran Ju, Michael Kam, Luoyao Kang, Fausto Kang, Siddhartha Kapuria, Peter Kazanzides, Sonika Kiehler, Ethan Kilmer, Ji Woong Kim, Przemysław Korzeniowski, Chandra Kuchi, Nithesh Kumar, Alan Kuntz, Federico Lavagno, Yu Chung Lee, Hao-Chih Lee, Hang Li, Zhen Li, Xiao Liang, Xinxin Lin, Jinsong Lin, Chang Liu, Fei Liu, Pei Liu, Yun-hui Liu, Wanli Liuchen, Eszter Lukács, Sareena Mann, Miles Mannas, Brett Marinelli, Sabina Martyniak, Francesco Marzola, Lorenzo Mazza, Xueyan Mei, Maria Clara Morais, Luigi Muratore, Chetan Reddy Narayanaswamy, Michał Naskręt, David Navarro-Alarcon, Cyrus Neary, Chi Kit Ng, Christopher Nguan, David Noonan, Ki Hwan Oh, Tom Christian Olesch, Allison M. Okamura, Justin Opfermann, Matteo Pescio, Doan Xuan Viet Pham, Tito Porras, Hongliang Ren, Ariel Rodriguez Jimenez, Ferdinando Rodriguez y Baena, Septimiu E. Salcudean, Asmitha Sathya, Preethi Satish, Lalithkumar Seenivasan, Jiaqi Shao, Yiqing Shen, Yu Sheng, Lucy XiaoYang Shi, Zoe Soulé, Stefanie Speidel, Mingwu Su, Jianhao Su, Idris Sunmola, Kristóf Takács, Yunxi Tang, Patrick Thornycroft, Yu Tian, Jordan Thompson, Mehmet K. Turkcan, Mathias Unberath, Pietro Valdastri, Carlos Vives, Quan Vuong, Martin Wagner, Farong Wang, Wei Wang, Lidian Wang, Chung-Pang Wang, Guankun Wang, Junyi Wang, Erqi Wang, Ziyi Wang, Tanner Watts, Wolfgang Wein, Yimeng Wu, Zijian Wu, Hongjun Wu, Luohong Wu, Jie Ying Wu, Junlin Wu, Victoria Wu, Kaixuan Wu, Mateusz Wójcikowski, Yunye Xiao, Nan Xiao, Wenxuan Xie, Hao Yang, Tianqi Yang, Yinuo Yang, Menglong Ye, Ryan S. Yeung, Nural Yilmaz, Chim Ho Yin, Michael Yip, Rayan Younis, Chenhao Yu, Sayem Nazmuz Zaman, Milos Zefran, Han Zhang, Yuelin Zhang, Yidong Zhang, Yanyong Zhang, Xuyang Zhang, Yameng Zhang, Joyce Zhang, Ning Zhong, Peng Zhou, Haoying Zhou, Xiuli Zuo, Nassir Navab, Mahdi Azizian, Sean D. Huver, Axel Krieger

发表机构 * Open-H-Embodiment Consortium ； University of California, Berkeley（加州大学伯克利分校）； University of California, Los Angeles（加州大学洛杉矶分校）； University of Southern California（南加州大学）； University of Cambridge（剑桥大学）； University of Tokyo（东京大学）； University of Tokyo, Graduate School of Information Science and Technology（东京大学信息科学与技术研究生院）； University of Tokyo, Institute of Industrial Science（东京大学工业科学研究所）

AI总结本文提出Open-H-Embodiment数据集，通过两个基础模型展示了其在医疗机器人领域的应用，展示了大规模开放数据在推动机器人学习和世界建模方面的关键作用。

Comments Project website: https://open-h.github.io/open-h-embodiment/

详情

AI中文摘要

自主医疗机器人有希望提高患者预后、减少从业者的工作量、普及医疗访问并实现超人精度。然而，自主医疗机器人受到根本性数据问题的限制：现有的医疗机器人数据集较小、单一躯体且很少公开共享，限制了该领域所需的基础模型的发展。我们介绍了Open-H-Embodiment，这是迄今为止最大的开放医疗机器人视频数据集，包含同步运动学，涵盖超过50个机构和多种机器人平台，包括CMR Versius、Intuitive Surgical的da Vinci、da Vinci Research Kit（dVRK）、Rob Surgical BiTrack、Virtual Incision的MIRA、Moon Surgical Maestro以及多种定制系统，涵盖手术操作、机器人超声和内窥镜程序。我们通过两个基础模型展示了该数据集的研究价值。GR00T-H是首个开放的基础视觉-语言-动作模型，是唯一在结构缝合基准测试中实现完整端到端任务完成的模型（25%的试验 vs. 其他所有模型的0%），并在29步体外缝合序列中实现了64%的平均成功率。我们还训练了Cosmos-H-Surgical-Simulator，这是首个动作条件的世界模型，能够从单个检查点实现多躯体手术模拟，涵盖九种机器人平台，并支持计算机模拟政策评估和医学领域合成数据生成。这些结果表明，开放、大规模的医疗机器人数据收集可以作为研究社区的关键基础设施，推动机器人学习、世界建模以及更广泛的研究进展。

英文摘要

Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 50 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

URL PDF HTML ☆

赞 0 踩 0

2604.12474 2026-06-05 cs.RO cs.AI 版本更新

From Kinematics to Dynamics: Learning to Refine Hybrid Plans for Physically Feasible Execution

从运动学到动力学：学习精炼混合计划以实现物理可行的执行

Lidor Erez, Shahaf S. Shperberg, Ayal Taitler

发表机构 * Technion - Israel Institute of Technology（技术学院 - 以色列理工学院）

AI总结该研究通过连续空间中的强化学习，解决混合计划在物理可行性执行中的问题，通过引入分析二阶约束的马尔可夫决策过程，改进混合规划器生成的一阶轨迹，从而可靠地恢复物理可行性。

详情

AI中文摘要

在许多机器人任务中，智能体必须穿越一系列空间区域以完成任务。此类问题本质上是混合离散-连续的：一个高层动作序列和一个在物理上可行的连续轨迹。生成的轨迹和动作序列还必须满足诸如截止时间、时间窗口和速度或加速度限制等约束条件。尽管混合时间规划器试图解决这一挑战，但它们通常使用线性（一阶）动力学建模运动，这无法保证生成的计划满足机器人的真实物理约束。因此，即使高层动作序列固定，生成动态可行的轨迹也变成了一个双层优化问题。我们通过连续空间中的强化学习来解决这个问题。我们定义了一个明确包含分析二阶约束的马尔可夫决策过程，并用它来改进由混合规划器生成的一阶计划。我们的结果表明，这种方法可以可靠地恢复物理可行性，并有效弥合规划器初始一阶轨迹与实际执行所需动力学之间的差距。

英文摘要

In many robotic tasks, agents must traverse a sequence of spatial regions to complete a mission. Such problems are inherently mixed discrete-continuous: a high-level action sequence and a physically feasible continuous trajectory. The resulting trajectory and action sequence must also satisfy problem constraints such as deadlines, time windows, and velocity or acceleration limits. While hybrid temporal planners attempt to address this challenge, they typically model motion using linear (first-order) dynamics, which cannot guarantee that the resulting plan respects the robot's true physical constraints. Consequently, even when the high-level action sequence is fixed, producing a dynamically feasible trajectory becomes a bi-level optimization problem. We address this problem via reinforcement learning in continuous space. We define a Markov Decision Process that explicitly incorporates analytical second-order constraints and use it to refine first-order plans generated by a hybrid planner. Our results show that this approach can reliably recover physical feasibility and effectively bridge the gap between a planner's initial first-order trajectory and the dynamics required for real execution.

URL PDF HTML ☆

赞 0 踩 0

2604.08882 2026-06-05 cs.RO 版本更新

Simulation of Adaptive Running with Flexible Sports Prosthesis using Reinforcement Learning of Hybrid-link System

使用混合链接系统强化学习模拟适应性跑步与柔性运动假肢

Yuta Shimane, Ko Yamamoto

发表机构 * Department of Biological Sciences, The University of Tokyo（东京大学生物科学系）； Institute of Systems and Information Engineering, University of Tsukuba（茨城大学系统与信息工程研究所）

AI总结本文提出了一种基于强化学习的框架，用于模拟单侧小腿截肢者在不同虚拟假肢刚度条件下的适应性跑步运动，通过混合链接系统整合了叶弹簧型运动假肢的灵活性，分析了假肢刚度对跑步动态和代谢成本的影响。

详情

DOI: 10.1109/LRA.2026.3693933

AI中文摘要

本研究提出了一种基于强化学习的框架，用于模拟单侧小腿截肢者在混合链接系统中的适应性跑步运动，该系统整合了叶弹簧型运动假肢的灵活性。运动假肢的设计和选择通常依赖于试错法。全面的全身动力学分析，考虑人体运动与假肢变形之间的相互作用，可以为用户特定的设计和选择提供有价值的见解。所提出的混合链接系统通过整合分段常应（PCS）模型来代表假肢的灵活性。基于此系统，模拟方法利用强化学习方法生成单侧小腿截肢者的全身动态运动。该框架整合了基于运动捕捉数据的模仿学习与准确的假肢动力学计算。在多种虚拟假肢刚度条件下模拟跑步运动，并分析由此获得的相应的代谢成本（COT）。结果表明，假肢刚度的变化影响跑步动态和性能，且COT与先前研究中的值一致。我们的发现证明了所提出方法在虚拟条件下进行模拟和分析的潜力，这些虚拟条件与现实世界条件不同。

英文摘要

This study proposes a reinforcement learning-based framework for adaptive running motion simulation in a unilateral transtibial amputee using a hybrid-link system that incorporates the flexibility of a leaf-spring-type sports prosthesis. The design and selection of sports prostheses typically rely on trial and error. A comprehensive whole-body dynamics analysis that accounts for interactions between human motion and prosthetic deformation can provide valuable insights for user-specific design and selection. The proposed hybrid-link system enables such analysis by integrating a Piece-wise Constant Strain (PCS) model to represent prosthetic flexibility. Based on this system, the simulation methodology generates whole-body dynamic motions of a unilateral transtibial amputee using a reinforcement learning approach. This framework integrates imitation learning based on motion capture data with accurate computation of prosthetic dynamics. Running motions are simulated under multiple virtual prosthetic stiffness conditions, and the corresponding metabolic cost of transport (COT) obtained from these simulations is analyzed. The results suggest that variations in prosthetic stiffness influence running dynamics and performance, and that COT is consistent with values reported in prior study. Our findings demonstrate the potential of the proposed approach for simulation and analysis under virtual conditions that differ from real-world conditions.

URL PDF HTML ☆

赞 0 踩 0

2604.03042 2026-06-05 cs.RO 版本更新

Enhancing Multi-Robot Exploration Using Probabilistic Frontier Prioritization with Dirichlet Process Gaussian Mixtures

利用概率前沿优先级与狄利克雷过程高斯混合模型增强多机器人探索

John Lewis Devassy, Meysam Basiri, Mário A. T. Figueiredo, Pedro U. Lima

发表机构 * Institute for Systems and Robotics / LARSyS and Instituto Superior Técnico, Universidade de Lisboa（系统与机器人研究所 / LARSyS 和里斯本大学理工学院）； Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de Lisboa（电信研究所和里斯本大学理工学院）

AI总结本文提出了一种基于概率前沿优先级和狄利克雷过程高斯混合模型的改进方法，以提升多机器人探索的效率，通过在两种先进的多智能体探索算法中集成该方法，实现了在不同环境复杂度、通信限制和团队规模下的性能提升，实验结果表明平均性能提升了10%至14%。

Comments Accepted: IEEE Robotics and Automation Letters (RA-L)

详情

AI中文摘要

多智能体自主探索对于环境监测、搜索救援和大规模工业监控等应用至关重要。然而，在通信限制下有效协调仍是一个重大挑战。前沿探索算法分析已知区域与未知区域之间的边界，以确定下一个最佳视图，以最大化探索收益。本文提出了一种改进现有基于前沿的探索算法的方法，通过引入概率前沿优先级方法，利用狄利克雷过程高斯混合模型（DP-GMM）和信息增益的概率公式，提高前沿优先级的质量。该改进方法整合到两种最先进的多智能体探索算法中，在不同环境复杂度、通信限制和团队规模下均实现了性能提升。仿真显示，两种算法在所有组合中平均收益提高了10%和14%。在双无人机真实世界实验中的成功部署进一步证实了这些发现。

英文摘要

Multi-agent autonomous exploration is essential for applications such as environmental monitoring, search and rescue, and industrial-scale surveillance. However, effective coordination under communication constraints remains a significant challenge. Frontier exploration algorithms analyze the boundary between the known and unknown regions to determine the next-best view that maximizes exploratory gain. This article proposes an enhancement to existing frontier-based exploration algorithms by introducing a probabilistic approach to frontier prioritization. By leveraging Dirichlet process Gaussian mixture model (DP-GMM) and a probabilistic formulation of information gain, the method improves the quality of frontier prioritization. The proposed enhancement, integrated into two state-of-the-art multi-agent exploration algorithms, consistently improves performance across environments of varying clutter, communication constraints, and team sizes. Simulations showcase an average gain of $10\%$ and $14\%$ for the two algorithms across all combinations. Successful deployment in real-world experiments with a dual-drone system further corroborates these findings.

URL PDF HTML ☆

赞 0 踩 0

2603.10971 2026-06-05 cs.RO cs.AI 版本更新

ContactExplorer: Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation

ContactExplorer: 接触覆盖引导的通用灵巧操作探索

Zixuan Liu, Ruoyi Qiao, Chenrui Tie, Xuanwei Liu, Yunfan Lou, Chongkai Gao, Zhixuan Xu, Lin Shao

发表机构 * School of Computing, National University of Singapore（新加坡国立大学计算机学院）； RoboScience（机器人科学）

AI总结提出ContactExplorer方法，通过接触覆盖奖励和能量引导奖励，在灵巧操作任务中高效探索接触模式，提升样本效率和成功率。

Comments 24 pages

详情

AI中文摘要

强化学习在Atari游戏、导航和移动等任务中取得了显著成功，这些任务中的探索通常可以通过状态或动态的新颖性来引导。相比之下，灵巧操作需要丰富的物理手-物体交互，但现有方法常受限于不稳定的基于接触的新颖性信号、低效的距离新颖性信号或依赖任务先验知识。我们提出ContactExplorer，一种用于灵巧操作任务的通用探索方法。ContactExplorer将接触表示为物体表面点与手部关键点的交集，鼓励灵巧手发现多样且新颖的接触模式，即哪些手指接触物体的哪些区域。它维护一个基于离散化物体状态（通过学习的哈希码获得）的接触计数器，捕捉每个手指与不同物体区域交互的频率。该计数器以两种互补方式利用：（1）分配基于计数的接触覆盖奖励，促进对新接触模式的探索；（2）基于能量的到达奖励，引导智能体朝向未充分探索的接触区域。我们在多种灵巧操作任务上评估ContactExplorer。实验结果表明，ContactExplorer在样本效率和成功率上显著优于现有探索方法，并且通过ContactExplorer学习的接触模式能鲁棒地迁移到现实世界。项目页面：https://contact-explorer.github.io。

英文摘要

Reinforcement learning has achieved remarkable success in domains such as Atari games, navigation, and locomotion, where exploration can often be guided by novelty over states or dynamics. In contrast, dexterous manipulation requires rich physical hand--object interactions, but existing methods often suffer from unstable contact-based novelty signals, inefficient distance novelty signals, or reliance on task-specific priors. We propose ContactExplorer, a general exploration method for dexterous manipulation tasks. ContactExplorer represents contact as the intersection between object surface points and hand keypoints, encouraging dexterous hands to discover diverse and novel contact patterns, namely which fingers contact which object regions. It maintains a contact counter conditioned on discretized object states obtained via learned hash codes, capturing how frequently each finger interacts with different object regions. This counter is leveraged in two complementary ways: (1) to assign a count-based contact coverage reward that promotes exploration of novel contact patterns, and (2) an energy-based reaching reward that guides the agent toward under-explored contact regions. We evaluate ContactExplorer on a diverse set of dexterous manipulation tasks. Experimental results show that ContactExplorer substantially improves sample efficiency and success rates over existing exploration methods, and that the contact patterns learned with ContactExplorer transfer robustly to the real world. Project page is https://contact-explorer.github.io.

URL PDF HTML ☆

赞 0 踩 0

2602.12628 2026-06-05 cs.RO 版本更新

EVE: 一种生成策略的生成-验证系统

Yusuf Ali, Gryphon Patlin, Karthik Kothuri, Jeremiah Coholich, Muhammad Zubair Irshad, Wuwei Liang, Zsolt Kira

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Toyota Research Institute（丰田研究院）； Symbotic Inc.（Symbotic公司）

AI总结本文提出EVE系统，通过生成-验证框架在测试时提升预训练生成策略的性能，利用零样本视觉语言模型验证者进行动作优化，无需额外训练。

详情

AI中文摘要

基于生成模型的视觉运动策略，如扩散和流匹配，在机器人应用中表现出色，但在分布偏移下性能下降，显示出有限的恢复能力，无法通过昂贵的微调恢复。在语言模型领域，测试时计算扩展已通过使候选解决方案细化革新了现代LLM的推理能力。这些方法通常利用基础模型作为验证模块进行零样本方式评分。我们假设生成策略也可以从额外的推理时计算中受益，该计算利用零样本基于VLM的验证者进行生成-验证框架。为此，我们引入EVE：一个模块化、生成-验证交互框架，通过在测试时提升预训练生成策略的性能，而无需额外训练。EVE将冻结的基础策略包裹在多个零样本、基于VLM的验证者代理中。每个验证者对基础策略的候选动作提出动作优化建议，而一个动作融合器使用分类器指导将聚合的验证器反馈融合到动作去噪中。我们研究了生成器-验证器信息接口的设计选择，跨具有不同能力的验证器系统。在多样化的模拟和真实机器人任务和实现中，EVE在不增加策略或验证器训练的情况下一致提高了成功率。通过广泛的消融实验，我们隔离了验证器能力和动作融合器策略的贡献，提供了构建可扩展、模块化生成器-验证器系统的实用指南。

英文摘要

Visuomotor policies based on generative such as diffusion and flow-matching have shown strong performance for robotics applications but degrade under distribution shifts, demonstrating limited recovery capabilities without costly finetuning. In the language modeling domain, test-time compute scaling has revolutionized the reasoning capabilities of modern LLMs by enabling candidate solution refinement. These methods typically leverage foundation models as verification modules in a zero-shot manner to score candidate solutions. We hypothesize that generative policies can similarly benefit from additional inference-time compute that employs zero-shot VLM-based verifiers in a generation-verification framework. To this end, we introduce EVE: a modular, generator-verifier interaction framework that boosts the performance of pretrained generative policies at test time, with no additional training. EVE wraps a frozen base policy with multiple zero-shot, VLM-based verifier agents. Each verifier proposes action refinements to the base policy candidate actions, while an action incorporator uses classifier guidance to fuse aggregated verifier feedback into action denoising. We study design choices for generator-verifier information interfacing across a system of verifiers with distinct capabilities. Across diverse simulated and real robotic tasks and embodiments, EVE consistently improves success rates without additional policy or verifier training. Through extensive ablations, we isolate the contribution of verifier capabilities and action incorporator strategies, offering practical guidelines to build scalable, modular generator-verifier systems for embodied control.

URL PDF HTML ☆

赞 0 踩 0

2510.26236 2026-06-05 cs.RO 版本更新

RECON: 通过人类放置的标记减少因果混淆

Robert Ramirez Sanchez, Heramb Nemlekar, Shahabedin Sagheb, Cara M. Nunez, Dylan P. Losey

发表机构 * Collaborative Robotics Lab ( Collab ), Dept. of Mechanical Engineering, Virginia Tech, Blacksburg, VA 24061（协作机器人实验室（Collab），机械工程系，弗吉尼亚理工学院，布莱克斯堡，VA 24061）； Sibley School of Mechanical and Aerospace Engineering, Cornell University, Ithaca, NY 14853（西伯利机械与航空航天工程学院，康奈尔大学，伊萨卡，NY 14853）

AI总结该研究提出RECON框架，通过人类主动标记任务关键部分来减少机器人学习中的因果混淆，利用标记物数据训练任务相关状态嵌入，从而提高学习效率。

Comments 7 pages, 5 figures

详情

DOI: 10.1109/IROS60139.2025.11246741

AI中文摘要

模仿学习使机器人能够从人类示例中学习新任务。然而，从人类学习时的一个根本限制是因果混淆。因果混淆发生在机器人观察到的任务相关和无关信息同时存在时：例如，机器人的摄像头可能不仅看到目标，还看到环境中的杂物和光照变化。由于机器人事先不知道哪些观察方面是重要的，它经常误解人类的例子，无法学习所需任务。为了解决这个问题，我们指出——尽管机器人学习者可能不知道该关注什么，但人类教师知道。在本文中，我们提出人类应主动用小型轻量的标记物标记任务关键部分。在我们的框架（RECON）中，人类在提供演示前将这些标记物附着在任务相关对象上：当人类展示任务示例时，标记物跟踪标记对象的位置。我们随后利用这些离线标记数据来训练任务相关状态嵌入。具体来说，我们将机器人的观察嵌入到一个与测量标记读数相关的潜在状态中：在实践中，这使机器人能够自动过滤掉无关观察，并基于从标记数据中学习的特征做出决策。我们的模拟和一个真实机器人实验表明，这种人类放置标记的框架可以缓解因果混淆。确实，我们发现使用RECON显著减少了传达任务所需的演示次数，从而降低人类教学的总体时间。见此处视频：https://youtu.be/oy85xJvtLSU

英文摘要

Imitation learning enables robots to learn new tasks from human examples. One fundamental limitation while learning from humans is causal confusion. Causal confusion occurs when the robot's observations include both task-relevant and extraneous information: for instance, a robot's camera might see not only the intended goal, but also clutter and changes in lighting within its environment. Because the robot does not know which aspects of its observations are important a priori, it often misinterprets the human's examples and fails to learn the desired task. To address this issue, we highlight that -- while the robot learner may not know what to focus on -- the human teacher does. In this paper we propose that the human proactively marks key parts of their task with small, lightweight beacons. Under our framework (RECON) the human attaches these beacons to task-relevant objects before providing demonstrations: as the human shows examples of the task, beacons track the position of marked objects. We then harness this offline beacon data to train a task-relevant state embedding. Specifically, we embed the robot's observations to a latent state that is correlated with the measured beacon readings: in practice, this causes the robot to autonomously filter out extraneous observations and make decisions based on features learned from the beacon data. Our simulations and a real robot experiment suggest that this framework for human-placed beacons mitigates causal confusion. Indeed, we find that using RECON significantly reduces the number of demonstrations needed to convey the task, lowering the overall time required for human teaching. See videos here: https://youtu.be/oy85xJvtLSU

URL PDF HTML ☆

赞 0 踩 0