arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.06491 2026-06-05 cs.RO cs.AI 版本更新

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

TempoVLA: 学习速度可控的视觉-语言-动作策略

Dong Jing, Jingchen Nie, Tianqi Zhang, Jiaqi Liu, Huaxiu Yao, Zhiwu Lu, Mingyu Ding

发表机构 * RUC(中国人民大学) FDU(福建大学) UNC(北卡罗来纳大学教堂山分校)

AI总结 提出TempoVLA,通过可变速度轨迹增强和速度条件机制,实现机器人操作中速度的双向灵活控制,并支持动态速度调节。

详情
AI中文摘要

机器人操作在低风险过渡阶段需要快速执行,而在高风险接触阶段需要缓慢精确的运动。然而,现有的视觉-语言-动作模型(VLA)仅从训练演示中继承单一的固定速度。先前通过模型压缩、KV缓存重用或强化学习加速VLA的尝试仅将策略从一个固定速度转移到另一个,而几乎未探索减速。我们观察到每个预测动作的幅度已经决定了机器人移动的速度,这为可控执行速度开辟了直接途径。我们将这一观察转化为TempoVLA,一个执行速度由显式条件控制的单一VLA。TempoVLA结合了两个耦合组件:(1)数据侧的可变速度轨迹增强(VSTA),通过合并或分割动作重新定时演示到任何目标速度,同时保留其运动语义;(2)模型侧的条件机制,将速度馈送给策略。统计显示,VSTA以可忽略的运动误差达到请求的速度。在仿真和真实世界任务上的实验表明,TempoVLA实现了双向的灵活速度控制,而VSTA通过更好的数据利用进一步提升了默认的1倍性能。此外,通过与大型多模态模型协作,TempoVLA实现了动态速度控制,在低风险阶段加速,在高风险阶段减速。

英文摘要

Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.

2606.06461 2026-06-05 cs.RO 版本更新

Flow-based Policy Adaptation without Policy Updates

基于流的策略适应无需策略更新

Luzhe Sun, Jingtian Ji, Haoran Chen, Jiawei Zhou, Matthew R. Walter

发表机构 * Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) Stony Brook University(石溪大学)

AI总结 提出GLOVES方法,通过流模型将非专家动作向专家动作分布传输,实现选择性动作级适应,提升任务成功率并保持智能体意图。

详情
AI中文摘要

利用预训练策略、基础模型或人类操作员的先验知识,为从零开始学习机器人技能提供了一种高效替代方案。然而,这些智能体提供的动作往往是次优的、有噪声的,或与特定任务的专家行为不一致。我们提出了GLOVES,一系列基于流的适应方法,通过将非专家动作向专家动作分布传输来纠正它们。GLOVES并非用完全自主性取代智能体控制,而是执行选择性的动作级适应,在提升任务成功率的同时保持智能体意图。学习到的流还通过反向流评估提供了一种自然的分布内评分机制。我们利用该信号作为干预门:与专家分布一致的动作保持不变,而异常或分布外(OOD)动作则被纠正。这样,仅在必要时提供辅助。GLOVES仅需有限的专家监督,使用少量演示或可重用的成功技能片段。通过学习局部专家动作模式并在执行过程中拼接,GLOVES提供了一个轻量级的共享控制模块,用于跨任务和环境的鲁棒动作适应。代码和演示可在ripl.github.io/GLOVES_web获取。

英文摘要

Leveraging prior knowledge from pretrained policies, foundation models, or human operators offers an efficient alternative to learning robot skills from scratch. However, these agents often provide actions that are suboptimal, noisy, or misaligned with task-specific expert behavior. We propose GLOVES, a family of flow-based adaptation methods that correct non-expert actions by transporting them toward an expert action distribution. Rather than replacing agentic control with full autonomy, GLOVES performs selective action-level adaptation, improving task success while preserving agent intent. The learned flow also provides a natural in-distribution scoring mechanism through reverse flow evaluation. We use this signal as an intervention gate: actions that appear consistent with the expert distribution are passed through unchanged, while anomalous or out-of-distribution (OOD) actions are corrected. In this way, assistance is only provided when necessary. GLOVES requires only limited expert supervision, using a small number of demonstrations or reusable successful skill segments. By learning local expert action patterns and stitching them during execution, GLOVES provides a lightweight shared-control module for robust action adaptation across tasks and environments. Code and demos are available at ripl.github.io/GLOVES_web.

2606.06423 2026-06-05 cs.RO cs.AI 版本更新

RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario Generation

RiskFlow: 快速且保真的安全关键交通场景生成

Qi Lan, Yining Tang, Yu Shen, Yi Zhou, Yuhao Wei, Jie Li, Guofa Li

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出RiskFlow框架,通过动作空间中的单次前向传输替代迭代去噪,实现快速、保真的安全关键多智能体交通场景生成。

详情
AI中文摘要

安全关键交通场景生成对于评估自动驾驶系统在罕见但高风险交互下的表现至关重要。现有的基于扩散的方法在闭环生成中提供了强大的可控性,但其迭代去噪过程计算成本高,并且可能在长时间滚动中累积采样和引导误差,导致不真实的运动伪影,如抖动、异常加速度和越野行为。为了解决这些问题,我们提出了RiskFlow,一个闭环安全关键多智能体交通生成框架,将未来轨迹生成公式化为动作空间中的传输。RiskFlow不依赖迭代去噪,而是学习有限区间上的平均速度场,通过单次前向传递将高斯动作序列转换为未来的加速度和偏航率命令,使用基于JVP的目标函数实现高效稳定的训练。在测试时,RiskFlow将输出空间引导应用于生成的动作,引导选定的关键智能体走向风险交互,同时正则化越野行为,并通过车辆动力学重建物理可行的轨迹。在nuScenes上使用tbsim闭环评估的实验表明,RiskFlow在多智能体和长时域设置中实现了强大的对抗性与真实性的权衡。与代表性基线相比,RiskFlow在保持竞争性安全关键生成能力的同时,持续提高了真实性,并显著减少了推理时间。

英文摘要

Safety-critical traffic scenario generation is essential for evaluating autonomous driving systems under rare but high-risk interactions. Existing diffusion-based methods offer strong controllability in closed-loop generation, but their iterative denoising process is computationally expensive and may accumulate sampling and guidance errors over long rollouts, causing unrealistic motion artifacts such as jitter, abnormal acceleration, and off-road behavior. To address these issues, we propose RiskFlow, a closed-loop safety-critical multi-agent traffic generation framework that formulates future trajectory generation as transport in the action space. Instead of relying on iterative denoising, RiskFlow learns an average velocity field over a finite interval to transform Gaussian action sequences into future acceleration and yaw-rate commands with a single forward pass, using a JVP-based objective for efficient and stable training. At test time, RiskFlow applies output-space guidance to the generated actions, steering selected critical agents toward risky interactions while regularizing off-road behavior, and reconstructs physically feasible trajectories through vehicle dynamics. Experiments on nuScenes with tbsim closed-loop evaluation show that RiskFlow achieves a strong adversariality-realism trade-off across multi-agent and long-horizon settings. Compared with representative baselines, RiskFlow consistently improves realism while maintaining competitive safety-critical generation capability, and substantially reduces inference time for evaluation.

2606.06370 2026-06-05 cs.RO 版本更新

Ensuring Interaction Safety in Multitask Exoskeleton Control: A Simulation-Trained Variable Impedance Framework

确保多任务外骨骼控制中的交互安全性:一种仿真训练的可变阻抗框架

Muyuan Ma, Houcheng Li, Haotian Zhai, Lijun Han, Xinpan Meng, Xiuze Xia, Long Cheng

发表机构 * Tsinghua University(清华大学)

AI总结 提出一种基于仿真训练的可变阻抗控制框架,通过Lyapunov稳定性理论约束刚度变化,实现多任务外骨骼的安全交互控制并降低代谢成本。

详情
AI中文摘要

可穿戴外骨骼可以在复杂活动中增强人体物理能力。然而,在确保交互安全性的同时实现跨任务的适应性仍然是一个关键挑战。为了解决这一问题,提出了一种具有稳定性保证的仿真训练可变阻抗控制方法。首先,建立了一个基于仿真的人-外骨骼运动数据生成流程,利用近端策略优化(PPO)合成人体肌肉激活,同时外骨骼对人体生物关节力矩提供直接补偿。随后,使用生成的数据集训练一个双模态策略,该策略融合语义指令与本体感受历史,能够预测九种不同运动任务的参考轨迹和可变阻抗增益。为了保证安全性,网络输出受到基于Lyapunov稳定性理论导出的稳定性准则的约束,该准则限制了刚度变化,以确保耦合系统的渐近稳定性。实验结果表明,与标准基线方法相比,所提出的框架在实际场景中降低了代谢成本。这些发现表明了所提框架用于安全、多任务外骨骼控制的可行性。

英文摘要

Wearable exoskeletons can augment human phys ical capabilities during complex activities. However, ensuring adaptation across diverse tasks while guaranteeing interaction safety remains a critical challenge. To address this, a simulation trained variable impedance control approach with stability guarantees is proposed. First, a simulation-based human exoskeleton motion data generation pipeline is established, utilizing Proximal Policy Optimization (PPO) to synthesize human muscle activations while the exoskeleton provides direct compensation for human biological joint torques. Subsequently, the generated dataset is used to train a dual modality policy that fuses semantic instructions with proprioceptive history, enabling the prediction of reference trajectories and variable impedance gains for nine different motion tasks. To guarantee safety, the network outputs are constrained by a stability criterion derived from Lyapunov stability theory, which bounds stiffness variations to ensure the asymptotic stability of the coupled system. Experimental results indicate that the proposed framework reduces metabolic cost in real-world scenarios com pared with standard baseline methods. These findings suggest the feasibility of the proposed framework for safe, multitask exoskeleton control.

2606.06366 2026-06-05 cs.RO 版本更新

Waypoints Matter: A Systematic Study for Sampling-Based Trajectory Planning

航点至关重要:基于采样的轨迹规划的系统研究

Josep M. Barbera, Antonio Artuñedo, Jorge Villagra

发表机构 * AUTOPIA Program at the Centre for Automation and Robotics, CSIC-Universidad Politécnica de Madrid(自动化与机器人中心,CSIC-马德里理工大学)

AI总结 本文系统研究了航点放置策略(均匀间隔、RDP*变体、曲率条件分配)对采样轨迹规划器性能的影响,发现标称航点间距是主要性能驱动因素,均匀采样在适当间距下表现最佳。

Comments 8 pages, 5 figures, 3 tables; accepted at IEEE ITSC 2026

详情
AI中文摘要

实时自动驾驶通常依赖于基于采样的轨迹规划器,该规划器将候选轨迹与沿道路中心线的目标航点连接起来。这些航点的放置直接影响可行轨迹的存在性和质量。然而,其对规划器性能的影响在很大程度上尚未被探索。在本文中,我们将航点放置视为一等设计变量。我们固定轨迹基元和候选预算,系统地扫描三种放置策略(均匀间隔、增强的Ramer-Douglas-Peucker变体(RDP*)和一种新颖的曲率条件分配),跨越449种配置和五个几何复杂度递增的CommonRoad地图。我们的结果表明,标称航点间距$d_s$是主要的性能驱动因素,仅由放置引起的规划器可靠性差异很大。在调整良好的间距下进行均匀采样,其表现匹配或超过RDP*和中心曲率变体。曲率变体在几何复杂道路上,在可靠性优先和平衡加权下提供了微小但一致的优势,而RDP*从未优于均匀采样。这些发现表明,$d_s$应被视为主要的调优参数,而几何感知策略应保留给曲率丰富的走廊,其中可行性是限制因素。

英文摘要

Real-time autonomous driving commonly relies on sampling-based trajectory planners that link candidate trajectories to target waypoints along the road centerline. The placement of these waypoints directly impacts both the existence and quality of feasible trajectories. Yet, its effect on planner performance remains largely unexplored. In this paper, we treat waypoint placement as a first-class design variable. We hold the trajectory primitive and candidate budget fixed, and systematically sweep three placement strategies (uniform spacing, an augmented Ramer-Douglas-Peucker variant (RDP*), and a novel curvature-conditioned allocation) across 449 configurations and five CommonRoad maps of increasing geometric complexity. Our results show that the nominal inter-waypoint spacing $d_s$ is the primary performance driver, with large differences in planner reliability attributed to placement alone. Uniform sampling at a well-tuned spacing matches or surpasses both RDP* and the centered curvature variant. The curvature variant offers a small but consistent advantage on geometrically complex roads under reliability-first and balanced weightings, while RDP* never outperforms uniform sampling. These findings suggest that $d_s$ should be treated as the dominant tuning parameter, with geometry-aware strategies reserved for curvature-rich corridors where feasibility is the limiting factor.

2606.06312 2026-06-05 cs.RO 版本更新

Meridian: Metric-Semantic Primitive Matching for Cross-View Geo-Localization Beyond Urban Environments

Meridian: 超越城市环境的跨视角地理定位的度量-语义基元匹配

Mason Peterson, Qingyuan Li, Yixuan Jia, Fernando Cladera, Carlos Nieto-Granda, Camillo Jose Taylor, Jonathan P. How

发表机构 * Massachusetts Institute of Technology(麻省理工学院) GRASP Laboratory, University of Pennsylvania(宾夕法尼亚大学GRASP实验室) U.S. Army Combat Capabilities Development Command, Army Research Laboratory(美国陆军战斗能力发展指挥部,陆军研究实验室)

AI总结 提出Meridian方法,通过匹配航拍图像与地面机器人RGB-D数据中的高层度量-语义基元,无需特定区域训练即可实现跨多种环境的全局定位,平均轨迹误差2.4米。

Comments 9 pages, 6 figures

详情
AI中文摘要

成功的机器人自动化需要准确的全局定位以支持可重复性、任务规划、目标指定和安全操作。然而,在GNSS受限环境中的可靠定位仍然是一个开放问题。高空航拍图像提供了一种有前景的解决方案,但现有方法主要针对结构化城市环境,很少在非结构化自然地形中得到验证。现有技术的局限性包括依赖针对特定环境训练的模型,以及在自然户外区域常见的重复几何和无特征景观中难以处理。为克服这些挑战,我们提出了Meridian,一种在航拍图像和地面机器人RGB-D相机数据之间匹配高层度量-语义基元的方法,实现了准确的全局定位,并在多样环境中具有良好的泛化能力,无需任何针对特定区域数据的训练或算法微调。我们提出了新颖的一致性度量来估计机器人子图位姿的分布,并在鲁棒的位姿图优化步骤中剔除异常假设,以实现准确的机器人轨迹估计。我们证明了我们的算法可以在多种环境中定位地面机器人,包括自动驾驶数据集、公园和校园区域以及荒野营地,在19公里的地面遍历中平均优化轨迹误差为2.4米。

英文摘要

Successful robot automation requires accurate global localization to support repeatability, task planning, goal specification, and safe operation. However, reliable localization in GNSS-denied environments remains an open problem. Overhead aerial imagery offers a promising solution, but existing approaches primarily target structured urban environments and have been rarely demonstrated in unstructured natural terrain. Limitations of the state-of-the-art include a reliance on models trained for specific environments, as well as difficulty handling repetitive geometries and featureless landscapes commonly found in natural outdoor areas. To overcome these challenges, we present Meridian, a method for matching high-level metric-semantic primitives across aerial images and ground robot RGB-D camera data that achieves accurate global localization and generalizes well across diverse environments, all without any training or algorithmic fine-tuning on area-specific data. We formulate novel consistency metrics to estimate a distribution over robot submap poses and to reject outlier hypotheses in a robust pose graph optimization step for accurate robot trajectory estimation. We demonstrate that our algorithm can localize a ground robot across a wide variety of environments, including an autonomous driving dataset, a park and campus area, and a wilderness camp, with an average optimized trajectory error of 2.4 m over 19 km of ground traversal.

2606.06308 2026-06-05 cs.RO 版本更新

Attitude-Aided Linear Calibration of Triaxial Accelerometers

三轴加速度计的姿态辅助线性校准

Yongqiang Yu, Tian Huang, Yipeng Yang

发表机构 * Tsinghua University(清华大学)

AI总结 提出一种利用姿态信息的三轴加速度计线性校准方法(ALAC),通过构建组合误差矩阵实现线性最小二乘估计,仅需五个任意方向测量即可完成校准,并在静态和准静态实验中验证了其精度和鲁棒性。

详情
AI中文摘要

三轴MEMS加速度计广泛应用于惯性传感、导航和传感器融合,但现有校准方法通常依赖昂贵的参考设备或非线性迭代优化,限制了其在低成本或自校准系统中的效率和适用性。我们提出姿态辅助线性加速度计校准(ALAC),一种可在任何提供姿态信息的平台(如转台、机械臂或惯性测量单元)上运行的方法。ALAC构建组合误差矩阵(CEM)以在统一校准模型中表示传感器误差,并实现线性最小二乘估计。偏置和重力向量被联合估计,隐式考虑了平台未对准,CEM的矩阵分解恢复尺度、非正交性和对准旋转参数。在静态重力下,校准被表述为约束齐次最小二乘(CHLS)问题,并使用标准线性代数闭式求解。仅需五个任意方向的测量,递归扩展支持在线或现场校准。在静止的机器人安装加速度计和准静态公共IMU轨迹上的实验表明,ALAC在离线和在线模式下,在精度和对传感器噪声的鲁棒性方面均优于基于参考和在线基线方法。在相同数据集上,它在滤波条件下与迭代自校准性能相当,并在原始测量上超越所有评估基线。这些结果证明了基于MEMS的惯性平台(尤其是低成本IMU和在线校准场景)的一种鲁棒且实用的校准方案。

英文摘要

Triaxial MEMS accelerometers are widely used for inertial sensing, navigation, and sensor fusion, but existing calibration methods often rely on costly reference setups or nonlinear iterative optimization, limiting their efficiency and applicability to low-cost or self-calibrating systems. We present attitude-aided linear accelerometer calibration (ALAC), a method that operates on any platform providing orientation information, such as turntables, robotic arms, or inertial measurement units. ALAC constructs a combined error matrix (CEM) to represent sensor errors in a unified calibration model and enables linear least-squares estimation. The bias and gravity vector are jointly estimated, implicitly accounting for platform misalignment, and matrix decomposition of the CEM recovers scale, non-orthogonality, and alignment rotation parameters. Under static gravity, calibration is formulated as a constrained homogeneous least-squares (CHLS) problem and solved in closed form using standard linear algebra. Only five arbitrarily oriented measurements are required, and a recursive extension supports online or in-field calibration. Experiments on a stationary robot-mounted accelerometer and a quasi-static public IMU trajectory show that ALAC, in both offline and online modes, outperforms reference-based and online baselines in accuracy and robustness to sensor noise. On the same dataset, it matches iterative self-calibration under filtered conditions and surpasses all evaluated baselines on raw measurements. These results demonstrate a robust and practical calibration scheme for MEMS-based inertial platforms, especially low-cost IMUs and online calibration scenarios.

2606.06292 2026-06-05 cs.CV cs.RO 版本更新

Synthetic Data Generation and Vision-based Wrinkle and Keypoint Detection for Bimanual Cloth Manipulation

合成数据生成与基于视觉的褶皱和关键点检测用于双手布料操作

Ariel Herrera, Xueyang Kang, Atal Anil Kumar

发表机构 * Department of Engineering, University of Luxembourg(卢森堡大学工程系) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院) Université de Lorraine, Arts et Metiers Institute of Technology, LCFC(洛林大学,艺术与工艺技术学院,LCFC)

AI总结 针对布料操作中视觉感知难题,提出基于Blender的合成数据生成管道和结合CNN与YOLOv8-OpenCV的感知框架,实现褶皱抓取和关键点熨烫,关键点模型平均位置误差1.7615像素。

详情
AI中文摘要

纺织品的机器人操作仍然具有挑战性,因为连续变形和自遮挡阻碍了估计布料状态所需的鲁棒视觉感知。为了解决缺乏标注真实世界数据的问题,我们开发了一个基于Blender的合成管道,导出自动标注的关键点,并将人工标注的渲染图与真实世界数据结合训练褶皱检测器。我们提出了一个感知框架,集成了用于置换不变关键点检测的CNN和用于从结构褶皱中提取抓取点的YOLOv8-OpenCV管道。一个提出的双手算法利用该系统通过褶皱拉伸完全折叠的服装,一旦角落出现就过渡到基于关键点的熨烫。关键点模型实现了1.7615像素的平均位置误差(MPE)。感知系统无需微调即可迁移到物理织物上,优于在高遮挡状态下失败或在严重褶皱上产生误报的基线方法。

英文摘要

Robotic manipulation of textiles remains challenging because continuous deformation and self-occlusions hinder the robust visual perception required to estimate the cloth's state. To address the lack of annotated real-world data, we developed a Blender-based synthetic pipeline exporting auto-annotated keypoints, and combined manually labeled renders with real-world data to train a wrinkle detector. We present a perception framework integrating a CNN for permutation-invariant keypoint detection and a YOLOv8-OpenCV pipeline to extract grasping points from structural wrinkles. A proposed bimanual algorithm uses this system to stretch fully folded garments via wrinkles, transitioning to keypoint-based ironing once corners emerge. The keypoint model achieves a Mean Position Error (MPE) of 1.7615 pixels. The perception system transfers to physical fabrics without fine-tuning, outperforming baselines that fail in high-occlusion states or yield false positives on severe folds.

2606.06281 2026-06-05 cs.RO 版本更新

Multi-Resolution Tactile Imitation Learning for Contact-Rich Robotic Manipulation

多分辨率触觉模仿学习用于接触丰富的机器人操作

Rickmer Krohn, Erik Helmut, Niklas Funk, Jan Peters, Vignesh Prasad, Georgia Chalvatzaki

发表机构 * Interactive Robot Perception & Learning, TU Darmstadt(互动机器人感知与学习,图腾达姆施塔特大学) Intelligent Autonomous Systems, TU Darmstadt(智能自主系统,图腾达姆施塔特大学) Hessian AI(黑森人工智能) Robotics Institute Germany(德国机器人研究所)

AI总结 提出多分辨率触觉表示框架MiTaS,融合不同时间分辨率的触觉传感器(GelSight Mini和Evetac)与RGB相机,通过模态特定卷积茎和基于Transformer的融合实现复杂接触丰富操作任务的模仿学习,平均成功率80%。

Comments 20 pages, preprint

详情
AI中文摘要

触觉感知有助于解决各种操作任务。尽管存在多种不同特性的触觉传感器,但利用多个异构触觉传感器的融合来改进操作学习仍未被充分探索。我们提出了多分辨率触觉感知(MiTaS),一个表示框架,利用在不同时间分辨率下工作的多个触觉传感器来解决复杂的接触丰富操作任务。我们提出了一种新颖的架构,使用模态特定的卷积茎和基于Transformer的融合,有效融合来自RGB相机流、基于视觉的GelSight Mini传感器和高频事件型Evetac传感器的信息。然后,这种多传感器表示条件化一个流匹配策略,用于解决下游任务。在五个接触丰富操作任务上的实验结果证明了多分辨率触觉特征在模仿学习中的有效性。MiTaS实现了80%的平均成功率,而仅视觉(31%)和视觉-触觉(54%)基线无法可靠地完成任务。在策略评估期间无法访问Evetac传感器的情况下,使用多触觉数据共同训练视觉-触觉模型可在某些任务上将性能提升超过10%。详细的传感器读取和注意力分析揭示了不同传感器在任务执行过程中的重要性,验证了我们的多分辨率触觉感知方法。项目页面:http://mitas-touch.github.io。

英文摘要

Touch sensing is beneficial for solving a wide variety of manipulation tasks. While there exists a wide range of tactile sensors with different properties, exploiting the fusion of multiple heterogeneous tactile sensors to improve manipulation learning remains underexplored. We present Multi-Resolution Tactile Sensing (MiTaS), a representation framework that leverages multiple tactile sensors operating at different temporal resolutions in order to solve complex contact-rich manipulation tasks. We propose a novel architecture using modality-specific convolutional stems and transformer-based fusion that effectively fuses information from an RGB camera stream, a vision-based GelSight Mini sensor and a high-frequency event-based Evetac sensor. This multi-sensor representation then conditions a flow-matching policy for solving downstream tasks. Experimental results across five contact-rich manipulation tasks demonstrate the effectiveness of multi-resolution tactile features in imitation learning. MiTaS achieves an average success rate of 80 %, while vision-only (31 %) and visual-tactile (54 %) baselines cannot solve the task reliably. Co-training a visuo-tactile model with multi-tactile data boosts performance by over 10 \% in certain tasks, without having access to the Evetac sensor during policy evaluation. A detailed sensor-reading and attention analysis reveals the importance of different sensors throughout task execution, validating our multi-resolution tactile sensing approach. Project Page: http://mitas-touch.github.io.

2606.06255 2026-06-05 cs.RO cs.CV cs.DC 版本更新

RadiusFPS: Efficient Farthest Point Sampling on CPUs and GPUs via Spherical Voxel Pruning

RadiusFPS:通过球形体素剪枝在CPU和GPU上实现高效最远点采样

Ziyang Yu, Xiang Li, Qiong Chang, Jun Miyazaki

发表机构 * School of Computing(计算学院) Institute of Science(科学研究院) Tokyo(东京)

AI总结 提出RadiusFPS框架,利用球形体素剪枝加速最远点采样(FPS),在保持标准更新规则的同时,通过保守几何边界和坐标点跳过测试减少冗余计算,并在GPU上实现融合核,显著提升速度并降低内存占用。

Comments 28 pages,15 figures

详情
AI中文摘要

点云是机器人感知的主要感官表示,支撑着基于激光雷达的自动驾驶、同时定位与地图构建(SLAM)和导航。在这些流程中,最远点采样(FPS)是最著名的下采样算子,其均匀覆盖保留了下游感知所依赖的几何结构。然而,经典FPS的大时间复杂度与现代3D传感器每秒百万点的速率难以匹配,使其成为与机器人系统的实时性和有限机载计算预算相冲突的主要延迟瓶颈。因此,我们提出RadiusFPS,一种基于球形体素剪枝的FPS加速框架,在相同初始化和打破平局策略下保留标准FPS更新规则。通过用球形体素索引点云,RadiusFPS推导出保守的几何边界,在每次迭代中剪枝冗余距离计算,并辅以坐标点跳过测试去除残余更新。我们进一步引入RadiusFPS-G,一种线程束级别的GPU实现,将体素选择、剪枝和距离更新融合到内存合并的核中,消除了昂贵的全局内存往返。在室内(S3DIS、ScanNet)和室外LiDAR(SemanticKITTI)基准测试中,RadiusFPS-G相比基于GPU的FPS实现了高达2.5倍的加速,在评估方法中与QuickFPS相当或更优,同时使用大约一半的GPU内存,并具有可比较的分割精度。当与基于学习的FastPoint采样器结合时,生成的流程在所有评估配置中实现了最快的端到端推理。这些特性使得高质量的FPS风格采样对于延迟和内存受限的机器人视觉变得实用。

英文摘要

Point clouds are a primary sensory representation for robotic perception, underpinning LiDAR-based autonomous driving, simultaneous localization and mapping (SLAM), and navigation. Within these pipelines, Farthest Point Sampling (FPS) is the most well-known downsampling operator, as its uniform coverage preserves the geometric structure on which downstream perception relies. However, the large time complexity of classical FPS scales poorly with the million-point-per-second rates of modern 3D sensors, making it a dominant latency bottleneck that conflicts with the real-time and limited onboard compute budgets of robotic systems. Therefore, we propose RadiusFPS, an FPS acceleration framework based on spherical voxel pruning that preserves the standard FPS update rule under the same initialization and tie-breaking policy. By indexing the point cloud with spherical voxels, RadiusFPS derives a conservative geometric bound that prunes redundant distance computations in each iteration, complemented by a coordinate-wise point-skip test that removes residual updates. We further introduce RadiusFPS-G, a warp-level GPU implementation that fuses voxel selection, pruning, and distance update into memory-coalesced kernels, eliminating costly global-memory round-trips. On indoor (S3DIS, ScanNet) and outdoor LiDAR (SemanticKITTI) benchmarks, RadiusFPS-G attains up to 2.5x speedup over GPU-based FPS and matches or exceeds QuickFPS among the evaluated methods while using roughly half its GPU memory, with comparable segmentation accuracy. When coupled with the learning-based FastPoint sampler, the resulting pipeline achieves the fastest End-to-End inference among all evaluated configurations. These properties make high-quality FPS-style sampling practical for latency- and memory-constrained robotic vision.

2606.06250 2026-06-05 cs.RO 版本更新

Breaking Time: A Fully Gaussian Framework for Distributed and Continuous-Time SLAM

打破时间:一种用于分布式和连续时间SLAM的全高斯框架

Davide Ceriola, Simone Ferrari, Luca Di Giammarino, Leonardo Brizi, Giorgio Grisetti

发表机构 * Department of Computer, Control, and Management Engineering "Antonio Ruberti", Sapienza University of Rome(计算机、控制与管理工程系(Antonio Ruberti), 罗马萨皮恩扎大学) University of Stuttgart(斯图加特大学)

AI总结 提出G-solver,结合高斯信念传播和高斯过程运动先验的分布式连续时间轨迹估计框架,支持异构异步传感器和多相机场景。

Comments To be published in RA-L. Open-source implementation is released at https://github.com/rvp-group/gsolver

详情
AI中文摘要

连续时间SLAM为融合异构传感器同时估计平滑轨迹提供了原则性框架,特别适合处理具有非均匀读出模式的异构、异步传感器流,如卷帘快门相机、激光雷达扫描仪、雷达扫描或事件相机。在这项工作中,我们引入了G-solver,一个全高斯分布式框架,将高斯信念传播(GBP)与高斯过程(GP)运动先验相结合,用于连续时间轨迹估计。我们的GP模型提供了轨迹的概率表示,支持一致插值和数据驱动超参数的使用,而GBP提供了一种适用于分散设置的可扩展消息传递公式。由此产生的求解器自然地扩展到多相机场景,无需专门的同步或工程工作。我们在合成数据和真实数据上评估了该方法,包括卷帘快门和分布式多相机优化,展示了与现有连续时间方法相当的运行时间下的准确稳定估计。发布了开源实现。

英文摘要

Continuous-time SLAM provides a principled framework for fusing heterogeneous sensors while estimating smooth trajectories, and is particularly well-suited for handling heterogeneous, asynchronous sensor streams with non-uniform readout patterns, such as rolling shutter cameras, LiDAR scanners, radar sweeps, or event-based sensors. In this work, we introduce G-solver, a fully Gaussian and distributed framework that combines Gaussian Belief Propagation (GBP) with Gaussian Process (GP) motion priors for continuous-time trajectory estimation. Our GP model provides a probabilistic representation of the trajectory, enabling consistent interpolation and the use of data-driven hyperparameters, while GBP offers a scalable message-passing formulation well-suited for decentralized settings. The resulting solver naturally extends to multi-camera scenarios without specialized synchronization or engineering effort. We evaluate the approach on synthetic and real data, including rolling shutter and distributed multi-camera optimization, demonstrating accurate and stable estimation with runtimes comparable to existing continuous-time methods. An open-source implementation is released.

2606.06245 2026-06-05 cs.RO cs.AI 版本更新

MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action

MPCoT: 奖励引导的多路径潜在推理用于测试时可扩展的视觉-语言-动作

Boyang Zhang, Lianlei Shan

发表机构 * Department of Electrical and Computer Engineering, Boston University(波士顿大学电气与计算机工程系) Department of Computer Science, Tsinghua University(清华大学计算机系)

AI总结 提出MPCoT框架,通过奖励引导的多路径潜在推理,在保持零推理令牌和原始动作接口的同时,提升长时域和高不确定性控制任务中的VLA策略性能。

Comments 14 pages, 5 figures, submitted to CoRL

详情
AI中文摘要

视觉-语言-动作(VLA)策略在长时域和高不确定性控制中仍然脆弱,其中单次动作解码提供的推理时思考有限。显式的思维链可以增加推理深度,但引入了令牌延迟和间接的文本到动作接口。我们提出MPCoT,一个奖励引导的多路径潜在推理框架,初始化$M$个假设,通过K个权重共享步骤细化它们,并在动作解码前进行软聚合。一个仅用于训练的路径偏好目标使用专家动作一致性、基于世界模型/VLM的进展和成功反馈来评估候选动作分支,使潜在路径评分器与下游执行质量对齐。MPCoT保留原始的8步动作接口,生成零推理令牌,并暴露可配置的推理控制(K,M)。在LIBERO和CALVIN上的匹配协议下,MPCoT提升了长时域性能,消融实验证实了深度-宽度效应、置信度加权聚合和奖励引导的路径监督。

英文摘要

Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but introduces token latency and an indirect text-to-action interface. We propose MPCoT, a reward-guided multi-path latent reasoning framework that initializes $M$ hypotheses, refines them for K weight-tied steps, and softly aggregates them before action decoding. A training-only path-preference objective evaluates candidate action branches with expert-action consistency, world-model/VLM-based progress, and success feedback to align the latent path scorer with downstream execution quality. MPCoT preserves the original 8-step action interface, generates zero reasoning tokens, and exposes configurable inference controls (K,M). Under matched protocols on LIBERO and CALVIN, MPCoT improves long-horizon performance, with ablations confirming depth-width effects, confidence-weighted aggregation, and reward-guided path supervision.

2606.06219 2026-06-05 cs.RO cs.AI 版本更新

CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving

CLEAR:端到端自动驾驶中的认知与潜在评估自适应路由

Yining Xing, Zehong Ke, Zhiyuan Liu, Yanbo Jiang, Wenhao Yu, Jianqiang Wang

发表机构 * Qwen 3.5 0.8B

AI总结 提出CLEAR框架,通过单步条件漂移替代扩散模型的多步去噪,结合视觉编码器Drive-JEPA和微调Qwen 3.5 0.8B进行语义推理,实现高效多模态规划,在NAVSIM v1上达到93.7 PDMS。

详情
AI中文摘要

端到端自动驾驶模型通常难以平衡多模态机动生成与实时推理约束。虽然扩散模型成功捕捉了多样化的驾驶行为,但其迭代去噪过程在安全关键部署中引入了不可接受的延迟。为了解决这个问题,我们提出了CLEAR(认知与潜在评估自适应路由),一个结合超快生成规划与深度语义推理的框架。CLEAR采用Drive-JEPA作为视觉编码器,并用VAE潜在空间中的单步条件漂移替代多步去噪链,引入条件系数以平衡多样性和专家精度。同时,我们在驾驶问答对上全微调Qwen~3.5~0.8B以提取场景感知隐藏状态。这些状态指导自适应调度器(从预定义方案的离散集中选择条件系数$α$和样本数量$N$)和交叉注意力评分器(从候选中选择最优轨迹)。在NAVSIM v1基准上,CLEAR达到了最先进的PDMS 93.7。我们的结果表明,无需密集几何标注或迭代采样,即可高效执行高保真多模态规划。

英文摘要

End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $α$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.

2606.06218 2026-06-05 cs.RO cs.AI 版本更新

TAM: Torque Adaptation Module for Robust Motion Transfer in Manipulation

TAM: 用于鲁棒操作运动传递的扭矩自适应模块

Dongwon Son, Florian Shkurti, Jason Lee, Naman Shah, Beomjoon Kim, Dieter Fox

发表机构 * KAIST(韩国科学技术院) Allen Institute for AI(人工智能研究院) University of Toronto(多伦多大学) University of Washington(华盛顿大学)

AI总结 提出扭矩自适应模块(TAM),通过历史编码器和扭矩适配器修正扭矩指令,实现不同机器人或负载间的运动传递,无需领域随机化或重新收集数据。

详情
AI中文摘要

为一个机器人调整的策略在另一个机器人上往往表现不同,无论是由于仿真到现实的差距、未知负载,还是同一机器人两个实例的不同动力学。在接触丰富的动态操作中,即使微小的运动差异也可能导致跟踪参考运动失败,因为它们会破坏接触的时间和模式。常见的补救措施,如领域随机化或系统辨识,要么产生过于保守的任务策略,要么需要为每个机器人或负载重新收集数据。我们引入了扭矩自适应模块(TAM),这是一个学习模块,它调整发送给机器人的扭矩命令以匹配理想机器人的行为。TAM 在跟踪策略动作的低级控制器和机器人的扭矩接口之间运行。它包括一个历史编码器,将本体感受历史嵌入到潜在状态中,以及一个扭矩适配器,计算残余扭矩修正。由于 TAM 仅依赖于本体感受历史,而不依赖于策略观测或动作空间,因此相同的 TAM 权重可以重复用于适应具有不同动作空间(关节目标、末端执行器目标或直接扭矩)的策略。策略本身不需要使用机器人参数的领域随机化进行训练。相反,我们将领域随机化的需求转移到 TAM 上,通过在随机化仿真中完全训练 TAM,使用多机器人预训练,然后进行特定机器人的微调步骤,该步骤仍然不需要真实机器人数据。我们在真实的 Franka Panda 机器人上对 TAM 进行了零样本评估,涉及动态操作任务,包括基于视觉的推箱子策略(来自强化学习)、翻转策略(来自行为克隆)和 MPC 球杆平衡。我们的实验表明,与在线系统辨识和 RMA 基线相比,TAM 改善了零样本真实机器人执行,并实现了鲁棒的动态操作性能。

英文摘要

A policy tuned for one robot often behaves differently on another, whether due to the sim-to-real gap, unknown payloads, or the differing dynamics of two instances of the same robot. In contact-rich, dynamic manipulation, even small motion discrepancies can result in failure to track reference motion, since they disrupt the timing and modes of contact. Common remedies, such as domain randomization or system identification, either produce overly conservative task policies or require data that must be recollected for each robot or payload. We introduce the Torque Adaptation Module (TAM), a learned module that adapts the torque commands sent to the robot to match the behavior of an ideal robot. TAM operates between the low-level controller that tracks the policy's actions and the robot's torque interface. It includes a history encoder that embeds proprioceptive history into a latent state and a torque adaptor that computes residual torque corrections. Because TAM depends only on proprioceptive history and not on policy observations, or the action space, the same TAM weights can be reused to adapt policies with different action spaces (joint targets, end-effector targets, or direct torques). The policies themselves do not need to be trained with domain randomization of robot parameters. Instead, we offload the need for domain randomization to TAM by training it entirely in randomized simulation, using multi-robot pretraining followed by a robot-specific fine-tuning step that still requires no real-robot data. We evaluate TAM zero-shot on a real Franka Panda robot across dynamic manipulation tasks that include a vision-based box pushing policy (from RL), a flip policy (from BC), and an MPC ball-on-plate balancing. Our experiments show that TAM improves zero-shot real-robot execution compared to online system identification and RMA baselines and enables robust dynamic manipulation performance.

2606.06194 2026-06-05 cs.RO cs.CV 版本更新

ActiveMimic: Egocentric Video Pretraining with Active Perception

ActiveMimic: 基于主动感知的自我中心视频预训练

Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Current Robotics NeoteAI

AI总结 提出ActiveMimic框架,从自我中心人类视频中恢复同步的相机和手腕轨迹,将相机运动建模为视角动作,联合学习主动感知和操作技能,使预训练模型在机器人任务上达到与机器人数据预训练相当的性能。

Comments Project Page: https://activemimic.github.io/

详情
AI中文摘要

自我中心人类视频为机器人数据预训练提供了一种可扩展的替代方案,但在此类视频上预训练的模型始终不如在机器人数据上预训练的模型。我们将这一差距归因于缺失的信号,即自我中心视频中的主动感知行为,其中人类在操作过程中不断重新定位视角,导致标准流程视为噪声的相机运动。为解决这一问题,我们提出了ActiveMimic,一个预训练框架,从单个身体佩戴的RGB相机中恢复同步的相机和手腕轨迹,将相机运动建模为视角动作,并在适应目标机器人之前,从野外自我中心人类视频中联合学习主动感知和操作。实验表明,在具有不同主动感知需求的任务中,ActiveMimic始终优于在人类视频上预训练的基线,并与在机器人数据上预训练的最先进模型相匹配。进一步分析提供了证据,表明主动感知能力源自自我中心人类视频预训练而非机器人特定微调,确认了主动感知是解锁自我中心人类视频用于机器人预训练的关键。

英文摘要

Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.

2606.06155 2026-06-05 cs.RO cs.CV cs.MM 版本更新

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

AffordanceVLA:一种通过可供性感知理解赋能动作生成的视觉-语言-动作模型

Qize Yu, Jiadi You, Yuran Wang, Jiaqi Liang, Bowen Ping, Yang Tian, Yue Chen, Minghong Cai, Zeying Gong, Ruihai Wu, Yinchuan Li, Junwei Liang, Yingcong Chen

发表机构 * Peking University(北京大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Chinese University of Hong Kong(香港中文大学) Knowin AI

AI总结 提出AffordanceVLA框架,通过引入结构化可供性预测作为任务导向的中间表示,解决VLA模型中语义空间与具身控制策略的结构不匹配问题,实现精确的感知-动作映射。

Comments Preprint. Code and project page are available. Code: https://github.com/Skywalker-yqz/AffordanceVLA Project page: https://skywalker-yqz.github.io/AffordanceVLA/

详情
AI中文摘要

视觉-语言-动作(VLA)模型利用预训练视觉-语言模型(VLM)的丰富世界知识来实现指令跟随的机器人操作。然而,VLM语义空间与具身控制策略之间的结构不匹配常常阻碍精确感知-动作映射的学习。为解决这一挑战,我们提出 extbf{AffordanceVLA},一个统一框架,引入结构化可供性预测作为任务导向的中间表示,以建立更精确和鲁棒的感知-动作映射。具体而言,我们通过三个互补组件逐步建模操作先验:1) extbf{Which2Act},通过视觉潜在预测进行以物体为中心的定位以抑制干扰;2) extbf{Where2Act},通过可供性图估计进行2D交互定位;3) extbf{How2Act},用于引导操作策略的3D几何推理。这些可供性线索提供了空间定位、语义条件化和动作耦合的中间表示,从而自然地桥接视觉、语言和动作。我们将这些模块集成到具有专门专家的混合Transformer(MoT)架构中,并使用三阶段训练策略和渐进式数据课程训练模型。为克服机器人数据集中密集可供性标签的稀缺性,我们还开发了一个鲁棒的自动化数据增强流水线。在仿真和真实世界中的大量实验表明,AffordanceVLA在多种操作场景中实现了强大的性能。

英文摘要

Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \textbf{Which2Act} for object-centric grounding via visual latent prediction to suppress distractions; 2) \textbf{Where2Act} for 2D interaction localization via affordance map estimation; and 3) \textbf{How2Act} for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.

2606.06139 2026-06-05 cs.RO 版本更新

MotionDisco: Motion Discovery for Extreme Humanoid Loco-Manipulation

MotionDisco: 用于极端人形机器人移动操作的运动发现

Ilyass Taouil, Michal Ciebelski, Shafeef Omar, Haizhou Zhao, Angela Dai, Aaron M. Johnson, Majid Khadiv

发表机构 * Technical University of Munich, Germany(慕尼黑技术大学) New York University, USA(纽约大学) Carnegie Mellon University, USA(卡内基梅隆大学)

AI总结 提出MotionDisco框架,通过大语言模型引导的进化搜索和顺序运动动力学轨迹优化,从零开始自动发现长时域、接触丰富的人形机器人移动操作技能,并在真实机器人上部署。

详情
AI中文摘要

我们提出MotionDisco,一个从零开始发现接触丰富、长时域人形机器人移动操作运动的框架,无需依赖遥操作或从人类演示中重定向运动。这是具有挑战性的,因为可能的接触交互空间随任务时域和场景中物体数量呈组合增长。MotionDisco通过将大语言模型(LLM)引导的进化搜索与高效的顺序运动动力学轨迹优化器和剪枝策略相结合,实现对交互序列的快速搜索,从而快速发现新技能。通过大量消融研究,我们展示了LLM引导的搜索在多个具有挑战性的长时域任务中成功发现了全身轨迹。最后,通过在发现的轨迹上训练强化学习跟踪策略,我们将运动迁移到真实人形机器人上。这是第一项完全通过自动进化搜索发现并部署长时域人形机器人移动操作技能的工作。实验补充视频见:https://youtu.be/DHiVz34QYlw。

英文摘要

We present MotionDisco, a framework that discovers contact-rich, long-horizon humanoid loco-manipulation motions from scratch, without relying on teleoperation or motion retargeting from human demonstrations. This is challenging because the space of possible contact interactions grows combinatorially with the task horizon and the number of objects in the scene. MotionDisco enables rapid discovery of novel motions by coupling a large language model (LLM) guided evolutionary search over sequences of interactions with an efficient sequential kinodynamic trajectory optimizer and pruning strategy, enabling the rapid discovery of novel skills. Through extensive ablation studies, we show that our LLM-guided search discovers successful whole-body trajectories across several challenging long-horizon tasks. Finally, by training reinforcement learning tracking policies on the discovered trajectories, we transfer the motions to a real humanoid robot. This is the first work to discover and deploy long-horizon humanoid loco-manipulation skills entirely through automated evolutionary search. Supplementary videos of the experiments are available at: https://youtu.be/DHiVz34QYlw.

2606.06130 2026-06-05 cs.RO 版本更新

Towards Realistic 3D Sonar Simulation

面向真实3D声纳仿真

Youssef Attia, Davide Costa, Francesco Wanderlingh, Filippo Campagnaro, Enrico Simetti

发表机构 * IEEE

AI总结 本文提出一种模块化架构,结合GPU加速图形引擎与物理声学传播原理,在NVIDIA Isaac Sim中实现基于Water Linked 3D-15传感器的体积3D声纳模型,并通过硬件在环配置验证其有效性。

详情
AI中文摘要

随着水下机器人研究日益涉及复杂的三维感知和自主导航,声纳仿真的保真度已成为算法开发的关键因素。当前的仿真框架通常依赖于几何驱动的渲染,将3D声纳近似为水下的LiDAR等效物,这未能考虑基本的声学现象,如折射、多径干扰和相位相关的信号形成。本文提出了一种用于真实3D声纳仿真的模块化架构,该架构将GPU加速的图形引擎与基于物理的声学传播原理相结合。我们在NVIDIA Isaac Sim环境中实现了一个体积3D声纳模型,该模型以Water Linked 3D-15传感器为原型,并将其集成到一个全面的水下仿真框架中。该系统通过硬件在环配置进行了验证,其中在NVIDIA Jetson Orin Nano上执行的改进FastLIO2 SLAM流水线使用合成3D声纳、DVL、IMU和压力数据进行传感器融合。最后,提供了模拟输出与来自港口板桩检查的真实数据之间的定性比较,描述了剩余的模拟到现实差距,并建立了迈向完全声学驱动的体积感知的路线图。

英文摘要

As underwater robotics research increasingly addresses complex 3D perception and autonomous navigation, the fidelity of sonar simulation has become a key factor in algorithm development. Current simulation frameworks typically rely on geometry-driven rendering, approximating 3D sonar as an underwater equivalent to LiDAR, which fails to account for fundamental acoustic phenomena such as refraction, multi-path interference, and phase-dependent signal formation. This paper proposes a modular architecture for realistic 3D sonar simulation that integrates GPU-accelerated graphics engines with physically grounded acoustic propagation principles. We implement a volumetric 3D sonar model within the NVIDIA Isaac Sim environment, modeled after the Water Linked 3D-15 sensor, and integrate it into a comprehensive underwater simulation framework. The system is validated through a hardware-in-the-loop configuration, where a modified FastLIO2 SLAM pipeline, executed on an NVIDIA Jetson Orin Nano, performs sensor fusion using synthetic 3D sonar, DVL, IMU, and pressure data. Finally, a qualitative comparison between simulated outputs and real-world data from harbor sheet-pile inspections is provided, characterizing the remaining sim-to-real gap and establishing a roadmap toward fully acoustics-driven volumetric sensing.

2606.06077 2026-06-05 cs.RO cs.LG 版本更新

3D Underwater Path Planning via Generative Flow Field Surrogates

基于生成流场代理的三维水下路径规划

Zachary Cooper-Baldock, Paulo E. Santos, Russell S. A. Brinkworth, Karl Sammut

发表机构 * Flinders University(弗林德斯大学)

AI总结 针对自主水下航行器回收过程中复杂三维螺旋桨尾流的高成本CFD仿真问题,提出用条件生成对抗网络(cGAN)作为替代,结合能量加权A*路径规划,实现快速且有效的路径规划。

Comments 41 pages, 5 figures, 11 tables

详情
AI中文摘要

自主水下航行器(AUV)从行进中的母船船体发射和回收(LAR)需要穿越复杂的三维螺旋桨尾流,其水动力学结构无法用均匀流模型表征。高保真雷诺平均Navier-Stokes(RANS)计算流体动力学(CFD)仿真能够以足够精度解析该结构以用于路径规划,但其计算成本使其无法在机载使用。我们通过集成两种条件生成对抗网络(cGAN)架构——正则化PatchGAN和带有自注意力的2D3DGAN——作为三维能量加权A*路径规划框架中RANS CFD数据的即插即用替代方案来填补这一空白。两个生成器均由一个分层流水线驱动,该流水线仅从标量操作条件输入合成完整的$128^3$体素流场体积,端到端推理时间约为28-146微秒,而单个RANS计算则需要数小时。我们在550种不同流动条件下的19,800条独立生成轨迹上对所有四种环境知识水平(均匀流、真实CFD、PatchGAN和2D3DGAN~SA)进行了基准测试。与均匀流规划相比,完整的CFD尾流知识使能量消耗降低5.7-12.5%,高速尾流核心遭遇减少高达77.8%,且两种优势随操作严重程度增加而扩大。cGAN代理在推理速度与边缘设备兼容的情况下,恢复了约45-60%的CFD能量收益和高速单元规避收益。这些结果首次系统量化了cGAN预测水动力场在三维海洋机器人应用中的下游路径规划价值。

英文摘要

Autonomous underwater vehicle (AUV) launch and recovery (LAR) into the hull of an advancing host platform requires traversal of a complex, three-dimensional propeller wake whose hydrodynamic structure cannot be characterised by a uniform current model. High-fidelity Reynolds-Averaged Navier-Stokes (RANS) Computational Fluid Dynamics (CFD) simulations resolve this structure with sufficient accuracy for path planning, but their computational cost renders them impractical for onboard use. We address this gap by integrating two conditional generative adversarial network (cGAN) architectures -- a regularised PatchGAN and a 2D3DGAN with self-attention -- as drop-in replacements for RANS CFD data within a three-dimensional, energy-weighted A* path planning framework. Both generators are driven by a hierarchical pipeline that synthesises full $128^3$ voxel flow field volumes from scalar operating condition inputs alone, with end-to-end inference times of approximately 28-146 $μ$s, compared to hours for a single RANS computation. We benchmark all four environmental knowledge levels: uniform current, ground-truth CFD, PatchGAN, and 2D3DGAN~SA across 19,800 independently generated trajectories spanning 550 distinct flow conditions. Full CFD wake knowledge reduces energy expenditure by 5.7-12.5% and high-velocity wake-core encounters by up to 77.8% relative to uniform-current planning, with both benefits scaling with operating severity. The cGAN surrogates recover approximately 45-60% of the CFD energy benefit and high-velocity cell avoidance benefit while operating at inference speeds compatible with edge device use. These results provide the first systematic quantification of the downstream path planning value of cGAN-predicted hydrodynamic fields in a three-dimensional maritime robotics application.

2606.06061 2026-06-05 cs.RO 版本更新

A Conversational Framework for Human-Robot Collaborative Manipulation with Distributed Generative AI models

基于分布式生成式AI模型的人机协作操作对话框架

Arash Ghasemzadeh Kakroudi, Roel Pieters

发表机构 * Automation Technology and Mechanical Engineering, Tampere University(自动化技术与机械工程,塔尔库大学)

AI总结 提出一个分布式对话框架,集成语言和视觉语言模型与ROS 2执行栈,实现从自由形式用户命令生成结构化操作请求,并通过视觉基础将图像空间目标转换为机器人框架目标,实验验证了端到端任务可靠性和延迟。

Comments Accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026). The final published version will appear under the title "A Distributed Conversational Framework for Human-Robot Collaborative Manipulation Using Local LLMs and VLMs"

详情
AI中文摘要

本文提出了一种用于人机协作操作的分布式对话框架,该框架将本地语言和视觉语言模型(VLM)与基于机器人操作系统2(ROS 2)的执行栈集成在一起。语言理解、视觉基础、编排和运动执行作为独立的ROS 2节点运行,能够在保持响应控制循环的同时,跨分布式硬件灵活部署。该系统从自由形式的用户命令中生成拾取、放置和交接的结构化动作请求。它使用VLM返回图像空间目标,并通过深度和校准将其转换为度量机器人框架目标。一个Web仪表板显示中间意图和基础叠加(像素、深度和机器人框架),并在执行任何运动之前需要操作员明确确认。在Franka FR3平台上的实验评估了在不断增加的工作台场景模糊性下的端到端任务可靠性和延迟,并比较了同一流水线中替代的LLM/VLM配置。代码和完整文档可在[github.com/cogrob-tuni/franka-llm](https://github.com/cogrob-tuni/franka-llm)获取。

英文摘要

This paper presents a distributed conversational framework for human-robot collaborative manipulation that integrates local language and vision-language models (VLMs) with a Robot Operating System 2 (ROS 2)-based execution stack. Language understanding, visual grounding, orchestration, and motion execution run as separate ROS 2 nodes, enabling flexible deployment across distributed hardware while maintaining a responsive control loop. From free-form user commands, the system generates structured action requests for pick, place, and handover. It uses a VLM to return image-space targets, which are converted into metric robot-frame goals using depth and calibration. A web dashboard exposes intermediate intent and grounding overlays (pixel, depth, and robot-frame) and requires explicit operator confirmation before any motion is executed. Experiments on a Franka FR3 platform evaluate end-to-end task reliability and latency under increasing working table scene ambiguity and compare alternative LLM/VLM configurations in the same pipeline. Code and full documentation are available at [github.com/cogrob-tuni/franka-llm](https://github.com/cogrob-tuni/franka-llm).

2606.06049 2026-06-05 cs.RO 版本更新

L-SDPPO: Policy Optimization of Spiking Diffusion Policy for Intra-vehicular Robotic Manipulation

L-SDPPO:用于舱内机器人操作的脉冲扩散策略优化

Liwen Zhang, Dong Zhou, Guanghui Sun, Yifei Zheng, Yuhui Hu, Kaihong Ouyang, Zuoquan Zhao

发表机构 * Department of Control Science and Engineering, Harbin Institute of Technology(控制科学与工程系,哈尔滨工业大学) Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong(机械与自动化工程系,香港中文大学)

AI总结 提出L-SDPPO框架,结合脉冲扩散策略与强化学习优化,并引入状态依赖延迟注入机制,在舱内机器人操作任务中实现高成功率和低能耗。

详情
AI中文摘要

航天器中的舱内机器人有助于减少宇航员的工作量并提高任务效率。最近的研究集中于使用深度学习方法来实现这些复杂环境中操作所需的精确控制。然而,在没有重力阻尼的情况下,物体会表现出不可预测、无约束的漂移。这些因素要求对复杂的多模态动作分布具有鲁棒性。扩散策略(DP)可以建模这些复杂动作,但其迭代采样过程对于航天器有限的功率预算来说消耗过多能量。因此,我们提出了一种低能耗的舱内机器人操作框架L-SDPPO,其中脉冲扩散策略(SDP)通过强化学习(RL)算法进行优化。此外,为了解决微重力下动态时空特征感知不足的问题,我们提出了状态依赖延迟注入(SDLI)机制,该机制模拟生物神经延迟以动态调节输入信息的时间。在五个代表性的舱内日常任务(例如舱门打开和精密容器盖合)上的评估表明,与最先进的机器人操作方法相比,我们的方法始终能实现更高的成功率和更低的能耗。这些结果表明我们的方法是一种可行的舱内机器人操作方法。

英文摘要

Intra-vehicular robots in spacecraft help reduce astronaut workload and improve mission efficiency. Recent research focuses on using deep learning methods to achieve the acute control required for operations in these complex environments. However, objects exhibit unpredictable, unconstrained drift without gravitational damping. These factors demand robustness against complex multimodal action distributions. Diffusion policies (DP) can model these complex actions, but their iterative sampling process consumes too much energy for the limited power budgets of spacecraft. We therefore propose a low-energy intra-vehicular robotic manipulation framework, L-SDPPO, in which the Spiking Diffusion Policy (SDP) is optimized with a reinforcement learning (RL) algorithm. Furthermore, to address the insufficient perception of dynamic spatiotemporal features in microgravity, we propose the statedependent latency injection (SDLI) mechanism, which mimics biological neural delays to dynamically regulate the timing of input information. Evaluation on five representative intra-vehicular daily tasks (e.g., hatch opening and precision container capping) shows that our method consistently achieves higher success rates and lower energy consumption, compared to the state-of-the-art robotic manipulation methods. These results demonstrate our method is a viable intra-vehicular robotic manipulation method.

2606.06041 2026-06-05 cs.RO cs.AI cs.NE 版本更新

Sample-efficient Low-level Motion Planning for Robotic Manipulation Tasks via Zero-shot Transfer Learning

通过零样本迁移学习实现机器人操作任务的样本高效低级运动规划

Yuanzhi He, Victor Romero-Cano, José J. Patiño, Juan David Hernández, William Sawtell, Gualtiero Colombo

发表机构 * School of Computer Science & Informatics, Cardiff University, Cardiff, UK(计算机科学与信息学系,卡迪夫大学,卡迪夫,英国)

AI总结 提出iCEM+TL框架,通过迁移学习和奖励重塑提高复杂操作任务的成功率,仿真中提升高达23%,并在真实机器人上验证。

Comments 12 pages, 5 figures, International Conference on Artificial Neural Networks (ICANN) 2026 conference accepted

详情
AI中文摘要

随着机器人系统变得日益复杂,其运动规划模型的复杂性和更长的训练时间带来了巨大挑战。进化算法如样本高效交叉熵方法(iCEM)最近通过利用高效的知识重用策略来提升性能,在低级实时规划中展现出潜力。尽管在许多控制任务中有效,但iCEM在更复杂场景中的性能可能受到限制,特别是那些需要堆叠、滑动和放置到架子的任务。在这项工作中,我们提出了一种新颖的iCEM+TL框架,明确利用迁移学习(TL),其中关键的iCEM参数从较简单的上游任务迁移以指导更复杂的下游任务。此外,我们通过任务分解对堆叠物体和放置到架子应用了奖励重塑(RR)以优化任务特定性能。仿真结果表明,我们的框架实现了高达23%的成功率提升。该框架还在真实的Franka Emika机器人上的堆叠任务中得到进一步验证,展示了其在实际部署中的可行性。

英文摘要

As robotic systems become more sophisticated, the growing complexity of their motion planning models and the longer training times pose substantial challenges. Evolutionary algorithms such as the Sample-efficient Cross-Entropy Method (iCEM) have recently demonstrated promising potential for low-level real-time planning by leveraging efficient knowledge reuse strategies to improve performance. Although effective in many control tasks, iCEM's performance can be constrained in more complex scenarios, particularly those requiring stacking, sliding, and shelf placement. In this work, we propose a novel iCEM+TL framework that explicitly leverages Transfer Learning (TL), where key iCEM parameters are transferred from simpler upstream tasks to guide more complex downstream tasks. Additionally, we applied Reward Redesign (RR) through task decomposition for stacking objects and shelf placement to optimize task-specific performance. Results from the simulation show that our framework achieves success rate improvements of up to 23%. The framework is further validated on a real Franka Emika robot in a stacking task, demonstrating its practical feasibility for real-world deployment.

2606.06040 2026-06-05 cs.RO cs.SY eess.SY 版本更新

Gotta Grow Fast: Design and Benchmarking of a Tip Mount for High-Speed Vine Robots

快速生长:高速藤蔓机器人尖端支架的设计与基准测试

Antonio Alvarez Valdivia, Robert Reeve, Ankush Dhawan, Ciera McFarland, Chad Council, Margaret McGuinness, Nathaniel Hanson

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Lincoln Laboratory(林肯实验室) Stanford University(斯坦福大学) University of Notre Dame(圣母大学)

AI总结 提出一种三角滚轮尖端支架,通过滚动代替滑动减少生长阻力,实现TPU涂层防撕裂尼龙藤蔓机器人的一致外翻,并建立可重复的基准测试框架。

Comments Accepted to IEEE Robotics & Automation Letters

详情
AI中文摘要

软体生长藤蔓机器人通过尖端外翻机制扩展,该机制使其能够在杂乱环境中导航。然而,在尖端集成摄像头和其他传感器具有独特挑战,因为形成尖端的材料随着机器人生长而不断更新。这种持续的材料更替,加上内层之间的摩擦、增加的尖端重量和织物收缩,使传感器和工具安装复杂化。这些限制阻碍了藤蔓机器人在检查和搜索任务中的应用,而快速生长并携带尖端传感器至关重要。在这项工作中,我们提出了一种三角滚轮尖端支架,通过滚动而非滑动与机器人本体接触,减少生长过程中的内部阻力。通过迭代故障分析优化设计,首次实现了在TPU涂层防撕裂尼龙藤蔓机器人上的一致外翻。为了定量评估支架性能,我们引入了一个定制测试台,通过测量外翻过程中的尾部张力来隔离尖端安装效应。跨多个支架变体(包括先前设计)的比较实验表明,我们的三角滚轮支架实现了最低的尾部张力和最可重复的生长性能。这些结果既建立了一个经过验证的尖端支架设计,也为推进软体生长机器人中传感器和工具集成提供了一个可重复的基准测试框架。支架和测试台的CAD文件可在以下网址获取:https://sprout-mitll.github.io/tip_mounts/。

英文摘要

Soft, growing vine robots extend through tip eversion, a mechanism that enables navigation through cluttered environments. However, integrating cameras and other sensors at the tip is uniquely challenging because the material forming the tip is constantly renewed as the robot grows. This continual material turnover, combined with friction between internal layers, added tip weight, and fabric constriction, complicates sensor and tool mounting. These limitations hinder the deployment of vine robots for inspection and search tasks, where rapid growth while carrying tip-mounted sensors is essential. In this work, we present a triangular roller tip mount that reduces internal resistance during growth by rolling rather than sliding against the robot body. The design was refined through iterative failure analysis, enabling, for the first time, consistent eversion on a TPU-coated ripstop nylon vine robot. To quantitatively evaluate mount performance, we introduce a custom testbed that isolates tip mounting effects by measuring tail tension during eversion. Comparative experiments across multiple mount variants, including prior designs, show that our triangular roller mount achieves the lowest tail tension and most repeatable growth performance. These results establish both a validated tip mount design and a repeatable benchmarking framework for advancing sensor and tool integration in soft growing robots. CAD for the mount and testbed is available at: https://sprout-mitll.github.io/tip_mounts/.

2606.06014 2026-06-05 cs.AI cs.RO 版本更新

PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models

PLAN-S:通过潜在风格动态桥接规划以实现自动驾驶世界模型

Xiaoyun Qiu, Jingtao He, Yijie Chen, Yusong Huang, Haotian Wang, Yixuan Wang, Xinhu Zheng

发表机构 * Intelligent Transportation Thrust, Systems Hub, and Center of Seamless Connectivity & Connected Intelligence, The Hong Kong University of Science and Technology (Guangzhou)(智能交通 thrust、系统中心及无缝连接与智能连接研究院,香港科学与技术大学(广州))

AI总结 提出PLAN-S框架,通过从潜在表示解码风格条件语义成本图,解决自动驾驶中潜在世界模型规划的可控性问题,在nuScenes和NAVSIM上降低了碰撞率并提升了驾驶性能。

详情
AI中文摘要

潜在世界模型通过预测紧凑的场景动态来增强端到端自动驾驶,用于下游规划。然而,现有的基于潜在世界模型的规划器通常直接从纠缠的潜在表示生成轨迹。这种紧凑的潜在到规划器路径缺乏对风险、可驾驶性和多样风格偏好的显式建模,使得驾驶风格动态在最终轨迹选择之前难以监督、检查或调制。我们提出PLAN-S(具有潜在风格动态的规划),一个面向规划器的桥接方法,通过从潜在表示解码风格条件的四通道语义成本图来解决这种紧凑-可控性困境。成本图以自我状态和驾驶风格为条件,并通过两个宿主侧接口在规划决策上游被消费:用于回归规划器的注意力级融合和用于锚点得分规划器的奖励级融合。我们在两个架构不同的宿主上验证PLAN-S:nuScenes上的ResWorld和NAVSIM上的WoTE,同时冻结宿主骨干以隔离所提出的桥接的贡献。在nuScenes上,PLAN-S在每个时间范围上降低了基线L2,平均L2为0.55米,3秒碰撞率相对降低42%。在NAVSIM上,规则成本变体达到89.4的预测驾驶模型分数,而学习成本变体在基线挑战场景中提供了互补增益。消融实验表明,成本路径对更安全的轨迹选择贡献最直接。定性结果进一步显示,PLAN-S可以产生多样化的成本图,其空间一致的变化与不同的驾驶风格对齐。

英文摘要

Latent world models (LWMs) have strengthened end-to-end autonomous driving by forecasting compact scene dynamics for downstream planning. However, existing LWM-based planners usually generate trajectories directly from entangled latent representations. This compact latent-to-planner pathway lacks explicit modeling of risk, drivability, and diverse style preferences, making driving-style dynamics difficult to supervise, inspect, or modulate before a final trajectory is selected. We propose PLAN-S (PLANning with latent Style dynamics), a planner-facing bridge that addresses this compactness-controllability dilemma by decoding a style-conditioned, four-channel semantic cost map from the latent representation. The cost map is conditioned on ego state and driving style and is consumed up-stream of the planning decision through two host-side interfaces: attention-level fusion for regression planners and reward-level fusion for anchor-score planners. We validate PLAN-S on two architecturally distinct hosts, ResWorld on nuScenes and WoTE on NAVSIM, while keeping the host backbones frozen to isolate the contribution of the proposed bridge. On nuScenes, PLAN-S reduces L2 at every horizon over the baseline, with 0.55 m average L2 and a 42% relative reduction in the 3 s collision rate. On NAVSIM, the rule-cost variant reaches 89.4 Predictive Driver Model Score (PDMS), while the learned cost variant provides complementary gains on baseline-challenging scenes. Ablations show that the cost pathway contributes most directly to safer trajectory selection. Qualitative results further show that PLAN-S can produce diverse cost maps, with spatially consistent variations aligned to different driving styles.

2606.06011 2026-06-05 cs.RO cs.LG cs.MA 版本更新

Merging model-based control with multi-agent reinforcement learning for multi-agent cooperative teaming strategies

将基于模型的控制与多智能体强化学习相结合以实现多智能体协作团队策略

Christian Llanes, Spencer W. Jensen, Samuel Coogan

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Sandia National Laboratories(桑地亚国家实验室)

AI总结 提出一种结合多智能体强化学习与模型预测控制的框架(MA-AC-MPC),通过扩展演员-评论家模型预测控制实现安全、动态可行的协作策略,并在追逃场景和异构环境中验证其优于多层感知机模型。

Comments 12 pages, 8 figures, 7 tables

详情
AI中文摘要

在这项工作中,我们提出了一种将多智能体强化学习(MARL)与基于模型的控制相结合的框架,以在协作多智能体任务中实现安全、动态可行的动作。多智能体强化学习具有从长期规划视野中的离散不可微奖励中学习多智能体团队协作策略的优势。模型预测控制具有鲁棒性,并在快速重规划框架中为短视野提供安全、动态可行的动作。我们提出了一种将演员-评论家模型预测控制扩展到MARL的算法,称为多智能体演员-评论家模型预测控制(MA-AC-MPC)。我们通过将其应用于多智能体追逃场景来展示该算法的能力。具体来说,我们比较了使用MA-AC-MPC模型和多层感知机模型(MA-AC-MLP)的逃避者团队策略。追逐者团队使用增强比例导航,因为它被接受为一种先进的对抗控制律。我们还提供了一个异构环境的示例,其中无人机和全向轮式机器人协作,在硬件上实现了可重复且成功的着陆,MA-AC-MPC的成功率为100%,而MA-AC-MLP为60%。我们在硬件上证明了所提出的MA-AC-MPC算法在两种环境中的鲁棒性。

英文摘要

In this work, we propose a framework that combines multi-agent reinforcement learning (MARL) with model-based control to achieve safe, dynamically feasible actions in cooperative multi-agent tasks. Multi-agent reinforcement learning provides the advantage of learning cooperative policies for multi-agent teams from discrete non-differentiable rewards in a long planning horizon. Model-predictive control is robust and offers safe, dynamically feasible actions in a fast replanning framework for short horizons. We propose an algorithm that extends actor-critic model predictive control for MARL which we refer to as multi-agent actor-critic model predictive control (MA-AC-MPC). We demonstrate the capabilities of this algorithm by applying it to a multi-agent pursuit-evasion scenario. Specifically, we compare the evader team's strategy using the MA-AC-MPC model and a multi-layer perceptron model (MA-AC-MLP). The pursuer team uses augmented proportional navigation as it is accepted as an advanced adversarial control law. We also provide an example with a heterogeneous environment where a drone and omni-wheeled rover cooperate to achieve repeatable and successful landing with 100% success rate in hardware for MA-AC-MPC compared to 60% for MA-AC-MLP. We demonstrate the robustness of the proposed MA-AC-MPC algorithm in hardware for both environments.

2606.05979 2026-06-05 cs.RO cs.AI 版本更新

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

世界-语言-动作模型:统一世界建模、语言推理与动作合成

Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng

发表机构 * SJTU(上海交通大学) SII(上海研究院) HUST(华中科技大学) SCUT(华南理工大学) ECUST(东华大学) SHU(上海大学) NJUPT(南京工业大学)

AI总结 提出世界-语言-动作(WLA)模型,通过自回归Transformer联合预测文本子任务、子目标图像和机器人动作,融合世界建模与语言推理能力,实现多任务和长时域任务的最优性能。

Comments 19 pages, 10 figures

详情
AI中文摘要

我们提出世界-语言-动作(WLA)模型作为一类新的具身基础模型。WLA以文本指令、图像和机器人状态为输入,联合预测文本子任务、子目标图像和机器人动作,结合了世界-动作模型(WAM)中从大量自我中心视频学习的世界建模接口,以及视觉-语言-动作(VLA)模型中解决复杂长时域任务的语言推理能力。WLA的核心是一个自回归(AR)Transformer主干,而非WAM中的双向扩散Transformer,用于预测下一状态,包括语义级别的文本意图和互补的细粒度物理动态。物理动态由基于专用世界专家的世界建模目标监督,并用于简化动作专家的状态-动作相关性表征。WLA利用元查询使世界预测隐式影响动作生成,从而在推理时可禁用世界预测。世界预测也可被激活以实现测试时缩放,从而改进机器人控制。我们的WLA-0原型具有2B活跃参数,在NVIDIA RTX 5090上每次推理耗时40毫秒。在模拟和真实环境中的评估表明,WLA-0实现了最先进的多任务和长时域学习能力,例如在RoboTwin2.0 Clean上成功率为92.94%,在RMBench上成功率为56.5%。WLA-0还有望直接从跨具身机器人视频中学习新任务,无需动作标注。

英文摘要

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emph{next state}, comprising the \emph{semantic-level} textual intention and complementary \emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emph{cross-embodiment robot videos} without action annotations.

2606.05975 2026-06-05 cs.CV cs.RO 版本更新

T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation

T-FunS3D:任务驱动的分层开放词汇3D功能分割

Jingkun Feng, Reza Sabzevari

发表机构 * P4MARS Lab at the Faculty of Aerospace Engineering, Delft University of Technology(代尔夫特理工大学航空航天工程学院P4MARS实验室)

AI总结 提出T-FunS3D方法,通过构建开放词汇场景图并利用视觉语言模型,实现任务驱动的分层3D功能分割,在保持性能的同时提升速度和降低内存消耗。

详情
AI中文摘要

开放词汇3D功能分割使机器人能够在3D场景中定位功能性物体组件。这是一项需要空间理解和任务解释的挑战性任务。当前的开放词汇3D分割方法主要关注物体级识别,而场景级部分分割方法试图详尽地分割整个场景,导致资源密集且耗时。在粒度、准确性和速度之间平衡分割性能仍然是一个挑战。作为缓解这一问题的一步,我们引入了T-FunS3D,一种任务驱动的分层开放词汇3D功能分割方法,为机器人应用提供可操作的感知。我们的方法以室内场景的3D点云和带姿态的RGB-D图像作为输入。通过提取环境中的实例及其视觉嵌入,我们构建了一个开放词汇场景图。给定任务描述,T-FunS3D识别场景图中最相关的实例,并利用视觉语言模型定位其功能组件。在SceneFun3D数据集上的实验表明,T-FunS3D在开放词汇3D功能分割方面与最先进方法相当,同时实现了更快的运行时间和更少的内存使用。

英文摘要

Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive and time consuming. Balancing segmentation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open-vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage.

2606.05960 2026-06-05 cs.RO 版本更新

Towards a Data Flywheel for Embodied Intelligence in Logistics

面向物流具身智能的数据飞轮

Anlan Yu, Zaishu Chen, Zhiqing Hong, Daqing Zhang

发表机构 * Peking University(北京大学) JD Logistics(京东物流) HKUST (Guangzhou)(香港科技大学(广州))

AI总结 提出一种数据驱动的物流具身智能框架,通过构建数据飞轮将日常操作转化为可复用数据资产,利用世界模型生成长尾包裹操作的可靠监督,并整合多模态数据实现策略持续改进。

详情
AI中文摘要

具身智能正从实验室演示走向工业部署,物流行业是其中的关键应用场景。基于学习的策略为超越传统感知-规划-控制流程提供了有前景的路径,但其可扩展性取决于具身数据的收集、组织和复用方式。本研究通过构建物流数据飞轮,探索面向工业具身智能的数据中心框架。我们的框架将日常操作转化为可复用的数据资产,利用世界模型为长尾包裹操作生成可靠监督,并将部署反馈反馈到策略改进中。作为初步成果, extit{WM-DAgger}引入了一种基于世界模型的数据聚合框架,该框架合成了分布外恢复数据,用于鲁棒的模仿学习。在此成果基础上,正在进行的工作探索如何将大规模野外多模态数据(包括标注的人类演示、未标注的操作视频以及系统级机器人日志)对齐用于策略学习,并将其转化为持续系统改进的反馈。

英文摘要

Embodied intelligence is moving from laboratory demonstrations toward industrial deployment, with the logistics industry serving as a key application scenario. Learning-based policies offer a promising path beyond traditional perception-planning-control pipelines, but their scalability depends on how embodied data can be collected, organized, and reused. This research studies a data-centric framework for industrial embodied intelligence by constructing a logistics data flywheel. Our framework converts daily operations into reusable data assets, uses World Models to generate reliable supervision for long-tail parcel manipulation, and feeds deployment feedback back into policy improvement. As an initial result, \textit{WM-DAgger} introduces a World-Model-based data aggregation framework that synthesizes out-of-distribution recovery data for robust imitation learning. Building on this result, ongoing work explores how large-scale in-the-wild multimodal data, including labeled human demonstrations, unlabeled operational videos, and system-level robot logs, can be aligned for policy learning and transformed into feedback for continual system improvement.

2606.05952 2026-06-05 cs.RO cs.AI 版本更新

Learning of Robot Safety Policies via Adversarial Synthetic Scenarios

通过对抗性合成场景学习机器人安全策略

Nikolai Dorofeev, Alexey Odinokov, Rostislav Yavorskiy

发表机构 * National Research Institute of Automation and Applied Mathematics(国家自动化与应用数学研究所)

AI总结 提出一个基于对抗性游戏的框架,通过红蓝两队对抗生成危险场景并迭代优化安全策略,以高效发现高风险边缘案例。

详情
AI中文摘要

在这项工作中,我们提出了一种基于代理的博弈框架,通过合成场景进行危险告知的机器人安全策略学习。我们将场景生成建模为两个代理之间的对抗游戏:红队通过构建危险情况探索潜在故障空间,蓝队则逐步完善安全策略以防止这些故障。这种迭代过程能够高效发现通过随机模拟或手动枚举难以捕获的高风险边缘案例。通过将经典风险建模与对抗性场景生成及现代学习范式相结合,这项工作为在复杂现实环境中运行的物理AI系统嵌入安全性提供了一条可扩展的路径。本文描述了正在进行的工作,贡献在于问题形式化和提出的解决方案架构。

英文摘要

In this work, we propose an agentic gamification framework for hazard-informed learning of robot safety policies through synthetic scenarios. We model scenario generation as an adversarial game between two agents: a Red Team that explores the space of potential failures by constructing hazardous situations, and a Blue Team that incrementally refines safety policies to prevent them. This iterative process enables efficient discovery of high-risk edge cases that are unlikely to be captured through random simulation or manual enumeration. By combining classical risk modeling with adversarial scenario generation and modern learning paradigms, this work provides a scalable pathway for embedding safety into Physical AI systems operating in complex real-world environments. The paper describes ongoing work. The contribution is a problem formulation and a proposed solution architecture.

2606.05903 2026-06-05 cs.RO 版本更新

A Novel Method with Encoder-Decoder for Cross-Sensor Adaptation in Surface Shape Sensing with Sparse Strain Sensors

一种基于编码器-解码器的跨传感器自适应方法,用于稀疏应变传感器的表面形状感知

Shuo Wang, Heng Luo, Dian Jin, Xiaoming Tao

发表机构 * IEEE

AI总结 提出一种结合元学习和少样本适应的编码器-解码器架构,实现不同传感器阵列间的跨传感器自适应,显著降低新传感器部署所需的标注数据量和适应时间,将感知误差从23.0 mm降至约4.0 mm。

详情
AI中文摘要

由内在差异或安装条件引起的传感器阵列性能变化可能导致形状感知结果不一致。为了获得准确结果,通常需要大量数据,并且必须为每个传感器阵列重新训练单独的模型,从而增加了数据采集、传输和计算的时间和成本。为解决这一问题,本文提出了一种基于稀疏应变传感器的表面形状感知编码器-解码器架构,并进一步结合元学习和少样本适应策略,实现不同传感器阵列组之间的自适应。实验结果表明,经过跨传感器自适应后,新部署的传感器阵列仅需少于5.0%的新标注数据,适应时间低于1秒,即可达到约4.0 mm的感知误差,相比未适应时的23.0 mm误差和训练新模型所需的20分钟数据采集时间,有显著提升。此外,误差低于5.0 mm的点数增加了超过65.0%。这些结果表明,所提方法能大幅降低表面形状感知的成本和训练负担,在软体机器人和可穿戴设备中具有广泛的应用潜力。

英文摘要

Performance variations in sensor arrays, caused by intrinsic differences or installation conditions, can lead to inconsistent results during shape sensing. To obtain accurate results, a large amount of data is usually required, and a separate model must be retrained for each sensor array, thereby increasing the cost and time of data acquisition, transmission, and computation. To address this issue, this work proposes an encoder-decoder architecture for surface shape sensing based on sparse strain sensors and further incorporates meta-learning and few-shot adaptation strategies to enable adaptation across different groups of sensor arrays. Experimental results demonstrate that, after the cross-sensor adaptation, a newly deployed sensor array achieves a sensing error of approximately 4.0 mm relying on less than 5.0% newly labeled data and requiring an adaptation time of under 1 second, which represents a substantial improvement from 23.0 mm error without adaptation and 20-minute data collection time required to train a new model. Moreover, the number of points with errors below 5.0 mm increased by more than 65.0%. These results indicate that the proposed method can substantially reduce the cost and training burden of surface shape sensing, and it has broad potential applications in soft robotics and wearable devices.

2606.05880 2026-06-05 cs.RO 版本更新

TAGA: Terrain-aware Active Gaze Learning for Generalizable Agile Humanoid Locomotion

TAGA:面向可泛化敏捷人形运动的地形感知主动注视学习

Peizhuo Li, Hongyi Li, Mingfeng Fan, Fangzhou Xu, Shuhao Liao, Yuxuan Ma, Zicheng Zeng, Ze Wang, Yongbin Jin, Yuhong Cao, Hongtao Wang, Guillaume Sartoretti

发表机构 * MarmotLab, National University of Singapore(马尔莫实验室,新加坡国立大学) Center of X-Mechanics, Zhejiang University(浙大X力学中心) South China University of Technology(华南理工大学)

AI总结 提出TAGA框架,通过融合视觉、本体感觉和运动命令,让模型学习主动注视地形关键区域,在有限计算资源下提高感知密度,实现鲁棒且可泛化的敏捷人形运动。

详情
AI中文摘要

在多样挑战性地形上的敏捷人形运动需要广泛的感知覆盖和精确的局部几何理解。受人类在运动中选择性注视相关地形的启发,我们提出了TAGA,一种用于基于注意力的人形控制的地形感知主动注视学习框架。通过融合视觉、本体感觉和运动命令,我们的框架引导模型学习预期线索并主动关注高度扫描的特定区域,选择性地将这些信息区域用于下游网络。这自适应地提高了在严格机载计算约束下观测的信息密度,从而在更大尺度地形上实现细粒度感知运动。我们发现,这种注视行为可以仅通过强化学习自然涌现,无需额外监督或显式指导,显著提高了训练效率。因此,训练后的策略在仿真和硬件上展示了鲁棒且可泛化的运动,包括可靠的地形感知落脚点选择、高台穿越、竞争性稀疏落脚点穿越,以及在感知人形运动系统中报告的最大实际间隙穿越距离1.2米,同时在严重感知干扰和环境干扰下保持稳定性。

英文摘要

Agile humanoid locomotion across diverse challenging terrain demands both wide perceptual coverage and precise local geometry understanding. Motivated by the way humans selectively look at relevant terrain during locomotion, we introduce TAGA, a Terrain-aware Active Gaze learning framework for Attention-based humanoid control. By fusing vision, proprioception, and motion commands, our framework guides the model to learn anticipatory cues and actively attend to specific areas of the height scan, selectively using these informative regions for the downstream network. This adaptively increases the information density of observations under tight onboard computational constraints, thus enabling fine-grained perceptive locomotion over larger-scale terrains. We find that such gaze behaviors can naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance, significantly improve training efficiency. As a result, the trained policy demonstrates robust and generalizable locomotion in simulation and on hardware, including reliable terrain-aware foothold selection, elevated-platform traversal, competitive sparse-foothold traversal, and the largest reported real-world gap traversal distance of 1.2m among perceptive humanoid locomotion systems, while maintaining stability under severe perceptual disturbances and environmental interference.

2606.05873 2026-06-05 cs.RO cs.AI cs.CV cs.LG 版本更新

LadderMan: Learning Humanoid Perceptive Ladder Climbing

LadderMan: 学习人形机器人感知爬梯

Siheng Zhao, Yuanhang Zhang, Ziqi Lu, Pieter Abbeel, Rocky Duan, Koushil Sreenath, Yue Wang, C. Karen Liu, Guanya Shi

发表机构 * Amazon FAR(亚马逊FAR) USC(美国南加州大学) UC Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) CMU(卡内基梅隆大学)

AI总结 提出LadderMan系统,通过两阶段学习管道和视觉基础模型,使人形机器人能够鲁棒地攀爬多种梯子并在梯子上进行操控。

详情
AI中文摘要

人形机器人在以人为中心的环境中具有巨大潜力,但由于稀疏的立足点和手抓点、复杂的全身协调以及对感知和控制误差的敏感性,爬梯仍然是最具挑战性的任务之一。我们提出了 extbf{LadderMan},一个统一的系统,使人形机器人能够鲁棒地攀爬多种梯子并在这种受限条件下进行操控。我们的攀爬策略基于一个可扩展的两阶段学习管道,其中我们使用混合运动跟踪从单个参考运动学习多个攀爬专家,并通过混合模仿和强化学习将这些专家蒸馏成一个统一的基于深度视觉的运动攀爬策略。为了实现真实世界部署,我们利用视觉基础模型来弥合深度感知中的模拟到现实差距。基于学习到的攀爬策略,我们进一步使用双智能体公式训练一个独立的操控策略,允许通过遥操作在梯子上进行稳定操控。实验表明,LadderMan在多种几何形状的梯子上实现了鲁棒的攀爬,以零样本方式成功迁移到真实世界硬件,并在具有挑战性的梯子约束下支持各种操控任务。视频结果见https://ladderman-robot.github.io。

英文摘要

Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at https://ladderman-robot.github.io .

2606.05848 2026-06-05 cs.RO 版本更新

Visuotactile and Explicitly Force-Controlled Robotic Ultrasound for Abdominal Volumetric Reconstruction

用于腹部体积重建的视觉触觉和显式力控制机器人超声

Adrian Piedra, R Brooke Jeffrey, Oussama Khatib

发表机构 * Stanford Robotics Laboratory, Computer Science Department, Stanford University(斯坦福机器人实验室、计算机科学系、斯坦福大学) Department of Radiology, School of Medicine, Stanford University(放射科、医学院、斯坦福大学)

AI总结 提出一种结合立体视觉、触觉反馈和专家策略的机器人超声采集系统,通过力控机械臂实现自适应腹部扫描,并实现三维体积重建以增强诊断能力。

详情
AI中文摘要

在本文中,我们提出了一种机器人超声采集系统,该系统集成立体视觉、基于触摸的反馈和专家策略,以执行自主和自适应的腹部扫描。系统记录来自放射科专家的徒手运动和力数据,创建一个框架来捕获探头运动、施加的力和解剖扫描策略。这些专家数据被重放以用机器人复制特征扫描,为进一步的自主能力奠定基础。利用立体视觉,系统生成患者腹部的三维地形图,并通过关键点的刚度测量来细化,以描绘肋骨边界。这些组合技术使机器人能够执行两种不同的扫描路径:肋骨下方向上倾斜的扫描以可视化上腹部附近的结构,以及穿过软组织区域的垂直扫描。一个柔顺的、扭矩控制的七自由度机器人操纵器通过闭环力控制来保持与不同解剖表面的一致探头接触。物理实验表明,该系统在动态适应患者特定地形的同时,实现了与专家扫描相当的高质量成像。此外,机器人系统通过实现三维体积采集超越了专家能力,这增强了诊断潜力并为高级分析提供了体积数据。这项工作突出了将专家知识集成到自主机器人系统中,并强调了将基于感知的自主性与物理推理相结合以增强诊断性能的潜力。

英文摘要

In this paper, we present a robotic ultrasound acquisition system that integrates stereo vision, touch-based feedback, and expert-informed strategies to perform autonomous and adaptive abdominal scans. The system records freehand motion and force data from expert radiologists, creating a framework to capture transducer motion, applied forces, and anatomical scanning strategies. This expert data is replayed to replicate characteristic scans with the robot, forming a foundation for further autonomous capabilities. Using stereo vision, the system generates three-dimensional topography maps of the patient's abdomen, which are refined through stiffness measurements at key points to delineate the rib cage boundary. These combined techniques enable the robot to execute two distinct scanning paths: an upward-angled sweep beneath the rib cage to visualize structures near the upper abdomen and a perpendicular sweep across soft tissue regions. A compliant, torque-controlled seven degree-of-freedom robotic manipulator is controlled to maintain consistent probe contact through closed-loop force control over the varied anatomical surfaces. Physical experiments demonstrate that the system achieves high-quality imaging comparable to expert scans while dynamically adapting to patient-specific topographies. Furthermore, the robotic system surpasses expert capabilities by enabling three-dimensional volume acquisition, which enhances diagnostic potential and provides volumetric data for advanced analyses. This work highlights the integration of expert knowledge into autonomous robotic systems and underscores the potential of combining perception-based autonomy with physical reasoning for enhanced diagnostic performance.

2606.05840 2026-06-05 eess.SY cs.RO cs.SY 版本更新

Amortized Nonlinear Model Predictive Control

摊销非线性模型预测控制

Francesco Pillitteri, Alberto Bemporad

发表机构 * IMT School for Advanced Studies(IMT高级研究学院)

AI总结 针对输入仿射非线性系统,提出一种基于状态依赖二次规划的单网络残差校正架构,通过可微内点层保证约束满足,实现实时非线性模型预测控制,在机械臂跟踪任务中取得数量级加速。

Comments 6 pages

详情
AI中文摘要

非线性模型预测控制需要在每个采样时刻实时求解一个约束非线性规划(NLP),这是一个计算瓶颈,限制了在资源受限硬件或高采样率下的部署。我们针对输入仿射非线性系统这一广泛类别解决了这一挑战,证明了最优控制动作可以通过一个状态依赖的二次规划(QP)来近似,其成本参数取决于当前状态和参考。我们提出了一种单网络残差校正架构:一个状态依赖的解析基线提供初始QP参数,网络仅学习匹配完整NLP解所需的校正;QP通过一个可微内点层求解,保证了第一个控制动作的约束满足。该网络使用由NLP求解器生成的数据进行离线训练,采用结合监督模仿和KKT残差惩罚的混合损失。我们在一个具有笛卡尔末端执行器跟踪的三连杆平面机械臂上验证了该方法,展示了相比NLP求解器数量级的加速,同时保持了可比的跟踪性能。

英文摘要

Nonlinear Model Predictive Control requires solving a constrained nonlinear program (NLP) in real-time at every sampling instant, a computational bottleneck that limits deployment on resource-constrained hardware or at high sampling rates. We address this challenge for the broad class of input-affine nonlinear systems to show that the optimal control move can be approximated by a state-dependent quadratic program (QP) whose cost parameters depend on the current state and reference. We propose a single-network residual-corrector architecture: a state-dependent analytic baseline provides initial QP parameters, and the network learns only the corrections needed to match the full NLP solution; the QP is solved by a differentiable interior-point layer, guaranteeing constraint satisfaction for the first control action. The network is trained offline on data generated by an NLP solver using a hybrid loss that combines supervised imitation and KKT-residual penalties. We validate the approach on a three-link planar robotic arm with Cartesian end-effector tracking, demonstrating orders-of-magnitude speedup over the NLP solver while maintaining comparable tracking performance.

2606.05773 2026-06-05 cs.RO 版本更新

PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

PiL-World: 用于VLA策略环内评估的块式世界模型

Chong Ma, Taiyi Su, Jian Zhu, Jianjun Zhang, Zitai Huang, Yi Xu, Hanli Wang

发表机构 * Tongji University(同济大学) AIRC, Midea Group(美的集团人工智能研究院)

AI总结 提出PiL-World,一种块式世界模型,通过交替VLA推理和世界模型预测实现闭环评估,无需真实机器人执行,显著降低成功率估计误差。

详情
AI中文摘要

视觉-语言-动作(VLA)策略在真实机器人任务中闭环运行:机器人观察场景,执行一个动作块,并根据结果观察决定下一步。然而,大多数现有的用于机器人动作评估的世界模型仅限于沿预收集动作轨迹进行开环预测。这阻碍了它们支持闭环VLA评估,其中每个动作块必须基于先前执行产生的观察。为填补这一空白,我们提出PiL-World,一种专为策略环内VLA评估设计的块式世界模型。给定当前观察和VLA策略展开的动作轨迹,PiL-World生成与VLA展开一致的多视角未来观察,并匹配策略所需的图像输入。通过交替VLA推理和世界模型预测,PiL-World实现了无需每一步真实机器人执行的闭环评估。为提高展开保真度,PiL-World将视频生成条件化为从头部视角机器人运动导出的动作视觉控制和编码任务执行上下文的潜在历史,同时联合预测互补的多视角观察。除了成功的遥操作演示,它还从失败的执行轨迹中学习,帮助想象展开更好地匹配真实策略执行的分布。我们在三个真实双臂操作任务上评估PiL-World。PiL-World生成的想象展开与真实机器人执行高度一致。更重要的是,与基线相比,它将真实世界展开中测量的VLA成功率与通过闭环世界模型评估估计的VLA成功率之间的误差从63.2%降低到12.0%。

英文摘要

Vision-language-action (VLA) policies operate in a closed loop in real-world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open-loop prediction along pre-collected action trajectories. This prevents them from supporting closed-loop VLA evaluation, where each action chunk must be conditioned on the observation generated by the previous execution. To address this gap, we propose PiL-World, a chunk-wise world model designed for policy-in-the-loop VLA evaluation. Given the current observation and the action trajectory rolled out by a VLA policy, PiL-World generates multi-view future observations that are consistent with the VLA rollout and match the image inputs required by the policy. By alternating between VLA inference and world-model prediction, PiL-World enables closed-loop evaluation without real robot execution at every step. To improve rollout fidelity, PiL-World conditions video generation on action-derived visual control from head-view robot motion and latent histories that encode task execution context, while jointly predicting complementary multi-view observations. Beyond successful teleoperated demonstrations, it also learns from failed execution trajectories, helping the imagined rollouts better match the distribution of real policy executions. We evaluate PiL-World on three real dual-arm manipulation tasks. PiL-World generates imagined rollouts that are highly consistent with real robot executions. More importantly, compared with the baseline, it reduces the error between VLA success rates measured in real-world rollouts and those estimated through closed-loop world-model evaluation from 63.2% to 12.0%.

2606.05737 2026-06-05 cs.CV cs.AI cs.LG cs.RO 版本更新

Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

让它简单:视觉-语言-动作模型的单步动作生成

Yitong Chen, Shiduo Zhang, Jingjing Gong, Xipeng Qiu

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学)

AI总结 针对视觉-语言-动作(VLA)模型,提出通过偏置训练时间分布至高频噪声状态,实现无需教师模型、蒸馏或辅助目标的单步动作生成,性能可匹配十步解码。

Comments 20 pages, 10 figures

详情
AI中文摘要

基于扩散的视觉-语言-动作(VLA)模型通常继承图像生成的观点:动作通过迭代去噪生成。我们认为VLA动作生成具有不同的条件-目标结构:策略以丰富的观测、语言和状态为条件,但仅预测紧凑的低维动作块。在这种不对称性下,强单步动作生成不一定需要为图像合成开发的先进单步方法。我们保持标准速度预测,不添加教师模型、蒸馏阶段或辅助目标;在我们的主要方案中,我们简单地将训练时间分布偏向高频噪声状态。我们首先在受控的MNIST网格到序列任务中隔离效果,然后通过广泛的机器人策略实验进行测试。在标准LIBERO、LIBERO-Plus和LIBERO-Pro上,使用高频噪声偏置调度训练的单步策略通常匹配相同方案下的十步解码,并且在标准LIBERO上可以超过使用均匀时间分布训练的十步策略。真实机器人双臂YAM RSS评估提供了相同采样器趋势的小样本跨架构检查。在具有30M动作头的1.4B VLM模型上,单步解码在LIBERO-Long上达到95.6%。这些结果表明,强单步VLA动作生成可以从标准扩散训练中涌现,而无需引入为图像生成开发的完整少步扩散机制。

英文摘要

Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6\% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.

2606.05699 2026-06-05 cs.RO 版本更新

DexFuture: Hierarchical Future-State Visuomotor Targeting for Bimanual Dexterous Tool Use

DexFuture: 用于双手灵巧工具使用的分层未来状态视觉运动目标

Runfa Blark Li, Kuang-Ting Tu, Nikola Raicevic, Dwait Bhatt, Xinshuang Liu, Keito Suzuki, Ki Myung Brian Lee, Nikolay Atanasov, Truong Nguyen

发表机构 * UC San Diego(圣迭戈大学)

AI总结 提出DexFuture分层系统,通过高层未来状态视觉运动目标预测器和低层目标条件结构化灵巧策略,实现双手灵巧工具使用,达到90%的特权oracle性能,运行速度60Hz,比DexWM式CEM规划快约250倍。

详情
AI中文摘要

双手灵巧工具使用对机器人来说仍然具有挑战性,因为手部配置维度高,且手-工具-物体动力学和接触复杂。大多数现有控制策略依赖于演示提供的未来配置参考,而未来动作条件世界模型需要对高维动作序列进行缓慢的在线规划。一个重大挑战是生成动态一致的未来参考轨迹,而不依赖于演示中的特权状态或缓慢的反事实规划。我们提出DexFuture,一个分层系统,将高层未来状态视觉运动目标预测器与低层目标条件结构化灵巧策略耦合。基于自我中心RGB、本体感觉和几何历史,高层预测器构建结构化的手-工具-物体视觉运动嵌入,并使用水平条件Transformer生成多步未来目标轨迹。然后,低层策略通过目标条件每链接Transformer跟踪这些轨迹。这种分层结构将粗略的未来参考生成与细粒度的动作控制解耦,并将缓慢的长时域语义预测与高频执行解耦。在OakInk2双手工具使用任务上,DexFuture达到了90%的特权oracle性能,而无参考策略仅为7%。DexFuture以60Hz运行,比DexWM风格的交叉熵方法(CEM)规划(使用未来动作条件世界模型)快约250倍。

英文摘要

Bimanual dexterous tool use remains challenging for robots due to high-dimensional hand configurations and complex hand-tool-object dynamics and contact. Most existing control policies depend on future configuration references provided from demonstrations, while future action-conditioned world models require slow online planning over high-dimensional action sequences. A significant challenge is generating a dynamically consistent future reference trajectory without relying on privileged states from demonstrations or slow counterfactual planning. We propose DexFuture, a hierarchical system that couples a high-level Future-State Visuomotor Target Predictor with a low-level Target-Conditioned Structured Dexterous Policy. Conditioned on egocentric RGB, proprioceptive and geometric history, the high-level predictor constructs structured hand-tool-object visuomotor embeddings and uses a horizon-conditioned transformer to generate a multi-step future target trajectory. Then, the low-level policy tracks them with a target-conditioned per-link transformer. This hierarchy decouples coarse future reference generation from fine-grained action control, and slow long-horizon semantic prediction from high-frequency execution. On OakInk2 bimanual tool-use tasks, DexFuture achieves 90% of the privileged-oracle performance, compared to 7% for a no-reference policy. DexFuture operates at 60 Hz, approximately 250 times faster than DexWM-style Cross-Entropy Method (CEM) planning with a future action-conditioned world model.

2606.05687 2026-06-05 cs.RO cs.SY eess.SY 版本更新

Accelerating and Scaling MPC-Guided Reinforcement Learning for Humanoid Locomotion and Manipulation

加速与扩展MPC引导的强化学习在类人机器人行走与操作中的应用

Junheng Li, Liang Wu, Sergio A. Esteban, Lizhi Yang, Ján Drgoňa, Aaron D. Ames

发表机构 * California Institute of Technology(加州理工学院) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出了一种基于质心动力学MPC奖励的MPC-RL框架,并开发了并行批处理GPU求解器π^nMPC,以高效实现类人机器人的行走与操作技能。

Comments 8 pages, 5 figures

详情
AI中文摘要

在类人运动控制中,模型预测控制(MPC)提供基于物理的预测和约束处理,而强化学习(RL)通过大规模仿真实现鲁棒的全身技能。然而,在RL内部使用MPC通常需要耗时的问题构建或过高的训练开销,使得此类框架在实践中难以证明其合理性。本文研究了训练时高效的MPC引导方法用于类人机器人行走与操作,称为MPC-RL。我们引入了一种基于质心动力学的MPC奖励公式,在训练时利用MPC轨迹的引导。为了在大规模并行RL中实现这一点,我们开发了π^nMPC,一种并行时域且无需构建的批处理GPU MPC求解器,它直接操作时变动力学以避免高内存使用和预编译。通过多种对比研究和硬件验证,我们发现MPC-RL在行走和操作技能上实现了优越的性能。代码库可在https://github.com/junhengl/mpc-rl获取。

英文摘要

In humanoid motion control, model predictive control (MPC) offers physically grounded prediction and constraint handling, while reinforcement learning (RL) enables robust whole-body skills through large-scale simulation. However, using MPC inside RL often requires time-consuming problem construction or excessive training overhead, making such frameworks difficult to justify in practice. This work studies efficient training-time MPC guidance for humanoid locomotion and manipulation, termed MPC-RL. We introduce a centroidal-dynamics MPC reward formulation that leverages guidance from MPC trajectories in training time. To make this practical in massively parallel RL, we develop $π^n$MPC, a parallel-in-horizon and construction-free batched GPU MPC solver that operates directly on time-varying dynamics to avoid high memory usage and pre-compilation. Through a variety of comparative studies and hardware validations, we have found that MPC-RL achieves superior performance in locomotion and manipulation skills. The code base is available at https://github.com/junhengl/mpc-rl.

2606.05669 2026-06-05 cs.RO cs.SY eess.SY 版本更新

Dynamic Multi-Agent Pickup and Delivery in Robotic Cellular Warehousing Systems

机器人化仓储系统中的动态多智能体取送货

Cheng Ren, Ming Li, Xinping Guan, George Q. Huang

发表机构 * Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University(工业与系统工程系,香港理工大学) School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(自动化与智能感知学院,上海交通大学)

AI总结 针对订单内部SKU动态追加的仓库场景,首次形式化动态多智能体取送货问题,提出两种基于令牌传递的事件触发在线重规划算法,显著降低订单流时间。

详情
AI中文摘要

机器人化仓储系统(RCWS)引发多智能体取送货(MAPD)过程,其中机器人按顺序为每个订单收集多个库存单位(SKU)。与假设静态任务的经典MAPD公式不同,真实仓库操作通常涉及动态订单演变,即在订单执行过程中可能追加新的SKU。受此实际需求驱动,本文首次考虑内部订单演变,形式化了动态多智能体取送货问题。基于令牌传递范式,我们提出了两种事件触发在线重规划算法。第一种,动态令牌传递,通过添加订单分解和基于优先级的令牌调度,在订单更新时执行局部重规划,同时保持无碰撞执行。第二种,协作令牌传递,进一步使空闲机器人能够机会性地协助新添加的取货任务,提高系统级效率。在RCWS环境中的仿真结果表明,与静态和非协作基线相比,所提方法显著减少了订单流时间。

英文摘要

Robotic Cellular Warehousing Systems (RCWS) give rise to multi-agent pickup and delivery (MAPD) processes in which robots sequentially collect multiple stock-keeping units (SKUs) for each order. Unlike classical MAPD formulations that assume static tasks, real warehouse operations often involve dynamic order evolution, where new SKUs may be appended to an order while it is being executed. Motivated by this practical requirement, this letter formulates the Dynamic Multi-Agent Pickup and Delivery problem considering internal order evolution for the first time. Building on the token passing paradigm, we propose two event-triggered online replanning algorithms. The first, Dynamic Token Passing, performs localized replanning upon order updates through add-order decomposition and priority-based token scheduling while preserving collision-free execution. The second, Cooperative Token Passing, further enables idle robots to opportunistically assist newly added pickups, improving system-level efficiency. Simulation results in RCWS environments demonstrate that the proposed methods significantly reduce order flowtime compared with static and non-cooperative baselines.

2606.05663 2026-06-05 cs.RO 版本更新

Preserving Full 6-DOF Actuation Under Abrupt Total Rotor Failures: Passive Fault-Tolerant Flight Control Using a Biaxial-Tilt Hexacopter

在突然完全旋翼故障下保持完整六自由度驱动:使用双轴倾斜六旋翼的被动容错飞行控制

Yipeng Yang, Yiqiao Tang, Hao Zhang, Jinqi Jiang, Jianfeng He, Rumo Chen, Xinghu Yu, Zhan Li, Huijun Gao

发表机构 * Tsinghua University(清华大学)

AI总结 本文针对双轴倾斜过驱动六旋翼在突发完全旋翼故障下,提出两种无需故障检测的被动容错控制方案,实现完整六自由度轨迹跟踪,并通过仿真和实验验证其鲁棒性。

详情
AI中文摘要

传统多旋翼在突发完全旋翼故障下,可达力旋量空间(AWS)迅速缩小,使得完整的六自由度恢复在物理上不可能。本文研究了双轴倾斜过驱动六旋翼(BTO)在控制器事先未知的突发完全旋翼故障下的被动容错飞行。控制设计与分析聚焦于代表性的突发旋翼故障情况,其中故障后系统仍保持完全驱动,且不假设显式的故障检测、隔离或故障模式切换。首先,我们通过引入瞬态力旋量跳跃项扩展了AWS的内接球度量,从而能够在最多三个同时旋翼故障下进行定量可行性评估,并与单轴倾斜和共面六旋翼进行基准比较。其次,我们开发了两种计算高效的被动方案,不依赖故障检测或在线优化。一种方案在控制器层运行,将高阶全驱动(HOFA)控制器与线性扩展状态观测器(LESO)结合,用于集总扰动抑制。另一种方案在分配器层运行,使用基于模型参考的自适应控制分配和基于动量的力旋量估计来补偿控制分配偏差。仿真和飞行实验验证了在单个和多个旋翼故障下的稳定悬停和六自由度轨迹跟踪。进一步系统比较证实,BTO比单轴倾斜和共面设计提供更大的恢复裕度。额外的仅机载传感器实验,包括风扰下的室内跟踪、极端条件下的室外跟踪、窄框穿越和基于接触的空中书写,进一步验证了所提框架在复杂操作环境中的鲁棒性。

英文摘要

Conventional multirotors suffer from a rapid collapse of attainable wrench space (AWS) under abrupt total rotor failures, rendering full 6-DOF recovery physically impossible. This paper addresses passive fault-tolerant flight of a biaxial-tilt overactuated hexacopter (BTO) under abrupt total rotor failures that are a priori unknown to the controller. The control design and analysis focus on representative abrupt rotor-failure cases for which the post-failure system remains fully actuated, while no explicit fault detection, isolation, or fault-mode switching is assumed. First, we extend the inscribed-sphere metric of the AWS by incorporating the transient-wrench-jump term, enabling quantitative feasibility assessment under up to three simultaneous rotor failures and benchmarking against uniaxial-tilt and coplanar hexacopters. Second, we develop two computationally efficient passive schemes without relying on fault detection or online optimization. One scheme operates at the controller layer by combining a high-order fully actuated (HOFA) controller with a linear extended state observer (LESO) for lumped-disturbance rejection. The other scheme operates at the allocator layer by using model-reference adaptive control allocation with momentum-based wrench estimation to compensate for control-allocation biases. Simulations and flight experiments validate stable hovering and 6-DOF trajectory tracking under single and multiple rotor failures. Further systematic comparisons confirm that the BTO provides larger recovery margins than uniaxial-tilt and coplanar designs. Additional onboard-sensor-only experiments, including indoor tracking under wind disturbance, outdoor tracking under extreme conditions, narrow-frame traversal, and contact-based aerial writing, further validate the robustness of the proposed framework in complex operational environments.

2606.05660 2026-06-05 cs.RO cs.AI 版本更新

Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation

面向长时域任务的安全具身AI:机器人操作跨层分析

Dabin Kim, Daemin Park, Sangyub Lee, Jinsik Kim, Yeongtak Oh, Jongho Shin, Sungroh Yoon

发表机构 * UNIST InnoCORE AI-Space Solar Initiative(UNIST创新核心人工智能空间太阳能计划) Ulsan National Institute of Science and Technology (UNIST)(乌山国立科学技术研究院) Automation and Systems Research Institute(自动化与系统研究所) Department of Electrical and Computer Engineering(电气与计算机工程系) Interdisciplinary Program in Artificial Intelligence(人工智能跨学科项目) LG Electronics(LG电子)

AI总结 本文从具身AI视角,系统综述长时域机器人操作中的安全问题,按干预时机(规划时、策略时、执行时)组织文献,分析证据强度,并指出当前安全保证的不足与未来方向。

Comments 63 pages, 6 figures

详情
AI中文摘要

具身AI系统日益被期望在物理环境中进行长时间跨度的推理和行动。这种不断增强的能力将安全问题推向前台,因为物理世界中的失败可能伤害人、损坏物体并扰乱工作场所。尽管安全具身AI已引起广泛关注,但文献在规划、策略设计和运行时执行方面仍然分散。长时域机器人操作是这一问题特别具有揭示性的锚定领域,因为语义误解、子任务级错误传播、执行漂移和接触丰富的物理风险可能在同一个闭环系统中累积。因此,本综述从具身AI视角对长时域机器人操作中的安全性进行了结构化回顾。我们按干预时机组织文献,涵盖规划时、策略时和执行时的安全性,并分析每条工作提供的证据强度,区分形式化保证、统计支持和经验安全启发式。这一框架阐明了骨干能力论文、直接安全机制以及基准或评估研究的独特作用,同时揭示了当前安全声明在哪些方面得到良好支持,在哪些方面仍然间接。我们识别了持续的空白,包括策略时安全性的有限证据、接触丰富长时域操作的形式化支持薄弱、不成熟的不确定性触发干预以及缺乏操作特定的安全基准。最后,我们概述了跨层保证、评估设计以及长时域机器人代理在真实世界环境中更安全部署的研究方向。

英文摘要

Embodied AI systems are increasingly expected to reason and act over extended horizons in physical environments. This growing capability brings safety to the foreground, because failures in the physical world can harm people, damage objects, and disrupt workplaces. Although safe embodied AI has attracted substantial attention, the literature remains fragmented across planning, policy design, and runtime execution. Long-horizon robotic manipulation is a particularly revealing anchor domain for this problem because semantic misgrounding, subtask-level error propagation, execution drift, and contact-rich physical risk can accumulate within the same closed-loop system. This survey therefore provides a structured review of safety in long-horizon robotic manipulation from an embodied AI perspective. We organize the literature by intervention locus, covering planning-time, policy-time, and execution-time safety, and we analyze the strength of the evidence that each line of work provides, distinguishing formal guarantees, statistical support, and empirical safety heuristics. This framework clarifies the distinct roles of backbone capability papers, direct safety mechanisms, and benchmark or evaluation studies, while exposing where current safety claims are well supported and where they remain indirect. We identify persistent gaps, including limited evidence for policy-time safety, weak formal support for contact-rich long-horizon manipulation, immature uncertainty-triggered intervention, and a shortage of manipulation-specific safety benchmarks. We conclude by outlining research directions for cross-layer assurance, evaluation design, and safer deployment of long-horizon robotic agents in real-world settings.

2606.05588 2026-06-05 cs.RO cs.LG 版本更新

Auditing Demonstration Curation Metrics: Action-Only Scorers Fail on the Structural Defects That Degrade Imitation Policies

审计示范策展指标:仅动作评分器在降低模仿策略的结构缺陷上失败

Aarav Bedi

发表机构 * Aarav Bedi

AI总结 本研究构建受控测试平台,注入两类示范缺陷(细微扰动和结构错误),审计七种策展指标,发现仅动作指标无法检测结构错误,且部分指标评分倒置,而状态轨迹指标能部分检测但下游性能恢复有限。

Comments 5 pages, 3 figures, 4 tables

详情
AI中文摘要

模仿学习策略继承了其训练示范的质量,越来越多的策展指标声称能自动评分和过滤低质量示范。这些指标各自在不同协议的不同数据上验证,因此不清楚哪些指标真正识别出损害策略的示范。我们构建了一个受控测试平台,其中示范缺陷以已知类型注入,并沿两个轴审计七种策展指标:每个指标区分缺陷示范与清洁示范的效果,以及基于每个指标策展的子集训练行为克隆策略是否提高任务成功率。我们研究两种缺陷机制。细微扰动(相关动作噪声、震颤、截断)可通过多变量离群值评分检测,一旦移除,可恢复全部下游差距。结构错误,即示范在关键时刻执行错误动作,对我们测试的每个仅动作指标都是不可见的,其中两个指标是倒置的:它们将缺陷示范评分为更高质量,并用于策展时,往往使策略处于或低于未策展基线,而非高于基线。只有检查状态轨迹的指标能检测结构错误,即使最好的指标也只能恢复三分之一的下游差距。高检测准确性并不保证下游改进。我们发布了测试平台和所有策展实现。

英文摘要

Imitation-learning policies inherit the quality of the demonstrations they are trained on, and a growing set of curation metrics promise to score and filter low-quality demonstrations automatically. These metrics are each validated on different data with different protocols, so it is unclear which of them actually identify the demonstrations that harm a policy. We build a controlled testbed in which demonstration defects are injected with known type, and audit seven curation metrics along two axes: how well each separates defective from clean demonstrations, and whether training a behavior-cloning policy on each metric's curated subset improves task success. We study two defect regimes. Subtle perturbations (correlated action noise, tremor, truncation) are detectable by multivariate outlier scoring and, once removed, recover the full downstream gap. Structural errors, where the demonstration executes a wrong action at a key moment, are invisible to every action-only metric we test, and two of them are inverted: they score defective demonstrations as higher quality and, used for curation, tend to leave the policy at or below the uncurated baseline rather than above it. Only metrics that examine the state trajectory detect structural errors, and even the best of them recovers just a third of the downstream gap. High detection accuracy does not guarantee downstream improvement. We release the testbed and all curation implementations.

2606.05572 2026-06-05 cs.ET cs.HC cs.RO physics.app-ph 版本更新

Wave Focusing in Metamaterials: Tactile Displays Beyond the Diffraction Limit

超材料中的波聚焦:超越衍射极限的触觉显示器

Gregory Reardon, Max Linnander, Dustin Goetz, Neeli Tummala, Yon Visell

发表机构 * Media Arts and Technology Program(媒体艺术与技术项目) Department of Mechanical Engineering(机械工程系) Department of Electrical and Computer Engineering(电气与计算机工程系) University of California, Santa Barbara(加州大学圣芭芭拉分校)

AI总结 本文利用局部共振超材料板中的慢波分支实现机械波聚焦,突破衍射极限,生成高分辨率虚拟触觉像素,并将像素面积缩小十倍。

详情
AI中文摘要

我们解决了工程化分布式触觉显示器的挑战,该显示器能够在表面上任意位置再现多个局部化、可独立寻址的振动——代表虚拟触觉像素。我们的技术基于使用稀疏的致动器阵列在弯曲板中聚焦机械波。在触觉频率下,波衍射阻止了在多指触摸交互相关空间尺度上形成局部化虚拟触觉像素。我们通过在板上增加机械共振器晶格,形成局部共振超材料板,克服了这一限制。板的动态模式与共振器模式之间的耦合改变了控制波传播的色散关系,引入了一个慢波分支,使得能够超越未修改板所施加的衍射极限进行聚焦。我们使用数值模拟来设计超材料系统的色散关系,以实现触觉频率下的高分辨率聚焦。然后,我们制造了一个超材料触觉显示器,并实验证明虚拟像素比在没有共振器的相同板上生成的像素更加局部化,导致虚拟像素面积缩小十倍。在行为实验中,我们展示了该系统能够传递感知上局部化的单点和多点触觉反馈以及移动触觉源,同时保持对多个显示位置的时间波形的独立控制。这里报告的方法可以使用少量致动自由度实现高分辨率触觉显示器,适用于广泛应用。

英文摘要

We address the challenge of engineering distributed haptic displays capable of reproducing multiple localized, independently addressable vibrations -- representing virtual tactile pixels -- at arbitrary locations on a surface. Our technique is based on the focusing of mechanical waves in a flexural plate using a sparse set of actuators. At tactile frequencies, wave diffraction prevents the formation of localized virtual tactile pixels at spatial scales relevant for multi-digit touch interactions. We overcome this limitation by augmenting the plate with a lattice of mechanical resonators, forming a locally resonant metamaterial plate. Coupling between the plate's dynamic modes and those of the resonators alters the dispersion relation governing wave transmission, introducing a slow-wave branch that enables focusing beyond the diffraction limit imposed by the unmodified plate. We use numerical simulations to engineer the dispersion relation of the metamaterial system for high-resolution focusing at tactile frequencies. We then fabricate a metamaterial tactile display and experimentally demonstrate virtual pixels that are far more localized than those generated on an otherwise identical plate without resonators, resulting in a tenfold reduction in virtual-pixel area. In behavioral experiments, we show that this system can deliver perceptually localized single- and multi-point tactile feedback and moving tactile sources while maintaining independent control over temporal waveforms at multiple display locations. The methods reported here can enable high-resolution haptic displays for widespread applications using a small number of actuated degrees of freedom.

2606.05533 2026-06-05 cs.LG cs.AI cs.CV cs.RO 版本更新

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

物体能做什么,而非它们是什么:面向功能可供性推理的功能潜在空间

Rohan Siva, Neel P. Bhatt, Yunhao Yang, Seoyoung Lee, Nishant Gadde, Christian Ellis, Alvaro Velasquez, Zhangyang Wang, Ufuk Topcu

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Neurosymbolic Intelligence(神经符号智能) University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出A4D框架,通过构建基于功能可供性的共享潜在空间,将视觉观察映射到该空间并测量与可供性的距离,实现基于物体功能而非外观的规划推理,显著提升泛化能力和推理效率。

Comments Code, videos, and data available at: https://A4Dance-reasoning.github.io

详情
AI中文摘要

现有的机器人规划系统依赖于基于外观的推理,其中视觉观察被编码到围绕物体外观组织的潜在空间中(例如,根据外观识别“手推车”)。然而,规划需要推理物体的任务相关功能(例如,物体是否“可移动”),而基于外观的潜在空间无法捕捉这些信息。因此,现有方法难以泛化到新颖的机器人-物体交互。我们通过功能可供性推理解决这一泛化能力有限的问题,使规划基于任务相关的物体功能而非仅外观。我们提出A4D,它将视觉观察映射到一个围绕可供性(例如“可移动”)组织的共享潜在空间中。通过将视觉观察投影到这个功能潜在空间并测量它们与可供性的接近程度,A4D推断出与观察物体相关的功能。此外,我们引入了一种可供性发现机制,扩展潜在空间以处理现有可供性不足的未见场景。A4D利用功能潜在空间中的接近度来量化可供性推理的不确定性,并选择性地触发可供性发现。我们在涉及多样化和未见可供性的多个规划任务上评估A4D。A4D在现有可供性上达到94%的推理准确率,比最先进方法高出超过15个百分点;在不到原始训练数据10%的情况下,将新可供性推理准确率从70%提升到90%以上,并实现100倍更快的推理。代码、视频和数据可在https://A4Dance-reasoning.github.io获取。

英文摘要

Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a "cart" based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is "movable"), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions. We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4D, which maps visual observations into a shared latent space structured around affordances (e.g., "movable"). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4D infers functionalities relevant to the observed object. Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4D uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4D across several planning tasks involving diverse and unseen affordances. A4D achieves 94% inference accuracy on existing affordances outperforming state-of-the-art approaches by over 15% points, improves new-affordance inference accuracy from 70% to over 90% with fewer than 10% of the original training data, and enables 100x faster inference. Code, videos, and data available at: https://A4Dance-reasoning.github.io.

2606.05501 2026-06-05 cs.RO 版本更新

Learning Contact Representation for Leg Odometry

学习足式里程计的接触表示

Emre Girgin, Cagri Kilic

发表机构 * Department of Aerospace Engineering, Embry Riddle Aeronautical University(航空航天工程系,埃姆布里-瑞德航空大学)

AI总结 提出一种自监督表示学习框架,仅利用关节编码器标准传感器集进行接触检测,无需力传感器,在足式机器人里程计中优于监督方法和基线概率方法。

Comments 17 pages

详情
AI中文摘要

足式机器人里程计的估计依赖于一个假设:在支撑相期间,足部相对于世界的速度保持为零。主体速度的反馈来自足部的运动学串行链,因此准确的腿部相位检测是一个关键子问题。大量研究使用安装在足尖的地面反作用力传感器进行分类,但这些传感器可能并非所有足式机器人普遍可用。此外,这些传感器通常对未考虑的干扰(如足部与地面接触时的滑动)不敏感。在本研究中,我们提出了一种用于接触检测的自监督表示学习框架,该框架利用关节编码器的标准传感器集,无需依赖力传感器增强。我们使用学习到的表示来概率性地建模支撑相和摆动相。实验结果证实了所提出的自监督接触检测器的有效性。我们的框架在性能上优于需要传感器集增强和标注的监督方法以及基线概率方法。此外,我们将代码公开。

英文摘要

The estimation of odometry in legged robots depends on the assumption that the velocity of the foot with respect to the world remains zero during the stance phase. Feedback for the main body velocity is derived from the kinematic serial chain of the feet making accurate leg phase detection is a critical subproblem. A considerable number of studies employ ground reaction force sensors mounted at the tip of the foot to classify, yet these sensors may not be universally available for all legged robots. Additionally, these sensors are often unresponsive to unaccounted disturbances, such as slippage, while the foot remains in contact with the ground. In this study, we propose a self-supervised representation learning framework for contact detection that utilizes the standard sensor set of joint encoders without reliance on force sensor augmentations. We employ learned representations to model the stance and swing phases probabilistically. The experimental results obtained confirm the efficacy of the proposed self-supervised contact detector. Our framework exhibited superior performance in comparison to supervised methods which necessitate sensor set augmentation and labeling, as well as baseline probabilistic approaches. Additionally, we make our code available to the public.

2606.05491 2026-06-05 cs.CV cs.RO 版本更新

Unpaired RGB-Thermal Gaussian-Splatting Using Visual Geometric Transformers

无配对RGB-热成像高斯泼溅使用视觉几何变换器

Jean Cordonnier, Chenghao Xu, Olga Fink, Malcolm Mielle

发表机构 * Ecole Polytechnique Federale de Lausanne(瑞士联邦理工学院洛桑分校) Schindler EPFL Lab(施耐德EPFL实验室)

AI总结 提出一种无配对RGB-热成像新视角合成框架,利用VGGT估计各模态相机位姿并通过Procrustes对齐,结合多模态3D高斯泼溅实现联合重建,在保持RGB保真度的同时实现热成像视图合成。

Comments Accepted at ICRA 2026's Workshop MM-SpatialAI: Multi-Modal Spatial AI for Robust Navigation and Open-World Understanding

详情
AI中文摘要

结合RGB和热成像的多模态新视角合成(NVS)能够利用视觉和热信息进行精确的3D场景重建。然而,现有方法通常依赖于精确校准的RGB-热成像图像对或立体设置,限制了可扩展性和实际部署。为了解决这个问题,我们引入了一个无配对RGB-热成像NVS框架,该框架利用VGGT(一种3D前馈变换器架构)独立估计每个模态的相机位姿。然后使用Procrustes算法与跨模态特征匹配器对齐位姿集,从而无需配对校准即可实现联合配准。在此对齐基础上,我们进一步提出了一种多模态3D高斯泼溅方法,直接从无配对的RGB和热成像图像中学习。在多种场景上的实验表明,我们的方法在热成像视图合成中取得了有竞争力的性能,同时保持了RGB保真度。此外,我们表明现有的重建方法可能产生缺乏跨模态一致性的特定模态重建。因此,我们引入了一个基准框架,以严格评估每个模态的图像合成以及重建场景的多模态一致性。

英文摘要

Multi-modal novel view synthesis (NVS) combining RGB and thermal imagery enables precise 3D scene reconstruction with visual and thermal information. However, existing methods typically rely on precisely calibrated RGB-thermal image pairs or stereo setups, limiting scalability and practical deployment. To address this, we introduce a framework for unpaired RGB-thermal NVS that leverages VGGT, a 3D feed-forward transformer architecture, to independently estimate camera poses for each modality. The pose sets are then aligned using the Procrustes algorithm with a cross-modal feature matcher, enabling joint registration without paired calibration. Building on this alignment, we further propose a multi-modal 3D Gaussian Splatting approach that learns directly from unpaired RGB and thermal images. Experiments on diverse scenes demonstrate that our method achieves competitive performance in thermal view synthesis while maintaining RGB fidelity. Moreover, we show that existing reconstruction approaches can produce modality-specific reconstructions that lack cross-modal consistency. We thus introduce a benchmarking framework to rigorously evaluate both per-modality image synthesis and the multi-modal coherence of reconstructed scenes.

2606.05468 2026-06-05 cs.RO 版本更新

FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization

FlowPRO:通过近端偏好优化对流匹配VLA进行无奖励强化微调

Yihao Wu, He Zhang, Junbo Tan, Xueqian Wang, Zhengyou Zhang

发表机构 * Tencent Robotics X(腾讯机器人X实验室) Futian Laboratory(福田实验室) Tsinghua University(清华大学)

AI总结 提出FlowPRO框架,通过近端偏好优化(RPRO)和干预-回滚数据收集方法,实现无奖励的离线强化微调,在四类长时程双臂任务中取得最高成功率。

详情
AI中文摘要

将视觉-语言-动作(VLA)模型后训练为可在真实机器人上可靠部署的策略仍然是一个主要瓶颈。SFT和DAgger仅间接利用失败信号,而基于奖励的强化学习则受限于真实世界奖励设计的难度以及训练可靠评论家的困难。我们提出FlowPRO,一种针对流匹配VLA的无奖励离线强化微调框架。在算法上,我们提出RPRO(机器人流匹配近端偏好优化),一种针对VLA模型流匹配动作头定制的偏好优化目标。RPRO将对比优化器与显式近端正则化器配对,该正则化器锚定隐式奖励的绝对幅度,从而消除了普通Flow-DPO的奖励黑客失败模式。在数据方面,一种遥操作干预-回滚范式通过单个操作员动作在真实机器人上自然产生成对的正负轨迹$(τ^w, τ^l)$;平滑插值过程结合批量混合,然后将这些稀疏修正转换为密集的每状态监督,同时保留基础策略的能力。在四项长时程双臂任务上,FlowPRO取得了最高成功率,优于四个代表性基线,消融实验证实了每个损失组件的贡献。

英文摘要

Post-training Vision-Language-Action (VLA) models into policies that can be reliably deployed on real robots remains a major bottleneck. SFT and DAgger exploit failure signals only indirectly, and reward-based RL is bottlenecked by the difficulty of real-world reward design and of training reliable critics. We present FlowPRO, a reward-free offline reinforced fine-tuning framework for flow-matching VLAs. Algorithmically, we propose RPRO (Robotic Flow-matching Proximalized Preference Optimization), a preference-optimization objective tailored to the flow-matching action head of VLA models. RPRO pairs a contrastive optimizer with an explicit proximal regularizer that anchors the absolute magnitude of the implicit reward, thereby eliminating the reward-hacking failure mode of plain Flow-DPO. On the data side, a teleoperated intervention-and-rollback paradigm produces naturally paired positive and negative trajectories $(τ^w, τ^l)$ on a real robot from a single operator action; a Smooth Interpolation procedure, combined with batch mixing, then converts these sparse corrections into dense per-state supervision while preserving the base policy's capabilities. On four long-horizon bimanual tasks, FlowPRO attains the highest success rate, outperforming four representative baselines, and ablations confirm the contribution of each loss component.

2606.05437 2026-06-05 cs.RO cs.CV 版本更新

Uncertainty-Aware Adaptive Sensor Fusion for Autonomous Navigation

不确定性感知的自适应传感器融合用于自主导航

Simegnew Yihunie Alaba, Yuichi Motai

发表机构 * IEEE

AI总结 提出一种结合无迹卡尔曼滤波(UKF)的混合深度学习方法,通过不确定性感知的自适应融合视觉和惯性特征,提高自主导航中视觉惯性里程计(VIO)的位姿估计精度。

Comments 13 pages

详情
AI中文摘要

本文介绍了一种混合深度学习方法,与无迹卡尔曼滤波(UKF)相结合,以增强自主导航中视觉惯性里程计(VIO)的位姿估计精度。所提出的模型采用视觉变换器(ViT)网络有效捕获惯性测量单元(IMU)数据的时间依赖性,并利用多尺度卷积神经网络(MCNN)从视觉数据中学习基于光流的运动线索。自适应传感器融合模块通过利用估计的不确定性动态加权IMU和视觉特征,从而在多样且具有挑战性的环境条件下提高鲁棒性。此外,提出了一种新颖的不确定性感知损失函数,将预测不确定性明确纳入学习过程,使得在噪声、不完整或不可靠的传感器输入下实现鲁棒且准确的导航。在KITTI数据集上的全面评估表明,所提出的方法显著优于基线方法,在绝对轨迹误差(ATE)和相对位姿误差(RPE)方面实现了优越性能。该轻量且计算高效的模型在NVIDIA A100 GPU上以155 FPS处理数据,非常适合部署在资源受限的自主系统中。

英文摘要

This work introduces a hybrid deep learning approach integrated with an Unscented Kalman Filter (UKF) to enhance pose estimation accuracy in Visual-Inertial Odometry (VIO) for autonomous navigation. The proposed model employs a Vision Transformer (ViT) network to effectively capture temporal dependencies from inertial measurement unit (IMU) data and utilizes a Multiscale Convolutional Neural Network (MCNN) to learn optical flow-based motion cues from visual data. An adaptive sensor fusion module dynamically weights IMU and visual features by leveraging estimated uncertainty, thus improving robustness in diverse and challenging environmental conditions. Additionally, a novel uncertainty-aware loss function is proposed to explicitly incorporate prediction uncertainty into the learning process, enabling robust and accurate navigation under noisy, incomplete, or unreliable sensor inputs. Comprehensive evaluations of the KITTI dataset demonstrate that the proposed method significantly outperforms baseline approaches, achieving superior performance in terms of Absolute Trajectory Error (ATE) and Relative Pose Error (RPE). The lightweight and computationally efficient model processes data at 155 FPS on an NVIDIA A100 GPU, making it highly suitable for deployment in resource-constrained autonomous systems.

2606.05422 2026-06-05 cs.RO 版本更新

Learning from Demonstrations over Riemannian Manifolds using Neural ODEs: An Extended Abstract

利用神经常微分方程在黎曼流形上从示范中学习:扩展摘要

Diana Cuervo Espinosa, Mahathi Anand, Angela P. Schoellig

发表机构 * ETH Zürich(苏黎世联邦理工学院)

AI总结 针对机器人状态(如方向)在弯曲空间上演化的问题,提出利用神经常微分方程在黎曼流形上从示范中学习,通过数值估计测地线实现自然运动生成,并降低计算开销。

Comments 2 pages

详情
AI中文摘要

从示范中学习(LfD)通常在欧几里得空间中进行,而机器人状态(例如方向)自然地在弯曲空间上演化。因此,为了确保自然、复杂的运动生成,我们研究在能够编码位置和方向数据的黎曼流形上从示范中学习。在这里,测地线提供了流形内任意两点之间的自然运动。我们提出通过神经常微分方程数值估计测地线,以减轻现有方法的大计算开销。最后,这些测地线可以在部署到机器人之前解码回原始任务空间。在这篇扩展摘要中,我们讨论了我们框架的架构,提供了一些来自仿真实验的初步见解,包括与其他测地线计算机制的比较,并讨论了未来工作的挑战和前景。

英文摘要

Learning from demonstratins (LfD) is usually performed over Euclidean spaces, while the robot state, e.g. orientation, naturally evolves over curved spaces. Therefore, to ensure natural, complex motion generation, we investigate learning from demonstrations over Riemannian manifolds that are capable of encoding both position and orientation data. Here, geodesic paths provide for natural motion between two arbitrary points within the manifold. We propose to numerically estimate geodesics via neural ordinary differential equations, mitigating large computational overhead of existing approaches. Finally, these geodesics can be decoded back into the original task space before deploying on the robot. In this extended abstract, we discuss the architecture of our framework, provide some initial insights from our simulation experiments, including comparison to other geodesic computation mechanisms, and discuss the challenges and prospects for future work.

2606.05407 2026-06-05 cs.RO 版本更新

MoDex: A Diffusion Policy for Sequential Multi-Object Dexterous Grasping

MoDex:用于顺序多物体灵巧抓取的扩散策略

Haofei Lu, Hongjia Liu, Yifei Dong, Florian T. Pokorny, Jens Lundell, Danica Kragic

发表机构 * Department of Robotics, Perception and Learning, KTH Royal Institute of Technology(机器人、感知与学习系,皇家理工学院) Robotics and Autonomous Systems at University of Turku(图尔库大学机器人与自主系统)

AI总结 提出MoDex扩散策略,通过对抗空间和点云条件预测抓取姿态,实现单只灵巧手顺序抓取多物体而不释放已抓物体,并通过两阶段训练(模仿学习+强化学习微调)提升成功率。

Comments Submitted to CoRL 2026

详情
AI中文摘要

本工作解决了用单只灵巧手顺序抓取多个物体而不释放已抓物体的问题。大多数灵巧抓取方法将手的所有自由度用于单个物体,未能充分利用其灵巧性,且没有为后续抓取留下冗余。所提出的解决方案MoDex是一种扩散策略,它直接从观测中预测下一个抓取器姿态,并以对抗空间和点云为条件。对抗空间条件指定了哪些手指参与当前抓取,使抓取器仅使用其可用自由度的一个子集,同时保留剩余自由度用于后续抓取。为了促进从仿真到现实的迁移,MoDex分两个阶段训练:首先通过专家演示的模仿学习,然后通过强化学习微调,这持续提高了预训练策略的成功率。我们在基于MuJoCo的Franka Emika Panda机器人(配备Allegro Hand)的仿真中以及相应的真实世界硬件平台上评估了MoDex。在仿真和真实世界实验中,MoDex均取得了比所评估的基于学习的基线方法更高的成功率,性能分别提升了2.92-17.92%和6.67-17.78%。项目页面:https://modex2026.github.io/。

英文摘要

This work addresses sequentially grasping multiple objects with a single dexterous hand without releasing those already held. Most dexterous grasping methods commit all of the hand's degrees of freedom to a single object, underutilizing its dexterity and leaving no redundancy for subsequent grasps. The proposed solution, MoDex, is a diffusion policy that predicts the next gripper pose directly from observations, conditioned on an opposition space and point cloud. The opposition space condition specifies which fingers participate in the current grasp, enabling the gripper to use only a subset of its available degrees of freedom while reserving the remaining degrees of freedom for subsequent grasps. To facilitate sim-to-real transfer, MoDex is trained in two stages: first through imitation learning on expert demonstrations, and subsequently through reinforcement learning fine-tuning, which consistently improves success rates over the pre-trained policy. We evaluate MoDex in simulation on a MuJoCo-based Franka Emika Panda robot equipped with an Allegro Hand and on the corresponding real-world hardware platform. Across both simulation and real-world experiments, MoDex achieves higher success rates than the evaluated learning-based baselines, improving performance by 2.92-17.92% and 6.67-17.78%, respectively. Project page: https://modex2026.github.io/.

2605.02192 2026-06-05 cs.RO 版本更新

Do We Really Need Immediate Resets? Rethinking Collision Handling for Efficient Robot Navigation

我们真的需要立即重置吗?重新思考高效机器人导航的碰撞处理

Shanze Wang, Xinming Zhang, Siwei Cheng, Xianghui Wang, Changwen Chen, Hailong Huang, Wei Zhang

发表机构 * College of Information Science and Technology, Eastern Institute of Technology(信息科学与技术学院,东部技术学院) Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University(航空与航空工程系,香港理工大学) Department of Computing, The Hong Kong Polytechnic University(计算系,香港理工大学) School of Computer Science and Technology, University of Science and Technology of China(计算机科学与技术学院,中国科学技术大学) Department of Mechanical Engineering, The Hong Kong Polytechnic University(机械工程系,香港理工大学)

AI总结 针对机器人导航中每次碰撞立即重置环境的惯例,提出多碰撞重置预算(MCB)框架,通过将局部碰撞终止与全局环境重置解耦,允许智能体在同一回合内重试困难配置,从而提高早期学习效率。

Comments 8 pages, 9 figures

详情
AI中文摘要

一次碰撞是否必然终止整个导航回合?在大多数用于机器人导航的深度强化学习(DRL)框架中,这仍然是标准做法:每次碰撞都会立即触发全局环境重置,并被视为完全任务失败而受到惩罚。虽然部署期间的碰撞自然表示任务失败,但在训练期间应用相同的处理会阻止智能体探索具有挑战性的障碍物配置,从而在早期训练阶段减慢学习进度。在这项工作中,我们挑战了这一惯例,并提出了一种多碰撞重置预算(MCB)框架,该框架将局部碰撞终止与全局环境重置解耦,允许智能体在同一回合内重试困难配置。仿真实验表明,MCB通过更少的交互达到目标成功率水平,提高了早期学习效率,其中小的碰撞预算产生最一致的收益。在异构机器人平台上的真实世界实验进一步验证了所学策略在杂乱环境中的可部署性。

英文摘要

Should a single collision necessarily terminate an entire navigation episode? In most deep reinforcement learning (DRL) frameworks for robot navigation, this remains the standard practice: every collision immediately triggers a global environment reset and is penalized as a complete task failure. While a collision during deployment naturally indicates task failure, applying the same treatment during training prevents the agent from exploring challenging obstacle configurations, which slows learning progress in the early training phase. In this work, we challenge this convention and propose a Multi-Collision reset Budget (MCB) framework that decouples local collision termination from global environment resets, allowing the agent to retry difficult configurations within the same episode. Simulation experiments show that MCB improves early-stage learning efficiency by reaching target success-rate levels with fewer interactions, with small collision budgets producing the most consistent gains. Real-world experiments on heterogeneous robot platforms further validate the deployability of the learned policies in cluttered environments.

2606.05395 2026-06-05 cs.RO cs.AI 版本更新

VASO: Formally Verifiable Self-Evolving Skills for Physical AI Agents

VASO:物理AI智能体的形式可验证自进化技能

Yunhao Yang, Neel P. Bhatt, Kevin Wang, Samuel Tetteh, Zhangyang Wang, Ufuk Topcu

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Iowa State University(爱荷华州立大学)

AI总结 提出VASO框架,通过形式验证引导LLM生成的机器人技能合约自进化,将模型检查的反例转化为文本梯度更新技能合约,无需微调模型权重,在Jackal和四旋翼任务中达到97.2%的形式规范符合率。

Comments Project webpage: https://languagegroundedriskdetection.github.io/ProjectPage/vaso-webpage/

详情
AI中文摘要

可重用的机器人技能正在成为具身智能体将开放式指令转化为长时域物理行为的基本单元。我们认为,虽然基础模型大幅降低了创建这些技能的成本,但信任它们的成本并未降低。现有的技能进化循环通过执行反馈、单元测试、环境奖励或LLM自我批评来改进技能,但这些信号仅提供痕迹级别的证据:它们表明技能在采样执行中有效,而非技能引发的计划在未经测试的条件下满足时间安全合约。我们提出VASO,一个用于验证引导的LLM生成机器人技能合约自进化的框架。在VASO中,每个技能被表示为具有两个耦合接口的语义合约:一个形式接口,将机器人状态、观测和控制命令与用于模型检查的逻辑命题对齐;一个面向规划器的接口,指导可执行行为的生成。模型检查器首先过滤逻辑不一致的技能合约,然后验证由该技能引发的计划是否满足全局和局部时间规范。当验证失败时,VASO将反例轨迹转化为文本梯度,更新可重用的技能合约,同时保持基础模型权重冻结。在Clearpath Jackal和PX4四旋翼任务中,VASO使用少于100个优化样本达到了97.2%的形式规范符合率,优于执行反馈、提示优化和微调基线。据我们所知,VASO是首个将形式验证与物理AI智能体的自进化LLM生成技能闭环的框架:形式反例成为可重用机器人技能合约的优化反馈,而不仅仅是验证一次性计划、调优规划器提示或微调模型权重。

英文摘要

Reusable robot skills are becoming the basic units through which embodied agents turn open-ended instructions into long-horizon physical behavior. We argue that, while foundation models have collapsed the cost of creating these skills, the cost of trusting them has not. Existing skill-evolution loops refine skills through execution feedback, unit tests, environment reward, or LLM self-critique, but these signals provide only trace-level evidence: they show that a skill worked on sampled executions, not that skill-induced plans satisfy temporal safety contracts under untested conditions. We introduce VASO, a framework for verification-guided self-evolution of LLM-generated robot skill contracts. In VASO, each skill is represented as a semantic contract with two coupled interfaces: a formal interface that aligns robot states, observations, and control commands with logical propositions for model checking, and a planner-facing interface that guides executable behavior generation. A model checker first filters logically inconsistent skill contracts, then verifies plans induced by the skill against global and local temporal specifications. When verification fails, VASO translates the counterexample trace into a textual gradient that updates the reusable skill contract while keeping foundation-model weights frozen. On Clearpath Jackal and PX4 quadcopter tasks, VASO reaches 97.2% formal-specification compliance using fewer than 100 optimization samples, outperforming execution-feedback, prompt-optimization, and fine-tuning baselines. To our knowledge, VASO is the first framework that closes the loop between formal verification and self-evolving LLM-generated skills for physical AI agents: formal counterexamples become optimization feedback for reusable robot skill contracts, rather than merely verifying one-off plans, tuning planner prompts, or fine-tuning model weights.

2606.05372 2026-06-05 cs.RO cs.CG 版本更新

Efficient Computation of Distance Functions for Navigation Vector Fields in Lie Groups

李群中导航向量场距离函数的高效计算

Vinicius M. Gonçalves, João Baião, Felipe Bartelt, Douglas G. Macharet, Gustavo M. Freitas, Héctor Azpúrua, Luciano C. A. Pimenta

发表机构 * University of São Paulo(圣保罗大学)

AI总结 针对李群中基于向量场的路径跟踪问题,提出一种利用G-多项式曲线结构将距离计算简化为多项式求根的高效方法,显著降低计算时间并保持精度。

详情
AI中文摘要

基于向量场的方法被广泛用于机器人控制,并常应用于路径跟踪问题。一些向量场方法需要重复计算机器人配置与曲线之间的距离以及相应的最近点。最近,向量场已被扩展到李群。在这种情况下,这种计算可能非常昂贵,尤其是在嵌入式平台上以高控制频率执行时。本文提出了一种高效计算点与曲线之间距离的方法,该曲线表示为所谓的G-多项式曲线,这是一种将多项式曲线推广到矩阵李群的曲线表示。所提出的方法利用这些曲线的结构,将问题简化为少量多项式求根计算。仿真结果表明,与现有的基于优化的方法相比,该方法在保持精度的同时显著减少了计算时间。还提供了SE(3)群情况下的实用公式,并在机器人机械臂上进行了实验验证。该方法已在一个计算包中实现,可在线获取。

英文摘要

Vector-field-based methods are widely used for robot control and are often applied to the path-tracking problem. Some vector field approaches require repeatedly computing the distance between the robot configuration and the curve, as well as the corresponding closest point. Recently, vector fields have been extended to Lie Groups. In this case, this computation can be expensive, especially when performed at high control frequencies on embedded platforms. This paper proposes a method for efficiently computing the distance between a point and a curve represented as what is called a G-polynomial curve, which is a curve representation that generalizes polynomial curves to matrix Lie groups. The proposed approach exploits the structure of these curves to reduce the problem to a small number of polynomial root-finding computations. Simulation results show that the method significantly reduces computation time while maintaining accuracy compared to existing optimization-based approaches. Practical formulas are also provided for the case of the group SE(3), and the method is validated experimentally on a robotic manipulator. The methodology is implemented in a computational package, available online.

2606.05254 2026-06-05 cs.LG cs.CV cs.RO 版本更新

Flash-WAM: Modality-Aware Distillation for World Action Models

Flash-WAM:面向世界动作模型的模态感知蒸馏

Arman Akbari, Ci Zhang, Arash Akbari, Lin Zhao, Yixiao Chen, Weiwei Chen, Xuan Zhang, Geng Yuan, Yanzhi Wang

发表机构 * Northeastern University(东北大学) University of Georgia(佐治亚大学) EmbodyX Inc.(EmbodyX公司)

AI总结 针对世界动作模型联合生成视频和机器人动作时因多模态噪声分布不对称导致蒸馏失效的问题,提出模态感知步蒸馏框架Flash-WAM,通过为不同模态选择匹配噪声机制的参数化方法,实现单步推理并大幅加速。

详情
AI中文摘要

世界动作模型(WAMs)通过迭代扩散联合生成未来视频和机器人动作,在操作基准上表现出色,但需要数十个去噪步骤,这一成本阻碍了实时控制。步蒸馏已成为自然的补救措施,但现成的方法在联合视频-动作设置中失效,因为视频和动作流使用不同的信噪比偏移噪声调度,并以显著不同的边际噪声分布到达训练,这种不对称性是单模态蒸馏方法无法处理的。我们提出 extbf{Flash-WAM},一个受一致性蒸馏启发的模态感知步蒸馏框架,为每个模态选择一致性函数以匹配其噪声机制:针对动作流的低噪声机制采用线性梯度缩放参数化,针对视频流的高噪声机制采用方差保持参数化,该框架基于对一致性函数族的结构分析,该分析刻画了在一致性边界条件下可实现的梯度缩放。在LingBot-VA上实例化,Flash-WAM将每个模态的推理压缩到单步。在RoboTwin 2.0上,这将每个块延迟从8.1秒减少到NVIDIA L40S上的348毫秒,实现了23倍的加速,从而支持实时推理。Flash-WAM在模拟基准上保持了任务成功率(RoboTwin 2.0上85.5%,LIBERO上95.7%),并大幅恢复了真实世界性能(Unitree G1人形机器人上平均60%),而朴素的一致性蒸馏在相同步预算下降至24%。

英文摘要

World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce \textbf{Flash-WAM}, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from $8.1$ seconds to $348$ ms on NVIDIA L40S, a $23{\times}$ speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks ($85.5\%$ RoboTwin 2.0, $95.7\%$ LIBERO) and substantially recovers real-world performance ($60\%$ average on a Unitree G1 humanoid robot), while naive consistency distillation drops to $24\%$ at the same step budget.

2606.05248 2026-06-05 cs.RO 版本更新

Inverse Manipulation through Symbolic Planning and Residual Operator Learning

通过符号规划与残差算子学习的逆操作

Yigit Yildirim, Giuseppe Rauso, Riccardo Caccavale, Alberto Finzi

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出一种混合框架,结合STRIPS-like符号规划与残差强化学习,实现机器人操作任务的逆操作,在ManiSkill3 PushCube任务中验证了将近似符号逆操作转化为物理可行的逆技能。

Comments To be presented in PlanRob26

详情
AI中文摘要

逆推机器人任务需要的不仅仅是逆转符号状态转换或回放运动轨迹。在机器人操作任务中,在连续交互动力学下,符号逆计划通常无法完全恢复正向执行的效果。我们提出了一种用于逆操作的混合框架,该框架通过软几何谓词从演示中自动提取STRIPS-like算子,并推导出逆技能目标。对于每个提取的算子,我们构建一个逆恢复目标,该目标保留前提条件、恢复删除效果并否定添加效果。任务规划器首先尝试使用可用的动作原语来满足该目标。未解决的符号谓词随后引出一个残差算子学习问题,通过强化学习(RL)解决。我们在ManiSkill3 PushCube任务上评估了该框架。对于正向推动技能,符号逆操作执行粗略的抓取-放置恢复,而残差Soft Actor-Critic策略则细化立方体姿态以满足剩余的逆谓词。我们的结果表明,谓词导出的残差控制可以将近似的符号逆操作转化为物理上可行的逆技能。

英文摘要

Inverting a robotic task requires more than reversing symbolic state transitions or rewinding motor trajectories. In robot manipulation tasks, symbolic inverse plans often fail to fully restore the effects of forward executions under continuous interaction dynamics. We present a hybrid framework for inverse manipulation that derives inverse-skill objectives from STRIPS-like operators automatically extracted from demonstrations through soft geometric predicates. For each extracted operator, we construct an inverse restoration objective that preserves preconditions, restores delete effects, and negates add effects. A task planner first attempts to satisfy this objective using available action primitives. Unresolved symbolic predicates then induce a residual operator learning problem solved through Reinforcement Learning (RL). We evaluate the framework on the ManiSkill3 PushCube task. For a forward pushing skill, the symbolic inverse performs a coarse pick-and-place restoration, while a residual Soft Actor-Critic policy refines the cube pose to satisfy the remaining inverse predicates. Our results show that predicate-derived residual control can turn an approximate symbolic inverse into a physically grounded inverse skill.

2606.05236 2026-06-05 cs.RO cs.LG 版本更新

A New Quaternion-Joint Cable-Driven Redundant Manipulator Configuration and its Control Through FABRIK and Residual Reinforcement Learning

一种新型四元数关节缆驱动冗余机械臂配置及其通过FABRIK和残差强化学习的控制

Tanapath Pornthisan, Thanapat Kemthong, Thanyapisit Kangsathien, Pasut Aranchaiya, Paulo Garcia, Viboon Sangveraphunsiri

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出一种4段8关节四元数关节缆驱动冗余机械臂配置,并利用残差强化学习实现比FABRIK算法高三个数量级的位置和方向精度控制。

详情
AI中文摘要

能够穿越任意空间路径的机械臂,特别是在高度阻塞的工作空间中,在多个行业中备受期待。四元数关节最近赋予了一类特定的机械臂——缆驱动冗余机械臂——超越其先前能力的新功能。具体来说,四元数关节减少了每个自由度所需的电机数量,为更紧凑的解决方案铺平了道路。一个持续的挑战是,四元数关节运动学模型的复杂性给机械臂配置的先验决策带来了困难,并对控制系统提出了更高的计算需求,其非线性放大了由于制造不精确而产生的设计与物理实物之间的所有差异。在这里,我们展示了一个4段、8关节的机械臂可以在更低的硬件成本下实现比现有配置更广阔的工作空间,并且残差强化学习在控制此类机械臂方面优于现有最先进的方法——特别是FABRIK算法。我们的结果表明,这种配置比先前设计更有效地利用工作空间,并且残差强化学习在位置和方向精度上比FABRIK高出三个数量级,实现了对新型4段、8关节机械臂的精确控制。此外,控制实现更简单:我们描述了完整的FABRIK控制过程及相应的学习实现。我们的方法适用于新系统的设计,为设计者提供了开发此类机械臂及新型配置相应控制系统的更多工具。

英文摘要

Robotic arms capable of traversing arbitrary spatial paths, especially in highly obstructed workspaces, are highly desired across several industries. Quaternion-joints have recently empowered a specific class of robotic arms -- cable-driven redundant manipulators -- beyond its prior capabilities. Specifically, quaternion-joints reduce the number of required motors per degree of freedom, paving the way for more compact solutions.An ongoing challenge is that the complexity of the kinematic model of quaternion joints challenges a priori decisions on manipulator configurations and imposes higher computational demands on the control system and its non-linearities amplify all discrepancies between design and physical artifact arising from fabrication imprecision. Here we show a that a 4-segment, 8-joint manipulator can achieve a broader workspace than extant configurations, at lower hardware cost, and that Residual Reinforcement Learning outperforms extant state-of-the-art methods -- specifically, the FABRIK algorithm -- on the control of such manipulator. Our results show that this configuration is more workspace-effective than prior designs, and that Residual Reinforcement Learning outperforms FABRIK by three orders of magnitude on positional and orientational accuracy, effecting precise control of the novel 4-segment, 8-joint manipulator. Additionally, the control implementation is simpler: we describe the complete FABRIK process for control and corresponding learning implementation. Our methodology is applicable to the design of new systems, providing designers with further tools for the development of this class of manipulators and corresponding control systems for novel configurations.

2606.05234 2026-06-05 cs.RO cs.LG 版本更新

OLIVE: Online Low-Rank Incremental Learning for Efficient Adaptive Exoskeletons

OLIVE: 面向高效自适应外骨骼的在线低秩增量学习

Dong Liu, Yanxuan Yu, Ben Lengerich, Tony Geng, Ying Nian Wu

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Columbia University(哥伦比亚大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Rice University(里奇大学)

AI总结 提出OLIVE框架,通过低秩残差分解和奖励驱动策略梯度实现外骨骼控制的在线个性化自适应,在多种地形上提升步态平滑度、降低努力并增强稳定性。

详情
AI中文摘要

可穿戴外骨骼系统有望恢复身体障碍者的行动能力,但大多数现有控制器依赖于静态步态策略,缺乏适应动态真实环境或个体用户特征的能力。我们提出\olive(\underline{O}nline \underline{L}ow-rank \underline{I}ncremental Learning for Efficient Adapti\underline{ve} Exoskeletons),一种参数高效的在线自适应框架,在部署期间持续个性化外骨骼控制。\olive将控制策略的自适应组件分解为低秩残差形式~$\dW = \At\Bt^\top$,秩~$r!\ll!\min(d,k)$,将在线更新成本从$\mathcal{O}(dk)$降低到$\mathcal{O}(r(d{+}k))$,同时保持预训练基础控制器~$\Wz$的稳定性。参数通过奖励塑造的策略梯度更新,完全由身体传感器反馈(EMG、IMU、振动)驱动,消除了对离线参考轨迹的依赖。门控机制根据上下文状态调节个性化强度,动态秩调度器根据地形复杂度调整更新维度——在简单平坦地形上分配最小容量,在要求高的不平坦地形上扩展到更高秩更新——从而在多种活动中实现稳健性能:平地行走、楼梯导航、斜坡和不平坦地形。在可穿戴平台上的实验表明,\olive在步态平滑度、努力减少和运动稳定性上比最强基线分别提高了13、22和15个百分点,在大约1,800步内收敛,端到端延迟为7.4毫秒。我们的代码实现可在https://github.com/FastLM/OLIVE获取。

英文摘要

Wearable exoskeleton systems hold promise for restoring mobility in individuals with physical impairments, yet most existing controllers rely on static gait policies that lack the ability to adapt to dynamic real-world environments or individual user characteristics. We present \olive (\underline{O}nline \underline{L}ow-rank \underline{I}ncremental Learning for Efficient Adapti\underline{ve} Exoskeletons), a parameter-efficient online adaptation framework that continuously personalizes exoskeleton control during deployment. \olive decomposes the adaptive component of the control policy into a low-rank residual form~$\dW = \At\Bt^\top$ with rank~$r!\ll!\min(d,k)$, reducing online update cost from $\mathcal{O}(dk)$ to $\mathcal{O}(r(d{+}k))$ while preserving the stability of a pretrained base controller~$\Wz$. Parameters are updated via a reward-shaped policy gradient driven purely by on-body sensor feedback (EMG, IMU, vibration), eliminating dependence on offline reference trajectories. A gating mechanism modulates the strength of personalization based on contextual state, and a dynamic rank scheduler adapts the update dimensionality to terrain complexity -- allocating minimal capacity on simple flat terrain and expanding to higher-rank updates on demanding uneven surfaces -- enabling robust performance across diverse activities: flat walking, stair navigation, slopes, and uneven terrain. Experiments on the wearable platform demonstrate that \olive achieves +13, +22, and +15 percentage-point improvements in gait smoothness, effort reduction, and motion stability over the strongest baseline, converging within $\sim$1{,}800 walking steps at 7.4,ms end-to-end latency. Our code implementation is available at https://github.com/FastLM/OLIVE.

2606.04708 2026-06-05 cs.RO cs.AI 版本更新

VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

VISTA: 基于视觉和物理验证的UMI数据适配用于VLA训练

Siyuan Yang, Linzheng Guo, Ouyang Lu, Zhaxizhuoma, Daoran Zhang, Xinmiao Wang, Ting Xiao, Fangzheng Yan, Zhijun Chen, Yan Ding, Chao Yu, Chenjia Bai, Xuelong Li

发表机构 * Institute of AI (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信) Lumos Robotics(Lumos机器人) University of Science and Technology of China(中国科学技术大学) Northwestern Polytechnical University(西北工业大学) Shanghai Jiao Tong University(上海交通大学) East China University of Science and Technology(东华大学) Harbin Engineering University(哈尔滨工程大学) Fudan University(复旦大学)

AI总结 提出VISTA框架,通过UMI-VQA数据集对齐视觉表示、物理验证流水线筛选可行轨迹以及两阶段联合训练,解决UMI数据训练VLA模型时的视觉分布偏移和物理不可行动作问题。

Comments Corrected the typing error

详情
AI中文摘要

通用操作接口(UMI)实现了无需特定硬件遥操作的可扩展真实世界机器人数据收集,但利用UMI数据训练大规模视觉-语言-动作(VLA)模型仍然面临根本性挑战。我们识别出两个关键不匹配:腕部安装的鱼眼视图具有严重的径向畸变和以夹爪为中心的局部视角,对于预训练VLM而言是分布外数据;人类收集的轨迹经常违反运动学限制、发生碰撞或超出控制器带宽,导致VLA策略学习到物理上不可行的动作。为解决这些挑战,我们提出了VISTA框架,通过三个协同组件弥合这一双重差距。(i) UMI-VQA,首个专门针对腕部鱼眼观测的大规模VQA数据集,通过辅助视觉-语言监督将VLM表示对齐到畸变视觉领域。(ii) 系统性的物理验证流水线,在训练前进行数据完整性预检查,并对每条有效轨迹的轨迹连续性、自碰撞风险和执行保真度进行评分。(iii) 两阶段联合训练方案,在UMI-VQA上联合学习视觉-语言基础,并在验证轨迹上学习动作预测。我们的实验经验表明,引入UMI-VQA能持续提升下游策略性能,且物理验证分数对部署成功具有强预测性。在多种仿真和真实世界操作任务中,VISTA显著优于包括$π_{0.5}$、LingBot-VLA和Wall-X在内的强基线。我们向社区发布了物理验证流水线、UMI-VQA、验证轨迹数据和预训练模型。

英文摘要

Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories frequently violate kinematic limits, incur collisions, or exceed controller bandwidth, teaching VLA policies physically infeasible actions. To address the challenges, we present VISTA, a framework that bridges this dual gap through three synergistic components. (i)~UMI-VQA, the first large-scale VQA dataset tailored to wrist-mounted fisheye observations, aligns VLM representations to the distorted visual regime via auxiliary vision-language supervision. (ii)~A systematic physical-validation pipeline performs a data-completeness pre-check and scores each valid trajectory for trajectory continuity, self-collision risk, and execution fidelity before it enters training. (iii)~A two-stage co-training recipe jointly learns vision-language grounding on UMI-VQA and action prediction on validated trajectories. Our experiments empirically show that incorporating UMI-VQA consistently improves downstream policy performance, and that physical-validation scores are strongly predictive of deployment success. On diverse simulation and real-world manipulation tasks, VISTA significantly outperforms strong baselines including $π_{0.5}$, LingBot-VLA, and Wall-X. We release the physical-validation pipeline, UMI-VQA, validated trajectory data, and the pre-trained model for the community.

2606.04463 2026-06-05 cs.RO 版本更新

OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics

OSCAR: 面向机器人的全具身骨架条件世界动作模型

Zhuoyuan Wu, Jun Gao

发表机构 * Peking University(北京大学) University of Michigan(密歇根大学) NVIDIA(英伟达)

AI总结 提出OSCAR,一种基于动作条件的视频世界模型,通过大规模数据管道和2D骨架渲染统一表示,实现跨机器人具身的泛化,并用于策略评估。

Comments Project page: https://wuzy2115.github.io/oscar-project-page/

详情
AI中文摘要

我们提出OSCAR,一种精确的动作条件视频世界模型,能够泛化到不同的机器人具身并支持机器人策略评估。现有的视频世界模型在真实机器人评估中面临三个主要挑战:当前机器人训练数据集的场景多样性有限、动作跟随不精确、以及跨具身泛化能力差以支持广泛采用。我们从两个角度应对这些挑战。其核心是一个大规模标准化数据管道,用于整理、过滤和去重广泛的机器人和以自我为中心的人类数据集,产生一个涵盖多样化任务、场景、动作和机器人具身的干净联合训练数据集。为了给视频模型提供条件,我们采用2D运动学骨架渲染作为统一的条件表示,能够泛化到不同的机器人手臂甚至人类手部。我们在单个GH200 GPU上微调Cosmos-Predict2.5-2B模型。与现有基线相比,我们的模型在动作跟随、外观质量和运动一致性方面取得了显著改进,而基线要么模型规模大得多,要么需要更多GPU。我们进一步将OSCAR部署到RoboArena中评估机器人策略。大量实验表明,OSCAR中的虚拟策略评估与真实世界评估之间存在显著相关性,为未来机器人策略可以纯粹在虚拟生成的世界中评估铺平了道路。

英文摘要

We present OSCAR, a precise action-conditioned video world model that generalizes across different robot embodiments and enables robot policy evaluation. Existing video world models face three main challenges for real-world robot evaluation: limited scenario diversity in current robot training datasets, imprecise action following, and poor generalization across embodiments for broad adoption. We tackle these challenges from two perspectives. At its core is a large-scale standardized data pipeline that curates, filters, and deduplicates broad robotics and egocentric human datasets, yielding a clean joint-training dataset that spans diverse tasks, scenarios, actions, and robot embodiments. To condition the video model, we adopt 2D kinematic skeleton rendering as a unified conditioning representation that generalizes across different robot arms or even human hands. We finetune the Cosmos-Predict2.5-2B model on a single GH200 GPU. Our model achieves significant improvement on action following, appearance quality, and motion consistency, compared to existing baselines, which either have a much larger model size or require more GPUs. We further deploy OSCAR to evaluate robot policies from RoboArena. Extensive experiments demonstrate the significant correlation between our virtual policy evaluation in OSCAR and real-world evaluation, paving the way for the future where robot policies can be purely evaluated in virtual generated worlds.

2605.28367 2026-06-05 cs.RO cs.SY eess.SY 版本更新

Safety-Critical Adaptive Impedance Control via Nonsmooth Control Barrier Functions under State and Input Constraints

基于非光滑控制障碍函数的状态与输入约束下安全关键自适应阻抗控制

Faisal Lawan, Xiaoran Han, Joaquin Carrasco, Barry Lennox, Xiaoxiao Cheng

发表机构 * Department of Electrical and Electronic Engineering, The University of Manchester(电气与电子工程系,曼彻斯特大学)

AI总结 提出一种在线自适应阻抗控制框架,结合二次规划安全滤波器与新型组合位置-速度非光滑控制障碍函数,在不确定动力学下实现关节状态安全约束与柔顺交互,并通过区间二型模糊逻辑补偿未知动力学、软约束处理执行器力矩限制,利用复合Lyapunov分析证明安全集前向不变性与阻抗跟踪误差一致最终有界。

Comments 12 pages, 3 figures

详情
AI中文摘要

安全物理交互对于在人类-机器人交互和接触密集型任务中部署机器人操作臂至关重要,其中不确定性、外力和执行器限制可能危及性能和安全性。我们提出一种在线自适应阻抗控制框架,在不确定动力学下强制执行关节状态安全,同时实现柔顺交互。该方法结合了基于二次规划的安全滤波器与一种新颖的组合位置-速度非光滑控制障碍函数(NCBF),使得关节位置和速度约束能够通过统一的相对度一障碍来实施。未知动力学通过区间二型模糊逻辑系统在线补偿,而执行器力矩限制则通过软约束处理,并利用精确罚函数恢复可行解。一种增强的扰动观测器安全机制提高了对建模误差和外部交互力的鲁棒性。使用复合Lyapunov分析,我们证明了安全集的前向不变性和阻抗跟踪误差的一致最终有界性。在具有严重参数不确定性和外部交互力的7自由度操作臂上的仿真展示了安全约束满足和鲁棒的阻抗跟踪。

英文摘要

Safe physical interaction is critical for deploying robotic manipulators in human-robot interaction and contact-rich tasks, where uncertainty, external forces, and actuator limitations can compromise both performance and safety. We propose an online adaptive impedance control framework that enforces joint-state safety while achieving compliant interaction under uncertain dynamics. The approach combines a quadratic-program-based safety filter with a novel composed position-velocity non-smooth control barrier function (NCBF), enabling joint position and velocity constraints to be enforced through a unified relative-degree-one barrier. Unknown dynamics are compensated online using an interval type-2 fuzzy logic system, while actuator torque limits are handled through soft constraints with exact penalty recovery of feasible solutions. A disturbance-observer-enhanced safety mechanism improves robustness against modelling errors and external interaction forces. Using composite Lyapunov analysis, we prove forward invariance of the safe set and the uniform ultimately boundedness of the impedance-tracking error. Simulations on a 7-DOF manipulator with severe parametric uncertainty and external interaction wrenches demonstrate safe constraint satisfaction and robust impedance tracking.

2604.15524 2026-06-05 eess.SY cs.RO cs.SY 版本更新

Safe and Energy-Aware Multi-Robot Density Control via PDE-Constrained Optimization for Long-Duration Autonomy

面向长期自主性的安全与能量感知多机器人密度控制:基于PDE约束优化

Longchen Niu, Andrew Nasif, Gennaro Notomista

发表机构 * Department of Electrical and Computer Engineering, University of Waterloo(滑铁卢大学电气与计算机工程系)

AI总结 提出一种结合Fokker-Planck偏微分方程与控制李雅普诺夫/障碍函数的密度控制框架,实现多机器人系统的目标密度跟踪、避障和能量可持续性。

详情
AI中文摘要

本文提出了一种新颖的多机器人系统密度控制框架,具有空间安全性和能量可持续性保证。随机机器人运动通过Fokker-Planck偏微分方程在密度层面进行编码。控制李雅普诺夫函数和控制障碍函数与PDE相结合,以强制实现目标密度跟踪、障碍区域避免以及多个充电周期内的能量充足性。由此产生的二次规划实现了快速的在环实现,可实时调整指令。进行了多机器人实验和广泛仿真,以证明控制器在定位和运动不确定性下的有效性。

英文摘要

This paper presents a novel density control framework for multi-robot systems with spatial safety and energy sustainability guarantees. Stochastic robot motion is encoded through the Fokker-Planck Partial Differential Equation (PDE) at the density level. Control Lyapunov and control barrier functions are integrated with PDEs to enforce target density tracking, obstacle region avoidance, and energy sufficiency over multiple charging cycles. The resulting quadratic program enables fast in-the-loop implementation that adjusts commands in real-time. Multi-robot experiment and extensive simulations were conducted to demonstrate the effectiveness of the controller under localization and motion uncertainties.

2605.17249 2026-06-05 cs.RO 版本更新

SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation

SEDualVLN:一种空间增强的双系统用于视觉语言导航

Jingzhi Huang, Junkai Huang, Wenxuan Song, Haoyang Yang, Hailong Huang, Haoang Li, Yi Wang

发表机构 * Hong Kong Polytechnic University(香港理工大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文提出SEDualVLN,一种空间增强的双系统框架,用于解决视觉语言导航中的长距离导航和动态推理问题,通过两个系统协同工作实现高效导航。

详情
AI中文摘要

视觉语言导航(VLN)方法目前主要遵循两种主要范式:一种是端到端视觉语言模型(VLM)策略,通过微调导航轨迹直接预测动作;另一种是零样本模块化流程,整合预训练的多模态大语言模型(MLLM)以实现无训练的泛化到未见环境。然而,端到端方法在长距离导航中表现不佳且缺乏动态推理能力,而零样本方法受限于有限的空间定位能力,且需要大量推理时间。为弥合这一差距,我们引入SEDualVLN,一种空间增强的双系统VLN框架。系统1是一个增强全球和局部空间意识的VLM模型,用于动作生成。系统2整合了一个通用MLLM和一个映射模块,其中MLLM通过利用实时3D地图的自上而下视图和渲染路径图像流来规划路径点。两个系统利用不同形式的空间增强来培养智能体在VLN任务中的方向感。最终,它们通过快慢协调的方法合作完成导航任务。SEDualVLN在VLN-CE基准上实现了最先进的性能,进一步的消融研究证明了每个系统和模块的有效性。

英文摘要

Vision-Language Navigation (VLN) approaches have currently followed two primary paradigms: the end-to-end Vision-Language Model (VLM) policy fine-tuned on navigation trajectories to directly predict actions, and the zero-shot modular pipeline integrating pre-trained Multimodal Large Language Model (MLLM) for training-free generalization to unseen environments. However, end-to-end methods struggle with long-horizon navigation and lack dynamic reasoning, whereas zero-shot methods are constrained by limited spatial grounding for reliable planning and also require substantial reasoning time. To bridge this gap, we introduce SEDualVLN, a spatially-enhanced dual-system VLN framework. System 1 is a VLM model enhanced with both global and local spatial awareness, used for action generation. System 2 integrates a general MLLM with a mapping module, wherein the MLLM plans waypoints by leveraging top-down views of the real-time 3D map alongside streams of rendered path images. Both systems leverage different forms of spatial enhancement to cultivate the agent's sense of direction in VLN tasks. Ultimately, they cooperate to complete the navigation task through a fast-slow coordinated approach. SEDualVLN achieves state-of-the-art performance on VLN-CE benchmarks, and further ablation studies demonstrate the effectiveness of each system and module.

2605.09989 2026-06-05 cs.RO cs.CV 版本更新

StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

StereoPolicy:通过立体视觉改进机器人操作策略

Evans Han, Yunfan Jiang, Yingke Wang, Haoyue Xiao, Huang Huang, Jianwen Xie, Jiajun Wu, Li Fei-Fei, Ruohan Zhang

发表机构 * Stanford University(斯坦福大学) Northwestern University(西北大学) Lambda, Inc(Lambda公司)

AI总结 该研究提出StereoPolicy,一种利用立体视觉提升机器人操作策略的框架,通过同步立体图像对增强几何推理,无需构建显式3D表示,在多个仿真和真实机器人任务中优于RGB、RGB-D、点云等基线方法。

详情
AI中文摘要

最近的机器人模仿学习进展产生了能够从视觉输入中操控多样化物体的强大视觉-运动策略。然而,单目观测缺乏深度信息,这对于在杂乱或几何复杂的场景中进行精确操作至关重要。显式的深度图和点云在现实世界操作中往往噪声大且易碎。我们引入了StereoPolicy,一种视觉-运动策略学习框架,直接利用同步的立体图像对来改进几何推理,而无需构建显式的3D表示。StereoPolicy通过预训练的2D视觉编码器处理每张图像,并通过基于交叉注意力的Stereo Transformer融合左右特征,隐式地捕捉空间对应关系和视差线索。该框架与基于扩散和预训练的视觉-语言-动作(VLA)策略集成,在三个仿真基准和七个真实机器人桌面和双臂移动操作任务中,相比RGB、RGB-D、点云和多视角基线方法均实现了持续改进。我们的结果表明,立体视觉能够将预训练的2D表示与3D几何理解联系起来,以提升机器人操作性能。

英文摘要

Recent advances in robot imitation learning have produced powerful visuomotor policies that manipulate diverse objects from visual inputs. However, monocular observations lack depth information, which is critical for precise manipulation in cluttered or geometrically complex scenes. Explicit depth maps and point clouds are often noisy and fragile in real-world manipulation. We introduce StereoPolicy, a visuomotor policy learning framework that directly leverages synchronized stereo image pairs to improve geometric reasoning without constructing explicit 3D representations. StereoPolicy processes each image with pretrained 2D vision encoders and fuses left-right features through a cross-attention-based Stereo Transformer, capturing spatial correspondence and disparity cues implicitly. The framework integrates with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks and seven real-robot tabletop and bimanual mobile manipulation tasks. Our results show that stereo vision bridges 2D pretrained representations and 3D geometric understanding for robotic manipulation.

2605.08215 2026-06-05 cs.CV cs.LG cs.RO 版本更新

Test-Time Training for Visual Foresight Vision-Language-Action Models

测试时训练用于视觉前瞻视觉-语言-动作模型

Sangwu Park, Wonjoong Kim, Yeonjun In, Sein Kim, Hongseok Kang, Chanyoung Park

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出了一种测试时训练方法,用于增强视觉前瞻视觉-语言-动作模型在面对分布外数据时的鲁棒性,通过引入适应性更新过滤机制来减少测试时更新带来的实际挑战。

Comments Accepted at ICML 2026 Workshop on Continual Adaptation at Scale (CATS)

详情
AI中文摘要

Visual Foresight VLA (VF-VLA) 已成为最近 VLA 中的重要架构选择,因其出色的性能。然而,VF-VLA 的固有设计使其特别容易受到分布外(OOD)偏移的影响。由于动作的质量直接取决于预测未来视觉信息的准确性,OOD 条件会影响两个阶段。为了解决这一脆弱性,我们提出了测试时训练视觉前瞻 VLA($T^3$VF),这是一种受观察启发的测试时训练方法,即预测的未来图像及其后续观察形成自然的监督对。为了进一步解决由于随意测试时更新而产生的实际挑战,我们引入了自适应更新过滤机制。经验上,$T^3$VF 在不改变任何架构或辅助模块的情况下,以适度的额外推理成本缓解了 VF-VLA 的 OOD 脆弱性。

英文摘要

Visual Foresight VLA (VF-VLA) has become a prominent architectural choice in the recent VLA due to its impressive performance. Nevertheless, the inherent design of VF-VLA makes it particularly vulnerable to out-of-distribution (OOD) shifts. Because the quality of action directly depends on the accuracy of the predicted future visual information, OOD conditions affect both stages at once. To address this vulnerability, we propose Test-Time Training Visual Foresight VLA ($T^3$VF), a test-time training approach motivated by the observation that the predicted future image and its subsequent observation form a natural supervision pair. To further address the practical challenges that arise from indiscriminate test-time updates, we introduce an adaptive update filtering mechanism. Empirically, $T^3$VF mitigates the OOD vulnerability of VF-VLA at a modest additional inference cost, without requiring any architectural modification or auxiliary modules.

2604.21017 2026-06-05 cs.RO cs.AI 版本更新

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

Open-H-Embodiment: 一个大规模数据集,用于在医疗机器人中启用基础模型

Open-H-Embodiment Consortium, :, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang, Alaa Eldin Abdelaal, Alberto Arezzo, Ayberk Acar, Farshid Alambeigi, Carlo Alberto Ammirati, Yunke Ao, Pablo David Aranda Rodriguez, Soofiyan Atar, Mattia Ballo, Noah Barnes, Federica Barontini, Filip Binkiewicz, Peter Black, Sebastian Bodenstedt, Leonardo Borgioli, Nikola Budjak, Benjamin Calmé, Fabio Carrillo, Nicola Cavalcanti, Changwei Chen, Haoxin Chen, Sihang Chen, Qihan Chen, Zhongyu Chen, Ziyang Chen, Shing Shin Cheng, Meiqing Cheng, Min Cheng, Zih-Yun Sarah Chiu, Xiangyu Chu, Camilo Correa-Gallego, Giulio Dagnino, Anton Deguet, Jacob Delgado, Jonathan C. DeLong, Kaizhong Deng, Alexander Dimitrakakis, Qingpeng Ding, Hao Ding, Giovanni Distefano, Daniel Donoho, Anqing Duan, Marco Esposito, Shane Farritor, Jad Fayad, Zahi Fayad, Mario Ferradosa, Filippo Filicori, Chelsea Finn, Philipp Fürnstahl, Jiawei Ge, Stamatia Giannarou, Xavier Giralt Ludevid, Frederic Giraud, Aditya Amit Godbole, Ken Goldberg, Antony Goldenberg, Diego Granero Marana, Xiaoqing Guo, Tamás Haidegger, Evan Hailey, Pascal Hansen, Ziyi Hao, Kush Hari, Kengo Hayashi, Jonathon Hawkins, Shelby Haworth, Ortrun Hellig, S. Duke Herrell, Zhouyang Hong, Andrew Howe, Junlei Hu, Zhaoyang Jacopo Hu, Ria Jain, Mohammad Rafiee Javazm, Howard Ji, Rui Ji, Jianmin Ji, Zhongliang Jiang, Dominic Jones, Jeffrey Jopling, Britton Jordan, Ran Ju, Michael Kam, Luoyao Kang, Fausto Kang, Siddhartha Kapuria, Peter Kazanzides, Sonika Kiehler, Ethan Kilmer, Ji Woong Kim, Przemysław Korzeniowski, Chandra Kuchi, Nithesh Kumar, Alan Kuntz, Federico Lavagno, Yu Chung Lee, Hao-Chih Lee, Hang Li, Zhen Li, Xiao Liang, Xinxin Lin, Jinsong Lin, Chang Liu, Fei Liu, Pei Liu, Yun-hui Liu, Wanli Liuchen, Eszter Lukács, Sareena Mann, Miles Mannas, Brett Marinelli, Sabina Martyniak, Francesco Marzola, Lorenzo Mazza, Xueyan Mei, Maria Clara Morais, Luigi Muratore, Chetan Reddy Narayanaswamy, Michał Naskręt, David Navarro-Alarcon, Cyrus Neary, Chi Kit Ng, Christopher Nguan, David Noonan, Ki Hwan Oh, Tom Christian Olesch, Allison M. Okamura, Justin Opfermann, Matteo Pescio, Doan Xuan Viet Pham, Tito Porras, Hongliang Ren, Ariel Rodriguez Jimenez, Ferdinando Rodriguez y Baena, Septimiu E. Salcudean, Asmitha Sathya, Preethi Satish, Lalithkumar Seenivasan, Jiaqi Shao, Yiqing Shen, Yu Sheng, Lucy XiaoYang Shi, Zoe Soulé, Stefanie Speidel, Mingwu Su, Jianhao Su, Idris Sunmola, Kristóf Takács, Yunxi Tang, Patrick Thornycroft, Yu Tian, Jordan Thompson, Mehmet K. Turkcan, Mathias Unberath, Pietro Valdastri, Carlos Vives, Quan Vuong, Martin Wagner, Farong Wang, Wei Wang, Lidian Wang, Chung-Pang Wang, Guankun Wang, Junyi Wang, Erqi Wang, Ziyi Wang, Tanner Watts, Wolfgang Wein, Yimeng Wu, Zijian Wu, Hongjun Wu, Luohong Wu, Jie Ying Wu, Junlin Wu, Victoria Wu, Kaixuan Wu, Mateusz Wójcikowski, Yunye Xiao, Nan Xiao, Wenxuan Xie, Hao Yang, Tianqi Yang, Yinuo Yang, Menglong Ye, Ryan S. Yeung, Nural Yilmaz, Chim Ho Yin, Michael Yip, Rayan Younis, Chenhao Yu, Sayem Nazmuz Zaman, Milos Zefran, Han Zhang, Yuelin Zhang, Yidong Zhang, Yanyong Zhang, Xuyang Zhang, Yameng Zhang, Joyce Zhang, Ning Zhong, Peng Zhou, Haoying Zhou, Xiuli Zuo, Nassir Navab, Mahdi Azizian, Sean D. Huver, Axel Krieger

发表机构 * Open-H-Embodiment Consortium University of California, Berkeley(加州大学伯克利分校) University of California, Los Angeles(加州大学洛杉矶分校) University of Southern California(南加州大学) University of Cambridge(剑桥大学) University of Tokyo(东京大学) University of Tokyo, Graduate School of Information Science and Technology(东京大学信息科学与技术研究生院) University of Tokyo, Institute of Industrial Science(东京大学工业科学研究所)

AI总结 本文提出Open-H-Embodiment数据集,通过两个基础模型展示了其在医疗机器人领域的应用,展示了大规模开放数据在推动机器人学习和世界建模方面的关键作用。

Comments Project website: https://open-h.github.io/open-h-embodiment/

详情
AI中文摘要

自主医疗机器人有希望提高患者预后、减少从业者的工作量、普及医疗访问并实现超人精度。然而,自主医疗机器人受到根本性数据问题的限制:现有的医疗机器人数据集较小、单一躯体且很少公开共享,限制了该领域所需的基础模型的发展。我们介绍了Open-H-Embodiment,这是迄今为止最大的开放医疗机器人视频数据集,包含同步运动学,涵盖超过50个机构和多种机器人平台,包括CMR Versius、Intuitive Surgical的da Vinci、da Vinci Research Kit(dVRK)、Rob Surgical BiTrack、Virtual Incision的MIRA、Moon Surgical Maestro以及多种定制系统,涵盖手术操作、机器人超声和内窥镜程序。我们通过两个基础模型展示了该数据集的研究价值。GR00T-H是首个开放的基础视觉-语言-动作模型,是唯一在结构缝合基准测试中实现完整端到端任务完成的模型(25%的试验 vs. 其他所有模型的0%),并在29步体外缝合序列中实现了64%的平均成功率。我们还训练了Cosmos-H-Surgical-Simulator,这是首个动作条件的世界模型,能够从单个检查点实现多躯体手术模拟,涵盖九种机器人平台,并支持计算机模拟政策评估和医学领域合成数据生成。这些结果表明,开放、大规模的医疗机器人数据收集可以作为研究社区的关键基础设施,推动机器人学习、世界建模以及更广泛的研究进展。

英文摘要

Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 50 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

2604.12474 2026-06-05 cs.RO cs.AI 版本更新

From Kinematics to Dynamics: Learning to Refine Hybrid Plans for Physically Feasible Execution

从运动学到动力学:学习精炼混合计划以实现物理可行的执行

Lidor Erez, Shahaf S. Shperberg, Ayal Taitler

发表机构 * Technion - Israel Institute of Technology(技术学院 - 以色列理工学院)

AI总结 该研究通过连续空间中的强化学习,解决混合计划在物理可行性执行中的问题,通过引入分析二阶约束的马尔可夫决策过程,改进混合规划器生成的一阶轨迹,从而可靠地恢复物理可行性。

详情
AI中文摘要

在许多机器人任务中,智能体必须穿越一系列空间区域以完成任务。此类问题本质上是混合离散-连续的:一个高层动作序列和一个在物理上可行的连续轨迹。生成的轨迹和动作序列还必须满足诸如截止时间、时间窗口和速度或加速度限制等约束条件。尽管混合时间规划器试图解决这一挑战,但它们通常使用线性(一阶)动力学建模运动,这无法保证生成的计划满足机器人的真实物理约束。因此,即使高层动作序列固定,生成动态可行的轨迹也变成了一个双层优化问题。我们通过连续空间中的强化学习来解决这个问题。我们定义了一个明确包含分析二阶约束的马尔可夫决策过程,并用它来改进由混合规划器生成的一阶计划。我们的结果表明,这种方法可以可靠地恢复物理可行性,并有效弥合规划器初始一阶轨迹与实际执行所需动力学之间的差距。

英文摘要

In many robotic tasks, agents must traverse a sequence of spatial regions to complete a mission. Such problems are inherently mixed discrete-continuous: a high-level action sequence and a physically feasible continuous trajectory. The resulting trajectory and action sequence must also satisfy problem constraints such as deadlines, time windows, and velocity or acceleration limits. While hybrid temporal planners attempt to address this challenge, they typically model motion using linear (first-order) dynamics, which cannot guarantee that the resulting plan respects the robot's true physical constraints. Consequently, even when the high-level action sequence is fixed, producing a dynamically feasible trajectory becomes a bi-level optimization problem. We address this problem via reinforcement learning in continuous space. We define a Markov Decision Process that explicitly incorporates analytical second-order constraints and use it to refine first-order plans generated by a hybrid planner. Our results show that this approach can reliably recover physical feasibility and effectively bridge the gap between a planner's initial first-order trajectory and the dynamics required for real execution.

2604.08882 2026-06-05 cs.RO 版本更新

Simulation of Adaptive Running with Flexible Sports Prosthesis using Reinforcement Learning of Hybrid-link System

使用混合链接系统强化学习模拟适应性跑步与柔性运动假肢

Yuta Shimane, Ko Yamamoto

发表机构 * Department of Biological Sciences, The University of Tokyo(东京大学生物科学系) Institute of Systems and Information Engineering, University of Tsukuba(茨城大学系统与信息工程研究所)

AI总结 本文提出了一种基于强化学习的框架,用于模拟单侧小腿截肢者在不同虚拟假肢刚度条件下的适应性跑步运动,通过混合链接系统整合了叶弹簧型运动假肢的灵活性,分析了假肢刚度对跑步动态和代谢成本的影响。

详情
AI中文摘要

本研究提出了一种基于强化学习的框架,用于模拟单侧小腿截肢者在混合链接系统中的适应性跑步运动,该系统整合了叶弹簧型运动假肢的灵活性。运动假肢的设计和选择通常依赖于试错法。全面的全身动力学分析,考虑人体运动与假肢变形之间的相互作用,可以为用户特定的设计和选择提供有价值的见解。所提出的混合链接系统通过整合分段常应(PCS)模型来代表假肢的灵活性。基于此系统,模拟方法利用强化学习方法生成单侧小腿截肢者的全身动态运动。该框架整合了基于运动捕捉数据的模仿学习与准确的假肢动力学计算。在多种虚拟假肢刚度条件下模拟跑步运动,并分析由此获得的相应的代谢成本(COT)。结果表明,假肢刚度的变化影响跑步动态和性能,且COT与先前研究中的值一致。我们的发现证明了所提出方法在虚拟条件下进行模拟和分析的潜力,这些虚拟条件与现实世界条件不同。

英文摘要

This study proposes a reinforcement learning-based framework for adaptive running motion simulation in a unilateral transtibial amputee using a hybrid-link system that incorporates the flexibility of a leaf-spring-type sports prosthesis. The design and selection of sports prostheses typically rely on trial and error. A comprehensive whole-body dynamics analysis that accounts for interactions between human motion and prosthetic deformation can provide valuable insights for user-specific design and selection. The proposed hybrid-link system enables such analysis by integrating a Piece-wise Constant Strain (PCS) model to represent prosthetic flexibility. Based on this system, the simulation methodology generates whole-body dynamic motions of a unilateral transtibial amputee using a reinforcement learning approach. This framework integrates imitation learning based on motion capture data with accurate computation of prosthetic dynamics. Running motions are simulated under multiple virtual prosthetic stiffness conditions, and the corresponding metabolic cost of transport (COT) obtained from these simulations is analyzed. The results suggest that variations in prosthetic stiffness influence running dynamics and performance, and that COT is consistent with values reported in prior study. Our findings demonstrate the potential of the proposed approach for simulation and analysis under virtual conditions that differ from real-world conditions.

2604.03042 2026-06-05 cs.RO 版本更新

Enhancing Multi-Robot Exploration Using Probabilistic Frontier Prioritization with Dirichlet Process Gaussian Mixtures

利用概率前沿优先级与狄利克雷过程高斯混合模型增强多机器人探索

John Lewis Devassy, Meysam Basiri, Mário A. T. Figueiredo, Pedro U. Lima

发表机构 * Institute for Systems and Robotics / LARSyS and Instituto Superior Técnico, Universidade de Lisboa(系统与机器人研究所 / LARSyS 和里斯本大学理工学院) Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de Lisboa(电信研究所和里斯本大学理工学院)

AI总结 本文提出了一种基于概率前沿优先级和狄利克雷过程高斯混合模型的改进方法,以提升多机器人探索的效率,通过在两种先进的多智能体探索算法中集成该方法,实现了在不同环境复杂度、通信限制和团队规模下的性能提升,实验结果表明平均性能提升了10%至14%。

Comments Accepted: IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

多智能体自主探索对于环境监测、搜索救援和大规模工业监控等应用至关重要。然而,在通信限制下有效协调仍是一个重大挑战。前沿探索算法分析已知区域与未知区域之间的边界,以确定下一个最佳视图,以最大化探索收益。本文提出了一种改进现有基于前沿的探索算法的方法,通过引入概率前沿优先级方法,利用狄利克雷过程高斯混合模型(DP-GMM)和信息增益的概率公式,提高前沿优先级的质量。该改进方法整合到两种最先进的多智能体探索算法中,在不同环境复杂度、通信限制和团队规模下均实现了性能提升。仿真显示,两种算法在所有组合中平均收益提高了10%和14%。在双无人机真实世界实验中的成功部署进一步证实了这些发现。

英文摘要

Multi-agent autonomous exploration is essential for applications such as environmental monitoring, search and rescue, and industrial-scale surveillance. However, effective coordination under communication constraints remains a significant challenge. Frontier exploration algorithms analyze the boundary between the known and unknown regions to determine the next-best view that maximizes exploratory gain. This article proposes an enhancement to existing frontier-based exploration algorithms by introducing a probabilistic approach to frontier prioritization. By leveraging Dirichlet process Gaussian mixture model (DP-GMM) and a probabilistic formulation of information gain, the method improves the quality of frontier prioritization. The proposed enhancement, integrated into two state-of-the-art multi-agent exploration algorithms, consistently improves performance across environments of varying clutter, communication constraints, and team sizes. Simulations showcase an average gain of $10\%$ and $14\%$ for the two algorithms across all combinations. Successful deployment in real-world experiments with a dual-drone system further corroborates these findings.

2603.10971 2026-06-05 cs.RO cs.AI 版本更新

ContactExplorer: Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation

ContactExplorer: 接触覆盖引导的通用灵巧操作探索

Zixuan Liu, Ruoyi Qiao, Chenrui Tie, Xuanwei Liu, Yunfan Lou, Chongkai Gao, Zhixuan Xu, Lin Shao

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院) RoboScience(机器人科学)

AI总结 提出ContactExplorer方法,通过接触覆盖奖励和能量引导奖励,在灵巧操作任务中高效探索接触模式,提升样本效率和成功率。

Comments 24 pages

详情
AI中文摘要

强化学习在Atari游戏、导航和移动等任务中取得了显著成功,这些任务中的探索通常可以通过状态或动态的新颖性来引导。相比之下,灵巧操作需要丰富的物理手-物体交互,但现有方法常受限于不稳定的基于接触的新颖性信号、低效的距离新颖性信号或依赖任务先验知识。我们提出ContactExplorer,一种用于灵巧操作任务的通用探索方法。ContactExplorer将接触表示为物体表面点与手部关键点的交集,鼓励灵巧手发现多样且新颖的接触模式,即哪些手指接触物体的哪些区域。它维护一个基于离散化物体状态(通过学习的哈希码获得)的接触计数器,捕捉每个手指与不同物体区域交互的频率。该计数器以两种互补方式利用:(1)分配基于计数的接触覆盖奖励,促进对新接触模式的探索;(2)基于能量的到达奖励,引导智能体朝向未充分探索的接触区域。我们在多种灵巧操作任务上评估ContactExplorer。实验结果表明,ContactExplorer在样本效率和成功率上显著优于现有探索方法,并且通过ContactExplorer学习的接触模式能鲁棒地迁移到现实世界。项目页面:https://contact-explorer.github.io。

英文摘要

Reinforcement learning has achieved remarkable success in domains such as Atari games, navigation, and locomotion, where exploration can often be guided by novelty over states or dynamics. In contrast, dexterous manipulation requires rich physical hand--object interactions, but existing methods often suffer from unstable contact-based novelty signals, inefficient distance novelty signals, or reliance on task-specific priors. We propose ContactExplorer, a general exploration method for dexterous manipulation tasks. ContactExplorer represents contact as the intersection between object surface points and hand keypoints, encouraging dexterous hands to discover diverse and novel contact patterns, namely which fingers contact which object regions. It maintains a contact counter conditioned on discretized object states obtained via learned hash codes, capturing how frequently each finger interacts with different object regions. This counter is leveraged in two complementary ways: (1) to assign a count-based contact coverage reward that promotes exploration of novel contact patterns, and (2) an energy-based reaching reward that guides the agent toward under-explored contact regions. We evaluate ContactExplorer on a diverse set of dexterous manipulation tasks. Experimental results show that ContactExplorer substantially improves sample efficiency and success rates over existing exploration methods, and that the contact patterns learned with ContactExplorer transfer robustly to the real world. Project page is https://contact-explorer.github.io.

2602.12628 2026-06-05 cs.RO 版本更新

Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

超越模仿:基于强化学习的仿真-现实协同训练用于VLA模型

Liangzhi Shi, Shuaihang Chen, Feng Gao, Yinuo Chen, Kang Chen, Tonghe Zhang, Hongzhi Zang, Jiakai Zhou, Weinan Zhang, Chao Yu, Yu Wang

发表机构 * Tsinghua University(清华大学) Harbin Institute of Technology(哈尔滨工业大学) Peking University(北京大学) Carnegie Mellon University(卡内基梅隆大学) Shanghai AI Laboratory(上海人工智能实验室) Zhongguancun Academy(中关村学院)

AI总结 本文提出基于强化学习的仿真-现实协同训练框架,通过结合仿真交互与真实世界数据,提升VLA模型的现实应用能力和泛化能力。

详情
AI中文摘要

仿真提供了一种可扩展且低成本的方式来丰富视觉-语言-动作(VLA)训练,减少了对昂贵真实机器人演示的依赖。然而,大多数仿真-现实协同训练方法依赖于监督微调(SFT),将仿真视为静态演示源,并未利用大规模闭环交互。因此,现实世界收益和泛化能力往往受到限制。在本文中,我们提出了一种基于强化学习的仿真-现实协同训练(RL-Co)框架,该框架在利用交互式仿真的同时保持现实世界的能力。我们的方法遵循一种通用的两阶段设计:首先使用SFT在真实和模拟演示的混合数据上预热策略,然后在仿真中通过强化学习进行微调,同时在真实世界数据上添加辅助监督损失以锚定策略并缓解灾难性遗忘。我们在四个现实世界桌面操作任务上评估了该框架,使用两种代表性的VLA架构,OpenVLA和π_{0.5},并观察到在真实-only微调和基于SFT的协同训练上的持续改进,包括在OpenVLA上的现实世界成功率提高24%和在π_{0.5}上的成功率提高20%。除了更高的成功率外,RL协同训练还表现出更强的对未见任务变化的泛化能力,并显著提高了现实世界的数据效率,为利用仿真提升真实机器人部署提供了实用且可扩展的途径。

英文摘要

Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and does not exploit large-scale closed-loop interaction. Consequently, real-world gains and generalization are often limited. In this paper, we propose an RL-based sim-real Co-training (RL-Co) framework that leverages interactive simulation while preserving real-world capabilities. Our method follows a generic two-stage design: we first warm-start the policy with SFT on a mixture of real and simulated demonstrations, then fine-tune it with reinforcement learning in simulation while adding an auxiliary supervised loss on real-world data to anchor the policy and mitigate catastrophic forgetting. We evaluate our framework on four real-world tabletop manipulation tasks using two representative VLA architectures, OpenVLA and $π_{0.5}$, and observe consistent improvements over real-only fine-tuning and SFT-based co-training, including +24% real-world success on OpenVLA and +20% on $π_{0.5}$. Beyond higher success rates, RL co-training yields stronger generalization to unseen task variations and substantially improved real-world data efficiency, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.

2602.16705 2026-06-05 cs.RO cs.CV 版本更新

HERO: Learning Humanoid End-Effector Control for Visual Whole-Body Open-Vocabulary Object Grasping

HERO: 学习人形机器人的末端执行器控制用于视觉全身体对象抓取

Runpei Dong, Ziyan Li, Arjun Gupta, Xialin He, Saurabh Gupta

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 该研究提出HERO方法,通过结合大视觉模型和模拟训练,实现了视觉全身体对象抓取任务中末端执行器的高精度控制和场景理解,显著提升了抓取精度和泛化能力。

Comments Project page: https://hero-humanoid.github.io/

详情
AI中文摘要

视觉定位和操作任意真实场景中的物体需要精确的末端执行器(EE)控制和从视觉输入(如RGB-D图像)中获得的可泛化场景理解。现有的模仿和仿真到现实的方法通过单体端到端学习同时学习这两个方面,因此难以扩展。在本工作中,我们利用最适合每个问题的工具——大视觉模型用于可泛化的场景理解和模拟训练用于精确的末端执行器控制,从而得到一个整体模块化的定位和操作系统,表现出强大的泛化能力。我们的核心技术创新是HERO,一个通过结合经典机器人学和机器学习实现的准确残差感知末端执行器跟踪策略。它利用a)逆运动学将残差末端执行器目标转换为参考轨迹,b)一个学习的神经前向模型用于准确的前向运动学,以及c)目标调整和重新规划。这些创新共同将末端执行器跟踪误差减少到2.44厘米,优于最强的先前方法5.5倍。我们的整体系统在多样化的现实环境中运行,从办公室到咖啡馆,机器人能够可靠地抓取各种日常物体(如杯子、苹果、玩具)在高度从43厘米到92厘米的表面上。系统性的模块化和端到端测试验证了我们提出设计的有效性。我们相信我们的进展为训练人形机器人与日常物体互动开辟了新途径。

英文摘要

Visual loco-manipulation of arbitrary in-the-wild objects requires accurate end-effector (EE) control and a generalizable understanding of the scene from visual inputs (eg, RGB-D images). Existing imitation and sim2real methods jointly learn both these aspects via monolithic end-to-end learning and are thus hard to scale. In this work, we bring to bear the best tools for each of these problems -- large vision models for generalizable scene understanding and simulated training for accurate EE control -- leading to an overall modular loco-manipulation system that exhibits strong generalization. Our core technical innovation is HERO, an accurate residual-aware EE tracking policy made possible by combining classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, and c) goal adjustment and replanning. Together, these innovations reduce the end-effector tracking error to 2.44cm, outperforming the strongest prior method by 5.5x. Our overall system operates in diverse real-world environments, from offices to coffee shops, where the robot reliably grasps various everyday objects (eg, mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests demonstrate the effectiveness of our proposed design. We believe our advances open up new ways of training humanoids to interact with daily objects.

2602.10106 2026-06-05 cs.RO 版本更新

EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration

EgoHumanoid: 通过无机器人眼示范解锁真实场景中的移动- manipulation

Modi Shi, Shijia Peng, Jin Chen, Haoran Jiang, Tianyu Li, Di Huang, Ping Luo, Hongyang Li, Li Chen

发表机构 * University of California, Berkeley(加州大学伯克利分校) Tsinghua University(清华大学)

AI总结 本文提出EgoHumanoid框架,通过结合大量眼示范数据和少量机器人数据共同训练视觉-语言-动作策略,使机器人能够执行多样化的现实环境中的移动- manipulation任务,实验表明无机器人数据显著提升了性能,尤其在未见过的环境中表现更优。

Comments Project page: https://opendrivelab.com/EgoHumanoid

详情
AI中文摘要

人类示范提供丰富的环境多样性,并能自然扩展规模,使其成为机器人远程操作的有吸引力替代方案。尽管这一范式已促进了机器人手臂操作的发展,但其在更具挑战性且数据需求高的问题——人形机器人移动- manipulation方面的潜力仍 largely未被探索。我们提出了EgoHumanoid,这是首个框架,通过大量眼示范数据和少量机器人数据共同训练视觉-语言-动作策略,使机器人能够执行多样化的现实环境中的移动- manipulation任务。为弥合人类与机器人之间的身体差距,包括形态和视角的差异,我们引入了一个系统化的对齐流程,涵盖从硬件设计到数据处理的各个方面。开发了一种便携式系统用于可扩展的人类数据收集,并建立了实用的收集协议以提高可迁移性。我们的核心人类到人形机器人对齐流程包含两个关键组件。视图对齐减少了由相机高度和视角变化引起的视觉领域差异。动作对齐将人类动作映射到一个统一的、在人形机器人控制中可行的动作空间。广泛的现实世界实验表明,结合无机器人眼数据显著优于仅机器人数据的基线,提高了51%,特别是在未见过的环境中。我们的分析进一步揭示了哪些行为能够有效迁移以及人类数据扩展的潜力。

英文摘要

Human demonstrations offer rich environmental diversity and scale naturally, making them an appealing alternative to robot teleoperation. While this paradigm has advanced robot-arm manipulation, its potential for the more challenging, data-hungry problem of humanoid loco-manipulation remains largely unexplored. We present EgoHumanoid, the first framework to co-train a vision-language-action policy using abundant egocentric human demonstrations together with a limited amount of robot data, enabling humanoids to perform loco-manipulation across diverse real-world environments. To bridge the embodiment gap between humans and robots, including discrepancies in physical morphology and viewpoint, we introduce a systematic alignment pipeline spanning from hardware design to data processing. A portable system for scalable human data collection is developed, and we establish practical collection protocols to improve transferability. At the core of our human-to-humanoid alignment pipeline lies two key components. The view alignment reduces visual domain discrepancies caused by camera height and perspective variation. The action alignment maps human motions into a unified, kinematically feasible action space for humanoid control. Extensive real-world experiments demonstrate that incorporating robot-free egocentric data significantly outperforms robot-only baselines by 51\%, particularly in unseen environments. Our analysis further reveals which behaviors transfer effectively and the potential for scaling human data.

2512.21430 2026-06-05 cs.RO 版本更新

EVE: A Generator-Verifier System for Generative Policies

EVE: 一种生成策略的生成-验证系统

Yusuf Ali, Gryphon Patlin, Karthik Kothuri, Jeremiah Coholich, Muhammad Zubair Irshad, Wuwei Liang, Zsolt Kira

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Toyota Research Institute(丰田研究院) Symbotic Inc.(Symbotic公司)

AI总结 本文提出EVE系统,通过生成-验证框架在测试时提升预训练生成策略的性能,利用零样本视觉语言模型验证者进行动作优化,无需额外训练。

详情
AI中文摘要

基于生成模型的视觉运动策略,如扩散和流匹配,在机器人应用中表现出色,但在分布偏移下性能下降,显示出有限的恢复能力,无法通过昂贵的微调恢复。在语言模型领域,测试时计算扩展已通过使候选解决方案细化革新了现代LLM的推理能力。这些方法通常利用基础模型作为验证模块进行零样本方式评分。我们假设生成策略也可以从额外的推理时计算中受益,该计算利用零样本基于VLM的验证者进行生成-验证框架。为此,我们引入EVE:一个模块化、生成-验证交互框架,通过在测试时提升预训练生成策略的性能,而无需额外训练。EVE将冻结的基础策略包裹在多个零样本、基于VLM的验证者代理中。每个验证者对基础策略的候选动作提出动作优化建议,而一个动作融合器使用分类器指导将聚合的验证器反馈融合到动作去噪中。我们研究了生成器-验证器信息接口的设计选择,跨具有不同能力的验证器系统。在多样化的模拟和真实机器人任务和实现中,EVE在不增加策略或验证器训练的情况下一致提高了成功率。通过广泛的消融实验,我们隔离了验证器能力和动作融合器策略的贡献,提供了构建可扩展、模块化生成器-验证器系统的实用指南。

英文摘要

Visuomotor policies based on generative such as diffusion and flow-matching have shown strong performance for robotics applications but degrade under distribution shifts, demonstrating limited recovery capabilities without costly finetuning. In the language modeling domain, test-time compute scaling has revolutionized the reasoning capabilities of modern LLMs by enabling candidate solution refinement. These methods typically leverage foundation models as verification modules in a zero-shot manner to score candidate solutions. We hypothesize that generative policies can similarly benefit from additional inference-time compute that employs zero-shot VLM-based verifiers in a generation-verification framework. To this end, we introduce EVE: a modular, generator-verifier interaction framework that boosts the performance of pretrained generative policies at test time, with no additional training. EVE wraps a frozen base policy with multiple zero-shot, VLM-based verifier agents. Each verifier proposes action refinements to the base policy candidate actions, while an action incorporator uses classifier guidance to fuse aggregated verifier feedback into action denoising. We study design choices for generator-verifier information interfacing across a system of verifiers with distinct capabilities. Across diverse simulated and real robotic tasks and embodiments, EVE consistently improves success rates without additional policy or verifier training. Through extensive ablations, we isolate the contribution of verifier capabilities and action incorporator strategies, offering practical guidelines to build scalable, modular generator-verifier systems for embodied control.

2510.26236 2026-06-05 cs.RO 版本更新

PHUMA: Physically Reliable Humanoid Locomotion Dataset

PHUMA:物理可靠的仿人运动数据集

Kyungmin Lee, Sibeen Kim, Youngdo Lee, Minho Park, Hyunseung Kim, Dongyoon Hwang, Donghu Kim, Hojoon Lee, Jaegul Choo

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出PHUMA数据集,通过结合物理感知的筛选和物理约束的重定向,整合动作捕捉和网络视频,生成物理可靠的仿人运动数据,提升仿人运动的稳定性和泛化能力。

详情
AI中文摘要

运动模仿是实现仿人运动的一种有前景的方法,使代理能够获取类人行为。现有方法通常依赖高质量的动作捕捉数据集,如AMASS,但这些数据稀缺且昂贵,限制了可扩展性和多样性。最近的研究尝试通过转换大规模互联网视频来扩大数据收集,例如Humanoid-X。然而,这些方法常面临物理伪影,如漂浮、穿透和脚滑,阻碍了稳定的模仿。为此,我们引入PHUMA,一个物理可靠的仿人运动数据集,通过两阶段流程结合物理感知的筛选和物理约束的重定向,将动作捕捉和互联网视频整合为一个73小时的物理可靠数据集。在动作跟踪基准测试中,PHUMA训练的策略比AMASS和Humanoid-X训练的策略成功率更高,并成功在真实Unitree G1上实现零样本迁移。代码可在https://davian-robotics.github.io/PHUMA获取。

英文摘要

Motion imitation is a promising approach for humanoid locomotion, enabling agents to acquire humanlike behaviors. Existing methods typically rely on high-quality motion capture datasets such as AMASS, but these are scarce and expensive, limiting scalability and diversity. Recent studies attempt to scale data collection by converting large-scale internet videos, exemplified by Humanoid-X. However, they often suffer from physical artifacts such as floating, penetration, and foot skating, which hinder stable imitation. To address this, we introduce PHUMA, a Physically Reliable HUMAnoid locomotion dataset produced by a two-stage pipeline combining physics-aware curation and physics-constrained retargeting, aggregating both motion capture and internet video into a physically reliable, 73-hour corpus. On motion tracking benchmarks, PHUMA-trained policies achieve higher success rates than those trained on AMASS and Humanoid-X, and successfully transfer zero-shot to a real Unitree G1. The code is available at https://davian-robotics.github.io/PHUMA.

2509.15061 2026-06-05 cs.RO cs.CV 版本更新

Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

Ask-to-Clarify: 通过多轮对话解决指令歧义

Xingyao Lin, Xinghao Zhu, Tianyi Lu, Sicheng Xie, Hui Zhang, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China(复旦大学计算机科学与人工智能学院) Shanghai Innovation Institute, Shanghai, China(上海创新研究院) Mechanical Systems Control Lab, UC Berkeley, California, USA(伯克利机械系统控制实验室)

AI总结 本文提出Ask-to-Clarify框架,通过多轮对话解决指令歧义问题,结合视觉语言模型和扩散模型,采用两阶段知识绝缘策略训练,实现多任务中更高效的协作式具身代理。

Comments 9 pages, 4 figures, 7 tables

详情
AI中文摘要

具身代理的最终目标是创造能够与人类交互的合作者,而非仅仅执行指令的被动执行者。这要求代理能够通过沟通、协调和适应行动来响应人类反馈。最近,视觉语言代理(VLAs)的进步为实现这一目标提供了途径。然而,大多数当前基于VLAs的具身代理仍处于单向模式:接收指令并执行,而无反馈。这种做法在现实场景中往往失效,因为指令通常存在歧义。在本文中,我们提出了Ask-to-Clarify框架来解决这一问题。该框架首先通过多轮对话解决模糊的指令,然后生成低层动作。具体来说,Ask-to-Clarify框架由两个组件组成:一个用于协作的视觉语言模型(VLM)和一个用于动作的扩散模型。我们还引入了一个连接模块,该模块根据VLM的输出生成扩散模型的条件。该模块通过指令调整观察来生成可靠的条件。我们采用两阶段知识绝缘策略来训练我们的框架。首先,我们使用模糊解决对话数据微调协作组件以处理歧义。然后,我们在冻结协作组件的情况下整合动作组件。这在保持交互能力的同时,微调扩散模型以生成动作。训练策略保证了我们的框架能够首先提问,然后生成动作。在推理过程中,一个信号检测器充当路由器,帮助框架在提问和执行之间切换。我们在8个现实任务中评估了Ask-to-Clarify框架,结果表明它在现有最先进的VLAs中表现更优。结果表明,所提出的框架及其训练策略为协作式具身代理提供了一条可行路径。

英文摘要

The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.

2507.06219 2026-06-05 cs.RO cs.AI cs.LG 版本更新

Is Diversity All You Need for Scalable Robotic Manipulation?

多样性是否是可扩展机器人操作的全部需求?

Modi Shi, Li Chen, Jin Chen, Yuxiang Lu, Chiming Liu, Guanghui Ren, Ping Luo, Di Huang, Maoqing Yao, Hongyang Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文研究了数据多样性在机器人学习中的作用,发现任务多样性比单任务演示量更重要,多身体预训练数据在跨身体转移中可选,专家多样性可能对策略学习产生干扰,提出分布去偏方法提升性能。

Comments Code is available at https://github.com/OpenDriveLab/AgiBot-World

详情
AI中文摘要

数据扩展在自然语言处理和计算机视觉的基础模型中取得了显著成功,但机器人操作中有效数据扩展的原则仍不够清楚。本文通过研究机器人学习中数据多样性的细微作用,探讨了三个关键维度:任务(做什么)、身体(使用哪种机器人)和专家(谁演示)。通过在各种机器人平台上进行广泛实验,我们发现:(1)任务多样性比单任务演示数量更重要,有助于从多样预训练任务转移到新下游场景;(2)多身体预训练数据在跨身体转移中是可选的,高质量单身体预训练模型可以高效地转移到不同平台,在微调过程中表现出比多身体预训练模型更优的扩展特性;(3)专家多样性源于个体操作偏好和人类演示中的随机变化,可能对策略学习产生干扰,速度多模态成为关键贡献因素。基于这一洞察,我们提出了一种分布去偏方法以缓解速度模糊性,所提出的GO-1-Pro方法实现了15%的性能提升,相当于使用2.5倍的预训练数据。这些发现提供了新的视角,并为如何有效扩展机器人操作数据集提供了实用指导。

英文摘要

Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.

2503.23300 2026-06-05 cs.CV cs.RO 版本更新

Learning Predictive Visuomotor Coordination

学习预测性视觉-运动协调

Wenqi Jia, Bolin Lai, Miao Liu, Danfei Xu, James M. Rehg

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Georgia Tech(佐治亚理工学院) Meta AI

AI总结 本文提出了一种基于预测的视觉-运动协调建模任务,通过结合第一人称视觉和运动学观测预测头部姿态、目光方向和上半身运动,展示了多模态整合在理解视觉-运动协调中的重要性。

Comments CVPR 2026 Findings

详情
AI中文摘要

理解并预测人类视觉-运动协调对于机器人学、人机交互和辅助技术的应用至关重要。本文介绍了一种基于预测的视觉-运动协调建模任务,目标是从第一人称视觉和运动学观测中预测头部姿态、目光方向和上半身运动。我们提出了一种视觉-运动协调表示(VCR),学习这些多模态信号之间的结构时间依赖性。我们扩展了基于扩散的运动建模框架,整合了第一人称视觉和运动学序列,实现了时间一致且准确的视觉-运动预测。我们的方法在大规模EgoExo4D数据集上进行了评估,展示了在多样化现实活动中的强大泛化能力。我们的结果强调了多模态整合在理解视觉-运动协调中的重要性,为视觉-运动学习和人类行为建模的研究做出了贡献。项目页面:https://vjwq.github.io/VCR/.

英文摘要

Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling. Project Page: https://vjwq.github.io/VCR/.

2409.13607 2026-06-05 cs.RO 版本更新

RECON: Reducing Causal Confusion with Human-Placed Markers

RECON: 通过人类放置的标记减少因果混淆

Robert Ramirez Sanchez, Heramb Nemlekar, Shahabedin Sagheb, Cara M. Nunez, Dylan P. Losey

发表机构 * Collaborative Robotics Lab ( Collab ), Dept. of Mechanical Engineering, Virginia Tech, Blacksburg, VA 24061(协作机器人实验室(Collab),机械工程系,弗吉尼亚理工学院,布莱克斯堡,VA 24061) Sibley School of Mechanical and Aerospace Engineering, Cornell University, Ithaca, NY 14853(西伯利机械与航空航天工程学院,康奈尔大学,伊萨卡,NY 14853)

AI总结 该研究提出RECON框架,通过人类主动标记任务关键部分来减少机器人学习中的因果混淆,利用标记物数据训练任务相关状态嵌入,从而提高学习效率。

Comments 7 pages, 5 figures

详情
AI中文摘要

模仿学习使机器人能够从人类示例中学习新任务。然而,从人类学习时的一个根本限制是因果混淆。因果混淆发生在机器人观察到的任务相关和无关信息同时存在时:例如,机器人的摄像头可能不仅看到目标,还看到环境中的杂物和光照变化。由于机器人事先不知道哪些观察方面是重要的,它经常误解人类的例子,无法学习所需任务。为了解决这个问题,我们指出——尽管机器人学习者可能不知道该关注什么,但人类教师知道。在本文中,我们提出人类应主动用小型轻量的标记物标记任务关键部分。在我们的框架(RECON)中,人类在提供演示前将这些标记物附着在任务相关对象上:当人类展示任务示例时,标记物跟踪标记对象的位置。我们随后利用这些离线标记数据来训练任务相关状态嵌入。具体来说,我们将机器人的观察嵌入到一个与测量标记读数相关的潜在状态中:在实践中,这使机器人能够自动过滤掉无关观察,并基于从标记数据中学习的特征做出决策。我们的模拟和一个真实机器人实验表明,这种人类放置标记的框架可以缓解因果混淆。确实,我们发现使用RECON显著减少了传达任务所需的演示次数,从而降低人类教学的总体时间。见此处视频:https://youtu.be/oy85xJvtLSU

英文摘要

Imitation learning enables robots to learn new tasks from human examples. One fundamental limitation while learning from humans is causal confusion. Causal confusion occurs when the robot's observations include both task-relevant and extraneous information: for instance, a robot's camera might see not only the intended goal, but also clutter and changes in lighting within its environment. Because the robot does not know which aspects of its observations are important a priori, it often misinterprets the human's examples and fails to learn the desired task. To address this issue, we highlight that -- while the robot learner may not know what to focus on -- the human teacher does. In this paper we propose that the human proactively marks key parts of their task with small, lightweight beacons. Under our framework (RECON) the human attaches these beacons to task-relevant objects before providing demonstrations: as the human shows examples of the task, beacons track the position of marked objects. We then harness this offline beacon data to train a task-relevant state embedding. Specifically, we embed the robot's observations to a latent state that is correlated with the measured beacon readings: in practice, this causes the robot to autonomously filter out extraneous observations and make decisions based on features learned from the beacon data. Our simulations and a real robot experiment suggest that this framework for human-placed beacons mitigates causal confusion. Indeed, we find that using RECON significantly reduces the number of demonstrations needed to convey the task, lowering the overall time required for human teaching. See videos here: https://youtu.be/oy85xJvtLSU