机器人 / 具身智能 - arXivDaily 专题

2606.19333 2026-06-18 cs.RO cs.CV 新提交 90%

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Do as I Do: 从日常人类视频中获取灵巧操作数据

Bhawna Paliwal, Haritheja Etukuru, William Liang, Pieter Abbeel, Nur Muhammad Mahi Shafiullah, Jitendra Malik

发表机构 * UC Berkeley（加州大学伯克利分校）

专题命中机器人学习：从人类视频重建灵巧操作数据

AI总结提出DO AS I DO算法，从单目RGB人类视频中重建手-物交互并重定向到多指灵巧机器人手，生成可执行的操作数据，优于现有方法。

Comments Project website: https://do-as-i-do.com/

详情

AI中文摘要

我们如何可扩展地生成机器人操作数据，特别是在像多指灵巧手这样的人形平台上？从人类视频中学习最近成为这个问题的可能答案。然而，估计手-物交互和跨越人-机器人具身差距的困难阻碍了将丰富的单目RGB人类视频作为机器人操作数据的主要来源。在这项工作中，我们提出了DO AS I DO，一种将单目RGB人类视频重建并重定向到多指灵巧机器人手的算法。DO AS I DO从各种自我中心和外部中心的野外视频源中重建手-物交互。然后，该算法将这些手-物交互估计重定向为一系列可在现实世界中执行的动作，从不同的人类视频中生成机器人完整的操作数据。总体而言，DO AS I DO在从RGB视频中估计手-物交互和提取灵巧操作轨迹方面优于先前的最先进技术，正如我们在具有真实标签的数据集和在线收集的视频片段数据集上的实验所示。我们的实验使我们能够为从业者收集人类操作数据提出一个有效性指南。

英文摘要

How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.18772 2026-06-18 cs.RO 新提交 90%

HALOMI: Learning Humanoid Loco-Manipulation with Active Perception from Human Demonstrations

HALOMI: 从人类演示中学习具有主动感知的人形机器人全身操控

Zehui Zhao, Yuxuan Zhao, Gaojing Zhang, Chenxi Liu, Maolin Zheng, Wenzhao Lian

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of Sussex（萨塞克斯大学）； East China University of Science and Technology（华东理工大学）

专题命中机器人学习：人形机器人全身操控，从人类演示学习。

AI总结提出HALOMI框架，通过扩展通用操控接口(UMI)实现主动感知，利用流形约束控制器和观察-动作对齐，使Unitree G1人形机器人在五项真实任务中平均成功率达85%。

详情

AI中文摘要

人类演示可以大规模收集，并自然捕捉主动的手眼协调，是学习人形机器人全身操控的有前景的数据源。然而，直接将人类演示迁移到人形机器人需要精确的世界坐标系跟踪控制器，这在分布外(OOD)目标下通常脆弱，而人形差异在自我中心观察和动作执行中持续存在。为解决这些挑战，我们提出HALOMI，一个从人类演示中学习具有主动感知的人形机器人全身操控的可扩展框架。HALOMI扩展了通用操控接口(UMI)并加入自我中心感知，以大规模收集自我视角和手腕视角观察以及头-手轨迹。我们进一步提出一个流形约束控制器，在学习的潜在行为流形中规划，以实现世界坐标系中精确鲁棒的头-手跟踪。为弥合人形差异，我们进行自我视角对齐，并引入控制器感知的参考轨迹自适应，以减少观察和动作执行中的不匹配。我们在配备活动脖子的Unitree G1人形机器人上验证HALOMI，涉及导航、抓取、双手操控、全身协调和动态行为五项真实任务。在三个定量评估的任务中，HALOMI平均成功率达85%，而额外定性演示显示其支持动态抛掷和深蹲抓取的能力。

英文摘要

Human demonstrations, which can be collected at scale and naturally capture active hand-eye coordination, are a promising data source for learning humanoid loco-manipulation. However, directly transferring human demonstrations to humanoids requires a precise world-frame tracking controller, which is often brittle under Out-of-Distribution(OOD) targets, while human-to-humanoid gaps persist in both egocentric observation and action execution. To address these challenges, we present HALOMI, a scalable framework for learning humanoid loco-manipulation with active perception from human demonstrations. HALOMI extends Universal Manipulation Interface (UMI) with egocentric sensing to collect ego-view and wrist-view observations along with head-hand trajectories at scale. We further propose a manifold-constrained controller that plans in a learned latent behavior manifold to enable precise and robust head-hand tracking in the world frame. To bridge the human-to-humanoid gap, we perform ego-view alignment and introduce a controller-aware reference trajectory adaptation to reduce mismatch in both observation and action execution. We validate HALOMI on a Unitree G1 humanoid robot with an actuated neck across five real-world tasks involving navigation, grasping, bimanual manipulation, whole-body coordination, and dynamic behaviors. Across the three quantitatively evaluated tasks, HALOMI achieves an average success rate of 85\%, while additional qualitative demonstrations show its ability to support dynamic tossing and deep-squat grasping.

URL PDF HTML ☆

赞 0 踩 0

2606.18704 2026-06-18 cs.RO 新提交 90%

Selective Unit-Cell Actuation in Lattice Structures for Distributed Morphology in Soft Robots

晶格结构中的选择性单元胞驱动用于软体机器人的分布式形态变化

Trevor Exley, Altair Coutinho, Lucia Beccai

发表机构 * Istituto Italiano di Tecnologia (IIT)（意大利技术研究院）

专题命中机器人学习：软体机器人晶格结构驱动与形态控制

AI总结提出嵌入式气动单元胞，将弯曲支柱晶格与双向波纹管致动器集成，通过空间驱动模式实现全局形态控制，实验验证了可扩展位移、力生成及弯曲、抓取和爬行运动。

Comments Accepted to IROS 2026, 8 pages, 5 figures

详情

AI中文摘要

软晶格结构越来越多地用于机器人中以定制柔顺性和引导变形；然而，驱动通常是在设备或模块级别引入，致动器插入到原本被动的架构中。在这项工作中，我们将致动器-晶格协同设计推进到单元胞尺度。我们提出了一种嵌入式气动单元胞，它将弯曲支柱晶格几何形状与双向波纹管致动器集成在一个单一的整体元件中。当镶嵌时，晶格作为一个分布式驱动场，其中全局形态由空间驱动模式而非均匀加压控制。对1x1、2x2和3x3镶嵌的实验表征展示了可扩展的位移和力生成，具有可重复的循环性能。在3x3x3阵列中，单元胞的选择性驱动产生了不同的全局变形模式，包括弯曲和定向抓取，而无需改变硬件配置。此外，耦合主动和被动单元胞实现了弯曲驱动的爬行运动，证明了异质镶嵌可以通过不对称变形进行平移。这些结果确立了单元胞级驱动作为晶格基软体机器人分布式变形的策略，并为可扩展的整体机器人架构提供了基础。

英文摘要

Soft lattice structures are increasingly used in robotics to tailor compliance and guide deformation; however, actuation is typically introduced at the device or module level, with actuators inserted into otherwise passive architectures. In this work, we move actuator-lattice co-design to the unit-cell scale. We present an embedded pneumatic unit cell that integrates curved-strut lattice geometry with a bidirectional bellow actuator within a single monolithic element. When tessellated, the lattice functions as a distributed actuation field in which global morphology is governed by spatial actuation patterns rather than uniform pressurization. Experimental characterization of 1x1, 2x2, and 3x3 tessellations demonstrates scalable displacement and force generation with repeatable cyclic performance. Selective actuation of unit cells in a 3x3x3 array produces distinct global deformation modes, including bending and directional grasping, without altering hardware configuration. Additionally, coupling active and passive unit cells enables bending-driven crawling locomotion, demonstrating that heterogeneous tessellations can translate through asymmetric deformation. These results establish unit-cell-level actuation as a strategy for distributed morphing in lattice-based soft robots and provide a foundation for scalable, monolithic robotic architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.13672 2026-06-18 cs.RO 新提交 90%

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

$\texttt{WEAVER}$：更好、更快、更长——一种有效的机器人操作世界模型

Arnav Kumar Jain, Yilin Wu, Jesse Farebrother, Gokul Swamy, Andrea Bajcsy

发表机构 * Mila - Québec AI Institute（Mila - 魁北克人工智能研究所）； Université de Montréal（蒙特利尔大学）； Carnegie Mellon University（卡内基梅隆大学）； McGill University（麦吉尔大学）

专题命中机器人学习：世界模型用于机器人操作策略评估与规划

AI总结提出WEAVER世界模型架构，通过流匹配损失训练多视图潜在预测，同时实现高保真度、长程一致性和高效推理，在机器人操作任务中显著提升策略评估、改进和测试时规划性能。

详情

AI中文摘要

世界模型（即学习型模拟器）对机器人技术的潜在影响深远——包括策略评估、策略改进和测试时规划——所有这些都只需有限的真实世界交互。为了解锁这些下游能力，世界模型需要同时满足三个期望：（i）保真度（即产生与现实相关的模拟轨迹），（ii）一致性（即产生在长时域上连贯的模拟轨迹），以及（iii）效率（即快速产生模拟轨迹）。我们提出$\texttt{WEAVER}$（面向具身推理的多视图世界估计）：一种同时实现所有三个期望的世界模型架构，在机器人操作任务上提供了最先进的结果。$\texttt{WEAVER}$是一个多视图世界模型，通过流匹配损失训练以预测未来潜在状态和奖励值。我们提炼了模型架构、记忆和预测目标方面的关键设计决策，以解锁那些困扰先前世界建模方法的长时间动态操作任务。我们将$\texttt{WEAVER}$应用于机器人硬件，展示了其在策略评估（与真实世界成功率的相关系数$\rho=0.870$）、策略改进（在$\pi_{0.5}$机器人基础模型上真实世界成功率提升$38\%$）和测试时规划（真实世界成功率提升$14\%$，且比先前世界模型快$5-10$倍）方面的有效性。$\texttt{WEAVER}$在分布外场景评估中也表现出优于先前世界模型的性能。代码、模型和视频见：this https URL。

英文摘要

The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and test-time planning -- all with limited real-world interaction. To unlock these downstream capabilities, a WM needs to jointly satisfy three desiderata: $\textit{(i)}$ fidelity (i.e., producing simulated trajectories that correlate with reality), $\textit{(ii)}$ consistency (i.e., producing simulated trajectories that are coherent over long horizons), and $\textit{(iii)}$ efficiency (i.e., producing simulated trajectories quickly). We propose WEAVER (World Estimation Across Views for Embodied Reasoning): a WM architecture that simultaneously achieves all three desiderata, providing state-of-the-art results on robotic manipulation tasks. WEAVER is a multi-view WM trained to predict future latents and reward values via a flow-matching loss. We distill the key design decisions across model architecture, memory, and prediction objectives required to unlock the kinds of long-horizon dynamic manipulation tasks that have confounded prior world modeling approaches. We apply WEAVER in robotic hardware, demonstrating its effectiveness at policy evaluation ($ρ$=0.870 correlation with real-world success rate), policy improvement (real-world success rate improvement of $38\%$ on top of the $π_{0.5}$ robot foundation model), and test-time planning (real-world success rate improvement of $14\%$ with a $5-10\times$ speedup over prior WMs). WEAVER also demonstrates better performance than prior WMs when evaluated on out-of-distribution scenarios. Code, models, and videos at: https://arnavkj1995.github.io/WEAVER/ .

URL PDF HTML ☆

赞 1 踩 0

2606.18328 2026-06-18 cs.RO 新提交 88%

Recover, Discover, Plan: Learning Skills and Concepts from Robot Failures

恢复、发现、规划：从机器人失败中学习技能与概念

Bowen Li, Mayank Mishra, Y. Isabel Liu, Stone Tao, Nishanth Kumar, Alexander G. Gray, Ruwan Wickramarachchi, Jonathan Francis, Sebastian Scherer, Tom Silver

发表机构 * CMU（卡内基梅隆大学）； Princeton（普林斯顿大学）； AI2（艾伦人工智能研究所）； MIT（麻省理工学院）； Centaur AI ； Bosch Center for AI（博世人工智能中心）

专题命中机器人学习：从机器人失败中学习技能与概念，实现长期规划。

AI总结提出ReSYNC方法，通过技能学习与概念发现的交替过程，从失败恢复经验中逐步构建抽象谓词，实现全局失败避免和长期规划，性能提升超50%。

Comments 9 pages, 6 figures. Website: https://jaraxxus-me.github.io/ReSYNC/

详情

AI中文摘要

智能机器人不仅应该从失败中恢复，还应该获取必要的抽象知识以避免未来的失败。虽然强化学习（RL）可以学习反应性恢复行为，但为每种不同的失败模式训练单独的策略效率极低。我们引入了恢复驱动的关系概念综合（ReSYNC），这是第一种从失败恢复经验中逐步发现并细化状态抽象（关系谓词）以支持抽象规划的方法。与纯粹的反应性方法不同，ReSYNC通过增量双学习过程联合学习技能和概念。在技能学习阶段，机器人使用RL学习从训练任务中出现的失败中恢复。在概念学习阶段，机器人发现新的关系谓词并细化其抽象规划模型，以解释和泛化所学的恢复行为。这种交互使ReSYNC能够将训练中看到的局部恢复转化为测试时的全局失败避免。在四个模拟领域，我们展示了ReSYNC持续扩展和细化其抽象库的能力，使其能够解决长期、前所未见的问题，性能超过强基线50%以上。此外，我们展示了ReSYNC的仿真到现实迁移，其中它执行真实世界的非抓取操作技能，并通过抽象规划泛化到未见场景。总体而言，ReSYNC代表了朝着机器人自主获取抽象以实现物理世界中可扩展的、感知失败的规划迈出的重要一步。

英文摘要

Intelligent robots should not only recover from failures, but also acquire the abstract knowledge needed to avoid them in the future. While reinforcement learning (RL) can learn reactive recovery behaviors, training a separate policy for every distinct failure mode is highly inefficient. We introduce Recovery-Driven Synthesis of Relational Concepts (ReSYNC), the first approach that progressively discovers and refines state abstractions (relational predicates) from failure-recovery experience to support abstract planning. Unlike purely reactive methods, ReSYNC jointly learns skills and concepts through an incremental dual-learning process. In the skill-learning phase, the robot uses RL to learn to recover from failures seen in training tasks. In the concept-learning phase, the robot discovers new relational predicates and refines its abstract planning model to explain and generalize the learned recovery behaviors. This interaction enables ReSYNC to convert local recoveries seen during training into global failure avoidance at test time. Across four simulated domains, we show that ReSYNC's ability to continually expand and refine its abstraction library allows it to solve long-horizon, previously unseen problems, outperforming strong baselines by over 50%. Additionally, we demonstrate sim-to-real transfer of ReSYNC, where it performs real-world non-prehensile manipulation skills and generalizes to unseen scenarios through abstract planning. Overall, ReSYNC represents a significant step toward robots that autonomously acquire abstractions for scalable, failure-aware planning in the physical world.

URL PDF HTML ☆

赞 0 踩 0

2606.18959 2026-06-18 cs.RO 新提交 85%

TactSpace: Learning a Physics-enriched Shared Latent Space for Tactile Sim-to-Real Transfer

TactSpace: 学习富含物理信息的共享潜在空间以实现触觉模拟到现实的迁移

Arunim Joarder, Arjun Bhardwaj, René Zurbrügg, Mayank Mittal, Florin Püntener, Sira Bielefeldt, Cosmin Roman, Vaishakh Patil, Marco Hutter

发表机构 * Robotic Systems Lab, ETH Zürich（瑞士苏黎世联邦理工学院机器人系统实验室）； Micro- and Nanosystems Lab, ETH Zürich（瑞士苏黎世联邦理工学院微纳系统实验室）； ETH AI Center（苏黎世联邦理工学院人工智能中心）； NVIDIA（NVIDIA公司）

专题命中机器人学习：学习共享潜在空间实现触觉模拟到现实迁移。

AI总结提出多模态表示学习框架TactSpace，通过共享潜在空间对齐异构触觉模态，实现零样本模拟到现实迁移，在力预测和形状重建任务中分别降低误差16.7%和45.8%。

Comments 9 pages, 6 figures, 4 tables, accepted into IROS 2026

详情

AI中文摘要

触觉传感提供了对机器人操作至关重要的接触相互作用的直接测量。然而，当前的模拟器缺乏足够保真度来忠实模拟触觉传感器的复杂变形和换能机制，严重阻碍了机器人学习流程中的模拟到现实迁移。为了解决这一挑战，我们提出了一种多模态表示学习框架，该框架在共享潜在空间内对齐异构触觉模态，消除了对精确原始信号模拟的需求，同时保留了相关的接触信息。我们的方法采用模态特定编码器将不同的触觉观测（例如模拟穿透深度和真实电容）投影到公共嵌入空间中。该模型使用自重建和交叉重建目标以及对比对齐进行训练，鼓励模态不变且信息丰富的表示。我们在压头形状识别、力预测和几何重建任务上评估学习到的嵌入，仅在模拟中训练并直接在真实传感器测量上测试。我们的结果展示了跨物理不同表示的零样本模拟到现实迁移。此外，结合多物理模拟模态产生了更信息丰富的嵌入，这些嵌入可跨不同下游任务迁移，力预测误差降低16.7%，形状重建误差降低45.8%。最后，我们为Isaac Lab发布了一个基于Warp的高效罚函数触觉模拟模型实现，支持可扩展的触觉数据生成。

英文摘要

Tactile sensing provides direct measurements of contact interactions that are essential for robotic manipulation. However, current simulators lack the fidelity to faithfully model the complex deformation and transduction mechanics of tactile sensors, severely hindering sim-to-real transfer in robot learning pipelines. To address this challenge, we propose a multi-modal representation learning framework that aligns heterogeneous tactile modalities within a shared latent space, eliminating the need for accurate raw-signal simulation while preserving relevant contact information. Our approach employs modality-specific encoders to project diverse tactile observations, such as simulated penetration depth and real-world capacitance, into a common embedding space. The model is trained using self- and cross-reconstruction objectives alongside contrastive alignment, encouraging modality-invariant yet information-rich representations. We evaluate the learned embeddings on indenter shape identification, force prediction, and geometric reconstruction tasks, training exclusively in simulation and testing directly on real sensor measurements. Our results demonstrate zero-shot sim-to-real transfer across physically dissimilar representations. Furthermore, incorporating multi-physics simulation modalities yields more informative embeddings that transfer across diverse downstream tasks, demonstrating a 16.7% reduction in force prediction error and a 45.8% reduction in shape reconstruction error. Finally, we release an efficient Warp-based implementation of a penalty-based tactile simulation model for Isaac Lab, enabling scalable tactile data generation.

URL PDF HTML ☆

赞 0 踩 0

2606.18828 2026-06-18 cs.RO cs.AI 新提交 85%

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

空间即智能：用于黎曼度量生成的神经半群叠加

Chenghao Xu

发表机构 * National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University（湖南大学机器人视觉感知与控制技术国家工程研究中心）

专题命中机器人学习：通过黎曼度量生成实现机器人运动规划，零样本泛化

AI总结提出将智能置于空间本身，通过神经半群叠加机制生成黎曼度量，使动作简化为测地线跟随，在单障碍场景训练后零样本泛化到未见配置。

详情

AI中文摘要

传统方法将智能置于智能体中，无论是作为学习策略还是搜索过程。我们则将智能置于空间本身：场景在构型流形上诱导一个黎曼度量，动作简化为跟随该度量的测地线，而无需调用单独的规划器或碰撞检查器。一个单一的编码器-路由器网络通过三个互补的参数组实现这一思想——框架参数（定向生成器）、调制参数（控制空间传播）和基本系数（决定强度）。这些组通过共享的半群叠加机制组合，产生单个黎曼度量场，形成一种紧凑的架构，其几何复杂度自然随场景复杂度扩展。在单个双障碍场景上训练后，该模型在未见过的障碍配置上展现出鲁棒的零样本泛化能力，无碰撞路径成本与障碍穿透路径成本相差数个数量级。

英文摘要

Traditional approaches place intelligence in the agent, whether as a learned policy or a search procedure. We instead place intelligence in the space itself: a scene induces a Riemannian metric on the configuration manifold, and action reduces to following the geodesics of that metric rather than invoking a separate planner or collision checker. A single Encoder-Router network realizes this idea through three complementary parameter groups -- frame parameters that orient the generators, modulation parameters that govern their spatial propagation, and basic coefficients that determine their strength. These groups combine through a shared semigroup-superposition mechanism to produce a single Riemannian metric field, yielding a compact architecture whose geometry scales naturally with scene complexity. Trained on a single two-obstacle scene, the model demonstrates robust zero-shot generalization across unseen obstacle configurations, with orders-of-magnitude separation between collision-free and obstacle-penetrating path costs.

URL PDF HTML ☆

赞 0 踩 0

2606.18747 2026-06-18 cs.RO cs.AI 新提交 85%

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

通过基于人类反馈的迭代强化学习利用大语言模型生成自然且富有表现力的机器人手势

Chris Lee, Flora Salim, Benjamin Tag, Francisco Cruz

发表机构 * University of New South Wales（新南威尔士大学）； Universidad Central de Chile（智利中央大学）

专题命中机器人学习：机器人手势生成，RLHF优化表达。

AI总结针对社交机器人手势生成僵硬问题，提出将ChatGPT集成到Pepper机器人中生成共语手势，并引入基于人类反馈的迭代强化学习（RLHF）优化手势，实验表明RLHF提升了手势的表现力、相关性和流畅性。

Comments 8 Pages, 6 Figures

详情

AI中文摘要

富有表现力的手势对于自然有效的沟通至关重要，当仅靠语言线索不足时（例如，指向），手势可以补充言语。对于像Pepper这样的人形社交机器人，产生自然且富有表现力的动作对于改善人机交互（HRI）和长期接受度至关重要。然而，由于依赖专家编写的动画，生成手势仍然具有挑战性，导致行为僵硬，难以适应动态和多样化的环境。或者，机器学习方法通常难以捕捉感知的自然性，随着自由度的增加而变得更加困难。因此，产生富有表现力的机器人手势需要一个能够适应环境同时遵守社会规范和物理约束的系统。大语言模型（LLMs）的最新进展使得动态代码生成成为可能，为从自然语言实时合成手势提供了新的机会。在本文中，我们将ChatGPT集成到人形机器人Pepper中，以生成与对话输出一致的共语手势。虽然这一基线实现了灵活的手势生成，但生成的动作通常被认为僵硬且不自然。为了解决这一限制，我们引入了一种基于人类反馈的迭代强化学习（RLHF）系统，该系统根据用户评估微调手势生成，并利用迭代用户研究比较Pepper生成的手势。我们的结果表明，RLHF改进了LLM的共语生成能力，产生了更富有表现力、相关且流畅的动作。

英文摘要

Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper's generated gestures. Our results show that RLHF improved the LLM's co-speech generative capabilities, producing more expressive, relevant and fluid movements.

URL PDF HTML ☆

赞 0 踩 0

2606.18698 2026-06-18 cs.RO cs.AI cs.LG 新提交 85%

Leveraging Energy Features for Surface Classification with Deep Learning: A Comparative Analysis Across Three Independent Datasets

利用能量特征进行基于深度学习的表面分类：三个独立数据集的比较分析

Alexander Belyaev, Oleg Kushnarev

专题命中机器人学习：移动机器人表面分类，使用能量特征

AI总结研究评估能量特征作为表面分类的独立或辅助模态的可行性，在三个数据集上比较多种深度学习架构，发现CNN性能最优，纯能量特征准确率85-90%，与惯性特征结合可达96-99%，且能量特征可稳定提升1-2%准确率。

详情

AI中文摘要

基于能量的方法在移动机器人表面分类中仍是一个相对未被充分研究的途径，尽管在受限环境中取得了有希望的结果。本研究评估了使用能量衍生特征作为独立分类模态或作为惯性数据补充输入的可行性。在三个公开数据集上进行了全面评估，比较了现代深度学习架构（包括循环神经网络、卷积神经网络、仅编码器变压器和Mamba状态空间模型）在自动超参数调整和输入序列长度优化下的性能。模型在所有评估数据集上均实现了比先前报道值更高的准确率，其中卷积神经网络取得了最高的整体性能。当仅依赖基于能量的特征时，模型分类准确率在85-90%范围内，比与惯性特征结合时（96-99%）低约5-10%。用能量特征增强惯性数据导致平均准确率持续提高1-2%。这些发现表明，仅依赖能量特征的分类器为独立部署提供了足够的准确性，同时在与其它感知模态结合使用时也提供了一致的增益。

英文摘要

The energy-based method remains a comparatively underexamined approach for surface classification in mobile robotics, despite promising results in constrained environments. This study evaluated the viability of using energy-derived features as either a standalone classification modality or as supplementary input to inertial data. A comprehensive evaluation was conducted across three publicly available datasets, comparing the performance of modern deep learning architectures including recurrent neural networks, convolutional neural networks, encoder-only transformers, and Mamba state-space models, under automated hyperparameter tuning and input sequence length optimization. The models achieved higher accuracy than previously reported values on all evaluated datasets, with the convolutional neural network yielding the highest overall performance. When relying exclusively on energy-based features, the models attained classification accuracies in the range of 85-90%, approximately 5-10% lower than those achieved when combined with inertial features (96-99%). Augmenting inertial data with energy features resulted in a consistent mean accuracy improvement of 1-2%. These findings indicate that classifiers relying solely on energy features offer sufficient accuracy for standalone deployment, while also providing a consistent gain when used in combination with other sensing modalities.

URL PDF HTML ☆

赞 0 踩 0

2606.18697 2026-06-18 cs.LG cs.CR cs.RO 新提交 85%

Stealthy World Model Manipulation via Data Poisoning

通过数据投毒进行隐蔽的世界模型操纵

Yibin Hu, Xiaolin Sun, Zizhan Zheng

发表机构 * Department of Computer Science（计算机科学系）

专题命中机器人学习：世界模型数据投毒攻击，影响规划

AI总结提出SWAAP框架，通过两阶段数据投毒（双层级优化寻找有害目标模型+梯度匹配隐蔽实现）操纵学习到的世界模型，导致规划性能显著下降，且能规避多种防御检测。

Comments 41 pages, 8 figures, 11 tables. Submitted to NeurIPS 2026

详情

AI中文摘要

基于模型的学习智能体使用学习到的世界模型来预测未来状态、规划行动并适应新环境。然而，从收集的经验中更新世界模型的过程创造了一个训练时攻击面：对抗性投毒的微调轨迹可以操纵学习到的动力学，从而破坏下游规划。在本文中，我们提出了SWAAP，这是第一个针对学习到的世界模型的两阶段数据投毒框架。在第一阶段，SWAAP利用过渡梯度定理实现的一阶双层优化，识别出一个有害的目标世界模型，该模型在规划下诱导低回报行为，同时保持接近干净动力学。在第二阶段，SWAAP通过隐蔽约束的梯度匹配实现该目标，仅修改有限比例的微调过渡目标，使得诱导的训练梯度将受害者模型引向对抗目标，同时预测误差正则化器鼓励投毒目标保持接近世界模型的自然近似误差。为了评估攻击的隐蔽性，我们在投毒管道的三个阶段评估了防御和可检测性：投毒过渡的预训练检测、微调期间的鲁棒训练以及测试时对结果世界模型的监控。在多种连续控制任务中，SWAAP导致显著的性能下降，同时保持投毒过渡接近干净数据，并规避了评估的非自适应残差/CUSUM/TRIM风格防御。这些结果揭示了世界模型适应管道中的实际漏洞，并强调了需要保护世界模型训练数据和所学动力学的鲁棒性方法。

英文摘要

Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack surface: adversarially poisoned fine-tuning trajectories can manipulate the learned dynamics and thereby corrupt downstream planning. In this paper, we propose SWAAP, the first two-stage data poisoning framework for learned world models. In the first stage, SWAAP identifies a harmful target world model that induces low-return behavior under planning while remaining close to clean dynamics, using first-order bilevel optimization enabled by a transition-gradient theorem. In the second stage, SWAAP realizes this target through stealth-constrained gradient matching, modifying only a limited fraction of fine-tuning transition targets so that the induced training gradients steer the victim model toward the adversarial target, while a prediction-error regularizer encourages the poisoned targets to remain close to the world model's natural approximation error. To assess attack stealthiness, we evaluate defenses and detectability across three stages of the poisoning pipeline: pre-training detection of poisoned transitions, robust training during fine-tuning, and test-time monitoring of the resulting world model. Across diverse continuous-control tasks, SWAAP causes substantial performance degradation while keeping poisoned transitions close to clean data and evading the evaluated non-adaptive residual/CUSUM/TRIM-style defenses. These results reveal a practical vulnerability in world-model adaptation pipelines and highlight the need for robustness methods that protect both world-model training data and learned dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.18680 2026-06-18 cs.RO 新提交 85%

High-Degree-of-Freedom Lightweight Bioinspired Leg for Enhanced Mobility in Small Robots

高自由度轻量化仿生腿：提升小型机器人机动性

Haoqi Han, Yifei Yu, Jiaming Zhang, Xinru Cui, Linxi Feng, Hesheng Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai University of Electric Power（上海电力大学）

专题命中机器人学习：微型机器人高自由度仿生腿设计

AI总结针对微型机器人腿部自由度受限问题，提出一种四自由度并联腿机构，通过同心设计简化运动学，实现轻量化（18.9g）和大工作空间（>22255 mm³），显著提升运动灵活性。

Journal ref 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情

AI中文摘要

在微型机器人领域，如何在严格的空间限制下通过增加腿部机构的自由度来增强运动能力仍然是一个重大挑战。受昆虫运动启发，本文提出了一种新型的微型四自由度并联腿机构，并系统分析了其机械设计、电气系统和运动学。该设计采用两个球形五杆连杆机构，在并联四杆配置中实现空间运动。此外，采用同心设计策略简化了腿部运动学的解析解。由于采用并联系统架构，所有执行器均位于主体上，与传统高自由度腿部结构相比，大大降低了运动部件的等效惯性。系统总质量仅为18.9 g，末端执行器输出力约为0.5 N，工作空间超过22255 mm³。实验结果表明，所提出的单腿机构具有优异的运动灵活性，凸显了其在微型仿生机器人领域的潜力。

英文摘要

In microrobotics, enhancing locomotion capabilities by increasing the degrees of freedom (DoF) of leg mechanisms under severe spatial constraints remains a significant challenge. Inspired by insect locomotion, this paper presents a novel micro-scale parallel leg mechanism with four degrees of freedom, and systematically analyzes its mechanical design, electrical system, and kinematics. The design incorporates two spherical five-bar linkages to achieve spatial motion within a parallel four-bar configuration. Furthermore, a concentric design strategy is employed to simplify the analytical solution of the leg kinematics. Due to the parallel system architecture, all actuators are located on the main body, substantially reducing the equivalent inertia of moving parts compared to traditional high-DOF leg structures. The total mass of the system is only 18.9 g, with an end-effector output force of approximately 0.5 N and a workspace exceeding 22255 mm3. Experimental results demonstrate that the proposed single-leg mechanism achieves excellent motion flexibility, highlighting its potential for micro bio-inspired robotics.

URL PDF HTML ☆

赞 0 踩 0

2606.18646 2026-06-18 cs.RO 新提交 85%

A Scalable Embodied Intelligence Platform for Seamless Real-to-Sim-to-Real Transfer of Household Mobile Manipulation Tasks

一种可扩展的具身智能平台，用于家庭移动操作任务的无缝真实-仿真-真实迁移

Kui Yang, Xianlei Long, Haoxuan Li, Yan Ding, Chao Chen

发表机构 * School of Computer Science, Chongqing University（重庆大学计算机学院）； R&D Department, Lumos Robotics Technology (Suzhou) Co., Ltd（苏州 Lumos 机器人技术（苏州）有限公司研发部）

专题命中机器人学习：家庭移动操作任务的真实-仿真-真实迁移平台

AI总结提出BestMan平台，通过自动化场景生成、仿真引导任务形式化和硬件无关中间件，解决真实-仿真-真实迁移中的场景重建、策略评估和部署兼容性挑战，实现家庭移动操作的无缝迁移。

Comments CCF Transactions on Pervasive Computing and Interaction

详情

AI中文摘要

移动操作是具身智能机器人的基本能力。对非结构化家庭环境中鲁棒且可泛化操作的需求日益增长，推动了具身智能平台的快速发展。然而，实现真实-仿真-真实循环的无缝迁移面临三个关键挑战：昂贵的高保真仿真场景重建、仿真中系统策略评估的复杂性以及不兼容的真实世界部署。为了解决这些挑战，我们开发了BestMan，一个可扩展且无缝的真实-仿真-真实平台，弥合仿真与真实世界之间的差距，实现家庭移动操作的有效策略开发、集成和部署。具体来说，我们设计了一个新颖的自动化场景生成（ASG）模块，从真实观测中重建逼真的仿真。然后，我们提出了一种仿真引导的任务形式化和技能学习架构，支持在仿真中灵活集成和大规模评估混合技能策略。最后，为了增强真实世界的可扩展性，我们开发了一个硬件无关的统一中间件（HUM），确保跨异构移动操作器的无缝且兼容的仿真到真实迁移，用于真实部署。实验结果表明，我们提出的平台在建立标准化基准和促进移动操作领域有前景的研究方面表现出优越的性能。

英文摘要

Mobile manipulation is a fundamental capability in embodied intelligence robotics. The growing demand for robust and generalizable manipulation in unstructured household environments has driven rapid progress in embodied intelligence platforms. However, achieving a seamless transfer across the real-to-sim-to-real cycle faces three key challenges, including costly high-fidelity simulation scenes reconstruction, the complexity of systematic strategy evaluation in simulation, and incompatible real-world deployments. To address these challenges, we develop BestMan, a scalable and seamless real-to-sim-to-real platform that bridges the gap between the simulation and the real world, enabling effective strategy development, integration, and deployment for household mobile manipulation. Specifically, we design a novel Automated Scene Generation (ASG) module to reconstruct realistic simulations from real observations. Then, we propose a simulation-guided task formalization and skill learning architecture that supports the flexible integration and large-scale evaluations of hybrid skill strategies in simulation. Finally, to enhance the real-world scalability, we develop a Hardware-agnostic and Unified Middleware (HUM) to ensure seamless and compatible sim-to-real transfer across heterogeneous mobile manipulators for real deployments. Experimental results demonstrate the superior performance of our proposed platform in establishing standardized benchmarks and facilitating promising research in the field of mobile manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.18625 2026-06-18 cs.RO 新提交 85%

SRL: Combining SLIP Model and Reinforcement Learning for Agile Robotic Jumping

SRL：结合SLIP模型与强化学习实现敏捷机器人跳跃

Xiaowen Hu, Linqi Ye, Yudi Zhu, Chenyue Shao, Rankun Li, Qingdu Li, Yan Peng

发表机构 * Institute of Artificial Intelligence, Shanghai University（上海大学人工智能研究院）； Institute of Machine Intelligence, University of Shanghai for Science and Technology（上海理工大学机器智能研究院）

专题命中机器人学习：结合SLIP模型与强化学习实现敏捷跳跃

AI总结提出SRL框架，融合SLIP模型的物理基线与强化学习的自适应能力，通过前馈控制信号与实时反馈优化机器人跳跃，显著减少训练时间并保持高精度跟踪。

Comments 17 pages, 12 figures

详情

AI中文摘要

机器人跳跃在搜救和物流等应用中至关重要，这些场景中跨越障碍和提高机动效率是关键。弹簧负载倒立摆（SLIP）模型利用简化的弹簧-质量动力学，自然编码了生物上合理的弹跳运动，但由于对接触和关节动力学的理想化假设，其在不规则地形上的性能会下降。同时，强化学习（RL）能够适应多样化和复杂的环境，但通常需要来自无引导探索的大量数据。SLIP的物理基线与RL的自适应能力的互补优势促使我们提出一种混合框架，以克服各自的局限性。因此，我们提出了弹簧负载强化学习（SRL），它将基于SLIP的前馈控制信号与RL驱动的实时反馈相结合，实现了机器人跳跃的持续优化。实验结果表明，与基线方法相比，SRL能够在更少的训练时间内实现更稳定的跳跃，平均位置跟踪误差低于0.1米，速度跟踪误差在目标值的±3%以内。通过双足和四足模拟的地面与楼梯跳跃，以及sim-to-sim和sim-to-real验证，SRL展现出对各种任务要求和环境复杂性的鲁棒适应性，突显了其在实际部署中的潜力。

英文摘要

Robotic jumping is pivotal in applications such as search and rescue and logistics, where crossing obstacles and enhancing mobility efficiency are critical. The Spring-Loaded Inverted Pendulum (SLIP) model leverages simplified spring-mass dynamics that naturally encode biologically plausible hopping motions, yet its performance degrades on irregular terrain due to idealized assumptions regarding contact and joint dynamics. Meanwhile, Reinforcement Learning (RL) can adapt to diverse and complex environments but often requires extensive data from unguided exploration. The complementary strengths of SLIP's physically grounded baseline and RL's adaptive capabilities motivate a hybrid framework that overcomes these individual limitations. We therefore propose Spring-loaded Reinforcement Learning (SRL), which integrates SLIP-based feedforward control signals with RL-driven real-time feedback, enabling continuous optimization of robotic jumping. Experimental results demonstrate that SRL can achieve more stable jumps with much less training time than the baseline method, maintaining an average position tracking error below 0.1 m and velocity tracking errors within +/-3% of the target values. Through bipedal and quadrupedal simulations of ground and stair jumping, as well as sim-to-sim and sim-to-real validations, SRL exhibits robust adaptability to various task requirements and environmental complexities, underscoring its potential for real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.18589 2026-06-18 cs.RO 新提交 85%

DREAM-Chunk: Reactive Action Chunking with Latent World Model

DREAM-Chunk：基于潜在世界模型的反应式动作分块

Wenxi Chen, Kaidi Zhang, Chi Lin, Zhiyuan Zhang, Yu She, Yuejiang Liu, Raymond A. Yeh, Shaoshuai Mou, Yan Gu

发表机构 * Purdue University（普渡大学）； Stanford University（斯坦福大学）

专题命中机器人学习：DREAM-Chunk增强动作分块策略鲁棒性

AI总结提出DREAM-Chunk方法，通过轻量级潜在世界模型在测试时采样多个候选动作分块并选择最优执行，提升动作分块策略在随机动态下的鲁棒性。

详情

AI中文摘要

动作分块已成为视觉-语言-动作（VLA）模型的常见接口，使得低频策略推理能够驱动高频机器人执行。然而，一旦动作分块被提交，其开环执行在随机动态、硬件执行错误和部分可观测性下可能变得脆弱。我们提出DREAM-Chunk，一种测试时扩展方法，通过轻量级潜在世界模型增强基于分块的策略，无需额外的策略微调。在测试时，DREAM-Chunk采样多个候选动作分块，展开其预测的潜在未来，并从预测状态与观测展开最匹配的分块中选择动作。通过这种方式，DREAM-Chunk利用额外的测试时计算覆盖多个可能的随机未来，并提高长时域分块执行期间的响应性。在Kinetix基准测试中，DREAM-Chunk在增加的动作噪声下提高了鲁棒性，并从更大的候选样本量中受益，尤其是当演示包含纠正行为时。我们进一步在两个机器人平台的四个操作任务和两种VLA策略下，针对各种随机性来源验证了DREAM-Chunk。在仿真和硬件实验中，DREAM-Chunk提高了动作分块策略在随机动态下的鲁棒性。

英文摘要

Action chunking has become a common interface for vision-language-action (VLA) models, enabling low-frequency policy inference to drive high-frequency robot execution. However, once an action chunk is committed, its open-loop execution can be brittle under stochastic dynamics, hardware execution errors, and partial observability. We propose DREAM-Chunk, a test-time scaling method that augments chunking-based policies with a lightweight latent world model, without requiring additional policy fine-tuning. At test time, DREAM-Chunk samples multiple candidate action chunks, rolls out their predicted latent futures, and selects actions from the chunk whose predicted state best matches the observed rollout. In this way, DREAM-Chunk uses additional test-time computation to cover multiple plausible stochastic futures and improve reactivity during long-horizon chunk execution. On the Kinetix benchmark, DREAM-Chunk improves robustness under increasing action noise and benefits from larger candidate sample sizes, especially when demonstrations contain corrective behaviors. We further validate DREAM-Chunk on four manipulation tasks across two robot platforms and two VLA policies under various sources of stochasticity. Across simulation and hardware experiments, DREAM-Chunk improves the robustness of action-chunking policies in stochastic dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.19161 2026-06-18 cs.RO 新提交 80%

HT-Bench: Benchmarking and Learning Dexterous Full-Hand Tactile Representations with Egocentric Vision

HT-Bench：基于自我中心视觉的灵巧全手触觉表示基准与学习

Yuzhe Huang, Jiaping Wu, Jiaming Jiang, Hezhe Lin, Aikebaier Aierken, Yunlong Wang, Kun Cheng, Ziyuan Jiao, Yuanxin Zhong

发表机构 * Beihang University（北航）； Rimbot ； BUPT（北邮）； ShanghaiTech University（上海科技大学）； Tsinghua University（清华大学）； CAS（中国科学院）

专题命中机器人学习：触觉表示基准用于机器人灵巧操作学习

AI总结提出HT-Bench多任务基准和HandTouch编码器，通过大规模自我中心视觉与全手触觉数据，在触觉相似性检索、掩码修复、视觉到触觉合成等任务上验证了触觉表示的有效性。

Comments 9pages, 4figures

详情

AI中文摘要

由于触觉传感器设计、数据格式和机器人形态的多样性，为机器人操作中的触觉表示学习建立通用基准仍然具有挑战性。我们并未试图建立这样的基准，而是探索了一个可扩展且有前景的未来发展方向：将自我中心视觉与全手触觉数据配对。为此，我们引入了\ extbf{HT-Bench}，一个用于灵巧全手触觉感知的大规模多任务基准，包含在226个任务中收集的1000万RGB帧和780万触觉帧。HT-Bench从三个关键角度评估触觉表示：它们是否编码有意义的接触几何、是否能够将触觉观测与视觉信息对齐、以及是否能够泛化到未见任务。为评估这些能力，HT-Bench包含四个任务：细粒度触觉相似性检索、掩码触觉修复、视觉到触觉合成以及多模态触觉帧预测。我们进一步提出了\ extbf{HandTouch}，一个矢量量化视觉-触觉编码器，通过渐进的空间、跨模态和时间训练学习触觉表示。在HT-Bench上，HandTouch始终优于代表性的触觉编码器基线，将细粒度触觉相似性检索的Recall@5从74.65%提高到85.23%，将掩码触觉修复的RMSE从0.022降低到0.010，并将视觉到触觉合成的OOD cIoU从0.628提高到0.705。这些结果证明了HandTouch的有效性，并表明大规模自我中心全手触觉数据为评估和推进灵巧操作中的触觉表示学习提供了可扩展的基础。

英文摘要

Establishing a universal benchmark for tactile representation learning in robotic manipulation remains challenging due to the diversity of tactile sensor designs, data formats, and robot embodiments. Rather than seeking to establish such, we explore a scalable and promising direction for future development: egocentric vision paired with full-hand tactile data. To this end, we introduce \textbf{HT-Bench}, a large-scale multi-task benchmark for dexterous full-hand tactile sensing, comprising 10M RGB frames and 7.8M tactile frames collected across 226 tasks. HT-Bench evaluates tactile representations from three key perspectives: whether they encode meaningful contact geometry, whether they can align tactile observations with visual information, and whether they generalize to unseen tasks. To assess these capabilities, HT-Bench includes four tasks: fine-grained tactile similarity retrieval, masked tactile inpainting, vision-to-tactile synthesis, and multimodal tactile frame prediction. We further propose \textbf{HandTouch}, a vector-quantized vision--tactile encoder that learns tactile representations through progressive spatial, cross-modal, and temporal training. Across HT-Bench, HandTouch consistently outperforms representative tactile encoder baselines, improving Recall@5 on fine-grained tactile similarity retrieval from 74.65\% to 85.23\%, reducing RMSE on masked tactile inpainting from 0.022 to 0.010, and increasing OOD cIoU on vision-to-tactile synthesis from 0.628 to 0.705. These results demonstrate the effectiveness of HandTouch and suggest that large-scale egocentric full-hand tactile data provides a scalable basis for evaluating and advancing tactile representation learning in dexterous manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.19088 2026-06-18 cs.RO 新提交 80%

ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks

ReSiReg：面向语言条件机器人任务的空间一致语义

Simon Schwaiger, David Seyser, Alessandro Scherl, Wilfried Wöber, Gerald Steinbauer-Wagner

发表机构 * Graz University of Technology, Institute of Software Engineering and Artificial Intelligence（格拉茨技术大学，软件工程与人工智能研究所）； University of Applied Sciences Technikum Wien, Department of Industrial Engineering（维也纳应用科技大学，工业工程系）； University of Alicante, Department of Computer Technology（阿利坎特大学，计算机技术系）； University of Natural Resources and Life Sciences, Institute for Integrative Nature Conservation Research（自然资源与生命科学大学，整合自然保护研究 institute）

专题命中机器人学习：语言条件机器人任务，空间一致语义。

AI总结提出ReSiReg方法，通过重构空间一致的VLM中间特征，改善密集语言接地检索，在OVSS和3D映射中提升空间一致性，并发布紧凑的25M参数VLM模型。

详情

AI中文摘要

视觉-语言模型（VLM）使机器人能够遵循开放语言指令。然而，密集的VLM嵌入已被证明存在噪声且缺乏空间一致性。这对于需要同时推理语义和3D空间的机器人应用来说是有问题的。我们研究了近期VLM的空间结构，并提出了ReSiReg，一种特征重构方法，利用空间一致的VLM中间特征来改善密集语言接地检索。ReSiReg将中间特征聚类为视觉原型，推导其语言描述符，并将每个补丁重构为原型级语言嵌入的软混合。我们在OVSS和3D映射上跨骨干网络进行定量评估，并在真实世界操作场景中进行定性评估。定量结果显示密集检索得到改善；操作场景显示出更空间一致的目标激活。我们进一步为机器人应用提供了一个紧凑的25M密集VLM，远小于ViT-B基线且具有竞争力。可从此网址获取。

英文摘要

Vision-Language Models (VLMs) enable robots to follow open-language instructions. However, dense VLM embeddings have shown to be noisy and lack spatial consistency. This is problematic for robotic applications, which require simultaneous reasoning over semantics and 3D space. We examine spatial structure across recent VLMs and propose ReSiReg, a feature reconstruction method that uses spatially consistent VLM intermediates to improve dense language-grounded retrieval. ReSiReg clusters intermediates into visual prototypes, derives their language descriptors, and reconstructs each patch as a soft mixture of prototype-level language embeddings. We evaluate quantitatively on OVSS and 3D mapping across backbones, and qualitatively in real-world manipulation scenes. Quantitative results show improved dense retrieval; manipulation scenes show more spatially consistent target activations. We further provide a compact 25M dense VLM for robotic applications, substantially smaller than and competitive with ViT-B baselines. Available at https://resireg.github.io

URL PDF HTML ☆

赞 0 踩 0

2606.19067 2026-06-18 cs.RO cs.CV 新提交 80%

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

传感器配置至关重要：四足机器人多模态SLAM的系统评估

Roberto Corlito, Fabian Schmidt, Nils Seibert, Markus Enzweiler, Abhinav Valada, Arne Roennau

发表机构 * Machine Intelligence and Robotics Lab, Karlsruhe Institute of Technology (KIT)（卡尔斯鲁厄理工学院智能机器人实验室）； Institute for Intelligent Systems, Esslingen University of Applied Sciences（埃森堡应用科学大学智能系统研究所）； Department of Computer Science, University of Freiburg（弗赖堡大学计算机科学系）

专题命中机器人学习：四足机器人多模态SLAM评估。

AI总结针对四足机器人运动中的传感器配置问题，系统评估了视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法，发现立体相机、全局快门和适当惯性集成能显著提升定位鲁棒性。

详情

AI中文摘要

四足机器人在不同环境中的自主导航从根本上依赖于鲁棒的同步定位与地图构建（SLAM）。虽然视觉-惯性SLAM在轮式、手持和空中平台上已经成熟，但在腿部运动的剧烈动态下，硬件级传感器配置如何影响性能仍存在关键的评估空白。四足机器人引入了独特的具身感知挑战，包括足部冲击、高频机械振动和快速角旋转，这些都会降低标准感知管道的性能。为了填补这一空白，我们使用在ANYmal D四足机器人上记录的GrandTour数据集，对最先进的视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法进行了系统评估。我们分离并量化了相机模态、快门技术和惯性传感器层级的影响，分析了它们在定位精度、算法鲁棒性和计算资源利用方面的权衡。我们的实证结果表明，硬件选择对系统鲁棒性有显著影响：立体配置始终优于单目和RGB-D模态，全局快门相机相比卷帘快门相机显著减少了运动引起的跟踪失败，并且关键的是，在剧烈的腿部运动下，标准惯性集成可能降低主要基于视觉的框架的性能。这些见解还为定制传感器负载提供了具体的设计指南，以实现敏捷腿部系统的可靠感知。

英文摘要

Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.

URL PDF HTML ☆

赞 0 踩 0

2606.18836 2026-06-18 cs.HC cs.AI 新提交 80%

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

通过先前协作的片段记忆改善城市搜索与救援中的人机团队合作

Taewoon Kim, Emma van Zoelen, Mark Neerincx

发表机构 * HumemAI, The Netherlands（荷兰HumemAI）； Vrije Universiteit Amsterdam, The Netherlands（荷兰阿姆斯特丹自由大学）； TNO, The Netherlands（荷兰TNO）

专题命中机器人学习：人机团队协作，片段记忆提升救援。

AI总结提出利用知识图谱片段记忆存储历史协作模式，通过图表示学习选择代表性记忆初始化机器人，在MATRX USAR环境中将救援成功率从25.7%提升至41.3%，任务时间减少283秒。

详情

AI中文摘要

有效的人机团队合作要求机器人从交互开始就适应伙伴、情境和任务动态。在MATRX城市搜索与救援（USAR）环境中，人们可以通过聊天和反思界面将他们在团队合作中发现的协作模式（CPs）外部化。我们研究机器人是否可以利用这种先前的团队经验，在未来的交互中成为更好的队友。为此，我们将历史CPs表示为知识图谱片段记忆，并使用具有节点分类目标的图表示学习来识别一个代表性且有效的记忆以供重用。然后，在新的协作片段开始之前，我们用该记忆初始化机器人。在20名参与者和160轮次观察中，用单个自动选择的先前CP初始化机器人将救援成功率从25.7%提高到41.3%，并将平均任务时间减少283秒。最强的提升出现在交互开始时，表明可重用的片段记忆可以帮助机器人以更有效的任务知识进入协作，并支持更顺畅的早期团队合作。

英文摘要

Effective human-robot teamwork requires robots to adapt to partners, situations, and task dynamics from the start of an interaction. In the MATRX Urban Search and Rescue (USAR) environment, people can externalize collaboration patterns (CPs) they discover during teamwork through a chat and reflection interface. We study whether a robot can use such prior team experience to become a better teammate in future interactions. To this end, we represent historical CPs as knowledge-graph episodic memories and use graph representation learning with a node-classification objective to identify a representative and effective memory for reuse. We then initialize the robot with this memory before a new collaboration episode begins. Across 20 participants and 160 round-level observations, initializing the robot with a single automatically selected prior CP increases rescue success from 25.7% to 41.3% and reduces average task time by 283 seconds. The strongest gains appear at the beginning of interaction, suggesting that reusable episodic memory can help robots enter collaboration with more effective task knowledge and support smoother early teamwork.

URL PDF HTML ☆

赞 0 踩 0

2606.18786 2026-06-18 cs.AI 新提交 80%

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

R2D-RL：用于多智能体强化学习的RoboCup 2D足球环境

Haobin Qin, Baofeng Zhang, Hidehisa Akiyama, Keisuke Fujii

发表机构 * Graduate School of Informatics, Nagoya University（名古屋大学信息学研究科）； School of Information and Data Sciences, Nagasaki University（长崎大学信息与数据科学学院）

专题命中机器人学习：多智能体强化学习环境，机器人足球

AI总结提出R2D-RL环境，通过共享内存通信和周期级同步连接RCSS2D与Python MARL接口，支持全场和场景训练，提供可配置对手、离散/混合动作空间、EPV奖励塑造及并行执行。

Comments Code is available at: https://github.com/open-starlab/R2DRL

详情

AI中文摘要

机器人足球是多智能体强化学习的一个具有挑战性的测试平台，因为它结合了部分可观测性、合作与对抗交互、稀疏奖励以及长期战术行为。RoboCup 2D足球仿真（RCSS2D）提供了一个成熟的机器人足球平台，但其面向竞争的服务器-客户端架构难以直接用于现代基于Python的MARL工作流。我们引入了R2D-RL，这是一个强化学习环境，通过共享内存通信和周期级同步将RCSS2D和基于HELIOS的玩家客户端连接到Python MARL接口。R2D-RL支持全场和基于场景的训练，具有可配置的对手、基础离散和混合参数化动作空间、动作掩码、基于预期控球值（EPV）的奖励塑造以及并行执行。我们提供了前场场景和11对11全场基准测试，以及基线结果。

英文摘要

Robot soccer is a challenging testbed for multi-agent reinforcement learning because it combines partial observability, cooperative and adversarial interaction, sparse rewards, and long-horizon tactical behavior. RoboCup 2D Soccer Simulation (RCSS2D) provides a mature robot-soccer platform, but its competition-oriented server-client architecture is difficult to use directly with modern Python-based MARL workflows. We introduce R2D-RL, a reinforcement learning environment that connects RCSS2D and HELIOS-based player clients to a Python MARL interface through shared-memory communication and cycle-level synchronization. R2D-RL supports full-field and scenario-based training with configurable opponents, Base discrete and Hybrid parameterized action spaces, action masks, expected possession value (EPV)-based reward shaping, and parallel execution. We provide front-goal scenarios and an 11-vs-11 full-field benchmark, together with baseline results.

URL PDF HTML ☆

赞 0 踩 0

2606.18516 2026-06-18 cs.RO 新提交 80%

Task Allocation and Motion Planning in Dynamic, Cluttered Environments via CBBA and Graphs of Convex Sets

动态杂乱环境下的任务分配与运动规划：基于CBBA与凸集图

Matthew D. Osburn, Cameron K. Peterson, John L. Salmon

发表机构 * Electrical and Computer Engineering（电气与计算机工程系）； Mechanical Engineering（机械工程系）

专题命中机器人学习：多智能体任务分配与运动规划

AI总结针对动态杂乱环境中的多智能体任务规划，提出结合凸集图（GCS）进行轨迹优化与共识捆绑算法（CBBA）进行分布式任务分配的方法，实现安全高效的轨迹规划和任务协调。

Comments 15 pages single column, 10 figures, AIAA-Scitech 2027 Submission

详情

AI中文摘要

在杂乱、动态环境中的多智能体任务规划需要在分配任务给智能体的同时，确定通过环境的安全、时间高效的轨迹。当任务是动态的（例如会合目标）时，分配决策不仅取决于哪个智能体最适合某项任务，还取决于该任务何时何地可以到达。本文提出了一个解决该问题的方法，该方法将凸集图（GCS）用于轨迹优化，与共识捆绑算法（CBBA）用于分布式任务分配相结合。在我们的方法中，GCS通过使用时间扩展（3D+时间）配置空间找到通过动态环境的最优轨迹。同时，CBBA协调跨智能体的任务分配，使得在移动环境中能够做出明智的决策。然后，我们连接分配和规划，使智能体能够在3D+时间配置空间中避免碰撞，并提供准确的任务完成时间估计。我们在具有静态和动态任务的模拟杂乱环境中展示了我们方法的有效性。

英文摘要

Multi-agent task planning in cluttered, dynamic environments requires assigning tasks to agents while simultaneously determining safe, time-efficient trajectories through the environment. When tasks are dynamic, such as rendezvous objectives, allocation decisions depend not only on which agent is best suited for a task, but also on when and where that task can be reached. This paper presents a solution to this problem, which combines Graphs of Convex Sets (GCS) for trajectory optimization with the Consensus-Based Bundle Algorithm (CBBA) for distributed task allocation. In our approach, GCS finds optimal trajectories through dynamic environments using a time-extended (3D+time) configuration space. At the same time, CBBA coordinates task assignments across agents, enabling informed decision-making in a moving environment. We then connect allocation and planning to allow the agents to avoid collisions in the 3D+time configuration space and provide accurate time estimates for task completion. We demonstrate the effectiveness of our approach in simulated cluttered environments with static and dynamic tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.18861 2026-06-18 cs.CV cs.AI 新提交 75%

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

基于可微联合推理与能量一致性验证的RGB-D序列URDF合成

Xinze Zhang

发表机构 * University of Southern California（南加州大学）

专题命中机器人学习：重建可仿真数字孪生，用于机器人。

AI总结提出KinemaForge管道，通过可微关节推理和能量一致性验证，从RGB-D序列联合估计部件形状、关节拓扑和参数，显著降低关节轴误差和仿真漂移。

详情

AI中文摘要

从传感器观测重建可仿真的铰接物体数字孪生仍受两个持续存在的差距制约：(i) 部件级几何重建与运动学参数估计分离，(ii) 恢复的模型常违反能量守恒等基本动态不变量，导致URDF在物理仿真器中重放时出现漂移。我们提出KinemaForge，一种约束驱动管道，从短RGB-D序列联合推断部件级形状、关节拓扑和关节参数，并通过基于可微刚体动力学构建的能量一致性验证器验证结果。该管道引入三个组件：将关节-部件关联编码为软边的运动学约束图；通过Featherstone铰接体算法从渲染观测反向传播到关节参数的可微螺旋轴求解器；以及惩罚重建模型非物理自由响应的能量残差损失。在五个PartNet-Mobility类别和一个内部RGB-D基准上，KinemaForge将平均关节轴误差从最强几何基线(PARIS)的4.52度降至2.83度(-37.4%)，从基于交互的Ditto基线的5.30度降至2.83度(-46.6%)，在50秒滚动中长时仿真漂移比PARIS降低64%，初步评估中闭环操作成功率比Ditto提高14.6个百分点。代码和重建数据将在接收后发布。

英文摘要

Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone's articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.18537 2026-06-18 cs.LG 新提交 75%

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

入乡随俗：从异构智能体学习通用行为

Caleb Chang, Davin Win Kyi, Natasha Jaques, Karen Leung

发表机构 * University of Washington（华盛顿大学）； NVIDIA（英伟达）

专题命中机器人学习：从异构智能体学习通用行为

AI总结提出GRID方法，从追求不同目标的异构示范者中提取通用奖励，训练通用智能体以学习环境通用能力，避免模式平均偏差，提升下游任务微调效率。

详情

AI中文摘要

人类通常通过观察他人来获取新技能，因为观察到的行为隐含地揭示了如何在环境中行动。然而，从异构群体中获得的观察会引入冲突的行为信号，使得难以确定哪些行为值得模仿。我们通过通用奖励推断与解耦（GRID）来解决这一挑战，这是一种从追求不同目标的异构示范者群体中提取普遍有用行为的社会学习方法。GRID将每个智能体的奖励函数分解为通用奖励（捕捉所有智能体共享的行为）和特定奖励（捕捉个体偏好和目标）。仅基于通用奖励进行训练提供了一种通用预训练的新范式。它产生了一个通用智能体，该智能体内化了通用的环境能力，如安全性和基本任务熟练度，而不会出现困扰标准从示范学习技术的模式平均偏差。这个通用智能体作为微调到下游任务（包括训练中未见过的偏好）的优越先验。在合成基函数分解、多智能体Craftax和连续自动驾驶模拟器（Highway-Env）上的实验证实，GRID以语义上有意义的方式成功解耦了奖励结构，优于标准的从示范学习基线，并实现了更高效和稳定的特化。

英文摘要

Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals, making it difficult to determine which behaviors are worth imitating. We address this challenge with General Reward Inference and Disentanglement (GRID), a social learning method that extracts universally useful behaviors from a heterogeneous population of demonstrators pursuing different goals. GRID decomposes per-agent reward functions into a general reward, capturing behaviors shared across all agents, and specific rewards, capturing individual preferences and objectives. Training exclusively on the general reward provides a new paradigm of generalist pretraining. It yields a generalist agent that internalizes universal environmental competencies, such as safety and basic task proficiency, without the mode-averaging bias that afflicts standard learning from demonstration techniques. This generalist serves as a superior prior for fine-tuning to downstream tasks, including preferences unseen during training. Experiments across a synthetic basis function decomposition, multi-agent Craftax, and a continuous autonomous driving simulator (Highway-Env) confirm that GRID successfully disentangles reward structure in a semantically meaningful way, outperforms standard learning from demonstration baselines, and enables more efficient and stable specialization.

URL PDF HTML ☆

赞 0 踩 0

2606.18519 2026-06-18 cs.RO cs.AI 新提交 75%

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

如您所愿：利用LLM在精准农业中进行形式化验证的任务规划

Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * University of California, Merced（加州大学默塞德分校）

专题命中机器人学习：LLM任务规划用于精准农业机器人

AI总结针对自然语言歧义性，提出基于线性时序逻辑（LTL）反馈循环的LLM任务规划系统，通过双LLM分工实现规范生成与验证，提升精准农业任务规划的可靠性。

Journal ref Published in Proceedings of 2026 International Conference on Robotics and Automation (ICRA)

详情

AI中文摘要

尽管机器人系统现已商业化并部署于各行各业，但许多系统高度专业化，通常需要高级技能才能操作并确保其按指令执行。为缓解这一问题，我们近期引入了一个任务规划器，利用大语言模型（LLM）根据自然语言描述的任务描述合成精准农业中的任务计划。虽然该系统表现出色，但也存在自然语言固有的歧义性。本文通过引入多个基于线性时序逻辑（LTL）的反馈循环来扩展我们的系统，以确保任务规划系统满足用户制定的规范，同时仍使用自然语言。为减轻潜在偏差，我们使用两个不同的商业LLM分别负责规范生成和验证子任务。通过大量实验，我们强调了将任务验证集成到全自主流水线中的优势与局限，特别是关于LLM生成有效LTL公式的能力，并展示了我们的实现如何应对和解决这些挑战。

英文摘要

Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, we recently introduced a mission planner leveraging LLMs to synthesize mission plans in precision agriculture based on mission descriptions provided in natural language. While the system demonstrates impressive performance, it also suffers from the inherent ambiguities of natural language. In this paper, we extend our system to address this issue by introducing multiple feedback loops in the planning architecture that leverage linear temporal logic (LTL) to ensure the mission planning system meets the specifications formulated by the user while still using natural language. To mitigate potential bias, this is achieved by using two different commercial LLMs in charge of the specification and verification subtasks. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM's ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.

URL PDF HTML ☆

赞 0 踩 0

2606.19297 2026-06-18 cs.LG cs.RO 新提交 70%

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

VLA 甚至知道基础知识吗？衡量视觉-语言-动作模型中的常识和世界知识保留

Nikita Kachaev, Andrey Moskalenko, Matvey Skripkin, Nikita Kurlaev, Daria Pugacheva, Albina Burlova, Mikhail Kolosov, Denis Shepelev, Andrey Kuznetsov, Elena Tutubalina, Aleksandr I. Panov, Alexey K. Kovalev, Vlad Shakhuro

发表机构 * CogAI Lab（CogAI实验室）； FusionBrain Lab（FusionBrain实验室）； IAI MSU（MSU人工智能研究所）； Lomonosov MSU（Lomonosov莫斯科大学）； NUST MISIS ； Applied AI Institute（应用人工智能研究所）； HSE University（俄罗斯高等经济大学）； Generalizable AI Systems（可泛化人工智能系统）； ISP RAS（俄罗斯科学院信息与自动化过程研究所）； MIRAI ； Domain-specific NLP Group（领域特定自然语言处理小组）

专题命中机器人学习：VLA模型在机器人任务中评估常识知识

AI总结提出 Act2Answer 协议，通过动作回答评估 VLA 模型的知识保留，发现模型在简单概念上表现良好，但在丰富语义类别上存在差距，且 VQA 联合训练有助于知识保留。

Comments Project page: https://tttonyalpha.github.io/act2answer/

详情

AI中文摘要

具身视觉-语言-动作（VLA）模型通常通过在机器人数据上微调强大的预训练 VLM 获得，但目前尚不清楚它们在适应后保留了多少常识和事实知识。在知识敏感任务上的失败是模糊的，混淆了知识缺失与低级控制泛化能力差。我们引入 Act2Answer，一种轻量级协议，通过要求智能体通过动作来回答，将 VLM 知识基准适配到 VLA 评估。每个问题变成一个简短的桌面场景，其中智能体执行单个物体放置动作以选择候选答案，从而产生动作基础的、减少控制混淆的成功率。我们在不同的常识和世界知识类别中策划了这样的环境测试套件，并引入逐层意图探测以定位 VLM 骨干和动作头中与答案相关的信息。在对 7 个 VLA 模型和 9 个 VLM 基线的大规模研究中，我们系统地跨类别对模型进行排名，发现 VLA 在简单概念上表现稳健，但在更丰富的语义类别上相对于其源 VLM 显示出更大的差距，VQA 联合训练与更好的知识保留相关，并且答案相关信号在 VLA 中间层达到峰值，但在上层减弱。Act2Answer 可在以下网址获取：此 https URL。

英文摘要

Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action to select among candidate answers, yielding an action-grounded success rate with reduced control confounds. We curate a test suite of such environments across diverse commonsense and world-knowledge categories and introduce layerwise intent probing to localize answer-relevant information across the VLM backbone and action head. In a large-scale study of 7 VLA models and 9 VLM baselines, we systematically rank models across categories, finding that VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs, that VQA co-training is associated with better knowledge retention, and that answer-relevant signals peak in middle VLA layers but attenuate in upper layers. Act2Answer is available at https://tttonyalpha.github.io/act2answer/.

URL PDF HTML ☆

赞 0 踩 0

2606.18514 2026-06-18 cs.RO cs.LG 新提交 70%

N(CO)$^2$: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering

N(CO)$^2$: 基于机会约束的神经组合优化求解随机定向问题

Anas Saeed, Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * Department of Computer Science and Engineering, University of California, Merced（加州大学默塞德分校计算机科学与工程系）

专题命中机器人学习：神经组合优化求解随机定向问题

AI总结提出N(CO)$^2$框架，结合强化学习求解随机定向问题，无需手工启发式，在不确定环境下优化路径选择，性能媲美MILP。

Journal ref In Proceedings of the IEEE International Conference on Automation Science and Engineering (CASE), 2025

详情

AI中文摘要

神经组合优化（NCO）通过学习启发式，为求解复杂图优化问题提供了一种有前景的替代传统启发式方法的方法。这类问题在自动化领域频繁出现，可用于建模多种应用。虽然NCO在确定性组合优化问题上已被广泛研究，但只有少数工作旨在解决随机组合优化问题。本文提出N(CO)$^2$：基于机会约束的神经组合优化，用于求解随机定向问题（SOP），无需手工设计的启发式。通过集成强化学习（RL）框架，模型在不确定性下优化路径选择，有效平衡探索与利用。实验结果表明，我们的方法在多种SOP实例上具有良好的泛化能力，与最先进的混合整数线性规划（MILP）相比性能具有竞争力。所提方法减少了启发式设计的人力投入，同时在不确定环境中实现自适应和高效的决策。

英文摘要

Neural combinatorial optimization (NCO) offers a promising alternative to traditional heuristic-based methods for solving complex graph optimization problems by proposing to learn heuristics through data. This class of problems frequently arises in automation, as it can be used to model a variety of applications. While NCO has been extensively studied for deterministic combinatorial optimization problems, there are only a few works that aim to solve stochastic combinatorial optimization problems. In this work, we present N(CO)$^2$: Neural Combinatorial Optimization with Chance cOnstraints to solve the Stochastic Orienteering Problem (SOP) without the use of hand-crafted heuristics. By integrating a reinforcement learning (RL) framework, the model optimizes path selection under uncertainty, effectively balancing exploration and exploitation. Empirical results demonstrate that our method generalizes well across diverse SOP instances, achieving competitive performance compared to the state-of-the-art mixed-integer linear program (MILP) for the task. The proposed approach reduces human effort in heuristic design while enabling adaptive and efficient decision-making in uncertain environments.

URL PDF HTML ☆

赞 0 踩 0

2606.18308 2026-06-18 cs.LG cs.AI 新提交 70%

TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

TRIDENT: 打破混合安全-物理耦合以实现可证明安全的多智能体强化学习

Zijie Meng, Ziwei Li, Yufei Liu, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Miao Zhang

发表机构 * Peking University（北京大学）； Xiamen University（厦门大学）； National Taiwan University（国立台湾大学）； WHU（武汉大学）； THU / Jimei University（清华大学 / 集美大学）

专题命中机器人学习：提出可证明安全的多智能体强化学习框架。

AI总结针对混合离散-连续动作、训练时安全约束和物理动力学形成的耦合问题，提出TRIDENT框架，通过Richardson-Romberg梯度校正、Lyapunov约束序列信任域更新和物理信息残差评论家，实现可证明的安全收敛，显著降低训练违规并提升奖励。

Comments 16 pages, 4 figures

详情

AI中文摘要

网络化信息物理系统中的安全协调迫使学习算法同时处理混合离散-连续动作、严格的训练时安全约束和物理支配的动力学。我们证明这三个特征形成了一个有向偏差循环，击败了任何现成模块的朴素组合，并将其形式化为一个三向耦合引理。然后我们引入TRIDENT，这是第一个MARL框架，其三个组件被共同设计以消除每个泄漏：一个将Gumbel-Softmax偏差从O(tau)降低到O(tau^2)的Richardson-Romberg梯度校正，一个强制每次迭代可行性的Lyapunov约束顺序信任域更新，以及一个分解价值而非奖励的物理信息残差评论家。我们证明了以O~(1/sqrt(K))的收敛速率达到约束纳什均衡，以及O(sqrt(K))的累积违规界。在多无人机移动边缘计算、自主交叉口管理和混合SMAC变体上，TRIDENT相比MADDPG减少了95.5%的训练时违规，相比MACPO减少了76.3%，同时相比最强的无约束基线提高了13.5%的奖励。

英文摘要

Safe coordination in networked cyber-physical systems forces learning algorithms to simultaneously handle hybrid discrete-continuous actions, hard training-time safety constraints, and physics-governed dynamics. We show that these three features form a directed cycle of biases that defeats any naive composition of off-the-shelf modules, and formalize this as a three-way coupling lemma. We then introduce TRIDENT, the first MARL framework whose three components are co-designed to cancel each leak: a Richardson-Romberg gradient correction reducing Gumbel-Softmax bias from O(tau) to O(tau^2), a Lyapunov-constrained sequential trust-region update enforcing per-iterate feasibility, and a physics-informed residual critic that decomposes value rather than reward. We prove an O~(1/sqrt(K)) convergence rate to a constrained Nash equilibrium and an O(sqrt(K)) cumulative-violation bound. On multi-UAV mobile-edge computing, autonomous intersection management, and a hybrid SMAC variant, TRIDENT cuts training-time violations by 95.5% over MADDPG and 76.3% over MACPO, while improving reward by 13.5% over the strongest unconstrained baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.19154 2026-06-18 cs.RO 新提交 65%

Viking Hill Dataset: A Lidar-Radar-Camera Dataset for Detection and Segmentation in Forest Scenes

Viking Hill数据集：用于森林场景检测与分割的激光雷达-雷达-相机数据集

Vladimír Kubelka, Oleksandr Kotlyar, Unal Artan, Martin Magnusson

发表机构 * Örebro University（奥雷布罗大学）； AASS research centre（AASS研究中心）； Robot Navigation and Perception Lab（机器人导航与感知实验室）

专题命中机器人学习：机器人平台采集数据，用于自主导航感知

AI总结提出首个包含4D成像雷达的森林多传感器数据集，通过MinkowskiUNet实现雷达与激光雷达点云的语义分割，并评估树干分割质量与树木尺寸的关系。

Comments 33 pages, 11 figures

详情

AI中文摘要

在森林冠层下运行的自主机器人需要对树木及周围植被在不同季节条件下进行稳健感知。现有的林业数据集提供带有单棵树标注的激光雷达或相机数据，但均未包含共配准的4D成像雷达——这一模态因其对视觉退化、表面污染和植被遮挡的鲁棒性而日益受到关注。我们介绍了一个由移动机器人收集的多传感器森林数据集，该机器人配备了高分辨率FMCW成像雷达、激光雷达、RGB相机、IMU和RTK-GNSS。该场地在两个不同植被状态的会话中记录，3D立方体标注（包括每棵树的直径估计）为所有三种感知模态提供了共享语义标签。此外，我们提供了使用MinkowskiUNet对雷达和激光雷达点云进行语义分割的基线结果。雷达在主要类别（地面91%，冠层86%）上取得了与激光雷达竞争性的IoU分数，但在几何精细结构（如树干）上落后（56%对74%）。跨模态分析进一步比较了激光雷达和雷达的树干分割与RGB检测模型，而按直径分层的评估揭示了树干分割质量如何随树木尺寸变化。除了分割，共配准的多模态数据和RTK-GNSS辅助参考定位支持冠层下地图构建、定位和传感器融合的研究。数据集和标注工具已公开。

英文摘要

Autonomous robots operating under forest canopies need robust perception of trees and surrounding vegetation across varying seasonal conditions. Existing forestry datasets provide lidar or camera data with per-tree annotations, but none include co-registered 4D imaging radar -- a modality of growing interest for its resilience to visual degradation, surface contamination, and vegetation occlusion. We introduce a multi-sensor forest dataset collected by a mobile robot equipped with a high-resolution FMCW imaging radar, lidar, RGB camera, IMU, and RTK-GNSS. The site was recorded in two sessions under contrasting vegetation states, and 3D cuboid annotations -- including per-tree diameter estimates -- provide shared semantic labels across all three perception modalities. Furthermore, we provide baseline results for semantic segmentation of the radar and lidar point clouds using MinkowskiUNet. Radar achieves IoU scores competitive with lidar for dominant classes (ground 91%, canopy 86%) while lagging on geometrically fine structures such as tree trunks (56% vs. 74%). A cross-modality analysis further compares lidar and radar trunk segmentation against an RGB detection model, and a diameter-stratified evaluation reveals how trunk segmentation quality varies with tree size. Beyond segmentation, the co-registered multi-modal data and RTK-GNSS-aided reference positioning support research in mapping, localization, and sensor fusion under canopy. The dataset and annotation tools are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.18315 2026-06-18 cs.LG cs.AI 新提交 65%

Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation

鬼吸引子网络：用于闭环序列生成的盆地结构动力学解码器

Tianyu Wang, Ying Wang, Zhihao Liu, Xi Vincent Wang, Lihui Wang

发表机构 * KTH Royal Institute of Technology（瑞典皇家理工学院）； Department of Production Engineering, KTH Royal Institute of Technology（瑞典皇家理工学院生产工程系）； Department of Decision and Control Systems, KTH Royal Institute of Technology（瑞典皇家理工学院决策与控制系统系）

专题命中机器人学习：提出动力学解码器用于机器人动作序列生成。

AI总结提出鬼吸引子网络，一种理论推导的动力学解码器，通过构建盆地-吸引子结构实现高效闭环序列生成，在机器人动作解码任务中以2.3M参数匹配1.07B参数扩散变压器的离线精度，延迟降低32倍。

详情

AI中文摘要

使用大规模Transformer和扩散解码器进行序列输出生成时，内存成本随序列长度增长，且需要迭代逐步骤计算。用小型前馈解码器替代可恢复效率，但产生非结构化的潜在表示，限制了闭环控制：相位条件动作生成和跨步骤潜在传递都需要具有稳定盆地的潜在几何结构。本文提出鬼吸引子网络，一种理论推导的动力学解码器，其潜在变量在学习的势能下演化并带有漂移，通过构造产生盆地-吸引子结构。三个期望（多模态、解码器级单次切换和恒定内存）激发了势能-漂移形式，模式转变作为鞍结分岔和鬼吸引子逃逸出现。层次化的相空间分解将一阶盆地收敛与二阶本体感受细化分开。实验上，使用行为克隆和对比目标端到端训练的鬼网络在其势能中表现出预测的梯度流收缩，在1430个保留样本上，梯度范数在五个积分步骤中衰减67%。鬼网络作为机器人动作解码器进行评估。一个230万参数的鬼网络以462倍少的参数和32倍低的延迟匹配了10.7亿参数扩散变压器的离线精度，并在离线均方误差上比五个替代的200万参数解码器（MLP、神经常微分方程、条件变分自编码器、Transformer、单步扩散）低5.9%至29%。在LIBERO-10闭环基准测试中，鬼网络的盆地结构潜在上的相位条件比前馈MLP基线提高了13.5个百分点的成功率，持久潜在集成达到95.7%的最终成功率。

英文摘要

Sequential output generation with large-scale Transformer and diffusion decoders pays a memory cost that grows with sequence length, plus iterative per-step computation. Replacing them with small feed-forward decoders restores efficiency but produces unstructured latent representations that limit closed-loop control: phase-conditioned action generation and cross-step latent carry-over both require a latent geometry with stable basins. This article proposes Ghost Attractor Networks, a theoretically derived dynamical decoder whose latent evolves under a learned potential with drift and produces a basin-attractor structure by construction. Three desiderata (multi-modality, decoder-level single-pass switching, and constant memory) motivate the potential-drift form, and mode transitions arise as saddle-node bifurcations with ghost-attractor escape. A hierarchical phase-space decomposition separates first-order basin convergence from second-order proprioceptive refinement. Empirically, a Ghost trained end-to-end with a behavioral-cloning and contrastive objective exhibits the predicted gradient-flow contraction in its potential, with the gradient norm decaying by 67 percent across five integration steps on 1430 held-out samples. Ghost is evaluated as a robotic action decoder. A 2.3-million-parameter Ghost matches the offline accuracy of a 1.07-billion-parameter Diffusion Transformer at 462 times fewer parameters and 32 times lower latency, and beats five alternative 2M-parameter decoders (MLP, Neural ODE, CVAE, Transformer, 1-step Diffusion) on offline mean squared error by 5.9 to 29 percent. On the LIBERO-10 closed-loop benchmark, phase conditioning on Ghost's basin-structured latent yields a 13.5 percentage-point success-rate gain over a feed-forward MLP baseline, and persistent-latent ensembling reaches a 95.7 percent final success rate.

URL PDF HTML ☆

赞 0 踩 0