arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 机器人学习与模仿强化学习 1 篇

2606.19328 2026-06-18 cs.LG cs.AI cs.RO 交叉投稿

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

UBP2: 不确定性平衡的偏好规划用于高效基于偏好的强化学习

Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart

发表机构 * Learning, Embodied Autonomy, and Forecasting (LEAF) Lab, University of Toronto(多伦多大学学习、具身自主与预测(LEAF)实验室)

AI总结 提出UBP2方法,通过联合推理奖励、动力学和值函数的不确定性来主动引导探索,在Meta-World基准上显著提高了样本效率。

详情
AI中文摘要

基于偏好的强化学习提供了一种从行为的成对比较中学习奖励模型的方法,绕过了显式奖励设计的需求。然而,现有方法通常依赖于被动数据收集,并且在学习的早期阶段样本效率低下。我们引入了一种基于模型的方法,通过联合推理奖励、动力学和值函数的不确定性来主动引导探索。我们的方法,不确定性平衡的偏好规划(UBP2),使用奖励、动力学和值函数模型的集成,根据结合了期望奖励、终值认知不确定性的统一评分来评估候选轨迹。在此目标下的规划产生了利用和信息获取之间的显式权衡,无需临时的探索启发式。在标准正则性假设下,我们为有限时域和无限时域设置建立了次线性遗憾保证。实验上,在Meta-World基准上的实验表明,UBP2比无模型的基于偏好的方法和非乐观的基于模型的基线方法实现了更高的样本效率。

英文摘要

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.

2. 操作、抓取与灵巧手 1 篇

2606.18960 2026-06-18 cs.CV cs.RO 交叉投稿

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Mem-World:用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology(大连理工大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院)

AI总结 提出Mem-World,通过4D腕部视角曲面元索引内存W-VMem,解决操作中因遮挡和运动导致的场景遗忘问题,实现持久世界建模,提升策略评估与改进效果。

详情
AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式,通过生成动作一致的视频推演,为昂贵的真实世界实验提供了可扩展的替代方案。然而,在操作中持久世界建模仍然具有挑战性:频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图,导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制,我们提出了Mem-World,一种内存增强的多视图动作条件世界模型。其核心是W-VMem,一种4D腕部视图为中心的曲面元索引内存,将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置,W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中,通过基于曲面元的渲染和评分选择相关历史帧,为预测提供信息丰富且非冗余的上下文。大量实验表明,Mem-World在复杂操作场景中生成持久推演,比Ctrl-World实现更可靠的策略评估,将皮尔逊相关系数提高14.5%,并通过合成数据生成支持有效的策略改进,在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

3. 导航、定位与SLAM 2 篇

2606.18583 2026-06-18 cs.CV cs.RO 交叉投稿

Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

空地激光雷达地点识别:基于块级自监督学习和扩展互逆重排序

Yandi Yang, Xianghong Zou, Jianping Li, Haofeng Xie, Saurav Uprety, Hongzhou Yang, Naser El-Sheimy

发表机构 * University of Calgary(卡尔加里大学) Nanchang University(南昌大学) Nanyang Technological University(南洋理工大学) Wuhan University(武汉大学)

AI总结 提出一种空地激光雷达地点识别框架,通过多尺度块级自监督学习缩小域差距,并利用扩展互逆重排序算法减少误检,在多个数据集上显著提升检索精度。

详情
AI中文摘要

激光雷达地点识别用于确定在预先采集的点云地图上的位置。最常研究的基于地面激光雷达的地点识别存在预访问要求、覆盖不完整和视角有限等缺点。使用预先采集的全覆盖机载激光扫描(ALS)数据作为空中先验地图可以克服这些缺点,使得跨视角地点识别变得必要且有利。然而,空地激光雷达地点识别面临重大挑战,包括空中和地面点云之间的域差距以及初始检索中的误检。为了解决这些问题,我们提出了一种用于空地激光雷达地点识别的新型检索和重排序框架。基于相邻点云块与锚点块共享相似语义的先验知识,我们的检索网络在多个尺度上引入了块级自监督学习模块,并与场景级学习相结合,以提高空中和地面点云之间全局特征的判别性。此外,利用ALS点云的结构化空间分布,我们引入了一种扩展互逆(ER)重排序算法,以最大化利用邻域信息,并根据邻域特征优化每个特征,然后用于更新相似度矩阵以进行最终排序。大量实验表明,我们的检索网络优于现有最先进(SOTA)方法,在CS-Urban-Scenes数据集上平均Recall@1提高了9.8%,平均Recall@1%提高了3.2%,同时在CS-Campus3D数据集上也展示了最佳性能。此外,我们的ER重排序算法在无需额外训练的情况下,进一步将CS-Campus3D上的平均Recall@1提高了4.9%,CS-Urban-Scenes上提高了10.2%。

英文摘要

LiDAR place recognition determines one's position on a prior point cloud map. The most studied ground-level LiDAR place recognition suffers from pre-visit requirements, incomplete coverage, and limited perspectives. Using pre-acquired, full-coverage Airborne Laser Scanning (ALS) data as an aerial prior map overcomes these drawbacks, making cross-view place recognition necessary and advantageous. However, aerial-ground LiDAR place recognition faces significant challenges, including the domain gap between aerial and ground point clouds, and false positives during initial retrieval. To address these challenges, we present a novel retrieval and re-ranking framework for aerial-ground LiDAR place recognition. Based on the priors that neighboring point cloud patches share similar semantics with anchor patch, our retrieval network introduces patch-level self-supervised learning modules at multiple scales and integrates with scene-level learning to improve global feature discriminativeness between aerial and ground point clouds. Furthermore, leveraging the structured spatial distribution of ALS point clouds, we introduce an Expanded Reciprocal (ER) re-ranking algorithm to exploit neighborhood information maximally and refine each feature based on neighbor features, which are then used to update the similarity matrix for final ranking. Extensive experiments demonstrate that our retrieval network outperforms existing state-of-the-art (SOTA) methods, achieving a 9.8\% improvement in average Recall@1 and a 3.2\% improvement in average Recall@1\% on the CS-Urban-Scenes, while also showing the best performance on the CS-Campus3D dataset. Additionally, our ER re-ranking algorithm further boosts the average Recall@1 by 4.9\% on CS-Campus3D and 10.2\% on CS-Urban-Scenes without additional training.

2606.18687 2026-06-18 cs.CV cs.RO 交叉投稿

Spatially Stratified Distillation for Heterogeneous Radar Place Recognition

空间分层蒸馏用于异构雷达位置识别

Sagun Singh Shrestha, Samuel Harding, Abdelwahed Khamis, Saimunur Rahman, Peyman Moghadam

发表机构 * CSIRO Robotics(澳大利亚联邦科学与工业研究组织机器人实验室) University of Queensland(昆士兰大学)

AI总结 针对4D汽车雷达与密集旋转雷达之间的异构位置识别,提出空间分层蒸馏(SSD)方法,通过基于雷达回波的物理空间非对称对齐,在重叠区域强制特征对齐,在稀疏区域降低蒸馏权重,在HeRCULES数据集上达到最先进性能。

Comments IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026

详情
AI中文摘要

可扩展的全天候位置识别越来越依赖于异构雷达位置识别来桥接不同的硬件平台。一个显著的应用是将来自经济高效的4D汽车雷达的查询与由密集旋转雷达构建的高保真参考地图进行匹配。这一过程从根本上受到4D传感器极端稀疏性(和窄视场)的限制,该传感器仅捕获旋转雷达数据库中存在的结构密度的一小部分。先前的工作通过统一不同的雷达信号来解决这个问题,即将两种信号投影到共同的表示空间。然而,它们在多会话环境中性能下降。在本文中,我们提出了空间分层蒸馏(SSD);一种策略,用直接从物理雷达回波导出的非对称空间对齐取代标准的均匀蒸馏。在两个雷达都有重叠回波的区域,SSD强制进行强特征对齐。关键的是,在4D学生雷达缺乏回波但教师雷达在共享视场内包含有效结构的稀疏区域,SSD应用大幅折扣的蒸馏权重。对最近的HeRCULES数据集的广泛评估表明,SSD显著优于先前的位置识别方法,在其具有挑战性的动态序列上取得了最先进的结果。

英文摘要

Scalable, all-weather place recognition increasingly relies on heterogeneous radar place recognition to bridge diverse hardware platforms. A notable application is matching queries from cost-effective 4D automotive radars against high-fidelity reference maps built by dense spinning radars. This process is fundamentally limited by the extreme sparsity (and narrow field-of-view) of the 4D sensor, which captures only a fraction of the structural density present in the spinning radar database. Prior efforts address this issue by unifying different radar signals. That is, projecting both signals into a common representational space. Yet, they suffer performance degradation in multi-session environments. In this paper, we propose spatially-stratified distillation (SSD); a strategy that replaces standard uniform distillation with an asymmetric spatial alignment derived directly from physical radar returns. In regions where both radars exhibit overlapping returns, SSD enforces strong feature alignment. Crucially, in sparse regions where the 4D student lacks returns but the teacher contains valid structure within the shared field of view, SSD applies heavily discounted distillation weights. Extensive evaluations of the recent HeRCULES dataset demonstrate that SSD significantly outperforms prior place recognition methods, achieving state-of-the-art results on its challenging dynamic sequences.

4. 具身智能与视觉语言动作模型 2 篇

2606.18955 2026-06-18 cs.CV cs.RO 交叉投稿

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

运动聚焦的潜在动作使跨实体VLA训练能从人类自我中心视频中学习

Runze Xu, Yiluo Zhang, Jian Wang, Yu Wang, Jincheng Yu

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Tianfu Jiangxi Laboratory(天府江西实验室)

AI总结 提出基于潜在动作的框架,利用混合解耦VQ-VAE从无标签人类视频中提取通用动作先验,通过意图-感知解耦策略减少动作幻觉,仅需50条轨迹即可适配下游任务。

Comments Accepted to IROS 2026

详情
AI中文摘要

训练通用视觉-语言-动作(VLA)模型通常需要大量、多样化的机器人数据集,并带有高保真动作标注。尽管自我中心的人类操作视频丰富且捕捉了显著的环境多样性,但缺乏动作标签使其难以在传统训练范式下使用。为解决这一问题,我们提出了一种基于潜在动作的框架,旨在从无标签人类视频中提取通用动作先验。该架构采用混合解耦VQ-VAE,通过物理掩码将运动动态与环境背景解耦,从而构建跨实体动作码本。通过在人类视频上使用码本进行预训练,VLM骨干网络学习到动作意图的深层表示。为了适应特定实体,我们引入了一种意图-感知解耦策略,其中VLM预测动作意图,而一个独立的冻结视觉编码器为动作专家提供状态特定特征,从而减少动作幻觉。在仿真和真实环境中的结果表明,我们的方法仅在无标签人类视频上预训练,与在大量标注数据集上训练的最先进VLA模型相比具有竞争力,且仅需50条轨迹进行下游适配。

英文摘要

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

2606.19297 2026-06-18 cs.LG cs.RO 交叉投稿

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

VLA 甚至知道基础知识吗?衡量视觉-语言-动作模型中的常识和世界知识保留

Nikita Kachaev, Andrey Moskalenko, Matvey Skripkin, Nikita Kurlaev, Daria Pugacheva, Albina Burlova, Mikhail Kolosov, Denis Shepelev, Andrey Kuznetsov, Elena Tutubalina, Aleksandr I. Panov, Alexey K. Kovalev, Vlad Shakhuro

发表机构 * CogAI Lab(CogAI实验室) FusionBrain Lab(FusionBrain实验室) IAI MSU(莫斯科大学人工智能研究所) Lomonosov MSU(莫斯科国立罗蒙诺索夫大学) NUST MISIS(国立研究型技术大学MISIS) Applied AI Institute(应用人工智能研究所) HSE University(高等经济大学) Generalizable AI Systems(通用人工智能系统实验室) ISP RAS(俄罗斯科学院系统编程研究所) MIRAI Domain-specific NLP Group(领域特定自然语言处理组)

AI总结 提出 Act2Answer 协议,通过动作回答评估 VLA 模型的知识保留,发现模型在简单概念上表现良好,但在丰富语义类别上存在差距,且 VQA 联合训练有助于知识保留。

Comments Project page: https://tttonyalpha.github.io/act2answer/

详情
AI中文摘要

具身视觉-语言-动作(VLA)模型通常通过在机器人数据上微调强大的预训练 VLM 获得,但目前尚不清楚它们在适应后保留了多少常识和事实知识。在知识敏感任务上的失败是模糊的,混淆了知识缺失与低级控制泛化能力差。我们引入 Act2Answer,一种轻量级协议,通过要求智能体通过动作来回答,将 VLM 知识基准适配到 VLA 评估。每个问题变成一个简短的桌面场景,其中智能体执行单个物体放置动作以选择候选答案,从而产生动作基础的、减少控制混淆的成功率。我们在不同的常识和世界知识类别中策划了这样的环境测试套件,并引入逐层意图探测以定位 VLM 骨干和动作头中与答案相关的信息。在对 7 个 VLA 模型和 9 个 VLM 基线的大规模研究中,我们系统地跨类别对模型进行排名,发现 VLA 在简单概念上表现稳健,但在更丰富的语义类别上相对于其源 VLM 显示出更大的差距,VQA 联合训练与更好的知识保留相关,并且答案相关信号在 VLA 中间层达到峰值,但在上层减弱。Act2Answer 可在以下网址获取:此 https URL。

英文摘要

Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action to select among candidate answers, yielding an action-grounded success rate with reduced control confounds. We curate a test suite of such environments across diverse commonsense and world-knowledge categories and introduce layerwise intent probing to localize answer-relevant information across the VLM backbone and action head. In a large-scale study of 7 VLA models and 9 VLM baselines, we systematically rank models across categories, finding that VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs, that VQA co-training is associated with better knowledge retention, and that answer-relevant signals peak in middle VLA layers but attenuate in upper layers. Act2Answer is available at https://tttonyalpha.github.io/act2answer/.

5. 无人车、无人机与移动机器人 1 篇

2606.19258 2026-06-18 cs.CV cs.RO 交叉投稿

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

CABLE: 面向V2X系统的云辅助带宽高效LMM编码框架

Haohua Que, Zhipeng Bao, Qianyi Wu, Handong Yao

发表机构 * University of Georgia(佐治亚大学)

AI总结 提出CABLE框架,通过边缘端利用自我运动补偿和残差运动线索传播云分割掩码,生成感兴趣区域(ROI)并仅上传ROI掩码图像,形成掩码-ROI-LMM反馈循环,在五个数据集上实现73-87%的ROI像素覆盖减少和5-8倍LMM预填充加速。

详情
AI中文摘要

云托管的大型多模态模型(LMM)可以为车联网系统提供强大的开放词汇感知能力,但简单地将全分辨率帧从边缘传输到云会导致严重的通信开销和云侧预填充延迟。我们提出了CABLE,一种用于边缘-云感知的云辅助带宽高效LMM编码框架。CABLE在边缘端利用自我运动补偿传播先前的云分割掩码,通过残差运动线索进行细化,并通过走廊包络整合断开区域,形成鲁棒的感兴趣区域(ROI)。仅上传ROI掩码图像,而云分割输出作为下一帧的先验反馈,形成掩码-ROI-LMM反馈循环。在五个数据集(nuScenes、WOD-ZB、Waymo、KITTI和CADC)上的实验表明,该方法在保持感知能力的同时实现了显著的通信节省,相对于全帧推理,ROI像素覆盖减少73-87%,估计LMM预填充加速5-8倍,检测质量略有折衷。

英文摘要

Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$--$87\%$ ROI pixel-coverage reduction with $5$--$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

6. 仿真、数据集与评测 3 篇

2606.18439 2026-06-18 cs.CV cs.RO 交叉投稿

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

RegimeVGGT:面向视觉几何基础Transformer的逐层空间保持冗余去除

Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, Zibo Zhao, Jiaxiang Hu, Kai Tang, Yichen Guo

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of California, Irvine(加利福尼亚大学尔湾分校) Nanyang Technological University(南洋理工大学)

AI总结 提出RegimeVGGT,通过逐层U形压缩(显著性引导带状合并与选择性保护K/V下采样)去除冗余,在保持重建质量的同时实现6.7倍加速。

Comments 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

详情
AI中文摘要

视觉几何基础Transformer(VGGT)通过一次前向传播从多视图图像恢复密集3D场景结构,但二次交叉帧注意力限制了其可扩展性。现有的免训练加速器沿单一轴均匀减少计算,忽略了层间异质性。我们的频谱、探测和因果分析揭示了三个区域:浅层缺乏跨视图结构,中层驱动跨视图对齐,深层对密集几何是冗余的,但其跨帧注意力对姿态仍然至关重要。RegimeVGGT沿两个轴应用逐层U形压缩:显著性引导带状合并保护几何和边缘显著性令牌,而选择性保护K/V下采样通过相移空间网格、参考帧锚点以及未压缩的相机/注册令牌来保持跨帧空间覆盖和姿态关键路径。免训练,RegimeVGGT在匹配重建质量下相比VGGT*实现了6.7倍加速。

英文摘要

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

2606.18582 2026-06-18 cs.CV cs.RO eess.IV 交叉投稿

Technical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leveraging DINOv3 for Robust Outdoor Scene Understanding in Field Robotics

ICRA 2026 GOOSE 2D细粒度语义分割挑战赛技术报告:利用DINOv3实现野外机器人中的鲁棒户外场景理解

Jaeil Park, Hyobin Choi, Sangjin Lee, Hyungtae Lim, Sung-Hoon Yoon

发表机构 * Daegu Gyeongbuk Institute of Science and Technology (DGIST)(大邱庆北科学技术院) Massachusetts Institute of Technology (MIT)(麻省理工学院)

AI总结 提出一种结合DINOv3自监督骨干、ViT-Adapter和Mask2Former解码器的网络设计,以及多尺度测试增强和模型集成的推理策略,在64类细粒度越野语义分割挑战中取得第一名,复合得分76.57%。

Comments 5 pages, 4 figures

详情
AI中文摘要

ICRA 2026野外机器人研讨会举办的GOOSE 2D细粒度语义分割挑战赛评估了越野图像在64个细粒度类别和11个评估的非空洞粗类别上的密集语义分割。我们提出了该挑战的第一名解决方案。我们的解决方案包含两个互补的改进:(a) 网络级设计,结合了自监督DINOv3 ViT-L/16骨干、ViT-Adapter和Mask2Former掩码分类解码器,以及基于全局[CLS]令牌的粗类别辅助损失;(b) 推理时聚合策略,基于多尺度和水平翻转测试时增强,以及使用Codabench分数选择的前三个检查点的集成。我们的方法达到了官方复合得分76.57%,包括69.32%的细类mIoU和83.81%的类别级mIoU,并在最终阶段排行榜上排名第一:http://this url。

英文摘要

The GOOSE 2D Fine-Grained Semantic Segmentation Challenge at the ICRA 2026 Workshop on Field Robotics evaluates dense semantic segmentation of off-road imagery over a fine-grained taxonomy of 64 classes and 11 evaluated non-void coarse categories. We present the first-place solution to this challenge. Our solution comprises two complementary improvements: (a) a network-level design that combines a self-supervised DINOv3 ViT-L/16 backbone, a ViT-Adapter, and a Mask2Former mask-classification decoder, together with a coarse-category auxiliary loss on the global [CLS] token; and (b) an inference-time aggregation strategy based on multi-scale and horizontal-flip test-time augmentation and an ensemble of the top three checkpoints selected using Codabench scores. Our method achieves an official composite score of 76.57%, consisting of 69.32% fine-class mIoU and 83.81% category-level mIoU, and ranks first on the final phase leaderboard: www.codabench.org/competitions/14257/#/results-tab.

2606.19253 2026-06-18 cs.CV cs.AI cs.LG cs.RO 交叉投稿

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas: 通过全景重投影实现3D场景理解

Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑工业大学) Huawei(华为)

AI总结 提出OneCanvas方法,将多视图补丁特征聚合到全景画布上,利用深度和相机位姿进行重投影,无需复杂几何编码器或大量训练,在SQA3D等基准上达到最先进精度。

Comments Project page: https://baranowskibrt.github.io/onecanvas/

详情
AI中文摘要

现有的视觉语言模型(VLM)中的3D场景理解方法要么依赖复杂的、模型特定的几何编码器,要么为了追求空间推理而需要大量的训练预算。相反,OneCanvas将所有视图的补丁特征聚合到一个单一的等距柱状全景画布上。具体来说,每个补丁利用其深度和相机位姿被反投影到3D世界坐标,然后根据从画布原点看到的该点的连续经度和纬度放置在画布上,无需对重叠视图进行光栅化或聚合。补丁的度量坐标的3D位置嵌入被添加到其特征中,从而恢复了将世界位置压缩到角度画布坐标时丢失的深度。因此,来自所有帧的补丁共享一个空间坐标系,无需融合或对主干网络进行重大架构修改。预训练的VLM将此表示视为普通图像。由于画布可以以任何感兴趣的姿态为中心,相同的表示直接支持从特定视角进行情境推理,这是机器人和具身AI中的常见需求。得益于这种表示,我们还可以引入空间预训练课程:通过程序化地将从真实图像中提取的对象的补丁特征放置在原本空白的画布上的选定3D世界位置,我们生成了涵盖广泛空间推理任务的即时监督,并控制答案分布以减少空间推理捷径。OneCanvas在SQA3D和VSI-Bench上达到了最先进的准确率,并在SPBench上泛化到分布外数据,其训练计算量比最强竞争方法少一个数量级。

英文摘要

Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

7. 安全、鲁棒性与可信机器人 2 篇

2606.18532 2026-06-18 cs.CR cs.AI cs.RO cs.SE 交叉投稿

AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework

AI沙箱:威胁模型、分类法与测量框架

Inderjeet Singh, Haitham Mahmoud, Andrés Murillo

发表机构 * Fujitsu Research of Europe(富士通欧洲研究)

AI总结 提出AI沙箱的威胁模型、分类法和测量框架,形式化沙箱边界与最弱链规则,定义网络物理威胁模型,并通过三个案例验证。

Comments 50 pages, 8 figures, 10 tables

详情
AI中文摘要

AI系统越来越多地在结合隔离、仿真、仪器化、监督和证据捕获的有界环境中进行评估。对于物理AI、AIoT和网络物理系统,这种转变不仅仅是术语问题:被测系统可能通过物理过程、网络设备和人类操作员进行感知、决策、执行、通信和故障。本文开发了一种面向保证的AI沙箱描述,将其作为数字AI、具身自主和网络物理部署中测试、评估、验证和确认的受控环境。我们形式化了沙箱边界和用于将每个维度的证据组合成有界部署声明的“最弱链”规则;分离了主要的沙箱原型;定义了一个包括对保证装置本身攻击的网络物理威胁模型;并引入了一个跨越保真度、可控性、可观测性、包含性、可重复性和治理工件的测量框架,在三个实际沙箱的工作案例研究中实例化。由此产生的威胁模型、分类法和测量框架阐明了沙箱可以有效测试什么、它可以包含哪些风险,以及它可以为安全、安保和监管保证支持哪些形式的证据。

英文摘要

AI systems are increasingly evaluated in bounded environments that combine isolation, simulation, instrumentation, supervision, and evidence capture. For physical AI, AIoT, and cyber-physical systems, this shift is not a matter of terminology: the system under test may sense, decide, actuate, communicate, and fail through physical processes, networked devices, and human operators. This article develops an assurance-oriented account of AI sandboxes as controlled environments for testing, evaluation, verification, and validation across digital AI, embodied autonomy, and cyber-physical deployments. We formalize the sandbox boundary and a weakest-link rule for composing per-dimension evidence into a bounded deployment claim; separate major sandbox archetypes; define a cyber-physical threat model that includes attacks on the assurance apparatus itself; and introduce a measurement framework spanning fidelity, controllability, observability, containment, reproducibility, and governance artifacts, instantiated on three worked case studies of real sandboxes. The resulting threat model, taxonomy, and measurement framework clarify what a sandbox can validly test, which risks it can contain, and what forms of evidence it can support for safety, security, and regulatory assurance.

2606.18697 2026-06-18 cs.LG cs.CR cs.RO 交叉投稿

Stealthy World Model Manipulation via Data Poisoning

通过数据投毒进行隐蔽的世界模型操纵

Yibin Hu, Xiaolin Sun, Zizhan Zheng

发表机构 * Department of Computer Science(计算机科学系)

AI总结 提出SWAAP框架,通过两阶段数据投毒(双层级优化寻找有害目标模型+梯度匹配隐蔽实现)操纵学习到的世界模型,导致规划性能显著下降,且能规避多种防御检测。

Comments 41 pages, 8 figures, 11 tables. Submitted to NeurIPS 2026

详情
AI中文摘要

基于模型的学习智能体使用学习到的世界模型来预测未来状态、规划行动并适应新环境。然而,从收集的经验中更新世界模型的过程创造了一个训练时攻击面:对抗性投毒的微调轨迹可以操纵学习到的动力学,从而破坏下游规划。在本文中,我们提出了SWAAP,这是第一个针对学习到的世界模型的两阶段数据投毒框架。在第一阶段,SWAAP利用过渡梯度定理实现的一阶双层优化,识别出一个有害的目标世界模型,该模型在规划下诱导低回报行为,同时保持接近干净动力学。在第二阶段,SWAAP通过隐蔽约束的梯度匹配实现该目标,仅修改有限比例的微调过渡目标,使得诱导的训练梯度将受害者模型引向对抗目标,同时预测误差正则化器鼓励投毒目标保持接近世界模型的自然近似误差。为了评估攻击的隐蔽性,我们在投毒管道的三个阶段评估了防御和可检测性:投毒过渡的预训练检测、微调期间的鲁棒训练以及测试时对结果世界模型的监控。在多种连续控制任务中,SWAAP导致显著的性能下降,同时保持投毒过渡接近干净数据,并规避了评估的非自适应残差/CUSUM/TRIM风格防御。这些结果揭示了世界模型适应管道中的实际漏洞,并强调了需要保护世界模型训练数据和所学动力学的鲁棒性方法。

英文摘要

Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack surface: adversarially poisoned fine-tuning trajectories can manipulate the learned dynamics and thereby corrupt downstream planning. In this paper, we propose SWAAP, the first two-stage data poisoning framework for learned world models. In the first stage, SWAAP identifies a harmful target world model that induces low-return behavior under planning while remaining close to clean dynamics, using first-order bilevel optimization enabled by a transition-gradient theorem. In the second stage, SWAAP realizes this target through stealth-constrained gradient matching, modifying only a limited fraction of fine-tuning transition targets so that the induced training gradients steer the victim model toward the adversarial target, while a prediction-error regularizer encourages the poisoned targets to remain close to the world model's natural approximation error. To assess attack stealthiness, we evaluate defenses and detectability across three stages of the poisoning pipeline: pre-training detection of poisoned transitions, robust training during fine-tuning, and test-time monitoring of the resulting world model. Across diverse continuous-control tasks, SWAAP causes substantial performance degradation while keeping poisoned transitions close to clean data and evading the evaluated non-adaptive residual/CUSUM/TRIM-style defenses. These results reveal a practical vulnerability in world-model adaptation pipelines and highlight the need for robustness methods that protect both world-model training data and learned dynamics.