arXivDaily arXiv每日学术速递 周一至周五更新

1. 机器人学习与模仿强化学习 1 篇

2606.19408 2026-06-19 cs.LG cs.RO 交叉投稿

FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning

FlexLAM: 解决潜在动作学习中的瓶颈权衡

Takanori Yoshimoto, Yang Hu, Naruya Kondo, Tatsuya Matsushima

发表机构 * University of Tsukuba(筑波大学) The University of Tokyo(东京大学)

AI总结 针对潜在动作模型中固定容量瓶颈导致的权衡问题,提出FlexLAM,通过嵌套dropout实现变长潜在动作,在不增加架构或损失的情况下,在稀缺标签和低回报任务中优于固定容量模型,并支持推理时调整令牌预算。

详情
AI中文摘要

潜在动作为无动作视频与下游决策提供了紧凑接口,但现有潜在动作模型(LAM)强制每个转换通过固定容量瓶颈。我们识别出一个瓶颈权衡:过于紧凑的编码可能丢弃动作对齐所需的转换线索,而过于松散的编码则保留了额外的转换变化,当对齐标签稀缺或分布狭窄时必须解决这些变化。FlexLAM用通过嵌套dropout训练的变长潜在动作取代固定容量,产生前缀有效编码,首先捕获紧凑的转换结构,仅在需要时添加细节,无需新架构或损失。在标准稀缺标签监督下和低回报单任务对齐压力测试中,单个FlexLAM在每个评估的令牌预算下匹配或超越单独训练的固定容量LAM,表明FlexLAM不仅在推理时可调整,而且在相同令牌预算下学习了更好的潜在动作接口。同一模型支持推理时令牌预算调整而无需重新训练,并且FlexLAM改善了Ego4D转换重建。这些结果表明,变长潜在动作是对潜在动作模型、潜在动作世界模型和视频预训练动作接口中固定容量瓶颈的无架构、即插即用升级。

英文摘要

Latent actions provide a compact interface between action-free video and downstream decision-making, yet existing Latent Action Models (LAMs) force every transition through a fixed-capacity bottleneck. We identify a bottleneck trade-off: overly tight codes can discard transition cues needed for action alignment, while overly loose codes preserve additional transition variation that must be resolved when alignment labels are scarce or narrowly distributed. FlexLAM replaces this fixed capacity with variable-length latent actions trained by nested dropout, yielding prefix-valid codes that capture compact transition structure first and add detail only when needed, without new architectures or losses. A single FlexLAM matches or surpasses separately trained fixed-capacity LAMs at every evaluated token budget under standard scarce-label supervision and under a low-return single-task alignment stress test, indicating that FlexLAM is not merely adjustable at inference time but learns a better latent-action interface at the same token budgets. The same model supports inference-time token-budget adjustment without retraining, and FlexLAM improves Ego4D transition reconstruction. These results suggest that variable-length latent actions are an architecture-free, drop-in upgrade to the fixed-capacity bottleneck in latent action models, latent-action world models, and video-pretrained action interfaces.

2. 具身智能与视觉语言动作模型 2 篇

2606.19451 2026-06-19 cs.LG cs.CV cs.RO 交叉投稿

3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning

3D-DLP:自监督3D物体中心场景表示学习

Ellina Zhang, Madhaven Iyengar, Amir Zadeh, Chuan Li, Deepak Pathak, David Held, Tal Daniel

AI总结 提出3D-DLP模型,通过自监督学习将场景级RGB-D或体素观测分解为3D潜在粒子,每个粒子编码解耦属性,实现可解释的逐粒子分割图,并支持场景操控和下游机器人操作。

Comments ICML 2026. Project webpage: https://eubooks3003.github.io/3d-dlp

详情
AI中文摘要

我们引入了3D-DLP,一种自监督的物体中心表示学习模型,它将场景级RGB-D或体素观测分解为一组3D潜在粒子。基于深度潜在粒子(DLP)框架,每个粒子编码解耦的属性,包括3D关键点位置、边界框尺寸和外观特征,并代表场景中的一个独特实体。该模型通过端到端的自监督重建目标学习可解释的逐粒子分割图。我们在模拟和真实数据集上证明,学习到的潜在空间是可解释和可控的:通过操纵粒子位置并解码,我们可以生成新颖的场景配置。此外,我们展示了将这些紧凑的3D潜在粒子用于下游机器人操作,相比缺乏显式3D信息或依赖无物体中心结构的密集3D输入的基线方法,性能有所提升。代码和视频可在以下网址获取:此 https URL。

英文摘要

We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at https://eubooks3003.github.io/3d-dlp.

2606.19531 2026-06-19 cs.CV cs.RO 交叉投稿

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

ImageWAM:世界动作模型真的需要视频生成,还是只需要图像编辑?

Yuyang Zhang, Wenyao Zhang, Zekun Qi, He Zhang, Haitao Lin, Jingbo Zhang, Yao Mu, Xiaokang Yang, Wenjun Zeng, Xin Jin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Eastern Institute of Technology(东方理工学院) Tencent Robotics X(腾讯机器人X) Tsinghua University(清华大学) Zhongguancun Academy(中关村学院)

AI总结 提出ImageWAM框架,利用预训练图像编辑模型替代视频生成进行机器人动作预测,通过编辑去噪的KV缓存作为世界动作上下文,在多个模拟和真实实验中优于基线,计算量降至1/6,延迟降至1/4。

Comments Project Page: https://zhangwenyao1.github.io/ImageWAM/

详情
AI中文摘要

世界动作模型(WAMs)通常依赖视频生成来桥接视觉世界建模和机器人控制。然而,基于视频的WAMs面临三个耦合的限制:密集的多帧未来令牌使得推理成本高昂,完整的视频预测将容量花费在与动作无关的时间和外观细节上,以及长期未来想象可能引入误导动作预测的错误。这些问题提出了一个简单的问题:世界动作模型真的需要视频生成吗?我们提出ImageWAM,一个简单的WAM框架,将预训练的图像编辑模型重新用于机器人动作预测。与视频生成相比,图像编辑提供了更匹配的先验:它只需要建模目标帧变换,关注与动作相关的当前到目标视觉差异,并通过编辑预训练将任务指令接地到局部视觉变化。在实践中,ImageWAM在推理时不解码目标帧;相反,它根据图像编辑去噪产生的KV缓存条件化一个流匹配动作专家,将其用作紧凑的世界动作上下文。ImageWAM在多个模拟和真实世界实验中优于标准VLA基线和匹配的竞争性WAM,且无需额外的策略预训练。它还将FLOPs降低到基于视频的WAMs的1/6,延迟降低到1/4。注意力分析进一步表明,编辑缓存聚焦于任务相关的变化区域,支持图像编辑作为基于视频的世界动作建模的有效替代方案。

英文摘要

World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.

3. 仿真、数据集与评测 1 篇

2606.20189 2026-06-19 cs.CV cs.AI cs.RO 交叉投稿

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

HilDA:利用扩散的分层蒸馏推进自监督LiDAR预训练

Maciej Wozniak, Jesper Ericsson, Hariprasath Govindarajan, Truls Nyberg, Thomas Gustafsson, Patric Jensfelt, Olov Andersson

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院) Linköping University(林雪平大学) TRATON AB(TRATON公司) Qualcomm Auto Ltd Sweden Filial(高通汽车有限公司瑞典分公司)

AI总结 提出HilDA框架,通过分层蒸馏(多层蒸馏和全局上下文蒸馏)结合时间占用扩散目标,自监督预训练LiDAR骨干网络,在3D检测、场景流和语义占用预测任务上达到最先进水平。

Comments Accepted to ECCV 2026. Maciej and Jesper contributed equally

详情
AI中文摘要

利用视觉基础模型(VFM)进行相机到LiDAR的知识蒸馏为解决真实世界自动驾驶中巨大的几何和运动多样性所需的标注数据稀缺问题提供了一种有前景的方案。然而,当前方法通常将VFM视为黑盒教师,仅依赖逐帧特征相似性。因此,它们未能充分利用教师的逐层语义结构和全局上下文,以及LiDAR序列中固有的丰富时空信息。我们提出HilDA,一个用于LiDAR骨干网络的自监督预训练框架,能更好地捕捉驾驶任务所需的语义“是什么”和几何“在哪里”。HilDA结合了分层蒸馏(包括用于渐进语义对齐的多层蒸馏和用于场景级语义的全局上下文蒸馏)与一个促进时空一致性的时间占用扩散目标。使用HilDA预训练的模型在跨模态蒸馏基准上取得了最先进的结果,并在3D目标检测、场景流和语义占用预测任务上优于通过先前蒸馏方法训练的模型。代码见:此 https URL。

英文摘要

Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.

4. 其他/综合机器人 1 篇

2606.20547 2026-06-19 cs.LG cs.CV cs.GR cs.RO math.DG 交叉投稿

The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups

Token 是群元素:关于矩阵李群上的李代数注意力

Przemyslaw Musialski

发表机构 * New Jersey Institute of Technology(新泽西理工学院)

AI总结 提出李代数注意力机制,将token定义为矩阵李群元素,利用相对位姿的李代数范数作为注意力分数,无需学习核函数或表示论工具,适用于仿射全帧群等非紧致非阿贝尔群。

Comments preprint, 19 pages, 3 figures

详情
AI中文摘要

我们将注意力token置于群上:一个token是矩阵李群$G$的一个元素$g_i$——一个纯粹的变换,没有特征负载,也没有外部作用$\rho(g)$承载它。据我们所知,这是第一个token为裸矩阵李群元素的注意力构造:它们的分数是相对位姿的闭式代数范数,而非学习核,并且它达到了每个基于不可约表示或满射指数的方法必须排除的仿射全帧群。我们称之为李代数注意力。一旦token是群元素,其余部分无需通常的表示论机制。一对的相对几何是规范的,即$g_i^{-1} g_j$,因此成对不变量$w_{ij} = \log(g_i^{-1} g_j)$是内在的而非设计的;在$G$对角作用下的等变性是重言式的,且余循环条件自动成立。注意力分数是负平方代数范数$s_{ij} = -\|\log(g_i^{-1} g_j)\|_\lambda^2/\tau$:在块加权Frobenius内积下的规范邻近核,无需不可约表示、球谐函数、Clebsch-Gordan积或学习核。该构造适用于任何矩阵李群,在包含相对位姿的选定对数图上,包括具有尺度和剪切的非紧致非阿贝尔仿射群,这些是向量token注意力方法无法达到的:既不是不可约表示传统,也不是满射指数方法。在SE(2)、SO(3)和Aff(2)上的三个序列补全实验证实了这一点:闭式分数匹配了相同不变量上的学习MLP核,并在SE(2)上优于它,使用的分数参数少50到80倍,而向量token基线破坏了不变量,误差达五到十二个数量级。

英文摘要

We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ -- a bare transformation, with no feature payload and no external action $ρ(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_λ^2/τ$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.