Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning
通过逆动力学学习缓解视觉-语言-动作模型中的状态混叠
Kyujin Lee, Injae Kim, Jihwan Park, Yejun Ju, Minseok Joo, Hyunwoo J. Kim
AI总结 提出将逆动力学学习作为辅助目标,直接监督VLA视觉编码器,通过预测当前与未来观测之间的动作来捕捉细粒度视觉差异,从而缓解状态混叠问题。
详情
视觉-语言-动作(VLA)模型通过将预训练的视觉-语言模型(VLM)适应于动作预测,成为统一机器人操作中感知、推理和控制的 promising 框架。然而,VLM 衍生的表示通常对低级控制所需的细微视觉差异不敏感,导致视觉相似但需要截然不同动作的状态之间出现状态混叠。先前的 VLA 研究通过生成视觉或推理输出(如未来帧、2D 接地点或轨迹、或中间空间推理步骤)来改善视觉理解,但这些目标通常仅通过端到端预测间接塑造视觉编码器,并未显式分析学习到的视觉特征空间中的状态混叠。为了缓解状态混叠,我们引入逆动力学学习作为辅助目标,直接监督 VLA 视觉编码器。通过预测当前与未来观测之间的动作,我们的目标鼓励编码器捕捉决定低级动作的细粒度视觉差异。我们进一步使用伪反向监督,使编码器暴露于更广泛的动作方向,并在有限的机器人演示下提高泛化能力。我们的方法适用于多种 VLA 基线,仅使用标准的观测-动作对,无需额外标注,并在测试时保留原始推理流程。在 CALVIN ABC-D 和 SimplerEnv 上的实验表明,在多种 VLA 基线上均获得一致的性能提升。冻结编码器探测和状态-特征对齐分析进一步表明,我们的方法学习了状态判别性的视觉表示,减少了状态混叠,并更好地与机器人状态变化对齐。
Vision-Language-Action (VLA) models have emerged as a promising framework that unifies perception, reasoning, and control for robot manipulation by adapting pretrained vision-language models (VLMs) to action prediction. However, VLM-derived representations are often insensitive to subtle visual distinctions required for low-level control, causing state aliasing between visually similar states that require substantially different actions. Prior VLA studies improve visual understanding by generating visual or reasoning outputs, such as future frames, 2D grounding points or traces, or intermediate spatial reasoning steps, but these objectives typically shape the vision encoder only indirectly through end-to-end prediction and do not explicitly analyze state aliasing in the learned visual feature space. To mitigate state aliasing, we introduce inverse dynamics learning as an auxiliary objective that directly supervises the VLA vision encoder. By predicting the action between current and future observations, our objective encourages the encoder to capture fine-grained visual distinctions that determine low-level actions. We further use pseudo-reversed supervision to expose the encoder to a broader range of action directions and improve generalization under limited robot demonstrations. Our method applies to diverse VLA baselines, uses only standard observation-action pairs without additional annotations, and preserves the original inference pipeline at test time. Experiments on CALVIN ABC-D and SimplerEnv show consistent gains across diverse VLA baselines. Frozen-encoder probing and state-feature alignment analyses further show that our method learns state-discriminative visual representations that reduce state aliasing and better align with robot state changes.