GEM: Generative Supervision Helps Embodied Intelligence
GEM: 生成式监督助力具身智能
Ruowen Zhao, Bangguo Li, Zuyan Liu, Yinan Liang, Junliang Ye, Fangfu Liu, Diankun Wu, Zhengyi Wang, Xumin Yu, Yongming Rao, Han Hu, Jun Zhu
AI总结 提出GEM模型,通过在视觉语言模型预训练中引入深度图生成任务,联合训练以提升具身智能的语义理解与物理操作能力,并发布大规模数据集GEM-4M,在多个基准上取得最优结果。
详情
- Comments
- Project Page: https://zhaorw02.github.io/GEM/
具身视觉语言模型(VLMs)在机器人领域,特别是在视觉-语言-动作框架中,展示了令人印象深刻的性能和泛化能力。然而,标准文本引导预训练范式的高层语义焦点与具身环境中执行所需的关键低层空间和物理知识之间仍存在显著差距。在本文中,我们介绍了GEM,一种生成式监督的具身视觉语言模型,旨在弥合这一鸿沟。我们提出将深度图生成任务直接集成到VLM预训练阶段。通过将这一生成目标与主模型联合训练,我们观察到具身智能的显著提升,同时增强了语义理解和物理操作能力。为了支持这一范式,我们整理并发布了GEM-4M,一个包含基础、推理和规划数据与高质量深度监督配对的大规模综合数据集。大量实验表明,GEM在多个具身基准上取得了最先进的结果。此外,我们部署的动作模型GEM-VLA在模拟环境和真实世界评估中均表现出卓越的任务执行能力。代码、模型和数据集可在https://zhaorw02.github.io/GEM/获取。
Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/