Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models
Embodied3DBench: 视觉语言模型低级具身空间智能的基准测试
Jiyao Zhang, Mingxu Zhang, Yitong Peng, Haoxuan Liu, Chenshuo Wang, Yuxing Long, Haoyang Huang, Dongjiang Li, Nan Duan, Hui Shen, Hao Dong
AI总结 提出Embodied3DBench基准,通过6类任务(空间结构理解与交互导向感知)系统评估视觉语言模型在3D环境中的低级空间智能,并合成130万QA对训练数据以弥补能力差距。
详情
当前的视觉语言模型(VLM)是否准备好理解和推理3D环境中的复杂具身交互?我们引入了Embodied3DBench,一个以机器人为中心的基准,针对具身3D环境中的低级空间智能。为了系统评估这些基础感知能力,该基准包括6个任务类别,分为两个核心组:空间结构理解(定位、空间关系预测和多视图对应)和交互导向感知(可供性预测、抓取点预测和轨迹预测)。该基准涵盖12个子类别,包含超过21k个高质量问答对。我们评估了13个最先进的模型,结果显示,尽管当前模型在高级空间推理(如理解对象间位置关系)方面表现相对较强,但在交互导向感知方面仍然脆弱,突显了缺乏鲁棒的3D感知交互先验。为了积极弥合基准揭示的能力差距,我们进一步合成了一个包含130万问答对的大规模训练数据集。值得注意的是,在该数据集上微调显著提升了低级空间智能。最终,Embodied3DBench通过提供系统评估框架和可扩展的数据解决方案填补了关键空白,为交互感知多模态系统的发展设定了明确目标。
Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in embodied 3D environments. To systematically evaluate these foundational perceptual capabilities, the benchmark includes 6 task categories divided into two core groups: Spatial Structural Understanding (Grounding, Spatial Relation Prediction, and Multi-view Correspondence) and Interaction-Oriented Perception (Affordance Prediction, Grasp Point Prediction, and Trajectory Prediction). The benchmark spans 12 subcategories and contains over 21k high-quality question-answer pairs. We evaluate 13 state-of-the-art models, and the results show that while current models exhibit relatively strong high-level spatial reasoning, such as understanding object-to-object positional relations, they remain fragile in interaction-oriented perception, highlighting a significant lack of robust 3D-aware interaction priors. To actively bridge this capability gap revealed by our benchmark, we further synthesize a large-scale training dataset comprising 1.3M QA pairs. Notably, fine-tuning on this dataset yields significant improvements in low-level spatial intelligence. Ultimately, Embodied3DBench fills a critical gap by providing both a systematic evaluation framework and a scalable data solution, setting a clear target for the development of interaction-aware multimodal systems.