AI中文摘要
通用具身智能体需要的不仅仅是物体识别:它们必须从情境视觉观察中推理空间关系、动作、程序、人类意图、环境约束和常识后果。然而,现有的视觉和具身问答基准通常对测试的推理依赖关系控制有限,使得难以将基于具身的推理与基于捷径的视觉或语言模式匹配区分开来。我们提出了ERQA-Plus,一个用于具身AI推理的诊断基准。ERQA-Plus包含1766个问答实例,这些实例基于711张以机器人为中心的图像,并根据一个结构化的分类法组织,涵盖感知、动作中心、社交交互、导航环境和上下文常识推理。该数据集使用多阶段生成和验证流程构建,结合了分类法引导的问题生成、自动质量判断、迭代修订和人工评估,以改进视觉基础、答案有效性和推理质量。我们对代表性的通用视觉语言模型和具身模型进行了基准测试,包括LLaVA-NeXT-8B、Prismatic-7B、MiniCPM-V-4.5-8B、Qwen3-VL、RoboRefer-8B和RoboBrain2.5-8B。尽管最强的模型Qwen3-VL-32B达到了83.4%的整体准确率和61.4的SBERT分数,但类别级别的结果揭示了空间推理、程序推理、事件预测和意图推理方面的持续弱点。因此,ERQA-Plus提供了一个细粒度的评估框架,不仅衡量具身智能体是否回答正确,还衡量它们能够可靠地执行哪些形式的具身推理。数据集可在https://this https URL获取,项目页面在https://this https URL。
英文摘要
Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.