2510.08759
2026-05-22
cs.CV
cs.RO
版本更新
Dissecting Embodied Abilities in Multimodal Language Models through Skill-level Evaluation and Diagnosis
通过技能级评估与诊断解构多模态语言模型的具身能力
Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Yizhe Zhu, Shiji Xin, Yijian Huang, Boce Hu, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Haojie Huang, Lawson L. S. Wong
发表机构
*
Northeastern University, Boston, MA, USA
;
The Chinese University of Hong Kong, Hong Kong, China
;
Peking University, Beijing, China
;
Westlake University, Hangzhou, China
;
Harvard University, Cambridge, MA, USA
;
Purdue University, West Lafayette, IN, USA
;
University of Oxford, Oxford, United Kingdom
AI总结
本文提出BEAR基准,通过分解具身任务为14个原子技能进行细粒度评估,发现感知能力是推理失败的主要瓶颈,并提出BEAR-Agent多模态对话代理,显著提升具身技能性能。