2606.07433
2026-06-08
cs.CV
cs.AI
cs.MM
新提交
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
Watch, Remember, Reason: 基于多模态大语言模型的人类视角视频理解
Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Jason Li, Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang, Guangliang Cheng, Yunhai Tong, Lu Qi, Minghsuan Yang
发表机构
*
Peking University
;
Wuhan University
;
Shanghai Jiao Tong University
;
Nanyang Technological University
;
CASIA
;
University of Tokyo
;
University of Liverpool
;
Zhejiang University
;
National University of Singapore
;
UC Merced
AI总结
提出人类视角下视频理解的三个功能能力(观看、记忆、推理),构建统一框架分析视频MLLM的感知、记忆、推理和预测,并总结挑战、方法、应用及未来方向。