Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning
观看、推理与搜索:一个面向开放网络的视频深度研究基准,用于代理视频推理
Chengwen Liu, Xiaomin Yu, Zhuoyue Chang, Zhe Huang, Shuo Zhang, Heng Lian, Jisheng Dang, Rui Xu, Sen Hu, Jianheng Hou, Chengwei Qin, Xiaobin Hu, Kunyi Wang, Zhi Yang, Hao Peng, Hong Peng, Ronghao Chen, Huacan Wang
AI总结 本文提出VideoDR基准,用于研究开放网络环境下视频代理推理,通过跨帧视觉锚点提取、交互式网络检索和多跳推理验证,揭示了长检索链中维持初始视频锚点、目标漂移和长时程一致性等关键挑战。
详情
在现实世界视频问答场景中,视频往往只提供局部视觉线索,而可验证答案分布在开放网络中;模型因此需要联合执行跨帧线索提取、迭代检索和基于多跳推理的验证。为弥合这一差距,我们构建了首个视频深度研究基准VideoDR。VideoDR专注于视频条件的开放领域视频问答,要求进行跨帧视觉锚点提取、交互式网络检索和基于联合视频-网络证据的多跳推理;通过严格的真人标注和质量控制,我们获得了涵盖六个语义领域的高质量视频深度研究样本。我们评估了多种闭源和开源多模态大语言模型在Workflow和Agentic范式下的表现,结果表明Agentic并不始终优于Workflow:其收益取决于模型在长检索链中维持初始视频锚点的能力。进一步分析表明,目标漂移和长时程一致性是核心瓶颈。总之,VideoDR为研究开放网络环境下视频代理提供了系统性的基准,并揭示了下一代视频深度研究代理的关键挑战。
In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.