VidMsg: A Benchmark for Implicit Message Inference in Short Videos
VidMsg:短视频中隐含信息推断的基准测试
Issar Tzachor, Michael Green, Rami Ben-Ari
AI总结 提出VidMsg基准,通过消息优先构建流程和双向检索任务,评估视频理解模型对短视频中隐含信息的推断能力。
详情
- Comments
- Project page: https://iyttor.github.io/VidMsg
理解短视频不仅仅是识别可见物体和动作;视频制作者常常在片段中包含潜在的信息或目的。我们引入了VidMsg,一个用于评估互联网原生短视频中隐含信息理解的基准测试。VidMsg包含400个来自YouTube的片段,涵盖9个实际主题领域和52个细粒度目标信息,涉及职业与金融、教育、健康与福祉、文化、安全、可持续性和生活方式等领域。VidMsg通过消息优先流程构建:LLM首先将目标信息转化为间接搜索场景,用于检索候选片段。然后,人工标注者保留那些传达预期信息但不过于直白的片段。VidMsg主要设计用于双向消息-片段检索,适用于视频搜索和推荐等可扩展应用,系统必须捕捉全面的视频理解。除了检索,VidMsg还包括一个诊断性多项选择问答基准,模型需要从语义相关的选项中选出片段的预期信息。与当代视频语言和检索模型的实验表明,强模型在VidMsg上常常失败,因为该任务需要语用推理、上下文线索整合以及语义相近信息的区分。我们还引入了VidVec-Msg,一种改进消息导向检索的基线方法,同时为未来工作留下了足够的提升空间。
Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying message or purpose in the clip. We introduce VidMsg, a benchmark for evaluating implicit message understanding in short, internet-native video clips. VidMsg contains 400 YouTube-derived clips across 9 practical topic areas and 52 fine-grained target messages, covering domains such as career and finance, education, health and well-being, culture, safety, sustainability, and lifestyle. VidMsg is constructed through a message-first pipeline: an LLM first translates target messages into indirect search scenarios, which are used to retrieve candidate clips. Human annotators then retain clips that convey the intended message without being overly explicit. VidMsg is designed primarily for bidirectional message-clip retrieval for scalable applications such as video search and recommendation, where systems must capture holistic video understanding. In addition to retrieval, VidMsg includes a diagnostic multiple-choice QA benchmark, where models select the intended message of a clip from semantically related alternatives. Experiments with contemporary video-language and retrieval models show that strong models often fail on VidMsg, because the task requires pragmatic inference, integration of contextual cues, and discrimination among semantically close messages. We also introduce VidVec-Msg, a baseline method that improves message-oriented retrieval while leaving substantial headroom for future work.