Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs
它往哪个方向移动?视频大语言模型中方向运动盲症的诊断与克服
Jongseo Lee, Hyuntak Lee, Sunghun Kim, Sooa Kim, Jihoon Chung, Jinwoo Choi
AI总结 本文研究了视频大语言模型在理解方向运动时的盲点,提出MoDirect数据集和DeltaDirect方法,通过改进模型对方向运动的感知能力,显著提升了模型在合成和真实场景中的方向识别性能。
详情
- Comments
- Preprint. 59 pages, including appendix. Code: https://github.com/KHU-VLL/DeltaDirect
视频大语言模型(Video-LLMs)在时间视频理解方面取得了快速进展,但许多模型在基本感知原始上失败:带符号的图像平面运动方向。在简单的单一物体左右上下移动的视频中,大多数Video-LLMs表现接近随机,超随机的案例主要归因于预测偏差而非真正的方向理解。我们称之为方向运动盲症。我们通过追踪运动方向信息通过Video-LLM管道来定位失败。运动方向可以从视觉编码器、投影器和LLM隐藏状态线性地访问,但读取失败将此信号绑定到正确的言语答案选项,揭示了方向绑定缺口。尽管合成运动方向指令微调减少了源域的这一缺口,但运动方向概念向量分析显示,视觉复杂性削弱了信号幅度并限制了跨域泛化。我们引入MoDirect,一个用于运动方向指令微调和评估的的数据集家族,以及DeltaDirect,一个诊断驱动的投影层目标,通过相邻帧特征差预测归一化的2D运动向量。在MoDirect-SynBench上,使用DeltaDirect指令微调将运动方向准确性从25.9%提高到85.4%。在MoDirect-RealBench上,DeltaDirect在没有真实世界微调数据的情况下,将真实世界运动方向准确性提高了21.9个点,同时保持标准视频理解性能。代码:https://github.com/KHU-VLL/DeltaDirect
Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-chance cases largely attributable to prediction biases rather than genuine direction understanding. We call this failure directional motion blindness. We localize the failure by tracing motion direction information through the Video-LLM pipeline. Motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, but the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. Although synthetic motion direction instruction tuning reduces this gap on the source domain, motion direction concept vector analysis shows that visual complexity weakens the signal magnitude and limits out-of-domain generalization. We introduce MoDirect, a dataset family for motion direction instruction tuning and evaluation, and DeltaDirect, a diagnosis-driven, projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. On MoDirect-SynBench, instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4%. On MoDirect-RealBench, DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline without real-world tuning data, while preserving standard video-understanding performance. Code: https://github.com/KHU-VLL/DeltaDirect