From Content to Knowledge: Lightning Fast Long-Video Understanding with Neural Knowledge Representations
从内容到知识:基于神经知识表示的闪电般快速长视频理解
Yuchen Guan, Xiao Li, Zongyu Guo, Xiaoyi Zhang, Xiulian Peng, Chun Yuan, Yan Lu
AI总结 提出将长视频编码为神经知识表示(NKR),通过智能体知识蒸馏(AKD)自动合成描述和问答对,将视频知识嵌入VLM骨干网络的少量权重中,实现轻量级、可复用的视频理解,推理时无需重新加载视频,大幅降低延迟。
详情
我们提出了一种新的长视频理解范式,将长视频视为神经知识表示(NKR)。NKR既不将视频内容表示为标记流,也不表示为预组织的数据库,而是作为附加到VLM骨干网络的一小部分网络权重。通过一种新颖的智能体知识蒸馏(AKD)过程优化NKR权重,以封装视频的语义内容,其中智能体自动合成密集描述和问答对,将视频知识蒸馏到NKR中。虽然AKD作为一次性的全面编码阶段,但生成的NKR将视频转换为可移植、可重用的资产。在推理时,轻量级NKR被挂载到冻结的视觉语言模型(VLM)上,实现直接的、基于查询的理解,无需重新加载或重新编码原始视频。这种方法将视频长度与推理成本解耦,为多轮视频理解提供了高摊销效率。在LVBench基准上的实验表明,我们的方法在实现与最先进方法相当的性能的同时,将端到端延迟降低了两个数量级以上,为交互式长视频理解开辟了新的可能性。
We propose a new paradigm for long video understanding by treating a long video as a Neural Knowledge Representation (NKR). NKR represents video contents neither as a stream of tokens nor pre-organized databases, but as an individual small portion of network weights attached to the VLM backbone. The NKR weights are optimized to encapsulate the video's semantic content via a novel Agentic Knowledge Distillation (AKD) process, where an agent automatically synthesizes dense descriptions and question-answer pairs to distill the video's knowledge into the NKR. While AKD serves as a comprehensive, one-time encoding phase, the resulting NKR transforms the video into a portable, reusable asset. At inference, the lightweight NKR is mounted onto a frozen Vision-Language Model (VLM), enabling direct, query-based understanding without reloading or re-encoding the original video. This approach decouples video length from inference cost, offering high amortized efficiency for multi-turn video understanding. Experiments on the LVBench benchmark show our method achieves performance comparable to state-of-the-art approaches while reducing end-to-end latency by over two orders of magnitude, opening new possibilities for interactive long-video understanding.