MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
MuKV:多粒度KV缓存压缩用于长流视频问答
Junbin Xiao, Jiajun Chen, Tianxiang Sun, Xun Yang, Angela Yao
AI总结 本文提出MuKV,一种多粒度KV缓存压缩方法,通过半分层检索方法提升长流视频问答的效率和准确性,实验表明其在答案准确率、内存使用和在线问答效率方面均优于基线方法。
详情
- Comments
- To appear at CVPR'26. Code is available at https://github.com/IMBALDY/MuKV
长流视频问答仍面临挑战,由于视觉token数量增加和大语言模型(LLM)推理长度有限。KV缓存通过LLM预填充存储历史token的Key-Value(KV),从而实现更高效的流式问答。然而,现有方法缓存每个或每两个帧,导致内存使用冗余并丢失帧内或跨帧的细粒度空间细节。本文提出MuKV,一种具有多粒度KV缓存压缩模块和半分层检索方法的方法,以提高长流视频问答的效率和准确性。对于离线KV缓存,MuKV在patch、frame和segment级别提取视觉表示。多个粒度层次保留了局部线索和全局时间上下文,同时通过自注意力和频率引导的双信号token压缩机制保持效率。对于在线问答,MuKV设计了一种半分层检索方法以检索相关KV缓存用于答案生成。在长流视频问答基准测试中,MuKV显著提高了答案准确率,而无需牺牲内存和在线问答效率。此外,我们的压缩机制本身在答案准确率、内存和问答效率方面均对基线方法带来了持续的改进,展示了高度有效的贡献。
Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self-attention and frequency. For online QA, MuKV designs a semi-hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long-streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without sacrificing memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.