SAC: Disaggregated KV Cache System for Sparse Attention LLMs with CXL
SAC: 面向稀疏注意力LLM的基于CXL的解耦KV缓存系统
Ruiyang Ma, Teng Ma, Junru Li, Hantian Zha, Xuchun Shang, Qingda Hu, Zheng Liu, Xinjun Yang, Tao Ma, Guojie Luo
AI总结 针对稀疏注意力模型在长上下文推理中全量KV缓存传输导致的瓶颈,提出基于CXL按需获取top-k KV条目的解耦缓存系统SAC,相比RDMA方案吞吐提升2.1倍、TTFT降低9.7倍。
详情
LLM向长上下文推理的扩展将主要服务系统瓶颈从计算转移到内存容量。传统针对密集注意力模型的解决方案依赖基于RDMA的解耦内存池,在解码前从远程存储粗粒度地获取整个前缀KV缓存到本地内存。然而,这种方法对于新兴的稀疏注意力模型本质上是低效的。尽管解码过程中只有一小部分KV条目是活跃的,这些系统仍然将完整的KV缓存获取到本地,导致严重的传输瓶颈和本地内存浪费。为了解决这个问题,我们提出了SAC,第一个针对稀疏注意力模型优化的高效解耦KV缓存系统。通过利用Compute Express Link (CXL)的低延迟、缓存行粒度的加载/存储语义,SAC在推理过程中按需仅获取所需的top-k KV条目。在使用SGLang对DeepSeek-V3.2的评估中,与基于RDMA的基线相比,SAC实现了2.1倍的吞吐量提升、9.7倍的TTFT降低和1.8倍的TBT降低,确立了基于CXL的解耦作为新兴稀疏注意力模型的优越基础设施。
The scaling of LLMs toward long-context inference has shifted the primary serving system bottleneck from computation to memory capacity. Traditional solutions for dense attention models rely on RDMA-based disaggregated memory pools, which perform coarse-grained fetching of the entire prefix KV cache from remote storage to local memory before decoding. However, this approach is fundamentally inefficient for emerging sparse attention models. While only a small fraction of KV entries are active during decoding, these systems still fetch the full KV cache locally, leading to severe transmission bottlenecks and local memory wastage. To address this, we propose SAC, the first efficient disaggregated KV cache system optimized for sparse attention models. By leveraging the low-latency, cache-line granularity load/store semantics of Compute Express Link (CXL), SAC fetches only the required top-k KV entries on demand during inference. Evaluations on DeepSeek-V3.2 using SGLang show that SAC achieves 2.1x higher throughput, 9.7x lower TTFT, and 1.8x lower TBT compared to RDMA-based baselines, establishing CXL-based disaggregation as the superior infrastructure for emerging sparse attention models.