Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?
基于证据的缓存路由用于检索增强生成:何时可以安全地重用答案?
Syed Huma Shah
AI总结 提出GroundedCache,一种通过四个廉价门控(查询相似性、检索证据重叠、源版本有效性和词汇支持)验证缓存答案安全性的路由方法,显著降低不安全服务率。
Comments 19 pages, 9 figures, 10 tables. Code: https://github.com/syedhumarahim/grounded-cache-router
详情
现代检索增强生成(RAG)部署越来越依赖缓存来降低令牌成本和首令牌时间(TTFT)。在vLLM等服务栈中,前缀级KV重用已成为标准,而最近的系统(RAGCache、TurboRAG、CacheBlend、EPIC、ContextPilot、PCR、LMCache)进一步推动了块级和位置无关的重用。相比之下,输出级语义答案缓存仍然脆弱:相似的提示可能映射到不同的正确答案,检索到的证据随着语料库更新而漂移,并且对抗性碰撞攻击已被证明可以劫持缓存的响应。我们认为,缓存答案重用的正确框架不是如何更快地重用,而是何时重用是安全的。我们提出了GroundedCache,一种经过证据验证的缓存路由器,仅当四个廉价门控同时成立时才允许缓存答案:查询相似性、检索证据重叠、源版本有效性以及新检索证据对缓存答案的词汇(或基于判断的)支持。我们构建了一个六区域工作负载,用于压力测试缓存安全性而不仅仅是命中率,并引入了一个面向操作员的指标——不安全服务率(USR),即收到错误缓存答案的查询比例。在两个数据集和12,000个真实LLM生成(在vLLM上使用自动前缀缓存的Qwen2.5-7B-Instruct)中,GroundedCache在每个HotpotQA区域上将USR降至0.0%(而朴素缓存为15-35%),在mtRAG文档漂移上降至1.5%(而朴素缓存为51.5%),在设计点对抗区域上减少了34倍,在其他mtRAG区域上减少了3-10倍,同时端到端p50延迟保持在无缓存RAG基线的1.04-1.07倍以内。逐门控消融实验表明,词汇支持门控是两个数据集上的主要安全机制,其余门控以近乎零成本提供纵深防御。我们发布了实现、工作负载和评估工具。
Modern retrieval-augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time-to-first-token(TTFT). Prefix-level KV reuse is now standard in serving stacks such as vLLM, and chunk-level and position-independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output-level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe. We propose GroundedCache, an evidence-validated cache router that admits a cached answer only when 4 cheap gates simultaneously hold: query similarity, retrieved-evidence overlap, source-version validity, and lexical (or judge-based) support of the cached answer by the freshly retrieved evidence. We build a six-regime workload that stress-tests cache safety rather than only hit rate, and introduce an operator-facing metric, the unsafe-served rate (USR), fraction of all queries that received a wrong cached answer. Across 2 datasets and 12,000 real-LLM generations(Qwen2.5-7B-Instruct on vLLM with Automatic Prefix Caching), GroundedCache drives USR to 0.0% on every HotpotQA regime(vs. 15-35% under naive caching) and to 1.5% on mtRAG document drift(vs. 51.5%), a 34x reduction on the design-point adversarial regime and 3-10x reductions across the other mtRAG regimes, while end-to-end p50 latency stays within 1.04-1.07x of a no-cache RAG baseline. A per-gate ablation isolates the lexical support gate as the load-bearing safety mechanism on both datasets, with the remaining gates providing defense-in-depth at near-zero cost. We release the implementation, workload, and evaluation harness.