ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning
ArborKV: 一种面向树状推理的KV缓存管理方法
Yeqiu Chen, Ziyan Liu, Zhenxin Huang, Runquan Gui, Hong Wang, Lei Liu
AI总结 本文提出ArborKV,一种结构感知的KV缓存管理方法,通过轻量级值估计器和树状分配策略,实现纯token提取式淘汰与惰性再水合,从而在保持高精度的同时减少KV内存使用,使在固定硬件预算下能支持更大规模的树状推理搜索。
详情
最近在大语言模型推理方面的进展越来越多地从单次生成转向在中间推理状态上的显式搜索。Tree-of-Thoughts (ToT) 将推理组织为具有分支和回溯的树状搜索,但显著放大了键值(KV)缓存:保留用于前沿部分轨迹的KV状态很快成为内存瓶颈,限制了吞吐量并约束了在固定硬件预算下的搜索深度和宽度。我们通过观察到ToT风格推理中的KV重用由搜索动态决定:短期解码主要依赖于活跃分支及其祖先,而无效子树具有低短期重用概率但必须保持可恢复以供回溯。受此启发,我们提出了ArborKV,一种结构感知的淘汰框架,结合轻量级值估计器和树状分配策略,并进行纯token提取式淘汰与惰性再水合以支持回溯。在ToT风格推理基准上的实验表明,ArborKV实现了高达约4倍的KV内存减少,同时保持接近完整保留的精度,使在固定设备预算下能支持更大规模的树状推理搜索。
Recent progress in LLM reasoning has increasingly shifted from single-pass generation to explicit search over intermediate reasoning states. Tree-of-Thoughts (ToT) organizes inference to tree-structured search with branching and backtracking, but it substantially amplifies the Key--Value (KV) cache: retaining KV states for a frontier of partial trajectories quickly becomes a memory bottleneck that limits throughput and constrains search depth and width under fixed hardware budgets. We address this challenge by observing that KV reuse in ToT-style inference is governed by search dynamics: near-term decoding depends primarily on the active branch and its ancestors, whereas inactive subtrees have low short-term reuse probability yet must remain recoverable for backtracking. Motivated by this, we propose ArborKV, a structure-aware eviction framework that couples a lightweight value estimator with a tree-aware allocation policy, and performs purely token-extractive eviction with lazy rehydration to support revisits. Experiments on ToT-style reasoning benchmarks show that ArborKV achieves up to ~4x peak KV-memory reduction while preserving near-full-retention accuracy, enabling larger search configurations under fixed device budgets that would otherwise run out of memory.