IVF-TQ: Calibration-Free Streaming Vector Search via a Codebook-Free Residual Layer
IVF-TQ:通过无码本残差层实现无需校准的流式向量搜索
Tarun Sharma
AI总结 本文提出了一种名为IVF-TQ的流式向量搜索索引,该方法通过一种无需代码本的残差压缩层实现了校准自由的近似最近邻搜索。核心思想是在不依赖代码本的情况下,利用固定随机旋转和预计算的Lloyd-Max标量量化器,仅通过比特宽度和维度参数进行配置,从而在不需训练的情况下保持流式数据的稳定性。实验表明,IVF-TQ在多个数据集和内存条件下均能保持良好的性能,无需重新训练或个性化调整比特预算,显著提升了流式场景下的搜索效率与鲁棒性。
详情
近似最近邻(ANN)索引部署在流式语料库上会在数周内无声地丢失召回率。标准诊断是分布漂移,但在洗牌独立同分布(shuffled-i.i.d.)摄取下(完全没有漂移),乘积量化在子匹配位预算下仍会下降3.8个百分点。主流生产压缩方法(PQ、OPQ、ScaNN)都针对初始样本拟合码本,并在数据库增长数个数量级时重复使用该码本。 本文提出IVF-TQ,一种倒排文件索引,其残差压缩层是数据无关的:一个固定的随机旋转,后跟一个仅由位宽b和维度d参数化的预计算Lloyd-Max标量量化器。仅训练IVF粗k-means分区。一个仅依赖于(b, d, delta)的球面上均匀内积误差界提供了任何学习码本方法都无法提供的结构保证。相同的无码本设计实现了IVF放大效应,将差距缩小到Extended RaBitQ的统计噪声范围内(在匹配位预算下,比平面TQ高17.7个百分点),以及一种自适应变体,在不触及压缩层的情况下刷新分区。在九个受控单元(三个10M数据集、三种PQ内存模式、三个随机种子)中,每批PQ码本重新训练从未恢复流式差距;IVF-PQ流式稳定性需要逐数据集位预算调整,而IVF-TQ在所有三个数据集上使用一个固定的(b, d)配置,Delta在[-0.80, +0.56]个百分点之间。贡献在于操作层面:无需训练码本,无需逐数据集位预算调整,无需任何能缩小差距的重新训练周期。
Approximate nearest neighbor (ANN) indexes deployed against streaming corpora silently lose recall over weeks. The standard diagnosis is distribution shift, but under shuffled-i.i.d. ingestion -- no shift at all -- product quantization still degrades -3.8pp at sub-matched bit budgets. The dominant production compression methods (PQ, OPQ, ScaNN) all fit a codebook to an initial sample and reuse it as the database grows by orders of magnitude. This paper presents IVF-TQ, an inverted-file index whose residual compression layer is data-independent: a fixed random rotation followed by a precomputed Lloyd-Max scalar quantizer parameterised only by the bit width b and dimension d. Only the IVF coarse k-means partition is trained. A uniform-over-sphere inner-product error bound depending only on (b, d, delta) provides a structural guarantee no learned-codebook method admits. The same codebook-free design enables an IVF-amplification effect that closes the gap to Extended RaBitQ to within statistical noise (+17.7pp over flat TQ at matched bit budget), and an Adaptive variant that refreshes the partition without touching the compression layer. Across nine controlled cells (three 10M datasets, three PQ memory regimes, three seeds), per-batch PQ codebook retraining never recovers the streaming gap; IVF-PQ streaming stability requires per-dataset bit-budget tuning, while IVF-TQ holds at one fixed (b, d) configuration on all three datasets with Delta in [-0.80, +0.56]pp. The contribution is operational: no codebook to train, no per-dataset bit-budget tuning, no retraining cycle that ever closes the gap.