eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization
New algorithm cuts memory by 3x while preserving accuracy on LongBench and retrieval tasks.
Transformer KV cache is a major memory bottleneck for long-context LLMs. The key insight in this paper is that the cache naturally splits into a low-rank shared context component and a full-rank per-token residual, well modeled by spiked random matrix theory. eOptShrinkQ exploits this: first, optimal singular value shrinkage (eOptShrink) automatically identifies and extracts the low-rank structure, removing noise and outliers. Second, the residual—shown to have delocalized coordinates and satisfy the thin shell property—is quantized using TurboQuant, a per-vector scalar quantizer with near-optimal distortion guarantees. By restoring isotropy, spectral denoising eliminates the need for outlier handling and inner product bias correction, freeing bits for better reconstruction. Theoretical guarantees include automatic rank selection via the BBP phase transition, provably near-zero inner product bias, and near-optimal quantization distortion.
Experiments validate eOptShrinkQ on Llama-3.1-8B and Ministral-8B at multiple levels: per-head MSE and inner product fidelity show savings of nearly one bit per entry over TurboQuant at equivalent quality. On LongBench (16 tasks), eOptShrinkQ at ~2.2 bits per entry outperforms TurboQuant at 3.0 bits. Most impressively, on multi-needle retrieval, eOptShrinkQ at 2.2 bits closely matches or exceeds uncompressed FP16, suggesting that spectral denoising acts as a beneficial regularizer for retrieval-intensive tasks. This work demonstrates that theoretically grounded random matrix methods can achieve near-lossless compression at extreme bitrates, significantly reducing memory for long-context inference.
- Decomposes KV cache into low-rank shared context + residual, enabling optimal spectral denoising before quantization
- Saves nearly 1 bit per entry over TurboQuant at same quality, achieving ~2.2 bits per entry while outperforming 3-bit TurboQuant on LongBench
- Matches or exceeds uncompressed FP16 on multi-needle retrieval tasks, suggesting spectral denoising acts as a regularizer
Why It Matters
Slashing KV cache memory 3x enables longer context windows on existing hardware, a key bottleneck for production LLMs.