Research & Papers

eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

arXiv cs.LG May 06, 2026

⚡New algorithm cuts memory by 3x while preserving accuracy on LongBench and retrieval tasks.

Deep Dive

Transformer KV cache is a major memory bottleneck for long-context LLMs. The key insight in this paper is that the cache naturally splits into a low-rank shared context component and a full-rank per-token residual, well modeled by spiked random matrix theory. eOptShrinkQ exploits this: first, optimal singular value shrinkage (eOptShrink) automatically identifies and extracts the low-rank structure, removing noise and outliers. Second, the residual—shown to have delocalized coordinates and satisfy the thin shell property—is quantized using TurboQuant, a per-vector scalar quantizer with near-optimal distortion guarantees. By restoring isotropy, spectral denoising eliminates the need for outlier handling and inner product bias correction, freeing bits for better reconstruction. Theoretical guarantees include automatic rank selection via the BBP phase transition, provably near-zero inner product bias, and near-optimal quantization distortion.

Experiments validate eOptShrinkQ on Llama-3.1-8B and Ministral-8B at multiple levels: per-head MSE and inner product fidelity show savings of nearly one bit per entry over TurboQuant at equivalent quality. On LongBench (16 tasks), eOptShrinkQ at ~2.2 bits per entry outperforms TurboQuant at 3.0 bits. Most impressively, on multi-needle retrieval, eOptShrinkQ at 2.2 bits closely matches or exceeds uncompressed FP16, suggesting that spectral denoising acts as a beneficial regularizer for retrieval-intensive tasks. This work demonstrates that theoretically grounded random matrix methods can achieve near-lossless compression at extreme bitrates, significantly reducing memory for long-context inference.

Key Points

Decomposes KV cache into low-rank shared context + residual, enabling optimal spectral denoising before quantization
Saves nearly 1 bit per entry over TurboQuant at same quality, achieving ~2.2 bits per entry while outperforming 3-bit TurboQuant on LongBench
Matches or exceeds uncompressed FP16 on multi-needle retrieval tasks, suggesting spectral denoising acts as a regularizer

Why It Matters

Slashing KV cache memory 3x enables longer context windows on existing hardware, a key bottleneck for production LLMs.

Read Original Article

eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

Why It Matters

Stay Ahead in AI