Research & Papers

TurboQuant paper reveals optimal KV cache quantization for LLMs

New statistical analysis shows KQV beats QKQV at 4-bit budget, saving LLM memory.

Deep Dive

A new preprint by Paolo D'Alberto provides a rigorous statistical comparison of three KV cache quantization schemes - KV (scalar MSE baseline), KQV (WHT + MSE on K; WHT + MSE + QJL on V), and QKQV (WHT + MSE + QJL on both) - all under a fair bit budget. The analysis leverages a Beta distribution on the hypersphere to trace how QJL on K inflates inner product variance by π/2, which softmax then amplifies nonlinearly through Jensen's inequality. Three key empirical findings emerge from the study. First, at the practically dominant budget n=4, KQV wins on every metric (KL divergence, geometric K error, and 6D distance) across all distributions and ranks tested. Second, the K-V asymmetry is unconditional: QKQV is consistently worse than KQV in KL divergence at every budget. Third, a budget-dependent crossover exists: QKQV achieves better geometric K reconstruction at n∈{2,3,5}, while KQV wins at n∈{4,6}, independent of rank and tail weight.

D'Alberto presents a sufficient condition for when the Jensen mechanism amplifies superlinearly through softmax, linking it to routing corruption and output collapse. At n∈{2,3,5}, QKQV wins geometrically because this condition doesn't bind; at n=4, elevated K error and KL divergence for QKQV strongly suggest the Jensen mechanism is the operative cause of the crossover. These findings offer a new perspective on KV cache quantization, suggesting that hybrid schemes like KQV (applying different quantization to keys vs values) can significantly outperform uniform approaches at the most common bit budgets. For LLM deployment engineers, this means carefully choosing quantization per tensor rather than applying a one-size-fits-all method, potentially reducing memory footprint without sacrificing model quality.

Key Points
  • At the common 4-bit budget (n=4), KQV beats QKQV on all metrics (KL divergence, geometric K error, 6D distance) across all distributions tested.
  • A crossover exists: QKQV wins geometric K reconstruction at n=2,3,5, while KQV wins at n=4,6 - an open rate-distortion problem explained by Jensen's mechanism.
  • K-V asymmetry is unconditional: QKQV consistently underperforms KQV in KL divergence at every budget, highlighting the importance of treating keys and values differently.

Why It Matters

Optimizing KV cache quantization halves LLM memory usage while preserving accuracy, critical for production deployment.