RateQuant cuts LLM memory by 70% with distortion-aware mixed-precision KV cache
Cuts perplexity from 49.3 to 14.9 at 2.5 avg bits—in 1.6 seconds of calibration.
A team led by Fei Zuo (arXiv, April 2026) tackled the memory bottleneck of KV caches in LLMs. Existing mixed-precision quantizers assign the same bit-width to every attention head, ignoring head importance. A natural fix—allocate more bits to important heads—fails because each quantizer follows a different distortion curve D(b)=α·β⁻ᵇ with decay rates β from 3.6 to 5.3. Applying one quantizer's distortion model to another inverts the bit allocation, making performance worse than uniform quantization—a failure mode dubbed "distortion model mismatch."
RateQuant solves this by fitting a per-quantizer distortion model from a small calibration set, then solving the bit allocation problem in closed form via reverse waterfilling from rate-distortion theory. On Qwen3-8B at 2.5 average bits, RateQuant cuts KIVI's perplexity from 49.3 to 14.9 (70% reduction) and improves QuaRot by 6.6 PPL. Calibration takes just 1.6 seconds on a single GPU and adds zero overhead at inference time. The 18-page paper includes 7 figures and 5 tables, released under arXiv:2605.06675.
- RateQuant reduces KV cache perplexity by 70% on Qwen3-8B at 2.5 average bits (from 49.3 to 14.9).
- Avoids 'distortion model mismatch' by fitting per-quantizer distortion curves (β from 3.6 to 5.3).
- Calibration takes only 1.6 seconds on a single GPU with zero inference overhead.
Why It Matters
Slashing KV cache memory by 70% with no inference cost will accelerate large-context LLM serving on limited hardware.