RateQuant reduces KV cache perplexity by 70% on Qwen3-8B at 2.5 average bits (from 49.3 to 14.9)?

RateQuant reduces KV cache perplexity by 70% on Qwen3-8B at 2.5 average bits (from 49.3 to 14.9).

Avoids 'distortion model mismatch' by fitting per-quantizer distortion curves (β from 3.6 to 5.3)?

Avoids 'distortion model mismatch' by fitting per-quantizer distortion curves (β from 3.6 to 5.3).

Calibration takes only 1.6 seconds on a single GPU with zero inference overhead?

Calibration takes only 1.6 seconds on a single GPU with zero inference overhead.

Research & Papers

RateQuant cuts LLM memory by 70% with distortion-aware mixed-precision KV cache

arXiv cs.LG May 11, 2026

⚡Cuts perplexity from 49.3 to 14.9 at 2.5 avg bits—in 1.6 seconds of calibration.

Deep Dive

A team led by Fei Zuo (arXiv, April 2026) tackled the memory bottleneck of KV caches in LLMs. Existing mixed-precision quantizers assign the same bit-width to every attention head, ignoring head importance. A natural fix—allocate more bits to important heads—fails because each quantizer follows a different distortion curve D(b)=α·β⁻ᵇ with decay rates β from 3.6 to 5.3. Applying one quantizer's distortion model to another inverts the bit allocation, making performance worse than uniform quantization—a failure mode dubbed "distortion model mismatch."

RateQuant solves this by fitting a per-quantizer distortion model from a small calibration set, then solving the bit allocation problem in closed form via reverse waterfilling from rate-distortion theory. On Qwen3-8B at 2.5 average bits, RateQuant cuts KIVI's perplexity from 49.3 to 14.9 (70% reduction) and improves QuaRot by 6.6 PPL. Calibration takes just 1.6 seconds on a single GPU and adds zero overhead at inference time. The 18-page paper includes 7 figures and 5 tables, released under arXiv:2605.06675.

Key Points

RateQuant reduces KV cache perplexity by 70% on Qwen3-8B at 2.5 average bits (from 49.3 to 14.9).
Avoids 'distortion model mismatch' by fitting per-quantizer distortion curves (β from 3.6 to 5.3).
Calibration takes only 1.6 seconds on a single GPU with zero inference overhead.

Why It Matters

Slashing KV cache memory by 70% with no inference cost will accelerate large-context LLM serving on limited hardware.

Read Original Article

RateQuant cuts LLM memory by 70% with distortion-aware mixed-precision KV cache

Why It Matters

Related Articles

🚀 Stay Ahead in AI