KV Cache Quant Benchmarks: q5_0 and q5_1 Are Underrated
New benchmarks reveal q8_0/q4_0 pairs perform worse than expected.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Anbeeld's latest benchmark article dives deep into KV cache quantization, testing 38 different quant pairs across three Qwen 3.6 27B configurations (Q5_K_S + 64k context, IQ4_XS + 64k, IQ4_XS + 128k). The study tracks not only how each quant affects precision in isolation, but also how it interacts with noise from the model's own weights. Using a custom fork of BeeLlama.cpp, the author includes rarely-benchmarked types like vanilla TurboQuant, TCQ 3-bit/2-bit, and q6_0. Key findings: q5_0 and q5_1 KV quantizations are consistently underrated—they provide strong mid-range performance without the VRAM hit of q8_0 or the poor quality of q4_0. The popular q8_0 / q4_* pairs are overrated; a strong Key cache does not fully rescue a weak Value cache, and these unbalanced pairs perform worse than community reputation suggests.
For practical use, Anbeeld proposes a clear vernier ladder: q8_0 / q6_0 or q8_0 / q5_1 for high-end setups, q6_0 / q5_0 for extra headroom, q5_0 / q5_0 or q5_0 / q4_1 when VRAM is constrained, and q4_0 / q4_0 only as a last resort. TurboQuant quantizations are confirmed to be viable only as extreme compression: turbo3_tcq is the only type with decent quality per size, while turbo4 is slow and essentially useless. The author advises against wasting VRAM on bf16 KV for heavily quantized models—better to balance the KV cache and weight quantizations from the same VRAM pool.
- q5_0 and q5_1 KV quantizations are underrated, offering solid mid-range performance without heavy VRAM usage.
- q8_0/q4_* pairs are overrated; unbalanced Key/Value caches perform worse than expected due to precision mismatch.
- TurboQuant is only useful for extreme compression; turbo3_tcq is decent, but turbo4 is slow and low quality.
Why It Matters
This guide helps optimize long-context LLM inference by efficiently allocating VRAM between weights and KV cache.