Huawei's KVarN compresses KV cache 3–5x with speed gains, not trade-offs
No retraining, single vLLM flag, and up to 2.4x TurboQuant throughput at higher accuracy.
Huawei just released KVarN, a new KV-cache quantization technique that challenges the current state of the art. While FP8 (already in vLLM) offers ~2x capacity with near-zero quality loss, and Google's TurboQuant pushes aggressive compression but at the cost of speed (66–80% of BF16 throughput, up to 2.5x slower at burst) and reasoning accuracy (up to 20 points drop on AIME25 and LiveCodeBench), KVarN claims the sweet spot: 3–5x compression, up to 1.4x FP16 throughput, and FP16-quality outputs. Even more impressive, it delivers up to 2.4x the throughput of TurboQuant at matched accuracy, and it holds reasoning quality at high compression—the exact axis where TurboQuant's low-bit variants fail. The method requires no model changes, retraining, or calibration; it's a single flag in vLLM, Apache 2.0 licensed.
KVarN's key advantage lies in its quantization-aware algorithm that avoids the dequantization bottleneck. Where TurboQuant dequantizes back to BF16 for attention compute (causing latency), KVarN keeps the compressed representation through the entire forward pass, enabling actual speed-up rather than slowdown. The paper shows robust performance on reasoning benchmarks, and the vLLM integration means immediate practical use. For professionals running long-context LLMs (e.g., RAG, code assistants, multi-turn agents), this translates to drastically lower memory usage without trading inference speed or output quality. The open-source release invites community stress-testing, and if claims hold, KVarN could become the default KV-cache technique in production systems.
- 3–5x KV cache compression vs. FP8's ~2x, with no quality loss on reasoning benchmarks
- Up to 1.4x FP16 throughput and 2.4x TurboQuant throughput at matched accuracy
- Apache 2.0, single vLLM flag, no retraining or calibration required
Why It Matters
Longer context at no latency or accuracy cost—critical for production RAG, agents, and code assistants.