3–5x KV cache compression vs. FP8's ~2x, with no quality loss on reasoning benchmarks?

3–5x KV cache compression vs. FP8's ~2x, with no quality loss on reasoning benchmarks

Up to 1.4x FP16 throughput and 2.4x TurboQuant throughput at matched accuracy?

Up to 1.4x FP16 throughput and 2.4x TurboQuant throughput at matched accuracy

Apache 2.0, single vLLM flag, no retraining or calibration required?

Apache 2.0, single vLLM flag, no retraining or calibration required

Open Source

Huawei's KVarN compresses KV cache 3–5x with speed gains, not trade-offs

r/LocalLLaMA June 04, 2026

⚡No retraining, single vLLM flag, and up to 2.4x TurboQuant throughput at higher accuracy.

Deep Dive

Huawei just released KVarN, a new KV-cache quantization technique that challenges the current state of the art. While FP8 (already in vLLM) offers ~2x capacity with near-zero quality loss, and Google's TurboQuant pushes aggressive compression but at the cost of speed (66–80% of BF16 throughput, up to 2.5x slower at burst) and reasoning accuracy (up to 20 points drop on AIME25 and LiveCodeBench), KVarN claims the sweet spot: 3–5x compression, up to 1.4x FP16 throughput, and FP16-quality outputs. Even more impressive, it delivers up to 2.4x the throughput of TurboQuant at matched accuracy, and it holds reasoning quality at high compression—the exact axis where TurboQuant's low-bit variants fail. The method requires no model changes, retraining, or calibration; it's a single flag in vLLM, Apache 2.0 licensed.

KVarN's key advantage lies in its quantization-aware algorithm that avoids the dequantization bottleneck. Where TurboQuant dequantizes back to BF16 for attention compute (causing latency), KVarN keeps the compressed representation through the entire forward pass, enabling actual speed-up rather than slowdown. The paper shows robust performance on reasoning benchmarks, and the vLLM integration means immediate practical use. For professionals running long-context LLMs (e.g., RAG, code assistants, multi-turn agents), this translates to drastically lower memory usage without trading inference speed or output quality. The open-source release invites community stress-testing, and if claims hold, KVarN could become the default KV-cache technique in production systems.

Key Points

3–5x KV cache compression vs. FP8's ~2x, with no quality loss on reasoning benchmarks
Up to 1.4x FP16 throughput and 2.4x TurboQuant throughput at matched accuracy
Apache 2.0, single vLLM flag, no retraining or calibration required

Why It Matters

Longer context at no latency or accuracy cost—critical for production RAG, agents, and code assistants.

Read Original Article

Huawei's KVarN compresses KV cache 3–5x with speed gains, not trade-offs

Why It Matters

Related Articles

🚀 Stay Ahead in AI