Research & Papers

KVarN achieves 3-4x KV-Cache compression with near-zero accuracy loss

Huawei’s new method cuts memory 4x while keeping accuracy on tough reasoning benchmarks.

Deep Dive

KVarN, introduced by Huawei researchers, tackles a critical bottleneck in large language model inference: the memory footprint of the key-value (KV) cache. While earlier quantization techniques often degrade accuracy—especially in the autoregressive decode phase—KVarN introduces a simple yet powerful tweak. It applies Hadamard rotations followed by variance-normalization on both axes of the K and V matrices before rounding to nearest. This addresses the root cause of quantization errors: catastrophic outlier tokens with poor scale factors. By normalizing variance, KVarN dramatically reduces the magnitude of these errors, which are disproportionately harmful in decode-heavy settings.

The results speak for themselves. On the challenging AIME24 benchmark, KVarN maintains accuracy within 0–1% of the full-precision baseline while compressing the cache by 3–4x. Moreover, it achieves a wall-clock speed-up over the fp16 baseline in the vLLM inference engine—something most prior KV-Cache compression methods fail to do. This makes KVarN particularly valuable for test-time scaling applications like multi-step reasoning, code generation, and agentic workflows, where long decode sequences amplify memory pressure. The method is open-source and integrated into vLLM, making it immediately usable by the community.

Key Points
  • Combines Hadamard rotations with variance-normalization on both axes of K and V matrices to eliminate outlier errors.
  • Delivers 3–4x compression with only 0–1% accuracy loss on AIME24, outperforming prior quantization methods.
  • Provides speed-ups over fp16 baseline in vLLM, critical for decode-heavy tasks like reasoning and agentics.

Why It Matters

Enables efficient large-context inference for reasoning and agentic AI without sacrificing accuracy or speed.