Open Source

llama.cpp's MTP KV cache quantization: free VRAM with no performance loss

Quantizing the draft KV cache to q8_0 saves memory without impacting speed or acceptance rate.

Deep Dive

In the latest llama.cpp builds, models like Qwen3.6/3.5 with MTP (multi-token prediction) support require additional VRAM to hold the draft model's key-value (KV) cache. A user on Reddit (u/legit_split) discovered that this draft KV cache can be independently quantized to 8-bit (q8_0) without degrading inference quality or speed—a genuine “free lunch” for memory-constrained setups.

Running benchmarks on a 27B Q8_0 model with 2xMi50 32GB GPUs, the user compared standard MTP inference against runs with `-cache-type-k-draft q8_0 -cache-type-v-draft q8_0`. Results showed identical aggregate accept rates (0.735) and wall times (49.46s vs 49.32s). Even with tensor parallelism (`-sm tensor`), performance remained unchanged (38.42s vs 38.29s, accept rate 0.7411). The only difference? VRAM usage drops, allowing slightly longer context windows or room for other processes. This technique is especially valuable for local LLM users running large models on consumer hardware.

Key Points
  • Quantizing the MTP KV cache to q8_0 saves VRAM with zero impact on token acceptance rate (0.735) or latency (~49.4s).
  • Tested on Qwen3.6-27B-Q8_0 with 2xMi50 32GB GPUs via llama.cpp; identical results with and without tensor parallelism.
  • Commands: add `-cache-type-k-draft q8_0 -cache-type-v-draft q8_0` to your llama.cpp launch flags.

Why It Matters

Frees VRAM for larger contexts or concurrent models, a practical optimization for local LLM inference.