llama.cpp's MTP KV cache quantization: free VRAM with no performance loss
Quantizing the draft KV cache to q8_0 saves memory without impacting speed or acceptance rate.
In the latest llama.cpp builds, models like Qwen3.6/3.5 with MTP (multi-token prediction) support require additional VRAM to hold the draft model's key-value (KV) cache. A user on Reddit (u/legit_split) discovered that this draft KV cache can be independently quantized to 8-bit (q8_0) without degrading inference quality or speed—a genuine “free lunch” for memory-constrained setups.
Running benchmarks on a 27B Q8_0 model with 2xMi50 32GB GPUs, the user compared standard MTP inference against runs with `-cache-type-k-draft q8_0 -cache-type-v-draft q8_0`. Results showed identical aggregate accept rates (0.735) and wall times (49.46s vs 49.32s). Even with tensor parallelism (`-sm tensor`), performance remained unchanged (38.42s vs 38.29s, accept rate 0.7411). The only difference? VRAM usage drops, allowing slightly longer context windows or room for other processes. This technique is especially valuable for local LLM users running large models on consumer hardware.
- Quantizing the MTP KV cache to q8_0 saves VRAM with zero impact on token acceptance rate (0.735) or latency (~49.4s).
- Tested on Qwen3.6-27B-Q8_0 with 2xMi50 32GB GPUs via llama.cpp; identical results with and without tensor parallelism.
- Commands: add `-cache-type-k-draft q8_0 -cache-type-v-draft q8_0` to your llama.cpp launch flags.
Why It Matters
Frees VRAM for larger contexts or concurrent models, a practical optimization for local LLM inference.