llama.cpp lands MTP speculative decoding, boosting Qwen 27B up to 2.44× on Strix Halo
llama.cpp's new MTP mode pushes single-stream Qwen3.6 27B to 2.44× faster on AMD's Strix Halo...
The open-source llama.cpp project just landed a major performance upgrade: MTP (multi-token prediction) speculative decoding, merged in PR #22673 on May 16. Contributor C_Coffie benchmarked the feature on two rigs — a Framework Desktop with Strix Halo (AMD ROCm 7.0.2) and a single/dual RTX 3090 setup (CUDA 12.9, 450W). With Qwen3.6 27B dense model, the Strix Halo saw 2.44× speedup at Q8_0 (7.4 → 18.1 tok/s) and 1.81× at Q4_K_M. On the dual 3090 rig, Q8_0 jumped 2.17× (25.7 → 55.9 tok/s) with draft depth n=3. The single 3090 at Q4_K_M got 1.54× (38.7 → 59.5 tok/s, n=2).
MTP works by predicting multiple tokens in parallel and accepting or rejecting them in bulk, reducing the number of expensive forward passes. However, gains are smaller for Mixture-of-Experts models like Qwen3.6 35B-A3B — only 1.24–1.40× — because each token already processes just ~3B active parameters, making the baseline cheap. The feature is enabled via two new flags: --spec-type draft-mtp and --spec-draft-n-max N. Output remains byte-identical to baseline at the same seed and temperature. Separately, C_Coffie discovered that earlier 3090 benchmarks were run under a 200W power cap; re-benchmarking at 350–450W showed dense 27–32B models gaining +70% to +113%, highlighting the importance of power limits in LLM performance testing.
- MTP speculative decoding merged into mainline llama.cpp May 16 via PR #22673
- Qwen3.6 27B at Q8_0 achieved 2.44× on Strix Halo and 2.17× on dual RTX 3090
- MoE models see smaller gains (1.24–1.40×) because per-token cost is already low
Why It Matters
Free 2× speed on local LLMs without quality loss — huge for self-hosted inference on both AMD and NVIDIA hardware.