MTP speculative decoding merged into mainline llama.cpp May 16 via PR #22673?

MTP speculative decoding merged into mainline llama.cpp May 16 via PR #22673

Qwen3.6 27B at Q8_0 achieved 2.44× on Strix Halo and 2.17× on dual RTX 3090?

Qwen3.6 27B at Q8_0 achieved 2.44× on Strix Halo and 2.17× on dual RTX 3090

MoE models see smaller gains (1.24–1.40×) because per-token cost is already low?

MoE models see smaller gains (1.24–1.40×) because per-token cost is already low

Open Source

llama.cpp lands MTP speculative decoding, boosting Qwen 27B up to 2.44× on Strix Halo

r/LocalLLaMA May 19, 2026

⚡llama.cpp's new MTP mode pushes single-stream Qwen3.6 27B to 2.44× faster on AMD's Strix Halo...

Deep Dive

The open-source llama.cpp project just landed a major performance upgrade: MTP (multi-token prediction) speculative decoding, merged in PR #22673 on May 16. Contributor C_Coffie benchmarked the feature on two rigs — a Framework Desktop with Strix Halo (AMD ROCm 7.0.2) and a single/dual RTX 3090 setup (CUDA 12.9, 450W). With Qwen3.6 27B dense model, the Strix Halo saw 2.44× speedup at Q8_0 (7.4 → 18.1 tok/s) and 1.81× at Q4_K_M. On the dual 3090 rig, Q8_0 jumped 2.17× (25.7 → 55.9 tok/s) with draft depth n=3. The single 3090 at Q4_K_M got 1.54× (38.7 → 59.5 tok/s, n=2).

MTP works by predicting multiple tokens in parallel and accepting or rejecting them in bulk, reducing the number of expensive forward passes. However, gains are smaller for Mixture-of-Experts models like Qwen3.6 35B-A3B — only 1.24–1.40× — because each token already processes just ~3B active parameters, making the baseline cheap. The feature is enabled via two new flags: --spec-type draft-mtp and --spec-draft-n-max N. Output remains byte-identical to baseline at the same seed and temperature. Separately, C_Coffie discovered that earlier 3090 benchmarks were run under a 200W power cap; re-benchmarking at 350–450W showed dense 27–32B models gaining +70% to +113%, highlighting the importance of power limits in LLM performance testing.

Key Points

MTP speculative decoding merged into mainline llama.cpp May 16 via PR #22673
Qwen3.6 27B at Q8_0 achieved 2.44× on Strix Halo and 2.17× on dual RTX 3090
MoE models see smaller gains (1.24–1.40×) because per-token cost is already low

Why It Matters

Free 2× speed on local LLMs without quality loss — huge for self-hosted inference on both AMD and NVIDIA hardware.

Read Original Article

llama.cpp lands MTP speculative decoding, boosting Qwen 27B up to 2.44× on Strix Halo

Why It Matters

Related Articles

🚀 Stay Ahead in AI