Coding tasks achieve 79–89% draft token acceptance, yielding up to 171% speedup on F16?

Coding tasks achieve 79–89% draft token acceptance, yielding up to 171% speedup on F16.

Creative writing acceptance drops to 39–48%, causing 9% slowdown on Q4_K_M?

Creative writing acceptance drops to 39–48%, causing 9% slowdown on Q4_K_M.

Optimal draft tokens is N=3 for most quants; N=4 beneficial only for F16 due to extreme bandwidth constraints?

Optimal draft tokens is N=3 for most quants; N=4 beneficial only for F16 due to extreme bandwidth constraints.

Open Source

MTP on Qwen 3.6 27B: Coding speeds up 171%, but creative tasks slow down

r/LocalLLaMA May 11, 2026

⚡Speculative decoding triples coding speed but hurts creative writing by 9%.

Deep Dive

The study tested Qwen 3.6 27B across four task types, five quantization levels, three temperatures, and multiple draft token counts. The core finding: task type dominates. For coding tasks (writing functions, debugging, refactoring), draft token acceptance rates hit 79–89%, leading to massive speedups. F16 quantization, normally slow at 6.6 tok/s, gains up to 171% with MTP. Q8_0 also benefits across all tasks (48–123% faster). But for creative writing (stories, poetry, roleplay), acceptance rates drop to 39–48%, and lower quants like Q4_K_M actually slow down by 9%. Analysis tasks show marginal gains or slight slowdowns on low quants.

Optimal draft tokens is N=3 for most scenarios; only F16 benefits from N=4. Temperature and MTP quantization (q8 vs matching model quant) barely affect results. The memory bandwidth bottleneck explains the pattern: F16 models saturate bandwidth, so each accepted draft saves an expensive full-model pass. But faster quants (Q4_K_M at 16GB) already decode quickly, and the overhead of generating and verifying draft tokens outweighs the benefit on less predictable tasks. Practical recommendation: always use MTP for coding at any quant; use MTP for creative tasks only on F16 or Q8_0 models.

Key Points

Coding tasks achieve 79–89% draft token acceptance, yielding up to 171% speedup on F16.
Creative writing acceptance drops to 39–48%, causing 9% slowdown on Q4_K_M.
Optimal draft tokens is N=3 for most quants; N=4 beneficial only for F16 due to extreme bandwidth constraints.

Why It Matters

Task-aware speculative decoding is critical: use MTP for code, skip it for creative work on low-quant models.

Read Original Article

MTP on Qwen 3.6 27B: Coding speeds up 171%, but creative tasks slow down

Why It Matters

Related Articles

🚀 Stay Ahead in AI