Llama.cpp MTP cuts inference time 41% on Qwen3.6 27B with RTX 3090
Speculative decoding boosts TG speed 85% while PP drops 42%—net win for long contexts.
A real-world benchmark on llama.cpp's Multi-Token Prediction (MTP) feature shows dramatic time savings for long-context inference. Running unsloth's Qwen3.6-27B-MTP Q4_K_M GGUF on a headless RTX 3090 with 24GB VRAM, the user compared the latest version without MTP (server-cuda13-b9174) against a master fork with MTP enabled. Both used 128k context, q8_0 KV cache, and identical prompts involving ~85,000 tokens for research and coding tasks.
Without MTP, prompt processing (PP) hit 1,050 tok/s and text generation (TG) averaged 27 tok/s, totaling ~39 minutes for 85k tokens. With MTP (spec-draft-n-max=3, draft-p-min=0), PP dropped 42% to 600 tok/s, but TG surged 85% to 50 tok/s. The net effect: total completion time fell 41% to ~23 minutes—a 1.7x speedup. The user noted that MTP benefits generation-heavy tasks; prompt-heavy workflows may see less improvement. The setup also included a dual-agent critic model, so solo deployments could see even better gains.
- MTP (speculative decoding) with draft size 3 cuts total time by 41% on Qwen3.6-27B Q4_K_M at 85k tokens
- TG speed jumps from 27 to 50 tok/s (+85%), while PP drops from 1,050 to 600 tok/s (-42%)
- Tested on headless RTX 3090 24GB with 128k context, q8_0 KV cache; recommended for generation-heavy use cases
Why It Matters
Speculative decoding unlocks 1.7x faster long-context inference on consumer GPUs, making local AI agents more practical.