Open Source

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

Speculative decoding doubles throughput on commodity GPUs with unmerged PR.

Deep Dive

A developer known as havenoammo has successfully brought Multi-Token Prediction (MTP) to local GGUF deployments of Qwen3-27B, achieving a roughly 2.5x throughput improvement over standard speculative decoding. The approach uses Unsloth’s UD XL quantizations for the base model, keeping it at low-bit precision, while grafting three MTP draft heads quantized at Q8_0 on top. This preserves speculative accuracy without significant VRAM overhead. The key enabler is an unmerged pull request (#22673) for llama.cpp that adds MTP support; havenoammo merged that PR onto master and built llama-server with CUDA support. Running with `--spec-type mtp --spec-draft-n-max 3` yielded solid acceptance rates, meaning most draft tokens are accepted, so the extra computation is not wasted.

The release includes the grafted GGUF files, the raw MTP layer weights, and a conversion script so others can apply the same technique to different models. Previously, MTP for Qwen models was effectively locked to production frameworks like SGLang or vLLM. This work brings it to the widely used llama.cpp ecosystem, enabling local inference at nearly 2.5x the speed of a standard low-bit GGUF. The developer provides full build instructions—just three git commands to merge PR #22673—and expects the PR to land in mainline soon, at which point this will work out of the box. For now, it offers a significant performance boost for anyone running Qwen3-27B on consumer hardware, especially for batch or interactive workloads.

Key Points
  • 2.5x token throughput measured vs same UD XL GGUF without MTP
  • MTP draft heads kept at Q8_0 while base model stays in low-bit quantization
  • Uses unmerged llama.cpp PR #22673 to enable local MTP support

Why It Matters

Brings server-grade speculative decoding to local hardware, slashing latency for open-weight model inference.