Open Source

Llama.cpp MTP on Qwen3.6 delivers 3x faster inference with RTX 5090

Multi-token prediction mode speeds up generation by 3x on a 32GB RTX 5090

Deep Dive

The latest llama.cpp build (commit 4f13cb7) adds MTP (multi-token prediction) support, tested with Unsloth's Qwen3.6 GGUF quants on an RTX 5090. Using Q5_K_M and UD-Q4_K_M variants at 128k context with flash-attn, the feature requires --parallel 1 and --spec-type draft-mtp flags, and works with existing GGUF files without requantization. Two prompts were tested: a short story (~400 tokens) and a Flappy Bird HTML file (~3000 tokens), with three seeds per configuration.

Key Points
  • MTP mode in llama.ccp accelerates generation up to 3x on RTX 5090 using Qwen3.6 GGUF models
  • Tested with 128k context, flash-attn, q8_0 KV cache at temperature 0.8 across two prompt types
  • Requires build from source (CUDA_DOCKER_ARCH=120) and --parallel 1 flag; no GGUF requantization needed

Why It Matters

MTP slashes inference latency, making high-quality local LLMs viable on single consumer GPUs