MTP predicts 2–4 tokens per step, reducing decoding latency by 30–50%?

MTP predicts 2–4 tokens per step, reducing decoding latency by 30–50%.

Supported via the `--mtp N` flag in llama.cpp, requiring fine-tuned models?

Supported via the `--mtp N` flag in llama.cpp, requiring fine-tuned models.

Meta originally proposed the technique; community models and reference checkpoints available.

Open Source

r/LocalLLaMA May 16, 2026

⚡llama.cpp now supports predicting multiple tokens per step, cutting latency dramatically...

Deep Dive

Multi-Token Prediction (MTP) is finally getting approved for llama.cpp. Time to prepare for the update.

Key Points

MTP predicts 2–4 tokens per step, reducing decoding latency by 30–50%.
Supported via the `--mtp N` flag in llama.cpp, requiring fine-tuned models.
Meta originally proposed the technique; community models and reference checkpoints available.

Faster local inference makes on-device LLMs viable for real-time applications like chat, coding assistants, and edge deployment.