Adds multi-token prediction (MTP) via llm_graph_input_mtp?

Adds multi-token prediction (MTP) via llm_graph_input_mtp.

Co-authored by Georgi Gerganov and the community?

Co-authored by Georgi Gerganov and the community.

Available on all major platforms including Apple Silicon with KleidiAI?

Available on all major platforms including Apple Silicon with KleidiAI.

Developer Tools

llama.cpp b9406 adds multi-token prediction for faster inference

llama.cpp Releases May 30, 2026

⚡New update lets models predict multiple tokens at once, boosting speed.

Deep Dive

llama.cpp release b9406 (29 May) adds llm_graph_input_mtp (#23643), renaming input_mtp to input_token_embd and including a TODO about mtmd embedding, co-authored by Georgi Gerganov. Builds are available for macOS (Apple Silicon with KleidiAI, Intel, iOS), Linux (multiple architectures: CPU, Vulkan, ROCm, OpenVINO, SYCL), Android arm64, Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), and openEuler.

Key Points

Adds multi-token prediction (MTP) via llm_graph_input_mtp.
Co-authored by Georgi Gerganov and the community.
Available on all major platforms including Apple Silicon with KleidiAI.

Why It Matters

Multi-token prediction dramatically speeds up local LLM inference, making it viable for real-time applications.

Read Original Article

llama.cpp b9406 adds multi-token prediction for faster inference

Why It Matters

Related Articles

🚀 Stay Ahead in AI