Developer Tools

llama.cpp b9406 adds multi-token prediction for faster inference

New update lets models predict multiple tokens at once, boosting speed.

Deep Dive

llama.cpp release b9406 (29 May) adds llm_graph_input_mtp (#23643), renaming input_mtp to input_token_embd and including a TODO about mtmd embedding, co-authored by Georgi Gerganov. Builds are available for macOS (Apple Silicon with KleidiAI, Intel, iOS), Linux (multiple architectures: CPU, Vulkan, ROCm, OpenVINO, SYCL), Android arm64, Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), and openEuler.

Key Points
  • Adds multi-token prediction (MTP) via llm_graph_input_mtp.
  • Co-authored by Georgi Gerganov and the community.
  • Available on all major platforms including Apple Silicon with KleidiAI.

Why It Matters

Multi-token prediction dramatically speeds up local LLM inference, making it viable for real-time applications.