llama.cpp b9406 adds multi-token prediction for faster inference
New update lets models predict multiple tokens at once, boosting speed.
Deep Dive
llama.cpp release b9406 (29 May) adds llm_graph_input_mtp (#23643), renaming input_mtp to input_token_embd and including a TODO about mtmd embedding, co-authored by Georgi Gerganov. Builds are available for macOS (Apple Silicon with KleidiAI, Intel, iOS), Linux (multiple architectures: CPU, Vulkan, ROCm, OpenVINO, SYCL), Android arm64, Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), and openEuler.
Key Points
- Adds multi-token prediction (MTP) via llm_graph_input_mtp.
- Co-authored by Georgi Gerganov and the community.
- Available on all major platforms including Apple Silicon with KleidiAI.
Why It Matters
Multi-token prediction dramatically speeds up local LLM inference, making it viable for real-time applications.