Eliminates redundant logit copying during prompt decode in Multi-Token Prediction (MTP), improving throughput?

Eliminates redundant logit copying during prompt decode in Multi-Token Prediction (MTP), improving throughput.

Supports 15+ platform/build combinations including Apple Silicon, Linux x64/ARM, Windows CUDA/Vulkan, and Android?

Supports 15+ platform/build combinations including Apple Silicon, Linux x64/ARM, Windows CUDA/Vulkan, and Android.

Signed release from ggml-org/llama.cpp, the most-starred LLM inference engine on GitHub (111k stars)?

Signed release from ggml-org/llama.cpp, the most-starred LLM inference engine on GitHub (111k stars).

Developer Tools

llama.cpp b9200 boosts MTP speed by avoiding logit copies

llama.cpp Releases May 18, 2026

⚡New optimization cuts overhead during prompt decoding for multi-token prediction.

Deep Dive

llama.cpp, the wildly popular open-source C++ inference engine for large language models, just dropped version b9200. The highlight is a performance fix for Multi-Token Prediction (MTP), a technique that allows the model to predict several tokens at once. In previous versions, the engine unnecessarily copied logits (the raw output probabilities) during the prompt decoding phase, wasting memory bandwidth and compute. This release eliminates that copy, leading to faster generation and lower memory pressure—especially noticeable on hardware with limited resources like Apple Silicon or consumer GPUs.

The release also includes a review comment update for the same MTP optimization and a fix for `set_output` on `t_h_pre_norm` in llama-graph. As usual, the build matrix is massive: macOS (Arm64 with or without KleidiAI, Intel x64, iOS), Linux (x64, Arm64, s390x, Vulkan, ROCm 7.2, OpenVINO, SYCL), Windows (x64 CPU, Arm64 CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android Arm64, and even openEuler with Ascend NPUs. This ensures the optimization benefits everyone from hobbyists running models on a laptop to production clusters. The release was signed by GitHub Actions with a verified GPG key (B5690EEEBB952194), confirming its authenticity.

Key Points

Eliminates redundant logit copying during prompt decode in Multi-Token Prediction (MTP), improving throughput.
Supports 15+ platform/build combinations including Apple Silicon, Linux x64/ARM, Windows CUDA/Vulkan, and Android.
Signed release from ggml-org/llama.cpp, the most-starred LLM inference engine on GitHub (111k stars).

Why It Matters

Faster local LLM inference means lower latency for developers building AI apps on consumer hardware.

Read Original Article

llama.cpp b9200 boosts MTP speed by avoiding logit copies

Why It Matters

Related Articles

🚀 Stay Ahead in AI