Developer Tools

llama.cpp b9200 boosts MTP speed by avoiding logit copies

New optimization cuts overhead during prompt decoding for multi-token prediction.

Deep Dive

llama.cpp, the wildly popular open-source C++ inference engine for large language models, just dropped version b9200. The highlight is a performance fix for Multi-Token Prediction (MTP), a technique that allows the model to predict several tokens at once. In previous versions, the engine unnecessarily copied logits (the raw output probabilities) during the prompt decoding phase, wasting memory bandwidth and compute. This release eliminates that copy, leading to faster generation and lower memory pressure—especially noticeable on hardware with limited resources like Apple Silicon or consumer GPUs.

The release also includes a review comment update for the same MTP optimization and a fix for `set_output` on `t_h_pre_norm` in llama-graph. As usual, the build matrix is massive: macOS (Arm64 with or without KleidiAI, Intel x64, iOS), Linux (x64, Arm64, s390x, Vulkan, ROCm 7.2, OpenVINO, SYCL), Windows (x64 CPU, Arm64 CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android Arm64, and even openEuler with Ascend NPUs. This ensures the optimization benefits everyone from hobbyists running models on a laptop to production clusters. The release was signed by GitHub Actions with a verified GPG key (B5690EEEBB952194), confirming its authenticity.

Key Points
  • Eliminates redundant logit copying during prompt decode in Multi-Token Prediction (MTP), improving throughput.
  • Supports 15+ platform/build combinations including Apple Silicon, Linux x64/ARM, Windows CUDA/Vulkan, and Android.
  • Signed release from ggml-org/llama.cpp, the most-starred LLM inference engine on GitHub (111k stars).

Why It Matters

Faster local LLM inference means lower latency for developers building AI apps on consumer hardware.