Developer Tools

llama.cpp b9318 fixes MTP kv-cache bug, ships across all platforms

113K-star local LLM runner gets a critical server fix for draft types.

Deep Dive

The popular open-source LLM inference engine llama.cpp has rolled out version b9318, focusing on a targeted server fix. The patch corrects the MTP (Multi-Token Prediction) layer's key-value cache handling to properly respect the 'draft type' parameter for speculative decoding. This ensures that when using draft models to accelerate generation, the cached context is correctly aligned, preventing mismatches that could degrade output quality or cause errors.

The release is notable for its exhaustive platform support. github-actions published pre-compiled binaries for 19 different configurations: macOS on Apple Silicon (both standard and KleidiAI-enhanced), Intel Macs, iOS (as an XCFramework), Linux across x64, arm64, s390x with CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL (FP32/FP16), Android arm64, Windows x64 and arm64 with CPU, CUDA 12/13, Vulkan, SYCL, and HIP, plus openEuler on x86 and aarch64 with Arm Compute Library graph acceleration. This breadth makes it easy for any developer to run local LLMs without compiling from source.

Key Points
  • Fixes a server bug where MTP layer kv-cache ignored the draft type 'ctk', critical for accurate speculative decoding.
  • Pre-built binaries available for 19+ configurations across macOS, Windows, Linux, Android, iOS, and openEuler.
  • Supports CPU, Apple Silicon, CUDA 12/13, Vulkan, ROCm, SYCL, HIP, and Arm Compute Library backends.

Why It Matters

Local LLM inference gets more reliable speculative decoding, making faster generation practical on consumer hardware.