llama.cpp b9480 adds StepFun 3.5 MTP for faster token generation
Multi-token prediction lands in llama.cpp, boosting local LLM inference speed.
ggml-org's llama.cpp, the popular open-source C++ inference engine for LLaMA-based models, released version b9480 on June 2, 2025. The headline feature is support for StepFun 3.5 MTP (Multi-Token Prediction). MTP is an inference optimization where the model predicts multiple tokens simultaneously rather than one at a time, substantially reducing latency for autoregressive generation. StepFun, a Chinese AI lab, introduced this technique in their Step-2 model series. The integration into llama.cpp means users can now run StepFun 3.5 MTP models locally on consumer hardware with faster text generation—ideal for real-time applications like chatbots and coding assistants.
The release also ships a wide array of platform-specific builds, including macOS Apple Silicon (both vanilla and KleidiAI-enabled), macOS Intel, iOS XCFramework, Linux x64/arm64/s390x CPU and GPU variants (Vulkan, ROCm, OpenVINO, SYCL), Android arm64, and Windows x64/arm64 CPU plus CUDA 12/13, Vulkan, and HIP builds. This breadth ensures that developers can deploy StepFun 3.5 MTP on everything from servers to edge devices. The commit, signed with GitHub’s verified signature, reflects contributions from the community, including a code review from Sigbjørn Skjæret. With the project sitting at 114k stars and 19.1k forks, this update reinforces llama.cpp as the go-to engine for on-device AI inference.
- New b9480 release adds StepFun 3.5 Multi-Token Prediction (MTP) support for faster LLM inference.
- Builds available for macOS, Linux, Windows, Android, and iOS across CPU and multiple GPU backends (CUDA, Vulkan, ROCm, etc.).
- llama.cpp now has 114k stars and 19.1k forks, making it a leading open-source LLM inference engine.
Why It Matters
Local LLM inference gets a speed boost with MTP, enabling real-time generation on consumer devices.