New b9480 release adds StepFun 3.5 Multi-Token Prediction (MTP) support for faster LLM inference?

New b9480 release adds StepFun 3.5 Multi-Token Prediction (MTP) support for faster LLM inference.

Builds available for macOS, Linux, Windows, Android, and iOS across CPU and multiple GPU backends (CUDA, Vulkan, ROCm, etc.)?

Builds available for macOS, Linux, Windows, Android, and iOS across CPU and multiple GPU backends (CUDA, Vulkan, ROCm, etc.).

llama.cpp now has 114k stars and 19.1k forks, making it a leading open-source LLM inference engine?

llama.cpp now has 114k stars and 19.1k forks, making it a leading open-source LLM inference engine.

Developer Tools

llama.cpp b9480 adds StepFun 3.5 MTP for faster token generation

llama.cpp Releases June 03, 2026

⚡Multi-token prediction lands in llama.cpp, boosting local LLM inference speed.

Deep Dive

ggml-org's llama.cpp, the popular open-source C++ inference engine for LLaMA-based models, released version b9480 on June 2, 2025. The headline feature is support for StepFun 3.5 MTP (Multi-Token Prediction). MTP is an inference optimization where the model predicts multiple tokens simultaneously rather than one at a time, substantially reducing latency for autoregressive generation. StepFun, a Chinese AI lab, introduced this technique in their Step-2 model series. The integration into llama.cpp means users can now run StepFun 3.5 MTP models locally on consumer hardware with faster text generation—ideal for real-time applications like chatbots and coding assistants.

The release also ships a wide array of platform-specific builds, including macOS Apple Silicon (both vanilla and KleidiAI-enabled), macOS Intel, iOS XCFramework, Linux x64/arm64/s390x CPU and GPU variants (Vulkan, ROCm, OpenVINO, SYCL), Android arm64, and Windows x64/arm64 CPU plus CUDA 12/13, Vulkan, and HIP builds. This breadth ensures that developers can deploy StepFun 3.5 MTP on everything from servers to edge devices. The commit, signed with GitHub’s verified signature, reflects contributions from the community, including a code review from Sigbjørn Skjæret. With the project sitting at 114k stars and 19.1k forks, this update reinforces llama.cpp as the go-to engine for on-device AI inference.

Key Points

New b9480 release adds StepFun 3.5 Multi-Token Prediction (MTP) support for faster LLM inference.
Builds available for macOS, Linux, Windows, Android, and iOS across CPU and multiple GPU backends (CUDA, Vulkan, ROCm, etc.).
llama.cpp now has 114k stars and 19.1k forks, making it a leading open-source LLM inference engine.

Why It Matters

Local LLM inference gets a speed boost with MTP, enabling real-time generation on consumer devices.

Read Original Article

llama.cpp b9480 adds StepFun 3.5 MTP for faster token generation

Why It Matters

Related Articles

🚀 Stay Ahead in AI