Adds Mimo v2.5 model support with fused QKV layers and attention value scaling?

Adds Mimo v2.5 model support with fused QKV layers and attention value scaling.

Includes multi-token prediction (MTP) weights in GGUF format for improved inference?

Includes multi-token prediction (MTP) weights in GGUF format for improved inference.

Available as prebuilt binaries across macOS, Linux, Windows, Android, and multiple GPU backends?

Available as prebuilt binaries across macOS, Linux, Windows, Android, and multiple GPU backends.

Developer Tools

llama.cpp b9055 adds local support for Mimo v2.5 model

llama.cpp Releases May 07, 2026

⚡Run the new Mimo v2.5 architecture on your own machine with fused QKV and multi-token prediction.

Deep Dive

llama.cpp's b9055 release adds support for the Mimo v2.5 model. The update includes fixes for fused QKV layers, attention value scaling, and multi-token prediction (MTP) weights. Prebuilt binaries are available for macOS (Apple Silicon, Intel), Linux (x64/arm64, Vulkan, ROCm, SYCL, OpenVINO, s390x, openEuler), Windows (CPU, CUDA, Vulkan, SYCL, HIP), and Android arm64.

Key Points

Adds Mimo v2.5 model support with fused QKV layers and attention value scaling.
Includes multi-token prediction (MTP) weights in GGUF format for improved inference.
Available as prebuilt binaries across macOS, Linux, Windows, Android, and multiple GPU backends.

Why It Matters

Enables local inference of a new model architecture, broadening options for on-device AI deployment.

Read Original Article

llama.cpp b9055 adds local support for Mimo v2.5 model

Why It Matters

Related Articles

🚀 Stay Ahead in AI