b9055
Run the new Mimo v2.5 architecture on your own machine with fused QKV and multi-token prediction.
Deep Dive
llama.cpp's b9055 release adds support for the Mimo v2.5 model. The update includes fixes for fused QKV layers, attention value scaling, and multi-token prediction (MTP) weights. Prebuilt binaries are available for macOS (Apple Silicon, Intel), Linux (x64/arm64, Vulkan, ROCm, SYCL, OpenVINO, s390x, openEuler), Windows (CPU, CUDA, Vulkan, SYCL, HIP), and Android arm64.
Key Points
- Adds Mimo v2.5 model support with fused QKV layers and attention value scaling.
- Includes multi-token prediction (MTP) weights in GGUF format for improved inference.
- Available as prebuilt binaries across macOS, Linux, Windows, Android, and multiple GPU backends.
Why It Matters
Enables local inference of a new model architecture, broadening options for on-device AI deployment.