llama.cpp b9495 fixes Qwen MTP with post-norm hidden state optimization
New release improves multi-token prediction for Qwen models on local hardware.
The latest release of llama.cpp, tagged b9495, rolls out a targeted fix for Qwen series models using multi-token prediction (MTP). The key change – 'use post-norm hidden state for MTP' – replaces the previous pre-norm approach, which was causing incorrect hidden state handling during token prediction. This is a subtle but important correction for anyone running Qwen2.5 or similar architectures that leverage MTP to generate multiple tokens per inference step.
Beyond the MTP fix, this release continues llama.cpp's tradition of broad platform support. Builds are now available for macOS (Apple Silicon and Intel), multiple Linux distributions (including Vulkan, ROCm 7.2, OpenVINO, and SYCL), Android ARM64, and Windows with CUDA 12/13 and HIP. The release also includes an iOS XCFramework. For professionals deploying local LLMs, this ensures that Qwen models can run efficiently on everything from edge devices to high-end GPU servers.
- Fixes Qwen MTP by switching from pre-norm to post-norm hidden state for accurate multi-token prediction.
- Supports 17+ build configurations across macOS, Linux, Windows, Android, and iOS.
- Includes GPU acceleration via CUDA 12/13, ROCm 7.2, Vulkan, and OpenVINO.
Why It Matters
Local LLM inference gets a subtle but critical fix for Qwen models, enabling faster and more accurate multi-token generation on any hardware.