Fixes Qwen MTP by switching from pre-norm to post-norm hidden state for accurate multi-token prediction?

Fixes Qwen MTP by switching from pre-norm to post-norm hidden state for accurate multi-token prediction.

Supports 17+ build configurations across macOS, Linux, Windows, Android, and iOS?

Supports 17+ build configurations across macOS, Linux, Windows, Android, and iOS.

Includes GPU acceleration via CUDA 12/13, ROCm 7.2, Vulkan, and OpenVINO?

Includes GPU acceleration via CUDA 12/13, ROCm 7.2, Vulkan, and OpenVINO.

Developer Tools

llama.cpp b9495 fixes Qwen MTP with post-norm hidden state optimization

llama.cpp Releases June 04, 2026

⚡New release improves multi-token prediction for Qwen models on local hardware.

Deep Dive

The latest release of llama.cpp, tagged b9495, rolls out a targeted fix for Qwen series models using multi-token prediction (MTP). The key change – 'use post-norm hidden state for MTP' – replaces the previous pre-norm approach, which was causing incorrect hidden state handling during token prediction. This is a subtle but important correction for anyone running Qwen2.5 or similar architectures that leverage MTP to generate multiple tokens per inference step.

Beyond the MTP fix, this release continues llama.cpp's tradition of broad platform support. Builds are now available for macOS (Apple Silicon and Intel), multiple Linux distributions (including Vulkan, ROCm 7.2, OpenVINO, and SYCL), Android ARM64, and Windows with CUDA 12/13 and HIP. The release also includes an iOS XCFramework. For professionals deploying local LLMs, this ensures that Qwen models can run efficiently on everything from edge devices to high-end GPU servers.

Key Points

Fixes Qwen MTP by switching from pre-norm to post-norm hidden state for accurate multi-token prediction.
Supports 17+ build configurations across macOS, Linux, Windows, Android, and iOS.
Includes GPU acceleration via CUDA 12/13, ROCm 7.2, Vulkan, and OpenVINO.

Why It Matters

Local LLM inference gets a subtle but critical fix for Qwen models, enabling faster and more accurate multi-token generation on any hardware.

Read Original Article

llama.cpp b9495 fixes Qwen MTP with post-norm hidden state optimization

Why It Matters

Related Articles

🚀 Stay Ahead in AI