Developer Tools

llama.cpp b9452 boosts Vulkan Q6_K performance 78% on Intel BMG

57% faster Q3_K and 78% faster Q6_K with new MMVQ and block-load optimizations

Deep Dive

The latest llama.cpp release (b9452) delivers a significant Vulkan performance boost for Intel BMG GPUs, specifically targeting quantized models using Q2_K, Q3_K, and Q6_K formats. The core change switches the kernel to use MMVQ (matrix-vector quantization) for these block types, which previously struggled on Intel's architecture due to alignment constraints. The developers further optimized by replacing back-to-back loads from alternating arrays with block loads (force-coalescing), and performing subtraction on full int32_t values instead of bit-twiddling on i8vec4. These changes required less than 2000 lines of code.

On Intel BMG with mesa drivers, the MMVQ switch alone delivers a ~57% throughput increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% for Q6_K. The subsequent block-load and subtraction refinements add another ~24% for Q3_K and ~48% for Q6_K. Additionally, Xe2 GPUs now also use MMVQ for K-quants even at small batch sizes (taking the NVIDIA override path). This means users running local LLMs on Intel GPUs (Arc A-series, BMG, Xe2) can expect dramatically faster inference for popular GGUF models without sacrificing quality, making local AI more practical on mid-range Intel hardware.

Key Points
  • New Vulkan kernel switches Q3_K/Q6_K to MMVQ, yielding 57% (Q3_K) and 78% (Q6_K) throughput gains on Intel BMG
  • Block-load and int32 subtraction optimizations add another 24% (Q3_K) and 48% (Q6_K) performance improvement
  • Xe2 GPUs also adopt MMVQ for K quants, expanding the benefit to Intel's latest discrete graphics

Why It Matters

Local LLM inference on Intel GPUs gets a major speedup, making quantized models more practical on affordable hardware.