b8963
New Vulkan optimizations for Q4_K and Q5_K models yield major speedups on Intel hardware.
The latest release of llama.cpp (b8963) from ggml-org delivers targeted Vulkan optimizations that significantly boost performance on Intel GPUs, particularly when running quantized models. The core change addresses a specific inefficiency in how Q4_K and Q5_K scale loads are handled in the Vulkan backend. The Mesa compiler (used by Intel's ANV driver) fails to coalesce the existing load pattern due to its inability to track bounds through ternary operations. The new code forces loading the full 12-byte scale array as packed u32s and extracts the needed bits, eliminating conditional loads and branches in the hot loop.
Benchmarks on an Arc Pro B60 GPU show substantial gains: for the Qwen3.5-27B model (Q4_K_S), prompt processing at 512 tokens improved by 9% (324.13 to 354.33 tokens/sec), and text generation at 128 tokens improved by 6% (17.11 to 18.11 tokens/sec). Similar gains were observed on multi-GPU setups. The fix reduces shader instruction counts by up to 14% and SEND counts by up to 40% for vector-matrix multiplication kernels. This optimization is particularly valuable for developers running local LLMs on Intel hardware, as it narrows the performance gap with other GPU vendors.
- Fixes Mesa compiler limitation by loading full 12-byte scale array as packed u32s instead of conditional byte loads
- Up to 9% faster prompt processing and 6% faster text generation on Intel Arc Pro B60 with Qwen3.5 models
- Reduces shader instruction counts by up to 14% and SEND operations by up to 40% for Q4_K/Q5_K kernels
Why It Matters
Intel GPU users running local LLMs via llama.cpp get a free performance boost without hardware upgrades.