Developer Tools

llama.cpp b9119 fixes Intel GPU performance regression on Windows

New update patches BF16 workloads for Intel Xe2 and newer GPUs

Deep Dive

llama.cpp, the widely-used C/C++ implementation of LLaMA and other LLMs optimized for local inference, received a new release (b9119) from the ggml-org maintainers. The headline fix resolves a performance regression on Intel integrated and discrete GPUs (Xe2 architecture and newer) when using BF16 (bfloat16) data types through the Vulkan backend on Windows. Users had reported slower token generation speeds since a prior update; this patch restores expected throughput by correcting the cooperative matrix handling for BF16 workloads.

The release ships with prebuilt binaries for nearly every major platform: Windows (x64 and arm64) with Vulkan, CUDA 12/13, SYCL, and HIP; Linux variants for x64, arm64, and s390x with Vulkan, ROCm 7.2, OpenVINO, and SYCL; macOS Apple Silicon (both standard and KleidiAI-enabled) and Intel x64; plus iOS and Android arm64. This breadth makes b9119 immediately deployable for developers running local LLMs on diverse hardware, especially Intel GPU users on Windows who now regain performance parity.

Key Points
  • Fixes Vulkan BF16 performance regression on Intel Xe2 and newer GPUs on Windows
  • Release includes prebuilt binaries for 20+ platform/backend combinations
  • macOS builds now offer KleidiAI-optimized Apple Silicon variant

Why It Matters

Restores optimal local LLM performance for Intel GPU users on Windows, critical for edge AI deployments.