Fixes WebGPU subgroup-matrix path by checking head dimension divisibility by matrix tile sizes (sg_mat_k/n)?

Fixes WebGPU subgroup-matrix path by checking head dimension divisibility by matrix tile sizes (sg_mat_k/n)

Release b9144 includes builds for macOS, Linux, Windows, Android, iOS, and multiple GPU backends (CUDA, Vulkan, ROCm, SYCL)?

Release b9144 includes builds for macOS, Linux, Windows, Android, iOS, and multiple GPU backends (CUDA, Vulkan, ROCm, SYCL)

Part of ongoing optimization for llama.cpp, the most-starred LLM inference engine with 110k stars?

Part of ongoing optimization for llama.cpp, the most-starred LLM inference engine with 110k stars

Developer Tools

llama.cpp b9144 fixes WebGPU matrix path for head dimension compatibility

llama.cpp Releases May 14, 2026

⚡This open-source LLM runner update ensures WebGPU performance gains don't break on edge cases.

Deep Dive

The latest release of llama.cpp, tagged b9144, addresses a subtle but impactful bug in its WebGPU backend. The fix ensures that the subgroup-matrix path is only used when head dimensions are divisible by sg_mat_k and sg_mat_n – key parameters for GPU parallelism. Without this check, certain model configurations could produce incorrect results or performance regressions on WebGPU-compatible devices. This matters for developers using browsers or Vulkan-backed GPU inference.

The release comes with prebuilt binaries for every major platform: macOS (Apple Silicon with optional KleidiAI acceleration, Intel x64), Linux (x64, ARM, s390x, plus Vulkan, ROCm 7.2, OpenVINO, and SYCL variants), Windows (CPU, ARM64, CUDA 12/13, Vulkan, HIP, SYCL), and even openEuler and iOS XCFramework. With 110k GitHub stars, llama.cpp continues to dominate the local LLM space. This patch is small but critical for stability as more users offload inference to GPU via WebGPU.

Key Points

Fixes WebGPU subgroup-matrix path by checking head dimension divisibility by matrix tile sizes (sg_mat_k/n)
Release b9144 includes builds for macOS, Linux, Windows, Android, iOS, and multiple GPU backends (CUDA, Vulkan, ROCm, SYCL)
Part of ongoing optimization for llama.cpp, the most-starred LLM inference engine with 110k stars

Why It Matters

Ensures reliable GPU acceleration for local LLM inference across diverse hardware and browser environments.

Read Original Article

llama.cpp b9144 fixes WebGPU matrix path for head dimension compatibility

Why It Matters

Related Articles

🚀 Stay Ahead in AI