llama.cpp build b9118 fixes Vulkan shared memory, expands platform support
110k-star open-source LLM project updates GPU shaders and adds ARM64 builds...
The open-source llama.cpp project, which has amassed over 110,000 GitHub stars and 18,100 forks, released build b9118 on May 12. This commit — signed with GitHub's verified GPG key — addresses a critical Vulkan backend bug by properly checking shared memory size for matrix-multiply-quantized (MMQ) shaders (issue #22693). This fix ensures more stable GPU inference on Vulkan-capable devices, particularly for large language models.
The release expands platform coverage significantly. Precompiled binaries are now available for macOS Apple Silicon (both standard and with KleidiAI acceleration), macOS Intel, iOS as an XCFramework, Linux on x64/arm64/s390x with various backends (Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows x64/arm64 with CUDA 12/13 and Vulkan/SYCL/HIP, Android arm64, and openEuler (x86 and aarch64 with ACL Graph). This breadth makes llama.cpp one of the most cross-platform LLM inference engines, enabling developers and enthusiasts to run models locally on everything from Raspberry Pi to high-end desktop GPUs.
- Fixed shared memory size calculation for MMQ shaders in Vulkan backend (issue #22693)
- Added prebuilt binaries for macOS Apple Silicon with KleidiAI, Linux s390x, Android arm64, Windows arm64
- Supports 10+ platform/backend combinations including CUDA 12/13, ROCm 7.2, OpenVINO, SYCL, HIP
Why It Matters
Broadens local LLM inference to more hardware, fixing a key GPU stability issue for Vulkan users.