llama.cpp b9504 improves build system, adds platform fixes
115k-star project updates with cleaner CMake and disabled KleidiAI on Apple Silicon
llama.cpp, the open-source C/C++ implementation for running large language models locally, has released version b9504. Maintained by the ggml-org team, this project has garnered over 115,000 GitHub stars and 19,200 forks, making it one of the most popular AI inference engines. The b9504 release focuses on build system refinements and platform-specific improvements.
The key change is a CMake update that now skips the cvector-generator and export-lora components when the CPU backend is disabled. This streamlines compilation for GPU-only setups (e.g., CUDA, Vulkan, ROCm) and reduces unnecessary dependencies. The release also notes that KleidiAI, a software library for optimised neural network inference, is currently disabled on macOS Apple Silicon (arm64). The full list of supported platforms includes Ubuntu (x64, arm64, s390x), Windows (x64, arm64), Android arm64, and macOS Intel/Apple Silicon. GPU backends span CUDA 12 and 13 (Windows), Vulkan, ROCm 7.2, OpenVINO, SYCL, and HIP.
For developers and AI enthusiasts, this version continues llama.cpp's tradition of efficient local inference. While no major new features are announced, the build improvements ensure smoother deployment across diverse hardware. Users can expect more reliable compilation for GPU-only environments and the usual high performance for models like Llama 3, Mistral, and Gemma.
- CMake now skips cvector-generator and export-lora when CPU backend is disabled (#24053)
- KleidiAI acceleration is currently disabled on macOS Apple Silicon (arm64) builds
- Supports multiple GPU backends: CUDA 12/13, Vulkan, ROCm 7.2, OpenVINO, SYCL, HIP, plus CPU-only builds
Why It Matters
llama.cpp b9504 keeps local AI inference accessible across AMD, Intel, NVIDIA, and Apple hardware, enabling privacy-friendly model deployment.