llama.cpp b9186 adds broad platform support and CUDA 12/13 backends
New release supports Apple Silicon, Windows, Linux, Android, and more...
ggml-org has released llama.cpp b9186, the latest version of their popular C/C++ LLM inference engine. This release syncs with the ggml library and brings extensive platform support, making it one of the most versatile options for running large language models locally. The update includes prebuilt binaries for macOS (Apple Silicon with optional KleidiAI acceleration, Intel, and iOS XCFramework), Linux (Ubuntu for x64, arm64, and s390x with Vulkan, ROCm 7.2, OpenVINO, and SYCL backends), Android arm64, and Windows (x64 and arm64 CPU, plus CUDA 12 and 13, Vulkan, SYCL, and HIP for AMD GPUs). Notably, openEuler support for Ascend processors is also included.
For developers and AI enthusiasts, b9186 lowers the barrier to running models like LLaMA, Mistral, and others on diverse hardware. The inclusion of CUDA 12 and 13 DLLs ensures compatibility with modern NVIDIA GPUs, while Vulkan and SYCL extend support to a wide range of graphics cards. The openEuler builds cater to enterprise users with Ascend NPUs. This release continues llama.cpp's mission to make on-device AI accessible, with performance optimized for each platform. Assets are available for direct download on the GitHub release page.
- Supports Apple Silicon (arm64) with optional KleidiAI acceleration for faster inference
- Windows builds include CUDA 12 and 13 DLLs, plus Vulkan, SYCL, and HIP for AMD GPUs
- openEuler support for Ascend 310p and 910b processors, targeting enterprise AI deployments
Why It Matters
llama.cpp b9186 enables local LLM inference on nearly any modern device, democratizing AI across platforms.