b9033
New release brings AI acceleration to macOS, Windows, and Linux variants.
ggml-org’s llama.cpp, the open-source C/C++ LLM inference engine with 109k GitHub stars, just dropped release b9033. This update significantly expands hardware support: macOS now includes both standard Apple Silicon (arm64) and a KleidiAI-enabled build for extra acceleration. Windows gets CPU, arm64, CUDA 12.4, CUDA 13.1, Vulkan, SYCL FP32/FP16, and HIP packages. Linux adds Ubuntu x64/arm64 CPUs, s390x, Vulkan, ROCm 7.2, OpenVINO, and SYCL. For mobile, an iOS XCFramework and Android arm64 CPU build are included. The release also covers openEuler with Huawei Ascend NPU support.
This breadth of target platforms makes llama.cpp one of the most versatile local LLM runners available. Whether you need to run models on an Apple Silicon Mac, an Nvidia GPU with the latest CUDA, an AMD GPU via ROCm or HIP, or even an Intel GPU with SYCL, b9033 has you covered. The addition of CUDA 13 support is particularly notable for users on newer Nvidia hardware. For edge and mobile developers, the iOS and Android builds enable on-device inference. This release continues the project’s mission to make large language models accessible on consumer hardware without cloud dependencies.
- KleidiAI acceleration now available for macOS Apple Silicon builds
- Windows support expanded to include CUDA 13.1, Vulkan, SYCL FP16, and HIP
- Includes iOS XCFramework and Android arm64 CPU builds for mobile inference
Why It Matters
Local AI inference broadens to more platforms, reducing cloud dependency and enabling private, fast LLM execution on any device.