Developer Tools

b9004

New release brings KleidiAI acceleration and support for CUDA 13, ROCm 7.2, and more.

Deep Dive

The llama.cpp project has released b9004, a significant update to its popular C++ inference engine for LLaMA models. This release is a sync with the underlying ggml library, bringing a host of new platform builds and performance optimizations. Most notably, the macOS Apple Silicon (arm64) build now includes KleidiAI acceleration, which leverages Apple's Neural Engine and AMX coprocessor for faster inference on M-series chips. The release also adds official builds for Windows arm64 CPU, Android arm64 CPU, and openEuler (both x86 and aarch64 with support for Ascend 310P and 910B processors).

On the GPU front, b9004 expands hardware compatibility significantly. Linux users get new builds with Vulkan (x64 and arm64), ROCm 7.2 (x64), OpenVINO (x64), and SYCL (FP32 and FP16). Windows users benefit from CUDA 12 and CUDA 13 builds (with matching DLLs), along with Vulkan, SYCL, and HIP builds. This broad support means developers and researchers can run LLaMA models on everything from consumer GPUs to server-grade accelerators, with optimizations tailored to each architecture. The release also includes the usual CPU builds for Ubuntu, macOS Intel, and Windows x64.

Key Points
  • KleidiAI acceleration for macOS Apple Silicon enables faster inference on M-series chips
  • New builds for Windows arm64 CPU, Android arm64 CPU, and openEuler with Ascend NPU support
  • GPU support expanded to include CUDA 13, ROCm 7.2, Vulkan, OpenVINO, and SYCL on Linux and Windows

Why It Matters

Broader hardware support and optimizations make local LLM inference more accessible and performant across diverse systems.