Developer Tools

llama.cpp b9156 adds NVIDIA WebGPU CI and cross-platform fixes

Precision issues resolved, new CI for NVIDIA GPUs expands WebGPU support.

Deep Dive

The latest release of llama.cpp, version b9156, brings significant infrastructure and quality-of-life improvements for developers running large language models locally. Chief among them is the enablement of NVIDIA self-hosted continuous integration (CI) for the WebGPU backend (PR #22976). This move ensures that future builds and optimizations will be rigorously tested on NVIDIA GPUs, addressing long-standing precision issues in operations like set_rows and div. The team also relaxed constraints on f16 formatting and naming, and added explanatory comments in the codebase referencing the relevant pull request for clarity.

On the platform support front, b9156 expands the list of prebuilt binaries. For macOS, both Apple Silicon (arm64) and Intel (x64) versions are available, with an optional KleidiAI-enabled build for Apple Silicon. Linux users get builds for x64, arm64, and s390x backends including Vulkan, ROCm, OpenVINO, and SYCL (both FP32 and FP16). Windows receives new CUDA 12 and CUDA 13 DLLs, plus Vulkan, SYCL, and HIP variants. Android arm64 and openEuler (x86 and aarch64 with ACL Graph) are also included. With 30 assets in this release, llama.cpp continues to broaden its accessibility for developers running AI inference on diverse hardware.

Key Points
  • NVIDIA self-hosted CI enabled for WebGPU backend (PR #22976) to ensure quality on NVIDIA GPUs.
  • Precision fixes applied to set_rows and div operations, plus relaxed f16 formatting constraints.
  • 30 prebuilt assets available across macOS, Linux, Windows, Android, and openEuler with multiple GPU backends including CUDA 13, Vulkan, ROCm, and SYCL.

Why It Matters

llama.cpp's latest release boosts GPU compatibility and stability for local LLM deployment across platforms.