b9014
New release brings GPU-accelerated layer normalization with improved numerical stability via Kahan summation.
The latest release of llama.cpp, version b9014, brings a critical improvement to its WebGPU backend: the addition of layer normalization operations. Layer normalization is essential for stable training and inference in transformer-based LLMs, and adding it to the WebGPU shader pipeline allows the popular C++ library to offload this computation to the GPU when running in browser environments. The update includes a numerically stable implementation using Kahan summation, which reduces floating-point errors, and adds support for mixed data types (e.g., fp16/bf16 with fp32 accumulation). Additionally, the developers removed non-contiguous strides to simplify and optimize the shader code. These changes are part of ongoing efforts to make llama.cpp a full-featured, high-performance inference engine across all platforms.
llama.cpp b9014 also ships pre-compiled binaries for an extensive list of platforms: macOS (Apple Silicon with optional KleidiAI acceleration and Intel), iOS as an XCFramework, Linux (x64/arm64 CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows (x64/arm64 CPU, CUDA 12 & 13, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler (with ACL Graph support). For developers and enthusiasts running LLMs locally, this release means more efficient inference on WebGPU-capable browsers and improved accuracy due to Kahan summation. It continues the trend of making state-of-the-art AI models accessible on consumer hardware without cloud dependencies.
- Adds dedicated layer normalization operations to WebGPU shaders, a key prerequisite for stable transformer inference on browsers.
- Uses Kahan summation algorithm for better numerical stability and supports mixed precision types (e.g., fp16/fp32).
- Pre-built binaries available for macOS, iOS, Linux, Windows, Android, and openEuler with various backends (CUDA, Vulkan, ROCm, SYCL, HIP).
Why It Matters
Enables more robust and faster on-device LLM inference via WebGPU, expanding local AI capabilities to browsers and GPU-accelerated environments.