b8953
New release brings 1-bit quantized models to browser GPUs, expanding edge AI reach.
The llama.cpp project, a popular open-source C/C++ implementation for running large language models locally, has just released version b8953. This update introduces Q1_0 quantization support for WebGPU, a significant step for running extremely low-bit quantized models directly in browsers using GPU acceleration. The release includes a fast matmul matvec kernel specifically optimized for Q1_0, and also drops redundant zero-fills in Q1_0 shared memory initialization to improve performance.
This release is accompanied by a wide array of pre-built binaries covering multiple platforms: macOS (Apple Silicon with and without KleidiAI, Intel x64, iOS XCFramework), Linux (x64, arm64, s390x with CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL), Android (arm64 CPU), Windows (x64 and arm64 CPU, CUDA 12 and 13, Vulkan, SYCL, HIP), and openEuler (x86 and aarch64 with ACL Graph). This broad support makes it easy for developers to experiment with 1-bit quantization across diverse hardware.
- Adds Q1_0 quantization support for WebGPU, enabling 1-bit model inference in browsers.
- Includes a fast matmul matvec kernel optimized for Q1_0 and removes redundant zero-fills in shared memory.
- Provides pre-built binaries for macOS, Linux, Windows, Android, iOS, and openEuler with multiple GPU backends (CUDA, Vulkan, ROCm, OpenVINO, SYCL, HIP).
Why It Matters
Enables extremely low-bit AI models to run efficiently in browsers, expanding edge deployment possibilities.