llama.cpp b9329 adds fast Walsh-Hadamard transform on CUDA
New CUDA kernel optimizes bit-serial LLM inference with FWHT
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Deep Dive
The latest llama.cpp release (b9329) adds a fast Walsh-Hadamard transform for CUDA, with code improvements including unrolls and changing size_t to int with warp size 64. Builds are provided for macOS (Apple Silicon and Intel), iOS, Linux (x64, arm64, s390x, Vulkan, ROCm, OpenVINO, SYCL), Android (arm64), Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), openEuler (x86, aarch64), and UI assets.
Key Points
- Adds CUDA kernel for fast Walsh-Hadamard transform (FWHT) with loop unrolling
- Supports multiple GPU backends: CUDA 12/13, ROCm, Vulkan, SYCL, and HIP
- Optimized for binary/ternary models (e.g., BitNet), reducing inference latency on consumer GPUs
Why It Matters
Brings cutting-edge quantization research to local LLM inference, enabling faster, cheaper runs on everyday hardware.