Developer Tools

llama.cpp b9329 adds fast Walsh-Hadamard transform on CUDA

New CUDA kernel optimizes bit-serial LLM inference with FWHT

Deep Dive

The latest llama.cpp release (b9329) adds a fast Walsh-Hadamard transform for CUDA, with code improvements including unrolls and changing size_t to int with warp size 64. Builds are provided for macOS (Apple Silicon and Intel), iOS, Linux (x64, arm64, s390x, Vulkan, ROCm, OpenVINO, SYCL), Android (arm64), Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), openEuler (x86, aarch64), and UI assets.

Key Points
  • Adds CUDA kernel for fast Walsh-Hadamard transform (FWHT) with loop unrolling
  • Supports multiple GPU backends: CUDA 12/13, ROCm, Vulkan, SYCL, and HIP
  • Optimized for binary/ternary models (e.g., BitNet), reducing inference latency on consumer GPUs

Why It Matters

Brings cutting-edge quantization research to local LLM inference, enabling faster, cheaper runs on everyday hardware.