Adds CUDA kernel for fast Walsh-Hadamard transform (FWHT) with loop unrolling?

Adds CUDA kernel for fast Walsh-Hadamard transform (FWHT) with loop unrolling

Supports multiple GPU backends?

CUDA 12/13, ROCm, Vulkan, SYCL, and HIP

Optimized for binary/ternary models (e.g., BitNet), reducing inference latency on consumer GPUs?

Optimized for binary/ternary models (e.g., BitNet), reducing inference latency on consumer GPUs

Developer Tools

llama.cpp b9329 adds fast Walsh-Hadamard transform on CUDA

llama.cpp Releases May 26, 2026

⚡New CUDA kernel optimizes bit-serial LLM inference with FWHT

Deep Dive

The latest llama.cpp release (b9329) adds a fast Walsh-Hadamard transform for CUDA, with code improvements including unrolls and changing size_t to int with warp size 64. Builds are provided for macOS (Apple Silicon and Intel), iOS, Linux (x64, arm64, s390x, Vulkan, ROCm, OpenVINO, SYCL), Android (arm64), Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), openEuler (x86, aarch64), and UI assets.

Key Points

Adds CUDA kernel for fast Walsh-Hadamard transform (FWHT) with loop unrolling
Supports multiple GPU backends: CUDA 12/13, ROCm, Vulkan, SYCL, and HIP
Optimized for binary/ternary models (e.g., BitNet), reducing inference latency on consumer GPUs

Why It Matters

Brings cutting-edge quantization research to local LLM inference, enabling faster, cheaper runs on everyday hardware.

Read Original Article

llama.cpp b9329 adds fast Walsh-Hadamard transform on CUDA

Why It Matters

Related Articles

🚀 Stay Ahead in AI