Developer Tools

llama.cpp b9334 fixes CUDA sync bug, improves FWHT fallback

This update resolves a missing PDL sync for FWHT on NVIDIA GPUs.

Deep Dive

On May 26, the ggml-org team rolled out llama.cpp version b9334, a maintenance release focused on improving CUDA backend reliability. The key fix addresses a missing PDL (parallel data layout) synchronization during Fast Walsh-Hadamard Transform (FWHT) operations. FWHT is used in some advanced model architectures for efficient linear transformations, and the lack of proper PDL sync could cause race conditions or incorrect results on CUDA hardware. The update also improves the fallback mechanism, ensuring that if FWHT isn't optimally supported, the system degrades gracefully without stalling or crashing. This makes local inference more robust, especially for users experimenting with newer, non-mainstream model layers.

The release ships with an extensive set of pre-built binaries for essentially every major platform and backend: macOS (Apple Silicon and Intel, including KleidiAI-optimized builds), Linux (CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL FP32 on multiple architectures), Windows (CPU, CUDA 12 and 13 DLLs, Vulkan, SYCL, HIP), Android arm64, and openEuler (x86 and aarch64 with ACL). While no new major features are introduced, the focus on stability and edge-case handling aligns with llama.cpp's role as a production-grade tool for running LLMs locally without external cloud services.

Key Points
  • Fixes missing PDL sync for FWHT (Fast Walsh-Hadamard Transform) on CUDA
  • Improves fallback behavior for FWHT operations on NVIDIA GPUs
  • Supports 20+ platform/backend builds including macOS, Linux, Windows, Android, and openEuler

Why It Matters

For professionals running local LLMs, this update ensures more reliable and efficient inference on NVIDIA GPUs.