Developer Tools

llama.cpp b9828 boosts inference with OpenCL flash attention improvements

New release accelerates LLM inference on GPUs with reworked flash attention kernels...

Deep Dive

The b9828 release of llama.cpp, the popular open-source C/C++ LLM inference engine, brings significant OpenCL flash attention improvements. The commit, signed and released on GitHub, reworks the FA kernel for f16 and f32 data types, introducing prefill prepass kernels that pad KV tiles and mask tiles to a multiple of BLOCK_N. A new flash_attn_blk_f16 classifies each KV tile per query block as fully masked, mixed, or fully unmasked, allowing the main kernel to skip fully-masked tiles and reduce mask lookups for unmasked ones.

Additionally, the release adds FA kernels for q4_0 and q8_0 quantized formats, along with set_rows and dequant kernels for these types. An FA tile tuning table with override capabilities is included, and the host side is wired for FA. OpenCL support now handles q4_0 MoE tensors similarly (SOA layout). The update also includes cosmetic fixes and refactoring, with a note fixing infinity handling under -cl-finite-math-only. This release is built for multiple platforms including macOS Apple Silicon (arm64), Linux (x64/arm64 with CPU, Vulkan, ROCm, OpenVINO, SYCL), Windows (x64/arm64 with CPU, OpenCL Adreno, CUDA, Vulkan, OpenVINO, SYCL, HIP), Android (arm64 CPU), and openEuler (with ACL Graph support).

Key Points
  • OpenCL flash attention kernel reworked for f16, f32, q4_0, and q8_0 with prefill prepass and tile classification.
  • Includes FA tile tuning table with override, host-side wiring, and MoE tensor support for q4_0.
  • Multi-platform builds: macOS, Linux, Windows, Android, openEuler; GPU support via OpenCL, CUDA, Vulkan, ROCm, and more.

Why It Matters

Faster LLM inference on diverse GPUs (AMD, Intel, mobile) via OpenCL optimization, democratizing AI performance.

📬 Get the top 10 AI stories daily