Developer Tools

b8971

Critical bug fix for FlashAttention on devices without subgroup support

Deep Dive

The llama.cpp project, a high-performance C/C++ implementation of LLM inference, has released version b8971. This update primarily addresses a bug in FlashAttention support for WebGPU devices that lack subgroup capabilities. The fix ensures that the path is set to none when the kv_tile doesn't fit, preventing potential errors or crashes during attention computations. The release is signed with GitHub's verified signature (GPG key ID: B5690EEEBB952194), confirming its authenticity.

This release is available across a wide range of platforms and hardware configurations. For macOS, it supports Apple Silicon (arm64), Apple Silicon with KleidiAI enabled, Intel (x64), and iOS XCFramework. Linux builds include Ubuntu x64 (CPU), arm64 (CPU), s390x (CPU), x64 (Vulkan), arm64 (Vulkan), x64 (ROCm 7.2), x64 (OpenVINO), x64 (SYCL FP32), and x64 (SYCL FP16). Android gets an arm64 (CPU) build. Windows versions cover x64 (CPU), arm64 (CPU), x64 (CUDA 12 with CUDA 12.4 DLLs), x64 (CUDA 13 with CUDA 13.1 DLLs), x64 (Vulkan), x64 (SYCL), and x64 (HIP). Additionally, openEuler builds are available for x86 (310p), x86 (910b, ACL Graph), aarch64 (310p), and aarch64 (910b, ACL Graph).

Key Points
  • Fixes FlashAttention support check for WebGPU devices without subgroup support
  • Sets path to none when kv_tile doesn't fit, preventing attention computation errors
  • Available across 20+ platform configurations including macOS, Linux, Windows, Android, and openEuler

Why It Matters

Ensures reliable FlashAttention on diverse hardware, critical for efficient LLM inference on edge devices.