Developer Tools

llama.cpp b9106 adds asymmetric Flash Attention on Vulkan

New release boosts LLM inference speed on NVIDIA GPUs via Vulkan path

Deep Dive

llama.cpp, the popular C++ inference engine for large language models, has released version b9106 with a notable improvement for Vulkan users. The update adds asymmetric Flash Attention (FA) support to three Vulkan code paths: scalar, mmq (matrix multiplication quantized), and coopmat1 (cooperative matrix). Flash Attention is a memory-efficient attention algorithm that reduces the quadratic memory bottleneck in transformer models. Asymmetric FA further optimizes it for variable-length sequences, critical for real-world chat and document processing.

Beyond the Vulkan enhancements, the release includes precompiled binaries for an extensive list of platforms: macOS Apple Silicon (both standard and KleidiAI-enabled), macOS Intel, iOS XCFramework, Linux (x64/arm64/s390x with CPU or Vulkan), Ubuntu (ROCm, OpenVINO, SYCL), Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android arm64, and openEuler (with ACL Graph). Developers using any hardware backend can drop in the new binaries for potential speed and memory improvements. The release contains 30 assets, reflecting the breadth of supported configurations.

Key Points
  • Adds asymmetric Flash Attention to Vulkan scalar, mmq, and coopmat1 paths for faster inference
  • Prebuilt binaries for macOS, Linux, Windows, Android, iOS, and openEuler with multiple GPU backends
  • Increases memory efficiency for LLM inference on Vulkan-enabled hardware (NVIDIA, AMD, Intel)

Why It Matters

Vulkan-powered LLM inference gets a crucial memory optimization, enabling faster runs on more hardware without CUDA.