b8995
New release supports Q1_0 and mixed quant types for faster LLM inference on Vulkan.
The latest release of llama.cpp (b8995) from ggml-org adds Vulkan support for asymmetric flash attention in the coopmat2 path, a feature designed to enable mixed quantization types during inference. The patch also includes support for Q1_0, an extreme 1-bit quantization method that has seen experimental use. The change reorders CUDA kernel cases and introduces platform-specific builds covering macOS (Apple Silicon with optional KleidiAI, Intel x64, iOS XCFramework), Linux (CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL), Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler (x86 and aarch64 with ACL Graph).
The release leverages a long-standing design in the coopmat2 FA shader that originally anticipated mixed quantization, now fully realized. This opens up new optimization possibilities for running large language models on Vulkan-compatible GPUs, particularly for users experimenting with aggressive quantization levels. The project, which has amassed 108k stars and 17.7k forks, continues to be a leading open-source inference engine for LLMs on consumer hardware.
- Vulkan coopmat2 path now supports asymmetric flash attention, enabling mixed quantization types for LLM inference.
- New support for Q1_0 extreme 1-bit quantization, expanding experimental performance options.
- Includes pre-built binaries for macOS, Linux, Windows, Android, and openEuler across CPU, Vulkan, CUDA, ROCm, and more.
Why It Matters
Unlocks more flexible and efficient LLM inference on Vulkan GPUs, benefiting developers and power users.