b8966
New CUDA kernel supports 320/256 head sizes with GQA ratio 32.
The latest release of llama.cpp, tagged b8966, introduces flash-attention support for specific head sizes: DKQ=320 and DV=256, with ncols2=32 to support a Grouped Query Attention (GQA) ratio of 32. This update is specifically tailored for the Mistral Small 4 model, which uses these head sizes. The implementation adds MMA-f16 and tile kernel configurations, dispatch logic, and template instances, along with a new .cu file for the tile kernel. A safeguard returns BEST_FATTN_KERNEL_NONE if GQA is not exactly 32.
A critical bug fix addresses an issue with sinks=1, where two warp groups were created but shared the same sink index, causing incorrect output. The fix introduces a sink_base variable to provide a unique base index for each warp group (based on threadIdx.y / np). The release also updates the generate_cu_files.py script accordingly. The build assets span multiple platforms: macOS (Apple Silicon with and without KleidiAI, Intel, iOS XCFramework), Linux (x64, arm64, s390x CPUs, plus Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Android (arm64), Windows (x64, arm64 CPUs, CUDA 12/13, Vulkan, SYCL, HIP), and openEuler (x86 and aarch64 with ACL Graph).
- Adds flash-attention support for DKQ=320 and DV=256 head sizes, targeting Mistral Small 4.
- Includes MMA-f16 tile kernel configs and dispatch logic, restricted to GQA ratio 32.
- Fixes sink index bug for sinks=1 by adding sink_base for each warp group.
Why It Matters
Enables faster inference for Mistral Small 4 on local hardware, improving efficiency for developers.