b8986
New release bundles builds for 22 platforms including iOS and Android...
The latest release of llama.cpp, tagged b8986, delivers a critical CUDA fix for the tile flash attention (FA) kernel on Nvidia Pascal GPUs (GTX 10-series, Titan Xp, etc.). This resolves a correctness issue that could degrade inference quality on older architecture cards while running large language models locally.
Beyond the CUDA patch, the release expands platform coverage to 22 distinct build configurations. Highlights include macOS with KleidiAI acceleration for Apple Silicon, Linux with Vulkan and ROCm 7.2, Windows with both CUDA 12.4 and 13.1 DLLs, and cross-platform support for iOS and Android arm64. Each asset is precompiled and signed with GitHub's verified GPG key. This makes it easier for developers to deploy llama.cpp across edge devices, servers, and workstations without manual compilation.
- Fixes tile flash attention kernel on Nvidia Pascal architecture (GTX 10-series, Titan Xp)
- Prebuilt binaries for 22 platform/backend combinations including iOS, Android, and CUDA 13.1
- All assets signed with GitHub verified signature for secure deployment
Why It Matters
Ensures accurate local LLM inference on older Nvidia GPUs and broadens cross-platform deployment options.