b8916
New update patches SWA-full logic and ships binaries for 20+ platforms.
The llama.cpp project, a widely-used C/C++ implementation for running large language models locally, has released version b8916. This release primarily addresses a bug fix in the server component's SWA-full (Sliding Window Attention full) logic, which is critical for models that rely on this attention mechanism for efficient context processing. The fix ensures that the server correctly handles SWA-full configurations, preventing potential inference errors or degraded model performance.
Beyond the core fix, this release is notable for its extensive multi-platform binary support. The project now provides pre-compiled binaries for over 20 platform configurations, including macOS (Apple Silicon arm64, Apple Silicon with KleidiAI acceleration, Intel x64, and iOS XCFramework), Linux (x64 and arm64 CPU, plus Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16 variants), Windows (x64 and arm64 CPU, plus CUDA 12.4, CUDA 13.1, Vulkan, SYCL, and HIP), Android (arm64 CPU), and openEuler (x86 and aarch64 with ACL Graph support). This broad availability lowers the barrier for developers and users to run LLMs on diverse hardware, from consumer laptops to enterprise servers.
- Fixed SWA-full logic in the server component, improving inference accuracy for sliding window attention models.
- Pre-built binaries for 20+ platform variations, including macOS, Windows, Linux, Android, and openEuler.
- Supports GPU acceleration via CUDA 12/13, Vulkan, ROCm 7.2, SYCL, and HIP for optimized local inference.
Why It Matters
Enables more reliable local LLM inference across diverse hardware, from consumer devices to enterprise servers.