b8954
New release fixes key rope issue, expands to Apple, Linux, Windows & more.
The open-source llama.cpp project has released version b8954, a critical update to its highly popular C++ inference engine for large language models. This release, tagged on GitHub by github-actions, primarily addresses a fix in the server component regarding m-rope (multi-head rotary position embedding). The change replaces n_tokens with pos_next, which is a more accurate positional reference for this advanced attention mechanism, potentially improving model output coherence for certain architectures.
Beyond the core fix, b8954 is notable for its extensive platform support. The release offers pre-built binaries for macOS (both Apple Silicon and Intel), multiple Linux distributions (x64, arm64, s390x), Windows (x64, arm64), iOS as an XCFramework, and Android (arm64). It also supports a variety of hardware acceleration backends, including Vulkan, CUDA (versions 12 and 13), ROCm, OpenVINO, SYCL, and HIP. This broad compatibility allows developers and hobbyists to run models efficiently on everything from a MacBook to a high-end GPU server, making local AI deployment more accessible than ever.
- Fixes m-rope issue by using pos_next positional reference instead of n_tokens
- Provides pre-built binaries for macOS, Linux, Windows, iOS, and Android
- Supports multiple backends including CUDA 12/13, Vulkan, ROCm, OpenVINO, and SYCL
Why It Matters
llama.cpp b8954 makes local AI inference more robust and accessible across nearly every major platform.