Developer Tools

b8954

llama.cpp Releases April 28, 2026

⚡New release fixes key rope issue, expands to Apple, Linux, Windows & more.

Deep Dive

The open-source llama.cpp project has released version b8954, a critical update to its highly popular C++ inference engine for large language models. This release, tagged on GitHub by github-actions, primarily addresses a fix in the server component regarding m-rope (multi-head rotary position embedding). The change replaces n_tokens with pos_next, which is a more accurate positional reference for this advanced attention mechanism, potentially improving model output coherence for certain architectures.

Beyond the core fix, b8954 is notable for its extensive platform support. The release offers pre-built binaries for macOS (both Apple Silicon and Intel), multiple Linux distributions (x64, arm64, s390x), Windows (x64, arm64), iOS as an XCFramework, and Android (arm64). It also supports a variety of hardware acceleration backends, including Vulkan, CUDA (versions 12 and 13), ROCm, OpenVINO, SYCL, and HIP. This broad compatibility allows developers and hobbyists to run models efficiently on everything from a MacBook to a high-end GPU server, making local AI deployment more accessible than ever.

Key Points

Fixes m-rope issue by using pos_next positional reference instead of n_tokens
Provides pre-built binaries for macOS, Linux, Windows, iOS, and Android
Supports multiple backends including CUDA 12/13, Vulkan, ROCm, OpenVINO, and SYCL

Why It Matters

llama.cpp b8954 makes local AI inference more robust and accessible across nearly every major platform.

Read Original Article

b8954

Why It Matters

Stay Ahead in AI