Developer Tools

b8946

llama.cpp Releases April 27, 2026

⚡New release removes duplicate wo_s scale after build_attn for Qwen3 and LLaMA...

Deep Dive

llama.cpp, the popular C/C++ implementation for running large language models locally, has shipped version b8946 with a critical fix for attention building in Qwen3 and LLaMA architectures. The patch, contributed by Yash Nankani from NVIDIA, removes a duplicate wo_s scale that occurred after the build_attn step, ensuring correct attention computation and potentially improving model output quality for users running these architectures locally.

Beyond the bug fix, this release dramatically expands platform support. The build matrix now includes macOS Apple Silicon (arm64) with and without KleidiAI optimizations, macOS Intel (x64), and iOS XCFramework. Linux users gain Ubuntu builds for x64, arm64, s390x (CPU), plus GPU-accelerated variants for Vulkan, ROCm 7.2, OpenVINO, and SYCL (FP32/FP16). Windows gets broad coverage with x64 and arm64 CPU builds, CUDA 12 and 13 DLLs, Vulkan, SYCL, and HIP. Android arm64 (CPU) and openEuler (x86 and aarch64) with ACL Graph support round out the list, making this one of the most platform-complete releases yet.

Key Points

Fixes duplicate wo_s scale in build_attn for Qwen3 and LLaMA models
Adds macOS Apple Silicon with KleidiAI, plus Intel and iOS builds
Expands Windows support to include CUDA 12/13, Vulkan, SYCL, and HIP

Why It Matters

Local LLM inference gets more reliable and accessible across 20+ platform configurations.

Read Original Article

b8946

Why It Matters

Stay Ahead in AI