Developer Tools

b8946

New release removes duplicate wo_s scale after build_attn for Qwen3 and LLaMA...

Deep Dive

llama.cpp, the popular C/C++ implementation for running large language models locally, has shipped version b8946 with a critical fix for attention building in Qwen3 and LLaMA architectures. The patch, contributed by Yash Nankani from NVIDIA, removes a duplicate wo_s scale that occurred after the build_attn step, ensuring correct attention computation and potentially improving model output quality for users running these architectures locally.

Beyond the bug fix, this release dramatically expands platform support. The build matrix now includes macOS Apple Silicon (arm64) with and without KleidiAI optimizations, macOS Intel (x64), and iOS XCFramework. Linux users gain Ubuntu builds for x64, arm64, s390x (CPU), plus GPU-accelerated variants for Vulkan, ROCm 7.2, OpenVINO, and SYCL (FP32/FP16). Windows gets broad coverage with x64 and arm64 CPU builds, CUDA 12 and 13 DLLs, Vulkan, SYCL, and HIP. Android arm64 (CPU) and openEuler (x86 and aarch64) with ACL Graph support round out the list, making this one of the most platform-complete releases yet.

Key Points
  • Fixes duplicate wo_s scale in build_attn for Qwen3 and LLaMA models
  • Adds macOS Apple Silicon with KleidiAI, plus Intel and iOS builds
  • Expands Windows support to include CUDA 12/13, Vulkan, SYCL, and HIP

Why It Matters

Local LLM inference gets more reliable and accessible across 20+ platform configurations.