llama.cpp b9490 boosts ARM CPU inference with runtime SVE width detection
llama.cpp b9490 auto-senses ARM SVE vector length for faster LLM inference
llama.cpp, the go-to C++ library for running large language models locally, just dropped version b9490. The headline change is a runtime SVE (Scalable Vector Extension) width detection optimization for the Fast Walsh-Hadamard Transform (FWHT) on ARM CPUs. FWHT is used in certain LLM operations like attention mechanisms and quantization blending. Previously, the code assumed a fixed SVE vector length, which could either waste hardware capability or cause workarounds. Now, llama.cpp dynamically queries the CPU for the actual SVE width (e.g., 128, 256, 512 bits) and tunes the FWHT kernel accordingly.
This improvement directly benefits Apple Silicon (M-series chips), Linux ARM64 (e.g., Graviton, Ampere), and Android ARM64 devices. Users can expect lower latency and higher throughput for models that utilize FWHT, particularly in larger context windows where transform operations scale. The release also includes the usual matrix of platform builds (CPU, CUDA, Vulkan, ROCm, etc.) and a fix for an earlier bug. The open-source community has already reacted positively—this is a subtle but meaningful performance boost for ARM-based LLM deployments.
- Runtime SVE width detection optimizes FWHT for ARM CPUs, adapting to 128/256/512-bit vector lengths
- Benefits Apple Silicon, Linux ARM64, and Android devices running llama.cpp
- Part of a stable release (b9490) with builds for CPU, CUDA, Vulkan, ROCm, and more
Why It Matters
Smarter ARM CPU utilization means faster local LLM inference for mobile and edge devices without extra hardware.