Runtime SVE width detection optimizes FWHT for ARM CPUs, adapting to 128/256/512-bit vector lengths?

Runtime SVE width detection optimizes FWHT for ARM CPUs, adapting to 128/256/512-bit vector lengths

Benefits Apple Silicon, Linux ARM64, and Android devices running llama.cpp?

Benefits Apple Silicon, Linux ARM64, and Android devices running llama.cpp

Part of a stable release (b9490) with builds for CPU, CUDA, Vulkan, ROCm, and more?

Part of a stable release (b9490) with builds for CPU, CUDA, Vulkan, ROCm, and more

Developer Tools

llama.cpp b9490 boosts ARM CPU inference with runtime SVE width detection

llama.cpp Releases June 03, 2026

⚡llama.cpp b9490 auto-senses ARM SVE vector length for faster LLM inference

Deep Dive

llama.cpp, the go-to C++ library for running large language models locally, just dropped version b9490. The headline change is a runtime SVE (Scalable Vector Extension) width detection optimization for the Fast Walsh-Hadamard Transform (FWHT) on ARM CPUs. FWHT is used in certain LLM operations like attention mechanisms and quantization blending. Previously, the code assumed a fixed SVE vector length, which could either waste hardware capability or cause workarounds. Now, llama.cpp dynamically queries the CPU for the actual SVE width (e.g., 128, 256, 512 bits) and tunes the FWHT kernel accordingly.

This improvement directly benefits Apple Silicon (M-series chips), Linux ARM64 (e.g., Graviton, Ampere), and Android ARM64 devices. Users can expect lower latency and higher throughput for models that utilize FWHT, particularly in larger context windows where transform operations scale. The release also includes the usual matrix of platform builds (CPU, CUDA, Vulkan, ROCm, etc.) and a fix for an earlier bug. The open-source community has already reacted positively—this is a subtle but meaningful performance boost for ARM-based LLM deployments.

Key Points

Runtime SVE width detection optimizes FWHT for ARM CPUs, adapting to 128/256/512-bit vector lengths
Benefits Apple Silicon, Linux ARM64, and Android devices running llama.cpp
Part of a stable release (b9490) with builds for CPU, CUDA, Vulkan, ROCm, and more

Why It Matters

Smarter ARM CPU utilization means faster local LLM inference for mobile and edge devices without extra hardware.

Read Original Article

llama.cpp b9490 boosts ARM CPU inference with runtime SVE width detection

Why It Matters

Related Articles

🚀 Stay Ahead in AI