b8969
New kernel optimization targets Fujitsu and ARM SVE hardware for faster LLM inference.
The llama.cpp project, a popular open-source C/C++ implementation of LLaMA models, has released version b8969. This release focuses on performance improvements for ARM64 architectures by adding SVE-tuned code for the gemm_q8_0_4x8_q8_0() kernel. SVE (Scalable Vector Extension) is ARM's next-generation SIMD instruction set, offering wider vector lengths and more flexible vector operations compared to NEON. The optimization specifically targets the quantized 8-bit matrix multiplication kernel (Q8_0) used in LLM inference, promising lower latency and higher throughput on compatible hardware.
This release also includes a minor code cleanup in repack.cpp, changing arrays to static const. Co-authored by Vithule Prashant from Fujitsu, the update underscores growing industry collaboration to optimize AI workloads for ARM-based servers and edge devices. Binary releases are available across major platforms, including macOS (Apple Silicon and Intel), Linux (x64, ARM64, s390x), Windows (x64 and ARM64), Android, and iOS. Multiple GPU backends are supported: Vulkan, ROCm, CUDA, SYCL, and HIP. For ARM Linux users, the update is particularly relevant as it targets the Fujitsu A64FX processor and similar SVE-capable chips used in high-performance computing and cloud environments.
- Added SVE-tuned code for gemm_q8_0_4x8_q8_0() kernel to accelerate quantized matrix multiply
- Co-authored by Fujitsu engineer, targeting A64FX and other SVE-capable ARM CPUs
- Release includes pre-built binaries for macOS, Linux, Windows, Android, iOS, and openEuler with multiple backends
Why It Matters
SVE optimization slashes LLM inference latency on ARM servers, enabling faster AI at the edge.