llama.cpp b9498 adds RVV 512/1024-bit quantized kernel optimizations
New release boosts LLM inference on RISC-V hardware with 1024-bit vector support
The latest release of llama.cpp, b9498, brings significant enhancements to RISC-V vector (RVV) support by extending quantization dot product kernels to vector lengths of 512 and 1024 bits. This update, contributed by Rehan Qasim and Taimur Ahmad from 10xengineers.ai, adds dedicated RVV implementations for a range of low-bit quantization formats including iq4_xs, q6_K, tq3_s, iq3_xxs, iq2_s, iq2_xs, and iq2_xxs. The work also includes a refactoring effort that improves the efficiency of iq2_xs kernels on 256-bit RVV configurations, laying groundwork for further optimizations.
These optimizations allow llama.cpp to better leverage higher-performance RISC-V CPUs, such as those found in next-generation edge devices and servers. By supporting 512-bit and 1024-bit vector operations, the project targets a broader range of hardware without sacrificing inference speed or memory efficiency. The release is cross-platform, with builds available for macOS (Apple Silicon, Intel), Linux (x64, arm64, s390x), Android (arm64), and Windows (x64, arm64, CUDA, Vulkan, HIP). This ensures that the RVV improvements are accessible to developers working on diverse deployment scenarios, from mobile devices to cloud-based inference.
- llama.cpp b9498 extends RVV quantization dot products to vector lengths of 512 and 1024 bits.
- New kernels added for iq4_xs, q6_K, tq3_s, iq3_xxs, iq2_s, iq2_xs, and iq2_xxs quantizations.
- Co-authored by engineers from 10xengineers.ai, with refactoring and improvements to iq2_xs for RVV 256.
Why It Matters
Unlocks faster local LLM inference on advanced RISC-V hardware, expanding AI deployment to new edge and server platforms.