Developer Tools

llama.cpp b9498 adds RVV 512/1024-bit quantized kernel optimizations

New release boosts LLM inference on RISC-V hardware with 1024-bit vector support

Deep Dive

The latest release of llama.cpp, b9498, brings significant enhancements to RISC-V vector (RVV) support by extending quantization dot product kernels to vector lengths of 512 and 1024 bits. This update, contributed by Rehan Qasim and Taimur Ahmad from 10xengineers.ai, adds dedicated RVV implementations for a range of low-bit quantization formats including iq4_xs, q6_K, tq3_s, iq3_xxs, iq2_s, iq2_xs, and iq2_xxs. The work also includes a refactoring effort that improves the efficiency of iq2_xs kernels on 256-bit RVV configurations, laying groundwork for further optimizations.

These optimizations allow llama.cpp to better leverage higher-performance RISC-V CPUs, such as those found in next-generation edge devices and servers. By supporting 512-bit and 1024-bit vector operations, the project targets a broader range of hardware without sacrificing inference speed or memory efficiency. The release is cross-platform, with builds available for macOS (Apple Silicon, Intel), Linux (x64, arm64, s390x), Android (arm64), and Windows (x64, arm64, CUDA, Vulkan, HIP). This ensures that the RVV improvements are accessible to developers working on diverse deployment scenarios, from mobile devices to cloud-based inference.

Key Points
  • llama.cpp b9498 extends RVV quantization dot products to vector lengths of 512 and 1024 bits.
  • New kernels added for iq4_xs, q6_K, tq3_s, iq3_xxs, iq2_s, iq2_xs, and iq2_xxs quantizations.
  • Co-authored by engineers from 10xengineers.ai, with refactoring and improvements to iq2_xs for RVV 256.

Why It Matters

Unlocks faster local LLM inference on advanced RISC-V hardware, expanding AI deployment to new edge and server platforms.