b9057
New release boosts LLM performance on low-power RISC-V hardware with optimized dot product.
ggml-org's llama.cpp, the popular C/C++ LLM inference engine, has released version b9057. The standout feature is an optimized RISC-V CPU implementation for the q1_0 quantization dot product. This optimization targets the low-level matrix operations that dominate transformer inference, resulting in significantly faster token generation on RISC-V architectures. The q1_0 format is a 1-bit weight-only quantization, enabling extremely memory-efficient model deployment. This release makes RISC-V a more viable platform for running large language models locally, especially in edge and IoT scenarios where power and cost are constraints.
The b9057 release ships with pre-built binaries for a wide range of platforms: macOS Apple Silicon (both with and without KleidiAI acceleration), macOS Intel, iOS XCFramework, Linux (x64, arm64, s390x) with Vulkan, ROCm 7.2, OpenVINO, and SYCL variants; Windows (x64, arm64) with CUDA 12/13, Vulkan, SYCL, and HIP; Android arm64; and openEuler for x86 and aarch64 with ACL Graph support. This breadth of platform support ensures developers can deploy optimized LLM inference across diverse hardware, from high-end GPUs to low-power RISC-V CPUs.
- Optimized RISC-V CPU kernel for q1_0 quantization dot product, improving inference speed.
- Supports 20+ platform targets including macOS, Linux, Windows, Android, and openEuler with various accelerators.
- b9057 release from ggml-org includes builds for CUDA 12/13, ROCm, Vulkan, SYCL, and HIP.
Why It Matters
Enables efficient on-device LLM inference on RISC-V hardware, expanding edge AI and reducing cloud dependency.