Developer Tools

b8951

New release boosts inference speed on Apple Silicon and other CPUs

Deep Dive

llama.cpp b8951, released on April 27 by ggml-org, brings a significant performance optimization: fast matrix-vector (mat-vec) kernels for i-quants. I-quants are integer quantization types that reduce model size and memory bandwidth, but their mat-vec operations—a core part of neural network inference—were previously slower. This update directly addresses that bottleneck, enabling faster token generation on CPU-based systems. The commit, signed with GitHub's verified GPG key, is part of the ongoing effort to make LLM inference more efficient on consumer hardware.

This release is a major win for developers running models locally on laptops or edge devices. The prebuilt binaries cover a wide range of platforms: macOS (Apple Silicon with and without KleidiAI, Intel x64, iOS), Linux (x64, arm64, s390x with Vulkan, ROCm, OpenVINO, SYCL), Android (arm64), Windows (x64, arm64 with CUDA 12/13, Vulkan, SYCL, HIP), and openEuler (x86, aarch64 with ACL Graph). For Apple Silicon users, the KleidiAI-enabled build offers additional acceleration. This granular optimization means users can expect lower latency and higher throughput for tasks like chat, code generation, and document analysis, without needing expensive GPUs. The focus on i-quants also aligns with the trend toward quantized models (like Llama 3, Mistral, and Phi-3), making them more practical for real-time applications.

Key Points
  • Adds fast mat-vec kernels specifically for i-quants (integer quantized models), improving CPU inference speed.
  • Prebuilt binaries for 20+ platform variants, including Apple Silicon with KleidiAI, Windows with CUDA 12/13, and Linux with Vulkan/ROCm.
  • Optimization targets quantized LLMs (e.g., 4-bit, 8-bit), reducing memory bandwidth bottlenecks for local AI deployment.

Why It Matters

Faster CPU inference for quantized LLMs means more efficient local AI on laptops and edge devices.