Developer Tools

llama.cpp b9114 optimizes Metal backend for faster local LLM inference on Apple Silicon

New release promotes batch divisors to constants, boosting GPU performance on M-series Macs and iOS devices.

Deep Dive

llama.cpp, the leading C++ library for running large language models locally on consumer hardware, has released version b9114 with a key optimization for its Apple Metal backend. The commit, signed and tagged by GitHub Actions on May 12, focuses on promoting mul_mv (matrix-vector) and mul_mm (matrix-matrix) batch divisors to function constants. This change reduces the number of runtime computations and memory accesses during inference on Apple's Metal GPU API, directly improving token generation speed and latency on Apple Silicon (M1 through M4) and iOS devices.

Beyond the Metal-specific improvement, the release maintains support for an extensive range of platforms and accelerators: macOS (ARM64, x64), Linux (x64, ARM64, s390x), Windows (x64, ARM64), and Android (ARM64), with backends including Vulkan, ROCm 7.2, OpenVINO, SYCL, CUDA 12/13, and HIP. The update is part of llama.cpp's ongoing effort to make large model inference accessible on everyday devices. For users, this means faster local AI assistants, chatbots, and code completions without cloud dependencies, especially on Apple hardware where Metal is the primary GPU pathway.

Key Points
  • Promotes mul_mv and mul_mm batch divisors to function constants for reduced runtime overhead on Apple Metal
  • Supports a wide range of platforms: macOS (ARM/x64), iOS, Linux, Windows, Android, with Vulkan, CUDA, ROCm, SYCL backends
  • Improves local LLM inference speed on Apple Silicon (M1-M4) and iOS devices, enhancing real-time user experience

Why It Matters

Makes running powerful LLMs locally on Apple hardware faster and more efficient, enabling private AI on everyday devices.