Developer Tools

llama.cpp b9247 boosts Metal performance with pad + cpy optimizations

New release delivers faster LLM inference on Apple Silicon via optimized memory operations.

Deep Dive

The llama.cpp open-source project, known for running LLaMA-family models efficiently on consumer hardware, has dropped version b9247. This release targets Apple Metal performance with two key optimizations: 'metal: optimize pad' and 'metal: optimize cpy' (copy). These patches improve memory movement operations critical for tensor computation on GPU. Additionally, the commit includes 'better row packing in threadgroups', which better utilizes the parallel processing units on Apple's M-series chips.

The release ships prebuilt binaries for a wide range of platforms: macOS (Apple Silicon arm64, Intel x64), iOS XCFramework, Linux (x64, arm64, s390x) with various backends (CPU, Vulkan, ROCm, OpenVINO, SYCL), Windows (CPU, CUDA, Vulkan, HIP), Android arm64, and openEuler. Notably, the arm64 macOS build also includes a variant with KleidiAI enabled, an inference acceleration library. For developers and power users running LLMs locally, this release translates to smoother, faster inference on Apple hardware without requiring cloud APIs.

Key Points
  • Version b9247 of llama.cpp optimizes Metal pad and copy operations for Apple Silicon.
  • Better row packing in threadgroups improves GPU utilization during inference.
  • Prebuilt binaries available for macOS, iOS, Linux, Windows, Android, and more, with optional KleidiAI acceleration.

Why It Matters

Faster local LLM inference on Apple devices means lower latency and better privacy for AI assistants and chatbots.