Developer Tools

b8312

llama.cpp Releases March 13, 2026

⚡Latest commit fixes critical GPU kernel issues and improves performance across macOS and iOS devices.

Deep Dive

The open-source project llama.cpp, maintained by ggml-org, has released a new technical update identified as commit b8312. This commit focuses on performance optimizations for Apple's Metal framework, which is crucial for running large language models (LLMs) like Meta's Llama 3 on macOS and iOS devices. The core changes involve modifying GPU compute kernels to "avoid divisions" and "avoid modulus" operations when not broadcasting, which are computationally expensive on GPUs. A separate fix addresses a `capture_started` flag bug, improving the stability of the Metal backend during operation.

Alongside the code changes, the release includes a comprehensive set of pre-built binaries for multiple platforms, simplifying deployment for end-users. These assets cover macOS for both Apple Silicon (arm64) and Intel (x64) architectures, iOS via an XCFramework, various Linux distributions (including CPU, Vulkan, and ROCm backends), and Windows with support for CPU, CUDA, Vulkan, SYCL, and HIP. This broad compatibility ensures developers and enthusiasts can leverage the performance gains across their preferred hardware stack, from Apple laptops to high-end Windows PCs with NVIDIA GPUs.

Key Points

Optimizes Metal GPU kernels by avoiding costly division/modulus ops, boosting Apple Silicon performance.
Fixes a `capture_started` flag bug to improve stability of the Metal inference backend.
Provides pre-built binaries for macOS, iOS, Linux, and Windows with CUDA, Vulkan, and ROCm support.

Why It Matters

Faster, more stable local AI inference on Macs and iPhones, making on-device LLMs like Llama 3 more practical for developers.

Read Original Article

b8312

Why It Matters

Stay Ahead in AI