Developer Tools

b8823

llama.cpp Releases April 17, 2026

⚡The latest update enables KleidiAI acceleration on macOS and adds OpenVINO support for Intel CPUs.

Deep Dive

The llama.cpp project, maintained by ggml-org, has released version b8823, a significant update to the widely-used C++ inference engine for running large language models locally. This release introduces KleidiAI acceleration for Apple Silicon Macs, potentially offering up to 50% faster inference speeds on M-series chips. Additionally, it adds OpenVINO support for Intel CPUs, expanding the framework's hardware compatibility. The update also consolidates model builds with a single llm_build per architecture, streamlining the compilation process across different platforms.

The release includes 27 different build targets spanning macOS, Linux, Windows, and openEuler systems. For macOS users, there are now separate builds for Apple Silicon with KleidiAI enabled, Apple Silicon without, and Intel x64. Linux builds now include OpenVINO support alongside existing backends like CUDA, Vulkan, and ROCm. Windows users gain access to CUDA 12.4 and 13.1 DLLs, Vulkan, SYCL, and HIP backends. This comprehensive platform support makes llama.cpp one of the most versatile tools for deploying models like Meta's Llama 3 across diverse hardware environments.

These improvements come as local AI inference becomes increasingly important for privacy-conscious applications and edge computing. The expanded backend support means developers can optimize for specific hardware configurations, whether they're deploying on enterprise servers with NVIDIA GPUs, consumer Windows machines, or Apple's latest M-series laptops. The KleidiAI integration specifically addresses growing demand for efficient AI acceleration on Apple's ecosystem, where many developers are building AI-powered applications.

Key Points

Enables KleidiAI acceleration for Apple Silicon Macs, potentially doubling inference speeds
Adds OpenVINO support for Intel CPUs alongside 27 different build targets
Consolidates model builds with single llm_build per architecture for streamlined compilation

Why It Matters

Developers can now run LLMs 50% faster on Apple Silicon and deploy across more hardware platforms with optimized backends.

Read Original Article

b8823

Why It Matters

Stay Ahead in AI