b8763
The latest commit for the popular open-source inference engine brings performance tweaks and expands Apple Silicon support.
The ggml-org team has pushed a new update (commit b8763) to their massively popular llama.cpp repository, which now boasts over 103k stars on GitHub. This release is a maintenance and optimization update focused on the build system and binary distribution. The key technical change is in the CUDA compilation process, where the build now skips compiling unnecessary Flash Attention (FA) kernels. This reduces build times and potential binary size for users compiling from source, making the development workflow more efficient.
For end-users, the most visible addition is a new pre-built binary variant labeled 'macOS Apple Silicon (arm64, KleidiAI enabled).' KleidiAI is a framework designed to accelerate AI inference on Apple's Neural Engine. This inclusion signals ongoing optimization for Apple's hardware ecosystem. The release also continues to provide a wide array of pre-compiled binaries across major platforms, including Windows with CUDA 12.4 and 13.1 DLLs, various Linux configurations with Vulkan and ROCm 7.2 support, and specialized builds for the openEuler OS. This broad compatibility ensures developers and researchers can easily run efficient, quantized large language models locally on their preferred hardware.
- CUDA build optimization skips superfluous Flash Attention kernel compilation (#21768)
- Adds new 'KleidiAI enabled' binary build for macOS Apple Silicon (arm64)
- Distributes pre-built binaries for Windows (CUDA 12/13, Vulkan), Linux (CPU/Vulkan/ROCm), and openEuler
Why It Matters
Optimizes the core tool for local LLM inference, making development faster and expanding performant options for Apple hardware users.