llama.cpp b9326 release expands local LLM support across platforms
New b9326 adds iOS, Android, and multiple GPU backends for local AI inference.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
The open-source llama.cpp project, maintained by ggml-org, released version b9326 on May 26. This update focuses on expanding hardware support for running large language models locally. New builds include macOS Apple Silicon with KleidiAI acceleration (for Apple's Neural Engine), iOS XCFramework, Android arm64, and Windows x64/arm64 with CUDA 12/13, Vulkan, and HIP. Linux gains s390x, ROCm 7.2, OpenVINO, and SYCL FP16 support. The release also syncs with the upstream ggml library, likely bringing performance improvements and bug fixes.
The timeline shows rapid iteration – b9326 follows shortly after previous releases. With 113k stars on GitHub, llama.cpp remains the go-to tool for developers wanting privacy-preserving, on-device AI. The broad platform support means professionals can deploy locally on edge devices (e.g., iOS, Android) without cloud dependencies. This release lowers the barrier for running models like Llama 3, Mistral, or CodeLlama on consumer hardware, enabling offline chat bots, code assistants, and RAG systems directly on laptops or phones.
- llama.cpp b9326 adds iOS XCFramework and Android arm64 builds for mobile local LLM inference.
- Windows now supports CUDA 13, Vulkan, SYCL FP16, and HIP (AMD GPU) backends.
- Linux expands to s390x, ROCm 7.2, OpenVINO, and SYCL FP32/FP16 for enterprise hardware.
- New macOS Apple Silicon build includes KleidiAI (Neural Engine) acceleration.
Why It Matters
Professionals can now run LLMs privately on almost any device, from mobile to server, with optimized GPU backends.