Developer Tools

b9023

llama.cpp Releases May 05, 2026

⚡New /models?reload=1 API lets you swap LLMs without restarting the server.

Deep Dive

The latest release of llama.cpp, version b9023, introduces a highly requested feature: a server-side endpoint at /models?reload=1 that allows users to dynamically reload language models without needing to restart the entire server process. This is a significant quality-of-life improvement for developers running local LLM inference in production or experimental setups, as it enables hot-swapping of models, A/B testing, and seamless updates to newer quantizations or fine-tuned versions. The feature was implemented via pull request #21848 by the community and is now available across all supported platforms.

This release ships with an extensive array of prebuilt binaries for nearly every major platform: macOS (Apple Silicon arm64 with and without KleidiAI acceleration, Intel x64, and iOS XCFramework), Linux (x64, arm64, s390x for CPU, plus GPU variants with Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows (x64 and arm64 CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android arm64 (CPU), and openEuler (x86 and aarch64 with ACL Graph acceleration). The wide platform support continues llama.cpp's mission to make local LLM inference accessible everywhere. For the open-source AI community, b9023 represents a practical step toward more robust self-hosted LLM servers that can be managed with zero downtime.

Key Points

New /models?reload=1 API endpoint in llama.cpp server for hot-swapping models without restart
Prebuilt binaries cover macOS, Linux, Windows, Android, iOS, and openEuler with multiple GPU backends (CUDA, ROCm, Vulkan, SYCL, HIP)
Community-driven PR #21848 merged; release includes 30+ asset files across all major architectures

Why It Matters

Enables dynamic model management for local LLM deployments, reducing downtime and simplifying experimentation for developers.

Read Original Article

b9023

Why It Matters

Stay Ahead in AI