Developer Tools

b9023

New /models?reload=1 API lets you swap LLMs without restarting the server.

Deep Dive

The latest release of llama.cpp, version b9023, introduces a highly requested feature: a server-side endpoint at /models?reload=1 that allows users to dynamically reload language models without needing to restart the entire server process. This is a significant quality-of-life improvement for developers running local LLM inference in production or experimental setups, as it enables hot-swapping of models, A/B testing, and seamless updates to newer quantizations or fine-tuned versions. The feature was implemented via pull request #21848 by the community and is now available across all supported platforms.

This release ships with an extensive array of prebuilt binaries for nearly every major platform: macOS (Apple Silicon arm64 with and without KleidiAI acceleration, Intel x64, and iOS XCFramework), Linux (x64, arm64, s390x for CPU, plus GPU variants with Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows (x64 and arm64 CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android arm64 (CPU), and openEuler (x86 and aarch64 with ACL Graph acceleration). The wide platform support continues llama.cpp's mission to make local LLM inference accessible everywhere. For the open-source AI community, b9023 represents a practical step toward more robust self-hosted LLM servers that can be managed with zero downtime.

Key Points
  • New /models?reload=1 API endpoint in llama.cpp server for hot-swapping models without restart
  • Prebuilt binaries cover macOS, Linux, Windows, Android, iOS, and openEuler with multiple GPU backends (CUDA, ROCm, Vulkan, SYCL, HIP)
  • Community-driven PR #21848 merged; release includes 30+ asset files across all major architectures

Why It Matters

Enables dynamic model management for local LLM deployments, reducing downtime and simplifying experimentation for developers.