b9076
b9076 adds router endpoint exposing child model info for multi-model setups
llama.cpp, the widely-used open-source C++ library for LLM inference (109k stars, 18k forks), has released version b9076. The headline feature is a router improvement: the server now exposes child model information from the router's /v1/models endpoint. This allows applications to query which models are available behind a routing layer, essential for multi-model deployments where a single endpoint distributes requests across multiple underlying models. The change, merged via PR #22683, updates the server API and adds documentation.
The release ships pre-built binaries for an extensive range of platforms: macOS (Apple Silicon arm64 with and without KleidiAI, Intel x64), iOS (XCFramework), Linux (Ubuntu x64/arm64 CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows (x64 CPU, arm64 CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler (x86 and aarch64 with Ascend). The commit (9dcf835) is GPG-signed, ensuring authenticity. For developers running LLM servers in production, this router endpoint simplifies model lifecycle management, enabling dynamic scaling and better observability without custom routing proxies.
- New router /v1/models endpoint exposes child model info for multi-model server deployments
- Supports 18+ platform variants including Apple Silicon, Linux with Vulkan/ROCm, and Windows with CUDA 12/13
- Release b9076 is based on GPG-signed commit 9dcf835, verified by GitHub
Why It Matters
Simplifies multi-model LLM server management, enabling dynamic querying of available models without custom routing logic.