b9031
New release loads backends only when needed, reducing memory and startup time.
Deep Dive
The llama.cpp project released version b9031, contributed by Adrien Gallouët from Hugging Face. This update makes backends load only when required, by calling ggml_backend_load_all() directly from llama_backend_init() and adding it wherever llama_backend_init() is not used. The change is available across all listed platforms including macOS, Linux, Windows, and iOS.
Key Points
- Lazy backend loading: only initializes GPU/CUDA/Vulkan backends when actually needed
- Contributed by Hugging Face's Adrien Gallouët, reducing startup latency and memory usage
- Available across all major platforms: macOS, Linux, Windows, Android, iOS
Why It Matters
Smarter resource management means faster AI inference and lower memory overhead for local LLM users.