b9010
Critical bug fix prevents memory crashes when using multiple GPUs for LLM inference.
The latest release of llama.cpp, tagged b9010 by GitHub Actions on May 2, addresses a critical bug in CUDA device PCI bus ID de-duplication that caused out-of-memory (OOM) errors on multi-GPU systems. The fix prevents the inference engine from incorrectly ignoring up to three other GPUs when deduplicating PCI bus IDs, a problem that led to memory exhaustion and crashes during large language model inference. The commit, co-authored by Johannes Gäßler, also includes updates to HIP and MUSA macros for broader GPU compatibility.
Beyond the CUDA fix, b9010 ships prebuilt binaries for a wide range of platforms: macOS (Apple Silicon and Intel, including KleidiAI-enabled), Linux (CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL), Windows (CPU, CUDA 12/13, Vulkan, HIP), Android (arm64 CPU), and openEuler. This ensures that users running local LLMs on heterogeneous hardware can immediately benefit from the stability improvement. The release is part of the ongoing maintenance of llama.cpp, the popular open-source C++ implementation for running large language models efficiently on consumer and server hardware.
- Fixes a CUDA PCI bus ID de-duplication bug that caused OOM errors by ignoring other GPUs entirely.
- Co-authored by Johannes Gäßler with updates to HIP and MUSA macros for broader GPU support.
- Prebuilt binaries released for macOS, Linux, Windows, Android, and openEuler across multiple backends.
Why It Matters
Ensures reliable multi-GPU inference for local LLM deployments, preventing crashes on high-end setups.