b9189 skips device enumeration in router mode to avoid creating unnecessary CUDA primary contexts?

b9189 skips device enumeration in router mode to avoid creating unnecessary CUDA primary contexts

Reduces overhead for multi-GPU setups, improving inference efficiency?

Reduces overhead for multi-GPU setups, improving inference efficiency

Supported on macOS, Linux, Windows, Android, and openEuler with multiple backends (CUDA, Vulkan, ROCm, SYCL, HIP)?

Supported on macOS, Linux, Windows, Android, and openEuler with multiple backends (CUDA, Vulkan, ROCm, SYCL, HIP)

Developer Tools

llama.cpp b9189 improves CUDA context handling in router mode

llama.cpp Releases May 17, 2026

⚡Skip device enumeration to prevent unwanted CUDA primary context creation

Deep Dive

The llama.cpp project, led by ggml-org, has released version b9189 with a targeted fix for CUDA performance. The key change is in server/router mode: the release now skips device enumeration to prevent the creation of a primary CUDA context. This addresses an issue where enumerating devices in router mode could inadvertently initialize a CUDA context on each device, causing memory and latency overhead, especially on systems with multiple GPUs.

This release also continues llama.cpp's tradition of broad platform support. Builds are provided for macOS (both Apple Silicon and Intel), Linux (CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL), Windows (CPU, CUDA 12 and 13, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler (with ACL Graph support). This allows developers to leverage the improved CUDA handling across diverse hardware configurations, making local LLM inference more efficient for power users running server deployments.

Key Points

b9189 skips device enumeration in router mode to avoid creating unnecessary CUDA primary contexts
Reduces overhead for multi-GPU setups, improving inference efficiency
Supported on macOS, Linux, Windows, Android, and openEuler with multiple backends (CUDA, Vulkan, ROCm, SYCL, HIP)

Why It Matters

Optimizes CUDA resource management for multi-GPU llama.cpp servers, enabling smoother local LLM deployments.

Read Original Article

llama.cpp b9189 improves CUDA context handling in router mode

Why It Matters

Related Articles

🚀 Stay Ahead in AI