Version b9105 directly includes cuda/iterator instead of relying on cub/cub.cuh transient import?

Version b9105 directly includes cuda/iterator instead of relying on cub/cub.cuh transient import

Fragile cub dependency caused compilation failures on some CUDA configurations?

Fragile cub dependency caused compilation failures on some CUDA configurations

Prebuilt binaries available for macOS, Linux, Windows, Android, and openEuler across multiple GPU backends?

Prebuilt binaries available for macOS, Linux, Windows, Android, and openEuler across multiple GPU backends

Developer Tools

Llama.cpp b9105 fixes CUDA stability with direct iterator inclusion

llama.cpp Releases May 11, 2026

⚡110k-star LLM engine patches transient cub dependency for reliable GPU inference

Deep Dive

Llama.cpp, the highly popular open-source C++ implementation for running large language models locally, has released version b9105. With over 110,000 stars and 18,100 forks on GitHub, this project is a cornerstone for developers seeking efficient local inference across diverse hardware. The release addresses a subtle but critical CUDA issue: previously, the build relied on a transient import from cub/cub.cuh to access cuda/iterator. This practice was fragile because cub does not consistently expose that header, leading to compilation failures or runtime instability on certain CUDA configurations. By directly including cuda/iterator, the fix ensures reliable GPU-accelerated inference, particularly for those using custom build pipelines or newer CUDA toolkits.

The b9105 release maintains llama.cpp's reputation for broad platform support. Prebuilt binaries are available for macOS Apple Silicon (both standard and KleidiAI-enabled), Linux (x64, arm64, s390x with Vulkan, ROCm 7.2, OpenVINO, and SYCL FP32/FP16), Android arm64, Windows (x64 and arm64 CPU builds, plus CUDA 12.4/13.1 DLLs, Vulkan, and HIP), and openEuler (x86 and aarch64 with ACL Graph). This comprehensive coverage means developers running LLMs on everything from gaming PCs to cloud VMs to edge devices can immediately benefit from the fix. The release also signals ongoing maintenance of a project that has become essential for the local AI community, where stability and performance are paramount.

Key Points

Version b9105 directly includes cuda/iterator instead of relying on cub/cub.cuh transient import
Fragile cub dependency caused compilation failures on some CUDA configurations
Prebuilt binaries available for macOS, Linux, Windows, Android, and openEuler across multiple GPU backends

Why It Matters

Stable CUDA inference is critical for developers running LLMs locally on diverse hardware—this fix removes a common build failure.

Read Original Article

Llama.cpp b9105 fixes CUDA stability with direct iterator inclusion

Why It Matters

Related Articles

Stay Ahead in AI