New internal AllReduce kernel removes NCCL dependency for tensor parallelism in llama.cpp?

New internal AllReduce kernel removes NCCL dependency for tensor parallelism in llama.cpp.

Currently supports 2 GPUs, FP32 tensors up to 256 KB; falls back to CPU for larger sizes?

Currently supports 2 GPUs, FP32 tensors up to 256 KB; falls back to CPU for larger sizes.

Provider selection via GGML_CUDA_ALLREDUCE env var or --reduction-provider flag in llama-bench?

Provider selection via GGML_CUDA_ALLREDUCE env var or --reduction-provider flag in llama-bench.

Developer Tools

llama.cpp b9095 adds internal AllReduce for 2-GPU tensor parallelism

llama.cpp Releases May 11, 2026

⚡New NCCL-free kernel enables tensor parallelism across 2 GPUs with minimal overhead.

Deep Dive

The ggml-org llama.cpp project released b9095, adding a built-in AllReduce kernel for CUDA tensor parallelism. Previously, multi-GPU inference required NVIDIA's NCCL library, which is not always available or easy to install. The new 'internal' provider implements a single-phase CUDA kernel that handles the entire AllReduce operation—device-to-host copy, cross-GPU synchronization via pinned-memory volatile flags, and GPU-side reduction—in one kernel launch per GPU. This design minimizes overhead and dependency on external libraries.

Currently, the internal AllReduce is limited to 2 GPUs, FP32 precision, and tensors up to 256 KB. For unsupported configurations, it falls back to a CPU-based reduction via the meta-backend. The provider is selectable at runtime using the GGML_CUDA_ALLREDUCE environment variable ("nccl" or "internal") or the --reduction-provider / -rp flag in llama-bench. Notably, the release notes credit Claude Sonnet 4.6 as a co-author, highlighting the increasing role of AI in code generation. This feature makes multi-GPU llama.cpp more accessible for users without NCCL.

Key Points

New internal AllReduce kernel removes NCCL dependency for tensor parallelism in llama.cpp.
Currently supports 2 GPUs, FP32 tensors up to 256 KB; falls back to CPU for larger sizes.
Provider selection via GGML_CUDA_ALLREDUCE env var or --reduction-provider flag in llama-bench.

Why It Matters

Lowers the barrier for multi-GPU inference in llama.cpp by eliminating the NCCL dependency for small-scale parallelism.

Read Original Article

llama.cpp b9095 adds internal AllReduce for 2-GPU tensor parallelism

Why It Matters

Related Articles

🚀 Stay Ahead in AI