llama.cpp b9095 adds internal AllReduce for 2-GPU tensor parallelism
New NCCL-free kernel enables tensor parallelism across 2 GPUs with minimal overhead.
The ggml-org llama.cpp project released b9095, adding a built-in AllReduce kernel for CUDA tensor parallelism. Previously, multi-GPU inference required NVIDIA's NCCL library, which is not always available or easy to install. The new 'internal' provider implements a single-phase CUDA kernel that handles the entire AllReduce operation—device-to-host copy, cross-GPU synchronization via pinned-memory volatile flags, and GPU-side reduction—in one kernel launch per GPU. This design minimizes overhead and dependency on external libraries.
Currently, the internal AllReduce is limited to 2 GPUs, FP32 precision, and tensors up to 256 KB. For unsupported configurations, it falls back to a CPU-based reduction via the meta-backend. The provider is selectable at runtime using the GGML_CUDA_ALLREDUCE environment variable ("nccl" or "internal") or the --reduction-provider / -rp flag in llama-bench. Notably, the release notes credit Claude Sonnet 4.6 as a co-author, highlighting the increasing role of AI in code generation. This feature makes multi-GPU llama.cpp more accessible for users without NCCL.
- New internal AllReduce kernel removes NCCL dependency for tensor parallelism in llama.cpp.
- Currently supports 2 GPUs, FP32 tensors up to 256 KB; falls back to CPU for larger sizes.
- Provider selection via GGML_CUDA_ALLREDUCE env var or --reduction-provider flag in llama-bench.
Why It Matters
Lowers the barrier for multi-GPU inference in llama.cpp by eliminating the NCCL dependency for small-scale parallelism.