NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL
New library cuts MoE model latency by 50% and boosts throughput for 4096+ token batches.
NVIDIA has unveiled NCCL EP (Expert Parallelism), a ground-up communication library designed specifically for Mixture-of-Experts (MoE) architectures that power today's largest language models. Built entirely on NCCL's Device API, it provides unified `ncclEpDispatch` and `ncclEpCombine` primitives with both C and Python interfaces. The library addresses two critical workloads: Low-Latency (LL) mode for inference decoding with small batch sizes (1-128 tokens) using direct all-to-all RDMA+NVLink mesh connectivity, and High-Throughput (HT) mode for training and inference prefill with large batches (4096+ tokens) using hierarchical communication that aggregates tokens within NVLink domains first.
NCCL EP's architecture leverages GPU-initiated communication for both intra- and inter-node data transfer, taking advantage of NCCL's topology awareness and optimized implementation. The LL mode employs double-buffered communication to overlap dispatch and combine phases, while HT mode optimizes for massive token counts common in training scenarios. Evaluations on H100-based clusters across multi-node configurations show competitive kernel performance and promising end-to-end results when integrated with popular inference engines like vLLM.
By building MoE communication natively within NCCL—NVIDIA's established collective communications library—the company provides a standardized, supported path for expert parallelism. This eliminates the need for fragmented, specialized libraries like DeepEP or Hybrid-EP, offering developers a unified API that works across current and future NVIDIA platforms. The research paper demonstrates how optimized communication protocols can significantly reduce latency and increase throughput for the sparse activation patterns characteristic of MoE models.
- Provides unified `ncclEpDispatch` and `ncclEpCombine` primitives with C/Python APIs for MoE models
- Features Low-Latency mode for 1-128 token inference and High-Throughput mode for 4096+ token training
- Leverages GPU-initiated RDMA and NVLink connectivity with topology-aware optimization on H100 clusters
Why It Matters
Enables faster, more efficient MoE model training and inference at scale, reducing development fragmentation and accelerating AI progress.