CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling
New library bypasses traditional networking, achieving up to 1.94x faster GPU communication and cutting hardware costs by 2.75x.
A research team from UC Merced, ByteDance, and Xconn-tech has introduced CCCL (CXL Collective Communication Library), a novel system designed to tackle the bottleneck of GPU communication in large-scale AI clusters. Training massive models like LLMs across multiple servers strains traditional network interconnects like InfiniBand. CCCL proposes a radical shift by using the emerging CXL 3.0 standard to create a shared memory pool accessible by GPUs across different physical nodes, effectively allowing them to communicate as if they were on the same machine, but at a data center scale.
The technical breakthrough lies in CCCL's software layer that manages synchronization, data placement, and parallel access over this CXL fabric. Evaluated on a testbed with a TITAN-II CXL switch and Micron CZ120 memory cards, CCCL outperformed a 200 Gbps InfiniBand setup, showing speedups of 1.34x to 1.94x for key collective operations like AllGather and Broadcast. Crucially, in an LLM training scenario, it achieved an 1.11x overall speedup while reducing the required hardware production cost by a factor of 2.75x. This demonstrates CXL's potential not just for memory expansion but as a foundational, memory-centric interconnect for future AI infrastructure, promising more efficient and cost-effective supercomputing.
- CCCL uses CXL 3.0 memory pooling to let GPUs across servers share memory directly, bypassing traditional network stacks.
- Outperforms 200 Gbps InfiniBand, with speedups up to 1.94x for Gather and 1.84x for Broadcast operations.
- In a real LLM training case, achieved 1.11x speedup while reducing hardware production costs by 2.75x.
Why It Matters
This could drastically reduce the cost and complexity of building giant AI training clusters, accelerating model development.