CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
New engine tackles the all-to-all communication bottleneck in distributed Diffusion Transformers, achieving up to 8.4x speedup.
A team of researchers, including authors from Intel, has introduced CoCoDiff, a new engine designed to dramatically speed up the distributed inference of large Diffusion Transformer (DiT) models. As DiTs grow in size and resolution for tasks like high-fidelity image and video generation, running them requires splitting the workload across multiple GPUs. A technique called Ulysses sequence parallelism is used for this scaling, but it introduces a critical bottleneck: frequent and costly 'all-to-all' collective communications between GPUs, which can dominate the total runtime. CoCoDiff directly attacks this problem by cleverly restructuring and overlapping these communications with ongoing computations.
CoCoDiff is built on two key insights about DiT architecture. First, it observes that within the attention mechanism, the 'V' (value) projection requires less processing than the 'Q' (query) and 'K' (key) projections. This creates an opportunity to hide the communication time for 'V' behind the computation for 'Q' and 'K'. Second, it exploits temporal redundancy, where tensors between consecutive denoising steps are similar. The engine implements three core mechanisms: Tile-Aware Parallel All-to-all (TAPA) to optimize communication paths for the hardware topology, V-First scheduling to overlap communications, and V-Major selective communication to send only essential data over slower network links.
In benchmarks conducted on the Intel Aurora supercomputer, scaling across 1 to 8 nodes (up to 96 Intel GPU Max Series 'tiles'), CoCoDiff delivered impressive results. It achieved an average speedup of 3.6x over a baseline inference system, with peak performance gains reaching 8.4x. This breakthrough in optimizing collective communications is a significant step towards making the real-time, distributed inference of massive generative AI models like Stable Diffusion 3 or Sora more practical and efficient on large-scale HPC systems.
- Solves the 'all-to-all' communication bottleneck in Ulysses-parallel DiT inference, which typically dominates latency.
- Uses V-First scheduling to overlap communication for value projections with computation for query/key projections.
- Achieved an average 3.6x (peak 8.4x) speedup on Intel's Aurora supercomputer with up to 96 GPUs.
Why It Matters
Enables faster, more scalable inference for next-gen image/video models, making large-scale AI generation feasible on supercomputers.