Research & Papers

ChunkFlow speeds up diffusion transformer inference by 1.28x with smart offloading

New prefetching technique hides latency and cuts GPU memory by 49%

Deep Dive

Layerwise offloading is a common technique to run large diffusion transformers (DiTs) on limited GPU memory: it prefetches upcoming model layers from host memory while the current layer is computing. But this approach breaks down when per-GPU computation is small or when prefetch traffic competes with inter-GPU collective communications (like all-reduce) over shared PCIe links. Researchers from UC Merced and the University of Chicago present ChunkFlow, a runtime that treats this as a co-scheduling problem. Using a first-order analytical model to predict when prefetch can be hidden by computation, ChunkFlow adaptively yields to collective communication and splits prefetches into smaller chunks, smoothly trading memory for latency.

Tested on three representative DiT models with two H100 GPUs over PCIe and Ulysses sequence parallelism, ChunkFlow outperforms SGLang's existing layerwise offloading by up to 1.28x in step time. Compared to a no-offload baseline, it cuts peak GPU memory by 49% while maintaining near-identical latency for larger workloads. In the small-workload regime, the tunable memory-latency tradeoff recovers near-zero step-time overhead. The work exposes a key insight: prefetch and communication don't have to fight—they can be orchestrated. This matters as diffusion models grow and distributed inference becomes the norm for image and video generation.

Key Points
  • Up to 1.28x step-time speedup over SGLang's existing layerwise offloading on two H100 GPUs
  • Reduces peak GPU memory by 49% versus the no-offload baseline at near-identical latency for large workloads
  • Uses a first-order analytical model to co-schedule prefetch and collective communication, avoiding PCIe contention

Why It Matters

Enables larger diffusion models on fewer GPUs, reducing cost for AI image/video generation.