ZEROGNN speeds up GNN training 5.28x by cutting CPU overhead
New system eliminates GPU-CPU syncs for near 100% GPU utilization.
Modern GNN training suffers from host-device orchestration overhead because the CPU mediates metadata-driven decisions like memory provisioning and kernel launches. This synchronization bottleneck dominates runtime when GPU compute is small. Existing solutions like CUDA Graphs fail because execution structure varies across iterations. The team behind ZEROGNN identified that the metadata-driven control loop remains host-mediated, preventing full GPU residency.
ZEROGNN tackles this by moving all runtime metadata to the GPU, mediating dynamic execution within a fixed launch structure, and provisioning a conservative yet tight execution envelope to restore CUDA Graph replayability. In experiments on sampling-based GNN workloads, ZEROGNN achieved up to 5.28x speedup, near 100% GPU execution fraction, and memory efficiency comparable to ideal metadata-informed allocation. It also enables strong multi-GPU scaling by eliminating host-side bottlenecks. The work is led by Yidong Gong and colleagues, with a paper available on arXiv (2605.29346).
- ZEROGNN removes the CPU from the metadata-driven control loop, enabling fully GPU-resident execution for sampling-based GNN training.
- Achieves up to 5.28x end-to-end speedup and near 100% GPU execution fraction with memory efficiency matching ideal allocation.
- Supports strong multi-GPU scaling by eliminating host-side bottlenecks, addressing a key limitation of existing approaches like CUDA Graphs.
Why It Matters
ZEROGNN eliminates a critical bottleneck in GNN training, potentially speeding up graph-based AI workloads significantly on existing hardware.