Research & Papers

GreenGNN cuts distributed GNN training energy by 43% with windowed communication

New system reduces GPU energy by up to 71% while boosting throughput 3.9x

Deep Dive

Distributed GNN training wastes energy because each mini-batch triggers thousands of small remote procedure calls (RPCs) to fetch features across partitions, and GPUs draw high baseline power while waiting. A new paper from Arefin Niam, Tevfik Kosar, and M. S. Q. Zulkar Nine introduces GreenGNN, a system that exploits the bursty temporal locality of neighbor sampling. It processes training in windows of consecutive mini-batches, stages frequently accessed features in a local cache, and combines remote requests into bulk transfers. This amortizes RPC overhead and reduces GPU stall time. The optimal window size is selected offline using a discrete-event simulator with a hybrid energy model, balancing communication savings against staleness of cached data.

GreenGNN is implemented on top of DGL and tested on a 4-node GPU cluster with benchmark datasets. Results show total system energy drops by 27–43% and GPU energy by 36–71% compared to baseline, while throughput improves up to 3.9×. The system preserves an on-demand path for cache misses to avoid accuracy loss. This work addresses a growing pain point as large-scale GNNs (e.g., on billion-node graphs) require distributed clusters, and energy costs become a significant fraction of total training expense. The approach is immediately applicable to any sampling-based GNN training pipeline and can be integrated into existing ML frameworks.

Key Points
  • Reduces total system energy by 27–43% on a 4-node GPU cluster across benchmark datasets.
  • GPU energy consumption drops 36–71% by cutting RPC initiations and stall time.
  • End-to-end training throughput improves by up to 3.9x over baseline DGL implementation.

Why It Matters

Makes large-scale GNN training greener and faster, cutting energy costs for distributed ML clusters.