GreenDyGNN: Runtime-Adaptive Energy-Efficient Communication for Distributed GNN Training
Static caching wastes up to 45% more energy under network congestion—GreenDyGNN fixes that.
Distributed Graph Neural Network (GNN) training relies heavily on remote feature fetching across partition boundaries, which triggers fine-grained RPCs that waste energy due to fixed initiation costs and GPU-stall latency. Previous systems used presampling and static caching to reduce overhead, but these cache policies cannot adapt to runtime network congestion. The researchers show that under time-varying congestion, static caching can actually increase energy consumption by up to 45% because its fixed rebuild schedule is insufficient.
GreenDyGNN treats cache window management as a sequential decision problem, performing intra-epoch cache rebuilds guided by a Double-DQN agent trained in a calibrated simulator with domain-randomized congestion. It adapts both the rebuild window size and per-owner cache allocation at each partition boundary. An asynchronous double-buffered pipeline ensures the adaptation incurs negligible overhead. In tests, GreenDyGNN reduced total energy by up to 43% compared to the default DGL system and 4-24% over the best static caching policy, while closely matching the theoretical optimum under clean network conditions.
- GreenDyGNN uses a Double-DQN reinforcement learning agent to dynamically adjust cache rebuild windows and per-owner allocations.
- Under congestion, it reduces total energy by up to 43% vs. Default DGL and 4-24% over the best static policy.
- Static caching can increase energy by up to 45% under time-varying congestion due to fixed rebuild schedules.
Why It Matters
Enables greener, cheaper distributed GNN training at scale by making caching adapt to real-world network conditions.