Online GPU Energy Optimization with Switching-Aware Bandits
New algorithm cuts energy waste on massive GPU clusters like Aurora, the world's third-fastest supercomputer.
Researchers Xiongxiao Xu, Solomon Abera Bekele, Brice Videau, and Kai Shu developed EnergyUCB, a lightweight AI controller. It formulates GPU energy optimization as a multi-armed bandit problem, dynamically adjusting core frequency in real-time. Tested on Aurora supercomputer workloads, it achieves substantial energy savings with minimal performance slowdown. The system uses a UCB-based index to balance exploration/exploitation while discouraging wasteful frequency switching that degrades quality of service.
Why It Matters
As AI and HPC drive massive energy costs, this directly reduces the operational expense and carbon footprint of large-scale GPU computing.