A Practical Two-Stage Framework for GPU Resource and Power Prediction in Heterogeneous HPC Systems
A new two-stage AI model uses Slurm logs and NVIDIA DCGM metrics to forecast GPU resource needs.
A team of researchers, including Beste Oztop, Dhruva Kulkarni, and Zhengji Zhao, has published a novel two-stage framework for predicting GPU resource and power consumption in heterogeneous High-Performance Computing (HPC) systems. The study, analyzing the widely-used Vienna ab initio Simulation Package (VASP) on the Perlmutter supercomputer (an HPE Cray EX system with NVIDIA A100 GPUs), leverages historical data from the Slurm workload manager and performance metrics from NVIDIA's Data Center GPU Manager (DCGM). The first stage of their framework uses only Slurm submission logs for training, while the second stage augments this data with detailed historical GPU profiling metrics.
Their results are significant for data center efficiency. The model predicting maximum GPU utilization from Slurm features alone achieved up to 97% accuracy. Furthermore, by engineering features from GPU-compute and memory activity metrics, the team's experiments for predicting runtime power usage resulted in up to 92% prediction accuracy. These findings demonstrate that DCGM metrics are highly effective at capturing unique application characteristics, providing a reliable foundation for predictive models.
The practical impact of this research is a move towards dynamic power management and more intelligent scheduling in HPC environments. By accurately forecasting the GPU power, utilization, and memory needs of applications before they run, system operators can make more efficient scheduling decisions. This reduces energy waste, optimizes hardware allocation, and lowers operational costs for massive computing infrastructures that are increasingly reliant on power-hungry GPU clusters.
- The two-stage framework predicts GPU power, utilization, and memory usage for HPC applications with up to 92-97% accuracy.
- It analyzes data from Slurm logs and NVIDIA's DCGM metrics collected from A100 GPUs running the VASP materials science code.
- The research enables power-aware scheduling and dynamic resource management, potentially saving significant energy in large data centers.
Why It Matters
This enables data centers to drastically cut energy costs and improve hardware allocation through predictive, AI-driven scheduling.