Minos: Systematically Classifying Performance and Power Characteristics of GPU Workloads on HPC Clusters
New system classifies GPU workloads with 4% power prediction error, solving power spike challenges.
A team of researchers including Rutwik Jain, Yiwei Jiang, Matthew D. Sinclair, and Shivaraman Venkataraman has developed Minos, a breakthrough system for classifying GPU workload characteristics in high-performance computing clusters. As HPC systems increasingly rely on power-hungry accelerators like NVIDIA GPUs to run demanding AI and scientific workloads, they face critical power constraints and damaging "power spikes" that can exceed hardware ratings. Traditional approaches require extensive, manual profiling of each application on specific hardware, creating massive inefficiencies at scale.
Minos solves this by creating a systematic classification mechanism that groups similarly behaving workloads into distinct classes through low-cost profiling. The system demonstrated remarkable accuracy across 18 diverse workloads including graph analytics, HPC simulations, and machine learning applications. When predicting frequency capping behavior for previously unseen applications, Minos reduced required profiling time by 89% while maintaining high accuracy. The tool achieved mean errors of just 4% for power predictions and 3% for performance predictions, representing a 10% improvement over state-of-the-art approaches.
This advancement addresses a critical bottleneck in modern computing infrastructure management. As AI models grow larger and more complex, efficient GPU utilization becomes increasingly important for both performance and energy costs. Minos enables automatic optimization of power and performance trade-offs without requiring deep expertise for each new workload, potentially saving millions in energy costs and hardware wear while improving overall cluster efficiency.
- Reduces profiling time for new GPU applications by 89% through classification
- Achieves 4% mean error for power predictions and 3% for performance across 18 workloads
- Improves prediction accuracy by 10% over state-of-the-art methods for HPC clusters
Why It Matters
Enables data centers to optimize GPU power consumption and prevent damaging spikes, saving energy costs while maintaining performance.