Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads
A new scheduling method using real supercomputer data slashes job wait times by 73-99% and improves node utilization.
Researchers Patrick Zojer, Jonas Posner, and Taylan Özden published a paper evaluating malleable job scheduling in HPC clusters. Using real workload traces from Cori, Eagle, and Theta supercomputers and the ElastiSim software, they tested five scheduling strategies. Their novel approach, which maintains malleable jobs at preferred allocations, reduced job wait times by 73-99%, turnaround times by 37-67%, and improved node utilization by 5-52% compared to rigid scheduling.
Why It Matters
This could dramatically increase efficiency and reduce costs for organizations running massive computational workloads on supercomputers.