Research & Papers

Rank-Aware Resource Scheduling for Tightly-Coupled MPI Workloads on Kubernetes

New rank-aware scheduling for MPI workloads on Kubernetes delivers 20% speedup while freeing 82% of wasted CPU resources.

Deep Dive

Researcher Tianfang Xie has published a significant technical paper titled 'Rank-Aware Resource Scheduling for Tightly-Coupled MPI Workloads on Kubernetes' on arXiv. The work addresses a critical problem in high-performance computing: how to run tightly-coupled Message Passing Interface (MPI) workloads like Computational Fluid Dynamics (CFD) solvers more efficiently on shared, cloud-managed Kubernetes clusters. Traditional approaches fully provision all MPI ranks equally, wasting resources on low-load subdomains. Xie's system introduces fine-grained CPU provisioning where each MPI rank gets a Kubernetes pod with CPU requests proportional to its actual computational load (subdomain cell count).

The research reveals three major findings. First, using hard CPU limits via Linux's CFS bandwidth controller causes catastrophic 78x slowdowns due to cascading stalls at MPI_Allreduce barriers—switching to requests-only allocation eliminates this throttling entirely. Second, on AWS EC2 c7i.metal instances, their concentric decomposition with equal CPU allocation is already 19% faster than the Scotch baseline, and adding proportional CPU yields another 3% improvement. Third, at 16 MPI ranks on 101K-cell meshes, proportional allocation is 20% faster than equal allocation while reducing provisioned CPU for sparse subdomains by 82%, freeing 6.5 vCPU of valuable scheduling headroom.

The implementation leverages Kubernetes v1.35's In-Place Pod Vertical Scaling feature for mid-simulation CPU adjustments without pod restarts. Experiments were conducted on AWS EC2 c7i.metal clusters (4-16 ranks) running k3s v1.35. All scripts and data have been released as open source, making this research immediately applicable for organizations running HPC workloads on Kubernetes. This represents a substantial advancement in cloud-native high-performance computing resource management.

Key Points
  • Eliminates 78x slowdowns caused by Linux CFS throttling at MPI barriers by using requests-only CPU allocation
  • Achieves 20% performance improvement over equal CPU allocation while reducing sparse-subdomain CPU provision by 82%
  • Leverages Kubernetes v1.35's In-Place Pod Vertical Scaling for dynamic CPU adjustment without pod restarts

Why It Matters

Enables organizations to run HPC workloads 20% faster on Kubernetes while dramatically reducing cloud compute costs through better resource utilization.