Research & Papers

Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems

Academic cluster cut GPU wait times from 81 minutes to 3.4 minutes without disrupting active research.

Deep Dive

A team of researchers led by Glen MacLachlan has published a significant case study on arXiv detailing a successful, non-disruptive transition of a production High-Performance Computing (HPC) cluster to a modern scheduling system. The paper, "Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems," outlines how they migrated an academic cluster from a traditional node-exclusive model to a consumable resource (TRES) scheduling system without interrupting active user workloads. This mid-lifecycle upgrade is notoriously difficult, as a direct cut-over can halt critical research. The team's operational strategy hinged on a three-pronged approach: deploying a time-bounded compatibility layer to support old and new job submissions simultaneously, using observability tools to provide feedback and guide users, and conducting targeted engagement to encourage adoption of explicit resource declaration.

The results were dramatic. By protecting existing workflows and gradually guiding users, the transition led to massive efficiency gains. Median queue wait times for CPU workloads plummeted from 277 minutes to under 3 minutes—a reduction of over 99%. For GPU workloads, wait times dropped from 81 minutes to just 3.4 minutes. Crucially, users who adopted the new TRES-based submission method showed strong long-term retention, indicating the change was not just technically sound but also user-accepted. The study, submitted to the PEARC'26 conference, concludes that successful HPC transitions depend as much on aligning user engagement and operational design as on the underlying system configuration. This provides a proven blueprint for other institutions looking to modernize aging HPC infrastructure without the typical pain and downtime.

Key Points
  • Deployed a compatibility layer and user engagement strategy to transition a live HPC cluster without disrupting active research workflows.
  • Achieved a 99% reduction in median queue wait times, from 277 minutes to under 3 minutes for CPU jobs.
  • GPU wait times fell from 81 minutes to 3.4 minutes, with high user retention for the new scheduling system.

Why It Matters

Provides a proven framework for upgrading critical research computing infrastructure without causing costly downtime or workflow disruption.