Research & Papers

Beyond Pre-Training: The Full Lifecycle of Foundation Models on HPC Systems

A Swiss supercomputing center unveils a novel architecture to run the entire AI lifecycle, from training to deployment.

Deep Dive

A team from the Swiss National Supercomputing Centre (CSCS) has published a research paper outlining a novel strategy to run the entire lifecycle of foundation models—pre-training, fine-tuning, and inference—on traditional high-performance computing (HPC) systems. The core challenge is that while the massive, batch-oriented compute needs of pre-training align with HPC's strengths, the subsequent phases require more dynamic, service-oriented workflows that conflict with established supercomputing environments. The paper argues that fine-tuning and inference are not trivial; they demand substantial high-end resources but with utilization patterns distinct from initial training.

To solve this, the researchers are developing and deploying a hybrid, cloud-native platform at CSCS. This architecture ingeniously bridges the gap by combining diskless, GPU-enabled HPE Cray EX compute nodes with virtualized commodity infrastructure, all orchestrated by Kubernetes. This setup allows the supercomputer to support both traditional batch jobs and the agile, persistent services needed for AI refinement and deployment. Their initial investigations into fine-tuning pipelines and highly available inference services demonstrate a path to improving user productivity and integrating 'AI Factory' workflows directly into national research infrastructure, offering a blueprint for other facilities worldwide.

Key Points
  • The Swiss National Supercomputing Centre (CSCS) is developing a hybrid platform to manage the full AI lifecycle (pre-training, fine-tuning, inference) on supercomputers.
  • The architecture combines HPE Cray EX compute nodes with virtualized commodity infrastructure, orchestrated by Kubernetes to bridge HPC and cloud-native workflows.
  • The research provides a blueprint for turning capability-oriented supercomputers into 'AI Factories' that support end-to-end scientific and industrial AI applications.

Why It Matters

This blueprint could unlock supercomputers for agile AI development, accelerating scientific discovery and sovereign AI initiatives beyond just initial model training.