Research & Papers

Sarus Suite: Cloud-native Containers for HPC

New container architecture matches HPC performance while enabling direct use of upstream OCI images and Kubernetes manifests.

Deep Dive

A research team from ETH Zurich and CSCS has introduced Sarus Suite, a novel container architecture designed to bridge the gap between high-performance computing (HPC) systems and mainstream cloud-native workflows. The system's core innovation is its use of an unchanged, upstream Podman container engine, avoiding the specialized runtime stacks that typically create compatibility issues. Sarus Suite adds the necessary HPC functionality through complementary system layers that handle declarative runtime specification, scheduler-native execution, scalable shared-image access, and standards-based host capability injection. This approach preserves continuity with the rapidly evolving container ecosystem while meeting HPC's stringent requirements for scheduler control and production performance.

The team rigorously evaluated Sarus Suite on a Cray EX GH200 supercomputer using communication-intensive HPC workloads like PyFR and SPH-EXA, large-scale AI training with Megatron-LM, and metadata-heavy startup workloads. The results showed that Sarus Suite matches the performance and scaling of the production-grade Enroot+Pyxis baseline while delivering consistently faster per-node container startup. Crucially, the architecture enables direct use of upstream OCI images, including popular NGC-based AI/ML images, and supports cloud-native multi-container workflows expressed through standard Kubernetes manifests. This demonstrates that HPC-grade containers don't require HPC-specific runtimes when the necessary integration is implemented in explicit system layers.

This research represents a significant shift in HPC container strategy, moving away from isolated, specialized solutions toward architectures that maintain alignment with the broader cloud-native ecosystem. By keeping the container engine upstream-aligned, Sarus Suite reduces the maintenance burden for HPC centers while improving software agility for researchers running AI/ML and scientific computing workloads. The architecture's support for Kubernetes manifests also opens new possibilities for portable workflows across HPC and cloud environments.

Key Points
  • Uses unchanged Podman engine with HPC-specific layers for scheduler integration and scalable image access
  • Matches Enroot+Pyxis performance on Cray EX GH200 with faster container startup in benchmarks
  • Enables direct use of upstream OCI images and Kubernetes manifests for cloud-native workflows

Why It Matters

Enables HPC centers to leverage mainstream container ecosystems while maintaining production performance, reducing maintenance overhead.