Research & Papers

Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO

New simulation method uses MLIR's StableHLO to model distributed training without costly physical hardware.

Deep Dive

A team of researchers from academic and industry backgrounds has published a paper evaluating whether MLIR's StableHLO dialect can serve as a unified foundation for predicting the performance of distributed machine learning workloads across different hardware architectures. The central challenge they address is that existing GPU and TPU simulators are typically architecture-specific, while distributed training simulators rely on workload-specific analytical models or costly post-execution traces, limiting portability and cross-platform comparison.

Their methodology establishes a StableHLO-based simulation approach that maps a single workload representation onto multiple performance models, spanning analytical, profiling-based, and simulator-driven predictors. This enables researchers and engineers to evaluate workloads across GPUs and TPUs without requiring access to expensive, scaled-out physical systems. The empirical evaluation covered distributed GEMM kernels, ResNet, and large language model training workloads, demonstrating that StableHLO preserves relative performance trends across architectures while exposing accuracy trade-offs and simulator limitations.

The results show prediction errors remain within practical bounds for early-stage design exploration, and the methodology reveals fidelity-dependent limitations in existing GPU simulators. This indicates StableHLO provides a viable foundation for unified, distributed ML performance modeling across accelerator architectures, supporting reusable evaluation workflows and cross-validation throughout the ML system design process. The approach could significantly reduce the cost and time required for hardware selection and system optimization in large-scale AI training scenarios.

Key Points
  • Uses MLIR's StableHLO dialect as unified workload representation for cross-architecture modeling
  • Enables performance prediction across GPUs and TPUs without scaled physical hardware
  • Empirical evaluation shows prediction errors within practical bounds for design exploration

Why It Matters

Reduces hardware evaluation costs for AI training, enabling better architecture selection and system optimization.