Research & Papers

Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO

arXiv cs.DC April 15, 2026

⚡New simulation method uses MLIR's StableHLO to model distributed training without costly physical hardware.

Deep Dive

A team of researchers from academic and industry backgrounds has published a paper evaluating whether MLIR's StableHLO dialect can serve as a unified foundation for predicting the performance of distributed machine learning workloads across different hardware architectures. The central challenge they address is that existing GPU and TPU simulators are typically architecture-specific, while distributed training simulators rely on workload-specific analytical models or costly post-execution traces, limiting portability and cross-platform comparison.

Their methodology establishes a StableHLO-based simulation approach that maps a single workload representation onto multiple performance models, spanning analytical, profiling-based, and simulator-driven predictors. This enables researchers and engineers to evaluate workloads across GPUs and TPUs without requiring access to expensive, scaled-out physical systems. The empirical evaluation covered distributed GEMM kernels, ResNet, and large language model training workloads, demonstrating that StableHLO preserves relative performance trends across architectures while exposing accuracy trade-offs and simulator limitations.

The results show prediction errors remain within practical bounds for early-stage design exploration, and the methodology reveals fidelity-dependent limitations in existing GPU simulators. This indicates StableHLO provides a viable foundation for unified, distributed ML performance modeling across accelerator architectures, supporting reusable evaluation workflows and cross-validation throughout the ML system design process. The approach could significantly reduce the cost and time required for hardware selection and system optimization in large-scale AI training scenarios.

Key Points

Uses MLIR's StableHLO dialect as unified workload representation for cross-architecture modeling
Enables performance prediction across GPUs and TPUs without scaled physical hardware
Empirical evaluation shows prediction errors within practical bounds for design exploration

Why It Matters

Reduces hardware evaluation costs for AI training, enabling better architecture selection and system optimization.

Read Original Article

Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO

Why It Matters

Stay Ahead in AI