Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures
Researchers achieve 0.09% error on AMD MI300A—99x better than naive roofline models
A paper by Aaron Jarmusch and Sunita Chandrasekaran introduces microbenchmark-driven analytical performance models for two modern GPU architectures: NVIDIA Blackwell B200 and AMD CDNA3 MI300A. As GPU architectures rapidly evolve with complex memory hierarchies, matrix units, and varied precision formats, the gap between theoretical peak performance and achievable throughput widens. The authors systematically characterize hardware through microbenchmarks, then build analytical models that capture key features: for Blackwell, Tensor Memory (TMEM), asynchronous bulk copy (TMA), and 5th-generation tensor cores; for CDNA3, the Infinity Cache hierarchy, VGPR register constraints, and occupancy limits.
The models are validated against real kernel execution—21 kernels on B200 and 27 on MI300A—achieving mean absolute errors of just 1.31% and 0.09%, respectively. In stark contrast, naive roofline baselines on the same kernels produce errors exceeding 95%. The approach is further validated on H200 (Hopper) and MI250X (CDNA2) by simply updating bandwidth and cache parameters, showing no major restructuring needed. The authors plan to release all models and benchmarks as open source upon acceptance, enabling developers to accurately predict GPU kernel performance without exhaustive trial runs.
- Achieves 1.31% MAE on NVIDIA Blackwell B200 and 0.09% MAE on AMD CDNA3 MI300A, vs >95% error for naive roofline models
- Captures hardware-specific features: Tensor Memory, TMA, 5th-gen tensor cores (Blackwell); Infinity Cache, VGPR constraints, occupancy (CDNA3)
- Validated on Hopper H200 and CDNA2 MI250X with only parameter updates; models will be open-sourced
Why It Matters
Enables precise performance prediction for GPU kernels, cutting development cycles and optimizing code without exhaustive benchmarking.