LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing
New cross-vendor analyzer delivers 1.73x-1.82x speedups by finding why GPU code stalls, not just where.
A team from Rice University led by Yuning Xia and John Mellor-Crummey has developed LEO, a novel diagnostic tool that addresses a critical gap in GPU performance optimization. While current profilers can show developers *where* in their source code a GPU stall occurs, they fail to reveal *why* it happens. LEO solves this by performing backward slicing from stalled instructions, tracing dependencies through registers and vendor-specific synchronization mechanisms on NVIDIA, AMD, and Intel GPUs. This cross-vendor capability is crucial, as the same kernel can have different stall causes and require different optimizations on different GPU architectures.
In practical testing across 21 diverse workloads on three GPU platforms, optimizations guided by LEO's root-cause analysis delivered significant geometric-mean speedups ranging from 1.73x to 1.82x. The research also demonstrates that LEO's structured diagnostic output can improve code optimization with large language models (LLMs) compared to providing only raw code or stall counts. This positions LEO not just as a standalone tool for expert developers, but as a potential component in an AI-assisted optimization pipeline, helping both humans and LLMs make smarter, architecture-aware performance improvements.
- Performs backward slicing to trace GPU stall root causes across NVIDIA, AMD, and Intel architectures.
- Delivered geometric-mean speedups of 1.73x-1.82x across 21 tested workloads on three GPU platforms.
- Its structured diagnostics improved code optimization with LLMs versus code-only or raw-stall-count baselines.
Why It Matters
It provides the 'why' behind GPU slowdowns, enabling targeted fixes for up to 82% faster code across major hardware vendors.