Developer Tools

SOTA Normalization Performance with torch.compile

PyTorch Blog April 08, 2026

⚡New optimizations bring torch.compile to parity with specialized Quack kernels on H100 and B200 GPUs.

Deep Dive

The PyTorch team has achieved a significant performance breakthrough with torch.compile, bringing its execution speed for normalization layers to parity with specialized, hand-optimized kernels. By addressing several key bottlenecks, they've enabled torch.compile to match the state-of-the-art performance of Quack—a library of hyper-optimized CuteDSL kernels developed by Tri Dao. Previously, torch.compile operated at roughly 50% of Quack's performance for these critical operations, which are foundational to training modern deep learning models.

Key optimizations included fixing autotune configuration defaults for H100 and B200 GPUs, scaling up the inner reduction block size (RBLOCK), and adjusting the number of warps to maximize vectorization. The team also resolved issues with automatic dynamic shape inference that were causing suboptimal kernel generation. These changes are particularly impactful for memory-bound workloads, where saturating peak memory bandwidth is essential, and are more sensitive on Blackwell architecture GPUs like the B200 due to their higher memory bandwidth.

The benchmark results demonstrate that torch.compile now performs on par with Quack across a range of common tensor shapes, including those with large batch sizes (M) and small feature dimensions (N). While minor regressions exist for non-power-of-two dimensions and very large feature sizes on H100, the overall performance parity represents a major step forward. This eliminates the need for developers to manually integrate specialized kernel libraries for peak normalization performance, streamlining the deep learning development workflow within the PyTorch ecosystem.

Key Points

torch.compile now matches Quack's SOTA performance for LayerNorm/RMSNorm kernels on H100 and B200 GPUs
Optimizations include better autotune defaults, increased RBLOCK size, and improved warp configuration for peak vectorization
Fixes dynamic shape handling issues that previously caused torch.compile to assume suboptimal kernel configurations

Why It Matters

Eliminates the performance gap between PyTorch's compiler and specialized kernels, simplifying high-performance model development.

Read Original Article

SOTA Normalization Performance with torch.compile

Why It Matters

Stay Ahead in AI