New GemmConfig(128, 256, 64, 4, 8) added to PyTorch Inductor’s CUDA autotuner for Hopper GPUs?

New GemmConfig(128, 256, 64, 4, 8) added to PyTorch Inductor’s CUDA autotuner for Hopper GPUs.

Delivers 1.16x–1.78x speedups on large matmuls and a median 1.343x on a target shape (M=8192, N=6144, K=3072, fp16)?

Delivers 1.16x–1.78x speedups on large matmuls and a median 1.343x on a target shape (M=8192, N=6144, K=3072, fp16).

Autotune ensures no regression on small or odd shapes; compile-time overhead is only ~5% more candidates?

Autotune ensures no regression on small or odd shapes; compile-time overhead is only ~5% more candidates.

Developer Tools

PyTorch adds 128x256x64 matmul config boosting H100 performance up to 1.78x

PyTorch Releases May 15, 2026

⚡New CUDA GEMM configuration delivers 1.34x median speedup on large shapes with zero regression.

Deep Dive

PyTorch’s latest commit to its Inductor compiler adds a new CUDA matrix multiplication configuration specifically optimized for Hopper architecture GPUs (e.g., H100). The new config, defined as GemmConfig(128, 256, 64, 4, 8), uses a tile size of 128x256x64 with NS=4 (number of splits along the reduction dimension). This configuration is designed to better exploit the larger shared memory and compute capabilities of Hopper-class GPUs compared to the existing set of 20 configs, which tended to under-explore larger M-tile shapes. The change was merged into the CUDAConfigHeuristic.mm_configs list, and autotune per shape selects the best config, so smaller or irregular shapes still use existing alternatives.

The performance impact is significant for large-scale matrix multiplications common in training transformers and other deep learning models. Across six production shapes with N ranging from 256 to 16384 and K from 512 to 3072, the new config wins where autotune picks it, showing speedups between 1.16x and 1.78x. On a target shape of M=8192, N=6144, K=3072 (fp16), a controlled test on an isolated H100 SXM ran 15 trials — all 15 favored the new config with a median 1.343x speedup. An M-sweep across sizes from 1024 to 32768 showed stable 1.30–1.35x gains for M ≥ 2048, while M=1024 correctly fell back to the original config (no regression). Compile time increases by roughly 5% due to one extra candidate per autotune session, a negligible tradeoff for the runtime gains.

Key Points

New GemmConfig(128, 256, 64, 4, 8) added to PyTorch Inductor’s CUDA autotuner for Hopper GPUs.
Delivers 1.16x–1.78x speedups on large matmuls and a median 1.343x on a target shape (M=8192, N=6144, K=3072, fp16).
Autotune ensures no regression on small or odd shapes; compile-time overhead is only ~5% more candidates.

Why It Matters

Faster matrix multiplications on H100 GPUs accelerate large-scale AI training and inference with zero performance regressions.

Read Original Article

PyTorch adds 128x256x64 matmul config boosting H100 performance up to 1.78x

Why It Matters

Related Articles

🚀 Stay Ahead in AI