PyTorch's Triton backward conv kernels boost GPU performance by 20%
New Triton kernels for convolution backward pass replace ATen fallback, improving small-batch workloads up to 20% on MI325.
PyTorch's latest commit (trunk/9c6b1aa) introduces Triton kernels for 2D convolution backward operations, specifically conv2d_bwd_weight and conv2d_bwd_input. These replace the previous ATen-only fallback in TorchInductor, bringing significant performance gains. The PR adds a convolution_backward_lowering function with backend selection logic, new layout computation functions, and configuration options (max_autotune_conv_bwd_weight_backends, max_autotune_conv_bwd_input_backends). The implementation targets small-batch workloads, where the new kernels can achieve up to 20% end-to-end performance improvement on AMD MI325 GPUs. About 3% of that gain comes from reduced kernel execution time, while 17% comes from reduced bubble time. Single-operator benchmarks show Triton kernels outperforming ATen by up to 4x in specific scenarios (e.g., backward weight with 3x112x1x1 input and 3x448x1x1 grad_out: 0.0062 ms vs 0.025 ms). The PR also includes comprehensive test coverage via test_conv2d_backward_parametrized, covering stride, dilation, padding, kernel sizes, groups, and NHWC format. It removes test_conv2d_backward_channels_last_dynamic_shapes from the list of failing tests. The pull request was approved by PyTorch core maintainers jansel and eellison.
- Replaces ATen-only fallback with custom Triton kernels for 2D conv backward weight/input operations in TorchInductor
- Up to 20% end-to-end performance improvement on AMD MI325 GPU for small-batch workloads, with 3% from kernel speed and 17% from bubble reduction
- Single-operator benchmarks show Triton kernels up to 4x faster than ATen (e.g., 0.0062 ms vs 0.025 ms for bwd_weight with specific shapes)
Why It Matters
Faster convolution backpropagation directly accelerates training of vision models, critical for deep learning practitioners using PyTorch.