Replaces ATen-only fallback with custom Triton kernels for 2D conv backward weight/input operations in TorchInductor?

Replaces ATen-only fallback with custom Triton kernels for 2D conv backward weight/input operations in TorchInductor

Up to 20% end-to-end performance improvement on AMD MI325 GPU for small-batch workloads, with 3% from kernel speed and 17% from bubble reduction?

Up to 20% end-to-end performance improvement on AMD MI325 GPU for small-batch workloads, with 3% from kernel speed and 17% from bubble reduction

Single-operator benchmarks show Triton kernels up to 4x faster than ATen (e.g., 0.0062 ms vs 0.025 ms for bwd_weight with specific shapes)?

Single-operator benchmarks show Triton kernels up to 4x faster than ATen (e.g., 0.0062 ms vs 0.025 ms for bwd_weight with specific shapes)

Developer Tools

PyTorch's Triton backward conv kernels boost GPU performance by 20%

PyTorch Releases May 13, 2026

⚡New Triton kernels for convolution backward pass replace ATen fallback, improving small-batch workloads up to 20% on MI325.

Deep Dive

PyTorch's latest commit (trunk/9c6b1aa) introduces Triton kernels for 2D convolution backward operations, specifically conv2d_bwd_weight and conv2d_bwd_input. These replace the previous ATen-only fallback in TorchInductor, bringing significant performance gains. The PR adds a convolution_backward_lowering function with backend selection logic, new layout computation functions, and configuration options (max_autotune_conv_bwd_weight_backends, max_autotune_conv_bwd_input_backends). The implementation targets small-batch workloads, where the new kernels can achieve up to 20% end-to-end performance improvement on AMD MI325 GPUs. About 3% of that gain comes from reduced kernel execution time, while 17% comes from reduced bubble time. Single-operator benchmarks show Triton kernels outperforming ATen by up to 4x in specific scenarios (e.g., backward weight with 3x112x1x1 input and 3x448x1x1 grad_out: 0.0062 ms vs 0.025 ms). The PR also includes comprehensive test coverage via test_conv2d_backward_parametrized, covering stride, dilation, padding, kernel sizes, groups, and NHWC format. It removes test_conv2d_backward_channels_last_dynamic_shapes from the list of failing tests. The pull request was approved by PyTorch core maintainers jansel and eellison.

Key Points

Replaces ATen-only fallback with custom Triton kernels for 2D conv backward weight/input operations in TorchInductor
Up to 20% end-to-end performance improvement on AMD MI325 GPU for small-batch workloads, with 3% from kernel speed and 17% from bubble reduction
Single-operator benchmarks show Triton kernels up to 4x faster than ATen (e.g., 0.0062 ms vs 0.025 ms for bwd_weight with specific shapes)

Why It Matters

Faster convolution backpropagation directly accelerates training of vision models, critical for deep learning practitioners using PyTorch.

Read Original Article

PyTorch's Triton backward conv kernels boost GPU performance by 20%

Why It Matters

Related Articles

🚀 Stay Ahead in AI