viable/strict/1771748829: [Inductor] Add depthwise conv1d triton template (#175280)
A new compiler optimization in PyTorch delivers massive speedups for a key 1D convolution operation.
The PyTorch team has merged a significant performance optimization into its core framework. The pull request #175280, titled '[Inductor] Add depthwise conv1d triton template,' introduces a new, highly efficient template written in Triton—a GPU programming language—specifically for depthwise 1D convolutions within PyTorch's Inductor compiler backend.
The technical results are striking. Benchmark tests on a tensor of size 3072×128×202 with a kernel size of 3 and 128 groups show the new implementation ('inductor_dw1d_CL') achieving a runtime of 0.1809ms. This represents a 57x speedup over the previous implementation, which took 10.3592ms. Another configuration ('inductor_dw1d_CF') saw an 11x improvement, dropping from 6.5351ms to 0.5781ms. The new template's performance is competitive with, and in some layouts surpasses, the highly optimized cuDNN library from NVIDIA.
This optimization matters because depthwise separable convolutions are a fundamental building block in modern neural network architectures, especially for efficiency. The 1D variant is critical for processing sequential data like audio waveforms, time-series sensor data, and text embeddings. By drastically reducing the computation time for this operation, PyTorch directly accelerates the training and inference pipelines for a wide range of models in fields like speech recognition, natural language processing, and financial forecasting. This improvement lowers the barrier for experimentation and deployment, saving both time and cloud compute costs for developers and researchers.
- Delivers up to 57x speedup for depthwise 1D convolutions, cutting time from 10.36ms to 0.18ms.
- Implemented as a new Triton kernel template in PyTorch's Inductor JIT compiler (PR #175280).
- Directly benefits models for audio, time-series, and text processing that rely on efficient 1D convolutions.
Why It Matters
Faster model training and inference for sequential data tasks, reducing experiment time and cloud compute costs.