Developer Tools

Generating State-of-the-Art GEMMs with TorchInductor’s CuteDSL backend

PyTorch Blog April 07, 2026

⚡New backend cuts matrix multiplication compile times dramatically while matching CUTLASS performance.

Deep Dive

The PyTorch team has integrated NVIDIA's CuteDSL as a new backend for TorchInductor, joining existing options like Triton, CUTLASS, and cuBLAS for generating matrix multiplication (GEMM) kernels. CuteDSL meets PyTorch's strict integration criteria: it imposes minimal maintenance burden (with NVIDIA's active development), doesn't regress compile times, and delivers better performance on target workloads. Crucially, it solves a major pain point of the CUTLASS C++ backend by offering dramatically faster compilation—comparable to Triton—through a custom Python-to-MLIR compiler, while retaining the same low-level abstractions and control.

This integration is strategically focused on GEMM operations, which consume the majority of GPU cycles in transformer-based LLMs. Unlike memory-bound operations where Triton already excels, achieving peak hardware utilization for GEMMs requires precise control over tensor cores, shared memory, and newer features like thread block clusters. CuteDSL enables this by starting from hand-optimized templates and exposing tunable parameters, allowing the autotuner to efficiently explore many configurations without the overhead of full nvcc compilations. The backend is already showing strong performance on FP8 GEMMs and epilogue fusion, making it a future-proof foundation for upcoming NVIDIA architectures.

Key Points

CuteDSL cuts GEMM kernel compilation time to parity with Triton, avoiding slow nvcc invocations required by CUTLASS C++.
The backend provides the same low-level hardware control as CUTLASS but is written in Python, simplifying maintenance and tuning.
Strategic focus on GEMMs targets the compute-heavy operations (attention projections, FFN layers) that dominate LLM inference latency.

Why It Matters

Faster compilation and better-tuned GEMM kernels directly translate to reduced latency and cost for running large language models in production.

Read Original Article

Generating State-of-the-Art GEMMs with TorchInductor’s CuteDSL backend

Why It Matters

Stay Ahead in AI