Developer Tools

trunk/2a7516cff86604b7cf40efc667a8542e68597322: test_cutlass_backend: fix fp8 fast-accum handling on sm100+ (#178378)

PyTorch Releases March 28, 2026

⚡A critical fix for NVIDIA's new Blackwell GPUs was developed using OpenAI's GPT-5.4 model.

Deep Dive

A recent commit to the PyTorch open-source framework, developed with the assistance of OpenAI's GPT-5.4, resolves a critical hardware compatibility bug. The issue prevented PyTorch's CUTLASS backend—a collection of high-performance GPU kernels—from functioning correctly on NVIDIA's next-generation Blackwell architecture (codenamed SM100+). Specifically, the bug incorrectly filtered out all valid computational kernels for FP8 rowwise scaled matrix multiplication, a key operation for efficient AI model training, by applying a naming convention specific to the previous Hopper (H100) architecture.

The fix addresses two core problems. First, it stops treating the `use_fast_accum=True` flag as a hard filter for kernel names on SM100+ GPUs, as the "fastaccum" naming pattern is Hopper-specific. Second, it modifies the Ahead-of-Time Inductor (AOTI) compilation system to ensure the output path hash depends on the full portable configuration state, preventing the reuse of stale, incorrectly compiled artifacts. The commit also updates the associated test suite to no longer expect the old naming convention on Blackwell-class hardware.

This technical update is significant because it unlocks PyTorch's ability to leverage the full performance of NVIDIA's flagship Blackwell GPUs for AI workloads immediately upon their release. Efficient FP8 computation is essential for reducing the memory footprint and accelerating the training of large language models (LLMs) and other advanced AI systems.

Key Points

Fix enables PyTorch's CUTLASS backend on NVIDIA Blackwell (SM100+) GPUs, correcting a Hopper-specific naming filter.
Resolves two bugs: invalid kernel filtering for FP8 `scaled_mm` and stale AOTI artifact reuse during compilation.
The commit itself was developed with assistance from OpenAI's advanced GPT-5.4 model, highlighting AI-assisted coding.

Why It Matters

Ensures AI researchers and engineers can use PyTorch at full speed on NVIDIA's latest Blackwell GPUs for cutting-edge model training.

Read Original Article

trunk/2a7516cff86604b7cf40efc667a8542e68597322: test_cutlass_backend: fix fp8 fast-accum handling on sm100+ (#178378)

Why It Matters

Stay Ahead in AI