Developer Tools

trunk/7ada301a4ccb644eea2cfbbce262657ae518bad4: Fix cuDNN SDPA with zero-stride (broadcast) Q/K/V inputs (#175764)

A subtle bug causing GPU kernel failures in attention layers is now resolved, ensuring stable training.

Deep Dive

The PyTorch development team has resolved a subtle but critical bug in the framework's integration with NVIDIA's cuDNN library, a fix notably assisted by Anthropic's Claude 4.6 Opus AI. The issue, identified in GitHub pull request #175764, affected the SDPA (scaled dot-product attention) implementation—the core computational block for modern Transformer models like GPT and Llama. When processing attention queries, keys, or values (Q/K/V) with broadcast dimensions (represented by a stride of zero), a layout-sorting function incorrectly mapped these zero strides, causing the output tensor's memory layout to be incompatible with the cuDNN Frontend API, leading to kernel execution failures. This bug could cause hard-to-debug crashes or incorrect behavior during model training on NVIDIA GPUs.

The technical root cause was in the `alloc_with_matching_layout` function, which initially mapped stride-0 dimensions to a value of 1 for sorting purposes. This caused them to tie with the actual head dimension (which has stride 1) in a stable sort, incorrectly placing broadcast dimensions first in the layout. The fix, as suggested by the AI assistant, changes this mapping to use `INT64_MAX` instead, ensuring broadcast dimensions are sorted last. This preserves the critical requirement that the last dimension's stride equals 1, which cuDNN's backend expects. This low-level fix is essential for the reliability of training and inference for any PyTorch model leveraging optimized attention kernels, preventing silent errors that could waste significant computational resources and researcher time.

Key Points
  • Bug fix for PyTorch's cuDNN SDPA backend preventing kernel rejection from broadcast Q/K/V inputs
  • Core issue was stride-0 dimensions incorrectly sorting in layout, fixed by mapping them to INT64_MAX
  • Fix assisted by Anthropic's Claude 4.6 Opus, highlighting AI's role in complex software debugging

Why It Matters

Ensures stable training for Transformer models on NVIDIA GPUs, preventing costly and obscure failures in AI research and production.