trunk/8252a58be1447be0462275006e34d64738f77d44: Fix scratch size for TMA in C++ wrapper (#175385)
A memory allocation bug in PyTorch's C++ wrapper caused illegal instruction errors when running Llama 3 attention layers.
The PyTorch team has resolved a critical bug in their Torch Inductor compilation pipeline that was causing CUDA kernels to crash with illegal instruction errors. The issue, documented in pull request #175385, specifically affected memory allocation for Tensor Memory Accelerator (TMA) operations when using the C++ wrapper. The bug manifested when running attention layers from models like Llama 3-8B with specific configurations, particularly with sequence lengths of 512 tokens and the use of tensor descriptors.
At the core of the problem was incorrect scratch space calculation in two key files: `cpp_wrapper_gpu.py` and `device_op_overrides.py`. The scratch space allocation wasn't properly scaling with the number of compute thread arrays (CTAs) and CUDA grid dimensions, leading to insufficient memory allocation for parallel operations. This caused CUDA driver errors that would crash entire training or inference sessions, particularly problematic for researchers and engineers working with modern transformer architectures.
The fix ensures that scratch space allocation now correctly multiplies requested size by `num_ctas` from kernel configuration parameters and by CUDA grid dimensions (`grid_0`, `grid_1`, `grid_2`). This proper scaling allows TMA operations to execute correctly across all parallel processing units. The bug was particularly insidious because it only appeared under specific conditions—when using tensor descriptors, C++ wrapper, and certain model dimensions—making it difficult to diagnose without the exact reproduction case provided in the issue.
This correction is crucial for the PyTorch ecosystem as Torch Inductor becomes increasingly important for optimizing model performance through compilation. The fix prevents crashes that could disrupt training pipelines or production inference systems, especially for teams working with cutting-edge models like Llama 3 that leverage TMA for memory efficiency.
- Fixed scratch space allocation bug in Torch Inductor's C++ wrapper that caused CUDA illegal instruction errors
- Bug specifically crashed when running Llama 3-8B attention layers with sequence length 512 and tensor descriptors enabled
- Solution ensures scratch space scales correctly with compute thread arrays and CUDA grid dimensions for TMA operations
Why It Matters
Prevents crashes in production AI systems using compiled PyTorch models, ensuring stable training and inference for modern architectures.