PyTorch PR #174813 adds deterministic backward for flex flash attention
New PR brings deterministic backward pass with <0.3% overhead for sequences >=8192
PyTorch has merged PR #174813, adding a deterministic backward pass for its flex flash attention mechanism. This change introduces a new function `_compute_dq_write_order_from_block_mask` that ensures gradient computations are reproducible across runs, a critical requirement for debugging, scientific computing, and any application where consistent numerical results matter. Prior to this, flex attention's backward pass could introduce non-determinism due to parallel reductions, particularly when using flexible block masking like causal masks, sliding windows, or custom document masks.
Extensive benchmarking across various batch sizes, head counts, and sequence lengths shows the overhead is negligible: for sequences of 8192 tokens or longer, the performance impact is consistently under 0.3%. Even at shorter sequences like 2048, overhead peaks at just 17.7% but becomes negligible at scale. The implementation achieves this efficiency by fusing the new computation with the existing q_block calculation, adding only a single extra kernel for the exclusive prefix sum. This makes deterministic training practical for production workloads without significant throughput loss.
- PR #174813 adds deterministic backward pass for PyTorch's flex flash attention
- Overhead measured at <0.3% for sequence lengths >=8192, rising only to 0.6% at 16384
- Code fuses with existing q_block calculation, adding just one extra prefix sum kernel
Why It Matters
Enables reproducible gradient computations for flex attention, vital for debugging and scientific reproducibility in deep learning.