PR #174813 adds deterministic backward pass for PyTorch's flex flash attention?

PR #174813 adds deterministic backward pass for PyTorch's flex flash attention

Overhead measured at =8192, rising only to 0.6% at 16384?

Overhead measured at =8192, rising only to 0.6% at 16384

Code fuses with existing q_block calculation, adding just one extra prefix sum kernel?

Code fuses with existing q_block calculation, adding just one extra prefix sum kernel

Developer Tools

PyTorch PR #174813 adds deterministic backward for flex flash attention

PyTorch Releases May 29, 2026

⚡New PR brings deterministic backward pass with <0.3% overhead for sequences >=8192

Deep Dive

PyTorch has merged PR #174813, adding a deterministic backward pass for its flex flash attention mechanism. This change introduces a new function `_compute_dq_write_order_from_block_mask` that ensures gradient computations are reproducible across runs, a critical requirement for debugging, scientific computing, and any application where consistent numerical results matter. Prior to this, flex attention's backward pass could introduce non-determinism due to parallel reductions, particularly when using flexible block masking like causal masks, sliding windows, or custom document masks.

Extensive benchmarking across various batch sizes, head counts, and sequence lengths shows the overhead is negligible: for sequences of 8192 tokens or longer, the performance impact is consistently under 0.3%. Even at shorter sequences like 2048, overhead peaks at just 17.7% but becomes negligible at scale. The implementation achieves this efficiency by fusing the new computation with the existing q_block calculation, adding only a single extra kernel for the exclusive prefix sum. This makes deterministic training practical for production workloads without significant throughput loss.

Key Points

PR #174813 adds deterministic backward pass for PyTorch's flex flash attention
Overhead measured at <0.3% for sequence lengths >=8192, rising only to 0.6% at 16384
Code fuses with existing q_block calculation, adding just one extra prefix sum kernel

Why It Matters

Enables reproducible gradient computations for flex attention, vital for debugging and scientific reproducibility in deep learning.

Read Original Article

PyTorch PR #174813 adds deterministic backward for flex flash attention

Why It Matters

Related Articles

🚀 Stay Ahead in AI