~1.85× forward and ~2.50× backward speedup over Flash Attention v2 on NVIDIA B200 GPUs?

~1.85× forward and ~2.50× backward speedup over Flash Attention v2 on NVIDIA B200 GPUs

Exploits fixed 64-token block-diagonal patterns to eliminate iterative loops, online softmax, and bookkeeping overhead?

Exploits fixed 64-token block-diagonal patterns to eliminate iterative loops, online softmax, and bookkeeping overhead

Built on TLX (Triton Language Extensions) for warp specialization and direct hardware control without raw CUDA?

Built on TLX (Triton Language Extensions) for warp specialization and direct hardware control without raw CUDA

Developer Tools

Meta's TLX Block Attention speeds up sparse self-attention 2.5x on Blackwell

PyTorch Blog May 26, 2026

⚡A 2.5x backward pass speedup over Flash Attention v2 on NVIDIA B200 GPUs.

Deep Dive

Meta (Facebook Research) introduced TLX Block Attention, a specialized Triton kernel for NVIDIA Blackwell GPUs that exploits a fixed block-diagonal attention pattern to dramatically accelerate self-attention. On B200 hardware, it delivers a ~1.85× forward speedup and ~2.50× backward speedup over Flash Attention v2, and up to ~3.5× when rotary embeddings are fused into the backward pass. The kernel targets 64-token blocks (incompatible with FlexAttention's 256 minimum tile) and eliminates the iterative Q-tile-over-K-tile loop, online softmax correction, and logsumexp bookkeeping required by general-purpose attention implementations.

The work leverages TLX (Triton Language Extensions), a set of low-level extensions exposing warp specialization, asynchronous tensor core operations, and memory hierarchy control in Triton. This bridges the gap between Python productivity and the low-level control of raw CUDA or CUTLASS. Meta's ads ranking stack—with batch sizes of 1152, sequences up to ~4k tokens, and ~70% sparsity—motivates the kernel, as attention costs dominate inference. The fixed-block constraint collapses multi-iteration accumulators into single GEMMs, removing auxiliary kernel launches and correction stages, yielding significant efficiency gains for production recommendation models.

Key Points

~1.85× forward and ~2.50× backward speedup over Flash Attention v2 on NVIDIA B200 GPUs
Exploits fixed 64-token block-diagonal patterns to eliminate iterative loops, online softmax, and bookkeeping overhead
Built on TLX (Triton Language Extensions) for warp specialization and direct hardware control without raw CUDA

Why It Matters

Enables faster inference and training for large-scale recommendation models where attention is the dominant bottleneck.

Read Original Article

Meta's TLX Block Attention speeds up sparse self-attention 2.5x on Blackwell

Why It Matters

Related Articles

🚀 Stay Ahead in AI