Meta's TLX Block Attention speeds up sparse self-attention 2.5x on Blackwell
A 2.5x backward pass speedup over Flash Attention v2 on NVIDIA B200 GPUs.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Meta (Facebook Research) introduced TLX Block Attention, a specialized Triton kernel for NVIDIA Blackwell GPUs that exploits a fixed block-diagonal attention pattern to dramatically accelerate self-attention. On B200 hardware, it delivers a ~1.85× forward speedup and ~2.50× backward speedup over Flash Attention v2, and up to ~3.5× when rotary embeddings are fused into the backward pass. The kernel targets 64-token blocks (incompatible with FlexAttention's 256 minimum tile) and eliminates the iterative Q-tile-over-K-tile loop, online softmax correction, and logsumexp bookkeeping required by general-purpose attention implementations.
The work leverages TLX (Triton Language Extensions), a set of low-level extensions exposing warp specialization, asynchronous tensor core operations, and memory hierarchy control in Triton. This bridges the gap between Python productivity and the low-level control of raw CUDA or CUTLASS. Meta's ads ranking stack—with batch sizes of 1152, sequences up to ~4k tokens, and ~70% sparsity—motivates the kernel, as attention costs dominate inference. The fixed-block constraint collapses multi-iteration accumulators into single GEMMs, removing auxiliary kernel launches and correction stages, yielding significant efficiency gains for production recommendation models.
- ~1.85× forward and ~2.50× backward speedup over Flash Attention v2 on NVIDIA B200 GPUs
- Exploits fixed 64-token block-diagonal patterns to eliminate iterative loops, online softmax, and bookkeeping overhead
- Built on TLX (Triton Language Extensions) for warp specialization and direct hardware control without raw CUDA
Why It Matters
Enables faster inference and training for large-scale recommendation models where attention is the dominant bottleneck.