Open Source

FlashAttention-4: 1613 TFLOPs/s, 2.7x faster than Triton, written in Python. What it means for inference.

New Python-written kernel runs 2.7x faster than Triton on NVIDIA's latest Blackwell GPUs.

Deep Dive

FlashAttention-4 represents a major leap in inference performance, achieving a staggering 1,613 TFLOPs/s in BF16 precision on NVIDIA's Blackwell B200 GPUs. This translates to 71% hardware utilization, bringing attention computation speeds to near matrix multiplication levels. The kernel delivers 2.1-2.7x speedups over Triton and up to 1.3x improvements over cuDNN 9.13. Major frameworks are already adopting it—vLLM 0.17.0 includes automatic integration for B200 users, while PyTorch FlexAttention shows 1.2-3.2x gains over its Triton backend. The update fully supports grouped-query and multi-query attention architectures used by models like Llama, Mistral, and Gemma.

What makes FlashAttention-4 particularly notable is its implementation in 100% CuTe-DSL, NVIDIA's Python-based kernel domain-specific language. This allows the kernel to compile in just 2.5 seconds compared to 55 seconds for equivalent C++ code, dramatically accelerating development iteration while maintaining identical runtime performance. However, there's a significant hardware limitation: the optimizations exploit Blackwell-specific features like TMEM, 2-CTA MMA, and async TMA, meaning it only works on Hopper (H100/H800) and Blackwell (B200/B100) GPUs—not on older A100s or consumer cards. For A100 users, FlashAttention-2 remains the optimal choice, while H100 users will see smaller gains than Blackwell systems.

The technical breakthroughs include selective rescaling that skips approximately 10x of the softmax correction work and a sophisticated 5-stage pipeline architecture. With attention no longer being the bottleneck (softmax computation now is), this advancement enables significantly faster inference for large language models. While the hardware requirements are restrictive today, the algorithmic innovations and CuTe-DSL tooling represent foundational improvements that will eventually benefit broader GPU ecosystems through trickle-down optimization techniques.

Key Points
  • Achieves 1,613 TFLOPs/s on B200 GPUs with 71% hardware utilization, making attention as fast as matmul operations
  • Written entirely in Python using NVIDIA's CuTe-DSL, compiling in 2.5 seconds vs 55 seconds for C++ equivalents
  • Currently limited to Hopper (H100) and Blackwell (B200) architectures due to hardware-specific optimizations

Why It Matters

Dramatically accelerates LLM inference for enterprise deployments on latest NVIDIA hardware, potentially cutting inference costs and latency significantly.