Generalized Dot-Product Attention: Tackling Real-World Challenges in GPU Training Kernels
New kernel hits 1,145 TFLOPS on B200 GPUs, unifying Meta's ad and recommendation model architectures.
Meta's AI infrastructure team has open-sourced a high-performance kernel called Generalized Dot-Product Attention (GDPA), designed to accelerate the training of their massive recommendation and advertising models. Built as an evolution of Tri Dao's Flash Attention 4, GDPA generalizes standard attention by allowing custom activation functions like GELU and SiLU, which are critical in production architectures such as the Generative Ads Model (GEM) and Kunlun. The kernel was specifically optimized for the messy realities of production data—jagged sequences, large batch sizes, and variable lengths—which caused a 2.6x performance gap in previous implementations. On NVIDIA B200 GPUs, the optimized GDPA achieves up to 1,145 BF16 Tensor Core TFLOPS (97% utilization) in the forward pass, a 2x speedup, and 702 TFLOPS in the backward pass, a 1.6x speedup over a baseline Triton implementation.
The significance lies in its direct impact on Meta's business. By unifying disparate attention-like modules (self-attention, PMA, PFFN) under one efficient kernel, the team streamlined the training pipeline for RecSys models that power ads and content recommendations. When applied across a full model, these custom kernels deliver over 30% training throughput improvement. This work demonstrates a shift from benchmarking on perfect, synthetic data to optimizing for real-world, irregular workloads, a necessity for scaling foundation models that rely on user behavior data. The code is available in Meta's ads model kernel library, providing a blueprint for others tackling similar production-scale challenges.
- Achieves up to 2x forward pass & 1.6x backward pass speedup on NVIDIA B200 GPUs, hitting 1,145 BF16 TFLOPS.
- Unifies multiple attention variants (self-attention, PMA, PFFN) under one kernel for Meta's GEM and Kunlun RecSys models.
- Delivers over 30% end-to-end training throughput improvement by optimizing for real-world jagged sequences and large batches.
Why It Matters
Dramatically reduces cost and time to train the massive AI models that power content recommendations and advertising for billions of users.