Developer Tools

In-Kernel Broadcast Optimization: Co-Designing Kernels for RecSys Inference

Meta eliminates redundant user embedding broadcast, slashing inference latency up to 66%.

Deep Dive

Meta has open-sourced In-Kernel Broadcast Optimization (IKBO), a kernel-model-system co-design that eliminates a major inefficiency in recommendation system inference. In RecSys, a single user request (e.g., a feed load) requires scoring hundreds to thousands of candidate items. Traditional approaches replicate the user's shared embeddings for every candidate before interaction layers, wasting memory bandwidth and compute—overhead that scales linearly with candidate count. IKBO instead handles broadcast internally within each kernel, accepting user and candidate inputs at their natural mismatched batch sizes and never materializing replicated tensors. This reduces both memory footprint and I/O utilization, turning a system-level bottleneck into a computational primitive.

Deployed end-to-end across Meta's multi-stage recommendation funnel on both GPU and MTIA (Meta Training and Inference Accelerator), IKBO delivers up to a two-thirds reduction in compute-intensive net latency. On H100 SXM5 GPUs, the IKBO Linear Compression kernel achieved a cumulative ~4× speedup through four progressive co-design stages, culminating in warp-specialized fusion via TLX (Triton Low-Level Extensions). For Flash Attention, IKBO shifted the kernel from I/O-bound to compute-bound, hitting 621 BF16 TFLOPs and delivering 2.4×/6.4× throughput gains over the non-co-designed CuTeDSL FA4 Hopper baseline. The approach serves as the scalability backbone for Meta's request-centric, inference-efficient framework powering the Meta Adaptive Ranking Model, which now serves LLM-scale models in production.

Key Points
  • IKBO eliminates explicit user embedding replication by fusing broadcast logic into interaction kernels, reducing memory bandwidth and compute overhead.
  • Achieves up to 66% latency reduction and 4× speedup on the Linear Compression kernel (H100 SXM5) via matmul decomposition, memory alignment, broadcast fusion, and warp-specialized optimization.
  • Flash Attention sees 2.4×/6.4× throughput gain over non-co-designed baselines, hitting 621 BF16 TFLOPs and shifting from I/O-bound to compute-bound performance.

Why It Matters

Boosts recommendation throughput for LLM-scale models, enabling faster and more efficient feed ranking at Meta scale.