Research & Papers

ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding

New CUDA technique fuses entire transformer block to cut latency

Deep Dive

ClusterFusion++ is a new CUDA-level optimization that expands operator fusion to cover the entire transformer decoder block for GPT-NeoX/Pythia models. Traditional LLM decoding is bottlenecked by fragmented kernel launches and repeated off-chip memory transfers. Prior work fused only attention-side ops (QKV projection, attention, output projection), but ClusterFusion++ includes everything: LayerNorm, QKV, RoPE, decode attention, output projection, Post-LN, MLP, and residual connections. It leverages thread-block clusters and on-chip inter-block collectives to minimize data movement, and introduces a CUDA-Graph-compatible execution mode with persistent Tensor Memory Accelerator (TMA) descriptors to cut per-step overhead.

Benchmarked on an NVIDIA RTX 5090-class GPU, ClusterFusion++ delivers a 1.34x throughput improvement for Pythia-2.8B and similar gains for the larger Pythia-6.9B model. Output fidelity remains high—near-token-identical generation—with only minor non-determinism from FP16 atomics. This work, submitted to arXiv by ChiHeng Jin, Hongche Yu, and Xihui Chen, targets latency-sensitive LLM inference and could be adapted to other architectures like LLaMA or GPT-J. The technique is especially relevant for real-time applications where every millisecond counts, such as chatbots, code assistants, and interactive AI tools.

Key Points
  • Fuses all 8 ops in a GPT-NeoX/Pythia decoder block into one kernel
  • 1.34x throughput boost on RTX 5090 for Pythia-2.8B, similar for 6.9B
  • Uses CUDA Graphs and TMA descriptors to reduce per-step overhead

Why It Matters

Speeds up LLM inference without quality loss, critical for real-time AI apps.