Research & Papers

RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators

New framework automatically fuses complex reduction patterns that stumped existing compilers like TVM and XLA.

Deep Dive

A research team led by Xinsheng Tang has introduced RedFuser, a breakthrough framework that automatically fuses cascaded reduction operations for AI accelerators. These complex patterns—where multiple reduction operations with inter-loop dependencies occur sequentially, like in transformer attention mechanisms—have long challenged existing AI compilers such as TVM and XLA. While some hand-crafted solutions existed for specific cases, RedFuser provides the first general, automated approach based on formal theoretical analysis that transforms these cascaded reductions into single-loop computations with incremental forms.

The framework's impact is substantial: experiments demonstrate 2x to 5x performance improvements over current state-of-the-art compilers, with RedFuser-generated kernels matching the efficiency of highly optimized, hand-written code. This breakthrough addresses a critical bottleneck in deploying large models, particularly transformers where such patterns are ubiquitous. By automating what previously required expert manual optimization, RedFuser could significantly accelerate inference and training across various AI hardware platforms.

RedFuser's methodology represents a significant advancement in compiler technology, moving beyond pattern-matching to formal analysis of reduction dependencies. The framework automatically identifies supported patterns and generates optimized fused kernels, potentially reducing engineering effort while improving performance consistency. With the code already available and the paper accepted to ASPLOS '26, this work could soon influence production compilers and AI deployment pipelines industry-wide.

Key Points
  • Automatically fuses cascaded reduction patterns (like safe softmax + GEMM) that existing compilers couldn't handle
  • Achieves 2x to 5x speedups over state-of-the-art AI compilers including TVM and XLA
  • Matches performance of hand-optimized kernels while providing general automation for previously manual optimizations

Why It Matters

Could dramatically accelerate transformer inference and training by automating optimization of ubiquitous reduction patterns.