FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow
LLM agents discover and compose CUTLASS kernels for 2.79x speedup on A100 GPUs...
Researchers Sina Heidari and Dimitrios S. Nikolopoulos from Virginia Tech have introduced FACT (Framework for Agentic CUTLASS Transpilation), a novel system that leverages large language model (LLM) agents to automatically synthesize and optimize GPU kernels for PyTorch models. The framework addresses a key bottleneck in deep learning: while vendor libraries like cuBLAS and CUTLASS offer strong baseline performance, they are limited to pre-defined optimizations. When these fall short, developers must hand-write CUDA or CUTLASS kernels—a process requiring deep expertise in GPU microarchitecture and C++ template metaprogramming.
FACT operates through a three-stage pipeline. First, an LLM agent inspects the computational graph of a PyTorch model and matches subgraphs to optimization rules from an architecture-specific index. Second, each identified pattern is implemented as a CUTLASS kernel wrapped in a PyTorch extension, verified, and auto-tuned by sweeping parameters from the CUTLASS hierarchy. Finally, the optimized extensions are composed into a single module for end-to-end benchmarking. In tests on an NVIDIA A100 GPU, FACT delivered 1.06x–1.18x speedups on standard GEMM workloads and a striking 2.79x end-to-end speedup on a MiniGPT block by fusing multi-head attention with MLP GEMM+GELU operations.
- Three-stage agentic workflow: pattern discovery via LLM, CUTLASS kernel realization with auto-tuning, and pattern composition into a single module
- Achieved 2.79x end-to-end speedup on a MiniGPT block by fusing multi-head attention with MLP GEMM+GELU on NVIDIA A100
- Auto-tuned CUTLASS kernels improved over PyTorch cuBLAS baseline by 1.06x–1.18x on three GEMM workloads
Why It Matters
Automates GPU kernel optimization for PyTorch, reducing the expertise needed and enabling significant performance gains for production models.