Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20
CUTLASS kernels for Kimi Delta Attention achieve up to 2.22x speedup over Triton baseline
Moonshot AI has open-sourced FlashKDA, a CUTLASS-based C++ implementation of the forward kernel for Kimi Delta Attention (KDA), the linear attention variant from the Kimi Linear paper. This kernel is designed to replace the existing Triton-based path in the flash-linear-attention (FLA) library, integrating via FLA pull request #852. The goal is to close the gap between theoretical linear scaling and real GPU performance, particularly on Hopper architectures like the H20. Benchmark results on H20 show significant speedups: 1.72x for fixed-length sequences (T=8192, H=96, D=128), 1.95x for variable-length sequences, and up to 2.22x for mixed-length sequences (uniform 1024x8).
FlashKDA is MIT licensed and requires SM90+ GPUs, CUDA 12.9+, and PyTorch 2.4+. However, it is currently limited to forward-pass only, which restricts its use in training scenarios. The benchmarks were conducted on H20, a China-specific Hopper variant, so absolute numbers may differ on H100 or Blackwell GPUs, though relative speedups are expected to be similar. This open-source release could accelerate adoption of linear attention architectures by making them more hardware-efficient, but a backward-pass kernel is still needed for full training support.
- FlashKDA achieves up to 2.22x speedup over Triton baseline on H20 for variable-length sequences
- CUTLASS C++ implementation integrated via FLA pull request #852, requiring SM90+, CUDA 12.9+, PyTorch 2.4+
- MIT licensed, but currently forward-pass only, limiting training use cases
Why It Matters
Brings linear attention closer to practical hardware efficiency, potentially enabling faster inference for long-context models.