Open Source

Moonshot open-sources FlashKDA, boosting attention kernels 2.22x on H20

CUTLASS kernels for Kimi Delta Attention achieve up to 2.22x speedup over Triton baseline

Deep Dive

Moonshot AI has open-sourced FlashKDA, a CUTLASS-based C++ implementation of the forward kernel for Kimi Delta Attention (KDA), the linear attention variant from the Kimi Linear paper. This kernel is designed to replace the existing Triton-based path in the flash-linear-attention (FLA) library, integrating via FLA pull request #852. The goal is to close the gap between theoretical linear scaling and real GPU performance, particularly on Hopper architectures like the H20. Benchmark results on H20 show significant speedups: 1.72x for fixed-length sequences (T=8192, H=96, D=128), 1.95x for variable-length sequences, and up to 2.22x for mixed-length sequences (uniform 1024x8).

FlashKDA is MIT licensed and requires SM90+ GPUs, CUDA 12.9+, and PyTorch 2.4+. However, it is currently limited to forward-pass only, which restricts its use in training scenarios. The benchmarks were conducted on H20, a China-specific Hopper variant, so absolute numbers may differ on H100 or Blackwell GPUs, though relative speedups are expected to be similar. This open-source release could accelerate adoption of linear attention architectures by making them more hardware-efficient, but a backward-pass kernel is still needed for full training support.

Key Points
  • FlashKDA achieves up to 2.22x speedup over Triton baseline on H20 for variable-length sequences
  • CUTLASS C++ implementation integrated via FLA pull request #852, requiring SM90+, CUDA 12.9+, PyTorch 2.4+
  • MIT licensed, but currently forward-pass only, limiting training use cases

Why It Matters

Brings linear attention closer to practical hardware efficiency, potentially enabling faster inference for long-context models.

📬 Get the top 10 AI stories daily