Open Source

Kimi just published a paper replacing residual connections in transformers. results look legit

Kimi's drop-in replacement for residual connections boosts reasoning scores while adding under 2% latency.

Deep Dive

Moonshot AI, the company behind the Kimi chatbot, has published a research paper introducing 'Attention Residuals,' a novel architectural component designed to replace the standard residual connections that have been foundational to transformers since 2015. The core innovation addresses the 'dilution problem,' where information from earlier layers becomes progressively weaker as it passes through many additive residual connections. Instead of simply summing all previous layer outputs, Attention Residuals allow each new layer to use a learned attention mechanism to selectively weight and combine the outputs from all preceding layers. This enables the model to dynamically decide which earlier information is most relevant for the current computation.

Initial benchmark results are significant, showing improvements of 3 to 7.5 points on graduate-level exams, mathematical reasoning, and code generation. The team also developed a more efficient 'Block Attention Residual' variant, which groups layers into blocks, applying the new attention mechanism only between blocks. This version retains most of the performance benefit while keeping training overhead under 4% and inference latency increase below 2%, and even achieves roughly 1.25x compute savings. Crucially, the authors state the modification is a drop-in replacement for existing residual modules, requiring only a swap and retraining rather than a full architectural redesign.

The work enters an emerging field of rethinking transformer fundamentals, contrasting with other recent approaches like DeepSeek's mHC method, which adds parallel computation streams. Early analysis suggests Kimi's Attention Residuals require about one-sixth the memory bandwidth of DeepSeek's method while delivering similar or better results. As noted by AI researcher Andrej Karpathy, this suggests attention mechanisms may have broader applicability within transformer architectures than previously assumed. For the open-weight model community, successful adoption could mean meaningful quality gains from architectural innovation alone, without simply scaling parameter counts.

Key Points
  • Solves the 'dilution problem' in transformers by replacing additive residual connections with selective, learned attention between layers.
  • Shows 3-7.5 point improvements on reasoning benchmarks and ~1.25x compute savings with under 2% added inference latency.
  • Designed as a drop-in replacement, contrasting with more invasive architectural overhauls like DeepSeek's mHC approach.

Why It Matters

This could enable the next generation of open-weight models to be significantly more capable without increasing their size or computational cost.