Attention is all you need: Kimi replaces residual connections with attention
Moonshot AI's Kimi introduces a novel transformer layer that learns which previous layers to pay attention to.
Moonshot AI, the company behind the popular Kimi Chat assistant, has proposed a significant architectural shift in a new research paper. The work challenges a foundational component of modern neural networks: the residual connection. For decades, residual connections have allowed gradients to flow through deep networks by simply adding a layer's output to its input, treating all previous information equally. Kimi's novel approach replaces this with a cross-layer attention mechanism, enabling each layer to dynamically decide *which* earlier layers to attend to and retrieve specific information from, rather than taking a uniform sum.
This modification, which the authors frame as applying the "attention is all you need" principle to the layer dimension itself, shows promising empirical results. Scaling law experiments, which measure how model performance improves with increased compute budget, demonstrate that architectures using this selective layer attention achieve the same performance as standard transformers while using approximately 1.25x less computational resources. This compute advantage appears consistent across various model scales, suggesting it's a robust improvement, not just a small-scale artifact. The research positions this as a more sophisticated alternative to other recent innovations like DeepSeek's MLA (Multi-head Latent Attention), which also rethinks core transformer components.
The implications are practical for AI developers and companies. A 1.25x compute efficiency gain translates directly to reduced training costs and lower carbon footprint for developing large language models. It also represents a meaningful step in evolving transformer architecture beyond its 2017 blueprint, moving from uniform, hard-coded connections to fully learned, dynamic data pathways throughout the network's depth.
- Replaces standard residual connections with a cross-layer attention mechanism, allowing layers to selectively retrieve information.
- Demonstrates a consistent 1.25x compute efficiency advantage in scaling law experiments across model sizes.
- Represents a fundamental shift from static, uniform layer connections to fully dynamic, learned information pathways.
Why It Matters
A 25% boost in training efficiency lowers costs and energy use for developing new AI models, making advanced R&D more accessible.