Residual connections haven't changed for 10 years and Kimi just replaced them with attention
A 48B-parameter model using this new architecture matches baseline performance with 1.25x less compute.
Moonshot AI has introduced a fundamental architectural upgrade called 'Attention Residuals' for its Kimi models, challenging a decade-old standard in deep learning. Traditional residual connections, a core component since the 2015 ResNet paper, simply add a layer's output to a running sum with fixed, equal weighting. The new mechanism replaces this with a softmax attention operation: each layer generates a single learned query vector that dynamically attends to all previous layer outputs. This allows the model to perform input-dependent, selective retrieval of information it actually needs, moving from passive aggregation to active querying.
Initial scaling law experiments show the 'Block AttnRes' architecture is significantly more parameter-efficient. It achieves the same loss as a conventional model trained with 1.25x more compute, effectively offering a 25% efficiency gain. Integrated into a 48B-parameter Kimi Linear model (with 3B activated), it delivered across-the-board benchmark improvements: +7.5 points on the demanding GPQA-Diamond, +3.6 on math, and +3.1 on HumanEval for code. Critically, the computational overhead is minimal, adding less than 4% to training costs and under 2% to inference latency, making it a practical enhancement for production systems.
The innovation has garnered attention from prominent figures like AI researcher Andrej Karpathy, who commented on the development. By making the flow of information through a neural network more intelligent and selective, Attention Residuals could become a new foundational building block, similar to the original residual connection. This points toward a future where models are not just larger but architecturally smarter, achieving better performance without proportional increases in computational cost.
- Replaces static residual connections with a dynamic attention mechanism for selective information retrieval.
- Achieves 25% better training efficiency (1.25x compute for equivalent loss) in scaling experiments.
- Integrated into a 48B-param Kimi model, boosting GPQA-Diamond by 7.5 points with under 2% latency cost.
Why It Matters
This architectural shift could make advanced AI models significantly more efficient and capable, reducing training costs and improving reasoning.