Open Source

Residual connections haven't changed for 10 years and Kimi just replaced them with attention

r/LocalLLaMA March 16, 2026

⚡A 48B-parameter model using this new architecture matches baseline performance with 1.25x less compute.

Deep Dive

Moonshot AI has introduced a fundamental architectural upgrade called 'Attention Residuals' for its Kimi models, challenging a decade-old standard in deep learning. Traditional residual connections, a core component since the 2015 ResNet paper, simply add a layer's output to a running sum with fixed, equal weighting. The new mechanism replaces this with a softmax attention operation: each layer generates a single learned query vector that dynamically attends to all previous layer outputs. This allows the model to perform input-dependent, selective retrieval of information it actually needs, moving from passive aggregation to active querying.

Initial scaling law experiments show the 'Block AttnRes' architecture is significantly more parameter-efficient. It achieves the same loss as a conventional model trained with 1.25x more compute, effectively offering a 25% efficiency gain. Integrated into a 48B-parameter Kimi Linear model (with 3B activated), it delivered across-the-board benchmark improvements: +7.5 points on the demanding GPQA-Diamond, +3.6 on math, and +3.1 on HumanEval for code. Critically, the computational overhead is minimal, adding less than 4% to training costs and under 2% to inference latency, making it a practical enhancement for production systems.

The innovation has garnered attention from prominent figures like AI researcher Andrej Karpathy, who commented on the development. By making the flow of information through a neural network more intelligent and selective, Attention Residuals could become a new foundational building block, similar to the original residual connection. This points toward a future where models are not just larger but architecturally smarter, achieving better performance without proportional increases in computational cost.

Key Points

Replaces static residual connections with a dynamic attention mechanism for selective information retrieval.
Achieves 25% better training efficiency (1.25x compute for equivalent loss) in scaling experiments.
Integrated into a 48B-param Kimi model, boosting GPQA-Diamond by 7.5 points with under 2% latency cost.

Why It Matters

This architectural shift could make advanced AI models significantly more efficient and capable, reducing training costs and improving reasoning.

Read Original Article

Residual connections haven't changed for 10 years and Kimi just replaced them with attention

Why It Matters

Stay Ahead in AI