Qwen 3.5 Architecture Analysis: Parameter Distribution in the Dense 27B vs. 122B/35B MoE Models
Architecture analysis reveals how a smaller 27B model can outperform 35B and 122B Mixture of Experts alternatives.
Alibaba's Qwen team has launched three new large language models in the Qwen 3.5 series, sparking architectural analysis about why the smaller 27B parameter dense model can compete with larger Mixture of Experts (MoE) variants. The models include two MoE architectures—the 122B A10 and 35B A3—alongside a traditional dense 27B model. All share a core design that interleaves Gated DeltaNet layers with Gated Attention Layers, each followed by Feed Forward Networks (FFNs). The key revelation is that the dense model's parameter distribution differs fundamentally, allowing it to punch above its weight class in benchmark performance despite having fewer total parameters.
Technical analysis quantifies the divergence: MoE models dedicate 61-63% of parameters to their FFN experts, while the dense 27B model allocates only 34%. This efficiency lets the dense model redistribute computational resources to other critical areas. Specifically, it employs 64 layers versus 40-48 in the MoE models, providing greater depth for complex reasoning. It also uses 4 keys and values in attention layers compared to the MoE's 2, enabling more nuanced pattern capture, and deploys more attention heads in its Gated DeltaNet layers. Crucially, the dense model activates a higher percentage of its parameters per token, making more effective use of its computational budget. This architectural insight explains how a strategically designed dense model can rival or exceed the capabilities of larger, more parameter-heavy MoE systems.
- 27B dense model allocates only 34% of parameters to FFNs vs. 61-63% for MoE models, freeing resources for other components
- Uses 64 layers (40-60% deeper than MoE variants) and 4 attention keys/values (double the MoE's 2) for enhanced reasoning
- Demonstrates that smarter architecture, not just more parameters, drives performance—challenging the 'bigger is better' assumption
Why It Matters
Shows efficient model design can rival larger systems, potentially reducing compute costs and enabling more accessible high-performance AI.