Open Source

Per-Layer Embeddings: A simple explanation of the magic behind the small Gemma 4 models

Google's new Gemma 4 E2B model achieves small model efficiency with per-layer embeddings, not traditional MoE architecture.

Deep Dive

Google's latest Gemma 4 model family introduces a novel architecture with the gemma-4-E2B and gemma-4-E4B models that use per-layer embeddings rather than traditional Mixture-of-Experts (MoE) or dense architectures. These models feature 5.1 billion total parameters but count only 2.3 billion as "effective" parameters, excluding the 2.8 billion embedding parameters that Google claims "don't count" for computational purposes. This represents a fundamental shift from MoE models like gemma-4-26B-A4B, which require loading all 25.2 billion parameters into VRAM despite only activating 3.8 billion per token.

The key innovation lies in how embeddings are handled throughout the transformer stack. Instead of using static, position-independent embeddings at the input layer only, these models incorporate specialized embedding parameters at each transformer layer. This allows the model to maintain high quality while dramatically reducing the active parameter count during inference. Unlike MoE models where routing networks dynamically select experts per token, the per-layer embedding approach provides consistent computational patterns that enable more predictable performance and potentially better hardware utilization for inference-constrained applications.

Key Points
  • Gemma 4 E2B has 5.1B total parameters but only 2.3B effective parameters (excluding 2.8B embedding params)
  • Uses per-layer embeddings instead of traditional MoE architecture for more predictable inference patterns
  • Enables faster inference than dense models without requiring full parameter loading like MoE models

Why It Matters

Enables high-quality AI inference on resource-constrained devices by dramatically reducing active parameter counts during computation.