Open Source

Gemma 4 MTP released

Up to 2x faster generation with no quality loss using speculative decoding.

Deep Dive

Google has open-sourced four Gemma 4 Multi-Token Prediction (MTP) draft models, available now on Hugging Face. These drafters are smaller, faster variants of the full Gemma 4 model, designed to work in a speculative decoding pipeline. Instead of generating one token at a time, the draft model proposes several tokens ahead; the target model then verifies them in parallel. This method delivers up to 2x faster generation while guaranteeing exactly the same output quality as standard autoregressive decoding.

The draft checkpoints come in four sizes: the 31B (full), 26B-A4B (mixture of experts), E4B (4B efficient), and E2B (2B efficient). With their compact footprint, they are especially suited for on-device inference and latency-sensitive applications like real-time chat, code completion, or mobile assistants. Developers can integrate them into existing Gemma 4 workflows to reduce response times without retraining or sacrificing accuracy.

Key Points
  • Google released four Gemma 4 MTP draft models (31B, 26B-A4B, E4B, E2B) on Hugging Face.
  • Speculative decoding with MTP yields up to 2x faster generation with no quality degradation.
  • Ideal for low-latency, on-device AI applications like real-time assistants and code completion.

Why It Matters

Dramatically faster inference from smaller draft models unlocks real-time AI on phones and edge devices.