Open Source

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Using a 4.65B draft model unlocks up to 50% faster code generation with zero token translation overhead.

Deep Dive

Independent benchmarks reveal that Google's Gemma 4 31B large language model can achieve a significant 29% average speedup through speculative decoding when paired with the much smaller Gemma 4 E2B (4.65B parameters) as a draft model. The technique, tested on an RTX 5090 GPU using a llama.cpp fork with TurboQuant KV cache, showed performance gains scaling with output predictability: code and math generation saw the highest boost at over 50%, reaching 86 tokens/second, while creative tasks still gained a minimum 10% speedup. A critical discovery was that early GGUF file downloads from April 2026 contained a metadata mismatch (add_bos_token setting) between the main and draft models, which forced the system into a slow 'token translation' mode, completely negating benefits. Re-downloading corrected GGUF files from Unsloth resolved this, unlocking the full performance potential.

For practical implementation, users must run with `--parallel 1` to avoid VRAM bloat from duplicate KV cache allocations, which can tank speed to 7 t/s. The setup requires approximately 2.3GB of extra VRAM, bringing total usage to ~23.4GB for a 128K context on a 32GB card. The Q4_K_XL quantization (3.0GB) for the draft model proved optimal, as a higher-precision Q8 version offered no speed improvement. This configuration demonstrates that even with a modest 42% draft acceptance rate for less predictable tasks, speculative decoding provides a net positive gain because compatible vocabularies eliminate token translation overhead, making the technique broadly useful across different query types.

Key Points
  • 29% average speedup for Gemma 4 31B using Gemma 4 E2B (4.65B) as a draft model, with code generation seeing a 50.5% boost.
  • Critical fix required: Early April 2026 GGUF files had a metadata mismatch (add_bos_token) that forced slow token translation; re-downloading corrected files is essential.
  • Practical setup needs `--parallel 1` flag to avoid KV cache bloat, uses ~23.4GB VRAM for 128K context, and works best with structured outputs like code.

Why It Matters

Enables significantly faster local inference for developers and researchers using open-weight models, making advanced AI more accessible and efficient on consumer hardware.