Welcome Gemma 4: Frontier multimodal intelligence on device
Four new open models with audio, image, and text processing, designed to run locally on your hardware.
Google DeepMind has launched the Gemma 4 family, a significant open-source release of four multimodal AI models now available on Hugging Face. The lineup includes the compact Gemma 4 E2B (2.3B effective parameters) and E4B (4.5B effective) models, which support audio, image, and text, alongside the larger 31B dense model and a 26B Mixture-of-Experts (MoE) model. All models are released under the permissive Apache 2.0 license and are designed from the ground up for efficient local deployment, featuring long context windows (up to 256K tokens) and architectural optimizations for quantization and speed.
Architecturally, Gemma 4 introduces key innovations for efficiency. It employs Per-Layer Embeddings (PLE) to feed residual signals into each decoder layer and a Shared KV Cache that reuses key-value states across layers to eliminate redundant computations. The vision encoder handles variable image aspect ratios and offers configurable token budgets for balancing quality and speed. Early benchmarks are promising, with the 31B model achieving an estimated LMArena score of 1452 on text tasks. The models show strong multimodal capabilities and are noted for performing well without extensive fine-tuning.
The release emphasizes developer accessibility and local execution. Google collaborated with the community to ensure compatibility with a vast ecosystem of tools, including transformers, llama.cpp, MLX, and WebGPU. This allows developers to easily integrate Gemma 4 into local AI agents and applications. By providing frontier-level multimodal intelligence in a truly open and portable format, Gemma 4 lowers the barrier to building sophisticated, private, on-device AI experiences without relying on cloud APIs.
- Four truly open models (Apache 2.0) ranging from 2.3B to 31B parameters, with the smaller E2B/E4B variants supporting audio, image, and text inputs.
- Engineered for on-device use with architectural optimizations like Per-Layer Embeddings (PLE) and a Shared KV Cache for efficiency and long context (up to 256K tokens).
- Achieves high benchmark scores (e.g., ~1450 LMArena) and is designed for broad framework compatibility (transformers, llama.cpp, MLX) right out of the box.
Why It Matters
It brings powerful, private multimodal AI directly to consumer devices and developer toolkits, challenging closed-source cloud models.