Gemma 4 1B, 13B, and 27B spotted
Google's new open-source AI family can now see, with a specialized vision processor and spatial 2D RoPE encoding.
Google has officially unveiled the Gemma 4 family, marking a significant expansion of its open-source AI offerings into the multimodal domain. The new models, available in 1B, 13B, and 27B parameter variants, retain the core architecture of their predecessors but are now equipped to understand both text and images. This leap is powered by two key technical innovations: a vision processor designed to output images within a fixed token budget, and a novel spatial 2D Rotary Position Embedding (RoPE) that encodes visual information specifically across the height and width axes of an image, providing a more nuanced spatial understanding than standard 1D positional encodings.
All model checkpoints, including both pre-trained base models and instruction-tuned variants optimized for following user commands, are now publicly available on Hugging Face. This release provides developers and researchers with a scalable suite of vision-language models, from the lightweight 1B version for edge deployment to the more capable 27B model. By open-sourcing these models, Google is directly competing with other multimodal offerings like Llava and Qwen-VL, but with the distinct advantage of offering multiple size options to fit different computational constraints and use cases, from mobile apps to large-scale cloud services.
- Multimodal capability added with a new vision processor for fixed-token image output.
- Introduces spatial 2D RoPE to encode height and width data, improving image understanding.
- Available in three sizes (1B, 13B, 27B) with pre-trained and instruction-tuned variants on Hugging Face.
Why It Matters
Provides developers free, scalable open-source models to build AI applications that can see and understand visual content.