Open Source

it looks like it will be soon 💎💎💎💎

⚡Leaked GitHub PRs reveal Google's upcoming Gemma 4 model with vision capabilities and three size variants.

Deep Dive

Google's next-generation open-weight AI model, Gemma 4, has been revealed through code leaks in major open-source repositories. Pull requests submitted to the llama.cpp and Hugging Face Transformers GitHub projects contain references to the new multimodal model family. The leaks confirm Gemma 4 will be available in three parameter sizes: 1B, 13B, and 27B, offering a range of efficiency and capability options for developers. While the core architecture remains similar to previous Gemma versions, the key advancement is the addition of vision capabilities, transforming it from a text-only model into a true multimodal system.

The technical details from the PRs highlight two major innovations. First, Gemma 4 includes a new vision processor designed to output images within a fixed token budget, a crucial technique for managing computational costs in image generation. Second, it implements spatial 2D Rotary Position Embedding (RoPE), a method to encode visual information across both height and width axes, which is essential for understanding spatial relationships in images. The initial implementation appears focused on 'dense' model architectures, suggesting a separate version for Mixture-of-Experts (MoE) models may follow. This leak indicates an imminent release, positioning Gemma 4 to compete directly with other open multimodal models like Llava and Qwen-VL.

Key Points
  • Leaked via GitHub PRs in llama.cpp and Hugging Face Transformers repositories
  • Comes in three sizes: 1B, 13B, and 27B parameters for varied use cases
  • Features new vision processor with 2D spatial RoPE for image understanding

Why It Matters

Brings powerful, open-source multimodal AI to developers, enabling image-based applications without relying on closed APIs.