Open Source

Google's Gemma 4 Unified model teased in llama.cpp code merge

New Gemma variant hints at transformer-less vision tower for multimodal AI

Deep Dive

A recently merged pull request on the llama.cpp GitHub repository (#24077) has set the AI community abuzz with the discovery of a new model type: 'Gemma 4 Unified'. Although the PR title lacks description, the code changes reveal that the llama.cpp team has implemented support for an as-yet-unreleased Google model. This early access suggests Google is coordinating with the open-source inference tool to ensure Gemma 4 launches with immediate compatibility.

The most intriguing detail is a comment in the code stating 'this is a transformer-less vision tower, the params below are redundant but set to avoid error'. This implies that Gemma 4 Unified will be a multimodal model, but with a vision encoding architecture that does not rely on the traditional transformer layers—potentially using a different mechanism like CNNs or state-space models for image processing. The model is expected to be a significant evolution from the previous Gemma 2 and 3 families, which were primarily text-only. Developers are eagerly awaiting official details from Google, as this could represent a major step forward in open-weight multimodal AI.

Key Points
  • Pull request #24077 on llama.cpp adds support for 'Gemma 4 Unified', indicating an upcoming Google release
  • Code comment reveals a 'transformer-less vision tower', suggesting a novel multimodal architecture
  • Early integration with llama.cpp hints at immediate open-source inference support at launch

Why It Matters

Google's next-gen open model could democratize multimodal AI with a novel vision architecture