Turns out Gemma 4 had MTP (multi token prediction) all along
A developer discovered hidden multi-token prediction heads in Gemma 4's code, confirmed by Google.
A developer working with Google's Gemma 4 9B model on Android stumbled upon a significant discovery. While integrating the model via the LiteRT API, they encountered loading errors related to "mtp weights" having an incompatible tensor shape. Further investigation revealed the presence of additional multi-token prediction (MTP) heads within the model's files. MTP is a technique that allows a model to predict several future tokens simultaneously, a key component for speculative decoding which can dramatically accelerate text generation outputs.
This finding was later confirmed by a Google employee in a Hugging Face discussion, who stated the MTP capability was deliberately "removed on purpose" from the final release to ensure "compatibility and broad usability." This revelation has sparked discussion in the AI community, as it means the publicly available Gemma 4 is performing slower than its inherent architecture allows. The discovery follows previous community disappointment over the non-release of a leaked 124B parameter version of Gemma. Some are now questioning if the disabled components could be reverse-engineered from the compute graph to unlock the model's full, faster potential.
- A developer found hidden Multi-Token Prediction (MTP) heads in Google's Gemma 4 9B model files, confirmed by a Google employee.
- Google intentionally disabled the MTP feature before public release, citing compatibility and usability concerns over performance.
- The disabled feature means Gemma 4's text generation is artificially slower, missing out on speculative decoding speed boosts.
Why It Matters
This reveals a trade-off between performance and accessibility in AI model deployment, impacting developers who rely on generation speed.