It looks like we’ll need to download the new Gemma 4 GGUFs
New 2B and 26B parameter models fix critical token handling bugs and improve performance.
Unsloth has published updated GGUF (GPT-Generated Unified Format) files for Google's Gemma 4 language models, specifically the 2B-parameter (E2B-it) and 26B-parameter (A4B-it) variants, on Hugging Face. These releases are direct responses to a series of critical fixes merged into the llama.cpp inference engine, the popular open-source framework for running LLMs locally. The updates address fundamental issues that were preventing Gemma 4 from functioning correctly in many local deployment scenarios, making the models finally usable for developers and researchers.
Key technical fixes include a CUDA patch that checks for buffer overlap before kernel fusion, which resolves a critical bug causing erroneous <unused24> tokens in outputs. Other essential updates add proper byte token handling to the BPE (Byte Pair Encoding) detokenizer specifically for Gemma 4, set the correct "add bos" (beginning of sequence) token setting to 'True', and implement a Gemma 4-specialized parser. These changes ensure the model's tokenizer and inference logic align with its unique architecture, fixing previously broken text generation.
For users, this means the highly anticipated Gemma 4 models—Google's latest open-weight family—are now functionally accessible for local inference on consumer hardware. Developers can download these pre-converted GGUF files and run them using llama.cpp or compatible front-ends like Ollama or LM Studio without encountering the earlier show-stopping bugs. This opens the door for experimentation with Gemma 4's claimed capabilities, including its improved reasoning and coding performance, in private, offline environments.
- Unsloth released updated GGUF files for Gemma 4 2B and 26B models on Hugging Face
- Integrates critical llama.cpp fixes including CUDA buffer checks to fix <unused24> token errors
- Enables stable local inference for Gemma 4 using tools like llama.cpp and Ollama
Why It Matters
Makes Google's latest open model usable locally, enabling private, offline AI experimentation and deployment.