Open Source

Gemma 4 is seriously broken when using Unsloth and llama.cpp

Users report Gemma 4 models fail basic tasks like proofreading when run locally, despite working perfectly on Google's cloud platform.

Deep Dive

A significant performance discrepancy has emerged for users trying to run Google's latest Gemma 4 open models locally. According to reports on Reddit, when the Gemma 4 26B MoE and 31B models are quantized and run using the popular Unsloth framework and llama.cpp inference engine, they produce "seriously broken" and nonsensical outputs. A simple test—feeding a BBC news article and asking the model to list typos—resulted in gibberish, with the model failing to identify actual errors.

This failure occurs across multiple quantization methods, including UD-Q8_K_XL, Q8_0, and UD-Q4_K_XL, suggesting the issue is not isolated to a specific compression technique. Crucially, the same models and prompts work perfectly when run on Google's proprietary AI Studio cloud platform, finding real typos as expected. This stark contrast points to a potential problem in the local inference pipeline, possibly within the quantization process, the llama.cpp integration, or an underlying incompatibility that doesn't affect Google's controlled cloud environment.

Key Points
  • Google's Gemma 4 models (26B MoE, 31B) fail basic tasks like proofreading when run locally via Unsloth and llama.cpp.
  • The issue persists across multiple quantization methods (UD-Q8_K_XL, Q8_0, UD-Q4_K_XL), indicating a systemic local deployment problem.
  • The same models perform correctly on Google's AI Studio, creating a major cloud-vs-local performance gap for open-source users.

Why It Matters

This undermines the promise of open models, creating a reliability chasm between proprietary cloud APIs and local, customizable deployments that developers depend on.