Gemma 4 on Llama.cpp should be stable now
Key fixes merged into llama.cpp source enable stable, high-performance Gemma 4 31B model runs.
The open-source inference engine llama.cpp has achieved a significant milestone with the successful integration of fixes for Google's Gemma 4 language models. A key pull request (#21534) has been merged into the project's master branch, resolving previously known stability and compatibility issues. This development means developers and researchers can now reliably run the Gemma 4 31B parameter model using efficient Q5 quantization directly from the latest source code, though official releases may still lag behind.
For optimal performance, the community recommends specific runtime parameters. Using the interleaved chat template prepared by contributor Aldehir is crucial for correct formatting. To prevent system memory issues, running with `--cache-ram 2048` and `-ctxcp 2` flags is advised. Early testing shows that using a mixed quantization approach—Q5 for the K (key) cache and Q4 for the V (value) cache—does not cause significant performance degradation. However, a critical build warning is in effect: the CUDA 13.2 toolkit is confirmed broken for this use case, and developers must use an alternative version until NVIDIA resolves the issue.
- Critical fixes (PR #21534) merged into llama.cpp source enable stable Gemma 4 31B inference.
- Requires specific runtime flags (`--cache-ram 2048 -ctxcp 2`) and a custom chat template for optimal performance.
- CUDA 13.2 is confirmed broken for building; developers must avoid it until NVIDIA provides a fix.
Why It Matters
Enables local, efficient deployment of Google's latest open-weight model, expanding accessible AI capabilities for developers and researchers.