Open Source

Gemma 4 for 16 GB VRAM

A 26B parameter MoE model now runs at 80+ tokens/sec on consumer GPUs with optimized vision settings.

Deep Dive

Google's Gemma 4 26B A4B MoE model represents a breakthrough for local AI deployment, delivering high-performance multimodal capabilities on consumer-grade 16GB VRAM systems. Through extensive testing of various quantizations, the unsloth/gemma-4-26B-A4B-it-GGUF variant with IQ4_XS quantization emerged as optimal, maintaining strong reasoning while fitting within memory constraints. The model requires specific parameter tuning for peak performance: --temp 0.3, --top-p 0.9, --min-p 0.1, and --top-k 20. For vision tasks, using the mmproj-F16.gguf projection with --image-min-tokens 300 and --image-max-tokens 512 significantly enhances performance without FP32 overhead.

Performance benchmarks show dramatic improvements over previous local models. Compared to Qwen 3.5 27B, Gemma 4 delivers 80+ tokens/sec versus 20 tokens/sec - a 4x speed increase. The model excels in multilingual support, Systems & DevOps tasks, and real-world coding with updated libraries. While Qwen maintains a slight edge in long-context handling, Gemma 4's vision capabilities match or exceed Qwen 3 27B when properly configured. Users can achieve 30K+ token contexts with KV fp16 and np -1 settings, though sacrificing vision quality is preferable to using KV Q8 quantization which degrades performance. Current implementation requires llama.cpp build b8660 due to tokenizer issues in newer versions.

Key Points
  • Achieves 80+ tokens/sec on 16GB VRAM - 4x faster than Qwen 3.5 27B
  • Requires specific quantization (IQ4_XS) and parameter tuning for optimal coding performance
  • Vision performance matches Qwen 3 27B with --image-min-tokens 300 setting

Why It Matters

Enables high-performance multimodal AI on consumer hardware, making advanced coding and vision tasks accessible locally.