Open Source

VRAM optimization for gemma 4

A simple command-line tweak instantly reduces memory consumption from 3.2GB to 1.2GB for the 31B model.

Deep Dive

Users running Google's Gemma 4 models via llama.cpp have discovered a critical VRAM optimization that makes the models far more accessible on consumer hardware. The issue stems from the Sliding Window Attention (SWA) Key-Value cache, a memory structure that stores recent token information for the model's attention mechanism. By default, the system allocates this cache for four parallel user sequences (`-np 4`), a holdover from multi-user server setups. For a single user, this wastes 2-3x the necessary memory.

Adding the simple flag `-np 1` to the launch command tells the system only one sequence is active, instantly slashing the SWA cache. On the Gemma 4 31B dense model, this reduces the cache from approximately 3.2GB down to 1.2GB. Combined with avoiding the `-ub 4096` batch size tweak (which also bloats the SWA buffer), this optimization is crucial for fitting quantized versions of these large models into a 16GB VRAM budget, enabling longer context lengths and making local deployment practical.

Key Points
  • Adding `-np 1` to llama.cpp commands cuts the SWA KV cache VRAM by 3x for solo users.
  • The fix reduces memory from ~3.2GB to 1.2GB for Gemma 4 31B, making it viable on 16GB GPUs.
  • Users must avoid the `-ub 4096` speed tweak and ensure they have a post-PR #21332 build for correct behavior.

Why It Matters

This optimization democratizes running state-of-the-art 31B+ parameter models locally, removing a major barrier for developers and researchers without enterprise-grade hardware.