Open Source

nvidia/Gemma-4-26B-A4B-NVFP4

50,000 tokens of context on a consumer GPU with near-lossless quality – the future of local AI is here.

Deep Dive

A user confirms it works on an RTX 5090 with 80% allocation of 32GB VRAM, achieving around 50k context. The model is 18.8GB. Benchmarks show NVFP4 scores: GPQA Diamond 79.90%, AIME 2025 90.00%, MMLU Pro 84.80%, LiveCodeBench 79.80%, IFBench 78.1%, IFEval 96.40%, compared to full‑precision baselines of 80.30%, 88.95%, 85.00%, 80.50%, 77.77%, and 96.60% respectively.

Key Points
  • Fits in 18.8GB VRAM on RTX 5090 with 50,000-token context window
  • Matches or exceeds full-precision on 5 of 6 benchmarks (e.g., 90% AIME 2025 vs 88.95% baseline)
  • Uses NVIDIA's NVFP4 quantization, a 4-bit floating point format optimized for GPU inference

Why It Matters

Makes 26B-parameter models viable on consumer GPUs, enabling privacy, offline use, and longer context without cloud costs.