Open Source

The 4B class of 2026 (benchmark)

A 2.8 GB thinking model beats larger rivals on finance tasks—100% accuracy.

Deep Dive

In a detailed benchmark comparing 3-4B parameter AI models, NVIDIA's Nemotron 3 Nano emerged as the clear winner with an 85% overall accuracy across 39 tasks. The model achieved a perfect 100% on 15 finance tasks, including calculations like present value and CAGR, using a 2.8 GB disk footprint. Its thinking model architecture, indicated by </think> tags, efficiently completes reasoning within a 1024-token budget, producing clean intermediate steps like "compute (1.08)^5 = 1.4693, so PV = 100,000 / 1.4693 ≈ 68,058." This performance surpasses larger models from Google (Gemma 4, 9.6 GB, 62%) and IBM (Granite 4, 2.1 GB, 54%).

The benchmark reveals distinct specializations among small models: IBM's Granite 4 hits 100% on code tasks but only 20% on reasoning, while NVIDIA's Nemotron excels at reasoning (80%) but scores 67% on code. Microsoft's Phi-4 Mini (2.5 GB, 77%) provides the most balanced performance with 100% code, 80% finance, and 60% reasoning, offering the best accuracy per GB at 30.8%. In contrast, Alibaba's Qwen 3.5 4B scored just 15%, with 30 of 39 responses truncated due to its thinking model consuming the full 1024-token budget without completing outputs—a pattern seen in earlier benchmarks. This highlights a systemic issue in evaluating thinking models with fixed token limits.

Key Points
  • NVIDIA's Nemotron 3 Nano (2.8 GB) scored 85% overall, with 100% on finance tasks, beating models up to 9.6 GB.
  • Microsoft's Phi-4 Mini (2.5 GB) offers the best balance at 77% overall and 30.8% accuracy per GB.
  • Qwen 3.5 4B failed at 15% accuracy due to token budget limits, a recurring issue for thinking models.

Why It Matters

Small, specialized models like Nemotron 3 Nano can outperform larger ones in specific domains, reshaping edge AI deployment.