Open Source

I was hyped for Nemotron 3 4B and it completely disappointed me compared to Qwen 3.5 4B

A rigorous custom test suite reveals Nemotron 3's 4B model fails on complex math and structured output where Qwen 3.5 4B succeeded.

Deep Dive

A developer's viral benchmark results reveal a significant performance gap between two leading 4-billion-parameter language models. Alibaba's Qwen 3.5 4B passed a grueling, custom multi-part test involving advanced calculus, combinatorial number theory, and structured JSON generation with perfect accuracy. In contrast, NVIDIA's newly released Nemotron 3 4B failed the same test, producing incorrect mathematical proofs and flawed logic despite its touted architectural advantages for handling long contexts.

The test was designed to stress reasoning, instruction following, and structured output. One task required computing a complex sum S(2026) and providing an 8-line integral proof, which Qwen 3.5 handled correctly. Nemotron 3 returned the right final answer but with a mangled, incorrect proof. Another task involved a modular arithmetic floor function sum where both models got the numerical answer (838) correct, but Nemotron's justification was logically inconsistent, suggesting it guessed correctly without understanding.

This real-world evaluation underscores a critical point for the local AI community: spec sheet features like context length are secondary to core reasoning ability. For developers choosing a model to run on local hardware, proven performance on complex tasks is more valuable than theoretical architectural benefits. The results suggest Qwen 3.5 4B currently holds the edge for applications requiring reliable, precise reasoning from a small model.

Key Points
  • Qwen 3.5 4B passed all custom tests, including a dense multi-part math prompt requiring a valid JSON output with proofs, calculations, and pseudocode.
  • Nemotron 3 4B failed the same test, producing incorrect integral proofs and logically inconsistent justifications, despite returning some correct numerical answers.
  • The test highlights that architectural claims (like Nemotron's large context window) do not guarantee superior reasoning, which is crucial for real-world coding and math applications.

Why It Matters

For developers running local models, these benchmarks show reasoning capability is more critical than raw specs, directly impacting which model is suitable for complex tasks.