Open Source

Mistral Small 4 vs Qwen3.5-9B on document understanding benchmarks, but it does better than GPT-4.1

r/LocalLLaMA March 20, 2026

⚡A 9-billion parameter model outperforms a 119B MoE system on key document tasks, challenging scaling assumptions.

Deep Dive

In a detailed comparison on the IDP Leaderboard for document AI, Alibaba's compact Qwen3.5-9B model has outperformed Mistral AI's much larger Mistral Small 4. The 9-billion parameter, dense Qwen model won 10 out of 14 sub-benchmarks, achieving an overall score of 77.0 and ranking #9, while the 119-billion parameter Mixture-of-Experts (MoE) Mistral model scored 71.5 and ranked #11. The benchmarks covered three core areas: OlmOCR Bench for optical character recognition, OmniDocBench for document structure, and IDP Core Bench for information extraction. Qwen demonstrated significant leads in key areas like math OCR (85.5 vs. 66.0) and Key Information Extraction (86.5 vs. 78.3), though both models struggled with absent text detection.

The results challenge the conventional wisdom that bigger models are always better, especially for specialized tasks like document understanding. Mistral Small 4, which uses a MoE architecture with ~6B active parameters per query, showed a high floor but low ceiling, with consistent but unspectacular scores. Analysts note that the comparison was run using Mistral's full-precision API, raising questions about how its recently released 4-bit quantized checkpoint (NVFP4) would perform, as the full model requires 242GB of memory. For professionals, this benchmark indicates that smaller, efficiently architected models like Qwen3.5-9B can be highly competitive for production document processing pipelines, potentially offering a better cost-to-performance ratio.

Key Points

Qwen3.5-9B (9B params) scored 77.0, beating Mistral Small 4 (119B MoE) which scored 71.5 on the IDP Leaderboard.
The smaller model won 10 of 14 sub-tests, with its largest lead in math OCR (85.5 vs. 66.0).
The result highlights that model architecture and training data can outweigh pure parameter count for domain-specific tasks like document AI.

Why It Matters

For businesses building document AI, smaller, efficient models can match or beat larger ones, drastically reducing deployment cost and complexity.

Read Original Article

Mistral Small 4 vs Qwen3.5-9B on document understanding benchmarks, but it does better than GPT-4.1

Why It Matters

Stay Ahead in AI