Open Source

Qwen3.5-9B on document benchmarks: where it beats frontier models and where it doesn't.

A tiny 9B open model beats frontier models on text extraction and document Q&A in new 9,000-document benchmark.

Deep Dive

A new, comprehensive benchmark for Document AI reveals surprising performance from Alibaba's open-source Qwen3.5 models. The IDP Leaderboard, which evaluated 20 models on over 9,000 real-world documents, shows the 9-billion-parameter Qwen3.5-9B outperforming massive frontier models like OpenAI's GPT-5.4 and Anthropic's Claude Sonnet 4.6 on core tasks. Specifically, in OlmOCR—extracting text from messy scans and dense PDFs—Qwen3.5-9B scored 78.1, leading the pack ahead of Gemini 3.1 Pro (74.6) and GPT-5.4 (73.4). Even the tiny 4B version scored 77.2, demonstrating that raw text extraction is a strength of the Qwen architecture.

Perhaps the most shocking result is in Visual Question Answering (VQA), where models answer questions about document content, charts, and tables. Qwen3.5-9B scored 79.5, placing it second overall and edging out GPT-5.4 (78.2) by over a point. It was a staggering 14 points ahead of Claude Sonnet 4.6. For a 9B open model to compete with proprietary giants on complex reasoning over documents is a major breakthrough. However, the benchmark also reveals clear limits: on structured table extraction (GrITS), frontier models scored between 85 and 96, while Qwen models plateaued around 76-77, suggesting an architectural constraint.

The results validate a trend toward smaller, more efficient models that can excel at specialized workloads. The Qwen3.5-4B model even matched GPT-5.4 on Key Information Extraction (KIE), scoring 86.0 for pulling data like invoice numbers and dates. This performance at a fraction of the size and cost makes the Qwen family a compelling option for developers building production document processing pipelines, offering a viable open-source alternative to expensive API calls for specific tasks.

Key Points
  • Qwen3.5-9B scored 78.1 on OCR text extraction, beating Gemini 3.1 Pro (74.6) and GPT-5.4 (73.4) on the IDP Leaderboard.
  • On Visual Question Answering (VQA), the 9B model scored 79.5, ranking #2 behind only Gemini 3.1 Pro and ahead of GPT-5.4.
  • The 4B model matched GPT-5.4 on Key Information Extraction (KIE), but Qwen models lag by 10-20 points on table extraction tasks.

Why It Matters

Proves small, open-source models can outperform expensive frontier APIs for specific document tasks, lowering costs for developers.