[R] IDP Leaderboard: Open benchmark for document AI across 16 VLMs, 9,000+ documents, 3 benchmark suites
Gemini 3.1 Pro leads by a hair, while cheaper models match flagships on extraction tasks.
Nanonets has released the IDP Leaderboard, a comprehensive open-source evaluation framework for document understanding AI. The benchmark rigorously tests 16 leading vision-language models (VLMs), including offerings from Google, OpenAI, and Anthropic, across a massive dataset of over 9,000 documents. It utilizes three distinct benchmark suites—OlmOCR, OmniDoc, and Nanonets' own IDP Core—covering critical tasks like Key Information Extraction (KIE), table parsing, visual question answering (VQA), OCR, classification, and long-document processing.
The results reveal a highly competitive field. Google's Gemini 3.1 Pro leads with an overall score of 83.2, but the margin is razor-thin, with the top five models all within 2.4 points. A major insight is that cheaper, faster model variants (like Gemini Flash or Claude Sonnet) deliver nearly identical extraction quality to their flagship counterparts, with differentiation only appearing on complex, reasoning-heavy VQA tasks. Notably, OpenAI's GPT-4.1 shows a dramatic improvement over its predecessor, jumping from an overall score of 70 to 81.
Beyond scores, the leaderboard's most practical feature is its interactive Results Explorer. This tool allows developers and businesses to see the ground truth data alongside every model's raw prediction for each document. This side-by-side comparison moves beyond abstract metrics, enabling users to visually assess which model's output best aligns with their specific use case, whether it's parsing invoices, extracting data from forms, or understanding complex reports. The benchmark highlights that sparse, unstructured tables remain the hardest challenge for current AI, with most models scoring below 55% accuracy on this task.
- Gemini 3.1 Pro leads with 83.2 overall score, but top 5 models are within a tight 2.4-point margin.
- Cheaper model variants (Flash, Sonnet) match flagship quality on extraction, differing only on complex VQA tasks.
- The Results Explorer shows ground truth vs. model predictions for all 9,000+ documents, aiding practical model selection.
Why It Matters
Provides data-driven clarity for businesses choosing document AI, saving costs by identifying where cheaper models perform equally well.