Research & Papers

DISCO: Document Intelligence Suite for COmparative Evaluation

New research shows OCR beats VLMs on handwriting, while vision models win on multilingual text.

Deep Dive

A team of researchers including Kenza Benkirane, Dan Goldwater, Martin Asenov, and Aneiss Ghodsi has published DISCO, a comprehensive new benchmark for evaluating document intelligence systems. Accepted at the ICLR 2026 Workshop on Multimodal Intelligence, the suite rigorously tests both traditional Optical Character Recognition (OCR) pipelines and modern Vision-Language Models (VLMs) across a diverse range of document types. These include challenging categories like handwritten text, multilingual scripts, complex medical forms, infographics, and multi-page documents. The goal is to move beyond generic accuracy scores and provide nuanced, complexity-aware guidance for practitioners.

The evaluation's core finding is that no single approach is universally superior. Performance varies substantially based on the document's characteristics and the required task. OCR pipelines demonstrated greater reliability for parsing handwriting and for processing long or multi-page documents, where their explicit text grounding provides a strong foundation for subsequent text-heavy reasoning. In contrast, VLMs like GPT-4V or Claude 3 showed stronger performance on documents with multilingual text and visually rich, non-standard layouts, where their integrated visual understanding is an advantage. The research also found that task-aware prompting for VLMs yielded mixed results, sometimes improving performance but degrading it on other document types, highlighting the need for careful tuning.

These results provide much-needed empirical data for engineers and product teams building document processing workflows. Instead of defaulting to the latest VLM, the DISCO benchmark suggests a hybrid, tool-selection strategy. For digitizing archives of handwritten notes or lengthy reports, a robust OCR engine remains the best starting point. For analyzing modern, design-heavy reports or documents in multiple languages, a vision-language model is likely more effective. This suite enables more informed, cost-effective, and accurate system design for real-world applications in legal, medical, and financial sectors.

Key Points
  • OCR pipelines are more reliable for handwriting and long, multi-page documents, providing better text grounding for reasoning.
  • Vision-language models (VLMs) perform better on multilingual text and visually rich, complex layouts like infographics.
  • Task-aware prompting for VLMs has mixed effects, improving some document types while degrading others, requiring careful implementation.

Why It Matters

Provides data-driven guidance for engineers to choose the right tool—OCR or VLM—for specific document types, improving accuracy and cost-efficiency.