Claude Sonnet 4.5 vision loses to OCR pipelines in document QA benchmark
A recent evaluation of a leading multimodal model against a traditional OCR pipeline reveals a persistent accuracy and cost gap, challenging the assumption that vision LLMs are ready to replace specialized document processing systems.
A systematic benchmark comparing a state-of-the-art vision language model to a premium optical character recognition (OCR) pipeline has yielded a counterintuitive result: the specialized pipeline outperformed the general-purpose model on both accuracy and cost. The evaluation, which covered 30 long, image-heavy PDFs and 171 document QA queries, found that the OCR pipeline achieved 59.6% accuracy at $0.19 per query, while the vision model scored 52.0% at $0.26 per query. Moreover, the vision model exhibited a 7% intrinsic failure rate due to file size limitations, a non-issue for dedicated systems. These findings underscore a fundamental limitation of current multimodal architectures: they struggle with precise extraction from tables and charts, the very elements that dominate enterprise documents.
The landscape of document AI is dominated by purpose-built services such as Amazon Textract, Microsoft Azure AI Document Intelligence (formerly Form Recognizer), and ABBYY FineReader. These tools employ layout analysis and optimized OCR engines that achieve high fidelity on structured data. In contrast, vision language models from companies like Anthropic, OpenAI, and Google are designed for broader visual understanding, not pixel-level accuracy. The benchmark results reaffirm a pattern observed with earlier models: while multimodal models excel at grasping document gist, they fall short when exact numbers or relationships from tables are required. As one AI researcher noted, “Vision LLMs are amazing for understanding the gist of a document, but if you need exact numbers from a table, you’re still better off with a traditional OCR pipeline.” This insight aligns with the practical experience of many developers who have found that combining OCR with a small language model yields better and cheaper results than feeding whole documents to a vision model.
The business implications are significant. The document AI market is estimated at over $2 billion annually, and major cloud providers generate substantial revenue from OCR-based services. The benchmark undercuts the narrative that vision models are more economical or accurate for document-heavy workflows. At $0.19 per query, the OCR pipeline is not only more accurate but 27% cheaper than the vision model. For enterprises processing millions of pages, that difference adds up quickly. Yet it would be premature to declare vision models irrelevant for document tasks. The evaluation used only 30 PDFs and 171 questions, limiting statistical power. The vision model was likely used without advanced prompting techniques (e.g., chain-of-thought or explicit OCR integration), which could narrow the gap. Additionally, newer models such as GPT-4o and Gemini 2.0 Flash have been optimized for OCR-like tasks and may perform better. The OCR pipeline may also have been fine-tuned for the specific document types in the benchmark, while the vision model remained general-purpose. Finally, the cost comparison does not include the infrastructure overhead of maintaining a dedicated OCR pipeline at scale, which could erase the per-query advantage for smaller deployments.
The bottom line: for high-stakes document QA where accuracy on tables and charts is paramount, dedicated OCR pipelines remain the superior choice. Vision models are best used for tasks that require holistic understanding, such as summarizing document themes or identifying document type. The most effective strategy for enterprises is likely a hybrid approach: use an OCR pipeline for exact extraction and a language model for reasoning over that extracted data. This benchmark serves as a valuable reality check, reminding us that while multimodal models are improving rapidly, they have not yet made specialized systems obsolete. The race is not over, but for now, the specialist still beats the generalist in this domain.
- Dedicated OCR pipelines achieve 59.6% accuracy at $0.19/query, outperforming a leading vision model's 52.0% at $0.26/query on document QA.
- Vision models have a 7% failure rate due to file size limits and struggle with tables and charts, making them unsuitable for precise extraction tasks.
- Enterprises should consider hybrid workflows combining OCR for extraction with language models for reasoning, rather than relying solely on vision models.
Why It Matters
A $2B document AI market hinges on accuracy vs cost; specialized OCR retains the edge over general vision models for structured data.