We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]
7,560 API calls reveal older models match GPT-4o at 90% lower cost
Arbitr, an AI research group, released a comprehensive OCR benchmark after testing 18 large language models on 42 curated standard documents. Each model was run 10 times under identical conditions, totaling 7,560 API calls. The benchmark measures pass^n reliability (how often a model succeeds across multiple attempts), cost-per-success, latency, and critical field accuracy. The key finding: for standard document extraction, older and cheaper models like GPT-3.5-Turbo and Claude Instant deliver accuracy comparable to premium models like GPT-4o and Claude 3 Opus, but at 5-10x lower cost.
The open-source framework, available on GitHub, includes a free tool for teams to test their own documents against the leaderboard. This challenges the common practice of defaulting to the newest, largest models for OCR workflows. The benchmark reveals that many teams are overpaying by 80-90% for OCR tasks that older models handle equally well. The dataset covers invoices, receipts, forms, and ID documents, with full transparency on methodology and raw results.
- 7,560 total API calls across 18 LLMs on 42 standard documents under identical conditions
- Older models like GPT-3.5-Turbo match GPT-4o accuracy on standard OCR at 5-10x lower cost
- Open-source framework includes pass^n reliability, cost-per-success, latency, and critical field accuracy metrics
Why It Matters
Saves teams 80-90% on OCR costs by proving older models work as well as premium ones.