A Survey of OCR Evaluation Methods and Metrics and the Invisibility of Historical Documents
A new survey of 20 years of research shows AI document systems are blind to historical Black newspapers.
A new academic survey reveals a critical blind spot in modern Optical Character Recognition (OCR) and document understanding AI. Researchers Fitsum Sileshi Beyene and Christopher L. Dancy analyzed two decades of research (2006-2025) and found that evaluation methods for systems based on vision transformers and multimodal models are centered on clean, modern, and Western documents. This focus completely masks how these AI systems perform on historical and marginalized archives, where factors like degraded paper, complex layouts, and unique typography are the norm.
The study pays particular attention to Black historical newspapers, which are rarely included in reported training data or standard benchmark datasets like those found on platforms such as Papers with Code. Consequently, evaluations that prioritize simple character accuracy fail to capture common structural failures in these documents, including column collapse, misread typography, and AI-hallucinated text. The authors argue these 'evaluation gaps' are not accidental but stem from organizational and institutional behaviors shaped by benchmark incentives and data governance decisions, leading to the 'structural invisibility' of these vital historical records.
By connecting technical evaluation gaps to archival statistics from major Black press collections, the paper frames the issue as one of representational harm. It proposes that the field's reliance on narrow benchmarks creates AI systems that are ill-equipped to preserve and interpret the documentary history of marginalized communities, effectively erasing them from the digital record. The work is a submission to the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT), highlighting its significance for ethical AI development.
- Survey of OCR research from 2006-2025 finds training data and benchmarks exclude historical Black newspapers.
- AI systems fail on historical layouts with common errors like column collapse and hallucinated text not captured by standard metrics.
- Authors argue these 'evaluation gaps' cause structural invisibility and representational harm, driven by institutional benchmark incentives.
Why It Matters
Shows how biased AI training data erases history, urging developers to build inclusive benchmarks for document AI.