OCR dominates end-to-end latency in production document AI — far more than LLM-based structured extraction?

OCR dominates end-to-end latency in production document AI — far more than LLM-based structured extraction.

System throughput saturates at GPU inference capacity, not worker count or CPU allocation — making GPU management the top scaling concern?

System throughput saturates at GPU inference capacity, not worker count or CPU allocation — making GPU management the top scaling concern.

Architecture uses hybrid classification, async I/O, and independent horizontal scaling to process thousands of multi-page documents per hour?

Architecture uses hybrid classification, async I/O, and independent horizontal scaling to process thousands of multi-page documents per hour.

Research & Papers

New Microservice Architecture for Document AI Reveals OCR, Not LLMs, as Latency Bottleneck

arXiv cs.AI May 20, 2026

⚡Processing thousands of documents per hour? Researchers found OCR is the real bottleneck.

Deep Dive

In a new arXiv paper, a team of 12 researchers (Fehlis et al.) from a large tech organization tackle the gap between academic model research and production deployment of document AI. They propose a microservice architecture that pipelines classification, OCR, and LLM-based structured field extraction. Key design choices include a hybrid classification step, strict separation of GPU-bound inference (OCR, LLM) from CPU-bound orchestration, heavy use of asynchronous processing for I/O-bound operations (e.g., fetching documents from storage), and independent horizontal scaling of each microservice. The system was tested profiling thousands of multi-page documents per hour.

The most surprising finding: OCR — extracting text from images — dominates end-to-end latency far more than the LLM parsing step. Additionally, the system saturates at a concurrency level determined by shared GPU-inference capacity, not by the number of worker processes or CPUs. This means that for production scale, optimizing OCR hardware allocation (e.g., GPU memory, parallel instances) and carefully managing GPU contention are far more impactful than optimizing the LLM prompt pipeline. The paper provides concrete patterns for practitioners: use async I/O, isolate GPU tasks, and monitor GPU memory as the primary scaling bottleneck. This research bridges the gap between model accuracy benchmarks and real-world throughput and latency requirements.

Key Points

OCR dominates end-to-end latency in production document AI — far more than LLM-based structured extraction.
System throughput saturates at GPU inference capacity, not worker count or CPU allocation — making GPU management the top scaling concern.
Architecture uses hybrid classification, async I/O, and independent horizontal scaling to process thousands of multi-page documents per hour.

Why It Matters

Real-world guidance for deploying document AI at scale: focus on OCR optimization and GPU capacity planning, not just model accuracy.

Read Original Article

New Microservice Architecture for Document AI Reveals OCR, Not LLMs, as Latency Bottleneck

Why It Matters

Related Articles

🚀 Stay Ahead in AI