Research & Papers

DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines

New small language models for OCR reduce text degeneration by 87.6% and cut costs by 22% via quantization.

Deep Dive

A team of researchers has published DharmaOCR, a pair of specialized small language models (SSLMs) designed to revolutionize structured Optical Character Recognition. The models, DharmaOCR Full (7B parameters) and DharmaOCR Lite (3B parameters), are engineered to jointly optimize transcription accuracy, generation stability, and inference cost. They set a new state-of-the-art on the newly introduced DharmaOCR-Benchmark, achieving extraction quality scores of 0.925 and 0.911 respectively. Crucially, they do this while maintaining extremely low text degeneration rates—0.40% and 0.20%—a common failure mode where OCR models get stuck in repetitive loops, which the paper shows materially worsens production performance by increasing latency and computational cost.

The core methodological breakthrough is the first application of Direct Preference Optimization (DPO) specifically for OCR. The researchers used degenerate, looping text generations as 'rejected' examples during DPO training to explicitly penalize this undesirable behavior. Combined with Supervised Fine-Tuning (SFT) to enforce a strict JSON schema for document structure (header, margin, footer, text), this approach reduced degeneration rates by up to 87.6% relative to baselines. For practical deployment, applying AWQ quantization reduced the per-page inference cost by up to 22% with negligible quality loss, creating a compelling quality-cost trade-off compared to proprietary services like Google Cloud Vision or Amazon Textract.

The work also introduces the DharmaOCR-Benchmark, a comprehensive evaluation suite covering printed, handwritten, and legal/administrative documents. It proposes a unified protocol that measures both fidelity (accuracy) and structure, while explicitly tracking text degeneration and unit cost as first-class metrics. This provides a much-needed standardized framework for comparing OCR systems, moving beyond simple accuracy scores to include critical operational and economic factors.

Key Points
  • DharmaOCR Full (7B) and Lite (3B) achieve extraction quality scores of 0.925 and 0.911 on a new benchmark, outperforming commercial APIs.
  • Novel DPO training using degenerate text as negative examples reduces harmful 'looping' behavior by up to 87.6%, cutting latency and cost.
  • AWQ quantization reduces per-page inference cost by 22%, making the models a cost-effective alternative to proprietary OCR services.

Why It Matters

This provides a high-accuracy, low-cost, and reliable open-source alternative for automating document processing in finance, legal, and administration.