Research & Papers

A Reliability Evaluation of Hybrid Deterministic-LLM Based Approaches for Academic Course Registration PDF Information Extraction

Researchers tested 1,000 academic PDFs, finding a hybrid approach with Qwen 2.5 is fastest and most accurate.

Deep Dive

A new arXiv study provides a rigorous benchmark for extracting structured data from messy academic PDFs, a common automation headache. Researchers tested three strategies on 1,000 course registration documents: using an LLM alone, a hybrid of regex and an LLM, and a pipeline built on the Camelot table extractor with an LLM fallback for errors. The models—Gemma 3, Phi 4, and Qwen 2.5 (all 12-14B parameters)—were run locally via Ollama on a standard CPU, deliberately avoiding GPU acceleration to simulate resource-constrained environments.

The results are a clear win for pragmatic, hybrid AI engineering. The Camelot pipeline with LLM fallback delivered the best combination of speed and precision, achieving exact match and Levenshtein similarity scores up to 0.99-1.00. Crucially, it processed most PDFs in less than one second. The Qwen 2.5:14b model emerged as the most consistent performer across all tests. While the pure LLM approach was flexible, it was less efficient, especially for deterministic metadata like course codes.

This research validates a key principle for production AI: deterministic rules and specialized libraries (like Camelot for tables) should handle what they can, with LLMs acting as intelligent fallbacks for ambiguity. This architecture dramatically reduces cost and latency while boosting reliability. The study demonstrates that high-accuracy document automation is feasible without expensive cloud APIs or high-end hardware, opening doors for more institutions and businesses to deploy robust, local AI solutions.

Key Points
  • Hybrid Camelot+LLM pipeline achieved near-perfect accuracy (0.99-1.00 EM/LS) on 1,000 test PDFs.
  • The system processed most documents in under 1 second running locally on a consumer CPU without a GPU.
  • The Qwen 2.5:14b model outperformed Gemma 3 and Phi 4 as the most reliable LLM component in the hybrid setup.

Why It Matters

Enables fast, accurate, and affordable document automation for businesses and institutions without relying on costly cloud APIs or GPU hardware.