Research & Papers

From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

A new benchmark reveals that data preparation quality, not just the LLM, is the dominant factor in RAG system performance.

Deep Dive

A team of Portuguese researchers has published a landmark study, "From PDF to RAG-Ready," that systematically evaluates how the quality of document preprocessing impacts the final performance of Retrieval-Augmented Generation (RAG) systems. The study fills a critical gap by moving beyond simple tool comparisons to assess four open-source PDF conversion frameworks—Docling, MinerU, Marker, and DeepSeek OCR—across 19 different processing pipelines. These pipelines varied the conversion tool, cleaning steps, text-splitting strategy, and metadata enrichment. The goal was to measure their direct impact on downstream question-answering accuracy, a metric often overlooked in favor of raw extraction quality.

Evaluation was performed on a corpus of 36 complex Portuguese administrative documents (1,706 pages) using a manually curated 50-question benchmark. An LLM-as-judge scored answers over 10 runs. The results were bounded by two baselines: a naïve PDFLoader (86.9% accuracy) and perfectly curated manual Markdown (97.1%). The top-performing automated pipeline, using the Docling framework with hierarchical splitting and image descriptions, achieved 94.1% accuracy. Crucially, the research found that metadata enrichment and hierarchy-aware chunking contributed more to final accuracy than the choice of the underlying PDF conversion tool. Font-based hierarchy reconstruction consistently outperformed more complex LLM-based approaches.

The study also delivered a surprising negative result: an exploratory GraphRAG implementation scored only 82%, underperforming basic RAG. This suggests that naïve knowledge graph construction, without proper ontological guidance, adds complexity without justifying it with improved performance. The overarching conclusion is unambiguous: for professionals building RAG applications, investing in high-quality, intelligent data preparation pipelines is more critical for end-user accuracy than simply selecting the most powerful large language model.

Key Points
  • Docling with hierarchical splitting achieved 94.1% accuracy in QA tests, nearly matching manually curated text at 97.1%.
  • Metadata enrichment and smart, hierarchy-aware chunking were more impactful on final accuracy than the choice of PDF conversion tool.
  • An exploratory GraphRAG implementation underperformed, scoring 82% vs. 94.1% for top RAG, showing added complexity isn't always better.

Why It Matters

For AI engineers, this proves that investing in robust data preprocessing pipelines is more critical for RAG success than just swapping to a better LLM.