Research & Papers

Evidence Units: Ontology-Grounded Document Organization for Parser-Independent Retrieval

New method groups figures, tables, and text into 'Evidence Units,' improving recall from 15% to 51%.

Deep Dive

A new research paper by Yeonjee Han tackles a fundamental flaw in how AI systems retrieve information from complex documents. Current methods often parse documents into isolated chunks, scattering a figure, its caption, and the explanatory paragraph across separate retrieval candidates. This fragmentation cripples the performance of Retrieval-Augmented Generation (RAG) systems. Han's solution is a pipeline that constructs 'Evidence Units' (EUs)—semantically complete chunks that keep visual assets like tables and figures glued to their surrounding context.

The core innovation is a four-part, parser-agnostic architecture. First, an ontology extends the Document Components Ontology (DoCO) to normalize outputs from different parsers like MinerU and Docling into a unified schema. Second, a semantic global assignment algorithm uses a full similarity matrix to optimally link paragraphs to the correct EU. Third, a graph-based layer in Neo4j formalizes construction rules and validates chunk completeness. The result is robust performance regardless of the underlying document parser used.

Experiments on the OmniDocBench v1.0 dataset (1,340 pages) show dramatic gains. Recall@1—the chance the top result contains the answer—jumped from 0.15 to 0.51, a 3.4x improvement. For purely text-based queries, the gain was even more stark, rising from 0.08 to 0.47. The method also improved the LCS (Longest Common Subsequence) metric by +0.31. Critically, these gains were preserved across different parsers, proving the pipeline's independence and practical utility for building more reliable document AI.

Key Points
  • Evidence Units (EUs) group figures, tables, and explanatory text into single, semantic chunks for retrieval.
  • The parser-independent method improved Recall@1 on OmniDocBench by 3.4x, from 0.15 to 0.51.
  • A graph-based layer in Neo4j validates chunk completeness, ensuring robust performance across different document parsers.

Why It Matters

This directly fixes a major weakness in enterprise RAG systems, making document AI for research, legal, and technical fields far more accurate.