Research & Papers

SPIRE: Structure-Preserving Interpretable Retrieval of Evidence

SPIRE treats documents as trees, not flat text, boosting citation quality by 40%

Deep Dive

A team of researchers led by Mike Rainey, Umut Acar, and Muhammed Sezer have released a paper introducing SPIRE (Structure-Preserving Interpretable Retrieval of Evidence), a new retrieval pipeline designed to overcome the limitations of current RAG (retrieval-augmented generation) systems when dealing with semi-structured sources like HTML. Traditional pipelines linearize documents into fixed-size chunks before indexing, which obscures critical structural elements such as section headers, lists, and tables, making it difficult to return precise, citation-ready evidence without losing the surrounding context needed for interpretability.

SPIRE addresses this by operating over tree-structured documents, representing candidates as 'subdocuments'—precise, addressable selections that preserve structural identity. The system defines a small set of document primitives, including paths and path sets, subdocument extraction by pruning, and two contextualization mechanisms: global contextualization adds non-local scaffolding (e.g., titles, headers), while local contextualization expands a seed within its structural neighborhood for a compact, context-rich view under a target budget. Using an embedding-based candidate generator that indexes sentence-seeded subdocuments, SPIRE introduces a query-time aggregation step and a contextual filtering stage that re-scores candidates using locally contextualized views. In experiments on HTML QA benchmarks, SPIRE consistently outperformed strong passage-based baselines, delivering higher-quality, more diverse citations under fixed budgets while maintaining scalability.

Key Points
  • SPIRE preserves document structure (trees) instead of flattening into chunks, enabling precise subdocument retrieval
  • Introduces two contextualization mechanisms: global (headers, titles) and local (structural neighborhood expansion)
  • Outperforms passage-based baselines on HTML QA benchmarks, producing higher-quality and more diverse citations

Why It Matters

SPIRE could significantly improve RAG systems for web and document QA, making AI citations more accurate and interpretable.