SPIRE: Structure-Preserving Interpretable Retrieval of Evidence
SPIRE treats documents as trees, not flat text, boosting citation quality by 40%
A team of researchers led by Mike Rainey, Umut Acar, and Muhammed Sezer have released a paper introducing SPIRE (Structure-Preserving Interpretable Retrieval of Evidence), a new retrieval pipeline designed to overcome the limitations of current RAG (retrieval-augmented generation) systems when dealing with semi-structured sources like HTML. Traditional pipelines linearize documents into fixed-size chunks before indexing, which obscures critical structural elements such as section headers, lists, and tables, making it difficult to return precise, citation-ready evidence without losing the surrounding context needed for interpretability.
SPIRE addresses this by operating over tree-structured documents, representing candidates as 'subdocuments'—precise, addressable selections that preserve structural identity. The system defines a small set of document primitives, including paths and path sets, subdocument extraction by pruning, and two contextualization mechanisms: global contextualization adds non-local scaffolding (e.g., titles, headers), while local contextualization expands a seed within its structural neighborhood for a compact, context-rich view under a target budget. Using an embedding-based candidate generator that indexes sentence-seeded subdocuments, SPIRE introduces a query-time aggregation step and a contextual filtering stage that re-scores candidates using locally contextualized views. In experiments on HTML QA benchmarks, SPIRE consistently outperformed strong passage-based baselines, delivering higher-quality, more diverse citations under fixed budgets while maintaining scalability.
- SPIRE preserves document structure (trees) instead of flattening into chunks, enabling precise subdocument retrieval
- Introduces two contextualization mechanisms: global (headers, titles) and local (structural neighborhood expansion)
- Outperforms passage-based baselines on HTML QA benchmarks, producing higher-quality and more diverse citations
Why It Matters
SPIRE could significantly improve RAG systems for web and document QA, making AI citations more accurate and interpretable.