AnnoRetrieve: Efficient Structured Retrieval for Unstructured Document Analysis
New system replaces expensive vector search with lightweight structured queries, dramatically reducing LLM usage.
Researchers Teng Lin, Yuyu Luo, and Nan Tang have proposed AnnoRetrieve, a new paradigm for retrieving information from unstructured documents like those found in enterprise and web data. The system directly addresses the high computational cost and frequent, expensive LLM calls required by current mainstream methods like embedding-based vector search, which rely on coarse-grained semantic similarity. AnnoRetrieve's core innovation is shifting the retrieval foundation from vector embeddings to structured annotations, enabling precise, annotation-driven semantic matching.
AnnoRetrieve integrates two synergistic components. The first is SchemaBoot, which automatically generates document annotation schemas through multi-granularity pattern discovery and constraint-based optimization, eliminating the need for manual schema design. The second is the Structured Semantic Retrieval (SSR) engine, which unifies semantic understanding with structured query execution. By leveraging the annotated structure, SSR can perform precise semantic matching, attribute-value extraction, table generation, and even progressive SQL-based reasoning without relying on LLM interventions for post-processing.
Experiments conducted on three real-world datasets confirm that this annotation-driven paradigm overcomes the limitations of both traditional vector-based methods (coarse matching, heavy LLM dependency) and graph-based methods (high computational overhead). The results show AnnoRetrieve significantly lowers LLM call frequency and overall retrieval cost while maintaining high accuracy. The authors position it as establishing a new, cost-effective, precise, and scalable paradigm for intelligent document analysis through automated structuring.
- Replaces expensive vector search with lightweight structured queries over automatically generated schemas via SchemaBoot.
- Its Structured Semantic Retrieval (SSR) engine enables precise matching and SQL-based reasoning without LLM calls for post-processing.
- Validated on three real-world datasets, it significantly reduces LLM usage and cost while maintaining high retrieval accuracy.
Why It Matters
Offers enterprises a path to precise document analysis at a fraction of the current cost by drastically reducing dependency on expensive LLM APIs.