Semantic Recall for Vector Search
New metric ignores irrelevant 'nearest neighbors' to give a true picture of retrieval quality.
A team of researchers, including Leonardo Kuffo, Ioanna Tsakalidou, and Roberta De Viti, has published a paper introducing a novel evaluation metric called Semantic Recall for vector search. This metric addresses a fundamental flaw in how we measure the performance of approximate nearest neighbor (ANN) search algorithms, which are the backbone of modern retrieval-augmented generation (RAG) and semantic search systems. Traditional recall metrics penalize an algorithm for failing to retrieve every single nearest neighbor in a vector space, even if many of those neighbors are semantically irrelevant to the query. Semantic Recall changes the game by only considering objects that are both semantically relevant and theoretically retrievable via an exact search, providing a much more accurate picture of an algorithm's true retrieval quality.
The researchers discovered that queries with few relevant results among their nearest neighbors are surprisingly common in real embedding datasets, making traditional metrics misleading. To make Semantic Recall practical even when semantic relevance labels are unavailable, the team also introduced Tolerant Recall, a proxy metric that approximates it. Their empirical work shows that these new metrics are more effective indicators of real-world performance. Crucially, they demonstrate that optimizing ANN algorithms—like those from FAISS or ScaNN—for Semantic or Tolerant Recall can unlock improved trade-offs between computational cost and retrieval accuracy, allowing developers to build faster or more accurate search systems without relying on flawed benchmarks.
- Semantic Recall evaluates search algorithms based only on retrievable, semantically relevant objects, ignoring irrelevant 'nearest neighbors'.
- The team found queries with few relevant nearest neighbors are common, making traditional recall metrics poor quality indicators.
- Optimizing algorithms for the new metrics can lead to better cost-quality tradeoffs in production RAG and search applications.
Why It Matters
Enables developers to build more accurate and efficient RAG pipelines and search engines by using better evaluation benchmarks.