arXiv study: Top text embeddings miss 80% of research agenda matches in RAG
Cosine similarity can't capture research agendas — a new audit of 3.5M papers reveals the gap.
A new arXiv paper by Junseon Yoo challenges the foundational assumption of retrieval-augmented generation (RAG): that cosine similarity between text embeddings captures conceptual relatedness. To test this, Yoo built an augmented citation graph over 3.58M scientific papers and partitioned it using the Leiden algorithm at two granularities: sub-field (L1) and research-agenda (L2). Four state-of-the-art embeddings were evaluated — Gemini, Qwen3-8B, Qwen3-0.6B, and SPECTER2. At L1, the models performed reasonably (45–52% top-10 same-rate), but at L2, the results collapsed: only 15–21% of top-10 neighbors shared the query's research agenda. In absolute terms, 8 out of every 10 retrieved papers were off-agenda. The failure was universal across eight scientific domains and all four models, with SPECTER2 — despite its citation-based contrastive training — performing the weakest.
As a diagnostic probe, Yoo tested whether the same augmented citation graph could serve as a retrieval signal. A deliberately simple citation-count rerank — applied on top of LLM-expanded Boolean retrieval and plain BM25 — achieved 57.7% and 59.6% top-1 L2 accuracy, respectively, on 80 curated agenda queries. That's roughly 9 points above the best cosine retriever (Gemini: 50.6%) and 20 points above BM25 alone (39.3%). The result isolates a slice of agenda-matching signal that the citation graph carries but embeddings miss, connecting recent theoretical limits on single-vector retrieval to a concrete failure mode of scientific RAG. For anyone building RAG pipelines over research literature, the study is a clear warning: cosine similarity is not enough when precision on research agendas matters.
- Four SOTA embeddings (Gemini, Qwen3-8B, Qwen3-0.6B, SPECTER2) achieved only 15–21% top-10 agenda match rate (L2) across 8 domains.
- A simple citation-count rerank on top of LLM-Boolean retrieval hit 57.7% L2, beating the best embedding (Gemini 50.6%) by ~9 points.
- SPECTER2, despite citation-based training, was the weakest embedder, underscoring that current methods miss agenda-level signals.
Why It Matters
For scientific RAG, cosine similarity misses research agendas — a critical flaw that demands better retrieval signals beyond embeddings.