Research & Papers

New RAG study reveals topic sampling as key variable in document ordering effects

Researchers find small topic sets mask or exaggerate position biases in retrieval-augmented generation.

Deep Dive

Researchers from the University of A Coruña have published a systematic reproducibility study on Retrieval-Augmented Generation (RAG) systems, titled "Lost in the Evidence? Reproducing Document Position and Context Size Effects in RAG." The paper, submitted to arXiv on May 26, 2026, tackles the inconsistent empirical findings around document ordering biases like "lost in the middle" and long-context phenomena. The team first demonstrates that topic sampling is a major source of variance: small topic sets can either mask or exaggerate position effects. They provide a practical calibration procedure that identifies topic counts yielding stable trends at a feasible cost, enabling more reliable evaluation.

Using these fixed topic sets, the authors reproduce and extend results on position sensitivity in modern LLMs, then move to a more realistic RAG scenario where relevance is mediated by a retriever rather than oracle access. Here, they re-examine a recent industry study and identify discrepancies due to limited topic coverage and reliance on LLM-based judges. Their analysis shows that retrieval order and context size interact strongly with retrieval quality and model choice, meaning conclusions from idealized setups don't transfer well to production RAG pipelines. The team has released all code and configurations to support future work on robust RAG evaluation.

Key Points
  • Topic sampling is a major source of variance; small topic sets can mask or exaggerate ordering effects.
  • Authors propose a calibration procedure to identify stable topic counts at feasible cost.
  • Real-world RAG pipelines show strong interactions between retrieval order, context size, retrieval quality, and model choice.

Why It Matters

This study provides crucial calibration methods for reliable RAG evaluation, impacting how developers design and test retrieval pipelines.