Research & Papers

New study: answer retention beats wording in RAG pipeline performance

13 document transformations tested across 4 generators reveal the hidden driver of accuracy...

Deep Dive

The authors tested 14 document representations in RAG pipelines, varying selection, summarisation, and reformulation. Across four generators, they found answer retention—whether a known answer-bearing document still supports its answer after transformation—is the primary determinant of accuracy. When retention is high, wording, structure, length, and query-dependence had limited effect. The paper suggests that accuracy gains attributed to specific mechanisms in prior work may be partly explained by how well those mechanisms preserve answer-bearing content.

Key Points
  • 14 document representations tested across 4 LLMs with fixed retrieval
  • Answer retention (whether key facts survive transformation) is the primary accuracy driver
  • Wording, structure, length, and query-dependence have limited effect when retention is high

Why It Matters

For RAG practitioners: focus on preserving answer content, not fancy formatting, to boost LLM accuracy.