New study: answer retention beats wording in RAG pipeline performance
13 document transformations tested across 4 generators reveal the hidden driver of accuracy...
The authors tested 14 document representations in RAG pipelines, varying selection, summarisation, and reformulation. Across four generators, they found answer retention—whether a known answer-bearing document still supports its answer after transformation—is the primary determinant of accuracy. When retention is high, wording, structure, length, and query-dependence had limited effect. The paper suggests that accuracy gains attributed to specific mechanisms in prior work may be partly explained by how well those mechanisms preserve answer-bearing content.
- 14 document representations tested across 4 LLMs with fixed retrieval
- Answer retention (whether key facts survive transformation) is the primary accuracy driver
- Wording, structure, length, and query-dependence have limited effect when retention is high
Why It Matters
For RAG practitioners: focus on preserving answer content, not fancy formatting, to boost LLM accuracy.