Research & Papers

Not All Queries Need Rewriting: When Prompt-Only LLM Refinement Helps and Hurts Dense Retrieval

Study finds prompt-only query rewriting degrades retrieval by 9% in some domains while improving others by 5%.

Deep Dive

A new research paper by Varun Kotte challenges the common practice of automatically rewriting user queries in RAG (retrieval-augmented generation) systems. The study systematically examines prompt-only, single-step LLM query rewriting—where an LLM like GPT-4 or Claude rewrites a query without retrieval feedback—across three BEIR benchmarks (FiQA, TREC-COVID, SciFact) and two dense retrievers. The results reveal strongly domain-dependent behavior: rewriting actually degraded retrieval quality (nDCG@10) by 9.0% on financial Q&A dataset FiQA, while improving it by 5.1% on biomedical dataset TREC-COVID, and having no significant effect on SciFact.

The researchers identified a consistent mechanism behind these mixed results: performance degradation occurs when rewriting replaces domain-specific terminology in already well-matched queries, reducing lexical alignment between rewritten queries and relevant documents. In contrast, improvements happen when rewriting shifts queries toward corpus-preferred terminology or resolves inconsistent nomenclature. The study found lexical substitution occurs in 95% of rewrites, showing effectiveness depends on the direction of substitution rather than substitution itself. Even selective rewriting with feature-based gating offered limited benefits, with oracle selection providing only modest gains.

These findings have significant implications for AI engineers building production RAG systems. The research suggests that automatic query rewriting—a common component in many RAG pipelines—can be harmful in well-optimized vertical domains. Instead, the paper recommends domain-adaptive post-training as a safer strategy when supervision or implicit feedback is available, challenging the assumption that all queries benefit from LLM refinement before retrieval.

Key Points
  • Prompt-only LLM query rewriting degraded retrieval performance by 9.0% on FiQA financial dataset while improving it by 5.1% on TREC-COVID biomedical dataset
  • Performance depends on lexical alignment: rewriting harms results when it replaces domain-specific terms in already well-matched queries (95% of rewrites involve lexical substitution)
  • Selective rewriting with feature-based gating doesn't reliably outperform never rewriting, suggesting domain-adaptive post-training is safer for production systems

Why It Matters

This research forces RAG developers to reconsider automatic query rewriting, potentially saving engineering effort and improving system reliability in production environments.