Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
New method uses LLMs as judges to train rerankers, eliminating costly human annotations and boosting answer accuracy.
A research team from Nanjing University has published a paper introducing ReRanking Preference Optimization (RRPO), a novel reinforcement learning framework designed to solve a critical flaw in current Retrieval-Augmented Generation (RAG) systems. In RAG, a 'reranker' model sorts retrieved documents to find the most useful ones for the LLM to answer a query. Traditionally, these rerankers are trained on static human-annotated 'relevance' labels, which often don't correlate with what information actually helps an LLM generate a precise answer. RRPO bridges this gap by directly optimizing the reranker for the LLM's end goal: generating high-quality answers.
RRPO formulates the reranking process as a sequential decision-making task. Instead of using human labels, it uses feedback from the LLM itself (like GPT-4o) as a reward signal to train the reranker, a method called reinforcement learning from AI feedback (RLAIF). This creates a direct feedback loop where the reranker learns to prioritize documents that lead to better LLM outputs. The team also introduced a 'reference-anchored deterministic baseline' to ensure stable training. In extensive testing on knowledge-intensive benchmarks, RRPO significantly outperformed powerful existing rerankers, including the list-wise model RankZephyr.
The framework demonstrates impressive versatility. It generalizes across different 'reader' LLMs, meaning a reranker trained with one model (e.g., Claude) can improve performance for another (e.g., GPT-4). It also works orthogonally with techniques like Query2Doc for query expansion and remains robust even when trained with 'noisy' or imperfect AI supervisors. By eliminating the need for expensive, static human annotations and directly aligning retrieval with generation utility, RRPO represents a major step toward more efficient, accurate, and self-improving RAG pipelines for enterprise and research applications.
- Uses LLM feedback via RL (RLAIF) to train rerankers, removing dependency on costly human relevance labels.
- Outperforms strong baselines like RankZephyr on knowledge benchmarks and works with various LLMs (e.g., GPT-4o).
- Framework is versatile: integrates with query expansion tools and remains robust to noisy training signals.
Why It Matters
Enables more accurate, cost-effective RAG systems by directly optimizing retrieval for what actually improves LLM answer quality.