RICE-PO turns retrieval interactions into training signals for reasoning agents
New framework solves credit assignment for hidden reasoning steps in retrieval agents.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A team of researchers from UMass Amherst and other institutions has introduced RICE-PO, a novel policy optimization framework designed to train reasoning agents that iteratively interact with retrievers. The core challenge is credit assignment: while executable actions like queries or summaries can be directly evaluated by the retriever, the latent reasoning steps that precede them are unobservable and only affect future actions. Traditional outcome-level reward assignment fails because the same final reward may wrongly credit reasoning steps that had little impact on retrieval success.
RICE-PO addresses this asymmetry by selecting high-uncertainty executable actions as anchors, then evaluating local counterfactual branches using standard retrieval metrics. It propagates credit to latent reasoning steps only when the reasoning-to-action influence is strong and future residual effects remain stable. This critic-free approach eliminates the need for a separate reward model. On the BRIGHT and BEIR benchmarks, RICE-PO consistently outperforms prompt-based agents and group-based reinforcement learning (RL) baselines under the same retriever configuration. The results demonstrate that the structure of agent-environment interaction itself can provide useful supervision for training more effective retrieval-augmented reasoning systems.
- RICE-PO is a critic-free framework that uses retriever metrics to assign credit to latent reasoning steps.
- It selects high-uncertainty actions as anchors and evaluates counterfactual branches to isolate reasoning impact.
- Outperforms prompt-based agents and group-based RL baselines on BRIGHT and BEIR benchmarks.
Why It Matters
Better training for retrieval agents means more reliable and efficient AI systems for research, search, and decision-making.