Research & Papers

RICE-PO turns retrieval interactions into training signals for reasoning agents

New framework solves credit assignment for hidden reasoning steps in retrieval agents.

Deep Dive

A team of researchers from UMass Amherst and other institutions has introduced RICE-PO, a novel policy optimization framework designed to train reasoning agents that iteratively interact with retrievers. The core challenge is credit assignment: while executable actions like queries or summaries can be directly evaluated by the retriever, the latent reasoning steps that precede them are unobservable and only affect future actions. Traditional outcome-level reward assignment fails because the same final reward may wrongly credit reasoning steps that had little impact on retrieval success.

RICE-PO addresses this asymmetry by selecting high-uncertainty executable actions as anchors, then evaluating local counterfactual branches using standard retrieval metrics. It propagates credit to latent reasoning steps only when the reasoning-to-action influence is strong and future residual effects remain stable. This critic-free approach eliminates the need for a separate reward model. On the BRIGHT and BEIR benchmarks, RICE-PO consistently outperforms prompt-based agents and group-based reinforcement learning (RL) baselines under the same retriever configuration. The results demonstrate that the structure of agent-environment interaction itself can provide useful supervision for training more effective retrieval-augmented reasoning systems.

Key Points
  • RICE-PO is a critic-free framework that uses retriever metrics to assign credit to latent reasoning steps.
  • It selects high-uncertainty actions as anchors and evaluates counterfactual branches to isolate reasoning impact.
  • Outperforms prompt-based agents and group-based RL baselines on BRIGHT and BEIR benchmarks.

Why It Matters

Better training for retrieval agents means more reliable and efficient AI systems for research, search, and decision-making.