Research & Papers

Credit-Assigned Policy Gradient cuts training variance for retrieval rankers

New RL method reduces gradient variance by up to 10x in two-stage ranking.

Deep Dive

Researchers propose Credit-Assigned Policy Gradient (CA-PG) for training early-stage rankers in two-stage retrieval systems (search, recommendation, RAG). Standard policy gradient suffers from exploding variance with large candidate sets. CA-PG reduces variance by marginalizing over candidate set composition, only requiring that the target item is included. Experiments show faster convergence and improved stability for Plackett-Luce models, especially at large candidate sizes.

Key Points
  • Vanilla policy gradient (V-PG) suffers from exploding variance for large candidate sets in two-stage retrieval.
  • CA-PG reduces variance by marginalizing over candidate set composition, computing gradients over inclusion probability of each item.
  • Experiments show faster convergence and improved training stability for Plackett-Luce early-stage rankers, especially with large candidate sizes.

Why It Matters

CA-PG enables more efficient training of retrieval systems used in search, recommendation, and RAG at scale.