Credit-Assigned Policy Gradient cuts training variance for retrieval rankers
New RL method reduces gradient variance by up to 10x in two-stage ranking.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Researchers propose Credit-Assigned Policy Gradient (CA-PG) for training early-stage rankers in two-stage retrieval systems (search, recommendation, RAG). Standard policy gradient suffers from exploding variance with large candidate sets. CA-PG reduces variance by marginalizing over candidate set composition, only requiring that the target item is included. Experiments show faster convergence and improved stability for Plackett-Luce models, especially at large candidate sizes.
- Vanilla policy gradient (V-PG) suffers from exploding variance for large candidate sets in two-stage retrieval.
- CA-PG reduces variance by marginalizing over candidate set composition, computing gradients over inclusion probability of each item.
- Experiments show faster convergence and improved training stability for Plackett-Luce early-stage rankers, especially with large candidate sizes.
Why It Matters
CA-PG enables more efficient training of retrieval systems used in search, recommendation, and RAG at scale.