Graph-GRPO boosts e-commerce search with graph-based reward credit assignment
New RL method assigns credit to each reasoning step using a dependency graph.
Researchers propose Graph-GRPO, a graph-structured extension of GRPO for generative e-commerce search relevance. It constructs a dependency graph where chain-of-thought reasoning steps are nodes and logical dependencies are edges, then propagates outcome-level rewards to derive step-level credit signals. This addresses limitations of outcome-level rewards and independent process rewards. In A/B tests on a leading e-commerce platform, Graph-GRPO improved relevance classification metrics and key engagement metrics.
- Creates a dependency graph over chain-of-thought reasoning steps, with nodes for steps and edges for logical dependencies.
- Propagates outcome-level rewards through the graph to assign fine-grained credit to each step, improving RL training.
- Online A/B tests on a leading e-commerce platform improved relevance classification and engagement metrics like click-through rate.
Why It Matters
Graph-GRPO makes LLM-based e-commerce search more accurate by correctly rewarding good reasoning, boosting user engagement.