Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
New training technique fixes reward hacking, improving OlympiadBench performance from 46.3% to 51.3%.
A research team from multiple institutions has developed Process-Aware Policy Optimization (PAPO), a novel method that addresses critical limitations in how AI models are trained using reinforcement learning. Current approaches typically use either outcome reward models (ORMs), which only check final answers and treat all correct responses equally, or process reward models (PRMs), which evaluate reasoning steps but are vulnerable to reward hacking where models learn to generate verbose but incorrect solutions. PAPO solves both problems through a decoupled advantage normalization technique that separates outcome and process components.
PAPO's innovation lies in its two-part advantage calculation: Aout, derived from ORM scores and normalized across all responses, anchors training on basic correctness. Meanwhile, Aproc, derived from rubric-based PRM scores, is normalized only among correct responses to differentiate reasoning quality without distorting the outcome signal. This prevents models from exploiting verbosity to inflate scores while accuracy collapses—a common failure mode in existing PRM approaches.
The method demonstrated significant improvements across multiple benchmarks, most notably on OlympiadBench where PAPO achieved 51.3% accuracy compared to 46.3% for standard ORM training. Unlike traditional methods that plateau or decline as training progresses, PAPO showed continued improvement across six different benchmarks and maintained stability at various model scales. The technique represents a more sophisticated approach to aligning AI models with both correct answers and high-quality reasoning processes.
By addressing the fundamental tension between outcome-focused and process-focused training, PAPO provides a more stable foundation for developing AI systems that excel at complex reasoning tasks. The research suggests this approach could lead to more reliable and capable AI models across scientific, mathematical, and logical domains where step-by-step reasoning is as important as the final answer.
- PAPO combines outcome and process rewards via decoupled normalization, preventing reward hacking while improving reasoning
- Achieved 51.3% accuracy on OlympiadBench vs 46.3% for traditional outcome reward models—an 11% relative improvement
- Maintains training stability where other methods plateau or decline, showing consistent gains across six benchmarks
Why It Matters
Enables more reliable AI training for complex reasoning tasks, moving beyond simple answer correctness to evaluate reasoning quality.