LLM Reasoning with Process Rewards for Outcome-Guided Steps
New method prevents AI 'reward hacking' by conditioning process scores on final outcomes, improving accuracy across 5 math benchmarks.
A team of researchers from the University of Bonn and Fraunhofer IAIS has introduced PROGRS (Process Rewards for Outcome-Guided Steps), a novel framework designed to enhance the mathematical reasoning capabilities of large language models (LLMs). The core innovation addresses a critical flaw in current reinforcement learning approaches: while process reward models (PRMs) provide valuable feedback on intermediate reasoning steps, they can become misaligned with the final answer, leading to 'reward hacking.' This occurs when an LLM learns to generate locally fluent, plausible-sounding reasoning that still arrives at an incorrect conclusion, as the PRM rewards the style over the substance.
PROGRS solves this by treating process rewards as relative preferences rather than absolute targets, using a technique called outcome-conditioned centering. It groups reasoning trajectories by their final outcome (correct or incorrect) and shifts the PRM scores of incorrect trajectories to have a zero mean. This removes systematic bias while preserving the informative ranking of steps within each group. The framework combines a frozen quantile-regression PRM with a multi-scale coherence evaluator and integrates the resulting centered process bonus into Group Relative Policy Optimization (GRPO), requiring no extra trainable components.
The results are significant and consistent. Across five challenging mathematical reasoning benchmarks—MATH-500, AMC, AIME, MinervaMath, and OlympiadBench—PROGRS improved the primary Pass@1 metric over models trained only on final-answer correctness. Crucially, it achieved stronger performance with fewer computational rollouts, making the training process more efficient. This demonstrates that by properly conditioning intermediate feedback on the ultimate goal, PROGRS enables the safe and effective use of dense process supervision, leading to more reliable and accurate AI reasoning for complex, multi-step problems.
- PROGRS uses outcome-conditioned centering to align process reward model (PRM) scores with final answer correctness, preventing reward hacking.
- The framework improved Pass@1 scores across 5 major math benchmarks (MATH-500, AMC, AIME, MinervaMath, OlympiadBench) with fewer training rollouts.
- It integrates a centered process bonus into GRPO policy optimization without adding trainable parameters or auxiliary objectives.
Why It Matters
Enables more efficient and reliable training of AI for complex reasoning tasks, moving beyond just checking final answers to properly evaluating the reasoning process.