StaRPO: Stability-Augmented Reinforcement Policy Optimization
New RL method adds step-by-step logic checks, improving accuracy on 4 benchmarks.
A research team led by Jinghan Zhang and seven other collaborators has introduced StaRPO (Stability-Augmented Reinforcement Policy Optimization), a novel framework designed to improve the logical consistency of large language models during complex reasoning tasks. Current RL methods primarily reward models based on whether the final answer is correct, often ignoring the internal structure of the reasoning process. This can lead to models that produce fluent but logically erratic or redundant chains of thought. StaRPO addresses this by explicitly incorporating 'reasoning stability' into the optimization objective.
To make stability measurable, the researchers decompose it into two computable, lightweight metrics. The Autocorrelation Function (ACF) evaluates local, step-to-step coherence in the reasoning trajectory, while Path Efficiency (PE) assesses the global goal-directedness of the entire reasoning path. These metrics generate stability rewards that are combined with traditional task rewards, providing complementary, process-aware feedback to the model during training.
The team validated their approach by demonstrating a correlation between their ACF and PE rewards and the presence of logic errors in two backbone models. In experiments across four established reasoning benchmarks, models optimized with StaRPO consistently outperformed baseline methods. The framework proved effective at enhancing not only the final-answer accuracy but also the logical stability and structural soundness of the models' reasoning outputs, marking a step toward more reliable and transparent AI reasoning.
- Introduces two novel stability metrics: Autocorrelation Function (ACF) for stepwise coherence and Path Efficiency (PE) for global reasoning efficiency.
- Combines these process-aware stability rewards with traditional task rewards in a reinforcement learning framework.
- Outperforms baseline methods on four reasoning benchmarks, improving both answer accuracy and logical consistency.
Why It Matters
Moves AI reasoning beyond just correct answers to ensure the logical process itself is sound and reliable.