Accuracy-only RL increased win-rate error by 112% and reduced internal consistency by 69% on chess tasks?

Accuracy-only RL increased win-rate error by 112% and reduced internal consistency by 69% on chess tasks.

VPS reduced win-rate error by 30% and restored internal consistency to near saturation while preserving accuracy?

VPS reduced win-rate error by 30% and restored internal consistency to near saturation while preserving accuracy.

VPS uses adaptive reward weighting to prioritize hardest reasoning subtasks, creating an implicit curriculum for process-level supervision?

VPS uses adaptive reward weighting to prioritize hardest reasoning subtasks, creating an implicit curriculum for process-level supervision.

Research & Papers

VPS training makes LLMs reason accurately and reliably

arXiv cs.CL May 14, 2026

⚡Standard RL boosts accuracy but wrecks reasoning—VPS cuts errors by 30%.

Deep Dive

Reinforcement learning with verifiable rewards has become a popular method for training language models, but it optimizes only final outcomes. This can lead to a failure mode where task accuracy improves while reasoning becomes less accurate, less complete, or internally inconsistent. In a new preprint on arXiv, researchers introduce Verifiable Process Supervision (VPS) to address this. They first apply supervised fine-tuning to induce a structured reasoning format, allowing intermediate claims to be extracted and verified against ground-truth signals for process-level rewards. To handle heterogeneous difficulty, they use adaptive reward weighting that prioritizes components with the largest remaining errors, creating an implicit curriculum.

Evaluated on chess—where reasoning steps are deterministically verifiable—VPS preserves accuracy while significantly improving reasoning quality. Accuracy-only RL, in contrast, increased win-rate error by up to 112% and reduced internal consistency by up to 69%. VPS reduced win-rate error by up to 30% and restored consistency to near saturation. Judge evaluations preferred process-supervised models at matched accuracy. The analysis also showed that without a structured prior, accuracy-only RL converges to budget-dependent shortcuts rather than sound multi-step reasoning. These results demonstrate that VPS enables language models to reason both accurately and reliably in verifiable domains, with clear implications for training trustworthy AI systems.

Key Points

Accuracy-only RL increased win-rate error by 112% and reduced internal consistency by 69% on chess tasks.
VPS reduced win-rate error by 30% and restored internal consistency to near saturation while preserving accuracy.
VPS uses adaptive reward weighting to prioritize hardest reasoning subtasks, creating an implicit curriculum for process-level supervision.

Why It Matters

VPS ensures AI doesn't just answer correctly but reasons soundly—critical for high-stakes applications like medical diagnosis or legal analysis.

Read Original Article

VPS training makes LLMs reason accurately and reliably

Why It Matters

Related Articles

🚀 Stay Ahead in AI