Distributionally Robust Token Optimization in RLHF
New RLHF technique improves LLM consistency on math problems, even when prompts are reworded.
A new research paper introduces Distributionally Robust Token Optimization (DRTO), a novel technique designed to make Large Language Models (LLMs) more reliable and consistent, especially for complex reasoning tasks. The core problem it addresses is the brittleness of current models: small, seemingly insignificant changes in how a prompt is worded, formatted, or phrased can lead to drastically different—and often incorrect—answers. DRTO tackles this by fusing two advanced concepts: token-level Reinforcement Learning from Human Feedback (RLHF), which optimizes model outputs at a granular level, and Distributionally Robust Optimization (DRO), a mathematical framework that prepares the model for worst-case scenarios. Specifically, DRTO constructs an 'ambiguity set' around batches of training data, forcing the model to perform well even under slight distributional shifts, thereby providing a theoretical guarantee of robustness.
Empirical results show DRTO delivers significant practical gains. When tested on standard mathematical reasoning benchmarks, models fine-tuned with DRTO showed markedly improved consistency. The method yielded a 9.17% performance improvement on the GSM8K dataset and a 2.49% boost on MathQA compared to standard RLHF approaches. This indicates the model becomes less sensitive to superficial prompt variations and more dependable in its problem-solving logic. The technique represents a move beyond simply maximizing average performance, focusing instead on minimizing catastrophic failures and building AI systems that professionals can trust to deliver correct answers reliably, not just most of the time.
- Proposes DRTO, combining token-level RLHF with Distributionally Robust Optimization (DRO) for theoretical robustness guarantees.
- Achieves a 9.17% performance improvement on the GSM8K math benchmark and 2.49% on MathQA.
- Specifically targets and reduces LLM failure rates caused by small changes in prompt wording, format, or language.
Why It Matters
Makes AI assistants more reliable for critical tasks like code generation, data analysis, and financial reasoning where consistency is paramount.