Risk-Sensitive Games Make LLM Fine-Tuning Robust Across Data Strata
New method uses CVaR-like risk measures to prevent systematic failures in preference learning
A research team led by Max Horwitz, Jake Gonzales, Eric Mazumdar, and Lillian J. Ratliff has published a paper on arXiv proposing risk-sensitive preference games to improve the robustness of LLM fine-tuning. Current state-of-the-art methods like Nash Learning from Human Feedback (NLHF) reframe preference optimization as a zero-sum game over policies but average pairwise win rates, meaning models can achieve high average performance while systematically failing on certain prompts, annotator groups, or safety-critical data strata. The new framework replaces expected payoff optimization with convex risk measures, such as Conditional Value at Risk (CVaR), which focus on tail behavior rather than averages.
The key technical challenge is that risk-sensitivity typically breaks the zero-sum structure needed for efficient self-play algorithms. However, the authors show that translation invariance of many risk metrics preserves monotonicity, enabling fast convergence. They also provide algorithmic stability guarantees and offline sample complexity bounds that scale with the risk parameter, requiring simultaneous control of structural bias from nonlinear risk transformations and statistical bias in risk estimation. To address this, they introduce a hierarchical game formulation and a two-timescale extragradient algorithm with bias correction that converges to the Stackelberg equilibrium, particularly effective in low-sample regimes. Empirically, the risk-adjusted policies perform on par with or better than risk-neutral methods across data strata without any performance tax, meaning no trade-off between average performance and robustness.
- Standard NLHF optimizes average pairwise win rates, masking systematic failures in tail data strata
- Risk-sensitive preference games use convex risk measures (e.g., CVaR) to focus on worst-case performance
- Novel two-timescale extragradient algorithm with bias correction converges in low-sample regimes, matching risk-neutral performance
Why It Matters
Makes LLM alignment robust against distributional shifts and safety-critical failures without sacrificing average performance.