Distributional Alignment Games for Answer-Level Fine-Tuning
A game between two models solves intractable marginalization in answer-level optimization.
Answer-Level Fine-Tuning (ALFT) aims to optimize large language models based solely on the correctness or properties of their final answers, ignoring the specific reasoning traces used. This is a more direct objective than traditional fine-tuning, but it is computationally intractable because it requires marginalizing over the vast space of latent reasoning paths. To overcome this, the authors introduce a general game-theoretic framework they call a Distributional Alignment Game. They formulate ALFT as a two-player game between a Policy (the generator) and an auxiliary Target distribution. Crucially, they prove that the Nash Equilibrium of this game corresponds exactly to the solution of the original ALFT optimization problem. This variational perspective converts the intractable marginalization into a tractable projection problem, making the approach computationally feasible.
The proposed framework unifies recent methods for promoting diversity (ensuring varied outputs) and self-improvement (coherence across reasoning steps). The authors demonstrate that their game-theoretic formulation is compatible with Group Relative Policy Optimization (GRPO), a popular technique for training LLMs with reward models. They introduce a specific variant called Coherence-GRPO that leverages the new framework, yielding significant complexity gains on mathematical reasoning benchmarks. This work provides both a theoretical foundation and practical algorithms for directly optimizing answer-level objectives, potentially simplifying LLM training pipelines while improving performance on tasks where reasoning trace quality is less important than final answer correctness.
- ALFT optimizes LLMs based on final answer correctness, not reasoning traces, but is intractable due to marginalization over latent paths.
- The game-theoretic formulation (Policy vs. Target) makes the problem tractable by turning marginalization into a projection problem at Nash Equilibrium.
- The framework enables a new algorithm, Coherence-GRPO, which improves efficiency in mathematical reasoning tasks while unifying diversity and self-improvement approaches.
Why It Matters
Makes fine-tuning LLMs for correctness more efficient, potentially reducing training costs and improving reasoning performance.