GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
New framework diagnoses SFT's flaws and offers a more stable, unified path to RL.
A research team from Zhejiang University and other institutions has published a paper introducing Group Fine-Tuning (GFT), a novel framework designed to replace the standard Supervised Fine-Tuning (SFT) step in training large language models like GPT-4 or Llama 3. The paper provides a critical training-dynamics analysis, revealing that SFT acts like a policy gradient method with an extremely sparse, implicit reward. This leads to problematic 'single-path dependency,' where the model overfits to one response pattern, causing 'entropy collapse' (loss of creativity) and 'gradient explosion' (unstable training). The authors argue this flawed foundation makes subsequent Reinforcement Learning (RL) alignment less effective.
GFT directly tackles these issues with two core mechanisms. First, 'Group Advantage Learning' constructs diverse groups of potential model responses for a given prompt and uses normalized, contrastive supervision between them. This alleviates reward sparsity by providing richer comparative feedback than a single 'correct' answer. Second, 'Dynamic Coefficient Rectification' adaptively bounds the inverse-probability weights used during optimization. This prevents the instability and gradient explosions inherent in SFT while still allowing for efficient knowledge injection from human demonstrations. The result is a more unified and stable pipeline from imitation learning to reward-based fine-tuning.
Initial experiments demonstrate that policies trained with GFT consistently surpass those trained with SFT-based methods. Crucially, these GFT-trained models integrate more smoothly with subsequent RL training stages, such as Reinforcement Learning from Human Feedback (RLHF), suggesting a stronger foundation for alignment. The work provides both a theoretical diagnosis of SFT's limitations and a practical, drop-in alternative that could improve the efficiency and robustness of training the next generation of AI assistants and agents.
- Diagnoses SFT as causing single-path dependency, entropy collapse, and gradient explosion due to sparse implicit rewards.
- Introduces 'Group Advantage Learning' for contrastive supervision and 'Dynamic Coefficient Rectification' for stable optimization.
- Experimental results show GFT outperforms SFT and creates a better foundation for subsequent reinforcement learning (RL) training.
Why It Matters
Offers a more stable, unified training pipeline that could lead to better-performing and more reliably aligned AI models.