[P] I trained Qwen2.5-1.5b with RLVR (GRPO) vs SFT and compared benchmark performance
RLVR training on just one example improved reasoning, while standard SFT degraded performance by 15.2 points.
An independent researcher has published a detailed comparison showing that Reinforcement Learning with Verifiable Rewards (RLVR) dramatically outperforms standard Supervised Fine-Tuning (SFT) for improving reasoning in small language models. The project, led by Jaymin Ban, fine-tuned Alibaba's Qwen2.5-1.5B-Instruct model on the GSM8K math dataset using both methods. The results were stark: RLVR—the technique powering DeepSeek's recent R1 model—boosted GSM8K performance by +11.9 points, while SFT actually degraded performance by -15.2 points. This suggests SFT can cause models to mimic surface-level answer formats without developing true reasoning capability, a critical insight for the open-source AI community.
The technical deep dive reveals RLVR's efficiency and generalization power. Remarkably, RLVR training with just a single example from the dataset still improved performance on both GSM8K and the unrelated MATH benchmark, indicating it strengthens the model's core reasoning circuits rather than just memorizing patterns. The project required significant compute, using up to 8 GPUs (RTX 3090/4090/5090) for over 32 hours per experiment and benchmarking 388 model checkpoints. All code, data, and over 2.4 million logged responses are publicly available on GitHub and Hugging Face, providing a reproducible blueprint for others. This work validates RLVR as a superior training paradigm for reasoning tasks and offers a cautionary tale about the potential downsides of standard SFT approaches.
- RLVR training boosted Qwen2.5-1.5B's GSM8K math score by +11.9 points, while SFT degraded it by -15.2 points.
- RLVR improved performance even when trained on just one example, and boosted scores on an unrelated MATH benchmark, suggesting generalized reasoning gains.
- The project benchmarked 388 model checkpoints and logged over 2.4 million responses in a public SQLite database for full reproducibility.
Why It Matters
Provides empirical evidence that RLVR is superior to SFT for reasoning tasks, offering a roadmap for efficiently improving open-source small models.