DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning
New research shows smarter data splitting between SFT and RL yields 15% better STEM performance
A research team from ETH Zurich and the University of Edinburgh has introduced DeReason, a novel training methodology that addresses a critical challenge in AI development: how to effectively combine supervised fine-tuning (SFT) and reinforcement learning (RL) for complex reasoning tasks. Their research reveals that applying RL directly to base models for general STEM domains is highly sample-inefficient and often underperforms compared to SFT alone. However, they discovered that a sequential SFT-then-RL approach can yield significant gains, but only if the training data is intelligently allocated between the two stages.
DeReason's core innovation is a difficulty-based data decoupling strategy. It uses an LLM-based scoring system to estimate the reasoning intensity of each training problem, partitioning data into reasoning-intensive and non-reasoning-intensive subsets. The system allocates broad-coverage, easier problems to SFT to build foundational domain knowledge, while reserving a focused subset of the most difficult problems for RL to cultivate complex reasoning skills. Extensive experiments across general STEM and mathematical benchmarks demonstrate that this principled curriculum significantly outperforms SFT-only, RL-only, and random-split baselines.
The research provides the first systematic study of the interplay between SFT and RL for general reasoning, moving beyond mathematics and coding to broader scientific domains. The team's controlled experiments show that their decoupled approach yields better performance than randomly splitting data, offering what they describe as a "highly effective and generalized post-training recipe." This work represents a significant step toward more efficient and effective training methodologies for AI systems tackling complex reasoning challenges.
- DeReason partitions training data using LLM-based scoring to estimate reasoning intensity, creating difficulty-aware subsets
- The method allocates easier, broad-coverage problems to SFT and reserves hardest problems for RL, outperforming random splits by 15%
- Provides first systematic study of SFT-RL interplay for general STEM reasoning, offering a generalized post-training recipe
Why It Matters
Offers AI developers a more efficient training blueprint for complex reasoning models, potentially reducing compute costs and improving STEM performance.