Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning
New method uses just 10% of labeled data to outperform SOTA models on math and science benchmarks.
A research team including Zhiyin Yu, Bo Zhang, and Qibin Hou has published a groundbreaking paper titled 'Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning.' The work introduces EasyRL, a novel framework designed to overcome the high cost and instability of traditional LLM reinforcement learning. Current methods rely heavily on expensive human annotation or suffer from issues like reward hacking and model collapse. EasyRL proposes a new paradigm inspired by how humans learn, focusing on mastering easy concepts first before progressively tackling harder problems.
The core of EasyRL is a three-stage, self-evolving process. It begins by initializing a model with a small amount of labeled 'easy' data. Then, it employs a smart 'divide-and-conquer' strategy on vast pools of unlabeled data, using consistency checks and reflection mechanisms to generate reliable pseudo-labels of increasing difficulty. Finally, the model undergoes iterative self-training and reinforcement learning, continuously strengthening its reasoning capabilities. This creates a unified, data-efficient pipeline for post-training large language models.
Experimental results are compelling. On standard mathematical and scientific reasoning benchmarks, models trained with EasyRL using only 10% of the typical labeled data consistently outperformed state-of-the-art baselines. This demonstrates a massive leap in data efficiency. The method provides a practical and scalable alternative to the resource-intensive supervised learning or unstable unsupervised methods that currently dominate LLM fine-tuning, potentially lowering the barrier to creating high-performance, specialized AI agents.
- Uses a cognitive 'easy-to-hard' learning curve, requiring only 10% of labeled data for training.
- Introduces a 'divide-and-conquer' pseudo-labeling strategy for unlabeled data, combining consistency and reflection.
- Outperforms current SOTA methods on math and science benchmarks, proving superior data efficiency and reasoning.
Why It Matters
Dramatically reduces the cost and data needed to create high-performance, specialized AI models for complex reasoning tasks.