Open Source

Self-taught Qwen 2.5 7B reaches 80% on HumanEval, beats GPT-3.5 on math

A 7B model trained only on its own errors outperforms GPT-3.5 without any human data.

Deep Dive

In a remarkable demonstration of self-supervised learning, a developer trained small language models exclusively on their own mistakes without any human-written code. The process started with Qwen 2.5 7B base: the model invented coding problems, wrote test cases, then attempted solutions. Only Python interpreter results determined correct and incorrect attempts. Pairs of broken and working code were used for fine-tuning. An initial bug in the grader — it truncated functions before scoring — caused the model to appear worse. Once fixed, the same 7B model soared from 25 to 112 correct on HumanEval (80%), a benchmark of 164 coding challenges.

Scaling up, Qwen 2.5 14B mined just 100 self-generated pairs and trained for $3.50 in cloud credits (95 minutes on an H100). It landed within 4 points of the official RLHF version from the same company. The developer ruled out data-format artifacts by training on random garbage pairs, which yielded no improvement. The method generalized to Meta's Llama 3.2 3B, confirming it is architecture-agnostic. This technique suggests that with verifiable rewards (like code execution), models can bootstrap their own intelligence without expensive human annotation.

Key Points
  • Qwen 2.5 7B improved from 25 to 112 on HumanEval (80%) after training on self-generated correction pairs.
  • Qwen 2.5 14B trained for $3.50 on 100 self-mined pairs reached within 4 points of the company's RLHF version.
  • Method validated on Llama 3.2 3B; fake data produced zero lift, confirming the signal comes from real errors.

Why It Matters

Enables small models to achieve high performance without expensive human-labeled data or large-scale RLHF.