Qwen 2.5 7B improved from 25 to 112 on HumanEval (80%) after training on self-generated correction pairs?

Qwen 2.5 7B improved from 25 to 112 on HumanEval (80%) after training on self-generated correction pairs.

Qwen 2.5 14B trained for $3.50 on 100 self-mined pairs reached within 4 points of the company's RLHF version?

Qwen 2.5 14B trained for $3.50 on 100 self-mined pairs reached within 4 points of the company's RLHF version.

Method validated on Llama 3.2 3B; fake data produced zero lift, confirming the signal comes from real errors?

Method validated on Llama 3.2 3B; fake data produced zero lift, confirming the signal comes from real errors.

Open Source

Self-taught Qwen 2.5 7B reaches 80% on HumanEval, beats GPT-3.5 on math

r/LocalLLaMA May 15, 2026

⚡A 7B model trained only on its own errors outperforms GPT-3.5 without any human data.

Deep Dive

In a remarkable demonstration of self-supervised learning, a developer trained small language models exclusively on their own mistakes without any human-written code. The process started with Qwen 2.5 7B base: the model invented coding problems, wrote test cases, then attempted solutions. Only Python interpreter results determined correct and incorrect attempts. Pairs of broken and working code were used for fine-tuning. An initial bug in the grader — it truncated functions before scoring — caused the model to appear worse. Once fixed, the same 7B model soared from 25 to 112 correct on HumanEval (80%), a benchmark of 164 coding challenges.

Scaling up, Qwen 2.5 14B mined just 100 self-generated pairs and trained for $3.50 in cloud credits (95 minutes on an H100). It landed within 4 points of the official RLHF version from the same company. The developer ruled out data-format artifacts by training on random garbage pairs, which yielded no improvement. The method generalized to Meta's Llama 3.2 3B, confirming it is architecture-agnostic. This technique suggests that with verifiable rewards (like code execution), models can bootstrap their own intelligence without expensive human annotation.

Key Points

Qwen 2.5 7B improved from 25 to 112 on HumanEval (80%) after training on self-generated correction pairs.
Qwen 2.5 14B trained for $3.50 on 100 self-mined pairs reached within 4 points of the company's RLHF version.
Method validated on Llama 3.2 3B; fake data produced zero lift, confirming the signal comes from real errors.

Why It Matters

Enables small models to achieve high performance without expensive human-labeled data or large-scale RLHF.

Read Original Article

Self-taught Qwen 2.5 7B reaches 80% on HumanEval, beats GPT-3.5 on math

Why It Matters

Related Articles

🚀 Stay Ahead in AI