RLVR boosts small code models by 13 points on MBPP
Small models Qwen3-0.6B and Llama3.2-1B get big gains with verification-based RL.
A new study from Egor Skopin and Evgeny Kotelnikov, accepted for AINL-2026, explores reinforcement learning with verifiable rewards (RLVR) for improving code generation in small language models. The researchers focused on two compact models—Qwen3-0.6B and Llama3.2-1B—fine-tuned with LoRA on the MBPP Python benchmark. They compared three reward formulations: unit-test-only, static-analysis-only (using Ruff linter), and a combined reward, using group-based policy optimization variants GRPO and GSPO.
The results showed that the combined reward improved pass@1 by up to 13 percentage points on the MBPP test set. However, using only static-analysis penalties biased the models toward shorter completions that reduced lint errors without actually improving functional correctness. The combined reward mitigated this degeneration, offering a stable trade-off between style constraints and correctness. The authors emphasize that RLVR effectiveness is highly sensitive to reward design and recommend diagnostics beyond pass@1, such as generation length and execution error types, to identify failure modes.
- Qwen3-0.6B and Llama3.2-1B improved pass@1 by up to 13 percentage points using combined unit-test + linter rewards.
- Pure static-analysis rewards (Ruff linter) led to shorter code with fewer lint errors but no gain in functional correctness.
- Study uses RLVR with GRPO and GSPO; highlights need for multi-metric diagnostics beyond pass@1.
Why It Matters
Makes small, deployable models much better at coding without needing massive compute—huge for edge and privacy-first apps.