Qwen3-0.6B and Llama3.2-1B improved pass@1 by up to 13 percentage points using combined unit-test + linter rewards?

Qwen3-0.6B and Llama3.2-1B improved pass@1 by up to 13 percentage points using combined unit-test + linter rewards.

Pure static-analysis rewards (Ruff linter) led to shorter code with fewer lint errors but no gain in functional correctness?

Pure static-analysis rewards (Ruff linter) led to shorter code with fewer lint errors but no gain in functional correctness.

Study uses RLVR with GRPO and GSPO; highlights need for multi-metric diagnostics beyond pass@1?

Study uses RLVR with GRPO and GSPO; highlights need for multi-metric diagnostics beyond pass@1.

Developer Tools

RLVR boosts small code models by 13 points on MBPP

arXiv cs.SE June 01, 2026

⚡Small models Qwen3-0.6B and Llama3.2-1B get big gains with verification-based RL.

Deep Dive

A new study from Egor Skopin and Evgeny Kotelnikov, accepted for AINL-2026, explores reinforcement learning with verifiable rewards (RLVR) for improving code generation in small language models. The researchers focused on two compact models—Qwen3-0.6B and Llama3.2-1B—fine-tuned with LoRA on the MBPP Python benchmark. They compared three reward formulations: unit-test-only, static-analysis-only (using Ruff linter), and a combined reward, using group-based policy optimization variants GRPO and GSPO.

The results showed that the combined reward improved pass@1 by up to 13 percentage points on the MBPP test set. However, using only static-analysis penalties biased the models toward shorter completions that reduced lint errors without actually improving functional correctness. The combined reward mitigated this degeneration, offering a stable trade-off between style constraints and correctness. The authors emphasize that RLVR effectiveness is highly sensitive to reward design and recommend diagnostics beyond pass@1, such as generation length and execution error types, to identify failure modes.

Key Points

Qwen3-0.6B and Llama3.2-1B improved pass@1 by up to 13 percentage points using combined unit-test + linter rewards.
Pure static-analysis rewards (Ruff linter) led to shorter code with fewer lint errors but no gain in functional correctness.
Study uses RLVR with GRPO and GSPO; highlights need for multi-metric diagnostics beyond pass@1.

Why It Matters

Makes small, deployable models much better at coding without needing massive compute—huge for edge and privacy-first apps.

Read Original Article

RLVR boosts small code models by 13 points on MBPP

Why It Matters

Related Articles

🚀 Stay Ahead in AI