Small LMs predict research success beating GPT-5 with RLVR
8B model forecasts idea outcomes at 77% accuracy, outperforming GPT-5's 61%.
A new paper from Srujan P Mule, Aniketh Garikaparthi, and Manasi Patwardhan (ACL 2026 Findings) tackles a growing bottleneck in AI-driven research: evaluating and filtering hundreds of automatically generated hypotheses without costly experiments. The team built a dataset of 11,488 idea pairs from PapersWithCode, grounding each pair in objective benchmark outcomes. They trained small 8B-parameter language models to predict which of two candidate ideas would yield better empirical performance, framing the task as comparative empirical forecasting.
Off-the-shelf 8B models performed poorly (30% accuracy), but supervised fine-tuning (SFT) dramatically boosted performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task using Reinforcement Learning with Verifiable Rewards (RLVR), the models achieved 71.35% accuracy while producing interpretable justifications. Ablations and out-of-distribution tests showed robustness to surface-level heuristics, and the approach transferred to cross-domain time-split and independently constructed test sets. The results demonstrate that compute-efficient small LMs can serve as objective verifiers, enabling scalable autonomous scientific discovery without requiring exhaustive experimentation.
- Trained 8B-parameter models on 11,488 idea pairs from PapersWithCode to forecast benchmark success, achieving 77.1% accuracy (SFT) vs GPT-5's 61.1%.
- Reinforcement Learning with Verifiable Rewards (RLVR) reached 71.35% accuracy with interpretable justifications for predictions.
- Out-of-distribution tests confirmed robustness to surface heuristics and transfer to cross-domain time-split test sets.
Why It Matters
Compute-efficient small LMs can replace expensive experiments for filtering research ideas, accelerating autonomous science.