Trained 8B-parameter models on 11,488 idea pairs from PapersWithCode to forecast benchmark success, achieving 77.1% accuracy (SFT) vs GPT-5's 61.1%?

Trained 8B-parameter models on 11,488 idea pairs from PapersWithCode to forecast benchmark success, achieving 77.1% accuracy (SFT) vs GPT-5's 61.1%.

Reinforcement Learning with Verifiable Rewards (RLVR) reached 71.35% accuracy with interpretable justifications for predictions?

Reinforcement Learning with Verifiable Rewards (RLVR) reached 71.35% accuracy with interpretable justifications for predictions.

Out-of-distribution tests confirmed robustness to surface heuristics and transfer to cross-domain time-split test sets?

Out-of-distribution tests confirmed robustness to surface heuristics and transfer to cross-domain time-split test sets.

Research & Papers

Small LMs predict research success beating GPT-5 with RLVR

arXiv cs.LG May 23, 2026

⚡8B model forecasts idea outcomes at 77% accuracy, outperforming GPT-5's 61%.

Deep Dive

A new paper from Srujan P Mule, Aniketh Garikaparthi, and Manasi Patwardhan (ACL 2026 Findings) tackles a growing bottleneck in AI-driven research: evaluating and filtering hundreds of automatically generated hypotheses without costly experiments. The team built a dataset of 11,488 idea pairs from PapersWithCode, grounding each pair in objective benchmark outcomes. They trained small 8B-parameter language models to predict which of two candidate ideas would yield better empirical performance, framing the task as comparative empirical forecasting.

Off-the-shelf 8B models performed poorly (30% accuracy), but supervised fine-tuning (SFT) dramatically boosted performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task using Reinforcement Learning with Verifiable Rewards (RLVR), the models achieved 71.35% accuracy while producing interpretable justifications. Ablations and out-of-distribution tests showed robustness to surface-level heuristics, and the approach transferred to cross-domain time-split and independently constructed test sets. The results demonstrate that compute-efficient small LMs can serve as objective verifiers, enabling scalable autonomous scientific discovery without requiring exhaustive experimentation.

Key Points

Trained 8B-parameter models on 11,488 idea pairs from PapersWithCode to forecast benchmark success, achieving 77.1% accuracy (SFT) vs GPT-5's 61.1%.
Reinforcement Learning with Verifiable Rewards (RLVR) reached 71.35% accuracy with interpretable justifications for predictions.
Out-of-distribution tests confirmed robustness to surface heuristics and transfer to cross-domain time-split test sets.

Why It Matters

Compute-efficient small LMs can replace expensive experiments for filtering research ideas, accelerating autonomous science.

Read Original Article

Small LMs predict research success beating GPT-5 with RLVR

Why It Matters

Related Articles

🚀 Stay Ahead in AI