AI Safety

Scaffolding vs Reinforcement Finetuning for AI Forecasting

Finetuned o4-mini beats baselines on numeric questions but fails on binary ones

Deep Dive

Ram Potham built a forecasting bot using OpenAI's Reinforcement Finetuning (RFT) on o4-mini, combined with a multi-agent scaffold. The system used 3 parallel forecaster teams (each with a researcher and forecaster), plus an aggregator that converged predictions within 2% over two rounds. Training cost $1,670 and took 12 hours 42 minutes on 979 samples from 344 questions (56.5% binary, 21.5% multiple choice, 21.1% numeric). A dual grading system weighted forecast accuracy (60%) and reasoning quality (40%) to prevent memorization.

In the minibench-2025-09-29 tournament (35 questions), the finetuned model won 12 questions (34.3%), the o4-mini baseline won 13 (37.1%), and high-effort o4-mini won 10 (28.6%). Average scores: finetuned 3.23, baseline 4.16, high-effort 1.82. However, on numeric questions (9 of 35), the finetuned model won 5 (55.6%) with average +14.59 vs baseline +9.25. On binary questions (26 of 35), it won only 7 (26.9%) with average -0.70 vs baseline +2.40. The model learned to trust authoritative sources like CompaniesMarketCap for financial questions, but struggled on political/legal questions where direct data was less available.

Key Points
  • Finetuned model outperformed baselines on numeric questions (avg +14.59 vs +9.25) but underperformed on binary ones (-0.7 vs +2.4)
  • Training cost $1,670 and took 12.7 hours on 979 samples from 344 questions
  • Key lesson: model learned to trust authoritative sources, boosting accuracy on data-driven financial questions but hurting on political/legal ones

Why It Matters

Shows RFT's potential for specialized forecasting tasks, but highlights domain-specific performance gaps that need addressing.