OpenAI's GPT 5.5 (Codex) tops FutureSim future prediction benchmark at 25% accuracy
Max Planck's FutureSim shows GPT 5.5 beating crowd markets on Super Bowl LX predictions.
The Max Planck Institute's new FutureSim benchmark evaluates frontier LLMs on predicting real-world events by replaying temporal web slices to agents. In this rigorous environment, OpenAI's GPT 5.5 (code-named Codex) leads the pack with 25% accuracy, outperforming Anthropic's Opus 4.6 at 20%. Open-weight models still have a significant gap: DeepSeek V4 Pro achieves 13%, GLM 5.1 at 10%, and Qwen3.6 Plus at 5%. Evaluations use native harnesses for each model.
More striking, GPT 5.5's predictions sometimes beat the crowd aggregate on Polymarket, a leading prediction market. For example, on Super Bowl LX (with $704M traded), GPT 5.5's simulation outperformed human traders. This suggests AI is approaching superhuman forecasting in specific domains. As models improve, the line between AI and collective human intelligence in prediction markets may blur, opening applications in finance, policy, and strategic planning.
- Max Planck's FutureSim benchmark evaluates temporal web prediction; GPT 5.5 leads at 25% accuracy.
- GPT 5.5 outperformed Polymarket's crowd aggregate on Super Bowl LX, a $704M market.
- Open-weight models like DeepSeek V4 Pro (13%) and Qwen3.6 Plus (5%) trail significantly.
Why It Matters
AI topping prediction markets signals a paradigm shift in forecasting, with implications for trading, risk, and decision-making.