Max Planck's FutureSim benchmark evaluates temporal web prediction; GPT 5.5 leads at 25% accuracy?

Max Planck's FutureSim benchmark evaluates temporal web prediction; GPT 5.5 leads at 25% accuracy.

GPT 5.5 outperformed Polymarket's crowd aggregate on Super Bowl LX, a $704M market?

GPT 5.5 outperformed Polymarket's crowd aggregate on Super Bowl LX, a $704M market.

Open-weight models like DeepSeek V4 Pro (13%) and Qwen3.6 Plus (5%) trail significantly?

Open-weight models like DeepSeek V4 Pro (13%) and Qwen3.6 Plus (5%) trail significantly.

Models & Releases

OpenAI's GPT 5.5 (Codex) tops FutureSim future prediction benchmark at 25% accuracy

r/OpenAI May 16, 2026

⚡Max Planck's FutureSim shows GPT 5.5 beating crowd markets on Super Bowl LX predictions.

Deep Dive

The Max Planck Institute's new FutureSim benchmark evaluates frontier LLMs on predicting real-world events by replaying temporal web slices to agents. In this rigorous environment, OpenAI's GPT 5.5 (code-named Codex) leads the pack with 25% accuracy, outperforming Anthropic's Opus 4.6 at 20%. Open-weight models still have a significant gap: DeepSeek V4 Pro achieves 13%, GLM 5.1 at 10%, and Qwen3.6 Plus at 5%. Evaluations use native harnesses for each model.

More striking, GPT 5.5's predictions sometimes beat the crowd aggregate on Polymarket, a leading prediction market. For example, on Super Bowl LX (with $704M traded), GPT 5.5's simulation outperformed human traders. This suggests AI is approaching superhuman forecasting in specific domains. As models improve, the line between AI and collective human intelligence in prediction markets may blur, opening applications in finance, policy, and strategic planning.

Key Points

Max Planck's FutureSim benchmark evaluates temporal web prediction; GPT 5.5 leads at 25% accuracy.
GPT 5.5 outperformed Polymarket's crowd aggregate on Super Bowl LX, a $704M market.
Open-weight models like DeepSeek V4 Pro (13%) and Qwen3.6 Plus (5%) trail significantly.

Why It Matters

AI topping prediction markets signals a paradigm shift in forecasting, with implications for trading, risk, and decision-making.

Read Original Article

OpenAI's GPT 5.5 (Codex) tops FutureSim future prediction benchmark at 25% accuracy

Why It Matters

Related Articles

🚀 Stay Ahead in AI