Media & Culture

Max Planck's FutureSim lets GPT-5.5 beat Polymarket humans on Super Bowl LX

No live web access, just replaying old news, yet outpredicting a $704M market.

Deep Dive

Researchers from the Max Planck Institute have introduced FutureSim, a novel benchmarking environment where AI agents are fed a temporal slice of web data and tasked with predicting real-world future events. The environment is designed to test forecasting ability without live web access – the agents only replay historical news snippets. In early experiments, GPT-5.5 (executed within Codex) was evaluated on Polymarket questions that overlapped with FutureSim's event set. The model surprised researchers by running ahead of the human-aggregate market on two high-stakes bets: the Super Bowl LX market (accumulated $704M in trading volume) and the Portugal presidential runoff. On those questions, GPT-5.5 achieved a Brier skill score of 0.90, a near-perfect measure of probabilistic forecast accuracy.

However, the model's performance was inconsistent. It was “smoked” on the UK general election market and the Grammy Awards market, struggling with events that involve cultural nuance or multi-factor shifts. The authors note that while the results are impressive – an AI with no real-time data access outperforming a market of thousands of human traders – the gap between today's capabilities and a reliable universal forecaster remains significant. Still, the rapid pace of improvement raises the possibility that by 2027, we could see AI systems capable of consistently predicting major political and economic events, potentially transforming fields from finance to intelligence analysis.

Key Points
  • Max Planck Institute's FutureSim tests AI forecasting by replaying web slices without live data.
  • GPT-5.5 in Codex achieved a 0.90 Brier skill score on Super Bowl LX ($704M market) and Portugal runoff.
  • Model failed on UK elections and Grammys, highlighting uneven progress toward reliable forecasting.

Why It Matters

AI beating real-money prediction markets could disrupt finance, governance, and intelligence – but consistency is still years away.