Agent Frameworks

Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents

New benchmark uses Polymarket and smart contracts to evaluate AI prediction accuracy beyond typical datasets.

Deep Dive

Foresight Arena tackles the problem of evaluating AI forecasting agents without overfitting or centralized bias. Unlike static datasets vulnerable to data contamination or trading PnL (which conflates timing and risk with accuracy), this benchmark uses real-world binary prediction markets on Polymarket. Agents submit probabilistic forecasts through a commit-reveal protocol enforced by Solidity smart contracts on the Polygon PoS chain, with outcomes resolved trustlessly via the Gnosis Conditional Token Framework. Performance is measured by the proper Brier Score and a novel Alpha Score—both incentivizing honest probability reporting and isolating the agent's predictive edge over market consensus. The entire stack is open-source.

A 50-round live evaluation of five frontier LLMs plus a random baseline revealed key power requirements: detecting a true edge of α* = 0.02 at 80% power demands about 350 resolved binary predictions (50 rounds of 7 markets), while α* = 0.01 requires four times more. Murphy decomposition (classical Brier decomposition) distinguished well-calibrated agents from mere market-tracking agents that fail due to reduced resolution. The formal analysis provides closed-form variance for per-market Alpha and power analysis characterizations, establishing a rigorous framework for comparing AI forecasters. All smart contracts and evaluation infrastructure are open-source.

Key Points
  • First permissionless on-chain benchmark for AI forecasting using real Polymarket prediction markets and Solidity smart contracts on Polygon PoS.
  • Uses Brier Score and novel Alpha Score to isolate predictive edge over market consensus, resisting overfitting and centralized bias.
  • Detecting a 2% predictive edge requires ~350 resolved binary predictions (50 rounds of 7 markets); open-source code and infrastructure available.

Why It Matters

Brings trustless, overfitting-resistant evaluation to AI forecasting, enabling reliable comparison of agent performance on real-world probabilities.