Research & Papers

Evaluating Strategic Reasoning in Forecasting Agents

BTF-2 benchmark uncovers AI blind spots in assessing political incentives and black swans.

Deep Dive

A new paper from Tom Liptay and colleagues introduces Bench to the Future 2 (BTF-2), a benchmark of 1,417 pastcasting questions paired with a frozen 15-million-document research corpus. This setup allows AI agents to research and forecast offline with full reasoning traces, enabling reproducible evaluation without hindsight bias. BTF-2 can detect accuracy differences as small as 0.004 on the Brier score, and can tease apart whether an agent's strength lies in research or judgment.

The researchers built a forecaster that outperforms any single frontier agent by 0.011 Brier, and used it to analyze strategic reasoning. They found that the better forecaster's edge comes from pre-mortem analysis of its own blind spots and explicit consideration of black swan events. Expert human forecasters identified the dominant failures of frontier agents: poor assessment of political and business leaders' incentives, inability to judge follow-through on stated plans, and weak modeling of institutional processes. This work shifts AI forecasting from accuracy leaderboards to understanding why models succeed or fail.

Key Points
  • BTF-2 benchmark includes 1,417 pastcasting questions with a frozen 15M-document corpus for reproducible offline research and forecasting.
  • The benchmark detects accuracy differences as small as 0.004 Brier score and distinguishes research vs. judgment strengths.
  • Better forecasting agents excel at pre-mortem blind spot analysis and black swan consideration, not just data processing.
  • Expert humans found frontier agents fail on assessing leader incentives, follow-through likelihood, and institutional processes.

Why It Matters

Shifts AI forecasting from leaderboard chasing to understanding why models succeed or fail strategically.