Research & Papers

Evaluating Strategic Reasoning in Forecasting Agents

arXiv cs.AI April 30, 2026

⚡BTF-2 benchmark uncovers AI blind spots in assessing political incentives and black swans.

Deep Dive

A new paper from Tom Liptay and colleagues introduces Bench to the Future 2 (BTF-2), a benchmark of 1,417 pastcasting questions paired with a frozen 15-million-document research corpus. This setup allows AI agents to research and forecast offline with full reasoning traces, enabling reproducible evaluation without hindsight bias. BTF-2 can detect accuracy differences as small as 0.004 on the Brier score, and can tease apart whether an agent's strength lies in research or judgment.

The researchers built a forecaster that outperforms any single frontier agent by 0.011 Brier, and used it to analyze strategic reasoning. They found that the better forecaster's edge comes from pre-mortem analysis of its own blind spots and explicit consideration of black swan events. Expert human forecasters identified the dominant failures of frontier agents: poor assessment of political and business leaders' incentives, inability to judge follow-through on stated plans, and weak modeling of institutional processes. This work shifts AI forecasting from accuracy leaderboards to understanding why models succeed or fail.

Key Points

BTF-2 benchmark includes 1,417 pastcasting questions with a frozen 15M-document corpus for reproducible offline research and forecasting.
The benchmark detects accuracy differences as small as 0.004 Brier score and distinguishes research vs. judgment strengths.
Better forecasting agents excel at pre-mortem blind spot analysis and black swan consideration, not just data processing.
Expert humans found frontier agents fail on assessing leader incentives, follow-through likelihood, and institutional processes.

Why It Matters

Shifts AI forecasting from leaderboard chasing to understanding why models succeed or fail strategically.

Read Original Article

Evaluating Strategic Reasoning in Forecasting Agents

Why It Matters

Stay Ahead in AI