The Oracle's Fingerprint: Correlated AI Forecasting Errors and the Limits of Bias Transmission
Top LLMs make identical mistakes on 568 predictions—collective intelligence may be an illusion.
A new study by Theodor Spiro, published on arXiv, tests whether large language models (LLMs) form an 'epistemic monoculture'—where individual model errors are no longer independent, undermining the foundation of collective intelligence. In Study 1, the author evaluated GPT-4o, Claude, and Gemini on 568 resolved binary prediction questions and found a mean pairwise error correlation of r = 0.77 (p < 0.001), which remained at r = 0.78 even after excluding questions that may have leaked into training data. This indicates that three independently developed frontier models share strikingly similar failure modes.
Study 2 examined whether this correlated bias has propagated into human crowd forecasts, using a within-question design that tracked community prediction shifts across the ChatGPT launch in November 2022. The results showed that community forecasts moved in the direction predicted by LLMs (r = 0.20, p = 0.007), but this shift was fully explained by rational updating toward ground truth—not by AI influence. Study 3 looked at category-level patterns of human forecasting errors and found that pre-ChatGPT human biases already strongly resembled the LLM bias fingerprint (r = 0.87). Surprisingly, post-ChatGPT the resemblance weakened (r = -0.28). Together, these findings reveal an epistemic monoculture that is 'built but not yet activated': AI systems amplify the very biases humans already hold, creating a dangerous feedback loop for reliance on AI forecasts.
- Three major LLMs (GPT-4o, Claude, Gemini) show a mean error correlation of r=0.77 on 568 binary forecasting questions, indicating nearly identical failure modes.
- Human crowd forecasts shifted in LLM-predicted directions after ChatGPT's launch, but the shift was entirely due to rational updating, not AI bias transmission.
- Pre-ChatGPT human biases already matched the LLM pattern (r=0.87), while post-ChatGPT the resemblance weakened (r=-0.28), suggesting AI mirrors existing human blindspots.
Why It Matters
Shared AI blindspots could undermine ensemble forecasting and amplify systemic errors, making AI less reliable for critical decisions.