Research & Papers

LLMs overconfident on hard tasks, new study finds

Like humans, AI models are too sure of wrong answers on difficult questions.

Deep Dive

A new pre-print study on arXiv (arXiv:2605.23909) from Noam Michael and colleagues investigates how well large language models (LLMs) calibrate their confidence against actual accuracy. In a preregistered experiment, the team found that current LLMs are, on average, too sure of themselves: their reported confidence systematically exceeds their true accuracy. However, this pattern depends heavily on task difficulty. On difficult tests, models show strong overconfidence—much like humans. On easy tests, they actually become underconfident, underestimating their performance.

The researchers introduced LifeEval, a benchmark designed to evaluate calibration across a spectrum of difficulty levels. This provides a more nuanced picture than simple average calibration scores. The hard-easy effect documented in humans appears replicated in LLMs, suggesting a fundamental cognitive bias that even advanced AI systems inherit. The findings have practical implications: when deploying LLMs for high-stakes decisions, users cannot rely on the model’s stated confidence as a trustworthy indicator of correctness, especially on hard problems.

Key Points
  • LLMs show overall overconfidence: average confidence exceeds average accuracy.
  • A strong hard-easy effect: overconfidence on difficult tasks, underconfidence on easy ones.
  • Researchers developed LifeEval, a new benchmark to measure calibration across difficulty levels.

Why It Matters

Highlights a reliability flaw in LLMs for critical applications, urging caution when trusting confidence scores on hard tasks.