The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration
Kimi K2 was 23.3% accurate but wildly overconfident, while Claude Haiku 4.5 showed much better self-awareness.
A new empirical study by researchers Sudipta Ghosh and Mrityunjoy Panday reveals that large language models (LLMs) can exhibit a pattern strikingly similar to the human cognitive bias known as the Dunning-Kruger effect. The research, published on arXiv, evaluated four state-of-the-art models—Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, and Kimi K2—across four benchmark datasets, totaling 24,000 experimental trials. The core finding is that models with lower competence tend to be dramatically overconfident in their incorrect answers, while more capable models are better at calibrating their confidence.
The results show a stark contrast in confidence calibration, measured by Expected Calibration Error (ECE). Kimi K2 exhibited the most severe overconfidence with an ECE of 0.726 despite achieving only 23.3% accuracy on the tasks. In contrast, Claude Haiku 4.5 demonstrated the best self-awareness, with an ECE of 0.122 and a much higher accuracy of 75.4%. This pattern, where the least accurate model was the most confident in its wrong answers, directly mirrors the Dunning-Kruger effect observed in human psychology.
The study's implications are significant for the safe deployment of AI, particularly in high-stakes applications like healthcare, legal analysis, or autonomous systems. An overconfident model that cannot accurately signal its uncertainty is a major reliability and safety risk. The findings underscore that raw performance metrics are insufficient; developers must also rigorously evaluate and improve a model's calibration—its ability to know what it doesn't know—before real-world deployment.
- Kimi K2 showed severe miscalibration with a 0.726 Expected Calibration Error (ECE) despite only 23.3% accuracy.
- Claude Haiku 4.5 achieved the best balance with a low ECE of 0.122 and high accuracy of 75.4%.
- The study analyzed 24,000 trials across four models, revealing a clear pattern where poorer performance correlated with higher overconfidence.
Why It Matters
Overconfident AI is dangerous in critical fields; this research pushes for mandatory confidence calibration testing alongside standard benchmarks.