33 frontier LLMs from 8 families tested on 1,500 MMLU items across 6 domains using verbalized confidence (0-100)?

33 frontier LLMs from 8 families tested on 1,500 MMLU items across 6 domains using verbalized confidence (0-100).

Applied/Professional knowledge easiest to monitor (mean AUROC .742); Formal Reasoning & Natural Science hardest (bottom-2 in 27/33 models)?

Applied/Professional knowledge easiest to monitor (mean AUROC .742); Formal Reasoning & Natural Science hardest (bottom-2 in 27/33 models).

Within-family clustering significant for Anthropic, Gemini, and Qwen (p<.0001) but not for DeepSeek, Gemma, or OpenAI?

Within-family clustering significant for Anthropic, Gemini, and Qwen (p<.0001) but not for DeepSeek, Gemma, or OpenAI.

Research & Papers

New study reveals LLMs' metacognitive blind spots across 33 models

arXiv cs.CL May 11, 2026

⚡Applied/Professional domains easiest to monitor; Formal Reasoning hardest.

Deep Dive

A comprehensive new study by Jon-Paul Cacioli, released on arXiv, provides the first large-scale atlas of how well frontier LLMs know what they know—their metacognitive monitoring—across different knowledge domains. Testing 33 models from eight families (including Anthropic, Google Gemini, Qwen, DeepSeek, OpenAI, and others) on 1,500 MMLU items, the paper computes Type-2 AUROC scores for each model-domain combination using verbalized confidence ratings (0-100). The key finding: every model that shows above-chance aggregate monitoring actually exhibits non-trivial domain-level variation. Applied/Professional knowledge consistently yields best monitoring (mean AUROC .742, top-2 in 21 of 33 models), while Formal Reasoning and Natural Science consistently rank worst (one of the two bottom-2 in 27 of 33 models). Three 'middle' domains (e.g., Humanities) are statistically indistinguishable. The study also reveals family-level clustering: Anthropic, Google-Gemini, and Qwen models show significant profile-shape similarity within families, while DeepSeek, Google-Gemma, and OpenAI do not. Notably, Gemma 4 31B improves +.202 AUROC over Gemma 3 27B, showing rapid iteration gains. Three models that failed binary KEEP/WITHDRAW probes produced normal profiles under verbalized confidence, suggesting probe-format specificity. The paper concludes that aggregate metrics obscure stable benchmark-domain variation, and recommends benchmark-stage domain screening before deployment in specific application areas.

Key Points

33 frontier LLMs from 8 families tested on 1,500 MMLU items across 6 domains using verbalized confidence (0-100).
Applied/Professional knowledge easiest to monitor (mean AUROC .742); Formal Reasoning & Natural Science hardest (bottom-2 in 27/33 models).
Within-family clustering significant for Anthropic, Gemini, and Qwen (p<.0001) but not for DeepSeek, Gemma, or OpenAI.

Why It Matters

Domain-aware monitoring metrics are critical for deploying LLMs safely in specialized fields like medicine or law.

Read Original Article

New study reveals LLMs' metacognitive blind spots across 33 models

Why It Matters

Related Articles

🚀 Stay Ahead in AI