Research & Papers

Confidence Estimation in Automatic Short Answer Grading with LLMs

arXiv cs.CL May 04, 2026

⚡Researchers combine model signals and data uncertainty for safer AI grading.

Deep Dive

A new study tackles a critical gap in AI-assisted education: LLMs like GPT-4 can grade short answers without fine-tuning, but their grading isn't perfect. To decide when to trust the AI vs. escalate to a human, reliable confidence estimates are essential. The researchers from multiple institutions systematically compared three model-based confidence methods: having the LLM verbalize its confidence, using latent representations, and analyzing consistency across repeated outputs. They found that none alone reliably captured uncertainty.

To fix this, they introduced a hybrid framework that augments model-based signals with dataset-derived aleatoric uncertainty—the inherent ambiguity in student responses. They cluster semantically similar answers via embeddings, then measure within-cluster grading variability. This hybrid approach yielded more accurate confidence estimates and significantly improved selective grading performance (knowing when to defer to a human). The work paves the way for trustworthy AI assessment systems that know their limits.

Key Points

Compared three confidence estimation strategies: verbalized, latent, and consistency-based
Proposed hybrid framework that adds dataset-derived aleatoric uncertainty via clustering student responses
Hybrid method outperforms single-source approaches in selective grading tasks

Why It Matters

Makes AI grading systems more trustworthy by knowing when to defer to human judges.

Read Original Article

Confidence Estimation in Automatic Short Answer Grading with LLMs

Why It Matters

Stay Ahead in AI