AI Safety

Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus

New research reveals AI is a poor judge of how hard its own exam questions are.

Deep Dive

A study tested ten major AI models on over 1,000 questions from Brazil's national ENEM exam. The models were poor at judging question difficulty, often underestimating it and failing with image-based questions. They also could not reliably adjust difficulty for different student backgrounds. The findings suggest AI should screen questions for human review, not act as the final authority in test design.

Why It Matters

This highlights a critical safety gap as AI is increasingly used to generate educational and assessment materials.