AI Safety

AI models struggle to accurately judge the difficulty of test questions they create.

New research reveals AI is a poor judge of how hard its own exam questions are.

Deep Dive

A study tested ten major AI models on over 1,000 questions from Brazil's national ENEM exam. The models were poor at judging question difficulty, often underestimating it and failing with image-based questions. They also could not reliably adjust difficulty for different student backgrounds. The findings suggest AI should screen questions for human review, not act as the final authority in test design.

Why It Matters

This highlights a critical safety gap as AI is increasingly used to generate educational and assessment materials.

📬 Get the top 10 AI stories daily