Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks
New research shows ChatGPT, Claude, and education-specific AI struggle to judge cognitive demand in math.
A new study led by researchers from the University of Pittsburgh and other institutions provides a critical baseline for AI's role in education, specifically in evaluating the cognitive rigor of math tasks. The paper, "Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks," tested eleven prominent AI models—including six general-purpose tools like OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini, and five education-specific platforms like Khan Academy's Khanmigo and Magic School AI. The goal was to see if AI could reliably categorize tasks into four levels of cognitive demand, a key skill for teachers adapting curricula. The results were sobering: on average, the tools were accurate only 63% of the time, with no single model surpassing 83% accuracy, revealing a significant performance gap for practical classroom use.
The research uncovered systematic flaws in the AI's reasoning. All tools exhibited a clear bias, consistently misclassifying tasks at the highest (Doing Mathematics) and lowest (Memorization) cognitive levels, favoring middle categories instead. Error analysis revealed the models overweighted surface-level textual features of a task while failing to correctly reason about the underlying cognitive processes required. Crucially, education-specific AI tools performed no better than their general-purpose counterparts, and all models generated persuasive, confident-sounding explanations for their incorrect classifications—a major concern for novice teachers who might trust these outputs. The findings underscore that current AI cannot yet replace expert judgment in lesson planning and highlight an urgent need for improved prompt engineering and specialized tool development for educational applications.
- Tested 11 AI tools (ChatGPT, Claude, Gemini, Khanmigo, etc.) achieving only 63% average accuracy in classifying math task cognitive demand.
- Found a systematic bias where all AI models struggled with extreme high/low difficulty tasks, favoring middle categories.
- Education-specific AI performed no better than general models, and all provided convincingly wrong explanations, posing a risk for teacher adoption.
Why It Matters
Highlights a critical reliability gap for AI in education, cautioning against blind trust in AI for curriculum planning and teacher support.