Evaluated vision-capable LLMs on handwritten math from two university STEM courses using instructor-defined rubrics?

Evaluated vision-capable LLMs on handwritten math from two university STEM courses using instructor-defined rubrics.

87% of model errors were caused by transcription failures rather than rubric misapplication?

87% of model errors were caused by transcription failures rather than rubric misapplication.

Common error modes include image quality issues, hallucinated content, and mishandling equivalent expressions?

Common error modes include image quality issues, hallucinated content, and mishandling equivalent expressions.

AI Safety

Vision LLMs can grade handwritten math, 87% of errors from transcription

arXiv cs.CY May 20, 2026

⚡New study shows LLMs accurately assess multi-step solutions, but image quality matters.

Deep Dive

A team from the University of Illinois (Levine et al.) tested vision-capable LLMs as automated graders for handwritten mathematics, extending a prior pipeline designed for typed responses. By integrating transcription and rubric-based evaluation into a single LLM call, they assessed photographic submissions from two university STEM courses against human-assigned ground truth at the rubric-item level. The work, presented at AIED 2026, aims to tackle the longstanding challenge of grading multi-step handwritten solutions at scale.

Results show high overall accuracy, but a critical insight emerged: 87% of model errors in the best-performing configuration were due to transcription failures—not errors in applying the rubric. The team categorized error modes including poor image quality, hallucinated content, and incorrect handling of equivalent mathematical expressions. These findings underscore that while vision LLMs hold promise for automating handwritten math grading, system design must prioritize robust transcription, prompt refinement, and careful deployment to avoid propagating transcription mistakes into final grades.

Key Points

Evaluated vision-capable LLMs on handwritten math from two university STEM courses using instructor-defined rubrics.
87% of model errors were caused by transcription failures rather than rubric misapplication.
Common error modes include image quality issues, hallucinated content, and mishandling equivalent expressions.

Why It Matters

Automated grading could save instructors hours, but transcription accuracy is the bottleneck.

Read Original Article

Vision LLMs can grade handwritten math, 87% of errors from transcription

Why It Matters

Related Articles

🚀 Stay Ahead in AI