When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR
GPT-4o penalized for fixing student work; Gemini 2.5 Flash most faithful transcriber.
A new study from Seoul National University reveals a critical flaw in how Vision-Language Models (VLMs) handle handwritten math in educational settings: they often "fix" student errors instead of faithfully transcribing them. This over-correction hides the very mistakes educators need to detect. The researchers evaluated 15 state-of-the-art VLMs on the FERMAT dataset, finding that traditional lexical metrics like BLEU fail to capture this behavior. Their proposed metric, PINK (Penalized INK-based score), uses an LLM for rubric-based grading and explicitly penalizes over-correction, revealing significant ranking reversals: GPT-4o was heavily penalized for aggressive over-correction, while Gemini 2.5 Flash emerged as the most faithful transcriber.
Human expert studies showed PINK aligns significantly better with human judgment (55.0% preference over BLEU's 39.5%), providing a more reliable evaluation framework for multi-line handwritten math OCR. This work is the first systematic study of multi-line handwritten math OCR, addressing a gap in current benchmarks that focus on single-line expressions. The findings have immediate implications for educational AI systems that rely on accurate transcription of student work, from automated grading to personalized tutoring. The paper is available on arXiv (2604.22774).
- VLMs like GPT-4o often over-correct student math errors, hiding mistakes instead of transcribing faithfully.
- PINK (Penalized INK-based score) uses an LLM for rubric-based grading and penalizes over-correction.
- Human experts preferred PINK over BLEU (55.0% vs. 39.5%), aligning better with human judgment.
Why It Matters
VLMs can't be trusted to transcribe student work accurately, undermining AI tutoring and grading systems.