AI Safety

Human-in-the-Loop LLM Grading for Handwritten Mathematics Assessments

Researchers combine LLMs with mandatory human verification to grade handwritten math tests at scale.

Deep Dive

A research team from KU Leuven has developed a novel, scalable workflow that uses Large Language Models (LLMs) to assist in grading short, handwritten mathematics assessments. The system addresses a critical challenge in education: providing timely, individualized feedback on pen-and-paper work, which has become more pressing as generative AI undermines the reliability of unsupervised, take-home assignments. Their end-to-end process involves constructing solution keys, developing detailed rubric-style grading guides for the LLM, and a grading procedure that includes automated scanning, anonymization, multi-pass LLM scoring, automated consistency checks, and—crucially—mandatory human verification.

The team deployed this hybrid system in two undergraduate mathematics courses, using it to grade six low-stakes, in-class tests. The empirical results are promising: LLM assistance reduced overall grading time by approximately 23%. Importantly, the agreement between grades was comparable to, and in some cases even tighter than, fully manual grading. While occasional model errors occurred, the human-in-the-loop design effectively contained them. The study demonstrates that carefully embedded AI assistance can substantially reduce instructor workload without sacrificing the fairness and accuracy essential for academic assessment, presenting a viable model for scaling personalized feedback.

Key Points
  • The system reduced grading time by ~23% in real-world tests across two undergraduate math courses.
  • It uses a multi-pass LLM scoring process guided by detailed rubrics, followed by mandatory human verification to catch errors.
  • The workflow is designed for scalability, from scanning and anonymizing papers to automated consistency checks, addressing the shift back to supervised in-class assessments.

Why It Matters

Offers educators a practical tool to maintain assessment integrity and provide personalized feedback efficiently as AI changes the homework landscape.