Beyond Grading Accuracy: Exploring Alignment of TAs and LLMs
Researchers find open-source AI models can grade complex software diagrams with near-human accuracy.
A new study from researchers at the University of Twente demonstrates that open-source Large Language Models (LLMs) can effectively grade complex software engineering assignments. The paper, titled 'Beyond Grading Accuracy: Exploring Alignment of TAs and LLMs,' evaluated six popular open-source models on their ability to assess 92 Unified Modeling Language (UML) class diagrams from a software design course. Unlike previous research focused on proprietary models like GPT-4, this work specifically addresses university needs for transparency and cost-effectiveness.
The researchers developed a novel grading pipeline that compares LLM assessments against human teaching assistant (TA) evaluations at the individual criterion level, rather than just overall diagram scores. This granular approach revealed where AI and human graders align and diverge. The best-performing open-source models achieved per-criterion accuracy up to 88.56% and a Pearson correlation coefficient of 0.78 with TA grades—a substantial improvement over prior work.
By creating an 'optimal model' that combines the best-performing LLM for each specific grading criterion, the researchers demonstrated performance approaching that of a human TA. This suggests a viable path toward mixed-initiative grading systems where AI handles initial assessments and humans provide final review. The study provides universities with a practical framework for managing increasing assessment workloads as student numbers grow, while maintaining academic standards through transparent AI tools.
- Open-source LLMs achieved 88.56% per-criterion accuracy grading UML diagrams, matching proprietary model performance
- The study analyzed 92 student diagrams across six models, finding a 0.78 Pearson correlation with human TAs
- Researchers proposed an 'optimal model' combining best-performing LLMs per criterion, creating a path for mixed human-AI grading systems
Why It Matters
Provides universities with transparent, cost-effective AI grading tools to manage growing class sizes while maintaining academic standards.