Counterargument for Critical Thinking as Judged by AI and Humans
Frontier LLMs graded 35 essays with Gwet's AC2=0.33 alignment to human reviewers.
Researchers from Luleå University of Technology and other institutions conducted an intervention study to explore how counterargument writing can foster critical thinking in the age of Generative AI. 36 university students were given four thesis statements from popular debates and asked to write counterarguments. After disqualifying one irregular submission, 35 samples were assessed by two peer reviewers, one experienced teacher, and six frontier LLMs using six established rubrics: focus, logic, content, style, correctness, and reference. The design mixed qualitative open-ended feedback with quantitative Likert-scale scoring.
The study found that students' self-written counterarguments to AI-generated content contained logic, a key component of critical thinking. Crucially, the LLM assessments generally aligned with human judgments, achieving a Gwet's AC2 inter-rater reliability of 0.33 across all models except one outlier. This suggests GenAI can be used at scale to grade written work on clear rubrics, offering a practical way to evaluate critical thinking while addressing concerns about cheating and cognitive offloading. The findings have significant implications for automated essay scoring in education.
- 36 students wrote counterarguments to AI-generated thesis statements; 35 submissions qualified.
- Six frontier LLMs graded essays against six rubrics on a 5-point Likert scale.
- LLM grades aligned with human raters (Gwet's AC2 = 0.33 for all but one model), and students showed logic in their writing.
Why It Matters
AI can now reliably grade critical thinking at scale, potentially transforming how educators assess student writing.