Modernizing Ground Truth: Four Shifts Toward Improving Reliability and Validity in AI in Education
Researchers argue single reliability scores like Cohen's kappa are misleading for complex educational AI.
A team of researchers from Carnegie Mellon University and Stanford University, led by Danielle R. Thomas and Conrad Borchers, has published a paper accepted to AIED 2026 that critiques current practices for establishing 'ground truth' in educational AI. The authors argue that the widespread use of generative AI (GenAI) in education is hamstrung by flawed data labeling practices. A key problem is the over-reliance on single inter-rater reliability (IRR) coefficients, like Cohen's kappa, used as a mechanical acceptance threshold (e.g., k > 0.8). They contend this is particularly misleading for educational data, which often involves high-inference constructs (like 'student engagement'), skewed label distributions, and complex multimodal data from tutoring systems.
The paper proposes four concrete shifts to modernize ground truth. First, treat IRR as a diagnostic tool to pinpoint disagreement and refine constructs, not just a pass/fail gate. Second, mandate transparent reporting of rater expertise, codebook development, and data segmentation rules. Third, actively mitigate risks when using large language models (LLMs) as annotators through bias audits and verification workflows to avoid automation bias. Fourth, and most crucially, complement agreement statistics with direct evidence of validity and effectiveness. This includes using uncertainty-aware labeling to capture nuance, running predictive checks to see if labels forecast real outcomes, and conducting 'close-the-loop' evaluations to prove systems trained on the labels actually improve learning beyond a control group.
- Critiques reliance on single IRR scores (e.g., Cohen's kappa >0.8) as a flawed 'gatekeeper' for complex educational data.
- Proposes bias audits and verification workflows to mitigate risks of using LLMs (like GPT-4) as automated annotators.
- Demands 'close-the-loop' evaluations proving AI systems trained on the labels genuinely improve student learning outcomes.
Why It Matters
Ensures AI tutoring and assessment tools are built on valid data, leading to measurable improvements in student learning instead of flawed benchmarks.