Research & Papers

RealMath-Eval reveals LLM judges fail at real student math grading

SOTA LLM judges have 2.5x higher error on real vs synthetic solutions

Deep Dive

A new study from researchers (Mao et al.) introduces RealMath-Eval, a rigorously annotated benchmark of 224 real-world high school math exam responses designed to test LLMs' ability to evaluate human reasoning. While LLMs now solve high-school math near-perfectly, their capacity to judge diverse student reasoning remains poor. The study found that state-of-the-art LLM judges exhibit a Mean Squared Error (MSE) of ~2.96 against expert human graders on real student responses, compared to only ~1.17 on synthetic LLM-generated solutions—a stark 'Evaluation Gap'.

Through semantic embedding analysis, the team discovered that synthetic errors suffer from a 'structural collapse' into predictable low-dimensional linear subspaces, whereas human errors occupy a much more diverse error space. Additionally, generative probability probes revealed that human reasoning involves significantly higher information-theoretic surprisal, indicating student reasoning transitions are out-of-distribution for current models. Attempts to close the gap with surface-level style transfer failed. The findings suggest that current LLM evaluation pipelines relying heavily on synthetic data may not capture authentic human mathematical reasoning, highlighting a critical limitation for AI use in education.

Key Points
  • New benchmark RealMath-Eval includes 224 real high school exam responses with expert annotations
  • LLM judges show MSE ~2.96 on real responses vs ~1.17 on synthetic solutions — a 2.5x gap
  • Human errors are diverse and high-surprisal; synthetic errors collapse into low-dimensional subspaces

Why It Matters

Exposes a critical blind spot in AI evaluation: synthetic tests don't reflect real-world human reasoning diversity.

📬 Get the top 10 AI stories daily