New TEI Metric Predicts LLM Math Tutor Quality Without Training or Judges
Four internal signals predict tutor quality 81.9% accuracy without costly RL training.
Researchers Shim Jaechang and Unggi Lee introduce the Tutoring Effectiveness Index (TEI), a training-free metric that predicts how well a frozen LLM will perform as a math tutor. TEI combines four internal reasoning signals from the model: a Schoenfeld-Verify keyword ratio, a math-step density, an ends-question rate, and a deep-reasoning gate derived from the Deep-Thinking Ratio (DTR) probe. By selecting the top N candidates using TEI (the TEI@N rule), the improvement rate on pre-incorrect student scenarios jumps from 59.0% to 81.9% at N=8—all without any reinforcement learning (RL) training or external LLM judges. The method uses a frozen DeepSeek-R1-8B base model, demonstrating that internal signals alone can replace costly alignment procedures.
The study also measures the alignment tax of pedagogical GRPO, a popular fine-tuning approach. Under GRPO, thinking length per tutor turn collapses from 1,764 to 119 words (–93%), and accuracy on content-knowledge and pedagogical-knowledge tasks falls by 71% and 80% respectively. Most critically, the student's delta solve rate swings from a positive +0.180 to a negative –0.012, meaning students actually perform worse. To contextualize these behavioral shifts, the authors reproduce an 82-code educational codebook on 119,009 tutor sentences with a one-shot structural classifier. Together, these findings give practitioners a cheap, zero-training recipe for building math-tutoring LLMs while avoiding the hidden costs of traditional RL-based alignment.
- TEI uses four internal reasoning signals (keyword ratio, math-step density, ends-question rate, deep-reasoning gate) from a frozen model to predict tutor quality
- TEI@8 on DeepSeek-R1-8B boosts improvement rate from 59% to 81.9% with zero training or external judges
- Pedagogical GRPO causes 93% drop in thinking length, 71–80% accuracy loss, and a negative student solve rate (–0.012 vs +0.180 baseline)
Why It Matters
A training-free method to build and evaluate AI tutors, cutting costs and avoiding the performance trade-offs of RL-based alignment.