The Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost
Temperature sampling beats majority voting for scoring math conversations, new research shows.
Recent research published on arXiv (arXiv:2604.26954) by Scott Frohn explores how to optimize automated scoring using large language models (LLMs) for conversation-based high school math assessments. The study evaluated 900 student conversations against human-scored ground truths using models from OpenAI (GPT-5.4 Nano and Mini) and Google (Gemini 3.1 Pro Preview). The key finding: self-consistency through intra-model majority voting (ensemble sizes from 1 to 7) yielded no significant accuracy gains over single temperature-sampled outputs. However, temperature sampling itself significantly outperformed deterministic (zero-temperature) calls.
Higher reasoning effort—controlled via model parameters—showed a significant positive linear trend with scoring accuracy, though the benefit varied by model family. An efficiency frontier analysis revealed a trade-off: Gemini 3.1 Pro Preview at low reasoning was the most accurate yet costly configuration. In contrast, GPT-5.4 Nano and Mini with no reasoning offered the best balance of cost and performance. The findings suggest that strategic model selection and reasoning effort adjustments are more effective than ensemble methods for cost-efficient, accurate automated scoring.
- Temperature sampling beats deterministic calls; ensemble voting (j=1 to 7) offers no extra gain.
- Higher reasoning effort improves accuracy linearly, but gains depend on model family.
- Gemini 3.1 Pro Preview (high cost, high accuracy) vs. GPT-5.4 Nano/Mini (best cost-performance).
Why It Matters
Practical guidance for deploying LLMs in automated scoring: skip voting, optimize reasoning and model choice.