Research & Papers

On the scaling relationship between cloze probabilities and language model next-token prediction

Study reveals how scaling impacts AI's ability to match human semantic predictions versus lexical patterns.

Deep Dive

A new research paper titled 'On the scaling relationship between cloze probabilities and language model next-token prediction' reveals critical insights about how model scaling affects AI's ability to predict human language patterns. Authored by Cassandra L. Jacobs and Morgan Grobol, the study examines how different-sized language models perform on cloze tasks—psychological tests where humans predict missing words in sentences.

The technical analysis shows that while all current models under-allocate probability mass to actual human responses, larger models demonstrate significantly better predictive alignment. Specifically, larger models assign higher-quality estimates of next tokens and their likelihood of production in cloze data because they become less sensitive to surface-level lexical co-occurrence statistics while developing better semantic alignment with human responses. This creates an interesting trade-off: the greater memorization capacity of larger models helps them guess more semantically appropriate words, but simultaneously makes them less sensitive to low-level linguistic information crucial for word recognition tasks.

This research provides empirical support for understanding why scaling improves certain aspects of language model performance while potentially degrading others. The findings have implications for model development, suggesting that pure scaling alone may not create models that perfectly mimic human language processing, and that specialized architectures might be needed for different linguistic tasks. The paper was submitted to arXiv on February 19, 2026 (arXiv:2602.17848) and represents important foundational work in understanding the relationship between model size and linguistic capability.

Key Points
  • Larger language models show better predictive power for human eye movement and reading time data
  • Bigger models assign higher-quality estimates for next tokens in cloze tasks due to improved semantic alignment
  • Increased memorization capacity helps with semantic guessing but reduces sensitivity to low-level word recognition cues

Why It Matters

Understanding scaling effects helps developers build better models for specific language tasks and reveals fundamental AI-human cognition differences.