New Paper: Data Scaling Follows Predictive Spectrum, Not Just Token Frequency
Researchers find data scaling laws match a 'predictive contribution spectrum' with 96% accuracy.
A new preprint from Zihui Song, Shihao Ji, Hongxi Li, Shuaizhi Cheng, and Chunlin Huang proposes a mechanism for why larger training datasets improve language models. Instead of relying solely on raw token frequency, the authors define a 'predictive contribution spectrum' using a suffix-automaton representation of text. Each state in this automaton contributes by its empirical mass times its KL divergence from a global next-token baseline. Across 12 diverse corpora, the tail slope of this spectrum strongly correlates with the empirical data-scaling exponent of a fixed small GPT learner, suggesting that model performance is directly tied to covering more predictive states, not just repeating rare tokens.
Going beyond correlation, the researchers define an effective truncation rank K(N) for each training size N by matching observed excess loss to residual tail mass of the spectrum. Empirically, log K is nearly linear in log N, with pooled R² ≈ 0.96 for the raw spectrum and R² ≈ 0.90 smoothed. This supports a simple picture: training scale advances an effective frontier through a predictive state spectrum, and the remaining tail mass tracks excess loss. The work provides a principled framework for understanding data scaling, potentially guiding data curation and model training strategies in the AI community.
- Proposes a predictive contribution spectrum based on state-level KL divergence in a suffix-automaton of text corpora.
- Across 12 real corpora, the spectrum's tail slope strongly correlates with the data-scaling exponent of a small GPT model.
- Effective truncation rank K(N) scales linearly in log N with R² ≈ 0.96, linking training data size directly to residual loss.
Why It Matters
Offers a mechanistic explanation for scaling laws, potentially improving data curation and efficiency in training large language models.