Research & Papers

A Zipf-preserving, long-range correlated surrogate for written language and other symbolic sequences

Researchers create first surrogate model that maintains both word frequency distributions and correlation structures over thousands of tokens.

Deep Dive

Researchers Marcelo A. Montemurro and Mirko Degli Esposti have introduced a breakthrough surrogate model that simultaneously preserves two fundamental statistical properties of symbolic sequences: Zipf's law for word frequencies and long-range correlations extending over hundreds or thousands of tokens. Published in Physica A (2026), their work addresses a critical limitation in existing models that typically preserve either frequency distributions or correlation properties, but not both. This advancement provides researchers with a principled tool for analyzing written language, genomic DNA, and other symbolic systems where both statistical constraints are essential for accurate modeling.

The model generates surrogates by mapping fractional Gaussian noise (FGN) onto empirical histograms through a frequency-preserving assignment, maintaining first-order statistics and long-range scaling while randomizing short-range dependencies. Validated on English and Latin texts as well as genomic DNA, the approach successfully reproduces both base composition and DFA scaling exponents. This enables researchers to disentangle structural features of symbolic systems and test hypotheses about the origins of scaling laws and memory effects across multiple domains, from natural language processing to computational genomics, offering new insights into the fundamental organization of complex symbolic sequences.

Key Points
  • First model to preserve both Zipf's law (word frequency distribution) and long-range correlations simultaneously
  • Validated on English/Latin texts and genomic DNA, reproducing DFA scaling exponents and base composition
  • Enables hypothesis testing about scaling laws across language, DNA, and other symbolic domains

Why It Matters

Provides researchers with accurate synthetic data for testing language models and understanding fundamental patterns in symbolic systems.