AI Safety

New paper proves model collapse costs 18% quality per generation without subsidies

AI training on synthetic data causes irreversible quality loss—but optimal subsidies can reverse it.

Deep Dive

A new arXiv paper by Lundström-Imanov develops the first unified microeconomic theory of synthetic data markets under model collapse, introducing the Synthetic Data Contamination Equilibrium (SDCE). The authors prove existence and generic uniqueness of this equilibrium, derive a welfare decomposition, and obtain closed-form expressions for optimal provenance subsidies (s* = KL(q||p)/(2κ)) and watermark strength (w* = (1-ψ)KL(q||p)/(2κψ)). An OLS estimation on a C4-synthetic benchmark over ten retraining generations yields a collapse-rate coefficient b̂ = 0.181 (HAC s.e. 0.024), within one standard error of the structural prediction 0.183. Calibrated experiments raise generation-ten model quality by 23.1 percent and lower 2-Wasserstein drift from 0.318 to 0.142. The PMIR algorithm converges to an ε-SDCE in O(ε⁻² log T) iterations, and scaling experiments recover a logarithmic-in-t collapse law with R² = 0.962.

Key Points
  • Recursive training on synthetic data causes 18.1% quality loss per generation (b̂ = 0.181, R² = 0.962).
  • Optimal provenance subsidy s* = KL(q||p)/(2κ) and watermark w* = (1-ψ)KL(q||p)/(2κψ) raise quality by 23.1%.
  • PMIR algorithm converges to equilibrium in O(ε⁻² log T) iterations and attains the Cramér–Rao bound.

Why It Matters

Without provenance subsidies, AI models trained on synthetic data degrade rapidly—this paper provides the economic framework to prevent it.