Nature study confirms AI models deteriorate when trained on synthetic data
New research warns AI feedback loops cause model collapse, threatening progress.
Deep Dive
A Nature paper by Shumailov et al. (July 2024) shows AI models degrade when trained on recursively generated synthetic data, losing accuracy and diversity—a phenomenon called model collapse. Gartner forecast that 60% of training data would be synthetic by 2024, amplifying the risk. OpenAI's o3 and o4-mini system card (April 2025) includes the PersonQA hallucination benchmark.
Key Points
- Shumailov et al. in Nature (July 2024) found AI models lose accuracy and diversity when trained on recursively generated synthetic data.
- Gartner forecast 60% of training data would be synthetic by 2024, worsening the feedback loop risk for LLMs.
- OpenAI's o3 system card (April 2025) reported increased hallucination rates on PersonQA, hinting at early model collapse effects.
Why It Matters
For professionals, ensuring training data provenance is critical to prevent AI performance degradation and maintain trust.