Training Language Models via Neural Cellular Automata
Synthetic data from cellular automata boosts LLM performance by 6% while using 10x less data than natural language.
A team of researchers has published a groundbreaking paper proposing a novel method for training large language models (LLMs) using synthetic data generated by Neural Cellular Automata (NCA). The work, titled 'Training Language Models via Neural Cellular Automata,' addresses fundamental limitations of current pre-training, which relies on finite, biased natural language text that entangles knowledge with reasoning. The authors ask: is natural language the only path to intelligence? Their answer is a 'synthetic-then-natural' pipeline where models are first pre-trained on controllable, cheap-to-generate NCA data that exhibits rich spatiotemporal structures statistically similar to language.
Remarkably, pre-pre-training on just 164 million tokens of this synthetic data improved downstream language modeling performance by up to 6% and sped up convergence by 1.6 times. This small amount of synthetic data even outperformed pre-training on 1.6 billion tokens of natural language from Common Crawl. The performance gains transferred to key reasoning benchmarks like GSM8K for math, HumanEval for code, and BigBench-Lite. The researchers found that the attention layers in transformer models were the most transferable components from synthetic to natural training.
Crucially, the study revealed that optimal NCA complexity is domain-specific. Simpler automata dynamics worked better for coding tasks, while more complex dynamics benefited mathematical reasoning and general web text modeling. This finding enables the systematic tuning of synthetic data distributions to target specific downstream applications. The work opens a new research direction toward more efficient, potentially fully synthetic pre-training pipelines that could reduce reliance on massive, copyrighted, or biased text corpora.
- Pre-training on 164M NCA tokens improved language modeling by 6% and accelerated convergence by 1.6x.
- Synthetic data outperformed 10x more natural language data (1.6B tokens) from Common Crawl for pre-training efficiency.
- Gains transferred to reasoning benchmarks (GSM8K, HumanEval) with optimal NCA complexity varying by domain (code vs. math).
Why It Matters
Offers a path to train more capable AI with less data, lower cost, and reduced human bias, potentially disrupting how foundation models are built.