Talkie, a 13B LM trained exclusively on pre-1931 data
This model talks like the 1930s but can still learn Python from scratch.
A team of AI researchers—Nick Levine, David Duvenaud, and Alec Radford—has released “talkie,” a 13-billion-parameter language model trained on a unique dataset: 260 billion tokens of text published before 1931. The model’s training corpus includes old books, newspapers, scientific journals, patents, and other historical documents, deliberately excluding any modern web content. This design allows researchers to probe how LLMs truly generalize versus simply memorizing patterns from the contemporary internet.
Early results from the project are striking. Talkie performs surprisingly well on core language and numeracy benchmarks, despite its antiquated training data. More intriguingly, the model shows early signs of learning simple Python code purely from examples—a skill that wasn’t present in its pre-1931 training set. This suggests that language models can develop reasoning abilities that extend beyond their training distribution, offering fresh insights into how AI systems acquire new capabilities. The work has significant implications for understanding generalization in large models and could influence future training strategies.
- Talkie is a 13B parameter LM trained on 260B tokens of pre-1931 text only.
- The model shows strong core language and numeracy performance despite no modern web exposure.
- It demonstrates early signs of learning Python from examples, a skill absent from its training data.
Why It Matters
This research challenges assumptions about LLM learning, showing that models can generalize beyond their training data in unexpected ways.