I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]
33 years of pre-internet-era language captured in 408 million posts.
For years, a developer known as OwnedByDanes quietly assembled what is now one of the largest privately held pretraining corpora: a complete Usenet archive from 1980 to 2013. The final dataset clocks in at 103.1 billion tokens (using the cl100k_base tokenizer), spanning 408 million posts across 18,347 newsgroups and 9 hierarchy categories. The processing pipeline was rigorous: full deduplication, exclusion of alt.binaries.* to remove binaries, quoted text handling, email address redaction via pattern matching and SHA-256 hashing of Message-IDs, and conversion from raw MBOX archives to gzip-compressed JSONL. Language detection per record used Meta's fasttext LID-176, yielding 96.6% English with meaningful representation from 100+ other languages—especially in the soc.culture.* groups.
The corpus’s most striking feature is its temporal arc: volume is sparse before 1986, grows steadily through the early 90s, peaks around 1999–2000, then declines as Usenet is displaced by forums and social media. This 33-year window captures language evolution before SEO, engagement optimization, and AI-generated content. The creator has published a full data card, cleaning methodology, and representative samples (5K posts per hierarchy + combined sets) on Hugging Face, inviting community questions about the pipeline.
- 103.1B tokens, 408M posts, 18,347 newsgroups, 33 years of coverage (1980–2013).
- Cleaned with deduplication, binary removal, email redaction, and fastText language detection (96.6% English).
- Hugging Face dataset includes full data card, methodology, and 5K sample posts per hierarchy.
Why It Matters
Provides a clean, pre-internet-era benchmark for training language models on authentic, non-optimized human discourse.