Research & Papers

I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

r/MachineLearning May 02, 2026

⚡33 years of pre-internet-era language captured in 408 million posts.

Deep Dive

For years, a developer known as OwnedByDanes quietly assembled what is now one of the largest privately held pretraining corpora: a complete Usenet archive from 1980 to 2013. The final dataset clocks in at 103.1 billion tokens (using the cl100k_base tokenizer), spanning 408 million posts across 18,347 newsgroups and 9 hierarchy categories. The processing pipeline was rigorous: full deduplication, exclusion of alt.binaries.* to remove binaries, quoted text handling, email address redaction via pattern matching and SHA-256 hashing of Message-IDs, and conversion from raw MBOX archives to gzip-compressed JSONL. Language detection per record used Meta's fasttext LID-176, yielding 96.6% English with meaningful representation from 100+ other languages—especially in the soc.culture.* groups.

The corpus’s most striking feature is its temporal arc: volume is sparse before 1986, grows steadily through the early 90s, peaks around 1999–2000, then declines as Usenet is displaced by forums and social media. This 33-year window captures language evolution before SEO, engagement optimization, and AI-generated content. The creator has published a full data card, cleaning methodology, and representative samples (5K posts per hierarchy + combined sets) on Hugging Face, inviting community questions about the pipeline.

Key Points

103.1B tokens, 408M posts, 18,347 newsgroups, 33 years of coverage (1980–2013).
Cleaned with deduplication, binary removal, email redaction, and fastText language detection (96.6% English).
Hugging Face dataset includes full data card, methodology, and 5K sample posts per hierarchy.

Why It Matters

Provides a clean, pre-internet-era benchmark for training language models on authentic, non-optimized human discourse.

Read Original Article

I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

Why It Matters

Stay Ahead in AI