Open Source

103B-token Usenet corpus (1980–2013) offers zero-AI-contaminated training data

30K views on r/ML – raw human writing from 33 years pre-LLM

Deep Dive

Usenet archives from 1980 to 2013 have been consolidated into a massive 103.1B-token corpus by r/ML user OwnerByDane. The dataset includes 408 million posts across 18,347 newsgroups, with 96.6% English content. Crucially, every post predates large language models, meaning the text is completely free of GPT mannerisms, refusal patterns, or RLHF artifacts. The writing is raw, argumentative, and stylistically diverse — a snapshot of genuine human discourse from the pre-SEO, pre-algorithm internet era. The corpus is deduplicated, binaries removed, and converted from MBOX to gzip JSONL format.

For ML practitioners, the corpus offers specialized hierarchies ideal for domain fine-tuning: comp.* (10.3B tokens of computing discussions from early internet builders), sci.* (3.3B tokens of scientific debates), rec.* (16.5B tokens of hobbies and games), and humanities.*. A proof-of-concept fine-tune of Gemma 4 on the sample data already exists on Hugging Face (wyan/usenet-gemma-4-E2B-lora). Free samples (5K posts per hierarchy and combined sets) are downloadable without approval. The full corpus is available for licensing, making it a valuable resource for building models that avoid modern web scraping contamination.

Key Points
  • 103.1B tokens from 1980–2013, zero AI contamination — all pre-LLM text
  • 408M posts across 18,347 newsgroups, deduplicated and cleaned
  • Free sample sets (5K posts per hierarchy) available; community fine-tune of Gemma 4 exists

Why It Matters

Enables fine-tuning local models on authentic human discourse without AI artifacts or RLHF biases.