103.1B tokens from 1980–2013, zero AI contamination — all pre-LLM text?

103.1B tokens from 1980–2013, zero AI contamination — all pre-LLM text

408M posts across 18,347 newsgroups, deduplicated and cleaned?

408M posts across 18,347 newsgroups, deduplicated and cleaned

Free sample sets (5K posts per hierarchy) available; community fine-tune of Gemma 4 exists?

Free sample sets (5K posts per hierarchy) available; community fine-tune of Gemma 4 exists

Open Source

103B-token Usenet corpus (1980–2013) offers zero-AI-contaminated training data

r/LocalLLaMA May 28, 2026

⚡30K views on r/ML – raw human writing from 33 years pre-LLM

Deep Dive

Usenet archives from 1980 to 2013 have been consolidated into a massive 103.1B-token corpus by r/ML user OwnerByDane. The dataset includes 408 million posts across 18,347 newsgroups, with 96.6% English content. Crucially, every post predates large language models, meaning the text is completely free of GPT mannerisms, refusal patterns, or RLHF artifacts. The writing is raw, argumentative, and stylistically diverse — a snapshot of genuine human discourse from the pre-SEO, pre-algorithm internet era. The corpus is deduplicated, binaries removed, and converted from MBOX to gzip JSONL format.

For ML practitioners, the corpus offers specialized hierarchies ideal for domain fine-tuning: comp.* (10.3B tokens of computing discussions from early internet builders), sci.* (3.3B tokens of scientific debates), rec.* (16.5B tokens of hobbies and games), and humanities.*. A proof-of-concept fine-tune of Gemma 4 on the sample data already exists on Hugging Face (wyan/usenet-gemma-4-E2B-lora). Free samples (5K posts per hierarchy and combined sets) are downloadable without approval. The full corpus is available for licensing, making it a valuable resource for building models that avoid modern web scraping contamination.

Key Points

103.1B tokens from 1980–2013, zero AI contamination — all pre-LLM text
408M posts across 18,347 newsgroups, deduplicated and cleaned
Free sample sets (5K posts per hierarchy) available; community fine-tune of Gemma 4 exists

Why It Matters

Enables fine-tuning local models on authentic human discourse without AI artifacts or RLHF biases.

Read Original Article

103B-token Usenet corpus (1980–2013) offers zero-AI-contaminated training data

Why It Matters

Related Articles

🚀 Stay Ahead in AI