Research & Papers

\"UberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

A 20-trillion-token public dataset solves the 'curse of multilinguality' with targeted curation.

Deep Dive

DatologyAI researchers built ÜberWeb, a 20-trillion-token public pretraining dataset. Their key insight is that targeted, per-language data curation—using under 8% of total tokens—mitigates performance interference between languages. Models trained on subsets achieved competitive multilingual accuracy with 4-10x fewer training FLOPs than baselines. This approach was validated at scale, improving the multilingual performance of the 400B-parameter Trinity Large model, proving quality curation enables compute-efficient multilingual scaling.

Why It Matters

Enables more efficient, higher-performance multilingual AI models, reducing training costs and improving global accessibility.