Free 9.8M-doc Indic multilingual corpus released on HuggingFace – CC0 license
8.4 billion tokens across 11 languages, including Hindi, Bengali, Tamil, and Telugu – all free.
Deep Dive
User ashtok897 built indic-hplt-v1 over the past few weeks for a multilingual research project. The dataset contains ~9.8M web documents across 11 languages (hi, bn, ta, te, mr, gu, kn, ml, pa, ur, en) with ~8.4B tokens, licensed under CC0. Available on HuggingFace.
Key Points
- Dataset: indic-hplt-v1 contains 9.8M documents and 8.4B tokens across 11 languages.
- Languages covered: Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, and English.
- License: CC0 (public domain) – free for any use including commercial AI training.
Why It Matters
Provides a massive, freely-licensed resource for improving NLP in underrepresented Indic languages.