indic-hplt-v1 contains 9.8M documents and 8.4B tokens across 11 languages.

Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, and English.

CC0 (public domain) – free for any use including commercial AI training.

Research & Papers

Free 9.8M-doc Indic multilingual corpus released on HuggingFace – CC0 license

r/MachineLearning May 19, 2026

⚡8.4 billion tokens across 11 languages, including Hindi, Bengali, Tamil, and Telugu – all free.

Deep Dive

User ashtok897 built indic-hplt-v1 over the past few weeks for a multilingual research project. The dataset contains ~9.8M web documents across 11 languages (hi, bn, ta, te, mr, gu, kn, ml, pa, ur, en) with ~8.4B tokens, licensed under CC0. Available on HuggingFace.

Key Points

Dataset: indic-hplt-v1 contains 9.8M documents and 8.4B tokens across 11 languages.
Languages covered: Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, and English.
License: CC0 (public domain) – free for any use including commercial AI training.

Why It Matters

Provides a massive, freely-licensed resource for improving NLP in underrepresented Indic languages.

Read Original Article

Free 9.8M-doc Indic multilingual corpus released on HuggingFace – CC0 license

Why It Matters

Related Articles

🚀 Stay Ahead in AI