VAANI: Capturing the language landscape for an inclusive digital India
Open-source dataset covers 112 languages from 165 districts, with many represented for the first time.
A major research consortium of 23 authors from institutions including IITs has open-sourced Project VAANI, a groundbreaking multimodal dataset designed to map India's vast linguistic landscape. The initiative, detailed in a new arXiv paper, has completed its first two phases by collecting data from 165 districts across 31 states and union territories. The core collection method uses image-based prompts to elicit spontaneous, natural speech responses, while a separate process gathers a broad range of contextual images. All data undergoes rigorous multi-stage quality checks, including both automated and manual validation for audio quality and transcription accuracy.
The resulting open-source release is massive in scale and unprecedented in scope. It includes approximately 31,270 hours of raw audio recordings, around 2,067 hours of carefully transcribed speech, and 289,000 images. This corpus encompasses 112 distinct languages, with the researchers noting that a significant number are being represented in a dataset of this scale for the very first time. The project's primary goal is to enable the development of inclusive speech recognition, text-to-speech, and multimodal AI applications that work effectively across India's diverse population, bridging a major data gap that has hindered AI localization.
- Massive open-source dataset with 31,270 hours of audio and 289K images from 165 Indian districts
- Covers 112 languages, many represented in a large-scale AI dataset for the first time
- Uses image-based prompts to collect spontaneous speech, followed by rigorous multi-stage quality validation
Why It Matters
Provides the foundational data needed to build AI speech and multimodal tools that actually work for hundreds of millions of non-English speakers in India.