Audio & Speech

UrduSpeech: 156-hour open-source Urdu corpus with 12 paralinguistic dimensions

A massive 156-hour Urdu speech dataset now open-sourced, targeting 230M speakers.

Deep Dive

Despite being spoken by over 230 million people worldwide, Urdu remains critically under-resourced in speech technology. A team of researchers has introduced UrduSpeech, a large-scale open-source corpus comprising 156 hours of high-fidelity audio recordings, each annotated with 12 paralinguistic dimensions. To handle challenges like right-to-left script and frequent code-switching between Urdu and English, the team developed an LLM-driven curation pipeline that automatically gathered data across 12 diverse categories—including news, drama, and rare literary forms like Bait-Bazi. The corpus contains 71,792 utterances with a 60-40 gender balance, and a separate 9-hour US-Benchmark set was manually corrected by native annotators to serve as a gold standard.

Human evaluation of the primary corpus yielded a Mean Opinion Score (MOS) of 4.6 (std=0.7) with inter-rater reliability confirmed by a Cohen's Kappa of 0.68, validating the pipeline's 97.6% confidence score. This release represents a significant leap toward linguistic inclusivity in global AI, providing a foundation for building robust automatic speech recognition (ASR), text-to-speech (TTS), and paralinguistic analysis systems for Urdu. The corpus and associated code are fully open-sourced, with a demo page available for immediate exploration.

Key Points
  • 156 hours of audio with 12-dimension paralinguistic annotations covering emotion, prosody, and speaker traits.
  • 71,792 utterances across 12 categories including news, drama, and rare Bait-Bazi poetry, with a 60-40 gender balance.
  • Achieves MOS 4.6 (std=0.7) with inter-rater reliability (Cohen's Kappa 0.68); pipeline confidence score of 97.6%.

Why It Matters

Fills a critical gap for Urdu speech AI, enabling inclusive ASR, TTS, and emotion recognition for 230M speakers.