Research & Papers

ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization

Researchers achieve state-of-the-art Bengali speech recognition using YouTube data and LLM-assisted cleaning.

Deep Dive

A team of researchers from Bangladesh has introduced ShobdoSetu, a novel data-centric framework designed to tackle the significant gap in speech AI for Bengali, a language spoken by over 230 million people. The system addresses two core challenges: long-form Automatic Speech Recognition (ASR) and Speaker Diarization. For ASR, the team constructed a high-quality training corpus by scraping and meticulously cleaning Bengali audiobooks and dramas from YouTube. This process involved innovative techniques like LLM-assisted language normalization and fuzzy-matching for chunk validation. They then fine-tuned the open-source `tugstugi/whisper-medium` model on approximately 21,000 data points, achieving a highly competitive Word Error Rate (WER) of 15.551% on a private test set.

For the speaker diarization task—identifying "who spoke when" in an audio file—the researchers operated in an extreme low-resource setting with only 10 training files. By fine-tuning the `pyannote.audio` segmentation model with targeted hyperparameter optimization, they achieved a Diarization Error Rate (DER) of 26.723%. The core innovation of ShobdoSetu is not a brand-new model architecture, but a sophisticated data engineering pipeline that proves careful, domain-specific data curation can yield state-of-the-art results where large, labeled datasets are unavailable. This approach is a blueprint for developing speech technology for other under-served languages globally.

Key Points
  • Achieved a 15.551% Word Error Rate for Bengali speech recognition by fine-tuning Whisper-medium on 21,000 curated YouTube data points.
  • Attained a 26.723% Diarization Error Rate for speaker identification using only 10 training files with the pyannote.audio model.
  • Built a novel data pipeline using LLM-assisted normalization and fuzzy-matching to create a clean training set from noisy web sources.

Why It Matters

Provides a scalable blueprint for building accurate speech AI for hundreds of other low-resource languages, expanding global digital access.