Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
A new 882-hour Bengali dataset and optimized pipeline tackles long-form speech recognition and speaker diarization.
A research team led by Sanjid Hasan has published a significant paper addressing critical gaps in Bengali Automatic Speech Recognition (ASR) and speaker diarization. Their work, 'Make It Hard to Hear, Easy to Learn,' introduces the Lipi-Ghor-882 dataset—an 882-hour multi-speaker Bengali corpus designed to tackle the severe scarcity of joint ASR and diarization resources. The research, detailing their submission to the DL Sprint 4.0 competition, systematically evaluates architectures for long-form Bengali speech, finding that simply scaling raw data is ineffective. Instead, they demonstrate that a novel approach of targeted fine-tuning with perfectly aligned annotations, combined with synthetic acoustic degradation (adding noise and reverberation), emerges as the most effective method for improving ASR accuracy.
For the complex task of speaker diarization—identifying 'who spoke when'—the team discovered that global state-of-the-art models like Diarizen performed poorly on their dataset. Extensive retraining yielded minimal gains, leading them to a counterintuitive solution: strategic, heuristic post-processing of baseline model outputs became the primary driver for increasing accuracy. The culmination of this research is a highly optimized dual pipeline that achieves a remarkably low Real-Time Factor (RTF) of approximately 0.019, meaning it can process audio much faster than real-time. This work establishes a new, empirically backed benchmark for practical, low-resource, long-form speech processing, providing a clear roadmap for advancing speech technology in under-resourced languages.
- Introduced the Lipi-Ghor-882 dataset, an 882-hour multi-speaker Bengali corpus for joint ASR and diarization.
- Found targeted fine-tuning with perfect alignment and synthetic noise/reverb beats raw data scaling for ASR.
- Achieved a ~0.019 Real-Time Factor with an optimized pipeline, setting a new benchmark for low-resource speech AI.
Why It Matters
Provides a scalable blueprint for building accurate, real-time speech AI for under-resourced global languages like Bengali.