[P] Whisper Accent — Accent-Aware English Speech Recognition
Open-source project fine-tunes Whisper with accent conditioning, cutting word error rates by 4.1% while keeping 90% of parameters frozen.
Open-source developer Mavleo96 has released Whisper Accent, a significant adaptation of OpenAI's Whisper speech recognition model specifically optimized for accented English. The project introduces accent-aware conditioning while preserving the original model's generalization capabilities through a novel architectural approach.
The technical implementation extends Whisper using Adaptive Layer Norm (AdaLN) in every decoder layer, where accent-specific embeddings condition the decoder hidden states. Crucially, the encoder and decoder remain completely frozen, preserving Whisper's original capabilities while only training <10% of parameters—specifically the AdaLN modulation weights, accent embeddings, and a classifier head. This classifier predicts accents from encoder states using learnable weighted sums, projection layers, and multi-head attention pooling.
Evaluation on the westbrook/English_Accent_DataSet shows substantial improvements: Whisper Accent-medium.en achieves 13.4% word error rate (WER), representing a 4.1% absolute improvement over the baseline Whisper-medium.en's 17.5% WER. The model also demonstrates 95.7% accuracy in accent classification across 20+ supported accents including American, British, Indian, Spanish, German, French, and various Eastern European variants.
This research matters because mainstream ASR systems often underperform on non-standard accents, creating accessibility barriers. By open-sourcing the full training setup and checkpoints, Mavleo96 enables both practical applications and further research into accent-adaptive speech recognition without requiring massive retraining of foundation models.
- Achieves 13.4% WER with Whisper Accent-medium.en, a 4.1% absolute improvement over baseline Whisper-medium.en's 17.5%
- Trains only <10% of parameters via AdaLN conditioning while freezing encoder/decoder, preserving original generalization
- Supports 20+ English accents with 95.7% classification accuracy and provides full open-source training pipeline
Why It Matters
Makes speech recognition more accessible globally by significantly improving accuracy for non-standard English accents without full model retraining.