TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition
New AI framework leverages Mandarin subtitles to transcribe Taiwanese Hokkien with 14.77% fewer errors.
A research team led by Cheng-Yeh Yang and Hung-Shin Lee has introduced TG-ASR, a novel framework designed to tackle the persistent challenge of automatic speech recognition (ASR) for low-resource languages. The work, accepted at LREC 2026, specifically targets Taiwanese Hokkien, a language where transcribed audio data is scarce but a wealth of spoken content exists in dramas with Mandarin subtitles. The core innovation is a translation-guided learning approach that allows the ASR system to leverage these readily available Mandarin translations as a supervisory signal, effectively using a high-resource language to bootstrap understanding of a low-resource one. To support this research, the team also released YT-THDC, a new 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and verified transcriptions.
The technical breakthrough of TG-ASR is its Parallel Gated Cross-Attention (PGCA) mechanism. This component sits within the ASR decoder and adaptively integrates semantic embeddings from auxiliary languages (like Mandarin) into the process of predicting characters in the target language (Hokkien). The "gated" design is crucial—it controls the flow of cross-linguistic information, ensuring the model receives robust semantic guidance from translations while minimizing interference and maintaining stable optimization. Comprehensive experiments identified which auxiliary languages most effectively enhance performance, with the final system demonstrating a significant 14.77% relative reduction in Character Error Rate (CER). This approach provides a practical blueprint for applying AI to hundreds of other underrepresented languages that suffer from similar data scarcity but may have paired translation content.
- Uses a novel Parallel Gated Cross-Attention (PGCA) mechanism to integrate multilingual translation data into the ASR decoder.
- Achieves a 14.77% relative reduction in Character Error Rate for Taiwanese Hokkien by learning from Mandarin subtitles.
- Introduces and releases the YT-THDC, a new 30-hour open corpus of aligned Hokkien speech and Mandarin subtitles for research.
Why It Matters
Provides a scalable blueprint for building accurate speech AI for hundreds of low-resource, underrepresented global languages using existing translation data.