Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition
New training technique reduces average word error rate from 10.56% to 7.66% across 10 languages.
A research team led by Lukuang Dong and Ziwei Li has published a paper titled 'Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition' that significantly improves automatic speech recognition (ASR) across multiple languages. Their work tackles the phoneme-to-grapheme (P2G) conversion problem—turning sound units (phonemes) into written text (graphemes)—using large language models (LLMs). The researchers tested their methods on the CV-Lang10 benchmark covering ten languages, addressing two major challenges: language-aware generation and severe cross-language data imbalance where some languages have far less training data than others.
Their key innovation is Simplified SKM (S-SKM), a Monte Carlo approximation technique that avoids the need for CTC-based speech-to-phoneme probability weighting during P2G training. This makes the system more robust to uncertainties in the initial speech-to-phoneme conversion. When combined with robust training techniques and oversampling of low-resource languages, their approach achieved a substantial reduction in word error rate (WER), dropping from 10.56% to 7.66% on average across all ten languages. This represents a 27% relative improvement in accuracy.
The paper, submitted to INTERSPEECH2026 and available on arXiv, demonstrates how separating ASR into speech-to-phoneme and phoneme-to-grapheme components allows for better cross-lingual acoustic sharing while maintaining language-specific orthography rules. This modular approach means acoustic models can be shared across languages while the P2G component handles the unique spelling and writing conventions of each language. The work represents an important step toward more efficient and accurate multilingual speech recognition systems that don't require completely separate models for each language.
- Simplified SKM (S-SKM) method reduces average WER from 10.56% to 7.66% across 10 languages
- Addresses data imbalance through low-resource oversampling and robust training techniques
- Enables shared acoustic models across languages while maintaining language-specific text generation
Why It Matters
Enables more accurate speech-to-text for global applications and low-resource languages with limited training data.