Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR
A new study shows converting speech to phonemes before an LLM can outperform standard methods, with major gains for languages like Tatar.
A research team including Ziwei Li, Zhijian Ou, and others has published a pivotal study comparing methods for connecting speech encoders to Large Language Models (LLMs) for Automatic Speech Recognition (ASR). The core challenge is the 'speech-language interface'—how to translate raw audio features into a format the LLM can understand. The standard approach uses a learned 'projector' network to map encoder outputs directly into the LLM's embedding space. This paper rigorously tests an alternative: first converting speech into a sequence of discrete phonemes (the basic sound units of language), then feeding that text-like sequence to the LLM.
Using identical speech encoder and LLM backbones for a fair comparison, the team evaluated these interfaces on high-resource English (LibriSpeech) and low-resource Tatar. The results challenge conventional wisdom. On LibriSpeech, the phoneme-based interface was competitive with the vanilla projector, and a novel 'BPE-phoneme' interface—which uses Byte-Pair Encoding to group frequent local phoneme patterns while preserving word boundaries—achieved even better performance. The breakthrough came in low-resource settings: on Tatar, the phoneme-based interface substantially outperformed the projector-based method. The research also yielded a 'phoneme-informed hybrid' interface that leverages phoneme supervision to create a stronger model than the standard projector, proving the value of linguistic structure in the interface.
This work has significant implications for building efficient and capable multilingual ASR systems. By using phonemes as an intermediate, interpretable representation, the method reduces the burden on the LLM to learn acoustic-to-linguistic mapping from scratch, leading to better data efficiency. This is particularly crucial for expanding high-quality speech recognition to the thousands of languages that lack massive annotated datasets, moving beyond a one-size-fits-all projector approach.
- Phoneme-based interfaces matched or beat standard 'projector' methods on English LibriSpeech benchmarks.
- The new 'BPE-phoneme' interface, grouping frequent sound patterns, delivered the best results on English.
- In low-resource Tatar, phoneme-based methods substantially outperformed projectors, highlighting major gains for underserved languages.
Why It Matters
This research provides a more efficient, effective blueprint for building LLM-powered speech recognition that works well for many languages, not just English.