Researchers expose 'WER Trap' flaw in unified speech tokens
Low Word Error Rate hides fatal loss of acoustic detail for generation.
The pursuit of a single discrete token for both speech understanding and generation has led the Speech Language Model (SLM) community to rely heavily on Word Error Rate (WER) as the definitive quality metric. This paper argues that low-WER tokens from Whisper-style tokenizers fail to preserve the fine-grained articulation and micro-dynamics needed for ODE-based acoustic synthesis. While high-frequency tokens succeed due to implicit information leakage, isolating pure semantic information at ultra-low frame rates strips away essential acoustic detail.
To empirically validate this, the team developed a dynamic compression tokenizer that aligns representations with semantic boundaries, achieving ultra-low frame rates with exceptionally low WER. When these isolated 'pure' semantic tokens were used to condition generative models (even with oracle duration alignments), the reconstructed speech suffered severe articulation blur and became acoustically unintelligible. The results demonstrate that semantic categorization rewarded by low WER is orthogonal to the continuous phonetic trajectories required for synthesis, shattering the illusion of a unified token and advocating for explicitly decoupled speech representations.
- WER metric in Whisper-style tokenizers measures understanding capability, not generation quality.
- Dynamic compression tokenizer achieved ultra-low frame rates while maintaining low WER.
- Pure semantic tokens produced unintelligible speech, proving understanding and generation need separate representations.
Why It Matters
This challenges the industry trend of unified speech tokens, forcing a rethink of SLM architectures for generation tasks.