The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation
Researchers find emotion2vec rewards mimicry, not genuine emotional expression...
A team of researchers from National Taiwan University (NTU) and the University of Southern California (USC) has published a critical study titled "The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation," submitted to Interspeech 2026. The paper challenges the widespread use of emotion embedding similarity metrics, such as those derived from emotion2vec, to evaluate emotional expressiveness in speech generation models. These metrics compute cosine similarity between embeddings of reference and generated samples, assuming they capture affective cues despite variations in language and speaker identity. Through controlled adversarial tasks and human alignment tests, the authors demonstrate that these latent spaces are unsuitable for zero-shot similarity evaluation, as linguistic and speaker interference overshadow emotional features, degrading discriminative ability.
The study reveals a critical misalignment between these metrics and human perception, showing that the metrics reward acoustic mimicry rather than genuine emotional synthesis. This "acoustic vulnerability" means models can achieve high scores by simply copying acoustic features without producing authentic emotional expression. The findings have significant implications for expressive speech synthesis and voice conversion, where accurate emotional prosody transfer is essential. The researchers argue that the field needs new objective metrics that better align with human judgment and capture true emotional expressiveness, rather than relying on flawed embedding-based similarity measures.
- Emotion2vec cosine similarity metrics are vulnerable to linguistic and speaker interference, overshadowing emotional features.
- Controlled adversarial tasks and human alignment tests show these metrics misalign with human perception of emotion.
- The metrics reward acoustic mimicry over genuine emotional synthesis, degrading evaluation of speech generation models.
Why It Matters
Flawed metrics could mislead progress in expressive speech AI, requiring new evaluation methods for authentic emotion.