Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features
A new study maps how AI speech models like WavLM encode pitch, gender, and intensity in individual dimensions.
A team of researchers from Stellenbosch University and the University of the Witwatersrand has published a significant paper titled 'Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features' on arXiv. The work addresses a key question in AI speech processing: how do models like WavLM, trained via self-supervised learning (SSL), internally organize information about a speaker's voice? Moving beyond analyzing entire feature vectors, the researchers focused on whether specific characteristics are captured within individual dimensions of these learned representations, using PCA to uncover the underlying structure.
The study applied PCA to utterance-averaged representations from the WavLM model. They discovered that the principal dimension explaining the most variance strongly correlates with pitch and associated characteristics like perceived gender. Other distinct dimensions were found to encode intensity (loudness), background noise levels, the second formant (a key acoustic property for vowel sounds), and higher-frequency spectral details. Crucially, the team demonstrated in synthesis experiments that modifying these specific dimensions allows for targeted control over the corresponding voice characteristic in the output. This provides a novel, interpretable, and potentially simpler method for fine-grained voice manipulation in text-to-speech and voice conversion applications, moving away from black-box approaches.
- PCA analysis of WavLM's features revealed the first principal dimension encodes pitch and gender.
- Individual dimensions were found to control intensity, noise, the second formant, and high-frequency traits.
- Synthesis experiments proved these dimensions can be manipulated for precise voice characteristic control.
Why It Matters
Enables more interpretable and precise control over synthesized voices for TTS, editing, and accessibility tools.