Explainable AI in Speaker Recognition -- Making Latent Representations Understandable
New study uncovers hierarchical clusters in neural networks, not just flat ones...
In a new paper on arXiv (2604.23354), researchers Yanze Xu, Wenwu Wang, and Mark D. Plumbley from the University of Surrey tackle a key Explainable AI (XAI) question: how do neural networks organize their internal representations? Focusing on speaker recognition networks, they challenge the prevailing view that these representations form independent, flat clusters. Instead, they applied Single-Linkage Clustering (SLINK) and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to show that representations often form hierarchical clusters—nested groupings that reveal deeper organizational patterns.
To make these hierarchies interpretable, the team designed a new algorithm called Hierarchical Cluster-Class Matching (HCCM), which performs one-to-one matching between predefined semantic classes (like 'male' or 'UK') and the hierarchical clusters produced by SLINK or HDBSCAN. Results showed that some clusters matched individual classes, while others corresponded to conjunctions (e.g., 'male and UK'). They also introduced Liebig's score, a metric to quantify matching performance and diagnose limiting factors. This work provides a new lens for understanding how speaker recognition networks encode attributes like gender and accent, with implications for debugging, fairness, and interpretability in audio AI systems.
- Applied SLINK and HDBSCAN to speaker recognition networks, revealing hierarchical clustering in latent representations
- Introduced HCCM algorithm to map hierarchical clusters to semantic classes like 'male' or 'UK'
- Proposed Liebig's score metric to quantify cluster-class matching performance and identify limiting factors
Why It Matters
Makes speaker recognition AI more interpretable, enabling better debugging and fairness analysis in voice-based systems.