Audio & Speech

Bias and Fairness in Self-Supervised Acoustic Representations for Cognitive Impairment Detection

AI models for detecting cognitive impairment from speech show significant performance gaps across gender and age groups.

Deep Dive

A research team from Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) has published a systematic analysis revealing significant fairness issues in AI models designed to detect cognitive impairment (CI) and depression from speech patterns. The study, titled 'Bias and Fairness in Self-Supervised Acoustic Representations for Cognitive Impairment Detection,' compared traditional acoustic features like MFCCs and eGeMAPS with contextualized embeddings from the self-supervised model Wav2Vec 2.0 (W2V2). Using the clinical DementiaBank Pitt Corpus, the researchers found that while higher-layer W2V2 embeddings outperformed baselines for CI detection, they exhibited troubling performance disparities across demographic subgroups, raising critical questions about deploying such tools in real-world clinical settings.

The technical analysis showed that for CI detection, W2V2 achieved an Unweighted Average Recall (UAR) of up to 80.6%. However, the Area Under the Curve (AUC) was significantly lower for female participants (0.769) and younger participants (0.746) compared to their counterparts. The specificity gap (Δ_spec) reached up to 18% for females and 15% for younger individuals, meaning the model was more likely to incorrectly label healthy people in these groups as impaired. Furthermore, the study found limited cross-task generalization between CI and depression classification, suggesting each condition relies on distinct acoustic markers. These findings underscore that raw performance metrics are insufficient and mandate rigorous, fairness-aware evaluation with subgroup-specific analysis before clinical adoption, especially given the demographic heterogeneity of patient populations.

Key Points
  • Wav2Vec 2.0 embeddings for dementia detection showed an 18% specificity gap against female participants, leading to higher false positive rates.
  • The AI model achieved up to 80.6% Unweighted Average Recall but performed worse for younger subjects, with a 15% specificity disparity.
  • The research highlights that high overall accuracy masks critical biases, necessitating subgroup analysis before deploying AI diagnostic tools in clinics.

Why It Matters

Highlights that AI diagnostic tools with good average performance can still be dangerously unfair, risking misdiagnosis for vulnerable patient groups.