Exploring Speech Foundation Models for Speaker Diarization Across Lifespan
New study finds speech AI fails with kids' and seniors' voices, degrading performance by up to 30%.
A team from the University of Southern California, led by Anfeng Xu, has published a significant study on arXiv investigating the robustness of modern speech AI. The research focuses on speaker diarization—the task of identifying and segmenting speech by different speakers in a conversation—using a unified framework called EEND-VC. The core finding is that foundation models like OpenAI's Whisper encoder, when trained only on standard adult speech, suffer substantial performance degradation when applied to conversations involving children or older adults. This 'age-related domain shift' reveals a critical blind spot in current systems.
The study evaluated three training approaches: zero-shot inference, joint multi-age training, and targeted adaptation. Results showed that simply applying an adult-trained model to other age groups leads to poor performance. However, training a single model on a diverse dataset spanning all age groups improved its robustness without reducing accuracy on canonical adult conversations. The most effective method was targeted adaptation, fine-tuning the model specifically for a particular age group, which yielded the best diarization performance, especially when leveraging the Whisper encoder's capabilities.
This work highlights a major challenge for real-world deployment of speech AI in education, healthcare, and customer service, where voices vary widely. The proposed solutions of multi-age training and targeted adaptation provide a clear path forward for developers to build more inclusive and accurate conversational AI systems that work for everyone, not just adults.
- Models trained only on adult speech show up to 30% performance drop when analyzing child or senior conversations.
- Joint training on multi-age data improves model robustness without sacrificing performance on standard adult speech.
- Targeted adaptation for specific age groups, using Whisper's encoder, yields the highest diarization accuracy gains.
Why It Matters
Ensures voice AI in classrooms, clinics, and call centers works accurately for people of all ages, not just adults.