Multimodal speech models show bias from seeing faces, study finds
Adding a face to audio degrades accuracy by up to 4 WER points depending on demographics.
A new study from researchers Maya K. Nachesa, Vlad Niculae, and Vagrant Gautam—published on arXiv—presents the first systematic bias evaluation of multimodal speech recognition. The team tested how adding visual facial cues to audio affects transcription accuracy in two leading multimodal models: mWhisper-Flamingo and Gemini. By creating videos that pair the same audio track with different faces (varying self-declared gender and ethnicity), they measured changes in word error rate (WER). Results show a significant quality-of-service disparity: WER increased by up to 4.05 points depending on the face shown, with intersections of gender and ethnicity producing the largest drops. The findings challenge the assumption that more modalities always improve performance. Instead, they highlight that multimodal models can inherit and amplify biases present in visual data, potentially leading to unfair outcomes in real-world applications like automated captioning, video conferencing, or voice assistants. The authors urge developers to evaluate, fix, and communicate such limitations before deploying these systems, noting that 'providing more signals through additional modalities is not necessarily better.'
- First bias evaluation of multimodal speech recognition; tested mWhisper-Flamingo and Gemini models.
- WER degraded by up to 4.05 points based on the face shown (gender, ethnicity, intersection).
- Adding visual modality can introduce bias, contradicting the assumption that more data always helps.
Why It Matters
Multimodal AI systems risk embedding visual biases into speech tasks, affecting fairness in captioning, assistants, and accessibility tools.