Audio & Speech

New MUSA benchmark reveals LALMs struggle with multilingual distractions

Even top audio AI models fail when eavesdropped by Spanish, Korean, or Chinese chatter

Deep Dive

A new study on arXiv by Heejoon Koo tackles a critical blind spot in Large Audio Language Models (LALMs): their ability to ignore multilingual distractors. The paper introduces MUSA (Multilingual Source-grounded Understanding and reasoning for Audio), a benchmark inspired by the cocktail party effect where listeners must focus on one speaker amid competing voices. Each test pairs an English target dialogue with a semantically plausible distractor in English, Spanish, Korean, or Chinese, evaluating models under three conditions: single speaker, two-stage source separation + processing, and end-to-end cocktail party settings.

Results from six LALMs (two closed-source, four open-weight) reveal a troubling gap. While models perform well on single-speaker tasks, their cocktail party accuracy degrades sharply under severe signal-to-noise ratios. Errors are dominated by distractor-grounded source confusion—models latch onto the wrong speaker. Even when source separation reduces acoustic overlap, source attribution remains unresolved, leading to confidently incorrect answers. The study underscores that strong baseline performance is no guarantee of robust real-world selectivity, especially when multiple languages are involved. Data and code will be released upon publication.

Key Points
  • MUSA benchmark tests LALMs on English target dialogues with distractors in English, Spanish, Korean, or Chinese across three listening conditions
  • Six models evaluated: two closed-source and four open-weight LALMs, none maintained robust attention under high-noise multilingual interference
  • Source separation reduced acoustic overlap but failed to fix source attribution, often causing confident wrong-stream answers

Why It Matters

Real-world LALM deployment in noisy, multilingual environments (e.g., voice assistants, transcription) remains unreliable without better selective attention.