Audio & Speech

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

First comprehensive survey to benchmark large audio-language models systematically

Deep Dive

Researchers Chih-Kai Yang, Neo S. Ho, and Hung-yi Lee from National Taiwan University have published a comprehensive survey titled "Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey," accepted at EMNLP 2025 (Main Conference). The paper addresses the growing fragmentation in evaluating LALMs—models that combine large language models with auditory capabilities—by proposing a structured taxonomy. This taxonomy organizes LALM evaluations into four key dimensions: (1) General Auditory Awareness and Processing, covering tasks like speech recognition and sound event detection; (2) Knowledge and Reasoning, assessing comprehension and inference from audio; (3) Dialogue-oriented Ability, focusing on conversational and interactive performance; and (4) Fairness, Safety, and Trustworthiness, examining bias, robustness, and ethical concerns.

The survey provides detailed overviews of existing benchmarks within each category, identifies current challenges such as lack of standardization and limited coverage of real-world scenarios, and highlights promising future directions. As the first survey specifically focused on LALM evaluations, it offers clear guidelines for the research community. The authors will release and actively maintain a collection of surveyed papers to support ongoing advancements, making this a valuable resource for standardizing evaluation practices and accelerating progress in audio-language AI.

Key Points
  • First comprehensive survey specifically focused on evaluating large audio-language models (LALMs)
  • Proposes a four-dimension taxonomy: auditory awareness, knowledge/reasoning, dialogue ability, and fairness/safety
  • Accepted at EMNLP 2025 (Main Conference) with a maintained paper collection for community use

Why It Matters

Standardizing LALM evaluation frameworks is crucial for advancing reliable and safe audio AI systems.