Audio & Speech

AnyAudio-Judge: Dynamic rubric benchmark evaluates audio instruction following with 7,920 samples

New benchmark decomposes complex audio captions into verifiable binary checks for precise evaluation.

Deep Dive

Current automated evaluation of instruction-guided audio generation relies on holistic scoring from large language models, which often fails to decouple complex instructions and misses fine-grained attribute mismatches. To address this, a team of researchers (Haitao Li, Tian Tan, Yuguang Yang, Shan Yang, Xie Chen) propose AnyAudio-Judge — a dynamic rubric-based evaluation paradigm that adaptively decomposes complex audio captions into a variable number of independent, verifiable binary rubric items. They also introduce the AnyAudio-Judge Bench, a comprehensive bilingual benchmark of 7,920 meticulously curated samples spanning four audio domains: speech, sound, music, and mixed, all featuring deliberately constructed hard negatives.

The researchers constructed a large-scale corpus of 105K samples with explicit Chain-of-Thought (CoT) rationales to train their dedicated evaluator. Using a training pipeline that combines Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), the model aligns its reasoning paths with the rubric-based scoring mechanism. Extensive experiments show that AnyAudio-Judge significantly enhances zero-shot alignment detection compared to state-of-the-art baselines. Moreover, it provides precise and interpretable reward signals that substantially improve instruction alignment in downstream reinforcement learning for audio generation. The paper is available on arXiv (2606.03116).

Key Points
  • Dynamic rubric approach decomposes complex audio captions into independent binary verification items for fine-grained evaluation.
  • Benchmark includes 7,920 bilingual samples across four audio domains (speech, sound, music, mixed) with hard negatives.
  • 105K-sample training corpus uses Chain-of-Thought rationales, trained via SFT and GRPO to align reasoning with rubric scoring.

Why It Matters

Enables reliable, interpretable evaluation of AI audio generation, improving alignment with complex user instructions for production systems.