Audio & Speech

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

No more task-specific training? JASTIN beats human correlation on audio evaluation.

Deep Dive

The rapid growth of generative audio models has exposed a critical gap: robust evaluation methods haven't kept pace. Existing objective metrics and general multimodal LLMs struggle with domain generalization and zero-shot capabilities. To solve this, researchers introduce JASTIN, a generalizable, instruction-driven audio evaluation framework. JASTIN bridges a frozen high-performance audio encoder with a fine-tuned LLM backbone using a trainable audio adapter. The system reframes audio evaluation as a self-instructed reasoning task, allowing it to handle diverse evaluation scenarios without task-specific retraining.

To ensure robust zero-shot generalization, the team developed a comprehensive instruction-following data pipeline with four key components: Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data. This approach lets JASTIN adapt to novel audio inputs and evaluation instructions on the fly. Experimental results show JASTIN achieves state-of-the-art Pearson and Spearman correlations with human subjective ratings across speech, sound, music, and out-of-domain tasks. It consistently outperforms general MLLMs without needing task-specific retraining, making it a powerful tool for evaluating the next generation of generative audio models.

Key Points
  • Bridges a frozen high-performance audio encoder with a fine-tuned LLM via a trainable adapter for zero-shot evaluation.
  • Uses a novel data pipeline with Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data for robust generalization.
  • Outperforms general MLLMs across speech, sound, music, and out-of-domain tasks without any task-specific retraining.

Why It Matters

Enables objective, instruction-driven evaluation of generative audio without costly retraining, accelerating model development and quality assurance.