JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
No more task-specific training? JASTIN beats human correlation on audio evaluation.
The rapid growth of generative audio models has exposed a critical gap: robust evaluation methods haven't kept pace. Existing objective metrics and general multimodal LLMs struggle with domain generalization and zero-shot capabilities. To solve this, researchers introduce JASTIN, a generalizable, instruction-driven audio evaluation framework. JASTIN bridges a frozen high-performance audio encoder with a fine-tuned LLM backbone using a trainable audio adapter. The system reframes audio evaluation as a self-instructed reasoning task, allowing it to handle diverse evaluation scenarios without task-specific retraining.
To ensure robust zero-shot generalization, the team developed a comprehensive instruction-following data pipeline with four key components: Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data. This approach lets JASTIN adapt to novel audio inputs and evaluation instructions on the fly. Experimental results show JASTIN achieves state-of-the-art Pearson and Spearman correlations with human subjective ratings across speech, sound, music, and out-of-domain tasks. It consistently outperforms general MLLMs without needing task-specific retraining, making it a powerful tool for evaluating the next generation of generative audio models.
- Bridges a frozen high-performance audio encoder with a fine-tuned LLM via a trainable adapter for zero-shot evaluation.
- Uses a novel data pipeline with Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data for robust generalization.
- Outperforms general MLLMs across speech, sound, music, and out-of-domain tasks without any task-specific retraining.
Why It Matters
Enables objective, instruction-driven evaluation of generative audio without costly retraining, accelerating model development and quality assurance.