DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment
A 28-author team's model uses 5M self-generated samples to solve a major audio AI problem.
A large research team from academia and industry has published DeSTA2.5-Audio, a new model designed to be a general-purpose Large Audio Language Model (LALM). The core innovation is a training strategy called 'self-generated cross-modal alignment' (DeSTA), which directly tackles a major flaw in previous audio AI models: catastrophic forgetting. When developers add audio capabilities to a large language model (LLM), the model often loses its original language proficiency. DeSTA solves this by having the backbone LLM generate its own textual descriptions and instructions for audio clips, creating a perfectly aligned training dataset that preserves its linguistic knowledge.
The team built a massive, task-agnostic dataset called DeSTA-AQA5M, containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets. This includes speech, environmental sounds, and music. This approach allows DeSTA2.5-Audio to generalize to new audio tasks without needing specific fine-tuning (zero-shot). In comprehensive evaluations, the model achieved state-of-the-art or highly competitive performance across major audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. The research, now published in IEEE Transactions on Audio, Speech and Language Processing, underscores that carefully designed data construction is more critical than model architecture for building robust, general-purpose audio AI.
- Solves 'catastrophic forgetting' in audio AI by having the LLM generate its own training targets (self-generated cross-modal alignment).
- Trained on a massive, diverse dataset (DeSTA-AQA5M) of 5M samples from 7,000 hours of audio across 50 datasets.
- Achieves state-of-the-art results on key benchmarks, enabling zero-shot generalization to new audio tasks without specific tuning.
Why It Matters
Enables AI assistants that can seamlessly understand and reason about real-world sounds without losing their core conversational abilities.