DashengTokenizer: One layer is enough for unified audio understanding and generation
A new tokenizer flips the script by injecting acoustic data into frozen semantic features, outperforming VAE-based methods.
A research team led by Heinrich Dinkel has introduced DashengTokenizer, a novel continuous audio tokenizer engineered for joint use in both understanding and generation tasks. This approach fundamentally inverts the conventional paradigm in audio AI: instead of training acoustic tokenizers and then integrating frozen semantic knowledge, DashengTokenizer starts with frozen semantic features and injects acoustic information. The paper demonstrates that this 'acoustic injection' method delivers superior performance across a broad spectrum of audio tasks, from speech emotion recognition to music understanding and acoustic scene classification, while maintaining strong audio reconstruction fidelity.
The technical breakthrough lies in the model's simplicity and effectiveness. In linear evaluations across 22 diverse audio tasks, DashengTokenizer outperformed previous audio codec and encoder baselines by a significant margin. For generative tasks like text-to-audio (TTA) and text-to-music (TTM), it surpassed standard variational autoencoder (VAE)-based methods, and its effectiveness on speech enhancement (SE) underscores its capabilities as a general-purpose audio encoder. Crucially, these results challenge the prevailing assumption that VAE-based architectures are a prerequisite for high-quality audio synthesis, potentially opening the door to more efficient and unified audio AI models. Checkpoints are already available for the community to build upon.
- Inverts standard AI audio paradigm by injecting acoustics into frozen semantic features, not vice-versa.
- Outperforms previous baselines on 22 diverse understanding tasks including emotion recognition and scene classification.
- Surpasses VAE-based methods on generative tasks like text-to-audio, challenging core architectural assumptions.
Why It Matters
Unifies audio understanding and generation in one efficient model, potentially simplifying and improving multimodal AI pipelines.