MOSS-Audio-Tokenizer: A 1.6B Parameter Model Trained on 3M Hours of Audio
This new audio tokenizer could unlock the next generation of AI audio models...
Researchers have introduced MOSS-Audio-Tokenizer, a massive 1.6 billion parameter model trained on 3 million hours of diverse audio. Built on a novel, fully end-to-end Transformer architecture called CAT, it outperforms prior audio codecs across speech, sound, and music at various bitrates. The model enables a purely autoregressive text-to-speech system that beats prior non-autoregressive models and achieves competitive automatic speech recognition without extra encoders, positioning it as a foundational audio interface.
Why It Matters
It provides a unified, scalable backbone for future audio foundation models, potentially revolutionizing AI audio generation and understanding.