MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models
This new audio tokenizer could unlock the next generation of AI audio models...
Researchers have introduced MOSS-Audio-Tokenizer, a massive 1.6 billion parameter model trained on 3 million hours of diverse audio. Built on a novel, fully end-to-end Transformer architecture called CAT, it outperforms prior audio codecs across speech, sound, and music at various bitrates. The model enables a purely autoregressive text-to-speech system that beats prior non-autoregressive models and achieves competitive automatic speech recognition without extra encoders, positioning it as a foundational audio interface.
Why It Matters
It provides a unified, scalable backbone for future audio foundation models, potentially revolutionizing AI audio generation and understanding.