Scaling Speech Tokenizers with Diffusion Autoencoders
A massive new speech AI model achieves unprecedented efficiency and quality in understanding and generating human speech.
Deep Dive
Researchers have developed a new AI model called SiTok that compresses speech into highly efficient digital tokens. Trained on 2 million hours of audio, this 1.6 billion-parameter model outperforms others in understanding, reconstructing, and generating speech. It achieves this at an extremely low data rate of 200 bits-per-second and a token rate of 12.5 Hz, balancing semantic meaning and audio quality better than previous methods.
Why It Matters
This breakthrough could lead to far more efficient and capable voice assistants, translation tools, and audio generation systems.