A new AI model compresses speech for better understanding and generation
A massive new speech AI model achieves unprecedented efficiency and quality in understanding and generating human speech.
Researchers have developed a new AI model called SiTok that compresses speech into highly efficient digital tokens. Trained on 2 million hours of audio, this 1.6 billion-parameter model outperforms others in understanding, reconstructing, and generating speech. It achieves this at an extremely low data rate of 200 bits-per-second and a token rate of 12.5 Hz, balancing semantic meaning and audio quality better than previous methods.
Why It Matters
This breakthrough could lead to far more efficient and capable voice assistants, translation tools, and audio generation systems.