CleanCodec boosts speech tokenization with 17x faster inference
New codec uses just 12.5 tokens per second, outperforming all existing models on efficiency and quality.
Neural audio codecs are essential for compressing speech into discrete tokens used by AI models, but existing codecs often waste tokens on background noise and recording artifacts, degrading quality and efficiency. CleanCodec, developed by Eugene Kwek and colleagues, reframes audio tokenization as a selective information bottleneck problem. The model learns to encode only perceptually important features—such as linguistic content and speaker characteristics—while discarding imperceptible information like noise. This approach allows CleanCodec to achieve remarkable tokenization efficiency at just 12.5 tokens per second, a third of what many current codecs require.
On benchmarks, CleanCodec substantially outperforms leading codecs in both speaker similarity and speech intelligibility, even while using fewer tokens. When tested on downstream tasks like text-to-speech and voice conversion, it not only improves output quality but also speeds up inference by up to 17 times. These results suggest that CleanCodec can significantly reduce computational overhead for speech AI systems without sacrificing fidelity. The paper highlights a new direction for building more efficient and robust audio representations, with potential applications in real-time voice assistants, streaming, and speech generation.
- CleanCodec achieves state-of-the-art tokenization efficiency at just 12.5 tokens per second.
- It outperforms existing codecs on speaker similarity and speech intelligibility benchmarks.
- Downstream TTS and voice conversion tasks see up to 17x faster inference with improved quality.
Why It Matters
Enables faster, more efficient speech AI pipelines while preserving audio quality for voice assistants and generation.