OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement
New neural codec from Chinese researchers beats Mimi at same bitrate, designed for LLM audio generation.
A research team of 13 authors, led by Jingbin Hu, has published a paper on OmniCodec, a novel universal neural audio codec designed specifically for low frame rate operation. The model addresses a key limitation in existing neural codecs, which often focus narrowly on speech and prioritize raw reconstruction fidelity over creating semantically useful representations for generative AI tasks. OmniCodec's core innovation is its hierarchical multi-codebook architecture, which explicitly disentangles semantic information from acoustic details. This is achieved by leveraging the encoder from a pre-trained audio understanding model and employing a self-guidance strategy to improve codebook utilization and overall reconstruction quality.
In practical terms, OmniCodec is engineered to be a foundational component for Large Language Model (LLM)-based audio generation systems. By providing compressed, discrete audio tokens that are both high-fidelity and semantically rich, it enables more efficient and effective audio synthesis, music generation, and sound effect creation within AI models. The researchers benchmarked OmniCodec against the established Mimi codec, demonstrating that it delivers superior performance at identical bitrates. This combination of high compression (low frame rate), universal application across speech, music, and general sounds, and semantically informative output positions OmniCodec as a significant technical advance. The team has committed to open-sourcing the model and code, facilitating its adoption and further development within the AI community.
- Uses a hierarchical multi-codebook design with semantic-acoustic decoupling for cleaner data separation.
- Benchmarked to outperform the Mimi codec at the same bitrate for reconstruction quality.
- Explicitly designed for downstream LLM audio generation tasks, not just compression fidelity.
Why It Matters
Provides a better, universal audio tokenizer for AI generation models, improving efficiency and output quality for synthetic speech, music, and sound.