FMelCodec compresses speech to 250 bps with flow-matching refinement
New codec achieves 640x compression while preserving speaker identity and naturalness.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A team of researchers has introduced FMelCodec, a novel neural speech codec designed for ultra-low-bitrate communication. Operating at just 250 bps for 16 kHz audio and 750 bps for 48 kHz, it achieves a staggering 640x compression ratio while preserving speech naturalness and speaker identity. The codec is built around a three-stage coding-refinement-reconstruction (CRR) framework that tackles the information loss and quantization instability typical at such extreme bit budgets.
The first stage uses a highly aggressive encoder-decoder structure with a single 1024-entry vector quantization (VQ) codebook, coupled with an online clustering strategy to prevent codebook collapse. The second stage applies conditional flow matching (CFM) to refine the degraded mel-spectrogram, using a lightweight velocity-field estimator and a self-consistency training scheme that reduces iterative inference steps. Finally, a HiFi-GAN vocoder reconstructs the waveform from the refined spectrogram. Experiments across multiple datasets and sampling rates show FMelCodec outperforms existing codecs in both objective and subjective evaluations, offering higher reconstruction quality and lower computational overhead. This breakthrough could enable high-fidelity voice communication over severely bandwidth-constrained channels, such as satellite links or IoT networks.
- FMelCodec operates at 250 bps for 16 kHz audio and 750 bps for 48 kHz, achieving 640x compression.
- Uses a three-stage CRR framework: aggressive VQ codec, conditional flow matching refinement, and HiFi-GAN vocoder.
- Outperforms existing ultra-low-bitrate codecs in speech quality and speaker similarity with lower complexity.
Why It Matters
Enables high-quality voice calls over extremely low-bandwidth networks, from satellite to IoT.