Audio & Speech

CFMDCTCodec delivers high-quality speech at 0.65 kbps

This neural codec uses conditional flow matching to restore fine spectral details...

Deep Dive

CFMDCTCodec tackles the challenge of high-quality speech coding at extremely low bitrates, a critical need for bandwidth-constrained applications. The system operates entirely in the modified discrete cosine transform (MDCT) domain, using a lightweight encoder-quantizer-decoder architecture to produce a coarse spectral reconstruction. To restore fine-grained details lost during compression, it introduces a noise-prior-aware conditional flow matching (CFM) enhancer that integrates a conditional MDCT velocity-field filter with an ordinary differential equation (ODE) solver. This enhancer is guided by an MDCT-derived magnitude-adaptive noise prior that emphasizes perceptually important high-energy regions while stabilizing low-energy and silent areas.

Training is performed with a unified non-adversarial strategy that jointly optimizes reconstruction, quantization, and CFM objectives. Evaluations show CFMDCTCodec outperforms competitive baselines at 0.65 kbps, achieving perceptual quality close to much larger codecs with a fraction of the parameters and computational cost. The paper has been accepted by IEEE Transactions on Audio, Speech and Language Processing, signaling strong peer validation.

Key Points
  • Operates entirely in the MDCT domain for efficient speech compression down to 0.65 kbps
  • Uses a noise-prior-aware conditional flow matching enhancer to restore fine spectral details
  • Outperforms baselines with significantly fewer parameters while approaching large-scale codec quality

Why It Matters

Enables near-transparent speech transmission over ultra-low-bandwidth networks like satellite or IoT links.