Audio & Speech

MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

New AI model cleans noisy speech by prioritizing difficult audio tokens, outperforming larger competitors.

Deep Dive

A research team including The Hieu Pham, Tan Dat Nguyen, and others has introduced MAGE (Masked Audio Generative Enhancer), a new AI model designed to tackle the persistent challenge of cleaning up noisy audio recordings. Unlike previous approaches that struggle with the trade-off between processing speed and output quality, MAGE employs an innovative 'coarse-to-fine' masking strategy. Instead of randomly masking parts of the audio signal, it intelligently prioritizes masking common, predictable audio tokens in early processing steps and saves the more complex, rare tokens for later refinement stages. This method, combined with a lightweight 'corrector' module that identifies and re-processes low-confidence predictions, allows the model to be both efficient and highly effective.

The model's architecture is built upon the BigCodec framework and was fine-tuned starting from the Qwen2.5-0.5B language model. Through a process of selective layer retention, the team managed to distill it down to a compact 200 million parameters. In rigorous testing on standard benchmarks like the DNS Challenge and noisy LibriSpeech datasets, MAGE demonstrated superior performance. It achieved state-of-the-art scores in perceptual audio quality and, crucially, led to a significant reduction in word error rates when the cleaned audio was fed into automatic speech recognition (ASR) systems. This means the enhancement directly improves the accuracy of downstream tasks like transcription, a key practical metric. The model's ability to outperform larger, more parameter-heavy baselines highlights the efficiency gains from its targeted masking approach, making it a promising tool for real-world applications in communication and audio processing.

Key Points
  • Uses a scarcity-aware coarse-to-fine masking strategy, prioritizing frequent tokens first and rare tokens later for better efficiency.
  • Compact 200M-parameter model built on BigCodec and fine-tuned from Qwen2.5-0.5B via selective layer retention.
  • Achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream speech recognition on benchmarks.

Why It Matters

Delivers cleaner audio for calls and transcripts with a smaller, more efficient model, improving real-world speech recognition accuracy.