Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection
A new training-free method uses multi-resolution spoof detectors to prune unnatural AI-generated speech tokens.
A team of researchers has introduced MSpoof-TTS, a novel training-free framework designed to significantly improve the quality and robustness of AI-generated speech. The work addresses a key weakness in current neural codec language models—the vulnerability to token-level artifacts and distributional drift during inference, which can make synthetic speech sound unnatural. Rather than the computationally expensive process of retraining models or using preference optimization, the researchers propose a clever inference-time solution that acts as a quality filter, guiding the generation process toward more perceptually realistic outputs.
The technical core of MSpoof-TTS is a Multi-Resolution Token-based Spoof Detection framework. This system analyzes the discrete codec sequences (the compressed representation of speech) at multiple temporal resolutions to identify locally inconsistent or unnatural patterns that signal poor quality. This detection mechanism is then integrated into a hierarchical decoding strategy. During generation, the framework progressively prunes low-quality candidate tokens and re-ranks hypotheses based on the spoof detector's feedback. This discriminator-guided approach enhances the robustness and fidelity of the final audio output without modifying a single parameter of the base speech synthesis model, offering a plug-and-play upgrade for existing systems.
- Proposes MSpoof-TTS, a training-free inference framework to fix artifacts in neural codec speech models.
- Uses a Multi-Resolution Spoof Detector to evaluate codec token sequences for unnatural patterns.
- Implements hierarchical decoding to prune bad candidates, boosting zero-shot synthesis quality without retraining.
Why It Matters
Enables higher-quality, more realistic AI voice generation for assistants, audiobooks, and media, using existing models.