SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding
Frozen HuBERT and Whisper models cut WER by 10% at ultra-low bitrates...
A team led by Mingyu Zhao from CUHK and Shenzhen International Graduate School has unveiled SPG-Codec, a neural speech codec that pushes ultra-low-bitrate speech coding to new frontiers. The paper, accepted at ICME 2026, systematically explores how frozen semantic priors—specifically HuBERT (acoustic-rich) and Whisper (high-level linguistic)—can salvage intelligibility when bitrates drop to 1.5 kbps. At these extreme compression levels, conventional codecs suffer from 'semantic loss' rather than acoustic distortion, making speech unintelligible. SPG-Codec mitigates this by injecting pre-trained representations from these models into the coding pipeline, achieving a relative 10% reduction in Word Error Rate (WER) at 1.5 kbps compared to baselines.
The study uncovers a critical 'Semantic Retirement' phenomenon: the benefits of semantic priors peak below 6 kbps and diminish rapidly beyond that threshold, revealing a practical capacity boundary for such integration. Additionally, the authors identify a clear trade-off between prior types: HuBERT better preserves prosodic and timbral details, while Whisper excels in noisy environments, reducing phonetic hallucination rates by 26% and narrowing the generalization gap for unseen speakers. To operationalize these findings, SPG-Codec employs a bitrate-aware regulation strategy that dynamically adjusts prior strength, optimizing the balance between semantic consistency and perceptual naturalness. Experimental results demonstrate competitive intelligibility and noise robustness, offering a principled pathway for next-generation generative speech codecs in bandwidth-constrained applications like satellite communications or hearing aids.
- SPG-Codec uses frozen HuBERT and Whisper priors to achieve 10% relative WER reduction at 1.5 kbps
- Discovers 'Semantic Retirement' where semantic priors lose effectiveness beyond 6 kbps bitrate
- Whisper reduces phonetic hallucinations by 26% in noisy environments vs HuBERT's prosody preservation
Why It Matters
Ultra-low-bitrate speech codecs enable reliable voice communication in bandwidth-scarce scenarios like satellite links or remote IoT devices.