Entropy-Guided GRVQ for Ultra-Low Bitrate Neural Speech Codec
A new neural codec uses entropy-guided grouping to improve speech quality for calls and AI models in bandwidth-limited scenarios.
A research team from institutions including Noboru Harada has published a paper on arXiv detailing a novel neural speech codec designed for ultra-low bitrate scenarios. The system, named Entropy-Guided Group Residual Vector Quantization (EG-GRVQ), tackles the core challenge of maintaining both high-fidelity audio reconstruction and accurate semantic modeling when bandwidth is severely constrained. This is critical for real-world applications like mobile communication in poor signal areas or efficient storage for large speech datasets. The proposed architecture retains a dual-branch approach, with one branch dedicated to capturing linguistic information and the other, the acoustic branch, incorporating the key innovation.
The technical breakthrough lies in the 'entropy-guided grouping' strategy within the acoustic branch. The method assumes encoder channel activations follow Gaussian statistics, allowing the variance of each channel to act as a proxy for its information content. The encoder output is then partitioned so that each group carries an equal share of the total information. This balanced allocation reduces redundancy and improves the efficiency of the vector quantization codebooks. Trained and evaluated on standard datasets LibriTTS and VCTK, the model demonstrates measurable improvements in perceptual quality and speech intelligibility metrics at ultra-low bitrates. The work explicitly focuses on codec-level fidelity for communication, paving the way for more robust voice calls and more efficient discrete token representations for downstream speech language models.
- Uses a novel 'entropy-guided grouping' strategy based on channel variance to balance information in vector quantization.
- Demonstrates improved perceptual quality and intelligibility on LibriTTS and VCTK datasets under ultra-low bitrate constraints.
- Maintains a dual-branch architecture separating linguistic semantics from acoustic details for downstream AI model compatibility.
Why It Matters
Enables clearer voice communication in low-bandwidth environments and creates more efficient data representations for training speech AI models.