Audio & Speech

New neural audio codec adjusts token resolution without retraining

One codec handles multiple temporal resolutions for better audio generation.

Deep Dive

Neural audio codecs (NACs) convert audio into discrete tokens for generation and understanding models. A key parameter is token temporal resolution (TTR)—the time between frames—which trades off detail capture against sequence length. Until now, most NACs are trained at a single TTR, requiring separate model training for each resolution. This limits flexibility and increases computational cost.

Nakamura et al. introduce a mechanism that allows a single NAC to operate at multiple TTRs by treating TTR as the sampling period of the token sequence. They use sampling-frequency-independent convolutional layers that generate TTR-specific kernels from a shared parameter set, adjusting kernel size and stride accordingly. The approach is integrated into Descript Audio Codec without modifying the quantizer. Experiments on environmental sound reconstruction show the proposed model outperforms a baseline that switches between TTR-specific layers, achieving better quality across resolutions. This work, accepted for IWAENC 2026, could enable more efficient, adaptive audio codecs for real-time applications.

Key Points
  • Single codec model handles multiple token temporal resolutions (TTR) without retraining.
  • Uses sampling-frequency-independent conv layers that generate TTR-specific kernels from shared parameters.
  • Outperforms baseline on environmental sound reconstruction, demonstrating better quality across resolutions.

Why It Matters

Flexible audio codecs reduce training overhead and improve token-based audio generation and understanding.

📬 Get the top 10 AI stories daily