wav2tok 2.0 boosts audio retrieval with pairwise token alignment
This scalable speech tokenizer beats BEST-STD while staying efficient and lightweight.
Audio retrieval systems often struggle with variable-length utterances and speaker variability. The original wav2tok used CTC-based alignment to enforce token consistency but was not scalable due to tightly coupled clustering and alignment training. wav2tok 2.0, accepted at INTERSPEECH 2026, tackles this by decoupling the process into a staged training approach. First, it learns discriminative, speaker-invariant representations via contrastive learning and vector quantization on the BEST-STD backbone. Then it enforces pairwise token consistency using a CTC alignment loss and a dynamic time warping (DTW)-aligned framewise prediction objective with adaptive weighting. This separation allows the model to scale efficiently while maintaining the alignment needed for accurate query-by-example spoken term detection (QbE-STD).
Experiments show wav2tok 2.0 consistently outperforms both the BEST-STD baseline and general-purpose audio tokenizers on QbE-STD benchmarks. The model achieves higher retrieval accuracy while keeping computational costs low, making it practical for real-world applications like voice search, multimedia indexing, and speech analytics. By producing discrete tokens that preserve similarity across utterances of different lengths, wav2tok 2.0 offers a scalable foundation for next-generation audio retrieval systems. The code and pretrained models are expected to be released, enabling further community adoption and research.
- Uses staged training: contrastive learning + VQ for speaker invariance, then CTC + DTW alignment for token consistency.
- Outperforms BEST-STD and general-purpose tokenizers on query-by-example spoken term detection benchmarks.
- Scalable architecture decouples clustering from alignment, enabling efficient training on large audio datasets.
Why It Matters
Enables faster, more accurate audio search for voice assistants, transcription services, and multimedia retrieval systems.