Audio & Speech

BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

New speech tokenizer uses noise-augmented training and optimal transport to improve voice search in noisy environments.

Deep Dive

Researchers Anup Singh, Vipul Arora, and Kris Demuynck developed BEST-STD2.0, a speech tokenizer for Spoken Term Detection (STD). It introduces noise/reverberation-augmented training for robustness and optimal transport-based regularization for balanced token usage. The system also adopts a TF-IDF search mechanism. Evaluations show it outperforms STD baselines across distortion levels while maintaining high search efficiency, enabling faster and more accurate retrieval of spoken content from audio databases.

Why It Matters

Enables more reliable voice search and audio content retrieval in real-world, noisy environments like smart speakers and call centers.