BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection
New speech tokenizer uses noise-augmented training and optimal transport to improve voice search in noisy environments.
Researchers Anup Singh, Vipul Arora, and Kris Demuynck developed BEST-STD2.0, a speech tokenizer for Spoken Term Detection (STD). It introduces noise/reverberation-augmented training for robustness and optimal transport-based regularization for balanced token usage. The system also adopts a TF-IDF search mechanism. Evaluations show it outperforms STD baselines across distortion levels while maintaining high search efficiency, enabling faster and more accurate retrieval of spoken content from audio databases.
Why It Matters
Enables more reliable voice search and audio content retrieval in real-world, noisy environments like smart speakers and call centers.