TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation
New AI model reduces parameters by 94.3% and computational cost by 95.3% while beating SOTA.
A research team from Tsinghua University has introduced TIGER (Time-frequency Interleaved Gain Extraction and Reconstruction), a breakthrough speech separation model designed for extreme efficiency. Accepted at ICLR 2025, TIGER addresses a critical gap in low-latency speech processing by drastically reducing computational demands. The model leverages prior knowledge to divide and compress frequency bands, employing a multi-scale selective attention module and a full-frequency-frame attention module to capture contextual information. Crucially, the team also released EchoSet, a new benchmark dataset featuring realistic acoustic challenges like noise, reverberation, and object occlusions to better evaluate model performance in complex, real-world environments.
TIGER's architectural innovations yield staggering efficiency gains: it reduces the number of parameters by 94.3% and computational costs (measured in Multiply-Accumulate Operations or MACs) by 95.3% compared to previous models. Remarkably, it still surpasses the performance of the previous state-of-the-art model, TF-GridNet, particularly when trained and tested on the new EchoSet data. The introduction of EchoSet itself is a significant contribution, as models trained on it demonstrated superior generalization to physical-world recordings compared to those trained on existing datasets. This combination of a highly efficient model and a more realistic evaluation framework paves the way for deploying advanced speech separation in resource-constrained, real-time applications.
- Achieves 94.3% fewer parameters and 95.3% lower computational cost (MACs) than previous models.
- Outperforms the previous SOTA model TF-GridNet, especially when trained on the new EchoSet dataset.
- Introduces the EchoSet dataset with realistic noise and reverberation for better real-world model evaluation.
Why It Matters
Enables high-quality, real-time speech separation on low-power devices like hearing aids, smart speakers, and conferencing systems.