Inter-Speaker Relative Cues for Two-Stage Text-Guided Target Speech Extraction
A new two-stage AI system uses text descriptions and speaker comparisons to extract target speech with unprecedented accuracy.
Researchers Wang Dai, Archontis Politis, and Tuomas Virtanen from Tampere University have published a paper introducing a novel two-stage framework for Text-guided Target Speech Extraction (TSE). The core innovation is the use of 'inter-speaker relative cues'—descriptive comparisons like "the louder speaker" or "the one who speaks after"—instead of absolute labels. The system first employs a speech separation model to isolate all candidate speaker sources from a mixed audio track. In the second stage, a text-guided classifier analyzes these sources, comparing them based on the provided relative textual description to select and extract the target speaker's voice. This approach is theoretically grounded in principles of human perception, which better preserves fine-grained distinctions often lost in categorical systems.
The experimental results demonstrate significant performance gains. The two-stage framework with relative cues substantially outperforms existing single-stage, text-conditioned extraction methods on both signal-level and perceptual metrics. Notably, certain combinations of relative cues—such as language, gender, loudness, and temporal order—even surpassed the performance of a baseline audio-only TSE system that doesn't use text guidance at all. This breakthrough suggests that natural language, when used to describe relationships between speakers, provides a powerful and intuitive control mechanism. The work provides clear insights into which cue types are most discriminative, paving the way for more robust and user-friendly audio editing and meeting transcription tools where users can simply describe who they want to hear.
- Uses a novel two-stage process: separate all speakers first, then classify target with text.
- Employs 'relative cues' (e.g., 'louder than', 'spoke after') which outperform absolute labels for accuracy.
- The framework beats single-stage methods and can rival audio-only systems with the right cue combination.
Why It Matters
Enables precise audio editing and meeting transcription using intuitive natural language commands instead of complex software.