G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition
New Speech-LLM system from Chinese researchers tackles overlapping speech with global identity consistency.
A research team from China, including lead authors Jing Peng and Shuai Wang, has proposed G-STAR (Global Speaker-Tracking Attributed Recognition), a novel end-to-end AI system designed to transcribe complex, multi-speaker meetings. The core innovation addresses a persistent challenge in audio processing: accurately transcribing long-form conversations where people talk over each other (overlapping speech) while consistently labeling who said what across the entire session. Previous Speech-LLM systems often failed by either focusing too narrowly on local speaker diarization within short chunks or on global labeling, missing precise timing or robust identity linking between segments.
G-STAR's architecture ingeniously combines a dedicated time-aware speaker-tracking module with a powerful Speech-LLM transcription backbone. The tracker provides the LLM with structured, temporally-grounded cues about speaker identity and activity. Conditioned on these cues, the LLM then generates the final attributed transcript. This design supports both component-wise optimization and joint end-to-end training, making it flexible for learning from various types of data supervision and adaptable to different acoustic domains. The paper, submitted to Interspeech 2026, details experiments analyzing cue fusion methods and the trade-offs between local accuracy and long-context understanding.
- Solves overlapping speech in meetings by combining speaker tracking with a Speech-LLM for end-to-end transcription.
- Maintains global speaker identity consistency across long conversations, a weakness in prior chunk-based systems.
- Enables flexible training under heterogeneous supervision, improving robustness to domain shifts in audio data.
Why It Matters
Delivers highly accurate, speaker-labeled meeting minutes automatically, saving hours of manual transcription and review.