Open Source

Cohere Transcribe fine-tuned for speaker diarization and timestamps

Open-source STT now identifies speakers with 0.097s timestamp accuracy.

Deep Dive

Until now, Cohere Transcribe—widely regarded as the best open-source speech-to-text model, rivaling proprietary solutions—lacked support for speaker diarization (identifying who spoke when) and word-level timestamps, even though the necessary tokens existed in its tokenizer. A Reddit user, iamMess, solved this by fine-tuning the model specifically to output both features. The resulting output follows a standard timestamp format, e.g., <|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|>, which is easy to parse programmatically.

Performance metrics are impressive: timestamps are accurate within 0.097 seconds on average, with 90% within 0.006 seconds—nearly real-time precision. The model can handle up to 4 speakers per 30-second segment, and by using the accompanying diarize_long.py script, it can accurately identify up to 32 speakers across longer recordings. The fine-tuned model and scripts are available for free on HuggingFace, making professional-grade speech transcription with speaker segmentation accessible to anyone.

Key Points
  • Timestamp accuracy averages 0.097 seconds; 90% within 0.006 seconds.
  • Supports up to 4 speakers per 30 seconds, extendable to 32 with long script.
  • Available free on HuggingFace with easily parsable format.

Why It Matters

Brings professional speaker identification and precise timestamps to top open-source STT, enabling accurate meeting transcription and podcast analysis.