Timestamp accuracy averages 0.097 seconds; 90% within 0.006 seconds?

Timestamp accuracy averages 0.097 seconds; 90% within 0.006 seconds.

Supports up to 4 speakers per 30 seconds, extendable to 32 with long script?

Supports up to 4 speakers per 30 seconds, extendable to 32 with long script.

Available free on HuggingFace with easily parsable format?

Available free on HuggingFace with easily parsable format.

Open Source

Cohere Transcribe fine-tuned for speaker diarization and timestamps

r/LocalLLaMA May 22, 2026

⚡Open-source STT now identifies speakers with 0.097s timestamp accuracy.

Deep Dive

Until now, Cohere Transcribe—widely regarded as the best open-source speech-to-text model, rivaling proprietary solutions—lacked support for speaker diarization (identifying who spoke when) and word-level timestamps, even though the necessary tokens existed in its tokenizer. A Reddit user, iamMess, solved this by fine-tuning the model specifically to output both features. The resulting output follows a standard timestamp format, e.g., <|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|>, which is easy to parse programmatically.

Performance metrics are impressive: timestamps are accurate within 0.097 seconds on average, with 90% within 0.006 seconds—nearly real-time precision. The model can handle up to 4 speakers per 30-second segment, and by using the accompanying diarize_long.py script, it can accurately identify up to 32 speakers across longer recordings. The fine-tuned model and scripts are available for free on HuggingFace, making professional-grade speech transcription with speaker segmentation accessible to anyone.

Key Points

Timestamp accuracy averages 0.097 seconds; 90% within 0.006 seconds.
Supports up to 4 speakers per 30 seconds, extendable to 32 with long script.
Available free on HuggingFace with easily parsable format.

Why It Matters

Brings professional speaker identification and precise timestamps to top open-source STT, enabling accurate meeting transcription and podcast analysis.

Read Original Article

Cohere Transcribe fine-tuned for speaker diarization and timestamps

Why It Matters

Related Articles

🚀 Stay Ahead in AI