Open Source

Omi Med STT v1: Fine-tuned Parakeet 0.6B matches cloud ASR at 145× realtime

Open-weight medical ASR runs locally, cutting M-WER 3.5x vs base Parakeet.

Deep Dive

Omi Health founder released Omi Med STT v1, a fine-tuned version of NVIDIA's 0.6B-parameter Parakeet TDT v2 model for clinical speech recognition. The model is open-weight under CC-BY-4.0 and designed to run locally on Mac (MLX), CUDA (NeMo), or CPU (GGUF/parakeet.cpp), automatically selecting the optimal backend. The goal: approach cloud ASR accuracy while keeping patient audio on-device.

Benchmarked on 1,513 clips (7.18 hours) of held-out medical audio, Omi Med STT v1 achieved a medical-WER of 2.37% (errors on clinical terms only) and overall WER of 8.30%. Compared to the base Parakeet model, M-WER improved 3.5× (from 8.36%) and drug-name false positives dropped from 131 to 9. Among open models, only VibeVoice-ASR 9B had a lower M-WER (1.78%), but Omi's 0.6B model is 15× smaller and faster. Against cloud APIs, Omi Med STT v1 nearly matches ElevenLabs Scribe v2 (1.39% M-WER) and Gemini 3.1 Pro (1.65%), and beats GPT-4o Mini Transcribe (3.55%). Notably, Gemini models hallucinated fake medical consultations on 33-87 of 420 benign clips—Omi had zero such failures. The model uses q8 quantization by default; a q4 version was tested but dropped due to degraded drug-name accuracy.

Key Points
  • Omi Med STT v1 achieves 2.37% medical-WER, 3.5× better than the base Parakeet 0.6B (8.36%) and on par with cloud systems like ElevenLabs Scribe.
  • Runs 145× realtime on an A10 GPU and ~68× on Apple Silicon Macs—structural latency edge from local execution.
  • Drug-name error rate plummeted from 131 to 9 false positives; q4 quantization was abandoned to preserve accuracy.

Why It Matters

On-device medical transcription that rivals cloud APIs while keeping patient data private and running at 145× realtime.