Voxtral Realtime
This new streaming model could make real-time transcription as good as offline.
Meta researchers have introduced Voxtral Realtime, a natively streaming automatic speech recognition model that achieves performance on par with OpenAI's Whisper at a delay of just 480ms. Unlike adapted offline models, it's trained end-to-end for streaming with explicit audio-text alignment. The model, pretrained on a 13-language dataset, is released under the Apache 2.0 license, making high-quality, low-latency transcription widely accessible for real-time applications.
Why It Matters
It enables applications like live captioning and voice assistants to have near-perfect accuracy without the lag, potentially replacing current offline systems.