Research & Papers

Voxtral Realtime

This new streaming model could make real-time transcription as good as offline.

Deep Dive

Meta researchers have introduced Voxtral Realtime, a natively streaming automatic speech recognition model that achieves performance on par with OpenAI's Whisper at a delay of just 480ms. Unlike adapted offline models, it's trained end-to-end for streaming with explicit audio-text alignment. The model, pretrained on a 13-language dataset, is released under the Apache 2.0 license, making high-quality, low-latency transcription widely accessible for real-time applications.

Why It Matters

It enables applications like live captioning and voice assistants to have near-perfect accuracy without the lag, potentially replacing current offline systems.