Open Source

We collected 135 phrases Whisper hallucinates during silence — here's what it says when nobody's talking and how we stopped it

OpenAI's Whisper generates coherent text during silence, including dangerous loops and violent content.

Deep Dive

Vexa.ai's engineering team, after processing thousands of hours of production audio with their open-source meeting bot, discovered a critical flaw in OpenAI's widely-used Whisper speech recognition model: it doesn't remain silent during silence. Instead, the model's decoder, trained on 680K hours of YouTube audio, generates coherent, confident text from its training distribution when no speech is detected. This results in specific, repeatable hallucinations like "Thanks for watching!" (from YouTube outros), subtitle watermarks ("Amara.org community"), and, most alarmingly, infinite repetition loops such as "Thank you, Mr. President..." or "I'm going to be a bad person..." OpenAI's own `no_speech_prob` flag is documented as "not very accurate," leaving developers to find their own solutions.

Vexa.ai's production-tested mitigation stack is a multi-layered defense. First, they use Silero VAD (Voice Activity Detection) as a pre-gate, preventing Whisper from even processing non-speech audio. Key configuration changes include setting `condition_on_previous_text=False` to stop hallucination cascades and `beam_size=1` for greedy decoding that fails faster. They also maintain exact-string blocklists (135 entries in English) and implement repeated-output detection to break loops. The team notes this is a fundamental architectural issue; CTC/transducer models like Deepgram Nova output blank tokens for silence by design, whereas Whisper's decoder must always generate text. Research from the FAccT 2024 'Careless Whisper' paper underscores the severity, finding 38% of hallucinated segments contained violent or harmful content, posing genuine risks in medical or legal transcription contexts.

Key Points
  • Whisper hallucinates 135 specific phrases during silence, including dangerous infinite loops and violent content.
  • The 'Careless Whisper' research paper found 38% of silent hallucinations contain harmful or violent text.
  • Vexa.ai's fix uses Silero VAD pre-gating, config changes, and a blocklist, shared on their GitHub.

Why It Matters

Silent hallucinations pose serious risks for automated transcription in healthcare, legal, and customer service applications where accuracy is critical.