Nvidia Parakeet v3 trained on 660k hours labeled data beats Whisper-large-v3 (5M hours weakly supervised) on most benchmarks?

Nvidia Parakeet v3 trained on 660k hours labeled data beats Whisper-large-v3 (5M hours weakly supervised) on most benchmarks.

Self-supervised models (Data2Vec2.0, WavLM) losing ground in ASR; question of a 'Dino moment' for speech remains open?

Self-supervised models (Data2Vec2.0, WavLM) losing ground in ASR; question of a 'Dino moment' for speech remains open.

Research & Papers

Nvidia Parakeet v3 tops Whisper-large-v3 with less data and smaller model

r/MachineLearning June 10, 2026

⚡Nvidia's 660k-hour model outperforms OpenAI's 5M-hour Whisper on almost every benchmark.

Deep Dive

The ASR landscape is shifting as Nvidia's Parakeet v3, trained on 660,000 hours of fully labeled data, consistently outperforms OpenAI's Whisper-large-v3—which uses 5 million hours of weakly supervised data. Parakeet achieves superior results on almost every benchmark despite a smaller model size, proving that data quality and architecture matter more than sheer scale. This has sparked debate about whether the era of self-supervised learning (SSL) for ASR, exemplified by models like Data2Vec2.0 and WavLM, is coming to an end.

New architectures—namely Transducer, Token-Duration-Transducer, and attention encoder-decoder designs (e.g., Qwen)—are gaining traction, all trained in a supervised manner. Meanwhile, in computer vision, self-supervised models like DINOv3 excel across segmentation, classification, and depth estimation. The ASR community now asks: can a similar SSL breakthrough occur for speech tasks like emotion recognition, diarization, and speech separation? Or will supervised learning continue to dominate? The answer may redefine how massive datasets are leveraged for speech AI.

Key Points

Nvidia Parakeet v3 trained on 660k hours labeled data beats Whisper-large-v3 (5M hours weakly supervised) on most benchmarks.
New ASR architectures include Transducer, Token-Duration-Transducer, and attention encoder-decoder (Qwen), all supervised.
Self-supervised models (Data2Vec2.0, WavLM) losing ground in ASR; question of a 'Dino moment' for speech remains open.

Why It Matters

If smaller, supervised models can beat massive ones, AI teams can rethink data curation and architecture choices.

Read Original Article

Nvidia Parakeet v3 tops Whisper-large-v3 with less data and smaller model

Why It Matters

Related Articles

Stay Ahead in AI