Audio & Speech

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

The unified system transcribes speech and singing in 100+ languages with a tiny 0.6M-parameter VAD module.

Deep Dive

A research team has introduced FireRedASR2S, a comprehensive, open-source automatic speech recognition system designed for industrial deployment. Unlike piecemeal solutions, it unifies four critical modules into a single pipeline: automatic speech recognition (ASR), voice activity detection (VAD), spoken language identification (LID), and punctuation prediction (Punc). The ASR module itself comes in two powerful variants: FireRedASR2-LLM, an 8-billion-plus parameter model, and the more efficient 1-billion-plus parameter FireRedASR2-AED. Both support transcription of not just speech but also singing across Mandarin, numerous Chinese dialects and accents, English, and code-switching. On public benchmarks, the LLM variant achieved a remarkably low 2.89% average Character Error Rate (CER) on Mandarin and 11.55% on 19 Chinese dialect benchmarks, outperforming commercial rivals like Alibaba's Doubao-ASR and Qwen3-ASR.

The system's other components are equally impressive. FireRedVAD is an ultra-lightweight module with only 0.6 million parameters, based on a Deep Feedforward Sequential Memory Network (DFSMN). It excels at detecting when someone is speaking, achieving 97.57% frame-level F1 score on the FLEURS-VAD benchmark, beating established tools like Silero-VAD and WebRTC. The language identification module, FireRedLID, accurately identifies over 100 languages and 20+ Chinese dialects with 97.18% utterance-level accuracy, surpassing OpenAI's Whisper. Finally, the BERT-style FireRedPunc module restores punctuation in transcribed text with a 78.90% average F1 score, a significant jump over FunASR-Punc's 62.77%. By releasing the model weights and code, the team provides a ready-to-deploy, state-of-the-art alternative to closed commercial APIs, advancing research and practical application in global speech technology.

Key Points
  • Unified pipeline integrates ASR, VAD, Language ID, and Punctuation Prediction into one SOTA system.
  • ASR module has two variants (8B+ and 1B+ params) and beats Doubao-ASR & Qwen3-ASR with a 2.89% avg CER on Mandarin.
  • Ultra-efficient 0.6M-parameter VAD module achieves 97.57% F1 score, outperforming industry standards like Silero-VAD and WebRTC.

Why It Matters

Provides a powerful, open-source alternative to commercial speech APIs, enabling customizable, high-accuracy transcription for global languages and dialects.