Audio & Speech

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

arXiv eess.AS March 12, 2026

⚡The unified system transcribes speech and singing in 100+ languages with a tiny 0.6M-parameter VAD module.

Deep Dive

A research team has introduced FireRedASR2S, a comprehensive, open-source automatic speech recognition system designed for industrial deployment. Unlike piecemeal solutions, it unifies four critical modules into a single pipeline: automatic speech recognition (ASR), voice activity detection (VAD), spoken language identification (LID), and punctuation prediction (Punc). The ASR module itself comes in two powerful variants: FireRedASR2-LLM, an 8-billion-plus parameter model, and the more efficient 1-billion-plus parameter FireRedASR2-AED. Both support transcription of not just speech but also singing across Mandarin, numerous Chinese dialects and accents, English, and code-switching. On public benchmarks, the LLM variant achieved a remarkably low 2.89% average Character Error Rate (CER) on Mandarin and 11.55% on 19 Chinese dialect benchmarks, outperforming commercial rivals like Alibaba's Doubao-ASR and Qwen3-ASR.

The system's other components are equally impressive. FireRedVAD is an ultra-lightweight module with only 0.6 million parameters, based on a Deep Feedforward Sequential Memory Network (DFSMN). It excels at detecting when someone is speaking, achieving 97.57% frame-level F1 score on the FLEURS-VAD benchmark, beating established tools like Silero-VAD and WebRTC. The language identification module, FireRedLID, accurately identifies over 100 languages and 20+ Chinese dialects with 97.18% utterance-level accuracy, surpassing OpenAI's Whisper. Finally, the BERT-style FireRedPunc module restores punctuation in transcribed text with a 78.90% average F1 score, a significant jump over FunASR-Punc's 62.77%. By releasing the model weights and code, the team provides a ready-to-deploy, state-of-the-art alternative to closed commercial APIs, advancing research and practical application in global speech technology.

Key Points

Unified pipeline integrates ASR, VAD, Language ID, and Punctuation Prediction into one SOTA system.
ASR module has two variants (8B+ and 1B+ params) and beats Doubao-ASR & Qwen3-ASR with a 2.89% avg CER on Mandarin.
Ultra-efficient 0.6M-parameter VAD module achieves 97.57% F1 score, outperforming industry standards like Silero-VAD and WebRTC.

Why It Matters

Provides a powerful, open-source alternative to commercial speech APIs, enabling customizable, high-accuracy transcription for global languages and dialects.

Read Original Article

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

Why It Matters

Stay Ahead in AI