Align-Consistency: Improving Non-autoregressive and Semi-supervised ASR with Consistency Regularization
New method combines parallel inference speed with iterative refinement for 2x faster transcription.
Researchers Wanting Huang and Weiran Wang have introduced Align-Consistency, a novel training framework that significantly improves non-autoregressive automatic speech recognition (ASR) systems. Building on the Align-Refine model—which performs iterative refinement of frame-level hypotheses—this method extends consistency regularization (CR) techniques to ensure predictions remain stable across various input perturbations. The approach addresses a key challenge in ASR: maintaining the speed advantages of parallel, non-autoregressive inference while achieving accuracy comparable to slower, sequential models. The researchers demonstrate that applying CR to both the base Connectionist Temporal Classification (CTC) model and subsequent refinement steps yields additive improvements, creating a more robust system suitable for real-time applications.
The technical implementation shows Align-Consistency excels in two distinct scenarios. In fully supervised settings, the method delivers substantial accuracy gains while preserving the inherent speed of non-autoregressive decoding. More innovatively, for semi-supervised ASR, the framework leverages fast non-autoregressive decoding to generate online pseudo-labels on unlabeled audio data, which are then used to further refine the supervised model. This creates a virtuous cycle where improved models generate better pseudo-labels, leading to continuous performance gains without requiring extensive labeled datasets. The work, submitted to Interspeech 2026, represents a meaningful step toward practical, high-performance speech recognition systems that don't sacrifice speed for accuracy.
- Extends consistency regularization to Align-Refine non-autoregressive ASR for stable predictions
- Maintains parallel inference speed while boosting accuracy through iterative refinement
- Enables semi-supervised learning by generating pseudo-labels on unlabeled data for model improvement
Why It Matters
Enables faster, more accurate speech-to-text for real-time applications while reducing dependency on expensive labeled data.