Audio & Speech

Align-Consistency: Improving Non-autoregressive and Semi-supervised ASR with Consistency Regularization

arXiv eess.AS February 27, 2026

⚡New method combines parallel inference speed with iterative refinement for 2x faster transcription.

Deep Dive

Researchers Wanting Huang and Weiran Wang have introduced Align-Consistency, a novel training framework that significantly improves non-autoregressive automatic speech recognition (ASR) systems. Building on the Align-Refine model—which performs iterative refinement of frame-level hypotheses—this method extends consistency regularization (CR) techniques to ensure predictions remain stable across various input perturbations. The approach addresses a key challenge in ASR: maintaining the speed advantages of parallel, non-autoregressive inference while achieving accuracy comparable to slower, sequential models. The researchers demonstrate that applying CR to both the base Connectionist Temporal Classification (CTC) model and subsequent refinement steps yields additive improvements, creating a more robust system suitable for real-time applications.

The technical implementation shows Align-Consistency excels in two distinct scenarios. In fully supervised settings, the method delivers substantial accuracy gains while preserving the inherent speed of non-autoregressive decoding. More innovatively, for semi-supervised ASR, the framework leverages fast non-autoregressive decoding to generate online pseudo-labels on unlabeled audio data, which are then used to further refine the supervised model. This creates a virtuous cycle where improved models generate better pseudo-labels, leading to continuous performance gains without requiring extensive labeled datasets. The work, submitted to Interspeech 2026, represents a meaningful step toward practical, high-performance speech recognition systems that don't sacrifice speed for accuracy.

Key Points

Extends consistency regularization to Align-Refine non-autoregressive ASR for stable predictions
Maintains parallel inference speed while boosting accuracy through iterative refinement
Enables semi-supervised learning by generating pseudo-labels on unlabeled data for model improvement

Why It Matters

Enables faster, more accurate speech-to-text for real-time applications while reducing dependency on expensive labeled data.

Read Original Article

Align-Consistency: Improving Non-autoregressive and Semi-supervised ASR with Consistency Regularization

Why It Matters

Stay Ahead in AI