TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
New method simulates realistic speech recognition errors without expensive audio data, cutting training costs.
A research team from Shanghai Jiao Tong University has introduced TASU2, a novel framework that addresses a critical bottleneck in speech large language model (LLM) development: the high cost of collecting paired audio-text data. Traditional methods like TASU simulated CTC (Connectionist Temporal Classification) posteriors from text alone but offered limited control over the uncertainty and error rates in the simulated data, making training curriculum design largely guesswork. TASU2 solves this by allowing researchers to precisely specify the Word Error Rate (WER) range—from 0% to 30%—for the simulated speech recognition outputs, creating text-derived supervision that better mimics real, noisy acoustic decoding interfaces.
This controllable simulation enables a principled, curriculum-based post-training approach where model difficulty can be smoothly increased without resorting to expensive text-to-speech (TTS) systems for data augmentation. In experiments across multiple adaptation scenarios, TASU2 consistently outperformed strong baselines, including text-only fine-tuning and TTS-based methods. It showed particular strength in low-resource and cross-domain adaptation, improving out-of-domain recognition accuracy while mitigating the common problem of catastrophic forgetting, where a model loses performance on its original source domain. The framework represents a significant step toward more data-efficient and controllable training pipelines for the next generation of speech-enabled AI agents.
- Enables simulation of CTC posteriors with controllable WER ranges (0-30%) for curriculum training
- Reduces reliance on costly audio-text pairs and TTS systems for data augmentation
- Improves low-resource adaptation and reduces source-domain performance degradation by 15-20% versus TASU
Why It Matters
Lowers the barrier to developing accurate speech AI for niche languages and domains by reducing data requirements.