Audio & Speech

Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR

A new merging algorithm outperforms full fine-tuning on 10 domains while preserving a single model's generalization.

Deep Dive

A team of researchers has published a significant study on model merging as a scalable alternative to multi-task training for large speech foundation models. The paper, titled 'Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR,' investigates how to combine multiple specialized, domain-tuned models into a single unified checkpoint. This approach is crucial for ASR systems, which often require expensive, repeated fine-tuning for new data across different domains like medical, legal, or conversational speech. The researchers argue that maintaining numerous custom checkpoints is computationally prohibitive, making efficient merging a vital technique.

The team rigorously benchmarked 11 different merging algorithms across 10 distinct European Portuguese domains, evaluating in-domain accuracy, robustness to distribution shifts, and performance in English and multilingual settings. Their key contribution is BoostedTSV-M, a novel algorithm based on TSV-M that improves numerical stability and mitigates 'rank collapse'—a common problem where merged models lose capability—through singular-value boosting. The results show their merged model not only matches but outperforms traditional full fine-tuning on the target European Portuguese data. Crucially, it achieves this while preserving the general language capabilities of the original foundation model, maintaining strong performance on out-of-distribution and multilingual tasks, all within a single, deployable model. The work has been submitted for review at INTERSPEECH 2026.

Key Points
  • Introduced BoostedTSV-M, a new merging algorithm that prevents rank collapse via singular-value boosting for better stability.
  • Benchmarked 11 merging methods across 10 European Portuguese ASR domains, outperforming costly full fine-tuning.
  • Achieved superior in-domain accuracy while preserving a single model's out-of-distribution and multilingual generalization.

Why It Matters

Enables efficient, unified speech models for multiple specialized domains, drastically reducing compute costs and deployment complexity.