Developer Tools

Fine-tuning NVIDIA Nemotron Speech ASR on Amazon EC2 for domain adaptation

Using synthetic speech and EC2 A100 GPUs, they adapted a top model for clinical accuracy.

Deep Dive

A collaboration between AWS, NVIDIA, and healthcare AI platform Heidi demonstrates a production-ready pipeline for domain-adapting automatic speech recognition (ASR) models. The team fine-tuned NVIDIA's leaderboard-topping Nemotron Speech model, specifically the Parakeet TDT 0.6B V2, to tackle the unique challenges of clinical environments. Out-of-the-box ASR models often fail with medical jargon, regional accents, and code-switching, leading to errors that compromise clinical safety. To solve this, the project utilized Amazon EC2 p4d.24xlarge instances powered by NVIDIA A100 GPUs for distributed training at scale, combined with the NVIDIA NeMo framework and DeepSpeed for optimization.

The core innovation was generating high-quality, privacy-compliant synthetic training data. Using large language models (LLMs) and neural text-to-speech (TTS), the team created synthetic speech samples interleaved with real-world background noises. This pipeline focused on low-resource languages and rare medical terms underrepresented in public datasets, enabling targeted augmentation without using real patient data. The end-to-end workflow also integrated MLflow for experiment tracking, Amazon EKS for scalable model serving, and FSx for Lustre for high-performance weight storage. This architecture, built with AWS managed services and open-source tools like Docker and Langfuse, delivers a blueprint for building accurate, domain-specific ASR systems that move from fine-tuning to observable deployment.

Key Points
  • Fine-tuned NVIDIA's Parakeet TDT 0.6B V2 ASR model using synthetic medical speech data generated by LLMs and TTS.
  • Leveraged Amazon EC2 p4d.24xlarge instances with NVIDIA A100 GPUs and the NeMo framework for distributed training at scale.
  • Deployed model supports Heidi's AI Care Partner, processing 2.4M weekly consultations across 110 languages in 190 countries.

Why It Matters

Shows a scalable blueprint for adapting general AI models to specialized, high-stakes domains like healthcare with privacy-safe synthetic data.