Research & Papers

easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P]

Open-source tool aligns audio/text 35-102% faster than WhisperX while preserving original formatting.

Deep Dive

Developer mLalush has released easyaligner, an open-source forced alignment library designed to address common pain points in speech data preprocessing. Built on experience processing hundreds of thousands of hours of audio, the tool uses PyTorch's forced alignment API with a GPU-accelerated Viterbi algorithm for performance. Unlike existing solutions, easyaligner automatically handles cases where transcripts don't cover all spoken content, manages irrelevant speech at segment boundaries, and processes long audio/text segments without requiring chunking.

A key innovation is the library's flexible text normalization system, which improves alignment quality while maintaining a mapping back to original formatting. The tool supports emission extraction from all wav2vec2 models available on Hugging Face Hub, enabling alignment in any language with available models. Benchmarks show it works 35% to 102% faster than WhisperX depending on hardware, while offering comparable functionality. The MIT-licensed library includes comprehensive documentation with tutorials for different alignment scenarios and custom text processing workflows.

The companion library easytranscriber demonstrates how easyaligner can serve as a backend for aligning ASR model outputs, creating a complete pipeline for speech processing tasks. This makes it particularly valuable for researchers and engineers building speech-to-text systems who need precise alignment between audio and text while maintaining data integrity throughout the preprocessing pipeline.

Key Points
  • GPU-accelerated forced alignment using PyTorch's API with Viterbi algorithm for speed
  • Works with all wav2vec2 models on Hugging Face Hub for multilingual support
  • 35-102% faster than WhisperX while preserving original text formatting through normalization

Why It Matters

Dramatically speeds up speech data preprocessing for AI training while maintaining data quality and supporting global languages.