Built a normalizer so WER stops penalizing formatting differences in STT evals! [P]
WER penalizes '3:00PM' vs '3 pm'—Gladia's library normalizes both before scoring.
Gladia, a company focused on speech-to-text (STT) evaluation, open-sourced gladia-normalization, a Python library designed to fix a common problem in STT benchmarks: Word Error Rate (WER) penalizes formatting differences that don't reflect actual transcription quality. For example, 'It's $50' vs 'it is fifty dollars' or '3:00PM' vs '3 pm' are both perfect transcriptions, but WER scores them as errors. The library normalizes both the reference and hypothesis before scoring, ensuring only genuine recognition errors are counted.
The library uses YAML-defined pipelines, making it deterministic, version-controllable, and customizable. Currently, it supports English, French, German, Italian, Spanish, and Dutch, though non-English presets need refinement. Gladia is actively seeking native speakers to contribute. The project is MIT licensed and available on GitHub, offering a configurable normalization pipeline that can be integrated into any STT evaluation workflow.
- Gladia open-sourced gladia-normalization to fix WER penalizing formatting differences like '3:00PM' vs '3 pm'.
- The library normalizes both reference and hypothesis before scoring, supporting English, French, German, Italian, Spanish, and Dutch.
- Pipelines are YAML-defined, deterministic, version-controllable, and MIT licensed on GitHub.
Why It Matters
Improves STT evaluation accuracy by eliminating false WER errors from formatting differences.