Research & Papers

Built a normalizer so WER stops penalizing formatting differences in STT evals! [P]

r/MachineLearning April 23, 2026

⚡WER penalizes '3:00PM' vs '3 pm'—Gladia's library normalizes both before scoring.

Deep Dive

Gladia, a company focused on speech-to-text (STT) evaluation, open-sourced gladia-normalization, a Python library designed to fix a common problem in STT benchmarks: Word Error Rate (WER) penalizes formatting differences that don't reflect actual transcription quality. For example, 'It's $50' vs 'it is fifty dollars' or '3:00PM' vs '3 pm' are both perfect transcriptions, but WER scores them as errors. The library normalizes both the reference and hypothesis before scoring, ensuring only genuine recognition errors are counted.

The library uses YAML-defined pipelines, making it deterministic, version-controllable, and customizable. Currently, it supports English, French, German, Italian, Spanish, and Dutch, though non-English presets need refinement. Gladia is actively seeking native speakers to contribute. The project is MIT licensed and available on GitHub, offering a configurable normalization pipeline that can be integrated into any STT evaluation workflow.

Key Points

Gladia open-sourced gladia-normalization to fix WER penalizing formatting differences like '3:00PM' vs '3 pm'.
The library normalizes both reference and hypothesis before scoring, supporting English, French, German, Italian, Spanish, and Dutch.
Pipelines are YAML-defined, deterministic, version-controllable, and MIT licensed on GitHub.

Why It Matters

Improves STT evaluation accuracy by eliminating false WER errors from formatting differences.

Read Original Article

Built a normalizer so WER stops penalizing formatting differences in STT evals! [P]

Why It Matters

Stay Ahead in AI