Research & Papers

VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications

New dependency-free library converts numbers, dates, currency to Vietnamese words for AI applications.

Deep Dive

A research team of eight authors including Hung Vu Nguyen has released VietNormalizer, an open-source Python library specifically designed to solve the critical preprocessing challenge of Vietnamese text normalization for AI applications. Published on arXiv, the library addresses a significant gap in the ecosystem: real-world Vietnamese text contains dense non-standard words (NSWs) like numbers, dates, currency amounts, and acronyms that must be converted to fully pronounceable Vietnamese before Text-to-Speech synthesis or downstream NLP processing. Existing tools either rely on heavy neural dependencies with limited coverage or are buried within larger toolkits, making VietNormalizer's standalone, dependency-free approach particularly valuable for developers working with Vietnamese language AI.

The library implements a comprehensive, rule-based pipeline that handles seven key normalization tasks: converting arbitrary integers and decimals to Vietnamese words, normalizing dates and times to spoken forms, processing VND and USD currency amounts, expanding percentages, resolving acronyms via customizable dictionaries, transliterating foreign loanwords to Vietnamese phonetic approximations, and performing Unicode normalization. All regex patterns are pre-compiled at initialization, enabling efficient batch processing with minimal memory overhead and no GPU requirements. Available on PyPI via pip install vietnormalizer and released under the MIT license, this tool not only serves Vietnamese language applications but also demonstrates the generalizability of rule-based normalization approaches for other low-resource tonal and agglutinative languages.

Key Points
  • Handles 7 normalization tasks including numbers, dates, currency (VND/USD), percentages, and acronyms via rule-based pipeline
  • Zero-dependency design with pre-compiled regex enables high-throughput batch processing without GPU or external APIs
  • Available via pip install and MIT licensed, addressing a critical gap in Vietnamese TTS and NLP preprocessing

Why It Matters

Enables reliable Vietnamese TTS and NLP by solving the critical preprocessing step of converting non-standard text to pronounceable words.