Handles 7 normalization tasks including numbers, dates, currency (VND/USD), percentages, and acronyms via rule-based pipeline?

Handles 7 normalization tasks including numbers, dates, currency (VND/USD), percentages, and acronyms via rule-based pipeline

Zero-dependency design with pre-compiled regex enables high-throughput batch processing without GPU or external APIs?

Zero-dependency design with pre-compiled regex enables high-throughput batch processing without GPU or external APIs

Available via pip install and MIT licensed, addressing a critical gap in Vietnamese TTS and NLP preprocessing?

Available via pip install and MIT licensed, addressing a critical gap in Vietnamese TTS and NLP preprocessing

Research & Papers

VietNormalizer: Open-source Python library solves Vietnamese text normalization for TTS/NLP

arXiv cs.NE March 05, 2026

⚡New dependency-free library converts numbers, dates, currency to Vietnamese words for AI applications.

Deep Dive

A research team of eight authors including Hung Vu Nguyen has released VietNormalizer, an open-source Python library specifically designed to solve the critical preprocessing challenge of Vietnamese text normalization for AI applications. Published on arXiv, the library addresses a significant gap in the ecosystem: real-world Vietnamese text contains dense non-standard words (NSWs) like numbers, dates, currency amounts, and acronyms that must be converted to fully pronounceable Vietnamese before Text-to-Speech synthesis or downstream NLP processing. Existing tools either rely on heavy neural dependencies with limited coverage or are buried within larger toolkits, making VietNormalizer's standalone, dependency-free approach particularly valuable for developers working with Vietnamese language AI.

The library implements a comprehensive, rule-based pipeline that handles seven key normalization tasks: converting arbitrary integers and decimals to Vietnamese words, normalizing dates and times to spoken forms, processing VND and USD currency amounts, expanding percentages, resolving acronyms via customizable dictionaries, transliterating foreign loanwords to Vietnamese phonetic approximations, and performing Unicode normalization. All regex patterns are pre-compiled at initialization, enabling efficient batch processing with minimal memory overhead and no GPU requirements. Available on PyPI via pip install vietnormalizer and released under the MIT license, this tool not only serves Vietnamese language applications but also demonstrates the generalizability of rule-based normalization approaches for other low-resource tonal and agglutinative languages.

Key Points

Handles 7 normalization tasks including numbers, dates, currency (VND/USD), percentages, and acronyms via rule-based pipeline
Zero-dependency design with pre-compiled regex enables high-throughput batch processing without GPU or external APIs
Available via pip install and MIT licensed, addressing a critical gap in Vietnamese TTS and NLP preprocessing

Why It Matters

Enables reliable Vietnamese TTS and NLP by solving the critical preprocessing step of converting non-standard text to pronounceable words.

Read Original Article

VietNormalizer: Open-source Python library solves Vietnamese text normalization for TTS/NLP

Why It Matters

Related Articles

🚀 Stay Ahead in AI