Audio & Speech

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

A new system matches top Indic TTS using frozen models and zero commercial data...

Deep Dive

Praxy Voice tackles a critical gap in open-source Indic text-to-speech (TTS): while commercial systems produce near-native audio, the best open-source bases like Chatterbox (23 languages) struggle with phonological accuracy and even fail to tokenize languages like Telugu and Tamil. The key innovation is a minimal intervention approach—using a frozen non-Indic base (Chatterbox) without training a new acoustic decoder or using any commercial TTS training data. The system combines three components: BUPS (Brahmic Unified Phoneme Space) that deterministically romanizes seven Indic scripts to ISO-15919 for Chatterbox's Latin tokenizer; a LoRA adapter trained on ~1,220 hours of licensed Indic audio; and a voice-prompt recovery recipe using an 8-11 second same-language reference clip with specific sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1).

Evaluated on 10-utterance pilot sets with the PSP benchmark, Praxy Voice matches or slightly leads commercial baselines across multiple metrics. For Telugu, it achieves 26.7% retroflex collapse (vs. Sarvam Bulbul's 33.3%), for Tamil-zha collapse it scores 71% (vs. the commercial trio's 86%), and for Hindi it ties with Cartesia Sonic-3 at 0.025 LLM-WER. For intra-sentential code-mixing, the system adds a third branch using IndicF5 with native-script transliteration, dropping code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hindi, Telugu, and Tamil. The team releases R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo, making commercial-grade Indic TTS accessible to the open-source community.

Key Points
  • Praxy Voice achieves commercial-class TTS for Telugu, Tamil, and Hindi using a frozen non-Indic base (Chatterbox) with zero commercial training data
  • BUPS romanizes seven Indic scripts to ISO-15919, enabling Chatterbox's Latin tokenizer to process languages it previously couldn't tokenize
  • Outperforms or ties commercial systems: 26.7% retroflex collapse on Telugu (vs. 33.3% for Sarvam Bulbul), 71% Tamil-zha collapse (vs. 86% commercial trio)

Why It Matters

Democratizes commercial-grade Indic TTS, enabling high-quality voice applications in underserved languages without costly proprietary data.