Audio & Speech

Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis

First open-source model matches commercial ElevenLabs in quality while handling Arabic's 30+ dialect variations.

Deep Dive

A research team led by Yushen Chen has introduced Habibi, a groundbreaking open-source framework for unified-dialectal Arabic speech synthesis. Arabic presents a unique challenge with over 30 spoken varieties and no existing open-source system to unify them. Habibi addresses three key barriers: substantial cross-dialect differences, scarce high-quality training data, and the absence of a standardized evaluation benchmark. Through an extensive multi-step curation pipeline, the team repurposed open-source Automatic Speech Recognition (ASR) corpora to create TTS training data covering 12+ regional dialects.

Habibi's technical innovation lies in a linguistically-informed curriculum learning strategy. The model first trains on Modern Standard Arabic before progressing to dialectal data, enabling robust zero-shot synthesis without requiring text diacritization—a significant simplification for users. The researchers conducted extensive ablations using approximately 8,000 H100 GPU hours across 30+ configurations to validate each design choice. On their newly released benchmark—comprising over 11,000 manually verified utterances across 7 dialect subsets—Habibi's unified model matches or surpasses specialized per-dialect models.

In both automatic metrics and human evaluations, Habibi demonstrates performance highly competitive with ElevenLabs' commercial Eleven v3 (alpha) model in key areas: intelligibility, speaker similarity, and naturalness. This represents a major achievement for open-source AI in a linguistically complex domain. The team is releasing all checkpoints, training and inference code, and benchmark data—marking the first such comprehensive release for multi-dialect Arabic TTS, providing a foundational resource for future research and application development.

Key Points
  • First open-source unified Arabic TTS covering 12+ dialects via curriculum learning from Modern Standard Arabic
  • Competes with ElevenLabs' Eleven v3 (alpha) in human eval across intelligibility, similarity, and naturalness
  • Releases complete toolkit: checkpoints, code, and a new 11,000-utterance benchmark for 7 dialect subsets

Why It Matters

Democratizes high-quality Arabic voice synthesis for 420M+ speakers, enabling accessible AI applications across the Arab world's diverse dialects.