I hacked LTX2 to be used as a Multi Lingual TTS voice cloner
A clever hack turns Stability AI's LTX2 video generator into a surprisingly capable, open-source TTS voice cloner.
A developer has ingeniously repurposed Stability AI's LTX2, a model designed for generating videos from text, into a powerful and fast multilingual voice cloning tool. The hack, detailed by Aurel M., works by feeding the model a short (20-second) audio sample alongside a text prompt. A key step involves using LTX2's "Set Audio Video Mask By Time" feature to lock in the vocal identity from the initial audio during the first 10 seconds of generation, after which the model continues speech based on the new text prompt. The initial setup segment is then simply trimmed, leaving a clean, cloned voice output.
This unconventional approach yields surprising advantages. The model demonstrates exceptional multilingual capabilities, reportedly handling languages like Romanian better than premium services like ElevenLabs. Users can also inject nuanced emotions into the cloned speech by prompting for actions like "he screams in perfect Romanian." The primary trade-off is a duration limit of about 10 seconds of clean audio per generation, and outputs can sometimes be nonsensical, requiring regeneration. Despite these constraints, for short-form, emotionally expressive clips in diverse languages, this LTX2 hack presents a uniquely capable and open-source alternative to specialized TTS systems.
- Repurposes Stability AI's LTX2 video model into a TTS tool using a 20-second audio sample and text prompts.
- Excels at multilingual output and emotional inflection, outperforming ElevenLabs for languages like Romanian.
- Fast generation but limited to ~10-second clean clips and can produce nonsense, requiring occasional re-runs.
Why It Matters
It provides a powerful, open-source alternative for multilingual voice cloning, challenging specialized commercial TTS services.