Image & Video

LTX-2 voice training was broken. I fixed it. (25 bugs, one patch, repo inside)

A single patch resolves garbled audio and silence issues that plagued LTX-2 character voice training.

Deep Dive

The Ostris AI-Toolkit team has identified and patched 25 critical bugs that were breaking voice training for LTX-2 character LoRAs. LTX-2 is a joint audio+video model from Lightricks, but its training pipeline had fundamental flaws causing garbled audio or silence despite correct visual outputs. The comprehensive fix addresses core architectural issues: audio and video now use independent timesteps during training (previously sharing one), audio loading has robust fallbacks (torchaudio → PyAV → ffmpeg CLI), and cached latents are validated for audio content. Additional fixes resolve loss balancing, DoRA+quantization crashes, and gradient problems. This patch transforms LTX-2 from a visually-only reliable model into a fully functional audio+video character training tool.

Key Points
  • Fixed independent audio timestep training—previously audio/video shared one timestep, preventing voice learning.
  • Added robust audio extraction with three fallback methods, solving silent outputs on Windows/Pinokio.
  • Implemented cache validation and auto-balancing loss so audio training isn't crushed by video loss magnitude.

Why It Matters

Enables reliable creation of AI characters with synchronized voice and appearance, unlocking LTX-2's full audio+video potential for creators.