Image & Video

I benchmarked LTX 2.3. It's so much better than previous generations but still has a long way to go.

Generates a 10-second 1344x768 video clip in just 57 seconds, but random background music persists.

Deep Dive

A developer building an AI filmmaking tool has published a detailed benchmark of the LTX-2.3 22B video generation model, revealing significant performance leaps alongside persistent, frustrating flaws. Running on an RTX PRO 6000 Blackwell with 96GB VRAM, the distilled model generated a 10-second, 1344x768 clip in approximately 57 seconds, and a full 60-second multi-shot sequence in just 6 minutes. This represents a "night and day" improvement over previous LTX generations, with notably better motion coherence, prompt adherence, and impressive new capabilities like generating speech audio from dialogue prompts and solid image-to-video (I2V) conditioning.

Despite these gains, the benchmark highlights critical shortcomings that limit professional use. The most frustrating issue is the model's tendency to inject random background music into clips, which persists despite aggressive negative prompting. Other flaws include unpredictable "Ken Burns effect" degeneration into static pans, strange calligraphy artifacts, slow-motion drift in the second half of clips, and uneven timing in multi-shot sequences. Crucially, the audio-to-video (A2V) pipeline was found to use audio only as a vague mood conditioner, not for true lip-sync, dashing hopes for automated dialogue generation. These issues indicate LTX-2.3, while faster and more capable, still requires significant iteration for reliable production workflows.

Key Points
  • Distilled model generates a 10s 1344x768 video clip in ~57 seconds, a 60s multi-shot sequence in 6 minutes.
  • Shows major improvements in motion coherence and prompt adherence, with new speech generation and solid I2V conditioning.
  • Suffers from random background music injection, motion drift, artifacts, and A2V does not perform true lip-sync.

Why It Matters

Highlights the rapid but uneven progress in AI video generation, showing what's production-ready versus still experimental.