Image & Video

ID-LoRA with LTX-2.3 and ComfyUI custom node🎉

⚡A single AI model now generates a person's face and voice together from just an image and audio clip.

Deep Dive

A new AI model called ID-LoRA (Identity-Driven In-Context LoRA) represents a significant leap in multimodal generation. Built on top of the LTX-2.3 architecture and accessible via a ComfyUI custom node, it is the first method to personalize both a subject's visual appearance and their voice within a single, unified generative pass. This eliminates the need for cascaded pipelines that treat audio and video as separate, often disjointed, processes.

ID-LoRA operates by creating a unified latent space where a single text prompt can simultaneously dictate the visual scene, environmental acoustics, and speaking style. Users provide a reference image for visual likeness and a short audio clip for vocal identity. The model then generates content where the subject looks and sounds like the reference, all governed by the text description. A key feature is its zero-shot capability at inference—users simply load the LoRA weights without needing per-speaker fine-tuning.

The technical pipeline employs a two-stage process for high-quality output, including 2x spatial upsampling. This allows for prompt-driven environment control, meaning the text can specify not just what the person says and how they look, but also background sounds and scene details. The model's ability to perform "audio identity transfer" ensures the generated speaker retains the timbre and characteristics of the reference voice, creating a cohesive and personalized audiovisual experience from minimal input.

Key Points
  • Unified audio-video generation in a single model, ending separate cascaded pipelines for voice and face.
  • Zero-shot inference: works with just LoRA weights, a reference image, and audio clip—no per-subject training needed.
  • Prompt-driven control over scene, speaking style, and environment sounds from a single text input.

Why It Matters

This enables rapid, cohesive creation of personalized digital avatars for content, education, and assistive tech without complex multi-model pipelines.