Audio & Speech

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

A new model generates lip-synced audio and video that outperforms cascaded pipelines.

Deep Dive

Talker-T2AV introduces a novel approach to generating synchronized talking head videos by separating high-level cross-modal modeling from low-level detail refinement. The framework uses a shared autoregressive language model to jointly reason over audio and video tokens in a unified patch-level space, capturing semantic correlations like lip movements and speech content. Then, two lightweight diffusion transformer decoders independently process the hidden states into high-quality audio and video latents, avoiding unnecessary entanglement between modalities that plagues earlier joint generation models.

This design improves efficiency and output quality. Experiments on talking portrait benchmarks show Talker-T2AV outperforms dual-branch baselines in lip-sync accuracy, video quality, and audio quality, achieving stronger cross-modal consistency than cascaded pipelines. The work highlights the benefits of decoupling semantic reasoning from low-level rendering in multimodal generation, potentially advancing applications in virtual avatars, film dubbing, and real-time communication.

Key Points
  • Talker-T2AV uses a shared autoregressive language model for high-level cross-modal reasoning over audio and video tokens.
  • Two lightweight diffusion transformer decoders handle low-level refinement separately for audio and video, reducing unnecessary entanglement.
  • Outperforms dual-branch baselines on talking portrait benchmarks in lip-sync accuracy, video quality, and audio quality.

Why It Matters

Enables more realistic and efficient talking head generation for virtual avatars, dubbing, and communication.