Joint synthesis of facial motion and speech using flow matching and a Multi-Modal Diffusion Transformer (MM-DiT)?

Joint synthesis of facial motion and speech using flow matching and a Multi-Modal Diffusion Transformer (MM-DiT)

Selective joint attention layers with temporally aligned positional embeddings for cross-modal interaction?

Selective joint attention layers with temporally aligned positional embeddings for cross-modal interaction

Supports text, audio, and motion inputs for tasks like talking head generation and audio-driven animation?

Supports text, audio, and motion inputs for tasks like talking head generation and audio-driven animation

Audio & Speech

JAM-Flow: New AI model unifies speech and facial animation

arXiv eess.AS May 18, 2026

⚡A single flow-matching model generates talking heads from text, audio, or motion

Deep Dive

JAM-Flow, developed by a team of researchers (Kwon et al.), tackles the overlooked intrinsic link between facial motion and speech in generative modeling. Unlike previous approaches that treat talking head synthesis and text-to-speech as separate tasks, JAM-Flow is a unified framework that jointly synthesizes and conditions on both modalities. The model uses flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These modules are coupled via selective joint attention layers, with key architectural choices like temporally aligned positional embeddings and localized joint attention masking. This design allows effective cross-modal interaction while preserving modality-specific strengths.

Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs—including text, reference audio, and reference motion—facilitating tasks such as synchronized talking head generation from text and audio-driven animation within a single coherent model. The framework represents a significant advance in multi-modal generative modeling, offering a practical solution for holistic audio-visual synthesis. The paper is currently under review and was published on arXiv with a project page available.

Key Points

Joint synthesis of facial motion and speech using flow matching and a Multi-Modal Diffusion Transformer (MM-DiT)
Selective joint attention layers with temporally aligned positional embeddings for cross-modal interaction
Supports text, audio, and motion inputs for tasks like talking head generation and audio-driven animation

Why It Matters

Unifies two previously separate AI tasks, enabling more realistic and controllable audio-visual content creation.

Read Original Article

JAM-Flow: New AI model unifies speech and facial animation

Why It Matters

Related Articles

🚀 Stay Ahead in AI