JAM-Flow: New AI model unifies speech and facial animation
A single flow-matching model generates talking heads from text, audio, or motion
JAM-Flow, developed by a team of researchers (Kwon et al.), tackles the overlooked intrinsic link between facial motion and speech in generative modeling. Unlike previous approaches that treat talking head synthesis and text-to-speech as separate tasks, JAM-Flow is a unified framework that jointly synthesizes and conditions on both modalities. The model uses flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These modules are coupled via selective joint attention layers, with key architectural choices like temporally aligned positional embeddings and localized joint attention masking. This design allows effective cross-modal interaction while preserving modality-specific strengths.
Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs—including text, reference audio, and reference motion—facilitating tasks such as synchronized talking head generation from text and audio-driven animation within a single coherent model. The framework represents a significant advance in multi-modal generative modeling, offering a practical solution for holistic audio-visual synthesis. The paper is currently under review and was published on arXiv with a project page available.
- Joint synthesis of facial motion and speech using flow matching and a Multi-Modal Diffusion Transformer (MM-DiT)
- Selective joint attention layers with temporally aligned positional embeddings for cross-modal interaction
- Supports text, audio, and motion inputs for tasks like talking head generation and audio-driven animation
Why It Matters
Unifies two previously separate AI tasks, enabling more realistic and controllable audio-visual content creation.