Audio & Speech

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

This new model can clone your face and voice into any video scenario.

Deep Dive

Researchers have unveiled OmniCustom, a new AI model that synchronously customizes both video and audio from single references. Given one reference image and one audio clip, it generates a video where the subject maintains the reference's visual identity and vocal timbre, while speaking any text prompt. Built on a Diffusion Transformer framework with specialized LoRA modules and contrastive learning, it was trained on a large-scale human dataset and outperforms existing methods in fidelity.

Why It Matters

It enables hyper-realistic, personalized video content creation for marketing, entertainment, and synthetic media, all from minimal input.