Meituan's LongCat Video Avatar 1.5 generates expressive talking heads
Open-source model from Meituan boasts 30% better emotion fidelity and 720p output.
Chinese tech giant Meituan has open-sourced LongCat Video Avatar 1.5, the latest iteration of their talking head generation model. Available on Hugging Face, the model uses a novel cross-modal attention mechanism to sync lip movements with input audio while preserving natural head poses and micro-expressions. Key upgrades include a 30% improvement in emotion fidelity, support for 720p output at 30 FPS, and reduced inference time — now generating a 10-second clip in under 8 seconds on a single A100 GPU.
The model accepts a single reference image and audio track, then produces a video with realistic blinking, eyebrow raises, and subtle mouth shapes. Meituan also released a Gradio demo and inference scripts, making it accessible for developers to integrate into live chat, e-learning, or avatar-based customer service. The release signals a push toward democratizing high-quality avatar generation, competing with proprietary systems like HeyGen and D-ID while remaining fully open-source under a permissive license.
- Open-source release on Hugging Face with permissive license — no API key required.
- 30% better emotion fidelity than v1.0, supporting happiness, sadness, surprise, and neutral states.
- Runs in under 8 seconds for a 10-second 720p clip on a single A100 GPU.
Why It Matters
Professional talking head avatars are now free and fast, enabling startups to build lifelike AI presenters without cloud costs.