LongCat 1.5 open-sources avatar video with Whisper-Large and 8x faster inference
Generates talking avatars from audio and text with 8-step inference and lip-sync accuracy.
Deep Dive
LongCat-Video-Avatar 1.5 is an open-source framework (MIT license) for audio-driven human video generation. It uses Whisper-Large for smoother lip-sync, achieves full-body temporal stability, and supports stylized domains (anime, animals). Inference is accelerated to 8 steps via DMD2 distillation. A 508-pair benchmark covers 6 scenarios, 2 languages, and 2 styles.
Key Points
- Replaces Wav2Vec2 with Whisper-Large for smoother, more natural lip-sync
- 8-step inference via DMD2 distillation enables cost-effective serving with high fidelity
- Tested on 508 pairs across 6 scenarios, 2 languages, and 2 styles with 770 human raters
Why It Matters
Open-source lifelike avatar generation is now production-ready, enabling cost-effective video synthesis for enterprises and creators.