Image & Video

Open-source audio-video generation: Porting Alive's joint Audio+Video DiT architecture onto Wan2.1/2.2 as base model. Early stage, contributors welcome.

A developer is rebuilding ByteDance's competitive Alive architecture on open-source Wan models for synchronized audio-video generation.

Deep Dive

An independent developer is spearheading an ambitious open-source initiative to recreate ByteDance's state-of-the-art Alive audio-video generation architecture using publicly available foundation models. The project, called Alive-Wan, aims to port the core technical innovations from ByteDance's research paper—which demonstrated results competitive with Veo 3, Kling 2.6, and Sora 2—onto the open-source Wan2.1 and Wan2.2 video generation models as a backbone. This effort seeks to democratize a capability currently locked behind major tech companies' closed APIs by building a community-developed alternative that can generate synchronized sound and moving images from text prompts.

The technical roadmap involves integrating a separate ~2B-parameter Audio Diffusion Transformer (DiT) branch alongside Wan's video DiT, connecting them with Temporally-Aligned Cross-Attention (TA-CrossAttn) for precise lip-sync and event alignment. A key advantage is that Wan models share the same VAE (Wan-VAE) that the original Alive research used, simplifying the porting process. The developer has established the codebase and is now seeking collaborators with expertise in audio machine learning, DiT architecture modifications, and distributed training infrastructure to help implement the complex four-stage training strategy outlined in the Alive paper. The project represents a significant step toward closing the performance gap between open-source and proprietary multimodal generative AI.

Key Points
  • Ports ByteDance's Alive architecture, which scored competitively against Veo 3 and Sora 2 in evaluations, to open-source Wan2.1/2.2 models.
  • Adds a separate ~2B-parameter Audio DiT branch and uses TA-CrossAttn & UniTemp-RoPE for synchronized audio-video generation.
  • Seeks community contributors to help build a fully open alternative to closed models, focusing on audio ML, DiT hacking, and training infra.

Why It Matters

Democratizes synchronized audio-video generation, creating an open-source alternative to closed models from OpenAI, Google, and ByteDance.