Image & Video

Wan-Weaver: Interleaved Multi-modal Generation (T2I & I2I )

r/StableDiffusion March 28, 2026

⚡New model from Tsinghua University beats open-source rivals and challenges Google's Nano Banana in interleaved generation.

Deep Dive

Researchers from Tongyi Lab and Tsinghua University have introduced Wan-Weaver, a novel multimodal AI model designed specifically for interleaved text and image generation. Unlike traditional models that handle these tasks separately, Wan-Weaver can fluidly alternate between writing text and generating corresponding images within a single, coherent conversation. This capability enables the creation of multi-step visual narratives, such as illustrated stories, fashion lookbooks with matching outfits, or step-by-step recipe guides. The model's architecture cleverly decouples the planning and visualization processes, using a Planner module to understand the narrative and a Visualizer to render the images, all without requiring hard-to-obtain real interleaved training data.

The team's key innovation was developing a method to synthesize 'textual proxy' data for training, bypassing the need for massive datasets of perfectly paired text and image sequences. This approach has proven effective, with Wan-Weaver demonstrating superior long-range consistency—ensuring characters and scenes remain coherent across multiple generation steps—compared to most open-source models. In benchmark tests, its performance is competitive with Google's commercial offering, Nano Banana, particularly in maintaining narrative and visual coherence. Beyond its core function, Wan-Weaver also shows strong capabilities in standard text-to-image generation, image editing, and visual understanding, positioning it as a versatile tool for content creators, educators, and developers looking to build more interactive and visually rich AI applications.

Key Points

Uses a novel Planner + Visualizer architecture trained with synthesized 'textual proxy' data, eliminating the need for real interleaved datasets.
Demonstrates strong long-range consistency, outperforming most open-source models and competing with Google's commercial Nano Banana in benchmarks.
Enables practical applications like creating illustrated stories, fashion lookbooks, and step-by-step guides in a single, coherent AI conversation.

Why It Matters

This brings us closer to AI that can create rich, multi-modal content like picture books or social media posts in one seamless interaction.

Read Original Article

Wan-Weaver: Interleaved Multi-modal Generation (T2I & I2I )

Why It Matters

Stay Ahead in AI