1.3B parameters with VAE-free architecture using dual-level DiT (patch + pixel)?

1.3B parameters with VAE-free architecture using dual-level DiT (patch + pixel)

MM-DiT text-image joint attention and Gemma-2-2B-IT text encoder?

MM-DiT text-image joint attention and Gemma-2-2B-IT text encoder

Supports multi-aspect-ratio at 1024px and includes image editing capabilities?

Supports multi-aspect-ratio at 1024px and includes image editing capabilities

Image & Video

NVidia's PixelDiT: 1.3B VAE-Free Diffusion Transformer for Images

r/StableDiffusion June 02, 2026

⚡NVidia drops VAE-free 1.3B parameter PixelDiT with dual-level architecture for editing.

Deep Dive

NVidia has released PixelDiT, a 1.3 billion parameter text-to-image model that breaks from conventional latent diffusion by operating directly in pixel space—no VAE (variational autoencoder) bottleneck. Its architecture pairs a patch-level DiT with a pixel-level DiT in a dual-level setup, allowing fine-grained control and high fidelity. The model leverages MM-DiT for joint attention between text and image tokens, and uses Google's Gemma-2-2B-IT as its text encoder for strong language understanding.

PixelDiT supports multi-aspect-ratio generation at 1024px resolution, making it flexible for various output shapes. Beyond text-to-image, it also handles image editing tasks. NVidia has released the model on Hugging Face (Diffusers), a ComfyUI version, and open-sourced the code on GitHub. This approach reduces artifacts common in VAE-based models and simplifies the pipeline for researchers and developers.

Key Points

1.3B parameters with VAE-free architecture using dual-level DiT (patch + pixel)
MM-DiT text-image joint attention and Gemma-2-2B-IT text encoder
Supports multi-aspect-ratio at 1024px and includes image editing capabilities

Why It Matters

Enables higher-fidelity image generation without VAE artifacts, pushing toward more direct pixel-space diffusion transformers.

Read Original Article

NVidia's PixelDiT: 1.3B VAE-Free Diffusion Transformer for Images

Why It Matters

Related Articles

🚀 Stay Ahead in AI