NVidia's PixelDiT: 1.3B VAE-Free Diffusion Transformer for Images
NVidia drops VAE-free 1.3B parameter PixelDiT with dual-level architecture for editing.
NVidia has released PixelDiT, a 1.3 billion parameter text-to-image model that breaks from conventional latent diffusion by operating directly in pixel space—no VAE (variational autoencoder) bottleneck. Its architecture pairs a patch-level DiT with a pixel-level DiT in a dual-level setup, allowing fine-grained control and high fidelity. The model leverages MM-DiT for joint attention between text and image tokens, and uses Google's Gemma-2-2B-IT as its text encoder for strong language understanding.
PixelDiT supports multi-aspect-ratio generation at 1024px resolution, making it flexible for various output shapes. Beyond text-to-image, it also handles image editing tasks. NVidia has released the model on Hugging Face (Diffusers), a ComfyUI version, and open-sourced the code on GitHub. This approach reduces artifacts common in VAE-based models and simplifies the pipeline for researchers and developers.
- 1.3B parameters with VAE-free architecture using dual-level DiT (patch + pixel)
- MM-DiT text-image joint attention and Gemma-2-2B-IT text encoder
- Supports multi-aspect-ratio at 1024px and includes image editing capabilities
Why It Matters
Enables higher-fidelity image generation without VAE artifacts, pushing toward more direct pixel-space diffusion transformers.