HiDream-O1-Image - A pixel space model , no need for VAE, , 8B parameters.
No external VAE or text encoder: HiDream's 8B unified transformer rivals closed-source models.
HiDream-ai introduces O1-Image, a groundbreaking 8B parameter foundation model built on a Pixel-level Unified Transformer (UiT). Unlike conventional generative models that rely on separate VAEs and text encoders, O1-Image natively encodes raw pixels, text, and task-specific conditions in a single shared token space. This end-to-end design eliminates the complexity of disjoint components while enabling direct synthesis up to 2048×2048 with sharp fine-grained detail. A built-in reasoning-driven prompt agent resolves implicit knowledge, layout constraints, and text rendering before generation, improving output fidelity across diverse scenarios.
O1-Image supports multiple tasks within a single architecture: text-to-image, long-text rendering, instruction-based editing, subject-driven personalization, and storyboard generation. Despite its compact 8B parameters, it achieves performance parity or even surpasses larger open-source DiTs (e.g., Stable Diffusion 3, PixArt-Σ) and leading closed-source models. Two variants are available: the full version requiring 50 inference steps for maximum quality, and a dev version optimized for 28 steps. This efficiency and versatility position O1-Image as a strong contender for production AI pipelines, reducing hardware barriers while maintaining state-of-the-art results.
- Pixel-level Unified Transformer (UiT) processes raw pixels without external VAE or text encoder
- Single architecture handles text-to-image, editing, personalization, and storyboard generation up to 2048×2048
- 8B parameter model matches or exceeds performance of larger open-source DiTs and closed-source models
Why It Matters
HiDream's efficient 8B model lowers hardware barriers for high-quality image generation and multi-task AI workflows.