Ernie and a Complex Composition in one Run (guest ZIT, Details and Prompt Included)
User tests Ernie's limits with a single prompt containing 8+ unrelated subjects, achieving a cohesive, detailed image.
A viral community test has put Baidu's Ernie image generation model through its paces, demonstrating its advanced capability to handle highly complex, multi-subject prompts in a single run. The user engineered a prompt containing over eight distinct and unrelated elements—a passenger at an airport, transparent sport shoes with floral fabric, three different cats, grape-shaped stickers, a large rose, and a blended background of a faded beach scene. Using a standard ComfyUI workflow with basic nodes and the Euler Ancestral sampler, Ernie generated a single, unified 1024x1536 image. The result successfully balanced photo-realistic fidelity for the main subjects with a watercolor-style background, showcasing meticulous attention to the prompt's detailed instructions for arrangement and aesthetic.
This test highlights Ernie's sophisticated scene composition and semantic understanding. Unlike models that might struggle with or ignore parts of an overloaded prompt, Ernie integrated all requested elements—arranged vertically as specified—into a visually harmonious image. The experiment used comparable settings (9 steps for ZIT, 8 for Ernie) and found similar generation speeds, focusing the benchmark on output quality and prompt adherence. The success suggests Ernie's underlying architecture is particularly robust at parsing long, detailed instructions and maintaining coherence across disparate visual concepts within one cohesive frame, a significant challenge in AI image generation.
- Ernie successfully generated a single image from a prompt with 8+ unrelated subjects, including a passenger, shoes, cats, and decorative elements.
- The test used a standard ComfyUI workflow, generating a 1024x1536 image with a mix of photo-realistic and watercolor styles.
- The model demonstrated strong compositional control, arranging elements vertically as instructed and creating a visually unified scene.
Why It Matters
Shows Ernie's competitive edge in complex, multi-concept image generation, crucial for professional design, advertising, and creative storytelling workflows.