Image & Video

Why did we move away from booru tags?

r/StableDiffusion May 09, 2026

⚡A Reddit post questions why open source models abandoned precise tags for verbose text.

Deep Dive

A Reddit user argues that booru-style tags (comma-separated, unambiguous labels like "1girl, blue eyes, sunset") are better than natural language for describing images in AI, since listing contents is clearer than subjective prose. They note that most new models use massive text encoders, which are great for understanding, but there are too many ways to describe an image naturally. The same approach could apply to video with timestamped tags. The user asks why the open source community chose natural language over booru style.

Key Points

Booru tags offer deterministic, unambiguous labels (e.g., '1girl, blue eyes') vs. subjective natural language.
Modern models (CLIP, DALL-E, Stable Diffusion) use large text encoders that favor verbose, context-dependent descriptions.
User proposes timestamped booru tags for video to reduce ambiguity and improve consistency across frames.

Why It Matters

Tagging conventions directly impact model training quality and downstream generation consistency for image/video AI.

Read Original Article

Why did we move away from booru tags?

Why It Matters

Stay Ahead in AI