Why did we move away from booru tags?
A Reddit post questions why open source models abandoned precise tags for verbose text.
A Reddit user argues that booru-style tags (comma-separated, unambiguous labels like "1girl, blue eyes, sunset") are better than natural language for describing images in AI, since listing contents is clearer than subjective prose. They note that most new models use massive text encoders, which are great for understanding, but there are too many ways to describe an image naturally. The same approach could apply to video with timestamped tags. The user asks why the open source community chose natural language over booru style.
- Booru tags offer deterministic, unambiguous labels (e.g., '1girl, blue eyes') vs. subjective natural language.
- Modern models (CLIP, DALL-E, Stable Diffusion) use large text encoders that favor verbose, context-dependent descriptions.
- User proposes timestamped booru tags for video to reduce ambiguity and improve consistency across frames.
Why It Matters
Tagging conventions directly impact model training quality and downstream generation consistency for image/video AI.