Image & Video

Trained a Vit model from scratch for auto tagging

r/StableDiffusion May 10, 2026

⚡Fixed 300k bad tags and filled 1M missing using SmilingWolf v3 for better classification

Deep Dive

A developer known as Grio43 has released OppaiOracle, a Vision Transformer (ViT) model trained from scratch for automatic tagging of anime images. To prepare the dataset, the creator used SmilingWolf v3, a pre-existing tagging tool, to correct roughly 300,000 inaccurate tags and generate approximately 1 million missing tags. Additionally, a baseline model was trained to identify and incorporate around 30,000 low-frequency tags that would otherwise be overlooked. The result is a V1 model operating at 320x320 resolution, with V1.1 currently training at 448x448 — a resolution bump already yielding noticeable gains in tagging precision.

OppaiOracle is fully open-source and hosted on HuggingFace, including a demo space and a separate CPU-based tagger for users without GPU access. The developer also provides a self-hosted web interface for local deployment. Future work aims to compile a clean 2025 dataset, retrain from scratch with structured vocab formats (e.g., artist:name), and resolve standalone installation issues for general users. This project fills a niche for high-quality, community-driven image tagging in the anime space.

Key Points

Used SmilingWolf v3 to fix 300k bad tags and fill 1M missing tags in the training dataset
Current V1 model runs at 320x320; V1.1 at 448x448 shows improved accuracy
Future plans include a 2025 dataset with structured vocabularies like 'artist:name'

Why It Matters

Open-source anime tagging model enables high-quality auto-labeling for fans, researchers, and curators.

Read Original Article

Trained a Vit model from scratch for auto tagging

Why It Matters

Stay Ahead in AI