Danbooru Dataset Filter: Fast local metadata-based search across 10M+ images for LoRA/Checkpoint training
A new desktop tool searches 10 million Danbooru images in seconds, letting creators build precise datasets for LoRA and checkpoint training.
Developer ThetaCursed has released the Danbooru Dataset Filter, an open-source desktop tool designed to solve a major bottleneck in AI image model training: curating high-quality datasets from massive image repositories. The tool works locally with the Danbooru 2025/2026 metadata collections—Parquet-based databases containing tags, ratings, scores, and direct links for over 10 million anime-style images. By processing queries on the user's own machine, it bypasses the rate limits and speed caps of web APIs, enabling searches across the entire dataset in seconds instead of hours or days.
Users can apply sophisticated filters to build precise datasets for training LoRAs (Low-Rank Adaptations) or full model checkpoints for Stable Diffusion. Key features include smart tagging with inclusion/exclusion lists and autocomplete, quality filtering based on community scores and favorites, and filtering by content rating and image orientation (landscape, portrait, square). The tool also includes MD5-based deduplication to prevent model overfitting and a 'time travel' feature to filter images by upload date. Once a selection is made, it calculates the total file size and exports a simple .txt list of direct image URLs ready for any bulk downloader, streamlining the entire workflow from search to dataset creation.
- Searches 10+ million records from Danbooru 2025/2026 metadata locally in seconds, bypassing slow web APIs.
- Enables precise filtering by tags, quality scores, ratings, orientation, and upload date with built-in deduplication.
- Exports a .txt file of direct image URLs for bulk downloading, creating ready-to-use datasets for LoRA/checkpoint training.
Why It Matters
This tool dramatically reduces the time and technical barrier for AI artists and researchers to create high-quality, targeted training datasets, accelerating model development.