Dataset of 150k+ stool images and not sure how to fully use it [D]
Manual verification of 150K stool images: is this the right ML workflow?
A Reddit user managing a dataset of 150K stool images is seeking validation (and criticism) of their current machine learning pipeline. Their process: start with 5K images individually verified by a human for Bristol stool type, consistency, color, and indicators of mucus or blood. After training an initial model on these clean annotations, they continue the cycle: human review of new images, correction of labels, then re-feeding into the model. This is time-intensive but ensures high-quality ground truth.
From a best-practices perspective, this iterative human-in-the-loop approach is common in specialized domains (e.g., medical imaging, defect detection) where label noise is costly. However, scaling beyond 5K demands automation: active learning to prioritize uncertain samples, self-training with confidence thresholds, and weak supervision from pre-trained models. The user’s current method is defensible but not scalable. For 150K images, they should consider semi-supervised learning (e.g., Noisy Student) or using a smaller, verified set to train a classifier that flags potential label errors for human review. The key trade-off is between label quality and annotation cost.
- Dataset of 150K stool images with manual verification of 5K images for Bristol type, consistency, color, and blood/mucus indicators
- Current workflow: human-in-the-loop iterative training with re-annotation before each model update
- Smarter approach: use active learning or semi-supervised methods (e.g., Noisy Student) to focus human review on high-uncertainty samples
Why It Matters
Scaling high-quality annotations in niche CV applications requires balancing manual verification with automated techniques to avoid label noise.