[D] Asymmetric consensus thresholds for multi-annotator NER — valid approach or methodological smell?
A viral Reddit post exposes a critical flaw in how AI models learn from human data.
A researcher training a Spanish legal NER model discovered a 'cliff effect' where using uniform consensus thresholds for multiple AI annotators would eliminate entire entity categories like DATES and ADDRESSES. The data shows DATE entities drop to 0 at a ≥3 threshold, retaining only 8.8% at ≥2. This forces a choice between asymmetric, category-specific thresholds or investing in more specialized annotators, challenging standard methodology for building reliable training datasets.
Why It Matters
This reveals a hidden bias in how AI models are trained, potentially causing them to systematically ignore critical real-world information.