Understanding Annotator Safety Policy with Interpretability
New interpretable models reveal why AI safety annotators disagree—without asking them.
Researchers introduced Annotator Policy Models (APMs), interpretable models that learn annotators' internal safety policies from labeling behavior alone. Validated to accurately model annotator safety policy (>80% accuracy), APMs faithfully predict responses to counterfactual edits and recover known policy differences. Applying APMs to LLM and human annotations surfaces policy ambiguity by revealing how annotators interpret safety instructions differently, and surfaces value pluralism by uncovering systematic differences in safety priorities across demographic groups. These capabilities support more targeted, transparent, and inclusive safety policy design without additional annotation effort.
- APMs distinguish three sources of annotation disagreement: operational failures, policy ambiguity, and value pluralism — each requiring different interventions.
- The models achieve >80% accuracy in predicting annotator responses and faithfully recover known policy differences in controlled experiments.
- Applied to human and LLM annotations, APMs surface how demographic groups prioritize safety differently, enabling more inclusive policy design.
Why It Matters
Enables targeted, transparent AI safety annotation without extra cost, reducing disagreements and improving model alignment.