Research & Papers

Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

Researchers propose Defensibility Index to fix flawed content moderation metrics.

Deep Dive

In a new preprint, researchers Michael O'Herlihy and Rosa Català expose a critical flaw in how content moderation AI is evaluated. They identify the 'Agreement Trap,' where systems are judged by how well they match human labels, even when multiple decisions are logically valid under the governing policy. This penalizes correct decisions and mislabels ambiguity as error.

To solve this, they propose the Defensibility Index (DI) and Ambiguity Index (AI), which measure whether a decision is logically derivable from rules, not just matching labels. They also introduce the Probabilistic Defensibility Signal (PDS), using LLM token logprobs to estimate reasoning stability without extra audit passes. Validated on 193,000+ Reddit decisions, the framework found a 33-46.6 percentage-point gap between agreement and policy-grounded metrics, with up to 80.6% of false negatives being policy-grounded decisions. A 'Governance Gate' built on these signals achieved 78.6% automation coverage with 64.9% risk reduction. The work shifts AI evaluation from historical label matching to reasoning-grounded validity.

Key Points
  • Defensibility Index (DI) and Ambiguity Index (AI) replace flawed agreement metrics for rule-governed AI evaluation.
  • Tested on 193,000+ Reddit decisions, revealing a 33-46.6 percentage-point gap between agreement and policy-grounded metrics.
  • Governance Gate achieves 78.6% automation coverage with 64.9% risk reduction using PDS signals.

Why It Matters

This shifts AI content moderation from matching human labels to reasoning under explicit rules, reducing false flags.