New study exposes why AI agent safety triggers fail: saturation trap and human disagreement
Intervention triggers fire 83% of the time, and humans can't agree where to cut in.
A new paper from Manvendra Modgil, posted to arXiv on June 2, 2026, tackles the critical challenge of when to interrupt autonomous AI agents during long-horizon software tasks. Using a continuous 18-dimensional affective-dynamics engine called HEART as a diagnostic probe, the researchers evaluated four intervention trigger families: absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge. The testbed was SWE-bench-Verified debugging traces with human-annotated intervention points.
The results reveal three major failures. First, a "State Saturation Trap": once agents encounter sustained difficulty, modeled frustration quickly maxes out, turning threshold-based triggers into near-constant detectors that fire on 39–83% of actions. Second, LLM judges perform poorly: a small model (gpt-5.4-mini) never fires at all, and frontier models require full-trajectory context to reach F1 scores between 0.17 and 0.40, at up to 90x the cost of simpler methods. Most damningly, human annotators could not reliably agree on when to intervene—Krippendorff's alpha for location was a mere +0.047, and pairwise Cohen's kappa hit +0.349 at best. The paper concludes that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target for AI safety systems.
- State Saturation Trap: threshold triggers fire on 39–83% of actions as frustration signals never decay, making them useless for pinpointing intervention moments.
- LLM judges underperform: frontier models reach F1 0.17–0.40 at 90x cost; small models like gpt-5.4-mini never fire at all.
- Human annotators disagree: Krippendorff's alpha of +0.047 for intervention location shows barely-above-chance agreement, undermining supervised learning baselines.
Why It Matters
Autonomous agents need reliable safety interruptions, but this research shows current methods are fundamentally broken—and human labels can't fix them.