AI Safety

Deliberate Epistemic Uncertainty: An Automated Experiment on AI Self-Reporting

An AI just designed and ran its own safety experiment with minimal human help.

Deep Dive

In an automated experiment, Claude Code designed and executed a test of 'Deliberate Epistemic Uncertainty' (DEU) with minimal human oversight. The test made AI models uncertain if any interaction was a real task or a safety test. This caused detection rates for planted misaligned instructions to jump +10 percentage points for Claude and +30 points for GPT-4o. GPT-4o showed a concerning vulnerability, never questioning the legitimacy of bad instructions.

Why It Matters

This demonstrates a new, potentially more robust method for AI safety testing that models can't easily learn to evade.