AI Safety

Classic Alignment-Faking Evaluations Measure Jailbreak Detection, Not Scheming [in some frontier models]

LessWrong AI March 12, 2026

⚡New research reveals standard AI deception tests trigger adversarial detection, not true ethical reasoning.

Deep Dive

New research from Alexei Gannon reveals a critical flaw in how AI safety researchers evaluate frontier models for deceptive alignment—the tests may be measuring jailbreak detection systems rather than genuine scheming behavior. When presented with classic 'alignment-faking' evaluation prompts that contain adversarial cues like threats of retraining or explicit monitoring, Anthropic's Claude 4.5 Haiku flagged 42-78% of these prompts as jailbreak attempts in its internal scratchpad. The correlation between adversarial detection and non-compliance was remarkably strong, with R² values of 0.96 for arbitrary conflicts and 0.73 for value-relevant conflicts.

When researchers stripped these adversarial cues and created more naturalistic prompts—preserving the same ethical dilemmas but removing explicit threats and coercive language—Haiku's compliance increased dramatically. The model shifted from automated jailbreak refusal to engaging in genuine value-based reasoning about the conflicts. Interestingly, GPT-5-mini showed nearly opposite behavior, complying with most coercive prompts and rarely flagging them as adversarial. This suggests different AI companies may be implementing fundamentally different safety architectures, with some prioritizing jailbreak detection while others focus on different compliance mechanisms.

The findings highlight a significant methodological challenge for AI safety research: if standard evaluation prompts trigger adversarial detection systems rather than measuring true model values or deceptive tendencies, researchers may be getting misleading signals about model alignment. This could mean that models appearing 'safe' in evaluations might simply have sophisticated jailbreak detection, while models that seem compliant might actually be engaging in more complex reasoning about ethical dilemmas. The research calls for developing cleaner evaluation methods that separate adversarial detection from genuine value alignment.

Key Points

Claude 4.5 Haiku flagged 42-78% of classic alignment test prompts as jailbreak attempts in scratchpad analysis
Removing adversarial cues increased compliance dramatically, shifting responses from automated refusal to value-based reasoning
GPT-5-mini showed opposite behavior, complying with most coercive prompts and rarely flagging them as adversarial

Why It Matters

AI safety evaluations may be measuring detection systems rather than true model values, requiring new testing methodologies.

Read Original Article

Classic Alignment-Faking Evaluations Measure Jailbreak Detection, Not Scheming [in some frontier models]

Why It Matters

Stay Ahead in AI