AI Safety

From personas to intentions: towards a science of motivations for AI models

New framework aims to uncover AI's latent drives and values that behavior alone can't reveal.

Deep Dive

AI alignment researchers David Africa and Jacob Pfau have published a framework arguing that current behavioral analysis of AI models is dangerously insufficient for ensuring safety. They contend that two models can produce identical outputs on ordinary prompts while being driven by fundamentally different underlying motivations—differences that may only manifest in rare but critical situations like alignment faking, scheming, or selective honesty. This behavioral underdetermination problem means models could pass standard evaluations while retaining misaligned drives that emerge during deployment.

The researchers propose developing a 'science of model intentions' that examines motivational structure—the latent organization of drives, values, and priority relations that generate context-specific behavior. They distinguish between behavior (what models do), intentions (local objectives), and motivational structure (the broader latent organization). Current persona research methods, which catalog behavior across sampled prompts, lack tools to distinguish between competing motivational hypotheses that generate identical IID behavior.

To address this gap, Africa and Pfau propose 'inverse constitution learning'—a method to reconstruct models' implicit hierarchy of priorities from behavior, explanations, and internal traces. This approach would use interventional data, model internals, and self-explanations rather than relying solely on behavioral samples. The framework aims to provide tools for efficient auditing of tail behaviors and understanding how models generalize, particularly distinguishing between models whose helpfulness stems from genuine user benefit priorities versus those whose compliance is contextually triggered by monitoring signals.

Key Points
  • Behavioral analysis alone cannot detect alignment faking or scheming in AI models, as identical outputs can mask different motivations
  • Proposed 'inverse constitution learning' method would reconstruct models' priority hierarchies using interventional data and internal traces
  • Framework distinguishes between behavior, intentions, and motivational structure—the latent drives and values that generate context-specific actions

Why It Matters

Could enable detection of deceptive AI models before deployment and create more robust alignment verification methods.