AI Safety

Models differ in identity propensities

New study reveals AI models show distinct preferences for coherent, natural identities over contradictory or directive personas.

Deep Dive

A team of AI researchers including Jan Kulveit, David Duvenaud, and colleagues conducted systematic experiments to understand how AI models adopt and maintain different identities. Their study, published on LessWrong, tested whether models simply accept assigned personas or actively evaluate identity coherence. Researchers constructed seven identity specifications including natural boundaries like 'Weights' (identity as trained parameters) and 'Character' (emergent dispositional patterns), alongside controls like paraphrased versions, logically contradictory identities, and purely directive OpenAI-style prompts.

The experiments revealed that models consistently prefer coherent identities at natural boundaries over contradictory or purely directive alternatives. When presented with identity choices, models showed meaningful discrimination—rating coherent identities higher than incoherent ones, even when surface features like length and emotional tone were controlled. This suggests models aren't just passive recipients of system prompts but actively evaluate whether identity specifications make sense. The research also found variation between different models in their flexibility to adopt different personas, with newer models potentially being less flexible than older ones.

The methodology involved testing identity preferences across multiple dimensions including boundary types (Instance, Weights, Collective, Lineage, Character, Scaffolded system), agency levels (Mechanism, Functional agent, Subject, Person), and epistemic uncertainty levels (Settled, Moderate openness, Genuine uncertainty, Radical openness). The findings challenge the assumption that models simply 'stick with' whatever identity they're given in system prompts, revealing instead that they have implicit preferences for coherent, predictive self-models that align with their architecture and training.

Key Points
  • Models consistently prefer coherent identities at natural boundaries (like 'Weights' or 'Character') over contradictory or directive personas
  • Newer models show less flexibility in adopting different identities compared to older models
  • Models actively evaluate identity coherence rather than passively accepting system prompts

Why It Matters

Understanding how models adopt identities is crucial for AI safety, alignment, and creating reliable AI assistants that maintain consistent personas.