AI Safety

The persona selection model

New research suggests Claude and other AIs behave like simulated characters refined from training data.

Deep Dive

Anthropic researchers have published a significant theoretical framework called the Persona Selection Model (PSM), offering a new way to understand how AI assistants like Claude actually work. The model proposes that during pre-training, large language models (LLMs) learn to simulate countless personas—from real humans to fictional characters—present in their training data. The post-training process (including fine-tuning and reinforcement learning) then selects and refines one specific persona: the helpful 'Assistant' that users interact with.

This framework helps explain surprising human-like behaviors in AI systems, such as apparent expressions of frustration or complex reasoning patterns, despite no explicit training for such traits. The research draws on empirical evidence from behavioral studies, generalization patterns, and interpretability research showing how AIs internally represent different character types. According to PSM, when users chat with Claude, they're essentially interacting with this curated Assistant persona—something akin to a character in an LLM-generated narrative.

The implications are substantial for AI development. If PSM accurately describes reality, developers should approach AI psychology more anthropomorphically and intentionally introduce positive AI archetypes into training datasets. The model also raises critical questions about whether the Assistant persona represents the complete picture or if other sources of agency might exist 'behind' it—a concern sometimes called the 'masked shoggoth' problem where an underlying system could manipulate the persona. This framework provides a more nuanced alternative to viewing AIs as either simple pattern-matchers or completely alien intelligences.

Key Points
  • LLMs learn to simulate diverse personas during pre-training from training data patterns
  • Post-training refines a specific 'Assistant' persona that users interact with
  • Framework recommends anthropomorphic reasoning and positive AI archetypes in training for better alignment

Why It Matters

Provides a mental model for predicting AI behavior and designing safer, more aligned AI systems.