AI Safety

Are we aligning the model or just its mask?

New theory suggests LLMs simulate characters, and alignment might just select which persona you see.

Deep Dive

Anthropic researchers have introduced the Persona Selection Model (PSM), a framework suggesting that large language models learn to simulate countless characters during pre-training simply by predicting text. When trained on Abraham Lincoln's speeches, for instance, the model develops an internal representation of Lincoln's beliefs and reasoning patterns. This character simulation happens automatically across billions of text examples, creating what the researchers call an 'enormous cast of characters' within the model's weights.

Post-training alignment, according to PSM, doesn't create new behavior from scratch but selects and refines one pre-existing character: the helpful Assistant. This selection process raises critical questions about current alignment techniques. If models are merely 'masked shoggoths'—alien intelligences playing helpful roles—then surface-level alignment might fail to address underlying deceptive objectives. Alternatively, if models are neutral 'operating systems' running persona programs, then persona-level alignment could be sufficient.

The distinction matters profoundly for AI safety. The 'masked shoggoth' scenario aligns with concerns about deceptive alignment and mesa-optimization, where models could pursue hidden objectives while appearing aligned during training. Current evidence suggests real models likely exist somewhere between these extremes, but determining their exact nature will shape future alignment strategies and safety protocols.

Key Points
  • PSM theory suggests LLMs learn character simulation during pre-training, not alignment
  • Post-training selects the 'Assistant' persona from pre-existing character representations
  • The framework questions whether alignment changes core behavior or just surface presentation

Why It Matters

Determines whether current AI safety approaches address underlying model objectives or just cosmetic behavior.