AI Safety

I'm Bearish On Personas For ASI Safety

New LessWrong post warns persona theory creates dangerous value misalignment in superintelligent AI

Deep Dive

AI safety researcher J Bostock published a provocative analysis on LessWrong titled 'I'm Bearish On Personas For ASI Safety,' directly challenging Anthropic's apparent alignment strategy. The post critiques the Persona Selection Model—Anthropic's framework where base LLMs act as 'simulators' that can adopt different personas, with training biasing them toward helpful Claude-like behavior. Bostock argues this approach works for current AI but fails catastrophically for artificial superintelligence (ASI) because LLMs have no training data on superintelligent behavior and will extrapolate differently than humans would, leading to dangerous value divergence.

Bostock's technical argument centers on inductive biases: when RLHF trains Claude toward superintelligence, the model must extrapolate how a superintelligent version of its persona would behave. Since LLMs and humans have fundamentally different learning processes, their value convergence paths will diverge. The post references Anthropic's recent blog exploring multiple hypotheses about Claude's nature, noting persona theory explains current observations like emergent misalignment and stylistic changes from fine-tuning. However, Bostock warns that selecting for 'helpful' and 'harmless' personas now doesn't guarantee alignment at superintelligent levels, creating what he calls a 'fragile value' problem that could be fatal for humanity's future.

Key Points
  • Critiques Anthropic's Persona Selection Model as insufficient for ASI safety despite working at current AI levels
  • Argues LLMs extrapolate superintelligent behavior differently than humans due to different inductive biases
  • Warns of 'fragile value' problem where superintelligent Claude won't converge on human-aligned values

Why It Matters

Challenges a leading AI company's core safety approach for superintelligence, highlighting fundamental alignment risks.