Anthropic embeds alignment into Claude by training on moral fiction
Anthropic feeds Claude stories of ethical choices to align its behavior before fine-tuning
Anthropic has adopted a novel approach to AI alignment called Alignment Pretraining (or Safety Pretraining). Instead of only removing bad behavior during post-training, they bake alignment directly into Claude’s foundation by training the model on a large corpus of natural and synthetic documents where an AI assistant handles morally difficult situations correctly. The technique uses standard stochastic gradient descent on these curated examples. Anthropic has gone a step further by incorporating fiction—specifically stories in which Claude itself makes ethical choices—as training material. This aligns with academic research dating back to Korbak et al. (2023) and was further validated by Maini et al. (2025) and Tice et al. (2026). Proponents like the LessWrong community have advocated this idea for years, and Anthropic’s implementation is seen as a milestone.
The key insight is that increasing the proportion of good examples in the pretraining data is far more effective than simply filtering out bad ones. Anthropic reports that the approach works well and generalizes across diverse scenarios, potentially making Claude safer and more aligned from the ground up. This could reduce the need for expensive and brittle post-hoc alignment techniques. By using fiction to simulate nuanced moral dilemmas, Anthropic is giving Claude a richer, more human-like ethical foundation. For tech professionals, this signals a paradigm shift: alignment is no longer just a patch but a core part of model training.
- Uses stochastic gradient descent on documents where AI acts ethically in moral dilemmas
- Incorporates fiction showing Claude making the right choice in difficult situations
- Found to generalize well and is more effective than removing bad examples from training
Why It Matters
Alignment pretraining makes Claude inherently safer, potentially reducing the need for costly post-training alignment.