Uses stochastic gradient descent on documents where AI acts ethically in moral dilemmas?

Uses stochastic gradient descent on documents where AI acts ethically in moral dilemmas

Incorporates fiction showing Claude making the right choice in difficult situations?

Incorporates fiction showing Claude making the right choice in difficult situations

Found to generalize well and is more effective than removing bad examples from training?

Found to generalize well and is more effective than removing bad examples from training

AI Safety

Anthropic embeds alignment into Claude by training on moral fiction

LessWrong AI May 14, 2026

⚡Anthropic feeds Claude stories of ethical choices to align its behavior before fine-tuning

Deep Dive

Anthropic has adopted a novel approach to AI alignment called Alignment Pretraining (or Safety Pretraining). Instead of only removing bad behavior during post-training, they bake alignment directly into Claude’s foundation by training the model on a large corpus of natural and synthetic documents where an AI assistant handles morally difficult situations correctly. The technique uses standard stochastic gradient descent on these curated examples. Anthropic has gone a step further by incorporating fiction—specifically stories in which Claude itself makes ethical choices—as training material. This aligns with academic research dating back to Korbak et al. (2023) and was further validated by Maini et al. (2025) and Tice et al. (2026). Proponents like the LessWrong community have advocated this idea for years, and Anthropic’s implementation is seen as a milestone.

The key insight is that increasing the proportion of good examples in the pretraining data is far more effective than simply filtering out bad ones. Anthropic reports that the approach works well and generalizes across diverse scenarios, potentially making Claude safer and more aligned from the ground up. This could reduce the need for expensive and brittle post-hoc alignment techniques. By using fiction to simulate nuanced moral dilemmas, Anthropic is giving Claude a richer, more human-like ethical foundation. For tech professionals, this signals a paradigm shift: alignment is no longer just a patch but a core part of model training.

Key Points

Uses stochastic gradient descent on documents where AI acts ethically in moral dilemmas
Incorporates fiction showing Claude making the right choice in difficult situations
Found to generalize well and is more effective than removing bad examples from training

Why It Matters

Alignment pretraining makes Claude inherently safer, potentially reducing the need for costly post-training alignment.

Read Original Article

Anthropic embeds alignment into Claude by training on moral fiction

Why It Matters

Related Articles

🚀 Stay Ahead in AI