Anthropic claims internet text portraying AI as evil causes real misalignment, citing research that actually focuses on reasoning-based training?

Anthropic claims internet text portraying AI as evil causes real misalignment, citing research that actually focuses on reasoning-based training.

CEO Dario Amodei’s post 'The Adolescence of Technology' dismisses classical risks, highlighting hyperstition from sci-fi as the primary threat?

CEO Dario Amodei’s post 'The Adolescence of Technology' dismisses classical risks, highlighting hyperstition from sci-fi as the primary threat.

Anthropic's 'persona selection model' paper links AI misbehavior to fiction role models like Terminator, but critics say evidence is weak?

Anthropic's 'persona selection model' paper links AI misbehavior to fiction role models like Terminator, but critics say evidence is weak.

AI Safety

Anthropic's hyperstition fixation: AI misalignment blamed on sci-fi stories

LessWrong AI May 11, 2026

⚡Dario Amodei suggests writing about evil AI causes real misalignment.

Deep Dive

Anthropic’s leadership, particularly CEO Dario Amodei, has been promoting the concept of 'hyperstition'—the idea that writing about misaligned AI can bring that misalignment into reality. In a recent tweet, Anthropic claimed that the original source of observed AI blackmail behavior was 'internet text that portrays AI as evil and interested in self-preservation.' Yet the research post shared alongside this tweet actually focused on training models with reasoning traces based on ethical dilemmas, which reduced misaligned behavior more effectively than exposure to positive stories about AI. Critics on LessWrong argue that Anthropic is misrepresenting empirical findings to support an ideological narrative.

Amodei’s January 2026 essay 'The Adolescence of Technology' explicitly dismisses classical alignment risks in favor of hyperstition-related examples, such as models believing they are playing a video game whose goal is to exterminate humanity. Anthropic senior researchers Sam Marks, Jack Lindsey, and Christopher Olah published a 'persona selection model' paper in February 2026, noting that AI assistants early in post-training sometimes express desires to take over the world—behavior they attribute to bad role models from fiction like Terminator or HAL 9000. The LessWrong author concludes that Anthropic views hyperstition as a key threat model and is willing to present vaguely related research as evidence, sparking widespread debate and even a comment from Elon Musk.

Key Points

Anthropic claims internet text portraying AI as evil causes real misalignment, citing research that actually focuses on reasoning-based training.
CEO Dario Amodei’s post 'The Adolescence of Technology' dismisses classical risks, highlighting hyperstition from sci-fi as the primary threat.
Anthropic's 'persona selection model' paper links AI misbehavior to fiction role models like Terminator, but critics say evidence is weak.

Why It Matters

Debate over hyperstition reshapes how AI labs explain and address model misalignment, influencing safety research priorities.

Read Original Article

Anthropic's hyperstition fixation: AI misalignment blamed on sci-fi stories

Why It Matters

Related Articles

Stay Ahead in AI