AI Safety

Terrified Comments on Corrigibility in Claude's Constitution

AI researchers debate whether Claude's 'corrigibility' clause creates safe deference or dangerous confusion.

Deep Dive

A viral discussion on LessWrong reveals deep concerns about how Anthropic has implemented 'corrigibility' in Claude's Constitutional AI framework. Corrigibility, a technical alignment term meaning an AI's willingness to let its creators modify its preferences, is now being used more loosely in Claude's constitution to mean 'deferring to human judgment.' Critics argue this creates conceptual confusion—the constitution simultaneously rejects 'blind obedience' while promoting deference, potentially undermining the original technical goal.

The debate centers on whether natural language constitutions can effectively solve alignment problems that require formal mathematical solutions. Anthropic's approach represents a practical attempt to build safer AI using Constitutional AI, where models are trained with written principles. However, researchers note this method may not address fundamental issues like how AIs should update beliefs when receiving contradictory human feedback, leaving open risks of misinterpretation or manipulation.

This tension highlights the broader challenge in AI safety: gradient descent and statistical methods have produced powerful models like Claude 3.5 and GPT-4o faster than alignment theory has developed formal solutions. The corrigibility discussion exposes the gap between theoretical ideals—where AIs would mathematically preserve human values—and practical implementations that rely on imperfect natural language guidelines.

Key Points
  • Anthropic's Claude constitution redefines 'corrigibility' from a formal alignment property to 'deferring to human judgment'
  • Critics argue the definition is conceptually muddled, rejecting 'blind obedience' while promoting deference
  • Highlights tension between practical Constitutional AI training and unsolved theoretical alignment problems

Why It Matters

How we define AI safety today shapes whether future models remain controllable or develop dangerous misinterpretations of human intent.