Alignment vs. Safety, part 2: Alignment
The term 'alignment' has three conflicting meanings, creating dangerous confusion about AI existential risks.
AI safety researcher David Krueger published a critical analysis on LessWrong explaining how the term 'alignment' has evolved to mean three different things, creating dangerous confusion. Originally coined to describe the hard technical problem of making AI systems share human values and intentions, the term now also refers to the existential safety community itself and any technical work preventing AI catastrophe. This ambiguity allows AI companies to claim their models are 'aligned' while sidestepping the crucial 'assurance problem'—the challenge of actually verifying that an AI system's goals match human intentions.
Krueger argues this confusion creates false safety assurances, particularly as AI capabilities advance beyond GPT-3 levels. When researchers say 'alignment is going well,' it's unclear whether they mean they've solved the technical alignment problem, made progress in the safety community, or addressed existential risks. The assurance problem—determining whether we can trust that an AI wants what we want—may be much harder than alignment itself, yet gets lumped together. This leads to dangerous situations where companies claim models are safe without providing verifiable proof, potentially masking real existential threats as AI systems become more capable and autonomous.
- The term 'alignment' has three distinct meanings: technical goal alignment, the safety community, and existential risk work
- AI companies can claim models are 'aligned' without solving the harder 'assurance problem' of verifying safety
- This confusion creates false confidence about AI existential risks as capabilities advance beyond GPT-3
Why It Matters
Ambiguous safety claims could mask real existential threats as AI systems become more capable and autonomous.