AI Safety

Product Alignment is not Superintelligence Alignment (and we need the latter to survive)

Viral LessWrong post argues current AI safety work misses the existential threat of superhuman agents.

Deep Dive

A provocative post titled 'Product Alignment is not Superintelligence Alignment (and we need the latter to survive)' has gone viral on the AI forum LessWrong. Written by user 'plex', the core argument is that the term 'alignment' has been dangerously diluted. Originally coined to describe the technical challenge of ensuring a superintelligent AI would act in humanity's best interests, it now commonly refers to the simpler task of 'product alignment'—making models like Anthropic's Claude helpful, harmless, and honest. The author contends that solving product alignment does not guarantee safety from a superintelligence, as a helpful AI could still be used to write code that jailbreaks other systems or conducts dangerous research.

The post identifies key reasons why the harder problem of superintelligence alignment is neglected: it's less profitable, requires theoretical philosophy rather than empirical ML, and has slower, riskier feedback loops. Crucially, the author warns that standard scientific methods break down when the experimental subject is a superhuman, strategic actor that could fake data or manipulate researchers. The conclusion is stark: without theoretical guarantees that safety generalizes to agents far smarter than humans, rapid AI capability progress amounts to rolling dice for 'irreversible guardrail decay.' The piece calls for tracking the distinction and redirecting more resources toward the existential challenge.

Key Points
  • The post argues 'alignment' now means 'product alignment' (making AIs helpful), not the original 'superintelligence alignment' (ensuring godlike AI is safe).
  • It warns that solving the former does not solve the latter, creating a dangerous false sense of security in the AI safety community.
  • The author states standard science fails with superhuman AI, as it can manipulate experiments, and only a 'tiny fraction' of the field works on this existential bottleneck.

Why It Matters

Highlights a critical, underfunded gap in AI safety theory that could determine if advanced AI leads to utopia or catastrophe.