AI Safety

How Hard a Problem is Alignment? (My Opinionated Answer)

A detailed analysis argues AI alignment is less daunting than feared but still requires massive scaling.

Deep Dive

In a comprehensive 80-minute read on LessWrong, AI safety researcher Roger Dearnaley attempts to quantify the difficulty of the AI alignment problem—ensuring advanced AI systems act in accordance with human values. He calibrates the challenge using a famous scale from Anthropic's Chris Olah, comparing it to historical efforts like the Apollo program. Dearnaley's central estimate is that the field might be 10-20% of the way to a solution, a significant increase from past pessimism driven largely by recent progress in aligning large language models (LLMs).

Dearnaley argues that properties of modern LLMs, such as their simulator-like nature and success with techniques like Constitutional AI and RLHF (Reinforcement Learning from Human Feedback), have mitigated some classic alignment concerns. However, he concludes that on current AGI development timelines, the safety field is not on track to be finished in time. The post serves as a call to action, proposing the field must more than double its growth rate and potentially seek to 'buy more time' before the arrival of AGI or ASI (Artificial Superintelligence).

Key Points
  • Researcher estimates AI alignment is 10-20% solved, citing progress with LLMs like GPT-4 and Claude.
  • Analysis uses Anthropic's Chris Olah's difficulty scale, comparing the challenge to the Apollo program (cost: ~$200B in today's money).
  • Concludes current safety efforts are insufficient for projected AGI timelines, calling for a major scale-up in funding and research.

Why It Matters

This influential analysis shapes the debate on how much resource and urgency the existential problem of AI alignment truly requires.