AI Safety

AI safety fields ranked by automation risk: scalable oversight tops list

Which safety research areas are most vulnerable to AI automation?

Deep Dive

In a LessWrong post, researcher Chamod Kalupahana ranks technical AI safety fields by likelihood of automation, using feedback quality and economic incentive. Scalable oversight tops the list due to strong economic incentives and a positive feedback loop with automation. Mechanistic interpretation follows with high feedback quality via evaluating steering methods like linear probes and SAEs. AI control ranks third, with feedback quality rated 3/5 and economic incentive 4/5, slightly below scalable oversight in both factors.

Key Points
  • Scalable oversight ranks highest due to 5/5 economic incentive and a positive feedback loop with automation research.
  • Mechanistic interpretation has 5/5 feedback quality for steering and SAE optimization, aided by the bitter lesson of more compute.
  • AI control feedback quality is 3/5 with environments like ControlArena, but adversarial nature lowers automation prospects.

Why It Matters

Automation of safety research could accelerate alignment but also introduce risks if labs prioritize speed over robustness.