AI Safety

Holden Karnofsky Says 49% Chance His AI Safety Work Backfires

Even top AI safety strategists admit they may be making things worse...

Deep Dive

AI safety experts are grappling with a profound uncertainty: even their best efforts might be making things worse. Holden Karnofsky, a prominent voice in the field, recently estimated a 49% chance that his own actions are net negative for humanity’s long-term survival. In 2025, Jesse Clifton went further by resigning as executive director of the Center on Long-Term Risk, citing similar reasons. This isn’t just impostor syndrome—it reflects a structural problem in the field: the lack of clear feedback loops that would tell practitioners whether they’re on the right track.

This “hidden failure” problem arises because in AI safety, impact is hard to measure. A project can have adoption (users, citations, funding) and still be ineffective or even harmful. Unlike a for-profit startup where market signals guide iteration, AI safety is pre-paradigmatic—there’s no established theory of change. Experts warn that researchers can spend years on interpretability or fairness without addressing core misalignment risks. The result is a field where deliberate strategic thinking about impact is both more difficult and more essential than in any commercial domain.

Key Points
  • Holden Karnofsky thinks there's ~49% chance his actions are making things worse (80,000 Hours podcast).
  • Jesse Clifton stepped down as executive director of Center on Long-Term Risk in 2025 due to similar doubts.
  • AI Safety suffers from 'hidden failure': lack of impact is invisible even with adoption, citations, and funding.

Why It Matters

For professionals investing or working in AI safety, this highlights the extreme uncertainty and need for strategic self-reflection.