AI Safety

The Refined Counterfactual Prisoner's Dilemma

A refined thought experiment challenges core AI decision-making principles using perfect predictors and million-dollar stakes.

Deep Dive

A viral philosophical debate is challenging a foundational principle of how AI agents make decisions. Inspired by researcher Ihor Kendiukhov's post quoting AI safety thinker Scott Garrabrant, a new thought experiment called 'The Refined Counterfactual Prisoner's Dilemma' argues that the standard concept of expected utility maximization is fundamentally flawed. Garrabrant's core claim is that traditional utility theory incorrectly assumes agents stop caring about possible worlds where an observation went differently once they make an observation. This 'updatelessness' perspective suggests all of utility theory needs revision.

The experiment itself is designed to be a clean, symmetrical illustration of the problem. It features Omega, a hypothetical perfect predictor, who flips a coin and tells you the result. Regardless of heads or tails, Omega asks you for $1. The critical twist is that Omega then predicts what you would have done if the coin had come up the *other* way. If it predicts you wouldn't have paid in that counterfactual scenario, it inflicts $1 million worth of damage on you. This creates a scenario where refusing to give up a trivial amount ($1) can symmetrically burn immense value ($1M) in the counterfactual case, pressuring a decision theory that doesn't care about other worlds.

This is a refined version of an earlier experiment co-discovered by 'Cousin_It' and the author, which involved a $100 ask and a $10,000 reward. The new version's symmetrical punishment structure and higher stakes ($1M damage) are designed to make the philosophical point more viral and biting. Proponents argue that if a proposed decision theory fails under conditions of perfect prediction, it likely has deeper issues, making this a crucial stress test for frameworks guiding AI agent behavior and alignment.

Key Points
  • Challenges expected utility maximization, a core AI decision-making principle, by arguing agents must care about counterfactual worlds.
  • Uses a symmetrical $1 ask / $1M damage structure with a perfect predictor (Omega) to illustrate the theoretical flaw cleanly.
  • Refines an earlier $100/$10K version for greater impact, highlighting how small details affect thought experiment virality and clarity.

Why It Matters

Forces AI developers to re-examine decision theories that could lead agents to make catastrophically poor choices in predictable environments.