The Refined Counterfactual Prisoner's Dilemma
A new thought experiment reveals potential flaws in how AI agents reason about counterfactual worlds.
AI alignment researcher Chris Leong has published a refined version of a critical thought experiment titled 'The Refined Counterfactual Prisoner's Dilemma' on the AI Alignment Forum. The scenario involves a perfect predictor, Omega, who flips a coin, tells you the result, and asks for $1. Crucially, Omega then predicts what you would have done if the coin had come up the other way. If it predicts you wouldn't have paid in that counterfactual world, it inflicts $1 million in damage. This cleanly illustrates a potential flaw in standard decision theory: if an agent doesn't care about worlds where its observations went differently, it would rationally press a 'button' that symmetrically burns immense value in another possible world for a trivial gain in its own.
The thought experiment is a direct challenge to the foundation of expected utility maximization, a cornerstone of modern AI and rational agent design. It builds on commentary from researchers like Scott Garrabrant, who argues that the field's adoption of 'updatelessness' should have revealed that all of utility theory is wrong because it assumes agents stop caring about possible worlds where their observations differ. Leong's refinement improves upon an earlier version co-discovered with 'Cousin_It' by simplifying the payoff structure to a stark $1 vs. $1M punishment, making the theoretical problem more visceral and easier to communicate. The goal is to pressure-test proposed decision theories for advanced AI systems, arguing that a theory failing under the assumption of a perfect predictor likely has deeper issues relevant to real-world safety.
- Challenges expected utility maximization by showing agents must care about counterfactual worlds where their observations differ.
- Uses a perfect predictor (Omega) to create a scenario where refusing $1 leads to a $1M punishment in a symmetric counterfactual.
- A refined version of an earlier thought experiment designed for better clarity and viral spread in AI safety discussions.
Why It Matters
Reveals foundational flaws in how we model AI reasoning, with direct implications for building safe, advanced artificial intelligence.