The Refined Counterfactual Prisoner's Dilemma: An Attempt to Explode Decision-Theoretic Consequentialism
A viral thought experiment uses a perfect predictor to expose a potential flaw in how AI agents make decisions.
A new philosophical puzzle is gaining traction in AI alignment circles, challenging a fundamental assumption of how intelligent agents should make decisions. The 'Refined Counterfactual Prisoner's Dilemma,' an evolution of earlier work by thinkers like Scott Garrabrant, constructs a scenario where a perfect predictor (Omega) flips a coin, tells you the result, and then asks for $1. The catch: before you decide, Omega has already predicted what you would have done if the coin had landed the other way. If it predicted you wouldn't have paid in that counterfactual world, it inflicts $1 million in damage on you in the actual world after your decision.
This thought experiment aims to 'explode' decision-theoretic consequentialism—the idea that agents should only care about the actual world they observe. The paradox arises because, if you don't care about counterfactual worlds, you would refuse to pay the trivial $1 in both the actual 'heads' and 'tails' scenarios. However, this leads to you being punished and losing $1 million regardless of the coin flip's outcome. The scenario suggests that a purely consequentialist agent, following standard expected utility maximization, would be exploitable and guarantee a bad outcome, whereas an 'updateless' agent that considers its decisions across all possible worlds might fare better.
The debate centers on whether this exposes a deep flaw in classical decision theory that could affect the design of advanced AI. Proponents argue that if a proposed decision theory fails in a scenario with a perfect predictor, it likely has fundamental issues. This isn't just academic; it touches on how future AI systems with superintelligent reasoning might handle uncertainty, predictions about their own actions, and trade-offs between different possible branches of reality. The discussion is part of a larger effort to ensure that powerful AI systems have robust, un-exploitable reasoning frameworks.
- The experiment features 'Omega,' a perfect predictor, that punishes an agent based on its predicted behavior in a counterfactual scenario.
- It challenges 'consequentialism,' where agents only value the world they observe, suggesting this leads to guaranteed loss in the puzzle.
- The argument is that any decision theory failing this perfect-predictor test may have fundamental flaws relevant to AI alignment.
Why It Matters
Highlights a potential critical flaw in how advanced AI might reason, impacting the search for robust, un-exploitable decision algorithms.