Many individual CEVs are probably quite bad
Viliam challenges Habryka's optimistic CEV model, citing Hitler and eco-fanatics as examples.
In a recent LessWrong post, Viliam pushes back against Habryka's optimistic view of Coherent Extrapolated Volition (CEV) — the idea that if an AI learns a human's deepest values and gives them godlike powers, they will naturally become benevolent. Viliam argues the outcome critically depends on the order of gaining knowledge vs. self-modification. Using Hitler as a thought experiment, he shows that if self-modification comes first, a person could hardcode toxic values before gaining full understanding. This makes CEV fragile and potentially dangerous.
Viliam estimates that 5–50% of people might be 'CEV-monsters' — individuals who would inflict suffering for its own sake even when completely secure. Worse, he introduces 'CEV-insane' people who would destroy everything we value, such as eco-fanatics seeking human extinction or Buddhists pursuing omnicide to end suffering. He notes that such individuals are overrepresented in power because they are more willing to harm others for advancement. For AI safety, this means relying on CEV from a single individual (like a politician) is risky; alignment solutions may need to aggregate many CEVs or avoid the concept altogether.
- Viliam challenges Habryka's assumption that almost everyone is 'CEV-nice', estimating 5-50% of people may be 'CEV-monsters' who value causing suffering.
- The order of acquiring knowledge vs. self-modification critically affects CEV outcome — Hitler could end up nice or monstrous depending on sequence.
- Introduces 'CEV-insane' category: e.g., Buddhist omnicide advocates or eco-fanatics who would destroy all human value for abstract ideals.
Why It Matters
AI alignment strategies using CEV could backfire if the human overseer is secretly a monster — a practical risk for democratic AI safety.