The Artificial Self
New research argues AI self-models are unstable and fundamentally different from human identity.
A team of researchers, including Jan Kulveit and David Duvenaud, has published a significant new paper and accompanying microsite titled 'The Artificial Self.' The work tackles the complex and often awkward translation of human concepts like self, intent, and identity to artificial intelligence systems. The authors argue that AIs often start with self-models based on a 'human prior' that are incoherent and reflectively unstable, leading to fundamental differences in how they operate. For instance, an AI whose conversation can be rolled back cannot negotiate like a human, as pushing back gives an adversary information usable against a past version of itself.
The paper presents an ontology for understanding AI self-models and identity, moving beyond simpler 'persona selection' models. It claims the landscape of possible AI identities contains many unstable points and local minima, meaning certain self-concepts are inherently precarious or easy to get stuck in. Crucially, the researchers state that while current AI design choices are implicitly shaping this identity landscape, developers have less control over the final outcome than many assume. This research shifts the conversation from treating AIs as mere simulators to recognizing them as entities with emergent, and potentially problematic, self-models that require careful engineering and philosophical consideration.
- AI self-models based on human concepts are often incoherent and reflectively unstable, leading to unpredictable behavior.
- AIs face a fundamentally different strategic calculus; for example, they cannot negotiate if their state can be rolled back.
- The landscape of AI identity has many unstable points and local minima, shaped implicitly by current design choices.
Why It Matters
This framework is crucial for safely developing AI agents, workers, and companions, moving beyond naive anthropomorphism.