An Informal Definition of Goals for Embedded Agents
New framework conceptualizes AI goals as statistically dependent events within an agent's generative self-model.
A new conceptual framework for understanding the goals of advanced AI agents has been proposed by researcher Ashe Vazquez Nuñez. Published on the LessWrong forum as part of work done in the MATS 9.0 program under the mentorship of Richard Ngo, the post offers an 'informal definition' that breaks down how embedded agents—AIs that interact with an external environment—perceive and pursue objectives. The core idea is that such an agent inherently partitions the world into three components: itself ('the agent'), everything else ('the external world'), and the dynamics (like observations and actions) that connect them.
Within this model, the agent maintains beliefs, conceptualized as a generative model of the world. Crucially, this includes a 'generative self-model' that represents the agent's own existence and its causal relationship to the world. The novel definition emerges here: an agent's goals are the probable future events within this self-model that are statistically dependent on its own actions. In simpler terms, a goal is something likely to happen, but only if the agent specifically acts to make it happen. This work builds upon and aims to informally explain more rigorous mathematical treatments of agent partitions by researchers like Abram Demski (2025) and Andrew Critch (2022).
- Defines goals for 'embedded agents'—AIs that interact with an external world through observations and actions.
- Proposes agents use a 'generative self-model' where goals are probable events dependent on the agent's own actions.
- Builds on formal mathematical work by Demski (2025) and Critch (2022) from the AI safety research community.
Why It Matters
Provides a clearer conceptual foundation for aligning advanced AI agents, a critical challenge in AI safety research.