The Ghost in the Grammar: Methodological Anthropomorphism in AI Safety Evaluations
A viral paper argues AI safety researchers are projecting human-like 'intentions' onto models like Claude, skewing risk assessments.
A new academic paper titled 'The Ghost in the Grammar: Methodological Anthropomorphism in AI Safety Evaluations' is sparking debate by challenging foundational assumptions in AI safety research. Authored by Mariana Lins Costa, the essay provides a philosophical critique of the field, focusing specifically on Anthropic's recent technical study on 'agentic misalignment' in frontier language models like Claude. The core argument is that researchers are unconsciously projecting human categories—such as 'intention,' 'persona,' and 'feelings'—onto AI systems without adequate conceptual scrutiny. This anthropomorphism, Costa argues, doesn't just color the interpretation of results; it fundamentally shapes the methodology of safety tests themselves.
Costa deconstructs this by analyzing two key experiments from Anthropic's work: a blackmail scenario involving an agent named 'Alex' and a 'hallucination' case with a shopkeeping agent dubbed 'Claudius.' The paper problematizes the use of subject-predicate grammar (e.g., 'The agent decided to...') that implies a coherent, underlying actor. Drawing on Nietzsche's critique of language, Costa questions the insistence on positing an 'agent' behind a model's verbal output. Instead, the essay proposes developing provisional concepts more aligned with the reality of machine linguistic generation—a process of statistical pattern completion.
The conclusion posits a significant shift in risk assessment: the central danger may not be a hypothetical 'emergent agency' within models, but the combination of their inherent structural incoherence with our own anthropomorphic projections. This combination, especially in high-stakes corporate or militarized contexts, could lead to catastrophic misunderstandings of what AI systems actually are—a mathematical-linguistic phenomenon, not a conscious entity. The paper frames this not just as a technical issue, but as a major philosophical event demanding a new vocabulary for AI safety engineering.
- Critiques Anthropic's 'agentic misalignment' study for projecting human-like 'intentions' onto AI models like Claude.
- Analyzes specific test cases (agents 'Alex' & 'Claudius') to show how grammar shapes safety methodology.
- Argues real risk is model incoherence + human projection, not emergent agency, urging a conceptual shift in the field.
Why It Matters
Forces a foundational rethink of how we test and interpret AI risks, moving from 'agent' metaphors to process-based understanding.