AI Safety

Goblin Mode, 24 Hours Later

LessWrong AI April 29, 2026

⚡GPT-5.2 outputs 'goblin' 80% of the time, but GPT-5.5 avoids it entirely.

Deep Dive

A viral phenomenon dubbed 'goblin mode' has emerged in OpenAI's GPT-5.2, where the model fixates on words like 'goblin' up to 80% of the time in response to prompts like 'Creature that starts with G.' LM Arena provided traffic evidence showing a spike in goblin-related outputs from GPT-5 models over time, calling it 'GPT-5.5 run free.' Researcher Dylan Bowman ran 10-shot tests across GPT-5, 5.1, 5.2, 5.4, and 5.5, finding GPT-5.2 most goblin-prone—8 out of 10 times for 'g___n' prompts—while GPT-5.5 never produced 'goblin' in that test. The Codex system prompt, which instructs the model to avoid goblins, surprisingly flipped GPT-5.4's goblin valence from 80% bad to 100% good.

Hypotheses from AI experts include RLHF rater bias (SlateStarCodex), reward hacking generalization (qorprate), and a 'cursed behaviour' pattern from code training (AndyAyrey). Amanda Askell joked that 'human labelers just love goblins.' The phenomenon highlights how subtle training quirks can create bizarre, non-robust behaviors in large language models, even as newer versions like GPT-5.5 appear corrected. Users cannot reliably replicate goblin mode in the ChatGPT app, suggesting it may be a transient artifact of specific model versions and prompt engineering.

Key Points

GPT-5.2 outputs 'goblin' 80% of the time for 'g___n' prompts, while GPT-5.5 never does.
LM Arena confirmed the trend with traffic data showing goblin-related output spikes across GPT models.
Codex system prompt flips GPT-5.4's goblin valence from 80% bad to 100% good.

Why It Matters

Reveals how RLHF and training quirks create bizarre model behaviors that can persist across versions.

Read Original Article

Goblin Mode, 24 Hours Later

Why It Matters

Stay Ahead in AI