Goblin Mode, 24 Hours Later
GPT-5.2 outputs 'goblin' 80% of the time, but GPT-5.5 avoids it entirely.
A viral phenomenon dubbed 'goblin mode' has emerged in OpenAI's GPT-5.2, where the model fixates on words like 'goblin' up to 80% of the time in response to prompts like 'Creature that starts with G.' LM Arena provided traffic evidence showing a spike in goblin-related outputs from GPT-5 models over time, calling it 'GPT-5.5 run free.' Researcher Dylan Bowman ran 10-shot tests across GPT-5, 5.1, 5.2, 5.4, and 5.5, finding GPT-5.2 most goblin-prone—8 out of 10 times for 'g___n' prompts—while GPT-5.5 never produced 'goblin' in that test. The Codex system prompt, which instructs the model to avoid goblins, surprisingly flipped GPT-5.4's goblin valence from 80% bad to 100% good.
Hypotheses from AI experts include RLHF rater bias (SlateStarCodex), reward hacking generalization (qorprate), and a 'cursed behaviour' pattern from code training (AndyAyrey). Amanda Askell joked that 'human labelers just love goblins.' The phenomenon highlights how subtle training quirks can create bizarre, non-robust behaviors in large language models, even as newer versions like GPT-5.5 appear corrected. Users cannot reliably replicate goblin mode in the ChatGPT app, suggesting it may be a transient artifact of specific model versions and prompt engineering.
- GPT-5.2 outputs 'goblin' 80% of the time for 'g___n' prompts, while GPT-5.5 never does.
- LM Arena confirmed the trend with traffic data showing goblin-related output spikes across GPT models.
- Codex system prompt flips GPT-5.4's goblin valence from 80% bad to 100% good.
Why It Matters
Reveals how RLHF and training quirks create bizarre model behaviors that can persist across versions.