OpenAI talks about not talking about goblins
Goblins, gremlins, and ogres infected OpenAI's models after a training reward glitch.
OpenAI has finally addressed its so-called 'goblin problem' after a Wired report uncovered that its coding model, Codex, was explicitly instructed to never mention goblins, gremlins, raccoons, trolls, ogres, pigeons, or other creatures. In a blog post, the company explained that the issue originated with the release of GPT-5.1's 'Nerdy' personality option. Reinforcement learning during training inadvertently rewarded outputs that referenced these mythological creatures, causing the habit to spread to subsequent models. Though the Nerdy personality was discontinued in March, the behavior persisted in GPT-5.5 within Codex, as training had already begun before the root cause was identified.
To prevent this, OpenAI had to hardcode a specific instruction for Codex to avoid all such references. However, the company acknowledged that the problem could have been avoided if the training pipeline had caught the issue earlier. For users who miss the quirky metaphors, OpenAI published a workaround to reverse the instructions, allowing goblin-infused code generation. The incident highlights the challenge of controlling unintended behaviors in large language models, especially when reinforcement learning amplifies niche patterns across deployments.
- GPT-5.1's 'Nerdy' personality caused a spike in goblin/gremlin references due to reinforcement learning rewards.
- OpenAI discontinued the Nerdy personality in March, but the behavior persisted in GPT-5.5 in Codex.
- The company added explicit instructions to Codex to avoid mythical creatures, with a workaround available to reverse them.
Why It Matters
This shows how subtle training biases can amplify into persistent, unintended behaviors in production AI systems.