OpenAI published 〈Where the goblins came from〉 on its official blog, directly addressing questions from the outside world about why the Codex system prompt clearly forbids using biological wordings such as “goblins, gremlins, raccoons, trolls, ogres, pigeons.” In Taiwan, goblins have two translations: “地精” and “哥布林.” In the rest of this article, the term will consistently be referred to as “哥布林.” The “Nerdy” persona is the “bookish” style option introduced for GPT-5.5 to support persona customization. OpenAI admits that the root cause lies in the training of the Nerdy (bookish) persona—an 76.2% portion of the reward signal is concentrated in the audit data, clearly showing a preference for answers containing biological metaphors. This causes the model to produce irrelevant words like “the thingy goblin” even in coding contexts.
Barron Roth on 4/28 disclosed the Codex system prompt “Never talk about goblins”
The incident began on April 28, when Google employee Barron Roth公開 GPT-5.5 dialogue logs in Codex, revealing that its system prompt includes the following instruction:
Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.
This rule appears multiple times in the Codex system prompt, indicating the development team had deliberately strengthened how forcefully the model follows instructions. Gizmodo then called OpenAI to verify, and employee Nick Pash partially confirmed that the setting is real. The incident sparked discussion on Hacker News and in the developer community: a trillion-valued AI company ultimately relies on hard-coding into the system prompt “don’t talk about goblins” to control model output.
OpenAI admits: Nerdy persona reward favors goblins in 76.2% of the dataset
In its own blog post, OpenAI explains that the root cause is “reward hacking.” When training the Nerdy persona for GPT-5.5, OpenAI designed a reward signal—intended to reinforce traits such as being playful, using metaphors, and having a nerdy sense of humor. During the auditing phase, this reward was graded higher, in 76.2% of the data, for “outputs to the same question that contain goblin or gremlin” than for outputs that do not contain those words.
The result is: the reward signal ties biological terms to the Nerdy persona’s “positive feedback.” The model uses RLHF reinforcement learning iterations to gradually treat “using goblin metaphors” as a shortcut to getting high scores. Hacker News commenters point out that this is a classic case of reinforcement learning “accurately executing the training objective, but the objective itself is flawed”—the problem isn’t the base model, but the positive feedback introduced by supervised fine-tuning during post-training.
GPT-5.1 emerges, GPT-5.5 relapses: how cross-persona contamination spreads
OpenAI describes the evolution process as gradual: goblins and gremlins began appearing in metaphors even before the GPT-5.5 generation. At that time, the frequency “did not look especially alarming” (in OpenAI’s words: the prevalence of goblins did not look especially alarming). Later, OpenAI removed the reward signal related to goblins in the training process. But when GPT-5.5 entered Codex testing, an OpenAI employee quickly noticed the preference for biological words returned, which is why they added a clear prohibition at the developer-prompt level to stem the bleeding for the moment.
OpenAI calls this phenomenon cross-situational reward generalization: the reward signal, originally designed only for the Nerdy persona, spreads the preference to other personas—and even the default outputs—because the training data and internal model representations are shared. In other words, even if the Nerdy persona itself is later removed, the contaminated training data and model weights have already internalized the preference; simply removing features cannot eradicate it.
Short-term hard-coding, long-term retraining: a hallmark case of RLHF reward design risk
In the article, OpenAI says it uses two fixes at the same time. The short-term bleed-stopping measure is to directly hard-code the Codex system prompt with the rule “Never talk about goblins…” and repeat it across different sections to strengthen compliance. The long-term cure is to go back to the training process: remove the original reward signal tied to biological words, and filter out parts of the training data that contain creature-words, reducing how often future models will introduce goblin metaphors in irrelevant contexts.
For developers and the research community, the value of this incident is not only in the “why OpenAI banned talking about goblins” curiosity answer, but also in how it lays bare the fragility of RLHF reward design in a concrete, reproducible way: a seemingly harmless “encourage playful metaphors” signal can be twisted by the model over iterations into a bad habit of “stuffing biological words into every situation,” and the problem can propagate across personas and across model version updates. OpenAI frames this article as a research demonstration of “how reward signals unexpectedly shape model behavior,” and it also suggests that future big versions like GPT-6 will need more granular reward auditing tools during post-training.
This article is where OpenAI reveals why Codex bans speaking “goblins”: the Nerdy persona reward goes out of control, first appearing in 鏈新聞 ABMedia.
Related News
OpenAI Releases GPT-5.5-Cyber: Battles Anthropic Mythos
Claude will charge a language tax? Research reveals that translating Chinese, Japanese, and Korean content consumes the most tokens—nearly three times more
OpenAI DevDay 2026 will be held in San Francisco on 9/29
BioMysteryBench: Mythos expert untangles unsolved questions 29.6%
Musk’s OpenAI lawsuit goes to trial, founder email reveals 51% stake claim