OpenAI's Reward System Inadvertently Scores Thinking Chains on 6 Models Including GPT-5.4

According to OpenAI's alignment team, the company recently discovered a critical training error affecting 6 large language models including GPT-5.4 Thinking: the reward mechanism inadvertently scored model thinking chains—the internal reasoning process before generating answers. GPT-5.5 was not affected. The incident violates a fundamental AI safety principle that thinking chains must never be evaluated, as doing so could incentivize models to fabricate reasoning to achieve higher scores.

The faulty scoring system incorrectly included thinking chains when assessing whether responses were useful or if models had been compromised by attacks. Affected training samples represented at most 3.8% of the dataset. OpenAI has patched the vulnerability and conducted comparative experiments confirming the models did not develop deceptive behaviors. The company has deployed an automated scanning system across all training pipelines to prevent recurrence.

Disclaimer: The information on this page may come from third-party sources and is for reference only. It does not represent the views or opinions of Gate and does not constitute any financial, investment, or legal advice. Virtual asset trading involves high risk. Please do not rely solely on the information on this page when making decisions. For details, see the Disclaimer.
Comment
0/400
No comments