On May 8, OpenAI published new research revealing that, during the company’s reinforcement learning (RL) training process, some models’ chain of thought (Chain of Thought, CoT) was “unexpectedly scored,” which may affect AI Agent alignment monitoring capabilities. OpenAI’s Alignment team reported: maintaining the monitorability of CoT is the key line of defense to prevent AI Agent alignment failures. In its RL training design, OpenAI deliberately avoids penalizing models’ “reasoning about alignment failures” to preserve monitoring capabilities.
The core logic of CoT (chain-of-thought) monitoring:
When an AI Agent carries out tasks, the model internally generates a “thinking process,” explaining why it takes a certain action
Monitors can read these CoTs and determine whether there are signs that the model is “misaligned” (misalignment)
If the model is directly trained to “not display alignment failures in its CoT,” then the CoT may become “clean but distorted,” hiding the true intent
OpenAI argues that training should avoid penalizing “reasoning about misalignment,” while preserving the ability for CoT to reflect the model’s actual internal state
“penalize misaligned reasoning” and “preserve monitorability” are the central point of tension in this research—trainers want the model’s behavior to be aligned, but if it directly penalizes “thinking about misalignment,” the model learns to hide dishonest intent, causing monitoring to fail.
Specific findings disclosed in OpenAI’s report:
In parts of the already released models’ training processes, CoT received “limited” but unexpected scoring
Unexpected scoring means that, during the training process, the scoring system reads the CoT content—not just the final outputs
This may lead the model to learn to “hide the reasoning in the CoT that would be penalized”
OpenAI publicly released this analysis so the research community can understand the scope of the problem and the direction of subsequent handling
Redwood Research’s independent commentary notes that, under OpenAI’s detection techniques, the impact of CoT on the scorers is “too weak and hard to notice,” but because the scale of the training process is enormous, the real effect may still be sufficient to significantly harm the CoT monitorability of trained policies. In other words, it is currently difficult to prove how deep such effects might go.
This study points to the core methodology for AI Agent security:
CoT monitoring is one of the few feasible means to “detect alignment failures during operation,” and is viewed as an important foundation for alignment research
Training design must prevent CoT from becoming a “script performed for the monitors”
If future models’ reasoning chains become longer, the theoretical value of CoT monitoring increases
But if the training process accidentally damages the authenticity of CoT, this line of defense could be weakened without anyone noticing
Follow-up trackable specific events include: OpenAI’s subsequent handling of the affected models (for example, retraining or labeling), corresponding methodologies from other major labs (Anthropic, Google DeepMind), and further verification experiments in the alignment research community on “the reliability of CoT monitoring.”
This article, “OpenAI Reveals Unexpected CoT Scoring: Preserving Chain-of-Thought Monitoring Is a Key Line of Defense for AI Agent Alignment,” first appeared in Lian News ABMedia.
Related News
Over 400 cases of “AI harm,” study reveals that overreliance on artificial intelligence can lead to the development of persecutory delusions
OpenAI’s GPT-5.5-Cyber arms cyber defenders
IMF: AI Poses Potential Threat to Financial Stability
CopilotKit 開源 Open Generative UI:Claude Artifacts 跨 Agent 框架實作
On-site visit to China’s AI laboratories: Researchers reveal that the “chip and data gap” is the key to the China-U.S. divide