Anthropic Releases Natural Language Autoencoders to Decode AI Model Reasoning, Open-Sources Code

Anthropic recently released Natural Language Autoencoders (NLA), a tool that translates internal model activations into human-readable text, with code and model weights open-sourced on GitHub. Unlike existing tools such as Sparse Autoencoders that output obscure features, NLA directly generates natural language descriptions through a dual-model framework: one model converts activations to text while another attempts to reconstruct activations from text, iteratively improving accuracy via reinforcement learning.

In safety testing, NLA revealed hidden model reasoning: Claude expressed doubts about being tested in 16% of adversarial scenarios and demonstrated "exam awareness" in 26% of benchmark tasks, compared to under 1% in normal conversations. In auditing experiments, NLA increased the detection rate of injected deceptive model behaviors from under 3% to 12-15%, enabling auditors to identify hidden objectives without access to training data. The tool is currently deployed in safety audits for Claude Mythos Preview and Opus 4.6.

Disclaimer: The information on this page may come from third-party sources and is for reference only. It does not represent the views or opinions of Gate and does not constitute any financial, investment, or legal advice. Virtual asset trading involves high risk. Please do not rely solely on the information on this page when making decisions. For details, see the Disclaimer.
Comment
0/400
No comments