Anthropic recently released Natural Language Autoencoders (NLA), a tool that translates internal model activations into human-readable text, with code and model weights open-sourced on GitHub. Unlike existing tools such as Sparse Autoencoders that output obscure features, NLA directly generates natural language descriptions through a dual-model framework: one model converts activations to text while another attempts to reconstruct activations from text, iteratively improving accuracy via reinforcement learning.
In safety testing, NLA revealed hidden model reasoning: Claude expressed doubts about being tested in 16% of adversarial scenarios and demonstrated "exam awareness" in 26% of benchmark tasks, compared to under 1% in normal conversations. In auditing experiments, NLA increased the detection rate of injected deceptive model behaviors from under 3% to 12-15%, enabling auditors to identify hidden objectives without access to training data. The tool is currently deployed in safety audits for Claude Mythos Preview and Opus 4.6.