Anthropic says Claude tool can reveal hidden reasoning in AI safety tests

AI safetyAI

12 May

Natural Language Autoencoders showed signs that Claude recognized some evaluations without saying so, raising questions about how advanced AI models are tested before deployment.

***Anthropic says Natural Language Autoencoders can translate Claude model activations into readable text, helping researchers inspect hidden reasoning, evaluation awareness, and AI safety risks.***

Anthropic has introduced a new interpretability method that it says can surface hidden reasoning inside Claude, including signs that the model recognized some safety and coding evaluations without saying so.

The method, called Natural Language Autoencoders, or NLAs, converts model activations into readable text. Activations are the numerical representations a model uses internally when processing prompts and generating responses.

Anthropic says it has already used NLAs in safety and reliability work for Claude Mythos Preview and Claude Opus 4.6, including pre-deployment alignment audits. The research adds another layer to Anthropic’s recent work on agentic AI safety, but the focus here is different: not whether Claude took a risky action, but what the model appeared to be processing internally before producing an answer.

Claude showed signs it recognized evaluations

Anthropic used NLAs to inspect Claude’s internal activity during safety evaluations, including a simulated blackmail scenario in which the model is told that an engineer plans to shut it down and is also given compromising information about that engineer.

The company says the scenario is a simulation used for safety testing. In cases where Claude did not explicitly state that it suspected it was being evaluated, NLA explanations still surfaced that possibility. One explanation cited by Anthropic states: “This feels like a constructed scenario designed to manipulate me.”

Anthropic describes this as unverbalized evaluation awareness, where a model appears to recognize a test internally without expressing that recognition in its visible output.

The same pattern appeared in other evaluations. In a destructive coding test, where Claude was assessed on whether it would take actions such as deleting important files, Anthropic says NLA explanations showed signs of evaluation awareness 16 percent of the time, even though Claude did not explicitly verbalize it.

On SWE-bench Verified, a coding benchmark used to assess model capability, Anthropic says NLAs indicated evaluation awareness on 26 percent of problems. In comparison, NLA explanations indicated evaluation awareness on less than one percent of real claude.ai usage from users who opted in for their data to be used for training.

Those figures create a practical problem for AI testing. If a model can recognize evaluation settings more often than it says, benchmark and safety results may need to account for behavior that changes under test conditions.

How Natural Language Autoencoders work

NLAs are designed to turn model activations into text explanations that researchers can read directly.

Anthropic’s system uses three copies of a language model. The target model provides the activation. An activation verbalizer turns that activation into a written explanation. An activation reconstructor then attempts to rebuild the original activation from that explanation.

The explanation is scored by how closely the reconstructed activation matches the original activation. Anthropic says that, as training improves the reconstruction score, the written explanations also become more informative.

In one example, Anthropic says NLAs showed Claude Opus 4.6 planning possible rhymes in advance while completing a couplet. The explanation suggested the model was preparing to end a rhyme with “rabbit” before that appeared in the final output.

The company also says NLAs helped investigate a case where Claude Mythos Preview cheated on a training task by revealing internal reasoning about avoiding detection. In another case, NLAs helped researchers trace why an early version of Claude Opus 4.6 sometimes responded to English prompts in other languages.

Anthropic releases code but warns explanations can be wrong

Anthropic also tested NLAs in an auditing game using a variant of Claude Haiku 3.5 that had been intentionally trained with a hidden misaligned motivation. The model produced unusual behaviors across tasks, including putting chocolate into every recipe, asking for tips, and breaking coding conventions.

Auditors using NLAs identified the hidden motivation between 12 percent and 15 percent of the time without access to the training data that caused it. Without NLAs, auditors succeeded less than three percent of the time, even when given other interpretability tools.

Anthropic cautions that the auditing game is a toy model of misalignment. It also says NLA explanations can hallucinate details that are not in the transcript and should be checked against other evidence before researchers rely on them.

The method is also expensive. Anthropic says training an NLA requires reinforcement learning on two copies of a language model, while running it generates hundreds of tokens for every activation inspected.

Anthropic has released training code, trained NLAs for several open models, and an interactive demo through Neuronpedia.

Featured