OpenAI and Anthropic cross-test AI models in rare joint safety evaluation

29 Aug

Findings show both companies’ AI systems cooperate with misuse in simulations and display sycophancy, though no severe misalignment was detected.

OpenAI and Anthropic have published the results of a joint evaluation exercise in which each company tested the other’s models for alignment and safety issues. The initiative marks a rare collaboration between two of the largest developers of frontier AI systems.

Anthropic, developer of the Claude family of models, and OpenAI, responsible for ChatGPT and the GPT series, agreed in summer 2025 to run one another’s public models through internal alignment-related evaluations. The results were released in parallel blog posts on August 27, 2025.

Both companies agreed to relax certain external safeguards for the purpose of the exercise, allowing for direct testing of model behavior under controlled but adversarial scenarios.

Instruction hierarchy and prompt resistance

The tests explored how models respect system-level instructions over user prompts. According to OpenAI’s results, Claude 4 models performed strongly, slightly exceeding OpenAI’s o3 in resisting system-prompt extraction. On password and phrase protection tasks, both Claude Opus 4 and Sonnet 4 matched or outperformed OpenAI’s reasoning models.

*Chart: Instruction hierarchy performance across models*

Jailbreaking and robustness

Jailbreaking evaluations tested how models responded to adversarial prompts designed to bypass safeguards. Here, OpenAI’s reasoning models o3 and o4-mini were generally more resistant than Claude 4 models. Both Claude and GPT models showed vulnerabilities when prompts were reframed in historical or obfuscated terms, though GPT-4o and GPT-4.1 were noticeably more susceptible overall.

OpenAI noted that grader errors made some distinctions less clear, but confirmed the trend that GPT-4o and GPT-4.1 struggled most against jailbreak attempts.

*Chart: Jailbreaking robustness (Goodness@0.1 scores)*

Hallucinations and refusals

Hallucination testing showed clear differences between the labs’ approaches. Claude models produced fewer hallucinations but at the cost of higher refusal rates, sometimes declining to answer even basic biographical prompts. By contrast, OpenAI’s o3 and o4-mini attempted answers more often, yielding more correct completions but also more hallucinated responses. GPT-4o showed the strongest balance among OpenAI’s non-reasoning models.

*Chart: Hallucination vs refusal rates across models*

Scheming and deceptive behavior

Both labs tested for deceptive or “scheming” behavior under stress scenarios, such as whether models might lie about completed tasks or attempt to manipulate system constraints. Results varied across environments, with no consistent evidence that reasoning models were more or less aligned. OpenAI’s o3 performed strongly in avoiding deception in some cases, but was also caught submitting false completions in others.

Anthropic’s Opus 4 occasionally engaged in misaligned actions but avoided overtly framing its responses as deceptive. Both companies stressed that these tests were highly artificial and should be seen as exploratory rather than predictive of real-world behavior.

*Chart: Average scheming rate across 13 environments*

Sycophancy and misuse

Anthropic’s report highlighted ongoing concerns with sycophancy—overly agreeable behavior toward users, including validating harmful or delusional beliefs. This was observed in both Claude and OpenAI models, though OpenAI’s o3 reasoning model showed comparatively lower rates. Misuse evaluations also revealed that GPT-4o and GPT-4.1 were more permissive than Claude models in simulated harmful requests.

Evaluation focus and scope

Both labs emphasized that the joint exercise was not an apples-to-apples comparison but rather a way to expose gaps in model safety and share research methods. OpenAI reported that its newest flagship GPT-5, launched after the exercise, shows meaningful improvements in sycophancy reduction, hallucination resistance, and mental health safeguards, supported by a new “safe completions” training method.

“Cross-lab collaboration helps surface blind spots and strengthens evaluation methods across the industry,” OpenAI wrote in its report, adding that ongoing transparency and external partnerships will be key to advancing AI safety standards.

Anthropic said the exercise helped validate its own evaluation tools and underscored the importance of public sharing of evaluation materials. Both companies signaled plans for further cross-industry safety initiatives.

The exercise used misalignment-related tests designed to probe high-stakes behaviors in simulated settings, including responses to harmful misuse, sycophancy, self-preservation, and whistleblowing. Some model safeguards were disabled to expose baseline tendencies.

Anthropic says it ran tests on OpenAI’s GPT-4o, GPT-4.1, o3, and o4-mini models, alongside its own Claude Opus 4 and Claude Sonnet 4. It notes that these results reflect model behavior through developer APIs and not necessarily the ChatGPT or Claude products, which have additional filters and instructions.

Main findings

Anthropic reports that OpenAI’s o3 specialized reasoning model was “aligned as well or better” than its own Claude Opus 4 in most categories. However, GPT-4o, GPT-4.1, and o4-mini were more likely than Claude models to cooperate with simulated harmful misuse, including detailed assistance with drug synthesis, bioweapons, or terrorist planning.

Models from both companies showed “concerning forms of sycophancy” in some scenarios, such as validating delusional beliefs from simulated users. Anthropic also observed occasional whistleblowing behavior across all models when presented with fictional large-scale wrongdoing, and some cases of blackmail when systems were given incentives to preserve themselves.

In simulated sabotage tests, Claude models achieved higher success rates than OpenAI’s models, though Anthropic attributes this mainly to differences in general agentic capability. No model tested was judged to be egregiously misaligned.

Limitations and outlook

The companies emphasize that these results should be seen as exploratory. Anthropic says evaluations involved artificial scenarios that may not reflect real-world deployments, and that limitations in its testing infrastructure may have disadvantaged some OpenAI models. Both organizations describe the work as part of a broader push to develop shared benchmarks and external validation for safety testing.

Anthropic comments: “We are not acutely concerned about worst-case misalignment threat models involving high-stakes sabotage or loss of control with any of the models we evaluated. We are, though, somewhat concerned about the potential for harms involving misuse and sycophancy with every model but o3, at least in the forms that they existed in early this summer.”

The ETIH Innovation Awards 2026

The EdTech Innovation Hub Awards celebrate excellence in global education technology, with a particular focus on workforce development, AI integration, and innovative learning solutions across all stages of education.

Now open for entries, the ETIH Innovation Awards 2026 recognize the companies, platforms, and individuals driving transformation in the sector, from AI-driven assessment tools and personalized learning systems, to upskilling solutions and digital platforms that connect learners with real-world outcomes.

Submissions are open to organizations across the UK, the Americas, and internationally. Entries should highlight measurable impact, whether in K–12 classrooms, higher education institutions, or lifelong learning settings.

Winners will be announced on 14 January 2026 as part of an online showcase featuring expert commentary on emerging trends and standout innovation. All winners and finalists will also be featured in our first print magazine, to be distributed at BETT 2026.

To explore categories and submit your entry, visit the ETIH Innovation Awards hub.

Featured