OpenAI launches open-weight safety reasoning models for content moderation

AI

OpenAI has released gpt-oss-safeguard, a new set of open-weight reasoning models for safety classification, allowing developers to apply their own custom policies to detect and filter unsafe content.

OpenAI Global Affairs announced through a LinkedIn post and accompanying blog that it has released gpt-oss-safeguard, a research preview of “open-weights reasoning models designed for safety classification.”

The models, available in two versions, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, extend the company’s open-weight architecture into safety and moderation. OpenAI said the models “allow developers to use their own custom policies to review content, providing developers with flexibility and fostering freedom of expression.”

OpenAI, based in San Francisco, develops AI systems including ChatGPT and a suite of large language models used in enterprise, research, and education. The company said gpt-oss-safeguard “enables developers, researchers, and safety teams everywhere to improve the safety of online content or to filter an AI model’s outputs.”

Policy-based reasoning and real-time adaptability

Unlike traditional classifiers trained on fixed datasets, gpt-oss-safeguard uses reasoning at inference time. The model interprets a developer-provided policy and classifies user messages or AI completions based on that policy. Because the policy is supplied during inference rather than training, developers can revise and test policies without retraining a model.

OpenAI’s blog explains: “The policy is provided during inference, rather than being trained into the model, so it is easy for developers to iteratively revise policies to increase performance.” Developers can also review the model’s chain of thought to understand how a classification was made.

Both models are released under an Apache 2.0 license and are hosted on Hugging Face, allowing anyone to download, modify, and deploy them freely. OpenAI noted that the open-weight design supports experimentation and collaboration across the wider AI safety community.

Partnership with ROOST and Hugging Face community rollout

OpenAI worked with partner ROOST to test the models, identify developer requirements, and publish documentation. ROOST will also lead the launch of the Model Community, a collaborative network for developers using open-source AI safety tools.

As part of the release, the models’ hosting on Hugging Face provides an immediate route for developers to access checkpoints and documentation, and to share adaptations publicly. Hugging Face’s infrastructure supports reproducible testing and version control—key to evaluating open-weight models at scale.

Vinay Rao, Chief Technology Officer at ROOST, said: “gpt-oss-safeguard is the first open source reasoning model with a ‘bring your own policies and definitions of harm’ design. Organizations deserve to freely study, modify and use critical safety technologies and be able to innovate. In our testing, it was skillful at understanding different policies, explaining its reasoning, and showing nuance in applying the policies, which we believe will be beneficial to builders and safety teams.”

Internal safety tools inform public release

The new models are derived from OpenAI’s internal Safety Reasoner framework, which powers moderation and safety systems across its major platforms. Safety Reasoner enables the company to update content policies quickly in production without retraining classifiers.

OpenAI wrote that gpt-oss-safeguard is “an open-weight implementation of an approach we developed internally, in a tool we call Safety Reasoner that enables us to update our safety policies to fight new threats in production in less time than it would take to retrain a classifier.”

This reasoning-based system has been integrated into OpenAI’s internal pipeline for launches including Sora and ChatGPT Agent. It helps detect and block unsafe outputs dynamically, particularly in areas such as image generation, biology, and self-harm prevention.

The gpt-oss-safeguard models were benchmarked against internal datasets and public benchmarks such as ToxicChat and the 2022 OpenAI moderation set. On multi-policy classification, the models achieved results comparable to or better than gpt-5-thinking and the original gpt-oss open models.

However, OpenAI acknowledged that purpose-built classifiers trained on large labeled datasets can still outperform gpt-oss-safeguard in highly complex moderation scenarios. The models also demand greater compute resources, which may limit scalability. OpenAI said it mitigates this internally by pairing lightweight classifiers with reasoning models to balance accuracy and latency.

Collaborative development and next steps

OpenAI emphasized that the release is part of its “defense in depth” safety strategy, combining layered protections and open collaboration. Through the ROOST Model Community and Hugging Face hosting, developers and researchers can contribute evaluation data, policy templates, and deployment feedback.

The company said it intends to iterate on the models based on input from the wider research and trust-and-safety community, aiming to refine reasoning quality and reduce computational overhead.

OpenAI Global Affairs concluded its LinkedIn announcement by reiterating its goal of collective responsibility in AI safety: “At OpenAI, we are making the industry safer — together.”

Previous
Previous

University of Huddersfield’s Institute of Railway Research partnership ensures safety research is deployed quickly

Next
Next

Illinois Tech researchers introduce socio-technical approach to AI threat modeling