Oxford University and partners build tamper-resistant safeguards into open-source AI models

AIHigher Education

14 Aug

Study shows that filtering sensitive knowledge during training can make open-weight AI models more secure.

Researchers from the University of Oxford, EleutherAI, and the UK AI Security Institute have developed a method to embed safeguards into open-weight language models by filtering high-risk content from training data.

The approach aims to prevent models from learning certain dangerous capabilities, even after further adversarial fine-tuning.

Open-weight models are widely used in collaborative AI research but pose risks because they can be downloaded, modified, and redistributed. The research team focused on biothreats, removing biology-related material from pretraining datasets to test whether the resulting models could resist acquiring harmful knowledge.

Filtering before training rather than after

The team filtered around 8–9% of the training dataset using keyword blocklists and a machine-learning classifier to detect and remove biothreat-related documents. They then trained models from scratch on the filtered data and benchmarked performance against traditional safety methods, including post-training fine-tuning.

The filtered models were able to resist additional training using more than 25,000 biothreat papers and held up against targeted adversarial fine-tuning involving 300 million tokens over 10,000 steps. Standard benchmark performance on tasks like commonsense reasoning and scientific Q&A remained consistent with unfiltered models.

Yarin Gal, associate professor of machine learning at Oxford and senior author on the study, says, “The research community has made great progress with AI safeguards over the past few years, but a remaining massive challenge is safeguarding open weight models – how do we build models that we can distribute to all without raising risks of misuse. Our study makes a significant stride in this direction.”

Focused on open-weight AI safety

The method is positioned as an alternative to safety mechanisms added after training, which the researchers note can often be bypassed. By embedding constraints at the data level, the filtered models reportedly lack the foundational knowledge required to complete certain harmful tasks, even if manipulated later.

This approach has implications for open-source AI communities, which continue to release capable models with minimal restrictions. Recent releases such as Kimi-K2, GLM-4.5, and gpt-oss show that open models are closing the gap with closed systems, and with that, safety concerns have intensified.

Stephen Casper, co-author from the UK AI Security Institute, says, “By removing the unwanted knowledge from the start, the resulting model had no basis for acquiring dangerous capabilities, even after further training attempts. Our study therefore shows that data filtration can be a powerful tool in helping developers balance safety and innovation in open-source AI.”

Featured