NIST proof says AI guardrails cannot block every adversarial prompt

AICyber security

12 Jun

The IEEE Security & Privacy paper argues organizations using AI need continuous red-teaming, updates, and resilience planning rather than fixed safeguards.

Digital padlock over a circuit board for a story on NIST research into AI guardrails and adversarial prompts — ***A NIST scientist has published a mathematical proof arguing that no finite set of AI guardrails can be universally robust against adversarial prompts***

A National Institute of Standards and Technology scientist has published a mathematical proof arguing that no finite set of AI guardrails can be universally robust against adversarial prompts.

The paper, by Apostol Vassilev, appears in the May-June 2026 issue of IEEE Security & Privacy. Vassilev is a senior scientist at the National Institute of Standards and Technology and works on adversarial machine learning.

The finding applies the logic of Kurt Gödel’s incompleteness theorems to artificial intelligence systems. NIST says the proof helps explain why organizations developing or deploying AI need to move away from a fixed “one and done” security model.

The research is relevant to AI systems used in education, scientific discovery, government, and enterprise settings, where guardrails are used to prevent models from generating prohibited or unsafe outputs.

Vassilev said in a LinkedIn post that the core finding is: “No finite set of guardrails is universally robust against adversarial prompts.”

Fixed AI guardrails have limits

AI developers use guardrails to stop models from generating outputs such as malware, deepfakes, instructions for making biological weapons, illicit drug guidance, or other prohibited content.

NIST says those guardrails are finite systems of rules. Vassilev’s proof argues that, however carefully they are designed, there will always be some prompt that could potentially evade them.

Apostol Vassilev, senior scientist at NIST, says: “One of the pillars of responsible AI is that you want the technology to be secure. You want it to withstand adversarial attacks and perform only what you want it to do, not what an attacker might want. What this proof shows is that there is no finite set of guardrails that is universally robust against adversarial prompts.”

The NIST article compares the issue to Gödel’s 1931 work, which showed limits to what can be proved within a system built on a finite number of rules.

Vassilev says: “Gödel put an end to this dream. He showed that you can’t have a finite set of statements and create a theory that is complete and consistent without contradictions. You can add more statements to address the contradictions you encounter, but you’re back to where you started. It happens again.”

AI security, safety, and alignment

Vassilev said in his LinkedIn post that AI security, safety, and alignment are “inextricably linked.” He wrote that attackers can push insecure systems into unsafe behaviors, while persistent insecurity can steer outputs away from intended values.

The practical risk is jailbreaks, where attackers craft prompts that cause an AI system to bypass refusal mechanisms. NIST says this can create real-world risks including cyberattacks, data breaches, and highly personalized phishing messages.

Vassilev says: “Gödel’s logic applies here. You can never make a claim that you are robust against all adversarial prompt attacks. There will always be some prompt that can potentially evade and defeat any defensive infrastructure that you have built around your AI system.”

The proof does not provide a method for attackers to find new exploits. NIST says it leaves room for organizations to harden AI systems so that attacks become harder and more expensive to carry out.

Vassilev compares the problem to zero-day exploits in traditional software, where attackers find unknown vulnerabilities before defenders can address them. He says human language makes the AI problem harder because adversaries can hide harmful intent in many different ways.

Continuous monitoring and updates

Vassilev’s proposed response has three parts: red teams that continuously search for adversarial prompts, regular guardrail updates based on newly discovered weaknesses, and operational resilience designed to limit harm and recover quickly when exploits occur.

The model is different from treating AI security as a fixed checklist. It requires organizations to keep testing systems after deployment and adapt safeguards as attackers change their methods.

Vassilev says: “The goal is to reach a state where the cost of finding new exploits exceeds attackers’ resources.”

He adds: “You can’t escape Gödel in math, and in AI you likely can’t patch an AI system like an LLM and then expect to be OK forever. You have to commit to a constant search for weaknesses and stay ahead of attackers. The goal is to reach a new economic equilibrium where you make it financially prohibitive for attackers to attempt to break your AI system. It may be expensive, but that’s the cost of even partial security that should allow organizations to maximize the benefits of AI while minimizing the risks.”

The paper was published in the May-June 2026 issue of IEEE Security & Privacy. NIST said the proof supports a transition to a continuous-monitor-and-update model for AI systems.

ETIH Innovation Awards 2026