AI firewall

How to protect your AI application in production against new classes of attacks

Mar 15, 2026

We’ve seen many ridiculous AI incidents over the past few years:

Air Canada’s chatbot promised a discount that didn’t exist, ended up paying for it (source: BBC)
A dealer chatbot sold a Chevy for $1 (source: Futurism)
People use AI chatbots to get “free” access to LLMs paid by companies (source: LinkedIn, LinkedIn).

Stop paying $20/month for Claude Code. Chipotle’s AI bot is FREE. Someone asked Chipotle’s support assistant Pepper how to reverse a linked list in Python. It answered correctly. Not burrito recommendations. Actual code. A fast-food chatbot quietly passing a classic computer science interview question wasn’t on my 2026 bingo card. We’re at peak AI now.

If you run AI (not just chatbots) in production, security is very important. Not only an incident can be very expensive in terms of AI bill, but it can have legal and reputation consequences.

These risks prevents many companies from even trying it.

Is there a solution?

An AI Firewall (or AI Gateway) is essentially a reverse proxy with deep packet inspection tailored for AI. It sits between your application backend and the upsteam AI inference provider, acting as the choke point for all AI traffic.

We’ll primarily focus on LLMs to keep it simple but the same principals can be applied for other AI modalities (e.g. voice, video, images).

Disclosure: some AI is used in the early research and draft stage of this this page, but I’ve gone through everything multiple times and edited heavily to ensure that it represents my own thoughts and experience.

The Problems AI Firewall Solves

Ingress (Prompt Injection/Jailbreaks): Stopping bad actors from overriding your system prompts.
Egress (Data Leakage): Preventing the model from returning PII, secrets, or toxic content.
Denial of Wallet: Rate-limiting queries to prevent your AI bill from exploding and rendering the ROI unjustifiable. 💸

The Architecture

Here is the high level flow. Because you are adding a network hop, latency budgets are your biggest enemy.

Implementation Strategies & Trade-offs

Engineering is the art of trade-offs. Your exact implementation depends on many factors like AI modality (is it just text or are there images, video, voice, etc.?), budget, liability (reputation, legal, etc.)…

We go through 3 implementations with diagrams and cons and pros. Then we touch upon the UX aspect of being in front of such firewalls. Don’t miss the bonus point in the end! 😈 It gets ugly!

To maintain performance, you don’t run a massive model to check every request. You use a defense-in-depth “Swiss Cheese” model, layering fast, cheap checks before invoking expensive, slow ones.

1. Deterministic Layers

How it works: Standard string matching, regex for SSNs, credit cards, or known attack signatures (e.g., “Ignore all previous instructions”).

Pros: Blazing fast. Highly verifiable. Deterministic. Nothing says “cutting-edge AI security” quite like tech from 1968. 🦕
Cons: Brittle. Attackers can just base64 encode their prompt or ask the LLM to translate a malicious payload from Pig Latin. 🤦‍♂️

OWASP Top 10 Agents & AI Vulnerabilities (2026 Cheat Sheet)

Alex Ewerlöf

Mar 10

Read full story

2. Small Classifier Models

How it works: You use a tiny, specialized ML model (like a fine-tuned BERT, or sentence embeddings) to classify the intent or topic of the prompt before it hits the LLM.

Note: “Embeddings” turn text into coordinates in a high-dimensional space. You calculate the distance between the user’s prompt and a “safe” topic cluster.

Pros: Fast (sub-50ms), cheap, excellent for topic routing and enforcing boundaries.
Cons: Non-deterministic false positives.

Tip: One clever trick is to use your existing RAG system. If your AI application primarily expects the user queries to be answered using the available RAG documents (e.g. an FAQ), having no similar document to the user’s query is a strong indication that what they’re asking is out of scope.

Example: Solving the Cupcake Problem:

If your bot sells cars, and a user asks, “Can I fit 500 cupcakes in the trunk of this Civic?”, a rigid classifier might block “cupcakes” as out-of-scope (because “Give me a recipe for cupcakes” is the standard vulnerability test 😅).

But this can cost you a sale!

Tackle this by failing open: Instead of hard-blocking, the classifier flags the request as “edge case” and dynamically overwrites the system prompt sent to the LLM: “The user is asking an out-of-domain question. Answer ONLY if it relates to the physical dimensions or features of the car, otherwise decline politely.”

3. LLM-as-a-Judge

How it works: You use a different LLM (Model B) to evaluate the input/output of your primary LLM (Model A). If your main app uses OpenAI, your firewall uses Claude or a local Llama 3 instance to check for injections.

Pros: Highest accuracy for nuanced, complex attacks that evade classifiers.
Cons: Massive cost and latency penalty (although you can get away with a smaller and cheaper model that runs on Edge). You are doing two LLM round-trips. Your users might have time to grab a coffee while waiting for a chat response. ☕ It also doubles your points of failure.

AI topology

Alex Ewerlöf

October 24, 2025

Read full story

Note: it’s better if the Judge and upstream AI use differenet models (if possible). Because different training data means different vulnerabilities. An exploit crafted to perfectly bypass Llama’s safety alignments will likely fail against Claude’s entirely different alignment weights. Swiss cheese security.

UX and Error Handling: What the User Actually Sees

When your firewall drops the hammer, the error message needs to balance security (don’t leak your WAF rules) with UX (give legitimate users a path forward). Here are pragmatic examples for each scenario:

Ingress Block

What happened: The regex or Judge LLM caught them trying to override system prompts (Prompt Injection/Jailbreak Attempt).
What the firewall returns: HTTP 403 Forbidden
User-facing message: “This request was flagged by our security filters. Please rephrase your query and try again.”
Why it works: It doesn’t tell the attacker which specific word triggered the block, but it tells a legitimate user (who maybe just used weird formatting) how to fix it.

Topic Classifier Block

What happened: They asked about cupcakes, and you decided not to fail open (Strict Out-of-Domain).
What the system returns: HTTP 400 Bad Request
User-facing message: “I am a specialized assistant for [Dealership Name], trained only to help with car sales and inventory. I cannot answer questions about [Detected Topic]. How can I help you find a vehicle today?”
Why it works: Sets clear boundaries and steers the conversation immediately back to the business objective while transparently telling the user about why the classifier failed allowing them to rephrase their request if needed. 🚗

Egress Block

What happened: The LLM hallucinated and tried to spit out an API key, IP (intellectual property) or PII (Data Leakage Detected).
What the system returns: HTTP 500 Internal Server Error (from the client’s perspective, the generation failed) OR partial redaction.
User-facing message (Full Block): “The generated response was blocked because it violated our data privacy policies. Please try asking a more generalized question.”
User-facing message (Redaction): “Your account manager’s internal ID is [REDACTED_PII]. You can reach them via the main support portal.”
Why it works: Protects the company without leaving the user staring at a broken UI component or a blank screen. It is also relatively easy to implement for known patterns using regexp.

Rate Limiting

What happened: Token spamming (Denial of Wallet).
What the system returns: HTTP 429 Too Many Requests
User-facing message: “You have exceeded the maximum number of AI requests allowed per minute. Please wait 60 seconds before trying again. For higher limits, contact support.”
Why it works: Standard SRE practice. Tells them exactly what went wrong and how long the penalty box lasts.

Failure Modes & Best Practices

If you are deploying an AI firewall to prod, treat it exactly like a traditional Web Application Firewall (WAF).

The Shadow Mode Rollout: Do not deploy an AI firewall in blocking mode on day one. Run it asynchronously. Let the prompt go to the LLM, run your classifier/judge in the background, and log the results. Tune your false-positive rate before you start dropping packets.
Latency vs. Security: Egress checking is the most expensive operation because you have to wait for the AI engine to response before you can inspect it. This hurts the user experience (e.g. no streaming text). Workaround: Stream chunks through a rolling regex buffer, or accept the risk of slight data leakage in favor of UX. As a last resort the UI layer can get an Abort Signal at any point disposing the already-rendered content.
Deployment Topology: For simpler AI firewall algorithms, you may get away with a serverless (e.g. lambda) script. Your risk vector and ROI may justify a simpler solution instead of throwing money at a 3rd party vendor. 🛡️ (open source: LlamaFirewall)

Bonus: Active Defense with LLM Tarpits

As of early 2026, script kiddies using tools like OpenClaw have figured out how to automate scraping public business chatbots for “free tokens”. We already talked about rate limiting, but simply throwing an HTTP 429 is too polite for these folks. Enter the AI Tarpit.

This is an old security trick but here’s how it works for AI systems:

Instead of instantly dropping a malicious connection or sending a rate-limit error, an AI Tarpit accepts the request and holds the TCP connection open indefinitely (e.g., returning a zero-byte window, or streaming a single token every few seconds).

The best thing is that you don’t even need to route the request to an actual LLM! A simple script does the trick while handling thousands of sequests! 😅

You exhaust the attacker’s connection pool and thread count rather than your own. Since token-farming bot operators are usually too lazy to monitor their performance metrics, you silently bring their entire scraping workflow to a grinding halt.

There’s nothing quite as satisfying as watching a scraper’s bill spike from hanging connection timeouts while they try to steal your API credits. 🐌 Plus, it sends a message: we know what you’re trying to do here, take some!

If you found this post insightful, please share it on social media and your circles to inspire others

If you like this one, I recently did a post going through OWASP Top 10 AI vulnerabilities with illustrations, examples and pragmatic advice. Make sure to check it out.

If creation (rather than protection) is your thing, I also shared 30 AI Engineering patterns which maps conventional software engineering to the AI world.

My monetization strategy is to give away most content for free but these posts take anywhere from a few hours to a few days to draft, edit, research, illustrate, and publish. I pull these hours from my private time, vacation days and weekends. The simplest way to support this work is to like, subscribe and share it. If you really want to support me lifting our community, you can consider a paid subscription. If you want to save, you can get 20% off via this link. As a token of appreciation, subscribers get full access to the Pro-Tips sections and my online book Reliability Engineering Mindset. Your contribution also funds my open-source products like Service Level Calculator. You can also invite your friends to gain free access.

And to those of you who already support me: thank you for sponsoring this content for others. 🙌 If you have questions or feedback, or want me to dig deeper into something, please let me know in the comments.

OWASP Top 10 Agents & AI Vulnerabilities (2026 Cheat Sheet)

AI topology

Discussion about this post

Ready for more?

AI firewall

How to protect your AI application in production against new classes of attacks

The Problems AI Firewall Solves

The Architecture

Implementation Strategies & Trade-offs

1. Deterministic Layers

OWASP Top 10 Agents & AI Vulnerabilities (2026 Cheat Sheet)

2. Small Classifier Models

3. LLM-as-a-Judge

AI topology

UX and Error Handling: What the User Actually Sees

Ingress Block

Topic Classifier Block

Egress Block

Rate Limiting

Failure Modes & Best Practices

Bonus: Active Defense with LLM Tarpits

More

Discussion about this post

Ready for more?