We’ve seen many ridiculous AI incidents over the past few years:
Air Canada’s chatbot promised a discount that didn’t exist, ended up paying for it (source: BBC)
A dealer chatbot sold a Chevy for $1 (source: Futurism)
People use AI chatbots to get “free” access to AI (source: LinkedIn) and more recently OpenClaw AI junkies get “free AI” from public chatbots.
If you run AI (not just chatbots) in production security is very important. Not only an incident can be very expensive in terms of AI bill, but it can have legal and reputation consequences that prevents many companies from even trying it.
Is there a solution?
An AI Firewall (or AI Gateway) is essentially a reverse proxy with deep packet inspection tailored for AI. It sits between your application backend and the upsteam AI inference provider, acting as the choke point for all AI traffic.
We’ll primarily focus on LLMs to keep it simple but the same principals can be applied for other AI modalities (e.g. voice, video, images).
Disclosure: some AI is used in the early research and draft stage of this this page, but I’ve gone through everything multiple times and edited heavily to ensure that it represents my own thoughts and experience.
The Problems AI Firewall Solves
Ingress (Prompt Injection/Jailbreaks): Stopping bad actors from overriding your system prompts.
Egress (Data Leakage): Preventing the model from returning PII, secrets, or toxic content.
Denial of Wallet: Rate-limiting queries to prevent your AI bill from exploding and rendering the ROI unjustifiable. 💸
The Architecture
Here is the high level flow. Because you are adding a network hop, latency budgets are your biggest enemy.
Implementation Strategies & Trade-offs
Engineering is the art of trade-offs. Your exact implementation depends on many factors like AI modality (is it just text or are there images, video, voice, etc.?), budget, liability (reputation, legal, etc.)…
We go through 3 implementations with diagrams and cons and pros. Then we touch upon the UX aspect of being in front of such firewalls. Don’t miss the bonus point in the end! 😈 It gets ugly!
To maintain performance, you don’t run a massive model to check every request. You use a defense-in-depth “Swiss Cheese” model, layering fast, cheap checks before invoking expensive, slow ones.
1. Deterministic Layers
How it works: Standard string matching, regex for SSNs, credit cards, or known attack signatures (e.g., “Ignore all previous instructions”).
Pros: Blazing fast. Highly verifiable. Deterministic. Nothing says “cutting-edge AI security” quite like tech from 1968. 🦕
Cons: Brittle. Attackers can just base64 encode their prompt or ask the LLM to translate a malicious payload from Pig Latin. 🤦♂️
2. Small Classifier Models
How it works: You use a tiny, specialized ML model (like a fine-tuned BERT, or sentence embeddings) to classify the intent or topic of the prompt before it hits the LLM.
Note: “Embeddings” turn text into coordinates in a high-dimensional space. You calculate the distance between the user’s prompt and a “safe” topic cluster.
Pros: Fast (sub-50ms), cheap, excellent for topic routing and enforcing boundaries.
Cons: Non-deterministic false positives.
Example: Solving the Cupcake Problem:
If your bot sells cars, and a user asks, “Can I fit 500 cupcakes in the trunk of this Civic?”, a rigid classifier might block “cupcakes” as out-of-scope (because “Give me a recipe for cupcakes” is the standard vulnerability test 😅).
But this can cost you a sale!
Tackle this by failing open: Instead of hard-blocking, the classifier flags the request as “edge case” and dynamically overwrites the system prompt sent to the LLM: “The user is asking an out-of-domain question. Answer ONLY if it relates to the physical dimensions or features of the car, otherwise decline politely.”
3. LLM-as-a-Judge
How it works: You use a different LLM (Model B) to evaluate the input/output of your primary LLM (Model A). If your main app uses OpenAI, your firewall uses Claude or a local Llama 3 instance to check for injections.
Pros: Highest accuracy for nuanced, complex attacks that evade classifiers.
Cons: Massive cost and latency penalty (although you can get away with a smaller and cheaper model that runs on Edge). You are doing two LLM round-trips. Your users might have time to grab a coffee while waiting for a chat response. ☕ It also doubles your points of failure.
Note: it’s better if the Judge and upstream AI use differenet models (if possible). Because different training data means different vulnerabilities. An exploit crafted to perfectly bypass Llama’s safety alignments will likely fail against Claude’s entirely different alignment weights. Swiss cheese security.
UX and Error Handling: What the User Actually Sees
When your firewall drops the hammer, the error message needs to balance security (don’t leak your WAF rules) with UX (give legitimate users a path forward). Here are pragmatic examples for each scenario:
Ingress Block
What happened: The regex or Judge LLM caught them trying to override system prompts (Prompt Injection/Jailbreak Attempt).
What the firewall returns:
HTTP 403 ForbiddenUser-facing message: “This request was flagged by our security filters. Please rephrase your query and try again.”
Why it works: It doesn’t tell the attacker which specific word triggered the block, but it tells a legitimate user (who maybe just used weird formatting) how to fix it.
Topic Classifier Block
What happened: They asked about cupcakes, and you decided not to fail open (Strict Out-of-Domain).
What the system returns:
HTTP 400 Bad RequestUser-facing message: “I am a specialized assistant for [Dealership Name], trained only to help with car sales and inventory. I cannot answer questions about [Detected Topic]. How can I help you find a vehicle today?”
Why it works: Sets clear boundaries and steers the conversation immediately back to the business objective while transparently telling the user about why the classifier failed allowing them to rephrase their request if needed. 🚗
Egress Block
What happened: The LLM hallucinated and tried to spit out an API key, IP (intellectual property) or PII (Data Leakage Detected).
What the system returns:
HTTP 500 Internal Server Error(from the client’s perspective, the generation failed) OR partial redaction.User-facing message (Full Block): “The generated response was blocked because it violated our data privacy policies. Please try asking a more generalized question.”
User-facing message (Redaction): “Your account manager’s internal ID is [REDACTED_PII]. You can reach them via the main support portal.”
Why it works: Protects the company without leaving the user staring at a broken UI component or a blank screen. It is also relatively easy to implement for known patterns using regexp.
Rate Limiting
What happened: Token spamming (Denial of Wallet).
What the system returns:
HTTP 429 Too Many RequestsUser-facing message: “You have exceeded the maximum number of AI requests allowed per minute. Please wait 60 seconds before trying again. For higher limits, contact support.”
Why it works: Standard SRE practice. Tells them exactly what went wrong and how long the penalty box lasts.
Failure Modes & Best Practices
If you are deploying an AI firewall to prod, treat it exactly like a traditional Web Application Firewall (WAF).
The Shadow Mode Rollout: Do not deploy an AI firewall in blocking mode on day one. Run it asynchronously. Let the prompt go to the LLM, run your classifier/judge in the background, and log the results. Tune your false-positive rate before you start dropping packets.
Latency vs. Security: Egress checking is the most expensive operation because you have to wait for the AI engine to response before you can inspect it. This hurts the user experience (e.g. no streaming text). Workaround: Stream chunks through a rolling regex buffer, or accept the risk of slight data leakage in favor of UX. As a last resort the UI layer can get an Abort Signal at any point disposing the already-rendered content.
Deployment Topology: For simpler AI firewall algorithms, you may get away with a serverless (e.g. lambda) script. Your risk vector and ROI may justify a simpler solution instead of throwing money at a 3rd party vendor. 🛡️ (open source: LlamaFirewall)
Bonus: Active Defense with LLM Tarpits
As of early 2026, script kiddies using tools like OpenClaw have figured out how to automate scraping public business chatbots for “free tokens”. We already talked about rate limiting, but simply throwing an HTTP 429 is too polite for these folks. Enter the AI Tarpit.
This is an old security trick but here’s how it works for AI systems:
Instead of instantly dropping a malicious connection or sending a rate-limit error, an AI Tarpit accepts the request and holds the TCP connection open indefinitely (e.g., returning a zero-byte window, or streaming a single token every few seconds).
The best thing is that you don’t even need to route the request to an actual LLM! A simple script does the trick while handling thousands of sequests! 😅
You exhaust the attacker’s connection pool and thread count rather than your own. Since token-farming bot operators are usually too lazy to monitor their performance metrics, you silently bring their entire scraping workflow to a grinding halt.
There’s nothing quite as satisfying as watching a scraper’s bill spike from hanging connection timeouts while they try to steal your API credits. 🐌 Plus, it sends a message: we know what you’re trying to do here, take some!
More
If you like this one, I recently did a post going through OWASP Top 10 AI vulnerabilities with illustrations, examples and pragmatic advice. Make sure to check it out.
If creation (rather than protection) is your thing, I also shared 30 AI Engineering patterns which maps conventional software engineering to the AI world.
My monetization strategy is to give away most content for free but these posts take anywhere from a few hours to a few days to draft, edit, research, illustrate, and publish. I pull these hours from my private time, vacation days and weekends. The simplest way to support this work is to like, subscribe and share it. If you really want to support me lifting our community, you can consider a paid subscription. If you want to save, you can get 20% off via this link. As a token of appreciation, subscribers get full access to the Pro-Tips sections and my online book Reliability Engineering Mindset. Your contribution also funds my open-source products like Service Level Calculator. You can also invite your friends to gain free access.
And to those of you who already support me: thank you for sponsoring this content for others. 🙌 If you have questions or feedback, or want me to dig deeper into something, please let me know in the comments.










