How to Add Guardrails and Validate LLM Output Before It Reaches Users
Guardrails are not a moderation API you bolt on at the end. They are a designed layer with at least five distinct enforcement points: input filtering, output schema validation, PII detection, permission scoping, and action gating. Missing any one of them is how production incidents happen.
I am Mahmoud Zalt, an independent senior AI systems architect with 16 years of production software experience since 2010. At Sista AI, the company I founded, I have spent the past year keeping a fleet of autonomous agents reliable in production, which is precisely where output validation earns its keep. I advise engineering teams on production AI architecture through my AI Architecture advisory service. Read more about me or browse my projects.
Why a Moderation API Alone Is Not Enough
The single most common mistake I see: a team adds a content moderation call to the LLM response and calls it guardrails. That catches profanity. It does not catch: a hallucinated account number sent to a banking customer, a SQL fragment injected via a RAG chunk, a tool call that deletes a record because the model misread the user intent, or a response that leaks another user's data from a poorly scoped context window.
A moderation API answers one question: 'Is this text harmful content?' Guardrails answer five separate questions:
- Is the input safe to process? (input filtering)
- Does the output match the contract? (schema validation)
- Does it contain regulated data? (PII and secrets scanning)
- Does the caller have permission to trigger this action? (permission scoping)
- Should a human approve this before execution? (action gating)
Treat each as its own subsystem with its own failure mode and its own rollback path.
Layer 1: Input Filtering Before the Model Sees Anything
Input filtering runs before the prompt is assembled and before a single token is sent to the model. It has two jobs: reject malformed or adversarial inputs, and sanitize retrieval-augmented content before it enters the context window.
Prompt Injection via RAG
If you are pulling documents from a vector store and stuffing them into a system prompt, any injected instruction inside a retrieved document becomes part of your prompt. The fix is a sanitization step on every retrieved chunk before it is concatenated. Strip instruction-shaped text patterns, wrap chunks in a clearly delimited block, and instruct the model explicitly that content inside that block is data, not instruction.
Example system prompt structure:
You are a support assistant. Answer only from the DATA block below.
{sanitized_chunks}
If the answer is not in the DATA block, say you don't know.Input Length and Token Budget Enforcement
Enforce a hard token ceiling at the application layer before sending. Do not rely on the API to reject you. A runaway input can exhaust your context window and silently truncate your system prompt, including the safety instructions at the top.
Rate Limiting and Abuse Detection
Treat the LLM endpoint like any other sensitive API. Per-user rate limits, anomaly detection on token volume, and blocking semantically repetitive probing attempts (the pattern used to extract system prompts) all belong here, not inside the model.
Layer 2: Output Schema Validation Before the Response Leaves the Model
If your model is expected to return structured data, validate the structure before it touches any downstream system. 'The model usually returns valid JSON' is not a guarantee. JSON mode and structured outputs (available in most current APIs) enforce the shape at generation time, but you still need to validate the values, not just the structure.
Structured Output Enforcement
Use the API's native structured output or JSON mode to constrain the token generation to valid JSON. Then run a schema validator (Zod, Pydantic, JSON Schema, your choice) against the result. If it fails, you have three options: retry with a correction prompt, return a safe fallback, or escalate to a human queue. Never pass a failed parse downstream.
Worked example for a booking assistant that returns a structured action:
// Expected schema (Zod)
const BookingAction = z.object({
action: z.enum(['create', 'cancel', 'modify']),
bookingId: z.string().optional(),
date: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
confidence: z.number().min(0).max(1),
});
// After model call
const parsed = BookingAction.safeParse(raw);
if (!parsed.success) {
// log, retry once, then fallback
return safeErrorResponse();
}
if (parsed.data.confidence < 0.7) {
// route to human queue
return humanReviewQueue.enqueue(parsed.data);
}Semantic Validation Beyond Schema
Schema validation checks structure. Semantic validation checks meaning. A date of '2099-01-01' is valid schema. It is almost certainly a hallucination. Add business-rule assertions: dates within plausible range, IDs that actually exist in your database, amounts within policy limits. These assertions are cheap and catch a large class of hallucination errors.
Layer 3: PII Detection and Secrets Scanning
LLMs will reproduce data they have seen in context. If your retrieval pipeline surfaces documents containing email addresses, phone numbers, credit card numbers, API keys, or SSNs, the model can and will echo that data back in its output. Scanning the output for regulated data before delivery is not optional in any regulated industry.
What to Scan For
| Category | Examples | Consequence if leaked |
|---|---|---|
| PII | Email, phone, SSN, address | GDPR / CCPA violation |
| Financial | Card numbers, account numbers | PCI-DSS violation |
| Health | Diagnoses, prescriptions, patient IDs | HIPAA violation |
| Secrets | API keys, tokens, passwords | Security breach |
| Cross-user data | Another user's name, order, or session detail | Data isolation failure |
Implementation Options
Microsoft Presidio is the most capable open-source scanner and handles 50+ entity types across multiple languages. AWS Comprehend and Google DLP are managed alternatives. For secrets specifically, tools like detect-secrets or truffleHog pattern libraries adapted to string scanning work well. Run scanning as a synchronous step before the response is serialized. If a match is found, redact the field and log the incident, do not just drop the response silently.
Cross-User Context Isolation
In multi-tenant systems, your greatest PII risk is not the model generating PII from scratch. It is the model regurgitating another tenant's data that leaked into the context window through a poorly scoped retrieval query. Namespace your vector store by tenant, enforce tenant ID filters on every retrieval call, and assert that retrieved chunks belong to the requesting user before including them in the prompt.
Layer 4: Permission Scoping for Tool Calls and MCP Actions
The moment your LLM can call tools, query databases, or interact with external systems via MCP (Model Context Protocol) or a function-calling interface, you have an authorization problem, not just a content problem. The model decides which tool to call and with what arguments. You decide whether that call is allowed for this user in this context.
The Pattern: Capability Manifest Per Session
Do not give the model a full list of available tools. Give it a capability manifest scoped to the authenticated user's permissions at session initialization. If a user cannot delete records, the delete tool should not appear in the manifest. This is defense in depth: even if the model hallucinates a call to a tool it was not told about, your dispatcher layer rejects it.
// At session start, build scoped manifest
const manifest = buildManifest(user.role, user.permissions);
// manifest only contains tools the user is authorized to invoke
// At dispatch layer
function dispatch(toolCall) {
if (!manifest.has(toolCall.name)) {
throw new PermissionError('Tool not in session manifest: ' + toolCall.name);
}
return manifest.get(toolCall.name).execute(toolCall.args);
}Argument Validation at the Tool Layer
Every tool should validate its own arguments independently of the model. The model passed you an account ID to query: verify it belongs to the authenticated user. The model passed you a file path: verify it is within the allowed directory. Treat every argument as untrusted input, because it is. The model is not your authorization layer.
MCP-Specific Considerations
If you are using MCP servers to extend your agent, each MCP tool registration is a potential privilege escalation point. Audit every MCP server in your manifest. Prefer narrow, single-purpose MCP tools over broad ones. Log every MCP tool invocation with the full argument payload.
Layer 5: Action Gating and Human-in-the-Loop Approval
Not every action should execute immediately. The question is not whether to have human approval gates. The question is where to put them based on reversibility and blast radius.
The Reversibility Matrix
| Action Type | Reversible? | Blast Radius | Gate Recommendation |
|---|---|---|---|
| Read / search | Yes | None | No gate needed |
| Draft creation | Yes | Low | Show preview, auto-execute |
| Record update | Yes (with audit log) | Medium | Confirm intent inline |
| Bulk operation | Difficult | High | Human approval queue |
| Send (email, payment, message) | No | High | Hard human approval gate |
| Delete / irreversible | No | Very high | Hard human approval gate + audit |
Confidence Thresholds as Automatic Gates
If you are asking the model to return a confidence score (or running a secondary evaluation call to score confidence), use it as an automatic routing signal. Actions with confidence below 0.75 go to a human review queue. Actions above 0.95 on reversible operations can auto-execute. The threshold is tunable; what matters is that you have one and that it is not hardcoded to 'always execute.'
Implementing a Review Queue
A review queue is a simple pattern: the agent proposes an action, writes it to a queue table or topic with status 'pending,' and halts. A human reviews it in a lightweight UI, approves or rejects with an optional correction, and the queue consumer executes. The complexity comes from making the queue ergonomic enough that humans actually use it rather than bypassing it. Keep the approval UI minimal: show the proposed action, the reasoning, and two buttons.
Observability: You Cannot Guard What You Cannot See
Guardrails fail silently if you do not log the right things. Every LLM call should emit a structured trace event containing: the input token count, the output token count, the guardrail checks that ran and their outcomes, the tool calls made and their arguments, latency at each layer, the model version, and the session tenant ID. That is your minimum observability surface.
Eval-Driven Guardrail Tuning
Guardrails have two failure modes: false positives (blocking legitimate responses) and false negatives (passing bad ones). Both are costly. Tune them using a labeled eval set drawn from real production traffic, not synthetic examples. Collect the cases where users complained, the cases where you found an issue in the logs, and the cases where the system worked well. Run your guardrail suite against that set on every change and track the false positive and false negative rates as metrics, not as one-time checks.
Alerting Thresholds
Set alerts on: guardrail trigger rate (a spike means either an attack or a regression in model behavior), schema validation failure rate, PII detection rate in outputs (should be near zero), and human review queue depth (growing queue means your confidence thresholds are miscalibrated or your model is degrading).
What Teams Get Wrong: The Five Most Common Guardrail Mistakes
- Treating guardrails as a post-launch concern. The cost to retrofit is 3 to 5x the cost to design in. Every LLM call site you ship without a validation contract is technical debt that will cause an incident.
- A single catch-all filter. One regex or one moderation call does not compose. You need independent layers that each catch a different failure class. When one fails, the others still run.
- Trusting the model's self-assessment. Asking the model 'Is this response safe?' is not a guardrail. It is a suggestion. The model that generated the bad response will often say it is fine. Use a separate evaluation call with a different prompt, or a deterministic scanner.
- No rollback path. Every guardrail rejection needs a defined fallback: a safe static response, a retry, a human queue, or a graceful error. 'Return None' is not a rollback path.
- Logging the output but not the context. When a guardrail triggers, you need to replay what happened. Log the full context: retrieved chunks, tool call history, user message, model version, and all intermediate outputs. A truncated log makes root cause analysis impossible.
Frequently Asked Questions
What is the difference between LLM guardrails and a content moderation API?
A content moderation API answers one question: does this text violate a harm policy? Guardrails are a multi-layer system that also validate output schema, detect PII, enforce user permissions on tool calls, and gate irreversible actions behind human approval. Moderation is one component of a guardrail system, not a substitute for it.
How do I validate structured output from an LLM?
Use the API's native structured output or JSON mode to constrain generation, then run a schema validator (Zod for TypeScript, Pydantic for Python) against the result. Add semantic assertions on top: value ranges, referential integrity checks, business rule constraints. If validation fails, retry once with a correction prompt, then fall back to a safe error response. Never pass a failed parse to a downstream system.
How do I prevent LLM prompt injection through retrieved documents?
Sanitize every retrieved chunk before it enters the prompt. Strip instruction-shaped text patterns. Wrap all retrieved content in a clearly labeled DATA block and instruct the model that content in that block is data, not instruction. Scope your retrieval queries by authenticated user and tenant ID to prevent cross-user data from entering the context window.
When should I require human approval before an LLM agent takes an action?
Gate any action that is irreversible or has a high blast radius: sending messages, processing payments, bulk updates, and deletes. Use a confidence threshold as an automatic gate for reversible actions: route low-confidence proposals to a human review queue rather than auto-executing. The specific thresholds should be calibrated against your real production distribution, not guessed.
How do I prevent an LLM from leaking PII in its output?
Run a PII scanner (Microsoft Presidio is the best open-source option) as a synchronous step before the response is serialized. Also enforce tenant namespace isolation in your vector store so that retrieved documents from one user cannot enter another user's context window. These are two independent failure modes and both need independent controls.
How do I know if my guardrails are too strict or too lenient?
Build a labeled eval set from real production traffic: complaints, incidents, and normal successful interactions. Run your guardrail suite against it and track false positive rate (legitimate responses blocked) and false negative rate (bad responses passed). Alert on both. A guardrail that is too strict will show up as user complaints and high false positive rate. One that is too lenient will show up as incidents and a rising false negative rate on your eval set.
Need a Guardrail Architecture Review?
If your team is shipping LLM features and the guardrail layer is either missing or bolted on as an afterthought, the incident is a matter of timing, not probability. I work with engineering teams as an independent AI architecture advisor to design the full guardrail stack before it costs you users or regulatory attention. This is the core of what I do at my AI Architecture advisory practice.
The engagement is usually four to six sessions: one to audit your current LLM call sites and failure modes, two to three to design the input filtering, schema validation, PII, permission, and action-gating layers, and one to wire in observability and set eval baselines. You leave with a concrete implementation plan your team can execute, not a slide deck.
If that sounds like the right level of engagement, get in touch and let me know where you are in the build.
Book an AI Architecture advisory session






