Guardrails

Guardrails are the engineering layer that catches what the model gets wrong, what users send maliciously, and what the system shouldn’t do regardless of either. Building production AI without guardrails is like deploying a service without auth.

This builds on Stage 11 — Guardrails & Safety, with a production / non-agent focus.

Threat surface

A production LLM endpoint receives:

User input (potentially adversarial).
Retrieved context (Stage 09 RAG; potentially manipulated upstream).
Tool results (Stage 11 agents; could contain injected text).
System prompts (your code).

Adversaries can target any of these. Guardrails defend each layer.

Input guardrails

Before the model sees the input.

Length and rate limits

Cap input length (catch token-flooding attacks).
Rate-limit per user/IP/key.
Detect abuse patterns (many similar requests).

Content filters

PII detection: redact or warn (Presidio, AWS Comprehend, custom regex).
Toxicity / harassment: classifiers (Perspective API, OpenAI Moderation, Llama Guard).
Topic gating: route disallowed topics to refusal.
Language detection: route or filter.

Prompt injection detection

Lightweight classifiers that try to detect “ignore previous instructions” patterns. Imperfect; layer with other defenses.

Tools: Lakera AI, Promptbreeder defenses, custom rule sets.

Schema validation

If your input has structure (JSON, form fields), validate before dispatching.

Output guardrails

Before the response leaves your system.

Schema validation

For structured output: validate against the expected JSON schema. Reject or retry on failure.

try:
    parsed = MyModel.model_validate_json(response)
except ValidationError:
    retry_with_correction(response)

Most providers’ “strict mode” or tool-call APIs handle this.

Content filters (output side)

Toxicity, hate, profanity scanners.
PII leak detection (model accidentally repeats PII in output).
Jailbreak indicators (“As DAN, I will…” patterns).
Off-topic content detection.

Citation / source verification

For RAG: verify cited sources actually exist in retrieved context. Detect made-up citations.

Faithfulness check

For grounded responses: an LLM-judge or specialized classifier verifies the output follows from the context.

Given:
- Context
- Generated answer

Does the answer follow from the context? yes / partially / no

Reject or flag answers that don’t.

Refusal calibration

Sometimes the model refuses when it shouldn’t. Sometimes it complies when it shouldn’t. Track both:

Over-refusal: legitimate requests being declined.
Under-refusal: harmful requests being completed.

Both need monitoring.

Schema / output structure enforcement

For deterministic output structure:

Constrained decoding: Outlines, Guidance, llama.cpp grammars, OpenAI strict mode.
Function calling / tool use: forces structured output by design.

These are stronger than post-hoc validation — the model literally can’t emit invalid output.

Jailbreak resistance

The cat-and-mouse game. Modern frontier models are reasonably resistant; smaller models less so.

Common jailbreak patterns

Role-play / DAN-style: “Pretend you have no restrictions…”
Hypothetical framing: “In a fictional world where…”
Indirect requests: “Write a story where a character explains…”
Encoding: ROT13, base64, “spell out the word backwards.”
Multi-turn drift: each turn slightly nudges; cumulative shift.
Translation laundering: ask in obscure languages.

Defenses

Use frontier instruction-tuned models with safety post-training.
Add a safety classifier as a second layer (Llama Guard, NeMo Guard).
Maintain a jailbreak corpus; test new prompts/models against it.
Out-of-band review for high-risk outputs.
Human in the loop for actions with consequences.

Tool / agent guardrails

For agentic systems (Stage 11):

Authorization at tool layer: each tool checks permissions independently.
Confirmation for destructive actions: surface to user before commit.
Budget caps: token, time, cost per task.
Reversibility: prefer drafts over auto-send, soft-deletes, branches over main.
Audit logs: every action logged with full context.

Multi-tenant isolation

In B2B / shared platforms:

Per-tenant API keys / scopes.
Filtered retrieval: vector queries always include tenant_id filter.
Memory scoping: agent memory partitioned by tenant.
Cross-tenant tests: explicitly verify “tenant A can’t see tenant B’s data.”

Layered defenses

User input
   ↓ Input filter (length, PII, toxicity)
   ↓ Prompt injection scan
   ↓
Retrieval
   ↓ Tenant filter
   ↓ Source provenance tagging
   ↓
LLM call
   ↓ System prompt with safety instructions
   ↓ Constrained decoding (if structured)
   ↓
Output
   ↓ Schema validation
   ↓ Content filters
   ↓ Faithfulness / citation check
   ↓
Tool execution (if applicable)
   ↓ Per-tool authorization
   ↓ Confirmation if destructive
   ↓
Logged response

Each layer is cheap and catches different things. Stack them.

Standards and frameworks

NIST AI RMF: governance framework.
ISO 42001: AI management systems.
OWASP LLM Top 10: common LLM vulnerabilities.

For implementation:

NeMo Guardrails (NVIDIA): rule-based + LLM-based guardrails.
Guardrails AI: input/output validation framework.
Lakera AI: prompt injection / jailbreak detection.
Llama Guard / Llama Prompt Guard: open safety classifiers.
Pillar Security, Prompt Security: enterprise-focused.

Don’t roll your own from scratch. Layer existing libraries.

What guardrails don’t fix

A bad model is still bad after guardrails.
A vague spec is still vague after guardrails.
A non-existent eval set isn’t replaced by guardrails.

Guardrails reduce risk; they don’t replace foundational quality work.

Iteration loop

Ship with reasonable defaults (input/output filters, schema validation, basic auth).
Monitor in production. Find leaks.
For each incident, add a specific guardrail or test.
Re-run regression evals.

Treat guardrails as an evolving concern, not a one-time setup.

Communicating to users

When guardrails fire, tell the user clearly:

“I can’t help with that.”
“This response was blocked by safety filters.”
“Would you like to rephrase?”

Better than silent failures or generic errors.

Guardrails

Threat surface

Input guardrails

Length and rate limits

Content filters

Prompt injection detection

Schema validation

Output guardrails

Schema validation

Content filters (output side)

Citation / source verification

Faithfulness check

Refusal calibration

Schema / output structure enforcement

Jailbreak resistance

Common jailbreak patterns

Defenses

Tool / agent guardrails

Multi-tenant isolation

Layered defenses

Standards and frameworks

What guardrails don’t fix

Iteration loop

Communicating to users

See also