Learn

What are AI Guardrails?

The runtime safety net that keeps AI agents from going off the rails.

AI guardrails are runtime policies and filters that constrain what an LLM or AI agent can output or do — blocking unsafe content, PII leaks, off-topic responses, prompt injections, and out-of-scope tool calls. Implemented as input validation, output classification, and action-level policies, they sit between the raw model and the user to enforce business rules, safety requirements, and regulatory compliance that alignment training alone cannot guarantee.

Free to startNo credit card requiredUpdated Apr 2026
Short answer

AI guardrails are runtime policies and filters that constrain what an LLM or AI agent can output or do — blocking unsafe content, PII leaks, off-topic responses, prompt injections, and out-of-scope tool calls. Implemented as input validation, output classification, and action-level policies, they sit between the raw model and the user to enforce business rules, safety requirements, and regulatory compliance that alignment training alone cannot guarantee.

In depth

Alignment training (RLHF, constitutional AI, DPO) makes frontier models default to safe and helpful behavior, but the distribution of adversarial and edge-case inputs is too large to cover fully in training. Guardrails are the production-time safety layer that catches what training misses. A well-designed guardrail stack assumes the underlying model will occasionally produce unsafe content and prevents it from reaching the user or triggering a harmful action. The four common guardrail layers are (1) input guardrails — detect and block prompt injections, jailbreak attempts, PII from the user, or off-topic queries before they hit the model. Libraries like NVIDIA NeMo Guardrails, Guardrails AI, and Rebuff implement this with a mix of classifier models, regex, and intent matchers. (2) Output guardrails — scan the model's response for unsafe content, schema violations, PII leaks, or hallucinations before returning to the user. Typically a combination of a classifier (Meta Prompt Guard, Llama Guard, Anthropic's content API) and structured-output validators like JSON schema checkers. (3) Action-level guardrails — the most important layer for agents. Even if the output is 'safe,' an agent that decides to call delete_database() is a problem. Action guardrails are policy rules over the tool call graph: allow-lists of permitted actions, required-approval flags for high-stakes actions, rate limits, and cost caps. (4) Conversational guardrails — enforce topic scope ('this agent only answers customer support questions'), refuse out-of-scope topics, and hand off to humans for escalations. Guardrails come in two architectural styles. Filter-based: a rule or classifier scans text in/out. Fast, cheap, brittle. A bad actor who phrases a jailbreak differently can slip past. Policy-based: a separate LLM call reviews the action in context and decides allow/deny with reasoning. Slower and more expensive but handles novel attacks. Production systems typically layer both — fast filters on the hot path, LLM policy checks for high-stakes actions where latency matters less. The 2026 state of the art is the 'defense in depth' pattern: multiple independent guardrails check the same output, with different failure modes. A jailbreak attempt has to defeat all of them to succeed. Leading open-source options include NeMo Guardrails (policy DSL), Llama Guard (classifier from Meta), Guardrails AI (input/output validation framework), and Prompt Guard (Meta's injection detector). Closed-source options include Anthropic's content API, OpenAI moderation API, and Azure AI Content Safety. Most production systems use 2-3 guardrails in series. For AI agents like those in Tycoon, action-level guardrails matter most. Astra can coordinate AI employees across tools — email, code, billing — so a broken action-level guardrail risks real damage, not just an embarrassing reply. The standard pattern is an autonomy slider: low-autonomy actions (send email, publish content, charge card) require explicit human approval by default; high-autonomy actions (draft a reply, analyze data) run freely. As the user builds trust in a specific AI employee, autonomy rises per action type. This is guardrailing as a product surface, not just an invisible filter.

Examples

  • Meta Llama Guard — a small classifier model fine-tuned to label unsafe content across 14 categories
  • NVIDIA NeMo Guardrails — policy DSL that defines allowed conversation flows and topic boundaries
  • OpenAI moderation API — free endpoint that flags violations across harassment, hate, self-harm, sexual, violence categories
  • Anthropic's Prompt Guard / content API — input-side injection detection and output-side content policy
  • Tycoon autonomy slider — action-level guardrails that require human approval for billing, customer emails, code deploys
  • JSON schema validation — output guardrail ensuring the agent's structured outputs match the expected format
  • PII redaction — regex + NER pipeline that strips emails, SSNs, credit cards from logs before storage
  • Tool allow-list — agent's available function calls are explicitly enumerated; anything else is silently rejected

Related terms

Frequently asked questions

Don't alignment-tuned models already refuse unsafe requests?

Yes, mostly, but not reliably enough for production. Frontier models like Claude 4.5 and GPT-5 refuse obvious misuse out of the box. However, they can be jailbroken with adversarial phrasing, can hallucinate PII from training data, can leak system prompts under pressure, and can be manipulated by prompt injections from retrieved documents. In regulated industries or customer-facing apps, 'usually refuses' isn't acceptable. Guardrails provide an independent check that doesn't rely on the model's goodwill — they enforce policies even when the model has been compromised.

What's the difference between alignment and guardrails?

Alignment is baked into model weights via training — RLHF, DPO, constitutional AI. It makes the model's default behavior safe and helpful. It's broad, fuzzy, and can be overridden by sufficiently tricky prompts. Guardrails are runtime code — classifiers, regex, policies — applied to specific inputs and outputs in production. They're narrow, deterministic, and composable. A well-designed system uses both: alignment for the 'default behavior' baseline, guardrails for the specific policies you absolutely need enforced. Relying on alignment alone is unsafe for production; relying on guardrails alone is too brittle (attacker only needs to find one gap).

Do guardrails hurt latency and user experience?

A little, if done right. Fast filter-based guardrails add 10-50ms per call. LLM-based policy checks add 500-2000ms and are reserved for high-stakes actions where that latency is acceptable. The bigger UX risk is overly aggressive refusals — a guardrail that blocks 5% of legitimate requests will tank user satisfaction. Measure false positive rate as rigorously as false negative rate. The standard fix is calibrating classifier thresholds per use case and having guardrails return clear, actionable explanations ('this action needs approval') rather than silent refusals.

How do I implement guardrails in a RAG system?

Three places. (1) Input side: filter the user query for PII, injection attempts, and off-topic content before it hits the retrieval layer. (2) Context side: scan retrieved chunks for prompt-injection payloads — an attacker can poison a document that gets retrieved, then that document manipulates the model. This is currently the most exploited attack vector in production RAG. Meta's Prompt Guard or a small classifier on retrieved chunks is standard. (3) Output side: validate the model's response — no PII leaks, citations match retrieved sources, refusals where appropriate. For agents that also call tools, add a fourth layer: action guardrails between the model's tool-call decision and actual execution.

Are guardrails an open-source option or do I need a vendor?

Strong open-source options exist in 2026: NeMo Guardrails, Guardrails AI, Llama Guard, Prompt Guard, Rebuff are all production-ready. Self-hosted guardrails give you control over policies and data residency. Commercial options (Anthropic's content API, OpenAI moderation, Azure AI Content Safety, Lakera Guard) are simpler to adopt and updated by the vendor as new attacks appear. The right choice depends on your compliance needs and engineering capacity. Most startups start with OpenAI moderation (free) plus application-specific rules, then layer in Llama Guard or NeMo when policy complexity grows. Enterprise deployments often use a commercial vendor for the audit trail alone.

Run your one-person company.

Hire your AI team in 30 seconds. Start for free.

Free to start · No credit card required · Set up in 30 seconds