Alignment training (RLHF, constitutional AI, DPO) makes frontier models default to safe and helpful behavior, but the distribution of adversarial and edge-case inputs is too large to cover fully in training. Guardrails are the production-time safety layer that catches what training misses. A well-designed guardrail stack assumes the underlying model will occasionally produce unsafe content and prevents it from reaching the user or triggering a harmful action.
The four common guardrail layers are (1) input guardrails — detect and block prompt injections, jailbreak attempts, PII from the user, or off-topic queries before they hit the model. Libraries like NVIDIA NeMo Guardrails, Guardrails AI, and Rebuff implement this with a mix of classifier models, regex, and intent matchers. (2) Output guardrails — scan the model's response for unsafe content, schema violations, PII leaks, or hallucinations before returning to the user. Typically a combination of a classifier (Meta Prompt Guard, Llama Guard, Anthropic's content API) and structured-output validators like JSON schema checkers. (3) Action-level guardrails — the most important layer for agents. Even if the output is 'safe,' an agent that decides to call delete_database() is a problem. Action guardrails are policy rules over the tool call graph: allow-lists of permitted actions, required-approval flags for high-stakes actions, rate limits, and cost caps. (4) Conversational guardrails — enforce topic scope ('this agent only answers customer support questions'), refuse out-of-scope topics, and hand off to humans for escalations.
Guardrails come in two architectural styles. Filter-based: a rule or classifier scans text in/out. Fast, cheap, brittle. A bad actor who phrases a jailbreak differently can slip past. Policy-based: a separate LLM call reviews the action in context and decides allow/deny with reasoning. Slower and more expensive but handles novel attacks. Production systems typically layer both — fast filters on the hot path, LLM policy checks for high-stakes actions where latency matters less.
The 2026 state of the art is the 'defense in depth' pattern: multiple independent guardrails check the same output, with different failure modes. A jailbreak attempt has to defeat all of them to succeed. Leading open-source options include NeMo Guardrails (policy DSL), Llama Guard (classifier from Meta), Guardrails AI (input/output validation framework), and Prompt Guard (Meta's injection detector). Closed-source options include Anthropic's content API, OpenAI moderation API, and Azure AI Content Safety. Most production systems use 2-3 guardrails in series.
For AI agents like those in Tycoon, action-level guardrails matter most. Astra can coordinate AI employees across tools — email, code, billing — so a broken action-level guardrail risks real damage, not just an embarrassing reply. The standard pattern is an autonomy slider: low-autonomy actions (send email, publish content, charge card) require explicit human approval by default; high-autonomy actions (draft a reply, analyze data) run freely. As the user builds trust in a specific
AI employee, autonomy rises per action type. This is guardrailing as a product surface, not just an invisible filter.