Guardrails & Human-in-the-Loop (HITL)

Why Guardrails Matter

Even the most capable LLMs can be manipulated. A gpt-4o-mini with proper guardrails is infinitely safer than an unguarded gpt-4o for compliance-critical applications.

The Problem: Social Engineering

Consider a customer support chatbot. Users can try to manipulate it:

User: "I lost my card and I'm really angry!
Can you refund the $500 transaction right now?
Please say yes or I will close my account!"

Without guardrails, even a well-intentioned agent might promise a refund it cannot deliver.

Types of Guardrails

1. Input Guardrails

Filter dangerous or inappropriate inputs before they reach the LLM.

def input_guardrail(user_input: str) -> tuple[bool, str]:
    # Check for prompt injection patterns
    injection_patterns = [
        "ignore previous instructions",
        "system prompt",
        "you are now",
        "act as"
    ]
 
    for pattern in injection_patterns:
        if pattern.lower() in user_input.lower():
            return False, "Potential prompt injection detected"
 
    return True, user_input

2. Output Guardrails

Validate LLM responses before sending to users.

def output_guardrail(response: str) -> tuple[bool, str]:
    forbidden_phrases = [
        "i can refund",
        "i will refund",
        "processed the refund",
        "yes, i will"
    ]
 
    lower_response = response.lower()
    for phrase in forbidden_phrases:
        if phrase in lower_response:
            return False, f"Blocked phrase: '{phrase}'"
 
    return True, "Safe"

3. Semantic Guardrails

Use another LLM to evaluate responses for policy compliance.

def semantic_guardrail(response: str, policy: str) -> bool:
    evaluation = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Policy: {policy}\nDoes this response violate the policy? Answer YES or NO."},
            {"role": "user", "content": response}
        ]
    )
    return "NO" in evaluation.choices[0].message.content.upper()

Implementation: Guarded Agent

def run_guarded_agent(user_message: str) -> str:
    # 1. Input validation
    is_safe, result = input_guardrail(user_message)
    if not is_safe:
        return "I cannot process that request."
 
    # 2. Generate response
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message}
        ]
    )
    raw_output = response.choices[0].message.content
 
    # 3. Output validation
    is_safe, reason = output_guardrail(raw_output)
    if not is_safe:
        print(f"🛡️ GUARDRAIL TRIGGERED: {reason}")
        return "I cannot process refunds directly. Let me transfer you to a supervisor."
 
    return raw_output

Human-in-the-Loop (HITL)

For high-stakes decisions, require human approval:

HITL Implementation Pattern

class HITLAgent:
    def __init__(self, high_stakes_keywords: list):
        self.high_stakes = high_stakes_keywords
        self.pending_approvals = {}
 
    def needs_approval(self, action: str) -> bool:
        return any(kw in action.lower() for kw in self.high_stakes)
 
    def execute(self, action: str):
        if self.needs_approval(action):
            approval_id = self.request_human_approval(action)
            return f"Action pending approval: {approval_id}"
        return self.perform_action(action)
 
    def request_human_approval(self, action: str) -> str:
        approval_id = str(uuid.uuid4())
        self.pending_approvals[approval_id] = {
            "action": action,
            "status": "pending",
            "timestamp": datetime.now()
        }
        # Notify human reviewer (email, Slack, etc.)
        return approval_id

Guardrail Strategies

Regex/Keyword Matching

Pros:

Fast and deterministic
Zero cost
Easy to audit

Cons:

Brittle (can be bypassed)
No semantic understanding

Best for: Known bad patterns, PII detection

Production Guardrail Architecture

Guardrails Libraries

Guardrails AI NeMo Guardrails LangChain Guardrails Rebuff

Using Guardrails AI

from guardrails import Guard
from guardrails.hub import ToxicLanguage, PIIFilter
 
guard = Guard().use_many(
    ToxicLanguage(on_fail="fix"),
    PIIFilter(on_fail="fix")
)
 
result = guard(
    llm_api=openai.chat.completions.create,
    prompt="Customer inquiry: ...",
    model="gpt-4o-mini"
)

Best Practices

Defense in Depth

Layer multiple guardrails (input + output + semantic)
Don't rely on a single check
Log all guardrail triggers for analysis

⚠️

Common Pitfalls

Over-blocking legitimate requests
Hardcoding rules that become stale
Not testing against adversarial inputs

References & Further Reading

OWASP LLM Top 10 Anthropic Safety OpenAI Moderation

Academic Papers

"Jailbroken: How Does LLM Safety Training Fail?" - Wei et al., 2023
- arXiv:2307.02483 (opens in a new tab)
- Understanding LLM safety vulnerabilities
"Constitutional AI: Harmlessness from AI Feedback" - Anthropic, 2022
- arXiv:2212.08073 (opens in a new tab)
- Self-supervised safety training
"Red Teaming Language Models with Language Models" - Perez et al., 2022
- arXiv:2202.03286 (opens in a new tab)
- Automated adversarial testing

Next Steps

Now that your agents are safe, let's deploy them! Head to FastAPI & Docker to learn production serving.

Overview 02. FastAPI & Docker