How to set guardrails for agent

Agentic-aiHow-ToBeginner · 4 min read

How to Set Guardrails for AI Agents: Simple Guide

To set guardrails for an AI agent, define clear rules or constraints that limit its actions or outputs. Use techniques like prompt engineering, filters, or custom validation functions to enforce these guardrails and keep the agent's behavior safe and aligned with your goals.

📐

Syntax

Setting guardrails for an AI agent typically involves defining constraints in the agent's configuration or code. This can include:

Prompt constraints: Instructions in the prompt to limit responses.
Validation functions: Code that checks outputs before accepting them.
Filters: Rules that block unwanted content or actions.

Example syntax for a guardrail function:

python

def guardrail_check(output: str) -> bool:
    """Return True if output is safe, False otherwise."""
    forbidden_words = ['hack', 'illegal', 'unauthorized']
    return not any(word in output.lower() for word in forbidden_words)

💻

Example

This example shows a simple AI agent simulation that uses a guardrail function to block unsafe outputs. The agent generates text, but the guardrail checks for forbidden words and rejects unsafe outputs.

python

def guardrail_check(output: str) -> bool:
    forbidden_words = ['hack', 'illegal', 'unauthorized']
    return not any(word in output.lower() for word in forbidden_words)


def simple_agent(prompt: str) -> str:
    # Simulated agent response
    if 'password' in prompt.lower():
        return 'Here is the password: 1234'
    return 'This is a safe response.'


prompt = 'Tell me the password'
response = simple_agent(prompt)

if guardrail_check(response):
    print('Agent output:', response)
else:
    print('Output blocked by guardrails.')

Output

Output blocked by guardrails.

⚠️

Common Pitfalls

Common mistakes when setting guardrails include:

Not covering all unsafe cases in the guardrail rules.
Making guardrails too strict, blocking useful outputs.
Applying guardrails only after output generation, which wastes resources.
Ignoring context, causing false positives or negatives.

Example of a wrong and right guardrail approach:

python

# Wrong: No guardrail, unsafe output allowed

def agent_no_guard(prompt):
    return 'Here is the password: 1234'

print(agent_no_guard('password'))  # Unsafe output


# Right: Guardrail blocks unsafe output

def guardrail(output):
    return 'password' not in output.lower()

output = agent_no_guard('password')
if guardrail(output):
    print('Safe:', output)
else:
    print('Blocked unsafe output')

Output

Here is the password: 1234 Blocked unsafe output

📊

Quick Reference

Tips for effective guardrails:

Define clear forbidden content or actions.
Use prompt instructions to guide agent behavior.
Validate outputs before use or display.
Balance strictness to avoid blocking helpful responses.
Test guardrails with varied inputs regularly.

✅

Key Takeaways

Set guardrails by defining clear rules that limit agent outputs or actions.

Use prompt constraints and validation functions to enforce guardrails effectively.

Test guardrails to avoid blocking useful outputs or missing unsafe content.

Apply guardrails early to save resources and improve safety.

Balance strictness to keep agent responses helpful and safe.