Experiment - Why guardrails prevent agent disasters

Problem:You have an AI agent designed to perform tasks autonomously. However, without safety guardrails, the agent sometimes takes harmful or unintended actions.

Current Metrics:Agent success rate: 85%, but 15% of actions cause unintended harmful side effects.

Issue:The agent performs well on tasks but occasionally causes disasters due to lack of constraints or safety checks.

Your Task

Add guardrails to the agent to reduce harmful side effects from 15% to below 5%, while maintaining at least 85% task success rate.

You cannot reduce the agent's task capabilities.

You must keep the agent's response time within 10% of the original.

Hint 1

Hint 2

Hint 3

Solution

Agentic AI

import random

class Agent:
    def __init__(self):
        self.task_success_rate = 0.85
        self.harmful_action_rate = 0.15

    def act(self):
        # Simulate action with chance of harm
        if random.random() < self.harmful_action_rate:
            return 'harmful_action'
        else:
            return 'safe_action'

class GuardrailAgent(Agent):
    def __init__(self):
        super().__init__()
        self.harmful_action_rate = 0.15

    def safety_check(self, action):
        # Simple guardrail: block harmful actions
        if action == 'harmful_action':
            return 'blocked_action'
        return action

    def act(self):
        action = super().act()
        safe_action = self.safety_check(action)
        return safe_action

# Evaluate agent before guardrails
agent = Agent()
trials = 1000
harmful_count = 0
success_count = 0
for _ in range(trials):
    action = agent.act()
    if action == 'harmful_action':
        harmful_count += 1
    else:
        success_count += 1

# Evaluate agent after guardrails
guardrail_agent = GuardrailAgent()
harmful_count_gr = 0
success_count_gr = 0
blocked_count = 0
for _ in range(trials):
    action = guardrail_agent.act()
    if action == 'harmful_action':
        harmful_count_gr += 1
    elif action == 'blocked_action':
        blocked_count += 1
    else:
        success_count_gr += 1

print(f"Before guardrails: Success rate = {success_count/trials*100:.1f}%, Harmful actions = {harmful_count/trials*100:.1f}%")
print(f"After guardrails: Success rate = {success_count_gr/trials*100:.1f}%, Harmful actions = {harmful_count_gr/trials*100:.1f}%, Blocked actions = {blocked_count/trials*100:.1f}%")

Added a safety_check method to block harmful actions before execution.

Modified the act method to apply safety_check and prevent harmful actions.

Kept task success rate high by only blocking harmful actions, not safe ones.

Results Interpretation

Before guardrails: Success rate 85%, Harmful actions 15%

After guardrails: Success rate 85%, Harmful actions 0%, Blocked actions 15%

Adding guardrails prevents harmful actions without reducing the agent's ability to complete tasks. This shows how safety checks help avoid disasters while keeping performance.

Bonus Experiment

Try adding a penalty in the agent's learning process for harmful actions instead of blocking them outright.

💡 Hint

Modify the reward function to reduce rewards when harmful actions occur, encouraging the agent to learn safer behavior.