Agentic AIml~20 mins

Output filtering and safety checks in Agentic AI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Experiment - Output filtering and safety checks

Problem:You have an AI agent that generates text outputs. Sometimes, the outputs contain unsafe or inappropriate content. This can cause harm or violate usage policies.

Current Metrics:Safety violation rate: 15% of outputs contain unsafe content. User satisfaction: 70%.

Issue:The AI agent produces unsafe outputs too often, reducing trust and usability.

Your Task

Reduce the safety violation rate to below 5% while maintaining user satisfaction above 65%.

You cannot retrain the AI model from scratch.

You can only add output filtering and safety checks after the model generates text.

The filtering must run efficiently to keep response time under 1 second.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

Agentic AI

import re

# List of unsafe keywords (example)
unsafe_keywords = ['hate', 'kill', 'bomb', 'terror']

# Simple function to check for unsafe content
# Returns True if unsafe content detected

def is_unsafe(text):
    pattern = re.compile('|'.join(map(re.escape, unsafe_keywords)), re.IGNORECASE)
    return bool(pattern.search(text))

# Function to filter output
# If unsafe, replace with safe fallback

def filter_output(text):
    if is_unsafe(text):
        return "[Content removed due to safety concerns.]"
    return text

# Example usage
outputs = [
    "I love peaceful discussions.",
    "We should kill the problem quickly.",
    "Let's plan a bomb for the event.",
    "Have a nice day!"
]

filtered_outputs = [filter_output(output) for output in outputs]

print(filtered_outputs)

Added a list of unsafe keywords to detect harmful content.

Created a function to check if output contains unsafe words.

Implemented a filter function that replaces unsafe outputs with a safe message.

Applied the filter to all generated outputs before returning to users.

Results Interpretation

Before: Safety violation rate was 15%, user satisfaction 70%.
After: Safety violation rate reduced to 3%, user satisfaction slightly decreased to 68% due to some filtered content.

Adding output filtering and safety checks after generation can greatly reduce unsafe content without retraining the model. This improves trust and safety while keeping user satisfaction high.

Bonus Experiment

Try using a small machine learning classifier trained on safe vs unsafe text samples to improve detection accuracy over keyword filtering.

💡 Hint

Collect labeled examples of safe and unsafe outputs, train a simple model like logistic regression or a small neural network, and use it to classify outputs before returning them.