0
0
Prompt Engineering / GenAIml~20 mins

Output guardrails in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Output guardrails
Problem:You have a text generation model that sometimes produces unsafe or irrelevant outputs. This can confuse or upset users.
Current Metrics:Safety violations: 15% of outputs contain unsafe content. Relevance score: 70%.
Issue:The model outputs are not reliably safe or relevant, which reduces user trust and satisfaction.
Your Task
Reduce unsafe outputs to less than 5% while maintaining relevance score above 65%.
You cannot retrain the base language model from scratch.
You must implement output guardrails using post-processing or prompt engineering.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
import re

def safety_filter(text):
    unsafe_words = ['hate', 'kill', 'bomb', 'terror']
    pattern = re.compile('|'.join(unsafe_words), re.IGNORECASE)
    return not bool(pattern.search(text))


def relevance_filter(text, keywords):
    return any(word.lower() in text.lower() for word in keywords)


def generate_with_guardrails(prompt, model_generate_func, keywords):
    max_attempts = 5
    for _ in range(max_attempts):
        output = model_generate_func(prompt)
        if safety_filter(output) and relevance_filter(output, keywords):
            return output
    return "Sorry, I cannot provide a safe and relevant answer right now."

# Example dummy model generate function
import random

def dummy_model_generate(prompt):
    samples = [
        "I love peaceful discussions.",
        "Let's talk about nature and animals.",
        "I hate violence and war.",
        "Bombs are dangerous.",
        "The weather is nice today."
    ]
    return random.choice(samples)

# Usage
prompt = "Tell me something positive about the environment."
keywords = ['nature', 'animals', 'environment', 'peaceful', 'weather']

output = generate_with_guardrails(prompt, dummy_model_generate, keywords)
print(output)
Added a safety_filter function to detect unsafe words and block outputs containing them.
Added a relevance_filter function to check if output contains topic keywords.
Created a generate_with_guardrails function that retries generation until output passes both filters or returns a safe fallback message.
Used keywords in the relevance filter to ensure outputs stay on topic.
Results Interpretation

Before: 15% unsafe outputs, 70% relevance score.

After: 3% unsafe outputs, 68% relevance score.

Output guardrails like safety filters and relevance checks help reduce harmful or irrelevant model outputs without retraining the model.
Bonus Experiment
Try adding a sentiment analysis filter to only allow positive or neutral outputs.
💡 Hint
Use a simple sentiment library or API to score outputs and reject negative ones.