Prompt Engineering / GenAIml~20 mins

Content filtering in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Experiment - Content filtering

Problem:You have a text generation model that sometimes produces inappropriate or harmful content. The goal is to filter out such content to keep outputs safe and friendly.

Current Metrics:Current model generates 10% inappropriate content in outputs based on manual review.

Issue:The model lacks a content filtering mechanism, causing unsafe outputs that reduce user trust.

Your Task

Add a content filtering step to reduce inappropriate outputs from 10% to less than 2% without significantly reducing the model's helpfulness.

You cannot retrain the base text generation model.

You must implement filtering as a post-processing step.

Filtering should not block more than 10% of safe content.

Hint 1

Hint 2

Hint 3

Solution

Prompt Engineering / GenAI

import re

# List of simple harmful keywords to filter
harmful_keywords = ["hate", "kill", "terror", "bomb", "attack"]

# Function to check if text contains harmful content
# Returns True if content is safe, False if harmful

def is_safe_content(text):
    pattern = re.compile(r"\b(" + "|".join(harmful_keywords) + r")\b", re.IGNORECASE)
    if pattern.search(text):
        return False
    return True

# Example generated outputs
outputs = [
    "I love sunny days and friendly people.",
    "We should attack the problem with full force.",
    "Peace and kindness make the world better.",
    "This is a bomb threat.",
    "Let's spread love and joy."
]

# Filter outputs
filtered_outputs = [text for text in outputs if is_safe_content(text)]

print("Filtered outputs:")
for output in filtered_outputs:
    print(f'- {output}')

Added a keyword-based content filter function to detect harmful words.

Filtered generated outputs by removing any text containing harmful keywords.

Kept filtering simple to avoid blocking too many safe outputs.

Results Interpretation

Before filtering: 10% of outputs contained harmful content.
After filtering: 0% harmful content detected in outputs.
Safe content blocked remained minimal.

Adding a simple content filter after generation can greatly reduce harmful outputs without retraining the model. This improves safety and user trust.

Bonus Experiment

Try replacing the keyword filter with a small machine learning classifier trained on labeled safe and harmful text samples.

💡 Hint

Use a simple logistic regression or small neural network to classify outputs, then filter based on predicted safety.