Output guardrails help control what a model says or does. The key metrics to check are accuracy for correctness, precision to avoid wrong or harmful outputs, and recall to ensure important or safe outputs are not missed. For example, in a chatbot, precision helps avoid wrong answers, while recall ensures it answers all questions well.
Output guardrails in Prompt Engineering / GenAI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
| Predicted Safe | Predicted Unsafe |
|----------------|------------------|
| True Safe (TN) | False Unsafe (FP)|
| False Safe (FN)| True Unsafe (TP) |
TP: Model correctly blocks unsafe content.
FP: Model wrongly blocks safe content.
FN: Model wrongly outputs unsafe content.
TN: Model correctly outputs safe content.
Metrics use these counts to measure how well guardrails work.
High precision means the model rarely outputs unsafe content (few false unsafe outputs). This is important to keep users safe.
High recall means the model catches most unsafe content (few unsafe outputs slip through). This is also critical for safety.
But improving one can hurt the other. For example, strict guardrails may block many safe outputs (low recall), while loose guardrails may let unsafe outputs through (low precision).
Finding the right balance depends on the use case and risk tolerance.
- Good: Precision and recall both above 90%, meaning most unsafe outputs are blocked and safe outputs are allowed.
- Bad: Precision below 70%, meaning many unsafe outputs get through, or recall below 70%, meaning many safe outputs are blocked.
- Accuracy alone can be misleading if unsafe content is rare.
- Accuracy paradox: If unsafe outputs are rare, a model that always says safe can have high accuracy but fail safety.
- Data leakage: If test data leaks into training, metrics look better but real safety is worse.
- Overfitting: Guardrails tuned too tightly on test data may fail on new inputs.
- Ignoring context: Metrics must consider context to judge if output is truly safe or unsafe.
Your model has 98% accuracy but only 12% recall on unsafe outputs. Is it good for production?
Answer: No. The model misses 88% of unsafe outputs, which is dangerous. High accuracy here is misleading because unsafe outputs are rare. You need higher recall to catch unsafe content reliably.
Practice
Solution
Step 1: Understand output guardrails
Output guardrails are rules that help AI give answers that are safe and useful.Step 2: Identify the main goal
The main goal is to guide AI responses to be helpful and respectful, avoiding harmful or irrelevant content.Final Answer:
To guide AI to give safe and useful answers -> Option BQuick Check:
Output guardrails = safe and useful answers [OK]
- Confusing guardrails with training speed
- Thinking guardrails increase model size
- Assuming guardrails reduce AI layers
Solution
Step 1: Identify output guardrail examples
Output guardrails include rules like blocking harmful words or limiting response length.Step 2: Match the correct rule
Blocking harmful words is a direct guardrail to keep AI responses safe.Final Answer:
Block certain harmful words from AI responses -> Option AQuick Check:
Guardrail = block harmful words [OK]
- Confusing training improvements with guardrails
- Thinking guardrails allow unlimited text
- Mixing model architecture changes with guardrails
blocked_words = ['badword']
def filter_output(text):
for word in blocked_words:
if word in text:
return 'Content blocked due to policy.'
return text
print(filter_output('This is a badword example.'))What will be the printed output?
Solution
Step 1: Analyze the filter_output function
The function checks if any blocked word is in the input text. If found, it returns a block message.Step 2: Check the input text
The input text contains 'badword', which is in blocked_words, so the function returns the block message.Final Answer:
Content blocked due to policy. -> Option DQuick Check:
Blocked word found = block message [OK]
- Ignoring the blocked word check
- Assuming original text prints always
- Confusing variable scope errors
def limit_length(text, max_len=10):
if len(text) > max_len:
return text[:max_len]
else:
return text
print(limit_length('Hello, world!'))What is the output and is there any bug?
Solution
Step 1: Check the function logic
If text length is more than 10, it returns first 10 characters; else returns full text.Step 2: Apply to input 'Hello, world!'
Input length is 13, so it returns text[:10] which is 'Hello, worl'.Final Answer:
'Hello, worl' and no bug -> Option CQuick Check:
Length limit applied correctly [OK]
- Counting 11 characters instead of 10
- Assuming no slicing happens
- Thinking code has syntax errors
Solution
Step 1: Understand the condition
The guardrail should block only if both 'error' and 'fail' appear together.Step 2: Check each option logic
def guard(text): if 'error' in text and 'fail' in text: return 'Response blocked.' return text uses 'and' to check both words, blocking only when both are present, which matches the requirement.Final Answer:
def guard(text): if 'error' in text and 'fail' in text: return 'Response blocked.' return text -> Option AQuick Check:
Block if both words present = def guard(text): if 'error' in text and 'fail' in text: return 'Response blocked.' return text [OK]
- Using 'or' blocks if either word appears
- Negating conditions incorrectly
- Blocking only one word instead of both
