In red teaming and adversarial testing, the key metric is robustness. This means how well the model resists attacks or tricky inputs designed to fool it. We also look at error rates on adversarial examples, which show how often the model makes mistakes when faced with these special inputs. Measuring attack success rate helps us understand how easily an attacker can trick the model. These metrics matter because the goal is to find weak spots before bad actors do.
Red teaming and adversarial testing in Prompt Engineering / GenAI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Normal Inputs Confusion Matrix:
Predicted
| TP | FP |
-----------------
TP | 950 | 50 |
FN | 30 | 970 |
Adversarial Inputs Confusion Matrix:
Predicted
| TP | FP |
-----------------
TP | 600 | 400 |
FN | 300 | 700 |
Explanation:
- TP: Correctly identified safe inputs
- FP: Mistakenly flagged safe inputs
- FN: Missed adversarial attacks
- TN: Correctly identified attacks
The higher the FN on adversarial inputs, the weaker the model's defense.
In adversarial testing, precision means how many flagged inputs are truly attacks. Recall means how many actual attacks the model catches.
Example 1: High precision but low recall means the model rarely cries wolf but misses many attacks. This is risky because some attacks slip through.
Example 2: High recall but low precision means the model catches most attacks but often flags normal inputs as attacks, causing false alarms.
We want a balance, often prioritizing recall to catch as many attacks as possible, even if it means some false alarms.
Good metrics:
- High recall (e.g., > 90%) on adversarial inputs, meaning most attacks are caught.
- Moderate to high precision (e.g., > 70%), so not too many false alarms.
- Low error rate on adversarial examples (e.g., < 10%).
Bad metrics:
- Low recall (e.g., < 50%), meaning many attacks go unnoticed.
- Very low precision (e.g., < 30%), causing many false alarms and user frustration.
- High error rate on adversarial inputs (e.g., > 50%).
- Accuracy paradox: High accuracy on normal data can hide poor performance on adversarial inputs.
- Data leakage: If adversarial examples leak into training, the test results become overly optimistic.
- Overfitting: Model may memorize known attacks but fail on new ones, showing good metrics only on seen adversarial data.
- Ignoring recall: Focusing only on precision can let many attacks slip through unnoticed.
Your model has 98% accuracy on normal inputs but only 12% recall on adversarial attacks. Is it good for production? Why or why not?
Answer: No, it is not good. The model misses 88% of attacks, which is very risky. High accuracy on normal data does not protect against adversarial threats. Improving recall on attacks is critical before production.
Practice
red teaming in AI?Solution
Step 1: Understand red teaming purpose
Red teaming is about testing AI models with challenging inputs to find weaknesses.Step 2: Compare options
Only To find weaknesses by testing with tricky inputs matches this goal; others relate to training, speed, or size, which are unrelated.Final Answer:
To find weaknesses by testing with tricky inputs -> Option AQuick Check:
Red teaming = find weaknesses [OK]
- Confusing red teaming with training
- Thinking it improves speed or size
- Assuming it fixes bugs automatically
Solution
Step 1: Define adversarial example
An adversarial example is a carefully crafted input meant to confuse or trick the AI model.Step 2: Match definition to options
An input designed to confuse the AI model matches this exactly; others describe normal, random, or training inputs.Final Answer:
An input designed to confuse the AI model -> Option DQuick Check:
Adversarial example = tricky input [OK]
- Thinking adversarial means normal or random input
- Confusing training data with adversarial examples
- Assuming adversarial examples improve model accuracy
def test_model(model, inputs):
results = []
for inp in inputs:
pred = model.predict(inp)
if pred == 'safe':
results.append(True)
else:
results.append(False)
return results
inputs = ['normal', 'tricky', 'normal']
class DummyModel:
def predict(self, x):
return 'safe' if x == 'normal' else 'unsafe'
model = DummyModel()
print(test_model(model, inputs))What is the output?
Solution
Step 1: Understand model predictions
The DummyModel returns 'safe' for 'normal' inputs and 'unsafe' for others.Step 2: Evaluate each input
Inputs are ['normal', 'tricky', 'normal']. Predictions: 'safe', 'unsafe', 'safe'. Results: True, False, True.Final Answer:
[True, False, True] -> Option CQuick Check:
Predictions match results [OK]
- Mixing up 'safe' and 'unsafe' outputs
- Assuming all inputs are safe
- Ignoring the else condition
def detect_adversarial(inputs, model):
flagged = []
for i in inputs:
if model.predict(i) == 'safe':
flagged.append(i)
return flagged
class Model:
def predict(self, x):
return 'unsafe' if x == 'tricky' else 'safe'
inputs = ['normal', 'tricky', 'normal']
print(detect_adversarial(inputs, Model()))What is the bug?
Solution
Step 1: Analyze detection logic
The function flags inputs where model.predict returns 'safe'.Step 2: Check model behavior
Model returns 'unsafe' for 'tricky', 'safe' otherwise. So safe inputs are flagged, which is wrong.Final Answer:
It flags safe inputs instead of unsafe ones -> Option BQuick Check:
Flagging logic reversed [OK]
- Assuming model.predict is missing
- Thinking inputs list is empty
- Confusing return types
Solution
Step 1: Understand red teaming and adversarial testing roles
They find weaknesses by using tricky inputs to test the model.Step 2: Combine testing with retraining
After finding weaknesses, retraining with those examples improves safety and reliability.Final Answer:
Use tricky inputs to find weaknesses, then retrain with those examples -> Option AQuick Check:
Test + retrain = better safety [OK]
- Only testing without retraining
- Ignoring tricky inputs
- Thinking smaller models fix safety
