Bird
Raised Fist0
Prompt Engineering / GenAIml~8 mins

Red teaming and adversarial testing in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Red teaming and adversarial testing
Which metric matters for Red teaming and adversarial testing and WHY

In red teaming and adversarial testing, the key metric is robustness. This means how well the model resists attacks or tricky inputs designed to fool it. We also look at error rates on adversarial examples, which show how often the model makes mistakes when faced with these special inputs. Measuring attack success rate helps us understand how easily an attacker can trick the model. These metrics matter because the goal is to find weak spots before bad actors do.

Confusion matrix or equivalent visualization
    Normal Inputs Confusion Matrix:
      Predicted
      |  TP  |  FP  |
    -----------------
    TP | 950  |  50  |
    FN |  30  |  970 |

    Adversarial Inputs Confusion Matrix:
      Predicted
      |  TP  |  FP  |
    -----------------
    TP | 600  | 400  |
    FN | 300  | 700  |

    Explanation:
    - TP: Correctly identified safe inputs
    - FP: Mistakenly flagged safe inputs
    - FN: Missed adversarial attacks
    - TN: Correctly identified attacks

    The higher the FN on adversarial inputs, the weaker the model's defense.
    
Precision vs Recall tradeoff with concrete examples

In adversarial testing, precision means how many flagged inputs are truly attacks. Recall means how many actual attacks the model catches.

Example 1: High precision but low recall means the model rarely cries wolf but misses many attacks. This is risky because some attacks slip through.

Example 2: High recall but low precision means the model catches most attacks but often flags normal inputs as attacks, causing false alarms.

We want a balance, often prioritizing recall to catch as many attacks as possible, even if it means some false alarms.

What "good" vs "bad" metric values look like for this use case

Good metrics:

  • High recall (e.g., > 90%) on adversarial inputs, meaning most attacks are caught.
  • Moderate to high precision (e.g., > 70%), so not too many false alarms.
  • Low error rate on adversarial examples (e.g., < 10%).

Bad metrics:

  • Low recall (e.g., < 50%), meaning many attacks go unnoticed.
  • Very low precision (e.g., < 30%), causing many false alarms and user frustration.
  • High error rate on adversarial inputs (e.g., > 50%).
Metrics pitfalls
  • Accuracy paradox: High accuracy on normal data can hide poor performance on adversarial inputs.
  • Data leakage: If adversarial examples leak into training, the test results become overly optimistic.
  • Overfitting: Model may memorize known attacks but fail on new ones, showing good metrics only on seen adversarial data.
  • Ignoring recall: Focusing only on precision can let many attacks slip through unnoticed.
Self-check question

Your model has 98% accuracy on normal inputs but only 12% recall on adversarial attacks. Is it good for production? Why or why not?

Answer: No, it is not good. The model misses 88% of attacks, which is very risky. High accuracy on normal data does not protect against adversarial threats. Improving recall on attacks is critical before production.

Key Result
Robustness metrics like recall on adversarial inputs are key to ensure the model resists attacks effectively.

Practice

(1/5)
1. What is the main goal of red teaming in AI?
easy
A. To find weaknesses by testing with tricky inputs
B. To train the AI model with more data
C. To improve the speed of the AI model
D. To reduce the size of the AI model

Solution

  1. Step 1: Understand red teaming purpose

    Red teaming is about testing AI models with challenging inputs to find weaknesses.
  2. Step 2: Compare options

    Only To find weaknesses by testing with tricky inputs matches this goal; others relate to training, speed, or size, which are unrelated.
  3. Final Answer:

    To find weaknesses by testing with tricky inputs -> Option A
  4. Quick Check:

    Red teaming = find weaknesses [OK]
Hint: Red teaming means testing for weaknesses with tricky inputs [OK]
Common Mistakes:
  • Confusing red teaming with training
  • Thinking it improves speed or size
  • Assuming it fixes bugs automatically
2. Which of the following is the correct way to describe an adversarial example?
easy
A. A normal input that the model handles well
B. A training example used to improve accuracy
C. A random input unrelated to the task
D. An input designed to confuse the AI model

Solution

  1. Step 1: Define adversarial example

    An adversarial example is a carefully crafted input meant to confuse or trick the AI model.
  2. Step 2: Match definition to options

    An input designed to confuse the AI model matches this exactly; others describe normal, random, or training inputs.
  3. Final Answer:

    An input designed to confuse the AI model -> Option D
  4. Quick Check:

    Adversarial example = tricky input [OK]
Hint: Adversarial examples are tricky inputs to confuse AI [OK]
Common Mistakes:
  • Thinking adversarial means normal or random input
  • Confusing training data with adversarial examples
  • Assuming adversarial examples improve model accuracy
3. Consider this Python code snippet for adversarial testing:
def test_model(model, inputs):
    results = []
    for inp in inputs:
        pred = model.predict(inp)
        if pred == 'safe':
            results.append(True)
        else:
            results.append(False)
    return results

inputs = ['normal', 'tricky', 'normal']
class DummyModel:
    def predict(self, x):
        return 'safe' if x == 'normal' else 'unsafe'

model = DummyModel()
print(test_model(model, inputs))

What is the output?
medium
A. [False, True, False]
B. [True, True, True]
C. [True, False, True]
D. [False, False, False]

Solution

  1. Step 1: Understand model predictions

    The DummyModel returns 'safe' for 'normal' inputs and 'unsafe' for others.
  2. Step 2: Evaluate each input

    Inputs are ['normal', 'tricky', 'normal']. Predictions: 'safe', 'unsafe', 'safe'. Results: True, False, True.
  3. Final Answer:

    [True, False, True] -> Option C
  4. Quick Check:

    Predictions match results [OK]
Hint: Check each input prediction carefully [OK]
Common Mistakes:
  • Mixing up 'safe' and 'unsafe' outputs
  • Assuming all inputs are safe
  • Ignoring the else condition
4. This code tries to detect adversarial inputs but has a bug:
def detect_adversarial(inputs, model):
    flagged = []
    for i in inputs:
        if model.predict(i) == 'safe':
            flagged.append(i)
    return flagged

class Model:
    def predict(self, x):
        return 'unsafe' if x == 'tricky' else 'safe'

inputs = ['normal', 'tricky', 'normal']
print(detect_adversarial(inputs, Model()))

What is the bug?
medium
A. The model.predict method is missing
B. It flags safe inputs instead of unsafe ones
C. The inputs list is empty
D. The function returns a boolean instead of a list

Solution

  1. Step 1: Analyze detection logic

    The function flags inputs where model.predict returns 'safe'.
  2. Step 2: Check model behavior

    Model returns 'unsafe' for 'tricky', 'safe' otherwise. So safe inputs are flagged, which is wrong.
  3. Final Answer:

    It flags safe inputs instead of unsafe ones -> Option B
  4. Quick Check:

    Flagging logic reversed [OK]
Hint: Check if flagged inputs match unsafe cases [OK]
Common Mistakes:
  • Assuming model.predict is missing
  • Thinking inputs list is empty
  • Confusing return types
5. You want to improve an AI chatbot's safety by using red teaming and adversarial testing. Which combined approach is best?
hard
A. Use tricky inputs to find weaknesses, then retrain with those examples
B. Ignore tricky inputs and focus on normal conversation data
C. Only test with random inputs and fix errors found
D. Reduce model size to avoid complex errors

Solution

  1. Step 1: Understand red teaming and adversarial testing roles

    They find weaknesses by using tricky inputs to test the model.
  2. Step 2: Combine testing with retraining

    After finding weaknesses, retraining with those examples improves safety and reliability.
  3. Final Answer:

    Use tricky inputs to find weaknesses, then retrain with those examples -> Option A
  4. Quick Check:

    Test + retrain = better safety [OK]
Hint: Test with tricky inputs, then retrain to fix weaknesses [OK]
Common Mistakes:
  • Only testing without retraining
  • Ignoring tricky inputs
  • Thinking smaller models fix safety