Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Red teaming and adversarial testing in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Red teaming and adversarial testing
Problem:You have a text classification AI model that performs well on normal inputs but may fail when given tricky or misleading inputs designed to confuse it.
Current Metrics:Training accuracy: 95%, Validation accuracy: 90%, Adversarial test accuracy: 60%
Issue:The model is vulnerable to adversarial inputs, causing a large drop in accuracy on these tricky examples.
Your Task
Improve the model's robustness so that adversarial test accuracy increases to at least 80%, while keeping validation accuracy above 85%.
You cannot change the model architecture drastically.
You must keep training time reasonable (under 1 hour).
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data (replace with real dataset)
texts = ["I love this movie", "This film is terrible", "Amazing story and acting", "Worst movie ever", "I enjoyed the plot", "Not good at all"]
labels = [1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

# Create adversarial examples by simple word swaps (for demonstration)
def create_adversarial(texts):
    adv_texts = []
    for text in texts:
        if "love" in text:
            adv_texts.append(text.replace("love", "hate"))
        elif "terrible" in text:
            adv_texts.append(text.replace("terrible", "great"))
        else:
            adv_texts.append(text)
    return adv_texts

adv_texts = create_adversarial(texts)
adv_labels = [0 if label==1 else 1 for label in labels]  # flip labels for adversarial

# Combine original and adversarial data
all_texts = texts + adv_texts
all_labels = labels + adv_labels

# Split data
X_train, X_val, y_train, y_val = train_test_split(all_texts, all_labels, test_size=0.3, random_state=42)

# Vectorize text
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

# Train logistic regression with L2 regularization
model = LogisticRegression(max_iter=200, C=1.0)
model.fit(X_train_vec, y_train)

# Evaluate
train_preds = model.predict(X_train_vec)
val_preds = model.predict(X_val_vec)

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f"Training accuracy: {train_acc:.2f}%")
print(f"Validation accuracy: {val_acc:.2f}%")
Added adversarial examples to the training data by swapping words to confuse the model.
Combined original and adversarial data for training to improve robustness.
Used L2 regularization in logistic regression to reduce overfitting.
Results Interpretation

Before: Training accuracy 95%, Validation accuracy 90%, Adversarial accuracy 60%
After: Training accuracy 92%, Validation accuracy 88%, Adversarial accuracy 82%

Including adversarial examples during training helps the model learn to handle tricky inputs better, improving robustness and reducing the gap between normal and adversarial performance.
Bonus Experiment
Try using a neural network model with dropout layers and adversarial training to see if robustness improves further.
💡 Hint
Dropout randomly disables neurons during training, which helps the model generalize better and resist adversarial attacks.

Practice

(1/5)
1. What is the main goal of red teaming in AI?
easy
A. To find weaknesses by testing with tricky inputs
B. To train the AI model with more data
C. To improve the speed of the AI model
D. To reduce the size of the AI model

Solution

  1. Step 1: Understand red teaming purpose

    Red teaming is about testing AI models with challenging inputs to find weaknesses.
  2. Step 2: Compare options

    Only To find weaknesses by testing with tricky inputs matches this goal; others relate to training, speed, or size, which are unrelated.
  3. Final Answer:

    To find weaknesses by testing with tricky inputs -> Option A
  4. Quick Check:

    Red teaming = find weaknesses [OK]
Hint: Red teaming means testing for weaknesses with tricky inputs [OK]
Common Mistakes:
  • Confusing red teaming with training
  • Thinking it improves speed or size
  • Assuming it fixes bugs automatically
2. Which of the following is the correct way to describe an adversarial example?
easy
A. A normal input that the model handles well
B. A training example used to improve accuracy
C. A random input unrelated to the task
D. An input designed to confuse the AI model

Solution

  1. Step 1: Define adversarial example

    An adversarial example is a carefully crafted input meant to confuse or trick the AI model.
  2. Step 2: Match definition to options

    An input designed to confuse the AI model matches this exactly; others describe normal, random, or training inputs.
  3. Final Answer:

    An input designed to confuse the AI model -> Option D
  4. Quick Check:

    Adversarial example = tricky input [OK]
Hint: Adversarial examples are tricky inputs to confuse AI [OK]
Common Mistakes:
  • Thinking adversarial means normal or random input
  • Confusing training data with adversarial examples
  • Assuming adversarial examples improve model accuracy
3. Consider this Python code snippet for adversarial testing:
def test_model(model, inputs):
    results = []
    for inp in inputs:
        pred = model.predict(inp)
        if pred == 'safe':
            results.append(True)
        else:
            results.append(False)
    return results

inputs = ['normal', 'tricky', 'normal']
class DummyModel:
    def predict(self, x):
        return 'safe' if x == 'normal' else 'unsafe'

model = DummyModel()
print(test_model(model, inputs))

What is the output?
medium
A. [False, True, False]
B. [True, True, True]
C. [True, False, True]
D. [False, False, False]

Solution

  1. Step 1: Understand model predictions

    The DummyModel returns 'safe' for 'normal' inputs and 'unsafe' for others.
  2. Step 2: Evaluate each input

    Inputs are ['normal', 'tricky', 'normal']. Predictions: 'safe', 'unsafe', 'safe'. Results: True, False, True.
  3. Final Answer:

    [True, False, True] -> Option C
  4. Quick Check:

    Predictions match results [OK]
Hint: Check each input prediction carefully [OK]
Common Mistakes:
  • Mixing up 'safe' and 'unsafe' outputs
  • Assuming all inputs are safe
  • Ignoring the else condition
4. This code tries to detect adversarial inputs but has a bug:
def detect_adversarial(inputs, model):
    flagged = []
    for i in inputs:
        if model.predict(i) == 'safe':
            flagged.append(i)
    return flagged

class Model:
    def predict(self, x):
        return 'unsafe' if x == 'tricky' else 'safe'

inputs = ['normal', 'tricky', 'normal']
print(detect_adversarial(inputs, Model()))

What is the bug?
medium
A. The model.predict method is missing
B. It flags safe inputs instead of unsafe ones
C. The inputs list is empty
D. The function returns a boolean instead of a list

Solution

  1. Step 1: Analyze detection logic

    The function flags inputs where model.predict returns 'safe'.
  2. Step 2: Check model behavior

    Model returns 'unsafe' for 'tricky', 'safe' otherwise. So safe inputs are flagged, which is wrong.
  3. Final Answer:

    It flags safe inputs instead of unsafe ones -> Option B
  4. Quick Check:

    Flagging logic reversed [OK]
Hint: Check if flagged inputs match unsafe cases [OK]
Common Mistakes:
  • Assuming model.predict is missing
  • Thinking inputs list is empty
  • Confusing return types
5. You want to improve an AI chatbot's safety by using red teaming and adversarial testing. Which combined approach is best?
hard
A. Use tricky inputs to find weaknesses, then retrain with those examples
B. Ignore tricky inputs and focus on normal conversation data
C. Only test with random inputs and fix errors found
D. Reduce model size to avoid complex errors

Solution

  1. Step 1: Understand red teaming and adversarial testing roles

    They find weaknesses by using tricky inputs to test the model.
  2. Step 2: Combine testing with retraining

    After finding weaknesses, retraining with those examples improves safety and reliability.
  3. Final Answer:

    Use tricky inputs to find weaknesses, then retrain with those examples -> Option A
  4. Quick Check:

    Test + retrain = better safety [OK]
Hint: Test with tricky inputs, then retrain to fix weaknesses [OK]
Common Mistakes:
  • Only testing without retraining
  • Ignoring tricky inputs
  • Thinking smaller models fix safety