In red teaming and adversarial testing, the key metric is robustness. This means how well the model resists attacks or tricky inputs designed to fool it. We also look at error rates on adversarial examples, which show how often the model makes mistakes when faced with these special inputs. Measuring attack success rate helps us understand how easily an attacker can trick the model. These metrics matter because the goal is to find weak spots before bad actors do.
Red teaming and adversarial testing in Prompt Engineering / GenAI - Model Metrics & Evaluation
Normal Inputs Confusion Matrix:
Predicted
| TP | FP |
-----------------
TP | 950 | 50 |
FN | 30 | 970 |
Adversarial Inputs Confusion Matrix:
Predicted
| TP | FP |
-----------------
TP | 600 | 400 |
FN | 300 | 700 |
Explanation:
- TP: Correctly identified safe inputs
- FP: Mistakenly flagged safe inputs
- FN: Missed adversarial attacks
- TN: Correctly identified attacks
The higher the FN on adversarial inputs, the weaker the model's defense.
In adversarial testing, precision means how many flagged inputs are truly attacks. Recall means how many actual attacks the model catches.
Example 1: High precision but low recall means the model rarely cries wolf but misses many attacks. This is risky because some attacks slip through.
Example 2: High recall but low precision means the model catches most attacks but often flags normal inputs as attacks, causing false alarms.
We want a balance, often prioritizing recall to catch as many attacks as possible, even if it means some false alarms.
Good metrics:
- High recall (e.g., > 90%) on adversarial inputs, meaning most attacks are caught.
- Moderate to high precision (e.g., > 70%), so not too many false alarms.
- Low error rate on adversarial examples (e.g., < 10%).
Bad metrics:
- Low recall (e.g., < 50%), meaning many attacks go unnoticed.
- Very low precision (e.g., < 30%), causing many false alarms and user frustration.
- High error rate on adversarial inputs (e.g., > 50%).
- Accuracy paradox: High accuracy on normal data can hide poor performance on adversarial inputs.
- Data leakage: If adversarial examples leak into training, the test results become overly optimistic.
- Overfitting: Model may memorize known attacks but fail on new ones, showing good metrics only on seen adversarial data.
- Ignoring recall: Focusing only on precision can let many attacks slip through unnoticed.
Your model has 98% accuracy on normal inputs but only 12% recall on adversarial attacks. Is it good for production? Why or why not?
Answer: No, it is not good. The model misses 88% of attacks, which is very risky. High accuracy on normal data does not protect against adversarial threats. Improving recall on attacks is critical before production.