Model Pipeline - Prompt injection defense
This pipeline shows how a language model defends against prompt injection attacks by detecting and filtering harmful inputs before generating safe responses.
This pipeline shows how a language model defends against prompt injection attacks by detecting and filtering harmful inputs before generating safe responses.
Loss
0.9 |****
0.7 |****
0.5 |****
0.3 |****
+---------
1 2 3 4 5 Epochs
| Epoch | Loss ↓ | Accuracy ↑ | Observation |
|---|---|---|---|
| 1 | 0.85 | 0.60 | Model starts learning to detect injection patterns |
| 2 | 0.65 | 0.75 | Detection accuracy improves, fewer false negatives |
| 3 | 0.50 | 0.85 | Model reliably flags suspicious prompts |
| 4 | 0.40 | 0.90 | Sanitization module learns to clean prompts effectively |
| 5 | 0.35 | 0.93 | Overall defense pipeline converges with high accuracy |