Experiment - Prompt injection defense
Problem:You are using a large language model (LLM) to answer user questions. However, some users try to trick the model by adding harmful instructions inside their input, called prompt injection. This causes the model to give wrong or unsafe answers.
Current Metrics:The model answers 95% of normal questions correctly but fails on 40% of injected prompts, producing unsafe or incorrect outputs.
Issue:The model is vulnerable to prompt injection attacks, leading to unsafe or misleading responses.