Experiment - Prompt injection attacks
Problem:You are using a generative AI model that takes user prompts to generate text. However, some users try to trick the model by adding hidden instructions inside their prompts. This is called a prompt injection attack. It can cause the model to produce unwanted or harmful outputs.
Current Metrics:The model responds correctly to normal prompts 95% of the time. But when tested with prompt injection attempts, it fails 40% of the time by following the injected instructions.
Issue:The model is vulnerable to prompt injection attacks, which reduces its reliability and safety.