0
0
Prompt Engineering / GenAIml~8 mins

Zero-shot prompting in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Zero-shot prompting
Which metric matters for Zero-shot prompting and WHY

Zero-shot prompting means asking a model to do a task it has never seen before. Because the model has no training examples for this task, we want to know how well it guesses correctly right away.

The key metric here is accuracy, which tells us the percentage of correct answers the model gives without any extra training. Accuracy is simple and clear for zero-shot tasks because we want to see if the model understands the task from just the prompt.

Sometimes, if the task is about finding specific items (like detecting spam), precision and recall also matter to understand if the model is careful or misses important cases.

Confusion matrix example for Zero-shot prompting
      | Predicted Positive | Predicted Negative |
      |--------------------|--------------------|
      | True Positive (TP)  | False Negative (FN) |
      | False Positive (FP) | True Negative (TN)  |

      Example:
      TP = 40, FP = 10, FN = 20, TN = 30
      Total samples = 100

      Accuracy = (TP + TN) / Total = (40 + 30) / 100 = 0.7 (70%)
      Precision = TP / (TP + FP) = 40 / (40 + 10) = 0.8 (80%)
      Recall = TP / (TP + FN) = 40 / (40 + 20) = 0.67 (67%)
      F1 Score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * 0.8 * 0.67 / (0.8 + 0.67) ≈ 0.73 (73%)
    
Precision vs Recall tradeoff in Zero-shot prompting

Imagine you ask the model to find spam emails without training it first (zero-shot). If the model marks many emails as spam, it might catch most spam (high recall) but also mark good emails as spam (low precision).

If the model is very careful and marks only very obvious spam, it will have high precision but might miss some spam emails (low recall).

Depending on what matters more, you choose to improve precision or recall. For zero-shot, understanding this tradeoff helps decide if the model guesses too broadly or too narrowly.

Good vs Bad metric values for Zero-shot prompting

Good: Accuracy above 70% means the model understands the task well without examples. Precision and recall above 70% show balanced and reliable predictions.

Bad: Accuracy below 50% means the model guesses worse than random. Very low precision (<50%) means many wrong positive guesses. Very low recall (<50%) means the model misses many true positives.

Common pitfalls in Zero-shot prompting metrics
  • Accuracy paradox: If the data is mostly one class, high accuracy can be misleading (e.g., always guessing the majority class).
  • Data leakage: If the model has seen similar tasks before, zero-shot results may be overestimated.
  • Overfitting indicators: Not relevant here since zero-shot means no training, but repeated prompt tuning can cause overfitting.
  • Ignoring class imbalance: If one class is rare, precision and recall give better insight than accuracy alone.
Self-check question

Your zero-shot model has 98% accuracy but only 12% recall on the positive class (e.g., fraud detection). Is it good for production?

Answer: No. The model misses most positive cases (low recall), which is critical in fraud detection. High accuracy is misleading because most data is negative. You need to improve recall before using it in production.

Key Result
Accuracy shows overall correctness in zero-shot prompting, but precision and recall reveal if the model guesses carefully or misses key cases.