0
0
Prompt Engineering / GenAIml~8 mins

Few-shot prompting in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Few-shot prompting
Which metric matters for Few-shot prompting and WHY

Few-shot prompting is about teaching a model to perform a task with very few examples. The key metric here is accuracy or task-specific correctness because it shows how well the model understands and applies the examples given. For tasks like classification or question answering, accuracy tells us if the model is making the right choices after seeing just a few samples.

Confusion matrix example
    Confusion Matrix for a 3-class classification task:

          Predicted
          A   B   C
    True A 18  2   0
         B 3  15   2
         C 0   1  19

    Total samples = 60

    From this:
    - True Positives (TP) for class A = 18
    - False Positives (FP) for class A = 3 + 0 = 3
    - False Negatives (FN) for class A = 2 + 0 = 2
    
    Precision for class A = TP / (TP + FP) = 18 / (18 + 3) = 0.86
    Recall for class A = TP / (TP + FN) = 18 / (18 + 2) = 0.90
    
Precision vs Recall tradeoff with examples

In few-shot prompting, sometimes the model guesses carefully (high precision) but misses some correct answers (low recall). Other times, it tries to catch all correct answers (high recall) but makes more mistakes (low precision).

Example 1: For a medical diagnosis task, high recall is important because missing a disease is dangerous. Few-shot prompting should focus on catching all positives, even if some false alarms happen.

Example 2: For spam detection, high precision matters more. Few-shot prompting should avoid marking good emails as spam, even if some spam slips through.

What "good" vs "bad" metric values look like for Few-shot prompting

Good: Accuracy above 80% with balanced precision and recall means the model learned well from few examples.

Bad: Accuracy below 50% or very low recall (e.g., under 30%) means the model is not understanding the examples or missing many correct answers.

Common pitfalls in Few-shot prompting metrics
  • Accuracy paradox: High accuracy can be misleading if the task is unbalanced (e.g., mostly one class).
  • Data leakage: If examples in the prompt are too similar to test data, metrics look better but model is not truly learning.
  • Overfitting: Model might memorize few examples but fail on new inputs, causing poor generalization.
Self-check question

Your few-shot prompted model has 98% accuracy but only 12% recall on the positive class. Is it good for production?

Answer: No. The model misses most positive cases (low recall), which is critical in many tasks. High accuracy here is misleading because the data is likely imbalanced. You should improve recall before using it in production.

Key Result
Accuracy with balanced precision and recall is key to evaluate few-shot prompting effectiveness.