Bird
Raised Fist0
Prompt Engineering / GenAIml~8 mins

Few-shot prompting in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Few-shot prompting
Which metric matters for Few-shot prompting and WHY

Few-shot prompting is about teaching a model to perform a task with very few examples. The key metric here is accuracy or task-specific correctness because it shows how well the model understands and applies the examples given. For tasks like classification or question answering, accuracy tells us if the model is making the right choices after seeing just a few samples.

Confusion matrix example
    Confusion Matrix for a 3-class classification task:

          Predicted
          A   B   C
    True A 18  2   0
         B 3  15   2
         C 0   1  19

    Total samples = 60

    From this:
    - True Positives (TP) for class A = 18
    - False Positives (FP) for class A = 3 + 0 = 3
    - False Negatives (FN) for class A = 2 + 0 = 2
    
    Precision for class A = TP / (TP + FP) = 18 / (18 + 3) = 0.86
    Recall for class A = TP / (TP + FN) = 18 / (18 + 2) = 0.90
    
Precision vs Recall tradeoff with examples

In few-shot prompting, sometimes the model guesses carefully (high precision) but misses some correct answers (low recall). Other times, it tries to catch all correct answers (high recall) but makes more mistakes (low precision).

Example 1: For a medical diagnosis task, high recall is important because missing a disease is dangerous. Few-shot prompting should focus on catching all positives, even if some false alarms happen.

Example 2: For spam detection, high precision matters more. Few-shot prompting should avoid marking good emails as spam, even if some spam slips through.

What "good" vs "bad" metric values look like for Few-shot prompting

Good: Accuracy above 80% with balanced precision and recall means the model learned well from few examples.

Bad: Accuracy below 50% or very low recall (e.g., under 30%) means the model is not understanding the examples or missing many correct answers.

Common pitfalls in Few-shot prompting metrics
  • Accuracy paradox: High accuracy can be misleading if the task is unbalanced (e.g., mostly one class).
  • Data leakage: If examples in the prompt are too similar to test data, metrics look better but model is not truly learning.
  • Overfitting: Model might memorize few examples but fail on new inputs, causing poor generalization.
Self-check question

Your few-shot prompted model has 98% accuracy but only 12% recall on the positive class. Is it good for production?

Answer: No. The model misses most positive cases (low recall), which is critical in many tasks. High accuracy here is misleading because the data is likely imbalanced. You should improve recall before using it in production.

Key Result
Accuracy with balanced precision and recall is key to evaluate few-shot prompting effectiveness.

Practice

(1/5)
1. What is the main idea behind few-shot prompting in AI models?
easy
A. Showing a few examples in the prompt to teach the model a task
B. Training the model with a large dataset from scratch
C. Using no examples and relying on random guesses
D. Fine-tuning the model with many epochs

Solution

  1. Step 1: Understand few-shot prompting concept

    Few-shot prompting means giving the model a few examples in the prompt to help it understand the task.
  2. Step 2: Compare with other methods

    Unlike training or fine-tuning, few-shot prompting does not require changing the model weights, just examples in the prompt.
  3. Final Answer:

    Showing a few examples in the prompt to teach the model a task -> Option A
  4. Quick Check:

    Few-shot prompting = examples in prompt [OK]
Hint: Few-shot means few examples shown in prompt [OK]
Common Mistakes:
  • Confusing few-shot prompting with full model training
  • Thinking it requires many examples
  • Assuming no examples are given
2. Which of the following is the correct way to include examples in a few-shot prompt?
easy
A. Add random unrelated text before the question
B. Write only the new question without examples
C. List examples clearly, then ask the new question
D. Use code comments instead of examples

Solution

  1. Step 1: Identify proper prompt structure

    Few-shot prompting works best when examples are clearly listed before the new question.
  2. Step 2: Eliminate incorrect options

    Options A, B, and D do not provide clear examples or add unrelated content, which confuses the model.
  3. Final Answer:

    List examples clearly, then ask the new question -> Option C
  4. Quick Check:

    Clear examples first = correct prompt [OK]
Hint: Put examples before the question in prompt [OK]
Common Mistakes:
  • Skipping examples completely
  • Adding unrelated text that confuses the model
  • Using comments instead of examples
3. Given this few-shot prompt for a model:
Q: What is 2 + 3?
A: 5
Q: What is 4 + 1?
A: 5
Q: What is 7 + 2?
A:

What will the model most likely answer?
medium
A. 5
B. 9
C. 7
D. 2

Solution

  1. Step 1: Analyze the examples given

    The examples show addition questions with correct answers: 2+3=5 and 4+1=5.
  2. Step 2: Predict the answer for 7 + 2

    7 + 2 equals 9, so the model should answer 9 following the pattern.
  3. Final Answer:

    9 -> Option B
  4. Quick Check:

    7+2=9 [OK]
Hint: Add numbers as shown in examples [OK]
Common Mistakes:
  • Repeating previous answer 5
  • Confusing question numbers
  • Ignoring addition operation
4. You wrote this few-shot prompt:
Q: Translate 'cat' to Spanish.
A: gato
Q: Translate 'dog' to Spanish.
A: perro
Q: Translate 'bird' to Spanish.
A: perro

What is the main error here?
medium
A. The last answer repeats 'perro' instead of 'pájaro'
B. The examples are unrelated to translation
C. The prompt is missing the question marks
D. The answers are in English, not Spanish

Solution

  1. Step 1: Check the last example's answer

    The last question asks for 'bird' in Spanish, but the answer repeats 'perro' (dog).
  2. Step 2: Identify correct Spanish word

    The correct Spanish word for 'bird' is 'pájaro', so the answer is wrong.
  3. Final Answer:

    The last answer repeats 'perro' instead of 'pájaro' -> Option A
  4. Quick Check:

    Wrong repeated answer = error [OK]
Hint: Check if answers match questions correctly [OK]
Common Mistakes:
  • Copying previous answer by mistake
  • Ignoring answer correctness
  • Assuming question marks are required
5. You want to create a few-shot prompt to help a model classify fruits as 'sweet' or 'sour'. Which prompt is best?
hard
A. Q: What color is lemon?\nA: yellow\nQ: What color is apple?\nA: red\nQ: What color is orange?\nA:
B. Q: Is lemon sweet or sour?\nA: sweet\nQ: Is apple sweet or sour?\nA: sour\nQ: Is orange sweet or sour?\nA:
C. Q: Is lemon a fruit?\nA: yes\nQ: Is apple a fruit?\nA: yes\nQ: Is orange a fruit?\nA:
D. Q: Is lemon sweet or sour?\nA: sour\nQ: Is apple sweet or sour?\nA: sweet\nQ: Is orange sweet or sour?\nA:

Solution

  1. Step 1: Identify the task in the prompt

    The task is to classify fruits as 'sweet' or 'sour', so examples must show this classification clearly.
  2. Step 2: Evaluate each option's relevance

    Q: Is lemon sweet or sour?\nA: sour\nQ: Is apple sweet or sour?\nA: sweet\nQ: Is orange sweet or sour?\nA: correctly shows examples of fruits labeled 'sweet' or 'sour'. Options B, C, and D either reverse labels or ask unrelated questions.
  3. Final Answer:

    Q: Is lemon sweet or sour? A: sour Q: Is apple sweet or sour? A: sweet Q: Is orange sweet or sour? A: -> Option D
  4. Quick Check:

    Examples match task = best prompt [OK]
Hint: Match examples to the exact task asked [OK]
Common Mistakes:
  • Mixing up labels in examples
  • Using unrelated questions
  • Not showing clear classification