0
0
NLPml~8 mins

Few-shot learning with prompts in NLP - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Few-shot learning with prompts
Which metric matters for Few-shot learning with prompts and WHY

In few-shot learning with prompts, the model sees very few examples before making predictions. Because of this, accuracy is important to see how often the model gets the right answer. But accuracy alone can be misleading if classes are unbalanced.

Therefore, precision and recall are also key. Precision tells us how many predicted answers are actually correct, and recall tells us how many correct answers the model finds out of all possible correct ones.

Since few-shot learning often deals with limited data, F1 score is very useful. It balances precision and recall into one number, showing overall quality.

For tasks like classification, a confusion matrix helps visualize where the model makes mistakes.

Confusion matrix example

Imagine a 3-class classification with few-shot prompts. Here is a confusion matrix from 100 samples:

      | Predicted A | Predicted B | Predicted C |
      |-------------|-------------|-------------|
      | True A: 30  | 5           | 0           |
      | True B: 3   | 25          | 2           |
      | True C: 1   | 4           | 30          |
    

Totals: TP (correct predictions) = 30 + 25 + 30 = 85, FP and FN are the off-diagonal numbers.

Precision vs Recall tradeoff with examples

In few-shot learning, sometimes the model guesses fewer answers to be sure (high precision, low recall). Other times it guesses more to catch all correct answers (high recall, low precision).

Example 1: A medical diagnosis prompt where missing a disease is dangerous. Here, high recall is more important to catch all cases, even if some false alarms happen.

Example 2: A spam detection prompt where marking good emails as spam is bad. Here, high precision is more important to avoid false positives.

Good vs Bad metric values for Few-shot learning with prompts

Good: Accuracy above 80%, precision and recall balanced above 75%, and F1 score close to both. Confusion matrix shows most predictions on the diagonal.

Bad: Accuracy near random chance (e.g., 33% for 3 classes), precision very low (many false positives), recall very low (many missed correct answers), and F1 score low. Confusion matrix shows many off-diagonal errors.

Common pitfalls in metrics for Few-shot learning with prompts
  • Accuracy paradox: High accuracy can happen if one class dominates, hiding poor performance on others.
  • Data leakage: If prompt examples leak answers, metrics look better but model is not truly learning.
  • Overfitting: Model may memorize few-shot examples, showing high training metrics but poor real-world results.
  • Small sample size: Few-shot means few examples, so metrics can vary a lot and be unstable.
Self-check question

Your few-shot prompt model has 98% accuracy but only 12% recall on the positive class (e.g., fraud detection). Is this good for production? Why or why not?

Answer: No, it is not good. The model misses 88% of positive cases, which is very risky. High accuracy is misleading because most data is negative. You need to improve recall to catch more positive cases.

Key Result
In few-shot learning with prompts, balanced precision, recall, and F1 score matter most to judge model quality beyond accuracy.