0
0
Prompt Engineering / GenAIml~8 mins

Pre-training and fine-tuning concept in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Pre-training and fine-tuning concept
Which metric matters and WHY

For pre-training and fine-tuning, the key metrics depend on the task the model is fine-tuned for. Common metrics include accuracy for classification, loss for general learning progress, and task-specific metrics like BLEU for language generation or F1 score for imbalanced classes.

During pre-training, loss (like cross-entropy) is important to see if the model is learning general patterns. During fine-tuning, task-specific metrics matter more because they show how well the model adapts to the new task.

Confusion matrix example

Imagine fine-tuning a model for spam detection. Here is a confusion matrix from the fine-tuned model:

      | Predicted Spam | Predicted Not Spam |
      |----------------|--------------------|
      | True Positives (TP) = 90  | False Positives (FP) = 15 |
      | False Negatives (FN) = 10 | True Negatives (TN) = 85  |
    

Total samples = 90 + 10 + 15 + 85 = 200

From this, precision = 90 / (90 + 15) = 0.857, recall = 90 / (90 + 10) = 0.9

Precision vs Recall tradeoff with examples

When fine-tuning, you often balance precision and recall depending on the task:

  • High precision: Important when false alarms are costly. For example, in spam detection, you want to avoid marking good emails as spam.
  • High recall: Important when missing positive cases is costly. For example, in medical diagnosis, you want to catch as many sick patients as possible.

Fine-tuning helps adjust the model to this balance by training on task-specific data.

What good vs bad metric values look like

For a fine-tuned model on a balanced classification task:

  • Good: Accuracy above 85%, precision and recall above 80%, loss steadily decreasing.
  • Bad: Accuracy near random chance (e.g., 50% for binary), precision or recall very low (below 50%), loss not improving or increasing.

Good metrics mean the model learned useful features during pre-training and adapted well during fine-tuning.

Common pitfalls in metrics
  • Accuracy paradox: High accuracy can be misleading if classes are imbalanced. For example, 95% accuracy on 95% negative data means the model ignores positives.
  • Data leakage: If fine-tuning data leaks test data, metrics look unrealistically good.
  • Overfitting: Very low training loss but poor test metrics means the model memorized training data and did not generalize.
  • Ignoring task metrics: Using only pre-training loss to judge fine-tuning success can be misleading.
Self-check question

Your fine-tuned model has 98% accuracy but only 12% recall on fraud detection. Is it good for production? Why or why not?

Answer: No, it is not good. The low recall means the model misses most fraud cases, which is dangerous. High accuracy is misleading because fraud cases are rare. You need to improve recall to catch more fraud.

Key Result
Pre-training loss shows general learning; fine-tuning metrics like precision and recall reveal task-specific performance and tradeoffs.