When we fine-tune a model, we want to see if it learned better than before. Common metrics include accuracy for simple tasks, but often precision, recall, and F1 score matter more. These metrics tell us how well the model predicts the right answers and avoids mistakes. For tasks like text generation or classification, we also check loss to see if the model is improving during training. Choosing the right metric depends on the task and what mistakes cost more.
Evaluation of fine-tuned models in Prompt Engineering / GenAI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP): 85 | False Negative (FN): 15 |
| False Positive (FP): 10 | True Negative (TN): 90 |
Total samples = 85 + 15 + 10 + 90 = 200
Precision = TP / (TP + FP) = 85 / (85 + 10) = 0.8947
Recall = TP / (TP + FN) = 85 / (85 + 15) = 0.85
F1 Score = 2 * (Precision * Recall) / (Precision + Recall) = 0.871
Precision means when the model says "yes," it is usually right. This is important when false alarms are costly. For example, in spam detection, high precision means fewer good emails marked as spam.
Recall means the model finds most of the true positives. This is important when missing a positive is bad. For example, in medical diagnosis, high recall means fewer sick patients are missed.
Fine-tuning can improve one metric but may reduce the other. We must balance them based on the task.
Good: Precision and recall above 0.85, F1 score close to 0.9 or higher, and steadily decreasing loss during training. This means the model predicts well and learns from data.
Bad: High accuracy but very low recall (e.g., recall 0.2) means the model misses many true cases. Or if loss stops improving or increases, the model may not be learning well.
- Accuracy paradox: High accuracy can be misleading if data is unbalanced.
- Data leakage: Using test data during training inflates metrics falsely.
- Overfitting: Model performs well on training but poorly on new data.
- Ignoring task needs: Using wrong metrics for the problem can hide issues.
Your fine-tuned model has 98% accuracy but only 12% recall on fraud detection. Is it good for production? Why or why not?
Answer: No, it is not good. Even though accuracy is high, the model misses 88% of fraud cases (low recall). This means many frauds go undetected, which is risky. For fraud detection, high recall is critical to catch as many frauds as possible.
Practice
Solution
Step 1: Understand model evaluation
Evaluation measures how well the model predicts on data it has not seen before.Step 2: Identify the purpose of evaluation
It helps us know if the model learned useful patterns or just memorized training data.Final Answer:
To check how well the model performs on new, unseen data -> Option BQuick Check:
Evaluation = performance on new data [OK]
- Confusing evaluation with training
- Thinking evaluation changes model structure
- Believing evaluation increases data size
Solution
Step 1: Recall TensorFlow evaluation method
TensorFlow models usemodel.evaluate()to measure performance on test data.Step 2: Identify correct usage
model.evaluate(test_data, test_labels)returns loss and metrics on unseen data.Final Answer:
model.evaluate(test_data, test_labels) -> Option DQuick Check:
Use model.evaluate() for testing [OK]
- Using model.train() instead of evaluate
- Calling predict() without labels for evaluation
- Confusing compile() with evaluation
print(loss, accuracy)?
loss, accuracy = model.evaluate(x_test, y_test) print(loss, accuracy)
Solution
Step 1: Understand model.evaluate() output
It returns loss and metrics (like accuracy) on the test data.Step 2: Analyze the print statement
Printingloss, accuracyshows these two values from evaluation.Final Answer:
The loss value and accuracy metric on the test set -> Option AQuick Check:
evaluate() returns loss and accuracy [OK]
- Thinking evaluate returns training metrics
- Assuming evaluate returns predictions
- Believing evaluate returns only one value
model.evaluate(x_test) but got an error. What is the likely cause?Solution
Step 1: Check evaluate method requirements
model.evaluate() needs both input data and true labels to compute metrics.Step 2: Identify missing argument
Callingmodel.evaluate(x_test)missesy_test, causing an error.Final Answer:
Missing the true labelsy_testin the evaluate call -> Option CQuick Check:
evaluate() needs inputs and labels [OK]
- Forgetting to pass labels to evaluate()
- Assuming evaluate works with inputs only
- Ignoring model compilation status
- Model A: loss=0.25, accuracy=0.90
- Model B: loss=0.20, accuracy=0.85
Solution
Step 1: Understand evaluation metrics
Accuracy shows correct predictions percentage; loss shows error magnitude.Step 2: Compare models on accuracy and loss
Model A has higher accuracy (0.90) but slightly higher loss (0.25) than Model B.Step 3: Decide based on goal
For classification, accuracy is usually more important to pick the better model.Final Answer:
Model A, because it has higher accuracy which is more important than loss -> Option AQuick Check:
Higher accuracy preferred for classification [OK]
- Choosing model with lower loss but worse accuracy
- Ignoring accuracy when loss differs
- Assuming loss always trumps accuracy
