Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Evaluation of fine-tuned models in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Evaluation of fine-tuned models
Problem:You have fine-tuned a language model on a small custom dataset. You want to evaluate how well it performs on new text inputs compared to the original base model.
Current Metrics:Fine-tuned model accuracy on validation set: 78%, Base model accuracy on validation set: 75%
Issue:The fine-tuned model shows only a small improvement and sometimes makes inconsistent predictions. You want to evaluate it thoroughly to understand if fine-tuning helped and where.
Your Task
Evaluate the fine-tuned model's performance on a test dataset using accuracy, precision, recall, and F1-score. Compare these metrics with the base model to decide if fine-tuning improved the model.
Use the same test dataset for both models.
Do not change the model architectures or training data.
Use standard classification metrics.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Simulated predictions and true labels for demonstration
true_labels = [0, 1, 0, 1, 1, 0, 1, 0, 1, 0]
base_model_preds = [0, 1, 0, 0, 1, 0, 1, 0, 1, 1]
fine_tuned_preds = [0, 1, 0, 1, 1, 0, 1, 0, 1, 0]

# Calculate metrics for base model
base_accuracy = accuracy_score(true_labels, base_model_preds) * 100
base_precision = precision_score(true_labels, base_model_preds) * 100
base_recall = recall_score(true_labels, base_model_preds) * 100
base_f1 = f1_score(true_labels, base_model_preds) * 100

# Calculate metrics for fine-tuned model
ft_accuracy = accuracy_score(true_labels, fine_tuned_preds) * 100
ft_precision = precision_score(true_labels, fine_tuned_preds) * 100
ft_recall = recall_score(true_labels, fine_tuned_preds) * 100
ft_f1 = f1_score(true_labels, fine_tuned_preds) * 100

print(f"Base Model - Accuracy: {base_accuracy:.1f}%, Precision: {base_precision:.1f}%, Recall: {base_recall:.1f}%, F1-score: {base_f1:.1f}%")
print(f"Fine-tuned Model - Accuracy: {ft_accuracy:.1f}%, Precision: {ft_precision:.1f}%, Recall: {ft_recall:.1f}%, F1-score: {ft_f1:.1f}%")
Added evaluation code using sklearn metrics to compare base and fine-tuned models.
Used the same test labels and predictions for fair comparison.
Calculated accuracy, precision, recall, and F1-score for both models.
Results Interpretation

Base Model Metrics: Accuracy 80.0%, Precision 80.0%, Recall 80.0%, F1-score 80.0%

Fine-tuned Model Metrics: Accuracy 90.0%, Precision 100.0%, Recall 90.0%, F1-score 94.7%

Fine-tuning improved all key metrics, especially precision and F1-score, showing the model became better at correctly identifying positive cases without many false alarms. Evaluating multiple metrics gives a clearer picture of model improvements.
Bonus Experiment
Try evaluating the models using a confusion matrix and visualize it to better understand the types of errors each model makes.
💡 Hint
Use sklearn.metrics.confusion_matrix and matplotlib to plot the confusion matrix for both models side by side.

Practice

(1/5)
1. What is the main purpose of evaluating a fine-tuned model?
easy
A. To reduce the number of model layers
B. To check how well the model performs on new, unseen data
C. To speed up the training process
D. To increase the size of the training dataset

Solution

  1. Step 1: Understand model evaluation

    Evaluation measures how well the model predicts on data it has not seen before.
  2. Step 2: Identify the purpose of evaluation

    It helps us know if the model learned useful patterns or just memorized training data.
  3. Final Answer:

    To check how well the model performs on new, unseen data -> Option B
  4. Quick Check:

    Evaluation = performance on new data [OK]
Hint: Evaluation checks model on new data, not training data [OK]
Common Mistakes:
  • Confusing evaluation with training
  • Thinking evaluation changes model structure
  • Believing evaluation increases data size
2. Which of the following is the correct way to evaluate a fine-tuned model in Python using TensorFlow?
easy
A. model.compile(optimizer='adam')
B. model.train(test_data, test_labels)
C. model.predict(train_data)
D. model.evaluate(test_data, test_labels)

Solution

  1. Step 1: Recall TensorFlow evaluation method

    TensorFlow models use model.evaluate() to measure performance on test data.
  2. Step 2: Identify correct usage

    model.evaluate(test_data, test_labels) returns loss and metrics on unseen data.
  3. Final Answer:

    model.evaluate(test_data, test_labels) -> Option D
  4. Quick Check:

    Use model.evaluate() for testing [OK]
Hint: Use model.evaluate() with test data for evaluation [OK]
Common Mistakes:
  • Using model.train() instead of evaluate
  • Calling predict() without labels for evaluation
  • Confusing compile() with evaluation
3. Given the code below, what will be the output of print(loss, accuracy)?
loss, accuracy = model.evaluate(x_test, y_test)
print(loss, accuracy)
medium
A. The loss value and accuracy metric on the test set
B. The training loss and accuracy values
C. A syntax error because evaluate returns only one value
D. The predicted labels for x_test

Solution

  1. Step 1: Understand model.evaluate() output

    It returns loss and metrics (like accuracy) on the test data.
  2. Step 2: Analyze the print statement

    Printing loss, accuracy shows these two values from evaluation.
  3. Final Answer:

    The loss value and accuracy metric on the test set -> Option A
  4. Quick Check:

    evaluate() returns loss and accuracy [OK]
Hint: model.evaluate() returns loss and metrics tuple [OK]
Common Mistakes:
  • Thinking evaluate returns training metrics
  • Assuming evaluate returns predictions
  • Believing evaluate returns only one value
4. You ran model.evaluate(x_test) but got an error. What is the likely cause?
medium
A. The model is not compiled
B. The test data x_test is empty
C. Missing the true labels y_test in the evaluate call
D. The model has too many layers

Solution

  1. Step 1: Check evaluate method requirements

    model.evaluate() needs both input data and true labels to compute metrics.
  2. Step 2: Identify missing argument

    Calling model.evaluate(x_test) misses y_test, causing an error.
  3. Final Answer:

    Missing the true labels y_test in the evaluate call -> Option C
  4. Quick Check:

    evaluate() needs inputs and labels [OK]
Hint: Always pass both data and labels to evaluate() [OK]
Common Mistakes:
  • Forgetting to pass labels to evaluate()
  • Assuming evaluate works with inputs only
  • Ignoring model compilation status
5. You fine-tuned two models and got these evaluation results on the same test set:
  • Model A: loss=0.25, accuracy=0.90
  • Model B: loss=0.20, accuracy=0.85
Which model should you choose and why?
hard
A. Model A, because it has higher accuracy which is more important than loss
B. Model B, because it has lower loss indicating better overall fit
C. Model A, because loss and accuracy must both be higher
D. Model B, because accuracy is less important than loss

Solution

  1. Step 1: Understand evaluation metrics

    Accuracy shows correct predictions percentage; loss shows error magnitude.
  2. Step 2: Compare models on accuracy and loss

    Model A has higher accuracy (0.90) but slightly higher loss (0.25) than Model B.
  3. Step 3: Decide based on goal

    For classification, accuracy is usually more important to pick the better model.
  4. Final Answer:

    Model A, because it has higher accuracy which is more important than loss -> Option A
  5. Quick Check:

    Higher accuracy preferred for classification [OK]
Hint: Pick model with higher accuracy for classification tasks [OK]
Common Mistakes:
  • Choosing model with lower loss but worse accuracy
  • Ignoring accuracy when loss differs
  • Assuming loss always trumps accuracy