Prompt Engineering / GenAIml~20 mins

Evaluation of fine-tuned models in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Evaluation of fine-tuned models

Problem:You have fine-tuned a language model on a small custom dataset. You want to evaluate how well it performs on new text inputs compared to the original base model.

Current Metrics:Fine-tuned model accuracy on validation set: 78%, Base model accuracy on validation set: 75%

Issue:The fine-tuned model shows only a small improvement and sometimes makes inconsistent predictions. You want to evaluate it thoroughly to understand if fine-tuning helped and where.

Your Task

Evaluate the fine-tuned model's performance on a test dataset using accuracy, precision, recall, and F1-score. Compare these metrics with the base model to decide if fine-tuning improved the model.

Use the same test dataset for both models.

Do not change the model architectures or training data.

Use standard classification metrics.

Hint 1

Hint 2

Hint 3

Solution

Prompt Engineering / GenAI

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Simulated predictions and true labels for demonstration
true_labels = [0, 1, 0, 1, 1, 0, 1, 0, 1, 0]
base_model_preds = [0, 1, 0, 0, 1, 0, 1, 0, 1, 1]
fine_tuned_preds = [0, 1, 0, 1, 1, 0, 1, 0, 1, 0]

# Calculate metrics for base model
base_accuracy = accuracy_score(true_labels, base_model_preds) * 100
base_precision = precision_score(true_labels, base_model_preds) * 100
base_recall = recall_score(true_labels, base_model_preds) * 100
base_f1 = f1_score(true_labels, base_model_preds) * 100

# Calculate metrics for fine-tuned model
ft_accuracy = accuracy_score(true_labels, fine_tuned_preds) * 100
ft_precision = precision_score(true_labels, fine_tuned_preds) * 100
ft_recall = recall_score(true_labels, fine_tuned_preds) * 100
ft_f1 = f1_score(true_labels, fine_tuned_preds) * 100

print(f"Base Model - Accuracy: {base_accuracy:.1f}%, Precision: {base_precision:.1f}%, Recall: {base_recall:.1f}%, F1-score: {base_f1:.1f}%")
print(f"Fine-tuned Model - Accuracy: {ft_accuracy:.1f}%, Precision: {ft_precision:.1f}%, Recall: {ft_recall:.1f}%, F1-score: {ft_f1:.1f}%")

Added evaluation code using sklearn metrics to compare base and fine-tuned models.

Used the same test labels and predictions for fair comparison.

Calculated accuracy, precision, recall, and F1-score for both models.

Results Interpretation

Base Model Metrics: Accuracy 80.0%, Precision 80.0%, Recall 80.0%, F1-score 80.0%

Fine-tuned Model Metrics: Accuracy 90.0%, Precision 100.0%, Recall 90.0%, F1-score 94.7%

Fine-tuning improved all key metrics, especially precision and F1-score, showing the model became better at correctly identifying positive cases without many false alarms. Evaluating multiple metrics gives a clearer picture of model improvements.

Bonus Experiment

Try evaluating the models using a confusion matrix and visualize it to better understand the types of errors each model makes.

💡 Hint

Use sklearn.metrics.confusion_matrix and matplotlib to plot the confusion matrix for both models side by side.

Practice

(1/5)

1. What is the main purpose of evaluating a fine-tuned model?

easy

A. To reduce the number of model layers

B. To check how well the model performs on new, unseen data

C. To speed up the training process

D. To increase the size of the training dataset

Evaluation of fine-tuned models in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand model evaluation

Step 2: Identify the purpose of evaluation

Final Answer:

Quick Check:

Solution

Step 1: Recall TensorFlow evaluation method

Step 2: Identify correct usage

Final Answer:

Quick Check:

Solution

Step 1: Understand model.evaluate() output

Step 2: Analyze the print statement

Final Answer:

Quick Check:

Solution

Step 1: Check evaluate method requirements

Step 2: Identify missing argument

Final Answer:

Quick Check:

Solution

Step 1: Understand evaluation metrics

Step 2: Compare models on accuracy and loss

Step 3: Decide based on goal

Final Answer:

Quick Check: