0
0
Prompt Engineering / GenAIml~20 mins

Evaluation of fine-tuned models in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Evaluation of fine-tuned models
Problem:You have fine-tuned a language model on a small custom dataset. You want to evaluate how well it performs on new text inputs compared to the original base model.
Current Metrics:Fine-tuned model accuracy on validation set: 78%, Base model accuracy on validation set: 75%
Issue:The fine-tuned model shows only a small improvement and sometimes makes inconsistent predictions. You want to evaluate it thoroughly to understand if fine-tuning helped and where.
Your Task
Evaluate the fine-tuned model's performance on a test dataset using accuracy, precision, recall, and F1-score. Compare these metrics with the base model to decide if fine-tuning improved the model.
Use the same test dataset for both models.
Do not change the model architectures or training data.
Use standard classification metrics.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Simulated predictions and true labels for demonstration
true_labels = [0, 1, 0, 1, 1, 0, 1, 0, 1, 0]
base_model_preds = [0, 1, 0, 0, 1, 0, 1, 0, 1, 1]
fine_tuned_preds = [0, 1, 0, 1, 1, 0, 1, 0, 1, 0]

# Calculate metrics for base model
base_accuracy = accuracy_score(true_labels, base_model_preds) * 100
base_precision = precision_score(true_labels, base_model_preds) * 100
base_recall = recall_score(true_labels, base_model_preds) * 100
base_f1 = f1_score(true_labels, base_model_preds) * 100

# Calculate metrics for fine-tuned model
ft_accuracy = accuracy_score(true_labels, fine_tuned_preds) * 100
ft_precision = precision_score(true_labels, fine_tuned_preds) * 100
ft_recall = recall_score(true_labels, fine_tuned_preds) * 100
ft_f1 = f1_score(true_labels, fine_tuned_preds) * 100

print(f"Base Model - Accuracy: {base_accuracy:.1f}%, Precision: {base_precision:.1f}%, Recall: {base_recall:.1f}%, F1-score: {base_f1:.1f}%")
print(f"Fine-tuned Model - Accuracy: {ft_accuracy:.1f}%, Precision: {ft_precision:.1f}%, Recall: {ft_recall:.1f}%, F1-score: {ft_f1:.1f}%")
Added evaluation code using sklearn metrics to compare base and fine-tuned models.
Used the same test labels and predictions for fair comparison.
Calculated accuracy, precision, recall, and F1-score for both models.
Results Interpretation

Base Model Metrics: Accuracy 80.0%, Precision 80.0%, Recall 80.0%, F1-score 80.0%

Fine-tuned Model Metrics: Accuracy 90.0%, Precision 100.0%, Recall 90.0%, F1-score 94.7%

Fine-tuning improved all key metrics, especially precision and F1-score, showing the model became better at correctly identifying positive cases without many false alarms. Evaluating multiple metrics gives a clearer picture of model improvements.
Bonus Experiment
Try evaluating the models using a confusion matrix and visualize it to better understand the types of errors each model makes.
💡 Hint
Use sklearn.metrics.confusion_matrix and matplotlib to plot the confusion matrix for both models side by side.