You fine-tuned a text classification model and evaluated it on a test set. The model predicted labels for 100 samples. The confusion matrix is:
[[40, 10], [5, 45]]
What is the accuracy of the model?
Accuracy = (True Positives + True Negatives) / Total samples
Accuracy = (40 + 45) / 100 = 85 / 100 = 0.85. But check carefully: 40 + 45 = 85, so accuracy is 0.85, not 0.90.
Consider this Python code evaluating a fine-tuned regression model's predictions:
from sklearn.metrics import mean_squared_error true = [3.0, -0.5, 2.0, 7.0] pred = [2.5, 0.0, 2.0, 8.0] mse = mean_squared_error(true, pred) print(round(mse, 2))
What is the printed output?
Mean Squared Error is the average of squared differences between true and predicted values.
Squared errors: (3-2.5)^2=0.25, (-0.5-0)^2=0.25, (2-2)^2=0, (7-8)^2=1. Sum=1.5, average=1.5/4=0.375, rounded to 0.38.
You fine-tuned a model on a dataset where 95% of samples belong to class A and 5% to class B. Which evaluation metric is best to assess the model's performance on the minority class?
Consider a metric that balances precision and recall for the minority class.
F1-score balances precision and recall, making it suitable for imbalanced data to evaluate minority class performance.
What error will this code raise when evaluating a fine-tuned classification model?
from sklearn.metrics import accuracy_score true_labels = [1, 0, 1, 1] pred_labels = [1, 0, 0] acc = accuracy_score(true_labels, pred_labels) print(acc)
Check if true and predicted label lists have the same length.
accuracy_score requires true and predicted labels to have the same length; here lengths differ (4 vs 3), causing ValueError.
After fine-tuning a pre-trained language model on a small dataset, you observe that the training accuracy is very high but the validation accuracy is low. What is the most likely explanation?
Think about what happens when a model learns training data too well but fails on new data.
High training accuracy but low validation accuracy indicates overfitting, where the model memorizes training data but does not generalize well.