For GPT models, common metrics include Perplexity and Accuracy on language tasks. Perplexity measures how well the model predicts the next word; lower is better. Accuracy measures correct predictions on specific tasks like classification. For GPT, Perplexity is key because it shows how well the model understands language patterns.
GPT family overview in NLP - Model Metrics & Evaluation
GPT models are often evaluated on language generation, so confusion matrices are less common. However, for classification tasks using GPT, a confusion matrix shows:
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP) | False Negative (FN) |
| False Positive (FP) | True Negative (TN) |
These values help calculate precision, recall, and F1 score to understand GPT's classification performance.
When GPT is used for tasks like spam detection, precision and recall tradeoff matters:
- High Precision: Few false alarms. Good when you don't want to mark good emails as spam.
- High Recall: Catch most spam. Important when missing spam is costly.
Choosing which to prioritize depends on the task GPT is applied to.
Good: Low perplexity (e.g., 10 or less on test data), high accuracy (above 90%) on classification tasks, balanced precision and recall.
Bad: High perplexity (e.g., above 100), low accuracy (below 50%), very low recall or precision indicating poor understanding or biased predictions.
- Accuracy paradox: High accuracy on imbalanced data can be misleading.
- Data leakage: Training data leaking into test data inflates metrics falsely.
- Overfitting: Very low training loss but poor test performance means model memorizes instead of generalizing.
- Ignoring context: Metrics that don't consider language context can miss real model quality.
Your GPT-based spam filter has 98% accuracy but only 12% recall on spam emails. Is it good for production? Why or why not?
Answer: No, it is not good. The model misses most spam emails (low recall), so many spam messages get through. High accuracy is misleading because most emails are not spam, so the model just predicts "not spam" often. Improving recall is critical here.