When choosing between GPT-4 and GPT-3.5, key metrics include accuracy, response quality, and latency. Accuracy shows how often the model gives correct or useful answers. Response quality measures how clear, relevant, and helpful the answers are. Latency is how fast the model responds. We pick metrics based on what matters most: if you want the best answers, accuracy and quality matter more; if you want speed, latency matters more.
Model selection (GPT-4, GPT-3.5) in Prompt Engineering / GenAI - Model Metrics & Evaluation
Example: Comparing GPT-4 and GPT-3.5 on 100 questions | Model | Correct (TP) | Incorrect (FP+FN) | Total | |---------|--------------|------------------|-------| | GPT-4 | 90 | 10 | 100 | | GPT-3.5 | 75 | 25 | 100 | Here, GPT-4 answers 90 out of 100 correctly, GPT-3.5 answers 75 correctly.
For language models, think of precision as how often the model's answers are correct when it gives an answer, and recall as how many of all possible correct answers the model actually provides.
GPT-4 tends to have higher precision and recall, giving more correct and complete answers. GPT-3.5 might be faster but less precise, sometimes giving wrong or incomplete answers.
Example: If you want a chatbot that never gives wrong info, prioritize precision (GPT-4). If you want quick answers and can tolerate some mistakes, recall and speed (GPT-3.5) might be enough.
Good: GPT-4 with 90%+ accuracy, high response quality, and acceptable latency (e.g., 1-2 seconds).
Bad: GPT-3.5 with 70% accuracy, lower quality answers, or very slow responses that frustrate users.
Good models balance accuracy and speed to fit your needs.
- Ignoring latency: A very accurate model that is too slow may not be practical.
- Overfitting to test data: A model might look great on a small test set but fail in real use.
- Data leakage: If test questions were seen during training, accuracy is misleading.
- Focusing only on accuracy: Quality and relevance of answers matter too.
Your GPT-3.5 model has 98% accuracy on a test set but only 12% recall on rare topics. Is it good for production? Why or why not?
Answer: No, because while overall accuracy is high, the model misses most rare topics (low recall). This means it often fails to answer important but uncommon questions, which can hurt user experience.