0
0
Prompt Engineering / GenAIml~8 mins

Model selection (GPT-4, GPT-3.5) in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Model selection (GPT-4, GPT-3.5)
Which metric matters for model selection and WHY

When choosing between GPT-4 and GPT-3.5, key metrics include accuracy, response quality, and latency. Accuracy shows how often the model gives correct or useful answers. Response quality measures how clear, relevant, and helpful the answers are. Latency is how fast the model responds. We pick metrics based on what matters most: if you want the best answers, accuracy and quality matter more; if you want speed, latency matters more.

Confusion matrix or equivalent visualization
Example: Comparing GPT-4 and GPT-3.5 on 100 questions

| Model   | Correct (TP) | Incorrect (FP+FN) | Total |
|---------|--------------|------------------|-------|
| GPT-4   | 90           | 10               | 100   |
| GPT-3.5 | 75           | 25               | 100   |

Here, GPT-4 answers 90 out of 100 correctly, GPT-3.5 answers 75 correctly.
Precision vs Recall tradeoff with concrete examples

For language models, think of precision as how often the model's answers are correct when it gives an answer, and recall as how many of all possible correct answers the model actually provides.

GPT-4 tends to have higher precision and recall, giving more correct and complete answers. GPT-3.5 might be faster but less precise, sometimes giving wrong or incomplete answers.

Example: If you want a chatbot that never gives wrong info, prioritize precision (GPT-4). If you want quick answers and can tolerate some mistakes, recall and speed (GPT-3.5) might be enough.

What "good" vs "bad" metric values look like for this use case

Good: GPT-4 with 90%+ accuracy, high response quality, and acceptable latency (e.g., 1-2 seconds).

Bad: GPT-3.5 with 70% accuracy, lower quality answers, or very slow responses that frustrate users.

Good models balance accuracy and speed to fit your needs.

Common pitfalls in model selection metrics
  • Ignoring latency: A very accurate model that is too slow may not be practical.
  • Overfitting to test data: A model might look great on a small test set but fail in real use.
  • Data leakage: If test questions were seen during training, accuracy is misleading.
  • Focusing only on accuracy: Quality and relevance of answers matter too.
Self-check question

Your GPT-3.5 model has 98% accuracy on a test set but only 12% recall on rare topics. Is it good for production? Why or why not?

Answer: No, because while overall accuracy is high, the model misses most rare topics (low recall). This means it often fails to answer important but uncommon questions, which can hurt user experience.

Key Result
Model selection balances accuracy, response quality, and speed to fit user needs; GPT-4 generally offers higher accuracy and quality, GPT-3.5 offers faster but less precise responses.