When choosing between GPT-4 and GPT-3.5, key metrics include accuracy, response quality, and latency. Accuracy shows how often the model gives correct or useful answers. Response quality measures how clear, relevant, and helpful the answers are. Latency is how fast the model responds. We pick metrics based on what matters most: if you want the best answers, accuracy and quality matter more; if you want speed, latency matters more.
Model selection (GPT-4, GPT-3.5) in Prompt Engineering / GenAI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Example: Comparing GPT-4 and GPT-3.5 on 100 questions | Model | Correct (TP) | Incorrect (FP+FN) | Total | |---------|--------------|------------------|-------| | GPT-4 | 90 | 10 | 100 | | GPT-3.5 | 75 | 25 | 100 | Here, GPT-4 answers 90 out of 100 correctly, GPT-3.5 answers 75 correctly.
For language models, think of precision as how often the model's answers are correct when it gives an answer, and recall as how many of all possible correct answers the model actually provides.
GPT-4 tends to have higher precision and recall, giving more correct and complete answers. GPT-3.5 might be faster but less precise, sometimes giving wrong or incomplete answers.
Example: If you want a chatbot that never gives wrong info, prioritize precision (GPT-4). If you want quick answers and can tolerate some mistakes, recall and speed (GPT-3.5) might be enough.
Good: GPT-4 with 90%+ accuracy, high response quality, and acceptable latency (e.g., 1-2 seconds).
Bad: GPT-3.5 with 70% accuracy, lower quality answers, or very slow responses that frustrate users.
Good models balance accuracy and speed to fit your needs.
- Ignoring latency: A very accurate model that is too slow may not be practical.
- Overfitting to test data: A model might look great on a small test set but fail in real use.
- Data leakage: If test questions were seen during training, accuracy is misleading.
- Focusing only on accuracy: Quality and relevance of answers matter too.
Your GPT-3.5 model has 98% accuracy on a test set but only 12% recall on rare topics. Is it good for production? Why or why not?
Answer: No, because while overall accuracy is high, the model misses most rare topics (low recall). This means it often fails to answer important but uncommon questions, which can hurt user experience.
Practice
Solution
Step 1: Understand model capabilities
GPT-4 is designed for more complex and detailed tasks compared to GPT-3.5.Step 2: Match task complexity to model
For detailed and complex text generation, GPT-4 is the better choice.Final Answer:
GPT-4 -> Option CQuick Check:
Complex tasks = GPT-4 [OK]
- Choosing GPT-3.5 for complex tasks
- Thinking both models have same detail level
Solution
Step 1: Recall model naming conventions
The GPT-3.5 model is named "gpt-3.5-turbo" in API calls.Step 2: Identify correct option
"model": "gpt-3.5-turbo" matches the exact model name for GPT-3.5.Final Answer:
"model": "gpt-3.5-turbo" -> Option AQuick Check:
Correct model name = "model": "gpt-3.5-turbo" [OK]
- Using "gpt-3" instead of "gpt-3.5-turbo"
- Confusing GPT-4 name with GPT-3.5
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Explain photosynthesis."}]
)Solution
Step 1: Identify the model used in code
The code uses "gpt-3.5-turbo" as the model parameter.Step 2: Recall model speed and detail tradeoff
GPT-3.5 is faster but less detailed compared to GPT-4.Final Answer:
GPT-3.5, faster but less detailed -> Option AQuick Check:
Model in code = GPT-3.5 [OK]
- Assuming GPT-3.5 is slower
- Confusing model names in code snippet
response = openai.ChatCompletion.create(
model="gpt-3.5",
messages=[{"role": "user", "content": "Tell me a joke."}]
) What is the likely problem?Solution
Step 1: Check model name correctness
The model name "gpt-3.5" is incomplete; the correct full name is "gpt-3.5-turbo".Step 2: Understand error cause
Using an incomplete model name causes the API to reject the call.Final Answer:
Model name is incomplete, should be "gpt-3.5-turbo" -> Option BQuick Check:
Model name must be exact [OK]
- Using partial model names
- Assuming system role is mandatory
- Ignoring API key errors
Solution
Step 1: Understand tradeoffs between GPT-3.5 and GPT-4
GPT-3.5 is faster and cheaper but less detailed; GPT-4 is slower and costlier but more detailed.Step 2: Match chatbot needs to model selection
Use GPT-3.5 for quick, cheap answers and switch to GPT-4 when detailed responses are needed.Final Answer:
Use GPT-3.5 for quick replies, switch to GPT-4 for detailed ones -> Option DQuick Check:
Balance speed and detail with model switching [OK]
- Using only one model for all tasks
- Ignoring cost and speed tradeoffs
