Bird
Raised Fist0
Prompt Engineering / GenAIml~8 mins

Model selection (GPT-4, GPT-3.5) in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Model selection (GPT-4, GPT-3.5)
Which metric matters for model selection and WHY

When choosing between GPT-4 and GPT-3.5, key metrics include accuracy, response quality, and latency. Accuracy shows how often the model gives correct or useful answers. Response quality measures how clear, relevant, and helpful the answers are. Latency is how fast the model responds. We pick metrics based on what matters most: if you want the best answers, accuracy and quality matter more; if you want speed, latency matters more.

Confusion matrix or equivalent visualization
Example: Comparing GPT-4 and GPT-3.5 on 100 questions

| Model   | Correct (TP) | Incorrect (FP+FN) | Total |
|---------|--------------|------------------|-------|
| GPT-4   | 90           | 10               | 100   |
| GPT-3.5 | 75           | 25               | 100   |

Here, GPT-4 answers 90 out of 100 correctly, GPT-3.5 answers 75 correctly.
Precision vs Recall tradeoff with concrete examples

For language models, think of precision as how often the model's answers are correct when it gives an answer, and recall as how many of all possible correct answers the model actually provides.

GPT-4 tends to have higher precision and recall, giving more correct and complete answers. GPT-3.5 might be faster but less precise, sometimes giving wrong or incomplete answers.

Example: If you want a chatbot that never gives wrong info, prioritize precision (GPT-4). If you want quick answers and can tolerate some mistakes, recall and speed (GPT-3.5) might be enough.

What "good" vs "bad" metric values look like for this use case

Good: GPT-4 with 90%+ accuracy, high response quality, and acceptable latency (e.g., 1-2 seconds).

Bad: GPT-3.5 with 70% accuracy, lower quality answers, or very slow responses that frustrate users.

Good models balance accuracy and speed to fit your needs.

Common pitfalls in model selection metrics
  • Ignoring latency: A very accurate model that is too slow may not be practical.
  • Overfitting to test data: A model might look great on a small test set but fail in real use.
  • Data leakage: If test questions were seen during training, accuracy is misleading.
  • Focusing only on accuracy: Quality and relevance of answers matter too.
Self-check question

Your GPT-3.5 model has 98% accuracy on a test set but only 12% recall on rare topics. Is it good for production? Why or why not?

Answer: No, because while overall accuracy is high, the model misses most rare topics (low recall). This means it often fails to answer important but uncommon questions, which can hurt user experience.

Key Result
Model selection balances accuracy, response quality, and speed to fit user needs; GPT-4 generally offers higher accuracy and quality, GPT-3.5 offers faster but less precise responses.

Practice

(1/5)
1. Which model should you choose if you need detailed and complex text generation?
easy
A. GPT-3.5
B. Both are equally detailed
C. GPT-4
D. Neither, use a smaller model

Solution

  1. Step 1: Understand model capabilities

    GPT-4 is designed for more complex and detailed tasks compared to GPT-3.5.
  2. Step 2: Match task complexity to model

    For detailed and complex text generation, GPT-4 is the better choice.
  3. Final Answer:

    GPT-4 -> Option C
  4. Quick Check:

    Complex tasks = GPT-4 [OK]
Hint: Choose GPT-4 for complexity, GPT-3.5 for speed [OK]
Common Mistakes:
  • Choosing GPT-3.5 for complex tasks
  • Thinking both models have same detail level
2. Which of the following is the correct way to specify GPT-3.5 in an API call?
easy
A. "model": "gpt-3.5-turbo"
B. "model": "gpt-3"
C. "model": "gpt-4"
D. "model": "gpt-5"

Solution

  1. Step 1: Recall model naming conventions

    The GPT-3.5 model is named "gpt-3.5-turbo" in API calls.
  2. Step 2: Identify correct option

    "model": "gpt-3.5-turbo" matches the exact model name for GPT-3.5.
  3. Final Answer:

    "model": "gpt-3.5-turbo" -> Option A
  4. Quick Check:

    Correct model name = "model": "gpt-3.5-turbo" [OK]
Hint: Use exact model name string in API call [OK]
Common Mistakes:
  • Using "gpt-3" instead of "gpt-3.5-turbo"
  • Confusing GPT-4 name with GPT-3.5
3. Given this code snippet calling the OpenAI API, which model will produce faster responses but possibly less detailed output?
response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[{"role": "user", "content": "Explain photosynthesis."}]
)
medium
A. GPT-3.5, faster but less detailed
B. GPT-4, slower but more detailed
C. GPT-4, faster and more detailed
D. GPT-3.5, slower but more detailed

Solution

  1. Step 1: Identify the model used in code

    The code uses "gpt-3.5-turbo" as the model parameter.
  2. Step 2: Recall model speed and detail tradeoff

    GPT-3.5 is faster but less detailed compared to GPT-4.
  3. Final Answer:

    GPT-3.5, faster but less detailed -> Option A
  4. Quick Check:

    Model in code = GPT-3.5 [OK]
Hint: Check model name string to know speed/detail tradeoff [OK]
Common Mistakes:
  • Assuming GPT-3.5 is slower
  • Confusing model names in code snippet
4. You wrote this API call but get an error:
response = openai.ChatCompletion.create(
  model="gpt-3.5",
  messages=[{"role": "user", "content": "Tell me a joke."}]
)
What is the likely problem?
medium
A. Messages list is missing a system role
B. Model name is incomplete, should be "gpt-3.5-turbo"
C. API key is missing
D. The model "gpt-3.5" does not exist

Solution

  1. Step 1: Check model name correctness

    The model name "gpt-3.5" is incomplete; the correct full name is "gpt-3.5-turbo".
  2. Step 2: Understand error cause

    Using an incomplete model name causes the API to reject the call.
  3. Final Answer:

    Model name is incomplete, should be "gpt-3.5-turbo" -> Option B
  4. Quick Check:

    Model name must be exact [OK]
Hint: Use full model name string to avoid errors [OK]
Common Mistakes:
  • Using partial model names
  • Assuming system role is mandatory
  • Ignoring API key errors
5. You want to build a chatbot that answers customer questions quickly and cheaply but can switch to detailed answers when needed. How should you select models in your code?
hard
A. Always use GPT-4 for all answers
B. Use GPT-4 only, it is always more accurate
C. Use GPT-3.5 only, it is always faster and cheaper
D. Use GPT-3.5 for quick replies, switch to GPT-4 for detailed ones

Solution

  1. Step 1: Understand tradeoffs between GPT-3.5 and GPT-4

    GPT-3.5 is faster and cheaper but less detailed; GPT-4 is slower and costlier but more detailed.
  2. Step 2: Match chatbot needs to model selection

    Use GPT-3.5 for quick, cheap answers and switch to GPT-4 when detailed responses are needed.
  3. Final Answer:

    Use GPT-3.5 for quick replies, switch to GPT-4 for detailed ones -> Option D
  4. Quick Check:

    Balance speed and detail with model switching [OK]
Hint: Switch models based on answer detail needed [OK]
Common Mistakes:
  • Using only one model for all tasks
  • Ignoring cost and speed tradeoffs