Practice

(1/5)

1. Which automated evaluation metric is commonly used to measure the accuracy of classification models?

easy

A. Perplexity

B. Mean Squared Error

C. BLEU Score

D. Accuracy

Solution

Step 1: Understand classification metrics
Classification models predict categories, so metrics like Accuracy measure correct predictions over total predictions.
Step 2: Match metric to task
Mean Squared Error is for regression, BLEU and Perplexity are for language tasks, so Accuracy fits classification best.
Final Answer:
Accuracy -> Option D
Quick Check:
Classification accuracy = Accuracy [OK]

Hint: Accuracy measures correct predictions in classification [OK]

Common Mistakes:

Confusing regression metrics with classification
Using BLEU for classification tasks
Mixing Perplexity with accuracy

2. Which of the following is the correct Python syntax to calculate accuracy using scikit-learn?

easy

A. accuracy = accuracy(y_true, y_pred)

B. accuracy = score_accuracy(y_true, y_pred)

C. accuracy = accuracy_score(y_true, y_pred)

D. accuracy = calc_accuracy(y_true, y_pred)

Solution

Step 1: Recall scikit-learn function name
The correct function to compute accuracy is accuracy_score from sklearn.metrics.
Step 2: Check function call syntax
It requires two arguments: true labels and predicted labels, called as accuracy_score(y_true, y_pred).
Final Answer:
accuracy = accuracy_score(y_true, y_pred) -> Option C
Quick Check:
scikit-learn accuracy function = accuracy_score [OK]

Hint: Use accuracy_score from sklearn.metrics for accuracy [OK]

Common Mistakes:

Using incorrect function names
Missing import of accuracy_score
Swapping argument order

3. Given the following code snippet, what will be the printed F1 score?

from sklearn.metrics import f1_score

y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 0, 1, 0]
f1 = f1_score(y_true, y_pred)
print(round(f1, 2))

medium

A. 0.80

B. 0.75

C. 0.67

D. 0.60

Solution

Step 1: Calculate precision and recall
True positives (TP) = 2 (positions 0 and 3), False positives (FP) = 0, False negatives (FN) = 1 (position 2).
Step 2: Compute F1 score
Precision = TP / (TP + FP) = 2/2 = 1.0; Recall = TP / (TP + FN) = 2/3 ≈ 0.67; F1 = 2 * (Precision * Recall) / (Precision + Recall) ≈ 2*(1*0.67)/(1+0.67) ≈ 0.80.
Step 3: Verify scikit-learn default behavior
By default, f1_score uses 'binary' average, so calculation matches above.
Step 4: Check rounding
Rounded to two decimals, the printed value is 0.80, but the actual f1_score value is approximately 0.80.
Final Answer:
0.80 -> Option A
Quick Check:
F1 score = 0.80 [OK]

Hint: F1 balances precision and recall; calculate both first [OK]

Common Mistakes:

Confusing precision with recall
Rounding too early
Ignoring default average parameter

4. You run this code but get an error:

from sklearn.metrics import precision_score

true = [1, 0, 1]
pred = [1, 1, 0]
score = precision_score(true, pred)
print(score)

What is the likely cause of the error?

medium

A. No error; code runs fine

B. Mismatch in label types causing undefined precision

C. Incorrect variable names used in function call

D. Missing import of precision_score

Solution

Step 1: Check imports and variables
precision_score is imported correctly and variables true, pred are defined properly.
Step 2: Understand precision_score behavior
Precision is undefined if there are no predicted positives for the positive class, which can cause warnings or errors.
Step 3: Analyze given data
pred has one positive (1), true has positives at positions 0 and 2; so precision can be computed without error.
Step 4: Consider label types
If labels are not binary or have unexpected types, precision_score may error; here labels are fine, so no error expected.
Final Answer:
No error; code runs fine -> Option A
Quick Check:
Code runs fine with correct inputs [OK]

Hint: Check label types and predicted positives for precision errors [OK]

Common Mistakes:

Assuming import errors without checking
Confusing variable names
Ignoring label format requirements

5. You want to evaluate a language generation model. Which automated metric should you choose to measure how well the model's output matches human references?

hard

A. Mean Absolute Error

B. BLEU Score

C. Accuracy

D. Silhouette Score

Solution

Step 1: Identify task type
Language generation models produce text outputs, so evaluation needs to compare generated text to reference text.
Step 2: Match metric to task
BLEU Score measures overlap of n-grams between generated and reference text, widely used for language generation evaluation.
Step 3: Exclude unrelated metrics
Mean Absolute Error is for regression, Accuracy for classification, Silhouette Score for clustering, so they don't fit language generation.
Final Answer:
BLEU Score -> Option B
Quick Check:
Language generation evaluation = BLEU Score [OK]

Hint: Use BLEU for comparing generated text to references [OK]

Common Mistakes:

Using regression or classification metrics for text
Confusing clustering metrics with language metrics
Ignoring task-specific metric choice

Why Automated evaluation metrics in Prompt Engineering / GenAI? - Purpose & Use Cases

Start learning this pattern below

Practice

Solution

Step 1: Understand classification metrics

Step 2: Match metric to task

Final Answer:

Quick Check:

Solution

Step 1: Recall scikit-learn function name

Step 2: Check function call syntax

Final Answer:

Quick Check:

Solution

Step 1: Calculate precision and recall

Step 2: Compute F1 score

Step 3: Verify scikit-learn default behavior

Step 4: Check rounding

Final Answer:

Quick Check:

Solution

Step 1: Check imports and variables

Step 2: Understand precision_score behavior

Step 3: Analyze given data

Step 4: Consider label types

Final Answer:

Quick Check:

Solution

Step 1: Identify task type

Step 2: Match metric to task

Step 3: Exclude unrelated metrics

Final Answer:

Quick Check: