Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Automated evaluation metrics in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Automated Metrics Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding BLEU score purpose

What is the main purpose of the BLEU score in evaluating machine learning models?

ATo measure the similarity between generated text and reference text by comparing overlapping n-grams
BTo calculate the accuracy of classification models by counting correct predictions
CTo evaluate the speed of model training in seconds
DTo measure the memory usage of a model during inference
Attempts:
2 left
💡 Hint

Think about how text generation quality is measured by comparing outputs.

Predict Output
intermediate
2:00remaining
Output of accuracy calculation code

What is the output of the following Python code that calculates accuracy?

Prompt Engineering / GenAI
true_labels = [1, 0, 1, 1, 0]
predictions = [1, 0, 0, 1, 0]
correct = sum(t == p for t, p in zip(true_labels, predictions))
accuracy = correct / len(true_labels)
print(round(accuracy, 2))
A0.80
B0.60
C0.75
D0.50
Attempts:
2 left
💡 Hint

Count how many predictions match the true labels, then divide by total.

Model Choice
advanced
2:00remaining
Best metric for imbalanced classification

You have a dataset where one class is very rare. Which metric is best to evaluate your model's performance?

AAccuracy
BBLEU score
CMean Squared Error
DPrecision and Recall
Attempts:
2 left
💡 Hint

Think about metrics that focus on positive class detection in imbalanced data.

Hyperparameter
advanced
2:00remaining
Effect of changing threshold on F1 score

In a binary classifier, what happens to the F1 score if you increase the decision threshold too high?

AF1 score increases because precision and recall both improve
BF1 score becomes undefined
CF1 score decreases because recall drops while precision may increase
DF1 score stays the same regardless of threshold
Attempts:
2 left
💡 Hint

Think about how raising threshold affects false negatives and false positives.

🔧 Debug
expert
2:00remaining
Identify error in metric calculation code

What error does this code raise when calculating the mean squared error (MSE)?

Prompt Engineering / GenAI
import numpy as np
true = np.array([1, 2, 3])
pred = np.array([1, 2])
mse = np.mean((true - pred) ** 2)
print(mse)
ATypeError because numpy arrays cannot be subtracted
BValueError due to shape mismatch in subtraction
CZeroDivisionError when calculating mean
DNo error, outputs 0.5
Attempts:
2 left
💡 Hint

Check if the arrays have the same length before subtracting.

Practice

(1/5)
1. Which automated evaluation metric is commonly used to measure the accuracy of classification models?
easy
A. Perplexity
B. Mean Squared Error
C. BLEU Score
D. Accuracy

Solution

  1. Step 1: Understand classification metrics

    Classification models predict categories, so metrics like Accuracy measure correct predictions over total predictions.
  2. Step 2: Match metric to task

    Mean Squared Error is for regression, BLEU and Perplexity are for language tasks, so Accuracy fits classification best.
  3. Final Answer:

    Accuracy -> Option D
  4. Quick Check:

    Classification accuracy = Accuracy [OK]
Hint: Accuracy measures correct predictions in classification [OK]
Common Mistakes:
  • Confusing regression metrics with classification
  • Using BLEU for classification tasks
  • Mixing Perplexity with accuracy
2. Which of the following is the correct Python syntax to calculate accuracy using scikit-learn?
easy
A. accuracy = accuracy(y_true, y_pred)
B. accuracy = score_accuracy(y_true, y_pred)
C. accuracy = accuracy_score(y_true, y_pred)
D. accuracy = calc_accuracy(y_true, y_pred)

Solution

  1. Step 1: Recall scikit-learn function name

    The correct function to compute accuracy is accuracy_score from sklearn.metrics.
  2. Step 2: Check function call syntax

    It requires two arguments: true labels and predicted labels, called as accuracy_score(y_true, y_pred).
  3. Final Answer:

    accuracy = accuracy_score(y_true, y_pred) -> Option C
  4. Quick Check:

    scikit-learn accuracy function = accuracy_score [OK]
Hint: Use accuracy_score from sklearn.metrics for accuracy [OK]
Common Mistakes:
  • Using incorrect function names
  • Missing import of accuracy_score
  • Swapping argument order
3. Given the following code snippet, what will be the printed F1 score?
from sklearn.metrics import f1_score

y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 0, 1, 0]
f1 = f1_score(y_true, y_pred)
print(round(f1, 2))
medium
A. 0.80
B. 0.75
C. 0.67
D. 0.60

Solution

  1. Step 1: Calculate precision and recall

    True positives (TP) = 2 (positions 0 and 3), False positives (FP) = 0, False negatives (FN) = 1 (position 2).
  2. Step 2: Compute F1 score

    Precision = TP / (TP + FP) = 2/2 = 1.0; Recall = TP / (TP + FN) = 2/3 ≈ 0.67; F1 = 2 * (Precision * Recall) / (Precision + Recall) ≈ 2*(1*0.67)/(1+0.67) ≈ 0.80.
  3. Step 3: Verify scikit-learn default behavior

    By default, f1_score uses 'binary' average, so calculation matches above.
  4. Step 4: Check rounding

    Rounded to two decimals, the printed value is 0.80, but the actual f1_score value is approximately 0.80.
  5. Final Answer:

    0.80 -> Option A
  6. Quick Check:

    F1 score = 0.80 [OK]
Hint: F1 balances precision and recall; calculate both first [OK]
Common Mistakes:
  • Confusing precision with recall
  • Rounding too early
  • Ignoring default average parameter
4. You run this code but get an error:
from sklearn.metrics import precision_score

true = [1, 0, 1]
pred = [1, 1, 0]
score = precision_score(true, pred)
print(score)
What is the likely cause of the error?
medium
A. No error; code runs fine
B. Mismatch in label types causing undefined precision
C. Incorrect variable names used in function call
D. Missing import of precision_score

Solution

  1. Step 1: Check imports and variables

    precision_score is imported correctly and variables true, pred are defined properly.
  2. Step 2: Understand precision_score behavior

    Precision is undefined if there are no predicted positives for the positive class, which can cause warnings or errors.
  3. Step 3: Analyze given data

    pred has one positive (1), true has positives at positions 0 and 2; so precision can be computed without error.
  4. Step 4: Consider label types

    If labels are not binary or have unexpected types, precision_score may error; here labels are fine, so no error expected.
  5. Final Answer:

    No error; code runs fine -> Option A
  6. Quick Check:

    Code runs fine with correct inputs [OK]
Hint: Check label types and predicted positives for precision errors [OK]
Common Mistakes:
  • Assuming import errors without checking
  • Confusing variable names
  • Ignoring label format requirements
5. You want to evaluate a language generation model. Which automated metric should you choose to measure how well the model's output matches human references?
hard
A. Mean Absolute Error
B. BLEU Score
C. Accuracy
D. Silhouette Score

Solution

  1. Step 1: Identify task type

    Language generation models produce text outputs, so evaluation needs to compare generated text to reference text.
  2. Step 2: Match metric to task

    BLEU Score measures overlap of n-grams between generated and reference text, widely used for language generation evaluation.
  3. Step 3: Exclude unrelated metrics

    Mean Absolute Error is for regression, Accuracy for classification, Silhouette Score for clustering, so they don't fit language generation.
  4. Final Answer:

    BLEU Score -> Option B
  5. Quick Check:

    Language generation evaluation = BLEU Score [OK]
Hint: Use BLEU for comparing generated text to references [OK]
Common Mistakes:
  • Using regression or classification metrics for text
  • Confusing clustering metrics with language metrics
  • Ignoring task-specific metric choice