Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Human evaluation frameworks in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Human Evaluation Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding Human Evaluation Metrics

Which of the following best describes the purpose of human evaluation in AI model assessment?

ATo speed up model training by using human-labeled data only
BTo measure how well a model performs on automated benchmarks without human input
CTo replace all automated metrics with human feedback exclusively
DTo assess the quality of AI outputs based on human judgment and preferences
Attempts:
2 left
💡 Hint

Think about why humans are involved in evaluating AI outputs.

Metrics
intermediate
2:00remaining
Interpreting Human Evaluation Scores

In a human evaluation where raters score AI-generated text from 1 (poor) to 5 (excellent), what does an average score of 4.2 indicate?

AThe AI outputs are generally rated as high quality by humans
BThe AI outputs are mostly rated as poor by humans
CThe AI outputs have a wide range of scores with no clear trend
DThe AI outputs are rated exactly the same by all raters
Attempts:
2 left
💡 Hint

Consider what a score closer to 5 means in a 1 to 5 scale.

Predict Output
advanced
2:00remaining
Output of Human Evaluation Aggregation Code

What is the output of this Python code that aggregates human ratings?

Prompt Engineering / GenAI
ratings = {'rater1': [4, 5, 3], 'rater2': [5, 4, 4], 'rater3': [3, 4, 5]}
average_scores = [sum(scores)/len(scores) for scores in zip(*ratings.values())]
print(average_scores)
A[4.0, 4.333333333333333, 4.0]
B[4.0, 4.0, 4.0]
C[3.5, 4.5, 4.0]
D[5.0, 4.0, 3.0]
Attempts:
2 left
💡 Hint

Calculate the average for each position across raters.

Model Choice
advanced
2:00remaining
Choosing a Human Evaluation Framework for Dialogue Systems

You want to evaluate a chatbot's responses for naturalness and relevance. Which human evaluation framework is most suitable?

ACross-validation on training data splits
BPairwise comparison where raters choose the better response between two options
CAutomated BLEU score calculation without human input
DConfusion matrix analysis of chatbot intents
Attempts:
2 left
💡 Hint

Think about how humans can compare two chatbot responses directly.

🔧 Debug
expert
3:00remaining
Debugging Human Evaluation Data Collection Code

What error does this code raise when collecting human ratings?

Prompt Engineering / GenAI
def collect_ratings(responses):
    ratings = {}
    for i, response in enumerate(responses):
        rating = int(input(f"Rate response {i+1} (1-5): "))
        if rating < 1 or rating > 5:
            raise ValueError("Rating must be between 1 and 5")
        ratings[i] = rating
    return ratings

ratings = collect_ratings(['Hi', 'Hello', 'Hey'])
print(ratings)
AIndexError due to wrong loop indexing
BTypeError because input() returns a string
CValueError if user inputs a number outside 1-5
DNo error, code runs correctly
Attempts:
2 left
💡 Hint

Consider what happens if the user inputs 0 or 6.

Practice

(1/5)
1. What is the main purpose of human evaluation frameworks in AI?
easy
A. To have people judge AI outputs for quality
B. To replace all automatic scoring methods
C. To train AI models faster
D. To collect data without human input

Solution

  1. Step 1: Understand the role of human evaluation

    Human evaluation frameworks involve people assessing AI outputs to check quality.
  2. Step 2: Compare with other options

    Options B, C, and D do not describe the main purpose correctly; human evaluation does not replace all automatic methods, nor is it for training or data collection without humans.
  3. Final Answer:

    To have people judge AI outputs for quality -> Option A
  4. Quick Check:

    Human evaluation = people judge AI outputs [OK]
Hint: Human evaluation means people check AI output quality [OK]
Common Mistakes:
  • Thinking human evaluation replaces automatic scores
  • Confusing evaluation with training
  • Assuming no human input is involved
2. Which of the following is a common method used in human evaluation frameworks?
easy
A. Simple rating scales
B. Automatic precision scoring
C. Gradient descent optimization
D. Data augmentation

Solution

  1. Step 1: Identify common human evaluation methods

    Simple rating scales are widely used for humans to rate AI outputs.
  2. Step 2: Eliminate unrelated options

    Automatic precision scoring, gradient descent, and data augmentation are technical methods not involving human judgment.
  3. Final Answer:

    Simple rating scales -> Option A
  4. Quick Check:

    Human evaluation uses rating scales [OK]
Hint: Look for methods involving human ratings or comparisons [OK]
Common Mistakes:
  • Choosing automatic or technical AI training methods
  • Confusing human evaluation with model training
  • Ignoring the human aspect in options
3. Consider a human evaluation where 3 raters score AI responses on a scale from 1 to 5. The scores for one response are [4, 5, 3]. What is the average score?
medium
A. 3
B. 5
C. 4
D. 12

Solution

  1. Step 1: Sum the scores given by raters

    4 + 5 + 3 = 12
  2. Step 2: Calculate the average score

    Average = Total sum / Number of raters = 12 / 3 = 4
  3. Final Answer:

    4 -> Option C
  4. Quick Check:

    (4+5+3)/3 = 4 [OK]
Hint: Add scores then divide by number of raters [OK]
Common Mistakes:
  • Adding but forgetting to divide by number of raters
  • Choosing the sum instead of average
  • Mixing up the scale values
4. A human evaluation study uses a comparison method where raters choose the better of two AI outputs. The code below has an error. What is the error?
def compare_outputs(output1, output2, rater_choice):
    if rater_choice == 'output1':
        return output1
    elif rater_choice == 'output2':
        return output2

result = compare_outputs('Answer A', 'Answer B', 'output3')
print(result)
medium
A. The comparison method cannot use strings as choices
B. The function should return both outputs instead of one
C. The print statement is missing parentheses
D. The function does not handle invalid rater choices properly

Solution

  1. Step 1: Trace the code execution for invalid input

    For rater_choice='output3', neither condition matches, so no explicit return; function implicitly returns None.
  2. Step 2: Identify the error

    Returning None for invalid choices is improper handling. Should explicitly manage invalid inputs (e.g., return error message or raise exception).
  3. Final Answer:

    The function does not handle invalid rater choices properly -> Option D
  4. Quick Check:

    Invalid input -> returns None [OK]
Hint: Check how function handles unexpected inputs [OK]
Common Mistakes:
  • Assuming print syntax error in Python 3
  • Thinking function must return both outputs
  • Ignoring the lack of else clause handling
5. You want to design a human evaluation framework to compare two AI chatbots. Which approach best balances simplicity and detailed feedback?
hard
A. Use only open-ended feedback without ratings
B. Use simple rating scales plus side-by-side output comparisons
C. Use automatic BLEU scores without human input
D. Use complex statistical models without human ratings

Solution

  1. Step 1: Consider evaluation goals

    Balancing simplicity and detail means combining easy-to-use ratings with meaningful comparisons.
  2. Step 2: Evaluate options

    Use only open-ended feedback without ratings lacks ratings, making quantitative comparison hard. Use automatic BLEU scores without human input and D exclude human input, missing human judgment. Use simple rating scales plus side-by-side output comparisons combines ratings and comparisons, fitting the goal.
  3. Final Answer:

    Use simple rating scales plus side-by-side output comparisons -> Option B
  4. Quick Check:

    Simple ratings + comparisons = balanced evaluation [OK]
Hint: Combine ratings with comparisons for best feedback [OK]
Common Mistakes:
  • Ignoring human input in evaluation
  • Choosing only open feedback without structure
  • Relying solely on automatic metrics