0
0
Prompt Engineering / GenAIml~20 mins

Human evaluation frameworks in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Human Evaluation Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding Human Evaluation Metrics

Which of the following best describes the purpose of human evaluation in AI model assessment?

ATo speed up model training by using human-labeled data only
BTo measure how well a model performs on automated benchmarks without human input
CTo replace all automated metrics with human feedback exclusively
DTo assess the quality of AI outputs based on human judgment and preferences
Attempts:
2 left
💡 Hint

Think about why humans are involved in evaluating AI outputs.

Metrics
intermediate
2:00remaining
Interpreting Human Evaluation Scores

In a human evaluation where raters score AI-generated text from 1 (poor) to 5 (excellent), what does an average score of 4.2 indicate?

AThe AI outputs are generally rated as high quality by humans
BThe AI outputs are mostly rated as poor by humans
CThe AI outputs have a wide range of scores with no clear trend
DThe AI outputs are rated exactly the same by all raters
Attempts:
2 left
💡 Hint

Consider what a score closer to 5 means in a 1 to 5 scale.

Predict Output
advanced
2:00remaining
Output of Human Evaluation Aggregation Code

What is the output of this Python code that aggregates human ratings?

Prompt Engineering / GenAI
ratings = {'rater1': [4, 5, 3], 'rater2': [5, 4, 4], 'rater3': [3, 4, 5]}
average_scores = [sum(scores)/len(scores) for scores in zip(*ratings.values())]
print(average_scores)
A[4.0, 4.333333333333333, 4.0]
B[4.0, 4.0, 4.0]
C[3.5, 4.5, 4.0]
D[5.0, 4.0, 3.0]
Attempts:
2 left
💡 Hint

Calculate the average for each position across raters.

Model Choice
advanced
2:00remaining
Choosing a Human Evaluation Framework for Dialogue Systems

You want to evaluate a chatbot's responses for naturalness and relevance. Which human evaluation framework is most suitable?

ACross-validation on training data splits
BPairwise comparison where raters choose the better response between two options
CAutomated BLEU score calculation without human input
DConfusion matrix analysis of chatbot intents
Attempts:
2 left
💡 Hint

Think about how humans can compare two chatbot responses directly.

🔧 Debug
expert
3:00remaining
Debugging Human Evaluation Data Collection Code

What error does this code raise when collecting human ratings?

Prompt Engineering / GenAI
def collect_ratings(responses):
    ratings = {}
    for i, response in enumerate(responses):
        rating = int(input(f"Rate response {i+1} (1-5): "))
        if rating < 1 or rating > 5:
            raise ValueError("Rating must be between 1 and 5")
        ratings[i] = rating
    return ratings

ratings = collect_ratings(['Hi', 'Hello', 'Hey'])
print(ratings)
AIndexError due to wrong loop indexing
BTypeError because input() returns a string
CValueError if user inputs a number outside 1-5
DNo error, code runs correctly
Attempts:
2 left
💡 Hint

Consider what happens if the user inputs 0 or 6.