Which of the following best describes the purpose of human evaluation in AI model assessment?
Think about why humans are involved in evaluating AI outputs.
Human evaluation is used to judge AI outputs based on human preferences, which automated metrics may not fully capture.
In a human evaluation where raters score AI-generated text from 1 (poor) to 5 (excellent), what does an average score of 4.2 indicate?
Consider what a score closer to 5 means in a 1 to 5 scale.
An average score of 4.2 means most raters found the outputs to be good or excellent.
What is the output of this Python code that aggregates human ratings?
ratings = {'rater1': [4, 5, 3], 'rater2': [5, 4, 4], 'rater3': [3, 4, 5]}
average_scores = [sum(scores)/len(scores) for scores in zip(*ratings.values())]
print(average_scores)Calculate the average for each position across raters.
The code averages the first ratings (4,5,3), second ratings (5,4,4), and third ratings (3,4,5) respectively.
You want to evaluate a chatbot's responses for naturalness and relevance. Which human evaluation framework is most suitable?
Think about how humans can compare two chatbot responses directly.
Pairwise comparison lets humans directly judge which response is better, useful for subjective qualities like naturalness.
What error does this code raise when collecting human ratings?
def collect_ratings(responses): ratings = {} for i, response in enumerate(responses): rating = int(input(f"Rate response {i+1} (1-5): ")) if rating < 1 or rating > 5: raise ValueError("Rating must be between 1 and 5") ratings[i] = rating return ratings ratings = collect_ratings(['Hi', 'Hello', 'Hey']) print(ratings)
Consider what happens if the user inputs 0 or 6.
The code raises ValueError if the rating is outside the allowed range.