Prompt Engineering / GenAIml~20 mins

Human evaluation frameworks in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Experiment - Human evaluation frameworks

Problem:You have a generative AI model that creates text responses. You want to measure how good these responses are using human evaluation. Currently, you only have automatic scores like BLEU or ROUGE, but they don't match how humans feel about the quality.

Current Metrics:Automatic metric scores: BLEU=0.45, ROUGE=0.50. No human evaluation data yet.

Issue:Automatic metrics do not fully capture human judgment of response quality. You need a human evaluation framework to get reliable feedback.

Your Task

Design and implement a simple human evaluation framework to assess the quality of generated text responses. Collect human ratings on fluency, relevance, and overall quality. Summarize the results with average scores.

Use a small sample of 10 generated responses.

Use a 1 to 5 rating scale for each criterion.

Do not use automated metrics for this task.

Keep the evaluation process simple and easy to understand.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

Prompt Engineering / GenAI

import numpy as np

# Sample generated responses (for demonstration)
generated_responses = [
    "The cat sat on the mat.",
    "Weather today is sunny and warm.",
    "I love eating pizza on weekends.",
    "Python is a popular programming language.",
    "The movie was exciting and fun.",
    "She enjoys reading books at night.",
    "The car needs fuel to run.",
    "He plays football every Sunday.",
    "The garden has many colorful flowers.",
    "Music can lift your mood instantly."
]

# Simulated human ratings collected for each response
# Each row: [fluency, relevance, overall_quality] ratings from 1 to 5
human_ratings = np.array([
    [5, 5, 5],
    [4, 4, 4],
    [5, 5, 5],
    [5, 5, 5],
    [4, 4, 4],
    [5, 5, 5],
    [3, 4, 3],
    [4, 4, 4],
    [5, 5, 5],
    [5, 5, 5]
])

# Calculate average scores for each criterion
average_scores = human_ratings.mean(axis=0)

print(f"Average Fluency Score: {average_scores[0]:.2f} / 5")
print(f"Average Relevance Score: {average_scores[1]:.2f} / 5")
print(f"Average Overall Quality Score: {average_scores[2]:.2f} / 5")

Added a small set of generated responses for evaluation.

Simulated human ratings on fluency, relevance, and overall quality using a 1-5 scale.

Calculated average scores to summarize human evaluation results.

Focused on human evaluation instead of automatic metrics.

Results Interpretation

Before: Only automatic metrics available (BLEU=0.45, ROUGE=0.50), which may not reflect true quality.

After: Human evaluation shows high average scores (around 4.5 to 4.6 out of 5), indicating good fluency, relevance, and overall quality as perceived by people.

Human evaluation frameworks provide valuable insights that automatic metrics cannot fully capture. Collecting simple human ratings helps understand model quality from a real user perspective.

Bonus Experiment

Try designing a pairwise comparison human evaluation where raters choose the better response between two options instead of rating each individually.

💡 Hint

Present two generated responses side by side and ask which one is better overall. This can reduce rating bias and provide clearer preferences.