Experiment - Human evaluation frameworks
Problem:You have a generative AI model that creates text responses. You want to measure how good these responses are using human evaluation. Currently, you only have automatic scores like BLEU or ROUGE, but they don't match how humans feel about the quality.
Current Metrics:Automatic metric scores: BLEU=0.45, ROUGE=0.50. No human evaluation data yet.
Issue:Automatic metrics do not fully capture human judgment of response quality. You need a human evaluation framework to get reliable feedback.