Experiment - Evaluating generated text (BLEU, ROUGE)
Problem:You have a text generation model that produces summaries. You want to measure how good these summaries are compared to human-written references.
Current Metrics:BLEU score: 0.35, ROUGE-1 F1 score: 0.40
Issue:The scores are low, indicating the generated summaries are not very close to the references.