Experiment - ROUGE evaluation metrics
Problem:You have a text summarization model that generates summaries. You want to evaluate how good these summaries are by comparing them to human-written reference summaries using ROUGE scores.
Current Metrics:ROUGE-1 F1 score: 0.45, ROUGE-2 F1 score: 0.22, ROUGE-L F1 score: 0.40
Issue:The ROUGE scores are low, indicating the model summaries are not very close to the reference summaries. You want to improve the evaluation by correctly computing ROUGE scores and understanding their meaning.