For summarization, we want to check how well the summary captures the important parts of the original text. The main metrics are ROUGE scores, especially ROUGE-1, ROUGE-2, and ROUGE-L. These compare the overlap of words and phrases between the generated summary and a human-written summary.
ROUGE-1 measures overlap of single words, ROUGE-2 looks at pairs of words, and ROUGE-L checks the longest matching sequence. Higher ROUGE scores mean the summary is closer to the reference, so it is better.