For extractive summarization, ROUGE scores are the most important metrics. ROUGE compares the overlap of words or phrases between the model's summary and a human-written summary. It tells us how much the model's output matches the important parts of the original text.
Specifically, ROUGE-1 measures overlap of single words, ROUGE-2 measures overlap of two-word pairs, and ROUGE-L measures longest common subsequence. These help us see if the summary captures key content accurately.
Accuracy or precision alone are less useful because summarization is about content coverage and relevance, not just classification correctness.