For question answering, the main goal is to get the correct answer from the model. We often use Exact Match (EM) and F1 score to check how well the model answers.
Exact Match measures if the model's answer exactly matches the correct answer. It is strict but clear.
F1 score looks at the overlap between the words in the model's answer and the correct answer. It balances precision (how many words in the answer are correct) and recall (how many correct words the model found).
These metrics matter because answers can be short or long, and sometimes the model's answer is close but not exact. F1 helps measure partial correctness.