For encoder-decoder models with attention, especially in tasks like translation or summarization, BLEU score and ROUGE score are key. They measure how close the model's output is to human references. However, these are not perfect, so perplexity is also used during training to see how well the model predicts the next word. Lower perplexity means the model is more confident and accurate in its predictions.
In classification tasks using encoder-decoder, accuracy, precision, and recall matter to understand how well the model identifies correct outputs.