In models using self-attention and multi-head attention, like Transformers, the key metrics to check are accuracy or loss on the task (e.g., translation, text classification). These metrics show how well the model understands relationships in the input.
Since attention helps the model focus on important parts of the input, improvements in accuracy or reduction in loss indicate better attention learning.
For sequence tasks, metrics like BLEU (for translation) or F1-score (for classification) are also important to measure quality.