For image understanding and description, we want to check how well the model describes images in words. Common metrics are BLEU, METEOR, ROUGE, and CIDEr. These compare the model's description to human-written ones. They matter because they tell us if the model's words match what a person would say about the image.
Also, accuracy-like metrics on object detection or classification parts help check if the model sees the right things in the image. But for description, language similarity scores are key.