Video understanding means teaching a computer to watch videos and know what is happening. The main goal is to recognize actions, objects, or events in the video.
The key metrics are Accuracy, Precision, Recall, and F1-score. These tell us how well the model identifies the right actions or objects without mistakes.
For example, if the model says "someone is running" in a video, precision tells us how often it was right when it said that. Recall tells us how many times it found all the running moments in the video. F1-score balances both.
Sometimes, we also use Mean Average Precision (mAP) for detecting multiple objects or actions in videos. It measures how well the model finds all correct items and avoids wrong ones.