YOLO (You Only Look Once) is a popular object detection model. What is its main advantage compared to older methods like sliding window or region proposal based detectors?
Think about how YOLO treats the image differently from methods that look at many small parts separately.
YOLO divides the image into a grid and predicts bounding boxes and class probabilities for all grid cells in one forward pass, making it very fast compared to methods that scan many regions separately.
YOLO divides the image into an SxS grid. Each grid cell predicts B bounding boxes and C class probabilities. Given S=7, B=2, and C=20, what is the shape of the output tensor from the final layer?
Each bounding box predicts 5 values (x, y, w, h, confidence) plus class probabilities per cell.
Each cell predicts B=2 boxes, each with 5 values, so 2*5=10 values, plus C=20 class probabilities, total 30 values per cell. So output shape is [7,7,30].
YOLO's final layer outputs a tensor with shape [batch_size, S, S, B*5 + C]. Which PyTorch layer is most appropriate to produce this output from a feature map?
The layer should keep spatial dimensions but change channel depth to prediction size.
A 1x1 convolution changes the number of channels per spatial location without changing height and width, perfect for predicting bounding boxes and classes per grid cell.
YOLO predicts bounding boxes and class labels. Which metric is most appropriate to evaluate its detection accuracy?
Consider a metric that accounts for both localization and classification correctness.
mAP measures how well predicted boxes match ground truth boxes and correct class labels, combining localization and classification performance.
YOLO divides the image into a grid and predicts boxes per cell. Why might it struggle to detect very small objects?
Think about how grid size affects the ability to localize small objects.
Because YOLO predicts a fixed number of boxes per grid cell, small objects that occupy less than a cell can be missed or confused with larger objects in the same cell.