Challenge - 5 Problems

🎖️

Temporal Vision Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Why does video data require temporal analysis in computer vision?

In computer vision, images are static snapshots. Videos add a new dimension. What is the main reason video data requires temporal analysis?

ABecause videos are stored in different file formats than images, requiring special decoding.

BBecause videos have higher resolution than images, needing more pixels to analyze.

CBecause videos only contain color information, unlike images which have depth data.

DBecause videos contain multiple frames that show changes over time, requiring models to understand motion and sequence.

Attempts:

2 left

❓ Model Choice

intermediate

2:00remaining

Which model type is best suited for capturing temporal information in video data?

Given a video input, which model architecture is designed to capture temporal dependencies effectively?

AK-Nearest Neighbors (KNN) classifier using only spatial features.

BConvolutional Neural Network (CNN) applied frame-by-frame independently.

CRecurrent Neural Network (RNN) or Long Short-Term Memory (LSTM) networks that process sequences over time.

DFeedforward Neural Network with no memory of previous frames.

Attempts:

2 left

❓ Metrics

advanced

2:00remaining

Which metric best evaluates temporal consistency in video classification?

When evaluating a model that classifies actions in videos, which metric helps measure how well the model captures temporal consistency across frames?

ATemporal Intersection over Union (tIoU) measuring overlap of predicted action segments with ground truth over time.

BFrame-level accuracy measuring each frame independently.

CMean Average Precision (mAP) over the entire video sequence.

DPixel-wise Mean Squared Error (MSE) between frames.

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Why does a CNN-only model fail to capture motion in video sequences?

Consider a CNN trained on individual video frames for action recognition. Why might it fail to recognize actions that depend on motion?

ABecause CNNs only analyze spatial features in single frames and lack mechanisms to model changes over time.

BBecause CNNs cannot process color information in video frames.

CBecause CNNs require more training data to learn motion patterns.

DBecause CNNs are too slow to process video frames in real time.

Attempts:

2 left

❓ Predict Output

expert

3:00remaining

What is the shape of output from a 3D CNN applied to video input?

Given a video input tensor of shape (batch_size=2, channels=3, frames=16, height=64, width=64), a 3D CNN layer with kernel size (3,3,3), stride 1, and padding 1 is applied. What is the output tensor shape?

Computer Vision

import torch
import torch.nn as nn

input_tensor = torch.randn(2, 3, 16, 64, 64)
conv3d = nn.Conv3d(in_channels=3, out_channels=8, kernel_size=(3,3,3), stride=1, padding=1)
output = conv3d(input_tensor)
print(output.shape)

A(2, 8, 14, 62, 62)

B(2, 8, 16, 64, 64)

C(2, 3, 16, 64, 64)

D(2, 8, 18, 66, 66)

Attempts:

2 left