In computer vision, images are static snapshots. Videos add a new dimension. What is the main reason video data requires temporal analysis?
Think about what extra information videos provide compared to single images.
Videos are sequences of images over time. This temporal aspect means models must analyze how things change frame to frame, capturing motion and sequence information.
Given a video input, which model architecture is designed to capture temporal dependencies effectively?
Consider models that remember past information to understand sequences.
RNNs and LSTMs are designed to handle sequences by maintaining memory of previous inputs, making them suitable for temporal data like videos.
When evaluating a model that classifies actions in videos, which metric helps measure how well the model captures temporal consistency across frames?
Think about metrics that consider time intervals, not just individual frames.
Temporal IoU measures how well predicted action segments align with true segments over time, capturing temporal consistency better than frame-level metrics.
Consider a CNN trained on individual video frames for action recognition. Why might it fail to recognize actions that depend on motion?
Think about what CNNs analyze and what they miss when looking at frames independently.
CNNs extract spatial features from images but do not have memory or sequence modeling to capture motion or temporal changes between frames.
Given a video input tensor of shape (batch_size=2, channels=3, frames=16, height=64, width=64), a 3D CNN layer with kernel size (3,3,3), stride 1, and padding 1 is applied. What is the output tensor shape?
import torch import torch.nn as nn input_tensor = torch.randn(2, 3, 16, 64, 64) conv3d = nn.Conv3d(in_channels=3, out_channels=8, kernel_size=(3,3,3), stride=1, padding=1) output = conv3d(input_tensor) print(output.shape)
Recall how padding and stride affect output size in convolution layers.
With kernel size 3, stride 1, and padding 1, the output size remains the same as input size for each dimension. The output channels change to 8 as specified.