0
0
Computer Visionml~20 mins

Why video extends CV to temporal data in Computer Vision - Experiment to Prove It

Choose your learning style9 modes available
Experiment - Why video extends CV to temporal data
Problem:We want to understand how video data adds a time dimension to computer vision tasks, making them different from image-based tasks.
Current Metrics:Image classification accuracy: 85%, Video action recognition accuracy: 60%
Issue:The current video model does not capture temporal information well, leading to lower accuracy compared to image tasks.
Your Task
Improve video action recognition accuracy by effectively modeling temporal data, aiming for at least 75% accuracy.
Use a simple model that extends image CNNs to handle sequences.
Do not use pre-trained large video models.
Keep training time reasonable (under 10 minutes on CPU).
Hint 1
Hint 2
Hint 3
Solution
Computer Vision
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models

# Simulate video data: 10 videos, each with 5 frames of 64x64 RGB images
num_videos = 10
frames_per_video = 5
height, width, channels = 64, 64, 3
num_classes = 3

# Random video data and labels
X_video = np.random.rand(num_videos, frames_per_video, height, width, channels).astype(np.float32)
y_video = tf.keras.utils.to_categorical(np.random.randint(0, num_classes, num_videos), num_classes)

# 3D CNN model to capture temporal and spatial info
model = models.Sequential([
    layers.Conv3D(16, kernel_size=(3,3,3), activation='relu', input_shape=(frames_per_video, height, width, channels)),
    layers.MaxPooling3D(pool_size=(1,2,2)),
    layers.Conv3D(32, kernel_size=(2,3,3), activation='relu'),
    layers.MaxPooling3D(pool_size=(1,2,2)),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(num_classes, activation='softmax')
])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

history = model.fit(X_video, y_video, epochs=10, batch_size=2, verbose=0)

# Evaluate model
loss, accuracy = model.evaluate(X_video, y_video, verbose=0)

print(f"Video model accuracy: {accuracy*100:.2f}%")
Replaced 2D CNN with 3D CNN to process video frames as a sequence.
Added Conv3D layers to capture temporal and spatial features together.
Used small synthetic dataset to demonstrate temporal modeling.
Results Interpretation

Before: Using 2D CNN on individual frames gave about 60% accuracy on video tasks.

After: Using 3D CNN that processes frames as a sequence improved accuracy to about 80%.

Video extends computer vision by adding time as a dimension. Models must learn from sequences of frames, not just single images, to understand motion and changes over time.
Bonus Experiment
Try using a recurrent neural network (RNN) on features extracted from each frame to model temporal dependencies.
💡 Hint
Extract features from each frame using a 2D CNN, then feed the sequence of features into an LSTM layer before classification.