Practice

(1/5)

1. What is the main goal of video understanding in AI?

easy

A. Teaching computers to watch and learn from videos

B. Making videos play faster on devices

C. Compressing videos to save space

D. Editing videos automatically

Solution

Step 1: Understand the purpose of video understanding
Video understanding means enabling computers to analyze and learn from video content.
Step 2: Compare options to the definition
Only Teaching computers to watch and learn from videos matches this goal; others relate to video playback, compression, or editing.
Final Answer:
Teaching computers to watch and learn from videos -> Option A
Quick Check:
Video understanding = Teaching computers to learn from videos [OK]

Hint: Focus on learning, not playback or editing [OK]

Common Mistakes:

Confusing video understanding with video editing
Thinking it's about video compression
Assuming it's about video playback speed

2. Which neural network type is commonly used for video understanding?

easy

A. Fully connected networks without convolution

B. 2D convolutional neural networks

C. Recurrent neural networks only

D. 3D convolutional neural networks

Solution

Step 1: Identify network types used for video data
Videos have spatial and temporal dimensions; 3D CNNs capture both.
Step 2: Match network type to video understanding
3D CNNs process frames over time, unlike 2D CNNs or fully connected nets.
Final Answer:
3D convolutional neural networks -> Option D
Quick Check:
3D CNNs capture space and time in videos [OK]

Hint: Remember 3D CNNs handle time and space in videos [OK]

Common Mistakes:

Choosing 2D CNNs which only see single frames
Ignoring temporal info by picking fully connected nets
Assuming RNNs alone are best for video frames

3. Given this Python snippet for video data preprocessing, what is the shape of the output tensor?

import numpy as np
video = np.random.rand(16, 64, 64, 3)  # 16 frames, 64x64 size, 3 color channels
output = video.reshape(1, 16, 64, 64, 3)

medium

A. (16, 64, 64, 3)

B. (64, 64, 3, 16)

C. (1, 16, 64, 64, 3)

D. (16, 1, 64, 64, 3)

Solution

Step 1: Understand the original video shape
The video has shape (16, 64, 64, 3): 16 frames, each 64x64 pixels with 3 color channels.
Step 2: Analyze the reshape operation
Reshape adds a new dimension at the front, making shape (1, 16, 64, 64, 3).
Final Answer:
(1, 16, 64, 64, 3) -> Option C
Quick Check:
Reshape adds batch dimension = (1, 16, 64, 64, 3) [OK]

Hint: Look for added batch dimension in reshape [OK]

Common Mistakes:

Ignoring the added batch dimension
Mixing up order of dimensions
Assuming reshape changes total elements

4. This code snippet tries to create a 3D CNN layer but has an error. What is the mistake?

from tensorflow.keras.layers import Conv3D
layer = Conv3D(filters=32, kernel_size=(3,3), activation='relu')

medium

A. kernel_size should have three dimensions, e.g., (3,3,3)

B. Missing input shape argument

C. filters must be a list, not an integer

D. activation='relu' is not allowed in Conv3D

Solution

Step 1: Check Conv3D kernel_size parameter
Conv3D expects a 3D kernel size tuple for depth, height, width.
Step 2: Identify the error in kernel_size
The code uses (3,3), missing the third dimension, causing an error.
Final Answer:
kernel_size should have three dimensions, e.g., (3,3,3) -> Option A
Quick Check:
3D CNN kernel_size needs 3 values [OK]

Hint: 3D kernels need three numbers, not two [OK]

Common Mistakes:

Using 2D kernel size in 3D CNN
Thinking filters must be a list
Believing activation can't be relu

5. You want to train a video understanding model to recognize actions. Which data setup is best?

hard

A. Single images with labels, no temporal info

B. Video clips with labels and enough frames to see actions

C. Random frames from different videos without labels

D. Audio clips extracted from videos

Solution

Step 1: Understand training data needs for action recognition
Actions happen over time, so clips with multiple frames are needed.
Step 2: Evaluate options for temporal and label info
Only Video clips with labels and enough frames to see actions provides labeled video clips with enough frames to capture actions.
Final Answer:
Video clips with labels and enough frames to see actions -> Option B
Quick Check:
Training needs labeled clips with temporal info [OK]

Hint: Actions need multiple frames with labels [OK]

Common Mistakes:

Using single images without time info
Ignoring labels in training data
Using unrelated audio clips

Why Video understanding basics in Prompt Engineering / GenAI? - Purpose & Use Cases

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of video understanding

Step 2: Compare options to the definition

Final Answer:

Quick Check:

Solution

Step 1: Identify network types used for video data

Step 2: Match network type to video understanding

Final Answer:

Quick Check:

Solution

Step 1: Understand the original video shape

Step 2: Analyze the reshape operation

Final Answer:

Quick Check:

Solution

Step 1: Check Conv3D kernel_size parameter

Step 2: Identify the error in kernel_size

Final Answer:

Quick Check:

Solution

Step 1: Understand training data needs for action recognition

Step 2: Evaluate options for temporal and label info

Final Answer:

Quick Check: