Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Video understanding basics in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Video Understanding Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
What is the main purpose of video understanding in AI?

Imagine you have a video of a soccer game. What does video understanding help AI do with this video?

ARecognize actions, objects, and events happening in the video
BOnly extract the audio from the video
CConvert the video into a text document without any analysis
DChange the video colors to black and white
Attempts:
2 left
💡 Hint

Think about what you want a smart system to learn from watching a video.

Predict Output
intermediate
2:00remaining
Output of frame extraction code snippet

What will be the output of this Python code that extracts frames from a video?

Prompt Engineering / GenAI
import cv2
cap = cv2.VideoCapture('video.mp4')
count = 0
while True:
    ret, frame = cap.read()
    if not ret or count == 3:
        break
    print(f'Frame {count} shape:', frame.shape)
    count += 1
cap.release()
A
Frame 0 shape: (480, 640, 3)
Frame 1 shape: (480, 640, 3)
Frame 2 shape: (480, 640, 3)
B
Frame 0 shape: (640, 480, 3)
Frame 1 shape: (640, 480, 3)
Frame 2 shape: (640, 480, 3)
C
Frame 0 shape: (480, 640)
Frame 1 shape: (480, 640)
Frame 2 shape: (480, 640)
DNo output because of error reading video
Attempts:
2 left
💡 Hint

Remember that OpenCV reads frames as height x width x channels.

Model Choice
advanced
2:00remaining
Best model type for action recognition in videos

You want to build an AI that recognizes actions like running or jumping in videos. Which model type is best suited?

AConvolutional Neural Network (CNN) applied only on single images
BRecurrent Neural Network (RNN) processing video frame sequences
CSimple linear regression model
D3D Convolutional Neural Network (3D CNN) that analyzes spatial and temporal data
Attempts:
2 left
💡 Hint

Think about a model that can understand both space (image) and time (video).

Metrics
advanced
2:00remaining
Choosing the right metric for video classification

You trained a video classification model to label videos into categories. Which metric best shows how well your model predicts the correct category?

AMean Squared Error (MSE)
BAccuracy
CBLEU score
DPerplexity
Attempts:
2 left
💡 Hint

Think about a metric that counts correct predictions out of total predictions.

🔧 Debug
expert
3:00remaining
Debugging a video captioning model output error

You have a video captioning model that generates text descriptions. The output is always empty strings. What is the most likely cause?

AThe video frames are too large in resolution
BThe optimizer learning rate is too high
CThe model's vocabulary is empty or not loaded properly
DThe video file format is unsupported
Attempts:
2 left
💡 Hint

Think about what the model needs to produce words in captions.

Practice

(1/5)
1. What is the main goal of video understanding in AI?
easy
A. Teaching computers to watch and learn from videos
B. Making videos play faster on devices
C. Compressing videos to save space
D. Editing videos automatically

Solution

  1. Step 1: Understand the purpose of video understanding

    Video understanding means enabling computers to analyze and learn from video content.
  2. Step 2: Compare options to the definition

    Only Teaching computers to watch and learn from videos matches this goal; others relate to video playback, compression, or editing.
  3. Final Answer:

    Teaching computers to watch and learn from videos -> Option A
  4. Quick Check:

    Video understanding = Teaching computers to learn from videos [OK]
Hint: Focus on learning, not playback or editing [OK]
Common Mistakes:
  • Confusing video understanding with video editing
  • Thinking it's about video compression
  • Assuming it's about video playback speed
2. Which neural network type is commonly used for video understanding?
easy
A. Fully connected networks without convolution
B. 2D convolutional neural networks
C. Recurrent neural networks only
D. 3D convolutional neural networks

Solution

  1. Step 1: Identify network types used for video data

    Videos have spatial and temporal dimensions; 3D CNNs capture both.
  2. Step 2: Match network type to video understanding

    3D CNNs process frames over time, unlike 2D CNNs or fully connected nets.
  3. Final Answer:

    3D convolutional neural networks -> Option D
  4. Quick Check:

    3D CNNs capture space and time in videos [OK]
Hint: Remember 3D CNNs handle time and space in videos [OK]
Common Mistakes:
  • Choosing 2D CNNs which only see single frames
  • Ignoring temporal info by picking fully connected nets
  • Assuming RNNs alone are best for video frames
3. Given this Python snippet for video data preprocessing, what is the shape of the output tensor?
import numpy as np
video = np.random.rand(16, 64, 64, 3)  # 16 frames, 64x64 size, 3 color channels
output = video.reshape(1, 16, 64, 64, 3)
medium
A. (16, 64, 64, 3)
B. (64, 64, 3, 16)
C. (1, 16, 64, 64, 3)
D. (16, 1, 64, 64, 3)

Solution

  1. Step 1: Understand the original video shape

    The video has shape (16, 64, 64, 3): 16 frames, each 64x64 pixels with 3 color channels.
  2. Step 2: Analyze the reshape operation

    Reshape adds a new dimension at the front, making shape (1, 16, 64, 64, 3).
  3. Final Answer:

    (1, 16, 64, 64, 3) -> Option C
  4. Quick Check:

    Reshape adds batch dimension = (1, 16, 64, 64, 3) [OK]
Hint: Look for added batch dimension in reshape [OK]
Common Mistakes:
  • Ignoring the added batch dimension
  • Mixing up order of dimensions
  • Assuming reshape changes total elements
4. This code snippet tries to create a 3D CNN layer but has an error. What is the mistake?
from tensorflow.keras.layers import Conv3D
layer = Conv3D(filters=32, kernel_size=(3,3), activation='relu')
medium
A. kernel_size should have three dimensions, e.g., (3,3,3)
B. Missing input shape argument
C. filters must be a list, not an integer
D. activation='relu' is not allowed in Conv3D

Solution

  1. Step 1: Check Conv3D kernel_size parameter

    Conv3D expects a 3D kernel size tuple for depth, height, width.
  2. Step 2: Identify the error in kernel_size

    The code uses (3,3), missing the third dimension, causing an error.
  3. Final Answer:

    kernel_size should have three dimensions, e.g., (3,3,3) -> Option A
  4. Quick Check:

    3D CNN kernel_size needs 3 values [OK]
Hint: 3D kernels need three numbers, not two [OK]
Common Mistakes:
  • Using 2D kernel size in 3D CNN
  • Thinking filters must be a list
  • Believing activation can't be relu
5. You want to train a video understanding model to recognize actions. Which data setup is best?
hard
A. Single images with labels, no temporal info
B. Video clips with labels and enough frames to see actions
C. Random frames from different videos without labels
D. Audio clips extracted from videos

Solution

  1. Step 1: Understand training data needs for action recognition

    Actions happen over time, so clips with multiple frames are needed.
  2. Step 2: Evaluate options for temporal and label info

    Only Video clips with labels and enough frames to see actions provides labeled video clips with enough frames to capture actions.
  3. Final Answer:

    Video clips with labels and enough frames to see actions -> Option B
  4. Quick Check:

    Training needs labeled clips with temporal info [OK]
Hint: Actions need multiple frames with labels [OK]
Common Mistakes:
  • Using single images without time info
  • Ignoring labels in training data
  • Using unrelated audio clips