Bird
Raised Fist0
Computer Visionml~20 mins

Action recognition basics in Computer Vision - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Action Recognition Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
What is the main input type for action recognition models?

Action recognition models analyze data to identify what action is happening. What kind of input data do these models mainly use?

ASingle images showing a moment in time
BText descriptions of actions
CAudio recordings of sounds related to actions
DSequences of images or video clips showing movement over time
Attempts:
2 left
💡 Hint

Think about how you recognize actions yourself. Do you need just one picture or a series of pictures?

Model Choice
intermediate
2:00remaining
Which model type is best suited for capturing temporal information in action recognition?

To understand actions, models must capture how things change over time. Which model type is designed to handle sequences and temporal data?

AFeedforward Neural Networks without loops
BRecurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks
CConvolutional Neural Networks (CNNs) only
DSupport Vector Machines (SVMs)
Attempts:
2 left
💡 Hint

Think about models that remember past information to understand sequences.

Predict Output
advanced
2:00remaining
What is the output shape of a 3D CNN model for action recognition given input shape (batch_size=8, frames=16, height=64, width=64, channels=3) and 10 action classes?

Consider a 3D CNN model that takes video clips as input. The input shape is (8, 16, 64, 64, 3) representing batch size, frames, height, width, and color channels. The model outputs predictions for 10 action classes. What is the shape of the output tensor?

Computer Vision
input_shape = (8, 16, 64, 64, 3)
num_classes = 10
# Model outputs class probabilities for each video in the batch
A(8, 16, 10)
B(16, 10)
C(8, 10)
D(8, 64, 64, 10)
Attempts:
2 left
💡 Hint

The model predicts one action class per video clip in the batch.

Metrics
advanced
2:00remaining
Which metric is most appropriate to evaluate an action recognition model on a balanced multi-class dataset?

You trained an action recognition model on a dataset with 10 balanced classes. Which metric best measures how well your model predicts the correct action?

AAccuracy
BRoot Mean Squared Error (RMSE)
CMean Squared Error (MSE)
DPrecision for one class only
Attempts:
2 left
💡 Hint

Think about a metric that counts how many predictions are exactly right out of all predictions.

🔧 Debug
expert
3:00remaining
Why does this action recognition training code raise a shape mismatch error?

Consider this PyTorch training snippet for an action recognition model:

outputs = model(inputs)  # outputs shape: (8, 10)
labels = labels.unsqueeze(1)  # labels shape: (8, 1)
loss = criterion(outputs, labels)

Why does this code raise a shape mismatch error during loss calculation?

ABecause labels need to be a 1D tensor of shape (8,) for CrossEntropyLoss
BBecause outputs should have shape (8, 1) to match labels
CBecause inputs and labels have different batch sizes
DBecause criterion expects labels to be one-hot encoded
Attempts:
2 left
💡 Hint

Check the expected label shape for PyTorch's CrossEntropyLoss.

Practice

(1/5)
1. What is the main goal of action recognition in computer vision?
easy
A. To generate captions for images
B. To detect objects in images
C. To enhance image resolution
D. To identify human movements in videos

Solution

  1. Step 1: Understand the purpose of action recognition

    Action recognition focuses on understanding what actions or movements humans perform in videos.
  2. Step 2: Compare with other tasks

    Detecting objects, generating captions, or enhancing resolution are different tasks unrelated to recognizing actions.
  3. Final Answer:

    To identify human movements in videos -> Option D
  4. Quick Check:

    Action recognition = Identify human movements [OK]
Hint: Action recognition = understanding human movements in videos [OK]
Common Mistakes:
  • Confusing action recognition with object detection
  • Thinking it generates image captions
  • Assuming it improves image quality
2. Which of the following is the correct way to represent a video input for an action recognition model?
easy
A. A sequence of image frames
B. A single grayscale image
C. A text description of the action
D. A 1D audio signal

Solution

  1. Step 1: Identify video data format

    Videos are made of many image frames shown in order, so a sequence of frames is the correct input.
  2. Step 2: Eliminate incorrect options

    A single image or text or audio does not represent the full video needed for action recognition.
  3. Final Answer:

    A sequence of image frames -> Option A
  4. Quick Check:

    Video input = sequence of frames [OK]
Hint: Videos = many frames in order, not single images [OK]
Common Mistakes:
  • Using a single image instead of multiple frames
  • Confusing video input with text or audio
  • Ignoring the temporal sequence of frames
3. Consider this Python snippet for extracting features from video frames for action recognition:
features = []
for frame in video_frames:
    feat = extract_features(frame)
    features.append(feat)
print(len(features))
If video_frames contains 10 frames, what will be the output?
medium
A. 10
B. 9
C. 0
D. Error

Solution

  1. Step 1: Understand the loop over frames

    The loop runs once for each frame in video_frames, which has 10 frames.
  2. Step 2: Count how many features are appended

    Each iteration appends one feature, so after 10 iterations, features has length 10.
  3. Final Answer:

    10 -> Option A
  4. Quick Check:

    Number of frames = features length = 10 [OK]
Hint: One feature per frame means length equals number of frames [OK]
Common Mistakes:
  • Off-by-one errors counting features
  • Assuming extract_features returns multiple items
  • Thinking the list is empty before print
4. You have this code snippet for action recognition training:
for video, label in dataset:
    features = extract_features(video)
    prediction = model.predict(features)
    loss = loss_function(prediction, label)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
The training loss does not decrease after many epochs. What is a likely error?
medium
A. Optimizer step is missing
B. Loss function is not called
C. Features are extracted frame-by-frame but model expects video clips
D. Labels are not used in prediction

Solution

  1. Step 1: Analyze feature extraction and model input

    If features are extracted frame-by-frame but the model expects a clip (multiple frames together), the input shape mismatch can cause poor learning.
  2. Step 2: Check other training steps

    Loss function is called, optimizer steps are present, and labels are used in loss, so these are correct.
  3. Final Answer:

    Features are extracted frame-by-frame but model expects video clips -> Option C
  4. Quick Check:

    Input shape mismatch = training loss stuck [OK]
Hint: Check if model input matches feature extraction format [OK]
Common Mistakes:
  • Ignoring input shape mismatch
  • Assuming loss or optimizer calls are missing
  • Not verifying label usage in loss
5. You want to improve an action recognition model that uses only spatial features from single frames. Which approach is best to capture motion information?
hard
A. Train on grayscale frames instead of color
B. Use 3D convolutional neural networks on video clips
C. Add dropout layers to the model
D. Increase image resolution of single frames

Solution

  1. Step 1: Understand spatial vs temporal features

    Spatial features come from single frames; motion requires temporal features across frames.
  2. Step 2: Identify model type capturing motion

    3D CNNs process multiple frames together, capturing motion and temporal info effectively.
  3. Step 3: Evaluate other options

    Increasing resolution, dropout, or grayscale do not add motion info.
  4. Final Answer:

    Use 3D convolutional neural networks on video clips -> Option B
  5. Quick Check:

    3D CNNs capture motion = better action recognition [OK]
Hint: Motion needs temporal models like 3D CNNs, not just images [OK]
Common Mistakes:
  • Thinking higher resolution adds motion info
  • Confusing regular CNNs with 3D CNNs
  • Ignoring temporal dimension in videos