Challenge - 5 Problems
Multimodal Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate2:00remaining
Why use multimodal data in AI?
Why do AI systems combine text, images, and audio together?
Attempts:
2 left
💡 Hint
Think about how humans use multiple senses to understand the world.
✗ Incorrect
Multimodal AI uses text, images, and audio together to get a fuller understanding, similar to how humans combine sight, hearing, and language.
❓ Predict Output
intermediate2:00remaining
Output of combining text and image features
What is the shape of the combined feature vector after concatenating a text feature vector of shape (1, 300) and an image feature vector of shape (1, 512)?
Prompt Engineering / GenAI
import numpy as np text_feat = np.random.rand(1, 300) image_feat = np.random.rand(1, 512) combined_feat = np.concatenate((text_feat, image_feat), axis=1) print(combined_feat.shape)
Attempts:
2 left
💡 Hint
Concatenation along axis=1 joins columns side by side.
✗ Incorrect
Concatenating along axis=1 adds the feature dimensions, so 300 + 512 = 812 features in one row.
❓ Model Choice
advanced2:00remaining
Best model type for multimodal AI
Which model architecture is best suited to process and combine text, image, and audio data in one system?
Attempts:
2 left
💡 Hint
Think about how to handle different data types before combining.
✗ Incorrect
Separate encoders extract features from each modality, then a fusion layer combines them for joint understanding.
❓ Metrics
advanced2:00remaining
Evaluating multimodal model performance
Which metric is most appropriate to evaluate a multimodal classification model that predicts categories from text, image, and audio inputs?
Attempts:
2 left
💡 Hint
Consider the task is classification, not generation or regression.
✗ Incorrect
Accuracy is suitable for classification tasks to measure how many predictions are correct.
🔧 Debug
expert3:00remaining
Debugging multimodal data mismatch error
Given this code snippet, what error will occur when concatenating text and audio features with different batch sizes?
import numpy as np
text_feat = np.random.rand(4, 300)
audio_feat = np.random.rand(5, 128)
combined_feat = np.concatenate((text_feat, audio_feat), axis=1)
Prompt Engineering / GenAI
import numpy as np text_feat = np.random.rand(4, 300) audio_feat = np.random.rand(5, 128) combined_feat = np.concatenate((text_feat, audio_feat), axis=1)
Attempts:
2 left
💡 Hint
Check if the batch sizes (first dimension) match before concatenation.
✗ Incorrect
Concatenation along axis=1 requires the first dimension (batch size) to be equal; here 4 != 5 causes ValueError.