0
0
Prompt Engineering / GenAIml~20 mins

Why multimodal combines text, image, and audio in Prompt Engineering / GenAI - Challenge Your Understanding

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Multimodal Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Why use multimodal data in AI?
Why do AI systems combine text, images, and audio together?
ABecause combining different types of data helps AI understand information more like humans do.
BBecause using only one type of data is always faster and more accurate.
CBecause text data is always better than images and audio for AI tasks.
DBecause images and audio cannot be processed by AI without text.
Attempts:
2 left
💡 Hint
Think about how humans use multiple senses to understand the world.
Predict Output
intermediate
2:00remaining
Output of combining text and image features
What is the shape of the combined feature vector after concatenating a text feature vector of shape (1, 300) and an image feature vector of shape (1, 512)?
Prompt Engineering / GenAI
import numpy as np
text_feat = np.random.rand(1, 300)
image_feat = np.random.rand(1, 512)
combined_feat = np.concatenate((text_feat, image_feat), axis=1)
print(combined_feat.shape)
A(1, 300)
B(1, 812)
C(812, 1)
D(300, 512)
Attempts:
2 left
💡 Hint
Concatenation along axis=1 joins columns side by side.
Model Choice
advanced
2:00remaining
Best model type for multimodal AI
Which model architecture is best suited to process and combine text, image, and audio data in one system?
AA decision tree trained only on text features.
BA simple linear regression model trained on raw pixels only.
CA model that uses separate encoders for each data type and a fusion layer to combine them.
DA convolutional neural network designed only for images.
Attempts:
2 left
💡 Hint
Think about how to handle different data types before combining.
Metrics
advanced
2:00remaining
Evaluating multimodal model performance
Which metric is most appropriate to evaluate a multimodal classification model that predicts categories from text, image, and audio inputs?
AAccuracy, because it measures correct predictions over total samples.
BMean Squared Error, because it measures distance between predicted and true values.
CBLEU score, because it measures text generation quality.
DPerplexity, because it measures language model uncertainty.
Attempts:
2 left
💡 Hint
Consider the task is classification, not generation or regression.
🔧 Debug
expert
3:00remaining
Debugging multimodal data mismatch error
Given this code snippet, what error will occur when concatenating text and audio features with different batch sizes? import numpy as np text_feat = np.random.rand(4, 300) audio_feat = np.random.rand(5, 128) combined_feat = np.concatenate((text_feat, audio_feat), axis=1)
Prompt Engineering / GenAI
import numpy as np
text_feat = np.random.rand(4, 300)
audio_feat = np.random.rand(5, 128)
combined_feat = np.concatenate((text_feat, audio_feat), axis=1)
ATypeError: unsupported operand type(s) for +: 'int' and 'str'
BNo error, concatenation succeeds
CIndexError: index out of range
DValueError: all the input array dimensions except for the concatenation axis must match exactly
Attempts:
2 left
💡 Hint
Check if the batch sizes (first dimension) match before concatenation.