Challenge - 5 Problems

🎖️

Multimodal Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Why use multimodal data in AI?

Why do AI systems combine text, images, and audio together?

ABecause combining different types of data helps AI understand information more like humans do.

BBecause using only one type of data is always faster and more accurate.

CBecause text data is always better than images and audio for AI tasks.

DBecause images and audio cannot be processed by AI without text.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of combining text and image features

What is the shape of the combined feature vector after concatenating a text feature vector of shape (1, 300) and an image feature vector of shape (1, 512)?

Prompt Engineering / GenAI

import numpy as np
text_feat = np.random.rand(1, 300)
image_feat = np.random.rand(1, 512)
combined_feat = np.concatenate((text_feat, image_feat), axis=1)
print(combined_feat.shape)

A(1, 300)

B(1, 812)

C(812, 1)

D(300, 512)

Attempts:

2 left

❓ Model Choice

advanced

2:00remaining

Best model type for multimodal AI

Which model architecture is best suited to process and combine text, image, and audio data in one system?

AA decision tree trained only on text features.

BA simple linear regression model trained on raw pixels only.

CA model that uses separate encoders for each data type and a fusion layer to combine them.

DA convolutional neural network designed only for images.

Attempts:

2 left

❓ Metrics

advanced

2:00remaining

Evaluating multimodal model performance

Which metric is most appropriate to evaluate a multimodal classification model that predicts categories from text, image, and audio inputs?

AAccuracy, because it measures correct predictions over total samples.

BMean Squared Error, because it measures distance between predicted and true values.

CBLEU score, because it measures text generation quality.

DPerplexity, because it measures language model uncertainty.

Attempts:

2 left

🔧 Debug

expert

3:00remaining

Debugging multimodal data mismatch error

Given this code snippet, what error will occur when concatenating text and audio features with different batch sizes? import numpy as np text_feat = np.random.rand(4, 300) audio_feat = np.random.rand(5, 128) combined_feat = np.concatenate((text_feat, audio_feat), axis=1)

Prompt Engineering / GenAI

import numpy as np
text_feat = np.random.rand(4, 300)
audio_feat = np.random.rand(5, 128)
combined_feat = np.concatenate((text_feat, audio_feat), axis=1)

ATypeError: unsupported operand type(s) for +: 'int' and 'str'

BNo error, concatenation succeeds

CIndexError: index out of range

DValueError: all the input array dimensions except for the concatenation axis must match exactly

Attempts:

2 left

Practice

(1/5)

1. Why do multimodal AI models combine text, images, and audio?

easy

A. To understand information better by using different types of data together

B. Because text alone is always enough for understanding

C. To make the model run faster without extra data

D. To avoid using any visual or sound information

Why multimodal combines text, image, and audio in Prompt Engineering / GenAI - Challenge Your Understanding

Start learning this pattern below

Practice

Solution

Step 1: Understand what multimodal means

Step 2: Why combine different data types?

Final Answer:

Quick Check:

Solution

Step 1: Define multimodal input

Step 2: Match the correct description

Final Answer:

Quick Check:

Solution

Step 1: Identify data types in the video

Step 2: Understand multimodal model behavior

Final Answer:

Quick Check:

Solution

Step 1: Analyze model output behavior

Step 2: Identify possible cause

Final Answer:

Quick Check:

Solution

Step 1: Understand the goal

Step 2: Choose best approach

Final Answer:

Quick Check: