Prompt Engineering / GenAIml~20 mins

Multimodal RAG in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Challenge - 5 Problems

🎖️

Multimodal RAG Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

What is the main advantage of using Multimodal Retrieval-Augmented Generation (RAG)?

Imagine you have a smart assistant that can answer questions using both text and images. What is the key benefit of combining multiple types of data (like text and images) in a RAG system?

AIt reduces the need for training data by using only images.

BIt makes the system faster by ignoring irrelevant data types.

CIt allows the system to understand and generate answers using richer information from different data types.

DIt limits the system to only text-based answers for simplicity.

Attempts:

2 left

❓ Model Choice

intermediate

2:00remaining

Which model architecture is best suited for encoding both images and text in a Multimodal RAG system?

You want to build a Multimodal RAG system that can understand images and text together. Which model architecture should you choose to encode both types of data effectively?

AA recurrent neural network (RNN) trained on text sequences only.

BA single text-only transformer model trained on text captions of images.

CA convolutional neural network (CNN) trained only on images without text input.

DA dual-encoder model with separate encoders for images and text that produce embeddings in the same space.

Attempts:

2 left

❓ Predict Output

advanced

2:00remaining

What is the output of this embedding similarity code snippet?

Given the following Python code that computes cosine similarity between image and text embeddings, what is the printed output?

Prompt Engineering / GenAI

import numpy as np
from numpy.linalg import norm

image_embedding = np.array([0.6, 0.8])
text_embedding = np.array([0.9, 0.1])

cosine_similarity = np.dot(image_embedding, text_embedding) / (norm(image_embedding) * norm(text_embedding))
print(round(cosine_similarity, 2))

A0.68

B0.75

C0.80

D0.50

Attempts:

2 left

❓ Metrics

advanced

2:00remaining

Which metric best evaluates the retrieval quality in a Multimodal RAG system?

You want to measure how well your Multimodal RAG system retrieves relevant documents (text or images) for a query. Which metric should you use?

ARecall@K, which measures if the correct item is in the top K retrieved results.

BAccuracy of classification labels on a test set.

CBLEU score comparing generated text to reference text.

DMean Squared Error (MSE) between embeddings.

Attempts:

2 left

🔧 Debug

expert

3:00remaining

Why does this Multimodal RAG system fail to retrieve relevant images?

Consider this simplified retrieval code snippet for a Multimodal RAG system. Why does it fail to retrieve relevant images?

Prompt Engineering / GenAI

def retrieve(query_embedding, image_embeddings):
    # Returns index of image with max dot product similarity
    similarities = [sum(q * i for q, i in zip(query_embedding, img)) for img in image_embeddings]
    return similarities.index(max(similarities))

query = [0.5, 0.5]
images = [[0.6, 0.8], [0.9, 0.1], [0.1, 0.9]]
result = retrieve(query, images)
print(result)

AThe code incorrectly returns the minimum similarity index instead of maximum.

BThe code uses dot product without normalizing embeddings, causing incorrect similarity ranking.

CThe code uses sum instead of product in similarity calculation, causing a TypeError.

DThe query embedding has wrong dimensions compared to image embeddings.

Attempts:

2 left

Practice

(1/5)

1. What is the main purpose of Multimodal RAG in AI systems?

easy

A. To generate images from text descriptions without retrieval

B. To translate languages using only text data

C. To combine text and images for better information retrieval and generation

D. To classify images into categories without text input

Multimodal RAG in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand the components of Multimodal RAG

Step 2: Identify the main goal

Final Answer:

Quick Check:

Solution

Step 1: Recall the architecture of Multimodal RAG

Step 2: Understand the role of retriever and generator

Final Answer:

Quick Check:

Solution

Step 1: Understand the retriever output

Step 2: Identify the output type printed

Final Answer:

Quick Check:

Solution

Step 1: Check how embeddings are combined

Step 2: Understand the impact of using '+' operator

Final Answer:

Quick Check:

Solution

Step 1: Identify the cause of missing relevant images

Step 2: Choose the best fix

Final Answer:

Quick Check: