Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Multimodal RAG in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Multimodal RAG Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
What is the main advantage of using Multimodal Retrieval-Augmented Generation (RAG)?

Imagine you have a smart assistant that can answer questions using both text and images. What is the key benefit of combining multiple types of data (like text and images) in a RAG system?

AIt reduces the need for training data by using only images.
BIt makes the system faster by ignoring irrelevant data types.
CIt allows the system to understand and generate answers using richer information from different data types.
DIt limits the system to only text-based answers for simplicity.
Attempts:
2 left
💡 Hint

Think about how combining different senses helps humans understand better.

Model Choice
intermediate
2:00remaining
Which model architecture is best suited for encoding both images and text in a Multimodal RAG system?

You want to build a Multimodal RAG system that can understand images and text together. Which model architecture should you choose to encode both types of data effectively?

AA recurrent neural network (RNN) trained on text sequences only.
BA single text-only transformer model trained on text captions of images.
CA convolutional neural network (CNN) trained only on images without text input.
DA dual-encoder model with separate encoders for images and text that produce embeddings in the same space.
Attempts:
2 left
💡 Hint

Think about how to represent different data types in a way that they can be compared or combined.

Predict Output
advanced
2:00remaining
What is the output of this embedding similarity code snippet?

Given the following Python code that computes cosine similarity between image and text embeddings, what is the printed output?

Prompt Engineering / GenAI
import numpy as np
from numpy.linalg import norm

image_embedding = np.array([0.6, 0.8])
text_embedding = np.array([0.9, 0.1])

cosine_similarity = np.dot(image_embedding, text_embedding) / (norm(image_embedding) * norm(text_embedding))
print(round(cosine_similarity, 2))
A0.68
B0.75
C0.80
D0.50
Attempts:
2 left
💡 Hint

Recall cosine similarity formula: dot product divided by product of norms.

Metrics
advanced
2:00remaining
Which metric best evaluates the retrieval quality in a Multimodal RAG system?

You want to measure how well your Multimodal RAG system retrieves relevant documents (text or images) for a query. Which metric should you use?

ARecall@K, which measures if the correct item is in the top K retrieved results.
BAccuracy of classification labels on a test set.
CBLEU score comparing generated text to reference text.
DMean Squared Error (MSE) between embeddings.
Attempts:
2 left
💡 Hint

Think about how to check if the system finds the right items among its top guesses.

🔧 Debug
expert
3:00remaining
Why does this Multimodal RAG system fail to retrieve relevant images?

Consider this simplified retrieval code snippet for a Multimodal RAG system. Why does it fail to retrieve relevant images?

Prompt Engineering / GenAI
def retrieve(query_embedding, image_embeddings):
    # Returns index of image with max dot product similarity
    similarities = [sum(q * i for q, i in zip(query_embedding, img)) for img in image_embeddings]
    return similarities.index(max(similarities))

query = [0.5, 0.5]
images = [[0.6, 0.8], [0.9, 0.1], [0.1, 0.9]]
result = retrieve(query, images)
print(result)
AThe code incorrectly returns the minimum similarity index instead of maximum.
BThe code uses dot product without normalizing embeddings, causing incorrect similarity ranking.
CThe code uses sum instead of product in similarity calculation, causing a TypeError.
DThe query embedding has wrong dimensions compared to image embeddings.
Attempts:
2 left
💡 Hint

Think about how cosine similarity differs from dot product and why normalization matters.

Practice

(1/5)
1. What is the main purpose of Multimodal RAG in AI systems?
easy
A. To generate images from text descriptions without retrieval
B. To translate languages using only text data
C. To combine text and images for better information retrieval and generation
D. To classify images into categories without text input

Solution

  1. Step 1: Understand the components of Multimodal RAG

    Multimodal RAG uses both text and image data to improve retrieval and generation tasks.
  2. Step 2: Identify the main goal

    The goal is to combine these data types to find and generate better answers than using text or images alone.
  3. Final Answer:

    To combine text and images for better information retrieval and generation -> Option C
  4. Quick Check:

    Multimodal RAG = combine text + images [OK]
Hint: Remember: Multimodal means multiple data types combined [OK]
Common Mistakes:
  • Thinking it only works with text
  • Confusing it with image-only models
  • Assuming it only generates images
2. Which of the following is the correct component setup for a Multimodal RAG system?
easy
A. Single encoder for both text and images, no retriever
B. Separate encoders for text and images, plus a retriever and a generator
C. Only a text encoder and a generator, no image processing
D. Only an image encoder and a retriever, no text input

Solution

  1. Step 1: Recall the architecture of Multimodal RAG

    It uses separate encoders for text and images to handle each data type properly.
  2. Step 2: Understand the role of retriever and generator

    The retriever finds relevant data, and the generator creates the final output combining both modalities.
  3. Final Answer:

    Separate encoders for text and images, plus a retriever and a generator -> Option B
  4. Quick Check:

    Separate encoders + retriever + generator = B [OK]
Hint: Look for separate encoders and both retriever and generator [OK]
Common Mistakes:
  • Assuming one encoder handles both text and images
  • Ignoring the retriever component
  • Thinking image processing is optional
3. Given the following pseudocode for a Multimodal RAG retrieval step, what will be the output type?
text_embedding = text_encoder(text_input)
image_embedding = image_encoder(image_input)
combined_embedding = concatenate(text_embedding, image_embedding)
retrieved_docs = retriever.retrieve(combined_embedding)
print(type(retrieved_docs))
medium
A. <class 'int'>
B. <class 'dict'>
C. <class 'str'>
D. <class 'list'>

Solution

  1. Step 1: Understand the retriever output

    The retriever typically returns a list of documents or data items relevant to the query embedding.
  2. Step 2: Identify the output type printed

    Since retrieved_docs holds multiple documents, its type is a list.
  3. Final Answer:

    <class 'list'> -> Option D
  4. Quick Check:

    Retriever output = list of documents [OK]
Hint: Retriever returns a list of relevant documents [OK]
Common Mistakes:
  • Assuming output is a string or dictionary
  • Confusing embedding types with retrieval output
  • Expecting a single document instead of a list
4. You have this code snippet for a Multimodal RAG generator:
def generate_answer(text, image):
    text_emb = text_encoder(text)
    image_emb = image_encoder(image)
    combined = text_emb + image_emb
    docs = retriever.retrieve(combined)
    answer = generator.generate(docs)
    return answer
What is the main error in this code?
medium
A. Using '+' to combine embeddings instead of concatenation
B. Missing image encoder call
C. Retriever should not be called before generator
D. Generator cannot take documents as input

Solution

  1. Step 1: Check how embeddings are combined

    Embeddings from different modalities should be concatenated, not added, to preserve information.
  2. Step 2: Understand the impact of using '+' operator

    Adding embeddings sums values element-wise, which can lose modality-specific features.
  3. Final Answer:

    Using '+' to combine embeddings instead of concatenation -> Option A
  4. Quick Check:

    Combine embeddings = concatenate, not add [OK]
Hint: Use concatenate, not plus, to combine embeddings [OK]
Common Mistakes:
  • Thinking '+' merges embeddings correctly
  • Ignoring the need for separate encoders
  • Assuming retriever or generator order is wrong
5. You want to improve a Multimodal RAG system that sometimes misses relevant images when answering questions. Which approach is best to fix this?
hard
A. Train the image encoder with more diverse image-text pairs to improve embedding quality
B. Remove the retriever and rely only on the generator
C. Use only text data and ignore images to simplify the model
D. Replace the text encoder with a simpler model to speed up processing

Solution

  1. Step 1: Identify the cause of missing relevant images

    Low-quality image embeddings can cause the retriever to miss relevant images.
  2. Step 2: Choose the best fix

    Training the image encoder with more diverse data improves embedding quality and retrieval accuracy.
  3. Final Answer:

    Train the image encoder with more diverse image-text pairs to improve embedding quality -> Option A
  4. Quick Check:

    Better image encoder training = better retrieval [OK]
Hint: Improve encoder training with diverse data for better retrieval [OK]
Common Mistakes:
  • Removing retriever loses retrieval benefits
  • Ignoring images reduces multimodal power
  • Simplifying text encoder won't fix image retrieval