question = 'What is shown in the image?' inputs = retriever.question_encoder.tokenizer(question, return_tensors='pt') image_inputs = retriever.image_encoder.[1](image, return_tensors='pt') retrieved_docs = retriever.get_relevant_documents([2]) outputs = model.generate(input_ids=inputs['input_ids'], [3]=retrieved_docs)

Practice

(1/5)

1. What is the main purpose of Multimodal RAG in AI systems?

easy

A. To generate images from text descriptions without retrieval

B. To translate languages using only text data

C. To combine text and images for better information retrieval and generation

D. To classify images into categories without text input

Solution

Step 1: Understand the components of Multimodal RAG
Multimodal RAG uses both text and image data to improve retrieval and generation tasks.
Step 2: Identify the main goal
The goal is to combine these data types to find and generate better answers than using text or images alone.
Final Answer:
To combine text and images for better information retrieval and generation -> Option C
Quick Check:
Multimodal RAG = combine text + images [OK]

Hint: Remember: Multimodal means multiple data types combined [OK]

Common Mistakes:

Thinking it only works with text
Confusing it with image-only models
Assuming it only generates images

2. Which of the following is the correct component setup for a Multimodal RAG system?

easy

A. Single encoder for both text and images, no retriever

B. Separate encoders for text and images, plus a retriever and a generator

C. Only a text encoder and a generator, no image processing

D. Only an image encoder and a retriever, no text input

Solution

Step 1: Recall the architecture of Multimodal RAG
It uses separate encoders for text and images to handle each data type properly.
Step 2: Understand the role of retriever and generator
The retriever finds relevant data, and the generator creates the final output combining both modalities.
Final Answer:
Separate encoders for text and images, plus a retriever and a generator -> Option B
Quick Check:
Separate encoders + retriever + generator = B [OK]

Hint: Look for separate encoders and both retriever and generator [OK]

Common Mistakes:

Assuming one encoder handles both text and images
Ignoring the retriever component
Thinking image processing is optional

3. Given the following pseudocode for a Multimodal RAG retrieval step, what will be the output type?

text_embedding = text_encoder(text_input)
image_embedding = image_encoder(image_input)
combined_embedding = concatenate(text_embedding, image_embedding)
retrieved_docs = retriever.retrieve(combined_embedding)
print(type(retrieved_docs))

medium

A. <class 'int'>

B. <class 'dict'>

C. <class 'str'>

D. <class 'list'>

Solution

Step 1: Understand the retriever output
The retriever typically returns a list of documents or data items relevant to the query embedding.
Step 2: Identify the output type printed
Since retrieved_docs holds multiple documents, its type is a list.
Final Answer:
<class 'list'> -> Option D
Quick Check:
Retriever output = list of documents [OK]

Hint: Retriever returns a list of relevant documents [OK]

Common Mistakes:

Assuming output is a string or dictionary
Confusing embedding types with retrieval output
Expecting a single document instead of a list

4. You have this code snippet for a Multimodal RAG generator:

def generate_answer(text, image):
    text_emb = text_encoder(text)
    image_emb = image_encoder(image)
    combined = text_emb + image_emb
    docs = retriever.retrieve(combined)
    answer = generator.generate(docs)
    return answer

What is the main error in this code?

medium

A. Using '+' to combine embeddings instead of concatenation

B. Missing image encoder call

C. Retriever should not be called before generator

D. Generator cannot take documents as input

Solution

Step 1: Check how embeddings are combined
Embeddings from different modalities should be concatenated, not added, to preserve information.
Step 2: Understand the impact of using '+' operator
Adding embeddings sums values element-wise, which can lose modality-specific features.
Final Answer:
Using '+' to combine embeddings instead of concatenation -> Option A
Quick Check:
Combine embeddings = concatenate, not add [OK]

Hint: Use concatenate, not plus, to combine embeddings [OK]

Common Mistakes:

Thinking '+' merges embeddings correctly
Ignoring the need for separate encoders
Assuming retriever or generator order is wrong

5. You want to improve a Multimodal RAG system that sometimes misses relevant images when answering questions. Which approach is best to fix this?

hard

A. Train the image encoder with more diverse image-text pairs to improve embedding quality

B. Remove the retriever and rely only on the generator

C. Use only text data and ignore images to simplify the model

D. Replace the text encoder with a simpler model to speed up processing

Solution

Step 1: Identify the cause of missing relevant images
Low-quality image embeddings can cause the retriever to miss relevant images.
Step 2: Choose the best fix
Training the image encoder with more diverse data improves embedding quality and retrieval accuracy.
Final Answer:
Train the image encoder with more diverse image-text pairs to improve embedding quality -> Option A
Quick Check:
Better image encoder training = better retrieval [OK]

Hint: Improve encoder training with diverse data for better retrieval [OK]

Common Mistakes:

Removing retriever loses retrieval benefits
Ignoring images reduces multimodal power
Simplifying text encoder won't fix image retrieval

Multimodal RAG in Prompt Engineering / GenAI - Interactive Code Practice

Start learning this pattern below

Practice

Solution

Step 1: Understand the components of Multimodal RAG

Step 2: Identify the main goal

Final Answer:

Quick Check:

Solution

Step 1: Recall the architecture of Multimodal RAG

Step 2: Understand the role of retriever and generator

Final Answer:

Quick Check:

Solution

Step 1: Understand the retriever output

Step 2: Identify the output type printed

Final Answer:

Quick Check:

Solution

Step 1: Check how embeddings are combined

Step 2: Understand the impact of using '+' operator

Final Answer:

Quick Check:

Solution

Step 1: Identify the cause of missing relevant images

Step 2: Choose the best fix

Final Answer:

Quick Check: