Imagine you have a smart assistant that can answer questions using both text and images. What is the key benefit of combining multiple types of data (like text and images) in a RAG system?
Think about how combining different senses helps humans understand better.
Multimodal RAG uses both text and images to retrieve and generate answers, giving richer and more accurate responses than using just one type of data.
You want to build a Multimodal RAG system that can understand images and text together. Which model architecture should you choose to encode both types of data effectively?
Think about how to represent different data types in a way that they can be compared or combined.
A dual-encoder model with separate encoders for images and text allows the system to create embeddings for both modalities in the same space, enabling effective retrieval and fusion.
Given the following Python code that computes cosine similarity between image and text embeddings, what is the printed output?
import numpy as np from numpy.linalg import norm image_embedding = np.array([0.6, 0.8]) text_embedding = np.array([0.9, 0.1]) cosine_similarity = np.dot(image_embedding, text_embedding) / (norm(image_embedding) * norm(text_embedding)) print(round(cosine_similarity, 2))
Recall cosine similarity formula: dot product divided by product of norms.
The dot product is 0.6*0.9 + 0.8*0.1 = 0.54 + 0.08 = 0.62. Norms are 1.0 for image_embedding and approx 0.9055 for text_embedding. So similarity = 0.62 / (1 * 0.9055) ≈ 0.685, rounded to 0.68.
You want to measure how well your Multimodal RAG system retrieves relevant documents (text or images) for a query. Which metric should you use?
Think about how to check if the system finds the right items among its top guesses.
Recall@K measures whether the relevant item appears in the top K retrieved results, which is ideal for retrieval tasks in RAG systems.
Consider this simplified retrieval code snippet for a Multimodal RAG system. Why does it fail to retrieve relevant images?
def retrieve(query_embedding, image_embeddings): # Returns index of image with max dot product similarity similarities = [sum(q * i for q, i in zip(query_embedding, img)) for img in image_embeddings] return similarities.index(max(similarities)) query = [0.5, 0.5] images = [[0.6, 0.8], [0.9, 0.1], [0.1, 0.9]] result = retrieve(query, images) print(result)
Think about how cosine similarity differs from dot product and why normalization matters.
Dot product alone can be misleading if embeddings are not normalized. Without normalization, vectors with larger magnitudes may appear more similar even if directions differ, causing wrong retrieval.