0
0
Prompt Engineering / GenAIml~12 mins

Multimodal RAG in Prompt Engineering / GenAI - Model Pipeline Trace

Choose your learning style9 modes available
Model Pipeline - Multimodal RAG

Multimodal RAG combines text and images to answer questions by retrieving relevant information and generating answers using both types of data.

Data Flow - 5 Stages
1Input Data
1000 samples with text and imagesCollect paired text and image data for questions and documents1000 samples with text and image data
Question: 'What is shown in this picture?' + Image of a cat
2Preprocessing
1000 samples with text and imagesClean text, resize images, and normalize both1000 samples with cleaned text and processed images
Text: 'What is shown?' -> 'what is shown'; Image resized to 224x224 pixels
3Feature Extraction
1000 samples with cleaned text and processed imagesConvert text to embeddings and images to feature vectors1000 samples with text embeddings (768 dims) and image embeddings (512 dims)
Text embedding vector: [0.12, -0.05, ...]; Image embedding vector: [0.34, 0.78, ...]
4Retrieval
1000 samples with text and image embeddingsRetrieve top 5 relevant documents using combined embeddings1000 samples with 5 retrieved documents each
Retrieved docs: ['Doc1 text', 'Doc2 text', ...]
5Fusion and Generation
1000 samples with retrieved documents and embeddingsFuse multimodal info and generate answer using a language model1000 samples with generated text answers
Answer: 'The image shows a cat sitting on a sofa.'
Training Trace - Epoch by Epoch

Epoch 1: ************ (1.2)
Epoch 2: *********    (0.9)
Epoch 3: *******      (0.7)
Epoch 4: *****        (0.55)
Epoch 5: ****         (0.45)
EpochLoss ↓Accuracy ↑Observation
11.20.45Model starts learning, loss high, accuracy low
20.90.60Loss decreases, accuracy improves
30.70.72Model learns better multimodal relations
40.550.80Loss continues to drop, accuracy rises
50.450.85Good convergence, model ready for predictions
Prediction Trace - 5 Layers
Layer 1: Input
Layer 2: Preprocessing
Layer 3: Feature Extraction
Layer 4: Retrieval
Layer 5: Fusion and Generation
Model Quiz - 3 Questions
Test your understanding
What happens to the data shape after feature extraction?
AData shape increases to include raw pixels
BText and image converted to embeddings with fixed dimensions
CText is removed and only images remain
DData shape stays the same as input
Key Insight
Multimodal RAG effectively combines text and image data by converting both into embeddings, retrieving relevant documents, and generating accurate answers. The training shows steady improvement, highlighting the model's ability to learn from combined data types.