Model Pipeline - Multimodal RAG
Multimodal RAG combines text and images to answer questions by retrieving relevant information and generating answers using both types of data.
Jump into concepts and practice - no test required
Multimodal RAG combines text and images to answer questions by retrieving relevant information and generating answers using both types of data.
Epoch 1: ************ (1.2) Epoch 2: ********* (0.9) Epoch 3: ******* (0.7) Epoch 4: ***** (0.55) Epoch 5: **** (0.45)
| Epoch | Loss ↓ | Accuracy ↑ | Observation |
|---|---|---|---|
| 1 | 1.2 | 0.45 | Model starts learning, loss high, accuracy low |
| 2 | 0.9 | 0.60 | Loss decreases, accuracy improves |
| 3 | 0.7 | 0.72 | Model learns better multimodal relations |
| 4 | 0.55 | 0.80 | Loss continues to drop, accuracy rises |
| 5 | 0.45 | 0.85 | Good convergence, model ready for predictions |
text_embedding = text_encoder(text_input) image_embedding = image_encoder(image_input) combined_embedding = concatenate(text_embedding, image_embedding) retrieved_docs = retriever.retrieve(combined_embedding) print(type(retrieved_docs))
def generate_answer(text, image):
text_emb = text_encoder(text)
image_emb = image_encoder(image)
combined = text_emb + image_emb
docs = retriever.retrieve(combined)
answer = generator.generate(docs)
return answer
What is the main error in this code?