Model Pipeline - CLIP (vision-language model)
CLIP is a model that learns to connect pictures and words. It understands images by matching them with text descriptions, helping computers see and read together.
CLIP is a model that learns to connect pictures and words. It understands images by matching them with text descriptions, helping computers see and read together.
Loss
2.5 |****
2.0 |***
1.5 |**
1.0 |*
0.5 |
0.0 +----
1 5 10 15 20 Epochs
| Epoch | Loss ↓ | Accuracy ↑ | Observation |
|---|---|---|---|
| 1 | 2.3 | 0.12 | High loss and low accuracy as model starts learning |
| 5 | 1.1 | 0.45 | Loss decreasing and accuracy improving steadily |
| 10 | 0.6 | 0.70 | Model learning meaningful image-text relations |
| 15 | 0.4 | 0.82 | Good convergence with high accuracy |
| 20 | 0.3 | 0.88 | Loss low and accuracy high, model well trained |