Imagine you have a smart assistant that looks at pictures and tells you what it sees in simple sentences. What is the main job of this assistant?
Think about what it means to 'describe' an image in words.
An image captioning model looks at an image and produces a sentence or phrase that describes what is in the image. This helps people understand the image content through text.
Given the following simplified code that uses a pre-trained image captioning model, what will be printed?
image = load_image('dog_park.jpg') caption = model.generate_caption(image) print(caption)
The model generates a text description of the image, not just the filename.
The code loads an image and uses the model to generate a caption describing the image content. The print statement outputs that caption.
You want to build a system that looks at images and writes sentences describing them. Which model type is most appropriate?
Think about how images and sentences are processed differently and how to combine them.
CNNs extract features from images, and RNNs generate sequences of words. Combining them allows the model to understand images and produce text descriptions.
After training an image captioning model, you want to measure how good its descriptions are compared to human-written captions. Which metric should you use?
Think about metrics that compare text similarity.
BLEU score measures how closely the generated captions match human captions by comparing word sequences, making it suitable for caption quality evaluation.
Consider this simplified code snippet where the model generates captions but repeats the same word multiple times:
caption = model.generate_caption(image) print(caption) # Output: "dog dog dog dog dog"
What is the most likely cause?
Think about how the model chooses words during caption generation.
If beam search or greedy decoding is faulty, the model may get stuck repeating the same word. This is a common decoding bug in sequence generation models.