Which of the following lists the correct main stages of a typical text recognition pipeline in order?
Think about first finding where text is, then breaking it down, then reading it.
The pipeline starts by detecting text regions, then segments characters or words, then recognizes the text, and finally applies post-processing to improve accuracy.
Given the following PyTorch CNN feature extractor code for text recognition, what is the shape of the output tensor if the input image batch has shape (8, 1, 32, 128)?
import torch import torch.nn as nn class FeatureExtractor(nn.Module): def __init__(self): super().__init__() self.conv = nn.Sequential( nn.Conv2d(1, 64, kernel_size=3, stride=1, padding=1), nn.ReLU(), nn.MaxPool2d(2, 2), # halves H and W nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1), nn.ReLU(), nn.MaxPool2d(2, 2) # halves H and W again ) def forward(self, x): return self.conv(x) model = FeatureExtractor() input_tensor = torch.randn(8, 1, 32, 128) output = model(input_tensor) output.shape
Each MaxPool2d halves height and width. Calculate step by step.
Input height 32 → after first pool 16 → after second pool 8. Width 128 → 64 → 32. Final channels 128.
Which model architecture is best suited for recognizing variable-length text sequences in images, such as handwritten words or license plates?
Think about models that handle sequences and variable lengths.
CNN extracts features, RNN models sequence dependencies, and CTC loss allows variable-length outputs without alignment.
Which metric is most appropriate to evaluate the accuracy of a text recognition model that outputs sequences of characters?
Consider metrics that compare predicted text sequences to ground truth text.
CER measures the number of character insertions, deletions, and substitutions needed to match the prediction to the true text, making it ideal for text recognition.
A text recognition model trained on clear printed text images performs poorly on handwritten text images. Which is the most likely cause?
Think about differences between training and testing data.
Models trained on one style of text often fail on very different styles due to distribution mismatch, requiring domain adaptation or more diverse training data.