0
0
Computer Visionml~12 mins

Text recognition pipeline in Computer Vision - Model Pipeline Trace

Choose your learning style9 modes available
Model Pipeline - Text recognition pipeline

This pipeline takes pictures of text and turns them into words you can read on a computer. It first cleans the image, finds the text parts, then reads the letters, and finally gives the text as output.

Data Flow - 8 Stages
1Input Image
1 image x 256 x 256 pixels x 3 color channelsRaw photo input with text1 image x 256 x 256 pixels x 3 color channels
Photo of a street sign with letters
2Preprocessing
1 image x 256 x 256 x 3Convert to grayscale and normalize pixel values1 image x 256 x 256 x 1
Grayscale image with pixel values between 0 and 1
3Text Detection
1 image x 256 x 256 x 1Find bounding boxes around text areas1 image x 256 x 256 x 1 + bounding box coordinates
Boxes around words like 'STOP' and 'SPEED'
4Text Cropping
Bounding boxes + imageCrop image regions inside bounding boxesN cropped images x 32 x 128 x 1 (N = number of text boxes)
Small images each containing one word
5Feature Extraction
N cropped images x 32 x 128 x 1Extract features using CNN layersN feature maps x 8 x 32 x 64 channels
Feature maps highlighting edges and shapes of letters
6Sequence Modeling
N feature maps x 8 x 32 x 64Use RNN layers to understand letter sequencesN sequences x 32 time steps x 256 features
Sequences representing letter order in words
7Prediction
N sequences x 32 x 256Apply fully connected layer + softmax to predict charactersN sequences x 32 time steps x 37 classes (26 letters + 10 digits + blank)
Probabilities for each character at each time step
8Decoding
N sequences x 32 x 37Convert probabilities to text using CTC decodingN text strings
Recognized words like 'STOP' and 'SPEED'
Training Trace - Epoch by Epoch
Loss
2.3 |****
1.8 |***
1.4 |**
1.1 |*
0.9 |*
0.8 |*
     +---------
     Epochs 1-6
EpochLoss ↓Accuracy ↑Observation
12.30.25Model starts learning, loss is high, accuracy low
21.80.40Loss decreases, accuracy improves
31.40.55Model learns letter shapes better
41.10.65Better sequence understanding
50.90.72Model converging, good text recognition
60.80.76Small improvements, nearing stable performance
Prediction Trace - 7 Layers
Layer 1: Input Image
Layer 2: Text Detection
Layer 3: Text Cropping
Layer 4: Feature Extraction
Layer 5: Sequence Modeling
Layer 6: Prediction
Layer 7: Decoding
Model Quiz - 3 Questions
Test your understanding
What is the purpose of the Text Detection stage?
ATo predict the letters in the text
BTo convert the image to grayscale
CTo find where text is located in the image
DTo crop the image into smaller pieces
Key Insight
This visualization shows how a text recognition model processes images step-by-step, improving its ability to read text by learning features and sequences, and finally decoding predictions into readable words.