Bird
Raised Fist0
PyTorchml~12 mins

CNN architecture for image classification in PyTorch - Model Pipeline Trace

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Model Pipeline - CNN architecture for image classification

This pipeline uses a Convolutional Neural Network (CNN) to classify images into categories. It starts with raw images, processes them through layers that detect patterns, trains the model to improve accuracy, and finally predicts the class of new images.

Data Flow - 13 Stages
1Input Images
1000 images x 3 channels x 32 height x 32 widthRaw image data loaded from dataset1000 images x 3 channels x 32 height x 32 width
An image of a cat represented as a 3D array of pixel colors
2Normalization
1000 images x 3 channels x 32 height x 32 widthScale pixel values from 0-255 to 0-11000 images x 3 channels x 32 height x 32 width
Pixel value 128 becomes 0.502
3Convolutional Layer 1
1000 images x 3 channels x 32 height x 32 widthApply 16 filters of size 3x3 with stride 1 and padding 11000 images x 16 channels x 32 height x 32 width
Detect edges and simple shapes in images
4ReLU Activation
1000 images x 16 channels x 32 height x 32 widthApply ReLU to add non-linearity1000 images x 16 channels x 32 height x 32 width
Negative values become 0, positive stay same
5Max Pooling
1000 images x 16 channels x 32 height x 32 widthDownsample by taking max over 2x2 regions with stride 21000 images x 16 channels x 16 height x 16 width
Reduce image size while keeping important features
6Convolutional Layer 2
1000 images x 16 channels x 16 height x 16 widthApply 32 filters of size 3x3 with stride 1 and padding 11000 images x 32 channels x 16 height x 16 width
Detect more complex patterns
7ReLU Activation
1000 images x 32 channels x 16 height x 16 widthApply ReLU1000 images x 32 channels x 16 height x 16 width
Keep positive activations
8Max Pooling
1000 images x 32 channels x 16 height x 16 widthDownsample by 2x2 max pooling1000 images x 32 channels x 8 height x 8 width
Further reduce spatial size
9Flatten
1000 images x 32 channels x 8 height x 8 widthConvert 3D feature maps to 1D vector1000 images x 2048 features
32*8*8 = 2048 features per image
10Fully Connected Layer
1000 images x 2048 featuresLinear layer to 64 neurons1000 images x 64 features
Combine features to learn complex relations
11ReLU Activation
1000 images x 64 featuresApply ReLU1000 images x 64 features
Add non-linearity
12Output Layer
1000 images x 64 featuresLinear layer to 10 classes1000 images x 10 classes
Predict scores for 10 categories
13Softmax
1000 images x 10 classesConvert scores to probabilities1000 images x 10 classes
Probabilities sum to 1 for each image
Training Trace - Epoch by Epoch
Loss
1.9 |*        
1.6 | **      
1.3 |  ***    
1.0 |    **** 
0.7 |      ***
0.4 |       **
    +---------
     1 2 3 4 5 6 7 8 9 10 Epochs
EpochLoss ↓Accuracy ↑Observation
11.850.35Model starts learning basic patterns
21.250.55Loss decreases, accuracy improves
30.950.68Model captures more features
40.750.75Good convergence, accuracy rising
50.600.81Model stabilizes with better accuracy
60.520.85Further improvement, loss lowers
70.470.87Training converging well
80.430.89High accuracy, low loss
90.400.90Model near optimal performance
100.380.91Training complete with good results
Prediction Trace - 8 Layers
Layer 1: Input Image
Layer 2: Conv Layer 1 + ReLU
Layer 3: Max Pooling
Layer 4: Conv Layer 2 + ReLU
Layer 5: Max Pooling
Layer 6: Flatten
Layer 7: Fully Connected + ReLU
Layer 8: Output Layer + Softmax
Model Quiz - 3 Questions
Test your understanding
What is the purpose of the max pooling layers in this CNN?
ATo reduce the spatial size and keep important features
BTo increase the number of channels
CTo normalize pixel values
DTo convert images to grayscale
Key Insight
This CNN model learns to recognize image patterns by gradually extracting features through convolution and pooling layers, then classifies images by combining these features in fully connected layers. Training shows steady improvement in accuracy as loss decreases, demonstrating effective learning.

Practice

(1/5)
1. What is the main role of convolutional layers in a CNN for image classification?
easy
A. To detect features like edges and textures in small parts of the image
B. To reduce the size of the image by downsampling
C. To combine all features into a final decision
D. To randomly change pixel values for data augmentation

Solution

  1. Step 1: Understand convolutional layers

    Convolutional layers scan small parts of the image to find patterns like edges and textures.
  2. Step 2: Compare with other layers

    Pooling layers reduce image size, and fully connected layers make the final classification decision.
  3. Final Answer:

    To detect features like edges and textures in small parts of the image -> Option A
  4. Quick Check:

    Convolutional layers = feature detection [OK]
Hint: Convolution layers find patterns, pooling shrinks images [OK]
Common Mistakes:
  • Confusing pooling with convolution
  • Thinking fully connected layers detect features
  • Believing convolution layers change image size
2. Which of the following is the correct way to define a 2D convolutional layer in PyTorch with 3 input channels, 16 output channels, and a kernel size of 3?
easy
A. nn.Conv2d(16, 3, kernel_size=3)
B. nn.Conv1d(3, 16, kernel_size=3)
C. nn.Linear(3, 16, kernel_size=3)
D. nn.Conv2d(3, 16, kernel_size=3)

Solution

  1. Step 1: Identify correct layer type and parameters

    For images, use nn.Conv2d with input channels first, then output channels, and kernel size.
  2. Step 2: Check each option

    nn.Conv2d(3, 16, kernel_size=3) uses nn.Conv2d(3, 16, kernel_size=3) which is correct. nn.Conv1d(3, 16, kernel_size=3) uses Conv1d (wrong dimension). nn.Linear(3, 16, kernel_size=3) uses Linear (not convolution). nn.Conv2d(16, 3, kernel_size=3) reverses input/output channels.
  3. Final Answer:

    nn.Conv2d(3, 16, kernel_size=3) -> Option D
  4. Quick Check:

    Conv2d(input_channels, output_channels, kernel_size) = A [OK]
Hint: Conv2d uses (in_channels, out_channels, kernel_size) order [OK]
Common Mistakes:
  • Using Conv1d instead of Conv2d for images
  • Swapping input and output channels
  • Using Linear layer for convolution
3. Given the following PyTorch CNN snippet, what is the output shape after the convolution and pooling layers if the input image size is (3, 32, 32)?
import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 8, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
    def forward(self, x):
        x = self.conv(x)
        x = self.pool(x)
        return x

model = SimpleCNN()
input_tensor = torch.randn(1, 3, 32, 32)
output = model(input_tensor)
print(output.shape)
medium
A. torch.Size([1, 8, 30, 30])
B. torch.Size([1, 8, 16, 16])
C. torch.Size([1, 3, 16, 16])
D. torch.Size([1, 8, 32, 32])

Solution

  1. Step 1: Calculate output size after convolution

    Input size: 32x32, kernel=3, padding=1, stride=1 (default). Output size = (32 - 3 + 2*1)/1 + 1 = 32. Channels change from 3 to 8.
  2. Step 2: Calculate output size after max pooling

    MaxPool2d with kernel=2, stride=2 halves width and height: 32/2 = 16. Channels remain 8.
  3. Final Answer:

    torch.Size([1, 8, 16, 16]) -> Option B
  4. Quick Check:

    Conv keeps size, pooling halves it = B [OK]
Hint: Conv with padding keeps size; pooling halves it [OK]
Common Mistakes:
  • Ignoring padding effect on convolution output size
  • Forgetting pooling halves spatial dimensions
  • Mixing up input and output channels
4. Identify the error in this PyTorch CNN model definition for image classification:
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, 3)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(16 * 15 * 15, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = x.view(-1, 16 * 15 * 15)
        x = self.fc1(x)
        return x
medium
A. Pooling layer should come before convolution
B. The input size to fc1 is incorrect due to convolution output size mismatch
C. Missing import for torch.nn.functional as F
D. The number of output classes in fc1 should be 16

Solution

  1. Step 1: Check imports and usage

    The forward method uses F.relu but torch.nn.functional as F is not imported, causing a NameError.
  2. Step 2: Verify other parts

    Input size to fc1 assumes input image size 32x32 with kernel=3 and no padding, output size after conv and pool is 15x15, so fc1 input size is correct. Pooling after conv is correct. Output classes 10 is reasonable.
  3. Final Answer:

    Missing import for torch.nn.functional as F -> Option C
  4. Quick Check:

    Using F.relu without import = A [OK]
Hint: Check all used modules are imported [OK]
Common Mistakes:
  • Forgetting to import torch.nn.functional as F
  • Miscalculating fc1 input size
  • Changing layer order incorrectly
5. You want to build a CNN in PyTorch to classify 64x64 RGB images into 5 classes. Which architecture below correctly combines convolution, pooling, and fully connected layers to achieve this?
hard
A.
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 10, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(10, 20, 5)
        self.fc1 = nn.Linear(20 * 13 * 13, 50)
        self.fc2 = nn.Linear(50, 5)
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 20 * 13 * 13)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x
B.
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 10, 3)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(10 * 32 * 32, 5)
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = x.view(-1, 10 * 32 * 32)
        x = self.fc1(x)
        return x
C.
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 10, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(10, 20, 5)
        self.fc1 = nn.Linear(20 * 12 * 12, 50)
        self.fc2 = nn.Linear(50, 5)
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 20 * 12 * 12)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x
D.
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 10, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(10, 20, 5)
        self.fc1 = nn.Linear(20 * 14 * 14, 50)
        self.fc2 = nn.Linear(50, 5)
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 20 * 14 * 14)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

Solution

  1. Step 1: Calculate output sizes after conv and pooling layers

    Input: 64x64. Conv1 kernel=5, padding=0: (64-5+1)=60, pool kernel=2 stride=2: 60/2=30. Conv2 kernel=5: (30-5+1)=26, pool: 26/2=13. Final size 20x13x13.
  2. Step 2: Check fc1 input sizes

    class CNN(nn.Module):
        def __init__(self):
            super().__init__()
            self.conv1 = nn.Conv2d(3, 10, 5)
            self.pool = nn.MaxPool2d(2, 2)
            self.conv2 = nn.Conv2d(10, 20, 5)
            self.fc1 = nn.Linear(20 * 13 * 13, 50)
            self.fc2 = nn.Linear(50, 5)
        def forward(self, x):
            x = self.pool(F.relu(self.conv1(x)))
            x = self.pool(F.relu(self.conv2(x)))
            x = x.view(-1, 20 * 13 * 13)
            x = F.relu(self.fc1(x))
            x = self.fc2(x)
            return x
    : 20*13*13 correct.
    class CNN(nn.Module):
        def __init__(self):
            super().__init__()
            self.conv1 = nn.Conv2d(3, 10, 3)
            self.pool = nn.MaxPool2d(2, 2)
            self.fc1 = nn.Linear(10 * 32 * 32, 5)
        def forward(self, x):
            x = self.pool(F.relu(self.conv1(x)))
            x = x.view(-1, 10 * 32 * 32)
            x = self.fc1(x)
            return x
    : single conv kernel=3 gives ~10*31*31 but uses 10*32*32 wrong.
    class CNN(nn.Module):
        def __init__(self):
            super().__init__()
            self.conv1 = nn.Conv2d(3, 10, 5)
            self.pool = nn.MaxPool2d(2, 2)
            self.conv2 = nn.Conv2d(10, 20, 5)
            self.fc1 = nn.Linear(20 * 12 * 12, 50)
            self.fc2 = nn.Linear(50, 5)
        def forward(self, x):
            x = self.pool(F.relu(self.conv1(x)))
            x = self.pool(F.relu(self.conv2(x)))
            x = x.view(-1, 20 * 12 * 12)
            x = F.relu(self.fc1(x))
            x = self.fc2(x)
            return x
    : 20*12*12 too small.
    class CNN(nn.Module):
        def __init__(self):
            super().__init__()
            self.conv1 = nn.Conv2d(3, 10, 5)
            self.pool = nn.MaxPool2d(2, 2)
            self.conv2 = nn.Conv2d(10, 20, 5)
            self.fc1 = nn.Linear(20 * 14 * 14, 50)
            self.fc2 = nn.Linear(50, 5)
        def forward(self, x):
            x = self.pool(F.relu(self.conv1(x)))
            x = self.pool(F.relu(self.conv2(x)))
            x = x.view(-1, 20 * 14 * 14)
            x = F.relu(self.fc1(x))
            x = self.fc2(x)
            return x
    : 20*14*14 too big.
  3. Final Answer:

    nn.Linear(20 * 13 * 13, 50) -> Option A
  4. Quick Check:

    64->60->30->26->13 = 20x13x13 -> A [OK]
Hint: Calculate conv and pool sizes stepwise to find fc input size [OK]
Common Mistakes:
  • Ignoring how kernel size reduces image dimensions
  • Assuming pooling does not halve size
  • Mismatching fc layer input size with conv output