Bird
Raised Fist0
Computer Visionml~8 mins

CNN architecture review in Computer Vision - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - CNN architecture review
Which metric matters for CNN architecture review and WHY

When reviewing a CNN (Convolutional Neural Network) architecture, the key metrics to focus on are accuracy, precision, recall, and F1 score. These metrics tell us how well the CNN is recognizing patterns and making correct predictions.

Accuracy shows overall correctness, but it can be misleading if classes are unbalanced. Precision tells us how many predicted positives are actually correct, which is important when false alarms are costly. Recall tells us how many real positives the model finds, which matters when missing a positive is bad. F1 score balances precision and recall, giving a single number to compare models.

For CNNs used in image tasks, these metrics help us understand if the architecture is good at learning useful features and generalizing to new images.

Confusion matrix example for CNN predictions
      Actual \ Predicted | Cat | Dog | Other
      -------------------|-----|-----|------
      Cat                | 50  | 5   | 10   
      Dog                | 3   | 45  | 7    
      Other              | 2   | 4   | 60   
    

This matrix shows how many images of each true class were predicted as each class. For example, 50 cat images were correctly predicted as cats (true positives for cat), 5 cat images were wrongly predicted as dogs (false negatives for cat and false positives for dog), and so on.

Precision vs Recall tradeoff with CNN example

Imagine a CNN that detects cats in photos. If we want to be very sure when the model says "cat" (high precision), it might miss some cats (lower recall). This means fewer false alarms but more missed cats.

On the other hand, if we want to find every cat possible (high recall), the model might sometimes say "cat" when it is not (lower precision). This means catching all cats but with more false alarms.

Choosing the right balance depends on the goal. For example, in a pet app, missing a cat might be worse, so recall is more important. In a security camera, false alarms might be annoying, so precision matters more.

What good vs bad metric values look like for CNNs

Good CNN metrics:

  • Accuracy above 85% on a balanced dataset
  • Precision and recall both above 80%
  • F1 score close to precision and recall, showing balance
  • Confusion matrix with most predictions on the diagonal (correct class)

Bad CNN metrics:

  • High accuracy but very low recall or precision (model guesses mostly one class)
  • F1 score much lower than precision or recall, showing imbalance
  • Confusion matrix with many off-diagonal errors (wrong predictions)
  • Very low accuracy (below 50%) indicating poor learning
Common pitfalls when evaluating CNN metrics
  • Accuracy paradox: High accuracy can hide poor performance if classes are imbalanced.
  • Data leakage: If test images are too similar to training, metrics look better but model won't generalize.
  • Overfitting: Very high training accuracy but low test accuracy means model memorizes training images, not learning features.
  • Ignoring class imbalance: Not using precision, recall, or F1 can mislead about model quality.
Self-check question

Your CNN model has 98% accuracy but only 12% recall on the "cat" class. Is it good for production? Why or why not?

Answer: No, it is not good. The low recall means the model misses most cats, even though overall accuracy is high. This likely happens because the dataset is imbalanced or the model predicts mostly other classes. For production, especially if finding cats is important, recall must improve.

Key Result
For CNNs, balanced precision and recall with high accuracy and F1 score indicate a good architecture that learns well and generalizes.

Practice

(1/5)
1. What is the main purpose of a Convolutional Neural Network (CNN) in computer vision?
easy
A. To perform text translation
B. To sort numbers in a list
C. To generate random images
D. To detect patterns and features in images

Solution

  1. Step 1: Understand CNN function

    CNNs scan images to find important patterns like edges and shapes.
  2. Step 2: Match purpose to options

    Only To detect patterns and features in images describes detecting patterns in images, which is CNN's main job.
  3. Final Answer:

    To detect patterns and features in images -> Option D
  4. Quick Check:

    CNN purpose = detect image patterns [OK]
Hint: CNNs find image features, not unrelated tasks like sorting [OK]
Common Mistakes:
  • Confusing CNNs with general neural networks
  • Thinking CNNs generate images
  • Mixing CNNs with text processing models
2. Which of the following is the correct way to add a 2D convolutional layer in Keras?
easy
A. Dense(units=32, activation='relu')
B. Conv1D(filters=32, kernel_size=3, activation='relu')
C. Conv2D(filters=32, kernel_size=(3,3), activation='relu')
D. MaxPooling2D(pool_size=(2,2))

Solution

  1. Step 1: Identify Conv2D syntax

    Conv2D requires filters, kernel_size as a tuple, and activation function.
  2. Step 2: Compare options

    Conv2D(filters=32, kernel_size=(3,3), activation='relu') matches Conv2D syntax correctly; others are different layers or wrong dimensions.
  3. Final Answer:

    Conv2D(filters=32, kernel_size=(3,3), activation='relu') -> Option C
  4. Quick Check:

    Conv2D syntax = Conv2D(filters=32, kernel_size=(3,3), activation='relu') [OK]
Hint: Conv2D uses 2D kernel size tuple, not single int [OK]
Common Mistakes:
  • Using Conv1D instead of Conv2D for images
  • Confusing Dense layer with Conv2D
  • Wrong kernel_size format
3. Given this Keras CNN snippet, what is the output shape after the Conv2D layer?
model = Sequential()
model.add(Conv2D(16, (3,3), input_shape=(28,28,1)))
medium
A. (26, 26, 16)
B. (28, 28, 16)
C. (30, 30, 16)
D. (28, 28, 1)

Solution

  1. Step 1: Calculate output size after Conv2D

    With default 'valid' padding and kernel size 3, output dims = input - kernel + 1 = 28 - 3 + 1 = 26.
  2. Step 2: Determine output channels

    Filters=16 means output depth is 16 channels.
  3. Final Answer:

    (26, 26, 16) -> Option A
  4. Quick Check:

    Output shape = (26,26,16) [OK]
Hint: Output size = input - kernel + 1 with 'valid' padding [OK]
Common Mistakes:
  • Assuming output size equals input size without padding
  • Confusing number of filters with spatial dimensions
  • Forgetting default padding is 'valid'
4. Identify the error in this CNN model code snippet:
model = Sequential()
model.add(Conv2D(32, (3,3), activation='relu', input_shape=(28,28)))
model.add(Flatten())
model.add(Dense(10, activation='softmax'))
medium
A. Dense layer should come before Flatten
B. input_shape missing channel dimension
C. Activation function 'relu' is invalid
D. Conv2D filters must be 64 or more

Solution

  1. Step 1: Check input_shape format

    Conv2D expects input_shape with 3 dimensions: height, width, channels. Here channels are missing.
  2. Step 2: Validate other parts

    Activation 'relu' is valid, Flatten before Dense is correct, filters can be any positive integer.
  3. Final Answer:

    input_shape missing channel dimension -> Option B
  4. Quick Check:

    Input shape must include channels [OK]
Hint: Conv2D input_shape needs (height, width, channels) [OK]
Common Mistakes:
  • Ignoring channel dimension in input_shape
  • Misordering Flatten and Dense layers
  • Thinking filters must be >=64
5. You want to build a CNN for classifying 64x64 RGB images into 5 classes. Which architecture choice is best?
hard
A. Conv2D(32, (3,3)) + MaxPooling2D + Conv2D(64, (3,3)) + Flatten + Dense(5, softmax)
B. Dense(128) + Dense(64) + Dense(5, softmax)
C. Conv1D(32, 3) + Flatten + Dense(5, softmax)
D. Flatten + Dense(5, softmax)

Solution

  1. Step 1: Identify suitable layers for image data

    Conv2D layers extract spatial features from 2D images; MaxPooling reduces size; Flatten prepares for Dense.
  2. Step 2: Evaluate options

    Conv2D(32, (3,3)) + MaxPooling2D + Conv2D(64, (3,3)) + Flatten + Dense(5, softmax) uses Conv2D and pooling correctly for images. The Dense-only option lacks feature extraction, Conv1D is unsuitable for 2D images, and Flatten + Dense skips convolutions.
  3. Final Answer:

    Conv2D(32, (3,3)) + MaxPooling2D + Conv2D(64, (3,3)) + Flatten + Dense(5, softmax) -> Option A
  4. Quick Check:

    Use Conv2D + pooling for images [OK]
Hint: Use Conv2D layers for images, not Dense-only or Conv1D [OK]
Common Mistakes:
  • Using Dense layers only for image input
  • Applying Conv1D to 2D images
  • Skipping pooling layers for downsampling