Bird
Raised Fist0
Computer Visionml~5 mins

Model optimization (pruning, quantization) in Computer Vision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction

Model optimization helps make AI models smaller and faster without losing much accuracy. This is useful to run models on devices like phones or cameras.

When you want your computer vision model to run faster on a smartphone.
When you need to save memory or storage space for your AI model.
When deploying models to devices with limited power or hardware.
When you want to reduce the cost of running AI models in the cloud.
When you want to improve the speed of real-time image or video processing.
Syntax
Computer Vision
import tensorflow_model_optimization as tfmot
import tensorflow as tf

# Pruning example
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

# Quantization example
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_tflite_model = converter.convert()

Pruning removes less important parts of the model to make it smaller.

Quantization reduces the precision of numbers to make the model faster and smaller.

Examples
This example shows how to apply pruning to a Keras model with a schedule that gradually prunes 50% of weights.
Computer Vision
import tensorflow_model_optimization as tfmot

prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

pruning_params = {
  'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
      initial_sparsity=0.0, final_sparsity=0.5, begin_step=0, end_step=1000)
}

model_for_pruning = prune_low_magnitude(model, **pruning_params)
This example converts a Keras model to a TensorFlow Lite model with default quantization for smaller size and faster inference.
Computer Vision
import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_tflite_model = converter.convert()
Sample Model

This program trains a simple CNN on MNIST digits, applies pruning to remove 50% of weights, then converts the pruned model to a quantized TensorFlow Lite model. It prints the accuracy before and after pruning and the size of the quantized model.

Computer Vision
import tensorflow as tf
import tensorflow_model_optimization as tfmot
from tensorflow.keras import layers, models

# Create a simple CNN model for computer vision
model = models.Sequential([
    layers.InputLayer(input_shape=(28, 28, 1)),
    layers.Conv2D(16, 3, activation='relu'),
    layers.MaxPooling2D(),
    layers.Flatten(),
    layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Load MNIST data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train[..., None] / 255.0
x_test = x_test[..., None] / 255.0

# Train the model briefly
model.fit(x_train, y_train, epochs=1, batch_size=128, verbose=0)

# Apply pruning
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
pruning_params = {
    'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
        initial_sparsity=0.0, final_sparsity=0.5, begin_step=0, end_step=1000)
}
model_for_pruning = prune_low_magnitude(model, **pruning_params)

model_for_pruning.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train pruned model briefly
model_for_pruning.fit(x_train, y_train, epochs=1, batch_size=128, verbose=0)

# Strip pruning wrappers to get final pruned model
model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)

# Convert to quantized TFLite model
converter = tf.lite.TFLiteConverter.from_keras_model(model_for_export)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_tflite_model = converter.convert()

# Evaluate original model accuracy
loss, acc = model.evaluate(x_test, y_test, verbose=0)

print(f'Original model accuracy: {acc:.4f}')

# Evaluate pruned model accuracy
loss_p, acc_p = model_for_export.evaluate(x_test, y_test, verbose=0)
print(f'Pruned model accuracy: {acc_p:.4f}')

print(f'Quantized TFLite model size: {len(quantized_tflite_model) / 1024:.2f} KB')
OutputSuccess
Important Notes

Pruning works best when you retrain the model after pruning to recover accuracy.

Quantization can slightly reduce accuracy but greatly improves speed and size.

Always test your optimized model to ensure it still meets your accuracy needs.

Summary

Model optimization makes AI models smaller and faster for real devices.

Pruning removes less important parts of the model to reduce size.

Quantization reduces number precision to speed up the model and save space.

Practice

(1/5)
1. What is the main goal of model pruning in computer vision?
easy
A. To remove less important parts of the model to reduce size
B. To increase the number of layers in the model
C. To add more training data for better accuracy
D. To convert the model to a different programming language

Solution

  1. Step 1: Understand pruning concept

    Pruning means removing parts of the model that contribute less to its output.
  2. Step 2: Identify pruning goal

    The goal is to reduce model size and speed up inference by cutting unnecessary parts.
  3. Final Answer:

    To remove less important parts of the model to reduce size -> Option A
  4. Quick Check:

    Pruning = Remove less important parts [OK]
Hint: Pruning cuts unneeded parts to shrink model size [OK]
Common Mistakes:
  • Thinking pruning adds layers instead of removing
  • Confusing pruning with data augmentation
  • Believing pruning changes programming language
2. Which of the following is the correct way to apply quantization in TensorFlow Lite?
easy
A. model = tf.lite.TFLiteConverter.from_keras_model(model).convert()
B. converter.optimizations = [tf.lite.Optimize.DEFAULT]
C. model.compile(optimizer='adam', loss='mse')
D. model.fit(x_train, y_train, epochs=10)

Solution

  1. Step 1: Identify quantization syntax

    In TensorFlow Lite, quantization is enabled by setting converter.optimizations to Optimize.DEFAULT.
  2. Step 2: Check other options

    model = tf.lite.TFLiteConverter.from_keras_model(model).convert() converts model but does not enable quantization. Options B and C are training commands, not quantization.
  3. Final Answer:

    converter.optimizations = [tf.lite.Optimize.DEFAULT] -> Option B
  4. Quick Check:

    Quantization flag = converter.optimizations [OK]
Hint: Quantization needs converter.optimizations set to Optimize.DEFAULT [OK]
Common Mistakes:
  • Confusing model conversion with quantization
  • Using training commands instead of conversion flags
  • Missing the optimization setting for quantization
3. Given this PyTorch pruning code snippet, what will be the output size of the model's first linear layer weights after pruning 20% of connections?
import torch
import torch.nn.utils.prune as prune

model = torch.nn.Sequential(
    torch.nn.Linear(100, 50),
    torch.nn.ReLU()
)
prune.l1_unstructured(model[0], name='weight', amount=0.2)
pruned_weights = model[0].weight
print((pruned_weights != 0).sum().item())
medium
A. 8000
B. 5000
C. 10000
D. 4000

Solution

  1. Step 1: Calculate total weights

    The first linear layer has 100 inputs and 50 outputs, so total weights = 100 * 50 = 5000.
  2. Step 2: Calculate remaining weights after pruning

    Pruning 20% removes 20% of weights, so remaining weights = 80% of 5000 = 4000.
  3. Step 3: Understand pruning method

    PyTorch's l1_unstructured pruning does not remove weights but masks them, so the weight tensor size remains 5000, but the number of non-zero weights is 4000.
  4. Step 4: Check print output

    The print statement counts non-zero weights, so output is 4000.
  5. Final Answer:

    4000 -> Option D
  6. Quick Check:

    5000 * 0.8 = 4000 [OK]
Hint: Remaining weights = total * (1 - pruning amount) [OK]
Common Mistakes:
  • Calculating total weights incorrectly
  • Using pruning amount as remaining instead of removed
  • Confusing layer input/output dimensions
4. You tried to quantize a model but got an error: AttributeError: 'TFLiteConverter' object has no attribute 'optimizations'. What is the likely cause?
medium
A. Quantization requires training the model again
B. Model is too large to quantize
C. Using an outdated TensorFlow version without quantization support
D. The model has no weights to quantize

Solution

  1. Step 1: Understand the error

    The error says the converter object lacks 'optimizations' attribute, meaning the TensorFlow version is old.
  2. Step 2: Identify cause

    Older TensorFlow versions do not support the 'optimizations' attribute needed for quantization.
  3. Final Answer:

    Using an outdated TensorFlow version without quantization support -> Option C
  4. Quick Check:

    Missing attribute = outdated TensorFlow [OK]
Hint: Check TensorFlow version supports quantization features [OK]
Common Mistakes:
  • Assuming model size causes attribute error
  • Thinking quantization needs retraining always
  • Believing model without weights causes this error
5. You want to deploy a computer vision model on a mobile device with limited memory and CPU. Which combination of optimization techniques is best to reduce model size and speed up inference without much accuracy loss?
hard
A. Apply pruning to remove unimportant weights, then quantize weights to 8-bit integers
B. Only increase model layers to improve accuracy
C. Use full precision weights and no pruning for best accuracy
D. Train longer without any model size changes

Solution

  1. Step 1: Understand device constraints

    Mobile devices have limited memory and CPU, so model size and speed matter.
  2. Step 2: Choose optimization techniques

    Pruning removes unnecessary weights reducing size; quantization reduces number precision speeding inference.
  3. Step 3: Combine pruning and quantization

    Using both together reduces size and speeds up model with minimal accuracy loss.
  4. Final Answer:

    Apply pruning to remove unimportant weights, then quantize weights to 8-bit integers -> Option A
  5. Quick Check:

    Pruning + quantization = smaller, faster model [OK]
Hint: Combine pruning and quantization for efficient mobile models [OK]
Common Mistakes:
  • Only increasing layers without optimization
  • Ignoring quantization benefits
  • Assuming full precision is always best for deployment