MLOpsdevops~10 mins

Model optimization for serving (quantization, pruning) in MLOps - Commands & Configuration

Choose your learning style10 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Introduction

Model optimization helps make machine learning models smaller and faster. This is important when you want to run models on devices with limited power or speed. Techniques like quantization and pruning reduce model size and improve serving speed without losing much accuracy.

When you want to deploy a model on a mobile phone with limited memory and CPU power.

When you need faster predictions from a model in a web service to handle more users.

When you want to reduce cloud costs by using smaller models that need less compute.

When you want to run models on edge devices like IoT sensors with low resources.

When you want to improve battery life on devices running AI models by reducing computation.

Commands

This Python code loads a pretrained MobileNetV2 model, applies quantization to reduce its size and improve speed, then saves the optimized model as a TensorFlow Lite file for efficient serving.

Terminal

import tensorflow as tf
from tensorflow import keras

# Load a pretrained model
model = keras.applications.MobileNetV2(weights='imagenet')

# Convert to a TensorFlow Lite model with quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

# Save the quantized model to disk
with open('mobilenetv2_quant.tflite', 'wb') as f:
    f.write(tflite_quant_model)

print('Quantized model saved as mobilenetv2_quant.tflite')

Expected OutputExpected

Quantized model saved as mobilenetv2_quant.tflite

This code applies pruning to the MobileNetV2 model to remove less important weights, making the model smaller. It runs a short training loop to apply pruning, then saves the pruned model for serving.

Terminal

import tensorflow_model_optimization as tfmot
from tensorflow import keras

# Load the original model
model = keras.applications.MobileNetV2(weights='imagenet')

# Apply pruning to the model
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
pruning_params = {
    'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
        initial_sparsity=0.0, final_sparsity=0.5, begin_step=0, end_step=1000)
}

model_for_pruning = prune_low_magnitude(model, **pruning_params)

# Compile the pruned model
model_for_pruning.compile(optimizer='adam', loss='categorical_crossentropy')

# Dummy training loop to apply pruning
import numpy as np
x_dummy = np.random.rand(10, 224, 224, 3)
y_dummy = np.random.rand(10, 1000)
model_for_pruning.fit(x_dummy, y_dummy, epochs=1, steps_per_epoch=10)

# Strip pruning wrappers to get the final pruned model
model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)

# Save the pruned model
model_for_export.save('mobilenetv2_pruned.h5')

print('Pruned model saved as mobilenetv2_pruned.h5')

Expected OutputExpected

Epoch 1/1 10/10 [==============================] - 3s 200ms/step - loss: 6.908 Pruned model saved as mobilenetv2_pruned.h5

This command loads the quantized TensorFlow Lite model and prepares it for inference. It prints the input and output details so you know how to feed data and read predictions.

Terminal

import tensorflow as tf

# Load the quantized TFLite model
interpreter = tf.lite.Interpreter(model_path='mobilenetv2_quant.tflite')
interpreter.allocate_tensors()

# Get input and output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

print('Input details:', input_details)
print('Output details:', output_details)

Expected OutputExpected

Input details: [{'name': 'input_1', 'index': 0, 'shape': array([ 1, 224, 224, 3], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}] Output details: [{'name': 'Logits', 'index': 123, 'shape': array([ 1, 1000], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}]

Key Concept

If you remember nothing else from this pattern, remember: quantization and pruning reduce model size and speed up serving by simplifying the model without losing much accuracy.

Code Example

MLOps

import tensorflow as tf
from tensorflow import keras
import tensorflow_model_optimization as tfmot
import numpy as np

# Load pretrained model
model = keras.applications.MobileNetV2(weights='imagenet')

# Quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()
with open('mobilenetv2_quant.tflite', 'wb') as f:
    f.write(tflite_quant_model)
print('Quantized model saved as mobilenetv2_quant.tflite')

# Pruning
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
pruning_params = {
    'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
        initial_sparsity=0.0, final_sparsity=0.5, begin_step=0, end_step=1000)
}
model_for_pruning = prune_low_magnitude(model, **pruning_params)
model_for_pruning.compile(optimizer='adam', loss='categorical_crossentropy')
x_dummy = np.random.rand(10, 224, 224, 3)
y_dummy = np.random.rand(10, 1000)
model_for_pruning.fit(x_dummy, y_dummy, epochs=1, steps_per_epoch=10)
model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)
model_for_export.save('mobilenetv2_pruned.h5')
print('Pruned model saved as mobilenetv2_pruned.h5')

# Load and prepare quantized model for inference
interpreter = tf.lite.Interpreter(model_path='mobilenetv2_quant.tflite')
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
print('Input details:', input_details)
print('Output details:', output_details)

OutputSuccess

Common Mistakes

Skipping the short training step after applying pruning.

Pruning requires training to identify and remove less important weights; skipping this means pruning has no effect.

Run a brief training loop after applying pruning to let the model adjust and remove weights properly.

Trying to use the original TensorFlow model directly after quantization without converting to TensorFlow Lite.

Quantization for serving usually requires conversion to TensorFlow Lite format; the original model won't benefit from quantization optimizations.

Always convert the quantized model to TensorFlow Lite format before deployment.

Not allocating tensors before running inference with the TensorFlow Lite interpreter.

Without allocating tensors, the interpreter cannot prepare memory for inputs and outputs, causing errors.

Call interpreter.allocate_tensors() before inference.

Summary

Use TensorFlow Lite converter with optimization flags to quantize models for smaller size and faster serving.

Apply pruning with TensorFlow Model Optimization Toolkit by wrapping the model and running a short training loop.

Always allocate tensors before running inference on TensorFlow Lite models to prepare input and output buffers.

Practice

(1/5)

1. What is the main goal of quantization in model optimization for serving?

easy

A. Increase the size of the model for better performance

B. Reduce the precision of numbers to make the model smaller and faster

C. Add more neurons to improve accuracy

D. Remove entire layers from the model to simplify it

Model optimization for serving (quantization, pruning) in MLOps - Commands & Configuration

Start learning this pattern below

Practice

Solution

Step 1: Understand quantization purpose

Step 2: Compare options

Final Answer:

Quick Check:

Solution

Step 1: Recall TensorFlow pruning API structure

Step 2: Check syntax correctness

Final Answer:

Quick Check:

Solution

Step 1: Analyze dynamic quantization effect

Step 2: Trace the print statement

Final Answer:

Quick Check:

Solution

Step 1: Understand the error message

Step 2: Check common causes

Final Answer:

Quick Check:

Solution

Step 1: Understand pruning and quantization order

Step 2: Apply quantization after pruning

Final Answer:

Quick Check: