0
0
MLOpsdevops~10 mins

Model optimization for serving (quantization, pruning) in MLOps - Commands & Configuration

Choose your learning style9 modes available
Introduction
Model optimization helps make machine learning models smaller and faster. This is important when you want to run models on devices with limited power or speed. Techniques like quantization and pruning reduce model size and improve serving speed without losing much accuracy.
When you want to deploy a model on a mobile phone with limited memory and CPU power.
When you need faster predictions from a model in a web service to handle more users.
When you want to reduce cloud costs by using smaller models that need less compute.
When you want to run models on edge devices like IoT sensors with low resources.
When you want to improve battery life on devices running AI models by reducing computation.
Commands
This Python code loads a pretrained MobileNetV2 model, applies quantization to reduce its size and improve speed, then saves the optimized model as a TensorFlow Lite file for efficient serving.
Terminal
import tensorflow as tf
from tensorflow import keras

# Load a pretrained model
model = keras.applications.MobileNetV2(weights='imagenet')

# Convert to a TensorFlow Lite model with quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

# Save the quantized model to disk
with open('mobilenetv2_quant.tflite', 'wb') as f:
    f.write(tflite_quant_model)

print('Quantized model saved as mobilenetv2_quant.tflite')
Expected OutputExpected
Quantized model saved as mobilenetv2_quant.tflite
This code applies pruning to the MobileNetV2 model to remove less important weights, making the model smaller. It runs a short training loop to apply pruning, then saves the pruned model for serving.
Terminal
import tensorflow_model_optimization as tfmot
from tensorflow import keras

# Load the original model
model = keras.applications.MobileNetV2(weights='imagenet')

# Apply pruning to the model
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
pruning_params = {
    'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
        initial_sparsity=0.0, final_sparsity=0.5, begin_step=0, end_step=1000)
}

model_for_pruning = prune_low_magnitude(model, **pruning_params)

# Compile the pruned model
model_for_pruning.compile(optimizer='adam', loss='categorical_crossentropy')

# Dummy training loop to apply pruning
import numpy as np
x_dummy = np.random.rand(10, 224, 224, 3)
y_dummy = np.random.rand(10, 1000)
model_for_pruning.fit(x_dummy, y_dummy, epochs=1, steps_per_epoch=10)

# Strip pruning wrappers to get the final pruned model
model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)

# Save the pruned model
model_for_export.save('mobilenetv2_pruned.h5')

print('Pruned model saved as mobilenetv2_pruned.h5')
Expected OutputExpected
Epoch 1/1 10/10 [==============================] - 3s 200ms/step - loss: 6.908 Pruned model saved as mobilenetv2_pruned.h5
This command loads the quantized TensorFlow Lite model and prepares it for inference. It prints the input and output details so you know how to feed data and read predictions.
Terminal
import tensorflow as tf

# Load the quantized TFLite model
interpreter = tf.lite.Interpreter(model_path='mobilenetv2_quant.tflite')
interpreter.allocate_tensors()

# Get input and output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

print('Input details:', input_details)
print('Output details:', output_details)
Expected OutputExpected
Input details: [{'name': 'input_1', 'index': 0, 'shape': array([ 1, 224, 224, 3], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}] Output details: [{'name': 'Logits', 'index': 123, 'shape': array([ 1, 1000], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}]
Key Concept

If you remember nothing else from this pattern, remember: quantization and pruning reduce model size and speed up serving by simplifying the model without losing much accuracy.

Code Example
MLOps
import tensorflow as tf
from tensorflow import keras
import tensorflow_model_optimization as tfmot
import numpy as np

# Load pretrained model
model = keras.applications.MobileNetV2(weights='imagenet')

# Quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()
with open('mobilenetv2_quant.tflite', 'wb') as f:
    f.write(tflite_quant_model)
print('Quantized model saved as mobilenetv2_quant.tflite')

# Pruning
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
pruning_params = {
    'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
        initial_sparsity=0.0, final_sparsity=0.5, begin_step=0, end_step=1000)
}
model_for_pruning = prune_low_magnitude(model, **pruning_params)
model_for_pruning.compile(optimizer='adam', loss='categorical_crossentropy')
x_dummy = np.random.rand(10, 224, 224, 3)
y_dummy = np.random.rand(10, 1000)
model_for_pruning.fit(x_dummy, y_dummy, epochs=1, steps_per_epoch=10)
model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)
model_for_export.save('mobilenetv2_pruned.h5')
print('Pruned model saved as mobilenetv2_pruned.h5')

# Load and prepare quantized model for inference
interpreter = tf.lite.Interpreter(model_path='mobilenetv2_quant.tflite')
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
print('Input details:', input_details)
print('Output details:', output_details)
OutputSuccess
Common Mistakes
Skipping the short training step after applying pruning.
Pruning requires training to identify and remove less important weights; skipping this means pruning has no effect.
Run a brief training loop after applying pruning to let the model adjust and remove weights properly.
Trying to use the original TensorFlow model directly after quantization without converting to TensorFlow Lite.
Quantization for serving usually requires conversion to TensorFlow Lite format; the original model won't benefit from quantization optimizations.
Always convert the quantized model to TensorFlow Lite format before deployment.
Not allocating tensors before running inference with the TensorFlow Lite interpreter.
Without allocating tensors, the interpreter cannot prepare memory for inputs and outputs, causing errors.
Call interpreter.allocate_tensors() before inference.
Summary
Use TensorFlow Lite converter with optimization flags to quantize models for smaller size and faster serving.
Apply pruning with TensorFlow Model Optimization Toolkit by wrapping the model and running a short training loop.
Always allocate tensors before running inference on TensorFlow Lite models to prepare input and output buffers.