Model optimization for serving (quantization, pruning) in MLOps - Time & Space Complexity
Start learning this pattern below
Jump into concepts and practice - no test required
When we optimize machine learning models for serving, we want to know how these changes affect the time it takes to make predictions.
We ask: How does the model's speed change as we apply quantization or pruning?
Analyze the time complexity of the following simplified model inference code.
# Simplified inference with pruning and quantization
for layer in model.layers:
weights = layer.weights
# Pruning removes some weights
pruned_weights = [w for w in weights if abs(w) > threshold]
# Quantization reduces precision
quantized_weights = [quantize(w) for w in pruned_weights]
output = layer.forward(input, quantized_weights)
input = output
This code runs inference through each layer, pruning and quantizing weights before computing output.
Look at what repeats in the code:
- Primary operation: Loop over model layers and loop over weights in each layer.
- How many times: Once per layer, and once per weight in that layer.
As the number of layers or weights grows, the work grows too.
| Input Size (weights) | Approx. Operations |
|---|---|
| 10 | About 10 operations per layer |
| 100 | About 100 operations per layer |
| 1000 | About 1000 operations per layer |
Pattern observation: The time grows roughly in direct proportion to the number of weights processed.
Time Complexity: O(n)
This means the time to run inference grows linearly with the number of weights after pruning and quantization.
[X] Wrong: "Pruning and quantization make inference time constant regardless of model size."
[OK] Correct: Even after pruning and quantization, the model still processes weights, so time grows with how many weights remain.
Understanding how model optimization affects inference time shows you can balance speed and accuracy, a key skill in real-world machine learning deployment.
What if we applied pruning to remove half the weights? How would that change the time complexity?
Practice
quantization in model optimization for serving?Solution
Step 1: Understand quantization purpose
Quantization reduces the number precision (like from 32-bit to 8-bit) to save memory and speed up computation.Step 2: Compare options
Removing layers is pruning, adding neurons increases size, increasing size is opposite of optimization.Final Answer:
Reduce the precision of numbers to make the model smaller and faster -> Option BQuick Check:
Quantization = Reduce precision [OK]
- Confusing pruning with quantization
- Thinking quantization adds complexity
- Believing quantization increases model size
Solution
Step 1: Recall TensorFlow pruning API structure
The pruning function is under tfmot.sparsity.keras and requires the pruning_schedule argument.Step 2: Check syntax correctness
pruned_model = tfmot.sparsity.keras.prune_low_magnitude(model, pruning_schedule=pruning_schedule) matches the correct full path and argument names. Others miss parts or have wrong argument names.Final Answer:
pruned_model = tfmot.sparsity.keras.prune_low_magnitude(model, pruning_schedule=pruning_schedule) -> Option AQuick Check:
Correct pruning syntax = pruned_model = tfmot.sparsity.keras.prune_low_magnitude(model, pruning_schedule=pruning_schedule) [OK]
- Omitting 'keras' in the API path
- Using wrong argument names
- Calling pruning directly from tf module
import torch
import torch.nn as nn
model = nn.Linear(10, 5)
quantized_model = torch.quantization.quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)
print(type(quantized_model.weight()))Solution
Step 1: Analyze dynamic quantization effect
torch.quantization.quantize_dynamic converts nn.Linear to torch.nn.quantized.dynamic.Linear, where weight is a method returning dequantized weights as torch.Tensor.Step 2: Trace the print statement
quantized_model.weight() succeeds, returning a torch.Tensor (fp32 dequantized), so print(type(...)) outputs <class 'torch.Tensor'>.Final Answer:
<class 'torch.Tensor'> -> Option DQuick Check:
Dynamic quant: weight() returns Tensor [OK]
- Thinking weight remains non-callable attribute like original Linear
- Confusing quantized_model type with weight type
- Expecting error on quantized model weight access
AttributeError: module 'tensorflow_model_optimization' has no attribute 'sparsity'. What is the most likely cause?Solution
Step 1: Understand the error message
The error says the module has no attribute 'sparsity', which usually means the package is missing or outdated.Step 2: Check common causes
If the package is not installed, Python cannot find the 'sparsity' submodule. Importing incorrectly or wrong argument causes different errors.Final Answer:
The tensorflow_model_optimization package is not installed -> Option AQuick Check:
Missing package = AttributeError [OK]
- Assuming import alias causes error
- Blaming pruning schedule argument
- Thinking pruning unsupported in TensorFlow
Solution
Step 1: Understand pruning and quantization order
Pruning removes unimportant weights first, reducing model size and complexity.Step 2: Apply quantization after pruning
Quantization then reduces number precision on the smaller pruned model, further shrinking size and speeding inference.Final Answer:
First prune the model to remove unimportant weights, then apply quantization to reduce number precision -> Option CQuick Check:
Prune first, then quantize = First prune the model to remove unimportant weights, then apply quantization to reduce number precision [OK]
- Quantizing before pruning reduces pruning effectiveness
- Thinking pruning and quantization cannot be combined
- Pruning after deployment is too late
