Model optimization for serving (quantization, pruning) in MLOps - Time & Space Complexity
When we optimize machine learning models for serving, we want to know how these changes affect the time it takes to make predictions.
We ask: How does the model's speed change as we apply quantization or pruning?
Analyze the time complexity of the following simplified model inference code.
# Simplified inference with pruning and quantization
for layer in model.layers:
weights = layer.weights
# Pruning removes some weights
pruned_weights = [w for w in weights if abs(w) > threshold]
# Quantization reduces precision
quantized_weights = [quantize(w) for w in pruned_weights]
output = layer.forward(input, quantized_weights)
input = output
This code runs inference through each layer, pruning and quantizing weights before computing output.
Look at what repeats in the code:
- Primary operation: Loop over model layers and loop over weights in each layer.
- How many times: Once per layer, and once per weight in that layer.
As the number of layers or weights grows, the work grows too.
| Input Size (weights) | Approx. Operations |
|---|---|
| 10 | About 10 operations per layer |
| 100 | About 100 operations per layer |
| 1000 | About 1000 operations per layer |
Pattern observation: The time grows roughly in direct proportion to the number of weights processed.
Time Complexity: O(n)
This means the time to run inference grows linearly with the number of weights after pruning and quantization.
[X] Wrong: "Pruning and quantization make inference time constant regardless of model size."
[OK] Correct: Even after pruning and quantization, the model still processes weights, so time grows with how many weights remain.
Understanding how model optimization affects inference time shows you can balance speed and accuracy, a key skill in real-world machine learning deployment.
What if we applied pruning to remove half the weights? How would that change the time complexity?