0
0
MLOpsdevops~5 mins

Model optimization for serving (quantization, pruning) in MLOps - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Model optimization for serving (quantization, pruning)
O(n)
Understanding Time Complexity

When we optimize machine learning models for serving, we want to know how these changes affect the time it takes to make predictions.

We ask: How does the model's speed change as we apply quantization or pruning?

Scenario Under Consideration

Analyze the time complexity of the following simplified model inference code.


# Simplified inference with pruning and quantization
for layer in model.layers:
    weights = layer.weights
    # Pruning removes some weights
    pruned_weights = [w for w in weights if abs(w) > threshold]
    # Quantization reduces precision
    quantized_weights = [quantize(w) for w in pruned_weights]
    output = layer.forward(input, quantized_weights)
    input = output

This code runs inference through each layer, pruning and quantizing weights before computing output.

Identify Repeating Operations

Look at what repeats in the code:

  • Primary operation: Loop over model layers and loop over weights in each layer.
  • How many times: Once per layer, and once per weight in that layer.
How Execution Grows With Input

As the number of layers or weights grows, the work grows too.

Input Size (weights)Approx. Operations
10About 10 operations per layer
100About 100 operations per layer
1000About 1000 operations per layer

Pattern observation: The time grows roughly in direct proportion to the number of weights processed.

Final Time Complexity

Time Complexity: O(n)

This means the time to run inference grows linearly with the number of weights after pruning and quantization.

Common Mistake

[X] Wrong: "Pruning and quantization make inference time constant regardless of model size."

[OK] Correct: Even after pruning and quantization, the model still processes weights, so time grows with how many weights remain.

Interview Connect

Understanding how model optimization affects inference time shows you can balance speed and accuracy, a key skill in real-world machine learning deployment.

Self-Check

What if we applied pruning to remove half the weights? How would that change the time complexity?