Why serving architecture affects latency and cost in MLOps - Performance Analysis
When we serve machine learning models, the way we set up the system changes how fast it responds and how much it costs.
We want to understand how the design of serving affects the time it takes to answer and the resources used.
Analyze the time complexity of the following serving code snippet.
class ModelServer:
def __init__(self, models):
self.models = models # list of models
def serve(self, input_data):
results = []
for model in self.models:
results.append(model.predict(input_data))
return results
This code runs multiple models one after another to get predictions for the same input.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Loop over each model to call its predict function.
- How many times: Once for each model in the list.
As the number of models grows, the total time to serve grows too because each model prediction takes time.
| Number of Models (n) | Approx. Operations |
|---|---|
| 10 | 10 predictions |
| 100 | 100 predictions |
| 1000 | 1000 predictions |
Pattern observation: The time grows directly with the number of models; doubling models doubles the work.
Time Complexity: O(n)
This means the serving time grows linearly with the number of models we run.
[X] Wrong: "Running more models won't affect latency much because they run fast."
[OK] Correct: Each model adds its own time, so more models add up and increase total latency.
Understanding how serving design affects speed and cost helps you build better systems and explain your choices clearly.
"What if we run all models in parallel instead of one by one? How would the time complexity change?"