Why serving architecture affects latency and cost in MLOps - Performance Analysis
Start learning this pattern below
Jump into concepts and practice - no test required
When we serve machine learning models, the way we set up the system changes how fast it responds and how much it costs.
We want to understand how the design of serving affects the time it takes to answer and the resources used.
Analyze the time complexity of the following serving code snippet.
class ModelServer:
def __init__(self, models):
self.models = models # list of models
def serve(self, input_data):
results = []
for model in self.models:
results.append(model.predict(input_data))
return results
This code runs multiple models one after another to get predictions for the same input.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Loop over each model to call its predict function.
- How many times: Once for each model in the list.
As the number of models grows, the total time to serve grows too because each model prediction takes time.
| Number of Models (n) | Approx. Operations |
|---|---|
| 10 | 10 predictions |
| 100 | 100 predictions |
| 1000 | 1000 predictions |
Pattern observation: The time grows directly with the number of models; doubling models doubles the work.
Time Complexity: O(n)
This means the serving time grows linearly with the number of models we run.
[X] Wrong: "Running more models won't affect latency much because they run fast."
[OK] Correct: Each model adds its own time, so more models add up and increase total latency.
Understanding how serving design affects speed and cost helps you build better systems and explain your choices clearly.
"What if we run all models in parallel instead of one by one? How would the time complexity change?"
Practice
Solution
Step 1: Understand latency in serving architectures
Latency means the delay before a prediction is returned. Edge serving places the model close to the user, reducing delay.Step 2: Compare architectures
Batch serving processes data in groups and is slower. Edge serving is designed for fast responses near the user.Final Answer:
Edge serving -> Option CQuick Check:
Lowest latency = Edge serving [OK]
- Confusing batch serving as low latency
- Thinking cloud batch is fastest
- Ignoring edge location benefits
Solution
Step 1: Define batch serving
Batch serving processes multiple data points together, not one by one, which saves cost but adds delay.Step 2: Evaluate options
Batch serving processes data in groups and is usually cheaper but slower. correctly states batch serving is cheaper but slower. Other options are incorrect or unrealistic.Final Answer:
Batch serving processes data in groups and is usually cheaper but slower. -> Option BQuick Check:
Batch serving = cheaper, slower [OK]
- Thinking batch serving is real-time
- Assuming batch runs on edge devices
- Believing batch needs no compute
Solution
Step 1: Recall characteristics of online and batch serving
Online serving provides predictions immediately (low latency) but requires more resources (high cost). Batch serving delays predictions but is cheaper.Step 2: Match options to characteristics
Online serving: low latency, high cost; Batch serving: high latency, low cost correctly matches low latency and high cost to online serving, and high latency and low cost to batch serving.Final Answer:
Online serving: low latency, high cost; Batch serving: high latency, low cost -> Option AQuick Check:
Online = fast & costly, Batch = slow & cheap [OK]
- Swapping latency and cost roles
- Assuming both have same cost
- Thinking batch is faster
Solution
Step 1: Understand edge serving constraints
Edge devices have limited resources. Large models can slow down processing and increase cost.Step 2: Analyze options
The model is too large to run efficiently on edge devices explains the likely cause. Batch processing was mistakenly used instead of edge serving is incorrect because batch serving is different. The model is deployed in a cloud data center far from users describes cloud serving, not edge. Edge serving always causes high latency and cost is false.Final Answer:
The model is too large to run efficiently on edge devices -> Option DQuick Check:
Large model on edge = high latency/cost [OK]
- Confusing edge with cloud serving
- Assuming edge always has high latency
- Mixing batch and edge serving
Solution
Step 1: Analyze latency and cost trade-offs
Central cloud has higher latency for distant users. Batch serving is cheap but slow. Edge serving is fast but costly.Step 2: Evaluate hybrid approach
Combining edge serving in key regions reduces latency where needed, while batch serving elsewhere controls costs.Final Answer:
Combine edge serving for critical regions and batch serving elsewhere -> Option AQuick Check:
Hybrid edge + batch balances latency and cost [OK]
- Choosing only cloud causing high latency
- Using batch only causing slow responses
- Deploying large models on all devices is costly
