Batch prediction vs real-time serving in MLOps - Performance Comparison
We want to understand how the time needed to make predictions changes when using batch prediction versus real-time serving.
How does the number of predictions affect the time taken in each method?
Analyze the time complexity of the following code snippet.
# Batch prediction example
predictions = []
for data_point in dataset:
prediction = model.predict(data_point)
predictions.append(prediction)
# Real-time serving example
# Each request triggers a single prediction
response = model.predict(single_request_data)
This code shows batch prediction processing many data points in a loop, and real-time serving handling one request at a time.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Calling
model.predict()for each data point. - How many times: In batch, once per data point in the dataset; in real-time, once per request.
As the number of data points grows, batch prediction time grows proportionally because it predicts all at once.
Real-time serving handles one prediction at a time, so each prediction time stays about the same regardless of total requests.
| Input Size (n) | Approx. Operations (Batch) | Approx. Operations (Real-time) |
|---|---|---|
| 10 | 10 predictions | 1 prediction per request |
| 100 | 100 predictions | 1 prediction per request |
| 1000 | 1000 predictions | 1 prediction per request |
Pattern observation: Batch time grows with number of data points; real-time time per prediction stays constant.
Time Complexity: O(n)
This means batch prediction time grows linearly with the number of data points, while real-time serving handles each prediction individually with constant time.
[X] Wrong: "Real-time serving takes longer as more requests come in because it processes all requests together like batch."
[OK] Correct: Real-time serving processes each request separately, so the time per prediction stays about the same regardless of total requests.
Understanding how prediction time scales helps you explain trade-offs between batch and real-time systems clearly, a useful skill in many practical MLOps discussions.
What if we parallelize batch prediction to run multiple predictions at the same time? How would the time complexity change?