Reproducible training pipelines in MLOps - Time & Space Complexity
When building reproducible training pipelines, it's important to know how the time to run the pipeline changes as the data or steps grow.
We want to understand how the pipeline's execution time scales with input size.
Analyze the time complexity of the following code snippet.
for batch in data_batches:
preprocess(batch)
train_model(batch)
validate_model(batch)
save_checkpoint()
This code runs a training pipeline on batches of data, processing each batch through preprocessing, training, validation, and saving checkpoints.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Loop over each data batch running all pipeline steps.
- How many times: Once per batch, so the number of batches determines repetitions.
As the number of batches increases, the total time grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 times the pipeline steps |
| 100 | 100 times the pipeline steps |
| 1000 | 1000 times the pipeline steps |
Pattern observation: Doubling the number of batches roughly doubles the total work.
Time Complexity: O(n)
This means the total time grows linearly with the number of data batches processed.
[X] Wrong: "The pipeline time stays the same no matter how many batches there are."
[OK] Correct: Each batch requires running all steps, so more batches mean more total work and longer time.
Understanding how pipeline time scales helps you design efficient workflows and explain your approach clearly in real projects or interviews.
"What if we parallelize processing batches instead of running them one by one? How would the time complexity change?"