Training data pipeline automation in MLOps - Time & Space Complexity
When automating a training data pipeline, it's important to know how the time to process data grows as the data size increases.
We want to understand how the pipeline's execution time changes when we add more data.
Analyze the time complexity of the following pipeline automation code snippet.
for batch in data_batches:
cleaned = clean_data(batch)
features = extract_features(cleaned)
store(features)
This code processes data in batches: cleaning, extracting features, and storing results for each batch.
Look at what repeats as data size grows.
- Primary operation: Looping over each batch of data.
- How many times: Once for every batch in the dataset.
As the number of batches increases, the total work grows proportionally.
| Input Size (n batches) | Approx. Operations |
|---|---|
| 10 | 10 times the batch processing steps |
| 100 | 100 times the batch processing steps |
| 1000 | 1000 times the batch processing steps |
Pattern observation: Doubling the number of batches roughly doubles the total processing time.
Time Complexity: O(n)
This means the time to run the pipeline grows directly in proportion to the number of data batches.
[X] Wrong: "The pipeline time stays the same no matter how much data we add."
[OK] Correct: Each batch requires processing steps, so more batches mean more total work and longer time.
Understanding how pipeline time scales with data size shows you can predict and manage workload growth, a key skill in real projects.
"What if we parallelize batch processing? How would that affect the time complexity?"