Data pipelines with DVC in MLOps - Time & Space Complexity
When working with data pipelines using DVC, it's important to understand how the time to run pipelines grows as data or steps increase.
We want to know how the pipeline execution time changes when we add more stages or larger data.
Analyze the time complexity of the following DVC pipeline commands.
dvc run -n preprocess -d raw_data.csv -o processed_data.csv python preprocess.py
dvc run -n train -d processed_data.csv -o model.pkl python train.py
dvc run -n evaluate -d model.pkl -o metrics.json python evaluate.py
This code defines a simple DVC pipeline with three stages: preprocess, train, and evaluate, each depending on outputs from the previous stage.
Look for repeated actions or steps in the pipeline execution.
- Primary operation: Running each pipeline stage sequentially.
- How many times: Once per stage, total number of stages is n.
As you add more stages, the total time grows roughly by adding each stage's time.
| Input Size (n) | Approx. Operations |
|---|---|
| 3 stages | 3 runs |
| 10 stages | 10 runs |
| 100 stages | 100 runs |
Pattern observation: The total execution time grows linearly with the number of pipeline stages.
Time Complexity: O(n)
This means the total time to run the pipeline grows in direct proportion to the number of stages.
[X] Wrong: "Adding more stages won't affect total time much because they run fast."
[OK] Correct: Each stage adds its own run time, so more stages add up and increase total time linearly.
Understanding how pipeline execution time grows helps you design efficient workflows and explain your choices clearly in real projects.
"What if some stages run in parallel instead of sequentially? How would the time complexity change?"