Data pipelines with DVC in MLOps - Time & Space Complexity
Start learning this pattern below
Jump into concepts and practice - no test required
When working with data pipelines using DVC, it's important to understand how the time to run pipelines grows as data or steps increase.
We want to know how the pipeline execution time changes when we add more stages or larger data.
Analyze the time complexity of the following DVC pipeline commands.
dvc run -n preprocess -d raw_data.csv -o processed_data.csv python preprocess.py
dvc run -n train -d processed_data.csv -o model.pkl python train.py
dvc run -n evaluate -d model.pkl -o metrics.json python evaluate.py
This code defines a simple DVC pipeline with three stages: preprocess, train, and evaluate, each depending on outputs from the previous stage.
Look for repeated actions or steps in the pipeline execution.
- Primary operation: Running each pipeline stage sequentially.
- How many times: Once per stage, total number of stages is n.
As you add more stages, the total time grows roughly by adding each stage's time.
| Input Size (n) | Approx. Operations |
|---|---|
| 3 stages | 3 runs |
| 10 stages | 10 runs |
| 100 stages | 100 runs |
Pattern observation: The total execution time grows linearly with the number of pipeline stages.
Time Complexity: O(n)
This means the total time to run the pipeline grows in direct proportion to the number of stages.
[X] Wrong: "Adding more stages won't affect total time much because they run fast."
[OK] Correct: Each stage adds its own run time, so more stages add up and increase total time linearly.
Understanding how pipeline execution time grows helps you design efficient workflows and explain your choices clearly in real projects.
"What if some stages run in parallel instead of sequentially? How would the time complexity change?"
Practice
dvc repro in a DVC pipeline?Solution
Step 1: Understand the role of
This command checks if any inputs or dependencies of pipeline stages have changed.dvc reproStep 2: Effect of running
If changes are detected, it reruns the affected stages to update outputs accordingly.dvc reproFinal Answer:
To reproduce pipeline stages and update outputs if inputs changed -> Option CQuick Check:
dvc reproupdates pipeline outputs [OK]
- Confusing repro with initialization commands
- Thinking repro deletes data
- Assuming repro edits pipeline commands
python train.py and outputs model.pkl?Solution
Step 1: Identify required flags for stage creation
Thedvc stage addcommand requires-nfor name,-dfor dependencies, and-ofor outputs.Step 2: Check which option includes all required flags correctly
dvc stage add -n train -d train.py -o model.pkl python train.py uses-n train,-d train.py(dependency), and-o model.pklwith the commandpython train.py.Final Answer:
dvc stage add -n train -d train.py -o model.pkl python train.py -> Option DQuick Check:
Stage add needs name, dependency, output flags [OK]
- Omitting the dependency with -d
- Using deprecated dvc run instead of stage add
- Mixing order of flags incorrectly
dvc.yaml:
stages:
preprocess:
cmd: python preprocess.py data/raw data/processed
deps:
- data/raw
- preprocess.py
outs:
- data/processed
What happens when you run dvc repro after modifying data/raw?Solution
Step 1: Identify dependencies of the preprocess stage
The stage depends ondata/rawandpreprocess.py.Step 2: Effect of changing
Changing a dependency triggers rerun of that stage to update outputs.data/rawondvc reproFinal Answer:
The preprocess stage reruns and updatesdata/processed-> Option AQuick Check:
Changed input triggers stage rerun [OK]
- Assuming no rerun if only data changes
- Thinking all stages rerun always
- Confusing outputs with dependencies
dvc repro but get an error: ERROR: failed to reproduce stage 'train': missing dependency 'data/train.csv'. What is the most likely cause?Solution
Step 1: Understand the error message
The error says a dependency file is missing, which means DVC cannot finddata/train.csv.Step 2: Common causes of missing dependency errors
Usually, the file was deleted, renamed, or moved after the pipeline stage was created.Final Answer:
The filedata/train.csvwas deleted or moved after pipeline creation -> Option AQuick Check:
Missing dependency file causes repro error [OK]
- Assuming dvc.yaml missing stage causes this error
- Blaming dvc.lock corruption without evidence
- Forgetting to initialize repo before repro
extract that outputs data/raw.csv, and train that depends on data/raw.csv and outputs model.pkl. Which sequence of commands correctly sets up this pipeline?Solution
Step 1: Define extract stage with output only
Extract stage producesdata/raw.csvso it needs-n extractand-o data/raw.csvwith the command.Step 2: Define train stage depending on extract output
Train stage depends ondata/raw.csvso it needs-d data/raw.csv, outputsmodel.pkl, and runspython train.py.Step 3: Confirm correct order and commands
dvc stage add -n extract -o data/raw.csv python extract.py dvc stage add -n train -d data/raw.csv -o model.pkl python train.py correctly adds extract first, then train with proper dependencies and outputs.Final Answer:
dvc stage add -n extract -o data/raw.csv python extract.py dvc stage add -n train -d data/raw.csv -o model.pkl python train.py -> Option BQuick Check:
Define stages with correct deps and outputs [OK]
- Adding train stage before extract output exists
- Using dvc add instead of stage add for pipeline steps
- Missing dependencies in train stage
