Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Data pipelines with DVC
📖 Scenario: You are working on a machine learning project that requires managing data files and processing steps efficiently. You want to use DVC (Data Version Control) to create a simple data pipeline that tracks your data and processing commands.
🎯 Goal: Build a basic DVC pipeline that stages a data file, adds a processing command, and shows the pipeline status.
📋 What You'll Learn
Create a data file named data.csv with sample content
Initialize DVC in the project directory
Add data.csv to DVC tracking
Create a DVC stage that runs a processing command
Show the DVC pipeline status
💡 Why This Matters
🌍 Real World
Data scientists and ML engineers use DVC to manage datasets and processing steps, ensuring reproducibility and easy collaboration.
💼 Career
Understanding DVC pipelines is essential for roles in machine learning operations (MLOps) and data engineering to maintain reliable data workflows.
Progress0 / 4 steps
1
Create initial data file
Create a file named data.csv with the exact content: id,value
1,100
2,200
3,300
MLOps
Hint
Use the echo command with -e to create the file with new lines.
2
Initialize DVC and add data file
Run the command dvc init to initialize DVC in the project directory. Then add data.csv to DVC tracking using dvc add data.csv.
MLOps
Hint
First run dvc init, then dvc add data.csv to track the data file.
3
Create a DVC stage for processing
Create a DVC stage named process that runs the command head -n 2 data.csv > processed.csv. Use dvc stage add -n process -d data.csv -o processed.csv "head -n 2 data.csv > processed.csv".
MLOps
Hint
Use dvc stage add with -n for name, -d for dependency, and -o for output.
4
Show DVC pipeline status
Run the command dvc status to display the current status of the DVC pipeline.
MLOps
Hint
Simply run dvc status to check if the pipeline is up to date.
Practice
(1/5)
1. What is the main purpose of using dvc repro in a DVC pipeline?
easy
A. To delete all pipeline data and cache
B. To initialize a new DVC repository
C. To reproduce pipeline stages and update outputs if inputs changed
D. To manually edit pipeline stage commands
Solution
Step 1: Understand the role of dvc repro
This command checks if any inputs or dependencies of pipeline stages have changed.
Step 2: Effect of running dvc repro
If changes are detected, it reruns the affected stages to update outputs accordingly.
Final Answer:
To reproduce pipeline stages and update outputs if inputs changed -> Option C
Quick Check:
dvc repro updates pipeline outputs [OK]
Hint: Remember: repro means rerun changed pipeline parts [OK]
Common Mistakes:
Confusing repro with initialization commands
Thinking repro deletes data
Assuming repro edits pipeline commands
2. Which of the following is the correct syntax to add a pipeline stage with DVC that runs python train.py and outputs model.pkl?
easy
A. dvc stage add -n train -o model.pkl python train.py
B. dvc add stage train -o model.pkl python train.py
What happens when you run dvc repro after modifying data/raw?
medium
A. The preprocess stage reruns and updates data/processed
B. Nothing happens because only preprocess.py changes trigger rerun
C. The pipeline fails due to missing output specification
D. All pipeline stages rerun regardless of changes
Solution
Step 1: Identify dependencies of the preprocess stage
The stage depends on data/raw and preprocess.py.
Step 2: Effect of changing data/raw on dvc repro
Changing a dependency triggers rerun of that stage to update outputs.
Final Answer:
The preprocess stage reruns and updates data/processed -> Option A
Quick Check:
Changed input triggers stage rerun [OK]
Hint: Change in deps triggers rerun of that stage [OK]
Common Mistakes:
Assuming no rerun if only data changes
Thinking all stages rerun always
Confusing outputs with dependencies
4. You run dvc repro but get an error: ERROR: failed to reproduce stage 'train': missing dependency 'data/train.csv'. What is the most likely cause?
medium
A. The file data/train.csv was deleted or moved after pipeline creation
B. The dvc.yaml file is missing the train stage
C. The dvc.lock file is corrupted
D. You forgot to run dvc init before dvc repro
Solution
Step 1: Understand the error message
The error says a dependency file is missing, which means DVC cannot find data/train.csv.
Step 2: Common causes of missing dependency errors
Usually, the file was deleted, renamed, or moved after the pipeline stage was created.
Final Answer:
The file data/train.csv was deleted or moved after pipeline creation -> Option A
Quick Check:
Missing dependency file causes repro error [OK]
Hint: Check if all dependency files exist before repro [OK]
Common Mistakes:
Assuming dvc.yaml missing stage causes this error
Blaming dvc.lock corruption without evidence
Forgetting to initialize repo before repro
5. You want to create a DVC pipeline with two stages: extract that outputs data/raw.csv, and train that depends on data/raw.csv and outputs model.pkl. Which sequence of commands correctly sets up this pipeline?