Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is DVC in the context of data pipelines?
DVC (Data Version Control) is a tool that helps manage data, models, and experiments in machine learning projects. It tracks data changes and automates pipelines like code version control.
Click to reveal answer
beginner
How does DVC help in building data pipelines?
DVC lets you define stages in a pipeline with commands and dependencies. It tracks inputs and outputs, so when data or code changes, only necessary steps rerun, saving time and ensuring reproducibility.
Click to reveal answer
intermediate
What is the purpose of the dvc.yaml file?
The dvc.yaml file describes the pipeline stages, their commands, inputs, and outputs. It acts like a recipe that DVC uses to run and reproduce the pipeline steps.
Click to reveal answer
beginner
How do you run a DVC pipeline after defining it?
You run the command dvc repro. This tells DVC to check for changes and rerun only the pipeline stages that need updating.
Click to reveal answer
beginner
Why is versioning data important in machine learning projects?
Versioning data helps track changes, reproduce results, and collaborate safely. It prevents confusion about which data or model version was used for experiments.
Click to reveal answer
What command initializes a new DVC project?
Advc new
Bdvc start
Cdvc create
Ddvc init
✗ Incorrect
The command dvc init sets up DVC in your project folder.
Which file stores the pipeline stages in DVC?
Advc.yaml
Bpipeline.json
Cdvc.lock
Dconfig.yml
✗ Incorrect
The dvc.yaml file defines pipeline stages, commands, inputs, and outputs.
What does the command dvc repro do?
ARemoves old data files
BUploads data to cloud storage
CRuns the pipeline stages that need updating
DInitializes a new pipeline
✗ Incorrect
dvc repro reruns pipeline stages if inputs or code changed.
Why should data be versioned in ML projects?
ATo delete old data automatically
BTo track changes and reproduce experiments
CTo speed up model training
DTo encrypt data for security
✗ Incorrect
Versioning data helps keep track of changes and ensures reproducibility.
Which of these is NOT a DVC pipeline component?
AContainers
BCommands
CStages
DInputs/Outputs
✗ Incorrect
DVC pipelines use stages, commands, inputs, and outputs, but not containers.
Explain how DVC helps automate and manage data pipelines in machine learning projects.
Think about how DVC tracks inputs, outputs, and commands to run pipelines efficiently.
You got /4 concepts.
Describe the role of the dvc.yaml and dvc.lock files in a DVC pipeline.
One file is like a recipe, the other locks the exact ingredients used.
You got /3 concepts.
Practice
(1/5)
1. What is the main purpose of using dvc repro in a DVC pipeline?
easy
A. To delete all pipeline data and cache
B. To initialize a new DVC repository
C. To reproduce pipeline stages and update outputs if inputs changed
D. To manually edit pipeline stage commands
Solution
Step 1: Understand the role of dvc repro
This command checks if any inputs or dependencies of pipeline stages have changed.
Step 2: Effect of running dvc repro
If changes are detected, it reruns the affected stages to update outputs accordingly.
Final Answer:
To reproduce pipeline stages and update outputs if inputs changed -> Option C
Quick Check:
dvc repro updates pipeline outputs [OK]
Hint: Remember: repro means rerun changed pipeline parts [OK]
Common Mistakes:
Confusing repro with initialization commands
Thinking repro deletes data
Assuming repro edits pipeline commands
2. Which of the following is the correct syntax to add a pipeline stage with DVC that runs python train.py and outputs model.pkl?
easy
A. dvc stage add -n train -o model.pkl python train.py
B. dvc add stage train -o model.pkl python train.py
What happens when you run dvc repro after modifying data/raw?
medium
A. The preprocess stage reruns and updates data/processed
B. Nothing happens because only preprocess.py changes trigger rerun
C. The pipeline fails due to missing output specification
D. All pipeline stages rerun regardless of changes
Solution
Step 1: Identify dependencies of the preprocess stage
The stage depends on data/raw and preprocess.py.
Step 2: Effect of changing data/raw on dvc repro
Changing a dependency triggers rerun of that stage to update outputs.
Final Answer:
The preprocess stage reruns and updates data/processed -> Option A
Quick Check:
Changed input triggers stage rerun [OK]
Hint: Change in deps triggers rerun of that stage [OK]
Common Mistakes:
Assuming no rerun if only data changes
Thinking all stages rerun always
Confusing outputs with dependencies
4. You run dvc repro but get an error: ERROR: failed to reproduce stage 'train': missing dependency 'data/train.csv'. What is the most likely cause?
medium
A. The file data/train.csv was deleted or moved after pipeline creation
B. The dvc.yaml file is missing the train stage
C. The dvc.lock file is corrupted
D. You forgot to run dvc init before dvc repro
Solution
Step 1: Understand the error message
The error says a dependency file is missing, which means DVC cannot find data/train.csv.
Step 2: Common causes of missing dependency errors
Usually, the file was deleted, renamed, or moved after the pipeline stage was created.
Final Answer:
The file data/train.csv was deleted or moved after pipeline creation -> Option A
Quick Check:
Missing dependency file causes repro error [OK]
Hint: Check if all dependency files exist before repro [OK]
Common Mistakes:
Assuming dvc.yaml missing stage causes this error
Blaming dvc.lock corruption without evidence
Forgetting to initialize repo before repro
5. You want to create a DVC pipeline with two stages: extract that outputs data/raw.csv, and train that depends on data/raw.csv and outputs model.pkl. Which sequence of commands correctly sets up this pipeline?