Bird
Raised Fist0
MLOpsdevops~7 mins

Data pipelines with DVC in MLOps - Commands & Configuration

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Data pipelines help organize and automate steps in machine learning projects, like preparing data and training models. DVC makes it easy to track these steps and their data, so you can reproduce results and share work with others.
When you want to keep track of changes in your data and code together.
When you need to automate data processing and model training steps.
When you want to share your ML project with teammates and ensure they get the same results.
When you want to avoid manually running each step and risk mistakes.
When you want to save storage by sharing data efficiently across pipeline stages.
Config File - dvc.yaml
dvc.yaml
stages:
  prepare:
    cmd: python prepare.py data/raw data/prepared
    deps:
      - prepare.py
      - data/raw
    outs:
      - data/prepared
  train:
    cmd: python train.py data/prepared model.pkl
    deps:
      - train.py
      - data/prepared
    outs:
      - model.pkl

This dvc.yaml file defines two pipeline stages: prepare and train.

The prepare stage runs a Python script to process raw data into prepared data. It lists the script and raw data as dependencies and the prepared data as output.

The train stage runs a training script using the prepared data and produces a model file. It lists the training script and prepared data as dependencies and the model file as output.

DVC uses this file to know what commands to run, what files to watch for changes, and what files to save as results.

Commands
Initialize DVC in the current project folder to start tracking data and pipelines.
Terminal
dvc init
Expected OutputExpected
Initialized DVC repository. You can now track data files with 'dvc add' and create pipelines with 'dvc run' or 'dvc.yaml'.
Tell DVC to track the raw data folder so it can manage its versions and storage.
Terminal
dvc add data/raw
Expected OutputExpected
Adding 'data/raw' to DVC tracking. Computing checksum... Adding to cache. Saving 'data/raw.dvc'.
Run the pipeline stages defined in dvc.yaml in the correct order, skipping unchanged steps.
Terminal
dvc repro
Expected OutputExpected
Running stage 'prepare': > python prepare.py data/raw data/prepared Running stage 'train': > python train.py data/prepared model.pkl
Display the pipeline graph in the terminal to see the order of stages and dependencies.
Terminal
dvc pipeline show --ascii
Expected OutputExpected
prepare | train
--ascii - Show the pipeline graph using ASCII characters for easy reading in terminal
Add DVC pipeline files and data tracking files to Git so the project and data versions are saved together.
Terminal
git add dvc.yaml data/raw.dvc .gitignore
Expected OutputExpected
No output (command runs silently)
Key Concept

If you remember nothing else from this pattern, remember: DVC pipelines automate and track your data and code steps so you can reproduce and share ML projects easily.

Common Mistakes
Not adding the dvc.yaml and .dvc files to Git after creating or changing the pipeline.
Without these files in Git, teammates or future you won't have the pipeline definition and data tracking info, breaking reproducibility.
Always commit dvc.yaml, .dvc files, and .gitignore changes to Git after modifying the pipeline.
Running 'dvc repro' without first adding raw data with 'dvc add'.
DVC won't track the raw data changes, so pipeline stages depending on it may not run or produce wrong results.
Use 'dvc add' on raw data before running the pipeline to ensure DVC tracks data versions.
Summary
Initialize DVC in your project with 'dvc init' to start tracking data and pipelines.
Use 'dvc add' to track raw data files or folders so DVC manages their versions.
Define pipeline stages in dvc.yaml with commands, dependencies, and outputs.
Run the pipeline with 'dvc repro' to execute steps in order and skip unchanged ones.
Commit dvc.yaml and .dvc files to Git to share pipeline and data tracking with others.

Practice

(1/5)
1. What is the main purpose of using dvc repro in a DVC pipeline?
easy
A. To delete all pipeline data and cache
B. To initialize a new DVC repository
C. To reproduce pipeline stages and update outputs if inputs changed
D. To manually edit pipeline stage commands

Solution

  1. Step 1: Understand the role of dvc repro

    This command checks if any inputs or dependencies of pipeline stages have changed.
  2. Step 2: Effect of running dvc repro

    If changes are detected, it reruns the affected stages to update outputs accordingly.
  3. Final Answer:

    To reproduce pipeline stages and update outputs if inputs changed -> Option C
  4. Quick Check:

    dvc repro updates pipeline outputs [OK]
Hint: Remember: repro means rerun changed pipeline parts [OK]
Common Mistakes:
  • Confusing repro with initialization commands
  • Thinking repro deletes data
  • Assuming repro edits pipeline commands
2. Which of the following is the correct syntax to add a pipeline stage with DVC that runs python train.py and outputs model.pkl?
easy
A. dvc stage add -n train -o model.pkl python train.py
B. dvc add stage train -o model.pkl python train.py
C. dvc run -n train -o model.pkl python train.py
D. dvc stage add -n train -d train.py -o model.pkl python train.py

Solution

  1. Step 1: Identify required flags for stage creation

    The dvc stage add command requires -n for name, -d for dependencies, and -o for outputs.
  2. Step 2: Check which option includes all required flags correctly

    dvc stage add -n train -d train.py -o model.pkl python train.py uses -n train, -d train.py (dependency), and -o model.pkl with the command python train.py.
  3. Final Answer:

    dvc stage add -n train -d train.py -o model.pkl python train.py -> Option D
  4. Quick Check:

    Stage add needs name, dependency, output flags [OK]
Hint: Stage add needs -n (name), -d (deps), -o (outputs) [OK]
Common Mistakes:
  • Omitting the dependency with -d
  • Using deprecated dvc run instead of stage add
  • Mixing order of flags incorrectly
3. Given this DVC pipeline stage definition in dvc.yaml:
stages:
  preprocess:
    cmd: python preprocess.py data/raw data/processed
    deps:
      - data/raw
      - preprocess.py
    outs:
      - data/processed
What happens when you run dvc repro after modifying data/raw?
medium
A. The preprocess stage reruns and updates data/processed
B. Nothing happens because only preprocess.py changes trigger rerun
C. The pipeline fails due to missing output specification
D. All pipeline stages rerun regardless of changes

Solution

  1. Step 1: Identify dependencies of the preprocess stage

    The stage depends on data/raw and preprocess.py.
  2. Step 2: Effect of changing data/raw on dvc repro

    Changing a dependency triggers rerun of that stage to update outputs.
  3. Final Answer:

    The preprocess stage reruns and updates data/processed -> Option A
  4. Quick Check:

    Changed input triggers stage rerun [OK]
Hint: Change in deps triggers rerun of that stage [OK]
Common Mistakes:
  • Assuming no rerun if only data changes
  • Thinking all stages rerun always
  • Confusing outputs with dependencies
4. You run dvc repro but get an error: ERROR: failed to reproduce stage 'train': missing dependency 'data/train.csv'. What is the most likely cause?
medium
A. The file data/train.csv was deleted or moved after pipeline creation
B. The dvc.yaml file is missing the train stage
C. The dvc.lock file is corrupted
D. You forgot to run dvc init before dvc repro

Solution

  1. Step 1: Understand the error message

    The error says a dependency file is missing, which means DVC cannot find data/train.csv.
  2. Step 2: Common causes of missing dependency errors

    Usually, the file was deleted, renamed, or moved after the pipeline stage was created.
  3. Final Answer:

    The file data/train.csv was deleted or moved after pipeline creation -> Option A
  4. Quick Check:

    Missing dependency file causes repro error [OK]
Hint: Check if all dependency files exist before repro [OK]
Common Mistakes:
  • Assuming dvc.yaml missing stage causes this error
  • Blaming dvc.lock corruption without evidence
  • Forgetting to initialize repo before repro
5. You want to create a DVC pipeline with two stages: extract that outputs data/raw.csv, and train that depends on data/raw.csv and outputs model.pkl. Which sequence of commands correctly sets up this pipeline?
hard
A. dvc stage add -n train -o model.pkl python train.py dvc stage add -n extract -d data/raw.csv -o data/raw.csv python extract.py
B. dvc stage add -n extract -o data/raw.csv python extract.py dvc stage add -n train -d data/raw.csv -o model.pkl python train.py
C. dvc run -n extract -o data/raw.csv python extract.py dvc run -n train -d data/raw.csv -o model.pkl python train.py
D. dvc add data/raw.csv dvc add model.pkl

Solution

  1. Step 1: Define extract stage with output only

    Extract stage produces data/raw.csv so it needs -n extract and -o data/raw.csv with the command.
  2. Step 2: Define train stage depending on extract output

    Train stage depends on data/raw.csv so it needs -d data/raw.csv, outputs model.pkl, and runs python train.py.
  3. Step 3: Confirm correct order and commands

    dvc stage add -n extract -o data/raw.csv python extract.py dvc stage add -n train -d data/raw.csv -o model.pkl python train.py correctly adds extract first, then train with proper dependencies and outputs.
  4. Final Answer:

    dvc stage add -n extract -o data/raw.csv python extract.py dvc stage add -n train -d data/raw.csv -o model.pkl python train.py -> Option B
  5. Quick Check:

    Define stages with correct deps and outputs [OK]
Hint: Add extract stage first, then train with dependency on extract output [OK]
Common Mistakes:
  • Adding train stage before extract output exists
  • Using dvc add instead of stage add for pipeline steps
  • Missing dependencies in train stage