MLOpsdevops~7 mins

Data pipelines with DVC in MLOps - Commands & Configuration

Choose your learning style10 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Introduction

Data pipelines help organize and automate steps in machine learning projects, like preparing data and training models. DVC makes it easy to track these steps and their data, so you can reproduce results and share work with others.

When you want to keep track of changes in your data and code together.

When you need to automate data processing and model training steps.

When you want to share your ML project with teammates and ensure they get the same results.

When you want to avoid manually running each step and risk mistakes.

When you want to save storage by sharing data efficiently across pipeline stages.

Config File - dvc.yaml

dvc.yaml

stages:
  prepare:
    cmd: python prepare.py data/raw data/prepared
    deps:
      - prepare.py
      - data/raw
    outs:
      - data/prepared
  train:
    cmd: python train.py data/prepared model.pkl
    deps:
      - train.py
      - data/prepared
    outs:
      - model.pkl

This dvc.yaml file defines two pipeline stages: prepare and train.

The prepare stage runs a Python script to process raw data into prepared data. It lists the script and raw data as dependencies and the prepared data as output.

The train stage runs a training script using the prepared data and produces a model file. It lists the training script and prepared data as dependencies and the model file as output.

DVC uses this file to know what commands to run, what files to watch for changes, and what files to save as results.

Commands

Initialize DVC in the current project folder to start tracking data and pipelines.

Terminal

dvc init

Expected OutputExpected

Initialized DVC repository. You can now track data files with 'dvc add' and create pipelines with 'dvc run' or 'dvc.yaml'.

Tell DVC to track the raw data folder so it can manage its versions and storage.

Terminal

dvc add data/raw

Expected OutputExpected

Adding 'data/raw' to DVC tracking. Computing checksum... Adding to cache. Saving 'data/raw.dvc'.

Run the pipeline stages defined in dvc.yaml in the correct order, skipping unchanged steps.

Terminal

dvc repro

Expected OutputExpected

Running stage 'prepare': > python prepare.py data/raw data/prepared Running stage 'train': > python train.py data/prepared model.pkl

Display the pipeline graph in the terminal to see the order of stages and dependencies.

Terminal

dvc pipeline show --ascii

Expected OutputExpected

prepare | train

→

--ascii - Show the pipeline graph using ASCII characters for easy reading in terminal

Add DVC pipeline files and data tracking files to Git so the project and data versions are saved together.

Terminal

git add dvc.yaml data/raw.dvc .gitignore

Expected OutputExpected

No output (command runs silently)

Key Concept

If you remember nothing else from this pattern, remember: DVC pipelines automate and track your data and code steps so you can reproduce and share ML projects easily.

Common Mistakes

Not adding the dvc.yaml and .dvc files to Git after creating or changing the pipeline.

Without these files in Git, teammates or future you won't have the pipeline definition and data tracking info, breaking reproducibility.

Always commit dvc.yaml, .dvc files, and .gitignore changes to Git after modifying the pipeline.

Running 'dvc repro' without first adding raw data with 'dvc add'.

DVC won't track the raw data changes, so pipeline stages depending on it may not run or produce wrong results.

Use 'dvc add' on raw data before running the pipeline to ensure DVC tracks data versions.

Summary

Initialize DVC in your project with 'dvc init' to start tracking data and pipelines.

Use 'dvc add' to track raw data files or folders so DVC manages their versions.

Define pipeline stages in dvc.yaml with commands, dependencies, and outputs.

Run the pipeline with 'dvc repro' to execute steps in order and skip unchanged ones.

Commit dvc.yaml and .dvc files to Git to share pipeline and data tracking with others.

Practice

(1/5)

1. What is the main purpose of using dvc repro in a DVC pipeline?

easy

A. To delete all pipeline data and cache

B. To initialize a new DVC repository

C. To reproduce pipeline stages and update outputs if inputs changed

D. To manually edit pipeline stage commands

Data pipelines with DVC in MLOps - Commands & Configuration

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of `dvc repro`

Step 2: Effect of running `dvc repro`

Final Answer:

Quick Check:

Solution

Step 1: Identify required flags for stage creation

Step 2: Check which option includes all required flags correctly

Final Answer:

Quick Check:

Solution

Step 1: Identify dependencies of the preprocess stage

Step 2: Effect of changing `data/raw` on `dvc repro`

Final Answer:

Quick Check:

Solution

Step 1: Understand the error message

Step 2: Common causes of missing dependency errors

Final Answer:

Quick Check:

Solution

Step 1: Define extract stage with output only

Step 2: Define train stage depending on extract output

Step 3: Confirm correct order and commands

Final Answer:

Quick Check:

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of dvc repro

Step 2: Effect of running dvc repro

Final Answer:

Quick Check:

Solution

Step 1: Identify required flags for stage creation

Step 2: Check which option includes all required flags correctly

Final Answer:

Quick Check:

Solution

Step 1: Identify dependencies of the preprocess stage

Step 2: Effect of changing data/raw on dvc repro

Final Answer:

Quick Check:

Solution

Step 1: Understand the error message

Step 2: Common causes of missing dependency errors

Final Answer:

Quick Check:

Solution

Step 1: Define extract stage with output only

Step 2: Define train stage depending on extract output

Step 3: Confirm correct order and commands

Final Answer:

Quick Check:

Step 1: Understand the role of `dvc repro`

Step 2: Effect of running `dvc repro`

Step 2: Effect of changing `data/raw` on `dvc repro`