0
0
MLOpsdevops~10 mins

Data pipelines with DVC in MLOps - Step-by-Step Execution

Choose your learning style9 modes available
Process Flow - Data pipelines with DVC
Define stages in dvc.yaml
Run dvc repro to execute pipeline
DVC checks dependencies and outputs
Execute commands for each stage
Save outputs and update dvc.lock
Track data and pipeline with git and dvc
Repeat: modify data/code -> dvc repro -> track changes
This flow shows how DVC runs a data pipeline by defining stages, executing them, tracking outputs, and updating pipeline state.
Execution Sample
MLOps
stages:
  preprocess:
    cmd: python preprocess.py data/raw data/preprocessed
    deps:
      - data/raw
      - preprocess.py
    outs:
      - data/preprocessed
This dvc.yaml snippet defines a pipeline stage 'preprocess' that runs a Python script with input dependencies and output data.
Process Table
StepActionStageDependencies CheckedCommand ExecutedOutputs Saveddvc.lock Updated
1Start pipeline run-----
2Check 'preprocess' stage dependenciespreprocessdata/raw, preprocess.py---
3Run commandpreprocess-python preprocess.py data/raw data/preprocessed--
4Save output datapreprocess--data/preprocessed-
5Update dvc.lock with new hashespreprocess---Updated
6Pipeline run complete-----
💡 All stages executed successfully, outputs saved, and dvc.lock updated to reflect current pipeline state.
Status Tracker
VariableStartAfter Step 2After Step 3After Step 4After Step 5Final
dependencies_checkedNone['data/raw', 'preprocess.py']['data/raw', 'preprocess.py']['data/raw', 'preprocess.py']['data/raw', 'preprocess.py']['data/raw', 'preprocess.py']
command_statusNot runNot runRunningCompletedCompletedCompleted
outputs_savedNoneNoneNonedata/preprocesseddata/preprocesseddata/preprocessed
dvc_lock_statusOldOldOldOldUpdatedUpdated
Key Moments - 3 Insights
Why does DVC check dependencies before running a stage?
DVC checks dependencies (see Step 2 in execution_table) to decide if the stage needs to run. If dependencies haven't changed, DVC can skip running the stage to save time.
What happens if the output data already exists and dependencies are unchanged?
DVC will skip running the command and keep the existing outputs and dvc.lock unchanged, avoiding unnecessary work. This is shown by the dependency check and output save steps.
Why is dvc.lock updated after running a stage?
dvc.lock records exact hashes of dependencies and outputs after execution (Step 5). This helps DVC track changes and decide if future runs need to re-execute stages.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, at which step does DVC run the actual command for the 'preprocess' stage?
AStep 2
BStep 4
CStep 3
DStep 5
💡 Hint
Check the 'Command Executed' column in execution_table rows.
According to variable_tracker, what is the status of 'dvc_lock_status' after Step 4?
AUpdated
BOld
CNot created
DDeleted
💡 Hint
Look at the 'dvc_lock_status' row and the column 'After Step 4'.
If the dependencies change, what will DVC do differently in the execution flow?
ARun the stage command again
BSkip running the stage
CDelete outputs without running
DIgnore changes and keep old outputs
💡 Hint
Refer to the key moment about dependency checking and stage execution.
Concept Snapshot
DVC pipelines are defined in dvc.yaml with stages specifying commands, dependencies, and outputs.
Run 'dvc repro' to execute stages if dependencies changed.
DVC tracks outputs and updates dvc.lock with hashes.
This avoids rerunning unchanged stages, saving time.
Use git to version control pipeline files and data pointers.
Full Transcript
This visual execution shows how DVC manages data pipelines. First, you define stages in dvc.yaml with commands, dependencies, and outputs. When you run 'dvc repro', DVC checks if dependencies changed. If yes, it runs the stage command, saves outputs, and updates dvc.lock with new hashes. If no changes, it skips running to save time. This process helps track data and code changes efficiently in machine learning projects.