0
0
MLOpsdevops~10 mins

DVC (Data Version Control) basics in MLOps - Step-by-Step Execution

Choose your learning style9 modes available
Process Flow - DVC (Data Version Control) basics
Initialize Git repo
Initialize DVC in repo
Add data files to DVC tracking
Commit changes to Git
Push data and code to remote storage
Pull data and code from remote when needed
This flow shows how you start with a Git repo, add DVC to track data files, commit changes, and push or pull data versions.
Execution Sample
MLOps
git init

dvc init

dvc add data.csv

git add .
git commit -m "Add data with DVC"
dvc push
This code initializes Git and DVC, tracks a data file with DVC, commits changes, and pushes data to remote storage.
Process Table
StepCommandActionResult
1git initCreate new Git repositoryEmpty Git repo created
2dvc initInitialize DVC in repo.dvc folder and config created
3dvc add data.csvTrack data.csv with DVCdata.csv.dvc file created, data tracked
4git add .Stage all changes for GitAll changes staged
5git commit -m "Add data with DVC"Commit changes to GitCommit saved with message
6dvc pushUpload data files to remote storageData files uploaded to remote
7End of processData and code versioned and stored
💡 Process stops after data and code are versioned and pushed to remote storage
Status Tracker
VariableStartAfter Step 2After Step 3After Step 5After Step 6
Git repoNoneInitializedInitializedCommit with DVC filesCommit with DVC files
DVC configNoneCreatedCreatedCreatedCreated
Tracked dataNoneNonedata.csv trackeddata.csv trackeddata.csv pushed to remote
Key Moments - 3 Insights
Why do we need both Git and DVC commands?
Git tracks code and small files, while DVC tracks large data files separately. Execution table rows 1-2 show Git and DVC initialization, and rows 3-6 show how both work together.
What does 'dvc add' actually do to the data file?
'dvc add' does not change the data file but creates a .dvc file that tracks its version. See execution table row 3 where data.csv.dvc is created.
Why do we run 'git commit' after 'dvc add'?
Because DVC creates tracking files that Git needs to save. Without committing, Git won't remember the data version info. This is shown in rows 4 and 5.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the result after running 'dvc init'?
AEmpty Git repository created
Bdata.csv.dvc file created
C.dvc folder and config created
DData files uploaded to remote
💡 Hint
Check row 2 under Result column in the execution table
At which step is the data file actually tracked by DVC?
AStep 1
BStep 3
CStep 2
DStep 6
💡 Hint
Look at the Action column for 'dvc add data.csv' in the execution table
If you skip 'git commit' after 'dvc add', what will happen?
AGit won't save the DVC tracking files
BData files won't be tracked by DVC
CData files won't upload to remote
DDVC will fail to initialize
💡 Hint
Refer to key moment about why 'git commit' is needed after 'dvc add'
Concept Snapshot
DVC basics:
- Use 'git init' to start Git repo
- Use 'dvc init' to add DVC
- Use 'dvc add <file>' to track data files
- Commit changes with Git to save DVC tracking
- Use 'dvc push' to upload data to remote
- Use 'dvc pull' to retrieve data versions
Full Transcript
This lesson shows how to use DVC with Git to version control data files. First, you create a Git repository with 'git init'. Then, you add DVC support using 'dvc init'. Next, you track a data file using 'dvc add data.csv', which creates a tracking file but does not change the data itself. After that, you stage and commit all changes with Git commands 'git add .' and 'git commit'. Finally, you upload the data files to remote storage using 'dvc push'. This process helps keep your data and code versions in sync and easy to share or reproduce.