0
0
MLOpsdevops~10 mins

Tracking datasets with DVC in MLOps - Step-by-Step Execution

Choose your learning style9 modes available
Process Flow - Tracking datasets with DVC
Initialize DVC in project
Add dataset to DVC tracking
DVC creates .dvc file and stores data hash
Push dataset to remote storage (optional)
Modify dataset locally
DVC detects changes and updates tracking
Pull dataset from remote when needed
This flow shows how DVC tracks datasets by initializing, adding data, storing metadata, pushing to remote, detecting changes, and pulling data.
Execution Sample
MLOps
dvc init

dvc add data/dataset.csv

git add data/dataset.csv.dvc .gitignore

git commit -m "Track dataset with DVC"
This code initializes DVC, adds a dataset file to DVC tracking, stages the DVC metadata files, and commits them to Git.
Process Table
StepCommandActionResult
1dvc initInitialize DVC in project folderCreates .dvc folder and config files
2dvc add data/dataset.csvAdd dataset file to DVC trackingGenerates data/dataset.csv.dvc file with hash info
3git add data/dataset.csv.dvc .gitignoreStage DVC metadata files for GitFiles ready to commit
4git commit -m "Track dataset with DVC"Commit changes to GitDataset tracking metadata saved in Git
5Modify data/dataset.csvChange dataset contentLocal file changed, DVC not updated yet
6dvc statusCheck dataset statusShows dataset.csv is modified and needs update
7dvc add data/dataset.csvUpdate DVC tracking for changed datasetUpdates .dvc file with new hash
8git add data/dataset.csv.dvcStage updated metadataReady to commit updated dataset info
9git commit -m "Update dataset version"Commit updated trackingNew dataset version tracked in Git
10dvc pushPush dataset to remote storageDataset files uploaded to remote storage
11dvc pullRetrieve dataset from remoteDataset files downloaded locally if missing
💡 Process ends after dataset is tracked, updated, and optionally pushed or pulled from remote storage.
Status Tracker
VariableStartAfter Step 2After Step 7After Step 10
data/dataset.csv.dvcNot presentCreated with initial hashUpdated with new hash after dataset changeSame updated file pushed to remote
data/dataset.csvOriginal fileOriginal file trackedModified file locallyFile synced with remote after pull
Key Moments - 3 Insights
Why do we need to run 'dvc add' again after modifying the dataset?
Because DVC tracks dataset versions by file hash, after modification the hash changes. Running 'dvc add' updates the .dvc file with the new hash to track the latest version, as shown in steps 5 to 7 in the execution table.
What is the role of the .dvc file created when adding a dataset?
The .dvc file stores metadata including the hash of the dataset file. It tells DVC which version of the data is tracked. This is why we commit the .dvc file to Git (step 4), so dataset versions are linked with code versions.
Why do we push datasets separately with 'dvc push' after committing changes?
Because datasets can be large, DVC stores actual data in remote storage separately from Git. 'dvc push' uploads the data files to remote storage, while Git only tracks metadata. This separation is shown in step 10.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what happens at step 2?
ADVC initializes the project
BDataset file is added to DVC tracking and .dvc file is created
CDataset is pushed to remote storage
DGit commits the dataset metadata
💡 Hint
Refer to row 2 in the execution_table where 'dvc add' command is run.
At which step does DVC detect that the dataset file has changed?
AStep 6
BStep 5
CStep 4
DStep 9
💡 Hint
Check the 'dvc status' command in step 6 that shows dataset modification.
If you skip 'git add' after 'dvc add', what will happen?
ADataset file will not be tracked by DVC
BDVC will fail to add the dataset
CDataset changes won't be saved in Git history
DRemote storage will not accept the dataset
💡 Hint
Look at steps 3 and 4 where git add and commit save the .dvc metadata.
Concept Snapshot
Tracking datasets with DVC:
- Run 'dvc init' once per project
- Use 'dvc add <file>' to track datasets
- DVC creates a .dvc file with dataset hash
- Commit .dvc files to Git to version data metadata
- Use 'dvc push' to upload data to remote storage
- Use 'dvc pull' to download data when needed
- Re-run 'dvc add' after dataset changes to update tracking
Full Transcript
This lesson shows how to track datasets using DVC step-by-step. First, initialize DVC in your project folder with 'dvc init'. Then add your dataset file using 'dvc add', which creates a .dvc file storing the dataset's hash. Stage and commit this .dvc file with Git to version your data metadata alongside code. When you modify the dataset, run 'dvc status' to check changes and 'dvc add' again to update tracking. Commit the updated .dvc file to Git. To share or backup data, use 'dvc push' to upload dataset files to remote storage. Others can retrieve data with 'dvc pull'. This process helps keep data versions organized and linked to your code changes.