Bird
Raised Fist0
MLOpsdevops~10 mins

Tracking datasets with DVC in MLOps - Step-by-Step Execution

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Process Flow - Tracking datasets with DVC
Initialize DVC in project
Add dataset to DVC tracking
DVC creates .dvc file and stores data hash
Push dataset to remote storage (optional)
Modify dataset locally
DVC detects changes and updates tracking
Pull dataset from remote when needed
This flow shows how DVC tracks datasets by initializing, adding data, storing metadata, pushing to remote, detecting changes, and pulling data.
Execution Sample
MLOps
dvc init

dvc add data/dataset.csv

git add data/dataset.csv.dvc .gitignore

git commit -m "Track dataset with DVC"
This code initializes DVC, adds a dataset file to DVC tracking, stages the DVC metadata files, and commits them to Git.
Process Table
StepCommandActionResult
1dvc initInitialize DVC in project folderCreates .dvc folder and config files
2dvc add data/dataset.csvAdd dataset file to DVC trackingGenerates data/dataset.csv.dvc file with hash info
3git add data/dataset.csv.dvc .gitignoreStage DVC metadata files for GitFiles ready to commit
4git commit -m "Track dataset with DVC"Commit changes to GitDataset tracking metadata saved in Git
5Modify data/dataset.csvChange dataset contentLocal file changed, DVC not updated yet
6dvc statusCheck dataset statusShows dataset.csv is modified and needs update
7dvc add data/dataset.csvUpdate DVC tracking for changed datasetUpdates .dvc file with new hash
8git add data/dataset.csv.dvcStage updated metadataReady to commit updated dataset info
9git commit -m "Update dataset version"Commit updated trackingNew dataset version tracked in Git
10dvc pushPush dataset to remote storageDataset files uploaded to remote storage
11dvc pullRetrieve dataset from remoteDataset files downloaded locally if missing
💡 Process ends after dataset is tracked, updated, and optionally pushed or pulled from remote storage.
Status Tracker
VariableStartAfter Step 2After Step 7After Step 10
data/dataset.csv.dvcNot presentCreated with initial hashUpdated with new hash after dataset changeSame updated file pushed to remote
data/dataset.csvOriginal fileOriginal file trackedModified file locallyFile synced with remote after pull
Key Moments - 3 Insights
Why do we need to run 'dvc add' again after modifying the dataset?
Because DVC tracks dataset versions by file hash, after modification the hash changes. Running 'dvc add' updates the .dvc file with the new hash to track the latest version, as shown in steps 5 to 7 in the execution table.
What is the role of the .dvc file created when adding a dataset?
The .dvc file stores metadata including the hash of the dataset file. It tells DVC which version of the data is tracked. This is why we commit the .dvc file to Git (step 4), so dataset versions are linked with code versions.
Why do we push datasets separately with 'dvc push' after committing changes?
Because datasets can be large, DVC stores actual data in remote storage separately from Git. 'dvc push' uploads the data files to remote storage, while Git only tracks metadata. This separation is shown in step 10.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what happens at step 2?
ADVC initializes the project
BDataset file is added to DVC tracking and .dvc file is created
CDataset is pushed to remote storage
DGit commits the dataset metadata
💡 Hint
Refer to row 2 in the execution_table where 'dvc add' command is run.
At which step does DVC detect that the dataset file has changed?
AStep 6
BStep 5
CStep 4
DStep 9
💡 Hint
Check the 'dvc status' command in step 6 that shows dataset modification.
If you skip 'git add' after 'dvc add', what will happen?
ADataset file will not be tracked by DVC
BDVC will fail to add the dataset
CDataset changes won't be saved in Git history
DRemote storage will not accept the dataset
💡 Hint
Look at steps 3 and 4 where git add and commit save the .dvc metadata.
Concept Snapshot
Tracking datasets with DVC:
- Run 'dvc init' once per project
- Use 'dvc add <file>' to track datasets
- DVC creates a .dvc file with dataset hash
- Commit .dvc files to Git to version data metadata
- Use 'dvc push' to upload data to remote storage
- Use 'dvc pull' to download data when needed
- Re-run 'dvc add' after dataset changes to update tracking
Full Transcript
This lesson shows how to track datasets using DVC step-by-step. First, initialize DVC in your project folder with 'dvc init'. Then add your dataset file using 'dvc add', which creates a .dvc file storing the dataset's hash. Stage and commit this .dvc file with Git to version your data metadata alongside code. When you modify the dataset, run 'dvc status' to check changes and 'dvc add' again to update tracking. Commit the updated .dvc file to Git. To share or backup data, use 'dvc push' to upload dataset files to remote storage. Others can retrieve data with 'dvc pull'. This process helps keep data versions organized and linked to your code changes.

Practice

(1/5)
1. What does the dvc add command do when tracking datasets?
easy
A. It deletes the dataset from the local machine.
B. It uploads the dataset directly to GitHub.
C. It converts the dataset into a database format.
D. It creates a pointer file to track the dataset without storing the data in Git.

Solution

  1. Step 1: Understand dvc add purpose

    The dvc add command creates a small pointer file that represents the dataset, instead of storing the full data in Git.
  2. Step 2: Recognize data management with DVC

    This pointer file allows Git to track dataset versions without handling large files directly.
  3. Final Answer:

    It creates a pointer file to track the dataset without storing the data in Git. -> Option D
  4. Quick Check:

    dvc add creates pointer file [OK]
Hint: Remember: DVC tracks data with pointer files, not full data [OK]
Common Mistakes:
  • Thinking dvc add uploads data to GitHub
  • Confusing dvc add with deleting files
  • Assuming data is converted or changed format
2. Which of the following is the correct syntax to track a dataset file named data.csv using DVC?
easy
A. dvc track data.csv
B. dvc add data.csv
C. dvc push data.csv
D. dvc commit data.csv

Solution

  1. Step 1: Identify the correct DVC command for tracking

    The command to start tracking a dataset file is dvc add followed by the filename.
  2. Step 2: Confirm syntax correctness

    Among the options, only dvc add data.csv correctly adds the file to DVC tracking.
  3. Final Answer:

    dvc add data.csv -> Option B
  4. Quick Check:

    Use dvc add filename to track data [OK]
Hint: Use dvc add to start tracking files [OK]
Common Mistakes:
  • Using dvc track which is not a valid command
  • Confusing dvc push with adding files
  • Trying dvc commit which is a Git command
3. After running dvc add data.csv, what is the expected output or change in the project directory?
medium
A. A new file named data.csv.dvc is created and data.csv remains in the directory.
B. A new file named data.csv.dvc is created and data.csv is removed.
C. The data.csv file is uploaded to GitHub automatically.
D. The data.csv file is converted to a binary format.

Solution

  1. Step 1: Understand dvc add effects on files

    Running dvc add creates a pointer file with extension .dvc that tracks the dataset, but does not delete the original data file.
  2. Step 2: Confirm directory state after command

    The original data.csv remains, and a new data.csv.dvc file appears to track it.
  3. Final Answer:

    A new file named data.csv.dvc is created and data.csv remains in the directory. -> Option A
  4. Quick Check:

    dvc add creates pointer file, keeps data [OK]
Hint: Look for .dvc pointer file; data file stays [OK]
Common Mistakes:
  • Assuming data file is deleted after dvc add
  • Thinking data is uploaded automatically to GitHub
  • Believing data file is converted or changed format
4. You ran dvc add dataset.csv but forgot to commit the generated dataset.csv.dvc file to Git. What problem might occur?
medium
A. Git will track the dataset file directly, causing large repository size.
B. DVC will stop tracking the dataset automatically.
C. The dataset pointer file won't be versioned, causing sync issues between code and data.
D. The dataset file will be deleted from the local machine.

Solution

  1. Step 1: Understand the role of the pointer file in Git

    The .dvc pointer file must be committed to Git to keep track of dataset versions alongside code.
  2. Step 2: Identify consequences of not committing pointer file

    If the pointer file is not committed, Git won't know about dataset changes, causing mismatch between code and data versions.
  3. Final Answer:

    The dataset pointer file won't be versioned, causing sync issues between code and data. -> Option C
  4. Quick Check:

    Commit pointer files to Git to sync data and code [OK]
Hint: Always commit .dvc files to Git after adding data [OK]
Common Mistakes:
  • Assuming DVC stops tracking automatically
  • Thinking dataset file is deleted if not committed
  • Believing Git tracks large data files directly
5. You have a dataset folder named images/ with many files. You want to track it with DVC and ensure the dataset version is saved and shared with your team. Which sequence of commands is correct?
hard
A. dvc add images/; git add images.dvc; git commit -m 'Track images dataset'; git push; dvc push
B. git add images/; dvc add images/; git commit -m 'Track images dataset'; dvc push
C. dvc add images/; git add images.dvc; git commit -m 'Track images dataset'; git push
D. dvc add images/; git add images.dvc; git commit -m 'Track images dataset'; dvc push

Solution

  1. Step 1: Add the dataset folder with DVC

    Use dvc add images/ to create the pointer file images.dvc tracking the folder.
  2. Step 2: Commit the pointer file to Git

    Run git add images.dvc and git commit to version control the pointer file.
  3. Step 3: Push Git changes and dataset to remote storage

    First push Git commits with git push, then push dataset files to remote storage with dvc push.
  4. Final Answer:

    dvc add images/; git add images.dvc; git commit -m 'Track images dataset'; git push; dvc push -> Option A
  5. Quick Check:

    Push Git first, then DVC data to share [OK]
Hint: Push Git commits before dvc push to sync data [OK]
Common Mistakes:
  • Pushing DVC data before Git commits
  • Adding dataset files directly to Git
  • Forgetting to push Git commits before dvc push