Bird
Raised Fist0
MLOpsdevops~5 mins

Tracking datasets with DVC in MLOps - Commands & Configuration

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
When working on data projects, it is hard to keep track of different versions of datasets. DVC helps by saving snapshots of your data files so you can easily go back or share them without copying large files manually.
When you want to save a version of your dataset before making changes.
When you need to share your dataset versions with teammates without sending big files.
When you want to reproduce a machine learning experiment with the exact same data.
When you want to track changes in your data alongside your code.
When you want to store datasets in remote storage but keep lightweight pointers in your project.
Commands
This command initializes DVC in your project folder by creating necessary configuration files and folders to start tracking data.
Terminal
dvc init
Expected OutputExpected
Initialized DVC repository. You can now track data files with 'dvc add'.
This command tells DVC to track the dataset file located at data/dataset.csv. It creates a small pointer file and stores the actual data in DVC's cache.
Terminal
dvc add data/dataset.csv
Expected OutputExpected
Adding 'data/dataset.csv' to DVC tracking. Saving information to 'data/dataset.csv.dvc'.
This command adds the DVC pointer file and updated .gitignore to Git so you can version control the dataset reference, not the data itself.
Terminal
git add data/dataset.csv.dvc .gitignore
Expected OutputExpected
No output (command runs silently)
This commits the changes to Git, saving the dataset pointer and ignore rules so your project history includes the dataset version.
Terminal
git commit -m "Track dataset with DVC"
Expected OutputExpected
[main abc1234] Track dataset with DVC 2 files changed, 10 insertions(+)
This uploads the actual dataset file to the remote storage configured in DVC, so others can download it later without storing it in Git.
Terminal
dvc push
Expected OutputExpected
Uploading data/dataset.csv to remote storage. Upload complete.
Key Concept

If you remember nothing else from this pattern, remember: DVC tracks data by saving small pointer files in Git and storing large data files separately.

Common Mistakes
Adding large data files directly to Git instead of using 'dvc add'.
Git is not designed to handle large files efficiently, which slows down your project and bloats the repository.
Use 'dvc add' to track large data files and commit only the small .dvc pointer files to Git.
Forgetting to run 'dvc push' after adding data.
Without pushing, the actual data is only stored locally and others cannot access it from remote storage.
Always run 'dvc push' to upload data files to the remote storage after adding or updating datasets.
Not committing the .dvc files to Git.
Without committing .dvc files, the dataset versions are not tracked in your project history.
Commit the .dvc pointer files and updated .gitignore to Git after running 'dvc add'.
Summary
Initialize DVC in your project with 'dvc init' to start tracking data.
Use 'dvc add' to track large dataset files and create pointer files.
Commit the pointer files to Git to version control dataset references.
Push the actual data to remote storage with 'dvc push' for sharing and backup.

Practice

(1/5)
1. What does the dvc add command do when tracking datasets?
easy
A. It deletes the dataset from the local machine.
B. It uploads the dataset directly to GitHub.
C. It converts the dataset into a database format.
D. It creates a pointer file to track the dataset without storing the data in Git.

Solution

  1. Step 1: Understand dvc add purpose

    The dvc add command creates a small pointer file that represents the dataset, instead of storing the full data in Git.
  2. Step 2: Recognize data management with DVC

    This pointer file allows Git to track dataset versions without handling large files directly.
  3. Final Answer:

    It creates a pointer file to track the dataset without storing the data in Git. -> Option D
  4. Quick Check:

    dvc add creates pointer file [OK]
Hint: Remember: DVC tracks data with pointer files, not full data [OK]
Common Mistakes:
  • Thinking dvc add uploads data to GitHub
  • Confusing dvc add with deleting files
  • Assuming data is converted or changed format
2. Which of the following is the correct syntax to track a dataset file named data.csv using DVC?
easy
A. dvc track data.csv
B. dvc add data.csv
C. dvc push data.csv
D. dvc commit data.csv

Solution

  1. Step 1: Identify the correct DVC command for tracking

    The command to start tracking a dataset file is dvc add followed by the filename.
  2. Step 2: Confirm syntax correctness

    Among the options, only dvc add data.csv correctly adds the file to DVC tracking.
  3. Final Answer:

    dvc add data.csv -> Option B
  4. Quick Check:

    Use dvc add filename to track data [OK]
Hint: Use dvc add to start tracking files [OK]
Common Mistakes:
  • Using dvc track which is not a valid command
  • Confusing dvc push with adding files
  • Trying dvc commit which is a Git command
3. After running dvc add data.csv, what is the expected output or change in the project directory?
medium
A. A new file named data.csv.dvc is created and data.csv remains in the directory.
B. A new file named data.csv.dvc is created and data.csv is removed.
C. The data.csv file is uploaded to GitHub automatically.
D. The data.csv file is converted to a binary format.

Solution

  1. Step 1: Understand dvc add effects on files

    Running dvc add creates a pointer file with extension .dvc that tracks the dataset, but does not delete the original data file.
  2. Step 2: Confirm directory state after command

    The original data.csv remains, and a new data.csv.dvc file appears to track it.
  3. Final Answer:

    A new file named data.csv.dvc is created and data.csv remains in the directory. -> Option A
  4. Quick Check:

    dvc add creates pointer file, keeps data [OK]
Hint: Look for .dvc pointer file; data file stays [OK]
Common Mistakes:
  • Assuming data file is deleted after dvc add
  • Thinking data is uploaded automatically to GitHub
  • Believing data file is converted or changed format
4. You ran dvc add dataset.csv but forgot to commit the generated dataset.csv.dvc file to Git. What problem might occur?
medium
A. Git will track the dataset file directly, causing large repository size.
B. DVC will stop tracking the dataset automatically.
C. The dataset pointer file won't be versioned, causing sync issues between code and data.
D. The dataset file will be deleted from the local machine.

Solution

  1. Step 1: Understand the role of the pointer file in Git

    The .dvc pointer file must be committed to Git to keep track of dataset versions alongside code.
  2. Step 2: Identify consequences of not committing pointer file

    If the pointer file is not committed, Git won't know about dataset changes, causing mismatch between code and data versions.
  3. Final Answer:

    The dataset pointer file won't be versioned, causing sync issues between code and data. -> Option C
  4. Quick Check:

    Commit pointer files to Git to sync data and code [OK]
Hint: Always commit .dvc files to Git after adding data [OK]
Common Mistakes:
  • Assuming DVC stops tracking automatically
  • Thinking dataset file is deleted if not committed
  • Believing Git tracks large data files directly
5. You have a dataset folder named images/ with many files. You want to track it with DVC and ensure the dataset version is saved and shared with your team. Which sequence of commands is correct?
hard
A. dvc add images/; git add images.dvc; git commit -m 'Track images dataset'; git push; dvc push
B. git add images/; dvc add images/; git commit -m 'Track images dataset'; dvc push
C. dvc add images/; git add images.dvc; git commit -m 'Track images dataset'; git push
D. dvc add images/; git add images.dvc; git commit -m 'Track images dataset'; dvc push

Solution

  1. Step 1: Add the dataset folder with DVC

    Use dvc add images/ to create the pointer file images.dvc tracking the folder.
  2. Step 2: Commit the pointer file to Git

    Run git add images.dvc and git commit to version control the pointer file.
  3. Step 3: Push Git changes and dataset to remote storage

    First push Git commits with git push, then push dataset files to remote storage with dvc push.
  4. Final Answer:

    dvc add images/; git add images.dvc; git commit -m 'Track images dataset'; git push; dvc push -> Option A
  5. Quick Check:

    Push Git first, then DVC data to share [OK]
Hint: Push Git commits before dvc push to sync data [OK]
Common Mistakes:
  • Pushing DVC data before Git commits
  • Adding dataset files directly to Git
  • Forgetting to push Git commits before dvc push