Bird
Raised Fist0
MLOpsdevops~5 mins

DVC (Data Version Control) basics in MLOps - Commands & Configuration

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
When working with machine learning projects, managing data and model versions is hard. DVC helps track data files and models alongside code, making it easy to reproduce experiments and share results.
When you want to keep track of large datasets without storing them directly in Git.
When you need to share data and models with your team while keeping versions organized.
When you want to reproduce machine learning experiments exactly with the same data and code.
When you want to avoid mixing code changes with data changes in your version control.
When you want to automate data pipeline steps and track their outputs.
Commands
This command initializes DVC in your project folder. It creates necessary config files and folders to start tracking data and models.
Terminal
dvc init
Expected OutputExpected
Initialized DVC repository. You can now track data files with `dvc add`.
This command tells DVC to track the dataset.csv file. It creates a small pointer file and stores the actual data in DVC cache.
Terminal
dvc add data/dataset.csv
Expected OutputExpected
Adding 'data/dataset.csv' to DVC. Computing md5 hash: 123abc456def7890 Saving to cache: .dvc/cache/12/3abc456def7890 To track this file, commit the changes to Git.
Add the DVC pointer file and updated .gitignore to Git. This keeps track of the data version without storing the actual data in Git.
Terminal
git add data/dataset.csv.dvc .gitignore
Expected OutputExpected
No output (command runs silently)
Commit the changes to Git so the data version is linked with your code version.
Terminal
git commit -m "Track dataset.csv with DVC"
Expected OutputExpected
[main abc1234] Track dataset.csv with DVC 2 files changed, 10 insertions(+) create mode 100644 data/dataset.csv.dvc
Upload the actual data files tracked by DVC to remote storage (like cloud or shared server) so others can access them.
Terminal
dvc push
Expected OutputExpected
Uploading data/dataset.csv to remote storage. 100%|███████████████████████████████████████| 1.00M/1.00M [00:01<00:00, 1.00MB/s]
Key Concept

If you remember nothing else, remember: DVC tracks large data files separately from code, linking them with small pointer files in Git for easy versioning and sharing.

Common Mistakes
Adding large data files directly to Git instead of using dvc add.
Git is not designed for large files and will slow down or bloat your repository.
Use 'dvc add' to track large files and commit only the small .dvc pointer files to Git.
Forgetting to run 'dvc push' after adding data files.
Data files remain only in local cache and are not shared with team or remote storage.
Always run 'dvc push' to upload data files to remote storage after adding or updating them.
Not committing the .dvc files to Git after running 'dvc add'.
Without committing .dvc files, the data version is not tracked in Git and cannot be reproduced.
Add and commit the .dvc files and .gitignore changes to Git after 'dvc add'.
Summary
Initialize DVC in your project with 'dvc init' to start tracking data.
Use 'dvc add' to track large data files and create pointer files.
Commit the .dvc pointer files and .gitignore changes to Git to link data versions with code.
Run 'dvc push' to upload data files to remote storage for sharing and backup.

Practice

(1/5)
1. What is the main purpose of using dvc add in a project?
easy
A. To push code changes to a remote Git server
B. To initialize a new Git repository
C. To start tracking a data file or directory with DVC
D. To remove data files from the project

Solution

  1. Step 1: Understand the role of dvc add

    dvc add is used to tell DVC to track a data file or directory, creating a pointer file in Git.
  2. Step 2: Differentiate from other commands

    Commands like dvc init start DVC, while dvc push syncs data remotely. dvc add specifically tracks data.
  3. Final Answer:

    To start tracking a data file or directory with DVC -> Option C
  4. Quick Check:

    dvc add tracks data files [OK]
Hint: Remember: add means track data files with DVC [OK]
Common Mistakes:
  • Confusing dvc add with dvc init
  • Thinking dvc add pushes data remotely
  • Assuming dvc add initializes Git
2. Which command correctly initializes DVC in an existing Git repository?
easy
A. dvc start
B. dvc init
C. git dvc init
D. dvc create

Solution

  1. Step 1: Identify the DVC initialization command

    The correct command to initialize DVC in a Git repo is dvc init.
  2. Step 2: Eliminate incorrect options

    dvc start and dvc create are not valid DVC commands. git dvc init is invalid syntax.
  3. Final Answer:

    dvc init -> Option B
  4. Quick Check:

    DVC init command = dvc init [OK]
Hint: Use dvc init to start DVC in your repo [OK]
Common Mistakes:
  • Typing dvc start instead of dvc init
  • Prefixing with git incorrectly
  • Using non-existent commands like dvc create
3. Given the following commands run in order:
git init
 dvc init
 dvc add data.csv
 git add data.csv.dvc
 git commit -m "Add data"
 dvc push

What happens after dvc push is executed?
medium
A. The data file is deleted locally after upload
B. Only the data.csv.dvc pointer file is pushed to Git remote
C. The Git repository is cloned to remote storage
D. The actual data file data.csv is uploaded to remote storage

Solution

  1. Step 1: Understand dvc push behavior

    dvc push uploads the actual large data files tracked by DVC to the configured remote storage, not just Git files.
  2. Step 2: Differentiate Git and DVC storage roles

    Git stores small pointer files like data.csv.dvc, while DVC manages big data files separately in remote storage.
  3. Final Answer:

    The actual data file data.csv is uploaded to remote storage -> Option D
  4. Quick Check:

    dvc push uploads data files remotely [OK]
Hint: dvc push uploads big data files, not just pointers [OK]
Common Mistakes:
  • Thinking dvc push only pushes Git files
  • Confusing dvc push with git push
  • Assuming data files are deleted after push
4. You ran dvc add dataset.csv but forgot to commit the generated dataset.csv.dvc file to Git. What problem will you face?
medium
A. DVC will not track the data file until the pointer file is committed
B. The data file will be deleted automatically
C. Git will track the data file instead of DVC
D. No problem; DVC tracks data without Git commits

Solution

  1. Step 1: Understand the role of the .dvc pointer file

    The dataset.csv.dvc file is a small pointer tracked by Git that tells DVC about the data file version.
  2. Step 2: Consequence of not committing the pointer file

    If you don't commit this pointer file, Git and collaborators won't know about the data version, so DVC tracking is incomplete.
  3. Final Answer:

    DVC will not track the data file until the pointer file is committed -> Option A
  4. Quick Check:

    Pointer file commit = DVC tracking active [OK]
Hint: Always commit .dvc pointer files after dvc add [OK]
Common Mistakes:
  • Assuming data files are tracked without pointer commits
  • Thinking data files get deleted automatically
  • Believing Git tracks large data files directly
5. You have a large dataset tracked by DVC and a remote storage configured. Your teammate cloned the Git repo but the data files are missing locally. Which command should they run to get the data files?
hard
A. dvc pull
B. dvc add
C. git pull
D. git clone

Solution

  1. Step 1: Understand what dvc pull does

    dvc pull downloads the actual data files from remote storage to the local machine based on the pointer files in Git.
  2. Step 2: Differentiate from Git commands

    git pull updates code and pointer files but does not fetch large data files. dvc add tracks new data, and git clone clones the repo initially.
  3. Final Answer:

    dvc pull -> Option A
  4. Quick Check:

    Use dvc pull to fetch data files locally [OK]
Hint: Use dvc pull to download data after cloning repo [OK]
Common Mistakes:
  • Running only git pull expecting data files
  • Trying dvc add to get data files
  • Confusing git clone with data download