Ml-pythonHow-ToBeginner · 4 min read

How to Use DVC for Data Versioning in Machine Learning Projects

Use dvc init to start DVC in your project, then track data files with dvc add <file>. Commit changes with Git and push data versions to remote storage using dvc push to manage data versioning efficiently.

📐

Syntax

dvc init: Initializes DVC in your project folder to start tracking data.

dvc add <file>: Adds a data file or folder to DVC tracking, creating a .dvc file that records its version.

git add && git commit: Use Git to commit the .dvc files and project code changes.

dvc push: Uploads tracked data files to remote storage for sharing and backup.

dvc pull: Downloads data files from remote storage to your local project.

bash

dvc init
dvc add data/mydataset.csv
git add data/mydataset.csv.dvc .gitignore
git commit -m "Track dataset with DVC"
dvc remote add -d myremote s3://mybucket/dvcstore
dvc push

💻

Example

This example shows how to initialize DVC, add a dataset file, commit changes with Git, set up remote storage, and push data versions.

bash

mkdir myproject
cd myproject
mkdir data
echo "sample,data" > data/data.csv

# Initialize Git and DVC
git init
dvc init

# Add data file to DVC
dvc add data/data.csv

# Commit changes
git add data/data.csv.dvc .gitignore
git commit -m "Add data.csv with DVC"

# Setup remote storage (local folder for demo)
mkdir -p ../dvc_remote
dvc remote add -d localremote ../dvc_remote

# Push data to remote
dvc push

# Show tracked files
ls -l data/data.csv data/data.csv.dvc ../dvc_remote

Output

Initialized empty Git repository in /myproject/.git/ Initialized DVC repository. Adding 'data/data.csv' to DVC tracking. Saving 'data/data.csv' to cache. [master (root-commit) abc1234] Add data.csv with DVC 1 file changed, 1 insertion(+) create mode 100644 data/data.csv.dvc Preparing to push data to remote 'localremote' Pushing data to remote storage. -rw-r--r-- 1 user user 12 Apr 27 12:00 data/data.csv -rw-r--r-- 1 user user 89 Apr 27 12:00 data/data.csv.dvc -rw-r--r-- 1 user user 12 Apr 27 12:00 ../dvc_remote/1234567890abcdef1234567890abcdef

⚠️

Common Pitfalls

Forgetting to commit the .dvc files with Git causes data versioning to break.
Not setting up a remote storage means dvc push will fail or data won't be shared.
Adding large files directly to Git instead of DVC leads to slow repos and large history.
Changing data files without running dvc add again will not update versions.

bash

## Wrong way: Adding large data file directly to Git
git add large_data.csv
git commit -m "Add large data file"

## Right way: Track with DVC
dvc add large_data.csv
git add large_data.csv.dvc .gitignore
git commit -m "Track large data with DVC"

📊

Quick Reference

Command	Purpose
dvc init	Start DVC in your project
dvc add	Track data file or folder
git add .dvc .gitignore	Stage DVC tracking files
git commit -m "msg"	Commit changes to Git
dvc remote add -d	Set default remote storage
dvc push	Upload data to remote storage
dvc pull	Download data from remote storage

✅

Key Takeaways

Initialize DVC with 'dvc init' to start data versioning in your project.

Use 'dvc add ' to track data files without storing them in Git.

Always commit the generated .dvc files and .gitignore changes with Git.

Set up remote storage and use 'dvc push' to share and backup data versions.

Avoid adding large data files directly to Git to keep your repo fast.