How to Use DVC for Data Versioning in Machine Learning Projects
Use
dvc init to start DVC in your project, then track data files with dvc add <file>. Commit changes with Git and push data versions to remote storage using dvc push to manage data versioning efficiently.Syntax
dvc init: Initializes DVC in your project folder to start tracking data.
dvc add <file>: Adds a data file or folder to DVC tracking, creating a .dvc file that records its version.
git add && git commit: Use Git to commit the .dvc files and project code changes.
dvc push: Uploads tracked data files to remote storage for sharing and backup.
dvc pull: Downloads data files from remote storage to your local project.
bash
dvc init dvc add data/mydataset.csv git add data/mydataset.csv.dvc .gitignore git commit -m "Track dataset with DVC" dvc remote add -d myremote s3://mybucket/dvcstore dvc push
Example
This example shows how to initialize DVC, add a dataset file, commit changes with Git, set up remote storage, and push data versions.
bash
mkdir myproject cd myproject mkdir data echo "sample,data" > data/data.csv # Initialize Git and DVC git init dvc init # Add data file to DVC dvc add data/data.csv # Commit changes git add data/data.csv.dvc .gitignore git commit -m "Add data.csv with DVC" # Setup remote storage (local folder for demo) mkdir -p ../dvc_remote dvc remote add -d localremote ../dvc_remote # Push data to remote dvc push # Show tracked files ls -l data/data.csv data/data.csv.dvc ../dvc_remote
Output
Initialized empty Git repository in /myproject/.git/
Initialized DVC repository.
Adding 'data/data.csv' to DVC tracking.
Saving 'data/data.csv' to cache.
[master (root-commit) abc1234] Add data.csv with DVC
1 file changed, 1 insertion(+)
create mode 100644 data/data.csv.dvc
Preparing to push data to remote 'localremote'
Pushing data to remote storage.
-rw-r--r-- 1 user user 12 Apr 27 12:00 data/data.csv
-rw-r--r-- 1 user user 89 Apr 27 12:00 data/data.csv.dvc
-rw-r--r-- 1 user user 12 Apr 27 12:00 ../dvc_remote/1234567890abcdef1234567890abcdef
Common Pitfalls
- Forgetting to commit the
.dvcfiles with Git causes data versioning to break. - Not setting up a remote storage means
dvc pushwill fail or data won't be shared. - Adding large files directly to Git instead of DVC leads to slow repos and large history.
- Changing data files without running
dvc addagain will not update versions.
bash
## Wrong way: Adding large data file directly to Git git add large_data.csv git commit -m "Add large data file" ## Right way: Track with DVC dvc add large_data.csv git add large_data.csv.dvc .gitignore git commit -m "Track large data with DVC"
Quick Reference
| Command | Purpose |
|---|---|
| dvc init | Start DVC in your project |
| dvc add | Track data file or folder |
| git add | Stage DVC tracking files |
| git commit -m "msg" | Commit changes to Git |
| dvc remote add -d | Set default remote storage |
| dvc push | Upload data to remote storage |
| dvc pull | Download data from remote storage |
Key Takeaways
Initialize DVC with 'dvc init' to start data versioning in your project.
Use 'dvc add ' to track data files without storing them in Git.
Always commit the generated .dvc files and .gitignore changes with Git.
Set up remote storage and use 'dvc push' to share and backup data versions.
Avoid adding large data files directly to Git to keep your repo fast.