0
0
Ml-pythonHow-ToBeginner ยท 3 min read

How to Use DVC with Git for Data Version Control

Use dvc init to start DVC in your Git repository, then track large data files with dvc add. Commit the changes and DVC files to Git, and use dvc push to upload data to remote storage while Git tracks code and small files.
๐Ÿ“

Syntax

Here are the main commands to use DVC with Git:

  • dvc init: Initialize DVC in your Git repo.
  • dvc add <file>: Track large data files with DVC.
  • git add <file>.dvc: Add DVC metafiles to Git.
  • git commit -m "message": Commit changes to Git.
  • dvc remote add -d <name> <url>: Set remote storage for data.
  • dvc push: Upload data files to remote storage.
  • dvc pull: Download data files from remote storage.
bash
dvc init
dvc add data/raw_data.csv
git add data/raw_data.csv.dvc .gitignore
git commit -m "Track raw data with DVC"
dvc remote add -d storage s3://mybucket/dvcstore
dvc push
Output
Initialized DVC repository. Adding 'data/raw_data.csv' to DVC tracking. Tracking 'data/raw_data.csv' with DVC. [master abc1234] Track raw data with DVC 2 files changed, 20 insertions(+) Remote 'storage' has been added. Uploading data to remote storage...
๐Ÿ’ป

Example

This example shows how to initialize DVC in a Git repo, track a data file, commit changes, set a remote, and push data.

bash
mkdir my-ml-project
cd my-ml-project
git init
echo "sample,data" > data.csv
mkdir data
mv data.csv data/
dvc init
dvc add data/data.csv
git add data/data.csv.dvc .gitignore
git commit -m "Add data file with DVC"
dvc remote add -d myremote s3://mybucket/dvcstore
dvc push
Output
Initialized empty Git repository. Initialized DVC repository. Adding 'data/data.csv' to DVC tracking. Tracking 'data/data.csv' with DVC. [master 123abcd] Add data file with DVC 2 files changed, 20 insertions(+) Remote 'myremote' has been added. Uploading data to remote storage...
โš ๏ธ

Common Pitfalls

Common mistakes when using DVC with Git:

  • Not committing the .dvc files to Git, so data tracking is lost.
  • Forgetting to set or configure remote storage before pushing data.
  • Adding large data files directly to Git instead of using dvc add.
  • Not pushing data to remote storage, causing collaborators to miss data files.

Wrong way: Adding data files directly to Git:

bash
git add data/large_file.csv
git commit -m "Add large data file directly"
๐Ÿ“Š

Quick Reference

CommandPurpose
dvc initInitialize DVC in Git repo
dvc add Track large data file with DVC
git add .dvcAdd DVC metafile to Git
git commit -m "msg"Commit changes to Git
dvc remote add -d Set default remote storage
dvc pushUpload data to remote storage
dvc pullDownload data from remote storage
โœ…

Key Takeaways

Initialize DVC in your Git repo with 'dvc init' before tracking data.
Use 'dvc add' to track large files and commit the generated .dvc files to Git.
Configure remote storage with 'dvc remote add' to share data across collaborators.
Always push data to remote storage with 'dvc push' to keep data accessible.
Never add large data files directly to Git; use DVC to manage them efficiently.