How to Use DVC with Git for Data Version Control
Use
dvc init to start DVC in your Git repository, then track large data files with dvc add. Commit the changes and DVC files to Git, and use dvc push to upload data to remote storage while Git tracks code and small files.Syntax
Here are the main commands to use DVC with Git:
dvc init: Initialize DVC in your Git repo.dvc add <file>: Track large data files with DVC.git add <file>.dvc: Add DVC metafiles to Git.git commit -m "message": Commit changes to Git.dvc remote add -d <name> <url>: Set remote storage for data.dvc push: Upload data files to remote storage.dvc pull: Download data files from remote storage.
bash
dvc init dvc add data/raw_data.csv git add data/raw_data.csv.dvc .gitignore git commit -m "Track raw data with DVC" dvc remote add -d storage s3://mybucket/dvcstore dvc push
Output
Initialized DVC repository.
Adding 'data/raw_data.csv' to DVC tracking.
Tracking 'data/raw_data.csv' with DVC.
[master abc1234] Track raw data with DVC
2 files changed, 20 insertions(+)
Remote 'storage' has been added.
Uploading data to remote storage...
Example
This example shows how to initialize DVC in a Git repo, track a data file, commit changes, set a remote, and push data.
bash
mkdir my-ml-project cd my-ml-project git init echo "sample,data" > data.csv mkdir data mv data.csv data/ dvc init dvc add data/data.csv git add data/data.csv.dvc .gitignore git commit -m "Add data file with DVC" dvc remote add -d myremote s3://mybucket/dvcstore dvc push
Output
Initialized empty Git repository.
Initialized DVC repository.
Adding 'data/data.csv' to DVC tracking.
Tracking 'data/data.csv' with DVC.
[master 123abcd] Add data file with DVC
2 files changed, 20 insertions(+)
Remote 'myremote' has been added.
Uploading data to remote storage...
Common Pitfalls
Common mistakes when using DVC with Git:
- Not committing the
.dvcfiles to Git, so data tracking is lost. - Forgetting to set or configure remote storage before pushing data.
- Adding large data files directly to Git instead of using
dvc add. - Not pushing data to remote storage, causing collaborators to miss data files.
Wrong way: Adding data files directly to Git:
bash
git add data/large_file.csv
git commit -m "Add large data file directly"Quick Reference
| Command | Purpose |
|---|---|
| dvc init | Initialize DVC in Git repo |
| dvc add | Track large data file with DVC |
| git add | Add DVC metafile to Git |
| git commit -m "msg" | Commit changes to Git |
| dvc remote add -d | Set default remote storage |
| dvc push | Upload data to remote storage |
| dvc pull | Download data from remote storage |
Key Takeaways
Initialize DVC in your Git repo with 'dvc init' before tracking data.
Use 'dvc add' to track large files and commit the generated .dvc files to Git.
Configure remote storage with 'dvc remote add' to share data across collaborators.
Always push data to remote storage with 'dvc push' to keep data accessible.
Never add large data files directly to Git; use DVC to manage them efficiently.