How to Version Data for Machine Learning Projects
To version data for ML, use
data versioning tools like DVC or Git LFS that track dataset changes alongside code. This helps keep datasets organized, reproducible, and easy to update during model development.Syntax
Data versioning typically involves commands to add, track, and update datasets in a version control system designed for large files.
For example, with DVC:
dvc init: Initialize data versioning in your project.dvc add <data-file>: Track a data file or folder.git add .andgit commit -m "message": Commit changes including data pointers.dvc push: Upload data to remote storage.
bash
dvc init dvc add data/train.csv git add data/train.csv.dvc .gitignore git commit -m "Add training data version" dvc push
Example
This example shows how to version a CSV dataset using DVC in a simple ML project. It tracks the data file, commits the pointer to Git, and pushes the data to remote storage.
python
import os import subprocess # Step 1: Initialize DVC in the project folder subprocess.run(["dvc", "init"], check=True) # Step 2: Create a sample data file os.makedirs("data", exist_ok=True) with open("data/train.csv", "w") as f: f.write("feature,label\n1,0\n2,1\n3,0\n") # Step 3: Add data file to DVC tracking subprocess.run(["dvc", "add", "data/train.csv"], check=True) # Step 4: Commit changes to Git subprocess.run(["git", "add", "data/train.csv.dvc", ".gitignore"], check=True) subprocess.run(["git", "commit", "-m", "Add training data version"], check=True) # Step 5: Push data to remote storage (requires remote setup) # subprocess.run(["dvc", "push"], check=True) print("Data versioning setup complete.")
Output
Data versioning setup complete.
Common Pitfalls
- Not tracking data files: Forgetting to add data files to the versioning tool leads to missing dataset versions.
- Committing large data directly to Git: Git is not designed for big files; use tools like DVC or Git LFS instead.
- Not setting up remote storage: Without remote storage, data versions are only local and not shareable.
- Ignoring data schema changes: Changes in data format or columns should be documented and versioned carefully.
bash
## Wrong way: committing large data directly to Git # git add data/train.csv # git commit -m "Add large data file" ## Right way: use DVC to track data # dvc add data/train.csv # git add data/train.csv.dvc .gitignore # git commit -m "Track data with DVC"
Quick Reference
| Command | Purpose |
|---|---|
| dvc init | Initialize DVC in your project |
| dvc add | Track a data file or folder |
| git add | Stage changes including DVC pointers |
| git commit -m "msg" | Commit changes to Git |
| dvc push | Upload data to remote storage |
| dvc pull | Download data from remote storage |
Key Takeaways
Use specialized tools like DVC or Git LFS to version large ML datasets efficiently.
Always track data changes alongside code to ensure reproducibility.
Set up remote storage to share and backup dataset versions.
Avoid committing large data files directly to Git to prevent repository bloat.
Document data schema and format changes as part of versioning.