0
0
Ml-pythonHow-ToBeginner ยท 4 min read

How to Version Data for Machine Learning Projects

To version data for ML, use data versioning tools like DVC or Git LFS that track dataset changes alongside code. This helps keep datasets organized, reproducible, and easy to update during model development.
๐Ÿ“

Syntax

Data versioning typically involves commands to add, track, and update datasets in a version control system designed for large files.

For example, with DVC:

  • dvc init: Initialize data versioning in your project.
  • dvc add <data-file>: Track a data file or folder.
  • git add . and git commit -m "message": Commit changes including data pointers.
  • dvc push: Upload data to remote storage.
bash
dvc init
dvc add data/train.csv
git add data/train.csv.dvc .gitignore
git commit -m "Add training data version"
dvc push
๐Ÿ’ป

Example

This example shows how to version a CSV dataset using DVC in a simple ML project. It tracks the data file, commits the pointer to Git, and pushes the data to remote storage.

python
import os
import subprocess

# Step 1: Initialize DVC in the project folder
subprocess.run(["dvc", "init"], check=True)

# Step 2: Create a sample data file
os.makedirs("data", exist_ok=True)
with open("data/train.csv", "w") as f:
    f.write("feature,label\n1,0\n2,1\n3,0\n")

# Step 3: Add data file to DVC tracking
subprocess.run(["dvc", "add", "data/train.csv"], check=True)

# Step 4: Commit changes to Git
subprocess.run(["git", "add", "data/train.csv.dvc", ".gitignore"], check=True)
subprocess.run(["git", "commit", "-m", "Add training data version"], check=True)

# Step 5: Push data to remote storage (requires remote setup)
# subprocess.run(["dvc", "push"], check=True)

print("Data versioning setup complete.")
Output
Data versioning setup complete.
โš ๏ธ

Common Pitfalls

  • Not tracking data files: Forgetting to add data files to the versioning tool leads to missing dataset versions.
  • Committing large data directly to Git: Git is not designed for big files; use tools like DVC or Git LFS instead.
  • Not setting up remote storage: Without remote storage, data versions are only local and not shareable.
  • Ignoring data schema changes: Changes in data format or columns should be documented and versioned carefully.
bash
## Wrong way: committing large data directly to Git
# git add data/train.csv
# git commit -m "Add large data file"

## Right way: use DVC to track data
# dvc add data/train.csv
# git add data/train.csv.dvc .gitignore
# git commit -m "Track data with DVC"
๐Ÿ“Š

Quick Reference

CommandPurpose
dvc initInitialize DVC in your project
dvc add Track a data file or folder
git add Stage changes including DVC pointers
git commit -m "msg"Commit changes to Git
dvc pushUpload data to remote storage
dvc pullDownload data from remote storage
โœ…

Key Takeaways

Use specialized tools like DVC or Git LFS to version large ML datasets efficiently.
Always track data changes alongside code to ensure reproducibility.
Set up remote storage to share and backup dataset versions.
Avoid committing large data files directly to Git to prevent repository bloat.
Document data schema and format changes as part of versioning.